- setting sunstudio
- getting source code
- compiling source code
- open project properties window: on project window, right click on project name, and select properties.
- activate openmp feature: on categories tree box, select configuration properties - c/c++/fortran - c compiler - general. on properties table box, change multithreading level form none to openmp. click on ok button.
- build the project: on menu bar, click on build - build main project.
- executing the binary file
$ ulimit -s unlimited
you need also to set number of thread to use. i use intel core 2 duo processor. it's dual processor so i set it with 2.
$ export OMP_NUM_THREADS=2
$ export OMP_DYNAMIC=FALSE
i name the binary file with openmp feature, with sun_openmp. following is the execution result.
$ ./sun_openmp
Using time() for wall clock time
Problem size: c(600,2400) = a(600,1200) * b(1200,2400)
Calculating product 5 time(s)
We are using 2 thread(s)
Finished calculations.
Matmul kernel wall clock time = 6.00 sec
Wall clock time/thread = 3.00 sec
MFlops = 2880.000000
i name the binary file without openmp feature, with sun_not_openmp. following is the execution result.
$ ./sun_not_openmp
Using time() for wall clock time
Problem size: c(600,2400) = a(600,1200) * b(1200,2400)
Calculating product 5 time(s)
We are using 1 thread(s)
Finished calculations.
Matmul kernel wall clock time = 17.00 sec
Wall clock time/thread = 17.00 sec
MFlops = 1016.470588
- system monitor
(1) single processor working: only cpu0 is active 100%, the process is finished slower
(2) dual processors working: both cpu0 and cpu1 are active 100%, the process is finished faster
- openmp api user's guide
1 comment:
I understand your program is for demonstration purposes and it's a great, compact sample. Also note that included with the Sun Studio packages is the Sun Performance Library which implements things like Matrix Multiply and it is callable from Fortran, C, and C++. I took your example and replaced the computational loops with a single call to the DGEMM routine and got the following results:
slimbutte > ./a.out
Using time() for wall clock time
Problem size: c(600,2400) = a(600,1200) * b(1200,2400)
Calculating product 5 time(s)
We are using 1 thread(s)
Finished calculations.
Matmul kernel wall clock time = 2.00 sec
Wall clock time/thread = 2.00 sec
MFlops = 8640.000000
slimbutte > setenv OMP_NUM_THREADS 2
slimbutte > ./a.out
Using time() for wall clock time
Problem size: c(600,2400) = a(600,1200) * b(1200,2400)
Calculating product 5 time(s)
We are using 2 thread(s)
Finished calculations.
Matmul kernel wall clock time = 1.00 sec
Wall clock time/thread = 0.50 sec
MFlops = 17280.000000
So, there's alot more performance to be had for this type of problem.
Of course, Intel also has a numerical library (MKL) which provides the same routines and would produce equally high performance for this type of problem.
Great article!
Post a Comment