Friday, April 18, 2008

openmp c/c++ on intel and linux using sun compiler

i had written another openmp topic titled openmp c/c++ on intel and linux using intel compiler. this blog is about how to do it again using sun compiler. following is the steps:
  • setting sunstudio
sunstudio is a development packet from sun microsystems. it contains netbeans ide and sun c compiler. we will need both of them to create the project and compile the code. my another blog titled sun studio on linux describes how to set up the sunstudio on debian gnu/linux. at the end of the blog, i give two links to sunstudio documentations on how to use it.
  • getting source code
i use a same source code as i use it on intel compiler. please find the source code from another blog titled openmp c/c++ on intel and linux using intel compiler on sample code bulleted list.
  • compiling source code
after setting up the project and loading the source code in, follow following steps to activate openmp feature during compilation process.
    • open project properties window: on project window, right click on project name, and select properties.
    • activate openmp feature: on categories tree box, select configuration properties - c/c++/fortran - c compiler - general. on properties table box, change multithreading level form none to openmp. click on ok button.
    • build the project: on menu bar, click on build - build main project.
activate openmp compilation feature
  • executing the binary file
you need to set maximum stack size with following command. i still don't know how to run the following command on the sunstudio ide before i execute the binary file. so i execute the binary file on console.
$ ulimit -s unlimited

you need also to set number of thread to use. i use intel core 2 duo processor. it's dual processor so i set it with 2.
$ export OMP_NUM_THREADS=2
$ export OMP_DYNAMIC=FALSE

i name the binary file with openmp feature, with sun_openmp. following is the execution result.
$ ./sun_openmp
Using time() for wall clock time
Problem size: c(600,2400) = a(600,1200) * b(1200,2400)
Calculating product 5 time(s)

We are using 2 thread(s)

Finished calculations.
Matmul kernel wall clock time = 6.00 sec
Wall clock time/thread = 3.00 sec
MFlops = 2880.000000

i name the binary file without openmp feature, with sun_not_openmp. following is the execution result.
$ ./sun_not_openmp
Using time() for wall clock time
Problem size: c(600,2400) = a(600,1200) * b(1200,2400)
Calculating product 5 time(s)

We are using 1 thread(s)

Finished calculations.
Matmul kernel wall clock time = 17.00 sec
Wall clock time/thread = 17.00 sec
MFlops = 1016.470588
  • system monitor
following captured kde system guard (performance monitor) shows cpu load of cpu0 and cpu1. both numbers on purple color describe:
(1) single processor working: only cpu0 is active 100%, the process is finished slower
(2) dual processors working: both cpu0 and cpu1 are active 100%, the process is finished faster

  • openmp api user's guide
for more deep understanding, please reefer to sun studio 12: openmp api user's guide.

1 comment:

Anonymous said...

I understand your program is for demonstration purposes and it's a great, compact sample. Also note that included with the Sun Studio packages is the Sun Performance Library which implements things like Matrix Multiply and it is callable from Fortran, C, and C++. I took your example and replaced the computational loops with a single call to the DGEMM routine and got the following results:

slimbutte > ./a.out
Using time() for wall clock time
Problem size: c(600,2400) = a(600,1200) * b(1200,2400)
Calculating product 5 time(s)

We are using 1 thread(s)

Finished calculations.
Matmul kernel wall clock time = 2.00 sec
Wall clock time/thread = 2.00 sec
MFlops = 8640.000000
slimbutte > setenv OMP_NUM_THREADS 2
slimbutte > ./a.out
Using time() for wall clock time
Problem size: c(600,2400) = a(600,1200) * b(1200,2400)
Calculating product 5 time(s)

We are using 2 thread(s)

Finished calculations.
Matmul kernel wall clock time = 1.00 sec
Wall clock time/thread = 0.50 sec
MFlops = 17280.000000

So, there's alot more performance to be had for this type of problem.

Of course, Intel also has a numerical library (MKL) which provides the same routines and would produce equally high performance for this type of problem.

Great article!