Wednesday, April 2, 2008

openmp c/c++ on intel and linux using intel compiler

i'm using intel core 2 duo processor and debian gnu/linux 4.0. i'm now, in the step of learning openmp and i want to know how fast dual processors can run compared to single processor.
  • hardware and software specification
following processor information is taken from /proc/cpuinfo file.
vendor_id       : GenuineIntel
cpu family : 6
model : 15
model name : Intel(R) Core(TM)2 CPU 6600 @ 2.40GHz
stepping : 6
cpu MHz : 2400.778
cache size : 4096 KB
physical id : 0
siblings : 2
cpu cores : 2

i'm using debian gnu/linux 4.0.
$ uname -a
Linux l411v 2.6.18-6-686 #1 SMP Sun Feb 10 22:11:31 UTC 2008 i686 GNU/Linux

i get and install intel c++ compiler professional edition for linux for non-commercial use.
$ icc --version
icc (ICC) 10.1 20080112
Copyright (C) 1985-2007 Intel Corporation. All rights reserved.
  • setting environment
set executable path to intel c++ compiler binary directory.
$ export PATH=$PATH:/usr/share/intel/cc/10.1.012/bin/

set library path to intel c++ dynamic library directory.
$ export LD_LIBRARY_PATH=/usr/share/intel/cc/10.1.012/lib
$ echo $LD_LIBRARY_PATH
/usr/share/intel/cc/10.1.012/lib
  • fixing bug
i get openmp sample code (openmp_sample.c) after installing intel c++ compiler from sample directory. please find the sample code below. to activate openmp feature, add additional -openmp compiling parameter. i found some errors when compiling the sample code without -openmp compiling parameter. the errors as follow:
$ icc -std=c99 openmp_sample.c
openmp_sample.c(106): warning #161: unrecognized #pragma
#pragma omp parallel private(i,j,k)
^

openmp_sample.c(109): warning #161: unrecognized #pragma
#pragma omp single nowait
^

openmp_sample.c(119): warning #161: unrecognized #pragma
#pragma omp for nowait
^

openmp_sample.c(126): warning #161: unrecognized #pragma
#pragma omp for nowait
^

what i need to do just adding preprocessor to block all openmp features on the sample code. so change from:
  #pragma omp parallel private(i,j,k)
to:
#ifdef _OPENMP
#pragma omp parallel private(i,j,k)
#endif
  • compile and run
ulimit command controls the user resources available to a process started by the shell. you need set the stack size to an appropriate size; otherwise, the application will generate a segmentation fault. following command sets the maximum stack size to unlimited.
$ ulimit -s unlimited

compile with openmp feature and static linking mode:
$ icc -std=c99 -openmp -static openmp_sample.c
openmp_sample.c(119): (col. 5) remark: OpenMP DEFINED LOOP WAS PARALLELIZED.
openmp_sample.c(126): (col. 5) remark: OpenMP DEFINED LOOP WAS PARALLELIZED.
openmp_sample.c(106): (col. 3) remark: OpenMP DEFINED REGION WAS PARALLELIZED.

$ ls -l
total 1092
-rwxr-xr-x 1 lain lain 1105135 2008-04-02 11:48 a.out
-rw-r--r-- 1 lain lain 4702 2008-04-02 11:06 openmp_sample.c

$ ./a.out

Using time() for wall clock time
Problem size: c(600,2400) = a(600,1200) * b(1200,2400)
Calculating product 5 time(s)

We are using 2 thread(s)

Finished calculations.
Matmul kernel wall clock time = 6.00 sec
Wall clock time/thread = 3.00 sec
MFlops = 2880.000000

compile with openmp feature and dynamic linking mode:
$ icc -std=c99 -openmp openmp_sample.c
openmp_sample.c(124): (col. 5) remark: OpenMP DEFINED LOOP WAS PARALLELIZED.
openmp_sample.c(133): (col. 5) remark: OpenMP DEFINED LOOP WAS PARALLELIZED.
openmp_sample.c(107): (col. 3) remark: OpenMP DEFINED REGION WAS PARALLELIZED.

$ ls -l
total 44
-rwxr-xr-x 1 lain lain 34087 2008-04-02 12:01 a.out
-rw-r--r-- 1 lain lain 4702 2008-04-02 11:06 openmp_sample.c

$ ./a.out
Using time() for wall clock time
Problem size: c(600,2400) = a(600,1200) * b(1200,2400)
Calculating product 5 time(s)

We are using 2 thread(s)

Finished calculations.
Matmul kernel wall clock time = 6.00 sec
Wall clock time/thread = 3.00 sec
MFlops = 2880.000000

compile without openmp feature and static linking mode:
$ icc -std=c99 -static openmp_sample.c

$ ls -l
total 512
-rwxr-xr-x 1 lain lain 509458 2008-04-02 11:49 a.out
-rw-r--r-- 1 lain lain 4702 2008-04-02 11:06 openmp_sample.c

$ ./a.out
Using time() for wall clock time
Problem size: c(600,2400) = a(600,1200) * b(1200,2400)
Calculating product 5 time(s)

We are using 1 thread(s)

Finished calculations.
Matmul kernel wall clock time = 17.00 sec
Wall clock time/thread = 17.00 sec
MFlops = 1016.470588

compile without openmp feature and dynamic linking mode:
$ icc -std=c99 openmp_sample.c

$ ls -l
total 32
-rwxr-xr-x 1 lain lain 23570 2008-04-02 12:03 a.out
-rw-r--r-- 1 lain lain 4702 2008-04-02 11:06 openmp_sample.c

$ ./a.out
Using time() for wall clock time
Problem size: c(600,2400) = a(600,1200) * b(1200,2400)
Calculating product 5 time(s)


We are using 1 thread(s)

Finished calculations.
Matmul kernel wall clock time = 17.00 sec
Wall clock time/thread = 17.00 sec
MFlops = 1016.470588
  • conclusion
| openmp  | number of | linking | file size |  time to  |    mega      |
| feature | thread | mode | (byte) | finish | flops |
| | | | | (seconds) | |
+---------+-----------+---------+-----------+-----------+--------------+
| yes | 2 | static | 1,105,135 | 6 | 2,880.000000 |
| yes | 2 | dynamic | 34,087 | 6 | 2,880.000000 |
| no | 1 | static | 509,458 | 17 | 1,016.470588 |
| no | 1 | dynamic | 23,570 | 17 | 1,016.470588 |
  • system monitor
following captured kde system guard (performance monitor) shows cpu load of cpu0 and cpu1. both numbers on purple color describe:
(1) single processor working: only cpu0 is active 100%, the process is finished slower
(2) dual processors working: both cpu0 and cpu1 are active 100%, the process is finished faster

  • sample code
openmp_sample.c file:
/*
* Copyright (C) 2006-2007 Intel Corporation. All Rights Reserved.
*
* The source code contained or described herein and all
* documents related to the source code ("Material") are owned by
* Intel Corporation or its suppliers or licensors. Title to the
* Material remains with Intel Corporation or its suppliers and
* licensors. The Material is protected by worldwide copyright
* laws and treaty provisions. No part of the Material may be
* used, copied, reproduced, modified, published, uploaded,
* posted, transmitted, distributed, or disclosed in any way
* except as expressly provided in the license provided with the
* Materials. No license under any patent, copyright, trade
* secret or other intellectual property right is granted to or
* conferred upon you by disclosure or delivery of the Materials,
* either expressly, by implication, inducement, estoppel or
* otherwise, except as expressly provided in the license
* provided with the Materials.
*
* [DESCRIPTION]
* Each element of the product matrix c[i][j] is
* computed from a unique row and
* column of the factor matrices, a[i][k] and b[k][j].
*
* In the multithreaded implementation, each thread can
* concurrently compute some submatrix of the product without
* needing OpenMP data or control synchronization.
*
* The algorithm uses OpenMP* to parallelize the outer-most loop,
* using the "i" row index.
*
* Both the outer-most "i" loop and middle "k" loop are manually
* unrolled by 4. The inner-most "j" loop iterates one-by-one
* over the columns of the product and factor matrices.
*
* [COMPILE]
* Use the following compiler options to compile both multi- and
* single-threaded versions.
*
* Parallel compilation:
* You must set the stacksize to an appropriate size; otherwise,
* the application will generate a segmentation fault.
* Linux* and Mac OS* X: appropriate ulimit commands are shown for
* bash shell.
*
* Windows*: /Qstd=c99 /Qopenmp /F256000000
*
* Linux*: ulimit -s unlimited
* -std=c99 -openmp
*
* Mac OS* X: ulimit -s 64000
* -std=c99 -openmp
*
* Serial compilation:
*
* Use the same command, but omit the -openmp (Linux and Mac OS X)
* or /Qopenmp (Windows) option.
*
*/

#include <stdio.h>
#include <time.h>
#include <float.h>
#include <math.h>
#ifdef _OPENMP
#include <omp.h>
#endif
#define bool _Bool
#define true 1
#define false 0

// Matrix size constants
// Be careful to set your shell's stacksize limit to a high value if you
// wish to increase the SIZE.
#define SIZE 4800 // Must be a multiple of 8.
#define M SIZE/8
#define N SIZE/4
#define P SIZE/2
#define NTIMES 5 // product matrix calculations

int main(void)
{
double a[M][N], b[N][P], c[M][P], walltime;
bool nthr_checked=false;
time_t start;

int i, j, k, l, i1, i2, i3, k1, k2, k3, nthr=1;

printf("Using time() for wall clock time\n");
printf("Problem size: c(%d,%d) = a(%d,%d) * b(%d,%d)\n",
M, P, M, N, N, P);
printf("Calculating product %d time(s)\n", NTIMES);

// a is identity matrix
for (i=0; i<M; i++)
for (j=0; j<N; j++)
a[i][j] = 1.0;

// each column of b is the sequence 1,2,...,N
for (i=0; i<N; i++)
for (j=0; j<P; j++)
b[i][j] = i+1.;

start = time(NULL);

#ifdef _OPENMP
#pragma omp parallel private(i,j,k)
#endif
{
for (l=0; l<NTIMES; l++) {
#ifdef _OPENMP
#pragma omp single nowait
#endif
if (!nthr_checked) {
#ifdef _OPENMP
nthr = omp_get_num_threads();
#endif
printf( "\nWe are using %d thread(s)\n", nthr);
nthr_checked = true;
}

// Initialize product matrix
#ifdef _OPENMP
#pragma omp for nowait
#endif
for (i=0; i<M; i++)
for (j=0; j<P; j++)
c[i][j] = 0.0;

// Parallelize by row. The threads don't need to synchronize at
// loop end, so "nowait" can be used.
#ifdef _OPENMP
#pragma omp for nowait
#endif
for (i=0; i<M; i++) {
for (k=0; k<N; k++) {
// Each element of the product is just the sum 1+2+...+n
for (j=0; j<P; j++) {
c[i][j] += a[i][k] * b[k][j];
}
}
}
} // #pragma omp parallel private(i,j,k)
} // l=0,...NTIMES-1

walltime = time(NULL) - start;
printf("\nFinished calculations.\n");
printf("Matmul kernel wall clock time = %.2f sec\n", walltime);
printf("Wall clock time/thread = %.2f sec\n", walltime/nthr);
printf("MFlops = %f\n",
(double)(NTIMES)*(double)(N*M*2)*(double)(P)/walltime/1.0e6);

return 0;
}

1 comment:

Unknown said...

Thanks, your blog post was quite useful for me while I tried to figure out how to use openmp on linux :). I am also in Singapore - see my site at http://palgos.blogspot.com/