- hardware and software specification
vendor_id : GenuineIntel
cpu family : 6
model : 15
model name : Intel(R) Core(TM)2 CPU 6600 @ 2.40GHz
stepping : 6
cpu MHz : 2400.778
cache size : 4096 KB
physical id : 0
siblings : 2
cpu cores : 2
i'm using debian gnu/linux 4.0.
$ uname -a
Linux l411v 2.6.18-6-686 #1 SMP Sun Feb 10 22:11:31 UTC 2008 i686 GNU/Linux
i get and install intel c++ compiler professional edition for linux for non-commercial use.
$ icc --version
icc (ICC) 10.1 20080112
Copyright (C) 1985-2007 Intel Corporation. All rights reserved.
- setting environment
$ export PATH=$PATH:/usr/share/intel/cc/10.1.012/bin/
set library path to intel c++ dynamic library directory.
$ export LD_LIBRARY_PATH=/usr/share/intel/cc/10.1.012/lib
$ echo $LD_LIBRARY_PATH
/usr/share/intel/cc/10.1.012/lib
- fixing bug
$ icc -std=c99 openmp_sample.c
openmp_sample.c(106): warning #161: unrecognized #pragma
#pragma omp parallel private(i,j,k)
^
openmp_sample.c(109): warning #161: unrecognized #pragma
#pragma omp single nowait
^
openmp_sample.c(119): warning #161: unrecognized #pragma
#pragma omp for nowait
^
openmp_sample.c(126): warning #161: unrecognized #pragma
#pragma omp for nowait
^
what i need to do just adding preprocessor to block all openmp features on the sample code. so change from:
#pragma omp parallel private(i,j,k)to:
#ifdef _OPENMP
#pragma omp parallel private(i,j,k)
#endif
- compile and run
$ ulimit -s unlimited
compile with openmp feature and static linking mode:
$ icc -std=c99 -openmp -static openmp_sample.c
openmp_sample.c(119): (col. 5) remark: OpenMP DEFINED LOOP WAS PARALLELIZED.
openmp_sample.c(126): (col. 5) remark: OpenMP DEFINED LOOP WAS PARALLELIZED.
openmp_sample.c(106): (col. 3) remark: OpenMP DEFINED REGION WAS PARALLELIZED.
$ ls -l
total 1092
-rwxr-xr-x 1 lain lain 1105135 2008-04-02 11:48 a.out
-rw-r--r-- 1 lain lain 4702 2008-04-02 11:06 openmp_sample.c
$ ./a.out
Using time() for wall clock time
Problem size: c(600,2400) = a(600,1200) * b(1200,2400)
Calculating product 5 time(s)
We are using 2 thread(s)
Finished calculations.
Matmul kernel wall clock time = 6.00 sec
Wall clock time/thread = 3.00 sec
MFlops = 2880.000000
compile with openmp feature and dynamic linking mode:
$ icc -std=c99 -openmp openmp_sample.c
openmp_sample.c(124): (col. 5) remark: OpenMP DEFINED LOOP WAS PARALLELIZED.
openmp_sample.c(133): (col. 5) remark: OpenMP DEFINED LOOP WAS PARALLELIZED.
openmp_sample.c(107): (col. 3) remark: OpenMP DEFINED REGION WAS PARALLELIZED.
$ ls -l
total 44
-rwxr-xr-x 1 lain lain 34087 2008-04-02 12:01 a.out
-rw-r--r-- 1 lain lain 4702 2008-04-02 11:06 openmp_sample.c
$ ./a.out
Using time() for wall clock time
Problem size: c(600,2400) = a(600,1200) * b(1200,2400)
Calculating product 5 time(s)
We are using 2 thread(s)
Finished calculations.
Matmul kernel wall clock time = 6.00 sec
Wall clock time/thread = 3.00 sec
MFlops = 2880.000000
compile without openmp feature and static linking mode:
$ icc -std=c99 -static openmp_sample.c
$ ls -l
total 512
-rwxr-xr-x 1 lain lain 509458 2008-04-02 11:49 a.out
-rw-r--r-- 1 lain lain 4702 2008-04-02 11:06 openmp_sample.c
$ ./a.out
Using time() for wall clock time
Problem size: c(600,2400) = a(600,1200) * b(1200,2400)
Calculating product 5 time(s)
We are using 1 thread(s)
Finished calculations.
Matmul kernel wall clock time = 17.00 sec
Wall clock time/thread = 17.00 sec
MFlops = 1016.470588
compile without openmp feature and dynamic linking mode:
$ icc -std=c99 openmp_sample.c
$ ls -l
total 32
-rwxr-xr-x 1 lain lain 23570 2008-04-02 12:03 a.out
-rw-r--r-- 1 lain lain 4702 2008-04-02 11:06 openmp_sample.c
$ ./a.out
Using time() for wall clock time
Problem size: c(600,2400) = a(600,1200) * b(1200,2400)
Calculating product 5 time(s)
We are using 1 thread(s)
Finished calculations.
Matmul kernel wall clock time = 17.00 sec
Wall clock time/thread = 17.00 sec
MFlops = 1016.470588
- conclusion
| openmp | number of | linking | file size | time to | mega |
| feature | thread | mode | (byte) | finish | flops |
| | | | | (seconds) | |
+---------+-----------+---------+-----------+-----------+--------------+
| yes | 2 | static | 1,105,135 | 6 | 2,880.000000 |
| yes | 2 | dynamic | 34,087 | 6 | 2,880.000000 |
| no | 1 | static | 509,458 | 17 | 1,016.470588 |
| no | 1 | dynamic | 23,570 | 17 | 1,016.470588 |
- system monitor
(1) single processor working: only cpu0 is active 100%, the process is finished slower
(2) dual processors working: both cpu0 and cpu1 are active 100%, the process is finished faster
- sample code
/*
* Copyright (C) 2006-2007 Intel Corporation. All Rights Reserved.
*
* The source code contained or described herein and all
* documents related to the source code ("Material") are owned by
* Intel Corporation or its suppliers or licensors. Title to the
* Material remains with Intel Corporation or its suppliers and
* licensors. The Material is protected by worldwide copyright
* laws and treaty provisions. No part of the Material may be
* used, copied, reproduced, modified, published, uploaded,
* posted, transmitted, distributed, or disclosed in any way
* except as expressly provided in the license provided with the
* Materials. No license under any patent, copyright, trade
* secret or other intellectual property right is granted to or
* conferred upon you by disclosure or delivery of the Materials,
* either expressly, by implication, inducement, estoppel or
* otherwise, except as expressly provided in the license
* provided with the Materials.
*
* [DESCRIPTION]
* Each element of the product matrix c[i][j] is
* computed from a unique row and
* column of the factor matrices, a[i][k] and b[k][j].
*
* In the multithreaded implementation, each thread can
* concurrently compute some submatrix of the product without
* needing OpenMP data or control synchronization.
*
* The algorithm uses OpenMP* to parallelize the outer-most loop,
* using the "i" row index.
*
* Both the outer-most "i" loop and middle "k" loop are manually
* unrolled by 4. The inner-most "j" loop iterates one-by-one
* over the columns of the product and factor matrices.
*
* [COMPILE]
* Use the following compiler options to compile both multi- and
* single-threaded versions.
*
* Parallel compilation:
* You must set the stacksize to an appropriate size; otherwise,
* the application will generate a segmentation fault.
* Linux* and Mac OS* X: appropriate ulimit commands are shown for
* bash shell.
*
* Windows*: /Qstd=c99 /Qopenmp /F256000000
*
* Linux*: ulimit -s unlimited
* -std=c99 -openmp
*
* Mac OS* X: ulimit -s 64000
* -std=c99 -openmp
*
* Serial compilation:
*
* Use the same command, but omit the -openmp (Linux and Mac OS X)
* or /Qopenmp (Windows) option.
*
*/
#include <stdio.h>
#include <time.h>
#include <float.h>
#include <math.h>
#ifdef _OPENMP
#include <omp.h>
#endif
#define bool _Bool
#define true 1
#define false 0
// Matrix size constants
// Be careful to set your shell's stacksize limit to a high value if you
// wish to increase the SIZE.
#define SIZE 4800 // Must be a multiple of 8.
#define M SIZE/8
#define N SIZE/4
#define P SIZE/2
#define NTIMES 5 // product matrix calculations
int main(void)
{
double a[M][N], b[N][P], c[M][P], walltime;
bool nthr_checked=false;
time_t start;
int i, j, k, l, i1, i2, i3, k1, k2, k3, nthr=1;
printf("Using time() for wall clock time\n");
printf("Problem size: c(%d,%d) = a(%d,%d) * b(%d,%d)\n",
M, P, M, N, N, P);
printf("Calculating product %d time(s)\n", NTIMES);
// a is identity matrix
for (i=0; i<M; i++)
for (j=0; j<N; j++)
a[i][j] = 1.0;
// each column of b is the sequence 1,2,...,N
for (i=0; i<N; i++)
for (j=0; j<P; j++)
b[i][j] = i+1.;
start = time(NULL);
#ifdef _OPENMP
#pragma omp parallel private(i,j,k)
#endif
{
for (l=0; l<NTIMES; l++) {
#ifdef _OPENMP
#pragma omp single nowait
#endif
if (!nthr_checked) {
#ifdef _OPENMP
nthr = omp_get_num_threads();
#endif
printf( "\nWe are using %d thread(s)\n", nthr);
nthr_checked = true;
}
// Initialize product matrix
#ifdef _OPENMP
#pragma omp for nowait
#endif
for (i=0; i<M; i++)
for (j=0; j<P; j++)
c[i][j] = 0.0;
// Parallelize by row. The threads don't need to synchronize at
// loop end, so "nowait" can be used.
#ifdef _OPENMP
#pragma omp for nowait
#endif
for (i=0; i<M; i++) {
for (k=0; k<N; k++) {
// Each element of the product is just the sum 1+2+...+n
for (j=0; j<P; j++) {
c[i][j] += a[i][k] * b[k][j];
}
}
}
} // #pragma omp parallel private(i,j,k)
} // l=0,...NTIMES-1
walltime = time(NULL) - start;
printf("\nFinished calculations.\n");
printf("Matmul kernel wall clock time = %.2f sec\n", walltime);
printf("Wall clock time/thread = %.2f sec\n", walltime/nthr);
printf("MFlops = %f\n",
(double)(NTIMES)*(double)(N*M*2)*(double)(P)/walltime/1.0e6);
return 0;
}
1 comment:
Thanks, your blog post was quite useful for me while I tried to figure out how to use openmp on linux :). I am also in Singapore - see my site at http://palgos.blogspot.com/
Post a Comment