right side of wrong: openmp c/c++ on intel and linux using intel compiler

i'm using intel core 2 duo processor and debian gnu/linux 4.0. i'm now, in the step of learning openmp and i want to know how fast dual processors can run compared to single processor.

hardware and software specification

following processor information is taken from /proc/cpuinfo file.

vendor_id       : GenuineIntel
cpu family      : 6
model           : 15
model name      : Intel(R) Core(TM)2 CPU          6600  @ 2.40GHz
stepping        : 6
cpu MHz         : 2400.778
cache size      : 4096 KB
physical id     : 0
siblings        : 2
cpu cores       : 2

i'm using debian gnu/linux 4.0.

$ uname -a
Linux l411v 2.6.18-6-686 #1 SMP Sun Feb 10 22:11:31 UTC 2008 i686 GNU/Linux

i get and install intel c++ compiler professional edition for linux for non-commercial use.

$ icc --version
icc (ICC) 10.1 20080112
Copyright (C) 1985-2007 Intel Corporation.  All rights reserved.

setting environment

set executable path to intel c++ compiler binary directory.

$ export PATH=$PATH:/usr/share/intel/cc/10.1.012/bin/

set library path to intel c++ dynamic library directory.

$ export LD_LIBRARY_PATH=/usr/share/intel/cc/10.1.012/lib
$ echo $LD_LIBRARY_PATH
/usr/share/intel/cc/10.1.012/lib

fixing bug

i get openmp sample code (openmp_sample.c) after installing intel c++ compiler from sample directory. please find the sample code below. to activate openmp feature, add additional -openmp compiling parameter. i found some errors when compiling the sample code without -openmp compiling parameter. the errors as follow:

$ icc -std=c99 openmp_sample.c
openmp_sample.c(106): warning #161: unrecognized #pragma
     #pragma omp parallel private(i,j,k)
             ^

openmp_sample.c(109): warning #161: unrecognized #pragma
      #pragma omp single nowait
              ^

openmp_sample.c(119): warning #161: unrecognized #pragma
      #pragma omp for nowait
              ^

openmp_sample.c(126): warning #161: unrecognized #pragma
      #pragma omp for nowait
              ^

what i need to do just adding preprocessor to block all openmp features on the sample code. so change from:

  #pragma omp parallel private(i,j,k)

to:

#ifdef _OPENMP
  #pragma omp parallel private(i,j,k)
#endif

compile and run

ulimit command controls the user resources available to a process started by the shell. you need set the stack size to an appropriate size; otherwise, the application will generate a segmentation fault. following command sets the maximum stack size to unlimited.

$ ulimit -s unlimited

compile with openmp feature and static linking mode:

$ icc -std=c99 -openmp -static openmp_sample.c
openmp_sample.c(119): (col. 5) remark: OpenMP DEFINED LOOP WAS PARALLELIZED.
openmp_sample.c(126): (col. 5) remark: OpenMP DEFINED LOOP WAS PARALLELIZED.
openmp_sample.c(106): (col. 3) remark: OpenMP DEFINED REGION WAS PARALLELIZED.

$ ls -l
total 1092
-rwxr-xr-x 1 lain lain 1105135 2008-04-02 11:48 a.out
-rw-r--r-- 1 lain lain    4702 2008-04-02 11:06 openmp_sample.c

$ ./a.out

Using time() for wall clock time
Problem size: c(600,2400) = a(600,1200) * b(1200,2400)
Calculating product 5 time(s)

We are using 2 thread(s)

Finished calculations.
Matmul kernel wall clock time = 6.00 sec
Wall clock time/thread = 3.00 sec
MFlops = 2880.000000

compile with openmp feature and dynamic linking mode:

$ icc -std=c99 -openmp openmp_sample.c
openmp_sample.c(124): (col. 5) remark: OpenMP DEFINED LOOP WAS PARALLELIZED.
openmp_sample.c(133): (col. 5) remark: OpenMP DEFINED LOOP WAS PARALLELIZED.
openmp_sample.c(107): (col. 3) remark: OpenMP DEFINED REGION WAS PARALLELIZED.

$ ls -l
total 44
-rwxr-xr-x 1 lain lain 34087 2008-04-02 12:01 a.out
-rw-r--r-- 1 lain lain  4702 2008-04-02 11:06 openmp_sample.c

$ ./a.out
Using time() for wall clock time
Problem size: c(600,2400) = a(600,1200) * b(1200,2400)
Calculating product 5 time(s)

We are using 2 thread(s)

Finished calculations.
Matmul kernel wall clock time = 6.00 sec
Wall clock time/thread = 3.00 sec
MFlops = 2880.000000

compile without openmp feature and static linking mode:

$ icc -std=c99 -static openmp_sample.c

$ ls -l
total 512
-rwxr-xr-x 1 lain lain 509458 2008-04-02 11:49 a.out
-rw-r--r-- 1 lain lain   4702 2008-04-02 11:06 openmp_sample.c

$ ./a.out
Using time() for wall clock time
Problem size: c(600,2400) = a(600,1200) * b(1200,2400)
Calculating product 5 time(s)

We are using 1 thread(s)

Finished calculations.
Matmul kernel wall clock time = 17.00 sec
Wall clock time/thread = 17.00 sec
MFlops = 1016.470588

compile without openmp feature and dynamic linking mode:

$ icc -std=c99 openmp_sample.c

$ ls -l
total 32
-rwxr-xr-x 1 lain lain 23570 2008-04-02 12:03 a.out
-rw-r--r-- 1 lain lain  4702 2008-04-02 11:06 openmp_sample.c

$ ./a.out
Using time() for wall clock time
Problem size: c(600,2400) = a(600,1200) * b(1200,2400)
Calculating product 5 time(s)


We are using 1 thread(s)

Finished calculations.
Matmul kernel wall clock time = 17.00 sec
Wall clock time/thread = 17.00 sec
MFlops = 1016.470588

conclusion

| openmp  | number of | linking | file size |  time to  |    mega      |
| feature |  thread   |  mode   |  (byte)   |  finish   |    flops     |
|         |           |         |           | (seconds) |              |
+---------+-----------+---------+-----------+-----------+--------------+
|   yes   |     2     | static  | 1,105,135 |     6     | 2,880.000000 |
|   yes   |     2     | dynamic |    34,087 |     6     | 2,880.000000 |
|   no    |     1     | static  |   509,458 |    17     | 1,016.470588 |
|   no    |     1     | dynamic |    23,570 |    17     | 1,016.470588 |

system monitor

following captured kde system guard (performance monitor) shows cpu load of cpu0 and cpu1. both numbers on purple color describe:
(1) single processor working: only cpu0 is active 100%, the process is finished slower
(2) dual processors working: both cpu0 and cpu1 are active 100%, the process is finished faster

sample code

openmp_sample.c file:

/*
 * Copyright (C) 2006-2007 Intel Corporation. All Rights Reserved.
 *
 * The source code contained or described herein and all
 * documents related to the source code ("Material") are owned by
 * Intel Corporation or its suppliers or licensors. Title to the
 * Material remains with Intel Corporation or its suppliers and
 * licensors. The Material is protected by worldwide copyright
 * laws and treaty provisions.  No part of the Material may be
 * used, copied, reproduced, modified, published, uploaded,
 * posted, transmitted, distributed,  or disclosed in any way
 * except as expressly provided in the license provided with the
 * Materials.  No license under any patent, copyright, trade
 * secret or other intellectual property right is granted to or
 * conferred upon you by disclosure or delivery of the Materials,
 * either expressly, by implication, inducement, estoppel or
 * otherwise, except as expressly provided in the license
 * provided with the Materials.
 *
 * [DESCRIPTION]
 * Each element of the product matrix c[i][j] is 
 * computed from a unique row and
 * column of the factor matrices, a[i][k] and b[k][j].
 *
 * In the multithreaded implementation, each thread can
 * concurrently compute some submatrix of the product without
 * needing OpenMP data or control synchronization.
 *
 * The algorithm uses OpenMP* to parallelize the outer-most loop,
 * using the "i" row index.
 *
 * Both the outer-most "i" loop and middle "k" loop are manually
 * unrolled by 4.  The inner-most "j" loop iterates one-by-one
 * over the columns of the product and factor matrices.
 *
 * [COMPILE]
 * Use the following compiler options to compile both multi- and 
 * single-threaded versions.
 *
 * Parallel compilation:
 *  You must set the stacksize to an appropriate size; otherwise,
 *  the application will generate a segmentation fault. 
 *  Linux* and Mac OS* X: appropriate ulimit commands are shown for 
 *  bash shell.
 *
 *  Windows*: /Qstd=c99 /Qopenmp /F256000000
 *
 *  Linux*:   ulimit -s unlimited
 *            -std=c99 -openmp
 * 
 *  Mac OS* X:  ulimit -s 64000
 *            -std=c99 -openmp
 *
 * Serial compilation:
 *
 *  Use the same command, but omit the -openmp (Linux and Mac OS X)
 *  or /Qopenmp (Windows) option.
 *
 */

#include <stdio.h>
#include <time.h>
#include <float.h>
#include <math.h>
#ifdef _OPENMP
#include <omp.h>
#endif
#define bool _Bool
#define true 1
#define false 0

// Matrix size constants
// Be careful to set your shell's stacksize limit to a high value if you
// wish to increase the SIZE.
#define SIZE     4800     // Must be a multiple of 8.
#define M        SIZE/8
#define N        SIZE/4
#define P        SIZE/2
#define NTIMES   5        // product matrix calculations

int main(void)
{
  double a[M][N], b[N][P], c[M][P], walltime;
  bool nthr_checked=false;
  time_t start;

  int i, j, k, l, i1, i2, i3, k1, k2, k3, nthr=1;

  printf("Using time() for wall clock time\n");
  printf("Problem size: c(%d,%d) = a(%d,%d) * b(%d,%d)\n",
         M, P, M, N, N, P);
  printf("Calculating product %d time(s)\n", NTIMES);

  // a is identity matrix
  for (i=0; i<M; i++)
    for (j=0; j<N; j++)
      a[i][j] = 1.0;

  // each column of b is the sequence 1,2,...,N
  for (i=0; i<N; i++)
    for (j=0; j<P; j++)
      b[i][j] = i+1.;

  start = time(NULL);

#ifdef _OPENMP
  #pragma omp parallel private(i,j,k)
#endif
  {
  for (l=0; l<NTIMES; l++) {
#ifdef _OPENMP
    #pragma omp single nowait
#endif
    if (!nthr_checked) {
#ifdef _OPENMP
      nthr = omp_get_num_threads();
#endif
      printf( "\nWe are using %d thread(s)\n", nthr);
      nthr_checked = true;
    }

    // Initialize product matrix
#ifdef _OPENMP
    #pragma omp for nowait
#endif
    for (i=0; i<M; i++)
      for (j=0; j<P; j++)
        c[i][j] = 0.0;

    // Parallelize by row.  The threads don't need to synchronize at
    // loop end, so "nowait" can be used.
#ifdef _OPENMP
    #pragma omp for nowait
#endif
    for (i=0; i<M; i++) {
      for (k=0; k<N; k++) {
        // Each element of the product is just the sum 1+2+...+n
        for (j=0; j<P; j++) {
          c[i][j]  += a[i][k]  * b[k][j];
        }
      }
    }
  } // #pragma omp parallel private(i,j,k)
  } // l=0,...NTIMES-1

  walltime = time(NULL) - start;
  printf("\nFinished calculations.\n");
  printf("Matmul kernel wall clock time = %.2f sec\n", walltime);
  printf("Wall clock time/thread = %.2f sec\n", walltime/nthr);
  printf("MFlops = %f\n",
      (double)(NTIMES)*(double)(N*M*2)*(double)(P)/walltime/1.0e6);

  return 0;
}

right side of wrong

Wednesday, April 2, 2008

openmp c/c++ on intel and linux using intel compiler

1 comment:

About Me

labels

places to hang on

my tattoos

Blog Archive