Slow speed after changing from serial to parallel (with ex2f.F)

Tue Apr 15 22:45:25 CDT 2008

On Wed, 16 Apr 2008, Ben Tay wrote:

> I think you may be right. My school uses :

>   No of Nodes Processors Qty per node Total cores per node Memory per node  
>   4 Quad-Core Intel Xeon X5355 2 8 16 GB  
>   60 Dual-Core Intel Xeon 5160 2 4 8 GB

I've attempted to run the same ex2f on a 2x quad-core Intel Xeon X5355
machine [with gcc/ latest mpich2 with --with-device=ch3:nemesis:newtcp] - and I get the following:

<< Logs for my run are attached >>

asterix:/home/balay/download-pine>grep MatMult *
ex2f-600-1p.log:MatMult             1192 1.0 9.7109e+00 1.0 3.86e+09 1.0 0.0e+00 0.0e+00 0.0e+00 14 11  0  0  0  14 11  0  0  0   397
ex2f-600-2p.log:MatMult             1217 1.0 6.2256e+00 1.0 1.97e+09 1.0 2.4e+03 4.8e+03 0.0e+00 14 11100100  0  14 11100100  0   632
ex2f-600-4p.log:MatMult              969 1.0 4.3311e+00 1.0 7.84e+08 1.0 5.8e+03 4.8e+03 0.0e+00 15 11100100  0  15 11100100  0   724
ex2f-600-8p.log:MatMult             1318 1.0 5.6966e+00 1.0 5.33e+08 1.0 1.8e+04 4.8e+03 0.0e+00 16 11100100  0  16 11100100  0   749
asterix:/home/balay/download-pine>grep KSPSolve *
ex2f-600-1p.log:KSPSolve               1 1.0 6.9165e+01 1.0 3.55e+10 1.0 0.0e+00 0.0e+00 2.3e+03100100  0  0100 100100  0  0100   513
ex2f-600-2p.log:KSPSolve               1 1.0 4.4005e+01 1.0 1.81e+10 1.0 2.4e+03 4.8e+03 2.4e+03100100100100100 100100100100100   824
ex2f-600-4p.log:KSPSolve               1 1.0 2.8139e+01 1.0 7.21e+09 1.0 5.8e+03 4.8e+03 1.9e+03100100100100 99 100100100100 99  1024
ex2f-600-8p.log:KSPSolve               1 1.0 3.6260e+01 1.0 4.90e+09 1.0 1.8e+04 4.8e+03 2.6e+03100100100100100 100100100100100  1081
asterix:/home/balay/download-pine>

You get the following [with intel compilers?]:

asterix:/home/balay/download-pine/x>grep MatMult *
log.1:MatMult             1192 1.0 1.6115e+01 1.0 2.39e+08 1.0 0.0e+00 0.0e+00 0.0e+00 13 11  0  0  0  13 11  0  0  0   239
log.2:MatMult             1217 1.0 1.2502e+01 1.2 1.88e+08 1.2 2.4e+03 4.8e+03 0.0e+00 11 11100100  0  11 11100100  0   315
log.4:MatMult              969 1.0 9.7564e+00 3.6 2.87e+08 3.6 5.8e+03 4.8e+03 0.0e+00  8 11100100  0   8 11100100  0   321
asterix:/home/balay/download-pine/x>grep KSPSolve *
log.1:KSPSolve               1 1.0 1.2159e+02 1.0 2.92e+08 1.0 0.0e+00 0.0e+00 2.3e+03100100  0  0100 100100  0  0100   292
log.2:KSPSolve               1 1.0 1.0289e+02 1.0 1.76e+08 1.0 2.4e+03 4.8e+03 2.4e+03 99100100100100  99100100100100   352
log.4:KSPSolve               1 1.0 6.2496e+01 1.0 1.15e+08 1.0 5.8e+03 4.8e+03 1.9e+03 98100100100 99  98100100100 99   461
asterix:/home/balay/download-pine/x>

What exact CPU was this run on?

A couple of comments:
- my runs for MatMult have 1.0 ratio for 2,4,8 proc runs, while yours have 1.2, 3.6 for 2,4 proc runs [so higher
  load imbalance on your machine]
- The peaks are also lower - not sure why. 397 for 1p-MatMult for me - vs 239 for you
- Speedups I see for MatMult are:

np   me   you

2   1.59   1.32
4   1.82   1.34
8   1.88

--------------------------

The primary issue is - expecting speedup of 4, from 4-cores and 8 from 8-cores.

As Matt indicated perhaps in "Subject: general question on speed using quad core Xeons" thread,
for sparse linear algebra - the performance is limited by memory bandwidth - not CPU

So one have to look at the hardware memory architecture of the machine
if you expect scalability.

The 2x quad-core has a memory architecture that gives 11GB/s if one
CPU-socket is used, but 22GB/s when both CPUs-sockets are used
[irrespective of the number of cores in each CPU socket]. One
inference is - max of 2 speedup can be obtained from such machine [due
to 2 memory bank architecture].

So if you have 2 such machines [i.e 4 memory banks] - then you can
expect a theoretical max speedup of 4.

We are generally used to evaluating performance/cpu [or core]. Here
the scalability numbers suck.

However if you do performance/number-of-memory-banks - then things look better.

Its just that we are used to always expecting scalability per node and
assume it translates to scalability per core. [however the scalability
per node - was more about scalability per memory bank - before
multicore cpus took over]

There is also another measure - performance/dollar spent. Generally
the extra cores are practically free - so here this measure also holds
up ok.

Satish
-------------- next part --------------
************************************************************************************************************************
***             WIDEN YOUR WINDOW TO 120 CHARACTERS.  Use 'enscript -r -fCourier9' to print this document            ***
************************************************************************************************************************

---------------------------------------------- PETSc Performance Summary: ----------------------------------------------

./ex2f on a linux-tes named intel-loaner1 with 1 processor, by balay Tue Apr 15 22:02:38 2008
Using Petsc Development Version 2.3.3, Patch 12, unknown HG revision: unknown

                         Max       Max/Min        Avg      Total 
Time (sec):           6.936e+01      1.00000   6.936e+01
Objects:              4.400e+01      1.00000   4.400e+01
Flops:                3.547e+10      1.00000   3.547e+10  3.547e+10
Flops/sec:            5.113e+08      1.00000   5.113e+08  5.113e+08
MPI Messages:         0.000e+00      0.00000   0.000e+00  0.000e+00
MPI Message Lengths:  0.000e+00      0.00000   0.000e+00  0.000e+00
MPI Reductions:       2.349e+03      1.00000

Flop counting convention: 1 flop = 1 real number operation of type (multiply/divide/add/subtract)
                            e.g., VecAXPY() for real vectors of length N --> 2N flops
                            and VecAXPY() for complex vectors of length N --> 8N flops

Summary of Stages:   ----- Time ------  ----- Flops -----  --- Messages ---  -- Message Lengths --  -- Reductions --
                        Avg     %Total     Avg     %Total   counts   %Total     Avg         %Total   counts   %Total 
 0:      Main Stage: 6.9359e+01 100.0%  3.5466e+10 100.0%  0.000e+00   0.0%  0.000e+00        0.0%  2.349e+03 100.0% 

------------------------------------------------------------------------------------------------------------------------
See the 'Profiling' chapter of the users' manual for details on interpreting output.
Phase summary info:
   Count: number of times phase was executed
   Time and Flops: Max - maximum over all processors
                   Ratio - ratio of maximum to minimum over all processors
   Mess: number of messages sent
   Avg. len: average message length
   Reduct: number of global reductions
   Global: entire computation
   Stage: stages of a computation. Set stages with PetscLogStagePush() and PetscLogStagePop().
      %T - percent time in this phase         %F - percent flops in this phase
      %M - percent messages in this phase     %L - percent message lengths in this phase
      %R - percent reductions in this phase
   Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max time over all processors)
------------------------------------------------------------------------------------------------------------------------
Event                Count      Time (sec)     Flops                             --- Global ---  --- Stage ---   Total
                   Max Ratio  Max     Ratio   Max  Ratio  Mess   Avg len Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s
------------------------------------------------------------------------------------------------------------------------

--- Event Stage 0: Main Stage

MatMult             1192 1.0 9.7109e+00 1.0 3.86e+09 1.0 0.0e+00 0.0e+00 0.0e+00 14 11  0  0  0  14 11  0  0  0   397
MatSolve            1192 1.0 1.8658e+01 1.0 3.86e+09 1.0 0.0e+00 0.0e+00 0.0e+00 27 11  0  0  0  27 11  0  0  0   207
MatLUFactorNum         1 1.0 4.1455e-02 1.0 3.23e+06 1.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0    78
MatILUFactorSym        1 1.0 2.9251e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 1.0e+00  0  0  0  0  0   0  0  0  0  0     0
MatAssemblyBegin       1 1.0 9.5367e-07 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
MatAssemblyEnd         1 1.0 3.1618e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
MatGetRowIJ            1 1.0 1.9073e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
MatGetOrdering         1 1.0 5.1751e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 2.0e+00  0  0  0  0  0   0  0  0  0  0     0
VecMDot             1153 1.0 1.6326e+01 1.0 1.28e+10 1.0 0.0e+00 0.0e+00 1.2e+03 24 36  0  0 49  24 36  0  0 49   783
VecNorm             1193 1.0 5.0365e+00 1.0 8.59e+08 1.0 0.0e+00 0.0e+00 1.2e+03  7  2  0  0 51   7  2  0  0 51   171
VecScale            1192 1.0 5.4950e-01 1.0 4.29e+08 1.0 0.0e+00 0.0e+00 0.0e+00  1  1  0  0  0   1  1  0  0  0   781
VecCopy               39 1.0 6.6555e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
VecSet                41 1.0 3.4185e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
VecAXPY               78 1.0 1.2492e-01 1.0 5.62e+07 1.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0   450
VecMAXPY            1192 1.0 1.8493e+01 1.0 1.36e+10 1.0 0.0e+00 0.0e+00 0.0e+00 27 38  0  0  0  27 38  0  0  0   736
VecNormalize        1192 1.0 5.5843e+00 1.0 1.29e+09 1.0 0.0e+00 0.0e+00 1.2e+03  8  4  0  0 51   8  4  0  0 51   231
KSPGMRESOrthog      1153 1.0 3.3669e+01 1.0 2.56e+10 1.0 0.0e+00 0.0e+00 1.2e+03 49 72  0  0 49  49 72  0  0 49   760
KSPSetup               1 1.0 1.1875e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
KSPSolve               1 1.0 6.9165e+01 1.0 3.55e+10 1.0 0.0e+00 0.0e+00 2.3e+03100100  0  0100 100100  0  0100   513
PCSetUp                1 1.0 7.5919e-02 1.0 3.23e+06 1.0 0.0e+00 0.0e+00 3.0e+00  0  0  0  0  0   0  0  0  0  0    43
PCApply             1192 1.0 1.8661e+01 1.0 3.86e+09 1.0 0.0e+00 0.0e+00 0.0e+00 27 11  0  0  0  27 11  0  0  0   207
------------------------------------------------------------------------------------------------------------------------

Memory usage is given in bytes:

Object Type          Creations   Destructions   Memory  Descendants' Mem.

--- Event Stage 0: Main Stage

              Matrix     2              2   54695580     0
                 Vec    37             37  106606176     0
       Krylov Solver     1              1      18016     0
      Preconditioner     1              1        720     0
           Index Set     3              3    4321464     0
========================================================================================================================
Average time to get PetscTime(): 9.53674e-08
OptionTable: -log_summary ex2f-600-1p.log
OptionTable: -m 600
OptionTable: -n 600
Compiled without FORTRAN kernels
Compiled with full precision matrices (default)
sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8 sizeof(PetscScalar) 8
Configure run at: Tue Apr 15 21:39:17 2008
Configure options: --with-mpi-dir=/home/balay/mpich2-svn --with-debugging=0 --download-f-blas-lapack=1 PETSC_ARCH=linux-test --with-shared=0
-----------------------------------------
Libraries compiled on Tue Apr 15 21:45:29 CDT 2008 on intel-loaner1 
Machine characteristics: Linux intel-loaner1 2.6.20-16-generic #2 SMP Tue Feb 12 02:11:24 UTC 2008 x86_64 GNU/Linux 
Using PETSc directory: /home/balay/petsc-dev
Using PETSc arch: linux-test
-----------------------------------------
Using C compiler: /home/balay/mpich2-svn/bin/mpicc -fPIC -O   
Using Fortran compiler: /home/balay/mpich2-svn/bin/mpif90 -I. -fPIC -O    
-----------------------------------------
Using include paths: -I/home/balay/petsc-dev -I/home/balay/petsc-dev/linux-test/include -I/home/balay/petsc-dev/include -I/home/balay/mpich2-svn/include -I. -I/home/balay/mpich2-svn/src/include -I/home/balay/mpich2-svn/src/binding/f90      
------------------------------------------
Using C linker: /home/balay/mpich2-svn/bin/mpicc -fPIC -O 
Using Fortran linker: /home/balay/mpich2-svn/bin/mpif90 -I. -fPIC -O  
Using libraries: -Wl,-rpath,/home/balay/petsc-dev/linux-test/lib -L/home/balay/petsc-dev/linux-test/lib -lpetscts -lpetscsnes -lpetscksp -lpetscdm -lpetscmat -lpetscvec -lpetsc           -Wl,-rpath,/home/balay/petsc-dev/linux-test/lib -L/home/balay/petsc-dev/linux-test/lib -lflapack -Wl,-rpath,/home/balay/petsc-dev/linux-test/lib -L/home/balay/petsc-dev/linux-test/lib -lfblas -lm -Wl,-rpath,/usr/lib/gcc/x86_64-linux-gnu/4.1.2 -L/usr/lib/gcc/x86_64-linux-gnu/4.1.2 -Wl,-rpath,/usr/lib/gcc/x86_64-linux-gnu/4.1.2/../../../../lib64 -L/usr/lib/gcc/x86_64-linux-gnu/4.1.2/../../../../lib64 -Wl,-rpath,/lib/../lib64 -L/lib/../lib64 -Wl,-rpath,/usr/lib/../lib64 -L/usr/lib/../lib64 -ldl -lgcc_s -lgfortranbegin -lgfortran -lm -Wl,-rpath,/usr/lib/gcc/x86_64-linux-gnu/4.1.2 -Wl,-rpath,/usr/lib/gcc/x86_64-linux-gnu/4.1.2/../../../../lib64 -Wl,-rpath,/lib/../lib64 -Wl,-rpath,/usr/lib/../lib64 -lm  -Wl,-rpath,/usr/lib/gcc/x86_64-linux-gnu/4.1.2 -L/usr/lib/gcc/x86_64-linux-gnu/4.1.2 -Wl,-rpath,/usr/lib/gcc/x86_64-linux-gnu/4.1.2/../../../../lib64 -L/usr/lib/gcc/x86_64-linux-gnu/4.1.2/../../../../lib64 -Wl,-rpath,/lib/../lib64 -L/lib/../lib64 -Wl,-rpath,/usr/lib/../lib64 -L/usr/lib/../lib64 -ldl -lgcc_s -ldl  
------------------------------------------
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ex2f-600-2p.log
Type: application/octet-stream
Size: 9562 bytes
Desc: 
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20080415/80bfb8eb/attachment.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ex2f-600-4p.log
Type: application/octet-stream
Size: 9563 bytes
Desc: 
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20080415/80bfb8eb/attachment-0001.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ex2f-600-8p.log
Type: application/octet-stream
Size: 9562 bytes
Desc: 
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20080415/80bfb8eb/attachment-0002.obj>