Slow speed after changing from serial to parallel (with ex2f.F)
Satish Balay
balay at mcs.anl.gov
Tue Apr 15 22:45:25 CDT 2008
On Wed, 16 Apr 2008, Ben Tay wrote:
> I think you may be right. My school uses :
> No of Nodes Processors Qty per node Total cores per node Memory per node
> 4 Quad-Core Intel Xeon X5355 2 8 16 GB
> 60 Dual-Core Intel Xeon 5160 2 4 8 GB
I've attempted to run the same ex2f on a 2x quad-core Intel Xeon X5355
machine [with gcc/ latest mpich2 with --with-device=ch3:nemesis:newtcp] - and I get the following:
<< Logs for my run are attached >>
asterix:/home/balay/download-pine>grep MatMult *
ex2f-600-1p.log:MatMult 1192 1.0 9.7109e+00 1.0 3.86e+09 1.0 0.0e+00 0.0e+00 0.0e+00 14 11 0 0 0 14 11 0 0 0 397
ex2f-600-2p.log:MatMult 1217 1.0 6.2256e+00 1.0 1.97e+09 1.0 2.4e+03 4.8e+03 0.0e+00 14 11100100 0 14 11100100 0 632
ex2f-600-4p.log:MatMult 969 1.0 4.3311e+00 1.0 7.84e+08 1.0 5.8e+03 4.8e+03 0.0e+00 15 11100100 0 15 11100100 0 724
ex2f-600-8p.log:MatMult 1318 1.0 5.6966e+00 1.0 5.33e+08 1.0 1.8e+04 4.8e+03 0.0e+00 16 11100100 0 16 11100100 0 749
asterix:/home/balay/download-pine>grep KSPSolve *
ex2f-600-1p.log:KSPSolve 1 1.0 6.9165e+01 1.0 3.55e+10 1.0 0.0e+00 0.0e+00 2.3e+03100100 0 0100 100100 0 0100 513
ex2f-600-2p.log:KSPSolve 1 1.0 4.4005e+01 1.0 1.81e+10 1.0 2.4e+03 4.8e+03 2.4e+03100100100100100 100100100100100 824
ex2f-600-4p.log:KSPSolve 1 1.0 2.8139e+01 1.0 7.21e+09 1.0 5.8e+03 4.8e+03 1.9e+03100100100100 99 100100100100 99 1024
ex2f-600-8p.log:KSPSolve 1 1.0 3.6260e+01 1.0 4.90e+09 1.0 1.8e+04 4.8e+03 2.6e+03100100100100100 100100100100100 1081
asterix:/home/balay/download-pine>
You get the following [with intel compilers?]:
asterix:/home/balay/download-pine/x>grep MatMult *
log.1:MatMult 1192 1.0 1.6115e+01 1.0 2.39e+08 1.0 0.0e+00 0.0e+00 0.0e+00 13 11 0 0 0 13 11 0 0 0 239
log.2:MatMult 1217 1.0 1.2502e+01 1.2 1.88e+08 1.2 2.4e+03 4.8e+03 0.0e+00 11 11100100 0 11 11100100 0 315
log.4:MatMult 969 1.0 9.7564e+00 3.6 2.87e+08 3.6 5.8e+03 4.8e+03 0.0e+00 8 11100100 0 8 11100100 0 321
asterix:/home/balay/download-pine/x>grep KSPSolve *
log.1:KSPSolve 1 1.0 1.2159e+02 1.0 2.92e+08 1.0 0.0e+00 0.0e+00 2.3e+03100100 0 0100 100100 0 0100 292
log.2:KSPSolve 1 1.0 1.0289e+02 1.0 1.76e+08 1.0 2.4e+03 4.8e+03 2.4e+03 99100100100100 99100100100100 352
log.4:KSPSolve 1 1.0 6.2496e+01 1.0 1.15e+08 1.0 5.8e+03 4.8e+03 1.9e+03 98100100100 99 98100100100 99 461
asterix:/home/balay/download-pine/x>
What exact CPU was this run on?
A couple of comments:
- my runs for MatMult have 1.0 ratio for 2,4,8 proc runs, while yours have 1.2, 3.6 for 2,4 proc runs [so higher
load imbalance on your machine]
- The peaks are also lower - not sure why. 397 for 1p-MatMult for me - vs 239 for you
- Speedups I see for MatMult are:
np me you
2 1.59 1.32
4 1.82 1.34
8 1.88
--------------------------
The primary issue is - expecting speedup of 4, from 4-cores and 8 from 8-cores.
As Matt indicated perhaps in "Subject: general question on speed using quad core Xeons" thread,
for sparse linear algebra - the performance is limited by memory bandwidth - not CPU
So one have to look at the hardware memory architecture of the machine
if you expect scalability.
The 2x quad-core has a memory architecture that gives 11GB/s if one
CPU-socket is used, but 22GB/s when both CPUs-sockets are used
[irrespective of the number of cores in each CPU socket]. One
inference is - max of 2 speedup can be obtained from such machine [due
to 2 memory bank architecture].
So if you have 2 such machines [i.e 4 memory banks] - then you can
expect a theoretical max speedup of 4.
We are generally used to evaluating performance/cpu [or core]. Here
the scalability numbers suck.
However if you do performance/number-of-memory-banks - then things look better.
Its just that we are used to always expecting scalability per node and
assume it translates to scalability per core. [however the scalability
per node - was more about scalability per memory bank - before
multicore cpus took over]
There is also another measure - performance/dollar spent. Generally
the extra cores are practically free - so here this measure also holds
up ok.
Satish
-------------- next part --------------
************************************************************************************************************************
*** WIDEN YOUR WINDOW TO 120 CHARACTERS. Use 'enscript -r -fCourier9' to print this document ***
************************************************************************************************************************
---------------------------------------------- PETSc Performance Summary: ----------------------------------------------
./ex2f on a linux-tes named intel-loaner1 with 1 processor, by balay Tue Apr 15 22:02:38 2008
Using Petsc Development Version 2.3.3, Patch 12, unknown HG revision: unknown
Max Max/Min Avg Total
Time (sec): 6.936e+01 1.00000 6.936e+01
Objects: 4.400e+01 1.00000 4.400e+01
Flops: 3.547e+10 1.00000 3.547e+10 3.547e+10
Flops/sec: 5.113e+08 1.00000 5.113e+08 5.113e+08
MPI Messages: 0.000e+00 0.00000 0.000e+00 0.000e+00
MPI Message Lengths: 0.000e+00 0.00000 0.000e+00 0.000e+00
MPI Reductions: 2.349e+03 1.00000
Flop counting convention: 1 flop = 1 real number operation of type (multiply/divide/add/subtract)
e.g., VecAXPY() for real vectors of length N --> 2N flops
and VecAXPY() for complex vectors of length N --> 8N flops
Summary of Stages: ----- Time ------ ----- Flops ----- --- Messages --- -- Message Lengths -- -- Reductions --
Avg %Total Avg %Total counts %Total Avg %Total counts %Total
0: Main Stage: 6.9359e+01 100.0% 3.5466e+10 100.0% 0.000e+00 0.0% 0.000e+00 0.0% 2.349e+03 100.0%
------------------------------------------------------------------------------------------------------------------------
See the 'Profiling' chapter of the users' manual for details on interpreting output.
Phase summary info:
Count: number of times phase was executed
Time and Flops: Max - maximum over all processors
Ratio - ratio of maximum to minimum over all processors
Mess: number of messages sent
Avg. len: average message length
Reduct: number of global reductions
Global: entire computation
Stage: stages of a computation. Set stages with PetscLogStagePush() and PetscLogStagePop().
%T - percent time in this phase %F - percent flops in this phase
%M - percent messages in this phase %L - percent message lengths in this phase
%R - percent reductions in this phase
Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max time over all processors)
------------------------------------------------------------------------------------------------------------------------
Event Count Time (sec) Flops --- Global --- --- Stage --- Total
Max Ratio Max Ratio Max Ratio Mess Avg len Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s
------------------------------------------------------------------------------------------------------------------------
--- Event Stage 0: Main Stage
MatMult 1192 1.0 9.7109e+00 1.0 3.86e+09 1.0 0.0e+00 0.0e+00 0.0e+00 14 11 0 0 0 14 11 0 0 0 397
MatSolve 1192 1.0 1.8658e+01 1.0 3.86e+09 1.0 0.0e+00 0.0e+00 0.0e+00 27 11 0 0 0 27 11 0 0 0 207
MatLUFactorNum 1 1.0 4.1455e-02 1.0 3.23e+06 1.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 78
MatILUFactorSym 1 1.0 2.9251e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 1.0e+00 0 0 0 0 0 0 0 0 0 0 0
MatAssemblyBegin 1 1.0 9.5367e-07 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
MatAssemblyEnd 1 1.0 3.1618e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
MatGetRowIJ 1 1.0 1.9073e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
MatGetOrdering 1 1.0 5.1751e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 2.0e+00 0 0 0 0 0 0 0 0 0 0 0
VecMDot 1153 1.0 1.6326e+01 1.0 1.28e+10 1.0 0.0e+00 0.0e+00 1.2e+03 24 36 0 0 49 24 36 0 0 49 783
VecNorm 1193 1.0 5.0365e+00 1.0 8.59e+08 1.0 0.0e+00 0.0e+00 1.2e+03 7 2 0 0 51 7 2 0 0 51 171
VecScale 1192 1.0 5.4950e-01 1.0 4.29e+08 1.0 0.0e+00 0.0e+00 0.0e+00 1 1 0 0 0 1 1 0 0 0 781
VecCopy 39 1.0 6.6555e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
VecSet 41 1.0 3.4185e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
VecAXPY 78 1.0 1.2492e-01 1.0 5.62e+07 1.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 450
VecMAXPY 1192 1.0 1.8493e+01 1.0 1.36e+10 1.0 0.0e+00 0.0e+00 0.0e+00 27 38 0 0 0 27 38 0 0 0 736
VecNormalize 1192 1.0 5.5843e+00 1.0 1.29e+09 1.0 0.0e+00 0.0e+00 1.2e+03 8 4 0 0 51 8 4 0 0 51 231
KSPGMRESOrthog 1153 1.0 3.3669e+01 1.0 2.56e+10 1.0 0.0e+00 0.0e+00 1.2e+03 49 72 0 0 49 49 72 0 0 49 760
KSPSetup 1 1.0 1.1875e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
KSPSolve 1 1.0 6.9165e+01 1.0 3.55e+10 1.0 0.0e+00 0.0e+00 2.3e+03100100 0 0100 100100 0 0100 513
PCSetUp 1 1.0 7.5919e-02 1.0 3.23e+06 1.0 0.0e+00 0.0e+00 3.0e+00 0 0 0 0 0 0 0 0 0 0 43
PCApply 1192 1.0 1.8661e+01 1.0 3.86e+09 1.0 0.0e+00 0.0e+00 0.0e+00 27 11 0 0 0 27 11 0 0 0 207
------------------------------------------------------------------------------------------------------------------------
Memory usage is given in bytes:
Object Type Creations Destructions Memory Descendants' Mem.
--- Event Stage 0: Main Stage
Matrix 2 2 54695580 0
Vec 37 37 106606176 0
Krylov Solver 1 1 18016 0
Preconditioner 1 1 720 0
Index Set 3 3 4321464 0
========================================================================================================================
Average time to get PetscTime(): 9.53674e-08
OptionTable: -log_summary ex2f-600-1p.log
OptionTable: -m 600
OptionTable: -n 600
Compiled without FORTRAN kernels
Compiled with full precision matrices (default)
sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8 sizeof(PetscScalar) 8
Configure run at: Tue Apr 15 21:39:17 2008
Configure options: --with-mpi-dir=/home/balay/mpich2-svn --with-debugging=0 --download-f-blas-lapack=1 PETSC_ARCH=linux-test --with-shared=0
-----------------------------------------
Libraries compiled on Tue Apr 15 21:45:29 CDT 2008 on intel-loaner1
Machine characteristics: Linux intel-loaner1 2.6.20-16-generic #2 SMP Tue Feb 12 02:11:24 UTC 2008 x86_64 GNU/Linux
Using PETSc directory: /home/balay/petsc-dev
Using PETSc arch: linux-test
-----------------------------------------
Using C compiler: /home/balay/mpich2-svn/bin/mpicc -fPIC -O
Using Fortran compiler: /home/balay/mpich2-svn/bin/mpif90 -I. -fPIC -O
-----------------------------------------
Using include paths: -I/home/balay/petsc-dev -I/home/balay/petsc-dev/linux-test/include -I/home/balay/petsc-dev/include -I/home/balay/mpich2-svn/include -I. -I/home/balay/mpich2-svn/src/include -I/home/balay/mpich2-svn/src/binding/f90
------------------------------------------
Using C linker: /home/balay/mpich2-svn/bin/mpicc -fPIC -O
Using Fortran linker: /home/balay/mpich2-svn/bin/mpif90 -I. -fPIC -O
Using libraries: -Wl,-rpath,/home/balay/petsc-dev/linux-test/lib -L/home/balay/petsc-dev/linux-test/lib -lpetscts -lpetscsnes -lpetscksp -lpetscdm -lpetscmat -lpetscvec -lpetsc -Wl,-rpath,/home/balay/petsc-dev/linux-test/lib -L/home/balay/petsc-dev/linux-test/lib -lflapack -Wl,-rpath,/home/balay/petsc-dev/linux-test/lib -L/home/balay/petsc-dev/linux-test/lib -lfblas -lm -Wl,-rpath,/usr/lib/gcc/x86_64-linux-gnu/4.1.2 -L/usr/lib/gcc/x86_64-linux-gnu/4.1.2 -Wl,-rpath,/usr/lib/gcc/x86_64-linux-gnu/4.1.2/../../../../lib64 -L/usr/lib/gcc/x86_64-linux-gnu/4.1.2/../../../../lib64 -Wl,-rpath,/lib/../lib64 -L/lib/../lib64 -Wl,-rpath,/usr/lib/../lib64 -L/usr/lib/../lib64 -ldl -lgcc_s -lgfortranbegin -lgfortran -lm -Wl,-rpath,/usr/lib/gcc/x86_64-linux-gnu/4.1.2 -Wl,-rpath,/usr/lib/gcc/x86_64-linux-gnu/4.1.2/../../../../lib64 -Wl,-rpath,/lib/../lib64 -Wl,-rpath,/usr/lib/../lib64 -lm -Wl,-rpath,/usr/lib/gcc/x86_64-linux-gnu/4.1.2 -L/usr/lib/gcc/x86_64-linux-gnu/4.1.2 -Wl,-rpath,/usr/lib/gcc/x86_64-linux-gnu/4.1.2/../../../../lib64 -L/usr/lib/gcc/x86_64-linux-gnu/4.1.2/../../../../lib64 -Wl,-rpath,/lib/../lib64 -L/lib/../lib64 -Wl,-rpath,/usr/lib/../lib64 -L/usr/lib/../lib64 -ldl -lgcc_s -ldl
------------------------------------------
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ex2f-600-2p.log
Type: application/octet-stream
Size: 9562 bytes
Desc:
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20080415/80bfb8eb/attachment.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ex2f-600-4p.log
Type: application/octet-stream
Size: 9563 bytes
Desc:
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20080415/80bfb8eb/attachment-0001.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ex2f-600-8p.log
Type: application/octet-stream
Size: 9562 bytes
Desc:
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20080415/80bfb8eb/attachment-0002.obj>
More information about the petsc-users
mailing list