<div>Hi,</div>
<div> </div>
<div>In other words, for my CFD code, it is not possible to parallelize it effectively because the problem is too small? </div>
<div> </div>
<div>Is these true for all parallel solver, or just PETSc? I was hoping to reduce the runtime since mine is an unsteady problem which requires many steps to reach a periodic state and it takes many hours to reach it.</div>
<div> </div>
<div>Lastly, if I'm running on 2 processors, will there be improvement likely? </div>
<div> </div>
<div>Thank you.<br><br> </div>
<div><span class="gmail_quote">On 2/11/07, <b class="gmail_sendername">Barry Smith</b> <<a href="mailto:bsmith@mcs.anl.gov">bsmith@mcs.anl.gov</a>> wrote:</span>
<blockquote class="gmail_quote" style="PADDING-LEFT: 1ex; MARGIN: 0px 0px 0px 0.8ex; BORDER-LEFT: #ccc 1px solid"><br><br>On Sat, 10 Feb 2007, Ben Tay wrote:<br><br>> Hi,<br>><br>> I've repeated the test with n,m = 800. Now serial takes around 11mins while
<br>> parallel with 4 processors took 6mins. Does it mean that the problem must be<br>> pretty large before it is more superior to use parallel? Moreover 800x800<br>> means there's 640000 unknowns. My problem is a 2D CFD code which typically
<br>> has 200x80=16000 unknowns. Does it mean that I won't be able to benefit from<br> ^^^^^^^^^^^<br>You'll never get much performance past 2 processors; its not even worth<br>all the work of having a parallel code in this case. I'd just optimize the
<br>heck out of the serial code.<br><br> Barry<br><br><br><br>> running in parallel?<br>><br>> Btw, this is the parallel's log_summary:<br>><br>><br>> Event Count Time (sec)<br>> Flops/sec --- Global --- --- Stage --- Total
<br>> Max Ratio Max Ratio Max Ratio Mess Avg len<br>> Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s<br>> ------------------------------------------------------------------------------------------------------------------------
<br>><br>> --- Event Stage 0: Main Stage<br>><br>> MatMult 1265 1.0 7.0615e+01 1.2 3.22e+07 1.2 7.6e+03 6.4e+03<br>> 0.0e+00 16 11100100 0 16 11100100 0 103<br>> MatSolve 1265
1.0 4.7820e+01 1.2 4.60e+07 1.2 0.0e+00 0.0e+00<br>> 0.0e+00 11 11 0 0 0 11 11 0 0 0 152<br>> MatLUFactorNum 1 1.0 2.5703e-01 2.3 1.27e+07 2.3 0.0e+00 0.0e+00<br>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 22
<br>> MatILUFactorSym 1 1.0 1.8933e-01 4.1 0.00e+00 0.0 0.0e+00 0.0e+00<br>> 2.0e+00 0 0 0 0 0 0 0 0 0 0 0<br>> MatAssemblyBegin 1 1.0 4.2153e-01 3.5 0.00e+00 0.0 0.0e+00 0.0e+00<br>>
2.0e+00 0 0 0 0 0 0 0 0 0 0 0<br>> MatAssemblyEnd 1 1.0 3.6475e-01 1.5 0.00e+00 0.0 6.0e+00 3.2e+03<br>> 1.3e+01 0 0 0 0 0 0 0 0 0 0 0<br>> MatGetOrdering 1 1.0 1.2088e-02
1.0 0.00e+00 0.0 0.0e+00 0.0e+00<br>> 2.0e+00 0 0 0 0 0 0 0 0 0 0 0<br>> VecMDot 1224 1.0 1.5314e+02 1.2 4.63e+07 1.2 0.0e+00 0.0e+00<br>> 1.2e+03 36 36 0 0 31 36 36 0 0 31 158<br>
> VecNorm 1266 1.0 1.0215e+02 1.1 4.31e+06 1.1 0.0e+00 0.0e+00<br>> 1.3e+03 24 2 0 0 33 24 2 0 0 33 16<br>> VecScale 1265 1.0 3.7467e+00 1.5 8.34e+07 1.5 0.0e+00 0.0e+00<br>> 0.0e+00
1 1 0 0 0 1 1 0 0 0 216<br>> VecCopy 41 1.0 2.5530e-01 2.8 0.00e+00 0.0 0.0e+00 0.0e+00<br>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0<br>> VecSet 1308 1.0 3.2717e+00 1.4
0.00e+00 0.0 0.0e+00 0.0e+00<br>> 0.0e+00 1 0 0 0 0 1 0 0 0 0 0<br>> VecAXPY 82 1.0 5.3338e-01 2.8 1.40e+08 2.8 0.0e+00 0.0e+00<br>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 197<br>> VecMAXPY 1265
1.0 4.6234e+01 1.2 1.74e+08 1.2 0.0e+00 0.0e+00<br>> 0.0e+00 10 38 0 0 0 10 38 0 0 0 557<br>> VecScatterBegin 1265 1.0 1.5684e-01 1.6 0.00e+00 0.0 7.6e+03 6.4e+03<br>> 0.0e+00 0 0100100 0 0 0100100 0 0
<br>> VecScatterEnd 1265 1.0 4.3167e+01 1.3 0.00e+00 0.0 0.0e+00 0.0e+00<br>> 0.0e+00 9 0 0 0 0 9 0 0 0 0 0<br>> VecNormalize 1265 1.0 1.0459e+02 1.1 6.21e+06 1.1 0.0e+00 0.0e+00<br>>
1.3e+03 25 4 0 0 32 25 4 0 0 32 23<br>> KSPGMRESOrthog 1224 1.0 1.9035e+02 1.1 7.00e+07 1.1 0.0e+00 0.0e+00<br>> 1.2e+03 45 72 0 0 31 45 72 0 0 31 254<br>> KSPSetup 2 1.0 5.1674e-01
1.2 0.00e+00 0.0 0.0e+00 0.0e+00<br>> 1.0e+01 0 0 0 0 0 0 0 0 0 0 0<br>> KSPSolve 1 1.0 4.0269e+02 1.0 4.16e+07 1.0 7.6e+03 6.4e+03<br>> 3.9e+03 99100100100 99 99100100100 99 166<br>
> PCSetUp 2 1.0 4.5924e-01 2.6 8.23e+06 2.6 0.0e+00 0.0e+00<br>> 6.0e+00 0 0 0 0 0 0 0 0 0 0 12<br>> PCSetUpOnBlocks 1 1.0 4.5847e-01 2.6 8.26e+06 2.6 0.0e+00 0.0e+00<br>> 4.0e+00
0 0 0 0 0 0 0 0 0 0 13<br>> PCApply 1265 1.0 5.0990e+01 1.2 4.33e+07 1.2 0.0e+00 0.0e+00<br>> 1.3e+03 12 11 0 0 32 12 11 0 0 32 143<br>> ------------------------------------------------------------------------------------------------------------------------
<br>><br>> Memory usage is given in bytes:<br>><br>> Object Type Creations Destructions Memory Descendants' Mem.<br>><br>> --- Event Stage 0: Main Stage<br>><br>> Matrix 4 4 643208 0
<br>> Index Set 5 5 1924296 0<br>> Vec 41 41 47379984 0<br>> Vec Scatter 1 1 0 0<br>> Krylov Solver 2 2 16880 0
<br>> Preconditioner 2 2 196 0<br>> ========================================================================================================================<br>> Average time to get PetscTime():
1.00136e-06<br>> Average time for MPI_Barrier(): 4.00066e-05<br>> Average time for zero size MPI_Send(): 1.70469e-05<br>> OptionTable: -log_summary<br>> Compiled without FORTRAN kernels<br>> Compiled with full precision matrices (default)
<br>> sizeof(short) 2 sizeof(int) 4 sizeof(long) 4 sizeof(void*) 4<br>> sizeof(PetscScalar) 8<br>> Configure run at: Thu Jan 18 12:23:31 2007<br>> Configure options: --with-vendor-compilers=intel --with-x=0 --with-shared
<br>> --with-blas-lapack-dir=/lsftmp/g0306332/inter/mkl/lib/32<br>> --with-mpi-dir=/opt/mpich/myrinet/intel/<br>> -----------------------------------------<br>><br>><br>><br>><br>><br>><br>><br>
> On 2/10/07, Ben Tay <<a href="mailto:zonexo@gmail.com">zonexo@gmail.com</a>> wrote:<br>> ><br>> > Hi,<br>> ><br>> > I tried to use ex2f.F as a test code. I've changed the number n,m from 3
<br>> > to 500 each. I ran the code using 1 processor and then with 4 processor. I<br>> > then repeat the same with the following modification:<br>> ><br>> ><br>> > do i=1,10<br>> ><br>
> > call KSPSolve(ksp,b,x,ierr)<br>> ><br>> > end do<br>> > I've added to do loop to make the solving repeat 10 times.<br>> ><br>> > In both cases, the serial code is faster,
e.g. 1 taking 2.4 min while the<br>> > other 3.3 min.<br>> ><br>> > Here's the log_summary:<br>> ><br>> ><br>> > ---------------------------------------------- PETSc Performance Summary:
<br>> > ----------------------------------------------<br>> ><br>> > ./ex2f on a linux-mpi named <a href="http://atlas12.nus.edu.sg">atlas12.nus.edu.sg</a> with 4 processors, by<br>> > g0306332 Sat Feb 10 16:21:36 2007
<br>> > Using Petsc Release Version 2.3.2, Patch 8, Tue Jan 2 14:33:59 PST 2007<br>> > HG revision: ebeddcedcc065e32fc252af32cf1d01ed4fc7a80<br>> ><br>> > Max Max/Min Avg Total
<br>> > Time (sec): 2.213e+02 1.00051 2.212e+02<br>> > Objects: 5.500e+01 1.00000 5.500e+01<br>> > Flops: 4.718e+09 1.00019 4.718e+09 1.887e+10
<br>> > Flops/sec: 2.134e+07 1.00070 2.133e+07 8.531e+07<br>> ><br>> > Memory: 3.186e+07 1.00069 1.274e+08<br>> > MPI Messages: 1.832e+03
2.00000 1.374e+03 5.496e+03<br>> > MPI Message Lengths: 7.324e+06 2.00000 3.998e+03 2.197e+07<br>> > MPI Reductions: 7.112e+02 1.00000<br>> ><br>> > Flop counting convention: 1 flop = 1 real number operation of type
<br>> > (multiply/divide/add/subtract)<br>> > e.g., VecAXPY() for real vectors of length N<br>> > --> 2N flops<br>> > and VecAXPY() for complex vectors of length N
<br>> > --> 8N flops<br>> ><br>> > Summary of Stages: ----- Time ------ ----- Flops ----- --- Messages<br>> > --- -- Message Lengths -- -- Reductions --<br>> > Avg %Total Avg %Total counts
<br>> > %Total Avg %Total counts %Total<br>> > 0: Main Stage: 2.2120e+02 100.0% 1.8871e+10 100.0% 5.496e+03<br>> > 100.0% 3.998e+03 100.0% 2.845e+03 100.0%<br>> ><br>
> ><br>> ><br>> > ------------------------------------------------------------------------------------------------------------------------<br>> > See the 'Profiling' chapter of the users' manual for details on
<br>> > interpreting output.<br>> > Phase summary info:<br>> > Count: number of times phase was executed<br>> > Time and Flops/sec: Max - maximum over all processors<br>> > Ratio - ratio of maximum to minimum over all
<br>> > processors<br>> > Mess: number of messages sent<br>> > Avg. len: average message length<br>> > Reduct: number of global reductions<br>> > Global: entire computation<br>> > Stage: stages of a computation. Set stages with PetscLogStagePush() and
<br>> > PetscLogStagePop().<br>> > %T - percent time in this phase %F - percent flops in this<br>> > phase<br>> > %M - percent messages in this phase %L - percent message lengths
<br>> > in this phase<br>> > %R - percent reductions in this phase<br>> > Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max time<br>> > over all processors)<br>> ><br>> >
<br>> > ------------------------------------------------------------------------------------------------------------------------<br>> ><br>> > ##########################################################
<br>> > # #<br>> > # WARNING!!! #<br>> > # #
<br>> > # This code was compiled with a debugging option, #<br>> > # To get timing results run config/configure.py #<br>> > # using --with-debugging=no, the performance will #
<br>> > # be generally two or three times faster. #<br>> > # #<br>> > ##########################################################
<br>> ><br>> ><br>> ><br>> ><br>> > ##########################################################<br>> > # #<br>> > # WARNING!!! #
<br>> > # #<br>> > # This code was run without the PreLoadBegin() #<br>> > # macros. To get timing results we always recommend #
<br>> > # preloading. otherwise timing numbers may be #<br>> > # meaningless. #<br>> > ##########################################################
<br>> ><br>> ><br>> > Event Count Time (sec)<br>> > Flops/sec --- Global --- --- Stage --- Total<br>> > Max Ratio Max Ratio Max Ratio Mess Avg len
<br>> > Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s<br>> ><br>> ><br>> > ------------------------------------------------------------------------------------------------------------------------<br>
> ><br>> > --- Event Stage 0: Main Stage<br>> ><br>> > MatMult 915 1.0 4.4291e+01 1.3 1.50e+07 1.3 5.5e+03 4.0e+03<br>> > 0.0e+00 18 11100100 0 18 11100100 0 46<br>> > MatSolve 915
1.0 1.5684e+01 1.1 3.56e+07 1.1 0.0e+00 0.0e+00<br>> > 0.0e+00 7 11 0 0 0 7 11 0 0 0 131<br>> > MatLUFactorNum 1 1.0 5.1654e-02 1.4 1.48e+07 1.4 0.0e+00 0.0e+00<br>> > 0.0e+00 0 0 0 0 0 0 0 0 0 0 43
<br>> > MatILUFactorSym 1 1.0 1.6838e-02 1.1 0.00e+00 0.0 0.0e+00 0.0e+00<br>> > 2.0e+00 0 0 0 0 0 0 0 0 0 0 0<br>> > MatAssemblyBegin 1 1.0 3.2428e-01 1.6 0.00e+00 0.0 0.0e+00
0.0e+00<br>> > 2.0e+00 0 0 0 0 0 0 0 0 0 0 0<br>> > MatAssemblyEnd 1 1.0 1.3120e+00 1.1 0.00e+00 0.0 6.0e+00 2.0e+03<br>> > 1.3e+01 1 0 0 0 0 1 0 0 0 0 0<br>> > MatGetOrdering 1
1.0 4.1590e-03 1.2 0.00e+00 0.0 0.0e+00 0.0e+00<br>> > 2.0e+00 0 0 0 0 0 0 0 0 0 0 0<br>> > VecMDot 885 1.0 8.5091e+01 1.1 2.27e+07 1.1 0.0e+00 0.0e+00<br>> > 8.8e+02 36 36 0 0 31 36 36 0 0 31 80
<br>> > VecNorm 916 1.0 6.6747e+01 1.1 1.81e+06 1.1 0.0e+00 0.0e+00<br>> > 9.2e+02 29 2 0 0 32 29 2 0 0 32 7<br>> > VecScale 915 1.0 1.1430e+00 2.2 1.12e+08 2.2 0.0e+00
0.0e+00<br>> > 0.0e+00 0 1 0 0 0 0 1 0 0 0 200<br>> > VecCopy 30 1.0 1.2816e-01 5.7 0.00e+00 0.0 0.0e+00 0.0e+00<br>> > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0<br>> > VecSet 947
1.0 7.8979e-01 1.3 0.00e+00 0.0 0.0e+00 0.0e+00<br>> > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0<br>> > VecAXPY 60 1.0 5.5332e-02 1.1 1.51e+08 1.1 0.0e+00 0.0e+00<br>> > 0.0e+00 0 0 0 0 0 0 0 0 0 0 542
<br>> > VecMAXPY 915 1.0 1.5004e+01 1.3 1.54e+08 1.3 0.0e+00 0.0e+00<br>> > 0.0e+00 6 38 0 0 0 6 38 0 0 0 483<br>> > VecScatterBegin 915 1.0 9.0358e-02 1.4 0.00e+00 0.0 5.5e+03
4.0e+03<br>> > 0.0e+00 0 0100100 0 0 0100100 0 0<br>> > VecScatterEnd 915 1.0 3.5136e+01 1.4 0.00e+00 0.0 0.0e+00 0.0e+00<br>> > 0.0e+00 14 0 0 0 0 14 0 0 0 0 0<br>> > VecNormalize 915
1.0 6.7272e+01 1.0 2.68e+06 1.0 0.0e+00 0.0e+00<br>> > 9.2e+02 30 4 0 0 32 30 4 0 0 32 10<br>> > KSPGMRESOrthog 885 1.0 9.8478e+01 1.1 3.87e+07 1.1 0.0e+00 0.0e+00<br>> > 8.8e+02 42 72 0 0 31 42 72 0 0 31 138
<br>> > KSPSetup 2 1.0 6.1918e-01 1.2 0.00e+00 0.0 0.0e+00 0.0e+00<br>> > 1.0e+01 0 0 0 0 0 0 0 0 0 0 0<br>> > KSPSolve 1 1.0 2.1892e+02 1.0 2.15e+07 1.0 5.5e+03
4.0e+03<br>> > 2.8e+03 99100100100 99 99100100100 99 86<br>> > PCSetUp 2 1.0 7.3292e-02 1.3 9.84e+06 1.3 0.0e+00 0.0e+00<br>> > 6.0e+00 0 0 0 0 0 0 0 0 0 0 30<br>> > PCSetUpOnBlocks 1
1.0 7.2706e-02 1.3 9.97e+06 1.3 0.0e+00 0.0e+00<br>> > 4.0e+00 0 0 0 0 0 0 0 0 0 0 31<br>> > PCApply 915 1.0 1.6508e+01 1.1 3.27e+07 1.1 0.0e+00 0.0e+00<br>> > 9.2e+02 7 11 0 0 32 7 11 0 0 32 124
<br>> ><br>> > ------------------------------------------------------------------------------------------------------------------------<br>> ><br>> ><br>> > Memory usage is given in bytes:<br>
> ><br>> > Object Type Creations Destructions Memory Descendants' Mem.<br>> ><br>> > --- Event Stage 0: Main Stage<br>> ><br>> > Matrix 4 4 252008 0
<br>> > Index Set 5 5 753096 0<br>> > Vec 41 41 18519984 0<br>> > Vec Scatter 1 1 0 0<br>> > Krylov Solver 2 2 16880 0
<br>> > Preconditioner 2 2 196 0<br>> > ========================================================================================================================<br>> ><br>
> > Average time to get PetscTime(): 1.09673e-06<br>> > Average time for MPI_Barrier(): 4.18186e-05<br>> > Average time for zero size MPI_Send(): 2.62856e-05<br>> > OptionTable: -log_summary<br>> > Compiled without FORTRAN kernels
<br>> > Compiled with full precision matrices (default)<br>> > sizeof(short) 2 sizeof(int) 4 sizeof(long) 4 sizeof(void*) 4<br>> > sizeof(PetscScalar) 8<br>> > Configure run at: Thu Jan 18 12:23:31 2007
<br>> > Configure options: --with-vendor-compilers=intel --with-x=0 --with-shared<br>> > --with-blas-lapack-dir=/lsftmp/g0306332/inter/mkl/lib/32<br>> > --with-mpi-dir=/opt/mpich/myrinet/intel/<br>> > -----------------------------------------
<br>> > Libraries compiled on Thu Jan 18 12:24:41 SGT 2007 on <a href="http://atlas1.nus.edu.sg">atlas1.nus.edu.sg</a><br>> > Machine characteristics: Linux <a href="http://atlas1.nus.edu.sg">atlas1.nus.edu.sg
</a> 2.4.21-20.ELsmp #1 SMP<br>> > Wed Sep 8 17:29:34 GMT 2004 i686 i686 i386 GNU/Linux<br>> > Using PETSc directory: /nas/lsftmp/g0306332/petsc-2.3.2-p8<br>> > Using PETSc arch: linux-mpif90<br>> > -----------------------------------------
<br>> > Using C compiler: /opt/mpich/myrinet/intel/bin/mpicc -fPIC -g<br>> > Using Fortran compiler: /opt/mpich/myrinet/intel/bin/mpif90 -I. -fPIC -g<br>> > -w90 -w<br>> > -----------------------------------------
<br>> > Using include paths: -I/nas/lsftmp/g0306332/petsc-<br>> > 2.3.2-p8-I/nas/lsftmp/g0306332/petsc-<br>> > 2.3.2-p8/bmake/linux-mpif90 -I/nas/lsftmp/g0306332/petsc-2.3.2-p8/include<br>> > -I/opt/mpich/myrinet/intel/include
<br>> > ------------------------------------------<br>> > Using C linker: /opt/mpich/myrinet/intel/bin/mpicc -fPIC -g<br>> > Using Fortran linker: /opt/mpich/myrinet/intel/bin/mpif90 -I. -fPIC -g<br>> > -w90 -w
<br>> > Using libraries:<br>> > -Wl,-rpath,/nas/lsftmp/g0306332/petsc-2.3.2-p8/lib/linux-mpif90<br>> > -L/nas/lsftmp/g0306332/petsc-2.3.2-p8/lib/linux-mpif90 -lpetscts<br>> > -lpetscsnes -lpetscksp -lpetscdm -lpetscmat -lpetscvec -lpetsc
<br>> > -Wl,-rpath,/lsftmp/g0306332/inter/mkl/lib/32<br>> > -L/lsftmp/g0306332/inter/mkl/lib/32 -lmkl_lapack -lmkl_ia32 -lguide<br>> > -lPEPCF90 -Wl,-rpath,/opt/intel/compiler70/ia32/lib<br>> > -Wl,-rpath,/opt/mpich/myrinet/intel/lib -L/opt/mpich/myrinet/intel/lib
<br>> > -Wl,-rpath,-rpath -Wl,-rpath,-ldl -L-ldl -lmpich -Wl,-rpath,-L -lgm<br>> > -lpthread -Wl,-rpath,/opt/intel/compiler70/ia32/lib<br>> > -Wl,-rpath,/opt/intel/compiler70/ia32/lib -L/opt/intel/compiler70/ia32/lib
<br>> > -Wl,-rpath,/usr/lib -Wl,-rpath,/usr/lib -L/usr/lib -limf -lirc -lcprts -lcxa<br>> > -lunwind -ldl -lmpichf90 -Wl,-rpath,/opt/gm/lib -L/opt/gm/lib -lPEPCF90<br>> > -Wl,-rpath,/opt/intel/compiler70/ia32/lib -L/opt/intel/compiler70/ia32/lib
<br>> > -Wl,-rpath,/usr/lib -L/usr/lib -lintrins -lIEPCF90 -lF90 -lm -Wl,-rpath,\<br>> > -Wl,-rpath,\ -L\ -ldl -lmpich -Wl,-rpath,\ -L\ -lgm -lpthread<br>> > -Wl,-rpath,/opt/intel/compiler70/ia32/lib -L/opt/intel/compiler70/ia32/lib
<br>> > -Wl,-rpath,/usr/lib -L/usr/lib -limf -lirc -lcprts -lcxa -lunwind -ldl<br>> > ------------------------------------------<br>> ><br>> > So is there something wrong with the server's mpi implementation?
<br>> ><br>> > Thank you.<br>> ><br>> ><br>> ><br>> > On 2/10/07, Satish Balay <<a href="mailto:balay@mcs.anl.gov">balay@mcs.anl.gov</a>> wrote:<br>> > ><br>> > > Looks like MatMult = 24sec Out of this the scatter time is: 22sec.
<br>> > > Either something is wrong with your run - or MPI is really broken..<br>> > ><br>> > > Satish<br>> > ><br>> > > > > > MatMult 3927 1.0 2.4071e+01
1.3 6.14e+06 1.4 2.4e+04<br>> > > 1.3e+03<br>> > > > > > VecScatterBegin 3927 1.0 2.8672e-01 3.9 0.00e+00 0.0 2.4e+04<br>> > > 1.3e+03<br>> > > > > > VecScatterEnd 3927
1.0 2.2135e+01 1.5 0.00e+00 0.0 0.0e+00<br>> > > 0.0e+00<br>> > ><br>> > ><br>> ><br>><br><br></blockquote></div><br>