Slow speed after changing from serial to parallel

Fri Apr 18 07:08:46 CDT 2008

On Apr 17, 2008, at 10:55 PM, Ben Tay wrote:

> Hi,
>
> I've email my school super computing staff and they told me that the  
> queue which I'm using is one meant for testing, hence, it's handling  
> of work load is not good. I've sent my job to another queue and it's  
> run on 4 processors. It's my own code because there seems to be  
> something wrong with the server displaying the summary when using - 
> log_summary with ex2f.F. I'm trying it again.
>
> Anyway comparing just kspsolve between the two, the speedup is about  
> 2.7. However, I noticed that for the 4 processors one, its  
> MatAssemblyBegin is  1.5158e+02, which is more than KSPSolve's  
> 4.7041e+00.

   You have a huge load imbalance in setting the values in the matrix  
(the load imbalance is 2254.7). Are you sure each process is setting
about the same amount of matrix entries? Also are you doing an  
accurate matrix preallocation (see the detailed manual pages for
MatMPIAIJSetPreallocation() and MatCreateMPIAIJ()). You can run with - 
info and grep for malloc to see if the MatSetValues() is allocating
additional memory. If you get the matrix preallocation correct you  
will see a HUGE speed improvement.

   Barry

> So is MatAssemblyBegin's time included in KSPSolve? If not, does it  
> mean that there's something wrong about my MatAssemblyBegin?
>
> Thank you
>
> For 1 processor:
>
> ************************************************************************************************************************
> ***             WIDEN YOUR WINDOW TO 120 CHARACTERS.  Use 'enscript - 
> r -fCourier9' to print this document            ***
> ************************************************************************************************************************
>
> ---------------------------------------------- PETSc Performance  
> Summary: ----------------------------------------------
>
> ./a.out on a atlas3 named atlas3-c28 with 1 processor, by g0306332  
> Fri Apr 18 08:46:11 2008
> Using Petsc Release Version 2.3.3, Patch 8, Fri Nov 16 17:03:40 CST  
> 2007 HG revision: 414581156e67e55c761739b0deb119f7590d0f4b
>
>                          Max       Max/Min        Avg      Total
> Time (sec):           1.322e+02      1.00000   1.322e+02
> Objects:              2.200e+01      1.00000   2.200e+01
> Flops:                2.242e+08      1.00000   2.242e+08  2.242e+08
> Flops/sec:            1.696e+06      1.00000   1.696e+06  1.696e+06
> MPI Messages:         0.000e+00      0.00000   0.000e+00  0.000e+00
> MPI Message Lengths:  0.000e+00      0.00000   0.000e+00  0.000e+00
> MPI Reductions:       2.100e+01      1.00000
>
> Flop counting convention: 1 flop = 1 real number operation of type  
> (multiply/divide/add/subtract)
>                             e.g., VecAXPY() for real vectors of  
> length N --> 2N flops
>                             and VecAXPY() for complex vectors of  
> length N --> 8N flops
>
> Summary of Stages:   ----- Time ------  ----- Flops -----  ---  
> Messages ---  -- Message Lengths --  -- Reductions --
>                         Avg     %Total     Avg     %Total   counts    
> %Total     Avg         %Total   counts   %Total
>  0:      Main Stage: 1.3217e+02 100.0%  2.2415e+08 100.0%  0.000e 
> +00   0.0%  0.000e+00        0.0%  2.100e+01 100.0%
>
> ------------------------------------------------------------------------------------------------------------------------
> See the 'Profiling' chapter of the users' manual for details on  
> interpreting output.
> Phase summary info:
>    Count: number of times phase was executed
>    Time and Flops/sec: Max - maximum over all processors
>                        Ratio - ratio of maximum to minimum over all  
> processors
>    Mess: number of messages sent
>    Avg. len: average message length
>    Reduct: number of global reductions
>    Global: entire computation
>    Stage: stages of a computation. Set stages with  
> PetscLogStagePush() and PetscLogStagePop().
>       %T - percent time in this phase         %F - percent flops in  
> this phase
>       %M - percent messages in this phase     %L - percent message  
> lengths in this phase
>       %R - percent reductions in this phase
>    Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max  
> time over all processors)
> ------------------------------------------------------------------------------------------------------------------------
>
>
>       ##########################################################
>       #                                                        #
>       #                          WARNING!!!                    #
>       #                                                        #
>       #   This code was run without the PreLoadBegin()         #
>       #   macros. To get timing results we always recommend    #
>       #   preloading. otherwise timing numbers may be          #
>       #   meaningless.                                         #
>       ##########################################################
>
>
>
> Event                Count      Time (sec)     Flops/ 
> sec                         --- Global ---  --- Stage ---   Total
>                    Max Ratio  Max     Ratio   Max  Ratio  Mess   Avg  
> len Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s
> ------------------------------------------------------------------------------------------------------------------------
>
> --- Event Stage 0: Main Stage
>
> MatMult                6 1.0 1.8572e-01 1.0 3.77e+08 1.0 0.0e+00 0.0e 
> +00 0.0e+00  0 31  0  0  0   0 31  0  0  0   377
> MatConvert             1 1.0 1.1636e+00 1.0 0.00e+00 0.0 0.0e+00 0.0e 
> +00 0.0e+00  1  0  0  0  0   1  0  0  0  0     0
> MatAssemblyBegin       1 1.0 2.1458e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e 
> +00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> MatAssemblyEnd         1 1.0 8.8531e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e 
> +00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> MatGetRow        1296000 1.0 2.6576e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e 
> +00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> MatGetRowIJ            1 1.0 4.0531e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e 
> +00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> MatZeroEntries         1 1.0 4.4700e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e 
> +00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> KSPGMRESOrthog         6 1.0 2.1104e-01 1.0 5.16e+08 1.0 0.0e+00 0.0e 
> +00 6.0e+00  0 49  0  0 29   0 49  0  0 29   516
> KSPSetup               1 1.0 6.5601e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e 
> +00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> KSPSolve               1 1.0 1.2883e+01 1.0 1.74e+07 1.0 0.0e+00 0.0e 
> +00 1.5e+01 10100  0  0 71  10100  0  0 71    17
> PCSetUp                1 1.0 4.4342e+00 1.0 0.00e+00 0.0 0.0e+00 0.0e 
> +00 2.0e+00  3  0  0  0 10   3  0  0  0 10     0
> PCApply                7 1.0 7.7337e+00 1.0 0.00e+00 0.0 0.0e+00 0.0e 
> +00 0.0e+00  6  0  0  0  0   6  0  0  0  0     0
> VecMDot                6 1.0 9.8586e-02 1.0 5.52e+08 1.0 0.0e+00 0.0e 
> +00 6.0e+00  0 24  0  0 29   0 24  0  0 29   552
> VecNorm                7 1.0 6.9757e-02 1.0 2.60e+08 1.0 0.0e+00 0.0e 
> +00 7.0e+00  0  8  0  0 33   0  8  0  0 33   260
> VecScale               7 1.0 2.9803e-02 1.0 3.04e+08 1.0 0.0e+00 0.0e 
> +00 0.0e+00  0  4  0  0  0   0  4  0  0  0   304
> VecCopy                1 1.0 6.1009e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e 
> +00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> VecSet                 9 1.0 3.1438e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e 
> +00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> VecAXPY                1 1.0 7.5161e-03 1.0 3.45e+08 1.0 0.0e+00 0.0e 
> +00 0.0e+00  0  1  0  0  0   0  1  0  0  0   345
> VecMAXPY               7 1.0 1.4444e-01 1.0 4.85e+08 1.0 0.0e+00 0.0e 
> +00 0.0e+00  0 31  0  0  0   0 31  0  0  0   485
> VecAssemblyBegin       2 1.0 4.2915e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e 
> +00 6.0e+00  0  0  0  0 29   0  0  0  0 29     0
> VecAssemblyEnd         2 1.0 6.9141e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e 
> +00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> VecNormalize           7 1.0 9.9603e-02 1.0 2.73e+08 1.0 0.0e+00 0.0e 
> +00 7.0e+00  0 12  0  0 33   0 12  0  0 33   273
> ------------------------------------------------------------------------------------------------------------------------
>
> Memory usage is given in bytes:
>
> Object Type          Creations   Destructions   Memory  Descendants'  
> Mem.
>
> --- Event Stage 0: Main Stage
>
>               Matrix     1              1   98496004     0
>        Krylov Solver     1              1      17216     0
>       Preconditioner     1              1        272     0
>                  Vec    19             19  186638392     0
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> ======================================================================
> Average time to get PetscTime(): 9.53674e-08
> OptionTable: -log_summary
> Compiled without FORTRAN kernels
> Compiled with full precision matrices (default)
> sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8  
> sizeof(PetscScalar) 8
> Configure run at: Wed Jan  9 14:33:02 2008
> Configure options: --with-cc=icc --with-fc=ifort --with-x=0 --with- 
> blas-lapack-dir=/opt/intel/cmkl/8.1.1/lib/em64t --with-shared --with- 
> mpi-dir=/lsftmp/g0306332/mpich2/ --with-debugging=0 --with-hypre- 
> dir=/home/enduser/g0306332/lib/hypre_shared
> -----------------------------------------
> Libraries compiled on Wed Jan  9 14:33:36 SGT 2008 on atlas3-c01
> Machine characteristics: Linux atlas3-c01 2.6.9-42.ELsmp #1 SMP Wed  
> Jul 12 23:32:02 EDT 2006 x86_64 x86_64 x86_64 GNU/Linux
> Using PETSc directory: /home/enduser/g0306332/petsc-2.3.3-p8
> Using PETSc arch: atlas3
> -----------------------------------------
> Using C compiler: icc -fPIC -O
>
> for 4 processors
>
> ************************************************************************************************************************
> ***             WIDEN YOUR WINDOW TO 120 CHARACTERS.  Use 'enscript - 
> r -fCourier9' to print this document            ***
> ************************************************************************************************************************
>
> ---------------------------------------------- PETSc Performance  
> Summary: ----------------------------------------------
>
> ./a.out on a atlas3-mp named atlas3-c23 with 4 processors, by  
> g0306332 Fri Apr 18 08:22:11 2008
> Using Petsc Release Version 2.3.3, Patch 8, Fri Nov 16 17:03:40 CST  
> 2007 HG revision: 414581156e67e55c761739b0deb119f7590d0f4b
>
>                          Max       Max/Min        Avg      Total
>   0.000000000000000E+000   58.1071298622710
>   0.000000000000000E+000   58.1071298622710
>   0.000000000000000E+000   58.1071298622710
>   0.000000000000000E+000   58.1071298622710
> Time (sec):           3.308e+02      1.00177   3.305e+02
> Objects:              2.900e+01      1.00000   2.900e+01
> Flops:                5.605e+07      1.00026   5.604e+07  2.242e+08
> Flops/sec:            1.697e+05      1.00201   1.695e+05  6.782e+05
> MPI Messages:         1.400e+01      2.00000   1.050e+01  4.200e+01
> MPI Message Lengths:  1.248e+05      2.00000   8.914e+03  3.744e+05
> MPI Reductions:       7.500e+00      1.00000
>
> Flop counting convention: 1 flop = 1 real number operation of type  
> (multiply/divide/add/subtract)
>                             e.g., VecAXPY() for real vectors of  
> length N --> 2N flops
>                             and VecAXPY() for complex vectors of  
> length N --> 8N flops
>
> Summary of Stages:   ----- Time ------  ----- Flops -----  ---  
> Messages ---  -- Message Lengths --  -- Reductions --
>                         Avg     %Total     Avg     %Total   counts    
> %Total     Avg         %Total   counts   %Total
>  0:      Main Stage: 3.3051e+02 100.0%  2.2415e+08 100.0%  4.200e+01  
> 100.0%  8.914e+03      100.0%  3.000e+01 100.0%
>
> ------------------------------------------------------------------------------------------------------------------------
> See the 'Profiling' chapter of the users' manual for details on  
> interpreting output.
> Phase summary info:
>    Count: number of times phase was executed
>    Time and Flops/sec: Max - maximum over all processors
>                        Ratio - ratio of maximum to minimum over all  
> processors
>    Mess: number of messages sent
>    Avg. len: average message length
>    Reduct: number of global reductions
>    Global: entire computation
>    Stage: stages of a computation. Set stages with  
> PetscLogStagePush() and PetscLogStagePop().
>       %T - percent time in this phase         %F - percent flops in  
> this phase
>       %M - percent messages in this phase     %L - percent message  
> lengths in this phase
>       %R - percent reductions in this phase
>    Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max  
> time over all processors)
> ------------------------------------------------------------------------------------------------------------------------
>
>
>       ##########################################################
>       #                                                        #
>       #                          WARNING!!!                    #
>       #                                                        #
>       #   This code was run without the PreLoadBegin()         #
>       #   macros. To get timing results we always recommend    #
>       #   preloading. otherwise timing numbers may be          #
>       #   meaningless.                                         #
>       #   preloading. otherwise timing numbers may be          #
>       #   meaningless.                                         #
>       ##########################################################
>
>
> Event                Count      Time (sec)     Flops/ 
> sec                         --- Global ---  --- Stage ---   Total
>                    Max Ratio  Max     Ratio   Max  Ratio  Mess   Avg  
> len Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s
> ------------------------------------------------------------------------------------------------------------------------
>
> --- Event Stage 0: Main Stage
>
> MatMult                6 1.0 8.2640e-02 1.6 3.37e+08 1.6 3.6e+01 9.6e 
> +03 0.0e+00  0 31 86 92  0   0 31 86 92  0   846
> MatConvert             1 1.0 2.1472e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e 
> +00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> MatAssemblyBegin       1 1.0 1.5158e+022254.7 0.00e+00 0.0 0.0e+00  
> 0.0e+00 2.0e+00 22  0  0  0  7  22  0  0  0  7     0
> MatAssemblyEnd         1 1.0 1.5766e-01 1.1 0.00e+00 0.0 6.0e+00 4.8e 
> +03 7.0e+00  0  0 14  8 23   0  0 14  8 23     0
> MatGetRow         324000 1.0 8.9608e-02 1.1 0.00e+00 0.0 0.0e+00 0.0e 
> +00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> MatGetRowIJ            2 1.0 5.9605e-06 2.8 0.00e+00 0.0 0.0e+00 0.0e 
> +00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> MatZeroEntries         1 1.0 5.8902e-02 2.3 0.00e+00 0.0 0.0e+00 0.0e 
> +00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> KSPGMRESOrthog         6 1.0 1.1247e-01 1.7 4.11e+08 1.7 0.0e+00 0.0e 
> +00 6.0e+00  0 49  0  0 20   0 49  0  0 20   968
> KSPSetup               1 1.0 1.5483e-02 1.2 0.00e+00 0.0 0.0e+00 0.0e 
> +00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> KSPSolve               1 1.0 4.7041e+00 1.0 1.19e+07 1.0 3.6e+01 9.6e 
> +03 1.5e+01  1100 86 92 50   1100 86 92 50    48
> PCSetUp                1 1.0 1.5953e+00 1.0 0.00e+00 0.0 0.0e+00 0.0e 
> +00 2.0e+00  0  0  0  0  7   0  0  0  0  7     0
> PCApply                7 1.0 2.6580e+00 1.0 0.00e+00 0.0 0.0e+00 0.0e 
> +00 0.0e+00  1  0  0  0  0   1  0  0  0  0     0
> VecMDot                6 1.0 7.3443e-02 2.2 4.13e+08 2.2 0.0e+00 0.0e 
> +00 6.0e+00  0 24  0  0 20   0 24  0  0 20   741
> VecNorm                7 1.0 2.5193e-01 1.1 1.94e+07 1.1 0.0e+00 0.0e 
> +00 7.0e+00  0  8  0  0 23   0  8  0  0 23    72
> VecScale               7 1.0 6.6319e-03 2.8 9.64e+08 2.8 0.0e+00 0.0e 
> +00 0.0e+00  0  4  0  0  0   0  4  0  0  0  1368
> VecCopy                1 1.0 2.3100e-03 1.3 0.00e+00 0.0 0.0e+00 0.0e 
> +00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> VecSet                 9 1.0 1.4173e-02 1.5 0.00e+00 0.0 0.0e+00 0.0e 
> +00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> VecAXPY                1 1.0 2.9502e-03 1.7 3.72e+08 1.7 0.0e+00 0.0e 
> +00 0.0e+00  0  1  0  0  0   0  1  0  0  0   879
> VecMAXPY               7 1.0 4.9046e-02 1.4 5.09e+08 1.4 0.0e+00 0.0e 
> +00 0.0e+00  0 31  0  0  0   0 31  0  0  0  1427
> VecAssemblyBegin       2 1.0 4.3297e-04 3.1 0.00e+00 0.0 0.0e+00 0.0e 
> +00 6.0e+00  0  0  0  0 20   0  0  0  0 20     0
> VecAssemblyEnd         2 1.0 5.2452e-06 1.4 0.00e+00 0.0 0.0e+00 0.0e 
> +00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> VecScatterBegin        6 1.0 6.9666e-04 6.3 0.00e+00 0.0 3.6e+01 9.6e 
> +03 0.0e+00  0  0 86 92  0   0  0 86 92  0     0
> VecScatterEnd          6 1.0 1.4806e-02102.6 0.00e+00 0.0 0.0e+00  
> 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> VecNormalize           7 1.0 2.5431e-01 1.1 2.86e+07 1.1 0.0e+00 0.0e 
> +00 7.0e+00  0 12  0  0 23   0 12  0  0 23   107
> ------------------------------------------------------------------------------------------------------------------------
>
> Memory usage is given in bytes:
>
> Object Type          Creations   Destructions   Memory  Descendants'  
> Mem.
>
> --- Event Stage 0: Main Stage
>
>               Matrix     3              3   49252812     0
>        Krylov Solver     1              1      17216     0
>       Preconditioner     1              1        272     0
>            Index Set     2              2       5488     0
>                  Vec    21             21   49273624     0
>          Vec Scatter     1              1          0     0
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> ======================================================================
> Average time to get PetscTime(): 1.90735e-07
> Average time for MPI_Barrier(): 5.62668e-06
> Average time for zero size MPI_Send(): 6.73532e-06
> OptionTable: -log_summary
> Compiled without FORTRAN kernels
> Compiled with full precision matrices (default)
> sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8  
> sizeof(PetscScalar) 8
> Configure run at: Tue Jan  8 22:22:08 2008
>
>
>>>>
>>>>
>>>
>>
>>
>>