[petsc-users] Slower performance using more MPI processes

Chris Hewson chris at resfrac.com
Fri Sep 8 17:52:10 CDT 2023


Thanks for the help.

Yah, running the example again using core binding, I get similar results to
Junchao as well:
1 core: 2808 ms
2 core: 1398 ms
4 core: 989 ms
8 core: 1083 ms

On Fri, Sep 8, 2023 at 4:39 PM Junchao Zhang <junchao.zhang at gmail.com>
wrote:

> Hi, Chris,
>   this is my result of running your test on a local machine (8 cores, 16
> hardware threads).  I use OpenMPI with command line like
>     mpirun --bind-to core -n <np> ./main
> np=1 Solve duration: 2132
> np=2 Solve duration: 1116
> np=4 Solve duration: 990
> np=8 Solve duration: 1257
>
> Note when I used
> $ mpirun  -n 4 ./main
> Solve duration: 22693
>
> It suggests binding is important.
>
> --Junchao Zhang
>
>
> On Fri, Sep 8, 2023 at 4:53 PM Barry Smith <bsmith at petsc.dev> wrote:
>
>>
>> For the size problem you are running this is very unexpected.  The
>> reductions should only start to dominate for thousands of MPI ranks, not 2.
>>
>> First thing I recommend is to run the streams benchmark. Then check the
>> binding that MPI is doing for the two processes. You want to bind to cores
>> in different NUMA regions. It could be that it is binding both processes to
>> cores that share the same cache. The MatSolve should be embarressingly
>> parallel but you are getting almost no speed up in it, so something very
>> "wrong" is happening.
>>
>>
>>
>> Event                Count      Time (sec)     Flop
>>        --- Global ---  --- Stage ----  Total
>>                    Max Ratio  Max     Ratio   Max  Ratio  Mess   AvgLen
>>  Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s
>>
>> VecDot               182 1.0 1.9998e-01 1.0 2.26e+08 1.0 0.0e+00 0.0e+00
>> 0.0e+00  7  5  0  0  0   7  5  0  0  0  1129
>> VecDotNorm2           91 1.0 6.6214e-02 1.0 2.26e+08 1.0 0.0e+00 0.0e+00
>> 0.0e+00  2  5  0  0  0   2  5  0  0  0  3409
>> VecNorm               92 1.0 1.4790e-01 1.0 1.14e+08 1.0 0.0e+00 0.0e+00
>> 0.0e+00  5  3  0  0  0   5  3  0  0  0   771
>>
>> VecDot               198 1.0 2.1037e+00 1.1 1.23e+08 1.0 0.0e+00 0.0e+00
>> 2.0e+02 33  5  0  0 43  33  5  0  0 45   117
>> VecDotNorm2           99 1.0 5.0169e-01 1.2 1.23e+08 1.0 0.0e+00 0.0e+00
>> 9.9e+01  7  5  0  0 22   7  5  0  0 22   489
>> VecNorm              100 1.0 1.3131e+00 1.0 6.20e+07 1.0 0.0e+00 0.0e+00
>> 1.0e+02 21  3  0  0 22  21  3  0  0 23    94
>> VecScatterEnd        198 1.0 7.6160e-01 1.7 0.00e+00 0.0 0.0e+00 0.0e+00
>> 0.0e+00 10  0  0  0  0  10  0  0  0  0     0
>>
>> MatSolve             183 1.0 1.0882e+00 1.0 1.43e+09 1.0 0.0e+00 0.0e+00
>> 0.0e+00 40 35  0  0  0  40 35  0  0  0  1318
>> MatSolve             199 1.0 8.9666e-01 1.2 7.75e+08 1.0 0.0e+00 0.0e+00
>> 0.0e+00 13 35  0  0  0  13 35  0  0  0  1729
>>
>>
>> On Sep 8, 2023, at 5:22 PM, Chris Hewson <chris at resfrac.com> wrote:
>>
>> Thanks for the quick response.
>>
>> The links to the log view files are below:
>> 2 ranks:
>>
>> https://drive.google.com/file/d/1PGRsiHypWtN5h3uxdJBKy9WzEkE0mUgO/view?usp=drive_link
>>
>> 1 rank:
>>
>> https://drive.google.com/file/d/1hB2XyoNtLMHseZUT7jCuiixTQeBi_tjJ/view?usp=drive_link
>>
>> I'll also attach them to this email:
>> **************************** 1 RANK ******************************
>> ------------------------------------------------------------------ PETSc
>> Performance Summary:
>> ------------------------------------------------------------------
>>
>> ./petsc-testing on a  named ubuntu-office with 1 processor, by chewson
>> Fri Sep  8 15:16:51 2023
>> Using Petsc Release Version 3.19.5, unknown
>>
>>                          Max       Max/Min     Avg       Total
>> Time (sec):           2.746e+00     1.000   2.746e+00
>> Objects:              2.100e+01     1.000   2.100e+01
>> Flops:                4.117e+09     1.000   4.117e+09  4.117e+09
>> Flops/sec:            1.499e+09     1.000   1.499e+09  1.499e+09
>> MPI Msg Count:        0.000e+00     0.000   0.000e+00  0.000e+00
>> MPI Msg Len (bytes):  0.000e+00     0.000   0.000e+00  0.000e+00
>> MPI Reductions:       0.000e+00     0.000
>>
>> Flop counting convention: 1 flop = 1 real number operation of type
>> (multiply/divide/add/subtract)
>>                             e.g., VecAXPY() for real vectors of length N
>> --> 2N flops
>>                             and VecAXPY() for complex vectors of length N
>> --> 8N flops
>>
>> Summary of Stages:   ----- Time ------  ----- Flop ------  --- Messages
>> ---  -- Message Lengths --  -- Reductions --
>>                         Avg     %Total     Avg     %Total    Count
>> %Total     Avg         %Total    Count   %Total
>>  0:      Main Stage: 2.7458e+00 100.0%  4.1167e+09 100.0%  0.000e+00
>> 0.0%  0.000e+00        0.0%  0.000e+00   0.0%
>>
>>
>> ------------------------------------------------------------------------------------------------------------------------
>> See the 'Profiling' chapter of the users' manual for details on
>> interpreting output.
>> Phase summary info:
>>    Count: number of times phase was executed
>>    Time and Flop: Max - maximum over all processors
>>                   Ratio - ratio of maximum to minimum over all processors
>>    Mess: number of messages sent
>>    AvgLen: average message length (bytes)
>>    Reduct: number of global reductions
>>    Global: entire computation
>>    Stage: stages of a computation. Set stages with PetscLogStagePush()
>> and PetscLogStagePop().
>>       %T - percent time in this phase         %F - percent flop in this
>> phase
>>       %M - percent messages in this phase     %L - percent message
>> lengths in this phase
>>       %R - percent reductions in this phase
>>    Total Mflop/s: 10e-6 * (sum of flop over all processors)/(max time
>> over all processors)
>>
>> ------------------------------------------------------------------------------------------------------------------------
>> Event                Count      Time (sec)     Flop
>>        --- Global ---  --- Stage ----  Total
>>                    Max Ratio  Max     Ratio   Max  Ratio  Mess   AvgLen
>>  Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s
>>
>> ------------------------------------------------------------------------------------------------------------------------
>>
>> --- Event Stage 0: Main Stage
>>
>> MatMult              182 1.0 8.0351e-01 1.0 1.43e+09 1.0 0.0e+00 0.0e+00
>> 0.0e+00 29 35  0  0  0  29 35  0  0  0  1775
>> MatSolve             183 1.0 1.0882e+00 1.0 1.43e+09 1.0 0.0e+00 0.0e+00
>> 0.0e+00 40 35  0  0  0  40 35  0  0  0  1318
>> MatLUFactorNum         1 1.0 1.3892e-02 1.0 1.30e+07 1.0 0.0e+00 0.0e+00
>> 0.0e+00  1  0  0  0  0   1  0  0  0  0   934
>> MatILUFactorSym        1 1.0 2.1567e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
>> 0.0e+00  1  0  0  0  0   1  0  0  0  0     0
>> MatAssemblyBegin       1 1.0 1.0420e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>> MatAssemblyEnd         1 1.0 6.9049e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>> MatGetRowIJ            1 1.0 3.8500e-07 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>> MatGetOrdering         1 1.0 1.7026e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>> MatLoad                1 1.0 6.6749e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
>> 0.0e+00  2  0  0  0  0   2  0  0  0  0     0
>> VecDot               182 1.0 1.9998e-01 1.0 2.26e+08 1.0 0.0e+00 0.0e+00
>> 0.0e+00  7  5  0  0  0   7  5  0  0  0  1129
>> VecDotNorm2           91 1.0 6.6214e-02 1.0 2.26e+08 1.0 0.0e+00 0.0e+00
>> 0.0e+00  2  5  0  0  0   2  5  0  0  0  3409
>> VecNorm               92 1.0 1.4790e-01 1.0 1.14e+08 1.0 0.0e+00 0.0e+00
>> 0.0e+00  5  3  0  0  0   5  3  0  0  0   771
>> VecCopy                2 1.0 6.8473e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>> VecSet                 3 1.0 1.3256e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>> VecAXPBYCZ           182 1.0 1.6542e-01 1.0 4.51e+08 1.0 0.0e+00 0.0e+00
>> 0.0e+00  6 11  0  0  0   6 11  0  0  0  2729
>> VecWAXPY             182 1.0 1.4476e-01 1.0 2.26e+08 1.0 0.0e+00 0.0e+00
>> 0.0e+00  5  5  0  0  0   5  5  0  0  0  1559
>> VecLoad                2 1.0 1.0104e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>> KSPSetUp               1 1.0 9.9204e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>> KSPSolve               1 1.0 2.6210e+00 1.0 4.10e+09 1.0 0.0e+00 0.0e+00
>> 0.0e+00 95 100  0  0  0  95 100  0  0  0  1566
>> PCSetUp                1 1.0 3.7232e-02 1.0 1.30e+07 1.0 0.0e+00 0.0e+00
>> 0.0e+00  1  0  0  0  0   1  0  0  0  0   349
>> PCApply              183 1.0 1.0885e+00 1.0 1.43e+09 1.0 0.0e+00 0.0e+00
>> 0.0e+00 40 35  0  0  0  40 35  0  0  0  1318
>>
>> --- Event Stage 1: Unknown
>>
>>
>> ------------------------------------------------------------------------------------------------------------------------
>>
>> Object Type          Creations   Destructions. Reports information only
>> for process 0.
>>
>> --- Event Stage 0: Main Stage
>>
>>               Viewer     4              1
>>               Matrix     3              1
>>               Vector     9              1
>>        Krylov Solver     1              0
>>       Preconditioner     1              0
>>            Index Set     3              0
>>
>> --- Event Stage 1: Unknown
>>
>>
>> ========================================================================================================================
>> Average time to get PetscTime(): 1.51e-08
>> #PETSc Option Table entries:
>> -log_view # (source: command line)
>> #End of PETSc Option Table entries
>> Compiled without FORTRAN kernels
>> Compiled with full precision matrices (default)
>> sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8
>> sizeof(PetscScalar) 8 sizeof(PetscInt) 4
>> Configure options: --with-debugging=0 --prefix=/opt/anl/petsc-3.19.5
>> --download-mumps --download-scalapack --with-mpi=1
>> --with-mpi-dir=/opt/anl/mpich COPTFLAGS=-O2 CXXOPTFLAGS=-O2 FOPTFLAGS=-O2
>> -----------------------------------------
>> Libraries compiled on 2023-09-08 16:27:49 on ubuntu-office
>> Machine characteristics: Linux-6.2.0-26-generic-x86_64-with-glibc2.35
>> Using PETSc directory: /opt/anl/petsc-3.19.5
>> Using PETSc arch:
>> -----------------------------------------
>>
>> Using C compiler: /opt/anl/mpich/bin/mpicc  -fPIC -Wall -Wwrite-strings
>> -Wno-unknown-pragmas -Wno-lto-type-mismatch -Wno-stringop-overflow
>> -fstack-protector -fvisibility=hidden -O2
>> Using Fortran compiler: /opt/anl/mpich/bin/mpif90  -fPIC -Wall
>> -ffree-line-length-none -ffree-line-length-0 -Wno-lto-type-mismatch
>> -Wno-unused-dummy-argument -O2
>> -----------------------------------------
>>
>> Using include paths: -I/opt/anl/petsc-3.19.5/include
>> -I/opt/anl/mpich/include
>> -----------------------------------------
>>
>> Using C linker: /opt/anl/mpich/bin/mpicc
>> Using Fortran linker: /opt/anl/mpich/bin/mpif90
>> Using libraries: -Wl,-rpath,/opt/anl/petsc-3.19.5/lib
>> -L/opt/anl/petsc-3.19.5/lib -lpetsc -Wl,-rpath,/opt/anl/petsc-3.19.5/lib
>> -L/opt/anl/petsc-3.19.5/lib -Wl,-rpath,/opt/anl/mpich/lib
>> -L/opt/anl/mpich/lib -Wl,-rpath,/usr/lib/gcc/x86_64-linux-gnu/11
>> -L/usr/lib/gcc/x86_64-linux-gnu/11 -ldmumps -lmumps_common -lpord -lpthread
>> -lscalapack -llapack -lblas -lm -lX11 -lmpifort -lmpi -lgfortran -lm
>> -lgfortran -lm -lgcc_s -lquadmath -lstdc++ -lquadmath
>> -----------------------------------------
>>
>> ************************* 2 RANKS ************************************
>> ------------------------------------------------------------------ PETSc
>> Performance Summary:
>> ------------------------------------------------------------------
>>
>> ./petsc-testing on a  named ubuntu-office with 2 processors, by chewson
>> Fri Sep  8 15:16:43 2023
>> Using Petsc Release Version 3.19.5, unknown
>>
>>                          Max       Max/Min     Avg       Total
>> Time (sec):           6.167e+00     1.001   6.164e+00
>> Objects:              3.200e+01     1.000   3.200e+01
>> Flops:                2.233e+09     1.000   2.233e+09  4.467e+09
>> Flops/sec:            3.625e+08     1.001   3.623e+08  7.247e+08
>> MPI Msg Count:        2.050e+02     1.000   2.050e+02  4.100e+02
>> MPI Msg Len (bytes):  3.437e+07     1.000   1.676e+05  6.874e+07
>> MPI Reductions:       4.580e+02     1.000
>>
>> Flop counting convention: 1 flop = 1 real number operation of type
>> (multiply/divide/add/subtract)
>>                             e.g., VecAXPY() for real vectors of length N
>> --> 2N flops
>>                             and VecAXPY() for complex vectors of length N
>> --> 8N flops
>>
>> Summary of Stages:   ----- Time ------  ----- Flop ------  --- Messages
>> ---  -- Message Lengths --  -- Reductions --
>>                         Avg     %Total     Avg     %Total    Count
>> %Total     Avg         %Total    Count   %Total
>>  0:      Main Stage: 6.1642e+00 100.0%  4.4670e+09 100.0%  4.100e+02
>> 100.0%  1.676e+05      100.0%  4.400e+02  96.1%
>>
>>
>> ------------------------------------------------------------------------------------------------------------------------
>> See the 'Profiling' chapter of the users' manual for details on
>> interpreting output.
>> Phase summary info:
>>    Count: number of times phase was executed
>>    Time and Flop: Max - maximum over all processors
>>                   Ratio - ratio of maximum to minimum over all processors
>>    Mess: number of messages sent
>>    AvgLen: average message length (bytes)
>>    Reduct: number of global reductions
>>    Global: entire computation
>>    Stage: stages of a computation. Set stages with PetscLogStagePush()
>> and PetscLogStagePop().
>>       %T - percent time in this phase         %F - percent flop in this
>> phase
>>       %M - percent messages in this phase     %L - percent message
>> lengths in this phase
>>       %R - percent reductions in this phase
>>    Total Mflop/s: 10e-6 * (sum of flop over all processors)/(max time
>> over all processors)
>>
>> ------------------------------------------------------------------------------------------------------------------------
>> Event                Count      Time (sec)     Flop
>>        --- Global ---  --- Stage ----  Total
>>                    Max Ratio  Max     Ratio   Max  Ratio  Mess   AvgLen
>>  Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s
>>
>> ------------------------------------------------------------------------------------------------------------------------
>>
>> --- Event Stage 0: Main Stage
>>
>> BuildTwoSided          1 1.0 3.1824e-05 1.0 0.00e+00 0.0 2.0e+00 4.0e+00
>> 1.0e+00  0  0  0  0  0   0  0  0  0  0     0
>> MatMult              198 1.0 1.3480e+00 1.4 7.76e+08 1.0 4.0e+02 9.4e+04
>> 0.0e+00 19 35 97 54  0  19 35 97 54  0  1151
>> MatSolve             199 1.0 8.9666e-01 1.2 7.75e+08 1.0 0.0e+00 0.0e+00
>> 0.0e+00 13 35  0  0  0  13 35  0  0  0  1729
>> MatLUFactorNum         1 1.0 7.1852e-03 1.0 6.43e+06 1.0 0.0e+00 0.0e+00
>> 0.0e+00  0  0  0  0  0   0  0  0  0  0  1789
>> MatILUFactorSym        1 1.0 1.0472e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>> MatAssemblyBegin       1 1.0 9.8700e-07 1.4 0.00e+00 0.0 0.0e+00 0.0e+00
>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>> MatAssemblyEnd         1 1.0 6.8341e-03 1.1 0.00e+00 0.0 4.0e+00 2.3e+04
>> 5.0e+00  0  0  1  0  1   0  0  1  0  1     0
>> MatGetRowIJ            1 1.0 1.9930e-06 6.3 0.00e+00 0.0 0.0e+00 0.0e+00
>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>> MatGetOrdering         1 1.0 7.4472e-04 1.1 0.00e+00 0.0 0.0e+00 0.0e+00
>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>> MatLoad                1 1.0 9.8562e-02 1.0 0.00e+00 0.0 1.0e+01 2.7e+06
>> 1.7e+01  2  0  2 39  4   2  0  2 39  4     0
>> VecDot               198 1.0 2.1037e+00 1.1 1.23e+08 1.0 0.0e+00 0.0e+00
>> 2.0e+02 33  5  0  0 43  33  5  0  0 45   117
>> VecDotNorm2           99 1.0 5.0169e-01 1.2 1.23e+08 1.0 0.0e+00 0.0e+00
>> 9.9e+01  7  5  0  0 22   7  5  0  0 22   489
>> VecNorm              100 1.0 1.3131e+00 1.0 6.20e+07 1.0 0.0e+00 0.0e+00
>> 1.0e+02 21  3  0  0 22  21  3  0  0 23    94
>> VecCopy                2 1.0 7.4971e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>> VecSet               202 1.0 8.0035e-02 1.5 0.00e+00 0.0 0.0e+00 0.0e+00
>> 0.0e+00  1  0  0  0  0   1  0  0  0  0     0
>> VecAXPBYCZ           198 1.0 1.2889e-01 1.5 2.46e+08 1.0 0.0e+00 0.0e+00
>> 0.0e+00  2 11  0  0  0   2 11  0  0  0  3811
>> VecWAXPY             198 1.0 9.1526e-02 1.0 1.23e+08 1.0 0.0e+00 0.0e+00
>> 0.0e+00  1  5  0  0  0   1  5  0  0  0  2683
>> VecLoad                2 1.0 9.8983e-03 1.0 0.00e+00 0.0 4.0e+00 1.2e+06
>> 1.6e+01  0  0  1  7  3   0  0  1  7  4     0
>> VecScatterBegin      198 1.0 1.2941e-03 1.0 0.00e+00 0.0 4.0e+02 9.4e+04
>> 0.0e+00  0  0 97 54  0   0  0 97 54  0     0
>> VecScatterEnd        198 1.0 7.6160e-01 1.7 0.00e+00 0.0 0.0e+00 0.0e+00
>> 0.0e+00 10  0  0  0  0  10  0  0  0  0     0
>> SFSetGraph             1 1.0 7.6630e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>> SFSetUp                1 1.0 1.2410e-04 1.0 0.00e+00 0.0 4.0e+00 2.3e+04
>> 1.0e+00  0  0  1  0  0   0  0  1  0  0     0
>> SFPack               198 1.0 5.1814e-05 1.3 0.00e+00 0.0 0.0e+00 0.0e+00
>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>> SFUnpack             198 1.0 3.8273e-05 1.3 0.00e+00 0.0 0.0e+00 0.0e+00
>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>> KSPSetUp               2 1.0 4.7077e-03 1.1 0.00e+00 0.0 0.0e+00 0.0e+00
>> 2.0e+00  0  0  0  0  0   0  0  0  0  0     0
>> KSPSolve               1 1.0 6.0344e+00 1.0 2.23e+09 1.0 4.0e+02 9.4e+04
>> 4.0e+02 98 100 97 54 87  98 100 97 54 90   738
>> PCSetUp                2 1.0 1.8496e-02 1.0 6.43e+06 1.0 0.0e+00 0.0e+00
>> 0.0e+00  0  0  0  0  0   0  0  0  0  0   695
>> PCSetUpOnBlocks        1 1.0 1.8435e-02 1.0 6.43e+06 1.0 0.0e+00 0.0e+00
>> 0.0e+00  0  0  0  0  0   0  0  0  0  0   697
>> PCApply              199 1.0 9.5103e-01 1.1 7.75e+08 1.0 0.0e+00 0.0e+00
>> 0.0e+00 15 35  0  0  0  15 35  0  0  0  1630
>>
>> --- Event Stage 1: Unknown
>>
>>
>> ------------------------------------------------------------------------------------------------------------------------
>>
>> Object Type          Creations   Destructions. Reports information only
>> for process 0.
>>
>> --- Event Stage 0: Main Stage
>>
>>               Viewer     4              1
>>               Matrix     5              1
>>               Vector    13              2
>>            Index Set     5              2
>>    Star Forest Graph     1              0
>>        Krylov Solver     2              0
>>       Preconditioner     2              0
>>
>> --- Event Stage 1: Unknown
>>
>>
>> ========================================================================================================================
>> Average time to get PetscTime(): 2.47e-08
>> Average time for MPI_Barrier(): 4.406e-07
>> Average time for zero size MPI_Send(): 4.769e-06
>> #PETSc Option Table entries:
>> -log_view # (source: command line)
>> #End of PETSc Option Table entries
>> Compiled without FORTRAN kernels
>> Compiled with full precision matrices (default)
>> sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8
>> sizeof(PetscScalar) 8 sizeof(PetscInt) 4
>> Configure options: --with-debugging=0 --prefix=/opt/anl/petsc-3.19.5
>> --download-mumps --download-scalapack --with-mpi=1
>> --with-mpi-dir=/opt/anl/mpich COPTFLAGS=-O2 CXXOPTFLAGS=-O2 FOPTFLAGS=-O2
>> -----------------------------------------
>> Libraries compiled on 2023-09-08 16:27:49 on ubuntu-office
>> Machine characteristics: Linux-6.2.0-26-generic-x86_64-with-glibc2.35
>> Using PETSc directory: /opt/anl/petsc-3.19.5
>> Using PETSc arch:
>> -----------------------------------------
>>
>> Using C compiler: /opt/anl/mpich/bin/mpicc  -fPIC -Wall -Wwrite-strings
>> -Wno-unknown-pragmas -Wno-lto-type-mismatch -Wno-stringop-overflow
>> -fstack-protector -fvisibility=hidden -O2
>> Using Fortran compiler: /opt/anl/mpich/bin/mpif90  -fPIC -Wall
>> -ffree-line-length-none -ffree-line-length-0 -Wno-lto-type-mismatch
>> -Wno-unused-dummy-argument -O2
>> -----------------------------------------
>>
>> Using include paths: -I/opt/anl/petsc-3.19.5/include
>> -I/opt/anl/mpich/include
>> -----------------------------------------
>>
>> Using C linker: /opt/anl/mpich/bin/mpicc
>> Using Fortran linker: /opt/anl/mpich/bin/mpif90
>> Using libraries: -Wl,-rpath,/opt/anl/petsc-3.19.5/lib
>> -L/opt/anl/petsc-3.19.5/lib -lpetsc -Wl,-rpath,/opt/anl/petsc-3.19.5/lib
>> -L/opt/anl/petsc-3.19.5/lib -Wl,-rpath,/opt/anl/mpich/lib
>> -L/opt/anl/mpich/lib -Wl,-rpath,/usr/lib/gcc/x86_64-linux-gnu/11
>> -L/usr/lib/gcc/x86_64-linux-gnu/11 -ldmumps -lmumps_common -lpord -lpthread
>> -lscalapack -llapack -lblas -lm -lX11 -lmpifort -lmpi -lgfortran -lm
>> -lgfortran -lm -lgcc_s -lquadmath -lstdc++ -lquadmath
>> -----------------------------------------
>>
>> Chris
>>
>>
>> On Fri, Sep 8, 2023 at 3:00 PM Barry Smith <bsmith at petsc.dev> wrote:
>>
>>>
>>>   It would be very helpful if you could run on 1 and 2 ranks with
>>> -log_view and send all the output.
>>>
>>>
>>>
>>> On Sep 8, 2023, at 4:52 PM, Chris Hewson <chris at resfrac.com> wrote:
>>>
>>> Hi There,
>>>
>>> I am trying to solve a linear problem and am having an issue when I use
>>> more MPI processes with the KSPsolve slowing down considerably the more
>>> processes I add.
>>>
>>> The matrix itself is 620100 X 620100 with ~5 million non-zero entries, I
>>> am using petsc version 3.19.5 and have tried with a couple different
>>> versions of mpich getting the same behavior (v4.1.2 w/ device ch4:ofi and
>>> v3.3.2 w/ ch3:sock).
>>>
>>> In testing, I've noticed the following trend for speed for the KSPSolve
>>> function call:
>>> 1 core: 4042 ms
>>> 2 core: 7085 ms
>>> 4 core: 26573 ms
>>> 8 core: 65745 ms
>>> 16 core: 149283 ms
>>>
>>> This was all done on a single node machine w/ 16 non-hyperthreaded
>>> cores. We solve quite a few different matrices with PETSc using MPI and
>>> haven't noticed an impact like this on performance before.
>>>
>>> I am very confused by this and am a little stumped at the moment as to
>>> why this was happening. I've been using the KSPBCGS solver to solve the
>>> problem. I have tried with multiple different solvers and pre-conditioners
>>> (we usually don't use a pre-conditioner for this part of our code).
>>>
>>> It did seem that using the piped BCGS solver did help improve the
>>> parallel speed slightly (maybe 15%), but it still doesn't come close to the
>>> single threaded speed.
>>>
>>> I'll attach a link to a folder that contains the specific A, x and b
>>> matrices for this problem, as well as a main.cpp file that I was using for
>>> testing.
>>>
>>>
>>> https://drive.google.com/drive/folders/1CEDinKxu8ZbKpLtwmqKqP1ZIDG7JvDI1?usp=sharing
>>>
>>> I was testing this in our main code base, but don't include that here,
>>> and observe very similar speed results to the ones above. We do use Metis
>>> to graph partition in our own code and checked the vector and matrix
>>> partitioning and that all made sense. I could be doing the partitioning
>>> incorrectly in the example (not 100% sure how it works with the viewer/load
>>> functions).
>>>
>>> Any insight or thoughts on this would be greatly appreciated.
>>>
>>> Thanks,
>>>
>>> *Chris Hewson*
>>> Senior Reservoir Simulation Engineer
>>> ResFrac
>>> +1.587.575.9792
>>>
>>>
>>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20230908/efbc4672/attachment-0001.html>


More information about the petsc-users mailing list