<div dir="ltr">Thanks for the help.<div><br></div><div>Yah, running the example again using core binding, I get similar results to Junchao as well:</div><div>1 core: 2808 ms<br></div><div>2 core: 1398 ms<br>4 core: 989 ms<br>8 core: 1083 ms<br></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Fri, Sep 8, 2023 at 4:39 PM Junchao Zhang <<a href="mailto:junchao.zhang@gmail.com">junchao.zhang@gmail.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div>Hi, Chris,</div><div> this is my result of running your test on a local machine (8 cores, 16 hardware threads). I use OpenMPI with command line like</div><div> mpirun --bind-to core -n <np> ./main<br></div><div>np=1 Solve duration: 2132<br></div><div>np=2 Solve duration: 1116<br></div><div>np=4 Solve duration: 990<br></div><div>np=8 Solve duration: 1257<br></div><div><br></div><div>Note when I used </div><div>$ mpirun -n 4 ./main<br>Solve duration: 22693<br></div><div><br></div><div>It suggests binding is important. </div><br clear="all"><div><div dir="ltr" class="gmail_signature"><div dir="ltr">--Junchao Zhang</div></div></div><br></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Fri, Sep 8, 2023 at 4:53 PM Barry Smith <<a href="mailto:bsmith@petsc.dev" target="_blank">bsmith@petsc.dev</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div><div><br></div><div>For the size problem you are running this is very unexpected. The reductions should only start to dominate for thousands of MPI ranks, not 2.</div><div><br></div><div>First thing I recommend is to run the streams benchmark. Then check the binding that MPI is doing for the two processes. You want to bind to cores in different NUMA regions. It could be that it is binding both processes to cores that share the same cache. The MatSolve should be embarressingly parallel but you are getting almost no speed up in it, so something very "wrong" is happening.</div><div><br></div><div><br></div><div><br></div><div><font face="Menlo">Event Count Time (sec) Flop --- Global --- --- Stage ---- Total</font></div><div><font face="Menlo"> Max Ratio Max Ratio Max Ratio Mess AvgLen Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s</font></div><div><font face="Andale Mono"><br></font></div><div><font face="Menlo">VecDot 182 1.0 1.9998e-01 1.0 2.26e+08 1.0 0.0e+00 0.0e+00 0.0e+00 7 5 0 0 0 7 5 0 0 0 1129</font></div><div><font face="Menlo">VecDotNorm2 91 1.0 6.6214e-02 1.0 2.26e+08 1.0 0.0e+00 0.0e+00 0.0e+00 2 5 0 0 0 2 5 0 0 0 3409</font></div><div><font face="Menlo">VecNorm 92 1.0 1.4790e-01 1.0 1.14e+08 1.0 0.0e+00 0.0e+00 0.0e+00 5 3 0 0 0 5 3 0 0 0 771</font></div><div><font face="Menlo"><br></font></div><div><div><font face="Menlo">VecDot 198 1.0 2.1037e+00 1.1 1.23e+08 1.0 0.0e+00 0.0e+00 2.0e+02 33 5 0 0 43 33 5 0 0 45 117</font></div><div><font face="Menlo">VecDotNorm2 99 1.0 5.0169e-01 1.2 1.23e+08 1.0 0.0e+00 0.0e+00 9.9e+01 7 5 0 0 22 7 5 0 0 22 489</font></div><div><font face="Menlo">VecNorm 100 1.0 1.3131e+00 1.0 6.20e+07 1.0 0.0e+00 0.0e+00 1.0e+02 21 3 0 0 22 21 3 0 0 23 94</font></div></div><div><font face="Menlo">VecScatterEnd 198 1.0 7.6160e-01 1.7 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 10 0 0 0 0 10 0 0 0 0 0</font></div><div><br></div><div>MatSolve 183 1.0 1.0882e+00 1.0 1.43e+09 1.0 0.0e+00 0.0e+00 0.0e+00 40 35 0 0 0 40 35 0 0 0 1318</div><div>MatSolve 199 1.0 8.9666e-01 1.2 7.75e+08 1.0 0.0e+00 0.0e+00 0.0e+00 13 35 0 0 0 13 35 0 0 0 1729</div><div><br></div><div><br><blockquote type="cite"><div>On Sep 8, 2023, at 5:22 PM, Chris Hewson <<a href="mailto:chris@resfrac.com" target="_blank">chris@resfrac.com</a>> wrote:</div><br><div><div dir="ltr">Thanks for the quick response.<div><br></div><div>The links to the log view files are below:</div><div>2 ranks:<br><a href="https://drive.google.com/file/d/1PGRsiHypWtN5h3uxdJBKy9WzEkE0mUgO/view?usp=drive_link" target="_blank">https://drive.google.com/file/d/1PGRsiHypWtN5h3uxdJBKy9WzEkE0mUgO/view?usp=drive_link</a><br><br>1 rank:<br><a href="https://drive.google.com/file/d/1hB2XyoNtLMHseZUT7jCuiixTQeBi_tjJ/view?usp=drive_link" target="_blank">https://drive.google.com/file/d/1hB2XyoNtLMHseZUT7jCuiixTQeBi_tjJ/view?usp=drive_link</a><br></div><div><br></div><div>I'll also attach them to this email:</div><div>**************************** 1 RANK ******************************</div><div>------------------------------------------------------------------ PETSc Performance Summary: ------------------------------------------------------------------<br><br>./petsc-testing on a named ubuntu-office with 1 processor, by chewson Fri Sep 8 15:16:51 2023<br>Using Petsc Release Version 3.19.5, unknown <br><br> Max Max/Min Avg Total<br>Time (sec): 2.746e+00 1.000 2.746e+00<br>Objects: 2.100e+01 1.000 2.100e+01<br>Flops: 4.117e+09 1.000 4.117e+09 4.117e+09<br>Flops/sec: 1.499e+09 1.000 1.499e+09 1.499e+09<br>MPI Msg Count: 0.000e+00 0.000 0.000e+00 0.000e+00<br>MPI Msg Len (bytes): 0.000e+00 0.000 0.000e+00 0.000e+00<br>MPI Reductions: 0.000e+00 0.000<br><br>Flop counting convention: 1 flop = 1 real number operation of type (multiply/divide/add/subtract)<br> e.g., VecAXPY() for real vectors of length N --> 2N flops<br> and VecAXPY() for complex vectors of length N --> 8N flops<br><br>Summary of Stages: ----- Time ------ ----- Flop ------ --- Messages --- -- Message Lengths -- -- Reductions --<br> Avg %Total Avg %Total Count %Total Avg %Total Count %Total<br> 0: Main Stage: 2.7458e+00 100.0% 4.1167e+09 100.0% 0.000e+00 0.0% 0.000e+00 0.0% 0.000e+00 0.0%<br><br>------------------------------------------------------------------------------------------------------------------------<br>See the 'Profiling' chapter of the users' manual for details on interpreting output.<br>Phase summary info:<br> Count: number of times phase was executed<br> Time and Flop: Max - maximum over all processors<br> Ratio - ratio of maximum to minimum over all processors<br> Mess: number of messages sent<br> AvgLen: average message length (bytes)<br> Reduct: number of global reductions<br> Global: entire computation<br> Stage: stages of a computation. Set stages with PetscLogStagePush() and PetscLogStagePop().<br> %T - percent time in this phase %F - percent flop in this phase<br> %M - percent messages in this phase %L - percent message lengths in this phase<br> %R - percent reductions in this phase<br> Total Mflop/s: 10e-6 * (sum of flop over all processors)/(max time over all processors)<br>------------------------------------------------------------------------------------------------------------------------<br>Event Count Time (sec) Flop --- Global --- --- Stage ---- Total<br> Max Ratio Max Ratio Max Ratio Mess AvgLen Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s<br>------------------------------------------------------------------------------------------------------------------------<br><br>--- Event Stage 0: Main Stage<br><br>MatMult 182 1.0 8.0351e-01 1.0 1.43e+09 1.0 0.0e+00 0.0e+00 0.0e+00 29 35 0 0 0 29 35 0 0 0 1775<br>MatSolve 183 1.0 1.0882e+00 1.0 1.43e+09 1.0 0.0e+00 0.0e+00 0.0e+00 40 35 0 0 0 40 35 0 0 0 1318<br>MatLUFactorNum 1 1.0 1.3892e-02 1.0 1.30e+07 1.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 934<br>MatILUFactorSym 1 1.0 2.1567e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 0<br>MatAssemblyBegin 1 1.0 1.0420e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0<br>MatAssemblyEnd 1 1.0 6.9049e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0<br>MatGetRowIJ 1 1.0 3.8500e-07 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0<br>MatGetOrdering 1 1.0 1.7026e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0<br>MatLoad 1 1.0 6.6749e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 2 0 0 0 0 2 0 0 0 0 0<br>VecDot 182 1.0 1.9998e-01 1.0 2.26e+08 1.0 0.0e+00 0.0e+00 0.0e+00 7 5 0 0 0 7 5 0 0 0 1129<br>VecDotNorm2 91 1.0 6.6214e-02 1.0 2.26e+08 1.0 0.0e+00 0.0e+00 0.0e+00 2 5 0 0 0 2 5 0 0 0 3409<br>VecNorm 92 1.0 1.4790e-01 1.0 1.14e+08 1.0 0.0e+00 0.0e+00 0.0e+00 5 3 0 0 0 5 3 0 0 0 771<br>VecCopy 2 1.0 6.8473e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0<br>VecSet 3 1.0 1.3256e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0<br>VecAXPBYCZ 182 1.0 1.6542e-01 1.0 4.51e+08 1.0 0.0e+00 0.0e+00 0.0e+00 6 11 0 0 0 6 11 0 0 0 2729<br>VecWAXPY 182 1.0 1.4476e-01 1.0 2.26e+08 1.0 0.0e+00 0.0e+00 0.0e+00 5 5 0 0 0 5 5 0 0 0 1559<br>VecLoad 2 1.0 1.0104e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0<br>KSPSetUp 1 1.0 9.9204e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0<br>KSPSolve 1 1.0 2.6210e+00 1.0 4.10e+09 1.0 0.0e+00 0.0e+00 0.0e+00 95 100 0 0 0 95 100 0 0 0 1566<br>PCSetUp 1 1.0 3.7232e-02 1.0 1.30e+07 1.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 349<br>PCApply 183 1.0 1.0885e+00 1.0 1.43e+09 1.0 0.0e+00 0.0e+00 0.0e+00 40 35 0 0 0 40 35 0 0 0 1318<br><br>--- Event Stage 1: Unknown<br><br>------------------------------------------------------------------------------------------------------------------------<br><br>Object Type Creations Destructions. Reports information only for process 0.<br><br>--- Event Stage 0: Main Stage<br><br> Viewer 4 1<br> Matrix 3 1<br> Vector 9 1<br> Krylov Solver 1 0<br> Preconditioner 1 0<br> Index Set 3 0<br><br>--- Event Stage 1: Unknown<br><br>========================================================================================================================<br>Average time to get PetscTime(): 1.51e-08<br>#PETSc Option Table entries:<br>-log_view # (source: command line)<br>#End of PETSc Option Table entries<br>Compiled without FORTRAN kernels<br>Compiled with full precision matrices (default)<br>sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8 sizeof(PetscScalar) 8 sizeof(PetscInt) 4<br>Configure options: --with-debugging=0 --prefix=/opt/anl/petsc-3.19.5 --download-mumps --download-scalapack --with-mpi=1 --with-mpi-dir=/opt/anl/mpich COPTFLAGS=-O2 CXXOPTFLAGS=-O2 FOPTFLAGS=-O2<br>-----------------------------------------<br>Libraries compiled on 2023-09-08 16:27:49 on ubuntu-office <br>Machine characteristics: Linux-6.2.0-26-generic-x86_64-with-glibc2.35<br>Using PETSc directory: /opt/anl/petsc-3.19.5<br>Using PETSc arch: <br>-----------------------------------------<br><br>Using C compiler: /opt/anl/mpich/bin/mpicc -fPIC -Wall -Wwrite-strings -Wno-unknown-pragmas -Wno-lto-type-mismatch -Wno-stringop-overflow -fstack-protector -fvisibility=hidden -O2 <br>Using Fortran compiler: /opt/anl/mpich/bin/mpif90 -fPIC -Wall -ffree-line-length-none -ffree-line-length-0 -Wno-lto-type-mismatch -Wno-unused-dummy-argument -O2 <br>-----------------------------------------<br><br>Using include paths: -I/opt/anl/petsc-3.19.5/include -I/opt/anl/mpich/include<br>-----------------------------------------<br><br>Using C linker: /opt/anl/mpich/bin/mpicc<br>Using Fortran linker: /opt/anl/mpich/bin/mpif90<br>Using libraries: -Wl,-rpath,/opt/anl/petsc-3.19.5/lib -L/opt/anl/petsc-3.19.5/lib -lpetsc -Wl,-rpath,/opt/anl/petsc-3.19.5/lib -L/opt/anl/petsc-3.19.5/lib -Wl,-rpath,/opt/anl/mpich/lib -L/opt/anl/mpich/lib -Wl,-rpath,/usr/lib/gcc/x86_64-linux-gnu/11 -L/usr/lib/gcc/x86_64-linux-gnu/11 -ldmumps -lmumps_common -lpord -lpthread -lscalapack -llapack -lblas -lm -lX11 -lmpifort -lmpi -lgfortran -lm -lgfortran -lm -lgcc_s -lquadmath -lstdc++ -lquadmath<br>-----------------------------------------<br></div><div><br></div><div>************************* 2 RANKS ************************************</div><div>------------------------------------------------------------------ PETSc Performance Summary: ------------------------------------------------------------------<br><br>./petsc-testing on a named ubuntu-office with 2 processors, by chewson Fri Sep 8 15:16:43 2023<br>Using Petsc Release Version 3.19.5, unknown <br><br> Max Max/Min Avg Total<br>Time (sec): 6.167e+00 1.001 6.164e+00<br>Objects: 3.200e+01 1.000 3.200e+01<br>Flops: 2.233e+09 1.000 2.233e+09 4.467e+09<br>Flops/sec: 3.625e+08 1.001 3.623e+08 7.247e+08<br>MPI Msg Count: 2.050e+02 1.000 2.050e+02 4.100e+02<br>MPI Msg Len (bytes): 3.437e+07 1.000 1.676e+05 6.874e+07<br>MPI Reductions: 4.580e+02 1.000<br><br>Flop counting convention: 1 flop = 1 real number operation of type (multiply/divide/add/subtract)<br> e.g., VecAXPY() for real vectors of length N --> 2N flops<br> and VecAXPY() for complex vectors of length N --> 8N flops<br><br>Summary of Stages: ----- Time ------ ----- Flop ------ --- Messages --- -- Message Lengths -- -- Reductions --<br> Avg %Total Avg %Total Count %Total Avg %Total Count %Total<br> 0: Main Stage: 6.1642e+00 100.0% 4.4670e+09 100.0% 4.100e+02 100.0% 1.676e+05 100.0% 4.400e+02 96.1%<br><br>------------------------------------------------------------------------------------------------------------------------<br>See the 'Profiling' chapter of the users' manual for details on interpreting output.<br>Phase summary info:<br> Count: number of times phase was executed<br> Time and Flop: Max - maximum over all processors<br> Ratio - ratio of maximum to minimum over all processors<br> Mess: number of messages sent<br> AvgLen: average message length (bytes)<br> Reduct: number of global reductions<br> Global: entire computation<br> Stage: stages of a computation. Set stages with PetscLogStagePush() and PetscLogStagePop().<br> %T - percent time in this phase %F - percent flop in this phase<br> %M - percent messages in this phase %L - percent message lengths in this phase<br> %R - percent reductions in this phase<br> Total Mflop/s: 10e-6 * (sum of flop over all processors)/(max time over all processors)<br>------------------------------------------------------------------------------------------------------------------------<br>Event Count Time (sec) Flop --- Global --- --- Stage ---- Total<br> Max Ratio Max Ratio Max Ratio Mess AvgLen Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s<br>------------------------------------------------------------------------------------------------------------------------<br><br>--- Event Stage 0: Main Stage<br><br>BuildTwoSided 1 1.0 3.1824e-05 1.0 0.00e+00 0.0 2.0e+00 4.0e+00 1.0e+00 0 0 0 0 0 0 0 0 0 0 0<br>MatMult 198 1.0 1.3480e+00 1.4 7.76e+08 1.0 4.0e+02 9.4e+04 0.0e+00 19 35 97 54 0 19 35 97 54 0 1151<br>MatSolve 199 1.0 8.9666e-01 1.2 7.75e+08 1.0 0.0e+00 0.0e+00 0.0e+00 13 35 0 0 0 13 35 0 0 0 1729<br>MatLUFactorNum 1 1.0 7.1852e-03 1.0 6.43e+06 1.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 1789<br>MatILUFactorSym 1 1.0 1.0472e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0<br>MatAssemblyBegin 1 1.0 9.8700e-07 1.4 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0<br>MatAssemblyEnd 1 1.0 6.8341e-03 1.1 0.00e+00 0.0 4.0e+00 2.3e+04 5.0e+00 0 0 1 0 1 0 0 1 0 1 0<br>MatGetRowIJ 1 1.0 1.9930e-06 6.3 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0<br>MatGetOrdering 1 1.0 7.4472e-04 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0<br>MatLoad 1 1.0 9.8562e-02 1.0 0.00e+00 0.0 1.0e+01 2.7e+06 1.7e+01 2 0 2 39 4 2 0 2 39 4 0<br>VecDot 198 1.0 2.1037e+00 1.1 1.23e+08 1.0 0.0e+00 0.0e+00 2.0e+02 33 5 0 0 43 33 5 0 0 45 117<br>VecDotNorm2 99 1.0 5.0169e-01 1.2 1.23e+08 1.0 0.0e+00 0.0e+00 9.9e+01 7 5 0 0 22 7 5 0 0 22 489<br>VecNorm 100 1.0 1.3131e+00 1.0 6.20e+07 1.0 0.0e+00 0.0e+00 1.0e+02 21 3 0 0 22 21 3 0 0 23 94<br>VecCopy 2 1.0 7.4971e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0<br>VecSet 202 1.0 8.0035e-02 1.5 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 0<br>VecAXPBYCZ 198 1.0 1.2889e-01 1.5 2.46e+08 1.0 0.0e+00 0.0e+00 0.0e+00 2 11 0 0 0 2 11 0 0 0 3811<br>VecWAXPY 198 1.0 9.1526e-02 1.0 1.23e+08 1.0 0.0e+00 0.0e+00 0.0e+00 1 5 0 0 0 1 5 0 0 0 2683<br>VecLoad 2 1.0 9.8983e-03 1.0 0.00e+00 0.0 4.0e+00 1.2e+06 1.6e+01 0 0 1 7 3 0 0 1 7 4 0<br>VecScatterBegin 198 1.0 1.2941e-03 1.0 0.00e+00 0.0 4.0e+02 9.4e+04 0.0e+00 0 0 97 54 0 0 0 97 54 0 0<br>VecScatterEnd 198 1.0 7.6160e-01 1.7 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 10 0 0 0 0 10 0 0 0 0 0<br>SFSetGraph 1 1.0 7.6630e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0<br>SFSetUp 1 1.0 1.2410e-04 1.0 0.00e+00 0.0 4.0e+00 2.3e+04 1.0e+00 0 0 1 0 0 0 0 1 0 0 0<br>SFPack 198 1.0 5.1814e-05 1.3 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0<br>SFUnpack 198 1.0 3.8273e-05 1.3 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0<br>KSPSetUp 2 1.0 4.7077e-03 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 2.0e+00 0 0 0 0 0 0 0 0 0 0 0<br>KSPSolve 1 1.0 6.0344e+00 1.0 2.23e+09 1.0 4.0e+02 9.4e+04 4.0e+02 98 100 97 54 87 98 100 97 54 90 738<br>PCSetUp 2 1.0 1.8496e-02 1.0 6.43e+06 1.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 695<br>PCSetUpOnBlocks 1 1.0 1.8435e-02 1.0 6.43e+06 1.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 697<br>PCApply 199 1.0 9.5103e-01 1.1 7.75e+08 1.0 0.0e+00 0.0e+00 0.0e+00 15 35 0 0 0 15 35 0 0 0 1630<br><br>--- Event Stage 1: Unknown<br><br>------------------------------------------------------------------------------------------------------------------------<br><br>Object Type Creations Destructions. Reports information only for process 0.<br><br>--- Event Stage 0: Main Stage<br><br> Viewer 4 1<br> Matrix 5 1<br> Vector 13 2<br> Index Set 5 2<br> Star Forest Graph 1 0<br> Krylov Solver 2 0<br> Preconditioner 2 0<br><br>--- Event Stage 1: Unknown<br><br>========================================================================================================================<br>Average time to get PetscTime(): 2.47e-08<br>Average time for MPI_Barrier(): 4.406e-07<br>Average time for zero size MPI_Send(): 4.769e-06<br>#PETSc Option Table entries:<br>-log_view # (source: command line)<br>#End of PETSc Option Table entries<br>Compiled without FORTRAN kernels<br>Compiled with full precision matrices (default)<br>sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8 sizeof(PetscScalar) 8 sizeof(PetscInt) 4<br>Configure options: --with-debugging=0 --prefix=/opt/anl/petsc-3.19.5 --download-mumps --download-scalapack --with-mpi=1 --with-mpi-dir=/opt/anl/mpich COPTFLAGS=-O2 CXXOPTFLAGS=-O2 FOPTFLAGS=-O2<br>-----------------------------------------<br>Libraries compiled on 2023-09-08 16:27:49 on ubuntu-office <br>Machine characteristics: Linux-6.2.0-26-generic-x86_64-with-glibc2.35<br>Using PETSc directory: /opt/anl/petsc-3.19.5<br>Using PETSc arch: <br>-----------------------------------------<br><br>Using C compiler: /opt/anl/mpich/bin/mpicc -fPIC -Wall -Wwrite-strings -Wno-unknown-pragmas -Wno-lto-type-mismatch -Wno-stringop-overflow -fstack-protector -fvisibility=hidden -O2 <br>Using Fortran compiler: /opt/anl/mpich/bin/mpif90 -fPIC -Wall -ffree-line-length-none -ffree-line-length-0 -Wno-lto-type-mismatch -Wno-unused-dummy-argument -O2 <br>-----------------------------------------<br><br>Using include paths: -I/opt/anl/petsc-3.19.5/include -I/opt/anl/mpich/include<br>-----------------------------------------<br><br>Using C linker: /opt/anl/mpich/bin/mpicc<br>Using Fortran linker: /opt/anl/mpich/bin/mpif90<br>Using libraries: -Wl,-rpath,/opt/anl/petsc-3.19.5/lib -L/opt/anl/petsc-3.19.5/lib -lpetsc -Wl,-rpath,/opt/anl/petsc-3.19.5/lib -L/opt/anl/petsc-3.19.5/lib -Wl,-rpath,/opt/anl/mpich/lib -L/opt/anl/mpich/lib -Wl,-rpath,/usr/lib/gcc/x86_64-linux-gnu/11 -L/usr/lib/gcc/x86_64-linux-gnu/11 -ldmumps -lmumps_common -lpord -lpthread -lscalapack -llapack -lblas -lm -lX11 -lmpifort -lmpi -lgfortran -lm -lgfortran -lm -lgcc_s -lquadmath -lstdc++ -lquadmath<br>-----------------------------------------<br><br></div><div>Chris</div><div><br></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Fri, Sep 8, 2023 at 3:00 PM Barry Smith <<a href="mailto:bsmith@petsc.dev" target="_blank">bsmith@petsc.dev</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div><div><br></div> It would be very helpful if you could run on 1 and 2 ranks with -log_view and send all the output.<div><br></div><div> <br><div><br><blockquote type="cite"><div>On Sep 8, 2023, at 4:52 PM, Chris Hewson <<a href="mailto:chris@resfrac.com" target="_blank">chris@resfrac.com</a>> wrote:</div><br><div><div dir="ltr">Hi There,<div><br></div><div>I am trying to solve a linear problem and am having an issue when I use more MPI processes with the KSPsolve slowing down considerably the more processes I add.</div><div><br></div><div>The matrix itself is 620100 X 620100 with ~5 million non-zero entries, I am using petsc version 3.19.5 and have tried with a couple different versions of mpich getting the same behavior (v4.1.2 w/ device ch4:ofi and v3.3.2 w/ ch3:sock).</div><div><br></div><div>In testing, I've noticed the following trend for speed for the KSPSolve function call:</div><div>1 core: 4042 ms<br>2 core: 7085 ms<br>4 core: 26573 ms<br>8 core: 65745 ms<br>16 core: 149283 ms<br></div><div><br></div><div>This was all done on a single node machine w/ 16 non-hyperthreaded cores. We solve quite a few different matrices with PETSc using MPI and haven't noticed an impact like this on performance before.</div><div><br></div><div>I am very confused by this and am a little stumped at the moment as to why this was happening. I've been using the KSPBCGS solver to solve the problem. I have tried with multiple different solvers and pre-conditioners (we usually don't use a pre-conditioner for this part of our code). </div><div><br></div><div>It did seem that using the piped BCGS solver did help improve the parallel speed slightly (maybe 15%), but it still doesn't come close to the single threaded speed. </div><div><br></div><div>I'll attach a link to a folder that contains the specific A, x and b matrices for this problem, as well as a main.cpp file that I was using for testing. </div><div><br></div><div><a href="https://drive.google.com/drive/folders/1CEDinKxu8ZbKpLtwmqKqP1ZIDG7JvDI1?usp=sharing" target="_blank">https://drive.google.com/drive/folders/1CEDinKxu8ZbKpLtwmqKqP1ZIDG7JvDI1?usp=sharing</a><br></div><div><br></div><div>I was testing this in our main code base, but don't include that here, and observe very similar speed results to the ones above. We do use Metis to graph partition in our own code and checked the vector and matrix partitioning and that all made sense. I could be doing the partitioning incorrectly in the example (not 100% sure how it works with the viewer/load functions).</div><div><br></div><div>Any insight or thoughts on this would be greatly appreciated.</div><div><br></div><div>Thanks,</div><div><div><div dir="ltr" class="gmail_signature"><div dir="ltr"><div><div dir="ltr"><div><div dir="ltr"><b><br></b></div><div dir="ltr"><b>Chris Hewson</b><div>Senior Reservoir Simulation Engineer</div><div>ResFrac</div><div>+1.587.575.9792</div></div></div></div></div></div></div></div></div></div>
</div></blockquote></div><br></div></div></blockquote></div>
</div></blockquote></div><br></div></blockquote></div>
</blockquote></div>