[petsc-users] Slower performance in multi-node system

Wed Feb 3 13:41:10 CST 2021

Hello,

I'm evaluating the performance of an application in a distributed 
environment and I notice that it's much slower when running in many 
nodes/cores when compared to a single node with a fewer cores.

When running the application in 20 nodes, the Main Stage time reported 
in PETSc's log is up to 10 times slower than it is when running the same 
application in only 1 node, even with fewer cores per node.

The application I'm running is an example code provided by libmesh:

http://libmesh.github.io/examples/introduction_ex4.html

The application runs inside a Singularity container, with openmpi-4.0.3 
and PETSc 3.14.3. The distributed processes are managed by slurm 
17.02.11 and each node is equipped with two Intel CPU Xeon E5-2695v2 Ivy 
Bridge (12c @2,4GHz) and 128Gb of RAM, all communications going through 
infiniband.

My questions are: Is the slowdown expected? Should the application be 
specially tailored to work well in distributed environments?

Also, where (maybe in PETSc documentation/source-code) can I find 
information on how PETSc handles MPI communications? Do the KSP solvers 
favor one-to-one process communication over broadcast messages or 
vice-versa? I suspect inter-process communication must be the cause of 
the poor performance when using many nodes, but not as much as I'm seeing.

Thank you in advance!

Luciano.