[petsc-users] Slower performance in multi-node system

Wed Feb 3 22:37:21 CST 2021

  https://www.mcs.anl.gov/petsc/documentation/faq.html#computers <https://www.mcs.anl.gov/petsc/documentation/faq.html#computers>

  In particular looking at the results of the parallel run I see 

Average time to get PetscTime(): 3.933e-07
Average time for MPI_Barrier(): 0.00498015
Average time for zero size MPI_Send(): 0.000194207

  So the times for communication are huge. 4.9 milliseconds for a synchronization of twenty processes. A millisecond is an eternity for parallel computing. It is not clear to me that this system is appropriate for tightly couple parallel simulations.

  Barry

> On Feb 3, 2021, at 2:40 PM, Luciano Siqueira <luciano.siqueira at usp.br> wrote:
> 
> Here are the (attached) output of -log_view for both cases. The beginning of the files has some info from the libmesh app.
> 
> Running in 1 node, 32 cores: 01_node_log_view.txt
> 
> Running in 20 nodes, 32 cores each (640 cores in total): 01_node_log_view.txt
> 
> Thanks!
> 
> Luciano.
> 
> Em 03/02/2021 16:43, Matthew Knepley escreveu:
>> On Wed, Feb 3, 2021 at 2:42 PM Luciano Siqueira <luciano.siqueira at usp.br <mailto:luciano.siqueira at usp.br>> wrote:
>> Hello,
>> 
>> I'm evaluating the performance of an application in a distributed 
>> environment and I notice that it's much slower when running in many 
>> nodes/cores when compared to a single node with a fewer cores.
>> 
>> When running the application in 20 nodes, the Main Stage time reported 
>> in PETSc's log is up to 10 times slower than it is when running the same 
>> application in only 1 node, even with fewer cores per node.
>> 
>> The application I'm running is an example code provided by libmesh:
>> 
>> http://libmesh.github.io/examples/introduction_ex4.html <http://libmesh.github.io/examples/introduction_ex4.html>
>> 
>> The application runs inside a Singularity container, with openmpi-4.0.3 
>> and PETSc 3.14.3. The distributed processes are managed by slurm 
>> 17.02.11 and each node is equipped with two Intel CPU Xeon E5-2695v2 Ivy 
>> Bridge (12c @2,4GHz) and 128Gb of RAM, all communications going through 
>> infiniband.
>> 
>> My questions are: Is the slowdown expected? Should the application be 
>> specially tailored to work well in distributed environments?
>> 
>> Also, where (maybe in PETSc documentation/source-code) can I find 
>> information on how PETSc handles MPI communications? Do the KSP solvers 
>> favor one-to-one process communication over broadcast messages or 
>> vice-versa? I suspect inter-process communication must be the cause of 
>> the poor performance when using many nodes, but not as much as I'm seeing.
>> 
>> Thank you in advance!
>> 
>> We can't say anything about the performance without some data. Please send us the output
>> of -log_view for both cases.
>> 
>>   Thanks,
>> 
>>      Matt
>>  
>> Luciano.
>> 
>> 
>> 
>> -- 
>> What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead.
>> -- Norbert Wiener
>> 
>> https://www.cse.buffalo.edu/~knepley/ <http://www.cse.buffalo.edu/~knepley/>
> <01_node_log_view.txt><20_node_log_view.txt>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20210203/fe69a8a4/attachment.html>