[petsc-dev] bad cpu/MPI performance problem

Mark Adams mfadams at lbl.gov
Sun Jan 8 14:56:03 CST 2023


>
>
> The code is four years old? How come this problem of calling the
> constructure on a subset of ranks hasn't come up since day 1?
>

No, this is a new code that runs fine.
It is just calling PETSc's Plex/GMG solvers with GAMG as the coarse grid
solver.
High level code.

This looks like an MPI bug ... I have moved to using 32 cores per node to
see if that fixes the issue.

TL:DR
Crusher recently changed the maximum number of cores that can be used, by
default, on Crusher to 56 (7/8 per GCD).
You can bypass that with "-S 0", which I did here.
I can actually do my study with 2x the nodes and 1/2 the cores per node, so
I am doing that.
That seems to be running well up to 32 nodes but my one 128 node job that
ran today timed out and it looks like it is fouling internally somehow.

Thanks,


>
> On Jan 8, 2023, at 12:21 PM, Mark Adams <mfadams at lbl.gov> wrote:
>
> I am running on Crusher, CPU only, 64 cores per node with Plex/PetscFE.
> In going up to 64 nodes, something really catastrophic is happening.
> I understand I am not using the machine the way it was intended, but I
> just want to see if there are any options that I could try for a quick
> fix/help.
>
> In a debug build I get a stack trace on many but not all of the 4K
> processes.
> Alas, I am not sure why this job was terminated but every process that I
> checked, that had an "ERROR", had this stack:
>
> 11:57 main *+=
> crusher:/gpfs/alpine/csc314/scratch/adams/mg-m3dc1/src/data$ grep ERROR
> slurm-245063.out |g 3160
> [3160]PETSC ERROR:
> ------------------------------------------------------------------------
> [3160]PETSC ERROR: Caught signal number 15 Terminate: Some process (or the
> batch system) has told this process to end
> [3160]PETSC ERROR: Try option -start_in_debugger or
> -on_error_attach_debugger
> [3160]PETSC ERROR: or see https://petsc.org/release/faq/#valgrind and
> https://petsc.org/release/faq/
> [3160]PETSC ERROR: ---------------------  Stack Frames
> ------------------------------------
> [3160]PETSC ERROR: The line numbers in the error traceback are not always
> exact.
> [3160]PETSC ERROR: #1 MPI function
> [3160]PETSC ERROR: #2 PetscCommDuplicate() at
> /gpfs/alpine/csc314/scratch/adams/petsc/src/sys/objects/tagm.c:248
> [3160]PETSC ERROR: #3 PetscHeaderCreate_Private() at
> /gpfs/alpine/csc314/scratch/adams/petsc/src/sys/objects/inherit.c:56
> [3160]PETSC ERROR: #4 PetscSFCreate() at
> /gpfs/alpine/csc314/scratch/adams/petsc/src/vec/is/sf/interface/sf.c:65
> [3160]PETSC ERROR: #5 DMLabelGather() at
> /gpfs/alpine/csc314/scratch/adams/petsc/src/dm/label/dmlabel.c:1932
> [3160]PETSC ERROR: #6 DMPlexLabelComplete_Internal() at
> /gpfs/alpine/csc314/scratch/adams/petsc/src/dm/impls/plex/plexsubmesh.c:177
> [3160]PETSC ERROR: #7 DMPlexLabelComplete() at
> /gpfs/alpine/csc314/scratch/adams/petsc/src/dm/impls/plex/plexsubmesh.c:227
> [3160]PETSC ERROR: #8 DMCompleteBCLabels_Internal() at
> /gpfs/alpine/csc314/scratch/adams/petsc/src/dm/interface/dm.c:5301
> [3160]PETSC ERROR: #9 DMCopyDS() at
> /gpfs/alpine/csc314/scratch/adams/petsc/src/dm/interface/dm.c:6117
> [3160]PETSC ERROR: #10 DMCopyDisc() at
> /gpfs/alpine/csc314/scratch/adams/petsc/src/dm/interface/dm.c:6143
> [3160]PETSC ERROR: #11 SetupDiscretization() at
> /gpfs/alpine/csc314/scratch/adams/mg-m3dc1/src/mhd_2field.c:755
>
> Maybe the MPI is just getting overwhelmed*.*
>
> And I was able to get one run to to work (one TS with beuler), and the
> solver performance was horrendous and I see this (attached):
>
> Time (sec):           1.601e+02     1.001   1.600e+02
> VecMDot           111712 1.0 5.1684e+01 1.4 2.32e+07 12.8 0.0e+00 0.0e+00
> 1.1e+05 30  4  0  0 23  30  4  0  0 23   499
> VecNorm           163478 1.0 6.6660e+01 1.2 1.51e+07 21.5 0.0e+00 0.0e+00
> 1.6e+05 39  2  0  0 34  39  2  0  0 34   139
> VecNormalize      154599 1.0 6.3942e+01 1.2 2.19e+07 23.3 0.0e+00 0.0e+00
> 1.5e+05 38  2  0  0 32  38  2  0  0 32   189
> etc,
> KSPSolve               3 1.0 1.1553e+02 1.0 1.34e+09 47.1 2.8e+09 6.0e+01
> 2.8e+05 72 95 45 72 58  72 95 45 72 58  4772
>
> Any ideas would be welcome,
> Thanks,
> Mark
> <cushersolve.txt>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20230108/7cd30f5d/attachment.html>


More information about the petsc-dev mailing list