[petsc-dev] bad cpu/MPI performance problem

Sun Jan 8 13:44:24 CST 2023

On Sun, Jan 8, 2023 at 9:28 AM Barry Smith <bsmith at petsc.dev> wrote:

>
>   Mark,
>
>   Looks like the error checking in PetscCommDuplicate() is doing its job.
> It is reporting an attempt to use an PETSc object constructer on a subset
> of ranks of an MPI_Comm (which is, of course, fundamentally impossible in
> the PETSc/MPI model)
>
> Note that nroots can be negative on a particular rank but
> DMPlexLabelComplete_Internal() is collective on sf based on the comment in
> the code below
>
>
> struct _p_PetscSF {
> ....
>   PetscInt     nroots;  /* Number of root vertices on current process
> (candidates for incoming edges) */
>
> But the next routine calls a collective only when nroots >= 0
>
> static PetscErrorCode DMPlexLabelComplete_Internal(DM dm, DMLabel label,
> PetscBool completeCells){
> ...
>   PetscCall(PetscSFGetGraph(sfPoint, &nroots, NULL, NULL, NULL));
>   if (nroots >= 0) {
>     DMLabel         lblRoots, lblLeaves;
>     IS              valueIS, pointIS;
>     const PetscInt *values;
>     PetscInt        numValues, v;
>
>     /* Pull point contributions from remote leaves into local roots */
>     PetscCall(DMLabelGather(label, sfPoint, &lblLeaves));
>
>
> The code is four years old? How come this problem of calling the
> constructure on a subset of ranks hasn't come up since day 1?
>

The contract here is that it should be impossible to have nroots < 0
(meaning the SF is not setup) on a subset of processes. Do we know that
this is happening?

  Thanks,

    Matt

> On Jan 8, 2023, at 12:21 PM, Mark Adams <mfadams at lbl.gov> wrote:
>
> I am running on Crusher, CPU only, 64 cores per node with Plex/PetscFE.
> In going up to 64 nodes, something really catastrophic is happening.
> I understand I am not using the machine the way it was intended, but I
> just want to see if there are any options that I could try for a quick
> fix/help.
>
> In a debug build I get a stack trace on many but not all of the 4K
> processes.
> Alas, I am not sure why this job was terminated but every process that I
> checked, that had an "ERROR", had this stack:
>
> 11:57 main *+=
> crusher:/gpfs/alpine/csc314/scratch/adams/mg-m3dc1/src/data$ grep ERROR
> slurm-245063.out |g 3160
> [3160]PETSC ERROR:
> ------------------------------------------------------------------------
> [3160]PETSC ERROR: Caught signal number 15 Terminate: Some process (or the
> batch system) has told this process to end
> [3160]PETSC ERROR: Try option -start_in_debugger or
> -on_error_attach_debugger
> [3160]PETSC ERROR: or see https://petsc.org/release/faq/#valgrind and
> https://petsc.org/release/faq/
> [3160]PETSC ERROR: ---------------------  Stack Frames
> ------------------------------------
> [3160]PETSC ERROR: The line numbers in the error traceback are not always
> exact.
> [3160]PETSC ERROR: #1 MPI function
> [3160]PETSC ERROR: #2 PetscCommDuplicate() at
> /gpfs/alpine/csc314/scratch/adams/petsc/src/sys/objects/tagm.c:248
> [3160]PETSC ERROR: #3 PetscHeaderCreate_Private() at
> /gpfs/alpine/csc314/scratch/adams/petsc/src/sys/objects/inherit.c:56
> [3160]PETSC ERROR: #4 PetscSFCreate() at
> /gpfs/alpine/csc314/scratch/adams/petsc/src/vec/is/sf/interface/sf.c:65
> [3160]PETSC ERROR: #5 DMLabelGather() at
> /gpfs/alpine/csc314/scratch/adams/petsc/src/dm/label/dmlabel.c:1932
> [3160]PETSC ERROR: #6 DMPlexLabelComplete_Internal() at
> /gpfs/alpine/csc314/scratch/adams/petsc/src/dm/impls/plex/plexsubmesh.c:177
> [3160]PETSC ERROR: #7 DMPlexLabelComplete() at
> /gpfs/alpine/csc314/scratch/adams/petsc/src/dm/impls/plex/plexsubmesh.c:227
> [3160]PETSC ERROR: #8 DMCompleteBCLabels_Internal() at
> /gpfs/alpine/csc314/scratch/adams/petsc/src/dm/interface/dm.c:5301
> [3160]PETSC ERROR: #9 DMCopyDS() at
> /gpfs/alpine/csc314/scratch/adams/petsc/src/dm/interface/dm.c:6117
> [3160]PETSC ERROR: #10 DMCopyDisc() at
> /gpfs/alpine/csc314/scratch/adams/petsc/src/dm/interface/dm.c:6143
> [3160]PETSC ERROR: #11 SetupDiscretization() at
> /gpfs/alpine/csc314/scratch/adams/mg-m3dc1/src/mhd_2field.c:755
>
> Maybe the MPI is just getting overwhelmed*.*
>
> And I was able to get one run to to work (one TS with beuler), and the
> solver performance was horrendous and I see this (attached):
>
> Time (sec):           1.601e+02     1.001   1.600e+02
> VecMDot           111712 1.0 5.1684e+01 1.4 2.32e+07 12.8 0.0e+00 0.0e+00
> 1.1e+05 30  4  0  0 23  30  4  0  0 23   499
> VecNorm           163478 1.0 6.6660e+01 1.2 1.51e+07 21.5 0.0e+00 0.0e+00
> 1.6e+05 39  2  0  0 34  39  2  0  0 34   139
> VecNormalize      154599 1.0 6.3942e+01 1.2 2.19e+07 23.3 0.0e+00 0.0e+00
> 1.5e+05 38  2  0  0 32  38  2  0  0 32   189
> etc,
> KSPSolve               3 1.0 1.1553e+02 1.0 1.34e+09 47.1 2.8e+09 6.0e+01
> 2.8e+05 72 95 45 72 58  72 95 45 72 58  4772
>
> Any ideas would be welcome,
> Thanks,
> Mark
> <cushersolve.txt>
>
>
>

-- 
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which their
experiments lead.
-- Norbert Wiener

https://www.cse.buffalo.edu/~knepley/ <http://www.cse.buffalo.edu/~knepley/>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20230108/cac5038c/attachment-0001.html>