[petsc-dev] bad cpu/MPI performance problem

Mon Jan 9 18:56:45 CST 2023

OK, here is a timeout one with 4K processors.

On Mon, Jan 9, 2023 at 3:13 PM Barry Smith <bsmith at petsc.dev> wrote:

>
>
> On Jan 9, 2023, at 12:13 PM, Mark Adams <mfadams at lbl.gov> wrote:
>
> Sorry, I deleted it. I put a 128 node job with a debug version to try to
> reproduce it, but finished w/o this error.
> With the same bad performance as my original log.
> I think this message comes from a (time out) signal.
>
>
>   Sure but we would need to know if there are multiple different places in
> the code that the time-out is happening.
>
>
>
>
> What I am seeing though is just really bad performance on large jobs. It
> is sudden. Good scaling up to about 32 nodes and then 100x slow down in the
> solver, the vec norms, on 128 nodes (and 64).
>
> Thanks,
> Mark
>
>
> On Sun, Jan 8, 2023 at 8:07 PM Barry Smith <bsmith at petsc.dev> wrote:
>
>>
>>   Rather cryptic use of that integer variable for anyone who did not
>> write the code, but I guess one was short on bytes when it was written :-).
>>
>>   Ok, based on what Jed said, if we are certain that either all or none
>> of the nroots are -1, since you are getting hangs in
>> the PetscCommDuplicate() this might indicate some ranks have called a
>> creation on some OTHER object, that not all ranks are calling. Of course,
>> could be an MPI bug.
>>
>>   Mark, can you send the file containing all the output from the Signal
>> Terminate run, maybe there will be a hint in there of a different
>> constructor being called on some ranks.
>>
>>   Barry
>>
>>
>>
>> DMLabelGather
>>>>
>>>>
>> On Jan 8, 2023, at 4:32 PM, Mark Adams <mfadams at lbl.gov> wrote:
>>
>>
>> On Sun, Jan 8, 2023 at 4:13 PM Barry Smith <bsmith at petsc.dev> wrote:
>>
>>>
>>>    There is a bug in the routine DMPlexLabelComplete_Internal()! The
>>> code should definitely not have the code route around if (nroots >=0)
>>> because checking the nroots value to decide on the code route is simply
>>> nonsense (if one "knows" "by contract" that nroots is >=0 then the if ()
>>> test is not needed.
>>>
>>>    The first thing to do is to fix the bug with a PetscCheck() remove
>>> the nonsensical if (nroots >=0) check and rerun you code to see what
>>> happens.
>>>
>>
>> This does not fix the bug right? It just fails cleanly, right?
>>
>> I do have lots of empty processors in the first GAMG coarse grid. I just
>> saw that the first GAMG coarse grid reduces the processor count to 4, from
>> 4K.
>> This is one case where coarse grids could be repartitioned, for once that
>> can be used.
>>
>> Do you have a bug fix suggestion for me to try?
>>
>> Thanks
>>
>>
>>>   Barry
>>>
>>> Yes it is possible that in your run the nroots is always >= 0 and some
>>> MPI bug is causing the problem but this doesn't change the fact that the
>>> current code is buggy and needs to be fixed before blaming some other bug
>>> for the problem.
>>>
>>>
>>>
>>> On Jan 8, 2023, at 4:04 PM, Mark Adams <mfadams at lbl.gov> wrote:
>>>
>>>
>>>
>>> On Sun, Jan 8, 2023 at 2:44 PM Matthew Knepley <knepley at gmail.com>
>>> wrote:
>>>
>>>> On Sun, Jan 8, 2023 at 9:28 AM Barry Smith <bsmith at petsc.dev> wrote:
>>>>
>>>>>
>>>>>   Mark,
>>>>>
>>>>>   Looks like the error checking in PetscCommDuplicate() is doing its
>>>>> job. It is reporting an attempt to use an PETSc object constructer on a
>>>>> subset of ranks of an MPI_Comm (which is, of course, fundamentally
>>>>> impossible in the PETSc/MPI model)
>>>>>
>>>>> Note that nroots can be negative on a particular rank but
>>>>> DMPlexLabelComplete_Internal() is collective on sf based on the comment in
>>>>> the code below
>>>>>
>>>>>
>>>>> struct _p_PetscSF {
>>>>> ....
>>>>>   PetscInt     nroots;  /* Number of root vertices on current process
>>>>> (candidates for incoming edges) */
>>>>>
>>>>> But the next routine calls a collective only when nroots >= 0
>>>>>
>>>>> static PetscErrorCode DMPlexLabelComplete_Internal(DM dm, DMLabel
>>>>> label, PetscBool completeCells){
>>>>> ...
>>>>>   PetscCall(PetscSFGetGraph(sfPoint, &nroots, NULL, NULL, NULL));
>>>>>   if (nroots >= 0) {
>>>>>     DMLabel         lblRoots, lblLeaves;
>>>>>     IS              valueIS, pointIS;
>>>>>     const PetscInt *values;
>>>>>     PetscInt        numValues, v;
>>>>>
>>>>>     /* Pull point contributions from remote leaves into local roots */
>>>>>     PetscCall(DMLabelGather(label, sfPoint, &lblLeaves));
>>>>>
>>>>>
>>>>> The code is four years old? How come this problem of calling the
>>>>> constructure on a subset of ranks hasn't come up since day 1?
>>>>>
>>>>
>>>> The contract here is that it should be impossible to have nroots < 0
>>>> (meaning the SF is not setup) on a subset of processes. Do we know that
>>>> this is happening?
>>>>
>>>
>>> Can't imagine a code bug here. Very simple code.
>>>
>>> This code does use GAMG as the coarse grid solver in a pretty extreme
>>> way.
>>> GAMG is fairly complicated and not used on such small problems with high
>>> parallelism.
>>> It is conceivable that its a GAMG bug, but that is not what was going on
>>> in my initial emal here.
>>>
>>> Here is a run that timed out, but it should not have so I think this is
>>> the same issue. I always have perfectly distributed grids like this.
>>>
>>> DM Object: box 2048 MPI processes
>>>   type: plex
>>> box in 2 dimensions:
>>>   Min/Max of 0-cells per rank: 8385/8580
>>>   Min/Max of 1-cells per rank: 24768/24960
>>>   Min/Max of 2-cells per rank: 16384/16384
>>> Labels:
>>>   celltype: 3 strata with value/size (1 (24768), 3 (16384), 0 (8385))
>>>   depth: 3 strata with value/size (0 (8385), 1 (24768), 2 (16384))
>>>   marker: 1 strata with value/size (1 (385))
>>>   Face Sets: 1 strata with value/size (1 (381))
>>>   Defined by transform from:
>>>   DM_0x84000002_1 in 2 dimensions:
>>>     Min/Max of 0-cells per rank:   2145/2244
>>>     Min/Max of 1-cells per rank:   6240/6336
>>>     Min/Max of 2-cells per rank:   4096/4096
>>>   Labels:
>>>     celltype: 3 strata with value/size (1 (6240), 3 (4096), 0 (2145))
>>>     depth: 3 strata with value/size (0 (2145), 1 (6240), 2 (4096))
>>>     marker: 1 strata with value/size (1 (193))
>>>     Face Sets: 1 strata with value/size (1 (189))
>>>     Defined by transform from:
>>>     DM_0x84000002_2 in 2 dimensions:
>>>       Min/Max of 0-cells per rank:     561/612
>>>       Min/Max of 1-cells per rank:     1584/1632
>>>       Min/Max of 2-cells per rank:     1024/1024
>>>     Labels:
>>>       celltype: 3 strata with value/size (1 (1584), 3 (1024), 0 (561))
>>>       depth: 3 strata with value/size (0 (561), 1 (1584), 2 (1024))
>>>       marker: 1 strata with value/size (1 (97))
>>>       Face Sets: 1 strata with value/size (1 (93))
>>>       Defined by transform from:
>>>       DM_0x84000002_3 in 2 dimensions:
>>>         Min/Max of 0-cells per rank:       153/180
>>>         Min/Max of 1-cells per rank:       408/432
>>>         Min/Max of 2-cells per rank:       256/256
>>>       Labels:
>>>         celltype: 3 strata with value/size (1 (408), 3 (256), 0 (153))
>>>         depth: 3 strata with value/size (0 (153), 1 (408), 2 (256))
>>>         marker: 1 strata with value/size (1 (49))
>>>         Face Sets: 1 strata with value/size (1 (45))
>>>         Defined by transform from:
>>>         DM_0x84000002_4 in 2 dimensions:
>>>           Min/Max of 0-cells per rank:         45/60
>>>           Min/Max of 1-cells per rank:         108/120
>>>           Min/Max of 2-cells per rank:         64/64
>>>         Labels:
>>>           celltype: 3 strata with value/size (1 (108), 3 (64), 0 (45))
>>>           depth: 3 strata with value/size (0 (45), 1 (108), 2 (64))
>>>           marker: 1 strata with value/size (1 (25))
>>>           Face Sets: 1 strata with value/size (1 (21))
>>>           Defined by transform from:
>>>           DM_0x84000002_5 in 2 dimensions:
>>>             Min/Max of 0-cells per rank:           15/24
>>>             Min/Max of 1-cells per rank:           30/36
>>>             Min/Max of 2-cells per rank:           16/16
>>>           Labels:
>>>             celltype: 3 strata with value/size (1 (30), 3 (16), 0 (15))
>>>             depth: 3 strata with value/size (0 (15), 1 (30), 2 (16))
>>>             marker: 1 strata with value/size (1 (13))
>>>             Face Sets: 1 strata with value/size (1 (9))
>>>             Defined by transform from:
>>>             DM_0x84000002_6 in 2 dimensions:
>>>               Min/Max of 0-cells per rank:             6/12
>>>               Min/Max of 1-cells per rank:             9/12
>>>               Min/Max of 2-cells per rank:             4/4
>>>             Labels:
>>>               depth: 3 strata with value/size (0 (6), 1 (9), 2 (4))
>>>               celltype: 3 strata with value/size (0 (6), 1 (9), 3 (4))
>>>               marker: 1 strata with value/size (1 (7))
>>>               Face Sets: 1 strata with value/size (1 (3))
>>> 0 TS dt 0.001 time 0.
>>> MHD    0) time =         0, Eergy=  2.3259668003585e+00 (plot ID 0)
>>>     0 SNES Function norm 5.415286407365e-03
>>> srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
>>> slurmstepd: error: *** STEP 245100.0 ON crusher002 CANCELLED AT
>>> 2023-01-08T15:32:43 DUE TO TIME LIMIT ***
>>>
>>>
>>>
>>>>
>>>>   Thanks,
>>>>
>>>>     Matt
>>>>
>>>>
>>>>> On Jan 8, 2023, at 12:21 PM, Mark Adams <mfadams at lbl.gov> wrote:
>>>>>
>>>>> I am running on Crusher, CPU only, 64 cores per node with
>>>>> Plex/PetscFE.
>>>>> In going up to 64 nodes, something really catastrophic is happening.
>>>>> I understand I am not using the machine the way it was intended, but I
>>>>> just want to see if there are any options that I could try for a quick
>>>>> fix/help.
>>>>>
>>>>> In a debug build I get a stack trace on many but not all of the 4K
>>>>> processes.
>>>>> Alas, I am not sure why this job was terminated but every process that
>>>>> I checked, that had an "ERROR", had this stack:
>>>>>
>>>>> 11:57 main *+=
>>>>> crusher:/gpfs/alpine/csc314/scratch/adams/mg-m3dc1/src/data$ grep ERROR
>>>>> slurm-245063.out |g 3160
>>>>> [3160]PETSC ERROR:
>>>>> ------------------------------------------------------------------------
>>>>> [3160]PETSC ERROR: Caught signal number 15 Terminate: Some process (or
>>>>> the batch system) has told this process to end
>>>>> [3160]PETSC ERROR: Try option -start_in_debugger or
>>>>> -on_error_attach_debugger
>>>>> [3160]PETSC ERROR: or see https://petsc.org/release/faq/#valgrind and
>>>>> https://petsc.org/release/faq/
>>>>> [3160]PETSC ERROR: ---------------------  Stack Frames
>>>>> ------------------------------------
>>>>> [3160]PETSC ERROR: The line numbers in the error traceback are not
>>>>> always exact.
>>>>> [3160]PETSC ERROR: #1 MPI function
>>>>> [3160]PETSC ERROR: #2 PetscCommDuplicate() at
>>>>> /gpfs/alpine/csc314/scratch/adams/petsc/src/sys/objects/tagm.c:248
>>>>> [3160]PETSC ERROR: #3 PetscHeaderCreate_Private() at
>>>>> /gpfs/alpine/csc314/scratch/adams/petsc/src/sys/objects/inherit.c:56
>>>>> [3160]PETSC ERROR: #4 PetscSFCreate() at
>>>>> /gpfs/alpine/csc314/scratch/adams/petsc/src/vec/is/sf/interface/sf.c:65
>>>>> [3160]PETSC ERROR: #5 DMLabelGather() at
>>>>> /gpfs/alpine/csc314/scratch/adams/petsc/src/dm/label/dmlabel.c:1932
>>>>> [3160]PETSC ERROR: #6 DMPlexLabelComplete_Internal() at
>>>>> /gpfs/alpine/csc314/scratch/adams/petsc/src/dm/impls/plex/plexsubmesh.c:177
>>>>> [3160]PETSC ERROR: #7 DMPlexLabelComplete() at
>>>>> /gpfs/alpine/csc314/scratch/adams/petsc/src/dm/impls/plex/plexsubmesh.c:227
>>>>> [3160]PETSC ERROR: #8 DMCompleteBCLabels_Internal() at
>>>>> /gpfs/alpine/csc314/scratch/adams/petsc/src/dm/interface/dm.c:5301
>>>>> [3160]PETSC ERROR: #9 DMCopyDS() at
>>>>> /gpfs/alpine/csc314/scratch/adams/petsc/src/dm/interface/dm.c:6117
>>>>> [3160]PETSC ERROR: #10 DMCopyDisc() at
>>>>> /gpfs/alpine/csc314/scratch/adams/petsc/src/dm/interface/dm.c:6143
>>>>> [3160]PETSC ERROR: #11 SetupDiscretization() at
>>>>> /gpfs/alpine/csc314/scratch/adams/mg-m3dc1/src/mhd_2field.c:755
>>>>>
>>>>> Maybe the MPI is just getting overwhelmed*.*
>>>>>
>>>>> And I was able to get one run to to work (one TS with beuler), and the
>>>>> solver performance was horrendous and I see this (attached):
>>>>>
>>>>> Time (sec):           1.601e+02     1.001   1.600e+02
>>>>> VecMDot           111712 1.0 5.1684e+01 1.4 2.32e+07 12.8 0.0e+00
>>>>> 0.0e+00 1.1e+05 30  4  0  0 23  30  4  0  0 23   499
>>>>> VecNorm           163478 1.0 6.6660e+01 1.2 1.51e+07 21.5 0.0e+00
>>>>> 0.0e+00 1.6e+05 39  2  0  0 34  39  2  0  0 34   139
>>>>> VecNormalize      154599 1.0 6.3942e+01 1.2 2.19e+07 23.3 0.0e+00
>>>>> 0.0e+00 1.5e+05 38  2  0  0 32  38  2  0  0 32   189
>>>>> etc,
>>>>> KSPSolve               3 1.0 1.1553e+02 1.0 1.34e+09 47.1 2.8e+09
>>>>> 6.0e+01 2.8e+05 72 95 45 72 58  72 95 45 72 58  4772
>>>>>
>>>>> Any ideas would be welcome,
>>>>> Thanks,
>>>>> Mark
>>>>> <cushersolve.txt>
>>>>>
>>>>>
>>>>>
>>>>
>>>> --
>>>> What most experimenters take for granted before they begin their
>>>> experiments is infinitely more interesting than any results to which their
>>>> experiments lead.
>>>> -- Norbert Wiener
>>>>
>>>> https://www.cse.buffalo.edu/~knepley/
>>>> <http://www.cse.buffalo.edu/~knepley/>
>>>>
>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20230109/70545e38/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: crushererror.txt.gz
Type: application/x-gzip
Size: 83279 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20230109/70545e38/attachment-0001.gz>