[petsc-dev] bad cpu/MPI performance problem
Mark Adams
mfadams at lbl.gov
Mon Jan 9 18:56:45 CST 2023
OK, here is a timeout one with 4K processors.
On Mon, Jan 9, 2023 at 3:13 PM Barry Smith <bsmith at petsc.dev> wrote:
>
>
> On Jan 9, 2023, at 12:13 PM, Mark Adams <mfadams at lbl.gov> wrote:
>
> Sorry, I deleted it. I put a 128 node job with a debug version to try to
> reproduce it, but finished w/o this error.
> With the same bad performance as my original log.
> I think this message comes from a (time out) signal.
>
>
> Sure but we would need to know if there are multiple different places in
> the code that the time-out is happening.
>
>
>
>
> What I am seeing though is just really bad performance on large jobs. It
> is sudden. Good scaling up to about 32 nodes and then 100x slow down in the
> solver, the vec norms, on 128 nodes (and 64).
>
> Thanks,
> Mark
>
>
> On Sun, Jan 8, 2023 at 8:07 PM Barry Smith <bsmith at petsc.dev> wrote:
>
>>
>> Rather cryptic use of that integer variable for anyone who did not
>> write the code, but I guess one was short on bytes when it was written :-).
>>
>> Ok, based on what Jed said, if we are certain that either all or none
>> of the nroots are -1, since you are getting hangs in
>> the PetscCommDuplicate() this might indicate some ranks have called a
>> creation on some OTHER object, that not all ranks are calling. Of course,
>> could be an MPI bug.
>>
>> Mark, can you send the file containing all the output from the Signal
>> Terminate run, maybe there will be a hint in there of a different
>> constructor being called on some ranks.
>>
>> Barry
>>
>>
>>
>> DMLabelGather
>>>>
>>>>
>> On Jan 8, 2023, at 4:32 PM, Mark Adams <mfadams at lbl.gov> wrote:
>>
>>
>> On Sun, Jan 8, 2023 at 4:13 PM Barry Smith <bsmith at petsc.dev> wrote:
>>
>>>
>>> There is a bug in the routine DMPlexLabelComplete_Internal()! The
>>> code should definitely not have the code route around if (nroots >=0)
>>> because checking the nroots value to decide on the code route is simply
>>> nonsense (if one "knows" "by contract" that nroots is >=0 then the if ()
>>> test is not needed.
>>>
>>> The first thing to do is to fix the bug with a PetscCheck() remove
>>> the nonsensical if (nroots >=0) check and rerun you code to see what
>>> happens.
>>>
>>
>> This does not fix the bug right? It just fails cleanly, right?
>>
>> I do have lots of empty processors in the first GAMG coarse grid. I just
>> saw that the first GAMG coarse grid reduces the processor count to 4, from
>> 4K.
>> This is one case where coarse grids could be repartitioned, for once that
>> can be used.
>>
>> Do you have a bug fix suggestion for me to try?
>>
>> Thanks
>>
>>
>>> Barry
>>>
>>> Yes it is possible that in your run the nroots is always >= 0 and some
>>> MPI bug is causing the problem but this doesn't change the fact that the
>>> current code is buggy and needs to be fixed before blaming some other bug
>>> for the problem.
>>>
>>>
>>>
>>> On Jan 8, 2023, at 4:04 PM, Mark Adams <mfadams at lbl.gov> wrote:
>>>
>>>
>>>
>>> On Sun, Jan 8, 2023 at 2:44 PM Matthew Knepley <knepley at gmail.com>
>>> wrote:
>>>
>>>> On Sun, Jan 8, 2023 at 9:28 AM Barry Smith <bsmith at petsc.dev> wrote:
>>>>
>>>>>
>>>>> Mark,
>>>>>
>>>>> Looks like the error checking in PetscCommDuplicate() is doing its
>>>>> job. It is reporting an attempt to use an PETSc object constructer on a
>>>>> subset of ranks of an MPI_Comm (which is, of course, fundamentally
>>>>> impossible in the PETSc/MPI model)
>>>>>
>>>>> Note that nroots can be negative on a particular rank but
>>>>> DMPlexLabelComplete_Internal() is collective on sf based on the comment in
>>>>> the code below
>>>>>
>>>>>
>>>>> struct _p_PetscSF {
>>>>> ....
>>>>> PetscInt nroots; /* Number of root vertices on current process
>>>>> (candidates for incoming edges) */
>>>>>
>>>>> But the next routine calls a collective only when nroots >= 0
>>>>>
>>>>> static PetscErrorCode DMPlexLabelComplete_Internal(DM dm, DMLabel
>>>>> label, PetscBool completeCells){
>>>>> ...
>>>>> PetscCall(PetscSFGetGraph(sfPoint, &nroots, NULL, NULL, NULL));
>>>>> if (nroots >= 0) {
>>>>> DMLabel lblRoots, lblLeaves;
>>>>> IS valueIS, pointIS;
>>>>> const PetscInt *values;
>>>>> PetscInt numValues, v;
>>>>>
>>>>> /* Pull point contributions from remote leaves into local roots */
>>>>> PetscCall(DMLabelGather(label, sfPoint, &lblLeaves));
>>>>>
>>>>>
>>>>> The code is four years old? How come this problem of calling the
>>>>> constructure on a subset of ranks hasn't come up since day 1?
>>>>>
>>>>
>>>> The contract here is that it should be impossible to have nroots < 0
>>>> (meaning the SF is not setup) on a subset of processes. Do we know that
>>>> this is happening?
>>>>
>>>
>>> Can't imagine a code bug here. Very simple code.
>>>
>>> This code does use GAMG as the coarse grid solver in a pretty extreme
>>> way.
>>> GAMG is fairly complicated and not used on such small problems with high
>>> parallelism.
>>> It is conceivable that its a GAMG bug, but that is not what was going on
>>> in my initial emal here.
>>>
>>> Here is a run that timed out, but it should not have so I think this is
>>> the same issue. I always have perfectly distributed grids like this.
>>>
>>> DM Object: box 2048 MPI processes
>>> type: plex
>>> box in 2 dimensions:
>>> Min/Max of 0-cells per rank: 8385/8580
>>> Min/Max of 1-cells per rank: 24768/24960
>>> Min/Max of 2-cells per rank: 16384/16384
>>> Labels:
>>> celltype: 3 strata with value/size (1 (24768), 3 (16384), 0 (8385))
>>> depth: 3 strata with value/size (0 (8385), 1 (24768), 2 (16384))
>>> marker: 1 strata with value/size (1 (385))
>>> Face Sets: 1 strata with value/size (1 (381))
>>> Defined by transform from:
>>> DM_0x84000002_1 in 2 dimensions:
>>> Min/Max of 0-cells per rank: 2145/2244
>>> Min/Max of 1-cells per rank: 6240/6336
>>> Min/Max of 2-cells per rank: 4096/4096
>>> Labels:
>>> celltype: 3 strata with value/size (1 (6240), 3 (4096), 0 (2145))
>>> depth: 3 strata with value/size (0 (2145), 1 (6240), 2 (4096))
>>> marker: 1 strata with value/size (1 (193))
>>> Face Sets: 1 strata with value/size (1 (189))
>>> Defined by transform from:
>>> DM_0x84000002_2 in 2 dimensions:
>>> Min/Max of 0-cells per rank: 561/612
>>> Min/Max of 1-cells per rank: 1584/1632
>>> Min/Max of 2-cells per rank: 1024/1024
>>> Labels:
>>> celltype: 3 strata with value/size (1 (1584), 3 (1024), 0 (561))
>>> depth: 3 strata with value/size (0 (561), 1 (1584), 2 (1024))
>>> marker: 1 strata with value/size (1 (97))
>>> Face Sets: 1 strata with value/size (1 (93))
>>> Defined by transform from:
>>> DM_0x84000002_3 in 2 dimensions:
>>> Min/Max of 0-cells per rank: 153/180
>>> Min/Max of 1-cells per rank: 408/432
>>> Min/Max of 2-cells per rank: 256/256
>>> Labels:
>>> celltype: 3 strata with value/size (1 (408), 3 (256), 0 (153))
>>> depth: 3 strata with value/size (0 (153), 1 (408), 2 (256))
>>> marker: 1 strata with value/size (1 (49))
>>> Face Sets: 1 strata with value/size (1 (45))
>>> Defined by transform from:
>>> DM_0x84000002_4 in 2 dimensions:
>>> Min/Max of 0-cells per rank: 45/60
>>> Min/Max of 1-cells per rank: 108/120
>>> Min/Max of 2-cells per rank: 64/64
>>> Labels:
>>> celltype: 3 strata with value/size (1 (108), 3 (64), 0 (45))
>>> depth: 3 strata with value/size (0 (45), 1 (108), 2 (64))
>>> marker: 1 strata with value/size (1 (25))
>>> Face Sets: 1 strata with value/size (1 (21))
>>> Defined by transform from:
>>> DM_0x84000002_5 in 2 dimensions:
>>> Min/Max of 0-cells per rank: 15/24
>>> Min/Max of 1-cells per rank: 30/36
>>> Min/Max of 2-cells per rank: 16/16
>>> Labels:
>>> celltype: 3 strata with value/size (1 (30), 3 (16), 0 (15))
>>> depth: 3 strata with value/size (0 (15), 1 (30), 2 (16))
>>> marker: 1 strata with value/size (1 (13))
>>> Face Sets: 1 strata with value/size (1 (9))
>>> Defined by transform from:
>>> DM_0x84000002_6 in 2 dimensions:
>>> Min/Max of 0-cells per rank: 6/12
>>> Min/Max of 1-cells per rank: 9/12
>>> Min/Max of 2-cells per rank: 4/4
>>> Labels:
>>> depth: 3 strata with value/size (0 (6), 1 (9), 2 (4))
>>> celltype: 3 strata with value/size (0 (6), 1 (9), 3 (4))
>>> marker: 1 strata with value/size (1 (7))
>>> Face Sets: 1 strata with value/size (1 (3))
>>> 0 TS dt 0.001 time 0.
>>> MHD 0) time = 0, Eergy= 2.3259668003585e+00 (plot ID 0)
>>> 0 SNES Function norm 5.415286407365e-03
>>> srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
>>> slurmstepd: error: *** STEP 245100.0 ON crusher002 CANCELLED AT
>>> 2023-01-08T15:32:43 DUE TO TIME LIMIT ***
>>>
>>>
>>>
>>>>
>>>> Thanks,
>>>>
>>>> Matt
>>>>
>>>>
>>>>> On Jan 8, 2023, at 12:21 PM, Mark Adams <mfadams at lbl.gov> wrote:
>>>>>
>>>>> I am running on Crusher, CPU only, 64 cores per node with
>>>>> Plex/PetscFE.
>>>>> In going up to 64 nodes, something really catastrophic is happening.
>>>>> I understand I am not using the machine the way it was intended, but I
>>>>> just want to see if there are any options that I could try for a quick
>>>>> fix/help.
>>>>>
>>>>> In a debug build I get a stack trace on many but not all of the 4K
>>>>> processes.
>>>>> Alas, I am not sure why this job was terminated but every process that
>>>>> I checked, that had an "ERROR", had this stack:
>>>>>
>>>>> 11:57 main *+=
>>>>> crusher:/gpfs/alpine/csc314/scratch/adams/mg-m3dc1/src/data$ grep ERROR
>>>>> slurm-245063.out |g 3160
>>>>> [3160]PETSC ERROR:
>>>>> ------------------------------------------------------------------------
>>>>> [3160]PETSC ERROR: Caught signal number 15 Terminate: Some process (or
>>>>> the batch system) has told this process to end
>>>>> [3160]PETSC ERROR: Try option -start_in_debugger or
>>>>> -on_error_attach_debugger
>>>>> [3160]PETSC ERROR: or see https://petsc.org/release/faq/#valgrind and
>>>>> https://petsc.org/release/faq/
>>>>> [3160]PETSC ERROR: --------------------- Stack Frames
>>>>> ------------------------------------
>>>>> [3160]PETSC ERROR: The line numbers in the error traceback are not
>>>>> always exact.
>>>>> [3160]PETSC ERROR: #1 MPI function
>>>>> [3160]PETSC ERROR: #2 PetscCommDuplicate() at
>>>>> /gpfs/alpine/csc314/scratch/adams/petsc/src/sys/objects/tagm.c:248
>>>>> [3160]PETSC ERROR: #3 PetscHeaderCreate_Private() at
>>>>> /gpfs/alpine/csc314/scratch/adams/petsc/src/sys/objects/inherit.c:56
>>>>> [3160]PETSC ERROR: #4 PetscSFCreate() at
>>>>> /gpfs/alpine/csc314/scratch/adams/petsc/src/vec/is/sf/interface/sf.c:65
>>>>> [3160]PETSC ERROR: #5 DMLabelGather() at
>>>>> /gpfs/alpine/csc314/scratch/adams/petsc/src/dm/label/dmlabel.c:1932
>>>>> [3160]PETSC ERROR: #6 DMPlexLabelComplete_Internal() at
>>>>> /gpfs/alpine/csc314/scratch/adams/petsc/src/dm/impls/plex/plexsubmesh.c:177
>>>>> [3160]PETSC ERROR: #7 DMPlexLabelComplete() at
>>>>> /gpfs/alpine/csc314/scratch/adams/petsc/src/dm/impls/plex/plexsubmesh.c:227
>>>>> [3160]PETSC ERROR: #8 DMCompleteBCLabels_Internal() at
>>>>> /gpfs/alpine/csc314/scratch/adams/petsc/src/dm/interface/dm.c:5301
>>>>> [3160]PETSC ERROR: #9 DMCopyDS() at
>>>>> /gpfs/alpine/csc314/scratch/adams/petsc/src/dm/interface/dm.c:6117
>>>>> [3160]PETSC ERROR: #10 DMCopyDisc() at
>>>>> /gpfs/alpine/csc314/scratch/adams/petsc/src/dm/interface/dm.c:6143
>>>>> [3160]PETSC ERROR: #11 SetupDiscretization() at
>>>>> /gpfs/alpine/csc314/scratch/adams/mg-m3dc1/src/mhd_2field.c:755
>>>>>
>>>>> Maybe the MPI is just getting overwhelmed*.*
>>>>>
>>>>> And I was able to get one run to to work (one TS with beuler), and the
>>>>> solver performance was horrendous and I see this (attached):
>>>>>
>>>>> Time (sec): 1.601e+02 1.001 1.600e+02
>>>>> VecMDot 111712 1.0 5.1684e+01 1.4 2.32e+07 12.8 0.0e+00
>>>>> 0.0e+00 1.1e+05 30 4 0 0 23 30 4 0 0 23 499
>>>>> VecNorm 163478 1.0 6.6660e+01 1.2 1.51e+07 21.5 0.0e+00
>>>>> 0.0e+00 1.6e+05 39 2 0 0 34 39 2 0 0 34 139
>>>>> VecNormalize 154599 1.0 6.3942e+01 1.2 2.19e+07 23.3 0.0e+00
>>>>> 0.0e+00 1.5e+05 38 2 0 0 32 38 2 0 0 32 189
>>>>> etc,
>>>>> KSPSolve 3 1.0 1.1553e+02 1.0 1.34e+09 47.1 2.8e+09
>>>>> 6.0e+01 2.8e+05 72 95 45 72 58 72 95 45 72 58 4772
>>>>>
>>>>> Any ideas would be welcome,
>>>>> Thanks,
>>>>> Mark
>>>>> <cushersolve.txt>
>>>>>
>>>>>
>>>>>
>>>>
>>>> --
>>>> What most experimenters take for granted before they begin their
>>>> experiments is infinitely more interesting than any results to which their
>>>> experiments lead.
>>>> -- Norbert Wiener
>>>>
>>>> https://www.cse.buffalo.edu/~knepley/
>>>> <http://www.cse.buffalo.edu/~knepley/>
>>>>
>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20230109/70545e38/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: crushererror.txt.gz
Type: application/x-gzip
Size: 83279 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20230109/70545e38/attachment-0001.gz>
More information about the petsc-dev
mailing list