[petsc-dev] bad cpu/MPI performance problem

Barry Smith bsmith at petsc.dev
Mon Jan 9 23:02:53 CST 2023


  Mark,

    You could write a simple pure MPI program to see if you can get a hang due to a mess up MPI implementation.

    for (i=0 to something big
       MPI_Allreduce()
       MPI_Bcast() from zero
   
if this hangs you have something to give to NERSC.

> On Jan 9, 2023, at 11:21 PM, Barry Smith <bsmith at petsc.dev> wrote:
> 
> 
>    Ok, this is a completely different location of the problem.  
> 
>    The code is hanging in two different locations (for different ranks)
> 
> MPI_Allreduce()
> [2180]PETSC ERROR: #1 MPI function
> [2180]PETSC ERROR: #2 DMPlexView_Ascii() at /gpfs/alpine/csc314/scratch/adams/petsc/src/dm/impls/plex/plex.c:1439
> 
> MPI_Bcast()
> [72]PETSC ERROR: #1 MPI function
> [72]PETSC ERROR: #2 PetscObjectName() at /gpfs/alpine/csc314/scratch/adams/petsc/src/sys/objects/pname.c:123
> [72]PETSC ERROR: #3 PetscObjectGetName() at /gpfs/alpine/csc314/scratch/adams/petsc/src/sys/objects/pgname.c:27
> [72]PETSC ERROR: #4 DMPlexView_Ascii() at /gpfs/alpine/csc314/scratch/adams/petsc/src/dm/impls/plex/plex.c:1434
> 
> 
> It appears that the object has a name on some ranks (most) (normally this is set with with PetscObjectSetName()) but no name on other ranks.
> 
> -----
> There are some "bugs" in the current code (that should be fixed but are not the root of this problem)
> 
> 1)
> 
>     PetscCall(PetscObjectGetName((PetscObject)dm, &name));
>     if (name) PetscCall(PetscViewerASCIIPrintf(viewer, "%s in %" PetscInt_FMT " dimension%s:\n", name, dim, dim == 1 ? "" : "s"));
>     else PetscCall(PetscViewerASCIIPrintf(viewer, "Mesh in %" PetscInt_FMT " dimension%s:\n", dim, dim == 1 ? "" : "s"));
> 
>  PetscObjectGetName() ALWAYS returns a non-null name so the if (name) check is nonsense and should be removed
> 
> 2) 
> 
> /*@C
>    PetscObjectSetName - Sets a string name associated with a PETSc object.
> 
>    Not Collective
> 
>    This is dangerous phrasing; while it is true it  does not need be called on all ranks, if it is called on a subset of ranks them PetscObjectGetName() will trigger a PetscObjectName() which triggers an MPI_Bcast() on a subset of ranks which will cause a hang. 
> 
>     It should say Logically Collective Not Not Collective
> 
> ----
> The problem is when viewing the coarsest of the six DM, all the finer DM are viewed correctly. The coarser DM are all named automatically with PetscObjectName() so have names like DM_0x84000002_1
> 
> 
> I cannot speculate on how it got into this state (since my last conclusion was so wrong :-)). 
> 
> Mark, 
> 
>   What happens if you run the same problem (same number ranks etc etc) but do not use -dm_refine_hierarchy 6 (so the coarse DM is the only dm).
> 
> Barry
> 
> 
> 
>> On Jan 9, 2023, at 7:56 PM, Mark Adams <mfadams at lbl.gov> wrote:
>> 
>> OK, here is a timeout one with 4K processors.
>> 
>> On Mon, Jan 9, 2023 at 3:13 PM Barry Smith <bsmith at petsc.dev <mailto:bsmith at petsc.dev>> wrote:
>>> 
>>> 
>>>> On Jan 9, 2023, at 12:13 PM, Mark Adams <mfadams at lbl.gov <mailto:mfadams at lbl.gov>> wrote:
>>>> 
>>>> Sorry, I deleted it. I put a 128 node job with a debug version to try to reproduce it, but finished w/o this error.
>>>> With the same bad performance as my original log.
>>>> I think this message comes from a (time out) signal.
>>> 
>>>   Sure but we would need to know if there are multiple different places in the code that the time-out is happening. 
>>> 
>>> 
>>> 
>>>> 
>>>> What I am seeing though is just really bad performance on large jobs. It is sudden. Good scaling up to about 32 nodes and then 100x slow down in the solver, the vec norms, on 128 nodes (and 64).
>>>> 
>>>> Thanks,
>>>> Mark
>>>> 
>>>> 
>>>> On Sun, Jan 8, 2023 at 8:07 PM Barry Smith <bsmith at petsc.dev <mailto:bsmith at petsc.dev>> wrote:
>>>>> 
>>>>>   Rather cryptic use of that integer variable for anyone who did not write the code, but I guess one was short on bytes when it was written :-).
>>>>> 
>>>>>   Ok, based on what Jed said, if we are certain that either all or none of the nroots are -1, since you are getting hangs in the PetscCommDuplicate() this might indicate some ranks have called a creation on some OTHER object, that not all ranks are calling. Of course, could be an MPI bug.
>>>>> 
>>>>>   Mark, can you send the file containing all the output from the Signal Terminate run, maybe there will be a hint in there of a different constructor being called on some ranks.
>>>>> 
>>>>>   Barry
>>>>> 
>>>>> 
>>>>> 
>>>>>>>>>> DMLabelGather
>>>>> 
>>>>>> On Jan 8, 2023, at 4:32 PM, Mark Adams <mfadams at lbl.gov <mailto:mfadams at lbl.gov>> wrote:
>>>>>> 
>>>>>> 
>>>>>> On Sun, Jan 8, 2023 at 4:13 PM Barry Smith <bsmith at petsc.dev <mailto:bsmith at petsc.dev>> wrote:
>>>>>>> 
>>>>>>>    There is a bug in the routine DMPlexLabelComplete_Internal()! The code should definitely not have the code route around if (nroots >=0) because checking the nroots value to decide on the code route is simply nonsense (if one "knows" "by contract" that nroots is >=0 then the if () test is not needed.
>>>>>>> 
>>>>>>>    The first thing to do is to fix the bug with a PetscCheck() remove the nonsensical if (nroots >=0) check and rerun you code to see what happens.
>>>>>> 
>>>>>> This does not fix the bug right? It just fails cleanly, right?
>>>>>> 
>>>>>> I do have lots of empty processors in the first GAMG coarse grid. I just saw that the first GAMG coarse grid reduces the processor count to 4, from 4K.
>>>>>> This is one case where coarse grids could be repartitioned, for once that can be used.
>>>>>> 
>>>>>> Do you have a bug fix suggestion for me to try?
>>>>>> 
>>>>>> Thanks
>>>>>> 
>>>>>>> 
>>>>>>>   Barry
>>>>>>> 
>>>>>>> Yes it is possible that in your run the nroots is always >= 0 and some MPI bug is causing the problem but this doesn't change the fact that the current code is buggy and needs to be fixed before blaming some other bug for the problem.
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>>> On Jan 8, 2023, at 4:04 PM, Mark Adams <mfadams at lbl.gov <mailto:mfadams at lbl.gov>> wrote:
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Sun, Jan 8, 2023 at 2:44 PM Matthew Knepley <knepley at gmail.com <mailto:knepley at gmail.com>> wrote:
>>>>>>>>> On Sun, Jan 8, 2023 at 9:28 AM Barry Smith <bsmith at petsc.dev <mailto:bsmith at petsc.dev>> wrote:
>>>>>>>>>> 
>>>>>>>>>>   Mark,
>>>>>>>>>> 
>>>>>>>>>>   Looks like the error checking in PetscCommDuplicate() is doing its job. It is reporting an attempt to use an PETSc object constructer on a subset of ranks of an MPI_Comm (which is, of course, fundamentally impossible in the PETSc/MPI model)
>>>>>>>>>> 
>>>>>>>>>> Note that nroots can be negative on a particular rank but DMPlexLabelComplete_Internal() is collective on sf based on the comment in the code below
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> struct _p_PetscSF {
>>>>>>>>>> ....
>>>>>>>>>>   PetscInt     nroots;  /* Number of root vertices on current process (candidates for incoming edges) */
>>>>>>>>>> 
>>>>>>>>>> But the next routine calls a collective only when nroots >= 0 
>>>>>>>>>> 
>>>>>>>>>> static PetscErrorCode DMPlexLabelComplete_Internal(DM dm, DMLabel label, PetscBool completeCells){
>>>>>>>>>> ...
>>>>>>>>>>   PetscCall(PetscSFGetGraph(sfPoint, &nroots, NULL, NULL, NULL));
>>>>>>>>>>   if (nroots >= 0) {
>>>>>>>>>>     DMLabel         lblRoots, lblLeaves;
>>>>>>>>>>     IS              valueIS, pointIS;
>>>>>>>>>>     const PetscInt *values;
>>>>>>>>>>     PetscInt        numValues, v;
>>>>>>>>>> 
>>>>>>>>>>     /* Pull point contributions from remote leaves into local roots */
>>>>>>>>>>     PetscCall(DMLabelGather(label, sfPoint, &lblLeaves));
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> The code is four years old? How come this problem of calling the constructure on a subset of ranks hasn't come up since day 1? 
>>>>>>>>> 
>>>>>>>>> The contract here is that it should be impossible to have nroots < 0 (meaning the SF is not setup) on a subset of processes. Do we know that this is happening?
>>>>>>>> 
>>>>>>>> Can't imagine a code bug here. Very simple code.
>>>>>>>> 
>>>>>>>> This code does use GAMG as the coarse grid solver in a pretty extreme way.
>>>>>>>> GAMG is fairly complicated and not used on such small problems with high parallelism.
>>>>>>>> It is conceivable that its a GAMG bug, but that is not what was going on in my initial emal here.
>>>>>>>> 
>>>>>>>> Here is a run that timed out, but it should not have so I think this is the same issue. I always have perfectly distributed grids like this.
>>>>>>>> 
>>>>>>>> DM Object: box 2048 MPI processes
>>>>>>>>   type: plex
>>>>>>>> box in 2 dimensions:
>>>>>>>>   Min/Max of 0-cells per rank: 8385/8580
>>>>>>>>   Min/Max of 1-cells per rank: 24768/24960
>>>>>>>>   Min/Max of 2-cells per rank: 16384/16384
>>>>>>>> Labels:
>>>>>>>>   celltype: 3 strata with value/size (1 (24768), 3 (16384), 0 (8385))
>>>>>>>>   depth: 3 strata with value/size (0 (8385), 1 (24768), 2 (16384))
>>>>>>>>   marker: 1 strata with value/size (1 (385))
>>>>>>>>   Face Sets: 1 strata with value/size (1 (381))
>>>>>>>>   Defined by transform from:
>>>>>>>>   DM_0x84000002_1 in 2 dimensions:
>>>>>>>>     Min/Max of 0-cells per rank:   2145/2244
>>>>>>>>     Min/Max of 1-cells per rank:   6240/6336
>>>>>>>>     Min/Max of 2-cells per rank:   4096/4096
>>>>>>>>   Labels:
>>>>>>>>     celltype: 3 strata with value/size (1 (6240), 3 (4096), 0 (2145))
>>>>>>>>     depth: 3 strata with value/size (0 (2145), 1 (6240), 2 (4096))
>>>>>>>>     marker: 1 strata with value/size (1 (193))
>>>>>>>>     Face Sets: 1 strata with value/size (1 (189))
>>>>>>>>     Defined by transform from:
>>>>>>>>     DM_0x84000002_2 in 2 dimensions:
>>>>>>>>       Min/Max of 0-cells per rank:     561/612
>>>>>>>>       Min/Max of 1-cells per rank:     1584/1632
>>>>>>>>       Min/Max of 2-cells per rank:     1024/1024
>>>>>>>>     Labels:
>>>>>>>>       celltype: 3 strata with value/size (1 (1584), 3 (1024), 0 (561))
>>>>>>>>       depth: 3 strata with value/size (0 (561), 1 (1584), 2 (1024))
>>>>>>>>       marker: 1 strata with value/size (1 (97))
>>>>>>>>       Face Sets: 1 strata with value/size (1 (93))
>>>>>>>>       Defined by transform from:
>>>>>>>>       DM_0x84000002_3 in 2 dimensions:
>>>>>>>>         Min/Max of 0-cells per rank:       153/180
>>>>>>>>         Min/Max of 1-cells per rank:       408/432
>>>>>>>>         Min/Max of 2-cells per rank:       256/256
>>>>>>>>       Labels:
>>>>>>>>         celltype: 3 strata with value/size (1 (408), 3 (256), 0 (153))
>>>>>>>>         depth: 3 strata with value/size (0 (153), 1 (408), 2 (256))
>>>>>>>>         marker: 1 strata with value/size (1 (49))
>>>>>>>>         Face Sets: 1 strata with value/size (1 (45))
>>>>>>>>         Defined by transform from:
>>>>>>>>         DM_0x84000002_4 in 2 dimensions:
>>>>>>>>           Min/Max of 0-cells per rank:         45/60
>>>>>>>>           Min/Max of 1-cells per rank:         108/120
>>>>>>>>           Min/Max of 2-cells per rank:         64/64
>>>>>>>>         Labels:
>>>>>>>>           celltype: 3 strata with value/size (1 (108), 3 (64), 0 (45))
>>>>>>>>           depth: 3 strata with value/size (0 (45), 1 (108), 2 (64))
>>>>>>>>           marker: 1 strata with value/size (1 (25))
>>>>>>>>           Face Sets: 1 strata with value/size (1 (21))
>>>>>>>>           Defined by transform from:
>>>>>>>>           DM_0x84000002_5 in 2 dimensions:
>>>>>>>>             Min/Max of 0-cells per rank:           15/24
>>>>>>>>             Min/Max of 1-cells per rank:           30/36
>>>>>>>>             Min/Max of 2-cells per rank:           16/16
>>>>>>>>           Labels:
>>>>>>>>             celltype: 3 strata with value/size (1 (30), 3 (16), 0 (15))
>>>>>>>>             depth: 3 strata with value/size (0 (15), 1 (30), 2 (16))
>>>>>>>>             marker: 1 strata with value/size (1 (13))
>>>>>>>>             Face Sets: 1 strata with value/size (1 (9))
>>>>>>>>             Defined by transform from:
>>>>>>>>             DM_0x84000002_6 in 2 dimensions:
>>>>>>>>               Min/Max of 0-cells per rank:             6/12
>>>>>>>>               Min/Max of 1-cells per rank:             9/12
>>>>>>>>               Min/Max of 2-cells per rank:             4/4
>>>>>>>>             Labels:
>>>>>>>>               depth: 3 strata with value/size (0 (6), 1 (9), 2 (4))
>>>>>>>>               celltype: 3 strata with value/size (0 (6), 1 (9), 3 (4))
>>>>>>>>               marker: 1 strata with value/size (1 (7))
>>>>>>>>               Face Sets: 1 strata with value/size (1 (3))
>>>>>>>> 0 TS dt 0.001 time 0.
>>>>>>>> MHD    0) time =         0, Eergy=  2.3259668003585e+00 (plot ID 0)
>>>>>>>>     0 SNES Function norm 5.415286407365e-03
>>>>>>>> srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
>>>>>>>> slurmstepd: error: *** STEP 245100.0 ON crusher002 CANCELLED AT 2023-01-08T15:32:43 DUE TO TIME LIMIT ***
>>>>>>>> 
>>>>>>>>  
>>>>>>>>> 
>>>>>>>>>   Thanks,
>>>>>>>>> 
>>>>>>>>>     Matt
>>>>>>>>>  
>>>>>>>>>>> On Jan 8, 2023, at 12:21 PM, Mark Adams <mfadams at lbl.gov <mailto:mfadams at lbl.gov>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> I am running on Crusher, CPU only, 64 cores per node with Plex/PetscFE. 
>>>>>>>>>>> In going up to 64 nodes, something really catastrophic is happening. 
>>>>>>>>>>> I understand I am not using the machine the way it was intended, but I just want to see if there are any options that I could try for a quick fix/help.
>>>>>>>>>>> 
>>>>>>>>>>> In a debug build I get a stack trace on many but not all of the 4K processes. 
>>>>>>>>>>> Alas, I am not sure why this job was terminated but every process that I checked, that had an "ERROR", had this stack:
>>>>>>>>>>> 
>>>>>>>>>>> 11:57 main *+= crusher:/gpfs/alpine/csc314/scratch/adams/mg-m3dc1/src/data$ grep ERROR slurm-245063.out |g 3160
>>>>>>>>>>> [3160]PETSC ERROR: ------------------------------------------------------------------------
>>>>>>>>>>> [3160]PETSC ERROR: Caught signal number 15 Terminate: Some process (or the batch system) has told this process to end
>>>>>>>>>>> [3160]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
>>>>>>>>>>> [3160]PETSC ERROR: or see https://petsc.org/release/faq/#valgrind and https://petsc.org/release/faq/
>>>>>>>>>>> [3160]PETSC ERROR: ---------------------  Stack Frames ------------------------------------
>>>>>>>>>>> [3160]PETSC ERROR: The line numbers in the error traceback are not always exact.
>>>>>>>>>>> [3160]PETSC ERROR: #1 MPI function
>>>>>>>>>>> [3160]PETSC ERROR: #2 PetscCommDuplicate() at /gpfs/alpine/csc314/scratch/adams/petsc/src/sys/objects/tagm.c:248
>>>>>>>>>>> [3160]PETSC ERROR: #3 PetscHeaderCreate_Private() at /gpfs/alpine/csc314/scratch/adams/petsc/src/sys/objects/inherit.c:56
>>>>>>>>>>> [3160]PETSC ERROR: #4 PetscSFCreate() at /gpfs/alpine/csc314/scratch/adams/petsc/src/vec/is/sf/interface/sf.c:65
>>>>>>>>>>> [3160]PETSC ERROR: #5 DMLabelGather() at /gpfs/alpine/csc314/scratch/adams/petsc/src/dm/label/dmlabel.c:1932
>>>>>>>>>>> [3160]PETSC ERROR: #6 DMPlexLabelComplete_Internal() at /gpfs/alpine/csc314/scratch/adams/petsc/src/dm/impls/plex/plexsubmesh.c:177
>>>>>>>>>>> [3160]PETSC ERROR: #7 DMPlexLabelComplete() at /gpfs/alpine/csc314/scratch/adams/petsc/src/dm/impls/plex/plexsubmesh.c:227
>>>>>>>>>>> [3160]PETSC ERROR: #8 DMCompleteBCLabels_Internal() at /gpfs/alpine/csc314/scratch/adams/petsc/src/dm/interface/dm.c:5301
>>>>>>>>>>> [3160]PETSC ERROR: #9 DMCopyDS() at /gpfs/alpine/csc314/scratch/adams/petsc/src/dm/interface/dm.c:6117
>>>>>>>>>>> [3160]PETSC ERROR: #10 DMCopyDisc() at /gpfs/alpine/csc314/scratch/adams/petsc/src/dm/interface/dm.c:6143
>>>>>>>>>>> [3160]PETSC ERROR: #11 SetupDiscretization() at /gpfs/alpine/csc314/scratch/adams/mg-m3dc1/src/mhd_2field.c:755
>>>>>>>>>>> 
>>>>>>>>>>> Maybe the MPI is just getting overwhelmed. 
>>>>>>>>>>> 
>>>>>>>>>>> And I was able to get one run to to work (one TS with beuler), and the solver performance was horrendous and I see this (attached):
>>>>>>>>>>> 
>>>>>>>>>>> Time (sec):           1.601e+02     1.001   1.600e+02
>>>>>>>>>>> VecMDot           111712 1.0 5.1684e+01 1.4 2.32e+07 12.8 0.0e+00 0.0e+00 1.1e+05 30  4  0  0 23  30  4  0  0 23   499
>>>>>>>>>>> VecNorm           163478 1.0 6.6660e+01 1.2 1.51e+07 21.5 0.0e+00 0.0e+00 1.6e+05 39  2  0  0 34  39  2  0  0 34   139
>>>>>>>>>>> VecNormalize      154599 1.0 6.3942e+01 1.2 2.19e+07 23.3 0.0e+00 0.0e+00 1.5e+05 38  2  0  0 32  38  2  0  0 32   189
>>>>>>>>>>> etc,
>>>>>>>>>>> KSPSolve               3 1.0 1.1553e+02 1.0 1.34e+09 47.1 2.8e+09 6.0e+01 2.8e+05 72 95 45 72 58  72 95 45 72 58  4772
>>>>>>>>>>> 
>>>>>>>>>>> Any ideas would be welcome,
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Mark
>>>>>>>>>>> <cushersolve.txt>
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> -- 
>>>>>>>>> What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead.
>>>>>>>>> -- Norbert Wiener
>>>>>>>>> 
>>>>>>>>> https://www.cse.buffalo.edu/~knepley/ <http://www.cse.buffalo.edu/~knepley/>
>>>>>>> 
>>>>> 
>>> 
>> <crushererror.txt.gz>
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20230110/9dbd717f/attachment-0001.html>


More information about the petsc-dev mailing list