<div dir="ltr">OK, here is a timeout one with 4K processors.</div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Mon, Jan 9, 2023 at 3:13 PM Barry Smith <<a href="mailto:bsmith@petsc.dev">bsmith@petsc.dev</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div><br><div><br><blockquote type="cite"><div>On Jan 9, 2023, at 12:13 PM, Mark Adams <<a href="mailto:mfadams@lbl.gov" target="_blank">mfadams@lbl.gov</a>> wrote:</div><br><div><div dir="ltr">Sorry, I deleted it. I put a 128 node job with a debug version to try to reproduce it, but finished w/o this error.<div>With the same bad performance as my original log.<br><div>I think this message comes from a (time out) signal.<br></div></div></div></div></blockquote><div><br></div> Sure but we would need to know if there are multiple different places in the code that the time-out is happening. </div><div><br></div><div><br></div><div><br><blockquote type="cite"><div><div dir="ltr"><div><div><div><br></div><div>What I am seeing though is just really bad performance on large jobs. It is sudden. Good scaling up to about 32 nodes and then 100x slow down in the solver, the vec norms, on 128 nodes (and 64).</div><div><br></div><div>Thanks,</div><div>Mark</div><div><br></div></div></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Sun, Jan 8, 2023 at 8:07 PM Barry Smith <<a href="mailto:bsmith@petsc.dev" target="_blank">bsmith@petsc.dev</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div><div><br></div><div> Rather cryptic use of that integer variable for anyone who did not write the code, but I guess one was short on bytes when it was written :-).</div><div><br></div> Ok, based on what Jed said, if we are certain that either all or none of the nroots are -1, since you are getting hangs in the PetscCommDuplicate() this might indicate some ranks have called a creation on some OTHER object, that not all ranks are calling. Of course, could be an MPI bug.<div><br></div><div> Mark, can you send the file containing all the output from the Signal Terminate run, maybe there will be a hint in there of a different constructor being called on some ranks.</div><div><br></div><div> Barry</div><div><br><div><br></div><div><br><blockquote type="cite"><div dir="ltr"><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><blockquote type="cite"><div dir="ltr"><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">DMLabelGather</blockquote></div></div></blockquote></div></div></blockquote></blockquote></div></div></blockquote><div><br><blockquote type="cite"><div>On Jan 8, 2023, at 4:32 PM, Mark Adams <<a href="mailto:mfadams@lbl.gov" target="_blank">mfadams@lbl.gov</a>> wrote:</div><br><div><div dir="ltr"><div dir="ltr"></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Sun, Jan 8, 2023 at 4:13 PM Barry Smith <<a href="mailto:bsmith@petsc.dev" target="_blank">bsmith@petsc.dev</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div><div><br></div> There is a bug in the routine DMPlexLabelComplete_Internal()! The code should definitely not have the code route around if (nroots >=0) because checking the nroots value to decide on the code route is simply nonsense (if one "knows" "by contract" that nroots is >=0 then the if () test is not needed.<div><br></div><div> The first thing to do is to fix the bug with a PetscCheck() remove the nonsensical if (nroots >=0) check and rerun you code to see what happens.</div></div></blockquote><div><br></div><div>This does not fix the bug right? It just fails cleanly, right?</div><div><br></div><div>I do have lots of empty processors in the first GAMG coarse grid. I just saw that the first GAMG coarse grid reduces the processor count to 4, from 4K.</div><div>This is one case where coarse grids could be repartitioned, for once that can be used.</div><div><br></div><div>Do you have a bug fix suggestion for me to try?</div><div><br></div><div>Thanks</div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div><div><br></div><div> Barry</div><div><br></div><div>Yes it is possible that in your run the nroots is always >= 0 and some MPI bug is causing the problem but this doesn't change the fact that the current code is buggy and needs to be fixed before blaming some other bug for the problem.</div><div><br></div><div><br><div><br><blockquote type="cite"><div>On Jan 8, 2023, at 4:04 PM, Mark Adams <<a href="mailto:mfadams@lbl.gov" target="_blank">mfadams@lbl.gov</a>> wrote:</div><br><div><div dir="ltr"><div dir="ltr"><br></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Sun, Jan 8, 2023 at 2:44 PM Matthew Knepley <<a href="mailto:knepley@gmail.com" target="_blank">knepley@gmail.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div dir="ltr">On Sun, Jan 8, 2023 at 9:28 AM Barry Smith <<a href="mailto:bsmith@petsc.dev" target="_blank">bsmith@petsc.dev</a>> wrote:<br></div><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div><div><br></div><div> Mark,</div><div><br></div><div> Looks like the error checking in PetscCommDuplicate() is doing its job. It is reporting an attempt to use an PETSc object constructer on a subset of ranks of an MPI_Comm (which is, of course, fundamentally impossible in the PETSc/MPI model)</div><div><br></div><div>Note that nroots can be negative on a particular rank but DMPlexLabelComplete_Internal() is collective on sf based on the comment in the code below</div><div><br></div><div><br></div><div>struct _p_PetscSF {</div><div>....</div><div> PetscInt nroots; /* Number of root vertices on current process (candidates for incoming edges) */</div><div><br></div><div>But the next routine calls a collective only when nroots >= 0 </div><div><br></div><div>static PetscErrorCode DMPlexLabelComplete_Internal(DM dm, DMLabel label, PetscBool completeCells){</div><div>...</div><div><div> PetscCall(PetscSFGetGraph(sfPoint, &nroots, NULL, NULL, NULL));</div><div> if (nroots >= 0) {</div><div> DMLabel lblRoots, lblLeaves;</div><div> IS valueIS, pointIS;</div><div> const PetscInt *values;</div><div> PetscInt numValues, v;</div><div><br></div><div> /* Pull point contributions from remote leaves into local roots */</div><div> PetscCall(DMLabelGather(label, sfPoint, &lblLeaves));</div><div><br></div></div><div><br></div><div>The code is four years old? How come this problem of calling the constructure on a subset of ranks hasn't come up since day 1? </div></div></blockquote><div><br></div><div>The contract here is that it should be impossible to have nroots < 0 (meaning the SF is not setup) on a subset of processes. Do we know that this is happening?</div></div></div></blockquote><div><br></div><div>Can't imagine a code bug here. Very simple code.</div><div><br></div><div>This code does use GAMG as the coarse grid solver in a pretty extreme way.</div><div>GAMG is fairly complicated and not used on such small problems with high parallelism.</div><div>It is conceivable that its a GAMG bug, but that is not what was going on in my initial emal here.</div><div><br></div><div>Here is a run that timed out, but it should not have so I think this is the same issue. I always have perfectly distributed grids like this.</div><div><br></div><div>DM Object: box 2048 MPI processes<br> type: plex<br>box in 2 dimensions:<br> Min/Max of 0-cells per rank: 8385/8580<br> Min/Max of 1-cells per rank: 24768/24960<br> Min/Max of 2-cells per rank: 16384/16384<br>Labels:<br> celltype: 3 strata with value/size (1 (24768), 3 (16384), 0 (8385))<br> depth: 3 strata with value/size (0 (8385), 1 (24768), 2 (16384))<br> marker: 1 strata with value/size (1 (385))<br> Face Sets: 1 strata with value/size (1 (381))<br> Defined by transform from:<br> DM_0x84000002_1 in 2 dimensions:<br> Min/Max of 0-cells per rank: 2145/2244<br> Min/Max of 1-cells per rank: 6240/6336<br> Min/Max of 2-cells per rank: 4096/4096<br> Labels:<br> celltype: 3 strata with value/size (1 (6240), 3 (4096), 0 (2145))<br> depth: 3 strata with value/size (0 (2145), 1 (6240), 2 (4096))<br> marker: 1 strata with value/size (1 (193))<br> Face Sets: 1 strata with value/size (1 (189))<br> Defined by transform from:<br> DM_0x84000002_2 in 2 dimensions:<br> Min/Max of 0-cells per rank: 561/612<br> Min/Max of 1-cells per rank: 1584/1632<br> Min/Max of 2-cells per rank: 1024/1024<br> Labels:<br> celltype: 3 strata with value/size (1 (1584), 3 (1024), 0 (561))<br> depth: 3 strata with value/size (0 (561), 1 (1584), 2 (1024))<br> marker: 1 strata with value/size (1 (97))<br> Face Sets: 1 strata with value/size (1 (93))<br> Defined by transform from:<br> DM_0x84000002_3 in 2 dimensions:<br> Min/Max of 0-cells per rank: 153/180<br> Min/Max of 1-cells per rank: 408/432<br> Min/Max of 2-cells per rank: 256/256<br> Labels:<br> celltype: 3 strata with value/size (1 (408), 3 (256), 0 (153))<br> depth: 3 strata with value/size (0 (153), 1 (408), 2 (256))<br> marker: 1 strata with value/size (1 (49))<br> Face Sets: 1 strata with value/size (1 (45))<br> Defined by transform from:<br> DM_0x84000002_4 in 2 dimensions:<br> Min/Max of 0-cells per rank: 45/60<br> Min/Max of 1-cells per rank: 108/120<br> Min/Max of 2-cells per rank: 64/64<br> Labels:<br> celltype: 3 strata with value/size (1 (108), 3 (64), 0 (45))<br> depth: 3 strata with value/size (0 (45), 1 (108), 2 (64))<br> marker: 1 strata with value/size (1 (25))<br> Face Sets: 1 strata with value/size (1 (21))<br> Defined by transform from:<br> DM_0x84000002_5 in 2 dimensions:<br> Min/Max of 0-cells per rank: 15/24<br> Min/Max of 1-cells per rank: 30/36<br> Min/Max of 2-cells per rank: 16/16<br> Labels:<br> celltype: 3 strata with value/size (1 (30), 3 (16), 0 (15))<br> depth: 3 strata with value/size (0 (15), 1 (30), 2 (16))<br> marker: 1 strata with value/size (1 (13))<br> Face Sets: 1 strata with value/size (1 (9))<br> Defined by transform from:<br> DM_0x84000002_6 in 2 dimensions:<br> Min/Max of 0-cells per rank: 6/12<br> Min/Max of 1-cells per rank: 9/12<br> Min/Max of 2-cells per rank: 4/4<br> Labels:<br> depth: 3 strata with value/size (0 (6), 1 (9), 2 (4))<br> celltype: 3 strata with value/size (0 (6), 1 (9), 3 (4))<br> marker: 1 strata with value/size (1 (7))<br> Face Sets: 1 strata with value/size (1 (3))<br>0 TS dt 0.001 time 0.<br>MHD 0) time = 0, Eergy= 2.3259668003585e+00 (plot ID 0)<br> 0 SNES Function norm 5.415286407365e-03<br>srun: Job step aborted: Waiting up to 32 seconds for job step to finish.<br>slurmstepd: error: *** STEP 245100.0 ON crusher002 CANCELLED AT 2023-01-08T15:32:43 DUE TO TIME LIMIT ***<br></div><div><br></div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div class="gmail_quote"><div><br></div><div> Thanks,</div><div><br></div><div> Matt</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div><div><blockquote type="cite"><div>On Jan 8, 2023, at 12:21 PM, Mark Adams <<a href="mailto:mfadams@lbl.gov" target="_blank">mfadams@lbl.gov</a>> wrote:</div><br><div><div dir="ltr">I am running on Crusher, CPU only, 64 cores per node with Plex/PetscFE. <div>In going up to 64 nodes, something really catastrophic is happening. </div>I understand I am not using the machine the way it was intended, but I just want to see if there are any options that I could try for a quick fix/help.<div><br><div>In a debug build I get a stack trace on many but not all of the 4K processes. <br></div><div>Alas, I am not sure why this job was terminated but every process that I checked, that had an "ERROR", had this stack:</div><div><br></div><div>11:57 main *+= crusher:/gpfs/alpine/csc314/scratch/adams/mg-m3dc1/src/data$ grep ERROR slurm-245063.out |g 3160<br>[3160]PETSC ERROR: ------------------------------------------------------------------------<br>[3160]PETSC ERROR: Caught signal number 15 Terminate: Some process (or the batch system) has told this process to end<br>[3160]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger<br>[3160]PETSC ERROR: or see <a href="https://petsc.org/release/faq/#valgrind" target="_blank">https://petsc.org/release/faq/#valgrind</a> and <a href="https://petsc.org/release/faq/" target="_blank">https://petsc.org/release/faq/</a><br>[3160]PETSC ERROR: --------------------- Stack Frames ------------------------------------<br>[3160]PETSC ERROR: The line numbers in the error traceback are not always exact.<br>[3160]PETSC ERROR: #1 MPI function<br>[3160]PETSC ERROR: #2 PetscCommDuplicate() at /gpfs/alpine/csc314/scratch/adams/petsc/src/sys/objects/tagm.c:248<br>[3160]PETSC ERROR: #3 PetscHeaderCreate_Private() at /gpfs/alpine/csc314/scratch/adams/petsc/src/sys/objects/inherit.c:56<br>[3160]PETSC ERROR: #4 PetscSFCreate() at /gpfs/alpine/csc314/scratch/adams/petsc/src/vec/is/sf/interface/sf.c:65<br>[3160]PETSC ERROR: #5 DMLabelGather() at /gpfs/alpine/csc314/scratch/adams/petsc/src/dm/label/dmlabel.c:1932<br>[3160]PETSC ERROR: #6 DMPlexLabelComplete_Internal() at /gpfs/alpine/csc314/scratch/adams/petsc/src/dm/impls/plex/plexsubmesh.c:177<br>[3160]PETSC ERROR: #7 DMPlexLabelComplete() at /gpfs/alpine/csc314/scratch/adams/petsc/src/dm/impls/plex/plexsubmesh.c:227<br>[3160]PETSC ERROR: #8 DMCompleteBCLabels_Internal() at /gpfs/alpine/csc314/scratch/adams/petsc/src/dm/interface/dm.c:5301<br>[3160]PETSC ERROR: #9 DMCopyDS() at /gpfs/alpine/csc314/scratch/adams/petsc/src/dm/interface/dm.c:6117<br>[3160]PETSC ERROR: #10 DMCopyDisc() at /gpfs/alpine/csc314/scratch/adams/petsc/src/dm/interface/dm.c:6143<br>[3160]PETSC ERROR: #11 SetupDiscretization() at /gpfs/alpine/csc314/scratch/adams/mg-m3dc1/src/mhd_2field.c:755<br></div><div><br></div><div>Maybe the MPI is just getting overwhelmed<b>.</b> </div><div><br></div><div>And I was able to get one run to to work (one TS with beuler), and the solver performance was horrendous and I see this (attached):</div><div><br></div><div>Time (sec): 1.601e+02 1.001 1.600e+02<br></div><div>VecMDot 111712 1.0 5.1684e+01 1.4 2.32e+07 12.8 0.0e+00 0.0e+00 1.1e+05 30 4 0 0 23 30 4 0 0 23 499<br>VecNorm 163478 1.0 6.6660e+01 1.2 1.51e+07 21.5 0.0e+00 0.0e+00 1.6e+05 39 2 0 0 34 39 2 0 0 34 139<br></div><div>VecNormalize 154599 1.0 6.3942e+01 1.2 2.19e+07 23.3 0.0e+00 0.0e+00 1.5e+05 38 2 0 0 32 38 2 0 0 32 189<br></div><div>etc,</div><div>KSPSolve 3 1.0 1.1553e+02 1.0 1.34e+09 47.1 2.8e+09 6.0e+01 2.8e+05 72 95 45 72 58 72 95 45 72 58 4772<br></div><div><br></div><div>Any ideas would be welcome,</div><div>Thanks,</div><div>Mark</div></div></div>
<span id="m_4252934208922418232m_-638888774209831635m_1040180228679549843m_793772223963650823m_5996805930052341808m_-6643793360929026991m_-2264126495614651966cid:f_lcnn3feu0"><cushersolve.txt></span></div></blockquote></div><br></div></blockquote></div><br clear="all"><div><br></div>-- <br><div dir="ltr"><div dir="ltr"><div><div dir="ltr"><div><div dir="ltr"><div>What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead.<br>-- Norbert Wiener</div><div><br></div><div><a href="http://www.cse.buffalo.edu/~knepley/" target="_blank">https://www.cse.buffalo.edu/~knepley/</a><br></div></div></div></div></div></div></div></div>
</blockquote></div></div>
</div></blockquote></div><br></div></div></blockquote></div></div>
</div></blockquote></div><br></div></div></div></blockquote></div>
</div></blockquote></div><br></div></blockquote></div>