<div dir="ltr"><div dir="ltr"><br></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Sun, Jan 8, 2023 at 2:44 PM Matthew Knepley <<a href="mailto:knepley@gmail.com" target="_blank">knepley@gmail.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div dir="ltr">On Sun, Jan 8, 2023 at 9:28 AM Barry Smith <<a href="mailto:bsmith@petsc.dev" target="_blank">bsmith@petsc.dev</a>> wrote:<br></div><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div><div><br></div><div> Mark,</div><div><br></div><div> Looks like the error checking in PetscCommDuplicate() is doing its job. It is reporting an attempt to use an PETSc object constructer on a subset of ranks of an MPI_Comm (which is, of course, fundamentally impossible in the PETSc/MPI model)</div><div><br></div><div>Note that nroots can be negative on a particular rank but DMPlexLabelComplete_Internal() is collective on sf based on the comment in the code below</div><div><br></div><div><br></div><div>struct _p_PetscSF {</div><div>....</div><div> PetscInt nroots; /* Number of root vertices on current process (candidates for incoming edges) */</div><div><br></div><div>But the next routine calls a collective only when nroots >= 0 </div><div><br></div><div>static PetscErrorCode DMPlexLabelComplete_Internal(DM dm, DMLabel label, PetscBool completeCells){</div><div>...</div><div><div> PetscCall(PetscSFGetGraph(sfPoint, &nroots, NULL, NULL, NULL));</div><div> if (nroots >= 0) {</div><div> DMLabel lblRoots, lblLeaves;</div><div> IS valueIS, pointIS;</div><div> const PetscInt *values;</div><div> PetscInt numValues, v;</div><div><br></div><div> /* Pull point contributions from remote leaves into local roots */</div><div> PetscCall(DMLabelGather(label, sfPoint, &lblLeaves));</div><div><br></div></div><div><br></div><div>The code is four years old? How come this problem of calling the constructure on a subset of ranks hasn't come up since day 1? </div></div></blockquote><div><br></div><div>The contract here is that it should be impossible to have nroots < 0 (meaning the SF is not setup) on a subset of processes. Do we know that this is happening?</div></div></div></blockquote><div><br></div><div>Can't imagine a code bug here. Very simple code.</div><div><br></div><div>This code does use GAMG as the coarse grid solver in a pretty extreme way.</div><div>GAMG is fairly complicated and not used on such small problems with high parallelism.</div><div>It is conceivable that its a GAMG bug, but that is not what was going on in my initial emal here.</div><div><br></div><div>Here is a run that timed out, but it should not have so I think this is the same issue. I always have perfectly distributed grids like this.</div><div><br></div><div>DM Object: box 2048 MPI processes<br> type: plex<br>box in 2 dimensions:<br> Min/Max of 0-cells per rank: 8385/8580<br> Min/Max of 1-cells per rank: 24768/24960<br> Min/Max of 2-cells per rank: 16384/16384<br>Labels:<br> celltype: 3 strata with value/size (1 (24768), 3 (16384), 0 (8385))<br> depth: 3 strata with value/size (0 (8385), 1 (24768), 2 (16384))<br> marker: 1 strata with value/size (1 (385))<br> Face Sets: 1 strata with value/size (1 (381))<br> Defined by transform from:<br> DM_0x84000002_1 in 2 dimensions:<br> Min/Max of 0-cells per rank: 2145/2244<br> Min/Max of 1-cells per rank: 6240/6336<br> Min/Max of 2-cells per rank: 4096/4096<br> Labels:<br> celltype: 3 strata with value/size (1 (6240), 3 (4096), 0 (2145))<br> depth: 3 strata with value/size (0 (2145), 1 (6240), 2 (4096))<br> marker: 1 strata with value/size (1 (193))<br> Face Sets: 1 strata with value/size (1 (189))<br> Defined by transform from:<br> DM_0x84000002_2 in 2 dimensions:<br> Min/Max of 0-cells per rank: 561/612<br> Min/Max of 1-cells per rank: 1584/1632<br> Min/Max of 2-cells per rank: 1024/1024<br> Labels:<br> celltype: 3 strata with value/size (1 (1584), 3 (1024), 0 (561))<br> depth: 3 strata with value/size (0 (561), 1 (1584), 2 (1024))<br> marker: 1 strata with value/size (1 (97))<br> Face Sets: 1 strata with value/size (1 (93))<br> Defined by transform from:<br> DM_0x84000002_3 in 2 dimensions:<br> Min/Max of 0-cells per rank: 153/180<br> Min/Max of 1-cells per rank: 408/432<br> Min/Max of 2-cells per rank: 256/256<br> Labels:<br> celltype: 3 strata with value/size (1 (408), 3 (256), 0 (153))<br> depth: 3 strata with value/size (0 (153), 1 (408), 2 (256))<br> marker: 1 strata with value/size (1 (49))<br> Face Sets: 1 strata with value/size (1 (45))<br> Defined by transform from:<br> DM_0x84000002_4 in 2 dimensions:<br> Min/Max of 0-cells per rank: 45/60<br> Min/Max of 1-cells per rank: 108/120<br> Min/Max of 2-cells per rank: 64/64<br> Labels:<br> celltype: 3 strata with value/size (1 (108), 3 (64), 0 (45))<br> depth: 3 strata with value/size (0 (45), 1 (108), 2 (64))<br> marker: 1 strata with value/size (1 (25))<br> Face Sets: 1 strata with value/size (1 (21))<br> Defined by transform from:<br> DM_0x84000002_5 in 2 dimensions:<br> Min/Max of 0-cells per rank: 15/24<br> Min/Max of 1-cells per rank: 30/36<br> Min/Max of 2-cells per rank: 16/16<br> Labels:<br> celltype: 3 strata with value/size (1 (30), 3 (16), 0 (15))<br> depth: 3 strata with value/size (0 (15), 1 (30), 2 (16))<br> marker: 1 strata with value/size (1 (13))<br> Face Sets: 1 strata with value/size (1 (9))<br> Defined by transform from:<br> DM_0x84000002_6 in 2 dimensions:<br> Min/Max of 0-cells per rank: 6/12<br> Min/Max of 1-cells per rank: 9/12<br> Min/Max of 2-cells per rank: 4/4<br> Labels:<br> depth: 3 strata with value/size (0 (6), 1 (9), 2 (4))<br> celltype: 3 strata with value/size (0 (6), 1 (9), 3 (4))<br> marker: 1 strata with value/size (1 (7))<br> Face Sets: 1 strata with value/size (1 (3))<br>0 TS dt 0.001 time 0.<br>MHD 0) time = 0, Eergy= 2.3259668003585e+00 (plot ID 0)<br> 0 SNES Function norm 5.415286407365e-03<br>srun: Job step aborted: Waiting up to 32 seconds for job step to finish.<br>slurmstepd: error: *** STEP 245100.0 ON crusher002 CANCELLED AT 2023-01-08T15:32:43 DUE TO TIME LIMIT ***<br></div><div><br></div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div class="gmail_quote"><div><br></div><div> Thanks,</div><div><br></div><div> Matt</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div><div><blockquote type="cite"><div>On Jan 8, 2023, at 12:21 PM, Mark Adams <<a href="mailto:mfadams@lbl.gov" target="_blank">mfadams@lbl.gov</a>> wrote:</div><br><div><div dir="ltr">I am running on Crusher, CPU only, 64 cores per node with Plex/PetscFE. <div>In going up to 64 nodes, something really catastrophic is happening. </div>I understand I am not using the machine the way it was intended, but I just want to see if there are any options that I could try for a quick fix/help.<div><br><div>In a debug build I get a stack trace on many but not all of the 4K processes. <br></div><div>Alas, I am not sure why this job was terminated but every process that I checked, that had an "ERROR", had this stack:</div><div><br></div><div>11:57 main *+= crusher:/gpfs/alpine/csc314/scratch/adams/mg-m3dc1/src/data$ grep ERROR slurm-245063.out |g 3160<br>[3160]PETSC ERROR: ------------------------------------------------------------------------<br>[3160]PETSC ERROR: Caught signal number 15 Terminate: Some process (or the batch system) has told this process to end<br>[3160]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger<br>[3160]PETSC ERROR: or see <a href="https://petsc.org/release/faq/#valgrind" target="_blank">https://petsc.org/release/faq/#valgrind</a> and <a href="https://petsc.org/release/faq/" target="_blank">https://petsc.org/release/faq/</a><br>[3160]PETSC ERROR: --------------------- Stack Frames ------------------------------------<br>[3160]PETSC ERROR: The line numbers in the error traceback are not always exact.<br>[3160]PETSC ERROR: #1 MPI function<br>[3160]PETSC ERROR: #2 PetscCommDuplicate() at /gpfs/alpine/csc314/scratch/adams/petsc/src/sys/objects/tagm.c:248<br>[3160]PETSC ERROR: #3 PetscHeaderCreate_Private() at /gpfs/alpine/csc314/scratch/adams/petsc/src/sys/objects/inherit.c:56<br>[3160]PETSC ERROR: #4 PetscSFCreate() at /gpfs/alpine/csc314/scratch/adams/petsc/src/vec/is/sf/interface/sf.c:65<br>[3160]PETSC ERROR: #5 DMLabelGather() at /gpfs/alpine/csc314/scratch/adams/petsc/src/dm/label/dmlabel.c:1932<br>[3160]PETSC ERROR: #6 DMPlexLabelComplete_Internal() at /gpfs/alpine/csc314/scratch/adams/petsc/src/dm/impls/plex/plexsubmesh.c:177<br>[3160]PETSC ERROR: #7 DMPlexLabelComplete() at /gpfs/alpine/csc314/scratch/adams/petsc/src/dm/impls/plex/plexsubmesh.c:227<br>[3160]PETSC ERROR: #8 DMCompleteBCLabels_Internal() at /gpfs/alpine/csc314/scratch/adams/petsc/src/dm/interface/dm.c:5301<br>[3160]PETSC ERROR: #9 DMCopyDS() at /gpfs/alpine/csc314/scratch/adams/petsc/src/dm/interface/dm.c:6117<br>[3160]PETSC ERROR: #10 DMCopyDisc() at /gpfs/alpine/csc314/scratch/adams/petsc/src/dm/interface/dm.c:6143<br>[3160]PETSC ERROR: #11 SetupDiscretization() at /gpfs/alpine/csc314/scratch/adams/mg-m3dc1/src/mhd_2field.c:755<br></div><div><br></div><div>Maybe the MPI is just getting overwhelmed<b>.</b> </div><div><br></div><div>And I was able to get one run to to work (one TS with beuler), and the solver performance was horrendous and I see this (attached):</div><div><br></div><div>Time (sec): 1.601e+02 1.001 1.600e+02<br></div><div>VecMDot 111712 1.0 5.1684e+01 1.4 2.32e+07 12.8 0.0e+00 0.0e+00 1.1e+05 30 4 0 0 23 30 4 0 0 23 499<br>VecNorm 163478 1.0 6.6660e+01 1.2 1.51e+07 21.5 0.0e+00 0.0e+00 1.6e+05 39 2 0 0 34 39 2 0 0 34 139<br></div><div>VecNormalize 154599 1.0 6.3942e+01 1.2 2.19e+07 23.3 0.0e+00 0.0e+00 1.5e+05 38 2 0 0 32 38 2 0 0 32 189<br></div><div>etc,</div><div>KSPSolve 3 1.0 1.1553e+02 1.0 1.34e+09 47.1 2.8e+09 6.0e+01 2.8e+05 72 95 45 72 58 72 95 45 72 58 4772<br></div><div><br></div><div>Any ideas would be welcome,</div><div>Thanks,</div><div>Mark</div></div></div>
<span id="m_5996805930052341808m_-6643793360929026991m_-2264126495614651966cid:f_lcnn3feu0"><cushersolve.txt></span></div></blockquote></div><br></div></blockquote></div><br clear="all"><div><br></div>-- <br><div dir="ltr"><div dir="ltr"><div><div dir="ltr"><div><div dir="ltr"><div>What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead.<br>-- Norbert Wiener</div><div><br></div><div><a href="http://www.cse.buffalo.edu/~knepley/" target="_blank">https://www.cse.buffalo.edu/~knepley/</a><br></div></div></div></div></div></div></div></div>
</blockquote></div></div>