<div dir="ltr"><div class="gmail_extra"><div class="gmail_quote">On Mon, Apr 3, 2017 at 6:11 AM, Jed Brown <span dir="ltr"><<a href="mailto:jed@jedbrown.org" target="_blank">jed@jedbrown.org</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span class="">Justin Chang <<a href="mailto:jychang48@gmail.com">jychang48@gmail.com</a>> writes:<br>

<br>

> So if I begin with a 128x128x8 grid on 1032 procs, it works fine for the<br>

> first two levels of da_refine. However, on the third level I get this error:<br>

><br>

> Level 3 domain size (m)    1e+04 x    1e+04 x    1e+03, num elements 1024 x<br>

> 1024 x 57 (59768832), size (m) 9.76562 x 9.76562 x 17.8571<br>

> Level 2 domain size (m)    1e+04 x    1e+04 x    1e+03, num elements 512 x<br>

> 512 x 29 (7602176), size (m) 19.5312 x 19.5312 x 35.7143<br>

> Level 1 domain size (m)    1e+04 x    1e+04 x    1e+03, num elements 256 x<br>

> 256 x 15 (983040), size (m) 39.0625 x 39.0625 x 71.4286<br>

> Level 0 domain size (m)    1e+04 x    1e+04 x    1e+03, num elements 128 x<br>

> 128 x 8 (131072), size (m) 78.125 x 78.125 x 142.857<br>

> [0]PETSC ERROR: --------------------- Error Message<br>

> ------------------------------<wbr>------------------------------<wbr>--<br>

> [0]PETSC ERROR: Petsc has generated inconsistent data<br>

> [0]PETSC ERROR: Eigen estimator failed: DIVERGED_NANORINF at iteration 0<br>

<br>

</span>Building with debugging and adding -fp_trap to get a stack trace would<br>

be really useful.  Or reproducing at smaller scale.</blockquote><div><br></div><div>I can't think why it would fail there, but DMDA really likes old numbers of vertices, because it wants</div><div>to take every other point, 129 seems good. I will see if I can reproduce once I get a chance.</div><div><br></div><div>And now you see why it almost always takes a full-time person just to run jobs on one of these machines.</div><div>Horrible design flaws never get fixed.</div><div><br></div><div>   Matt</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div><div class="h5">

> [0]PETSC ERROR: See <a href="http://www.mcs.anl.gov/petsc/documentation/faq.html" rel="noreferrer" target="_blank">http://www.mcs.anl.gov/petsc/<wbr>documentation/faq.html</a> for<br>

> trouble shooting.<br>

> [0]PETSC ERROR: Petsc Development GIT revision: v3.7.5-3418-ge372536  GIT<br>

> Date: 2017-03-30 13:35:15 -0500<br>

> [0]PETSC ERROR: /scratch2/scratchdirs/jychang/<wbr>Icesheet/./ex48edison on a<br>

> arch-edison-c-opt named nid00865 by jychang Sun Apr  2 21:44:44 2017<br>

> [0]PETSC ERROR: Configure options --download-fblaslapack --with-cc=cc<br>

> --with-clib-autodetect=0 --with-cxx=CC --with-cxxlib-autodetect=0<br>

> --with-debugging=0 --with-fc=ftn --with-fortranlib-autodetect=0<br>

> --with-mpiexec=srun --with-64-bit-indices=1 COPTFLAGS=-O3 CXXOPTFLAGS=-O3<br>

> FOPTFLAGS=-O3 PETSC_ARCH=arch-edison-c-opt<br>

> [0]PETSC ERROR: #1 KSPSolve_Chebyshev() line 380 in<br>

> /global/u1/j/jychang/Software/<wbr>petsc/src/ksp/ksp/impls/cheby/<wbr>cheby.c<br>

> [0]PETSC ERROR: #2 KSPSolve() line 655 in /global/u1/j/jychang/Software/<br>

> petsc/src/ksp/ksp/interface/<wbr>itfunc.c<br>

> [0]PETSC ERROR: #3 PCMGMCycle_Private() line 19 in<br>

> /global/u1/j/jychang/Software/<wbr>petsc/src/ksp/pc/impls/mg/mg.c<br>

> [0]PETSC ERROR: #4 PCMGMCycle_Private() line 53 in<br>

> /global/u1/j/jychang/Software/<wbr>petsc/src/ksp/pc/impls/mg/mg.c<br>

> [0]PETSC ERROR: #5 PCApply_MG() line 331 in /global/u1/j/jychang/Software/<br>

> petsc/src/ksp/pc/impls/mg/mg.c<br>

> [0]PETSC ERROR: #6 PCApply() line 458 in /global/u1/j/jychang/Software/<br>

> petsc/src/ksp/pc/interface/<wbr>precon.c<br>

> [0]PETSC ERROR: #7 KSP_PCApply() line 251 in /global/homes/j/jychang/<br>

> Software/petsc/include/petsc/<wbr>private/kspimpl.h<br>

> [0]PETSC ERROR: #8 KSPInitialResidual() line 67 in<br>

> /global/u1/j/jychang/Software/<wbr>petsc/src/ksp/ksp/interface/<wbr>itres.c<br>

> [0]PETSC ERROR: #9 KSPSolve_GMRES() line 233 in<br>

> /global/u1/j/jychang/Software/<wbr>petsc/src/ksp/ksp/impls/gmres/<wbr>gmres.c<br>

> [0]PETSC ERROR: #10 KSPSolve() line 655 in /global/u1/j/jychang/Software/<br>

> petsc/src/ksp/ksp/interface/<wbr>itfunc.c<br>

> [0]PETSC ERROR: #11 SNESSolve_NEWTONLS() line 224 in<br>

> /global/u1/j/jychang/Software/<wbr>petsc/src/snes/impls/ls/ls.c<br>

> [0]PETSC ERROR: #12 SNESSolve() line 3967 in /global/u1/j/jychang/Software/<br>

> petsc/src/snes/interface/snes.<wbr>c<br>

> [0]PETSC ERROR: #13 main() line 1548 in /scratch2/scratchdirs/jychang/<br>

> Icesheet/ex48.c<br>

> [0]PETSC ERROR: PETSc Option Table entries:<br>

> [0]PETSC ERROR: -M 128<br>

> [0]PETSC ERROR: -N 128<br>

> [0]PETSC ERROR: -P 8<br>

> [0]PETSC ERROR: -da_refine 3<br>

> [0]PETSC ERROR: -mg_coarse_pc_type gamg<br>

> [0]PETSC ERROR: -pc_mg_levels 4<br>

> [0]PETSC ERROR: -pc_type mg<br>

> [0]PETSC ERROR: -thi_mat_type baij<br>

> [0]PETSC ERROR: ----------------End of Error Message -------send entire<br>

> error message to petsc-maint@mcs.anl.gov-------<wbr>---<br>

><br>

> If I changed the coarse grid to 129x129x8, no error whatsoever for up to 4<br>

> levels of refinement.<br>

><br>

> However, I am having trouble getting this started up on Cori's KNL...<br>

><br>

> I am using a coarse grid 136x136x8 across 1088 cores, and slurm is simply<br>

> cancelling the job. No other PETSc error was given. This is literally what<br>

> my log files say:<br>

><br>

> Level 1 domain size (m)    1e+04 x    1e+04 x    1e+03, num elements 272 x<br>

> 272 x 15 (1109760), size (m) 36.7647 x 36.7647 x 71.4286<br>

> Level 0 domain size (m)    1e+04 x    1e+04 x    1e+03, num elements 136 x<br>

> 136 x 8 (147968), size (m) 73.5294 x 73.5294 x 142.857<br>

<br>

</div></div>Why are levels 1 and 0 printed above, then 2,1,0 below.<br>

<span class=""><br>

> makefile:25: recipe for target 'runcori' failed<br>

<br>

</span>What is this makefile message doing?<br>

<div class="HOEnZb"><div class="h5"><br>

> Level 2 domain size (m)    1e+04 x    1e+04 x    1e+03, num elements 544 x<br>

> 544 x 29 (8582144), size (m) 18.3824 x 18.3824 x 35.7143<br>

> Level 1 domain size (m)    1e+04 x    1e+04 x    1e+03, num elements 272 x<br>

> 272 x 15 (1109760), size (m) 36.7647 x 36.7647 x 71.4286<br>

> Level 0 domain size (m)    1e+04 x    1e+04 x    1e+03, num elements 136 x<br>

> 136 x 8 (147968), size (m) 73.5294 x 73.5294 x 142.857<br>

> srun: error: nid04139: task 480: Killed<br>

> srun: Terminating job step 4387719.0<br>

> srun: Job step aborted: Waiting up to 32 seconds for job step to finish.<br>

> slurmstepd: error: *** STEP 4387719.0 ON nid03873 CANCELLED AT<br>

> 2017-04-02T22:21:21 ***<br>

> srun: error: nid03960: task 202: Killed<br>

> srun: error: nid04005: task 339: Killed<br>

> srun: error: nid03873: task 32: Killed<br>

> srun: error: nid03960: task 203: Killed<br>

> srun: error: nid03873: task 3: Killed<br>

> srun: error: nid03960: task 199: Killed<br>

> srun: error: nid04004: task 264: Killed<br>

> srun: error: nid04141: task 660: Killed<br>

> srun: error: nid04139: task 539: Killed<br>

> srun: error: nid03873: task 63: Killed<br>

> srun: error: nid03960: task 170: Killed<br>

> srun: error: nid08164: task 821: Killed<br>

> srun: error: nid04139: task 507: Killed<br>

> srun: error: nid04005: task 299: Killed<br>

> srun: error: nid03960: tasks 136-169,171-198,200-201: Killed<br>

> srun: error: nid04005: task 310: Killed<br>

> srun: error: nid08166: task 1008: Killed<br>

> srun: error: nid04141: task 671: Killed<br>

> srun: error: nid03873: task 18: Killed<br>

> srun: error: nid04139: tasks 476-479,481-506,508-538,540-<wbr>543: Killed<br>

> srun: error: nid04005: tasks 272-298,300-309,311-338: Killed<br>

> srun: error: nid04140: tasks 544-611: Killed<br>

> srun: error: nid04142: tasks 680-747: Killed<br>

> srun: error: nid04138: tasks 408-475: Killed<br>

> srun: error: nid04006: tasks 340-407: Killed<br>

> srun: error: nid08163: tasks 748-815: Killed<br>

> srun: error: nid08166: tasks 952-1007,1009-1019: Killed<br>

> srun: error: nid03873: tasks 0-2,4-17,19-31,33-62,64-67: Killed<br>

> srun: error: nid08165: tasks 884-951: Killed<br>

> srun: error: nid03883: tasks 68-135: Killed<br>

> srun: error: nid08164: tasks 816-820,822-883: Killed<br>

> srun: error: nid08167: tasks 1020-1087: Killed<br>

> srun: error: nid04141: tasks 612-659,661-670,672-679: Killed<br>

> srun: error: nid04004: tasks 204-263,265-271: Killed<br>

> make: [runcori] Error 137 (ignored)<br>

> [257]PETSC ERROR:<br>

> ------------------------------<wbr>------------------------------<wbr>------------<br>

> [257]PETSC ERROR: Caught signal number 15 Terminate: Some process (or the<br>

> batch system) has told this process to end<br>

> [257]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger<br>

> [257]PETSC ERROR: or see<br>

> <a href="http://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind" rel="noreferrer" target="_blank">http://www.mcs.anl.gov/petsc/<wbr>documentation/faq.html#<wbr>valgrind</a><br>

> [257]PETSC ERROR: or try <a href="http://valgrind.org" rel="noreferrer" target="_blank">http://valgrind.org</a> on GNU/linux and Apple Mac OS<br>

> X to find memory corruption errors<br>

> [257]PETSC ERROR: configure using --with-debugging=yes, recompile, link,<br>

> and run<br>

> [257]PETSC ERROR: to get more information on the crash.<br>

> [878]PETSC ERROR:<br>

> ------------------------------<wbr>------------------------------<wbr>------------<br>

> [878]PETSC ERROR: Caught signal number 15 Terminate: Some process (or the<br>

> batch system) has told this process to end<br>

> [878]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger<br>

> [878]PETSC ERROR: or see<br>

> <a href="http://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind" rel="noreferrer" target="_blank">http://www.mcs.anl.gov/petsc/<wbr>documentation/faq.html#<wbr>valgrind</a><br>

> [878]PETSC ERROR: or try <a href="http://valgrind.org" rel="noreferrer" target="_blank">http://valgrind.org</a> on GNU/linux and Apple Mac OS<br>

> X to find memory corruption errors<br>

> [878]PETSC ERROR: configure using --with-debugging=yes, recompile, link,<br>

> and run<br>

> [878]PETSC ERROR: to get more information on the crash.<br>

> ....<br>

> [clipped]<br>

> ....<br>

><br>

><br>

><br>

> my job script for KNL looks like this:<br>

><br>

> #!/bin/bash<br>

> #SBATCH -N 16<br>

> #SBATCH -C knl,quad,cache<br>

> #SBATCH -p regular<br>

> #SBATCH -J knl1024<br>

> #SBATCH -L SCRATCH<br>

> #SBATCH -o knl1088.o%j<br>

> #SBATCH -e knl1088.e%j<br>

> #SBATCH --mail-type=ALL<br>

> #SBATCH --mail-user=<a href="mailto:jychang48@gmail.com">jychang48@gmail.<wbr>com</a><br>

> #SBATCH -t 00:20:00<br>

><br>

> srun -n 1088 -c 4 --cpu_bind=cores ./ex48 ....<br>

><br>

> Any ideas why this is happening? Or do I need to contact the NERSC folks?<br>

><br>

> Thanks,<br>

> Justin<br>

><br>

> On Sun, Apr 2, 2017 at 2:15 PM, Matthew Knepley <<a href="mailto:knepley@gmail.com">knepley@gmail.com</a>> wrote:<br>

><br>

>> On Sun, Apr 2, 2017 at 2:13 PM, Barry Smith <<a href="mailto:bsmith@mcs.anl.gov">bsmith@mcs.anl.gov</a>> wrote:<br>

>><br>

>>><br>

>>> > On Apr 2, 2017, at 9:25 AM, Justin Chang <<a href="mailto:jychang48@gmail.com">jychang48@gmail.com</a>> wrote:<br>

>>> ><br>

>>> > Thanks guys,<br>

>>> ><br>

>>> > So I want to run SNES ex48 across 1032 processes on Edison, but I keep<br>

>>> getting segmentation violations. These are the parameters I am trying:<br>

>>> ><br>

>>> > srun -n 1032 -c 2 ./ex48 -M 80 -N 80 -P 9 -da_refine 1 -pc_type mg<br>

>>> -thi_mat_type baij -mg_coarse_pc_type gamg<br>

>>> ><br>

>>> > The above works perfectly fine if I used 96 processes. I also tried to<br>

>>> use a finer coarse mesh on 1032 but the error persists.<br>

>>> ><br>

>>> > Any ideas why this is happening? What are the ideal parameters to use<br>

>>> if I want to use 1k+ cores?<br>

>>> ><br>

>>><br>

>>>    Hmm, one should never get segmentation violations. You should only get<br>

>>> not completely useful error messages about incompatible sizes etc. Send an<br>

>>> example of the segmentation violations. (I sure hope you are checking the<br>

>>> error return codes for all functions?).<br>

>><br>

>><br>

>> He is just running SNES ex48.<br>

>><br>

>>   Matt<br>

>><br>

>><br>

>>><br>

>>>   Barry<br>

>>><br>

>>> > Thanks,<br>

>>> > Justin<br>

>>> ><br>

>>> > On Fri, Mar 31, 2017 at 12:47 PM, Barry Smith <<a href="mailto:bsmith@mcs.anl.gov">bsmith@mcs.anl.gov</a>><br>

>>> wrote:<br>

>>> ><br>

>>> > > On Mar 31, 2017, at 10:00 AM, Jed Brown <<a href="mailto:jed@jedbrown.org">jed@jedbrown.org</a>> wrote:<br>

>>> > ><br>

>>> > > Justin Chang <<a href="mailto:jychang48@gmail.com">jychang48@gmail.com</a>> writes:<br>

>>> > ><br>

>>> > >> Yeah based on my experiments it seems setting pc_mg_levels to<br>

>>> $DAREFINE + 1<br>

>>> > >> has decent performance.<br>

>>> > >><br>

>>> > >> 1) is there ever a case where you'd want $MGLEVELS <= $DAREFINE? In<br>

>>> some of<br>

>>> > >> the PETSc tutorial slides (e.g., <a href="http://www.mcs.anl.gov/" rel="noreferrer" target="_blank">http://www.mcs.anl.gov/</a><br>

>>> > >> petsc/documentation/tutorials/<wbr>TutorialCEMRACS2016.pdf on slide<br>

>>> 203/227)<br>

>>> > >> they say to use $MGLEVELS = 4 and $DAREFINE = 5, but when I ran<br>

>>> this, it<br>

>>> > >> was almost twice as slow as if $MGLEVELS >= $DAREFINE<br>

>>> > ><br>

>>> > > Smaller coarse grids are generally more scalable -- when the problem<br>

>>> > > data is distributed, multigrid is a good solution algorithm.  But if<br>

>>> > > multigrid stops being effective because it is not preserving<br>

>>> sufficient<br>

>>> > > coarse grid accuracy (e.g., for transport-dominated problems in<br>

>>> > > complicated domains) then you might want to stop early and use a more<br>

>>> > > robust method (like direct solves).<br>

>>> ><br>

>>> > Basically for symmetric positive definite operators you can make the<br>

>>> coarse problem as small as you like (even 1 point) in theory. For<br>

>>> indefinite and non-symmetric problems the theory says the "coarse grid must<br>

>>> be sufficiently fine" (loosely speaking the coarse grid has to resolve the<br>

>>> eigenmodes for the eigenvalues to the left of the x = 0).<br>

>>> ><br>

>>> > <a href="https://www.jstor.org/stable/2158375?seq=1#page_scan_tab_contents" rel="noreferrer" target="_blank">https://www.jstor.org/stable/<wbr>2158375?seq=1#page_scan_tab_<wbr>contents</a><br>

>>> ><br>

>>> ><br>

>>> ><br>

>>><br>

>>><br>

>><br>

>><br>

>> --<br>

>> What most experimenters take for granted before they begin their<br>

>> experiments is infinitely more interesting than any results to which their<br>

>> experiments lead.<br>

>> -- Norbert Wiener<br>

>><br>

</div></div></blockquote></div><br><br clear="all"><div><br></div>-- <br><div class="gmail_signature" data-smartmail="gmail_signature">What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead.<br>-- Norbert Wiener</div>

</div></div>