[petsc-users] Correlation between da_refine and pg_mg_levels

Matthew Knepley knepley at gmail.com
Mon Apr 3 06:15:18 CDT 2017


On Mon, Apr 3, 2017 at 6:11 AM, Jed Brown <jed at jedbrown.org> wrote:

> Justin Chang <jychang48 at gmail.com> writes:
>
> > So if I begin with a 128x128x8 grid on 1032 procs, it works fine for the
> > first two levels of da_refine. However, on the third level I get this
> error:
> >
> > Level 3 domain size (m)    1e+04 x    1e+04 x    1e+03, num elements
> 1024 x
> > 1024 x 57 (59768832), size (m) 9.76562 x 9.76562 x 17.8571
> > Level 2 domain size (m)    1e+04 x    1e+04 x    1e+03, num elements 512
> x
> > 512 x 29 (7602176), size (m) 19.5312 x 19.5312 x 35.7143
> > Level 1 domain size (m)    1e+04 x    1e+04 x    1e+03, num elements 256
> x
> > 256 x 15 (983040), size (m) 39.0625 x 39.0625 x 71.4286
> > Level 0 domain size (m)    1e+04 x    1e+04 x    1e+03, num elements 128
> x
> > 128 x 8 (131072), size (m) 78.125 x 78.125 x 142.857
> > [0]PETSC ERROR: --------------------- Error Message
> > --------------------------------------------------------------
> > [0]PETSC ERROR: Petsc has generated inconsistent data
> > [0]PETSC ERROR: Eigen estimator failed: DIVERGED_NANORINF at iteration 0
>
> Building with debugging and adding -fp_trap to get a stack trace would
> be really useful.  Or reproducing at smaller scale.


I can't think why it would fail there, but DMDA really likes old numbers of
vertices, because it wants
to take every other point, 129 seems good. I will see if I can reproduce
once I get a chance.

And now you see why it almost always takes a full-time person just to run
jobs on one of these machines.
Horrible design flaws never get fixed.

   Matt


> > [0]PETSC ERROR: See http://www.mcs.anl.gov/petsc/documentation/faq.html
> for
> > trouble shooting.
> > [0]PETSC ERROR: Petsc Development GIT revision: v3.7.5-3418-ge372536  GIT
> > Date: 2017-03-30 13:35:15 -0500
> > [0]PETSC ERROR: /scratch2/scratchdirs/jychang/Icesheet/./ex48edison on a
> > arch-edison-c-opt named nid00865 by jychang Sun Apr  2 21:44:44 2017
> > [0]PETSC ERROR: Configure options --download-fblaslapack --with-cc=cc
> > --with-clib-autodetect=0 --with-cxx=CC --with-cxxlib-autodetect=0
> > --with-debugging=0 --with-fc=ftn --with-fortranlib-autodetect=0
> > --with-mpiexec=srun --with-64-bit-indices=1 COPTFLAGS=-O3 CXXOPTFLAGS=-O3
> > FOPTFLAGS=-O3 PETSC_ARCH=arch-edison-c-opt
> > [0]PETSC ERROR: #1 KSPSolve_Chebyshev() line 380 in
> > /global/u1/j/jychang/Software/petsc/src/ksp/ksp/impls/cheby/cheby.c
> > [0]PETSC ERROR: #2 KSPSolve() line 655 in /global/u1/j/jychang/Software/
> > petsc/src/ksp/ksp/interface/itfunc.c
> > [0]PETSC ERROR: #3 PCMGMCycle_Private() line 19 in
> > /global/u1/j/jychang/Software/petsc/src/ksp/pc/impls/mg/mg.c
> > [0]PETSC ERROR: #4 PCMGMCycle_Private() line 53 in
> > /global/u1/j/jychang/Software/petsc/src/ksp/pc/impls/mg/mg.c
> > [0]PETSC ERROR: #5 PCApply_MG() line 331 in
> /global/u1/j/jychang/Software/
> > petsc/src/ksp/pc/impls/mg/mg.c
> > [0]PETSC ERROR: #6 PCApply() line 458 in /global/u1/j/jychang/Software/
> > petsc/src/ksp/pc/interface/precon.c
> > [0]PETSC ERROR: #7 KSP_PCApply() line 251 in /global/homes/j/jychang/
> > Software/petsc/include/petsc/private/kspimpl.h
> > [0]PETSC ERROR: #8 KSPInitialResidual() line 67 in
> > /global/u1/j/jychang/Software/petsc/src/ksp/ksp/interface/itres.c
> > [0]PETSC ERROR: #9 KSPSolve_GMRES() line 233 in
> > /global/u1/j/jychang/Software/petsc/src/ksp/ksp/impls/gmres/gmres.c
> > [0]PETSC ERROR: #10 KSPSolve() line 655 in /global/u1/j/jychang/Software/
> > petsc/src/ksp/ksp/interface/itfunc.c
> > [0]PETSC ERROR: #11 SNESSolve_NEWTONLS() line 224 in
> > /global/u1/j/jychang/Software/petsc/src/snes/impls/ls/ls.c
> > [0]PETSC ERROR: #12 SNESSolve() line 3967 in
> /global/u1/j/jychang/Software/
> > petsc/src/snes/interface/snes.c
> > [0]PETSC ERROR: #13 main() line 1548 in /scratch2/scratchdirs/jychang/
> > Icesheet/ex48.c
> > [0]PETSC ERROR: PETSc Option Table entries:
> > [0]PETSC ERROR: -M 128
> > [0]PETSC ERROR: -N 128
> > [0]PETSC ERROR: -P 8
> > [0]PETSC ERROR: -da_refine 3
> > [0]PETSC ERROR: -mg_coarse_pc_type gamg
> > [0]PETSC ERROR: -pc_mg_levels 4
> > [0]PETSC ERROR: -pc_type mg
> > [0]PETSC ERROR: -thi_mat_type baij
> > [0]PETSC ERROR: ----------------End of Error Message -------send entire
> > error message to petsc-maint at mcs.anl.gov----------
> >
> > If I changed the coarse grid to 129x129x8, no error whatsoever for up to
> 4
> > levels of refinement.
> >
> > However, I am having trouble getting this started up on Cori's KNL...
> >
> > I am using a coarse grid 136x136x8 across 1088 cores, and slurm is simply
> > cancelling the job. No other PETSc error was given. This is literally
> what
> > my log files say:
> >
> > Level 1 domain size (m)    1e+04 x    1e+04 x    1e+03, num elements 272
> x
> > 272 x 15 (1109760), size (m) 36.7647 x 36.7647 x 71.4286
> > Level 0 domain size (m)    1e+04 x    1e+04 x    1e+03, num elements 136
> x
> > 136 x 8 (147968), size (m) 73.5294 x 73.5294 x 142.857
>
> Why are levels 1 and 0 printed above, then 2,1,0 below.
>
> > makefile:25: recipe for target 'runcori' failed
>
> What is this makefile message doing?
>
> > Level 2 domain size (m)    1e+04 x    1e+04 x    1e+03, num elements 544
> x
> > 544 x 29 (8582144), size (m) 18.3824 x 18.3824 x 35.7143
> > Level 1 domain size (m)    1e+04 x    1e+04 x    1e+03, num elements 272
> x
> > 272 x 15 (1109760), size (m) 36.7647 x 36.7647 x 71.4286
> > Level 0 domain size (m)    1e+04 x    1e+04 x    1e+03, num elements 136
> x
> > 136 x 8 (147968), size (m) 73.5294 x 73.5294 x 142.857
> > srun: error: nid04139: task 480: Killed
> > srun: Terminating job step 4387719.0
> > srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
> > slurmstepd: error: *** STEP 4387719.0 ON nid03873 CANCELLED AT
> > 2017-04-02T22:21:21 ***
> > srun: error: nid03960: task 202: Killed
> > srun: error: nid04005: task 339: Killed
> > srun: error: nid03873: task 32: Killed
> > srun: error: nid03960: task 203: Killed
> > srun: error: nid03873: task 3: Killed
> > srun: error: nid03960: task 199: Killed
> > srun: error: nid04004: task 264: Killed
> > srun: error: nid04141: task 660: Killed
> > srun: error: nid04139: task 539: Killed
> > srun: error: nid03873: task 63: Killed
> > srun: error: nid03960: task 170: Killed
> > srun: error: nid08164: task 821: Killed
> > srun: error: nid04139: task 507: Killed
> > srun: error: nid04005: task 299: Killed
> > srun: error: nid03960: tasks 136-169,171-198,200-201: Killed
> > srun: error: nid04005: task 310: Killed
> > srun: error: nid08166: task 1008: Killed
> > srun: error: nid04141: task 671: Killed
> > srun: error: nid03873: task 18: Killed
> > srun: error: nid04139: tasks 476-479,481-506,508-538,540-543: Killed
> > srun: error: nid04005: tasks 272-298,300-309,311-338: Killed
> > srun: error: nid04140: tasks 544-611: Killed
> > srun: error: nid04142: tasks 680-747: Killed
> > srun: error: nid04138: tasks 408-475: Killed
> > srun: error: nid04006: tasks 340-407: Killed
> > srun: error: nid08163: tasks 748-815: Killed
> > srun: error: nid08166: tasks 952-1007,1009-1019: Killed
> > srun: error: nid03873: tasks 0-2,4-17,19-31,33-62,64-67: Killed
> > srun: error: nid08165: tasks 884-951: Killed
> > srun: error: nid03883: tasks 68-135: Killed
> > srun: error: nid08164: tasks 816-820,822-883: Killed
> > srun: error: nid08167: tasks 1020-1087: Killed
> > srun: error: nid04141: tasks 612-659,661-670,672-679: Killed
> > srun: error: nid04004: tasks 204-263,265-271: Killed
> > make: [runcori] Error 137 (ignored)
> > [257]PETSC ERROR:
> > ------------------------------------------------------------------------
> > [257]PETSC ERROR: Caught signal number 15 Terminate: Some process (or the
> > batch system) has told this process to end
> > [257]PETSC ERROR: Try option -start_in_debugger or
> -on_error_attach_debugger
> > [257]PETSC ERROR: or see
> > http://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind
> > [257]PETSC ERROR: or try http://valgrind.org on GNU/linux and Apple Mac
> OS
> > X to find memory corruption errors
> > [257]PETSC ERROR: configure using --with-debugging=yes, recompile, link,
> > and run
> > [257]PETSC ERROR: to get more information on the crash.
> > [878]PETSC ERROR:
> > ------------------------------------------------------------------------
> > [878]PETSC ERROR: Caught signal number 15 Terminate: Some process (or the
> > batch system) has told this process to end
> > [878]PETSC ERROR: Try option -start_in_debugger or
> -on_error_attach_debugger
> > [878]PETSC ERROR: or see
> > http://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind
> > [878]PETSC ERROR: or try http://valgrind.org on GNU/linux and Apple Mac
> OS
> > X to find memory corruption errors
> > [878]PETSC ERROR: configure using --with-debugging=yes, recompile, link,
> > and run
> > [878]PETSC ERROR: to get more information on the crash.
> > ....
> > [clipped]
> > ....
> >
> >
> >
> > my job script for KNL looks like this:
> >
> > #!/bin/bash
> > #SBATCH -N 16
> > #SBATCH -C knl,quad,cache
> > #SBATCH -p regular
> > #SBATCH -J knl1024
> > #SBATCH -L SCRATCH
> > #SBATCH -o knl1088.o%j
> > #SBATCH -e knl1088.e%j
> > #SBATCH --mail-type=ALL
> > #SBATCH --mail-user=jychang48 at gmail.com
> > #SBATCH -t 00:20:00
> >
> > srun -n 1088 -c 4 --cpu_bind=cores ./ex48 ....
> >
> > Any ideas why this is happening? Or do I need to contact the NERSC folks?
> >
> > Thanks,
> > Justin
> >
> > On Sun, Apr 2, 2017 at 2:15 PM, Matthew Knepley <knepley at gmail.com>
> wrote:
> >
> >> On Sun, Apr 2, 2017 at 2:13 PM, Barry Smith <bsmith at mcs.anl.gov> wrote:
> >>
> >>>
> >>> > On Apr 2, 2017, at 9:25 AM, Justin Chang <jychang48 at gmail.com>
> wrote:
> >>> >
> >>> > Thanks guys,
> >>> >
> >>> > So I want to run SNES ex48 across 1032 processes on Edison, but I
> keep
> >>> getting segmentation violations. These are the parameters I am trying:
> >>> >
> >>> > srun -n 1032 -c 2 ./ex48 -M 80 -N 80 -P 9 -da_refine 1 -pc_type mg
> >>> -thi_mat_type baij -mg_coarse_pc_type gamg
> >>> >
> >>> > The above works perfectly fine if I used 96 processes. I also tried
> to
> >>> use a finer coarse mesh on 1032 but the error persists.
> >>> >
> >>> > Any ideas why this is happening? What are the ideal parameters to use
> >>> if I want to use 1k+ cores?
> >>> >
> >>>
> >>>    Hmm, one should never get segmentation violations. You should only
> get
> >>> not completely useful error messages about incompatible sizes etc.
> Send an
> >>> example of the segmentation violations. (I sure hope you are checking
> the
> >>> error return codes for all functions?).
> >>
> >>
> >> He is just running SNES ex48.
> >>
> >>   Matt
> >>
> >>
> >>>
> >>>   Barry
> >>>
> >>> > Thanks,
> >>> > Justin
> >>> >
> >>> > On Fri, Mar 31, 2017 at 12:47 PM, Barry Smith <bsmith at mcs.anl.gov>
> >>> wrote:
> >>> >
> >>> > > On Mar 31, 2017, at 10:00 AM, Jed Brown <jed at jedbrown.org> wrote:
> >>> > >
> >>> > > Justin Chang <jychang48 at gmail.com> writes:
> >>> > >
> >>> > >> Yeah based on my experiments it seems setting pc_mg_levels to
> >>> $DAREFINE + 1
> >>> > >> has decent performance.
> >>> > >>
> >>> > >> 1) is there ever a case where you'd want $MGLEVELS <= $DAREFINE?
> In
> >>> some of
> >>> > >> the PETSc tutorial slides (e.g., http://www.mcs.anl.gov/
> >>> > >> petsc/documentation/tutorials/TutorialCEMRACS2016.pdf on slide
> >>> 203/227)
> >>> > >> they say to use $MGLEVELS = 4 and $DAREFINE = 5, but when I ran
> >>> this, it
> >>> > >> was almost twice as slow as if $MGLEVELS >= $DAREFINE
> >>> > >
> >>> > > Smaller coarse grids are generally more scalable -- when the
> problem
> >>> > > data is distributed, multigrid is a good solution algorithm.  But
> if
> >>> > > multigrid stops being effective because it is not preserving
> >>> sufficient
> >>> > > coarse grid accuracy (e.g., for transport-dominated problems in
> >>> > > complicated domains) then you might want to stop early and use a
> more
> >>> > > robust method (like direct solves).
> >>> >
> >>> > Basically for symmetric positive definite operators you can make the
> >>> coarse problem as small as you like (even 1 point) in theory. For
> >>> indefinite and non-symmetric problems the theory says the "coarse grid
> must
> >>> be sufficiently fine" (loosely speaking the coarse grid has to resolve
> the
> >>> eigenmodes for the eigenvalues to the left of the x = 0).
> >>> >
> >>> > https://www.jstor.org/stable/2158375?seq=1#page_scan_tab_contents
> >>> >
> >>> >
> >>> >
> >>>
> >>>
> >>
> >>
> >> --
> >> What most experimenters take for granted before they begin their
> >> experiments is infinitely more interesting than any results to which
> their
> >> experiments lead.
> >> -- Norbert Wiener
> >>
>



-- 
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which their
experiments lead.
-- Norbert Wiener
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20170403/08ae8f47/attachment.html>


More information about the petsc-users mailing list