[petsc-dev] [petsc-maint] running CUDA on SUMMIT

Sat Aug 31 20:04:17 CDT 2019

On Sat, Aug 31, 2019 at 4:28 PM Smith, Barry F. <bsmith at mcs.anl.gov> wrote:

>
>   Any explanation for why the scaling is much better for CPUs and than
> GPUs? Is it the "extra" time needed for communication from the GPUs?
>

The GPU work is well load balanced so it weak scales perfectly. When you
put that work in the CPU you get more perfectly scalable work added so it
looks better. For instance, the 98K dof/proc data goes up by about 1/2 sec.
from the 1 node to 512 node case for both GPU and CPU, because this
non-scaling is from communication that is the same for both cases

>
>   Perhaps you could try the GPU version with Junchao's new MPI-aware CUDA
> branch (in the gitlab merge requests)  that can speed up the communication
> from GPUs?
>

Sure, Do I just checkout jczhang/feature-sf-on-gpu and run as ussual?

>
>    Barry
>
>
> > On Aug 30, 2019, at 11:56 AM, Mark Adams <mfadams at lbl.gov> wrote:
> >
> > Here is some more weak scaling data with a fixed number of iterations (I
> have given a test with the numerical problems to ORNL and they said they
> would give it to Nvidia).
> >
> > I implemented an option to "spread" the reduced coarse grids across the
> whole machine as opposed to a "compact" layout where active processes are
> laid out in simple lexicographical order. This spread approach looks a
> little better.
> >
> > Mark
> >
> > On Wed, Aug 14, 2019 at 10:46 PM Smith, Barry F. <bsmith at mcs.anl.gov>
> wrote:
> >
> >   Ahh, PGI compiler, that explains it :-)
> >
> >   Ok, thanks. Don't worry about the runs right now. We'll figure out the
> fix. The code is just
> >
> >   *a = (PetscReal)strtod(name,endptr);
> >
> >   could be a compiler bug.
> >
> >
> >
> >
> > > On Aug 14, 2019, at 9:23 PM, Mark Adams <mfadams at lbl.gov> wrote:
> > >
> > > I am getting this error with single:
> > >
> > > 22:21  /gpfs/alpine/geo127/scratch/adams$ jsrun -n 1 -a 1 -c 1 -g 1
> ./ex56_single -cells 2,2,2 -ex56_dm_vec_type cuda -ex56_dm_mat_type
> aijcusparse -fp_trap
> > > [0] 81 global equations, 27 vertices
> > > [0]PETSC ERROR: *** unknown floating point error occurred ***
> > > [0]PETSC ERROR: The specific exception can be determined by running in
> a debugger.  When the
> > > [0]PETSC ERROR: debugger traps the signal, the exception can be found
> with fetestexcept(0x3e000000)
> > > [0]PETSC ERROR: where the result is a bitwise OR of the following
> flags:
> > > [0]PETSC ERROR: FE_INVALID=0x20000000 FE_DIVBYZERO=0x4000000
> FE_OVERFLOW=0x10000000 FE_UNDERFLOW=0x8000000 FE_INEXACT=0x2000000
> > > [0]PETSC ERROR: Try option -start_in_debugger
> > > [0]PETSC ERROR: likely location of problem given in stack below
> > > [0]PETSC ERROR: ---------------------  Stack Frames
> ------------------------------------
> > > [0]PETSC ERROR: Note: The EXACT line numbers in the stack are not
> available,
> > > [0]PETSC ERROR:       INSTEAD the line number of the start of the
> function
> > > [0]PETSC ERROR:       is given.
> > > [0]PETSC ERROR: [0] PetscDefaultFPTrap line 355
> /autofs/nccs-svm1_home1/adams/petsc/src/sys/error/fp.c
> > > [0]PETSC ERROR: [0] PetscStrtod line 1964
> /autofs/nccs-svm1_home1/adams/petsc/src/sys/objects/options.c
> > > [0]PETSC ERROR: [0] PetscOptionsStringToReal line 2021
> /autofs/nccs-svm1_home1/adams/petsc/src/sys/objects/options.c
> > > [0]PETSC ERROR: [0] PetscOptionsGetReal line 2321
> /autofs/nccs-svm1_home1/adams/petsc/src/sys/objects/options.c
> > > [0]PETSC ERROR: [0] PetscOptionsReal_Private line 1015
> /autofs/nccs-svm1_home1/adams/petsc/src/sys/objects/aoptions.c
> > > [0]PETSC ERROR: [0] KSPSetFromOptions line 329
> /autofs/nccs-svm1_home1/adams/petsc/src/ksp/ksp/interface/itcl.c
> > > [0]PETSC ERROR: [0] SNESSetFromOptions line 869
> /autofs/nccs-svm1_home1/adams/petsc/src/snes/interface/snes.c
> > > [0]PETSC ERROR: --------------------- Error Message
> --------------------------------------------------------------
> > > [0]PETSC ERROR: Floating point exception
> > > [0]PETSC ERROR: trapped floating point error
> > > [0]PETSC ERROR: See
> https://www.mcs.anl.gov/petsc/documentation/faq.html for trouble shooting.
> > > [0]PETSC ERROR: Petsc Development GIT revision: v3.11.3-1685-gd3eb2e1
> GIT Date: 2019-08-13 06:33:29 -0400
> > > [0]PETSC ERROR: ./ex56_single on a arch-summit-dbg-single-pgi-cuda
> named h36n11 by adams Wed Aug 14 22:21:56 2019
> > > [0]PETSC ERROR: Configure options --with-cc=mpicc --with-cxx=mpiCC
> --with-fc=mpif90 COPTFLAGS="-g -Mfcon" CXXOPTFLAGS="-g -Mfcon"
> FOPTFLAGS="-g -Mfcon" --with-precision=single --with-ssl=0 --with-batch=0
> --with-mpiexec="jsrun -g 1" --with-cuda=1 --with-cudac=nvcc
> CUDAFLAGS="-ccbin pgc++" --download-metis --download-parmetis
> --download-fblaslapack --with-x=0 --with-64-bit-indices=0
> --with-debugging=1 PETSC_ARCH=arch-summit-dbg-single-pgi-cuda
> > > [0]PETSC ERROR: #1 User provided function() line 0 in Unknown file
> > >
> --------------------------------------------------------------------------
> > >
> > > On Wed, Aug 14, 2019 at 9:51 PM Smith, Barry F. <bsmith at mcs.anl.gov>
> wrote:
> > >
> > >   Oh, doesn't even have to be that large. We just need to be able to
> look at the flop rates (as a surrogate for run times) and compare with the
> previous runs. So long as the size per process is pretty much the same that
> is good enough.
> > >
> > >    Barry
> > >
> > >
> > > > On Aug 14, 2019, at 8:45 PM, Mark Adams <mfadams at lbl.gov> wrote:
> > > >
> > > > I can run single, I just can't scale up. But I can use like 1500
> processors.
> > > >
> > > > On Wed, Aug 14, 2019 at 9:31 PM Smith, Barry F. <bsmith at mcs.anl.gov>
> wrote:
> > > >
> > > >   Oh, are all your integers 8 bytes? Even on one node?
> > > >
> > > >   Once Karl's new middleware is in place we should see about
> reducing to 4 bytes on the GPU.
> > > >
> > > >    Barry
> > > >
> > > >
> > > > > On Aug 14, 2019, at 7:44 PM, Mark Adams <mfadams at lbl.gov> wrote:
> > > > >
> > > > > OK, I'll run single. It a bit perverse to run with 4 byte floats
> and 8 byte integers ... I could use 32 bit ints and just not scale out.
> > > > >
> > > > > On Wed, Aug 14, 2019 at 6:48 PM Smith, Barry F. <
> bsmith at mcs.anl.gov> wrote:
> > > > >
> > > > >  Mark,
> > > > >
> > > > >    Oh, I don't even care if it converges, just put in a fixed
> number of iterations. The idea is to just get a baseline of the possible
> improvement.
> > > > >
> > > > >     ECP is literally dropping millions into research on "multi
> precision" computations on GPUs, we need to have some actual numbers for
> the best potential benefit to determine how much we invest in further
> investigating it, or not.
> > > > >
> > > > >     I am not expressing any opinions on the approach, we are just
> in the fact gathering stage.
> > > > >
> > > > >
> > > > >    Barry
> > > > >
> > > > >
> > > > > > On Aug 14, 2019, at 2:27 PM, Mark Adams <mfadams at lbl.gov> wrote:
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Wed, Aug 14, 2019 at 2:35 PM Smith, Barry F. <
> bsmith at mcs.anl.gov> wrote:
> > > > > >
> > > > > >   Mark,
> > > > > >
> > > > > >    Would you be able to make one run using single precision?
> Just single everywhere since that is all we support currently?
> > > > > >
> > > > > >
> > > > > > Experience in engineering at least is single does not work for
> FE elasticity. I have tried it many years ago and have heard this from
> others. This problem is pretty simple other than using Q2. I suppose I
> could try it, but just be aware the FE people might say that single sucks.
> > > > > >
> > > > > >    The results will give us motivation (or anti-motivation) to
> have support for running KSP (or PC (or Mat)  in single precision while the
> simulation is double.
> > > > > >
> > > > > >    Thanks.
> > > > > >
> > > > > >      Barry
> > > > > >
> > > > > > For example if the GPU speed on KSP is a factor of 3 over the
> double on GPUs this is serious motivation.
> > > > > >
> > > > > >
> > > > > > > On Aug 14, 2019, at 12:45 PM, Mark Adams <mfadams at lbl.gov>
> wrote:
> > > > > > >
> > > > > > > FYI, Here is some scaling data of GAMG on SUMMIT. Getting
> about 4x GPU speedup with 98K dof/proc (3D Q2 elasticity).
> > > > > > >
> > > > > > > This is weak scaling of a solve. There is growth in iteration
> count folded in here. I should put rtol in the title and/or run a fixed
> number of iterations and make it clear in the title.
> > > > > > >
> > > > > > > Comments welcome.
> > > > > > >
> <out_cpu_012288><out_cpu_001536><out_cuda_012288><out_cpu_000024><out_cpu_000192><out_cuda_001536><out_cuda_000192><out_cuda_000024><weak_scaling_cpu.png><weak_scaling_cuda.png>
> > > > > >
> > > > >
> > > >
> > >
> >
> >
> <weak_scaling_gpu_compact_spread.png><weak_scaling_cpu.png><spread.tar><compact.tar>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20190831/c37fb3bd/attachment-0001.html>