[petsc-dev] [petsc-maint] running CUDA on SUMMIT

Sun Sep 1 09:50:41 CDT 2019

Junchao and Barry,

I am using mark/fix-cuda-with-gamg-pintocpu, which is built on barry's
robustify branch. Is this in master yet? If so, I'd like to get my branch
merged to master, then merge Junchao's branch. Then us it.

I think we were waiting for some refactoring from Karl to proceed.

Anyway, I'm not sure how to proceed.

Thanks,
Mark

On Sun, Sep 1, 2019 at 8:45 AM Zhang, Junchao <jczhang at mcs.anl.gov> wrote:

>
>
>
> On Sat, Aug 31, 2019 at 8:04 PM Mark Adams <mfadams at lbl.gov> wrote:
>
>>
>>
>> On Sat, Aug 31, 2019 at 4:28 PM Smith, Barry F. <bsmith at mcs.anl.gov>
>> wrote:
>>
>>>
>>>   Any explanation for why the scaling is much better for CPUs and than
>>> GPUs? Is it the "extra" time needed for communication from the GPUs?
>>>
>>
>> The GPU work is well load balanced so it weak scales perfectly. When you
>> put that work in the CPU you get more perfectly scalable work added so it
>> looks better. For instance, the 98K dof/proc data goes up by about 1/2 sec.
>> from the 1 node to 512 node case for both GPU and CPU, because this
>> non-scaling is from communication that is the same for both cases
>>
>>
>>>
>>>   Perhaps you could try the GPU version with Junchao's new MPI-aware
>>> CUDA branch (in the gitlab merge requests)  that can speed up the
>>> communication from GPUs?
>>>
>>
>> Sure, Do I just checkout jczhang/feature-sf-on-gpu and run as ussual?
>>
>
> Use jsrun --smpiargs="-gpu"  to enable IBM MPI's cuda-aware support, then
> add -use_gpu_aware_mpi in option to let PETSc use that feature.
>
>
>>
>>
>>>
>>>    Barry
>>>
>>>
>>> > On Aug 30, 2019, at 11:56 AM, Mark Adams <mfadams at lbl.gov> wrote:
>>> >
>>> > Here is some more weak scaling data with a fixed number of iterations
>>> (I have given a test with the numerical problems to ORNL and they said they
>>> would give it to Nvidia).
>>> >
>>> > I implemented an option to "spread" the reduced coarse grids across
>>> the whole machine as opposed to a "compact" layout where active processes
>>> are laid out in simple lexicographical order. This spread approach looks a
>>> little better.
>>> >
>>> > Mark
>>> >
>>> > On Wed, Aug 14, 2019 at 10:46 PM Smith, Barry F. <bsmith at mcs.anl.gov>
>>> wrote:
>>> >
>>> >   Ahh, PGI compiler, that explains it :-)
>>> >
>>> >   Ok, thanks. Don't worry about the runs right now. We'll figure out
>>> the fix. The code is just
>>> >
>>> >   *a = (PetscReal)strtod(name,endptr);
>>> >
>>> >   could be a compiler bug.
>>> >
>>> >
>>> >
>>> >
>>> > > On Aug 14, 2019, at 9:23 PM, Mark Adams <mfadams at lbl.gov> wrote:
>>> > >
>>> > > I am getting this error with single:
>>> > >
>>> > > 22:21  /gpfs/alpine/geo127/scratch/adams$ jsrun -n 1 -a 1 -c 1 -g 1
>>> ./ex56_single -cells 2,2,2 -ex56_dm_vec_type cuda -ex56_dm_mat_type
>>> aijcusparse -fp_trap
>>> > > [0] 81 global equations, 27 vertices
>>> > > [0]PETSC ERROR: *** unknown floating point error occurred ***
>>> > > [0]PETSC ERROR: The specific exception can be determined by running
>>> in a debugger.  When the
>>> > > [0]PETSC ERROR: debugger traps the signal, the exception can be
>>> found with fetestexcept(0x3e000000)
>>> > > [0]PETSC ERROR: where the result is a bitwise OR of the following
>>> flags:
>>> > > [0]PETSC ERROR: FE_INVALID=0x20000000 FE_DIVBYZERO=0x4000000
>>> FE_OVERFLOW=0x10000000 FE_UNDERFLOW=0x8000000 FE_INEXACT=0x2000000
>>> > > [0]PETSC ERROR: Try option -start_in_debugger
>>> > > [0]PETSC ERROR: likely location of problem given in stack below
>>> > > [0]PETSC ERROR: ---------------------  Stack Frames
>>> ------------------------------------
>>> > > [0]PETSC ERROR: Note: The EXACT line numbers in the stack are not
>>> available,
>>> > > [0]PETSC ERROR:       INSTEAD the line number of the start of the
>>> function
>>> > > [0]PETSC ERROR:       is given.
>>> > > [0]PETSC ERROR: [0] PetscDefaultFPTrap line 355
>>> /autofs/nccs-svm1_home1/adams/petsc/src/sys/error/fp.c
>>> > > [0]PETSC ERROR: [0] PetscStrtod line 1964
>>> /autofs/nccs-svm1_home1/adams/petsc/src/sys/objects/options.c
>>> > > [0]PETSC ERROR: [0] PetscOptionsStringToReal line 2021
>>> /autofs/nccs-svm1_home1/adams/petsc/src/sys/objects/options.c
>>> > > [0]PETSC ERROR: [0] PetscOptionsGetReal line 2321
>>> /autofs/nccs-svm1_home1/adams/petsc/src/sys/objects/options.c
>>> > > [0]PETSC ERROR: [0] PetscOptionsReal_Private line 1015
>>> /autofs/nccs-svm1_home1/adams/petsc/src/sys/objects/aoptions.c
>>> > > [0]PETSC ERROR: [0] KSPSetFromOptions line 329
>>> /autofs/nccs-svm1_home1/adams/petsc/src/ksp/ksp/interface/itcl.c
>>> > > [0]PETSC ERROR: [0] SNESSetFromOptions line 869
>>> /autofs/nccs-svm1_home1/adams/petsc/src/snes/interface/snes.c
>>> > > [0]PETSC ERROR: --------------------- Error Message
>>> --------------------------------------------------------------
>>> > > [0]PETSC ERROR: Floating point exception
>>> > > [0]PETSC ERROR: trapped floating point error
>>> > > [0]PETSC ERROR: See
>>> https://www.mcs.anl.gov/petsc/documentation/faq.html for trouble
>>> shooting.
>>> > > [0]PETSC ERROR: Petsc Development GIT revision:
>>> v3.11.3-1685-gd3eb2e1  GIT Date: 2019-08-13 06:33:29 -0400
>>> > > [0]PETSC ERROR: ./ex56_single on a arch-summit-dbg-single-pgi-cuda
>>> named h36n11 by adams Wed Aug 14 22:21:56 2019
>>> > > [0]PETSC ERROR: Configure options --with-cc=mpicc --with-cxx=mpiCC
>>> --with-fc=mpif90 COPTFLAGS="-g -Mfcon" CXXOPTFLAGS="-g -Mfcon"
>>> FOPTFLAGS="-g -Mfcon" --with-precision=single --with-ssl=0 --with-batch=0
>>> --with-mpiexec="jsrun -g 1" --with-cuda=1 --with-cudac=nvcc
>>> CUDAFLAGS="-ccbin pgc++" --download-metis --download-parmetis
>>> --download-fblaslapack --with-x=0 --with-64-bit-indices=0
>>> --with-debugging=1 PETSC_ARCH=arch-summit-dbg-single-pgi-cuda
>>> > > [0]PETSC ERROR: #1 User provided function() line 0 in Unknown file
>>> > >
>>> --------------------------------------------------------------------------
>>> > >
>>> > > On Wed, Aug 14, 2019 at 9:51 PM Smith, Barry F. <bsmith at mcs.anl.gov>
>>> wrote:
>>> > >
>>> > >   Oh, doesn't even have to be that large. We just need to be able to
>>> look at the flop rates (as a surrogate for run times) and compare with the
>>> previous runs. So long as the size per process is pretty much the same that
>>> is good enough.
>>> > >
>>> > >    Barry
>>> > >
>>> > >
>>> > > > On Aug 14, 2019, at 8:45 PM, Mark Adams <mfadams at lbl.gov> wrote:
>>> > > >
>>> > > > I can run single, I just can't scale up. But I can use like 1500
>>> processors.
>>> > > >
>>> > > > On Wed, Aug 14, 2019 at 9:31 PM Smith, Barry F. <
>>> bsmith at mcs.anl.gov> wrote:
>>> > > >
>>> > > >   Oh, are all your integers 8 bytes? Even on one node?
>>> > > >
>>> > > >   Once Karl's new middleware is in place we should see about
>>> reducing to 4 bytes on the GPU.
>>> > > >
>>> > > >    Barry
>>> > > >
>>> > > >
>>> > > > > On Aug 14, 2019, at 7:44 PM, Mark Adams <mfadams at lbl.gov> wrote:
>>> > > > >
>>> > > > > OK, I'll run single. It a bit perverse to run with 4 byte floats
>>> and 8 byte integers ... I could use 32 bit ints and just not scale out.
>>> > > > >
>>> > > > > On Wed, Aug 14, 2019 at 6:48 PM Smith, Barry F. <
>>> bsmith at mcs.anl.gov> wrote:
>>> > > > >
>>> > > > >  Mark,
>>> > > > >
>>> > > > >    Oh, I don't even care if it converges, just put in a fixed
>>> number of iterations. The idea is to just get a baseline of the possible
>>> improvement.
>>> > > > >
>>> > > > >     ECP is literally dropping millions into research on "multi
>>> precision" computations on GPUs, we need to have some actual numbers for
>>> the best potential benefit to determine how much we invest in further
>>> investigating it, or not.
>>> > > > >
>>> > > > >     I am not expressing any opinions on the approach, we are
>>> just in the fact gathering stage.
>>> > > > >
>>> > > > >
>>> > > > >    Barry
>>> > > > >
>>> > > > >
>>> > > > > > On Aug 14, 2019, at 2:27 PM, Mark Adams <mfadams at lbl.gov>
>>> wrote:
>>> > > > > >
>>> > > > > >
>>> > > > > >
>>> > > > > > On Wed, Aug 14, 2019 at 2:35 PM Smith, Barry F. <
>>> bsmith at mcs.anl.gov> wrote:
>>> > > > > >
>>> > > > > >   Mark,
>>> > > > > >
>>> > > > > >    Would you be able to make one run using single precision?
>>> Just single everywhere since that is all we support currently?
>>> > > > > >
>>> > > > > >
>>> > > > > > Experience in engineering at least is single does not work for
>>> FE elasticity. I have tried it many years ago and have heard this from
>>> others. This problem is pretty simple other than using Q2. I suppose I
>>> could try it, but just be aware the FE people might say that single sucks.
>>> > > > > >
>>> > > > > >    The results will give us motivation (or anti-motivation) to
>>> have support for running KSP (or PC (or Mat)  in single precision while the
>>> simulation is double.
>>> > > > > >
>>> > > > > >    Thanks.
>>> > > > > >
>>> > > > > >      Barry
>>> > > > > >
>>> > > > > > For example if the GPU speed on KSP is a factor of 3 over the
>>> double on GPUs this is serious motivation.
>>> > > > > >
>>> > > > > >
>>> > > > > > > On Aug 14, 2019, at 12:45 PM, Mark Adams <mfadams at lbl.gov>
>>> wrote:
>>> > > > > > >
>>> > > > > > > FYI, Here is some scaling data of GAMG on SUMMIT. Getting
>>> about 4x GPU speedup with 98K dof/proc (3D Q2 elasticity).
>>> > > > > > >
>>> > > > > > > This is weak scaling of a solve. There is growth in
>>> iteration count folded in here. I should put rtol in the title and/or run a
>>> fixed number of iterations and make it clear in the title.
>>> > > > > > >
>>> > > > > > > Comments welcome.
>>> > > > > > >
>>> <out_cpu_012288><out_cpu_001536><out_cuda_012288><out_cpu_000024><out_cpu_000192><out_cuda_001536><out_cuda_000192><out_cuda_000024><weak_scaling_cpu.png><weak_scaling_cuda.png>
>>> > > > > >
>>> > > > >
>>> > > >
>>> > >
>>> >
>>> >
>>> <weak_scaling_gpu_compact_spread.png><weak_scaling_cpu.png><spread.tar><compact.tar>
>>>
>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20190901/01b3c822/attachment-0001.html>