<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</head>
<body>
<div dir="ltr">
<div dir="ltr"><br>
<br>
</div>
<br>
<div class="gmail_quote">
<div dir="ltr" class="gmail_attr">On Sat, Aug 31, 2019 at 8:04 PM Mark Adams <<a href="mailto:mfadams@lbl.gov">mfadams@lbl.gov</a>> wrote:<br>
</div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div dir="ltr">
<div dir="ltr"><br>
</div>
<br>
<div class="gmail_quote">
<div dir="ltr" class="gmail_attr">On Sat, Aug 31, 2019 at 4:28 PM Smith, Barry F. <<a href="mailto:bsmith@mcs.anl.gov" target="_blank">bsmith@mcs.anl.gov</a>> wrote:<br>
</div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<br>
Any explanation for why the scaling is much better for CPUs and than GPUs? Is it the "extra" time needed for communication from the GPUs?
<br>
</blockquote>
<div><br>
</div>
<div>The GPU work is well load balanced so it weak scales perfectly. When you put that work in the CPU you get more perfectly scalable work added so it looks better. For instance, the 98K dof/proc data goes up by about 1/2 sec. from the 1 node to 512 node case
for both GPU and CPU, because this non-scaling is from communication that is the same for both cases</div>
<div> </div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<br>
Perhaps you could try the GPU version with Junchao's new MPI-aware CUDA branch (in the gitlab merge requests) that can speed up the communication from GPUs?<br>
</blockquote>
<div><br>
</div>
<div>Sure, Do I just checkout jczhang/feature-sf-on-gpu and run as ussual?</div>
</div>
</div>
</blockquote>
<div><br>
</div>
Use jsrun --smpiargs="-gpu" to enable IBM MPI's cuda-aware support, then add -use_gpu_aware_mpi in option to let PETSc use that feature.
<div> </div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div dir="ltr">
<div class="gmail_quote">
<div> </div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<br>
Barry<br>
<br>
<br>
> On Aug 30, 2019, at 11:56 AM, Mark Adams <<a href="mailto:mfadams@lbl.gov" target="_blank">mfadams@lbl.gov</a>> wrote:<br>
> <br>
> Here is some more weak scaling data with a fixed number of iterations (I have given a test with the numerical problems to ORNL and they said they would give it to Nvidia).<br>
> <br>
> I implemented an option to "spread" the reduced coarse grids across the whole machine as opposed to a "compact" layout where active processes are laid out in simple lexicographical order. This spread approach looks a little better.<br>
> <br>
> Mark<br>
> <br>
> On Wed, Aug 14, 2019 at 10:46 PM Smith, Barry F. <<a href="mailto:bsmith@mcs.anl.gov" target="_blank">bsmith@mcs.anl.gov</a>> wrote:<br>
> <br>
> Ahh, PGI compiler, that explains it :-)<br>
> <br>
> Ok, thanks. Don't worry about the runs right now. We'll figure out the fix. The code is just<br>
> <br>
> *a = (PetscReal)strtod(name,endptr);<br>
> <br>
> could be a compiler bug.<br>
> <br>
> <br>
> <br>
> <br>
> > On Aug 14, 2019, at 9:23 PM, Mark Adams <<a href="mailto:mfadams@lbl.gov" target="_blank">mfadams@lbl.gov</a>> wrote:<br>
> > <br>
> > I am getting this error with single:<br>
> > <br>
> > 22:21 /gpfs/alpine/geo127/scratch/adams$ jsrun -n 1 -a 1 -c 1 -g 1 ./ex56_single -cells 2,2,2 -ex56_dm_vec_type cuda -ex56_dm_mat_type aijcusparse -fp_trap
<br>
> > [0] 81 global equations, 27 vertices<br>
> > [0]PETSC ERROR: *** unknown floating point error occurred ***<br>
> > [0]PETSC ERROR: The specific exception can be determined by running in a debugger. When the<br>
> > [0]PETSC ERROR: debugger traps the signal, the exception can be found with fetestexcept(0x3e000000)<br>
> > [0]PETSC ERROR: where the result is a bitwise OR of the following flags:<br>
> > [0]PETSC ERROR: FE_INVALID=0x20000000 FE_DIVBYZERO=0x4000000 FE_OVERFLOW=0x10000000 FE_UNDERFLOW=0x8000000 FE_INEXACT=0x2000000<br>
> > [0]PETSC ERROR: Try option -start_in_debugger<br>
> > [0]PETSC ERROR: likely location of problem given in stack below<br>
> > [0]PETSC ERROR: --------------------- Stack Frames ------------------------------------<br>
> > [0]PETSC ERROR: Note: The EXACT line numbers in the stack are not available,<br>
> > [0]PETSC ERROR: INSTEAD the line number of the start of the function<br>
> > [0]PETSC ERROR: is given.<br>
> > [0]PETSC ERROR: [0] PetscDefaultFPTrap line 355 /autofs/nccs-svm1_home1/adams/petsc/src/sys/error/fp.c<br>
> > [0]PETSC ERROR: [0] PetscStrtod line 1964 /autofs/nccs-svm1_home1/adams/petsc/src/sys/objects/options.c<br>
> > [0]PETSC ERROR: [0] PetscOptionsStringToReal line 2021 /autofs/nccs-svm1_home1/adams/petsc/src/sys/objects/options.c<br>
> > [0]PETSC ERROR: [0] PetscOptionsGetReal line 2321 /autofs/nccs-svm1_home1/adams/petsc/src/sys/objects/options.c<br>
> > [0]PETSC ERROR: [0] PetscOptionsReal_Private line 1015 /autofs/nccs-svm1_home1/adams/petsc/src/sys/objects/aoptions.c<br>
> > [0]PETSC ERROR: [0] KSPSetFromOptions line 329 /autofs/nccs-svm1_home1/adams/petsc/src/ksp/ksp/interface/itcl.c<br>
> > [0]PETSC ERROR: [0] SNESSetFromOptions line 869 /autofs/nccs-svm1_home1/adams/petsc/src/snes/interface/snes.c<br>
> > [0]PETSC ERROR: --------------------- Error Message --------------------------------------------------------------<br>
> > [0]PETSC ERROR: Floating point exception<br>
> > [0]PETSC ERROR: trapped floating point error<br>
> > [0]PETSC ERROR: See <a href="https://www.mcs.anl.gov/petsc/documentation/faq.html" rel="noreferrer" target="_blank">
https://www.mcs.anl.gov/petsc/documentation/faq.html</a> for trouble shooting.<br>
> > [0]PETSC ERROR: Petsc Development GIT revision: v3.11.3-1685-gd3eb2e1 GIT Date: 2019-08-13 06:33:29 -0400<br>
> > [0]PETSC ERROR: ./ex56_single on a arch-summit-dbg-single-pgi-cuda named h36n11 by adams Wed Aug 14 22:21:56 2019<br>
> > [0]PETSC ERROR: Configure options --with-cc=mpicc --with-cxx=mpiCC --with-fc=mpif90 COPTFLAGS="-g -Mfcon" CXXOPTFLAGS="-g -Mfcon" FOPTFLAGS="-g -Mfcon" --with-precision=single --with-ssl=0 --with-batch=0 --with-mpiexec="jsrun -g 1" --with-cuda=1 --with-cudac=nvcc
CUDAFLAGS="-ccbin pgc++" --download-metis --download-parmetis --download-fblaslapack --with-x=0 --with-64-bit-indices=0 --with-debugging=1 PETSC_ARCH=arch-summit-dbg-single-pgi-cuda<br>
> > [0]PETSC ERROR: #1 User provided function() line 0 in Unknown file<br>
> > --------------------------------------------------------------------------<br>
> > <br>
> > On Wed, Aug 14, 2019 at 9:51 PM Smith, Barry F. <<a href="mailto:bsmith@mcs.anl.gov" target="_blank">bsmith@mcs.anl.gov</a>> wrote:<br>
> > <br>
> > Oh, doesn't even have to be that large. We just need to be able to look at the flop rates (as a surrogate for run times) and compare with the previous runs. So long as the size per process is pretty much the same that is good enough.<br>
> > <br>
> > Barry<br>
> > <br>
> > <br>
> > > On Aug 14, 2019, at 8:45 PM, Mark Adams <<a href="mailto:mfadams@lbl.gov" target="_blank">mfadams@lbl.gov</a>> wrote:<br>
> > > <br>
> > > I can run single, I just can't scale up. But I can use like 1500 processors.<br>
> > > <br>
> > > On Wed, Aug 14, 2019 at 9:31 PM Smith, Barry F. <<a href="mailto:bsmith@mcs.anl.gov" target="_blank">bsmith@mcs.anl.gov</a>> wrote:<br>
> > > <br>
> > > Oh, are all your integers 8 bytes? Even on one node?<br>
> > > <br>
> > > Once Karl's new middleware is in place we should see about reducing to 4 bytes on the GPU.<br>
> > > <br>
> > > Barry<br>
> > > <br>
> > > <br>
> > > > On Aug 14, 2019, at 7:44 PM, Mark Adams <<a href="mailto:mfadams@lbl.gov" target="_blank">mfadams@lbl.gov</a>> wrote:<br>
> > > > <br>
> > > > OK, I'll run single. It a bit perverse to run with 4 byte floats and 8 byte integers ... I could use 32 bit ints and just not scale out.<br>
> > > > <br>
> > > > On Wed, Aug 14, 2019 at 6:48 PM Smith, Barry F. <<a href="mailto:bsmith@mcs.anl.gov" target="_blank">bsmith@mcs.anl.gov</a>> wrote:<br>
> > > > <br>
> > > > Mark,<br>
> > > > <br>
> > > > Oh, I don't even care if it converges, just put in a fixed number of iterations. The idea is to just get a baseline of the possible improvement.
<br>
> > > > <br>
> > > > ECP is literally dropping millions into research on "multi precision" computations on GPUs, we need to have some actual numbers for the best potential benefit to determine how much we invest in further investigating it, or not.<br>
> > > > <br>
> > > > I am not expressing any opinions on the approach, we are just in the fact gathering stage.<br>
> > > > <br>
> > > > <br>
> > > > Barry<br>
> > > > <br>
> > > > <br>
> > > > > On Aug 14, 2019, at 2:27 PM, Mark Adams <<a href="mailto:mfadams@lbl.gov" target="_blank">mfadams@lbl.gov</a>> wrote:<br>
> > > > > <br>
> > > > > <br>
> > > > > <br>
> > > > > On Wed, Aug 14, 2019 at 2:35 PM Smith, Barry F. <<a href="mailto:bsmith@mcs.anl.gov" target="_blank">bsmith@mcs.anl.gov</a>> wrote:<br>
> > > > > <br>
> > > > > Mark,<br>
> > > > > <br>
> > > > > Would you be able to make one run using single precision? Just single everywhere since that is all we support currently?
<br>
> > > > > <br>
> > > > > <br>
> > > > > Experience in engineering at least is single does not work for FE elasticity. I have tried it many years ago and have heard this from others. This problem is pretty simple other than using Q2. I suppose I could try it, but just be aware the FE people
might say that single sucks.<br>
> > > > > <br>
> > > > > The results will give us motivation (or anti-motivation) to have support for running KSP (or PC (or Mat) in single precision while the simulation is double.<br>
> > > > > <br>
> > > > > Thanks.<br>
> > > > > <br>
> > > > > Barry<br>
> > > > > <br>
> > > > > For example if the GPU speed on KSP is a factor of 3 over the double on GPUs this is serious motivation.
<br>
> > > > > <br>
> > > > > <br>
> > > > > > On Aug 14, 2019, at 12:45 PM, Mark Adams <<a href="mailto:mfadams@lbl.gov" target="_blank">mfadams@lbl.gov</a>> wrote:<br>
> > > > > > <br>
> > > > > > FYI, Here is some scaling data of GAMG on SUMMIT. Getting about 4x GPU speedup with 98K dof/proc (3D Q2 elasticity).<br>
> > > > > > <br>
> > > > > > This is weak scaling of a solve. There is growth in iteration count folded in here. I should put rtol in the title and/or run a fixed number of iterations and make it clear in the title.<br>
> > > > > > <br>
> > > > > > Comments welcome.<br>
> > > > > > <out_cpu_012288><out_cpu_001536><out_cuda_012288><out_cpu_000024><out_cpu_000192><out_cuda_001536><out_cuda_000192><out_cuda_000024><weak_scaling_cpu.png><weak_scaling_cuda.png><br>
> > > > > <br>
> > > > <br>
> > > <br>
> > <br>
> <br>
> <weak_scaling_gpu_compact_spread.png><weak_scaling_cpu.png><spread.tar><compact.tar><br>
<br>
</blockquote>
</div>
</div>
</blockquote>
</div>
</div>
</body>
</html>