[petsc-dev] [petsc-maint #88993] Petsc with Cuda 4.0 and Multiple GPUs

Sun Oct 2 21:08:11 CDT 2011

Barry Smith writes:
 > On Oct 2, 2011, at 6:39 PM, Dave Nystrom wrote:
 > 
 >> Thanks for the update.  I don't believe I have gotten a run with good
 >> performance yet, either from C or Fortran.  I wish there was an easy way for
 >> me to force use of only one of my gpus.  I don't want to have to pull one of
 >> the gpus in order to see if that is complicating things with Cuda 4.0.  I'll
 >> be eager to hear if you make any progress on figuring things out.
 >> 
 >> Do you understand yet why the petsc ex2.c example is able to parse the
 >> "-cuda_show_devices" argument but ex2f.F does not?
 > 
 > Matt put the code in the wrong place in PETSc, that is all, no big
 > existentialist reason. I will fix that.

Thanks.  I'll look forward to testing out the new version.

Dave

 > Barry
 > 
 >> 
 >> Thanks,
 >> 
 >> Dave
 >> 
 >> Barry Smith writes:
 >>> It is not doing the MatMult operation on the GPU and hence needs to move
 >>> the vectors back and forth for each operation (since MatMult is done on
 >>> the CPU with the vector while vector operations are done on the GPU) hence
 >>> the terrible performance.
 >>> 
 >>> Not sure why yet. It is copying the Mat down for me from C.
 >>> 
 >>> Barry
 >>> 
 >>> On Oct 2, 2011, at 4:18 PM, Dave Nystrom wrote:
 >>> 
 >>>> In case it might be useful, I have attached two log files of runs with the
 >>>> ex2f petsc example from src/ksp/ksp/examples/tutorials.  One was run back in
 >>>> April with petsc-dev linked to Cuda 3.2.  It shows excellent runtime
 >>>> performance.  The other was run today with petsc-dev checked out of the
 >>>> mercurial repo yesterday morning and linked to Cuda 4.0.  In addition to the
 >>>> differences in run time performance, I also do not see an entry for
 >>>> MatCUSPCopyTo in the profiling section.  I'm not sure what the significance
 >>>> of that is.  I do observe that the run time for PCApply is about the same for
 >>>> the two cases.  I think I would expect that to be the case even if the
 >>>> problem were partitioned across two gpus.  However, it does make me wonder if
 >>>> the absence of MatCUSPCopyTo in the profiling section of the Cuda 4.0 log
 >>>> file is an indication that the matrix was not actually copied to the gpu.
 >>>> I'm not sure yet how to check for that.  Hope this might be useful.
 >>>> 
 >>>> Thanks,
 >>>> 
 >>>> Dave
 >>>> 
 >>>> 
 >>>> <ex2f_3200_3200_cuda_yes_cuda_3.2.log><ex2f_3200_3200_cuda_yes_cuda_4.0.log>
 >>>> Dave Nystrom writes:
 >>>>> Matthew Knepley writes:
 >>>>>> On Sat, Oct 1, 2011 at 11:26 PM, Dave Nystrom <Dave.Nystrom at tachyonlogic.com> wrote:
 >>>>>>> Barry Smith writes:
 >>>>>>>> On Oct 1, 2011, at 9:22 PM, Dave Nystrom wrote:
 >>>>>>>>> Hi Barry,
 >>>>>>>>> 
 >>>>>>>>> I've sent a couple more emails on this topic.  What I am trying to do at the
 >>>>>>>>> moment is to figure out how to have a problem run on only one gpu if it will
 >>>>>>>>> fit in the memory of that gpu.  Back in April when I had built petsc-dev with
 >>>>>>>>> Cuda 3.2, petsc would only use one gpu if you had multiple gpus on your
 >>>>>>>>> machine.  In order to use multiple gpus for a problem, one had to use
 >>>>>>>>> multiple threads with a separate thread assigned to control each gpu.  But
 >>>>>>>>> Cuda 4.0 has, I believe, made that transparent and under the hood.  So now
 >>>>>>>>> when I run a small example problem such as
 >>>>>>>>> src/ksp/ksp/examples/tutorials/ex2f.F with an 800x800 problem, it gets
 >>>>>>>>> partitioned to run on both of the gpus in my machine.  The result is a very
 >>>>>>>>> large performance hit because of communication back and forth from one gpu to
 >>>>>>>>> the other via the cpu.
 >>>>>>>> 
 >>>>>>>> How do you know there is lots of communication from the GPU to the CPU? In
 >>>>>>>> the -log_summary? Nope because PETSc does not manage anything like that
 >>>>>>>> (that is one CPU process using both GPUs).
 >>>>>>> 
 >>>>>>> What I believe is that it is being managed by Cuda 4.0, not by petsc.
 >>>>>>> 
 >>>>>>>>> So this problem with a 3200x3200 grid runs 5x slower
 >>>>>>>>> now than it did with Cuda 3.2.  I believe if one is programming down at the
 >>>>>>>>> cuda level, it is possible to have a smaller problem run on only one gpu so
 >>>>>>>>> that there is communication only between the cpu and gpu and only at the
 >>>>>>>>> start and end of the calculation.
 >>>>>>>>> 
 >>>>>>>>> To me, it seems like what is needed is a petsc option to specify the number
 >>>>>>>>> of gpus to run on that can somehow get passed down to the cuda level through
 >>>>>>>>> cusp and thrust.  I fear that the short term solution is going to have to be
 >>>>>>>>> for me to pull one of the gpus out of my desktop system but it would be nice
 >>>>>>>>> if there was a way to tell petsc and friends to just use one gpu when I want
 >>>>>>>>> it to.
 >>>>>>>>> 
 >>>>>>>>> If necessary, I can send a couple of log files to demonstrate what I am
 >>>>>>>>> trying to describe regarding the performance hit.
 >>>>>>>> 
 >>>>>>>> I am not convinced that the poor performance you are getting now has
 >>>>>>>> anything to do with using both GPUs. Please run a PETSc program with the
 >>>>>>>> command -cuda_show_devices
 >>>>>>> 
 >>>>>>> I ran the following command:
 >>>>>>> 
 >>>>>>> ex2f -m 8 -n 8 -ksp_type cg -pc_type jacobi -log_summary -cuda_show_devices
 >>>>>>> -mat_type aijcusp -vec_type cusp -options_left
 >>>>>>> 
 >>>>>>> The result was a report that there was one option left, that being
 >>>>>>> -cuda_show_devices.  I am using a copy of petsc-dev that I cloned and built
 >>>>>>> this morning.
 >>>>>> 
 >>>>>> What do you have at src/sys/object/pinit.c:825? You should see the code
 >>>>>> that processes this option. You should be able to break there in the
 >>>>>> debugger and see what happens. This sounds again like you are not
 >>>>>> processing options correctly.
 >>>>> 
 >>>>> Hi Matt,
 >>>>> 
 >>>>> I'll take a look at that in a bit and see if I can figure out what is going
 >>>>> on.  I do see the code that you mention that processes the arguments that
 >>>>> Barry mentioned.  In terms of processing options correctly, at least in this
 >>>>> case I am actually running one of the petsc examples rather than my own
 >>>>> code.  And it seems to correctly process the other command line arguments.
 >>>>> Anyway, I'll write more after I have had a chance to investigate more.
 >>>>> 
 >>>>> Thanks,
 >>>>> 
 >>>>> Dave
 >>>>> 
 >>>>>> Matt
 >>>>>> 
 >>>>>>>> What are the choices?  You can then pick one of them and run with
 >>>>>>> -cuda_set_device integer
 >>>>>>> 
 >>>>>>> The -cuda_set_device option does not appear to be recognized either, even
 >>>>>>> if I choose an integer like 0.
 >>>>>>> 
 >>>>>>>> Does this change things?
 >>>>>>> 
 >>>>>>> I suspect it would change things if I could get it to work.
 >>>>>>> 
 >>>>>>> Thanks,
 >>>>>>> 
 >>>>>>> Dave
 >>>>>>> 
 >>>>>>>> Barry
 >>>>>>>> 
 >>>>>>>>> 
 >>>>>>>>> Thanks,
 >>>>>>>>> 
 >>>>>>>>> Dave
 >>>>>>>>> 
 >>>>>>>>> Barry Smith writes:
 >>>>>>>>>> Dave,
 >>>>>>>>>> 
 >>>>>>>>>> We have no mechanism in the PETSc code for a PETSc single CPU process to
 >>>>>>>>>> use two GPUs at the same time. However you could have two MPI processes
 >>>>>>>>>> each using their own GPU.
 >>>>>>>>>> 
 >>>>>>>>>> The one tricky part is you need to make sure each MPI process uses a
 >>>>>>>>>> different GPU. We currently do not have a mechanism to do this assignment
 >>>>>>>>>> automatically. I think it can be done with cudaSetDevice(). But I don't
 >>>>>>>>>> know the details, sending this to petsc-dev at mcs.anl.gov where more people
 >>>>>>>>>> may know.
 >>>>>>>>>> 
 >>>>>>>>>> PETSc-folks,
 >>>>>>>>>> 
 >>>>>>>>>> We need a way to have this setup automatically.
 >>>>>>>>>> 
 >>>>>>>>>> Barry
 >>>>>>>>>> 
 >>>>>>>>>> On Oct 1, 2011, at 5:43 PM, Dave Nystrom wrote:
 >>>>>>>>>> 
 >>>>>>>>>>> I'm running petsc on a machine with Cuda 4.0 and 2 gpus.  This is a desktop
 >>>>>>>>>>> machine with a single processor.  I know that Cuda 4.0 has support for
 >>>>>>>>>>> running on multiple gpus but don't know if petsc uses that.  But suppose I
 >>>>>>>>>>> have a problem that will fit in the memory for a single gpu.  Will petsc run
 >>>>>>>>>>> the problem on a single gpu or does it split it between the 2 gpus and incur
 >>>>>>>>>>> the communication overhead of copying data between the two gpus?
 >>>>>>>>>>> 
 >>>>>>>>>>> Thanks,
 >>>>>>>>>>> 
 >>>>>>>>>>> Dave
 >>>>>> 
 >>>>>> -- 
 >>>>>> What most experimenters take for granted before they begin their experiments
 >>>>>> is infinitely more interesting than any results to which their experiments
 >>>>>> lead.
 >>>>>> -- Norbert Wiener
 >>> 
 >