[petsc-dev] [petsc-maint #88993] Petsc with Cuda 4.0 and Multiple GPUs

Sun Oct 2 16:18:30 CDT 2011

In case it might be useful, I have attached two log files of runs with the
ex2f petsc example from src/ksp/ksp/examples/tutorials.  One was run back in
April with petsc-dev linked to Cuda 3.2.  It shows excellent runtime
performance.  The other was run today with petsc-dev checked out of the
mercurial repo yesterday morning and linked to Cuda 4.0.  In addition to the
differences in run time performance, I also do not see an entry for
MatCUSPCopyTo in the profiling section.  I'm not sure what the significance
of that is.  I do observe that the run time for PCApply is about the same for
the two cases.  I think I would expect that to be the case even if the
problem were partitioned across two gpus.  However, it does make me wonder if
the absence of MatCUSPCopyTo in the profiling section of the Cuda 4.0 log
file is an indication that the matrix was not actually copied to the gpu.
I'm not sure yet how to check for that.  Hope this might be useful.

Thanks,

Dave

-------------- next part --------------
A non-text attachment was scrubbed...
Name: ex2f_3200_3200_cuda_yes_cuda_3.2.log
Type: application/octet-stream
Size: 10966 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20111002/4cb2699c/attachment.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ex2f_3200_3200_cuda_yes_cuda_4.0.log
Type: application/octet-stream
Size: 9782 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20111002/4cb2699c/attachment-0001.obj>
-------------- next part --------------

Dave Nystrom writes:
 > Matthew Knepley writes:
 >  > On Sat, Oct 1, 2011 at 11:26 PM, Dave Nystrom <Dave.Nystrom at tachyonlogic.com> wrote:
 >  > > Barry Smith writes:
 >  > >  > On Oct 1, 2011, at 9:22 PM, Dave Nystrom wrote:
 >  > >  > > Hi Barry,
 >  > >  > >
 >  > >  > > I've sent a couple more emails on this topic.  What I am trying to do at the
 >  > >  > > moment is to figure out how to have a problem run on only one gpu if it will
 >  > >  > > fit in the memory of that gpu.  Back in April when I had built petsc-dev with
 >  > >  > > Cuda 3.2, petsc would only use one gpu if you had multiple gpus on your
 >  > >  > > machine.  In order to use multiple gpus for a problem, one had to use
 >  > >  > > multiple threads with a separate thread assigned to control each gpu.  But
 >  > >  > > Cuda 4.0 has, I believe, made that transparent and under the hood.  So now
 >  > >  > > when I run a small example problem such as
 >  > >  > > src/ksp/ksp/examples/tutorials/ex2f.F with an 800x800 problem, it gets
 >  > >  > > partitioned to run on both of the gpus in my machine.  The result is a very
 >  > >  > > large performance hit because of communication back and forth from one gpu to
 >  > >  > > the other via the cpu.
 >  > >  >
 >  > >  > How do you know there is lots of communication from the GPU to the CPU? In
 >  > >  > the -log_summary? Nope because PETSc does not manage anything like that
 >  > >  > (that is one CPU process using both GPUs).
 >  > >
 >  > > What I believe is that it is being managed by Cuda 4.0, not by petsc.
 >  > >
 >  > >  > > So this problem with a 3200x3200 grid runs 5x slower
 >  > >  > > now than it did with Cuda 3.2.  I believe if one is programming down at the
 >  > >  > > cuda level, it is possible to have a smaller problem run on only one gpu so
 >  > >  > > that there is communication only between the cpu and gpu and only at the
 >  > >  > > start and end of the calculation.
 >  > >  > >
 >  > >  > > To me, it seems like what is needed is a petsc option to specify the number
 >  > >  > > of gpus to run on that can somehow get passed down to the cuda level through
 >  > >  > > cusp and thrust.  I fear that the short term solution is going to have to be
 >  > >  > > for me to pull one of the gpus out of my desktop system but it would be nice
 >  > >  > > if there was a way to tell petsc and friends to just use one gpu when I want
 >  > >  > > it to.
 >  > >  > >
 >  > >  > > If necessary, I can send a couple of log files to demonstrate what I am
 >  > >  > > trying to describe regarding the performance hit.
 >  > >  >
 >  > >  > I am not convinced that the poor performance you are getting now has
 >  > >  > anything to do with using both GPUs. Please run a PETSc program with the
 >  > >  > command -cuda_show_devices
 >  > >
 >  > > I ran the following command:
 >  > >
 >  > > ex2f -m 8 -n 8 -ksp_type cg -pc_type jacobi -log_summary -cuda_show_devices
 >  > > -mat_type aijcusp -vec_type cusp -options_left
 >  > >
 >  > > The result was a report that there was one option left, that being
 >  > > -cuda_show_devices.  I am using a copy of petsc-dev that I cloned and built
 >  > > this morning.
 >  > 
 >  > What do you have at src/sys/object/pinit.c:825? You should see the code
 >  > that processes this option. You should be able to break there in the
 >  > debugger and see what happens. This sounds again like you are not
 >  > processing options correctly.
 > 
 > Hi Matt,
 > 
 > I'll take a look at that in a bit and see if I can figure out what is going
 > on.  I do see the code that you mention that processes the arguments that
 > Barry mentioned.  In terms of processing options correctly, at least in this
 > case I am actually running one of the petsc examples rather than my own
 > code.  And it seems to correctly process the other command line arguments.
 > Anyway, I'll write more after I have had a chance to investigate more.
 > 
 > Thanks,
 > 
 > Dave
 > 
 >  > Matt
 >  > 
 >  > >  > What are the choices?  You can then pick one of them and run with
 >  > > -cuda_set_device integer
 >  > >
 >  > > The -cuda_set_device option does not appear to be recognized either, even
 >  > > if I choose an integer like 0.
 >  > >
 >  > >  > Does this change things?
 >  > >
 >  > > I suspect it would change things if I could get it to work.
 >  > >
 >  > > Thanks,
 >  > >
 >  > > Dave
 >  > >
 >  > >  > Barry
 >  > >  >
 >  > >  > >
 >  > >  > > Thanks,
 >  > >  > >
 >  > >  > > Dave
 >  > >  > >
 >  > >  > > Barry Smith writes:
 >  > >  > >> Dave,
 >  > >  > >>
 >  > >  > >> We have no mechanism in the PETSc code for a PETSc single CPU process to
 >  > >  > >> use two GPUs at the same time. However you could have two MPI processes
 >  > >  > >> each using their own GPU.
 >  > >  > >>
 >  > >  > >> The one tricky part is you need to make sure each MPI process uses a
 >  > >  > >> different GPU. We currently do not have a mechanism to do this assignment
 >  > >  > >> automatically. I think it can be done with cudaSetDevice(). But I don't
 >  > >  > >> know the details, sending this to petsc-dev at mcs.anl.gov where more people
 >  > >  > >> may know.
 >  > >  > >>
 >  > >  > >> PETSc-folks,
 >  > >  > >>
 >  > >  > >> We need a way to have this setup automatically.
 >  > >  > >>
 >  > >  > >> Barry
 >  > >  > >>
 >  > >  > >> On Oct 1, 2011, at 5:43 PM, Dave Nystrom wrote:
 >  > >  > >>
 >  > >  > >>> I'm running petsc on a machine with Cuda 4.0 and 2 gpus.  This is a desktop
 >  > >  > >>> machine with a single processor.  I know that Cuda 4.0 has support for
 >  > >  > >>> running on multiple gpus but don't know if petsc uses that.  But suppose I
 >  > >  > >>> have a problem that will fit in the memory for a single gpu.  Will petsc run
 >  > >  > >>> the problem on a single gpu or does it split it between the 2 gpus and incur
 >  > >  > >>> the communication overhead of copying data between the two gpus?
 >  > >  > >>>
 >  > >  > >>> Thanks,
 >  > >  > >>>
 >  > >  > >>> Dave
 >  > 
 >  > -- 
 >  > What most experimenters take for granted before they begin their experiments
 >  > is infinitely more interesting than any results to which their experiments
 >  > lead.
 >  > -- Norbert Wiener