[petsc-dev] [petsc-maint #88993] Petsc with Cuda 4.0 and Multiple GPUs

Mon Oct 3 08:51:04 CDT 2011

Matthew Knepley writes:
 > On Sun, Oct 2, 2011 at 10:50 PM, Dave Nystrom <Dave.Nystrom at tachyonlogic.com> wrote:
 > 
 > > Hi Barry,
 > >
 > > Barry Smith writes:
 > >  > Dave,
 > >  >
 > >  > I cannot explain why it does not use the MatMult_SeqAIJCusp() - it does for me.
 > >
 > > Do you get good performance running a problem like ex2?
 > >
 > 
 > Okay, now the problem is clear. This does not have to do with having 2
 > GPUs, rather you were not running MatMult on any GPU.
 > 
 > This problem has to do with the 'da_mat_type aijcusp' option being passed
 > in. Somehow this is not being acted on. So, we need
 > 
 >   - The full input
 >   - The full output of the test using -log_summary

Hi Matt,

I'm not sure what you are asking for.  The two problems I have been running
are two of the petsc examples i.e. src/ksp/ksp/examples/tutorials/ex2f.F and
src/ksp/ksp/examples/tutorials/ex2.c.  Are you not able to reproduce the
problem?  If you are convinced that this is not dependent on having 2 gpus,
then it would seem that my platform is not very unique.  I did send a couple
of log files in an earlier email that were run with -log_summary.  One was
run yesterday with my current petsc/cuda install.

Anyway, I'm happy to perform any runs you want and pass you the data and
results.  But I'm not sure what you mean by "The full input" since I am
running unmodified petsc examples and -log_summary captures my command line
options.  Just let me know what you need.

I do have on my list from Barry to reinstall petsc with the mercurial
versions of cusp and thrust and without the txpetscgpu package.  That will
probably have to happen this evening as I have to head off to work.

Related to the above two petsc examples, ex2.c and ex2f.F, I have observed
different behavior in the processing of command line options but I believe
Barry is working on a fix for that.

Thanks,

Dave

 > Thanks,
 > 
 > Matt
 > 
 > 
 > >  > Have you updated to the latest cusp/thrust? From the mecurial
 > > repositories?
 > >
 > > I did try the latest version of cusp from mercurial initially but the build
 > > failed.  So I am currently using the latest cusp tarball.  I did not try
 > > the
 > > latest version of thrust but instead was just using what came with the
 > > released version of Cuda 4.0.  I could try the mercurial versions of both.
 > >
 > >  > There is a difference, in your new 4.0 build you added
 > >  > --download-txpetscgpu=yes BTW: that doesn't work for me with the latest
 > >  > cusp and thrust from the mecurial repositories can you try reconfiguring
 > >  > and making without that?
 > >
 > > Yes, I can try that.  Maybe that is why my original build with cusp from
 > > mercurial failed.
 > >
 > > Thanks for your help,
 > >
 > > Dave
 > >
 > >  > Barry
 > >  >
 > >  > On Oct 2, 2011, at 9:08 PM, Dave Nystrom wrote:
 > >  >
 > >  >> Barry Smith writes:
 > >  >>> On Oct 2, 2011, at 6:39 PM, Dave Nystrom wrote:
 > >  >>>
 > >  >>>> Thanks for the update.  I don't believe I have gotten a run with good
 > >  >>>> performance yet, either from C or Fortran.  I wish there was an easy
 > > way for
 > >  >>>> me to force use of only one of my gpus.  I don't want to have to pull
 > > one of
 > >  >>>> the gpus in order to see if that is complicating things with Cuda
 > > 4.0.  I'll
 > >  >>>> be eager to hear if you make any progress on figuring things out.
 > >  >>>>
 > >  >>>> Do you understand yet why the petsc ex2.c example is able to parse
 > > the
 > >  >>>> "-cuda_show_devices" argument but ex2f.F does not?
 > >  >>>
 > >  >>> Matt put the code in the wrong place in PETSc, that is all, no big
 > >  >>> existentialist reason. I will fix that.
 > >  >>
 > >  >> Thanks.  I'll look forward to testing out the new version.
 > >  >>
 > >  >> Dave
 > >  >>
 > >  >>> Barry
 > >  >>>
 > >  >>>>
 > >  >>>> Thanks,
 > >  >>>>
 > >  >>>> Dave
 > >  >>>>
 > >  >>>> Barry Smith writes:
 > >  >>>>> It is not doing the MatMult operation on the GPU and hence needs to
 > > move
 > >  >>>>> the vectors back and forth for each operation (since MatMult is done
 > > on
 > >  >>>>> the CPU with the vector while vector operations are done on the GPU)
 > > hence
 > >  >>>>> the terrible performance.
 > >  >>>>>
 > >  >>>>> Not sure why yet. It is copying the Mat down for me from C.
 > >  >>>>>
 > >  >>>>> Barry
 > >  >>>>>
 > >  >>>>> On Oct 2, 2011, at 4:18 PM, Dave Nystrom wrote:
 > >  >>>>>
 > >  >>>>>> In case it might be useful, I have attached two log files of runs
 > > with the
 > >  >>>>>> ex2f petsc example from src/ksp/ksp/examples/tutorials.  One was
 > > run back in
 > >  >>>>>> April with petsc-dev linked to Cuda 3.2.  It shows excellent
 > > runtime
 > >  >>>>>> performance.  The other was run today with petsc-dev checked out of
 > > the
 > >  >>>>>> mercurial repo yesterday morning and linked to Cuda 4.0.  In
 > > addition to the
 > >  >>>>>> differences in run time performance, I also do not see an entry for
 > >  >>>>>> MatCUSPCopyTo in the profiling section.  I'm not sure what the
 > > significance
 > >  >>>>>> of that is.  I do observe that the run time for PCApply is about
 > > the same for
 > >  >>>>>> the two cases.  I think I would expect that to be the case even if
 > > the
 > >  >>>>>> problem were partitioned across two gpus.  However, it does make me
 > > wonder if
 > >  >>>>>> the absence of MatCUSPCopyTo in the profiling section of the Cuda
 > > 4.0 log
 > >  >>>>>> file is an indication that the matrix was not actually copied to
 > > the gpu.
 > >  >>>>>> I'm not sure yet how to check for that.  Hope this might be useful.
 > >  >>>>>>
 > >  >>>>>> Thanks,
 > >  >>>>>>
 > >  >>>>>> Dave
 > >  >>>>>>
 > >  >>>>>>
 > >  >>>>>>
 > > <ex2f_3200_3200_cuda_yes_cuda_3.2.log><ex2f_3200_3200_cuda_yes_cuda_4.0.log>
 > >  >>>>>> Dave Nystrom writes:
 > >  >>>>>>> Matthew Knepley writes:
 > >  >>>>>>>> On Sat, Oct 1, 2011 at 11:26 PM, Dave Nystrom <
 > > Dave.Nystrom at tachyonlogic.com> wrote:
 > >  >>>>>>>>> Barry Smith writes:
 > >  >>>>>>>>>> On Oct 1, 2011, at 9:22 PM, Dave Nystrom wrote:
 > >  >>>>>>>>>>> Hi Barry,
 > >  >>>>>>>>>>>
 > >  >>>>>>>>>>> I've sent a couple more emails on this topic.  What I am
 > > trying to do at the
 > >  >>>>>>>>>>> moment is to figure out how to have a problem run on only one
 > > gpu if it will
 > >  >>>>>>>>>>> fit in the memory of that gpu.  Back in April when I had built
 > > petsc-dev with
 > >  >>>>>>>>>>> Cuda 3.2, petsc would only use one gpu if you had multiple
 > > gpus on your
 > >  >>>>>>>>>>> machine.  In order to use multiple gpus for a problem, one had
 > > to use
 > >  >>>>>>>>>>> multiple threads with a separate thread assigned to control
 > > each gpu.  But
 > >  >>>>>>>>>>> Cuda 4.0 has, I believe, made that transparent and under the
 > > hood.  So now
 > >  >>>>>>>>>>> when I run a small example problem such as
 > >  >>>>>>>>>>> src/ksp/ksp/examples/tutorials/ex2f.F with an 800x800 problem,
 > > it gets
 > >  >>>>>>>>>>> partitioned to run on both of the gpus in my machine.  The
 > > result is a very
 > >  >>>>>>>>>>> large performance hit because of communication back and forth
 > > from one gpu to
 > >  >>>>>>>>>>> the other via the cpu.
 > >  >>>>>>>>>>
 > >  >>>>>>>>>> How do you know there is lots of communication from the GPU to
 > > the CPU? In
 > >  >>>>>>>>>> the -log_summary? Nope because PETSc does not manage anything
 > > like that
 > >  >>>>>>>>>> (that is one CPU process using both GPUs).
 > >  >>>>>>>>>
 > >  >>>>>>>>> What I believe is that it is being managed by Cuda 4.0, not by
 > > petsc.
 > >  >>>>>>>>>
 > >  >>>>>>>>>>> So this problem with a 3200x3200 grid runs 5x slower
 > >  >>>>>>>>>>> now than it did with Cuda 3.2.  I believe if one is
 > > programming down at the
 > >  >>>>>>>>>>> cuda level, it is possible to have a smaller problem run on
 > > only one gpu so
 > >  >>>>>>>>>>> that there is communication only between the cpu and gpu and
 > > only at the
 > >  >>>>>>>>>>> start and end of the calculation.
 > >  >>>>>>>>>>>
 > >  >>>>>>>>>>> To me, it seems like what is needed is a petsc option to
 > > specify the number
 > >  >>>>>>>>>>> of gpus to run on that can somehow get passed down to the cuda
 > > level through
 > >  >>>>>>>>>>> cusp and thrust.  I fear that the short term solution is going
 > > to have to be
 > >  >>>>>>>>>>> for me to pull one of the gpus out of my desktop system but it
 > > would be nice
 > >  >>>>>>>>>>> if there was a way to tell petsc and friends to just use one
 > > gpu when I want
 > >  >>>>>>>>>>> it to.
 > >  >>>>>>>>>>>
 > >  >>>>>>>>>>> If necessary, I can send a couple of log files to demonstrate
 > > what I am
 > >  >>>>>>>>>>> trying to describe regarding the performance hit.
 > >  >>>>>>>>>>
 > >  >>>>>>>>>> I am not convinced that the poor performance you are getting
 > > now has
 > >  >>>>>>>>>> anything to do with using both GPUs. Please run a PETSc program
 > > with the
 > >  >>>>>>>>>> command -cuda_show_devices
 > >  >>>>>>>>>
 > >  >>>>>>>>> I ran the following command:
 > >  >>>>>>>>>
 > >  >>>>>>>>> ex2f -m 8 -n 8 -ksp_type cg -pc_type jacobi -log_summary
 > > -cuda_show_devices
 > >  >>>>>>>>> -mat_type aijcusp -vec_type cusp -options_left
 > >  >>>>>>>>>
 > >  >>>>>>>>> The result was a report that there was one option left, that
 > > being
 > >  >>>>>>>>> -cuda_show_devices.  I am using a copy of petsc-dev that I
 > > cloned and built
 > >  >>>>>>>>> this morning.
 > >  >>>>>>>>
 > >  >>>>>>>> What do you have at src/sys/object/pinit.c:825? You should see
 > > the code
 > >  >>>>>>>> that processes this option. You should be able to break there in
 > > the
 > >  >>>>>>>> debugger and see what happens. This sounds again like you are not
 > >  >>>>>>>> processing options correctly.
 > >  >>>>>>>
 > >  >>>>>>> Hi Matt,
 > >  >>>>>>>
 > >  >>>>>>> I'll take a look at that in a bit and see if I can figure out what
 > > is going
 > >  >>>>>>> on.  I do see the code that you mention that processes the
 > > arguments that
 > >  >>>>>>> Barry mentioned.  In terms of processing options correctly, at
 > > least in this
 > >  >>>>>>> case I am actually running one of the petsc examples rather than
 > > my own
 > >  >>>>>>> code.  And it seems to correctly process the other command line
 > > arguments.
 > >  >>>>>>> Anyway, I'll write more after I have had a chance to investigate
 > > more.
 > >  >>>>>>>
 > >  >>>>>>> Thanks,
 > >  >>>>>>>
 > >  >>>>>>> Dave
 > >  >>>>>>>
 > >  >>>>>>>> Matt
 > >  >>>>>>>>
 > >  >>>>>>>>>> What are the choices?  You can then pick one of them and run
 > > with
 > >  >>>>>>>>> -cuda_set_device integer
 > >  >>>>>>>>>
 > >  >>>>>>>>> The -cuda_set_device option does not appear to be recognized
 > > either, even
 > >  >>>>>>>>> if I choose an integer like 0.
 > >  >>>>>>>>>
 > >  >>>>>>>>>> Does this change things?
 > >  >>>>>>>>>
 > >  >>>>>>>>> I suspect it would change things if I could get it to work.
 > >  >>>>>>>>>
 > >  >>>>>>>>> Thanks,
 > >  >>>>>>>>>
 > >  >>>>>>>>> Dave
 > >  >>>>>>>>>
 > >  >>>>>>>>>> Barry
 > >  >>>>>>>>>>
 > >  >>>>>>>>>>>
 > >  >>>>>>>>>>> Thanks,
 > >  >>>>>>>>>>>
 > >  >>>>>>>>>>> Dave
 > >  >>>>>>>>>>>
 > >  >>>>>>>>>>> Barry Smith writes:
 > >  >>>>>>>>>>>> Dave,
 > >  >>>>>>>>>>>>
 > >  >>>>>>>>>>>> We have no mechanism in the PETSc code for a PETSc single CPU
 > > process to
 > >  >>>>>>>>>>>> use two GPUs at the same time. However you could have two MPI
 > > processes
 > >  >>>>>>>>>>>> each using their own GPU.
 > >  >>>>>>>>>>>>
 > >  >>>>>>>>>>>> The one tricky part is you need to make sure each MPI process
 > > uses a
 > >  >>>>>>>>>>>> different GPU. We currently do not have a mechanism to do
 > > this assignment
 > >  >>>>>>>>>>>> automatically. I think it can be done with cudaSetDevice().
 > > But I don't
 > >  >>>>>>>>>>>> know the details, sending this to petsc-dev at mcs.anl.govwhere more people
 > >  >>>>>>>>>>>> may know.
 > >  >>>>>>>>>>>>
 > >  >>>>>>>>>>>> PETSc-folks,
 > >  >>>>>>>>>>>>
 > >  >>>>>>>>>>>> We need a way to have this setup automatically.
 > >  >>>>>>>>>>>>
 > >  >>>>>>>>>>>> Barry
 > >  >>>>>>>>>>>>
 > >  >>>>>>>>>>>> On Oct 1, 2011, at 5:43 PM, Dave Nystrom wrote:
 > >  >>>>>>>>>>>>
 > >  >>>>>>>>>>>>> I'm running petsc on a machine with Cuda 4.0 and 2 gpus.
 > >  This is a desktop
 > >  >>>>>>>>>>>>> machine with a single processor.  I know that Cuda 4.0 has
 > > support for
 > >  >>>>>>>>>>>>> running on multiple gpus but don't know if petsc uses that.
 > >  But suppose I
 > >  >>>>>>>>>>>>> have a problem that will fit in the memory for a single gpu.
 > >  Will petsc run
 > >  >>>>>>>>>>>>> the problem on a single gpu or does it split it between the
 > > 2 gpus and incur
 > >  >>>>>>>>>>>>> the communication overhead of copying data between the two
 > > gpus?
 > >  >>>>>>>>>>>>>
 > >  >>>>>>>>>>>>> Thanks,
 > >  >>>>>>>>>>>>>
 > >  >>>>>>>>>>>>> Dave
 > >  >>>>>>>>
 > >  >>>>>>>> --
 > >  >>>>>>>> What most experimenters take for granted before they begin their
 > > experiments
 > >  >>>>>>>> is infinitely more interesting than any results to which their
 > > experiments
 > >  >>>>>>>> lead.
 > >  >>>>>>>> -- Norbert Wiener
 > >  >>>>>
 > >  >>>
 > >  >
 > >
 > >
 > 
 > 
 > -- 
 > What most experimenters take for granted before they begin their experiments
 > is infinitely more interesting than any results to which their experiments
 > lead.
 > -- Norbert Wiener