[petsc-dev] [petsc-maint #88993] Petsc with Cuda 4.0 and Multiple GPUs

Mon Oct 3 08:24:01 CDT 2011

On Sun, Oct 2, 2011 at 10:50 PM, Dave Nystrom <Dave.Nystrom at tachyonlogic.com
> wrote:

> Hi Barry,
>
> Barry Smith writes:
>  > Dave,
>  >
>  > I cannot explain why it does not use the MatMult_SeqAIJCusp() - it does
> for me.
>
> Do you get good performance running a problem like ex2?
>

Okay, now the problem is clear. This does not have to do with having 2 GPUs,
rather you
were not running MatMult on any GPU.

This problem has to do with the 'da_mat_type aijcusp' option being passed
in. Somehow this
is not being acted on. So, we need

  - The full input
  - The full output of the test using -log_summary

  Thanks,

     Matt

>  > Have you updated to the latest cusp/thrust? From the mecurial
> repositories?
>
> I did try the latest version of cusp from mercurial initially but the build
> failed.  So I am currently using the latest cusp tarball.  I did not try
> the
> latest version of thrust but instead was just using what came with the
> released version of Cuda 4.0.  I could try the mercurial versions of both.
>
>  > There is a difference, in your new 4.0 build you added
>  > --download-txpetscgpu=yes BTW: that doesn't work for me with the latest
>  > cusp and thrust from the mecurial repositories can you try reconfiguring
>  > and making without that?
>
> Yes, I can try that.  Maybe that is why my original build with cusp from
> mercurial failed.
>
> Thanks for your help,
>
> Dave
>
>  > Barry
>  >
>  > On Oct 2, 2011, at 9:08 PM, Dave Nystrom wrote:
>  >
>  >> Barry Smith writes:
>  >>> On Oct 2, 2011, at 6:39 PM, Dave Nystrom wrote:
>  >>>
>  >>>> Thanks for the update.  I don't believe I have gotten a run with good
>  >>>> performance yet, either from C or Fortran.  I wish there was an easy
> way for
>  >>>> me to force use of only one of my gpus.  I don't want to have to pull
> one of
>  >>>> the gpus in order to see if that is complicating things with Cuda
> 4.0.  I'll
>  >>>> be eager to hear if you make any progress on figuring things out.
>  >>>>
>  >>>> Do you understand yet why the petsc ex2.c example is able to parse
> the
>  >>>> "-cuda_show_devices" argument but ex2f.F does not?
>  >>>
>  >>> Matt put the code in the wrong place in PETSc, that is all, no big
>  >>> existentialist reason. I will fix that.
>  >>
>  >> Thanks.  I'll look forward to testing out the new version.
>  >>
>  >> Dave
>  >>
>  >>> Barry
>  >>>
>  >>>>
>  >>>> Thanks,
>  >>>>
>  >>>> Dave
>  >>>>
>  >>>> Barry Smith writes:
>  >>>>> It is not doing the MatMult operation on the GPU and hence needs to
> move
>  >>>>> the vectors back and forth for each operation (since MatMult is done
> on
>  >>>>> the CPU with the vector while vector operations are done on the GPU)
> hence
>  >>>>> the terrible performance.
>  >>>>>
>  >>>>> Not sure why yet. It is copying the Mat down for me from C.
>  >>>>>
>  >>>>> Barry
>  >>>>>
>  >>>>> On Oct 2, 2011, at 4:18 PM, Dave Nystrom wrote:
>  >>>>>
>  >>>>>> In case it might be useful, I have attached two log files of runs
> with the
>  >>>>>> ex2f petsc example from src/ksp/ksp/examples/tutorials.  One was
> run back in
>  >>>>>> April with petsc-dev linked to Cuda 3.2.  It shows excellent
> runtime
>  >>>>>> performance.  The other was run today with petsc-dev checked out of
> the
>  >>>>>> mercurial repo yesterday morning and linked to Cuda 4.0.  In
> addition to the
>  >>>>>> differences in run time performance, I also do not see an entry for
>  >>>>>> MatCUSPCopyTo in the profiling section.  I'm not sure what the
> significance
>  >>>>>> of that is.  I do observe that the run time for PCApply is about
> the same for
>  >>>>>> the two cases.  I think I would expect that to be the case even if
> the
>  >>>>>> problem were partitioned across two gpus.  However, it does make me
> wonder if
>  >>>>>> the absence of MatCUSPCopyTo in the profiling section of the Cuda
> 4.0 log
>  >>>>>> file is an indication that the matrix was not actually copied to
> the gpu.
>  >>>>>> I'm not sure yet how to check for that.  Hope this might be useful.
>  >>>>>>
>  >>>>>> Thanks,
>  >>>>>>
>  >>>>>> Dave
>  >>>>>>
>  >>>>>>
>  >>>>>>
> <ex2f_3200_3200_cuda_yes_cuda_3.2.log><ex2f_3200_3200_cuda_yes_cuda_4.0.log>
>  >>>>>> Dave Nystrom writes:
>  >>>>>>> Matthew Knepley writes:
>  >>>>>>>> On Sat, Oct 1, 2011 at 11:26 PM, Dave Nystrom <
> Dave.Nystrom at tachyonlogic.com> wrote:
>  >>>>>>>>> Barry Smith writes:
>  >>>>>>>>>> On Oct 1, 2011, at 9:22 PM, Dave Nystrom wrote:
>  >>>>>>>>>>> Hi Barry,
>  >>>>>>>>>>>
>  >>>>>>>>>>> I've sent a couple more emails on this topic.  What I am
> trying to do at the
>  >>>>>>>>>>> moment is to figure out how to have a problem run on only one
> gpu if it will
>  >>>>>>>>>>> fit in the memory of that gpu.  Back in April when I had built
> petsc-dev with
>  >>>>>>>>>>> Cuda 3.2, petsc would only use one gpu if you had multiple
> gpus on your
>  >>>>>>>>>>> machine.  In order to use multiple gpus for a problem, one had
> to use
>  >>>>>>>>>>> multiple threads with a separate thread assigned to control
> each gpu.  But
>  >>>>>>>>>>> Cuda 4.0 has, I believe, made that transparent and under the
> hood.  So now
>  >>>>>>>>>>> when I run a small example problem such as
>  >>>>>>>>>>> src/ksp/ksp/examples/tutorials/ex2f.F with an 800x800 problem,
> it gets
>  >>>>>>>>>>> partitioned to run on both of the gpus in my machine.  The
> result is a very
>  >>>>>>>>>>> large performance hit because of communication back and forth
> from one gpu to
>  >>>>>>>>>>> the other via the cpu.
>  >>>>>>>>>>
>  >>>>>>>>>> How do you know there is lots of communication from the GPU to
> the CPU? In
>  >>>>>>>>>> the -log_summary? Nope because PETSc does not manage anything
> like that
>  >>>>>>>>>> (that is one CPU process using both GPUs).
>  >>>>>>>>>
>  >>>>>>>>> What I believe is that it is being managed by Cuda 4.0, not by
> petsc.
>  >>>>>>>>>
>  >>>>>>>>>>> So this problem with a 3200x3200 grid runs 5x slower
>  >>>>>>>>>>> now than it did with Cuda 3.2.  I believe if one is
> programming down at the
>  >>>>>>>>>>> cuda level, it is possible to have a smaller problem run on
> only one gpu so
>  >>>>>>>>>>> that there is communication only between the cpu and gpu and
> only at the
>  >>>>>>>>>>> start and end of the calculation.
>  >>>>>>>>>>>
>  >>>>>>>>>>> To me, it seems like what is needed is a petsc option to
> specify the number
>  >>>>>>>>>>> of gpus to run on that can somehow get passed down to the cuda
> level through
>  >>>>>>>>>>> cusp and thrust.  I fear that the short term solution is going
> to have to be
>  >>>>>>>>>>> for me to pull one of the gpus out of my desktop system but it
> would be nice
>  >>>>>>>>>>> if there was a way to tell petsc and friends to just use one
> gpu when I want
>  >>>>>>>>>>> it to.
>  >>>>>>>>>>>
>  >>>>>>>>>>> If necessary, I can send a couple of log files to demonstrate
> what I am
>  >>>>>>>>>>> trying to describe regarding the performance hit.
>  >>>>>>>>>>
>  >>>>>>>>>> I am not convinced that the poor performance you are getting
> now has
>  >>>>>>>>>> anything to do with using both GPUs. Please run a PETSc program
> with the
>  >>>>>>>>>> command -cuda_show_devices
>  >>>>>>>>>
>  >>>>>>>>> I ran the following command:
>  >>>>>>>>>
>  >>>>>>>>> ex2f -m 8 -n 8 -ksp_type cg -pc_type jacobi -log_summary
> -cuda_show_devices
>  >>>>>>>>> -mat_type aijcusp -vec_type cusp -options_left
>  >>>>>>>>>
>  >>>>>>>>> The result was a report that there was one option left, that
> being
>  >>>>>>>>> -cuda_show_devices.  I am using a copy of petsc-dev that I
> cloned and built
>  >>>>>>>>> this morning.
>  >>>>>>>>
>  >>>>>>>> What do you have at src/sys/object/pinit.c:825? You should see
> the code
>  >>>>>>>> that processes this option. You should be able to break there in
> the
>  >>>>>>>> debugger and see what happens. This sounds again like you are not
>  >>>>>>>> processing options correctly.
>  >>>>>>>
>  >>>>>>> Hi Matt,
>  >>>>>>>
>  >>>>>>> I'll take a look at that in a bit and see if I can figure out what
> is going
>  >>>>>>> on.  I do see the code that you mention that processes the
> arguments that
>  >>>>>>> Barry mentioned.  In terms of processing options correctly, at
> least in this
>  >>>>>>> case I am actually running one of the petsc examples rather than
> my own
>  >>>>>>> code.  And it seems to correctly process the other command line
> arguments.
>  >>>>>>> Anyway, I'll write more after I have had a chance to investigate
> more.
>  >>>>>>>
>  >>>>>>> Thanks,
>  >>>>>>>
>  >>>>>>> Dave
>  >>>>>>>
>  >>>>>>>> Matt
>  >>>>>>>>
>  >>>>>>>>>> What are the choices?  You can then pick one of them and run
> with
>  >>>>>>>>> -cuda_set_device integer
>  >>>>>>>>>
>  >>>>>>>>> The -cuda_set_device option does not appear to be recognized
> either, even
>  >>>>>>>>> if I choose an integer like 0.
>  >>>>>>>>>
>  >>>>>>>>>> Does this change things?
>  >>>>>>>>>
>  >>>>>>>>> I suspect it would change things if I could get it to work.
>  >>>>>>>>>
>  >>>>>>>>> Thanks,
>  >>>>>>>>>
>  >>>>>>>>> Dave
>  >>>>>>>>>
>  >>>>>>>>>> Barry
>  >>>>>>>>>>
>  >>>>>>>>>>>
>  >>>>>>>>>>> Thanks,
>  >>>>>>>>>>>
>  >>>>>>>>>>> Dave
>  >>>>>>>>>>>
>  >>>>>>>>>>> Barry Smith writes:
>  >>>>>>>>>>>> Dave,
>  >>>>>>>>>>>>
>  >>>>>>>>>>>> We have no mechanism in the PETSc code for a PETSc single CPU
> process to
>  >>>>>>>>>>>> use two GPUs at the same time. However you could have two MPI
> processes
>  >>>>>>>>>>>> each using their own GPU.
>  >>>>>>>>>>>>
>  >>>>>>>>>>>> The one tricky part is you need to make sure each MPI process
> uses a
>  >>>>>>>>>>>> different GPU. We currently do not have a mechanism to do
> this assignment
>  >>>>>>>>>>>> automatically. I think it can be done with cudaSetDevice().
> But I don't
>  >>>>>>>>>>>> know the details, sending this to petsc-dev at mcs.anl.govwhere more people
>  >>>>>>>>>>>> may know.
>  >>>>>>>>>>>>
>  >>>>>>>>>>>> PETSc-folks,
>  >>>>>>>>>>>>
>  >>>>>>>>>>>> We need a way to have this setup automatically.
>  >>>>>>>>>>>>
>  >>>>>>>>>>>> Barry
>  >>>>>>>>>>>>
>  >>>>>>>>>>>> On Oct 1, 2011, at 5:43 PM, Dave Nystrom wrote:
>  >>>>>>>>>>>>
>  >>>>>>>>>>>>> I'm running petsc on a machine with Cuda 4.0 and 2 gpus.
>  This is a desktop
>  >>>>>>>>>>>>> machine with a single processor.  I know that Cuda 4.0 has
> support for
>  >>>>>>>>>>>>> running on multiple gpus but don't know if petsc uses that.
>  But suppose I
>  >>>>>>>>>>>>> have a problem that will fit in the memory for a single gpu.
>  Will petsc run
>  >>>>>>>>>>>>> the problem on a single gpu or does it split it between the
> 2 gpus and incur
>  >>>>>>>>>>>>> the communication overhead of copying data between the two
> gpus?
>  >>>>>>>>>>>>>
>  >>>>>>>>>>>>> Thanks,
>  >>>>>>>>>>>>>
>  >>>>>>>>>>>>> Dave
>  >>>>>>>>
>  >>>>>>>> --
>  >>>>>>>> What most experimenters take for granted before they begin their
> experiments
>  >>>>>>>> is infinitely more interesting than any results to which their
> experiments
>  >>>>>>>> lead.
>  >>>>>>>> -- Norbert Wiener
>  >>>>>
>  >>>
>  >
>
>

-- 
What most experimenters take for granted before they begin their experiments
is infinitely more interesting than any results to which their experiments
lead.
-- Norbert Wiener
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20111003/4737981f/attachment-0001.html>