[petsc-dev] [petsc-maint #88993] Petsc with Cuda 4.0 and Multiple GPUs

Mon Oct 3 19:52:54 CDT 2011

   Dave,

    I have found the cause of the problem you were seeing and have fixed it. It was caused by bad code when --download-txpetscgpu was used. 

    To eliminate the problem

1) upgrade to latest cusp and thrust via mecurial

2) rm -rf externpackages/txpetscgpu*

3) hg pull; hg update 

4) rerun ./configure (you can use the option --download-txpetscgpu

  Barry

On Oct 3, 2011, at 8:51 AM, Dave Nystrom wrote:

> Matthew Knepley writes:
>> On Sun, Oct 2, 2011 at 10:50 PM, Dave Nystrom <Dave.Nystrom at tachyonlogic.com> wrote:
>> 
>>> Hi Barry,
>>> 
>>> Barry Smith writes:
>>>> Dave,
>>>> 
>>>> I cannot explain why it does not use the MatMult_SeqAIJCusp() - it does for me.
>>> 
>>> Do you get good performance running a problem like ex2?
>>> 
>> 
>> Okay, now the problem is clear. This does not have to do with having 2
>> GPUs, rather you were not running MatMult on any GPU.
>> 
>> This problem has to do with the 'da_mat_type aijcusp' option being passed
>> in. Somehow this is not being acted on. So, we need
>> 
>>  - The full input
>>  - The full output of the test using -log_summary
> 
> Hi Matt,
> 
> I'm not sure what you are asking for.  The two problems I have been running
> are two of the petsc examples i.e. src/ksp/ksp/examples/tutorials/ex2f.F and
> src/ksp/ksp/examples/tutorials/ex2.c.  Are you not able to reproduce the
> problem?  If you are convinced that this is not dependent on having 2 gpus,
> then it would seem that my platform is not very unique.  I did send a couple
> of log files in an earlier email that were run with -log_summary.  One was
> run yesterday with my current petsc/cuda install.
> 
> Anyway, I'm happy to perform any runs you want and pass you the data and
> results.  But I'm not sure what you mean by "The full input" since I am
> running unmodified petsc examples and -log_summary captures my command line
> options.  Just let me know what you need.
> 
> I do have on my list from Barry to reinstall petsc with the mercurial
> versions of cusp and thrust and without the txpetscgpu package.  That will
> probably have to happen this evening as I have to head off to work.
> 
> Related to the above two petsc examples, ex2.c and ex2f.F, I have observed
> different behavior in the processing of command line options but I believe
> Barry is working on a fix for that.
> 
> Thanks,
> 
> Dave
> 
>> Thanks,
>> 
>> Matt
>> 
>> 
>>>> Have you updated to the latest cusp/thrust? From the mecurial
>>> repositories?
>>> 
>>> I did try the latest version of cusp from mercurial initially but the build
>>> failed.  So I am currently using the latest cusp tarball.  I did not try
>>> the
>>> latest version of thrust but instead was just using what came with the
>>> released version of Cuda 4.0.  I could try the mercurial versions of both.
>>> 
>>>> There is a difference, in your new 4.0 build you added
>>>> --download-txpetscgpu=yes BTW: that doesn't work for me with the latest
>>>> cusp and thrust from the mecurial repositories can you try reconfiguring
>>>> and making without that?
>>> 
>>> Yes, I can try that.  Maybe that is why my original build with cusp from
>>> mercurial failed.
>>> 
>>> Thanks for your help,
>>> 
>>> Dave
>>> 
>>>> Barry
>>>> 
>>>> On Oct 2, 2011, at 9:08 PM, Dave Nystrom wrote:
>>>> 
>>>>> Barry Smith writes:
>>>>>> On Oct 2, 2011, at 6:39 PM, Dave Nystrom wrote:
>>>>>> 
>>>>>>> Thanks for the update.  I don't believe I have gotten a run with good
>>>>>>> performance yet, either from C or Fortran.  I wish there was an easy
>>> way for
>>>>>>> me to force use of only one of my gpus.  I don't want to have to pull
>>> one of
>>>>>>> the gpus in order to see if that is complicating things with Cuda
>>> 4.0.  I'll
>>>>>>> be eager to hear if you make any progress on figuring things out.
>>>>>>> 
>>>>>>> Do you understand yet why the petsc ex2.c example is able to parse
>>> the
>>>>>>> "-cuda_show_devices" argument but ex2f.F does not?
>>>>>> 
>>>>>> Matt put the code in the wrong place in PETSc, that is all, no big
>>>>>> existentialist reason. I will fix that.
>>>>> 
>>>>> Thanks.  I'll look forward to testing out the new version.
>>>>> 
>>>>> Dave
>>>>> 
>>>>>> Barry
>>>>>> 
>>>>>>> 
>>>>>>> Thanks,
>>>>>>> 
>>>>>>> Dave
>>>>>>> 
>>>>>>> Barry Smith writes:
>>>>>>>> It is not doing the MatMult operation on the GPU and hence needs to
>>> move
>>>>>>>> the vectors back and forth for each operation (since MatMult is done
>>> on
>>>>>>>> the CPU with the vector while vector operations are done on the GPU)
>>> hence
>>>>>>>> the terrible performance.
>>>>>>>> 
>>>>>>>> Not sure why yet. It is copying the Mat down for me from C.
>>>>>>>> 
>>>>>>>> Barry
>>>>>>>> 
>>>>>>>> On Oct 2, 2011, at 4:18 PM, Dave Nystrom wrote:
>>>>>>>> 
>>>>>>>>> In case it might be useful, I have attached two log files of runs
>>> with the
>>>>>>>>> ex2f petsc example from src/ksp/ksp/examples/tutorials.  One was
>>> run back in
>>>>>>>>> April with petsc-dev linked to Cuda 3.2.  It shows excellent
>>> runtime
>>>>>>>>> performance.  The other was run today with petsc-dev checked out of
>>> the
>>>>>>>>> mercurial repo yesterday morning and linked to Cuda 4.0.  In
>>> addition to the
>>>>>>>>> differences in run time performance, I also do not see an entry for
>>>>>>>>> MatCUSPCopyTo in the profiling section.  I'm not sure what the
>>> significance
>>>>>>>>> of that is.  I do observe that the run time for PCApply is about
>>> the same for
>>>>>>>>> the two cases.  I think I would expect that to be the case even if
>>> the
>>>>>>>>> problem were partitioned across two gpus.  However, it does make me
>>> wonder if
>>>>>>>>> the absence of MatCUSPCopyTo in the profiling section of the Cuda
>>> 4.0 log
>>>>>>>>> file is an indication that the matrix was not actually copied to
>>> the gpu.
>>>>>>>>> I'm not sure yet how to check for that.  Hope this might be useful.
>>>>>>>>> 
>>>>>>>>> Thanks,
>>>>>>>>> 
>>>>>>>>> Dave
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>> <ex2f_3200_3200_cuda_yes_cuda_3.2.log><ex2f_3200_3200_cuda_yes_cuda_4.0.log>
>>>>>>>>> Dave Nystrom writes:
>>>>>>>>>> Matthew Knepley writes:
>>>>>>>>>>> On Sat, Oct 1, 2011 at 11:26 PM, Dave Nystrom <
>>> Dave.Nystrom at tachyonlogic.com> wrote:
>>>>>>>>>>>> Barry Smith writes:
>>>>>>>>>>>>> On Oct 1, 2011, at 9:22 PM, Dave Nystrom wrote:
>>>>>>>>>>>>>> Hi Barry,
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> I've sent a couple more emails on this topic.  What I am
>>> trying to do at the
>>>>>>>>>>>>>> moment is to figure out how to have a problem run on only one
>>> gpu if it will
>>>>>>>>>>>>>> fit in the memory of that gpu.  Back in April when I had built
>>> petsc-dev with
>>>>>>>>>>>>>> Cuda 3.2, petsc would only use one gpu if you had multiple
>>> gpus on your
>>>>>>>>>>>>>> machine.  In order to use multiple gpus for a problem, one had
>>> to use
>>>>>>>>>>>>>> multiple threads with a separate thread assigned to control
>>> each gpu.  But
>>>>>>>>>>>>>> Cuda 4.0 has, I believe, made that transparent and under the
>>> hood.  So now
>>>>>>>>>>>>>> when I run a small example problem such as
>>>>>>>>>>>>>> src/ksp/ksp/examples/tutorials/ex2f.F with an 800x800 problem,
>>> it gets
>>>>>>>>>>>>>> partitioned to run on both of the gpus in my machine.  The
>>> result is a very
>>>>>>>>>>>>>> large performance hit because of communication back and forth
>>> from one gpu to
>>>>>>>>>>>>>> the other via the cpu.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> How do you know there is lots of communication from the GPU to
>>> the CPU? In
>>>>>>>>>>>>> the -log_summary? Nope because PETSc does not manage anything
>>> like that
>>>>>>>>>>>>> (that is one CPU process using both GPUs).
>>>>>>>>>>>> 
>>>>>>>>>>>> What I believe is that it is being managed by Cuda 4.0, not by
>>> petsc.
>>>>>>>>>>>> 
>>>>>>>>>>>>>> So this problem with a 3200x3200 grid runs 5x slower
>>>>>>>>>>>>>> now than it did with Cuda 3.2.  I believe if one is
>>> programming down at the
>>>>>>>>>>>>>> cuda level, it is possible to have a smaller problem run on
>>> only one gpu so
>>>>>>>>>>>>>> that there is communication only between the cpu and gpu and
>>> only at the
>>>>>>>>>>>>>> start and end of the calculation.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> To me, it seems like what is needed is a petsc option to
>>> specify the number
>>>>>>>>>>>>>> of gpus to run on that can somehow get passed down to the cuda
>>> level through
>>>>>>>>>>>>>> cusp and thrust.  I fear that the short term solution is going
>>> to have to be
>>>>>>>>>>>>>> for me to pull one of the gpus out of my desktop system but it
>>> would be nice
>>>>>>>>>>>>>> if there was a way to tell petsc and friends to just use one
>>> gpu when I want
>>>>>>>>>>>>>> it to.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> If necessary, I can send a couple of log files to demonstrate
>>> what I am
>>>>>>>>>>>>>> trying to describe regarding the performance hit.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> I am not convinced that the poor performance you are getting
>>> now has
>>>>>>>>>>>>> anything to do with using both GPUs. Please run a PETSc program
>>> with the
>>>>>>>>>>>>> command -cuda_show_devices
>>>>>>>>>>>> 
>>>>>>>>>>>> I ran the following command:
>>>>>>>>>>>> 
>>>>>>>>>>>> ex2f -m 8 -n 8 -ksp_type cg -pc_type jacobi -log_summary
>>> -cuda_show_devices
>>>>>>>>>>>> -mat_type aijcusp -vec_type cusp -options_left
>>>>>>>>>>>> 
>>>>>>>>>>>> The result was a report that there was one option left, that
>>> being
>>>>>>>>>>>> -cuda_show_devices.  I am using a copy of petsc-dev that I
>>> cloned and built
>>>>>>>>>>>> this morning.
>>>>>>>>>>> 
>>>>>>>>>>> What do you have at src/sys/object/pinit.c:825? You should see
>>> the code
>>>>>>>>>>> that processes this option. You should be able to break there in
>>> the
>>>>>>>>>>> debugger and see what happens. This sounds again like you are not
>>>>>>>>>>> processing options correctly.
>>>>>>>>>> 
>>>>>>>>>> Hi Matt,
>>>>>>>>>> 
>>>>>>>>>> I'll take a look at that in a bit and see if I can figure out what
>>> is going
>>>>>>>>>> on.  I do see the code that you mention that processes the
>>> arguments that
>>>>>>>>>> Barry mentioned.  In terms of processing options correctly, at
>>> least in this
>>>>>>>>>> case I am actually running one of the petsc examples rather than
>>> my own
>>>>>>>>>> code.  And it seems to correctly process the other command line
>>> arguments.
>>>>>>>>>> Anyway, I'll write more after I have had a chance to investigate
>>> more.
>>>>>>>>>> 
>>>>>>>>>> Thanks,
>>>>>>>>>> 
>>>>>>>>>> Dave
>>>>>>>>>> 
>>>>>>>>>>> Matt
>>>>>>>>>>> 
>>>>>>>>>>>>> What are the choices?  You can then pick one of them and run
>>> with
>>>>>>>>>>>> -cuda_set_device integer
>>>>>>>>>>>> 
>>>>>>>>>>>> The -cuda_set_device option does not appear to be recognized
>>> either, even
>>>>>>>>>>>> if I choose an integer like 0.
>>>>>>>>>>>> 
>>>>>>>>>>>>> Does this change things?
>>>>>>>>>>>> 
>>>>>>>>>>>> I suspect it would change things if I could get it to work.
>>>>>>>>>>>> 
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> 
>>>>>>>>>>>> Dave
>>>>>>>>>>>> 
>>>>>>>>>>>>> Barry
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Dave
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Barry Smith writes:
>>>>>>>>>>>>>>> Dave,
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> We have no mechanism in the PETSc code for a PETSc single CPU
>>> process to
>>>>>>>>>>>>>>> use two GPUs at the same time. However you could have two MPI
>>> processes
>>>>>>>>>>>>>>> each using their own GPU.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> The one tricky part is you need to make sure each MPI process
>>> uses a
>>>>>>>>>>>>>>> different GPU. We currently do not have a mechanism to do
>>> this assignment
>>>>>>>>>>>>>>> automatically. I think it can be done with cudaSetDevice().
>>> But I don't
>>>>>>>>>>>>>>> know the details, sending this to petsc-dev at mcs.anl.govwhere more people
>>>>>>>>>>>>>>> may know.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> PETSc-folks,
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> We need a way to have this setup automatically.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Barry
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> On Oct 1, 2011, at 5:43 PM, Dave Nystrom wrote:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> I'm running petsc on a machine with Cuda 4.0 and 2 gpus.
>>> This is a desktop
>>>>>>>>>>>>>>>> machine with a single processor.  I know that Cuda 4.0 has
>>> support for
>>>>>>>>>>>>>>>> running on multiple gpus but don't know if petsc uses that.
>>> But suppose I
>>>>>>>>>>>>>>>> have a problem that will fit in the memory for a single gpu.
>>> Will petsc run
>>>>>>>>>>>>>>>> the problem on a single gpu or does it split it between the
>>> 2 gpus and incur
>>>>>>>>>>>>>>>> the communication overhead of copying data between the two
>>> gpus?
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Dave
>>>>>>>>>>> 
>>>>>>>>>>> --
>>>>>>>>>>> What most experimenters take for granted before they begin their
>>> experiments
>>>>>>>>>>> is infinitely more interesting than any results to which their
>>> experiments
>>>>>>>>>>> lead.
>>>>>>>>>>> -- Norbert Wiener
>>>>>>>> 
>>>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> -- 
>> What most experimenters take for granted before they begin their experiments
>> is infinitely more interesting than any results to which their experiments
>> lead.
>> -- Norbert Wiener