[petsc-dev] [petsc-maint #88993] Petsc with Cuda 4.0 and Multiple GPUs
Dave Nystrom
Dave.Nystrom at tachyonlogic.com
Sun Oct 2 21:08:11 CDT 2011
Barry Smith writes:
> On Oct 2, 2011, at 6:39 PM, Dave Nystrom wrote:
>> Thanks for the update. I don't believe I have gotten a run with good
>> performance yet, either from C or Fortran. I wish there was an easy way for
>> me to force use of only one of my gpus. I don't want to have to pull one of
>> the gpus in order to see if that is complicating things with Cuda 4.0. I'll
>> be eager to hear if you make any progress on figuring things out.
>> Do you understand yet why the petsc ex2.c example is able to parse the
>> "-cuda_show_devices" argument but ex2f.F does not?
> Matt put the code in the wrong place in PETSc, that is all, no big
> existentialist reason. I will fix that.
Thanks. I'll look forward to testing out the new version.
> Barry
>> Thanks,
>> Dave
>> Barry Smith writes:
>>> It is not doing the MatMult operation on the GPU and hence needs to move
>>> the vectors back and forth for each operation (since MatMult is done on
>>> the CPU with the vector while vector operations are done on the GPU) hence
>>> the terrible performance.
>>> Not sure why yet. It is copying the Mat down for me from C.
>>> Barry
>>> On Oct 2, 2011, at 4:18 PM, Dave Nystrom wrote:
>>>> In case it might be useful, I have attached two log files of runs with the
>>>> ex2f petsc example from src/ksp/ksp/examples/tutorials. One was run back in
>>>> April with petsc-dev linked to Cuda 3.2. It shows excellent runtime
>>>> performance. The other was run today with petsc-dev checked out of the
>>>> mercurial repo yesterday morning and linked to Cuda 4.0. In addition to the
>>>> differences in run time performance, I also do not see an entry for
>>>> MatCUSPCopyTo in the profiling section. I'm not sure what the significance
>>>> of that is. I do observe that the run time for PCApply is about the same for
>>>> the two cases. I think I would expect that to be the case even if the
>>>> problem were partitioned across two gpus. However, it does make me wonder if
>>>> the absence of MatCUSPCopyTo in the profiling section of the Cuda 4.0 log
>>>> file is an indication that the matrix was not actually copied to the gpu.
>>>> I'm not sure yet how to check for that. Hope this might be useful.
>>>> Thanks,
>>>> Dave
>>>> <ex2f_3200_3200_cuda_yes_cuda_3.2.log><ex2f_3200_3200_cuda_yes_cuda_4.0.log>
>>>> Dave Nystrom writes:
>>>>> Matthew Knepley writes:
>>>>>> On Sat, Oct 1, 2011 at 11:26 PM, Dave Nystrom <Dave.Nystrom at tachyonlogic.com> wrote:
>>>>>>> Barry Smith writes:
>>>>>>>> On Oct 1, 2011, at 9:22 PM, Dave Nystrom wrote:
>>>>>>>>> Hi Barry,
>>>>>>>>> I've sent a couple more emails on this topic. What I am trying to do at the
>>>>>>>>> moment is to figure out how to have a problem run on only one gpu if it will
>>>>>>>>> fit in the memory of that gpu. Back in April when I had built petsc-dev with
>>>>>>>>> Cuda 3.2, petsc would only use one gpu if you had multiple gpus on your
>>>>>>>>> machine. In order to use multiple gpus for a problem, one had to use
>>>>>>>>> multiple threads with a separate thread assigned to control each gpu. But
>>>>>>>>> Cuda 4.0 has, I believe, made that transparent and under the hood. So now
>>>>>>>>> when I run a small example problem such as
>>>>>>>>> src/ksp/ksp/examples/tutorials/ex2f.F with an 800x800 problem, it gets
>>>>>>>>> partitioned to run on both of the gpus in my machine. The result is a very
>>>>>>>>> large performance hit because of communication back and forth from one gpu to
>>>>>>>>> the other via the cpu.
>>>>>>>> How do you know there is lots of communication from the GPU to the CPU? In
>>>>>>>> the -log_summary? Nope because PETSc does not manage anything like that
>>>>>>>> (that is one CPU process using both GPUs).
>>>>>>> What I believe is that it is being managed by Cuda 4.0, not by petsc.
>>>>>>>>> So this problem with a 3200x3200 grid runs 5x slower
>>>>>>>>> now than it did with Cuda 3.2. I believe if one is programming down at the
>>>>>>>>> cuda level, it is possible to have a smaller problem run on only one gpu so
>>>>>>>>> that there is communication only between the cpu and gpu and only at the
>>>>>>>>> start and end of the calculation.
>>>>>>>>> To me, it seems like what is needed is a petsc option to specify the number
>>>>>>>>> of gpus to run on that can somehow get passed down to the cuda level through
>>>>>>>>> cusp and thrust. I fear that the short term solution is going to have to be
>>>>>>>>> for me to pull one of the gpus out of my desktop system but it would be nice
>>>>>>>>> if there was a way to tell petsc and friends to just use one gpu when I want
>>>>>>>>> it to.
>>>>>>>>> If necessary, I can send a couple of log files to demonstrate what I am
>>>>>>>>> trying to describe regarding the performance hit.
>>>>>>>> I am not convinced that the poor performance you are getting now has
>>>>>>>> anything to do with using both GPUs. Please run a PETSc program with the
>>>>>>>> command -cuda_show_devices
>>>>>>> I ran the following command:
>>>>>>> ex2f -m 8 -n 8 -ksp_type cg -pc_type jacobi -log_summary -cuda_show_devices
>>>>>>> -mat_type aijcusp -vec_type cusp -options_left
>>>>>>> The result was a report that there was one option left, that being
>>>>>>> -cuda_show_devices. I am using a copy of petsc-dev that I cloned and built
>>>>>>> this morning.
>>>>>> What do you have at src/sys/object/pinit.c:825? You should see the code
>>>>>> that processes this option. You should be able to break there in the
>>>>>> debugger and see what happens. This sounds again like you are not
>>>>>> processing options correctly.
>>>>> Hi Matt,
>>>>> I'll take a look at that in a bit and see if I can figure out what is going
>>>>> on. I do see the code that you mention that processes the arguments that
>>>>> Barry mentioned. In terms of processing options correctly, at least in this
>>>>> case I am actually running one of the petsc examples rather than my own
>>>>> code. And it seems to correctly process the other command line arguments.
>>>>> Anyway, I'll write more after I have had a chance to investigate more.
>>>>> Thanks,
>>>>> Dave
>>>>>> Matt
>>>>>>>> What are the choices? You can then pick one of them and run with
>>>>>>> -cuda_set_device integer
>>>>>>> The -cuda_set_device option does not appear to be recognized either, even
>>>>>>> if I choose an integer like 0.
>>>>>>>> Does this change things?
>>>>>>> I suspect it would change things if I could get it to work.
>>>>>>> Thanks,
>>>>>>> Dave
>>>>>>>> Barry
>>>>>>>>> Thanks,
>>>>>>>>> Dave
>>>>>>>>> Barry Smith writes:
>>>>>>>>>> Dave,
>>>>>>>>>> We have no mechanism in the PETSc code for a PETSc single CPU process to
>>>>>>>>>> use two GPUs at the same time. However you could have two MPI processes
>>>>>>>>>> each using their own GPU.
>>>>>>>>>> The one tricky part is you need to make sure each MPI process uses a
>>>>>>>>>> different GPU. We currently do not have a mechanism to do this assignment
>>>>>>>>>> automatically. I think it can be done with cudaSetDevice(). But I don't
>>>>>>>>>> know the details, sending this to petsc-dev at mcs.anl.gov where more people
>>>>>>>>>> may know.
>>>>>>>>>> PETSc-folks,
>>>>>>>>>> We need a way to have this setup automatically.
>>>>>>>>>> Barry
>>>>>>>>>> On Oct 1, 2011, at 5:43 PM, Dave Nystrom wrote:
>>>>>>>>>>> I'm running petsc on a machine with Cuda 4.0 and 2 gpus. This is a desktop
>>>>>>>>>>> machine with a single processor. I know that Cuda 4.0 has support for
>>>>>>>>>>> running on multiple gpus but don't know if petsc uses that. But suppose I
>>>>>>>>>>> have a problem that will fit in the memory for a single gpu. Will petsc run
>>>>>>>>>>> the problem on a single gpu or does it split it between the 2 gpus and incur
>>>>>>>>>>> the communication overhead of copying data between the two gpus?
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Dave
>>>>>> --
>>>>>> What most experimenters take for granted before they begin their experiments
>>>>>> is infinitely more interesting than any results to which their experiments
>>>>>> lead.
>>>>>> -- Norbert Wiener
More information about the petsc-dev
mailing list