[petsc-dev] [petsc-maint #88993] Petsc with Cuda 4.0 and Multiple GPUs
Barry Smith
bsmith at mcs.anl.gov
Mon Oct 3 19:52:54 CDT 2011
Dave,
I have found the cause of the problem you were seeing and have fixed it. It was caused by bad code when --download-txpetscgpu was used.
To eliminate the problem
1) upgrade to latest cusp and thrust via mecurial
2) rm -rf externpackages/txpetscgpu*
3) hg pull; hg update
4) rerun ./configure (you can use the option --download-txpetscgpu
Barry
On Oct 3, 2011, at 8:51 AM, Dave Nystrom wrote:
> Matthew Knepley writes:
>> On Sun, Oct 2, 2011 at 10:50 PM, Dave Nystrom <Dave.Nystrom at tachyonlogic.com> wrote:
>>
>>> Hi Barry,
>>>
>>> Barry Smith writes:
>>>> Dave,
>>>>
>>>> I cannot explain why it does not use the MatMult_SeqAIJCusp() - it does for me.
>>>
>>> Do you get good performance running a problem like ex2?
>>>
>>
>> Okay, now the problem is clear. This does not have to do with having 2
>> GPUs, rather you were not running MatMult on any GPU.
>>
>> This problem has to do with the 'da_mat_type aijcusp' option being passed
>> in. Somehow this is not being acted on. So, we need
>>
>> - The full input
>> - The full output of the test using -log_summary
>
> Hi Matt,
>
> I'm not sure what you are asking for. The two problems I have been running
> are two of the petsc examples i.e. src/ksp/ksp/examples/tutorials/ex2f.F and
> src/ksp/ksp/examples/tutorials/ex2.c. Are you not able to reproduce the
> problem? If you are convinced that this is not dependent on having 2 gpus,
> then it would seem that my platform is not very unique. I did send a couple
> of log files in an earlier email that were run with -log_summary. One was
> run yesterday with my current petsc/cuda install.
>
> Anyway, I'm happy to perform any runs you want and pass you the data and
> results. But I'm not sure what you mean by "The full input" since I am
> running unmodified petsc examples and -log_summary captures my command line
> options. Just let me know what you need.
>
> I do have on my list from Barry to reinstall petsc with the mercurial
> versions of cusp and thrust and without the txpetscgpu package. That will
> probably have to happen this evening as I have to head off to work.
>
> Related to the above two petsc examples, ex2.c and ex2f.F, I have observed
> different behavior in the processing of command line options but I believe
> Barry is working on a fix for that.
>
> Thanks,
>
> Dave
>
>> Thanks,
>>
>> Matt
>>
>>
>>>> Have you updated to the latest cusp/thrust? From the mecurial
>>> repositories?
>>>
>>> I did try the latest version of cusp from mercurial initially but the build
>>> failed. So I am currently using the latest cusp tarball. I did not try
>>> the
>>> latest version of thrust but instead was just using what came with the
>>> released version of Cuda 4.0. I could try the mercurial versions of both.
>>>
>>>> There is a difference, in your new 4.0 build you added
>>>> --download-txpetscgpu=yes BTW: that doesn't work for me with the latest
>>>> cusp and thrust from the mecurial repositories can you try reconfiguring
>>>> and making without that?
>>>
>>> Yes, I can try that. Maybe that is why my original build with cusp from
>>> mercurial failed.
>>>
>>> Thanks for your help,
>>>
>>> Dave
>>>
>>>> Barry
>>>>
>>>> On Oct 2, 2011, at 9:08 PM, Dave Nystrom wrote:
>>>>
>>>>> Barry Smith writes:
>>>>>> On Oct 2, 2011, at 6:39 PM, Dave Nystrom wrote:
>>>>>>
>>>>>>> Thanks for the update. I don't believe I have gotten a run with good
>>>>>>> performance yet, either from C or Fortran. I wish there was an easy
>>> way for
>>>>>>> me to force use of only one of my gpus. I don't want to have to pull
>>> one of
>>>>>>> the gpus in order to see if that is complicating things with Cuda
>>> 4.0. I'll
>>>>>>> be eager to hear if you make any progress on figuring things out.
>>>>>>>
>>>>>>> Do you understand yet why the petsc ex2.c example is able to parse
>>> the
>>>>>>> "-cuda_show_devices" argument but ex2f.F does not?
>>>>>>
>>>>>> Matt put the code in the wrong place in PETSc, that is all, no big
>>>>>> existentialist reason. I will fix that.
>>>>>
>>>>> Thanks. I'll look forward to testing out the new version.
>>>>>
>>>>> Dave
>>>>>
>>>>>> Barry
>>>>>>
>>>>>>>
>>>>>>> Thanks,
>>>>>>>
>>>>>>> Dave
>>>>>>>
>>>>>>> Barry Smith writes:
>>>>>>>> It is not doing the MatMult operation on the GPU and hence needs to
>>> move
>>>>>>>> the vectors back and forth for each operation (since MatMult is done
>>> on
>>>>>>>> the CPU with the vector while vector operations are done on the GPU)
>>> hence
>>>>>>>> the terrible performance.
>>>>>>>>
>>>>>>>> Not sure why yet. It is copying the Mat down for me from C.
>>>>>>>>
>>>>>>>> Barry
>>>>>>>>
>>>>>>>> On Oct 2, 2011, at 4:18 PM, Dave Nystrom wrote:
>>>>>>>>
>>>>>>>>> In case it might be useful, I have attached two log files of runs
>>> with the
>>>>>>>>> ex2f petsc example from src/ksp/ksp/examples/tutorials. One was
>>> run back in
>>>>>>>>> April with petsc-dev linked to Cuda 3.2. It shows excellent
>>> runtime
>>>>>>>>> performance. The other was run today with petsc-dev checked out of
>>> the
>>>>>>>>> mercurial repo yesterday morning and linked to Cuda 4.0. In
>>> addition to the
>>>>>>>>> differences in run time performance, I also do not see an entry for
>>>>>>>>> MatCUSPCopyTo in the profiling section. I'm not sure what the
>>> significance
>>>>>>>>> of that is. I do observe that the run time for PCApply is about
>>> the same for
>>>>>>>>> the two cases. I think I would expect that to be the case even if
>>> the
>>>>>>>>> problem were partitioned across two gpus. However, it does make me
>>> wonder if
>>>>>>>>> the absence of MatCUSPCopyTo in the profiling section of the Cuda
>>> 4.0 log
>>>>>>>>> file is an indication that the matrix was not actually copied to
>>> the gpu.
>>>>>>>>> I'm not sure yet how to check for that. Hope this might be useful.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>>
>>>>>>>>> Dave
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>> <ex2f_3200_3200_cuda_yes_cuda_3.2.log><ex2f_3200_3200_cuda_yes_cuda_4.0.log>
>>>>>>>>> Dave Nystrom writes:
>>>>>>>>>> Matthew Knepley writes:
>>>>>>>>>>> On Sat, Oct 1, 2011 at 11:26 PM, Dave Nystrom <
>>> Dave.Nystrom at tachyonlogic.com> wrote:
>>>>>>>>>>>> Barry Smith writes:
>>>>>>>>>>>>> On Oct 1, 2011, at 9:22 PM, Dave Nystrom wrote:
>>>>>>>>>>>>>> Hi Barry,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I've sent a couple more emails on this topic. What I am
>>> trying to do at the
>>>>>>>>>>>>>> moment is to figure out how to have a problem run on only one
>>> gpu if it will
>>>>>>>>>>>>>> fit in the memory of that gpu. Back in April when I had built
>>> petsc-dev with
>>>>>>>>>>>>>> Cuda 3.2, petsc would only use one gpu if you had multiple
>>> gpus on your
>>>>>>>>>>>>>> machine. In order to use multiple gpus for a problem, one had
>>> to use
>>>>>>>>>>>>>> multiple threads with a separate thread assigned to control
>>> each gpu. But
>>>>>>>>>>>>>> Cuda 4.0 has, I believe, made that transparent and under the
>>> hood. So now
>>>>>>>>>>>>>> when I run a small example problem such as
>>>>>>>>>>>>>> src/ksp/ksp/examples/tutorials/ex2f.F with an 800x800 problem,
>>> it gets
>>>>>>>>>>>>>> partitioned to run on both of the gpus in my machine. The
>>> result is a very
>>>>>>>>>>>>>> large performance hit because of communication back and forth
>>> from one gpu to
>>>>>>>>>>>>>> the other via the cpu.
>>>>>>>>>>>>>
>>>>>>>>>>>>> How do you know there is lots of communication from the GPU to
>>> the CPU? In
>>>>>>>>>>>>> the -log_summary? Nope because PETSc does not manage anything
>>> like that
>>>>>>>>>>>>> (that is one CPU process using both GPUs).
>>>>>>>>>>>>
>>>>>>>>>>>> What I believe is that it is being managed by Cuda 4.0, not by
>>> petsc.
>>>>>>>>>>>>
>>>>>>>>>>>>>> So this problem with a 3200x3200 grid runs 5x slower
>>>>>>>>>>>>>> now than it did with Cuda 3.2. I believe if one is
>>> programming down at the
>>>>>>>>>>>>>> cuda level, it is possible to have a smaller problem run on
>>> only one gpu so
>>>>>>>>>>>>>> that there is communication only between the cpu and gpu and
>>> only at the
>>>>>>>>>>>>>> start and end of the calculation.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> To me, it seems like what is needed is a petsc option to
>>> specify the number
>>>>>>>>>>>>>> of gpus to run on that can somehow get passed down to the cuda
>>> level through
>>>>>>>>>>>>>> cusp and thrust. I fear that the short term solution is going
>>> to have to be
>>>>>>>>>>>>>> for me to pull one of the gpus out of my desktop system but it
>>> would be nice
>>>>>>>>>>>>>> if there was a way to tell petsc and friends to just use one
>>> gpu when I want
>>>>>>>>>>>>>> it to.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> If necessary, I can send a couple of log files to demonstrate
>>> what I am
>>>>>>>>>>>>>> trying to describe regarding the performance hit.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I am not convinced that the poor performance you are getting
>>> now has
>>>>>>>>>>>>> anything to do with using both GPUs. Please run a PETSc program
>>> with the
>>>>>>>>>>>>> command -cuda_show_devices
>>>>>>>>>>>>
>>>>>>>>>>>> I ran the following command:
>>>>>>>>>>>>
>>>>>>>>>>>> ex2f -m 8 -n 8 -ksp_type cg -pc_type jacobi -log_summary
>>> -cuda_show_devices
>>>>>>>>>>>> -mat_type aijcusp -vec_type cusp -options_left
>>>>>>>>>>>>
>>>>>>>>>>>> The result was a report that there was one option left, that
>>> being
>>>>>>>>>>>> -cuda_show_devices. I am using a copy of petsc-dev that I
>>> cloned and built
>>>>>>>>>>>> this morning.
>>>>>>>>>>>
>>>>>>>>>>> What do you have at src/sys/object/pinit.c:825? You should see
>>> the code
>>>>>>>>>>> that processes this option. You should be able to break there in
>>> the
>>>>>>>>>>> debugger and see what happens. This sounds again like you are not
>>>>>>>>>>> processing options correctly.
>>>>>>>>>>
>>>>>>>>>> Hi Matt,
>>>>>>>>>>
>>>>>>>>>> I'll take a look at that in a bit and see if I can figure out what
>>> is going
>>>>>>>>>> on. I do see the code that you mention that processes the
>>> arguments that
>>>>>>>>>> Barry mentioned. In terms of processing options correctly, at
>>> least in this
>>>>>>>>>> case I am actually running one of the petsc examples rather than
>>> my own
>>>>>>>>>> code. And it seems to correctly process the other command line
>>> arguments.
>>>>>>>>>> Anyway, I'll write more after I have had a chance to investigate
>>> more.
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>>
>>>>>>>>>> Dave
>>>>>>>>>>
>>>>>>>>>>> Matt
>>>>>>>>>>>
>>>>>>>>>>>>> What are the choices? You can then pick one of them and run
>>> with
>>>>>>>>>>>> -cuda_set_device integer
>>>>>>>>>>>>
>>>>>>>>>>>> The -cuda_set_device option does not appear to be recognized
>>> either, even
>>>>>>>>>>>> if I choose an integer like 0.
>>>>>>>>>>>>
>>>>>>>>>>>>> Does this change things?
>>>>>>>>>>>>
>>>>>>>>>>>> I suspect it would change things if I could get it to work.
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>
>>>>>>>>>>>> Dave
>>>>>>>>>>>>
>>>>>>>>>>>>> Barry
>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Dave
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Barry Smith writes:
>>>>>>>>>>>>>>> Dave,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> We have no mechanism in the PETSc code for a PETSc single CPU
>>> process to
>>>>>>>>>>>>>>> use two GPUs at the same time. However you could have two MPI
>>> processes
>>>>>>>>>>>>>>> each using their own GPU.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> The one tricky part is you need to make sure each MPI process
>>> uses a
>>>>>>>>>>>>>>> different GPU. We currently do not have a mechanism to do
>>> this assignment
>>>>>>>>>>>>>>> automatically. I think it can be done with cudaSetDevice().
>>> But I don't
>>>>>>>>>>>>>>> know the details, sending this to petsc-dev at mcs.anl.govwhere more people
>>>>>>>>>>>>>>> may know.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> PETSc-folks,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> We need a way to have this setup automatically.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Barry
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Oct 1, 2011, at 5:43 PM, Dave Nystrom wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I'm running petsc on a machine with Cuda 4.0 and 2 gpus.
>>> This is a desktop
>>>>>>>>>>>>>>>> machine with a single processor. I know that Cuda 4.0 has
>>> support for
>>>>>>>>>>>>>>>> running on multiple gpus but don't know if petsc uses that.
>>> But suppose I
>>>>>>>>>>>>>>>> have a problem that will fit in the memory for a single gpu.
>>> Will petsc run
>>>>>>>>>>>>>>>> the problem on a single gpu or does it split it between the
>>> 2 gpus and incur
>>>>>>>>>>>>>>>> the communication overhead of copying data between the two
>>> gpus?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Dave
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> What most experimenters take for granted before they begin their
>>> experiments
>>>>>>>>>>> is infinitely more interesting than any results to which their
>>> experiments
>>>>>>>>>>> lead.
>>>>>>>>>>> -- Norbert Wiener
>>>>>>>>
>>>>>>
>>>>
>>>
>>>
>>
>>
>> --
>> What most experimenters take for granted before they begin their experiments
>> is infinitely more interesting than any results to which their experiments
>> lead.
>> -- Norbert Wiener
More information about the petsc-dev
mailing list