[petsc-dev] [petsc-maint #88993] Petsc with Cuda 4.0 and Multiple GPUs

Matthew Knepley petsc-maint at mcs.anl.gov
Mon Oct 3 09:33:19 CDT 2011

On Sun, Oct 2, 2011 at 4:43 PM, Dave Nystrom
<Dave.Nystrom at tachyonlogic.com>wrote:

> Dave Nystrom writes:
>  > In case it might be useful, I have attached two log files of runs with
> the
>  > ex2f petsc example from src/ksp/ksp/examples/tutorials.  One was run
> back in
>  > April with petsc-dev linked to Cuda 3.2.  It shows excellent runtime
>  > performance.  The other was run today with petsc-dev checked out of the
>  > mercurial repo yesterday morning and linked to Cuda 4.0.  In addition to
> the
>  > differences in run time performance, I also do not see an entry for
>  > MatCUSPCopyTo in the profiling section.  I'm not sure what the
> significance
>  > of that is.  I do observe that the run time for PCApply is about the
> same for
>  > the two cases.  I think I would expect that to be the case even if the
>  > problem were partitioned across two gpus.  However, it does make me
> wonder if
>  > the absence of MatCUSPCopyTo in the profiling section of the Cuda 4.0
> log
>  > file is an indication that the matrix was not actually copied to the
> gpu.
>  > I'm not sure yet how to check for that.  Hope this might be useful.
> I have been able to get the option "-cuda_show_devices" to work if I use
> the
> C version of the ex2 example rather than the Fortran version.  So it would
> seem that there are some issues associated with command line option
> processing for the petsc case.  To be more explicit, I am running the
> following C petsc example:
> src/ksp/ksp/examples/tutorials/ex2.c
> However, when I ran this example with the "-cuda_set_device 0" option, I
> did
> not see any change in the run time performance.  The option was recognized
> and parsed by the C example.
> I'm not sure how to proceed.  It would seem that one of two scenarios may
> be
> at play here.
> 1.  The problem is being partitioned across the two gpus under the hood by
> Cuda 4.0 regardless of whether the problem would fit on one gpu.  And this
> has the result that the matvec requires communication each iteration
> between
> the two gpus.

Dave, this is definitely not happening. There is not evidence for this.
Instead, the
matrix is not using the GPU at all. There must be

MatCUSPCopyToGPU ---

in the -log_summary in order to be using the GPU.

> 2.  For some reason, the matrix may not be copied to the gpu at all meaning
> that the matvec requires communication with the gpu on each iteration.
> Any thoughts on what might be happening?  I certainly got excellent
> performance back in April.

Look at your April log. It has that event. Something else is happening in
this code.
I can confirm that my run of ex2f executes MatCUSPCopyToGPU.

You can look at the MatMult call in the debugger, and see what it is
dispatching to.


> Thanks,
> Dave

What most experimenters take for granted before they begin their experiments
is infinitely more interesting than any results to which their experiments
-- Norbert Wiener
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20111003/2e30d24c/attachment.html>

More information about the petsc-dev mailing list