[petsc-dev] [petsc-maint #88993] Petsc with Cuda 4.0 and Multiple GPUs

Dave Nystrom Dave.Nystrom at tachyonlogic.com
Sun Oct 2 16:43:48 CDT 2011


Dave Nystrom writes:
 > In case it might be useful, I have attached two log files of runs with the
 > ex2f petsc example from src/ksp/ksp/examples/tutorials.  One was run back in
 > April with petsc-dev linked to Cuda 3.2.  It shows excellent runtime
 > performance.  The other was run today with petsc-dev checked out of the
 > mercurial repo yesterday morning and linked to Cuda 4.0.  In addition to the
 > differences in run time performance, I also do not see an entry for
 > MatCUSPCopyTo in the profiling section.  I'm not sure what the significance
 > of that is.  I do observe that the run time for PCApply is about the same for
 > the two cases.  I think I would expect that to be the case even if the
 > problem were partitioned across two gpus.  However, it does make me wonder if
 > the absence of MatCUSPCopyTo in the profiling section of the Cuda 4.0 log
 > file is an indication that the matrix was not actually copied to the gpu.
 > I'm not sure yet how to check for that.  Hope this might be useful.

I have been able to get the option "-cuda_show_devices" to work if I use the
C version of the ex2 example rather than the Fortran version.  So it would
seem that there are some issues associated with command line option
processing for the petsc case.  To be more explicit, I am running the
following C petsc example:

src/ksp/ksp/examples/tutorials/ex2.c

However, when I ran this example with the "-cuda_set_device 0" option, I did
not see any change in the run time performance.  The option was recognized
and parsed by the C example.

I'm not sure how to proceed.  It would seem that one of two scenarios may be
at play here.

1.  The problem is being partitioned across the two gpus under the hood by
Cuda 4.0 regardless of whether the problem would fit on one gpu.  And this
has the result that the matvec requires communication each iteration between
the two gpus.

2.  For some reason, the matrix may not be copied to the gpu at all meaning
that the matvec requires communication with the gpu on each iteration.

Any thoughts on what might be happening?  I certainly got excellent
performance back in April.

Thanks,

Dave



More information about the petsc-dev mailing list