On Sun, Oct 2, 2011 at 4:43 PM, Dave Nystrom <span dir="ltr"><<a href="mailto:Dave.Nystrom@tachyonlogic.com">Dave.Nystrom@tachyonlogic.com</a>></span> wrote:<br><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">
<div><div></div><div class="h5">Dave Nystrom writes:<br>
> In case it might be useful, I have attached two log files of runs with the<br>
> ex2f petsc example from src/ksp/ksp/examples/tutorials. One was run back in<br>
> April with petsc-dev linked to Cuda 3.2. It shows excellent runtime<br>
> performance. The other was run today with petsc-dev checked out of the<br>
> mercurial repo yesterday morning and linked to Cuda 4.0. In addition to the<br>
> differences in run time performance, I also do not see an entry for<br>
> MatCUSPCopyTo in the profiling section. I'm not sure what the significance<br>
> of that is. I do observe that the run time for PCApply is about the same for<br>
> the two cases. I think I would expect that to be the case even if the<br>
> problem were partitioned across two gpus. However, it does make me wonder if<br>
> the absence of MatCUSPCopyTo in the profiling section of the Cuda 4.0 log<br>
> file is an indication that the matrix was not actually copied to the gpu.<br>
> I'm not sure yet how to check for that. Hope this might be useful.<br>
<br>
</div></div>I have been able to get the option "-cuda_show_devices" to work if I use the<br>
C version of the ex2 example rather than the Fortran version. So it would<br>
seem that there are some issues associated with command line option<br>
processing for the petsc case. To be more explicit, I am running the<br>
following C petsc example:<br>
<br>
src/ksp/ksp/examples/tutorials/ex2.c<br>
<br>
However, when I ran this example with the "-cuda_set_device 0" option, I did<br>
not see any change in the run time performance. The option was recognized<br>
and parsed by the C example.<br>
<br>
I'm not sure how to proceed. It would seem that one of two scenarios may be<br>
at play here.<br>
<br>
1. The problem is being partitioned across the two gpus under the hood by<br>
Cuda 4.0 regardless of whether the problem would fit on one gpu. And this<br>
has the result that the matvec requires communication each iteration between<br>
the two gpus.<br></blockquote><div><br></div><div>Dave, this is definitely not happening. There is not evidence for this. Instead, the</div><div>matrix is not using the GPU at all. There must be</div><div><br></div><div>
MatCUSPCopyToGPU ---</div><div><br></div><div>in the -log_summary in order to be using the GPU.</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">
2. For some reason, the matrix may not be copied to the gpu at all meaning<br>
that the matvec requires communication with the gpu on each iteration.<br>
<br>
Any thoughts on what might be happening? I certainly got excellent<br>
performance back in April.<br></blockquote><div><br></div><div>Look at your April log. It has that event. Something else is happening in this code.</div><div>I can confirm that my run of ex2f executes MatCUSPCopyToGPU.</div>
<div><br></div><div>You can look at the MatMult call in the debugger, and see what it is dispatching to.</div><div><br></div><div> Matt</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">
Thanks,<br>
<font color="#888888"><br>
Dave<br>
</font></blockquote></div><br><br clear="all"><div><br></div>-- <br>What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead.<br>
-- Norbert Wiener<br>