On Sun, Oct 2, 2011 at 4:43 PM, Dave Nystrom <span dir="ltr"><<a href="mailto:Dave.Nystrom@tachyonlogic.com">Dave.Nystrom@tachyonlogic.com</a>></span> wrote:<br><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">

<div><div></div><div class="h5">Dave Nystrom writes:<br>

 > In case it might be useful, I have attached two log files of runs with the<br>

 > ex2f petsc example from src/ksp/ksp/examples/tutorials.  One was run back in<br>

 > April with petsc-dev linked to Cuda 3.2.  It shows excellent runtime<br>

 > performance.  The other was run today with petsc-dev checked out of the<br>

 > mercurial repo yesterday morning and linked to Cuda 4.0.  In addition to the<br>

 > differences in run time performance, I also do not see an entry for<br>

 > MatCUSPCopyTo in the profiling section.  I'm not sure what the significance<br>

 > of that is.  I do observe that the run time for PCApply is about the same for<br>

 > the two cases.  I think I would expect that to be the case even if the<br>

 > problem were partitioned across two gpus.  However, it does make me wonder if<br>

 > the absence of MatCUSPCopyTo in the profiling section of the Cuda 4.0 log<br>

 > file is an indication that the matrix was not actually copied to the gpu.<br>

 > I'm not sure yet how to check for that.  Hope this might be useful.<br>

<br>

</div></div>I have been able to get the option "-cuda_show_devices" to work if I use the<br>

C version of the ex2 example rather than the Fortran version.  So it would<br>

seem that there are some issues associated with command line option<br>

processing for the petsc case.  To be more explicit, I am running the<br>

following C petsc example:<br>

<br>

src/ksp/ksp/examples/tutorials/ex2.c<br>

<br>

However, when I ran this example with the "-cuda_set_device 0" option, I did<br>

not see any change in the run time performance.  The option was recognized<br>

and parsed by the C example.<br>

<br>

I'm not sure how to proceed.  It would seem that one of two scenarios may be<br>

at play here.<br>

<br>

1.  The problem is being partitioned across the two gpus under the hood by<br>

Cuda 4.0 regardless of whether the problem would fit on one gpu.  And this<br>

has the result that the matvec requires communication each iteration between<br>

the two gpus.<br></blockquote><div><br></div><div>Dave, this is definitely not happening. There is not evidence for this. Instead, the</div><div>matrix is not using the GPU at all. There must be</div><div><br></div><div>

MatCUSPCopyToGPU ---</div><div><br></div><div>in the -log_summary in order to be using the GPU.</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">

2.  For some reason, the matrix may not be copied to the gpu at all meaning<br>

that the matvec requires communication with the gpu on each iteration.<br>

<br>

Any thoughts on what might be happening?  I certainly got excellent<br>

performance back in April.<br></blockquote><div><br></div><div>Look at your April log. It has that event. Something else is happening in this code.</div><div>I can confirm that my run of ex2f executes MatCUSPCopyToGPU.</div>

<div><br></div><div>You can look at the MatMult call in the debugger, and see what it is dispatching to.</div><div><br></div><div>   Matt</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">


Thanks,<br>

<font color="#888888"><br>

Dave<br>

</font></blockquote></div><br><br clear="all"><div><br></div>-- <br>What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead.<br>

-- Norbert Wiener<br>