<div class="gmail_quote">On Sun, Oct 2, 2011 at 10:50 PM, Dave Nystrom <span dir="ltr"><<a href="mailto:Dave.Nystrom@tachyonlogic.com">Dave.Nystrom@tachyonlogic.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">

Hi Barry,<br>

<br>

Barry Smith writes:<br>

 > Dave,<br>

 ><br>

 > I cannot explain why it does not use the MatMult_SeqAIJCusp() - it does for me.<br>

<br>

Do you get good performance running a problem like ex2?<br></blockquote><div><br></div><div>Okay, now the problem is clear. This does not have to do with having 2 GPUs, rather you</div><div>were not running MatMult on any GPU.</div>

<div><br></div><div>This problem has to do with the 'da_mat_type aijcusp' option being passed in. Somehow this</div><div>is not being acted on. So, we need</div><div><br></div><div>  - The full input</div><div>  - The full output of the test using -log_summary</div>

<div><br></div><div>  Thanks,</div><div><br></div><div>     Matt</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">

 > Have you updated to the latest cusp/thrust? From the mecurial repositories?<br>

<br>

I did try the latest version of cusp from mercurial initially but the build<br>

failed.  So I am currently using the latest cusp tarball.  I did not try the<br>

latest version of thrust but instead was just using what came with the<br>

released version of Cuda 4.0.  I could try the mercurial versions of both.<br>

<div class="im"><br>

 > There is a difference, in your new 4.0 build you added<br>

 > --download-txpetscgpu=yes BTW: that doesn't work for me with the latest<br>

 > cusp and thrust from the mecurial repositories can you try reconfiguring<br>

 > and making without that?<br>

<br>

</div>Yes, I can try that.  Maybe that is why my original build with cusp from<br>

mercurial failed.<br>

<br>

Thanks for your help,<br>

<font color="#888888"><br>

Dave<br>

</font><div><div></div><div class="h5"><br>

 > Barry<br>

 ><br>

 > On Oct 2, 2011, at 9:08 PM, Dave Nystrom wrote:<br>

 ><br>

 >> Barry Smith writes:<br>

 >>> On Oct 2, 2011, at 6:39 PM, Dave Nystrom wrote:<br>

 >>><br>

 >>>> Thanks for the update.  I don't believe I have gotten a run with good<br>

 >>>> performance yet, either from C or Fortran.  I wish there was an easy way for<br>

 >>>> me to force use of only one of my gpus.  I don't want to have to pull one of<br>

 >>>> the gpus in order to see if that is complicating things with Cuda 4.0.  I'll<br>

 >>>> be eager to hear if you make any progress on figuring things out.<br>

 >>>><br>

 >>>> Do you understand yet why the petsc ex2.c example is able to parse the<br>

 >>>> "-cuda_show_devices" argument but ex2f.F does not?<br>

 >>><br>

 >>> Matt put the code in the wrong place in PETSc, that is all, no big<br>

 >>> existentialist reason. I will fix that.<br>

 >><br>

 >> Thanks.  I'll look forward to testing out the new version.<br>

 >><br>

 >> Dave<br>

 >><br>

 >>> Barry<br>

 >>><br>

 >>>><br>

 >>>> Thanks,<br>

 >>>><br>

 >>>> Dave<br>

 >>>><br>

 >>>> Barry Smith writes:<br>

 >>>>> It is not doing the MatMult operation on the GPU and hence needs to move<br>

 >>>>> the vectors back and forth for each operation (since MatMult is done on<br>

 >>>>> the CPU with the vector while vector operations are done on the GPU) hence<br>

 >>>>> the terrible performance.<br>

 >>>>><br>

 >>>>> Not sure why yet. It is copying the Mat down for me from C.<br>

 >>>>><br>

 >>>>> Barry<br>

 >>>>><br>

 >>>>> On Oct 2, 2011, at 4:18 PM, Dave Nystrom wrote:<br>

 >>>>><br>

 >>>>>> In case it might be useful, I have attached two log files of runs with the<br>

 >>>>>> ex2f petsc example from src/ksp/ksp/examples/tutorials.  One was run back in<br>

 >>>>>> April with petsc-dev linked to Cuda 3.2.  It shows excellent runtime<br>

 >>>>>> performance.  The other was run today with petsc-dev checked out of the<br>

 >>>>>> mercurial repo yesterday morning and linked to Cuda 4.0.  In addition to the<br>

 >>>>>> differences in run time performance, I also do not see an entry for<br>

 >>>>>> MatCUSPCopyTo in the profiling section.  I'm not sure what the significance<br>

 >>>>>> of that is.  I do observe that the run time for PCApply is about the same for<br>

 >>>>>> the two cases.  I think I would expect that to be the case even if the<br>

 >>>>>> problem were partitioned across two gpus.  However, it does make me wonder if<br>

 >>>>>> the absence of MatCUSPCopyTo in the profiling section of the Cuda 4.0 log<br>

 >>>>>> file is an indication that the matrix was not actually copied to the gpu.<br>

 >>>>>> I'm not sure yet how to check for that.  Hope this might be useful.<br>

 >>>>>><br>

 >>>>>> Thanks,<br>

 >>>>>><br>

 >>>>>> Dave<br>

 >>>>>><br>

 >>>>>><br>

 >>>>>> <ex2f_3200_3200_cuda_yes_cuda_3.2.log><ex2f_3200_3200_cuda_yes_cuda_4.0.log><br>

 >>>>>> Dave Nystrom writes:<br>

 >>>>>>> Matthew Knepley writes:<br>

 >>>>>>>> On Sat, Oct 1, 2011 at 11:26 PM, Dave Nystrom <<a href="mailto:Dave.Nystrom@tachyonlogic.com">Dave.Nystrom@tachyonlogic.com</a>> wrote:<br>

 >>>>>>>>> Barry Smith writes:<br>

 >>>>>>>>>> On Oct 1, 2011, at 9:22 PM, Dave Nystrom wrote:<br>

 >>>>>>>>>>> Hi Barry,<br>

 >>>>>>>>>>><br>

 >>>>>>>>>>> I've sent a couple more emails on this topic.  What I am trying to do at the<br>

 >>>>>>>>>>> moment is to figure out how to have a problem run on only one gpu if it will<br>

 >>>>>>>>>>> fit in the memory of that gpu.  Back in April when I had built petsc-dev with<br>

 >>>>>>>>>>> Cuda 3.2, petsc would only use one gpu if you had multiple gpus on your<br>

 >>>>>>>>>>> machine.  In order to use multiple gpus for a problem, one had to use<br>

 >>>>>>>>>>> multiple threads with a separate thread assigned to control each gpu.  But<br>

 >>>>>>>>>>> Cuda 4.0 has, I believe, made that transparent and under the hood.  So now<br>

 >>>>>>>>>>> when I run a small example problem such as<br>

 >>>>>>>>>>> src/ksp/ksp/examples/tutorials/ex2f.F with an 800x800 problem, it gets<br>

 >>>>>>>>>>> partitioned to run on both of the gpus in my machine.  The result is a very<br>

 >>>>>>>>>>> large performance hit because of communication back and forth from one gpu to<br>

 >>>>>>>>>>> the other via the cpu.<br>

 >>>>>>>>>><br>

 >>>>>>>>>> How do you know there is lots of communication from the GPU to the CPU? In<br>

 >>>>>>>>>> the -log_summary? Nope because PETSc does not manage anything like that<br>

 >>>>>>>>>> (that is one CPU process using both GPUs).<br>

 >>>>>>>>><br>

 >>>>>>>>> What I believe is that it is being managed by Cuda 4.0, not by petsc.<br>

 >>>>>>>>><br>

 >>>>>>>>>>> So this problem with a 3200x3200 grid runs 5x slower<br>

 >>>>>>>>>>> now than it did with Cuda 3.2.  I believe if one is programming down at the<br>

 >>>>>>>>>>> cuda level, it is possible to have a smaller problem run on only one gpu so<br>

 >>>>>>>>>>> that there is communication only between the cpu and gpu and only at the<br>

 >>>>>>>>>>> start and end of the calculation.<br>

 >>>>>>>>>>><br>

 >>>>>>>>>>> To me, it seems like what is needed is a petsc option to specify the number<br>

 >>>>>>>>>>> of gpus to run on that can somehow get passed down to the cuda level through<br>

 >>>>>>>>>>> cusp and thrust.  I fear that the short term solution is going to have to be<br>

 >>>>>>>>>>> for me to pull one of the gpus out of my desktop system but it would be nice<br>

 >>>>>>>>>>> if there was a way to tell petsc and friends to just use one gpu when I want<br>

 >>>>>>>>>>> it to.<br>

 >>>>>>>>>>><br>

 >>>>>>>>>>> If necessary, I can send a couple of log files to demonstrate what I am<br>

 >>>>>>>>>>> trying to describe regarding the performance hit.<br>

 >>>>>>>>>><br>

 >>>>>>>>>> I am not convinced that the poor performance you are getting now has<br>

 >>>>>>>>>> anything to do with using both GPUs. Please run a PETSc program with the<br>

 >>>>>>>>>> command -cuda_show_devices<br>

 >>>>>>>>><br>

 >>>>>>>>> I ran the following command:<br>

 >>>>>>>>><br>

 >>>>>>>>> ex2f -m 8 -n 8 -ksp_type cg -pc_type jacobi -log_summary -cuda_show_devices<br>

 >>>>>>>>> -mat_type aijcusp -vec_type cusp -options_left<br>

 >>>>>>>>><br>

 >>>>>>>>> The result was a report that there was one option left, that being<br>

 >>>>>>>>> -cuda_show_devices.  I am using a copy of petsc-dev that I cloned and built<br>

 >>>>>>>>> this morning.<br>

 >>>>>>>><br>

 >>>>>>>> What do you have at src/sys/object/pinit.c:825? You should see the code<br>

 >>>>>>>> that processes this option. You should be able to break there in the<br>

 >>>>>>>> debugger and see what happens. This sounds again like you are not<br>

 >>>>>>>> processing options correctly.<br>

 >>>>>>><br>

 >>>>>>> Hi Matt,<br>

 >>>>>>><br>

 >>>>>>> I'll take a look at that in a bit and see if I can figure out what is going<br>

 >>>>>>> on.  I do see the code that you mention that processes the arguments that<br>

 >>>>>>> Barry mentioned.  In terms of processing options correctly, at least in this<br>

 >>>>>>> case I am actually running one of the petsc examples rather than my own<br>

 >>>>>>> code.  And it seems to correctly process the other command line arguments.<br>

 >>>>>>> Anyway, I'll write more after I have had a chance to investigate more.<br>

 >>>>>>><br>

 >>>>>>> Thanks,<br>

 >>>>>>><br>

 >>>>>>> Dave<br>

 >>>>>>><br>

 >>>>>>>> Matt<br>

 >>>>>>>><br>

 >>>>>>>>>> What are the choices?  You can then pick one of them and run with<br>

 >>>>>>>>> -cuda_set_device integer<br>

 >>>>>>>>><br>

 >>>>>>>>> The -cuda_set_device option does not appear to be recognized either, even<br>

 >>>>>>>>> if I choose an integer like 0.<br>

 >>>>>>>>><br>

 >>>>>>>>>> Does this change things?<br>

 >>>>>>>>><br>

 >>>>>>>>> I suspect it would change things if I could get it to work.<br>

 >>>>>>>>><br>

 >>>>>>>>> Thanks,<br>

 >>>>>>>>><br>

 >>>>>>>>> Dave<br>

 >>>>>>>>><br>

 >>>>>>>>>> Barry<br>

 >>>>>>>>>><br>

 >>>>>>>>>>><br>

 >>>>>>>>>>> Thanks,<br>

 >>>>>>>>>>><br>

 >>>>>>>>>>> Dave<br>

 >>>>>>>>>>><br>

 >>>>>>>>>>> Barry Smith writes:<br>

 >>>>>>>>>>>> Dave,<br>

 >>>>>>>>>>>><br>

 >>>>>>>>>>>> We have no mechanism in the PETSc code for a PETSc single CPU process to<br>

 >>>>>>>>>>>> use two GPUs at the same time. However you could have two MPI processes<br>

 >>>>>>>>>>>> each using their own GPU.<br>

 >>>>>>>>>>>><br>

 >>>>>>>>>>>> The one tricky part is you need to make sure each MPI process uses a<br>

 >>>>>>>>>>>> different GPU. We currently do not have a mechanism to do this assignment<br>

 >>>>>>>>>>>> automatically. I think it can be done with cudaSetDevice(). But I don't<br>

 >>>>>>>>>>>> know the details, sending this to <a href="mailto:petsc-dev@mcs.anl.gov">petsc-dev@mcs.anl.gov</a> where more people<br>

 >>>>>>>>>>>> may know.<br>

 >>>>>>>>>>>><br>

 >>>>>>>>>>>> PETSc-folks,<br>

 >>>>>>>>>>>><br>

 >>>>>>>>>>>> We need a way to have this setup automatically.<br>

 >>>>>>>>>>>><br>

 >>>>>>>>>>>> Barry<br>

 >>>>>>>>>>>><br>

 >>>>>>>>>>>> On Oct 1, 2011, at 5:43 PM, Dave Nystrom wrote:<br>

 >>>>>>>>>>>><br>

 >>>>>>>>>>>>> I'm running petsc on a machine with Cuda 4.0 and 2 gpus.  This is a desktop<br>

 >>>>>>>>>>>>> machine with a single processor.  I know that Cuda 4.0 has support for<br>

 >>>>>>>>>>>>> running on multiple gpus but don't know if petsc uses that.  But suppose I<br>

 >>>>>>>>>>>>> have a problem that will fit in the memory for a single gpu.  Will petsc run<br>

 >>>>>>>>>>>>> the problem on a single gpu or does it split it between the 2 gpus and incur<br>

 >>>>>>>>>>>>> the communication overhead of copying data between the two gpus?<br>

 >>>>>>>>>>>>><br>

 >>>>>>>>>>>>> Thanks,<br>

 >>>>>>>>>>>>><br>

 >>>>>>>>>>>>> Dave<br>

 >>>>>>>><br>

 >>>>>>>> --<br>

 >>>>>>>> What most experimenters take for granted before they begin their experiments<br>

 >>>>>>>> is infinitely more interesting than any results to which their experiments<br>

 >>>>>>>> lead.<br>

 >>>>>>>> -- Norbert Wiener<br>

 >>>>><br>

 >>><br>

 ><br>

<br>

</div></div></blockquote></div><br><br clear="all"><div><br></div>-- <br>What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead.<br>

-- Norbert Wiener<br>