<div dir="ltr">On Thu, May 2, 2013 at 3:29 PM, Florian Rathgeber <span dir="ltr"><<a href="mailto:florian.rathgeber@gmail.com" target="_blank">florian.rathgeber@gmail.com</a>></span> wrote:<br><div class="gmail_extra">
<div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class="im">On 02/05/13 03:12, Matthew Knepley wrote:<br>
> On Wed, May 1, 2013 at 8:52 PM, Karl Rupp <<a href="mailto:rupp@mcs.anl.gov">rupp@mcs.anl.gov</a><br>
</div><div><div class="h5">> <mailto:<a href="mailto:rupp@mcs.anl.gov">rupp@mcs.anl.gov</a>>> wrote:<br>
><br>
> � � Hi Florian,<br>
><br>
> � � > This is loosely a follow up to [1]. In this thread a few potential<br>
> � � ways<br>
><br>
> � � � � for making GPU assembly work with PETSc were discussed and to me<br>
> � � � � the two<br>
> � � � � most promising appeared to be:<br>
> � � � � 1) Create a PETSc matrix from a pre-assembled CSR structure, or<br>
> � � � � 2) Preallocate a PETSc matrix and get the handle to pass the row<br>
> � � � � pointer, column indices and values array to a custom assembly<br>
> � � � � routine.<br>
><br>
> � � I still consider these two to be the most promising (and general)<br>
> � � approaches. On the other hand, to my knowledge the infrastructure<br>
> � � hasn't changed a lot since then. Some additional functionality from<br>
> � � CUSPARSE was added, while I added ViennaCL-bindings to branch 'next'<br>
> � � (i.e. still a few corners to polish). This means that you could<br>
> � � technically use the much more jit-friendly OpenCL (and, as a<br>
> � � follow-up, complain at NVIDIA and AMD over the higher latencies than<br>
> � � with CUDA).<br>
><br>
> � � � � We compute<br>
> � � � � local assembly matrices on the GPU and a crucial requirement is<br>
> � � � � that the<br>
> � � � � matrix *only* lives in device device, we want to avoid any host <-><br>
> � � � � device data transfers.<br>
><br>
> � � One of the reasons why - despite its attractiveness - this hasn't<br>
> � � taken off is because good preconditioners are typically still<br>
> � � required in such a setting. Other than the smoothed aggregation in<br>
> � � CUSP, there is not much which does *not* require a copy to the host.<br>
> � � Particularly when thinking about multi-GPU you're entering the<br>
> � � regime where a good preconditioner on the CPU will still outperform<br>
> � � a GPU assembly with poor preconditioner.<br>
><br>
> � � � � So far we have been using CUSP with a custom (generated)<br>
> � � � � assembly into<br>
> � � � � our own CUSP-compatible CSR data structure for a single GPU.<br>
> � � � � Since CUSP<br>
> � � � � doesn't give us multi-GPU solvers out of the box we'd rather use<br>
> � � � � existing infrastructure that works rather than rolling our own.<br>
><br>
> � � I guess this is good news for you: Steve Dalton will work with us<br>
> � � during the summer to extend the CUSP-SA-AMG to distributed memory.<br>
> � � Other than that, I think there's currently only the functionality<br>
> � � from CUSPARSE and polynomial preconditioners, available through the<br>
> � � txpetscgpu package.<br>
><br>
> � � Aside from that I also have a couple of plans on that front spinning<br>
> � � in my head, yet I couldn't find the time for implementing this yet.<br>
><br>
> � � � � At the time of [1] supporting GPU assembly in one form or the<br>
> � � � � other was<br>
> � � � � on the roadmap, but the implementation direction seemed to not<br>
> � � � � have been<br>
> � � � � finally decided. Was there any progress since then or anything<br>
> � � � � to add to<br>
> � � � � the discussion? Is there even (experimental) code we might be<br>
> � � � � able to<br>
> � � � � use? Note that we're using petsc4py to interface to PETSc.<br>
><br>
> � � Did you have a look at snes/examples/tutorials/ex52? I'm currently<br>
> � � converting/extending this to OpenCL, so it serves as a playground<br>
> � � for a future interface. Matt might have some additional comments on<br>
> � � this.<br>
><br>
> I like to be very precise in the terminology. Doing the cell integrals<br>
> on the GPU (integration) is worthwhile, whereas<br>
> inserting the element matrices into a global representation like CSR<br>
> (assembly) takes no time and can be done<br>
> almost any way including on the CPU. I stopped working on assembly<br>
> because it made on difference.<br>
<br>
</div></div>The actual insertion (as in MatSetValues) may not take up much time on<br>
either the CPU or the GPU, provided it is done where the integration was<br>
done. As I mentioned before we do both the integration and the solve on<br>
the GPU. We don't even allocate data in host memory. Therefore it<br>
wouldn't make much sense to do the addto on the host since it would<br>
require device -> host data transfer of all the cell integrals and host<br>
-> device of the CSR, which would make it quite expensive.<br>
<br>
One option we considered was creating a MatShell and providing an SPMV<br>
callback, probably calling a CUSP kernel on each MPI rank. That<br>
restricts the available preconditioners, but as mentioned, without doing<br>
any data transfers we'd be restricted to GPU-only preconditioners<br>
anyway. Any thoughts on this compared to the strategies mentioned above?<br></blockquote><div><br></div><div style>What about just creating your CUSP matrix and then shoving it into a MATAIJCUSP?</div><div style>That is what I did for my assembly tests.</div>
<div style><br></div><div style>For GPU only preconditioners, �would focus on the Cusp AMG using Chebychev for</div><div style>the smoothers.</div><div style><br></div><div style>� �Matt</div><div>�</div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
Thanks,<br>
Florian<br>
<div class="HOEnZb"><div class="h5"><br>
> � Thanks,<br>
><br>
> � � �Matt<br>
><br>
><br>
> � � Best regards,<br>
> � � Karli<br>
<br>
</div></div></blockquote></div><br><br clear="all"><div><br></div>-- <br>What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead.<br>
-- Norbert Wiener
</div></div>