<div dir="ltr">On Thu, Jun 6, 2013 at 12:17 PM, Florian Rathgeber <span dir="ltr"><<a href="mailto:florian.rathgeber@gmail.com" target="_blank">florian.rathgeber@gmail.com</a>></span> wrote:<br><div class="gmail_extra">

<div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class="im">On 02/05/13 21:35, Matthew Knepley wrote:<br>

> On Thu, May 2, 2013 at 3:29 PM, Florian Rathgeber<br>

</div><div class="im">> <<a href="mailto:florian.rathgeber@gmail.com">florian.rathgeber@gmail.com</a> <mailto:<a href="mailto:florian.rathgeber@gmail.com">florian.rathgeber@gmail.com</a>>> wrote:<br>

><br>

>     On 02/05/13 03:12, Matthew Knepley wrote:<br>

>     > On Wed, May 1, 2013 at 8:52 PM, Karl Rupp <<a href="mailto:rupp@mcs.anl.gov">rupp@mcs.anl.gov</a><br>

>     <mailto:<a href="mailto:rupp@mcs.anl.gov">rupp@mcs.anl.gov</a>><br>

</div><div><div class="h5">>     > <mailto:<a href="mailto:rupp@mcs.anl.gov">rupp@mcs.anl.gov</a> <mailto:<a href="mailto:rupp@mcs.anl.gov">rupp@mcs.anl.gov</a>>>> wrote:<br>

>     ><br>

>     >     Hi Florian,<br>

>     ><br>

>     >     > This is loosely a follow up to [1]. In this thread a few<br>

>     potential<br>

>     >     ways<br>

>     ><br>

>     >         for making GPU assembly work with PETSc were discussed and<br>

>     to me<br>

>     >         the two<br>

>     >         most promising appeared to be:<br>

>     >         1) Create a PETSc matrix from a pre-assembled CSR<br>

>     structure, or<br>

>     >         2) Preallocate a PETSc matrix and get the handle to pass<br>

>     the row<br>

>     >         pointer, column indices and values array to a custom assembly<br>

>     >         routine.<br>

>     ><br>

>     >     I still consider these two to be the most promising (and general)<br>

>     >     approaches. On the other hand, to my knowledge the infrastructure<br>

>     >     hasn't changed a lot since then. Some additional functionality<br>

>     from<br>

>     >     CUSPARSE was added, while I added ViennaCL-bindings to branch<br>

>     'next'<br>

>     >     (i.e. still a few corners to polish). This means that you could<br>

>     >     technically use the much more jit-friendly OpenCL (and, as a<br>

>     >     follow-up, complain at NVIDIA and AMD over the higher<br>

>     latencies than<br>

>     >     with CUDA).<br>

>     ><br>

>     >         We compute<br>

>     >         local assembly matrices on the GPU and a crucial<br>

>     requirement is<br>

>     >         that the<br>

>     >         matrix *only* lives in device device, we want to avoid any<br>

>     host <-><br>

>     >         device data transfers.<br>

>     ><br>

>     >     One of the reasons why - despite its attractiveness - this hasn't<br>

>     >     taken off is because good preconditioners are typically still<br>

>     >     required in such a setting. Other than the smoothed aggregation in<br>

>     >     CUSP, there is not much which does *not* require a copy to the<br>

>     host.<br>

>     >     Particularly when thinking about multi-GPU you're entering the<br>

>     >     regime where a good preconditioner on the CPU will still<br>

>     outperform<br>

>     >     a GPU assembly with poor preconditioner.<br>

>     ><br>

>     >         So far we have been using CUSP with a custom (generated)<br>

>     >         assembly into<br>

>     >         our own CUSP-compatible CSR data structure for a single GPU.<br>

>     >         Since CUSP<br>

>     >         doesn't give us multi-GPU solvers out of the box we'd<br>

>     rather use<br>

>     >         existing infrastructure that works rather than rolling our<br>

>     own.<br>

>     ><br>

>     >     I guess this is good news for you: Steve Dalton will work with us<br>

>     >     during the summer to extend the CUSP-SA-AMG to distributed memory.<br>

>     >     Other than that, I think there's currently only the functionality<br>

>     >     from CUSPARSE and polynomial preconditioners, available<br>

>     through the<br>

>     >     txpetscgpu package.<br>

>     ><br>

>     >     Aside from that I also have a couple of plans on that front<br>

>     spinning<br>

>     >     in my head, yet I couldn't find the time for implementing this<br>

>     yet.<br>

>     ><br>

>     >         At the time of [1] supporting GPU assembly in one form or the<br>

>     >         other was<br>

>     >         on the roadmap, but the implementation direction seemed to not<br>

>     >         have been<br>

>     >         finally decided. Was there any progress since then or anything<br>

>     >         to add to<br>

>     >         the discussion? Is there even (experimental) code we might be<br>

>     >         able to<br>

>     >         use? Note that we're using petsc4py to interface to PETSc.<br>

>     ><br>

>     >     Did you have a look at snes/examples/tutorials/ex52? I'm currently<br>

>     >     converting/extending this to OpenCL, so it serves as a playground<br>

>     >     for a future interface. Matt might have some additional<br>

>     comments on<br>

>     >     this.<br>

>     ><br>

>     > I like to be very precise in the terminology. Doing the cell integrals<br>

>     > on the GPU (integration) is worthwhile, whereas<br>

>     > inserting the element matrices into a global representation like CSR<br>

>     > (assembly) takes no time and can be done<br>

>     > almost any way including on the CPU. I stopped working on assembly<br>

>     > because it made on difference.<br>

><br>

>     The actual insertion (as in MatSetValues) may not take up much time on<br>

>     either the CPU or the GPU, provided it is done where the integration was<br>

>     done. As I mentioned before we do both the integration and the solve on<br>

>     the GPU. We don't even allocate data in host memory. Therefore it<br>

>     wouldn't make much sense to do the addto on the host since it would<br>

>     require device -> host data transfer of all the cell integrals and host<br>

>     -> device of the CSR, which would make it quite expensive.<br>

><br>

>     One option we considered was creating a MatShell and providing an SPMV<br>

>     callback, probably calling a CUSP kernel on each MPI rank. That<br>

>     restricts the available preconditioners, but as mentioned, without doing<br>

>     any data transfers we'd be restricted to GPU-only preconditioners<br>

>     anyway. Any thoughts on this compared to the strategies mentioned above?<br>

><br>

><br>

> What about just creating your CUSP matrix and then shoving it into a<br>

> MATAIJCUSP?<br>

> That is what I did for my assembly tests.<br>

<br>

</div></div>That'd be the ideal solution. Does this work with MPIAIJ? We're only<br>

really interested in multi-GPU with MPI. In the sequential case we can<br>

just call Cusp directly, but for the MPI distributed case we'd rather<br>

rely on PETSc to help us out.<br></blockquote><div><br></div><div style>You would have to create the diagonal and off-diagonal matrices yourself.</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">


Presumably you're referring to the experiments you did for the TOMS<br>

paper? Is that code available somewhere?</blockquote><div><br></div><div style>No, its for the TOMS paper I did not write because the result was not interesting</div><div style>enough I thought. The code is in PETSc.</div>

<div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class="im">

> For GPU only preconditioners,  would focus on the Cusp AMG using<br>

> Chebychev for<br>

> the smoothers.<br>

<br>

</div>OK. Again we'd have to create our own PCShell for this when using<br>

MatShell if I understand correctly?</blockquote><div><br></div><div style>I don't think so since Cheby just uses a matrix action.</div><div style><br></div><div style>   Matt</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<span class="HOEnZb"><font color="#888888"><br>

Florian<br>

<br>

</font></span></blockquote></div><br><br clear="all"><div><br></div>-- <br>What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead.<br>

-- Norbert Wiener

</div></div>