<div dir="ltr">On Wed, May 1, 2013 at 8:52 PM, Karl Rupp <span dir="ltr"><<a href="mailto:rupp@mcs.anl.gov" target="_blank">rupp@mcs.anl.gov</a>></span> wrote:<br><div class="gmail_extra"><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

Hi Florian,<div class="im"><br>

<br>

> This is loosely a follow up to [1]. In this thread a few potential ways<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

for making GPU assembly work with PETSc were discussed and to me the two<br>

most promising appeared to be:<br>

1) Create a PETSc matrix from a pre-assembled CSR structure, or<br>

2) Preallocate a PETSc matrix and get the handle to pass the row<br>

pointer, column indices and values array to a custom assembly routine.<br>

</blockquote>

<br></div>

I still consider these two to be the most promising (and general) approaches. On the other hand, to my knowledge the infrastructure hasn't changed a lot since then. Some additional functionality from CUSPARSE was added, while I added ViennaCL-bindings to branch 'next' (i.e. still a few corners to polish). This means that you could technically use the much more jit-friendly OpenCL (and, as a follow-up, complain at NVIDIA and AMD over the higher latencies than with CUDA).<div class="im">

<br>

<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

We compute<br>

local assembly matrices on the GPU and a crucial requirement is that the<br>

matrix *only* lives in device device, we want to avoid any host <-><br>

device data transfers.<br>

</blockquote>

<br></div>

One of the reasons why - despite its attractiveness - this hasn't taken off is because good preconditioners are typically still required in such a setting. Other than the smoothed aggregation in CUSP, there is not much which does *not* require a copy to the host. Particularly when thinking about multi-GPU you're entering the regime where a good preconditioner on the CPU will still outperform a GPU assembly with poor preconditioner.<div class="im">

<br>

<br>

<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

So far we have been using CUSP with a custom (generated) assembly into<br>

our own CUSP-compatible CSR data structure for a single GPU. Since CUSP<br>

doesn't give us multi-GPU solvers out of the box we'd rather use<br>

existing infrastructure that works rather than rolling our own.<br>

</blockquote>

<br></div>

I guess this is good news for you: Steve Dalton will work with us during the summer to extend the CUSP-SA-AMG to distributed memory. Other than that, I think there's currently only the functionality from CUSPARSE and polynomial preconditioners, available through the txpetscgpu package.<br>


<br>

Aside from that I also have a couple of plans on that front spinning in my head, yet I couldn't find the time for implementing this yet.<div class="im"><br>

<br>

<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

At the time of [1] supporting GPU assembly in one form or the other was<br>

on the roadmap, but the implementation direction seemed to not have been<br>

finally decided. Was there any progress since then or anything to add to<br>

the discussion? Is there even (experimental) code we might be able to<br>

use? Note that we're using petsc4py to interface to PETSc.<br>

</blockquote>

<br></div>

Did you have a look at snes/examples/tutorials/ex52? I'm currently converting/extending this to OpenCL, so it serves as a playground for a future interface. Matt might have some additional comments on this.<br></blockquote>

<div><br></div><div style>I like to be very precise in the terminology. Doing the cell integrals on the GPU (integration) is worthwhile, whereas</div><div style>inserting the element matrices into a global representation like CSR (assembly) takes no time and can be done</div>

<div style>almost any way including on the CPU. I stopped working on assembly because it made on difference.</div><div style><br></div><div style>  Thanks,</div><div style><br></div><div style>     Matt</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">


Best regards,<br>

Karli<br>

<br>

</blockquote></div><br><br clear="all"><div><br></div>-- <br>What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead.<br>

-- Norbert Wiener

</div></div>