[petsc-dev] PETSc multi-GPU assembly - current status

Paul Mullowney paulm at txcorp.com
Thu May 2 12:41:30 CDT 2013


A few things.
(1) Our implementation of the LSPP preconditioner is in a PCShell.
(2) The algorithm uses a Lanczos algorithm (I think) to compute the 
Polynomial coefficients. However it is limited to SPD matrices. The 
technique could be extended to non symmetric matrices, I believe.

It would not be very hard to make LSPP available in PETSc that could be 
used on any piece of hardware for any matrix. All one needs is an AXPY 
and a MatMult to do the preconditioner application. The setup phase will 
require porting of our shell code into a PETSc class.

I would be happy to share the PCShell and then discuss how to move it 
into the code.

-Paul


> Hmm, Paul mentioned the following paper a couple of weeks back:
>
> http://ieeexplore.ieee.org/xpl/articleDetails.jsp?reload=true&arnumber=6319205&contentType=Conference+Publications 
>
>
> from which I concluded that this is already part of the txpetscgpu 
> package. Paul, this is the case, isn't it?
>
> Best regards,
> Karli
>
>
>
>
>>
>> ________________________________________
>> From: petsc-dev-bounces at mcs.anl.gov [petsc-dev-bounces at mcs.anl.gov] 
>> on behalf of Karl Rupp [rupp at mcs.anl.gov]
>> Sent: Wednesday, May 01, 2013 7:52 PM
>> To: petsc-dev at mcs.anl.gov
>> Subject: Re: [petsc-dev] PETSc multi-GPU assembly - current status
>>
>> Hi Florian,
>>
>> > This is loosely a follow up to [1]. In this thread a few potential 
>> ways
>>> for making GPU assembly work with PETSc were discussed and to me the 
>>> two
>>> most promising appeared to be:
>>> 1) Create a PETSc matrix from a pre-assembled CSR structure, or
>>> 2) Preallocate a PETSc matrix and get the handle to pass the row
>>> pointer, column indices and values array to a custom assembly routine.
>>
>> I still consider these two to be the most promising (and general)
>> approaches. On the other hand, to my knowledge the infrastructure hasn't
>> changed a lot since then. Some additional functionality from CUSPARSE
>> was added, while I added ViennaCL-bindings to branch 'next' (i.e. still
>> a few corners to polish). This means that you could technically use the
>> much more jit-friendly OpenCL (and, as a follow-up, complain at NVIDIA
>> and AMD over the higher latencies than with CUDA).
>>
>>> We compute
>>> local assembly matrices on the GPU and a crucial requirement is that 
>>> the
>>> matrix *only* lives in device device, we want to avoid any host <->
>>> device data transfers.
>>
>> One of the reasons why - despite its attractiveness - this hasn't taken
>> off is because good preconditioners are typically still required in such
>> a setting. Other than the smoothed aggregation in CUSP, there is not
>> much which does *not* require a copy to the host. Particularly when
>> thinking about multi-GPU you're entering the regime where a good
>> preconditioner on the CPU will still outperform a GPU assembly with poor
>> preconditioner.
>>
>>
>>> So far we have been using CUSP with a custom (generated) assembly into
>>> our own CUSP-compatible CSR data structure for a single GPU. Since CUSP
>>> doesn't give us multi-GPU solvers out of the box we'd rather use
>>> existing infrastructure that works rather than rolling our own.
>>
>> I guess this is good news for you: Steve Dalton will work with us during
>> the summer to extend the CUSP-SA-AMG to distributed memory. Other than
>> that, I think there's currently only the functionality from CUSPARSE and
>> polynomial preconditioners, available through the txpetscgpu package.
>>
>> Aside from that I also have a couple of plans on that front spinning in
>> my head, yet I couldn't find the time for implementing this yet.
>>
>>
>>> At the time of [1] supporting GPU assembly in one form or the other was
>>> on the roadmap, but the implementation direction seemed to not have 
>>> been
>>> finally decided. Was there any progress since then or anything to 
>>> add to
>>> the discussion? Is there even (experimental) code we might be able to
>>> use? Note that we're using petsc4py to interface to PETSc.
>>
>> Did you have a look at snes/examples/tutorials/ex52? I'm currently
>> converting/extending this to OpenCL, so it serves as a playground for a
>> future interface. Matt might have some additional comments on this.
>>
>> Best regards,
>> Karli
>>




More information about the petsc-dev mailing list