PETSc on GPUs

Thu Sep 25 22:44:27 CDT 2008

Hi,

I ran across another cool way to speed up Petsc on GPUs:
GPUs have way faster single precision FLOPS counts than double 
precision.   For iterative solvers, the update-computation can easily be 
single precision.  But that's where all the work is.  For example, one 
could run Newton-Raphson such that:
   - the non-linear residual is computed in double precision.
   - the inversion of the Jacobian is done in single precision.

One gets the single precision FLOPS count while keeping double-precision 
accuracy in the end result.

Harald

For iterative solvers, run the update-computation in *single* precision.

On 09/18/2008 02:09 PM, Barry Smith wrote:
> 
>    In implementing Vec in the GPU a VecScatter would take the entries 
> directly from GPU copy of the
> vector and put them appropriately into the GPU copy of the other vector, 
> thus it would be completely
> transparent from the MatMult_MPIAIJ() code. In other words, if you have 
> to modify the MatMult_MPIAIJ
> IN ANYWAY for the GPU then there is something wrong.
> 
>    Barry
> 
> Note the VecScatter would need to marshal the values up from the GPU 
> into the regular memory,
> do the appropriate message passing to the resulting processes and then 
> marshal the values back
> down to the destination GPU. Since the current VecScatter handles all 
> the message passing I think
> the GPU VecScatter() could be written to 1) move the needed GPU values 
> up to regular memory
> (inside work MPI Vecs) then 2) use a regular VecScatter on those work 
> vectors then 3) move the results
> down from the regular memory to the GPU.
> 
> In other words PETSc has been designed since 1991 to be able to reuse 
> almost all the infrastructure
> and only need new low level kernels written to run on GPUs. :-)
> 
> Fun stuff, sure beats working :-)
> 
> 
> 
> On Sep 18, 2008, at 3:18 PM, Richard Tran Mills wrote:
> 
>> Ahmed,
>>
>> If you take a look in
>>
>>  src/mat/impls/aij/mpi/
>>
>> at mpiaij.h and mpiaij.c, you can see that an MPIAIJ matrix is stored 
>> locally as two "sequential" matrics, A and B.  A contains the 
>> "diagonal" portion of the matrix, and B contains the "off-diagonal" 
>> portion.  See page 59 of the PETSc User manual for a quick 
>> illustration of how this partitioning works. Because vectors are 
>> partitioned row-wise among processors in the same manner as the 
>> matrix, each process already has the vector entries required to form 
>> the portion of the matrix-vector product ("mat-vec") involving the 
>> diagonal portion 'A' can be done with data that reside locally.  The 
>> off-diagonal portion 'B', however, requires gathering vector entries 
>> that reside off-processor.  If you look in mpiaij.c at the 
>> MatMult_MPIAIJ, you will see the following lines:
>>
>>  ierr = 
>> VecScatterBegin(a->Mvctx,xx,a->lvec,INSERT_VALUES,SCATTER_FORWARD);CHKERRQ(ierr); 
>>
>>  ierr = (*a->A->ops->mult)(a->A,xx,yy);CHKERRQ(ierr);
>>  ierr = 
>> VecScatterEnd(a->Mvctx,xx,a->lvec,INSERT_VALUES,SCATTER_FORWARD);CHKERRQ(ierr); 
>>
>>  ierr = (*a->B->ops->multadd)(a->B,a->lvec,yy,yy);CHKERRQ(ierr);
>>
>> The mat-vec involving a->A can be completed without any VecScatter, 
>> but the one involving a->B needs the VecScatter to complete before it 
>> can be done. (Note that the VecScatterBegin() occurs before the 
>> mat-vec for a->A so that there might be some overlap of communication 
>> and computation).
>>
>> Hopefully this helps elucidate what Matt meant, and didn't just 
>> confuse you.
>>
>> --Richard
>>
>> Ahmed El Zein wrote:
>>> On Thu, 2008-09-18 at 10:20 -0500, Matthew Knepley wrote:
>>>>> A question that I had regarding the PETSc code when I was thinking
>>>> about
>>>>> this was:
>>>>> You have the SeqAIJ matrix type and the the MPIAIJ type built around
>>>> it
>>>>> (or that is what I understand from the code). So basically you
>>>> implement
>>>>> the SeqAIJ type for the GPU and you get the MPI type for free?
>>>> Yes, that is true. However, note that in the MPI step, you will need a
>>>> gather
>>>> operation to get the second matrix multiply to work.
>>> Matt,
>>> Could you explain a bit more?
>>> Ahmed
>>
>> -- 
>> Richard Tran Mills, Ph.D.            |   E-mail: rmills at climate.ornl.gov
>> Computational Scientist              |   Phone:  (865) 241-3198
>> Computational Earth Sciences Group   |   Fax:    (865) 574-0405
>> Oak Ridge National Laboratory        |   http://climate.ornl.gov/~rmills
>>

-- 
Harald P. Pfeiffer                harald at tapir.caltech.edu
Theoretical Astrophysics          Phone (626) 395-8413
Caltech 130-33                    Fax   (626) 796-5675
Pasadena, CA 91125, USA