PETSc on GPUs

Fri Sep 19 15:35:30 CDT 2008

Hi All,

thanks for the many responses - people are certainly excited about GPUs 
and their possibilities.

Our group at Caltech & Cornell uses PETSc to compute initial data for 
collisions between two black holes.  The black hole evolution code is 
currently fully explicit, but we are working on implicit time-stepping 
(which heavily uses PETSc).  The trick Barry mentions below (isolate 
heavy calculations into a few computational kernels; rewrite those 
kernels for GPUs) will apply to our code, too:  There is one 
"data-class" which holds essentially all data, and it could easily be 
changed to keep the data in GPU memory.   All operations on the data are 
performed via expression templates, which end up in only a handful 
routines with explicit loops over grid-points.  Making these few 
routines GPU aware would then speed up all calculations (or so I hope).

My original email was driven by curiosity.  But when this gets more 
serious for us, I'll certainly write back with an update.

Cheers,
Harald

On 09/18/2008 02:09 PM, Barry Smith wrote:
> 
>    In implementing Vec in the GPU a VecScatter would take the entries 
> directly from GPU copy of the
> vector and put them appropriately into the GPU copy of the other vector, 
> thus it would be completely
> transparent from the MatMult_MPIAIJ() code. In other words, if you have 
> to modify the MatMult_MPIAIJ
> IN ANYWAY for the GPU then there is something wrong.
> 
>    Barry
> 
> Note the VecScatter would need to marshal the values up from the GPU 
> into the regular memory,
> do the appropriate message passing to the resulting processes and then 
> marshal the values back
> down to the destination GPU. Since the current VecScatter handles all 
> the message passing I think
> the GPU VecScatter() could be written to 1) move the needed GPU values 
> up to regular memory
> (inside work MPI Vecs) then 2) use a regular VecScatter on those work 
> vectors then 3) move the results
> down from the regular memory to the GPU.
> 
> In other words PETSc has been designed since 1991 to be able to reuse 
> almost all the infrastructure
> and only need new low level kernels written to run on GPUs. :-)
> 
> Fun stuff, sure beats working :-)
> 
> 
> 
> On Sep 18, 2008, at 3:18 PM, Richard Tran Mills wrote:
> 
>> Ahmed,
>>
>> If you take a look in
>>
>>  src/mat/impls/aij/mpi/
>>
>> at mpiaij.h and mpiaij.c, you can see that an MPIAIJ matrix is stored 
>> locally as two "sequential" matrics, A and B.  A contains the 
>> "diagonal" portion of the matrix, and B contains the "off-diagonal" 
>> portion.  See page 59 of the PETSc User manual for a quick 
>> illustration of how this partitioning works. Because vectors are 
>> partitioned row-wise among processors in the same manner as the 
>> matrix, each process already has the vector entries required to form 
>> the portion of the matrix-vector product ("mat-vec") involving the 
>> diagonal portion 'A' can be done with data that reside locally.  The 
>> off-diagonal portion 'B', however, requires gathering vector entries 
>> that reside off-processor.  If you look in mpiaij.c at the 
>> MatMult_MPIAIJ, you will see the following lines:
>>
>>  ierr = 
>> VecScatterBegin(a->Mvctx,xx,a->lvec,INSERT_VALUES,SCATTER_FORWARD);CHKERRQ(ierr); 
>>
>>  ierr = (*a->A->ops->mult)(a->A,xx,yy);CHKERRQ(ierr);
>>  ierr = 
>> VecScatterEnd(a->Mvctx,xx,a->lvec,INSERT_VALUES,SCATTER_FORWARD);CHKERRQ(ierr); 
>>
>>  ierr = (*a->B->ops->multadd)(a->B,a->lvec,yy,yy);CHKERRQ(ierr);
>>
>> The mat-vec involving a->A can be completed without any VecScatter, 
>> but the one involving a->B needs the VecScatter to complete before it 
>> can be done. (Note that the VecScatterBegin() occurs before the 
>> mat-vec for a->A so that there might be some overlap of communication 
>> and computation).
>>
>> Hopefully this helps elucidate what Matt meant, and didn't just 
>> confuse you.
>>
>> --Richard
>>
>> Ahmed El Zein wrote:
>>> On Thu, 2008-09-18 at 10:20 -0500, Matthew Knepley wrote:
>>>>> A question that I had regarding the PETSc code when I was thinking
>>>> about
>>>>> this was:
>>>>> You have the SeqAIJ matrix type and the the MPIAIJ type built around
>>>> it
>>>>> (or that is what I understand from the code). So basically you
>>>> implement
>>>>> the SeqAIJ type for the GPU and you get the MPI type for free?
>>>> Yes, that is true. However, note that in the MPI step, you will need a
>>>> gather
>>>> operation to get the second matrix multiply to work.
>>> Matt,
>>> Could you explain a bit more?
>>> Ahmed
>>
>> -- 
>> Richard Tran Mills, Ph.D.            |   E-mail: rmills at climate.ornl.gov
>> Computational Scientist              |   Phone:  (865) 241-3198
>> Computational Earth Sciences Group   |   Fax:    (865) 574-0405
>> Oak Ridge National Laboratory        |   http://climate.ornl.gov/~rmills
>>

-- 
Harald P. Pfeiffer                harald at tapir.caltech.edu
Theoretical Astrophysics          Phone (626) 395-8413
Caltech 130-33                    Fax   (626) 796-5675
Pasadena, CA 91125, USA