PETSc on GPUs

Thu Sep 18 16:09:35 CDT 2008

    In implementing Vec in the GPU a VecScatter would take the entries  
directly from GPU copy of the
vector and put them appropriately into the GPU copy of the other  
vector, thus it would be completely
transparent from the MatMult_MPIAIJ() code. In other words, if you  
have to modify the MatMult_MPIAIJ
IN ANYWAY for the GPU then there is something wrong.

    Barry

Note the VecScatter would need to marshal the values up from the GPU  
into the regular memory,
do the appropriate message passing to the resulting processes and then  
marshal the values back
down to the destination GPU. Since the current VecScatter handles all  
the message passing I think
the GPU VecScatter() could be written to 1) move the needed GPU values  
up to regular memory
(inside work MPI Vecs) then 2) use a regular VecScatter on those work  
vectors then 3) move the results
down from the regular memory to the GPU.

In other words PETSc has been designed since 1991 to be able to reuse  
almost all the infrastructure
and only need new low level kernels written to run on GPUs. :-)

Fun stuff, sure beats working :-)

On Sep 18, 2008, at 3:18 PM, Richard Tran Mills wrote:

> Ahmed,
>
> If you take a look in
>
>  src/mat/impls/aij/mpi/
>
> at mpiaij.h and mpiaij.c, you can see that an MPIAIJ matrix is  
> stored locally as two "sequential" matrics, A and B.  A contains the  
> "diagonal" portion of the matrix, and B contains the "off-diagonal"  
> portion.  See page 59 of the PETSc User manual for a quick  
> illustration of how this partitioning works. Because vectors are  
> partitioned row-wise among processors in the same manner as the  
> matrix, each process already has the vector entries required to form  
> the portion of the matrix-vector product ("mat-vec") involving the  
> diagonal portion 'A' can be done with data that reside locally.  The  
> off-diagonal portion 'B', however, requires gathering vector entries  
> that reside off-processor.  If you look in mpiaij.c at the  
> MatMult_MPIAIJ, you will see the following lines:
>
>  ierr = VecScatterBegin(a->Mvctx,xx,a- 
> >lvec,INSERT_VALUES,SCATTER_FORWARD);CHKERRQ(ierr);
>  ierr = (*a->A->ops->mult)(a->A,xx,yy);CHKERRQ(ierr);
>  ierr = VecScatterEnd(a->Mvctx,xx,a- 
> >lvec,INSERT_VALUES,SCATTER_FORWARD);CHKERRQ(ierr);
>  ierr = (*a->B->ops->multadd)(a->B,a->lvec,yy,yy);CHKERRQ(ierr);
>
> The mat-vec involving a->A can be completed without any VecScatter,  
> but the one involving a->B needs the VecScatter to complete before  
> it can be done. (Note that the VecScatterBegin() occurs before the  
> mat-vec for a->A so that there might be some overlap of  
> communication and computation).
>
> Hopefully this helps elucidate what Matt meant, and didn't just  
> confuse you.
>
> --Richard
>
> Ahmed El Zein wrote:
>> On Thu, 2008-09-18 at 10:20 -0500, Matthew Knepley wrote:
>>>> A question that I had regarding the PETSc code when I was thinking
>>> about
>>>> this was:
>>>> You have the SeqAIJ matrix type and the the MPIAIJ type built  
>>>> around
>>> it
>>>> (or that is what I understand from the code). So basically you
>>> implement
>>>> the SeqAIJ type for the GPU and you get the MPI type for free?
>>> Yes, that is true. However, note that in the MPI step, you will  
>>> need a
>>> gather
>>> operation to get the second matrix multiply to work.
>> Matt,
>> Could you explain a bit more?
>> Ahmed
>
> -- 
> Richard Tran Mills, Ph.D.            |   E-mail: rmills at climate.ornl.gov
> Computational Scientist              |   Phone:  (865) 241-3198
> Computational Earth Sciences Group   |   Fax:    (865) 574-0405
> Oak Ridge National Laboratory        |   http://climate.ornl.gov/~rmills
>