[petsc-users] MatMult

Mon Dec 13 01:29:16 CST 2010

Hi,

Does MatMult function is performed on GPU? when I prepared program which
just executes this function with parameters -vec_type cuda and -mat_type
seqaijcuda i havent seen in summary log any VecCUDACopyTo entry

Dnia 2010-12-11, sob o godzinie 11:50 -0600, Barry Smith pisze:
> To answer this you need to understand that PETSc copies vectors and matrices to the GPU memory "on demand" (that is exactly when they are first needed on the GPU, and not before) and once it has copied to the GPU it keeps track of it and will NOT copy it down again if it is already there.
> 
>    Hence in your run below, yes it includes the copy time down. 
> 
>    But note that ONE multiply on the GPU is absurd, it does not make sense to copy a matrix down to the GPU and then do ONE multiply with it. Thus I NEVER do "sandalone" benchmarking where a single kernel is called by it self once, the time results are useless. Always run a FULL application with -log_summary; for example in this case a full KSPSolve() that requires a bunch of iterations. Then you can look at the performance of each kernel. The reason to do it this way is that the numbers can be very different and what matters is runs in APPLICATIONS so that is what should be measured.
> 
>    If say you run KSP with 20 iterations then the time to copy the matrix down to the GPU is amortized over those 20 iterations and thus maybe ok. You should see the flop rate for the MatMult() go up in this case.
> 
>    You may have noticed we have a log entry for VecCopyToGPU() we will be adding one for matrices as well thus you will be able to see how long the copy time is but not that the copy time is still counted in the MatMult() time if the first copy of the matrix to GPU is triggered by the MatMult. You can subtract the copy time from the mult time to get the per multiply time, this would correspond to the multiply time in the limit of a single copy down and many, many multiplies on the GPU.
> 
>    Barry
> 
> 
> 
> 
> On Dec 11, 2010, at 8:32 AM, Jakub Pola wrote:
> 
> > Hello again,
> > 
> > I compiled one of te examples. I used sparse matix called 02-raefsky3.
> > I used -vec_type cuda and -mat_type seqaijcuda. 
> > 
> > When I see summary of the operations performed by program there is
> > 
> > MatMult 1 1.0 2.0237e-02 1.0 2.98e+06 1.0 0.0e+00 0.0e+00 0.0e+00  2100
> > 0  0  0   2100  0  0  0   147
> > 
> > Does time of performing MatMult includes memory transfer for loading
> > matrix in GPU memory or just exact computation time?
> > 
> > Thanks in advance. 
> > Kuba.
> > 
>