[petsc-dev] Performance of VecMDot_SeqCUSP

Tue Apr 24 14:09:20 CDT 2012

On Tue, Apr 24, 2012 at 11:55, Jose E. Roman <jroman at dsic.upv.es> wrote:

> It seems that VecMDot_SeqCUSP has rather poor performance. This has a lot
> of impact in SLEPc because it is the main kernel used in the
> orthogonalization of vectors.
>
> Is this due to the version of Thrust? I am using CUDA Toolkit 4.0.
>
> I tried a naive replacement that copies the contents of the vectors into a
> matrix and calls CUBLAS dgemv. The improvement is significant, despite the
> data movement overhead. In some tests I see a reduction of time
> (VecReduceArith) from 24.5 seconds to 9.6 seconds (with up to 200 vectors
> of length 10000) on a Fermi.
>
> I can send the code for you to try.
>

That would be useful. The current code is horribly over-synchronous since
it spawns a new kernel for each set of 3 vectors. We want to use one kernel
to process all vectors. Unless there is another implementation somewhere,
we'll probably want to write custom CUDA kernel for this (to avoid needing
to copy all the vectors into contiguous memory).
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20120424/52b3b732/attachment.html>