[petsc-dev] Performance of VecMDot_SeqCUSP

Tue Apr 24 14:12:33 CDT 2012

I'm writing a vector type with uses flag synching like you have in PETSc
with Vec CUSP, however it uses asynchronous kernel launches
(pipeling,etc..) and autotuned kernels. Not quite ready for primetime, but
we have seen the value of it in terms of speed up.

On Tue, Apr 24, 2012 at 2:09 PM, Jed Brown <jedbrown at mcs.anl.gov> wrote:

> On Tue, Apr 24, 2012 at 11:55, Jose E. Roman <jroman at dsic.upv.es> wrote:
>
>> It seems that VecMDot_SeqCUSP has rather poor performance. This has a lot
>> of impact in SLEPc because it is the main kernel used in the
>> orthogonalization of vectors.
>>
>> Is this due to the version of Thrust? I am using CUDA Toolkit 4.0.
>>
>> I tried a naive replacement that copies the contents of the vectors into
>> a matrix and calls CUBLAS dgemv. The improvement is significant, despite
>> the data movement overhead. In some tests I see a reduction of time
>> (VecReduceArith) from 24.5 seconds to 9.6 seconds (with up to 200 vectors
>> of length 10000) on a Fermi.
>>
>> I can send the code for you to try.
>>
>
> That would be useful. The current code is horribly over-synchronous since
> it spawns a new kernel for each set of 3 vectors. We want to use one kernel
> to process all vectors. Unless there is another implementation somewhere,
> we'll probably want to write custom CUDA kernel for this (to avoid needing
> to copy all the vectors into contiguous memory).
>

-- 
Daniel Lowell
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20120424/24b1a1fc/attachment.html>