[petsc-dev] VecMDot_SeqCUSP improved

Tue Mar 26 09:35:16 CDT 2013

Karli,
Thanks for doing this. Once I get my current PETSc fork into next, I 
will test this asap. I have a couple examples using GMRES and I hope 
this will make a big difference. I'll let you know what I find.
-Paul
> Hi Jose, Paul, and others,
>
> I worked today and VecMDot and came up with an implementation which is 
> faster than an iterated application of the standard cusp::blas::dot() 
> (which, if I'm not mistaken, just forwards to CUBLAS) if enough 
> vectors (>~6) are involved. For complex arithmetic, an iterated 
> application of cusp::blas::dotc() is used, since passing complex types 
> to CUDA kernels is fairly tricky within PETSc. Jose, any performance 
> feedback from within SLEPc is appreciated :-)
>
> The new implementation is based on custom kernels, only allocates a 
> little scratchpad memory and is thus more memory efficient than the 
> old version. Also, any unnecessary copying of data is avoided. This 
> should speed up GMRES quite a bit, yet I haven't run any dedicated 
> GMRES benchmarks. Paul, I guess you have some samples at hand, don't you?
>
> Best regards,
> Karli