[petsc-dev] VecMDot_SeqCUSP improved

Mon Mar 25 20:41:42 CDT 2013

Hi Jose, Paul, and others,

I worked today and VecMDot and came up with an implementation which is 
faster than an iterated application of the standard cusp::blas::dot() 
(which, if I'm not mistaken, just forwards to CUBLAS) if enough vectors 
(>~6) are involved. For complex arithmetic, an iterated application of 
cusp::blas::dotc() is used, since passing complex types to CUDA kernels 
is fairly tricky within PETSc. Jose, any performance feedback from 
within SLEPc is appreciated :-)

The new implementation is based on custom kernels, only allocates a 
little scratchpad memory and is thus more memory efficient than the old 
version. Also, any unnecessary copying of data is avoided. This should 
speed up GMRES quite a bit, yet I haven't run any dedicated GMRES 
benchmarks. Paul, I guess you have some samples at hand, don't you?

Best regards,
Karli