[petsc-dev] VecMDot_SeqCUSP improved

Mon Mar 25 20:42:43 CDT 2013

Forgot to mention: Code is in 'next':

https://bitbucket.org/petsc/petsc/commits/78e6257bdd411e017b354225da1226dab51c07b7

On 03/25/2013 08:41 PM, Karl Rupp wrote:
> Hi Jose, Paul, and others,
>
> I worked today and VecMDot and came up with an implementation which is
> faster than an iterated application of the standard cusp::blas::dot()
> (which, if I'm not mistaken, just forwards to CUBLAS) if enough vectors
> (>~6) are involved. For complex arithmetic, an iterated application of
> cusp::blas::dotc() is used, since passing complex types to CUDA kernels
> is fairly tricky within PETSc. Jose, any performance feedback from
> within SLEPc is appreciated :-)
>
> The new implementation is based on custom kernels, only allocates a
> little scratchpad memory and is thus more memory efficient than the old
> version. Also, any unnecessary copying of data is avoided. This should
> speed up GMRES quite a bit, yet I haven't run any dedicated GMRES
> benchmarks. Paul, I guess you have some samples at hand, don't you?
>
> Best regards,
> Karli