[petsc-dev] VecMDot_SeqCUSP improved
Karl Rupp
rupp at mcs.anl.gov
Mon Mar 25 20:41:42 CDT 2013
Hi Jose, Paul, and others,
I worked today and VecMDot and came up with an implementation which is
faster than an iterated application of the standard cusp::blas::dot()
(which, if I'm not mistaken, just forwards to CUBLAS) if enough vectors
(>~6) are involved. For complex arithmetic, an iterated application of
cusp::blas::dotc() is used, since passing complex types to CUDA kernels
is fairly tricky within PETSc. Jose, any performance feedback from
within SLEPc is appreciated :-)
The new implementation is based on custom kernels, only allocates a
little scratchpad memory and is thus more memory efficient than the old
version. Also, any unnecessary copying of data is avoided. This should
speed up GMRES quite a bit, yet I haven't run any dedicated GMRES
benchmarks. Paul, I guess you have some samples at hand, don't you?
Best regards,
Karli
More information about the petsc-dev
mailing list