[petsc-dev] Performance of VecMDot_SeqCUSP
Jose E. Roman
jroman at dsic.upv.es
Tue Apr 24 11:55:16 CDT 2012
It seems that VecMDot_SeqCUSP has rather poor performance. This has a lot of impact in SLEPc because it is the main kernel used in the orthogonalization of vectors.
Is this due to the version of Thrust? I am using CUDA Toolkit 4.0.
I tried a naive replacement that copies the contents of the vectors into a matrix and calls CUBLAS dgemv. The improvement is significant, despite the data movement overhead. In some tests I see a reduction of time (VecReduceArith) from 24.5 seconds to 9.6 seconds (with up to 200 vectors of length 10000) on a Fermi.
I can send the code for you to try.
Jose
More information about the petsc-dev
mailing list