[petsc-dev] VecMDot_SeqCUSP improved

Tue Mar 26 14:15:42 CDT 2013

Hi Jose,

here's the benchmark data obtained on my local machine running an NVIDIA 
GTX 285 for vectors of size 100k:

# Master

./ex43 -n 100000 -k 200 -mdot -log_summary
VecMDot  5.6363e+01

./ex43 -n 100000 -k 200 -mdot -log_summary -vec_type cusp
VecMDot  2.1936e+01

./ex43 -n 100000 -k 200 -log_summary
VecDot   5.1124e+01

./ex43 -n 100000 -k 200 -log_summary -vec_type cusp
VecDot   4.0968e+01

# Next

./ex43 -n 100000 -k 200 -mdot -log_summary
VecMDot  5.6417e+01

./ex43 -n 100000 -k 200 -mdot -log_summary -vec_type cusp
VecMDot  1.0281e+01

./ex43 -n 100000 -k 200 -log_summary
VecDot   5.0886e+01

./ex43 -n 100000 -k 200 -log_summary -vec_type cusp
VecDot   4.1905e+01

This makes 10sec in next vs. 20 sec. on master for the CUDA-accelerated 
mdot-case. The factor of two is actually as expected, because in the 
'old' kernel the data movement is twice as what it is in the custom 
kernel version. The factor of four with respect to VecDot is not 
entirely clear to me, I'd rather expect a factor close to 2. Presumably 
the more frequent host <-> device transfers add extra overhead.

Best regards,
Karli

On 03/26/2013 10:39 AM, Jose E. Roman wrote:
>
> El 26/03/2013, a las 02:41, Karl Rupp escribió:
>
>> Hi Jose, Paul, and others,
>>
>> I worked today and VecMDot and came up with an implementation which is faster than an iterated application of the standard cusp::blas::dot() (which, if I'm not mistaken, just forwards to CUBLAS) if enough vectors (>~6) are involved. For complex arithmetic, an iterated application of cusp::blas::dotc() is used, since passing complex types to CUDA kernels is fairly tricky within PETSc. Jose, any performance feedback from within SLEPc is appreciated :-)
>>
>> The new implementation is based on custom kernels, only allocates a little scratchpad memory and is thus more memory efficient than the old version. Also, any unnecessary copying of data is avoided. This should speed up GMRES quite a bit, yet I haven't run any dedicated GMRES benchmarks. Paul, I guess you have some samples at hand, don't you?
>>
>> Best regards,
>> Karli
>
> In my tests, the new implementation is actually slower. I tried src/vec/vec/examples/tests/ex43.c with 200 vectors of length 10000. Time increases from 4.1 to 7.2. Can anyone try to repeat the tests below?
>
> I have an Intel Core i7 with two Tesla C2050.
>
> Jose
>
>
> master
> ---------------
>
> $ ./ex43 -n 10000 -k 200 -mdot -log_summary
>
> VecMDot             3980 1.0 3.6485e+00 1.0 7.96e+09 1.0 0.0e+00 0.0e+00 0.0e+00 11100  0  0  0  11100  0  0  0  2182
>
> $ ./ex43 -n 10000 -k 200 -mdot -log_summary -vec_type cusp
>
> VecMDot             3980 1.0 4.1368e+00 1.0 7.96e+09 1.0 0.0e+00 0.0e+00 0.0e+00 40100  0  0  0  40100  0  0  0  1924
>
> $ ./ex43 -n 10000 -k 200 -log_summary
>
> VecDot            398000 1.0 2.1585e+01 1.0 7.96e+09 1.0 0.0e+00 0.0e+00 0.0e+00 78100  0  0  0  78100  0  0  0   369
>
> $ ./ex43 -n 10000 -k 200 -log_summary -vec_type cusp
>
> VecDot            398000 1.0 2.9228e+01 1.0 7.96e+09 1.0 0.0e+00 0.0e+00 0.0e+00 82100  0  0  0  82100  0  0  0   272
>
>
> next
> ---------------
>
> $ ./ex43 -n 10000 -k 200 -mdot -log_summary
>
> VecMDot             3980 1.0 3.6899e+00 1.0 7.96e+09 1.0 0.0e+00 0.0e+00 0.0e+00 39100  0  0  0  39100  0  0  0  2157
>
> $ ./ex43 -n 10000 -k 200 -mdot -log_summary -vec_type cusp
>
> VecMDot             3980 1.0 7.1823e+00 1.0 7.96e+09 1.0 0.0e+00 0.0e+00 0.0e+00 54100  0  0  0  54100  0  0  0  1108
>
> $ ./ex43 -n 10000 -k 200 -log_summary
>
> VecDot            398000 1.0 2.1702e+01 1.0 7.96e+09 1.0 0.0e+00 0.0e+00 0.0e+00 79100  0  0  0  79100  0  0  0   367
>
> $ ./ex43 -n 10000 -k 200 -log_summary -vec_type cusp
>
> VecDot            398000 1.0 2.8953e+01 1.0 7.96e+09 1.0 0.0e+00 0.0e+00 0.0e+00 82100  0  0  0  82100  0  0  0   275
>