[petsc-dev] VecMDot_SeqCUSP improved
Karl Rupp
rupp at mcs.anl.gov
Tue Mar 26 14:15:42 CDT 2013
Hi Jose,
here's the benchmark data obtained on my local machine running an NVIDIA
GTX 285 for vectors of size 100k:
# Master
./ex43 -n 100000 -k 200 -mdot -log_summary
VecMDot 5.6363e+01
./ex43 -n 100000 -k 200 -mdot -log_summary -vec_type cusp
VecMDot 2.1936e+01
./ex43 -n 100000 -k 200 -log_summary
VecDot 5.1124e+01
./ex43 -n 100000 -k 200 -log_summary -vec_type cusp
VecDot 4.0968e+01
# Next
./ex43 -n 100000 -k 200 -mdot -log_summary
VecMDot 5.6417e+01
./ex43 -n 100000 -k 200 -mdot -log_summary -vec_type cusp
VecMDot 1.0281e+01
./ex43 -n 100000 -k 200 -log_summary
VecDot 5.0886e+01
./ex43 -n 100000 -k 200 -log_summary -vec_type cusp
VecDot 4.1905e+01
This makes 10sec in next vs. 20 sec. on master for the CUDA-accelerated
mdot-case. The factor of two is actually as expected, because in the
'old' kernel the data movement is twice as what it is in the custom
kernel version. The factor of four with respect to VecDot is not
entirely clear to me, I'd rather expect a factor close to 2. Presumably
the more frequent host <-> device transfers add extra overhead.
Best regards,
Karli
On 03/26/2013 10:39 AM, Jose E. Roman wrote:
>
> El 26/03/2013, a las 02:41, Karl Rupp escribió:
>
>> Hi Jose, Paul, and others,
>>
>> I worked today and VecMDot and came up with an implementation which is faster than an iterated application of the standard cusp::blas::dot() (which, if I'm not mistaken, just forwards to CUBLAS) if enough vectors (>~6) are involved. For complex arithmetic, an iterated application of cusp::blas::dotc() is used, since passing complex types to CUDA kernels is fairly tricky within PETSc. Jose, any performance feedback from within SLEPc is appreciated :-)
>>
>> The new implementation is based on custom kernels, only allocates a little scratchpad memory and is thus more memory efficient than the old version. Also, any unnecessary copying of data is avoided. This should speed up GMRES quite a bit, yet I haven't run any dedicated GMRES benchmarks. Paul, I guess you have some samples at hand, don't you?
>>
>> Best regards,
>> Karli
>
> In my tests, the new implementation is actually slower. I tried src/vec/vec/examples/tests/ex43.c with 200 vectors of length 10000. Time increases from 4.1 to 7.2. Can anyone try to repeat the tests below?
>
> I have an Intel Core i7 with two Tesla C2050.
>
> Jose
>
>
> master
> ---------------
>
> $ ./ex43 -n 10000 -k 200 -mdot -log_summary
>
> VecMDot 3980 1.0 3.6485e+00 1.0 7.96e+09 1.0 0.0e+00 0.0e+00 0.0e+00 11100 0 0 0 11100 0 0 0 2182
>
> $ ./ex43 -n 10000 -k 200 -mdot -log_summary -vec_type cusp
>
> VecMDot 3980 1.0 4.1368e+00 1.0 7.96e+09 1.0 0.0e+00 0.0e+00 0.0e+00 40100 0 0 0 40100 0 0 0 1924
>
> $ ./ex43 -n 10000 -k 200 -log_summary
>
> VecDot 398000 1.0 2.1585e+01 1.0 7.96e+09 1.0 0.0e+00 0.0e+00 0.0e+00 78100 0 0 0 78100 0 0 0 369
>
> $ ./ex43 -n 10000 -k 200 -log_summary -vec_type cusp
>
> VecDot 398000 1.0 2.9228e+01 1.0 7.96e+09 1.0 0.0e+00 0.0e+00 0.0e+00 82100 0 0 0 82100 0 0 0 272
>
>
> next
> ---------------
>
> $ ./ex43 -n 10000 -k 200 -mdot -log_summary
>
> VecMDot 3980 1.0 3.6899e+00 1.0 7.96e+09 1.0 0.0e+00 0.0e+00 0.0e+00 39100 0 0 0 39100 0 0 0 2157
>
> $ ./ex43 -n 10000 -k 200 -mdot -log_summary -vec_type cusp
>
> VecMDot 3980 1.0 7.1823e+00 1.0 7.96e+09 1.0 0.0e+00 0.0e+00 0.0e+00 54100 0 0 0 54100 0 0 0 1108
>
> $ ./ex43 -n 10000 -k 200 -log_summary
>
> VecDot 398000 1.0 2.1702e+01 1.0 7.96e+09 1.0 0.0e+00 0.0e+00 0.0e+00 79100 0 0 0 79100 0 0 0 367
>
> $ ./ex43 -n 10000 -k 200 -log_summary -vec_type cusp
>
> VecDot 398000 1.0 2.8953e+01 1.0 7.96e+09 1.0 0.0e+00 0.0e+00 0.0e+00 82100 0 0 0 82100 0 0 0 275
>
More information about the petsc-dev
mailing list