[petsc-dev] VecMDot_SeqCUSP improved

Sun Apr 7 23:28:00 CDT 2013

Hi Paul,

you're very welcome. I'm glad it works out fine for your case as well.

Best regards,
Karli

On 04/07/2013 11:17 PM, Paul Mullowney wrote:
> VecMDot is performing great for my examples. About 2X faster than the
> trivial implementation that I originally suggested when I reported the
> problem.
> Thanks Karl.
> -Paul
>> El 26/03/2013, a las 20:15, Karl Rupp escribió:
>>
>>> Hi Jose,
>>>
>>> here's the benchmark data obtained on my local machine running an
>>> NVIDIA GTX 285 for vectors of size 100k:
>>>
>>> # Master
>>>
>>> ./ex43 -n 100000 -k 200 -mdot -log_summary
>>> VecMDot  5.6363e+01
>>>
>>> ./ex43 -n 100000 -k 200 -mdot -log_summary -vec_type cusp
>>> VecMDot  2.1936e+01
>>>
>>> ./ex43 -n 100000 -k 200 -log_summary
>>> VecDot   5.1124e+01
>>>
>>> ./ex43 -n 100000 -k 200 -log_summary -vec_type cusp
>>> VecDot   4.0968e+01
>>>
>>>
>>> # Next
>>>
>>> ./ex43 -n 100000 -k 200 -mdot -log_summary
>>> VecMDot  5.6417e+01
>>>
>>> ./ex43 -n 100000 -k 200 -mdot -log_summary -vec_type cusp
>>> VecMDot  1.0281e+01
>>>
>>> ./ex43 -n 100000 -k 200 -log_summary
>>> VecDot   5.0886e+01
>>>
>>> ./ex43 -n 100000 -k 200 -log_summary -vec_type cusp
>>> VecDot   4.1905e+01
>>>
>>>
>>> This makes 10sec in next vs. 20 sec. on master for the
>>> CUDA-accelerated mdot-case. The factor of two is actually as
>>> expected, because in the 'old' kernel the data movement is twice as
>>> what it is in the custom kernel version. The factor of four with
>>> respect to VecDot is not entirely clear to me, I'd rather expect a
>>> factor close to 2. Presumably the more frequent host<->  device
>>> transfers add extra overhead.
>>>
>>> Best regards,
>>> Karli
>> Here are my numbers for this size. They are similar to yours (a bit
>> worse, though). Also, I tried with ViennaCL which gave very poor
>> performance (is this normal?).
>>
>> # Master
>>
>> ./ex43 -n 100000 -k 200 -mdot -log_summary
>> VecMDot  4.0681e+01
>>
>> ./ex43 -n 100000 -k 200 -mdot -log_summary -vec_type cusp
>> VecMDot  2.4489e+01
>>
>> ./ex43 -n 100000 -k 200 -log_summary
>> VecDot   5.9457e+01
>>
>> ./ex43 -n 100000 -k 200 -log_summary -vec_type cusp
>> VecDot   5.0021e+01
>>
>> # Next
>>
>> ./ex43 -n 100000 -k 200 -mdot -log_summary
>> VecMDot  4.4252e+01
>>
>> ./ex43 -n 100000 -k 200 -mdot -log_summary -vec_type cusp
>> VecMDot  1.2176e+01
>>
>> ./ex43 -n 100000 -k 200 -log_summary
>> VecDot   5.9847e+01
>>
>> ./ex43 -n 100000 -k 200 -log_summary -vec_type cusp
>> VecDot   5.0080e+01
>>
>> # ViennaCL
>>
>> ./ex43 -n 100000 -k 200 -mdot -log_summary -vec_type viennacl
>> VecMDot 9.4478e+01
>>
>> ./ex43 -n 100000 -k 200 -log_summary -vec_type viennacl
>> VecMDot 1.2311e+02
>>
>>
>> I tried a full SLEPc computation, with a matrix of order 256,000 and
>> making VecMDot operate on 40 vectors. The gain from 'master' to 'next'
>> is 91 seconds to 53 seconds. So, yes it is good improvement. Thanks.
>> However, I still see only a modest speedup (about 4) with respect to
>> CPU (since we do some optimizations for the CPU). Also, performance
>> depends a lot on the different matrix dimensions. I have to figure out
>> how to optimize it more for the GPU as well.
>>
>> Jose
>>
>