[petsc-dev] VecMDot_SeqCUSP improved

Karl Rupp rupp at mcs.anl.gov
Fri Apr 12 16:15:32 CDT 2013


Hi guys,

FYI: I've squeezed out a few more percents in performance for the mdot 
operation (ten percent for the test case below). Changes are pushed to 
next and will be merged to master tomorrow unless there are issues in 
the nightly tests.

Best regards,
Karli



On 04/07/2013 11:17 PM, Paul Mullowney wrote:
> VecMDot is performing great for my examples. About 2X faster than the
> trivial implementation that I originally suggested when I reported the
> problem.
> Thanks Karl.
> -Paul
>> El 26/03/2013, a las 20:15, Karl Rupp escribió:
>>
>>> Hi Jose,
>>>
>>> here's the benchmark data obtained on my local machine running an
>>> NVIDIA GTX 285 for vectors of size 100k:
>>>
>>> # Master
>>>
>>> ./ex43 -n 100000 -k 200 -mdot -log_summary
>>> VecMDot  5.6363e+01
>>>
>>> ./ex43 -n 100000 -k 200 -mdot -log_summary -vec_type cusp
>>> VecMDot  2.1936e+01
>>>
>>> ./ex43 -n 100000 -k 200 -log_summary
>>> VecDot   5.1124e+01
>>>
>>> ./ex43 -n 100000 -k 200 -log_summary -vec_type cusp
>>> VecDot   4.0968e+01
>>>
>>>
>>> # Next
>>>
>>> ./ex43 -n 100000 -k 200 -mdot -log_summary
>>> VecMDot  5.6417e+01
>>>
>>> ./ex43 -n 100000 -k 200 -mdot -log_summary -vec_type cusp
>>> VecMDot  1.0281e+01
>>>
>>> ./ex43 -n 100000 -k 200 -log_summary
>>> VecDot   5.0886e+01
>>>
>>> ./ex43 -n 100000 -k 200 -log_summary -vec_type cusp
>>> VecDot   4.1905e+01
>>>
>>>
>>> This makes 10sec in next vs. 20 sec. on master for the
>>> CUDA-accelerated mdot-case. The factor of two is actually as
>>> expected, because in the 'old' kernel the data movement is twice as
>>> what it is in the custom kernel version. The factor of four with
>>> respect to VecDot is not entirely clear to me, I'd rather expect a
>>> factor close to 2. Presumably the more frequent host<->  device
>>> transfers add extra overhead.
>>>
>>> Best regards,
>>> Karli
>> Here are my numbers for this size. They are similar to yours (a bit
>> worse, though). Also, I tried with ViennaCL which gave very poor
>> performance (is this normal?).
>>
>> # Master
>>
>> ./ex43 -n 100000 -k 200 -mdot -log_summary
>> VecMDot  4.0681e+01
>>
>> ./ex43 -n 100000 -k 200 -mdot -log_summary -vec_type cusp
>> VecMDot  2.4489e+01
>>
>> ./ex43 -n 100000 -k 200 -log_summary
>> VecDot   5.9457e+01
>>
>> ./ex43 -n 100000 -k 200 -log_summary -vec_type cusp
>> VecDot   5.0021e+01
>>
>> # Next
>>
>> ./ex43 -n 100000 -k 200 -mdot -log_summary
>> VecMDot  4.4252e+01
>>
>> ./ex43 -n 100000 -k 200 -mdot -log_summary -vec_type cusp
>> VecMDot  1.2176e+01
>>
>> ./ex43 -n 100000 -k 200 -log_summary
>> VecDot   5.9847e+01
>>
>> ./ex43 -n 100000 -k 200 -log_summary -vec_type cusp
>> VecDot   5.0080e+01
>>
>> # ViennaCL
>>
>> ./ex43 -n 100000 -k 200 -mdot -log_summary -vec_type viennacl
>> VecMDot 9.4478e+01
>>
>> ./ex43 -n 100000 -k 200 -log_summary -vec_type viennacl
>> VecMDot 1.2311e+02
>>
>>
>> I tried a full SLEPc computation, with a matrix of order 256,000 and
>> making VecMDot operate on 40 vectors. The gain from 'master' to 'next'
>> is 91 seconds to 53 seconds. So, yes it is good improvement. Thanks.
>> However, I still see only a modest speedup (about 4) with respect to
>> CPU (since we do some optimizations for the CPU). Also, performance
>> depends a lot on the different matrix dimensions. I have to figure out
>> how to optimize it more for the GPU as well.
>>
>> Jose
>>
>




More information about the petsc-dev mailing list