[petsc-dev] [petsc-maint] running CUDA on SUMMIT

Wed Aug 14 18:22:01 CDT 2019

"Smith, Barry F." <bsmith at mcs.anl.gov> writes:

>> On Aug 14, 2019, at 5:58 PM, Jed Brown <jed at jedbrown.org> wrote:
>> 
>> "Smith, Barry F." <bsmith at mcs.anl.gov> writes:
>> 
>>>> On Aug 14, 2019, at 2:37 PM, Jed Brown <jed at jedbrown.org> wrote:
>>>> 
>>>> Mark Adams via petsc-dev <petsc-dev at mcs.anl.gov> writes:
>>>> 
>>>>> On Wed, Aug 14, 2019 at 2:35 PM Smith, Barry F. <bsmith at mcs.anl.gov> wrote:
>>>>> 
>>>>>> 
>>>>>> Mark,
>>>>>> 
>>>>>>  Would you be able to make one run using single precision? Just single
>>>>>> everywhere since that is all we support currently?
>>>>>> 
>>>>>> 
>>>>> Experience in engineering at least is single does not work for FE
>>>>> elasticity. I have tried it many years ago and have heard this from others.
>>>>> This problem is pretty simple other than using Q2. I suppose I could try
>>>>> it, but just be aware the FE people might say that single sucks.
>>>> 
>>>> When they say that single sucks, is it for the definition of the
>>>> operator or the preconditioner?
>>>> 
>>>> As point of reference, we can apply Q2 elasticity operators in double
>>>> precision at nearly a billion dofs/second per GPU.
>>> 
>>>  And in single you get what?
>> 
>> I don't have exact numbers, but <2x faster on V100, and it sort of
>> doesn't matter because preconditioning cost will dominate.  
>
>    When using block formats a much higher percentage of the bandwidth goes to moving the double precision matrix entries so switching to single could conceivably benefit    up to almost a factor of two. 
>
>     Depending on the matrix structure perhaps the column indices could be handled by a shift and short j indices. Or 2 shifts and 2 sets of j indices

Shorts are a problem, but a lot of matrices are actually quite
compressible if you subtract the row from all the column indices.  I've
done some experiments using zstd and the CPU decode rate is competitive
to better than DRAM bandwidth.  But that gives up random access, which
seems important for vectorization.  Maybe someone who knows more about
decompression on GPUs can comment?

>> The big win
>> of single is on consumer-grade GPUs, which DOE doesn't install and
>> NVIDIA forbids to be used in data centers (because they're so
>> cost-effective ;-)).
>
>    DOE LCFs are not our only customers. Cheap-o engineering professors
>    might stack a bunch of consumer grade in their lab, would they
>    benefit? Satish's basement could hold a great deal of consumer
>    grades.

Fair point.  Time is also important so most companies buy the more
expensive hardware on the assumption it means less frequent problems
(due to lack of ECC, etc.).