[petsc-dev] PETSc issue I cannot post combine WaitForCUDA(); inside PetscLogGpuTimeEnd();

Sat Aug 29 00:16:50 CDT 2020

>>>>>  Since we cannot post issues (reported here 
>>>>> https://forum.gitlab.com/t/creating-new-issue-gives-cannot-create-issue-getting-whoops-something-went-wrong-on-our-end/41966?u=bsmith) 
>>>>> here is my issue so I don't forget it.
>>>>>  I think
>>>>> err  = WaitForCUDA();CHKERRCUDA(err);
>>>>> ierr = PetscLogGpuTimeEnd();CHKERRQ(ierr);
>>>>> should be changed to include WaitForCUDA() actually WaitForDevice() 
>>>>> inside the PetscLogGpuTimeEnd().
>>>>> Currently sometimes the WaitForCUDA() is missing in a few places 
>>>>> resulting in bad timing.
>>>>> Also some _SeqCUDA() don't have the PetscLogGpuTimeEnd() and need 
>>>>> to be fixed.
>>>>> The current model is a maintenance nightmare.
>>>>> Does anyone see a problem with making this change?
>>>>
>>>> I'm fine with this change, as the maintenance benefits outweigh the 
>>>> performance cost for typical use cases.
>>>>
>>>> I propose to also add the WaitForDevice(); at 
>>>> PetscLogGpuTimeBegin(). This will ensure that no previous GPU kernel 
>>>> executions spill over into the timed section.
> 
>    Karl,
> 
>     When synchronization is turned on the precious GPU kernels should 
> always have their own WaitForDevice(), so are you concerned about buggy 
> code that does not include WaitForDevice?

I'm primarily thinking of user callback routines here. For example, a 
FormFunction provided by the user that is running some GPU kernels. We 
have no guarantee that these user kernels have completed before entering 
the timed sections inside PETSc, so the logs will be skewed to report an 
unusually slow kernel in PETSc (the one right after the user form 
function). Arguably we could add a WaitForDevice() after user callback 
invocations.

I didn't think of the WaitForDevice() after each kernel call in PETSc; 
with that we do get reasonable timings within PETSc (except for the user 
callbacks mentioned above), so the two-barrier model is not needed.

Best regards,
Karli

> 
>>>
>>>  Might this incur an extra overhead checking the device? Or will it 
>>> always be true that if there are no outstanding kernels it will not 
>>> go to the GPU and the check will return immediately?
>>
>> If we want to have a two barrier model, I propose we log the timing 
>> for waiting at the first barrier separately.
>>>
>>> Barry
>>>
>>>>
>>>> Best regards,
>>>> Karli
>>
>