[petsc-dev] PETSc issue I cannot post combine WaitForCUDA(); inside PetscLogGpuTimeEnd();
Barry Smith
bsmith at petsc.dev
Sat Aug 29 11:28:39 CDT 2020
Karl,
You are right.
If we have some way of knowing the user callback is on the GPU then we will want to wrap the calling back in the GPU timing so the problem should be resolved but we would need to know which callbacks are on the GPU perhaps as our understanding of how through the PETSc API users will use the GPU for their own code this will be clear.
Barry
> On Aug 29, 2020, at 12:16 AM, Karl Rupp <rupp at iue.tuwien.ac.at> wrote:
>
>
>>>>>> Since we cannot post issues (reported here https://forum.gitlab.com/t/creating-new-issue-gives-cannot-create-issue-getting-whoops-something-went-wrong-on-our-end/41966?u=bsmith) here is my issue so I don't forget it.
>>>>>> I think
>>>>>> err = WaitForCUDA();CHKERRCUDA(err);
>>>>>> ierr = PetscLogGpuTimeEnd();CHKERRQ(ierr);
>>>>>> should be changed to include WaitForCUDA() actually WaitForDevice() inside the PetscLogGpuTimeEnd().
>>>>>> Currently sometimes the WaitForCUDA() is missing in a few places resulting in bad timing.
>>>>>> Also some _SeqCUDA() don't have the PetscLogGpuTimeEnd() and need to be fixed.
>>>>>> The current model is a maintenance nightmare.
>>>>>> Does anyone see a problem with making this change?
>>>>>
>>>>> I'm fine with this change, as the maintenance benefits outweigh the performance cost for typical use cases.
>>>>>
>>>>> I propose to also add the WaitForDevice(); at PetscLogGpuTimeBegin(). This will ensure that no previous GPU kernel executions spill over into the timed section.
>> Karl,
>> When synchronization is turned on the precious GPU kernels should always have their own WaitForDevice(), so are you concerned about buggy code that does not include WaitForDevice?
>
> I'm primarily thinking of user callback routines here. For example, a FormFunction provided by the user that is running some GPU kernels. We have no guarantee that these user kernels have completed before entering the timed sections inside PETSc, so the logs will be skewed to report an unusually slow kernel in PETSc (the one right after the user form function). Arguably we could add a WaitForDevice() after user callback invocations.
>
> I didn't think of the WaitForDevice() after each kernel call in PETSc; with that we do get reasonable timings within PETSc (except for the user callbacks mentioned above), so the two-barrier model is not needed.
>
> Best regards,
> Karli
>
>
>
>>>>
>>>> Might this incur an extra overhead checking the device? Or will it always be true that if there are no outstanding kernels it will not go to the GPU and the check will return immediately?
>>>
>>> If we want to have a two barrier model, I propose we log the timing for waiting at the first barrier separately.
>>>>
>>>> Barry
>>>>
>>>>>
>>>>> Best regards,
>>>>> Karli
>>>
More information about the petsc-dev
mailing list