[petsc-users] hypre / hip usage

Paul T. Bauman ptbauman at gmail.com
Mon Jan 24 09:43:45 CST 2022


On Mon, Jan 24, 2022 at 9:31 AM Mark Adams <mfadams at lbl.gov> wrote:

> Thanks Paul,
>
> How do I get a stack trace? I have been relying on PETSc's
> which piggybacks on timers so it is not getting too deep here.
>

I'm not sure what the "PETSc way" is, but I just run the executable through
`rocgdb` as one would do with `gdb` (`rocgdb` is literally `gdb` built with
extra AMD stuff (that stuff is either upstreamed or being upstreamed to gdb
BTW)). You can do it in batch mode as well so you can dump the logs from
each MPI process.


>
> On Mon, Jan 24, 2022 at 10:16 AM Paul T. Bauman <ptbauman at gmail.com>
> wrote:
>
>> On Mon, Jan 24, 2022 at 8:53 AM Matthew Knepley <knepley at gmail.com>
>> wrote:
>>
>>> On Mon, Jan 24, 2022 at 9:24 AM Mark Adams <mfadams at lbl.gov> wrote:
>>>
>>>> What is the fastest way to rebuild hypre? reconfiguring did not work
>>>> and is slow.
>>>>
>>>> I am printf debugging to find this HSA_STATUS_ERROR_MEMORY_FAULT  (no
>>>> debuggers other than valgrind on Crusher??!?!)
>>>>
>>>
>> Again, apologies for interjecting, but I wanted to offer a few pointers
>> here.
>>
>> 1. `rocgdb` will be in your PATH when the `rocm` module is loaded. This
>> is gdb, but with some extra AMDGPU goodies. AFAIK, you cannot, yet, do
>> stepping through a kernel in the source (only the ISA), but you can query
>> device variables in host code, print their values, etc.
>> 1a. Note that multiple threads can be spawned by the HIP runtime.
>> Furthermore, it's likely the thread you'll be on when you catch the error
>> is (one of) the runtime thread(s). You'll need to do `info threads` and
>> then select your host thread to get back to it.
>> 2. To get an accurate stacktrace (meaning get the line in the host code
>> where the error is actually happening), I recommend setting the following
>> environment variables for debugging that will force the serialization of
>> async memcopies and kernel launches:
>> AMD_SERIALIZE_KERNEL = 3
>> AMD_SERIALIZE_COPY=3
>>
>> Thanks,
>>
>> Paul
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20220124/f1ca0492/attachment.html>


More information about the petsc-users mailing list