[petsc-users] hypre / hip usage

Mark Adams mfadams at lbl.gov
Mon Jan 24 09:54:39 CST 2022


OK, you meant in gdb. rocgdb seems to be hung here. Do you see a problem?
Thanks,

+ srun -n1 -N1 --ntasks-per-gpu=1 --gpu-bind=closest rocgdb --args ../ex13
-dm_plex_box_faces 2,2,2 -petscpartitioner_simple_process_grid 2,2,2
-dm_plex_box_upper 1,1,1 -petscpartitioner_simple_node_grid 1,1,1
-dm_refine 6 -dm_view -dm_mat_type aijkokkos -dm_vec_type kokkos -pc_type
jacobi -log_view -ksp_view -use_gpu_aware_mpi true
GNU gdb (rocm-rel-4.5-56) 11.1
Copyright (C) 2021 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html
>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-pc-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://github.com/ROCm-Developer-Tools/ROCgdb/issues>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from ../ex13...


On Mon, Jan 24, 2022 at 10:43 AM Paul T. Bauman <ptbauman at gmail.com> wrote:

>
>
> On Mon, Jan 24, 2022 at 9:31 AM Mark Adams <mfadams at lbl.gov> wrote:
>
>> Thanks Paul,
>>
>> How do I get a stack trace? I have been relying on PETSc's
>> which piggybacks on timers so it is not getting too deep here.
>>
>
> I'm not sure what the "PETSc way" is, but I just run the executable
> through `rocgdb` as one would do with `gdb` (`rocgdb` is literally `gdb`
> built with extra AMD stuff (that stuff is either upstreamed or being
> upstreamed to gdb BTW)). You can do it in batch mode as well so you can
> dump the logs from each MPI process.
>
>
>>
>> On Mon, Jan 24, 2022 at 10:16 AM Paul T. Bauman <ptbauman at gmail.com>
>> wrote:
>>
>>> On Mon, Jan 24, 2022 at 8:53 AM Matthew Knepley <knepley at gmail.com>
>>> wrote:
>>>
>>>> On Mon, Jan 24, 2022 at 9:24 AM Mark Adams <mfadams at lbl.gov> wrote:
>>>>
>>>>> What is the fastest way to rebuild hypre? reconfiguring did not work
>>>>> and is slow.
>>>>>
>>>>> I am printf debugging to find this HSA_STATUS_ERROR_MEMORY_FAULT  (no
>>>>> debuggers other than valgrind on Crusher??!?!)
>>>>>
>>>>
>>> Again, apologies for interjecting, but I wanted to offer a few pointers
>>> here.
>>>
>>> 1. `rocgdb` will be in your PATH when the `rocm` module is loaded. This
>>> is gdb, but with some extra AMDGPU goodies. AFAIK, you cannot, yet, do
>>> stepping through a kernel in the source (only the ISA), but you can query
>>> device variables in host code, print their values, etc.
>>> 1a. Note that multiple threads can be spawned by the HIP runtime.
>>> Furthermore, it's likely the thread you'll be on when you catch the error
>>> is (one of) the runtime thread(s). You'll need to do `info threads` and
>>> then select your host thread to get back to it.
>>> 2. To get an accurate stacktrace (meaning get the line in the host code
>>> where the error is actually happening), I recommend setting the following
>>> environment variables for debugging that will force the serialization of
>>> async memcopies and kernel launches:
>>> AMD_SERIALIZE_KERNEL = 3
>>> AMD_SERIALIZE_COPY=3
>>>
>>> Thanks,
>>>
>>> Paul
>>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20220124/fc13b0d4/attachment-0001.html>


More information about the petsc-users mailing list