<div dir="ltr"><div dir="ltr"><br></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Mon, Jan 24, 2022 at 9:53 AM Jed Brown <<a href="mailto:jed@jedbrown.org">jed@jedbrown.org</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">"Paul T. Bauman" <<a href="mailto:ptbauman@gmail.com" target="_blank">ptbauman@gmail.com</a>> writes:<br>
<br>
> 1. `rocgdb` will be in your PATH when the `rocm` module is loaded. This is<br>
> gdb, but with some extra AMDGPU goodies. AFAIK, you cannot, yet, do<br>
> stepping through a kernel in the source (only the ISA), but you can query<br>
> device variables in host code, print their values, etc.<br>
> 1a. Note that multiple threads can be spawned by the HIP runtime.<br>
> Furthermore, it's likely the thread you'll be on when you catch the error<br>
> is (one of) the runtime thread(s). You'll need to do `info threads` and<br>
> then select your host thread to get back to it.<br>
> 2. To get an accurate stacktrace (meaning get the line in the host code<br>
> where the error is actually happening), I recommend setting the following<br>
> environment variables for debugging that will force the serialization of<br>
> async memcopies and kernel launches:<br>
> AMD_SERIALIZE_KERNEL = 3<br>
> AMD_SERIALIZE_COPY=3<br>
<br>
Is there a tutorial on this? I bet a 10-minute screencast demo would make a big impact in the use of these tools.<br></blockquote><div><br></div><div>The one that springs to mind is a 3-day (virtual) workshop from last May at OLCF. There was a recent workshop on crusher that may also cover this.</div><div><br></div><div><a href="https://www.olcf.ornl.gov/calendar/2021hip/">https://www.olcf.ornl.gov/calendar/2021hip/</a></div><div><a href="https://www.olcf.ornl.gov/wp-content/uploads/2021/04/rocgdb_hipmath_ornl_2021_v2.pdf">https://www.olcf.ornl.gov/wp-content/uploads/2021/04/rocgdb_hipmath_ornl_2021_v2.pdf</a></div><div> </div><div>They recorded it, but I can't seem to find the recordings, not sure what OLCF did with them. Justin did live demos of the debugger during his talk. :(</div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
AMD_SERIALIZE_COPY isn't documented at all and AMD_SERIALIZE_KERNEL isn't mentioned in this context.<br>
<br>
<a href="https://rocmdocs.amd.com/en/latest/search.html?q=amd_serialize_copy&check_keywords=yes&area=default" rel="noreferrer" target="_blank">https://rocmdocs.amd.com/en/latest/search.html?q=amd_serialize_copy&check_keywords=yes&area=default</a></blockquote><div><br></div><div>Sigh. This is a never-ending source of frustration on my end. Sorry, it is really unacceptable. This link is probably the best description at this moment: <a href="https://github.com/ROCm-Developer-Tools/HIP/blob/develop/docs/markdown/hip_debugging.md">https://github.com/ROCm-Developer-Tools/HIP/blob/develop/docs/markdown/hip_debugging.md</a></div></div></div>