<div dir="ltr">Thanks Paul,<div><br></div><div>How do I get a stack trace? I have been relying on PETSc's which piggybacks on timers so it is not getting too deep here.</div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Mon, Jan 24, 2022 at 10:16 AM Paul T. Bauman <<a href="mailto:ptbauman@gmail.com">ptbauman@gmail.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Mon, Jan 24, 2022 at 8:53 AM Matthew Knepley <<a href="mailto:knepley@gmail.com" target="_blank">knepley@gmail.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div dir="ltr">On Mon, Jan 24, 2022 at 9:24 AM Mark Adams <<a href="mailto:mfadams@lbl.gov" target="_blank">mfadams@lbl.gov</a>> wrote:<br></div><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div><div>What is the fastest way to rebuild hypre? reconfiguring did not work and is slow.</div><div><br></div></div><div>I am printf debugging to find this HSA_STATUS_ERROR_MEMORY_FAULT (no debuggers other than valgrind on Crusher??!?!) </div></div></blockquote></div></div></blockquote><div><br></div><div>Again, apologies for interjecting, but I wanted to offer a few pointers here.</div><div><br></div><div>1. `rocgdb` will be in your PATH when the `rocm` module is loaded. This is gdb, but with some extra AMDGPU goodies. AFAIK, you cannot, yet, do stepping through a kernel in the source (only the ISA), but you can query device variables in host code, print their values, etc.<br></div><div>1a. Note that multiple threads can be spawned by the HIP runtime. Furthermore, it's likely the thread you'll be on when you catch the error is (one of) the runtime thread(s). You'll need to do `info threads` and then select your host thread to get back to it.</div><div>2. To get an accurate stacktrace (meaning get the line in the host code where the error is actually happening), I recommend setting the following environment variables for debugging that will force the serialization of async memcopies and kernel launches: <br></div><div><span style="color:rgb(255,255,255);font-family:"Segoe UI",system-ui,"Apple Color Emoji","Segoe UI Emoji",sans-serif;font-size:16px;font-style:normal;font-variant-ligatures:normal;font-variant-caps:normal;font-weight:400;letter-spacing:normal;text-align:left;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;background-color:rgb(41,41,41);text-decoration-style:initial;text-decoration-color:initial;display:inline;float:none">AMD_SERIALIZE_KERNEL = 3<br></span></div><div><span style="color:rgb(255,255,255);font-family:"Segoe UI",system-ui,"Apple Color Emoji","Segoe UI Emoji",sans-serif;font-size:16px;font-style:normal;font-variant-ligatures:normal;font-variant-caps:normal;font-weight:400;letter-spacing:normal;text-align:left;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;background-color:rgb(41,41,41);text-decoration-style:initial;text-decoration-color:initial;display:inline;float:none">AMD_SERIALIZE_COPY=3</span></div><div><span style="color:rgb(255,255,255);font-family:"Segoe UI",system-ui,"Apple Color Emoji","Segoe UI Emoji",sans-serif;font-size:16px;font-style:normal;font-variant-ligatures:normal;font-variant-caps:normal;font-weight:400;letter-spacing:normal;text-align:left;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;background-color:rgb(41,41,41);text-decoration-style:initial;text-decoration-color:initial;display:inline;float:none"><br></span></div><div>Thanks,</div><div><br></div><div>Paul<br></div></div></div>
</blockquote></div>