<div dir="ltr">Glad you figured it out!<div><br clear="all"><div><div dir="ltr" class="gmail_signature" data-smartmail="gmail_signature"><div dir="ltr">--Junchao Zhang</div></div></div><br></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Sun, Feb 4, 2024 at 7:56 PM Yesypenko, Anna <<a href="mailto:anna@oden.utexas.edu">anna@oden.utexas.edu</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div class="msg-5214566323276710640">
<div dir="ltr">
<div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
Hi Junchao, Victor,</div>
<div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
<br>
</div>
<div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
I fixed the issue! The issue was with the CPU bindings. Python has a limitation that it only runs on one core.</div>
<div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
I had to modify the MPI thread launch script to make sure that each python instance is bound to only one physical core.</div>
<div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
<br>
</div>
<div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
Thank you both very much for your patience and help!</div>
<div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
<br>
</div>
<div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
Best,</div>
<div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
Anna</div>
<div id="m_-5214566323276710640appendonsend"></div>
<hr style="display:inline-block;width:98%">
<div id="m_-5214566323276710640divRplyFwdMsg" dir="ltr"><font face="Calibri, sans-serif" style="font-size:11pt" color="#000000"><b>From:</b> Yesypenko, Anna <<a href="mailto:anna@oden.utexas.edu" target="_blank">anna@oden.utexas.edu</a>><br>
<b>Sent:</b> Friday, February 2, 2024 2:12 PM<br>
<b>To:</b> Junchao Zhang <<a href="mailto:junchao.zhang@gmail.com" target="_blank">junchao.zhang@gmail.com</a>><br>
<b>Cc:</b> Victor Eijkhout <<a href="mailto:eijkhout@tacc.utexas.edu" target="_blank">eijkhout@tacc.utexas.edu</a>>; <a href="mailto:petsc-users@mcs.anl.gov" target="_blank">petsc-users@mcs.anl.gov</a> <<a href="mailto:petsc-users@mcs.anl.gov" target="_blank">petsc-users@mcs.anl.gov</a>><br>
<b>Subject:</b> Re: [petsc-users] errors with hypre with MPI and multiple GPUs on a node</font>
<div> </div>
</div>
<div dir="ltr">
<div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
Hi Junchao,</div>
<div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
<br>
</div>
<div><span style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">Unfortunately I don't have access to other cuda machines with multiple GPUs.</span></div>
<div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
I'm pretty stuck, and I think running on a different machine would help isolate the issue.</div>
<div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
<br>
</div>
<div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
I'm sharing the python script and the launch script that Victor wrote.</div>
<div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
There is a comment in the launch script with the mpi command I was using to run the python script.</div>
<div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
I configured hypre without unified memory. In case it's useful, I also attached the configure.log.</div>
<div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
<br>
</div>
<div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
If the issue is with petsc/hypre, it may be in the environment variables described here (e.g. HYPRE_MEMORY_DEVICE):</div>
<div><span style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)"><a href="https://github.com/hypre-space/hypre/wiki/GPUs" id="m_-5214566323276710640LPlnk" target="_blank">https://github.com/hypre-space/hypre/wiki/GPUs</a></span></div>
<div><br>
</div>
<div><span style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">Thank you for helping me troubleshoot this issue!</span></div>
<div><span style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">Best,</span></div>
<div><span style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">Anna<br>
</span></div>
<div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
<br>
</div>
<div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
<br>
</div>
<div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
<br>
</div>
<div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
<br>
</div>
<div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
<br>
</div>
<div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
<br>
</div>
<div id="m_-5214566323276710640x_appendonsend"></div>
<hr style="display:inline-block;width:98%">
<div id="m_-5214566323276710640x_divRplyFwdMsg" dir="ltr"><font face="Calibri, sans-serif" color="#000000" style="font-size:11pt"><b>From:</b> Junchao Zhang <<a href="mailto:junchao.zhang@gmail.com" target="_blank">junchao.zhang@gmail.com</a>><br>
<b>Sent:</b> Thursday, February 1, 2024 9:07 PM<br>
<b>To:</b> Yesypenko, Anna <<a href="mailto:anna@oden.utexas.edu" target="_blank">anna@oden.utexas.edu</a>><br>
<b>Cc:</b> Victor Eijkhout <<a href="mailto:eijkhout@tacc.utexas.edu" target="_blank">eijkhout@tacc.utexas.edu</a>>; <a href="mailto:petsc-users@mcs.anl.gov" target="_blank">petsc-users@mcs.anl.gov</a> <<a href="mailto:petsc-users@mcs.anl.gov" target="_blank">petsc-users@mcs.anl.gov</a>><br>
<b>Subject:</b> Re: [petsc-users] errors with hypre with MPI and multiple GPUs on a node</font>
<div> </div>
</div>
<div>
<div dir="ltr">
<div>Hi, Anna,</div>
<div> Do you have other CUDA machines to try? If you can share your test, then I will run on Polaris@Argonne to see if it is a petsc/hypre issue. If not, then it must be a GPU-MPI binding problem on TACC. </div>
<div><br>
</div>
Thanks<br clear="all">
<div>
<div dir="ltr">
<div dir="ltr">--Junchao Zhang</div>
</div>
</div>
<br>
</div>
<br>
<div>
<div dir="ltr">On Thu, Feb 1, 2024 at 5:31 PM Yesypenko, Anna <<a href="mailto:anna@oden.utexas.edu" target="_blank">anna@oden.utexas.edu</a>> wrote:<br>
</div>
<blockquote style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div>
<div dir="ltr">
<div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
Hi Victor, Junchao,</div>
<div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
<br>
</div>
<div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
Thank you for providing the script, it is very useful! </div>
<div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
There are still issues with hypre not binding correctly, and I'm getting the error message occasionally (but much less often).</div>
<div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
I added some additional environment variables to the script that seem to make the behavior more consistent.</div>
<div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
<br>
</div>
<div><span style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">export CUDA_DEVICE_ORDER=PCI_BUS_ID</span></div>
<div><span style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">export CUDA_VISIBLE_DEVICES=$MV2_COMM_WORLD_LOCAL_RANK ## as Victor suggested</span></div>
<div><span style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">export HYPRE_MEMORY_DEVICE=$MV2_COMM_WORLD_LOCAL_RANK</span></div>
<div><span style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)"><br>
</span></div>
<div><span style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">The last environment variable is from hypre's documentation on GPUs.</span></div>
<div><span style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">In 30 runs for a small problem size, 4 fail with a hypre-related error. Do you have any other thoughts or suggestions?</span></div>
<div><span style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)"><br>
</span></div>
<div><span style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">Best,</span></div>
<div><span style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">Anna</span></div>
<div><span style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)"><br>
</span></div>
<div id="m_-5214566323276710640x_x_m_-278907308663054568m_5798670685468534043appendonsend"></div>
<hr style="display:inline-block;width:98%">
<div dir="ltr" id="m_-5214566323276710640x_x_m_-278907308663054568m_5798670685468534043divRplyFwdMsg"><span style="font-family:Calibri,sans-serif;font-size:11pt;color:rgb(0,0,0)"><b>From:</b> Victor Eijkhout <<a href="mailto:eijkhout@tacc.utexas.edu" target="_blank">eijkhout@tacc.utexas.edu</a>><br>
<b>Sent:</b> Thursday, February 1, 2024 11:26 AM<br>
<b>To:</b> Junchao Zhang <<a href="mailto:junchao.zhang@gmail.com" target="_blank">junchao.zhang@gmail.com</a>>; Yesypenko, Anna <<a href="mailto:anna@oden.utexas.edu" target="_blank">anna@oden.utexas.edu</a>><br>
<b>Cc:</b> <a href="mailto:petsc-users@mcs.anl.gov" target="_blank">petsc-users@mcs.anl.gov</a> <<a href="mailto:petsc-users@mcs.anl.gov" target="_blank">petsc-users@mcs.anl.gov</a>><br>
<b>Subject:</b> Re: [petsc-users] errors with hypre with MPI and multiple GPUs on a node</span>
<div> </div>
</div>
<p style="margin:0in;font-family:Calibri,sans-serif;font-size:11pt"><span style="color:rgb(0,0,0)">Only for mvapich2-gdr:</span></p>
<p style="margin:0in;font-family:Calibri,sans-serif;font-size:11pt"><span style="color:rgb(0,0,0)"> </span></p>
<div id="m_-5214566323276710640x_x_m_-278907308663054568m_5798670685468534043x_mail-editor-reference-message-container">
<div id="m_-5214566323276710640x_x_m_-278907308663054568m_5798670685468534043x_mail-editor-reference-message-container">
<p style="margin:0in 0in 3pt;font-family:Calibri,sans-serif;font-size:11pt"><span style="font-family:Monaco;font-size:9pt;color:rgb(29,28,29)">#!/bin/bash</span></p>
<p style="margin:0in 0in 3pt;font-family:Calibri,sans-serif;font-size:11pt"><span style="font-family:Monaco;font-size:9pt;color:rgb(29,28,29)"># Usage: mpirun -n <num_proc> MV2_USE_AFFINITY=0 MV2_ENABLE_AFFINITY=0 ./launch ./bin</span></p>
<p style="margin:0in 0in 3pt;font-family:Calibri,sans-serif;font-size:11pt"><span style="font-family:Monaco;font-size:9pt;color:rgb(29,28,29)"> </span></p>
<p style="margin:0in 0in 3pt;font-family:Calibri,sans-serif;font-size:11pt"><span style="font-family:Monaco;font-size:9pt;color:rgb(29,28,29)">export CUDA_VISIBLE_DEVICES=$MV2_COMM_WORLD_LOCAL_RANK</span></p>
<p style="margin:0in 0in 3pt;font-family:Calibri,sans-serif;font-size:11pt"><span style="font-family:Monaco;font-size:9pt;color:rgb(29,28,29)">case $MV2_COMM_WORLD_LOCAL_RANK in</span></p>
<p style="margin:0in 0in 3pt;font-family:Calibri,sans-serif;font-size:11pt"><span style="font-family:Monaco;font-size:9pt;color:rgb(29,28,29)"> [0]) cpus=0-3 ;;</span></p>
<p style="margin:0in 0in 3pt;font-family:Calibri,sans-serif;font-size:11pt"><span style="font-family:Monaco;font-size:9pt;color:rgb(29,28,29)"> [1]) cpus=64-67 ;;</span></p>
<p style="margin:0in 0in 3pt;font-family:Calibri,sans-serif;font-size:11pt"><span style="font-family:Monaco;font-size:9pt;color:rgb(29,28,29)"> [2]) cpus=72-75 ;;</span></p>
<p style="margin:0in 0in 3pt;font-family:Calibri,sans-serif;font-size:11pt"><span style="font-family:Monaco;font-size:9pt;color:rgb(29,28,29)">esac</span></p>
<p style="margin:0in 0in 3pt;font-family:Calibri,sans-serif;font-size:11pt"><span style="font-family:Monaco;font-size:9pt;color:rgb(29,28,29)"> </span></p>
<p style="margin:0in 0in 3pt;font-family:Calibri,sans-serif;font-size:11pt"><span style="font-family:Monaco;font-size:9pt;color:rgb(29,28,29)">numactl --physcpubind=$cpus $@</span></p>
<p style="margin:0in;font-family:Calibri,sans-serif;font-size:11pt"> </p>
</div>
</div>
</div>
</div>
</blockquote>
</div>
</div>
</div>
</div>
</div></blockquote></div>