<div dir="ltr"><div>Hi, Anna,</div><div>  Do you have other CUDA machines to try?  If you can share your test, then I will run on Polaris@Argonne to see if it is a petsc/hypre issue.  If not, then it must be a GPU-MPI binding problem on TACC. </div><div><br></div>  Thanks<br clear="all"><div><div dir="ltr" class="gmail_signature" data-smartmail="gmail_signature"><div dir="ltr">--Junchao Zhang</div></div></div><br></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Thu, Feb 1, 2024 at 5:31 PM Yesypenko, Anna <<a href="mailto:anna@oden.utexas.edu" target="_blank">anna@oden.utexas.edu</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div>




<div dir="ltr">
<div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
Hi Victor, Junchao,</div>
<div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
<br>
</div>
<div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
Thank you for providing the script, it is very useful! </div>
<div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
There are still issues with hypre not binding correctly, and I'm getting the error message occasionally (but much less often).</div>
<div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
I added some additional environment variables to the script that seem to make the behavior more consistent.</div>
<div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
<br>
</div>
<div><span style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">export CUDA_DEVICE_ORDER=PCI_BUS_ID</span></div>
<div><span style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">export CUDA_VISIBLE_DEVICES=$MV2_COMM_WORLD_LOCAL_RANK    ## as Victor suggested</span></div>
<div><span style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">export HYPRE_MEMORY_DEVICE=$MV2_COMM_WORLD_LOCAL_RANK</span></div>
<div><span style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)"><br>
</span></div>
<div><span style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">The last environment variable is from hypre's documentation on GPUs.</span></div>
<div><span style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">In 30 runs for a small problem size, 4 fail with a hypre-related error. Do you have
 any other thoughts or suggestions?</span></div>
<div><span style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)"><br>
</span></div>
<div><span style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">Best,</span></div>
<div><span style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">Anna</span></div>
<div><span style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)"><br>
</span></div>
<div id="m_-278907308663054568m_5798670685468534043appendonsend"></div>
<hr style="display:inline-block;width:98%">
<div dir="ltr" id="m_-278907308663054568m_5798670685468534043divRplyFwdMsg"><span style="font-family:Calibri,sans-serif;font-size:11pt;color:rgb(0,0,0)"><b>From:</b> Victor Eijkhout <<a href="mailto:eijkhout@tacc.utexas.edu" target="_blank">eijkhout@tacc.utexas.edu</a>><br>
<b>Sent:</b> Thursday, February 1, 2024 11:26 AM<br>
<b>To:</b> Junchao Zhang <<a href="mailto:junchao.zhang@gmail.com" target="_blank">junchao.zhang@gmail.com</a>>; Yesypenko, Anna <<a href="mailto:anna@oden.utexas.edu" target="_blank">anna@oden.utexas.edu</a>><br>
<b>Cc:</b> <a href="mailto:petsc-users@mcs.anl.gov" target="_blank">petsc-users@mcs.anl.gov</a> <<a href="mailto:petsc-users@mcs.anl.gov" target="_blank">petsc-users@mcs.anl.gov</a>><br>
<b>Subject:</b> Re: [petsc-users] errors with hypre with MPI and multiple GPUs on a node</span>
<div> </div>
</div>
<p style="margin:0in;font-family:Calibri,sans-serif;font-size:11pt"><span style="color:rgb(0,0,0)">Only for mvapich2-gdr:</span></p>
<p style="margin:0in;font-family:Calibri,sans-serif;font-size:11pt"><span style="color:rgb(0,0,0)"> </span></p>
<div id="m_-278907308663054568m_5798670685468534043x_mail-editor-reference-message-container">
<div id="m_-278907308663054568m_5798670685468534043x_mail-editor-reference-message-container">
<p style="margin:0in 0in 3pt;font-family:Calibri,sans-serif;font-size:11pt">
<span style="font-family:Monaco;font-size:9pt;color:rgb(29,28,29)">#!/bin/bash</span></p>
<p style="margin:0in 0in 3pt;font-family:Calibri,sans-serif;font-size:11pt">
<span style="font-family:Monaco;font-size:9pt;color:rgb(29,28,29)"># Usage: mpirun -n <num_proc> MV2_USE_AFFINITY=0 MV2_ENABLE_AFFINITY=0 ./launch ./bin</span></p>
<p style="margin:0in 0in 3pt;font-family:Calibri,sans-serif;font-size:11pt">
<span style="font-family:Monaco;font-size:9pt;color:rgb(29,28,29)"> </span></p>
<p style="margin:0in 0in 3pt;font-family:Calibri,sans-serif;font-size:11pt">
<span style="font-family:Monaco;font-size:9pt;color:rgb(29,28,29)">export CUDA_VISIBLE_DEVICES=$MV2_COMM_WORLD_LOCAL_RANK</span></p>
<p style="margin:0in 0in 3pt;font-family:Calibri,sans-serif;font-size:11pt">
<span style="font-family:Monaco;font-size:9pt;color:rgb(29,28,29)">case $MV2_COMM_WORLD_LOCAL_RANK in</span></p>
<p style="margin:0in 0in 3pt;font-family:Calibri,sans-serif;font-size:11pt">
<span style="font-family:Monaco;font-size:9pt;color:rgb(29,28,29)">        [0]) cpus=0-3 ;;</span></p>
<p style="margin:0in 0in 3pt;font-family:Calibri,sans-serif;font-size:11pt">
<span style="font-family:Monaco;font-size:9pt;color:rgb(29,28,29)">        [1]) cpus=64-67 ;;</span></p>
<p style="margin:0in 0in 3pt;font-family:Calibri,sans-serif;font-size:11pt">
<span style="font-family:Monaco;font-size:9pt;color:rgb(29,28,29)">        [2]) cpus=72-75 ;;</span></p>
<p style="margin:0in 0in 3pt;font-family:Calibri,sans-serif;font-size:11pt">
<span style="font-family:Monaco;font-size:9pt;color:rgb(29,28,29)">esac</span></p>
<p style="margin:0in 0in 3pt;font-family:Calibri,sans-serif;font-size:11pt">
<span style="font-family:Monaco;font-size:9pt;color:rgb(29,28,29)"> </span></p>
<p style="margin:0in 0in 3pt;font-family:Calibri,sans-serif;font-size:11pt">
<span style="font-family:Monaco;font-size:9pt;color:rgb(29,28,29)">numactl --physcpubind=$cpus $@</span></p>
<p style="margin:0in;font-family:Calibri,sans-serif;font-size:11pt"> </p>
</div>
</div>
</div>

</div></blockquote></div>