<div dir="ltr">Hi, Macros,<div>  If you look at the PIDs of the nvidia-smi output, you will only find 8 unique PIDs, which is expected since you allocated 8 MPI ranks per node.</div><div>  The duplicate PIDs are usually for threads spawned by the MPI runtime (for example, progress threads in MPI implementation).   So your job script and output are all good.<br><div><br></div></div><div>  Thanks.</div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Mon, Aug 21, 2023 at 2:00 PM Vanella, Marcos (Fed) <<a href="mailto:marcos.vanella@nist.gov">marcos.vanella@nist.gov</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div class="msg-2525567993800845248">




<div dir="ltr">
<div style="font-family:Calibri,Arial,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
Hi Junchao, something I'm noting related to running with cuda enabled linear solvers (CG+HYPRE, CG+GAMG) is that for multi cpu-multi gpu calculations, the GPU 0 in the node is taking what seems to be all sub-matrices corresponding to all the MPI processes in
 the node. This is the result of the nvidia-smi command on a node with 8 MPI processes (each advancing the same number of unknowns in the calculation) and 4 GPU V100s:</div>
<div style="font-family:Calibri,Arial,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
<br>
</div>
<div style="font-family:Calibri,Arial,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
<span style="font-family:"Courier New",monospace">Mon Aug 21 14:36:07 2023      
</span>
<div><span style="font-family:"Courier New",monospace">+---------------------------------------------------------------------------------------+</span></div>
<div><span style="font-family:"Courier New",monospace">| NVIDIA-SMI 535.54.03              Driver Version: 535.54.03    CUDA Version: 12.2     |</span></div>
<div><span style="font-family:"Courier New",monospace">|-----------------------------------------+----------------------+----------------------+</span></div>
<div><span style="font-family:"Courier New",monospace">| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |</span></div>
<div><span style="font-family:"Courier New",monospace">| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |</span></div>
<div><span style="font-family:"Courier New",monospace">|                                         |                      |               MIG M. |</span></div>
<div><span style="font-family:"Courier New",monospace">|=========================================+======================+======================|</span></div>
<div><span style="font-family:"Courier New",monospace">|   0  Tesla V100-SXM2-16GB           On  | 00000004:04:00.0 Off |                    0 |</span></div>
<div><span style="font-family:"Courier New",monospace">| N/A   34C    P0              63W / 300W |   2488MiB / 16384MiB |      0%      Default |</span></div>
<div><span style="font-family:"Courier New",monospace">|                                         |                      |                  N/A |</span></div>
<div><span style="font-family:"Courier New",monospace">+-----------------------------------------+----------------------+----------------------+</span></div>
<div><span style="font-family:"Courier New",monospace">|   1  Tesla V100-SXM2-16GB           On  | 00000004:05:00.0 Off |                    0 |</span></div>
<div><span style="font-family:"Courier New",monospace">| N/A   38C    P0              56W / 300W |    638MiB / 16384MiB |      0%      Default |</span></div>
<div><span style="font-family:"Courier New",monospace">|                                         |                      |                  N/A |</span></div>
<div><span style="font-family:"Courier New",monospace">+-----------------------------------------+----------------------+----------------------+</span></div>
<div><span style="font-family:"Courier New",monospace">|   2  Tesla V100-SXM2-16GB           On  | 00000035:03:00.0 Off |                    0 |</span></div>
<div><span style="font-family:"Courier New",monospace">| N/A   35C    P0              52W / 300W |    638MiB / 16384MiB |      0%      Default |</span></div>
<div><span style="font-family:"Courier New",monospace">|                                         |                      |                  N/A |</span></div>
<div><span style="font-family:"Courier New",monospace">+-----------------------------------------+----------------------+----------------------+</span></div>
<div><span style="font-family:"Courier New",monospace">|   3  Tesla V100-SXM2-16GB           On  | 00000035:04:00.0 Off |                    0 |</span></div>
<div><span style="font-family:"Courier New",monospace">| N/A   38C    P0              53W / 300W |    638MiB / 16384MiB |      0%      Default |</span></div>
<div><span style="font-family:"Courier New",monospace">|                                         |                      |                  N/A |</span></div>
<div><span style="font-family:"Courier New",monospace">+-----------------------------------------+----------------------+----------------------+</span></div>
<div><span style="font-family:"Courier New",monospace">                                                                                         </span></div>
<div><span style="font-family:"Courier New",monospace">+---------------------------------------------------------------------------------------+</span></div>
<div><span style="font-family:"Courier New",monospace">| Processes:                                                                            |</span></div>
<div><span style="font-family:"Courier New",monospace">|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |</span></div>
<div><span style="font-family:"Courier New",monospace">|        ID   ID                                                             Usage      |</span></div>
<div><span style="font-family:"Courier New",monospace">|=======================================================================================|</span></div>
<div><span style="font-family:"Courier New",monospace">|    0   N/A  N/A    214626      C   ...d/ompi_gnu_linux/fds_ompi_gnu_linux      318MiB |</span></div>
<div><span style="font-family:"Courier New",monospace">|    0   N/A  N/A    214627      C   ...d/ompi_gnu_linux/fds_ompi_gnu_linux      308MiB |</span></div>
<div><span style="font-family:"Courier New",monospace">|    0   N/A  N/A    214628      C   ...d/ompi_gnu_linux/fds_ompi_gnu_linux      308MiB |</span></div>
<div><span style="font-family:"Courier New",monospace">|    0   N/A  N/A    214629      C   ...d/ompi_gnu_linux/fds_ompi_gnu_linux      308MiB |</span></div>
<div><span style="font-family:"Courier New",monospace">|    0   N/A  N/A    214630      C   ...d/ompi_gnu_linux/fds_ompi_gnu_linux      318MiB |</span></div>
<div><span style="font-family:"Courier New",monospace">|    0   N/A  N/A    214631      C   ...d/ompi_gnu_linux/fds_ompi_gnu_linux      308MiB |</span></div>
<div><span style="font-family:"Courier New",monospace">|    0   N/A  N/A    214632      C   ...d/ompi_gnu_linux/fds_ompi_gnu_linux      308MiB |</span></div>
<div><span style="font-family:"Courier New",monospace">|    0   N/A  N/A    214633      C   ...d/ompi_gnu_linux/fds_ompi_gnu_linux      308MiB |</span></div>
<div><span style="font-family:"Courier New",monospace">|    1   N/A  N/A    214627      C   ...d/ompi_gnu_linux/fds_ompi_gnu_linux      318MiB |</span></div>
<div><span style="font-family:"Courier New",monospace">|    1   N/A  N/A    214631      C   ...d/ompi_gnu_linux/fds_ompi_gnu_linux      318MiB |</span></div>
<div><span style="font-family:"Courier New",monospace">|    2   N/A  N/A    214628      C   ...d/ompi_gnu_linux/fds_ompi_gnu_linux      318MiB |</span></div>
<div><span style="font-family:"Courier New",monospace">|    2   N/A  N/A    214632      C   ...d/ompi_gnu_linux/fds_ompi_gnu_linux      318MiB |</span></div>
<div><span style="font-family:"Courier New",monospace">|    3   N/A  N/A    214629      C   ...d/ompi_gnu_linux/fds_ompi_gnu_linux      318MiB |</span></div>
<div><span style="font-family:"Courier New",monospace">|    3   N/A  N/A    214633      C   ...d/ompi_gnu_linux/fds_ompi_gnu_linux      318MiB |</span></div>
<span style="font-family:"Courier New",monospace">+---------------------------------------------------------------------------------------+</span><br>
</div>
<div style="font-family:Calibri,Arial,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
<br>
</div>
<div style="font-family:Calibri,Arial,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
<br>
</div>
<div style="font-family:Calibri,Arial,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
You can see that GPU 0 is connected to all 8 MPI Processes, each taking about 300MB on it, whereas GPUs 1,2 and 3 are working with 2 MPI Processes. I'm wondering if this is expected or there are some changes I need to do on my submission script/runtime parameters.</div>
<div style="font-family:Calibri,Arial,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
This is the script in this case (2 nodes, 8 MPI processes/node, 4 GPU/node):</div>
<div style="font-family:Calibri,Arial,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
<br>
</div>
<div style="font-family:Calibri,Arial,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
<div><span style="font-family:"Courier New",monospace">#!/bin/bash</span></div>
<div><span style="font-family:"Courier New",monospace"># ../../Utilities/Scripts/qfds.sh -p 2  -T db -d test.fds</span></div>
<div><span style="font-family:"Courier New",monospace">#SBATCH -J test
</span></div>
<div><span style="font-family:"Courier New",monospace">#SBATCH -e /home/mnv/Firemodels_fork/fds/Issues/PETSc/test.err</span></div>
<div><span style="font-family:"Courier New",monospace">#SBATCH -o /home/mnv/Firemodels_fork/fds/Issues/PETSc/test.log</span></div>
<div><span style="font-family:"Courier New",monospace">#SBATCH --partition=gpu</span></div>
<div><span style="font-family:"Courier New",monospace">#SBATCH --ntasks=16</span></div>
<div><span style="font-family:"Courier New",monospace">#SBATCH --ntasks-per-node=8</span></div>
<div><span style="font-family:"Courier New",monospace">#SBATCH --cpus-per-task=1</span></div>
<div><span style="font-family:"Courier New",monospace">#SBATCH --nodes=2</span></div>
<div><span style="font-family:"Courier New",monospace">#SBATCH --time=01:00:00</span></div>
<div><span style="font-family:"Courier New",monospace">#SBATCH --gres=gpu:4</span></div>
<br>
<div><span style="font-family:"Courier New",monospace">export OMP_NUM_THREADS=1</span></div>
<div><span style="font-family:"Courier New",monospace"># modules</span></div>
<div><span style="font-family:"Courier New",monospace">module load cuda/11.7</span></div>
<div><span style="font-family:"Courier New",monospace">module load gcc/11.2.1/toolset</span></div>
<div><span style="font-family:"Courier New",monospace">module load openmpi/4.1.4/gcc-11.2.1-cuda-11.7</span></div>
<div><br>
</div>
<div><span style="font-family:"Courier New",monospace">cd /home/mnv/Firemodels_fork/fds/Issues/PETSc</span></div>
<div><br>
</div>
<div></div>
<span style="font-family:"Courier New",monospace">srun -N 2 -n 16 /home/mnv/Firemodels_fork/fds/Build/ompi_gnu_linux/fds_ompi_gnu_linux test.fds -pc_type gamg -mat_type aijcusparse -vec_type cuda</span>
<div></div>
<span style="font-family:"Courier New",monospace">                                   </span><br>
</div>
<div style="font-family:Calibri,Arial,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
Thank you for the advice,</div>
<div style="font-family:Calibri,Arial,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
Marcos<br>
</div>
<div style="font-family:Calibri,Arial,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
<br>
</div>
<div style="font-family:Calibri,Arial,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
 <br>
</div>
<div id="m_-2525567993800845248appendonsend"></div>
<br>
</div>

</div></blockquote></div>