<div dir="ltr">That is a good question. Looking at <a href="https://slurm.schedmd.com/gres.html#GPU_Management">https://slurm.schedmd.com/gres.html#GPU_Management</a>, I was wondering if you can share the output of your job so we can search CUDA_VISIBLE_DEVICES and see how GPUs were allocated.<div><br><div><div dir="ltr" class="gmail_signature" data-smartmail="gmail_signature"><div dir="ltr">--Junchao Zhang</div></div></div><br></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Mon, Aug 21, 2023 at 2:38 PM Vanella, Marcos (Fed) <<a href="mailto:marcos.vanella@nist.gov">marcos.vanella@nist.gov</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div class="msg3869060330462788085">
<div dir="ltr">
<div style="font-family:Calibri,Arial,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
Ok thanks Junchao, so is GPU 0 actually allocating memory for the 8 MPI processes meshes but only working on 2 of them? </div>
<div style="font-family:Calibri,Arial,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
It says in the script it has allocated 2.4GB</div>
<div style="font-family:Calibri,Arial,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
Best,</div>
<div style="font-family:Calibri,Arial,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
Marcos<br>
</div>
<div id="m_3869060330462788085appendonsend"></div>
<hr style="display:inline-block;width:98%">
<div id="m_3869060330462788085divRplyFwdMsg" dir="ltr"><font face="Calibri, sans-serif" style="font-size:11pt" color="#000000"><b>From:</b> Junchao Zhang <<a href="mailto:junchao.zhang@gmail.com" target="_blank">junchao.zhang@gmail.com</a>><br>
<b>Sent:</b> Monday, August 21, 2023 3:29 PM<br>
<b>To:</b> Vanella, Marcos (Fed) <<a href="mailto:marcos.vanella@nist.gov" target="_blank">marcos.vanella@nist.gov</a>><br>
<b>Cc:</b> PETSc users list <<a href="mailto:petsc-users@mcs.anl.gov" target="_blank">petsc-users@mcs.anl.gov</a>>; Guan, Collin X. (Fed) <<a href="mailto:collin.guan@nist.gov" target="_blank">collin.guan@nist.gov</a>><br>
<b>Subject:</b> Re: [petsc-users] CUDA error trying to run a job with two mpi processes and 1 GPU</font>
<div> </div>
</div>
<div>
<div dir="ltr">Hi, Macros,
<div> If you look at the PIDs of the nvidia-smi output, you will only find 8 unique PIDs, which is expected since you allocated 8 MPI ranks per node.</div>
<div> The duplicate PIDs are usually for threads spawned by the MPI runtime (for example, progress threads in MPI implementation). So your job script and output are all good.<br>
<div><br>
</div>
</div>
<div> Thanks.</div>
</div>
<br>
<div>
<div dir="ltr">On Mon, Aug 21, 2023 at 2:00 PM Vanella, Marcos (Fed) <<a href="mailto:marcos.vanella@nist.gov" target="_blank">marcos.vanella@nist.gov</a>> wrote:<br>
</div>
<blockquote style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div>
<div dir="ltr">
<div style="font-family:Calibri,Arial,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
Hi Junchao, something I'm noting related to running with cuda enabled linear solvers (CG+HYPRE, CG+GAMG) is that for multi cpu-multi gpu calculations, the GPU 0 in the node is taking what seems to be all sub-matrices corresponding to all the MPI processes in
the node. This is the result of the nvidia-smi command on a node with 8 MPI processes (each advancing the same number of unknowns in the calculation) and 4 GPU V100s:</div>
<div style="font-family:Calibri,Arial,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
<br>
</div>
<div style="font-family:Calibri,Arial,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
<span style="font-family:"Courier New",monospace">Mon Aug 21 14:36:07 2023 </span>
<div><span style="font-family:"Courier New",monospace">+---------------------------------------------------------------------------------------+</span></div>
<div><span style="font-family:"Courier New",monospace">| NVIDIA-SMI 535.54.03 Driver Version: 535.54.03 CUDA Version: 12.2 |</span></div>
<div><span style="font-family:"Courier New",monospace">|-----------------------------------------+----------------------+----------------------+</span></div>
<div><span style="font-family:"Courier New",monospace">| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |</span></div>
<div><span style="font-family:"Courier New",monospace">| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |</span></div>
<div><span style="font-family:"Courier New",monospace">| | | MIG M. |</span></div>
<div><span style="font-family:"Courier New",monospace">|=========================================+======================+======================|</span></div>
<div><span style="font-family:"Courier New",monospace">| 0 Tesla V100-SXM2-16GB On | 00000004:04:00.0 Off | 0 |</span></div>
<div><span style="font-family:"Courier New",monospace">| N/A 34C P0 63W / 300W | 2488MiB / 16384MiB | 0% Default |</span></div>
<div><span style="font-family:"Courier New",monospace">| | | N/A |</span></div>
<div><span style="font-family:"Courier New",monospace">+-----------------------------------------+----------------------+----------------------+</span></div>
<div><span style="font-family:"Courier New",monospace">| 1 Tesla V100-SXM2-16GB On | 00000004:05:00.0 Off | 0 |</span></div>
<div><span style="font-family:"Courier New",monospace">| N/A 38C P0 56W / 300W | 638MiB / 16384MiB | 0% Default |</span></div>
<div><span style="font-family:"Courier New",monospace">| | | N/A |</span></div>
<div><span style="font-family:"Courier New",monospace">+-----------------------------------------+----------------------+----------------------+</span></div>
<div><span style="font-family:"Courier New",monospace">| 2 Tesla V100-SXM2-16GB On | 00000035:03:00.0 Off | 0 |</span></div>
<div><span style="font-family:"Courier New",monospace">| N/A 35C P0 52W / 300W | 638MiB / 16384MiB | 0% Default |</span></div>
<div><span style="font-family:"Courier New",monospace">| | | N/A |</span></div>
<div><span style="font-family:"Courier New",monospace">+-----------------------------------------+----------------------+----------------------+</span></div>
<div><span style="font-family:"Courier New",monospace">| 3 Tesla V100-SXM2-16GB On | 00000035:04:00.0 Off | 0 |</span></div>
<div><span style="font-family:"Courier New",monospace">| N/A 38C P0 53W / 300W | 638MiB / 16384MiB | 0% Default |</span></div>
<div><span style="font-family:"Courier New",monospace">| | | N/A |</span></div>
<div><span style="font-family:"Courier New",monospace">+-----------------------------------------+----------------------+----------------------+</span></div>
<div><span style="font-family:"Courier New",monospace"> </span></div>
<div><span style="font-family:"Courier New",monospace">+---------------------------------------------------------------------------------------+</span></div>
<div><span style="font-family:"Courier New",monospace">| Processes: |</span></div>
<div><span style="font-family:"Courier New",monospace">| GPU GI CI PID Type Process name GPU Memory |</span></div>
<div><span style="font-family:"Courier New",monospace">| ID ID Usage |</span></div>
<div><span style="font-family:"Courier New",monospace">|=======================================================================================|</span></div>
<div><span style="font-family:"Courier New",monospace">| 0 N/A N/A 214626 C ...d/ompi_gnu_linux/fds_ompi_gnu_linux 318MiB |</span></div>
<div><span style="font-family:"Courier New",monospace">| 0 N/A N/A 214627 C ...d/ompi_gnu_linux/fds_ompi_gnu_linux 308MiB |</span></div>
<div><span style="font-family:"Courier New",monospace">| 0 N/A N/A 214628 C ...d/ompi_gnu_linux/fds_ompi_gnu_linux 308MiB |</span></div>
<div><span style="font-family:"Courier New",monospace">| 0 N/A N/A 214629 C ...d/ompi_gnu_linux/fds_ompi_gnu_linux 308MiB |</span></div>
<div><span style="font-family:"Courier New",monospace">| 0 N/A N/A 214630 C ...d/ompi_gnu_linux/fds_ompi_gnu_linux 318MiB |</span></div>
<div><span style="font-family:"Courier New",monospace">| 0 N/A N/A 214631 C ...d/ompi_gnu_linux/fds_ompi_gnu_linux 308MiB |</span></div>
<div><span style="font-family:"Courier New",monospace">| 0 N/A N/A 214632 C ...d/ompi_gnu_linux/fds_ompi_gnu_linux 308MiB |</span></div>
<div><span style="font-family:"Courier New",monospace">| 0 N/A N/A 214633 C ...d/ompi_gnu_linux/fds_ompi_gnu_linux 308MiB |</span></div>
<div><span style="font-family:"Courier New",monospace">| 1 N/A N/A 214627 C ...d/ompi_gnu_linux/fds_ompi_gnu_linux 318MiB |</span></div>
<div><span style="font-family:"Courier New",monospace">| 1 N/A N/A 214631 C ...d/ompi_gnu_linux/fds_ompi_gnu_linux 318MiB |</span></div>
<div><span style="font-family:"Courier New",monospace">| 2 N/A N/A 214628 C ...d/ompi_gnu_linux/fds_ompi_gnu_linux 318MiB |</span></div>
<div><span style="font-family:"Courier New",monospace">| 2 N/A N/A 214632 C ...d/ompi_gnu_linux/fds_ompi_gnu_linux 318MiB |</span></div>
<div><span style="font-family:"Courier New",monospace">| 3 N/A N/A 214629 C ...d/ompi_gnu_linux/fds_ompi_gnu_linux 318MiB |</span></div>
<div><span style="font-family:"Courier New",monospace">| 3 N/A N/A 214633 C ...d/ompi_gnu_linux/fds_ompi_gnu_linux 318MiB |</span></div>
<span style="font-family:"Courier New",monospace">+---------------------------------------------------------------------------------------+</span><br>
</div>
<div style="font-family:Calibri,Arial,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
<br>
</div>
<div style="font-family:Calibri,Arial,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
<br>
</div>
<div style="font-family:Calibri,Arial,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
You can see that GPU 0 is connected to all 8 MPI Processes, each taking about 300MB on it, whereas GPUs 1,2 and 3 are working with 2 MPI Processes. I'm wondering if this is expected or there are some changes I need to do on my submission script/runtime parameters.</div>
<div style="font-family:Calibri,Arial,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
This is the script in this case (2 nodes, 8 MPI processes/node, 4 GPU/node):</div>
<div style="font-family:Calibri,Arial,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
<br>
</div>
<div style="font-family:Calibri,Arial,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
<div><span style="font-family:"Courier New",monospace">#!/bin/bash</span></div>
<div><span style="font-family:"Courier New",monospace"># ../../Utilities/Scripts/qfds.sh -p 2 -T db -d test.fds</span></div>
<div><span style="font-family:"Courier New",monospace">#SBATCH -J test </span></div>
<div><span style="font-family:"Courier New",monospace">#SBATCH -e /home/mnv/Firemodels_fork/fds/Issues/PETSc/test.err</span></div>
<div><span style="font-family:"Courier New",monospace">#SBATCH -o /home/mnv/Firemodels_fork/fds/Issues/PETSc/test.log</span></div>
<div><span style="font-family:"Courier New",monospace">#SBATCH --partition=gpu</span></div>
<div><span style="font-family:"Courier New",monospace">#SBATCH --ntasks=16</span></div>
<div><span style="font-family:"Courier New",monospace">#SBATCH --ntasks-per-node=8</span></div>
<div><span style="font-family:"Courier New",monospace">#SBATCH --cpus-per-task=1</span></div>
<div><span style="font-family:"Courier New",monospace">#SBATCH --nodes=2</span></div>
<div><span style="font-family:"Courier New",monospace">#SBATCH --time=01:00:00</span></div>
<div><span style="font-family:"Courier New",monospace">#SBATCH --gres=gpu:4</span></div>
<br>
<div><span style="font-family:"Courier New",monospace">export OMP_NUM_THREADS=1</span></div>
<div><span style="font-family:"Courier New",monospace"># modules</span></div>
<div><span style="font-family:"Courier New",monospace">module load cuda/11.7</span></div>
<div><span style="font-family:"Courier New",monospace">module load gcc/11.2.1/toolset</span></div>
<div><span style="font-family:"Courier New",monospace">module load openmpi/4.1.4/gcc-11.2.1-cuda-11.7</span></div>
<div><br>
</div>
<div><span style="font-family:"Courier New",monospace">cd /home/mnv/Firemodels_fork/fds/Issues/PETSc</span></div>
<div><br>
</div>
<div></div>
<span style="font-family:"Courier New",monospace">srun -N 2 -n 16 /home/mnv/Firemodels_fork/fds/Build/ompi_gnu_linux/fds_ompi_gnu_linux test.fds -pc_type gamg -mat_type aijcusparse -vec_type cuda</span>
<div></div>
<span style="font-family:"Courier New",monospace"> </span><br>
</div>
<div style="font-family:Calibri,Arial,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
Thank you for the advice,</div>
<div style="font-family:Calibri,Arial,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
Marcos<br>
</div>
<div style="font-family:Calibri,Arial,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
<br>
</div>
<div style="font-family:Calibri,Arial,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
<br>
</div>
<div id="m_3869060330462788085x_m_-2525567993800845248appendonsend"></div>
<br>
</div>
</div>
</blockquote>
</div>
</div>
</div>
</div></blockquote></div>