<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
<style type="text/css" style="display:none;"> P {margin-top:0;margin-bottom:0;} </style>
</head>
<body dir="ltr">
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);" class="elementToProof">
Hi Junchao, something I'm noting related to running with cuda enabled linear solvers (CG+HYPRE, CG+GAMG) is that for multi cpu-multi gpu calculations, the GPU 0 in the node is taking what seems to be all sub-matrices corresponding to all the MPI processes in
the node. This is the result of the nvidia-smi command on a node with 8 MPI processes (each advancing the same number of unknowns in the calculation) and 4 GPU V100s:</div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);" class="elementToProof">
<br>
</div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);" class="elementToProof ContentPasted0">
<span style="font-family: "Courier New", monospace;">Mon Aug 21 14:36:07 2023
</span>
<div class="ContentPasted0"><span style="font-family: "Courier New", monospace;">+---------------------------------------------------------------------------------------+</span></div>
<div class="ContentPasted0"><span style="font-family: "Courier New", monospace;">| NVIDIA-SMI 535.54.03 Driver Version: 535.54.03 CUDA Version: 12.2 |</span></div>
<div class="ContentPasted0"><span style="font-family: "Courier New", monospace;">|-----------------------------------------+----------------------+----------------------+</span></div>
<div class="ContentPasted0"><span style="font-family: "Courier New", monospace;">| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |</span></div>
<div class="ContentPasted0"><span style="font-family: "Courier New", monospace;">| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |</span></div>
<div class="ContentPasted0"><span style="font-family: "Courier New", monospace;">| | | MIG M. |</span></div>
<div class="ContentPasted0"><span style="font-family: "Courier New", monospace;">|=========================================+======================+======================|</span></div>
<div class="ContentPasted0"><span style="font-family: "Courier New", monospace;">| 0 Tesla V100-SXM2-16GB On | 00000004:04:00.0 Off | 0 |</span></div>
<div class="ContentPasted0"><span style="font-family: "Courier New", monospace;">| N/A 34C P0 63W / 300W | 2488MiB / 16384MiB | 0% Default |</span></div>
<div class="ContentPasted0"><span style="font-family: "Courier New", monospace;">| | | N/A |</span></div>
<div class="ContentPasted0"><span style="font-family: "Courier New", monospace;">+-----------------------------------------+----------------------+----------------------+</span></div>
<div class="ContentPasted0"><span style="font-family: "Courier New", monospace;">| 1 Tesla V100-SXM2-16GB On | 00000004:05:00.0 Off | 0 |</span></div>
<div class="ContentPasted0"><span style="font-family: "Courier New", monospace;">| N/A 38C P0 56W / 300W | 638MiB / 16384MiB | 0% Default |</span></div>
<div class="ContentPasted0"><span style="font-family: "Courier New", monospace;">| | | N/A |</span></div>
<div class="ContentPasted0"><span style="font-family: "Courier New", monospace;">+-----------------------------------------+----------------------+----------------------+</span></div>
<div class="ContentPasted0"><span style="font-family: "Courier New", monospace;">| 2 Tesla V100-SXM2-16GB On | 00000035:03:00.0 Off | 0 |</span></div>
<div class="ContentPasted0"><span style="font-family: "Courier New", monospace;">| N/A 35C P0 52W / 300W | 638MiB / 16384MiB | 0% Default |</span></div>
<div class="ContentPasted0"><span style="font-family: "Courier New", monospace;">| | | N/A |</span></div>
<div class="ContentPasted0"><span style="font-family: "Courier New", monospace;">+-----------------------------------------+----------------------+----------------------+</span></div>
<div class="ContentPasted0"><span style="font-family: "Courier New", monospace;">| 3 Tesla V100-SXM2-16GB On | 00000035:04:00.0 Off | 0 |</span></div>
<div class="ContentPasted0"><span style="font-family: "Courier New", monospace;">| N/A 38C P0 53W / 300W | 638MiB / 16384MiB | 0% Default |</span></div>
<div class="ContentPasted0"><span style="font-family: "Courier New", monospace;">| | | N/A |</span></div>
<div class="ContentPasted0"><span style="font-family: "Courier New", monospace;">+-----------------------------------------+----------------------+----------------------+</span></div>
<div class="ContentPasted0"><span style="font-family: "Courier New", monospace;"> </span></div>
<div class="ContentPasted0"><span style="font-family: "Courier New", monospace;">+---------------------------------------------------------------------------------------+</span></div>
<div class="ContentPasted0"><span style="font-family: "Courier New", monospace;">| Processes: |</span></div>
<div class="ContentPasted0"><span style="font-family: "Courier New", monospace;">| GPU GI CI PID Type Process name GPU Memory |</span></div>
<div class="ContentPasted0"><span style="font-family: "Courier New", monospace;">| ID ID Usage |</span></div>
<div class="ContentPasted0"><span style="font-family: "Courier New", monospace;">|=======================================================================================|</span></div>
<div class="ContentPasted0"><span style="font-family: "Courier New", monospace;">| 0 N/A N/A 214626 C ...d/ompi_gnu_linux/fds_ompi_gnu_linux 318MiB |</span></div>
<div class="ContentPasted0"><span style="font-family: "Courier New", monospace;">| 0 N/A N/A 214627 C ...d/ompi_gnu_linux/fds_ompi_gnu_linux 308MiB |</span></div>
<div class="ContentPasted0"><span style="font-family: "Courier New", monospace;">| 0 N/A N/A 214628 C ...d/ompi_gnu_linux/fds_ompi_gnu_linux 308MiB |</span></div>
<div class="ContentPasted0"><span style="font-family: "Courier New", monospace;">| 0 N/A N/A 214629 C ...d/ompi_gnu_linux/fds_ompi_gnu_linux 308MiB |</span></div>
<div class="ContentPasted0"><span style="font-family: "Courier New", monospace;">| 0 N/A N/A 214630 C ...d/ompi_gnu_linux/fds_ompi_gnu_linux 318MiB |</span></div>
<div class="ContentPasted0"><span style="font-family: "Courier New", monospace;">| 0 N/A N/A 214631 C ...d/ompi_gnu_linux/fds_ompi_gnu_linux 308MiB |</span></div>
<div class="ContentPasted0"><span style="font-family: "Courier New", monospace;">| 0 N/A N/A 214632 C ...d/ompi_gnu_linux/fds_ompi_gnu_linux 308MiB |</span></div>
<div class="ContentPasted0"><span style="font-family: "Courier New", monospace;">| 0 N/A N/A 214633 C ...d/ompi_gnu_linux/fds_ompi_gnu_linux 308MiB |</span></div>
<div class="ContentPasted0"><span style="font-family: "Courier New", monospace;">| 1 N/A N/A 214627 C ...d/ompi_gnu_linux/fds_ompi_gnu_linux 318MiB |</span></div>
<div class="ContentPasted0"><span style="font-family: "Courier New", monospace;">| 1 N/A N/A 214631 C ...d/ompi_gnu_linux/fds_ompi_gnu_linux 318MiB |</span></div>
<div class="ContentPasted0"><span style="font-family: "Courier New", monospace;">| 2 N/A N/A 214628 C ...d/ompi_gnu_linux/fds_ompi_gnu_linux 318MiB |</span></div>
<div class="ContentPasted0"><span style="font-family: "Courier New", monospace;">| 2 N/A N/A 214632 C ...d/ompi_gnu_linux/fds_ompi_gnu_linux 318MiB |</span></div>
<div class="ContentPasted0"><span style="font-family: "Courier New", monospace;">| 3 N/A N/A 214629 C ...d/ompi_gnu_linux/fds_ompi_gnu_linux 318MiB |</span></div>
<div class="ContentPasted0"><span style="font-family: "Courier New", monospace;">| 3 N/A N/A 214633 C ...d/ompi_gnu_linux/fds_ompi_gnu_linux 318MiB |</span></div>
<span style="font-family: "Courier New", monospace;">+---------------------------------------------------------------------------------------+</span><br>
</div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);" class="elementToProof ContentPasted0">
<br>
</div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);" class="elementToProof ContentPasted0">
<br>
</div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);" class="elementToProof ContentPasted0">
You can see that GPU 0 is connected to all 8 MPI Processes, each taking about 300MB on it, whereas GPUs 1,2 and 3 are working with 2 MPI Processes. I'm wondering if this is expected or there are some changes I need to do on my submission script/runtime parameters.</div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);" class="elementToProof ContentPasted0">
This is the script in this case (2 nodes, 8 MPI processes/node, 4 GPU/node):</div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);" class="elementToProof ContentPasted0">
<br>
</div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);" class="elementToProof ContentPasted0 ContentPasted1">
<div class="ContentPasted1"><span style="font-family: "Courier New", monospace;">#!/bin/bash</span></div>
<div class="ContentPasted1"><span style="font-family: "Courier New", monospace;"># ../../Utilities/Scripts/qfds.sh -p 2 -T db -d test.fds</span></div>
<div class="ContentPasted1"><span style="font-family: "Courier New", monospace;">#SBATCH -J test
</span></div>
<div class="ContentPasted1"><span style="font-family: "Courier New", monospace;">#SBATCH -e /home/mnv/Firemodels_fork/fds/Issues/PETSc/test.err</span></div>
<div class="ContentPasted1"><span style="font-family: "Courier New", monospace;">#SBATCH -o /home/mnv/Firemodels_fork/fds/Issues/PETSc/test.log</span></div>
<div class="ContentPasted1"><span style="font-family: "Courier New", monospace;">#SBATCH --partition=gpu</span></div>
<div class="ContentPasted1"><span style="font-family: "Courier New", monospace;">#SBATCH --ntasks=16</span></div>
<div class="ContentPasted1"><span style="font-family: "Courier New", monospace;">#SBATCH --ntasks-per-node=8</span></div>
<div class="ContentPasted1"><span style="font-family: "Courier New", monospace;">#SBATCH --cpus-per-task=1</span></div>
<div class="ContentPasted1"><span style="font-family: "Courier New", monospace;">#SBATCH --nodes=2</span></div>
<div class="ContentPasted1"><span style="font-family: "Courier New", monospace;">#SBATCH --time=01:00:00</span></div>
<div class="ContentPasted1"><span style="font-family: "Courier New", monospace;">#SBATCH --gres=gpu:4</span></div>
<br>
<div class="ContentPasted1"><span style="font-family: "Courier New", monospace;">export OMP_NUM_THREADS=1</span></div>
<div class="ContentPasted1"><span style="font-family: "Courier New", monospace;"># modules</span></div>
<div class="ContentPasted1"><span style="font-family: "Courier New", monospace;">module load cuda/11.7</span></div>
<div class="ContentPasted1"><span style="font-family: "Courier New", monospace;">module load gcc/11.2.1/toolset</span></div>
<div class="ContentPasted1"><span style="font-family: "Courier New", monospace;">module load openmpi/4.1.4/gcc-11.2.1-cuda-11.7</span></div>
<div><br class="ContentPasted1">
</div>
<div class="ContentPasted1"><span style="font-family: "Courier New", monospace;">cd /home/mnv/Firemodels_fork/fds/Issues/PETSc</span></div>
<div><br class="ContentPasted1">
</div>
<div class="ContentPasted1"></div>
<span style="font-family: "Courier New", monospace;">srun -N 2 -n 16 /home/mnv/Firemodels_fork/fds/Build/ompi_gnu_linux/fds_ompi_gnu_linux test.fds -pc_type gamg -mat_type aijcusparse -vec_type cuda</span>
<div class="ContentPasted1"></div>
<span style="font-family: "Courier New", monospace;"> </span><br>
</div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);" class="elementToProof ContentPasted0">
Thank you for the advice,</div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);" class="elementToProof ContentPasted0">
Marcos<br>
</div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);" class="elementToProof ContentPasted0">
<br>
</div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);" class="elementToProof ContentPasted0">
<br>
</div>
<div id="appendonsend"></div>
<br>
</body>
</html>