[petsc-users] CUDA error trying to run a job with two mpi processes and 1 GPU
Junchao Zhang
junchao.zhang at gmail.com
Mon Aug 21 14:29:21 CDT 2023
Hi, Macros,
If you look at the PIDs of the nvidia-smi output, you will only find 8
unique PIDs, which is expected since you allocated 8 MPI ranks per node.
The duplicate PIDs are usually for threads spawned by the MPI runtime
(for example, progress threads in MPI implementation). So your job script
and output are all good.
Thanks.
On Mon, Aug 21, 2023 at 2:00 PM Vanella, Marcos (Fed) <
marcos.vanella at nist.gov> wrote:
> Hi Junchao, something I'm noting related to running with cuda enabled
> linear solvers (CG+HYPRE, CG+GAMG) is that for multi cpu-multi gpu
> calculations, the GPU 0 in the node is taking what seems to be all
> sub-matrices corresponding to all the MPI processes in the node. This is
> the result of the nvidia-smi command on a node with 8 MPI processes (each
> advancing the same number of unknowns in the calculation) and 4 GPU V100s:
>
> Mon Aug 21 14:36:07 2023
>
> +---------------------------------------------------------------------------------------+
> | NVIDIA-SMI 535.54.03 Driver Version: 535.54.03 CUDA
> Version: 12.2 |
>
> |-----------------------------------------+----------------------+----------------------+
> | GPU Name Persistence-M | Bus-Id Disp.A |
> Volatile Uncorr. ECC |
> | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage |
> GPU-Util Compute M. |
> | | |
> MIG M. |
>
> |=========================================+======================+======================|
> | 0 Tesla V100-SXM2-16GB On | 00000004:04:00.0 Off |
> 0 |
> | N/A 34C P0 63W / 300W | 2488MiB / 16384MiB | 0%
> Default |
> | | |
> N/A |
>
> +-----------------------------------------+----------------------+----------------------+
> | 1 Tesla V100-SXM2-16GB On | 00000004:05:00.0 Off |
> 0 |
> | N/A 38C P0 56W / 300W | 638MiB / 16384MiB | 0%
> Default |
> | | |
> N/A |
>
> +-----------------------------------------+----------------------+----------------------+
> | 2 Tesla V100-SXM2-16GB On | 00000035:03:00.0 Off |
> 0 |
> | N/A 35C P0 52W / 300W | 638MiB / 16384MiB | 0%
> Default |
> | | |
> N/A |
>
> +-----------------------------------------+----------------------+----------------------+
> | 3 Tesla V100-SXM2-16GB On | 00000035:04:00.0 Off |
> 0 |
> | N/A 38C P0 53W / 300W | 638MiB / 16384MiB | 0%
> Default |
> | | |
> N/A |
>
> +-----------------------------------------+----------------------+----------------------+
>
>
>
> +---------------------------------------------------------------------------------------+
> | Processes:
> |
> | GPU GI CI PID Type Process name
> GPU Memory |
> | ID ID
> Usage |
>
> |=======================================================================================|
> | 0 N/A N/A 214626 C
> ...d/ompi_gnu_linux/fds_ompi_gnu_linux 318MiB |
> | 0 N/A N/A 214627 C
> ...d/ompi_gnu_linux/fds_ompi_gnu_linux 308MiB |
> | 0 N/A N/A 214628 C
> ...d/ompi_gnu_linux/fds_ompi_gnu_linux 308MiB |
> | 0 N/A N/A 214629 C
> ...d/ompi_gnu_linux/fds_ompi_gnu_linux 308MiB |
> | 0 N/A N/A 214630 C
> ...d/ompi_gnu_linux/fds_ompi_gnu_linux 318MiB |
> | 0 N/A N/A 214631 C
> ...d/ompi_gnu_linux/fds_ompi_gnu_linux 308MiB |
> | 0 N/A N/A 214632 C
> ...d/ompi_gnu_linux/fds_ompi_gnu_linux 308MiB |
> | 0 N/A N/A 214633 C
> ...d/ompi_gnu_linux/fds_ompi_gnu_linux 308MiB |
> | 1 N/A N/A 214627 C
> ...d/ompi_gnu_linux/fds_ompi_gnu_linux 318MiB |
> | 1 N/A N/A 214631 C
> ...d/ompi_gnu_linux/fds_ompi_gnu_linux 318MiB |
> | 2 N/A N/A 214628 C
> ...d/ompi_gnu_linux/fds_ompi_gnu_linux 318MiB |
> | 2 N/A N/A 214632 C
> ...d/ompi_gnu_linux/fds_ompi_gnu_linux 318MiB |
> | 3 N/A N/A 214629 C
> ...d/ompi_gnu_linux/fds_ompi_gnu_linux 318MiB |
> | 3 N/A N/A 214633 C
> ...d/ompi_gnu_linux/fds_ompi_gnu_linux 318MiB |
>
> +---------------------------------------------------------------------------------------+
>
>
> You can see that GPU 0 is connected to all 8 MPI Processes, each taking
> about 300MB on it, whereas GPUs 1,2 and 3 are working with 2 MPI Processes.
> I'm wondering if this is expected or there are some changes I need to do on
> my submission script/runtime parameters.
> This is the script in this case (2 nodes, 8 MPI processes/node, 4
> GPU/node):
>
> #!/bin/bash
> # ../../Utilities/Scripts/qfds.sh -p 2 -T db -d test.fds
> #SBATCH -J test
> #SBATCH -e /home/mnv/Firemodels_fork/fds/Issues/PETSc/test.err
> #SBATCH -o /home/mnv/Firemodels_fork/fds/Issues/PETSc/test.log
> #SBATCH --partition=gpu
> #SBATCH --ntasks=16
> #SBATCH --ntasks-per-node=8
> #SBATCH --cpus-per-task=1
> #SBATCH --nodes=2
> #SBATCH --time=01:00:00
> #SBATCH --gres=gpu:4
>
> export OMP_NUM_THREADS=1
> # modules
> module load cuda/11.7
> module load gcc/11.2.1/toolset
> module load openmpi/4.1.4/gcc-11.2.1-cuda-11.7
>
> cd /home/mnv/Firemodels_fork/fds/Issues/PETSc
>
> srun -N 2 -n 16
> /home/mnv/Firemodels_fork/fds/Build/ompi_gnu_linux/fds_ompi_gnu_linux
> test.fds -pc_type gamg -mat_type aijcusparse -vec_type cuda
>
> Thank you for the advice,
> Marcos
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20230821/946c4bf2/attachment-0001.html>
More information about the petsc-users
mailing list