[petsc-users] CUDA error trying to run a job with two mpi processes and 1 GPU
Junchao Zhang
junchao.zhang at gmail.com
Mon Aug 21 15:17:25 CDT 2023
That is a good question. Looking at
https://slurm.schedmd.com/gres.html#GPU_Management, I was wondering if you
can share the output of your job so we can search CUDA_VISIBLE_DEVICES and
see how GPUs were allocated.
--Junchao Zhang
On Mon, Aug 21, 2023 at 2:38 PM Vanella, Marcos (Fed) <
marcos.vanella at nist.gov> wrote:
> Ok thanks Junchao, so is GPU 0 actually allocating memory for the 8 MPI
> processes meshes but only working on 2 of them?
> It says in the script it has allocated 2.4GB
> Best,
> Marcos
> ------------------------------
> *From:* Junchao Zhang <junchao.zhang at gmail.com>
> *Sent:* Monday, August 21, 2023 3:29 PM
> *To:* Vanella, Marcos (Fed) <marcos.vanella at nist.gov>
> *Cc:* PETSc users list <petsc-users at mcs.anl.gov>; Guan, Collin X. (Fed) <
> collin.guan at nist.gov>
> *Subject:* Re: [petsc-users] CUDA error trying to run a job with two mpi
> processes and 1 GPU
>
> Hi, Macros,
> If you look at the PIDs of the nvidia-smi output, you will only find 8
> unique PIDs, which is expected since you allocated 8 MPI ranks per node.
> The duplicate PIDs are usually for threads spawned by the MPI runtime
> (for example, progress threads in MPI implementation). So your job script
> and output are all good.
>
> Thanks.
>
> On Mon, Aug 21, 2023 at 2:00 PM Vanella, Marcos (Fed) <
> marcos.vanella at nist.gov> wrote:
>
> Hi Junchao, something I'm noting related to running with cuda enabled
> linear solvers (CG+HYPRE, CG+GAMG) is that for multi cpu-multi gpu
> calculations, the GPU 0 in the node is taking what seems to be all
> sub-matrices corresponding to all the MPI processes in the node. This is
> the result of the nvidia-smi command on a node with 8 MPI processes (each
> advancing the same number of unknowns in the calculation) and 4 GPU V100s:
>
> Mon Aug 21 14:36:07 2023
>
> +---------------------------------------------------------------------------------------+
> | NVIDIA-SMI 535.54.03 Driver Version: 535.54.03 CUDA
> Version: 12.2 |
>
> |-----------------------------------------+----------------------+----------------------+
> | GPU Name Persistence-M | Bus-Id Disp.A |
> Volatile Uncorr. ECC |
> | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage |
> GPU-Util Compute M. |
> | | |
> MIG M. |
>
> |=========================================+======================+======================|
> | 0 Tesla V100-SXM2-16GB On | 00000004:04:00.0 Off |
> 0 |
> | N/A 34C P0 63W / 300W | 2488MiB / 16384MiB | 0%
> Default |
> | | |
> N/A |
>
> +-----------------------------------------+----------------------+----------------------+
> | 1 Tesla V100-SXM2-16GB On | 00000004:05:00.0 Off |
> 0 |
> | N/A 38C P0 56W / 300W | 638MiB / 16384MiB | 0%
> Default |
> | | |
> N/A |
>
> +-----------------------------------------+----------------------+----------------------+
> | 2 Tesla V100-SXM2-16GB On | 00000035:03:00.0 Off |
> 0 |
> | N/A 35C P0 52W / 300W | 638MiB / 16384MiB | 0%
> Default |
> | | |
> N/A |
>
> +-----------------------------------------+----------------------+----------------------+
> | 3 Tesla V100-SXM2-16GB On | 00000035:04:00.0 Off |
> 0 |
> | N/A 38C P0 53W / 300W | 638MiB / 16384MiB | 0%
> Default |
> | | |
> N/A |
>
> +-----------------------------------------+----------------------+----------------------+
>
>
>
> +---------------------------------------------------------------------------------------+
> | Processes:
> |
> | GPU GI CI PID Type Process name
> GPU Memory |
> | ID ID
> Usage |
>
> |=======================================================================================|
> | 0 N/A N/A 214626 C
> ...d/ompi_gnu_linux/fds_ompi_gnu_linux 318MiB |
> | 0 N/A N/A 214627 C
> ...d/ompi_gnu_linux/fds_ompi_gnu_linux 308MiB |
> | 0 N/A N/A 214628 C
> ...d/ompi_gnu_linux/fds_ompi_gnu_linux 308MiB |
> | 0 N/A N/A 214629 C
> ...d/ompi_gnu_linux/fds_ompi_gnu_linux 308MiB |
> | 0 N/A N/A 214630 C
> ...d/ompi_gnu_linux/fds_ompi_gnu_linux 318MiB |
> | 0 N/A N/A 214631 C
> ...d/ompi_gnu_linux/fds_ompi_gnu_linux 308MiB |
> | 0 N/A N/A 214632 C
> ...d/ompi_gnu_linux/fds_ompi_gnu_linux 308MiB |
> | 0 N/A N/A 214633 C
> ...d/ompi_gnu_linux/fds_ompi_gnu_linux 308MiB |
> | 1 N/A N/A 214627 C
> ...d/ompi_gnu_linux/fds_ompi_gnu_linux 318MiB |
> | 1 N/A N/A 214631 C
> ...d/ompi_gnu_linux/fds_ompi_gnu_linux 318MiB |
> | 2 N/A N/A 214628 C
> ...d/ompi_gnu_linux/fds_ompi_gnu_linux 318MiB |
> | 2 N/A N/A 214632 C
> ...d/ompi_gnu_linux/fds_ompi_gnu_linux 318MiB |
> | 3 N/A N/A 214629 C
> ...d/ompi_gnu_linux/fds_ompi_gnu_linux 318MiB |
> | 3 N/A N/A 214633 C
> ...d/ompi_gnu_linux/fds_ompi_gnu_linux 318MiB |
>
> +---------------------------------------------------------------------------------------+
>
>
> You can see that GPU 0 is connected to all 8 MPI Processes, each taking
> about 300MB on it, whereas GPUs 1,2 and 3 are working with 2 MPI Processes.
> I'm wondering if this is expected or there are some changes I need to do on
> my submission script/runtime parameters.
> This is the script in this case (2 nodes, 8 MPI processes/node, 4
> GPU/node):
>
> #!/bin/bash
> # ../../Utilities/Scripts/qfds.sh -p 2 -T db -d test.fds
> #SBATCH -J test
> #SBATCH -e /home/mnv/Firemodels_fork/fds/Issues/PETSc/test.err
> #SBATCH -o /home/mnv/Firemodels_fork/fds/Issues/PETSc/test.log
> #SBATCH --partition=gpu
> #SBATCH --ntasks=16
> #SBATCH --ntasks-per-node=8
> #SBATCH --cpus-per-task=1
> #SBATCH --nodes=2
> #SBATCH --time=01:00:00
> #SBATCH --gres=gpu:4
>
> export OMP_NUM_THREADS=1
> # modules
> module load cuda/11.7
> module load gcc/11.2.1/toolset
> module load openmpi/4.1.4/gcc-11.2.1-cuda-11.7
>
> cd /home/mnv/Firemodels_fork/fds/Issues/PETSc
>
> srun -N 2 -n 16
> /home/mnv/Firemodels_fork/fds/Build/ompi_gnu_linux/fds_ompi_gnu_linux
> test.fds -pc_type gamg -mat_type aijcusparse -vec_type cuda
>
> Thank you for the advice,
> Marcos
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20230821/9e78da12/attachment-0001.html>
More information about the petsc-users
mailing list