[petsc-users] CUDA error trying to run a job with two mpi processes and 1 GPU
Vanella, Marcos (Fed)
marcos.vanella at nist.gov
Fri Aug 11 12:43:03 CDT 2023
Hi Junchao, thank you for replying. I compiled petsc in debug mode and this is what I get for the case:
terminate called after throwing an instance of 'thrust::system::system_error'
what(): merge_sort: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered
Program received signal SIGABRT: Process abort signal.
Backtrace for this error:
#0 0x15264731ead0 in ???
#1 0x15264731dc35 in ???
#2 0x15264711551f in ???
#3 0x152647169a7c in ???
#4 0x152647115475 in ???
#5 0x1526470fb7f2 in ???
#6 0x152647678bbd in ???
#7 0x15264768424b in ???
#8 0x1526476842b6 in ???
#9 0x152647684517 in ???
#10 0x55bb46342ebb in _ZN6thrust8cuda_cub14throw_on_errorE9cudaErrorPKc
at /usr/local/cuda/include/thrust/system/cuda/detail/util.h:224
#11 0x55bb46342ebb in _ZN6thrust8cuda_cub12__merge_sort10merge_sortINS_6detail17integral_constantIbLb1EEENS4_IbLb0EEENS0_3tagENS_12zip_iteratorINS_5tupleINS_10device_ptrIiEESB_NS_9null_typeESC_SC_SC_SC_SC_SC_SC_EEEENS3_15normal_iteratorISB_EE9IJCompareEEvRNS0_16execution_policyIT1_EET2_SM_T3_T4_
at /usr/local/cuda/include/thrust/system/cuda/detail/sort.h:1316
#12 0x55bb46342ebb in _ZN6thrust8cuda_cub12__smart_sort10smart_sortINS_6detail17integral_constantIbLb1EEENS4_IbLb0EEENS0_16execution_policyINS0_3tagEEENS_12zip_iteratorINS_5tupleINS_10device_ptrIiEESD_NS_9null_typeESE_SE_SE_SE_SE_SE_SE_EEEENS3_15normal_iteratorISD_EE9IJCompareEENS1_25enable_if_comparison_sortIT2_T4_E4typeERT1_SL_SL_T3_SM_
at /usr/local/cuda/include/thrust/system/cuda/detail/sort.h:1544
#13 0x55bb46342ebb in _ZN6thrust8cuda_cub11sort_by_keyINS0_3tagENS_12zip_iteratorINS_5tupleINS_10device_ptrIiEES6_NS_9null_typeES7_S7_S7_S7_S7_S7_S7_EEEENS_6detail15normal_iteratorIS6_EE9IJCompareEEvRNS0_16execution_policyIT_EET0_SI_T1_T2_
at /usr/local/cuda/include/thrust/system/cuda/detail/sort.h:1669
#14 0x55bb46317bc5 in _ZN6thrust11sort_by_keyINS_8cuda_cub3tagENS_12zip_iteratorINS_5tupleINS_10device_ptrIiEES6_NS_9null_typeES7_S7_S7_S7_S7_S7_S7_EEEENS_6detail15normal_iteratorIS6_EE9IJCompareEEvRKNSA_21execution_policy_baseIT_EET0_SJ_T1_T2_
at /usr/local/cuda/include/thrust/detail/sort.inl:115
#15 0x55bb46317bc5 in _ZN6thrust11sort_by_keyINS_12zip_iteratorINS_5tupleINS_10device_ptrIiEES4_NS_9null_typeES5_S5_S5_S5_S5_S5_S5_EEEENS_6detail15normal_iteratorIS4_EE9IJCompareEEvT_SC_T0_T1_
at /usr/local/cuda/include/thrust/detail/sort.inl:305
#16 0x55bb46317bc5 in MatSetPreallocationCOO_SeqAIJCUSPARSE_Basic
at /home/mnv/Software/petsc/src/mat/impls/aij/seq/seqcusparse/aijcusparse.cu:4452
#17 0x55bb46c5b27c in MatSetPreallocationCOO_MPIAIJCUSPARSE_Basic
at /home/mnv/Software/petsc/src/mat/impls/aij/mpi/mpicusparse/mpiaijcusparse.cu:173
#18 0x55bb46c5b27c in MatSetPreallocationCOO_MPIAIJCUSPARSE
at /home/mnv/Software/petsc/src/mat/impls/aij/mpi/mpicusparse/mpiaijcusparse.cu:222
#19 0x55bb468e01cf in MatSetPreallocationCOO
at /home/mnv/Software/petsc/src/mat/utils/gcreate.c:606
#20 0x55bb46b39c9b in MatProductSymbolic_MPIAIJBACKEND
at /home/mnv/Software/petsc/src/mat/impls/aij/mpi/mpiaij.c:7547
#21 0x55bb469015e5 in MatProductSymbolic
at /home/mnv/Software/petsc/src/mat/interface/matproduct.c:803
#22 0x55bb4694ade2 in MatPtAP
at /home/mnv/Software/petsc/src/mat/interface/matrix.c:9897
#23 0x55bb4696d3ec in MatCoarsenApply_MISK_private
at /home/mnv/Software/petsc/src/mat/coarsen/impls/misk/misk.c:283
#24 0x55bb4696eb67 in MatCoarsenApply_MISK
at /home/mnv/Software/petsc/src/mat/coarsen/impls/misk/misk.c:368
#25 0x55bb4695bd91 in MatCoarsenApply
at /home/mnv/Software/petsc/src/mat/coarsen/coarsen.c:97
#26 0x55bb478294d8 in PCGAMGCoarsen_AGG
at /home/mnv/Software/petsc/src/ksp/pc/impls/gamg/agg.c:524
#27 0x55bb471d1cb4 in PCSetUp_GAMG
at /home/mnv/Software/petsc/src/ksp/pc/impls/gamg/gamg.c:631
#28 0x55bb464022cf in PCSetUp
at /home/mnv/Software/petsc/src/ksp/pc/interface/precon.c:994
#29 0x55bb4718b8a7 in KSPSetUp
at /home/mnv/Software/petsc/src/ksp/ksp/interface/itfunc.c:406
#30 0x55bb4718f22e in KSPSolve_Private
at /home/mnv/Software/petsc/src/ksp/ksp/interface/itfunc.c:824
#31 0x55bb47192c0c in KSPSolve
at /home/mnv/Software/petsc/src/ksp/ksp/interface/itfunc.c:1070
#32 0x55bb463efd35 in kspsolve_
at /home/mnv/Software/petsc/src/ksp/ksp/interface/ftn-auto/itfuncf.c:320
#33 0x55bb45e94b32 in ???
#34 0x55bb46048044 in ???
#35 0x55bb46052ea1 in ???
#36 0x55bb45ac5f8e in ???
#37 0x1526470fcd8f in ???
#38 0x1526470fce3f in ???
#39 0x55bb45aef55d in ???
#40 0xffffffffffffffff in ???
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 1771753 on node dgx02 exited on signal 6 (Aborted).
--------------------------------------------------------------------------
BTW, I'm curious. If I set n MPI processes, each of them building a part of the linear system, and g GPUs, how does PETSc distribute those n pieces of system matrix and rhs in the g GPUs? Does it do some load balancing algorithm? Where can I read about this?
Thank you and best Regards, I can also point you to my code repo in GitHub if you want to take a closer look.
Best Regards,
Marcos
________________________________
From: Junchao Zhang <junchao.zhang at gmail.com>
Sent: Friday, August 11, 2023 10:52 AM
To: Vanella, Marcos (Fed) <marcos.vanella at nist.gov>
Cc: petsc-users at mcs.anl.gov <petsc-users at mcs.anl.gov>
Subject: Re: [petsc-users] CUDA error trying to run a job with two mpi processes and 1 GPU
Hi, Marcos,
Could you build petsc in debug mode and then copy and paste the whole error stack message?
Thanks
--Junchao Zhang
On Thu, Aug 10, 2023 at 5:51 PM Vanella, Marcos (Fed) via petsc-users <petsc-users at mcs.anl.gov<mailto:petsc-users at mcs.anl.gov>> wrote:
Hi, I'm trying to run a parallel matrix vector build and linear solution with PETSc on 2 MPI processes + one V100 GPU. I tested that the matrix build and solution is successful in CPUs only. I'm using cuda 11.5 and cuda enabled openmpi and gcc 9.3. When I run the job with GPU enabled I get the following error:
terminate called after throwing an instance of 'thrust::system::system_error'
what(): merge_sort: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered
Program received signal SIGABRT: Process abort signal.
Backtrace for this error:
terminate called after throwing an instance of 'thrust::system::system_error'
what(): merge_sort: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered
Program received signal SIGABRT: Process abort signal.
I'm new to submitting jobs in slurm that also use GPU resources, so I might be doing something wrong in my submission script. This is it:
#!/bin/bash
#SBATCH -J test
#SBATCH -e /home/Issues/PETSc/test.err
#SBATCH -o /home/Issues/PETSc/test.log
#SBATCH --partition=batch
#SBATCH --ntasks=2
#SBATCH --nodes=1
#SBATCH --cpus-per-task=1
#SBATCH --ntasks-per-node=2
#SBATCH --time=01:00:00
#SBATCH --gres=gpu:1
export OMP_NUM_THREADS=1
module load cuda/11.5
module load openmpi/4.1.1
cd /home/Issues/PETSc
mpirun -n 2 /home/fds/Build/ompi_gnu_linux/fds_ompi_gnu_linux test.fds -vec_type mpicuda -mat_type mpiaijcusparse -pc_type gamg
If anyone has any suggestions on how o troubleshoot this please let me know.
Thanks!
Marcos
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20230811/278cbedb/attachment-0001.html>
More information about the petsc-users
mailing list