[petsc-users] CUDA error trying to run a job with two mpi processes and 1 GPU

Vanella, Marcos (Fed) marcos.vanella at nist.gov
Fri Aug 11 12:43:03 CDT 2023


Hi Junchao, thank you for replying. I compiled petsc in debug mode and this is what I get for the case:

terminate called after throwing an instance of 'thrust::system::system_error'
  what():  merge_sort: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered

Program received signal SIGABRT: Process abort signal.

Backtrace for this error:
#0  0x15264731ead0 in ???
#1  0x15264731dc35 in ???
#2  0x15264711551f in ???
#3  0x152647169a7c in ???
#4  0x152647115475 in ???
#5  0x1526470fb7f2 in ???
#6  0x152647678bbd in ???
#7  0x15264768424b in ???
#8  0x1526476842b6 in ???
#9  0x152647684517 in ???
#10  0x55bb46342ebb in _ZN6thrust8cuda_cub14throw_on_errorE9cudaErrorPKc
      at /usr/local/cuda/include/thrust/system/cuda/detail/util.h:224
#11  0x55bb46342ebb in _ZN6thrust8cuda_cub12__merge_sort10merge_sortINS_6detail17integral_constantIbLb1EEENS4_IbLb0EEENS0_3tagENS_12zip_iteratorINS_5tupleINS_10device_ptrIiEESB_NS_9null_typeESC_SC_SC_SC_SC_SC_SC_EEEENS3_15normal_iteratorISB_EE9IJCompareEEvRNS0_16execution_policyIT1_EET2_SM_T3_T4_
      at /usr/local/cuda/include/thrust/system/cuda/detail/sort.h:1316
#12  0x55bb46342ebb in _ZN6thrust8cuda_cub12__smart_sort10smart_sortINS_6detail17integral_constantIbLb1EEENS4_IbLb0EEENS0_16execution_policyINS0_3tagEEENS_12zip_iteratorINS_5tupleINS_10device_ptrIiEESD_NS_9null_typeESE_SE_SE_SE_SE_SE_SE_EEEENS3_15normal_iteratorISD_EE9IJCompareEENS1_25enable_if_comparison_sortIT2_T4_E4typeERT1_SL_SL_T3_SM_
      at /usr/local/cuda/include/thrust/system/cuda/detail/sort.h:1544
#13  0x55bb46342ebb in _ZN6thrust8cuda_cub11sort_by_keyINS0_3tagENS_12zip_iteratorINS_5tupleINS_10device_ptrIiEES6_NS_9null_typeES7_S7_S7_S7_S7_S7_S7_EEEENS_6detail15normal_iteratorIS6_EE9IJCompareEEvRNS0_16execution_policyIT_EET0_SI_T1_T2_
      at /usr/local/cuda/include/thrust/system/cuda/detail/sort.h:1669
#14  0x55bb46317bc5 in _ZN6thrust11sort_by_keyINS_8cuda_cub3tagENS_12zip_iteratorINS_5tupleINS_10device_ptrIiEES6_NS_9null_typeES7_S7_S7_S7_S7_S7_S7_EEEENS_6detail15normal_iteratorIS6_EE9IJCompareEEvRKNSA_21execution_policy_baseIT_EET0_SJ_T1_T2_
      at /usr/local/cuda/include/thrust/detail/sort.inl:115
#15  0x55bb46317bc5 in _ZN6thrust11sort_by_keyINS_12zip_iteratorINS_5tupleINS_10device_ptrIiEES4_NS_9null_typeES5_S5_S5_S5_S5_S5_S5_EEEENS_6detail15normal_iteratorIS4_EE9IJCompareEEvT_SC_T0_T1_
      at /usr/local/cuda/include/thrust/detail/sort.inl:305
#16  0x55bb46317bc5 in MatSetPreallocationCOO_SeqAIJCUSPARSE_Basic
      at /home/mnv/Software/petsc/src/mat/impls/aij/seq/seqcusparse/aijcusparse.cu:4452
#17  0x55bb46c5b27c in MatSetPreallocationCOO_MPIAIJCUSPARSE_Basic
      at /home/mnv/Software/petsc/src/mat/impls/aij/mpi/mpicusparse/mpiaijcusparse.cu:173
#18  0x55bb46c5b27c in MatSetPreallocationCOO_MPIAIJCUSPARSE
      at /home/mnv/Software/petsc/src/mat/impls/aij/mpi/mpicusparse/mpiaijcusparse.cu:222
#19  0x55bb468e01cf in MatSetPreallocationCOO
      at /home/mnv/Software/petsc/src/mat/utils/gcreate.c:606
#20  0x55bb46b39c9b in MatProductSymbolic_MPIAIJBACKEND
      at /home/mnv/Software/petsc/src/mat/impls/aij/mpi/mpiaij.c:7547
#21  0x55bb469015e5 in MatProductSymbolic
      at /home/mnv/Software/petsc/src/mat/interface/matproduct.c:803
#22  0x55bb4694ade2 in MatPtAP
      at /home/mnv/Software/petsc/src/mat/interface/matrix.c:9897
#23  0x55bb4696d3ec in MatCoarsenApply_MISK_private
      at /home/mnv/Software/petsc/src/mat/coarsen/impls/misk/misk.c:283
#24  0x55bb4696eb67 in MatCoarsenApply_MISK
      at /home/mnv/Software/petsc/src/mat/coarsen/impls/misk/misk.c:368
#25  0x55bb4695bd91 in MatCoarsenApply
      at /home/mnv/Software/petsc/src/mat/coarsen/coarsen.c:97
#26  0x55bb478294d8 in PCGAMGCoarsen_AGG
      at /home/mnv/Software/petsc/src/ksp/pc/impls/gamg/agg.c:524
#27  0x55bb471d1cb4 in PCSetUp_GAMG
      at /home/mnv/Software/petsc/src/ksp/pc/impls/gamg/gamg.c:631
#28  0x55bb464022cf in PCSetUp
      at /home/mnv/Software/petsc/src/ksp/pc/interface/precon.c:994
#29  0x55bb4718b8a7 in KSPSetUp
      at /home/mnv/Software/petsc/src/ksp/ksp/interface/itfunc.c:406
#30  0x55bb4718f22e in KSPSolve_Private
      at /home/mnv/Software/petsc/src/ksp/ksp/interface/itfunc.c:824
#31  0x55bb47192c0c in KSPSolve
      at /home/mnv/Software/petsc/src/ksp/ksp/interface/itfunc.c:1070
#32  0x55bb463efd35 in kspsolve_
      at /home/mnv/Software/petsc/src/ksp/ksp/interface/ftn-auto/itfuncf.c:320
#33  0x55bb45e94b32 in ???
#34  0x55bb46048044 in ???
#35  0x55bb46052ea1 in ???
#36  0x55bb45ac5f8e in ???
#37  0x1526470fcd8f in ???
#38  0x1526470fce3f in ???
#39  0x55bb45aef55d in ???
#40  0xffffffffffffffff in ???
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 1771753 on node dgx02 exited on signal 6 (Aborted).
--------------------------------------------------------------------------

BTW, I'm curious. If I set n MPI processes, each of them building a part of the linear system, and g GPUs, how does PETSc distribute those n pieces of system matrix and rhs in the g GPUs? Does it do some load balancing algorithm? Where can I read about this?
Thank you and best Regards, I can also point you to my code repo in GitHub if you want to take a closer look.

Best Regards,
Marcos

________________________________
From: Junchao Zhang <junchao.zhang at gmail.com>
Sent: Friday, August 11, 2023 10:52 AM
To: Vanella, Marcos (Fed) <marcos.vanella at nist.gov>
Cc: petsc-users at mcs.anl.gov <petsc-users at mcs.anl.gov>
Subject: Re: [petsc-users] CUDA error trying to run a job with two mpi processes and 1 GPU

Hi, Marcos,
  Could you build petsc in debug mode and then copy and paste the whole error stack message?

   Thanks
--Junchao Zhang


On Thu, Aug 10, 2023 at 5:51 PM Vanella, Marcos (Fed) via petsc-users <petsc-users at mcs.anl.gov<mailto:petsc-users at mcs.anl.gov>> wrote:
Hi, I'm trying to run a parallel matrix vector build and linear solution with PETSc on 2 MPI processes + one V100 GPU. I tested that the matrix build and solution is successful in CPUs only. I'm using cuda 11.5 and cuda enabled openmpi and gcc 9.3. When I run the job with GPU enabled I get the following error:

terminate called after throwing an instance of 'thrust::system::system_error'
  what():  merge_sort: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered

Program received signal SIGABRT: Process abort signal.

Backtrace for this error:
terminate called after throwing an instance of 'thrust::system::system_error'
  what():  merge_sort: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered

Program received signal SIGABRT: Process abort signal.

I'm new to submitting jobs in slurm that also use GPU resources, so I might be doing something wrong in my submission script. This is it:

#!/bin/bash
#SBATCH -J test
#SBATCH -e /home/Issues/PETSc/test.err
#SBATCH -o /home/Issues/PETSc/test.log
#SBATCH --partition=batch
#SBATCH --ntasks=2
#SBATCH --nodes=1
#SBATCH --cpus-per-task=1
#SBATCH --ntasks-per-node=2
#SBATCH --time=01:00:00
#SBATCH --gres=gpu:1

export OMP_NUM_THREADS=1
module load cuda/11.5
module load openmpi/4.1.1

cd /home/Issues/PETSc
mpirun -n 2 /home/fds/Build/ompi_gnu_linux/fds_ompi_gnu_linux test.fds -vec_type mpicuda -mat_type mpiaijcusparse -pc_type gamg

If anyone has any suggestions on how o troubleshoot this please let me know.
Thanks!
Marcos



-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20230811/278cbedb/attachment-0001.html>


More information about the petsc-users mailing list