[petsc-users] CUDA error trying to run a job with two mpi processes and 1 GPU

Vanella, Marcos (Fed) marcos.vanella at nist.gov
Mon Aug 14 13:05:29 CDT 2023


Hi Junchao, I compiled and run ex60 through slurm in our Enki system. The batch script for slurm submission, ex60.log and gpu stats files are attached.
Nothing stands out as wrong to me but please have a look.
I'll revisit running the original 2 MPI process + 1 GPU Poisson problem.
Thanks!
Marcos
________________________________
From: Junchao Zhang <junchao.zhang at gmail.com>
Sent: Friday, August 11, 2023 5:52 PM
To: Vanella, Marcos (Fed) <marcos.vanella at nist.gov>
Cc: PETSc users list <petsc-users at mcs.anl.gov>; Satish Balay <balay at mcs.anl.gov>
Subject: Re: [petsc-users] CUDA error trying to run a job with two mpi processes and 1 GPU

Before digging into the details, could you try to run src/ksp/ksp/tests/ex60.c to make sure the environment is ok.

The comment at the end shows how to run it
   test:
      requires: cuda
      suffix: 1_cuda
      nsize: 4
      args: -ksp_view -mat_type aijcusparse -sub_pc_factor_mat_solver_type cusparse

--Junchao Zhang


On Fri, Aug 11, 2023 at 4:36 PM Vanella, Marcos (Fed) <marcos.vanella at nist.gov<mailto:marcos.vanella at nist.gov>> wrote:
Hi Junchao, thank you for the info. I compiled the main branch of PETSc in another machine that has the  openmpi/4.1.4/gcc-11.2.1-cuda-11.7 toolchain and don't see the fortran compilation error. It might have been related to gcc-9.3.
I tried the case again, 2 CPUs and one GPU and get this error now:

terminate called after throwing an instance of 'thrust::system::system_error'
terminate called after throwing an instance of 'thrust::system::system_error'
  what():  parallel_for failed: cudaErrorInvalidConfiguration: invalid configuration argument
  what():  parallel_for failed: cudaErrorInvalidConfiguration: invalid configuration argument

Program received signal SIGABRT: Process abort signal.

Backtrace for this error:

Program received signal SIGABRT: Process abort signal.

Backtrace for this error:
#0  0x2000397fcd8f in ???
#1  0x2000397fb657 in ???
#0  0x2000397fcd8f in ???
#1  0x2000397fb657 in ???
#2  0x2000000604d7 in ???
#2  0x2000000604d7 in ???
#3  0x200039cb9628 in ???
#4  0x200039c93eb3 in ???
#5  0x200039364a97 in ???
#6  0x20003935f6d3 in ???
#7  0x20003935f78f in ???
#8  0x20003935fc6b in ???
#3  0x200039cb9628 in ???
#4  0x200039c93eb3 in ???
#5  0x200039364a97 in ???
#6  0x20003935f6d3 in ???
#7  0x20003935f78f in ???
#8  0x20003935fc6b in ???
#9  0x11ec425b in _ZN6thrust8cuda_cub14throw_on_errorE9cudaErrorPKc
      at /usr/local/cuda-11.7/include/thrust/system/cuda/detail/util.h:225
#10  0x11ec425b in _ZN6thrust8cuda_cub20uninitialized_fill_nINS0_3tagENS_10device_ptrIiEEmiEET0_RNS0_16execution_policyIT_EES5_T1_RKT2_
#9  0x11ec425b in _ZN6thrust8cuda_cub14throw_on_errorE9cudaErrorPKc
      at /usr/local/cuda-11.7/include/thrust/system/cuda/detail/util.h:225
#10  0x11ec425b in _ZN6thrust8cuda_cub20uninitialized_fill_nINS0_3tagENS_10device_ptrIiEEmiEET0_RNS0_16execution_policyIT_EES5_T1_RKT2_
      at /usr/local/cuda-11.7/include/thrust/system/cuda/detail/uninitialized_fill.h:88
#11  0x11efa263 in _ZN6thrust20uninitialized_fill_nINS_8cuda_cub3tagENS_10device_ptrIiEEmiEET0_RKNS_6detail21execution_policy_baseIT_EES5_T1_RKT2_
      at /usr/local/cuda-11.7/include/thrust/system/cuda/detail/uninitialized_fill.h:88
#11  0x11efa263 in _ZN6thrust20uninitialized_fill_nINS_8cuda_cub3tagENS_10device_ptrIiEEmiEET0_RKNS_6detail21execution_policy_baseIT_EES5_T1_RKT2_
      at /usr/local/cuda-11.7/include/thrust/detail/uninitialized_fill.inl:55
#12  0x11efa263 in _ZN6thrust6detail23allocator_traits_detail23default_construct_rangeINS_16device_allocatorIiEENS_10device_ptrIiEEmEENS0_10disable_ifIXsrNS1_37needs_default_construct_via_allocatorIT_NS0_15pointer_elementIT0_E4typeEEE5valueEvE4typeERS9_SB_T1_
      at /usr/local/cuda-11.7/include/thrust/detail/allocator/default_construct_range.inl:93
#13  0x11efa263 in _ZN6thrust6detail23default_construct_rangeINS_16device_allocatorIiEENS_10device_ptrIiEEmEEvRT_T0_T1_
      at /usr/local/cuda-11.7/include/thrust/detail/allocator/default_construct_range.inl:104
      at /usr/local/cuda-11.7/include/thrust/detail/uninitialized_fill.inl:55
#12  0x11efa263 in _ZN6thrust6detail23allocator_traits_detail23default_construct_rangeINS_16device_allocatorIiEENS_10device_ptrIiEEmEENS0_10disable_ifIXsrNS1_37needs_default_construct_via_allocatorIT_NS0_15pointer_elementIT0_E4typeEEE5valueEvE4typeERS9_SB_T1_
      at /usr/local/cuda-11.7/include/thrust/detail/allocator/default_construct_range.inl:93
#13  0x11efa263 in _ZN6thrust6detail23default_construct_rangeINS_16device_allocatorIiEENS_10device_ptrIiEEmEEvRT_T0_T1_
      at /usr/local/cuda-11.7/include/thrust/detail/allocator/default_construct_range.inl:104
#14  0x11efa263 in _ZN6thrust6detail18contiguous_storageIiNS_16device_allocatorIiEEE19default_construct_nENS0_15normal_iteratorINS_10device_ptrIiEEEEm
      at /usr/local/cuda-11.7/include/thrust/detail/contiguous_storage.inl:254
#15  0x11efa263 in _ZN6thrust6detail11vector_baseIiNS_16device_allocatorIiEEE12default_initEm
      at /usr/local/cuda-11.7/include/thrust/detail/vector_base.inl:220
#14  0x11efa263 in _ZN6thrust6detail18contiguous_storageIiNS_16device_allocatorIiEEE19default_construct_nENS0_15normal_iteratorINS_10device_ptrIiEEEEm
      at /usr/local/cuda-11.7/include/thrust/detail/contiguous_storage.inl:254
#15  0x11efa263 in _ZN6thrust6detail11vector_baseIiNS_16device_allocatorIiEEE12default_initEm
      at /usr/local/cuda-11.7/include/thrust/detail/vector_base.inl:220
#16  0x11efa263 in _ZN6thrust6detail11vector_baseIiNS_16device_allocatorIiEEE12default_initEm
      at /usr/local/cuda-11.7/include/thrust/detail/vector_base.inl:213
#17  0x11efa263 in _ZN6thrust6detail11vector_baseIiNS_16device_allocatorIiEEEC2Em
      at /usr/local/cuda-11.7/include/thrust/detail/vector_base.inl:65
#18  0x11ed7e47 in _ZN6thrust13device_vectorIiNS_16device_allocatorIiEEEC4Em
      at /usr/local/cuda-11.7/include/thrust/device_vector.h:88
#19  0x11ed7e47 in MatSeqAIJCUSPARSECopyToGPU
      at /home/mnv/Software/petsc/src/mat/impls/aij/seq/seqcusparse/aijcusparse.cu:2488<http://aijcusparse.cu:2488/>
#20  0x11eef623 in MatSeqAIJCUSPARSEMergeMats
      at /home/mnv/Software/petsc/src/mat/impls/aij/seq/seqcusparse/aijcusparse.cu:4696<http://aijcusparse.cu:4696/>
#16  0x11efa263 in _ZN6thrust6detail11vector_baseIiNS_16device_allocatorIiEEE12default_initEm
      at /usr/local/cuda-11.7/include/thrust/detail/vector_base.inl:213
#17  0x11efa263 in _ZN6thrust6detail11vector_baseIiNS_16device_allocatorIiEEEC2Em
      at /usr/local/cuda-11.7/include/thrust/detail/vector_base.inl:65
#18  0x11ed7e47 in _ZN6thrust13device_vectorIiNS_16device_allocatorIiEEEC4Em
      at /usr/local/cuda-11.7/include/thrust/device_vector.h:88
#19  0x11ed7e47 in MatSeqAIJCUSPARSECopyToGPU
      at /home/mnv/Software/petsc/src/mat/impls/aij/seq/seqcusparse/aijcusparse.cu:2488<http://aijcusparse.cu:2488/>
#20  0x11eef623 in MatSeqAIJCUSPARSEMergeMats
      at /home/mnv/Software/petsc/src/mat/impls/aij/seq/seqcusparse/aijcusparse.cu:4696<http://aijcusparse.cu:4696/>
#21  0x11f0682b in MatMPIAIJGetLocalMatMerge_MPIAIJCUSPARSE
      at /home/mnv/Software/petsc/src/mat/impls/aij/mpi/mpicusparse/mpiaijcusparse.cu:251<http://mpiaijcusparse.cu:251/>
#21  0x11f0682b in MatMPIAIJGetLocalMatMerge_MPIAIJCUSPARSE
      at /home/mnv/Software/petsc/src/mat/impls/aij/mpi/mpicusparse/mpiaijcusparse.cu:251<http://mpiaijcusparse.cu:251/>
#22  0x133f141f in MatMPIAIJGetLocalMatMerge
      at /home/mnv/Software/petsc/src/mat/impls/aij/mpi/mpiaij.c:5342
#22  0x133f141f in MatMPIAIJGetLocalMatMerge
      at /home/mnv/Software/petsc/src/mat/impls/aij/mpi/mpiaij.c:5342
#23  0x133fe9cb in MatProductSymbolic_MPIAIJBACKEND
      at /home/mnv/Software/petsc/src/mat/impls/aij/mpi/mpiaij.c:7368
#23  0x133fe9cb in MatProductSymbolic_MPIAIJBACKEND
      at /home/mnv/Software/petsc/src/mat/impls/aij/mpi/mpiaij.c:7368
#24  0x1377e1df in MatProductSymbolic
      at /home/mnv/Software/petsc/src/mat/interface/matproduct.c:795
#24  0x1377e1df in MatProductSymbolic
      at /home/mnv/Software/petsc/src/mat/interface/matproduct.c:795
#25  0x11e4dd1f in MatPtAP
      at /home/mnv/Software/petsc/src/mat/interface/matrix.c:9934
#25  0x11e4dd1f in MatPtAP
      at /home/mnv/Software/petsc/src/mat/interface/matrix.c:9934
#26  0x130d792f in MatCoarsenApply_MISK_private
      at /home/mnv/Software/petsc/src/mat/coarsen/impls/misk/misk.c:283
#26  0x130d792f in MatCoarsenApply_MISK_private
      at /home/mnv/Software/petsc/src/mat/coarsen/impls/misk/misk.c:283
#27  0x130db89b in MatCoarsenApply_MISK
      at /home/mnv/Software/petsc/src/mat/coarsen/impls/misk/misk.c:368
#27  0x130db89b in MatCoarsenApply_MISK
      at /home/mnv/Software/petsc/src/mat/coarsen/impls/misk/misk.c:368
#28  0x130bf5a3 in MatCoarsenApply
      at /home/mnv/Software/petsc/src/mat/coarsen/coarsen.c:97
#28  0x130bf5a3 in MatCoarsenApply
      at /home/mnv/Software/petsc/src/mat/coarsen/coarsen.c:97
#29  0x141518ff in PCGAMGCoarsen_AGG
      at /home/mnv/Software/petsc/src/ksp/pc/impls/gamg/agg.c:524
#29  0x141518ff in PCGAMGCoarsen_AGG
      at /home/mnv/Software/petsc/src/ksp/pc/impls/gamg/agg.c:524
#30  0x13b3a43f in PCSetUp_GAMG
      at /home/mnv/Software/petsc/src/ksp/pc/impls/gamg/gamg.c:631
#30  0x13b3a43f in PCSetUp_GAMG
      at /home/mnv/Software/petsc/src/ksp/pc/impls/gamg/gamg.c:631
#31  0x1276845b in PCSetUp
      at /home/mnv/Software/petsc/src/ksp/pc/interface/precon.c:1069
#31  0x1276845b in PCSetUp
      at /home/mnv/Software/petsc/src/ksp/pc/interface/precon.c:1069
#32  0x127d6cbb in KSPSetUp
      at /home/mnv/Software/petsc/src/ksp/ksp/interface/itfunc.c:415
#32  0x127d6cbb in KSPSetUp
      at /home/mnv/Software/petsc/src/ksp/ksp/interface/itfunc.c:415
#33  0x127dddbf in KSPSolve_Private
      at /home/mnv/Software/petsc/src/ksp/ksp/interface/itfunc.c:836
#33  0x127dddbf in KSPSolve_Private
      at /home/mnv/Software/petsc/src/ksp/ksp/interface/itfunc.c:836
#34  0x127e4987 in KSPSolve
      at /home/mnv/Software/petsc/src/ksp/ksp/interface/itfunc.c:1082
#34  0x127e4987 in KSPSolve
      at /home/mnv/Software/petsc/src/ksp/ksp/interface/itfunc.c:1082
#35  0x1280b18b in kspsolve_
      at /home/mnv/Software/petsc/arch-linux-c-dbg/src/ksp/ksp/interface/ftn-auto/itfuncf.c:335
#35  0x1280b18b in kspsolve_
      at /home/mnv/Software/petsc/arch-linux-c-dbg/src/ksp/ksp/interface/ftn-auto/itfuncf.c:335
#36  0x1140945f in __globmat_solver_MOD_glmat_solver
      at ../../Source/pres.f90:3128
#36  0x1140945f in __globmat_solver_MOD_glmat_solver
      at ../../Source/pres.f90:3128
#37  0x119f8853 in pressure_iteration_scheme
      at ../../Source/main.f90:1449
#37  0x119f8853 in pressure_iteration_scheme
      at ../../Source/main.f90:1449
#38  0x11969bd3 in fds
      at ../../Source/main.f90:688
#38  0x11969bd3 in fds
      at ../../Source/main.f90:688
#39  0x11a10167 in main
      at ../../Source/main.f90:6
#39  0x11a10167 in main
      at ../../Source/main.f90:6
srun: error: enki12: tasks 0-1: Aborted (core dumped)


This was the slurm submission script in this case:

#!/bin/bash
# ../../Utilities/Scripts/qfds.sh -p 2  -T db -d test.fds
#SBATCH -J test
#SBATCH -e /home/mnv/Firemodels_fork/fds/Issues/PETSc/test.err
#SBATCH -o /home/mnv/Firemodels_fork/fds/Issues/PETSc/test.log
#SBATCH --partition=debug
#SBATCH --ntasks=2
#SBATCH --nodes=1
#SBATCH --cpus-per-task=1
#SBATCH --ntasks-per-node=2
#SBATCH --time=01:00:00
#SBATCH --gres=gpu:1

export OMP_NUM_THREADS=1

# PETSc dir and arch:
export PETSC_DIR=/home/mnv/Software/petsc
export PETSC_ARCH=arch-linux-c-dbg

# SYSTEM name:
export MYSYSTEM=enki

# modules
module load cuda/11.7
module load gcc/11.2.1/toolset
module load openmpi/4.1.4/gcc-11.2.1-cuda-11.7

cd /home/mnv/Firemodels_fork/fds/Issues/PETSc
srun -N 1 -n 2 --ntasks-per-node 2 --mpi=pmi2 /home/mnv/Firemodels_fork/fds/Build/ompi_gnu_linux_db/fds_ompi_gnu_linux_db test.fds -vec_type mpicuda -mat_type mpiaijcusparse -pc_type gamg

The configure.log for the PETSc build is attached.  Another clue to what is happening is that even setting the matrices/vectors to be mpi (-vec_type mpi -mat_type mpiaij) and not requesting a gpu I get a GPU warning :

0]PETSC ERROR: --------------------- Error Message --------------------------------------------------------------
[1]PETSC ERROR: --------------------- Error Message --------------------------------------------------------------
[1]PETSC ERROR: GPU error
[1]PETSC ERROR: Cannot lazily initialize PetscDevice: cuda error 100 (cudaErrorNoDevice) : no CUDA-capable device is detected
[1]PETSC ERROR: WARNING! There are unused option(s) set! Could be the program crashed before usage or a spelling mistake, etc!
[0]PETSC ERROR: GPU error
[0]PETSC ERROR: Cannot lazily initialize PetscDevice: cuda error 100 (cudaErrorNoDevice) : no CUDA-capable device is detected
[0]PETSC ERROR: WARNING! There are unused option(s) set! Could be the program crashed before usage or a spelling mistake, etc!
[0]PETSC ERROR:   Option left: name:-pc_type value: gamg source: command line
[0]PETSC ERROR: See https://petsc.org/release/faq/ for trouble shooting.
[1]PETSC ERROR:   Option left: name:-pc_type value: gamg source: command line
[1]PETSC ERROR: See https://petsc.org/release/faq/ for trouble shooting.
[1]PETSC ERROR: Petsc Development GIT revision: v3.19.4-946-g590ad0f52ad  GIT Date: 2023-08-11 15:13:02 +0000
[0]PETSC ERROR: Petsc Development GIT revision: v3.19.4-946-g590ad0f52ad  GIT Date: 2023-08-11 15:13:02 +0000
[0]PETSC ERROR: /home/mnv/Firemodels_fork/fds/Build/ompi_gnu_linux_db/fds_ompi_gnu_linux_db on a arch-linux-c-dbg named enki11.adlp by mnv Fri Aug 11 17:04:55 2023
[0]PETSC ERROR: Configure options COPTFLAGS="-g -O2" CXXOPTFLAGS="-g -O2" FOPTFLAGS="-g -O2" FCOPTFLAGS="-g -O2" CUDAOPTFLAGS="-g -O2" --with-debugging=yes --with-shared-libraries=0 --download-suitesparse --download-hypre --download-fblaslapack --with-cuda
...

I would have expected not to see GPU errors being printed out, given I did not request cuda matrix/vectors. The case run anyways, I assume it defaulted to the CPU solver.
Let me know if you have any ideas as to what is happening. Thanks,
Marcos


________________________________
From: Junchao Zhang <junchao.zhang at gmail.com<mailto:junchao.zhang at gmail.com>>
Sent: Friday, August 11, 2023 3:35 PM
To: Vanella, Marcos (Fed) <marcos.vanella at nist.gov<mailto:marcos.vanella at nist.gov>>; PETSc users list <petsc-users at mcs.anl.gov<mailto:petsc-users at mcs.anl.gov>>; Satish Balay <balay at mcs.anl.gov<mailto:balay at mcs.anl.gov>>
Subject: Re: [petsc-users] CUDA error trying to run a job with two mpi processes and 1 GPU

Marcos,
  We do not have good petsc/gpu documentation, but see https://petsc.org/main/faq/#doc-faq-gpuhowto, and also search "requires: cuda" in petsc tests and you will find examples using GPU.
  For the Fortran compile errors, attach your configure.log and Satish (Cc'ed) or others should know how to fix them.

  Thanks.
--Junchao Zhang


On Fri, Aug 11, 2023 at 2:22 PM Vanella, Marcos (Fed) <marcos.vanella at nist.gov<mailto:marcos.vanella at nist.gov>> wrote:
Hi Junchao, thanks for the explanation. Is there some development documentation on the GPU work? I'm interested learning about it.
I checked out the main branch and configured petsc. when compiling with gcc/gfortran I come across this error:

....
      CUDAC arch-linux-c-opt/obj/src/mat/impls/aij/seq/seqcusparse/aijcusparse.o
  CUDAC.dep arch-linux-c-opt/obj/src/mat/impls/aij/seq/seqcusparse/aijcusparse.o
         FC arch-linux-c-opt/obj/src/ksp/f90-mod/petsckspdefmod.o
         FC arch-linux-c-opt/obj/src/ksp/f90-mod/petscpcmod.o
/home/mnv/Software/petsc/include/../src/ksp/f90-mod/ftn-auto-interfaces/petscpc.h90:37:61:

   37 |       subroutine PCASMCreateSubdomains2D(a,b,c,d,e,f,g,h,i,z)
      |                                                             1
Error: Symbol ‘pcasmcreatesubdomains2d’ at (1) already has an explicit interface
/home/mnv/Software/petsc/include/../src/ksp/f90-mod/ftn-auto-interfaces/petscpc.h90:38:13:

   38 |        import tIS
      |             1
Error: IMPORT statement at (1) only permitted in an INTERFACE body
/home/mnv/Software/petsc/include/../src/ksp/f90-mod/ftn-auto-interfaces/petscpc.h90:39:80:

   39 |        PetscInt a ! PetscInt
      |                                                                                1
Error: Unexpected data declaration statement in INTERFACE block at (1)
/home/mnv/Software/petsc/include/../src/ksp/f90-mod/ftn-auto-interfaces/petscpc.h90:40:80:

   40 |        PetscInt b ! PetscInt
      |                                                                                1
Error: Unexpected data declaration statement in INTERFACE block at (1)
/home/mnv/Software/petsc/include/../src/ksp/f90-mod/ftn-auto-interfaces/petscpc.h90:41:80:

   41 |        PetscInt c ! PetscInt
      |                                                                                1
Error: Unexpected data declaration statement in INTERFACE block at (1)
/home/mnv/Software/petsc/include/../src/ksp/f90-mod/ftn-auto-interfaces/petscpc.h90:42:80:

   42 |        PetscInt d ! PetscInt
      |                                                                                1
Error: Unexpected data declaration statement in INTERFACE block at (1)
/home/mnv/Software/petsc/include/../src/ksp/f90-mod/ftn-auto-interfaces/petscpc.h90:43:80:

   43 |        PetscInt e ! PetscInt
      |                                                                                1
Error: Unexpected data declaration statement in INTERFACE block at (1)
/home/mnv/Software/petsc/include/../src/ksp/f90-mod/ftn-auto-interfaces/petscpc.h90:44:80:

   44 |        PetscInt f ! PetscInt
      |                                                                                1
Error: Unexpected data declaration statement in INTERFACE block at (1)
/home/mnv/Software/petsc/include/../src/ksp/f90-mod/ftn-auto-interfaces/petscpc.h90:45:80:

   45 |        PetscInt g ! PetscInt
      |                                                                                1
Error: Unexpected data declaration statement in INTERFACE block at (1)
/home/mnv/Software/petsc/include/../src/ksp/f90-mod/ftn-auto-interfaces/petscpc.h90:46:30:

   46 |        IS h ! IS
      |                              1
Error: Unexpected data declaration statement in INTERFACE block at (1)
/home/mnv/Software/petsc/include/../src/ksp/f90-mod/ftn-auto-interfaces/petscpc.h90:47:30:

   47 |        IS i ! IS
      |                              1
Error: Unexpected data declaration statement in INTERFACE block at (1)
/home/mnv/Software/petsc/include/../src/ksp/f90-mod/ftn-auto-interfaces/petscpc.h90:48:43:

   48 |        PetscErrorCode z
      |                                           1
Error: Unexpected data declaration statement in INTERFACE block at (1)
/home/mnv/Software/petsc/include/../src/ksp/f90-mod/ftn-auto-interfaces/petscpc.h90:49:10:

   49 |        end subroutine PCASMCreateSubdomains2D
      |          1
Error: Expecting END INTERFACE statement at (1)
make[3]: *** [gmakefile:225: arch-linux-c-opt/obj/src/ksp/f90-mod/petscpcmod.o] Error 1
make[3]: *** Waiting for unfinished jobs....
         CC arch-linux-c-opt/obj/src/tao/leastsquares/impls/pounders/pounders.o
         CC arch-linux-c-opt/obj/src/ksp/pc/impls/bddc/bddcprivate.o
      CUDAC arch-linux-c-opt/obj/src/vec/vec/impls/seq/cupm/cuda/vecseqcupm.o
  CUDAC.dep arch-linux-c-opt/obj/src/vec/vec/impls/seq/cupm/cuda/vecseqcupm.o
make[3]: Leaving directory '/home/mnv/Software/petsc'
make[2]: *** [/home/mnv/Software/petsc/lib/petsc/conf/rules.doc:28: libs] Error 2
make[2]: Leaving directory '/home/mnv/Software/petsc'
**************************ERROR*************************************
  Error during compile, check arch-linux-c-opt/lib/petsc/conf/make.log
  Send it and arch-linux-c-opt/lib/petsc/conf/configure.log to petsc-maint at mcs.anl.gov<mailto:petsc-maint at mcs.anl.gov>
********************************************************************
make[1]: *** [makefile:45: all] Error 1
make: *** [GNUmakefile:9: all] Error 2
________________________________
From: Junchao Zhang <junchao.zhang at gmail.com<mailto:junchao.zhang at gmail.com>>
Sent: Friday, August 11, 2023 3:04 PM
To: Vanella, Marcos (Fed) <marcos.vanella at nist.gov<mailto:marcos.vanella at nist.gov>>
Cc: petsc-users at mcs.anl.gov<mailto:petsc-users at mcs.anl.gov> <petsc-users at mcs.anl.gov<mailto:petsc-users at mcs.anl.gov>>
Subject: Re: [petsc-users] CUDA error trying to run a job with two mpi processes and 1 GPU

Hi, Macros,
  I saw MatSetPreallocationCOO_MPIAIJCUSPARSE_Basic() in the error stack.  We recently refactored the COO code and got rid of that function.  So could you try petsc/main?
  We map MPI processes to GPUs in a round-robin fashion. We query the number of visible CUDA devices (g), and assign the device (rank%g) to the MPI process (rank).   In that sense, the work distribution is totally determined by your MPI work partition (i.e, yourself).
  On clusters, this MPI process to GPU binding is usually done by the job scheduler like slurm.  You need to check your cluster's users' guide to see how to bind MPI processes to GPUs. If the job scheduler has done that, the number of visible CUDA devices to a process might just appear to be 1, making petsc's own mapping void.

   Thanks.
--Junchao Zhang


On Fri, Aug 11, 2023 at 12:43 PM Vanella, Marcos (Fed) <marcos.vanella at nist.gov<mailto:marcos.vanella at nist.gov>> wrote:
Hi Junchao, thank you for replying. I compiled petsc in debug mode and this is what I get for the case:

terminate called after throwing an instance of 'thrust::system::system_error'
  what():  merge_sort: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered

Program received signal SIGABRT: Process abort signal.

Backtrace for this error:
#0  0x15264731ead0 in ???
#1  0x15264731dc35 in ???
#2  0x15264711551f in ???
#3  0x152647169a7c in ???
#4  0x152647115475 in ???
#5  0x1526470fb7f2 in ???
#6  0x152647678bbd in ???
#7  0x15264768424b in ???
#8  0x1526476842b6 in ???
#9  0x152647684517 in ???
#10  0x55bb46342ebb in _ZN6thrust8cuda_cub14throw_on_errorE9cudaErrorPKc
      at /usr/local/cuda/include/thrust/system/cuda/detail/util.h:224
#11  0x55bb46342ebb in _ZN6thrust8cuda_cub12__merge_sort10merge_sortINS_6detail17integral_constantIbLb1EEENS4_IbLb0EEENS0_3tagENS_12zip_iteratorINS_5tupleINS_10device_ptrIiEESB_NS_9null_typeESC_SC_SC_SC_SC_SC_SC_EEEENS3_15normal_iteratorISB_EE9IJCompareEEvRNS0_16execution_policyIT1_EET2_SM_T3_T4_
      at /usr/local/cuda/include/thrust/system/cuda/detail/sort.h:1316
#12  0x55bb46342ebb in _ZN6thrust8cuda_cub12__smart_sort10smart_sortINS_6detail17integral_constantIbLb1EEENS4_IbLb0EEENS0_16execution_policyINS0_3tagEEENS_12zip_iteratorINS_5tupleINS_10device_ptrIiEESD_NS_9null_typeESE_SE_SE_SE_SE_SE_SE_EEEENS3_15normal_iteratorISD_EE9IJCompareEENS1_25enable_if_comparison_sortIT2_T4_E4typeERT1_SL_SL_T3_SM_
      at /usr/local/cuda/include/thrust/system/cuda/detail/sort.h:1544
#13  0x55bb46342ebb in _ZN6thrust8cuda_cub11sort_by_keyINS0_3tagENS_12zip_iteratorINS_5tupleINS_10device_ptrIiEES6_NS_9null_typeES7_S7_S7_S7_S7_S7_S7_EEEENS_6detail15normal_iteratorIS6_EE9IJCompareEEvRNS0_16execution_policyIT_EET0_SI_T1_T2_
      at /usr/local/cuda/include/thrust/system/cuda/detail/sort.h:1669
#14  0x55bb46317bc5 in _ZN6thrust11sort_by_keyINS_8cuda_cub3tagENS_12zip_iteratorINS_5tupleINS_10device_ptrIiEES6_NS_9null_typeES7_S7_S7_S7_S7_S7_S7_EEEENS_6detail15normal_iteratorIS6_EE9IJCompareEEvRKNSA_21execution_policy_baseIT_EET0_SJ_T1_T2_
      at /usr/local/cuda/include/thrust/detail/sort.inl:115
#15  0x55bb46317bc5 in _ZN6thrust11sort_by_keyINS_12zip_iteratorINS_5tupleINS_10device_ptrIiEES4_NS_9null_typeES5_S5_S5_S5_S5_S5_S5_EEEENS_6detail15normal_iteratorIS4_EE9IJCompareEEvT_SC_T0_T1_
      at /usr/local/cuda/include/thrust/detail/sort.inl:305
#16  0x55bb46317bc5 in MatSetPreallocationCOO_SeqAIJCUSPARSE_Basic
      at /home/mnv/Software/petsc/src/mat/impls/aij/seq/seqcusparse/aijcusparse.cu:4452<http://aijcusparse.cu:4452/>
#17  0x55bb46c5b27c in MatSetPreallocationCOO_MPIAIJCUSPARSE_Basic
      at /home/mnv/Software/petsc/src/mat/impls/aij/mpi/mpicusparse/mpiaijcusparse.cu:173<http://mpiaijcusparse.cu:173/>
#18  0x55bb46c5b27c in MatSetPreallocationCOO_MPIAIJCUSPARSE
      at /home/mnv/Software/petsc/src/mat/impls/aij/mpi/mpicusparse/mpiaijcusparse.cu:222<http://mpiaijcusparse.cu:222/>
#19  0x55bb468e01cf in MatSetPreallocationCOO
      at /home/mnv/Software/petsc/src/mat/utils/gcreate.c:606
#20  0x55bb46b39c9b in MatProductSymbolic_MPIAIJBACKEND
      at /home/mnv/Software/petsc/src/mat/impls/aij/mpi/mpiaij.c:7547
#21  0x55bb469015e5 in MatProductSymbolic
      at /home/mnv/Software/petsc/src/mat/interface/matproduct.c:803
#22  0x55bb4694ade2 in MatPtAP
      at /home/mnv/Software/petsc/src/mat/interface/matrix.c:9897
#23  0x55bb4696d3ec in MatCoarsenApply_MISK_private
      at /home/mnv/Software/petsc/src/mat/coarsen/impls/misk/misk.c:283
#24  0x55bb4696eb67 in MatCoarsenApply_MISK
      at /home/mnv/Software/petsc/src/mat/coarsen/impls/misk/misk.c:368
#25  0x55bb4695bd91 in MatCoarsenApply
      at /home/mnv/Software/petsc/src/mat/coarsen/coarsen.c:97
#26  0x55bb478294d8 in PCGAMGCoarsen_AGG
      at /home/mnv/Software/petsc/src/ksp/pc/impls/gamg/agg.c:524
#27  0x55bb471d1cb4 in PCSetUp_GAMG
      at /home/mnv/Software/petsc/src/ksp/pc/impls/gamg/gamg.c:631
#28  0x55bb464022cf in PCSetUp
      at /home/mnv/Software/petsc/src/ksp/pc/interface/precon.c:994
#29  0x55bb4718b8a7 in KSPSetUp
      at /home/mnv/Software/petsc/src/ksp/ksp/interface/itfunc.c:406
#30  0x55bb4718f22e in KSPSolve_Private
      at /home/mnv/Software/petsc/src/ksp/ksp/interface/itfunc.c:824
#31  0x55bb47192c0c in KSPSolve
      at /home/mnv/Software/petsc/src/ksp/ksp/interface/itfunc.c:1070
#32  0x55bb463efd35 in kspsolve_
      at /home/mnv/Software/petsc/src/ksp/ksp/interface/ftn-auto/itfuncf.c:320
#33  0x55bb45e94b32 in ???
#34  0x55bb46048044 in ???
#35  0x55bb46052ea1 in ???
#36  0x55bb45ac5f8e in ???
#37  0x1526470fcd8f in ???
#38  0x1526470fce3f in ???
#39  0x55bb45aef55d in ???
#40  0xffffffffffffffff in ???
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 1771753 on node dgx02 exited on signal 6 (Aborted).
--------------------------------------------------------------------------

BTW, I'm curious. If I set n MPI processes, each of them building a part of the linear system, and g GPUs, how does PETSc distribute those n pieces of system matrix and rhs in the g GPUs? Does it do some load balancing algorithm? Where can I read about this?
Thank you and best Regards, I can also point you to my code repo in GitHub if you want to take a closer look.

Best Regards,
Marcos

________________________________
From: Junchao Zhang <junchao.zhang at gmail.com<mailto:junchao.zhang at gmail.com>>
Sent: Friday, August 11, 2023 10:52 AM
To: Vanella, Marcos (Fed) <marcos.vanella at nist.gov<mailto:marcos.vanella at nist.gov>>
Cc: petsc-users at mcs.anl.gov<mailto:petsc-users at mcs.anl.gov> <petsc-users at mcs.anl.gov<mailto:petsc-users at mcs.anl.gov>>
Subject: Re: [petsc-users] CUDA error trying to run a job with two mpi processes and 1 GPU

Hi, Marcos,
  Could you build petsc in debug mode and then copy and paste the whole error stack message?

   Thanks
--Junchao Zhang


On Thu, Aug 10, 2023 at 5:51 PM Vanella, Marcos (Fed) via petsc-users <petsc-users at mcs.anl.gov<mailto:petsc-users at mcs.anl.gov>> wrote:
Hi, I'm trying to run a parallel matrix vector build and linear solution with PETSc on 2 MPI processes + one V100 GPU. I tested that the matrix build and solution is successful in CPUs only. I'm using cuda 11.5 and cuda enabled openmpi and gcc 9.3. When I run the job with GPU enabled I get the following error:

terminate called after throwing an instance of 'thrust::system::system_error'
  what():  merge_sort: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered

Program received signal SIGABRT: Process abort signal.

Backtrace for this error:
terminate called after throwing an instance of 'thrust::system::system_error'
  what():  merge_sort: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered

Program received signal SIGABRT: Process abort signal.

I'm new to submitting jobs in slurm that also use GPU resources, so I might be doing something wrong in my submission script. This is it:

#!/bin/bash
#SBATCH -J test
#SBATCH -e /home/Issues/PETSc/test.err
#SBATCH -o /home/Issues/PETSc/test.log
#SBATCH --partition=batch
#SBATCH --ntasks=2
#SBATCH --nodes=1
#SBATCH --cpus-per-task=1
#SBATCH --ntasks-per-node=2
#SBATCH --time=01:00:00
#SBATCH --gres=gpu:1

export OMP_NUM_THREADS=1
module load cuda/11.5
module load openmpi/4.1.1

cd /home/Issues/PETSc
mpirun -n 2 /home/fds/Build/ompi_gnu_linux/fds_ompi_gnu_linux test.fds -vec_type mpicuda -mat_type mpiaijcusparse -pc_type gamg

If anyone has any suggestions on how o troubleshoot this please let me know.
Thanks!
Marcos



-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20230814/a4639e58/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: batch_script.sh
Type: application/x-sh
Size: 750 bytes
Desc: batch_script.sh
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20230814/a4639e58/attachment-0001.sh>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ex60.log
Type: application/octet-stream
Size: 2034 bytes
Desc: ex60.log
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20230814/a4639e58/attachment-0002.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: gpu-stats-enki11.adlp-199160.out
Type: application/octet-stream
Size: 2726 bytes
Desc: gpu-stats-enki11.adlp-199160.out
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20230814/a4639e58/attachment-0003.obj>


More information about the petsc-users mailing list