[petsc-users] CUDA error trying to run a job with two mpi processes and 1 GPU
Junchao Zhang
junchao.zhang at gmail.com
Mon Aug 14 18:01:13 CDT 2023
Marcos,
These are my findings. I successfully ran the test in the end.
$ mpirun -n 2 ./fds_ompi_gnu_linux_db test.fds -log_view
Starting FDS ...
...
[0]PETSC ERROR: --------------------- Error Message
--------------------------------------------------------------
[0]PETSC ERROR: Invalid argument
[0]PETSC ERROR: HYPRE_MEMORY_DEVICE expects a device vector. You need to
enable PETSc device support, for example, in some cases, -vec_type cuda
Now I get why you met errors with "CPU runs". You configured and built
hypre with petsc. Since you added --with-cuda, petsc would configure hypre
with its GPU support. However, hypre has a limit/shortcoming that if it is
configured with GPU support, you must pass GPU vectors to it. Thus the
error. In other words, if you remove --with-cuda, you should be able to run
above command.
$ mpirun -n 2 ./fds_ompi_gnu_linux_db test.fds -log_view -mat_type
aijcusparse -vec_type cuda
Starting FDS ...
MPI Process 0 started on hong-gce-workstation
MPI Process 1 started on hong-gce-workstation
Reading FDS input file ...
At line 3014 of file ../../Source/read.f90
Fortran runtime warning: An array temporary was created
At line 3461 of file ../../Source/read.f90
Fortran runtime warning: An array temporary was created
WARNING: SPEC REAC_FUEL is not in the table of pre-defined species. Any
unassigned SPEC variables in the input were assigned the properties of
nitrogen.
At line 3014 of file ../../Source/read.f90
..
Fire Dynamics Simulator
...
STOP: FDS completed successfully (CHID: test)
I guess there were link problems in your makefile. Actually, in the first
try, I failed with
mpifort -m64 -O0 -std=f2018 -ggdb -Wall -Wunused-parameter
-Wcharacter-truncation -Wno-target-lifetime -fcheck=all -fbacktrace
-ffpe-trap=invalid,zero,overflow -frecursive -ffpe-summary=none
-fall-intrinsics -fbounds-check -cpp
-DGITHASH_PP=\"FDS6.7.0-11263-g04d5df7-FireX\" -DGITDATE_PP=\""Mon Aug 14
17:07:20 2023 -0400\"" -DBUILDDATE_PP=\""Aug 14, 2023 17:32:12\""
-DCOMPVER_PP=\""Gnu gfortran 11.4.0-1ubuntu1~22.04)"\" -DWITH_PETSC
-I"/home/jczhang/petsc/include/"
-I"/home/jczhang/petsc/arch-kokkos-dbg/include" -fopenmp -o
fds_ompi_gnu_linux_db prec.o cons.o prop.o devc.o type.o data.o mesh.o
func.o gsmv.o smvv.o rcal.o turb.o soot.o pois.o geom.o ccib.o radi.o
part.o vege.o ctrl.o hvac.o mass.o imkl.o wall.o fire.o velo.o pres.o
init.o dump.o read.o divg.o main.o -Wl,-rpath
-Wl,/apps/ubuntu-20.04.2/openmpi/4.1.1/gcc-9.3.0/lib -Wl,--enable-new-dtags
-L/apps/ubuntu-20.04.2/openmpi/4.1.1/gcc-9.3.0/lib -lmpi
-Wl,-rpath,/home/jczhang/petsc/arch-kokkos-dbg/lib
-L/home/jczhang/petsc/arch-kokkos-dbg/lib -lpetsc -ldl -lspqr -lumfpack
-lklu -lcholmod -lbtf -lccolamd -lcolamd -lcamd -lamd -lsuitesparseconfig
-lHYPRE -Wl,-rpath,/usr/local/cuda/lib64 -L/usr/local/cuda/lib64
-L/usr/local/cuda/lib64/stubs -lcudart -lnvToolsExt -lcufft -lcublas
-lcusparse -lcusolver -lcurand -lcuda -lflapack -lfblas -lstdc++
-L/usr/lib64 -lX11
/usr/bin/ld: cannot find -lflapack: No such file or directory
/usr/bin/ld: cannot find -lfblas: No such file or directory
collect2: error: ld returned 1 exit status
make: *** [../makefile:357: ompi_gnu_linux_db] Error 1
That is because you hardwired many link flags in your fds/Build/makefile.
Then I changed LFLAGS_PETSC to
LFLAGS_PETSC = -Wl,-rpath,${PETSC_DIR}/${PETSC_ARCH}/lib
-L${PETSC_DIR}/${PETSC_ARCH}/lib -lpetsc
and everything worked. Could you also try it?
--Junchao Zhang
On Mon, Aug 14, 2023 at 4:53 PM Vanella, Marcos (Fed) <
marcos.vanella at nist.gov> wrote:
> Attached is the test.fds test case. Thanks!
> ------------------------------
> *From:* Vanella, Marcos (Fed) <marcos.vanella at nist.gov>
> *Sent:* Monday, August 14, 2023 5:45 PM
> *To:* Junchao Zhang <junchao.zhang at gmail.com>; petsc-users at mcs.anl.gov <
> petsc-users at mcs.anl.gov>; Satish Balay <balay at mcs.anl.gov>
> *Cc:* McDermott, Randall J. (Fed) <randall.mcdermott at nist.gov>
> *Subject:* Re: [petsc-users] CUDA error trying to run a job with two mpi
> processes and 1 GPU
>
> All right Junchao, thank you for looking at this!
>
> So, I checked out the /dir_to_petsc/petsc/main branch, setup the petsc
> env variables:
>
> # PETSc dir and arch, set MYSYS to nisaba dor FDS:
> export PETSC_DIR=/dir_to_petsc/petsc
> export PETSC_ARCH=arch-linux-c-dbg
> export MYSYSTEM=nisaba
>
> and configured the library with:
>
> $ ./Configure COPTFLAGS="-g -O2" CXXOPTFLAGS="-g -O2" FOPTFLAGS="-g -O2"
> FCOPTFLAGS="-g -O2" CUDAOPTFLAGS="-g -O2" --with-debugging=yes
> --with-shared-libraries=0 --download-suitesparse --download-hypre
> --download-fblaslapack --with-cuda
>
> Then made and checked the PETSc build.
>
> Then for FDS:
>
> 1. Clone my fds repo in a ~/fds_dir you make, and checkout the FireX
> branch:
>
> $ cd ~/fds_dir
> $ git clone https://github.com/marcosvanella/fds.git
> $ cd fds
> $ git checkout FireX
>
>
> 1. With PETSC_DIR, PETSC_ARCH and MYSYSTEM=nisaba defined, compile a
> debug target for fds (this is with cuda enabled openmpi compiled with gcc,
> in my case gcc-11.2 + PETSc):
>
> $ cd Build/ompi_gnu_linux_db
> $./make_fds.sh
>
> You should see compilation lines like this, with the WITH_PETSC
> Preprocessor variable being defined:
>
> Building ompi_gnu_linux_db
> mpifort -c -m64 -O0 -std=f2018 -ggdb -Wall -Wunused-parameter
> -Wcharacter-truncation -Wno-target-lifetime -fcheck=all -fbacktrace
> -ffpe-trap=invalid,zero,overflow -frecursive -ffpe-summary=none
> -fall-intrinsics -fbounds-check -cpp
> -DGITHASH_PP=\"FDS-6.8.0-556-g04d5df7-dirty-FireX\" -DGITDATE_PP=\""Mon Aug
> 14 17:07:20 2023 -0400\"" -DBUILDDATE_PP=\""Aug 14, 2023 17:34:36\""
> -DCOMPVER_PP=\""Gnu gfortran 11.2.1"\" *-DWITH_PETSC*
> -I"/home/mnv/Software/petsc/include/"
> -I"/home/mnv/Software/petsc/arch-linux-c-dbg/include" ../../Source/prec.f90
> mpifort -c -m64 -O0 -std=f2018 -ggdb -Wall -Wunused-parameter
> -Wcharacter-truncation -Wno-target-lifetime -fcheck=all -fbacktrace
> -ffpe-trap=invalid,zero,overflow -frecursive -ffpe-summary=none
> -fall-intrinsics -fbounds-check -cpp
> -DGITHASH_PP=\"FDS-6.8.0-556-g04d5df7-dirty-FireX\" -DGITDATE_PP=\""Mon Aug
> 14 17:07:20 2023 -0400\"" -DBUILDDATE_PP=\""Aug 14, 2023 17:34:36\""
> -DCOMPVER_PP=\""Gnu gfortran 11.2.1"\" *-DWITH_PETSC*
> -I"/home/mnv/Software/petsc/include/"
> -I"/home/mnv/Software/petsc/arch-linux-c-dbg/include" ../../Source/cons.f90
> mpifort -c -m64 -O0 -std=f2018 -ggdb -Wall -Wunused-parameter
> -Wcharacter-truncation -Wno-target-lifetime -fcheck=all -fbacktrace
> -ffpe-trap=invalid,zero,overflow -frecursive -ffpe-summary=none
> -fall-intrinsics -fbounds-check -cpp
> -DGITHASH_PP=\"FDS-6.8.0-556-g04d5df7-dirty-FireX\" -DGITDATE_PP=\""Mon Aug
> 14 17:07:20 2023 -0400\"" -DBUILDDATE_PP=\""Aug 14, 2023 17:34:36\""
> -DCOMPVER_PP=\""Gnu gfortran 11.2.1"\" *-DWITH_PETSC*
> -I"/home/mnv/Software/petsc/include/"
> -I"/home/mnv/Software/petsc/arch-linux-c-dbg/include" ../../Source/prop.f90
> mpifort -c -m64 -O0 -std=f2018 -ggdb -Wall -Wunused-parameter
> -Wcharacter-truncation -Wno-target-lifetime -fcheck=all -fbacktrace
> -ffpe-trap=invalid,zero,overflow -frecursive -ffpe-summary=none
> -fall-intrinsics -fbounds-check -cpp
> -DGITHASH_PP=\"FDS-6.8.0-556-g04d5df7-dirty-FireX\" -DGITDATE_PP=\""Mon Aug
> 14 17:07:20 2023 -0400\"" -DBUILDDATE_PP=\""Aug 14, 2023 17:34:36\""
> -DCOMPVER_PP=\""Gnu gfortran 11.2.1"\" *-DWITH_PETSC*
> -I"/home/mnv/Software/petsc/include/"
> -I"/home/mnv/Software/petsc/arch-linux-c-dbg/include" ../../Source/devc.f90
> ...
> ...
>
> If you are compiling on a Power9 node you might come across this error
> right off the bat:
>
> ../../Source/prec.f90:34:8:
>
> 34 | REAL(QB), PARAMETER :: TWO_EPSILON_QB=2._QB*EPSILON(1._QB) !< A
> very small number 16 byte accuracy
> | 1
> Error: Kind -3 not supported for type REAL at (1)
>
> which means for some reason gcc in the Power9 does not like quad precision
> definition in this manner. A way around it is to add the intrinsic
> Fortran2008 module iso_fortran_env:
>
> use, intrinsic :: iso_fortran_env
>
> in the fds/Source/prec.f90 file and change the quad precision denominator
> to:
>
> INTEGER, PARAMETER :: QB = REAL128
>
> in there. We are investigating the reason why this is happening. This is
> not related to Petsc in the code, everything related to PETSc calls is
> integers and double precision reals.
>
> After the code compiles you get the executable in
> ~/fds_dir/fds/Build/ompi_gnu_linux_db/fds_ompi_gnu_linux_db
>
> With which you can run the attached 2 mesh case as:
>
> $ mpirun -n 2 ~/fds_dir/fds/Build/ompi_gnu_linux_db/fds_ompi_gnu_linux_db
> test.fds -log_view
>
> and change PETSc ksp, pc runtime flags, etc. The default is PCG + HYPRE
> which is what I was testing in CPU. This is the result I get from the
> previous submission in an interactive job in Enki (similar with batch
> submissions, gmres ksp, gamg pc):
>
>
> Starting FDS ...
>
> MPI Process 1 started on enki11.adlp
> MPI Process 0 started on enki11.adlp
>
> Reading FDS input file ...
>
> WARNING: SPEC REAC_FUEL is not in the table of pre-defined species. Any
> unassigned SPEC variables in the input were assigned the properties of
> nitrogen.
> At line 3014 of file ../../Source/read.f90
> Fortran runtime warning: An array temporary was created
> At line 3014 of file ../../Source/read.f90
> Fortran runtime warning: An array temporary was created
> At line 3461 of file ../../Source/read.f90
> Fortran runtime warning: An array temporary was created
> At line 3461 of file ../../Source/read.f90
> Fortran runtime warning: An array temporary was created
> WARNING: DEVC Device is not within any mesh.
>
> Fire Dynamics Simulator
>
> Current Date : August 14, 2023 17:26:22
> Revision : FDS6.7.0-11263-g04d5df7-dirty-FireX
> Revision Date : Mon Aug 14 17:07:20 2023 -0400
> Compiler : Gnu gfortran 11.2.1
> Compilation Date : Aug 14, 2023 17:11:05
>
> MPI Enabled; Number of MPI Processes: 2
> OpenMP Enabled; Number of OpenMP Threads: 1
>
> MPI version: 3.1
> MPI library version: Open MPI v4.1.4, package: Open MPI xng4 at enki01.adlp
> Distribution, ident: 4.1.4, repo rev: v4.1.4, May 26, 2022
>
> Job TITLE :
> Job ID string : test
>
> terminate called after throwing an instance of
> 'thrust::system::system_error'
> terminate called after throwing an instance of
> 'thrust::system::system_error'
> what(): parallel_for failed: cudaErrorInvalidConfiguration: invalid
> configuration argument
> what(): parallel_for failed: cudaErrorInvalidConfiguration: invalid
> configuration argument
>
> Program received signal SIGABRT: Process abort signal.
>
> Backtrace for this error:
>
> Program received signal SIGABRT: Process abort signal.
>
> Backtrace for this error:
> #0 0x2000397fcd8f in ???
> #1 0x2000397fb657 in ???
> #2 0x2000000604d7 in ???
> #3 0x200039cb9628 in ???
> #0 0x2000397fcd8f in ???
> #1 0x2000397fb657 in ???
> #2 0x2000000604d7 in ???
> #3 0x200039cb9628 in ???
> #4 0x200039c93eb3 in ???
> #5 0x200039364a97 in ???
> #4 0x200039c93eb3 in ???
> #5 0x200039364a97 in ???
> #6 0x20003935f6d3 in ???
> #7 0x20003935f78f in ???
> #8 0x20003935fc6b in ???
> #6 0x20003935f6d3 in ???
> #7 0x20003935f78f in ???
> #8 0x20003935fc6b in ???
> #9 0x11ec67db in _ZN6thrust8cuda_cub14throw_on_errorE9cudaErrorPKc
> at /usr/local/cuda-11.7/include/thrust/system/cuda/detail/util.h:225
> #10 0x11ec67db in
> _ZN6thrust8cuda_cub20uninitialized_fill_nINS0_3tagENS_10device_ptrIiEEmiEET0_RNS0_16execution_policyIT_EES5_T1_RKT2_
> at
> /usr/local/cuda-11.7/include/thrust/system/cuda/detail/uninitialized_fill.h:88
> #11 0x11efc7e3 in
> _ZN6thrust20uninitialized_fill_nINS_8cuda_cub3tagENS_10device_ptrIiEEmiEET0_RKNS_6detail21execution_policy_baseIT_EES5_T1_RKT2_
> at /usr/local/cuda-11.7/include/thrust/detail/uninitialized_fill.inl:55
> #9 0x11ec67db in _ZN6thrust8cuda_cub14throw_on_errorE9cudaErrorPKc
> at /usr/local/cuda-11.7/include/thrust/system/cuda/detail/util.h:225
> #10 0x11ec67db in
> _ZN6thrust8cuda_cub20uninitialized_fill_nINS0_3tagENS_10device_ptrIiEEmiEET0_RNS0_16execution_policyIT_EES5_T1_RKT2_
> at
> /usr/local/cuda-11.7/include/thrust/system/cuda/detail/uninitialized_fill.h:88
> #11 0x11efc7e3 in
> _ZN6thrust20uninitialized_fill_nINS_8cuda_cub3tagENS_10device_ptrIiEEmiEET0_RKNS_6detail21execution_policy_baseIT_EES5_T1_RKT2_
> at /usr/local/cuda-11.7/include/thrust/detail/uninitialized_fill.inl:55
> #12 0x11efc7e3 in
> _ZN6thrust6detail23allocator_traits_detail23default_construct_rangeINS_16device_allocatorIiEENS_10device_ptrIiEEmEENS0_10disable_ifIXsrNS1_37needs_default_construct_via_allocatorIT_NS0_15pointer_elementIT0_E4typeEEE5valueEvE4typeERS9_SB_T1_
> at
> /usr/local/cuda-11.7/include/thrust/detail/allocator/default_construct_range.inl:93
> #12 0x11efc7e3 in
> _ZN6thrust6detail23allocator_traits_detail23default_construct_rangeINS_16device_allocatorIiEENS_10device_ptrIiEEmEENS0_10disable_ifIXsrNS1_37needs_default_construct_via_allocatorIT_NS0_15pointer_elementIT0_E4typeEEE5valueEvE4typeERS9_SB_T1_
> at
> /usr/local/cuda-11.7/include/thrust/detail/allocator/default_construct_range.inl:93
> #13 0x11efc7e3 in
> _ZN6thrust6detail23default_construct_rangeINS_16device_allocatorIiEENS_10device_ptrIiEEmEEvRT_T0_T1_
> at
> /usr/local/cuda-11.7/include/thrust/detail/allocator/default_construct_range.inl:104
> #14 0x11efc7e3 in
> _ZN6thrust6detail18contiguous_storageIiNS_16device_allocatorIiEEE19default_construct_nENS0_15normal_iteratorINS_10device_ptrIiEEEEm
> at /usr/local/cuda-11.7/include/thrust/detail/contiguous_storage.inl:254
> #15 0x11efc7e3 in
> _ZN6thrust6detail11vector_baseIiNS_16device_allocatorIiEEE12default_initEm
> at /usr/local/cuda-11.7/include/thrust/detail/vector_base.inl:220
> #16 0x11efc7e3 in
> _ZN6thrust6detail11vector_baseIiNS_16device_allocatorIiEEE12default_initEm
> at /usr/local/cuda-11.7/include/thrust/detail/vector_base.inl:213
> #17 0x11efc7e3 in
> _ZN6thrust6detail11vector_baseIiNS_16device_allocatorIiEEEC2Em
> at /usr/local/cuda-11.7/include/thrust/detail/vector_base.inl:65
> #13 0x11efc7e3 in
> _ZN6thrust6detail23default_construct_rangeINS_16device_allocatorIiEENS_10device_ptrIiEEmEEvRT_T0_T1_
> at
> /usr/local/cuda-11.7/include/thrust/detail/allocator/default_construct_range.inl:104
> #14 0x11efc7e3 in
> _ZN6thrust6detail18contiguous_storageIiNS_16device_allocatorIiEEE19default_construct_nENS0_15normal_iteratorINS_10device_ptrIiEEEEm
> at /usr/local/cuda-11.7/include/thrust/detail/contiguous_storage.inl:254
> #15 0x11efc7e3 in
> _ZN6thrust6detail11vector_baseIiNS_16device_allocatorIiEEE12default_initEm
> at /usr/local/cuda-11.7/include/thrust/detail/vector_base.inl:220
> #16 0x11efc7e3 in
> _ZN6thrust6detail11vector_baseIiNS_16device_allocatorIiEEE12default_initEm
> at /usr/local/cuda-11.7/include/thrust/detail/vector_base.inl:213
> #17 0x11efc7e3 in
> _ZN6thrust6detail11vector_baseIiNS_16device_allocatorIiEEEC2Em
> at /usr/local/cuda-11.7/include/thrust/detail/vector_base.inl:65
> #18 0x11eda3c7 in
> _ZN6thrust13device_vectorIiNS_16device_allocatorIiEEEC4Em
> at /usr/local/cuda-11.7/include/thrust/device_vector.h:88
> *#19 0x11eda3c7 in MatSeqAIJCUSPARSECopyToGPU*
> at /home/mnv/Software/petsc/src/mat/impls/aij/seq/seqcusparse/
> aijcusparse.cu:2488
> *#20 0x11edc6b7 in MatSetPreallocationCOO_SeqAIJCUSPARSE*
> at /home/mnv/Software/petsc/src/mat/impls/aij/seq/seqcusparse/
> aijcusparse.cu:4300
> #18 0x11eda3c7 in
> _ZN6thrust13device_vectorIiNS_16device_allocatorIiEEEC4Em
> at /usr/local/cuda-11.7/include/thrust/device_vector.h:88
> #*19 0x11eda3c7 in MatSeqAIJCUSPARSECopyToGPU*
> at /home/mnv/Software/petsc/src/mat/impls/aij/seq/seqcusparse/
> aijcusparse.cu:2488
> *#20 0x11edc6b7 in MatSetPreallocationCOO_SeqAIJCUSPARSE*
> at /home/mnv/Software/petsc/src/mat/impls/aij/seq/seqcusparse/
> aijcusparse.cu:4300
> #21 0x11e91bc7 in MatSetPreallocationCOO
> at /home/mnv/Software/petsc/src/mat/utils/gcreate.c:650
> #21 0x11e91bc7 in MatSetPreallocationCOO
> at /home/mnv/Software/petsc/src/mat/utils/gcreate.c:650
> #22 0x1316d5ab in MatConvert_AIJ_HYPRE
> at /home/mnv/Software/petsc/src/mat/impls/hypre/mhypre.c:648
> #22 0x1316d5ab in MatConvert_AIJ_HYPRE
> at /home/mnv/Software/petsc/src/mat/impls/hypre/mhypre.c:648
> #23 0x11e3b463 in MatConvert
> at /home/mnv/Software/petsc/src/mat/interface/matrix.c:4428
> #23 0x11e3b463 in MatConvert
> at /home/mnv/Software/petsc/src/mat/interface/matrix.c:4428
> #24 0x14072213 in PCSetUp_HYPRE
> at /home/mnv/Software/petsc/src/ksp/pc/impls/hypre/hypre.c:254
> #24 0x14072213 in PCSetUp_HYPRE
> at /home/mnv/Software/petsc/src/ksp/pc/impls/hypre/hypre.c:254
> #25 0x1276a9db in PCSetUp
> at /home/mnv/Software/petsc/src/ksp/pc/interface/precon.c:1069
> #25 0x1276a9db in PCSetUp
> at /home/mnv/Software/petsc/src/ksp/pc/interface/precon.c:1069
> #26 0x127d923b in KSPSetUp
> at /home/mnv/Software/petsc/src/ksp/ksp/interface/itfunc.c:415
> #27 0x127e033f in KSPSolve_Private
> #26 0x127d923b in KSPSetUp
> at /home/mnv/Software/petsc/src/ksp/ksp/interface/itfunc.c:415
> #27 0x127e033f in KSPSolve_Private
> at /home/mnv/Software/petsc/src/ksp/ksp/interface/itfunc.c:836
> at /home/mnv/Software/petsc/src/ksp/ksp/interface/itfunc.c:836
> #28 0x127e6f07 in KSPSolve
> at /home/mnv/Software/petsc/src/ksp/ksp/interface/itfunc.c:1082
> #28 0x127e6f07 in KSPSolve
> at /home/mnv/Software/petsc/src/ksp/ksp/interface/itfunc.c:1082
> #29 0x1280d70b in kspsolve_
> at
> /home/mnv/Software/petsc/arch-linux-c-dbg/src/ksp/ksp/interface/ftn-auto/itfuncf.c:335
> #29 0x1280d70b in kspsolve_
> at
> /home/mnv/Software/petsc/arch-linux-c-dbg/src/ksp/ksp/interface/ftn-auto/itfuncf.c:335
> #30 0x1140858f in __globmat_solver_MOD_glmat_solver
> at ../../Source/pres.f90:3130
> #30 0x1140858f in __globmat_solver_MOD_glmat_solver
> at ../../Source/pres.f90:3130
> #31 0x119faddf in pressure_iteration_scheme
> at ../../Source/main.f90:1449
> #32 0x1196c15f in fds
> at ../../Source/main.f90:688
> #31 0x119faddf in pressure_iteration_scheme
> at ../../Source/main.f90:1449
> #32 0x1196c15f in fds
> at ../../Source/main.f90:688
> #33 0x11a126f3 in main
> at ../../Source/main.f90:6
> #33 0x11a126f3 in main
> at ../../Source/main.f90:6
> --------------------------------------------------------------------------
> Primary job terminated normally, but 1 process returned
> a non-zero exit code. Per user-direction, the job has been aborted.
> --------------------------------------------------------------------------
> --------------------------------------------------------------------------
> mpirun noticed that process rank 1 with PID 3028180 on node enki11 exited
> on signal 6 (Aborted).
> --------------------------------------------------------------------------
>
> Seems the issue stems from the call to KSPSOLVE, line 3130 in
> fds/Source/pres.f90.
>
> Well, thank you for taking the time to look at this and also let me know
> if these threads should be moved to the issue tracker, or other venue.
> Best,
> Marcos
>
>
>
>
> ------------------------------
> *From:* Junchao Zhang <junchao.zhang at gmail.com>
> *Sent:* Monday, August 14, 2023 4:37 PM
> *To:* Vanella, Marcos (Fed) <marcos.vanella at nist.gov>; PETSc users list <
> petsc-users at mcs.anl.gov>
> *Subject:* Re: [petsc-users] CUDA error trying to run a job with two mpi
> processes and 1 GPU
>
> I don't see a problem in the matrix assembly.
> If you point me to your repo and show me how to build it, I can try to
> reproduce.
>
> --Junchao Zhang
>
>
> On Mon, Aug 14, 2023 at 2:53 PM Vanella, Marcos (Fed) <
> marcos.vanella at nist.gov> wrote:
>
> Hi Junchao, I've tried for my case using the -ksp_type gmres and -pc_type
> asm with -mat_type aijcusparse -sub_pc_factor_mat_solver_type cusparse as
> (I understand) is done in the ex60. The error is always the same, so it
> seems it is not related to ksp,pc. Indeed it seems to happen when trying to
> offload the Matrix to the GPU:
>
> terminate called after throwing an instance of
> 'thrust::system::system_error'
> terminate called after throwing an instance of
> 'thrust::system::system_error'
> what(): parallel_for failed: cudaErrorInvalidConfiguration: invalid
> configuration argument
> what(): parallel_for failed: cudaErrorInvalidConfiguration: invalid
> configuration argument
>
> Program received signal SIGABRT: Process abort signal.
>
> Backtrace for this error:
>
> Program received signal SIGABRT: Process abort signal.
>
> Backtrace for this error:
> #0 0x2000397fcd8f in ???
> ...
> #8 0x20003935fc6b in ???
> #9 0x11ec769b in _ZN6thrust8cuda_cub14throw_on_errorE9cudaErrorPKc
> at /usr/local/cuda-11.7/include/thrust/system/cuda/detail/util.h:225
> #10 0x11ec769b in
> _ZN6thrust8cuda_cub20uninitialized_fill_nINS0_3tagENS_10device_ptrIiEEmiEET0_RNS0_16execution_policyIT_EES5_T1_RKT2_
> at
> /usr/local/cuda-11.7/include/thrust/system/cuda/detail/uninitialized_fill.h:88
> #11 0x11efd6a3 in
> _ZN6thrust20uninitialized_fill_nINS_8cuda_cub3tagENS_10device_ptrIiEEmiEET0_RKNS_6detail21execution_policy_baseIT_EES5_T1_RKT2_
> #9 0x11ec769b in _ZN6thrust8cuda_cub14throw_on_errorE9cudaErrorPKc
> at /usr/local/cuda-11.7/include/thrust/system/cuda/detail/util.h:225
> #10 0x11ec769b in
> _ZN6thrust8cuda_cub20uninitialized_fill_nINS0_3tagENS_10device_ptrIiEEmiEET0_RNS0_16execution_policyIT_EES5_T1_RKT2_
> at
> /usr/local/cuda-11.7/include/thrust/system/cuda/detail/uninitialized_fill.h:88
> #11 0x11efd6a3 in
> _ZN6thrust20uninitialized_fill_nINS_8cuda_cub3tagENS_10device_ptrIiEEmiEET0_RKNS_6detail21execution_policy_baseIT_EES5_T1_RKT2_
> at /usr/local/cuda-11.7/include/thrust/detail/uninitialized_fill.inl:55
> #12 0x11efd6a3 in
> _ZN6thrust6detail23allocator_traits_detail23default_construct_rangeINS_16device_allocatorIiEENS_10device_ptrIiEEmEENS0_10disable_ifIXsrNS1_37needs_default_construct_via_allocatorIT_NS0_15pointer_elementIT0_E4typeEEE5valueEvE4typeERS9_SB_T1_
> at
> /usr/local/cuda-11.7/include/thrust/detail/allocator/default_construct_range.inl:93
> at /usr/local/cuda-11.7/include/thrust/detail/uninitialized_fill.inl:55
> #12 0x11efd6a3 in
> _ZN6thrust6detail23allocator_traits_detail23default_construct_rangeINS_16device_allocatorIiEENS_10device_ptrIiEEmEENS0_10disable_ifIXsrNS1_37needs_default_construct_via_allocatorIT_NS0_15pointer_elementIT0_E4typeEEE5valueEvE4typeERS9_SB_T1_
> at
> /usr/local/cuda-11.7/include/thrust/detail/allocator/default_construct_range.inl:93
> #13 0x11efd6a3 in
> _ZN6thrust6detail23default_construct_rangeINS_16device_allocatorIiEENS_10device_ptrIiEEmEEvRT_T0_T1_
> at
> /usr/local/cuda-11.7/include/thrust/detail/allocator/default_construct_range.inl:104
> #14 0x11efd6a3 in
> _ZN6thrust6detail18contiguous_storageIiNS_16device_allocatorIiEEE19default_construct_nENS0_15normal_iteratorINS_10device_ptrIiEEEEm
> at /usr/local/cuda-11.7/include/thrust/detail/contiguous_storage.inl:254
> #15 0x11efd6a3 in
> _ZN6thrust6detail11vector_baseIiNS_16device_allocatorIiEEE12default_initEm
> #13 0x11efd6a3 in
> _ZN6thrust6detail23default_construct_rangeINS_16device_allocatorIiEENS_10device_ptrIiEEmEEvRT_T0_T1_
> at
> /usr/local/cuda-11.7/include/thrust/detail/allocator/default_construct_range.inl:104
> #14 0x11efd6a3 in
> _ZN6thrust6detail18contiguous_storageIiNS_16device_allocatorIiEEE19default_construct_nENS0_15normal_iteratorINS_10device_ptrIiEEEEm
> at /usr/local/cuda-11.7/include/thrust/detail/contiguous_storage.inl:254
> #15 0x11efd6a3 in
> _ZN6thrust6detail11vector_baseIiNS_16device_allocatorIiEEE12default_initEm
> at /usr/local/cuda-11.7/include/thrust/detail/vector_base.inl:220
> at /usr/local/cuda-11.7/include/thrust/detail/vector_base.inl:220
> #16 0x11efd6a3 in
> _ZN6thrust6detail11vector_baseIiNS_16device_allocatorIiEEE12default_initEm
> at /usr/local/cuda-11.7/include/thrust/detail/vector_base.inl:213
> #17 0x11efd6a3 in
> _ZN6thrust6detail11vector_baseIiNS_16device_allocatorIiEEEC2Em
> at /usr/local/cuda-11.7/include/thrust/detail/vector_base.inl:65
> #18 0x11edb287 in
> _ZN6thrust13device_vectorIiNS_16device_allocatorIiEEEC4Em
> at /usr/local/cuda-11.7/include/thrust/device_vector.h:88
> #19 0x11edb287 in *MatSeqAIJCUSPARSECopyToGPU*
> at /home/mnv/Software/petsc/src/mat/impls/aij/seq/seqcusparse/
> aijcusparse.cu:2488
> #20 0x11edfd1b in *MatSeqAIJCUSPARSEGetIJ*
> ...
> ...
>
> This is the piece of fortran code I have doing this within my Poisson
> solver:
>
> ! Create Parallel PETSc Sparse matrix for this ZSL: Set diag/off diag
> blocks nonzeros per row to 5.
> CALL MATCREATEAIJ(MPI_COMM_WORLD,ZSL%NUNKH_LOCAL,ZSL%NUNKH_LOCAL,ZSL%
> NUNKH_TOTAL,ZSL%NUNKH_TOTAL,&
> 7,PETSC_NULL_INTEGER,7,PETSC_NULL_INTEGER,ZSL%PETSC_ZS%
> A_H,PETSC_IERR)
> CALL MATSETFROMOPTIONS(ZSL%PETSC_ZS%A_H,PETSC_IERR)
> DO IROW=1,ZSL%NUNKH_LOCAL
> DO JCOL=1,ZSL%NNZ_D_MAT_H(IROW)
> ! PETSC expects zero based indexes.1,Global I position (zero
> base),1,Global J position (zero base)
> CALL MATSETVALUES(ZSL%PETSC_ZS%A_H,1,ZSL%UNKH_IND(NM_START)+IROW-1,1
> ,ZSL%JD_MAT_H(JCOL,IROW)-1,&
> ZSL%D_MAT_H(JCOL,IROW),INSERT_VALUES,PETSC_IERR)
> ENDDO
> ENDDO
> CALL MATASSEMBLYBEGIN(ZSL%PETSC_ZS%A_H, MAT_FINAL_ASSEMBLY, PETSC_IERR)
> CALL MATASSEMBLYEND(ZSL%PETSC_ZS%A_H, MAT_FINAL_ASSEMBLY, PETSC_IERR)
>
> Note that I allocate d_nz=7 and o_nz=7 per row (more than enough size),
> and add nonzero values one by one. I wonder if there is something related
> to this that the copying to GPU does not like.
> Thanks,
> Marcos
>
> ------------------------------
> *From:* Junchao Zhang <junchao.zhang at gmail.com>
> *Sent:* Monday, August 14, 2023 3:24 PM
> *To:* Vanella, Marcos (Fed) <marcos.vanella at nist.gov>
> *Cc:* PETSc users list <petsc-users at mcs.anl.gov>; Satish Balay <
> balay at mcs.anl.gov>
> *Subject:* Re: [petsc-users] CUDA error trying to run a job with two mpi
> processes and 1 GPU
>
> Yeah, it looks like ex60 was run correctly.
> Double check your code again and if you still run into errors, we can try
> to reproduce on our end.
>
> Thanks.
> --Junchao Zhang
>
>
> On Mon, Aug 14, 2023 at 1:05 PM Vanella, Marcos (Fed) <
> marcos.vanella at nist.gov> wrote:
>
> Hi Junchao, I compiled and run ex60 through slurm in our Enki system. The
> batch script for slurm submission, ex60.log and gpu stats files are
> attached.
> Nothing stands out as wrong to me but please have a look.
> I'll revisit running the original 2 MPI process + 1 GPU Poisson problem.
> Thanks!
> Marcos
> ------------------------------
> *From:* Junchao Zhang <junchao.zhang at gmail.com>
> *Sent:* Friday, August 11, 2023 5:52 PM
> *To:* Vanella, Marcos (Fed) <marcos.vanella at nist.gov>
> *Cc:* PETSc users list <petsc-users at mcs.anl.gov>; Satish Balay <
> balay at mcs.anl.gov>
> *Subject:* Re: [petsc-users] CUDA error trying to run a job with two mpi
> processes and 1 GPU
>
> Before digging into the details, could you try to run
> src/ksp/ksp/tests/ex60.c to make sure the environment is ok.
>
> The comment at the end shows how to run it
> test:
> requires: cuda
> suffix: 1_cuda
> nsize: 4
> args: -ksp_view -mat_type aijcusparse -sub_pc_factor_mat_solver_type
> cusparse
>
> --Junchao Zhang
>
>
> On Fri, Aug 11, 2023 at 4:36 PM Vanella, Marcos (Fed) <
> marcos.vanella at nist.gov> wrote:
>
> Hi Junchao, thank you for the info. I compiled the main branch of PETSc in
> another machine that has the openmpi/4.1.4/gcc-11.2.1-cuda-11.7 toolchain
> and don't see the fortran compilation error. It might have been related to
> gcc-9.3.
> I tried the case again, 2 CPUs and one GPU and get this error now:
>
> terminate called after throwing an instance of
> 'thrust::system::system_error'
> terminate called after throwing an instance of
> 'thrust::system::system_error'
> what(): parallel_for failed: cudaErrorInvalidConfiguration: invalid
> configuration argument
> what(): parallel_for failed: cudaErrorInvalidConfiguration: invalid
> configuration argument
>
> Program received signal SIGABRT: Process abort signal.
>
> Backtrace for this error:
>
> Program received signal SIGABRT: Process abort signal.
>
> Backtrace for this error:
> #0 0x2000397fcd8f in ???
> #1 0x2000397fb657 in ???
> #0 0x2000397fcd8f in ???
> #1 0x2000397fb657 in ???
> #2 0x2000000604d7 in ???
> #2 0x2000000604d7 in ???
> #3 0x200039cb9628 in ???
> #4 0x200039c93eb3 in ???
> #5 0x200039364a97 in ???
> #6 0x20003935f6d3 in ???
> #7 0x20003935f78f in ???
> #8 0x20003935fc6b in ???
> #3 0x200039cb9628 in ???
> #4 0x200039c93eb3 in ???
> #5 0x200039364a97 in ???
> #6 0x20003935f6d3 in ???
> #7 0x20003935f78f in ???
> #8 0x20003935fc6b in ???
> #9 0x11ec425b in _ZN6thrust8cuda_cub14throw_on_errorE9cudaErrorPKc
> at /usr/local/cuda-11.7/include/thrust/system/cuda/detail/util.h:225
> #10 0x11ec425b in
> _ZN6thrust8cuda_cub20uninitialized_fill_nINS0_3tagENS_10device_ptrIiEEmiEET0_RNS0_16execution_policyIT_EES5_T1_RKT2_
> #9 0x11ec425b in _ZN6thrust8cuda_cub14throw_on_errorE9cudaErrorPKc
> at /usr/local/cuda-11.7/include/thrust/system/cuda/detail/util.h:225
> #10 0x11ec425b in
> _ZN6thrust8cuda_cub20uninitialized_fill_nINS0_3tagENS_10device_ptrIiEEmiEET0_RNS0_16execution_policyIT_EES5_T1_RKT2_
> at
> /usr/local/cuda-11.7/include/thrust/system/cuda/detail/uninitialized_fill.h:88
> #11 0x11efa263 in
> _ZN6thrust20uninitialized_fill_nINS_8cuda_cub3tagENS_10device_ptrIiEEmiEET0_RKNS_6detail21execution_policy_baseIT_EES5_T1_RKT2_
> at
> /usr/local/cuda-11.7/include/thrust/system/cuda/detail/uninitialized_fill.h:88
> #11 0x11efa263 in
> _ZN6thrust20uninitialized_fill_nINS_8cuda_cub3tagENS_10device_ptrIiEEmiEET0_RKNS_6detail21execution_policy_baseIT_EES5_T1_RKT2_
> at /usr/local/cuda-11.7/include/thrust/detail/uninitialized_fill.inl:55
> #12 0x11efa263 in
> _ZN6thrust6detail23allocator_traits_detail23default_construct_rangeINS_16device_allocatorIiEENS_10device_ptrIiEEmEENS0_10disable_ifIXsrNS1_37needs_default_construct_via_allocatorIT_NS0_15pointer_elementIT0_E4typeEEE5valueEvE4typeERS9_SB_T1_
> at
> /usr/local/cuda-11.7/include/thrust/detail/allocator/default_construct_range.inl:93
> #13 0x11efa263 in
> _ZN6thrust6detail23default_construct_rangeINS_16device_allocatorIiEENS_10device_ptrIiEEmEEvRT_T0_T1_
> at
> /usr/local/cuda-11.7/include/thrust/detail/allocator/default_construct_range.inl:104
> at /usr/local/cuda-11.7/include/thrust/detail/uninitialized_fill.inl:55
> #12 0x11efa263 in
> _ZN6thrust6detail23allocator_traits_detail23default_construct_rangeINS_16device_allocatorIiEENS_10device_ptrIiEEmEENS0_10disable_ifIXsrNS1_37needs_default_construct_via_allocatorIT_NS0_15pointer_elementIT0_E4typeEEE5valueEvE4typeERS9_SB_T1_
> at
> /usr/local/cuda-11.7/include/thrust/detail/allocator/default_construct_range.inl:93
> #13 0x11efa263 in
> _ZN6thrust6detail23default_construct_rangeINS_16device_allocatorIiEENS_10device_ptrIiEEmEEvRT_T0_T1_
> at
> /usr/local/cuda-11.7/include/thrust/detail/allocator/default_construct_range.inl:104
> #14 0x11efa263 in
> _ZN6thrust6detail18contiguous_storageIiNS_16device_allocatorIiEEE19default_construct_nENS0_15normal_iteratorINS_10device_ptrIiEEEEm
> at /usr/local/cuda-11.7/include/thrust/detail/contiguous_storage.inl:254
> #15 0x11efa263 in
> _ZN6thrust6detail11vector_baseIiNS_16device_allocatorIiEEE12default_initEm
> at /usr/local/cuda-11.7/include/thrust/detail/vector_base.inl:220
> #14 0x11efa263 in
> _ZN6thrust6detail18contiguous_storageIiNS_16device_allocatorIiEEE19default_construct_nENS0_15normal_iteratorINS_10device_ptrIiEEEEm
> at /usr/local/cuda-11.7/include/thrust/detail/contiguous_storage.inl:254
> #15 0x11efa263 in
> _ZN6thrust6detail11vector_baseIiNS_16device_allocatorIiEEE12default_initEm
> at /usr/local/cuda-11.7/include/thrust/detail/vector_base.inl:220
> #16 0x11efa263 in
> _ZN6thrust6detail11vector_baseIiNS_16device_allocatorIiEEE12default_initEm
> at /usr/local/cuda-11.7/include/thrust/detail/vector_base.inl:213
> #17 0x11efa263 in
> _ZN6thrust6detail11vector_baseIiNS_16device_allocatorIiEEEC2Em
> at /usr/local/cuda-11.7/include/thrust/detail/vector_base.inl:65
> #18 0x11ed7e47 in
> _ZN6thrust13device_vectorIiNS_16device_allocatorIiEEEC4Em
> at /usr/local/cuda-11.7/include/thrust/device_vector.h:88
> #19 0x11ed7e47 in MatSeqAIJCUSPARSECopyToGPU
> at /home/mnv/Software/petsc/src/mat/impls/aij/seq/seqcusparse/
> aijcusparse.cu:2488
> #20 0x11eef623 in MatSeqAIJCUSPARSEMergeMats
> at /home/mnv/Software/petsc/src/mat/impls/aij/seq/seqcusparse/
> aijcusparse.cu:4696
> #16 0x11efa263 in
> _ZN6thrust6detail11vector_baseIiNS_16device_allocatorIiEEE12default_initEm
> at /usr/local/cuda-11.7/include/thrust/detail/vector_base.inl:213
> #17 0x11efa263 in
> _ZN6thrust6detail11vector_baseIiNS_16device_allocatorIiEEEC2Em
> at /usr/local/cuda-11.7/include/thrust/detail/vector_base.inl:65
> #18 0x11ed7e47 in
> _ZN6thrust13device_vectorIiNS_16device_allocatorIiEEEC4Em
> at /usr/local/cuda-11.7/include/thrust/device_vector.h:88
> #19 0x11ed7e47 in MatSeqAIJCUSPARSECopyToGPU
> at /home/mnv/Software/petsc/src/mat/impls/aij/seq/seqcusparse/
> aijcusparse.cu:2488
> #20 0x11eef623 in MatSeqAIJCUSPARSEMergeMats
> at /home/mnv/Software/petsc/src/mat/impls/aij/seq/seqcusparse/
> aijcusparse.cu:4696
> #21 0x11f0682b in MatMPIAIJGetLocalMatMerge_MPIAIJCUSPARSE
> at /home/mnv/Software/petsc/src/mat/impls/aij/mpi/mpicusparse/
> mpiaijcusparse.cu:251
> #21 0x11f0682b in MatMPIAIJGetLocalMatMerge_MPIAIJCUSPARSE
> at /home/mnv/Software/petsc/src/mat/impls/aij/mpi/mpicusparse/
> mpiaijcusparse.cu:251
> #22 0x133f141f in MatMPIAIJGetLocalMatMerge
> at /home/mnv/Software/petsc/src/mat/impls/aij/mpi/mpiaij.c:5342
> #22 0x133f141f in MatMPIAIJGetLocalMatMerge
> at /home/mnv/Software/petsc/src/mat/impls/aij/mpi/mpiaij.c:5342
> #23 0x133fe9cb in MatProductSymbolic_MPIAIJBACKEND
> at /home/mnv/Software/petsc/src/mat/impls/aij/mpi/mpiaij.c:7368
> #23 0x133fe9cb in MatProductSymbolic_MPIAIJBACKEND
> at /home/mnv/Software/petsc/src/mat/impls/aij/mpi/mpiaij.c:7368
> #24 0x1377e1df in MatProductSymbolic
> at /home/mnv/Software/petsc/src/mat/interface/matproduct.c:795
> #24 0x1377e1df in MatProductSymbolic
> at /home/mnv/Software/petsc/src/mat/interface/matproduct.c:795
> #25 0x11e4dd1f in MatPtAP
> at /home/mnv/Software/petsc/src/mat/interface/matrix.c:9934
> #25 0x11e4dd1f in MatPtAP
> at /home/mnv/Software/petsc/src/mat/interface/matrix.c:9934
> #26 0x130d792f in MatCoarsenApply_MISK_private
> at /home/mnv/Software/petsc/src/mat/coarsen/impls/misk/misk.c:283
> #26 0x130d792f in MatCoarsenApply_MISK_private
> at /home/mnv/Software/petsc/src/mat/coarsen/impls/misk/misk.c:283
> #27 0x130db89b in MatCoarsenApply_MISK
> at /home/mnv/Software/petsc/src/mat/coarsen/impls/misk/misk.c:368
> #27 0x130db89b in MatCoarsenApply_MISK
> at /home/mnv/Software/petsc/src/mat/coarsen/impls/misk/misk.c:368
> #28 0x130bf5a3 in MatCoarsenApply
> at /home/mnv/Software/petsc/src/mat/coarsen/coarsen.c:97
> #28 0x130bf5a3 in MatCoarsenApply
> at /home/mnv/Software/petsc/src/mat/coarsen/coarsen.c:97
> #29 0x141518ff in PCGAMGCoarsen_AGG
> at /home/mnv/Software/petsc/src/ksp/pc/impls/gamg/agg.c:524
> #29 0x141518ff in PCGAMGCoarsen_AGG
> at /home/mnv/Software/petsc/src/ksp/pc/impls/gamg/agg.c:524
> #30 0x13b3a43f in PCSetUp_GAMG
> at /home/mnv/Software/petsc/src/ksp/pc/impls/gamg/gamg.c:631
> #30 0x13b3a43f in PCSetUp_GAMG
> at /home/mnv/Software/petsc/src/ksp/pc/impls/gamg/gamg.c:631
> #31 0x1276845b in PCSetUp
> at /home/mnv/Software/petsc/src/ksp/pc/interface/precon.c:1069
> #31 0x1276845b in PCSetUp
> at /home/mnv/Software/petsc/src/ksp/pc/interface/precon.c:1069
> #32 0x127d6cbb in KSPSetUp
> at /home/mnv/Software/petsc/src/ksp/ksp/interface/itfunc.c:415
> #32 0x127d6cbb in KSPSetUp
> at /home/mnv/Software/petsc/src/ksp/ksp/interface/itfunc.c:415
> #33 0x127dddbf in KSPSolve_Private
> at /home/mnv/Software/petsc/src/ksp/ksp/interface/itfunc.c:836
> #33 0x127dddbf in KSPSolve_Private
> at /home/mnv/Software/petsc/src/ksp/ksp/interface/itfunc.c:836
> #34 0x127e4987 in KSPSolve
> at /home/mnv/Software/petsc/src/ksp/ksp/interface/itfunc.c:1082
> #34 0x127e4987 in KSPSolve
> at /home/mnv/Software/petsc/src/ksp/ksp/interface/itfunc.c:1082
> #35 0x1280b18b in kspsolve_
> at
> /home/mnv/Software/petsc/arch-linux-c-dbg/src/ksp/ksp/interface/ftn-auto/itfuncf.c:335
> #35 0x1280b18b in kspsolve_
> at
> /home/mnv/Software/petsc/arch-linux-c-dbg/src/ksp/ksp/interface/ftn-auto/itfuncf.c:335
> #36 0x1140945f in __globmat_solver_MOD_glmat_solver
> at ../../Source/pres.f90:3128
> #36 0x1140945f in __globmat_solver_MOD_glmat_solver
> at ../../Source/pres.f90:3128
> #37 0x119f8853 in pressure_iteration_scheme
> at ../../Source/main.f90:1449
> #37 0x119f8853 in pressure_iteration_scheme
> at ../../Source/main.f90:1449
> #38 0x11969bd3 in fds
> at ../../Source/main.f90:688
> #38 0x11969bd3 in fds
> at ../../Source/main.f90:688
> #39 0x11a10167 in main
> at ../../Source/main.f90:6
> #39 0x11a10167 in main
> at ../../Source/main.f90:6
> srun: error: enki12: tasks 0-1: Aborted (core dumped)
>
>
> This was the slurm submission script in this case:
>
> #!/bin/bash
> # ../../Utilities/Scripts/qfds.sh -p 2 -T db -d test.fds
> #SBATCH -J test
> #SBATCH -e /home/mnv/Firemodels_fork/fds/Issues/PETSc/test.err
> #SBATCH -o /home/mnv/Firemodels_fork/fds/Issues/PETSc/test.log
> #SBATCH --partition=debug
> #SBATCH --ntasks=2
> #SBATCH --nodes=1
> #SBATCH --cpus-per-task=1
> #SBATCH --ntasks-per-node=2
> #SBATCH --time=01:00:00
> #SBATCH --gres=gpu:1
>
> export OMP_NUM_THREADS=1
>
> # PETSc dir and arch:
> export PETSC_DIR=/home/mnv/Software/petsc
> export PETSC_ARCH=arch-linux-c-dbg
>
> # SYSTEM name:
> export MYSYSTEM=enki
>
> # modules
> module load cuda/11.7
> module load gcc/11.2.1/toolset
> module load openmpi/4.1.4/gcc-11.2.1-cuda-11.7
>
> cd /home/mnv/Firemodels_fork/fds/Issues/PETSc
> srun -N 1 -n 2 --ntasks-per-node 2 --mpi=pmi2
> /home/mnv/Firemodels_fork/fds/Build/ompi_gnu_linux_db/fds_ompi_gnu_linux_db
> test.fds -vec_type mpicuda -mat_type mpiaijcusparse -pc_type gamg
>
> The configure.log for the PETSc build is attached. Another clue to what
> is happening is that even setting the matrices/vectors to be mpi (-vec_type
> mpi -mat_type mpiaij) and not requesting a gpu I get a GPU warning :
>
> 0]PETSC ERROR: --------------------- Error Message
> --------------------------------------------------------------
> [1]PETSC ERROR: --------------------- Error Message
> --------------------------------------------------------------
> [1]PETSC ERROR: GPU error
> [1]PETSC ERROR: Cannot lazily initialize PetscDevice: cuda error 100
> (cudaErrorNoDevice) : no CUDA-capable device is detected
> [1]PETSC ERROR: WARNING! There are unused option(s) set! Could be the
> program crashed before usage or a spelling mistake, etc!
> [0]PETSC ERROR: GPU error
> [0]PETSC ERROR: Cannot lazily initialize PetscDevice: cuda error 100
> (cudaErrorNoDevice) : no CUDA-capable device is detected
> [0]PETSC ERROR: WARNING! There are unused option(s) set! Could be the
> program crashed before usage or a spelling mistake, etc!
> [0]PETSC ERROR: Option left: name:-pc_type value: gamg source: command
> line
> [0]PETSC ERROR: See https://petsc.org/release/faq/ for trouble shooting.
> [1]PETSC ERROR: Option left: name:-pc_type value: gamg source: command
> line
> [1]PETSC ERROR: See https://petsc.org/release/faq/ for trouble shooting.
> [1]PETSC ERROR: Petsc Development GIT revision: v3.19.4-946-g590ad0f52ad
> GIT Date: 2023-08-11 15:13:02 +0000
> [0]PETSC ERROR: Petsc Development GIT revision: v3.19.4-946-g590ad0f52ad
> GIT Date: 2023-08-11 15:13:02 +0000
> [0]PETSC ERROR:
> /home/mnv/Firemodels_fork/fds/Build/ompi_gnu_linux_db/fds_ompi_gnu_linux_db
> on a arch-linux-c-dbg named enki11.adlp by mnv Fri Aug 11 17:04:55 2023
> [0]PETSC ERROR: Configure options COPTFLAGS="-g -O2" CXXOPTFLAGS="-g -O2"
> FOPTFLAGS="-g -O2" FCOPTFLAGS="-g -O2" CUDAOPTFLAGS="-g -O2"
> --with-debugging=yes --with-shared-libraries=0 --download-suitesparse
> --download-hypre --download-fblaslapack --with-cuda
> ...
>
> I would have expected not to see GPU errors being printed out, given I did
> not request cuda matrix/vectors. The case run anyways, I assume it
> defaulted to the CPU solver.
> Let me know if you have any ideas as to what is happening. Thanks,
> Marcos
>
>
> ------------------------------
> *From:* Junchao Zhang <junchao.zhang at gmail.com>
> *Sent:* Friday, August 11, 2023 3:35 PM
> *To:* Vanella, Marcos (Fed) <marcos.vanella at nist.gov>; PETSc users list <
> petsc-users at mcs.anl.gov>; Satish Balay <balay at mcs.anl.gov>
> *Subject:* Re: [petsc-users] CUDA error trying to run a job with two mpi
> processes and 1 GPU
>
> Marcos,
> We do not have good petsc/gpu documentation, but see
> https://petsc.org/main/faq/#doc-faq-gpuhowto, and also search "requires:
> cuda" in petsc tests and you will find examples using GPU.
> For the Fortran compile errors, attach your configure.log and Satish
> (Cc'ed) or others should know how to fix them.
>
> Thanks.
> --Junchao Zhang
>
>
> On Fri, Aug 11, 2023 at 2:22 PM Vanella, Marcos (Fed) <
> marcos.vanella at nist.gov> wrote:
>
> Hi Junchao, thanks for the explanation. Is there some development
> documentation on the GPU work? I'm interested learning about it.
> I checked out the main branch and configured petsc. when compiling with
> gcc/gfortran I come across this error:
>
> ....
> CUDAC
> arch-linux-c-opt/obj/src/mat/impls/aij/seq/seqcusparse/aijcusparse.o
> CUDAC.dep
> arch-linux-c-opt/obj/src/mat/impls/aij/seq/seqcusparse/aijcusparse.o
> FC arch-linux-c-opt/obj/src/ksp/f90-mod/petsckspdefmod.o
> FC arch-linux-c-opt/obj/src/ksp/f90-mod/petscpcmod.o
>
> /home/mnv/Software/petsc/include/../src/ksp/f90-mod/ftn-auto-interfaces/petscpc.h90:37:61:
>
> 37 | subroutine PCASMCreateSubdomains2D(a,b,c,d,e,f,g,h,i,z)
> | 1
> *Error: Symbol ‘pcasmcreatesubdomains2d’ at (1) already has an explicit
> interface*
>
> /home/mnv/Software/petsc/include/../src/ksp/f90-mod/ftn-auto-interfaces/petscpc.h90:38:13:
>
> 38 | import tIS
> | 1
> Error: IMPORT statement at (1) only permitted in an INTERFACE body
>
> /home/mnv/Software/petsc/include/../src/ksp/f90-mod/ftn-auto-interfaces/petscpc.h90:39:80:
>
> 39 | PetscInt a ! PetscInt
> |
> 1
> Error: Unexpected data declaration statement in INTERFACE block at (1)
>
> /home/mnv/Software/petsc/include/../src/ksp/f90-mod/ftn-auto-interfaces/petscpc.h90:40:80:
>
> 40 | PetscInt b ! PetscInt
> |
> 1
> Error: Unexpected data declaration statement in INTERFACE block at (1)
>
> /home/mnv/Software/petsc/include/../src/ksp/f90-mod/ftn-auto-interfaces/petscpc.h90:41:80:
>
> 41 | PetscInt c ! PetscInt
> |
> 1
> Error: Unexpected data declaration statement in INTERFACE block at (1)
>
> /home/mnv/Software/petsc/include/../src/ksp/f90-mod/ftn-auto-interfaces/petscpc.h90:42:80:
>
> 42 | PetscInt d ! PetscInt
> |
> 1
> Error: Unexpected data declaration statement in INTERFACE block at (1)
>
> /home/mnv/Software/petsc/include/../src/ksp/f90-mod/ftn-auto-interfaces/petscpc.h90:43:80:
>
> 43 | PetscInt e ! PetscInt
> |
> 1
> Error: Unexpected data declaration statement in INTERFACE block at (1)
>
> /home/mnv/Software/petsc/include/../src/ksp/f90-mod/ftn-auto-interfaces/petscpc.h90:44:80:
>
> 44 | PetscInt f ! PetscInt
> |
> 1
> Error: Unexpected data declaration statement in INTERFACE block at (1)
>
> /home/mnv/Software/petsc/include/../src/ksp/f90-mod/ftn-auto-interfaces/petscpc.h90:45:80:
>
> 45 | PetscInt g ! PetscInt
> |
> 1
> Error: Unexpected data declaration statement in INTERFACE block at (1)
>
> /home/mnv/Software/petsc/include/../src/ksp/f90-mod/ftn-auto-interfaces/petscpc.h90:46:30:
>
> 46 | IS h ! IS
> | 1
> Error: Unexpected data declaration statement in INTERFACE block at (1)
>
> /home/mnv/Software/petsc/include/../src/ksp/f90-mod/ftn-auto-interfaces/petscpc.h90:47:30:
>
> 47 | IS i ! IS
> | 1
> Error: Unexpected data declaration statement in INTERFACE block at (1)
>
> /home/mnv/Software/petsc/include/../src/ksp/f90-mod/ftn-auto-interfaces/petscpc.h90:48:43:
>
> 48 | PetscErrorCode z
> | 1
> Error: Unexpected data declaration statement in INTERFACE block at (1)
>
> /home/mnv/Software/petsc/include/../src/ksp/f90-mod/ftn-auto-interfaces/petscpc.h90:49:10:
>
> 49 | end subroutine PCASMCreateSubdomains2D
> | 1
> Error: Expecting END INTERFACE statement at (1)
> make[3]: *** [gmakefile:225:
> arch-linux-c-opt/obj/src/ksp/f90-mod/petscpcmod.o] Error 1
> make[3]: *** Waiting for unfinished jobs....
> CC
> arch-linux-c-opt/obj/src/tao/leastsquares/impls/pounders/pounders.o
> CC arch-linux-c-opt/obj/src/ksp/pc/impls/bddc/bddcprivate.o
> CUDAC
> arch-linux-c-opt/obj/src/vec/vec/impls/seq/cupm/cuda/vecseqcupm.o
> CUDAC.dep
> arch-linux-c-opt/obj/src/vec/vec/impls/seq/cupm/cuda/vecseqcupm.o
> make[3]: Leaving directory '/home/mnv/Software/petsc'
> make[2]: *** [/home/mnv/Software/petsc/lib/petsc/conf/rules.doc:28: libs]
> Error 2
> make[2]: Leaving directory '/home/mnv/Software/petsc'
> **************************ERROR*************************************
> Error during compile, check arch-linux-c-opt/lib/petsc/conf/make.log
> Send it and arch-linux-c-opt/lib/petsc/conf/configure.log to
> petsc-maint at mcs.anl.gov
> ********************************************************************
> make[1]: *** [makefile:45: all] Error 1
> make: *** [GNUmakefile:9: all] Error 2
> ------------------------------
> *From:* Junchao Zhang <junchao.zhang at gmail.com>
> *Sent:* Friday, August 11, 2023 3:04 PM
> *To:* Vanella, Marcos (Fed) <marcos.vanella at nist.gov>
> *Cc:* petsc-users at mcs.anl.gov <petsc-users at mcs.anl.gov>
> *Subject:* Re: [petsc-users] CUDA error trying to run a job with two mpi
> processes and 1 GPU
>
> Hi, Macros,
> I saw MatSetPreallocationCOO_MPIAIJCUSPARSE_Basic() in the error stack.
> We recently refactored the COO code and got rid of that function. So could
> you try petsc/main?
> We map MPI processes to GPUs in a round-robin fashion. We query the
> number of visible CUDA devices (g), and assign the device (rank%g) to the
> MPI process (rank). In that sense, the work distribution is totally
> determined by your MPI work partition (i.e, yourself).
> On clusters, this MPI process to GPU binding is usually done by the job
> scheduler like slurm. You need to check your cluster's users' guide to see
> how to bind MPI processes to GPUs. If the job scheduler has done that, the
> number of visible CUDA devices to a process might just appear to be 1,
> making petsc's own mapping void.
>
> Thanks.
> --Junchao Zhang
>
>
> On Fri, Aug 11, 2023 at 12:43 PM Vanella, Marcos (Fed) <
> marcos.vanella at nist.gov> wrote:
>
> Hi Junchao, thank you for replying. I compiled petsc in debug mode and
> this is what I get for the case:
>
> terminate called after throwing an instance of
> 'thrust::system::system_error'
> what(): merge_sort: failed to synchronize: cudaErrorIllegalAddress: an
> illegal memory access was encountered
>
> Program received signal SIGABRT: Process abort signal.
>
> Backtrace for this error:
> #0 0x15264731ead0 in ???
> #1 0x15264731dc35 in ???
> #2 0x15264711551f in ???
> #3 0x152647169a7c in ???
> #4 0x152647115475 in ???
> #5 0x1526470fb7f2 in ???
> #6 0x152647678bbd in ???
> #7 0x15264768424b in ???
> #8 0x1526476842b6 in ???
> #9 0x152647684517 in ???
> #10 0x55bb46342ebb in _ZN6thrust8cuda_cub14throw_on_errorE9cudaErrorPKc
> at /usr/local/cuda/include/thrust/system/cuda/detail/util.h:224
> #11 0x55bb46342ebb in
> _ZN6thrust8cuda_cub12__merge_sort10merge_sortINS_6detail17integral_constantIbLb1EEENS4_IbLb0EEENS0_3tagENS_12zip_iteratorINS_5tupleINS_10device_ptrIiEESB_NS_9null_typeESC_SC_SC_SC_SC_SC_SC_EEEENS3_15normal_iteratorISB_EE9IJCompareEEvRNS0_16execution_policyIT1_EET2_SM_T3_T4_
> at /usr/local/cuda/include/thrust/system/cuda/detail/sort.h:1316
> #12 0x55bb46342ebb in
> _ZN6thrust8cuda_cub12__smart_sort10smart_sortINS_6detail17integral_constantIbLb1EEENS4_IbLb0EEENS0_16execution_policyINS0_3tagEEENS_12zip_iteratorINS_5tupleINS_10device_ptrIiEESD_NS_9null_typeESE_SE_SE_SE_SE_SE_SE_EEEENS3_15normal_iteratorISD_EE9IJCompareEENS1_25enable_if_comparison_sortIT2_T4_E4typeERT1_SL_SL_T3_SM_
> at /usr/local/cuda/include/thrust/system/cuda/detail/sort.h:1544
> #13 0x55bb46342ebb in
> _ZN6thrust8cuda_cub11sort_by_keyINS0_3tagENS_12zip_iteratorINS_5tupleINS_10device_ptrIiEES6_NS_9null_typeES7_S7_S7_S7_S7_S7_S7_EEEENS_6detail15normal_iteratorIS6_EE9IJCompareEEvRNS0_16execution_policyIT_EET0_SI_T1_T2_
> at /usr/local/cuda/include/thrust/system/cuda/detail/sort.h:1669
> #14 0x55bb46317bc5 in
> _ZN6thrust11sort_by_keyINS_8cuda_cub3tagENS_12zip_iteratorINS_5tupleINS_10device_ptrIiEES6_NS_9null_typeES7_S7_S7_S7_S7_S7_S7_EEEENS_6detail15normal_iteratorIS6_EE9IJCompareEEvRKNSA_21execution_policy_baseIT_EET0_SJ_T1_T2_
> at /usr/local/cuda/include/thrust/detail/sort.inl:115
> #15 0x55bb46317bc5 in
> _ZN6thrust11sort_by_keyINS_12zip_iteratorINS_5tupleINS_10device_ptrIiEES4_NS_9null_typeES5_S5_S5_S5_S5_S5_S5_EEEENS_6detail15normal_iteratorIS4_EE9IJCompareEEvT_SC_T0_T1_
> at /usr/local/cuda/include/thrust/detail/sort.inl:305
> #16 0x55bb46317bc5 in MatSetPreallocationCOO_SeqAIJCUSPARSE_Basic
> at /home/mnv/Software/petsc/src/mat/impls/aij/seq/seqcusparse/
> aijcusparse.cu:4452
> #17 0x55bb46c5b27c in MatSetPreallocationCOO_MPIAIJCUSPARSE_Basic
> at /home/mnv/Software/petsc/src/mat/impls/aij/mpi/mpicusparse/
> mpiaijcusparse.cu:173
> #18 0x55bb46c5b27c in MatSetPreallocationCOO_MPIAIJCUSPARSE
> at /home/mnv/Software/petsc/src/mat/impls/aij/mpi/mpicusparse/
> mpiaijcusparse.cu:222
> #19 0x55bb468e01cf in MatSetPreallocationCOO
> at /home/mnv/Software/petsc/src/mat/utils/gcreate.c:606
> #20 0x55bb46b39c9b in MatProductSymbolic_MPIAIJBACKEND
> at /home/mnv/Software/petsc/src/mat/impls/aij/mpi/mpiaij.c:7547
> #21 0x55bb469015e5 in MatProductSymbolic
> at /home/mnv/Software/petsc/src/mat/interface/matproduct.c:803
> #22 0x55bb4694ade2 in MatPtAP
> at /home/mnv/Software/petsc/src/mat/interface/matrix.c:9897
> #23 0x55bb4696d3ec in MatCoarsenApply_MISK_private
> at /home/mnv/Software/petsc/src/mat/coarsen/impls/misk/misk.c:283
> #24 0x55bb4696eb67 in MatCoarsenApply_MISK
> at /home/mnv/Software/petsc/src/mat/coarsen/impls/misk/misk.c:368
> #25 0x55bb4695bd91 in MatCoarsenApply
> at /home/mnv/Software/petsc/src/mat/coarsen/coarsen.c:97
> #26 0x55bb478294d8 in PCGAMGCoarsen_AGG
> at /home/mnv/Software/petsc/src/ksp/pc/impls/gamg/agg.c:524
> #27 0x55bb471d1cb4 in PCSetUp_GAMG
> at /home/mnv/Software/petsc/src/ksp/pc/impls/gamg/gamg.c:631
> #28 0x55bb464022cf in PCSetUp
> at /home/mnv/Software/petsc/src/ksp/pc/interface/precon.c:994
> #29 0x55bb4718b8a7 in KSPSetUp
> at /home/mnv/Software/petsc/src/ksp/ksp/interface/itfunc.c:406
> #30 0x55bb4718f22e in KSPSolve_Private
> at /home/mnv/Software/petsc/src/ksp/ksp/interface/itfunc.c:824
> #31 0x55bb47192c0c in KSPSolve
> at /home/mnv/Software/petsc/src/ksp/ksp/interface/itfunc.c:1070
> #32 0x55bb463efd35 in kspsolve_
> at /home/mnv/Software/petsc/src/ksp/ksp/interface/ftn-auto/itfuncf.c:320
> #33 0x55bb45e94b32 in ???
> #34 0x55bb46048044 in ???
> #35 0x55bb46052ea1 in ???
> #36 0x55bb45ac5f8e in ???
> #37 0x1526470fcd8f in ???
> #38 0x1526470fce3f in ???
> #39 0x55bb45aef55d in ???
> #40 0xffffffffffffffff in ???
> --------------------------------------------------------------------------
> Primary job terminated normally, but 1 process returned
> a non-zero exit code. Per user-direction, the job has been aborted.
> --------------------------------------------------------------------------
> --------------------------------------------------------------------------
> mpirun noticed that process rank 0 with PID 1771753 on node dgx02 exited
> on signal 6 (Aborted).
> --------------------------------------------------------------------------
>
> BTW, I'm curious. If I set n MPI processes, each of them building a part
> of the linear system, and g GPUs, how does PETSc distribute those n pieces
> of system matrix and rhs in the g GPUs? Does it do some load balancing
> algorithm? Where can I read about this?
> Thank you and best Regards, I can also point you to my code repo in GitHub
> if you want to take a closer look.
>
> Best Regards,
> Marcos
>
> ------------------------------
> *From:* Junchao Zhang <junchao.zhang at gmail.com>
> *Sent:* Friday, August 11, 2023 10:52 AM
> *To:* Vanella, Marcos (Fed) <marcos.vanella at nist.gov>
> *Cc:* petsc-users at mcs.anl.gov <petsc-users at mcs.anl.gov>
> *Subject:* Re: [petsc-users] CUDA error trying to run a job with two mpi
> processes and 1 GPU
>
> Hi, Marcos,
> Could you build petsc in debug mode and then copy and paste the whole
> error stack message?
>
> Thanks
> --Junchao Zhang
>
>
> On Thu, Aug 10, 2023 at 5:51 PM Vanella, Marcos (Fed) via petsc-users <
> petsc-users at mcs.anl.gov> wrote:
>
> Hi, I'm trying to run a parallel matrix vector build and linear solution
> with PETSc on 2 MPI processes + one V100 GPU. I tested that the matrix
> build and solution is successful in CPUs only. I'm using cuda 11.5 and cuda
> enabled openmpi and gcc 9.3. When I run the job with GPU enabled I get the
> following error:
>
> terminate called after throwing an instance of
> 'thrust::system::system_error'
> *what(): merge_sort: failed to synchronize: cudaErrorIllegalAddress:
> an illegal memory access was encountered*
>
> Program received signal SIGABRT: Process abort signal.
>
> Backtrace for this error:
> terminate called after throwing an instance of
> 'thrust::system::system_error'
> what(): merge_sort: failed to synchronize: cudaErrorIllegalAddress: an
> illegal memory access was encountered
>
> Program received signal SIGABRT: Process abort signal.
>
> I'm new to submitting jobs in slurm that also use GPU resources, so I
> might be doing something wrong in my submission script. This is it:
>
> #!/bin/bash
> #SBATCH -J test
> #SBATCH -e /home/Issues/PETSc/test.err
> #SBATCH -o /home/Issues/PETSc/test.log
> #SBATCH --partition=batch
> #SBATCH --ntasks=2
> #SBATCH --nodes=1
> #SBATCH --cpus-per-task=1
> #SBATCH --ntasks-per-node=2
> #SBATCH --time=01:00:00
> #SBATCH --gres=gpu:1
>
> export OMP_NUM_THREADS=1
> module load cuda/11.5
> module load openmpi/4.1.1
>
> cd /home/Issues/PETSc
> *mpirun -n 2 */home/fds/Build/ompi_gnu_linux/fds_ompi_gnu_linux test.fds *-vec_type
> mpicuda -mat_type mpiaijcusparse -pc_type gamg*
>
> If anyone has any suggestions on how o troubleshoot this please let me
> know.
> Thanks!
> Marcos
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20230814/291856d6/attachment-0001.html>
More information about the petsc-users
mailing list