[petsc-users] CUDA error trying to run a job with two mpi processes and 1 GPU
Junchao Zhang
junchao.zhang at gmail.com
Fri Aug 11 14:35:48 CDT 2023
Marcos,
We do not have good petsc/gpu documentation, but see
https://petsc.org/main/faq/#doc-faq-gpuhowto, and also search "requires:
cuda" in petsc tests and you will find examples using GPU.
For the Fortran compile errors, attach your configure.log and Satish
(Cc'ed) or others should know how to fix them.
Thanks.
--Junchao Zhang
On Fri, Aug 11, 2023 at 2:22 PM Vanella, Marcos (Fed) <
marcos.vanella at nist.gov> wrote:
> Hi Junchao, thanks for the explanation. Is there some development
> documentation on the GPU work? I'm interested learning about it.
> I checked out the main branch and configured petsc. when compiling with
> gcc/gfortran I come across this error:
>
> ....
> CUDAC
> arch-linux-c-opt/obj/src/mat/impls/aij/seq/seqcusparse/aijcusparse.o
> CUDAC.dep
> arch-linux-c-opt/obj/src/mat/impls/aij/seq/seqcusparse/aijcusparse.o
> FC arch-linux-c-opt/obj/src/ksp/f90-mod/petsckspdefmod.o
> FC arch-linux-c-opt/obj/src/ksp/f90-mod/petscpcmod.o
>
> /home/mnv/Software/petsc/include/../src/ksp/f90-mod/ftn-auto-interfaces/petscpc.h90:37:61:
>
> 37 | subroutine PCASMCreateSubdomains2D(a,b,c,d,e,f,g,h,i,z)
> | 1
> *Error: Symbol ‘pcasmcreatesubdomains2d’ at (1) already has an explicit
> interface*
>
> /home/mnv/Software/petsc/include/../src/ksp/f90-mod/ftn-auto-interfaces/petscpc.h90:38:13:
>
> 38 | import tIS
> | 1
> Error: IMPORT statement at (1) only permitted in an INTERFACE body
>
> /home/mnv/Software/petsc/include/../src/ksp/f90-mod/ftn-auto-interfaces/petscpc.h90:39:80:
>
> 39 | PetscInt a ! PetscInt
> |
> 1
> Error: Unexpected data declaration statement in INTERFACE block at (1)
>
> /home/mnv/Software/petsc/include/../src/ksp/f90-mod/ftn-auto-interfaces/petscpc.h90:40:80:
>
> 40 | PetscInt b ! PetscInt
> |
> 1
> Error: Unexpected data declaration statement in INTERFACE block at (1)
>
> /home/mnv/Software/petsc/include/../src/ksp/f90-mod/ftn-auto-interfaces/petscpc.h90:41:80:
>
> 41 | PetscInt c ! PetscInt
> |
> 1
> Error: Unexpected data declaration statement in INTERFACE block at (1)
>
> /home/mnv/Software/petsc/include/../src/ksp/f90-mod/ftn-auto-interfaces/petscpc.h90:42:80:
>
> 42 | PetscInt d ! PetscInt
> |
> 1
> Error: Unexpected data declaration statement in INTERFACE block at (1)
>
> /home/mnv/Software/petsc/include/../src/ksp/f90-mod/ftn-auto-interfaces/petscpc.h90:43:80:
>
> 43 | PetscInt e ! PetscInt
> |
> 1
> Error: Unexpected data declaration statement in INTERFACE block at (1)
>
> /home/mnv/Software/petsc/include/../src/ksp/f90-mod/ftn-auto-interfaces/petscpc.h90:44:80:
>
> 44 | PetscInt f ! PetscInt
> |
> 1
> Error: Unexpected data declaration statement in INTERFACE block at (1)
>
> /home/mnv/Software/petsc/include/../src/ksp/f90-mod/ftn-auto-interfaces/petscpc.h90:45:80:
>
> 45 | PetscInt g ! PetscInt
> |
> 1
> Error: Unexpected data declaration statement in INTERFACE block at (1)
>
> /home/mnv/Software/petsc/include/../src/ksp/f90-mod/ftn-auto-interfaces/petscpc.h90:46:30:
>
> 46 | IS h ! IS
> | 1
> Error: Unexpected data declaration statement in INTERFACE block at (1)
>
> /home/mnv/Software/petsc/include/../src/ksp/f90-mod/ftn-auto-interfaces/petscpc.h90:47:30:
>
> 47 | IS i ! IS
> | 1
> Error: Unexpected data declaration statement in INTERFACE block at (1)
>
> /home/mnv/Software/petsc/include/../src/ksp/f90-mod/ftn-auto-interfaces/petscpc.h90:48:43:
>
> 48 | PetscErrorCode z
> | 1
> Error: Unexpected data declaration statement in INTERFACE block at (1)
>
> /home/mnv/Software/petsc/include/../src/ksp/f90-mod/ftn-auto-interfaces/petscpc.h90:49:10:
>
> 49 | end subroutine PCASMCreateSubdomains2D
> | 1
> Error: Expecting END INTERFACE statement at (1)
> make[3]: *** [gmakefile:225:
> arch-linux-c-opt/obj/src/ksp/f90-mod/petscpcmod.o] Error 1
> make[3]: *** Waiting for unfinished jobs....
> CC
> arch-linux-c-opt/obj/src/tao/leastsquares/impls/pounders/pounders.o
> CC arch-linux-c-opt/obj/src/ksp/pc/impls/bddc/bddcprivate.o
> CUDAC
> arch-linux-c-opt/obj/src/vec/vec/impls/seq/cupm/cuda/vecseqcupm.o
> CUDAC.dep
> arch-linux-c-opt/obj/src/vec/vec/impls/seq/cupm/cuda/vecseqcupm.o
> make[3]: Leaving directory '/home/mnv/Software/petsc'
> make[2]: *** [/home/mnv/Software/petsc/lib/petsc/conf/rules.doc:28: libs]
> Error 2
> make[2]: Leaving directory '/home/mnv/Software/petsc'
> **************************ERROR*************************************
> Error during compile, check arch-linux-c-opt/lib/petsc/conf/make.log
> Send it and arch-linux-c-opt/lib/petsc/conf/configure.log to
> petsc-maint at mcs.anl.gov
> ********************************************************************
> make[1]: *** [makefile:45: all] Error 1
> make: *** [GNUmakefile:9: all] Error 2
> ------------------------------
> *From:* Junchao Zhang <junchao.zhang at gmail.com>
> *Sent:* Friday, August 11, 2023 3:04 PM
> *To:* Vanella, Marcos (Fed) <marcos.vanella at nist.gov>
> *Cc:* petsc-users at mcs.anl.gov <petsc-users at mcs.anl.gov>
> *Subject:* Re: [petsc-users] CUDA error trying to run a job with two mpi
> processes and 1 GPU
>
> Hi, Macros,
> I saw MatSetPreallocationCOO_MPIAIJCUSPARSE_Basic() in the error stack.
> We recently refactored the COO code and got rid of that function. So could
> you try petsc/main?
> We map MPI processes to GPUs in a round-robin fashion. We query the
> number of visible CUDA devices (g), and assign the device (rank%g) to the
> MPI process (rank). In that sense, the work distribution is totally
> determined by your MPI work partition (i.e, yourself).
> On clusters, this MPI process to GPU binding is usually done by the job
> scheduler like slurm. You need to check your cluster's users' guide to see
> how to bind MPI processes to GPUs. If the job scheduler has done that, the
> number of visible CUDA devices to a process might just appear to be 1,
> making petsc's own mapping void.
>
> Thanks.
> --Junchao Zhang
>
>
> On Fri, Aug 11, 2023 at 12:43 PM Vanella, Marcos (Fed) <
> marcos.vanella at nist.gov> wrote:
>
> Hi Junchao, thank you for replying. I compiled petsc in debug mode and
> this is what I get for the case:
>
> terminate called after throwing an instance of
> 'thrust::system::system_error'
> what(): merge_sort: failed to synchronize: cudaErrorIllegalAddress: an
> illegal memory access was encountered
>
> Program received signal SIGABRT: Process abort signal.
>
> Backtrace for this error:
> #0 0x15264731ead0 in ???
> #1 0x15264731dc35 in ???
> #2 0x15264711551f in ???
> #3 0x152647169a7c in ???
> #4 0x152647115475 in ???
> #5 0x1526470fb7f2 in ???
> #6 0x152647678bbd in ???
> #7 0x15264768424b in ???
> #8 0x1526476842b6 in ???
> #9 0x152647684517 in ???
> #10 0x55bb46342ebb in _ZN6thrust8cuda_cub14throw_on_errorE9cudaErrorPKc
> at /usr/local/cuda/include/thrust/system/cuda/detail/util.h:224
> #11 0x55bb46342ebb in
> _ZN6thrust8cuda_cub12__merge_sort10merge_sortINS_6detail17integral_constantIbLb1EEENS4_IbLb0EEENS0_3tagENS_12zip_iteratorINS_5tupleINS_10device_ptrIiEESB_NS_9null_typeESC_SC_SC_SC_SC_SC_SC_EEEENS3_15normal_iteratorISB_EE9IJCompareEEvRNS0_16execution_policyIT1_EET2_SM_T3_T4_
> at /usr/local/cuda/include/thrust/system/cuda/detail/sort.h:1316
> #12 0x55bb46342ebb in
> _ZN6thrust8cuda_cub12__smart_sort10smart_sortINS_6detail17integral_constantIbLb1EEENS4_IbLb0EEENS0_16execution_policyINS0_3tagEEENS_12zip_iteratorINS_5tupleINS_10device_ptrIiEESD_NS_9null_typeESE_SE_SE_SE_SE_SE_SE_EEEENS3_15normal_iteratorISD_EE9IJCompareEENS1_25enable_if_comparison_sortIT2_T4_E4typeERT1_SL_SL_T3_SM_
> at /usr/local/cuda/include/thrust/system/cuda/detail/sort.h:1544
> #13 0x55bb46342ebb in
> _ZN6thrust8cuda_cub11sort_by_keyINS0_3tagENS_12zip_iteratorINS_5tupleINS_10device_ptrIiEES6_NS_9null_typeES7_S7_S7_S7_S7_S7_S7_EEEENS_6detail15normal_iteratorIS6_EE9IJCompareEEvRNS0_16execution_policyIT_EET0_SI_T1_T2_
> at /usr/local/cuda/include/thrust/system/cuda/detail/sort.h:1669
> #14 0x55bb46317bc5 in
> _ZN6thrust11sort_by_keyINS_8cuda_cub3tagENS_12zip_iteratorINS_5tupleINS_10device_ptrIiEES6_NS_9null_typeES7_S7_S7_S7_S7_S7_S7_EEEENS_6detail15normal_iteratorIS6_EE9IJCompareEEvRKNSA_21execution_policy_baseIT_EET0_SJ_T1_T2_
> at /usr/local/cuda/include/thrust/detail/sort.inl:115
> #15 0x55bb46317bc5 in
> _ZN6thrust11sort_by_keyINS_12zip_iteratorINS_5tupleINS_10device_ptrIiEES4_NS_9null_typeES5_S5_S5_S5_S5_S5_S5_EEEENS_6detail15normal_iteratorIS4_EE9IJCompareEEvT_SC_T0_T1_
> at /usr/local/cuda/include/thrust/detail/sort.inl:305
> #16 0x55bb46317bc5 in MatSetPreallocationCOO_SeqAIJCUSPARSE_Basic
> at /home/mnv/Software/petsc/src/mat/impls/aij/seq/seqcusparse/
> aijcusparse.cu:4452
> #17 0x55bb46c5b27c in MatSetPreallocationCOO_MPIAIJCUSPARSE_Basic
> at /home/mnv/Software/petsc/src/mat/impls/aij/mpi/mpicusparse/
> mpiaijcusparse.cu:173
> #18 0x55bb46c5b27c in MatSetPreallocationCOO_MPIAIJCUSPARSE
> at /home/mnv/Software/petsc/src/mat/impls/aij/mpi/mpicusparse/
> mpiaijcusparse.cu:222
> #19 0x55bb468e01cf in MatSetPreallocationCOO
> at /home/mnv/Software/petsc/src/mat/utils/gcreate.c:606
> #20 0x55bb46b39c9b in MatProductSymbolic_MPIAIJBACKEND
> at /home/mnv/Software/petsc/src/mat/impls/aij/mpi/mpiaij.c:7547
> #21 0x55bb469015e5 in MatProductSymbolic
> at /home/mnv/Software/petsc/src/mat/interface/matproduct.c:803
> #22 0x55bb4694ade2 in MatPtAP
> at /home/mnv/Software/petsc/src/mat/interface/matrix.c:9897
> #23 0x55bb4696d3ec in MatCoarsenApply_MISK_private
> at /home/mnv/Software/petsc/src/mat/coarsen/impls/misk/misk.c:283
> #24 0x55bb4696eb67 in MatCoarsenApply_MISK
> at /home/mnv/Software/petsc/src/mat/coarsen/impls/misk/misk.c:368
> #25 0x55bb4695bd91 in MatCoarsenApply
> at /home/mnv/Software/petsc/src/mat/coarsen/coarsen.c:97
> #26 0x55bb478294d8 in PCGAMGCoarsen_AGG
> at /home/mnv/Software/petsc/src/ksp/pc/impls/gamg/agg.c:524
> #27 0x55bb471d1cb4 in PCSetUp_GAMG
> at /home/mnv/Software/petsc/src/ksp/pc/impls/gamg/gamg.c:631
> #28 0x55bb464022cf in PCSetUp
> at /home/mnv/Software/petsc/src/ksp/pc/interface/precon.c:994
> #29 0x55bb4718b8a7 in KSPSetUp
> at /home/mnv/Software/petsc/src/ksp/ksp/interface/itfunc.c:406
> #30 0x55bb4718f22e in KSPSolve_Private
> at /home/mnv/Software/petsc/src/ksp/ksp/interface/itfunc.c:824
> #31 0x55bb47192c0c in KSPSolve
> at /home/mnv/Software/petsc/src/ksp/ksp/interface/itfunc.c:1070
> #32 0x55bb463efd35 in kspsolve_
> at /home/mnv/Software/petsc/src/ksp/ksp/interface/ftn-auto/itfuncf.c:320
> #33 0x55bb45e94b32 in ???
> #34 0x55bb46048044 in ???
> #35 0x55bb46052ea1 in ???
> #36 0x55bb45ac5f8e in ???
> #37 0x1526470fcd8f in ???
> #38 0x1526470fce3f in ???
> #39 0x55bb45aef55d in ???
> #40 0xffffffffffffffff in ???
> --------------------------------------------------------------------------
> Primary job terminated normally, but 1 process returned
> a non-zero exit code. Per user-direction, the job has been aborted.
> --------------------------------------------------------------------------
> --------------------------------------------------------------------------
> mpirun noticed that process rank 0 with PID 1771753 on node dgx02 exited
> on signal 6 (Aborted).
> --------------------------------------------------------------------------
>
> BTW, I'm curious. If I set n MPI processes, each of them building a part
> of the linear system, and g GPUs, how does PETSc distribute those n pieces
> of system matrix and rhs in the g GPUs? Does it do some load balancing
> algorithm? Where can I read about this?
> Thank you and best Regards, I can also point you to my code repo in GitHub
> if you want to take a closer look.
>
> Best Regards,
> Marcos
>
> ------------------------------
> *From:* Junchao Zhang <junchao.zhang at gmail.com>
> *Sent:* Friday, August 11, 2023 10:52 AM
> *To:* Vanella, Marcos (Fed) <marcos.vanella at nist.gov>
> *Cc:* petsc-users at mcs.anl.gov <petsc-users at mcs.anl.gov>
> *Subject:* Re: [petsc-users] CUDA error trying to run a job with two mpi
> processes and 1 GPU
>
> Hi, Marcos,
> Could you build petsc in debug mode and then copy and paste the whole
> error stack message?
>
> Thanks
> --Junchao Zhang
>
>
> On Thu, Aug 10, 2023 at 5:51 PM Vanella, Marcos (Fed) via petsc-users <
> petsc-users at mcs.anl.gov> wrote:
>
> Hi, I'm trying to run a parallel matrix vector build and linear solution
> with PETSc on 2 MPI processes + one V100 GPU. I tested that the matrix
> build and solution is successful in CPUs only. I'm using cuda 11.5 and cuda
> enabled openmpi and gcc 9.3. When I run the job with GPU enabled I get the
> following error:
>
> terminate called after throwing an instance of
> 'thrust::system::system_error'
> *what(): merge_sort: failed to synchronize: cudaErrorIllegalAddress:
> an illegal memory access was encountered*
>
> Program received signal SIGABRT: Process abort signal.
>
> Backtrace for this error:
> terminate called after throwing an instance of
> 'thrust::system::system_error'
> what(): merge_sort: failed to synchronize: cudaErrorIllegalAddress: an
> illegal memory access was encountered
>
> Program received signal SIGABRT: Process abort signal.
>
> I'm new to submitting jobs in slurm that also use GPU resources, so I
> might be doing something wrong in my submission script. This is it:
>
> #!/bin/bash
> #SBATCH -J test
> #SBATCH -e /home/Issues/PETSc/test.err
> #SBATCH -o /home/Issues/PETSc/test.log
> #SBATCH --partition=batch
> #SBATCH --ntasks=2
> #SBATCH --nodes=1
> #SBATCH --cpus-per-task=1
> #SBATCH --ntasks-per-node=2
> #SBATCH --time=01:00:00
> #SBATCH --gres=gpu:1
>
> export OMP_NUM_THREADS=1
> module load cuda/11.5
> module load openmpi/4.1.1
>
> cd /home/Issues/PETSc
> *mpirun -n 2 */home/fds/Build/ompi_gnu_linux/fds_ompi_gnu_linux test.fds *-vec_type
> mpicuda -mat_type mpiaijcusparse -pc_type gamg*
>
> If anyone has any suggestions on how o troubleshoot this please let me
> know.
> Thanks!
> Marcos
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20230811/f2f8a8ab/attachment-0001.html>
More information about the petsc-users
mailing list