[petsc-users] errors with hypre with MPI and multiple GPUs on a node

Junchao Zhang junchao.zhang at gmail.com
Thu Feb 1 09:47:43 CST 2024


Cc Victor at TACC,  who might have some ideas.

--Junchao Zhang


On Thu, Feb 1, 2024 at 9:28 AM Yesypenko, Anna <anna at oden.utexas.edu> wrote:

> Hi Junchao,
>
> Thank you for your suggestion, you're right that binding MPI ranks to GPUs
> seems to be the issue.
> I looked at the TACC documentation, and I'm not sure they provide this
> utility.
> I'm trying to set the CUDA_VISIBLE_DEVICES environment variable according
> to the MPI rank.
>
> This works sometimes now! The environment variables are set properly, but
> it still fails with the same error half the time.
> How do I know that hypre is binding MPI ranks to GPUs properly?  The error
> originates from a call to hypre.
>
> I also tried to set the environment variable (using mpi4py) before
> importing PETSc, but this doesn't seem to work either.
>
> Here is the preamble I added to the top of the script. I'm running on a
> single node with 3 GPUs.
> ``
> import numpy,petsc4py,sys,os,time
> from time import time
> petsc4py.init(sys.argv)
> from petsc4py import PETSc
>
> comm  = PETSc.COMM_WORLD
>
> os.environ['CUDA_VISIBLE_DEVICES'] = "%d" % comm.Get_rank()
> PETSc.Sys.syncPrint("\t Processor %d of %d gets GPU %d"%\
>
> (comm.Get_rank(),comm.Get_size(),comm.Get_rank()),comm=comm,flush=True)
> comm.Barrier()
>
> ### Petsc Matrix initialization here
>
> ### I confirm that the matrix is partitioned into indices as I expect
> PETSc.Sys.syncPrint("\t Processor %d with GPU %s gets indices %d:%d"\
>
> %(comm.Get_rank(),os.environ['CUDA_VISIBLE_DEVICES'],rstart,rend),flush=True,comm=comm)
> ``
>
> When the script fails, I get the following stack trace.
> ``
> TACC:  Starting up job 1491828
> TACC:  Setting up parallel environment for MVAPICH2+mpispawn.
> TACC:  Starting parallel tasks...
> Processor 0 of 3 gets GPU 0
> Processor 1 of 3 gets GPU 1
> Processor 2 of 3 gets GPU 2
> Processor 0 with GPU 0 gets indices 0:166667
> Processor 1 with GPU 1 gets indices 166667:333334
> Processor 2 with GPU 2 gets indices 333334:500000
> [0]PETSC ERROR:
> ------------------------------------------------------------------------
> [0]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation,
> probably memory access out of range
> [0]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
> [0]PETSC ERROR: or see https://petsc.org/release/faq/#valgrind and
> https://petsc.org/release/faq/
> [0]PETSC ERROR: or try
> https://docs.nvidia.com/cuda/cuda-memcheck/index.html on NVIDIA CUDA
> systems to find memory corruption errors
> [0]PETSC ERROR: ---------------------  Stack Frames
> ------------------------------------
> [0]PETSC ERROR: The line numbers in the error traceback are not always
> exact.
> [0]PETSC ERROR: #1 hypre_ParCSRMatrixMigrate()
> [0]PETSC ERROR: #2 MatBindToCPU_HYPRE() at
> /work/06368/annayesy/ls6/petsc/src/mat/impls/hypre/mhypre.c:1394
> [0]PETSC ERROR: #3 MatAssemblyEnd_HYPRE() at
> /work/06368/annayesy/ls6/petsc/src/mat/impls/hypre/mhypre.c:1471
> [0]PETSC ERROR: #4 MatAssemblyEnd() at
> /work/06368/annayesy/ls6/petsc/src/mat/interface/matrix.c:5773
> [0]PETSC ERROR: #5 MatConvert_AIJ_HYPRE() at
> /work/06368/annayesy/ls6/petsc/src/mat/impls/hypre/mhypre.c:660
> [0]PETSC ERROR: #6 MatConvert() at
> /work/06368/annayesy/ls6/petsc/src/mat/interface/matrix.c:4421
> [0]PETSC ERROR: #7 PCSetUp_HYPRE() at
> /work/06368/annayesy/ls6/petsc/src/ksp/pc/impls/hypre/hypre.c:245
> [0]PETSC ERROR: #8 PCSetUp() at
> /work/06368/annayesy/ls6/petsc/src/ksp/pc/interface/precon.c:1080
> [0]PETSC ERROR: #9 KSPSetUp() at
> /work/06368/annayesy/ls6/petsc/src/ksp/ksp/interface/itfunc.c:415
> [0]PETSC ERROR: #10 KSPSolve_Private() at
> /work/06368/annayesy/ls6/petsc/src/ksp/ksp/interface/itfunc.c:833
> [0]PETSC ERROR: #11 KSPSolve() at
> /work/06368/annayesy/ls6/petsc/src/ksp/ksp/interface/itfunc.c:1080
> application called MPI_Abort(MPI_COMM_WORLD, 59) - process 0
> ``
>
> ------------------------------
> *From:* Junchao Zhang <junchao.zhang at gmail.com>
> *Sent:* Wednesday, January 31, 2024 5:36 PM
> *To:* Yesypenko, Anna <anna at oden.utexas.edu>
> *Cc:* petsc-users at mcs.anl.gov <petsc-users at mcs.anl.gov>
> *Subject:* Re: [petsc-users] errors with hypre with MPI and multiple GPUs
> on a node
>
> Hi Anna,
>   Since you said "The code works with pc-type hypre on a single GPU.", I
> was wondering if this is a CUDA devices to MPI ranks binding problem.
>   You can search TACC documentation to find how its job scheduler binds
> GPUs to MPI ranks (usually via manipulating the CUDA_VISIBLE_DEVICES
> environment variable)
>
>   Please follow up if you could not solve it.
>
>   Thanks.
> --Junchao Zhang
>
>
> On Wed, Jan 31, 2024 at 4:07 PM Yesypenko, Anna <anna at oden.utexas.edu>
> wrote:
>
> Dear Petsc devs,
>
> I'm encountering an error running hypre on a single node with multiple
> GPUs.
> The issue is in the setup phase. I'm trying to troubleshoot, but don't
> know where to start.
> Are the system routines PetScCUDAInitialize and PetScCUDAInitializeCheck
> available in python?
> How do I verify that GPUs are assigned properly to each MPI process? In
> this case, I have 3 tasks and 3 GPUs.
>
> The code works with pc-type hypre on a single GPU.
> Any suggestions are appreciated!
>
> Below is the error trace:
> ``
> TACC:  Starting up job 1490124
> TACC:  Setting up parallel environment for MVAPICH2+mpispawn.
> TACC:  Starting parallel tasks...
> [0]PETSC ERROR:
> ------------------------------------------------------------------------
> [0]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation,
> probably memory access out of range
> [0]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
> [0]PETSC ERROR: or see https://petsc.org/release/faq/#valgrind and
> https://petsc.org/release/faq/
> [0]PETSC ERROR: or try
> https://docs.nvidia.com/cuda/cuda-memcheck/index.html on NVIDIA CUDA
> systems to find memory corruption errors
> [0]PETSC ERROR: ---------------------  Stack Frames
> ------------------------------------
> [0]PETSC ERROR: The line numbers in the error traceback are not always
> exact.
> [0]PETSC ERROR: #1 hypre_ParCSRMatrixMigrate()
> [0]PETSC ERROR: #2 MatBindToCPU_HYPRE() at
> /work/06368/annayesy/ls6/petsc/src/mat/impls/hypre/mhypre.c:1394
> [0]PETSC ERROR: #3 MatAssemblyEnd_HYPRE() at
> /work/06368/annayesy/ls6/petsc/src/mat/impls/hypre/mhypre.c:1471
> [0]PETSC ERROR: #4 MatAssemblyEnd() at
> /work/06368/annayesy/ls6/petsc/src/mat/interface/matrix.c:5773
> [0]PETSC ERROR: #5 MatConvert_AIJ_HYPRE() at
> /work/06368/annayesy/ls6/petsc/src/mat/impls/hypre/mhypre.c:660
> [0]PETSC ERROR: #6 MatConvert() at
> /work/06368/annayesy/ls6/petsc/src/mat/interface/matrix.c:4421
> [0]PETSC ERROR: #7 PCSetUp_HYPRE() at
> /work/06368/annayesy/ls6/petsc/src/ksp/pc/impls/hypre/hypre.c:245
> [0]PETSC ERROR: #8 PCSetUp() at
> /work/06368/annayesy/ls6/petsc/src/ksp/pc/interface/precon.c:1080
> [0]PETSC ERROR: #9 KSPSetUp() at
> /work/06368/annayesy/ls6/petsc/src/ksp/ksp/interface/itfunc.c:415
> [0]PETSC ERROR: #10 KSPSolve_Private() at
> /work/06368/annayesy/ls6/petsc/src/ksp/ksp/interface/itfunc.c:833
> [0]PETSC ERROR: #11 KSPSolve() at
> /work/06368/annayesy/ls6/petsc/src/ksp/ksp/interface/itfunc.c:1080
> application called MPI_Abort(MPI_COMM_WORLD, 59) - process 0
> ``
>
> Below is a minimum working example:
> ``
> import numpy,petsc4py,sys,time
> petsc4py.init(sys.argv)
> from petsc4py import PETSc
> from time import time
>
> n     = int(5e5);
> comm  = PETSc.COMM_WORLD
>
> pA = PETSc.Mat(comm=comm)
> pA.create(comm=comm)
> pA.setSizes((n,n))
> pA.setType(PETSc.Mat.Type.AIJ)
> pA.setPreallocationNNZ(3)
> rstart,rend=pA.getOwnershipRange()
>
> print("\t Processor %d of %d gets indices
> %d:%d"%(comm.Get_rank(),comm.Get_size(),rstart,rend))
> if (rstart == 0):
>     pA.setValue(0,0,2); pA.setValue(0,1,-1)
> if (rend == n):
>     pA.setValue(n-1,n-2,-1); pA.setValue(n-1,n-1,2)
>
> for index in range(rstart,rend):
>     if (rstart > 0):
>         pA.setValue(index,index-1,-1)
>     pA.setValue(index,index,2)
>     if (rend < n):
>         pA.setValue(index,index+1,-1)
>
> pA.assemble()
> pA = pA.convert(mat_type='aijcusparse')
>
> px,pb = pA.createVecs()
> pb.set(1.0); px.set(1.0)
>
> ksp = PETSc.KSP().create()
> ksp.setOperators(pA)
> ksp.setConvergenceHistory()
> ksp.setType('cg')
> ksp.getPC().setType('hypre')
> ksp.setTolerances(rtol=1e-10)
>
> ksp.solve(pb, px)                           # error is generated here
> ``
>
> Best,
> Anna
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20240201/5abb89de/attachment-0001.html>


More information about the petsc-users mailing list