[petsc-users] errors with hypre with MPI and multiple GPUs on a node

Yesypenko, Anna anna at oden.utexas.edu
Thu Feb 1 09:28:10 CST 2024


Hi Junchao,

Thank you for your suggestion, you're right that binding MPI ranks to GPUs seems to be the issue.
I looked at the TACC documentation, and I'm not sure they provide this utility.
I'm trying to set the CUDA_VISIBLE_DEVICES environment variable according to the MPI rank.

This works sometimes now! The environment variables are set properly, but it still fails with the same error half the time.
How do I know that hypre is binding MPI ranks to GPUs properly?  The error originates from a call to hypre.

I also tried to set the environment variable (using mpi4py) before importing PETSc, but this doesn't seem to work either.

Here is the preamble I added to the top of the script. I'm running on a single node with 3 GPUs.
``
import numpy,petsc4py,sys,os,time
from time import time
petsc4py.init(sys.argv)
from petsc4py import PETSc

comm  = PETSc.COMM_WORLD

os.environ['CUDA_VISIBLE_DEVICES'] = "%d" % comm.Get_rank()
PETSc.Sys.syncPrint("\t Processor %d of %d gets GPU %d"%\
        (comm.Get_rank(),comm.Get_size(),comm.Get_rank()),comm=comm,flush=True)
comm.Barrier()

### Petsc Matrix initialization here

### I confirm that the matrix is partitioned into indices as I expect
PETSc.Sys.syncPrint("\t Processor %d with GPU %s gets indices %d:%d"\
        %(comm.Get_rank(),os.environ['CUDA_VISIBLE_DEVICES'],rstart,rend),flush=True,comm=comm)
``

When the script fails, I get the following stack trace.
``
TACC:  Starting up job 1491828
TACC:  Setting up parallel environment for MVAPICH2+mpispawn.
TACC:  Starting parallel tasks...
       Processor 0 of 3 gets GPU 0
       Processor 1 of 3 gets GPU 1
       Processor 2 of 3 gets GPU 2
       Processor 0 with GPU 0 gets indices 0:166667
       Processor 1 with GPU 1 gets indices 166667:333334
       Processor 2 with GPU 2 gets indices 333334:500000
[0]PETSC ERROR: ------------------------------------------------------------------------
[0]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, probably memory access out of range
[0]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
[0]PETSC ERROR: or see https://petsc.org/release/faq/#valgrind and https://petsc.org/release/faq/
[0]PETSC ERROR: or try https://docs.nvidia.com/cuda/cuda-memcheck/index.html on NVIDIA CUDA systems to find memory corruption errors
[0]PETSC ERROR: ---------------------  Stack Frames ------------------------------------
[0]PETSC ERROR: The line numbers in the error traceback are not always exact.
[0]PETSC ERROR: #1 hypre_ParCSRMatrixMigrate()
[0]PETSC ERROR: #2 MatBindToCPU_HYPRE() at /work/06368/annayesy/ls6/petsc/src/mat/impls/hypre/mhypre.c:1394
[0]PETSC ERROR: #3 MatAssemblyEnd_HYPRE() at /work/06368/annayesy/ls6/petsc/src/mat/impls/hypre/mhypre.c:1471
[0]PETSC ERROR: #4 MatAssemblyEnd() at /work/06368/annayesy/ls6/petsc/src/mat/interface/matrix.c:5773
[0]PETSC ERROR: #5 MatConvert_AIJ_HYPRE() at /work/06368/annayesy/ls6/petsc/src/mat/impls/hypre/mhypre.c:660
[0]PETSC ERROR: #6 MatConvert() at /work/06368/annayesy/ls6/petsc/src/mat/interface/matrix.c:4421
[0]PETSC ERROR: #7 PCSetUp_HYPRE() at /work/06368/annayesy/ls6/petsc/src/ksp/pc/impls/hypre/hypre.c:245
[0]PETSC ERROR: #8 PCSetUp() at /work/06368/annayesy/ls6/petsc/src/ksp/pc/interface/precon.c:1080
[0]PETSC ERROR: #9 KSPSetUp() at /work/06368/annayesy/ls6/petsc/src/ksp/ksp/interface/itfunc.c:415
[0]PETSC ERROR: #10 KSPSolve_Private() at /work/06368/annayesy/ls6/petsc/src/ksp/ksp/interface/itfunc.c:833
[0]PETSC ERROR: #11 KSPSolve() at /work/06368/annayesy/ls6/petsc/src/ksp/ksp/interface/itfunc.c:1080
application called MPI_Abort(MPI_COMM_WORLD, 59) - process 0
``

________________________________
From: Junchao Zhang <junchao.zhang at gmail.com>
Sent: Wednesday, January 31, 2024 5:36 PM
To: Yesypenko, Anna <anna at oden.utexas.edu>
Cc: petsc-users at mcs.anl.gov <petsc-users at mcs.anl.gov>
Subject: Re: [petsc-users] errors with hypre with MPI and multiple GPUs on a node

Hi Anna,
  Since you said "The code works with pc-type hypre on a single GPU.", I was wondering if this is a CUDA devices to MPI ranks binding problem.
  You can search TACC documentation to find how its job scheduler binds GPUs to MPI ranks (usually via manipulating the CUDA_VISIBLE_DEVICES environment variable)

  Please follow up if you could not solve it.

  Thanks.
--Junchao Zhang


On Wed, Jan 31, 2024 at 4:07 PM Yesypenko, Anna <anna at oden.utexas.edu<mailto:anna at oden.utexas.edu>> wrote:
Dear Petsc devs,

I'm encountering an error running hypre on a single node with multiple GPUs.
The issue is in the setup phase. I'm trying to troubleshoot, but don't know where to start.
Are the system routines PetScCUDAInitialize and PetScCUDAInitializeCheck available in python?
How do I verify that GPUs are assigned properly to each MPI process? In this case, I have 3 tasks and 3 GPUs.

The code works with pc-type hypre on a single GPU.
Any suggestions are appreciated!

Below is the error trace:
``
TACC:  Starting up job 1490124
TACC:  Setting up parallel environment for MVAPICH2+mpispawn.
TACC:  Starting parallel tasks...
[0]PETSC ERROR: ------------------------------------------------------------------------
[0]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, probably memory access out of range
[0]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
[0]PETSC ERROR: or see https://petsc.org/release/faq/#valgrind and https://petsc.org/release/faq/
[0]PETSC ERROR: or try https://docs.nvidia.com/cuda/cuda-memcheck/index.html on NVIDIA CUDA systems to find memory corruption errors
[0]PETSC ERROR: ---------------------  Stack Frames ------------------------------------
[0]PETSC ERROR: The line numbers in the error traceback are not always exact.
[0]PETSC ERROR: #1 hypre_ParCSRMatrixMigrate()
[0]PETSC ERROR: #2 MatBindToCPU_HYPRE() at /work/06368/annayesy/ls6/petsc/src/mat/impls/hypre/mhypre.c:1394
[0]PETSC ERROR: #3 MatAssemblyEnd_HYPRE() at /work/06368/annayesy/ls6/petsc/src/mat/impls/hypre/mhypre.c:1471
[0]PETSC ERROR: #4 MatAssemblyEnd() at /work/06368/annayesy/ls6/petsc/src/mat/interface/matrix.c:5773
[0]PETSC ERROR: #5 MatConvert_AIJ_HYPRE() at /work/06368/annayesy/ls6/petsc/src/mat/impls/hypre/mhypre.c:660
[0]PETSC ERROR: #6 MatConvert() at /work/06368/annayesy/ls6/petsc/src/mat/interface/matrix.c:4421
[0]PETSC ERROR: #7 PCSetUp_HYPRE() at /work/06368/annayesy/ls6/petsc/src/ksp/pc/impls/hypre/hypre.c:245
[0]PETSC ERROR: #8 PCSetUp() at /work/06368/annayesy/ls6/petsc/src/ksp/pc/interface/precon.c:1080
[0]PETSC ERROR: #9 KSPSetUp() at /work/06368/annayesy/ls6/petsc/src/ksp/ksp/interface/itfunc.c:415
[0]PETSC ERROR: #10 KSPSolve_Private() at /work/06368/annayesy/ls6/petsc/src/ksp/ksp/interface/itfunc.c:833
[0]PETSC ERROR: #11 KSPSolve() at /work/06368/annayesy/ls6/petsc/src/ksp/ksp/interface/itfunc.c:1080
application called MPI_Abort(MPI_COMM_WORLD, 59) - process 0
``

Below is a minimum working example:
``
import numpy,petsc4py,sys,time
petsc4py.init(sys.argv)
from petsc4py import PETSc
from time import time

n     = int(5e5);
comm  = PETSc.COMM_WORLD

pA = PETSc.Mat(comm=comm)
pA.create(comm=comm)
pA.setSizes((n,n))
pA.setType(PETSc.Mat.Type.AIJ)
pA.setPreallocationNNZ(3)
rstart,rend=pA.getOwnershipRange()

print("\t Processor %d of %d gets indices %d:%d"%(comm.Get_rank(),comm.Get_size(),rstart,rend))
if (rstart == 0):
    pA.setValue(0,0,2); pA.setValue(0,1,-1)
if (rend == n):
    pA.setValue(n-1,n-2,-1); pA.setValue(n-1,n-1,2)

for index in range(rstart,rend):
    if (rstart > 0):
        pA.setValue(index,index-1,-1)
    pA.setValue(index,index,2)
    if (rend < n):
        pA.setValue(index,index+1,-1)

pA.assemble()
pA = pA.convert(mat_type='aijcusparse')

px,pb = pA.createVecs()
pb.set(1.0); px.set(1.0)

ksp = PETSc.KSP().create()
ksp.setOperators(pA)
ksp.setConvergenceHistory()
ksp.setType('cg')
ksp.getPC().setType('hypre')
ksp.setTolerances(rtol=1e-10)

ksp.solve(pb, px)                           # error is generated here
``

Best,
Anna
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20240201/62e442af/attachment-0001.html>


More information about the petsc-users mailing list