[petsc-users] errors with hypre with MPI and multiple GPUs on a node

Yesypenko, Anna anna at oden.utexas.edu
Wed Jan 31 16:06:51 CST 2024


Dear Petsc devs,

I'm encountering an error running hypre on a single node with multiple GPUs.
The issue is in the setup phase. I'm trying to troubleshoot, but don't know where to start.
Are the system routines PetScCUDAInitialize and PetScCUDAInitializeCheck available in python?
How do I verify that GPUs are assigned properly to each MPI process? In this case, I have 3 tasks and 3 GPUs.

The code works with pc-type hypre on a single GPU.
Any suggestions are appreciated!

Below is the error trace:
``
TACC:  Starting up job 1490124
TACC:  Setting up parallel environment for MVAPICH2+mpispawn.
TACC:  Starting parallel tasks...
[0]PETSC ERROR: ------------------------------------------------------------------------
[0]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, probably memory access out of range
[0]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
[0]PETSC ERROR: or see https://petsc.org/release/faq/#valgrind and https://petsc.org/release/faq/
[0]PETSC ERROR: or try https://docs.nvidia.com/cuda/cuda-memcheck/index.html on NVIDIA CUDA systems to find memory corruption errors
[0]PETSC ERROR: ---------------------  Stack Frames ------------------------------------
[0]PETSC ERROR: The line numbers in the error traceback are not always exact.
[0]PETSC ERROR: #1 hypre_ParCSRMatrixMigrate()
[0]PETSC ERROR: #2 MatBindToCPU_HYPRE() at /work/06368/annayesy/ls6/petsc/src/mat/impls/hypre/mhypre.c:1394
[0]PETSC ERROR: #3 MatAssemblyEnd_HYPRE() at /work/06368/annayesy/ls6/petsc/src/mat/impls/hypre/mhypre.c:1471
[0]PETSC ERROR: #4 MatAssemblyEnd() at /work/06368/annayesy/ls6/petsc/src/mat/interface/matrix.c:5773
[0]PETSC ERROR: #5 MatConvert_AIJ_HYPRE() at /work/06368/annayesy/ls6/petsc/src/mat/impls/hypre/mhypre.c:660
[0]PETSC ERROR: #6 MatConvert() at /work/06368/annayesy/ls6/petsc/src/mat/interface/matrix.c:4421
[0]PETSC ERROR: #7 PCSetUp_HYPRE() at /work/06368/annayesy/ls6/petsc/src/ksp/pc/impls/hypre/hypre.c:245
[0]PETSC ERROR: #8 PCSetUp() at /work/06368/annayesy/ls6/petsc/src/ksp/pc/interface/precon.c:1080
[0]PETSC ERROR: #9 KSPSetUp() at /work/06368/annayesy/ls6/petsc/src/ksp/ksp/interface/itfunc.c:415
[0]PETSC ERROR: #10 KSPSolve_Private() at /work/06368/annayesy/ls6/petsc/src/ksp/ksp/interface/itfunc.c:833
[0]PETSC ERROR: #11 KSPSolve() at /work/06368/annayesy/ls6/petsc/src/ksp/ksp/interface/itfunc.c:1080
application called MPI_Abort(MPI_COMM_WORLD, 59) - process 0
``

Below is a minimum working example:
``
import numpy,petsc4py,sys,time
petsc4py.init(sys.argv)
from petsc4py import PETSc
from time import time

n     = int(5e5);
comm  = PETSc.COMM_WORLD

pA = PETSc.Mat(comm=comm)
pA.create(comm=comm)
pA.setSizes((n,n))
pA.setType(PETSc.Mat.Type.AIJ)
pA.setPreallocationNNZ(3)
rstart,rend=pA.getOwnershipRange()

print("\t Processor %d of %d gets indices %d:%d"%(comm.Get_rank(),comm.Get_size(),rstart,rend))
if (rstart == 0):
    pA.setValue(0,0,2); pA.setValue(0,1,-1)
if (rend == n):
    pA.setValue(n-1,n-2,-1); pA.setValue(n-1,n-1,2)

for index in range(rstart,rend):
    if (rstart > 0):
        pA.setValue(index,index-1,-1)
    pA.setValue(index,index,2)
    if (rend < n):
        pA.setValue(index,index+1,-1)

pA.assemble()
pA = pA.convert(mat_type='aijcusparse')

px,pb = pA.createVecs()
pb.set(1.0); px.set(1.0)

ksp = PETSc.KSP().create()
ksp.setOperators(pA)
ksp.setConvergenceHistory()
ksp.setType('cg')
ksp.getPC().setType('hypre')
ksp.setTolerances(rtol=1e-10)

ksp.solve(pb, px)                           # error is generated here
``

Best,
Anna
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20240131/20e68883/attachment.html>


More information about the petsc-users mailing list