[petsc-users] errors with hypre with MPI and multiple GPUs on a node

Junchao Zhang junchao.zhang at gmail.com
Wed Jan 31 17:36:36 CST 2024


Hi Anna,
  Since you said "The code works with pc-type hypre on a single GPU.", I
was wondering if this is a CUDA devices to MPI ranks binding problem.
  You can search TACC documentation to find how its job scheduler binds
GPUs to MPI ranks (usually via manipulating the CUDA_VISIBLE_DEVICES
environment variable)

  Please follow up if you could not solve it.

  Thanks.
--Junchao Zhang


On Wed, Jan 31, 2024 at 4:07 PM Yesypenko, Anna <anna at oden.utexas.edu>
wrote:

> Dear Petsc devs,
>
> I'm encountering an error running hypre on a single node with multiple
> GPUs.
> The issue is in the setup phase. I'm trying to troubleshoot, but don't
> know where to start.
> Are the system routines PetScCUDAInitialize and PetScCUDAInitializeCheck
> available in python?
> How do I verify that GPUs are assigned properly to each MPI process? In
> this case, I have 3 tasks and 3 GPUs.
>
> The code works with pc-type hypre on a single GPU.
> Any suggestions are appreciated!
>
> Below is the error trace:
> ``
> TACC:  Starting up job 1490124
> TACC:  Setting up parallel environment for MVAPICH2+mpispawn.
> TACC:  Starting parallel tasks...
> [0]PETSC ERROR:
> ------------------------------------------------------------------------
> [0]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation,
> probably memory access out of range
> [0]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
> [0]PETSC ERROR: or see https://petsc.org/release/faq/#valgrind and
> https://petsc.org/release/faq/
> [0]PETSC ERROR: or try
> https://docs.nvidia.com/cuda/cuda-memcheck/index.html on NVIDIA CUDA
> systems to find memory corruption errors
> [0]PETSC ERROR: ---------------------  Stack Frames
> ------------------------------------
> [0]PETSC ERROR: The line numbers in the error traceback are not always
> exact.
> [0]PETSC ERROR: #1 hypre_ParCSRMatrixMigrate()
> [0]PETSC ERROR: #2 MatBindToCPU_HYPRE() at
> /work/06368/annayesy/ls6/petsc/src/mat/impls/hypre/mhypre.c:1394
> [0]PETSC ERROR: #3 MatAssemblyEnd_HYPRE() at
> /work/06368/annayesy/ls6/petsc/src/mat/impls/hypre/mhypre.c:1471
> [0]PETSC ERROR: #4 MatAssemblyEnd() at
> /work/06368/annayesy/ls6/petsc/src/mat/interface/matrix.c:5773
> [0]PETSC ERROR: #5 MatConvert_AIJ_HYPRE() at
> /work/06368/annayesy/ls6/petsc/src/mat/impls/hypre/mhypre.c:660
> [0]PETSC ERROR: #6 MatConvert() at
> /work/06368/annayesy/ls6/petsc/src/mat/interface/matrix.c:4421
> [0]PETSC ERROR: #7 PCSetUp_HYPRE() at
> /work/06368/annayesy/ls6/petsc/src/ksp/pc/impls/hypre/hypre.c:245
> [0]PETSC ERROR: #8 PCSetUp() at
> /work/06368/annayesy/ls6/petsc/src/ksp/pc/interface/precon.c:1080
> [0]PETSC ERROR: #9 KSPSetUp() at
> /work/06368/annayesy/ls6/petsc/src/ksp/ksp/interface/itfunc.c:415
> [0]PETSC ERROR: #10 KSPSolve_Private() at
> /work/06368/annayesy/ls6/petsc/src/ksp/ksp/interface/itfunc.c:833
> [0]PETSC ERROR: #11 KSPSolve() at
> /work/06368/annayesy/ls6/petsc/src/ksp/ksp/interface/itfunc.c:1080
> application called MPI_Abort(MPI_COMM_WORLD, 59) - process 0
> ``
>
> Below is a minimum working example:
> ``
> import numpy,petsc4py,sys,time
> petsc4py.init(sys.argv)
> from petsc4py import PETSc
> from time import time
>
> n     = int(5e5);
> comm  = PETSc.COMM_WORLD
>
> pA = PETSc.Mat(comm=comm)
> pA.create(comm=comm)
> pA.setSizes((n,n))
> pA.setType(PETSc.Mat.Type.AIJ)
> pA.setPreallocationNNZ(3)
> rstart,rend=pA.getOwnershipRange()
>
> print("\t Processor %d of %d gets indices
> %d:%d"%(comm.Get_rank(),comm.Get_size(),rstart,rend))
> if (rstart == 0):
>     pA.setValue(0,0,2); pA.setValue(0,1,-1)
> if (rend == n):
>     pA.setValue(n-1,n-2,-1); pA.setValue(n-1,n-1,2)
>
> for index in range(rstart,rend):
>     if (rstart > 0):
>         pA.setValue(index,index-1,-1)
>     pA.setValue(index,index,2)
>     if (rend < n):
>         pA.setValue(index,index+1,-1)
>
> pA.assemble()
> pA = pA.convert(mat_type='aijcusparse')
>
> px,pb = pA.createVecs()
> pb.set(1.0); px.set(1.0)
>
> ksp = PETSc.KSP().create()
> ksp.setOperators(pA)
> ksp.setConvergenceHistory()
> ksp.setType('cg')
> ksp.getPC().setType('hypre')
> ksp.setTolerances(rtol=1e-10)
>
> ksp.solve(pb, px)                           # error is generated here
> ``
>
> Best,
> Anna
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20240131/db260cbd/attachment-0001.html>


More information about the petsc-users mailing list