[petsc-users] GAMG crash during setup when using multiple GPUs

Sajid Ali Syed sasyed at fnal.gov
Thu Feb 10 13:21:44 CST 2022


Hi PETSc-developers,

I’m seeing the following crash that occurs during the setup phase of the preconditioner when using multiple GPUs. The relevant error trace is shown below:

(GTL DEBUG: 26) cuIpcOpenMemHandle: resource already mapped, CUDA_ERROR_ALREADY_MAPPED, line no 272
[24]PETSC ERROR: --------------------- Error Message --------------------------------------------------------------
[24]PETSC ERROR: General MPI error
[24]PETSC ERROR: MPI error 1 Invalid buffer pointer
[24]PETSC ERROR: See https://petsc.org/release/faq/ for trouble shooting.
[24]PETSC ERROR: Petsc Development GIT revision: f351d5494b5462f62c419e00645ac2e477b88cae  GIT Date: 2022-02-08 15:08:19 +0000
...
[24]PETSC ERROR: #1 PetscSFLinkWaitRequests_MPI() at /tmp/sajid/spack-stage/spack-stage-petsc-main-mnj56kbexro3fipf6kheyttljzwss7fo/spack-src/src/vec/is/sf/impls/basic/sfmpi.c:54
[24]PETSC ERROR: #2 PetscSFLinkFinishCommunication() at /tmp/sajid/spack-stage/spack-stage-petsc-main-mnj56kbexro3fipf6kheyttljzwss7fo/spack-src/include/../src/vec/is/sf/impls/basic/sfpack.h:274
[24]PETSC ERROR: #3 PetscSFBcastEnd_Basic() at /tmp/sajid/spack-stage/spack-stage-petsc-main-mnj56kbexro3fipf6kheyttljzwss7fo/spack-src/src/vec/is/sf/impls/basic/sfbasic.c:218
[24]PETSC ERROR: #4 PetscSFBcastEnd() at /tmp/sajid/spack-stage/spack-stage-petsc-main-mnj56kbexro3fipf6kheyttljzwss7fo/spack-src/src/vec/is/sf/interface/sf.c:1499
[24]PETSC ERROR: #5 VecScatterEnd_Internal() at /tmp/sajid/spack-stage/spack-stage-petsc-main-mnj56kbexro3fipf6kheyttljzwss7fo/spack-src/src/vec/is/sf/interface/vscat.c:87
[24]PETSC ERROR: #6 VecScatterEnd() at /tmp/sajid/spack-stage/spack-stage-petsc-main-mnj56kbexro3fipf6kheyttljzwss7fo/spack-src/src/vec/is/sf/interface/vscat.c:1366
[24]PETSC ERROR: #7 MatMult_MPIAIJCUSPARSE() at /tmp/sajid/spack-stage/spack-stage-petsc-main-mnj56kbexro3fipf6kheyttljzwss7fo/spack-src/src/mat/impls/aij/mpi/mpicusparse/mpiaijcusparse.cu:302
[24]PETSC ERROR: #8 MatMult() at /tmp/sajid/spack-stage/spack-stage-petsc-main-mnj56kbexro3fipf6kheyttljzwss7fo/spack-src/src/mat/interface/matrix.c:2438
[24]PETSC ERROR: #9 PCApplyBAorAB() at /tmp/sajid/spack-stage/spack-stage-petsc-main-mnj56kbexro3fipf6kheyttljzwss7fo/spack-src/src/ksp/pc/interface/precon.c:730
[24]PETSC ERROR: #10 KSP_PCApplyBAorAB() at /tmp/sajid/spack-stage/spack-stage-petsc-main-mnj56kbexro3fipf6kheyttljzwss7fo/spack-src/include/petsc/private/kspimpl.h:421
[24]PETSC ERROR: #11 KSPGMRESCycle() at /tmp/sajid/spack-stage/spack-stage-petsc-main-mnj56kbexro3fipf6kheyttljzwss7fo/spack-src/src/ksp/ksp/impls/gmres/gmres.c:162
[24]PETSC ERROR: #12 KSPSolve_GMRES() at /tmp/sajid/spack-stage/spack-stage-petsc-main-mnj56kbexro3fipf6kheyttljzwss7fo/spack-src/src/ksp/ksp/impls/gmres/gmres.c:247
[24]PETSC ERROR: #13 KSPSolve_Private() at /tmp/sajid/spack-stage/spack-stage-petsc-main-mnj56kbexro3fipf6kheyttljzwss7fo/spack-src/src/ksp/ksp/interface/itfunc.c:925
[24]PETSC ERROR: #14 KSPSolve() at /tmp/sajid/spack-stage/spack-stage-petsc-main-mnj56kbexro3fipf6kheyttljzwss7fo/spack-src/src/ksp/ksp/interface/itfunc.c:1103
[24]PETSC ERROR: #15 PCGAMGOptProlongator_AGG() at /tmp/sajid/spack-stage/spack-stage-petsc-main-mnj56kbexro3fipf6kheyttljzwss7fo/spack-src/src/ksp/pc/impls/gamg/agg.c:1127
[24]PETSC ERROR: #16 PCSetUp_GAMG() at /tmp/sajid/spack-stage/spack-stage-petsc-main-mnj56kbexro3fipf6kheyttljzwss7fo/spack-src/src/ksp/pc/impls/gamg/gamg.c:626
[24]PETSC ERROR: #17 PCSetUp() at /tmp/sajid/spack-stage/spack-stage-petsc-main-mnj56kbexro3fipf6kheyttljzwss7fo/spack-src/src/ksp/pc/interface/precon.c:1017
[24]PETSC ERROR: #18 KSPSetUp() at /tmp/sajid/spack-stage/spack-stage-petsc-main-mnj56kbexro3fipf6kheyttljzwss7fo/spack-src/src/ksp/ksp/interface/itfunc.c:417
[24]PETSC ERROR: #19 main() at poisson3d.c:69
[24]PETSC ERROR: PETSc Option Table entries:
[24]PETSC ERROR: -dm_mat_type aijcusparse
[24]PETSC ERROR: -dm_vec_type cuda
[24]PETSC ERROR: -ksp_monitor
[24]PETSC ERROR: -ksp_norm_type unpreconditioned
[24]PETSC ERROR: -ksp_type cg
[24]PETSC ERROR: -ksp_view
[24]PETSC ERROR: -log_view
[24]PETSC ERROR: -mg_levels_esteig_ksp_type cg
[24]PETSC ERROR: -mg_levels_ksp_type chebyshev
[24]PETSC ERROR: -mg_levels_pc_type jacobi
[24]PETSC ERROR: -pc_gamg_agg_nsmooths 1
[24]PETSC ERROR: -pc_gamg_square_graph 1
[24]PETSC ERROR: -pc_gamg_threshold 0.0
[24]PETSC ERROR: -pc_gamg_threshold_scale 0.0
[24]PETSC ERROR: -pc_gamg_type agg
[24]PETSC ERROR: -pc_type gamg
[24]PETSC ERROR: ----------------End of Error Message -------send entire error message to petsc-maint at mcs.anl.gov----------


Attached with this email is the full error log and the submit script for a 8-node/64-GPU/64 MPI rank job. I’ll also note that the same program did not crash when using either 2 or 4 nodes (with 8 & 16 GPUs/MPI ranks respectively) and attach those logs as well if that helps. Could someone let me know what this error means and what can be done to prevent it?

Thank You,
Sajid Ali (he/him) | Research Associate

Scientific Computing Division

Fermi National Accelerator Laboratory

s-sajid-ali.github.io<http://s-sajid-ali.github.io>

​
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20220210/dc1d3c96/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 2n8g.sh
Type: application/x-sh
Size: 686 bytes
Desc: 2n8g.sh
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20220210/dc1d3c96/attachment-0003.sh>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 2n8g-log
Type: application/octet-stream
Size: 88908 bytes
Desc: 2n8g-log
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20220210/dc1d3c96/attachment-0003.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 4n16g.sh
Type: application/x-sh
Size: 687 bytes
Desc: 4n16g.sh
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20220210/dc1d3c96/attachment-0004.sh>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 4n16g-log
Type: application/octet-stream
Size: 89091 bytes
Desc: 4n16g-log
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20220210/dc1d3c96/attachment-0004.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 8n32g.sh
Type: application/x-sh
Size: 687 bytes
Desc: 8n32g.sh
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20220210/dc1d3c96/attachment-0005.sh>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 8n32g-errlog
Type: application/octet-stream
Size: 179380 bytes
Desc: 8n32g-errlog
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20220210/dc1d3c96/attachment-0005.obj>


More information about the petsc-users mailing list