[petsc-users] GAMG crash during setup when using multiple GPUs

Sajid Ali Syed sasyed at fnal.gov
Thu Feb 10 18:04:25 CST 2022


Hi Junchao,

With "-use_gpu_aware_mpi 0" there is no error. I'm attaching the log for this case with this email.

I also ran with gpu aware mpi to see if I could reproduce the error and got the error but from a different location. This logfile is also attached.

This was using the newest cray-mpich on NERSC-perlmutter (8.1.12). Let me know if I can share further information to help with debugging this.

Thank You,
Sajid Ali (he/him) | Research Associate
Scientific Computing Division
Fermi National Accelerator Laboratory
s-sajid-ali.github.io<http://s-sajid-ali.github.io>

________________________________
From: Junchao Zhang <junchao.zhang at gmail.com>
Sent: Thursday, February 10, 2022 1:43 PM
To: Sajid Ali Syed <sasyed at fnal.gov>
Cc: petsc-users at mcs.anl.gov <petsc-users at mcs.anl.gov>
Subject: Re: [petsc-users] GAMG crash during setup when using multiple GPUs

Also, try "-use_gpu_aware_mpi 0" to see if there is a difference.

--Junchao Zhang


On Thu, Feb 10, 2022 at 1:40 PM Junchao Zhang <junchao.zhang at gmail.com<mailto:junchao.zhang at gmail.com>> wrote:
Did it fail without GPU at 64 MPI ranks?

--Junchao Zhang


On Thu, Feb 10, 2022 at 1:22 PM Sajid Ali Syed <sasyed at fnal.gov<mailto:sasyed at fnal.gov>> wrote:

Hi PETSc-developers,

I’m seeing the following crash that occurs during the setup phase of the preconditioner when using multiple GPUs. The relevant error trace is shown below:

(GTL DEBUG: 26) cuIpcOpenMemHandle: resource already mapped, CUDA_ERROR_ALREADY_MAPPED, line no 272
[24]PETSC ERROR: --------------------- Error Message --------------------------------------------------------------
[24]PETSC ERROR: General MPI error
[24]PETSC ERROR: MPI error 1 Invalid buffer pointer
[24]PETSC ERROR: See https://petsc.org/release/faq/<https://urldefense.proofpoint.com/v2/url?u=https-3A__petsc.org_release_faq_&d=DwMFaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=w-DPglgoOUOz8eiEyHKz0g&m=3AFKDE-HT__MEeFxdxlc6bMDLLjchFccw_htjVmWkOsApaEairnUJYnKT28tfsiN&s=ZpvtorGvQdUD8O-wLBTUYUUb6-Kccver8Cc4kXlZ7J0&e=> for trouble shooting.
[24]PETSC ERROR: Petsc Development GIT revision: f351d5494b5462f62c419e00645ac2e477b88cae  GIT Date: 2022-02-08 15:08:19 +0000
...
[24]PETSC ERROR: #1 PetscSFLinkWaitRequests_MPI() at /tmp/sajid/spack-stage/spack-stage-petsc-main-mnj56kbexro3fipf6kheyttljzwss7fo/spack-src/src/vec/is/sf/impls/basic/sfmpi.c:54
[24]PETSC ERROR: #2 PetscSFLinkFinishCommunication() at /tmp/sajid/spack-stage/spack-stage-petsc-main-mnj56kbexro3fipf6kheyttljzwss7fo/spack-src/include/../src/vec/is/sf/impls/basic/sfpack.h:274
[24]PETSC ERROR: #3 PetscSFBcastEnd_Basic() at /tmp/sajid/spack-stage/spack-stage-petsc-main-mnj56kbexro3fipf6kheyttljzwss7fo/spack-src/src/vec/is/sf/impls/basic/sfbasic.c:218
[24]PETSC ERROR: #4 PetscSFBcastEnd() at /tmp/sajid/spack-stage/spack-stage-petsc-main-mnj56kbexro3fipf6kheyttljzwss7fo/spack-src/src/vec/is/sf/interface/sf.c:1499
[24]PETSC ERROR: #5 VecScatterEnd_Internal() at /tmp/sajid/spack-stage/spack-stage-petsc-main-mnj56kbexro3fipf6kheyttljzwss7fo/spack-src/src/vec/is/sf/interface/vscat.c:87
[24]PETSC ERROR: #6 VecScatterEnd() at /tmp/sajid/spack-stage/spack-stage-petsc-main-mnj56kbexro3fipf6kheyttljzwss7fo/spack-src/src/vec/is/sf/interface/vscat.c:1366
[24]PETSC ERROR: #7 MatMult_MPIAIJCUSPARSE() at /tmp/sajid/spack-stage/spack-stage-petsc-main-mnj56kbexro3fipf6kheyttljzwss7fo/spack-src/src/mat/impls/aij/mpi/mpicusparse/mpiaijcusparse.cu:302<https://urldefense.proofpoint.com/v2/url?u=http-3A__mpiaijcusparse.cu-3A302&d=DwMFaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=w-DPglgoOUOz8eiEyHKz0g&m=3AFKDE-HT__MEeFxdxlc6bMDLLjchFccw_htjVmWkOsApaEairnUJYnKT28tfsiN&s=eMW4lGCKOn_tzQeT5gnM0i9mgEMwwbOe1EkCAtKG9M8&e=>
[24]PETSC ERROR: #8 MatMult() at /tmp/sajid/spack-stage/spack-stage-petsc-main-mnj56kbexro3fipf6kheyttljzwss7fo/spack-src/src/mat/interface/matrix.c:2438
[24]PETSC ERROR: #9 PCApplyBAorAB() at /tmp/sajid/spack-stage/spack-stage-petsc-main-mnj56kbexro3fipf6kheyttljzwss7fo/spack-src/src/ksp/pc/interface/precon.c:730
[24]PETSC ERROR: #10 KSP_PCApplyBAorAB() at /tmp/sajid/spack-stage/spack-stage-petsc-main-mnj56kbexro3fipf6kheyttljzwss7fo/spack-src/include/petsc/private/kspimpl.h:421
[24]PETSC ERROR: #11 KSPGMRESCycle() at /tmp/sajid/spack-stage/spack-stage-petsc-main-mnj56kbexro3fipf6kheyttljzwss7fo/spack-src/src/ksp/ksp/impls/gmres/gmres.c:162
[24]PETSC ERROR: #12 KSPSolve_GMRES() at /tmp/sajid/spack-stage/spack-stage-petsc-main-mnj56kbexro3fipf6kheyttljzwss7fo/spack-src/src/ksp/ksp/impls/gmres/gmres.c:247
[24]PETSC ERROR: #13 KSPSolve_Private() at /tmp/sajid/spack-stage/spack-stage-petsc-main-mnj56kbexro3fipf6kheyttljzwss7fo/spack-src/src/ksp/ksp/interface/itfunc.c:925
[24]PETSC ERROR: #14 KSPSolve() at /tmp/sajid/spack-stage/spack-stage-petsc-main-mnj56kbexro3fipf6kheyttljzwss7fo/spack-src/src/ksp/ksp/interface/itfunc.c:1103
[24]PETSC ERROR: #15 PCGAMGOptProlongator_AGG() at /tmp/sajid/spack-stage/spack-stage-petsc-main-mnj56kbexro3fipf6kheyttljzwss7fo/spack-src/src/ksp/pc/impls/gamg/agg.c:1127
[24]PETSC ERROR: #16 PCSetUp_GAMG() at /tmp/sajid/spack-stage/spack-stage-petsc-main-mnj56kbexro3fipf6kheyttljzwss7fo/spack-src/src/ksp/pc/impls/gamg/gamg.c:626
[24]PETSC ERROR: #17 PCSetUp() at /tmp/sajid/spack-stage/spack-stage-petsc-main-mnj56kbexro3fipf6kheyttljzwss7fo/spack-src/src/ksp/pc/interface/precon.c:1017
[24]PETSC ERROR: #18 KSPSetUp() at /tmp/sajid/spack-stage/spack-stage-petsc-main-mnj56kbexro3fipf6kheyttljzwss7fo/spack-src/src/ksp/ksp/interface/itfunc.c:417
[24]PETSC ERROR: #19 main() at poisson3d.c:69
[24]PETSC ERROR: PETSc Option Table entries:
[24]PETSC ERROR: -dm_mat_type aijcusparse
[24]PETSC ERROR: -dm_vec_type cuda
[24]PETSC ERROR: -ksp_monitor
[24]PETSC ERROR: -ksp_norm_type unpreconditioned
[24]PETSC ERROR: -ksp_type cg
[24]PETSC ERROR: -ksp_view
[24]PETSC ERROR: -log_view
[24]PETSC ERROR: -mg_levels_esteig_ksp_type cg
[24]PETSC ERROR: -mg_levels_ksp_type chebyshev
[24]PETSC ERROR: -mg_levels_pc_type jacobi
[24]PETSC ERROR: -pc_gamg_agg_nsmooths 1
[24]PETSC ERROR: -pc_gamg_square_graph 1
[24]PETSC ERROR: -pc_gamg_threshold 0.0
[24]PETSC ERROR: -pc_gamg_threshold_scale 0.0
[24]PETSC ERROR: -pc_gamg_type agg
[24]PETSC ERROR: -pc_type gamg
[24]PETSC ERROR: ----------------End of Error Message -------send entire error message to petsc-maint at mcs.anl.gov----------


Attached with this email is the full error log and the submit script for a 8-node/64-GPU/64 MPI rank job. I’ll also note that the same program did not crash when using either 2 or 4 nodes (with 8 & 16 GPUs/MPI ranks respectively) and attach those logs as well if that helps. Could someone let me know what this error means and what can be done to prevent it?

Thank You,
Sajid Ali (he/him) | Research Associate

Scientific Computing Division

Fermi National Accelerator Laboratory

s-sajid-ali.github.io<https://urldefense.proofpoint.com/v2/url?u=http-3A__s-2Dsajid-2Dali.github.io&d=DwMFaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=w-DPglgoOUOz8eiEyHKz0g&m=3AFKDE-HT__MEeFxdxlc6bMDLLjchFccw_htjVmWkOsApaEairnUJYnKT28tfsiN&s=6Fj7FO5IQGRCkPfC22pD7hAo0AxsVgu3kG9LNOftqK0&e=>

​
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20220211/88f26c63/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 8n32g-nogpuawarempi-log
Type: application/octet-stream
Size: 90432 bytes
Desc: 8n32g-nogpuawarempi-log
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20220211/88f26c63/attachment-0002.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 8n32g-newerr-log
Type: application/octet-stream
Size: 170257 bytes
Desc: 8n32g-newerr-log
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20220211/88f26c63/attachment-0003.obj>


More information about the petsc-users mailing list