[petsc-users] GAMG crash during setup when using multiple GPUs

Sajid Ali Syed sasyed at fnal.gov
Fri Feb 11 10:17:13 CST 2022


Hi Mark,

Thanks for the information.

@Junchao: Given that there are known issues with GPU aware MPI, it might be best to wait until there is an updated version of cray-mpich (which hopefully contains the relevant fixes).

Thank You,
Sajid Ali (he/him) | Research Associate
Scientific Computing Division
Fermi National Accelerator Laboratory
s-sajid-ali.github.io<http://s-sajid-ali.github.io>

________________________________
From: Mark Adams <mfadams at lbl.gov>
Sent: Thursday, February 10, 2022 8:47 PM
To: Junchao Zhang <junchao.zhang at gmail.com>
Cc: Sajid Ali Syed <sasyed at fnal.gov>; petsc-users at mcs.anl.gov <petsc-users at mcs.anl.gov>
Subject: Re: [petsc-users] GAMG crash during setup when using multiple GPUs

Perlmutter has problems with GPU aware MPI.
This is being actively worked on at NERSc.

Mark

On Thu, Feb 10, 2022 at 9:22 PM Junchao Zhang <junchao.zhang at gmail.com<mailto:junchao.zhang at gmail.com>> wrote:
Hi, Sajid Ali,
  I have no clue. I have access to perlmutter.  I am thinking how to debug that.
  If your app is open-sourced and easy to build, then I can build and debug it. Otherwise, suppose you build and install petsc (only with options needed by your app) to a shared directory, and I can access your executable (which uses RPATH for libraries), then maybe I can debug it (I only need to install my own petsc to the shared directory)

--Junchao Zhang


On Thu, Feb 10, 2022 at 6:04 PM Sajid Ali Syed <sasyed at fnal.gov<mailto:sasyed at fnal.gov>> wrote:
Hi Junchao,

With "-use_gpu_aware_mpi 0" there is no error. I'm attaching the log for this case with this email.

I also ran with gpu aware mpi to see if I could reproduce the error and got the error but from a different location. This logfile is also attached.

This was using the newest cray-mpich on NERSC-perlmutter (8.1.12). Let me know if I can share further information to help with debugging this.

Thank You,
Sajid Ali (he/him) | Research Associate
Scientific Computing Division
Fermi National Accelerator Laboratory
s-sajid-ali.github.io<https://urldefense.proofpoint.com/v2/url?u=http-3A__s-2Dsajid-2Dali.github.io&d=DwMFaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=w-DPglgoOUOz8eiEyHKz0g&m=Fea4VIbc4UoqdTFjAk3kg3Hp94LYXkjR3gHIdP08lMeT-3zEDZNKDcHjRejBIggW&s=ezCw13eIYUcCzUki3rlnpGZWZrdcTxlGpG57GqrEz_s&e=>

________________________________
From: Junchao Zhang <junchao.zhang at gmail.com<mailto:junchao.zhang at gmail.com>>
Sent: Thursday, February 10, 2022 1:43 PM
To: Sajid Ali Syed <sasyed at fnal.gov<mailto:sasyed at fnal.gov>>
Cc: petsc-users at mcs.anl.gov<mailto:petsc-users at mcs.anl.gov> <petsc-users at mcs.anl.gov<mailto:petsc-users at mcs.anl.gov>>
Subject: Re: [petsc-users] GAMG crash during setup when using multiple GPUs

Also, try "-use_gpu_aware_mpi 0" to see if there is a difference.

--Junchao Zhang


On Thu, Feb 10, 2022 at 1:40 PM Junchao Zhang <junchao.zhang at gmail.com<mailto:junchao.zhang at gmail.com>> wrote:
Did it fail without GPU at 64 MPI ranks?

--Junchao Zhang


On Thu, Feb 10, 2022 at 1:22 PM Sajid Ali Syed <sasyed at fnal.gov<mailto:sasyed at fnal.gov>> wrote:

Hi PETSc-developers,

I’m seeing the following crash that occurs during the setup phase of the preconditioner when using multiple GPUs. The relevant error trace is shown below:

(GTL DEBUG: 26) cuIpcOpenMemHandle: resource already mapped, CUDA_ERROR_ALREADY_MAPPED, line no 272
[24]PETSC ERROR: --------------------- Error Message --------------------------------------------------------------
[24]PETSC ERROR: General MPI error
[24]PETSC ERROR: MPI error 1 Invalid buffer pointer
[24]PETSC ERROR: See https://petsc.org/release/faq/<https://urldefense.proofpoint.com/v2/url?u=https-3A__petsc.org_release_faq_&d=DwMFaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=w-DPglgoOUOz8eiEyHKz0g&m=3AFKDE-HT__MEeFxdxlc6bMDLLjchFccw_htjVmWkOsApaEairnUJYnKT28tfsiN&s=ZpvtorGvQdUD8O-wLBTUYUUb6-Kccver8Cc4kXlZ7J0&e=> for trouble shooting.
[24]PETSC ERROR: Petsc Development GIT revision: f351d5494b5462f62c419e00645ac2e477b88cae  GIT Date: 2022-02-08 15:08:19 +0000
...
[24]PETSC ERROR: #1 PetscSFLinkWaitRequests_MPI() at /tmp/sajid/spack-stage/spack-stage-petsc-main-mnj56kbexro3fipf6kheyttljzwss7fo/spack-src/src/vec/is/sf/impls/basic/sfmpi.c:54
[24]PETSC ERROR: #2 PetscSFLinkFinishCommunication() at /tmp/sajid/spack-stage/spack-stage-petsc-main-mnj56kbexro3fipf6kheyttljzwss7fo/spack-src/include/../src/vec/is/sf/impls/basic/sfpack.h:274
[24]PETSC ERROR: #3 PetscSFBcastEnd_Basic() at /tmp/sajid/spack-stage/spack-stage-petsc-main-mnj56kbexro3fipf6kheyttljzwss7fo/spack-src/src/vec/is/sf/impls/basic/sfbasic.c:218
[24]PETSC ERROR: #4 PetscSFBcastEnd() at /tmp/sajid/spack-stage/spack-stage-petsc-main-mnj56kbexro3fipf6kheyttljzwss7fo/spack-src/src/vec/is/sf/interface/sf.c:1499
[24]PETSC ERROR: #5 VecScatterEnd_Internal() at /tmp/sajid/spack-stage/spack-stage-petsc-main-mnj56kbexro3fipf6kheyttljzwss7fo/spack-src/src/vec/is/sf/interface/vscat.c:87
[24]PETSC ERROR: #6 VecScatterEnd() at /tmp/sajid/spack-stage/spack-stage-petsc-main-mnj56kbexro3fipf6kheyttljzwss7fo/spack-src/src/vec/is/sf/interface/vscat.c:1366
[24]PETSC ERROR: #7 MatMult_MPIAIJCUSPARSE() at /tmp/sajid/spack-stage/spack-stage-petsc-main-mnj56kbexro3fipf6kheyttljzwss7fo/spack-src/src/mat/impls/aij/mpi/mpicusparse/mpiaijcusparse.cu:302<https://urldefense.proofpoint.com/v2/url?u=http-3A__mpiaijcusparse.cu-3A302&d=DwMFaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=w-DPglgoOUOz8eiEyHKz0g&m=3AFKDE-HT__MEeFxdxlc6bMDLLjchFccw_htjVmWkOsApaEairnUJYnKT28tfsiN&s=eMW4lGCKOn_tzQeT5gnM0i9mgEMwwbOe1EkCAtKG9M8&e=>
[24]PETSC ERROR: #8 MatMult() at /tmp/sajid/spack-stage/spack-stage-petsc-main-mnj56kbexro3fipf6kheyttljzwss7fo/spack-src/src/mat/interface/matrix.c:2438
[24]PETSC ERROR: #9 PCApplyBAorAB() at /tmp/sajid/spack-stage/spack-stage-petsc-main-mnj56kbexro3fipf6kheyttljzwss7fo/spack-src/src/ksp/pc/interface/precon.c:730
[24]PETSC ERROR: #10 KSP_PCApplyBAorAB() at /tmp/sajid/spack-stage/spack-stage-petsc-main-mnj56kbexro3fipf6kheyttljzwss7fo/spack-src/include/petsc/private/kspimpl.h:421
[24]PETSC ERROR: #11 KSPGMRESCycle() at /tmp/sajid/spack-stage/spack-stage-petsc-main-mnj56kbexro3fipf6kheyttljzwss7fo/spack-src/src/ksp/ksp/impls/gmres/gmres.c:162
[24]PETSC ERROR: #12 KSPSolve_GMRES() at /tmp/sajid/spack-stage/spack-stage-petsc-main-mnj56kbexro3fipf6kheyttljzwss7fo/spack-src/src/ksp/ksp/impls/gmres/gmres.c:247
[24]PETSC ERROR: #13 KSPSolve_Private() at /tmp/sajid/spack-stage/spack-stage-petsc-main-mnj56kbexro3fipf6kheyttljzwss7fo/spack-src/src/ksp/ksp/interface/itfunc.c:925
[24]PETSC ERROR: #14 KSPSolve() at /tmp/sajid/spack-stage/spack-stage-petsc-main-mnj56kbexro3fipf6kheyttljzwss7fo/spack-src/src/ksp/ksp/interface/itfunc.c:1103
[24]PETSC ERROR: #15 PCGAMGOptProlongator_AGG() at /tmp/sajid/spack-stage/spack-stage-petsc-main-mnj56kbexro3fipf6kheyttljzwss7fo/spack-src/src/ksp/pc/impls/gamg/agg.c:1127
[24]PETSC ERROR: #16 PCSetUp_GAMG() at /tmp/sajid/spack-stage/spack-stage-petsc-main-mnj56kbexro3fipf6kheyttljzwss7fo/spack-src/src/ksp/pc/impls/gamg/gamg.c:626
[24]PETSC ERROR: #17 PCSetUp() at /tmp/sajid/spack-stage/spack-stage-petsc-main-mnj56kbexro3fipf6kheyttljzwss7fo/spack-src/src/ksp/pc/interface/precon.c:1017
[24]PETSC ERROR: #18 KSPSetUp() at /tmp/sajid/spack-stage/spack-stage-petsc-main-mnj56kbexro3fipf6kheyttljzwss7fo/spack-src/src/ksp/ksp/interface/itfunc.c:417
[24]PETSC ERROR: #19 main() at poisson3d.c:69
[24]PETSC ERROR: PETSc Option Table entries:
[24]PETSC ERROR: -dm_mat_type aijcusparse
[24]PETSC ERROR: -dm_vec_type cuda
[24]PETSC ERROR: -ksp_monitor
[24]PETSC ERROR: -ksp_norm_type unpreconditioned
[24]PETSC ERROR: -ksp_type cg
[24]PETSC ERROR: -ksp_view
[24]PETSC ERROR: -log_view
[24]PETSC ERROR: -mg_levels_esteig_ksp_type cg
[24]PETSC ERROR: -mg_levels_ksp_type chebyshev
[24]PETSC ERROR: -mg_levels_pc_type jacobi
[24]PETSC ERROR: -pc_gamg_agg_nsmooths 1
[24]PETSC ERROR: -pc_gamg_square_graph 1
[24]PETSC ERROR: -pc_gamg_threshold 0.0
[24]PETSC ERROR: -pc_gamg_threshold_scale 0.0
[24]PETSC ERROR: -pc_gamg_type agg
[24]PETSC ERROR: -pc_type gamg
[24]PETSC ERROR: ----------------End of Error Message -------send entire error message to petsc-maint at mcs.anl.gov----------


Attached with this email is the full error log and the submit script for a 8-node/64-GPU/64 MPI rank job. I’ll also note that the same program did not crash when using either 2 or 4 nodes (with 8 & 16 GPUs/MPI ranks respectively) and attach those logs as well if that helps. Could someone let me know what this error means and what can be done to prevent it?

Thank You,
Sajid Ali (he/him) | Research Associate

Scientific Computing Division

Fermi National Accelerator Laboratory

s-sajid-ali.github.io<https://urldefense.proofpoint.com/v2/url?u=http-3A__s-2Dsajid-2Dali.github.io&d=DwMFaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=w-DPglgoOUOz8eiEyHKz0g&m=3AFKDE-HT__MEeFxdxlc6bMDLLjchFccw_htjVmWkOsApaEairnUJYnKT28tfsiN&s=6Fj7FO5IQGRCkPfC22pD7hAo0AxsVgu3kG9LNOftqK0&e=>

​
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20220211/d333dd77/attachment-0001.html>


More information about the petsc-users mailing list