[petsc-users] [MPI GPU Aware] KSP_DIVERGED

Junchao Zhang junchao.zhang at gmail.com
Wed Sep 18 09:58:54 CDT 2024


Pierre, thanks for the additional information.
This is scary. If this is really an OpenMPI bug, we don't have a petsc test
to catch it and currently have no clue what went wrong.

--Junchao Zhang


On Wed, Sep 18, 2024 at 5:09 AM LEDAC Pierre <Pierre.LEDAC at cea.fr> wrote:

> Junchao, I just tried PETSc last version in main branch with OpenMPI
> 4.1.5-cuda. I still get KSP_DIVERGED on
>
> some parallel calculations. Solved when moving to OpenMPI 5.0.5-cuda.
>
>
> Pierre LEDAC
> Commissariat à l’énergie atomique et aux énergies alternatives
> Centre de SACLAY
> DES/ISAS/DM2S/SGLS/LCAN
> Bâtiment 451 – point courrier n°43
> F-91191 Gif-sur-Yvette
> +33 1 69 08 04 03
> +33 6 83 42 05 79
> ------------------------------
> *De :* LEDAC Pierre
> *Envoyé :* mardi 17 septembre 2024 19:43:32
> *À :* Junchao Zhang
> *Cc :* petsc-users; ROUMET Elie
> *Objet :* RE: [petsc-users] [MPI GPU Aware] KSP_DIVERGED
>
>
> Yes. Only OpenMPI 5.0.5 with Petsc 3.20.
>
>
> Pierre LEDAC
> Commissariat à l’énergie atomique et aux énergies alternatives
> Centre de SACLAY
> DES/ISAS/DM2S/SGLS/LCAN
> Bâtiment 451 – point courrier n°43
> F-91191 Gif-sur-Yvette
> +33 1 69 08 04 03
> +33 6 83 42 05 79
> ------------------------------
> *De :* Junchao Zhang <junchao.zhang at gmail.com>
> *Envoyé :* mardi 17 septembre 2024 18:09:44
> *À :* LEDAC Pierre
> *Cc :* petsc-users; ROUMET Elie
> *Objet :* Re: [petsc-users] [MPI GPU Aware] KSP_DIVERGED
>
> Did you "fix" the problem with OpenMPI 5, but keep petsc unchanged (ie.,
> still 3.20)?
>
> --Junchao Zhang
>
>
> On Tue, Sep 17, 2024 at 9:47 AM LEDAC Pierre <Pierre.LEDAC at cea.fr> wrote:
>
>> Thanks Satish, and nice guess for OpenMPI 5 !
>>
>>
>> It seems it solves the issue (at least on my GPU box where I reproduced
>> the issue with 8 MPI ranks with OpenMPI 4.x).
>>
>>
>> Unhappily, all the clusters we currently use have no module with OpenMPI
>> 5.x. Seems I need to build it to really confirm.
>>
>>
>> Probably we will prevent users from configuring our code with
>> OpenMPI-cuda 4.x cause it is really a weird bug.
>>
>>
>> Pierre LEDAC
>> Commissariat à l’énergie atomique et aux énergies alternatives
>> Centre de SACLAY
>> DES/ISAS/DM2S/SGLS/LCAN
>> Bâtiment 451 – point courrier n°43
>> F-91191 Gif-sur-Yvette
>> +33 1 69 08 04 03
>> +33 6 83 42 05 79
>> ------------------------------
>> *De :* Satish Balay <balay.anl at fastmail.org>
>> *Envoyé :* mardi 17 septembre 2024 15:39:22
>> *À :* LEDAC Pierre
>> *Cc :* Junchao Zhang; petsc-users; ROUMET Elie
>> *Objet :* Re: [petsc-users] [MPI GPU Aware] KSP_DIVERGED
>>
>> On Tue, 17 Sep 2024, LEDAC Pierre wrote:
>>
>> > Thanks all, I will try and report.
>> >
>> >
>> > Last question, if I use "-use_gpu_aware_mpi 0" flag with a MPI GPU
>> Aware library, do PETSc
>> >
>> > disable GPU intra/inter communications and send MPI buffers as usual
>> (with extra Device<->Host copies) ?
>>
>> Yes.
>>
>> Not: Wrt using MPI that is not GPU-aware - we are changing the default
>> behavior - to not require "-use_gpu_aware_mpi 0" flag.
>>
>> https://urldefense.us/v3/__https://gitlab.com/petsc/petsc/-/merge_requests/7813__;!!G_uCfscf7eWS!fqIBdGt7byg9rSjen4SjjKyivmWEe_CLpOXOzAM1m3NAKmGoS2jsQQaeww4yYT0379vyu8UdqMoywsMM6JJJS1kdDwvw$ 
>>
>>
>> Satish
>>
>> >
>> >
>> > Thanks,
>> >
>> >
>> > Pierre LEDAC
>> > Commissariat à l’énergie atomique et aux énergies alternatives
>> > Centre de SACLAY
>> > DES/ISAS/DM2S/SGLS/LCAN
>> > Bâtiment 451 – point courrier n°43
>> > F-91191 Gif-sur-Yvette
>> > +33 1 69 08 04 03
>> > +33 6 83 42 05 79
>> > ________________________________
>> > De : Satish Balay <balay.anl at fastmail.org>
>> > Envoyé : lundi 16 septembre 2024 18:57:02
>> > À : Junchao Zhang
>> > Cc : LEDAC Pierre; petsc-users at mcs.anl.gov; ROUMET Elie
>> > Objet : Re: [petsc-users] [MPI GPU Aware] KSP_DIVERGED
>> >
>> > And/Or - try latest OpenMPI [or MPICH] and see if that makes a
>> difference.
>> >
>> > --download-mpich or --download-openmpi with latest petsc should build
>> gpu-aware-mpi
>> >
>> > Satish
>> >
>> > On Mon, 16 Sep 2024, Junchao Zhang wrote:
>> >
>> > > Could you try petsc/main to see if the problem persists?
>> > >
>> > > --Junchao Zhang
>> > >
>> > >
>> > > On Mon, Sep 16, 2024 at 10:51 AM LEDAC Pierre <Pierre.LEDAC at cea.fr>
>> wrote:
>> > >
>> > > > Hi all,
>> > > >
>> > > >
>> > > > We are using PETSc 3.20 in our code and running succesfully several
>> > > > solvers on Nvidia GPU with OpenMPI library which are not GPU aware
>> (so I
>> > > > need to add the flag -use_gpu_aware_mpi 0).
>> > > >
>> > > >
>> > > > But now, when using OpenMPI GPU Aware library (OpenMPI 4.0.5 ou
>> 4.1.5 from
>> > > > NVHPC), some parallel calculations failed with *KSP_DIVERGED_ITS* or
>> > > > *KSP_DIVERGED_DTOL*
>> > > >
>> > > > with several configurations. It may run wells on a small test case
>> with
>> > > > (matrix is symmetric):
>> > > >
>> > > >
>> > > > *-ksp_type cg -pc_type gamg -pc_gamg_type classical*
>> > > >
>> > > >
>> > > > But suddenly with a number of devices for instance bigger than 4 or
>> 8, it
>> > > > may fail.
>> > > >
>> > > >
>> > > > If I switch to another solver (BiCGstab), it may converge:
>> > > >
>> > > >
>> > > > *-ksp_type bcgs -pc_type gamg -pc_gamg_type classical*
>> > > >
>> > > >
>> > > > The more sensitive cases where it diverges are the following:
>> > > >
>> > > >
>> > > > *-ksp_type cg -pc_type hypre -pc_hypre_type boomeramg *
>> > > >
>> > > > *-ksp_type cg -pc_type gamg  -pc_gamg_type classical*
>> > > >
>> > > >
>> > > > And the *bcgs* turnaroud doesn't work each time...
>> > > >
>> > > >
>> > > > It seems to work without problem with aggregation (at least 128
>> GPUs on my
>> > > > simulation):
>> > > >
>> > > > *-ksp_type cg -pc_type gamg -pc_gamg_type agg*
>> > > >
>> > > >
>> > > > So I guess there is a weird thing happening in my code during the
>> solve in
>> > > > PETSc with MPI GPU Aware, as all the previous configurations works
>> with non
>> > > > GPU aware MPI.
>> > > >
>> > > >
>> > > > Here is the -ksp_view log during one fail with the first
>> configuration:
>> > > >
>> > > >
>> > > >
>> > > >
>> > > >
>> > > >
>> > > >
>> > > >
>> > > >
>> > > >
>> > > >
>> > > >
>> > > >
>> > > >
>> > > >
>> > > >
>> > > >
>> > > >
>> > > >
>> > > >
>> > > >
>> > > >
>> > > >
>> > > >
>> > > >
>> > > >
>> > > >
>> > > >
>> > > >
>> > > >
>> > > >
>> > > >
>> > > >
>> > > >
>> > > >
>> > > >
>> > > >
>> > > >
>> > > >
>> > > >
>> > > >
>> > > >
>> > > >
>> > > > *KSP Object: () 8 MPI processes   type: cg   maximum
>> iterations=10000,
>> > > > nonzero initial guess   tolerances:  relative=0., absolute=0.0001,
>> > > > divergence=10000.   left preconditioning   using UNPRECONDITIONED
>> norm type
>> > > > for convergence test PC Object: () 8 MPI processes   type:
>> hypre     HYPRE
>> > > > BoomerAMG preconditioning       Cycle type V       Maximum number
>> of levels
>> > > > 25       Maximum number of iterations PER hypre call 1
>> Convergence
>> > > > tolerance PER hypre call 0.       Threshold for strong coupling 0.7
>> > > > Interpolation truncation factor 0.       Interpolation: max
>> elements per
>> > > > row 0       Number of levels of aggressive coarsening 0
>> Number of
>> > > > paths for aggressive coarsening 1       Maximum row sums 0.9
>> Sweeps
>> > > > down         1       Sweeps up           1       Sweeps on
>> coarse    1
>> > > >       Relax down          l1scaled-Jacobi       Relax up
>> > > > l1scaled-Jacobi       Relax on coarse
>> Gaussian-elimination       Relax
>> > > > weight  (all)      1.       Outer relax weight (all) 1.
>> Maximum size
>> > > > of coarsest grid 9       Minimum size of coarsest grid 1       Not
>> using
>> > > > CF-relaxation       Not using more complex smoothers.       Measure
>> > > > type        local       Coarsen type        PMIS
>> Interpolation type
>> > > > ext+i       SpGEMM type         cusparse   linear system matrix =
>> precond
>> > > > matrix:   Mat Object: () 8 MPI processes     type: mpiaijcusparse
>> > > > rows=64000, cols=64000     total: nonzeros=311040, allocated
>> > > > nonzeros=311040     total number of mallocs used during MatSetValues
>> > > > calls=0       not using I-node (on process 0) routines*
>> > > >
>> > > >
>> > > > I didn't succeed for the moment creating a reproducer with ex.c
>> examples...
>> > > >
>> > > >
>> > > > Did you see this kind of behaviour before?
>> > > >
>> > > > Should I update my PETSc version ?
>> > > >
>> > > >
>> > > > Thanks for any advice,
>> > > >
>> > > >
>> > > > Pierre LEDAC
>> > > > Commissariat à l’énergie atomique et aux énergies alternatives
>> > > > Centre de SACLAY
>> > > > DES/ISAS/DM2S/SGLS/LCAN
>> > > > Bâtiment 451 – point courrier n°43
>> > > > F-91191 Gif-sur-Yvette
>> > > > +33 1 69 08 04 03
>> > > > +33 6 83 42 05 79
>> > > >
>> > >
>> >
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20240918/9d580d05/attachment-0001.html>


More information about the petsc-users mailing list