[petsc-users] [MPI GPU Aware] KSP_DIVERGED

Junchao Zhang junchao.zhang at gmail.com
Mon Sep 16 11:14:21 CDT 2024


Could you try petsc/main to see if the problem persists?

--Junchao Zhang


On Mon, Sep 16, 2024 at 10:51 AM LEDAC Pierre <Pierre.LEDAC at cea.fr> wrote:

> Hi all,
>
>
> We are using PETSc 3.20 in our code and running succesfully several
> solvers on Nvidia GPU with OpenMPI library which are not GPU aware (so I
> need to add the flag -use_gpu_aware_mpi 0).
>
>
> But now, when using OpenMPI GPU Aware library (OpenMPI 4.0.5 ou 4.1.5 from
> NVHPC), some parallel calculations failed with *KSP_DIVERGED_ITS* or
> *KSP_DIVERGED_DTOL*
>
> with several configurations. It may run wells on a small test case with
> (matrix is symmetric):
>
>
> *-ksp_type cg -pc_type gamg -pc_gamg_type classical*
>
>
> But suddenly with a number of devices for instance bigger than 4 or 8, it
> may fail.
>
>
> If I switch to another solver (BiCGstab), it may converge:
>
>
> *-ksp_type bcgs -pc_type gamg -pc_gamg_type classical*
>
>
> The more sensitive cases where it diverges are the following:
>
>
> *-ksp_type cg -pc_type hypre -pc_hypre_type boomeramg *
>
> *-ksp_type cg -pc_type gamg  -pc_gamg_type classical*
>
>
> And the *bcgs* turnaroud doesn't work each time...
>
>
> It seems to work without problem with aggregation (at least 128 GPUs on my
> simulation):
>
> *-ksp_type cg -pc_type gamg -pc_gamg_type agg*
>
>
> So I guess there is a weird thing happening in my code during the solve in
> PETSc with MPI GPU Aware, as all the previous configurations works with non
> GPU aware MPI.
>
>
> Here is the -ksp_view log during one fail with the first configuration:
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> *KSP Object: () 8 MPI processes   type: cg   maximum iterations=10000,
> nonzero initial guess   tolerances:  relative=0., absolute=0.0001,
> divergence=10000.   left preconditioning   using UNPRECONDITIONED norm type
> for convergence test PC Object: () 8 MPI processes   type: hypre     HYPRE
> BoomerAMG preconditioning       Cycle type V       Maximum number of levels
> 25       Maximum number of iterations PER hypre call 1       Convergence
> tolerance PER hypre call 0.       Threshold for strong coupling 0.7
> Interpolation truncation factor 0.       Interpolation: max elements per
> row 0       Number of levels of aggressive coarsening 0       Number of
> paths for aggressive coarsening 1       Maximum row sums 0.9       Sweeps
> down         1       Sweeps up           1       Sweeps on coarse    1
>       Relax down          l1scaled-Jacobi       Relax up
> l1scaled-Jacobi       Relax on coarse     Gaussian-elimination       Relax
> weight  (all)      1.       Outer relax weight (all) 1.       Maximum size
> of coarsest grid 9       Minimum size of coarsest grid 1       Not using
> CF-relaxation       Not using more complex smoothers.       Measure
> type        local       Coarsen type        PMIS       Interpolation type
> ext+i       SpGEMM type         cusparse   linear system matrix = precond
> matrix:   Mat Object: () 8 MPI processes     type: mpiaijcusparse
> rows=64000, cols=64000     total: nonzeros=311040, allocated
> nonzeros=311040     total number of mallocs used during MatSetValues
> calls=0       not using I-node (on process 0) routines*
>
>
> I didn't succeed for the moment creating a reproducer with ex.c examples...
>
>
> Did you see this kind of behaviour before?
>
> Should I update my PETSc version ?
>
>
> Thanks for any advice,
>
>
> Pierre LEDAC
> Commissariat à l’énergie atomique et aux énergies alternatives
> Centre de SACLAY
> DES/ISAS/DM2S/SGLS/LCAN
> Bâtiment 451 – point courrier n°43
> F-91191 Gif-sur-Yvette
> +33 1 69 08 04 03
> +33 6 83 42 05 79
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20240916/94a7d6a0/attachment-0001.html>


More information about the petsc-users mailing list