[petsc-users] A bad commit affects MOOSE

Satish Balay balay at mcs.anl.gov
Tue Apr 3 10:32:50 CDT 2018


On Tue, 3 Apr 2018, Stefano Zampini wrote:

> 
> > On Apr 3, 2018, at 4:58 PM, Satish Balay <balay at mcs.anl.gov> wrote:
> > 
> > On Tue, 3 Apr 2018, Kong, Fande wrote:
> > 
> >> On Tue, Apr 3, 2018 at 1:17 AM, Smith, Barry F. <bsmith at mcs.anl.gov> wrote:
> >> 
> >>> 
> >>>   Each external package definitely needs its own duplicated communicator;
> >>> cannot share between packages.
> >>> 
> >>>   The only problem with the dups below is if they are in a loop and get
> >>> called many times.
> >>> 
> >> 
> >> 
> >> The "standard test" that has this issue actually has 1K fields. MOOSE
> >> creates its own field-split preconditioner (not based on the PETSc
> >> fieldsplit), and each filed is associated with one PC HYPRE.  If PETSc
> >> duplicates communicators, we should easily reach the limit 2048.
> >> 
> >> I also want to confirm what extra communicators are introduced in the bad
> >> commit.
> > 
> > To me it looks like there is 1 extra comm created [for MATHYPRE] for each PCHYPRE that is created [which also creates one comm for this object].
> > 
> 
> You’re right; however, it was the same before the commit.

https://bitbucket.org/petsc/petsc/commits/49a781f5cee36db85e8d5b951eec29f10ac13593
Before the commit - PCHYPRE was not calling MatConvert(MATHYPRE) [this results in an additional call to MPI_Comm_dup() for hypre calls] PCHYPRE was calling MatHYPRE_IJMatrixCreate() directly [which I presume reusing the comm created by the call to MPI_Comm_dup() in PCHYPRE - for hypre calls]



> I don’t understand how this specific commit is related with this issue, being the error not in the MPI_Comm_Dup which is inside MatCreate_MATHYPRE. Actually, the error comes from MPI_Comm_create
> 
>     frame #5: 0x00000001068defd4 libmpi.12.dylib`MPI_Comm_create + 3492
>     frame #6: 0x00000001061345d9 libpetsc.3.07.dylib`hypre_GenerateSubComm(comm=-1006627852, participate=<unavailable>, new_comm_ptr=<unavailable>) + 409 at gen_redcs_mat.c:531 [opt]
>     frame #7: 0x000000010618f8ba libpetsc.3.07.dylib`hypre_GaussElimSetup(amg_data=0x00007fe7ff857a00, level=<unavailable>, relax_type=9) + 74 at par_relax.c:4209 [opt]
>     frame #8: 0x0000000106140e93 libpetsc.3.07.dylib`hypre_BoomerAMGSetup(amg_vdata=<unavailable>, A=0x00007fe80842aff0, f=0x00007fe80842a980, u=0x00007fe80842a510) + 17699 at par_amg_setup.c:2108 [opt]
>     frame #9: 0x0000000105ec773c libpetsc.3.07.dylib`PCSetUp_HYPRE(pc=<unavailable>) + 2540 at hypre.c:226 [opt

I thought this trace comes up after applying your patch

-    ierr = MatDestroy(&jac->hpmat);CHKERRQ(ierr);
-    ierr = MatConvert(pc->pmat,MATHYPRE,MAT_INITIAL_MATRIX,&jac->hpmat);CHKERRQ(ierr);
+    ierr = MatConvert(pc->pmat,MATHYPRE,jac->hpmat ? MAT_REUSE_MATRIX : MAT_INITIAL_MATRIX,&jac->hpmat);CHKERRQ(ierr);

The stack before this patch was: [its a different format - so it was obtained in a different way than the above method?]

preconditioners/pbp.lots_of_variables: Other MPI error, error stack:
preconditioners/pbp.lots_of_variables: PMPI_Comm_dup(177)..................: MPI_Comm_dup(comm=0x84000001, new_comm=0x97d1068) failed
preconditioners/pbp.lots_of_variables: PMPI_Comm_dup(162)..................:
preconditioners/pbp.lots_of_variables: MPIR_Comm_dup_impl(57)..............:
preconditioners/pbp.lots_of_variables: MPIR_Comm_copy(739).................:
preconditioners/pbp.lots_of_variables: MPIR_Get_contextid_sparse_group(614): Too many communicators (0/2048 free on this process; ignore_id=0)

Satish

> 
> How did you perform the bisection? make clean + make all ? Which version of HYPRE are you using?
> 
> > But you might want to verify [by linking with mpi trace library?]
> > 
> > 
> > There are some debugging hints at https://lists.mpich.org/pipermail/discuss/2012-December/000148.html [wrt mpich] - which I haven't checked..
> > 
> > Satish
> > 
> >> 
> >> 
> >> Fande,
> >> 
> >> 
> >> 
> >>> 
> >>>    To debug the hypre/duplication issue in MOOSE I would run in the
> >>> debugger with a break point in MPI_Comm_dup() and see
> >>> who keeps calling it an unreasonable amount of times. (My guess is this is
> >>> a new "feature" in hypre that they will need to fix but only debugging will
> >>> tell)
> >>> 
> >>>   Barry
> >>> 
> >>> 
> >>>> On Apr 2, 2018, at 7:44 PM, Balay, Satish <balay at mcs.anl.gov> wrote:
> >>>> 
> >>>> We do a MPI_Comm_dup() for objects related to externalpackages.
> >>>> 
> >>>> Looks like we added a new mat type MATHYPRE - in 3.8 that PCHYPRE is
> >>>> using. Previously there was one MPI_Comm_dup() PCHYPRE - now I think
> >>>> is one more for MATHYPRE - so more calls to MPI_Comm_dup in 3.8 vs 3.7
> >>>> 
> >>>> src/dm/impls/da/hypre/mhyp.c:  ierr = MPI_Comm_dup(PetscObjectComm((
> >>> PetscObject)B),&(ex->hcomm));CHKERRQ(ierr);
> >>>> src/dm/impls/da/hypre/mhyp.c:  ierr = MPI_Comm_dup(PetscObjectComm((
> >>> PetscObject)B),&(ex->hcomm));CHKERRQ(ierr);
> >>>> src/dm/impls/swarm/data_ex.c:  ierr = MPI_Comm_dup(comm,&d->comm);
> >>> CHKERRQ(ierr);
> >>>> src/ksp/pc/impls/hypre/hypre.c:  ierr = MPI_Comm_dup(PetscObjectComm((
> >>> PetscObject)pc),&(jac->comm_hypre));CHKERRQ(ierr);
> >>>> src/ksp/pc/impls/hypre/hypre.c:  ierr = MPI_Comm_dup(PetscObjectComm((
> >>> PetscObject)pc),&(ex->hcomm));CHKERRQ(ierr);
> >>>> src/ksp/pc/impls/hypre/hypre.c:  ierr = MPI_Comm_dup(PetscObjectComm((
> >>> PetscObject)pc),&(ex->hcomm));CHKERRQ(ierr);
> >>>> src/ksp/pc/impls/spai/ispai.c:  ierr      =
> >>> MPI_Comm_dup(PetscObjectComm((PetscObject)pc),&(ispai->comm_
> >>> spai));CHKERRQ(ierr);
> >>>> src/mat/examples/tests/ex152.c:  ierr   = MPI_Comm_dup(MPI_COMM_WORLD,
> >>> &comm);CHKERRQ(ierr);
> >>>> src/mat/impls/aij/mpi/mkl_cpardiso/mkl_cpardiso.c:  ierr =
> >>> MPI_Comm_dup(PetscObjectComm((PetscObject)A),&(mat_mkl_
> >>> cpardiso->comm_mkl_cpardiso));CHKERRQ(ierr);
> >>>> src/mat/impls/aij/mpi/mumps/mumps.c:  ierr =
> >>> MPI_Comm_dup(PetscObjectComm((PetscObject)A),&(mumps->comm_
> >>> mumps));CHKERRQ(ierr);
> >>>> src/mat/impls/aij/mpi/pastix/pastix.c:    ierr =
> >>> MPI_Comm_dup(PetscObjectComm((PetscObject)A),&(lu->pastix_
> >>> comm));CHKERRQ(ierr);
> >>>> src/mat/impls/aij/mpi/superlu_dist/superlu_dist.c:  ierr =
> >>> MPI_Comm_dup(PetscObjectComm((PetscObject)A),&(lu->comm_
> >>> superlu));CHKERRQ(ierr);
> >>>> src/mat/impls/hypre/mhypre.c:  ierr = MPI_Comm_dup(PetscObjectComm((
> >>> PetscObject)B),&hB->comm);CHKERRQ(ierr);
> >>>> src/mat/partition/impls/pmetis/pmetis.c:    ierr   =
> >>> MPI_Comm_dup(pcomm,&comm);CHKERRQ(ierr);
> >>>> src/sys/mpiuni/mpi.c:    MPI_COMM_SELF, MPI_COMM_WORLD, and a
> >>> MPI_Comm_dup() of each of these (duplicates of duplicates return the same
> >>> communictor)
> >>>> src/sys/mpiuni/mpi.c:int MPI_Comm_dup(MPI_Comm comm,MPI_Comm *out)
> >>>> src/sys/objects/pinit.c:      ierr = MPI_Comm_dup(MPI_COMM_WORLD,&
> >>> local_comm);CHKERRQ(ierr);
> >>>> src/sys/objects/pinit.c:      ierr = MPI_Comm_dup(MPI_COMM_WORLD,&
> >>> local_comm);CHKERRQ(ierr);
> >>>> src/sys/objects/tagm.c:      ierr = MPI_Comm_dup(comm_in,comm_out)
> >>> ;CHKERRQ(ierr);
> >>>> src/sys/utils/mpiu.c:  ierr = MPI_Comm_dup(comm,&local_comm)
> >>> ;CHKERRQ(ierr);
> >>>> src/ts/impls/implicit/sundials/sundials.c:  ierr =
> >>> MPI_Comm_dup(PetscObjectComm((PetscObject)ts),&(cvode->comm_
> >>> sundials));CHKERRQ(ierr);
> >>>> 
> >>>> Perhaps we need a PetscCommDuplicateExternalPkg() to somehow avoid
> >>> these MPI_Comm_dup() calls?
> >>>> 
> >>>> Satish
> >>>> 
> >>>> On Tue, 3 Apr 2018, Smith, Barry F. wrote:
> >>>> 
> >>>>> 
> >>>>> Are we sure this is a PETSc comm issue and not a hypre comm
> >>> duplication issue
> >>>>> 
> >>>>> frame #6: 0x00000001061345d9 libpetsc.3.07.dylib`hypre_
> >>> GenerateSubComm(comm=-1006627852, participate=<unavailable>,
> >>> new_comm_ptr=<unavailable>) + 409 at gen_redcs_mat.c:531 [opt]
> >>>>> 
> >>>>> Looks like hypre is needed to generate subcomms, perhaps it generates
> >>> too many?
> >>>>> 
> >>>>>  Barry
> >>>>> 
> >>>>> 
> >>>>>> On Apr 2, 2018, at 7:07 PM, Derek Gaston <friedmud at gmail.com> wrote:
> >>>>>> 
> >>>>>> I’m working with Fande on this and I would like to add a bit more.
> >>> There are many circumstances where we aren’t working on COMM_WORLD at all
> >>> (e.g. working on a sub-communicator) but PETSc was initialized using
> >>> MPI_COMM_WORLD (think multi-level solves)… and we need to create
> >>> arbitrarily many PETSc vecs/mats/solvers/preconditioners and solve.  We
> >>> definitely can’t rely on using PETSC_COMM_WORLD to avoid triggering
> >>> duplication.
> >>>>>> 
> >>>>>> Can you explain why PETSc needs to duplicate the communicator so much?
> >>>>>> 
> >>>>>> Thanks for your help in tracking this down!
> >>>>>> 
> >>>>>> Derek
> >>>>>> 
> >>>>>> On Mon, Apr 2, 2018 at 5:44 PM Kong, Fande <fande.kong at inl.gov> wrote:
> >>>>>> Why we do not use user-level MPI communicators directly? What are
> >>> potential risks here?
> >>>>>> 
> >>>>>> 
> >>>>>> Fande,
> >>>>>> 
> >>>>>> On Mon, Apr 2, 2018 at 5:08 PM, Satish Balay <balay at mcs.anl.gov>
> >>> wrote:
> >>>>>> PETSC_COMM_WORLD [via PetscCommDuplicate()] attempts to minimize calls
> >>> to MPI_Comm_dup() - thus potentially avoiding such errors
> >>>>>> 
> >>>>>> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.mcs.
> >>> anl.gov_petsc_petsc-2Dcurrent_docs_manualpages_Sys_
> >>> PetscCommDuplicate.html&d=DwIBAg&c=54IZrppPQZKX9mLzcGdPfFD1hxrcB_
> >>> _aEkJFOKJFd00&r=DUUt3SRGI0_JgtNaS3udV68GRkgV4ts7XKfj2opmi
> >>> CY&m=jgv7gpZ3K52d_FWMgkK9yEScbLA7pkrWydFuJnYflsU&s=_
> >>> zpWRcyk3kHuEHoq02NDqYExnXIohXpNnjyabYnnDjU&e=
> >>>>>> 
> >>>>>> 
> >>>>>> Satish
> >>>>>> 
> >>>>>> On Mon, 2 Apr 2018, Kong, Fande wrote:
> >>>>>> 
> >>>>>>> On Mon, Apr 2, 2018 at 4:23 PM, Satish Balay <balay at mcs.anl.gov>
> >>> wrote:
> >>>>>>> 
> >>>>>>>> Does this 'standard test' use MPI_COMM_WORLD' to crate PETSc objects?
> >>>>>>>> 
> >>>>>>>> If so - you could try changing to PETSC_COMM_WORLD
> >>>>>>>> 
> >>>>>>> 
> >>>>>>> 
> >>>>>>> I do not think we are using PETSC_COMM_WORLD when creating PETSc
> >>> objects.
> >>>>>>> Why we can not use MPI_COMM_WORLD?
> >>>>>>> 
> >>>>>>> 
> >>>>>>> Fande,
> >>>>>>> 
> >>>>>>> 
> >>>>>>>> 
> >>>>>>>> Satish
> >>>>>>>> 
> >>>>>>>> 
> >>>>>>>> On Mon, 2 Apr 2018, Kong, Fande wrote:
> >>>>>>>> 
> >>>>>>>>> Hi All,
> >>>>>>>>> 
> >>>>>>>>> I am trying to upgrade PETSc from 3.7.6 to 3.8.3 for MOOSE and its
> >>>>>>>>> applications. I have a error message for a standard test:
> >>>>>>>>> 
> >>>>>>>>> 
> >>>>>>>>> 
> >>>>>>>>> 
> >>>>>>>>> 
> >>>>>>>>> 
> >>>>>>>>> 
> >>>>>>>>> 
> >>>>>>>>> 
> >>>>>>>>> *preconditioners/pbp.lots_of_variables: MPI had an
> >>>>>>>>> errorpreconditioners/pbp.lots_of_variables:
> >>>>>>>>> ------------------------------------------------
> >>>>>>>> preconditioners/pbp.lots_of_variables:
> >>>>>>>>> Other MPI error, error stack:preconditioners/pbp.lots_of_variables:
> >>>>>>>>> PMPI_Comm_dup(177)..................: MPI_Comm_dup(comm=0x84000001,
> >>>>>>>>> new_comm=0x97d1068) failedpreconditioners/pbp.lots_of_variables:
> >>>>>>>>> PMPI_Comm_dup(162)..................:
> >>>>>>>>> preconditioners/pbp.lots_of_variables:
> >>>>>>>>> MPIR_Comm_dup_impl(57)..............:
> >>>>>>>>> preconditioners/pbp.lots_of_variables:
> >>>>>>>>> MPIR_Comm_copy(739).................:
> >>>>>>>>> preconditioners/pbp.lots_of_variables:
> >>>>>>>>> MPIR_Get_contextid_sparse_group(614): Too many communicators
> >>> (0/2048
> >>>>>>>> free
> >>>>>>>>> on this process; ignore_id=0)*
> >>>>>>>>> 
> >>>>>>>>> 
> >>>>>>>>> I did "git bisect', and the following commit introduces this issue:
> >>>>>>>>> 
> >>>>>>>>> 
> >>>>>>>>> 
> >>>>>>>>> 
> >>>>>>>>> 
> >>>>>>>>> 
> >>>>>>>>> 
> >>>>>>>>> 
> >>>>>>>>> *commit 49a781f5cee36db85e8d5b951eec29f10ac13593Author: Stefano
> >>> Zampini
> >>>>>>>>> <stefano.zampini at gmail.com <stefano.zampini at gmail.com>>Date:   Sat
> >>> Nov 5
> >>>>>>>>> 20:15:19 2016 +0300    PCHYPRE: use internal Mat of type MatHYPRE
> >>>>>>>>> hpmat already stores two HYPRE vectors*
> >>>>>>>>> 
> >>>>>>>>> Before I debug line-by-line, anyone has a clue on this?
> >>>>>>>>> 
> >>>>>>>>> 
> >>>>>>>>> Fande,
> >>>>>>>>> 
> >>>>>>>> 
> >>>>>>>> 
> >>>>>>> 
> >>>>>> 
> >>>>> 
> >>> 
> >>> 
> 
> 


More information about the petsc-users mailing list