[petsc-users] A bad commit affects MOOSE

Jed Brown jed at jedbrown.org
Tue Apr 3 09:50:17 CDT 2018


The PETSc model is that the "outer" communicator (passed by the caller)
is dup'd to create an "inner" communicator which as attached (using MPI
attributes) to the outer communicator.  In the future, PETSc will find
the inner communicator and use that, instead of dup'ing again.

Derek Gaston <friedmud at gmail.com> writes:

> I like the idea that Hypre (as a package) would get _one_ comm (for all the
> solvers/matrices created) that was duped from the one given to PETSc in
> Vec/MatCreate().
>
> Seems like the tricky part would be figuring out _which_ comm that is based
> on the incoming comm.  For instance - we would definitely have the case
> where we are doing a Hypre solve on effectively COMM_WORLD… and then many
> Hypre solves on sub-communicators (and even Hypre solves on
> sub-communicators of those sub-communicators).  The system for getting
> “the” Hypre Comm would have to match up the incoming Comm in the
> Vec/MatCreate() call and find the correct Hypre comm to use.
>
> Derek
>
>
>
> On Tue, Apr 3, 2018 at 7:46 AM Satish Balay <balay at mcs.anl.gov> wrote:
>
>> Fande claimed 49a781f5cee36db85e8d5b951eec29f10ac13593 made a difference.
>> [so assuming same hypre version was used before and after this commit - for
>> this bisection]
>>
>> So the extra MPI_Comm_dup() calls due to MATHYPRE must be pushing the
>> total communicators over the limit.
>>
>> And wrt debugging - perhaps we need to  check MPI_Comm_free() aswell?
>> Presumably freed communicators can get reused so we have to look for
>> outstanding/unfreed communicators?
>>
>> Per message below - MPICH[?] provides a max of 2048 communicators. And
>> there is some discussion of this issue at:
>> https://lists.mpich.org/pipermail/discuss/2012-December/000148.html
>>
>> And wrt 'sharing' - I was thining in terms of: Can one use MPI_COMM_WORLD
>> with all hypre objects we create? If so - we could somehow attach one more
>> inner-comm - that could be obtained and reused with multiple hypre objects
>> [that got created off the same petsc_comm?]
>>
>> Satish
>>
>> On Tue, 3 Apr 2018, Smith, Barry F. wrote:
>>
>> >
>> >    Each external package definitely needs its own duplicated
>> communicator; cannot share between packages.
>> >
>> >    The only problem with the dups below is if they are in a loop and get
>> called many times.
>> >
>> >     To debug the hypre/duplication issue in MOOSE I would run in the
>> debugger with a break point in MPI_Comm_dup() and see
>> > who keeps calling it an unreasonable amount of times. (My guess is this
>> is a new "feature" in hypre that they will need to fix but only debugging
>> will tell)
>> >
>> >    Barry
>> >
>> >
>> > > On Apr 2, 2018, at 7:44 PM, Balay, Satish <balay at mcs.anl.gov> wrote:
>> > >
>> > > We do a MPI_Comm_dup() for objects related to externalpackages.
>> > >
>> > > Looks like we added a new mat type MATHYPRE - in 3.8 that PCHYPRE is
>> > > using. Previously there was one MPI_Comm_dup() PCHYPRE - now I think
>> > > is one more for MATHYPRE - so more calls to MPI_Comm_dup in 3.8 vs 3.7
>> > >
>> > > src/dm/impls/da/hypre/mhyp.c:  ierr =
>> MPI_Comm_dup(PetscObjectComm((PetscObject)B),&(ex->hcomm));CHKERRQ(ierr);
>> > > src/dm/impls/da/hypre/mhyp.c:  ierr =
>> MPI_Comm_dup(PetscObjectComm((PetscObject)B),&(ex->hcomm));CHKERRQ(ierr);
>> > > src/dm/impls/swarm/data_ex.c:  ierr =
>> MPI_Comm_dup(comm,&d->comm);CHKERRQ(ierr);
>> > > src/ksp/pc/impls/hypre/hypre.c:  ierr =
>> MPI_Comm_dup(PetscObjectComm((PetscObject)pc),&(jac->comm_hypre));CHKERRQ(ierr);
>> > > src/ksp/pc/impls/hypre/hypre.c:  ierr =
>> MPI_Comm_dup(PetscObjectComm((PetscObject)pc),&(ex->hcomm));CHKERRQ(ierr);
>> > > src/ksp/pc/impls/hypre/hypre.c:  ierr =
>> MPI_Comm_dup(PetscObjectComm((PetscObject)pc),&(ex->hcomm));CHKERRQ(ierr);
>> > > src/ksp/pc/impls/spai/ispai.c:  ierr      =
>> MPI_Comm_dup(PetscObjectComm((PetscObject)pc),&(ispai->comm_spai));CHKERRQ(ierr);
>> > > src/mat/examples/tests/ex152.c:  ierr   = MPI_Comm_dup(MPI_COMM_WORLD,
>> &comm);CHKERRQ(ierr);
>> > > src/mat/impls/aij/mpi/mkl_cpardiso/mkl_cpardiso.c:  ierr =
>> MPI_Comm_dup(PetscObjectComm((PetscObject)A),&(mat_mkl_cpardiso->comm_mkl_cpardiso));CHKERRQ(ierr);
>> > > src/mat/impls/aij/mpi/mumps/mumps.c:  ierr =
>> MPI_Comm_dup(PetscObjectComm((PetscObject)A),&(mumps->comm_mumps));CHKERRQ(ierr);
>> > > src/mat/impls/aij/mpi/pastix/pastix.c:    ierr =
>> MPI_Comm_dup(PetscObjectComm((PetscObject)A),&(lu->pastix_comm));CHKERRQ(ierr);
>> > > src/mat/impls/aij/mpi/superlu_dist/superlu_dist.c:  ierr =
>> MPI_Comm_dup(PetscObjectComm((PetscObject)A),&(lu->comm_superlu));CHKERRQ(ierr);
>> > > src/mat/impls/hypre/mhypre.c:  ierr =
>> MPI_Comm_dup(PetscObjectComm((PetscObject)B),&hB->comm);CHKERRQ(ierr);
>> > > src/mat/partition/impls/pmetis/pmetis.c:    ierr   =
>> MPI_Comm_dup(pcomm,&comm);CHKERRQ(ierr);
>> > > src/sys/mpiuni/mpi.c:    MPI_COMM_SELF, MPI_COMM_WORLD, and a
>> MPI_Comm_dup() of each of these (duplicates of duplicates return the same
>> communictor)
>> > > src/sys/mpiuni/mpi.c:int MPI_Comm_dup(MPI_Comm comm,MPI_Comm *out)
>> > > src/sys/objects/pinit.c:      ierr =
>> MPI_Comm_dup(MPI_COMM_WORLD,&local_comm);CHKERRQ(ierr);
>> > > src/sys/objects/pinit.c:      ierr =
>> MPI_Comm_dup(MPI_COMM_WORLD,&local_comm);CHKERRQ(ierr);
>> > > src/sys/objects/tagm.c:      ierr =
>> MPI_Comm_dup(comm_in,comm_out);CHKERRQ(ierr);
>> > > src/sys/utils/mpiu.c:  ierr =
>> MPI_Comm_dup(comm,&local_comm);CHKERRQ(ierr);
>> > > src/ts/impls/implicit/sundials/sundials.c:  ierr =
>> MPI_Comm_dup(PetscObjectComm((PetscObject)ts),&(cvode->comm_sundials));CHKERRQ(ierr);
>> > >
>> > > Perhaps we need a PetscCommDuplicateExternalPkg() to somehow avoid
>> these MPI_Comm_dup() calls?
>> > >
>> > > Satish
>> > >
>> > > On Tue, 3 Apr 2018, Smith, Barry F. wrote:
>> > >
>> > >>
>> > >>  Are we sure this is a PETSc comm issue and not a hypre comm
>> duplication issue
>> > >>
>> > >> frame #6: 0x00000001061345d9
>> libpetsc.3.07.dylib`hypre_GenerateSubComm(comm=-1006627852,
>> participate=<unavailable>, new_comm_ptr=<unavailable>) + 409 at
>> gen_redcs_mat.c:531 [opt]
>> > >>
>> > >> Looks like hypre is needed to generate subcomms, perhaps it generates
>> too many?
>> > >>
>> > >>   Barry
>> > >>
>> > >>
>> > >>> On Apr 2, 2018, at 7:07 PM, Derek Gaston <friedmud at gmail.com> wrote:
>> > >>>
>> > >>> I’m working with Fande on this and I would like to add a bit more.
>> There are many circumstances where we aren’t working on COMM_WORLD at all
>> (e.g. working on a sub-communicator) but PETSc was initialized using
>> MPI_COMM_WORLD (think multi-level solves)… and we need to create
>> arbitrarily many PETSc vecs/mats/solvers/preconditioners and solve.  We
>> definitely can’t rely on using PETSC_COMM_WORLD to avoid triggering
>> duplication.
>> > >>>
>> > >>> Can you explain why PETSc needs to duplicate the communicator so
>> much?
>> > >>>
>> > >>> Thanks for your help in tracking this down!
>> > >>>
>> > >>> Derek
>> > >>>
>> > >>> On Mon, Apr 2, 2018 at 5:44 PM Kong, Fande <fande.kong at inl.gov>
>> wrote:
>> > >>> Why we do not use user-level MPI communicators directly? What are
>> potential risks here?
>> > >>>
>> > >>>
>> > >>> Fande,
>> > >>>
>> > >>> On Mon, Apr 2, 2018 at 5:08 PM, Satish Balay <balay at mcs.anl.gov>
>> wrote:
>> > >>> PETSC_COMM_WORLD [via PetscCommDuplicate()] attempts to minimize
>> calls to MPI_Comm_dup() - thus potentially avoiding such errors
>> > >>>
>> > >>>
>> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.mcs.anl.gov_petsc_petsc-2Dcurrent_docs_manualpages_Sys_PetscCommDuplicate.html&d=DwIBAg&c=54IZrppPQZKX9mLzcGdPfFD1hxrcB__aEkJFOKJFd00&r=DUUt3SRGI0_JgtNaS3udV68GRkgV4ts7XKfj2opmiCY&m=jgv7gpZ3K52d_FWMgkK9yEScbLA7pkrWydFuJnYflsU&s=_zpWRcyk3kHuEHoq02NDqYExnXIohXpNnjyabYnnDjU&e=
>> > >>>
>> > >>>
>> > >>> Satish
>> > >>>
>> > >>> On Mon, 2 Apr 2018, Kong, Fande wrote:
>> > >>>
>> > >>>> On Mon, Apr 2, 2018 at 4:23 PM, Satish Balay <balay at mcs.anl.gov>
>> wrote:
>> > >>>>
>> > >>>>> Does this 'standard test' use MPI_COMM_WORLD' to crate PETSc
>> objects?
>> > >>>>>
>> > >>>>> If so - you could try changing to PETSC_COMM_WORLD
>> > >>>>>
>> > >>>>
>> > >>>>
>> > >>>> I do not think we are using PETSC_COMM_WORLD when creating PETSc
>> objects.
>> > >>>> Why we can not use MPI_COMM_WORLD?
>> > >>>>
>> > >>>>
>> > >>>> Fande,
>> > >>>>
>> > >>>>
>> > >>>>>
>> > >>>>> Satish
>> > >>>>>
>> > >>>>>
>> > >>>>> On Mon, 2 Apr 2018, Kong, Fande wrote:
>> > >>>>>
>> > >>>>>> Hi All,
>> > >>>>>>
>> > >>>>>> I am trying to upgrade PETSc from 3.7.6 to 3.8.3 for MOOSE and its
>> > >>>>>> applications. I have a error message for a standard test:
>> > >>>>>>
>> > >>>>>>
>> > >>>>>>
>> > >>>>>>
>> > >>>>>>
>> > >>>>>>
>> > >>>>>>
>> > >>>>>>
>> > >>>>>>
>> > >>>>>> *preconditioners/pbp.lots_of_variables: MPI had an
>> > >>>>>> errorpreconditioners/pbp.lots_of_variables:
>> > >>>>>> ------------------------------------------------
>> > >>>>> preconditioners/pbp.lots_of_variables:
>> > >>>>>> Other MPI error, error
>> stack:preconditioners/pbp.lots_of_variables:
>> > >>>>>> PMPI_Comm_dup(177)..................:
>> MPI_Comm_dup(comm=0x84000001,
>> > >>>>>> new_comm=0x97d1068) failedpreconditioners/pbp.lots_of_variables:
>> > >>>>>> PMPI_Comm_dup(162)..................:
>> > >>>>>> preconditioners/pbp.lots_of_variables:
>> > >>>>>> MPIR_Comm_dup_impl(57)..............:
>> > >>>>>> preconditioners/pbp.lots_of_variables:
>> > >>>>>> MPIR_Comm_copy(739).................:
>> > >>>>>> preconditioners/pbp.lots_of_variables:
>> > >>>>>> MPIR_Get_contextid_sparse_group(614): Too many communicators
>> (0/2048
>> > >>>>> free
>> > >>>>>> on this process; ignore_id=0)*
>> > >>>>>>
>> > >>>>>>
>> > >>>>>> I did "git bisect', and the following commit introduces this
>> issue:
>> > >>>>>>
>> > >>>>>>
>> > >>>>>>
>> > >>>>>>
>> > >>>>>>
>> > >>>>>>
>> > >>>>>>
>> > >>>>>>
>> > >>>>>> *commit 49a781f5cee36db85e8d5b951eec29f10ac13593Author: Stefano
>> Zampini
>> > >>>>>> <stefano.zampini at gmail.com <stefano.zampini at gmail.com>>Date:
>>  Sat Nov 5
>> > >>>>>> 20:15:19 2016 +0300    PCHYPRE: use internal Mat of type MatHYPRE
>> > >>>>>> hpmat already stores two HYPRE vectors*
>> > >>>>>>
>> > >>>>>> Before I debug line-by-line, anyone has a clue on this?
>> > >>>>>>
>> > >>>>>>
>> > >>>>>> Fande,
>> > >>>>>>
>> > >>>>>
>> > >>>>>
>> > >>>>
>> > >>>
>> > >>
>> >
>> >
>>


More information about the petsc-users mailing list