[petsc-users] A bad commit affects MOOSE
Derek Gaston
friedmud at gmail.com
Tue Apr 3 09:37:55 CDT 2018
I like the idea that Hypre (as a package) would get _one_ comm (for all the
solvers/matrices created) that was duped from the one given to PETSc in
Vec/MatCreate().
Seems like the tricky part would be figuring out _which_ comm that is based
on the incoming comm. For instance - we would definitely have the case
where we are doing a Hypre solve on effectively COMM_WORLD… and then many
Hypre solves on sub-communicators (and even Hypre solves on
sub-communicators of those sub-communicators). The system for getting
“the” Hypre Comm would have to match up the incoming Comm in the
Vec/MatCreate() call and find the correct Hypre comm to use.
Derek
On Tue, Apr 3, 2018 at 7:46 AM Satish Balay <balay at mcs.anl.gov> wrote:
> Fande claimed 49a781f5cee36db85e8d5b951eec29f10ac13593 made a difference.
> [so assuming same hypre version was used before and after this commit - for
> this bisection]
>
> So the extra MPI_Comm_dup() calls due to MATHYPRE must be pushing the
> total communicators over the limit.
>
> And wrt debugging - perhaps we need to check MPI_Comm_free() aswell?
> Presumably freed communicators can get reused so we have to look for
> outstanding/unfreed communicators?
>
> Per message below - MPICH[?] provides a max of 2048 communicators. And
> there is some discussion of this issue at:
> https://lists.mpich.org/pipermail/discuss/2012-December/000148.html
>
> And wrt 'sharing' - I was thining in terms of: Can one use MPI_COMM_WORLD
> with all hypre objects we create? If so - we could somehow attach one more
> inner-comm - that could be obtained and reused with multiple hypre objects
> [that got created off the same petsc_comm?]
>
> Satish
>
> On Tue, 3 Apr 2018, Smith, Barry F. wrote:
>
> >
> > Each external package definitely needs its own duplicated
> communicator; cannot share between packages.
> >
> > The only problem with the dups below is if they are in a loop and get
> called many times.
> >
> > To debug the hypre/duplication issue in MOOSE I would run in the
> debugger with a break point in MPI_Comm_dup() and see
> > who keeps calling it an unreasonable amount of times. (My guess is this
> is a new "feature" in hypre that they will need to fix but only debugging
> will tell)
> >
> > Barry
> >
> >
> > > On Apr 2, 2018, at 7:44 PM, Balay, Satish <balay at mcs.anl.gov> wrote:
> > >
> > > We do a MPI_Comm_dup() for objects related to externalpackages.
> > >
> > > Looks like we added a new mat type MATHYPRE - in 3.8 that PCHYPRE is
> > > using. Previously there was one MPI_Comm_dup() PCHYPRE - now I think
> > > is one more for MATHYPRE - so more calls to MPI_Comm_dup in 3.8 vs 3.7
> > >
> > > src/dm/impls/da/hypre/mhyp.c: ierr =
> MPI_Comm_dup(PetscObjectComm((PetscObject)B),&(ex->hcomm));CHKERRQ(ierr);
> > > src/dm/impls/da/hypre/mhyp.c: ierr =
> MPI_Comm_dup(PetscObjectComm((PetscObject)B),&(ex->hcomm));CHKERRQ(ierr);
> > > src/dm/impls/swarm/data_ex.c: ierr =
> MPI_Comm_dup(comm,&d->comm);CHKERRQ(ierr);
> > > src/ksp/pc/impls/hypre/hypre.c: ierr =
> MPI_Comm_dup(PetscObjectComm((PetscObject)pc),&(jac->comm_hypre));CHKERRQ(ierr);
> > > src/ksp/pc/impls/hypre/hypre.c: ierr =
> MPI_Comm_dup(PetscObjectComm((PetscObject)pc),&(ex->hcomm));CHKERRQ(ierr);
> > > src/ksp/pc/impls/hypre/hypre.c: ierr =
> MPI_Comm_dup(PetscObjectComm((PetscObject)pc),&(ex->hcomm));CHKERRQ(ierr);
> > > src/ksp/pc/impls/spai/ispai.c: ierr =
> MPI_Comm_dup(PetscObjectComm((PetscObject)pc),&(ispai->comm_spai));CHKERRQ(ierr);
> > > src/mat/examples/tests/ex152.c: ierr = MPI_Comm_dup(MPI_COMM_WORLD,
> &comm);CHKERRQ(ierr);
> > > src/mat/impls/aij/mpi/mkl_cpardiso/mkl_cpardiso.c: ierr =
> MPI_Comm_dup(PetscObjectComm((PetscObject)A),&(mat_mkl_cpardiso->comm_mkl_cpardiso));CHKERRQ(ierr);
> > > src/mat/impls/aij/mpi/mumps/mumps.c: ierr =
> MPI_Comm_dup(PetscObjectComm((PetscObject)A),&(mumps->comm_mumps));CHKERRQ(ierr);
> > > src/mat/impls/aij/mpi/pastix/pastix.c: ierr =
> MPI_Comm_dup(PetscObjectComm((PetscObject)A),&(lu->pastix_comm));CHKERRQ(ierr);
> > > src/mat/impls/aij/mpi/superlu_dist/superlu_dist.c: ierr =
> MPI_Comm_dup(PetscObjectComm((PetscObject)A),&(lu->comm_superlu));CHKERRQ(ierr);
> > > src/mat/impls/hypre/mhypre.c: ierr =
> MPI_Comm_dup(PetscObjectComm((PetscObject)B),&hB->comm);CHKERRQ(ierr);
> > > src/mat/partition/impls/pmetis/pmetis.c: ierr =
> MPI_Comm_dup(pcomm,&comm);CHKERRQ(ierr);
> > > src/sys/mpiuni/mpi.c: MPI_COMM_SELF, MPI_COMM_WORLD, and a
> MPI_Comm_dup() of each of these (duplicates of duplicates return the same
> communictor)
> > > src/sys/mpiuni/mpi.c:int MPI_Comm_dup(MPI_Comm comm,MPI_Comm *out)
> > > src/sys/objects/pinit.c: ierr =
> MPI_Comm_dup(MPI_COMM_WORLD,&local_comm);CHKERRQ(ierr);
> > > src/sys/objects/pinit.c: ierr =
> MPI_Comm_dup(MPI_COMM_WORLD,&local_comm);CHKERRQ(ierr);
> > > src/sys/objects/tagm.c: ierr =
> MPI_Comm_dup(comm_in,comm_out);CHKERRQ(ierr);
> > > src/sys/utils/mpiu.c: ierr =
> MPI_Comm_dup(comm,&local_comm);CHKERRQ(ierr);
> > > src/ts/impls/implicit/sundials/sundials.c: ierr =
> MPI_Comm_dup(PetscObjectComm((PetscObject)ts),&(cvode->comm_sundials));CHKERRQ(ierr);
> > >
> > > Perhaps we need a PetscCommDuplicateExternalPkg() to somehow avoid
> these MPI_Comm_dup() calls?
> > >
> > > Satish
> > >
> > > On Tue, 3 Apr 2018, Smith, Barry F. wrote:
> > >
> > >>
> > >> Are we sure this is a PETSc comm issue and not a hypre comm
> duplication issue
> > >>
> > >> frame #6: 0x00000001061345d9
> libpetsc.3.07.dylib`hypre_GenerateSubComm(comm=-1006627852,
> participate=<unavailable>, new_comm_ptr=<unavailable>) + 409 at
> gen_redcs_mat.c:531 [opt]
> > >>
> > >> Looks like hypre is needed to generate subcomms, perhaps it generates
> too many?
> > >>
> > >> Barry
> > >>
> > >>
> > >>> On Apr 2, 2018, at 7:07 PM, Derek Gaston <friedmud at gmail.com> wrote:
> > >>>
> > >>> I’m working with Fande on this and I would like to add a bit more.
> There are many circumstances where we aren’t working on COMM_WORLD at all
> (e.g. working on a sub-communicator) but PETSc was initialized using
> MPI_COMM_WORLD (think multi-level solves)… and we need to create
> arbitrarily many PETSc vecs/mats/solvers/preconditioners and solve. We
> definitely can’t rely on using PETSC_COMM_WORLD to avoid triggering
> duplication.
> > >>>
> > >>> Can you explain why PETSc needs to duplicate the communicator so
> much?
> > >>>
> > >>> Thanks for your help in tracking this down!
> > >>>
> > >>> Derek
> > >>>
> > >>> On Mon, Apr 2, 2018 at 5:44 PM Kong, Fande <fande.kong at inl.gov>
> wrote:
> > >>> Why we do not use user-level MPI communicators directly? What are
> potential risks here?
> > >>>
> > >>>
> > >>> Fande,
> > >>>
> > >>> On Mon, Apr 2, 2018 at 5:08 PM, Satish Balay <balay at mcs.anl.gov>
> wrote:
> > >>> PETSC_COMM_WORLD [via PetscCommDuplicate()] attempts to minimize
> calls to MPI_Comm_dup() - thus potentially avoiding such errors
> > >>>
> > >>>
> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.mcs.anl.gov_petsc_petsc-2Dcurrent_docs_manualpages_Sys_PetscCommDuplicate.html&d=DwIBAg&c=54IZrppPQZKX9mLzcGdPfFD1hxrcB__aEkJFOKJFd00&r=DUUt3SRGI0_JgtNaS3udV68GRkgV4ts7XKfj2opmiCY&m=jgv7gpZ3K52d_FWMgkK9yEScbLA7pkrWydFuJnYflsU&s=_zpWRcyk3kHuEHoq02NDqYExnXIohXpNnjyabYnnDjU&e=
> > >>>
> > >>>
> > >>> Satish
> > >>>
> > >>> On Mon, 2 Apr 2018, Kong, Fande wrote:
> > >>>
> > >>>> On Mon, Apr 2, 2018 at 4:23 PM, Satish Balay <balay at mcs.anl.gov>
> wrote:
> > >>>>
> > >>>>> Does this 'standard test' use MPI_COMM_WORLD' to crate PETSc
> objects?
> > >>>>>
> > >>>>> If so - you could try changing to PETSC_COMM_WORLD
> > >>>>>
> > >>>>
> > >>>>
> > >>>> I do not think we are using PETSC_COMM_WORLD when creating PETSc
> objects.
> > >>>> Why we can not use MPI_COMM_WORLD?
> > >>>>
> > >>>>
> > >>>> Fande,
> > >>>>
> > >>>>
> > >>>>>
> > >>>>> Satish
> > >>>>>
> > >>>>>
> > >>>>> On Mon, 2 Apr 2018, Kong, Fande wrote:
> > >>>>>
> > >>>>>> Hi All,
> > >>>>>>
> > >>>>>> I am trying to upgrade PETSc from 3.7.6 to 3.8.3 for MOOSE and its
> > >>>>>> applications. I have a error message for a standard test:
> > >>>>>>
> > >>>>>>
> > >>>>>>
> > >>>>>>
> > >>>>>>
> > >>>>>>
> > >>>>>>
> > >>>>>>
> > >>>>>>
> > >>>>>> *preconditioners/pbp.lots_of_variables: MPI had an
> > >>>>>> errorpreconditioners/pbp.lots_of_variables:
> > >>>>>> ------------------------------------------------
> > >>>>> preconditioners/pbp.lots_of_variables:
> > >>>>>> Other MPI error, error
> stack:preconditioners/pbp.lots_of_variables:
> > >>>>>> PMPI_Comm_dup(177)..................:
> MPI_Comm_dup(comm=0x84000001,
> > >>>>>> new_comm=0x97d1068) failedpreconditioners/pbp.lots_of_variables:
> > >>>>>> PMPI_Comm_dup(162)..................:
> > >>>>>> preconditioners/pbp.lots_of_variables:
> > >>>>>> MPIR_Comm_dup_impl(57)..............:
> > >>>>>> preconditioners/pbp.lots_of_variables:
> > >>>>>> MPIR_Comm_copy(739).................:
> > >>>>>> preconditioners/pbp.lots_of_variables:
> > >>>>>> MPIR_Get_contextid_sparse_group(614): Too many communicators
> (0/2048
> > >>>>> free
> > >>>>>> on this process; ignore_id=0)*
> > >>>>>>
> > >>>>>>
> > >>>>>> I did "git bisect', and the following commit introduces this
> issue:
> > >>>>>>
> > >>>>>>
> > >>>>>>
> > >>>>>>
> > >>>>>>
> > >>>>>>
> > >>>>>>
> > >>>>>>
> > >>>>>> *commit 49a781f5cee36db85e8d5b951eec29f10ac13593Author: Stefano
> Zampini
> > >>>>>> <stefano.zampini at gmail.com <stefano.zampini at gmail.com>>Date:
> Sat Nov 5
> > >>>>>> 20:15:19 2016 +0300 PCHYPRE: use internal Mat of type MatHYPRE
> > >>>>>> hpmat already stores two HYPRE vectors*
> > >>>>>>
> > >>>>>> Before I debug line-by-line, anyone has a clue on this?
> > >>>>>>
> > >>>>>>
> > >>>>>> Fande,
> > >>>>>>
> > >>>>>
> > >>>>>
> > >>>>
> > >>>
> > >>
> >
> >
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20180403/24d2ea7e/attachment.html>
More information about the petsc-users
mailing list