[petsc-users] A bad commit affects MOOSE

Derek Gaston friedmud at gmail.com
Tue Apr 3 10:43:54 CDT 2018


One thing I want to be clear of here: is that we're not trying to solve
this particular problem (where we're creating 1000 instances of Hypre to
precondition each variable independently)... this particular problem is
just a test (that we've had in our test suite for a long time) to stress
test some of this capability.

We really do have needs for thousands (tens of thousands) of simultaneous
solves (each with their own Hypre instances).  That's not what this
particular problem is doing - but it is representative of a class of our
problems we need to solve.

Which does bring up a point: I have been able to do solves before with
~50,000 separate PETSc solves without issue.  Is it because I was working
with MVAPICH on a cluster?  Does it just have a higher limit?

Derek

On Tue, Apr 3, 2018 at 9:13 AM Stefano Zampini <stefano.zampini at gmail.com>
wrote:

> On Apr 3, 2018, at 4:58 PM, Satish Balay <balay at mcs.anl.gov> wrote:
>
> On Tue, 3 Apr 2018, Kong, Fande wrote:
>
> On Tue, Apr 3, 2018 at 1:17 AM, Smith, Barry F. <bsmith at mcs.anl.gov>
> wrote:
>
>
>   Each external package definitely needs its own duplicated communicator;
> cannot share between packages.
>
>   The only problem with the dups below is if they are in a loop and get
> called many times.
>
>
>
> The "standard test" that has this issue actually has 1K fields. MOOSE
> creates its own field-split preconditioner (not based on the PETSc
> fieldsplit), and each filed is associated with one PC HYPRE.  If PETSc
> duplicates communicators, we should easily reach the limit 2048.
>
> I also want to confirm what extra communicators are introduced in the bad
> commit.
>
>
> To me it looks like there is 1 extra comm created [for MATHYPRE] for each
> PCHYPRE that is created [which also creates one comm for this object].
>
>
> You’re right; however, it was the same before the commit.
> I don’t understand how this specific commit is related with this issue,
> being the error not in the MPI_Comm_Dup which is inside MatCreate_MATHYPRE.
> Actually, the error comes from MPI_Comm_create
>
>
>
>
>
> *    frame #5: 0x00000001068defd4 libmpi.12.dylib`MPI_Comm_create +
> 3492    frame #6: 0x00000001061345d9
> libpetsc.3.07.dylib`hypre_GenerateSubComm(comm=-1006627852,
> participate=<unavailable>, new_comm_ptr=<unavailable>) + 409 at
> gen_redcs_mat.c:531 [opt]    frame #7: 0x000000010618f8ba
> libpetsc.3.07.dylib`hypre_GaussElimSetup(amg_data=0x00007fe7ff857a00,
> level=<unavailable>, relax_type=9) + 74 at par_relax.c:4209 [opt]    frame
> #8: 0x0000000106140e93
> libpetsc.3.07.dylib`hypre_BoomerAMGSetup(amg_vdata=<unavailable>,
> A=0x00007fe80842aff0, f=0x00007fe80842a980, u=0x00007fe80842a510) + 17699
> at par_amg_setup.c:2108 [opt]    frame #9: 0x0000000105ec773c
> libpetsc.3.07.dylib`PCSetUp_HYPRE(pc=<unavailable>) + 2540 at hypre.c:226
> [opt*
>
> How did you perform the bisection? make clean + make all ? Which version
> of HYPRE are you using?
>
> But you might want to verify [by linking with mpi trace library?]
>
>
> There are some debugging hints at
> https://lists.mpich.org/pipermail/discuss/2012-December/000148.html [wrt
> mpich] - which I haven't checked..
>
> Satish
>
>
>
> Fande,
>
>
>
>
>    To debug the hypre/duplication issue in MOOSE I would run in the
> debugger with a break point in MPI_Comm_dup() and see
> who keeps calling it an unreasonable amount of times. (My guess is this is
> a new "feature" in hypre that they will need to fix but only debugging will
> tell)
>
>   Barry
>
>
> On Apr 2, 2018, at 7:44 PM, Balay, Satish <balay at mcs.anl.gov> wrote:
>
> We do a MPI_Comm_dup() for objects related to externalpackages.
>
> Looks like we added a new mat type MATHYPRE - in 3.8 that PCHYPRE is
> using. Previously there was one MPI_Comm_dup() PCHYPRE - now I think
> is one more for MATHYPRE - so more calls to MPI_Comm_dup in 3.8 vs 3.7
>
> src/dm/impls/da/hypre/mhyp.c:  ierr = MPI_Comm_dup(PetscObjectComm((
>
> PetscObject)B),&(ex->hcomm));CHKERRQ(ierr);
>
> src/dm/impls/da/hypre/mhyp.c:  ierr = MPI_Comm_dup(PetscObjectComm((
>
> PetscObject)B),&(ex->hcomm));CHKERRQ(ierr);
>
> src/dm/impls/swarm/data_ex.c:  ierr = MPI_Comm_dup(comm,&d->comm);
>
> CHKERRQ(ierr);
>
> src/ksp/pc/impls/hypre/hypre.c:  ierr = MPI_Comm_dup(PetscObjectComm((
>
> PetscObject)pc),&(jac->comm_hypre));CHKERRQ(ierr);
>
> src/ksp/pc/impls/hypre/hypre.c:  ierr = MPI_Comm_dup(PetscObjectComm((
>
> PetscObject)pc),&(ex->hcomm));CHKERRQ(ierr);
>
> src/ksp/pc/impls/hypre/hypre.c:  ierr = MPI_Comm_dup(PetscObjectComm((
>
> PetscObject)pc),&(ex->hcomm));CHKERRQ(ierr);
>
> src/ksp/pc/impls/spai/ispai.c:  ierr      =
>
> MPI_Comm_dup(PetscObjectComm((PetscObject)pc),&(ispai->comm_
> spai));CHKERRQ(ierr);
>
> src/mat/examples/tests/ex152.c:  ierr   = MPI_Comm_dup(MPI_COMM_WORLD,
>
> &comm);CHKERRQ(ierr);
>
> src/mat/impls/aij/mpi/mkl_cpardiso/mkl_cpardiso.c:  ierr =
>
> MPI_Comm_dup(PetscObjectComm((PetscObject)A),&(mat_mkl_
> cpardiso->comm_mkl_cpardiso));CHKERRQ(ierr);
>
> src/mat/impls/aij/mpi/mumps/mumps.c:  ierr =
>
> MPI_Comm_dup(PetscObjectComm((PetscObject)A),&(mumps->comm_
> mumps));CHKERRQ(ierr);
>
> src/mat/impls/aij/mpi/pastix/pastix.c:    ierr =
>
> MPI_Comm_dup(PetscObjectComm((PetscObject)A),&(lu->pastix_
> comm));CHKERRQ(ierr);
>
> src/mat/impls/aij/mpi/superlu_dist/superlu_dist.c:  ierr =
>
> MPI_Comm_dup(PetscObjectComm((PetscObject)A),&(lu->comm_
> superlu));CHKERRQ(ierr);
>
> src/mat/impls/hypre/mhypre.c:  ierr = MPI_Comm_dup(PetscObjectComm((
>
> PetscObject)B),&hB->comm);CHKERRQ(ierr);
>
> src/mat/partition/impls/pmetis/pmetis.c:    ierr   =
>
> MPI_Comm_dup(pcomm,&comm);CHKERRQ(ierr);
>
> src/sys/mpiuni/mpi.c:    MPI_COMM_SELF, MPI_COMM_WORLD, and a
>
> MPI_Comm_dup() of each of these (duplicates of duplicates return the same
> communictor)
>
> src/sys/mpiuni/mpi.c:int MPI_Comm_dup(MPI_Comm comm,MPI_Comm *out)
> src/sys/objects/pinit.c:      ierr = MPI_Comm_dup(MPI_COMM_WORLD,&
>
> local_comm);CHKERRQ(ierr);
>
> src/sys/objects/pinit.c:      ierr = MPI_Comm_dup(MPI_COMM_WORLD,&
>
> local_comm);CHKERRQ(ierr);
>
> src/sys/objects/tagm.c:      ierr = MPI_Comm_dup(comm_in,comm_out)
>
> ;CHKERRQ(ierr);
>
> src/sys/utils/mpiu.c:  ierr = MPI_Comm_dup(comm,&local_comm)
>
> ;CHKERRQ(ierr);
>
> src/ts/impls/implicit/sundials/sundials.c:  ierr =
>
> MPI_Comm_dup(PetscObjectComm((PetscObject)ts),&(cvode->comm_
> sundials));CHKERRQ(ierr);
>
>
> Perhaps we need a PetscCommDuplicateExternalPkg() to somehow avoid
>
> these MPI_Comm_dup() calls?
>
>
> Satish
>
> On Tue, 3 Apr 2018, Smith, Barry F. wrote:
>
>
> Are we sure this is a PETSc comm issue and not a hypre comm
>
> duplication issue
>
>
> frame #6: 0x00000001061345d9 libpetsc.3.07.dylib`hypre_
>
> GenerateSubComm(comm=-1006627852, participate=<unavailable>,
> new_comm_ptr=<unavailable>) + 409 at gen_redcs_mat.c:531 [opt]
>
>
> Looks like hypre is needed to generate subcomms, perhaps it generates
>
> too many?
>
>
>  Barry
>
>
> On Apr 2, 2018, at 7:07 PM, Derek Gaston <friedmud at gmail.com> wrote:
>
> I’m working with Fande on this and I would like to add a bit more.
>
> There are many circumstances where we aren’t working on COMM_WORLD at all
> (e.g. working on a sub-communicator) but PETSc was initialized using
> MPI_COMM_WORLD (think multi-level solves)… and we need to create
> arbitrarily many PETSc vecs/mats/solvers/preconditioners and solve.  We
> definitely can’t rely on using PETSC_COMM_WORLD to avoid triggering
> duplication.
>
>
> Can you explain why PETSc needs to duplicate the communicator so much?
>
> Thanks for your help in tracking this down!
>
> Derek
>
> On Mon, Apr 2, 2018 at 5:44 PM Kong, Fande <fande.kong at inl.gov> wrote:
> Why we do not use user-level MPI communicators directly? What are
>
> potential risks here?
>
>
>
> Fande,
>
> On Mon, Apr 2, 2018 at 5:08 PM, Satish Balay <balay at mcs.anl.gov>
>
> wrote:
>
> PETSC_COMM_WORLD [via PetscCommDuplicate()] attempts to minimize calls
>
> to MPI_Comm_dup() - thus potentially avoiding such errors
>
>
> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.mcs.
>
> anl.gov_petsc_petsc-2Dcurrent_docs_manualpages_Sys_
> PetscCommDuplicate.html&d=DwIBAg&c=54IZrppPQZKX9mLzcGdPfFD1hxrcB_
> _aEkJFOKJFd00&r=DUUt3SRGI0_JgtNaS3udV68GRkgV4ts7XKfj2opmi
> CY&m=jgv7gpZ3K52d_FWMgkK9yEScbLA7pkrWydFuJnYflsU&s=_
> zpWRcyk3kHuEHoq02NDqYExnXIohXpNnjyabYnnDjU&e=
>
>
>
> Satish
>
> On Mon, 2 Apr 2018, Kong, Fande wrote:
>
> On Mon, Apr 2, 2018 at 4:23 PM, Satish Balay <balay at mcs.anl.gov>
>
> wrote:
>
>
> Does this 'standard test' use MPI_COMM_WORLD' to crate PETSc objects?
>
> If so - you could try changing to PETSC_COMM_WORLD
>
>
>
> I do not think we are using PETSC_COMM_WORLD when creating PETSc
>
> objects.
>
> Why we can not use MPI_COMM_WORLD?
>
>
> Fande,
>
>
>
> Satish
>
>
> On Mon, 2 Apr 2018, Kong, Fande wrote:
>
> Hi All,
>
> I am trying to upgrade PETSc from 3.7.6 to 3.8.3 for MOOSE and its
> applications. I have a error message for a standard test:
>
>
>
>
>
>
>
>
>
> *preconditioners/pbp.lots_of_variables: MPI had an
> errorpreconditioners/pbp.lots_of_variables:
> ------------------------------------------------
>
> preconditioners/pbp.lots_of_variables:
>
> Other MPI error, error stack:preconditioners/pbp.lots_of_variables:
> PMPI_Comm_dup(177)..................: MPI_Comm_dup(comm=0x84000001,
> new_comm=0x97d1068) failedpreconditioners/pbp.lots_of_variables:
> PMPI_Comm_dup(162)..................:
> preconditioners/pbp.lots_of_variables:
> MPIR_Comm_dup_impl(57)..............:
> preconditioners/pbp.lots_of_variables:
> MPIR_Comm_copy(739).................:
> preconditioners/pbp.lots_of_variables:
> MPIR_Get_contextid_sparse_group(614): Too many communicators
>
> (0/2048
>
> free
>
> on this process; ignore_id=0)*
>
>
> I did "git bisect', and the following commit introduces this issue:
>
>
>
>
>
>
>
>
> *commit 49a781f5cee36db85e8d5b951eec29f10ac13593Author: Stefano
>
> Zampini
>
> <stefano.zampini at gmail.com <stefano.zampini at gmail.com>>Date:   Sat
>
> Nov 5
>
> 20:15:19 2016 +0300    PCHYPRE: use internal Mat of type MatHYPRE
> hpmat already stores two HYPRE vectors*
>
> Before I debug line-by-line, anyone has a clue on this?
>
>
> Fande,
>
>
>
>
>
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20180403/c591371c/attachment.html>


More information about the petsc-users mailing list