<div dir="ltr"><br><div class="gmail_extra"><br><div class="gmail_quote">On Tue, Apr 3, 2018 at 9:32 AM, Satish Balay <span dir="ltr"><<a href="mailto:balay@mcs.anl.gov" target="_blank">balay@mcs.anl.gov</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">On Tue, 3 Apr 2018, Stefano Zampini wrote:<br>
<span class=""><br>
><br>
> > On Apr 3, 2018, at 4:58 PM, Satish Balay <<a href="mailto:balay@mcs.anl.gov">balay@mcs.anl.gov</a>> wrote:<br>
> ><br>
> > On Tue, 3 Apr 2018, Kong, Fande wrote:<br>
> ><br>
> >> On Tue, Apr 3, 2018 at 1:17 AM, Smith, Barry F. <<a href="mailto:bsmith@mcs.anl.gov">bsmith@mcs.anl.gov</a>> wrote:<br>
> >><br>
> >>><br>
> >>> Each external package definitely needs its own duplicated communicator;<br>
> >>> cannot share between packages.<br>
> >>><br>
> >>> The only problem with the dups below is if they are in a loop and get<br>
> >>> called many times.<br>
> >>><br>
> >><br>
> >><br>
> >> The "standard test" that has this issue actually has 1K fields. MOOSE<br>
> >> creates its own field-split preconditioner (not based on the PETSc<br>
> >> fieldsplit), and each filed is associated with one PC HYPRE. If PETSc<br>
> >> duplicates communicators, we should easily reach the limit 2048.<br>
> >><br>
> >> I also want to confirm what extra communicators are introduced in the bad<br>
> >> commit.<br>
> ><br>
> > To me it looks like there is 1 extra comm created [for MATHYPRE] for each PCHYPRE that is created [which also creates one comm for this object].<br>
> ><br>
><br>
</span>> You’re right; however, it was the same before the commit.<br>
<br>
<a href="https://urldefense.proofpoint.com/v2/url?u=https-3A__bitbucket.org_petsc_petsc_commits_49a781f5cee36db85e8d5b951eec29f10ac13593&d=DwIDaQ&c=54IZrppPQZKX9mLzcGdPfFD1hxrcB__aEkJFOKJFd00&r=DUUt3SRGI0_JgtNaS3udV68GRkgV4ts7XKfj2opmiCY&m=6_ukwovpDrK5BL_94S4ezasw2a3S15SM59R41rSY-Yw&s=r8xHYLKF9LtJHReR6Jmfeei3OfwkQNiGrKXAgeqPVQ8&e=" rel="noreferrer" target="_blank">https://urldefense.proofpoint.<wbr>com/v2/url?u=https-3A__<wbr>bitbucket.org_petsc_petsc_<wbr>commits_<wbr>49a781f5cee36db85e8d5b951eec29<wbr>f10ac13593&d=DwIDaQ&c=<wbr>54IZrppPQZKX9mLzcGdPfFD1hxrcB_<wbr>_aEkJFOKJFd00&r=DUUt3SRGI0_<wbr>JgtNaS3udV68GRkgV4ts7XKfj2opmi<wbr>CY&m=6_ukwovpDrK5BL_<wbr>94S4ezasw2a3S15SM59R41rSY-Yw&<wbr>s=<wbr>r8xHYLKF9LtJHReR6Jmfeei3OfwkQN<wbr>iGrKXAgeqPVQ8&e=</a><br>
Before the commit - PCHYPRE was not calling MatConvert(MATHYPRE) [this results in an additional call to MPI_Comm_dup() for hypre calls] PCHYPRE was calling MatHYPRE_IJMatrixCreate() directly [which I presume reusing the comm created by the call to MPI_Comm_dup() in PCHYPRE - for hypre calls]<br>
<br>
<br>
<br>
> I don’t understand how this specific commit is related with this issue, being the error not in the MPI_Comm_Dup which is inside MatCreate_MATHYPRE. Actually, the error comes from MPI_Comm_create<br>
<span class="">><br>
> frame #5: 0x00000001068defd4 libmpi.12.dylib`MPI_Comm_<wbr>create + 3492<br>
> frame #6: 0x00000001061345d9 libpetsc.3.07.dylib`hypre_<wbr>GenerateSubComm(comm=-<wbr>1006627852, participate=<unavailable>, new_comm_ptr=<unavailable>) + 409 at gen_redcs_mat.c:531 [opt]<br>
> frame #7: 0x000000010618f8ba libpetsc.3.07.dylib`hypre_<wbr>GaussElimSetup(amg_data=<wbr>0x00007fe7ff857a00, level=<unavailable>, relax_type=9) + 74 at par_relax.c:4209 [opt]<br>
> frame #8: 0x0000000106140e93 libpetsc.3.07.dylib`hypre_<wbr>BoomerAMGSetup(amg_vdata=<<wbr>unavailable>, A=0x00007fe80842aff0, f=0x00007fe80842a980, u=0x00007fe80842a510) + 17699 at par_amg_setup.c:2108 [opt]<br>
> frame #9: 0x0000000105ec773c libpetsc.3.07.dylib`PCSetUp_<wbr>HYPRE(pc=<unavailable>) + 2540 at hypre.c:226 [opt<br>
<br>
</span>I thought this trace comes up after applying your patch<br></blockquote><div><br></div><div>This trace comes from Mac<br></div><div><br> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<span class=""><br>
- ierr = MatDestroy(&jac->hpmat);<wbr>CHKERRQ(ierr);<br>
- ierr = MatConvert(pc->pmat,MATHYPRE,<wbr>MAT_INITIAL_MATRIX,&jac-><wbr>hpmat);CHKERRQ(ierr);<br>
+ ierr = MatConvert(pc->pmat,MATHYPRE,<wbr>jac->hpmat ? MAT_REUSE_MATRIX : MAT_INITIAL_MATRIX,&jac-><wbr>hpmat);CHKERRQ(ierr);<br>
<br>
</span>The stack before this patch was: [its a different format - so it was obtained in a different way than the above method?]<br>
<span class=""><br>
preconditioners/pbp.lots_of_<wbr>variables: Other MPI error, error stack:<br>
preconditioners/pbp.lots_of_<wbr>variables: PMPI_Comm_dup(177)............<wbr>......: MPI_Comm_dup(comm=0x84000001, new_comm=0x97d1068) failed<br>
</span>preconditioners/pbp.lots_of_<wbr>variables: PMPI_Comm_dup(162)............<wbr>......:<br>
<span class="">preconditioners/pbp.lots_of_<wbr>variables: MPIR_Comm_dup_impl(57)........<wbr>......:<br>
preconditioners/pbp.lots_of_<wbr>variables: MPIR_Comm_copy(739)...........<wbr>......:<br>
preconditioners/pbp.lots_of_<wbr>variables: MPIR_Get_contextid_sparse_<wbr>group(614): Too many communicators (0/2048 free on this process; ignore_id=0)<br></span></blockquote><div><br></div><div>This comes from a Linux (it is a test box), and I do not have access to it. <br><br><br></div><div>Fande,<br></div><div><br> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span class="">
<br>
</span>Satish<br>
<br>
><br>
> How did you perform the bisection? make clean + make all ? Which version of HYPRE are you using?<br>
<span class="">><br>
> > But you might want to verify [by linking with mpi trace library?]<br>
> ><br>
> ><br>
</span>> > There are some debugging hints at <a href="https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.mpich.org_pipermail_discuss_2012-2DDecember_000148.html&d=DwIDaQ&c=54IZrppPQZKX9mLzcGdPfFD1hxrcB__aEkJFOKJFd00&r=DUUt3SRGI0_JgtNaS3udV68GRkgV4ts7XKfj2opmiCY&m=6_ukwovpDrK5BL_94S4ezasw2a3S15SM59R41rSY-Yw&s=XUy9n2kmdq262Gwrn_RMXYR-bIyiKViCvp4fRfGCP9w&e=" rel="noreferrer" target="_blank">https://urldefense.proofpoint.<wbr>com/v2/url?u=https-3A__lists.<wbr>mpich.org_pipermail_discuss_<wbr>2012-2DDecember_000148.html&d=<wbr>DwIDaQ&c=<wbr>54IZrppPQZKX9mLzcGdPfFD1hxrcB_<wbr>_aEkJFOKJFd00&r=DUUt3SRGI0_<wbr>JgtNaS3udV68GRkgV4ts7XKfj2opmi<wbr>CY&m=6_ukwovpDrK5BL_<wbr>94S4ezasw2a3S15SM59R41rSY-Yw&<wbr>s=XUy9n2kmdq262Gwrn_RMXYR-<wbr>bIyiKViCvp4fRfGCP9w&e=</a> [wrt mpich] - which I haven't checked..<br>
<div class="HOEnZb"><div class="h5">> ><br>
> > Satish<br>
> ><br>
> >><br>
> >><br>
> >> Fande,<br>
> >><br>
> >><br>
> >><br>
> >>><br>
> >>> To debug the hypre/duplication issue in MOOSE I would run in the<br>
> >>> debugger with a break point in MPI_Comm_dup() and see<br>
> >>> who keeps calling it an unreasonable amount of times. (My guess is this is<br>
> >>> a new "feature" in hypre that they will need to fix but only debugging will<br>
> >>> tell)<br>
> >>><br>
> >>> Barry<br>
> >>><br>
> >>><br>
> >>>> On Apr 2, 2018, at 7:44 PM, Balay, Satish <<a href="mailto:balay@mcs.anl.gov">balay@mcs.anl.gov</a>> wrote:<br>
> >>>><br>
> >>>> We do a MPI_Comm_dup() for objects related to externalpackages.<br>
> >>>><br>
> >>>> Looks like we added a new mat type MATHYPRE - in 3.8 that PCHYPRE is<br>
> >>>> using. Previously there was one MPI_Comm_dup() PCHYPRE - now I think<br>
> >>>> is one more for MATHYPRE - so more calls to MPI_Comm_dup in 3.8 vs 3.7<br>
> >>>><br>
> >>>> src/dm/impls/da/hypre/mhyp.c: ierr = MPI_Comm_dup(PetscObjectComm((<br>
> >>> PetscObject)B),&(ex->hcomm));<wbr>CHKERRQ(ierr);<br>
> >>>> src/dm/impls/da/hypre/mhyp.c: ierr = MPI_Comm_dup(PetscObjectComm((<br>
> >>> PetscObject)B),&(ex->hcomm));<wbr>CHKERRQ(ierr);<br>
> >>>> src/dm/impls/swarm/data_ex.c: ierr = MPI_Comm_dup(comm,&d->comm);<br>
> >>> CHKERRQ(ierr);<br>
> >>>> src/ksp/pc/impls/hypre/hypre.<wbr>c: ierr = MPI_Comm_dup(PetscObjectComm((<br>
> >>> PetscObject)pc),&(jac->comm_<wbr>hypre));CHKERRQ(ierr);<br>
> >>>> src/ksp/pc/impls/hypre/hypre.<wbr>c: ierr = MPI_Comm_dup(PetscObjectComm((<br>
> >>> PetscObject)pc),&(ex->hcomm));<wbr>CHKERRQ(ierr);<br>
> >>>> src/ksp/pc/impls/hypre/hypre.<wbr>c: ierr = MPI_Comm_dup(PetscObjectComm((<br>
> >>> PetscObject)pc),&(ex->hcomm));<wbr>CHKERRQ(ierr);<br>
> >>>> src/ksp/pc/impls/spai/ispai.c: ierr =<br>
> >>> MPI_Comm_dup(PetscObjectComm((<wbr>PetscObject)pc),&(ispai->comm_<br>
> >>> spai));CHKERRQ(ierr);<br>
> >>>> src/mat/examples/tests/ex152.<wbr>c: ierr = MPI_Comm_dup(MPI_COMM_WORLD,<br>
> >>> &comm);CHKERRQ(ierr);<br>
> >>>> src/mat/impls/aij/mpi/mkl_<wbr>cpardiso/mkl_cpardiso.c: ierr =<br>
> >>> MPI_Comm_dup(PetscObjectComm((<wbr>PetscObject)A),&(mat_mkl_<br>
> >>> cpardiso->comm_mkl_cpardiso));<wbr>CHKERRQ(ierr);<br>
> >>>> src/mat/impls/aij/mpi/mumps/<wbr>mumps.c: ierr =<br>
> >>> MPI_Comm_dup(PetscObjectComm((<wbr>PetscObject)A),&(mumps->comm_<br>
> >>> mumps));CHKERRQ(ierr);<br>
> >>>> src/mat/impls/aij/mpi/pastix/<wbr>pastix.c: ierr =<br>
> >>> MPI_Comm_dup(PetscObjectComm((<wbr>PetscObject)A),&(lu->pastix_<br>
> >>> comm));CHKERRQ(ierr);<br>
> >>>> src/mat/impls/aij/mpi/superlu_<wbr>dist/superlu_dist.c: ierr =<br>
> >>> MPI_Comm_dup(PetscObjectComm((<wbr>PetscObject)A),&(lu->comm_<br>
> >>> superlu));CHKERRQ(ierr);<br>
> >>>> src/mat/impls/hypre/mhypre.c: ierr = MPI_Comm_dup(PetscObjectComm((<br>
> >>> PetscObject)B),&hB->comm);<wbr>CHKERRQ(ierr);<br>
> >>>> src/mat/partition/impls/<wbr>pmetis/pmetis.c: ierr =<br>
> >>> MPI_Comm_dup(pcomm,&comm);<wbr>CHKERRQ(ierr);<br>
> >>>> src/sys/mpiuni/mpi.c: MPI_COMM_SELF, MPI_COMM_WORLD, and a<br>
> >>> MPI_Comm_dup() of each of these (duplicates of duplicates return the same<br>
> >>> communictor)<br>
> >>>> src/sys/mpiuni/mpi.c:int MPI_Comm_dup(MPI_Comm comm,MPI_Comm *out)<br>
> >>>> src/sys/objects/pinit.c: ierr = MPI_Comm_dup(MPI_COMM_WORLD,&<br>
> >>> local_comm);CHKERRQ(ierr);<br>
> >>>> src/sys/objects/pinit.c: ierr = MPI_Comm_dup(MPI_COMM_WORLD,&<br>
> >>> local_comm);CHKERRQ(ierr);<br>
> >>>> src/sys/objects/tagm.c: ierr = MPI_Comm_dup(comm_in,comm_out)<br>
> >>> ;CHKERRQ(ierr);<br>
> >>>> src/sys/utils/mpiu.c: ierr = MPI_Comm_dup(comm,&local_comm)<br>
> >>> ;CHKERRQ(ierr);<br>
> >>>> src/ts/impls/implicit/<wbr>sundials/sundials.c: ierr =<br>
> >>> MPI_Comm_dup(PetscObjectComm((<wbr>PetscObject)ts),&(cvode->comm_<br>
> >>> sundials));CHKERRQ(ierr);<br>
> >>>><br>
> >>>> Perhaps we need a PetscCommDuplicateExternalPkg(<wbr>) to somehow avoid<br>
> >>> these MPI_Comm_dup() calls?<br>
> >>>><br>
> >>>> Satish<br>
> >>>><br>
> >>>> On Tue, 3 Apr 2018, Smith, Barry F. wrote:<br>
> >>>><br>
> >>>>><br>
> >>>>> Are we sure this is a PETSc comm issue and not a hypre comm<br>
> >>> duplication issue<br>
> >>>>><br>
> >>>>> frame #6: 0x00000001061345d9 libpetsc.3.07.dylib`hypre_<br>
> >>> GenerateSubComm(comm=-<wbr>1006627852, participate=<unavailable>,<br>
> >>> new_comm_ptr=<unavailable>) + 409 at gen_redcs_mat.c:531 [opt]<br>
> >>>>><br>
> >>>>> Looks like hypre is needed to generate subcomms, perhaps it generates<br>
> >>> too many?<br>
> >>>>><br>
> >>>>> Barry<br>
> >>>>><br>
> >>>>><br>
> >>>>>> On Apr 2, 2018, at 7:07 PM, Derek Gaston <<a href="mailto:friedmud@gmail.com">friedmud@gmail.com</a>> wrote:<br>
> >>>>>><br>
> >>>>>> I’m working with Fande on this and I would like to add a bit more.<br>
> >>> There are many circumstances where we aren’t working on COMM_WORLD at all<br>
> >>> (e.g. working on a sub-communicator) but PETSc was initialized using<br>
> >>> MPI_COMM_WORLD (think multi-level solves)… and we need to create<br>
> >>> arbitrarily many PETSc vecs/mats/solvers/<wbr>preconditioners and solve. We<br>
> >>> definitely can’t rely on using PETSC_COMM_WORLD to avoid triggering<br>
> >>> duplication.<br>
> >>>>>><br>
> >>>>>> Can you explain why PETSc needs to duplicate the communicator so much?<br>
> >>>>>><br>
> >>>>>> Thanks for your help in tracking this down!<br>
> >>>>>><br>
> >>>>>> Derek<br>
> >>>>>><br>
> >>>>>> On Mon, Apr 2, 2018 at 5:44 PM Kong, Fande <<a href="mailto:fande.kong@inl.gov">fande.kong@inl.gov</a>> wrote:<br>
> >>>>>> Why we do not use user-level MPI communicators directly? What are<br>
> >>> potential risks here?<br>
> >>>>>><br>
> >>>>>><br>
> >>>>>> Fande,<br>
> >>>>>><br>
> >>>>>> On Mon, Apr 2, 2018 at 5:08 PM, Satish Balay <<a href="mailto:balay@mcs.anl.gov">balay@mcs.anl.gov</a>><br>
> >>> wrote:<br>
> >>>>>> PETSC_COMM_WORLD [via PetscCommDuplicate()] attempts to minimize calls<br>
> >>> to MPI_Comm_dup() - thus potentially avoiding such errors<br>
> >>>>>><br>
> >>>>>> <a href="https://urldefense.proofpoint.com/v2/url?u=http-3A__www.mcs" rel="noreferrer" target="_blank">https://urldefense.proofpoint.<wbr>com/v2/url?u=http-3A__www.mcs</a>.<br>
> >>> anl.gov_petsc_petsc-2Dcurrent_<wbr>docs_manualpages_Sys_<br>
> >>> PetscCommDuplicate.html&d=<wbr>DwIBAg&c=<wbr>54IZrppPQZKX9mLzcGdPfFD1hxrcB_<br>
> >>> _aEkJFOKJFd00&r=DUUt3SRGI0_<wbr>JgtNaS3udV68GRkgV4ts7XKfj2opmi<br>
> >>> CY&m=jgv7gpZ3K52d_<wbr>FWMgkK9yEScbLA7pkrWydFuJnYflsU<wbr>&s=_<br>
> >>> zpWRcyk3kHuEHoq02NDqYExnXIohXp<wbr>NnjyabYnnDjU&e=<br>
> >>>>>><br>
> >>>>>><br>
> >>>>>> Satish<br>
> >>>>>><br>
> >>>>>> On Mon, 2 Apr 2018, Kong, Fande wrote:<br>
> >>>>>><br>
> >>>>>>> On Mon, Apr 2, 2018 at 4:23 PM, Satish Balay <<a href="mailto:balay@mcs.anl.gov">balay@mcs.anl.gov</a>><br>
> >>> wrote:<br>
> >>>>>>><br>
> >>>>>>>> Does this 'standard test' use MPI_COMM_WORLD' to crate PETSc objects?<br>
> >>>>>>>><br>
> >>>>>>>> If so - you could try changing to PETSC_COMM_WORLD<br>
> >>>>>>>><br>
> >>>>>>><br>
> >>>>>>><br>
> >>>>>>> I do not think we are using PETSC_COMM_WORLD when creating PETSc<br>
> >>> objects.<br>
> >>>>>>> Why we can not use MPI_COMM_WORLD?<br>
> >>>>>>><br>
> >>>>>>><br>
> >>>>>>> Fande,<br>
> >>>>>>><br>
> >>>>>>><br>
> >>>>>>>><br>
> >>>>>>>> Satish<br>
> >>>>>>>><br>
> >>>>>>>><br>
> >>>>>>>> On Mon, 2 Apr 2018, Kong, Fande wrote:<br>
> >>>>>>>><br>
> >>>>>>>>> Hi All,<br>
> >>>>>>>>><br>
> >>>>>>>>> I am trying to upgrade PETSc from 3.7.6 to 3.8.3 for MOOSE and its<br>
> >>>>>>>>> applications. I have a error message for a standard test:<br>
> >>>>>>>>><br>
> >>>>>>>>><br>
> >>>>>>>>><br>
> >>>>>>>>><br>
> >>>>>>>>><br>
> >>>>>>>>><br>
> >>>>>>>>><br>
> >>>>>>>>><br>
> >>>>>>>>><br>
> >>>>>>>>> *preconditioners/pbp.lots_of_<wbr>variables: MPI had an<br>
> >>>>>>>>> errorpreconditioners/pbp.lots_<wbr>of_variables:<br>
> >>>>>>>>> ------------------------------<wbr>------------------<br>
> >>>>>>>> preconditioners/pbp.lots_of_<wbr>variables:<br>
> >>>>>>>>> Other MPI error, error stack:preconditioners/pbp.<wbr>lots_of_variables:<br>
> >>>>>>>>> PMPI_Comm_dup(177)............<wbr>......: MPI_Comm_dup(comm=0x84000001,<br>
> >>>>>>>>> new_comm=0x97d1068) failedpreconditioners/pbp.<wbr>lots_of_variables:<br>
> >>>>>>>>> PMPI_Comm_dup(162)............<wbr>......:<br>
> >>>>>>>>> preconditioners/pbp.lots_of_<wbr>variables:<br>
> >>>>>>>>> MPIR_Comm_dup_impl(57)........<wbr>......:<br>
> >>>>>>>>> preconditioners/pbp.lots_of_<wbr>variables:<br>
> >>>>>>>>> MPIR_Comm_copy(739)...........<wbr>......:<br>
> >>>>>>>>> preconditioners/pbp.lots_of_<wbr>variables:<br>
> >>>>>>>>> MPIR_Get_contextid_sparse_<wbr>group(614): Too many communicators<br>
> >>> (0/2048<br>
> >>>>>>>> free<br>
> >>>>>>>>> on this process; ignore_id=0)*<br>
> >>>>>>>>><br>
> >>>>>>>>><br>
> >>>>>>>>> I did "git bisect', and the following commit introduces this issue:<br>
> >>>>>>>>><br>
> >>>>>>>>><br>
> >>>>>>>>><br>
> >>>>>>>>><br>
> >>>>>>>>><br>
> >>>>>>>>><br>
> >>>>>>>>><br>
> >>>>>>>>><br>
> >>>>>>>>> *commit 49a781f5cee36db85e8d5b951eec29<wbr>f10ac13593Author: Stefano<br>
> >>> Zampini<br>
> >>>>>>>>> <<a href="mailto:stefano.zampini@gmail.com">stefano.zampini@gmail.com</a> <<a href="mailto:stefano.zampini@gmail.com">stefano.zampini@gmail.com</a>>><wbr>Date: Sat<br>
> >>> Nov 5<br>
> >>>>>>>>> 20:15:19 2016 +0300 PCHYPRE: use internal Mat of type MatHYPRE<br>
> >>>>>>>>> hpmat already stores two HYPRE vectors*<br>
> >>>>>>>>><br>
> >>>>>>>>> Before I debug line-by-line, anyone has a clue on this?<br>
> >>>>>>>>><br>
> >>>>>>>>><br>
> >>>>>>>>> Fande,<br>
> >>>>>>>>><br>
> >>>>>>>><br>
> >>>>>>>><br>
> >>>>>>><br>
> >>>>>><br>
> >>>>><br>
> >>><br>
> >>><br>
><br>
> </div></div></blockquote></div><br></div></div>