<div dir="ltr"><div>The first bad commit:<br><br><i>commit 49a781f5cee36db85e8d5b951eec29f10ac13593<br>Author: Stefano Zampini <<a href="mailto:stefano.zampini@gmail.com">stefano.zampini@gmail.com</a>><br>Date: Sat Nov 5 20:15:19 2016 +0300<br><br> PCHYPRE: use internal Mat of type MatHYPRE<br> <br> hpmat already stores two HYPRE vectors<br></i><br><br></div><div>Hypre version:<br><br>~/projects/petsc/arch-darwin-c-opt-bisect_bad/externalpackages/git.hypre]> git branch <br>* (HEAD detached at 83b1f19)<br></div><div><br><br><br></div>The last good commit:<br><br><i>commit 63c07aad33d943fe85193412d077a1746a7c55aa<br>Author: Stefano Zampini <<a href="mailto:stefano.zampini@gmail.com">stefano.zampini@gmail.com</a>><br>Date: Sat Nov 5 19:30:12 2016 +0300<br><br> MatHYPRE: create new matrix type<br> <br> The conversion from AIJ to HYPRE has been taken from src/dm/impls/da/hypre/mhyp.c<br> HYPRE to AIJ is new</i><br><br><div><div>Hypre version:<br><br>/projects/petsc/arch-darwin-c-opt-bisect/externalpackages/git.hypre]> git branch <br>* (HEAD detached at 83b1f19)<br><br><br><br><br><br></div><div>We are using the same HYPRE version.<br><br><br></div><div>I will narrow down line-by-line.<br><br><br></div><div>Fande,<br></div><div><br></div></div></div><div class="gmail_extra"><br><div class="gmail_quote">On Tue, Apr 3, 2018 at 9:50 AM, Stefano Zampini <span dir="ltr"><<a href="mailto:stefano.zampini@gmail.com" target="_blank">stefano.zampini@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div style="word-wrap:break-word"><br><div><div><div class="h5"><blockquote type="cite"><div>On Apr 3, 2018, at 5:43 PM, Fande Kong <<a href="mailto:fdkong.jd@gmail.com" target="_blank">fdkong.jd@gmail.com</a>> wrote:</div><br class="m_7931168383490294883Apple-interchange-newline"><div><div dir="ltr"><br><div class="gmail_extra"><br><div class="gmail_quote">On Tue, Apr 3, 2018 at 9:12 AM, Stefano Zampini <span dir="ltr"><<a href="mailto:stefano.zampini@gmail.com" target="_blank">stefano.zampini@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-style:solid;border-left-color:rgb(204,204,204);padding-left:1ex"><div style="word-wrap:break-word"><br><div><span class="m_7931168383490294883gmail-"><blockquote type="cite"><div>On Apr 3, 2018, at 4:58 PM, Satish Balay <<a href="mailto:balay@mcs.anl.gov" target="_blank">balay@mcs.anl.gov</a>> wrote:</div><br class="m_7931168383490294883gmail-m_2524865371699403345Apple-interchange-newline"><div><div>On Tue, 3 Apr 2018, Kong, Fande wrote:<br><br><blockquote type="cite">On Tue, Apr 3, 2018 at 1:17 AM, Smith, Barry F. <<a href="mailto:bsmith@mcs.anl.gov" target="_blank">bsmith@mcs.anl.gov</a>> wrote:<br><br><blockquote type="cite"><br> Each external package definitely needs its own duplicated communicator;<br>cannot share between packages.<br><br> The only problem with the dups below is if they are in a loop and get<br>called many times.<br><br></blockquote><br><br>The "standard test" that has this issue actually has 1K fields. MOOSE<br>creates its own field-split preconditioner (not based on the PETSc<br>fieldsplit), and each filed is associated with one PC HYPRE. If PETSc<br>duplicates communicators, we should easily reach the limit 2048.<br><br>I also want to confirm what extra communicators are introduced in the bad<br>commit.<br></blockquote><br>To me it looks like there is 1 extra comm created [for MATHYPRE] for each PCHYPRE that is created [which also creates one comm for this object].<br><br></div></div></blockquote><div><br></div></span><div>You’re right; however, it was the same before the commit.</div><div>I don’t understand how this specific commit is related with this issue, being the error not in the MPI_Comm_Dup which is inside MatCreate_MATHYPRE. Actually, the error comes from MPI_Comm_create</div><span class="m_7931168383490294883gmail-"><div><br></div><div><i> frame #5: 0x00000001068defd4 libmpi.12.dylib`MPI_Comm_creat<wbr>e + 3492<br> frame #6: 0x00000001061345d9 libpetsc.3.07.dylib`hypre_Gene<wbr>rateSubComm(comm=-1006627852, participate=<unavailable>, new_comm_ptr=<unavailable>) + 409 at gen_redcs_mat.c:531 [opt]<br> frame #7: 0x000000010618f8ba libpetsc.3.07.dylib`hypre_Gaus<wbr>sElimSetup(amg_data=0x00007fe7<wbr>ff857a00, level=<unavailable>, relax_type=9) + 74 at par_relax.c:4209 [opt]<br> frame #8: 0x0000000106140e93 libpetsc.3.07.dylib`hypre_Boom<wbr>erAMGSetup(amg_vdata=<unavaila<wbr>ble>, A=0x00007fe80842aff0, f=0x00007fe80842a980, u=0x00007fe80842a510) + 17699 at par_amg_setup.c:2108 [opt]<br> frame #9: 0x0000000105ec773c libpetsc.3.07.dylib`PCSetUp_HY<wbr>PRE(pc=<unavailable>) + 2540 at hypre.c:226 [opt</i></div><div><br></div></span><div>How did you perform the bisection? make clean + make all ? Which version of HYPRE are you using?</div></div></div></blockquote><div><br></div><div>I did more aggressively. </div><div><br></div><div>"rm -rf arch-darwin-c-opt-bisect "</div><div><br></div><div>"./configure --optionsModule=config.<wbr>compilerOptions -with-debugging=no --with-shared-libraries=1 --with-mpi=1 --download-fblaslapack=1 --download-metis=1 --download-parmetis=1 --download-superlu_dist=1 --download-hypre=1 --download-mumps=1 --download-scalapack=1 PETSC_ARCH=arch-darwin-c-opt-<wbr>bisect"</div><div><br></div></div></div></div></div></blockquote><div><br></div></div></div>Good, so this removes some possible sources of errors<span class=""><br><blockquote type="cite"><div><div dir="ltr"><div class="gmail_extra"><div class="gmail_quote"><div><br></div><div>HYPRE verison:</div><div><br></div><div><br></div><div><div> self.gitcommit = 'v2.11.1-55-g2ea0e43'</div><div> self.download = ['git://<a href="https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_LLNL_hypre&d=DwMFaQ&c=54IZrppPQZKX9mLzcGdPfFD1hxrcB__aEkJFOKJFd00&r=DUUt3SRGI0_JgtNaS3udV68GRkgV4ts7XKfj2opmiCY&m=LTXwlyqefohCW3djvHLnK_QFKia-PIJn5cgBbNxC91A&s=K0qCoSO2uYo06lAKeKuukkC7k9R16DVQyZJTF-m23l8&e=" target="_blank">https://github.com/<wbr>LLNL/hypre</a>','<a href="https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_LLNL_hypre_archive_-27-2Bself.gitcommit-2B-27.tar.gz&d=DwMFaQ&c=54IZrppPQZKX9mLzcGdPfFD1hxrcB__aEkJFOKJFd00&r=DUUt3SRGI0_JgtNaS3udV68GRkgV4ts7XKfj2opmiCY&m=LTXwlyqefohCW3djvHLnK_QFKia-PIJn5cgBbNxC91A&s=ZirglM2VzwkDUv503G0jaf1VTDZMmqpKH64P8vrAXwo&e=" target="_blank">https://github.<wbr>com/LLNL/hypre/archive/'+self.<wbr>gitcommit+'.tar.gz</a>']</div></div><div><br></div><div><br></div></div></div></div></div></blockquote><div><br></div></span><div>When reconfiguring, the HYPRE version can be different too (that commit is from 11/2016, so the HYPRE version used by the PETSc configure can have been upgraded too)</div><div><div class="h5"><br><blockquote type="cite"><div><div dir="ltr"><div class="gmail_extra"><div class="gmail_quote"><div>I do not think this is caused by HYPRE.</div></div></div></div></div></blockquote><blockquote type="cite"><div><div dir="ltr"><div class="gmail_extra"><div class="gmail_quote"><div><br></div><div>Fande,</div><div><br></div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-style:solid;border-left-color:rgb(204,204,204);padding-left:1ex"><div style="word-wrap:break-word"><div><div><div class="m_7931168383490294883gmail-h5"><br><blockquote type="cite"><div><div>But you might want to verify [by linking with mpi trace library?]<br><br><br>There are some debugging hints at <a href="https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.mpich.org_pipermail_discuss_2012-2DDecember_000148.html&d=DwMFaQ&c=54IZrppPQZKX9mLzcGdPfFD1hxrcB__aEkJFOKJFd00&r=DUUt3SRGI0_JgtNaS3udV68GRkgV4ts7XKfj2opmiCY&m=LTXwlyqefohCW3djvHLnK_QFKia-PIJn5cgBbNxC91A&s=LJXoNthyvs72jBCfo6sXph3GVLiniaQcr4e1hMetpIc&e=" target="_blank">https://lists.mpich.org/piperm<wbr>ail/discuss/2012-December/<wbr>000148.html</a> [wrt mpich] - which I haven't checked..<br><br>Satish<br><br><blockquote type="cite"><br><br>Fande,<br><br><br><br><blockquote type="cite"><br> To debug the hypre/duplication issue in MOOSE I would run in the<br>debugger with a break point in MPI_Comm_dup() and see<br>who keeps calling it an unreasonable amount of times. (My guess is this is<br>a new "feature" in hypre that they will need to fix but only debugging will<br>tell)<br><br> Barry<br><br><br><blockquote type="cite">On Apr 2, 2018, at 7:44 PM, Balay, Satish <<a href="mailto:balay@mcs.anl.gov" target="_blank">balay@mcs.anl.gov</a>> wrote:<br><br>We do a MPI_Comm_dup() for objects related to externalpackages.<br><br>Looks like we added a new mat type MATHYPRE - in 3.8 that PCHYPRE is<br>using. Previously there was one MPI_Comm_dup() PCHYPRE - now I think<br>is one more for MATHYPRE - so more calls to MPI_Comm_dup in 3.8 vs 3.7<br><br>src/dm/impls/da/hypre/mhyp.c: ierr = MPI_Comm_dup(PetscObjectComm((<br></blockquote>PetscObject)B),&(ex->hcomm));C<wbr>HKERRQ(ierr);<br><blockquote type="cite">src/dm/impls/da/hypre/mhyp.c: ierr = MPI_Comm_dup(PetscObjectComm((<br></blockquote>PetscObject)B),&(ex->hcomm));C<wbr>HKERRQ(ierr);<br><blockquote type="cite">src/dm/impls/swarm/data_ex.c: ierr = MPI_Comm_dup(comm,&d->comm);<br></blockquote>CHKERRQ(ierr);<br><blockquote type="cite">src/ksp/pc/impls/hypre/hypre.c<wbr>: ierr = MPI_Comm_dup(PetscObjectComm((<br></blockquote>PetscObject)pc),&(jac->comm_hy<wbr>pre));CHKERRQ(ierr);<br><blockquote type="cite">src/ksp/pc/impls/hypre/hypre.c<wbr>: ierr = MPI_Comm_dup(PetscObjectComm((<br></blockquote>PetscObject)pc),&(ex->hcomm));<wbr>CHKERRQ(ierr);<br><blockquote type="cite">src/ksp/pc/impls/hypre/hypre.c<wbr>: ierr = MPI_Comm_dup(PetscObjectComm((<br></blockquote>PetscObject)pc),&(ex->hcomm));<wbr>CHKERRQ(ierr);<br><blockquote type="cite">src/ksp/pc/impls/spai/ispai.c: ierr =<br></blockquote>MPI_Comm_dup(PetscObjectComm((<wbr>PetscObject)pc),&(ispai->comm_<br>spai));CHKERRQ(ierr);<br><blockquote type="cite">src/mat/examples/tests/ex152.c<wbr>: ierr = MPI_Comm_dup(MPI_COMM_WORLD,<br></blockquote>&comm);CHKERRQ(ierr);<br><blockquote type="cite">src/mat/impls/aij/mpi/mkl_cpar<wbr>diso/mkl_cpardiso.c: ierr =<br></blockquote>MPI_Comm_dup(PetscObjectComm((<wbr>PetscObject)A),&(mat_mkl_<br>cpardiso->comm_mkl_cpardiso));<wbr>CHKERRQ(ierr);<br><blockquote type="cite">src/mat/impls/aij/mpi/mumps/mu<wbr>mps.c: ierr =<br></blockquote>MPI_Comm_dup(PetscObjectComm((<wbr>PetscObject)A),&(mumps->comm_<br>mumps));CHKERRQ(ierr);<br><blockquote type="cite">src/mat/impls/aij/mpi/pastix/p<wbr>astix.c: ierr =<br></blockquote>MPI_Comm_dup(PetscObjectComm((<wbr>PetscObject)A),&(lu->pastix_<br>comm));CHKERRQ(ierr);<br><blockquote type="cite">src/mat/impls/aij/mpi/superlu_<wbr>dist/superlu_dist.c: ierr =<br></blockquote>MPI_Comm_dup(PetscObjectComm((<wbr>PetscObject)A),&(lu->comm_<br>superlu));CHKERRQ(ierr);<br><blockquote type="cite">src/mat/impls/hypre/mhypre.c: ierr = MPI_Comm_dup(PetscObjectComm((<br></blockquote>PetscObject)B),&hB->comm);CHKE<wbr>RRQ(ierr);<br><blockquote type="cite">src/mat/partition/impls/pmetis<wbr>/pmetis.c: ierr =<br></blockquote>MPI_Comm_dup(pcomm,&comm);CHKE<wbr>RRQ(ierr);<br><blockquote type="cite">src/sys/mpiuni/mpi.c: MPI_COMM_SELF, MPI_COMM_WORLD, and a<br></blockquote>MPI_Comm_dup() of each of these (duplicates of duplicates return the same<br>communictor)<br><blockquote type="cite">src/sys/mpiuni/mpi.c:int MPI_Comm_dup(MPI_Comm comm,MPI_Comm *out)<br>src/sys/objects/pinit.c: ierr = MPI_Comm_dup(MPI_COMM_WORLD,&<br></blockquote>local_comm);CHKERRQ(ierr);<br><blockquote type="cite">src/sys/objects/pinit.c: ierr = MPI_Comm_dup(MPI_COMM_WORLD,&<br></blockquote>local_comm);CHKERRQ(ierr);<br><blockquote type="cite">src/sys/objects/tagm.c: ierr = MPI_Comm_dup(comm_in,comm_out)<br></blockquote>;CHKERRQ(ierr);<br><blockquote type="cite">src/sys/utils/mpiu.c: ierr = MPI_Comm_dup(comm,&local_comm)<br></blockquote>;CHKERRQ(ierr);<br><blockquote type="cite">src/ts/impls/implicit/sundials<wbr>/sundials.c: ierr =<br></blockquote>MPI_Comm_dup(PetscObjectComm((<wbr>PetscObject)ts),&(cvode->comm_<br>sundials));CHKERRQ(ierr);<br><blockquote type="cite"><br>Perhaps we need a PetscCommDuplicateExternalPkg(<wbr>) to somehow avoid<br></blockquote>these MPI_Comm_dup() calls?<br><blockquote type="cite"><br>Satish<br><br>On Tue, 3 Apr 2018, Smith, Barry F. wrote:<br><br><blockquote type="cite"><br> Are we sure this is a PETSc comm issue and not a hypre comm<br></blockquote></blockquote>duplication issue<br><blockquote type="cite"><blockquote type="cite"><br>frame #6: 0x00000001061345d9 libpetsc.3.07.dylib`hypre_<br></blockquote></blockquote>GenerateSubComm(comm=-10066278<wbr>52, participate=<unavailable>,<br>new_comm_ptr=<unavailable>) + 409 at gen_redcs_mat.c:531 [opt]<br><blockquote type="cite"><blockquote type="cite"><br>Looks like hypre is needed to generate subcomms, perhaps it generates<br></blockquote></blockquote>too many?<br><blockquote type="cite"><blockquote type="cite"><br> Barry<br><br><br><blockquote type="cite">On Apr 2, 2018, at 7:07 PM, Derek Gaston <<a href="mailto:friedmud@gmail.com" target="_blank">friedmud@gmail.com</a>> wrote:<br><br>I’m working with Fande on this and I would like to add a bit more.<br></blockquote></blockquote></blockquote>There are many circumstances where we aren’t working on COMM_WORLD at all<br>(e.g. working on a sub-communicator) but PETSc was initialized using<br>MPI_COMM_WORLD (think multi-level solves)… and we need to create<br>arbitrarily many PETSc vecs/mats/solvers/precondition<wbr>ers and solve. We<br>definitely can’t rely on using PETSC_COMM_WORLD to avoid triggering<br>duplication.<br><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><br>Can you explain why PETSc needs to duplicate the communicator so much?<br><br>Thanks for your help in tracking this down!<br><br>Derek<br><br>On Mon, Apr 2, 2018 at 5:44 PM Kong, Fande <<a href="mailto:fande.kong@inl.gov" target="_blank">fande.kong@inl.gov</a>> wrote:<br>Why we do not use user-level MPI communicators directly? What are<br></blockquote></blockquote></blockquote>potential risks here?<br><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><br><br>Fande,<br><br>On Mon, Apr 2, 2018 at 5:08 PM, Satish Balay <<a href="mailto:balay@mcs.anl.gov" target="_blank">balay@mcs.anl.gov</a>><br></blockquote></blockquote></blockquote>wrote:<br><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">PETSC_COMM_WORLD [via PetscCommDuplicate()] attempts to minimize calls<br></blockquote></blockquote></blockquote>to MPI_Comm_dup() - thus potentially avoiding such errors<br><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><br><a href="https://urldefense.proofpoint.com/v2/url?u=http-3A__www.mcs" target="_blank">https://urldefense.proofpoint.<wbr>com/v2/url?u=http-3A__www.mcs</a>.<br></blockquote></blockquote></blockquote>anl.gov_petsc_petsc-2Dcurrent_<wbr>docs_manualpages_Sys_<br>PetscCommDuplicate.html&d=DwIB<wbr>Ag&c=54IZrppPQZKX9mLzcGdPfFD1h<wbr>xrcB_<br>_aEkJFOKJFd00&r=DUUt3SRGI0_Jgt<wbr>NaS3udV68GRkgV4ts7XKfj2opmi<br>CY&m=jgv7gpZ3K52d_FWMgkK9yEScb<wbr>LA7pkrWydFuJnYflsU&s=_<br>zpWRcyk3kHuEHoq02NDqYExnXIohXp<wbr>NnjyabYnnDjU&e=<br><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><br><br>Satish<br><br>On Mon, 2 Apr 2018, Kong, Fande wrote:<br><br><blockquote type="cite">On Mon, Apr 2, 2018 at 4:23 PM, Satish Balay <<a href="mailto:balay@mcs.anl.gov" target="_blank">balay@mcs.anl.gov</a>><br></blockquote></blockquote></blockquote></blockquote>wrote:<br><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><br><blockquote type="cite">Does this 'standard test' use MPI_COMM_WORLD' to crate PETSc objects?<br><br>If so - you could try changing to PETSC_COMM_WORLD<br><br></blockquote><br><br>I do not think we are using PETSC_COMM_WORLD when creating PETSc<br></blockquote></blockquote></blockquote></blockquote>objects.<br><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">Why we can not use MPI_COMM_WORLD?<br><br><br>Fande,<br><br><br><blockquote type="cite"><br>Satish<br><br><br>On Mon, 2 Apr 2018, Kong, Fande wrote:<br><br><blockquote type="cite">Hi All,<br><br>I am trying to upgrade PETSc from 3.7.6 to 3.8.3 for MOOSE and its<br>applications. I have a error message for a standard test:<br><br><br><br><br><br><br><br><br><br>*preconditioners/pbp.lots_of_v<wbr>ariables: MPI had an<br>errorpreconditioners/pbp.lots_<wbr>of_variables:<br>------------------------------<wbr>------------------<br></blockquote>preconditioners/pbp.lots_of_va<wbr>riables:<br><blockquote type="cite">Other MPI error, error stack:preconditioners/pbp.lots<wbr>_of_variables:<br>PMPI_Comm_dup(177)............<wbr>......: MPI_Comm_dup(comm=0x84000001,<br>new_comm=0x97d1068) failedpreconditioners/pbp.lots<wbr>_of_variables:<br>PMPI_Comm_dup(162)............<wbr>......:<br>preconditioners/pbp.lots_of_va<wbr>riables:<br>MPIR_Comm_dup_impl(57)........<wbr>......:<br>preconditioners/pbp.lots_of_va<wbr>riables:<br>MPIR_Comm_copy(739)...........<wbr>......:<br>preconditioners/pbp.lots_of_va<wbr>riables:<br>MPIR_Get_contextid_sparse_grou<wbr>p(614): Too many communicators<br></blockquote></blockquote></blockquote></blockquote></blockquote></blockquote>(0/2048<br><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">free<br><blockquote type="cite">on this process; ignore_id=0)*<br><br><br>I did "git bisect', and the following commit introduces this issue:<br><br><br><br><br><br><br><br><br>*commit 49a781f5cee36db85e8d5b951eec29<wbr>f10ac13593Author: Stefano<br></blockquote></blockquote></blockquote></blockquote></blockquote></blockquote>Zampini<br><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><<a href="mailto:stefano.zampini@gmail.com" target="_blank">stefano.zampini@gmail.com</a> <<a href="mailto:stefano.zampini@gmail.com" target="_blank">stefano.zampini@gmail.com</a>>>Da<wbr>te: Sat<br></blockquote></blockquote></blockquote></blockquote></blockquote></blockquote>Nov 5<br><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">20:15:19 2016 +0300 PCHYPRE: use internal Mat of type MatHYPRE<br>hpmat already stores two HYPRE vectors*<br><br>Before I debug line-by-line, anyone has a clue on this?<br><br><br>Fande,<br><br></blockquote><br><br></blockquote><br></blockquote><br></blockquote><br></blockquote></blockquote><br><br></blockquote></blockquote></div></div></blockquote></div></div></div><br></div></blockquote></div><br></div></div>
</div></blockquote></div></div></div><br></div></blockquote></div><br></div>