[petsc-users] A bad commit affects MOOSE

Smith, Barry F. bsmith at mcs.anl.gov
Tue Apr 3 12:24:20 CDT 2018


   Fande,

      Please try the branch https://bitbucket.org/petsc/petsc/pull-requests/921/boomeramg-unlike-2-other-hypre/diff  

       It does not "solve" the problem but it should get your current test that now fails to run again,

   Barry


> On Apr 3, 2018, at 10:14 AM, Kong, Fande <fande.kong at inl.gov> wrote:
> 
> The first bad commit:
> 
> commit 49a781f5cee36db85e8d5b951eec29f10ac13593
> Author: Stefano Zampini <stefano.zampini at gmail.com>
> Date:   Sat Nov 5 20:15:19 2016 +0300
> 
>     PCHYPRE: use internal Mat of type MatHYPRE
>     
>     hpmat already stores two HYPRE vectors
> 
> 
> Hypre version:
> 
> ~/projects/petsc/arch-darwin-c-opt-bisect_bad/externalpackages/git.hypre]> git branch 
> * (HEAD detached at 83b1f19)
> 
> 
> 
> The last good commit:
> 
> commit 63c07aad33d943fe85193412d077a1746a7c55aa
> Author: Stefano Zampini <stefano.zampini at gmail.com>
> Date:   Sat Nov 5 19:30:12 2016 +0300
> 
>     MatHYPRE: create new matrix type
>     
>     The conversion from AIJ to HYPRE has been taken from src/dm/impls/da/hypre/mhyp.c
>     HYPRE to AIJ is new
> 
> Hypre version:
> 
> /projects/petsc/arch-darwin-c-opt-bisect/externalpackages/git.hypre]> git branch 
> * (HEAD detached at 83b1f19)
> 
> 
> 
> 
> 
> We are using the same HYPRE version.
> 
> 
> I will narrow down line-by-line.
> 
> 
> Fande,
> 
> 
> On Tue, Apr 3, 2018 at 9:50 AM, Stefano Zampini <stefano.zampini at gmail.com> wrote:
> 
>> On Apr 3, 2018, at 5:43 PM, Fande Kong <fdkong.jd at gmail.com> wrote:
>> 
>> 
>> 
>> On Tue, Apr 3, 2018 at 9:12 AM, Stefano Zampini <stefano.zampini at gmail.com> wrote:
>> 
>>> On Apr 3, 2018, at 4:58 PM, Satish Balay <balay at mcs.anl.gov> wrote:
>>> 
>>> On Tue, 3 Apr 2018, Kong, Fande wrote:
>>> 
>>>> On Tue, Apr 3, 2018 at 1:17 AM, Smith, Barry F. <bsmith at mcs.anl.gov> wrote:
>>>> 
>>>>> 
>>>>>   Each external package definitely needs its own duplicated communicator;
>>>>> cannot share between packages.
>>>>> 
>>>>>   The only problem with the dups below is if they are in a loop and get
>>>>> called many times.
>>>>> 
>>>> 
>>>> 
>>>> The "standard test" that has this issue actually has 1K fields. MOOSE
>>>> creates its own field-split preconditioner (not based on the PETSc
>>>> fieldsplit), and each filed is associated with one PC HYPRE.  If PETSc
>>>> duplicates communicators, we should easily reach the limit 2048.
>>>> 
>>>> I also want to confirm what extra communicators are introduced in the bad
>>>> commit.
>>> 
>>> To me it looks like there is 1 extra comm created [for MATHYPRE] for each PCHYPRE that is created [which also creates one comm for this object].
>>> 
>> 
>> You’re right; however, it was the same before the commit.
>> I don’t understand how this specific commit is related with this issue, being the error not in the MPI_Comm_Dup which is inside MatCreate_MATHYPRE. Actually, the error comes from MPI_Comm_create
>> 
>>     frame #5: 0x00000001068defd4 libmpi.12.dylib`MPI_Comm_create + 3492
>>     frame #6: 0x00000001061345d9 libpetsc.3.07.dylib`hypre_GenerateSubComm(comm=-1006627852, participate=<unavailable>, new_comm_ptr=<unavailable>) + 409 at gen_redcs_mat.c:531 [opt]
>>     frame #7: 0x000000010618f8ba libpetsc.3.07.dylib`hypre_GaussElimSetup(amg_data=0x00007fe7ff857a00, level=<unavailable>, relax_type=9) + 74 at par_relax.c:4209 [opt]
>>     frame #8: 0x0000000106140e93 libpetsc.3.07.dylib`hypre_BoomerAMGSetup(amg_vdata=<unavailable>, A=0x00007fe80842aff0, f=0x00007fe80842a980, u=0x00007fe80842a510) + 17699 at par_amg_setup.c:2108 [opt]
>>     frame #9: 0x0000000105ec773c libpetsc.3.07.dylib`PCSetUp_HYPRE(pc=<unavailable>) + 2540 at hypre.c:226 [opt
>> 
>> How did you perform the bisection? make clean + make all ? Which version of HYPRE are you using?
>> 
>> I did more aggressively.  
>> 
>> "rm -rf  arch-darwin-c-opt-bisect   "
>> 
>> "./configure  --optionsModule=config.compilerOptions -with-debugging=no --with-shared-libraries=1 --with-mpi=1 --download-fblaslapack=1 --download-metis=1 --download-parmetis=1 --download-superlu_dist=1 --download-hypre=1 --download-mumps=1 --download-scalapack=1 PETSC_ARCH=arch-darwin-c-opt-bisect"
>> 
> 
> Good, so this removes some possible sources of errors
>> 
>> HYPRE verison:
>> 
>> 
>>     self.gitcommit = 'v2.11.1-55-g2ea0e43'
>>     self.download  = ['git://https://github.com/LLNL/hypre','https://github.com/LLNL/hypre/archive/'+self.gitcommit+'.tar.gz']
>> 
>> 
> 
> When reconfiguring, the  HYPRE version can be different too (that commit is from 11/2016, so the HYPRE version used by the PETSc configure can have been upgraded too)
> 
>> I do not think this is caused by HYPRE.
>> 
>> Fande,
>> 
>>  
>> 
>>> But you might want to verify [by linking with mpi trace library?]
>>> 
>>> 
>>> There are some debugging hints at https://lists.mpich.org/pipermail/discuss/2012-December/000148.html [wrt mpich] - which I haven't checked..
>>> 
>>> Satish
>>> 
>>>> 
>>>> 
>>>> Fande,
>>>> 
>>>> 
>>>> 
>>>>> 
>>>>>    To debug the hypre/duplication issue in MOOSE I would run in the
>>>>> debugger with a break point in MPI_Comm_dup() and see
>>>>> who keeps calling it an unreasonable amount of times. (My guess is this is
>>>>> a new "feature" in hypre that they will need to fix but only debugging will
>>>>> tell)
>>>>> 
>>>>>   Barry
>>>>> 
>>>>> 
>>>>>> On Apr 2, 2018, at 7:44 PM, Balay, Satish <balay at mcs.anl.gov> wrote:
>>>>>> 
>>>>>> We do a MPI_Comm_dup() for objects related to externalpackages.
>>>>>> 
>>>>>> Looks like we added a new mat type MATHYPRE - in 3.8 that PCHYPRE is
>>>>>> using. Previously there was one MPI_Comm_dup() PCHYPRE - now I think
>>>>>> is one more for MATHYPRE - so more calls to MPI_Comm_dup in 3.8 vs 3.7
>>>>>> 
>>>>>> src/dm/impls/da/hypre/mhyp.c:  ierr = MPI_Comm_dup(PetscObjectComm((
>>>>> PetscObject)B),&(ex->hcomm));CHKERRQ(ierr);
>>>>>> src/dm/impls/da/hypre/mhyp.c:  ierr = MPI_Comm_dup(PetscObjectComm((
>>>>> PetscObject)B),&(ex->hcomm));CHKERRQ(ierr);
>>>>>> src/dm/impls/swarm/data_ex.c:  ierr = MPI_Comm_dup(comm,&d->comm);
>>>>> CHKERRQ(ierr);
>>>>>> src/ksp/pc/impls/hypre/hypre.c:  ierr = MPI_Comm_dup(PetscObjectComm((
>>>>> PetscObject)pc),&(jac->comm_hypre));CHKERRQ(ierr);
>>>>>> src/ksp/pc/impls/hypre/hypre.c:  ierr = MPI_Comm_dup(PetscObjectComm((
>>>>> PetscObject)pc),&(ex->hcomm));CHKERRQ(ierr);
>>>>>> src/ksp/pc/impls/hypre/hypre.c:  ierr = MPI_Comm_dup(PetscObjectComm((
>>>>> PetscObject)pc),&(ex->hcomm));CHKERRQ(ierr);
>>>>>> src/ksp/pc/impls/spai/ispai.c:  ierr      =
>>>>> MPI_Comm_dup(PetscObjectComm((PetscObject)pc),&(ispai->comm_
>>>>> spai));CHKERRQ(ierr);
>>>>>> src/mat/examples/tests/ex152.c:  ierr   = MPI_Comm_dup(MPI_COMM_WORLD,
>>>>> &comm);CHKERRQ(ierr);
>>>>>> src/mat/impls/aij/mpi/mkl_cpardiso/mkl_cpardiso.c:  ierr =
>>>>> MPI_Comm_dup(PetscObjectComm((PetscObject)A),&(mat_mkl_
>>>>> cpardiso->comm_mkl_cpardiso));CHKERRQ(ierr);
>>>>>> src/mat/impls/aij/mpi/mumps/mumps.c:  ierr =
>>>>> MPI_Comm_dup(PetscObjectComm((PetscObject)A),&(mumps->comm_
>>>>> mumps));CHKERRQ(ierr);
>>>>>> src/mat/impls/aij/mpi/pastix/pastix.c:    ierr =
>>>>> MPI_Comm_dup(PetscObjectComm((PetscObject)A),&(lu->pastix_
>>>>> comm));CHKERRQ(ierr);
>>>>>> src/mat/impls/aij/mpi/superlu_dist/superlu_dist.c:  ierr =
>>>>> MPI_Comm_dup(PetscObjectComm((PetscObject)A),&(lu->comm_
>>>>> superlu));CHKERRQ(ierr);
>>>>>> src/mat/impls/hypre/mhypre.c:  ierr = MPI_Comm_dup(PetscObjectComm((
>>>>> PetscObject)B),&hB->comm);CHKERRQ(ierr);
>>>>>> src/mat/partition/impls/pmetis/pmetis.c:    ierr   =
>>>>> MPI_Comm_dup(pcomm,&comm);CHKERRQ(ierr);
>>>>>> src/sys/mpiuni/mpi.c:    MPI_COMM_SELF, MPI_COMM_WORLD, and a
>>>>> MPI_Comm_dup() of each of these (duplicates of duplicates return the same
>>>>> communictor)
>>>>>> src/sys/mpiuni/mpi.c:int MPI_Comm_dup(MPI_Comm comm,MPI_Comm *out)
>>>>>> src/sys/objects/pinit.c:      ierr = MPI_Comm_dup(MPI_COMM_WORLD,&
>>>>> local_comm);CHKERRQ(ierr);
>>>>>> src/sys/objects/pinit.c:      ierr = MPI_Comm_dup(MPI_COMM_WORLD,&
>>>>> local_comm);CHKERRQ(ierr);
>>>>>> src/sys/objects/tagm.c:      ierr = MPI_Comm_dup(comm_in,comm_out)
>>>>> ;CHKERRQ(ierr);
>>>>>> src/sys/utils/mpiu.c:  ierr = MPI_Comm_dup(comm,&local_comm)
>>>>> ;CHKERRQ(ierr);
>>>>>> src/ts/impls/implicit/sundials/sundials.c:  ierr =
>>>>> MPI_Comm_dup(PetscObjectComm((PetscObject)ts),&(cvode->comm_
>>>>> sundials));CHKERRQ(ierr);
>>>>>> 
>>>>>> Perhaps we need a PetscCommDuplicateExternalPkg() to somehow avoid
>>>>> these MPI_Comm_dup() calls?
>>>>>> 
>>>>>> Satish
>>>>>> 
>>>>>> On Tue, 3 Apr 2018, Smith, Barry F. wrote:
>>>>>> 
>>>>>>> 
>>>>>>> Are we sure this is a PETSc comm issue and not a hypre comm
>>>>> duplication issue
>>>>>>> 
>>>>>>> frame #6: 0x00000001061345d9 libpetsc.3.07.dylib`hypre_
>>>>> GenerateSubComm(comm=-1006627852, participate=<unavailable>,
>>>>> new_comm_ptr=<unavailable>) + 409 at gen_redcs_mat.c:531 [opt]
>>>>>>> 
>>>>>>> Looks like hypre is needed to generate subcomms, perhaps it generates
>>>>> too many?
>>>>>>> 
>>>>>>>  Barry
>>>>>>> 
>>>>>>> 
>>>>>>>> On Apr 2, 2018, at 7:07 PM, Derek Gaston <friedmud at gmail.com> wrote:
>>>>>>>> 
>>>>>>>> I’m working with Fande on this and I would like to add a bit more.
>>>>> There are many circumstances where we aren’t working on COMM_WORLD at all
>>>>> (e.g. working on a sub-communicator) but PETSc was initialized using
>>>>> MPI_COMM_WORLD (think multi-level solves)… and we need to create
>>>>> arbitrarily many PETSc vecs/mats/solvers/preconditioners and solve.  We
>>>>> definitely can’t rely on using PETSC_COMM_WORLD to avoid triggering
>>>>> duplication.
>>>>>>>> 
>>>>>>>> Can you explain why PETSc needs to duplicate the communicator so much?
>>>>>>>> 
>>>>>>>> Thanks for your help in tracking this down!
>>>>>>>> 
>>>>>>>> Derek
>>>>>>>> 
>>>>>>>> On Mon, Apr 2, 2018 at 5:44 PM Kong, Fande <fande.kong at inl.gov> wrote:
>>>>>>>> Why we do not use user-level MPI communicators directly? What are
>>>>> potential risks here?
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Fande,
>>>>>>>> 
>>>>>>>> On Mon, Apr 2, 2018 at 5:08 PM, Satish Balay <balay at mcs.anl.gov>
>>>>> wrote:
>>>>>>>> PETSC_COMM_WORLD [via PetscCommDuplicate()] attempts to minimize calls
>>>>> to MPI_Comm_dup() - thus potentially avoiding such errors
>>>>>>>> 
>>>>>>>> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.mcs.
>>>>> anl.gov_petsc_petsc-2Dcurrent_docs_manualpages_Sys_
>>>>> PetscCommDuplicate.html&d=DwIBAg&c=54IZrppPQZKX9mLzcGdPfFD1hxrcB_
>>>>> _aEkJFOKJFd00&r=DUUt3SRGI0_JgtNaS3udV68GRkgV4ts7XKfj2opmi
>>>>> CY&m=jgv7gpZ3K52d_FWMgkK9yEScbLA7pkrWydFuJnYflsU&s=_
>>>>> zpWRcyk3kHuEHoq02NDqYExnXIohXpNnjyabYnnDjU&e=
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Satish
>>>>>>>> 
>>>>>>>> On Mon, 2 Apr 2018, Kong, Fande wrote:
>>>>>>>> 
>>>>>>>>> On Mon, Apr 2, 2018 at 4:23 PM, Satish Balay <balay at mcs.anl.gov>
>>>>> wrote:
>>>>>>>>> 
>>>>>>>>>> Does this 'standard test' use MPI_COMM_WORLD' to crate PETSc objects?
>>>>>>>>>> 
>>>>>>>>>> If so - you could try changing to PETSC_COMM_WORLD
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> I do not think we are using PETSC_COMM_WORLD when creating PETSc
>>>>> objects.
>>>>>>>>> Why we can not use MPI_COMM_WORLD?
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Fande,
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Satish
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> On Mon, 2 Apr 2018, Kong, Fande wrote:
>>>>>>>>>> 
>>>>>>>>>>> Hi All,
>>>>>>>>>>> 
>>>>>>>>>>> I am trying to upgrade PETSc from 3.7.6 to 3.8.3 for MOOSE and its
>>>>>>>>>>> applications. I have a error message for a standard test:
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> *preconditioners/pbp.lots_of_variables: MPI had an
>>>>>>>>>>> errorpreconditioners/pbp.lots_of_variables:
>>>>>>>>>>> ------------------------------------------------
>>>>>>>>>> preconditioners/pbp.lots_of_variables:
>>>>>>>>>>> Other MPI error, error stack:preconditioners/pbp.lots_of_variables:
>>>>>>>>>>> PMPI_Comm_dup(177)..................: MPI_Comm_dup(comm=0x84000001,
>>>>>>>>>>> new_comm=0x97d1068) failedpreconditioners/pbp.lots_of_variables:
>>>>>>>>>>> PMPI_Comm_dup(162)..................:
>>>>>>>>>>> preconditioners/pbp.lots_of_variables:
>>>>>>>>>>> MPIR_Comm_dup_impl(57)..............:
>>>>>>>>>>> preconditioners/pbp.lots_of_variables:
>>>>>>>>>>> MPIR_Comm_copy(739).................:
>>>>>>>>>>> preconditioners/pbp.lots_of_variables:
>>>>>>>>>>> MPIR_Get_contextid_sparse_group(614): Too many communicators
>>>>> (0/2048
>>>>>>>>>> free
>>>>>>>>>>> on this process; ignore_id=0)*
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> I did "git bisect', and the following commit introduces this issue:
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> *commit 49a781f5cee36db85e8d5b951eec29f10ac13593Author: Stefano
>>>>> Zampini
>>>>>>>>>>> <stefano.zampini at gmail.com <stefano.zampini at gmail.com>>Date:   Sat
>>>>> Nov 5
>>>>>>>>>>> 20:15:19 2016 +0300    PCHYPRE: use internal Mat of type MatHYPRE
>>>>>>>>>>> hpmat already stores two HYPRE vectors*
>>>>>>>>>>> 
>>>>>>>>>>> Before I debug line-by-line, anyone has a clue on this?
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> Fande,
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>> 
>>>>> 
>> 
>> 
> 
> 



More information about the petsc-users mailing list