[petsc-users] A bad commit affects MOOSE
Kong, Fande
fande.kong at inl.gov
Tue Apr 3 11:14:44 CDT 2018
The first bad commit:
*commit 49a781f5cee36db85e8d5b951eec29f10ac13593Author: Stefano Zampini
<stefano.zampini at gmail.com <stefano.zampini at gmail.com>>Date: Sat Nov 5
20:15:19 2016 +0300 PCHYPRE: use internal Mat of type MatHYPRE
hpmat already stores two HYPRE vectors*
Hypre version:
~/projects/petsc/arch-darwin-c-opt-bisect_bad/externalpackages/git.hypre]>
git branch
* (HEAD detached at 83b1f19)
The last good commit:
*commit 63c07aad33d943fe85193412d077a1746a7c55aaAuthor: Stefano Zampini
<stefano.zampini at gmail.com <stefano.zampini at gmail.com>>Date: Sat Nov 5
19:30:12 2016 +0300 MatHYPRE: create new matrix type The
conversion from AIJ to HYPRE has been taken from
src/dm/impls/da/hypre/mhyp.c HYPRE to AIJ is new*
Hypre version:
/projects/petsc/arch-darwin-c-opt-bisect/externalpackages/git.hypre]> git
branch
* (HEAD detached at 83b1f19)
We are using the same HYPRE version.
I will narrow down line-by-line.
Fande,
On Tue, Apr 3, 2018 at 9:50 AM, Stefano Zampini <stefano.zampini at gmail.com>
wrote:
>
> On Apr 3, 2018, at 5:43 PM, Fande Kong <fdkong.jd at gmail.com> wrote:
>
>
>
> On Tue, Apr 3, 2018 at 9:12 AM, Stefano Zampini <stefano.zampini at gmail.com
> > wrote:
>
>>
>> On Apr 3, 2018, at 4:58 PM, Satish Balay <balay at mcs.anl.gov> wrote:
>>
>> On Tue, 3 Apr 2018, Kong, Fande wrote:
>>
>> On Tue, Apr 3, 2018 at 1:17 AM, Smith, Barry F. <bsmith at mcs.anl.gov>
>> wrote:
>>
>>
>> Each external package definitely needs its own duplicated communicator;
>> cannot share between packages.
>>
>> The only problem with the dups below is if they are in a loop and get
>> called many times.
>>
>>
>>
>> The "standard test" that has this issue actually has 1K fields. MOOSE
>> creates its own field-split preconditioner (not based on the PETSc
>> fieldsplit), and each filed is associated with one PC HYPRE. If PETSc
>> duplicates communicators, we should easily reach the limit 2048.
>>
>> I also want to confirm what extra communicators are introduced in the bad
>> commit.
>>
>>
>> To me it looks like there is 1 extra comm created [for MATHYPRE] for each
>> PCHYPRE that is created [which also creates one comm for this object].
>>
>>
>> You’re right; however, it was the same before the commit.
>> I don’t understand how this specific commit is related with this issue,
>> being the error not in the MPI_Comm_Dup which is inside MatCreate_MATHYPRE.
>> Actually, the error comes from MPI_Comm_create
>>
>>
>>
>>
>>
>> * frame #5: 0x00000001068defd4 libmpi.12.dylib`MPI_Comm_create +
>> 3492 frame #6: 0x00000001061345d9
>> libpetsc.3.07.dylib`hypre_GenerateSubComm(comm=-1006627852,
>> participate=<unavailable>, new_comm_ptr=<unavailable>) + 409 at
>> gen_redcs_mat.c:531 [opt] frame #7: 0x000000010618f8ba
>> libpetsc.3.07.dylib`hypre_GaussElimSetup(amg_data=0x00007fe7ff857a00,
>> level=<unavailable>, relax_type=9) + 74 at par_relax.c:4209 [opt] frame
>> #8: 0x0000000106140e93
>> libpetsc.3.07.dylib`hypre_BoomerAMGSetup(amg_vdata=<unavailable>,
>> A=0x00007fe80842aff0, f=0x00007fe80842a980, u=0x00007fe80842a510) + 17699
>> at par_amg_setup.c:2108 [opt] frame #9: 0x0000000105ec773c
>> libpetsc.3.07.dylib`PCSetUp_HYPRE(pc=<unavailable>) + 2540 at hypre.c:226
>> [opt*
>>
>> How did you perform the bisection? make clean + make all ? Which version
>> of HYPRE are you using?
>>
>
> I did more aggressively.
>
> "rm -rf arch-darwin-c-opt-bisect "
>
> "./configure --optionsModule=config.compilerOptions -with-debugging=no
> --with-shared-libraries=1 --with-mpi=1 --download-fblaslapack=1
> --download-metis=1 --download-parmetis=1 --download-superlu_dist=1
> --download-hypre=1 --download-mumps=1 --download-scalapack=1
> PETSC_ARCH=arch-darwin-c-opt-bisect"
>
>
> Good, so this removes some possible sources of errors
>
>
> HYPRE verison:
>
>
> self.gitcommit = 'v2.11.1-55-g2ea0e43'
> self.download = ['git://https://github.com/LLNL/hypre
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_LLNL_hypre&d=DwMFaQ&c=54IZrppPQZKX9mLzcGdPfFD1hxrcB__aEkJFOKJFd00&r=DUUt3SRGI0_JgtNaS3udV68GRkgV4ts7XKfj2opmiCY&m=LTXwlyqefohCW3djvHLnK_QFKia-PIJn5cgBbNxC91A&s=K0qCoSO2uYo06lAKeKuukkC7k9R16DVQyZJTF-m23l8&e=>
> ','https://github.com/LLNL/hypre/archive/'+self.gitcommit+'.tar.gz
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_LLNL_hypre_archive_-27-2Bself.gitcommit-2B-27.tar.gz&d=DwMFaQ&c=54IZrppPQZKX9mLzcGdPfFD1hxrcB__aEkJFOKJFd00&r=DUUt3SRGI0_JgtNaS3udV68GRkgV4ts7XKfj2opmiCY&m=LTXwlyqefohCW3djvHLnK_QFKia-PIJn5cgBbNxC91A&s=ZirglM2VzwkDUv503G0jaf1VTDZMmqpKH64P8vrAXwo&e=>
> ']
>
>
>
> When reconfiguring, the HYPRE version can be different too (that commit
> is from 11/2016, so the HYPRE version used by the PETSc configure can have
> been upgraded too)
>
> I do not think this is caused by HYPRE.
>
>
> Fande,
>
>
>
>>
>> But you might want to verify [by linking with mpi trace library?]
>>
>>
>> There are some debugging hints at https://lists.mpich.org/piperm
>> ail/discuss/2012-December/000148.html
>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.mpich.org_pipermail_discuss_2012-2DDecember_000148.html&d=DwMFaQ&c=54IZrppPQZKX9mLzcGdPfFD1hxrcB__aEkJFOKJFd00&r=DUUt3SRGI0_JgtNaS3udV68GRkgV4ts7XKfj2opmiCY&m=LTXwlyqefohCW3djvHLnK_QFKia-PIJn5cgBbNxC91A&s=LJXoNthyvs72jBCfo6sXph3GVLiniaQcr4e1hMetpIc&e=>
>> [wrt mpich] - which I haven't checked..
>>
>> Satish
>>
>>
>>
>> Fande,
>>
>>
>>
>>
>> To debug the hypre/duplication issue in MOOSE I would run in the
>> debugger with a break point in MPI_Comm_dup() and see
>> who keeps calling it an unreasonable amount of times. (My guess is this is
>> a new "feature" in hypre that they will need to fix but only debugging
>> will
>> tell)
>>
>> Barry
>>
>>
>> On Apr 2, 2018, at 7:44 PM, Balay, Satish <balay at mcs.anl.gov> wrote:
>>
>> We do a MPI_Comm_dup() for objects related to externalpackages.
>>
>> Looks like we added a new mat type MATHYPRE - in 3.8 that PCHYPRE is
>> using. Previously there was one MPI_Comm_dup() PCHYPRE - now I think
>> is one more for MATHYPRE - so more calls to MPI_Comm_dup in 3.8 vs 3.7
>>
>> src/dm/impls/da/hypre/mhyp.c: ierr = MPI_Comm_dup(PetscObjectComm((
>>
>> PetscObject)B),&(ex->hcomm));CHKERRQ(ierr);
>>
>> src/dm/impls/da/hypre/mhyp.c: ierr = MPI_Comm_dup(PetscObjectComm((
>>
>> PetscObject)B),&(ex->hcomm));CHKERRQ(ierr);
>>
>> src/dm/impls/swarm/data_ex.c: ierr = MPI_Comm_dup(comm,&d->comm);
>>
>> CHKERRQ(ierr);
>>
>> src/ksp/pc/impls/hypre/hypre.c: ierr = MPI_Comm_dup(PetscObjectComm((
>>
>> PetscObject)pc),&(jac->comm_hypre));CHKERRQ(ierr);
>>
>> src/ksp/pc/impls/hypre/hypre.c: ierr = MPI_Comm_dup(PetscObjectComm((
>>
>> PetscObject)pc),&(ex->hcomm));CHKERRQ(ierr);
>>
>> src/ksp/pc/impls/hypre/hypre.c: ierr = MPI_Comm_dup(PetscObjectComm((
>>
>> PetscObject)pc),&(ex->hcomm));CHKERRQ(ierr);
>>
>> src/ksp/pc/impls/spai/ispai.c: ierr =
>>
>> MPI_Comm_dup(PetscObjectComm((PetscObject)pc),&(ispai->comm_
>> spai));CHKERRQ(ierr);
>>
>> src/mat/examples/tests/ex152.c: ierr = MPI_Comm_dup(MPI_COMM_WORLD,
>>
>> &comm);CHKERRQ(ierr);
>>
>> src/mat/impls/aij/mpi/mkl_cpardiso/mkl_cpardiso.c: ierr =
>>
>> MPI_Comm_dup(PetscObjectComm((PetscObject)A),&(mat_mkl_
>> cpardiso->comm_mkl_cpardiso));CHKERRQ(ierr);
>>
>> src/mat/impls/aij/mpi/mumps/mumps.c: ierr =
>>
>> MPI_Comm_dup(PetscObjectComm((PetscObject)A),&(mumps->comm_
>> mumps));CHKERRQ(ierr);
>>
>> src/mat/impls/aij/mpi/pastix/pastix.c: ierr =
>>
>> MPI_Comm_dup(PetscObjectComm((PetscObject)A),&(lu->pastix_
>> comm));CHKERRQ(ierr);
>>
>> src/mat/impls/aij/mpi/superlu_dist/superlu_dist.c: ierr =
>>
>> MPI_Comm_dup(PetscObjectComm((PetscObject)A),&(lu->comm_
>> superlu));CHKERRQ(ierr);
>>
>> src/mat/impls/hypre/mhypre.c: ierr = MPI_Comm_dup(PetscObjectComm((
>>
>> PetscObject)B),&hB->comm);CHKERRQ(ierr);
>>
>> src/mat/partition/impls/pmetis/pmetis.c: ierr =
>>
>> MPI_Comm_dup(pcomm,&comm);CHKERRQ(ierr);
>>
>> src/sys/mpiuni/mpi.c: MPI_COMM_SELF, MPI_COMM_WORLD, and a
>>
>> MPI_Comm_dup() of each of these (duplicates of duplicates return the same
>> communictor)
>>
>> src/sys/mpiuni/mpi.c:int MPI_Comm_dup(MPI_Comm comm,MPI_Comm *out)
>> src/sys/objects/pinit.c: ierr = MPI_Comm_dup(MPI_COMM_WORLD,&
>>
>> local_comm);CHKERRQ(ierr);
>>
>> src/sys/objects/pinit.c: ierr = MPI_Comm_dup(MPI_COMM_WORLD,&
>>
>> local_comm);CHKERRQ(ierr);
>>
>> src/sys/objects/tagm.c: ierr = MPI_Comm_dup(comm_in,comm_out)
>>
>> ;CHKERRQ(ierr);
>>
>> src/sys/utils/mpiu.c: ierr = MPI_Comm_dup(comm,&local_comm)
>>
>> ;CHKERRQ(ierr);
>>
>> src/ts/impls/implicit/sundials/sundials.c: ierr =
>>
>> MPI_Comm_dup(PetscObjectComm((PetscObject)ts),&(cvode->comm_
>> sundials));CHKERRQ(ierr);
>>
>>
>> Perhaps we need a PetscCommDuplicateExternalPkg() to somehow avoid
>>
>> these MPI_Comm_dup() calls?
>>
>>
>> Satish
>>
>> On Tue, 3 Apr 2018, Smith, Barry F. wrote:
>>
>>
>> Are we sure this is a PETSc comm issue and not a hypre comm
>>
>> duplication issue
>>
>>
>> frame #6: 0x00000001061345d9 libpetsc.3.07.dylib`hypre_
>>
>> GenerateSubComm(comm=-1006627852, participate=<unavailable>,
>> new_comm_ptr=<unavailable>) + 409 at gen_redcs_mat.c:531 [opt]
>>
>>
>> Looks like hypre is needed to generate subcomms, perhaps it generates
>>
>> too many?
>>
>>
>> Barry
>>
>>
>> On Apr 2, 2018, at 7:07 PM, Derek Gaston <friedmud at gmail.com> wrote:
>>
>> I’m working with Fande on this and I would like to add a bit more.
>>
>> There are many circumstances where we aren’t working on COMM_WORLD at all
>> (e.g. working on a sub-communicator) but PETSc was initialized using
>> MPI_COMM_WORLD (think multi-level solves)… and we need to create
>> arbitrarily many PETSc vecs/mats/solvers/preconditioners and solve. We
>> definitely can’t rely on using PETSC_COMM_WORLD to avoid triggering
>> duplication.
>>
>>
>> Can you explain why PETSc needs to duplicate the communicator so much?
>>
>> Thanks for your help in tracking this down!
>>
>> Derek
>>
>> On Mon, Apr 2, 2018 at 5:44 PM Kong, Fande <fande.kong at inl.gov> wrote:
>> Why we do not use user-level MPI communicators directly? What are
>>
>> potential risks here?
>>
>>
>>
>> Fande,
>>
>> On Mon, Apr 2, 2018 at 5:08 PM, Satish Balay <balay at mcs.anl.gov>
>>
>> wrote:
>>
>> PETSC_COMM_WORLD [via PetscCommDuplicate()] attempts to minimize calls
>>
>> to MPI_Comm_dup() - thus potentially avoiding such errors
>>
>>
>> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.mcs.
>>
>> anl.gov_petsc_petsc-2Dcurrent_docs_manualpages_Sys_
>> PetscCommDuplicate.html&d=DwIBAg&c=54IZrppPQZKX9mLzcGdPfFD1hxrcB_
>> _aEkJFOKJFd00&r=DUUt3SRGI0_JgtNaS3udV68GRkgV4ts7XKfj2opmi
>> CY&m=jgv7gpZ3K52d_FWMgkK9yEScbLA7pkrWydFuJnYflsU&s=_
>> zpWRcyk3kHuEHoq02NDqYExnXIohXpNnjyabYnnDjU&e=
>>
>>
>>
>> Satish
>>
>> On Mon, 2 Apr 2018, Kong, Fande wrote:
>>
>> On Mon, Apr 2, 2018 at 4:23 PM, Satish Balay <balay at mcs.anl.gov>
>>
>> wrote:
>>
>>
>> Does this 'standard test' use MPI_COMM_WORLD' to crate PETSc objects?
>>
>> If so - you could try changing to PETSC_COMM_WORLD
>>
>>
>>
>> I do not think we are using PETSC_COMM_WORLD when creating PETSc
>>
>> objects.
>>
>> Why we can not use MPI_COMM_WORLD?
>>
>>
>> Fande,
>>
>>
>>
>> Satish
>>
>>
>> On Mon, 2 Apr 2018, Kong, Fande wrote:
>>
>> Hi All,
>>
>> I am trying to upgrade PETSc from 3.7.6 to 3.8.3 for MOOSE and its
>> applications. I have a error message for a standard test:
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> *preconditioners/pbp.lots_of_variables: MPI had an
>> errorpreconditioners/pbp.lots_of_variables:
>> ------------------------------------------------
>>
>> preconditioners/pbp.lots_of_variables:
>>
>> Other MPI error, error stack:preconditioners/pbp.lots_of_variables:
>> PMPI_Comm_dup(177)..................: MPI_Comm_dup(comm=0x84000001,
>> new_comm=0x97d1068) failedpreconditioners/pbp.lots_of_variables:
>> PMPI_Comm_dup(162)..................:
>> preconditioners/pbp.lots_of_variables:
>> MPIR_Comm_dup_impl(57)..............:
>> preconditioners/pbp.lots_of_variables:
>> MPIR_Comm_copy(739).................:
>> preconditioners/pbp.lots_of_variables:
>> MPIR_Get_contextid_sparse_group(614): Too many communicators
>>
>> (0/2048
>>
>> free
>>
>> on this process; ignore_id=0)*
>>
>>
>> I did "git bisect', and the following commit introduces this issue:
>>
>>
>>
>>
>>
>>
>>
>>
>> *commit 49a781f5cee36db85e8d5b951eec29f10ac13593Author: Stefano
>>
>> Zampini
>>
>> <stefano.zampini at gmail.com <stefano.zampini at gmail.com>>Date: Sat
>>
>> Nov 5
>>
>> 20:15:19 2016 +0300 PCHYPRE: use internal Mat of type MatHYPRE
>> hpmat already stores two HYPRE vectors*
>>
>> Before I debug line-by-line, anyone has a clue on this?
>>
>>
>> Fande,
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20180403/340fc468/attachment.html>
More information about the petsc-users
mailing list