[petsc-users] A bad commit affects MOOSE
Kong, Fande
fande.kong at inl.gov
Tue Apr 3 12:41:01 CDT 2018
On Tue, Apr 3, 2018 at 11:29 AM, Smith, Barry F. <bsmith at mcs.anl.gov> wrote:
>
> Fande,
>
> The reason for MPI_Comm_dup() and the inner communicator is that this
> communicator is used by hypre and so cannot "just" be a PETSc communicator.
> We cannot have PETSc and hypre using the same communicator since they may
> capture each others messages etc.
>
> See my pull request that I think should resolve the issue in the
> short term,
>
Yes, it helps as well.
The question becomes we can not have more than 2000 AMG solvers in one
application because each Hypre owns its communicator. There is no way to
have all AMG solvers share the same HYPRE-sided communicator? Just like
what we are dong for PETSc objects?
Fande,
>
> Barry
>
>
> > On Apr 3, 2018, at 11:21 AM, Kong, Fande <fande.kong at inl.gov> wrote:
> >
> > Figured out:
> >
> > The reason is that in MatCreate_HYPRE(Mat B), we call MPI_Comm_dup
> instead of PetscCommDuplicate. The PetscCommDuplicate is better, and it
> does not actually create a communicator if the communicator is already
> known to PETSc.
> >
> > Furthermore, I do not think we should a comm in
> >
> > typedef struct {
> > HYPRE_IJMatrix ij;
> > HYPRE_IJVector x;
> > HYPRE_IJVector b;
> > MPI_Comm comm;
> > } Mat_HYPRE;
> >
> > It is an inner data of Mat, and it should already the same comm as the
> Mat. I do not understand why the internal data has its own comm.
> >
> > The following patch fixed the issue (just deleted this extra comm).
> >
> > diff --git a/src/mat/impls/hypre/mhypre.c b/src/mat/impls/hypre/mhypre.c
> > index dc19892..d8cfe3d 100644
> > --- a/src/mat/impls/hypre/mhypre.c
> > +++ b/src/mat/impls/hypre/mhypre.c
> > @@ -74,7 +74,7 @@ static PetscErrorCode MatHYPRE_CreateFromMat(Mat A,
> Mat_HYPRE *hA)
> > rend = A->rmap->rend;
> > cstart = A->cmap->rstart;
> > cend = A->cmap->rend;
> > - PetscStackCallStandard(HYPRE_IJMatrixCreate,(hA->comm,
> rstart,rend-1,cstart,cend-1,&hA->ij));
> > + PetscStackCallStandard(HYPRE_IJMatrixCreate,(
> PetscObjectComm((PetscObject)A),rstart,rend-1,cstart,cend-1,&hA->ij));
> > PetscStackCallStandard(HYPRE_IJMatrixSetObjectType,(hA->ij,
> HYPRE_PARCSR));
> > {
> > PetscBool same;
> > @@ -434,7 +434,7 @@ PetscErrorCode MatDestroy_HYPRE(Mat A)
> > if (hA->x) PetscStackCallStandard(HYPRE_IJVectorDestroy,(hA->x));
> > if (hA->b) PetscStackCallStandard(HYPRE_IJVectorDestroy,(hA->b));
> > if (hA->ij) PetscStackCallStandard(HYPRE_IJMatrixDestroy,(hA->ij));
> > - if (hA->comm) { ierr = MPI_Comm_free(&hA->comm);CHKERRQ(ierr);}
> > + /*if (hA->comm) { ierr = MPI_Comm_free(&hA->comm);CHKERRQ(ierr);}*/
> > ierr = PetscObjectComposeFunction((PetscObject)A,"MatConvert_
> hypre_aij_C",NULL);CHKERRQ(ierr);
> > ierr = PetscFree(A->data);CHKERRQ(ierr);
> > PetscFunctionReturn(0);
> > @@ -500,7 +500,8 @@ PETSC_EXTERN PetscErrorCode MatCreate_HYPRE(Mat B)
> > B->ops->destroy = MatDestroy_HYPRE;
> > B->ops->assemblyend = MatAssemblyEnd_HYPRE;
> >
> > - ierr = MPI_Comm_dup(PetscObjectComm((PetscObject)B),&hB->comm);
> CHKERRQ(ierr);
> > + /*ierr = MPI_Comm_dup(PetscObjectComm((PetscObject)B),&hB->comm);CHKERRQ(ierr);
> */
> > + /*ierr = PetscCommDuplicate(PetscObjectComm((PetscObject)
> B),&hB->comm,NULL);CHKERRQ(ierr);*/
> > ierr = PetscObjectChangeTypeName((PetscObject)B,MATHYPRE);
> CHKERRQ(ierr);
> > ierr = PetscObjectComposeFunction((PetscObject)B,"MatConvert_
> hypre_aij_C",MatConvert_HYPRE_AIJ);CHKERRQ(ierr);
> > PetscFunctionReturn(0);
> > diff --git a/src/mat/impls/hypre/mhypre.h b/src/mat/impls/hypre/mhypre.h
> > index 3d9ddd2..1189020 100644
> > --- a/src/mat/impls/hypre/mhypre.h
> > +++ b/src/mat/impls/hypre/mhypre.h
> > @@ -10,7 +10,7 @@ typedef struct {
> > HYPRE_IJMatrix ij;
> > HYPRE_IJVector x;
> > HYPRE_IJVector b;
> > - MPI_Comm comm;
> > + /*MPI_Comm comm;*/
> > } Mat_HYPRE;
> >
> >
> >
> > Fande,
> >
> >
> >
> >
> > On Tue, Apr 3, 2018 at 10:35 AM, Satish Balay <balay at mcs.anl.gov> wrote:
> > On Tue, 3 Apr 2018, Satish Balay wrote:
> >
> > > On Tue, 3 Apr 2018, Derek Gaston wrote:
> > >
> > > > One thing I want to be clear of here: is that we're not trying to
> solve
> > > > this particular problem (where we're creating 1000 instances of
> Hypre to
> > > > precondition each variable independently)... this particular problem
> is
> > > > just a test (that we've had in our test suite for a long time) to
> stress
> > > > test some of this capability.
> > > >
> > > > We really do have needs for thousands (tens of thousands) of
> simultaneous
> > > > solves (each with their own Hypre instances). That's not what this
> > > > particular problem is doing - but it is representative of a class of
> our
> > > > problems we need to solve.
> > > >
> > > > Which does bring up a point: I have been able to do solves before
> with
> > > > ~50,000 separate PETSc solves without issue. Is it because I was
> working
> > > > with MVAPICH on a cluster? Does it just have a higher limit?
> > >
> > > Don't know - but thats easy to find out with a simple test code..
> > >
> > > >>>>>>
> > > $ cat comm_dup_test.c
> > > #include <mpi.h>
> > > #include <stdio.h>
> > >
> > > int main(int argc, char** argv) {
> > > MPI_Comm newcomm;
> > > int i, err;
> > > MPI_Init(NULL, NULL);
> > > for (i=0; i<100000; i++) {
> > > err = MPI_Comm_dup(MPI_COMM_WORLD, &newcomm);
> > > if (err) {
> > > printf("%5d - fail\n",i);fflush(stdout);
> > > break;
> > > } else {
> > > printf("%5d - success\n",i);fflush(stdout);
> > > }
> > > }
> > > MPI_Finalize();
> > > }
> > > <<<<<<<
> > >
> > > OpenMPI fails after '65531' and mpich after '2044'. MVAPICH is derived
> > > off MPICH - but its possible they have a different limit than MPICH.
> >
> > BTW: the above is with: openmpi-2.1.2 and mpich-3.3b1
> >
> > mvapich2-1.9.5 - and I get error after '2044' comm dupes
> >
> > Satish
> >
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20180403/9c7dc658/attachment-0001.html>
More information about the petsc-users
mailing list