[petsc-users] A bad commit affects MOOSE

Satish Balay balay at mcs.anl.gov
Tue Apr 3 12:27:28 CDT 2018


On Tue, 3 Apr 2018, Kong, Fande wrote:

> Figured out:
> 
> The reason is that  in  MatCreate_HYPRE(Mat B), we call MPI_Comm_dup
> instead of PetscCommDuplicate. The PetscCommDuplicate is better, and it
> does not actually create a communicator if the communicator is already
> known to PETSc.
> 
> Furthermore, I do not think we should a comm in
> 
> 
> 
> 
> 
> 
> *typedef struct {  HYPRE_IJMatrix ij;  HYPRE_IJVector x;  HYPRE_IJVector
> b;  MPI_Comm       comm;} Mat_HYPRE;*
> 
> It is an inner data of Mat, and it should already the same comm as the Mat.
> I do not understand why the internal data has its own comm.

As mentioned before - we create a separate comm using MPI_Comm_dup() to pass it down to hypre routines [corresponding to this petsc object]. PetscCommDuplicate() is used for all comm usage required within PETSc. 

The problem with this change is - now 2 libraries are using the same communicator - but don't know how to manage 'tags' in separate space - thus potentially resulting in
messages from hypre picked up by petsc or vice versa.

Satish

> 
> The following patch fixed the issue (just deleted this extra comm).
> 
> diff --git a/src/mat/impls/hypre/mhypre.c b/src/mat/impls/hypre/mhypre.c
> index dc19892..d8cfe3d 100644
> --- a/src/mat/impls/hypre/mhypre.c
> +++ b/src/mat/impls/hypre/mhypre.c
> @@ -74,7 +74,7 @@ static PetscErrorCode MatHYPRE_CreateFromMat(Mat A,
> Mat_HYPRE *hA)
>    rend   = A->rmap->rend;
>    cstart = A->cmap->rstart;
>    cend   = A->cmap->rend;
> -
> PetscStackCallStandard(HYPRE_IJMatrixCreate,(hA->comm,rstart,rend-1,cstart,cend-1,&hA->ij));
> +
> PetscStackCallStandard(HYPRE_IJMatrixCreate,(PetscObjectComm((PetscObject)A),rstart,rend-1,cstart,cend-1,&hA->ij));
> 
> PetscStackCallStandard(HYPRE_IJMatrixSetObjectType,(hA->ij,HYPRE_PARCSR));
>    {
>      PetscBool      same;
> @@ -434,7 +434,7 @@ PetscErrorCode MatDestroy_HYPRE(Mat A)
>    if (hA->x) PetscStackCallStandard(HYPRE_IJVectorDestroy,(hA->x));
>    if (hA->b) PetscStackCallStandard(HYPRE_IJVectorDestroy,(hA->b));
>    if (hA->ij) PetscStackCallStandard(HYPRE_IJMatrixDestroy,(hA->ij));
> -  if (hA->comm) { ierr = MPI_Comm_free(&hA->comm);CHKERRQ(ierr);}
> +  /*if (hA->comm) { ierr = MPI_Comm_free(&hA->comm);CHKERRQ(ierr);}*/
>    ierr =
> PetscObjectComposeFunction((PetscObject)A,"MatConvert_hypre_aij_C",NULL);CHKERRQ(ierr);
>    ierr = PetscFree(A->data);CHKERRQ(ierr);
>    PetscFunctionReturn(0);
> @@ -500,7 +500,8 @@ PETSC_EXTERN PetscErrorCode MatCreate_HYPRE(Mat B)
>    B->ops->destroy       = MatDestroy_HYPRE;
>    B->ops->assemblyend   = MatAssemblyEnd_HYPRE;
> 
> -  ierr =
> MPI_Comm_dup(PetscObjectComm((PetscObject)B),&hB->comm);CHKERRQ(ierr);
> +  /*ierr =
> MPI_Comm_dup(PetscObjectComm((PetscObject)B),&hB->comm);CHKERRQ(ierr); */
> +  /*ierr =
> PetscCommDuplicate(PetscObjectComm((PetscObject)B),&hB->comm,NULL);CHKERRQ(ierr);*/
>    ierr = PetscObjectChangeTypeName((PetscObject)B,MATHYPRE);CHKERRQ(ierr);
>    ierr =
> PetscObjectComposeFunction((PetscObject)B,"MatConvert_hypre_aij_C",MatConvert_HYPRE_AIJ);CHKERRQ(ierr);
>    PetscFunctionReturn(0);
> diff --git a/src/mat/impls/hypre/mhypre.h b/src/mat/impls/hypre/mhypre.h
> index 3d9ddd2..1189020 100644
> --- a/src/mat/impls/hypre/mhypre.h
> +++ b/src/mat/impls/hypre/mhypre.h
> @@ -10,7 +10,7 @@ typedef struct {
>    HYPRE_IJMatrix ij;
>    HYPRE_IJVector x;
>    HYPRE_IJVector b;
> -  MPI_Comm       comm;
> +  /*MPI_Comm       comm;*/
>  } Mat_HYPRE;
> 
> 
> 
> Fande,
> 
> 
> 
> 
> On Tue, Apr 3, 2018 at 10:35 AM, Satish Balay <balay at mcs.anl.gov> wrote:
> 
> > On Tue, 3 Apr 2018, Satish Balay wrote:
> >
> > > On Tue, 3 Apr 2018, Derek Gaston wrote:
> > >
> > > > One thing I want to be clear of here: is that we're not trying to solve
> > > > this particular problem (where we're creating 1000 instances of Hypre
> > to
> > > > precondition each variable independently)... this particular problem is
> > > > just a test (that we've had in our test suite for a long time) to
> > stress
> > > > test some of this capability.
> > > >
> > > > We really do have needs for thousands (tens of thousands) of
> > simultaneous
> > > > solves (each with their own Hypre instances).  That's not what this
> > > > particular problem is doing - but it is representative of a class of
> > our
> > > > problems we need to solve.
> > > >
> > > > Which does bring up a point: I have been able to do solves before with
> > > > ~50,000 separate PETSc solves without issue.  Is it because I was
> > working
> > > > with MVAPICH on a cluster?  Does it just have a higher limit?
> > >
> > > Don't know - but thats easy to find out with a simple test code..
> > >
> > > >>>>>>
> > > $ cat comm_dup_test.c
> > > #include <mpi.h>
> > > #include <stdio.h>
> > >
> > > int main(int argc, char** argv) {
> > >     MPI_Comm newcomm;
> > >     int i, err;
> > >     MPI_Init(NULL, NULL);
> > >     for (i=0; i<100000; i++) {
> > >       err = MPI_Comm_dup(MPI_COMM_WORLD, &newcomm);
> > >       if (err) {
> > >           printf("%5d - fail\n",i);fflush(stdout);
> > >           break;
> > >         } else {
> > >           printf("%5d - success\n",i);fflush(stdout);
> > >       }
> > >     }
> > >     MPI_Finalize();
> > > }
> > > <<<<<<<
> > >
> > > OpenMPI fails after '65531' and mpich after '2044'. MVAPICH is derived
> > > off MPICH - but its possible they have a different limit than MPICH.
> >
> > BTW: the above is  with: openmpi-2.1.2 and mpich-3.3b1
> >
> > mvapich2-1.9.5 - and I get error after '2044' comm dupes
> >
> > Satish
> >
> 



More information about the petsc-users mailing list