[petsc-users] A bad commit affects MOOSE

Smith, Barry F. bsmith at mcs.anl.gov
Tue Apr 3 12:29:16 CDT 2018


  Fande,

     The reason for MPI_Comm_dup() and the inner communicator is that this communicator is used by hypre and so cannot "just" be a PETSc communicator. We cannot have PETSc and hypre using the same communicator since they may capture each others messages etc.

      See my pull request that I think should resolve the issue in the short term,

    Barry


> On Apr 3, 2018, at 11:21 AM, Kong, Fande <fande.kong at inl.gov> wrote:
> 
> Figured out:
> 
> The reason is that  in  MatCreate_HYPRE(Mat B), we call MPI_Comm_dup instead of PetscCommDuplicate. The PetscCommDuplicate is better, and it does not actually create a communicator if the communicator is already known to PETSc. 
> 
> Furthermore, I do not think we should a comm in 
> 
> typedef struct {
>   HYPRE_IJMatrix ij;
>   HYPRE_IJVector x;
>   HYPRE_IJVector b;
>   MPI_Comm       comm;
> } Mat_HYPRE;
> 
> It is an inner data of Mat, and it should already the same comm as the Mat. I do not understand why the internal data has its own comm.
> 
> The following patch fixed the issue (just deleted this extra comm).
> 
> diff --git a/src/mat/impls/hypre/mhypre.c b/src/mat/impls/hypre/mhypre.c
> index dc19892..d8cfe3d 100644
> --- a/src/mat/impls/hypre/mhypre.c
> +++ b/src/mat/impls/hypre/mhypre.c
> @@ -74,7 +74,7 @@ static PetscErrorCode MatHYPRE_CreateFromMat(Mat A, Mat_HYPRE *hA)
>    rend   = A->rmap->rend;
>    cstart = A->cmap->rstart;
>    cend   = A->cmap->rend;
> -  PetscStackCallStandard(HYPRE_IJMatrixCreate,(hA->comm,rstart,rend-1,cstart,cend-1,&hA->ij));
> +  PetscStackCallStandard(HYPRE_IJMatrixCreate,(PetscObjectComm((PetscObject)A),rstart,rend-1,cstart,cend-1,&hA->ij));
>    PetscStackCallStandard(HYPRE_IJMatrixSetObjectType,(hA->ij,HYPRE_PARCSR));
>    {
>      PetscBool      same;
> @@ -434,7 +434,7 @@ PetscErrorCode MatDestroy_HYPRE(Mat A)
>    if (hA->x) PetscStackCallStandard(HYPRE_IJVectorDestroy,(hA->x));
>    if (hA->b) PetscStackCallStandard(HYPRE_IJVectorDestroy,(hA->b));
>    if (hA->ij) PetscStackCallStandard(HYPRE_IJMatrixDestroy,(hA->ij));
> -  if (hA->comm) { ierr = MPI_Comm_free(&hA->comm);CHKERRQ(ierr);}
> +  /*if (hA->comm) { ierr = MPI_Comm_free(&hA->comm);CHKERRQ(ierr);}*/
>    ierr = PetscObjectComposeFunction((PetscObject)A,"MatConvert_hypre_aij_C",NULL);CHKERRQ(ierr);
>    ierr = PetscFree(A->data);CHKERRQ(ierr);
>    PetscFunctionReturn(0);
> @@ -500,7 +500,8 @@ PETSC_EXTERN PetscErrorCode MatCreate_HYPRE(Mat B)
>    B->ops->destroy       = MatDestroy_HYPRE;
>    B->ops->assemblyend   = MatAssemblyEnd_HYPRE;
>  
> -  ierr = MPI_Comm_dup(PetscObjectComm((PetscObject)B),&hB->comm);CHKERRQ(ierr);
> +  /*ierr = MPI_Comm_dup(PetscObjectComm((PetscObject)B),&hB->comm);CHKERRQ(ierr); */
> +  /*ierr = PetscCommDuplicate(PetscObjectComm((PetscObject)B),&hB->comm,NULL);CHKERRQ(ierr);*/
>    ierr = PetscObjectChangeTypeName((PetscObject)B,MATHYPRE);CHKERRQ(ierr);
>    ierr = PetscObjectComposeFunction((PetscObject)B,"MatConvert_hypre_aij_C",MatConvert_HYPRE_AIJ);CHKERRQ(ierr);
>    PetscFunctionReturn(0);
> diff --git a/src/mat/impls/hypre/mhypre.h b/src/mat/impls/hypre/mhypre.h
> index 3d9ddd2..1189020 100644
> --- a/src/mat/impls/hypre/mhypre.h
> +++ b/src/mat/impls/hypre/mhypre.h
> @@ -10,7 +10,7 @@ typedef struct {
>    HYPRE_IJMatrix ij;
>    HYPRE_IJVector x;
>    HYPRE_IJVector b;
> -  MPI_Comm       comm;
> +  /*MPI_Comm       comm;*/
>  } Mat_HYPRE;
>  
> 
> 
> Fande,
> 
> 
> 
> 
> On Tue, Apr 3, 2018 at 10:35 AM, Satish Balay <balay at mcs.anl.gov> wrote:
> On Tue, 3 Apr 2018, Satish Balay wrote:
> 
> > On Tue, 3 Apr 2018, Derek Gaston wrote:
> >
> > > One thing I want to be clear of here: is that we're not trying to solve
> > > this particular problem (where we're creating 1000 instances of Hypre to
> > > precondition each variable independently)... this particular problem is
> > > just a test (that we've had in our test suite for a long time) to stress
> > > test some of this capability.
> > >
> > > We really do have needs for thousands (tens of thousands) of simultaneous
> > > solves (each with their own Hypre instances).  That's not what this
> > > particular problem is doing - but it is representative of a class of our
> > > problems we need to solve.
> > >
> > > Which does bring up a point: I have been able to do solves before with
> > > ~50,000 separate PETSc solves without issue.  Is it because I was working
> > > with MVAPICH on a cluster?  Does it just have a higher limit?
> >
> > Don't know - but thats easy to find out with a simple test code..
> >
> > >>>>>>
> > $ cat comm_dup_test.c
> > #include <mpi.h>
> > #include <stdio.h>
> >
> > int main(int argc, char** argv) {
> >     MPI_Comm newcomm;
> >     int i, err;
> >     MPI_Init(NULL, NULL);
> >     for (i=0; i<100000; i++) {
> >       err = MPI_Comm_dup(MPI_COMM_WORLD, &newcomm);
> >       if (err) {
> >           printf("%5d - fail\n",i);fflush(stdout);
> >           break;
> >         } else {
> >           printf("%5d - success\n",i);fflush(stdout);
> >       }
> >     }
> >     MPI_Finalize();
> > }
> > <<<<<<<
> >
> > OpenMPI fails after '65531' and mpich after '2044'. MVAPICH is derived
> > off MPICH - but its possible they have a different limit than MPICH.
> 
> BTW: the above is  with: openmpi-2.1.2 and mpich-3.3b1
> 
> mvapich2-1.9.5 - and I get error after '2044' comm dupes
> 
> Satish
> 



More information about the petsc-users mailing list