[petsc-dev] [issue1595] Issues of limited number of MPI communicators when having many instances of hypre boomerAMG with Moose

Wed Apr 4 07:43:46 CDT 2018

Yes, this is a real issue for MOOSE which sometimes has thousands of active single-field solvers.  PETSc can limit the number of fine-level communicators by retaining the dup'd communicator so the same communicator can be passed to hypre for each solver, but cannot control the MPI_Comm_create for a parallel coarse level.  Hypre could do that internally by attaching the coarse communicator as an attribute (on the relevant ranks) of the larger communicator.

The separate tag space is important because point-to-point messaging can be pending when hypre is called -- that does not lead to deadlock, but it is important that hypre not post sends or receives with those tags lest messages be delivered incorrectly.  

I feel like your response below is a false economy.  Nobody would fault hypre for dup'ing once.  But with the current interface, it is laborious to impossible (in case of parallel coarse solve) to create a thousand hypre solvers without having a thousand communicators.  Assuming you are not convinced, we will handle this in PETSc the same way PETSc does for itself, but (a) we still can't control the communicator for a parallel coarse solve and (b) this issue may crop up again if some other user attempts to do this sort of solve without using PETSc.

Rob Falgout hypre Tracker <hypre-support at llnl.gov> writes:

> Rob Falgout <rfalgout at llnl.gov> added the comment:
>
> Is somebody actually having a problem with communicator conflicts right now?
>
> I thought the reason for this thread was to reduce the number of communicators because of limits in MPI implementations.  Somebody has to reduce the Comm_create() and Comm_dup() calls.  We responded with one way to reduce the create() calls in BoomerAMG, but now you are asking us to put them back in by calling dup()?  I'm confused about what we are trying to achieve here now.
>
> The reason I suggested that the user be responsible for calling dup() is twofold: 1) I don't think it is common for users to run hypre in parallel with other user code where both are using the same communicator (I'm not sure how this could even work without deadlocking since hypre calls are collective); 2) Making libraries lower down on the call stack be responsible for calling dup() seems less scalable than the other way around and more likely to increase the number of communicators used.
>
> Anyway, I'm still confused about what we are trying to achieve so maybe somebody can try to summarize again?
>
> -Rob
>
> ____________________________________________
> hypre Issue Tracker <hypre-support at llnl.gov>
> <http://cascb1.llnl.gov/hypre/issue1595>
> ____________________________________________