[petsc-users] Reaching limit number of communicator with Spectrum MPI

Junchao Zhang junchao.zhang at gmail.com
Sat Aug 21 10:30:46 CDT 2021


I checked and found MPI_Comm_dup() and MPI_Comm_free() were called in
pairs. So the MPI runtime should not complain about running out of
resources.
I guess there might be pending communications on communicators.  But I've
no means to know exactly. Per MPI manual, MPI_Comm_free() only marks a
communicator object for deallocation.
We can file a bug report to OLCF. With MPI source code, it should be easy
for them to debug.

--Junchao Zhang


On Fri, Aug 20, 2021 at 4:14 PM Junchao Zhang <junchao.zhang at gmail.com>
wrote:

> Feimi,
>   I'm able to reproduce the problem. I will have a look. Thanks a lot for
> the example.
> --Junchao Zhang
>
>
> On Fri, Aug 20, 2021 at 2:02 PM Feimi Yu <yuf2 at rpi.edu> wrote:
>
>> Sorry, I forgot to destroy the matrix after the loop, but anyway, the
>> in-loop preconditioners are destroyed. Updated the code here and the google
>> drive.
>>
>> Feimi
>> On 8/20/21 2:54 PM, Feimi Yu wrote:
>>
>> Hi Barry and Junchao,
>>
>> Actually I did a simple MPI "dup and free" test before with Spectrum MPI,
>> but that one did not have any problem. I'm not a PETSc programmer as I
>> mainly use deal.ii's PETSc wrappers, but I managed to write a minimal
>> program based on petsc/src/mat/tests/ex98.c to reproduce my problem. This
>> piece of code creates and destroys 10,000 instances of Hypre Parasail
>> preconditioners (for my own code, it uses Euclid, but I don't think it
>> matters). It runs fine with OpenMPI but reports the out of communicator
>> error with Sepctrum MPI. The code is attached in the email. In case the
>> attachment is not available, I also uploaded a copy on my google drive:
>>
>>
>> https://drive.google.com/drive/folders/1DCf7lNlks8GjazvoP7c211ojNHLwFKL6?usp=sharing
>>
>> Thanks!
>>
>> Feimi
>> On 8/20/21 9:58 AM, Junchao Zhang wrote:
>>
>> Feimi, if it is easy to reproduce, could you give instructions on how to
>> reproduce that?
>>
>> PS: Spectrum MPI is based on OpenMPI.  I don't understand why it has the
>> problem but OpenMPI does not.  It could be a bug in petsc or user's code.
>> For reference counting on MPI_Comm, we already have petsc inner comm. I
>> think we can reuse that.
>>
>> --Junchao Zhang
>>
>>
>> On Fri, Aug 20, 2021 at 12:33 AM Barry Smith <bsmith at petsc.dev> wrote:
>>
>>>
>>>   It sounds like maybe the Spectrum MPI_Comm_free() is not returning the
>>> comm to the "pool" as available for future use; a very buggy MPI
>>> implementation. This can easily be checked in a tiny standalone MPI program
>>> that simply comm dups and frees thousands of times in a loop. Could even be
>>> a configure test (that requires running an MPI program). I do not remember
>>> if we ever tested this possibility; maybe and I forgot.
>>>
>>>   If this is the problem we can provide a "work around" that attributes
>>> the new comm (to be passed to hypre) to the old comm with a reference count
>>> value also in the attribute. When the hypre matrix is created that count is
>>> (with the new comm) is set to 1, when the hypre matrix is freed that count
>>> is set to zero (but the comm is not freed), in the next call to create the
>>> hypre matrix when the attribute is found, the count is zero so PETSc knows
>>> it can pass the same comm again to the new hypre matrix.
>>>
>>> This will only allow one simultaneous hypre matrix to be created from
>>> the original comm. To allow multiply simultaneous hypre matrix one could
>>> have multiple comms and counts in the attribute and just check them until
>>> one finds an available one to reuse (or creates yet another one if all the
>>> current ones are busy with hypre matrices). So it is the same model as
>>> DMGetXXVector() where vectors are checked out and then checked in to be
>>> available later. This would solve the currently reported problem (if it is
>>> a buggy MPI that does not properly free comms), but not solve the MOOSE
>>> problem where 10,000 comms are needed at the same time.
>>>
>>>   Barry
>>>
>>>
>>>
>>>
>>>
>>> On Aug 19, 2021, at 3:29 PM, Junchao Zhang <junchao.zhang at gmail.com>
>>> wrote:
>>>
>>>
>>>
>>>
>>> On Thu, Aug 19, 2021 at 2:08 PM Feimi Yu <yuf2 at rpi.edu> wrote:
>>>
>>>> Hi Jed,
>>>>
>>>> In my case, I only have 2 hypre preconditioners at the same time, and
>>>> they do not solve simultaneously, so it might not be case 1.
>>>>
>>>> I checked the stack for all the calls of MPI_Comm_dup/MPI_Comm_free on
>>>> my own machine (with OpenMPI), all the communicators are freed from my
>>>> observation. I could not test it with Spectrum MPI on the clusters
>>>> immediately because all the dependencies were built in release mode.
>>>> However, as I mentioned, I haven't had this problem with OpenMPI
>>>> before,
>>>> so I'm not sure if this is really an MPI implementation problem, or
>>>> just
>>>> because Spectrum MPI has less limit for the number of communicators,
>>>> and/or this also depends on how many MPI ranks are used, as only 2 out
>>>> of 40 ranks reported the error.
>>>>
>>> You can add printf around MPI_Comm_dup/MPI_Comm_free sites on the two
>>> ranks, e.g., if (myrank == 38) printf(...), to see if the dup/free are
>>> paired.
>>>
>>>  As a workaround, I replaced the MPI_Comm_dup() at
>>>
>>>> petsc/src/mat/impls/hypre/mhypre.c:2120 with a copy assignment, and
>>>> also
>>>> removed the MPI_Comm_free() in the hypre destroyer. My code runs fine
>>>> with Spectrum MPI now, but I don't think this is a long-term solution.
>>>>
>>>> Thanks!
>>>>
>>>> Feimi
>>>>
>>>> On 8/19/21 9:01 AM, Jed Brown wrote:
>>>> > Junchao Zhang <junchao.zhang at gmail.com> writes:
>>>> >
>>>> >> Hi, Feimi,
>>>> >>    I need to consult Jed (cc'ed).
>>>> >>    Jed, is this an example of
>>>> >>
>>>> https://lists.mcs.anl.gov/mailman/htdig/petsc-dev/2018-April/thread.html#22663
>>>> ?
>>>> >> If Feimi really can not free matrices, then we just need to attach a
>>>> >> hypre-comm to a petsc inner comm, and pass that to hypre.
>>>> > Are there a bunch of solves as in that case?
>>>> >
>>>> > My understanding is that one should be able to
>>>> MPI_Comm_dup/MPI_Comm_free as many times as you like, but the
>>>> implementation has limits on how many communicators can co-exist at any one
>>>> time. The many-at-once is what we encountered in that 2018 thread.
>>>> >
>>>> > One way to check would be to use a debugger or tracer to examine the
>>>> stack every time (P)MPI_Comm_dup and (P)MPI_Comm_free are called.
>>>> >
>>>> > case 1: we'll find lots of dups without frees (until the end) because
>>>> the user really wants lots of these existing at the same time.
>>>> >
>>>> > case 2: dups are unfreed because of reference counting
>>>> issue/inessential references
>>>> >
>>>> >
>>>> > In case 1, I think the solution is as outlined in the thread, PETSc
>>>> can create an inner-comm for Hypre. I think I'd prefer to attach it to the
>>>> outer comm instead of the PETSc inner comm, but perhaps a case could be
>>>> made either way.
>>>>
>>>
>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20210821/5ec0338b/attachment-0001.html>


More information about the petsc-users mailing list