[petsc-users] Vexing deadlock situation with petsc4py

Lawrence Mitchell wence at gmx.li
Wed Oct 28 12:21:02 CDT 2020


> On 28 Oct 2020, at 16:35, Guyer, Jonathan E. Dr. (Fed) via petsc-users <petsc-users at mcs.anl.gov> wrote:
> 
> We use petsc4py as a solver suite in our [FiPy](https://www.ctcms.nist.gov/fipy) Python-based PDE solver package. Some time back, I refactored some of the code and provoked a deadlock situation in our test suite. I have been tearing what remains of my hair out trying to isolate things and am at a loss. I’ve gone through the refactoring line-by-line and I just don’t think I’ve changed anything substantive, just how the code is organized.
> 
> I have posted a branch that exhibits the issue at https://github.com/usnistgov/fipy/pull/761
> 
> I explain in greater detail in that “pull request” how to reproduce, but in short, after a substantial number of our tests run, the code either deadlocks or raises exceptions:
> 
> On processor 0 in 
> 
>   matrix.setUp() 
> 
> specifically in
> 
>   [0] PetscSplitOwnership() line 93 in /Users/runner/miniforge3/conda-bld/petsc_1601473259434/work/src/sys/utils/psplit.c
> 
> and on other processors a few lines earlier in
> 
>   matrix.create(comm)
> 
> specifically in 
> 
>   [1] PetscCommDuplicate() line 126 in /Users/runner/miniforge3/conda-bld/petsc_1601473259434/work/src/sys/objects/tagm.c
> 
> 
> The circumstances that lead to this failure are really fragile and it seems likely due to some memory corruption. Particularly likely given that I can make the failure go away by removing seemingly irrelevant things like
> 
>     >>> from scipy.stats.mstats import argstoarray
> 
> Note that when I run the full test suite after taking out this scipy import, the same problem just arises elsewhere without any obvious similar import trigger.
> 
> Running with `-malloc_debug true` doesn’t illuminate anything.
> 
> I’ve run with `-info` and `-log_trace` and don’t see any obvious issues, but there’s a ton of output.
> 
> 
> 
> I have tried reducing things to a minimal reproducible example, but unfortunately things remain way too complicated and idiosyncratic to FiPy. I’m grateful for any help anybody can offer despite the mess that I’m offering.

My crystal ball guess is the following:

PETSc objects have collective destroy semantics.

When using petsc4py, XXX.destroy() is called on an object when its Python refcount drops to zero, or when it is collected by the generational garbage collector.

In the absence of reference-cycles, all allocated objects will be collected by the refcounting part of the collector. This is (unless you do something funky like hold more references on one process than another) deterministic, and if you do normal SPMD programming, you'll call XXX.destroy() in the same order on the same objects on all processes.

If you have reference cycles, then the refcounting part of the collector will not collect these objects. Now you are at the mercy of the generational collector. This is definitely not deterministic. If different Python processes do different things (for example, rank 0 might open files) then when the generational collector runs is no longer in sync across processes.

A consequence is that you now might have rank 0 collect XXX then YYY, whereas rank 1 might collect YYY then XXX => deadlock.

You can test this hypothesis by turning off the garbage collector in your test that provokes the failure:

import gc
gc.disable()
...

If this turns out to be the case, I don't think there's a good solution here. You can audit your code base and ensure that objects that hold PETSc objects never participate in reference cycles. This is fragile.

Another option, is to explicitly require that the user of the API call XXX.destroy() on all your objects (and then PETSc objects). This is the decision taken for mpi4py: you are responsible for freeing any objects that you create.

That is, your API becomes more like the C API with

x = Foo(...) # holds some petsc object XX
... # use x
x.destroy() # calls XX.destroy()

you could make this more pythonic by wrapping this pattern in contextmanagers:

with Foo(...) as x:
   ...


Thanks,

Lawrence


More information about the petsc-users mailing list