[petsc-users] Vexing deadlock situation with petsc4py

Guyer, Jonathan E. Dr. (Fed) jonathan.guyer at nist.gov
Wed Oct 28 12:08:21 CDT 2020


I should note that I’m running with a --with-debugging build that I’ve [forked from conda-forge/petsc-feedstock](https://github.com/guyer/petsc-feedstock/), but it doesn’t highlight any problems.
When I -start_in_debugger, I drop into lldb[*], but there are no symbols. The last assembler I knew was for the 6502 and I haven’t known that for a looooong time.

How can I get symbols included in my build?

If I drop into the [i]pdb Python debugger, the problem goes away.

[*] I’m running on a Mac, but the same deadlock happens on our linux builds

On Oct 28, 2020, at 12:35 PM, Guyer, Jonathan E. Dr. (Fed) via petsc-users <petsc-users at mcs.anl.gov<mailto:petsc-users at mcs.anl.gov>> wrote:

We use petsc4py as a solver suite in our [FiPy](https://www.ctcms.nist.gov/fipy) Python-based PDE solver package. Some time back, I refactored some of the code and provoked a deadlock situation in our test suite. I have been tearing what remains of my hair out trying to isolate things and am at a loss. I’ve gone through the refactoring line-by-line and I just don’t think I’ve changed anything substantive, just how the code is organized.

I have posted a branch that exhibits the issue at https://github.com/usnistgov/fipy/pull/761

I explain in greater detail in that “pull request” how to reproduce, but in short, after a substantial number of our tests run, the code either deadlocks or raises exceptions:

On processor 0 in

  matrix.setUp()

specifically in

  [0] PetscSplitOwnership() line 93 in /Users/runner/miniforge3/conda-bld/petsc_1601473259434/work/src/sys/utils/psplit.c

and on other processors a few lines earlier in

  matrix.create(comm)

specifically in

  [1] PetscCommDuplicate() line 126 in /Users/runner/miniforge3/conda-bld/petsc_1601473259434/work/src/sys/objects/tagm.c


The circumstances that lead to this failure are really fragile and it seems likely due to some memory corruption. Particularly likely given that I can make the failure go away by removing seemingly irrelevant things like

    >>> from scipy.stats.mstats import argstoarray

Note that when I run the full test suite after taking out this scipy import, the same problem just arises elsewhere without any obvious similar import trigger.

Running with `-malloc_debug true` doesn’t illuminate anything.

I’ve run with `-info` and `-log_trace` and don’t see any obvious issues, but there’s a ton of output.



I have tried reducing things to a minimal reproducible example, but unfortunately things remain way too complicated and idiosyncratic to FiPy. I’m grateful for any help anybody can offer despite the mess that I’m offering.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20201028/eac724c5/attachment.html>


More information about the petsc-users mailing list