[petsc-users] Vexing deadlock situation with petsc4py
Guyer, Jonathan E. Dr. (Fed)
jonathan.guyer at nist.gov
Wed Oct 28 12:08:21 CDT 2020
I should note that I’m running with a --with-debugging build that I’ve [forked from conda-forge/petsc-feedstock](https://github.com/guyer/petsc-feedstock/), but it doesn’t highlight any problems.
When I -start_in_debugger, I drop into lldb[*], but there are no symbols. The last assembler I knew was for the 6502 and I haven’t known that for a looooong time.
How can I get symbols included in my build?
If I drop into the [i]pdb Python debugger, the problem goes away.
[*] I’m running on a Mac, but the same deadlock happens on our linux builds
On Oct 28, 2020, at 12:35 PM, Guyer, Jonathan E. Dr. (Fed) via petsc-users <petsc-users at mcs.anl.gov<mailto:petsc-users at mcs.anl.gov>> wrote:
We use petsc4py as a solver suite in our [FiPy](https://www.ctcms.nist.gov/fipy) Python-based PDE solver package. Some time back, I refactored some of the code and provoked a deadlock situation in our test suite. I have been tearing what remains of my hair out trying to isolate things and am at a loss. I’ve gone through the refactoring line-by-line and I just don’t think I’ve changed anything substantive, just how the code is organized.
I have posted a branch that exhibits the issue at https://github.com/usnistgov/fipy/pull/761
I explain in greater detail in that “pull request” how to reproduce, but in short, after a substantial number of our tests run, the code either deadlocks or raises exceptions:
On processor 0 in
matrix.setUp()
specifically in
[0] PetscSplitOwnership() line 93 in /Users/runner/miniforge3/conda-bld/petsc_1601473259434/work/src/sys/utils/psplit.c
and on other processors a few lines earlier in
matrix.create(comm)
specifically in
[1] PetscCommDuplicate() line 126 in /Users/runner/miniforge3/conda-bld/petsc_1601473259434/work/src/sys/objects/tagm.c
The circumstances that lead to this failure are really fragile and it seems likely due to some memory corruption. Particularly likely given that I can make the failure go away by removing seemingly irrelevant things like
>>> from scipy.stats.mstats import argstoarray
Note that when I run the full test suite after taking out this scipy import, the same problem just arises elsewhere without any obvious similar import trigger.
Running with `-malloc_debug true` doesn’t illuminate anything.
I’ve run with `-info` and `-log_trace` and don’t see any obvious issues, but there’s a ton of output.
I have tried reducing things to a minimal reproducible example, but unfortunately things remain way too complicated and idiosyncratic to FiPy. I’m grateful for any help anybody can offer despite the mess that I’m offering.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20201028/eac724c5/attachment.html>
More information about the petsc-users
mailing list