[petsc-users] KSP_Solve crashes in debug mode

Matthew Knepley knepley at gmail.com
Wed Feb 22 18:28:58 CST 2023


On Wed, Feb 22, 2023 at 6:32 PM Sajid Ali Syed <sasyed at fnal.gov> wrote:

> Hi Matt,
>
> Adding `-checkstack` does not prevent the crash, both on my laptop and on
> the cluster.
>

It will not prevent a crash. The output is intended to show us where the
stack problem originates. Can you send the output?

  Thanks,

    Matt


> What does prevent the crash (on my laptop at least) is changing
> `PETSCSTACKSIZE` from 64 to 256 here :
> https://github.com/petsc/petsc/blob/main/include/petscerror.h#L1153
>
>
> Thank You,
> Sajid Ali (he/him) | Research Associate
> Data Science, Simulation, and Learning Division
> Fermi National Accelerator Laboratory
> s-sajid-ali.github.io
>
> ------------------------------
> *From:* Matthew Knepley <knepley at gmail.com>
> *Sent:* Wednesday, February 22, 2023 5:23 PM
> *To:* Sajid Ali Syed <sasyed at fnal.gov>
> *Cc:* Barry Smith <bsmith at petsc.dev>; petsc-users at mcs.anl.gov <
> petsc-users at mcs.anl.gov>
> *Subject:* Re: [petsc-users] KSP_Solve crashes in debug mode
>
> On Wed, Feb 22, 2023 at 6:18 PM Sajid Ali Syed via petsc-users <
> petsc-users at mcs.anl.gov> wrote:
>
> One thing to note in relation to the trace attached in the previous email
> is that there are no warnings until the 36th call to KSP_Solve. The first
> error (as indicated by ASAN) occurs somewhere before the 40th call to
> KSP_Solve (part of what the application marks as turn 10 of the
> propagator). The crash finally occurs on the 43rd call to KSP_solve.
>
>
> Looking at the trace, it appears that stack handling is messed up and
> eventually it causes the crash. This can happen when
> PetscFunctionBegin is not matched up with PetscFunctionReturn. Can you try
> running this with
>
>   -checkstack
>
>   Thanks,
>
>      Matt
>
>
> Thank You,
> Sajid Ali (he/him) | Research Associate
> Data Science, Simulation, and Learning Division
> Fermi National Accelerator Laboratory
> s-sajid-ali.github.io
> <https://urldefense.proofpoint.com/v2/url?u=http-3A__s-2Dsajid-2Dali.github.io&d=DwMFaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=w-DPglgoOUOz8eiEyHKz0g&m=q5fD8r624Cr0Ow4AKTmgeLtq_M--q_KdGYMkBNiKOMDa8o82C8P97vdCRcxrqTCF&s=oNWxB3zDYTHODeZK9VCibIVqSo7DnwsJjSr6IgIPs2M&e=>
>
> ------------------------------
> *From:* Sajid Ali Syed <sasyed at fnal.gov>
> *Sent:* Wednesday, February 22, 2023 5:11 PM
> *To:* Barry Smith <bsmith at petsc.dev>
> *Cc:* petsc-users at mcs.anl.gov <petsc-users at mcs.anl.gov>
> *Subject:* Re: [petsc-users] KSP_Solve crashes in debug mode
>
> Hi Barry,
>
> Thanks a lot for fixing this issue. I ran the same problem on a linux
> machine and have the following trace for the same crash (with ASAN turned
> on for both PETSc (on the latest commit of the branch) and the application)
> : https://gist.github.com/s-sajid-ali/85bdf689eb8452ef8702c214c4df6940
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__gist.github.com_s-2Dsajid-2Dali_85bdf689eb8452ef8702c214c4df6940&d=DwMFaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=w-DPglgoOUOz8eiEyHKz0g&m=q5fD8r624Cr0Ow4AKTmgeLtq_M--q_KdGYMkBNiKOMDa8o82C8P97vdCRcxrqTCF&s=Z8JyNKYXjUZE4DXYKvjxTOG4HZUA95U6z750WC6gUCo&e=>
>
> The trace seems to indicate a couple of buffer overflows, one of which
> causes the crash. I'm not sure as to what causes them.
>
> Thank You,
> Sajid Ali (he/him) | Research Associate
> Data Science, Simulation, and Learning Division
> Fermi National Accelerator Laboratory
> s-sajid-ali.github.io
> <https://urldefense.proofpoint.com/v2/url?u=http-3A__s-2Dsajid-2Dali.github.io&d=DwMFaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=w-DPglgoOUOz8eiEyHKz0g&m=q5fD8r624Cr0Ow4AKTmgeLtq_M--q_KdGYMkBNiKOMDa8o82C8P97vdCRcxrqTCF&s=oNWxB3zDYTHODeZK9VCibIVqSo7DnwsJjSr6IgIPs2M&e=>
>
> ------------------------------
> *From:* Barry Smith <bsmith at petsc.dev>
> *Sent:* Wednesday, February 15, 2023 2:01 PM
> *To:* Sajid Ali Syed <sasyed at fnal.gov>
> *Cc:* petsc-users at mcs.anl.gov <petsc-users at mcs.anl.gov>
> *Subject:* Re: [petsc-users] KSP_Solve crashes in debug mode
>
>
> https://gitlab.com/petsc/petsc/-/merge_requests/6075
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__gitlab.com_petsc_petsc_-2D_merge-5Frequests_6075&d=DwMFaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=w-DPglgoOUOz8eiEyHKz0g&m=P7R0CW9R-fGNfm2q3yTL-ehqhM5N9-r8hHBLNgDetm9-7jxVqNsujIZ2hdnhVrVX&s=QwRI_DzGnCHagpaQSC4MPPEUnC4aAkbMwdG1eg_QUII&e=> should
> fix the possible recursive error condition Matt pointed out
>
>
> On Feb 9, 2023, at 6:24 PM, Matthew Knepley <knepley at gmail.com> wrote:
>
> On Thu, Feb 9, 2023 at 6:05 PM Sajid Ali Syed via petsc-users <
> petsc-users at mcs.anl.gov> wrote:
>
> I added “-malloc_debug” in a .petscrc file and ran it again. The backtrace
> from lldb is in the attached file. The crash now seems to be at:
>
> Process 32660 stopped* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=2, address=0x16f603fb8)
>     frame #0: 0x0000000112ecc8f8 libpetsc.3.018.dylib`PetscFPrintf(comm=0, fd=0x0000000000000000, format=0x0000000000000000) at mprint.c:601
>    598               `PetscViewerASCIISynchronizedPrintf()`, `PetscSynchronizedFlush()`
>    599      @*/
>    600      PetscErrorCode PetscFPrintf(MPI_Comm comm, FILE *fd, const char format[], ...)
> -> 601      {
>    602       PetscMPIInt rank;
>    603
>    604       PetscFunctionBegin;
> (lldb) frame info
> frame #0: 0x0000000112ecc8f8 libpetsc.3.018.dylib`PetscFPrintf(comm=0, fd=0x0000000000000000, format=0x0000000000000000) at mprint.c:601
> (lldb)
>
> The trace seems to indicate some sort of infinite loop causing an overflow.
>
>
> Yes, I have also seen this. What happens is that we have a memory error.
> The error is reported inside PetscMallocValidate()
> using PetscErrorPrintf, which eventually calls PetscCallMPI, which calls
> PetscMallocValidate again, which fails. We need to
> remove all error checking from the prints inside Validate.
>
>   Thanks,
>
>      Matt
>
>
> PS: I'm using a arm64 mac, so I don't have access to valgrind.
>
> Thank You,
> Sajid Ali (he/him) | Research Associate
> Scientific Computing Division
> Fermi National Accelerator Laboratory
> s-sajid-ali.github.io
> <https://urldefense.proofpoint.com/v2/url?u=http-3A__s-2Dsajid-2Dali.github.io_&d=DwMFaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=w-DPglgoOUOz8eiEyHKz0g&m=P7R0CW9R-fGNfm2q3yTL-ehqhM5N9-r8hHBLNgDetm9-7jxVqNsujIZ2hdnhVrVX&s=JA1u9AHcO8HqY5oCgbEy-ghtKRjURlRDwdmxP-9YJac&e=>
>
>
>
> --
> What most experimenters take for granted before they begin their
> experiments is infinitely more interesting than any results to which their
> experiments lead.
> -- Norbert Wiener
>
> https://www.cse.buffalo.edu/~knepley/
> <https://urldefense.proofpoint.com/v2/url?u=http-3A__www.cse.buffalo.edu_-7Eknepley_&d=DwMFaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=w-DPglgoOUOz8eiEyHKz0g&m=P7R0CW9R-fGNfm2q3yTL-ehqhM5N9-r8hHBLNgDetm9-7jxVqNsujIZ2hdnhVrVX&s=CdEZKWQbBYiD2pzU3Az_EDIGUTBNkNHwSoD2n_2098Y&e=>
>
>
>
>
> --
> What most experimenters take for granted before they begin their
> experiments is infinitely more interesting than any results to which their
> experiments lead.
> -- Norbert Wiener
>
> https://www.cse.buffalo.edu/~knepley/
> <https://urldefense.proofpoint.com/v2/url?u=http-3A__www.cse.buffalo.edu_-7Eknepley_&d=DwMFaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=w-DPglgoOUOz8eiEyHKz0g&m=q5fD8r624Cr0Ow4AKTmgeLtq_M--q_KdGYMkBNiKOMDa8o82C8P97vdCRcxrqTCF&s=Hkn4IxPABZIeY0m9o_VGFHJ4ntffqbtyd3fddpbZw7I&e=>
>


-- 
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which their
experiments lead.
-- Norbert Wiener

https://www.cse.buffalo.edu/~knepley/ <http://www.cse.buffalo.edu/~knepley/>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20230222/c0f11b34/attachment.html>


More information about the petsc-users mailing list