[petsc-users] KSP_Solve crashes in debug mode

Barry Smith bsmith at petsc.dev
Fri Feb 24 11:47:19 CST 2023


Hmm, here is the macro

    #define PetscCallAbort(comm, ...) \
    do { \
      PetscErrorCode ierr_petsc_call_abort_; \
      PetscStackUpdateLine; \
      ierr_petsc_call_abort_ = __VA_ARGS__; \
      if (PetscUnlikely(ierr_petsc_call_abort_ != PETSC_SUCCESS)) { \
        ierr_petsc_call_abort_ = PetscError(PETSC_COMM_SELF, __LINE__, PETSC_FUNCTION_NAME, __FILE__, ierr_petsc_call_abort_, PETSC_ERROR_REPEAT, " "); \
        (void)MPI_Abort(comm, (PetscMPIInt)ierr_petsc_call_abort_); \
      } \
    } while (0)

it does not seem to increment anything in the stack. So I think call should be ok

Perhaps your function has a PetscFunctionBegin, but no PetscFunctionReturn() or in some other way increase the stack size without decreasing it?




> On Feb 24, 2023, at 12:39 PM, Sajid Ali Syed <sasyed at fnal.gov> wrote:
> 
> Hi Barry, 
> 
> The application calls PetscCallAbort in a loop, i.e.
> 
> for i in range:
>   void routine(PetscCallAbort(function_returning_petsc_error_code))
> 
> From the prior logs it looks like the stack grows every time PetscCallAbort is called (in other words, the stack does not shrink upon successful exit from PetscCallAbort). 
> 
> Is this usage pattern not recommended? Should I be manually checking for success of the `function_returning_petsc_error_code` and throw instead of relying on PetscCallAbort?
> 
> 
> 
> Thank You,
> Sajid Ali (he/him) | Research Associate
> Data Science, Simulation, and Learning Division
> Fermi National Accelerator Laboratory
> s-sajid-ali.github.io <http://s-sajid-ali.github.io/>
> From: Barry Smith <bsmith at petsc.dev <mailto:bsmith at petsc.dev>>
> Sent: Wednesday, February 22, 2023 6:49 PM
> To: Sajid Ali Syed <sasyed at fnal.gov <mailto:sasyed at fnal.gov>>
> Cc: Matthew Knepley <knepley at gmail.com <mailto:knepley at gmail.com>>; petsc-users at mcs.anl.gov <mailto:petsc-users at mcs.anl.gov> <petsc-users at mcs.anl.gov <mailto:petsc-users at mcs.anl.gov>>
> Subject: Re: [petsc-users] KSP_Solve crashes in debug mode
>  
> 
>   Hmm, there could be a bug in our handling of the stack when reaches the maximum. It is suppose to just stop collecting additional levels at that point but likely it has not been tested since a lot of refactorizations.
> 
>    What are you doing to have so many stack frames? 
> 
>> On Feb 22, 2023, at 6:32 PM, Sajid Ali Syed <sasyed at fnal.gov <mailto:sasyed at fnal.gov>> wrote:
>> 
>> Hi Matt, 
>> 
>> Adding `-checkstack` does not prevent the crash, both on my laptop and on the cluster. 
>> 
>> What does prevent the crash (on my laptop at least) is changing `PETSCSTACKSIZE` from 64 to 256 here : https://github.com/petsc/petsc/blob/main/include/petscerror.h#L1153 <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_petsc_petsc_blob_main_include_petscerror.h-23L1153&d=DwMFaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=w-DPglgoOUOz8eiEyHKz0g&m=h95E7R5X17258LHwsaKi0qVASp22lBVFOsdrDZFvAOS2iJQd-5FGzfHgq68ShXYR&s=Rfmp69z-e_VacDf-D0n8jt0xA6qq7oRBfgFSgMn1Dj8&e=>
>> 
>> 
>> Thank You,
>> Sajid Ali (he/him) | Research Associate
>> Data Science, Simulation, and Learning Division
>> Fermi National Accelerator Laboratory
>> s-sajid-ali.github.io <https://urldefense.proofpoint.com/v2/url?u=http-3A__s-2Dsajid-2Dali.github.io_&d=DwMFaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=w-DPglgoOUOz8eiEyHKz0g&m=h95E7R5X17258LHwsaKi0qVASp22lBVFOsdrDZFvAOS2iJQd-5FGzfHgq68ShXYR&s=KDcd2SRT062jOa-0d8hvQywGEvYtyx9oHol5xp4XMI8&e=>
>> From: Matthew Knepley <knepley at gmail.com <mailto:knepley at gmail.com>>
>> Sent: Wednesday, February 22, 2023 5:23 PM
>> To: Sajid Ali Syed <sasyed at fnal.gov <mailto:sasyed at fnal.gov>>
>> Cc: Barry Smith <bsmith at petsc.dev <mailto:bsmith at petsc.dev>>; petsc-users at mcs.anl.gov <mailto:petsc-users at mcs.anl.gov> <petsc-users at mcs.anl.gov <mailto:petsc-users at mcs.anl.gov>>
>> Subject: Re: [petsc-users] KSP_Solve crashes in debug mode
>>  
>> On Wed, Feb 22, 2023 at 6:18 PM Sajid Ali Syed via petsc-users <petsc-users at mcs.anl.gov <mailto:petsc-users at mcs.anl.gov>> wrote:
>> One thing to note in relation to the trace attached in the previous email is that there are no warnings until the 36th call to KSP_Solve. The first error (as indicated by ASAN) occurs somewhere before the 40th call to KSP_Solve (part of what the application marks as turn 10 of the propagator). The crash finally occurs on the 43rd call to KSP_solve.
>> 
>> Looking at the trace, it appears that stack handling is messed up and eventually it causes the crash. This can happen when
>> PetscFunctionBegin is not matched up with PetscFunctionReturn. Can you try running this with
>> 
>>   -checkstack
>> 
>>   Thanks,
>> 
>>      Matt
>>  
>> Thank You,
>> Sajid Ali (he/him) | Research Associate
>> Data Science, Simulation, and Learning Division
>> Fermi National Accelerator Laboratory
>> s-sajid-ali.github.io <https://urldefense.proofpoint.com/v2/url?u=http-3A__s-2Dsajid-2Dali.github.io&d=DwMFaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=w-DPglgoOUOz8eiEyHKz0g&m=q5fD8r624Cr0Ow4AKTmgeLtq_M--q_KdGYMkBNiKOMDa8o82C8P97vdCRcxrqTCF&s=oNWxB3zDYTHODeZK9VCibIVqSo7DnwsJjSr6IgIPs2M&e=>
>> From: Sajid Ali Syed <sasyed at fnal.gov <mailto:sasyed at fnal.gov>>
>> Sent: Wednesday, February 22, 2023 5:11 PM
>> To: Barry Smith <bsmith at petsc.dev <mailto:bsmith at petsc.dev>>
>> Cc: petsc-users at mcs.anl.gov <mailto:petsc-users at mcs.anl.gov> <petsc-users at mcs.anl.gov <mailto:petsc-users at mcs.anl.gov>>
>> Subject: Re: [petsc-users] KSP_Solve crashes in debug mode
>>  
>> Hi Barry, 
>> 
>> Thanks a lot for fixing this issue. I ran the same problem on a linux machine and have the following trace for the same crash (with ASAN turned on for both PETSc (on the latest commit of the branch) and the application) : https://gist.github.com/s-sajid-ali/85bdf689eb8452ef8702c214c4df6940 <https://urldefense.proofpoint.com/v2/url?u=https-3A__gist.github.com_s-2Dsajid-2Dali_85bdf689eb8452ef8702c214c4df6940&d=DwMFaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=w-DPglgoOUOz8eiEyHKz0g&m=q5fD8r624Cr0Ow4AKTmgeLtq_M--q_KdGYMkBNiKOMDa8o82C8P97vdCRcxrqTCF&s=Z8JyNKYXjUZE4DXYKvjxTOG4HZUA95U6z750WC6gUCo&e=>
>> 
>> The trace seems to indicate a couple of buffer overflows, one of which causes the crash. I'm not sure as to what causes them. 
>> 
>> Thank You,
>> Sajid Ali (he/him) | Research Associate
>> Data Science, Simulation, and Learning Division
>> Fermi National Accelerator Laboratory
>> s-sajid-ali.github.io <https://urldefense.proofpoint.com/v2/url?u=http-3A__s-2Dsajid-2Dali.github.io&d=DwMFaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=w-DPglgoOUOz8eiEyHKz0g&m=q5fD8r624Cr0Ow4AKTmgeLtq_M--q_KdGYMkBNiKOMDa8o82C8P97vdCRcxrqTCF&s=oNWxB3zDYTHODeZK9VCibIVqSo7DnwsJjSr6IgIPs2M&e=>
>> From: Barry Smith <bsmith at petsc.dev <mailto:bsmith at petsc.dev>>
>> Sent: Wednesday, February 15, 2023 2:01 PM
>> To: Sajid Ali Syed <sasyed at fnal.gov <mailto:sasyed at fnal.gov>>
>> Cc: petsc-users at mcs.anl.gov <mailto:petsc-users at mcs.anl.gov> <petsc-users at mcs.anl.gov <mailto:petsc-users at mcs.anl.gov>>
>> Subject: Re: [petsc-users] KSP_Solve crashes in debug mode
>>  
>> 
>> https://gitlab.com/petsc/petsc/-/merge_requests/6075 <https://urldefense.proofpoint.com/v2/url?u=https-3A__gitlab.com_petsc_petsc_-2D_merge-5Frequests_6075&d=DwMFaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=w-DPglgoOUOz8eiEyHKz0g&m=P7R0CW9R-fGNfm2q3yTL-ehqhM5N9-r8hHBLNgDetm9-7jxVqNsujIZ2hdnhVrVX&s=QwRI_DzGnCHagpaQSC4MPPEUnC4aAkbMwdG1eg_QUII&e=> should fix the possible recursive error condition Matt pointed out
>> 
>> 
>>> On Feb 9, 2023, at 6:24 PM, Matthew Knepley <knepley at gmail.com <mailto:knepley at gmail.com>> wrote:
>>> 
>>> On Thu, Feb 9, 2023 at 6:05 PM Sajid Ali Syed via petsc-users <petsc-users at mcs.anl.gov <mailto:petsc-users at mcs.anl.gov>> wrote:
>>> I added “-malloc_debug” in a .petscrc file and ran it again. The backtrace from lldb is in the attached file. The crash now seems to be at:
>>> 
>>> Process 32660 stopped* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=2, address=0x16f603fb8)
>>>     frame #0: 0x0000000112ecc8f8 libpetsc.3.018.dylib`PetscFPrintf(comm=0, fd=0x0000000000000000, format=0x0000000000000000) at mprint.c:601
>>>    598               `PetscViewerASCIISynchronizedPrintf()`, `PetscSynchronizedFlush()`
>>>    599      @*/
>>>    600      PetscErrorCode PetscFPrintf(MPI_Comm comm, FILE *fd, const char format[], ...)
>>> -> 601      {
>>>    602       PetscMPIInt rank;
>>>    603      
>>>    604       PetscFunctionBegin;
>>> (lldb) frame info
>>> frame #0: 0x0000000112ecc8f8 libpetsc.3.018.dylib`PetscFPrintf(comm=0, fd=0x0000000000000000, format=0x0000000000000000) at mprint.c:601
>>> (lldb)
>>> The trace seems to indicate some sort of infinite loop causing an overflow.
>>> 
>>> 
>>> Yes, I have also seen this. What happens is that we have a memory error. The error is reported inside PetscMallocValidate()
>>> using PetscErrorPrintf, which eventually calls PetscCallMPI, which calls PetscMallocValidate again, which fails. We need to
>>> remove all error checking from the prints inside Validate.
>>> 
>>>   Thanks,
>>> 
>>>      Matt
>>>  
>>> PS: I'm using a arm64 mac, so I don't have access to valgrind. 
>>> 
>>> Thank You,
>>> Sajid Ali (he/him) | Research Associate
>>> Scientific Computing Division
>>> Fermi National Accelerator Laboratory
>>> s-sajid-ali.github.io <https://urldefense.proofpoint.com/v2/url?u=http-3A__s-2Dsajid-2Dali.github.io_&d=DwMFaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=w-DPglgoOUOz8eiEyHKz0g&m=P7R0CW9R-fGNfm2q3yTL-ehqhM5N9-r8hHBLNgDetm9-7jxVqNsujIZ2hdnhVrVX&s=JA1u9AHcO8HqY5oCgbEy-ghtKRjURlRDwdmxP-9YJac&e=>
>>> 
>>> 
>>> -- 
>>> What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead.
>>> -- Norbert Wiener
>>> 
>>> https://www.cse.buffalo.edu/~knepley/ <https://urldefense.proofpoint.com/v2/url?u=http-3A__www.cse.buffalo.edu_-7Eknepley_&d=DwMFaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=w-DPglgoOUOz8eiEyHKz0g&m=P7R0CW9R-fGNfm2q3yTL-ehqhM5N9-r8hHBLNgDetm9-7jxVqNsujIZ2hdnhVrVX&s=CdEZKWQbBYiD2pzU3Az_EDIGUTBNkNHwSoD2n_2098Y&e=>
>> 
>> 
>> 
>> -- 
>> What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead.
>> -- Norbert Wiener
>> 
>> https://www.cse.buffalo.edu/~knepley/ <https://urldefense.proofpoint.com/v2/url?u=http-3A__www.cse.buffalo.edu_-7Eknepley_&d=DwMFaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=w-DPglgoOUOz8eiEyHKz0g&m=q5fD8r624Cr0Ow4AKTmgeLtq_M--q_KdGYMkBNiKOMDa8o82C8P97vdCRcxrqTCF&s=Hkn4IxPABZIeY0m9o_VGFHJ4ntffqbtyd3fddpbZw7I&e=>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20230224/64086441/attachment-0001.html>


More information about the petsc-users mailing list