[petsc-users] Bus Error

Jed Brown jed at jedbrown.org
Mon Aug 24 16:00:26 CDT 2020


Do you potentially have a memory or other resource leak?  SIGBUS would be an odd result, but the symptom of crashing after running for a long time sometimes fits with a resource leak.

Mark Lohry <mlohry at gmail.com> writes:

> I queued up some jobs with Barry's patch, so we'll see.
>
> Re Jed's suggestion at checkpointing, I don't *think* this is something
> coming from the state of the solution -- running from the same point I'm
> seeing it crash anywhere between 1 hour and 20 hours in. I'll increase my
> file save frequency in case I'm wrong there though.
>
> My intel build with different blas just made it through a 6 hour time slot
> without crash, whereas yesterday the same thing crashed after 3 hours. But
> given the randomness so far I'd bet that's just dumb luck.
>
> On Mon, Aug 24, 2020 at 4:22 PM Barry Smith <bsmith at petsc.dev> wrote:
>
>>
>>
>> > On Aug 24, 2020, at 2:34 PM, Jed Brown <jed at jedbrown.org> wrote:
>> >
>> > I'm thinking of something such as writing floating point data into the
>> return address, which would be unaligned/garbage.
>>
>>   Ok, my patch will detect this. This is what I was talking about, messing
>> up the BLAS arguments which are the addresses of arrays.
>>
>>   Valgrind is by far the preferred approach.
>>
>>   Barry
>>
>>   Another feature we could add to the malloc checking is when a SEGV or
>> BUS error is encountered and we catch it we should run the
>> PetscMallocVerify() and check our memory for corruption reporting any we
>> find.
>>
>>
>>
>> >
>> > Reproducing under Valgrind would help a lot.  Perhaps it's possible to
>> checkpoint such that the breakage can be reproduced more quickly?
>> >
>> > Barry Smith <bsmith at petsc.dev> writes:
>> >
>> >> https://en.wikipedia.org/wiki/Bus_error <
>> https://en.wikipedia.org/wiki/Bus_error>
>> >>
>> >> But perhaps not true for Intel?
>> >>
>> >>
>> >>
>> >>> On Aug 24, 2020, at 1:06 PM, Matthew Knepley <knepley at gmail.com>
>> wrote:
>> >>>
>> >>> On Mon, Aug 24, 2020 at 1:46 PM Barry Smith <bsmith at petsc.dev <mailto:
>> bsmith at petsc.dev>> wrote:
>> >>>
>> >>>
>> >>>> On Aug 24, 2020, at 12:39 PM, Jed Brown <jed at jedbrown.org <mailto:
>> jed at jedbrown.org>> wrote:
>> >>>>
>> >>>> Barry Smith <bsmith at petsc.dev <mailto:bsmith at petsc.dev>> writes:
>> >>>>
>> >>>>>> On Aug 24, 2020, at 12:31 PM, Jed Brown <jed at jedbrown.org <mailto:
>> jed at jedbrown.org>> wrote:
>> >>>>>>
>> >>>>>> Barry Smith <bsmith at petsc.dev <mailto:bsmith at petsc.dev>> writes:
>> >>>>>>
>> >>>>>>> So if a BLAS errors with SIGBUS then it is always an input error
>> of just not proper double/complex alignment? Or some other very strange
>> thing?
>> >>>>>>
>> >>>>>> I would suspect memory corruption.
>> >>>>>
>> >>>>>
>> >>>>> Corruption meaning what specifically?
>> >>>>>
>> >>>>> The routines crashing are dgemv which only take double precision
>> arrays, regardless of what garbage is in those arrays i don't think there
>> can be BUS errors resulting. They don't take integer arrays whose
>> corruption could result in bad indexing and then BUS errors.
>> >>>>>
>> >>>>> So then it can only be corruption of the pointers passed in, correct?
>> >>>>
>> >>>> Such as those pointers pointing into data on the stack with incorrect
>> sizes.
>> >>>
>> >>> But won't incorrect sizes "usually" lead to SEGV not SEGBUS?
>> >>>
>> >>> My understanding was that roughly memory errors in the heap are SEGV
>> and memory errors on the stack are SIGBUS. Is that not true?
>> >>>
>> >>>   Matt
>> >>>
>> >>> --
>> >>> What most experimenters take for granted before they begin their
>> experiments is infinitely more interesting than any results to which their
>> experiments lead.
>> >>> -- Norbert Wiener
>> >>>
>> >>> https://www.cse.buffalo.edu/~knepley/ <
>> http://www.cse.buffalo.edu/~knepley/>
>>
>>


More information about the petsc-users mailing list