[petsc-users] Bus Error

Mark Lohry mlohry at gmail.com
Mon Aug 24 15:36:36 CDT 2020


I queued up some jobs with Barry's patch, so we'll see.

Re Jed's suggestion at checkpointing, I don't *think* this is something
coming from the state of the solution -- running from the same point I'm
seeing it crash anywhere between 1 hour and 20 hours in. I'll increase my
file save frequency in case I'm wrong there though.

My intel build with different blas just made it through a 6 hour time slot
without crash, whereas yesterday the same thing crashed after 3 hours. But
given the randomness so far I'd bet that's just dumb luck.

On Mon, Aug 24, 2020 at 4:22 PM Barry Smith <bsmith at petsc.dev> wrote:

>
>
> > On Aug 24, 2020, at 2:34 PM, Jed Brown <jed at jedbrown.org> wrote:
> >
> > I'm thinking of something such as writing floating point data into the
> return address, which would be unaligned/garbage.
>
>   Ok, my patch will detect this. This is what I was talking about, messing
> up the BLAS arguments which are the addresses of arrays.
>
>   Valgrind is by far the preferred approach.
>
>   Barry
>
>   Another feature we could add to the malloc checking is when a SEGV or
> BUS error is encountered and we catch it we should run the
> PetscMallocVerify() and check our memory for corruption reporting any we
> find.
>
>
>
> >
> > Reproducing under Valgrind would help a lot.  Perhaps it's possible to
> checkpoint such that the breakage can be reproduced more quickly?
> >
> > Barry Smith <bsmith at petsc.dev> writes:
> >
> >> https://en.wikipedia.org/wiki/Bus_error <
> https://en.wikipedia.org/wiki/Bus_error>
> >>
> >> But perhaps not true for Intel?
> >>
> >>
> >>
> >>> On Aug 24, 2020, at 1:06 PM, Matthew Knepley <knepley at gmail.com>
> wrote:
> >>>
> >>> On Mon, Aug 24, 2020 at 1:46 PM Barry Smith <bsmith at petsc.dev <mailto:
> bsmith at petsc.dev>> wrote:
> >>>
> >>>
> >>>> On Aug 24, 2020, at 12:39 PM, Jed Brown <jed at jedbrown.org <mailto:
> jed at jedbrown.org>> wrote:
> >>>>
> >>>> Barry Smith <bsmith at petsc.dev <mailto:bsmith at petsc.dev>> writes:
> >>>>
> >>>>>> On Aug 24, 2020, at 12:31 PM, Jed Brown <jed at jedbrown.org <mailto:
> jed at jedbrown.org>> wrote:
> >>>>>>
> >>>>>> Barry Smith <bsmith at petsc.dev <mailto:bsmith at petsc.dev>> writes:
> >>>>>>
> >>>>>>> So if a BLAS errors with SIGBUS then it is always an input error
> of just not proper double/complex alignment? Or some other very strange
> thing?
> >>>>>>
> >>>>>> I would suspect memory corruption.
> >>>>>
> >>>>>
> >>>>> Corruption meaning what specifically?
> >>>>>
> >>>>> The routines crashing are dgemv which only take double precision
> arrays, regardless of what garbage is in those arrays i don't think there
> can be BUS errors resulting. They don't take integer arrays whose
> corruption could result in bad indexing and then BUS errors.
> >>>>>
> >>>>> So then it can only be corruption of the pointers passed in, correct?
> >>>>
> >>>> Such as those pointers pointing into data on the stack with incorrect
> sizes.
> >>>
> >>> But won't incorrect sizes "usually" lead to SEGV not SEGBUS?
> >>>
> >>> My understanding was that roughly memory errors in the heap are SEGV
> and memory errors on the stack are SIGBUS. Is that not true?
> >>>
> >>>   Matt
> >>>
> >>> --
> >>> What most experimenters take for granted before they begin their
> experiments is infinitely more interesting than any results to which their
> experiments lead.
> >>> -- Norbert Wiener
> >>>
> >>> https://www.cse.buffalo.edu/~knepley/ <
> http://www.cse.buffalo.edu/~knepley/>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20200824/5325d871/attachment-0001.html>


More information about the petsc-users mailing list