<div dir="ltr"><div>I queued up some jobs with Barry's patch, so we'll see.</div><div><br></div><div>Re Jed's suggestion at checkpointing, I don't *think* this is something coming from the state of the solution -- running from the same point I'm seeing it crash anywhere between 1 hour and 20 hours in. I'll increase my file save frequency in case I'm wrong there though.</div><div><br></div><div>My intel build with different blas just made it through a 6 hour time slot without crash, whereas yesterday the same thing crashed after 3 hours. But given the randomness so far I'd bet that's just dumb luck.<br></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Mon, Aug 24, 2020 at 4:22 PM Barry Smith <<a href="mailto:bsmith@petsc.dev">bsmith@petsc.dev</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><br>
<br>
> On Aug 24, 2020, at 2:34 PM, Jed Brown <<a href="mailto:jed@jedbrown.org" target="_blank">jed@jedbrown.org</a>> wrote:<br>
> <br>
> I'm thinking of something such as writing floating point data into the return address, which would be unaligned/garbage.<br>
<br>
Ok, my patch will detect this. This is what I was talking about, messing up the BLAS arguments which are the addresses of arrays.<br>
<br>
Valgrind is by far the preferred approach.<br>
<br>
Barry<br>
<br>
Another feature we could add to the malloc checking is when a SEGV or BUS error is encountered and we catch it we should run the PetscMallocVerify() and check our memory for corruption reporting any we find.<br>
<br>
<br>
<br>
> <br>
> Reproducing under Valgrind would help a lot. Perhaps it's possible to checkpoint such that the breakage can be reproduced more quickly?<br>
> <br>
> Barry Smith <<a href="mailto:bsmith@petsc.dev" target="_blank">bsmith@petsc.dev</a>> writes:<br>
> <br>
>> <a href="https://en.wikipedia.org/wiki/Bus_error" rel="noreferrer" target="_blank">https://en.wikipedia.org/wiki/Bus_error</a> <<a href="https://en.wikipedia.org/wiki/Bus_error" rel="noreferrer" target="_blank">https://en.wikipedia.org/wiki/Bus_error</a>><br>
>> <br>
>> But perhaps not true for Intel? <br>
>> <br>
>> <br>
>> <br>
>>> On Aug 24, 2020, at 1:06 PM, Matthew Knepley <<a href="mailto:knepley@gmail.com" target="_blank">knepley@gmail.com</a>> wrote:<br>
>>> <br>
>>> On Mon, Aug 24, 2020 at 1:46 PM Barry Smith <<a href="mailto:bsmith@petsc.dev" target="_blank">bsmith@petsc.dev</a> <mailto:<a href="mailto:bsmith@petsc.dev" target="_blank">bsmith@petsc.dev</a>>> wrote:<br>
>>> <br>
>>> <br>
>>>> On Aug 24, 2020, at 12:39 PM, Jed Brown <<a href="mailto:jed@jedbrown.org" target="_blank">jed@jedbrown.org</a> <mailto:<a href="mailto:jed@jedbrown.org" target="_blank">jed@jedbrown.org</a>>> wrote:<br>
>>>> <br>
>>>> Barry Smith <<a href="mailto:bsmith@petsc.dev" target="_blank">bsmith@petsc.dev</a> <mailto:<a href="mailto:bsmith@petsc.dev" target="_blank">bsmith@petsc.dev</a>>> writes:<br>
>>>> <br>
>>>>>> On Aug 24, 2020, at 12:31 PM, Jed Brown <<a href="mailto:jed@jedbrown.org" target="_blank">jed@jedbrown.org</a> <mailto:<a href="mailto:jed@jedbrown.org" target="_blank">jed@jedbrown.org</a>>> wrote:<br>
>>>>>> <br>
>>>>>> Barry Smith <<a href="mailto:bsmith@petsc.dev" target="_blank">bsmith@petsc.dev</a> <mailto:<a href="mailto:bsmith@petsc.dev" target="_blank">bsmith@petsc.dev</a>>> writes:<br>
>>>>>> <br>
>>>>>>> So if a BLAS errors with SIGBUS then it is always an input error of just not proper double/complex alignment? Or some other very strange thing?<br>
>>>>>> <br>
>>>>>> I would suspect memory corruption.<br>
>>>>> <br>
>>>>> <br>
>>>>> Corruption meaning what specifically?<br>
>>>>> <br>
>>>>> The routines crashing are dgemv which only take double precision arrays, regardless of what garbage is in those arrays i don't think there can be BUS errors resulting. They don't take integer arrays whose corruption could result in bad indexing and then BUS errors. <br>
>>>>> <br>
>>>>> So then it can only be corruption of the pointers passed in, correct?<br>
>>>> <br>
>>>> Such as those pointers pointing into data on the stack with incorrect sizes.<br>
>>> <br>
>>> But won't incorrect sizes "usually" lead to SEGV not SEGBUS?<br>
>>> <br>
>>> My understanding was that roughly memory errors in the heap are SEGV and memory errors on the stack are SIGBUS. Is that not true?<br>
>>> <br>
>>> Matt<br>
>>> <br>
>>> -- <br>
>>> What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead.<br>
>>> -- Norbert Wiener<br>
>>> <br>
>>> <a href="https://www.cse.buffalo.edu/~knepley/" rel="noreferrer" target="_blank">https://www.cse.buffalo.edu/~knepley/</a> <<a href="http://www.cse.buffalo.edu/~knepley/" rel="noreferrer" target="_blank">http://www.cse.buffalo.edu/~knepley/</a>><br>
<br>
</blockquote></div>