[petsc-users] Bus Error

Barry Smith bsmith at petsc.dev
Wed Aug 12 12:46:14 CDT 2020


   Mark.

    When valgrind is not feasible (like on many centrally controlled batch systems) you can run PETSc with an extra flag to do some memory error checks
 -malloc_debug

 this 

1) fills all malloced memory with Nan so if the code is using uninitialized memory it may be detected and 
2) checks the beginning and end of each alloced memory region for out-of-bounds writes at each malloc and free.

it will slow the code down a little bit but generally not a huge amount.

It is no where near as good as valgrind or other memory corruption tools but it has the advantage you can run it anywhere on any size job.


  Barry





> On Aug 12, 2020, at 7:46 AM, Matthew Knepley <knepley at gmail.com> wrote:
> 
> On Wed, Aug 12, 2020 at 7:53 AM Mark Lohry <mlohry at gmail.com <mailto:mlohry at gmail.com>> wrote:
> I'm getting seemingly random failures of late:
> Caught signal number 7 BUS: Bus Error, possibly illegal memory access
> 
> The first thing I would do is run valgrind on as wide an array of tests as you can. This will find problems
> on things that run completely fine.
> 
>   Thanks,
> 
>      Matt
>  
> Symptoms:
> 1) Seems to only happen (so far) on larger cases, 400-2000 cores
> 2) It doesn't happen right away -- this was running happily for several hours over several hundred time steps with no indication of bad health in the numerics
> 3) At least the total memory consumption seems to be within bounds, though I'm not sure about individual processes. e.g. slurm here reported Memory Efficiency: 75.23% of 1.76 TB (180.00 GB/node)
> 4) running the same setup twice it fails at different points
> 
> Any suggestions on what to look for? This is a bit painful to work on as I can only reproduce it on large runs and then it's seemingly random.
> 
> 
> Thanks,
> Mark
> 
> 
> -- 
> What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead.
> -- Norbert Wiener
> 
> https://www.cse.buffalo.edu/~knepley/ <http://www.cse.buffalo.edu/~knepley/>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20200812/84abb29c/attachment.html>


More information about the petsc-users mailing list