[petsc-users] Tough to reproduce petsctablefind error

Sat Sep 26 21:53:12 CDT 2020

Ok, I think we've made some progress.

We were already calling the function like this: ierr = PetscCall(); if
(ierr != 0) {do something to handle error}. We actually are doing that on
every single call made to Petsc, just to be careful. This is what was
confusing to me. Why was the program terminating from within Petsc and not
returning out with an error? We'd written our code so that if Petsc did
return with an error, we'd discard the full timestep, destroy all the Petsc
data structures and redo everything with a smaller dt. So if Petsc did hit
this error very rarely, we might be able to recovery gracefully.

It does not appear to be seg faulting. So it seemed that the program was
being terminated intentionally from within Petsc, which was puzzling, and
why I was asking about that in my previous email.

So - Chris made a great find. Turns out that right after PetscInitialize in
our main.cpp, we had the line:

PetscPushErrorHandler(PetscAbortErrorHandler, NULL);

Which was telling Petsc to call MPI_Abort if there was an error. I probably
put that line into the code years ago and forgot it was there. So, as Barry
said, if we change the PetscErrorHandler option to ignore, then at least we
can avoid the program aborting on the error, and hopefully be able to
recover with our existing code logic.

Also, may have found a clue on the root cause of the error. I had thought
that were were checking all of our inputs to Petsc for issues such as out
of range index values. But I went back and see that due to a versioning
mistake, there is one particular error check on our inputs that was being
removed from production builds by a preprocessor definition. Which means
that it wouldn't be caught in our production builds, which means that it is
possible that bad inputs could have been passed into Petsc. I don't know
for sure - but is plausible. The missing error check was doing the
following: checking to see if the fourth entry to MatSetValues ("n", for
the number of nonzero values in the row) (
https://www.mcs.anl.gov/petsc/petsc-current/docs/manualpages/Mat/MatSetValues.html)
was equal to the sum of the number of diagonal and off diagonal values that
we had specified in our previous call to MatMPIAIJSetPreallocation.

So that is at least a theory for what was happening. The theory would be:
very rarely, due to a bug in our code, we were running MatSetValues with
"n" set to a value not equal to the number of nonzero values promised in
the call to MatMPIAIJSetPreallocation. Maybe this led to the Petsc "key
7556 is greater than largest key allowed 5693" error message, and then our
setting of 'PetscAbortErrorHandler' was causing the program to abort.

Mark

On Sat, Sep 26, 2020 at 9:51 PM Barry Smith <bsmith at petsc.dev> wrote:

>
>
> On Sep 26, 2020, at 5:58 PM, Junchao Zhang <junchao.zhang at gmail.com>
> wrote:
>
>
>
> On Sat, Sep 26, 2020 at 5:44 PM Mark Adams <mfadams at lbl.gov> wrote:
>
>>
>>
>> On Sat, Sep 26, 2020 at 1:07 PM Matthew Knepley <knepley at gmail.com>
>> wrote:
>>
>>> On Sat, Sep 26, 2020 at 11:17 AM Mark McClure <mark at resfrac.com> wrote:
>>>
>>>> Thank you, all for the explanations.
>>>>
>>>> Following Matt's suggestion, we'll use -g (and not use
>>>> -with-debugging=0) all future compiles to all users, so in future, we can
>>>> provide better information.
>>>>
>>>> Second, Chris is going to boil our function down to minimum stub and
>>>> share in case there is some subtle issue with the way the functions are
>>>> being called.
>>>>
>>>> Third, I have question/request - Petsc is, in fact, detecting an error.
>>>> As far as I can tell, this is not an uncontrolled 'seg fault'. It seems to
>>>> me that maybe Petsc could choose to return out from the function when it
>>>> detects this error, returning an error code, rather than dumping the core
>>>> and terminating the program. If Petsc simply returned out with an error
>>>> message, this would resolve the problem for us. After the Petsc call, we
>>>> check for Petsc error messages. If Petsc returns an error - that's fine -
>>>> we use a direct solver as a backup, and the simulation continues. So - I am
>>>> not sure whether this is feasible - but if Petsc could return out with an
>>>> error message - rather than dumping the core and terminating the program -
>>>> then that would effectively resolve the issue for us. Would this change be
>>>> possible?
>>>>
>>>
>>> At some level, I think it is currently doing what you want. CHKERRQ()
>>> simply returns an error code from that function call, printing an error
>>> message. Suppressing the message is harder I think,
>>>
>>
>> He does not need this.
>>
>>
>>> but for now, if you know what function call is causing the error, you
>>> can just catch the (ierr != 0) yourself instead of using CHKERRQ.
>>>
>>
>> This is what I suggested earlier but maybe I was not clear enough.
>>
>> Your code calls something like
>>
>> ierr = SNESSolve(....); CHKERRQ(ierr);
>>
>> You can replace this with:
>>
>>  ierr = SNESSolve(....);
>>  if (ierr) {
>>
> How to deal with CHKERRQ(ierr); inside SNESSolve()?
>
>
>
>    PetscPushErrorHandler(PetscIgnoreErrorHandler,NULL);
>
>    But the problem in this code's runs appear to be due to corrupt data,
> why and how it gets corrupted is not known. Continuing with an alternative
> solver because a solver failed for numerical or algorithmic reasons is
> generally fine but continuing when there is corrupted data is always iffy
> because one doesn't know how far the corruption has spread.
> SNESDestroy(&snes); SNESCreate(&snes); may likely clean out any potentially
> corrupted data but if the corruption got into the mesh data structures it
> will still be there.
>
>    A very sophisticated library code would, when it detects this type of
> corruption, sweep through all the data structures looking for any
> indications of corruption, to help determine the cause and/or even fix the
> problem. We don't have this code in place, though we could add some,
> because generally we relay on valgrind or -malloc_debug to detect such
> corruption, the problem is valgrind and -malloc_debug don't fit well in a
> production environment. Handling corruption that comes up in production but
> not testing is difficult.
>
>  Barry
>
>
>
>
>     ....
>>  }
>>
>> I suggested something earlier to do here. Maybe call KSPView. You could
>> even destroy the solver and start the solver from scratch and see if that
>> works.
>>
>> Mark
>>
>>
>>> The drawback here is that we might not have cleaned up
>>> all the state so that restarting makes sense. It should be possible to
>>> just kill the solve, reset the solver, and retry, although it is not clear
>>> to me at first glance if MPI will be in an okay state.
>>>
>>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20200926/6dbea22a/attachment.html>