[petsc-users] Tough to reproduce petsctablefind error

Mark McClure mark at resfrac.com
Tue Nov 3 07:23:13 CST 2020


Sorry, meant to say " are *not* within the range from 0 to size of matrix"

On Tue, Nov 3, 2020 at 8:22 AM Mark McClure <mark at resfrac.com> wrote:

> Hi, all.
>
> I am emailing to close the loop on this.
>
> There were two things combining to cause our issue.
>
> 1. At some point, several years ago, I had set
> PetscPushErrorHandler(PetscAbortErrorHandler, NULL), and then forgotten
> about it. This caused the program to terminate when an error was
> encountered.
>
> 2. I believe our code had a very rare, nonreproducible bug (occurring once
> every 1000s of hours of runtime) where it passed column and/or row values
> to Petsc that were greater than the size of the matrix.
>
> Having changed the error handler, and also put in a special error check to
> check the column/row assignments and abort (and then rerun with smaller dt)
> a timestep if they are out of range. Having done that, we've run for over a
> month and have not seen the problem reproduce.
>
> Two ideas that might be helpful to avoid this problem in the future:
>
> 1. When Petsc aborts with error because
> PetscPushErrorHandler(PetscAbortErrorHandler, NULL) is set, it might be
> helpful to add to the error message that is printed to cout. "Petsc is
> aborting the program because the Petsc Error Handler has been set to abort
> on errors. This can be changed by modifying the option passed into the
> error handler." Otherwise, it might be unclear to a user why Petsc is
> aborting (if abort on error is set, but they don't realize). We had written
> error checks to catch errors passed out of Petsc and handle gracefully. But
> had overlooked the error handler setting, and so were confused why this
> error was causing the entire program to terminate. A bit more explanation
> in the abort message could help avoid that kind of user confusion. I had
> thought it was a hard crash out of Petsc, since I was confused why the
> function wasn't returning.
>
> 2. When the "greater than largest key allowed" error is encountered, it
> might be helpful to add to the warning message printed to users to
> additionally say: "This error may occur if the column and row values passed
> into Petsc are within the range from 0 to size of matrix." Not positive,
> but I am about 90% sure that is why we were hitting the error.
>
> Thank you again for your help. I really appreciate your rapid responses
> and help!
>
> Mark
>
>
>
>
>
> On Sat, Sep 26, 2020 at 10:53 PM Mark McClure <mark at resfrac.com> wrote:
>
>> Ok, I think we've made some progress.
>>
>> We were already calling the function like this: ierr = PetscCall(); if
>> (ierr != 0) {do something to handle error}. We actually are doing that on
>> every single call made to Petsc, just to be careful. This is what was
>> confusing to me. Why was the program terminating from within Petsc and not
>> returning out with an error? We'd written our code so that if Petsc did
>> return with an error, we'd discard the full timestep, destroy all the Petsc
>> data structures and redo everything with a smaller dt. So if Petsc did hit
>> this error very rarely, we might be able to recovery gracefully.
>>
>> It does not appear to be seg faulting. So it seemed that the program was
>> being terminated intentionally from within Petsc, which was puzzling, and
>> why I was asking about that in my previous email.
>>
>> So - Chris made a great find. Turns out that right after PetscInitialize
>> in our main.cpp, we had the line:
>>
>> PetscPushErrorHandler(PetscAbortErrorHandler, NULL);
>>
>> Which was telling Petsc to call MPI_Abort if there was an error. I
>> probably put that line into the code years ago and forgot it was there. So,
>> as Barry said, if we change the PetscErrorHandler option to ignore, then at
>> least we can avoid the program aborting on the error, and hopefully be able
>> to recover with our existing code logic.
>>
>> Also, may have found a clue on the root cause of the error. I had thought
>> that were were checking all of our inputs to Petsc for issues such as out
>> of range index values. But I went back and see that due to a versioning
>> mistake, there is one particular error check on our inputs that was being
>> removed from production builds by a preprocessor definition. Which means
>> that it wouldn't be caught in our production builds, which means that it is
>> possible that bad inputs could have been passed into Petsc. I don't know
>> for sure - but is plausible. The missing error check was doing the
>> following: checking to see if the fourth entry to MatSetValues ("n", for
>> the number of nonzero values in the row) (
>> https://www.mcs.anl.gov/petsc/petsc-current/docs/manualpages/Mat/MatSetValues.html)
>> was equal to the sum of the number of diagonal and off diagonal values that
>> we had specified in our previous call to MatMPIAIJSetPreallocation.
>>
>> So that is at least a theory for what was happening. The theory would be:
>> very rarely, due to a bug in our code, we were running MatSetValues with
>> "n" set to a value not equal to the number of nonzero values promised in
>> the call to MatMPIAIJSetPreallocation. Maybe this led to the Petsc "key
>> 7556 is greater than largest key allowed 5693" error message, and then our
>> setting of 'PetscAbortErrorHandler' was causing the program to abort.
>>
>> Mark
>>
>>
>>
>>
>>
>>
>>
>> On Sat, Sep 26, 2020 at 9:51 PM Barry Smith <bsmith at petsc.dev> wrote:
>>
>>>
>>>
>>> On Sep 26, 2020, at 5:58 PM, Junchao Zhang <junchao.zhang at gmail.com>
>>> wrote:
>>>
>>>
>>>
>>> On Sat, Sep 26, 2020 at 5:44 PM Mark Adams <mfadams at lbl.gov> wrote:
>>>
>>>>
>>>>
>>>> On Sat, Sep 26, 2020 at 1:07 PM Matthew Knepley <knepley at gmail.com>
>>>> wrote:
>>>>
>>>>> On Sat, Sep 26, 2020 at 11:17 AM Mark McClure <mark at resfrac.com>
>>>>> wrote:
>>>>>
>>>>>> Thank you, all for the explanations.
>>>>>>
>>>>>> Following Matt's suggestion, we'll use -g (and not use
>>>>>> -with-debugging=0) all future compiles to all users, so in future, we can
>>>>>> provide better information.
>>>>>>
>>>>>> Second, Chris is going to boil our function down to minimum stub and
>>>>>> share in case there is some subtle issue with the way the functions are
>>>>>> being called.
>>>>>>
>>>>>> Third, I have question/request - Petsc is, in fact, detecting an
>>>>>> error. As far as I can tell, this is not an uncontrolled 'seg fault'. It
>>>>>> seems to me that maybe Petsc could choose to return out from the function
>>>>>> when it detects this error, returning an error code, rather than dumping
>>>>>> the core and terminating the program. If Petsc simply returned out with an
>>>>>> error message, this would resolve the problem for us. After the Petsc call,
>>>>>> we check for Petsc error messages. If Petsc returns an error - that's fine
>>>>>> - we use a direct solver as a backup, and the simulation continues. So - I
>>>>>> am not sure whether this is feasible - but if Petsc could return out with
>>>>>> an error message - rather than dumping the core and terminating the program
>>>>>> - then that would effectively resolve the issue for us. Would this change
>>>>>> be possible?
>>>>>>
>>>>>
>>>>> At some level, I think it is currently doing what you want. CHKERRQ()
>>>>> simply returns an error code from that function call, printing an error
>>>>> message. Suppressing the message is harder I think,
>>>>>
>>>>
>>>> He does not need this.
>>>>
>>>>
>>>>> but for now, if you know what function call is causing the error, you
>>>>> can just catch the (ierr != 0) yourself instead of using CHKERRQ.
>>>>>
>>>>
>>>> This is what I suggested earlier but maybe I was not clear enough.
>>>>
>>>> Your code calls something like
>>>>
>>>> ierr = SNESSolve(....); CHKERRQ(ierr);
>>>>
>>>> You can replace this with:
>>>>
>>>>  ierr = SNESSolve(....);
>>>>  if (ierr) {
>>>>
>>> How to deal with CHKERRQ(ierr); inside SNESSolve()?
>>>
>>>
>>>
>>>    PetscPushErrorHandler(PetscIgnoreErrorHandler,NULL);
>>>
>>>    But the problem in this code's runs appear to be due to corrupt data,
>>> why and how it gets corrupted is not known. Continuing with an alternative
>>> solver because a solver failed for numerical or algorithmic reasons is
>>> generally fine but continuing when there is corrupted data is always iffy
>>> because one doesn't know how far the corruption has spread.
>>> SNESDestroy(&snes); SNESCreate(&snes); may likely clean out any potentially
>>> corrupted data but if the corruption got into the mesh data structures it
>>> will still be there.
>>>
>>>    A very sophisticated library code would, when it detects this type of
>>> corruption, sweep through all the data structures looking for any
>>> indications of corruption, to help determine the cause and/or even fix the
>>> problem. We don't have this code in place, though we could add some,
>>> because generally we relay on valgrind or -malloc_debug to detect such
>>> corruption, the problem is valgrind and -malloc_debug don't fit well in a
>>> production environment. Handling corruption that comes up in production but
>>> not testing is difficult.
>>>
>>>  Barry
>>>
>>>
>>>
>>>
>>>     ....
>>>>  }
>>>>
>>>> I suggested something earlier to do here. Maybe call KSPView. You could
>>>> even destroy the solver and start the solver from scratch and see if that
>>>> works.
>>>>
>>>> Mark
>>>>
>>>>
>>>>> The drawback here is that we might not have cleaned up
>>>>> all the state so that restarting makes sense. It should be possible to
>>>>> just kill the solve, reset the solver, and retry, although it is not clear
>>>>> to me at first glance if MPI will be in an okay state.
>>>>>
>>>>>
>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20201103/9eebb14f/attachment.html>


More information about the petsc-users mailing list