[petsc-dev] handling user domain errors

Fri May 1 11:26:31 CDT 2015

 This is an attempt to merge the handling of user domain errors that Dmitry brought up (for example in Jacobian free matrix-vector products) and the handling of, for example, zero pivot in an LU factorization brought up by Ed in a simple consistent way.

  I have a new branch barry/propagate-pcsetup-failures that attempts to allow propagating errors up from lower levels of KSP solves, PC setup, PC applies etc using Nan/Inf. (Note Dmitry I am not doing the matrix-free etc domain error setting that you proposed to do, that can be added at any time).

   1) This simplifies the needed code since we won't need to put checks all over the place on returns about failure nor do we need to worry about propagating errors from one process to another (since the Nan/Inf get moved by the MPI_Allreduce()).

    2) I am propagating down the KSPSetErrorIfNotConverged() flag to all the inner solvers so if the user DOES want an immediate error stop they can get it by simply setting the flag at the highest level of KSP

    3) eventually we would like to propagate up not only the fact that an error happened but also information about the type of error. This I think we can do orthogonally to propagating up the FACT that we have an error with the Nan/Inf. In other words if an error is detected by a Nan/Inf norm or inner product then eventually the  code would be able query where the problem started, for example a zero pivot inside the coarse grid solve inside a multigrid inside a fieldsplit etc.

   Thoughts,

   Barry

> On Apr 29, 2015, at 10:11 PM, Barry Smith <bsmith at mcs.anl.gov> wrote:
> 
> 
>  Indeed you proposed the exact thing. I would be happy if you tried to make a branch of master that used this approach.
> 
>  Barry
> 
>> On Apr 29, 2015, at 9:28 PM, Dmitry Karpeyev <dkarpeev at gmail.com> wrote:
>> 
>> Barry,
>> Sorry, I must have missed this -- I really ought to make a better filter for catching email like this.
>> I think using NaNs is an excellent solution, in fact, I was proposing it a few months ago here :-)
>> http://lists.mcs.anl.gov/pipermail/petsc-dev/2015-February/016958.html
>> It ensures that the error is collective (the norm reduction will ensure every rank gets a NaN), 
>> the "error condition" is cleared automatically on the next MatMult, etc.
>> I'm all for it.
>> Should I put it in?
>> 
>> Dmitry.
>> 
>> On Wed, Apr 29, 2015 at 8:26 PM Barry Smith <bsmith at mcs.anl.gov> wrote:
>> 
>>  Dmitry,
>> 
>>    I haven't heard back from you on this. Any thoughts?
>> 
>>  Barry
>> 
>>> On Apr 20, 2015, at 6:23 PM, Barry Smith <bsmith at mcs.anl.gov> wrote:
>>> 
>>> 
>>> Dmitry,
>>> 
>>>  Rather than introducing another whole complexity of flags for indicating domain errors in user functions just do the following.
>>> 
>>>  1) just stick a Nan into the functions result
>>>  2) remove the VecValidValues() at the END of routines like MatMult()
>>>  3) when Nan or Inf pop up in Krylov methods (which will happen within VecNorm or VecDot() and thus we get free collective knowledge of the problem even if it happened on only one node), generate the appropriate KSP_DIVERGED_NANORINF. This is already handled sometimes (most of the time?), for example in KSPSolve_CG is code
>>> ierr = VecXDot(Z,R,&beta);CHKERRQ(ierr);         /*  beta <- z'*r       */
>>>   if (PetscIsInfOrNanScalar(beta)) {
>>>     if (ksp->errorifnotconverged) SETERRQ(PetscObjectComm((PetscObject)ksp),PETSC_ERR_NOT_CONVERGED,"KSPSolve has not converged due to Nan or Inf inner product");
>>>     else {
>>>       ksp->reason = KSP_DIVERGED_NANORINF;
>>>       PetscFunctionReturn(0);
>>>     }
>>>   }
>>> 
>>>  4) SNES already handles failed to converge KSP and
>>>  5 ) TS already handles failed to converged SNES; by, for example, cutting the timestep.
>>> 
>>> Barry
>>> 
>>> 
>> 
>