[petsc-dev] handling user domain errors

Mon May 4 23:18:31 CDT 2015

> On May 4, 2015, at 10:54 PM, Dmitry Karpeyev <dkarpeev at gmail.com> wrote:
> 
> 
> 
> On Mon, May 4, 2015 at 6:20 PM Barry Smith <bsmith at mcs.anl.gov> wrote:
> 
>   My first reaction to this was "man that is ugly and cumbersome, I can do it much cleaner than that"; turns out it isn't as simple as I thought but with a couple of macros I think I've incorporated much of what is needed in
> 
> https://bitbucket.org/petsc/petsc/pull-request/315/propagating-solver-errors-instead-of/diff
> 
> some work needs to be done on getting the most appropriate SNES converged reason set. In fact one could argue that trying to pass the converged reason up as a single enum type may not be the best model since there may be more information that one wishes to convey such as function domain error that happened while differencing the function with coloring to compute the Jacobian.
> Are you arguing for a more full-fledged exception handling?

  No. Actually the more full-fledged exception handling has to handle the parallel collective issues which is tough.

> Note that you are essentially having to insert various custom "exception condition" checks (e.g., SNESCheckKSPSolve(), if(ksp->reason) break; KSPCheckDot(), etc) on the whole call path, along which an exception might be propagating.  This strikes me as brittle and error-prone, not to mention threatening to get rather complex if the number of these exceptions and their combinations starts to grow.

   Propose something better

> 
>   Anyways in particular look at the test example ex69.c
> Looks pretty good.  Thanks! 
> 
> 
>   Barry
> 
> > On May 1, 2015, at 10:52 PM, Dmitry Karpeyev <dkarpeev at gmail.com> wrote:
> >
> > Here's the first crack at it: https://bitbucket.org/petsc/petsc/branch/karpeev/ksp-diverged-on-matmult-nanorinf.
> > Messier than I had expected (GMRES only for now).
> >
> > On Fri, May 1, 2015 at 8:06 PM Dmitry Karpeyev <dkarpeev at gmail.com> wrote:
> > On Fri, May 1, 2015 at 7:32 PM Barry Smith <bsmith at mcs.anl.gov> wrote:
> >
> > > On May 1, 2015, at 6:43 PM, Jed Brown <jed at jedbrown.org> wrote:
> > >
> > > Barry Smith <bsmith at mcs.anl.gov> writes:
> > >>   1) This simplifies the needed code since we won't need to put
> > >>   checks all over the place on returns about failure nor do we need
> > >>   to worry about propagating errors from one process to another
> > >>   (since the Nan/Inf get moved by the MPI_Allreduce()).
> > >
> > > My concern is that -fp_trap will become a lot less useful.
> >
> >   I agree there is a tradeoff; but under "normal" circumstances where there are no Nan or Inf around (which I think is most of the time) -fp_trap will be just as useful as now. For the other cases the user will have to have some idea where (and when) in the code to turn on the trapping to catch the "true" problems.
> >
> >    Barry
> >
> >   The only other way I see to do it is carry a validity flag around with each vector and reduce that flag in all the vector reductions; but this alone is not enough we would also have to have some propagation code for things like zero pivot, for example setting a validity flag in the Mat factor (saying the factor is not valid) and propagating up those flags. We get all these things "for free" with the Inf Nan approach.
> > There is an additional benefit: the validity flag would have to be cleared by the caller to avoid "false positives" on subsequent calls.  That's an opportunity for bugs.  With NaN the "error condition" (i.e., the NaN entry) gets cleared automatically by a subsequent successful vector operation.
> >
> >
> > What exactly caused the NaN would have to be signaled "out-of-band" as the saying goes. One way to "signal" it is by the code path that led to the error condition: that's why calling through KSP_MatMult() is useful.  It's not ideal, but covers the cases of immediate interest.
> > Dmitry.
> >
> > >
> >
>