[petsc-dev] handling user domain errors

Dmitry Karpeyev dkarpeev at gmail.com
Mon May 4 23:30:17 CDT 2015


On Mon, May 4, 2015 at 11:18 PM Barry Smith <bsmith at mcs.anl.gov> wrote:

>
> > On May 4, 2015, at 10:54 PM, Dmitry Karpeyev <dkarpeev at gmail.com> wrote:
> >
> >
> >
> > On Mon, May 4, 2015 at 6:20 PM Barry Smith <bsmith at mcs.anl.gov> wrote:
> >
> >   My first reaction to this was "man that is ugly and cumbersome, I can
> do it much cleaner than that"; turns out it isn't as simple as I thought
> but with a couple of macros I think I've incorporated much of what is
> needed in
> >
> >
> https://bitbucket.org/petsc/petsc/pull-request/315/propagating-solver-errors-instead-of/diff
> >
> > some work needs to be done on getting the most appropriate SNES
> converged reason set. In fact one could argue that trying to pass the
> converged reason up as a single enum type may not be the best model since
> there may be more information that one wishes to convey such as function
> domain error that happened while differencing the function with coloring to
> compute the Jacobian.
> > Are you arguing for a more full-fledged exception handling?
>
>   No. Actually the more full-fledged exception handling has to handle the
> parallel collective issues which is tough.
>
Yes, we'd have to ensure that every rank raises the exception.
That's why NaN/Inf norm/dot is so attractive.
Maybe if we allowed only reductions to raise exceptions?

>
> > Note that you are essentially having to insert various custom "exception
> condition" checks (e.g., SNESCheckKSPSolve(), if(ksp->reason) break;
> KSPCheckDot(), etc) on the whole call path, along which an exception might
> be propagating.  This strikes me as brittle and error-prone, not to mention
> threatening to get rather complex if the number of these exceptions and
> their combinations starts to grow.
>
>    Propose something better
>
> >
> >   Anyways in particular look at the test example ex69.c
> > Looks pretty good.  Thanks!
> >
> >
> >   Barry
> >
> > > On May 1, 2015, at 10:52 PM, Dmitry Karpeyev <dkarpeev at gmail.com>
> wrote:
> > >
> > > Here's the first crack at it:
> https://bitbucket.org/petsc/petsc/branch/karpeev/ksp-diverged-on-matmult-nanorinf
> .
> > > Messier than I had expected (GMRES only for now).
> > >
> > > On Fri, May 1, 2015 at 8:06 PM Dmitry Karpeyev <dkarpeev at gmail.com>
> wrote:
> > > On Fri, May 1, 2015 at 7:32 PM Barry Smith <bsmith at mcs.anl.gov> wrote:
> > >
> > > > On May 1, 2015, at 6:43 PM, Jed Brown <jed at jedbrown.org> wrote:
> > > >
> > > > Barry Smith <bsmith at mcs.anl.gov> writes:
> > > >>   1) This simplifies the needed code since we won't need to put
> > > >>   checks all over the place on returns about failure nor do we need
> > > >>   to worry about propagating errors from one process to another
> > > >>   (since the Nan/Inf get moved by the MPI_Allreduce()).
> > > >
> > > > My concern is that -fp_trap will become a lot less useful.
> > >
> > >   I agree there is a tradeoff; but under "normal" circumstances where
> there are no Nan or Inf around (which I think is most of the time) -fp_trap
> will be just as useful as now. For the other cases the user will have to
> have some idea where (and when) in the code to turn on the trapping to
> catch the "true" problems.
> > >
> > >    Barry
> > >
> > >   The only other way I see to do it is carry a validity flag around
> with each vector and reduce that flag in all the vector reductions; but
> this alone is not enough we would also have to have some propagation code
> for things like zero pivot, for example setting a validity flag in the Mat
> factor (saying the factor is not valid) and propagating up those flags. We
> get all these things "for free" with the Inf Nan approach.
> > > There is an additional benefit: the validity flag would have to be
> cleared by the caller to avoid "false positives" on subsequent calls.
> That's an opportunity for bugs.  With NaN the "error condition" (i.e., the
> NaN entry) gets cleared automatically by a subsequent successful vector
> operation.
> > >
> > >
> > > What exactly caused the NaN would have to be signaled "out-of-band" as
> the saying goes. One way to "signal" it is by the code path that led to the
> error condition: that's why calling through KSP_MatMult() is useful.  It's
> not ideal, but covers the cases of immediate interest.
> > > Dmitry.
> > >
> > > >
> > >
> >
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20150505/6d47ca74/attachment.html>


More information about the petsc-dev mailing list