[petsc-dev] handling user domain errors

Dmitry Karpeyev dkarpeev at gmail.com
Fri May 1 22:52:29 CDT 2015


Here's the first crack at it:
https://bitbucket.org/petsc/petsc/branch/karpeev/ksp-diverged-on-matmult-nanorinf
.
Messier than I had expected (GMRES only for now).

On Fri, May 1, 2015 at 8:06 PM Dmitry Karpeyev <dkarpeev at gmail.com> wrote:

> On Fri, May 1, 2015 at 7:32 PM Barry Smith <bsmith at mcs.anl.gov> wrote:
>
>>
>> > On May 1, 2015, at 6:43 PM, Jed Brown <jed at jedbrown.org> wrote:
>> >
>> > Barry Smith <bsmith at mcs.anl.gov> writes:
>> >>   1) This simplifies the needed code since we won't need to put
>> >>   checks all over the place on returns about failure nor do we need
>> >>   to worry about propagating errors from one process to another
>> >>   (since the Nan/Inf get moved by the MPI_Allreduce()).
>> >
>> > My concern is that -fp_trap will become a lot less useful.
>>
>>   I agree there is a tradeoff; but under "normal" circumstances where
>> there are no Nan or Inf around (which I think is most of the time) -fp_trap
>> will be just as useful as now. For the other cases the user will have to
>> have some idea where (and when) in the code to turn on the trapping to
>> catch the "true" problems.
>>
>>    Barry
>>
>>   The only other way I see to do it is carry a validity flag around with
>> each vector and reduce that flag in all the vector reductions; but this
>> alone is not enough we would also have to have some propagation code for
>> things like zero pivot, for example setting a validity flag in the Mat
>> factor (saying the factor is not valid) and propagating up those flags. We
>> get all these things "for free" with the Inf Nan approach.
>>
> There is an additional benefit: the validity flag would have to be cleared
> by the caller to avoid "false positives" on subsequent calls.  That's an
> opportunity for bugs.  With NaN the "error condition" (i.e., the NaN entry)
> gets cleared automatically by a subsequent successful vector operation.
>
>
> What exactly caused the NaN would have to be signaled "out-of-band" as the
> saying goes. One way to "signal" it is by the code path that led to the
> error condition: that's why calling through KSP_MatMult() is useful.  It's
> not ideal, but covers the cases of immediate interest.
> Dmitry.
>
>>
>> >
>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20150502/f8ad71c8/attachment.html>


More information about the petsc-dev mailing list