[petsc-dev] [petsc-users] SNESSetFunctionDomainError

Barry Smith bsmith at mcs.anl.gov
Thu Feb 19 14:41:09 CST 2015


> On Feb 19, 2015, at 2:06 PM, Dmitry Karpeyev <karpeev at mcs.anl.gov> wrote:
> 
> 
> 
> On Thu Feb 19 2015 at 12:59:12 PM Barry Smith <bsmith at mcs.anl.gov> wrote:
> 
> > On Feb 19, 2015, at 1:56 PM, Dmitry Karpeyev <karpeev at mcs.anl.gov> wrote:
> >
> >
> >
> > On Thu Feb 19 2015 at 12:41:59 PM Barry Smith <bsmith at mcs.anl.gov> wrote:
> >
> > Yeah, that sounds like a good fix, except for this: we have to make sure all ranks diverge with this failure so that the user can retry the solve, if necessary.
> 
>    This is the business of the person calling MatSetFailure(). We can require that this routine be a collective.
> Yes, but this way any "resilient" code is required to carry out an extra reduction every KSP iteration, and we are not giving them an opportunity to coalesce it with anything else. 

   For each Krylov method the "point" in the code where the coalescence can take place is different and with methods like pipelined GMRES the point where the coalescence can be done is AFTER the point where the results of the (failed) multiply are used so coding can be tricky.  For now I won't worry about coalescence and would just require MatSetFailure() to collective. We need more experience before we try to be fancy.

   If you do any coding use my branch barry/use-ksp_matmult_pcapply-in-ksp-methods

Barry

> 
> In any event, I can put this MatSetFailure()/MatSetFailureCollective() in.
> Dmitry. 
> 
>  Barry
> 
> > That would require an extra reduction every KSP iteration.
> > With Inf or NaN we could piggyback on the norm computation.
> >
> > Dmitry.
> >
> >    Barry
> >
> >
> >
> >
> > > On Feb 19, 2015, at 9:33 AM, Dmitry Karpeyev <karpeev at mcs.anl.gov> wrote:
> > >
> > > I wanted to revive this thread and move it to petsc-dev. This problem seems to be harder than I realized.
> > >
> > > Suppose MatMult inside KSPSolve() inside SNESSolve() cannot compute a valid output vector.
> > > For example, it's a MatMFFD and as part of its function evaluation it  has to evaluate an implicitly-defined
> > > constitutive model (e.g., solve an equation of state) and this inner solve diverges
> > > (e.g., the time step is too big).  I want to be able to abort the linear
> > > solve and the nonlinear solve, return a suitable "converged" reason and let the user retry, maybe with a
> > > different timestep size.  This is for a hand-rolled time stepper, but TS would face similar issues.
> > >
> > > Based on the previous thread here http://lists.mcs.anl.gov/pipermail/petsc-users/2014-August/022597.html
> > > I tried marking the result of MatMult as "invalid" and let it propagate up to KSPSolve() where it can be handled.
> > > This quickly gets out of control, since the invalid Vec isn't returned to the KSP immediately.  It could be a work
> > > vector, which is fed into PCApply() along various code paths, depending on the side of the preconditioner, whether it's a
> > > transpose solve, etc.  Each of these transformations (e.g., PCApply()) would then have to check the validity of
> > > the input argument, clear its error condition and set it on the output argument, etc.  Very error-prone and fragile.
> > > Not to mention the large amount of code to sift through.
> > >
> > > This is a general problem of exception handling -- we want to "unwind" the stack to the point where the problem should
> > > be handled, but there doesn't seem to a good way to do it.  We also want to be able to clear all of the error conditions
> > > on the way up (e.g., mark vectors as valid again, but not too early), otherwise we leave the solver in an invalid state.
> > >
> > >
> > > Instead of passing an exception condition up the stack I could try storing that condition in one of the more globally-visible
> > > objects (e.g., the Mat), but if the error occurs inside the evaluation of the residual that's being differenced, it doesn't really
> > > have access to the Mat.  This probably raises various thread safety issues as well.
> > >
> > > Using SNESSetFunctionDomainError() doesn't seem to be a solution: a MatMFFD created with MatCreateSNESMF()
> > > has a pointer to SNES, but the function evaluation code actually has no clue about that. More generally, I don't
> > > know whether we want to wait for the linear solve to complete before handling this exception: it is unnecessary,
> > > it might be an expensive linear solve and the result of such a KSPSolve() is probably undefined and might blow up in
> > > unexpected ways.  I suppose if there is a way to get a hold of SNES, each subsequent MatMult_MFFD has to check
> > > whether the domain error is set and return early in that case?  We would still have to wait for the linear solve to grind
> > > through the rest of its iterations.    I don't know, however, if there is a good way to guarantee that linear solver will get
> > > through this quickly and without unintended consequences. Should MatMFFD also get a hold of the KSP and set a flag
> > > there to abort?  I still don't know what the intervening code (e.g., the various PCApply()) will do before the KSP has a
> > > chance to deal with this.
> > >
> > > I'm now thinking that setting some vector entries to NaN might be a good solution: I hope this NaN will propagate all the
> > > way up through the subsequent arithmetic operations (does the IEEE floating-point arithmetic guarantees?), this "error
> > > condition" gets automatically cleared the next time the vector is recomputed, since its values are reset.  Finally, I want
> > > this exception to be detected globally but without incurring an extra reduction every time the residual is evaluated,
> > > and NaN will be show up in the norm that (most) KSPs would compute anyway.  That way KSP could diverge with a
> > > KSP_DIVERGED_NAN or a similar reason and the user would have an option to retry.  The problem with this approach
> > > is that VecValidEntries() in MatMult() and PCApply() will throw an error before this can work, so I'm trying to think about
> > > good ways of turning it off.  Any ideas about how to do this?
> > >
> > > Incidentally, I realized that I don't understand how SNESFunctionDomainError can be handled gracefully in the current
> > > set up: it's not set or checked collectively, so there isn't a good way to abort and retry across the whole comm, is there?
> > >
> > > Dmitry.
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > On Sun Aug 31 2014 at 10:12:53 PM Jed Brown <jed at jedbrown.org> wrote:
> > > Dmitry Karpeyev <karpeev at mcs.anl.gov> writes:
> > >
> > > > Handling this at the KSP level (I actually think the Mat level is more
> > > > appropriate, since the operator, not the solver, knows its domain),
> > >
> > > We are dynamically discovering the domain, but I don't think it's
> > > appropriate for Mat to refuse to evaluate any more matrix actions until
> > > some action is taken at the MatMFFD/SNESMF level.  Marking the Vec
> > > invalid is fine, but some action needs to be taken and if Derek wants
> > > the SNES to skip further evaluations, we need to propagate the
> > > information up the stack somehow.
> >
> 




More information about the petsc-dev mailing list