[petsc-dev] Recover from TS failure

Jed Brown jed at jedbrown.org
Wed Nov 1 17:14:14 CDT 2017


Mark Adams <mfadams at lbl.gov> writes:

> I am trying to recover from a failed time step with this code. But PETSc
> does a hard abort with this code:
>
>  } else {
>      /* do not print error messages since process 0 will print them, sleep
> before aborting so will not accidently kill process 0*/
>      PetscSleep(10.0);
>      abort();
>    }

You're in PetscTraceBackErrorHandler which is way too late.  Back up.
What caused the error?

> Can I get TS to not abort and return?
>
>   do {
>     TSConvergedReason reason;
>     ierr = TSGetTime(ts, &old_time);CHKERRQ(ierr);
>     ierr = TSSolve(ts, u);
>     if (!ierr) break;
>     ierr = TSGetConvergedReason(ts,&reason);CHKERRQ(ierr);
>     if (reason==TS_DIVERGED_NONLINEAR_SOLVE) { /* recover */
>       PetscReal dt;
>       ierr = TSGetTime(ts, &t);CHKERRQ(ierr);
>       if (t == old_time && failed) {
> PetscPrintf(PETSC_COMM_WORLD,"Failed to recover\n");
> break; /* failed */
>       } else {
> old_time = t;
> failed = PETSC_TRUE;
>       }
>       ierr = TSGetTimeStep(ts, &dt);CHKERRQ(ierr);
>       ierr = PetscPrintf(PETSC_COMM_WORLD,"recover from %s, reason=%d
> time=%g, dt=%g --> ",TSConvergedReasons[reason],reason,t,dt);CHKERRQ(ierr);
>       dt = dt/ctx.dt_reduction_factor;
>       ierr = PetscPrintf(PETSC_COMM_WORLD,"%g\n",dt);CHKERRQ(ierr);
>       ierr = TSSetTimeStep(ts, dt);CHKERRQ(ierr);
>       ierr = VecCopy(ctx.u0,u);CHKERRQ(ierr); /* recover state */
>     } else {
>       SETERRQ1(PETSC_COMM_WORLD,PETSC_ERR_ARG_WRONG,"Unhandled error
> %s",TSConvergedReasons[reason]);
>     }
>
> On Wed, Nov 1, 2017 at 2:45 PM, Mark Adams <mfadams at lbl.gov> wrote:
>
>>
>>
>> On Wed, Nov 1, 2017 at 2:34 PM, Jed Brown <jed at jedbrown.org> wrote:
>>
>>> Mark Adams <mfadams at lbl.gov> writes:
>>>
>>> > Oh, Maybe the Jacobian has Nans or Infs even though the last time step
>>> > survived. Maybe it was going crazy. I'll check
>>>
>>> If that is the case you would use TSSetFunctionDomainError().
>>>
>>
>> Ah, very cool. Good to know. That does not seem to be the problem. Not
>> sure what would cause this linear solver error. I'll try different solvers
>> and see if I get any hints,
>> Thanks again,
>>
>>
>>>
>>> > On Wed, Nov 1, 2017 at 2:13 PM, Mark Adams <mfadams at lbl.gov> wrote:
>>> >
>>> >> Yea, I don't understand the linear solve error:
>>> >>
>>> >> -ts_monitor -ts_type beuler -pc_type lu -pc_factor_mat_solver_package
>>> >> mumps -ksp_type preonly -snes_monitor -snes_rtol 1.e-10 -snes_stol
>>> 1.e-10
>>> >> -snes_converged_reason -snes_atol 1.e-18 -snes_converged_reason
>>> >>
>>> >> Maybe mumps?
>>> >>
>>> >> On Wed, Nov 1, 2017 at 2:02 PM, Jed Brown <jed at jedbrown.org> wrote:
>>> >>
>>> >>> Mark Adams <mfadams at lbl.gov> writes:
>>> >>>
>>> >>> > On Wed, Nov 1, 2017 at 1:46 PM, Jed Brown <jed at jedbrown.org> wrote:
>>> >>> >
>>> >>> >> Mark Adams <mfadams at lbl.gov> writes:
>>> >>> >>
>>> >>> >> > I have added some code in a TS post step method to look at the
>>> >>> number of
>>> >>> >> > nonlinear iterations and cut the time step if it took too many
>>> SNES
>>> >>> >> > iterations.  That helped but now I want to go one step further
>>> and
>>> >>> >> recover
>>> >>> >> > from a failed time step (see appended error message).
>>> >>> >> >
>>> >>> >> > How can/should I recover from a SNES failure inside of TS? I
>>> could
>>> >>> just
>>> >>> >> put
>>> >>> >> > TSSolve in a loop and if TS failed because of the solver, then
>>> >>> reduce the
>>> >>> >> > time step and try to recover the state, and try again.
>>> >>> >> >
>>> >>> >> > Thoughts?
>>> >>> >> >
>>> >>> >> > Thanks,
>>> >>> >> > Mark
>>> >>> >> >
>>> >>> >> > Note, this is a direct solver and the linear solver is diverging?
>>> >>> I've
>>> >>> >> > seen the same thing with the maximum its in SNES and failed
>>> >>> linesearch.
>>> >>> >> >
>>> >>> >> > 23 TS dt 0.0025 time 0.1025
>>> >>> >> >     0 SNES Function norm 9.997473571601e+04
>>> >>> >> >     1 SNES Function norm 2.295851692533e-01
>>> >>> >> >   Nonlinear solve did not converge due to DIVERGED_LINEAR_SOLVE
>>> >>> >> iterations 1
>>> >>> >> > [0]PETSC ERROR: --------------------- Error Message
>>> >>> >> > --------------------------------------------------------------
>>> >>> >> > [0]PETSC ERROR:
>>> >>> >> > [0]PETSC ERROR: TSStep has failed due to
>>> DIVERGED_NONLINEAR_SOLVE,
>>> >>> >> increase
>>> >>> >> > -ts_max_snes_failures or make negative to attempt recovery
>>> >>> >>
>>> >>> >> Did you increase -ts_max_snes_failures (TSSetMaxSNESFailures)?
>>> >>> >>
>>> >>> >
>>> >>> > Ah, read the error message :) with a negative value I get this.
>>> >>> >
>>> >>> > TS can not, or is not going to try, to reconstruct my state and
>>> reduce
>>> >>> the
>>> >>> > time step. So I think I need to do this myself.
>>> >>>
>>> >>> I don't understand.  Why is the linear solve diverging
>>> >>> (-ksp_converged_reason)?  What TS are you using?
>>> >>>
>>> >>> > I guess just making loop and testing in the TSConvergedReason, is
>>> the
>>> >>> way
>>> >>> > to go. We already do something like this for mesh adpativity.
>>> >>> >
>>> >>> > Thanks,
>>> >>> >
>>> >>> > 23 TS dt 0.0025 time 0.1025
>>> >>> >     0 SNES Function norm 9.997473571601e+04
>>> >>> >     1 SNES Function norm 2.295851692533e-01
>>> >>> >   Nonlinear solve did not converge due to DIVERGED_LINEAR_SOLVE
>>> >>> iterations 1
>>> >>> >     0 SNES Function norm 9.997473571601e+04
>>> >>> >   Nonlinear solve did not converge due to DIVERGED_LINEAR_SOLVE
>>> >>> iterations 0
>>> >>> >     0 SNES Function norm 9.997473571601e+04
>>> >>> >   Nonlinear solve did not converge due to DIVERGED_LINEAR_SOLVE
>>> >>> iterations 0
>>> >>> >     0 SNES Function norm 9.997473571601e+04
>>> >>> >   Nonlinear solve did not converge due to DIVERGED_LINEAR_SOLVE
>>> >>> iterations 0
>>> >>> >     0 SNES Function norm 9.997473571601e+04
>>> >>> >   Nonlinear solve did not converge due to DIVERGED_LINEAR_SOLVE
>>> >>> iterations 0
>>> >>> >     0 SNES Function norm 9.997473571601e+04
>>> >>> >   Nonlinear solve did not converge due to DIVERGED_LINEAR_SOLVE
>>> >>> iterations 0
>>> >>> >     0 SNES Function norm 9.997473571601e+04
>>> >>> >   Nonlinear solve did not converge due to DIVERGED_LINEAR_SOLVE
>>> >>> iterations 0
>>> >>> >     0 SNES Function norm 9.997473571601e+04
>>> >>> >   Nonlinear solve did not converge due to DIVERGED_LINEAR_SOLVE
>>> >>> iterations 0
>>> >>> >     0 SNES Function norm 9.997473571601e+04
>>> >>> >   Nonlinear solve did not converge due to DIVERGED_LINEAR_SOLVE
>>> >>> iterations 0
>>> >>> >     0 SNES Function norm 9.997473571601e+04
>>> >>> >   Nonlinear solve did not converge due to DIVERGED_LINEAR_SOLVE
>>> >>> iterations 0
>>> >>> >     0 SNES Function norm 9.997473571601e+04
>>> >>> >   Nonlinear solve did not converge due to DIVERGED_LINEAR_SOLVE
>>> >>> iterations 0
>>> >>> > [0]PETSC ERROR: --------------------- Error Message
>>> >>> > --------------------------------------------------------------
>>> >>> > [0]PETSC ERROR:
>>> >>> > [0]PETSC ERROR: TSStep has failed due to DIVERGED_STEP_REJECTED
>>> >>> > [0]PETSC ERROR: See http://www.mcs.anl.gov/petsc/d
>>> ocumentation/faq.html
>>> >>> for
>>> >>> > trouble shooting.
>>> >>> > [0]PETSC ERROR: Petsc Development GIT revision:
>>> v3.8-53-ga2d93b2d43  GIT
>>> >>> > Date: 2017-10-09 12:56:49 -0400
>>> >>> > [0]PETSC ERROR: ./ex48 on a arch-macosx-gnu-O named
>>> MarksMac-5.local by
>>> >>> > markadams Wed Nov  1 13:51:24 2017
>>> >>> > [0]PETSC ERROR: Configure options --with-cc=clang
>>> --with-cc++=clang++
>>> >>> > COPTFLAGS="-O2 -g -mavx2" CXXOPTFLAGS="-O2 -g -mavx2"
>>> FOPTFLAGS="-O2 -g
>>> >>> > -mavx2" --download-mpich=1 --download-parmetis=1 --download-metis=1
>>> >>> > --download-hypre=1 --download-ml=1 --download-triangle=1
>>> >>> > --download-ctetgen=1 --download-p4est=1 --download-fftw
>>> >>> --download-mumps=1
>>> >>> > --download-scalapack=1 --with-x=0 --download-superlu_dist
>>> >>> > --download-superlu --download-ctetgen --with-debugging=0
>>> >>> --download-hdf5=1
>>> >>> > PETSC_ARCH=arch-macosx-gnu-O --download-chaco
>>> --with-viewfromoptions=1
>>> >>> > [0]PETSC ERROR: #1 TSStep() line 4130 in
>>> >>> > /Users/markadams/Codes/petsc/src/ts/interface/ts.c
>>> >>> > [0]PETSC ERROR: #2 TSSolve() line 4374 in
>>> >>> > /Users/markadams/Codes/petsc/src/ts/interface/ts.c
>>> >>> > *********** DIVERGED_STEP_REJECTED ierr=91 reason=-2
>>> >>> > L_2 Diff: 1.32241e+07 at time 0.1025
>>> >>> >
>>> >>> >
>>> >>> >
>>> >>> >
>>> >>> >>
>>> >>> >> > [0]PETSC ERROR: See http://www.mcs.anl.gov/petsc/d
>>> >>> ocumentation/faq.html
>>> >>> >> for
>>> >>> >> > trouble shooting.
>>> >>> >> > [0]PETSC ERROR: Petsc Development GIT revision:
>>> v3.8-53-ga2d93b2d43
>>> >>> GIT
>>> >>> >> > Date: 2017-10-09 12:56:49 -0400
>>> >>> >> > [0]PETSC ERROR: ./ex48 on a arch-macosx-gnu-O named
>>> MarksMac-5.local
>>> >>> by
>>> >>> >> > markadams Wed Nov  1 13:17:15 2017
>>> >>> >> > [0]PETSC ERROR: Configure options --with-cc=clang
>>> --with-cc++=clang++
>>> >>> >> > COPTFLAGS="-O2 -g -mavx2" CXXOPTFLAGS="-O2 -g -mavx2"
>>> FOPTFLAGS="-O2
>>> >>> -g
>>> >>> >> > -mavx2" --download-mpich=1 --download-parmetis=1
>>> --download-metis=1
>>> >>> >> > --download-hypre=1 --download-ml=1 --download-triangle=1
>>> >>> >> > --download-ctetgen=1 --download-p4est=1 --download-fftw
>>> >>> >> --download-mumps=1
>>> >>> >> > --download-scalapack=1 --with-x=0 --download-superlu_dist
>>> >>> >> > --download-superlu --download-ctetgen --with-debugging=0
>>> >>> >> --download-hdf5=1
>>> >>> >> > PETSC_ARCH=arch-macosx-gnu-O --download-chaco
>>> >>> --with-viewfromoptions=1
>>> >>> >> > [0]PETSC ERROR: #1 TSStep() line 4129 in
>>> >>> >> > /Users/markadams/Codes/petsc/src/ts/interface/ts.c
>>> >>> >> > [0]PETSC ERROR: #2 TSSolve() line 4374 in
>>> >>> >> > /Users/markadams/Codes/petsc/src/ts/interface/ts.c
>>> >>> >> > [0]PETSC ERROR: #3 main() line 1161 in
>>> /Users/markadams/Codes/mhd/
>>> >>> >> src/ex48.c
>>> >>> >>
>>> >>>
>>> >>
>>> >>
>>>
>>
>>


More information about the petsc-dev mailing list