[petsc-users] KSP_CONVERGED_STEP_LENGTH

Wed Sep 14 13:31:13 CDT 2016

Thanks. I now put a watchpoint on

watch *( (PetscErrorCode (**)(KSP, PetscInt, PetscReal, KSPConvergedReason
*, void *)) &(ksp->converged) )

The function pointer changes in the first iteration of SNES. It changed at
the following place:

Old value =
    (PetscErrorCode (*)(KSP, PetscInt, PetscReal, KSPConvergedReason *,
void *)) 0x2b54acdd00aa <KSPConvergedDefault(KSP, PetscInt, PetscReal,
KSPConvergedReason*, void*)>
New value =
    (PetscErrorCode (*)(KSP, PetscInt, PetscReal, KSPConvergedReason *,
void *)) 0x2b54ad436ce8 <SNES_TR_KSPConverged_Private(_p_KSP*, int, double,
KSPConvergedReason*, void*)>
KSPSetConvergenceTest (ksp=0x22bf090, converge=0x2b54ad436ce8
<SNES_TR_KSPConverged_Private(_p_KSP*, int, double, KSPConvergedReason*,
void*)>, cctx=0x1c8b3e0,
    destroy=0x2b54ad437210 <SNES_TR_KSPConverged_Destroy(void*)>)
    at
/depot/ncn/apps/conte/conte-gcc-petsc35-dbg/libs/petsc/build-real/src/ksp/ksp/interface/itfunc.c:1768
1767      ksp->converged        = converge;

Here's the backtrace:

#0  KSPSetConvergenceTest (ksp=0x22bf090, converge=0x2b54ad436ce8
<SNES_TR_KSPConverged_Private(_p_KSP*, int, double, KSPConvergedReason*,
void*)>, cctx=0x1c8b3e0,
    destroy=0x2b54ad437210 <SNES_TR_KSPConverged_Destroy(void*)>)
    at
/depot/ncn/apps/conte/conte-gcc-petsc35-dbg/libs/petsc/build-real/src/ksp/ksp/interface/itfunc.c:1768
#1  0x00002b54ad43865a in SNESSolve_NEWTONTR (snes=0x1d9e490) at
/depot/ncn/apps/conte/conte-gcc-petsc35-dbg/libs/petsc/build-real/src/snes/impls/tr/tr.c:146
#2  0x00002b54acedab57 in SNESSolve (snes=0x1d9e490, b=0x0, x=0x1923420)
    at
/depot/ncn/apps/conte/conte-gcc-petsc35-dbg/libs/petsc/build-real/src/snes/interface/snes.c:3743
#3  0x00002b54abe8b780 in libMesh::PetscNonlinearSolver<double>::solve
(this=0x19198c0, jac_in=..., x_in=..., r_in=...) at
src/solvers/petsc_nonlinear_solver.C:714
#4  0x00002b54abefa61d in libMesh::NonlinearImplicitSystem::solve
(this=0x1910fe0) at src/systems/nonlinear_implicit_system.C:183
#5  0x00002b54a5dcdceb in NonlinearPoisson::execute_solver (this=0x100c500)
at NonlinearPoisson.cpp:1191
#6  0x00002b54a5da733c in NonlinearPoisson::do_solve (this=0x100c500) at
NonlinearPoisson.cpp:948
#7  0x00002b54a6423785 in Simulation::solve (this=0x100c500) at
Simulation.cpp:781
#8  0x00002b54a634826e in Nemo::run_simulations (this=0x63b020) at
Nemo.cpp:1313
#9  0x0000000000426d0d in main (argc=6, argv=0x7ffcdb910768) at main.cpp:447

Thanks,
Harshad

On Wed, Sep 14, 2016 at 2:07 PM, Barry Smith <bsmith at mcs.anl.gov> wrote:

>
>   Super strange, it should never have switched to the function
> SNES_TR_KSPConverged_Private
>
>    Fortunately you can use the same technique to track down where the
> function pointer changes. Just watch ksp->converged to see when the
> function pointer gets changed and send back the new stack trace.
>
>    Barry
>
>
>
>
> > On Sep 14, 2016, at 12:39 PM, Harshad Sahasrabudhe <hsahasra at purdue.edu>
> wrote:
> >
> > Hi Barry,
> >
> > I put a watchpoint on *((KSP_CONVERGED_REASON*) &(
> ((_p_KSP*)ksp)->reason )) in gdb. The ksp->reason switched between:
> >
> > Old value = KSP_CONVERGED_ITERATING
> > New value = KSP_CONVERGED_RTOL
> > 0x00002b143054bef2 in KSPConvergedDefault (ksp=0x23c3090, n=12,
> rnorm=5.3617149831259514e-08, reason=0x23c3310, ctx=0x2446210)
> >     at /depot/ncn/apps/conte/conte-gcc-petsc35-dbg/libs/petsc/
> build-real/src/ksp/ksp/interface/iterativ.c:764
> > 764           *reason = KSP_CONVERGED_RTOL;
> >
> > and
> >
> > Old value = KSP_CONVERGED_RTOL
> > New value = KSP_CONVERGED_ITERATING
> > KSPSetUp (ksp=0x23c3090) at /depot/ncn/apps/conte/conte-
> gcc-petsc35-dbg/libs/petsc/build-real/src/ksp/ksp/interface/itfunc.c:226
> > 226       if (!((PetscObject)ksp)->type_name) {
> >
> > However, after iteration 6, it changed to KSP_CONVERGED_STEP_LENGTH
> >
> > Old value = KSP_CONVERGED_ITERATING
> > New value = KSP_CONVERGED_STEP_LENGTH
> > SNES_TR_KSPConverged_Private (ksp=0x23c3090, n=1,
> rnorm=0.097733468578376406, reason=0x23c3310, cctx=0x1d8f3e0)
> >     at /depot/ncn/apps/conte/conte-gcc-petsc35-dbg/libs/petsc/
> build-real/src/snes/impls/tr/tr.c:36
> > 36        PetscFunctionReturn(0);
> >
> > Any ideas why that function was executed? Backtrace when the program
> stopped here:
> >
> > #0  SNES_TR_KSPConverged_Private (ksp=0x23c3090, n=1,
> rnorm=0.097733468578376406, reason=0x23c3310, cctx=0x1d8f3e0)
> >     at /depot/ncn/apps/conte/conte-gcc-petsc35-dbg/libs/petsc/
> build-real/src/snes/impls/tr/tr.c:36
> > #1  0x00002b14305d3fda in KSPGMRESCycle (itcount=0x7ffdcf2d4ffc,
> ksp=0x23c3090)
> >     at /depot/ncn/apps/conte/conte-gcc-petsc35-dbg/libs/petsc/
> build-real/src/ksp/ksp/impls/gmres/gmres.c:182
> > #2  0x00002b14305d4711 in KSPSolve_GMRES (ksp=0x23c3090) at
> /depot/ncn/apps/conte/conte-gcc-petsc35-dbg/libs/petsc/
> build-real/src/ksp/ksp/impls/gmres/gmres.c:235
> > #3  0x00002b1430526a8a in KSPSolve (ksp=0x23c3090, b=0x1a916c0,
> x=0x1d89dc0)
> >     at /depot/ncn/apps/conte/conte-gcc-petsc35-dbg/libs/petsc/
> build-real/src/ksp/ksp/interface/itfunc.c:460
> > #4  0x00002b1430bb3905 in SNESSolve_NEWTONTR (snes=0x1ea2490) at
> /depot/ncn/apps/conte/conte-gcc-petsc35-dbg/libs/petsc/
> build-real/src/snes/impls/tr/tr.c:160
> > #5  0x00002b1430655b57 in SNESSolve (snes=0x1ea2490, b=0x0, x=0x1a27420)
> >     at /depot/ncn/apps/conte/conte-gcc-petsc35-dbg/libs/petsc/
> build-real/src/snes/interface/snes.c:3743
> > #6  0x00002b142f606780 in libMesh::PetscNonlinearSolver<double>::solve
> (this=0x1a1d8c0, jac_in=..., x_in=..., r_in=...) at
> src/solvers/petsc_nonlinear_solver.C:714
> > #7  0x00002b142f67561d in libMesh::NonlinearImplicitSystem::solve
> (this=0x1a14fe0) at src/systems/nonlinear_implicit_system.C:183
> > #8  0x00002b1429548ceb in NonlinearPoisson::execute_solver
> (this=0x1110500) at NonlinearPoisson.cpp:1191
> > #9  0x00002b142952233c in NonlinearPoisson::do_solve (this=0x1110500) at
> NonlinearPoisson.cpp:948
> > #10 0x00002b1429b9e785 in Simulation::solve (this=0x1110500) at
> Simulation.cpp:781
> > #11 0x00002b1429ac326e in Nemo::run_simulations (this=0x63b020) at
> Nemo.cpp:1313
> > #12 0x0000000000426d0d in main (argc=6, argv=0x7ffdcf2d7908) at
> main.cpp:447
> >
> >
> > Thanks!
> > Harshad
> >
> > On Wed, Sep 14, 2016 at 10:10 AM, Harshad Sahasrabudhe <
> hsahasra at purdue.edu> wrote:
> > I think I found the problem. I configured PETSc with COPTFLAGS=-O3. I'll
> remove that option and try again.
> >
> > Thanks!
> > Harshad
> >
> > On Wed, Sep 14, 2016 at 10:06 AM, Harshad Sahasrabudhe <
> hsahasra at purdue.edu> wrote:
> > Hi Barry,
> >
> > Thanks for your inputs. I tried to set a watchpoint on
> ((_p_KSP*)ksp)->reason, but gdb says no symbol _p_KSP in context.
> Basically, GDB isn't able to find the PETSc source code. I built PETSc with
> --with-debugging=1 statically and -fPIC, but it seems the libpetsc.a I get
> doesn't contain debugging symbols (checked using objdump -g). How do I get
> PETSc library to have debugging info?
> >
> > Thanks,
> > Harshad
> >
> > On Tue, Sep 13, 2016 at 2:47 PM, Barry Smith <bsmith at mcs.anl.gov> wrote:
> >
> > > On Sep 13, 2016, at 1:01 PM, Harshad Sahasrabudhe <hsahasra at purdue.edu>
> wrote:
> > >
> > > Hi Barry,
> > >
> > > I compiled with mpich configured using --enable-g=meminit to get rid
> of MPI errors in Valgrind. Doing this reduced the number of errors to 2. I
> have attached the Valgrind output.
> >
> >    This isn't helpful but it seems not to be a memory corruption issue
> :-(
> > >
> > > I'm using GAMG+GMRES for in each linear iteration of SNES. The linear
> solver converges with CONVERGED_RTOL for the first 6 iterations and with
> CONVERGED_STEP_LENGTH after that. I'm still very confused about why this is
> happening. Any thoughts/ideas?
> >
> >    Does this happen on one process? If so I would run in the debugger
> and track the variable to see everyplace the variable is changed, this
> would point to exactly what piece of code is changing the variable to this
> unexpected value.
> >
> >    For example with lldb one can use watch
> http://lldb.llvm.org/tutorial.html to see each time a variable gets
> changed. Similar thing with gdb.
> >
> >    The variable to watch is ksp->reason  Once you get the hang of this
> it can take just a few minutes to track down the code that is making this
> unexpected value, though I understand if you haven't done it before it can
> be intimidating.
> >
> >   Barry
> >
> > You can do the same thing in parallel (like on two processes) if you
> need to but it is more cumbersome since you need run multiple debuggers.
> You can have PETSc start up multiple debuggers with mpiexec -n 2 ./ex
> -start_in_debugger
> >
> >
> >
> >
> > >
> > > Thanks,
> > > Harshad
> > >
> > > On Thu, Sep 8, 2016 at 11:26 PM, Barry Smith <bsmith at mcs.anl.gov>
> wrote:
> > >
> > >   Install your MPI with --download-mpich as a PETSc ./configure
> option, this will eliminate all the MPICH valgrind errors. Then send as an
> attachment the resulting valgrind file.
> > >
> > >   I do not 100 % trust any code that produces such valgrind errors.
> > >
> > >    Barry
> > >
> > >
> > >
> > > > On Sep 8, 2016, at 10:12 PM, Harshad Sahasrabudhe <
> hsahasra at purdue.edu> wrote:
> > > >
> > > > Hi Barry,
> > > >
> > > > Thanks for the reply. My code is in C. I ran with Valgrind and found
> many "Conditional jump or move depends on uninitialized value(s)", "Invalid
> read" and "Use of uninitialized value" errors. I think all of them are from
> the libraries I'm using (LibMesh, Boost, MPI, etc.). I'm not really sure
> what I'm looking for in the Valgrind output. At the end of the file, I get:
> > > >
> > > > ==40223== More than 10000000 total errors detected.  I'm not
> reporting any more.
> > > > ==40223== Final error counts will be inaccurate.  Go fix your
> program!
> > > > ==40223== Rerun with --error-limit=no to disable this cutoff.  Note
> > > > ==40223== that errors may occur in your program without prior
> warning from
> > > > ==40223== Valgrind, because errors are no longer being displayed.
> > > >
> > > > Can you give some suggestions on how I should proceed?
> > > >
> > > > Thanks,
> > > > Harshad
> > > >
> > > > On Thu, Sep 8, 2016 at 1:59 PM, Barry Smith <bsmith at mcs.anl.gov>
> wrote:
> > > >
> > > >    This is very odd. CONVERGED_STEP_LENGTH for KSP is very
> specialized and should never occur with GMRES.
> > > >
> > > >    Can you run with valgrind to make sure there is no memory
> corruption? http://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind
> > > >
> > > >    Is your code fortran or C?
> > > >
> > > >    Barry
> > > >
> > > > > On Sep 8, 2016, at 10:38 AM, Harshad Sahasrabudhe <
> hsahasra at purdue.edu> wrote:
> > > > >
> > > > > Hi,
> > > > >
> > > > > I'm using GAMG + GMRES for my Poisson problem. The solver
> converges with KSP_CONVERGED_STEP_LENGTH at a residual of
> 9.773346857844e-02, which is much higher than what I need (I need a
> tolerance of at least 1E-8). I am not able to figure out which tolerance I
> need to set to avoid convergence due to CONVERGED_STEP_LENGTH.
> > > > >
> > > > > Any help is appreciated! Output of -ksp_view and -ksp_monitor:
> > > > >
> > > > >     0 KSP Residual norm 3.121347818142e+00
> > > > >     1 KSP Residual norm 9.773346857844e-02
> > > > >   Linear solve converged due to CONVERGED_STEP_LENGTH iterations 1
> > > > > KSP Object: 1 MPI processes
> > > > >   type: gmres
> > > > >     GMRES: restart=30, using Classical (unmodified) Gram-Schmidt
> Orthogonalization with no iterative refinement
> > > > >     GMRES: happy breakdown tolerance 1e-30
> > > > >   maximum iterations=10000, initial guess is zero
> > > > >   tolerances:  relative=1e-08, absolute=1e-50, divergence=10000
> > > > >   left preconditioning
> > > > >   using PRECONDITIONED norm type for convergence test
> > > > > PC Object: 1 MPI processes
> > > > >   type: gamg
> > > > >     MG: type is MULTIPLICATIVE, levels=2 cycles=v
> > > > >       Cycles per PCApply=1
> > > > >       Using Galerkin computed coarse grid matrices
> > > > >   Coarse grid solver -- level -------------------------------
> > > > >     KSP Object:    (mg_coarse_)     1 MPI processes
> > > > >       type: preonly
> > > > >       maximum iterations=1, initial guess is zero
> > > > >       tolerances:  relative=1e-05, absolute=1e-50, divergence=10000
> > > > >       left preconditioning
> > > > >       using NONE norm type for convergence test
> > > > >     PC Object:    (mg_coarse_)     1 MPI processes
> > > > >       type: bjacobi
> > > > >         block Jacobi: number of blocks = 1
> > > > >         Local solve is same for all blocks, in the following KSP
> and PC objects:
> > > > >         KSP Object:        (mg_coarse_sub_)         1 MPI processes
> > > > >           type: preonly
> > > > >           maximum iterations=1, initial guess is zero
> > > > >           tolerances:  relative=1e-05, absolute=1e-50,
> divergence=10000
> > > > >           left preconditioning
> > > > >           using NONE norm type for convergence test
> > > > >         PC Object:        (mg_coarse_sub_)         1 MPI processes
> > > > >           type: lu
> > > > >             LU: out-of-place factorization
> > > > >             tolerance for zero pivot 2.22045e-14
> > > > >             using diagonal shift on blocks to prevent zero pivot
> [INBLOCKS]
> > > > >             matrix ordering: nd
> > > > >             factor fill ratio given 5, needed 1.91048
> > > > >               Factored matrix follows:
> > > > >                 Mat Object:                 1 MPI processes
> > > > >                   type: seqaij
> > > > >                   rows=284, cols=284
> > > > >                   package used to perform factorization: petsc
> > > > >                   total: nonzeros=7726, allocated nonzeros=7726
> > > > >                   total number of mallocs used during MatSetValues
> calls =0
> > > > >                     using I-node routines: found 133 nodes, limit
> used is 5
> > > > >           linear system matrix = precond matrix:
> > > > >           Mat Object:           1 MPI processes
> > > > >             type: seqaij
> > > > >             rows=284, cols=284
> > > > >             total: nonzeros=4044, allocated nonzeros=4044
> > > > >             total number of mallocs used during MatSetValues calls
> =0
> > > > >               not using I-node routines
> > > > >       linear system matrix = precond matrix:
> > > > >       Mat Object:       1 MPI processes
> > > > >         type: seqaij
> > > > >         rows=284, cols=284
> > > > >         total: nonzeros=4044, allocated nonzeros=4044
> > > > >         total number of mallocs used during MatSetValues calls =0
> > > > >           not using I-node routines
> > > > >   Down solver (pre-smoother) on level 1
> -------------------------------
> > > > >     KSP Object:    (mg_levels_1_)     1 MPI processes
> > > > >       type: chebyshev
> > > > >         Chebyshev: eigenvalue estimates:  min = 0.195339, max =
> 4.10212
> > > > >       maximum iterations=2
> > > > >       tolerances:  relative=1e-05, absolute=1e-50, divergence=10000
> > > > >       left preconditioning
> > > > >       using nonzero initial guess
> > > > >       using NONE norm type for convergence test
> > > > >     PC Object:    (mg_levels_1_)     1 MPI processes
> > > > >       type: sor
> > > > >         SOR: type = local_symmetric, iterations = 1, local
> iterations = 1, omega = 1
> > > > >       linear system matrix = precond matrix:
> > > > >       Mat Object:      ()       1 MPI processes
> > > > >         type: seqaij
> > > > >         rows=9036, cols=9036
> > > > >         total: nonzeros=192256, allocated nonzeros=192256
> > > > >         total number of mallocs used during MatSetValues calls =0
> > > > >           not using I-node routines
> > > > >   Up solver (post-smoother) same as down solver (pre-smoother)
> > > > >   linear system matrix = precond matrix:
> > > > >   Mat Object:  ()   1 MPI processes
> > > > >     type: seqaij
> > > > >     rows=9036, cols=9036
> > > > >     total: nonzeros=192256, allocated nonzeros=192256
> > > > >     total number of mallocs used during MatSetValues calls =0
> > > > >       not using I-node routines
> > > > >
> > > > > Thanks,
> > > > > Harshad
> > > >
> > > >
> > >
> > >
> > > <valgrind.log.33199>
> >
> >
> >
> >
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20160914/d851aecd/attachment-0001.html>