[petsc-users] KSP_CONVERGED_STEP_LENGTH

Wed Sep 14 13:35:58 CDT 2016

Hi Hsahara,

I am not sure whether or not the dbg will also trace your libmesh code. We
have a similar issue in the MOOSE with PETSc-3.7.x. No issues with the old
PETSc. We finally get the problem fixed. In KSP, we could plugin any user
converged test function, and there is a plugin-in function in MOOSE.
PETSc-3.7.x. calls  the PETSc default converged test first, and  if the
algorithm converges or diverges, PETSc just stop solving the linear system
and NOT going to call the user converged test any more. A variable stores
the converged reason in MOOSE. Try to solve another updated linear system
using the same KSP, and KSP just stops solving because the old converged
reason is reused again.

I think the old version of PETSc always call the user converged test
function first.

This is possibly not related to your issue, but just give you one more
thought.

Fande Kong

On Wed, Sep 14, 2016 at 11:39 AM, Harshad Sahasrabudhe <hsahasra at purdue.edu>
wrote:

> Hi Barry,
>
> I put a watchpoint on *((KSP_CONVERGED_REASON*) &( ((_p_KSP*)ksp)->reason ))
> in gdb. The ksp->reason switched between:
>
> Old value = KSP_CONVERGED_ITERATING
> New value = KSP_CONVERGED_RTOL
> 0x00002b143054bef2 in KSPConvergedDefault (ksp=0x23c3090, n=12,
> rnorm=5.3617149831259514e-08, reason=0x23c3310, ctx=0x2446210)
>     at /depot/ncn/apps/conte/conte-gcc-petsc35-dbg/libs/petsc/
> build-real/src/ksp/ksp/interface/iterativ.c:764
> 764           *reason = KSP_CONVERGED_RTOL;
>
> *and*
>
> Old value = KSP_CONVERGED_RTOL
> New value = KSP_CONVERGED_ITERATING
> KSPSetUp (ksp=0x23c3090) at /depot/ncn/apps/conte/conte-
> gcc-petsc35-dbg/libs/petsc/build-real/src/ksp/ksp/interface/itfunc.c:226
> 226       if (!((PetscObject)ksp)->type_name) {
>
> However, after iteration 6, it changed to KSP_CONVERGED_STEP_LENGTH
>
> Old value = KSP_CONVERGED_ITERATING
> New value = KSP_CONVERGED_STEP_LENGTH
> SNES_TR_KSPConverged_Private (ksp=0x23c3090, n=1,
> rnorm=0.097733468578376406, reason=0x23c3310, cctx=0x1d8f3e0)
>     at /depot/ncn/apps/conte/conte-gcc-petsc35-dbg/libs/petsc/
> build-real/src/snes/impls/tr/tr.c:36
> 36        PetscFunctionReturn(0);
>
> Any ideas why that function was executed? Backtrace when the program
> stopped here:
>
> #0  SNES_TR_KSPConverged_Private (ksp=0x23c3090, n=1,
> rnorm=0.097733468578376406, reason=0x23c3310, cctx=0x1d8f3e0)
>     at /depot/ncn/apps/conte/conte-gcc-petsc35-dbg/libs/petsc/
> build-real/src/snes/impls/tr/tr.c:36
> #1  0x00002b14305d3fda in KSPGMRESCycle (itcount=0x7ffdcf2d4ffc,
> ksp=0x23c3090)
>     at /depot/ncn/apps/conte/conte-gcc-petsc35-dbg/libs/petsc/
> build-real/src/ksp/ksp/impls/gmres/gmres.c:182
> #2  0x00002b14305d4711 in KSPSolve_GMRES (ksp=0x23c3090) at
> /depot/ncn/apps/conte/conte-gcc-petsc35-dbg/libs/petsc/
> build-real/src/ksp/ksp/impls/gmres/gmres.c:235
> #3  0x00002b1430526a8a in KSPSolve (ksp=0x23c3090, b=0x1a916c0,
> x=0x1d89dc0)
>     at /depot/ncn/apps/conte/conte-gcc-petsc35-dbg/libs/petsc/
> build-real/src/ksp/ksp/interface/itfunc.c:460
> #4  0x00002b1430bb3905 in SNESSolve_NEWTONTR (snes=0x1ea2490) at
> /depot/ncn/apps/conte/conte-gcc-petsc35-dbg/libs/petsc/
> build-real/src/snes/impls/tr/tr.c:160
> #5  0x00002b1430655b57 in SNESSolve (snes=0x1ea2490, b=0x0, x=0x1a27420)
>     at /depot/ncn/apps/conte/conte-gcc-petsc35-dbg/libs/petsc/
> build-real/src/snes/interface/snes.c:3743
> #6  0x00002b142f606780 in libMesh::PetscNonlinearSolver<double>::solve
> (this=0x1a1d8c0, jac_in=..., x_in=..., r_in=...) at
> src/solvers/petsc_nonlinear_solver.C:714
> #7  0x00002b142f67561d in libMesh::NonlinearImplicitSystem::solve
> (this=0x1a14fe0) at src/systems/nonlinear_implicit_system.C:183
> #8  0x00002b1429548ceb in NonlinearPoisson::execute_solver
> (this=0x1110500) at NonlinearPoisson.cpp:1191
> #9  0x00002b142952233c in NonlinearPoisson::do_solve (this=0x1110500) at
> NonlinearPoisson.cpp:948
> #10 0x00002b1429b9e785 in Simulation::solve (this=0x1110500) at
> Simulation.cpp:781
> #11 0x00002b1429ac326e in Nemo::run_simulations (this=0x63b020) at
> Nemo.cpp:1313
> #12 0x0000000000426d0d in main (argc=6, argv=0x7ffdcf2d7908) at
> main.cpp:447
>
>
> Thanks!
> Harshad
>
> On Wed, Sep 14, 2016 at 10:10 AM, Harshad Sahasrabudhe <
> hsahasra at purdue.edu> wrote:
>
>> I think I found the problem. I configured PETSc with COPTFLAGS=-O3. I'll
>> remove that option and try again.
>>
>> Thanks!
>> Harshad
>>
>> On Wed, Sep 14, 2016 at 10:06 AM, Harshad Sahasrabudhe <
>> hsahasra at purdue.edu> wrote:
>>
>>> Hi Barry,
>>>
>>> Thanks for your inputs. I tried to set a watchpoint on
>>> ((_p_KSP*)ksp)->reason, but gdb says no symbol _p_KSP in context.
>>> Basically, GDB isn't able to find the PETSc source code. I built PETSc with
>>> --with-debugging=1 statically and -fPIC, but it seems the libpetsc.a I get
>>> doesn't contain debugging symbols (checked using objdump -g). How do I get
>>> PETSc library to have debugging info?
>>>
>>> Thanks,
>>> Harshad
>>>
>>> On Tue, Sep 13, 2016 at 2:47 PM, Barry Smith <bsmith at mcs.anl.gov> wrote:
>>>
>>>>
>>>> > On Sep 13, 2016, at 1:01 PM, Harshad Sahasrabudhe <
>>>> hsahasra at purdue.edu> wrote:
>>>> >
>>>> > Hi Barry,
>>>> >
>>>> > I compiled with mpich configured using --enable-g=meminit to get rid
>>>> of MPI errors in Valgrind. Doing this reduced the number of errors to 2. I
>>>> have attached the Valgrind output.
>>>>
>>>>    This isn't helpful but it seems not to be a memory corruption issue
>>>> :-(
>>>> >
>>>> > I'm using GAMG+GMRES for in each linear iteration of SNES. The linear
>>>> solver converges with CONVERGED_RTOL for the first 6 iterations and with
>>>> CONVERGED_STEP_LENGTH after that. I'm still very confused about why this is
>>>> happening. Any thoughts/ideas?
>>>>
>>>>    Does this happen on one process? If so I would run in the debugger
>>>> and track the variable to see everyplace the variable is changed, this
>>>> would point to exactly what piece of code is changing the variable to this
>>>> unexpected value.
>>>>
>>>>    For example with lldb one can use watch
>>>> http://lldb.llvm.org/tutorial.html
>>>> <https://urldefense.proofpoint.com/v2/url?u=http-3A__lldb.llvm.org_tutorial.html&d=CwMFaQ&c=54IZrppPQZKX9mLzcGdPfFD1hxrcB__aEkJFOKJFd00&r=DUUt3SRGI0_JgtNaS3udV68GRkgV4ts7XKfj2opmiCY&m=E9U2Th1JCvLXgyxKVx27Uc1_irgC5Iztv3fndq0JRQk&s=eqp6I596uzgPRsUhmyYPwwTomBUCcAHnJ7vy_pvXEkI&e=>
>>>> to see each time a variable gets changed. Similar thing with gdb.
>>>>
>>>>    The variable to watch is ksp->reason  Once you get the hang of this
>>>> it can take just a few minutes to track down the code that is making this
>>>> unexpected value, though I understand if you haven't done it before it can
>>>> be intimidating.
>>>>
>>>>   Barry
>>>>
>>>> You can do the same thing in parallel (like on two processes) if you
>>>> need to but it is more cumbersome since you need run multiple debuggers.
>>>> You can have PETSc start up multiple debuggers with mpiexec -n 2 ./ex
>>>> -start_in_debugger
>>>>
>>>>
>>>>
>>>>
>>>> >
>>>> > Thanks,
>>>> > Harshad
>>>> >
>>>> > On Thu, Sep 8, 2016 at 11:26 PM, Barry Smith <bsmith at mcs.anl.gov>
>>>> wrote:
>>>> >
>>>> >   Install your MPI with --download-mpich as a PETSc ./configure
>>>> option, this will eliminate all the MPICH valgrind errors. Then send as an
>>>> attachment the resulting valgrind file.
>>>> >
>>>> >   I do not 100 % trust any code that produces such valgrind errors.
>>>> >
>>>> >    Barry
>>>> >
>>>> >
>>>> >
>>>> > > On Sep 8, 2016, at 10:12 PM, Harshad Sahasrabudhe <
>>>> hsahasra at purdue.edu> wrote:
>>>> > >
>>>> > > Hi Barry,
>>>> > >
>>>> > > Thanks for the reply. My code is in C. I ran with Valgrind and
>>>> found many "Conditional jump or move depends on uninitialized value(s)",
>>>> "Invalid read" and "Use of uninitialized value" errors. I think all of them
>>>> are from the libraries I'm using (LibMesh, Boost, MPI, etc.). I'm not
>>>> really sure what I'm looking for in the Valgrind output. At the end of the
>>>> file, I get:
>>>> > >
>>>> > > ==40223== More than 10000000 total errors detected.  I'm not
>>>> reporting any more.
>>>> > > ==40223== Final error counts will be inaccurate.  Go fix your
>>>> program!
>>>> > > ==40223== Rerun with --error-limit=no to disable this cutoff.  Note
>>>> > > ==40223== that errors may occur in your program without prior
>>>> warning from
>>>> > > ==40223== Valgrind, because errors are no longer being displayed.
>>>> > >
>>>> > > Can you give some suggestions on how I should proceed?
>>>> > >
>>>> > > Thanks,
>>>> > > Harshad
>>>> > >
>>>> > > On Thu, Sep 8, 2016 at 1:59 PM, Barry Smith <bsmith at mcs.anl.gov>
>>>> wrote:
>>>> > >
>>>> > >    This is very odd. CONVERGED_STEP_LENGTH for KSP is very
>>>> specialized and should never occur with GMRES.
>>>> > >
>>>> > >    Can you run with valgrind to make sure there is no memory
>>>> corruption? http://www.mcs.anl.gov/petsc/d
>>>> ocumentation/faq.html#valgrind
>>>> <https://urldefense.proofpoint.com/v2/url?u=http-3A__www.mcs.anl.gov_petsc_documentation_faq.html-23valgrind&d=CwMFaQ&c=54IZrppPQZKX9mLzcGdPfFD1hxrcB__aEkJFOKJFd00&r=DUUt3SRGI0_JgtNaS3udV68GRkgV4ts7XKfj2opmiCY&m=E9U2Th1JCvLXgyxKVx27Uc1_irgC5Iztv3fndq0JRQk&s=UF7DuHQ2byckAJuFV9ZyezN4kklsP1x_y-Bd_v4Uj0I&e=>
>>>> > >
>>>> > >    Is your code fortran or C?
>>>> > >
>>>> > >    Barry
>>>> > >
>>>> > > > On Sep 8, 2016, at 10:38 AM, Harshad Sahasrabudhe <
>>>> hsahasra at purdue.edu> wrote:
>>>> > > >
>>>> > > > Hi,
>>>> > > >
>>>> > > > I'm using GAMG + GMRES for my Poisson problem. The solver
>>>> converges with KSP_CONVERGED_STEP_LENGTH at a residual of
>>>> 9.773346857844e-02, which is much higher than what I need (I need a
>>>> tolerance of at least 1E-8). I am not able to figure out which tolerance I
>>>> need to set to avoid convergence due to CONVERGED_STEP_LENGTH.
>>>> > > >
>>>> > > > Any help is appreciated! Output of -ksp_view and -ksp_monitor:
>>>> > > >
>>>> > > >     0 KSP Residual norm 3.121347818142e+00
>>>> > > >     1 KSP Residual norm 9.773346857844e-02
>>>> > > >   Linear solve converged due to CONVERGED_STEP_LENGTH iterations 1
>>>> > > > KSP Object: 1 MPI processes
>>>> > > >   type: gmres
>>>> > > >     GMRES: restart=30, using Classical (unmodified) Gram-Schmidt
>>>> Orthogonalization with no iterative refinement
>>>> > > >     GMRES: happy breakdown tolerance 1e-30
>>>> > > >   maximum iterations=10000, initial guess is zero
>>>> > > >   tolerances:  relative=1e-08, absolute=1e-50, divergence=10000
>>>> > > >   left preconditioning
>>>> > > >   using PRECONDITIONED norm type for convergence test
>>>> > > > PC Object: 1 MPI processes
>>>> > > >   type: gamg
>>>> > > >     MG: type is MULTIPLICATIVE, levels=2 cycles=v
>>>> > > >       Cycles per PCApply=1
>>>> > > >       Using Galerkin computed coarse grid matrices
>>>> > > >   Coarse grid solver -- level -------------------------------
>>>> > > >     KSP Object:    (mg_coarse_)     1 MPI processes
>>>> > > >       type: preonly
>>>> > > >       maximum iterations=1, initial guess is zero
>>>> > > >       tolerances:  relative=1e-05, absolute=1e-50,
>>>> divergence=10000
>>>> > > >       left preconditioning
>>>> > > >       using NONE norm type for convergence test
>>>> > > >     PC Object:    (mg_coarse_)     1 MPI processes
>>>> > > >       type: bjacobi
>>>> > > >         block Jacobi: number of blocks = 1
>>>> > > >         Local solve is same for all blocks, in the following KSP
>>>> and PC objects:
>>>> > > >         KSP Object:        (mg_coarse_sub_)         1 MPI
>>>> processes
>>>> > > >           type: preonly
>>>> > > >           maximum iterations=1, initial guess is zero
>>>> > > >           tolerances:  relative=1e-05, absolute=1e-50,
>>>> divergence=10000
>>>> > > >           left preconditioning
>>>> > > >           using NONE norm type for convergence test
>>>> > > >         PC Object:        (mg_coarse_sub_)         1 MPI processes
>>>> > > >           type: lu
>>>> > > >             LU: out-of-place factorization
>>>> > > >             tolerance for zero pivot 2.22045e-14
>>>> > > >             using diagonal shift on blocks to prevent zero pivot
>>>> [INBLOCKS]
>>>> > > >             matrix ordering: nd
>>>> > > >             factor fill ratio given 5, needed 1.91048
>>>> > > >               Factored matrix follows:
>>>> > > >                 Mat Object:                 1 MPI processes
>>>> > > >                   type: seqaij
>>>> > > >                   rows=284, cols=284
>>>> > > >                   package used to perform factorization: petsc
>>>> > > >                   total: nonzeros=7726, allocated nonzeros=7726
>>>> > > >                   total number of mallocs used during
>>>> MatSetValues calls =0
>>>> > > >                     using I-node routines: found 133 nodes, limit
>>>> used is 5
>>>> > > >           linear system matrix = precond matrix:
>>>> > > >           Mat Object:           1 MPI processes
>>>> > > >             type: seqaij
>>>> > > >             rows=284, cols=284
>>>> > > >             total: nonzeros=4044, allocated nonzeros=4044
>>>> > > >             total number of mallocs used during MatSetValues
>>>> calls =0
>>>> > > >               not using I-node routines
>>>> > > >       linear system matrix = precond matrix:
>>>> > > >       Mat Object:       1 MPI processes
>>>> > > >         type: seqaij
>>>> > > >         rows=284, cols=284
>>>> > > >         total: nonzeros=4044, allocated nonzeros=4044
>>>> > > >         total number of mallocs used during MatSetValues calls =0
>>>> > > >           not using I-node routines
>>>> > > >   Down solver (pre-smoother) on level 1
>>>> -------------------------------
>>>> > > >     KSP Object:    (mg_levels_1_)     1 MPI processes
>>>> > > >       type: chebyshev
>>>> > > >         Chebyshev: eigenvalue estimates:  min = 0.195339, max =
>>>> 4.10212
>>>> > > >       maximum iterations=2
>>>> > > >       tolerances:  relative=1e-05, absolute=1e-50,
>>>> divergence=10000
>>>> > > >       left preconditioning
>>>> > > >       using nonzero initial guess
>>>> > > >       using NONE norm type for convergence test
>>>> > > >     PC Object:    (mg_levels_1_)     1 MPI processes
>>>> > > >       type: sor
>>>> > > >         SOR: type = local_symmetric, iterations = 1, local
>>>> iterations = 1, omega = 1
>>>> > > >       linear system matrix = precond matrix:
>>>> > > >       Mat Object:      ()       1 MPI processes
>>>> > > >         type: seqaij
>>>> > > >         rows=9036, cols=9036
>>>> > > >         total: nonzeros=192256, allocated nonzeros=192256
>>>> > > >         total number of mallocs used during MatSetValues calls =0
>>>> > > >           not using I-node routines
>>>> > > >   Up solver (post-smoother) same as down solver (pre-smoother)
>>>> > > >   linear system matrix = precond matrix:
>>>> > > >   Mat Object:  ()   1 MPI processes
>>>> > > >     type: seqaij
>>>> > > >     rows=9036, cols=9036
>>>> > > >     total: nonzeros=192256, allocated nonzeros=192256
>>>> > > >     total number of mallocs used during MatSetValues calls =0
>>>> > > >       not using I-node routines
>>>> > > >
>>>> > > > Thanks,
>>>> > > > Harshad
>>>> > >
>>>> > >
>>>> >
>>>> >
>>>> > <valgrind.log.33199>
>>>>
>>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20160914/66410831/attachment-0001.html>