<div dir="ltr"><div><div>Hi Hsahara,<br><br></div>I am not sure whether or not the dbg will also trace your libmesh code. We have a similar issue in the MOOSE with PETSc-3.7.x. No issues with the old PETSc. We finally get the problem fixed. In KSP, we could plugin any user converged test function, and there is a plugin-in function in MOOSE. PETSc-3.7.x. calls  the PETSc default converged test first, and  if the algorithm converges or diverges, PETSc just stop solving the linear system and NOT going to call the user converged test any more. A variable stores the converged reason in MOOSE. Try to solve another updated linear system  using the same KSP, and KSP just stops solving because the old converged reason is reused again.<br><br></div><div>I think the old version of PETSc always call the user converged test function first. <br></div><div><br></div><div>This is possibly not related to your issue, but just give you one more thought.<br></div><div><br><br></div>Fande Kong<br></div><div class="gmail_extra"><br><div class="gmail_quote">On Wed, Sep 14, 2016 at 11:39 AM, Harshad Sahasrabudhe <span dir="ltr"><<a href="mailto:hsahasra@purdue.edu" target="_blank">hsahasra@purdue.edu</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr">Hi Barry,<div><br></div><div>I put a watchpoint on *((KSP_CONVERGED_REASON*) &(<span style="font-size:12.8px"> </span><span style="font-size:12.8px">((_p_KSP*)ksp)->reason </span>)) in gdb. The ksp->reason switched between:</div><div><br></div><div><div>Old value = KSP_CONVERGED_ITERATING</div><div>New value = KSP_CONVERGED_RTOL</div><div>0x00002b143054bef2 in KSPConvergedDefault (ksp=0x23c3090, n=12, rnorm=5.3617149831259514e-08, reason=0x23c3310, ctx=0x2446210)</div><div>    at /depot/ncn/apps/conte/conte-<wbr>gcc-petsc35-dbg/libs/petsc/<wbr>build-real/src/ksp/ksp/<wbr>interface/iterativ.c:764</div><div>764           *reason = KSP_CONVERGED_RTOL;</div></div><div><br></div><div><b>and</b></div><div><br></div><div><div>Old value = KSP_CONVERGED_RTOL</div><div>New value = KSP_CONVERGED_ITERATING</div><div>KSPSetUp (ksp=0x23c3090) at /depot/ncn/apps/conte/conte-<wbr>gcc-petsc35-dbg/libs/petsc/<wbr>build-real/src/ksp/ksp/<wbr>interface/itfunc.c:226</div><div>226       if (!((PetscObject)ksp)->type_<wbr>name) {</div></div><div><br></div><div>However, after iteration 6, it changed to KSP_CONVERGED_STEP_LENGTH</div><div><br></div><div><div>Old value = KSP_CONVERGED_ITERATING</div><div>New value = KSP_CONVERGED_STEP_LENGTH</div><div>SNES_TR_KSPConverged_Private (ksp=0x23c3090, n=1, rnorm=0.097733468578376406, reason=0x23c3310, cctx=0x1d8f3e0)</div><div>    at /depot/ncn/apps/conte/conte-<wbr>gcc-petsc35-dbg/libs/petsc/<wbr>build-real/src/snes/impls/tr/<wbr>tr.c:36</div><div>36        PetscFunctionReturn(0);</div></div><div><br></div><div>Any ideas why that function was executed? Backtrace when the program stopped here:</div><div><br></div><div><div>#0  SNES_TR_KSPConverged_Private (ksp=0x23c3090, n=1, rnorm=0.097733468578376406, reason=0x23c3310, cctx=0x1d8f3e0)</div><div>    at /depot/ncn/apps/conte/conte-<wbr>gcc-petsc35-dbg/libs/petsc/<wbr>build-real/src/snes/impls/tr/<wbr>tr.c:36</div><div>#1  0x00002b14305d3fda in KSPGMRESCycle (itcount=0x7ffdcf2d4ffc, ksp=0x23c3090)</div><div>    at /depot/ncn/apps/conte/conte-<wbr>gcc-petsc35-dbg/libs/petsc/<wbr>build-real/src/ksp/ksp/impls/<wbr>gmres/gmres.c:182</div><div>#2  0x00002b14305d4711 in KSPSolve_GMRES (ksp=0x23c3090) at /depot/ncn/apps/conte/conte-<wbr>gcc-petsc35-dbg/libs/petsc/<wbr>build-real/src/ksp/ksp/impls/<wbr>gmres/gmres.c:235</div><div>#3  0x00002b1430526a8a in KSPSolve (ksp=0x23c3090, b=0x1a916c0, x=0x1d89dc0)</div><div>    at /depot/ncn/apps/conte/conte-<wbr>gcc-petsc35-dbg/libs/petsc/<wbr>build-real/src/ksp/ksp/<wbr>interface/itfunc.c:460</div><div>#4  0x00002b1430bb3905 in SNESSolve_NEWTONTR (snes=0x1ea2490) at /depot/ncn/apps/conte/conte-<wbr>gcc-petsc35-dbg/libs/petsc/<wbr>build-real/src/snes/impls/tr/<wbr>tr.c:160</div><div>#5  0x00002b1430655b57 in SNESSolve (snes=0x1ea2490, b=0x0, x=0x1a27420)</div><div>    at /depot/ncn/apps/conte/conte-<wbr>gcc-petsc35-dbg/libs/petsc/<wbr>build-real/src/snes/interface/<wbr>snes.c:3743</div><div>#6  0x00002b142f606780 in libMesh::PetscNonlinearSolver<<wbr>double>::solve (this=0x1a1d8c0, jac_in=..., x_in=..., r_in=...) at src/solvers/petsc_nonlinear_<wbr>solver.C:714</div><div>#7  0x00002b142f67561d in libMesh::<wbr>NonlinearImplicitSystem::solve (this=0x1a14fe0) at src/systems/nonlinear_<wbr>implicit_system.C:183</div><div>#8  0x00002b1429548ceb in NonlinearPoisson::execute_<wbr>solver (this=0x1110500) at NonlinearPoisson.cpp:1191</div><div>#9  0x00002b142952233c in NonlinearPoisson::do_solve (this=0x1110500) at NonlinearPoisson.cpp:948</div><div>#10 0x00002b1429b9e785 in Simulation::solve (this=0x1110500) at Simulation.cpp:781</div><div>#11 0x00002b1429ac326e in Nemo::run_simulations (this=0x63b020) at Nemo.cpp:1313</div><div>#12 0x0000000000426d0d in main (argc=6, argv=0x7ffdcf2d7908) at main.cpp:447</div></div><div><br></div><div><br></div><div>Thanks!</div><span class="HOEnZb"><font color="#888888"><div>Harshad</div></font></span></div><div class="HOEnZb"><div class="h5"><div class="gmail_extra"><br><div class="gmail_quote">On Wed, Sep 14, 2016 at 10:10 AM, Harshad Sahasrabudhe <span dir="ltr"><<a href="mailto:hsahasra@purdue.edu" target="_blank">hsahasra@purdue.edu</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr">I think I found the problem. I configured PETSc with COPTFLAGS=-O3. I'll remove that option and try again.<div><br></div><div>Thanks!</div><span><font color="#888888"><div>Harshad</div></font></span></div><div><div><div class="gmail_extra"><br><div class="gmail_quote">On Wed, Sep 14, 2016 at 10:06 AM, Harshad Sahasrabudhe <span dir="ltr"><<a href="mailto:hsahasra@purdue.edu" target="_blank">hsahasra@purdue.edu</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr">Hi Barry,<div><br></div><div>Thanks for your inputs. I tried to set a watchpoint on ((_p_KSP*)ksp)->reason, but gdb says no symbol _p_KSP in context. Basically, GDB isn't able to find the PETSc source code. I built PETSc with --with-debugging=1 statically and -fPIC, but it seems the libpetsc.a I get doesn't contain debugging symbols (checked using objdump -g). How do I get PETSc library to have debugging info?</div><div><br></div><div>Thanks,</div><div>Harshad</div><div><div><div class="gmail_extra"><br><div class="gmail_quote">On Tue, Sep 13, 2016 at 2:47 PM, Barry Smith <span dir="ltr"><<a href="mailto:bsmith@mcs.anl.gov" target="_blank">bsmith@mcs.anl.gov</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span><br>

> On Sep 13, 2016, at 1:01 PM, Harshad Sahasrabudhe <<a href="mailto:hsahasra@purdue.edu" target="_blank">hsahasra@purdue.edu</a>> wrote:<br>

><br>

> Hi Barry,<br>

><br>

</span><span>> I compiled with mpich configured using --enable-g=meminit to get rid of MPI errors in Valgrind. Doing this reduced the number of errors to 2. I have attached the Valgrind output.<br>

<br>

</span>   This isn't helpful but it seems not to be a memory corruption issue :-(<br>

<span>><br>

> I'm using GAMG+GMRES for in each linear iteration of SNES. The linear solver converges with CONVERGED_RTOL for the first 6 iterations and with CONVERGED_STEP_LENGTH after that. I'm still very confused about why this is happening. Any thoughts/ideas?<br>

<br>

</span>   Does this happen on one process? If so I would run in the debugger and track the variable to see everyplace the variable is changed, this would point to exactly what piece of code is changing the variable to this unexpected value.<br>

<br>

   For example with lldb one can use watch <a href="https://urldefense.proofpoint.com/v2/url?u=http-3A__lldb.llvm.org_tutorial.html&d=CwMFaQ&c=54IZrppPQZKX9mLzcGdPfFD1hxrcB__aEkJFOKJFd00&r=DUUt3SRGI0_JgtNaS3udV68GRkgV4ts7XKfj2opmiCY&m=E9U2Th1JCvLXgyxKVx27Uc1_irgC5Iztv3fndq0JRQk&s=eqp6I596uzgPRsUhmyYPwwTomBUCcAHnJ7vy_pvXEkI&e=" rel="noreferrer" target="_blank">http://lldb.llvm.org/tutorial.<wbr>html</a> to see each time a variable gets changed. Similar thing with gdb.<br>

<br>

   The variable to watch is ksp->reason  Once you get the hang of this it can take just a few minutes to track down the code that is making this unexpected value, though I understand if you haven't done it before it can be intimidating.<br>

<br>

  Barry<br>

<br>

You can do the same thing in parallel (like on two processes) if you need to but it is more cumbersome since you need run multiple debuggers. You can have PETSc start up multiple debuggers with mpiexec -n 2 ./ex -start_in_debugger<br>

<div><div><br>

<br>

<br>

<br>

><br>

> Thanks,<br>

> Harshad<br>

><br>

> On Thu, Sep 8, 2016 at 11:26 PM, Barry Smith <<a href="mailto:bsmith@mcs.anl.gov" target="_blank">bsmith@mcs.anl.gov</a>> wrote:<br>

><br>

>   Install your MPI with --download-mpich as a PETSc ./configure option, this will eliminate all the MPICH valgrind errors. Then send as an attachment the resulting valgrind file.<br>

><br>

>   I do not 100 % trust any code that produces such valgrind errors.<br>

><br>

>    Barry<br>

><br>

><br>

><br>

> > On Sep 8, 2016, at 10:12 PM, Harshad Sahasrabudhe <<a href="mailto:hsahasra@purdue.edu" target="_blank">hsahasra@purdue.edu</a>> wrote:<br>

> ><br>

> > Hi Barry,<br>

> ><br>

> > Thanks for the reply. My code is in C. I ran with Valgrind and found many "Conditional jump or move depends on uninitialized value(s)", "Invalid read" and "Use of uninitialized value" errors. I think all of them are from the libraries I'm using (LibMesh, Boost, MPI, etc.). I'm not really sure what I'm looking for in the Valgrind output. At the end of the file, I get:<br>

> ><br>

> > ==40223== More than 10000000 total errors detected.  I'm not reporting any more.<br>

> > ==40223== Final error counts will be inaccurate.  Go fix your program!<br>

> > ==40223== Rerun with --error-limit=no to disable this cutoff.  Note<br>

> > ==40223== that errors may occur in your program without prior warning from<br>

> > ==40223== Valgrind, because errors are no longer being displayed.<br>

> ><br>

> > Can you give some suggestions on how I should proceed?<br>

> ><br>

> > Thanks,<br>

> > Harshad<br>

> ><br>

> > On Thu, Sep 8, 2016 at 1:59 PM, Barry Smith <<a href="mailto:bsmith@mcs.anl.gov" target="_blank">bsmith@mcs.anl.gov</a>> wrote:<br>

> ><br>

> >    This is very odd. CONVERGED_STEP_LENGTH for KSP is very specialized and should never occur with GMRES.<br>

> ><br>

> >    Can you run with valgrind to make sure there is no memory corruption? <a href="https://urldefense.proofpoint.com/v2/url?u=http-3A__www.mcs.anl.gov_petsc_documentation_faq.html-23valgrind&d=CwMFaQ&c=54IZrppPQZKX9mLzcGdPfFD1hxrcB__aEkJFOKJFd00&r=DUUt3SRGI0_JgtNaS3udV68GRkgV4ts7XKfj2opmiCY&m=E9U2Th1JCvLXgyxKVx27Uc1_irgC5Iztv3fndq0JRQk&s=UF7DuHQ2byckAJuFV9ZyezN4kklsP1x_y-Bd_v4Uj0I&e=" rel="noreferrer" target="_blank">http://www.mcs.anl.gov/petsc/d<wbr>ocumentation/faq.html#valgrind</a><br>

> ><br>

> >    Is your code fortran or C?<br>

> ><br>

> >    Barry<br>

> ><br>

> > > On Sep 8, 2016, at 10:38 AM, Harshad Sahasrabudhe <<a href="mailto:hsahasra@purdue.edu" target="_blank">hsahasra@purdue.edu</a>> wrote:<br>

> > ><br>

> > > Hi,<br>

> > ><br>

> > > I'm using GAMG + GMRES for my Poisson problem. The solver converges with KSP_CONVERGED_STEP_LENGTH at a residual of 9.773346857844e-02, which is much higher than what I need (I need a tolerance of at least 1E-8). I am not able to figure out which tolerance I need to set to avoid convergence due to CONVERGED_STEP_LENGTH.<br>

> > ><br>

> > > Any help is appreciated! Output of -ksp_view and -ksp_monitor:<br>

> > ><br>

> > >     0 KSP Residual norm 3.121347818142e+00<br>

> > >     1 KSP Residual norm 9.773346857844e-02<br>

> > >   Linear solve converged due to CONVERGED_STEP_LENGTH iterations 1<br>

> > > KSP Object: 1 MPI processes<br>

> > >   type: gmres<br>

> > >     GMRES: restart=30, using Classical (unmodified) Gram-Schmidt Orthogonalization with no iterative refinement<br>

> > >     GMRES: happy breakdown tolerance 1e-30<br>

> > >   maximum iterations=10000, initial guess is zero<br>

> > >   tolerances:  relative=1e-08, absolute=1e-50, divergence=10000<br>

> > >   left preconditioning<br>

> > >   using PRECONDITIONED norm type for convergence test<br>

> > > PC Object: 1 MPI processes<br>

> > >   type: gamg<br>

> > >     MG: type is MULTIPLICATIVE, levels=2 cycles=v<br>

> > >       Cycles per PCApply=1<br>

> > >       Using Galerkin computed coarse grid matrices<br>

> > >   Coarse grid solver -- level ------------------------------<wbr>-<br>

> > >     KSP Object:    (mg_coarse_)     1 MPI processes<br>

> > >       type: preonly<br>

> > >       maximum iterations=1, initial guess is zero<br>

> > >       tolerances:  relative=1e-05, absolute=1e-50, divergence=10000<br>

> > >       left preconditioning<br>

> > >       using NONE norm type for convergence test<br>

> > >     PC Object:    (mg_coarse_)     1 MPI processes<br>

> > >       type: bjacobi<br>

> > >         block Jacobi: number of blocks = 1<br>

> > >         Local solve is same for all blocks, in the following KSP and PC objects:<br>

> > >         KSP Object:        (mg_coarse_sub_)         1 MPI processes<br>

> > >           type: preonly<br>

> > >           maximum iterations=1, initial guess is zero<br>

> > >           tolerances:  relative=1e-05, absolute=1e-50, divergence=10000<br>

> > >           left preconditioning<br>

> > >           using NONE norm type for convergence test<br>

> > >         PC Object:        (mg_coarse_sub_)         1 MPI processes<br>

> > >           type: lu<br>

> > >             LU: out-of-place factorization<br>

> > >             tolerance for zero pivot 2.22045e-14<br>

> > >             using diagonal shift on blocks to prevent zero pivot [INBLOCKS]<br>

> > >             matrix ordering: nd<br>

> > >             factor fill ratio given 5, needed 1.91048<br>

> > >               Factored matrix follows:<br>

> > >                 Mat Object:                 1 MPI processes<br>

> > >                   type: seqaij<br>

> > >                   rows=284, cols=284<br>

> > >                   package used to perform factorization: petsc<br>

> > >                   total: nonzeros=7726, allocated nonzeros=7726<br>

> > >                   total number of mallocs used during MatSetValues calls =0<br>

> > >                     using I-node routines: found 133 nodes, limit used is 5<br>

> > >           linear system matrix = precond matrix:<br>

> > >           Mat Object:           1 MPI processes<br>

> > >             type: seqaij<br>

> > >             rows=284, cols=284<br>

> > >             total: nonzeros=4044, allocated nonzeros=4044<br>

> > >             total number of mallocs used during MatSetValues calls =0<br>

> > >               not using I-node routines<br>

> > >       linear system matrix = precond matrix:<br>

> > >       Mat Object:       1 MPI processes<br>

> > >         type: seqaij<br>

> > >         rows=284, cols=284<br>

> > >         total: nonzeros=4044, allocated nonzeros=4044<br>

> > >         total number of mallocs used during MatSetValues calls =0<br>

> > >           not using I-node routines<br>

> > >   Down solver (pre-smoother) on level 1 ------------------------------<wbr>-<br>

> > >     KSP Object:    (mg_levels_1_)     1 MPI processes<br>

> > >       type: chebyshev<br>

> > >         Chebyshev: eigenvalue estimates:  min = 0.195339, max = 4.10212<br>

> > >       maximum iterations=2<br>

> > >       tolerances:  relative=1e-05, absolute=1e-50, divergence=10000<br>

> > >       left preconditioning<br>

> > >       using nonzero initial guess<br>

> > >       using NONE norm type for convergence test<br>

> > >     PC Object:    (mg_levels_1_)     1 MPI processes<br>

> > >       type: sor<br>

> > >         SOR: type = local_symmetric, iterations = 1, local iterations = 1, omega = 1<br>

> > >       linear system matrix = precond matrix:<br>

> > >       Mat Object:      ()       1 MPI processes<br>

> > >         type: seqaij<br>

> > >         rows=9036, cols=9036<br>

> > >         total: nonzeros=192256, allocated nonzeros=192256<br>

> > >         total number of mallocs used during MatSetValues calls =0<br>

> > >           not using I-node routines<br>

> > >   Up solver (post-smoother) same as down solver (pre-smoother)<br>

> > >   linear system matrix = precond matrix:<br>

> > >   Mat Object:  ()   1 MPI processes<br>

> > >     type: seqaij<br>

> > >     rows=9036, cols=9036<br>

> > >     total: nonzeros=192256, allocated nonzeros=192256<br>

> > >     total number of mallocs used during MatSetValues calls =0<br>

> > >       not using I-node routines<br>

> > ><br>

> > > Thanks,<br>

> > > Harshad<br>

> ><br>

> ><br>

><br>

><br>

</div></div>> <valgrind.log.33199><br>

<br>

</blockquote></div><br></div></div></div></div>

</blockquote></div><br></div>

</div></div></blockquote></div><br></div>

</div></div></blockquote></div><br></div>