[petsc-users] strange error using fgmres

Dave May dave.mayhem23 at gmail.com
Mon May 6 00:34:34 CDT 2019


On Mon, 6 May 2019 at 02:18, Smith, Barry F. via petsc-users <
petsc-users at mcs.anl.gov> wrote:

>
>
>   Even if you don't get failures on the smaller version of a code it can
> still be worth running with valgrind (when you can't run valgrind on the
> massive problem) because often the problem is still there on the smaller
> problem, just less directly visible but valgrind can still find it.
>
>
> > [13]PETSC ERROR: Object is in wrong state
> > [13]PETSC ERROR: Clearing DM of global vectors that has a global vector
> obtained with DMGetGlobalVector()
>
>    You probably have a work vector obtained with DMGetGlobalVector() that
> you forgot to return with DMRestoreGlobalVector(). Though I would expect
> that this would reproduce on any size problem.


I'd fix the DM issue first before addressing the solver problem. I suspect
the DM error could cause the solver error.

Yep - something is wrong with your management of vectors associated with
one of your DM's. You can figure out if this is the case by running with
-log_view. Make sure the summary of the objects reported shows that the
number of Vecs created and destroyed matches. At the very least, if there
is a mismatch, make sure this difference does not increase as you do
additional optimization solvers (or time steps).

As Barry says, you don't need to run a large scale job to detect this, nor
do you need to run through many optimization solves - the problem exists
and is detectable and thus fixable for all job sizes.


>
>    Barry
>
>
> > On May 5, 2019, at 5:21 PM, Randall Mackie via petsc-users <
> petsc-users at mcs.anl.gov> wrote:
> >
> > In solving a nonlinear optimization problem, I was recently
> experimenting with fgmres using the following options:
> >
> > -nlcg_ksp_type fgmres \
> > -nlcg_pc_type ksp \
> > -nlcg_ksp_ksp_type bcgs \
> > -nlcg_ksp_pc_type jacobi \
> > -nlcg_ksp_rtol 1e-6 \
> > -nlcg_ksp_ksp_max_it 300 \
> > -nlcg_ksp_max_it 200 \
> > -nlcg_ksp_converged_reason \
> > -nlcg_ksp_monitor_true_residual \
> >
> > I sometimes randomly will get an error like the following:
> >
> > Residual norms for nlcg_ solve.
> >   0 KSP unpreconditioned resid norm 3.371606868500e+04 true resid norm
> 3.371606868500e+04 ||r(i)||/||b|| 1.000000000000e+00
> >   1 KSP unpreconditioned resid norm 2.322590778002e+02 true resid norm
> 2.322590778002e+02 ||r(i)||/||b|| 6.888676137487e-03
> >   2 KSP unpreconditioned resid norm 8.262440884758e+01 true resid norm
> 8.262440884758e+01 ||r(i)||/||b|| 2.450594392232e-03
> >   3 KSP unpreconditioned resid norm 3.660428333809e+01 true resid norm
> 3.660428333809e+01 ||r(i)||/||b|| 1.085662853522e-03
> >   3 KSP unpreconditioned resid norm 0.000000000000e+00 true resid norm
>          -nan ||r(i)||/||b||           -nan
> > Linear nlcg_ solve did not converge due to DIVERGED_PC_FAILED iterations
> 3
> >                PC_FAILED due to SUBPC_ERROR
> >
> > This usually happens after a few nonlinear optimization iterations,
> meaning that it’s worked perfectly fine until this point.
> > How can using jacobi pc all of a sudden cause a NaN, if it’s worked
> perfectly fine before?
> >
> > Some other errors in the output log file are as follows, although I have
> no idea if they result from the above error or not:
> >
> > [13]PETSC ERROR: Object is in wrong state
> > [13]PETSC ERROR: Clearing DM of global vectors that has a global vector
> obtained with DMGetGlobalVector()
> > [13]PETSC ERROR: See http://www.mcs.anl.gov/petsc/documentation/faq.html
> for trouble shooting.
> > [13]PETSC ERROR: Petsc Release Version 3.11.1, Apr, 12, 2019
> >
> >
> > [27]PETSC ERROR: #1 DMClearGlobalVectors() line 196 in
> /state/std2/FEMI/PETSc/petsc-3.11.1/src/dm/interface/dmget.c
> > [27]PETSC ERROR: Configure options --with-clean=1
> --with-scalar-type=complex --with-debugging=0 --with-fortran=1
> --with-blaslapack-dir=/state/std2/intel_2018/m
> > kl --with-mkl_pardiso-dir=/state/std2/intel_2018/mkl
> --with-mkl_cpardiso-dir=/state/std2/intel_2018/mkl
> --download-mumps=../external/mumps_v5.1.2-p1.tar.gz --d
> > ownload-scalapack=../external/scalapack-2.0.2.tgz --with-cc=mpiicc
> --with-fc=mpiifort --with-cxx=mpiicc --FOPTFLAGS="-O3 -xHost"
> --COPTFLAGS="-O3 -xHost" --CXX
> > OPTFLAGS="-O3 -xHost"
> >
> >
> > #2 DMDestroy() line 752 in
> /state/std2/FEMI/PETSc/petsc-3.11.1/src/dm/interface/dm.c
> > [72]PETSC ERROR: #3 PetscObjectDereference() line 624 in
> /state/std2/FEMI/PETSc/petsc-3.11.1/src/sys/objects/inherit.c
> > [72]PETSC ERROR: #4 PetscObjectListDestroy() line 156 in
> /state/std2/FEMI/PETSc/petsc-3.11.1/src/sys/objects/olist.c
> > [72]PETSC ERROR: #5 PetscHeaderDestroy_Private() line 122 in
> /state/std2/FEMI/PETSc/petsc-3.11.1/src/sys/objects/inherit.c
> > [72]PETSC ERROR: #6 VecDestroy() line 412 in
> /state/std2/FEMI/PETSc/petsc-3.11.1/src/vec/vec/interface/vector.c
> >
> >
> >
> > This is a large run taking many hours to get to this problem. I will try
> to run in debug mode, but given that this seems to be randomly happening
> (this has happened maybe 30% of the time I have used the fgmres option),
> there is no guarantee that will show anything useful. Valgrind is obviously
> out of the question for a large run, and I have yet to reproduce this on a
> smaller run.
> >
> > Anyone have any ideas as to what’s causing this?
> >
> > Thanks in advance,
> >
> > Randy M.
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20190506/42153977/attachment.html>


More information about the petsc-users mailing list