[petsc-users] How to debug PETSc error

Barry Smith bsmith at mcs.anl.gov
Mon Jun 30 01:37:57 CDT 2014


On Jun 30, 2014, at 1:30 AM, TAY wee-beng <zonexo at gmail.com> wrote:

> On 30/6/2014 1:53 PM, Barry Smith wrote:
>> On Jun 30, 2014, at 12:00 AM, TAY wee-beng <zonexo at gmail.com> wrote:
>> 
>>> Hi,
>>> 
>>> I have a CFD code which gives an error when solving the momentum eqn at time step = 1109. Using KSPGetConvergedReason give < 0 using optimized build.
>>    What value < 0? It is possible there is no bug. Bi-CG-stab (though it is stabilized) is not always stable and it can grief even if the matrix and right hand side are “reasonable”. Or the preconditioner may be generating inappropriately huge values (for example if ILU is being used inside it).
>> 
>>    Yes, don’t try to print the matrix or anything like that.
>> 
>>    I would start by trying with KSPBCGSL (manual page below). It is designed to be more stable than Bi-CG-stab. Try it with the default options; you can also increase the ell if it fails.
>> 
>>    GMRES is always a good bet but I am thinking you are not using it because it requires too much memory due to restart length.
>> 
>>   Barry
>> 
>> 
>> KSPBCGSL - Implements a slight variant of the Enhanced
>>                 BiCGStab(L) algorithm in (3) and (2).  The variation
>>                 concerns cases when either kappa0**2 or kappa1**2 is
>>                 negative due to round-off. Kappa0 has also been pulled
>>                 out of the denominator in the formula for ghat.
>> 
>>     References:
>>       1. G.L.G. Sleijpen, H.A. van der Vorst, "An overview of
>>          approaches for the stable computation of hybrid BiCG
>>          methods", Applied Numerical Mathematics: Transactions
>>          f IMACS, 19(3), pp 235-54, 1996.
>>       2. G.L.G. Sleijpen, H.A. van der Vorst, D.R. Fokkema,
>>          "BiCGStab(L) and other hybrid Bi-CG methods",
>>           Numerical Algorithms, 7, pp 75-109, 1994.
>>       3. D.R. Fokkema, "Enhanced implementation of BiCGStab(L)
>>          for solving linear systems of equations", preprint
>>          from www.citeseer.com.
>> 
>>    Contributed by: Joel M. Malard, email jm.malard at pnl.gov
>> 
>>    Options Database Keys:
>> +  -ksp_bcgsl_ell <ell> Number of Krylov search directions, defaults to 2 -- KSPBCGSLSetEll()
>> .  -ksp_bcgsl_cxpol - Use a convex function of the MinRes and OR polynomials after the BiCG step instead of default MinRes -- KSPBCGSLSetPol()
>> .  -ksp_bcgsl_mrpoly - Use the default MinRes polynomial after the BiCG step  -- KSPBCGSLSetPol()
>> .  -ksp_bcgsl_xres <res> Threshold used to decide when to refresh computed residuals -- KSPBCGSLSetXRes()
>> -  -ksp_bcgsl_pinv <true/false> - (de)activate use of pseudoinverse -- KSPBCGSLSetUsePseudoinverse()
>> 
>>    Level: beginner
>> 
>> .seealso:  KSPCreate(), KSPSetType(), KSPType (for list of available types), KSP, KSPFGMRES, KSPBCGS, KSPSetPCSide(), KSPBCGSLSetEll(), KSPBCGSLSetXRes()
> Hi Barry,
> 
> I mean why I run :
> 
> KSPGetConvergedReason(ksp_semi_xyz,reason,ierr)
> 
> reason < 0.

   Yes but exactly what value of reason? 

> 
> I forgot to add that the problem happens with my newly modified code. In my old code, it works perfectly. So during my modification, the matrix or vector may have been changed unintentionally. By right, the new and old code should give the same matrix, except for small differences due to truncation error. Based on these info, is there a better way to debug? I will also changed to KSPBCGSL as suggested.
> 
   If you run the two versions next to each other do they produce very similar results for all those time steps? 

   Can you very slowly change the old code to the new form and run these intermediate versions until you hit upon the change that causes the problem?

   Unfortunately I don’t have any easy answer.

  Barry


> Thanks
> 
> Regards.
>>> I retry using debug build and it gives the error below. I sent the job to a job scheduler on 32 procs. So what is best way to debug? Should I print out the matrix but it is very big since grid size is 13 million.
>>> 
>>> Thanks. Regards.
>>> 
>>> n12-10:13681] 31 more processes have sent help message help-mpi-btl-base.txt / btl:no-nics
>>> [n12-10:13681] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
>>> [17]PETSC ERROR: ------------------------------------------------------------------------
>>> [17]PETSC ERROR: Caught signal number 8 FPE: Floating Point Exception,probably divide by zero
>>> [17]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
>>> [17]PETSC ERROR: or see http://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind[17]PETSC ERROR: or try http://valgrind.org on GNU/linux and Apple Mac OS X to find memory corruption errors
>>> [17]PETSC ERROR: likely location of problem given in stack below
>>> [17]PETSC ERROR: ---------------------  Stack Frames ------------------------------------
>>> [17]PETSC ERROR: Note: The EXACT line numbers in the stack are not available,
>>> [17]PETSC ERROR:       INSTEAD the line number of the start of the function
>>> [17]PETSC ERROR:       is given.
>>> [17]PETSC ERROR: [17] VecNorm_MPI line 57 /home/wtay/Codes/petsc-3.4.4/src/vec/vec/impls/mpi/pvec2.c
>>> [17]PETSC ERROR: [17] VecNorm line 224 /home/wtay/Codes/petsc-3.4.4/src/vec/vec/interface/rvector.c
>>> [17]PETSC ERROR: [17] KSPSolve_BCGS line 39 /home/wtay/Codes/petsc-3.4.4/src/ksp/ksp/impls/bcgs/bcgs.c
>>> [17]PETSC ERROR: [17] KSPSolve line 356 /home/wtay/Codes/petsc-3.4.4/src/ksp/ksp/interface/itfunc.c
>>> 
>>> -- 
>>> Thank you
>>> 
>>> Yours sincerely,
>>> 
>>> TAY wee-beng



More information about the petsc-users mailing list