[petsc-dev] MUMPS silent errors
Zhang, Junchao
jczhang at mcs.anl.gov
Thu Oct 11 10:25:26 CDT 2018
I checked with MUMPS developers and they said MUMPS would gave a new error code if a previous step failed. That is nice. Hong, I find code like this in mumps.c, which does not follow the 'no error if failure' rule, does it?
mumps->id.job = JOB_SOLVE;
PetscMUMPS_c(mumps);
if (mumps->id.INFOG(1) < 0) SETERRQ1(PETSC_COMM_SELF,PETSC_ERR_LIB,"Error reported by MUMPS in solve phase: INFOG(1)=%d\n",mumps->id.INFOG(1));
________________________________
From: Zhang, Junchao
Sent: Wednesday, October 10, 2018 10:54:21 PM
To: Smith, Barry F.
Cc: petsc-dev
Subject: Re: [petsc-dev] MUMPS silent errors
What is embarrassing is the user sent me beautiful -log_view outputs and began doing performance comparison. The whole thing is meaningless only because he forgot to check the converged reason on a direct solver.
MUMPS manual has "A call to MUMPS with JOB=2 must be preceded by a call with JOB=1 on the same instance", and similar languages for other phases. It implies we at least should not call MatSolve_MUMPS with failed factorization since it might crash the code.
________________________________
From: Smith, Barry F.
Sent: Wednesday, October 10, 2018 6:41:20 PM
To: Zhang, Junchao
Cc: petsc-dev
Subject: Re: [petsc-dev] MUMPS silent errors
I looked at the code and it is handled in the PETSc way. The user should not expect KSP to error just because it was unable to solve a linear system; they should be calling KSPGetConvergedReason() after KSPSolve() to check that the solution was computed successfully.
Barry
> On Oct 10, 2018, at 2:12 PM, Zhang, Junchao <jczhang at mcs.anl.gov> wrote:
>
> I met a case where MUMPS numeric factorization returned an error code -9 in mumps->id.INFOG(1) but A->erroriffailure was false in the following code in mumps.c
> 1199: PetscErrorCode MatFactorNumeric_MUMPS(Mat F,Mat A,const MatFactorInfo *info)
> 1200:
> {
> ...
>
> 1227: PetscMUMPS_c(mumps);
> 1228: if
> (mumps->id.INFOG(1) < 0) {
>
> 1229: if
> (A->erroriffailure) {
>
> 1230: SETERRQ2(PETSC_COMM_SELF,PETSC_ERR_LIB,"Error reported by MUMPS in numerical factorization phase: INFOG(1)=%d, INFO(2)=%d\n"
> ,mumps->id.INFOG(1),mumps->id.INFO(2));
>
> 1231: } else
> {
>
> 1232: if (mumps->id.INFOG(1) == -10) { /* numerically singular matrix */
> 1233: PetscInfo2(F,"matrix is numerically singular, INFOG(1)=%d, INFO(2)=%d\n"
> ,mumps->id.INFOG(1),mumps->id.INFO(2));
>
> 1234:
> F->factorerrortype = MAT_FACTOR_NUMERIC_ZEROPIVOT;
>
>
> The code continued to KSPSolve and finished successfully (with wrong answer). The user did not call KSPGetConvergedReason() after KSPSolve. I found I had to either add -ksp_error_if_not_converged or call KSPSetErrorIfNotConverged(ksp,PETSC_TRUE) to make the code fail.
> Is it expected? In my view, it is dangerous. If MUMPS fails in one stage, PETSc should not proceed to the next stage because it may hang there.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20181011/ab7f4ac1/attachment-0001.html>
More information about the petsc-dev
mailing list