[petsc-users] superlu_dist produces random results

Wed Nov 15 16:24:35 CST 2017

On Wed, Nov 15, 2017 at 2:52 PM, Smith, Barry F. <bsmith at mcs.anl.gov> wrote:

>
>
> > On Nov 15, 2017, at 3:36 PM, Kong, Fande <fande.kong at inl.gov> wrote:
> >
> > Hi Barry,
> >
> > Thanks for your reply. I was wondering why this happens only when we use
> superlu_dist. I am trying to understand the algorithm in superlu_dist. If
> we use ASM or MUMPS, we do not produce these differences.
> >
> > The differences actually are NOT meaningless.  In fact, we have a real
> transient application that presents this issue.   When we run the
> simulation with superlu_dist in parallel for thousands of time steps, the
> final physics  solution looks totally different from different runs. The
> differences are not acceptable any more.  For a steady problem, the
> difference may be meaningless. But it is significant for the transient
> problem.
>
>   I submit that the "physics solution" of all of these runs is equally
> right and equally wrong. If the solutions are very different due to a small
> perturbation than something is wrong with the model or the integrator, I
> don't think you can blame the linear solver (see below)
>
>
> > This makes the solution not reproducible, and we can not even set a
> targeting solution in the test system because the solution is so different
> from one run to another.   I guess there might/may be a tiny bug in
> superlu_dist or the PETSc interface to superlu_dist.
>
>   This is possible but it is also possible this is due to normal round off
> inside of SuperLU dist.
>
>    Since you have SuperLU_Dist inside a nonlinear iteration it shouldn't
> really matter exactly how well SuperLU_Dist does. The nonlinear iteration
> does essential defect correction for you; are you making sure that the
> nonlinear iteration always works for every timestep? For example confirm
> that SNESGetConvergedReason() is always positive.
>

Definitely it could be something wrong on my side.  But let us focus on the
simple question first.

To make the discussion a little simpler, let us back to the simple problem
(heat conduction).   Now I want to understand why this happens to
superlu_dist only. When we are using ASM or MUMPS,  why we can not see the
differences from one run to another?  I posted the residual histories for
MUMPS and ASM.  We can not see any differences in terms of the residual
norms when using MUMPS or ASM. Does superlu_dist have higher round off than
other solvers?

MUMPS run1:

 0 Nonlinear |R| = 9.447423e+03
      0 Linear |R| = 9.447423e+03
      1 Linear |R| = 1.013384e-02
      2 Linear |R| = 4.020993e-08
 1 Nonlinear |R| = 1.404678e-02
      0 Linear |R| = 1.404678e-02
      1 Linear |R| = 4.836162e-08
      2 Linear |R| = 7.055620e-14
 2 Nonlinear |R| = 4.836392e-08

MUMPS run2:

 0 Nonlinear |R| = 9.447423e+03
      0 Linear |R| = 9.447423e+03
      1 Linear |R| = 1.013384e-02
      2 Linear |R| = 4.020993e-08
 1 Nonlinear |R| = 1.404678e-02
      0 Linear |R| = 1.404678e-02
      1 Linear |R| = 4.836162e-08
      2 Linear |R| = 7.055620e-14
 2 Nonlinear |R| = 4.836392e-08

MUMPS run3:

 0 Nonlinear |R| = 9.447423e+03
      0 Linear |R| = 9.447423e+03
      1 Linear |R| = 1.013384e-02
      2 Linear |R| = 4.020993e-08
 1 Nonlinear |R| = 1.404678e-02
      0 Linear |R| = 1.404678e-02
      1 Linear |R| = 4.836162e-08
      2 Linear |R| = 7.055620e-14
 2 Nonlinear |R| = 4.836392e-08

MUMPS run4:

 0 Nonlinear |R| = 9.447423e+03
      0 Linear |R| = 9.447423e+03
      1 Linear |R| = 1.013384e-02
      2 Linear |R| = 4.020993e-08
 1 Nonlinear |R| = 1.404678e-02
      0 Linear |R| = 1.404678e-02
      1 Linear |R| = 4.836162e-08
      2 Linear |R| = 7.055620e-14
 2 Nonlinear |R| = 4.836392e-08

ASM run1:

 0 Nonlinear |R| = 9.447423e+03
      0 Linear |R| = 9.447423e+03
      1 Linear |R| = 6.189229e+03
      2 Linear |R| = 3.252487e+02
      3 Linear |R| = 3.485174e+01
      4 Linear |R| = 8.600695e+00
      5 Linear |R| = 3.333942e+00
      6 Linear |R| = 1.706112e+00
      7 Linear |R| = 5.047863e-01
      8 Linear |R| = 2.337297e-01
      9 Linear |R| = 1.071627e-01
     10 Linear |R| = 4.692177e-02
     11 Linear |R| = 1.340717e-02
     12 Linear |R| = 4.753951e-03
 1 Nonlinear |R| = 2.320271e-02
      0 Linear |R| = 2.320271e-02
      1 Linear |R| = 4.367880e-03
      2 Linear |R| = 1.407852e-03
      3 Linear |R| = 6.036360e-04
      4 Linear |R| = 1.867661e-04
      5 Linear |R| = 8.760076e-05
      6 Linear |R| = 3.260519e-05
      7 Linear |R| = 1.435418e-05
      8 Linear |R| = 4.532875e-06
      9 Linear |R| = 2.439053e-06
     10 Linear |R| = 7.998549e-07
     11 Linear |R| = 2.428064e-07
     12 Linear |R| = 4.766918e-08
     13 Linear |R| = 1.713748e-08
 2 Nonlinear |R| = 3.671573e-07

ASM run2:

 0 Nonlinear |R| = 9.447423e+03
      0 Linear |R| = 9.447423e+03
      1 Linear |R| = 6.189229e+03
      2 Linear |R| = 3.252487e+02
      3 Linear |R| = 3.485174e+01
      4 Linear |R| = 8.600695e+00
      5 Linear |R| = 3.333942e+00
      6 Linear |R| = 1.706112e+00
      7 Linear |R| = 5.047863e-01
      8 Linear |R| = 2.337297e-01
      9 Linear |R| = 1.071627e-01
     10 Linear |R| = 4.692177e-02
     11 Linear |R| = 1.340717e-02
     12 Linear |R| = 4.753951e-03
 1 Nonlinear |R| = 2.320271e-02
      0 Linear |R| = 2.320271e-02
      1 Linear |R| = 4.367880e-03
      2 Linear |R| = 1.407852e-03
      3 Linear |R| = 6.036360e-04
      4 Linear |R| = 1.867661e-04
      5 Linear |R| = 8.760076e-05
      6 Linear |R| = 3.260519e-05
      7 Linear |R| = 1.435418e-05
      8 Linear |R| = 4.532875e-06
      9 Linear |R| = 2.439053e-06
     10 Linear |R| = 7.998549e-07
     11 Linear |R| = 2.428064e-07
     12 Linear |R| = 4.766918e-08
     13 Linear |R| = 1.713748e-08
 2 Nonlinear |R| = 3.671573e-07

ASM run3:

 0 Nonlinear |R| = 9.447423e+03
      0 Linear |R| = 9.447423e+03
      1 Linear |R| = 6.189229e+03
      2 Linear |R| = 3.252487e+02
      3 Linear |R| = 3.485174e+01
      4 Linear |R| = 8.600695e+00
      5 Linear |R| = 3.333942e+00
      6 Linear |R| = 1.706112e+00
      7 Linear |R| = 5.047863e-01
      8 Linear |R| = 2.337297e-01
      9 Linear |R| = 1.071627e-01
     10 Linear |R| = 4.692177e-02
     11 Linear |R| = 1.340717e-02
     12 Linear |R| = 4.753951e-03
 1 Nonlinear |R| = 2.320271e-02
      0 Linear |R| = 2.320271e-02
      1 Linear |R| = 4.367880e-03
      2 Linear |R| = 1.407852e-03
      3 Linear |R| = 6.036360e-04
      4 Linear |R| = 1.867661e-04
      5 Linear |R| = 8.760076e-05
      6 Linear |R| = 3.260519e-05
      7 Linear |R| = 1.435418e-05
      8 Linear |R| = 4.532875e-06
      9 Linear |R| = 2.439053e-06
     10 Linear |R| = 7.998549e-07
     11 Linear |R| = 2.428064e-07
     12 Linear |R| = 4.766918e-08
     13 Linear |R| = 1.713748e-08
 2 Nonlinear |R| = 3.671573e-07

ASM run4:
 0 Nonlinear |R| = 9.447423e+03
      0 Linear |R| = 9.447423e+03
      1 Linear |R| = 6.189229e+03
      2 Linear |R| = 3.252487e+02
      3 Linear |R| = 3.485174e+01
      4 Linear |R| = 8.600695e+00
      5 Linear |R| = 3.333942e+00
      6 Linear |R| = 1.706112e+00
      7 Linear |R| = 5.047863e-01
      8 Linear |R| = 2.337297e-01
      9 Linear |R| = 1.071627e-01
     10 Linear |R| = 4.692177e-02
     11 Linear |R| = 1.340717e-02
     12 Linear |R| = 4.753951e-03
 1 Nonlinear |R| = 2.320271e-02
      0 Linear |R| = 2.320271e-02
      1 Linear |R| = 4.367880e-03
      2 Linear |R| = 1.407852e-03
      3 Linear |R| = 6.036360e-04
      4 Linear |R| = 1.867661e-04
      5 Linear |R| = 8.760076e-05
      6 Linear |R| = 3.260519e-05
      7 Linear |R| = 1.435418e-05
      8 Linear |R| = 4.532875e-06
      9 Linear |R| = 2.439053e-06
     10 Linear |R| = 7.998549e-07
     11 Linear |R| = 2.428064e-07
     12 Linear |R| = 4.766918e-08
     13 Linear |R| = 1.713748e-08
 2 Nonlinear |R| = 3.671573e-07

>
>
> >
> >
> > Fande,
> >
> >
> >
> >
> > On Wed, Nov 15, 2017 at 1:59 PM, Smith, Barry F. <bsmith at mcs.anl.gov>
> wrote:
> >
> >   Meaningless differences
> >
> >
> > > On Nov 15, 2017, at 2:26 PM, Kong, Fande <fande.kong at inl.gov> wrote:
> > >
> > > Hi,
> > >
> > > There is a heat conduction problem. When superlu_dist is used as a
> preconditioner, we have random results from different runs. Is there a
> random algorithm in superlu_dist? If we use ASM or MUMPS as the
> preconditioner, we then don't have this issue.
> > >
> > > run 1:
> > >
> > >  0 Nonlinear |R| = 9.447423e+03
> > >       0 Linear |R| = 9.447423e+03
> > >       1 Linear |R| = 1.013384e-02
> > >       2 Linear |R| = 4.020995e-08
> > >  1 Nonlinear |R| = 1.404678e-02
> > >       0 Linear |R| = 1.404678e-02
> > >       1 Linear |R| = 5.104757e-08
> > >       2 Linear |R| = 7.699637e-14
> > >  2 Nonlinear |R| = 5.106418e-08
> > >
> > >
> > > run 2:
> > >
> > >  0 Nonlinear |R| = 9.447423e+03
> > >       0 Linear |R| = 9.447423e+03
> > >       1 Linear |R| = 1.013384e-02
> > >       2 Linear |R| = 4.020995e-08
> > >  1 Nonlinear |R| = 1.404678e-02
> > >       0 Linear |R| = 1.404678e-02
> > >       1 Linear |R| = 5.109913e-08
> > >       2 Linear |R| = 7.189091e-14
> > >  2 Nonlinear |R| = 5.111591e-08
> > >
> > > run 3:
> > >
> > >  0 Nonlinear |R| = 9.447423e+03
> > >       0 Linear |R| = 9.447423e+03
> > >       1 Linear |R| = 1.013384e-02
> > >       2 Linear |R| = 4.020995e-08
> > >  1 Nonlinear |R| = 1.404678e-02
> > >       0 Linear |R| = 1.404678e-02
> > >       1 Linear |R| = 5.104942e-08
> > >       2 Linear |R| = 7.465572e-14
> > >  2 Nonlinear |R| = 5.106642e-08
> > >
> > > run 4:
> > >
> > >  0 Nonlinear |R| = 9.447423e+03
> > >       0 Linear |R| = 9.447423e+03
> > >       1 Linear |R| = 1.013384e-02
> > >       2 Linear |R| = 4.020995e-08
> > >  1 Nonlinear |R| = 1.404678e-02
> > >       0 Linear |R| = 1.404678e-02
> > >       1 Linear |R| = 5.102730e-08
> > >       2 Linear |R| = 7.132220e-14
> > >  2 Nonlinear |R| = 5.104442e-08
> > >
> > > Solver details:
> > >
> > > SNES Object: 8 MPI processes
> > >   type: newtonls
> > >   maximum iterations=15, maximum function evaluations=10000
> > >   tolerances: relative=1e-08, absolute=1e-11, solution=1e-50
> > >   total number of linear solver iterations=4
> > >   total number of function evaluations=7
> > >   norm schedule ALWAYS
> > >   SNESLineSearch Object: 8 MPI processes
> > >     type: basic
> > >     maxstep=1.000000e+08, minlambda=1.000000e-12
> > >     tolerances: relative=1.000000e-08, absolute=1.000000e-15,
> lambda=1.000000e-08
> > >     maximum iterations=40
> > >   KSP Object: 8 MPI processes
> > >     type: gmres
> > >       restart=30, using Classical (unmodified) Gram-Schmidt
> Orthogonalization with no iterative refinement
> > >       happy breakdown tolerance 1e-30
> > >     maximum iterations=100, initial guess is zero
> > >     tolerances:  relative=1e-06, absolute=1e-50, divergence=10000.
> > >     right preconditioning
> > >     using UNPRECONDITIONED norm type for convergence test
> > >   PC Object: 8 MPI processes
> > >     type: lu
> > >       out-of-place factorization
> > >       tolerance for zero pivot 2.22045e-14
> > >       matrix ordering: natural
> > >       factor fill ratio given 0., needed 0.
> > >         Factored matrix follows:
> > >           Mat Object: 8 MPI processes
> > >             type: superlu_dist
> > >             rows=7925, cols=7925
> > >             package used to perform factorization: superlu_dist
> > >             total: nonzeros=0, allocated nonzeros=0
> > >             total number of mallocs used during MatSetValues calls =0
> > >               SuperLU_DIST run parameters:
> > >                 Process grid nprow 4 x npcol 2
> > >                 Equilibrate matrix TRUE
> > >                 Matrix input mode 1
> > >                 Replace tiny pivots FALSE
> > >                 Use iterative refinement TRUE
> > >                 Processors in row 4 col partition 2
> > >                 Row permutation LargeDiag
> > >                 Column permutation METIS_AT_PLUS_A
> > >                 Parallel symbolic factorization FALSE
> > >                 Repeated factorization SamePattern
> > >     linear system matrix followed by preconditioner matrix:
> > >     Mat Object: 8 MPI processes
> > >       type: mffd
> > >       rows=7925, cols=7925
> > >         Matrix-free approximation:
> > >           err=1.49012e-08 (relative error in function evaluation)
> > >           Using wp compute h routine
> > >               Does not compute normU
> > >     Mat Object: () 8 MPI processes
> > >       type: mpiaij
> > >       rows=7925, cols=7925
> > >       total: nonzeros=63587, allocated nonzeros=63865
> > >       total number of mallocs used during MatSetValues calls =0
> > >         not using I-node (on process 0) routines
> > >
> > >
> > > Fande,
> > >
> > >
> >
> >
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20171115/cc65a80a/attachment-0001.html>