[petsc-users] roundoff differences
Dominik Szczerba
dominik at itis.ethz.ch
Thu Aug 18 03:15:21 CDT 2011
Generally you do not expect digit-for-digit identical results between
parallel runs, and yes, round off errors might result from different loading
in-between. If a problem is ill-posed or poorly preconditioned this may even
lead to divergence in some cases.
Dominik
On Thu, Aug 18, 2011 at 9:49 AM, Harald Pfeiffer
<pfeiffer at cita.utoronto.ca>wrote:
> Hello,
>
> we use PETSc to solve the nonlinear system arising from pseudo-spectral
> discretization of certain elliptic PDEs in Einstein's equations. When
> running the same job multiple times on the same number of processors on the
> same workstation, we find roundoff differences. Is this expected, e.g.
> because MPI reduction calls may behave differently depending on the load of
> the machine? Or should we be concerned and investigate further?
>
> Thanks,
> Harald
>
>
>
>
>
>
>
> -------- Original Message -------- Subject: Re: Quick question about
> derivatives in SpEC Date: Tue, 16 Aug 2011 09:45:27 -0400 From: Gregory
> B. Cook <cookgb at wfu.edu> <cookgb at wfu.edu> To: Harald Pfeiffer
> <pfeiffer at cita.utoronto.ca> <pfeiffer at cita.utoronto.ca> CC: Larry Kidder
> <kidder at astro.cornell.edu> <kidder at astro.cornell.edu>, Mark Scheel
> <scheel at tapir.caltech.edu> <scheel at tapir.caltech.edu>
>
> Hi Harald,
>
> All of the tests I was doing were on the same 8 cores on my office
> workstation. It is running Ubuntu 11, and uses the default OpenMPI
> communication approach. To make sure it wasn't something I was doing, I
> ran two elliptic solves of the ExtendedConformalThinSandwich() volume
> terms. Here are the outputs of snes.dat for the different levels:
>
> Run 1 Run 2
> Six0/snes.dat
> 0 7.3385297958166698 0 7.3385297958166698
> 1 5.1229060531500723 1 5.1229060531500723
> 2 0.32616852761238285 2 0.32616852761238285
> 3 0.012351417186533147 3 0.012351417186800266 <*****
> 4 9.7478354935351385e-06 4 9.7478351511500114e-06
> Six1/snes.dat
> 0 0.13405558402489681 0 0.13405558402540407
> 1 0.00068002100028642610 1 0.00068002089609322440
> 2 6.8764357250058596e-08 2 6.3738394418031232e-08
> Six2/snes.dat
> 0 0.0063028244769771681 0 0.0063028058475922306
> 1 1.4538921141731714e-06 1 1.4545032695605256e-06
> Six3/snes.dat
> 0 0.00061476105672438877 0 0.00061476093499534406
> 1 6.0267672358059814e-08 1 5.4897793428123648e-08
> Six4/snes.dat
> 0 0.00053059501859595651 0 0.00053059591479892143
> 1 4.8003269489205705e-08 1 4.8079799390886591e-08
> Six5/snes.dat
> 0 3.6402372419546429e-05 0 3.6402169997838670e-05
> 1 5.3117360561476420e-09 1 5.2732089856727503e-09
>
> The differences are clearly at the level of roundoff, but it is
> "strange" that you cannot reproduce identical results.
>
> I've attached all of the .input files for this run in case you want to
> try to reproduce my findings.
>
> Greg
>
> On 08/16/2011 06:21 AM, Harald Pfeiffer wrote:
> > Hi Greg,
> >
> > some thoughts:
> >
> > Petsc is using standard MPI reduction calls, which may give results that
> > differ by roundoff. We have positively noticed this happening for
> > different number of processes, but perhaps this also happens depending
> > on where in a cluster your jobs run (different network topology
> > depending on whether all processors are on the same rack, vs. split
> > among racks; dynamic load-balancing of network communication).
> >
> > You might want to try reserving a few nodes interactively, and then
> > running the elliptic solver multiple times on this same set of nodes.
> >
> > The Mover does indeed load-balanced interpolation, but when doing so,
> > the MPI communication should not affect identical results.
> >
> > Once there are roundoff differences, they are typically amplified during
> > a petsc linear solve. The iterative algorithm takes different paths
> > toward the solution, and a difference of 1e-10 doesn't seem excessive.
> >
> > Harald
> >
> > ps. Preconditioning is done differently on-processor and off-processor
> > and depends therefore highly on the processor count. So if you were to
> > change number of processors, the iterative solve will proceed very
> > differently.
> >
> > On 8/15/11 10:27 PM, Gregory B. Cook wrote:
> >> Hi Larry,
> >>
> >> I ran a check using the default ExtendedConformalThinSandwich() volume
> >> terms and this also produced roundoff error differences between
> >> identical runs, so I feel better about that. I am using the same
> >> number of processors, but if there is any kind of dynamic load
> >> balancing for interpolation/communication/etc, then I can see that
> >> different runs might end up using different boundary communications.
> >> Maybe that's all there is to it?
> >>
> >> Greg
> >>
> >> On 08/15/2011 04:16 PM, Larry Kidder wrote:
> >>> Hi Greg,
> >>>
> >>> Harald is traveling, so I am not sure when he will answer.
> >>> My vague recollection is that there is something about how PETSc does
> >>> preconditioning in parallel that leads to not producing the same result;
> >>> but I don't recall if this happens in general, or only if you change the
> >>> distribution of processes.
> >>>
> >>> Larry
> >>>
> >>> Gregory B. Cook wrote:
> >>>> Hi Guys,
> >>>>
> >>>> I have a follow-up question that may be tangentially related to my
> >>>> original question about derivatives. This one is targeted at Harald.
> >>>>
> >>>> When I run a version of my code where the very small errors in the
> >>>> derivative of the metric are not present (I code them in differently),
> >>>> I find that running the exact same input files successively does not
> >>>> produce exactly the same results. This is a multi-level elliptic solve
> >>>> on a complex domain for binary black holes. On Level-0, the the
> >>>> results returned in snes.dat are identical. On Level-1, the initial
> >>>> and second snes norms are identical, but the third differs. After
> >>>> this, all snes norms differ.
> >>>>
> >>>> Is this to be expected? Does PETSc not produce identical results on
> >>>> consecutive solves with the same starting point? Is there something in
> >>>> the MPI communication that means that the results should differ? The
> >>>> differences start at the order of 10^-13, but grow by the 6th level to
> >>>> be of order 10^-10.
> >>>>
> >>>> Greg
> >>>>
> >>>> On 08/15/2011 01:02 PM, Larry Kidder wrote:
> >>>>> Hi Greg,
> >>>>>
> >>>>> Did you compute the norm of the metric itself?
> >>>>> What domain did you use?
> >>>>>
> >>>>> Larry
> >>>>>
> >>>>> Gregory B. Cook wrote:
> >>>>>> Hi Guys,
> >>>>>>
> >>>>>> I was doing a simple test as part of debugging some code I'm writing.
> >>>>>> I ended up placing the following relevant lines of code into the
> >>>>>> EllipticItems.input and EllipticObservers.input files:
> >>>>>>
> >>>>>> ---EllipticItems.input---
> >>>>>> EvaluateMatrixFormula(Output=InvConformalMetric; Dim=3; Symm=11;
> >>>>>> M[0,0]=1; M[1,1]=1; M[2,2]=1),
> >>>>>> FirstDeriv(Input=InvConformalMetric; Output=dInvConformalMetric),
> >>>>>> SecondDeriv(Input=InvConformalMetric; Output=ddInvConformalMetric),
> >>>>>>
> >>>>>> FlattenDeriv(Input=dInvConformalMetric;
> >>>>>> Output=fdInvConformalMetric;DerivPosition=Last),
> >>>>>> FlattenDeriv(Input=ddInvConformalMetric;
> >>>>>> Output=fddInvConformalMetric;DerivPosition=Last),
> >>>>>>
> >>>>>> ---EllipticObservers.input---
> >>>>>> NormOfTensor(Input=fdInvConformalMetric, fddInvConformalMetric;
> >>>>>> Filename=dInvCM_L2.dat;Op=L2; MetricForTensors=None),
> >>>>>> NormOfTensor(Input=fdInvConformalMetric, fddInvConformalMetric;
> >>>>>> Filename=dInvCM_Linf.dat;Op=Linf; MetricForTensors=None),
> >>>>>>
> >>>>>>
> >>>>>> The odd thing is that the norms that I get out are not exactly zero.
> >>>>>> They are very small, but I'm taking the first and second derivatives
> >>>>>> of the identity matrix, so I would expect them to evaluate to exactly
> >>>>>> zero. The fact that they don't leads me to think that there is
> >>>>>> something wrong either in my code or in how I have written the input
> >>>>>> files.
> >>>>>>
> >>>>>> Should these derivatives evaluate to exactly zero?
> >>>>>>
> >>>>>> Greg
> >>>>>
> >>>
> >
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20110818/6e3e958b/attachment.htm>
More information about the petsc-users
mailing list