[petsc-users] Big discrepancy between machines

Timothée Nicolas timothee.nicolas at gmail.com
Thu Dec 17 20:16:53 CST 2015


OK, I can build the residuals with this routine. It allowed me to see that
the residuals computed by PETSc (with KSPGetResidualNorm), which do
converge rapidly, are totally inconsistent with what I compute from

  call KSPBuildSolution(ksp,PETSC_NULL_OBJECT,x,ierr)
  call VecDuplicate(x,y,ierr)
  call KSPGetRHS(ksp,b,ierr)
  call KSPGetOperators(ksp,A,PETSC_NULL_OBJECT,ierr)
  call MatMult(A,x,y,ierr)
  call VecAXPY(y,-one,b,ierr)
  call VecNorm(y,NORM_2,norm,ierr)
  call PrintReal('      KSP Residual norm',norm,itwelve,ierr)
  call VecDestroy(y,ierr)

Now, the residuals computed with this method *are* consistent with the next
residual computed by SNES. In other words, I get the following output (see
below, the first KSP residual is the internal PETSc one, the second is
mine). As you can see, the one I compute is consistent (up to a few digits)
with what goes inside the updated solution of SNES. How is that possible ?
I tried to see what KSPGetResidualNorm does but I could only find the
instruction

  *rnorm = ksp->rnorm

and then I don't know where to look for...

Here's what the output looks like

0 SNES Function norm 4.867111713544e-03

    0 KSP Residual norm 4.958714442097e-03
      KSP Residual norm 4.867111713544E-03

    1 KSP Residual norm 3.549907385578e-04
      KSP Residual norm 1.651154147130E-03

    2 KSP Residual norm 3.522862963778e-05
      KSP Residual norm 1.557509645650E-03

    3 KSP Residual norm 3.541384239147e-06
      KSP Residual norm 1.561088378958E-03

    4 KSP Residual norm 3.641326695942e-07
      KSP Residual norm 1.560783284631E-03

    5 KSP Residual norm 3.850512634373e-08
      KSP Residual norm 1.560806669239E-03

  1 SNES Function norm 1.560806759605e-03
      etc.. etc...

On the cluster, I don't find exactly the behavior above, the two norms
agree rather well up to a few digits, at least at the beginning, but they
start to be very different at the end of the iterations (up to 2 orders of
magnitude, which also gets me quite worried).

Thx

Timothee



2015-12-18 1:45 GMT+09:00 Barry Smith <bsmith at mcs.anl.gov>:

>
> > On Dec 17, 2015, at 7:47 AM, Timothée Nicolas <
> timothee.nicolas at gmail.com> wrote:
> >
> > It works very smoothly for the SNES :-), but KSPGetSolution keeps
> returning a zero vector... KSPGetResidualNorm gives me the norm alright,
> but I would like to actually see the vectors. Is KSPGetSolution the wrong
> routine ?
>
>   If you are using GMRES, which is the default, it actually DOES NOT have
> a representation of the solution at each step. Yes that seems odd but it
> only computes the solution vector at a restart or when the iteration ends.
> This is why KSPGetSolution doesn't provide anything useful.  You can use
> KSPBuildSolution() to have it construct the "current" solution whenever you
> need it.
>
>   Barry
>
>
>
> >
> > Thx
> >
> > Timothée
> >
> > 2015-12-17 19:19 GMT+09:00 Dave May <dave.mayhem23 at gmail.com>:
> >
> >
> > On 17 December 2015 at 11:00, Timothée Nicolas <
> timothee.nicolas at gmail.com> wrote:
> > Hi,
> >
> > So, valgrind is OK (at least on the local machine. Actually on the
> cluster helios, it produces strange results even for the simplest petsc
> program PetscInitialize followed by PetscFinalize, I will try to figure
> this out with their technical team), and I have also tried with exactly the
> same versions (3.6.0) and it does not change the behavior.
> >
> > So now I would like to now how to have a grip on what comes in and out
> of the SNES and the KSP internal to the SNES. That is, I would like to
> inspect manually the vector which enters the SNES in the first place
> (should be zero I believe), what is being fed to the KSP, and the vector
> which comes out of it, etc. if possible at each iteration of the SNES. I
> want to actually see these vectors, and compute there norm by hand. The
> trouble is, it is really hard to understand why the newton residuals are
> not reduced since the KSP converges so nicely. This does not make any sense
> to me, so I want to know what happens to the vectors. But on the SNES list
> of routines, I did not find the tools that would allow me to do that (and
> messing around with the C code is too hard for me, it would take me weeks).
> Does someone have a hint ?
> >
> > The only sane way to do this is to write a custom monitor for your SNES
> object.
> >
> >
> http://www.mcs.anl.gov/petsc/petsc-current/docs/manualpages/SNES/SNESMonitorSet.html
> >
> > Inside your monitor, you have access the SNES, and everything it
> defines, e.g. the current solution, non-linear residual, KSP etc. See these
> pages
> >
> >
> http://www.mcs.anl.gov/petsc/petsc-current/docs/manualpages/SNES/SNESGetSolution.html
> >
> >
> http://www.mcs.anl.gov/petsc/petsc-current/docs/manualpages/SNES/SNESGetFunction.html
> >
> >
> http://www.mcs.anl.gov/petsc/petsc-current/docs/manualpages/SNES/SNESGetKSP.html
> >
> > Then you can pull apart the residual and compute specific norms (or plot
> the residual).
> >
> > Hopefully you can access everything you need to perform your analysis.
> >
> > Cheers,
> >   Dave
> >
> >
> > Thx
> >
> > Timothee
> >
> >
> >
> >
> > 2015-12-15 14:20 GMT+09:00 Matthew Knepley <knepley at gmail.com>:
> > On Mon, Dec 14, 2015 at 11:06 PM, Timothée Nicolas <
> timothee.nicolas at gmail.com> wrote:
> > There is a diference in valgrind indeed between the two. It seems to be
> clean on my desktop Mac OS X but not on the cluster. I'll try to see what's
> causing this. I still don't understand well what's causing memory leaks in
> the case where all PETSc objects are freed correctly (as can pbe checked
> with -log_summary).
> >
> > Also, I have tried running either
> >
> > valgrind ./my_code -option1 -option2...
> >
> > or
> >
> > valgrind mpiexec -n 1 ./my_code -option1 -option2...
> >
> > Note here you would need --trace-children=yes for valgrind.
> >
> >   Matt
> >
> > It seems the second is the correct way to proceed right ? This gives
> very different behaviour for valgrind.
> >
> > Timothee
> >
> >
> >
> > 2015-12-14 17:38 GMT+09:00 Timothée Nicolas <timothee.nicolas at gmail.com
> >:
> > OK, I'll try that, thx
> >
> > 2015-12-14 17:38 GMT+09:00 Dave May <dave.mayhem23 at gmail.com>:
> > You have the configure line, so it should be relatively straight forward
> to configure / build petsc in your home directory.
> >
> >
> > On 14 December 2015 at 09:34, Timothée Nicolas <
> timothee.nicolas at gmail.com> wrote:
> > OK, The problem is that I don't think I can change this easily as far as
> the cluster is concerned. I obtain access to petsc by loading the petsc
> module, and even if I have a few choices, I don't see any debug builds...
> >
> > 2015-12-14 17:26 GMT+09:00 Dave May <dave.mayhem23 at gmail.com>:
> >
> >
> > On Monday, 14 December 2015, Timothée Nicolas <
> timothee.nicolas at gmail.com> wrote:
> > Hum, OK. I use FORTRAN by the way. Is your comment still valid ?
> >
> > No. Fortran compilers init variables to zero.
> > In this case, I would run a debug build on your OSX machine through
> valgrind and make sure it is clean.
> >
> > Other obvious thing to check what happens if use exactly the same petsc
> builds on both machines. I see 3.6.1 and 3.6.0 are being used.
> >
> > For all this type of checking, I would definitely use debug builds on
> both machines. Your cluster build is using the highest level of
> optimization...
> >
> >
> >
> > I'll check anyway, but I thought I had been careful about this sort of
> things.
> >
> > Also, I thought the problem on Mac OS X may have been due to the fact I
> used the version with debugging on, so I rerun configure with
> --with-debugging=no, which did not change anything.
> >
> > Thx
> >
> > Timothee
> >
> >
> > 2015-12-14 17:04 GMT+09:00 Dave May <dave.mayhem23 at gmail.com>:
> > One suggestion is you have some uninitialized variables in your pcshell.
> Despite your arch being called "debug", your configure options indicate you
> have turned debugging off.
> >
> > C standard doesn't prescribe how uninit variables should be treated -
> the behavior is labelled as undefined. As a result, different compilers on
> different archs with the same optimization flags can and will treat uninit
> variables differently. I find OSX c compilers tend to set them to zero.
> >
> > I suggest compiling a debug build on both machines and trying your test
> again. Also, consider running the debug builds through valgrind.
> >
> > Thanks,
> >   Dave
> >
> > On Monday, 14 December 2015, Timothée Nicolas <
> timothee.nicolas at gmail.com> wrote:
> > Hi,
> >
> > I have noticed I have a VERY big difference in behaviour between two
> machines in my problem, solved with SNES. I can't explain it, because I
> have tested my operators which give the same result. I also checked that
> the vectors fed to the SNES are the same. The problem happens only with my
> shell preconditioner. When I don't use it, and simply solve using -snes_mf,
> I don't see anymore than the usual 3-4 changing digits at the end of the
> residuals. However, when I use my pcshell, the results are completely
> different between the two machines.
> >
> > I have attached output_SuperComputer.txt and output_DesktopComputer.txt,
> which correspond to the output from the exact same code and options (and of
> course same input data file !). More precisely
> >
> > output_SuperComputer.txt : output on a supercomputer called Helios,
> sorry I don't know the exact specs.
> > In this case, the SNES norms are reduced successively:
> > 0 SNES Function norm 4.867111712420e-03
> > 1 SNES Function norm 5.632325929998e-08
> > 2 SNES Function norm 7.427800084502e-15
> >
> > output_DesktopComputer.txt : output on a Mac OS X Yosemite 3.4 GHz Intel
> Core i5 16GB 1600 MHz DDr3. (the same happens on an other laptop with Mac
> OS X Mavericks).
> > In this case, I obtain the following for the SNES norms,
> > while in the other, I obtain
> > 0 SNES Function norm 4.867111713544e-03
> > 1 SNES Function norm 1.560094052222e-03
> > 2 SNES Function norm 1.552118650943e-03
> > 3 SNES Function norm 1.552106297094e-03
> > 4 SNES Function norm 1.552106277949e-03
> > which I can't explain, because otherwise the KSP residual (with the same
> operator, which I checked) behave well.
> >
> > As you can see, the first time the preconditioner is applied (DB_, DP_,
> Drho_ and PS_ solves), the two outputs coincide (except for the few last
> digits, up to 9 actually, which is more than I would expect), and
> everything starts to diverge at the first print of the main KSP (the one
> stemming from the SNES) residual norms.
> >
> > Do you have an idea what may cause such a strange behaviour ?
> >
> > Best
> >
> > Timothee
> >
> >
> >
> >
> >
> >
> >
> >
> > --
> > What most experimenters take for granted before they begin their
> experiments is infinitely more interesting than any results to which their
> experiments lead.
> > -- Norbert Wiener
> >
> >
> >
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20151218/6fedcd69/attachment.html>


More information about the petsc-users mailing list