[petsc-users] Big discrepancy between machines

Timothée Nicolas timothee.nicolas at gmail.com
Thu Dec 17 21:41:27 CST 2015


Ah ! That was the reason !

I do have actually four GMRES inside the outside main ksp GMRES. I do a lot
of shenanigans in my preconditioning. Namely, I take the input vector x,
and extract the different degrees of freedom, putting them inside 4 vectors
(2 vectors with 3 dof, and 2 vectors of scalars). Then I run a GMRES on
each of these vectors (preconditioned with multigrid for at least one of
them). When this is all done I combine all the vectors again (there are
some matrix multiplication involved, with the different vectors coupled) to
put them in a vector with same format (8 dof) as the initial input vector,
and the preconditioning is done.

So, simply adding -ksp_type fgmres seems to fix my problem, I just checked.
You have just saved me several days/weeks of scratching my head, if not
more !

I had no clue standard GMRES was picky about preconditioners. So now, after
telling you what I do in the preconditioner, you confirm indeed that simply
using FGMRES is OK right ?

Thx !

Best

Timothee



2015-12-18 11:51 GMT+09:00 Barry Smith <bsmith at mcs.anl.gov>:

>
>    There is something inconsistent with the preconditioners or the matrix
> vector product. Do you have a gmres outside a gmres? That definitely won't
> work and would produce this problem. If you have a gmres somewhere inside
> the preconditioner then you need to have a fgmres outside of that (and not
> a gmres). Run with -ksp_view and send the output showing what your solver is
>
>   Barry
>
> > On Dec 17, 2015, at 8:16 PM, Timothée Nicolas <
> timothee.nicolas at gmail.com> wrote:
> >
> > OK, I can build the residuals with this routine. It allowed me to see
> that the residuals computed by PETSc (with KSPGetResidualNorm), which do
> converge rapidly, are totally inconsistent with what I compute from
> >
> >   call KSPBuildSolution(ksp,PETSC_NULL_OBJECT,x,ierr)
> >   call VecDuplicate(x,y,ierr)
> >   call KSPGetRHS(ksp,b,ierr)
> >   call KSPGetOperators(ksp,A,PETSC_NULL_OBJECT,ierr)
> >   call MatMult(A,x,y,ierr)
> >   call VecAXPY(y,-one,b,ierr)
> >   call VecNorm(y,NORM_2,norm,ierr)
> >   call PrintReal('      KSP Residual norm',norm,itwelve,ierr)
> >   call VecDestroy(y,ierr)
> >
> > Now, the residuals computed with this method are consistent with the
> next residual computed by SNES. In other words, I get the following output
> (see below, the first KSP residual is the internal PETSc one, the second is
> mine). As you can see, the one I compute is consistent (up to a few digits)
> with what goes inside the updated solution of SNES. How is that possible ?
> I tried to see what KSPGetResidualNorm does but I could only find the
> instruction
> >
> >   *rnorm = ksp->rnorm
> >
> > and then I don't know where to look for...
> >
> > Here's what the output looks like
> >
> > 0 SNES Function norm 4.867111713544e-03
> >
> >     0 KSP Residual norm 4.958714442097e-03
> >       KSP Residual norm 4.867111713544E-03
> >
> >     1 KSP Residual norm 3.549907385578e-04
> >       KSP Residual norm 1.651154147130E-03
> >
> >     2 KSP Residual norm 3.522862963778e-05
> >       KSP Residual norm 1.557509645650E-03
> >
> >     3 KSP Residual norm 3.541384239147e-06
> >       KSP Residual norm 1.561088378958E-03
> >
> >     4 KSP Residual norm 3.641326695942e-07
> >       KSP Residual norm 1.560783284631E-03
> >
> >     5 KSP Residual norm 3.850512634373e-08
> >       KSP Residual norm 1.560806669239E-03
> >
> >   1 SNES Function norm 1.560806759605e-03
> >       etc.. etc...
> >
> > On the cluster, I don't find exactly the behavior above, the two norms
> agree rather well up to a few digits, at least at the beginning, but they
> start to be very different at the end of the iterations (up to 2 orders of
> magnitude, which also gets me quite worried).
> >
> > Thx
> >
> > Timothee
> >
> >
> >
> > 2015-12-18 1:45 GMT+09:00 Barry Smith <bsmith at mcs.anl.gov>:
> >
> > > On Dec 17, 2015, at 7:47 AM, Timothée Nicolas <
> timothee.nicolas at gmail.com> wrote:
> > >
> > > It works very smoothly for the SNES :-), but KSPGetSolution keeps
> returning a zero vector... KSPGetResidualNorm gives me the norm alright,
> but I would like to actually see the vectors. Is KSPGetSolution the wrong
> routine ?
> >
> >   If you are using GMRES, which is the default, it actually DOES NOT
> have a representation of the solution at each step. Yes that seems odd but
> it only computes the solution vector at a restart or when the iteration
> ends. This is why KSPGetSolution doesn't provide anything useful.  You can
> use KSPBuildSolution() to have it construct the "current" solution whenever
> you need it.
> >
> >   Barry
> >
> >
> >
> > >
> > > Thx
> > >
> > > Timothée
> > >
> > > 2015-12-17 19:19 GMT+09:00 Dave May <dave.mayhem23 at gmail.com>:
> > >
> > >
> > > On 17 December 2015 at 11:00, Timothée Nicolas <
> timothee.nicolas at gmail.com> wrote:
> > > Hi,
> > >
> > > So, valgrind is OK (at least on the local machine. Actually on the
> cluster helios, it produces strange results even for the simplest petsc
> program PetscInitialize followed by PetscFinalize, I will try to figure
> this out with their technical team), and I have also tried with exactly the
> same versions (3.6.0) and it does not change the behavior.
> > >
> > > So now I would like to now how to have a grip on what comes in and out
> of the SNES and the KSP internal to the SNES. That is, I would like to
> inspect manually the vector which enters the SNES in the first place
> (should be zero I believe), what is being fed to the KSP, and the vector
> which comes out of it, etc. if possible at each iteration of the SNES. I
> want to actually see these vectors, and compute there norm by hand. The
> trouble is, it is really hard to understand why the newton residuals are
> not reduced since the KSP converges so nicely. This does not make any sense
> to me, so I want to know what happens to the vectors. But on the SNES list
> of routines, I did not find the tools that would allow me to do that (and
> messing around with the C code is too hard for me, it would take me weeks).
> Does someone have a hint ?
> > >
> > > The only sane way to do this is to write a custom monitor for your
> SNES object.
> > >
> > >
> http://www.mcs.anl.gov/petsc/petsc-current/docs/manualpages/SNES/SNESMonitorSet.html
> > >
> > > Inside your monitor, you have access the SNES, and everything it
> defines, e.g. the current solution, non-linear residual, KSP etc. See these
> pages
> > >
> > >
> http://www.mcs.anl.gov/petsc/petsc-current/docs/manualpages/SNES/SNESGetSolution.html
> > >
> > >
> http://www.mcs.anl.gov/petsc/petsc-current/docs/manualpages/SNES/SNESGetFunction.html
> > >
> > >
> http://www.mcs.anl.gov/petsc/petsc-current/docs/manualpages/SNES/SNESGetKSP.html
> > >
> > > Then you can pull apart the residual and compute specific norms (or
> plot the residual).
> > >
> > > Hopefully you can access everything you need to perform your analysis.
> > >
> > > Cheers,
> > >   Dave
> > >
> > >
> > > Thx
> > >
> > > Timothee
> > >
> > >
> > >
> > >
> > > 2015-12-15 14:20 GMT+09:00 Matthew Knepley <knepley at gmail.com>:
> > > On Mon, Dec 14, 2015 at 11:06 PM, Timothée Nicolas <
> timothee.nicolas at gmail.com> wrote:
> > > There is a diference in valgrind indeed between the two. It seems to
> be clean on my desktop Mac OS X but not on the cluster. I'll try to see
> what's causing this. I still don't understand well what's causing memory
> leaks in the case where all PETSc objects are freed correctly (as can pbe
> checked with -log_summary).
> > >
> > > Also, I have tried running either
> > >
> > > valgrind ./my_code -option1 -option2...
> > >
> > > or
> > >
> > > valgrind mpiexec -n 1 ./my_code -option1 -option2...
> > >
> > > Note here you would need --trace-children=yes for valgrind.
> > >
> > >   Matt
> > >
> > > It seems the second is the correct way to proceed right ? This gives
> very different behaviour for valgrind.
> > >
> > > Timothee
> > >
> > >
> > >
> > > 2015-12-14 17:38 GMT+09:00 Timothée Nicolas <
> timothee.nicolas at gmail.com>:
> > > OK, I'll try that, thx
> > >
> > > 2015-12-14 17:38 GMT+09:00 Dave May <dave.mayhem23 at gmail.com>:
> > > You have the configure line, so it should be relatively straight
> forward to configure / build petsc in your home directory.
> > >
> > >
> > > On 14 December 2015 at 09:34, Timothée Nicolas <
> timothee.nicolas at gmail.com> wrote:
> > > OK, The problem is that I don't think I can change this easily as far
> as the cluster is concerned. I obtain access to petsc by loading the petsc
> module, and even if I have a few choices, I don't see any debug builds...
> > >
> > > 2015-12-14 17:26 GMT+09:00 Dave May <dave.mayhem23 at gmail.com>:
> > >
> > >
> > > On Monday, 14 December 2015, Timothée Nicolas <
> timothee.nicolas at gmail.com> wrote:
> > > Hum, OK. I use FORTRAN by the way. Is your comment still valid ?
> > >
> > > No. Fortran compilers init variables to zero.
> > > In this case, I would run a debug build on your OSX machine through
> valgrind and make sure it is clean.
> > >
> > > Other obvious thing to check what happens if use exactly the same
> petsc builds on both machines. I see 3.6.1 and 3.6.0 are being used.
> > >
> > > For all this type of checking, I would definitely use debug builds on
> both machines. Your cluster build is using the highest level of
> optimization...
> > >
> > >
> > >
> > > I'll check anyway, but I thought I had been careful about this sort of
> things.
> > >
> > > Also, I thought the problem on Mac OS X may have been due to the fact
> I used the version with debugging on, so I rerun configure with
> --with-debugging=no, which did not change anything.
> > >
> > > Thx
> > >
> > > Timothee
> > >
> > >
> > > 2015-12-14 17:04 GMT+09:00 Dave May <dave.mayhem23 at gmail.com>:
> > > One suggestion is you have some uninitialized variables in your
> pcshell. Despite your arch being called "debug", your configure options
> indicate you have turned debugging off.
> > >
> > > C standard doesn't prescribe how uninit variables should be treated -
> the behavior is labelled as undefined. As a result, different compilers on
> different archs with the same optimization flags can and will treat uninit
> variables differently. I find OSX c compilers tend to set them to zero.
> > >
> > > I suggest compiling a debug build on both machines and trying your
> test again. Also, consider running the debug builds through valgrind.
> > >
> > > Thanks,
> > >   Dave
> > >
> > > On Monday, 14 December 2015, Timothée Nicolas <
> timothee.nicolas at gmail.com> wrote:
> > > Hi,
> > >
> > > I have noticed I have a VERY big difference in behaviour between two
> machines in my problem, solved with SNES. I can't explain it, because I
> have tested my operators which give the same result. I also checked that
> the vectors fed to the SNES are the same. The problem happens only with my
> shell preconditioner. When I don't use it, and simply solve using -snes_mf,
> I don't see anymore than the usual 3-4 changing digits at the end of the
> residuals. However, when I use my pcshell, the results are completely
> different between the two machines.
> > >
> > > I have attached output_SuperComputer.txt and
> output_DesktopComputer.txt, which correspond to the output from the exact
> same code and options (and of course same input data file !). More precisely
> > >
> > > output_SuperComputer.txt : output on a supercomputer called Helios,
> sorry I don't know the exact specs.
> > > In this case, the SNES norms are reduced successively:
> > > 0 SNES Function norm 4.867111712420e-03
> > > 1 SNES Function norm 5.632325929998e-08
> > > 2 SNES Function norm 7.427800084502e-15
> > >
> > > output_DesktopComputer.txt : output on a Mac OS X Yosemite 3.4 GHz
> Intel Core i5 16GB 1600 MHz DDr3. (the same happens on an other laptop with
> Mac OS X Mavericks).
> > > In this case, I obtain the following for the SNES norms,
> > > while in the other, I obtain
> > > 0 SNES Function norm 4.867111713544e-03
> > > 1 SNES Function norm 1.560094052222e-03
> > > 2 SNES Function norm 1.552118650943e-03
> > > 3 SNES Function norm 1.552106297094e-03
> > > 4 SNES Function norm 1.552106277949e-03
> > > which I can't explain, because otherwise the KSP residual (with the
> same operator, which I checked) behave well.
> > >
> > > As you can see, the first time the preconditioner is applied (DB_,
> DP_, Drho_ and PS_ solves), the two outputs coincide (except for the few
> last digits, up to 9 actually, which is more than I would expect), and
> everything starts to diverge at the first print of the main KSP (the one
> stemming from the SNES) residual norms.
> > >
> > > Do you have an idea what may cause such a strange behaviour ?
> > >
> > > Best
> > >
> > > Timothee
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > --
> > > What most experimenters take for granted before they begin their
> experiments is infinitely more interesting than any results to which their
> experiments lead.
> > > -- Norbert Wiener
> > >
> > >
> > >
> >
> >
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20151218/07fa919d/attachment.html>


More information about the petsc-users mailing list