[petsc-users] Big discrepancy between machines

Timothée Nicolas timothee.nicolas at gmail.com
Thu Dec 17 04:00:21 CST 2015


Hi,

So, valgrind is OK (at least on the local machine. Actually on the cluster
helios, it produces strange results even for the simplest petsc program
PetscInitialize followed by PetscFinalize, I will try to figure this out
with their technical team), and I have also tried with exactly the same
versions (3.6.0) and it does not change the behavior.

So now I would like to now how to have a grip on what comes in and out of
the SNES and the KSP internal to the SNES. That is, I would like to inspect
manually the vector which enters the SNES in the first place (should be
zero I believe), what is being fed to the KSP, and the vector which comes
out of it, etc. if possible at each iteration of the SNES. I want to
actually *see* these vectors, and compute there norm by hand. The trouble
is, it is really hard to understand why the newton residuals are not
reduced since the KSP converges so nicely. This does not make any sense to
me, so I want to know what happens to the vectors. But on the SNES list of
routines, I did not find the tools that would allow me to do that (and
messing around with the C code is too hard for me, it would take me weeks).
Does someone have a hint ?

Thx

Timothee




2015-12-15 14:20 GMT+09:00 Matthew Knepley <knepley at gmail.com>:

> On Mon, Dec 14, 2015 at 11:06 PM, Timothée Nicolas <
> timothee.nicolas at gmail.com> wrote:
>
>> There is a diference in valgrind indeed between the two. It seems to be
>> clean on my desktop Mac OS X but not on the cluster. I'll try to see what's
>> causing this. I still don't understand well what's causing memory leaks in
>> the case where all PETSc objects are freed correctly (as can pbe checked
>> with -log_summary).
>>
>> Also, I have tried running either
>>
>> valgrind ./my_code -option1 -option2...
>>
>> or
>>
>> valgrind mpiexec -n 1 ./my_code -option1 -option2...
>>
>
> Note here you would need --trace-children=yes for valgrind.
>
>   Matt
>
>
>> It seems the second is the correct way to proceed right ? This gives very
>> different behaviour for valgrind.
>>
>> Timothee
>>
>>
>>
>> 2015-12-14 17:38 GMT+09:00 Timothée Nicolas <timothee.nicolas at gmail.com>:
>>
>>> OK, I'll try that, thx
>>>
>>> 2015-12-14 17:38 GMT+09:00 Dave May <dave.mayhem23 at gmail.com>:
>>>
>>>> You have the configure line, so it should be relatively straight
>>>> forward to configure / build petsc in your home directory.
>>>>
>>>>
>>>> On 14 December 2015 at 09:34, Timothée Nicolas <
>>>> timothee.nicolas at gmail.com> wrote:
>>>>
>>>>> OK, The problem is that I don't think I can change this easily as far
>>>>> as the cluster is concerned. I obtain access to petsc by loading the petsc
>>>>> module, and even if I have a few choices, I don't see any debug builds...
>>>>>
>>>>> 2015-12-14 17:26 GMT+09:00 Dave May <dave.mayhem23 at gmail.com>:
>>>>>
>>>>>>
>>>>>>
>>>>>> On Monday, 14 December 2015, Timothée Nicolas <
>>>>>> timothee.nicolas at gmail.com> wrote:
>>>>>>
>>>>>>> Hum, OK. I use FORTRAN by the way. Is your comment still valid ?
>>>>>>>
>>>>>>
>>>>>> No. Fortran compilers init variables to zero.
>>>>>> In this case, I would run a debug build on your OSX machine through
>>>>>> valgrind and make sure it is clean.
>>>>>>
>>>>>> Other obvious thing to check what happens if use exactly the same
>>>>>> petsc builds on both machines. I see 3.6.1 and 3.6.0 are being used.
>>>>>>
>>>>>> For all this type of checking, I would definitely use debug builds on
>>>>>> both machines. Your cluster build is using the highest level of
>>>>>> optimization...
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>> I'll check anyway, but I thought I had been careful about this sort
>>>>>>> of things.
>>>>>>>
>>>>>>> Also, I thought the problem on Mac OS X may have been due to the
>>>>>>> fact I used the version with debugging on, so I rerun configure with
>>>>>>> --with-debugging=no, which did not change anything.
>>>>>>>
>>>>>>> Thx
>>>>>>>
>>>>>>> Timothee
>>>>>>>
>>>>>>>
>>>>>>> 2015-12-14 17:04 GMT+09:00 Dave May <dave.mayhem23 at gmail.com>:
>>>>>>>
>>>>>>>> One suggestion is you have some uninitialized variables in your
>>>>>>>> pcshell. Despite your arch being called "debug", your configure options
>>>>>>>> indicate you have turned debugging off.
>>>>>>>>
>>>>>>>> C standard doesn't prescribe how uninit variables should be treated
>>>>>>>> - the behavior is labelled as undefined. As a result, different compilers
>>>>>>>> on different archs with the same optimization flags can and will treat
>>>>>>>> uninit variables differently. I find OSX c compilers tend to set them to
>>>>>>>> zero.
>>>>>>>>
>>>>>>>> I suggest compiling a debug build on both machines and trying your
>>>>>>>> test again. Also, consider running the debug builds through valgrind.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>>   Dave
>>>>>>>>
>>>>>>>> On Monday, 14 December 2015, Timothée Nicolas <
>>>>>>>> timothee.nicolas at gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> I have noticed I have a VERY big difference in behaviour between
>>>>>>>>> two machines in my problem, solved with SNES. I can't explain it, because I
>>>>>>>>> have tested my operators which give the same result. I also checked that
>>>>>>>>> the vectors fed to the SNES are the same. The problem happens only with my
>>>>>>>>> shell preconditioner. When I don't use it, and simply solve using -snes_mf,
>>>>>>>>> I don't see anymore than the usual 3-4 changing digits at the end of the
>>>>>>>>> residuals. However, when I use my pcshell, the results are completely
>>>>>>>>> different between the two machines.
>>>>>>>>>
>>>>>>>>> I have attached output_SuperComputer.txt and
>>>>>>>>> output_DesktopComputer.txt, which correspond to the output from the exact
>>>>>>>>> same code and options (and of course same input data file !). More precisely
>>>>>>>>>
>>>>>>>>> output_SuperComputer.txt : output on a supercomputer called
>>>>>>>>> Helios, sorry I don't know the exact specs.
>>>>>>>>> In this case, the SNES norms are reduced successively:
>>>>>>>>> 0 SNES Function norm 4.867111712420e-03
>>>>>>>>> 1 SNES Function norm 5.632325929998e-08
>>>>>>>>> 2 SNES Function norm 7.427800084502e-15
>>>>>>>>>
>>>>>>>>> output_DesktopComputer.txt : output on a Mac OS X Yosemite 3.4 GHz
>>>>>>>>> Intel Core i5 16GB 1600 MHz DDr3. (the same happens on an other laptop with
>>>>>>>>> Mac OS X Mavericks).
>>>>>>>>> In this case, I obtain the following for the SNES norms,
>>>>>>>>> while in the other, I obtain
>>>>>>>>> 0 SNES Function norm 4.867111713544e-03
>>>>>>>>> 1 SNES Function norm 1.560094052222e-03
>>>>>>>>> 2 SNES Function norm 1.552118650943e-03
>>>>>>>>> 3 SNES Function norm 1.552106297094e-03
>>>>>>>>> 4 SNES Function norm 1.552106277949e-03
>>>>>>>>> which I can't explain, because otherwise the KSP residual (with
>>>>>>>>> the same operator, which I checked) behave well.
>>>>>>>>>
>>>>>>>>> As you can see, the first time the preconditioner is applied (DB_,
>>>>>>>>> DP_, Drho_ and PS_ solves), the two outputs coincide (except for the few
>>>>>>>>> last digits, up to 9 actually, which is more than I would expect), and
>>>>>>>>> everything starts to diverge at the first print of the main KSP (the one
>>>>>>>>> stemming from the SNES) residual norms.
>>>>>>>>>
>>>>>>>>> Do you have an idea what may cause such a strange behaviour ?
>>>>>>>>>
>>>>>>>>> Best
>>>>>>>>>
>>>>>>>>> Timothee
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>
>>>>
>>>
>>
>
>
> --
> What most experimenters take for granted before they begin their
> experiments is infinitely more interesting than any results to which their
> experiments lead.
> -- Norbert Wiener
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20151217/89c33b37/attachment-0001.html>


More information about the petsc-users mailing list