VecSetValues and irreproducible results in parallel
Barry Smith
bsmith at mcs.anl.gov
Thu Jan 29 19:13:06 CST 2009
Since the code would only be compiled in when the special
reproducability flag is on we can live with the C++.
Barry
On Jan 28, 2009, at 12:27 PM, Boyce Griffith wrote:
> Hi, Barry --
>
> I suppose that each of the runs could be thought of as a particular
> "realization" or something like that. In my case, I figure they are
> all "equally wrong." ;-) I have a hunch that the discrepancies
> between runs is exacerbated by under-resolution, but I haven't
> confirmed this to be the case.
>
> What got me started on this was I was looking into the effect of
> changing a convergence threshold: how tight do I need to make my
> convergence tolerance before I start getting the "same" answer? I
> was distressed to discover that no matter how tight the tolerance,
> the results were always different. Then I noticed that they were
> different even for the same tolerance. This made it a little
> difficult to choose the tolerance!
>
> I'd be happy to supply the code for inclusion in petsc-dev. I use C+
> + STL vectors and sorting algorithms in VecAssemblyEnd, however.
> Can I toss you all the code with C++ stuff in it, or would it really
> only be useful if I turn it into C-only code first?
>
> -- Boyce
>
> Barry Smith wrote:
>> Boyce,
>> As you know, any of the runs give equally valid answers (all
>> right or all wrong depending on how you look at it). I assume you
>> want reproducibility to help track down other bugs in the code and
>> to see if changes in some places that are not suppose to change
>> the solution are not wrong and actually changing the solution
>> (which is hard to find if in each run you get a different answer).
>> We would be interested in including your additions with petsc-dev
>> with a config/configure.py option (say --with-reproducibility) that
>> turns
>> on the more precise (but slower) version if you would like to
>> contribute it.
>> Barry
>> We see this issue every once it a while and never got the energy
>> to comprehensively go through the code and support to make the
>> computations much
>> more reproducible.
>> On Jan 28, 2009, at 11:55 AM, Boyce Griffith wrote:
>>> Hi, Folks --
>>>
>>> I recently noticed an irreproducibility issue with some parallel
>>> code which uses PETSc. Basically, I can run the same parallel
>>> simulation on the same number of processors and get different
>>> answers. The discrepancies were only apparent to the eye after
>>> very long integration times (e.g., approx. 20k timesteps), but in
>>> fact the computations start diverging after only a few handfuls
>>> (30--40) of timesteps. (Running the same code in serial yields
>>> reproducible results, but changing the MPI library or running the
>>> code on a different system did not make any difference.)
>>>
>>> I believe that I have tracked this issue down to VecSetValues/
>>> VecSetValuesBlocked. A part of the computation uses VecSetValues
>>> to accumulate the values of forces at the nodes of a mesh. At
>>> some nodes, force contributions come from multiple processors. It
>>> appears that there is some "randomness" in the way that this
>>> accumulation is performed in parallel, presumably related to the
>>> order in which values are communicated.
>>>
>>> I have found that I can make modified versions of
>>> VecSetValues_MPI(), VecSetValuesBlocked_MPI(), and
>>> VecAssemblyEnd_MPI() to ensure reproducible results. I make the
>>> following modifications:
>>>
>>> (1) VecSetValues_MPI() and VecSetValuesBlocked_MPI() place all
>>> values into the stash instead of immediately adding the local
>>> components.
>>>
>>> (2) VecAssemblyEnd_MPI() "extracts" all of the values provided by
>>> the stash and bstash, placing these values into lists
>>> corresponding to each of the local entries in the vector. Next,
>>> once all values have been extracted, I sort each of these lists
>>> (e.g., by descending or ascending magnitude).
>>>
>>> Making these changes appears to yield exactly reproducible
>>> results, although I am still performing tests to try to shake out
>>> any other problems. Another approach which seems to work is to
>>> perform the final summation in higher precision (e.g., using
>>> "double-double precision" arithmetic). Using double-double
>>> precision allows me to skip the sorting step, although since the
>>> number of values to sort is relatively small, it may be cheaper to
>>> sort.
>>>
>>> Using more accurate summation methods (e.g., compensated
>>> summation) does not appear to fix the lack or reproducibility
>>> problem.
>>>
>>> I was wondering if anyone has run into similar issues with PETSc
>>> and has a better solution.
>>>
>>> Thanks!
>>>
>>> -- Boyce
>>>
>>> PS: These tests were done using PETSc 2.3.3. I have just finished
>>> upgrading the code to PETSc 3.0.0 and will re-run them. However,
>>> looking at VecSetValues_MPI()/VecSetValuesBlocked_MPI()/etc, it
>>> looks like this code is essentially the same in PETSc 2.3.3 and
>>> 3.0.0.
More information about the petsc-users
mailing list