VecSetValues and irreproducible results in parallel

Wed Jan 28 12:14:13 CST 2009

    Boyce,

     As you know, any of the runs give equally valid answers (all  
right or all wrong depending on how you look at it). I assume you
want reproducibility to help track down other bugs in the code and to  
see if changes in some places that are not suppose to change
the solution are not wrong and actually changing the solution (which  
is hard to find if in each run you get a different answer).

    We would be interested in including your additions with petsc-dev  
with a config/configure.py option (say --with-reproducibility) that  
turns
on the more precise (but slower) version if you would like to  
contribute it.

    Barry

   We see this issue every once it a while and never got the energy to  
comprehensively go through the code and support to make the  
computations much
more reproducible.

On Jan 28, 2009, at 11:55 AM, Boyce Griffith wrote:

> Hi, Folks --
>
> I recently noticed an irreproducibility issue with some parallel  
> code which uses PETSc.  Basically, I can run the same parallel  
> simulation on the same number of processors and get different  
> answers.  The discrepancies were only apparent to the eye after very  
> long integration times (e.g., approx. 20k timesteps), but in fact  
> the computations start diverging after only a few handfuls (30--40)  
> of timesteps.  (Running the same code in serial yields reproducible  
> results, but changing the MPI library or running the code on a  
> different system did not make any difference.)
>
> I believe that I have tracked this issue down to VecSetValues/ 
> VecSetValuesBlocked.  A part of the computation uses VecSetValues to  
> accumulate the values of forces at the nodes of a mesh.  At some  
> nodes, force contributions come from multiple processors.  It  
> appears that there is some "randomness" in the way that this  
> accumulation is performed in parallel, presumably related to the  
> order in which values are communicated.
>
> I have found that I can make modified versions of  
> VecSetValues_MPI(), VecSetValuesBlocked_MPI(), and  
> VecAssemblyEnd_MPI() to ensure reproducible results.  I make the  
> following modifications:
>
> (1) VecSetValues_MPI() and VecSetValuesBlocked_MPI() place all  
> values into the stash instead of immediately adding the local  
> components.
>
> (2) VecAssemblyEnd_MPI() "extracts" all of the values provided by  
> the stash and bstash, placing these values into lists corresponding  
> to each of the local entries in the vector.  Next, once all values  
> have been extracted, I sort each of these lists (e.g., by descending  
> or ascending magnitude).
>
> Making these changes appears to yield exactly reproducible results,  
> although I am still performing tests to try to shake out any other  
> problems.  Another approach which seems to work is to perform the  
> final summation in higher precision (e.g., using "double-double  
> precision" arithmetic).  Using double-double precision allows me to  
> skip the sorting step, although since the number of values to sort  
> is relatively small, it may be cheaper to sort.
>
> Using more accurate summation methods (e.g., compensated summation)  
> does not appear to fix the lack or reproducibility problem.
>
> I was wondering if anyone has run into similar issues with PETSc and  
> has a better solution.
>
> Thanks!
>
> -- Boyce
>
> PS: These tests were done using PETSc 2.3.3.  I have just finished  
> upgrading the code to PETSc 3.0.0 and will re-run them.  However,  
> looking at VecSetValues_MPI()/VecSetValuesBlocked_MPI()/etc, it  
> looks like this code is essentially the same in PETSc 2.3.3 and 3.0.0.