[MPICH] tests are failing with different numbers

Jerry Mersel jerry.mersel at weizmann.ac.il
Wed Sep 21 07:10:25 CDT 2005


Another question...

  Is MPICH (and,or MPICH2) NFS safe?

                       Regards,
                        Jerry


> On Tue, 2005-09-20 at 16:54 +0300, Jerry Mersel wrote:
>> Hi:
>>   I've installed MPICH 1.2.6 on a cluster which consists of several dual
>> opteron machines running redhat AS 4.0.
>>    A user, while running an application using 4 processors has brought
>> to
>>    my attention that 2 runs with the same binary results in 2 different
>>    sets of results.
>>
>>     I then ran the tests that come with mpich (I know I should have done
>> it  before, we just won't tell anybody). And come up with errors, Here
>> are they
>>      are:
>>           Differences in issendtest.out
>>           Differences in structf.out
>>           *** Checking for differences from expected output ***
>>           Differences in issendtest.out
>>           Differences in structf.out
>>            p0_3896:  p4_error: net_recv read:  probable EOF on socket: 1
>>            p0_10524:  p4_error: : 972
>>           0 - MPI_ADDRESS : Address of location given to MPI_ADDRESS
>> does
>> not    fit in Fortran integer
>>
>>    I've tried this with the gcc compiler and the pgi compiler with
>>    none and many different options - the results are the same.
>
>>    I tried using MPICH from the pgi site, the user still got different
>>    results on different runs.
>
> I've seen something similar to this, firstly structf is testing
> something that is mathematically impossible, i.e. storing a (in your
> case 64bit) pointer in a 32bit integer.  This sometimes works (depending
> on what the pointer is) but often doesn't, we have some patches for this
> (I believe pgi also ship them) but it's still not a 100% cure.  Unless
> you actually have an application that suffers from this then you don't
> need the patch.
>
> Secondly RedHat AS 4.0 has some odd features that effectively mean your
> supposed to get different results from running the same program twice,
> in particular it's got exec-shield-randomize enabled which moves the
> stack about between runs and it runs a cron job overnight which
> randomises the load address of your installed shared library's.  This
> means that the same binary on two different nodes will have a different
> address for the stack and a different address for mmap()/shared
> library's.
>
> Most applications however should hide this away from you and MPI itself
> is designed to be independent of this type of configuration change, it
> is fairly easy to introduce artificial dependencies without meaning to
> though.
>
> And of course there are the normal problems of floating point accuracy,
> some apps aren't actually supposed to get identical results between
> runs, merely similar ones...
>
> Ashley,
>
>




More information about the mpich-discuss mailing list