[MPICH] tests are failing with different numbers

Jerry Mersel jerry.mersel at weizmann.ac.il
Wed Sep 21 06:58:55 CDT 2005


Thank you Ashley,

  The problem on AS 4.0 would also explain the problems that I've been
having testing MPICH2. Namely things like "-2652 expected         2651"
seems like the high order and low order bytes are reversed.

In either case which linux OS would you recommend.
(We try to use only REDHAT or SUSE, but I don't think anyone would
complain about fedora).

                                  regards,
                                    Jerry



> On Tue, 2005-09-20 at 16:54 +0300, Jerry Mersel wrote:
>> Hi:
>>   I've installed MPICH 1.2.6 on a cluster which consists of several dual
>> opteron machines running redhat AS 4.0.
>>    A user, while running an application using 4 processors has brought
>> to
>>    my attention that 2 runs with the same binary results in 2 different
>>    sets of results.
>>
>>     I then ran the tests that come with mpich (I know I should have done
>> it  before, we just won't tell anybody). And come up with errors, Here
>> are they
>>      are:
>>           Differences in issendtest.out
>>           Differences in structf.out
>>           *** Checking for differences from expected output ***
>>           Differences in issendtest.out
>>           Differences in structf.out
>>            p0_3896:  p4_error: net_recv read:  probable EOF on socket: 1
>>            p0_10524:  p4_error: : 972
>>           0 - MPI_ADDRESS : Address of location given to MPI_ADDRESS
>> does
>> not    fit in Fortran integer
>>
>>    I've tried this with the gcc compiler and the pgi compiler with
>>    none and many different options - the results are the same.
>
>>    I tried using MPICH from the pgi site, the user still got different
>>    results on different runs.
>
> I've seen something similar to this, firstly structf is testing
> something that is mathematically impossible, i.e. storing a (in your
> case 64bit) pointer in a 32bit integer.  This sometimes works (depending
> on what the pointer is) but often doesn't, we have some patches for this
> (I believe pgi also ship them) but it's still not a 100% cure.  Unless
> you actually have an application that suffers from this then you don't
> need the patch.
>
> Secondly RedHat AS 4.0 has some odd features that effectively mean your
> supposed to get different results from running the same program twice,
> in particular it's got exec-shield-randomize enabled which moves the
> stack about between runs and it runs a cron job overnight which
> randomises the load address of your installed shared library's.  This
> means that the same binary on two different nodes will have a different
> address for the stack and a different address for mmap()/shared
> library's.
>
> Most applications however should hide this away from you and MPI itself
> is designed to be independent of this type of configuration change, it
> is fairly easy to introduce artificial dependencies without meaning to
> though.
>
> And of course there are the normal problems of floating point accuracy,
> some apps aren't actually supposed to get identical results between
> runs, merely similar ones...
>
> Ashley,
>
>




More information about the mpich-discuss mailing list