[MPICH] tests are failing with different numbers

Rob Ross rross at mcs.anl.gov
Fri Sep 23 10:12:06 CDT 2005


What do you mean by that?

Rob

Jerry Mersel wrote:
> Another question...
> 
>   Is MPICH (and,or MPICH2) NFS safe?
> 
>                        Regards,
>                         Jerry
> 
> 
> 
>>On Tue, 2005-09-20 at 16:54 +0300, Jerry Mersel wrote:
>>
>>>Hi:
>>>  I've installed MPICH 1.2.6 on a cluster which consists of several dual
>>>opteron machines running redhat AS 4.0.
>>>   A user, while running an application using 4 processors has brought
>>>to
>>>   my attention that 2 runs with the same binary results in 2 different
>>>   sets of results.
>>>
>>>    I then ran the tests that come with mpich (I know I should have done
>>>it  before, we just won't tell anybody). And come up with errors, Here
>>>are they
>>>     are:
>>>          Differences in issendtest.out
>>>          Differences in structf.out
>>>          *** Checking for differences from expected output ***
>>>          Differences in issendtest.out
>>>          Differences in structf.out
>>>           p0_3896:  p4_error: net_recv read:  probable EOF on socket: 1
>>>           p0_10524:  p4_error: : 972
>>>          0 - MPI_ADDRESS : Address of location given to MPI_ADDRESS
>>>does
>>>not    fit in Fortran integer
>>>
>>>   I've tried this with the gcc compiler and the pgi compiler with
>>>   none and many different options - the results are the same.
>>
>>>   I tried using MPICH from the pgi site, the user still got different
>>>   results on different runs.
>>
>>I've seen something similar to this, firstly structf is testing
>>something that is mathematically impossible, i.e. storing a (in your
>>case 64bit) pointer in a 32bit integer.  This sometimes works (depending
>>on what the pointer is) but often doesn't, we have some patches for this
>>(I believe pgi also ship them) but it's still not a 100% cure.  Unless
>>you actually have an application that suffers from this then you don't
>>need the patch.
>>
>>Secondly RedHat AS 4.0 has some odd features that effectively mean your
>>supposed to get different results from running the same program twice,
>>in particular it's got exec-shield-randomize enabled which moves the
>>stack about between runs and it runs a cron job overnight which
>>randomises the load address of your installed shared library's.  This
>>means that the same binary on two different nodes will have a different
>>address for the stack and a different address for mmap()/shared
>>library's.
>>
>>Most applications however should hide this away from you and MPI itself
>>is designed to be independent of this type of configuration change, it
>>is fairly easy to introduce artificial dependencies without meaning to
>>though.
>>
>>And of course there are the normal problems of floating point accuracy,
>>some apps aren't actually supposed to get identical results between
>>runs, merely similar ones...
>>
>>Ashley,
>>
>>
> 
> 




More information about the mpich-discuss mailing list