[MPICH] tests are failing with different numbers

Jerry Mersel jerry.mersel at weizmann.ac.il
Mon Sep 26 03:37:06 CDT 2005


The data written to the exported/mounted directory gets written and does
not interfere with what's going on  with other machines.

                            Regards,
                             Jerry

e.x. Certain uw-imap folder formats are not NFS safe


> What do you mean by that?
>
> Rob
>
> Jerry Mersel wrote:
>> Another question...
>>
>>   Is MPICH (and,or MPICH2) NFS safe?
>>
>>                        Regards,
>>                         Jerry
>>
>>
>>
>>>On Tue, 2005-09-20 at 16:54 +0300, Jerry Mersel wrote:
>>>
>>>>Hi:
>>>>  I've installed MPICH 1.2.6 on a cluster which consists of several
>>>> dual
>>>>opteron machines running redhat AS 4.0.
>>>>   A user, while running an application using 4 processors has brought
>>>>to
>>>>   my attention that 2 runs with the same binary results in 2 different
>>>>   sets of results.
>>>>
>>>>    I then ran the tests that come with mpich (I know I should have
>>>> done
>>>>it  before, we just won't tell anybody). And come up with errors, Here
>>>>are they
>>>>     are:
>>>>          Differences in issendtest.out
>>>>          Differences in structf.out
>>>>          *** Checking for differences from expected output ***
>>>>          Differences in issendtest.out
>>>>          Differences in structf.out
>>>>           p0_3896:  p4_error: net_recv read:  probable EOF on socket:
>>>> 1
>>>>           p0_10524:  p4_error: : 972
>>>>          0 - MPI_ADDRESS : Address of location given to MPI_ADDRESS
>>>>does
>>>>not    fit in Fortran integer
>>>>
>>>>   I've tried this with the gcc compiler and the pgi compiler with
>>>>   none and many different options - the results are the same.
>>>
>>>>   I tried using MPICH from the pgi site, the user still got different
>>>>   results on different runs.
>>>
>>>I've seen something similar to this, firstly structf is testing
>>>something that is mathematically impossible, i.e. storing a (in your
>>>case 64bit) pointer in a 32bit integer.  This sometimes works (depending
>>>on what the pointer is) but often doesn't, we have some patches for this
>>>(I believe pgi also ship them) but it's still not a 100% cure.  Unless
>>>you actually have an application that suffers from this then you don't
>>>need the patch.
>>>
>>>Secondly RedHat AS 4.0 has some odd features that effectively mean your
>>>supposed to get different results from running the same program twice,
>>>in particular it's got exec-shield-randomize enabled which moves the
>>>stack about between runs and it runs a cron job overnight which
>>>randomises the load address of your installed shared library's.  This
>>>means that the same binary on two different nodes will have a different
>>>address for the stack and a different address for mmap()/shared
>>>library's.
>>>
>>>Most applications however should hide this away from you and MPI itself
>>>is designed to be independent of this type of configuration change, it
>>>is fairly easy to introduce artificial dependencies without meaning to
>>>though.
>>>
>>>And of course there are the normal problems of floating point accuracy,
>>>some apps aren't actually supposed to get identical results between
>>>runs, merely similar ones...
>>>
>>>Ashley,
>>>
>>>
>>
>>
>
>




More information about the mpich-discuss mailing list