[MPICH] tests are failing with different numbers

Jerry Mersel jerry.mersel at weizmann.ac.il
Mon Sep 26 03:44:06 CDT 2005


Here's another wrench in the works. When I build the executable with
either MPICH2 or MPICH (portland build) the binary is different on each
build. The size is the same, but using cmp or diff gives different
files.
Anyone ever see that before?

                            Regards,
                             Jerry

P.S. I did the echo 0 ,and prelink as written below


>
> You can temporarily disable both these features of RedHat by running the
> following as root on each of your compute nodes, I'd do this first and
> then re-run the tests.  The prelink command might take a few seconds.
>
> $ echo 0 > /proc/sys/kernel/exec-shield-randomize
> $ prelink -ua
>
> There is also a cron job that runs overnight to change the library
> mappings, if doing the above does help then you'll want to disable this
> as well (or at least re-run prelink -ua before running more tests in the
> future).
>
> If this does fix your problems then you should really try and work out
> why it's happening and make your program more rhobust against this kind
> of thing, with some programs (more commonly shmem or MPI-Onesided) it's
> not possible (or at least very difficult) but other time it can be a
> symptom of a underlying bug somewhere.
>
> As for which OS to use?  I recommend picking one you are familiar with
> and has a level of support you can live with and configuring to meet
> your requirements.
>
> Ashley,
>
> On Wed, 2005-09-21 at 14:58 +0300, Jerry Mersel wrote:
>> Thank you Ashley,
>>
>>   The problem on AS 4.0 would also explain the problems that I've been
>> having testing MPICH2. Namely things like "-2652 expected         2651"
>> seems like the high order and low order bytes are reversed.
>>
>> In either case which linux OS would you recommend.
>> (We try to use only REDHAT or SUSE, but I don't think anyone would
>> complain about fedora).
>>
>>                                   regards,
>>                                     Jerry
>>
>>
>>
>> > On Tue, 2005-09-20 at 16:54 +0300, Jerry Mersel wrote:
>> >> Hi:
>> >>   I've installed MPICH 1.2.6 on a cluster which consists of several
>> dual
>> >> opteron machines running redhat AS 4.0.
>> >>    A user, while running an application using 4 processors has
>> brought
>> >> to
>> >>    my attention that 2 runs with the same binary results in 2
>> different
>> >>    sets of results.
>> >>
>> >>     I then ran the tests that come with mpich (I know I should have
>> done
>> >> it  before, we just won't tell anybody). And come up with errors,
>> Here
>> >> are they
>> >>      are:
>> >>           Differences in issendtest.out
>> >>           Differences in structf.out
>> >>           *** Checking for differences from expected output ***
>> >>           Differences in issendtest.out
>> >>           Differences in structf.out
>> >>            p0_3896:  p4_error: net_recv read:  probable EOF on
>> socket: 1
>> >>            p0_10524:  p4_error: : 972
>> >>           0 - MPI_ADDRESS : Address of location given to MPI_ADDRESS
>> >> does
>> >> not    fit in Fortran integer
>> >>
>> >>    I've tried this with the gcc compiler and the pgi compiler with
>> >>    none and many different options - the results are the same.
>> >
>> >>    I tried using MPICH from the pgi site, the user still got
>> different
>> >>    results on different runs.
>> >
>> > I've seen something similar to this, firstly structf is testing
>> > something that is mathematically impossible, i.e. storing a (in your
>> > case 64bit) pointer in a 32bit integer.  This sometimes works
>> (depending
>> > on what the pointer is) but often doesn't, we have some patches for
>> this
>> > (I believe pgi also ship them) but it's still not a 100% cure.  Unless
>> > you actually have an application that suffers from this then you don't
>> > need the patch.
>> >
>> > Secondly RedHat AS 4.0 has some odd features that effectively mean
>> your
>> > supposed to get different results from running the same program twice,
>> > in particular it's got exec-shield-randomize enabled which moves the
>> > stack about between runs and it runs a cron job overnight which
>> > randomises the load address of your installed shared library's.  This
>> > means that the same binary on two different nodes will have a
>> different
>> > address for the stack and a different address for mmap()/shared
>> > library's.
>> >
>> > Most applications however should hide this away from you and MPI
>> itself
>> > is designed to be independent of this type of configuration change, it
>> > is fairly easy to introduce artificial dependencies without meaning to
>> > though.
>> >
>> > And of course there are the normal problems of floating point
>> accuracy,
>> > some apps aren't actually supposed to get identical results between
>> > runs, merely similar ones...
>> >
>> > Ashley,
>> >
>> >
>>
>
>




More information about the mpich-discuss mailing list