[MPICH] tests are failing with different numbers

Ashley Pittman ashley at quadrics.com
Wed Sep 21 07:13:27 CDT 2005


You can temporarily disable both these features of RedHat by running the
following as root on each of your compute nodes, I'd do this first and
then re-run the tests.  The prelink command might take a few seconds.

$ echo 0 > /proc/sys/kernel/exec-shield-randomize
$ prelink -ua

There is also a cron job that runs overnight to change the library
mappings, if doing the above does help then you'll want to disable this
as well (or at least re-run prelink -ua before running more tests in the
future).

If this does fix your problems then you should really try and work out
why it's happening and make your program more rhobust against this kind
of thing, with some programs (more commonly shmem or MPI-Onesided) it's
not possible (or at least very difficult) but other time it can be a
symptom of a underlying bug somewhere.

As for which OS to use?  I recommend picking one you are familiar with
and has a level of support you can live with and configuring to meet
your requirements.

Ashley,

On Wed, 2005-09-21 at 14:58 +0300, Jerry Mersel wrote:
> Thank you Ashley,
> 
>   The problem on AS 4.0 would also explain the problems that I've been
> having testing MPICH2. Namely things like "-2652 expected         2651"
> seems like the high order and low order bytes are reversed.
> 
> In either case which linux OS would you recommend.
> (We try to use only REDHAT or SUSE, but I don't think anyone would
> complain about fedora).
> 
>                                   regards,
>                                     Jerry
> 
> 
> 
> > On Tue, 2005-09-20 at 16:54 +0300, Jerry Mersel wrote:
> >> Hi:
> >>   I've installed MPICH 1.2.6 on a cluster which consists of several dual
> >> opteron machines running redhat AS 4.0.
> >>    A user, while running an application using 4 processors has brought
> >> to
> >>    my attention that 2 runs with the same binary results in 2 different
> >>    sets of results.
> >>
> >>     I then ran the tests that come with mpich (I know I should have done
> >> it  before, we just won't tell anybody). And come up with errors, Here
> >> are they
> >>      are:
> >>           Differences in issendtest.out
> >>           Differences in structf.out
> >>           *** Checking for differences from expected output ***
> >>           Differences in issendtest.out
> >>           Differences in structf.out
> >>            p0_3896:  p4_error: net_recv read:  probable EOF on socket: 1
> >>            p0_10524:  p4_error: : 972
> >>           0 - MPI_ADDRESS : Address of location given to MPI_ADDRESS
> >> does
> >> not    fit in Fortran integer
> >>
> >>    I've tried this with the gcc compiler and the pgi compiler with
> >>    none and many different options - the results are the same.
> >
> >>    I tried using MPICH from the pgi site, the user still got different
> >>    results on different runs.
> >
> > I've seen something similar to this, firstly structf is testing
> > something that is mathematically impossible, i.e. storing a (in your
> > case 64bit) pointer in a 32bit integer.  This sometimes works (depending
> > on what the pointer is) but often doesn't, we have some patches for this
> > (I believe pgi also ship them) but it's still not a 100% cure.  Unless
> > you actually have an application that suffers from this then you don't
> > need the patch.
> >
> > Secondly RedHat AS 4.0 has some odd features that effectively mean your
> > supposed to get different results from running the same program twice,
> > in particular it's got exec-shield-randomize enabled which moves the
> > stack about between runs and it runs a cron job overnight which
> > randomises the load address of your installed shared library's.  This
> > means that the same binary on two different nodes will have a different
> > address for the stack and a different address for mmap()/shared
> > library's.
> >
> > Most applications however should hide this away from you and MPI itself
> > is designed to be independent of this type of configuration change, it
> > is fairly easy to introduce artificial dependencies without meaning to
> > though.
> >
> > And of course there are the normal problems of floating point accuracy,
> > some apps aren't actually supposed to get identical results between
> > runs, merely similar ones...
> >
> > Ashley,
> >
> >
> 




More information about the mpich-discuss mailing list