[MPICH] tests are failing with different numbers

Ashley Pittman ashley at quadrics.com
Wed Sep 21 05:02:39 CDT 2005


On Tue, 2005-09-20 at 16:54 +0300, Jerry Mersel wrote:
> Hi:
>   I've installed MPICH 1.2.6 on a cluster which consists of several dual
> opteron machines running redhat AS 4.0.
>    A user, while running an application using 4 processors has brought to
>    my attention that 2 runs with the same binary results in 2 different
>    sets of results.
> 
>     I then ran the tests that come with mpich (I know I should have done
> it  before, we just won't tell anybody). And come up with errors, Here
> are they
>      are:
>           Differences in issendtest.out
>           Differences in structf.out
>           *** Checking for differences from expected output ***
>           Differences in issendtest.out
>           Differences in structf.out
>            p0_3896:  p4_error: net_recv read:  probable EOF on socket: 1
>            p0_10524:  p4_error: : 972
>           0 - MPI_ADDRESS : Address of location given to MPI_ADDRESS does
> not    fit in Fortran integer
> 
>    I've tried this with the gcc compiler and the pgi compiler with
>    none and many different options - the results are the same.

>    I tried using MPICH from the pgi site, the user still got different
>    results on different runs.

I've seen something similar to this, firstly structf is testing
something that is mathematically impossible, i.e. storing a (in your
case 64bit) pointer in a 32bit integer.  This sometimes works (depending
on what the pointer is) but often doesn't, we have some patches for this
(I believe pgi also ship them) but it's still not a 100% cure.  Unless
you actually have an application that suffers from this then you don't
need the patch.

Secondly RedHat AS 4.0 has some odd features that effectively mean your
supposed to get different results from running the same program twice,
in particular it's got exec-shield-randomize enabled which moves the
stack about between runs and it runs a cron job overnight which
randomises the load address of your installed shared library's.  This
means that the same binary on two different nodes will have a different
address for the stack and a different address for mmap()/shared
library's.

Most applications however should hide this away from you and MPI itself
is designed to be independent of this type of configuration change, it
is fairly easy to introduce artificial dependencies without meaning to
though.

And of course there are the normal problems of floating point accuracy,
some apps aren't actually supposed to get identical results between
runs, merely similar ones...

Ashley,




More information about the mpich-discuss mailing list