[MPICH] tests are failing with different numbers

Jerry Mersel jerry.mersel at weizmann.ac.il
Tue Sep 20 08:54:32 CDT 2005


Hi:
  I've installed MPICH 1.2.6 on a cluster which consists of several dual
opteron machines running redhat AS 4.0.
   A user, while running an application using 4 processors has brought to
   my attention that 2 runs with the same binary results in 2 different
   sets of results.

    I then ran the tests that come with mpich (I know I should have done
it  before, we just won't tell anybody). And come up with errors, Here
are they
     are:
          Differences in issendtest.out
          Differences in structf.out
          *** Checking for differences from expected output ***
          Differences in issendtest.out
          Differences in structf.out
           p0_3896:  p4_error: net_recv read:  probable EOF on socket: 1
           p0_10524:  p4_error: : 972
          0 - MPI_ADDRESS : Address of location given to MPI_ADDRESS does
not    fit in Fortran integer

   I've tried this with the gcc compiler and the pgi compiler with
   none and many different options - the results are the same.

   I tried using MPICH from the pgi site, the user still got different
   results on different runs.

   I then tried to use MPICH2 and the testing also failed with :

      copy function return code was MPI_SUCCESS in dup
 Found 1 errors
<WORKDIR>./f77/spawn</WORKDIR>^M
<STATUS>fail</STATUS>^M
<TESTDIFF>^M
 Error in Lookup name
 Invalid service name (see MPI_Publish_name), error stack:
MPID_NS_Lookup(182): Lookup failed for service name MyTest
  Found  1 errors

<WORKDIR>./f77/spawn</WORKDIR>^M
<STATUS>fail</STATUS>^M
<TESTDIFF>^M
aborting job:
Fatal error in MPI_Comm_connect: Other MPI error, error stack:
MPI_Comm_connect(118): MPI_Comm_connect(port=" ", MPI_INFO_NULL, root=0,
comm=0x84000000, newcomm=0x7fbffff5c0) failed
MPID_Comm_connect(28):
MPIDI_CH3_Comm_connect(76):
MPIDI_CH3I_Connect_to_root(1519): no space for the host description
rank 1 in job 610  wiccopt-1_34510   caused collective abort of all ranks
  exit status of rank 1: return code 13
<NAME>iwriteatf</NAME>^M
<NP>4</NP>^M
<WORKDIR>./f77/io</WORKDIR>^M
<STATUS>fail</STATUS>^M
<TESTDIFF>^M
 Error class  17(See the MPI_ERROR field in MPI_Status for the error code)
 0 buf( 1) =  -49 expected  48
 Error class  17(See the MPI_ERROR field in MPI_Status for the error code)
 0 buf( 1) =  -177 expected  176
 Error class  17(See the MPI_ERROR field in MPI_Status for the error code)
 0 buf2( 1) =  -397
 Error class  17(See the MPI_ERROR field in MPI_Status for the error code)
 0 buf( 1) =  -929 expected  928
 Error class  17(See the MPI_ERROR field in MPI_Status for the error code)
 0 buf2( 1) =  -1005
 Error class  17(See the MPI_ERROR field in MPI_Status for the error code)
 3 buf2( 1) =  -1968
 Error class  17(See the MPI_ERROR field in MPI_Status for the error code)
 3 buf2( 1) =  -5304
 Error class  17(See the MPI_ERROR field in MPI_Status for the error code)
 3 buf( 1) =  -5636 expected  5635
 Error class  17(See the MPI_ERROR field in MPI_Status for the error code)
 3 buf2( 1) =  -5896
 Error class  17(See the MPI_ERROR field in MPI_Status for the error code)
 3 buf2( 1) =  -6152
mpdrun_wiccopt-1 (handle_sig_occurred 519): job terminating due to timeout
<NAME>iwritef</NAME>^M
<NP>4</NP>^M
<WORKDIR>./f77/io</WORKDIR>^M
<STATUS>fail</STATUS>^M
<TESTDIFF>^M
 Error class  17(See the MPI_ERROR field in MPI_Status for the error code)
 3 buf2( 1) =  -1080
 Error class  17(See the MPI_ERROR field in MPI_Status for the error code)
 3 buf2( 1) =  -1672
 Error class  17(See the MPI_ERROR field in MPI_Status for the error code)
 3 buf2( 1) =  -2256
 Error class  17(See the MPI_ERROR field in MPI_Status for the error code)
 3 buf2( 1) =  -2360
 Error class  17(See the MPI_ERROR field in MPI_Status for the error code)
 3 buf( 1) =  -2460 expected  2459
 Error class  17(See the MPI_ERROR field in MPI_Status for the error code)
 0 buf( 1) =  -5105 expected  5104
 Error class  17(See the MPI_ERROR field in MPI_Status for the error code)
 0 buf2( 1) =  -5869
 Error class  17(See the MPI_ERROR field in MPI_Status for the error code)
 0 buf2( 1) =  -6061
 Error class  17(See the MPI_ERROR field in MPI_Status for the error code)
 0 buf( 1) =  -6153 expected  6152
 Error class  17(See the MPI_ERROR field in MPI_Status for the error code)
 0 buf2( 1) =  -6741
mpdrun_wiccopt-1 (handle_sig_occurred 519): job terminating due to timeout
</TESTDIFF>^M
</MPITEST>^M
<MPITEST>^M
<NAME>iwriteshf</NAME>^M
<NP>4</NP>^M
<WORKDIR>./f77/io</WORKDIR>^M
<STATUS>fail</STATUS>^M
<TESTDIFF>^M
mpdrun_wiccopt-1 (handle_sig_occurred 519): job terminating due to timeout


I can add more but that's the general idea.

  Any recommendations?
  Could it be because of NFS? MPICH is on a file server but I mount it with
  NFS version 3.

                             Thanks for your help,
                               Jerry Mersel





More information about the mpich-discuss mailing list