[MPICH] tests are failing with different numbers
Jerry Mersel
jerry.mersel at weizmann.ac.il
Tue Sep 20 08:54:32 CDT 2005
Hi:
I've installed MPICH 1.2.6 on a cluster which consists of several dual
opteron machines running redhat AS 4.0.
A user, while running an application using 4 processors has brought to
my attention that 2 runs with the same binary results in 2 different
sets of results.
I then ran the tests that come with mpich (I know I should have done
it before, we just won't tell anybody). And come up with errors, Here
are they
are:
Differences in issendtest.out
Differences in structf.out
*** Checking for differences from expected output ***
Differences in issendtest.out
Differences in structf.out
p0_3896: p4_error: net_recv read: probable EOF on socket: 1
p0_10524: p4_error: : 972
0 - MPI_ADDRESS : Address of location given to MPI_ADDRESS does
not fit in Fortran integer
I've tried this with the gcc compiler and the pgi compiler with
none and many different options - the results are the same.
I tried using MPICH from the pgi site, the user still got different
results on different runs.
I then tried to use MPICH2 and the testing also failed with :
copy function return code was MPI_SUCCESS in dup
Found 1 errors
<WORKDIR>./f77/spawn</WORKDIR>^M
<STATUS>fail</STATUS>^M
<TESTDIFF>^M
Error in Lookup name
Invalid service name (see MPI_Publish_name), error stack:
MPID_NS_Lookup(182): Lookup failed for service name MyTest
Found 1 errors
<WORKDIR>./f77/spawn</WORKDIR>^M
<STATUS>fail</STATUS>^M
<TESTDIFF>^M
aborting job:
Fatal error in MPI_Comm_connect: Other MPI error, error stack:
MPI_Comm_connect(118): MPI_Comm_connect(port=" ", MPI_INFO_NULL, root=0,
comm=0x84000000, newcomm=0x7fbffff5c0) failed
MPID_Comm_connect(28):
MPIDI_CH3_Comm_connect(76):
MPIDI_CH3I_Connect_to_root(1519): no space for the host description
rank 1 in job 610 wiccopt-1_34510 caused collective abort of all ranks
exit status of rank 1: return code 13
<NAME>iwriteatf</NAME>^M
<NP>4</NP>^M
<WORKDIR>./f77/io</WORKDIR>^M
<STATUS>fail</STATUS>^M
<TESTDIFF>^M
Error class 17(See the MPI_ERROR field in MPI_Status for the error code)
0 buf( 1) = -49 expected 48
Error class 17(See the MPI_ERROR field in MPI_Status for the error code)
0 buf( 1) = -177 expected 176
Error class 17(See the MPI_ERROR field in MPI_Status for the error code)
0 buf2( 1) = -397
Error class 17(See the MPI_ERROR field in MPI_Status for the error code)
0 buf( 1) = -929 expected 928
Error class 17(See the MPI_ERROR field in MPI_Status for the error code)
0 buf2( 1) = -1005
Error class 17(See the MPI_ERROR field in MPI_Status for the error code)
3 buf2( 1) = -1968
Error class 17(See the MPI_ERROR field in MPI_Status for the error code)
3 buf2( 1) = -5304
Error class 17(See the MPI_ERROR field in MPI_Status for the error code)
3 buf( 1) = -5636 expected 5635
Error class 17(See the MPI_ERROR field in MPI_Status for the error code)
3 buf2( 1) = -5896
Error class 17(See the MPI_ERROR field in MPI_Status for the error code)
3 buf2( 1) = -6152
mpdrun_wiccopt-1 (handle_sig_occurred 519): job terminating due to timeout
<NAME>iwritef</NAME>^M
<NP>4</NP>^M
<WORKDIR>./f77/io</WORKDIR>^M
<STATUS>fail</STATUS>^M
<TESTDIFF>^M
Error class 17(See the MPI_ERROR field in MPI_Status for the error code)
3 buf2( 1) = -1080
Error class 17(See the MPI_ERROR field in MPI_Status for the error code)
3 buf2( 1) = -1672
Error class 17(See the MPI_ERROR field in MPI_Status for the error code)
3 buf2( 1) = -2256
Error class 17(See the MPI_ERROR field in MPI_Status for the error code)
3 buf2( 1) = -2360
Error class 17(See the MPI_ERROR field in MPI_Status for the error code)
3 buf( 1) = -2460 expected 2459
Error class 17(See the MPI_ERROR field in MPI_Status for the error code)
0 buf( 1) = -5105 expected 5104
Error class 17(See the MPI_ERROR field in MPI_Status for the error code)
0 buf2( 1) = -5869
Error class 17(See the MPI_ERROR field in MPI_Status for the error code)
0 buf2( 1) = -6061
Error class 17(See the MPI_ERROR field in MPI_Status for the error code)
0 buf( 1) = -6153 expected 6152
Error class 17(See the MPI_ERROR field in MPI_Status for the error code)
0 buf2( 1) = -6741
mpdrun_wiccopt-1 (handle_sig_occurred 519): job terminating due to timeout
</TESTDIFF>^M
</MPITEST>^M
<MPITEST>^M
<NAME>iwriteshf</NAME>^M
<NP>4</NP>^M
<WORKDIR>./f77/io</WORKDIR>^M
<STATUS>fail</STATUS>^M
<TESTDIFF>^M
mpdrun_wiccopt-1 (handle_sig_occurred 519): job terminating due to timeout
I can add more but that's the general idea.
Any recommendations?
Could it be because of NFS? MPICH is on a file server but I mount it with
NFS version 3.
Thanks for your help,
Jerry Mersel
More information about the mpich-discuss
mailing list