[MPICH] troubles with mpich2-1.0.3
    Philip Sydney Lavers 
    psl02 at uow.edu.au
       
    Wed Jan 18 02:04:35 CST 2006
    
    
  
Hello folks,
I seek help with the following problem:
My Opteron/Athlon64 cluster has been working well with MPICH2 on FreeBSD for some months, but when I added new nodes, including a twin core Athlon64 (highly recommended) I decided to upgrade to  mpich2-1.0.3. It compiled and installed successfully on some nodes but not on others - see below.
1) The nodes with mpich2-1.0.3 would not join the ring with nodes that had the older version and vice versa.
2) The new version would not build on all the nodes even though it builds without problem on identical machines.
Here is the make error:
"lots of output
...........make /home/psl/downloads/mpich2-1.0.3/src/mpe2/lib/libmpe.a
`/home/psl/downloads/mpich2-1.0.3/src/mpe2/lib/libmpe.a' is up to date.
make /home/psl/downloads/mpich2-1.0.3/src/mpe2/bin/clog2_print
gcc -O3 -march=athlon64 -I.. -I/home/psl/downloads/mpich2-1.0.3/src/mpe2/src/logging/include  -I../../.. -I/home/psl/downloads/mpich2-1.0.3/sr
c/mpe2/src/logging/../../include    -DCLOG_NOMPI -c clog_print.c
gcc  -O3 -march=athlon64  -o /home/psl/downloads/mpich2-1.0.3/src/mpe2/bin/clog2_print clog_print.o  -L/home/psl/downloads/mpich2-1.0.3/src/mp
e2/lib -lmpe_nompi
clog_print.o(.text+0x2e): In function `main':
: undefined reference to `CLOG_Rec_sizes_init'
clog_print.o(.text+0x33): In function `main':
: undefined reference to `CLOG_Preamble_create'
clog_print.o(.text+0x41): In function `main':
: undefined reference to `CLOG_Preamble_read'
clog_print.o(.text+0x51): In function `main':
: undefined reference to `CLOG_Preamble_print'
clog_print.o(.text+0x5d): In function `main':
: undefined reference to `CLOG_BlockData_create'
clog_print.o(.text+0x80): In function `main':
: undefined reference to `CLOG_BlockData_reset'
clog_print.o(.text+0x91): In function `main':
: undefined reference to `CLOG_BlockData_print'
clog_print.o(.text+0xea): In function `main':
: undefined reference to `CLOG_BlockData_free'
clog_print.o(.text+0xf2): In function `main':
: undefined reference to `CLOG_Preamble_free'
clog_print.o(.text+0x10e): In function `main':
: undefined reference to `CLOG_BlockData_swap_bytes_first'
*** Error code 1
Stop in /home/psl/downloads/mpich2-1.0.3/src/mpe2/src/logging/src.
*** Error code 1
Stop in /home/psl/downloads/mpich2-1.0.3/src/mpe2/src/logging/src.
*** Error code 1
Stop in /home/psl/downloads/mpich2-1.0.3/src/mpe2/src/logging.
*** Error code 1
Stop in /home/psl/downloads/mpich2-1.0.3/src/mpe2.
*** Error code 1
Stop in /home/psl/downloads/mpich2-1.0.3/src.
*** Error code 1
Stop in /home/psl/downloads/mpich2-1.0.3.
"
This problem is an old "friend" - I remember once changing source code to fix the build but I dont want to do that now that I have ten machines.
3) The above problem goes away randomly by just repeating the command 'make' - either the error will recurr or the build will complete. 
I then su to root and 'make install' .
Great - but I think mpi on some nodes is fractured.
4) Here is a broken production run output: 
'
psl at claude1$ mpdtrace -l
claude1_56774 (192.168.1.11)
Paul1_49487 (192.168.1.221)
claude8_64665 (192.168.1.18)
claude10_54749 (192.168.1.20)
psl at claude1$ mpiexec -n 6 ./rogfn 1 100 10  789 500 .0000001
Process 0 of 6 is on claude1
Process 1 of 6 is on Paul1Process 2 of 6 is on claude8
Process 3 of 6 is on claude10
Process 4 of 6 is on claude1
Process 5 of 6 is on Paul1
[cli_3]: aborting job:
Fatal error in MPI_Barrier: Error message texts are not available, error stack:
(unknown)(): Error message texts are not available
rank 3 in job 1  claude1_56774   caused collective abort of all ranks
  exit status of rank 3: return code 13 
'
Is this problem related to the build problem?
Can anyone help?
rogfn is a very large md type programme that was runnning very well on the smaller cluster. claude1 and Paul1 are dual core athlon and dual opteron machines respectively. Asking for six processes on the above four platforms commits them to using both processors.
Thanks for any advice,
Phil Lavers
    
    
More information about the mpich-discuss
mailing list