[mpich-discuss] mpich2-1.2.1, only starts 5 mpd's and cpi won't run, compiler flags issue?

David Mathog mathog at caltech.edu
Mon Feb 8 15:58:14 CST 2010


One one of the compute nodes (an Athlon MP) did:

./configure --prefix=/opt/mpich2_121 --enable_sharedlibs=gcc

and then on the primary server;

make 
make install

Copied /opt/mpich2_121 to all compute nodes.  This resolved the cpi
problem, as that program now runs on all nodes.

>  If you  
> are running it in the background because it hangs, then you are likely  
> hitting the hang bug described here:
https://trac.mcs.anl.gov/projects/mpich2/ticket/974

Yes, that was causing mpdboot to hang.  As suggested in that bug thread,
installing the newer mpd.py from the link provided resolved the issue.

Test runs of cpi for small -n worked up to

 mpiexec -n 8 /opt/mpich2_121/examples/cpi

where exploded.  Possibly because it included node monkey15.  The build
was on monkey01, and the former is an Athlon XP, the latter an Athlon
MP. Pretty subtle difference in processor, but maybe that is still a
problem.  The run blows up like this:
\
Process 0 of 8 is on safserver.bio.caltech.edu
Process 1 of 8 is on monkey04.cluster
Process 3 of 8 is on monkey11.cluster
Process 5 of 8 is on monkey09.cluster
Process 4 of 8 is on monkey10.cluster
Process 2 of 8 is on monkey12.cluster
Process 7 of 8 is on monkey15.cluster
Process 6 of 8 is on monkey02.cluster
Fatal error in PMPI_Bcast: Other MPI error, error stack:
PMPI_Bcast(1302)..................: MPI_Bcast(buf=0xbfe1dcc8, count=1,
MPI_INT, root=0, MPI_COMM_WORLD) failed
MPIR_Bcast(1031)..................: 
MPIR_Bcast_binomial(187)..........: 
MPIC_Send(41).....................: 
MPIC_Wait(513)....................: 
MPIDI_CH3I_Progress(150)..........: 
MPID_nem_mpich2_blocking_recv(948): 
MPID_nem_tcp_connpoll(1709).......: Communication error
rank 6 in job 1  safserver.bio.caltech.edu_43019   caused collective
abort of all ranks
  exit status of rank 6: return code 1 
Fatal error in PMPI_Reduce: Other MPI error, error stack:
PMPI_Reduce(1198).................: MPI_Reduce(sbuf=0xbfbe4278,
rbuf=0xbfbe4270, count=1, MPI_DOUBLE, MPI_SUM, root=0, MPI_COMM_WORLD)
failed
MPIR_Reduce(764)..................: 
MPIR_Reduce_binomial(172).........: 
MPIC_Recv(83).....................: 
MPIC_Wait(513)....................: 
MPIDI_CH3I_Progress(150)..........: 
MPID_nem_mpich2_blocking_recv(948): 
MPID_nem_tcp_connpoll(1720).......: 
state_commrdy_handler(1556).......: 
MPID_nem_tcp_recv_handler(1446)...: socket closed
rank 4 in job 1  safserver.bio.caltech.edu_43019   caused collective
abort of all ranks
  exit status of rank 4: return code 1 

> 2) Configuring with "--with-atomic- 
> primitives=opa_gcc_intel_32_64_p3.h".  This tells MPICH2 to use  
> atomics suitable for older x86 processors that don't support mfence.

Guess I will try that, just to eliminate the possibility that some
slight difference between Athlon XP and MP is triggering this last issue.

Thanks,

David Mathog
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech


More information about the mpich-discuss mailing list