[mpich-discuss] mpich2-1.2.1, only starts 5 mpd's and cpi won't run, compiler flags issue?

David Mathog mathog at caltech.edu
Mon Feb 8 16:55:39 CST 2010


> Test runs of cpi for small -n worked up to
> 
>  mpiexec -n 8 /opt/mpich2_121/examples/cpi
> 
> where exploded.  Possibly because it included node monkey15.  The build
> was on monkey01, and the former is an Athlon XP, the latter an Athlon
> MP. Pretty subtle difference in processor, but maybe that is still a
> problem.  The run blows up like this:
> \
> Process 0 of 8 is on safserver.bio.caltech.edu
> Process 1 of 8 is on monkey04.cluster
> Process 3 of 8 is on monkey11.cluster
> Process 5 of 8 is on monkey09.cluster
> Process 4 of 8 is on monkey10.cluster
> Process 2 of 8 is on monkey12.cluster
> Process 7 of 8 is on monkey15.cluster
> Process 6 of 8 is on monkey02.cluster
> Fatal error in PMPI_Bcast: Other MPI error, error stack:
> PMPI_Bcast(1302)..................: MPI_Bcast(buf=0xbfe1dcc8, count=1,
> MPI_INT, root=0, MPI_COMM_WORLD) failed
> MPIR_Bcast(1031)..................: 
> MPIR_Bcast_binomial(187)..........: 
> MPIC_Send(41).....................: 
> MPIC_Wait(513)....................: 
> MPIDI_CH3I_Progress(150)..........: 
> MPID_nem_mpich2_blocking_recv(948): 
> MPID_nem_tcp_connpoll(1709).......: Communication error
> rank 6 in job 1  safserver.bio.caltech.edu_43019   caused collective
> abort of all ranks
>   exit status of rank 6: return code 1 
> Fatal error in PMPI_Reduce: Other MPI error, error stack:
> PMPI_Reduce(1198).................: MPI_Reduce(sbuf=0xbfbe4278,
> rbuf=0xbfbe4270, count=1, MPI_DOUBLE, MPI_SUM, root=0, MPI_COMM_WORLD)
> failed
> MPIR_Reduce(764)..................: 
> MPIR_Reduce_binomial(172).........: 
> MPIC_Recv(83).....................: 
> MPIC_Wait(513)....................: 
> MPIDI_CH3I_Progress(150)..........: 
> MPID_nem_mpich2_blocking_recv(948): 
> MPID_nem_tcp_connpoll(1720).......: 
> state_commrdy_handler(1556).......: 
> MPID_nem_tcp_recv_handler(1446)...: socket closed
> rank 4 in job 1  safserver.bio.caltech.edu_43019   caused collective
> abort of all ranks
>   exit status of rank 4: return code 1 
> 
> > 2) Configuring with "--with-atomic- 
> > primitives=opa_gcc_intel_32_64_p3.h".  This tells MPICH2 to use  
> > atomics suitable for older x86 processors that don't support mfence.
> 
> Guess I will try that, just to eliminate the possibility that some
> slight difference between Athlon XP and MP is triggering this last issue.

Using --with-atomic-primitives=opa_gcc_intel_32_64-p3.h did not resolve
this issue.  Again it blew up following:

mpdboot -f /usr/common/etc/machines.LINUX_INTEL_Safserver -n 21 -r rsh
--ifhn=192.168.1.220 -v
mpiexec -n 8 /opt/mpich2_121/examples/cpi

 mpiexec -n 8 /opt/mpich2_121/examples/cpi
Process 0 of 8 is on safserver.bio.caltech.edu
Process 1 of 8 is on monkey02.cluster
Process 3 of 8 is on monkey11.cluster
Process 5 of 8 is on monkey09.cluster
Process 2 of 8 is on monkey12.cluster
Process 4 of 8 is on monkey10.cluster
Process 6 of 8 is on monkey04.cluster
Process 7 of 8 is on monkey15.cluster
Fatal error in PMPI_Bcast: Other MPI error, error stack:
PMPI_Bcast(1302)..................: MPI_Bcast(buf=0xbfc3eae8, count=1,
MPI_INT, root=0, MPI_COMM_WORLD) failed
MPIR_Bcast(1031)..................: 
MPIR_Bcast_binomial(187)..........: 
MPIC_Send(41).....................: 
MPIC_Wait(513)....................: 
MPIDI_CH3I_Progress(150)..........: 
MPID_nem_mpich2_blocking_recv(948): 
MPID_nem_tcp_connpoll(1709).......: Communication error
Fatal error in PMPI_Reduce: Other MPI error, error stack:
PMPI_Reduce(1198).................: MPI_Reduce(sbuf=0xbf9f0088,
rbuf=0xbf9f0080, count=1, MPI_DOUBLE, MPI_SUM, root=0, MPI_COMM_WORLD)
failed
MPIR_Reduce(764)..................: 
MPIR_Reduce_binomial(172).........: 
MPIC_Recv(83).....................: 
MPIC_Wait(513)....................: 
MPIDI_CH3I_Progress(150)..........: 
MPID_nem_mpich2_blocking_recv(948): 
MPID_nem_tcp_connpoll(1720).......: 
state_commrdy_handler(1556).......: 
MPID_nem_tcp_recv_handler(1446)...: socket closed
Fatal error in PMPI_Reduce: Other MPI error, error stack:
PMPI_Reduce(1198).................: MPI_Reduce(sbuf=0xbfcb2348,
rbuf=0xbfcb2340, count=1, MPI_DOUBLE, MPI_SUM, root=0, MPI_COMM_WORLD)
failed
MPIR_Reduce(764)..................: 
MPIR_Reduce_binomial(172).........: 
MPIC_Recv(83).....................: 
MPIC_Wait(513)....................: 
MPIDI_CH3I_Progress(150)..........: 
MPID_nem_mpich2_blocking_recv(948): 
MPID_nem_tcp_connpoll(1720).......: 
state_commrdy_handler(1556).......: 
MPID_nem_tcp_recv_handler(1446)...: socket closed
rank 6 in job 2  safserver.bio.caltech.edu_49241   caused collective
abort of all ranks
  exit status of rank 6: killed by signal 9 
rank 4 in job 2  safserver.bio.caltech.edu_49241   caused collective
abort of all ranks
  exit status of rank 4: killed by signal 9 

No problems for mpiexec when -n is less than 8.  The only thing I see
which is (slightly) off is that it is giving the name for the wrong
interface on safserver, it should probably say safserver.cluster (the
192.168 address on eth1).  This is likely just the result of it using
"hostname" to figure out its name, and on the server, the public side is
returned by that command.

icpi has the same problems as cpi.

Note that if -n 9 or larger is used it just locks up, there is no error
dump. So in summary

-n < 8  runs
-n ==8  throws long error
-n > 8  locks up without saying why

Any idea what might be going on here???  

Thanks,

David Mathog
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech


More information about the mpich-discuss mailing list