[mpich-discuss] mpich2-1.2.1p1

Costa, Michael MCosta at fanshawec.ca
Fri Mar 26 17:41:37 CDT 2010


I have been struggling with communication errors when ever I run mpiexec. This installation is on PARISC based cluster. Running mpich2-1.2.1p1, I have configured it with --with-device=ch3:nemesis.
 
Currently only 2 nodes are in the ring, hp20 and hp14 for testing/setup purposes.
 
The following steps may shed some light on the problem, which I'm sure is something I have omitted or failed to do with the initial installation/configuration. It appears that I can run non MPI programs OK, but MPI code like cpi or hello fail.
 
 
hp20:~$ mpdallexit

mikec at hp20:~$ mpdboot -v -n 2 -f /etc/mpd.hosts
running mpdallexit on hp20
LAUNCHED mpd on hp20  via
RUNNING: mpd on hp20
LAUNCHED mpd on hp14  via  hp20
RUNNING: mpd on hp14

mikec at hp20:~$ mpdtrace
hp20
hp14

mikec at hp20:~$ mpdtrace -l
hp20_44192 (172.17.81.20)
hp14_51832 (172.17.81.14)

mikec at hp20:~$ mpdringtest 10
time for 10 loops = 0.0491678714752 seconds

mikec at hp20:~$ mpiexec -n 2 uname -a
Linux hp20 2.6.32-trunk-parisc #1 Mon Jan 11 03:07:31 UTC 2010 parisc GNU/Linux
Linux hp14 2.6.32-trunk-parisc #1 Mon Jan 11 03:07:31 UTC 2010 parisc GNU/Linux
 
mikec at hp20:~$ mpiexec -n 1 ./cpi
Process 0 of 1 is on hp20
pi is approximately 3.1415926544231341, Error is 0.0000000008333410
wall clock time = 0.003888


mikec at hp20:~$ mpiexec -n 2 ./cpi
Process 0 of 2 is on hp20
Process 1 of 2 is on hp14
Fatal error in PMPI_Bcast: Other MPI error, error stack:
PMPI_Bcast(1302)..................: MPI_Bcast(buf=0xc016e33c, count=1, MPI_INT, root=0, MPI_COMM_WORLD) failed
MPIR_Bcast(1031)..................:
MPIR_Bcast_binomial(157)..........:
MPIC_Recv(83).....................:
MPIC_Wait(513)....................:
MPIDI_CH3I_Progress(150)..........:
MPID_nem_mpich2_blocking_recv(948):
MPID_nem_tcp_connpoll(1720).......:
state_listening_handler(1787).....: accept of socket fd failed - Resource temporarily unavailable
rank 1 in job 2  hp20_44192   caused collective abort of all ranks
  exit status of rank 1: return code 1
Fatal error in PMPI_Bcast: Other MPI error, error stack:
PMPI_Bcast(1302)..................: MPI_Bcast(buf=0xc067f33c, count=1, MPI_INT, root=0, MPI_COMM_WORLD) failed
MPIR_Bcast(1031)..................:
MPIR_Bcast_binomial(187)..........:
MPIC_Send(41).....................:
MPIC_Wait(513)....................:
MPIDI_CH3I_Progress(150)..........:
MPID_nem_mpich2_blocking_recv(948):
MPID_nem_tcp_connpoll(1709).......: Communication error
 
 
Any comments and or suggestions are greatly appreciated.
 
 
Mike C.
 
 
 
Michael A. Costa
SET (RCC), CCAI-CCNA/CCNP (Cisco), MInfTech (Griffith)
Professor - Information Technology Division
Fanshawe College
G3001
1001 Fanshawe College Boulevard
P.O. Box 7005
London, ON N5Y 5R6
Tel: (519) 452-4291  Fax: (519) 452-1801
 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20100326/e9f535e8/attachment-0001.htm>


More information about the mpich-discuss mailing list