[mpich-discuss] mpich2-1.2.1p1

Costa, Michael MCosta at fanshawec.ca
Fri Apr 2 15:05:39 CDT 2010


Rajeev,
 
I ran /usr/local/sbin/mpeunistall, make clean and then re-configured with the ch3:sock device, built and installed it OK but I'm still getting com errors albeit less errors this time. Any more thoughts?
 
 
Mike C.
 
 
 
root at hp20:~# mpdallexit
root at hp20:~# mpdboot -n 2 -v -f /etc/mpd.hosts
running mpdallexit on hp20
LAUNCHED mpd on hp20  via
RUNNING: mpd on hp20
LAUNCHED mpd on hp14  via  hp20
RUNNING: mpd on hp14
root at hp20:~# mpiexec -n 1 ./cpi
Process 0 of 1 is on hp20
pi is approximately 3.1415926544231341, Error is 0.0000000008333410
wall clock time = 0.003967
root at hp20:~# mpiexec -n 2 ./cpi
Process 0 of 2 is on hp20
Process 1 of 2 is on hp14
Fatal error in PMPI_Bcast: Other MPI error, error stack:
PMPI_Bcast(1302)..................: MPI_Bcast(buf=0xc00ee33c, count=1, MPI_INT, root=0, MPI_COMM_WORLD) failed
MPIR_Bcast(1031)..................:
MPIR_Bcast_binomial(157)..........:
MPIC_Recv(83).....................:
MPIC_Wait(513)....................:
MPIDI_CH3I_Progress(150)..........:
MPID_nem_mpich2_blocking_recv(948):
MPID_nem_tcp_connpoll(1720).......:
state_listening_handler(1787).....: accept of socket fd failed - Resource temporarily unavailable
rank 1 in job 2  hp20_58916   caused collective abort of all ranks
  exit status of rank 1: return code 1

 
 
 
 
 
Michael A. Costa
SET (RCC), CCAI-CCNA/CCNP (Cisco), MInfTech (Griffith)
Professor - Information Technology Division
Fanshawe College
G3001
1001 Fanshawe College Boulevard
P.O. Box 7005
London, ON N5Y 5R6
Tel: (519) 452-4291  Fax: (519) 452-1801
 

________________________________

From: mpich-discuss-bounces at mcs.anl.gov on behalf of Rajeev Thakur
Sent: Fri 4/2/2010 2:52 PM
To: mpich-discuss at mcs.anl.gov
Subject: Re: [mpich-discuss] mpich2-1.2.1p1


Can you try configuring with --with-device=ch3:sock and see if that works?
 
Rajeev


________________________________

	From: mpich-discuss-bounces at mcs.anl.gov [mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of Costa, Michael
	Sent: Friday, April 02, 2010 1:01 PM
	To: mpich-discuss at mcs.anl.gov
	Subject: [mpich-discuss] mpich2-1.2.1p1
	
	
	I have been struggling with communication errors when ever I run mpiexec. This installation is on PARISC based cluster. Running mpich2-1.2.1p1, I have configured it with --with-device=ch3:nemesis. 
	 
	Currently only 2 nodes are in the ring, hp20 and hp14 for testing/setup purposes.
	 
	The following steps may shed some light on the problem, which I'm sure is something I have omitted or failed to do with the initial installation/configuration. It appears that I can run non MPI programs OK, but MPI code like cpi or hello fail.
	 
	 
	hp20:~$ mpdallexit
	

	mikec at hp20:~$ mpdboot -v -n 2 -f /etc/mpd.hosts
	running mpdallexit on hp20
	LAUNCHED mpd on hp20  via
	RUNNING: mpd on hp20
	LAUNCHED mpd on hp14  via  hp20
	RUNNING: mpd on hp14
	
	mikec at hp20:~$ mpdtrace
	hp20
	hp14

	mikec at hp20:~$ mpdtrace -l
	hp20_44192 (172.17.81.20)
	hp14_51832 (172.17.81.14)

	mikec at hp20:~$ mpdringtest 10
	time for 10 loops = 0.0491678714752 seconds

	mikec at hp20:~$ mpiexec -n 2 uname -a
	Linux hp20 2.6.32-trunk-parisc #1 Mon Jan 11 03:07:31 UTC 2010 parisc GNU/Linux
	Linux hp14 2.6.32-trunk-parisc #1 Mon Jan 11 03:07:31 UTC 2010 parisc GNU/Linux
	 
	mikec at hp20:~$ mpiexec -n 1 ./cpi
	Process 0 of 1 is on hp20
	pi is approximately 3.1415926544231341, Error is 0.0000000008333410
	wall clock time = 0.003888
	
	
	mikec at hp20:~$ mpiexec -n 2 ./cpi
	Process 0 of 2 is on hp20
	Process 1 of 2 is on hp14
	Fatal error in PMPI_Bcast: Other MPI error, error stack:
	PMPI_Bcast(1302)..................: MPI_Bcast(buf=0xc016e33c, count=1, MPI_INT, root=0, MPI_COMM_WORLD) failed
	MPIR_Bcast(1031)..................:
	MPIR_Bcast_binomial(157)..........:
	MPIC_Recv(83).....................:
	MPIC_Wait(513)....................:
	MPIDI_CH3I_Progress(150)..........:
	MPID_nem_mpich2_blocking_recv(948):
	MPID_nem_tcp_connpoll(1720).......:
	state_listening_handler(1787).....: accept of socket fd failed - Resource temporarily unavailable
	rank 1 in job 2  hp20_44192   caused collective abort of all ranks
	  exit status of rank 1: return code 1
	Fatal error in PMPI_Bcast: Other MPI error, error stack:
	PMPI_Bcast(1302)..................: MPI_Bcast(buf=0xc067f33c, count=1, MPI_INT, root=0, MPI_COMM_WORLD) failed
	MPIR_Bcast(1031)..................:
	MPIR_Bcast_binomial(187)..........:
	MPIC_Send(41).....................:
	MPIC_Wait(513)....................:
	MPIDI_CH3I_Progress(150)..........:
	MPID_nem_mpich2_blocking_recv(948):
	MPID_nem_tcp_connpoll(1709).......: Communication error
	 
	 
	Any comments and or suggestions are greatly appreciated.
	 
	 
	Mike C.
	 
	 
	
	
	
	Michael A. Costa
	SET (RCC), CCAI-CCNA/CCNP (Cisco), MInfTech (Griffith)
	Professor - Information Technology Division
	Fanshawe College
	G3001
	1001 Fanshawe College Boulevard
	P.O. Box 7005
	London, ON N5Y 5R6
	Tel: (519) 452-4291  Fax: (519) 452-1801
	 
	
	

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/ms-tnef
Size: 12603 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20100402/b2de2679/attachment.bin>


More information about the mpich-discuss mailing list