[mpich-discuss] mpich2-1.2.1p1

Costa, Michael MCosta at fanshawec.ca
Fri Apr 2 16:43:57 CDT 2010


Rajeev,
 
FYI It's running Debian (squeeze) with a 2.6.32 kernel.
I ended up re-making cpi and scp copied it all the nodes. Low and behold it works. Just to be safe I added another node and all is functional see below. So configuring with --with-device=ch3:sock was the fix!
Wonders never cease.
 
Truly....Thanks for all your help with this.
 
Mike C.
 
 
root at hp20:~# mpdboot -n 3 -v -f /etc/mpd.hosts
running mpdallexit on hp20
LAUNCHED mpd on hp20  via
RUNNING: mpd on hp20
LAUNCHED mpd on hp4-master  via  hp20
LAUNCHED mpd on hp14  via  hp20
RUNNING: mpd on hp4-master
RUNNING: mpd on hp14
root at hp20:~# mpiexec -n 3 ./cpi
Process 0 of 3 is on hp20
Process 2 of 3 is on hp4-master
Process 1 of 3 is on hp14
pi is approximately 3.1415926544231318, Error is 0.0000000008333387
wall clock time = 0.011971
root at hp20:~# mpiexec -n 6 ./cpi
Process 0 of 6 is on hp20
Process 1 of 6 is on hp14
Process 2 of 6 is on hp4-master
Process 5 of 6 is on hp4-master
Process 3 of 6 is on hp20
Process 4 of 6 is on hp14
pi is approximately 3.1415926544231243, Error is 0.0000000008333312
wall clock time = 0.027467
root at hp20:~#

 
 
 
 
Michael A. Costa
SET (RCC), CCAI-CCNA/CCNP (Cisco), MInfTech (Griffith)
Professor - Information Technology Division
Fanshawe College
G3001
1001 Fanshawe College Boulevard
P.O. Box 7005
London, ON N5Y 5R6
Tel: (519) 452-4291  Fax: (519) 452-1801
 

________________________________

From: mpich-discuss-bounces at mcs.anl.gov on behalf of Rajeev Thakur
Sent: Fri 4/2/2010 4:32 PM
To: mpich-discuss at mcs.anl.gov
Subject: Re: [mpich-discuss] mpich2-1.2.1p1


Not really. We don't have access to a PA RISC system. Are you using HPUX or Linux?
 
Rajeev


________________________________

	From: mpich-discuss-bounces at mcs.anl.gov [mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of Costa, Michael
	Sent: Friday, April 02, 2010 3:06 PM
	To: mpich-discuss at mcs.anl.gov
	Subject: Re: [mpich-discuss] mpich2-1.2.1p1
	
	
	Rajeev,
	 
	I ran /usr/local/sbin/mpeunistall, make clean and then re-configured with the ch3:sock device, built and installed it OK but I'm still getting com errors albeit less errors this time. Any more thoughts?
	 
	 
	Mike C.
	 
	 
	 
	root at hp20:~# mpdallexit
	root at hp20:~# mpdboot -n 2 -v -f /etc/mpd.hosts
	running mpdallexit on hp20
	LAUNCHED mpd on hp20  via
	RUNNING: mpd on hp20
	LAUNCHED mpd on hp14  via  hp20
	RUNNING: mpd on hp14
	root at hp20:~# mpiexec -n 1 ./cpi
	Process 0 of 1 is on hp20
	pi is approximately 3.1415926544231341, Error is 0.0000000008333410
	wall clock time = 0.003967
	root at hp20:~# mpiexec -n 2 ./cpi
	Process 0 of 2 is on hp20
	Process 1 of 2 is on hp14
	Fatal error in PMPI_Bcast: Other MPI error, error stack:
	PMPI_Bcast(1302)..................: MPI_Bcast(buf=0xc00ee33c, count=1, MPI_INT, root=0, MPI_COMM_WORLD) failed
	MPIR_Bcast(1031)..................:
	MPIR_Bcast_binomial(157)..........:
	MPIC_Recv(83).....................:
	MPIC_Wait(513)....................:
	MPIDI_CH3I_Progress(150)..........:
	MPID_nem_mpich2_blocking_recv(948):
	MPID_nem_tcp_connpoll(1720).......:
	state_listening_handler(1787).....: accept of socket fd failed - Resource temporarily unavailable
	rank 1 in job 2  hp20_58916   caused collective abort of all ranks
	  exit status of rank 1: return code 1
	
	 
	 
	 
	 
	 
	
	
	
	Michael A. Costa
	SET (RCC), CCAI-CCNA/CCNP (Cisco), MInfTech (Griffith)
	Professor - Information Technology Division
	Fanshawe College
	G3001
	1001 Fanshawe College Boulevard
	P.O. Box 7005
	London, ON N5Y 5R6
	Tel: (519) 452-4291  Fax: (519) 452-1801
	 
	
	

________________________________

	From: mpich-discuss-bounces at mcs.anl.gov on behalf of Rajeev Thakur
	Sent: Fri 4/2/2010 2:52 PM
	To: mpich-discuss at mcs.anl.gov
	Subject: Re: [mpich-discuss] mpich2-1.2.1p1
	
	
	Can you try configuring with --with-device=ch3:sock and see if that works?
	 
	Rajeev


________________________________

		From: mpich-discuss-bounces at mcs.anl.gov [mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of Costa, Michael
		Sent: Friday, April 02, 2010 1:01 PM
		To: mpich-discuss at mcs.anl.gov
		Subject: [mpich-discuss] mpich2-1.2.1p1
		
		
		I have been struggling with communication errors when ever I run mpiexec. This installation is on PARISC based cluster. Running mpich2-1.2.1p1, I have configured it with --with-device=ch3:nemesis. 
		 
		Currently only 2 nodes are in the ring, hp20 and hp14 for testing/setup purposes.
		 
		The following steps may shed some light on the problem, which I'm sure is something I have omitted or failed to do with the initial installation/configuration. It appears that I can run non MPI programs OK, but MPI code like cpi or hello fail.
		 
		 
		hp20:~$ mpdallexit
		

		mikec at hp20:~$ mpdboot -v -n 2 -f /etc/mpd.hosts
		running mpdallexit on hp20
		LAUNCHED mpd on hp20  via
		RUNNING: mpd on hp20
		LAUNCHED mpd on hp14  via  hp20
		RUNNING: mpd on hp14
		
		mikec at hp20:~$ mpdtrace
		hp20
		hp14

		mikec at hp20:~$ mpdtrace -l
		hp20_44192 (172.17.81.20)
		hp14_51832 (172.17.81.14)

		mikec at hp20:~$ mpdringtest 10
		time for 10 loops = 0.0491678714752 seconds

		mikec at hp20:~$ mpiexec -n 2 uname -a
		Linux hp20 2.6.32-trunk-parisc #1 Mon Jan 11 03:07:31 UTC 2010 parisc GNU/Linux
		Linux hp14 2.6.32-trunk-parisc #1 Mon Jan 11 03:07:31 UTC 2010 parisc GNU/Linux
		 
		mikec at hp20:~$ mpiexec -n 1 ./cpi
		Process 0 of 1 is on hp20
		pi is approximately 3.1415926544231341, Error is 0.0000000008333410
		wall clock time = 0.003888
		
		
		mikec at hp20:~$ mpiexec -n 2 ./cpi
		Process 0 of 2 is on hp20
		Process 1 of 2 is on hp14
		Fatal error in PMPI_Bcast: Other MPI error, error stack:
		PMPI_Bcast(1302)..................: MPI_Bcast(buf=0xc016e33c, count=1, MPI_INT, root=0, MPI_COMM_WORLD) failed
		MPIR_Bcast(1031)..................:
		MPIR_Bcast_binomial(157)..........:
		MPIC_Recv(83).....................:
		MPIC_Wait(513)....................:
		MPIDI_CH3I_Progress(150)..........:
		MPID_nem_mpich2_blocking_recv(948):
		MPID_nem_tcp_connpoll(1720).......:
		state_listening_handler(1787).....: accept of socket fd failed - Resource temporarily unavailable
		rank 1 in job 2  hp20_44192   caused collective abort of all ranks
		  exit status of rank 1: return code 1
		Fatal error in PMPI_Bcast: Other MPI error, error stack:
		PMPI_Bcast(1302)..................: MPI_Bcast(buf=0xc067f33c, count=1, MPI_INT, root=0, MPI_COMM_WORLD) failed
		MPIR_Bcast(1031)..................:
		MPIR_Bcast_binomial(187)..........:
		MPIC_Send(41).....................:
		MPIC_Wait(513)....................:
		MPIDI_CH3I_Progress(150)..........:
		MPID_nem_mpich2_blocking_recv(948):
		MPID_nem_tcp_connpoll(1709).......: Communication error
		 
		 
		Any comments and or suggestions are greatly appreciated.
		 
		 
		Mike C.
		 
		 
		
		
		
		Michael A. Costa
		SET (RCC), CCAI-CCNA/CCNP (Cisco), MInfTech (Griffith)
		Professor - Information Technology Division
		Fanshawe College
		G3001
		1001 Fanshawe College Boulevard
		P.O. Box 7005
		London, ON N5Y 5R6
		Tel: (519) 452-4291  Fax: (519) 452-1801
		 
		
		

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/ms-tnef
Size: 17235 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20100402/b576b4c5/attachment-0001.bin>


More information about the mpich-discuss mailing list