[MPICH] MPI_Bcast hangs in Windows XP

Jayesh Krishna jayesh at mcs.anl.gov
Mon Sep 10 09:38:50 CDT 2007


Hi,
 It looks like some problem with the name resolution for the hostnames. Can
you try one run by just specifying the ipaddress of the hosts in config.txt
?
 
Regards,
Jayesh

  _____  

From: Richard Li [mailto:xs_li at hotmail.com] 
Sent: Friday, September 07, 2007 9:00 PM
To: Jayesh Krishna
Cc: mpich-discuss at mcs.anl.gov
Subject: RE: [MPICH] MPI_Bcast hangs in Windows XP


Jayesh,

Thanks a lot for your info.

After a lot of trys, I was finally able to run cpi.exe across multiple
hosts. It seems to have something to with the setting in my machine file.
The following is the detail.

I use the following command to run cpi.exe:

mpiexec -n 2 -machinefile config.txt -channel ssm(or others) cpi.exe

a) If I have a config.txt file like the following:
    host1name:1 -ifhn host1_ipaddress
    host2name:2 -ifhn host2_ipaddress
   Everything works fine(for all channels).
b) If I have a config.txt like the following:
   host1name:1
   host2:name:2
 then, for sock channel, it hangs mpi_bcast. For auto and ssm, I got the
following error message:
  

C:\public\bin>mpiexec -n 2 -machinefile Config.txt -channel ssm  or auto
C:\public\bin\cpi.exe
Enter the number of intervals: (0 quits) 100
 
job aborted:
rank: node: exit code[: error message]
0: B0016350B383E: 1: Fatal error in MPI_Bcast: Other MPI error, error stack:
MPI_Bcast(784).................: MPI_Bcast(buf=0012FE88, count=1, MPI_INT,
root=0, MPI_COMM_WORLD) fai
led
MPIR_Bcast(230)................:
MPIC_Send(36)..................:
MPIDI_EagerContigSend(146).....: failure occurred while attempting to send
an eager message
MPIDI_CH3_iStartMsgv(224)......:
MPIDI_CH3I_VC_post_connect(555): [ch3:sock] rank 0 unable to connect to rank
1 using business card <po
rt=3872 description=B001279FD7C60.corp.bankofamerica.com
ifname=171.188.32.154 shm_host=B001279FD7C60.
corp.bankofamerica.com shm_queue=39E4F281-FCC0-4f4a-B540-EDC8D517F065
shm_pid=2484 >
MPIDU_Sock_post_connect(1228)..: unable to connect to
B001279FD7C60.corp.bankofamerica.com on port 387
2, exhausted all endpoints (errno -1)
MPIDU_Sock_post_connect(1244)..: gethostbyname failed, The requested name is
valid and was found in th
e database, but it does not have the correct associated data being resolved
for. (errno 11004)
1: B001279FD7C60: 1

I know this has something to do with my network setting, but just can't
figure out why.

Any ideas?

Thanks

Richard






  _____  

From: jayesh at mcs.anl.gov
To: xs_li at hotmail.com
CC: mpich-discuss at mcs.anl.gov
Subject: RE: [MPICH] MPI_Bcast hangs in Windows XP
Date: Thu, 6 Sep 2007 09:25:36 -0500


Hi,
 The process manager (smpd) is responsible for launching the MPI processes
on the various machines and providing an MPI processes information on how to
communicate with other MPI processes.
 The SMPD process manager listens (default case) on port 8676 and then asks
the client PM to connect to a new port. So you should allow SMPD process
manager (smpd.exe --- installed as a service in windows) to communicate at
all ports (This is the easiest way. However you can also restrict the port
range used by SMPD. Refer to the windows devloper's guide available at
http://www-unix.mcs.anl.gov/mpi/mpich/ for details.)
 Make sure that no firewall (1. Running on the individual machines   2. OR
on the network, filtering the traffic btw the machines) is preventing the
process managers & the MPI procs on the individual machines from contacting
each other.
 
(Note: Since you do not know what changed in your network, it might help if
you try analyzing the network packets sent btw the machines using a packet
sniffer like Ethereal.)
 
Regards,
Jayesh

  _____  

From: owner-mpich-discuss at mcs.anl.gov
[mailto:owner-mpich-discuss at mcs.anl.gov] On Behalf Of Richard Li
Sent: Wednesday, September 05, 2007 8:21 PM
To: mpich-discuss at mcs.anl.gov
Subject: [MPICH] MPI_Bcast hangs in Windows XP



Hi there,
 
I am writing an application in Windows XP/VC8 and am having problem with
MPI_Bcast(). I am working in corporate environment and suspect it may have
something to do with our security policies, however, I don't know exact
which low-level operations failed . 
 
Here is the symptom: my application (as well as cpi.exe example) works fine
as long as there is only one machine in the machine file, whether its local
machine or remote does not matter. It hangs at MPI_Bcast() when I have more
than one machine in MPI_COMM_WORLD. I am using
mpich2-1.0.5p2-win32-ia32.msi.
 
The same application worked perfectly a year ago and there have been many
security policy changes since that time(as usual, all policies reduce our
freedom). My question is that what's the communication mechanism used in
inter-node communication. I tried nothing, auto, sock, ssm as communication
channels and had no luck.
 
Thanks for your help.

Richard


  _____  

Discover the new Windows Vista Learn more!
<http://search.msn.com/results.aspx?q=windows+vista&mkt=en-US&form=QBRE>  


  _____  

Make your little one a shining star! Shine
<http://www.reallivemoms.com?ocid=TXT_TAGHM&loc=us> on! 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20070910/42e94112/attachment.htm>


More information about the mpich-discuss mailing list