[mpich-discuss] mpiexec woes

Ralph Butler rbutler at mtsu.edu
Tue Aug 25 14:10:13 CDT 2009


One mpd is failing to obtain correct info about its own host or the  
other one.
I might guess that the hostname you provide on the cmd line when  
running mpdcheck and the hostname
that the system identifies itself by are different.  I would have  
thought running mpdcheck like this:
         mpdcheck -v
might shed some light on that however.

Some of the kinds of things mpdcheck does, you can try by hand.  For  
example, I have 2 hosts named
b01 and b02.  I can do some quick, non-exhaustive verification that  
they correctly identify each other:

First, on b01:
(b01:51)% python
Python 2.5.2 (r252:60911, Jan  4 2009, 17:40:26)
[GCC 4.3.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
 >>> import socket
 >>> socket.gethostname()
'b01'
 >>> socket.gethostbyname_ex('b01')
('b01.cs.mtsu.edu', ['b01'], ['161.45.166.1'])
 >>> socket.gethostbyname_ex('b02')
('b02.cs.mtsu.edu', [], ['161.45.166.2'])

Then, on b02:
(b02:51)% python
Python 2.5.2 (r252:60911, Jan  4 2009, 17:40:26)
[GCC 4.3.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
 >>> import socket
 >>> socket.gethostname()
'b02'
 >>> socket.gethostbyname_ex('b01')
('b01.cs.mtsu.edu', [], ['161.45.166.1'])
 >>> socket.gethostbyname_ex('b02')
('b02.cs.mtsu.edu', ['b02'], ['161.45.166.2'])


On TueAug 25, at Tue Aug 25 10:18AM, Janzen Brewer wrote:

> I performed the steps in the troubleshooting guide and was able to  
> get two nodes to handshake (with mpdcheck -s / -c), but the test for  
> running /bin/hostname locally AND remotely failed. The command  
> simply hung and didn't produce any output. I was able to use  
> mpdlistjobs and mpdkilljob on one of the slave nodes to kill the  
> job, so the nodes can obviously communicate.
>
> Janzen
>
> Rajeev Thakur wrote:
>> It could be a problem with the networking settings on the machines.  
>> To
>> debug, you could follow the steps outlined in the Appendix of the  
>> MPICH2
>> installation guide (using mpdcheck). And try with a smaller set of  
>> nodes
>> first.
>>
>> Rajeev
>>
>>
>>
>>> -----Original Message-----
>>> From: mpich-discuss-bounces at mcs.anl.gov [mailto:mpich-discuss-bounces at mcs.anl.gov 
>>> ] On Behalf Of Janzen Brewer
>>> Sent: Thursday, August 20, 2009 8:14 AM
>>> To: mpich-discuss at mcs.anl.gov
>>> Subject: [mpich-discuss] mpiexec woes
>>>
>>> I'm implementing MPICH2 on a small GPU cluster. It will eventually  
>>> be integrated with Condor 7.2, but for now it's running by itself.  
>>> The cluster is set up such that the command:
>>>
>>> $ mpdboot -n 10 --ifhn=192.168.1.100 --rsh=rsh
>>>
>>> appears to start the daemon on all nodes. Running mpdtrace returns  
>>> all the nodes hostnames and 'mpdringtest 100' runs successfully.  
>>> However, when I try to run anything with mpiexec, the shell hangs  
>>> indefinitely and I have to kill it with mpdallexit from a separate  
>>> shell. Here's the particular command I've been using:
>>>
>>>
>>> $ mpiexec -n 10 /bin/hostname
>>>
>>> This command works when the daemon is only booted on the master  
>>> node (i.e. -n argument is 1 for both commands above). I've lurked  
>>> around but have been unable to find the solution.
>>>
>>> Thanks!
>>> Janzen
>>>
>>>
>>
>>
>



More information about the mpich-discuss mailing list