[mpich-discuss] mpiexec woes
Ralph Butler
rbutler at mtsu.edu
Tue Aug 25 14:10:13 CDT 2009
One mpd is failing to obtain correct info about its own host or the
other one.
I might guess that the hostname you provide on the cmd line when
running mpdcheck and the hostname
that the system identifies itself by are different. I would have
thought running mpdcheck like this:
mpdcheck -v
might shed some light on that however.
Some of the kinds of things mpdcheck does, you can try by hand. For
example, I have 2 hosts named
b01 and b02. I can do some quick, non-exhaustive verification that
they correctly identify each other:
First, on b01:
(b01:51)% python
Python 2.5.2 (r252:60911, Jan 4 2009, 17:40:26)
[GCC 4.3.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import socket
>>> socket.gethostname()
'b01'
>>> socket.gethostbyname_ex('b01')
('b01.cs.mtsu.edu', ['b01'], ['161.45.166.1'])
>>> socket.gethostbyname_ex('b02')
('b02.cs.mtsu.edu', [], ['161.45.166.2'])
Then, on b02:
(b02:51)% python
Python 2.5.2 (r252:60911, Jan 4 2009, 17:40:26)
[GCC 4.3.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import socket
>>> socket.gethostname()
'b02'
>>> socket.gethostbyname_ex('b01')
('b01.cs.mtsu.edu', [], ['161.45.166.1'])
>>> socket.gethostbyname_ex('b02')
('b02.cs.mtsu.edu', ['b02'], ['161.45.166.2'])
On TueAug 25, at Tue Aug 25 10:18AM, Janzen Brewer wrote:
> I performed the steps in the troubleshooting guide and was able to
> get two nodes to handshake (with mpdcheck -s / -c), but the test for
> running /bin/hostname locally AND remotely failed. The command
> simply hung and didn't produce any output. I was able to use
> mpdlistjobs and mpdkilljob on one of the slave nodes to kill the
> job, so the nodes can obviously communicate.
>
> Janzen
>
> Rajeev Thakur wrote:
>> It could be a problem with the networking settings on the machines.
>> To
>> debug, you could follow the steps outlined in the Appendix of the
>> MPICH2
>> installation guide (using mpdcheck). And try with a smaller set of
>> nodes
>> first.
>>
>> Rajeev
>>
>>
>>
>>> -----Original Message-----
>>> From: mpich-discuss-bounces at mcs.anl.gov [mailto:mpich-discuss-bounces at mcs.anl.gov
>>> ] On Behalf Of Janzen Brewer
>>> Sent: Thursday, August 20, 2009 8:14 AM
>>> To: mpich-discuss at mcs.anl.gov
>>> Subject: [mpich-discuss] mpiexec woes
>>>
>>> I'm implementing MPICH2 on a small GPU cluster. It will eventually
>>> be integrated with Condor 7.2, but for now it's running by itself.
>>> The cluster is set up such that the command:
>>>
>>> $ mpdboot -n 10 --ifhn=192.168.1.100 --rsh=rsh
>>>
>>> appears to start the daemon on all nodes. Running mpdtrace returns
>>> all the nodes hostnames and 'mpdringtest 100' runs successfully.
>>> However, when I try to run anything with mpiexec, the shell hangs
>>> indefinitely and I have to kill it with mpdallexit from a separate
>>> shell. Here's the particular command I've been using:
>>>
>>>
>>> $ mpiexec -n 10 /bin/hostname
>>>
>>> This command works when the daemon is only booted on the master
>>> node (i.e. -n argument is 1 for both commands above). I've lurked
>>> around but have been unable to find the solution.
>>>
>>> Thanks!
>>> Janzen
>>>
>>>
>>
>>
>
More information about the mpich-discuss
mailing list