[mpich-discuss] mpdboot returning error codes 582, 404, 347, and 476
BAKER, MYLES D. (LARC-E302)
myles.d.baker at nasa.gov
Thu Jul 5 09:26:39 CDT 2012
Dear List,
NOTE: This is my first list post, so please let me know how to write more effectively and honor the etiquette of the community.
I am having trouble launching mpd after successful past boots. My system administrator is updating to a hydra-capable mpich but for now I'm stuck with mpd and mpich2-1.2.1p1. Here is some information about my system:
1. Multi node POWER6 cluster using General Parallel File System (GPFS).
2. Master is interactive node(tisa2-blue), slaves are dedicated compute nodes(bb101, bb102).
3. ~/.mpd.conf file contains MPD_SECRETWORD=password for each node (master and slaves).
4. ~/mpd.hosts file contains bb101:4\n bb102:4. Once again, this file is accessible by all nodes.
5. Passwordless ssh between all nodes (3 node, fully connected digraph)
6. The /etc/hosts files on the respective machines are configured correctly (i.e., no 127.0.1.1 for the slave nodes, etc; I have added them to the bottom of the post if need to reference)
Given the above setup, when I try to boot using mpdboot -v -n 3 -f ~/mpd.hosts command on tisa2-blue, I get the following output:
running mpdallexit on tisa2-blue
LAUNCHED mpd on tisa2-blue via
[2] Done mpd
RUNNING: mpd on tisa2-blue
LAUNCHED mpd on bb101 via tisa2-blue
LAUNCHED mpd on bb102 via tisa2-blue
Next, the terminal window waits, and I get the following error:
mpdboot_tisa2-blue (handle_mpd_output 406): failed to handshake with mpd on bb101; recvd output={}
If I use ^C to end I get the following error:
mpdboot_tisa2-blue (recv_dict_msg 582):recv_dict_msg: errmsg=::
mpdtb:
/SPG_ops/utils/ppc64/mpich2-1.2.1p1/bin/mpdlib.py, 582, recv_dict_msg
/usr/local/bin/mpdboot, 404, handle_mpd_output
/usr/local/bin/mpdboot, 347, mpdboot
/usr/local/bin/mpdboot, 476, ?
mpdboot_tisa2-blue (handle_mpd_output 406): failed to handshake with mpd on bb101; recvd output={}
mpdboot_tisa2-blue: failure doing recv exceptions.KeyboardInterrupt ::
0
So, at this point I'm not sure what I can do to fix this. I have looked up the error codes and I don't think that I have done anything wrong. Can anyone give me some guidance / ideas on where to fix this issue?
Thank you so much!
Myles Baker
tisa2-blue: /etc/hosts
127.0.0.1 localhost
#192.168.18.115 bc206 bc206.cluster.net<http://bc206.cluster.net>
#192.168.18.115 tisa2-blue.larc.nasa.gov<http://tisa2-blue.larc.nasa.gov>
# made sure the larc-facing address for tisa2-blue is here and
# uncommented... w/o it, daacget commands fail because it reverse
# lookup's and gets the AMI interface name and it doesn't match. crjones, 07/03/12
198.119.135.140 tisa2-blue.larc.nasa.gov<http://tisa2-blue.larc.nasa.gov> tisa2-blue
192.168.18.130 magneto.cluster.net<http://magneto.cluster.net> magneto.magneto
198.119.135.180 snfsmdc1.larc.nasa.gov<http://snfsmdc1.larc.nasa.gov> snfsmdc1
198.119.135.181 snfsmdc2.larc.nasa.gov<http://snfsmdc2.larc.nasa.gov> snfsmdc2
192.168.18.162 bk17.cluster.net<http://bk17.cluster.net> bk17
192.168.18.164 bk21.cluster.net<http://bk21.cluster.net> bk21
192.168.18.207 ab01-p.cluster.net<http://ab01-p.cluster.net> ab01-p
192.168.18.1 coil-blue.cluster.net<http://coil-blue.cluster.net> coil-blue
192.168.18.2 nsd1.cluster.net<http://nsd1.cluster.net> nsd1
192.168.18.3 nsd2.cluster.net<http://nsd2.cluster.net> nsd2
bb101: /etc/hosts
127.0.0.1 localhost
192.168.18.50 bb101 bb101.cluster.net<http://bb101.cluster.net>
192.168.18.1 coil-blue coil-blue.cluster.net<http://coil-blue.cluster.net>
192.168.18.2 nsd1 nsd1.cluster.net<http://nsd1.cluster.net>
192.168.18.5 ab3950.cluster.net<http://ab3950.cluster.net> ab3950
192.168.18.3 nsd2 nsd2.cluster.net<http://nsd2.cluster.net>
192.168.18.10 ba101 ba101.cluster.net<http://ba101.cluster.net>
192.168.18.90 bc101 bc101.cluster.net<http://bc101.cluster.net>
192.168.18.173 ab19 ab19.cluster.net<http://ab19.cluster.net>
192.168.18.175 ac19 ac19.cluster.net<http://ac19.cluster.net>
192.168.18.130 magneto.cluster.net<http://magneto.cluster.net> magneto.magneto
198.119.135.180 snfsmdc1.larc.nasa.gov<http://snfsmdc1.larc.nasa.gov> snfsmdc1
198.119.135.181 snfsmdc2.larc.nasa.gov<http://snfsmdc2.larc.nasa.gov> snfsmdc2
192.168.18.162 bk17.cluster.net<http://bk17.cluster.net> bk17
192.168.18.164 bk21.cluster.net<http://bk21.cluster.net> bk21
192.168.18.207 ab01-p.cluster.net<http://ab01-p.cluster.net> ab01-p
192.168.18.1 coil-blue.cluster.net<http://coil-blue.cluster.net> coil-blue
192.168.18.2 nsd1.cluster.net<http://nsd1.cluster.net> nsd1
192.168.18.3 nsd2.cluster.net<http://nsd2.cluster.net> nsd2
bb102: /etc/hosts
127.0.0.1 localhost
192.168.16.51 bb102m
192.168.18.51 bb102 bb102.cluster.net<http://bb102.cluster.net>
192.168.18.130 magneto.cluster.net<http://magneto.cluster.net> magneto.magneto
198.119.135.180 snfsmdc1.larc.nasa.gov<http://snfsmdc1.larc.nasa.gov> snfsmdc1
198.119.135.181 snfsmdc2.larc.nasa.gov<http://snfsmdc2.larc.nasa.gov> snfsmdc2
192.168.18.162 bk17.cluster.net<http://bk17.cluster.net> bk17
192.168.18.164 bk21.cluster.net<http://bk21.cluster.net> bk21
192.168.18.207 ab01-p.cluster.net<http://ab01-p.cluster.net> ab01-p
192.168.18.1 coil-blue.cluster.net<http://coil-blue.cluster.net> coil-blue
192.168.18.2 nsd1.cluster.net<http://nsd1.cluster.net> nsd1
192.168.18.3 nsd2.cluster.net<http://nsd2.cluster.net> nsd2
BAKER, MYLES D. (LARC-E302)
-----------------------------------------------------
Mail Stop 420, B1250 R177
myles.d.baker at nasa.gov<mailto:myles.d.baker at nasa.gov>
LaRC Ext: x46393
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20120705/b82cf28d/attachment.html>
More information about the mpich-discuss
mailing list