[mpich-discuss] mpdboot returning error codes 582, 404, 347, and 476

BAKER, MYLES D. (LARC-E302) myles.d.baker at nasa.gov
Thu Jul 5 09:26:39 CDT 2012


Dear List,

NOTE: This is my first list post, so please let me know how to write more effectively and honor the etiquette of the community.

I am having trouble launching mpd after successful past boots. My system administrator is updating to a hydra-capable mpich but for now I'm stuck with mpd and mpich2-1.2.1p1. Here is some information about my system:

1. Multi node POWER6 cluster using General Parallel File System (GPFS).
2. Master is interactive node(tisa2-blue), slaves are dedicated compute nodes(bb101, bb102).
3. ~/.mpd.conf file contains MPD_SECRETWORD=password for each node (master and slaves).
4. ~/mpd.hosts file contains bb101:4\n bb102:4. Once again, this file is accessible by all nodes.
5. Passwordless ssh between all nodes (3 node, fully connected digraph)
6. The /etc/hosts files on the respective machines are configured correctly (i.e., no 127.0.1.1 for the slave nodes, etc; I have added them to the bottom of the post if need to reference)

Given the above setup, when I try to boot using mpdboot -v -n 3 -f ~/mpd.hosts command on tisa2-blue, I get the following output:
running mpdallexit on tisa2-blue
LAUNCHED mpd on tisa2-blue  via

[2]    Done                          mpd
RUNNING: mpd on tisa2-blue
LAUNCHED mpd on bb101  via  tisa2-blue
LAUNCHED mpd on bb102  via  tisa2-blue

Next, the terminal window waits, and I get the following error:
mpdboot_tisa2-blue (handle_mpd_output 406): failed to handshake with mpd on bb101; recvd output={}

If I use ^C to end I get the following error:
mpdboot_tisa2-blue (recv_dict_msg 582):recv_dict_msg: errmsg=::
  mpdtb:
    /SPG_ops/utils/ppc64/mpich2-1.2.1p1/bin/mpdlib.py,  582,  recv_dict_msg
    /usr/local/bin/mpdboot,  404,  handle_mpd_output
    /usr/local/bin/mpdboot,  347,  mpdboot
    /usr/local/bin/mpdboot,  476,  ?

mpdboot_tisa2-blue (handle_mpd_output 406): failed to handshake with mpd on bb101; recvd output={}
mpdboot_tisa2-blue: failure doing recv exceptions.KeyboardInterrupt ::
0

So, at this point I'm not sure what I can do to fix this. I have looked up the error codes and I don't think that I have done anything wrong. Can anyone give me some guidance / ideas on where to fix this issue?

Thank you so much!
Myles Baker

tisa2-blue: /etc/hosts
127.0.0.1       localhost
#192.168.18.115  bc206 bc206.cluster.net<http://bc206.cluster.net>
#192.168.18.115  tisa2-blue.larc.nasa.gov<http://tisa2-blue.larc.nasa.gov>
# made sure the larc-facing address for tisa2-blue is here and
# uncommented... w/o it, daacget commands fail because it reverse
# lookup's and gets the AMI interface name and it doesn't match.  crjones, 07/03/12
198.119.135.140 tisa2-blue.larc.nasa.gov<http://tisa2-blue.larc.nasa.gov> tisa2-blue
192.168.18.130  magneto.cluster.net<http://magneto.cluster.net> magneto.magneto
198.119.135.180 snfsmdc1.larc.nasa.gov<http://snfsmdc1.larc.nasa.gov> snfsmdc1
198.119.135.181 snfsmdc2.larc.nasa.gov<http://snfsmdc2.larc.nasa.gov> snfsmdc2
192.168.18.162  bk17.cluster.net<http://bk17.cluster.net> bk17
192.168.18.164  bk21.cluster.net<http://bk21.cluster.net> bk21
192.168.18.207  ab01-p.cluster.net<http://ab01-p.cluster.net> ab01-p
192.168.18.1    coil-blue.cluster.net<http://coil-blue.cluster.net> coil-blue
192.168.18.2    nsd1.cluster.net<http://nsd1.cluster.net> nsd1
192.168.18.3    nsd2.cluster.net<http://nsd2.cluster.net> nsd2

bb101: /etc/hosts
127.0.0.1       localhost
192.168.18.50   bb101 bb101.cluster.net<http://bb101.cluster.net>
192.168.18.1    coil-blue coil-blue.cluster.net<http://coil-blue.cluster.net>
192.168.18.2    nsd1 nsd1.cluster.net<http://nsd1.cluster.net>
192.168.18.5    ab3950.cluster.net<http://ab3950.cluster.net> ab3950
192.168.18.3    nsd2 nsd2.cluster.net<http://nsd2.cluster.net>
192.168.18.10   ba101 ba101.cluster.net<http://ba101.cluster.net>
192.168.18.90   bc101 bc101.cluster.net<http://bc101.cluster.net>
192.168.18.173  ab19 ab19.cluster.net<http://ab19.cluster.net>
192.168.18.175  ac19 ac19.cluster.net<http://ac19.cluster.net>
192.168.18.130 magneto.cluster.net<http://magneto.cluster.net> magneto.magneto
198.119.135.180 snfsmdc1.larc.nasa.gov<http://snfsmdc1.larc.nasa.gov> snfsmdc1
198.119.135.181 snfsmdc2.larc.nasa.gov<http://snfsmdc2.larc.nasa.gov> snfsmdc2
192.168.18.162 bk17.cluster.net<http://bk17.cluster.net> bk17
192.168.18.164 bk21.cluster.net<http://bk21.cluster.net> bk21
192.168.18.207 ab01-p.cluster.net<http://ab01-p.cluster.net> ab01-p
192.168.18.1    coil-blue.cluster.net<http://coil-blue.cluster.net> coil-blue
192.168.18.2    nsd1.cluster.net<http://nsd1.cluster.net> nsd1
192.168.18.3    nsd2.cluster.net<http://nsd2.cluster.net> nsd2

bb102: /etc/hosts
127.0.0.1         localhost
192.168.16.51 bb102m
192.168.18.51 bb102 bb102.cluster.net<http://bb102.cluster.net>
192.168.18.130 magneto.cluster.net<http://magneto.cluster.net> magneto.magneto
198.119.135.180 snfsmdc1.larc.nasa.gov<http://snfsmdc1.larc.nasa.gov> snfsmdc1
198.119.135.181 snfsmdc2.larc.nasa.gov<http://snfsmdc2.larc.nasa.gov> snfsmdc2
192.168.18.162 bk17.cluster.net<http://bk17.cluster.net> bk17
192.168.18.164 bk21.cluster.net<http://bk21.cluster.net> bk21
192.168.18.207 ab01-p.cluster.net<http://ab01-p.cluster.net> ab01-p
192.168.18.1    coil-blue.cluster.net<http://coil-blue.cluster.net> coil-blue
192.168.18.2    nsd1.cluster.net<http://nsd1.cluster.net> nsd1
192.168.18.3    nsd2.cluster.net<http://nsd2.cluster.net> nsd2



BAKER, MYLES D. (LARC-E302)
-----------------------------------------------------
Mail Stop 420, B1250 R177
myles.d.baker at nasa.gov<mailto:myles.d.baker at nasa.gov>
LaRC Ext:  x46393






-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20120705/b82cf28d/attachment.html>


More information about the mpich-discuss mailing list