<html><head></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space; ">Dear List, <div><br></div><div>NOTE: This is my first list post, so please let me know how to write more effectively and honor the etiquette of the community. <br><div><br></div><div>I am having trouble launching mpd after successful past boots. My system administrator is updating to a hydra-capable mpich but for now I'm stuck with mpd and mpich2-1.2.1p1. Here is some information about my system: </div><div><br></div><div>1. Multi node POWER6 cluster using General Parallel File System (GPFS).</div><div>2. Master is interactive node(tisa2-blue), slaves are dedicated compute nodes(bb101, bb102). </div><div>3. ~/.mpd.conf file contains MPD_SECRETWORD=password for each node (master and slaves).</div><div>4. ~/mpd.hosts file contains bb101:4\n bb102:4. Once again, this file is accessible by all nodes. </div><div>5. Passwordless ssh between all nodes (3 node, fully connected digraph)</div><div>6. The /etc/hosts files on the respective machines are configured correctly (i.e., no 127.0.1.1 for the slave nodes, etc; I have added them to the bottom of the post if need to reference) </div><div><br></div><div>Given the above setup, when I try to boot using mpdboot -v -n 3 -f ~/mpd.hosts command on tisa2-blue, I get the following output: </div><div><div>running mpdallexit on tisa2-blue</div><div>LAUNCHED mpd on tisa2-blue via </div><div><br></div><div>[2] Done mpd</div><div>RUNNING: mpd on tisa2-blue</div><div>LAUNCHED mpd on bb101 via tisa2-blue</div><div>LAUNCHED mpd on bb102 via tisa2-blue</div></div><div><br></div><div>Next, the terminal window waits, and I get the following error: </div><div><div>mpdboot_tisa2-blue (handle_mpd_output 406): failed to handshake with mpd on bb101; recvd output={}</div><div><br></div></div><div>If I use ^C to end I get the following error: </div><div><div>mpdboot_tisa2-blue (recv_dict_msg 582):recv_dict_msg: errmsg=::</div><div> mpdtb:</div><div> /SPG_ops/utils/ppc64/mpich2-1.2.1p1/bin/mpdlib.py, 582, recv_dict_msg</div><div> /usr/local/bin/mpdboot, 404, handle_mpd_output</div><div> /usr/local/bin/mpdboot, 347, mpdboot</div><div> /usr/local/bin/mpdboot, 476, ?</div><div><br></div><div>mpdboot_tisa2-blue (handle_mpd_output 406): failed to handshake with mpd on bb101; recvd output={}</div><div>mpdboot_tisa2-blue: failure doing recv exceptions.KeyboardInterrupt ::</div><div>0</div></div><div><br></div><div>So, at this point I'm not sure what I can do to fix this. I have looked up the error codes and I don't think that I have done anything wrong. Can anyone give me some guidance / ideas on where to fix this issue? </div><div><br></div><div>Thank you so much! </div><div>Myles Baker</div><div><br></div><div>tisa2-blue: /etc/hosts</div><div><div>127.0.0.1 localhost</div><div>#192.168.18.115 bc206 <a href="http://bc206.cluster.net">bc206.cluster.net</a></div><div>#192.168.18.115 <a href="http://tisa2-blue.larc.nasa.gov">tisa2-blue.larc.nasa.gov</a></div><div># made sure the larc-facing address for tisa2-blue is here and </div><div># uncommented... w/o it, daacget commands fail because it reverse </div><div># lookup's and gets the AMI interface name and it doesn't match. crjones, 07/03/12</div><div>198.119.135.140 <a href="http://tisa2-blue.larc.nasa.gov">tisa2-blue.larc.nasa.gov</a> tisa2-blue</div><div>192.168.18.130 <a href="http://magneto.cluster.net">magneto.cluster.net</a> magneto.magneto</div><div>198.119.135.180 <a href="http://snfsmdc1.larc.nasa.gov">snfsmdc1.larc.nasa.gov</a> snfsmdc1</div><div>198.119.135.181 <a href="http://snfsmdc2.larc.nasa.gov">snfsmdc2.larc.nasa.gov</a> snfsmdc2</div><div>192.168.18.162 <a href="http://bk17.cluster.net">bk17.cluster.net</a> bk17</div><div>192.168.18.164 <a href="http://bk21.cluster.net">bk21.cluster.net</a> bk21</div><div>192.168.18.207 <a href="http://ab01-p.cluster.net">ab01-p.cluster.net</a> ab01-p</div><div>192.168.18.1 <a href="http://coil-blue.cluster.net">coil-blue.cluster.net</a> coil-blue</div><div>192.168.18.2 <a href="http://nsd1.cluster.net">nsd1.cluster.net</a> nsd1</div><div>192.168.18.3 <a href="http://nsd2.cluster.net">nsd2.cluster.net</a> nsd2</div></div><div><br></div><div>bb101: /etc/hosts</div><div><div>127.0.0.1 localhost</div><div>192.168.18.50 bb101 <a href="http://bb101.cluster.net">bb101.cluster.net</a></div><div>192.168.18.1 coil-blue <a href="http://coil-blue.cluster.net">coil-blue.cluster.net</a></div><div>192.168.18.2 nsd1 <a href="http://nsd1.cluster.net">nsd1.cluster.net</a></div><div>192.168.18.5 <a href="http://ab3950.cluster.net">ab3950.cluster.net</a> ab3950</div><div>192.168.18.3 nsd2 <a href="http://nsd2.cluster.net">nsd2.cluster.net</a></div><div>192.168.18.10 ba101 <a href="http://ba101.cluster.net">ba101.cluster.net</a></div><div>192.168.18.90 bc101 <a href="http://bc101.cluster.net">bc101.cluster.net</a></div><div>192.168.18.173 ab19 <a href="http://ab19.cluster.net">ab19.cluster.net</a></div><div>192.168.18.175 ac19 <a href="http://ac19.cluster.net">ac19.cluster.net</a></div><div>192.168.18.130 <a href="http://magneto.cluster.net">magneto.cluster.net</a> magneto.magneto</div><div>198.119.135.180 <a href="http://snfsmdc1.larc.nasa.gov">snfsmdc1.larc.nasa.gov</a> snfsmdc1</div><div>198.119.135.181 <a href="http://snfsmdc2.larc.nasa.gov">snfsmdc2.larc.nasa.gov</a> snfsmdc2</div><div>192.168.18.162 <a href="http://bk17.cluster.net">bk17.cluster.net</a> bk17</div><div>192.168.18.164 <a href="http://bk21.cluster.net">bk21.cluster.net</a> bk21</div><div>192.168.18.207 <a href="http://ab01-p.cluster.net">ab01-p.cluster.net</a> ab01-p</div><div>192.168.18.1 <a href="http://coil-blue.cluster.net">coil-blue.cluster.net</a> coil-blue</div><div>192.168.18.2 <a href="http://nsd1.cluster.net">nsd1.cluster.net</a> nsd1</div><div>192.168.18.3 <a href="http://nsd2.cluster.net">nsd2.cluster.net</a> nsd2</div><div><br></div></div><div>bb102: /etc/hosts</div><div><div>127.0.0.1 localhost</div><div>192.168.16.51<span class="Apple-tab-span" style="white-space:pre"> </span>bb102m</div><div>192.168.18.51<span class="Apple-tab-span" style="white-space:pre"> </span>bb102 <a href="http://bb102.cluster.net">bb102.cluster.net</a></div><div>192.168.18.130 <a href="http://magneto.cluster.net">magneto.cluster.net</a> magneto.magneto</div><div>198.119.135.180 <a href="http://snfsmdc1.larc.nasa.gov">snfsmdc1.larc.nasa.gov</a> snfsmdc1</div><div>198.119.135.181 <a href="http://snfsmdc2.larc.nasa.gov">snfsmdc2.larc.nasa.gov</a> snfsmdc2</div><div>192.168.18.162 <a href="http://bk17.cluster.net">bk17.cluster.net</a> bk17</div><div>192.168.18.164 <a href="http://bk21.cluster.net">bk21.cluster.net</a> bk21</div><div>192.168.18.207 <a href="http://ab01-p.cluster.net">ab01-p.cluster.net</a> ab01-p</div><div>192.168.18.1 <a href="http://coil-blue.cluster.net">coil-blue.cluster.net</a> coil-blue</div><div>192.168.18.2 <a href="http://nsd1.cluster.net">nsd1.cluster.net</a> nsd1</div><div>192.168.18.3 <a href="http://nsd2.cluster.net">nsd2.cluster.net</a> nsd2</div><div><br></div><div><br></div></div><div><br><div apple-content-edited="true">
<div style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space; "><b>BAKER, MYLES D. (LARC-E302)</b></div><div style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space; "><b>-----------------------------------------------------<br></b>Mail Stop 420, B1250 R177<br><a href="mailto:myles.d.baker@nasa.gov">myles.d.baker@nasa.gov</a><br>LaRC Ext: x46393 <br><br><br><br><br><br></div>
</div>
<br></div></div></body></html>