Great! After bringing both nodes to have latest MPICH2 version, and both hydra and mpd are working great:<br><br>% mpiexec.hydra -f machinefile -n 4 hostname<br><br>even though mpdboot hangs, still after a cntrl-c, mpdtrace shows successful start of the mpd daemons, and mpiexec works fine.<br>
<br>Thanks again to Pavan and Dave.<br clear="all">- Rajnish.<br>
<br><br><div class="gmail_quote">On Fri, Feb 5, 2010 at 1:26 PM, Dave Goodell <span dir="ltr"><<a href="mailto:goodell@mcs.anl.gov">goodell@mcs.anl.gov</a>></span> wrote:<br><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">
FYI, if you install version 1.2.1 and your mpdboot hangs, you might need to apply the fix described in ticket #963: <a href="https://trac.mcs.anl.gov/projects/mpich2/ticket/963" target="_blank">https://trac.mcs.anl.gov/projects/mpich2/ticket/963</a><br>
<font color="#888888">
<br>
-Dave</font><div><div></div><div class="h5"><br>
<br>
On Feb 5, 2010, at 11:59 AM, Rajnish wrote:<br>
<br>
<blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">
Thank you Pavan for suggesting the use of hydra, did not have it in the default installation, doing that right now.<br>
<br>
Thank you Dave for pointing out about the version incompatibility. Yes, md5sums are different indeed...<br>
<br>
I had installed 1.2.1 on node-n1, and node n2 had it already, but not sure which version, must be incompatible.<br>
<br>
I am going to install fresh and will give you updates.<br>
<br>
- Rajnish.<br>
<br>
<br>
On Fri, Feb 5, 2010 at 12:37 PM, Dave Goodell <<a href="mailto:goodell@mcs.anl.gov" target="_blank">goodell@mcs.anl.gov</a>> wrote:<br>
On Feb 5, 2010, at 7:24 AM, Rajnish wrote:<br>
<br>
[snip]<br>
<br>
However, when I schedule tasks across both nodes, with n2 as the master node, I get the following message on n1:<br>
<br>
mpd_uncaught_except_tb handling:<br>
exceptions.KeyError: 'process_mapping'<br>
/usr/local/bin/mpd 1354 do_mpdrun<br>
msg['process_mapping'][lorank] = self.myHost<br>
/usr/local/bin/mpd 984 handle_lhs_input<br>
self.do_mpdrun(msg)<br>
/usr/local/bin/mpdlib.py 780 handle_active_streams<br>
handler(stream,*args)<br>
/usr/local/bin/mpd 301 runmainloop<br>
rv = self.streamHandler.handle_active_streams(timeout=8.0)<br>
/usr/local/bin/mpd 270 run<br>
self.runmainloop()<br>
/usr/local/bin/mpd 1643 ?<br>
mpd.run()<br>
n1-wulf.myhost.org_mpdman_1 (run 287): invalid msg from lhs; expecting ringsize got: {}<br>
<br>
What version of MPICH2 are you running?<br>
<br>
This message seems to indicate that you have somehow installed incompatible versions of "mpd.py" between the two hosts. What output do you get from running the following commands on both hosts?<br>
<br>
-------8<-------<br>
tail -n +2 `which mpd.py` | md5sum<br>
tail -n +2 `which mpdman.py` | md5sum<br>
-------8<-------<br>
<br>
(the "tail" business is needed because the shebang line is usually altered by the install step)<br>
<br>
Results for a few releases:<br>
-------8<-------<br>
release mpich2-1.2.1<br>
68a128402fb44c6fdebe631bbc1c4b7f mpd.py<br>
b79fd98d6e4f9d9b80c295e05e01591c mpdman.py<br>
release mpich2-1.2<br>
be37cc1347b915a0ec32cba54c928f63 mpd.py<br>
5a9cd3f44b5986584b27a648f889bf31 mpdman.py<br>
release mpich2-1.1.1p1<br>
be37cc1347b915a0ec32cba54c928f63 mpd.py<br>
5a9cd3f44b5986584b27a648f889bf31 mpdman.py<br>
release mpich2-1.1.1<br>
550958d41e76cdef0ceaa74d540760de mpd.py<br>
5a9cd3f44b5986584b27a648f889bf31 mpdman.py<br>
release mpich2-1.1<br>
f59c7e766dd2d3488b6df212a663ccb9 mpd.py<br>
07129c1f68cd815c56bd186eb1b59038 mpdman.py<br>
release mpich2-1.0.8<br>
65fb3b8b1c9e3d053bb97d5ef2ae86ad mpd.py<br>
2083f8908d0b9698eb0550c32ef3d153 mpdman.py<br>
-------8<-------<br>
<br>
<br>
After doing mpdallexit, n2 shows the following message:<br>
<br>
<a href="http://mpiexec_n2-wulf.myhost.org" target="_blank">mpiexec_n2-wulf.myhost.org</a> (mpiexec 377): no msg recvd from mpd when expecting ack of request<br>
<br>
You can ignore this message, it's a consequence of the earlier error.<br>
<br>
-Dave<br>
<br>
_______________________________________________<br>
mpich-discuss mailing list<br>
<a href="mailto:mpich-discuss@mcs.anl.gov" target="_blank">mpich-discuss@mcs.anl.gov</a><br>
<a href="https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss" target="_blank">https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss</a><br>
<br>
_______________________________________________<br>
mpich-discuss mailing list<br>
<a href="mailto:mpich-discuss@mcs.anl.gov" target="_blank">mpich-discuss@mcs.anl.gov</a><br>
<a href="https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss" target="_blank">https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss</a><br>
</blockquote>
<br>
_______________________________________________<br>
mpich-discuss mailing list<br>
<a href="mailto:mpich-discuss@mcs.anl.gov" target="_blank">mpich-discuss@mcs.anl.gov</a><br>
<a href="https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss" target="_blank">https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss</a><br>
</div></div></blockquote></div><br>