[mpich-discuss] Getting runtime exception
Dave Goodell
goodell at mcs.anl.gov
Fri Feb 5 12:26:35 CST 2010
FYI, if you install version 1.2.1 and your mpdboot hangs, you might
need to apply the fix described in ticket #963: https://trac.mcs.anl.gov/projects/mpich2/ticket/963
-Dave
On Feb 5, 2010, at 11:59 AM, Rajnish wrote:
> Thank you Pavan for suggesting the use of hydra, did not have it in
> the default installation, doing that right now.
>
> Thank you Dave for pointing out about the version incompatibility.
> Yes, md5sums are different indeed...
>
> I had installed 1.2.1 on node-n1, and node n2 had it already, but
> not sure which version, must be incompatible.
>
> I am going to install fresh and will give you updates.
>
> - Rajnish.
>
>
> On Fri, Feb 5, 2010 at 12:37 PM, Dave Goodell <goodell at mcs.anl.gov>
> wrote:
> On Feb 5, 2010, at 7:24 AM, Rajnish wrote:
>
> [snip]
>
> However, when I schedule tasks across both nodes, with n2 as the
> master node, I get the following message on n1:
>
> mpd_uncaught_except_tb handling:
> exceptions.KeyError: 'process_mapping'
> /usr/local/bin/mpd 1354 do_mpdrun
> msg['process_mapping'][lorank] = self.myHost
> /usr/local/bin/mpd 984 handle_lhs_input
> self.do_mpdrun(msg)
> /usr/local/bin/mpdlib.py 780 handle_active_streams
> handler(stream,*args)
> /usr/local/bin/mpd 301 runmainloop
> rv = self.streamHandler.handle_active_streams(timeout=8.0)
> /usr/local/bin/mpd 270 run
> self.runmainloop()
> /usr/local/bin/mpd 1643 ?
> mpd.run()
> n1-wulf.myhost.org_mpdman_1 (run 287): invalid msg from lhs;
> expecting ringsize got: {}
>
> What version of MPICH2 are you running?
>
> This message seems to indicate that you have somehow installed
> incompatible versions of "mpd.py" between the two hosts. What
> output do you get from running the following commands on both hosts?
>
> -------8<-------
> tail -n +2 `which mpd.py` | md5sum
> tail -n +2 `which mpdman.py` | md5sum
> -------8<-------
>
> (the "tail" business is needed because the shebang line is usually
> altered by the install step)
>
> Results for a few releases:
> -------8<-------
> release mpich2-1.2.1
> 68a128402fb44c6fdebe631bbc1c4b7f mpd.py
> b79fd98d6e4f9d9b80c295e05e01591c mpdman.py
> release mpich2-1.2
> be37cc1347b915a0ec32cba54c928f63 mpd.py
> 5a9cd3f44b5986584b27a648f889bf31 mpdman.py
> release mpich2-1.1.1p1
> be37cc1347b915a0ec32cba54c928f63 mpd.py
> 5a9cd3f44b5986584b27a648f889bf31 mpdman.py
> release mpich2-1.1.1
> 550958d41e76cdef0ceaa74d540760de mpd.py
> 5a9cd3f44b5986584b27a648f889bf31 mpdman.py
> release mpich2-1.1
> f59c7e766dd2d3488b6df212a663ccb9 mpd.py
> 07129c1f68cd815c56bd186eb1b59038 mpdman.py
> release mpich2-1.0.8
> 65fb3b8b1c9e3d053bb97d5ef2ae86ad mpd.py
> 2083f8908d0b9698eb0550c32ef3d153 mpdman.py
> -------8<-------
>
>
> After doing mpdallexit, n2 shows the following message:
>
> mpiexec_n2-wulf.myhost.org (mpiexec 377): no msg recvd from mpd when
> expecting ack of request
>
> You can ignore this message, it's a consequence of the earlier error.
>
> -Dave
>
> _______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>
> _______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
More information about the mpich-discuss
mailing list