[mpich-discuss] Getting runtime exception

Dave Goodell goodell at mcs.anl.gov
Fri Feb 5 11:37:51 CST 2010


On Feb 5, 2010, at 7:24 AM, Rajnish wrote:

[snip]
> However, when I schedule tasks across both nodes, with n2 as the  
> master node, I get the following message on n1:
>
> mpd_uncaught_except_tb handling:
>  exceptions.KeyError: 'process_mapping'
>    /usr/local/bin/mpd  1354  do_mpdrun
>        msg['process_mapping'][lorank] = self.myHost
>    /usr/local/bin/mpd  984  handle_lhs_input
>        self.do_mpdrun(msg)
>    /usr/local/bin/mpdlib.py  780  handle_active_streams
>        handler(stream,*args)
>    /usr/local/bin/mpd  301  runmainloop
>        rv = self.streamHandler.handle_active_streams(timeout=8.0)
>    /usr/local/bin/mpd  270  run
>        self.runmainloop()
>    /usr/local/bin/mpd  1643  ?
>        mpd.run()
> n1-wulf.myhost.org_mpdman_1 (run 287): invalid msg from lhs;  
> expecting ringsize got: {}

What version of MPICH2 are you running?

This message seems to indicate that you have somehow installed  
incompatible versions of "mpd.py" between the two hosts.  What output  
do you get from running the following commands on both hosts?

-------8<-------
tail -n +2 `which mpd.py` | md5sum
tail -n +2 `which mpdman.py` | md5sum
-------8<-------

(the "tail" business is needed because the shebang line is usually  
altered by the install step)

Results for a few releases:
-------8<-------
release mpich2-1.2.1
68a128402fb44c6fdebe631bbc1c4b7f  mpd.py
b79fd98d6e4f9d9b80c295e05e01591c  mpdman.py
release mpich2-1.2
be37cc1347b915a0ec32cba54c928f63  mpd.py
5a9cd3f44b5986584b27a648f889bf31  mpdman.py
release mpich2-1.1.1p1
be37cc1347b915a0ec32cba54c928f63  mpd.py
5a9cd3f44b5986584b27a648f889bf31  mpdman.py
release mpich2-1.1.1
550958d41e76cdef0ceaa74d540760de  mpd.py
5a9cd3f44b5986584b27a648f889bf31  mpdman.py
release mpich2-1.1
f59c7e766dd2d3488b6df212a663ccb9  mpd.py
07129c1f68cd815c56bd186eb1b59038  mpdman.py
release mpich2-1.0.8
65fb3b8b1c9e3d053bb97d5ef2ae86ad  mpd.py
2083f8908d0b9698eb0550c32ef3d153  mpdman.py
-------8<-------

> After doing mpdallexit, n2 shows the following message:
>
> mpiexec_n2-wulf.myhost.org (mpiexec 377): no msg recvd from mpd when  
> expecting ack of request

You can ignore this message, it's a consequence of the earlier error.

-Dave


More information about the mpich-discuss mailing list