[mpich-discuss] Getting runtime exception

Rajnish rajnish99 at gmail.com
Fri Feb 5 11:59:08 CST 2010


Thank you Pavan for suggesting the use of hydra, did not have it in the
default installation, doing that right now.

Thank you Dave for pointing out about the version incompatibility. Yes,
md5sums are different indeed...

I had installed 1.2.1 on node-n1, and node n2 had it already, but not sure
which version, must be incompatible.

I am going to install fresh and will give you updates.

- Rajnish.


On Fri, Feb 5, 2010 at 12:37 PM, Dave Goodell <goodell at mcs.anl.gov> wrote:

> On Feb 5, 2010, at 7:24 AM, Rajnish wrote:
>
> [snip]
>
>  However, when I schedule tasks across both nodes, with n2 as the master
>> node, I get the following message on n1:
>>
>> mpd_uncaught_except_tb handling:
>>  exceptions.KeyError: 'process_mapping'
>>   /usr/local/bin/mpd  1354  do_mpdrun
>>       msg['process_mapping'][lorank] = self.myHost
>>   /usr/local/bin/mpd  984  handle_lhs_input
>>       self.do_mpdrun(msg)
>>   /usr/local/bin/mpdlib.py  780  handle_active_streams
>>       handler(stream,*args)
>>   /usr/local/bin/mpd  301  runmainloop
>>       rv = self.streamHandler.handle_active_streams(timeout=8.0)
>>   /usr/local/bin/mpd  270  run
>>       self.runmainloop()
>>   /usr/local/bin/mpd  1643  ?
>>       mpd.run()
>> n1-wulf.myhost.org_mpdman_1 (run 287): invalid msg from lhs; expecting
>> ringsize got: {}
>>
>
> What version of MPICH2 are you running?
>
> This message seems to indicate that you have somehow installed incompatible
> versions of "mpd.py" between the two hosts.  What output do you get from
> running the following commands on both hosts?
>
> -------8<-------
> tail -n +2 `which mpd.py` | md5sum
> tail -n +2 `which mpdman.py` | md5sum
> -------8<-------
>
> (the "tail" business is needed because the shebang line is usually altered
> by the install step)
>
> Results for a few releases:
> -------8<-------
> release mpich2-1.2.1
> 68a128402fb44c6fdebe631bbc1c4b7f  mpd.py
> b79fd98d6e4f9d9b80c295e05e01591c  mpdman.py
> release mpich2-1.2
> be37cc1347b915a0ec32cba54c928f63  mpd.py
> 5a9cd3f44b5986584b27a648f889bf31  mpdman.py
> release mpich2-1.1.1p1
> be37cc1347b915a0ec32cba54c928f63  mpd.py
> 5a9cd3f44b5986584b27a648f889bf31  mpdman.py
> release mpich2-1.1.1
> 550958d41e76cdef0ceaa74d540760de  mpd.py
> 5a9cd3f44b5986584b27a648f889bf31  mpdman.py
> release mpich2-1.1
> f59c7e766dd2d3488b6df212a663ccb9  mpd.py
> 07129c1f68cd815c56bd186eb1b59038  mpdman.py
> release mpich2-1.0.8
> 65fb3b8b1c9e3d053bb97d5ef2ae86ad  mpd.py
> 2083f8908d0b9698eb0550c32ef3d153  mpdman.py
> -------8<-------
>
>
>  After doing mpdallexit, n2 shows the following message:
>>
>> mpiexec_n2-wulf.myhost.org (mpiexec 377): no msg recvd from mpd when
>> expecting ack of request
>>
>
> You can ignore this message, it's a consequence of the earlier error.
>
> -Dave
>
> _______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20100205/e0b944b7/attachment.htm>


More information about the mpich-discuss mailing list