[mpich-discuss] Getting runtime exception

Rajnish rajnish99 at gmail.com
Fri Feb 5 13:03:11 CST 2010


Great! After bringing both nodes to have latest MPICH2 version, and both
hydra and mpd are working great:

% mpiexec.hydra -f machinefile -n 4 hostname

even though mpdboot hangs, still after a cntrl-c, mpdtrace shows successful
start of the mpd daemons, and mpiexec works fine.

Thanks again to Pavan and Dave.
- Rajnish.


On Fri, Feb 5, 2010 at 1:26 PM, Dave Goodell <goodell at mcs.anl.gov> wrote:

> FYI, if you install version 1.2.1 and your mpdboot hangs, you might need to
> apply the fix described in ticket #963:
> https://trac.mcs.anl.gov/projects/mpich2/ticket/963
>
> -Dave
>
>
> On Feb 5, 2010, at 11:59 AM, Rajnish wrote:
>
>  Thank you Pavan for suggesting the use of hydra, did not have it in the
>> default installation, doing that right now.
>>
>> Thank you Dave for pointing out about the version incompatibility. Yes,
>> md5sums are different indeed...
>>
>> I had installed 1.2.1 on node-n1, and node n2 had it already, but not sure
>> which version, must be incompatible.
>>
>> I am going to install fresh and will give you updates.
>>
>> - Rajnish.
>>
>>
>> On Fri, Feb 5, 2010 at 12:37 PM, Dave Goodell <goodell at mcs.anl.gov>
>> wrote:
>> On Feb 5, 2010, at 7:24 AM, Rajnish wrote:
>>
>> [snip]
>>
>> However, when I schedule tasks across both nodes, with n2 as the master
>> node, I get the following message on n1:
>>
>> mpd_uncaught_except_tb handling:
>>  exceptions.KeyError: 'process_mapping'
>>  /usr/local/bin/mpd  1354  do_mpdrun
>>      msg['process_mapping'][lorank] = self.myHost
>>  /usr/local/bin/mpd  984  handle_lhs_input
>>      self.do_mpdrun(msg)
>>  /usr/local/bin/mpdlib.py  780  handle_active_streams
>>      handler(stream,*args)
>>  /usr/local/bin/mpd  301  runmainloop
>>      rv = self.streamHandler.handle_active_streams(timeout=8.0)
>>  /usr/local/bin/mpd  270  run
>>      self.runmainloop()
>>  /usr/local/bin/mpd  1643  ?
>>      mpd.run()
>> n1-wulf.myhost.org_mpdman_1 (run 287): invalid msg from lhs; expecting
>> ringsize got: {}
>>
>> What version of MPICH2 are you running?
>>
>> This message seems to indicate that you have somehow installed
>> incompatible versions of "mpd.py" between the two hosts.  What output do you
>> get from running the following commands on both hosts?
>>
>> -------8<-------
>> tail -n +2 `which mpd.py` | md5sum
>> tail -n +2 `which mpdman.py` | md5sum
>> -------8<-------
>>
>> (the "tail" business is needed because the shebang line is usually altered
>> by the install step)
>>
>> Results for a few releases:
>> -------8<-------
>> release mpich2-1.2.1
>> 68a128402fb44c6fdebe631bbc1c4b7f  mpd.py
>> b79fd98d6e4f9d9b80c295e05e01591c  mpdman.py
>> release mpich2-1.2
>> be37cc1347b915a0ec32cba54c928f63  mpd.py
>> 5a9cd3f44b5986584b27a648f889bf31  mpdman.py
>> release mpich2-1.1.1p1
>> be37cc1347b915a0ec32cba54c928f63  mpd.py
>> 5a9cd3f44b5986584b27a648f889bf31  mpdman.py
>> release mpich2-1.1.1
>> 550958d41e76cdef0ceaa74d540760de  mpd.py
>> 5a9cd3f44b5986584b27a648f889bf31  mpdman.py
>> release mpich2-1.1
>> f59c7e766dd2d3488b6df212a663ccb9  mpd.py
>> 07129c1f68cd815c56bd186eb1b59038  mpdman.py
>> release mpich2-1.0.8
>> 65fb3b8b1c9e3d053bb97d5ef2ae86ad  mpd.py
>> 2083f8908d0b9698eb0550c32ef3d153  mpdman.py
>> -------8<-------
>>
>>
>> After doing mpdallexit, n2 shows the following message:
>>
>> mpiexec_n2-wulf.myhost.org (mpiexec 377): no msg recvd from mpd when
>> expecting ack of request
>>
>> You can ignore this message, it's a consequence of the earlier error.
>>
>> -Dave
>>
>> _______________________________________________
>> mpich-discuss mailing list
>> mpich-discuss at mcs.anl.gov
>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>
>> _______________________________________________
>> mpich-discuss mailing list
>> mpich-discuss at mcs.anl.gov
>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>
>
> _______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20100205/758da520/attachment.htm>


More information about the mpich-discuss mailing list