[mpich-discuss] Getting runtime exception

Dave Goodell goodell at mcs.anl.gov
Fri Feb 5 12:26:35 CST 2010


FYI, if you install version 1.2.1 and your mpdboot hangs, you might  
need to apply the fix described in ticket #963: https://trac.mcs.anl.gov/projects/mpich2/ticket/963

-Dave

On Feb 5, 2010, at 11:59 AM, Rajnish wrote:

> Thank you Pavan for suggesting the use of hydra, did not have it in  
> the default installation, doing that right now.
>
> Thank you Dave for pointing out about the version incompatibility.  
> Yes, md5sums are different indeed...
>
> I had installed 1.2.1 on node-n1, and node n2 had it already, but  
> not sure which version, must be incompatible.
>
> I am going to install fresh and will give you updates.
>
> - Rajnish.
>
>
> On Fri, Feb 5, 2010 at 12:37 PM, Dave Goodell <goodell at mcs.anl.gov>  
> wrote:
> On Feb 5, 2010, at 7:24 AM, Rajnish wrote:
>
> [snip]
>
> However, when I schedule tasks across both nodes, with n2 as the  
> master node, I get the following message on n1:
>
> mpd_uncaught_except_tb handling:
>  exceptions.KeyError: 'process_mapping'
>   /usr/local/bin/mpd  1354  do_mpdrun
>       msg['process_mapping'][lorank] = self.myHost
>   /usr/local/bin/mpd  984  handle_lhs_input
>       self.do_mpdrun(msg)
>   /usr/local/bin/mpdlib.py  780  handle_active_streams
>       handler(stream,*args)
>   /usr/local/bin/mpd  301  runmainloop
>       rv = self.streamHandler.handle_active_streams(timeout=8.0)
>   /usr/local/bin/mpd  270  run
>       self.runmainloop()
>   /usr/local/bin/mpd  1643  ?
>       mpd.run()
> n1-wulf.myhost.org_mpdman_1 (run 287): invalid msg from lhs;  
> expecting ringsize got: {}
>
> What version of MPICH2 are you running?
>
> This message seems to indicate that you have somehow installed  
> incompatible versions of "mpd.py" between the two hosts.  What  
> output do you get from running the following commands on both hosts?
>
> -------8<-------
> tail -n +2 `which mpd.py` | md5sum
> tail -n +2 `which mpdman.py` | md5sum
> -------8<-------
>
> (the "tail" business is needed because the shebang line is usually  
> altered by the install step)
>
> Results for a few releases:
> -------8<-------
> release mpich2-1.2.1
> 68a128402fb44c6fdebe631bbc1c4b7f  mpd.py
> b79fd98d6e4f9d9b80c295e05e01591c  mpdman.py
> release mpich2-1.2
> be37cc1347b915a0ec32cba54c928f63  mpd.py
> 5a9cd3f44b5986584b27a648f889bf31  mpdman.py
> release mpich2-1.1.1p1
> be37cc1347b915a0ec32cba54c928f63  mpd.py
> 5a9cd3f44b5986584b27a648f889bf31  mpdman.py
> release mpich2-1.1.1
> 550958d41e76cdef0ceaa74d540760de  mpd.py
> 5a9cd3f44b5986584b27a648f889bf31  mpdman.py
> release mpich2-1.1
> f59c7e766dd2d3488b6df212a663ccb9  mpd.py
> 07129c1f68cd815c56bd186eb1b59038  mpdman.py
> release mpich2-1.0.8
> 65fb3b8b1c9e3d053bb97d5ef2ae86ad  mpd.py
> 2083f8908d0b9698eb0550c32ef3d153  mpdman.py
> -------8<-------
>
>
> After doing mpdallexit, n2 shows the following message:
>
> mpiexec_n2-wulf.myhost.org (mpiexec 377): no msg recvd from mpd when  
> expecting ack of request
>
> You can ignore this message, it's a consequence of the earlier error.
>
> -Dave
>
> _______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>
> _______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss



More information about the mpich-discuss mailing list