[mpich-discuss] mpdboot returning error codes 582, 404, 347, and 476

Dave Goodell goodell at mcs.anl.gov
Fri Jul 6 08:30:01 CDT 2012


To elaborate on this: you can install a newer version of MPICH2, or just a newer version of hydra, since we ship separate tarballs for it as well.  IIRC modern hydra should be compatible with 1.2.1p1 still.  Just make sure the install directory for your copy of MPICH2/hydra comes earlier in your $PATH than the system-installed copy.

-Dave

On Jul 6, 2012, at 7:56 AM CDT, Pavan Balaji wrote:

> 
> Installing MPICH2 does not require root privileges.  So you can download it and install the latest version in your home directory, if you don't want to wait for your system administrator.
> 
> -- Pavan
> 
> On 07/06/2012 07:51 AM, Jeff Hammond wrote:
>> Hi Myles,
>> 
>> Given http://wiki.mcs.anl.gov/mpich2/index.php/Frequently_Asked_Questions#Q:_I_don.27t_like_.3CWHATEVER.3E_about_mpd.2C_or_I.27m_having_a_problem_with_mpdboot.2C_can_you_fix_it.3F,
>> my guess is that you'll have to wait for Hydra-capable MPICH2.
>> 
>> As frustrating as it may be to have to wait for the solution, I'm sure
>> that it improves the life quality of the MPICH2 guys to not have to
>> revisit deprecated code that has been replaced with something that
>> almost certainly solves your problem without any additional effort.
>> 
>> Best,
>> 
>> Jeff
>> 
>> On Thu, Jul 5, 2012 at 9:26 AM, BAKER, MYLES D. (LARC-E302)
>> <myles.d.baker at nasa.gov> wrote:
>>> Dear List,
>>> 
>>> NOTE: This is my first list post, so please let me know how to write more
>>> effectively and honor the etiquette of the community.
>>> 
>>> I am having trouble launching mpd after successful past boots. My system
>>> administrator is updating to a hydra-capable mpich but for now I'm stuck
>>> with mpd and mpich2-1.2.1p1. Here is some information about my system:
>>> 
>>> 1. Multi node POWER6 cluster using General Parallel File System (GPFS).
>>> 2. Master is interactive node(tisa2-blue), slaves are dedicated compute
>>> nodes(bb101, bb102).
>>> 3. ~/.mpd.conf file contains MPD_SECRETWORD=password for each node (master
>>> and slaves).
>>> 4. ~/mpd.hosts file contains bb101:4\n bb102:4. Once again, this file is
>>> accessible by all nodes.
>>> 5. Passwordless ssh between all nodes (3 node, fully connected digraph)
>>> 6. The /etc/hosts files on the respective machines are configured correctly
>>> (i.e., no 127.0.1.1 for the slave nodes, etc; I have added them to the
>>> bottom of the post if need to reference)
>>> 
>>> Given the above setup, when I try to boot using mpdboot -v -n 3 -f
>>> ~/mpd.hosts command on tisa2-blue, I get the following output:
>>> running mpdallexit on tisa2-blue
>>> LAUNCHED mpd on tisa2-blue  via
>>> 
>>> [2]    Done                          mpd
>>> RUNNING: mpd on tisa2-blue
>>> LAUNCHED mpd on bb101  via  tisa2-blue
>>> LAUNCHED mpd on bb102  via  tisa2-blue
>>> 
>>> Next, the terminal window waits, and I get the following error:
>>> mpdboot_tisa2-blue (handle_mpd_output 406): failed to handshake with mpd on
>>> bb101; recvd output={}
>>> 
>>> If I use ^C to end I get the following error:
>>> mpdboot_tisa2-blue (recv_dict_msg 582):recv_dict_msg: errmsg=::
>>>   mpdtb:
>>>     /SPG_ops/utils/ppc64/mpich2-1.2.1p1/bin/mpdlib.py,  582,  recv_dict_msg
>>>     /usr/local/bin/mpdboot,  404,  handle_mpd_output
>>>     /usr/local/bin/mpdboot,  347,  mpdboot
>>>     /usr/local/bin/mpdboot,  476,  ?
>>> 
>>> mpdboot_tisa2-blue (handle_mpd_output 406): failed to handshake with mpd on
>>> bb101; recvd output={}
>>> mpdboot_tisa2-blue: failure doing recv exceptions.KeyboardInterrupt ::
>>> 0
>>> 
>>> So, at this point I'm not sure what I can do to fix this. I have looked up
>>> the error codes and I don't think that I have done anything wrong. Can
>>> anyone give me some guidance / ideas on where to fix this issue?
>>> 
>>> Thank you so much!
>>> Myles Baker
>>> 
>>> tisa2-blue: /etc/hosts
>>> 127.0.0.1       localhost
>>> #192.168.18.115  bc206 bc206.cluster.net
>>> #192.168.18.115  tisa2-blue.larc.nasa.gov
>>> # made sure the larc-facing address for tisa2-blue is here and
>>> # uncommented... w/o it, daacget commands fail because it reverse
>>> # lookup's and gets the AMI interface name and it doesn't match.  crjones,
>>> 07/03/12
>>> 198.119.135.140 tisa2-blue.larc.nasa.gov tisa2-blue
>>> 192.168.18.130  magneto.cluster.net magneto.magneto
>>> 198.119.135.180 snfsmdc1.larc.nasa.gov snfsmdc1
>>> 198.119.135.181 snfsmdc2.larc.nasa.gov snfsmdc2
>>> 192.168.18.162  bk17.cluster.net bk17
>>> 192.168.18.164  bk21.cluster.net bk21
>>> 192.168.18.207  ab01-p.cluster.net ab01-p
>>> 192.168.18.1    coil-blue.cluster.net coil-blue
>>> 192.168.18.2    nsd1.cluster.net nsd1
>>> 192.168.18.3    nsd2.cluster.net nsd2
>>> 
>>> bb101: /etc/hosts
>>> 127.0.0.1       localhost
>>> 192.168.18.50   bb101 bb101.cluster.net
>>> 192.168.18.1    coil-blue coil-blue.cluster.net
>>> 192.168.18.2    nsd1 nsd1.cluster.net
>>> 192.168.18.5    ab3950.cluster.net ab3950
>>> 192.168.18.3    nsd2 nsd2.cluster.net
>>> 192.168.18.10   ba101 ba101.cluster.net
>>> 192.168.18.90   bc101 bc101.cluster.net
>>> 192.168.18.173  ab19 ab19.cluster.net
>>> 192.168.18.175  ac19 ac19.cluster.net
>>> 192.168.18.130 magneto.cluster.net magneto.magneto
>>> 198.119.135.180 snfsmdc1.larc.nasa.gov snfsmdc1
>>> 198.119.135.181 snfsmdc2.larc.nasa.gov snfsmdc2
>>> 192.168.18.162 bk17.cluster.net bk17
>>> 192.168.18.164 bk21.cluster.net bk21
>>> 192.168.18.207 ab01-p.cluster.net ab01-p
>>> 192.168.18.1    coil-blue.cluster.net coil-blue
>>> 192.168.18.2    nsd1.cluster.net nsd1
>>> 192.168.18.3    nsd2.cluster.net nsd2
>>> 
>>> bb102: /etc/hosts
>>> 127.0.0.1         localhost
>>> 192.168.16.51 bb102m
>>> 192.168.18.51 bb102 bb102.cluster.net
>>> 192.168.18.130 magneto.cluster.net magneto.magneto
>>> 198.119.135.180 snfsmdc1.larc.nasa.gov snfsmdc1
>>> 198.119.135.181 snfsmdc2.larc.nasa.gov snfsmdc2
>>> 192.168.18.162 bk17.cluster.net bk17
>>> 192.168.18.164 bk21.cluster.net bk21
>>> 192.168.18.207 ab01-p.cluster.net ab01-p
>>> 192.168.18.1    coil-blue.cluster.net coil-blue
>>> 192.168.18.2    nsd1.cluster.net nsd1
>>> 192.168.18.3    nsd2.cluster.net nsd2
>>> 
>>> 
>>> 
>>> BAKER, MYLES D. (LARC-E302)
>>> -----------------------------------------------------
>>> Mail Stop 420, B1250 R177
>>> myles.d.baker at nasa.gov
>>> LaRC Ext:  x46393
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> _______________________________________________
>>> mpich-discuss mailing list     mpich-discuss at mcs.anl.gov
>>> To manage subscription options or unsubscribe:
>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>> 
>> 
>> 
>> 
> 
> -- 
> Pavan Balaji
> http://www.mcs.anl.gov/~balaji
> 
> 
> _______________________________________________
> mpich-discuss mailing list     mpich-discuss at mcs.anl.gov
> To manage subscription options or unsubscribe:
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss



More information about the mpich-discuss mailing list