[mpich-discuss] mpdboot returning error codes 582, 404, 347, and 476
Jeff Hammond
jhammond at alcf.anl.gov
Fri Jul 6 07:51:44 CDT 2012
Hi Myles,
Given http://wiki.mcs.anl.gov/mpich2/index.php/Frequently_Asked_Questions#Q:_I_don.27t_like_.3CWHATEVER.3E_about_mpd.2C_or_I.27m_having_a_problem_with_mpdboot.2C_can_you_fix_it.3F,
my guess is that you'll have to wait for Hydra-capable MPICH2.
As frustrating as it may be to have to wait for the solution, I'm sure
that it improves the life quality of the MPICH2 guys to not have to
revisit deprecated code that has been replaced with something that
almost certainly solves your problem without any additional effort.
Best,
Jeff
On Thu, Jul 5, 2012 at 9:26 AM, BAKER, MYLES D. (LARC-E302)
<myles.d.baker at nasa.gov> wrote:
> Dear List,
>
> NOTE: This is my first list post, so please let me know how to write more
> effectively and honor the etiquette of the community.
>
> I am having trouble launching mpd after successful past boots. My system
> administrator is updating to a hydra-capable mpich but for now I'm stuck
> with mpd and mpich2-1.2.1p1. Here is some information about my system:
>
> 1. Multi node POWER6 cluster using General Parallel File System (GPFS).
> 2. Master is interactive node(tisa2-blue), slaves are dedicated compute
> nodes(bb101, bb102).
> 3. ~/.mpd.conf file contains MPD_SECRETWORD=password for each node (master
> and slaves).
> 4. ~/mpd.hosts file contains bb101:4\n bb102:4. Once again, this file is
> accessible by all nodes.
> 5. Passwordless ssh between all nodes (3 node, fully connected digraph)
> 6. The /etc/hosts files on the respective machines are configured correctly
> (i.e., no 127.0.1.1 for the slave nodes, etc; I have added them to the
> bottom of the post if need to reference)
>
> Given the above setup, when I try to boot using mpdboot -v -n 3 -f
> ~/mpd.hosts command on tisa2-blue, I get the following output:
> running mpdallexit on tisa2-blue
> LAUNCHED mpd on tisa2-blue via
>
> [2] Done mpd
> RUNNING: mpd on tisa2-blue
> LAUNCHED mpd on bb101 via tisa2-blue
> LAUNCHED mpd on bb102 via tisa2-blue
>
> Next, the terminal window waits, and I get the following error:
> mpdboot_tisa2-blue (handle_mpd_output 406): failed to handshake with mpd on
> bb101; recvd output={}
>
> If I use ^C to end I get the following error:
> mpdboot_tisa2-blue (recv_dict_msg 582):recv_dict_msg: errmsg=::
> mpdtb:
> /SPG_ops/utils/ppc64/mpich2-1.2.1p1/bin/mpdlib.py, 582, recv_dict_msg
> /usr/local/bin/mpdboot, 404, handle_mpd_output
> /usr/local/bin/mpdboot, 347, mpdboot
> /usr/local/bin/mpdboot, 476, ?
>
> mpdboot_tisa2-blue (handle_mpd_output 406): failed to handshake with mpd on
> bb101; recvd output={}
> mpdboot_tisa2-blue: failure doing recv exceptions.KeyboardInterrupt ::
> 0
>
> So, at this point I'm not sure what I can do to fix this. I have looked up
> the error codes and I don't think that I have done anything wrong. Can
> anyone give me some guidance / ideas on where to fix this issue?
>
> Thank you so much!
> Myles Baker
>
> tisa2-blue: /etc/hosts
> 127.0.0.1 localhost
> #192.168.18.115 bc206 bc206.cluster.net
> #192.168.18.115 tisa2-blue.larc.nasa.gov
> # made sure the larc-facing address for tisa2-blue is here and
> # uncommented... w/o it, daacget commands fail because it reverse
> # lookup's and gets the AMI interface name and it doesn't match. crjones,
> 07/03/12
> 198.119.135.140 tisa2-blue.larc.nasa.gov tisa2-blue
> 192.168.18.130 magneto.cluster.net magneto.magneto
> 198.119.135.180 snfsmdc1.larc.nasa.gov snfsmdc1
> 198.119.135.181 snfsmdc2.larc.nasa.gov snfsmdc2
> 192.168.18.162 bk17.cluster.net bk17
> 192.168.18.164 bk21.cluster.net bk21
> 192.168.18.207 ab01-p.cluster.net ab01-p
> 192.168.18.1 coil-blue.cluster.net coil-blue
> 192.168.18.2 nsd1.cluster.net nsd1
> 192.168.18.3 nsd2.cluster.net nsd2
>
> bb101: /etc/hosts
> 127.0.0.1 localhost
> 192.168.18.50 bb101 bb101.cluster.net
> 192.168.18.1 coil-blue coil-blue.cluster.net
> 192.168.18.2 nsd1 nsd1.cluster.net
> 192.168.18.5 ab3950.cluster.net ab3950
> 192.168.18.3 nsd2 nsd2.cluster.net
> 192.168.18.10 ba101 ba101.cluster.net
> 192.168.18.90 bc101 bc101.cluster.net
> 192.168.18.173 ab19 ab19.cluster.net
> 192.168.18.175 ac19 ac19.cluster.net
> 192.168.18.130 magneto.cluster.net magneto.magneto
> 198.119.135.180 snfsmdc1.larc.nasa.gov snfsmdc1
> 198.119.135.181 snfsmdc2.larc.nasa.gov snfsmdc2
> 192.168.18.162 bk17.cluster.net bk17
> 192.168.18.164 bk21.cluster.net bk21
> 192.168.18.207 ab01-p.cluster.net ab01-p
> 192.168.18.1 coil-blue.cluster.net coil-blue
> 192.168.18.2 nsd1.cluster.net nsd1
> 192.168.18.3 nsd2.cluster.net nsd2
>
> bb102: /etc/hosts
> 127.0.0.1 localhost
> 192.168.16.51 bb102m
> 192.168.18.51 bb102 bb102.cluster.net
> 192.168.18.130 magneto.cluster.net magneto.magneto
> 198.119.135.180 snfsmdc1.larc.nasa.gov snfsmdc1
> 198.119.135.181 snfsmdc2.larc.nasa.gov snfsmdc2
> 192.168.18.162 bk17.cluster.net bk17
> 192.168.18.164 bk21.cluster.net bk21
> 192.168.18.207 ab01-p.cluster.net ab01-p
> 192.168.18.1 coil-blue.cluster.net coil-blue
> 192.168.18.2 nsd1.cluster.net nsd1
> 192.168.18.3 nsd2.cluster.net nsd2
>
>
>
> BAKER, MYLES D. (LARC-E302)
> -----------------------------------------------------
> Mail Stop 420, B1250 R177
> myles.d.baker at nasa.gov
> LaRC Ext: x46393
>
>
>
>
>
>
>
> _______________________________________________
> mpich-discuss mailing list mpich-discuss at mcs.anl.gov
> To manage subscription options or unsubscribe:
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>
--
Jeff Hammond
Argonne Leadership Computing Facility
University of Chicago Computation Institute
jhammond at alcf.anl.gov / (630) 252-5381
http://www.linkedin.com/in/jeffhammond
https://wiki.alcf.anl.gov/parts/index.php/User:Jhammond
More information about the mpich-discuss
mailing list