[mpich-discuss] mpdboot returning error codes 582, 404, 347, and 476
Gus Correa
gus at ldeo.columbia.edu
Fri Jul 6 11:01:34 CDT 2012
Hi Myles
First, if the cluster has Infiniband, perhaps does it have
MVAPICH2 installed, maybe up to date?
Second, can tisa2-blue resolve bb101 and bb102 and vice-versa?
Looking at the complex /etc/hosts files that you sent I am not sure.
Maybe this is done via DNS?
Third, following up what Pavan said, if your home directory is
accessible from all nodes [via GPFS, NFS, or other shared
file system], you can install MPICH2 there via head node,
and use it on the compute nodes as well.
This recent thread may help:
http://lists.mcs.anl.gov/pipermail/mpich-discuss/2012-July/012752.html
BTW, there is a Wiki page for the current MPICH2 launcher, hydra:
http://wiki.mcs.anl.gov/mpich2/index.php/Using_the_Hydra_Process_Manager
I hope this helps,
Gus Correa
On 07/06/2012 08:56 AM, Pavan Balaji wrote:
>
> Installing MPICH2 does not require root privileges. So you can download
> it and install the latest version in your home directory, if you don't
> want to wait for your system administrator.
>
> -- Pavan
>
> On 07/06/2012 07:51 AM, Jeff Hammond wrote:
>> Hi Myles,
>>
>> Given
>> http://wiki.mcs.anl.gov/mpich2/index.php/Frequently_Asked_Questions#Q:_I_don.27t_like_.3CWHATEVER.3E_about_mpd.2C_or_I.27m_having_a_problem_with_mpdboot.2C_can_you_fix_it.3F,
>>
>> my guess is that you'll have to wait for Hydra-capable MPICH2.
>>
>> As frustrating as it may be to have to wait for the solution, I'm sure
>> that it improves the life quality of the MPICH2 guys to not have to
>> revisit deprecated code that has been replaced with something that
>> almost certainly solves your problem without any additional effort.
>>
>> Best,
>>
>> Jeff
>>
>> On Thu, Jul 5, 2012 at 9:26 AM, BAKER, MYLES D. (LARC-E302)
>> <myles.d.baker at nasa.gov> wrote:
>>> Dear List,
>>>
>>> NOTE: This is my first list post, so please let me know how to write
>>> more
>>> effectively and honor the etiquette of the community.
>>>
>>> I am having trouble launching mpd after successful past boots. My system
>>> administrator is updating to a hydra-capable mpich but for now I'm stuck
>>> with mpd and mpich2-1.2.1p1. Here is some information about my system:
>>>
>>> 1. Multi node POWER6 cluster using General Parallel File System (GPFS).
>>> 2. Master is interactive node(tisa2-blue), slaves are dedicated compute
>>> nodes(bb101, bb102).
>>> 3. ~/.mpd.conf file contains MPD_SECRETWORD=password for each node
>>> (master
>>> and slaves).
>>> 4. ~/mpd.hosts file contains bb101:4\n bb102:4. Once again, this file is
>>> accessible by all nodes.
>>> 5. Passwordless ssh between all nodes (3 node, fully connected digraph)
>>> 6. The /etc/hosts files on the respective machines are configured
>>> correctly
>>> (i.e., no 127.0.1.1 for the slave nodes, etc; I have added them to the
>>> bottom of the post if need to reference)
>>>
>>> Given the above setup, when I try to boot using mpdboot -v -n 3 -f
>>> ~/mpd.hosts command on tisa2-blue, I get the following output:
>>> running mpdallexit on tisa2-blue
>>> LAUNCHED mpd on tisa2-blue via
>>>
>>> [2] Done mpd
>>> RUNNING: mpd on tisa2-blue
>>> LAUNCHED mpd on bb101 via tisa2-blue
>>> LAUNCHED mpd on bb102 via tisa2-blue
>>>
>>> Next, the terminal window waits, and I get the following error:
>>> mpdboot_tisa2-blue (handle_mpd_output 406): failed to handshake with
>>> mpd on
>>> bb101; recvd output={}
>>>
>>> If I use ^C to end I get the following error:
>>> mpdboot_tisa2-blue (recv_dict_msg 582):recv_dict_msg: errmsg=::
>>> mpdtb:
>>> /SPG_ops/utils/ppc64/mpich2-1.2.1p1/bin/mpdlib.py, 582, recv_dict_msg
>>> /usr/local/bin/mpdboot, 404, handle_mpd_output
>>> /usr/local/bin/mpdboot, 347, mpdboot
>>> /usr/local/bin/mpdboot, 476, ?
>>>
>>> mpdboot_tisa2-blue (handle_mpd_output 406): failed to handshake with
>>> mpd on
>>> bb101; recvd output={}
>>> mpdboot_tisa2-blue: failure doing recv exceptions.KeyboardInterrupt ::
>>> 0
>>>
>>> So, at this point I'm not sure what I can do to fix this. I have
>>> looked up
>>> the error codes and I don't think that I have done anything wrong. Can
>>> anyone give me some guidance / ideas on where to fix this issue?
>>>
>>> Thank you so much!
>>> Myles Baker
>>>
>>> tisa2-blue: /etc/hosts
>>> 127.0.0.1 localhost
>>> #192.168.18.115 bc206 bc206.cluster.net
>>> #192.168.18.115 tisa2-blue.larc.nasa.gov
>>> # made sure the larc-facing address for tisa2-blue is here and
>>> # uncommented... w/o it, daacget commands fail because it reverse
>>> # lookup's and gets the AMI interface name and it doesn't match.
>>> crjones,
>>> 07/03/12
>>> 198.119.135.140 tisa2-blue.larc.nasa.gov tisa2-blue
>>> 192.168.18.130 magneto.cluster.net magneto.magneto
>>> 198.119.135.180 snfsmdc1.larc.nasa.gov snfsmdc1
>>> 198.119.135.181 snfsmdc2.larc.nasa.gov snfsmdc2
>>> 192.168.18.162 bk17.cluster.net bk17
>>> 192.168.18.164 bk21.cluster.net bk21
>>> 192.168.18.207 ab01-p.cluster.net ab01-p
>>> 192.168.18.1 coil-blue.cluster.net coil-blue
>>> 192.168.18.2 nsd1.cluster.net nsd1
>>> 192.168.18.3 nsd2.cluster.net nsd2
>>>
>>> bb101: /etc/hosts
>>> 127.0.0.1 localhost
>>> 192.168.18.50 bb101 bb101.cluster.net
>>> 192.168.18.1 coil-blue coil-blue.cluster.net
>>> 192.168.18.2 nsd1 nsd1.cluster.net
>>> 192.168.18.5 ab3950.cluster.net ab3950
>>> 192.168.18.3 nsd2 nsd2.cluster.net
>>> 192.168.18.10 ba101 ba101.cluster.net
>>> 192.168.18.90 bc101 bc101.cluster.net
>>> 192.168.18.173 ab19 ab19.cluster.net
>>> 192.168.18.175 ac19 ac19.cluster.net
>>> 192.168.18.130 magneto.cluster.net magneto.magneto
>>> 198.119.135.180 snfsmdc1.larc.nasa.gov snfsmdc1
>>> 198.119.135.181 snfsmdc2.larc.nasa.gov snfsmdc2
>>> 192.168.18.162 bk17.cluster.net bk17
>>> 192.168.18.164 bk21.cluster.net bk21
>>> 192.168.18.207 ab01-p.cluster.net ab01-p
>>> 192.168.18.1 coil-blue.cluster.net coil-blue
>>> 192.168.18.2 nsd1.cluster.net nsd1
>>> 192.168.18.3 nsd2.cluster.net nsd2
>>>
>>> bb102: /etc/hosts
>>> 127.0.0.1 localhost
>>> 192.168.16.51 bb102m
>>> 192.168.18.51 bb102 bb102.cluster.net
>>> 192.168.18.130 magneto.cluster.net magneto.magneto
>>> 198.119.135.180 snfsmdc1.larc.nasa.gov snfsmdc1
>>> 198.119.135.181 snfsmdc2.larc.nasa.gov snfsmdc2
>>> 192.168.18.162 bk17.cluster.net bk17
>>> 192.168.18.164 bk21.cluster.net bk21
>>> 192.168.18.207 ab01-p.cluster.net ab01-p
>>> 192.168.18.1 coil-blue.cluster.net coil-blue
>>> 192.168.18.2 nsd1.cluster.net nsd1
>>> 192.168.18.3 nsd2.cluster.net nsd2
>>>
>>>
>>>
>>> BAKER, MYLES D. (LARC-E302)
>>> -----------------------------------------------------
>>> Mail Stop 420, B1250 R177
>>> myles.d.baker at nasa.gov
>>> LaRC Ext: x46393
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> mpich-discuss mailing list mpich-discuss at mcs.anl.gov
>>> To manage subscription options or unsubscribe:
>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>>
>>
>>
>>
>
More information about the mpich-discuss
mailing list