[MPICH] Problem setting up a ring

Ralph Butler rbutler at mtsu.edu
Wed Apr 4 10:36:51 CDT 2007


First I should point out that mpdallexit will probably fail if you  
have not
successfully debugged host/net config problems; use kill -9.
Then, I note that you are trying mpdboot with 2 hosts.  But, the manual
suggests that when there are multiple hosts, that config has to be
verified for all.  This includes running mpcheck as server and client
on each, then reversing their roles, running mpd by hand on each, etc.
Here is a typical blurb I send out as a reminder:

Sometimes there are problems with mpd or mpdboot while following
the Quick Start portion of the mpich2 install guide.  This typically
happens somewhere during Steps 10-13, but may occur during other
steps as well.  The guide suggests that when mpd/mpdboot problems
arise, you follow the procedures in Appendix A (Troubleshooting MPDs).

Section A.1 (Getting Started with MPD) provides a 7-step procedure
to follow to get one or more mpds to working, first by hand, and
then via mpdboot.  However, some of the early steps begin with a
pre-MPD program called mpdcheck.  That program is designed to help
determine in advance if there will be problems associated wtih host
or network configuration.  The instructions in section A.1 suggest
first using mpdcheck on individual machines, and then pair-wise.
It is particularly important to try the pair-wise experiments where
one machine plays the role of the server and the other the client,
and then to reverse the roles.

Sometimes the procedures in A.1 indicate that MPDs are not likely
to run on your systems due to problems with host and/or network
configuration.  At those points, you are referred to subsequent
sections, e.g. A.2 Debugging host/network configuration problems,
or A.3 Firewalls, etc.


On WedApr 4, at Wed Apr 4 9:41AM, Brett Gordon wrote:

> Hi Ralph,
>
> Thanks for your response.
>
> I did as you suggested, and it seems to work, but I still can't get
> the ring running.
>
> Terminal 1:
> brgordon at veritas:~> mpdallexit    //Just to make sure nothing else  
> is running
> mpdallexit: cannot connect to local mpd (/tmp/mpd2.console_brgordon);
> possible causes:
>  1. no mpd is running on this host
>  2. an mpd is running but was started without a "console" (-n option)
> In case 1, you can start an mpd on this host with:
>    mpd &
> and you will be able to run jobs just on this host.
> For more details on starting mpds on a set of hosts, see
> the MPICH2 Installation Guide.
> brgordon at veritas:~> mpdcheck
> brgordon at veritas:~> mpdcheck -s
> server listening at INADDR_ANY on: veritas 23768
> server has conn on <socket._socketobject object at 0x2aaaaab42650>
> from ('128.2.93.142', 25125)
> server successfully recvd msg from client: hello_from_client_to_server
> brgordon at veritas:~>
>
> Terminal 2:
> brgordon at veritas:~> mpdcheck -c veritas 23768
> client successfully recvd ack from server: ack_from_server_to_client
> brgordon at veritas:~>
>
> I then tried to run mpdboot from the computer 'veritas', hoping to
> bring up a ring with 'veritas' and 'elaine', and got the following:
>
> brgordon at veritas:~> mpdboot -n 2 -f mpd.hosts --user=brgordon -- 
> verbose  --chkup
> checking elaine
> there are 2 hosts up (counting local)
> running mpdallexit on veritas
> LAUNCHED mpd on veritas  via
> RUNNING: mpd on veritas
> LAUNCHED mpd on elaine  via  veritas
> mpdboot_veritas (handle_mpd_output 383): failed to connect to mpd  
> on elaine
>
> brgordon at veritas:~> less mpd.hosts
> elaine
> brgordon at veritas:~> less .mpd.conf
> secretword=<my password>
>
> Same files exist on 'elaine', but the host is listed as 'veritas'.
>
> Thanks,
> Brett
>
>
>
> On 4/4/07, Ralph Butler <rbutler at mtsu.edu> wrote:
>> In your output for mpdcheck below, there are jumbled lots of error
>> msgs from
>> an mpd as well.  Apparently you had started an mpd in that same
>> window at some
>> point.  Anyway, it is best to make sure that all mpd processes are
>> killed before doing
>> the mpdcheck.  Then, try it again.  As the manual suggests, it is
>> pointless to try using
>> mpdboot unless you have cleared up all issues first.  Even starting
>> an mpd ring by
>> hand is only recommended after successfully debugging with mpdcheck.
>> So, I
>> suggest trying mpdcheck again, first with no options.  Then, with -s
>> in one window
>> and -c n another.
>>
>> On TueApr 3, at Tue Apr 3 10:34PM, Brett Gordon wrote:
>>
>> > Hello,
>> >
>> > I have successfully installed mpich2-1.0.5 on two linux boxes. Both
>> > succeed in the standard tests involving one host solving the 'cpi'
>> > program.
>> >
>> > However, I'm running into two (probably related) problems:
>> >
>> > 1) When I try to run mpd as a server and client on the same  
>> computer
>> > (as on page 31 of the install documentation), I get the following:
>> >
>> > brgordon at veritas:~> mpdcheck -s
>> > server listening at INADDR_ANY on: veritas 23761
>> > brgordon at veritas:~> mpdcheck -c veritas 23761
>> > veritas_23761 (recv_dict_msg 549):recv_dict_msg: errmsg=:invalid
>> > literal for int(): hello_fr:
>> >  mpdtb:
>> >    /home/brgordon/mpich2-install/bin/mpdlib.py,  549,   
>> recv_dict_msg
>> >    /home/brgordon/mpich2-install/bin/mpdlib.py,  989,
>> > handle_ring_listener_connection
>> >    /home/brgordon/mpich2-install/bin/mpdlib.py,  743,
>> > handle_active_streams
>> >    /home/brgordon/mpich2-install/bin/mpd,  286,  runmainloop
>> >    /home/brgordon/mpich2-install/bin/mpd,  255,  run
>> >    /home/brgordon/mpich2-install/bin/mpd,  1470,  ?
>> >
>> > veritas_23761 (handle_ring_listener_connection 993): INVALID msg  
>> from
>> > new connection :('128.2.93.142', 16587): msg=:{}:
>> > Traceback (most recent call last):
>> >  File "/home/brgordon/mpich2-install/bin/mpdcheck", line 105, in ?
>> >    msg = sock.recv(64)
>> > socket.error: (104, 'Connection reset by peer')
>> >
>> > 2) I also can't get a ring to work. I have setup ssh to work  
>> without
>> > using passwords ('ssh veritas date' works fine). The workaround for
>> > mpdboot on page 9 of the install doc does not work for me, nor does
>> > running 'mpdcheck -f mpd.hosts -ssh'.
>> >
>> > When I try to run mpdboot, I get
>> > brgordon at elaine ~]$ mpdboot -n 2 -f mpd.hosts
>> > mpdboot_elaine.tepper.cmu.edu (handle_mpd_output 383): failed to
>> > connect to mpd on veritas
>> >
>> >
>> > I feel like I'm getting close to having this working, so I would
>> > greatly appreciate any help. Please let me know if there is more
>> > information I can provide.
>> >
>> > Thanks,
>> > Brett
>> >
>>
>>




More information about the mpich-discuss mailing list