Error during mpdboot

Rusty Lusk lusk at mcs.anl.gov
Tue May 31 11:55:19 CDT 2005


From: Bruno Nyffeler <bruno.nyffeler at isb-sib.ch>
Subject: Error during mpdboot
Date: Tue, 31 May 2005 11:39:21 +0200

> Hello
> 
> I have been installing MPICH2 on several clusters and it worked flawless 
> on almost all machines. Just on one small cluster (Redhat 7.3) mpdboot 
> does not work and produces an error. So, I tried to start mpd directly 
> on two of the machines (like it is described in the installers guide), 
> using sth like:
> 
> host1> mpd -e &
> 47684
> host2> mpd -h host1 -p 47684
> 
> The server (on host1) prints this message:
> 
>     host1_47684 (_handle_new_connection 940): INVALID msg from new 
> connection :('10.255.255.254', 50674): msg=:{}:
> 
> while mpd on host2 quits with:
> 
>     host2_50673 failed ; cause: invalid challenge msg: {}
>         traceback: [('/tmp/mpich2/mpich2-1.0.1/bin/mpd', '1158', 
> '_enter_existing_ring'),     ('/tmp/mpich2/mpich2-1.0.1/bin/mpd', '175', 
> '_mpd_init'), ('/tmp/mpich2/mpich2-1.0.1/bin/mpd', '1398', '?')]
> 
> 
> I also tried the following using mpdcheck:
> 
> host1> mpdcheck -s
> server listening at INADDR_ANY on: host1 47700
> 
> host2> mpdcheck -c host1 47700
> 
> which exits on host1 with:
> 
>     server has conn on <socket object, fd=5, family=2, type=1, 
> protocol=0> from ('10.255.255.254', 50677)
>     server successfully recvd msg from client: hello_from_client_to_server
> 
> and on host2 with:
> 
>     client successfully recvd ack from server: ack_from_server_to_client
> 
> The machines are based on Intel Xeon CPUs and the OS is RedHat 7.3, 
> kernel 2.4.18-27.
> Do you have any ideas about this?
> 
> Thank you in advance,
> Bruno Nyffeler

We are at least temporarily stumped by this, since there is a clear
problem (unexpectedly empty message), yet mpdcheck seems to report that
everything is fine.

Here are two things to try.

1.  Run the mpdcheck program with the roles of client and server
    reversed.

2.  Try the new version of mpd that will be in our next release.
    Actually, it is already present in the release you have, but you
    have to configure for it specially:

      configure --with-pm=rmpd ....

    That will at least run the problem through a different batch of
    code, possibly illuminating the problem.  

Regards,
Rusty Lusk




More information about the mpich-discuss mailing list