[MPICH] Problems running mpd with n > mpd processes

Ralph M. Butler rbutler at mtsu.edu
Tue Sep 20 10:32:01 CDT 2005


Hi Tony:

These kinds of problems pop up fairly often.  They are generally due to
the 'networking setup' as you say.  I am attaching a new  copy of the
install.pdf file in case it contains useful scoop which you may not
have.  The appendix is about debugging MPD problems and networking
problems associated with getting MPDs up and running on clusters.  In
particular, it guides you thru the use of a program named mpdcheck.py
which comes with mpd.  But, I am also attaching a new copy of mpdcheck
just to be sure all the parts match.  Getting these kinds of things
going can be a bit of a hassle.  Your particular problem is somewhat
exacerbated by the fact that you apparently have a head node with 2
interfaces and the 'second' one is the one on the net with the rest of
the cluster.  I do not mean to imply that this makes things impossible;
it just adds to the confusion a bit.

If you get past the mpdcheck stuff and are still having some
problems which you suspect are due to the separate interfaces, I
can probably run a couple of demos on a similarly configured net
here and show you what I had to do.

--ralph

> Date: Mon, 19 Sep 2005 16:58:56 -0400
> From: Tony Keating <akeating at eng.umd.edu>
> To: mpich-discuss at mcs.anl.gov
> Subject: [MPICH] Problems running mpd with n > mpd processes
>
> Hi,
>
> I'm trying to get mpd up and running on a small (2 dual processor) cluster.
>
> I have it working fine with one processes per mpd processes (per box),
> but I'm having difficulties when running two processes per mpd
> processes. Here is some more info:
>
> On the head node:
>
> ~# mpd --ifhn=192.168.1.1
> barolo.umd.edu_mpdman_2: conn error in connect_lhs: Connection refused
> barolo.umd.edu_mpdman_2: conn error in connect_lhs: Connection refused
> barolo.umd.edu_mpdman_2: conn error in connect_lhs: Connection refused
> barolo.umd.edu_mpdman_2: conn error in connect_lhs: Connection refused
> barolo.umd.edu_mpdman_2: conn error in connect_lhs: Connection refused
> barolo.umd.edu_mpdman_2: conn error in connect_lhs: Connection refused
> barolo.umd.edu_mpdman_2: conn error in connect_lhs: Connection refused
> barolo.umd.edu_mpdman_2: conn error in connect_lhs: Connection refused
> barolo.umd.edu_mpdman_2 (connect_lhs 542): failed to connect to lhs at
> 127.0.0.1 33093
> barolo.umd.edu_mpdman_2 (run 172): lhs connect failed
>
> I tried running 2 processes which works fine, then with four things just
> hang and I get the above errors and need to press ctrl-C to break out:
>
> ~# mpdrun -n 2 hostname
> barolo.umd.edu
> c01
> ~# mpdrun -n 4 hostname
> mpdrun_barolo.umd.edu (mpdrun 276): mpdrun: failed to obtain sock from
> manager
>
> On the other node (c01)
>
> ~# mpd -h barolo.umd.edu -p 33450
> c01_mpdman_3 (connect_lhs 554): invalid challenge from 192.168.1.1 33471: {}
> c01_mpdman_3 (run 155): lhs connect failed
>
> Anybody have any ideas? I have a feeling it has to do with the
> networking setup here, but I'm not 100% sure how to fix it.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: install.pdf
Type: application/pdf
Size: 216007 bytes
Desc: 
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20050920/f608de06/attachment.pdf>
-------------- next part --------------
#!/usr/bin/env python
#
#   (C) 2001 by Argonne National Laboratory.
#       See COPYRIGHT in top-level directory.
#

"""
mpdcheck

This script is a work in progress and may change frequently as we work
with users and gain additional insights into how to improve it.

This script prints useful information about the host on which it runs.
It is here to help us help users detect problems with configurations of
their computers.  For example, some computers are configured to think
of themselves simply as 'localhost' with 127.0.0.1 as the IP address.
This might present problems if a process on that computer wishes to
identify itself by host and port to a process on another computer.
The process on the other computer would try to contact 'localhost'.

If you are having problems running parallel jobs via mpd on one or more
hosts, you might try running this script once on each of those hosts.

Any output with *** at the beginning indicates a potential problem
that you may have to resolve before being able to run parallel jobs
via mpd.

For help:
    mpdcheck -h (or --help)
        prints this message

In the following modes, the -v (verbose) option provides info about what
mpdcheck is doing; the -l (long messages) option causes long informational
messages to print in situations where problems are spotted.

The three major modes of operation for this program are:

    mpdcheck
        looks for config problems on 'this' host; prints as nec

    mpdcheck -pc
        print config info about 'this' host, e.g. contents of /etc/hosts, etc.

    mpdcheck -f some_file [-ssh]
        prints info about 'this' host and locatability info about the ones
        listed in some_file as well (note the file might be mpd.hosts);
        the -ssh option can be used in conjunction with the -f option to
        cause ssh tests to be run to each remote host

    mpdcheck -s
        runs this program as a server on one host
    mpdcheck -c server_host server_port
        runs a client on another (or same) host; connects to the specifed
        host/port where you previously started the server
"""
from time import ctime
__author__ = "Ralph Butler and Rusty Lusk"
__date__ = ctime()
__version__ = "$Revision: 1.18 $"
__credits__ = ""

import re

from  sys      import argv, exit, stdout
from  os       import path, kill, system
from  signal   import SIGKILL
from  socket   import gethostname, getfqdn, gethostbyname_ex, gethostbyaddr, socket
from  popen2   import Popen3
from  select   import select, error
from  commands import getoutput


if __name__ == '__main__':    # so I can be imported by pydoc
    do_ssh = 0
    fullDirName = path.abspath(path.split(argv[0])[0])  # normalize
    hostsFromFile = []
    verbose = 0
    long_messages = 0
    argidx = 1
    while argidx < len(argv):
        if argv[argidx] == '-h'  or argv[argidx] == '--help':
            print __doc__
            exit(0)
        elif argv[argidx] == '-s':
            lsock = socket()
            lsock.bind(('',0)) # anonymous port
            lsock.listen(5)
            print "server listening at INADDR_ANY on: %s %s" % (gethostname(),lsock.getsockname()[1])
            stdout.flush()
            (tsock,taddr) = lsock.accept()
            print "server has conn on %s from %s" % (tsock,taddr)
            msg = tsock.recv(64)
            if not msg:
                print "*** server failed to recv msg from client"
            else:
                print "server successfully recvd msg from client: %s" % (msg)
            tsock.sendall('ack_from_server_to_client')
            tsock.close()
            lsock.close()
            exit(0)
        elif argv[argidx] == '-c':
            sock = socket()
            sock.connect((argv[argidx+1],int(argv[argidx+2])))  # note double parens
            sock.sendall('hello_from_client_to_server')
            msg = sock.recv(64)
            if not msg:
                print "*** client failed to recv ack from server"
            else:
                print "client successfully recvd ack from server: %s" % (msg)
                stdout.flush()
            sock.close()
            exit(0)
        elif argv[argidx] == '-pc':
            print "--- print results of: gethostbyname_ex(gethostname())"
            print gethostbyname_ex(gethostname())
            print "--- try to run /bin/hostname"
            linesAsStr = getoutput("/bin/hostname")
            print linesAsStr
            print "--- try to run uname -a"
            linesAsStr = getoutput("/bin/uname -a")
            print linesAsStr
            print "--- try to print /etc/hosts"
            linesAsStr = getoutput("/bin/cat /etc/hosts")
            print linesAsStr
            print "--- try to print /etc/resolv.conf"
            linesAsStr = getoutput("/bin/cat /etc/resolv.conf")
            print linesAsStr
            print "--- try to run /sbin/ifconfig -a"
            linesAsStr = getoutput("/sbin/ifconfig -a")
            print linesAsStr
            print "--- try to print /etc/nsswitch.conf"
            linesAsStr = getoutput("/bin/cat /etc/nsswitch.conf")
            print linesAsStr
            exit(0)
        elif argv[argidx] == '-v':
            verbose = 1
            argidx += 1
        elif argv[argidx] == '-l':
            long_messages = 1
            argidx += 1
        elif argv[argidx] == '-f':
            try:
                hostsFile = open(argv[argidx+1])
            except:
                print 'unable to open file ', argv[argidx+1]
                exit(-1)
            for line in hostsFile:
                line = line.rstrip()
                if not line  or  line[0] == '#':
                    continue
                splitLine = re.split(r'\s+',line)
                host = splitLine[0]
                if ':' in host:
                    (host,ncpus) = host.split(':')
                hostsFromFile.append(host)
            argidx += 2
        elif argv[argidx] == '-ssh':
            do_ssh = 1
            argidx += 1
        else:
            print 'unrecognized arg:', argv[argidx]
            exit(0)
    
    
    # See if we can do gethostXXX, etc. for this host
    if verbose:
        print 'obtaining hostname via gethostname and getfqdn'
    uqhn1 = gethostname()
    fqhn1 = getfqdn()
    if verbose:
        print "gethostname gives ", uqhn1
        print "getfqdn gives ", fqhn1
    if verbose:
        print 'checking out unqualified hostname; make sure is not "localhost", etc.'
    if uqhn1.startswith('localhost'):
        if long_messages:
            msg = """
            **********
            The unqualified hostname seems to be localhost. This generally
            means that the machine's hostname is not set. You may change
            it by using the 'hostname' command, e.g.:
                hostname mybox1
            However, this will not remain after a reboot. To do this, you
            will need to consult the operating system's documentation. On
            Debian Linux systems, this can be done by:
                echo "mybox1" > /etc/hostname
            **********
            """
        else:
            msg = "*** the uq hostname seems to be localhost"
        print msg.strip().replace('        ','')
    elif uqhn1 == '':
        if long_messages:
            msg = """
            **********
            The unqualified hostname seems to be blank. This generally
            means that the machine's hostname is not set. You may change
            it by using the 'hostname' command, e.g.:
                hostname mybox1
            However, this will not remain after a reboot. To do this, you
            will need to consult the operating system's documentation. On
            Debian Linux systems, this can be done by:
                echo "mybox1" > /etc/hostname
            **********
            """
        else:
            msg = "*** the uq hostname seems to be localhost"
        print msg.replace('        ','')
    if verbose:
        print 'checking out qualified hostname; make sure is not "localhost", etc.'
    if fqhn1.startswith('localhost'):
        if long_messages:
            msg = """
            **********
            Your fully qualified hostname seems to be set to 'localhost'.
            This generally means that your machine's /etc/hosts file contains a line
            similar to this:
                127.0.0.1 mybox1 localhost.localdomain localhost
            You probably want to remove your hostname from this line and place it on
            a line by itself with your ipaddress, like this:
                $ipaddr mybox1
            **********
            """
        else:
            msg =  "*** the fq hostname seems to be localhost"
        print msg.rstrip().replace('        ','')
    elif fqhn1 == '':
        if long_messages:
            msg = """
            **********
            Your fully qualified hostname seems to be blank.
            **********
            """
        else:
            msg = "*** the fq hostname is blank"
        print msg.replace('        ','')
    
    if verbose:
        print 'obtain IP addrs via qualified and unqualified hostnames;',
        print ' make sure other than 127.0.0.1'
    uipaddr1 = 0
    try:
        ghbnu = gethostbyname_ex(uqhn1)
        if verbose:
            print "gethostbyname_ex: ", ghbnu
        uipaddr1 = ghbnu[2][0]
        if uipaddr1.startswith('127'):
            if long_messages:
                msg = """
                **********
                Your unqualified hostname resolves to 127.0.0.1, which is
                the IP address reserved for localhost. This likely means that
                you have a line similar to this one in your /etc/hosts file:
                127.0.0.1   $uqhn
                This should perhaps be changed to the following:
                127.0.0.1   localhost.localdomain localhost
                **********
                """
            else:
                msg = "*** first ipaddr for this host (via %s) is: %s" % (uqhn1,uipaddr1)
            print msg.replace('            ','')
        try:
            ghbau = gethostbyaddr(uipaddr1)
        except:
            print "*** gethostbyaddr failed for this hosts's IP %s" % (uipaddr1)
    except:
        if long_messages:
            msg = """
            **********
            The system call gethostbyname(3) failed to resolve your
            unqualified hostname, or $uqhn. This can be caused by
            missing info from your /etc/hosts file or your system not
            having correctly configured name resolvers, or by your IP 
            address not existing in resolution services.
            If you run DNS, you may wish to make sure that your
            DNS server has the correct forward A set up for yout machine's
            hostname. If you are not using DNS and are only using hosts
            files, please check that a line similar to the one below exists
            in your /etc/hosts file:
                $ipaddr $uqdn
            If you plan to use DNS but you are not sure that it is
            correctly configured, please check that the file /etc/resolv.conf
            contains entries similar to the following:
                nameserver 1.2.3.4
            where 1.2.3.4 is an actual IP of one of your nameservers.
            **********
            """
        else:
            msg = "*** gethostbyname_ex failed for this host %s" % (uqhn1)
        print msg.replace('        ','')
    
    fipaddr1 = 0
    try:
        ghbnf = gethostbyname_ex(fqhn1)
        if verbose:
            print "gethostbyname_ex: ", ghbnf
        fipaddr1 = ghbnf[2][0]
        if fipaddr1.startswith('127'):
            msg = """
            **********
            Your fully qualified hostname resolves to 127.0.0.1, which
            is the IP address reserved for localhost. This likely means
            that you have a line similar to this one in your /etc/hosts file:
                 127.0.0.1   $fqhn
            This should be perhaps changed to the following:
                 127.0.0.1   localhost.localdomain localhost
            **********
            """
        try:
            ghbaf = gethostbyaddr(fipaddr1)
        except:
            print "*** gethostbyaddr failed for this hosts's IP %s" % (uipaddr1)
    except:
        if long_messages:
            msg = """
            **********
            The system call gethostbyname(3) failed to resolve your
            fully qualified hostname, or $fqhn. This can be caused by
            missing info from your /etc/hosts file or your system not
            having correctly configured name resolvers, or by your IP 
            address not existing in resolution services.
            If you run DNS, please check and make sure that your
            DNS server has the correct forward A record set up for yout
            machine's hostname. If you are not using DNS and are only using
            hosts files, please check that a line similar to the one below
            exists in your /etc/hosts file:
                $ipaddr $fqhn
            If you intend to use DNS but you are not sure that it is
            correctly configured, please check that the file /etc/resolv.conf
            contains entries similar to the following:
                nameserver 1.2.3.4
            where 1.2.3.4 is an actual IP of one of your nameservers.
            **********
            """
        else:
            msg = "*** gethostbyname_ex failed for host %s" % (fqhn1)
        print msg.replace('        ','')
    
    if verbose:
        print 'checking that IP addrs resolve to same host'
    if uipaddr1 and fipaddr1 and uipaddr1 != fipaddr1:
        msg = """
            **********
            Your fully qualified and unqualified names do not resolve to
            the same IP. This likely means that your DNS domain name is not
            set correctly.  This might be fixed by adding a line similar
            to the following to your /etc/hosts:
                 $ipaddr             $fqhn   $uqdn
            **********
            """
        print msg.replace('        ','')
    
    
    if verbose:
        print 'now do some gethostbyaddr and gethostbyname_ex for machines in hosts file'
    # See if we can do gethostXXX, etc. for hosts in hostsFromFile
    for host in hostsFromFile:
        uqhn2 = host
        fqhn2 = getfqdn(uqhn2)
        uipaddr2 = 0
        if verbose:
            print 'checking gethostbyXXX for unqualified %s' % (uqhn2)
        try:
            ghbnu = gethostbyname_ex(uqhn2)
            if verbose:
                print "gethostbyname_ex: ", ghbnu
            uipaddr2 = ghbnu[2][0]
            try:
                ghbau = gethostbyaddr(uipaddr2)
            except:
                print "*** gethostbyaddr failed for remote hosts's IP %s" % (fipaddr2)
        except:
            print "*** gethostbyname_ex failed for host %s" % (fqhn2)
        if verbose:
            print 'checking gethostbyXXX for qualified %s' % (uqhn2)
        try:
            ghbnf = gethostbyname_ex(fqhn2)
            if verbose:
                print "gethostbyname_ex: ", ghbnf
            fipaddr2 = ghbnf[2][0]
            if uipaddr2  and  fipaddr2 != uipaddr2:
                print "*** ipaddr via uqn (%s) does not match via fqn (%s)" % (uipaddr2,fipaddr2)
            try:
                ghbaf = gethostbyaddr(fipaddr2)
            except:
                print "*** gethostbyaddr failed for remote hosts's IP %s" % (fipaddr2)
        except:
            print "*** gethostbyname_ex failed for host %s" % (fqhn2)
    
    
    # see if we can run /bin/date on remote hosts
    if not do_ssh:
        exit(0)
    
    for host in hostsFromFile:
        cmd = "ssh %s -x -n /bin/echo hello" % (host)
        if verbose:
            print 'trying: %s' % (cmd)
        runner = Popen3(cmd,1,0)
        runout = runner.fromchild
        runerr = runner.childerr
        runin  = runner.tochild
        runpid = runner.pid
        try:
            (readyFDs,unused1,unused2) = select([runout],[],[],9)
        except Exception, data:
            print 'select 1 error: %s ; %s' % ( data.__class__, data)
            exit(-1)
        if len(readyFDs) == 0:
            print '** ssh timed out to %s' % (host)
        line = ''
        failed = 0
        if runout in readyFDs:
            line = runout.readline()
            if not line.startswith('hello'):
                failed = 1
        else:
            failed = 1
        if failed:
            print '** ssh failed to %s' % (host)
            print '** here is the output:'
            if line:
                print line,
            done = 0
            fds = [runout,runerr]
            while not done:
                try:
                    (readyFDs,unused1,unused2) = select(fds,[],[],1)
                except Exception, data:
                    print 'select 2 error: %s ; %s' % ( data.__class__, data)
                    exit(-1)
                if runout in readyFDs:
                    line = runout.readline()
                    if line:
                        print line,
                    else:
                        fds.remove(runout)
                elif runerr in readyFDs:
                    line = runerr.readline()
                    if line:
                        print line,
                    else:
                        fds.remove(runerr)
                else:
                    done = 1
        try:
            kill(runpid,SIGKILL)
            runout.close()
            runerr.close()
            runin.close()
        except:
            pass
        if failed:
            exit(-1)
    
    # see if we can run mpdcheck on remote hosts
    for host in hostsFromFile:
        cmd1 = path.join(fullDirName,mpdcheck.py) + ' -s'
        if verbose:
            print 'starting server: %s' % (cmd1)
        runner1 = Popen3(cmd1,1,0)
        runout1 = runner1.fromchild
        runerr1 = runner1.childerr
        runin1  = runner1.tochild
        runpid1 = runner1.pid
        try:
            (readyFDs,unused1,unused2) = select([runout1],[],[],9)
        except Exception, data:
            print 'select 3 error: %s ; %s' % ( data.__class__, data)
            exit(-1)
        if len(readyFDs) == 0:
            print '** timed out waiting for local server to produce output'
        line = ''
        failed = 0
        port = 0
        if runout1 in readyFDs:
            line = runout1.readline()
            if line.startswith('server listening at '):
                port = line.rstrip().split(' ')[-1]
            else:
                failed = 1
        else:
            failed = 1
        if failed:
            print 'could not start mpdcheck server'
            print 'here is the output:'
            if line:
                print line,
            done = 0
            fds = [runout1,runerr1]
            while not done:
                try:
                    (readyFDs,unused1,unused2) = select(fds,[],[],1)
                except Exception, data:
                    print 'select 4 error: %s ; %s' % ( data.__class__, data)
                    exit(-1)
                if runout in readyFDs:
                    line = runout.readline()
                    if line:
                        print line,
                    else:
                        fds.remove(runout)
                elif runerr in readyFDs:
                    line = runerr.readline()
                    if line:
                        print line,
                    else:
                        fds.remove(runerr)
                else:
                    done = 1
        if failed:
            try:
                kill(runpid1,SIGKILL)
            except:
                pass
            exit(-1)
        cmd2 = "ssh %s -x -n %s%smpdcheck.py -c %s %s" % (host,fullDirName,path.sep,fqhn1,port)
        if verbose:
            print 'starting client: %s' % (cmd2)
        runner2 = Popen3(cmd2,1,0)
        runout2 = runner2.fromchild
        runerr2 = runner2.childerr
        runin2  = runner2.tochild
        runpid2 = runner2.pid
        try:
            (readyFDs,unused1,unused2) = select([runout2],[],[],9)
        except Exception, data:
            print 'select 3 error: %s ; %s' % ( data.__class__, data)
            exit(-1)
        if len(readyFDs) == 0:
            print '** timed out waiting for client on %s to produce output' % (host)
        line = ''
        failed = 0
        port = 0
        if runout2 in readyFDs:
            line = runout2.readline()
            if not line.startswith('client successfully recvd'):
                failed = 1
        else:
            failed = 1
        if failed:
            print 'client on %s failed to access the server' % (host)
            print 'here is the output:'
            if line:
                print line,
            done = 0
            fds = [runout2,runerr2]
            while not done:
                try:
                    (readyFDs,unused1,unused2) = select(fds,[],[],1)
                except Exception, data:
                    print 'select 4 error: %s ; %s' % ( data.__class__, data)
                    exit(-1)
                if runout2 in readyFDs:
                    line = runout2.readline()
                    if line:
                        print line,
                    else:
                        fds.remove(runout2)
                elif runerr2 in readyFDs:
                    line = runerr2.readline()
                    if line:
                        print line,
                    else:
                        fds.remove(runerr2)
                else:
                    done = 1
        try:
            kill(runpid2,SIGKILL)
        except:
            pass
        if failed:
            try:
                kill(runpid1,SIGKILL)
            except:
                pass
            exit(-1)


More information about the mpich-discuss mailing list