[mpich-discuss] mvapich2 on multiple nodes: 2 problems

abc def cannonjunk at hotmail.co.uk
Wed Apr 21 08:31:34 CDT 2010


Hello,
I was wondering if someone can help me.

I have 3 quad-core computers that I would like to construct a 12-core cluster out of.
At the moment, mvapich2 is installed on each of the machines, and we can successfully run 4-node simulations on each of the machines.
What we want to do is hook these machines together, to run 1 simulation on 12 processors.

I am having trouble doing this however. I have followed the quick-setup guide in the installguide PDF (page 31 onwards):

mpdcheck -s (checking communication between 2 of the computers) - OK

The 3 computers are called Quad, November and December.
When I run:
mpd &
mpiexec -n 1 /bin/hostname

November hangs, but December and Quad are fine. So my first question is, why does this hang? Running simply "/bin/hostname" on all the computers does work. After hanging for a while, November does eventually produce the following error messages, although I don't know what this means:

november_mpdman_0: mpd_uncaught_except_tb handling:
  <class 'socket.error'>: [Errno 110] Connection timed out
    /usr/local/mpich/bin/mpdlib.py  397  connect
        raise socket.error, errinfo
    /usr/local/mpich/bin/mpdman.py  235  run
        self.conSock.connect((self.conIfhn,self.conPort))
    /usr/local/mpich/bin/mpd  1430  launch_mpdman_via_fork
        mpdman.run()
    /usr/local/mpich/bin/mpd  1331  run_one_cli
        (manPid,toManSock) = self.launch_mpdman_via_fork(msg,man_env)
    /usr/local/mpich/bin/mpd  1205  do_mpdrun
        self.run_one_cli(lorank,msg)
    /usr/local/mpich/bin/mpd  618  handle_console_input
        self.do_mpdrun(msg)
    /usr/local/mpich/bin/mpdlib.py  762  handle_active_streams
        handler(stream,*args)
    /usr/local/mpich/bin/mpd  290  runmainloop
        rv = self.streamHandler.handle_active_streams(timeout=8.0)
    /usr/local/mpich/bin/mpd  259  run
        self.runmainloop()
    /usr/local/mpich/bin/mpd  1492  <module>
        mpd.run()
    mpd_cli_app=/bin/hostname
    cwd=/home/me

and when I eventually ctrl-C, I get "mpiexec: failed to obtain sock from manager". I'm assuming it's not referring to the woolly variety.

Secondly, continuing with Quad and December which don't hang, when I try to launch the simulation software using:

mpiexec -n 8 software.ex &

I get the following error for 4 out of the 8 nodes (each machine having 4 nodes, with 2 machines):
MPIR_Init_thread(310): Initialization failed
MPID_Init(113).......: channel initialization failed
MPIDI_CH3_Init(244)..: process not on the same host (quad != december)Fatal error in MPI_Init: O
ther MPI error, error stack:

I've tried looking on the internet for ways to launch with multiple hosts, but nothing seems to work. So my 2nd question is, how can I get this working?

Any help is greatly appreciated, since I really need to get this working asap.

Thanks!

James 		 	   		  
_________________________________________________________________
http://clk.atdmt.com/UKM/go/195013117/direct/01/
We want to hear all your funny, exciting and crazy Hotmail stories. Tell us now
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20100421/96874eba/attachment.htm>


More information about the mpich-discuss mailing list