[mpich-discuss] mvapich2 on multiple nodes: 2 problems
abc def
cannonjunk at hotmail.co.uk
Wed Apr 21 08:31:34 CDT 2010
Hello,
I was wondering if someone can help me.
I have 3 quad-core computers that I would like to construct a 12-core cluster out of.
At the moment, mvapich2 is installed on each of the machines, and we can successfully run 4-node simulations on each of the machines.
What we want to do is hook these machines together, to run 1 simulation on 12 processors.
I am having trouble doing this however. I have followed the quick-setup guide in the installguide PDF (page 31 onwards):
mpdcheck -s (checking communication between 2 of the computers) - OK
The 3 computers are called Quad, November and December.
When I run:
mpd &
mpiexec -n 1 /bin/hostname
November hangs, but December and Quad are fine. So my first question is, why does this hang? Running simply "/bin/hostname" on all the computers does work. After hanging for a while, November does eventually produce the following error messages, although I don't know what this means:
november_mpdman_0: mpd_uncaught_except_tb handling:
<class 'socket.error'>: [Errno 110] Connection timed out
/usr/local/mpich/bin/mpdlib.py 397 connect
raise socket.error, errinfo
/usr/local/mpich/bin/mpdman.py 235 run
self.conSock.connect((self.conIfhn,self.conPort))
/usr/local/mpich/bin/mpd 1430 launch_mpdman_via_fork
mpdman.run()
/usr/local/mpich/bin/mpd 1331 run_one_cli
(manPid,toManSock) = self.launch_mpdman_via_fork(msg,man_env)
/usr/local/mpich/bin/mpd 1205 do_mpdrun
self.run_one_cli(lorank,msg)
/usr/local/mpich/bin/mpd 618 handle_console_input
self.do_mpdrun(msg)
/usr/local/mpich/bin/mpdlib.py 762 handle_active_streams
handler(stream,*args)
/usr/local/mpich/bin/mpd 290 runmainloop
rv = self.streamHandler.handle_active_streams(timeout=8.0)
/usr/local/mpich/bin/mpd 259 run
self.runmainloop()
/usr/local/mpich/bin/mpd 1492 <module>
mpd.run()
mpd_cli_app=/bin/hostname
cwd=/home/me
and when I eventually ctrl-C, I get "mpiexec: failed to obtain sock from manager". I'm assuming it's not referring to the woolly variety.
Secondly, continuing with Quad and December which don't hang, when I try to launch the simulation software using:
mpiexec -n 8 software.ex &
I get the following error for 4 out of the 8 nodes (each machine having 4 nodes, with 2 machines):
MPIR_Init_thread(310): Initialization failed
MPID_Init(113).......: channel initialization failed
MPIDI_CH3_Init(244)..: process not on the same host (quad != december)Fatal error in MPI_Init: O
ther MPI error, error stack:
I've tried looking on the internet for ways to launch with multiple hosts, but nothing seems to work. So my 2nd question is, how can I get this working?
Any help is greatly appreciated, since I really need to get this working asap.
Thanks!
James
_________________________________________________________________
http://clk.atdmt.com/UKM/go/195013117/direct/01/
We want to hear all your funny, exciting and crazy Hotmail stories. Tell us now
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20100421/96874eba/attachment.htm>
More information about the mpich-discuss
mailing list