[mpich-discuss] mvapich2 on multiple nodes: 2 problems
Dhabaleswar Panda
panda at cse.ohio-state.edu
Wed Apr 21 08:38:54 CDT 2010
Which version of mvapich2 you are using. The latest version is mvapich2
1.4.1. Are these nodes connected with InfiniBand or any other network.
There is a scalable job startup scheme called `mpirun_rsh'. You should use
that, not mpd. Please follow the user-guide to know how to run jobs with
mpirun_rsh.
DK
On Wed, 21 Apr 2010, abc def wrote:
>
> Hello,
> I was wondering if someone can help me.
>
> I have 3 quad-core computers that I would like to construct a 12-core cluster out of.
> At the moment, mvapich2 is installed on each of the machines, and we can successfully run 4-node simulations on each of the machines.
> What we want to do is hook these machines together, to run 1 simulation on 12 processors.
>
> I am having trouble doing this however. I have followed the quick-setup guide in the installguide PDF (page 31 onwards):
>
> mpdcheck -s (checking communication between 2 of the computers) - OK
>
> The 3 computers are called Quad, November and December.
> When I run:
> mpd &
> mpiexec -n 1 /bin/hostname
>
> November hangs, but December and Quad are fine. So my first question is, why does this hang? Running simply "/bin/hostname" on all the computers does work. After hanging for a while, November does eventually produce the following error messages, although I don't know what this means:
>
> november_mpdman_0: mpd_uncaught_except_tb handling:
> <class 'socket.error'>: [Errno 110] Connection timed out
> /usr/local/mpich/bin/mpdlib.py 397 connect
> raise socket.error, errinfo
> /usr/local/mpich/bin/mpdman.py 235 run
> self.conSock.connect((self.conIfhn,self.conPort))
> /usr/local/mpich/bin/mpd 1430 launch_mpdman_via_fork
> mpdman.run()
> /usr/local/mpich/bin/mpd 1331 run_one_cli
> (manPid,toManSock) = self.launch_mpdman_via_fork(msg,man_env)
> /usr/local/mpich/bin/mpd 1205 do_mpdrun
> self.run_one_cli(lorank,msg)
> /usr/local/mpich/bin/mpd 618 handle_console_input
> self.do_mpdrun(msg)
> /usr/local/mpich/bin/mpdlib.py 762 handle_active_streams
> handler(stream,*args)
> /usr/local/mpich/bin/mpd 290 runmainloop
> rv = self.streamHandler.handle_active_streams(timeout=8.0)
> /usr/local/mpich/bin/mpd 259 run
> self.runmainloop()
> /usr/local/mpich/bin/mpd 1492 <module>
> mpd.run()
> mpd_cli_app=/bin/hostname
> cwd=/home/me
>
> and when I eventually ctrl-C, I get "mpiexec: failed to obtain sock from manager". I'm assuming it's not referring to the woolly variety.
>
> Secondly, continuing with Quad and December which don't hang, when I try to launch the simulation software using:
>
> mpiexec -n 8 software.ex &
>
> I get the following error for 4 out of the 8 nodes (each machine having 4 nodes, with 2 machines):
> MPIR_Init_thread(310): Initialization failed
> MPID_Init(113).......: channel initialization failed
> MPIDI_CH3_Init(244)..: process not on the same host (quad != december)Fatal error in MPI_Init: O
> ther MPI error, error stack:
>
> I've tried looking on the internet for ways to launch with multiple hosts, but nothing seems to work. So my 2nd question is, how can I get this working?
>
> Any help is greatly appreciated, since I really need to get this working asap.
>
> Thanks!
>
> James
> _________________________________________________________________
> http://clk.atdmt.com/UKM/go/195013117/direct/01/
> We want to hear all your funny, exciting and crazy Hotmail stories. Tell us now
More information about the mpich-discuss
mailing list