[mpich-discuss] mvapich2 on multiple nodes: 2 problems

Dhabaleswar Panda panda at cse.ohio-state.edu
Wed Apr 21 08:38:54 CDT 2010


Which version of mvapich2 you are using. The latest version is mvapich2
1.4.1. Are these nodes connected with InfiniBand or any other network.
There is a scalable job startup scheme called `mpirun_rsh'. You should use
that, not mpd. Please follow the user-guide to know how to run jobs with
mpirun_rsh.

DK

On Wed, 21 Apr 2010, abc def wrote:

>
> Hello,
> I was wondering if someone can help me.
>
> I have 3 quad-core computers that I would like to construct a 12-core cluster out of.
> At the moment, mvapich2 is installed on each of the machines, and we can successfully run 4-node simulations on each of the machines.
> What we want to do is hook these machines together, to run 1 simulation on 12 processors.
>
> I am having trouble doing this however. I have followed the quick-setup guide in the installguide PDF (page 31 onwards):
>
> mpdcheck -s (checking communication between 2 of the computers) - OK
>
> The 3 computers are called Quad, November and December.
> When I run:
> mpd &
> mpiexec -n 1 /bin/hostname
>
> November hangs, but December and Quad are fine. So my first question is, why does this hang? Running simply "/bin/hostname" on all the computers does work. After hanging for a while, November does eventually produce the following error messages, although I don't know what this means:
>
> november_mpdman_0: mpd_uncaught_except_tb handling:
>   <class 'socket.error'>: [Errno 110] Connection timed out
>     /usr/local/mpich/bin/mpdlib.py  397  connect
>         raise socket.error, errinfo
>     /usr/local/mpich/bin/mpdman.py  235  run
>         self.conSock.connect((self.conIfhn,self.conPort))
>     /usr/local/mpich/bin/mpd  1430  launch_mpdman_via_fork
>         mpdman.run()
>     /usr/local/mpich/bin/mpd  1331  run_one_cli
>         (manPid,toManSock) = self.launch_mpdman_via_fork(msg,man_env)
>     /usr/local/mpich/bin/mpd  1205  do_mpdrun
>         self.run_one_cli(lorank,msg)
>     /usr/local/mpich/bin/mpd  618  handle_console_input
>         self.do_mpdrun(msg)
>     /usr/local/mpich/bin/mpdlib.py  762  handle_active_streams
>         handler(stream,*args)
>     /usr/local/mpich/bin/mpd  290  runmainloop
>         rv = self.streamHandler.handle_active_streams(timeout=8.0)
>     /usr/local/mpich/bin/mpd  259  run
>         self.runmainloop()
>     /usr/local/mpich/bin/mpd  1492  <module>
>         mpd.run()
>     mpd_cli_app=/bin/hostname
>     cwd=/home/me
>
> and when I eventually ctrl-C, I get "mpiexec: failed to obtain sock from manager". I'm assuming it's not referring to the woolly variety.
>
> Secondly, continuing with Quad and December which don't hang, when I try to launch the simulation software using:
>
> mpiexec -n 8 software.ex &
>
> I get the following error for 4 out of the 8 nodes (each machine having 4 nodes, with 2 machines):
> MPIR_Init_thread(310): Initialization failed
> MPID_Init(113).......: channel initialization failed
> MPIDI_CH3_Init(244)..: process not on the same host (quad != december)Fatal error in MPI_Init: O
> ther MPI error, error stack:
>
> I've tried looking on the internet for ways to launch with multiple hosts, but nothing seems to work. So my 2nd question is, how can I get this working?
>
> Any help is greatly appreciated, since I really need to get this working asap.
>
> Thanks!
>
> James
> _________________________________________________________________
> http://clk.atdmt.com/UKM/go/195013117/direct/01/
> We want to hear all your funny, exciting and crazy Hotmail stories. Tell us now



More information about the mpich-discuss mailing list