[MPICH] Running on root's MPD as either root or another user
Matthew Chambers
matthew.chambers at vanderbilt.edu
Tue Oct 2 11:00:53 CDT 2007
OK, the one problem with your test is that it doesn't seem to use two
machines. I upgraded to 1.0.6 and still had the same problem. However,
I started running the MPD by hand to see what the debug messages would
be, and hit the jackpot:
fenchurch01_man_8036 (launch_mpdman_via_fork 1405): invalid username
:rslebos: on fenchurch01
fenchurch01_man_8036: mpd_uncaught_except_tb handling:
exceptions.AttributeError: 'int' object has no attribute 'send_dict_msg'
/frogstar/usr/ppc/bin/mpd 1408 launch_mpdman_via_fork
self.conSock.send_dict_msg(msgToSend)
/frogstar/usr/ppc/bin/mpd 1329 run_one_cli
(manPid,toManSock) = self.launch_mpdman_via_fork(msg,man_env)
/frogstar/usr/ppc/bin/mpd 1203 do_mpdrun
self.run_one_cli(lorank,msg)
/frogstar/usr/ppc/bin/mpd 857 handle_lhs_input
self.do_mpdrun(msg)
/frogstar/usr/ppc/bin/mpdlib.py 762 handle_active_streams
handler(stream,*args)
/frogstar/usr/ppc/bin/mpd 289 runmainloop
rv = self.streamHandler.handle_active_streams(timeout=8.0)
/frogstar/usr/ppc/bin/mpd 258 run
self.runmainloop()
/frogstar/usr/ppc/bin/mpd 1490 ?
mpd.run()
So clearly, it was an issue with the user not having an account on the
other machine. When I created the user account, the MPI job from that
user worked fine. But I asked about this before (whether the user would
have to have an account on each machine in the ring), so I assume that
this is not supposed to be the case. Any ideas?
Thanks,
Matt
Ralph Butler wrote:
> We were unaware of any bugs of this type on prior versions. However,
> it is possible that 1.0.6 could help.
>
> On MonOct 1, at Mon Oct 1 11:21PM, Matt Chambers wrote:
>
>> Thanks for your thoroughness Ralph. I'm still using 1.0.3 because
>> it's worked great up until now. Is it possible/likely that this
>> behavior will be fixed by upgrading to the latest version? Upgrading
>> won't be a problem, but I just didn't even think about until you
>> mentioned building 1.0.6.
>>
>> -Matt
>>
>> Ralph Butler wrote:
>>> OK. So I tried to reproduce the problem but could not. Here is the
>>> sequence of steps I followed on 2 nodes of my cluster:
>>>
>>> - build mpich2-1.0.6
>>> - su to root
>>> - install mpich2 in /tmp/mpich2i ( make sure mpdroot is +s)
>>> - create /etc/mpd.conf with secretword=foobar
>>> - install in the same way on a second machine
>>> - on 1st machine, start mpd by hand
>>> - on 2nd machine, start mpd by hand using the -h and -p options to
>>> join the first mpd
>>> - (still as root) run mpdtrace and some mpiexec jobs to make sure
>>> all works
>>> - logout as root and login to an unused student acct
>>> - as student:
>>> setenv MPD_USE_ROOT_MPD 1
>>> /tmp/mpich2i/bin/mpiexec -n 2 hostname
>>>
>>> I did not even create a .mpd.conf file for the student.
>>>
>>
>
More information about the mpich-discuss
mailing list