[MPICH] Running on root's MPD as either root or another user

Matthew Chambers matthew.chambers at vanderbilt.edu
Tue Oct 2 11:00:53 CDT 2007


OK, the one problem with your test is that it doesn't seem to use two 
machines.  I upgraded to 1.0.6 and still had the same problem.  However, 
I started running the MPD by hand to see what the debug messages would 
be, and hit the jackpot:

fenchurch01_man_8036 (launch_mpdman_via_fork 1405): invalid username 
:rslebos: on fenchurch01
fenchurch01_man_8036: mpd_uncaught_except_tb handling:
  exceptions.AttributeError: 'int' object has no attribute 'send_dict_msg'
    /frogstar/usr/ppc/bin/mpd  1408  launch_mpdman_via_fork
        self.conSock.send_dict_msg(msgToSend)
    /frogstar/usr/ppc/bin/mpd  1329  run_one_cli
        (manPid,toManSock) = self.launch_mpdman_via_fork(msg,man_env)
    /frogstar/usr/ppc/bin/mpd  1203  do_mpdrun
        self.run_one_cli(lorank,msg)
    /frogstar/usr/ppc/bin/mpd  857  handle_lhs_input
        self.do_mpdrun(msg)
    /frogstar/usr/ppc/bin/mpdlib.py  762  handle_active_streams
        handler(stream,*args)
    /frogstar/usr/ppc/bin/mpd  289  runmainloop
        rv = self.streamHandler.handle_active_streams(timeout=8.0)
    /frogstar/usr/ppc/bin/mpd  258  run
        self.runmainloop()
    /frogstar/usr/ppc/bin/mpd  1490  ?
        mpd.run()

So clearly, it was an issue with the user not having an account on the 
other machine.  When I created the user account, the MPI job from that 
user worked fine.  But I asked about this before (whether the user would 
have to have an account on each machine in the ring), so I assume that 
this is not supposed to be the case.  Any ideas?

Thanks,
Matt

Ralph Butler wrote:
> We were unaware of any bugs of this type on prior versions.  However, 
> it is possible that 1.0.6 could help.
>
> On MonOct 1, at Mon Oct 1 11:21PM, Matt Chambers wrote:
>
>> Thanks for your thoroughness Ralph.  I'm still using 1.0.3 because 
>> it's worked great up until now.  Is it possible/likely that this 
>> behavior will be fixed by upgrading to the latest version?  Upgrading 
>> won't be a problem, but I just didn't even think about until you 
>> mentioned building 1.0.6.
>>
>> -Matt
>>
>> Ralph Butler wrote:
>>> OK.  So I tried to reproduce the problem but could not.  Here is the 
>>> sequence of steps I followed on 2 nodes of my cluster:
>>>
>>> - build mpich2-1.0.6
>>> - su to root
>>> - install mpich2 in /tmp/mpich2i  ( make sure mpdroot is +s)
>>> - create /etc/mpd.conf with secretword=foobar
>>> - install in the same way on a second machine
>>> - on 1st machine, start mpd by hand
>>> - on 2nd machine, start mpd by hand using the -h and -p options to 
>>> join the first mpd
>>> - (still as root) run mpdtrace and some mpiexec jobs to make sure 
>>> all works
>>> - logout as root and login to an unused student acct
>>> - as student:
>>>       setenv MPD_USE_ROOT_MPD 1
>>>       /tmp/mpich2i/bin/mpiexec -n 2 hostname
>>>
>>> I did not even create a .mpd.conf file for the student.
>>>
>>
>




More information about the mpich-discuss mailing list