[MPICH] MPICH2 not running as 'normal' users

Matthew Chambers matthew.chambers at vanderbilt.edu
Wed Oct 10 13:07:29 CDT 2007


You need to set /etc/mpd.conf to be setuid: "chmod +s /etc/mpd.conf"

-Matt

Troy Telford wrote:
> I wish it were just multiple MPD rings, or MPD being set up by root; 
> unfortunately, there's just one, and it is started by the user before job 
> start, and 'mpdallexit' is called after the job completes.
>
> I've checked for MPD processes on all of the nodes (it's a smallish cluster; 
> 10 nodes).  There aren't any MPD rings before or after the job runs (and only 
> one MPD ring when I'm trying to run the job - so mpdallexit is apparently 
> working).
>
> But, you did give me the idea of starting MPD as root, and then attempting to 
> run the job using root's MPD (and the environment variable you mention)
>
> I'm getting the following error when I try to use root's MPD:
> [ make sure no MPDs are running anywhere ]
> # mpdboot -n 8
> # su - user
> $ export MPD_USE_ROOT_MPD=1
> $ mpdtrace -l
> mpdroot: open failed for root's mpd conf filempdtrace (__init__ 1171): forked 
> process failed; status=255
> $ mpirun -np 1 ./test
> mpdroot: open failed for root's mpd conf filempiexec_power1 (__init__ 1171): 
> forked process failed; status=255
>
> I don't think I mentioned what I'm running on:
> RHEL4 update 4 (I have SELinux Disabled)
> x86_64
>
>
> On Wednesday 10 October 2007, Rajeev Thakur wrote:
>   
>> Is the MPD ring set up by root? In that case if you are running as a
>> regular user, you need to set the environment variable MPD_USE_ROOT_MPD to
>> 1. If both root and non-root MPD rings are running at the same time, there
>> might be problems, so use only one of them.
>>
>> Rajeev
>>
>>     
>>> -----Original Message-----
>>> From: owner-mpich-discuss at mcs.anl.gov
>>> [mailto:owner-mpich-discuss at mcs.anl.gov] On Behalf Of Troy Telford
>>> Sent: Tuesday, October 09, 2007 4:06 PM
>>> To: mpich-discuss at mcs.anl.gov
>>> Subject: [MPICH] MPICH2 not running as 'normal' users
>>>
>>> I'm hoping this is simple oversight.  I've had no real issues
>>> using MPICH2 in
>>> the past, so this is a bit of a suprise to me.
>>>
>>> I've got a new cluster I'm setting up to use MPICH2.
>>>
>>> The program I'm running is a simple hello world that reports
>>> the rank of the
>>> process, and the node it's running on.
>>>
>>> I can execute it fine when running from the login node as a
>>> non-privileged
>>> user.
>>>
>>> It also executes fine when running on the compute nodes as 'root'
>>>
>>> However, when I try to run as an unprivileged user on the
>>> compute nodes, the
>>> job quits with an error:
>>>
>>> Here's a rundown of sorts:
>>> (from the login node, running on itself)
>>> $ mpdboot -n 8
>>> $ mpdtrace -l
>>> login.default.domain_48762 (10.254.1.250)
>>> n001_47142 (10.254.1.1)
>>> n002_40636 (10.254.1.2)
>>> n003_40697 (10.254.1.3)
>>> n004_40394 (10.254.1.4)
>>> n005_40151 (10.254.1.5)
>>> n006_39487 (10.254.1.6)
>>> n007_39540 (10.254.1.7)
>>> [mpdringtest works fine]
>>> $ mpiexec -n 1 ./test
>>> login.default.domain : proc (0)
>>>
>>> Same thing, but including compute nodes:
>>> $ mpiexec -n 2 ./test
>>> [cli_1]: aborting job:
>>> Fatal error in MPI_Init: Other MPI error, Null value
>>> rank 1 in job 1  ls1host.default.domain_48874   caused
>>> collective abort of all
>>> ranks
>>>   exit status of rank 1: return code 1
>>>
>>> Now, if I log into the compute node, and try running it, the
>>> error is similar
>>> $ mpiexec -n 1 ./test
>>> [cli_0]: aborting job:
>>> Fatal error in MPI_Init: Other MPI error, Null value
>>> rank 0 in job 1  n001_47208   caused collective abort of all ranks
>>>   exit status of rank 0: killed by signal 9
>>>
>>> If I use just one process, and specify 'host' as the login
>>> node, it does work
>>> (which I'd expect to see)
>>> $ mpiexec -n 1 -host login ./test
>>> login.default.domain : proc (0)
>>>
>>>
>>> None of this happens when the user is 'root'.  There aren't
>>> any login issues
>>> (ssh keys are fine, rsh is fine, etc.)  I noticed an mpd logfile
>>> in /tmp/mpd2.logfile_<user>, but its contents is just:
>>>   logfile for mpd with pid 25772
>>>
>>>
>>> Could anybody please give me a clue about what may be
>>> happening such that I'm
>>> able to run as root, but not as a 'normal' user?
>>> --
>>> Troy Telford
>>>       
>
>
>
>   




More information about the mpich-discuss mailing list