[MPICH] MPICH2 not running as 'normal' users

Troy Telford ttelford.groups at gmail.com
Wed Oct 10 12:36:45 CDT 2007


I wish it were just multiple MPD rings, or MPD being set up by root; 
unfortunately, there's just one, and it is started by the user before job 
start, and 'mpdallexit' is called after the job completes.

I've checked for MPD processes on all of the nodes (it's a smallish cluster; 
10 nodes).  There aren't any MPD rings before or after the job runs (and only 
one MPD ring when I'm trying to run the job - so mpdallexit is apparently 
working).

But, you did give me the idea of starting MPD as root, and then attempting to 
run the job using root's MPD (and the environment variable you mention)

I'm getting the following error when I try to use root's MPD:
[ make sure no MPDs are running anywhere ]
# mpdboot -n 8
# su - user
$ export MPD_USE_ROOT_MPD=1
$ mpdtrace -l
mpdroot: open failed for root's mpd conf filempdtrace (__init__ 1171): forked 
process failed; status=255
$ mpirun -np 1 ./test
mpdroot: open failed for root's mpd conf filempiexec_power1 (__init__ 1171): 
forked process failed; status=255

I don't think I mentioned what I'm running on:
RHEL4 update 4 (I have SELinux Disabled)
x86_64


On Wednesday 10 October 2007, Rajeev Thakur wrote:
> Is the MPD ring set up by root? In that case if you are running as a
> regular user, you need to set the environment variable MPD_USE_ROOT_MPD to
> 1. If both root and non-root MPD rings are running at the same time, there
> might be problems, so use only one of them.
>
> Rajeev
>
> > -----Original Message-----
> > From: owner-mpich-discuss at mcs.anl.gov
> > [mailto:owner-mpich-discuss at mcs.anl.gov] On Behalf Of Troy Telford
> > Sent: Tuesday, October 09, 2007 4:06 PM
> > To: mpich-discuss at mcs.anl.gov
> > Subject: [MPICH] MPICH2 not running as 'normal' users
> >
> > I'm hoping this is simple oversight.  I've had no real issues
> > using MPICH2 in
> > the past, so this is a bit of a suprise to me.
> >
> > I've got a new cluster I'm setting up to use MPICH2.
> >
> > The program I'm running is a simple hello world that reports
> > the rank of the
> > process, and the node it's running on.
> >
> > I can execute it fine when running from the login node as a
> > non-privileged
> > user.
> >
> > It also executes fine when running on the compute nodes as 'root'
> >
> > However, when I try to run as an unprivileged user on the
> > compute nodes, the
> > job quits with an error:
> >
> > Here's a rundown of sorts:
> > (from the login node, running on itself)
> > $ mpdboot -n 8
> > $ mpdtrace -l
> > login.default.domain_48762 (10.254.1.250)
> > n001_47142 (10.254.1.1)
> > n002_40636 (10.254.1.2)
> > n003_40697 (10.254.1.3)
> > n004_40394 (10.254.1.4)
> > n005_40151 (10.254.1.5)
> > n006_39487 (10.254.1.6)
> > n007_39540 (10.254.1.7)
> > [mpdringtest works fine]
> > $ mpiexec -n 1 ./test
> > login.default.domain : proc (0)
> >
> > Same thing, but including compute nodes:
> > $ mpiexec -n 2 ./test
> > [cli_1]: aborting job:
> > Fatal error in MPI_Init: Other MPI error, Null value
> > rank 1 in job 1  ls1host.default.domain_48874   caused
> > collective abort of all
> > ranks
> >   exit status of rank 1: return code 1
> >
> > Now, if I log into the compute node, and try running it, the
> > error is similar
> > $ mpiexec -n 1 ./test
> > [cli_0]: aborting job:
> > Fatal error in MPI_Init: Other MPI error, Null value
> > rank 0 in job 1  n001_47208   caused collective abort of all ranks
> >   exit status of rank 0: killed by signal 9
> >
> > If I use just one process, and specify 'host' as the login
> > node, it does work
> > (which I'd expect to see)
> > $ mpiexec -n 1 -host login ./test
> > login.default.domain : proc (0)
> >
> >
> > None of this happens when the user is 'root'.  There aren't
> > any login issues
> > (ssh keys are fine, rsh is fine, etc.)  I noticed an mpd logfile
> > in /tmp/mpd2.logfile_<user>, but its contents is just:
> >   logfile for mpd with pid 25772
> >
> >
> > Could anybody please give me a clue about what may be
> > happening such that I'm
> > able to run as root, but not as a 'normal' user?
> > --
> > Troy Telford



-- 
Troy Telford




More information about the mpich-discuss mailing list