[MPICH] MPICH2 not running as 'normal' users

Troy Telford ttelford.groups at gmail.com
Wed Oct 10 13:21:51 CDT 2007


I've since tried re-compiling MPICH2, and now I can use root's MPD ring.

The error I receive is more or less the same:
[cli_0]: aborting job:
Fatal error in MPI_Init: Other MPI error, Null value
rank 0 in job 2  host_32933   caused collective abort of all ranks
  exit status of rank 0: killed by signal 9



I've tried starting & stopping MPD and then ran it as both root and user, and 
when using mpdboot --verbose [etc, etc] 

I am seeing something in the MPD log (but not on every mpiexec command):
host_mpdman_0 (handle_console_input 1313): cannot send stdin to client

is this message at all helpful?  (I can't think of why it shouldn't be able to 
do this, but I don't know much about MPD other than some basics on how to use 
it)

On Wednesday 10 October 2007, Troy Telford wrote:
> I wish it were just multiple MPD rings, or MPD being set up by root;
> unfortunately, there's just one, and it is started by the user before job
> start, and 'mpdallexit' is called after the job completes.
>
> I've checked for MPD processes on all of the nodes (it's a smallish
> cluster; 10 nodes).  There aren't any MPD rings before or after the job
> runs (and only one MPD ring when I'm trying to run the job - so mpdallexit
> is apparently working).
>
> But, you did give me the idea of starting MPD as root, and then attempting
> to run the job using root's MPD (and the environment variable you mention)
>
> I'm getting the following error when I try to use root's MPD:
> [ make sure no MPDs are running anywhere ]
> # mpdboot -n 8
> # su - user
> $ export MPD_USE_ROOT_MPD=1
> $ mpdtrace -l
> mpdroot: open failed for root's mpd conf filempdtrace (__init__ 1171):
> forked process failed; status=255
> $ mpirun -np 1 ./test
> mpdroot: open failed for root's mpd conf filempiexec_power1 (__init__
> 1171): forked process failed; status=255
>
> I don't think I mentioned what I'm running on:
> RHEL4 update 4 (I have SELinux Disabled)
> x86_64
>
> On Wednesday 10 October 2007, Rajeev Thakur wrote:
> > Is the MPD ring set up by root? In that case if you are running as a
> > regular user, you need to set the environment variable MPD_USE_ROOT_MPD
> > to 1. If both root and non-root MPD rings are running at the same time,
> > there might be problems, so use only one of them.
> >
> > Rajeev
> >
> > > -----Original Message-----
> > > From: owner-mpich-discuss at mcs.anl.gov
> > > [mailto:owner-mpich-discuss at mcs.anl.gov] On Behalf Of Troy Telford
> > > Sent: Tuesday, October 09, 2007 4:06 PM
> > > To: mpich-discuss at mcs.anl.gov
> > > Subject: [MPICH] MPICH2 not running as 'normal' users
> > >
> > > I'm hoping this is simple oversight.  I've had no real issues
> > > using MPICH2 in
> > > the past, so this is a bit of a suprise to me.
> > >
> > > I've got a new cluster I'm setting up to use MPICH2.
> > >
> > > The program I'm running is a simple hello world that reports
> > > the rank of the
> > > process, and the node it's running on.
> > >
> > > I can execute it fine when running from the login node as a
> > > non-privileged
> > > user.
> > >
> > > It also executes fine when running on the compute nodes as 'root'
> > >
> > > However, when I try to run as an unprivileged user on the
> > > compute nodes, the
> > > job quits with an error:
> > >
> > > Here's a rundown of sorts:
> > > (from the login node, running on itself)
> > > $ mpdboot -n 8
> > > $ mpdtrace -l
> > > login.default.domain_48762 (10.254.1.250)
> > > n001_47142 (10.254.1.1)
> > > n002_40636 (10.254.1.2)
> > > n003_40697 (10.254.1.3)
> > > n004_40394 (10.254.1.4)
> > > n005_40151 (10.254.1.5)
> > > n006_39487 (10.254.1.6)
> > > n007_39540 (10.254.1.7)
> > > [mpdringtest works fine]
> > > $ mpiexec -n 1 ./test
> > > login.default.domain : proc (0)
> > >
> > > Same thing, but including compute nodes:
> > > $ mpiexec -n 2 ./test
> > > [cli_1]: aborting job:
> > > Fatal error in MPI_Init: Other MPI error, Null value
> > > rank 1 in job 1  ls1host.default.domain_48874   caused
> > > collective abort of all
> > > ranks
> > >   exit status of rank 1: return code 1
> > >
> > > Now, if I log into the compute node, and try running it, the
> > > error is similar
> > > $ mpiexec -n 1 ./test
> > > [cli_0]: aborting job:
> > > Fatal error in MPI_Init: Other MPI error, Null value
> > > rank 0 in job 1  n001_47208   caused collective abort of all ranks
> > >   exit status of rank 0: killed by signal 9
> > >
> > > If I use just one process, and specify 'host' as the login
> > > node, it does work
> > > (which I'd expect to see)
> > > $ mpiexec -n 1 -host login ./test
> > > login.default.domain : proc (0)
> > >
> > >
> > > None of this happens when the user is 'root'.  There aren't
> > > any login issues
> > > (ssh keys are fine, rsh is fine, etc.)  I noticed an mpd logfile
> > > in /tmp/mpd2.logfile_<user>, but its contents is just:
> > >   logfile for mpd with pid 25772
> > >
> > >
> > > Could anybody please give me a clue about what may be
> > > happening such that I'm
> > > able to run as root, but not as a 'normal' user?
> > > --
> > > Troy Telford



-- 
Troy Telford




More information about the mpich-discuss mailing list