[MPICH] MPICH2 not running as 'normal' users
Rajeev Thakur
thakur at mcs.anl.gov
Wed Oct 10 11:17:22 CDT 2007
Is the MPD ring set up by root? In that case if you are running as a regular
user, you need to set the environment variable MPD_USE_ROOT_MPD to 1. If
both root and non-root MPD rings are running at the same time, there might
be problems, so use only one of them.
Rajeev
> -----Original Message-----
> From: owner-mpich-discuss at mcs.anl.gov
> [mailto:owner-mpich-discuss at mcs.anl.gov] On Behalf Of Troy Telford
> Sent: Tuesday, October 09, 2007 4:06 PM
> To: mpich-discuss at mcs.anl.gov
> Subject: [MPICH] MPICH2 not running as 'normal' users
>
> I'm hoping this is simple oversight. I've had no real issues
> using MPICH2 in
> the past, so this is a bit of a suprise to me.
>
> I've got a new cluster I'm setting up to use MPICH2.
>
> The program I'm running is a simple hello world that reports
> the rank of the
> process, and the node it's running on.
>
> I can execute it fine when running from the login node as a
> non-privileged
> user.
>
> It also executes fine when running on the compute nodes as 'root'
>
> However, when I try to run as an unprivileged user on the
> compute nodes, the
> job quits with an error:
>
> Here's a rundown of sorts:
> (from the login node, running on itself)
> $ mpdboot -n 8
> $ mpdtrace -l
> login.default.domain_48762 (10.254.1.250)
> n001_47142 (10.254.1.1)
> n002_40636 (10.254.1.2)
> n003_40697 (10.254.1.3)
> n004_40394 (10.254.1.4)
> n005_40151 (10.254.1.5)
> n006_39487 (10.254.1.6)
> n007_39540 (10.254.1.7)
> [mpdringtest works fine]
> $ mpiexec -n 1 ./test
> login.default.domain : proc (0)
>
> Same thing, but including compute nodes:
> $ mpiexec -n 2 ./test
> [cli_1]: aborting job:
> Fatal error in MPI_Init: Other MPI error, Null value
> rank 1 in job 1 ls1host.default.domain_48874 caused
> collective abort of all
> ranks
> exit status of rank 1: return code 1
>
> Now, if I log into the compute node, and try running it, the
> error is similar
> $ mpiexec -n 1 ./test
> [cli_0]: aborting job:
> Fatal error in MPI_Init: Other MPI error, Null value
> rank 0 in job 1 n001_47208 caused collective abort of all ranks
> exit status of rank 0: killed by signal 9
>
> If I use just one process, and specify 'host' as the login
> node, it does work
> (which I'd expect to see)
> $ mpiexec -n 1 -host login ./test
> login.default.domain : proc (0)
>
>
> None of this happens when the user is 'root'. There aren't
> any login issues
> (ssh keys are fine, rsh is fine, etc.) I noticed an mpd logfile
> in /tmp/mpd2.logfile_<user>, but its contents is just:
> logfile for mpd with pid 25772
>
>
> Could anybody please give me a clue about what may be
> happening such that I'm
> able to run as root, but not as a 'normal' user?
> --
> Troy Telford
>
>
More information about the mpich-discuss
mailing list