[MPICH] MPICH2 not running as 'normal' users

Rajeev Thakur thakur at mcs.anl.gov
Wed Oct 10 11:17:22 CDT 2007


Is the MPD ring set up by root? In that case if you are running as a regular
user, you need to set the environment variable MPD_USE_ROOT_MPD to 1. If
both root and non-root MPD rings are running at the same time, there might
be problems, so use only one of them.

Rajeev

> -----Original Message-----
> From: owner-mpich-discuss at mcs.anl.gov 
> [mailto:owner-mpich-discuss at mcs.anl.gov] On Behalf Of Troy Telford
> Sent: Tuesday, October 09, 2007 4:06 PM
> To: mpich-discuss at mcs.anl.gov
> Subject: [MPICH] MPICH2 not running as 'normal' users
> 
> I'm hoping this is simple oversight.  I've had no real issues 
> using MPICH2 in 
> the past, so this is a bit of a suprise to me.
> 
> I've got a new cluster I'm setting up to use MPICH2.
> 
> The program I'm running is a simple hello world that reports 
> the rank of the 
> process, and the node it's running on.
> 
> I can execute it fine when running from the login node as a 
> non-privileged 
> user.
> 
> It also executes fine when running on the compute nodes as 'root'
> 
> However, when I try to run as an unprivileged user on the 
> compute nodes, the 
> job quits with an error:
> 
> Here's a rundown of sorts:
> (from the login node, running on itself)
> $ mpdboot -n 8
> $ mpdtrace -l
> login.default.domain_48762 (10.254.1.250)
> n001_47142 (10.254.1.1)
> n002_40636 (10.254.1.2)
> n003_40697 (10.254.1.3)
> n004_40394 (10.254.1.4)
> n005_40151 (10.254.1.5)
> n006_39487 (10.254.1.6)
> n007_39540 (10.254.1.7)
> [mpdringtest works fine]
> $ mpiexec -n 1 ./test
> login.default.domain : proc (0)
> 
> Same thing, but including compute nodes:
> $ mpiexec -n 2 ./test
> [cli_1]: aborting job:
> Fatal error in MPI_Init: Other MPI error, Null value
> rank 1 in job 1  ls1host.default.domain_48874   caused 
> collective abort of all 
> ranks
>   exit status of rank 1: return code 1
> 
> Now, if I log into the compute node, and try running it, the 
> error is similar
> $ mpiexec -n 1 ./test
> [cli_0]: aborting job:
> Fatal error in MPI_Init: Other MPI error, Null value
> rank 0 in job 1  n001_47208   caused collective abort of all ranks
>   exit status of rank 0: killed by signal 9
> 
> If I use just one process, and specify 'host' as the login 
> node, it does work 
> (which I'd expect to see)
> $ mpiexec -n 1 -host login ./test
> login.default.domain : proc (0)
> 
> 
> None of this happens when the user is 'root'.  There aren't 
> any login issues 
> (ssh keys are fine, rsh is fine, etc.)  I noticed an mpd logfile 
> in /tmp/mpd2.logfile_<user>, but its contents is just:
>   logfile for mpd with pid 25772
> 
> 
> Could anybody please give me a clue about what may be 
> happening such that I'm 
> able to run as root, but not as a 'normal' user?
> -- 
> Troy Telford
> 
> 




More information about the mpich-discuss mailing list