[mpich-discuss] mpd daemon not starting on all requested nodes

Pavan Balaji balaji at mcs.anl.gov
Wed Nov 18 10:55:53 CST 2009


mpd takes a slightly different format of hostfile than PBS. The MPICH2
users' guide has more information on this.

Alternatively, you should consider upgrading to the latest version of
MPICH2:

% yum install mpich2

... and use the Hydra process manager instead, which can avoid mpd
completely and directly use mpiexec:

% mpiexec.hydra -rmk pbs -n 4 ./foo

Hydra is not available in the mpich2-1.0.x series (started in 1.1.x).

 -- Pavan

On 11/18/2009 10:39 AM, Mary Ellen Fitzpatrick wrote:
> Hi,
> I have compiled/installed mpich2-1.0.8 with torque on my Centos-5.3
> cluster.  I can get an mpd ring running as root an all my nodes.
> I want users to start the mpd daemon on the nodes via their pbs script. 
> I have an mpd.conf file in my home dir which has a secretword...
> 
> When I submit my job, requesting 4 nodes with 2 processors/node, I get
> all 8 processes running on one node.  The mpd daemons do not start on
> the other 3 nodes, but the pbs job is listed as running on all the
> nodes.  If I cat my machine file via the pbs script, it shows 4 nodes
> and processors, but it mpd does not start on the requested/list nodes. 
> The job run on that one node and output is generated correctly.. only it
> does not run on all 4 nodes/processors.
> 
> Any help would be appreciated.
> Mary Ellen
> 
> Here is my pbs script:
> 
> #!/bin/bash
> 
> # This is a simple script that cd's into scratch
> # directory, copies some input files to /scr on compute node,
> # copy the output file from /scr to user's directory (user or storage)
> # give the job a name
> #PBS -N rmsss
> # send email notification
> # request 1 node
> #PBS -l nodes=4:ppn=2
> #join stderr and stdout and write the to a file
> #PBS -j oe
> #PBS -o /fs/userB1/mfitzpat/sss_clus/Output/rmsss.o
> 
> # cd into the scratch directory created for this job
> cd /scr/$PBS_JOBID
> # print out some diagnostic stuff
> echo Running on host `hostname`
> echo Directory is `pwd`
> echo Start time is `date`
> 
> # copy the date files to scratch
> cp /fs/userB1/mfitzpat/sss_clus/Code/rmsss.exe .
> cp /fs/userB1/mfitzpat/sss_clus/Examples/xdata.txt .
> cp /fs/userB1/mfitzpat/sss_clus/Examples/ybinarydata.txt .
> cp /fs/userB1/mfitzpat/sss_clus/Examples/wdata.txt .
> cp /fs/userB1/mfitzpat/sss_clus/Examples/binary.setup.txt .
> 
> # run my commands
> cat $PBS_NODEFILE > machinefile
> more machinefile
> 
> /usr/local/mpich2/bin/mpdboot --file=machinefile
> /usr/local/mpich2/bin/mpiexec -np 8 ./rmsss.exe ./binary.setup.txt >
> output.txt
> /usr/local/mpich2/bin/mpdallexit
> 
> # copy the output files someplace permanent
> cp binarymodels.out binarymodels.null binarymodels.summary
> binarymodels.iter output.txt /fs/userB1/mfitzpat/sss_clus/Output
> 
> 
> 

-- 
Pavan Balaji
http://www.mcs.anl.gov/~balaji


More information about the mpich-discuss mailing list