[mpich-discuss] mpd daemon not starting on all requested nodes

Mary Ellen Fitzpatrick mfitzpat at bu.edu
Wed Nov 18 10:39:31 CST 2009


Hi,
I have compiled/installed mpich2-1.0.8 with torque on my Centos-5.3 
cluster.  I can get an mpd ring running as root an all my nodes.
I want users to start the mpd daemon on the nodes via their pbs script.  
I have an mpd.conf file in my home dir which has a secretword...

When I submit my job, requesting 4 nodes with 2 processors/node, I get 
all 8 processes running on one node.  The mpd daemons do not start on 
the other 3 nodes, but the pbs job is listed as running on all the 
nodes.  If I cat my machine file via the pbs script, it shows 4 nodes 
and processors, but it mpd does not start on the requested/list nodes.  
The job run on that one node and output is generated correctly.. only it 
does not run on all 4 nodes/processors.

Any help would be appreciated.
Mary Ellen

Here is my pbs script:

#!/bin/bash

# This is a simple script that cd's into scratch
# directory, copies some input files to /scr on compute node,
# copy the output file from /scr to user's directory (user or storage)
# give the job a name
#PBS -N rmsss
# send email notification
# request 1 node
#PBS -l nodes=4:ppn=2
#join stderr and stdout and write the to a file
#PBS -j oe
#PBS -o /fs/userB1/mfitzpat/sss_clus/Output/rmsss.o

# cd into the scratch directory created for this job
cd /scr/$PBS_JOBID
# print out some diagnostic stuff
echo Running on host `hostname`
echo Directory is `pwd`
echo Start time is `date`

# copy the date files to scratch
cp /fs/userB1/mfitzpat/sss_clus/Code/rmsss.exe .
cp /fs/userB1/mfitzpat/sss_clus/Examples/xdata.txt .
cp /fs/userB1/mfitzpat/sss_clus/Examples/ybinarydata.txt .
cp /fs/userB1/mfitzpat/sss_clus/Examples/wdata.txt .
cp /fs/userB1/mfitzpat/sss_clus/Examples/binary.setup.txt .

# run my commands
cat $PBS_NODEFILE > machinefile
more machinefile

/usr/local/mpich2/bin/mpdboot --file=machinefile
/usr/local/mpich2/bin/mpiexec -np 8 ./rmsss.exe ./binary.setup.txt > 
output.txt
/usr/local/mpich2/bin/mpdallexit

# copy the output files someplace permanent
cp binarymodels.out binarymodels.null binarymodels.summary 
binarymodels.iter output.txt /fs/userB1/mfitzpat/sss_clus/Output



-- 
Thanks
Mary Ellen


Mary Ellen FitzPatrick
Systems Analyst 
Bioinformatics
Boston University
24 Cummington St.
Boston, MA 02215
office 617-358-2771
cell 617-797-7856 
mfitzpat at bu.edu



More information about the mpich-discuss mailing list