[mpich-discuss] mpd daemon not starting on all requested nodes
Mary Ellen Fitzpatrick
mfitzpat at bu.edu
Wed Nov 18 10:39:31 CST 2009
Hi,
I have compiled/installed mpich2-1.0.8 with torque on my Centos-5.3
cluster. I can get an mpd ring running as root an all my nodes.
I want users to start the mpd daemon on the nodes via their pbs script.
I have an mpd.conf file in my home dir which has a secretword...
When I submit my job, requesting 4 nodes with 2 processors/node, I get
all 8 processes running on one node. The mpd daemons do not start on
the other 3 nodes, but the pbs job is listed as running on all the
nodes. If I cat my machine file via the pbs script, it shows 4 nodes
and processors, but it mpd does not start on the requested/list nodes.
The job run on that one node and output is generated correctly.. only it
does not run on all 4 nodes/processors.
Any help would be appreciated.
Mary Ellen
Here is my pbs script:
#!/bin/bash
# This is a simple script that cd's into scratch
# directory, copies some input files to /scr on compute node,
# copy the output file from /scr to user's directory (user or storage)
# give the job a name
#PBS -N rmsss
# send email notification
# request 1 node
#PBS -l nodes=4:ppn=2
#join stderr and stdout and write the to a file
#PBS -j oe
#PBS -o /fs/userB1/mfitzpat/sss_clus/Output/rmsss.o
# cd into the scratch directory created for this job
cd /scr/$PBS_JOBID
# print out some diagnostic stuff
echo Running on host `hostname`
echo Directory is `pwd`
echo Start time is `date`
# copy the date files to scratch
cp /fs/userB1/mfitzpat/sss_clus/Code/rmsss.exe .
cp /fs/userB1/mfitzpat/sss_clus/Examples/xdata.txt .
cp /fs/userB1/mfitzpat/sss_clus/Examples/ybinarydata.txt .
cp /fs/userB1/mfitzpat/sss_clus/Examples/wdata.txt .
cp /fs/userB1/mfitzpat/sss_clus/Examples/binary.setup.txt .
# run my commands
cat $PBS_NODEFILE > machinefile
more machinefile
/usr/local/mpich2/bin/mpdboot --file=machinefile
/usr/local/mpich2/bin/mpiexec -np 8 ./rmsss.exe ./binary.setup.txt >
output.txt
/usr/local/mpich2/bin/mpdallexit
# copy the output files someplace permanent
cp binarymodels.out binarymodels.null binarymodels.summary
binarymodels.iter output.txt /fs/userB1/mfitzpat/sss_clus/Output
--
Thanks
Mary Ellen
Mary Ellen FitzPatrick
Systems Analyst
Bioinformatics
Boston University
24 Cummington St.
Boston, MA 02215
office 617-358-2771
cell 617-797-7856
mfitzpat at bu.edu
More information about the mpich-discuss
mailing list