[Swift-devel] workers not initiated on all nodes/cpus in a block

Allan Espinosa aespinosa at cs.uchicago.edu
Thu Jul 2 14:32:30 CDT 2009


looking at the submit script before, even though the coaster block
requested for 8 nodes, it still simply runs 1 worker

submit script found:
 cat PBS2252235058660926788.submit
#PBS -S /bin/sh
#PBS -N null
#PBS -m n
#PBS -l nodes=8
#PBS -l walltime=00:04:00
#PBS -q short
#PBS -o /home/aespinosa/.globus/scripts/PBS2252235058660926788.submit.stdout
#PBS -e /home/aespinosa/.globus/scripts/PBS2252235058660926788.submit.stderr
/usr/bin/perl /home/aespinosa/.globus/coasters/cscript5633964488332337528.pl
http://128.135.125.116:47679 0702-050234-000004 1
/bin/echo $? >/home/aespinosa/.globus/scripts/PBS2252235058660926788.submit.exitcode


the /usr/bin/perl line should be prepended with "pbdsh" or other
equivalent utilities to execute the script on all nodes/cpus. i think
this is the reason why in some instances the block requests more nodes
but not all are active.

host information:
[aespinosa at communicado ~]$ screen -r
IWD: [NONE]  Executable:  [NONE]
Bypass: 0  StartCount: 1
PartitionMask: [ALL]
Flags:       RESTARTABLE

Reservation '1122120' (-00:05:07 -> 00:22:53  Duration: 00:28:00)
PE:  8.00  StartPriority:  1800

[aespinosa at tp-c105 scripts]$ ssh tp-c114 ps x
Password:
  PID TTY      STAT   TIME COMMAND
31815 ?        Ss     0:00 -sh
32054 ?        S      0:00 pbs_demux
32229 ?        S      0:00 -sh
32230 ?        S      0:00 /usr/bin/perl
/home/aespinosa/.globus/coasters/cscript5633964488332337528.pl
http://128.135.125.116:47679 0702-050234-000003 1
32231 ?        S      0:00 /usr/bin/perl
/home/aespinosa/.globus/coasters/cscript5633964488332337528.pl
http://128.135.125.116:47679 0702-050234-000003 1
32233 ?        S      0:00 /bin/bash
/home/aespinosa/work/ampl/ampl-teraport_coaster/shared/_swiftwrap
run_ampl-slru44dj -jobdir s -e /home/zzhang/SEE/static/run_ampl -out
result/run1416/stdout -err stderr.txt -i -d
|subproblems|result/run1416 -if
template|armington.mod|armington_process.cmd|armington_output.cmd|subproblems/producer_tree.mod|ces.so
-of result/run1416/expend.dat|result/run1416/limits.dat|result/run1416/price.dat|result/run1416/ratio.dat|result/run1416/solve.dat|result/run1416/stdout
-k  -status files -a run1416 template armington.mod
armington_process.cmd armington_output.cmd
subproblems/producer_tree.mod ces.so
32256 ?        S      0:00 /bin/bash /home/zzhang/SEE/static/run_ampl
run1416 template armington.mod armington_process.cmd
armington_output.cmd subproblems/producer_tree.mod ces.so
32258 ?        S      0:19 ampl arm_test.cmd
32716 ?        R      0:37 pathampl /tmp/at32258 -AMPL
32726 ?        S      0:00 sshd: aespinosa at notty
32727 ?        Rs     0:00 ps x
[aespinosa at tp-c105 scripts]$ ssh tp-c105 ps x
Password:
  PID TTY      STAT   TIME COMMAND
30721 ?        S      0:00 sshd: aespinosa at pts/0
30722 pts/0    Ss     0:00 -bash
30951 pts/0    S+     0:00 ssh tp-c105 ps x
30955 ?        S      0:00 sshd: aespinosa at notty
30956 ?        Rs     0:00 ps x
[aespinosa at tp-c105 scripts]$ ssh tp-c102 ps x
The authenticity of host 'tp-c102 (10.135.125.108)' can't be established.
RSA key fingerprint is 60:dc:28:eb:f3:1b:ca:80:48:f2:32:f5:1e:3b:b3:d7.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'tp-c102,10.135.125.108' (RSA) to the list
of known hosts.
Password:
  PID TTY      STAT   TIME COMMAND
10274 ?        S      0:00 sshd: aespinosa at notty
10275 ?        Rs     0:00 ps x
...
...


swift session snapshot:
Progress:  Selecting site:1014  Submitted:8  Active:1
Progress:  Selecting site:1014  Submitted:8  Active:1
Progress:  Selecting site:1014  Submitted:8  Active:1
Progress:  Selecting site:1014  Submitted:8  Active:1

queue information:
ACTIVE JOBS--------------------
JOBNAME            USERNAME      STATE  PROC   REMAINING            STARTTIME

1122120            aespinosa    Running     8    00:19:53  Thu Jul  2 14:22:19

     1 Active Job      171 of  200 Processors Active (85.50%)
                       100 of  100 Nodes Active      (100.00%)





-- 
Allan M. Espinosa <http://allan.88-mph.net/blog>
PhD student, Computer Science
University of Chicago <http://people.cs.uchicago.edu/~aespinosa>



More information about the Swift-devel mailing list