[Swift-devel] workers not initiated on all nodes/cpus in a block
Mihael Hategan
hategan at mcs.anl.gov
Thu Jul 2 14:39:03 CDT 2009
This is with the PBS provider rather than Globus, right?
On Thu, 2009-07-02 at 14:32 -0500, Allan Espinosa wrote:
> looking at the submit script before, even though the coaster block
> requested for 8 nodes, it still simply runs 1 worker
>
> submit script found:
> cat PBS2252235058660926788.submit
> #PBS -S /bin/sh
> #PBS -N null
> #PBS -m n
> #PBS -l nodes=8
> #PBS -l walltime=00:04:00
> #PBS -q short
> #PBS -o /home/aespinosa/.globus/scripts/PBS2252235058660926788.submit.stdout
> #PBS -e /home/aespinosa/.globus/scripts/PBS2252235058660926788.submit.stderr
> /usr/bin/perl /home/aespinosa/.globus/coasters/cscript5633964488332337528.pl
> http://128.135.125.116:47679 0702-050234-000004 1
> /bin/echo $? >/home/aespinosa/.globus/scripts/PBS2252235058660926788.submit.exitcode
>
>
> the /usr/bin/perl line should be prepended with "pbdsh" or other
> equivalent utilities to execute the script on all nodes/cpus. i think
> this is the reason why in some instances the block requests more nodes
> but not all are active.
>
> host information:
> [aespinosa at communicado ~]$ screen -r
> IWD: [NONE] Executable: [NONE]
> Bypass: 0 StartCount: 1
> PartitionMask: [ALL]
> Flags: RESTARTABLE
>
> Reservation '1122120' (-00:05:07 -> 00:22:53 Duration: 00:28:00)
> PE: 8.00 StartPriority: 1800
>
> [aespinosa at tp-c105 scripts]$ ssh tp-c114 ps x
> Password:
> PID TTY STAT TIME COMMAND
> 31815 ? Ss 0:00 -sh
> 32054 ? S 0:00 pbs_demux
> 32229 ? S 0:00 -sh
> 32230 ? S 0:00 /usr/bin/perl
> /home/aespinosa/.globus/coasters/cscript5633964488332337528.pl
> http://128.135.125.116:47679 0702-050234-000003 1
> 32231 ? S 0:00 /usr/bin/perl
> /home/aespinosa/.globus/coasters/cscript5633964488332337528.pl
> http://128.135.125.116:47679 0702-050234-000003 1
> 32233 ? S 0:00 /bin/bash
> /home/aespinosa/work/ampl/ampl-teraport_coaster/shared/_swiftwrap
> run_ampl-slru44dj -jobdir s -e /home/zzhang/SEE/static/run_ampl -out
> result/run1416/stdout -err stderr.txt -i -d
> |subproblems|result/run1416 -if
> template|armington.mod|armington_process.cmd|armington_output.cmd|subproblems/producer_tree.mod|ces.so
> -of result/run1416/expend.dat|result/run1416/limits.dat|result/run1416/price.dat|result/run1416/ratio.dat|result/run1416/solve.dat|result/run1416/stdout
> -k -status files -a run1416 template armington.mod
> armington_process.cmd armington_output.cmd
> subproblems/producer_tree.mod ces.so
> 32256 ? S 0:00 /bin/bash /home/zzhang/SEE/static/run_ampl
> run1416 template armington.mod armington_process.cmd
> armington_output.cmd subproblems/producer_tree.mod ces.so
> 32258 ? S 0:19 ampl arm_test.cmd
> 32716 ? R 0:37 pathampl /tmp/at32258 -AMPL
> 32726 ? S 0:00 sshd: aespinosa at notty
> 32727 ? Rs 0:00 ps x
> [aespinosa at tp-c105 scripts]$ ssh tp-c105 ps x
> Password:
> PID TTY STAT TIME COMMAND
> 30721 ? S 0:00 sshd: aespinosa at pts/0
> 30722 pts/0 Ss 0:00 -bash
> 30951 pts/0 S+ 0:00 ssh tp-c105 ps x
> 30955 ? S 0:00 sshd: aespinosa at notty
> 30956 ? Rs 0:00 ps x
> [aespinosa at tp-c105 scripts]$ ssh tp-c102 ps x
> The authenticity of host 'tp-c102 (10.135.125.108)' can't be established.
> RSA key fingerprint is 60:dc:28:eb:f3:1b:ca:80:48:f2:32:f5:1e:3b:b3:d7.
> Are you sure you want to continue connecting (yes/no)? yes
> Warning: Permanently added 'tp-c102,10.135.125.108' (RSA) to the list
> of known hosts.
> Password:
> PID TTY STAT TIME COMMAND
> 10274 ? S 0:00 sshd: aespinosa at notty
> 10275 ? Rs 0:00 ps x
> ...
> ...
>
>
> swift session snapshot:
> Progress: Selecting site:1014 Submitted:8 Active:1
> Progress: Selecting site:1014 Submitted:8 Active:1
> Progress: Selecting site:1014 Submitted:8 Active:1
> Progress: Selecting site:1014 Submitted:8 Active:1
>
> queue information:
> ACTIVE JOBS--------------------
> JOBNAME USERNAME STATE PROC REMAINING STARTTIME
>
> 1122120 aespinosa Running 8 00:19:53 Thu Jul 2 14:22:19
>
> 1 Active Job 171 of 200 Processors Active (85.50%)
> 100 of 100 Nodes Active (100.00%)
>
>
>
>
>
> --
> Allan M. Espinosa <http://allan.88-mph.net/blog>
> PhD student, Computer Science
> University of Chicago <http://people.cs.uchicago.edu/~aespinosa>
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
More information about the Swift-devel
mailing list