[Swift-devel] workers not initiated on all nodes/cpus in a block

Mihael Hategan hategan at mcs.anl.gov
Thu Jul 2 14:39:03 CDT 2009


This is with the PBS provider rather than Globus, right?

On Thu, 2009-07-02 at 14:32 -0500, Allan Espinosa wrote:
> looking at the submit script before, even though the coaster block
> requested for 8 nodes, it still simply runs 1 worker
> 
> submit script found:
>  cat PBS2252235058660926788.submit
> #PBS -S /bin/sh
> #PBS -N null
> #PBS -m n
> #PBS -l nodes=8
> #PBS -l walltime=00:04:00
> #PBS -q short
> #PBS -o /home/aespinosa/.globus/scripts/PBS2252235058660926788.submit.stdout
> #PBS -e /home/aespinosa/.globus/scripts/PBS2252235058660926788.submit.stderr
> /usr/bin/perl /home/aespinosa/.globus/coasters/cscript5633964488332337528.pl
> http://128.135.125.116:47679 0702-050234-000004 1
> /bin/echo $? >/home/aespinosa/.globus/scripts/PBS2252235058660926788.submit.exitcode
> 
> 
> the /usr/bin/perl line should be prepended with "pbdsh" or other
> equivalent utilities to execute the script on all nodes/cpus. i think
> this is the reason why in some instances the block requests more nodes
> but not all are active.
> 
> host information:
> [aespinosa at communicado ~]$ screen -r
> IWD: [NONE]  Executable:  [NONE]
> Bypass: 0  StartCount: 1
> PartitionMask: [ALL]
> Flags:       RESTARTABLE
> 
> Reservation '1122120' (-00:05:07 -> 00:22:53  Duration: 00:28:00)
> PE:  8.00  StartPriority:  1800
> 
> [aespinosa at tp-c105 scripts]$ ssh tp-c114 ps x
> Password:
>   PID TTY      STAT   TIME COMMAND
> 31815 ?        Ss     0:00 -sh
> 32054 ?        S      0:00 pbs_demux
> 32229 ?        S      0:00 -sh
> 32230 ?        S      0:00 /usr/bin/perl
> /home/aespinosa/.globus/coasters/cscript5633964488332337528.pl
> http://128.135.125.116:47679 0702-050234-000003 1
> 32231 ?        S      0:00 /usr/bin/perl
> /home/aespinosa/.globus/coasters/cscript5633964488332337528.pl
> http://128.135.125.116:47679 0702-050234-000003 1
> 32233 ?        S      0:00 /bin/bash
> /home/aespinosa/work/ampl/ampl-teraport_coaster/shared/_swiftwrap
> run_ampl-slru44dj -jobdir s -e /home/zzhang/SEE/static/run_ampl -out
> result/run1416/stdout -err stderr.txt -i -d
> |subproblems|result/run1416 -if
> template|armington.mod|armington_process.cmd|armington_output.cmd|subproblems/producer_tree.mod|ces.so
> -of result/run1416/expend.dat|result/run1416/limits.dat|result/run1416/price.dat|result/run1416/ratio.dat|result/run1416/solve.dat|result/run1416/stdout
> -k  -status files -a run1416 template armington.mod
> armington_process.cmd armington_output.cmd
> subproblems/producer_tree.mod ces.so
> 32256 ?        S      0:00 /bin/bash /home/zzhang/SEE/static/run_ampl
> run1416 template armington.mod armington_process.cmd
> armington_output.cmd subproblems/producer_tree.mod ces.so
> 32258 ?        S      0:19 ampl arm_test.cmd
> 32716 ?        R      0:37 pathampl /tmp/at32258 -AMPL
> 32726 ?        S      0:00 sshd: aespinosa at notty
> 32727 ?        Rs     0:00 ps x
> [aespinosa at tp-c105 scripts]$ ssh tp-c105 ps x
> Password:
>   PID TTY      STAT   TIME COMMAND
> 30721 ?        S      0:00 sshd: aespinosa at pts/0
> 30722 pts/0    Ss     0:00 -bash
> 30951 pts/0    S+     0:00 ssh tp-c105 ps x
> 30955 ?        S      0:00 sshd: aespinosa at notty
> 30956 ?        Rs     0:00 ps x
> [aespinosa at tp-c105 scripts]$ ssh tp-c102 ps x
> The authenticity of host 'tp-c102 (10.135.125.108)' can't be established.
> RSA key fingerprint is 60:dc:28:eb:f3:1b:ca:80:48:f2:32:f5:1e:3b:b3:d7.
> Are you sure you want to continue connecting (yes/no)? yes
> Warning: Permanently added 'tp-c102,10.135.125.108' (RSA) to the list
> of known hosts.
> Password:
>   PID TTY      STAT   TIME COMMAND
> 10274 ?        S      0:00 sshd: aespinosa at notty
> 10275 ?        Rs     0:00 ps x
> ...
> ...
> 
> 
> swift session snapshot:
> Progress:  Selecting site:1014  Submitted:8  Active:1
> Progress:  Selecting site:1014  Submitted:8  Active:1
> Progress:  Selecting site:1014  Submitted:8  Active:1
> Progress:  Selecting site:1014  Submitted:8  Active:1
> 
> queue information:
> ACTIVE JOBS--------------------
> JOBNAME            USERNAME      STATE  PROC   REMAINING            STARTTIME
> 
> 1122120            aespinosa    Running     8    00:19:53  Thu Jul  2 14:22:19
> 
>      1 Active Job      171 of  200 Processors Active (85.50%)
>                        100 of  100 Nodes Active      (100.00%)
> 
> 
> 
> 
> 
> -- 
> Allan M. Espinosa <http://allan.88-mph.net/blog>
> PhD student, Computer Science
> University of Chicago <http://people.cs.uchicago.edu/~aespinosa>
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel




More information about the Swift-devel mailing list