[Swift-devel] workers not initiated on all nodes/cpus in a block

Allan Espinosa aespinosa at cs.uchicago.edu
Thu Jul 2 14:42:25 CDT 2009


yup pbs provider.  i'll checkout if the same goes with the globus gt2 provider.

-Allan

2009/7/2 Mihael Hategan <hategan at mcs.anl.gov>:
> This is with the PBS provider rather than Globus, right?
>
> On Thu, 2009-07-02 at 14:32 -0500, Allan Espinosa wrote:
>> looking at the submit script before, even though the coaster block
>> requested for 8 nodes, it still simply runs 1 worker
>>
>> submit script found:
>>  cat PBS2252235058660926788.submit
>> #PBS -S /bin/sh
>> #PBS -N null
>> #PBS -m n
>> #PBS -l nodes=8
>> #PBS -l walltime=00:04:00
>> #PBS -q short
>> #PBS -o /home/aespinosa/.globus/scripts/PBS2252235058660926788.submit.stdout
>> #PBS -e /home/aespinosa/.globus/scripts/PBS2252235058660926788.submit.stderr
>> /usr/bin/perl /home/aespinosa/.globus/coasters/cscript5633964488332337528.pl
>> http://128.135.125.116:47679 0702-050234-000004 1
>> /bin/echo $? >/home/aespinosa/.globus/scripts/PBS2252235058660926788.submit.exitcode
>>
>>
>> the /usr/bin/perl line should be prepended with "pbdsh" or other
>> equivalent utilities to execute the script on all nodes/cpus. i think
>> this is the reason why in some instances the block requests more nodes
>> but not all are active.
>>
>> host information:
>> [aespinosa at communicado ~]$ screen -r
>> IWD: [NONE]  Executable:  [NONE]
>> Bypass: 0  StartCount: 1
>> PartitionMask: [ALL]
>> Flags:       RESTARTABLE
>>
>> Reservation '1122120' (-00:05:07 -> 00:22:53  Duration: 00:28:00)
>> PE:  8.00  StartPriority:  1800
>>
>> [aespinosa at tp-c105 scripts]$ ssh tp-c114 ps x
>> Password:
>>   PID TTY      STAT   TIME COMMAND
>> 31815 ?        Ss     0:00 -sh
>> 32054 ?        S      0:00 pbs_demux
>> 32229 ?        S      0:00 -sh
>> 32230 ?        S      0:00 /usr/bin/perl
>> /home/aespinosa/.globus/coasters/cscript5633964488332337528.pl
>> http://128.135.125.116:47679 0702-050234-000003 1
>> 32231 ?        S      0:00 /usr/bin/perl
>> /home/aespinosa/.globus/coasters/cscript5633964488332337528.pl
>> http://128.135.125.116:47679 0702-050234-000003 1
>> 32233 ?        S      0:00 /bin/bash
>> /home/aespinosa/work/ampl/ampl-teraport_coaster/shared/_swiftwrap
>> run_ampl-slru44dj -jobdir s -e /home/zzhang/SEE/static/run_ampl -out
>> result/run1416/stdout -err stderr.txt -i -d
>> |subproblems|result/run1416 -if
>> template|armington.mod|armington_process.cmd|armington_output.cmd|subproblems/producer_tree.mod|ces.so
>> -of result/run1416/expend.dat|result/run1416/limits.dat|result/run1416/price.dat|result/run1416/ratio.dat|result/run1416/solve.dat|result/run1416/stdout
>> -k  -status files -a run1416 template armington.mod
>> armington_process.cmd armington_output.cmd
>> subproblems/producer_tree.mod ces.so
>> 32256 ?        S      0:00 /bin/bash /home/zzhang/SEE/static/run_ampl
>> run1416 template armington.mod armington_process.cmd
>> armington_output.cmd subproblems/producer_tree.mod ces.so
>> 32258 ?        S      0:19 ampl arm_test.cmd
>> 32716 ?        R      0:37 pathampl /tmp/at32258 -AMPL
>> 32726 ?        S      0:00 sshd: aespinosa at notty
>> 32727 ?        Rs     0:00 ps x
>> [aespinosa at tp-c105 scripts]$ ssh tp-c105 ps x
>> Password:
>>   PID TTY      STAT   TIME COMMAND
>> 30721 ?        S      0:00 sshd: aespinosa at pts/0
>> 30722 pts/0    Ss     0:00 -bash
>> 30951 pts/0    S+     0:00 ssh tp-c105 ps x
>> 30955 ?        S      0:00 sshd: aespinosa at notty
>> 30956 ?        Rs     0:00 ps x
>> [aespinosa at tp-c105 scripts]$ ssh tp-c102 ps x
>> The authenticity of host 'tp-c102 (10.135.125.108)' can't be established.
>> RSA key fingerprint is 60:dc:28:eb:f3:1b:ca:80:48:f2:32:f5:1e:3b:b3:d7.
>> Are you sure you want to continue connecting (yes/no)? yes
>> Warning: Permanently added 'tp-c102,10.135.125.108' (RSA) to the list
>> of known hosts.
>> Password:
>>   PID TTY      STAT   TIME COMMAND
>> 10274 ?        S      0:00 sshd: aespinosa at notty
>> 10275 ?        Rs     0:00 ps x
>> ...
>> ...
>>
>>
>> swift session snapshot:
>> Progress:  Selecting site:1014  Submitted:8  Active:1
>> Progress:  Selecting site:1014  Submitted:8  Active:1
>> Progress:  Selecting site:1014  Submitted:8  Active:1
>> Progress:  Selecting site:1014  Submitted:8  Active:1
>>
>> queue information:
>> ACTIVE JOBS--------------------
>> JOBNAME            USERNAME      STATE  PROC   REMAINING            STARTTIME
>>
>> 1122120            aespinosa    Running     8    00:19:53  Thu Jul  2 14:22:19
>>
>>      1 Active Job      171 of  200 Processors Active (85.50%)
>>                        100 of  100 Nodes Active      (100.00%)



More information about the Swift-devel mailing list