[Swift-devel] workers not initiated on all nodes/cpus in a block
Allan Espinosa
aespinosa at cs.uchicago.edu
Thu Jul 2 14:42:25 CDT 2009
yup pbs provider. i'll checkout if the same goes with the globus gt2 provider.
-Allan
2009/7/2 Mihael Hategan <hategan at mcs.anl.gov>:
> This is with the PBS provider rather than Globus, right?
>
> On Thu, 2009-07-02 at 14:32 -0500, Allan Espinosa wrote:
>> looking at the submit script before, even though the coaster block
>> requested for 8 nodes, it still simply runs 1 worker
>>
>> submit script found:
>> cat PBS2252235058660926788.submit
>> #PBS -S /bin/sh
>> #PBS -N null
>> #PBS -m n
>> #PBS -l nodes=8
>> #PBS -l walltime=00:04:00
>> #PBS -q short
>> #PBS -o /home/aespinosa/.globus/scripts/PBS2252235058660926788.submit.stdout
>> #PBS -e /home/aespinosa/.globus/scripts/PBS2252235058660926788.submit.stderr
>> /usr/bin/perl /home/aespinosa/.globus/coasters/cscript5633964488332337528.pl
>> http://128.135.125.116:47679 0702-050234-000004 1
>> /bin/echo $? >/home/aespinosa/.globus/scripts/PBS2252235058660926788.submit.exitcode
>>
>>
>> the /usr/bin/perl line should be prepended with "pbdsh" or other
>> equivalent utilities to execute the script on all nodes/cpus. i think
>> this is the reason why in some instances the block requests more nodes
>> but not all are active.
>>
>> host information:
>> [aespinosa at communicado ~]$ screen -r
>> IWD: [NONE] Executable: [NONE]
>> Bypass: 0 StartCount: 1
>> PartitionMask: [ALL]
>> Flags: RESTARTABLE
>>
>> Reservation '1122120' (-00:05:07 -> 00:22:53 Duration: 00:28:00)
>> PE: 8.00 StartPriority: 1800
>>
>> [aespinosa at tp-c105 scripts]$ ssh tp-c114 ps x
>> Password:
>> PID TTY STAT TIME COMMAND
>> 31815 ? Ss 0:00 -sh
>> 32054 ? S 0:00 pbs_demux
>> 32229 ? S 0:00 -sh
>> 32230 ? S 0:00 /usr/bin/perl
>> /home/aespinosa/.globus/coasters/cscript5633964488332337528.pl
>> http://128.135.125.116:47679 0702-050234-000003 1
>> 32231 ? S 0:00 /usr/bin/perl
>> /home/aespinosa/.globus/coasters/cscript5633964488332337528.pl
>> http://128.135.125.116:47679 0702-050234-000003 1
>> 32233 ? S 0:00 /bin/bash
>> /home/aespinosa/work/ampl/ampl-teraport_coaster/shared/_swiftwrap
>> run_ampl-slru44dj -jobdir s -e /home/zzhang/SEE/static/run_ampl -out
>> result/run1416/stdout -err stderr.txt -i -d
>> |subproblems|result/run1416 -if
>> template|armington.mod|armington_process.cmd|armington_output.cmd|subproblems/producer_tree.mod|ces.so
>> -of result/run1416/expend.dat|result/run1416/limits.dat|result/run1416/price.dat|result/run1416/ratio.dat|result/run1416/solve.dat|result/run1416/stdout
>> -k -status files -a run1416 template armington.mod
>> armington_process.cmd armington_output.cmd
>> subproblems/producer_tree.mod ces.so
>> 32256 ? S 0:00 /bin/bash /home/zzhang/SEE/static/run_ampl
>> run1416 template armington.mod armington_process.cmd
>> armington_output.cmd subproblems/producer_tree.mod ces.so
>> 32258 ? S 0:19 ampl arm_test.cmd
>> 32716 ? R 0:37 pathampl /tmp/at32258 -AMPL
>> 32726 ? S 0:00 sshd: aespinosa at notty
>> 32727 ? Rs 0:00 ps x
>> [aespinosa at tp-c105 scripts]$ ssh tp-c105 ps x
>> Password:
>> PID TTY STAT TIME COMMAND
>> 30721 ? S 0:00 sshd: aespinosa at pts/0
>> 30722 pts/0 Ss 0:00 -bash
>> 30951 pts/0 S+ 0:00 ssh tp-c105 ps x
>> 30955 ? S 0:00 sshd: aespinosa at notty
>> 30956 ? Rs 0:00 ps x
>> [aespinosa at tp-c105 scripts]$ ssh tp-c102 ps x
>> The authenticity of host 'tp-c102 (10.135.125.108)' can't be established.
>> RSA key fingerprint is 60:dc:28:eb:f3:1b:ca:80:48:f2:32:f5:1e:3b:b3:d7.
>> Are you sure you want to continue connecting (yes/no)? yes
>> Warning: Permanently added 'tp-c102,10.135.125.108' (RSA) to the list
>> of known hosts.
>> Password:
>> PID TTY STAT TIME COMMAND
>> 10274 ? S 0:00 sshd: aespinosa at notty
>> 10275 ? Rs 0:00 ps x
>> ...
>> ...
>>
>>
>> swift session snapshot:
>> Progress: Selecting site:1014 Submitted:8 Active:1
>> Progress: Selecting site:1014 Submitted:8 Active:1
>> Progress: Selecting site:1014 Submitted:8 Active:1
>> Progress: Selecting site:1014 Submitted:8 Active:1
>>
>> queue information:
>> ACTIVE JOBS--------------------
>> JOBNAME USERNAME STATE PROC REMAINING STARTTIME
>>
>> 1122120 aespinosa Running 8 00:19:53 Thu Jul 2 14:22:19
>>
>> 1 Active Job 171 of 200 Processors Active (85.50%)
>> 100 of 100 Nodes Active (100.00%)
More information about the Swift-devel
mailing list