[Swift-devel] active jobs vs available processors on submitted coaster queues
Allan Espinosa
aespinosa at cs.uchicago.edu
Wed Jun 10 16:42:35 CDT 2009
Here's run on 1k jobs: only 2 jobs were active . the 18 procs here
in the LRM i think is the 2nd block request:
[aespinosa at tg-login1 ~]$ showq -u $USER
active jobs------------------------
JOBID USERNAME STATE PROCS REMAINING STARTTIME
2016757 aespinos Running 18 00:15:09 Wed Jun 10 16:29:31
1 active job 18 of 114 processors in use by local jobs (15.79%)
50 of 57 nodes active (87.72%)
eligible jobs----------------------
JOBID USERNAME STATE PROCS WCLIMIT QUEUETIME
0 eligible jobs
blocked jobs-----------------------
JOBID USERNAME STATE PROCS WCLIMIT QUEUETIME
swift session:
Swift svn swift-r2949 cog-r2406
RunID: out.run_000
Progress:
Progress: uninitialized:1
Progress: Initializing:1000 Selecting site:1
Progress: Selecting site:1000 Initializing site shared directory:1
Progress: Selecting site:999 Initializing site shared directory:1 Stage in:1
Progress: Selecting site:996 Stage in:5
Progress: Selecting site:996 Stage in:5
Progress: Selecting site:995 Stage in:6
Progress: Selecting site:994 Stage in:7
Progress: Selecting site:994 Stage in:7
Progress: Selecting site:993 Stage in:8
Progress: Selecting site:993 Stage in:8
Progress: Selecting site:993 Stage in:8
Progress: Selecting site:993 Stage in:8
Progress: Selecting site:992 Stage in:9
Progress: Selecting site:992 Stage in:9
Progress: Selecting site:992 Stage in:8 Submitting:1
Progress: Selecting site:991 Stage in:1 Submitting:8 Submitted:1
Progress: Selecting site:991 Submitted:9 Active:1
Progress: Selecting site:991 Submitted:9 Active:1
Progress: Selecting site:991 Submitted:8 Active:2
Progress: Selecting site:991 Submitted:1 Active:2 Checking
status:6 Failed but can retry:1
Progress: Selecting site:991 Active:1 Checking status:4 Failed but
can retry:5
Progress: Selecting site:990 Stage in:1 Active:1 Failed but can retry:9
Progress: Selecting site:990 Active:1 Checking status:1 Failed but
can retry:9
Progress: Selecting site:989 Submitting:1 Active:1 Failed but can retry:10
Progress: Selecting site:989 Active:1 Checking status:1 Failed but
can retry:10
Progress: Selecting site:988 Submitting:1 Active:1 Failed but can retry:11
Progress: Selecting site:988 Active:1 Checking status:1 Failed but
can retry:11
Progress: Selecting site:987 Submitting:1 Active:1 Failed but can retry:12
Progress: Selecting site:987 Active:1 Checking status:1 Failed but
can retry:12
Progress: Selecting site:986 Stage in:1 Active:1 Failed but can retry:13
Progress: Selecting site:986 Active:1 Checking status:1 Failed but
can retry:13
Progress: Selecting site:985 Stage in:1 Active:1 Failed but can retry:14
Progress: Selecting site:985 Active:1 Checking status:1 Failed but
can retry:14
Progress: Selecting site:984 Stage in:1 Active:1 Failed but can retry:15
Progress: Selecting site:984 Active:1 Checking status:1 Failed but
can retry:15
Progress: Selecting site:983 Stage in:1 Active:1 Failed but can retry:16
Progress: Selecting site:983 Active:2 Failed but can retry:16
Progress: Selecting site:983 Active:2 Failed but can retry:16
Progress: Selecting site:983 Active:1 Checking status:1 Failed but
can retry:16
Progress: Selecting site:982 Stage in:1 Active:1 Finished
successfully:1 Failed but can retry:16
Progress: Selecting site:982 Active:1 Checking status:1 Finished
successfully:1 Failed but can retry:16
Progress: Selecting site:981 Submitting:1 Active:1 Finished
successfully:1 Failed but can retry:17
Progress: Selecting site:981 Active:1 Finished successfully:1
Failed but can retry:18
Progress: Selecting site:980 Submitting:1 Active:1 Finished
successfully:1 Failed but can retry:18
Progress: Selecting site:980 Active:1 Checking status:1 Finished
successfully:1 Failed but can retry:18
Progress: Selecting site:979 Stage in:1 Active:1 Finished
successfully:1 Failed but can retry:19
Progress: Selecting site:979 Active:1 Checking status:1 Finished
successfully:1 Failed but can retry:19
Progress: Selecting site:979 Active:1 Finished successfully:1
Failed but can retry:20
Progress: Selecting site:978 Stage in:1 Active:1 Finished
successfully:1 Failed but can retry:20
Progress: Selecting site:978 Active:1 Checking status:1 Finished
successfully:1 Failed but can retry:20
Progress: Selecting site:977 Stage in:1 Active:1 Finished
successfully:1 Failed but can retry:21
Progress: Selecting site:977 Active:1 Checking status:1 Finished
successfully:1 Failed but can retry:21
Progress: Selecting site:976 Stage in:1 Active:1 Finished
successfully:1 Failed but can retry:22
Progress: Selecting site:976 Submitted:1 Active:1 Finished
successfully:1 Failed but can retry:22
Progress: Selecting site:976 Submitted:1 Active:1 Finished
successfully:1 Failed but can retry:22
Progress: Selecting site:976 Submitted:1 Active:1 Finished
successfully:1 Failed but can retry:22
Progress: Selecting site:976 Submitted:1 Active:1 Finished
successfully:1 Failed but can retry:22
Progress: Selecting site:976 Submitted:1 Finished successfully:1
Failed but can retry:23
Progress: Selecting site:975 Stage in:1 Submitted:1 Finished
successfully:1 Failed but can retry:23
Progress: Selecting site:975 Submitted:2 Finished successfully:1
Failed but can retry:23
Progress: Selecting site:975 Submitted:2 Finished successfully:1
Failed but can retry:23
Progress: Selecting site:975 Submitted:2 Finished successfully:1
Failed but can retry:23
Progress: Selecting site:975 Submitted:2 Finished successfully:1
Failed but can retry:23
Progress: Selecting site:975 Submitted:2 Finished successfully:1
Failed but can retry:23
Progress: Selecting site:975 Submitted:2 Finished successfully:1
Failed but can retry:23
qProgress: Selecting site:975 Submitted:2 Finished successfully:1
Failed but can retry:23
Progress: Selecting site:975 Submitted:2 Finished successfully:1
Failed but can retry:23
2009/6/10 Allan Espinosa <aespinosa at cs.uchicago.edu>:
> hi mihael,
>
> I reran the job and attached the log files (coaster log, swift-log, gram logs).
>
> swift session:
> rogress: Submitted:1 Active:1 Finished successfully:4
> Progress: Submitted:1 Active:1 Finished successfully:4
> Progress: Submitted:1 Active:1 Finished successfully:4
> Progress: Submitted:1 Checking status:1 Finished successfully:4
> Progress: Active:1 Finished successfully:5
> Progress: Active:1 Finished successfully:5
> Progress: Active:1 Finished successfully:5
> Progress: Active:1 Finished successfully:5
> Progress: Active:1 Finished successfully:5
> Progress: Active:1 Finished successfully:5
> Progress: Active:1 Finished successfully:5
> Progress: Active:1 Finished successfully:5
> Progress: Checking status:1 Finished successfully:5
> Progress: Stage out:1 Finished successfully:5
> Progress: Submitted:1 Finished successfully:6
> Progress: Submitted:1 Finished successfully:6
> Progress: Submitted:1 Finished successfully:6
> Progress: Submitted:1 Finished successfully:6
> Progress: Submitted:1 Finished successfully:6
> ...
>
> sites.xml (i may have changed it during this run):
> <config>
> <pool handle="UCANL" sysinfo="INTEL32::LINUX">
> <execution provider="coaster"
> url="tg-grid.uc.teragrid.org" jobmanager="gt2:gt2:pbs" />
> <gridftp url="gsiftp://tg-gridftp.uc.teragrid.org" />
> <workdirectory >/home/aespinosa/blast-runs</workdirectory>
>
> <profile namespace="karajan" key="initialScore">1</profile>
> <profile namespace="karajan" key="jobThrottle">1.26</profile>
>
> <profile namespace="globus"
> key="host_types">ia64-compute</profile>
> <profile namespace="globus" key="slots">4</profile>
> <profile namespace="globus" key="maxnodes">2</profile>
> </pool>
> </config>
>
> it looks like the last job was submitted but have not yet registered
> with the gram service in the ucanl remote site. at this point the
> coaster for the previous 5 jobs already ended.
> -Allan
>
> 2009/6/10 Mihael Hategan <hategan at mcs.anl.gov>:
>> I need to look at the coaster log.
>>
>> On Tue, 2009-06-09 at 15:10 -0500, Allan Espinosa wrote:
>>> I was expecting to have 2 active jobs at a time from the swift log but
>>> instead got only one at a time:
>>> Swift svn swift-r2949 cog-r2406
>>>
>>> RunID: out.run_000
>>> Progress:
>>> Progress: Selecting site:4 Initializing site shared directory:1 Stage in:1
>>> Progress: Stage in:6
>>> Progress: Stage in:6
>>>
>>>
>>>
>>> Progress: Stage in:6
>>> Progress: Stage in:6
>>> Progress: Stage in:6
>>> Progress: Stage in:6
>>> Progress: Stage in:5 Submitting:1
>>> Progress: Submitting:5 Submitted:1
>>> Progress: Submitted:6
>>> Progress: Submitted:5 Active:1
>>> Progress: Submitted:5 Active:1
>>> Progress: Submitted:5 Active:1
>>> Progress: Submitted:5 Active:1
>>> Progress: Submitted:5 Active:1
>>> Progress: Submitted:5 Checking status:1
>>> Progress: Submitted:4 Active:1 Finished successfully:1
>>> Progress: Submitted:4 Active:1 Finished successfully:1
>>> Progress: Submitted:4 Active:1 Finished successfully:1
>>> Progress: Submitted:4 Active:1 Finished successfully:1
>>> Progress: Submitted:4 Active:1 Finished successfully:1
>>> Progress: Submitted:4 Active:1 Finished successfully:1
>>> Progress: Submitted:4 Checking status:1 Finished successfully:1
>>> Progress: Submitted:3 Active:1 Finished successfully:2
>>> Progress: Submitted:3 Active:1 Finished successfully:2
>>> Progress: Submitted:3 Active:1 Finished successfully:2
>>> Progress: Submitted:3 Checking status:1 Finished successfully:2
>>> Progress: Submitted:2 Active:1 Finished successfully:3
>>> ...
>>> ...
>>>
>>>
>>> uc-teragrid queue status: $showq -u $USER
>>> [aespinosa at tg-login1 ~]$ showq -u $USER
>>>
>>> active jobs------------------------
>>> JOBID USERNAME STATE PROCS REMAINING STARTTIME
>>>
>>> 2015982 aespinos Running 2 00:55:41 Tue Jun 9 15:02:18
>>>
>>> 1 active job 2 of 116 processors in use by local jobs (1.72%)
>>> 42 of 58 nodes active (72.41%)
>>>
>>> eligible jobs----------------------
>>> JOBID USERNAME STATE PROCS WCLIMIT QUEUETIME
>>>
>>>
>>> 0 eligible jobs
>>>
>>> blocked jobs-----------------------
>>> JOBID USERNAME STATE PROCS WCLIMIT QUEUETIME
>>>
>>>
>>> 0 blocked jobs
>>>
>>> Total job: 1
>>>
>>>
>>> sites.xml:
>>> <config>
>>> <pool handle="UCANL" sysinfo="INTEL32::LINUX">
>>> <execution provider="coaster"
>>> url="tg-grid.uc.teragrid.org" jobmanager="gt2:gt2:pbs" />
>>> <gridftp url="gsiftp://tg-gridftp.uc.teragrid.org" />
>>> <workdirectory >/home/aespinosa/blast-runs</workdirectory>
>>>
>>> <profile namespace="karajan" key="initialScore">5</profile>
>>> <profile namespace="karajan" key="jobThrottle">1.26</profile>
>>>
>>> <profile namespace="globus"
>>> key="host_types">ia64-compute</profile>
>>> <profile namespace="globus" key="slots">4</profile>
>>> <profile namespace="globus" key="maxnodes">16</profile>
>>> </pool>
>>> </config>
>
--
Allan M. Espinosa <http://allan.88-mph.net/blog>
PhD student, Computer Science
University of Chicago <http://people.cs.uchicago.edu/~aespinosa>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: tarball.tar.gz
Type: application/x-gzip
Size: 182455 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20090610/40c88233/attachment.bin>
More information about the Swift-devel
mailing list