[Swift-user] Queuing isssue with SWFIT on Orthros, APS cluster
Hemant Sharma
hsharma at anl.gov
Tue Jun 24 12:25:07 CDT 2014
Hi guys,
I'm having a problem with queuing using SWIFT. I have a swift script,
which should execute about 27000 iterations. In order to limit the
initial memory size, I created the following config file:
use.provider.staging=false
provider.staging.pin.swiftfiles=false
use.wrapper.staging=false
status.mode=provider
wrapperlog.always.transfer=true
execution.retries=0
lazy.errors=true
sitedir.keep=true
file.gc.enabled=false
wrapper.parameter.mode=files
foreach.max.threads=330
Some times, when I execute the script, it starts with 320 active jobs
(on 320 processors), but after some time, it just gets stuck with 330
submitted jobs and none of them are active. Example output to screen is:
Swift 0.94 swift-r6637 cog-r3742
RunID: 20140624-1200-qke4z0v8
Progress: time: Tue, 24 Jun 2014 12:00:30 -0500
Progress: time: Tue, 24 Jun 2014 12:00:31 -0500 Selecting site:328
Initializing site shared directory:1 Stage in:1
Progress: time: Tue, 24 Jun 2014 12:00:32 -0500 Selecting site:10
Stage in:277 Submitting:3 Submitted:40
Progress: time: Tue, 24 Jun 2014 12:00:38 -0500 Selecting site:10
Submitted:319 Active:1
Progress: time: Tue, 24 Jun 2014 12:00:43 -0500 Selecting site:10
Active:319 Checking status:1
Progress: time: Tue, 24 Jun 2014 12:00:44 -0500 Selecting site:1 Stage
in:20 Active:208 Checking status:31 Stage out:70 Finished
successfully:21
Progress: time: Tue, 24 Jun 2014 12:00:45 -0500 Stage in:11
Active:120 Checking status:10 Stage out:189 Finished successfully:31
Progress: time: Tue, 24 Jun 2014 12:00:46 -0500 Stage in:25
Active:117 Stage out:188 Finished successfully:54
Progress: time: Tue, 24 Jun 2014 12:00:47 -0500 Initializing:1
Selecting site:1 Stage in:46 Active:118 Stage out:164 Finished
successfully:86
Progress: time: Tue, 24 Jun 2014 12:00:48 -0500 Selecting site:2 Stage
in:102 Submitting:1 Submitted:2 Active:165 Checking status:1 Stage
out:57 Finished successfully:199
Progress: time: Tue, 24 Jun 2014 12:00:49 -0500 Submitted:5
Active:324 Checking status:1 Finished successfully:265
Progress: time: Tue, 24 Jun 2014 12:00:50 -0500 Submitted:12
Active:317 Checking status:1 Finished successfully:272
Progress: time: Tue, 24 Jun 2014 12:00:51 -0500 Submitted:22
Active:307 Finished successfully:283
Progress: time: Tue, 24 Jun 2014 12:00:52 -0500 Selecting site:1 Stage
in:13 Submitted:47 Active:223 Stage out:46 Finished successfully:321
Progress: time: Tue, 24 Jun 2014 12:00:53 -0500 Stage in:28
Submitted:73 Active:153 Stage out:75 Finished successfully:362
Progress: time: Tue, 24 Jun 2014 12:00:55 -0500 Submitted:182
Active:147 Checking status:1 Finished successfully:442
Progress: time: Tue, 24 Jun 2014 12:00:57 -0500 Submitted:183
Active:146 Checking status:1 Finished successfully:443
Progress: time: Tue, 24 Jun 2014 12:01:00 -0500 Submitted:185
Active:144 Checking status:1 Finished successfully:445
Progress: time: Tue, 24 Jun 2014 12:01:01 -0500 Submitted:186
Active:143 Checking status:1 Finished successfully:446
Progress: time: Tue, 24 Jun 2014 12:01:02 -0500 Submitted:190
Active:139 Checking status:1 Finished successfully:450
Progress: time: Tue, 24 Jun 2014 12:01:05 -0500 Submitted:193
Active:136 Checking status:1 Finished successfully:453
Progress: time: Tue, 24 Jun 2014 12:01:07 -0500 Submitted:196
Active:133 Checking status:1 Finished successfully:456
Progress: time: Tue, 24 Jun 2014 12:01:09 -0500 Submitted:198
Active:131 Checking status:1 Finished successfully:458
Progress: time: Tue, 24 Jun 2014 12:01:10 -0500 Stage in:5
Submitted:202 Active:63 Stage out:60 Finished successfully:467
Progress: time: Tue, 24 Jun 2014 12:01:11 -0500 Submitted:273
Active:56 Checking status:1 Finished successfully:533
Progress: time: Tue, 24 Jun 2014 12:01:13 -0500 Submitted:282
Active:47 Checking status:1 Finished successfully:542
Progress: time: Tue, 24 Jun 2014 12:01:14 -0500 Submitting:1
Submitted:292 Active:37 Finished successfully:553
Progress: time: Tue, 24 Jun 2014 12:01:15 -0500 Submitted:298
Active:31 Checking status:1 Finished successfully:558
Progress: time: Tue, 24 Jun 2014 12:01:16 -0500 Submitted:305
Active:24 Checking status:1 Finished successfully:565
Progress: time: Tue, 24 Jun 2014 12:01:17 -0500 Submitted:307
Active:22 Checking status:1 Finished successfully:567
Progress: time: Tue, 24 Jun 2014 12:01:18 -0500 Submitted:313
Active:16 Checking status:1 Finished successfully:573
Progress: time: Tue, 24 Jun 2014 12:01:20 -0500 Submitted:315
Active:14 Checking status:1 Finished successfully:575
Progress: time: Tue, 24 Jun 2014 12:01:21 -0500 Submitted:317
Active:12 Checking status:1 Finished successfully:577
Progress: time: Tue, 24 Jun 2014 12:01:22 -0500 Submitted:319
Active:10 Checking status:1 Finished successfully:579
Progress: time: Tue, 24 Jun 2014 12:01:23 -0500 Submitted:320
Active:9 Checking status:1 Finished successfully:580
Progress: time: Tue, 24 Jun 2014 12:01:25 -0500 Submitted:323
Active:6 Checking status:1 Finished successfully:583
Progress: time: Tue, 24 Jun 2014 12:01:26 -0500 Submitted:324
Active:5 Checking status:1 Finished successfully:584
Progress: time: Tue, 24 Jun 2014 12:01:27 -0500 Submitted:325
Active:4 Checking status:1 Finished successfully:585
Progress: time: Tue, 24 Jun 2014 12:01:29 -0500 Submitted:326
Active:3 Checking status:1 Finished successfully:586
Progress: time: Tue, 24 Jun 2014 12:01:36 -0500 Submitted:327
Active:2 Checking status:1 Finished successfully:587
Progress: time: Tue, 24 Jun 2014 12:01:39 -0500 Submitted:328
Active:1 Checking status:1 Finished successfully:588
Progress: time: Tue, 24 Jun 2014 12:01:50 -0500 Submitted:329 Checking
status:1 Finished successfully:589
Progress: time: Tue, 24 Jun 2014 12:02:00 -0500 Submitted:330 Finished
successfully:590
Progress: time: Tue, 24 Jun 2014 12:02:30 -0500 Submitted:330 Finished
successfully:590
Progress: time: Tue, 24 Jun 2014 12:03:00 -0500 Submitted:330 Finished
successfully:590
Progress: time: Tue, 24 Jun 2014 12:03:30 -0500 Submitted:330 Finished
successfully:590
Progress: time: Tue, 24 Jun 2014 12:04:00 -0500 Submitted:330 Finished
successfully:590
The issue is not really reproducible, nor is the number of successful
jobs. Any ideas how to solve this problem? I'm attaching the log file.
Thanks,
Hemant
Hemant Sharma
Post-doctoral Researcher
Advanced Photon Source
Argonne National Laboratory
Lemont IL 60429
USA
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Indexer-20140624-1200-qke4z0v8.log.zip
Type: application/zip
Size: 438282 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/swift-user/attachments/20140624/a8ca13cb/attachment.zip>
More information about the Swift-user
mailing list