[Swift-user] Queuing isssue with SWFIT on Orthros, APS cluster

Hemant Sharma hsharma at anl.gov
Tue Jun 24 12:25:07 CDT 2014


Hi guys,

I'm having a problem with queuing using SWIFT. I have a swift script, 
which should execute about 27000 iterations. In order to limit the 
initial memory size, I created the following config file:

use.provider.staging=false
provider.staging.pin.swiftfiles=false
use.wrapper.staging=false
status.mode=provider
wrapperlog.always.transfer=true
execution.retries=0
lazy.errors=true
sitedir.keep=true
file.gc.enabled=false
wrapper.parameter.mode=files
foreach.max.threads=330

Some times, when I execute the script, it starts with 320 active jobs 
(on 320 processors), but after some time, it just gets stuck with 330 
submitted jobs and none of them are active. Example output to screen is:

Swift 0.94 swift-r6637 cog-r3742

RunID: 20140624-1200-qke4z0v8
Progress:  time: Tue, 24 Jun 2014 12:00:30 -0500
Progress:  time: Tue, 24 Jun 2014 12:00:31 -0500  Selecting site:328  
Initializing site shared directory:1  Stage in:1
Progress:  time: Tue, 24 Jun 2014 12:00:32 -0500  Selecting site:10 
Stage in:277  Submitting:3  Submitted:40
Progress:  time: Tue, 24 Jun 2014 12:00:38 -0500  Selecting site:10 
Submitted:319  Active:1
Progress:  time: Tue, 24 Jun 2014 12:00:43 -0500  Selecting site:10 
Active:319  Checking status:1
Progress:  time: Tue, 24 Jun 2014 12:00:44 -0500  Selecting site:1 Stage 
in:20  Active:208  Checking status:31  Stage out:70  Finished 
successfully:21
Progress:  time: Tue, 24 Jun 2014 12:00:45 -0500  Stage in:11 
Active:120  Checking status:10  Stage out:189  Finished successfully:31
Progress:  time: Tue, 24 Jun 2014 12:00:46 -0500  Stage in:25 
Active:117  Stage out:188  Finished successfully:54
Progress:  time: Tue, 24 Jun 2014 12:00:47 -0500  Initializing:1 
Selecting site:1  Stage in:46  Active:118  Stage out:164  Finished 
successfully:86
Progress:  time: Tue, 24 Jun 2014 12:00:48 -0500  Selecting site:2 Stage 
in:102  Submitting:1  Submitted:2  Active:165  Checking status:1  Stage 
out:57  Finished successfully:199
Progress:  time: Tue, 24 Jun 2014 12:00:49 -0500  Submitted:5 
Active:324  Checking status:1  Finished successfully:265
Progress:  time: Tue, 24 Jun 2014 12:00:50 -0500  Submitted:12 
Active:317  Checking status:1  Finished successfully:272
Progress:  time: Tue, 24 Jun 2014 12:00:51 -0500  Submitted:22 
Active:307  Finished successfully:283
Progress:  time: Tue, 24 Jun 2014 12:00:52 -0500  Selecting site:1 Stage 
in:13  Submitted:47  Active:223  Stage out:46  Finished successfully:321
Progress:  time: Tue, 24 Jun 2014 12:00:53 -0500  Stage in:28 
Submitted:73  Active:153  Stage out:75  Finished successfully:362
Progress:  time: Tue, 24 Jun 2014 12:00:55 -0500  Submitted:182 
Active:147  Checking status:1  Finished successfully:442
Progress:  time: Tue, 24 Jun 2014 12:00:57 -0500  Submitted:183 
Active:146  Checking status:1  Finished successfully:443
Progress:  time: Tue, 24 Jun 2014 12:01:00 -0500  Submitted:185 
Active:144  Checking status:1  Finished successfully:445
Progress:  time: Tue, 24 Jun 2014 12:01:01 -0500  Submitted:186 
Active:143  Checking status:1  Finished successfully:446
Progress:  time: Tue, 24 Jun 2014 12:01:02 -0500  Submitted:190 
Active:139  Checking status:1  Finished successfully:450
Progress:  time: Tue, 24 Jun 2014 12:01:05 -0500  Submitted:193 
Active:136  Checking status:1  Finished successfully:453
Progress:  time: Tue, 24 Jun 2014 12:01:07 -0500  Submitted:196 
Active:133  Checking status:1  Finished successfully:456
Progress:  time: Tue, 24 Jun 2014 12:01:09 -0500  Submitted:198 
Active:131  Checking status:1  Finished successfully:458
Progress:  time: Tue, 24 Jun 2014 12:01:10 -0500  Stage in:5 
Submitted:202  Active:63  Stage out:60  Finished successfully:467
Progress:  time: Tue, 24 Jun 2014 12:01:11 -0500  Submitted:273 
Active:56  Checking status:1  Finished successfully:533
Progress:  time: Tue, 24 Jun 2014 12:01:13 -0500  Submitted:282 
Active:47  Checking status:1  Finished successfully:542
Progress:  time: Tue, 24 Jun 2014 12:01:14 -0500  Submitting:1 
Submitted:292  Active:37  Finished successfully:553
Progress:  time: Tue, 24 Jun 2014 12:01:15 -0500  Submitted:298 
Active:31  Checking status:1  Finished successfully:558
Progress:  time: Tue, 24 Jun 2014 12:01:16 -0500  Submitted:305 
Active:24  Checking status:1  Finished successfully:565
Progress:  time: Tue, 24 Jun 2014 12:01:17 -0500  Submitted:307 
Active:22  Checking status:1  Finished successfully:567
Progress:  time: Tue, 24 Jun 2014 12:01:18 -0500  Submitted:313 
Active:16  Checking status:1  Finished successfully:573
Progress:  time: Tue, 24 Jun 2014 12:01:20 -0500  Submitted:315 
Active:14  Checking status:1  Finished successfully:575
Progress:  time: Tue, 24 Jun 2014 12:01:21 -0500  Submitted:317 
Active:12  Checking status:1  Finished successfully:577
Progress:  time: Tue, 24 Jun 2014 12:01:22 -0500  Submitted:319 
Active:10  Checking status:1  Finished successfully:579
Progress:  time: Tue, 24 Jun 2014 12:01:23 -0500  Submitted:320 
Active:9  Checking status:1  Finished successfully:580
Progress:  time: Tue, 24 Jun 2014 12:01:25 -0500  Submitted:323 
Active:6  Checking status:1  Finished successfully:583
Progress:  time: Tue, 24 Jun 2014 12:01:26 -0500  Submitted:324 
Active:5  Checking status:1  Finished successfully:584
Progress:  time: Tue, 24 Jun 2014 12:01:27 -0500  Submitted:325 
Active:4  Checking status:1  Finished successfully:585
Progress:  time: Tue, 24 Jun 2014 12:01:29 -0500  Submitted:326 
Active:3  Checking status:1  Finished successfully:586
Progress:  time: Tue, 24 Jun 2014 12:01:36 -0500  Submitted:327 
Active:2  Checking status:1  Finished successfully:587
Progress:  time: Tue, 24 Jun 2014 12:01:39 -0500  Submitted:328 
Active:1  Checking status:1  Finished successfully:588
Progress:  time: Tue, 24 Jun 2014 12:01:50 -0500  Submitted:329 Checking 
status:1  Finished successfully:589
Progress:  time: Tue, 24 Jun 2014 12:02:00 -0500  Submitted:330 Finished 
successfully:590
Progress:  time: Tue, 24 Jun 2014 12:02:30 -0500  Submitted:330 Finished 
successfully:590
Progress:  time: Tue, 24 Jun 2014 12:03:00 -0500  Submitted:330 Finished 
successfully:590
Progress:  time: Tue, 24 Jun 2014 12:03:30 -0500  Submitted:330 Finished 
successfully:590
Progress:  time: Tue, 24 Jun 2014 12:04:00 -0500  Submitted:330 Finished 
successfully:590

The issue is not really reproducible, nor is the number of successful 
jobs. Any ideas how to solve this problem? I'm attaching the log file.

Thanks,
Hemant

Hemant Sharma
Post-doctoral Researcher
Advanced Photon Source
Argonne National Laboratory
Lemont IL 60429
USA
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Indexer-20140624-1200-qke4z0v8.log.zip
Type: application/zip
Size: 438282 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/swift-user/attachments/20140624/a8ca13cb/attachment.zip>


More information about the Swift-user mailing list