[Swift-user] Queuing isssue with SWFIT on Orthros, APS cluster

Justin M Wozniak wozniak at mcs.anl.gov
Fri Aug 22 08:30:23 CDT 2014


This message was held in our mailing list- is this still an issue?

On 6/24/2014 7:32 PM, Hemant Sharma wrote:
> Hi guys,
>
> I'm having a problem with queuing using SWIFT. I have a swift script, 
> which should execute about 27000 iterations. In order to limit the 
> initial memory size, I created the following config file:
>
> use.provider.staging=false
> provider.staging.pin.swiftfiles=false
> use.wrapper.staging=false
> status.mode=provider
> wrapperlog.always.transfer=true
> execution.retries=0
> lazy.errors=true
> sitedir.keep=true
> file.gc.enabled=false
> wrapper.parameter.mode=files
> foreach.max.threads=330
>
> Some times, when I execute the script, it starts with 320 active jobs 
> (on 320 processors), but after some time, it just gets stuck with 330 
> submitted jobs and none of them are active. Example output to screen is:
>
> Swift 0.94 swift-r6637 cog-r3742
>
> RunID: 20140624-1200-qke4z0v8
> Progress:  time: Tue, 24 Jun 2014 12:00:30 -0500
> Progress:  time: Tue, 24 Jun 2014 12:00:31 -0500  Selecting site:328  
> Initializing site shared directory:1  Stage in:1
> Progress:  time: Tue, 24 Jun 2014 12:00:32 -0500  Selecting site:10 
> Stage in:277  Submitting:3  Submitted:40
> Progress:  time: Tue, 24 Jun 2014 12:00:38 -0500  Selecting site:10 
> Submitted:319  Active:1
> Progress:  time: Tue, 24 Jun 2014 12:00:43 -0500  Selecting site:10 
> Active:319  Checking status:1
> Progress:  time: Tue, 24 Jun 2014 12:00:44 -0500  Selecting site:1 
> Stage in:20  Active:208  Checking status:31  Stage out:70 Finished 
> successfully:21
> Progress:  time: Tue, 24 Jun 2014 12:00:45 -0500  Stage in:11 
> Active:120  Checking status:10  Stage out:189  Finished successfully:31
> Progress:  time: Tue, 24 Jun 2014 12:00:46 -0500  Stage in:25 
> Active:117  Stage out:188  Finished successfully:54
> Progress:  time: Tue, 24 Jun 2014 12:00:47 -0500  Initializing:1 
> Selecting site:1  Stage in:46  Active:118  Stage out:164  Finished 
> successfully:86
> Progress:  time: Tue, 24 Jun 2014 12:00:48 -0500  Selecting site:2 
> Stage in:102  Submitting:1  Submitted:2  Active:165  Checking 
> status:1  Stage out:57  Finished successfully:199
> Progress:  time: Tue, 24 Jun 2014 12:00:49 -0500  Submitted:5 
> Active:324  Checking status:1  Finished successfully:265
> Progress:  time: Tue, 24 Jun 2014 12:00:50 -0500  Submitted:12 
> Active:317  Checking status:1  Finished successfully:272
> Progress:  time: Tue, 24 Jun 2014 12:00:51 -0500  Submitted:22 
> Active:307  Finished successfully:283
> Progress:  time: Tue, 24 Jun 2014 12:00:52 -0500  Selecting site:1 
> Stage in:13  Submitted:47  Active:223  Stage out:46  Finished 
> successfully:321
> Progress:  time: Tue, 24 Jun 2014 12:00:53 -0500  Stage in:28 
> Submitted:73  Active:153  Stage out:75  Finished successfully:362
> Progress:  time: Tue, 24 Jun 2014 12:00:55 -0500  Submitted:182 
> Active:147  Checking status:1  Finished successfully:442
> Progress:  time: Tue, 24 Jun 2014 12:00:57 -0500  Submitted:183 
> Active:146  Checking status:1  Finished successfully:443
> Progress:  time: Tue, 24 Jun 2014 12:01:00 -0500  Submitted:185 
> Active:144  Checking status:1  Finished successfully:445
> Progress:  time: Tue, 24 Jun 2014 12:01:01 -0500  Submitted:186 
> Active:143  Checking status:1  Finished successfully:446
> Progress:  time: Tue, 24 Jun 2014 12:01:02 -0500  Submitted:190 
> Active:139  Checking status:1  Finished successfully:450
> Progress:  time: Tue, 24 Jun 2014 12:01:05 -0500  Submitted:193 
> Active:136  Checking status:1  Finished successfully:453
> Progress:  time: Tue, 24 Jun 2014 12:01:07 -0500  Submitted:196 
> Active:133  Checking status:1  Finished successfully:456
> Progress:  time: Tue, 24 Jun 2014 12:01:09 -0500  Submitted:198 
> Active:131  Checking status:1  Finished successfully:458
> Progress:  time: Tue, 24 Jun 2014 12:01:10 -0500  Stage in:5 
> Submitted:202  Active:63  Stage out:60  Finished successfully:467
> Progress:  time: Tue, 24 Jun 2014 12:01:11 -0500  Submitted:273 
> Active:56  Checking status:1  Finished successfully:533
> Progress:  time: Tue, 24 Jun 2014 12:01:13 -0500  Submitted:282 
> Active:47  Checking status:1  Finished successfully:542
> Progress:  time: Tue, 24 Jun 2014 12:01:14 -0500  Submitting:1 
> Submitted:292  Active:37  Finished successfully:553
> Progress:  time: Tue, 24 Jun 2014 12:01:15 -0500  Submitted:298 
> Active:31  Checking status:1  Finished successfully:558
> Progress:  time: Tue, 24 Jun 2014 12:01:16 -0500  Submitted:305 
> Active:24  Checking status:1  Finished successfully:565
> Progress:  time: Tue, 24 Jun 2014 12:01:17 -0500  Submitted:307 
> Active:22  Checking status:1  Finished successfully:567
> Progress:  time: Tue, 24 Jun 2014 12:01:18 -0500  Submitted:313 
> Active:16  Checking status:1  Finished successfully:573
> Progress:  time: Tue, 24 Jun 2014 12:01:20 -0500  Submitted:315 
> Active:14  Checking status:1  Finished successfully:575
> Progress:  time: Tue, 24 Jun 2014 12:01:21 -0500  Submitted:317 
> Active:12  Checking status:1  Finished successfully:577
> Progress:  time: Tue, 24 Jun 2014 12:01:22 -0500  Submitted:319 
> Active:10  Checking status:1  Finished successfully:579
> Progress:  time: Tue, 24 Jun 2014 12:01:23 -0500  Submitted:320 
> Active:9  Checking status:1  Finished successfully:580
> Progress:  time: Tue, 24 Jun 2014 12:01:25 -0500  Submitted:323 
> Active:6  Checking status:1  Finished successfully:583
> Progress:  time: Tue, 24 Jun 2014 12:01:26 -0500  Submitted:324 
> Active:5  Checking status:1  Finished successfully:584
> Progress:  time: Tue, 24 Jun 2014 12:01:27 -0500  Submitted:325 
> Active:4  Checking status:1  Finished successfully:585
> Progress:  time: Tue, 24 Jun 2014 12:01:29 -0500  Submitted:326 
> Active:3  Checking status:1  Finished successfully:586
> Progress:  time: Tue, 24 Jun 2014 12:01:36 -0500  Submitted:327 
> Active:2  Checking status:1  Finished successfully:587
> Progress:  time: Tue, 24 Jun 2014 12:01:39 -0500  Submitted:328 
> Active:1  Checking status:1  Finished successfully:588
> Progress:  time: Tue, 24 Jun 2014 12:01:50 -0500  Submitted:329 
> Checking status:1  Finished successfully:589
> Progress:  time: Tue, 24 Jun 2014 12:02:00 -0500  Submitted:330 
> Finished successfully:590
> Progress:  time: Tue, 24 Jun 2014 12:02:30 -0500  Submitted:330 
> Finished successfully:590
> Progress:  time: Tue, 24 Jun 2014 12:03:00 -0500  Submitted:330 
> Finished successfully:590
> Progress:  time: Tue, 24 Jun 2014 12:03:30 -0500  Submitted:330 
> Finished successfully:590
> Progress:  time: Tue, 24 Jun 2014 12:04:00 -0500  Submitted:330 
> Finished successfully:590
>
> The issue is not really reproducible, nor is the number of successful 
> jobs. Any ideas how to solve this problem? I'm attaching the log file.
>
> Thanks,
> Hemant
>
> Hemant Sharma
> Post-doctoral Researcher
> Advanced Photon Source
> Argonne National Laboratory
> Lemont IL 60429
> USA
>
>
> _______________________________________________
> Swift-user mailing list
> Swift-user at ci.uchicago.edu
> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user


-- 
Justin M Wozniak

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-user/attachments/20140822/8479e7aa/attachment.html>


More information about the Swift-user mailing list