From yizhu at cs.uchicago.edu Thu Aug 6 15:15:23 2009 From: yizhu at cs.uchicago.edu (Yi Zhu) Date: Thu, 06 Aug 2009 15:15:23 -0500 Subject: [Swift-devel] How to the maximum number of concurrent jobs allowed on a site to a fixed size? Message-ID: <4A7B39DB.3030602@cs.uchicago.edu> Hi, all As we've already know, Swift dynamically change the maximum number of concurrent jobs allowed on a site based on the performance history of that site. According to swift Document: Each site is assigned a score (initially 1), which can increase or decrease based on whether the site yields successful or faulty job runs. The score for a site can take values in the (0.1, 100) interval. The number of allowed jobs is calculated using the following formula: 2 + score*throttle.score.job.factor We can change the throttle.score.job.factor in sites.xml or swift.properties files, but since the "score" value can be increased/decreased during the execution, It seems that we can not really set the maximum number of concurrent jobs allowed on a site to a fixed number. Anyone have any idea of that? Many Thanks. -Yi Zhu From hategan at mcs.anl.gov Thu Aug 6 16:41:54 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 06 Aug 2009 16:41:54 -0500 Subject: [Swift-devel] How to the maximum number of concurrent jobs allowed on a site to a fixed size? In-Reply-To: <4A7B39DB.3030602@cs.uchicago.edu> References: <4A7B39DB.3030602@cs.uchicago.edu> Message-ID: <1249594914.28410.81.camel@blabla> On Thu, 2009-08-06 at 15:15 -0500, Yi Zhu wrote: > Hi, all > > As we've already know, Swift dynamically change the maximum number of > concurrent jobs allowed on a site based on the performance history of > that site. According to swift Document: Each site is assigned a score > (initially 1), which can increase or decrease based on whether the site > yields successful or faulty job runs. The score for a site can take > values in the (0.1, 100) interval. The number of allowed jobs is > calculated using the following formula: > > 2 + score*throttle.score.job.factor > > We can change the throttle.score.job.factor in sites.xml or > swift.properties files, but since the "score" value can be > increased/decreased during the execution, It seems that we can not > really set the maximum number of concurrent jobs allowed on a site to a > fixed number. Anyone have any idea of that? Can you rephrase the question? The number of jobs running on a site is a function of the current demand for that site and some monotonically increasing function of the score: nj = f(d, g(s)) = min(d, g(s)) The score is a function of time (roughly): s = s(t) Assuming demand is higher than the job limit (g) (which is the case when you're interested in limiting nj): d > g(s) => min(d, g(s)) = g(s) So nj = g(s(t)) Now, you know that s(t) is bounded (by default (0.01, 100) - max is open so assume limits instead of equality), and since g is monotonically increasing and g(max_score) is finite, it follows that max(g(x)) is g(max_score). So there there is a fixed number of concurrent jobs regardless of time/score (max(g(t))) as well as a maximum number of concurrent jobs at each time point (i.e. for each score) (g(t)). Mihael From yizhu at cs.uchicago.edu Thu Aug 6 16:50:35 2009 From: yizhu at cs.uchicago.edu (Yi Zhu) Date: Thu, 06 Aug 2009 16:50:35 -0500 Subject: [Swift-devel] How to the maximum number of concurrent jobs allowed on a site to a fixed size? In-Reply-To: <1249594914.28410.81.camel@blabla> References: <4A7B39DB.3030602@cs.uchicago.edu> <1249594914.28410.81.camel@blabla> Message-ID: <4A7B502B.1080809@cs.uchicago.edu> Hi Mihael Now, I just set the initialScorer to a ridiculously high value (e.g. 10000), and swift seems can automatically scale it down to the range, and then I set the throttle.factor according, therefore I could get a fixed maximum number according to the formula: 2+ score (range 0.1 -100)* throttle.factor -Yi Mihael Hategan wrote: > On Thu, 2009-08-06 at 15:15 -0500, Yi Zhu wrote: >> Hi, all >> >> As we've already know, Swift dynamically change the maximum number of >> concurrent jobs allowed on a site based on the performance history of >> that site. According to swift Document: Each site is assigned a score >> (initially 1), which can increase or decrease based on whether the site >> yields successful or faulty job runs. The score for a site can take >> values in the (0.1, 100) interval. The number of allowed jobs is >> calculated using the following formula: >> >> 2 + score*throttle.score.job.factor >> >> We can change the throttle.score.job.factor in sites.xml or >> swift.properties files, but since the "score" value can be >> increased/decreased during the execution, It seems that we can not >> really set the maximum number of concurrent jobs allowed on a site to a >> fixed number. Anyone have any idea of that? > > Can you rephrase the question? > > The number of jobs running on a site is a function of the current demand > for that site and some monotonically increasing function of the score: > > nj = f(d, g(s)) = min(d, g(s)) > > The score is a function of time (roughly): > > s = s(t) > > Assuming demand is higher than the job limit (g) (which is the case when > you're interested in limiting nj): > > d > g(s) => min(d, g(s)) = g(s) > > So > > nj = g(s(t)) > > Now, you know that s(t) is bounded (by default (0.01, 100) - max is open > so assume limits instead of equality), and since g is monotonically > increasing and g(max_score) is finite, it follows that max(g(x)) is > g(max_score). So there there is a fixed number of concurrent jobs > regardless of time/score (max(g(t))) as well as a maximum number of > concurrent jobs at each time point (i.e. for each score) (g(t)). > > Mihael > > From hategan at mcs.anl.gov Thu Aug 6 16:58:21 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 06 Aug 2009 16:58:21 -0500 Subject: [Swift-devel] How to the maximum number of concurrent jobs allowed on a site to a fixed size? In-Reply-To: <4A7B502B.1080809@cs.uchicago.edu> References: <4A7B39DB.3030602@cs.uchicago.edu> <1249594914.28410.81.camel@blabla> <4A7B502B.1080809@cs.uchicago.edu> Message-ID: <1249595901.28410.84.camel@blabla> On Thu, 2009-08-06 at 16:50 -0500, Yi Zhu wrote: > Hi Mihael > > Now, I just set the initialScorer to a ridiculously high value (e.g. > 10000), and swift seems can automatically scale it down to the range, > and then I set the throttle.factor according, therefore I could get a > fixed maximum number according to the formula: > > > 2+ score (range 0.1 -100)* throttle.factor > Exactly. From bugzilla-daemon at mcs.anl.gov Tue Aug 25 10:35:53 2009 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Tue, 25 Aug 2009 10:35:53 -0500 (CDT) Subject: [Swift-devel] [Bug 218] New: Coasters failure in shutdown processing Message-ID: https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=218 Summary: Coasters failure in shutdown processing Product: Swift Version: unspecified Platform: All OS/Version: All Status: NEW Severity: normal Priority: P2 Component: General AssignedTo: hategan at mcs.anl.gov ReportedBy: wilde at mcs.anl.gov Hi, I have a processing step that takes somewhere ~2-5 min. It takes on input two ~5Mb files, and produces a small text file, which I need to store. I need to compute large number of such jobs, using different parameters. It seems to me "coaster" is the best execution provider for my application. Trying to start simple, I am running first.swift (echo) example that comes with Swift using different providers: GT2, GT4, GT2/coaster, and GT4/coaster. All of this is done on Abe NCSA cluster. Here's my sites.xml: /u/ac/fedorov/scratch-global/scratch /u/ac/fedorov/scratch-global/scratch /u/ac/fedorov/scratch-global/scratch /u/ac/fedorov/scratch-global/scratch And tc.data is simply Abe-GT4-coasters echo /bin/echo INSTALLED INTEL32::LINUX null and I change the site to test different providers. Now, results: 1) both GT2 and GT4 providers work fine, script completes 2) with GT2+coaster provider, I can see the job in the PBS queue (requested time is 01:41, I guess this comes with the default coaster parameters, that I didn't change). The job appears to finish successfully, and it seems like the output file is fetched back, but then I get this error: Final status: Finished successfully:1 START cleanups=[[first-20090825-0925-emkt2qt0, Abe-GT2-coasters]] START dir=first-20090825-0925-emkt2qt0 host=Abe-GT2-coasters Sending Command(21, SUBMITJOB) on GSSSChannel-null(1) Command(21, SUBMITJOB) CMD: Command(21, SUBMITJOB) GSSSChannel-null(1) REPL: Command(21, SUBMITJOB) Submitted task Task(type=JOB_SUBMISSION, identity=urn:0-1-1251210343871). Job id: urn:1251210343871-1251210376098-1251210376099 Unregistering Command(21, SUBMITJOB) GSSSChannel-null(1) REQ: Handler(JOBSTATUS) GSSSChannel-null(1) REQ: Handler(JOBSTATUS) Task(type=JOB_SUBMISSION, identity=urn:0-1-1251210343871) Completed. Waiting: 0, Running: 0. Heap size: 65M, Heap free: 42M, Max heap: 227M END dir=first-20090825-0925-emkt2qt0 host=Abe-GT2-coasters Cleaning up... Shutting down service at https://141.142.68.180:45552 Got channel MetaChannel: 500265006 -> GSSSChannel-null(1) Sending Command(22, SHUTDOWNSERVICE) on GSSSChannel-null(1) Command(22, SHUTDOWNSERVICE) CMD: Command(22, SHUTDOWNSERVICE) Command(22, SHUTDOWNSERVICE): handling reply timeout Command(22, SHUTDOWNSERVICE): failed too many times org.globus.cog.karajan.workflow.service.ReplyTimeoutException at org.globus.cog.karajan.workflow.service.commands.Command.handleReplyTimeout(Command.java:241) at org.globus.cog.karajan.workflow.service.commands.Command$Timeout.run(Command.java:246) at java.util.TimerThread.mainLoop(Timer.java:512) at java.util.TimerThread.run(Timer.java:462) - Done -- Configure bugmail: https://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are watching the assignee of the bug. You are watching the reporter. From hategan at mcs.anl.gov Thu Aug 27 12:58:51 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 27 Aug 2009 12:58:51 -0500 Subject: [Swift-devel] coaster update Message-ID: <1251395931.18897.17.camel@localhost> Cog r2453 contains a few updates: - there was a busy spin in some cases in the worker queue processing; this should be gone and a new log message should be printed every 10 seconds that says how much that particular thread worked and how much it sleeped - there's a new option (wrongly) called "parallelism". Short: parallelism = 0 means attempt to maximize parallelism, parallelism = 1 means old behavior (if the workers can eventually run all the jobs, don't request new ones) Long: a bit of detail about the scheduling problem: Coaster blocks are a bunch of 2d boxes. They have a width (number of workers) and a height (walltime). Jobs are pretty much the same, except they have a width of 1. The problem is that of "ordering" boxes subject to some constraints (e.g. widths can only be a multiple of a certain number, only n boxes can be had at one time, etc.) and fitting the jobs into the boxes. In order to amortize the queuing cost, boxes need to be a few times taller then the jobs, so that one can eventually stack multiple jobs on top of each other in boxes. The allocator looks at the current amount of jobs, the current boxes and the constraints to figure out whether to order more boxes and what sizes those boxes should be. It won't order more boxes if the jobs fit. So that brings us to the notion of size. It used to be that the size metric was w*h, so a sufficiently tall box would fit multiple jobs by itself. It was pointed out that while this is ok, it may be desirable to try to maximize parallelism, such that it is at least attempted to get boxes that would only have one stack of jobs. But this is pretty much the same as saying that the "size" of a box is now w (rather than w*h) and the size of a job is 1. So there comes the parallelism option which dictates what the size of a box and a job are, using sz = w * h^parallelism. If parallelism = 1, size = w*h; if parallelism = 0, size = w. The name "parallelism" is obviously wrong. If anybody feels like making it size = w*h^(1-parallelism) and or changing the name to something more sensible, feel free to do so.