From yizhu at cs.uchicago.edu  Thu Aug  6 15:15:23 2009
From: yizhu at cs.uchicago.edu (Yi Zhu)
Date: Thu, 06 Aug 2009 15:15:23 -0500
Subject: [Swift-devel] How to the maximum number of concurrent jobs allowed
 on a site to a fixed size?
Message-ID: <4A7B39DB.3030602@cs.uchicago.edu>

Hi, all

As we've already know, Swift dynamically change the maximum number of 
concurrent jobs allowed on a site based on the performance history of 
that site. According to swift Document: Each site is assigned a score 
(initially 1), which can increase or decrease based on whether the site 
yields successful or faulty job runs. The score for a site can take 
values in the (0.1, 100) interval. The number of allowed jobs is 
calculated using the following formula:

2 + score*throttle.score.job.factor

We can change the throttle.score.job.factor in sites.xml or 
swift.properties files, but since the "score" value can be 
increased/decreased during the execution, It seems that we can not 
really set the maximum  number of concurrent jobs allowed on a site to a 
fixed number. Anyone have any idea of that?


Many Thanks.

-Yi Zhu


From hategan at mcs.anl.gov  Thu Aug  6 16:41:54 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Thu, 06 Aug 2009 16:41:54 -0500
Subject: [Swift-devel] How to the maximum number of concurrent jobs
	allowed on a site to a fixed size?
In-Reply-To: <4A7B39DB.3030602@cs.uchicago.edu>
References: <4A7B39DB.3030602@cs.uchicago.edu>
Message-ID: <1249594914.28410.81.camel@blabla>

On Thu, 2009-08-06 at 15:15 -0500, Yi Zhu wrote:
> Hi, all
> 
> As we've already know, Swift dynamically change the maximum number of 
> concurrent jobs allowed on a site based on the performance history of 
> that site. According to swift Document: Each site is assigned a score 
> (initially 1), which can increase or decrease based on whether the site 
> yields successful or faulty job runs. The score for a site can take 
> values in the (0.1, 100) interval. The number of allowed jobs is 
> calculated using the following formula:
> 
> 2 + score*throttle.score.job.factor
> 
> We can change the throttle.score.job.factor in sites.xml or 
> swift.properties files, but since the "score" value can be 
> increased/decreased during the execution, It seems that we can not 
> really set the maximum  number of concurrent jobs allowed on a site to a 
> fixed number. Anyone have any idea of that?

Can you rephrase the question?

The number of jobs running on a site is a function of the current demand
for that site and some monotonically increasing function of the score:

nj = f(d, g(s)) = min(d, g(s))

The score is a function of time (roughly):

s = s(t)

Assuming demand is higher than the job limit (g) (which is the case when
you're interested in limiting nj):

d > g(s) => min(d, g(s)) = g(s)

So

nj = g(s(t))

Now, you know that s(t) is bounded (by default (0.01, 100) - max is open
so assume limits instead of equality), and since g is monotonically
increasing and g(max_score) is finite, it follows that max(g(x)) is
g(max_score). So there there is a fixed number of concurrent jobs
regardless of time/score (max(g(t))) as well as a maximum number of
concurrent jobs at each time point (i.e. for each score) (g(t)).

Mihael


From yizhu at cs.uchicago.edu  Thu Aug  6 16:50:35 2009
From: yizhu at cs.uchicago.edu (Yi Zhu)
Date: Thu, 06 Aug 2009 16:50:35 -0500
Subject: [Swift-devel] How to the maximum number of concurrent jobs allowed
	on a site to a fixed size?
In-Reply-To: <1249594914.28410.81.camel@blabla>
References: <4A7B39DB.3030602@cs.uchicago.edu>
	<1249594914.28410.81.camel@blabla>
Message-ID: <4A7B502B.1080809@cs.uchicago.edu>

Hi Mihael

Now, I just set the initialScorer to a ridiculously high value (e.g. 
10000), and swift seems can automatically scale it down to the range, 
and then I set the throttle.factor according, therefore I could get a 
fixed maximum number according to the formula:


2+ score (range 0.1 -100)* throttle.factor


-Yi


Mihael Hategan wrote:
> On Thu, 2009-08-06 at 15:15 -0500, Yi Zhu wrote:
>> Hi, all
>>
>> As we've already know, Swift dynamically change the maximum number of 
>> concurrent jobs allowed on a site based on the performance history of 
>> that site. According to swift Document: Each site is assigned a score 
>> (initially 1), which can increase or decrease based on whether the site 
>> yields successful or faulty job runs. The score for a site can take 
>> values in the (0.1, 100) interval. The number of allowed jobs is 
>> calculated using the following formula:
>>
>> 2 + score*throttle.score.job.factor
>>
>> We can change the throttle.score.job.factor in sites.xml or 
>> swift.properties files, but since the "score" value can be 
>> increased/decreased during the execution, It seems that we can not 
>> really set the maximum  number of concurrent jobs allowed on a site to a 
>> fixed number. Anyone have any idea of that?
> 
> Can you rephrase the question?
> 
> The number of jobs running on a site is a function of the current demand
> for that site and some monotonically increasing function of the score:
> 
> nj = f(d, g(s)) = min(d, g(s))
> 
> The score is a function of time (roughly):
> 
> s = s(t)
> 
> Assuming demand is higher than the job limit (g) (which is the case when
> you're interested in limiting nj):
> 
> d > g(s) => min(d, g(s)) = g(s)
> 
> So
> 
> nj = g(s(t))
> 
> Now, you know that s(t) is bounded (by default (0.01, 100) - max is open
> so assume limits instead of equality), and since g is monotonically
> increasing and g(max_score) is finite, it follows that max(g(x)) is
> g(max_score). So there there is a fixed number of concurrent jobs
> regardless of time/score (max(g(t))) as well as a maximum number of
> concurrent jobs at each time point (i.e. for each score) (g(t)).
> 
> Mihael
> 
> 


From hategan at mcs.anl.gov  Thu Aug  6 16:58:21 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Thu, 06 Aug 2009 16:58:21 -0500
Subject: [Swift-devel] How to the maximum number of concurrent jobs
	allowed on a site to a fixed size?
In-Reply-To: <4A7B502B.1080809@cs.uchicago.edu>
References: <4A7B39DB.3030602@cs.uchicago.edu>
	<1249594914.28410.81.camel@blabla> <4A7B502B.1080809@cs.uchicago.edu>
Message-ID: <1249595901.28410.84.camel@blabla>

On Thu, 2009-08-06 at 16:50 -0500, Yi Zhu wrote:
> Hi Mihael
> 
> Now, I just set the initialScorer to a ridiculously high value (e.g. 
> 10000), and swift seems can automatically scale it down to the range, 
> and then I set the throttle.factor according, therefore I could get a 
> fixed maximum number according to the formula:
> 
> 
> 2+ score (range 0.1 -100)* throttle.factor
> 

Exactly.


From bugzilla-daemon at mcs.anl.gov  Tue Aug 25 10:35:53 2009
From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov)
Date: Tue, 25 Aug 2009 10:35:53 -0500 (CDT)
Subject: [Swift-devel] [Bug 218] New: Coasters failure in shutdown processing
Message-ID: <bug-218-21@http.bugzilla.mcs.anl.gov/swift/>

https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=218

           Summary: Coasters failure in shutdown processing
           Product: Swift
           Version: unspecified
          Platform: All
        OS/Version: All
            Status: NEW
          Severity: normal
          Priority: P2
         Component: General
        AssignedTo: hategan at mcs.anl.gov
        ReportedBy: wilde at mcs.anl.gov


Hi,

I have a processing step that takes somewhere ~2-5 min. It takes on
input two ~5Mb files, and produces a small text file, which I need to
store. I need to compute large number of such jobs, using different
parameters. It seems to me "coaster" is the best execution provider
for my application.

Trying to start simple, I am running first.swift (echo) example that
comes with Swift using different providers: GT2, GT4, GT2/coaster, and
GT4/coaster. All of this is done on Abe NCSA cluster.

Here's my sites.xml:

<pool handle="Abe-GT4">
 <gridftp url="gsiftp://gridftp-abe.ncsa.teragrid.org:2811/" />
 <execution provider="gt4" jobmanager="PBS"

url="https://grid-abe.ncsa.teragrid.org:8443/wsrf/services/ManagedJobFactoryService"/>
 <workdirectory>/u/ac/fedorov/scratch-global/scratch</workdirectory>
</pool>

<pool handle="Abe-GT4-coasters">
 <gridftp url="gsiftp://gridftp-abe.ncsa.teragrid.org:2811/" />
 <execution provider="coaster" jobmanager="gt4:gt4:pbs"

url="https://grid-abe.ncsa.teragrid.org:8443/wsrf/services/ManagedJobFactoryService"/>
 <workdirectory>/u/ac/fedorov/scratch-global/scratch</workdirectory>
</pool>

<pool handle="Abe-GT2">
 <gridftp url="gsiftp://gridftp-abe.ncsa.teragrid.org:2811/" />
 <execution provider="gt2" jobmanager="PBS"
 url="grid-abe.ncsa.teragrid.org:2119/jobmanager-pbs"/>
 <workdirectory>/u/ac/fedorov/scratch-global/scratch</workdirectory>
</pool>

<pool handle="Abe-GT2-coasters">
 <gridftp url="gsiftp://gridftp-abe.ncsa.teragrid.org:2811/" />
 <execution provider="coaster" jobmanager="gt2:gt2:pbs"
 url="grid-abe.ncsa.teragrid.org"/>
 <filesystem provider="coaster" url="gt2://grid-abe.ncsa.teragrid.org" />
 <workdirectory>/u/ac/fedorov/scratch-global/scratch</workdirectory>
</pool>

And tc.data is simply

Abe-GT4-coasters echo /bin/echo INSTALLED INTEL32::LINUX null

and I change the site to test different providers.

Now, results:

1) both GT2 and GT4 providers work fine, script completes

2) with GT2+coaster provider, I can see the job in the PBS queue
(requested time is 01:41, I guess this comes with the default coaster
parameters, that I didn't change). The job appears to finish
successfully, and it seems like the output file is fetched back, but
then I get this error:

Final status:  Finished successfully:1
START cleanups=[[first-20090825-0925-emkt2qt0, Abe-GT2-coasters]]
START dir=first-20090825-0925-emkt2qt0 host=Abe-GT2-coasters
Sending Command(21, SUBMITJOB) on GSSSChannel-null(1)
Command(21, SUBMITJOB) CMD: Command(21, SUBMITJOB)
GSSSChannel-null(1) REPL: Command(21, SUBMITJOB)
Submitted task Task(type=JOB_SUBMISSION,
identity=urn:0-1-1251210343871). Job id:
urn:1251210343871-1251210376098-1251210376099
Unregistering Command(21, SUBMITJOB)
GSSSChannel-null(1) REQ: Handler(JOBSTATUS)
GSSSChannel-null(1) REQ: Handler(JOBSTATUS)
Task(type=JOB_SUBMISSION, identity=urn:0-1-1251210343871) Completed.
Waiting: 0, Running: 0. Heap size: 65M, Heap free: 42M, Max heap: 227M
END dir=first-20090825-0925-emkt2qt0 host=Abe-GT2-coasters
Cleaning up...
Shutting down service at https://141.142.68.180:45552
Got channel MetaChannel: 500265006 -> GSSSChannel-null(1)
Sending Command(22, SHUTDOWNSERVICE) on GSSSChannel-null(1)
Command(22, SHUTDOWNSERVICE) CMD: Command(22, SHUTDOWNSERVICE)
Command(22, SHUTDOWNSERVICE): handling reply timeout
Command(22, SHUTDOWNSERVICE): failed too many times
org.globus.cog.karajan.workflow.service.ReplyTimeoutException
       at
org.globus.cog.karajan.workflow.service.commands.Command.handleReplyTimeout(Command.java:241)
       at
org.globus.cog.karajan.workflow.service.commands.Command$Timeout.run(Command.java:246)
       at java.util.TimerThread.mainLoop(Timer.java:512)
       at java.util.TimerThread.run(Timer.java:462)
- Done

-- 
Configure bugmail: https://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.
You are watching the reporter.


From hategan at mcs.anl.gov  Thu Aug 27 12:58:51 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Thu, 27 Aug 2009 12:58:51 -0500
Subject: [Swift-devel] coaster update
Message-ID: <1251395931.18897.17.camel@localhost>

Cog r2453 contains a few updates:
- there was a busy spin in some cases in the worker queue processing;
this should be gone and a new log message should be printed every 10
seconds that says how much that particular thread worked and how much it
sleeped
- there's a new option (wrongly) called "parallelism". 

Short: parallelism = 0 means attempt to maximize parallelism,
parallelism = 1 means old behavior (if the workers can eventually run
all the jobs, don't request new ones)

Long: a bit of detail about the scheduling problem:

Coaster blocks are a bunch of 2d boxes. They have a width (number of
workers) and a height (walltime). Jobs are pretty much the same, except
they have a width of 1. The problem is that of "ordering" boxes subject
to some constraints (e.g. widths can only be a multiple of a certain
number, only n boxes can be had at one time, etc.) and fitting the jobs
into the boxes.

In order to amortize the queuing cost, boxes need to be a few times
taller then the jobs, so that one can eventually stack multiple jobs on
top of each other in boxes.

The allocator looks at the current amount of jobs, the current boxes and
the constraints to figure out whether to order more boxes and what sizes
those boxes should be. It won't order more boxes if the jobs fit. So
that brings us to the notion of size. It used to be that the size metric
was w*h, so a sufficiently tall box would fit multiple jobs by itself.
It was pointed out that while this is ok, it may be desirable to try to
maximize parallelism, such that it is at least attempted to get boxes
that would only have one stack of jobs. But this is pretty much the same
as saying that the "size" of a box is now w (rather than w*h) and the
size of a job is 1.

So there comes the parallelism option which dictates what the size of a
box and a job are, using sz = w * h^parallelism. If parallelism = 1,
size = w*h; if parallelism = 0, size = w.

The name "parallelism" is obviously wrong. If anybody feels like making
it size = w*h^(1-parallelism) and or changing the name to something more
sensible, feel free to do so.