From bugzilla-daemon at mcs.anl.gov Wed Apr 1 07:32:24 2009
From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov)
Date: Wed, 1 Apr 2009 07:32:24 -0500 (CDT)
Subject: [Swift-devel] [Bug 191] New: procedures invoked inside iterate{}
don't get unique execution IDs
Message-ID:
https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=191
Summary: procedures invoked inside iterate{} don't get unique
execution IDs
Product: Swift
Version: unspecified
Platform: PC
OS/Version: Mac OS
Status: NEW
Severity: normal
Priority: P2
Component: Log processing and plotting
AssignedTo: benc at hawaga.org.uk
ReportedBy: benc at hawaga.org.uk
CC: swift-devel at ci.uchicago.edu
iterate {} is more serialised than I intended. It executes each body inside the
same single thread. Consequently, each iteration of the loop body does not end
up with a unique thread prefix, and then execute IDs, which are based on thread
ID, end up duplicated between invocations.
I made the following hack for specific purpose of provenance challenge 3, as it
provides enough uniqueness, albeit inelegantly, for that project. More
properly, fixed bug 154 (iterate construct causes overserialisation of
execution) could make this problem go away.
Author: Ben Clifford
Date: Tue Mar 31 16:20:41 2009 +0100
make iterate give each iteration a unique thread ID. this is possibly
unsafe. in addition, it does not give
a uique ID to the termination condition distinct from the body of the loop,
which can probably give non-unique
IDs when procedure calls are made both in the loop body and in the termination
condition
diff --git a/src/org/griphyn/vdl/karajan/lib/InfiniteCountingWhile.java
b/src/org/griphyn/vdl/karajan/lib/Infin
iteCountingWhile.java
index 0d173c3..c6d4e89 100644
--- a/src/org/griphyn/vdl/karajan/lib/InfiniteCountingWhile.java
+++ b/src/org/griphyn/vdl/karajan/lib/InfiniteCountingWhile.java
@@ -9,6 +9,7 @@ package org.griphyn.vdl.karajan.lib;
import java.util.Arrays;
import java.util.List;
+import org.globus.cog.karajan.util.ThreadingContext;
import org.globus.cog.karajan.workflow.nodes.*;
import org.globus.cog.karajan.stack.VariableStack;
import org.globus.cog.karajan.util.TypeUtil;
@@ -26,6 +27,8 @@ public class InfiniteCountingWhile extends Sequential {
public void pre(VariableStack stack) throws ExecutionException {
stack.setVar("#condition", new Condition());
+ ThreadingContext tc =
(ThreadingContext)stack.getVar("#thread");
+ stack.setVar("#thread", tc.split(666));
stack.setVar(VAR, "$");
String counterName = (String)stack.getVar(VAR);
stack.setVar(counterName, Arrays.asList(new Integer[] {new
Integer(0)}));
@@ -54,6 +57,8 @@ public class InfiniteCountingWhile extends Sequential {
}
if (index >= elementCount()) {
// starting new iteration
+ ThreadingContext tc =
(ThreadingContext)stack.getVar("#thread");
+ stack.setVar("#thread", tc.split(666));
setIndex(stack, 1);
fn = (FlowElement) getElement(0);
--
Configure bugmail: https://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.
You are watching the assignee of the bug.
You are watching the reporter.
From bugzilla-daemon at mcs.anl.gov Wed Apr 1 21:00:52 2009
From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov)
Date: Wed, 1 Apr 2009 21:00:52 -0500 (CDT)
Subject: [Swift-devel] [Bug 116] simple_mapper handling of numbered files in
arrays broken
In-Reply-To:
References:
Message-ID: <20090402020052.537612CC70@wind.mcs.anl.gov>
https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=116
Mihael Hategan changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |hategan at mcs.anl.gov
--- Comment #3 from Mihael Hategan 2009-04-01 21:00:52 ---
Additionally, if non-numerically named files exist (say "test.in"), simple
mapper tries to use that name as index, which may or may not be the right
thing, but it causes a consistency check failure on the array:
Execution failed:
java.lang.RuntimeException: Array element has index 'test' that does not
parse as an integer.
--
Configure bugmail: https://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.
You are watching someone on the CC list of the bug.
You are watching the assignee of the bug.
You are watching the reporter.
From aespinosa at cs.uchicago.edu Wed Apr 1 21:13:55 2009
From: aespinosa at cs.uchicago.edu (Allan Espinosa)
Date: Wed, 1 Apr 2009 21:13:55 -0500
Subject: [Swift-devel] array args in function apps
Message-ID: <50b07b4b0904011913u4fafafc8qe2624780137f1dad@mail.gmail.com>
type file;
app (file out) cat(file infile[]) {
cat @infile stdout=@out;
}
file infile[] ;
file out <"test.out">;
out= cat(infile);
wift svn swift-r2748 cog-r2341
RunID: 20090401-2105-aevaa3o9
Progress:
Execution failed:
Exception in cat:
Arguments: [1.in 2.in 3.in 4.in 5.in 6.in 7.in 8.in 9.in 10.in]
Host: localhost
Directory: manyargs-20090401-2105-aevaa3o9/jobs/x/cat-xszn9s8j
stderr.txt: /bin/cat: 1.in 2.in 3.in 4.in 5.in 6.in 7.in 8.in 9.in
10.in: No such file or directory
stdout.txt:
----
Caused by:
Exit code 1
---
I remember when i ran using regular arguments, there are commas
separating them in swift's logs. maybe 1.in ... 10.in is seen as one
string?
log is in http://www.ci.uchicago.edu/~aespinosa/swift/manyargs-20090401-2105-aevaa3o9.log
From aespinosa at cs.uchicago.edu Wed Apr 1 21:44:20 2009
From: aespinosa at cs.uchicago.edu (Allan Espinosa)
Date: Wed, 1 Apr 2009 21:44:20 -0500
Subject: [Swift-devel] Re: array args in function apps
In-Reply-To: <50b07b4b0904011913u4fafafc8qe2624780137f1dad@mail.gmail.com>
References: <50b07b4b0904011913u4fafafc8qe2624780137f1dad@mail.gmail.com>
Message-ID: <50b07b4b0904011944q236ede8dg1c6c2175c336855@mail.gmail.com>
i got it working using @filenames
type file;
app (file out) cat(file infile[]) {
cat @filenames(infile) stdout=@out;
}
file infile[] ;
file out <"test.out">;
out= cat(infile);
On Wed, Apr 1, 2009 at 9:13 PM, Allan Espinosa
wrote:
> type file;
>
> app (file out) cat(file infile[]) {
> ?cat @infile stdout=@out;
> }
>
> file infile[] ;
> file out <"test.out">;
> out= cat(infile);
>
> wift svn swift-r2748 cog-r2341
>
> RunID: 20090401-2105-aevaa3o9
> Progress:
> Execution failed:
> ? ? ? ?Exception in cat:
> Arguments: [1.in 2.in 3.in 4.in 5.in 6.in 7.in 8.in 9.in 10.in]
> Host: localhost
> Directory: manyargs-20090401-2105-aevaa3o9/jobs/x/cat-xszn9s8j
> stderr.txt: /bin/cat: 1.in 2.in 3.in 4.in 5.in 6.in 7.in 8.in 9.in
> 10.in: No such file or directory
>
> stdout.txt:
> ----
>
> Caused by:
> ? ? ? ?Exit code 1
>
> ---
> I remember when i ran using regular arguments, there are commas
> separating them in swift's logs. maybe 1.in ... 10.in is seen as one
> string?
>
>
> log is in http://www.ci.uchicago.edu/~aespinosa/swift/manyargs-20090401-2105-aevaa3o9.log
>
--
Allan M. Espinosa
PhD student, Computer Science
University of Chicago
From benc at hawaga.org.uk Thu Apr 2 03:10:47 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Thu, 2 Apr 2009 08:10:47 +0000 (GMT)
Subject: [Swift-devel] array args in function apps
In-Reply-To: <50b07b4b0904011913u4fafafc8qe2624780137f1dad@mail.gmail.com>
References: <50b07b4b0904011913u4fafafc8qe2624780137f1dad@mail.gmail.com>
Message-ID:
On Wed, 1 Apr 2009, Allan Espinosa wrote:
> maybe 1.in ... 10.in is seen as one
> string?
yes. @filename(a) is even documented that way (which is what @a is an
abbreviation for).
--
From bugzilla-daemon at mcs.anl.gov Thu Apr 2 03:21:10 2009
From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov)
Date: Thu, 2 Apr 2009 03:21:10 -0500 (CDT)
Subject: [Swift-devel] [Bug 192] New: displeasing stack trace when pwd is
not writable
Message-ID:
https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=192
Summary: displeasing stack trace when pwd is not writable
Product: Swift
Version: unspecified
Platform: PC
OS/Version: Mac OS
Status: NEW
Severity: normal
Priority: P2
Component: General
AssignedTo: benc at hawaga.org.uk
ReportedBy: benc at hawaga.org.uk
The logging system outputs the following trace in swift 0.8 when pwd is
unwritable:
train02 at vm-125-58:/sw/swift$ swift
log4j:ERROR setFile(null,true) call failed.
java.io.FileNotFoundException: swift.log (Permission denied)
at java.io.FileOutputStream.openAppend(Native Method)
at java.io.FileOutputStream.(FileOutputStream.java:177)
at java.io.FileOutputStream.(FileOutputStream.java:102)
at org.apache.log4j.FileAppender.setFile(FileAppender.java:272)
at org.apache.log4j.FileAppender.activateOptions(FileAppender.java:151)
at org.apache.log4j.config.PropertySetter.activate(PropertySetter.java:247)
at
org.apache.log4j.config.PropertySetter.setProperties(PropertySetter.java:123)
at
org.apache.log4j.config.PropertySetter.setProperties(PropertySetter.java:87)
at
org.apache.log4j.PropertyConfigurator.parseAppender(PropertyConfigurator.java:645)
at
org.apache.log4j.PropertyConfigurator.parseCategory(PropertyConfigurator.java:603)
at
org.apache.log4j.PropertyConfigurator.configureRootCategory(PropertyConfigurator.java:500)
at
org.apache.log4j.PropertyConfigurator.doConfigure(PropertyConfigurator.java:406)
at
org.apache.log4j.PropertyConfigurator.doConfigure(PropertyConfigurator.java:432)
at
org.apache.log4j.helpers.OptionConverter.selectAndConfigure(OptionConverter.java:460)
at org.apache.log4j.LogManager.(LogManager.java:113)
at org.apache.log4j.Logger.getLogger(Logger.java:94)
at org.globus.cog.karajan.Loader.(Loader.java:43)
No SwiftScript program specified
--
Configure bugmail: https://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.
You are watching the reporter.
From wilde at mcs.anl.gov Thu Apr 2 21:31:39 2009
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Thu, 02 Apr 2009 21:31:39 -0500
Subject: [Swift-devel] Discussion on next steps for Coasters
Message-ID: <49D5750B.10603@mcs.anl.gov>
I had a brief off-list discussion with Mihael on next steps for
coasters. Im posting it here for group discussion and to get us started
on the same page.
This follows up on discussion a few weeks ago on the same topic.
Rather than try to reorg the email below, Im posting it largely as-is in
the interest of effort and time.
Bottom line: Mihael will work on Coasters next, as he suggested in a
prior email, taking the next steps to harden them for users, establish a
better test mechanism and procedure, and work on some usability &
enhancement issues.
- Mike
-------- Original Message --------
Subject: Re: Hi / status / next ?
Date: Thu, 02 Apr 2009 21:01:14 -0500
From: Michael Wilde
To: Mihael Hategan
References: <49D551B8.5010105 at mcs.anl.gov>
<1238721084.19231.18.camel at localhost>
OK, all sounds good. Many more details to work out, but a short followup
below.
On 4/2/09 8:11 PM, Mihael Hategan wrote:
> On Thu, 2009-04-02 at 19:00 -0500, Michael Wilde wrote:
>> Hi Mihael,
>>
...
>> So next on Swift: I think you should do a fairly intensive burst of
>> effort on Coaster stabilization and portability, like you suggested on
>> the list a little while ago.
>
> Right.
>
>> At a very high level, what I want to see is:
>>
>> - solid test suite, so we know its working on a agreed on and growing
>> set of platforms, mainly the TG, OSG and a few miscellaneous sites the
>> users need
>>
>> - solve the "GT2 / OSG thing", which I *think* involves starting coaster
>> workers from the submit host with GT2 using Condor-G.
>
> The complexity of adding condor-g into the loop will likely be nasty.
> But I'll try.
Before you start, then, especially if its not an obvious answer, lets
sanity check with discussion on list. As a proposed update to your
design doc.
>
>> - check that coaster shutdown is working.
>
> Is there any reason to believe it's not?
Yeah, some suspicious behavior that we havent been able to pin down (me,
Glen) but suspect may be happening.
>
>> Then lower priority:
>>
>> - make it possible to allocate a persistent pool of Coaster workers all
>> at once (say, gimme 1000 nodes on Ranger for 1 hour".
>
> That I think isn't a good idea. Here's why, and correct me if I'm
> missing something:
> - regardless of whether you use it or not, you need to wait for nodes to
> be available. Whether that waiting happens while swift is running or
> not, it still happens.
true
> - once you have a pre-set number of nodes, you need to quickly start
> swift and use them, otherwise you lose allocations. By contrast, in
> automatic mode, swift will use them as soon as they are available
true
> - allocation of a pre-set number of nodes may be delayed if that number
> of nodes is not available. In the automatic mode, swift will use fewer
> nodes when they are available and ramp up to whatever it can get. A
> limiting case, when your 1k nodes will not be available at all, shows
> that the automatic case will yield better performance (you workflow will
> finish).
true
> - better balancing can be done if there are multiple sites with
> automatic allocation.
all true ;)
Only case where its handy is benchmarking a workflow on a known quantity
of nodes.
Driven in part by fact that on BGP, this is how they are allocated.
(But even there we could do multi-block allocation in varying chunks if
the allocator was aware of the scheduling policy of the cluster)
So what I was thinking was "ask for N nodes all at once". In all cases,
it would be assumed "...and then start your workflow". So it would not
need to be a separate allocation.
Tied to an option to say "leave my nodes running when wf done" this
would I think meet all needs. But your points above are complelling,
hence this feature needs deliberation and is nowhere need the top of the
list. Higher on the list would be demand-based grow-shrink of pool, but
in varying sized blocks. And on all systems, I think, you need to free
in the same sized blocks (of of CPUs) that you allocated in.
It raises another Q: for some sites like TeraPort, which I think places
jobs on all cores, independently, in todays coasters implementation, I
am assuming the user should not specify coastersPerNode > 1. True? (even
though it has 2-core nodes.) We should clarify this in the users guide.
I will ask this on the list right now so all can get answer.
> One advantage to allocating blocks of coasters may be the possibility
> that a single multi-node job is started (so it solves the gt2
> scalability problem, but so does you provisioning point below).
I would be interested in this, both for its intrinsic performance
benefits, but also as a short-term solution to the OSG GT2 "overheating"
problem. Especially if the Condor-G solution gets complex and takes
long to implement and perfect. Ie, as a short term fix with long term
benefits, migt make sense to do it first, assuming that *it* is not
harder than Condor-G provider and coaster integ.
>
>> - other ways to leave coasters running for the next jobs
>
> Right. That may be possible with persistent services instead of the
> current transient scheme.
>
>> - ensure that coaster time allocations are being done sensibly
>>
>> - revisit the coaster "provisioning" mechanism in terms of in what
>> increments workers are allocated and released in
>>
>> - some kind of coaster status display
>>
>> - some way to probe a job thats running on a coaster?
>
> Define "probe".
- ps -f on the running process.
- probe its resource usage (/proc, also ps, etc)
- ls -lR of its jobdir (as these will more often be on /tmp)
We have these needs today; on the BGP under falkon we manually login to
the node, but thats cumbersome: hard to find the node; 2-stage login
process.
Low prio, a pipe dream. But theoretically do-able.
So, very cool, we are converging on a plan.
I'll cc most of the above to the list now.
>
>> Issue a shell
>> command on the worker of the job?
>>
>> - other things I missed.
>>
>> I'll send this to the list for discussion; what I mainly want to
>> understand from you first is your time availability, what you feel you
>> owe swift in terms of compensating from i2u2 hours, and anything you
>> know of on swift that is higher priority that the coaster things above?
>> (I dont, but maybe missing something)
>>
>> Lastly, how is Phong doing, and to what extent can he be self-sufficient
>> if you were to go 100% swift for a while?
>
> I think he'll be able to take over most things. However, with the
> current big push, he's probably not confident enough, so it may have to
> happen after the new version is put into production.
...
From wilde at mcs.anl.gov Fri Apr 3 08:38:08 2009
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Fri, 03 Apr 2009 08:38:08 -0500
Subject: [Swift-devel] Probing running jobs
In-Reply-To: <1238732253.22128.12.camel@localhost>
References: <49D551B8.5010105@mcs.anl.gov>
<1238721084.19231.18.camel@localhost> <49D56DEA.104@mcs.anl.gov>
<1238732253.22128.12.camel@localhost>
Message-ID: <49D61140.1090109@mcs.anl.gov>
Following up on Mihael's question about a feature I listed in the to-do
list I proposed for coasters:
On 4/2/09 11:17 PM, Mihael Hategan wrote:
> On Thu, 2009-04-02 at 21:01 -0500, Michael Wilde wrote:
>>>> - some way to probe a job thats running on a coaster?
>>> Define "probe".
>> - ps -f on the running process.
>> - probe its resource usage (/proc, also ps, etc)
>> - ls -lR of its jobdir (as these will more often be on /tmp)
>>
>> We have these needs today; on the BGP under falkon we manually login to
>> the node, but thats cumbersome: hard to find the node; 2-stage login
>> process.
>>
>> Low prio, a pipe dream. But theoretically do-able.
>
> It should be possible (and somewhat interesting) to have a simple shell
> that can execute stuff on the workers while the job is running, so that
> you can issue your own commands.
>
> The question of how to find the right worker remains. Can you go a bit
> deeper into the details? How do you find the node currently (be as
> specific as you can be)?
In the oops workflow, I recall these cases at the moment:
1) Have my (large set of similar) jobs started?
2) Most jobs have finished. Are the remaining ones hung, or proceeding
normally but slower for some application- or data-specific reason?
--
For (1), on the BGP, if most or all cores in the partition have apps
running on them, we pick any core and login to it. Then to see what that
particular app is doing, we tail its log file for progress compared to
its CPU tie consumption (from ps). Note that its log file is on local
disk, because we set the "jobdir on local" option of swiftwrapper.
Logging in to a node means finding its IO node IP addr from a Falkon
dynamic config file, ssh-ing to the ION, then telnetting to an arbitrary
worker node (these are on 192.168.1.[0-63] private addrs), then running
ps and tail. If not all the worker nodes in a processor set are busy,
its a nuisance to find one that is. If few are busy, its not practical.
Overall, this technique is just a spot-check to see "are *any* of my
jobs running right", ie to see if we've (finally) got their arguments
correct, etc.
(1) is better solved with the same technique needed for (2) - given a
job, find its ION and worker node IPs, and ssh/telnet directly there,
which does not exist but is straightforward. On BGP the WNs are not
running ssh, hence the additional nuisance of telnet.
(2) Is theoretically possible, but impractical, until we add a few
scripts to trace from a swift job to the falkon service thats running it
to the falkon agent thats running it (again, in the bgp case). The data
for this exists. So we occasionally need (2) but cant do it.
Regarding "question of how to find the right worker" - this starts with
having some sort of ID for each job that the use can use to go from
"source code based identity" through job status and then to job
location. (by job here I mean execution of an app() proc).
I have not yet looked at your status monitor, but am eager to try it. So
I dont know if you took any steps in there to correlate a job's proc
name and args to its status. But thats what I think the user ultimately
needs and wants.
For example, in oops, the majority of tasks are either of app "runrama"
or "runoops". They have a mixture of scalar and file args.
I'd like to see in status something sort of like strace, where syscalls
have potentially long arg lists (when formatted) but there's a canonical
way to present them in an acceptably compact format with ... ellipsis s
needed.
So as app invocations become known to swift, they get IDs starting from
0, (PID-like but not wrapping around), and are listed in the progress
log as:
Job 123 is Proc runrama Args 456 input/prot/.../...00.019.1ubq.pdb etc
Job 123 input transfer OK
Job 123 submitted - teraport/coaster92
Job 124 is
Job 125 is
Job 123 output transfer OK
job 123 ended OK
And then, I can say:
probe 123 "ps -ef | grep runrama; tail -3 /tmp/work/*/*/runrama.log"
(for starters).
So the capability depends on having usable IDs for jobs and coasters,
maybe more objects, so that the user can specify a job of interest and
the system can send the users probe to that job.
Something simple, flexible and shell-like is good to start with so we
can explore whats needed and ideally create scripts to wrap more
powerful capabilities.
From benc at hawaga.org.uk Fri Apr 3 09:33:13 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Fri, 3 Apr 2009 14:33:13 +0000 (GMT)
Subject: [Swift-devel] Probing running jobs
In-Reply-To: <49D61140.1090109@mcs.anl.gov>
References: <49D551B8.5010105@mcs.anl.gov>
<1238721084.19231.18.camel@localhost>
<49D56DEA.104@mcs.anl.gov> <1238732253.22128.12.camel@localhost>
<49D61140.1090109@mcs.anl.gov>
Message-ID:
Not address the bulk of your email, just the bit about IDs. almost
everything in swift that can be identified has some identifier on it from
log-processing and provenance work - at least datsets, procedure
invocations, job executions, file transfers.
--
From benc at hawaga.org.uk Fri Apr 3 14:10:10 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Fri, 3 Apr 2009 19:10:10 +0000 (GMT)
Subject: [Swift-devel] sync
Message-ID:
Just now I got this on a pbs+nfs cluster (the one at University of
Johannesburg that I am involved with).
It seems a little degenerate that in failing to record restart information
for reliability, the run dies.
Caused by: java.io.SyncFailedException: sync failed
at java.io.FileDescriptor.sync(Native Method)
at
org.globus.cog.karajan.workflow.nodes.restartLog.FlushableLockedFileW
riter.flush(FlushableLockedFileWriter.java:40)
at
org.globus.cog.karajan.workflow.nodes.restartLog.LogVargOperator.upda
te(LogVargOperator.java:37)
... 37 more
--
From hategan at mcs.anl.gov Sat Apr 4 16:34:32 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Sat, 04 Apr 2009 16:34:32 -0500
Subject: [Swift-devel] Re: Probing running jobs
In-Reply-To: <49D61140.1090109@mcs.anl.gov>
References: <49D551B8.5010105@mcs.anl.gov>
<1238721084.19231.18.camel@localhost> <49D56DEA.104@mcs.anl.gov>
<1238732253.22128.12.camel@localhost> <49D61140.1090109@mcs.anl.gov>
Message-ID: <1238880872.8212.13.camel@localhost>
On Fri, 2009-04-03 at 08:38 -0500, Michael Wilde wrote:
> Following up on Mihael's question about a feature I listed in the to-do
> list I proposed for coasters:
>
> On 4/2/09 11:17 PM, Mihael Hategan wrote:
> > On Thu, 2009-04-02 at 21:01 -0500, Michael Wilde wrote:
> >>>> - some way to probe a job thats running on a coaster?
> >>> Define "probe".
> >> - ps -f on the running process.
> >> - probe its resource usage (/proc, also ps, etc)
> >> - ls -lR of its jobdir (as these will more often be on /tmp)
> >>
> >> We have these needs today; on the BGP under falkon we manually login to
> >> the node, but thats cumbersome: hard to find the node; 2-stage login
> >> process.
> >>
> >> Low prio, a pipe dream. But theoretically do-able.
> >
> > It should be possible (and somewhat interesting) to have a simple shell
> > that can execute stuff on the workers while the job is running, so that
> > you can issue your own commands.
> >
> > The question of how to find the right worker remains. Can you go a bit
> > deeper into the details? How do you find the node currently (be as
> > specific as you can be)?
>
> In the oops workflow, I recall these cases at the moment:
>
> 1) Have my (large set of similar) jobs started?
>
> 2) Most jobs have finished. Are the remaining ones hung, or proceeding
> normally but slower for some application- or data-specific reason?
[...]
In swift r2821 cog r2365 (I think), there is such a feature.
If you start with the console monitor, you can go to the list of jobs.
Then select desired job, and push enter to display a detail pane. If the
job is in the active state and if it's running on a coaster worker, that
detail pane will have an extra button named "Worker Terminal". Pressing
that will pop up a simple terminal that can be used to run relatively
arbitrary commands on the worker that the job is running on.
It won't run commands that require console input (e.g., vi), so don't
try.
It won't start you in the job directory, but the swift workflow
directory. That's because at some point we stopped using the GRAM
directory attribute for setting the initial job dir because some silly
site on OSG doesn't honor it. I think we should revisit the issue (I
suspect there is a solution that works in both cases).
From wilde at mcs.anl.gov Sat Apr 4 16:59:44 2009
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Sat, 04 Apr 2009 16:59:44 -0500
Subject: [Swift-devel] coaster status report
Message-ID: <49D7D850.9060701@mcs.anl.gov>
With OOPS Glen was able to get some promising runs queued on Ranger,
using the default properties and the sites setting from the SEM runs.
Looking great so far, and above all was very easy to get it going.
Thats very exciting!
One run shows a few (3 out of 100 or so) failures that were retried
successfully. We need to trak these down, and see if it was a transient
app failure or something in swift etc.
Then we turned to Abe and Queenbee. That was amazingly easy to configure
and get running. Glen is scaling it up as we speak, trying for 2 sites x
40 jobs x 8 cores = 640 cores tween the two.
In initial small tests, though - 50 parallel app() calls - its sending
all jobs to abe, none to queenbee. We checked the usual sites, tc
things, *seems* ok there. Possibly either a bg or a scheduler anomaly?
We'll try with more jobs, and see; will send logs and sites etc files if
that anomaly persists at larger scales.
Seems like both these sites have WS-GRAM enabled; we'd like to try that
as well, to expand beyond the 40-job per site suggested limit. Would
like to get 1000 cores active on this problem. 2 x 60 x 8 or so.
Then will add in a few more fruitful TG sites.
Towards this end, Mihael, if you have the urge to probe at a
setting/config that lets us start coasters in 4-8 node batches, this
would be a great time to try that. I suspect you dont know yet if that
will be easy, hard, or in between?
Another note on coaster boot:
- old problems on Abe with funky limitations on non-login shells seems
to have gone away, either from the latest coaster strategy (-l issues?)
or from Abe changes.
- on queenbee, initial run got this error:
Could not start coaster service
Caused by:
Task ended before registration was received.
STDOUT: Warning: -jar not understood. Ignoring.
Exception in thread "main" java.lang.NoClassDefFoundError:
.tmp.bootstrap.y10420
at gnu.gcj.runtime.FirstThread.run() (/usr/lib64/libgcj.so.5.0.0)
Turns out default java was 1.4.2 something.
We added @default to .soft to get Java 1.6.
Then coasters bootstrapped fine. This was nice to see, that a simple
workaround was easy!
At any rate, very productive, very promising, very pleasing to use.
Nice work!
- Mike
From wilde at mcs.anl.gov Sat Apr 4 17:01:53 2009
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Sat, 04 Apr 2009 17:01:53 -0500
Subject: [Swift-devel] Re: Probing running jobs
In-Reply-To: <1238880872.8212.13.camel@localhost>
References: <49D551B8.5010105@mcs.anl.gov>
<1238721084.19231.18.camel@localhost> <49D56DEA.104@mcs.anl.gov>
<1238732253.22128.12.camel@localhost>
<49D61140.1090109@mcs.anl.gov> <1238880872.8212.13.camel@localhost>
Message-ID: <49D7D8D1.3030304@mcs.anl.gov>
Wow! Way cool - I cant wait to try this and the monitor.
But need to clone myself.
Maybe Glen, you can try this on oops tests...
- Mike
On 4/4/09 4:34 PM, Mihael Hategan wrote:
> On Fri, 2009-04-03 at 08:38 -0500, Michael Wilde wrote:
>> Following up on Mihael's question about a feature I listed in the to-do
>> list I proposed for coasters:
>>
>> On 4/2/09 11:17 PM, Mihael Hategan wrote:
>>> On Thu, 2009-04-02 at 21:01 -0500, Michael Wilde wrote:
>>>>>> - some way to probe a job thats running on a coaster?
>>>>> Define "probe".
>>>> - ps -f on the running process.
>>>> - probe its resource usage (/proc, also ps, etc)
>>>> - ls -lR of its jobdir (as these will more often be on /tmp)
>>>>
>>>> We have these needs today; on the BGP under falkon we manually login to
>>>> the node, but thats cumbersome: hard to find the node; 2-stage login
>>>> process.
>>>>
>>>> Low prio, a pipe dream. But theoretically do-able.
>>> It should be possible (and somewhat interesting) to have a simple shell
>>> that can execute stuff on the workers while the job is running, so that
>>> you can issue your own commands.
>>>
>>> The question of how to find the right worker remains. Can you go a bit
>>> deeper into the details? How do you find the node currently (be as
>>> specific as you can be)?
>> In the oops workflow, I recall these cases at the moment:
>>
>> 1) Have my (large set of similar) jobs started?
>>
>> 2) Most jobs have finished. Are the remaining ones hung, or proceeding
>> normally but slower for some application- or data-specific reason?
> [...]
>
> In swift r2821 cog r2365 (I think), there is such a feature.
>
> If you start with the console monitor, you can go to the list of jobs.
> Then select desired job, and push enter to display a detail pane. If the
> job is in the active state and if it's running on a coaster worker, that
> detail pane will have an extra button named "Worker Terminal". Pressing
> that will pop up a simple terminal that can be used to run relatively
> arbitrary commands on the worker that the job is running on.
>
> It won't run commands that require console input (e.g., vi), so don't
> try.
>
> It won't start you in the job directory, but the swift workflow
> directory. That's because at some point we stopped using the GRAM
> directory attribute for setting the initial job dir because some silly
> site on OSG doesn't honor it. I think we should revisit the issue (I
> suspect there is a solution that works in both cases).
>
From wilde at mcs.anl.gov Sat Apr 4 17:03:55 2009
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Sat, 04 Apr 2009 17:03:55 -0500
Subject: [Swift-devel] coaster status report
In-Reply-To: <49D7D850.9060701@mcs.anl.gov>
References: <49D7D850.9060701@mcs.anl.gov>
Message-ID: <49D7D94B.1090504@mcs.anl.gov>
small clarification here -
we had to turn away from range because the queue was gruesome.
the 3-failure issue was on abe. not much to say till we find and examine
the log on this one.
- Mike
On 4/4/09 4:59 PM, Michael Wilde wrote:
> With OOPS Glen was able to get some promising runs queued on Ranger,
> using the default properties and the sites setting from the SEM runs.
>
> Looking great so far, and above all was very easy to get it going.
>
> Thats very exciting!
>
> One run shows a few (3 out of 100 or so) failures that were retried
> successfully. We need to trak these down, and see if it was a transient
> app failure or something in swift etc.
>
> Then we turned to Abe and Queenbee. That was amazingly easy to configure
> and get running. Glen is scaling it up as we speak, trying for 2 sites x
> 40 jobs x 8 cores = 640 cores tween the two.
>
> In initial small tests, though - 50 parallel app() calls - its sending
> all jobs to abe, none to queenbee. We checked the usual sites, tc
> things, *seems* ok there. Possibly either a bg or a scheduler anomaly?
> We'll try with more jobs, and see; will send logs and sites etc files if
> that anomaly persists at larger scales.
>
> Seems like both these sites have WS-GRAM enabled; we'd like to try that
> as well, to expand beyond the 40-job per site suggested limit. Would
> like to get 1000 cores active on this problem. 2 x 60 x 8 or so.
>
> Then will add in a few more fruitful TG sites.
>
> Towards this end, Mihael, if you have the urge to probe at a
> setting/config that lets us start coasters in 4-8 node batches, this
> would be a great time to try that. I suspect you dont know yet if that
> will be easy, hard, or in between?
>
> Another note on coaster boot:
>
> - old problems on Abe with funky limitations on non-login shells seems
> to have gone away, either from the latest coaster strategy (-l issues?)
> or from Abe changes.
>
> - on queenbee, initial run got this error:
>
> Could not start coaster service
> Caused by:
> Task ended before registration was received.
> STDOUT: Warning: -jar not understood. Ignoring.
> Exception in thread "main" java.lang.NoClassDefFoundError:
> .tmp.bootstrap.y10420
> at gnu.gcj.runtime.FirstThread.run() (/usr/lib64/libgcj.so.5.0.0)
>
> Turns out default java was 1.4.2 something.
>
> We added @default to .soft to get Java 1.6.
> Then coasters bootstrapped fine. This was nice to see, that a simple
> workaround was easy!
>
> At any rate, very productive, very promising, very pleasing to use.
>
> Nice work!
>
> - Mike
>
>
>
>
>
>
>
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
From hategan at mcs.anl.gov Sat Apr 4 17:06:43 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Sat, 04 Apr 2009 17:06:43 -0500
Subject: [Swift-devel] coaster status report
In-Reply-To: <49D7D850.9060701@mcs.anl.gov>
References: <49D7D850.9060701@mcs.anl.gov>
Message-ID: <1238882803.9038.1.camel@localhost>
On Sat, 2009-04-04 at 16:59 -0500, Michael Wilde wrote:
> - on queenbee, initial run got this error:
>
> Could not start coaster service
> Caused by:
> Task ended before registration was received.
> STDOUT: Warning: -jar not understood. Ignoring.
> Exception in thread "main" java.lang.NoClassDefFoundError:
> .tmp.bootstrap.y10420
> at gnu.gcj.runtime.FirstThread.run() (/usr/lib64/libgcj.so.5.0.0)
>
> Turns out default java was 1.4.2 something.
It looks like default java is GCJ which I wouldn't dare call "Java"
because it probably fails too many compliance tests.
>
> We added @default to .soft to get Java 1.6.
> Then coasters bootstrapped fine. This was nice to see, that a simple
> workaround was easy!
Right. Good call.
From hategan at mcs.anl.gov Sat Apr 4 17:08:59 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Sat, 04 Apr 2009 17:08:59 -0500
Subject: [Swift-devel] coaster status report
In-Reply-To: <49D7D94B.1090504@mcs.anl.gov>
References: <49D7D850.9060701@mcs.anl.gov> <49D7D94B.1090504@mcs.anl.gov>
Message-ID: <1238882939.9038.4.camel@localhost>
On Sat, 2009-04-04 at 17:03 -0500, Michael Wilde wrote:
> small clarification here -
>
> we had to turn away from range because the queue was gruesome.
Yeah, but when it starts, it goooees.
The beauty of multi-site runs (with replication enabled, which may or
may not work properly) is that swift will make the best use of what's
there.
From wilde at mcs.anl.gov Sat Apr 4 17:15:00 2009
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Sat, 04 Apr 2009 17:15:00 -0500
Subject: [Swift-devel] coaster status report
In-Reply-To: <1238882939.9038.4.camel@localhost>
References: <49D7D850.9060701@mcs.anl.gov> <49D7D94B.1090504@mcs.anl.gov>
<1238882939.9038.4.camel@localhost>
Message-ID: <49D7DBE4.3070604@mcs.anl.gov>
On 4/4/09 5:08 PM, Mihael Hategan wrote:
> On Sat, 2009-04-04 at 17:03 -0500, Michael Wilde wrote:
>> small clarification here -
>>
>> we had to turn away from range because the queue was gruesome.
>
> Yeah, but when it starts, it goooees.
>
> The beauty of multi-site runs (with replication enabled, which may or
> may not work properly) is that swift will make the best use of what's
> there.
Exactly. And I think Glen's group is eager to use it in exactly that way
- send to TeraGrid and walk away, and not even bother manually
checking traffic and load etc. Very promising.
OOPS seems to compile clean everywhere we have tried, including BGP and
Sicortex, and Glen has tested Zhengxiong's ADEM installer on OSG, where
he got it installed on 8 sites in a blink.
Glen is also working on a tgsites command that gens a correct
user-specific sites.xml for TG, so ADEM and general use for both grids
is within reach.
Its all coming together very nice.
From benc at hawaga.org.uk Sat Apr 4 17:21:44 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Sat, 4 Apr 2009 22:21:44 +0000 (GMT)
Subject: [Swift-devel] Re: Probing running jobs
In-Reply-To: <1238880872.8212.13.camel@localhost>
References: <49D551B8.5010105@mcs.anl.gov>
<1238721084.19231.18.camel@localhost>
<49D56DEA.104@mcs.anl.gov> <1238732253.22128.12.camel@localhost>
<49D61140.1090109@mcs.anl.gov> <1238880872.8212.13.camel@localhost>
Message-ID:
On Sat, 4 Apr 2009, Mihael Hategan wrote:
> It won't start you in the job directory, but the swift workflow
> directory. That's because at some point we stopped using the GRAM
> directory attribute for setting the initial job dir because some silly
> site on OSG doesn't honor it. I think we should revisit the issue (I
> suspect there is a solution that works in both cases).
I think that has not been the case since at least before the CI SVN
started to be used.
The first mention of specifying a directory attribute for task:execute in
execute2 was r127, which specified wfdir. Before that, no directory was
specified at all.
The job directory has seemingly always been passed as a parameter of one
kind or another to the wrapper script.
--
From benc at hawaga.org.uk Sun Apr 5 06:11:28 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Sun, 5 Apr 2009 11:11:28 +0000 (GMT)
Subject: [Swift-devel] Swift 0.9 release for ~2nd April
In-Reply-To:
References:
Message-ID:
On Mon, 23 Mar 2009, Ben Clifford wrote:
> > I'd like to put out the Swift 0.9 release on the 2nd of April, with the
> > release candidate being made from SVN on the 23rd of March.
>
> the present trunk seems way too unstable for a release candidate. so not
> today.
for now I'm planning on looking at making 0.9 again in the 2nd half of
April.
--
From hategan at mcs.anl.gov Sun Apr 5 09:59:29 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Sun, 05 Apr 2009 09:59:29 -0500
Subject: [Swift-devel] Re: Probing running jobs
In-Reply-To:
References: <49D551B8.5010105@mcs.anl.gov>
<1238721084.19231.18.camel@localhost> <49D56DEA.104@mcs.anl.gov>
<1238732253.22128.12.camel@localhost> <49D61140.1090109@mcs.anl.gov>
<1238880872.8212.13.camel@localhost>
Message-ID: <1238943569.14220.0.camel@localhost>
On Sat, 2009-04-04 at 22:21 +0000, Ben Clifford wrote:
> On Sat, 4 Apr 2009, Mihael Hategan wrote:
>
> > It won't start you in the job directory, but the swift workflow
> > directory. That's because at some point we stopped using the GRAM
> > directory attribute for setting the initial job dir because some silly
> > site on OSG doesn't honor it. I think we should revisit the issue (I
> > suspect there is a solution that works in both cases).
>
> I think that has not been the case since at least before the CI SVN
> started to be used.
>
> The first mention of specifying a directory attribute for task:execute in
> execute2 was r127, which specified wfdir. Before that, no directory was
> specified at all.
>
> The job directory has seemingly always been passed as a parameter of one
> kind or another to the wrapper script.
>
You are right. I was confused.
From benc at hawaga.org.uk Sun Apr 5 16:51:26 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Sun, 5 Apr 2009 21:51:26 +0000 (GMT)
Subject: [Swift-devel] too many site initializations
Message-ID:
vdl-int.k has this code, which I think is meant to make site
initialization happen only once per site (and have only one job in
initializing site shared directory progress ticker state).
element(initSharedDir, [rhost]
once(list(rhost, "shared")
vdl:setprogress("Initializing site shared
directory")
However I see things like this:
Progress: Selecting site:2932 Initializing site shared directory:102
Submitted:64 Active:69 Finished successfully:204
when I run with around 20 .. 30 OSG sites which suggests the onceness
isn't happening there.
Don't have time to investigate properly now but it seemed interesting to
comment on.
--
From bugzilla-daemon at mcs.anl.gov Sun Apr 5 18:09:42 2009
From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov)
Date: Sun, 5 Apr 2009 18:09:42 -0500 (CDT)
Subject: [Swift-devel] [Bug 193] New: replication job cancellation using pbs
provider causes spurious console output
Message-ID:
https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=193
Summary: replication job cancellation using pbs provider causes
spurious console output
Product: Swift
Version: unspecified
Platform: PC
OS/Version: Mac OS
Status: NEW
Severity: normal
Priority: P2
Component: General
AssignedTo: benc at hawaga.org.uk
ReportedBy: benc at hawaga.org.uk
CC: swift-devel at ci.uchicago.edu
I get lines like this when job replicas are cancelled, where grid.uj.ac.za is a
site I submitted to using provider=pbs
Canceling job 33353.gridvm.grid.uj.ac.za
I guess this is either an spurious print in CoG or is an incorrect log setting
in Swift, but not looked any deeper.
--
Configure bugmail: https://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.
You are watching the assignee of the bug.
You are watching the reporter.
From wilde at mcs.anl.gov Sun Apr 5 18:14:34 2009
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Sun, 05 Apr 2009 18:14:34 -0500
Subject: [Swift-devel] using WS GRAM
Message-ID: <49D93B5A.8040103@mcs.anl.gov>
Glen, try this:
to try a few jobs "plain" on abe and qb
then try coasters using gt4:gt4:pbs
- Mike
ps. beware, both might be blazing new territory
From benc at hawaga.org.uk Sun Apr 5 18:18:46 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Sun, 5 Apr 2009 23:18:46 +0000 (GMT)
Subject: [Swift-devel] using WS GRAM
In-Reply-To: <49D93B5A.8040103@mcs.anl.gov>
References: <49D93B5A.8040103@mcs.anl.gov>
Message-ID:
On Sun, 5 Apr 2009, Michael Wilde wrote:
> ps. beware, both might be blazing new territory
Is that code for "every time anyone tries GRAM4 on teragrid, it doesn't
work?" ;)
--
From bugzilla-daemon at mcs.anl.gov Mon Apr 6 06:29:53 2009
From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov)
Date: Mon, 6 Apr 2009 06:29:53 -0500 (CDT)
Subject: [Swift-devel] [Bug 194] New: more analysis for replication
Message-ID:
https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=194
Summary: more analysis for replication
Product: Swift
Version: unspecified
Platform: PC
OS/Version: Mac OS
Status: NEW
Severity: enhancement
Priority: P2
Component: Log processing and plotting
AssignedTo: benc at hawaga.org.uk
ReportedBy: benc at hawaga.org.uk
One tab/page with information about replication.
Stuff that exists already:
Comparison for each site of how many jobs were submitted, executed
successfully, cancelled for replication (so similar to the execute2 sites
table)
Queue length distribution - something like the chart 'karajan queued
JOB_SUBMISSION cumulative duration'.
Stuff that could be collected/generated:
over duration of run, how the replication threshold (which is based on mean
queue time at the moment) varies.
--
Configure bugmail: https://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.
You are watching the reporter.
From benc at hawaga.org.uk Mon Apr 6 06:34:47 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Mon, 6 Apr 2009 11:34:47 +0000 (GMT)
Subject: [Swift-devel] goodness metrics for replication
Message-ID:
wondering what 'goodness' metrics are for replication.
one is "how many jobs were replicated but the first submission executed
(so the replication was in some sense wasted)"
I'd be interested in ideas for other metrics
--
From benc at hawaga.org.uk Mon Apr 6 08:26:42 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Mon, 6 Apr 2009 13:26:42 +0000 (GMT)
Subject: [Swift-devel] replication vs site score
Message-ID:
More ongoing ramblings as I'm making slides about this...
I'm not sure at the moment whether a job being cancelled due to
replication causes the site's score to change.
Maybe cancellation-due-to-other-replica-starting should be regarded as
badness and reduce that site's score - "we asked you to run this job but
were so slow we essentially regarded you as failing". Maybe they shouldn't
be.
Either way, it should be documented.
--
From bugzilla-daemon at mcs.anl.gov Mon Apr 6 08:47:40 2009
From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov)
Date: Mon, 6 Apr 2009 08:47:40 -0500 (CDT)
Subject: [Swift-devel] [Bug 195] New: info vs karajan states graph doesn't
work well when an info file is missing
Message-ID:
https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=195
Summary: info vs karajan states graph doesn't work well when an
info file is missing
Product: Swift
Version: unspecified
Platform: PC
OS/Version: Mac OS
Status: NEW
Severity: normal
Priority: P2
Component: Log processing and plotting
AssignedTo: benc at hawaga.org.uk
ReportedBy: benc at hawaga.org.uk
For graphs such as this: "Offsets between job submission Active events and
start times reported by info."
when an info file is missing, it appears to be regarded as having start time of
the unix epoch, which causes the automatic axes scaling to hide the actual
useful information in this graph (which is for jobs where both a karajan and
info start/end time are known).
Such jobs should probably be omitted from these charts entirely.
--
Configure bugmail: https://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.
You are watching the reporter.
From bugzilla-daemon at mcs.anl.gov Mon Apr 6 09:08:41 2009
From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov)
Date: Mon, 6 Apr 2009 09:08:41 -0500 (CDT)
Subject: [Swift-devel] [Bug 196] New: site score page should show site
scores colour coded by site
Message-ID:
https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=196
Summary: site score page should show site scores colour coded
by site
Product: Swift
Version: unspecified
Platform: PC
OS/Version: Mac OS
Status: NEW
Severity: enhancement
Priority: P2
Component: Log processing and plotting
AssignedTo: benc at hawaga.org.uk
ReportedBy: benc at hawaga.org.uk
site score page should show site scores colour coded by site
--
Configure bugmail: https://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.
You are watching the reporter.
From wilde at mcs.anl.gov Mon Apr 6 09:13:10 2009
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Mon, 06 Apr 2009 09:13:10 -0500
Subject: [Swift-devel] Swift issues and next steps on OOPS app
In-Reply-To: <49D800B5.4060109@uchicago.edu>
References: <49D800B5.4060109@uchicago.edu>
Message-ID: <49DA0DF6.5030300@mcs.anl.gov>
was: Re: status update
On 4/4/09 7:52 PM, Glen Hocky wrote:
> Things seem to be kind of working on all machines (including ranger,
> which picked up some speed) but not totally.
So for ranger at the moment we can run default params and hope for 640
cores at a time. We should queue up several science runs of full-scale
rounds, and assess the results and run times.
> Problems to investigate this week:
> swift choking after running lots of jobs successfully (shoudl probably
> just ignore these things)
I'm not sure which errors you mean here - lets examine them first. Do
you mean the "successfully retried" errors?
> swift not balancing load accross different sites (dumps all ones for my
> teragrid sites file onto one site, grr!)
Can you send a log of this to the Swift developers? They need that in
order to look at this problem.
I will do a sanity test of WS-GRAM with coasters on abe and queenbee. If
it works we should expand our science runs there.
These are good things to do today while BG/P is down.
- Mike
>
> Glen
From benc at hawaga.org.uk Mon Apr 6 09:19:10 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Mon, 6 Apr 2009 14:19:10 +0000 (GMT)
Subject: [Swift-devel] Swift issues and next steps on OOPS app
In-Reply-To: <49DA0DF6.5030300@mcs.anl.gov>
References: <49D800B5.4060109@uchicago.edu> <49DA0DF6.5030300@mcs.anl.gov>
Message-ID:
On Mon, 6 Apr 2009, Michael Wilde wrote:
> > swift not balancing load accross different sites (dumps all ones for my
> > teragrid sites file onto one site, grr!)
>
> Can you send a log of this to the Swift developers? They need that in order to
> look at this problem.
For this also please sent the commandline that invoke Swift with, your
sites file and your tc.data.
--
From wilde at mcs.anl.gov Mon Apr 6 09:35:16 2009
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Mon, 06 Apr 2009 09:35:16 -0500
Subject: [Swift-devel] Swift issues and next steps on OOPS app
In-Reply-To:
References: <49D800B5.4060109@uchicago.edu> <49DA0DF6.5030300@mcs.anl.gov>
Message-ID: <49DA1324.2010204@mcs.anl.gov>
and swift.properties?
(Aside to devel team: can we snapshot *all* this info into the start of
the log? Its trivially short compared to the length of most logs)
- command line
- sites file, tc file
- swift.properties
I'll file as enh bug if there is agreement.
On 4/6/09 9:19 AM, Ben Clifford wrote:
> On Mon, 6 Apr 2009, Michael Wilde wrote:
>
>>> swift not balancing load accross different sites (dumps all ones for my
>>> teragrid sites file onto one site, grr!)
>> Can you send a log of this to the Swift developers? They need that in order to
>> look at this problem.
>
> For this also please sent the commandline that invoke Swift with, your
> sites file and your tc.data.
>
From benc at hawaga.org.uk Mon Apr 6 09:39:11 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Mon, 6 Apr 2009 14:39:11 +0000 (GMT)
Subject: [Swift-devel] Re: replication vs site score
In-Reply-To:
References:
Message-ID:
even more rambling... in the context of a scheduler that is doing things
like prioritising jobs based on more than the order that Swift happened to
submit them (hopefully I will have a student for this in the summer), I
think a replicant job should be pushed toward later execution rather than
earlier execution to reduce the number of replicant jobs in the system at
any one time.
This is because I suspect (though I have gathered no numerical evidence)
that given the choice between submitting a fresh job and a replicant job
(making up terminology here too... mmm), it is almost always better to
submit the fresh job. Either we end up submitting the replicant job
eventually (in which case we are no worse off than if we submitted the
replicant first and then a fresh job); or by delaying the replicant job we
give that replicant's original a chance to start running and thus do not
discard our precious time-and-load-dollars that we have already spent on
queueing that replicant's original.
--
From benc at hawaga.org.uk Mon Apr 6 09:43:56 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Mon, 6 Apr 2009 14:43:56 +0000 (GMT)
Subject: [Swift-devel] Swift issues and next steps on OOPS app
In-Reply-To: <49DA1324.2010204@mcs.anl.gov>
References: <49D800B5.4060109@uchicago.edu> <49DA0DF6.5030300@mcs.anl.gov>
<49DA1324.2010204@mcs.anl.gov>
Message-ID:
On Mon, 6 Apr 2009, Michael Wilde wrote:
> I'll file as enh bug if there is agreement.
yep
--
From foster at anl.gov Mon Apr 6 09:46:29 2009
From: foster at anl.gov (Ian Foster)
Date: Mon, 6 Apr 2009 09:46:29 -0500
Subject: [Swift-devel] Re: replication vs site score
In-Reply-To:
References:
Message-ID: <1C37ED4E-D752-4870-A52C-076FC3EBBF49@anl.gov>
Ben:
You may recall the work that was done by Greg Maleciwz (sp?) on
prioritizing jobs that enable new jobs to run. Those ideas seem
relevant here.
I met last week with a smart fellow in Singapore, Qin Zheng (CCed
here), who has been working on the scheduling of replicant jobs. His
interest is in doing this for jobs that have failed, while I think
your interest is in scheduling for jobs that may have failed--a
somewhat different thing. But there may be a connection.
Ian.
On Apr 6, 2009, at 9:39 AM, Ben Clifford wrote:
> even more rambling... in the context of a scheduler that is doing
> things
> like prioritising jobs based on more than the order that Swift
> happened to
> submit them (hopefully I will have a student for this in the
> summer), I
> think a replicant job should be pushed toward later execution rather
> than
> earlier execution to reduce the number of replicant jobs in the
> system at
> any one time.
>
> This is because I suspect (though I have gathered no numerical
> evidence)
> that given the choice between submitting a fresh job and a replicant
> job
> (making up terminology here too... mmm), it is almost always better to
> submit the fresh job. Either we end up submitting the replicant job
> eventually (in which case we are no worse off than if we submitted the
> replicant first and then a fresh job); or by delaying the replicant
> job we
> give that replicant's original a chance to start running and thus do
> not
> discard our precious time-and-load-dollars that we have already
> spent on
> queueing that replicant's original.
>
> --
>
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From benc at hawaga.org.uk Mon Apr 6 10:00:08 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Mon, 6 Apr 2009 15:00:08 +0000 (GMT)
Subject: [Swift-devel] Re: replication vs site score
In-Reply-To: <1C37ED4E-D752-4870-A52C-076FC3EBBF49@anl.gov>
References:
<1C37ED4E-D752-4870-A52C-076FC3EBBF49@anl.gov>
Message-ID:
> You may recall the work that was done by Greg Maleciwz (sp?) on prioritizing
> jobs that enable new jobs to run. Those ideas seem relevant here.
yes, its ongoing thoughts based on that that lead me to thinking about
this - more generally, what are the useful things to prioritise work on
(both at the Swift level - a SwiftScript procedure call - and at the lower
level of file transfers and remote job submissions)
> I met last week with a smart fellow in Singapore, Qin Zheng (CCed here), who
> has been working on the scheduling of replicant jobs. His interest is in doing
> this for jobs that have failed, while I think your interest is in scheduling
> for jobs that may have failed--a somewhat different thing. But there may be a
> connection.
Replicated jobs are jobs that the remote job submission system (eg GRAM)
says are in a queue but that we think that we can probably run better
(i.e. quicker or even run at all) by resubmitting; when doing that, we
don't cancel the original job and potentially it will be that original job
that runs, not the replica. Sometimes that is because the remote queue is
"infintely long" (the site is taking jobs and losing them); sometimes its
because it is "very long" (eg teraport's 14 day queue when my laptop has a
local CPU free and no queue)
In your above paragraph, that sounds more like Swift's retry mechanism -
when a Swift-level job (SwiftScript procedure call) fails, we submit it
again, basically using the same mechanism as with replicated jobs.
However, in that case, the original job does not exist any more.
--
From bugzilla-daemon at mcs.anl.gov Mon Apr 6 10:00:21 2009
From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov)
Date: Mon, 6 Apr 2009 10:00:21 -0500 (CDT)
Subject: [Swift-devel] [Bug 197] New: Include more runtime environment info
in Swift log for debugging
Message-ID:
https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=197
Summary: Include more runtime environment info in Swift log for
debugging
Product: Swift
Version: unspecified
Platform: All
OS/Version: All
Status: NEW
Keywords: debug
Severity: enhancement
Priority: P2
Component: SwiftScript language
AssignedTo: benc at hawaga.org.uk
ReportedBy: wilde at mcs.anl.gov
More info on the user's runtime environment should be included automatically at
the start of the swift .log file, so that developers can do most debugging with
just the single .log file.
This should include:
- command line
- sites file
- tc file
- swift.properties file
and could include the swift source code itself (at least for now, when most
scripts are very short.
--
Configure bugmail: https://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.
You are watching the reporter.
From bugzilla-daemon at mcs.anl.gov Mon Apr 6 10:09:21 2009
From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov)
Date: Mon, 6 Apr 2009 10:09:21 -0500 (CDT)
Subject: [Swift-devel] [Bug 198] New: Add ability to specify execution sites
on swift command line
Message-ID:
https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=198
Summary: Add ability to specify execution sites on swift
command line
Product: Swift
Version: unspecified
Platform: All
OS/Version: All
Status: NEW
Keywords: running
Severity: enhancement
Priority: P2
Component: SwiftScript language
AssignedTo: benc at hawaga.org.uk
ReportedBy: wilde at mcs.anl.gov
Allow a set of sites to be specified on the command line for each run, rather
than needed to edit the sites file to make such choices.
This feature is (or should be) tied to improvements in how the site data is
generated and maintained.
Discussion of a design for this feature on the devel list should precede any
development.
The related issues are:
- where to keep the site data that the command line options select from
- how to parameterize that data and add options to the selected sites
- how site data is generated and customized
- how a variety of choices can be specified for the selected sites (eg, use
coasters or not; which data movement strategy to use).
--
Configure bugmail: https://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.
You are watching the reporter.
From wilde at mcs.anl.gov Mon Apr 6 10:20:16 2009
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Mon, 06 Apr 2009 10:20:16 -0500
Subject: [Swift-devel] bugzilla keywords
Message-ID: <49DA1DB0.8060107@mcs.anl.gov>
Ive noticed these more in the new bugzilla interface, and so started
using them, although I realize the keywords Ive created may need rethinking.
Are bug keywords of any use to us, or should I stop doing this?
If of use, we should define a small set that we like and works for all.
From benc at hawaga.org.uk Mon Apr 6 10:27:42 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Mon, 6 Apr 2009 15:27:42 +0000 (GMT)
Subject: [Swift-devel] bugzilla keywords
In-Reply-To: <49DA1DB0.8060107@mcs.anl.gov>
References: <49DA1DB0.8060107@mcs.anl.gov>
Message-ID:
On Mon, 6 Apr 2009, Michael Wilde wrote:
> Ive noticed these more in the new bugzilla interface, and so started using
> them, although I realize the keywords Ive created may need rethinking.
>
> Are bug keywords of any use to us, or should I stop doing this?
>
> If of use, we should define a small set that we like and works for all.
I've never come up with a particularly useful use for them within Swift,
and I don't think we should use them just because they are there.
For the most part, I think even the component classification list is
barely used.
If you find some use in using them, though, I see no reason why you
shouldn't do so, though.
--
From hategan at mcs.anl.gov Mon Apr 6 10:40:38 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Mon, 06 Apr 2009 10:40:38 -0500
Subject: [Swift-devel] Re: replication vs site score
In-Reply-To:
References:
Message-ID: <1239032438.30386.8.camel@localhost>
On Mon, 2009-04-06 at 14:39 +0000, Ben Clifford wrote:
> even more rambling... in the context of a scheduler that is doing things
> like prioritising jobs based on more than the order that Swift happened to
> submit them (hopefully I will have a student for this in the summer), I
> think a replicant job should be pushed toward later execution rather than
> earlier execution to reduce the number of replicant jobs in the system at
> any one time.
You have two extremes:
1. Send each job to all sites instantly.
2. Replicate after +inf time (see _too_much_ below)
You're suggesting moving from somewhere in the middle, to somewhere in
the middle, but a little to the right.
>
> This is because I suspect (though I have gathered no numerical evidence)
> that given the choice between submitting a fresh job and a replicant job
> (making up terminology here too... mmm), it is almost always better to
> submit the fresh job. Either we end up submitting the replicant job
> eventually (in which case we are no worse off than if we submitted the
> replicant first and then a fresh job); or by delaying the replicant job we
> give that replicant's original a chance to start running and thus do not
> discard our precious time-and-load-dollars that we have already spent on
> queueing that replicant's original.
You are saying this with the awareness of the fact that replicas are
only sent after the prototype job sat in the queue (and didn't start
running) for what is deemed _too_much_?
From benc at hawaga.org.uk Mon Apr 6 10:50:04 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Mon, 6 Apr 2009 15:50:04 +0000 (GMT)
Subject: [Swift-devel] Re: replication vs site score
In-Reply-To: <1239032438.30386.8.camel@localhost>
References:
<1239032438.30386.8.camel@localhost>
Message-ID:
On Mon, 6 Apr 2009, Mihael Hategan wrote:
> You are saying this with the awareness of the fact that replicas are
> only sent after the prototype job sat in the queue (and didn't start
> running) for what is deemed _too_much_?
I'm not suggesting that we reduce any submission load to remote sites. I'm
suggesting a different order for those submissions.
The queue delay is not so _too_much_ that we cancel the original on
replication; and it appears (though I don't have stats on real runs) that
the originals do run sometimes (though it would be interesting to know in
real situations how often)
Given that, I'm suggesting that a better use of our load capacity is to do
it with the ordering I suggested.
As far as I can tell, it will not result in slower runs. In the case where
originals do run eventually, this should results in faster runs.
Thinking about it more, I can see a situation where a site is pretty fully
loaded queuewise by swift yet never actually runs a job, because by the
time a job gets near the front of the queue it has been replicated and run
elsewhere. That's an extreme, but I think its the extreme of the same
situation I talk about in my original message.
--
From hategan at mcs.anl.gov Mon Apr 6 11:06:31 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Mon, 06 Apr 2009 11:06:31 -0500
Subject: [Swift-devel] Re: replication vs site score
In-Reply-To:
References:
<1239032438.30386.8.camel@localhost>
Message-ID: <1239033991.31063.3.camel@localhost>
On Mon, 2009-04-06 at 15:50 +0000, Ben Clifford wrote:
> On Mon, 6 Apr 2009, Mihael Hategan wrote:
>
> > You are saying this with the awareness of the fact that replicas are
> > only sent after the prototype job sat in the queue (and didn't start
> > running) for what is deemed _too_much_?
>
> I'm not suggesting that we reduce any submission load to remote sites. I'm
> suggesting a different order for those submissions.
>
> The queue delay is not so _too_much_ that we cancel the original on
> replication; and it appears (though I don't have stats on real runs) that
> the originals do run sometimes (though it would be interesting to know in
> real situations how often)
>
> Given that, I'm suggesting that a better use of our load capacity is to do
> it with the ordering I suggested.
I'm still not following. From what I understand, you are suggesting
what's already there. So either that is true and you think the current
scheme is not what it is, or I don't understand how your suggestion is
different than the current scheme.
From benc at hawaga.org.uk Mon Apr 6 11:11:31 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Mon, 6 Apr 2009 16:11:31 +0000 (GMT)
Subject: [Swift-devel] Re: replication vs site score
In-Reply-To: <1239033991.31063.3.camel@localhost>
References:
<1239032438.30386.8.camel@localhost>
<1239033991.31063.3.camel@localhost>
Message-ID:
On Mon, 6 Apr 2009, Mihael Hategan wrote:
> I'm still not following. From what I understand, you are suggesting
> what's already there. So either that is true and you think the current
> scheme is not what it is, or I don't understand how your suggestion is
> different than the current scheme.
Its not the case, as I understand it, that replica jobs will always be run
after primary jobs - they will be run in the order they arrive in the job
queue. Jobs that Swift puts in the queue after that replication decision
has been made (for example, jobs that were waiting for dependent data)
will run after the replicas submitted before that dependent data become
available.
a=p(x)
b=p(y)
c=q(a);
a, b run. eventually swift gets bored and resubmits b to the local job
queue. then a completes, and so c gets queued in the local job queue.
replica_of_b gets submitted to a site before c does.
or not?
--
From hategan at mcs.anl.gov Mon Apr 6 11:34:16 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Mon, 06 Apr 2009 11:34:16 -0500
Subject: [Swift-devel] Re: replication vs site score
In-Reply-To:
References:
<1239032438.30386.8.camel@localhost>
<1239033991.31063.3.camel@localhost>
Message-ID: <1239035656.31668.10.camel@localhost>
On Mon, 2009-04-06 at 16:11 +0000, Ben Clifford wrote:
> On Mon, 6 Apr 2009, Mihael Hategan wrote:
>
> > I'm still not following. From what I understand, you are suggesting
> > what's already there. So either that is true and you think the current
> > scheme is not what it is, or I don't understand how your suggestion is
> > different than the current scheme.
>
> Its not the case, as I understand it, that replica jobs will always be run
> after primary jobs - they will be run in the order they arrive in the job
> queue. Jobs that Swift puts in the queue after that replication decision
> has been made (for example, jobs that were waiting for dependent data)
> will run after the replicas submitted before that dependent data become
> available.
>
> a=p(x)
> b=p(y)
> c=q(a);
>
> a, b run. eventually swift gets bored and resubmits b to the local job
> queue. then a completes, and so c gets queued in the local job queue.
>
> replica_of_b gets submitted to a site before c does.
I see what you're saying now.
I think scheduler priorities are not a bad idea.
From benc at hawaga.org.uk Mon Apr 6 11:34:57 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Mon, 6 Apr 2009 16:34:57 +0000 (GMT)
Subject: [Swift-devel] Re: replication vs site score
In-Reply-To: <1239035656.31668.10.camel@localhost>
References:
<1239032438.30386.8.camel@localhost>
<1239033991.31063.3.camel@localhost>
<1239035656.31668.10.camel@localhost>
Message-ID:
On Mon, 6 Apr 2009, Mihael Hategan wrote:
> I think scheduler priorities are not a bad idea.
right - its likely that I get a summer student to play with that sort of
stuff, which is what is making me think of what sort of things to
prioritise on.
--
From wilde at mcs.anl.gov Mon Apr 6 12:10:51 2009
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Mon, 06 Apr 2009 12:10:51 -0500
Subject: [Swift-devel] coaster problem with jobmanager=gt2:pbs
Message-ID: <49DA379B.7080403@mcs.anl.gov>
With this sites entry:
TG-CDA070002T
/home/ux454325/swiftwork
I get the error below. Files are on CI net at /home/wilde/swift/lab.
I will try to copy coaster boot logs and gram logs to same place when I
find them, in subdirs named by $RunID.logs.
--
com$ swift -tc.file tc.data -sites.file qb.coasters.xml cat.swift
Swift svn swift-r2809 cog-r2350
RunID: 20090406-1155-pgc5nj00
Progress:
Progress: Stage in:1
Progress: Submitted:1
Failed to transfer wrapper log from cat-20090406-1155-pgc5nj00/info/m on qb
Progress: Failed:1
Execution failed:
Exception in cat:
Arguments: [data.txt]
Host: qb
Directory: cat-20090406-1155-pgc5nj00/jobs/m/cat-m91ku09j
stderr.txt:
stdout.txt:
----
Caused by:
Cannot submit job: Cannot run program "qsub":
java.io.IOException: error=2, No such file or directory
org.globus.cog.abstraction.impl.common.task.TaskSubmissionException:
Cannot submit job: Cannot run program "qsub": java.io.IOException:
error=2, No such file or directory
at
org.globus.cog.abstraction.impl.scheduler.common.AbstractJobSubmissionTaskHandler.submit(AbstractJobSubmissionTaskHandler.java:63)
at
org.globus.cog.abstraction.impl.common.AbstractTaskHandler.submit(AbstractTaskHandler.java:46)
at
org.globus.cog.abstraction.impl.common.task.ExecutionTaskHandler.submit(ExecutionTaskHandler.java:50)
at
org.globus.cog.abstraction.coaster.service.job.manager.WorkerManager.startWorker(WorkerManager.java:221)
at
org.globus.cog.abstraction.coaster.service.job.manager.WorkerManager.run(WorkerManager.java:145)
Caused by: java.io.IOException: Cannot run program "qsub":
java.io.IOException: error=2, No such file or directory
at java.lang.ProcessBuilder.start(ProcessBuilder.java:459)
at java.lang.Runtime.exec(Runtime.java:593)
at
org.globus.cog.abstraction.impl.scheduler.common.AbstractExecutor.start(AbstractExecutor.java:73)
at
org.globus.cog.abstraction.impl.scheduler.common.AbstractJobSubmissionTaskHandler.submit(AbstractJobSubmissionTaskHandler.java:53)
... 4 more
Caused by: java.io.IOException: java.io.IOException: error=2, No such
file or directory
at java.lang.UNIXProcess.(UNIXProcess.java:148)
at java.lang.ProcessImpl.start(ProcessImpl.java:65)
at java.lang.ProcessBuilder.start(ProcessBuilder.java:452)
... 7 more
Cleaning up...
Shutting down service at https://208.100.92.21:44166
Got channel MetaChannel: 24235184 -> GSSSChannel-null(1)
- Done
com$ pwd
From wilde at mcs.anl.gov Mon Apr 6 12:29:21 2009
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Mon, 06 Apr 2009 12:29:21 -0500
Subject: [Swift-devel] coaster problem with jobmanager=gt4:gt4:pbs
Message-ID: <49DA3BF1.1080206@mcs.anl.gov>
With this sites entry:
TG-CDA070002T
/home/ux454325/swiftwork
I get the error below. Files are on CI net at /home/wilde/swift/lab.
Coaster boot log is in 20090406-1216-f5k8chdg.logs/
There was no GRAM log on the queenbee site.
--
com$ swift -tc.file tc.data -sites.file qb.coasters-gt4-gt4-pbs.xml
cat.swift
Swift svn swift-r2809 cog-r2350
RunID: 20090406-1216-f5k8chdg
Progress:
Progress: Stage in:1
The GT4 provider does not support redirection. Redirection requests will
be ignored without further warnings.
Progress: Submitted:1
Failed to transfer wrapper log from cat-20090406-1216-f5k8chdg/info/0 on qb
Progress: Failed:1
Execution failed:
Exception in cat:
Arguments: [data.txt]
Host: qb
Directory: cat-20090406-1216-f5k8chdg/jobs/0/cat-0cjfv09j
stderr.txt:
stdout.txt:
----
Caused by:
Could not submit job
Caused by:
Could not start coaster service
Caused by:
Task ended before registration was received: Job failed with an
exit code of 1
Cleaning up...
Done
com$
From hategan at mcs.anl.gov Mon Apr 6 13:02:17 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Mon, 06 Apr 2009 13:02:17 -0500
Subject: [Swift-devel] coaster problem with jobmanager=gt2:pbs
In-Reply-To: <49DA379B.7080403@mcs.anl.gov>
References: <49DA379B.7080403@mcs.anl.gov>
Message-ID: <1239040937.2410.3.camel@localhost>
Yes. This is one of those "can't find executable unless run through
'bash -l' or maybe not" which we saw using wget and md5sum. I'm thinking
how to deal with the situation.
On Mon, 2009-04-06 at 12:10 -0500, Michael Wilde wrote:
> With this sites entry:
>
>
> TG-CDA070002T
> jobManager="gt2:pbs" />
>
> /home/ux454325/swiftwork
>
>
> I get the error below. Files are on CI net at /home/wilde/swift/lab.
>
> I will try to copy coaster boot logs and gram logs to same place when I
> find them, in subdirs named by $RunID.logs.
>
> --
>
> com$ swift -tc.file tc.data -sites.file qb.coasters.xml cat.swift
> Swift svn swift-r2809 cog-r2350
>
> RunID: 20090406-1155-pgc5nj00
> Progress:
> Progress: Stage in:1
> Progress: Submitted:1
> Failed to transfer wrapper log from cat-20090406-1155-pgc5nj00/info/m on qb
> Progress: Failed:1
> Execution failed:
> Exception in cat:
> Arguments: [data.txt]
> Host: qb
> Directory: cat-20090406-1155-pgc5nj00/jobs/m/cat-m91ku09j
> stderr.txt:
>
> stdout.txt:
>
> ----
>
> Caused by:
> Cannot submit job: Cannot run program "qsub":
> java.io.IOException: error=2, No such file or directory
> org.globus.cog.abstraction.impl.common.task.TaskSubmissionException:
> Cannot submit job: Cannot run program "qsub": java.io.IOException:
> error=2, No such file or directory
> at
> org.globus.cog.abstraction.impl.scheduler.common.AbstractJobSubmissionTaskHandler.submit(AbstractJobSubmissionTaskHandler.java:63)
> at
> org.globus.cog.abstraction.impl.common.AbstractTaskHandler.submit(AbstractTaskHandler.java:46)
> at
> org.globus.cog.abstraction.impl.common.task.ExecutionTaskHandler.submit(ExecutionTaskHandler.java:50)
> at
> org.globus.cog.abstraction.coaster.service.job.manager.WorkerManager.startWorker(WorkerManager.java:221)
> at
> org.globus.cog.abstraction.coaster.service.job.manager.WorkerManager.run(WorkerManager.java:145)
> Caused by: java.io.IOException: Cannot run program "qsub":
> java.io.IOException: error=2, No such file or directory
> at java.lang.ProcessBuilder.start(ProcessBuilder.java:459)
> at java.lang.Runtime.exec(Runtime.java:593)
> at
> org.globus.cog.abstraction.impl.scheduler.common.AbstractExecutor.start(AbstractExecutor.java:73)
> at
> org.globus.cog.abstraction.impl.scheduler.common.AbstractJobSubmissionTaskHandler.submit(AbstractJobSubmissionTaskHandler.java:53)
> ... 4 more
> Caused by: java.io.IOException: java.io.IOException: error=2, No such
> file or directory
> at java.lang.UNIXProcess.(UNIXProcess.java:148)
> at java.lang.ProcessImpl.start(ProcessImpl.java:65)
> at java.lang.ProcessBuilder.start(ProcessBuilder.java:452)
> ... 7 more
>
> Cleaning up...
> Shutting down service at https://208.100.92.21:44166
> Got channel MetaChannel: 24235184 -> GSSSChannel-null(1)
> - Done
> com$ pwd
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
From hategan at mcs.anl.gov Mon Apr 6 13:28:03 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Mon, 06 Apr 2009 13:28:03 -0500
Subject: [Swift-devel] coaster problem with jobmanager=gt4:gt4:pbs
In-Reply-To: <49DA3BF1.1080206@mcs.anl.gov>
References: <49DA3BF1.1080206@mcs.anl.gov>
Message-ID: <1239042483.3445.2.camel@localhost>
On Mon, 2009-04-06 at 12:29 -0500, Michael Wilde wrote:
> With this sites entry:
>
>
>
> TG-CDA070002T
> jobManager="gt4:gt4:pbs" />
>
> /home/ux454325/swiftwork
>
>
>
> I get the error below. Files are on CI net at /home/wilde/swift/lab.
>
> Coaster boot log is in 20090406-1216-f5k8chdg.logs/
I'll also need ~/.globus/coasters/coasters.log. Sorry for not mentioning
it earlier.
Normally with gt2, there would be a stdout explanation of what happened,
but with gt4 there is no stdout streaming back.
From wilde at mcs.anl.gov Mon Apr 6 13:33:32 2009
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Mon, 06 Apr 2009 13:33:32 -0500
Subject: [Swift-devel] coaster problem with jobmanager=gt4:gt4:pbs
In-Reply-To: <1239042483.3445.2.camel@localhost>
References: <49DA3BF1.1080206@mcs.anl.gov> <1239042483.3445.2.camel@localhost>
Message-ID: <49DA4AFC.104@mcs.anl.gov>
I just copied coaster.log to that same dir.
On 4/6/09 1:28 PM, Mihael Hategan wrote:
> On Mon, 2009-04-06 at 12:29 -0500, Michael Wilde wrote:
>> With this sites entry:
>>
>>
>>
>> TG-CDA070002T
>> > jobManager="gt4:gt4:pbs" />
>>
>> /home/ux454325/swiftwork
>>
>>
>>
>> I get the error below. Files are on CI net at /home/wilde/swift/lab.
>>
>> Coaster boot log is in 20090406-1216-f5k8chdg.logs/
>
> I'll also need ~/.globus/coasters/coasters.log. Sorry for not mentioning
> it earlier.
>
> Normally with gt2, there would be a stdout explanation of what happened,
> but with gt4 there is no stdout streaming back.
>
>
From hategan at mcs.anl.gov Mon Apr 6 14:24:59 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Mon, 06 Apr 2009 14:24:59 -0500
Subject: [Swift-devel] coaster problem with jobmanager=gt4:gt4:pbs
In-Reply-To: <49DA4AFC.104@mcs.anl.gov>
References: <49DA3BF1.1080206@mcs.anl.gov>
<1239042483.3445.2.camel@localhost> <49DA4AFC.104@mcs.anl.gov>
Message-ID: <1239045899.4203.3.camel@localhost>
On Mon, 2009-04-06 at 13:33 -0500, Michael Wilde wrote:
> I just copied coaster.log to that same dir.
Unfortunately it does not contain any information on the unfortunate
run.
I committed a patch to also log to the bootstrap log any errors that may
occur during bootstrap.jar startup that may otherwise not be logged to
the coasters log (nor reported back to the client due to the
middleware).
From wilde at mcs.anl.gov Mon Apr 6 15:17:29 2009
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Mon, 06 Apr 2009 15:17:29 -0500
Subject: [Swift-devel] coaster problem with jobmanager=gt2:pbs
In-Reply-To: <1239040937.2410.3.camel@localhost>
References: <49DA379B.7080403@mcs.anl.gov> <1239040937.2410.3.camel@localhost>
Message-ID: <49DA6359.8010207@mcs.anl.gov>
Mihael, I just updated our test swift+cog source and rebuilt.
Glen is now getting:
Caused by:
Invalid GSSCredentials
org.globus.cog.abstraction.impl.common.task.InvalidSecurityContextException:
Invalid GSSCredentials
at
org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.submitSingleJob(JobSubmissionTaskHandler.java:149)
at
org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.submit(JobSubmissionTaskHandler.java:99)
at
org.globus.cog.abstraction.impl.common.AbstractTaskHandler.submit(AbstractTaskHandler.java:46)
at
org.globus.cog.abstraction.impl.common.task.ExecutionTaskHandler.submit(ExecutionTaskHandler.java:50)
at
org.globus.cog.abstraction.coaster.service.job.manager.WorkerManager.startWorker(WorkerManager.java:222)
at
org.globus.cog.abstraction.coaster.service.job.manager.WorkerManager.run(WorkerManager.java:145)
Caused by: GSSException: Invalid name provided [Caused by: [JGLOBUS-112]
Malformed name, "=" missing in "38356/jobmanager-pbs"]
at
org.globus.gsi.gssapi.GlobusGSSName.(GlobusGSSName.java:137)
at
org.globus.gsi.gssapi.GlobusGSSManagerImpl.createName(GlobusGSSManagerImpl.java:304)
at
org.globus.gsi.gssapi.auth.IdentityAuthorization.getExpectedName(IdentityAuthorization.java:82)
at org.globus.gram.Gram.gatekeeperConnect(Gram.java:85)
at org.globus.gram.Gram.request(Gram.java:310)
at org.globus.gram.GramJob.request(GramJob.java:262)
at
org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.submitSingleJob(JobSubmissionTaskHandler.java:133)
... 5 more
what's up here
Any chance I picked up code in transition, or a new problem in recent
commits?
- Mike
On 4/6/09 1:02 PM, Mihael Hategan wrote:
> Yes. This is one of those "can't find executable unless run through
> 'bash -l' or maybe not" which we saw using wget and md5sum. I'm thinking
> how to deal with the situation.
>
> On Mon, 2009-04-06 at 12:10 -0500, Michael Wilde wrote:
>> With this sites entry:
>>
>>
>> TG-CDA070002T
>> > jobManager="gt2:pbs" />
>>
>> /home/ux454325/swiftwork
>>
>>
>> I get the error below. Files are on CI net at /home/wilde/swift/lab.
>>
>> I will try to copy coaster boot logs and gram logs to same place when I
>> find them, in subdirs named by $RunID.logs.
>>
>> --
>>
>> com$ swift -tc.file tc.data -sites.file qb.coasters.xml cat.swift
>> Swift svn swift-r2809 cog-r2350
>>
>> RunID: 20090406-1155-pgc5nj00
>> Progress:
>> Progress: Stage in:1
>> Progress: Submitted:1
>> Failed to transfer wrapper log from cat-20090406-1155-pgc5nj00/info/m on qb
>> Progress: Failed:1
>> Execution failed:
>> Exception in cat:
>> Arguments: [data.txt]
>> Host: qb
>> Directory: cat-20090406-1155-pgc5nj00/jobs/m/cat-m91ku09j
>> stderr.txt:
>>
>> stdout.txt:
>>
>> ----
>>
>> Caused by:
>> Cannot submit job: Cannot run program "qsub":
>> java.io.IOException: error=2, No such file or directory
>> org.globus.cog.abstraction.impl.common.task.TaskSubmissionException:
>> Cannot submit job: Cannot run program "qsub": java.io.IOException:
>> error=2, No such file or directory
>> at
>> org.globus.cog.abstraction.impl.scheduler.common.AbstractJobSubmissionTaskHandler.submit(AbstractJobSubmissionTaskHandler.java:63)
>> at
>> org.globus.cog.abstraction.impl.common.AbstractTaskHandler.submit(AbstractTaskHandler.java:46)
>> at
>> org.globus.cog.abstraction.impl.common.task.ExecutionTaskHandler.submit(ExecutionTaskHandler.java:50)
>> at
>> org.globus.cog.abstraction.coaster.service.job.manager.WorkerManager.startWorker(WorkerManager.java:221)
>> at
>> org.globus.cog.abstraction.coaster.service.job.manager.WorkerManager.run(WorkerManager.java:145)
>> Caused by: java.io.IOException: Cannot run program "qsub":
>> java.io.IOException: error=2, No such file or directory
>> at java.lang.ProcessBuilder.start(ProcessBuilder.java:459)
>> at java.lang.Runtime.exec(Runtime.java:593)
>> at
>> org.globus.cog.abstraction.impl.scheduler.common.AbstractExecutor.start(AbstractExecutor.java:73)
>> at
>> org.globus.cog.abstraction.impl.scheduler.common.AbstractJobSubmissionTaskHandler.submit(AbstractJobSubmissionTaskHandler.java:53)
>> ... 4 more
>> Caused by: java.io.IOException: java.io.IOException: error=2, No such
>> file or directory
>> at java.lang.UNIXProcess.(UNIXProcess.java:148)
>> at java.lang.ProcessImpl.start(ProcessImpl.java:65)
>> at java.lang.ProcessBuilder.start(ProcessBuilder.java:452)
>> ... 7 more
>>
>> Cleaning up...
>> Shutting down service at https://208.100.92.21:44166
>> Got channel MetaChannel: 24235184 -> GSSSChannel-null(1)
>> - Done
>> com$ pwd
>> _______________________________________________
>> Swift-devel mailing list
>> Swift-devel at ci.uchicago.edu
>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>
From wilde at mcs.anl.gov Mon Apr 6 15:20:37 2009
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Mon, 06 Apr 2009 15:20:37 -0500
Subject: [Swift-devel] coaster problem with jobmanager=gt2:pbs
In-Reply-To: <1239040937.2410.3.camel@localhost>
References: <49DA379B.7080403@mcs.anl.gov> <1239040937.2410.3.camel@localhost>
Message-ID: <49DA6415.1010402@mcs.anl.gov>
Mihael, when I svn updated our test swift+cog source and rebuilt, Glen
Glen gets the errors below.
When I reverted back to last Tuesday Mar 31, this new error does not occur.
Does "Caused by: GSSException: Invalid name provided [Caused by:
[JGLOBUS-112] Malformed name, "=" missing in "38356/jobmanager-pbs"]"
suggest a new error introduced in the commits since Tuesday?
This is with coasters and gt2:gt2:pbs.
- Mike
Caused by:
Invalid GSSCredentials
org.globus.cog.abstraction.impl.common.task.InvalidSecurityContextException:
Invalid GSSCredentials
at
org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.submitSingleJob(JobSubmissionTaskHandler.java:149)
at
org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.submit(JobSubmissionTaskHandler.java:99)
at
org.globus.cog.abstraction.impl.common.AbstractTaskHandler.submit(AbstractTaskHandler.java:46)
at
org.globus.cog.abstraction.impl.common.task.ExecutionTaskHandler.submit(ExecutionTaskHandler.java:50)
at
org.globus.cog.abstraction.coaster.service.job.manager.WorkerManager.startWorker(WorkerManager.java:222)
at
org.globus.cog.abstraction.coaster.service.job.manager.WorkerManager.run(WorkerManager.java:145)
Caused by: GSSException: Invalid name provided [Caused by: [JGLOBUS-112]
Malformed name, "=" missing in "38356/jobmanager-pbs"]
at
org.globus.gsi.gssapi.GlobusGSSName.(GlobusGSSName.java:137)
at
org.globus.gsi.gssapi.GlobusGSSManagerImpl.createName(GlobusGSSManagerImpl.java:304)
at
org.globus.gsi.gssapi.auth.IdentityAuthorization.getExpectedName(IdentityAuthorization.java:82)
at org.globus.gram.Gram.gatekeeperConnect(Gram.java:85)
at org.globus.gram.Gram.request(Gram.java:310)
at org.globus.gram.GramJob.request(GramJob.java:262)
at
org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.submitSingleJob(JobSubmissionTaskHandler.java:133)
... 5 more
what's up here
Any chance I picked up code in transition, or a new problem in recent
commits?
- Mike
On 4/6/09 1:02 PM, Mihael Hategan wrote:
> Yes. This is one of those "can't find executable unless run through
> 'bash -l' or maybe not" which we saw using wget and md5sum. I'm thinking
> how to deal with the situation.
>
> On Mon, 2009-04-06 at 12:10 -0500, Michael Wilde wrote:
>> With this sites entry:
>>
>>
>> TG-CDA070002T
>> > jobManager="gt2:pbs" />
>>
>> /home/ux454325/swiftwork
>>
>>
>> I get the error below. Files are on CI net at /home/wilde/swift/lab.
>>
>> I will try to copy coaster boot logs and gram logs to same place when I
>> find them, in subdirs named by $RunID.logs.
>>
>> --
>>
>> com$ swift -tc.file tc.data -sites.file qb.coasters.xml cat.swift
>> Swift svn swift-r2809 cog-r2350
>>
>> RunID: 20090406-1155-pgc5nj00
>> Progress:
>> Progress: Stage in:1
>> Progress: Submitted:1
>> Failed to transfer wrapper log from cat-20090406-1155-pgc5nj00/info/m on qb
>> Progress: Failed:1
>> Execution failed:
>> Exception in cat:
>> Arguments: [data.txt]
>> Host: qb
>> Directory: cat-20090406-1155-pgc5nj00/jobs/m/cat-m91ku09j
>> stderr.txt:
>>
>> stdout.txt:
>>
>> ----
>>
>> Caused by:
>> Cannot submit job: Cannot run program "qsub":
>> java.io.IOException: error=2, No such file or directory
>> org.globus.cog.abstraction.impl.common.task.TaskSubmissionException:
>> Cannot submit job: Cannot run program "qsub": java.io.IOException:
>> error=2, No such file or directory
>> at
>> org.globus.cog.abstraction.impl.scheduler.common.AbstractJobSubmissionTaskHandler.submit(AbstractJobSubmissionTaskHandler.java:63)
>> at
>> org.globus.cog.abstraction.impl.common.AbstractTaskHandler.submit(AbstractTaskHandler.java:46)
>> at
>> org.globus.cog.abstraction.impl.common.task.ExecutionTaskHandler.submit(ExecutionTaskHandler.java:50)
>> at
>> org.globus.cog.abstraction.coaster.service.job.manager.WorkerManager.startWorker(WorkerManager.java:221)
>> at
>> org.globus.cog.abstraction.coaster.service.job.manager.WorkerManager.run(WorkerManager.java:145)
>> Caused by: java.io.IOException: Cannot run program "qsub":
>> java.io.IOException: error=2, No such file or directory
>> at java.lang.ProcessBuilder.start(ProcessBuilder.java:459)
>> at java.lang.Runtime.exec(Runtime.java:593)
>> at
>> org.globus.cog.abstraction.impl.scheduler.common.AbstractExecutor.start(AbstractExecutor.java:73)
>> at
>> org.globus.cog.abstraction.impl.scheduler.common.AbstractJobSubmissionTaskHandler.submit(AbstractJobSubmissionTaskHandler.java:53)
>> ... 4 more
>> Caused by: java.io.IOException: java.io.IOException: error=2, No such
>> file or directory
>> at java.lang.UNIXProcess.(UNIXProcess.java:148)
>> at java.lang.ProcessImpl.start(ProcessImpl.java:65)
>> at java.lang.ProcessBuilder.start(ProcessBuilder.java:452)
>> ... 7 more
>>
>> Cleaning up...
>> Shutting down service at https://208.100.92.21:44166
>> Got channel MetaChannel: 24235184 -> GSSSChannel-null(1)
>> - Done
>> com$ pwd
>> _______________________________________________
>> Swift-devel mailing list
>> Swift-devel at ci.uchicago.edu
>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>
From hategan at mcs.anl.gov Mon Apr 6 15:25:09 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Mon, 06 Apr 2009 15:25:09 -0500
Subject: [Swift-devel] coaster problem with jobmanager=gt2:pbs
In-Reply-To: <49DA6359.8010207@mcs.anl.gov>
References: <49DA379B.7080403@mcs.anl.gov>
<1239040937.2410.3.camel@localhost> <49DA6359.8010207@mcs.anl.gov>
Message-ID: <1239049509.5350.0.camel@localhost>
Oops. cog r2367 should fix that.
On Mon, 2009-04-06 at 15:17 -0500, Michael Wilde wrote:
> Mihael, I just updated our test swift+cog source and rebuilt.
>
> Glen is now getting:
>
> Caused by:
> Invalid GSSCredentials
> org.globus.cog.abstraction.impl.common.task.InvalidSecurityContextException:
> Invalid GSSCredentials
> at
> org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.submitSingleJob(JobSubmissionTaskHandler.java:149)
> at
> org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.submit(JobSubmissionTaskHandler.java:99)
> at
> org.globus.cog.abstraction.impl.common.AbstractTaskHandler.submit(AbstractTaskHandler.java:46)
> at
> org.globus.cog.abstraction.impl.common.task.ExecutionTaskHandler.submit(ExecutionTaskHandler.java:50)
> at
> org.globus.cog.abstraction.coaster.service.job.manager.WorkerManager.startWorker(WorkerManager.java:222)
> at
> org.globus.cog.abstraction.coaster.service.job.manager.WorkerManager.run(WorkerManager.java:145)
> Caused by: GSSException: Invalid name provided [Caused by: [JGLOBUS-112]
> Malformed name, "=" missing in "38356/jobmanager-pbs"]
> at
> org.globus.gsi.gssapi.GlobusGSSName.(GlobusGSSName.java:137)
> at
> org.globus.gsi.gssapi.GlobusGSSManagerImpl.createName(GlobusGSSManagerImpl.java:304)
> at
> org.globus.gsi.gssapi.auth.IdentityAuthorization.getExpectedName(IdentityAuthorization.java:82)
> at org.globus.gram.Gram.gatekeeperConnect(Gram.java:85)
> at org.globus.gram.Gram.request(Gram.java:310)
> at org.globus.gram.GramJob.request(GramJob.java:262)
> at
> org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.submitSingleJob(JobSubmissionTaskHandler.java:133)
> ... 5 more
> what's up here
>
>
> Any chance I picked up code in transition, or a new problem in recent
> commits?
>
> - Mike
>
>
>
> On 4/6/09 1:02 PM, Mihael Hategan wrote:
> > Yes. This is one of those "can't find executable unless run through
> > 'bash -l' or maybe not" which we saw using wget and md5sum. I'm thinking
> > how to deal with the situation.
> >
> > On Mon, 2009-04-06 at 12:10 -0500, Michael Wilde wrote:
> >> With this sites entry:
> >>
> >>
> >> TG-CDA070002T
> >> >> jobManager="gt2:pbs" />
> >>
> >> /home/ux454325/swiftwork
> >>
> >>
> >> I get the error below. Files are on CI net at /home/wilde/swift/lab.
> >>
> >> I will try to copy coaster boot logs and gram logs to same place when I
> >> find them, in subdirs named by $RunID.logs.
> >>
> >> --
> >>
> >> com$ swift -tc.file tc.data -sites.file qb.coasters.xml cat.swift
> >> Swift svn swift-r2809 cog-r2350
> >>
> >> RunID: 20090406-1155-pgc5nj00
> >> Progress:
> >> Progress: Stage in:1
> >> Progress: Submitted:1
> >> Failed to transfer wrapper log from cat-20090406-1155-pgc5nj00/info/m on qb
> >> Progress: Failed:1
> >> Execution failed:
> >> Exception in cat:
> >> Arguments: [data.txt]
> >> Host: qb
> >> Directory: cat-20090406-1155-pgc5nj00/jobs/m/cat-m91ku09j
> >> stderr.txt:
> >>
> >> stdout.txt:
> >>
> >> ----
> >>
> >> Caused by:
> >> Cannot submit job: Cannot run program "qsub":
> >> java.io.IOException: error=2, No such file or directory
> >> org.globus.cog.abstraction.impl.common.task.TaskSubmissionException:
> >> Cannot submit job: Cannot run program "qsub": java.io.IOException:
> >> error=2, No such file or directory
> >> at
> >> org.globus.cog.abstraction.impl.scheduler.common.AbstractJobSubmissionTaskHandler.submit(AbstractJobSubmissionTaskHandler.java:63)
> >> at
> >> org.globus.cog.abstraction.impl.common.AbstractTaskHandler.submit(AbstractTaskHandler.java:46)
> >> at
> >> org.globus.cog.abstraction.impl.common.task.ExecutionTaskHandler.submit(ExecutionTaskHandler.java:50)
> >> at
> >> org.globus.cog.abstraction.coaster.service.job.manager.WorkerManager.startWorker(WorkerManager.java:221)
> >> at
> >> org.globus.cog.abstraction.coaster.service.job.manager.WorkerManager.run(WorkerManager.java:145)
> >> Caused by: java.io.IOException: Cannot run program "qsub":
> >> java.io.IOException: error=2, No such file or directory
> >> at java.lang.ProcessBuilder.start(ProcessBuilder.java:459)
> >> at java.lang.Runtime.exec(Runtime.java:593)
> >> at
> >> org.globus.cog.abstraction.impl.scheduler.common.AbstractExecutor.start(AbstractExecutor.java:73)
> >> at
> >> org.globus.cog.abstraction.impl.scheduler.common.AbstractJobSubmissionTaskHandler.submit(AbstractJobSubmissionTaskHandler.java:53)
> >> ... 4 more
> >> Caused by: java.io.IOException: java.io.IOException: error=2, No such
> >> file or directory
> >> at java.lang.UNIXProcess.(UNIXProcess.java:148)
> >> at java.lang.ProcessImpl.start(ProcessImpl.java:65)
> >> at java.lang.ProcessBuilder.start(ProcessBuilder.java:452)
> >> ... 7 more
> >>
> >> Cleaning up...
> >> Shutting down service at https://208.100.92.21:44166
> >> Got channel MetaChannel: 24235184 -> GSSSChannel-null(1)
> >> - Done
> >> com$ pwd
> >> _______________________________________________
> >> Swift-devel mailing list
> >> Swift-devel at ci.uchicago.edu
> >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> >
From wilde at mcs.anl.gov Mon Apr 6 16:25:51 2009
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Mon, 06 Apr 2009 16:25:51 -0500
Subject: [Swift-devel] coaster problem with jobmanager=gt2:pbs
In-Reply-To: <1239049509.5350.0.camel@localhost>
References: <49DA379B.7080403@mcs.anl.gov> <1239040937.2410.3.camel@localhost>
<49DA6359.8010207@mcs.anl.gov> <1239049509.5350.0.camel@localhost>
Message-ID: <49DA735F.1020300@mcs.anl.gov>
We just tested that rev, and now it seems as if the jobs are getting
submitted to the fork JM instead of to PBS.
Need a log for that, or is the cause obvious?
On 4/6/09 3:25 PM, Mihael Hategan wrote:
> Oops. cog r2367 should fix that.
>
> On Mon, 2009-04-06 at 15:17 -0500, Michael Wilde wrote:
>> Mihael, I just updated our test swift+cog source and rebuilt.
>>
>> Glen is now getting:
>>
>> Caused by:
>> Invalid GSSCredentials
>> org.globus.cog.abstraction.impl.common.task.InvalidSecurityContextException:
>> Invalid GSSCredentials
>> at
>> org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.submitSingleJob(JobSubmissionTaskHandler.java:149)
>> at
>> org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.submit(JobSubmissionTaskHandler.java:99)
>> at
>> org.globus.cog.abstraction.impl.common.AbstractTaskHandler.submit(AbstractTaskHandler.java:46)
>> at
>> org.globus.cog.abstraction.impl.common.task.ExecutionTaskHandler.submit(ExecutionTaskHandler.java:50)
>> at
>> org.globus.cog.abstraction.coaster.service.job.manager.WorkerManager.startWorker(WorkerManager.java:222)
>> at
>> org.globus.cog.abstraction.coaster.service.job.manager.WorkerManager.run(WorkerManager.java:145)
>> Caused by: GSSException: Invalid name provided [Caused by: [JGLOBUS-112]
>> Malformed name, "=" missing in "38356/jobmanager-pbs"]
>> at
>> org.globus.gsi.gssapi.GlobusGSSName.(GlobusGSSName.java:137)
>> at
>> org.globus.gsi.gssapi.GlobusGSSManagerImpl.createName(GlobusGSSManagerImpl.java:304)
>> at
>> org.globus.gsi.gssapi.auth.IdentityAuthorization.getExpectedName(IdentityAuthorization.java:82)
>> at org.globus.gram.Gram.gatekeeperConnect(Gram.java:85)
>> at org.globus.gram.Gram.request(Gram.java:310)
>> at org.globus.gram.GramJob.request(GramJob.java:262)
>> at
>> org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.submitSingleJob(JobSubmissionTaskHandler.java:133)
>> ... 5 more
>> what's up here
>>
>>
>> Any chance I picked up code in transition, or a new problem in recent
>> commits?
>>
>> - Mike
>>
>>
>>
>> On 4/6/09 1:02 PM, Mihael Hategan wrote:
>>> Yes. This is one of those "can't find executable unless run through
>>> 'bash -l' or maybe not" which we saw using wget and md5sum. I'm thinking
>>> how to deal with the situation.
>>>
>>> On Mon, 2009-04-06 at 12:10 -0500, Michael Wilde wrote:
>>>> With this sites entry:
>>>>
>>>>
>>>> TG-CDA070002T
>>>> >>> jobManager="gt2:pbs" />
>>>>
>>>> /home/ux454325/swiftwork
>>>>
>>>>
>>>> I get the error below. Files are on CI net at /home/wilde/swift/lab.
>>>>
>>>> I will try to copy coaster boot logs and gram logs to same place when I
>>>> find them, in subdirs named by $RunID.logs.
>>>>
>>>> --
>>>>
>>>> com$ swift -tc.file tc.data -sites.file qb.coasters.xml cat.swift
>>>> Swift svn swift-r2809 cog-r2350
>>>>
>>>> RunID: 20090406-1155-pgc5nj00
>>>> Progress:
>>>> Progress: Stage in:1
>>>> Progress: Submitted:1
>>>> Failed to transfer wrapper log from cat-20090406-1155-pgc5nj00/info/m on qb
>>>> Progress: Failed:1
>>>> Execution failed:
>>>> Exception in cat:
>>>> Arguments: [data.txt]
>>>> Host: qb
>>>> Directory: cat-20090406-1155-pgc5nj00/jobs/m/cat-m91ku09j
>>>> stderr.txt:
>>>>
>>>> stdout.txt:
>>>>
>>>> ----
>>>>
>>>> Caused by:
>>>> Cannot submit job: Cannot run program "qsub":
>>>> java.io.IOException: error=2, No such file or directory
>>>> org.globus.cog.abstraction.impl.common.task.TaskSubmissionException:
>>>> Cannot submit job: Cannot run program "qsub": java.io.IOException:
>>>> error=2, No such file or directory
>>>> at
>>>> org.globus.cog.abstraction.impl.scheduler.common.AbstractJobSubmissionTaskHandler.submit(AbstractJobSubmissionTaskHandler.java:63)
>>>> at
>>>> org.globus.cog.abstraction.impl.common.AbstractTaskHandler.submit(AbstractTaskHandler.java:46)
>>>> at
>>>> org.globus.cog.abstraction.impl.common.task.ExecutionTaskHandler.submit(ExecutionTaskHandler.java:50)
>>>> at
>>>> org.globus.cog.abstraction.coaster.service.job.manager.WorkerManager.startWorker(WorkerManager.java:221)
>>>> at
>>>> org.globus.cog.abstraction.coaster.service.job.manager.WorkerManager.run(WorkerManager.java:145)
>>>> Caused by: java.io.IOException: Cannot run program "qsub":
>>>> java.io.IOException: error=2, No such file or directory
>>>> at java.lang.ProcessBuilder.start(ProcessBuilder.java:459)
>>>> at java.lang.Runtime.exec(Runtime.java:593)
>>>> at
>>>> org.globus.cog.abstraction.impl.scheduler.common.AbstractExecutor.start(AbstractExecutor.java:73)
>>>> at
>>>> org.globus.cog.abstraction.impl.scheduler.common.AbstractJobSubmissionTaskHandler.submit(AbstractJobSubmissionTaskHandler.java:53)
>>>> ... 4 more
>>>> Caused by: java.io.IOException: java.io.IOException: error=2, No such
>>>> file or directory
>>>> at java.lang.UNIXProcess.(UNIXProcess.java:148)
>>>> at java.lang.ProcessImpl.start(ProcessImpl.java:65)
>>>> at java.lang.ProcessBuilder.start(ProcessBuilder.java:452)
>>>> ... 7 more
>>>>
>>>> Cleaning up...
>>>> Shutting down service at https://208.100.92.21:44166
>>>> Got channel MetaChannel: 24235184 -> GSSSChannel-null(1)
>>>> - Done
>>>> com$ pwd
>>>> _______________________________________________
>>>> Swift-devel mailing list
>>>> Swift-devel at ci.uchicago.edu
>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>
From hategan at mcs.anl.gov Mon Apr 6 16:44:43 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Mon, 06 Apr 2009 16:44:43 -0500
Subject: [Swift-devel] coaster problem with jobmanager=gt2:pbs
In-Reply-To: <49DA735F.1020300@mcs.anl.gov>
References: <49DA379B.7080403@mcs.anl.gov>
<1239040937.2410.3.camel@localhost> <49DA6359.8010207@mcs.anl.gov>
<1239049509.5350.0.camel@localhost> <49DA735F.1020300@mcs.anl.gov>
Message-ID: <1239054283.6821.0.camel@localhost>
On Mon, 2009-04-06 at 16:25 -0500, Michael Wilde wrote:
> We just tested that rev, and now it seems as if the jobs are getting
> submitted to the fork JM instead of to PBS.
>
> Need a log for that, or is the cause obvious?
No. I'll debug and see.
>
>
> On 4/6/09 3:25 PM, Mihael Hategan wrote:
> > Oops. cog r2367 should fix that.
> >
> > On Mon, 2009-04-06 at 15:17 -0500, Michael Wilde wrote:
> >> Mihael, I just updated our test swift+cog source and rebuilt.
> >>
> >> Glen is now getting:
> >>
> >> Caused by:
> >> Invalid GSSCredentials
> >> org.globus.cog.abstraction.impl.common.task.InvalidSecurityContextException:
> >> Invalid GSSCredentials
> >> at
> >> org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.submitSingleJob(JobSubmissionTaskHandler.java:149)
> >> at
> >> org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.submit(JobSubmissionTaskHandler.java:99)
> >> at
> >> org.globus.cog.abstraction.impl.common.AbstractTaskHandler.submit(AbstractTaskHandler.java:46)
> >> at
> >> org.globus.cog.abstraction.impl.common.task.ExecutionTaskHandler.submit(ExecutionTaskHandler.java:50)
> >> at
> >> org.globus.cog.abstraction.coaster.service.job.manager.WorkerManager.startWorker(WorkerManager.java:222)
> >> at
> >> org.globus.cog.abstraction.coaster.service.job.manager.WorkerManager.run(WorkerManager.java:145)
> >> Caused by: GSSException: Invalid name provided [Caused by: [JGLOBUS-112]
> >> Malformed name, "=" missing in "38356/jobmanager-pbs"]
> >> at
> >> org.globus.gsi.gssapi.GlobusGSSName.(GlobusGSSName.java:137)
> >> at
> >> org.globus.gsi.gssapi.GlobusGSSManagerImpl.createName(GlobusGSSManagerImpl.java:304)
> >> at
> >> org.globus.gsi.gssapi.auth.IdentityAuthorization.getExpectedName(IdentityAuthorization.java:82)
> >> at org.globus.gram.Gram.gatekeeperConnect(Gram.java:85)
> >> at org.globus.gram.Gram.request(Gram.java:310)
> >> at org.globus.gram.GramJob.request(GramJob.java:262)
> >> at
> >> org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.submitSingleJob(JobSubmissionTaskHandler.java:133)
> >> ... 5 more
> >> what's up here
> >>
> >>
> >> Any chance I picked up code in transition, or a new problem in recent
> >> commits?
> >>
> >> - Mike
> >>
> >>
> >>
> >> On 4/6/09 1:02 PM, Mihael Hategan wrote:
> >>> Yes. This is one of those "can't find executable unless run through
> >>> 'bash -l' or maybe not" which we saw using wget and md5sum. I'm thinking
> >>> how to deal with the situation.
> >>>
> >>> On Mon, 2009-04-06 at 12:10 -0500, Michael Wilde wrote:
> >>>> With this sites entry:
> >>>>
> >>>>
> >>>> TG-CDA070002T
> >>>> >>>> jobManager="gt2:pbs" />
> >>>>
> >>>> /home/ux454325/swiftwork
> >>>>
> >>>>
> >>>> I get the error below. Files are on CI net at /home/wilde/swift/lab.
> >>>>
> >>>> I will try to copy coaster boot logs and gram logs to same place when I
> >>>> find them, in subdirs named by $RunID.logs.
> >>>>
> >>>> --
> >>>>
> >>>> com$ swift -tc.file tc.data -sites.file qb.coasters.xml cat.swift
> >>>> Swift svn swift-r2809 cog-r2350
> >>>>
> >>>> RunID: 20090406-1155-pgc5nj00
> >>>> Progress:
> >>>> Progress: Stage in:1
> >>>> Progress: Submitted:1
> >>>> Failed to transfer wrapper log from cat-20090406-1155-pgc5nj00/info/m on qb
> >>>> Progress: Failed:1
> >>>> Execution failed:
> >>>> Exception in cat:
> >>>> Arguments: [data.txt]
> >>>> Host: qb
> >>>> Directory: cat-20090406-1155-pgc5nj00/jobs/m/cat-m91ku09j
> >>>> stderr.txt:
> >>>>
> >>>> stdout.txt:
> >>>>
> >>>> ----
> >>>>
> >>>> Caused by:
> >>>> Cannot submit job: Cannot run program "qsub":
> >>>> java.io.IOException: error=2, No such file or directory
> >>>> org.globus.cog.abstraction.impl.common.task.TaskSubmissionException:
> >>>> Cannot submit job: Cannot run program "qsub": java.io.IOException:
> >>>> error=2, No such file or directory
> >>>> at
> >>>> org.globus.cog.abstraction.impl.scheduler.common.AbstractJobSubmissionTaskHandler.submit(AbstractJobSubmissionTaskHandler.java:63)
> >>>> at
> >>>> org.globus.cog.abstraction.impl.common.AbstractTaskHandler.submit(AbstractTaskHandler.java:46)
> >>>> at
> >>>> org.globus.cog.abstraction.impl.common.task.ExecutionTaskHandler.submit(ExecutionTaskHandler.java:50)
> >>>> at
> >>>> org.globus.cog.abstraction.coaster.service.job.manager.WorkerManager.startWorker(WorkerManager.java:221)
> >>>> at
> >>>> org.globus.cog.abstraction.coaster.service.job.manager.WorkerManager.run(WorkerManager.java:145)
> >>>> Caused by: java.io.IOException: Cannot run program "qsub":
> >>>> java.io.IOException: error=2, No such file or directory
> >>>> at java.lang.ProcessBuilder.start(ProcessBuilder.java:459)
> >>>> at java.lang.Runtime.exec(Runtime.java:593)
> >>>> at
> >>>> org.globus.cog.abstraction.impl.scheduler.common.AbstractExecutor.start(AbstractExecutor.java:73)
> >>>> at
> >>>> org.globus.cog.abstraction.impl.scheduler.common.AbstractJobSubmissionTaskHandler.submit(AbstractJobSubmissionTaskHandler.java:53)
> >>>> ... 4 more
> >>>> Caused by: java.io.IOException: java.io.IOException: error=2, No such
> >>>> file or directory
> >>>> at java.lang.UNIXProcess.(UNIXProcess.java:148)
> >>>> at java.lang.ProcessImpl.start(ProcessImpl.java:65)
> >>>> at java.lang.ProcessBuilder.start(ProcessBuilder.java:452)
> >>>> ... 7 more
> >>>>
> >>>> Cleaning up...
> >>>> Shutting down service at https://208.100.92.21:44166
> >>>> Got channel MetaChannel: 24235184 -> GSSSChannel-null(1)
> >>>> - Done
> >>>> com$ pwd
> >>>> _______________________________________________
> >>>> Swift-devel mailing list
> >>>> Swift-devel at ci.uchicago.edu
> >>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> >
From hategan at mcs.anl.gov Mon Apr 6 16:46:50 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Mon, 06 Apr 2009 16:46:50 -0500
Subject: [Swift-devel] coaster problem with jobmanager=gt2:pbs
In-Reply-To: <49DA735F.1020300@mcs.anl.gov>
References: <49DA379B.7080403@mcs.anl.gov>
<1239040937.2410.3.camel@localhost> <49DA6359.8010207@mcs.anl.gov>
<1239049509.5350.0.camel@localhost> <49DA735F.1020300@mcs.anl.gov>
Message-ID: <1239054410.6821.2.camel@localhost>
On Mon, 2009-04-06 at 16:25 -0500, Michael Wilde wrote:
> We just tested that rev, and now it seems as if the jobs are getting
> submitted to the fork JM instead of to PBS.
>
> Need a log for that, or is the cause obvious?
Actually yes, it just became obvious.
>
>
> On 4/6/09 3:25 PM, Mihael Hategan wrote:
> > Oops. cog r2367 should fix that.
> >
> > On Mon, 2009-04-06 at 15:17 -0500, Michael Wilde wrote:
> >> Mihael, I just updated our test swift+cog source and rebuilt.
> >>
> >> Glen is now getting:
> >>
> >> Caused by:
> >> Invalid GSSCredentials
> >> org.globus.cog.abstraction.impl.common.task.InvalidSecurityContextException:
> >> Invalid GSSCredentials
> >> at
> >> org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.submitSingleJob(JobSubmissionTaskHandler.java:149)
> >> at
> >> org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.submit(JobSubmissionTaskHandler.java:99)
> >> at
> >> org.globus.cog.abstraction.impl.common.AbstractTaskHandler.submit(AbstractTaskHandler.java:46)
> >> at
> >> org.globus.cog.abstraction.impl.common.task.ExecutionTaskHandler.submit(ExecutionTaskHandler.java:50)
> >> at
> >> org.globus.cog.abstraction.coaster.service.job.manager.WorkerManager.startWorker(WorkerManager.java:222)
> >> at
> >> org.globus.cog.abstraction.coaster.service.job.manager.WorkerManager.run(WorkerManager.java:145)
> >> Caused by: GSSException: Invalid name provided [Caused by: [JGLOBUS-112]
> >> Malformed name, "=" missing in "38356/jobmanager-pbs"]
> >> at
> >> org.globus.gsi.gssapi.GlobusGSSName.(GlobusGSSName.java:137)
> >> at
> >> org.globus.gsi.gssapi.GlobusGSSManagerImpl.createName(GlobusGSSManagerImpl.java:304)
> >> at
> >> org.globus.gsi.gssapi.auth.IdentityAuthorization.getExpectedName(IdentityAuthorization.java:82)
> >> at org.globus.gram.Gram.gatekeeperConnect(Gram.java:85)
> >> at org.globus.gram.Gram.request(Gram.java:310)
> >> at org.globus.gram.GramJob.request(GramJob.java:262)
> >> at
> >> org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.submitSingleJob(JobSubmissionTaskHandler.java:133)
> >> ... 5 more
> >> what's up here
> >>
> >>
> >> Any chance I picked up code in transition, or a new problem in recent
> >> commits?
> >>
> >> - Mike
> >>
> >>
> >>
> >> On 4/6/09 1:02 PM, Mihael Hategan wrote:
> >>> Yes. This is one of those "can't find executable unless run through
> >>> 'bash -l' or maybe not" which we saw using wget and md5sum. I'm thinking
> >>> how to deal with the situation.
> >>>
> >>> On Mon, 2009-04-06 at 12:10 -0500, Michael Wilde wrote:
> >>>> With this sites entry:
> >>>>
> >>>>
> >>>> TG-CDA070002T
> >>>> >>>> jobManager="gt2:pbs" />
> >>>>
> >>>> /home/ux454325/swiftwork
> >>>>
> >>>>
> >>>> I get the error below. Files are on CI net at /home/wilde/swift/lab.
> >>>>
> >>>> I will try to copy coaster boot logs and gram logs to same place when I
> >>>> find them, in subdirs named by $RunID.logs.
> >>>>
> >>>> --
> >>>>
> >>>> com$ swift -tc.file tc.data -sites.file qb.coasters.xml cat.swift
> >>>> Swift svn swift-r2809 cog-r2350
> >>>>
> >>>> RunID: 20090406-1155-pgc5nj00
> >>>> Progress:
> >>>> Progress: Stage in:1
> >>>> Progress: Submitted:1
> >>>> Failed to transfer wrapper log from cat-20090406-1155-pgc5nj00/info/m on qb
> >>>> Progress: Failed:1
> >>>> Execution failed:
> >>>> Exception in cat:
> >>>> Arguments: [data.txt]
> >>>> Host: qb
> >>>> Directory: cat-20090406-1155-pgc5nj00/jobs/m/cat-m91ku09j
> >>>> stderr.txt:
> >>>>
> >>>> stdout.txt:
> >>>>
> >>>> ----
> >>>>
> >>>> Caused by:
> >>>> Cannot submit job: Cannot run program "qsub":
> >>>> java.io.IOException: error=2, No such file or directory
> >>>> org.globus.cog.abstraction.impl.common.task.TaskSubmissionException:
> >>>> Cannot submit job: Cannot run program "qsub": java.io.IOException:
> >>>> error=2, No such file or directory
> >>>> at
> >>>> org.globus.cog.abstraction.impl.scheduler.common.AbstractJobSubmissionTaskHandler.submit(AbstractJobSubmissionTaskHandler.java:63)
> >>>> at
> >>>> org.globus.cog.abstraction.impl.common.AbstractTaskHandler.submit(AbstractTaskHandler.java:46)
> >>>> at
> >>>> org.globus.cog.abstraction.impl.common.task.ExecutionTaskHandler.submit(ExecutionTaskHandler.java:50)
> >>>> at
> >>>> org.globus.cog.abstraction.coaster.service.job.manager.WorkerManager.startWorker(WorkerManager.java:221)
> >>>> at
> >>>> org.globus.cog.abstraction.coaster.service.job.manager.WorkerManager.run(WorkerManager.java:145)
> >>>> Caused by: java.io.IOException: Cannot run program "qsub":
> >>>> java.io.IOException: error=2, No such file or directory
> >>>> at java.lang.ProcessBuilder.start(ProcessBuilder.java:459)
> >>>> at java.lang.Runtime.exec(Runtime.java:593)
> >>>> at
> >>>> org.globus.cog.abstraction.impl.scheduler.common.AbstractExecutor.start(AbstractExecutor.java:73)
> >>>> at
> >>>> org.globus.cog.abstraction.impl.scheduler.common.AbstractJobSubmissionTaskHandler.submit(AbstractJobSubmissionTaskHandler.java:53)
> >>>> ... 4 more
> >>>> Caused by: java.io.IOException: java.io.IOException: error=2, No such
> >>>> file or directory
> >>>> at java.lang.UNIXProcess.(UNIXProcess.java:148)
> >>>> at java.lang.ProcessImpl.start(ProcessImpl.java:65)
> >>>> at java.lang.ProcessBuilder.start(ProcessBuilder.java:452)
> >>>> ... 7 more
> >>>>
> >>>> Cleaning up...
> >>>> Shutting down service at https://208.100.92.21:44166
> >>>> Got channel MetaChannel: 24235184 -> GSSSChannel-null(1)
> >>>> - Done
> >>>> com$ pwd
> >>>> _______________________________________________
> >>>> Swift-devel mailing list
> >>>> Swift-devel at ci.uchicago.edu
> >>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> >
From wilde at mcs.anl.gov Mon Apr 6 17:00:26 2009
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Mon, 06 Apr 2009 17:00:26 -0500
Subject: [Swift-devel] Expected behavior for scheduler slow-start with
coasters?
Message-ID: <49DA7B7A.6070802@mcs.anl.gov>
We are seeing the following on Ranger:
Swift has 2 coaster jobs running in SGE, coastersPerNode is set to 16,
yet it seems to be doing "slow start" as if it doesnt know that it ca
quickly fill the available coaster slots.
For example, Glen sees the trace below, and is surprised that its not
running at least 32 app() procs by this point, instead of 2.
Is this expected behavior, or would you have expected the scheduler to
fill all available coaster slots?
--
Every 2.0s: showq | grep hockyg
Mon Apr 6 16:51:38 2009
641061 data hockyg Running 16 01:31:50 Mon Apr 6
16:42:30
641062 data hockyg Running 16 01:31:50 Mon Apr 6
16:42:30
that's ranger
Progress: Selecting site:98 Stage in:1 Submitting:1 Finished successfully:4
Progress: Selecting site:98 Submitting:2 Finished successfully:4
Progress: Selecting site:98 Submitting:1 Submitted:1 Finished
successfully:4
Progress: Selecting site:98 Submitted:2 Finished successfully:4
Progress: Selecting site:98 Submitted:1 Active:1 Finished successfully:4
Progress: Selecting site:98 Active:2 Finished successfully:4
Progress: Selecting site:98 Active:2 Finished successfully:4
Progress: Selecting site:98 Active:2 Finished successfully:4
Progress: Selecting site:98 Active:2 Finished successfully:4
Progress: Selecting site:98 Active:2 Finished successfully:4
Progress: Selecting site:98 Active:2 Finished successfully:4
Progress: Selecting site:98 Active:2 Finished successfully:4
Progress: Selecting site:98 Active:2 Finished successfully:4
Progress: Selecting site:98 Active:2 Finished successfully:4
Progress: Selecting site:98 Active:2 Finished successfully:4
Progress: Selecting site:98 Active:2 Finished successfully:4
Progress: Selecting site:98 Active:2 Finished successfully:4
Progress: Selecting site:98 Active:2 Finished successfully:4
Progress: Selecting site:98 Active:2 Finished successfully:4
Progress: Selecting site:98 Active:2 Finished successfully:4
Progress: Selecting site:98 Active:2 Finished successfully:4
Progress: Selecting site:98 Active:1 Stage out:1 Finished successfully:4
Progress: Selecting site:98 Active:1 Finished successfully:5
Progress: Selecting site:97 Stage in:1 Active:1 Finished successfully:5
Progress: Selecting site:96 Active:3 Finished successfully:5
Progress: Selecting site:96 Active:3 Finished successfully:5
Progress: Selecting site:96 Active:2 Stage out:1 Finished successfully:5
Progress: Selecting site:96 Active:2 Finished successfully:6
Progress: Selecting site:95 Stage in:1 Active:2 Finished successfully:6
From hategan at mcs.anl.gov Mon Apr 6 17:04:36 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Mon, 06 Apr 2009 17:04:36 -0500
Subject: [Swift-devel] coaster problem with jobmanager=gt2:pbs
In-Reply-To: <1239054410.6821.2.camel@localhost>
References: <49DA379B.7080403@mcs.anl.gov>
<1239040937.2410.3.camel@localhost> <49DA6359.8010207@mcs.anl.gov>
<1239049509.5350.0.camel@localhost> <49DA735F.1020300@mcs.anl.gov>
<1239054410.6821.2.camel@localhost>
Message-ID: <1239055476.6821.9.camel@localhost>
On Mon, 2009-04-06 at 16:46 -0500, Mihael Hategan wrote:
> On Mon, 2009-04-06 at 16:25 -0500, Michael Wilde wrote:
> > We just tested that rev, and now it seems as if the jobs are getting
> > submitted to the fork JM instead of to PBS.
> >
> > Need a log for that, or is the cause obvious?
>
> Actually yes, it just became obvious.
I've corrected the initial fix. Hopefully it works properly this time.
The issue was related to a badly thought change needed for the worker
terminal to function.
From hategan at mcs.anl.gov Mon Apr 6 17:08:23 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Mon, 06 Apr 2009 17:08:23 -0500
Subject: [Swift-devel] Expected behavior for scheduler slow-start with
coasters?
In-Reply-To: <49DA7B7A.6070802@mcs.anl.gov>
References: <49DA7B7A.6070802@mcs.anl.gov>
Message-ID: <1239055703.6821.11.camel@localhost>
On Mon, 2009-04-06 at 17:00 -0500, Michael Wilde wrote:
> We are seeing the following on Ranger:
>
> Swift has 2 coaster jobs running in SGE, coastersPerNode is set to 16,
> yet it seems to be doing "slow start" as if it doesnt know that it ca
> quickly fill the available coaster slots.
Right. Swift doing a slow start is a given.
Coasters allocating more workers than needed is the issue.
From wilde at mcs.anl.gov Mon Apr 6 17:32:22 2009
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Mon, 06 Apr 2009 17:32:22 -0500
Subject: [Swift-devel] Expected behavior for scheduler slow-start with
coasters?
In-Reply-To: <1239055703.6821.11.camel@localhost>
References: <49DA7B7A.6070802@mcs.anl.gov> <1239055703.6821.11.camel@localhost>
Message-ID: <49DA82F6.9090502@mcs.anl.gov>
OK, this one seems to be more of a nuisance/anomaly that we can set
aside for now I think.
Opening up the throttle a bit should make this a minor issue.
Eventually, you'd hope it would fill available coasters when there is
demand, or at least base the rampup on the fast that jobs started, and
not wait for them to finish. Then it would sense faster that there were
more ready workers.
On 4/6/09 5:08 PM, Mihael Hategan wrote:
> On Mon, 2009-04-06 at 17:00 -0500, Michael Wilde wrote:
>> We are seeing the following on Ranger:
>>
>> Swift has 2 coaster jobs running in SGE, coastersPerNode is set to 16,
>> yet it seems to be doing "slow start" as if it doesnt know that it ca
>> quickly fill the available coaster slots.
>
> Right. Swift doing a slow start is a given.
>
> Coasters allocating more workers than needed is the issue.
>
>
From hategan at mcs.anl.gov Mon Apr 6 17:43:23 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Mon, 06 Apr 2009 17:43:23 -0500
Subject: [Swift-devel] Expected behavior for scheduler slow-start with
coasters?
In-Reply-To: <49DA82F6.9090502@mcs.anl.gov>
References: <49DA7B7A.6070802@mcs.anl.gov>
<1239055703.6821.11.camel@localhost> <49DA82F6.9090502@mcs.anl.gov>
Message-ID: <1239057803.8721.4.camel@localhost>
On Mon, 2009-04-06 at 17:32 -0500, Michael Wilde wrote:
> OK, this one seems to be more of a nuisance/anomaly that we can set
> aside for now I think.
>
> Opening up the throttle a bit should make this a minor issue.
> Eventually, you'd hope it would fill available coasters when there is
> demand, or at least base the rampup on the fast that jobs started, and
> not wait for them to finish. Then it would sense faster that there were
> more ready workers.
Yes. I mentioned this a while ago, that with coasters, throttling
guesses become unnecessary. You simply throttle to the number of
available workers.
This, however, falls out of the model we started with, so there are some
possibly non-trivial changes to swift needed in order to support this
with coasters, while still keeping the old behaviour without coasters.
From wilde at mcs.anl.gov Mon Apr 6 17:53:17 2009
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Mon, 06 Apr 2009 17:53:17 -0500
Subject: [Swift-devel] Expected behavior for scheduler slow-start with
coasters?
In-Reply-To: <1239057803.8721.4.camel@localhost>
References: <49DA7B7A.6070802@mcs.anl.gov>
<1239055703.6821.11.camel@localhost> <49DA82F6.9090502@mcs.anl.gov>
<1239057803.8721.4.camel@localhost>
Message-ID: <49DA87DD.1010704@mcs.anl.gov>
OK, sounds reasonable.
For what its worth, Glen provided another example of coasters going idle
while there are jobs ready to run.
Nothing more to say on this, except to point out that it affects more
than just startup.
Is there a simpler, alternate scheduler algorithm that you could plug in
as a global, settable alternative to the current one when all sites are
using coasters?
(No need to answer that now; we'll see how far we can get with things as
they are, in various combinations of sites and settings).
We're digging into the imbalance problem at the moment, that one may be
more worthwhile your time, as is the larger-node-per-job allocation
enhancement.)
--- from Glen:
again, not using there coasters affectively
5:42
Michael Wilde
?
5:42
Glen Hocky
e.g.
qb now has qb2:
Req'd Req'd Elap
Job ID Username Queue Jobname SessID NDS TSK
Memory Time S Time
-------------------- -------- -------- ---------- ------ ----- ---
------ ----- - -----
94741.qb2 hockyg workq scheduler_ 30786 1 1 --
01:41 R 00:53
94742.qb2 hockyg workq scheduler_ 31391 1 1 --
01:41 R 00:53
94808.qb2 hockyg workq scheduler_ 2274 1 1 --
01:41 R 00:22
94809.qb2 hockyg workq scheduler_ 27186 1 1 --
01:41 R 00:21
94811.qb2 hockyg workq scheduler_ 31647 1 1 --
01:41 R 00:21
94812.qb2 hockyg workq scheduler_ 4773 1 1 --
01:41 R 00:18
but only 4 active jobs
4 submitted
*7 submitted
all the rest done
so what is it doing with all those extra cpus
5:43
...
Glen Hocky
for my run on only qb
Progress: Submitted:7 Active:4 Finished successfully:93
5:43
Glen Hocky
again, the problem may be that these jobs are taking 15 minutes or more
so they don't end very often
On 4/6/09 5:43 PM, Mihael Hategan wrote:
> On Mon, 2009-04-06 at 17:32 -0500, Michael Wilde wrote:
>> OK, this one seems to be more of a nuisance/anomaly that we can set
>> aside for now I think.
>>
>> Opening up the throttle a bit should make this a minor issue.
>> Eventually, you'd hope it would fill available coasters when there is
>> demand, or at least base the rampup on the fast that jobs started, and
>> not wait for them to finish. Then it would sense faster that there were
>> more ready workers.
>
> Yes. I mentioned this a while ago, that with coasters, throttling
> guesses become unnecessary. You simply throttle to the number of
> available workers.
>
> This, however, falls out of the model we started with, so there are some
> possibly non-trivial changes to swift needed in order to support this
> with coasters, while still keeping the old behaviour without coasters.
>
>
>
From hategan at mcs.anl.gov Mon Apr 6 18:18:59 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Mon, 06 Apr 2009 18:18:59 -0500
Subject: [Swift-devel] Expected behavior for scheduler slow-start with
coasters?
In-Reply-To: <49DA87DD.1010704@mcs.anl.gov>
References: <49DA7B7A.6070802@mcs.anl.gov>
<1239055703.6821.11.camel@localhost> <49DA82F6.9090502@mcs.anl.gov>
<1239057803.8721.4.camel@localhost> <49DA87DD.1010704@mcs.anl.gov>
Message-ID: <1239059939.8843.21.camel@localhost>
On Mon, 2009-04-06 at 17:53 -0500, Michael Wilde wrote:
> OK, sounds reasonable.
>
> For what its worth, Glen provided another example of coasters going idle
> while there are jobs ready to run.
Or maybe the jobs don't fit in the time some of the workers have left.
In other words don't be surprised that workers are not the same as the
jobs they are meant to run, because that's obvious.
There are only two promises related to how workers are allocated: no
more workers than jobs will be started (modulo the broken
coastersPerNode issue - and this promise may have to be dropped if we do
block allocations) and that no worker will stay idle for more than a
certain amount of time, which is currently 10 minutes (probably too
large).
>
> Nothing more to say on this, except to point out that it affects more
> than just startup.
Where "it" may be a very different it.
From wilde at mcs.anl.gov Mon Apr 6 18:28:04 2009
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Mon, 06 Apr 2009 18:28:04 -0500
Subject: [Swift-devel] Imbalanced scheduling with coasters and multiple sites
Message-ID: <49DA9004.8010409@mcs.anl.gov>
Glen seems to have a good example of this in:
/home/hockyg/oops/swift/output/teragridoutdir.1
com$ grep 'execute2 THREAD_ASSOCIATION' *ji3.log | awk '{print $8}' |
sort | uniq -c
159 host=abe
8 host=localhost
13 host=qb
11 host=ranger
com$
---
But then I looked in the log and I see that for qb and ranger, it tries
to start jobs there and gets an exception on each of them, while jobs
for abe keep on zipping through.
As far as I can tell, there is, eg on queenbee, no coaster boot log at
the time of the exception, and I cant glean any clues from the GRAM log
at the time of the exception (no obvious errors in it).
I am trying now to reproduce this with simple echo-like jobs under my
own id & cert where I can see all the server-side logs.
I *think* that for the run above, Glen first tested ach of the 3
sites.xml pool elements separately, for the 3 sites, before trying the
3-site test. I *think* he verified that all three sites worked separately.
But when put together, it *seems* that only the first one works, as if
the ability to start coasters on 3 sites at once is broken.
I am not at all sure, and will try to isolate with a simpler test that
you can run as well, but at the moment thats a plausible theory.
Btw, this is still with the Mar 31 code rev. I need to catch up on mail
to see if I can no go back to testing on trunk.
From wilde at mcs.anl.gov Mon Apr 6 23:25:45 2009
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Mon, 06 Apr 2009 23:25:45 -0500
Subject: [Swift-devel] Imbalanced scheduling with coasters and multiple
sites
In-Reply-To: <49DA9004.8010409@mcs.anl.gov>
References: <49DA9004.8010409@mcs.anl.gov>
Message-ID: <49DAD5C9.7080607@mcs.anl.gov>
I tried this test and discovered some more things about coaster time
management that I dont understand.
It seems that on Queenbee coasters were timing out, while on abe the
workers were getting queued, but abe's coasters.log showed lots of java
exceptions.
If you're interested, all logs for this run including coasters.logs from
the two sites .globus dirs is on ci net at
/home/wilde/swift/lab/20090406-2120-04ythaie
I will re-run with the latest cog/swift revs to see if the behavior
persists.
- Mike
On 4/6/09 6:28 PM, Michael Wilde wrote:
> Glen seems to have a good example of this in:
> /home/hockyg/oops/swift/output/teragridoutdir.1
>
> com$ grep 'execute2 THREAD_ASSOCIATION' *ji3.log | awk '{print $8}' |
> sort | uniq -c
> 159 host=abe
> 8 host=localhost
> 13 host=qb
> 11 host=ranger
> com$
>
> ---
>
> But then I looked in the log and I see that for qb and ranger, it tries
> to start jobs there and gets an exception on each of them, while jobs
> for abe keep on zipping through.
>
> As far as I can tell, there is, eg on queenbee, no coaster boot log at
> the time of the exception, and I cant glean any clues from the GRAM log
> at the time of the exception (no obvious errors in it).
>
> I am trying now to reproduce this with simple echo-like jobs under my
> own id & cert where I can see all the server-side logs.
>
> I *think* that for the run above, Glen first tested ach of the 3
> sites.xml pool elements separately, for the 3 sites, before trying the
> 3-site test. I *think* he verified that all three sites worked separately.
>
> But when put together, it *seems* that only the first one works, as if
> the ability to start coasters on 3 sites at once is broken.
>
> I am not at all sure, and will try to isolate with a simpler test that
> you can run as well, but at the moment thats a plausible theory.
>
> Btw, this is still with the Mar 31 code rev. I need to catch up on mail
> to see if I can no go back to testing on trunk.
>
>
>
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
From hategan at mcs.anl.gov Mon Apr 6 23:45:45 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Mon, 06 Apr 2009 23:45:45 -0500
Subject: [Swift-devel] Imbalanced scheduling with coasters and multiple
sites
In-Reply-To: <49DAD5C9.7080607@mcs.anl.gov>
References: <49DA9004.8010409@mcs.anl.gov> <49DAD5C9.7080607@mcs.anl.gov>
Message-ID: <1239079545.15719.3.camel@localhost>
On Mon, 2009-04-06 at 23:25 -0500, Michael Wilde wrote:
> I tried this test and discovered some more things about coaster time
> management that I dont understand.
>
> It seems that on Queenbee coasters were timing out, while on abe the
> workers were getting queued, but abe's coasters.log showed lots of java
> exceptions.
Yes. It still seems to have been run with the unfortunate version. I
can't tell which exceptions are legit and which ones are the result of
coasters code in the particular bad state.
>
> If you're interested, all logs for this run including coasters.logs from
> the two sites .globus dirs is on ci net at
> /home/wilde/swift/lab/20090406-2120-04ythaie
>
> I will re-run with the latest cog/swift revs to see if the behavior
> persists.
>
> - Mike
>
>
> On 4/6/09 6:28 PM, Michael Wilde wrote:
> > Glen seems to have a good example of this in:
> > /home/hockyg/oops/swift/output/teragridoutdir.1
> >
> > com$ grep 'execute2 THREAD_ASSOCIATION' *ji3.log | awk '{print $8}' |
> > sort | uniq -c
> > 159 host=abe
> > 8 host=localhost
> > 13 host=qb
> > 11 host=ranger
> > com$
> >
> > ---
> >
> > But then I looked in the log and I see that for qb and ranger, it tries
> > to start jobs there and gets an exception on each of them, while jobs
> > for abe keep on zipping through.
> >
> > As far as I can tell, there is, eg on queenbee, no coaster boot log at
> > the time of the exception, and I cant glean any clues from the GRAM log
> > at the time of the exception (no obvious errors in it).
> >
> > I am trying now to reproduce this with simple echo-like jobs under my
> > own id & cert where I can see all the server-side logs.
> >
> > I *think* that for the run above, Glen first tested ach of the 3
> > sites.xml pool elements separately, for the 3 sites, before trying the
> > 3-site test. I *think* he verified that all three sites worked separately.
> >
> > But when put together, it *seems* that only the first one works, as if
> > the ability to start coasters on 3 sites at once is broken.
> >
> > I am not at all sure, and will try to isolate with a simpler test that
> > you can run as well, but at the moment thats a plausible theory.
> >
> > Btw, this is still with the Mar 31 code rev. I need to catch up on mail
> > to see if I can no go back to testing on trunk.
> >
> >
> >
> > _______________________________________________
> > Swift-devel mailing list
> > Swift-devel at ci.uchicago.edu
> > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
From wilde at mcs.anl.gov Mon Apr 6 23:56:54 2009
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Mon, 06 Apr 2009 23:56:54 -0500
Subject: [Swift-devel] Imbalanced scheduling with coasters and multiple
sites
In-Reply-To: <49DAD5C9.7080607@mcs.anl.gov>
References: <49DA9004.8010409@mcs.anl.gov> <49DAD5C9.7080607@mcs.anl.gov>
Message-ID: <49DADD16.2010507@mcs.anl.gov>
The latest rev shows a similar failure on the surface, but I think
different patterns in the coaster logs.
The workflow is 40 simple "cat" jobs, data.txt to a default-mapped outfile.
This time 39 of 40 jobs ran on abe, and then the workflow lingered and
finally failed, with 39 ok, 1 failure.
All the logs for this run are in
/home/wilde/swift/lab/20090406-2330-72p9ale0
below that are dirs for the abe and qb logs coaster and gram logs.
Abe had no gram log for this run.
I suspect this one is worth looking at.
On 4/6/09 11:25 PM, Michael Wilde wrote:
> I tried this test and discovered some more things about coaster time
> management that I dont understand.
>
> It seems that on Queenbee coasters were timing out, while on abe the
> workers were getting queued, but abe's coasters.log showed lots of java
> exceptions.
>
> If you're interested, all logs for this run including coasters.logs from
> the two sites .globus dirs is on ci net at
> /home/wilde/swift/lab/20090406-2120-04ythaie
>
> I will re-run with the latest cog/swift revs to see if the behavior
> persists.
>
> - Mike
>
>
> On 4/6/09 6:28 PM, Michael Wilde wrote:
>> Glen seems to have a good example of this in:
>> /home/hockyg/oops/swift/output/teragridoutdir.1
>>
>> com$ grep 'execute2 THREAD_ASSOCIATION' *ji3.log | awk '{print $8}' |
>> sort | uniq -c
>> 159 host=abe
>> 8 host=localhost
>> 13 host=qb
>> 11 host=ranger
>> com$
>>
>> ---
>>
>> But then I looked in the log and I see that for qb and ranger, it
>> tries to start jobs there and gets an exception on each of them, while
>> jobs for abe keep on zipping through.
>>
>> As far as I can tell, there is, eg on queenbee, no coaster boot log at
>> the time of the exception, and I cant glean any clues from the GRAM
>> log at the time of the exception (no obvious errors in it).
>>
>> I am trying now to reproduce this with simple echo-like jobs under my
>> own id & cert where I can see all the server-side logs.
>>
>> I *think* that for the run above, Glen first tested ach of the 3
>> sites.xml pool elements separately, for the 3 sites, before trying the
>> 3-site test. I *think* he verified that all three sites worked
>> separately.
>>
>> But when put together, it *seems* that only the first one works, as if
>> the ability to start coasters on 3 sites at once is broken.
>>
>> I am not at all sure, and will try to isolate with a simpler test that
>> you can run as well, but at the moment thats a plausible theory.
>>
>> Btw, this is still with the Mar 31 code rev. I need to catch up on
>> mail to see if I can no go back to testing on trunk.
>>
>>
>>
>> _______________________________________________
>> Swift-devel mailing list
>> Swift-devel at ci.uchicago.edu
>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
From hategan at mcs.anl.gov Tue Apr 7 00:09:44 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Tue, 07 Apr 2009 00:09:44 -0500
Subject: [Swift-devel] Imbalanced scheduling with coasters and multiple
sites
In-Reply-To: <49DADD16.2010507@mcs.anl.gov>
References: <49DA9004.8010409@mcs.anl.gov> <49DAD5C9.7080607@mcs.anl.gov>
<49DADD16.2010507@mcs.anl.gov>
Message-ID: <1239080984.16125.1.camel@localhost>
On Mon, 2009-04-06 at 23:56 -0500, Michael Wilde wrote:
> The latest rev shows a similar failure on the surface, but I think
> different patterns in the coaster logs.
>
> The workflow is 40 simple "cat" jobs, data.txt to a default-mapped outfile.
>
> This time 39 of 40 jobs ran on abe, and then the workflow lingered and
> finally failed, with 39 ok, 1 failure.
>
> All the logs for this run are in
> /home/wilde/swift/lab/20090406-2330-72p9ale0
>
> below that are dirs for the abe and qb logs coaster and gram logs.
> Abe had no gram log for this run.
>
> I suspect this one is worth looking at.
Indeed. Can you paste your sites file?
There's some oddity there.
From wilde at mcs.anl.gov Tue Apr 7 00:09:58 2009
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Tue, 07 Apr 2009 00:09:58 -0500
Subject: [Swift-devel] Imbalanced scheduling with coasters and multiple
sites
In-Reply-To: <1239080984.16125.1.camel@localhost>
References: <49DA9004.8010409@mcs.anl.gov> <49DAD5C9.7080607@mcs.anl.gov>
<49DADD16.2010507@mcs.anl.gov> <1239080984.16125.1.camel@localhost>
Message-ID: <49DAE026.3040909@mcs.anl.gov>
com$ cat abe+qb.xml
TG-CDA070002T
8
02:30:00
/u/ac/wilde/swiftwork
TG-CDA070002T
8
02:30:00
/home/ux454325/swiftwork
com$
On 4/7/09 12:09 AM, Mihael Hategan wrote:
> On Mon, 2009-04-06 at 23:56 -0500, Michael Wilde wrote:
>> The latest rev shows a similar failure on the surface, but I think
>> different patterns in the coaster logs.
>>
>> The workflow is 40 simple "cat" jobs, data.txt to a default-mapped outfile.
>>
>> This time 39 of 40 jobs ran on abe, and then the workflow lingered and
>> finally failed, with 39 ok, 1 failure.
>>
>> All the logs for this run are in
>> /home/wilde/swift/lab/20090406-2330-72p9ale0
>>
>> below that are dirs for the abe and qb logs coaster and gram logs.
>> Abe had no gram log for this run.
>>
>> I suspect this one is worth looking at.
>
> Indeed. Can you paste your sites file?
>
> There's some oddity there.
>
>
From wilde at mcs.anl.gov Tue Apr 7 00:15:23 2009
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Tue, 07 Apr 2009 00:15:23 -0500
Subject: [Swift-devel] Imbalanced scheduling with coasters and multiple
sites
In-Reply-To: <49DAE026.3040909@mcs.anl.gov>
References: <49DA9004.8010409@mcs.anl.gov>
<49DAD5C9.7080607@mcs.anl.gov> <49DADD16.2010507@mcs.anl.gov>
<1239080984.16125.1.camel@localhost> <49DAE026.3040909@mcs.anl.gov>
Message-ID: <49DAE16B.6000508@mcs.anl.gov>
Note on below: I used 2hr30min as the time to match Glen's time, for the
runs in which he first saw the "imbalance".
In my first tests,I had used 5 min for coasterWorkerMaxwalltime and
specified no site or tc maxwalltime. I thought that would work, based on
our earlier lengthy exchanges on this topic. But apparantly coasters was
calculating some default max walltime for "cat" and it gave me an error
about insufficient time. I was trying to gather that alolng with several
other anomalies in another report.
On 4/7/09 12:09 AM, Michael Wilde wrote:
> com$ cat abe+qb.xml
>
>
>
>
> TG-CDA070002T
> 8
> key="coasterWorkerMaxwalltime">02:30:00
>
> jobManager="gt2:gt2:pbs" />
>
> /u/ac/wilde/swiftwork
>
>
>
>
>
> TG-CDA070002T
> 8
> key="coasterWorkerMaxwalltime">02:30:00
>
> jobManager="gt2:gt2:pbs" />
>
> /home/ux454325/swiftwork
>
>
>
>
> com$
>
>
> On 4/7/09 12:09 AM, Mihael Hategan wrote:
>> On Mon, 2009-04-06 at 23:56 -0500, Michael Wilde wrote:
>>> The latest rev shows a similar failure on the surface, but I think
>>> different patterns in the coaster logs.
>>>
>>> The workflow is 40 simple "cat" jobs, data.txt to a default-mapped
>>> outfile.
>>>
>>> This time 39 of 40 jobs ran on abe, and then the workflow lingered
>>> and finally failed, with 39 ok, 1 failure.
>>>
>>> All the logs for this run are in
>>> /home/wilde/swift/lab/20090406-2330-72p9ale0
>>>
>>> below that are dirs for the abe and qb logs coaster and gram logs.
>>> Abe had no gram log for this run.
>>>
>>> I suspect this one is worth looking at.
>>
>> Indeed. Can you paste your sites file?
>>
>> There's some oddity there.
>>
>>
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
From hategan at mcs.anl.gov Tue Apr 7 00:26:35 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Tue, 07 Apr 2009 00:26:35 -0500
Subject: [Swift-devel] Imbalanced scheduling with coasters and multiple
sites
In-Reply-To: <49DAE16B.6000508@mcs.anl.gov>
References: <49DA9004.8010409@mcs.anl.gov> <49DAD5C9.7080607@mcs.anl.gov>
<49DADD16.2010507@mcs.anl.gov> <1239080984.16125.1.camel@localhost>
<49DAE026.3040909@mcs.anl.gov> <49DAE16B.6000508@mcs.anl.gov>
Message-ID: <1239081995.16125.8.camel@localhost>
On Tue, 2009-04-07 at 00:15 -0500, Michael Wilde wrote:
> Note on below: I used 2hr30min as the time to match Glen's time, for the
> runs in which he first saw the "imbalance".
>
> In my first tests,I had used 5 min for coasterWorkerMaxwalltime and
> specified no site or tc maxwalltime. I thought that would work, based on
> our earlier lengthy exchanges on this topic. But apparantly coasters was
> calculating some default max walltime for "cat" and it gave me an error
> about insufficient time.
Right. Previously it would just loop starting workers and then not using
them because they didn't have enough time. The default walltime is 10
minutes.
> I was trying to gather that alolng with several
> other anomalies in another report.
Now, the oddity below is that both coaster services are started with the
same service id. Not only that, the same service id was used for
subsequent runs (the bootstrap logs contain multiple "runs"). This,
roughly, makes no sense, but I can't imagine it being cause for
goodness.
>
>
> On 4/7/09 12:09 AM, Michael Wilde wrote:
> > com$ cat abe+qb.xml
> >
> >
> >
> >
> > TG-CDA070002T
> > 8
> > > key="coasterWorkerMaxwalltime">02:30:00
> >
> > > jobManager="gt2:gt2:pbs" />
> >
> > /u/ac/wilde/swiftwork
> >
> >
> >
> >
> >
> > TG-CDA070002T
> > 8
> > > key="coasterWorkerMaxwalltime">02:30:00
> >
> > > jobManager="gt2:gt2:pbs" />
> >
> > /home/ux454325/swiftwork
> >
> >
> >
> >
> > com$
> >
> >
> > On 4/7/09 12:09 AM, Mihael Hategan wrote:
> >> On Mon, 2009-04-06 at 23:56 -0500, Michael Wilde wrote:
> >>> The latest rev shows a similar failure on the surface, but I think
> >>> different patterns in the coaster logs.
> >>>
> >>> The workflow is 40 simple "cat" jobs, data.txt to a default-mapped
> >>> outfile.
> >>>
> >>> This time 39 of 40 jobs ran on abe, and then the workflow lingered
> >>> and finally failed, with 39 ok, 1 failure.
> >>>
> >>> All the logs for this run are in
> >>> /home/wilde/swift/lab/20090406-2330-72p9ale0
> >>>
> >>> below that are dirs for the abe and qb logs coaster and gram logs.
> >>> Abe had no gram log for this run.
> >>>
> >>> I suspect this one is worth looking at.
> >>
> >> Indeed. Can you paste your sites file?
> >>
> >> There's some oddity there.
> >>
> >>
> > _______________________________________________
> > Swift-devel mailing list
> > Swift-devel at ci.uchicago.edu
> > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
From wilde at mcs.anl.gov Tue Apr 7 00:33:54 2009
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Tue, 07 Apr 2009 00:33:54 -0500
Subject: [Swift-devel] Imbalanced scheduling with coasters and multiple
sites
In-Reply-To: <1239081995.16125.8.camel@localhost>
References: <49DA9004.8010409@mcs.anl.gov> <49DAD5C9.7080607@mcs.anl.gov>
<49DADD16.2010507@mcs.anl.gov>
<1239080984.16125.1.camel@localhost>
<49DAE026.3040909@mcs.anl.gov> <49DAE16B.6000508@mcs.anl.gov>
<1239081995.16125.8.camel@localhost>
Message-ID: <49DAE5C2.6070806@mcs.anl.gov>
On 4/7/09 12:26 AM, Mihael Hategan wrote:
> On Tue, 2009-04-07 at 00:15 -0500, Michael Wilde wrote:
>> Note on below: I used 2hr30min as the time to match Glen's time, for the
>> runs in which he first saw the "imbalance".
>>
>> In my first tests,I had used 5 min for coasterWorkerMaxwalltime and
>> specified no site or tc maxwalltime. I thought that would work, based on
>> our earlier lengthy exchanges on this topic. But apparantly coasters was
>> calculating some default max walltime for "cat" and it gave me an error
>> about insufficient time.
>
> Right. Previously it would just loop starting workers and then not using
> them because they didn't have enough time. The default walltime is 10
> minutes.
That makes sense then. The error I got was:
2009-04-06 20:52:35,397-0500 DEBUG vdl:execute2 APPLICATION_EXCEPTION
jobid=cat-e3agg19j - Application exception: Job cannot be run with the
given max walltime worker constraint
The other few anomalies I saw I will ignore unless they happen again, as
I was using the bad 3/31 revision. This was things like starting a new
service with some strange default max time ("01:41:00" or 101 minutes)
after the initial services were started with the correct time, and some
strange error retry behavior.
Bear with me - these things are very difficult and tedious to report.
>> I was trying to gather that alolng with several
>> other anomalies in another report.
>
> Now, the oddity below is that both coaster services are started with the
> same service id. Not only that, the same service id was used for
> subsequent runs (the bootstrap logs contain multiple "runs"). This,
> roughly, makes no sense, but I can't imagine it being cause for
> goodness.
OK. Any chance I messed up copying log files (and duplicated one) or are
you seeing the duplicate service id in truly distinct logs?
(No need for reply - Im assuming if there was a chance I duplicated a
log it would be obvious...)
>
>>
>> On 4/7/09 12:09 AM, Michael Wilde wrote:
>>> com$ cat abe+qb.xml
>>>
>>>
>>>
>>>
>>> TG-CDA070002T
>>> 8
>>> >> key="coasterWorkerMaxwalltime">02:30:00
>>>
>>> >> jobManager="gt2:gt2:pbs" />
>>>
>>> /u/ac/wilde/swiftwork
>>>
>>>
>>>
>>>
>>>
>>> TG-CDA070002T
>>> 8
>>> >> key="coasterWorkerMaxwalltime">02:30:00
>>>
>>> >> jobManager="gt2:gt2:pbs" />
>>>
>>> /home/ux454325/swiftwork
>>>
>>>
>>>
>>>
>>> com$
>>>
>>>
>>> On 4/7/09 12:09 AM, Mihael Hategan wrote:
>>>> On Mon, 2009-04-06 at 23:56 -0500, Michael Wilde wrote:
>>>>> The latest rev shows a similar failure on the surface, but I think
>>>>> different patterns in the coaster logs.
>>>>>
>>>>> The workflow is 40 simple "cat" jobs, data.txt to a default-mapped
>>>>> outfile.
>>>>>
>>>>> This time 39 of 40 jobs ran on abe, and then the workflow lingered
>>>>> and finally failed, with 39 ok, 1 failure.
>>>>>
>>>>> All the logs for this run are in
>>>>> /home/wilde/swift/lab/20090406-2330-72p9ale0
>>>>>
>>>>> below that are dirs for the abe and qb logs coaster and gram logs.
>>>>> Abe had no gram log for this run.
>>>>>
>>>>> I suspect this one is worth looking at.
>>>> Indeed. Can you paste your sites file?
>>>>
>>>> There's some oddity there.
>>>>
>>>>
>>> _______________________________________________
>>> Swift-devel mailing list
>>> Swift-devel at ci.uchicago.edu
>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>
From hategan at mcs.anl.gov Tue Apr 7 00:39:14 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Tue, 07 Apr 2009 00:39:14 -0500
Subject: [Swift-devel] Imbalanced scheduling with coasters and multiple
sites
In-Reply-To: <49DAE5C2.6070806@mcs.anl.gov>
References: <49DA9004.8010409@mcs.anl.gov> <49DAD5C9.7080607@mcs.anl.gov>
<49DADD16.2010507@mcs.anl.gov> <1239080984.16125.1.camel@localhost>
<49DAE026.3040909@mcs.anl.gov> <49DAE16B.6000508@mcs.anl.gov>
<1239081995.16125.8.camel@localhost> <49DAE5C2.6070806@mcs.anl.gov>
Message-ID: <1239082754.16125.12.camel@localhost>
On Tue, 2009-04-07 at 00:33 -0500, Michael Wilde wrote:
>
> On 4/7/09 12:26 AM, Mihael Hategan wrote:
> > On Tue, 2009-04-07 at 00:15 -0500, Michael Wilde wrote:
> >> Note on below: I used 2hr30min as the time to match Glen's time, for the
> >> runs in which he first saw the "imbalance".
> >>
> >> In my first tests,I had used 5 min for coasterWorkerMaxwalltime and
> >> specified no site or tc maxwalltime. I thought that would work, based on
> >> our earlier lengthy exchanges on this topic. But apparantly coasters was
> >> calculating some default max walltime for "cat" and it gave me an error
> >> about insufficient time.
> >
> > Right. Previously it would just loop starting workers and then not using
> > them because they didn't have enough time. The default walltime is 10
> > minutes.
>
> That makes sense then. The error I got was:
>
> 2009-04-06 20:52:35,397-0500 DEBUG vdl:execute2 APPLICATION_EXCEPTION
> jobid=cat-e3agg19j - Application exception: Job cannot be run with the
> given max walltime worker constraint
>
> The other few anomalies I saw I will ignore unless they happen again, as
> I was using the bad 3/31 revision. This was things like starting a new
> service with some strange default max time ("01:41:00" or 101 minutes)
Not strange. 101 = 10 * 10 + 1 or DEFAULT_MAXWALLTIME *
OVERALLOCATION_FACTOR + RESERVE.
> after the initial services were started with the correct time, and some
> strange error retry behavior.
>
> Bear with me - these things are very difficult and tedious to report.
No problem. I'm glad you're exercising the code.
From hategan at mcs.anl.gov Tue Apr 7 01:04:22 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Tue, 07 Apr 2009 01:04:22 -0500
Subject: [Swift-devel] Imbalanced scheduling with coasters and multiple
sites
In-Reply-To: <1239081995.16125.8.camel@localhost>
References: <49DA9004.8010409@mcs.anl.gov> <49DAD5C9.7080607@mcs.anl.gov>
<49DADD16.2010507@mcs.anl.gov> <1239080984.16125.1.camel@localhost>
<49DAE026.3040909@mcs.anl.gov> <49DAE16B.6000508@mcs.anl.gov>
<1239081995.16125.8.camel@localhost>
Message-ID: <1239084262.16125.18.camel@localhost>
On Tue, 2009-04-07 at 00:26 -0500, Mihael Hategan wrote:
> > I was trying to gather that alolng with several
> > other anomalies in another report.
>
> Now, the oddity below is that both coaster services are started with the
> same service id. Not only that, the same service id was used for
> subsequent runs (the bootstrap logs contain multiple "runs"). This,
> roughly, makes no sense, but I can't imagine it being cause for
> goodness.
That was just another one of my brilliant ideas. It was dimmed a bit in
cog r2369. Previous to that, and after the big fiddle with the bootstrap
script a while ago, multi-site coaster runs are broken.
From benc at hawaga.org.uk Tue Apr 7 03:37:31 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Tue, 7 Apr 2009 08:37:31 +0000 (GMT)
Subject: [Swift-devel] Expected behavior for scheduler slow-start with
coasters?
In-Reply-To: <49DA87DD.1010704@mcs.anl.gov>
References: <49DA7B7A.6070802@mcs.anl.gov> <1239055703.6821.11.camel@localhost>
<49DA82F6.9090502@mcs.anl.gov> <1239057803.8721.4.camel@localhost>
<49DA87DD.1010704@mcs.anl.gov>
Message-ID:
> Is there a simpler, alternate scheduler algorithm that you could plug in as a
> global, settable alternative to the current one when all sites are using
> coasters?
You can set the initialScore profile key very high[1] so that Swift will
starts at full load rather than low load. This is basically "the simpler,
alterative scheduler algorithm" that you are looking for.
You will however runinto a different manifestation of the same problem
that coastersPerNode does not work properly and will likely attempt to
massively overallocate workers.
Its not a bug in the scheduler - its a bug in the implementation of
coastersPerNode that causes it to attempt to allocate one node per excess
job.
In the longer term, as Mihael said, the interface between the scheduler
and execution systems needs to change because coasters don't fit in the
present abstraction very well.
[1] (to about 100 - the actual formula is rather opaque and I have to
rederive it every time because I never write it down)
--
From qinz at ihpc.a-star.edu.sg Tue Apr 7 04:07:20 2009
From: qinz at ihpc.a-star.edu.sg (Qin Zheng)
Date: Tue, 7 Apr 2009 17:07:20 +0800
Subject: [Swift-devel] Re: replication vs site score
In-Reply-To: <1C37ED4E-D752-4870-A52C-076FC3EBBF49@anl.gov>
References:
<1C37ED4E-D752-4870-A52C-076FC3EBBF49@anl.gov>
Message-ID:
Prof Foster, thanks for introducing me to the team.
My research interest is on scheduling workflows (DAGs). Ben, we decided not to use resubmission in the consideration that a DAG cannot be completed when any of its tasks fails, which each time would trigger the resubmission\retry of the DAG. Instead, we use fault tolerance by pre-scheduling replica (backup) for each task (see enclosure for details). The objective is to guarantee that this DAG can be completed (in a preplanned manner with fast failover to the backup upon failure) before its deadline.
Currently I am also working on workflow scheduling under uncertainties of task running times. This work includes priorities tasks based on the impact of the variation of its running time on the overall response time and offline planning for high-priority tasks as well as runtime adaptation for all tasks once up-to-date information is available.
I am looking forward to talking to you guys and knowing your research!
Regards,
Qin Zheng
________________________________
From: Ian Foster [mailto:foster at anl.gov]
Sent: Monday, April 06, 2009 10:46 PM
To: Ben Clifford
Cc: swift-devel; Qin Zheng
Subject: Re: [Swift-devel] Re: replication vs site score
Ben:
You may recall the work that was done by Greg Maleciwz (sp?) on prioritizing jobs that enable new jobs to run. Those ideas seem relevant here.
I met last week with a smart fellow in Singapore, Qin Zheng (CCed here), who has been working on the scheduling of replicant jobs. His interest is in doing this for jobs that have failed, while I think your interest is in scheduling for jobs that may have failed--a somewhat different thing. But there may be a connection.
Ian.
On Apr 6, 2009, at 9:39 AM, Ben Clifford wrote:
even more rambling... in the context of a scheduler that is doing things
like prioritising jobs based on more than the order that Swift happened to
submit them (hopefully I will have a student for this in the summer), I
think a replicant job should be pushed toward later execution rather than
earlier execution to reduce the number of replicant jobs in the system at
any one time.
This is because I suspect (though I have gathered no numerical evidence)
that given the choice between submitting a fresh job and a replicant job
(making up terminology here too... mmm), it is almost always better to
submit the fresh job. Either we end up submitting the replicant job
eventually (in which case we are no worse off than if we submitted the
replicant first and then a fresh job); or by delaying the replicant job we
give that replicant's original a chance to start running and thus do not
discard our precious time-and-load-dollars that we have already spent on
queueing that replicant's original.
--
_______________________________________________
Swift-devel mailing list
Swift-devel at ci.uchicago.edu
http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
________________________________
This email is confidential and may be privileged. If you are not the intended recipient, please delete it and notify us immediately. Please do not copy or use it for any purpose, or disclose its contents to any other person. Thank you.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Fault Tolerance_TC_Mar09.pdf
Type: application/pdf
Size: 2142133 bytes
Desc: Fault Tolerance_TC_Mar09.pdf
URL:
From wilde at mcs.anl.gov Tue Apr 7 06:09:15 2009
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Tue, 07 Apr 2009 06:09:15 -0500
Subject: [Swift-devel] Imbalanced scheduling with coasters and multiple
sites
In-Reply-To: <1239082754.16125.12.camel@localhost>
References: <49DA9004.8010409@mcs.anl.gov> <49DAD5C9.7080607@mcs.anl.gov>
<49DADD16.2010507@mcs.anl.gov>
<1239080984.16125.1.camel@localhost>
<49DAE026.3040909@mcs.anl.gov> <49DAE16B.6000508@mcs.anl.gov>
<1239081995.16125.8.camel@localhost> <49DAE5C2.6070806@mcs.anl.gov>
<1239082754.16125.12.camel@localhost>
Message-ID: <49DB345B.3030406@mcs.anl.gov>
>> The other few anomalies I saw I will ignore unless they happen
again, as
>> I was using the bad 3/31 revision. This was things like starting a new
>> service with some strange default max time ("01:41:00" or 101 minutes)
>
> Not strange. 101 = 10 * 10 + 1 or DEFAULT_MAXWALLTIME *
> OVERALLOCATION_FACTOR + RESERVE.
I assumed 1:41 was derived from some formula. The unexpected behavior
here was that it looked like a job was submitted by coasters that
ignored the specified coasterWorkerMaxwalltime, after the initial jobs
honored it.
But again, the code base was suspect. I'll keep an eye out for it
happening again.
From wilde at mcs.anl.gov Tue Apr 7 06:13:47 2009
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Tue, 07 Apr 2009 06:13:47 -0500
Subject: [Swift-devel] Imbalanced scheduling with coasters and multiple
sites
In-Reply-To: <1239084262.16125.18.camel@localhost>
References: <49DA9004.8010409@mcs.anl.gov> <49DAD5C9.7080607@mcs.anl.gov>
<49DADD16.2010507@mcs.anl.gov>
<1239080984.16125.1.camel@localhost>
<49DAE026.3040909@mcs.anl.gov> <49DAE16B.6000508@mcs.anl.gov>
<1239081995.16125.8.camel@localhost>
<1239084262.16125.18.camel@localhost>
Message-ID: <49DB356B.4050808@mcs.anl.gov>
putting Glen back on cc: Multi-site coaster runs will not work until
Mihael posts a fix.
On 4/7/09 1:04 AM, Mihael Hategan wrote:
> On Tue, 2009-04-07 at 00:26 -0500, Mihael Hategan wrote:
>
>>> I was trying to gather that alolng with several
>>> other anomalies in another report.
>> Now, the oddity below is that both coaster services are started with the
>> same service id. Not only that, the same service id was used for
>> subsequent runs (the bootstrap logs contain multiple "runs"). This,
>> roughly, makes no sense, but I can't imagine it being cause for
>> goodness.
>
> That was just another one of my brilliant ideas. It was dimmed a bit in
> cog r2369. Previous to that, and after the big fiddle with the bootstrap
> script a while ago, multi-site coaster runs are broken.
>
From benc at hawaga.org.uk Tue Apr 7 06:30:59 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Tue, 7 Apr 2009 11:30:59 +0000 (GMT)
Subject: [Swift-devel] Re: replication vs site score
In-Reply-To:
References:
<1C37ED4E-D752-4870-A52C-076FC3EBBF49@anl.gov>
Message-ID:
Hi.
Most/all of the work that we've done with Swift works with fairly
opportunistic use of resources - we submit work into job queues on one or
more sites, where those job queues are shared with many other users, and
where the runtimes for both our jobs and other users jobs are not well
defined ahead of time.
So whilst we use the word 'scheduling' sometimes in Swift, its more a case
of "what do we think is the best site to queue a job on right now?" rather
than making an execution plan that we think will be valid for a long
period of time.
Our replication mechanism sounds fairly similar to your pre-scheduled
backups, but I think there are these important differences:
* we don't launch a replica until we think there is a reasonable chance
that the replica will run instead of the original (based on queue time)
* as soon as one of the jobs *starts* running, we cancel all the others.
from what I understand, you do that when one of the jobs *ends*
successfully.
We do have one situation where we have some pre-allocation of resources,
and that is when coasters are being used. These use the above
opportunistic queuing methods to acquire a worker node for a long period
of time, and then runs Swift level jobs in there, at present on a
first-come first-serve basis. Its likely that we'll change that to have
some other job prioritisation, but still pre-scheduling the jobs.
Where Swift would have trouble working with an ahead-of-time
planner/scheduler is that the module that generates file transfer and
execution tasks from high level SwiftScripts does not submit a dependent
task for scheduling and execution until its predecessors have been
successfully executed.
What the scheduler sees is a stream, over time, of file transfer and
execution tasks that are safe to run immediately.
It might be easy, or it might be hard, to make the Swift code submit more
eagerly, with description of task dependencies, which would allow you to
plug in a pre-planner underneath.
On Tue, 7 Apr 2009, Qin Zheng wrote:
> Prof Foster, thanks for introducing me to the team.
>
> My research interest is on scheduling workflows (DAGs). Ben, we decided
> not to use resubmission in the consideration that a DAG cannot be
> completed when any of its tasks fails, which each time would trigger the
> resubmission\retry of the DAG. Instead, we use fault tolerance by
> pre-scheduling replica (backup) for each task (see enclosure for
> details). The objective is to guarantee that this DAG can be completed
> (in a preplanned manner with fast failover to the backup upon failure)
> before its deadline.
>
> Currently I am also working on workflow scheduling under uncertainties
> of task running times. This work includes priorities tasks based on the
> impact of the variation of its running time on the overall response time
> and offline planning for high-priority tasks as well as runtime
> adaptation for all tasks once up-to-date information is available.
>
> I am looking forward to talking to you guys and knowing your research!
>
> Regards,
> Qin Zheng
> ________________________________
> From: Ian Foster [mailto:foster at anl.gov]
> Sent: Monday, April 06, 2009 10:46 PM
> To: Ben Clifford
> Cc: swift-devel; Qin Zheng
> Subject: Re: [Swift-devel] Re: replication vs site score
>
> Ben:
>
> You may recall the work that was done by Greg Maleciwz (sp?) on prioritizing jobs that enable new jobs to run. Those ideas seem relevant here.
>
> I met last week with a smart fellow in Singapore, Qin Zheng (CCed here), who has been working on the scheduling of replicant jobs. His interest is in doing this for jobs that have failed, while I think your interest is in scheduling for jobs that may have failed--a somewhat different thing. But there may be a connection.
>
> Ian.
>
>
> On Apr 6, 2009, at 9:39 AM, Ben Clifford wrote:
>
>
> even more rambling... in the context of a scheduler that is doing things
> like prioritising jobs based on more than the order that Swift happened to
> submit them (hopefully I will have a student for this in the summer), I
> think a replicant job should be pushed toward later execution rather than
> earlier execution to reduce the number of replicant jobs in the system at
> any one time.
>
> This is because I suspect (though I have gathered no numerical evidence)
> that given the choice between submitting a fresh job and a replicant job
> (making up terminology here too... mmm), it is almost always better to
> submit the fresh job. Either we end up submitting the replicant job
> eventually (in which case we are no worse off than if we submitted the
> replicant first and then a fresh job); or by delaying the replicant job we
> give that replicant's original a chance to start running and thus do not
> discard our precious time-and-load-dollars that we have already spent on
> queueing that replicant's original.
>
> --
>
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>
>
> ________________________________
> This email is confidential and may be privileged. If you are not the intended recipient, please delete it and notify us immediately. Please do not copy or use it for any purpose, or disclose its contents to any other person. Thank you.
>
From foster at anl.gov Tue Apr 7 07:33:02 2009
From: foster at anl.gov (Ian Foster)
Date: Tue, 7 Apr 2009 07:33:02 -0500
Subject: [Swift-devel] Imbalanced scheduling with coasters and multiple
sites
In-Reply-To: <49DB356B.4050808@mcs.anl.gov>
References: <49DA9004.8010409@mcs.anl.gov> <49DAD5C9.7080607@mcs.anl.gov>
<49DADD16.2010507@mcs.anl.gov>
<1239080984.16125.1.camel@localhost>
<49DAE026.3040909@mcs.anl.gov> <49DAE16B.6000508@mcs.anl.gov>
<1239081995.16125.8.camel@localhost>
<1239084262.16125.18.camel@localhost>
<49DB356B.4050808@mcs.anl.gov>
Message-ID: <26D2255E-D8B2-4816-A3D5-FF26743E519C@anl.gov>
Is there a description somewhere of the algorithms used for starting
coasters and submitting jobs to them?
Ian.
From benc at hawaga.org.uk Tue Apr 7 07:36:10 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Tue, 7 Apr 2009 12:36:10 +0000 (GMT)
Subject: [Swift-devel] Imbalanced scheduling with coasters and multiple
sites
In-Reply-To: <26D2255E-D8B2-4816-A3D5-FF26743E519C@anl.gov>
References: <49DA9004.8010409@mcs.anl.gov> <49DAD5C9.7080607@mcs.anl.gov>
<49DADD16.2010507@mcs.anl.gov> <1239080984.16125.1.camel@localhost>
<49DAE026.3040909@mcs.anl.gov> <49DAE16B.6000508@mcs.anl.gov>
<1239081995.16125.8.camel@localhost>
<1239084262.16125.18.camel@localhost>
<49DB356B.4050808@mcs.anl.gov>
<26D2255E-D8B2-4816-A3D5-FF26743E519C@anl.gov>
Message-ID:
On Tue, 7 Apr 2009, Ian Foster wrote:
> Is there a description somewhere of the algorithms used for starting coasters
> and submitting jobs to them?
Plenty in the archives of this list, I expect.
Basically: if a job arrives and there is a free coaster slot, launch a new
coaster worker. If there is no free coaster slot existing for it, launch a
new coaster worker.
--
From wilde at mcs.anl.gov Tue Apr 7 07:42:25 2009
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Tue, 07 Apr 2009 07:42:25 -0500
Subject: [Swift-devel] Imbalanced scheduling with coasters and multiple
sites
In-Reply-To:
References: <49DA9004.8010409@mcs.anl.gov> <49DAD5C9.7080607@mcs.anl.gov>
<49DADD16.2010507@mcs.anl.gov> <1239080984.16125.1.camel@localhost>
<49DAE026.3040909@mcs.anl.gov> <49DAE16B.6000508@mcs.anl.gov>
<1239081995.16125.8.camel@localhost>
<1239084262.16125.18.camel@localhost>
<49DB356B.4050808@mcs.anl.gov>
<26D2255E-D8B2-4816-A3D5-FF26743E519C@anl.gov>
Message-ID: <49DB4A31.80108@mcs.anl.gov>
This contains a lot of the startup details:
http://wiki.cogkit.org/wiki/Coasters
On 4/7/09 7:36 AM, Ben Clifford wrote:
> On Tue, 7 Apr 2009, Ian Foster wrote:
>
>> Is there a description somewhere of the algorithms used for starting coasters
>> and submitting jobs to them?
>
> Plenty in the archives of this list, I expect.
>
> Basically: if a job arrives and there is a free coaster slot, launch a new
> coaster worker. If there is no free coaster slot existing for it, launch a
> new coaster worker.
>
From benc at hawaga.org.uk Tue Apr 7 07:47:39 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Tue, 7 Apr 2009 12:47:39 +0000 (GMT)
Subject: [Swift-devel] Imbalanced scheduling with coasters and multiple
sites
In-Reply-To: <49DB4A31.80108@mcs.anl.gov>
References: <49DA9004.8010409@mcs.anl.gov> <49DAD5C9.7080607@mcs.anl.gov>
<49DADD16.2010507@mcs.anl.gov> <1239080984.16125.1.camel@localhost>
<49DAE026.3040909@mcs.anl.gov> <49DAE16B.6000508@mcs.anl.gov>
<1239081995.16125.8.camel@localhost>
<1239084262.16125.18.camel@localhost>
<49DB356B.4050808@mcs.anl.gov>
<26D2255E-D8B2-4816-A3D5-FF26743E519C@anl.gov>
<49DB4A31.80108@mcs.anl.gov>
Message-ID:
On Tue, 7 Apr 2009, Michael Wilde wrote:
> This contains a lot of the startup details:
>
> http://wiki.cogkit.org/wiki/Coasters
Would be good to link to that from the Swift user guide.
--
From wilde at mcs.anl.gov Tue Apr 7 08:03:47 2009
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Tue, 07 Apr 2009 08:03:47 -0500
Subject: [Swift-devel] Imbalanced scheduling with coasters and multiple
sites
In-Reply-To:
References: <49DA9004.8010409@mcs.anl.gov> <49DAD5C9.7080607@mcs.anl.gov>
<49DADD16.2010507@mcs.anl.gov> <1239080984.16125.1.camel@localhost>
<49DAE026.3040909@mcs.anl.gov> <49DAE16B.6000508@mcs.anl.gov>
<1239081995.16125.8.camel@localhost>
<1239084262.16125.18.camel@localhost>
<49DB356B.4050808@mcs.anl.gov>
<26D2255E-D8B2-4816-A3D5-FF26743E519C@anl.gov>
<49DB4A31.80108@mcs.anl.gov>
Message-ID: <49DB4F33.5040502@mcs.anl.gov>
done. (but not tested)
On 4/7/09 7:47 AM, Ben Clifford wrote:
>
> On Tue, 7 Apr 2009, Michael Wilde wrote:
>
>> This contains a lot of the startup details:
>>
>> http://wiki.cogkit.org/wiki/Coasters
>
> Would be good to link to that from the Swift user guide.
>
From wilde at mcs.anl.gov Tue Apr 7 08:14:54 2009
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Tue, 07 Apr 2009 08:14:54 -0500
Subject: [Swift-devel] Imbalanced scheduling with coasters and multiple
sites
In-Reply-To: <49DB4F33.5040502@mcs.anl.gov>
References: <49DA9004.8010409@mcs.anl.gov>
<49DAD5C9.7080607@mcs.anl.gov> <49DADD16.2010507@mcs.anl.gov>
<1239080984.16125.1.camel@localhost> <49DAE026.3040909@mcs.anl.gov>
<49DAE16B.6000508@mcs.anl.gov> <1239081995.16125.8.camel@localhost> <1239084262.16125.18.camel@localhost> <49DB356B.4050808@mcs.anl.gov> <26D2255E-D8B2-4816-A3D5-FF26743E519C@anl.gov> <49DB4A31.80108@mcs.anl.gov>
<49DB4F33.5040502@mcs.anl.gov>
Message-ID: <49DB51CE.3090502@mcs.anl.gov>
On 4/7/09 8:03 AM, Michael Wilde wrote:
> done. (but not tested)
but i should have. fixed, *and* tested.
>
> On 4/7/09 7:47 AM, Ben Clifford wrote:
>>
>> On Tue, 7 Apr 2009, Michael Wilde wrote:
>>
>>> This contains a lot of the startup details:
>>>
>>> http://wiki.cogkit.org/wiki/Coasters
>>
>> Would be good to link to that from the Swift user guide.
>>
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
From hategan at mcs.anl.gov Tue Apr 7 10:08:10 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Tue, 07 Apr 2009 10:08:10 -0500
Subject: [Swift-devel] Imbalanced scheduling with coasters and multiple
sites
In-Reply-To: <49DB356B.4050808@mcs.anl.gov>
References: <49DA9004.8010409@mcs.anl.gov> <49DAD5C9.7080607@mcs.anl.gov>
<49DADD16.2010507@mcs.anl.gov> <1239080984.16125.1.camel@localhost>
<49DAE026.3040909@mcs.anl.gov> <49DAE16B.6000508@mcs.anl.gov>
<1239081995.16125.8.camel@localhost>
<1239084262.16125.18.camel@localhost>
<49DB356B.4050808@mcs.anl.gov>
Message-ID: <1239116890.18531.0.camel@localhost>
On Tue, 2009-04-07 at 06:13 -0500, Michael Wilde wrote:
> putting Glen back on cc: Multi-site coaster runs will not work until
> Mihael posts a fix.
What I'm saying below is that the fix is in cog r2369.
>
> On 4/7/09 1:04 AM, Mihael Hategan wrote:
> > On Tue, 2009-04-07 at 00:26 -0500, Mihael Hategan wrote:
> >
> >>> I was trying to gather that alolng with several
> >>> other anomalies in another report.
> >> Now, the oddity below is that both coaster services are started with the
> >> same service id. Not only that, the same service id was used for
> >> subsequent runs (the bootstrap logs contain multiple "runs"). This,
> >> roughly, makes no sense, but I can't imagine it being cause for
> >> goodness.
> >
> > That was just another one of my brilliant ideas. It was dimmed a bit in
> > cog r2369. Previous to that, and after the big fiddle with the bootstrap
> > script a while ago, multi-site coaster runs are broken.
> >
From wilde at mcs.anl.gov Tue Apr 7 10:13:23 2009
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Tue, 07 Apr 2009 10:13:23 -0500
Subject: [Swift-devel] Imbalanced scheduling with coasters and multiple
sites
In-Reply-To: <1239116890.18531.0.camel@localhost>
References: <49DA9004.8010409@mcs.anl.gov> <49DAD5C9.7080607@mcs.anl.gov>
<49DADD16.2010507@mcs.anl.gov>
<1239080984.16125.1.camel@localhost>
<49DAE026.3040909@mcs.anl.gov> <49DAE16B.6000508@mcs.anl.gov>
<1239081995.16125.8.camel@localhost>
<1239084262.16125.18.camel@localhost>
<49DB356B.4050808@mcs.anl.gov> <1239116890.18531.0.camel@localhost>
Message-ID: <49DB6D93.7010900@mcs.anl.gov>
Cool. I interpreted your note below as meaning its still broken, didnt
realize that r2369 was latest. Got it, and am building now for Glen and
I to test. I'll re-run the "cats" test.
On 4/7/09 10:08 AM, Mihael Hategan wrote:
> On Tue, 2009-04-07 at 06:13 -0500, Michael Wilde wrote:
>> putting Glen back on cc: Multi-site coaster runs will not work until
>> Mihael posts a fix.
>
> What I'm saying below is that the fix is in cog r2369.
>
>> On 4/7/09 1:04 AM, Mihael Hategan wrote:
>>> On Tue, 2009-04-07 at 00:26 -0500, Mihael Hategan wrote:
>>>
>>>>> I was trying to gather that alolng with several
>>>>> other anomalies in another report.
>>>> Now, the oddity below is that both coaster services are started with the
>>>> same service id. Not only that, the same service id was used for
>>>> subsequent runs (the bootstrap logs contain multiple "runs"). This,
>>>> roughly, makes no sense, but I can't imagine it being cause for
>>>> goodness.
>>> That was just another one of my brilliant ideas. It was dimmed a bit in
>>> cog r2369. Previous to that, and after the big fiddle with the bootstrap
>>> script a while ago, multi-site coaster runs are broken.
>>>
>
From wilde at mcs.anl.gov Tue Apr 7 17:36:01 2009
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Tue, 07 Apr 2009 17:36:01 -0500
Subject: [Swift-devel] Possible problem in coaster data transfer
Message-ID: <49DBD551.3000201@mcs.anl.gov>
It looks as if something in swift is garbling data files.
We see this when trying coaster data transfer to circumvent a problem
that the abe gridftp server was reporting (when using gridftp data
transfer).
The oops "pdt" file is the main output of the simulation (the
coordinates of each atom in the folded protein). These files should have
very regular multi-column lines, but in a few we see garbled lines.
This is in run: ci:/home/hockyg/oops/swift/output/abeoutdir.20
These file range from 1.5MB to 3MB in this test. There's one per job, 50
files in this run.
The lines on top are normal; the lines on the bottom are long due to
file corruption.
We've used coaster transfer off an on; we usually do gridftp transfer
and were using coaster transfer in this case while Mihael debugs a
problem thats manifesting as a gridftp error.
Glen suspected he saw such corruption earlier; this run seems to confirm it.
I'm not inclined to go deep into this at the moment, but rather to say
that we'll stick to gridftp transfer for the duration of this paper
writing effort.
- Mike
TOM 337 N GLN 0 57 135 CA ALA 0 23 -17.316 8.147
0.047 1.00 0.00
ATOM 337 N GLN 0 57 135 CA ALA 0 23 -17.316 8.147
0.047 1.00 0.00
ATOM 337 N GLN 0 57 135 CA ALA 0 23 -17.316 8.147
0.047 1.00 0.00
ATOM 337 N GLN 0 57 135 CA ALA 0 23 -17.316 8.147
0.047 1.00 0.00
ATOM 337 N GLN 0 57 135 CA ALA 0 23 -17.316 8.147
0.047 1.00 0.00
ATOM 337 N GLN 0 57 135 CA ALA 0 23 -17.316 8.147
0.047 1.00 0.00
com$
com$ awk 'length($0) > 150 {print $0}' `find | grep pdt`
com$ awk 'length($0) > 120 {print $0}' `find | grep pdt`
ATOM 335 C ASN 0 56 -34.964 1.477 15.043 1.00 0.00
0 13 -2.528 18.017 -1.927 1.00 0.00
ATOM 335 C ASN 0 56 -21.516 -6.860 -31.404 1.00 0.00
91 C ASN 0 32 -10.865 31.809 -15.581 1.00 0.00
ATOM 404 HN ALA 0 68 -10.285 -33.690 -26.233 1.00 0.00
135 CA ALA 0 23 12.808 -6.713 -11.148 1.00 0.00
ATOM 335 C ASN 0 56 0.510 -30.608 0.783 1.00 0.00
LEU 0 2 0.505 3.186 -1.484 1.00 0.00
ATOM 335 C ASN 0 56 -3.155 25.367 -4.095 1.00 83 C
ALA 0 64 5.541 11.559 -1.063 1.00 0.00
ATOM 404 HN ALA 0 68 66.525 32.704 -21.958 GLN 0 57
135 CA ALA 0 23 19.234 14.087 -7.779 1.00 0.00
ATOM 335 C ASN 0 56 16.926 -3.414 -5.774 1.00 0.00
EU 0 43 13.554 22.230 19.827 1.00 0.00
ATOM 335 C ASN 0 56 14.805 34.413 23.907 1.00 0.00
59 -18.300 2.743 -27.536 1.00 0.00
ATOM 335 C ASN 0 56 19.787 15.477 24.896 1.00 0.00
0 13 9.613 11.599 -1.295 1.00 0.00
ATOM 404 HN ALA 0 68 11.882 -14.3 337 N GLN 0 57
135 CA ALA 0 23 21.798 -14.600 -6.379 1.00 0.00
ATOM 112 HA2 GLY 0 19 3.632 -11.142 -24.657 1.00 0. 315
CA LEU 0 53 0.180 -22.479 -33.671 1.00 0.00
ATOM 335 C ASN 0 56 -3.145 -30.419 -39.260 1.00 E 0 66
-8.925 -40.775 -24.402 1.00 0.00
com$
From aespinosa at cs.uchicago.edu Wed Apr 8 02:12:16 2009
From: aespinosa at cs.uchicago.edu (Allan Espinosa)
Date: Wed, 8 Apr 2009 02:12:16 -0500
Subject: [Swift-devel] jobs finishes but swift reports "execution failed".
Message-ID: <50b07b4b0904080012u668c6921y9012ac066c8156f7@mail.gmail.com>
Waiting for notification for 0 ms
Received notification with 1 messages
Progress: Submitted:1 Active:1
Progress: uninitialized:1 Finished successfully:2
Execution failed:
Could not find any valid host for task "Task(type=UNKNOWN,
identity=urn:cog-1239170783751)" with constraints
{filecache=org.griphyn.vdl.karajan.lib.cache.CacheMapAdapter at 53f253f2,
filenames=[Ljava.lang.String;@53945394, trfqn=cat, tr=cat}
probably in one of the staging components
cog2365 swift 2824 on surveyor BGP
the modifications made iis just the convertion of "|" to "^". Right Zhao?
log: http://www.ci.uchicago.edu/~aespinosa/swift/blast-20090408-0144-evyvbf93.log
--
Allan M. Espinosa
PhD student, Computer Science
University of Chicago
From qinz at ihpc.a-star.edu.sg Wed Apr 8 03:03:34 2009
From: qinz at ihpc.a-star.edu.sg (Qin Zheng)
Date: Wed, 8 Apr 2009 16:03:34 +0800
Subject: [Swift-devel] Re: replication vs site score
In-Reply-To:
References:
<1C37ED4E-D752-4870-A52C-076FC3EBBF49@anl.gov>
Message-ID:
Dear Ben,
Thanks for your detailed reply and it helps me understand scheduling in Swift better.
I wrote from a researcher perspective and I understand that for development, there are much more practical issues and are more challenging. I agree with you that scheduling a task after its parents completes is cost effective. It is the best "time" given all the updated info on the completion times of its parents. Also, it makes DAG submission easy (without dependency description) and minimizes the number of job instances in queues. The concern is that at this time, the task still needs to be submitted in queue and wait. This may not be sufficient for workflows with deadlines, where certain delivery guarantee in response time is necessary. The same applies for other remaining tasks in the workflow.
I felt besides offline planning, runtime adaptation is necessary considering task duration variation (overrun) and faults. But the number of updates should be kept minimum and only for the very near future as the workflow proceeds. I am writing a paper on this and hopefully I could share it with you guys in a few weeks. This implies that the Swift code could be submitted a little bit more eagerly with a short-sighted look ahead.
Yes, your points on the differences are valid and the replica in my case is used for FT while in Swift it could enable a task to run earlier (by submitting a replica at a short queue). You mentioned about queue time and can you share more on it, for example its accuracy and also the change to have some other job prioritization for coasters?
I will be on star cruise to Malaysia in a few hours :). If I can not access email there, I will reply to you guys on Friday when I return to Singapore.
Qin Zheng
-----Original Message-----
From: Ben Clifford [mailto:benc at hawaga.org.uk]
Sent: Tuesday, April 07, 2009 7:31 PM
To: Qin Zheng
Cc: Ian Foster; swift-devel
Subject: RE: [Swift-devel] Re: replication vs site score
Hi.
Most/all of the work that we've done with Swift works with fairly
opportunistic use of resources - we submit work into job queues on one or
more sites, where those job queues are shared with many other users, and
where the runtimes for both our jobs and other users jobs are not well
defined ahead of time.
So whilst we use the word 'scheduling' sometimes in Swift, its more a case
of "what do we think is the best site to queue a job on right now?" rather
than making an execution plan that we think will be valid for a long
period of time.
Our replication mechanism sounds fairly similar to your pre-scheduled
backups, but I think there are these important differences:
* we don't launch a replica until we think there is a reasonable chance
that the replica will run instead of the original (based on queue time)
* as soon as one of the jobs *starts* running, we cancel all the others.
from what I understand, you do that when one of the jobs *ends*
successfully.
We do have one situation where we have some pre-allocation of resources,
and that is when coasters are being used. These use the above
opportunistic queuing methods to acquire a worker node for a long period
of time, and then runs Swift level jobs in there, at present on a
first-come first-serve basis. Its likely that we'll change that to have
some other job prioritisation, but still pre-scheduling the jobs.
Where Swift would have trouble working with an ahead-of-time
planner/scheduler is that the module that generates file transfer and
execution tasks from high level SwiftScripts does not submit a dependent
task for scheduling and execution until its predecessors have been
successfully executed.
What the scheduler sees is a stream, over time, of file transfer and
execution tasks that are safe to run immediately.
It might be easy, or it might be hard, to make the Swift code submit more
eagerly, with description of task dependencies, which would allow you to
plug in a pre-planner underneath.
On Tue, 7 Apr 2009, Qin Zheng wrote:
> Prof Foster, thanks for introducing me to the team.
>
> My research interest is on scheduling workflows (DAGs). Ben, we decided
> not to use resubmission in the consideration that a DAG cannot be
> completed when any of its tasks fails, which each time would trigger the
> resubmission\retry of the DAG. Instead, we use fault tolerance by
> pre-scheduling replica (backup) for each task (see enclosure for
> details). The objective is to guarantee that this DAG can be completed
> (in a preplanned manner with fast failover to the backup upon failure)
> before its deadline.
>
> Currently I am also working on workflow scheduling under uncertainties
> of task running times. This work includes priorities tasks based on the
> impact of the variation of its running time on the overall response time
> and offline planning for high-priority tasks as well as runtime
> adaptation for all tasks once up-to-date information is available.
>
> I am looking forward to talking to you guys and knowing your research!
>
> Regards,
> Qin Zheng
> ________________________________
> From: Ian Foster [mailto:foster at anl.gov]
> Sent: Monday, April 06, 2009 10:46 PM
> To: Ben Clifford
> Cc: swift-devel; Qin Zheng
> Subject: Re: [Swift-devel] Re: replication vs site score
>
> Ben:
>
> You may recall the work that was done by Greg Maleciwz (sp?) on prioritizing jobs that enable new jobs to run. Those ideas seem relevant here.
>
> I met last week with a smart fellow in Singapore, Qin Zheng (CCed here), who has been working on the scheduling of replicant jobs. His interest is in doing this for jobs that have failed, while I think your interest is in scheduling for jobs that may have failed--a somewhat different thing. But there may be a connection.
>
> Ian.
>
>
> On Apr 6, 2009, at 9:39 AM, Ben Clifford wrote:
>
>
> even more rambling... in the context of a scheduler that is doing things
> like prioritising jobs based on more than the order that Swift happened to
> submit them (hopefully I will have a student for this in the summer), I
> think a replicant job should be pushed toward later execution rather than
> earlier execution to reduce the number of replicant jobs in the system at
> any one time.
>
> This is because I suspect (though I have gathered no numerical evidence)
> that given the choice between submitting a fresh job and a replicant job
> (making up terminology here too... mmm), it is almost always better to
> submit the fresh job. Either we end up submitting the replicant job
> eventually (in which case we are no worse off than if we submitted the
> replicant first and then a fresh job); or by delaying the replicant job we
> give that replicant's original a chance to start running and thus do not
> discard our precious time-and-load-dollars that we have already spent on
> queueing that replicant's original.
>
> --
>
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>
>
> ________________________________
> This email is confidential and may be privileged. If you are not the intended recipient, please delete it and notify us immediately. Please do not copy or use it for any purpose, or disclose its contents to any other person. Thank you.
>
This email is confidential and may be privileged. If you are not the intended recipient, please delete it and notify us immediately. Please do not copy or use it for any purpose, or disclose its contents to any other person. Thank you.
From benc at hawaga.org.uk Wed Apr 8 04:48:31 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Wed, 8 Apr 2009 09:48:31 +0000 (GMT)
Subject: [Swift-devel] jobs finishes but swift reports "execution failed".
In-Reply-To: <50b07b4b0904080012u668c6921y9012ac066c8156f7@mail.gmail.com>
References: <50b07b4b0904080012u668c6921y9012ac066c8156f7@mail.gmail.com>
Message-ID:
that looks to me like you have tc.data entries for mockblast but not for
cat.
--
From benc at hawaga.org.uk Wed Apr 8 07:20:41 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Wed, 8 Apr 2009 12:20:41 +0000 (GMT)
Subject: [Swift-devel] Possible problem in coaster data transfer
In-Reply-To: <49DBD551.3000201@mcs.anl.gov>
References: <49DBD551.3000201@mcs.anl.gov>
Message-ID:
if you do decide to dig deeper, you can turn on sitedir.keep in
swift.properties and check that the file in the remote shared directory is
uncorrupted for the same run that the staged out copy appears corrupted.
--
From hockyg at uchicago.edu Wed Apr 8 09:03:50 2009
From: hockyg at uchicago.edu (Glen Hocky)
Date: Wed, 08 Apr 2009 09:03:50 -0500
Subject: [Swift-devel] Possible problem in coaster data transfer
In-Reply-To:
References: <49DBD551.3000201@mcs.anl.gov>
Message-ID: <49DCAEC6.1090905@uchicago.edu>
Here you go. Same file from the remote site and from communicado after
transfer by coasterIO
Ben Clifford wrote:
> if you do decide to dig deeper, you can turn on sitedir.keep in
> swift.properties and check that the file in the remote shared directory is
> uncorrupted for the same run that the staged out copy appears corrupted.
>
>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: pdt_ci.gz
Type: application/gzip
Size: 46773 bytes
Desc: not available
URL:
-------------- next part --------------
A non-text attachment was scrubbed...
Name: abe_pdt.gz
Type: application/gzip
Size: 576854 bytes
Desc: not available
URL:
From hategan at mcs.anl.gov Wed Apr 8 10:03:07 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Wed, 08 Apr 2009 10:03:07 -0500
Subject: [Swift-devel] Re: replication vs site score
In-Reply-To:
References:
<1C37ED4E-D752-4870-A52C-076FC3EBBF49@anl.gov>
Message-ID: <1239202987.12586.17.camel@localhost>
On Wed, 2009-04-08 at 16:03 +0800, Qin Zheng wrote:
> Dear Ben,
>
> Thanks for your detailed reply and it helps me understand scheduling in Swift better.
>
> I wrote from a researcher perspective and I understand that for
> development, there are much more practical issues and are more
> challenging. I agree with you that scheduling a task after its parents
> completes is cost effective. It is the best "time" given all the
> updated info on the completion times of its parents. Also, it makes
> DAG submission easy (without dependency description) and minimizes the
> number of job instances in queues.
The main reasoning was that it can be dealt with efficiently and that
planning the whole workflow buys us little in a (very) dynamic
environment in which submitting a job one minute later may mean the
difference between 1 minute of queue time and one hour of queue time
(though that's statistically a rare occurrence).
> The concern is that at this time, the task still needs to be
> submitted in queue and wait. This may not be sufficient for workflows
> with deadlines, where certain delivery guarantee in response time is
> necessary.
You need some SLA/QOS to address that. Guessing the average queue time
does not reduce its variation hence the risk of not finishing it by the
time promised. You can use replication (i.e. race competing jobs) to
reduce that variation (assuming that it follows some reasonable
distribution), but I don't see how there could be a guarantee.
> The same applies for other remaining tasks in the workflow.
>
> I felt besides offline planning, runtime adaptation is necessary
> considering task duration variation (overrun) and faults. But the
> number of updates should be kept minimum and only for the very near
> future as the workflow proceeds. I am writing a paper on this and
> hopefully I could share it with you guys in a few weeks. This implies
> that the Swift code could be submitted a little bit more eagerly with
> a short-sighted look ahead.
I remember somebody mentioning (or having implemented) a similar scheme.
If we have dependent jobs a and b, in swift that would go something
like:
Qa + Ra + Qb + Rb (where Qx - queuing time and Rx run time)
But there's also the possibility of submitting B earlier by the average
queue time or less and than having it wait until A produces its results.
But then glide-ins/coasters, that's pretty much what they do.
Mihael
From benc at hawaga.org.uk Wed Apr 8 10:08:04 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Wed, 8 Apr 2009 15:08:04 +0000 (GMT)
Subject: [Swift-devel] Re: replication vs site score
In-Reply-To: <1239202987.12586.17.camel@localhost>
References:
<1C37ED4E-D752-4870-A52C-076FC3EBBF49@anl.gov>
<1239202987.12586.17.camel@localhost>
Message-ID:
On Wed, 8 Apr 2009, Mihael Hategan wrote:
This:
> planning the whole workflow buys us little in a (very) dynamic
> environment in which submitting a job one minute later may mean the
> difference between 1 minute of queue time and one hour of queue time
and this:
> You need some SLA/QOS to address that.
seem to be significant characteristics that make the environments we run
on not amenable to scheduling in the traditional sense. The lack of any
meaningful guarantees about almost anything time-related makes everything
basically opportunistic rather than scheduled.
--
From hategan at mcs.anl.gov Wed Apr 8 14:53:28 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Wed, 08 Apr 2009 14:53:28 -0500
Subject: [Swift-devel] updates
Message-ID: <1239220408.15551.4.camel@localhost>
There are some fixes in cog r2381, most notably:
- gridftp sessions were sometimes left in a messed state leading to
subsequent transfers throwing obscure errors
- coaster workers were left in an inconsistent state when jobs submitted
to them exceeded their walltimes and the remaining runtime of the
workers
- an alleged fix for "qsub not found". This tied in to our earlier
problems with finding executables. Even though, for example, java was
found using bash -l, the process wasn't subsequently started using bash
-l, leading to qsub not being in the path. The current scheme assumes
that either everything needed can be found using bash -l or everything
needed can be found without bash -l. I suppose some corner cases may
still exist, but they seem unlikely.
From hategan at mcs.anl.gov Wed Apr 8 15:44:44 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Wed, 08 Apr 2009 15:44:44 -0500
Subject: [Swift-devel] Re: replication vs site score
In-Reply-To: <49DD0B3F.7050903@cs.uchicago.edu>
References:
<1C37ED4E-D752-4870-A52C-076FC3EBBF49@anl.gov>
<1239202987.12586.17.camel@localhost>
<49DD0B3F.7050903@cs.uchicago.edu>
Message-ID: <1239223485.26815.2.camel@localhost>
On Wed, 2009-04-08 at 13:38 -0700, Ioan Raicu wrote:
> Does a batch-queue prediction service help things in any way?
> https://portal.teragrid.org/gridsphere/gridsphere?cid=queue-prediction
>
> I've always wondered how the Swift scheduler would behave differently
> if it had statistical information about queue times.
It would help. Statistically.
> Qin, have you compared your job replication strategy with one that
> was cognizant of the expected wait queue time, in order to meet
> deadlines? On the surface, assuming that the batch queue prediction is
> accurate, it would seem that scheduling with known queue times might
> solve the same deadline cognizant scheduling problem, but without
> wasting resources by unnecessary replication.
The replication isn't unnecessary. If it starts it starts because the
queue time is larger than the expected queue time.
> The place where the queue prediction doesn't help, is when there is a
> bad node which causes an application to be slow or fail.
No. The prediction doesn't help when it fails to predict accurately.
> In this case, replication is probably the better recourse to
> guarantee meeting deadlines.
From iraicu at cs.uchicago.edu Wed Apr 8 15:38:23 2009
From: iraicu at cs.uchicago.edu (Ioan Raicu)
Date: Wed, 08 Apr 2009 13:38:23 -0700
Subject: [Swift-devel] Re: replication vs site score
In-Reply-To:
References:
<1C37ED4E-D752-4870-A52C-076FC3EBBF49@anl.gov>
<1239202987.12586.17.camel@localhost>
Message-ID: <49DD0B3F.7050903@cs.uchicago.edu>
Does a batch-queue prediction service help things in any way?
https://portal.teragrid.org/gridsphere/gridsphere?cid=queue-prediction
I've always wondered how the Swift scheduler would behave differently if
it had statistical information about queue times. Qin, have you compared
your job replication strategy with one that was cognizant of the
expected wait queue time, in order to meet deadlines? On the surface,
assuming that the batch queue prediction is accurate, it would seem that
scheduling with known queue times might solve the same deadline
cognizant scheduling problem, but without wasting resources by
unnecessary replication. The place where the queue prediction doesn't
help, is when there is a bad node which causes an application to be slow
or fail. In this case, replication is probably the better recourse to
guarantee meeting deadlines.
Here is their latest paper on this:
http://www.springerlink.com/content/7552901360631246/fulltext.pdf. The
system is deployed on the TeraGrid, and has been for a few years now. As
far as I have heard, it is quite robust and accurate.
Cheers,
Ioan
Ben Clifford wrote:
> On Wed, 8 Apr 2009, Mihael Hategan wrote:
>
> This:
>
>
>> planning the whole workflow buys us little in a (very) dynamic
>> environment in which submitting a job one minute later may mean the
>> difference between 1 minute of queue time and one hour of queue time
>>
>
> and this:
>
>
>> You need some SLA/QOS to address that.
>>
>
> seem to be significant characteristics that make the environments we run
> on not amenable to scheduling in the traditional sense. The lack of any
> meaningful guarantees about almost anything time-related makes everything
> basically opportunistic rather than scheduled.
>
>
--
===================================================
Ioan Raicu, Ph.D.
===================================================
Distributed Systems Laboratory
Computer Science Department
University of Chicago
1100 E. 58th Street, Ryerson Hall
Chicago, IL 60637
===================================================
Email: iraicu at cs.uchicago.edu
Web: http://www.cs.uchicago.edu/~iraicu
http://dev.globus.org/wiki/Incubator/Falkon
http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page
===================================================
===================================================
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From iraicu at cs.uchicago.edu Wed Apr 8 15:46:28 2009
From: iraicu at cs.uchicago.edu (Ioan Raicu)
Date: Wed, 08 Apr 2009 13:46:28 -0700
Subject: [Swift-devel] Re: replication vs site score
In-Reply-To: <1239223485.26815.2.camel@localhost>
References:
<1C37ED4E-D752-4870-A52C-076FC3EBBF49@anl.gov>
<1239202987.12586.17.camel@localhost>
<49DD0B3F.7050903@cs.uchicago.edu>
<1239223485.26815.2.camel@localhost>
Message-ID: <49DD0D24.2010909@cs.uchicago.edu>
Mihael Hategan wrote:
>> The place where the queue prediction doesn't help, is when there is a
>> bad node which causes an application to be slow or fail.
>>
>
> No. The prediction doesn't help when it fails to predict accurately.
>
>
The prediction that I was referring to was only for the queue time, not
the execution time. A failed node, causing an application run time to be
longer than expected, has no impact on the prediction of the wait queue
time.
Ioan
--
===================================================
Ioan Raicu, Ph.D.
===================================================
Distributed Systems Laboratory
Computer Science Department
University of Chicago
1100 E. 58th Street, Ryerson Hall
Chicago, IL 60637
===================================================
Email: iraicu at cs.uchicago.edu
Web: http://www.cs.uchicago.edu/~iraicu
http://dev.globus.org/wiki/Incubator/Falkon
http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page
===================================================
===================================================
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From hategan at mcs.anl.gov Wed Apr 8 15:54:30 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Wed, 08 Apr 2009 15:54:30 -0500
Subject: [Swift-devel] Re: replication vs site score
In-Reply-To: <49DD0D24.2010909@cs.uchicago.edu>
References:
<1C37ED4E-D752-4870-A52C-076FC3EBBF49@anl.gov>
<1239202987.12586.17.camel@localhost>
<49DD0B3F.7050903@cs.uchicago.edu> <1239223485.26815.2.camel@localhost>
<49DD0D24.2010909@cs.uchicago.edu>
Message-ID: <1239224070.27089.4.camel@localhost>
On Wed, 2009-04-08 at 13:46 -0700, Ioan Raicu wrote:
>
>
> Mihael Hategan wrote:
> > > The place where the queue prediction doesn't help, is when there is a
> > > bad node which causes an application to be slow or fail.
> > >
> >
> > No. The prediction doesn't help when it fails to predict accurately.
> >
> >
> The prediction that I was referring to was only for the queue time,
> not the execution time. A failed node, causing an application run time
> to be longer than expected, has no impact on the prediction of the
> wait queue time.
You're right. I was trying to say that fundamentally the problem of
uncertainty in queue times will remain by virtue of the fact that the
times when people submit jobs (as well as the amount of jobs) is
unpredictable and it can affect other people's job queue times.
The predictor in the paper answers the question "if you were to submit
your job before the state of the queue changes in any way, what would be
the expected queue time for the job" and not "what will be the queue
time for the job".
From iraicu at cs.uchicago.edu Wed Apr 8 15:58:10 2009
From: iraicu at cs.uchicago.edu (Ioan Raicu)
Date: Wed, 08 Apr 2009 13:58:10 -0700
Subject: [Swift-devel] Re: replication vs site score
In-Reply-To: <1239224070.27089.4.camel@localhost>
References:
<1C37ED4E-D752-4870-A52C-076FC3EBBF49@anl.gov>
<1239202987.12586.17.camel@localhost>
<49DD0B3F.7050903@cs.uchicago.edu>
<1239223485.26815.2.camel@localhost>
<49DD0D24.2010909@cs.uchicago.edu>
<1239224070.27089.4.camel@localhost>
Message-ID: <49DD0FE2.3000505@cs.uchicago.edu>
Mihael Hategan wrote:
>
> You're right. I was trying to say that fundamentally the problem of
> uncertainty in queue times will remain by virtue of the fact that the
> times when people submit jobs (as well as the amount of jobs) is
> unpredictable and it can affect other people's job queue times.
>
> The predictor in the paper answers the question "if you were to submit
> your job before the state of the queue changes in any way, what would be
> the expected queue time for the job" and not "what will be the queue
> time for the job".
>
>
Yes, its possible that between a query of prediction, and actual
submission, the state of the queues change, and therefore the actual
result change. But, every prediction comes with some error bounds, so
its possible that the change in queue state, might be reflected in the
error bars. Nevertheless, I think it might be an interesting improvement
to the current Swift scheduler. Ben, was this on the list of Google
summer of code projects? If not, perhaps you might want to add it.
Ioan
--
===================================================
Ioan Raicu, Ph.D.
===================================================
Distributed Systems Laboratory
Computer Science Department
University of Chicago
1100 E. 58th Street, Ryerson Hall
Chicago, IL 60637
===================================================
Email: iraicu at cs.uchicago.edu
Web: http://www.cs.uchicago.edu/~iraicu
http://dev.globus.org/wiki/Incubator/Falkon
http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page
===================================================
===================================================
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From hategan at mcs.anl.gov Wed Apr 8 16:32:00 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Wed, 08 Apr 2009 16:32:00 -0500
Subject: [Swift-devel] Re: replication vs site score
In-Reply-To: <49DD0FE2.3000505@cs.uchicago.edu>
References:
<1C37ED4E-D752-4870-A52C-076FC3EBBF49@anl.gov>
<1239202987.12586.17.camel@localhost>
<49DD0B3F.7050903@cs.uchicago.edu> <1239223485.26815.2.camel@localhost>
<49DD0D24.2010909@cs.uchicago.edu> <1239224070.27089.4.camel@localhost>
<49DD0FE2.3000505@cs.uchicago.edu>
Message-ID: <1239226320.3974.5.camel@localhost>
On Wed, 2009-04-08 at 13:58 -0700, Ioan Raicu wrote:
>
>
> Mihael Hategan wrote:
> >
> > You're right. I was trying to say that fundamentally the problem of
> > uncertainty in queue times will remain by virtue of the fact that the
> > times when people submit jobs (as well as the amount of jobs) is
> > unpredictable and it can affect other people's job queue times.
> >
> > The predictor in the paper answers the question "if you were to submit
> > your job before the state of the queue changes in any way, what would be
> > the expected queue time for the job" and not "what will be the queue
> > time for the job".
> >
> >
> Yes, its possible that between a query of prediction, and actual
> submission, the state of the queues change, and therefore the actual
> result change. But, every prediction comes with some error bounds, so
> its possible that the change in queue state, might be reflected in the
> error bars.
I don't know... The system predicted that a 2 minute job on Abe would
sit 11.2 hours in the queue and 2.4 hours on QueenBee, but I've ran 20
such jobs on both in the past 15 minutes.
From iraicu at cs.uchicago.edu Wed Apr 8 18:00:37 2009
From: iraicu at cs.uchicago.edu (Ioan Raicu)
Date: Wed, 08 Apr 2009 16:00:37 -0700
Subject: [Swift-devel] Re: replication vs site score
In-Reply-To: <1239226320.3974.5.camel@localhost>
References:
<1C37ED4E-D752-4870-A52C-076FC3EBBF49@anl.gov>
<1239202987.12586.17.camel@localhost>
<49DD0B3F.7050903@cs.uchicago.edu>
<1239223485.26815.2.camel@localhost>
<49DD0D24.2010909@cs.uchicago.edu>
<1239224070.27089.4.camel@localhost>
<49DD0FE2.3000505@cs.uchicago.edu>
<1239226320.3974.5.camel@localhost>
Message-ID: <49DD2C95.3030706@cs.uchicago.edu>
Aha, but I think the predictions are upper bounds, not upper and lower
bounds. In essence, when they predict that your job will wait for 11.2
hours, with 95% confidence, and your job runs in 15 minutes, then in no
way have they made a prediction in error. Now, if they would have
predicted 1 minute, and it took 15 minutes, then it would have been an
error. It is possible that they do not use knowledge of back-filling,
which would make small jobs run immediately, although they would predict
a long queue wait time, as if no back-filling is enabled. Its not clear
how customized the predictor is, to the scheduler and features of the
LRM, so there is certainly room for being pessimistic on their predictions.
Ioan
Mihael Hategan wrote:
> On Wed, 2009-04-08 at 13:58 -0700, Ioan Raicu wrote:
>
>> Mihael Hategan wrote:
>>
>>> You're right. I was trying to say that fundamentally the problem of
>>> uncertainty in queue times will remain by virtue of the fact that the
>>> times when people submit jobs (as well as the amount of jobs) is
>>> unpredictable and it can affect other people's job queue times.
>>>
>>> The predictor in the paper answers the question "if you were to submit
>>> your job before the state of the queue changes in any way, what would be
>>> the expected queue time for the job" and not "what will be the queue
>>> time for the job".
>>>
>>>
>>>
>> Yes, its possible that between a query of prediction, and actual
>> submission, the state of the queues change, and therefore the actual
>> result change. But, every prediction comes with some error bounds, so
>> its possible that the change in queue state, might be reflected in the
>> error bars.
>>
>
> I don't know... The system predicted that a 2 minute job on Abe would
> sit 11.2 hours in the queue and 2.4 hours on QueenBee, but I've ran 20
> such jobs on both in the past 15 minutes.
>
>
>
>
--
===================================================
Ioan Raicu, Ph.D.
===================================================
Distributed Systems Laboratory
Computer Science Department
University of Chicago
1100 E. 58th Street, Ryerson Hall
Chicago, IL 60637
===================================================
Email: iraicu at cs.uchicago.edu
Web: http://www.cs.uchicago.edu/~iraicu
http://dev.globus.org/wiki/Incubator/Falkon
http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page
===================================================
===================================================
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From hategan at mcs.anl.gov Wed Apr 8 21:16:58 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Wed, 08 Apr 2009 21:16:58 -0500
Subject: [Swift-devel] Re: replication vs site score
In-Reply-To: <49DD2C95.3030706@cs.uchicago.edu>
References:
<1C37ED4E-D752-4870-A52C-076FC3EBBF49@anl.gov>
<1239202987.12586.17.camel@localhost>
<49DD0B3F.7050903@cs.uchicago.edu> <1239223485.26815.2.camel@localhost>
<49DD0D24.2010909@cs.uchicago.edu> <1239224070.27089.4.camel@localhost>
<49DD0FE2.3000505@cs.uchicago.edu> <1239226320.3974.5.camel@localhost>
<49DD2C95.3030706@cs.uchicago.edu>
Message-ID: <1239243418.17988.39.camel@localhost>
On Wed, 2009-04-08 at 16:00 -0700, Ioan Raicu wrote:
> Aha, but I think the predictions are upper bounds, not upper and lower
> bounds. In essence, when they predict that your job will wait for 11.2
> hours, with 95% confidence, and your job runs in 15 minutes, then in
> no way have they made a prediction in error.
Heh. "It's not even wrong".
From foster at anl.gov Thu Apr 9 06:27:31 2009
From: foster at anl.gov (Ian Foster)
Date: Thu, 9 Apr 2009 06:27:31 -0500
Subject: [Swift-devel] Re: replication vs site score
In-Reply-To: <49DD0B3F.7050903@cs.uchicago.edu>
References:
<1C37ED4E-D752-4870-A52C-076FC3EBBF49@anl.gov>
<1239202987.12586.17.camel@localhost>
<49DD0B3F.7050903@cs.uchicago.edu>
Message-ID: <4DED210A-3AB9-4FE6-B82D-60B0966280C5@anl.gov>
Hi,
I wanted to point out that when we use Falkon/coasters, we have full
control over scheduling, so in that case we could in principle pre-
compute schedules. However, in practice we still don't tend to have
enough information about execution times for this to be that useful.
At least that's my belief.
I assume that estimates of queue time bounds would surely be helpful
for determining where to send things, and whether a job was stuck.
Ian.
From hategan at mcs.anl.gov Thu Apr 9 10:30:54 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Thu, 09 Apr 2009 10:30:54 -0500
Subject: [Swift-devel] Re: replication vs site score
In-Reply-To: <4DED210A-3AB9-4FE6-B82D-60B0966280C5@anl.gov>
References:
<1C37ED4E-D752-4870-A52C-076FC3EBBF49@anl.gov>
<1239202987.12586.17.camel@localhost>
<49DD0B3F.7050903@cs.uchicago.edu>
<4DED210A-3AB9-4FE6-B82D-60B0966280C5@anl.gov>
Message-ID: <1239291054.32146.14.camel@localhost>
On Thu, 2009-04-09 at 06:27 -0500, Ian Foster wrote:
> Hi,
>
> I wanted to point out that when we use Falkon/coasters, we have full
> control over scheduling,
Once we get the nodes, yes.
> so in that case we could in principle pre-
> compute schedules. However, in practice we still don't tend to have
> enough information about execution times for this to be that useful.
> At least that's my belief.
>
> I assume that estimates of queue time bounds would surely be helpful
> for determining where to send things, and whether a job was stuck.
>
> Ian.
From benc at hawaga.org.uk Thu Apr 9 10:30:37 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Thu, 9 Apr 2009 15:30:37 +0000 (GMT)
Subject: [Swift-devel] Re: replication vs site score
In-Reply-To: <4DED210A-3AB9-4FE6-B82D-60B0966280C5@anl.gov>
References:
<1C37ED4E-D752-4870-A52C-076FC3EBBF49@anl.gov>
<1239202987.12586.17.camel@localhost>
<49DD0B3F.7050903@cs.uchicago.edu>
<4DED210A-3AB9-4FE6-B82D-60B0966280C5@anl.gov>
Message-ID:
On Thu, 9 Apr 2009, Ian Foster wrote:
> I wanted to point out that when we use Falkon/coasters, we have full control
> over scheduling, so in that case we could in principle pre-compute schedules.
Coasters as they are now are still allocated on an opportunistic basis, so
once we have a coaster stuff could be scheduled to it, but when coaster
workers actually exist is as unknown as when jobs will run in the
non-coaster case, I think.
Where Falkon has been used for pre-allocated resources on machines, with
no dynamic allocation/unallocation, though, the available resources
probably are known well enough for this.
> However, in practice we still don't tend to have enough information about
> execution times for this to be that useful. At least that's my belief.
yes.
--
From hategan at mcs.anl.gov Thu Apr 9 10:49:25 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Thu, 09 Apr 2009 10:49:25 -0500
Subject: [Swift-devel] Re: replication vs site score
In-Reply-To:
References:
<1C37ED4E-D752-4870-A52C-076FC3EBBF49@anl.gov>
<1239202987.12586.17.camel@localhost>
<49DD0B3F.7050903@cs.uchicago.edu>
<4DED210A-3AB9-4FE6-B82D-60B0966280C5@anl.gov>
Message-ID: <1239292165.339.1.camel@localhost>
On Thu, 2009-04-09 at 15:30 +0000, Ben Clifford wrote:
> On Thu, 9 Apr 2009, Ian Foster wrote:
>
> > I wanted to point out that when we use Falkon/coasters, we have full control
> > over scheduling, so in that case we could in principle pre-compute schedules.
>
> Coasters as they are now are still allocated on an opportunistic basis, so
> once we have a coaster stuff could be scheduled to it, but when coaster
> workers actually exist is as unknown as when jobs will run in the
> non-coaster case, I think.
>
> Where Falkon has been used for pre-allocated resources on machines, with
> no dynamic allocation/unallocation, though, the available resources
> probably are known well enough for this.
Except when using pre-allocated resources, you are still waiting for
them, but the waiting is not automated.
>
> > However, in practice we still don't tend to have enough information about
> > execution times for this to be that useful. At least that's my belief.
>
> yes.
>
From benc at hawaga.org.uk Thu Apr 9 10:49:50 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Thu, 9 Apr 2009 15:49:50 +0000 (GMT)
Subject: [Swift-devel] Re: replication vs site score
In-Reply-To: <1239292165.339.1.camel@localhost>
References:
<1C37ED4E-D752-4870-A52C-076FC3EBBF49@anl.gov>
<1239202987.12586.17.camel@localhost>
<49DD0B3F.7050903@cs.uchicago.edu>
<4DED210A-3AB9-4FE6-B82D-60B0966280C5@anl.gov>
<1239292165.339.1.camel@localhost>
Message-ID:
On Thu, 9 Apr 2009, Mihael Hategan wrote:
> Except when using pre-allocated resources, you are still waiting for
> them, but the waiting is not automated.
Also you have chosen to not attempt to opportunistically get any more once
you have decided you have waited enough.
--
From hategan at mcs.anl.gov Thu Apr 9 10:57:14 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Thu, 09 Apr 2009 10:57:14 -0500
Subject: [Swift-devel] Re: replication vs site score
In-Reply-To:
References:
<1C37ED4E-D752-4870-A52C-076FC3EBBF49@anl.gov>
<1239202987.12586.17.camel@localhost>
<49DD0B3F.7050903@cs.uchicago.edu>
<4DED210A-3AB9-4FE6-B82D-60B0966280C5@anl.gov>
<1239292165.339.1.camel@localhost>
Message-ID: <1239292634.406.6.camel@localhost>
On Thu, 2009-04-09 at 15:49 +0000, Ben Clifford wrote:
> On Thu, 9 Apr 2009, Mihael Hategan wrote:
>
> > Except when using pre-allocated resources, you are still waiting for
> > them, but the waiting is not automated.
>
> Also you have chosen to not attempt to opportunistically get any more once
> you have decided you have waited enough.
>
Right. Overall it leads to inefficiencies and wasted cpu-hours, but it
gives you a known set of resources, which is valuable. I think the known
set of resources part can be achieved anyway if there was that
back-channel mentioned in random chatter that informed swift about the
nodes available.
From iraicu at cs.uchicago.edu Thu Apr 9 11:49:33 2009
From: iraicu at cs.uchicago.edu (Ioan Raicu)
Date: Thu, 09 Apr 2009 09:49:33 -0700
Subject: [Swift-devel] Re: replication vs site score
In-Reply-To:
References:
<1C37ED4E-D752-4870-A52C-076FC3EBBF49@anl.gov>
<1239202987.12586.17.camel@localhost>
<49DD0B3F.7050903@cs.uchicago.edu>
<4DED210A-3AB9-4FE6-B82D-60B0966280C5@anl.gov>
Message-ID: <49DE271D.8030003@cs.uchicago.edu>
Right, Falkon supports both static and dynamic allocation of resources.
I believe coaster only supports dynamic allocation of resources. We have
lots of information under static allocation, that could help scheduling,
but under dynamic allocation, there is a mixture of known information
(the already allocated resources) and the unknown (the jobs in the wait
queue). In a sense, a smarter scheduler could make use of at least known
information, although this information might frequently change, and the
scheduler would have to adapt frequently.
Ioan
Ben Clifford wrote:
> On Thu, 9 Apr 2009, Ian Foster wrote:
>
>
>> I wanted to point out that when we use Falkon/coasters, we have full control
>> over scheduling, so in that case we could in principle pre-compute schedules.
>>
>
> Coasters as they are now are still allocated on an opportunistic basis, so
> once we have a coaster stuff could be scheduled to it, but when coaster
> workers actually exist is as unknown as when jobs will run in the
> non-coaster case, I think.
>
> Where Falkon has been used for pre-allocated resources on machines, with
> no dynamic allocation/unallocation, though, the available resources
> probably are known well enough for this.
>
>
>> However, in practice we still don't tend to have enough information about
>> execution times for this to be that useful. At least that's my belief.
>>
>
> yes.
>
>
--
===================================================
Ioan Raicu, Ph.D.
===================================================
Distributed Systems Laboratory
Computer Science Department
University of Chicago
1100 E. 58th Street, Ryerson Hall
Chicago, IL 60637
===================================================
Email: iraicu at cs.uchicago.edu
Web: http://www.cs.uchicago.edu/~iraicu
http://dev.globus.org/wiki/Incubator/Falkon
http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page
===================================================
===================================================
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From foster at anl.gov Thu Apr 9 11:51:27 2009
From: foster at anl.gov (Ian Foster)
Date: Thu, 9 Apr 2009 11:51:27 -0500
Subject: [Swift-devel] Re: replication vs site score
In-Reply-To: <49DE271D.8030003@cs.uchicago.edu>
References:
<1C37ED4E-D752-4870-A52C-076FC3EBBF49@anl.gov>
<1239202987.12586.17.camel@localhost>
<49DD0B3F.7050903@cs.uchicago.edu>
<4DED210A-3AB9-4FE6-B82D-60B0966280C5@anl.gov>
<49DE271D.8030003@cs.uchicago.edu>
Message-ID: <7A6429F1-CDE4-4B09-95BB-C7DB24E78905@anl.gov>
I didn't appreciate that about Coaster. It should (IMHO) support
static allocation, as a special case. People will clearly want that.
On Apr 9, 2009, at 11:49 AM, Ioan Raicu wrote:
> Right, Falkon supports both static and dynamic allocation of
> resources. I believe coaster only supports dynamic allocation of
> resources. We have lots of information under static allocation, that
> could help scheduling, but under dynamic allocation, there is a
> mixture of known information (the already allocated resources) and
> the unknown (the jobs in the wait queue). In a sense, a smarter
> scheduler could make use of at least known information, although
> this information might frequently change, and the scheduler would
> have to adapt frequently.
>
> Ioan
>
> Ben Clifford wrote:
>>
>> On Thu, 9 Apr 2009, Ian Foster wrote:
>>
>>
>>> I wanted to point out that when we use Falkon/coasters, we have
>>> full control
>>> over scheduling, so in that case we could in principle pre-compute
>>> schedules.
>>>
>> Coasters as they are now are still allocated on an opportunistic
>> basis, so
>> once we have a coaster stuff could be scheduled to it, but when
>> coaster
>> workers actually exist is as unknown as when jobs will run in the
>> non-coaster case, I think.
>>
>> Where Falkon has been used for pre-allocated resources on machines,
>> with
>> no dynamic allocation/unallocation, though, the available resources
>> probably are known well enough for this.
>>
>>
>>> However, in practice we still don't tend to have enough
>>> information about
>>> execution times for this to be that useful. At least that's my
>>> belief.
>>>
>> yes.
>>
>>
>
> --
> ===================================================
> Ioan Raicu, Ph.D.
> ===================================================
> Distributed Systems Laboratory
> Computer Science Department
> University of Chicago
> 1100 E. 58th Street, Ryerson Hall
> Chicago, IL 60637
> ===================================================
> Email: iraicu at cs.uchicago.edu
> Web: http://www.cs.uchicago.edu/~iraicu
> http://dev.globus.org/wiki/Incubator/Falkon
> http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page
> ===================================================
> ===================================================
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From hategan at mcs.anl.gov Thu Apr 9 11:56:30 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Thu, 09 Apr 2009 11:56:30 -0500
Subject: [Swift-devel] Re: replication vs site score
In-Reply-To: <7A6429F1-CDE4-4B09-95BB-C7DB24E78905@anl.gov>
References:
<1C37ED4E-D752-4870-A52C-076FC3EBBF49@anl.gov>
<1239202987.12586.17.camel@localhost>
<49DD0B3F.7050903@cs.uchicago.edu>
<4DED210A-3AB9-4FE6-B82D-60B0966280C5@anl.gov>
<49DE271D.8030003@cs.uchicago.edu>
<7A6429F1-CDE4-4B09-95BB-C7DB24E78905@anl.gov>
Message-ID: <1239296190.1717.1.camel@localhost>
On Thu, 2009-04-09 at 11:51 -0500, Ian Foster wrote:
> I didn't appreciate that about Coaster. It should (IMHO) support
> static allocation, as a special case. People will clearly want that.
Yes. People clearly make irrational choices.
>
>
>
> On Apr 9, 2009, at 11:49 AM, Ioan Raicu wrote:
>
> > Right, Falkon supports both static and dynamic allocation of
> > resources. I believe coaster only supports dynamic allocation of
> > resources. We have lots of information under static allocation, that
> > could help scheduling, but under dynamic allocation, there is a
> > mixture of known information (the already allocated resources) and
> > the unknown (the jobs in the wait queue). In a sense, a smarter
> > scheduler could make use of at least known information, although
> > this information might frequently change, and the scheduler would
> > have to adapt frequently.
> >
> > Ioan
> >
> > Ben Clifford wrote:
> > > On Thu, 9 Apr 2009, Ian Foster wrote:
> > >
> > >
> > > > I wanted to point out that when we use Falkon/coasters, we have full control
> > > > over scheduling, so in that case we could in principle pre-compute schedules.
> > > >
> > > Coasters as they are now are still allocated on an opportunistic basis, so
> > > once we have a coaster stuff could be scheduled to it, but when coaster
> > > workers actually exist is as unknown as when jobs will run in the
> > > non-coaster case, I think.
> > >
> > > Where Falkon has been used for pre-allocated resources on machines, with
> > > no dynamic allocation/unallocation, though, the available resources
> > > probably are known well enough for this.
> > >
> > >
> > > > However, in practice we still don't tend to have enough information about
> > > > execution times for this to be that useful. At least that's my belief.
> > > >
> > > yes.
> > >
> > >
> >
> > --
> > ===================================================
> > Ioan Raicu, Ph.D.
> > ===================================================
> > Distributed Systems Laboratory
> > Computer Science Department
> > University of Chicago
> > 1100 E. 58th Street, Ryerson Hall
> > Chicago, IL 60637
> > ===================================================
> > Email: iraicu at cs.uchicago.edu
> > Web: http://www.cs.uchicago.edu/~iraicu
> > http://dev.globus.org/wiki/Incubator/Falkon
> > http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page
> > ===================================================
> > ===================================================
> > _______________________________________________
> > Swift-devel mailing list
> > Swift-devel at ci.uchicago.edu
> > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
From iraicu at cs.uchicago.edu Thu Apr 9 11:52:48 2009
From: iraicu at cs.uchicago.edu (Ioan Raicu)
Date: Thu, 09 Apr 2009 09:52:48 -0700
Subject: [Swift-devel] Re: replication vs site score
In-Reply-To: <1239292165.339.1.camel@localhost>
References: