From bugzilla-daemon at mcs.anl.gov Sun Mar 1 15:47:05 2009
From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov)
Date: Sun, 1 Mar 2009 15:47:05 -0600 (CST)
Subject: [Swift-devel] [Bug 86] recompilation should not be suppressed if
compiler version has changed
In-Reply-To:
Message-ID: <20090301214705.B633B164B3@foxtrot.mcs.anl.gov>
http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=86
benc at hawaga.org.uk changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|NEW |RESOLVED
Resolution| |FIXED
------- Comment #2 from benc at hawaga.org.uk 2009-03-01 15:47 -------
Implemented in r2583.
--
Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.
From bugzilla-daemon at mcs.anl.gov Sun Mar 1 15:47:53 2009
From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov)
Date: Sun, 1 Mar 2009 15:47:53 -0600 (CST)
Subject: [Swift-devel] [Bug 180] multi-node coasters?
In-Reply-To:
Message-ID: <20090301214753.04AE7164CF@foxtrot.mcs.anl.gov>
http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=180
benc at hawaga.org.uk changed:
What |Removed |Added
----------------------------------------------------------------------------
AssignedTo|benc at hawaga.org.uk |hategan at mcs.anl.gov
--
Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
You are the assignee for the bug, or are watching the assignee.
From wilde at mcs.anl.gov Sun Mar 1 22:53:39 2009
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Sun, 01 Mar 2009 22:53:39 -0600
Subject: [Swift-devel] continued questions on iterate
Message-ID: <49AB6653.6010306@mcs.anl.gov>
This program:
string s[];
s[0]="hi ";
iterate i {
s[i+1] = @strcat(s[i],"hi ");
trace(s[i]);
} until(i==5);
Gives:
com$ swift it4.swift
Could not start execution.
variable s has multiple writers.
--
Its similar to the tutorial example:
counterfile a[] ;
a[0] = echo("793578934574893");
iterate v {
a[v+1] = countstep(a[v]);
print("extract int value ", at extractint(a[v+1]));
} until (@extractint(a[v+1]) <= 1);
--
...which I reported earlier as having problems (I think in addition to
the one above?)
This is using the latest swift, rev 2631, and latest cog.
I thought I had issues like this licked, but then updated the code to
get closer to what the user needs.
In this example, I dont see any violation of single-assignment, but
apparently swift does.
The full example that the test case above is for is at:
www.ci.uchicago.edu/~wilde/oops8.swift, which encounters the same
multiple-writer problem.
I start with an initial "secondary structure" string of all A's, same
length as the protein sequence. After each folding round, a new
structure is derived for analysis and used as the starting point for the
next round. This has the same data access pattern as array s[] above:
foreach p, pn in protein {
OOPSOut result[][] ;
SecSeq secseq[] ;
OOPSIn oopsin ;
secseq[0] = sedfasta(oopsin.fasta, ["-e","s/./A/g"]);
boolean converged[];
iterate i {
SecSeq s;
result[i] = doRound(p,oopsin,secseq[i],i);
(converged[i],s) = analyzeResult(result[i], p, i, secseq[i]);
secseq[i+1] = s;
} until (converged[i] || (i==3));
}
In this case, I get the same message for array secseq (varable has
multiple writers).
I
From foster at uchicago.edu Sun Mar 1 23:25:11 2009
From: foster at uchicago.edu (Ian Foster)
Date: Sun, 1 Mar 2009 23:25:11 -0600
Subject: [Swift-devel] Swift user guide
Message-ID: <08A89F3C-222A-4B5F-B497-75435EE6CB0A@uchicago.edu>
Reading a recent email about "iterate" got me looking at the Swift
manual. This is looking very nice. I had a couple of comments:
1) "Conceptually, a parallel can be drawn between Swift mapped
variables and Java reference types. In both cases there is no
syntactic distinction between primitive types and mapped types or
reference types respectively. Additionally, the semantic distinction
is also kept to a minimum."
--> I don't think we should assume that readers know what a Java
reference type is. Most will not.
2) Arrays: I gather from below that the size of an array is defined by
assignments to it. This seems confusing and dangerous to me: doesn't
it require a global analysis, which must ultimately be undecidable, to
determine whether an array is closed?
Statements which deal with the array as a whole will often wait for
the array to be closed before executing (thus, a closed array is the
equivalent of a non-array type being assigned). However, a foreach
statement will apply its body to elements of an array as they become
known. It will not wait until the array is closed.
Consider this script:
file a[];
file b[];
foreach v,i in a {
b[i] = p(v);
}
a[0] = r();
a[1] = s();
Initially, the foreach statement will have nothing to execute, as the
array a has not been assigned any values. The procedures r and s will
execute. As soon as either of them is finished, the corresponding
invocation of procedure p will occur. After both r and s have
completed, the array a will be closed since no other statements in the
script make an assignment toa.
3) In the following text, the (,index) is presumably meant to indicate
an optional element. But as you don't use a different font, or indeed
have indicated what conventions you are using, readers may not realize
that.
foreach statements have the general form:
foreach controlvariable (,index) in expression {
statements
}
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From wilde at mcs.anl.gov Sun Mar 1 23:39:42 2009
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Sun, 01 Mar 2009 23:39:42 -0600
Subject: [Swift-devel] continued questions on iterate
In-Reply-To: <49AB6653.6010306@mcs.anl.gov>
References: <49AB6653.6010306@mcs.anl.gov>
Message-ID: <49AB711E.4010208@mcs.anl.gov>
Im able to work around this by moving the s[0] assignments inside the
iterate block, in an if(i==0) {} else {} construct.
Still, it seems the restriction is not intended.
- Mike
On 3/1/09 10:53 PM, Michael Wilde wrote:
> This program:
>
> string s[];
> s[0]="hi ";
> iterate i {
> s[i+1] = @strcat(s[i],"hi ");
> trace(s[i]);
> } until(i==5);
>
> Gives:
>
> com$ swift it4.swift
> Could not start execution.
> variable s has multiple writers.
>
> --
> Its similar to the tutorial example:
>
> counterfile a[] ;
>
> a[0] = echo("793578934574893");
>
> iterate v {
> a[v+1] = countstep(a[v]);
> print("extract int value ", at extractint(a[v+1]));
> } until (@extractint(a[v+1]) <= 1);
>
> --
>
> ...which I reported earlier as having problems (I think in addition to
> the one above?)
>
> This is using the latest swift, rev 2631, and latest cog.
>
> I thought I had issues like this licked, but then updated the code to
> get closer to what the user needs.
>
> In this example, I dont see any violation of single-assignment, but
> apparently swift does.
>
> The full example that the test case above is for is at:
> www.ci.uchicago.edu/~wilde/oops8.swift, which encounters the same
> multiple-writer problem.
>
> I start with an initial "secondary structure" string of all A's, same
> length as the protein sequence. After each folding round, a new
> structure is derived for analysis and used as the starting point for the
> next round. This has the same data access pattern as array s[] above:
>
> foreach p, pn in protein {
> OOPSOut result[][] ;
> SecSeq secseq[] prefix=@strcat("seqseq/",p,"/"),suffix=".secseq">;
> OOPSIn oopsin ;
> secseq[0] = sedfasta(oopsin.fasta, ["-e","s/./A/g"]);
> boolean converged[];
> iterate i {
> SecSeq s;
> result[i] = doRound(p,oopsin,secseq[i],i);
> (converged[i],s) = analyzeResult(result[i], p, i, secseq[i]);
> secseq[i+1] = s;
> } until (converged[i] || (i==3));
> }
>
> In this case, I get the same message for array secseq (varable has
> multiple writers).
>
> I
>
>
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
From hategan at mcs.anl.gov Mon Mar 2 00:27:38 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Mon, 02 Mar 2009 00:27:38 -0600
Subject: [Swift-devel] Swift user guide
In-Reply-To: <08A89F3C-222A-4B5F-B497-75435EE6CB0A@uchicago.edu>
References: <08A89F3C-222A-4B5F-B497-75435EE6CB0A@uchicago.edu>
Message-ID: <1235975258.20059.39.camel@localhost>
On Sun, 2009-03-01 at 23:25 -0600, Ian Foster wrote:
> 2) Arrays: I gather from below that the size of an array is defined by
> assignments to it. This seems confusing and dangerous to me: doesn't
> it require a global analysis, which must ultimately be undecidable, to
> determine whether an array is closed?
>
You don't need to add a level of indirection :)
Iterate() makes swift termination and array closing undecidable (and so
does recursion*). And I think we want it that way in order to support
problems of the kind "repeat until sufficiently good results are
obtained".
(*) except with recursion you're guaranteed to run out of memory
eventually
From benc at hawaga.org.uk Mon Mar 2 01:40:56 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Mon, 2 Mar 2009 07:40:56 +0000 (GMT)
Subject: [Swift-devel] Swift user guide
In-Reply-To: <08A89F3C-222A-4B5F-B497-75435EE6CB0A@uchicago.edu>
References: <08A89F3C-222A-4B5F-B497-75435EE6CB0A@uchicago.edu>
Message-ID:
On Sun, 1 Mar 2009, Ian Foster wrote:
> 2) Arrays: I gather from below that the size of an array is defined by
> assignments to it. This seems confusing and dangerous to me: doesn't it
> require a global analysis, which must ultimately be undecidable, to determine
> whether an array is closed?
yes.
They're a very unpleasant structure at the moment. As I've written in
other mails in more depth, I'd prefer to see them behave as
single-assignment structures constructued by something like looks similar
to but slightly different to the present loop constructs, so that you'd
say:
array = foreach ....
rather than
foreach ... {
array[i]=...
}
There would be no requirement for this to actually execute as some atomic
operation, and could happen over time interleaved with other tasks, as
happens for foreach at the moment; but from a code analysis perspective,
its much clearer when the array is closed - after the single statement
that assigns to it has fully completed, like non-array variables.
--
From benc at hawaga.org.uk Mon Mar 2 01:45:01 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Mon, 2 Mar 2009 07:45:01 +0000 (GMT)
Subject: [Swift-devel] continued questions on iterate
In-Reply-To: <49AB6653.6010306@mcs.anl.gov>
References: <49AB6653.6010306@mcs.anl.gov>
Message-ID:
On Sun, 1 Mar 2009, Michael Wilde wrote:
> com$ swift it4.swift
> Could not start execution.
> variable s has multiple writers.
> In this example, I dont see any violation of single-assignment, but apparently
> swift does.
yes, its a bug in the "lets try to guess who is allowed to write to
array" code.
--
From benc at hawaga.org.uk Mon Mar 2 05:08:45 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Mon, 2 Mar 2009 11:08:45 +0000 (GMT)
Subject: [Swift-devel] provenance challenge 3 participation
Message-ID:
The 3rd Provenance Challenge starts today. Mike and I entered the VDS VDC
in the first provenance challenge; we did not participate in the second
challenge. Luiz M. R. Gadelha Jr. and I intend to work
on an entry based around the provenance database prototype that lives in
the provenancedb/ directory of the SVN, with the goals of making that code
more useable and useful and of making some presentation at the provenance
challenge workshop in the summer.
Information about the challenge is here:
http://twiki.ipaw.info/bin/view/Challenge/ThirdProvenanceChallenge
We've had some discussion in private correspondence but will continue our
discussions on this list.
--
From foster at uchicago.edu Mon Mar 2 07:09:38 2009
From: foster at uchicago.edu (Ian Foster)
Date: Mon, 2 Mar 2009 07:09:38 -0600
Subject: [Swift-devel] Swift user guide
In-Reply-To:
References: <08A89F3C-222A-4B5F-B497-75435EE6CB0A@uchicago.edu>
Message-ID: <4358C77B-E378-45BE-8C2D-116C411E4CE7@uchicago.edu>
Ben:
The example I saw in the code had the input array (e.g., "a", in
"foreach v in a") constructed elsewhere in the program, with a[0] =
v1, a[1] = v2. That seemed particularly challenging.
Ian.
On Mar 2, 2009, at 1:40 AM, Ben Clifford wrote:
>
> On Sun, 1 Mar 2009, Ian Foster wrote:
>
>> 2) Arrays: I gather from below that the size of an array is defined
>> by
>> assignments to it. This seems confusing and dangerous to me:
>> doesn't it
>> require a global analysis, which must ultimately be undecidable, to
>> determine
>> whether an array is closed?
>
> yes.
>
> They're a very unpleasant structure at the moment. As I've written in
> other mails in more depth, I'd prefer to see them behave as
> single-assignment structures constructued by something like looks
> similar
> to but slightly different to the present loop constructs, so that
> you'd
> say:
>
> array = foreach ....
>
> rather than
>
> foreach ... {
> array[i]=...
> }
>
> There would be no requirement for this to actually execute as some
> atomic
> operation, and could happen over time interleaved with other tasks, as
> happens for foreach at the moment; but from a code analysis
> perspective,
> its much clearer when the array is closed - after the single statement
> that assigns to it has fully completed, like non-array variables.
>
> --
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From foster at anl.gov Mon Mar 2 08:24:34 2009
From: foster at anl.gov (Ian Foster)
Date: Mon, 2 Mar 2009 08:24:34 -0600
Subject: [Swift-devel] provenance challenge 3 participation
In-Reply-To:
References:
Message-ID:
Ben:
That's great news.
Are we in a position whereby most Swift runs are recorded in the
database?
Ian.
On Mar 2, 2009, at 5:08 AM, Ben Clifford wrote:
>
> The 3rd Provenance Challenge starts today. Mike and I entered the
> VDS VDC
> in the first provenance challenge; we did not participate in the
> second
> challenge. Luiz M. R. Gadelha Jr. and I intend to
> work
> on an entry based around the provenance database prototype that
> lives in
> the provenancedb/ directory of the SVN, with the goals of making
> that code
> more useable and useful and of making some presentation at the
> provenance
> challenge workshop in the summer.
>
> Information about the challenge is here:
>
> http://twiki.ipaw.info/bin/view/Challenge/ThirdProvenanceChallenge
>
> We've had some discussion in private correspondence but will
> continue our
> discussions on this list.
>
> --
>
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
From benc at hawaga.org.uk Mon Mar 2 08:33:14 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Mon, 2 Mar 2009 14:33:14 +0000 (GMT)
Subject: [Swift-devel] provenance challenge 3 participation
In-Reply-To:
References:
Message-ID:
On Mon, 2 Mar 2009, Ian Foster wrote:
> Are we in a position whereby most Swift runs are recorded in the
> database?
Lots of Swift runs are going into my log repository, though I don't really
know what proportion of every run in the whole universe.
"The database" doesn't exist - there is a database implementation in SVN
which you can deploy and import data to. However, its still at the stage
where ongoing development changes the database schema often(*) which
usually requires a reimport of all the logs; or modifies the data that
Swift is producing so that old logs no longer provide all the needed data.
Hopefully it will converge on something fairly stable.
(*) not much in recent months but thats because the provenance project has
been dead.
--
From benc at hawaga.org.uk Mon Mar 2 09:43:50 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Mon, 2 Mar 2009 15:43:50 +0000 (GMT)
Subject: [Swift-devel] Swift user guide
In-Reply-To: <4358C77B-E378-45BE-8C2D-116C411E4CE7@uchicago.edu>
References: <08A89F3C-222A-4B5F-B497-75435EE6CB0A@uchicago.edu>
<4358C77B-E378-45BE-8C2D-116C411E4CE7@uchicago.edu>
Message-ID:
On Mon, 2 Mar 2009, Ian Foster wrote:
> Ben:
>
> The example I saw in the code had the input array (e.g., "a", in "foreach v in
> a") constructed elsewhere in the program, with a[0] = v1, a[1] = v2. That
> seemed particularly challenging.
You can have single assignment with an explicit list of array contents
like this:
a = [v1, v2];
That works in the present implementation in some, but not all, cases - it
depends on whethere the expressions v1 and v2 are primitive types like int
(in which case it does work, and has worked since before Swift was called
Swift) or if they are mapped to files (in which case it doesn't work in
the present implementation, but could be made to).
--
From wilde at mcs.anl.gov Mon Mar 2 17:26:49 2009
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Mon, 02 Mar 2009 17:26:49 -0600
Subject: [Swift-devel] Re: Fault tolerance in "many task computing"?
In-Reply-To: <49AAC1DF.7090502@cs.uchicago.edu>
References: <21662876.222571235010770270.JavaMail.root@zimbra>
<499CC801.3090304@cs.uchicago.edu>
<25796E93-7DB0-44FF-8884-0201B4EC620F@anl.gov>
<499CEF6C.1020700@mcs.anl.gov> <49AAC1DF.7090502@cs.uchicago.edu>
Message-ID: <49AC6B39.8010407@mcs.anl.gov>
All,
Pete suggested we take a look at CIFTS's message logging system and
consider integrating it into our stack. Rinku gave me, Allan, and Zhao
and excellent overview and demo of the system. (Thanks, Rinku!)
Here's my notes from this meeting. My intent is just to start a
discussion for longer-term consideration, not any near-term action.
(Although Jing Tie may find some of these concepts fruitful for er
troubleshooting research).
CIFTS is the DOE SciDAC project "Coordinated and Improved Fault
Tolerance for High Performance Computing Systems", PI'd by Pete:
http://www.mcs.anl.gov/research/cifts/index.php
It produces "FTB", a backplane for distributing logging information
within a distributed system:
http://www.mcs.anl.gov/research/cifts/docs/files/ftb_developers_guide.pdf
I pointed Rinku to Swift and Falkon info, as well as Netlogger and
activities related to it in the CEDPS project, and we have a joint
action item to understand the possible overlap and integration issues
and possibilities between these two systems.
Netlogger and CEDPS info is at:
http://acs.lbl.gov/NetLoggerWiki/index.php/Main_Page
http://dev.globus.org/wiki/Incubator/NetLogger
http://www.cedps.net/index.php/Troubleshooting#Work-in-progress
I mentioned that we have invested a small bit of effort in integrating
Netlogger log publishing capabilities into Swift.
Potential overlap notwithstanding, CIFTS (and in particular the Fault
Tolerant Backplane, FTB), could serve as a very nice consolidation
service for log information originating in the many different components
involved in executing a Swift program:
- the application program wrapper script
- the Falkon or Coaster worker agent
- the Globus job manager and/or local scheduler
- the worker node
- the remote site fileserver/filesystem
- a site system management facility like BG/P's RAS service
- Falkon and Coaster servers and bootstrappers
- the swift client-side engine
- GrifFTP and other transport protocols and services
- etc
FTB would enable us to readily capture and consolidate all these
information sources and funnel the data into streams related to specific
Swift program executions. It has the infrastructure to route messages
out of distributed systems, and to permit publication of and
subscription to message streams. Its agents, it seems, can help messages
traverse firewalls and deal with other transport and delivery issues.
FTB is implemented as a C API, and comes with a set of example clients.
From this a simple set of command line interfaces could be derived to
permit low-cost experimentation with the system in, eg, Falkon on the
BG/P, where Rinku and others are implementing collectors to gather log
information from different parts of ZeptoOS and the BG/P hardware complex.
Its not clear that any of us have the cycles within the next two months
to explore this, but it would make an interesting student project, to
compare CIFTS and NetLogger, and to test some initial integrations into
Swift, Falkon, and Coasters. (I feel its a good Summer of Code project).
My initial question is whether some CIFTS/FTB hooks could be planted in
a lightweight Swift experiment, and we could try to get a feel for
whether the infrastructure gives us something that we cant readily get
today. My gut feel is that is does.
I think it would be a great research/development topic to explore how
close this could bring us to the point where all distributed errors are
cleanly routed back to the centralized user to more quickly pinpoint the
cause of remote and distributed failures. Swift does a *pretty* good
job of this today, albeit in a somewhat ad-hoc fashion. FTB would make
it easier to integrate information from additional sources like the
remote scheduler and BGP RAS logs into the debugging process.
And all that is before we even consider the goals of automating fault
tolerance, which I think is the ultimate vision of CIFTS.
Thoughts and discussion welcome. Once any of us get a day or so to play
with FTB, we'll know more about the possibilities.
Regards,
Mike
On 3/1/09 11:11 AM, Ioan Raicu wrote:
> Hi Rinku,
> It looks like I am not going to be able to make the meeting tomorrow. On
> Friday, another interview opportunity came up, and the only open slot
> for the next 2 weeks was this Monday. Sorry about the short notice. Go
> ahead and meet without me, and I'll catch up with what was discussed at
> the meeting from Mike.
>
> Thanks,
> Ioan
>
> Michael Wilde wrote:
>> Zhao, Allan, you're welcome to join this discussion, at Argonne Mar 2,
>> or by phone.
>>
>> - Mike
>>
>>
>> On 2/18/09 10:30 PM, Ian Foster wrote:
>>> Hi,
>>>
>>> This sounds like a really fun project. Maybe we should involve Zhao
>>> and Allen as well, given that Ioan has (sadly) graduated, and will
>>> leave us?
>>> I'd love to participate, I will need to do so by phone--could we do
>>> that? I'll just listen in, and see what I can learn.
>>>
>>> Ian.
>>>
>>>
>>> On Feb 18, 2009, at 8:46 PM, Ioan Raicu wrote:
>>>
>>>> Great!
>>>>
>>>> I added Ian as a cc, maybe he wants to come to this meeting as well.
>>>> Ian, the original message from Pete was:
>>>>> Ioan and Mike,
>>>>>
>>>>> The CIFTS project is a DOE project to provide a "fault tolerant
>>>>> backplane". I'm the PI of the project which involved ORNL, LBL,
>>>>> IU, Ohio State, and UTK. Below is a suggestion to hook CIFTS to
>>>>> Falkon, so faults could be monitored. Rinku (on the cc: line) is
>>>>> the lead developer for CIFTS. Maybe when one of you is on campus
>>>>> (ANL) you can meet with Rinku, and brainstorm if there is any way
>>>>> to link the two systems efficiently. Email below is from an ORNL
>>>>> participant in the CIFTS framework.
>>>>>
>>>>> -Pete
>>>> The meeting is scheduled with Rinku, Mike, Pete (?), and I for March
>>>> 2nd, at 11AM in Rinku's office (ANL, D-231 in the MCS building).
>>>>
>>>> Ioan
>>>>
>>>> Rinku Gupta wrote:
>>>>> We can meet at my office (D-231 in the MCS building) and then sneak
>>>>> into Pete's room, if it is empty.
>>>>>
>>>>> Rinku
>>>>>
>>>>>
>>>>>
>>>>> ----- "Ioan Raicu" wrote:
>>>>>
>>>>>
>>>>>> Works for me! I assume we are meeting at ANL. Whose office are we
>>>>>> meeting in?
>>>>>>
>>>>>> Ioan
>>>>>>
>>>>>> Rinku Gupta wrote:
>>>>>>
>>>>>> Based on everyones availability, how does 11:00am on March 2nd sound?
>>>>>>
>>>>>> Thanks
>>>>>> Rinku
>>>>>>
>>>>>>
>>>>>> ----- "Michael Wilde" wrote:
>>>>>>
>>>>>> Rinku, Ioan,
>>>>>>
>>>>>> I can do Thu Feb 26 10:30 (I assume you meant AM not PM).
>>>>>>
>>>>>> But if Rinku is just arriving back in the US that morning, it seems
>>>>>> better to postpone to the week after.
>>>>>>
>>>>>> I can be at Argonne any time week of March 2. Mornings are free,
>>>>>> Mon-Thu
>>>>>> are best.
>>>>>>
>>>>>> Can we tentatively then meet at 11AM Mon Mar 2?
>>>>>>
>>>>>> Regards,
>>>>>>
>>>>>> Mike
>>>>>>
>>>>>>
>>>>>> On 2/18/09 9:37 AM, Ioan Raicu wrote:
>>>>>>
>>>>>> Hi Rinku,
>>>>>> Next Thursday (February 26th) at 10:30AM would work for me. If we
>>>>>> need
>>>>>>
>>>>>> to meet the following week, I could meet Monday (March 2nd) and
>>>>>> Thursday
>>>>>>
>>>>>> (March 5th) any time.
>>>>>>
>>>>>> Cheers,
>>>>>> Ioan
>>>>>>
>>>>>> Rinku Gupta wrote:
>>>>>>
>>>>>> Hi Michael, Ioan
>>>>>>
>>>>>> I am currently on travel and will arrive back to the USA only
>>>>>> Thursday
>>>>>> (Feb 26th) early morning. Will you be available anytime the
>>>>>> week after next? If not, then we can try to schedule a meeting
>>>>>> sometime around 10:30/11pm next Thursday at ANL.
>>>>>>
>>>>>>
>>>>>>
>>>>>> Thanks
>>>>>> Rinku
>>>>>>
>>>>>>
>>>>>> ----- "Ioan Raicu" wrote:
>>>>>>
>>>>>> Hi Rinku,
>>>>>> I can meet next week on Wednesday any time, and Thursday morning
>>>>>> before
>>>>>> noon, as I have a flight to catch early afternoon from O'Hare. I can
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> meet either at UC or ANL. Let me know what works best for everyone.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> Thanks,
>>>>>> Ioan
>>>>>>
>>>>>> Michael Wilde wrote:
>>>>>>
>>>>>> Hi All,
>>>>>>
>>>>>> Rinku, lets set up a meeting for next week to discuss. I can meet
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> Wed
>>>>>>
>>>>>> of Thu, at Argonne or UChicago.
>>>>>>
>>>>>> Do either of those dates work for you, and which place is best?
>>>>>>
>>>>>> In the meantime I'll read up on CIFTS at
>>>>>> http://www.mcs.anl.gov/research/cifts/docs/index.php and the wiki
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> that
>>>>>>
>>>>>> this refers to.
>>>>>>
>>>>>> If you have any other docs we should read, please send them.
>>>>>>
>>>>>> Thanks and regards,
>>>>>>
>>>>>> Mike
>>>>>>
>>>>>>
>>>>>> On 2/17/09 12:42 PM, Pete Beckman wrote:
>>>>>>
>>>>>> Ioan and Mike,
>>>>>>
>>>>>> The CIFTS project is a DOE project to provide a "fault tolerant
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> backplane". I'm the PI of the project which involved ORNL, LBL,
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> IU,
>>>>>>
>>>>>>
>>>>>>
>>>>>> Ohio State, and UTK. Below is a suggestion to hook CIFTS to Falkon,
>>>>>>
>>>>>>
>>>>>>
>>>>>> so faults could be monitored. Rinku (on the cc: line) is the lead
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> developer for CIFTS. Maybe when one of you is on campus (ANL) you
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> can meet with Rinku, and brainstorm if there is any way to link the
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> two systems efficiently. Email below is from an ORNL participant
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> in
>>>>>>
>>>>>>
>>>>>>
>>>>>> the CIFTS framework.
>>>>>>
>>>>>> -Pete
>>>>>>
>>>>>>
>>>>>> Begin forwarded message:
>>>>>>
>>>>>> From: bernholdtde at ornl.gov Date: February 12, 2009 10:29:47 AM CST
>>>>>> To: cifts at googlegroups.com Cc: bernholdtde at ornl.gov Subject: Fault
>>>>>> tolerance in "many task computing"?
>>>>>> Reply-To: cifts at googlegroups.com Pete (and other ANL folks),
>>>>>>
>>>>>> I recently read the SC08 paper on many task computing on which you're
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> a co-author. (
>>>>>> http://portal.acm.org/citation.cfm?doid=1413370.1413393
>>>>>> )
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> I wonder if it would be viable to build a CIFTS demonstration
>>>>>> scenario
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> around the software system described in this paper?
>>>>>>
>>>>>> In the paper, there's a paragraph discussing reliability that
>>>>>> discusses some of the issues at a high level. It strikes me as
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> both
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> interesting and challenging because you have both system components
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> (i.e. Cobalt) and multiple user-space components (Falken, Swift,
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> application tasks) interacting.
>>>>>>
>>>>>> It might also be worth looking at this environment to help understand
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> the use cases and requirements for the policy/control channels (as
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> opposed to the FTB's informational channel).
>>>>>>
>>>>>> Just some ideas, db
>>>>>> --
>>>>>> David E. Bernholdt | Email: bernholdtde at ornl.gov
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> Oak Ridge National Laboratory | Phone: +1 (865) 574 3147
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> http://www.csm.ornl.gov/~bernhold/ | Fax: +1 (865) 576 5491
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --~--~---------~--~----~------------~-------~--~----~
>>>>>> You received this message because you are subscribed to the Google
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> Groups "CIFTS" group.
>>>>>> To post to this group, send email to cifts at googlegroups.com To
>>>>>> unsubscribe from this group, send email to
>>>>>> cifts+unsubscribe at googlegroups.com For more options, visit this group
>>>>>> at http://groups.google.com/group/cifts?hl=en
>>>>>> -~----------~----~----~----~------~----~------~--~--- --
>>>>>> ===================================================
>>>>>> Ioan Raicu, Ph.D.
>>>>>> ===================================================
>>>>>> Distributed Systems Laboratory
>>>>>> Computer Science Department
>>>>>> University of Chicago
>>>>>> 1100 E. 58th Street, Ryerson Hall
>>>>>> Chicago, IL 60637
>>>>>> ===================================================
>>>>>> Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu
>>>>>> http://dev.globus.org/wiki/Incubator/Falkon
>>>>>> http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page
>>>>>> ===================================================
>>>>>> =================================================== --
>>>>>> ===================================================
>>>>>> Ioan Raicu, Ph.D.
>>>>>> ===================================================
>>>>>> Distributed Systems Laboratory
>>>>>> Computer Science Department
>>>>>> University of Chicago
>>>>>> 1100 E. 58th Street, Ryerson Hall
>>>>>> Chicago, IL 60637
>>>>>> ===================================================
>>>>>> Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu
>>>>>> http://dev.globus.org/wiki/Incubator/Falkon
>>>>>> http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page
>>>>>> ===================================================
>>>>>> ===================================================
>>>>>> --
>>>>>> ===================================================
>>>>>> Ioan Raicu, Ph.D.
>>>>>> ===================================================
>>>>>> Distributed Systems Laboratory
>>>>>> Computer Science Department
>>>>>> University of Chicago
>>>>>> 1100 E. 58th Street, Ryerson Hall
>>>>>> Chicago, IL 60637
>>>>>> ===================================================
>>>>>> Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu
>>>>>> http://dev.globus.org/wiki/Incubator/Falkon
>>>>>> http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page
>>>>>> ===================================================
>>>>>> ===================================================
>>>>>>
>>>>>
>>>>
>>>> --
>>>> ===================================================
>>>> Ioan Raicu, Ph.D.
>>>> ===================================================
>>>> Distributed Systems Laboratory
>>>> Computer Science Department
>>>> University of Chicago
>>>> 1100 E. 58th Street, Ryerson Hall
>>>> Chicago, IL 60637
>>>> ===================================================
>>>> Email: iraicu at cs.uchicago.edu
>>>> Web: http://www.cs.uchicago.edu/~iraicu
>>>> http://dev.globus.org/wiki/Incubator/Falkon
>>>> http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page
>>>> ===================================================
>>>> ===================================================
>>>
>>
>
From hategan at mcs.anl.gov Mon Mar 2 17:50:42 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Mon, 02 Mar 2009 17:50:42 -0600
Subject: [Swift-devel] Re: Fault tolerance in "many task computing"?
In-Reply-To: <49AC6B39.8010407@mcs.anl.gov>
References: <21662876.222571235010770270.JavaMail.root@zimbra>
<499CC801.3090304@cs.uchicago.edu>
<25796E93-7DB0-44FF-8884-0201B4EC620F@anl.gov>
<499CEF6C.1020700@mcs.anl.gov> <49AAC1DF.7090502@cs.uchicago.edu>
<49AC6B39.8010407@mcs.anl.gov>
Message-ID: <1236037842.2575.7.camel@localhost>
Is there a Java library for FTB?
What does FTB bring new to the table compared to a distributed messaging
system?
Mihael
On Mon, 2009-03-02 at 17:26 -0600, Michael Wilde wrote:
> All,
>
> Pete suggested we take a look at CIFTS's message logging system and
> consider integrating it into our stack. Rinku gave me, Allan, and Zhao
> and excellent overview and demo of the system. (Thanks, Rinku!)
>
> Here's my notes from this meeting. My intent is just to start a
> discussion for longer-term consideration, not any near-term action.
> (Although Jing Tie may find some of these concepts fruitful for er
> troubleshooting research).
>
> CIFTS is the DOE SciDAC project "Coordinated and Improved Fault
> Tolerance for High Performance Computing Systems", PI'd by Pete:
> http://www.mcs.anl.gov/research/cifts/index.php
>
> It produces "FTB", a backplane for distributing logging information
> within a distributed system:
>
> http://www.mcs.anl.gov/research/cifts/docs/files/ftb_developers_guide.pdf
>
> I pointed Rinku to Swift and Falkon info, as well as Netlogger and
> activities related to it in the CEDPS project, and we have a joint
> action item to understand the possible overlap and integration issues
> and possibilities between these two systems.
>
> Netlogger and CEDPS info is at:
>
> http://acs.lbl.gov/NetLoggerWiki/index.php/Main_Page
> http://dev.globus.org/wiki/Incubator/NetLogger
> http://www.cedps.net/index.php/Troubleshooting#Work-in-progress
>
> I mentioned that we have invested a small bit of effort in integrating
> Netlogger log publishing capabilities into Swift.
>
> Potential overlap notwithstanding, CIFTS (and in particular the Fault
> Tolerant Backplane, FTB), could serve as a very nice consolidation
> service for log information originating in the many different components
> involved in executing a Swift program:
>
> - the application program wrapper script
> - the Falkon or Coaster worker agent
> - the Globus job manager and/or local scheduler
> - the worker node
> - the remote site fileserver/filesystem
> - a site system management facility like BG/P's RAS service
> - Falkon and Coaster servers and bootstrappers
> - the swift client-side engine
> - GrifFTP and other transport protocols and services
> - etc
>
> FTB would enable us to readily capture and consolidate all these
> information sources and funnel the data into streams related to specific
> Swift program executions. It has the infrastructure to route messages
> out of distributed systems, and to permit publication of and
> subscription to message streams. Its agents, it seems, can help messages
> traverse firewalls and deal with other transport and delivery issues.
>
> FTB is implemented as a C API, and comes with a set of example clients.
> From this a simple set of command line interfaces could be derived to
> permit low-cost experimentation with the system in, eg, Falkon on the
> BG/P, where Rinku and others are implementing collectors to gather log
> information from different parts of ZeptoOS and the BG/P hardware complex.
>
> Its not clear that any of us have the cycles within the next two months
> to explore this, but it would make an interesting student project, to
> compare CIFTS and NetLogger, and to test some initial integrations into
> Swift, Falkon, and Coasters. (I feel its a good Summer of Code project).
>
> My initial question is whether some CIFTS/FTB hooks could be planted in
> a lightweight Swift experiment, and we could try to get a feel for
> whether the infrastructure gives us something that we cant readily get
> today. My gut feel is that is does.
>
> I think it would be a great research/development topic to explore how
> close this could bring us to the point where all distributed errors are
> cleanly routed back to the centralized user to more quickly pinpoint the
> cause of remote and distributed failures. Swift does a *pretty* good
> job of this today, albeit in a somewhat ad-hoc fashion. FTB would make
> it easier to integrate information from additional sources like the
> remote scheduler and BGP RAS logs into the debugging process.
>
> And all that is before we even consider the goals of automating fault
> tolerance, which I think is the ultimate vision of CIFTS.
>
> Thoughts and discussion welcome. Once any of us get a day or so to play
> with FTB, we'll know more about the possibilities.
>
> Regards,
>
> Mike
>
>
> On 3/1/09 11:11 AM, Ioan Raicu wrote:
> > Hi Rinku,
> > It looks like I am not going to be able to make the meeting tomorrow. On
> > Friday, another interview opportunity came up, and the only open slot
> > for the next 2 weeks was this Monday. Sorry about the short notice. Go
> > ahead and meet without me, and I'll catch up with what was discussed at
> > the meeting from Mike.
> >
> > Thanks,
> > Ioan
> >
> > Michael Wilde wrote:
> >> Zhao, Allan, you're welcome to join this discussion, at Argonne Mar 2,
> >> or by phone.
> >>
> >> - Mike
> >>
> >>
> >> On 2/18/09 10:30 PM, Ian Foster wrote:
> >>> Hi,
> >>>
> >>> This sounds like a really fun project. Maybe we should involve Zhao
> >>> and Allen as well, given that Ioan has (sadly) graduated, and will
> >>> leave us?
> >>> I'd love to participate, I will need to do so by phone--could we do
> >>> that? I'll just listen in, and see what I can learn.
> >>>
> >>> Ian.
> >>>
> >>>
> >>> On Feb 18, 2009, at 8:46 PM, Ioan Raicu wrote:
> >>>
> >>>> Great!
> >>>>
> >>>> I added Ian as a cc, maybe he wants to come to this meeting as well.
> >>>> Ian, the original message from Pete was:
> >>>>> Ioan and Mike,
> >>>>>
> >>>>> The CIFTS project is a DOE project to provide a "fault tolerant
> >>>>> backplane". I'm the PI of the project which involved ORNL, LBL,
> >>>>> IU, Ohio State, and UTK. Below is a suggestion to hook CIFTS to
> >>>>> Falkon, so faults could be monitored. Rinku (on the cc: line) is
> >>>>> the lead developer for CIFTS. Maybe when one of you is on campus
> >>>>> (ANL) you can meet with Rinku, and brainstorm if there is any way
> >>>>> to link the two systems efficiently. Email below is from an ORNL
> >>>>> participant in the CIFTS framework.
> >>>>>
> >>>>> -Pete
> >>>> The meeting is scheduled with Rinku, Mike, Pete (?), and I for March
> >>>> 2nd, at 11AM in Rinku's office (ANL, D-231 in the MCS building).
> >>>>
> >>>> Ioan
> >>>>
> >>>> Rinku Gupta wrote:
> >>>>> We can meet at my office (D-231 in the MCS building) and then sneak
> >>>>> into Pete's room, if it is empty.
> >>>>>
> >>>>> Rinku
> >>>>>
> >>>>>
> >>>>>
> >>>>> ----- "Ioan Raicu" wrote:
> >>>>>
> >>>>>
> >>>>>> Works for me! I assume we are meeting at ANL. Whose office are we
> >>>>>> meeting in?
> >>>>>>
> >>>>>> Ioan
> >>>>>>
> >>>>>> Rinku Gupta wrote:
> >>>>>>
> >>>>>> Based on everyones availability, how does 11:00am on March 2nd sound?
> >>>>>>
> >>>>>> Thanks
> >>>>>> Rinku
> >>>>>>
> >>>>>>
> >>>>>> ----- "Michael Wilde" wrote:
> >>>>>>
> >>>>>> Rinku, Ioan,
> >>>>>>
> >>>>>> I can do Thu Feb 26 10:30 (I assume you meant AM not PM).
> >>>>>>
> >>>>>> But if Rinku is just arriving back in the US that morning, it seems
> >>>>>> better to postpone to the week after.
> >>>>>>
> >>>>>> I can be at Argonne any time week of March 2. Mornings are free,
> >>>>>> Mon-Thu
> >>>>>> are best.
> >>>>>>
> >>>>>> Can we tentatively then meet at 11AM Mon Mar 2?
> >>>>>>
> >>>>>> Regards,
> >>>>>>
> >>>>>> Mike
> >>>>>>
> >>>>>>
> >>>>>> On 2/18/09 9:37 AM, Ioan Raicu wrote:
> >>>>>>
> >>>>>> Hi Rinku,
> >>>>>> Next Thursday (February 26th) at 10:30AM would work for me. If we
> >>>>>> need
> >>>>>>
> >>>>>> to meet the following week, I could meet Monday (March 2nd) and
> >>>>>> Thursday
> >>>>>>
> >>>>>> (March 5th) any time.
> >>>>>>
> >>>>>> Cheers,
> >>>>>> Ioan
> >>>>>>
> >>>>>> Rinku Gupta wrote:
> >>>>>>
> >>>>>> Hi Michael, Ioan
> >>>>>>
> >>>>>> I am currently on travel and will arrive back to the USA only
> >>>>>> Thursday
> >>>>>> (Feb 26th) early morning. Will you be available anytime the
> >>>>>> week after next? If not, then we can try to schedule a meeting
> >>>>>> sometime around 10:30/11pm next Thursday at ANL.
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> Thanks
> >>>>>> Rinku
> >>>>>>
> >>>>>>
> >>>>>> ----- "Ioan Raicu" wrote:
> >>>>>>
> >>>>>> Hi Rinku,
> >>>>>> I can meet next week on Wednesday any time, and Thursday morning
> >>>>>> before
> >>>>>> noon, as I have a flight to catch early afternoon from O'Hare. I can
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> meet either at UC or ANL. Let me know what works best for everyone.
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> Thanks,
> >>>>>> Ioan
> >>>>>>
> >>>>>> Michael Wilde wrote:
> >>>>>>
> >>>>>> Hi All,
> >>>>>>
> >>>>>> Rinku, lets set up a meeting for next week to discuss. I can meet
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> Wed
> >>>>>>
> >>>>>> of Thu, at Argonne or UChicago.
> >>>>>>
> >>>>>> Do either of those dates work for you, and which place is best?
> >>>>>>
> >>>>>> In the meantime I'll read up on CIFTS at
> >>>>>> http://www.mcs.anl.gov/research/cifts/docs/index.php and the wiki
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> that
> >>>>>>
> >>>>>> this refers to.
> >>>>>>
> >>>>>> If you have any other docs we should read, please send them.
> >>>>>>
> >>>>>> Thanks and regards,
> >>>>>>
> >>>>>> Mike
> >>>>>>
> >>>>>>
> >>>>>> On 2/17/09 12:42 PM, Pete Beckman wrote:
> >>>>>>
> >>>>>> Ioan and Mike,
> >>>>>>
> >>>>>> The CIFTS project is a DOE project to provide a "fault tolerant
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> backplane". I'm the PI of the project which involved ORNL, LBL,
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> IU,
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> Ohio State, and UTK. Below is a suggestion to hook CIFTS to Falkon,
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> so faults could be monitored. Rinku (on the cc: line) is the lead
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> developer for CIFTS. Maybe when one of you is on campus (ANL) you
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> can meet with Rinku, and brainstorm if there is any way to link the
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> two systems efficiently. Email below is from an ORNL participant
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> in
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> the CIFTS framework.
> >>>>>>
> >>>>>> -Pete
> >>>>>>
> >>>>>>
> >>>>>> Begin forwarded message:
> >>>>>>
> >>>>>> From: bernholdtde at ornl.gov Date: February 12, 2009 10:29:47 AM CST
> >>>>>> To: cifts at googlegroups.com Cc: bernholdtde at ornl.gov Subject: Fault
> >>>>>> tolerance in "many task computing"?
> >>>>>> Reply-To: cifts at googlegroups.com Pete (and other ANL folks),
> >>>>>>
> >>>>>> I recently read the SC08 paper on many task computing on which you're
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> a co-author. (
> >>>>>> http://portal.acm.org/citation.cfm?doid=1413370.1413393
> >>>>>> )
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> I wonder if it would be viable to build a CIFTS demonstration
> >>>>>> scenario
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> around the software system described in this paper?
> >>>>>>
> >>>>>> In the paper, there's a paragraph discussing reliability that
> >>>>>> discusses some of the issues at a high level. It strikes me as
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> both
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> interesting and challenging because you have both system components
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> (i.e. Cobalt) and multiple user-space components (Falken, Swift,
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> application tasks) interacting.
> >>>>>>
> >>>>>> It might also be worth looking at this environment to help understand
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> the use cases and requirements for the policy/control channels (as
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> opposed to the FTB's informational channel).
> >>>>>>
> >>>>>> Just some ideas, db
> >>>>>> --
> >>>>>> David E. Bernholdt | Email: bernholdtde at ornl.gov
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> Oak Ridge National Laboratory | Phone: +1 (865) 574 3147
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> http://www.csm.ornl.gov/~bernhold/ | Fax: +1 (865) 576 5491
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> --~--~---------~--~----~------------~-------~--~----~
> >>>>>> You received this message because you are subscribed to the Google
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> Groups "CIFTS" group.
> >>>>>> To post to this group, send email to cifts at googlegroups.com To
> >>>>>> unsubscribe from this group, send email to
> >>>>>> cifts+unsubscribe at googlegroups.com For more options, visit this group
> >>>>>> at http://groups.google.com/group/cifts?hl=en
> >>>>>> -~----------~----~----~----~------~----~------~--~--- --
> >>>>>> ===================================================
> >>>>>> Ioan Raicu, Ph.D.
> >>>>>> ===================================================
> >>>>>> Distributed Systems Laboratory
> >>>>>> Computer Science Department
> >>>>>> University of Chicago
> >>>>>> 1100 E. 58th Street, Ryerson Hall
> >>>>>> Chicago, IL 60637
> >>>>>> ===================================================
> >>>>>> Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu
> >>>>>> http://dev.globus.org/wiki/Incubator/Falkon
> >>>>>> http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page
> >>>>>> ===================================================
> >>>>>> =================================================== --
> >>>>>> ===================================================
> >>>>>> Ioan Raicu, Ph.D.
> >>>>>> ===================================================
> >>>>>> Distributed Systems Laboratory
> >>>>>> Computer Science Department
> >>>>>> University of Chicago
> >>>>>> 1100 E. 58th Street, Ryerson Hall
> >>>>>> Chicago, IL 60637
> >>>>>> ===================================================
> >>>>>> Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu
> >>>>>> http://dev.globus.org/wiki/Incubator/Falkon
> >>>>>> http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page
> >>>>>> ===================================================
> >>>>>> ===================================================
> >>>>>> --
> >>>>>> ===================================================
> >>>>>> Ioan Raicu, Ph.D.
> >>>>>> ===================================================
> >>>>>> Distributed Systems Laboratory
> >>>>>> Computer Science Department
> >>>>>> University of Chicago
> >>>>>> 1100 E. 58th Street, Ryerson Hall
> >>>>>> Chicago, IL 60637
> >>>>>> ===================================================
> >>>>>> Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu
> >>>>>> http://dev.globus.org/wiki/Incubator/Falkon
> >>>>>> http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page
> >>>>>> ===================================================
> >>>>>> ===================================================
> >>>>>>
> >>>>>
> >>>>
> >>>> --
> >>>> ===================================================
> >>>> Ioan Raicu, Ph.D.
> >>>> ===================================================
> >>>> Distributed Systems Laboratory
> >>>> Computer Science Department
> >>>> University of Chicago
> >>>> 1100 E. 58th Street, Ryerson Hall
> >>>> Chicago, IL 60637
> >>>> ===================================================
> >>>> Email: iraicu at cs.uchicago.edu
> >>>> Web: http://www.cs.uchicago.edu/~iraicu
> >>>> http://dev.globus.org/wiki/Incubator/Falkon
> >>>> http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page
> >>>> ===================================================
> >>>> ===================================================
> >>>
> >>
> >
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
From wilde at mcs.anl.gov Mon Mar 2 18:10:11 2009
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Mon, 02 Mar 2009 18:10:11 -0600
Subject: [Swift-devel] Re: Fault tolerance in "many task computing"?
In-Reply-To: <1236037842.2575.7.camel@localhost>
References: <21662876.222571235010770270.JavaMail.root@zimbra>
<499CC801.3090304@cs.uchicago.edu>
<25796E93-7DB0-44FF-8884-0201B4EC620F@anl.gov>
<499CEF6C.1020700@mcs.anl.gov> <49AAC1DF.7090502@cs.uchicago.edu>
<49AC6B39.8010407@mcs.anl.gov> <1236037842.2575.7.camel@localhost>
Message-ID: <49AC7563.2070209@mcs.anl.gov>
On 3/2/09 5:50 PM, Mihael Hategan wrote:
> Is there a Java library for FTB?
No, my understanding is that its only C at the moment.
>
> What does FTB bring new to the table compared to a distributed messaging
> system?
Pete and Rinku (and a bit of reading) can certainly make a better case,
but this is my general impression:
To me, it seems simple, lightweight, and well-structured for pub-sub of
messages that pertain to system/application operation. I think it
defines a nice model of endpoints, priorities, message codes, etc. while
leaving a payload for the user to send message-specific details.
Its agents implement s spanning tree to route messages from distributed
components, so the user doesnt need to worry about this. I think it has
some redundancy in this delivery model.
It seems to be designed to be light weight to handle high traffic (eg
from errant system components).
Just seems well-tailored to the log message routing job.
- Mike
>
> Mihael
>
> On Mon, 2009-03-02 at 17:26 -0600, Michael Wilde wrote:
>> All,
>>
>> Pete suggested we take a look at CIFTS's message logging system and
>> consider integrating it into our stack. Rinku gave me, Allan, and Zhao
>> and excellent overview and demo of the system. (Thanks, Rinku!)
>>
>> Here's my notes from this meeting. My intent is just to start a
>> discussion for longer-term consideration, not any near-term action.
>> (Although Jing Tie may find some of these concepts fruitful for er
>> troubleshooting research).
>>
>> CIFTS is the DOE SciDAC project "Coordinated and Improved Fault
>> Tolerance for High Performance Computing Systems", PI'd by Pete:
>> http://www.mcs.anl.gov/research/cifts/index.php
>>
>> It produces "FTB", a backplane for distributing logging information
>> within a distributed system:
>>
>> http://www.mcs.anl.gov/research/cifts/docs/files/ftb_developers_guide.pdf
>>
>> I pointed Rinku to Swift and Falkon info, as well as Netlogger and
>> activities related to it in the CEDPS project, and we have a joint
>> action item to understand the possible overlap and integration issues
>> and possibilities between these two systems.
>>
>> Netlogger and CEDPS info is at:
>>
>> http://acs.lbl.gov/NetLoggerWiki/index.php/Main_Page
>> http://dev.globus.org/wiki/Incubator/NetLogger
>> http://www.cedps.net/index.php/Troubleshooting#Work-in-progress
>>
>> I mentioned that we have invested a small bit of effort in integrating
>> Netlogger log publishing capabilities into Swift.
>>
>> Potential overlap notwithstanding, CIFTS (and in particular the Fault
>> Tolerant Backplane, FTB), could serve as a very nice consolidation
>> service for log information originating in the many different components
>> involved in executing a Swift program:
>>
>> - the application program wrapper script
>> - the Falkon or Coaster worker agent
>> - the Globus job manager and/or local scheduler
>> - the worker node
>> - the remote site fileserver/filesystem
>> - a site system management facility like BG/P's RAS service
>> - Falkon and Coaster servers and bootstrappers
>> - the swift client-side engine
>> - GrifFTP and other transport protocols and services
>> - etc
>>
>> FTB would enable us to readily capture and consolidate all these
>> information sources and funnel the data into streams related to specific
>> Swift program executions. It has the infrastructure to route messages
>> out of distributed systems, and to permit publication of and
>> subscription to message streams. Its agents, it seems, can help messages
>> traverse firewalls and deal with other transport and delivery issues.
>>
>> FTB is implemented as a C API, and comes with a set of example clients.
>> From this a simple set of command line interfaces could be derived to
>> permit low-cost experimentation with the system in, eg, Falkon on the
>> BG/P, where Rinku and others are implementing collectors to gather log
>> information from different parts of ZeptoOS and the BG/P hardware complex.
>>
>> Its not clear that any of us have the cycles within the next two months
>> to explore this, but it would make an interesting student project, to
>> compare CIFTS and NetLogger, and to test some initial integrations into
>> Swift, Falkon, and Coasters. (I feel its a good Summer of Code project).
>>
>> My initial question is whether some CIFTS/FTB hooks could be planted in
>> a lightweight Swift experiment, and we could try to get a feel for
>> whether the infrastructure gives us something that we cant readily get
>> today. My gut feel is that is does.
>>
>> I think it would be a great research/development topic to explore how
>> close this could bring us to the point where all distributed errors are
>> cleanly routed back to the centralized user to more quickly pinpoint the
>> cause of remote and distributed failures. Swift does a *pretty* good
>> job of this today, albeit in a somewhat ad-hoc fashion. FTB would make
>> it easier to integrate information from additional sources like the
>> remote scheduler and BGP RAS logs into the debugging process.
>>
>> And all that is before we even consider the goals of automating fault
>> tolerance, which I think is the ultimate vision of CIFTS.
>>
>> Thoughts and discussion welcome. Once any of us get a day or so to play
>> with FTB, we'll know more about the possibilities.
>>
>> Regards,
>>
>> Mike
>>
>>
>> On 3/1/09 11:11 AM, Ioan Raicu wrote:
>>> Hi Rinku,
>>> It looks like I am not going to be able to make the meeting tomorrow. On
>>> Friday, another interview opportunity came up, and the only open slot
>>> for the next 2 weeks was this Monday. Sorry about the short notice. Go
>>> ahead and meet without me, and I'll catch up with what was discussed at
>>> the meeting from Mike.
>>>
>>> Thanks,
>>> Ioan
>>>
>>> Michael Wilde wrote:
>>>> Zhao, Allan, you're welcome to join this discussion, at Argonne Mar 2,
>>>> or by phone.
>>>>
>>>> - Mike
>>>>
>>>>
>>>> On 2/18/09 10:30 PM, Ian Foster wrote:
>>>>> Hi,
>>>>>
>>>>> This sounds like a really fun project. Maybe we should involve Zhao
>>>>> and Allen as well, given that Ioan has (sadly) graduated, and will
>>>>> leave us?
>>>>> I'd love to participate, I will need to do so by phone--could we do
>>>>> that? I'll just listen in, and see what I can learn.
>>>>>
>>>>> Ian.
>>>>>
>>>>>
>>>>> On Feb 18, 2009, at 8:46 PM, Ioan Raicu wrote:
>>>>>
>>>>>> Great!
>>>>>>
>>>>>> I added Ian as a cc, maybe he wants to come to this meeting as well.
>>>>>> Ian, the original message from Pete was:
>>>>>>> Ioan and Mike,
>>>>>>>
>>>>>>> The CIFTS project is a DOE project to provide a "fault tolerant
>>>>>>> backplane". I'm the PI of the project which involved ORNL, LBL,
>>>>>>> IU, Ohio State, and UTK. Below is a suggestion to hook CIFTS to
>>>>>>> Falkon, so faults could be monitored. Rinku (on the cc: line) is
>>>>>>> the lead developer for CIFTS. Maybe when one of you is on campus
>>>>>>> (ANL) you can meet with Rinku, and brainstorm if there is any way
>>>>>>> to link the two systems efficiently. Email below is from an ORNL
>>>>>>> participant in the CIFTS framework.
>>>>>>>
>>>>>>> -Pete
>>>>>> The meeting is scheduled with Rinku, Mike, Pete (?), and I for March
>>>>>> 2nd, at 11AM in Rinku's office (ANL, D-231 in the MCS building).
>>>>>>
>>>>>> Ioan
>>>>>>
>>>>>> Rinku Gupta wrote:
>>>>>>> We can meet at my office (D-231 in the MCS building) and then sneak
>>>>>>> into Pete's room, if it is empty.
>>>>>>>
>>>>>>> Rinku
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> ----- "Ioan Raicu" wrote:
>>>>>>>
>>>>>>>
>>>>>>>> Works for me! I assume we are meeting at ANL. Whose office are we
>>>>>>>> meeting in?
>>>>>>>>
>>>>>>>> Ioan
>>>>>>>>
>>>>>>>> Rinku Gupta wrote:
>>>>>>>>
>>>>>>>> Based on everyones availability, how does 11:00am on March 2nd sound?
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>> Rinku
>>>>>>>>
>>>>>>>>
>>>>>>>> ----- "Michael Wilde" wrote:
>>>>>>>>
>>>>>>>> Rinku, Ioan,
>>>>>>>>
>>>>>>>> I can do Thu Feb 26 10:30 (I assume you meant AM not PM).
>>>>>>>>
>>>>>>>> But if Rinku is just arriving back in the US that morning, it seems
>>>>>>>> better to postpone to the week after.
>>>>>>>>
>>>>>>>> I can be at Argonne any time week of March 2. Mornings are free,
>>>>>>>> Mon-Thu
>>>>>>>> are best.
>>>>>>>>
>>>>>>>> Can we tentatively then meet at 11AM Mon Mar 2?
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>>
>>>>>>>> Mike
>>>>>>>>
>>>>>>>>
>>>>>>>> On 2/18/09 9:37 AM, Ioan Raicu wrote:
>>>>>>>>
>>>>>>>> Hi Rinku,
>>>>>>>> Next Thursday (February 26th) at 10:30AM would work for me. If we
>>>>>>>> need
>>>>>>>>
>>>>>>>> to meet the following week, I could meet Monday (March 2nd) and
>>>>>>>> Thursday
>>>>>>>>
>>>>>>>> (March 5th) any time.
>>>>>>>>
>>>>>>>> Cheers,
>>>>>>>> Ioan
>>>>>>>>
>>>>>>>> Rinku Gupta wrote:
>>>>>>>>
>>>>>>>> Hi Michael, Ioan
>>>>>>>>
>>>>>>>> I am currently on travel and will arrive back to the USA only
>>>>>>>> Thursday
>>>>>>>> (Feb 26th) early morning. Will you be available anytime the
>>>>>>>> week after next? If not, then we can try to schedule a meeting
>>>>>>>> sometime around 10:30/11pm next Thursday at ANL.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>> Rinku
>>>>>>>>
>>>>>>>>
>>>>>>>> ----- "Ioan Raicu" wrote:
>>>>>>>>
>>>>>>>> Hi Rinku,
>>>>>>>> I can meet next week on Wednesday any time, and Thursday morning
>>>>>>>> before
>>>>>>>> noon, as I have a flight to catch early afternoon from O'Hare. I can
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> meet either at UC or ANL. Let me know what works best for everyone.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Ioan
>>>>>>>>
>>>>>>>> Michael Wilde wrote:
>>>>>>>>
>>>>>>>> Hi All,
>>>>>>>>
>>>>>>>> Rinku, lets set up a meeting for next week to discuss. I can meet
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Wed
>>>>>>>>
>>>>>>>> of Thu, at Argonne or UChicago.
>>>>>>>>
>>>>>>>> Do either of those dates work for you, and which place is best?
>>>>>>>>
>>>>>>>> In the meantime I'll read up on CIFTS at
>>>>>>>> http://www.mcs.anl.gov/research/cifts/docs/index.php and the wiki
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> that
>>>>>>>>
>>>>>>>> this refers to.
>>>>>>>>
>>>>>>>> If you have any other docs we should read, please send them.
>>>>>>>>
>>>>>>>> Thanks and regards,
>>>>>>>>
>>>>>>>> Mike
>>>>>>>>
>>>>>>>>
>>>>>>>> On 2/17/09 12:42 PM, Pete Beckman wrote:
>>>>>>>>
>>>>>>>> Ioan and Mike,
>>>>>>>>
>>>>>>>> The CIFTS project is a DOE project to provide a "fault tolerant
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> backplane". I'm the PI of the project which involved ORNL, LBL,
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> IU,
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Ohio State, and UTK. Below is a suggestion to hook CIFTS to Falkon,
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> so faults could be monitored. Rinku (on the cc: line) is the lead
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> developer for CIFTS. Maybe when one of you is on campus (ANL) you
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> can meet with Rinku, and brainstorm if there is any way to link the
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> two systems efficiently. Email below is from an ORNL participant
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> in
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> the CIFTS framework.
>>>>>>>>
>>>>>>>> -Pete
>>>>>>>>
>>>>>>>>
>>>>>>>> Begin forwarded message:
>>>>>>>>
>>>>>>>> From: bernholdtde at ornl.gov Date: February 12, 2009 10:29:47 AM CST
>>>>>>>> To: cifts at googlegroups.com Cc: bernholdtde at ornl.gov Subject: Fault
>>>>>>>> tolerance in "many task computing"?
>>>>>>>> Reply-To: cifts at googlegroups.com Pete (and other ANL folks),
>>>>>>>>
>>>>>>>> I recently read the SC08 paper on many task computing on which you're
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> a co-author. (
>>>>>>>> http://portal.acm.org/citation.cfm?doid=1413370.1413393
>>>>>>>> )
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> I wonder if it would be viable to build a CIFTS demonstration
>>>>>>>> scenario
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> around the software system described in this paper?
>>>>>>>>
>>>>>>>> In the paper, there's a paragraph discussing reliability that
>>>>>>>> discusses some of the issues at a high level. It strikes me as
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> both
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> interesting and challenging because you have both system components
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> (i.e. Cobalt) and multiple user-space components (Falken, Swift,
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> application tasks) interacting.
>>>>>>>>
>>>>>>>> It might also be worth looking at this environment to help understand
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> the use cases and requirements for the policy/control channels (as
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> opposed to the FTB's informational channel).
>>>>>>>>
>>>>>>>> Just some ideas, db
>>>>>>>> --
>>>>>>>> David E. Bernholdt | Email: bernholdtde at ornl.gov
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Oak Ridge National Laboratory | Phone: +1 (865) 574 3147
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> http://www.csm.ornl.gov/~bernhold/ | Fax: +1 (865) 576 5491
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --~--~---------~--~----~------------~-------~--~----~
>>>>>>>> You received this message because you are subscribed to the Google
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Groups "CIFTS" group.
>>>>>>>> To post to this group, send email to cifts at googlegroups.com To
>>>>>>>> unsubscribe from this group, send email to
>>>>>>>> cifts+unsubscribe at googlegroups.com For more options, visit this group
>>>>>>>> at http://groups.google.com/group/cifts?hl=en
>>>>>>>> -~----------~----~----~----~------~----~------~--~--- --
>>>>>>>> ===================================================
>>>>>>>> Ioan Raicu, Ph.D.
>>>>>>>> ===================================================
>>>>>>>> Distributed Systems Laboratory
>>>>>>>> Computer Science Department
>>>>>>>> University of Chicago
>>>>>>>> 1100 E. 58th Street, Ryerson Hall
>>>>>>>> Chicago, IL 60637
>>>>>>>> ===================================================
>>>>>>>> Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu
>>>>>>>> http://dev.globus.org/wiki/Incubator/Falkon
>>>>>>>> http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page
>>>>>>>> ===================================================
>>>>>>>> =================================================== --
>>>>>>>> ===================================================
>>>>>>>> Ioan Raicu, Ph.D.
>>>>>>>> ===================================================
>>>>>>>> Distributed Systems Laboratory
>>>>>>>> Computer Science Department
>>>>>>>> University of Chicago
>>>>>>>> 1100 E. 58th Street, Ryerson Hall
>>>>>>>> Chicago, IL 60637
>>>>>>>> ===================================================
>>>>>>>> Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu
>>>>>>>> http://dev.globus.org/wiki/Incubator/Falkon
>>>>>>>> http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page
>>>>>>>> ===================================================
>>>>>>>> ===================================================
>>>>>>>> --
>>>>>>>> ===================================================
>>>>>>>> Ioan Raicu, Ph.D.
>>>>>>>> ===================================================
>>>>>>>> Distributed Systems Laboratory
>>>>>>>> Computer Science Department
>>>>>>>> University of Chicago
>>>>>>>> 1100 E. 58th Street, Ryerson Hall
>>>>>>>> Chicago, IL 60637
>>>>>>>> ===================================================
>>>>>>>> Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu
>>>>>>>> http://dev.globus.org/wiki/Incubator/Falkon
>>>>>>>> http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page
>>>>>>>> ===================================================
>>>>>>>> ===================================================
>>>>>>>>
>>>>>>>
>>>>>> --
>>>>>> ===================================================
>>>>>> Ioan Raicu, Ph.D.
>>>>>> ===================================================
>>>>>> Distributed Systems Laboratory
>>>>>> Computer Science Department
>>>>>> University of Chicago
>>>>>> 1100 E. 58th Street, Ryerson Hall
>>>>>> Chicago, IL 60637
>>>>>> ===================================================
>>>>>> Email: iraicu at cs.uchicago.edu
>>>>>> Web: http://www.cs.uchicago.edu/~iraicu
>>>>>> http://dev.globus.org/wiki/Incubator/Falkon
>>>>>> http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page
>>>>>> ===================================================
>>>>>> ===================================================
>> _______________________________________________
>> Swift-devel mailing list
>> Swift-devel at ci.uchicago.edu
>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>
From benc at hawaga.org.uk Tue Mar 3 07:31:33 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Tue, 3 Mar 2009 13:31:33 +0000 (GMT)
Subject: [Swift-devel] Re: Fault tolerance in "many task computing"?
In-Reply-To: <49AC6B39.8010407@mcs.anl.gov>
References: <21662876.222571235010770270.JavaMail.root@zimbra>
<499CC801.3090304@cs.uchicago.edu>
<25796E93-7DB0-44FF-8884-0201B4EC620F@anl.gov>
<499CEF6C.1020700@mcs.anl.gov> <49AAC1DF.7090502@cs.uchicago.edu>
<49AC6B39.8010407@mcs.anl.gov>
Message-ID:
Sounds interesting as a research project.
How to hook different logging systems in is fairly well defined in the
submit side (through log4j) and in the worker code (through a single bash
function).
Integration into the globus toolkit stack is something that ties in with
CEDPS, not Swift.
I would not be adverse to more pluggable logging mechanisms in the swift
core code, although I am as always resistant to adding in unnecessary
dependencies.
--
From hategan at mcs.anl.gov Wed Mar 4 14:04:41 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Wed, 04 Mar 2009 14:04:41 -0600
Subject: [Swift-devel] different host CN expectations in gram and
gridftp server
In-Reply-To: <1235510279.7676.0.camel@localhost>
References: <50b07b4b0902221621s52239835xf920e665e8cfce5f@mail.gmail.com>
<1235361343.1273.6.camel@localhost>
<50b07b4b0902230848v15e1394dh829fcb2bbf94a578@mail.gmail.com>
<1235408072.10242.0.camel@localhost>
<50b07b4b0902230856g18e11118v5f5a27d2d5eb7afc@mail.gmail.com>
<50b07b4b0902241314t7ea23b28g832c70e26877c5f6@mail.gmail.com>
<1235510279.7676.0.camel@localhost>
Message-ID: <1236197081.24081.2.camel@localhost>
http://bugzilla.mcs.anl.gov/globus/show_bug.cgi?id=6678
Use login3.ranger.tacc.utexas.edu instead of
gatekeeper.ranger.tacc.teragrid.org if your submit host is a node on
ranger. Though I'd recommend against running swift on one of ranger's
head nodes.
On Tue, 2009-02-24 at 15:17 -0600, Mihael Hategan wrote:
> Ok. I'll look into this.
>
> On Tue, 2009-02-24 at 15:14 -0600, Allan Espinosa wrote:
> > I still get the same gram authentication error message:
> >
> > Caused by: org.globus.cog.abstraction.impl.common.task.TaskSubmissionException:
> > Cannot submit job
> > Caused by: org.globus.gram.GramException: Data transfer to the server
> > failed [Caused by: Authentication failed [Caused by: Operation
> > unauthorized (Mechanism level: [JGLOBUS-56] Authorization failed.
> > Expected "/CN=host/129.114.50.163" target but received
> > "/C=US/O=UTAustin/OU=TACC/CN=login3.ranger.tacc.utexas.edu")]]
> > 2009-02-24 15:12:07,215-0600 DEBUG vdl:execute2 APPLICATION_EXCEPTION
> > jobid=hostname-8tx7p37j - Application exception: Cannot submit job
> >
> > This is using both the fork and sge job manager via gram2-only
> >
> > -aallan
> >
> >
> > On Mon, Feb 23, 2009 at 10:58 AM, Ben Clifford wrote:
> > >
> > > If you use gram2 instead of coasters+gram2, what happens?
> > >
> > _______________________________________________
> > Swift-devel mailing list
> > Swift-devel at ci.uchicago.edu
> > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
From benc at hawaga.org.uk Thu Mar 5 08:51:24 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Thu, 5 Mar 2009 14:51:24 +0000 (GMT)
Subject: [Swift-devel] Open Provenance Model log exporter
Message-ID:
Part of Provenance Challenge 3 (PC3) is to export data into the open
provenance model (OPM).
I've committed a crude exporter for that into provenancedb/ in the SVN in
r2633
I've also added details in the provenance.xml docbook page in that
directory about how open provenance model vocabulary relates to Swift
concepts.
My experience so far has been that OPM is quite in sympathy with the
thoughts that I've had so far about Swift provenance. I'll be interested
to see (in PC3) if any meaningful interop between implementations can be
achieved.
At present, my thoughts tend towards us being able to export Swift data
into some other provenance system, rather than importing other peoples
data from not-Swift into our database.
--
From foster at anl.gov Thu Mar 5 12:56:47 2009
From: foster at anl.gov (Ian Foster)
Date: Thu, 5 Mar 2009 12:56:47 -0600
Subject: [Swift-devel] Open Provenance Model log exporter
In-Reply-To:
References:
Message-ID: <73AE54CA-2F68-4AE4-BC7E-44F753C4A1B4@anl.gov>
Ben:
That sounds interesting. Are there any decent tools for analyzing/
processing OPM logs that we can make use of?
One issue that I recall from past discussions was that Swift's
functional model makes provenance in some ways "simpler" than in other
systems. Do we lose that simplicity when we export to OPM?
Maybe we can discuss these issues when you get some experience with OPM.
Ian.
On Mar 5, 2009, at 8:51 AM, Ben Clifford wrote:
> Part of Provenance Challenge 3 (PC3) is to export data into the open
> provenance model (OPM).
>
> I've committed a crude exporter for that into provenancedb/ in the
> SVN in
> r2633
>
> I've also added details in the provenance.xml docbook page in that
> directory about how open provenance model vocabulary relates to Swift
> concepts.
>
> My experience so far has been that OPM is quite in sympathy with the
> thoughts that I've had so far about Swift provenance. I'll be
> interested
> to see (in PC3) if any meaningful interop between implementations
> can be
> achieved.
>
> At present, my thoughts tend towards us being able to export Swift
> data
> into some other provenance system, rather than importing other peoples
> data from not-Swift into our database.
>
> --
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
From benc at hawaga.org.uk Thu Mar 5 16:45:21 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Thu, 5 Mar 2009 22:45:21 +0000 (GMT)
Subject: [Swift-devel] Open Provenance Model log exporter
In-Reply-To: <73AE54CA-2F68-4AE4-BC7E-44F753C4A1B4@anl.gov>
References:
<73AE54CA-2F68-4AE4-BC7E-44F753C4A1B4@anl.gov>
Message-ID:
On Thu, 5 Mar 2009, Ian Foster wrote:
> That sounds interesting. Are there any decent tools for
> analyzing/processing OPM logs that we can make use of?
Not that I'm aware of, as its all very new - a (perhaps vain - cf CEDPS)
hope in participating in PC3 is that we'll entangle ourselves with things
that consume what we produce.
> One issue that I recall from past discussions was that Swift's
> functional model makes provenance in some ways "simpler" than in other
> systems. Do we lose that simplicity when we export to OPM?
I think that OPM doesn't make things any more complicated. What looks like
a basic relation in the Swift provenance work so far pretty much maps into
a basic relation in OPM. So I am (surprisingly?) not very cynical about
the model.
--
From bugzilla-daemon at mcs.anl.gov Thu Mar 5 20:55:17 2009
From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov)
Date: Thu, 5 Mar 2009 20:55:17 -0600 (CST)
Subject: [Swift-devel] [Bug 61] semantics of [*] and multi-return-values
need clarifying
In-Reply-To:
Message-ID: <20090306025517.CA7E3164CE@foxtrot.mcs.anl.gov>
http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=61
gabri.turcu at gmail.com changed:
What |Removed |Added
----------------------------------------------------------------------------
AssignedTo|benc at hawaga.org.uk |gabri.turcu at gmail.com
Status|ASSIGNED |NEW
--
Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You reported the bug, or are watching the reporter.
You are the assignee for the bug, or are watching the assignee.
You are the assignee for the bug, or are watching the assignee.
From wilde at mcs.anl.gov Thu Mar 5 21:33:47 2009
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Thu, 05 Mar 2009 21:33:47 -0600
Subject: [Swift-devel] Quick Start Guide examples are mangled
Message-ID: <49B0999B.7060706@mcs.anl.gov>
I dont know when this appeared, but some of the example text in the
Swift Quick Start Guide at:
http://www.ci.uchicago.edu/swift/guides/quickstartguide.php#install
Is getting rendered wrong - its full of html tags and unreadable. Looks
like:
>
tar -xzvf
swift-.tar.gz
And one of our new users was scratching his head saying, "wow, this is
really rather cryptic!" ;) (seriously...)
- Mike
From wilde at mcs.anl.gov Thu Mar 5 21:36:41 2009
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Thu, 05 Mar 2009 21:36:41 -0600
Subject: [Swift-devel] Quick Start Guide examples are mangled
In-Reply-To: <49B0999B.7060706@mcs.anl.gov>
References: <49B0999B.7060706@mcs.anl.gov>
Message-ID: <49B09A49.1010800@mcs.anl.gov>
same with the Really Quick version:
http://www.ci.uchicago.edu/swift/guides/reallyquickstartguide.php
and its all the example text, on both pages, that's gone bad.
On 3/5/09 9:33 PM, Michael Wilde wrote:
> I dont know when this appeared, but some of the example text in the
> Swift Quick Start Guide at:
>
> http://www.ci.uchicago.edu/swift/guides/quickstartguide.php#install
>
> Is getting rendered wrong - its full of html tags and unreadable. Looks
> like:
>
> >
class="command">tar -xzvf
> swift-.tar.gz
>
> And one of our new users was scratching his head saying, "wow, this is
> really rather cryptic!" ;) (seriously...)
>
> - Mike
>
>
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
From benc at hawaga.org.uk Fri Mar 6 02:41:11 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Fri, 6 Mar 2009 08:41:11 +0000 (GMT)
Subject: [Swift-devel] Quick Start Guide examples are mangled
In-Reply-To: <49B0999B.7060706@mcs.anl.gov>
References: <49B0999B.7060706@mcs.anl.gov>
Message-ID:
On Thu, 5 Mar 2009, Michael Wilde wrote:
> I dont know when this appeared, but some of the example text in the Swift
> Quick Start Guide at:
>
> http://www.ci.uchicago.edu/swift/guides/quickstartguide.php#install
>
> Is getting rendered wrong - its full of html tags and unreadable. Looks like:
I changed a style sheet there the other day, which likely caused it. oops.
--
From benc at hawaga.org.uk Fri Mar 6 07:42:32 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Fri, 6 Mar 2009 13:42:32 +0000 (GMT)
Subject: [Swift-devel] Quick Start Guide examples are mangled
In-Reply-To:
References: <49B0999B.7060706@mcs.anl.gov>
Message-ID:
On Fri, 6 Mar 2009, Ben Clifford wrote:
> > I dont know when this appeared, but some of the example text in the Swift
> > Quick Start Guide at:
> >
> > http://www.ci.uchicago.edu/swift/guides/quickstartguide.php#install
> >
> > Is getting rendered wrong - its full of html tags and unreadable. Looks like:
>
> I changed a style sheet there the other day, which likely caused it. oops.
fixed in r2646:
Change use of elements for console interactions into
elements.
elements undergo magic syntax highlighting under the
swiftsh_html.xsl style sheet that the quickstart guides were switched
to in r2614
--
From bugzilla-daemon at mcs.anl.gov Fri Mar 6 18:34:07 2009
From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov)
Date: Fri, 6 Mar 2009 18:34:07 -0600 (CST)
Subject: [Swift-devel] [Bug 181] New: Poor error message for sites.xml
syntax error
Message-ID:
http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=181
Summary: Poor error message for sites.xml syntax error
Product: Swift
Version: unspecified
Platform: All
OS/Version: Linux
Status: NEW
Severity: minor
Priority: P4
Component: SwiftScript language
AssignedTo: benc at hawaga.org.uk
ReportedBy: wilde at mcs.anl.gov
A missing / on a /> closing bracket on the tag below yields a
cryptic (and duplicated) error message:
fast
05:00:00
/home/wilde/swiftwork
Gives:
Execution failed:
Could not load file teraport.xml:
com.thoughtworks.xstream.converters.ConversionException: : end tag name
must match start tag name from line 5 (position: TEXT seen
...\n... @8:8) : : end tag name must match
start tag name from line 5 (position: TEXT seen
...\n... @8:8)
--
Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You reported the bug, or are watching the reporter.
From wilde at mcs.anl.gov Fri Mar 6 18:45:37 2009
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Fri, 06 Mar 2009 18:45:37 -0600
Subject: [Swift-devel] PBS time-exceeds-queue conflict yields cryptic error
Message-ID: <49B1C3B1.9000309@mcs.anl.gov>
A low prio issue:
When I ask for more time than the selected PBS queue allows, I get a
cryptic error. The fact that this condition yields a PBS error is known
and has been discussed on the list.
Is it tracked as bug 133, or does that refer to exceed allotted time at
runtime? If so, I can add this note; else I can file a new bug.
In my case, I gave:
fast
05:00:00
/home/wilde/swiftwork
my error was asking for 5 hours (05:00:00) instead of 5 minutes.
I got the un-helpful error:
tp$ swift -tc.file tc.data -sites.file teraport.xml floop.swift
Swift svn swift-r2631 cog-r2306
RunID: 20090306-1822-cerzn3y8
Progress:
Failed to transfer wrapper log from floop-20090306-1822-cerzn3y8/info/0
on teraport
Execution failed:
Failed to transfer wrapper log from floop-20090306-1822-cerzn3y8/info/r
on teraport
Failed to transfer wrapper log from floop-20090306-1822-cerzn3y8/info/y
on teraport
Progress: Submitting:7 Failed:3
Exception in echo:
Arguments: [42]
Host: teraport
Directory: floop-20090306-1822-cerzn3y8/jobs/0/echo-0hikdk7j
stderr.txt:
stdout.txt:
----
Caused by:
Cannot submit job: Could not submit job (qsub reported an exit code of
188). no error output
Failed to transfer wrapper log from floop-20090306-1822-cerzn3y8/info/s
on teraport
Failed to transfer wrapper log from floop-20090306-1822-cerzn3y8/info/q
on teraport
tp$ fg
From hategan at mcs.anl.gov Fri Mar 6 21:24:06 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Fri, 06 Mar 2009 21:24:06 -0600
Subject: [Swift-devel] PBS time-exceeds-queue conflict yields cryptic error
In-Reply-To: <49B1C3B1.9000309@mcs.anl.gov>
References: <49B1C3B1.9000309@mcs.anl.gov>
Message-ID: <1236396246.14742.4.camel@localhost>
On Fri, 2009-03-06 at 18:45 -0600, Michael Wilde wrote:
> A low prio issue:
>
> When I ask for more time than the selected PBS queue allows, I get a
> cryptic error. The fact that this condition yields a PBS error is known
> and has been discussed on the list.
On a quick glance, I couldn't find a list of qsub exit codes and their
meanings.
So I'm thinking whether it's reasonable to assume some portability for
them (assuming I can find out from the Torque sources what they are).
From hategan at mcs.anl.gov Fri Mar 6 21:33:47 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Fri, 06 Mar 2009 21:33:47 -0600
Subject: [Swift-devel] PBS time-exceeds-queue conflict yields cryptic error
In-Reply-To: <1236396246.14742.4.camel@localhost>
References: <49B1C3B1.9000309@mcs.anl.gov> <1236396246.14742.4.camel@localhost>
Message-ID: <1236396827.14742.6.camel@localhost>
On Fri, 2009-03-06 at 21:24 -0600, Mihael Hategan wrote:
> On Fri, 2009-03-06 at 18:45 -0600, Michael Wilde wrote:
> > A low prio issue:
> >
> > When I ask for more time than the selected PBS queue allows, I get a
> > cryptic error. The fact that this condition yields a PBS error is known
> > and has been discussed on the list.
>
> On a quick glance, I couldn't find a list of qsub exit codes and their
> meanings.
>
> So I'm thinking whether it's reasonable to assume some portability for
> them (assuming I can find out from the Torque sources what they are).
There's also the possibility that stderr from qsub isn't displayed
properly.
Could you add log4j.logger.org.globus.cog.abstraction.impl=DEBUG to
etc/log4j.properties, run it again, and send me the log?
Mihael
From benc at hawaga.org.uk Sat Mar 7 03:49:01 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Sat, 7 Mar 2009 09:49:01 +0000 (GMT)
Subject: [Swift-devel] PBS time-exceeds-queue conflict yields cryptic error
In-Reply-To: <1236396246.14742.4.camel@localhost>
References: <49B1C3B1.9000309@mcs.anl.gov> <1236396246.14742.4.camel@localhost>
Message-ID:
On Fri, 6 Mar 2009, Mihael Hategan wrote:
> On a quick glance, I couldn't find a list of qsub exit codes and their
> meanings.
When I've been googling round previously for such, I've not had much luck.
> So I'm thinking whether it's reasonable to assume some portability for
> them (assuming I can find out from the Torque sources what they are).
leading to delightful error messages like "this *might* be a walltime
violation", which I suppose is better than no error at all.
--
From bugzilla-daemon at mcs.anl.gov Sun Mar 8 17:13:15 2009
From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov)
Date: Sun, 8 Mar 2009 17:13:15 -0500 (CDT)
Subject: [Swift-devel] [Bug 109] Change default max heap size to 256M
In-Reply-To:
Message-ID: <20090308221315.3A3CD164CE@foxtrot.mcs.anl.gov>
http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=109
hategan at mcs.anl.gov changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|NEW |RESOLVED
Resolution| |FIXED
------- Comment #1 from hategan at mcs.anl.gov 2009-03-08 17:13 -------
Fixed in r2647.
--
Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.
From benc at hawaga.org.uk Mon Mar 9 04:56:05 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Mon, 9 Mar 2009 09:56:05 +0000 (GMT)
Subject: [Swift-devel] #caller in karajan (fwd)
Message-ID:
yet again I demonstrate my ability to typo Chicago...
---------- Forwarded message ----------
Date: Mon, 9 Mar 2009 09:53:44 +0000 (GMT)
From: Ben Clifford
To: hategan at mcs.anl.gov
Cc: swift-devel at ci.uchciago.edu
Subject: #caller in karajan
This failed in the daily tests
http://nmi-s005.cs.wisc.edu:80/rundir/benc/2009/03/benc_nmi-s005.batlab.cs.wisc.edu_1236561792_24345/userdir/nmi:x86_64_rhas_3/remote_task.out
with an error that I haven't seen before that I think is something karajan
related.
The nmi platform description is x86_64_rhas_3
The same version has worked on other platforms.
I haven't really investigated it for reproducibility. The same test passed
yesterday, so it has the air of some race condition.
Running test 065-delay
Swift svn which: no svn in
(/prereq/java-1.4.2_05/bin:/prereq/apache-ant-1.7.0/bin:/bin:/usr/bin:/home/condor/execute/dir_27780/userdir)
which: no svn in
(/prereq/java-1.4.2_05/bin:/prereq/apache-ant-1.7.0/bin:/bin:/usr/bin:/home/condor/execute/dir_27780/userdir)
swift-unknown cog-unknown
RunID: 20090309-0236-rv96a6pg
Uncaught exception: java.util.EmptyStackException in
org.globus.cog.karajan.workflow.nodes.Sequential @ execute-default.k,
line: 1
java.util.EmptyStackException
at
org.globus.cog.karajan.stack.LinkedStack.leave(LinkedStack.java:54)
at
org.globus.cog.karajan.workflow.nodes.FlowNode.failImmediately(FlowNode.java:127)
at
org.globus.cog.karajan.workflow.nodes.FlowNode.failImmediately(FlowNode.java:151)
at
org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:366)
at
org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125)
at
org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99)
at
org.globus.cog.karajan.workflow.events.EventWorker.run(EventWorker.java:69)
Event was No #caller found on stack for sys:element @ execute-default.k,
line: 1
sys:element @ execute-default.k, line: 1
Exception is: java.util.EmptyStackException
Cannot fail element
java.util.EmptyStackException
at
org.globus.cog.karajan.stack.LinkedStack.leave(LinkedStack.java:54)
at
org.globus.cog.karajan.workflow.nodes.FlowNode.failImmediately(FlowNode.java:127)
at
org.globus.cog.karajan.workflow.nodes.FlowNode.failImmediately(FlowNode.java:147)
at
org.globus.cog.karajan.workflow.events.EventBus.failElement(EventBus.java:189)
at
org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:155)
at
org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99)
at
org.globus.cog.karajan.workflow.events.EventWorker.run(EventWorker.java:69)
SWIFT RETURN CODE NON-ZERO - test 065-delay.swift
From zhaozhang at uchicago.edu Mon Mar 9 14:03:46 2009
From: zhaozhang at uchicago.edu (Zhao Zhang)
Date: Mon, 09 Mar 2009 14:03:46 -0500
Subject: [Swift-devel] how to write a data provider
Message-ID: <49B56812.2020308@uchicago.edu>
Hi,
Is there any documentation online, from which I could learn how to write
a data provider? Thanks
best regards
zhao
From aespinosa at cs.uchicago.edu Mon Mar 9 14:12:36 2009
From: aespinosa at cs.uchicago.edu (Allan Espinosa)
Date: Mon, 9 Mar 2009 14:12:36 -0500
Subject: [Swift-devel] how to write a data provider
In-Reply-To: <49B56812.2020308@uchicago.edu>
References: <49B56812.2020308@uchicago.edu>
Message-ID: <50b07b4b0903091212w159c8fc9r92466348001037a5@mail.gmail.com>
HI Zhao,
I am also not familiar on how to write one but I'm sort of starting to
have an idea.
http://wiki.cogkit.org/wiki/Java_CoG_Kit_Abstraction_Guide
Looking at the SSH provider, my guess is that these are the entry
points from Swift (or CoG in general) to the provider:
*TaskHandler.java
*TaskImpl.java
Let's set a reading/ study group to get to know the internals of this
-Allan
On Mon, Mar 9, 2009 at 2:03 PM, Zhao Zhang wrote:
> Hi,
>
> Is there any documentation online, from which I could learn how to write a
> data provider? Thanks
>
> best regards
> zhao
From hategan at mcs.anl.gov Mon Mar 9 15:00:25 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Mon, 09 Mar 2009 15:00:25 -0500
Subject: [Swift-devel] how to write a data provider
In-Reply-To: <50b07b4b0903091212w159c8fc9r92466348001037a5@mail.gmail.com>
References: <49B56812.2020308@uchicago.edu>
<50b07b4b0903091212w159c8fc9r92466348001037a5@mail.gmail.com>
Message-ID: <1236628825.15244.4.camel@localhost>
On Mon, 2009-03-09 at 14:12 -0500, Allan Espinosa wrote:
> HI Zhao,
>
> I am also not familiar on how to write one but I'm sort of starting to
> have an idea.
>
> http://wiki.cogkit.org/wiki/Java_CoG_Kit_Abstraction_Guide
Yes, that's useful reading material.
>
> Looking at the SSH provider, my guess is that these are the entry
> points from Swift (or CoG in general) to the provider:
>
> *TaskHandler.java
> *TaskImpl.java
>
For code examples, take a look at
org.globus.cog.abstraction.impl.file.local.FileResourceImpl.java in
provider-local.
I'd start by making a copy of provider-local, updating
project.properties and resources/cog-provider.properties, and then
hacking at FileResourceImpl.java.
From hategan at mcs.anl.gov Mon Mar 9 15:25:00 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Mon, 09 Mar 2009 15:25:00 -0500
Subject: [Swift-devel] scalability updates
Message-ID: <1236630300.16421.17.camel@localhost>
I've committed two main things today:
1. A foreach thread limiting patch, which limits the maximum number of
concurrent threads that a foreach can have at any time. The default is
at 1024, but it is configurable in swift.properties. For scripts whose
main memory hog is large numbers of iterations in foreach loops, this
should allow things to run with considerably less memory.
2. A lazy range function (the [x:y] operator). The previous one was
silly. Simply writing [0:1000000] would cause swift to run out of memory
because it was trying to create a swift array with 1000000 elements
before running a single iteration on it.
In principle, these two would roughly translate into the following:
- a likely demise to several of our swift-runs-out-of-memory scenarios.
Though there's still a bit to go here, because arrays in general in
swift keep too much things in memory.
- skenny type scripts (foreach i in [1:65535] { doStuff(); }) will not
see that 5 minute delay before the first job is submitted.
- Ben's provenance stuff may break if it relies on items returned by
range() reporting a path-from-root containing the array itself (as array
elements are roots themselves).
From wilde at mcs.anl.gov Tue Mar 10 00:01:53 2009
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Tue, 10 Mar 2009 00:01:53 -0500
Subject: [Swift-devel] scalability updates
In-Reply-To: <1236630300.16421.17.camel@localhost>
References: <1236630300.16421.17.camel@localhost>
Message-ID: <49B5F441.7060502@mcs.anl.gov>
Very nice! These look very promising. One interesting test would be
doing a million localhost echos in a simple foreach loop on a
range-initialized array, and looking at the memory needs.
These 2 enhancements seem to pave the way to making streamed (or
on-demand) mappers useful. For those, I think we need a mapper paradigm
design adjustment discussion.
But I think the next thing to work on in scalability would be the
Condor-G provider, so we can run large coaster runs with more
concurrency. The multi-cpu coaster allocator might be a workaround to
re-consider if a condor-G provider is too far off.
Assuming (or when) there's agreement that this is the best solution for
coaster scalability, I'd like to propose that as the next big task on
your to-do list.
On 3/9/09 3:25 PM, Mihael Hategan wrote:
> I've committed two main things today:
>
> 1. A foreach thread limiting patch, which limits the maximum number of
> concurrent threads that a foreach can have at any time. The default is
> at 1024, but it is configurable in swift.properties. For scripts whose
> main memory hog is large numbers of iterations in foreach loops, this
> should allow things to run with considerably less memory.
>
> 2. A lazy range function (the [x:y] operator). The previous one was
> silly. Simply writing [0:1000000] would cause swift to run out of memory
> because it was trying to create a swift array with 1000000 elements
> before running a single iteration on it.
>
> In principle, these two would roughly translate into the following:
> - a likely demise to several of our swift-runs-out-of-memory scenarios.
> Though there's still a bit to go here, because arrays in general in
> swift keep too much things in memory.
> - skenny type scripts (foreach i in [1:65535] { doStuff(); }) will not
> see that 5 minute delay before the first job is submitted.
> - Ben's provenance stuff may break if it relies on items returned by
> range() reporting a path-from-root containing the array itself (as array
> elements are roots themselves).
>
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
From hategan at mcs.anl.gov Tue Mar 10 00:07:58 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Tue, 10 Mar 2009 00:07:58 -0500
Subject: [Swift-devel] scalability updates
In-Reply-To: <49B5F441.7060502@mcs.anl.gov>
References: <1236630300.16421.17.camel@localhost>
<49B5F441.7060502@mcs.anl.gov>
Message-ID: <1236661678.26700.3.camel@localhost>
On Tue, 2009-03-10 at 00:01 -0500, Michael Wilde wrote:
> Very nice! These look very promising. One interesting test would be
> doing a million localhost echos in a simple foreach loop on a
> range-initialized array, and looking at the memory needs.
>
> These 2 enhancements seem to pave the way to making streamed (or
> on-demand) mappers useful. For those, I think we need a mapper paradigm
> design adjustment discussion.
I think we thought out the mappers to be procedural (i.e. not hold
state) from the beginning, so the problem does not seem to be in the
design of the mappers. Rather, it's the implementation of some of the
mappers and the implementation of swift data structures (parts of an
array cannot be garbage-collected independently).
>
> But I think the next thing to work on in scalability would be the
> Condor-G provider, so we can run large coaster runs with more
> concurrency. The multi-cpu coaster allocator might be a workaround to
> re-consider if a condor-G provider is too far off.
>
> Assuming (or when) there's agreement that this is the best solution for
> coaster scalability, I'd like to propose that as the next big task on
> your to-do list.
Yes. Sounds reasonable. Now, only if I could find a condor/condor-g
installation that I have access to...
From wilde at mcs.anl.gov Tue Mar 10 00:18:31 2009
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Tue, 10 Mar 2009 00:18:31 -0500
Subject: [Swift-devel] scalability updates
In-Reply-To: <1236661678.26700.3.camel@localhost>
References: <1236630300.16421.17.camel@localhost>
<49B5F441.7060502@mcs.anl.gov> <1236661678.26700.3.camel@localhost>
Message-ID: <49B5F827.2050003@mcs.anl.gov>
On 3/10/09 12:07 AM, Mihael Hategan wrote:
> On Tue, 2009-03-10 at 00:01 -0500, Michael Wilde wrote:
>> Very nice! These look very promising. One interesting test would be
>> doing a million localhost echos in a simple foreach loop on a
>> range-initialized array, and looking at the memory needs.
>>
>> These 2 enhancements seem to pave the way to making streamed (or
>> on-demand) mappers useful. For those, I think we need a mapper paradigm
>> design adjustment discussion.
>
> I think we thought out the mappers to be procedural (i.e. not hold
> state) from the beginning, so the problem does not seem to be in the
> design of the mappers. Rather, it's the implementation of some of the
> mappers and the implementation of swift data structures (parts of an
> array cannot be garbage-collected independently).
By that I meant the functionality issues in mappers (ie, ability to map
various user patterns easily, like the things I ran into in the OOPS
app). Other than the streaming thing, I wasnt concerned with the
performance of the mappers.
>
>> But I think the next thing to work on in scalability would be the
>> Condor-G provider, so we can run large coaster runs with more
>> concurrency. The multi-cpu coaster allocator might be a workaround to
>> re-consider if a condor-G provider is too far off.
>>
>> Assuming (or when) there's agreement that this is the best solution for
>> coaster scalability, I'd like to propose that as the next big task on
>> your to-do list.
>
> Yes. Sounds reasonable. Now, only if I could find a condor/condor-g
> installation that I have access to...
Great, we'll find you one. The local ITB site that Suchandra maintains
is a good choice, and I think we can get Greg to install the client
package on TeraGrid if its not already there, or Ti on communicado.
From benc at hawaga.org.uk Tue Mar 10 02:55:18 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Tue, 10 Mar 2009 07:55:18 +0000 (GMT)
Subject: [Swift-devel] scalability updates
In-Reply-To: <1236630300.16421.17.camel@localhost>
References: <1236630300.16421.17.camel@localhost>
Message-ID:
On Mon, 9 Mar 2009, Mihael Hategan wrote:
> - Ben's provenance stuff may break if it relies on items returned by
> range() reporting a path-from-root containing the array itself (as array
> elements are roots themselves).
Perhaps. But the path-from-root stuff leaves a slightly bad taste in my
mouth anyway and is I think broken in other places due to aliasing (eg
this code fragment:
int i = 3; j = [i];
is a little ambiguous in what should be the root of j[1])
--
From benc at hawaga.org.uk Tue Mar 10 02:59:45 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Tue, 10 Mar 2009 07:59:45 +0000 (GMT)
Subject: [Swift-devel] scalability updates
In-Reply-To: <1236661678.26700.3.camel@localhost>
References: <1236630300.16421.17.camel@localhost>
<49B5F441.7060502@mcs.anl.gov> <1236661678.26700.3.camel@localhost>
Message-ID:
On Tue, 10 Mar 2009, Mihael Hategan wrote:
> Yes. Sounds reasonable. Now, only if I could find a condor/condor-g
> installation that I have access to...
communicado has Condor running on it. That should be enough.
--
From wilde at mcs.anl.gov Tue Mar 10 08:28:53 2009
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Tue, 10 Mar 2009 08:28:53 -0500
Subject: [Swift-devel] scalability updates
In-Reply-To:
References: <1236630300.16421.17.camel@localhost>
<49B5F441.7060502@mcs.anl.gov> <1236661678.26700.3.camel@localhost>
Message-ID: <49B66B15.3010908@mcs.anl.gov>
Yes, provided its configured to send jobs to OSG and TG, and thats
working. It *should* be, to support the OSG hands-on tutorial, but has
occasional problems as I recall.
On 3/10/09 2:59 AM, Ben Clifford wrote:
> On Tue, 10 Mar 2009, Mihael Hategan wrote:
>
>> Yes. Sounds reasonable. Now, only if I could find a condor/condor-g
>> installation that I have access to...
>
> communicado has Condor running on it. That should be enough.
>
From wilde at mcs.anl.gov Tue Mar 10 08:35:53 2009
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Tue, 10 Mar 2009 08:35:53 -0500
Subject: [Swift-devel] scalability updates
In-Reply-To: <49B66B15.3010908@mcs.anl.gov>
References: <1236630300.16421.17.camel@localhost> <49B5F441.7060502@mcs.anl.gov>
<1236661678.26700.3.camel@localhost>
<49B66B15.3010908@mcs.anl.gov>
Message-ID: <49B66CB9.4010103@mcs.anl.gov>
But yes, I agree: communicado is the best place to test from. This is a
good motivation to keep its Condor in working state. Can you test it and
keep on maintaining it? Should this be your job, Ben, or CI Support's?
On 3/10/09 8:28 AM, Michael Wilde wrote:
> Yes, provided its configured to send jobs to OSG and TG, and thats
> working. It *should* be, to support the OSG hands-on tutorial, but has
> occasional problems as I recall.
>
> On 3/10/09 2:59 AM, Ben Clifford wrote:
>> On Tue, 10 Mar 2009, Mihael Hategan wrote:
>>
>>> Yes. Sounds reasonable. Now, only if I could find a condor/condor-g
>>> installation that I have access to...
>>
>> communicado has Condor running on it. That should be enough.
>>
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
From benc at hawaga.org.uk Tue Mar 10 08:37:43 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Tue, 10 Mar 2009 13:37:43 +0000 (GMT)
Subject: [Swift-devel] scalability updates
In-Reply-To: <49B66CB9.4010103@mcs.anl.gov>
References: <1236630300.16421.17.camel@localhost>
<49B5F441.7060502@mcs.anl.gov> <1236661678.26700.3.camel@localhost>
<49B66B15.3010908@mcs.anl.gov> <49B66CB9.4010103@mcs.anl.gov>
Message-ID:
On Tue, 10 Mar 2009, Michael Wilde wrote:
> But yes, I agree: communicado is the best place to test from. This is a good
> motivation to keep its Condor in working state. Can you test it and keep on
> maintaining it? Should this be your job, Ben, or CI Support's?
Maintenance of that is CI support's job.
--
From benc at hawaga.org.uk Tue Mar 10 09:22:04 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Tue, 10 Mar 2009 14:22:04 +0000 (GMT)
Subject: [Swift-devel] scalability updates
In-Reply-To: <1236630300.16421.17.camel@localhost>
References: <1236630300.16421.17.camel@localhost>
Message-ID:
I ran 066-many with 30000 jobs (10 x more than is in the SVN at the
moment).
This eventually died, with it not being entirely clear to me why.
Throughout the run I was null pointer exceptions being thrown.
I also noticed a slowdown during the run - at the start more than 1000
jobs are waiting for site selection; towards the end this was down to
about 980.
This is on communicado.
The log for that run is
http://www.ci.uchicago.edu/~benc/tmp/066-many-20090310-0853-ezpsjt20.log
--
From hategan at mcs.anl.gov Tue Mar 10 09:24:04 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Tue, 10 Mar 2009 09:24:04 -0500
Subject: [Swift-devel] scalability updates
In-Reply-To: <49B5F827.2050003@mcs.anl.gov>
References: <1236630300.16421.17.camel@localhost>
<49B5F441.7060502@mcs.anl.gov> <1236661678.26700.3.camel@localhost>
<49B5F827.2050003@mcs.anl.gov>
Message-ID: <1236695045.3324.0.camel@localhost>
On Tue, 2009-03-10 at 00:18 -0500, Michael Wilde wrote:
>
> On 3/10/09 12:07 AM, Mihael Hategan wrote:
> By that I meant the functionality issues in mappers (ie, ability to map
> various user patterns easily, like the things I ran into in the OOPS
> app). Other than the streaming thing, I wasnt concerned with the
> performance of the mappers.
My misunderstanding. Sorry.
From hategan at mcs.anl.gov Tue Mar 10 09:27:12 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Tue, 10 Mar 2009 09:27:12 -0500
Subject: [Swift-devel] scalability updates
In-Reply-To: <49B5F441.7060502@mcs.anl.gov>
References: <1236630300.16421.17.camel@localhost>
<49B5F441.7060502@mcs.anl.gov>
Message-ID: <1236695232.3324.4.camel@localhost>
On Tue, 2009-03-10 at 00:01 -0500, Michael Wilde wrote:
> Very nice! These look very promising. One interesting test would be
> doing a million localhost echos in a simple foreach loop on a
> range-initialized array, and looking at the memory needs.
It depends. Should the echos return anything, and should the result be
put in an array without being used, that won't work. A 1M swift integer
array takes more than 300MB of memory.
From wilde at mcs.anl.gov Tue Mar 10 11:13:20 2009
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Tue, 10 Mar 2009 11:13:20 -0500
Subject: [Swift-devel] scalability updates
In-Reply-To: <1236695232.3324.4.camel@localhost>
References: <1236630300.16421.17.camel@localhost>
<49B5F441.7060502@mcs.anl.gov> <1236695232.3324.4.camel@localhost>
Message-ID: <49B691A0.50601@mcs.anl.gov>
I think 1M echos would be a good milestone, even if it takes several GB
of RAM. Communicado has 14GB total, so its a good place for such a test.
I realize that it will take time to work up to that level.
But even more important than 1M is just to know how the system scales,
and document it in the user guide along with resource needs, whatever
the level.
On 3/10/09 9:27 AM, Mihael Hategan wrote:
> On Tue, 2009-03-10 at 00:01 -0500, Michael Wilde wrote:
>> Very nice! These look very promising. One interesting test would be
>> doing a million localhost echos in a simple foreach loop on a
>> range-initialized array, and looking at the memory needs.
>
> It depends. Should the echos return anything, and should the result be
> put in an array without being used, that won't work. A 1M swift integer
> array takes more than 300MB of memory.
>
From hategan at mcs.anl.gov Tue Mar 10 14:24:01 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Tue, 10 Mar 2009 14:24:01 -0500
Subject: [Swift-devel] scalability updates
In-Reply-To:
References: <1236630300.16421.17.camel@localhost>
Message-ID: <1236713041.22637.0.camel@localhost>
Should be fixed in cog r2325 & swift r2677.
On Tue, 2009-03-10 at 14:22 +0000, Ben Clifford wrote:
> I ran 066-many with 30000 jobs (10 x more than is in the SVN at the
> moment).
>
> This eventually died, with it not being entirely clear to me why.
>
> Throughout the run I was null pointer exceptions being thrown.
>
> I also noticed a slowdown during the run - at the start more than 1000
> jobs are waiting for site selection; towards the end this was down to
> about 980.
>
> This is on communicado.
>
> The log for that run is
> http://www.ci.uchicago.edu/~benc/tmp/066-many-20090310-0853-ezpsjt20.log
>
From benc at hawaga.org.uk Wed Mar 11 03:47:53 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Wed, 11 Mar 2009 08:47:53 +0000 (GMT)
Subject: [Swift-devel] scalability updates
In-Reply-To: <1236713041.22637.0.camel@localhost>
References: <1236630300.16421.17.camel@localhost>
<1236713041.22637.0.camel@localhost>
Message-ID:
Here's a different looking one.
Trying to run 066-many with 30000 iterations, in one run I got the below
error within a couple of seconds; and in a subsequent run it happened
after 20s or so.
http://www.ci.uchicago.edu/~benc/tmp/066-many-20090311-0336-toznl2a0.log
Full exception is in the log but in summary:
> No named arguments on current frame
> at org.globus.cog.karajan.arguments.Arg.getNamed(Arg.java:52)
> at org.globus.cog.karajan.arguments.Arg.getValue0(Arg.java:66)
> at org.globus.cog.karajan.arguments.Arg.getValue(Arg.java:62)
> at
> org.globus.cog.karajan.arguments.Arg$Positional.getValue(Arg.java:131)
> at org.griphyn.vdl.karajan.lib.Log.post(Log.java:71)
I got what looks like its a similar error when I tried running with 3000
iterations, in
http://www.ci.uchicago.edu/~benc/tmp/066-many-20090311-0343-tmpq92td.log
> Execution failed:
> Missing argument level
--
From hategan at mcs.anl.gov Wed Mar 11 08:38:11 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Wed, 11 Mar 2009 08:38:11 -0500
Subject: [Swift-devel] scalability updates
In-Reply-To:
References: <1236630300.16421.17.camel@localhost>
<1236713041.22637.0.camel@localhost>
Message-ID: <1236778692.20470.0.camel@localhost>
Yes, I saw those. Did you get this with pre or post r2677?
On Wed, 2009-03-11 at 08:47 +0000, Ben Clifford wrote:
> Here's a different looking one.
>
> Trying to run 066-many with 30000 iterations, in one run I got the below
> error within a couple of seconds; and in a subsequent run it happened
> after 20s or so.
>
> http://www.ci.uchicago.edu/~benc/tmp/066-many-20090311-0336-toznl2a0.log
>
> Full exception is in the log but in summary:
>
> > No named arguments on current frame
> > at org.globus.cog.karajan.arguments.Arg.getNamed(Arg.java:52)
> > at org.globus.cog.karajan.arguments.Arg.getValue0(Arg.java:66)
> > at org.globus.cog.karajan.arguments.Arg.getValue(Arg.java:62)
> > at
> > org.globus.cog.karajan.arguments.Arg$Positional.getValue(Arg.java:131)
> > at org.griphyn.vdl.karajan.lib.Log.post(Log.java:71)
>
>
> I got what looks like its a similar error when I tried running with 3000
> iterations, in
> http://www.ci.uchicago.edu/~benc/tmp/066-many-20090311-0343-tmpq92td.log
>
> > Execution failed:
> > Missing argument level
>
>
From benc at hawaga.org.uk Wed Mar 11 10:05:04 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Wed, 11 Mar 2009 15:05:04 +0000 (GMT)
Subject: [Swift-devel] scalability updates
In-Reply-To: <1236778692.20470.0.camel@localhost>
References: <1236630300.16421.17.camel@localhost>
<1236713041.22637.0.camel@localhost>
<1236778692.20470.0.camel@localhost>
Message-ID:
On Wed, 11 Mar 2009, Mihael Hategan wrote:
> Yes, I saw those. Did you get this with pre or post r2677?
2009-03-11 03:43:54,179-0500 INFO unknown Swift svn swift-r2677 (swift
modified locally) cog-r2325
I think I have occasionally seen this error in the past, but never enough
to recreate it.
--
From hategan at mcs.anl.gov Wed Mar 11 10:07:49 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Wed, 11 Mar 2009 10:07:49 -0500
Subject: [Swift-devel] scalability updates
In-Reply-To:
References: <1236630300.16421.17.camel@localhost>
<1236713041.22637.0.camel@localhost>
<1236778692.20470.0.camel@localhost>
Message-ID: <1236784069.21789.0.camel@localhost>
On Wed, 2009-03-11 at 15:05 +0000, Ben Clifford wrote:
> On Wed, 11 Mar 2009, Mihael Hategan wrote:
>
> > Yes, I saw those. Did you get this with pre or post r2677?
Hmm. I need to dig more.
>
> 2009-03-11 03:43:54,179-0500 INFO unknown Swift svn swift-r2677 (swift
> modified locally) cog-r2325
>
> I think I have occasionally seen this error in the past, but never enough
> to recreate it.
>
From hategan at mcs.anl.gov Wed Mar 11 13:18:23 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Wed, 11 Mar 2009 13:18:23 -0500
Subject: [Swift-devel] scalability updates
In-Reply-To:
References: <1236630300.16421.17.camel@localhost>
<1236713041.22637.0.camel@localhost>
<1236778692.20470.0.camel@localhost>
Message-ID: <1236795503.15465.1.camel@localhost>
On Wed, 2009-03-11 at 15:05 +0000, Ben Clifford wrote:
> On Wed, 11 Mar 2009, Mihael Hategan wrote:
>
> > Yes, I saw those. Did you get this with pre or post r2677?
>
> 2009-03-11 03:43:54,179-0500 INFO unknown Swift svn swift-r2677 (swift
> modified locally) cog-r2325
>
> I think I have occasionally seen this error in the past, but never enough
> to recreate it.
Right. It doesn't seem specific to the foreach limiting changes.
Are you getting this on any machine I have access to?
From hategan at mcs.anl.gov Wed Mar 11 15:57:11 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Wed, 11 Mar 2009 15:57:11 -0500
Subject: [Swift-devel] scalability updates
In-Reply-To: <1236795503.15465.1.camel@localhost>
References: <1236630300.16421.17.camel@localhost>
<1236713041.22637.0.camel@localhost>
<1236778692.20470.0.camel@localhost>
<1236795503.15465.1.camel@localhost>
Message-ID: <1236805031.7892.0.camel@localhost>
On Wed, 2009-03-11 at 13:18 -0500, Mihael Hategan wrote:
> On Wed, 2009-03-11 at 15:05 +0000, Ben Clifford wrote:
> > On Wed, 11 Mar 2009, Mihael Hategan wrote:
> >
> > > Yes, I saw those. Did you get this with pre or post r2677?
> >
> > 2009-03-11 03:43:54,179-0500 INFO unknown Swift svn swift-r2677 (swift
> > modified locally) cog-r2325
> >
> > I think I have occasionally seen this error in the past, but never enough
> > to recreate it.
>
> Right. It doesn't seem specific to the foreach limiting changes.
>
> Are you getting this on any machine I have access to?
Nevermind. I can reproduce it (them).
From hategan at mcs.anl.gov Wed Mar 11 21:20:04 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Wed, 11 Mar 2009 21:20:04 -0500
Subject: [Swift-devel] scalability updates
In-Reply-To:
References: <1236630300.16421.17.camel@localhost>
<1236713041.22637.0.camel@localhost>
<1236778692.20470.0.camel@localhost>
Message-ID: <1236824404.12524.0.camel@localhost>
I got 4 successful runs with cog r2326.
On Wed, 2009-03-11 at 15:05 +0000, Ben Clifford wrote:
> On Wed, 11 Mar 2009, Mihael Hategan wrote:
>
> > Yes, I saw those. Did you get this with pre or post r2677?
>
> 2009-03-11 03:43:54,179-0500 INFO unknown Swift svn swift-r2677 (swift
> modified locally) cog-r2325
>
> I think I have occasionally seen this error in the past, but never enough
> to recreate it.
>
From benc at hawaga.org.uk Thu Mar 12 06:09:29 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Thu, 12 Mar 2009 11:09:29 +0000 (GMT)
Subject: [Swift-devel] provenance 'processes'
Message-ID:
The provenance work that I've done so far links datasets (DSHandles) by
saying they are inputs or outputs to app procedure invocations; it does
not represent the processing of data by @functions or operators.
Separately there is a containment graph to link DSHandles that represent
arrays or structs with their members; in that graph there's no causal
information - if you construct and array, and then extract a member, the
representation in this graph is the same as if you construct the members,
then use [] syntax to make an array of those members.
OPM has a different representation of containment, explicitly representing
'processes' that construct collections or extract collections.
Having mulled that over, and tried to do some other things, I think that
the provenance representation in Swift should record @functions and
operators (and probably composite procedures) in the same way that it
records app procedure executions. Mapper parameters should be also
recorded in some more sensible way.
This is likely to generate much more provenance logging information, but I
think will give decent information about every single DSHandle.
I have a niggling fear that this will generate enough information to cause
undesirable slowdown; but I think making provenance recording
turn-off-and-onable can reduce that problem for people who don't care
about provenance.
--
From bugzilla-daemon at mcs.anl.gov Thu Mar 12 23:08:08 2009
From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov)
Date: Thu, 12 Mar 2009 23:08:08 -0500 (CDT)
Subject: [Swift-devel] [Bug 182] New: Error messages summarized at end of
Swift output should also be printed when they occur
Message-ID:
https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=182
Summary: Error messages summarized at end of Swift output
should also be printed when they occur
Product: Swift
Version: unspecified
Platform: PC
OS/Version: All
Status: NEW
Severity: normal
Priority: P2
Component: SwiftScript language
AssignedTo: benc at hawaga.org.uk
ReportedBy: wilde at mcs.anl.gov
Based on the premise that we want most users to get most of the debugging info
they need from just looking at Swift output, rather than the .log file, then
job execution error messages should be listed when they occur, rather than
summarozed at the end. This is especially tru for long workflows.
Currently, all the user sees in the output is the running success/failure
count, e.g.:
Progress: Submitted:1 Active:1 Failed:5 Finished successfully:29
But at the end, they see:
Final status: Failed:6 Finished successfully:32
The following errors have occurred:
1. Application "render_round" failed (Job failed with an exit code of 254)
Arguments: "output/T1af7/0001_0000.SecStr,
viz/T1af7/T1af7.round.0001.result.png, T1af7, 1"
Host: localhost
Directory: oops8-20090312-2254-hrdvk4lg/jobs/r/render_round-rzcihu7j
STDERR:
STDOUT:
--
These errors should also be listed when they occur.
--
Configure bugmail: https://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.
You are watching the reporter.
From bugzilla-daemon at mcs.anl.gov Thu Mar 12 23:16:10 2009
From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov)
Date: Thu, 12 Mar 2009 23:16:10 -0500 (CDT)
Subject: [Swift-devel] [Bug 183] New: Print better error message when app
executable is not found
Message-ID:
https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=183
Summary: Print better error message when app executable is not
found
Product: Swift
Version: unspecified
Platform: All
OS/Version: All
Status: NEW
Severity: normal
Priority: P2
Component: SwiftScript language
AssignedTo: benc at hawaga.org.uk
ReportedBy: wilde at mcs.anl.gov
Currently, when an app listed in tc.data doesnt exist on the site, one gets
this error:
(Job failed with an exit code of 254)
Swift should generate an error that says exactly what happened, eg:
Application render_round not found on site locahost at path
"/home/wilde/oops/oops-r026/bin/render_round.sh".
--
Configure bugmail: https://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.
You are watching the reporter.
From zhaozhang at uchicago.edu Fri Mar 13 17:07:34 2009
From: zhaozhang at uchicago.edu (Zhao Zhang)
Date: Fri, 13 Mar 2009 17:07:34 -0500
Subject: [Swift-devel] How does swift know if a task is successful
Message-ID: <49BAD926.1030607@uchicago.edu>
Hi, All
I have a question on how swift knows if a task is successful.
In my case, I am using a status notification instead of a status file.
So my question is is this status notification the only thing swift is
waiting for, or is swift also waiting for the output data to appear to
say that a job is successful? Thanks.
zhao
From hategan at mcs.anl.gov Fri Mar 13 17:13:33 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Fri, 13 Mar 2009 17:13:33 -0500
Subject: [Swift-devel] How does swift know if a task is successful
In-Reply-To: <49BAD926.1030607@uchicago.edu>
References: <49BAD926.1030607@uchicago.edu>
Message-ID: <1236982413.13026.1.camel@localhost>
On Fri, 2009-03-13 at 17:07 -0500, Zhao Zhang wrote:
> Hi, All
>
> I have a question on how swift knows if a task is successful.
> In my case, I am using a status notification instead of a status file.
>
> So my question is is this status notification the only thing swift is
> waiting for, or is swift also waiting for the output data to appear to
> say that a job is successful?
Once the job is done, swift will attempt to stage out all the files that
it expects the job to have produced.
Should one of those files not be there, there will be failures.
From hategan at mcs.anl.gov Fri Mar 13 22:17:52 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Fri, 13 Mar 2009 22:17:52 -0500
Subject: [Swift-devel] condor provider
Message-ID: <1237000672.21050.7.camel@localhost>
I committed an update to the local schedulers. This includes a bit of
refactoring of the existing providers and the addition of a condor
provider.
The condor provider is not a globus-through-condor thing. It submits to
the default condor queue in the vanilla (or mpi) universe.
I've tested it on communicado with with loads like 256 parallel jobs.
Seems to behave. It needs more work, but it's a start.
In the refactoring process, I probably screwed the other two, and I'm
not in the mood to test them now.
There's one thing I haven't figured out though, and that is whether
condor has any notion of a walltime limit.
From benc at hawaga.org.uk Sun Mar 15 12:29:23 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Sun, 15 Mar 2009 17:29:23 +0000 (GMT)
Subject: [Swift-devel] scalability updates
In-Reply-To: <1236805031.7892.0.camel@localhost>
References: <1236630300.16421.17.camel@localhost>
<1236713041.22637.0.camel@localhost>
<1236778692.20470.0.camel@localhost>
<1236795503.15465.1.camel@localhost>
<1236805031.7892.0.camel@localhost>
Message-ID:
I tried to run 066-many with a million (as a suitable large number)
iterations to see where it got and then promptly forogt that I'd left
it running
It got to about 200000 (2*10^5) jobs, and then died.
If you're interested, the log is in
http://www.ci.uchciago.edu/~benc/066-many-20090312-0550-z7jxehb4.log
The restart log for that is empty apart from the timestamp
(s/.log/.0.rlog on the the above URL for that). I think it should contain
much more than that - one line for each of the 2*10^5 jobs that are
alleged by the log files to have been completed.
From hategan at mcs.anl.gov Sun Mar 15 19:41:22 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Sun, 15 Mar 2009 19:41:22 -0500
Subject: [Swift-devel] scalability updates
In-Reply-To:
References: <1236630300.16421.17.camel@localhost>
<1236713041.22637.0.camel@localhost>
<1236778692.20470.0.camel@localhost>
<1236795503.15465.1.camel@localhost>
<1236805031.7892.0.camel@localhost>
Message-ID: <1237164082.20046.0.camel@localhost>
On Sun, 2009-03-15 at 17:29 +0000, Ben Clifford wrote:
> I tried to run 066-many with a million (as a suitable large number)
> iterations to see where it got and then promptly forogt that I'd left
> it running
>
> It got to about 200000 (2*10^5) jobs, and then died.
>
> If you're interested, the log is in
> http://www.ci.uchciago.edu/~benc/066-many-20090312-0550-z7jxehb4.log
>
> The restart log for that is empty apart from the timestamp
> (s/.log/.0.rlog on the the above URL for that). I think it should contain
> much more than that - one line for each of the 2*10^5 jobs that are
> alleged by the log files to have been completed.
The jobs produce no data. What exactly should be in the restart log
you'd say?
From hategan at mcs.anl.gov Sun Mar 15 19:44:11 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Sun, 15 Mar 2009 19:44:11 -0500
Subject: [Swift-devel] scalability updates
In-Reply-To:
References: <1236630300.16421.17.camel@localhost>
<1236713041.22637.0.camel@localhost>
<1236778692.20470.0.camel@localhost>
<1236795503.15465.1.camel@localhost>
<1236805031.7892.0.camel@localhost>
Message-ID: <1237164251.20046.2.camel@localhost>
On Sun, 2009-03-15 at 17:29 +0000, Ben Clifford wrote:
> I tried to run 066-many with a million (as a suitable large number)
> iterations to see where it got and then promptly forogt that I'd left
> it running
>
> It got to about 200000 (2*10^5) jobs, and then died.
>
> If you're interested, the log is in
> http://www.ci.uchciago.edu/~benc/066-many-20090312-0550-z7jxehb4.log
mike at blabla tmp$ wget
http://www.ci.uchicago.edu/~benc/066-many-20090312-0550-z7jxehb4.log
--2009-03-15 19:43:28--
http://www.ci.uchicago.edu/~benc/066-many-20090312-0550-z7jxehb4.log
Resolving www.ci.uchicago.edu... 128.135.125.142
Connecting to www.ci.uchicago.edu|128.135.125.142|:80... connected.
HTTP request sent, awaiting response... 404 Not Found
2009-03-15 19:43:28 ERROR 404: Not Found.
mike at blabla tmp$
From benc at hawaga.org.uk Mon Mar 16 03:54:36 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Mon, 16 Mar 2009 08:54:36 +0000 (GMT)
Subject: [Swift-devel] scalability updates
In-Reply-To: <1237164251.20046.2.camel@localhost>
References: <1236630300.16421.17.camel@localhost>
<1236713041.22637.0.camel@localhost>
<1236778692.20470.0.camel@localhost>
<1236795503.15465.1.camel@localhost>
<1236805031.7892.0.camel@localhost>
<1237164251.20046.2.camel@localhost>
Message-ID:
On Sun, 15 Mar 2009, Mihael Hategan wrote:
> mike at blabla tmp$ wget
> http://www.ci.uchicago.edu/~benc/066-many-20090312-0550-z7jxehb4.log
> --2009-03-15 19:43:28--
oops. Try ~benc/tmp/
--
From benc at hawaga.org.uk Mon Mar 16 03:56:11 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Mon, 16 Mar 2009 08:56:11 +0000 (GMT)
Subject: [Swift-devel] scalability updates
In-Reply-To: <1237164082.20046.0.camel@localhost>
References: <1236630300.16421.17.camel@localhost>
<1236713041.22637.0.camel@localhost>
<1236778692.20470.0.camel@localhost>
<1236795503.15465.1.camel@localhost>
<1236805031.7892.0.camel@localhost>
<1237164082.20046.0.camel@localhost>
Message-ID:
On Sun, 15 Mar 2009, Mihael Hategan wrote:
> The jobs produce no data. What exactly should be in the restart log
> you'd say?
oh yes, I forgot that they need an output variable for that.
Procedures with no outputs are wierd - they aren't restartable and don't
get optimised away as already having all (0 of) their outputs there.
--
From benc at hawaga.org.uk Mon Mar 16 07:50:09 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Mon, 16 Mar 2009 12:50:09 +0000 (GMT)
Subject: [Swift-devel] Swift 0.9 release for ~2nd April
Message-ID:
I'd like to put out the Swift 0.9 release on the 2nd of April, with the
release candidate being made from SVN on the 23rd of March.
Things that have been screwed around with since 0.8 that aren't getting
substantial automated testing are coasters, the PBS provider and the
log-processing code. So effort to test (or automate tests for) those in
the next few weeks would be good.
--
From hategan at mcs.anl.gov Mon Mar 16 10:18:54 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Mon, 16 Mar 2009 10:18:54 -0500
Subject: [Swift-devel] scalability updates
In-Reply-To:
References: <1236630300.16421.17.camel@localhost>
<1236713041.22637.0.camel@localhost>
<1236778692.20470.0.camel@localhost>
<1236795503.15465.1.camel@localhost>
<1236805031.7892.0.camel@localhost>
<1237164082.20046.0.camel@localhost>
Message-ID: <1237216734.3397.2.camel@localhost>
On Mon, 2009-03-16 at 08:56 +0000, Ben Clifford wrote:
> On Sun, 15 Mar 2009, Mihael Hategan wrote:
>
> > The jobs produce no data. What exactly should be in the restart log
> > you'd say?
>
> oh yes, I forgot that they need an output variable for that.
>
> Procedures with no outputs are wierd - they aren't restartable and don't
> get optimised away as already having all (0 of) their outputs there.
Do they even exist? :)
>
From wilde at mcs.anl.gov Mon Mar 16 13:21:28 2009
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Mon, 16 Mar 2009 13:21:28 -0500
Subject: [Swift-devel] Coaster run fails from communicado to ranger
Message-ID: <49BE98A8.5060109@mcs.anl.gov>
I'm trying a simple test script to ranger, from communicado, using
latest svn rev (swift-r2692 cog-r2329).
I get the error below (/bin/bash: line 39: eval: --: invalid option)
Sites file is:
/share/home/00306/tg455797/swiftwork
1
8
00:01:00
TG-CCR080022N
16
--
Output is:
Swift svn swift-r2692 cog-r2329
RunID: 20090316-1220-kfipom0f
Progress:
Progress: Stage in:1
Progress: Submitted:1
Failed to transfer wrapper log from cat-20090316-1220-kfipom0f/info/r on
ranger
Progress: Failed:1
Execution failed:
Exception in cat:
Arguments: [data.txt]
Host: ranger
Directory: cat-20090316-1220-kfipom0f/jobs/r/cat-rsrhc18j
stderr.txt:
stdout.txt:
----
Caused by:
Could not submit job
Caused by:
Could not start coaster service
Caused by:
Task ended before registration was received.
STDOUT: /bin/bash: line 39: eval: --: invalid option
eval: usage: eval [arg ...]
STDERR: null
Cleaning up...
Done
From aespinosa at cs.uchicago.edu Mon Mar 16 14:13:04 2009
From: aespinosa at cs.uchicago.edu (Allan Espinosa)
Date: Mon, 16 Mar 2009 14:13:04 -0500
Subject: [Swift-devel] "any valid host for task" in Swift + deef provider
Message-ID: <50b07b4b0903161213r292fb025o1bc52a34608192d@mail.gmail.com>
Hi,
I'm using swift r2682, cogkit 2326 and provider-deef 2507
RunID: 20090316-1354-ocn573c3
Progress:
Execution failed:
Could not find any valid host for task "Task(type=UNKNOWN,
identity=urn:cog-1237229648327)" with constraints {tr=hostname,
filenames=[Ljava.lang.String;@14aa453, trfqn=hostname,
filecache=org.griphyn.vdl.karajan.lib.cache.CacheMapAdapter at 16a4aef}
Sites.xml:
/work/01035/tg802895/swift-runs
The run did not initialize the work directory.
-Allan
From hategan at mcs.anl.gov Mon Mar 16 14:24:35 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Mon, 16 Mar 2009 14:24:35 -0500
Subject: [Swift-devel] "any valid host for task" in Swift + deef provider
In-Reply-To: <50b07b4b0903161213r292fb025o1bc52a34608192d@mail.gmail.com>
References: <50b07b4b0903161213r292fb025o1bc52a34608192d@mail.gmail.com>
Message-ID: <1237231475.8617.12.camel@localhost>
Can you post your tc.data?
On Mon, 2009-03-16 at 14:13 -0500, Allan Espinosa wrote:
> Hi,
>
> I'm using swift r2682, cogkit 2326 and provider-deef 2507
>
> RunID: 20090316-1354-ocn573c3
> Progress:
> Execution failed:
> Could not find any valid host for task "Task(type=UNKNOWN,
> identity=urn:cog-1237229648327)" with constraints {tr=hostname,
> filenames=[Ljava.lang.String;@14aa453, trfqn=hostname,
> filecache=org.griphyn.vdl.karajan.lib.cache.CacheMapAdapter at 16a4aef}
>
>
> Sites.xml:
>
>
>
> url="http://129.114.102.179:50001/wsrf/services/GenericPortal/core/WS/GPFactoryService"/>
> /work/01035/tg802895/swift-runs
>
>
>
> The run did not initialize the work directory.
>
> -Allan
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
From hategan at mcs.anl.gov Mon Mar 16 14:32:19 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Mon, 16 Mar 2009 14:32:19 -0500
Subject: [Swift-devel] Coaster run fails from communicado to ranger
In-Reply-To: <49BE98A8.5060109@mcs.anl.gov>
References: <49BE98A8.5060109@mcs.anl.gov>
Message-ID: <1237231939.10524.1.camel@localhost>
On Mon, 2009-03-16 at 13:21 -0500, Michael Wilde wrote:
> I'm trying a simple test script to ranger, from communicado, using
> latest svn rev (swift-r2692 cog-r2329).
>
> I get the error below (/bin/bash: line 39: eval: --: invalid option)
That's unfortunately because "/bin/bash -l -c 'which wget'" returns:
--------------------- Project balances for user tg455678
---------------------- | Name Avail SUs Expires | Name Avail SUs Expires
|
| TG-CCR080022N 7008 2009-05-31 | TG-DBS080004N 999508 2009-12-31 |
------------------------ Disk quotas for user tg455678 ----------
-------------- | Disk Usage (GB) Limit %Used File Usage Limit %Used |
| /share 3.8 6 64.13 3086 100000 3.09 | ------------------------
------------------------------------------------------- /usr/bin/wget
(with slight variations).
I guess another strategy is needed here.
>
> Sites file is:
>
>
>
> url="gatekeeper.ranger.tacc.teragrid.org" jobManager="gt2:gt2:SGE"/>
>
> /share/home/00306/tg455797/swiftwork
> 1
> 8
> 00:01:00
> TG-CCR080022N
> 16
>
>
>
> --
>
> Output is:
>
> Swift svn swift-r2692 cog-r2329
>
> RunID: 20090316-1220-kfipom0f
> Progress:
> Progress: Stage in:1
> Progress: Submitted:1
> Failed to transfer wrapper log from cat-20090316-1220-kfipom0f/info/r on
> ranger
> Progress: Failed:1
> Execution failed:
> Exception in cat:
> Arguments: [data.txt]
> Host: ranger
> Directory: cat-20090316-1220-kfipom0f/jobs/r/cat-rsrhc18j
> stderr.txt:
>
> stdout.txt:
>
> ----
>
> Caused by:
> Could not submit job
> Caused by:
> Could not start coaster service
> Caused by:
> Task ended before registration was received.
> STDOUT: /bin/bash: line 39: eval: --: invalid option
> eval: usage: eval [arg ...]
>
> STDERR: null
> Cleaning up...
> Done
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
From aespinosa at cs.uchicago.edu Mon Mar 16 14:44:03 2009
From: aespinosa at cs.uchicago.edu (Allan Espinosa)
Date: Mon, 16 Mar 2009 14:44:03 -0500
Subject: [Swift-devel] Coaster run fails from communicado to ranger
In-Reply-To: <1237231939.10524.1.camel@localhost>
References: <49BE98A8.5060109@mcs.anl.gov> <1237231939.10524.1.camel@localhost>
Message-ID: <50b07b4b0903161244p311666d2jb9764897ca92b187@mail.gmail.com>
What I did before was hardwire wget, md5sum and other binaries needed
for coasters because the environment does not like you doing a "bash
-l" . You get access to TTY errors.
-Allan
On Mon, Mar 16, 2009 at 2:32 PM, Mihael Hategan wrote:
> On Mon, 2009-03-16 at 13:21 -0500, Michael Wilde wrote:
>> I'm trying a simple test script to ranger, from communicado, using
>> latest svn rev (swift-r2692 cog-r2329).
>>
>> I get the error below (/bin/bash: line 39: eval: --: invalid option)
>
> That's unfortunately because "/bin/bash -l -c 'which wget'" returns:
> --------------------- Project balances for user tg455678
> ---------------------- | Name Avail SUs Expires | Name Avail SUs Expires
> |
> ?| TG-CCR080022N 7008 2009-05-31 | TG-DBS080004N 999508 2009-12-31 |
> ------------------------ Disk quotas for user tg455678 ----------
> -------------- | Disk Usage (GB) Limit %Used File Usage Limit %Used |
> | /share 3.8 6 64.13 3086 100000 3.09 | ------------------------
> ------------------------------------------------------- /usr/bin/wget
>
> (with slight variations).
>
> I guess another strategy is needed here.
>
>>
>> Sites file is:
>>
>>
>>
>> ? ?> url="gatekeeper.ranger.tacc.teragrid.org" jobManager="gt2:gt2:SGE"/>
>> ? ?
>> ? ?/share/home/00306/tg455797/swiftwork
>> ? ?1
>> ? ?8
>> ? ?00:01:00
>> ? ?TG-CCR080022N
>> ? ?16
>>
>>
>>
>> --
>>
>> Output is:
>>
>> Swift svn swift-r2692 cog-r2329
>>
>> RunID: 20090316-1220-kfipom0f
>> Progress:
>> Progress: ?Stage in:1
>> Progress: ?Submitted:1
>> Failed to transfer wrapper log from cat-20090316-1220-kfipom0f/info/r on
>> ranger
>> Progress: ?Failed:1
>> Execution failed:
>> ? ? ? ? ?Exception in cat:
>> Arguments: [data.txt]
>> Host: ranger
>> Directory: cat-20090316-1220-kfipom0f/jobs/r/cat-rsrhc18j
>> stderr.txt:
>>
>> stdout.txt:
>>
>> ----
>>
>> Caused by:
>> ? ? ? ? ?Could not submit job
>> Caused by:
>> ? ? ? ? ?Could not start coaster service
>> Caused by:
>> ? ? ? ? ?Task ended before registration was received.
>> STDOUT: /bin/bash: line 39: eval: --: invalid option
>> eval: usage: eval [arg ...]
>>
>> STDERR: null
>> Cleaning up...
>> ? Done
>> _______________________________________________
>> Swift-devel mailing list
>> Swift-devel at ci.uchicago.edu
>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>
>
--
Allan M. Espinosa
PhD student, Computer Science
University of Chicago
From hategan at mcs.anl.gov Mon Mar 16 15:08:56 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Mon, 16 Mar 2009 15:08:56 -0500
Subject: [Swift-devel] Coaster run fails from communicado to ranger
In-Reply-To: <50b07b4b0903161244p311666d2jb9764897ca92b187@mail.gmail.com>
References: <49BE98A8.5060109@mcs.anl.gov> <1237231939.10524.1.camel@localhost>
<50b07b4b0903161244p311666d2jb9764897ca92b187@mail.gmail.com>
Message-ID: <1237234136.10524.9.camel@localhost>
On Mon, 2009-03-16 at 14:44 -0500, Allan Espinosa wrote:
> What I did before was hardwire wget, md5sum and other binaries needed
> for coasters because the environment does not like you doing a "bash
> -l" . You get access to TTY errors.
That's stty. Something that is used to configure the terminal, and
doesn't work well in non-terminals. But by itself it is a benign issue.
>
> -Allan
>
> On Mon, Mar 16, 2009 at 2:32 PM, Mihael Hategan wrote:
> > On Mon, 2009-03-16 at 13:21 -0500, Michael Wilde wrote:
> >> I'm trying a simple test script to ranger, from communicado, using
> >> latest svn rev (swift-r2692 cog-r2329).
> >>
> >> I get the error below (/bin/bash: line 39: eval: --: invalid option)
> >
> > That's unfortunately because "/bin/bash -l -c 'which wget'" returns:
> > --------------------- Project balances for user tg455678
> > ---------------------- | Name Avail SUs Expires | Name Avail SUs Expires
> > |
> > | TG-CCR080022N 7008 2009-05-31 | TG-DBS080004N 999508 2009-12-31 |
> > ------------------------ Disk quotas for user tg455678 ----------
> > -------------- | Disk Usage (GB) Limit %Used File Usage Limit %Used |
> > | /share 3.8 6 64.13 3086 100000 3.09 | ------------------------
> > ------------------------------------------------------- /usr/bin/wget
> >
> > (with slight variations).
> >
> > I guess another strategy is needed here.
> >
> >>
> >> Sites file is:
> >>
> >>
> >>
> >> >> url="gatekeeper.ranger.tacc.teragrid.org" jobManager="gt2:gt2:SGE"/>
> >>
> >> /share/home/00306/tg455797/swiftwork
> >> 1
> >> 8
> >> 00:01:00
> >> TG-CCR080022N
> >> 16
> >>
> >>
> >>
> >> --
> >>
> >> Output is:
> >>
> >> Swift svn swift-r2692 cog-r2329
> >>
> >> RunID: 20090316-1220-kfipom0f
> >> Progress:
> >> Progress: Stage in:1
> >> Progress: Submitted:1
> >> Failed to transfer wrapper log from cat-20090316-1220-kfipom0f/info/r on
> >> ranger
> >> Progress: Failed:1
> >> Execution failed:
> >> Exception in cat:
> >> Arguments: [data.txt]
> >> Host: ranger
> >> Directory: cat-20090316-1220-kfipom0f/jobs/r/cat-rsrhc18j
> >> stderr.txt:
> >>
> >> stdout.txt:
> >>
> >> ----
> >>
> >> Caused by:
> >> Could not submit job
> >> Caused by:
> >> Could not start coaster service
> >> Caused by:
> >> Task ended before registration was received.
> >> STDOUT: /bin/bash: line 39: eval: --: invalid option
> >> eval: usage: eval [arg ...]
> >>
> >> STDERR: null
> >> Cleaning up...
> >> Done
> >> _______________________________________________
> >> Swift-devel mailing list
> >> Swift-devel at ci.uchicago.edu
> >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> >
> > _______________________________________________
> > Swift-devel mailing list
> > Swift-devel at ci.uchicago.edu
> > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> >
> >
>
>
>
From hategan at mcs.anl.gov Mon Mar 16 16:50:11 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Mon, 16 Mar 2009 16:50:11 -0500
Subject: [Swift-devel] Coaster run fails from communicado to ranger
In-Reply-To: <1237231939.10524.1.camel@localhost>
References: <49BE98A8.5060109@mcs.anl.gov> <1237231939.10524.1.camel@localhost>
Message-ID: <1237240211.14096.1.camel@localhost>
I committed a fix in cog r2330. It uses a temporary file for the output
of which from bash -l -c.
On Mon, 2009-03-16 at 14:32 -0500, Mihael Hategan wrote:
> On Mon, 2009-03-16 at 13:21 -0500, Michael Wilde wrote:
> > I'm trying a simple test script to ranger, from communicado, using
> > latest svn rev (swift-r2692 cog-r2329).
> >
> > I get the error below (/bin/bash: line 39: eval: --: invalid option)
>
> That's unfortunately because "/bin/bash -l -c 'which wget'" returns:
> --------------------- Project balances for user tg455678
> ---------------------- | Name Avail SUs Expires | Name Avail SUs Expires
> |
> | TG-CCR080022N 7008 2009-05-31 | TG-DBS080004N 999508 2009-12-31 |
> ------------------------ Disk quotas for user tg455678 ----------
> -------------- | Disk Usage (GB) Limit %Used File Usage Limit %Used |
> | /share 3.8 6 64.13 3086 100000 3.09 | ------------------------
> ------------------------------------------------------- /usr/bin/wget
>
> (with slight variations).
>
> I guess another strategy is needed here.
>
> >
> > Sites file is:
> >
> >
> >
> > > url="gatekeeper.ranger.tacc.teragrid.org" jobManager="gt2:gt2:SGE"/>
> >
> > /share/home/00306/tg455797/swiftwork
> > 1
> > 8
> > 00:01:00
> > TG-CCR080022N
> > 16
> >
> >
> >
> > --
> >
> > Output is:
> >
> > Swift svn swift-r2692 cog-r2329
> >
> > RunID: 20090316-1220-kfipom0f
> > Progress:
> > Progress: Stage in:1
> > Progress: Submitted:1
> > Failed to transfer wrapper log from cat-20090316-1220-kfipom0f/info/r on
> > ranger
> > Progress: Failed:1
> > Execution failed:
> > Exception in cat:
> > Arguments: [data.txt]
> > Host: ranger
> > Directory: cat-20090316-1220-kfipom0f/jobs/r/cat-rsrhc18j
> > stderr.txt:
> >
> > stdout.txt:
> >
> > ----
> >
> > Caused by:
> > Could not submit job
> > Caused by:
> > Could not start coaster service
> > Caused by:
> > Task ended before registration was received.
> > STDOUT: /bin/bash: line 39: eval: --: invalid option
> > eval: usage: eval [arg ...]
> >
> > STDERR: null
> > Cleaning up...
> > Done
> > _______________________________________________
> > Swift-devel mailing list
> > Swift-devel at ci.uchicago.edu
> > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
From wilde at mcs.anl.gov Mon Mar 16 21:51:01 2009
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Mon, 16 Mar 2009 21:51:01 -0500
Subject: [Swift-devel] Coaster run fails from communicado to ranger
In-Reply-To: <1237240211.14096.1.camel@localhost>
References: <49BE98A8.5060109@mcs.anl.gov>
<1237231939.10524.1.camel@localhost>
<1237240211.14096.1.camel@localhost>
Message-ID: <49BF1015.2010400@mcs.anl.gov>
With this rev I was able to submit a simple test job to ranger from a
swift script. The job never started, though, so I suspect I got some
queuing parameter wrong (or ranger is very congested), and I need to
debug further. But it certainly got past the problem that started this
thread. Thanks.
On 3/16/09 4:50 PM, Mihael Hategan wrote:
> I committed a fix in cog r2330. It uses a temporary file for the output
> of which from bash -l -c.
>
> On Mon, 2009-03-16 at 14:32 -0500, Mihael Hategan wrote:
>> On Mon, 2009-03-16 at 13:21 -0500, Michael Wilde wrote:
>>> I'm trying a simple test script to ranger, from communicado, using
>>> latest svn rev (swift-r2692 cog-r2329).
>>>
>>> I get the error below (/bin/bash: line 39: eval: --: invalid option)
>> That's unfortunately because "/bin/bash -l -c 'which wget'" returns:
>> --------------------- Project balances for user tg455678
>> ---------------------- | Name Avail SUs Expires | Name Avail SUs Expires
>> |
>> | TG-CCR080022N 7008 2009-05-31 | TG-DBS080004N 999508 2009-12-31 |
>> ------------------------ Disk quotas for user tg455678 ----------
>> -------------- | Disk Usage (GB) Limit %Used File Usage Limit %Used |
>> | /share 3.8 6 64.13 3086 100000 3.09 | ------------------------
>> ------------------------------------------------------- /usr/bin/wget
>>
>> (with slight variations).
>>
>> I guess another strategy is needed here.
>>
>>> Sites file is:
>>>
>>>
>>>
>>> >> url="gatekeeper.ranger.tacc.teragrid.org" jobManager="gt2:gt2:SGE"/>
>>>
>>> /share/home/00306/tg455797/swiftwork
>>> 1
>>> 8
>>> 00:01:00
>>> TG-CCR080022N
>>> 16
>>>
>>>
>>>
>>> --
>>>
>>> Output is:
>>>
>>> Swift svn swift-r2692 cog-r2329
>>>
>>> RunID: 20090316-1220-kfipom0f
>>> Progress:
>>> Progress: Stage in:1
>>> Progress: Submitted:1
>>> Failed to transfer wrapper log from cat-20090316-1220-kfipom0f/info/r on
>>> ranger
>>> Progress: Failed:1
>>> Execution failed:
>>> Exception in cat:
>>> Arguments: [data.txt]
>>> Host: ranger
>>> Directory: cat-20090316-1220-kfipom0f/jobs/r/cat-rsrhc18j
>>> stderr.txt:
>>>
>>> stdout.txt:
>>>
>>> ----
>>>
>>> Caused by:
>>> Could not submit job
>>> Caused by:
>>> Could not start coaster service
>>> Caused by:
>>> Task ended before registration was received.
>>> STDOUT: /bin/bash: line 39: eval: --: invalid option
>>> eval: usage: eval [arg ...]
>>>
>>> STDERR: null
>>> Cleaning up...
>>> Done
>>> _______________________________________________
>>> Swift-devel mailing list
>>> Swift-devel at ci.uchicago.edu
>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>> _______________________________________________
>> Swift-devel mailing list
>> Swift-devel at ci.uchicago.edu
>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>
From benc at hawaga.org.uk Tue Mar 17 05:29:57 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Tue, 17 Mar 2009 10:29:57 +0000 (GMT)
Subject: [Swift-devel] another attempt at a getting started with provenance
section (fwd)
Message-ID:
I've fiddled a lot with the provenance db code to make installation and
configuration easier; and the associated docbook page at
http://www.ci.uchicago.edu/~benc/provenance.html to have more of a focus
on running your own db (in either sqlite3 for a personal-sized db or in
postgres for a larger db).
Section 2 of the above web page gives notes on importing your own log
files into a database of your choosing, and section 3 gives some notes on
query commands that I implemented a while ago and fixed up yesterday.
I also added some more commentary in the SQL schema, prov-init.sql, in the
provenancedb checkout, to help with creating your own queries.
If you're going to play with this, I recommend starting with the sqlite3
mode - that provides a substantially easier to administer low-end database
compared to postgres.
I think this is basically the form I want the provenance db to look for
the next few months. I plan on adding more information (i.e. more tables
and more columns) and functionality, but largely in a backwards compatible
way.
--
From hategan at mcs.anl.gov Tue Mar 17 11:30:10 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Tue, 17 Mar 2009 11:30:10 -0500
Subject: [Swift-devel] testing
Message-ID: <1237307410.26969.8.camel@localhost>
So I think that, at this point, if we're serious about running things
production-style on teragrid or osg, an appropriate testing effort is
required.
In the past our hands were tied due to our inability to fix and deploy
gram issues on both those places. With the shift towards coasters and
local providers, we have, at least in theory, overcome the issue.
However, in order for it to also be in practice, we need to make sure
that things actually work, and that can only be done with testing or a
magic wand. I don't have the latter, so we'll have to do testing.
There are probably a few issues still left to address, one of which is
to make sure that coasters are an acceptable way of running things on
OSG. I suspect this would require some negotiation with the right people
from OSG, and I don't know who those people are.
From wilde at mcs.anl.gov Tue Mar 17 12:04:12 2009
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Tue, 17 Mar 2009 12:04:12 -0500
Subject: [Swift-devel] testing
In-Reply-To: <1237307410.26969.8.camel@localhost>
References: <1237307410.26969.8.camel@localhost>
Message-ID: <49BFD80C.7080903@mcs.anl.gov>
On 3/17/09 11:30 AM, Mihael Hategan wrote:
> So I think that, at this point, if we're serious about running things
> production-style on teragrid or osg, an appropriate testing effort is
> required.
I agree.
>
> In the past our hands were tied due to our inability to fix and deploy
> gram issues on both those places. With the shift towards coasters and
> local providers, we have, at least in theory, overcome the issue.
I think your new Condor provider provides the hopefully final missing piece.
> However, in order for it to also be in practice, we need to make sure
> that things actually work, and that can only be done with testing or a
> magic wand. I don't have the latter, so we'll have to do testing.
yes.
> There are probably a few issues still left to address, one of which is
> to make sure that coasters are an acceptable way of running things on
> OSG. I suspect this would require some negotiation with the right people
> from OSG, and I don't know who those people are.
I do: Its Ruth, and several people in various working groups whose names
I can gather and send out. I'll make the initial contacts.
What I need are test data from various scale runs that prove Swift is
fast, scalable, and "safe" (ie doesnt harm things).
This is all coming together well. For example, a user (Glen Hocky) was
able to run ZHangiong's "ADEM" installer to push OOPS to 5-8 OSG sites
and then run a swift workflow using them. A good test effort would allow
us to expand that to a great success story.
I'd like to see this effort build on the OSG site list scripts, and
equivalent REST-based scripts that are now available at info.teragrid.org.
I think you should make this your next focus, Mihael, and just get
started; then as you go we can gradually through discussion align this
into a testing effort that really opens up OSG and TG to Swift users.
That'll be a great step.
- Mike
_______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
From zhaozhang at uchicago.edu Tue Mar 17 12:14:29 2009
From: zhaozhang at uchicago.edu (Zhao Zhang)
Date: Tue, 17 Mar 2009 12:14:29 -0500
Subject: [Swift-devel] How does swift know if a task is successful
In-Reply-To: <1236982413.13026.1.camel@localhost>
References: <49BAD926.1030607@uchicago.edu>
<1236982413.13026.1.camel@localhost>
Message-ID: <49BFDA75.9070803@uchicago.edu>
Here comes another question, is there any place that I could set to
disable swift's waiting for data feature?
Or is there any way for me to cheat swift that the data is already
there? thanks.
zhao
Mihael Hategan wrote:
> On Fri, 2009-03-13 at 17:07 -0500, Zhao Zhang wrote:
>
>> Hi, All
>>
>> I have a question on how swift knows if a task is successful.
>> In my case, I am using a status notification instead of a status file.
>>
>> So my question is is this status notification the only thing swift is
>> waiting for, or is swift also waiting for the output data to appear to
>> say that a job is successful?
>>
>
> Once the job is done, swift will attempt to stage out all the files that
> it expects the job to have produced.
>
> Should one of those files not be there, there will be failures.
>
>
>
>
From hategan at mcs.anl.gov Tue Mar 17 12:18:55 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Tue, 17 Mar 2009 12:18:55 -0500
Subject: [Swift-devel] testing
In-Reply-To: <49BFD80C.7080903@mcs.anl.gov>
References: <1237307410.26969.8.camel@localhost> <49BFD80C.7080903@mcs.anl.gov>
Message-ID: <1237310335.29738.8.camel@localhost>
On Tue, 2009-03-17 at 12:04 -0500, Michael Wilde wrote:
> On 3/17/09 11:30 AM, Mihael Hategan wrote:
> > So I think that, at this point, if we're serious about running things
> > production-style on teragrid or osg, an appropriate testing effort is
> > required.
>
> I agree.
> >
> > In the past our hands were tied due to our inability to fix and deploy
> > gram issues on both those places. With the shift towards coasters and
> > local providers, we have, at least in theory, overcome the issue.
>
> I think your new Condor provider provides the hopefully final missing piece.
There are the LSF and SGE providers still to be done ;)
>
> > However, in order for it to also be in practice, we need to make sure
> > that things actually work, and that can only be done with testing or a
> > magic wand. I don't have the latter, so we'll have to do testing.
>
> yes.
>
> > There are probably a few issues still left to address, one of which is
> > to make sure that coasters are an acceptable way of running things on
> > OSG. I suspect this would require some negotiation with the right people
> > from OSG, and I don't know who those people are.
>
> I do: Its Ruth, and several people in various working groups whose names
> I can gather and send out. I'll make the initial contacts.
Ok.
>
> What I need are test data from various scale runs that prove Swift is
> fast, scalable, and "safe" (ie doesnt harm things).
This isn't in particular a swift issue, but a coaster issue. There is no
proof of safety, and swift being fast, scalable, and safe comes after
this testing, not before. But I would like to "negotiate" the ability
to:
- have one process on the head node, hopefully one that doesn't hog it.
- the ability to submit from the head node to the queuing system
directly (as if running qsub manually - something that isn't exactly
"the way" on OSG).
From hategan at mcs.anl.gov Tue Mar 17 12:20:55 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Tue, 17 Mar 2009 12:20:55 -0500
Subject: [Swift-devel] How does swift know if a task is successful
In-Reply-To: <49BFDA75.9070803@uchicago.edu>
References: <49BAD926.1030607@uchicago.edu>
<1236982413.13026.1.camel@localhost> <49BFDA75.9070803@uchicago.edu>
Message-ID: <1237310455.29738.11.camel@localhost>
On Tue, 2009-03-17 at 12:14 -0500, Zhao Zhang wrote:
> Here comes another question, is there any place that I could set to
> disable swift's waiting for data feature?
Do you mean disable the stage-outs?
> Or is there any way for me to cheat swift that the data is already
> there? thanks.
>
> zhao
>
> Mihael Hategan wrote:
> > On Fri, 2009-03-13 at 17:07 -0500, Zhao Zhang wrote:
> >
> >> Hi, All
> >>
> >> I have a question on how swift knows if a task is successful.
> >> In my case, I am using a status notification instead of a status file.
> >>
> >> So my question is is this status notification the only thing swift is
> >> waiting for, or is swift also waiting for the output data to appear to
> >> say that a job is successful?
> >>
> >
> > Once the job is done, swift will attempt to stage out all the files that
> > it expects the job to have produced.
> >
> > Should one of those files not be there, there will be failures.
> >
> >
> >
> >
From zhaozhang at uchicago.edu Tue Mar 17 12:23:04 2009
From: zhaozhang at uchicago.edu (Zhao Zhang)
Date: Tue, 17 Mar 2009 12:23:04 -0500
Subject: [Swift-devel] How does swift know if a task is successful
In-Reply-To: <1237310455.29738.11.camel@localhost>
References: <49BAD926.1030607@uchicago.edu>
<1236982413.13026.1.camel@localhost>
<49BFDA75.9070803@uchicago.edu>
<1237310455.29738.11.camel@localhost>
Message-ID: <49BFDC78.8040506@uchicago.edu>
Hi, Mihael
yes, can I do that?
zhao
Mihael Hategan wrote:
> On Tue, 2009-03-17 at 12:14 -0500, Zhao Zhang wrote:
>
>> Here comes another question, is there any place that I could set to
>> disable swift's waiting for data feature?
>>
>
> Do you mean disable the stage-outs?
>
>
>> Or is there any way for me to cheat swift that the data is already
>> there? thanks.
>>
>> zhao
>>
>> Mihael Hategan wrote:
>>
>>> On Fri, 2009-03-13 at 17:07 -0500, Zhao Zhang wrote:
>>>
>>>
>>>> Hi, All
>>>>
>>>> I have a question on how swift knows if a task is successful.
>>>> In my case, I am using a status notification instead of a status file.
>>>>
>>>> So my question is is this status notification the only thing swift is
>>>> waiting for, or is swift also waiting for the output data to appear to
>>>> say that a job is successful?
>>>>
>>>>
>>> Once the job is done, swift will attempt to stage out all the files that
>>> it expects the job to have produced.
>>>
>>> Should one of those files not be there, there will be failures.
>>>
>>>
>>>
>>>
>>>
>
>
>
From hategan at mcs.anl.gov Tue Mar 17 12:29:07 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Tue, 17 Mar 2009 12:29:07 -0500
Subject: [Swift-devel] How does swift know if a task is successful
In-Reply-To: <49BFDC78.8040506@uchicago.edu>
References: <49BAD926.1030607@uchicago.edu>
<1236982413.13026.1.camel@localhost> <49BFDA75.9070803@uchicago.edu>
<1237310455.29738.11.camel@localhost> <49BFDC78.8040506@uchicago.edu>
Message-ID: <1237310948.30064.2.camel@localhost>
On Tue, 2009-03-17 at 12:23 -0500, Zhao Zhang wrote:
> Hi, Mihael
>
> yes, can I do that?
You should know this by now:
in vdl-int.k, in doStageout, comment out the task:transfer invocation
(and dir:make).
>
> zhao
>
> Mihael Hategan wrote:
> > On Tue, 2009-03-17 at 12:14 -0500, Zhao Zhang wrote:
> >
> >> Here comes another question, is there any place that I could set to
> >> disable swift's waiting for data feature?
> >>
> >
> > Do you mean disable the stage-outs?
> >
> >
> >> Or is there any way for me to cheat swift that the data is already
> >> there? thanks.
> >>
> >> zhao
> >>
> >> Mihael Hategan wrote:
> >>
> >>> On Fri, 2009-03-13 at 17:07 -0500, Zhao Zhang wrote:
> >>>
> >>>
> >>>> Hi, All
> >>>>
> >>>> I have a question on how swift knows if a task is successful.
> >>>> In my case, I am using a status notification instead of a status file.
> >>>>
> >>>> So my question is is this status notification the only thing swift is
> >>>> waiting for, or is swift also waiting for the output data to appear to
> >>>> say that a job is successful?
> >>>>
> >>>>
> >>> Once the job is done, swift will attempt to stage out all the files that
> >>> it expects the job to have produced.
> >>>
> >>> Should one of those files not be there, there will be failures.
> >>>
> >>>
> >>>
> >>>
> >>>
> >
> >
> >
From zhaozhang at uchicago.edu Tue Mar 17 12:31:31 2009
From: zhaozhang at uchicago.edu (Zhao Zhang)
Date: Tue, 17 Mar 2009 12:31:31 -0500
Subject: [Swift-devel] How does swift know if a task is successful
In-Reply-To: <1237310948.30064.2.camel@localhost>
References: <49BAD926.1030607@uchicago.edu>
<1236982413.13026.1.camel@localhost>
<49BFDA75.9070803@uchicago.edu>
<1237310455.29738.11.camel@localhost>
<49BFDC78.8040506@uchicago.edu>
<1237310948.30064.2.camel@localhost>
Message-ID: <49BFDE73.2070600@uchicago.edu>
ok, thanks, I will try it out.
zhao
Mihael Hategan wrote:
> On Tue, 2009-03-17 at 12:23 -0500, Zhao Zhang wrote:
>
>> Hi, Mihael
>>
>> yes, can I do that?
>>
>
> You should know this by now:
> in vdl-int.k, in doStageout, comment out the task:transfer invocation
> (and dir:make).
>
>
>> zhao
>>
>> Mihael Hategan wrote:
>>
>>> On Tue, 2009-03-17 at 12:14 -0500, Zhao Zhang wrote:
>>>
>>>
>>>> Here comes another question, is there any place that I could set to
>>>> disable swift's waiting for data feature?
>>>>
>>>>
>>> Do you mean disable the stage-outs?
>>>
>>>
>>>
>>>> Or is there any way for me to cheat swift that the data is already
>>>> there? thanks.
>>>>
>>>> zhao
>>>>
>>>> Mihael Hategan wrote:
>>>>
>>>>
>>>>> On Fri, 2009-03-13 at 17:07 -0500, Zhao Zhang wrote:
>>>>>
>>>>>
>>>>>
>>>>>> Hi, All
>>>>>>
>>>>>> I have a question on how swift knows if a task is successful.
>>>>>> In my case, I am using a status notification instead of a status file.
>>>>>>
>>>>>> So my question is is this status notification the only thing swift is
>>>>>> waiting for, or is swift also waiting for the output data to appear to
>>>>>> say that a job is successful?
>>>>>>
>>>>>>
>>>>>>
>>>>> Once the job is done, swift will attempt to stage out all the files that
>>>>> it expects the job to have produced.
>>>>>
>>>>> Should one of those files not be there, there will be failures.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>
>>>
>
>
>
From zhaozhang at uchicago.edu Tue Mar 17 13:36:32 2009
From: zhaozhang at uchicago.edu (Zhao Zhang)
Date: Tue, 17 Mar 2009 13:36:32 -0500
Subject: [Swift-devel] How does swift know if a task is successful
In-Reply-To: <1237310948.30064.2.camel@localhost>
References: <49BAD926.1030607@uchicago.edu>
<1236982413.13026.1.camel@localhost>
<49BFDA75.9070803@uchicago.edu>
<1237310455.29738.11.camel@localhost>
<49BFDC78.8040506@uchicago.edu>
<1237310948.30064.2.camel@localhost>
Message-ID: <49BFEDB0.5070409@uchicago.edu>
Hi, Mihael
I commented the following lines
/*dir:make(ldir)
restartOnError(".*", 2
task:transfer(srchost=host, srcfile=bname,
srcdir=rdir, destdir=ldir, desthost=dhost, destprovider=provider)
)*/
Then I modified wrapper.sh to not to copy output file back, but I still
got an error.
The log file is at
http://www.ci.uchicago.edu/~zzhang/first-20090317-1327-oqgttus8.log
Thanks
zhao
zzhang at login6.surveyor:~/new_dock6> ./run_swift_ssh.sh 1117 64 first.swift
waiting for at least 64 nodes to register before submitting workload...
waiting to find at least 1 services in file
/home/falkon/users/zzhang/1117/config/Client-service-URIs.config...
all done, file has found at least 1 services
found at least 64 registered, submitting workload...
Swift svn swift-r2676 (swift modified locally) cog-r2305
RunID: 20090317-1327-oqgttus8
Progress:
Progress: Selecting site:1 Stage in:1
Progress: Submitting:1 Submitted:1
Progress: Submitted:1 Failed but can retry:1
Failed to transfer wrapper log from
first-20090317-1327-oqgttus8/info/b/n/bgp000
Progress: Submitted:1 Active:1
Failed to transfer wrapper log from
first-20090317-1327-oqgttus8/info/e/n/bgp000
Progress: Submitted:1 Active:1
Failed to transfer wrapper log from
first-20090317-1327-oqgttus8/info/g/n/bgp000
Execution failed:
Exception in echo:
Arguments: [Hello, world!]
Host: bgp000
Directory: first-20090317-1327-oqgttus8/jobs/g/n/echo-gnlq238j
stderr.txt:
stdout.txt:
----
Caused by:
Cannot transfer
"/tmp/first-20090317-1327-oqgttus8/shared/hello.txt" to
"/gpfs/home/zzhang/new_dock6/./hello.txt"
Caused by:
No such file
Mihael Hategan wrote:
> On Tue, 2009-03-17 at 12:23 -0500, Zhao Zhang wrote:
>
>> Hi, Mihael
>>
>> yes, can I do that?
>>
>
> You should know this by now:
> in vdl-int.k, in doStageout, comment out the task:transfer invocation
> (and dir:make).
>
>
>> zhao
>>
>> Mihael Hategan wrote:
>>
>>> On Tue, 2009-03-17 at 12:14 -0500, Zhao Zhang wrote:
>>>
>>>
>>>> Here comes another question, is there any place that I could set to
>>>> disable swift's waiting for data feature?
>>>>
>>>>
>>> Do you mean disable the stage-outs?
>>>
>>>
>>>
>>>> Or is there any way for me to cheat swift that the data is already
>>>> there? thanks.
>>>>
>>>> zhao
>>>>
>>>> Mihael Hategan wrote:
>>>>
>>>>
>>>>> On Fri, 2009-03-13 at 17:07 -0500, Zhao Zhang wrote:
>>>>>
>>>>>
>>>>>
>>>>>> Hi, All
>>>>>>
>>>>>> I have a question on how swift knows if a task is successful.
>>>>>> In my case, I am using a status notification instead of a status file.
>>>>>>
>>>>>> So my question is is this status notification the only thing swift is
>>>>>> waiting for, or is swift also waiting for the output data to appear to
>>>>>> say that a job is successful?
>>>>>>
>>>>>>
>>>>>>
>>>>> Once the job is done, swift will attempt to stage out all the files that
>>>>> it expects the job to have produced.
>>>>>
>>>>> Should one of those files not be there, there will be failures.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>
>>>
>
>
>
From hategan at mcs.anl.gov Tue Mar 17 13:40:30 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Tue, 17 Mar 2009 13:40:30 -0500
Subject: [Swift-devel] How does swift know if a task is successful
In-Reply-To: <49BFEDB0.5070409@uchicago.edu>
References: <49BAD926.1030607@uchicago.edu>
<1236982413.13026.1.camel@localhost> <49BFDA75.9070803@uchicago.edu>
<1237310455.29738.11.camel@localhost> <49BFDC78.8040506@uchicago.edu>
<1237310948.30064.2.camel@localhost> <49BFEDB0.5070409@uchicago.edu>
Message-ID: <1237315230.31264.1.camel@localhost>
On Tue, 2009-03-17 at 13:36 -0500, Zhao Zhang wrote:
> Hi, Mihael
>
> I commented the following lines
> /*dir:make(ldir)
> restartOnError(".*", 2
> task:transfer(srchost=host, srcfile=bname,
> srcdir=rdir, destdir=ldir, desthost=dhost, destprovider=provider)
> )*/
>
Did you modify this file in dist/?/libexec? If not, did you re-compile
swift after the modification?
Put an echo or a log message in place, to see if your change is picked
up by swift next time.
> Then I modified wrapper.sh to not to copy output file back, but I still
> got an error.
> The log file is at
> http://www.ci.uchicago.edu/~zzhang/first-20090317-1327-oqgttus8.log
> Thanks
>
> zhao
>
> zzhang at login6.surveyor:~/new_dock6> ./run_swift_ssh.sh 1117 64 first.swift
> waiting for at least 64 nodes to register before submitting workload...
> waiting to find at least 1 services in file
> /home/falkon/users/zzhang/1117/config/Client-service-URIs.config...
> all done, file has found at least 1 services
> found at least 64 registered, submitting workload...
> Swift svn swift-r2676 (swift modified locally) cog-r2305
>
> RunID: 20090317-1327-oqgttus8
> Progress:
> Progress: Selecting site:1 Stage in:1
> Progress: Submitting:1 Submitted:1
> Progress: Submitted:1 Failed but can retry:1
> Failed to transfer wrapper log from
> first-20090317-1327-oqgttus8/info/b/n/bgp000
> Progress: Submitted:1 Active:1
> Failed to transfer wrapper log from
> first-20090317-1327-oqgttus8/info/e/n/bgp000
> Progress: Submitted:1 Active:1
> Failed to transfer wrapper log from
> first-20090317-1327-oqgttus8/info/g/n/bgp000
> Execution failed:
> Exception in echo:
> Arguments: [Hello, world!]
> Host: bgp000
> Directory: first-20090317-1327-oqgttus8/jobs/g/n/echo-gnlq238j
> stderr.txt:
>
> stdout.txt:
>
> ----
>
> Caused by:
> Cannot transfer
> "/tmp/first-20090317-1327-oqgttus8/shared/hello.txt" to
> "/gpfs/home/zzhang/new_dock6/./hello.txt"
> Caused by:
> No such file
>
>
> Mihael Hategan wrote:
> > On Tue, 2009-03-17 at 12:23 -0500, Zhao Zhang wrote:
> >
> >> Hi, Mihael
> >>
> >> yes, can I do that?
> >>
> >
> > You should know this by now:
> > in vdl-int.k, in doStageout, comment out the task:transfer invocation
> > (and dir:make).
> >
> >
> >> zhao
> >>
> >> Mihael Hategan wrote:
> >>
> >>> On Tue, 2009-03-17 at 12:14 -0500, Zhao Zhang wrote:
> >>>
> >>>
> >>>> Here comes another question, is there any place that I could set to
> >>>> disable swift's waiting for data feature?
> >>>>
> >>>>
> >>> Do you mean disable the stage-outs?
> >>>
> >>>
> >>>
> >>>> Or is there any way for me to cheat swift that the data is already
> >>>> there? thanks.
> >>>>
> >>>> zhao
> >>>>
> >>>> Mihael Hategan wrote:
> >>>>
> >>>>
> >>>>> On Fri, 2009-03-13 at 17:07 -0500, Zhao Zhang wrote:
> >>>>>
> >>>>>
> >>>>>
> >>>>>> Hi, All
> >>>>>>
> >>>>>> I have a question on how swift knows if a task is successful.
> >>>>>> In my case, I am using a status notification instead of a status file.
> >>>>>>
> >>>>>> So my question is is this status notification the only thing swift is
> >>>>>> waiting for, or is swift also waiting for the output data to appear to
> >>>>>> say that a job is successful?
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>> Once the job is done, swift will attempt to stage out all the files that
> >>>>> it expects the job to have produced.
> >>>>>
> >>>>> Should one of those files not be there, there will be failures.
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>
> >>>
> >
> >
> >
From wilde at mcs.anl.gov Tue Mar 17 16:07:10 2009
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Tue, 17 Mar 2009 16:07:10 -0500
Subject: [Swift-devel] null pointer exception from nested loops
Message-ID: <49C010FE.4070503@mcs.anl.gov>
I just expanded my oops protein folding script to add another level of
parameter sweep. This script is getting pretty complex now (at least,
for a swift script).
I got the following npe on my first two tries. Im going to start
debugging, but any clues as to the cause would be helpful.
The outer loops are:
main()
{
string protein[] = readData(@arg("plist"));
string startTemp[] = ["10","20"];
string tempUpdate[] = ["1","2","3"];
foreach p in protein {
foreach st in startTemp {
foreach tu in tempUpdate {
doRoundSet(p,st,tu);
}
}
}
}
There are two levels of inner loops further down below doRoundSet().
The script, output, command line args and log are in:
http://ww.ci.uchicago.edu/~wilde/swift3.tar.gz
I suspect it will take a while to narrow the cause to a simpler test
case thats easy tp reproduce without a lot of setup.
I'll try on a vanilla swift on local execution; this is on bgp with Falkon.
Thanks.
--
...
Progress: uninitialized:1 Selecting site:2
SwiftScript trace: T1af7, Round, 0, Sim, 7
SwiftScript trace: T1af7, Round, 0, Sim, 2
SwiftScript trace: T1af7, Round, 0, Sim, 8
SwiftScript trace: T1af7, Round, 0, Sim, 0
SwiftScript trace: T1af7, Round, 0, Sim, 5
SwiftScript trace: T1af7, Round, 0, Sim, 9
SwiftScript trace: T1af7, Round, 0, Sim, 1
SwiftScript trace: T1af7, Round, 0, Sim, 6
SwiftScript trace: T1af7, Round, 0, Sim, 3
SwiftScript trace: T1af7, Round, 0, Sim, 4
Ex098
java.lang.NullPointerException
at
org.griphyn.vdl.karajan.lib.VDLFunction.leafFileName(VDLFunction.java:285)
at org.griphyn.vdl.karajan.lib.VDLFunction.filename(VDLFunction.java:201)
at org.griphyn.vdl.karajan.lib.VDLFunction.filename(VDLFunction.java:182)
at
org.griphyn.vdl.karajan.lib.swiftscript.FileName.function(FileName.java:19)
at org.griphyn.vdl.karajan.lib.VDLFunction.post(VDLFunction.java:67)
at
org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.childCompleted(AbstractSequentialWithArguments.java:192)
at
org.globus.cog.karajan.workflow.nodes.Sequential.notificationEvent(Sequential.java:33)
at org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:332)
at org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125)
at
org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99)
at
org.globus.cog.karajan.workflow.nodes.FlowNode.fireNotificationEvent(FlowNode.java:176)
at
org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:296)
at
org.globus.cog.karajan.workflow.nodes.FlowContainer.post(FlowContainer.java:58)
at
org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:46)
at
org.globus.cog.karajan.workflow.nodes.Sequential.startNext(Sequential.java:51)
at
org.globus.cog.karajan.workflow.nodes.Sequential.executeChildren(Sequential.java:27)
at
org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.executeChildren(AbstractFunction.java:40)
at
org.globus.cog.karajan.workflow.nodes.FlowContainer.execute(FlowContainer.java:63)
at
org.globus.cog.karajan.workflow.nodes.FlowNode.restart(FlowNode.java:233)
at org.globus.cog.karajan.workflow.nodes.FlowNode.start(FlowNode.java:278)
at
org.globus.cog.karajan.workflow.nodes.FlowNode.controlEvent(FlowNode.java:391)
at org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:329)
at
org.globus.cog.karajan.workflow.FlowElementWrapper.event(FlowElementWrapper.java:227)
at org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125)
at
org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99)
at
org.globus.cog.karajan.workflow.events.EventWorker.run(EventWorker.java:69)
Execution failed:
java.lang.NullPointerException
at
org.griphyn.vdl.karajan.lib.VDLFunction.leafFileName(VDLFunction.java:285)
at org.griphyn.vdl.karajan.lib.VDLFunction.filename(VDLFunction.java:201)
at org.griphyn.vdl.karajan.lib.VDLFunction.filename(VDLFunction.java:182)
at
org.griphyn.vdl.karajan.lib.swiftscript.FileName.function(FileName.java:19)
at org.griphyn.vdl.karajan.lib.VDLFunction.post(VDLFunction.java:67)
at
org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.childCompleted(AbstractSequentialWithArguments.java:192)
at
org.globus.cog.karajan.workflow.nodes.Sequential.notificationEvent(Sequential.java:33)
at org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:332)
at org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125)
at
org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99)
at
org.globus.cog.karajan.workflow.nodes.FlowNode.fireNotificationEvent(FlowNode.java:176)
at
org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:296)
at
org.globus.cog.karajan.workflow.nodes.FlowContainer.post(FlowContainer.java:58)
at
org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:46)
at
org.globus.cog.karajan.workflow.nodes.Sequential.startNext(Sequential.java:51)
at
org.globus.cog.karajan.workflow.nodes.Sequential.executeChildren(Sequential.java:27)
at
org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.executeChildren(AbstractFunction.java:40)
at
org.globus.cog.karajan.workflow.nodes.FlowContainer.execute(FlowContainer.java:63)
at
org.globus.cog.karajan.workflow.nodes.FlowNode.restart(FlowNode.java:233)
at org.globus.cog.karajan.workflow.nodes.FlowNode.start(FlowNode.java:278)
at
org.globus.cog.karajan.workflow.nodes.FlowNode.controlEvent(FlowNode.java:391)
at org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:329)
at
org.globus.cog.karajan.workflow.FlowElementWrapper.event(FlowElementWrapper.java:227)
at org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125)
at
org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99)
at
org.globus.cog.karajan.workflow.events.EventWorker.run(EventWorker.java:69)
Ex098
java.lang.NullPointerException
at
org.griphyn.vdl.karajan.lib.VDLFunction.leafFileName(VDLFunction.java:285)
at org.griphyn.vdl.karajan.lib.VDLFunction.filename(VDLFunction.java:201)
at org.griphyn.vdl.karajan.lib.VDLFunction.filename(VDLFunction.java:182)
at
org.griphyn.vdl.karajan.lib.swiftscript.FileName.function(FileName.java:19)
at org.griphyn.vdl.karajan.lib.VDLFunction.post(VDLFunction.java:67)
at
org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.childCompleted(AbstractSequentialWithArguments.java:192)
at
org.globus.cog.karajan.workflow.nodes.Sequential.notificationEvent(Sequential.java:33)
at org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:332)
at org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125)
at
org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99)
at
org.globus.cog.karajan.workflow.nodes.FlowNode.fireNotificationEvent(FlowNode.java:176)
at
org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:296)
at
org.globus.cog.karajan.workflow.nodes.FlowContainer.post(FlowContainer.java:58)
at
org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:46)
at
org.globus.cog.karajan.workflow.nodes.Sequential.startNext(Sequential.java:51)
at
org.globus.cog.karajan.workflow.nodes.Sequential.executeChildren(Sequential.java:27)
at
org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.executeChildren(AbstractFunction.java:40)
at
org.globus.cog.karajan.workflow.nodes.FlowContainer.execute(FlowContainer.java:63)
at
org.globus.cog.karajan.workflow.nodes.FlowNode.restart(FlowNode.java:233)
at org.globus.cog.karajan.workflow.nodes.FlowNode.start(FlowNode.java:278)
at
org.globus.cog.karajan.workflow.nodes.FlowNode.controlEvent(FlowNode.java:391)
at org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:329)
at
org.globus.cog.karajan.workflow.FlowElementWrapper.event(FlowElementWrapper.java:227)
at org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125)
at
org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99)
at
org.globus.cog.karajan.workflow.events.EventWorker.run(EventWorker.java:69)
SwiftScript trace: T1af7, Round, 0, Sim, 7
From wilde at mcs.anl.gov Tue Mar 17 16:17:05 2009
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Tue, 17 Mar 2009 16:17:05 -0500
Subject: [Swift-devel] null pointer exception from nested loops
In-Reply-To: <49C010FE.4070503@mcs.anl.gov>
References: <49C010FE.4070503@mcs.anl.gov>
Message-ID: <49C01351.1050807@mcs.anl.gov>
It seems not related to scale or Falkon.
It occurs when running on localhost (but on bgp) and when I cut all the
loops down to a single iteration.
I'm still debugging.
On 3/17/09 4:07 PM, Michael Wilde wrote:
> I just expanded my oops protein folding script to add another level of
> parameter sweep. This script is getting pretty complex now (at least,
> for a swift script).
>
> I got the following npe on my first two tries. Im going to start
> debugging, but any clues as to the cause would be helpful.
>
> The outer loops are:
>
> main()
> {
> string protein[] = readData(@arg("plist"));
> string startTemp[] = ["10","20"];
> string tempUpdate[] = ["1","2","3"];
>
> foreach p in protein {
> foreach st in startTemp {
> foreach tu in tempUpdate {
> doRoundSet(p,st,tu);
> }
> }
> }
> }
>
> There are two levels of inner loops further down below doRoundSet().
>
> The script, output, command line args and log are in:
> http://ww.ci.uchicago.edu/~wilde/swift3.tar.gz
>
> I suspect it will take a while to narrow the cause to a simpler test
> case thats easy tp reproduce without a lot of setup.
>
> I'll try on a vanilla swift on local execution; this is on bgp with Falkon.
>
> Thanks.
>
> --
>
> ...
> Progress: uninitialized:1 Selecting site:2
> SwiftScript trace: T1af7, Round, 0, Sim, 7
> SwiftScript trace: T1af7, Round, 0, Sim, 2
> SwiftScript trace: T1af7, Round, 0, Sim, 8
> SwiftScript trace: T1af7, Round, 0, Sim, 0
> SwiftScript trace: T1af7, Round, 0, Sim, 5
> SwiftScript trace: T1af7, Round, 0, Sim, 9
> SwiftScript trace: T1af7, Round, 0, Sim, 1
> SwiftScript trace: T1af7, Round, 0, Sim, 6
> SwiftScript trace: T1af7, Round, 0, Sim, 3
> SwiftScript trace: T1af7, Round, 0, Sim, 4
> Ex098
> java.lang.NullPointerException
> at
> org.griphyn.vdl.karajan.lib.VDLFunction.leafFileName(VDLFunction.java:285)
> at
> org.griphyn.vdl.karajan.lib.VDLFunction.filename(VDLFunction.java:201)
> at
> org.griphyn.vdl.karajan.lib.VDLFunction.filename(VDLFunction.java:182)
> at
> org.griphyn.vdl.karajan.lib.swiftscript.FileName.function(FileName.java:19)
> at org.griphyn.vdl.karajan.lib.VDLFunction.post(VDLFunction.java:67)
> at
> org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.childCompleted(AbstractSequentialWithArguments.java:192)
>
> at
> org.globus.cog.karajan.workflow.nodes.Sequential.notificationEvent(Sequential.java:33)
>
> at
> org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:332)
> at
> org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125)
> at
> org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99)
>
> at
> org.globus.cog.karajan.workflow.nodes.FlowNode.fireNotificationEvent(FlowNode.java:176)
>
> at
> org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:296)
> at
> org.globus.cog.karajan.workflow.nodes.FlowContainer.post(FlowContainer.java:58)
>
> at
> org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:46)
>
> at
> org.globus.cog.karajan.workflow.nodes.Sequential.startNext(Sequential.java:51)
>
> at
> org.globus.cog.karajan.workflow.nodes.Sequential.executeChildren(Sequential.java:27)
>
> at
> org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.executeChildren(AbstractFunction.java:40)
>
> at
> org.globus.cog.karajan.workflow.nodes.FlowContainer.execute(FlowContainer.java:63)
>
> at
> org.globus.cog.karajan.workflow.nodes.FlowNode.restart(FlowNode.java:233)
> at
> org.globus.cog.karajan.workflow.nodes.FlowNode.start(FlowNode.java:278)
> at
> org.globus.cog.karajan.workflow.nodes.FlowNode.controlEvent(FlowNode.java:391)
>
> at
> org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:329)
> at
> org.globus.cog.karajan.workflow.FlowElementWrapper.event(FlowElementWrapper.java:227)
>
> at
> org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125)
> at
> org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99)
>
> at
> org.globus.cog.karajan.workflow.events.EventWorker.run(EventWorker.java:69)
> Execution failed:
> java.lang.NullPointerException
> at
> org.griphyn.vdl.karajan.lib.VDLFunction.leafFileName(VDLFunction.java:285)
> at
> org.griphyn.vdl.karajan.lib.VDLFunction.filename(VDLFunction.java:201)
> at
> org.griphyn.vdl.karajan.lib.VDLFunction.filename(VDLFunction.java:182)
> at
> org.griphyn.vdl.karajan.lib.swiftscript.FileName.function(FileName.java:19)
> at org.griphyn.vdl.karajan.lib.VDLFunction.post(VDLFunction.java:67)
> at
> org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.childCompleted(AbstractSequentialWithArguments.java:192)
>
> at
> org.globus.cog.karajan.workflow.nodes.Sequential.notificationEvent(Sequential.java:33)
>
> at
> org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:332)
> at
> org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125)
> at
> org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99)
>
> at
> org.globus.cog.karajan.workflow.nodes.FlowNode.fireNotificationEvent(FlowNode.java:176)
>
> at
> org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:296)
> at
> org.globus.cog.karajan.workflow.nodes.FlowContainer.post(FlowContainer.java:58)
>
> at
> org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:46)
>
> at
> org.globus.cog.karajan.workflow.nodes.Sequential.startNext(Sequential.java:51)
>
> at
> org.globus.cog.karajan.workflow.nodes.Sequential.executeChildren(Sequential.java:27)
>
> at
> org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.executeChildren(AbstractFunction.java:40)
>
> at
> org.globus.cog.karajan.workflow.nodes.FlowContainer.execute(FlowContainer.java:63)
>
> at
> org.globus.cog.karajan.workflow.nodes.FlowNode.restart(FlowNode.java:233)
> at
> org.globus.cog.karajan.workflow.nodes.FlowNode.start(FlowNode.java:278)
> at
> org.globus.cog.karajan.workflow.nodes.FlowNode.controlEvent(FlowNode.java:391)
>
> at
> org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:329)
> at
> org.globus.cog.karajan.workflow.FlowElementWrapper.event(FlowElementWrapper.java:227)
>
> at
> org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125)
> at
> org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99)
>
> at
> org.globus.cog.karajan.workflow.events.EventWorker.run(EventWorker.java:69)
>
> Ex098
> java.lang.NullPointerException
> at
> org.griphyn.vdl.karajan.lib.VDLFunction.leafFileName(VDLFunction.java:285)
> at
> org.griphyn.vdl.karajan.lib.VDLFunction.filename(VDLFunction.java:201)
> at
> org.griphyn.vdl.karajan.lib.VDLFunction.filename(VDLFunction.java:182)
> at
> org.griphyn.vdl.karajan.lib.swiftscript.FileName.function(FileName.java:19)
> at org.griphyn.vdl.karajan.lib.VDLFunction.post(VDLFunction.java:67)
> at
> org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.childCompleted(AbstractSequentialWithArguments.java:192)
>
> at
> org.globus.cog.karajan.workflow.nodes.Sequential.notificationEvent(Sequential.java:33)
>
> at
> org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:332)
> at
> org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125)
> at
> org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99)
>
> at
> org.globus.cog.karajan.workflow.nodes.FlowNode.fireNotificationEvent(FlowNode.java:176)
>
> at
> org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:296)
> at
> org.globus.cog.karajan.workflow.nodes.FlowContainer.post(FlowContainer.java:58)
>
> at
> org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:46)
>
> at
> org.globus.cog.karajan.workflow.nodes.Sequential.startNext(Sequential.java:51)
>
> at
> org.globus.cog.karajan.workflow.nodes.Sequential.executeChildren(Sequential.java:27)
>
> at
> org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.executeChildren(AbstractFunction.java:40)
>
> at
> org.globus.cog.karajan.workflow.nodes.FlowContainer.execute(FlowContainer.java:63)
>
> at
> org.globus.cog.karajan.workflow.nodes.FlowNode.restart(FlowNode.java:233)
> at
> org.globus.cog.karajan.workflow.nodes.FlowNode.start(FlowNode.java:278)
> at
> org.globus.cog.karajan.workflow.nodes.FlowNode.controlEvent(FlowNode.java:391)
>
> at
> org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:329)
> at
> org.globus.cog.karajan.workflow.FlowElementWrapper.event(FlowElementWrapper.java:227)
>
> at
> org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125)
> at
> org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99)
>
> at
> org.globus.cog.karajan.workflow.events.EventWorker.run(EventWorker.java:69)
> SwiftScript trace: T1af7, Round, 0, Sim, 7
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
From wilde at mcs.anl.gov Tue Mar 17 16:25:32 2009
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Tue, 17 Mar 2009 16:25:32 -0500
Subject: [Swift-devel] null pointer exception from nested loops
In-Reply-To: <49C01351.1050807@mcs.anl.gov>
References: <49C010FE.4070503@mcs.anl.gov> <49C01351.1050807@mcs.anl.gov>
Message-ID: <49C0154C.7000608@mcs.anl.gov>
The log contains this just before the NPE, including the suspicious
message: WARN FlowNode Ex098:
Thats giving me a clue as to the offending statements.
---
2009-03-17 16:20:34,723-0500 INFO AbstractDataNode Adding handle
listener "F/org.griphyn.vdl.mapping.DataNode identifier
tag:benc at ci.uchicago.edu,2008:swift:dataset:20090\
317-1620-e1n1bz3g:720000000071 type SecSeq with no value at
dataset=secseq path=[0] (not closed)" to
"org.griphyn.vdl.mapping.DataNode identifier tag:benc at ci.uchicago.edu,\
2008:swift:dataset:20090317-1620-e1n1bz3g:720000000071 type SecSeq with
no value at dataset=secseq path=[0] (not closed)"
2009-03-17 16:20:34,724-0500 INFO AbstractDataNode closed
org.griphyn.vdl.mapping.RootDataNode identifier
tag:benc at ci.uchicago.edu,2008:swift:dataset:20090317-1620-e1n1bz\
3g:720000000092 type string value=s/@DIT@/10/ dataset=unnamed
SwiftScript value (closed)
2009-03-17 16:20:34,724-0500 INFO AbstractDataNode ROOTPATH
dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090317-1620-e1n1bz3g:720000000092
path=$
2009-03-17 16:20:34,724-0500 INFO AbstractDataNode VALUE
dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090317-1620-e1n1bz3g:720000000092
VALUE=s/@DIT@/10/
2009-03-17 16:20:34,725-0500 INFO AbstractDataNode closed
org.griphyn.vdl.mapping.RootDataNode identifier
tag:benc at ci.uchicago.edu,2008:swift:dataset:20090317-1620-e1n1bz\
3g:720000000093 type string value=s/@TUI@/1/ dataset=unnamed SwiftScript
value (closed)
2009-03-17 16:20:34,725-0500 INFO AbstractDataNode ROOTPATH
dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090317-1620-e1n1bz3g:720000000093
path=$
2009-03-17 16:20:34,725-0500 INFO AbstractDataNode VALUE
dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090317-1620-e1n1bz3g:720000000093
VALUE=s/@TUI@/1/
2009-03-17 16:20:34,726-0500 INFO AbstractDataNode closed
org.griphyn.vdl.mapping.RootDataNode identifier
tag:benc at ci.uchicago.edu,2008:swift:dataset:20090317-1620-e1n1bz\
3g:720000000094 type string value=params.tloop dataset=unnamed
SwiftScript value (closed)
2009-03-17 16:20:34,726-0500 INFO AbstractDataNode ROOTPATH
dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090317-1620-e1n1bz3g:720000000094
path=$
2009-03-17 16:20:34,726-0500 INFO AbstractDataNode VALUE
dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090317-1620-e1n1bz3g:720000000094
VALUE=params.tloop
2009-03-17 16:20:34,727-0500 WARN FlowNode Ex098
java.lang.NullPointerException
at
org.griphyn.vdl.karajan.lib.VDLFunction.leafFileName(VDLFunction.java:285)
at
org.griphyn.vdl.karajan.lib.VDLFunction.filename(VDLFunction.java:201)
at
org.griphyn.vdl.karajan.lib.VDLFunction.filename(VDLFunction.java:182)
at
org.griphyn.vdl.karajan.lib.swiftscript.FileName.function(FileName.java:19)
at
org.griphyn.vdl.karajan.lib.VDLFunction.post(VDLFunction.java:67)
On 3/17/09 4:17 PM, Michael Wilde wrote:
> It seems not related to scale or Falkon.
>
> It occurs when running on localhost (but on bgp) and when I cut all the
> loops down to a single iteration.
>
> I'm still debugging.
>
> On 3/17/09 4:07 PM, Michael Wilde wrote:
>> I just expanded my oops protein folding script to add another level of
>> parameter sweep. This script is getting pretty complex now (at least,
>> for a swift script).
>>
>> I got the following npe on my first two tries. Im going to start
>> debugging, but any clues as to the cause would be helpful.
>>
>> The outer loops are:
>>
>> main()
>> {
>> string protein[] = readData(@arg("plist"));
>> string startTemp[] = ["10","20"];
>> string tempUpdate[] = ["1","2","3"];
>>
>> foreach p in protein {
>> foreach st in startTemp {
>> foreach tu in tempUpdate {
>> doRoundSet(p,st,tu);
>> }
>> }
>> }
>> }
>>
>> There are two levels of inner loops further down below doRoundSet().
>>
>> The script, output, command line args and log are in:
>> http://ww.ci.uchicago.edu/~wilde/swift3.tar.gz
>>
>> I suspect it will take a while to narrow the cause to a simpler test
>> case thats easy tp reproduce without a lot of setup.
>>
>> I'll try on a vanilla swift on local execution; this is on bgp with
>> Falkon.
>>
>> Thanks.
>>
>> --
>>
>> ...
>> Progress: uninitialized:1 Selecting site:2
>> SwiftScript trace: T1af7, Round, 0, Sim, 7
>> SwiftScript trace: T1af7, Round, 0, Sim, 2
>> SwiftScript trace: T1af7, Round, 0, Sim, 8
>> SwiftScript trace: T1af7, Round, 0, Sim, 0
>> SwiftScript trace: T1af7, Round, 0, Sim, 5
>> SwiftScript trace: T1af7, Round, 0, Sim, 9
>> SwiftScript trace: T1af7, Round, 0, Sim, 1
>> SwiftScript trace: T1af7, Round, 0, Sim, 6
>> SwiftScript trace: T1af7, Round, 0, Sim, 3
>> SwiftScript trace: T1af7, Round, 0, Sim, 4
>> Ex098
>> java.lang.NullPointerException
>> at
>> org.griphyn.vdl.karajan.lib.VDLFunction.leafFileName(VDLFunction.java:285)
>>
>> at
>> org.griphyn.vdl.karajan.lib.VDLFunction.filename(VDLFunction.java:201)
>> at
>> org.griphyn.vdl.karajan.lib.VDLFunction.filename(VDLFunction.java:182)
>> at
>> org.griphyn.vdl.karajan.lib.swiftscript.FileName.function(FileName.java:19)
>>
>> at org.griphyn.vdl.karajan.lib.VDLFunction.post(VDLFunction.java:67)
>> at
>> org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.childCompleted(AbstractSequentialWithArguments.java:192)
>>
>> at
>> org.globus.cog.karajan.workflow.nodes.Sequential.notificationEvent(Sequential.java:33)
>>
>> at
>> org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:332)
>> at
>> org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125)
>> at
>> org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99)
>>
>> at
>> org.globus.cog.karajan.workflow.nodes.FlowNode.fireNotificationEvent(FlowNode.java:176)
>>
>> at
>> org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:296)
>>
>> at
>> org.globus.cog.karajan.workflow.nodes.FlowContainer.post(FlowContainer.java:58)
>>
>> at
>> org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:46)
>>
>> at
>> org.globus.cog.karajan.workflow.nodes.Sequential.startNext(Sequential.java:51)
>>
>> at
>> org.globus.cog.karajan.workflow.nodes.Sequential.executeChildren(Sequential.java:27)
>>
>> at
>> org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.executeChildren(AbstractFunction.java:40)
>>
>> at
>> org.globus.cog.karajan.workflow.nodes.FlowContainer.execute(FlowContainer.java:63)
>>
>> at
>> org.globus.cog.karajan.workflow.nodes.FlowNode.restart(FlowNode.java:233)
>> at
>> org.globus.cog.karajan.workflow.nodes.FlowNode.start(FlowNode.java:278)
>> at
>> org.globus.cog.karajan.workflow.nodes.FlowNode.controlEvent(FlowNode.java:391)
>>
>> at
>> org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:329)
>> at
>> org.globus.cog.karajan.workflow.FlowElementWrapper.event(FlowElementWrapper.java:227)
>>
>> at
>> org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125)
>> at
>> org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99)
>>
>> at
>> org.globus.cog.karajan.workflow.events.EventWorker.run(EventWorker.java:69)
>>
>> Execution failed:
>> java.lang.NullPointerException
>> at
>> org.griphyn.vdl.karajan.lib.VDLFunction.leafFileName(VDLFunction.java:285)
>>
>> at
>> org.griphyn.vdl.karajan.lib.VDLFunction.filename(VDLFunction.java:201)
>> at
>> org.griphyn.vdl.karajan.lib.VDLFunction.filename(VDLFunction.java:182)
>> at
>> org.griphyn.vdl.karajan.lib.swiftscript.FileName.function(FileName.java:19)
>>
>> at org.griphyn.vdl.karajan.lib.VDLFunction.post(VDLFunction.java:67)
>> at
>> org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.childCompleted(AbstractSequentialWithArguments.java:192)
>>
>> at
>> org.globus.cog.karajan.workflow.nodes.Sequential.notificationEvent(Sequential.java:33)
>>
>> at
>> org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:332)
>> at
>> org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125)
>> at
>> org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99)
>>
>> at
>> org.globus.cog.karajan.workflow.nodes.FlowNode.fireNotificationEvent(FlowNode.java:176)
>>
>> at
>> org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:296)
>>
>> at
>> org.globus.cog.karajan.workflow.nodes.FlowContainer.post(FlowContainer.java:58)
>>
>> at
>> org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:46)
>>
>> at
>> org.globus.cog.karajan.workflow.nodes.Sequential.startNext(Sequential.java:51)
>>
>> at
>> org.globus.cog.karajan.workflow.nodes.Sequential.executeChildren(Sequential.java:27)
>>
>> at
>> org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.executeChildren(AbstractFunction.java:40)
>>
>> at
>> org.globus.cog.karajan.workflow.nodes.FlowContainer.execute(FlowContainer.java:63)
>>
>> at
>> org.globus.cog.karajan.workflow.nodes.FlowNode.restart(FlowNode.java:233)
>> at
>> org.globus.cog.karajan.workflow.nodes.FlowNode.start(FlowNode.java:278)
>> at
>> org.globus.cog.karajan.workflow.nodes.FlowNode.controlEvent(FlowNode.java:391)
>>
>> at
>> org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:329)
>> at
>> org.globus.cog.karajan.workflow.FlowElementWrapper.event(FlowElementWrapper.java:227)
>>
>> at
>> org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125)
>> at
>> org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99)
>>
>> at
>> org.globus.cog.karajan.workflow.events.EventWorker.run(EventWorker.java:69)
>>
>>
>> Ex098
>> java.lang.NullPointerException
>> at
>> org.griphyn.vdl.karajan.lib.VDLFunction.leafFileName(VDLFunction.java:285)
>>
>> at
>> org.griphyn.vdl.karajan.lib.VDLFunction.filename(VDLFunction.java:201)
>> at
>> org.griphyn.vdl.karajan.lib.VDLFunction.filename(VDLFunction.java:182)
>> at
>> org.griphyn.vdl.karajan.lib.swiftscript.FileName.function(FileName.java:19)
>>
>> at org.griphyn.vdl.karajan.lib.VDLFunction.post(VDLFunction.java:67)
>> at
>> org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.childCompleted(AbstractSequentialWithArguments.java:192)
>>
>> at
>> org.globus.cog.karajan.workflow.nodes.Sequential.notificationEvent(Sequential.java:33)
>>
>> at
>> org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:332)
>> at
>> org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125)
>> at
>> org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99)
>>
>> at
>> org.globus.cog.karajan.workflow.nodes.FlowNode.fireNotificationEvent(FlowNode.java:176)
>>
>> at
>> org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:296)
>>
>> at
>> org.globus.cog.karajan.workflow.nodes.FlowContainer.post(FlowContainer.java:58)
>>
>> at
>> org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:46)
>>
>> at
>> org.globus.cog.karajan.workflow.nodes.Sequential.startNext(Sequential.java:51)
>>
>> at
>> org.globus.cog.karajan.workflow.nodes.Sequential.executeChildren(Sequential.java:27)
>>
>> at
>> org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.executeChildren(AbstractFunction.java:40)
>>
>> at
>> org.globus.cog.karajan.workflow.nodes.FlowContainer.execute(FlowContainer.java:63)
>>
>> at
>> org.globus.cog.karajan.workflow.nodes.FlowNode.restart(FlowNode.java:233)
>> at
>> org.globus.cog.karajan.workflow.nodes.FlowNode.start(FlowNode.java:278)
>> at
>> org.globus.cog.karajan.workflow.nodes.FlowNode.controlEvent(FlowNode.java:391)
>>
>> at
>> org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:329)
>> at
>> org.globus.cog.karajan.workflow.FlowElementWrapper.event(FlowElementWrapper.java:227)
>>
>> at
>> org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125)
>> at
>> org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99)
>>
>> at
>> org.globus.cog.karajan.workflow.events.EventWorker.run(EventWorker.java:69)
>>
>> SwiftScript trace: T1af7, Round, 0, Sim, 7
>> _______________________________________________
>> Swift-devel mailing list
>> Swift-devel at ci.uchicago.edu
>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
From foster at anl.gov Tue Mar 17 16:26:25 2009
From: foster at anl.gov (Ian Foster)
Date: Tue, 17 Mar 2009 16:26:25 -0500
Subject: [Swift-devel] null pointer exception from nested loops
In-Reply-To: <49C0154C.7000608@mcs.anl.gov>
References: <49C010FE.4070503@mcs.anl.gov> <49C01351.1050807@mcs.anl.gov>
<49C0154C.7000608@mcs.anl.gov>
Message-ID: <17AE2924-B021-46EC-9649-C80B1EBFFECD@anl.gov>
Just curious, is the whole thing working with just Falkon?
On Mar 17, 2009, at 4:25 PM, Michael Wilde wrote:
> The log contains this just before the NPE, including the suspicious
> message: WARN FlowNode Ex098:
>
> Thats giving me a clue as to the offending statements.
>
> ---
>
> 2009-03-17 16:20:34,723-0500 INFO AbstractDataNode Adding handle
> listener "F/org.griphyn.vdl.mapping.DataNode identifier tag:benc at ci.uchicago.edu
> ,2008:swift:dataset:20090\
> 317-1620-e1n1bz3g:720000000071 type SecSeq with no value at
> dataset=secseq path=[0] (not closed)" to
> "org.griphyn.vdl.mapping.DataNode identifier
> tag:benc at ci.uchicago.edu,\
> 2008:swift:dataset:20090317-1620-e1n1bz3g:720000000071 type SecSeq
> with no value at dataset=secseq path=[0] (not closed)"
> 2009-03-17 16:20:34,724-0500 INFO AbstractDataNode closed
> org.griphyn.vdl.mapping.RootDataNode identifier tag:benc at ci.uchicago.edu
> ,2008:swift:dataset:20090317-1620-e1n1bz\
> 3g:720000000092 type string value=s/@DIT@/10/ dataset=unnamed
> SwiftScript value (closed)
> 2009-03-17 16:20:34,724-0500 INFO AbstractDataNode ROOTPATH
> dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090317-1620-
> e1n1bz3g:720000000092 path=$
> 2009-03-17 16:20:34,724-0500 INFO AbstractDataNode VALUE
> dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090317-1620-
> e1n1bz3g:720000000092 VALUE=s/@DIT@/10/
> 2009-03-17 16:20:34,725-0500 INFO AbstractDataNode closed
> org.griphyn.vdl.mapping.RootDataNode identifier tag:benc at ci.uchicago.edu
> ,2008:swift:dataset:20090317-1620-e1n1bz\
> 3g:720000000093 type string value=s/@TUI@/1/ dataset=unnamed
> SwiftScript value (closed)
> 2009-03-17 16:20:34,725-0500 INFO AbstractDataNode ROOTPATH
> dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090317-1620-
> e1n1bz3g:720000000093 path=$
> 2009-03-17 16:20:34,725-0500 INFO AbstractDataNode VALUE
> dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090317-1620-
> e1n1bz3g:720000000093 VALUE=s/@TUI@/1/
> 2009-03-17 16:20:34,726-0500 INFO AbstractDataNode closed
> org.griphyn.vdl.mapping.RootDataNode identifier tag:benc at ci.uchicago.edu
> ,2008:swift:dataset:20090317-1620-e1n1bz\
> 3g:720000000094 type string value=params.tloop dataset=unnamed
> SwiftScript value (closed)
> 2009-03-17 16:20:34,726-0500 INFO AbstractDataNode ROOTPATH
> dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090317-1620-
> e1n1bz3g:720000000094 path=$
> 2009-03-17 16:20:34,726-0500 INFO AbstractDataNode VALUE
> dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090317-1620-
> e1n1bz3g:720000000094 VALUE=params.tloop
> 2009-03-17 16:20:34,727-0500 WARN FlowNode Ex098
> java.lang.NullPointerException
> at
> org
> .griphyn.vdl.karajan.lib.VDLFunction.leafFileName(VDLFunction.java:
> 285)
> at
> org.griphyn.vdl.karajan.lib.VDLFunction.filename(VDLFunction.java:201)
> at
> org.griphyn.vdl.karajan.lib.VDLFunction.filename(VDLFunction.java:182)
> at
> org
> .griphyn.vdl.karajan.lib.swiftscript.FileName.function(FileName.java:
> 19)
> at
> org.griphyn.vdl.karajan.lib.VDLFunction.post(VDLFunction.java:67)
>
>
>
> On 3/17/09 4:17 PM, Michael Wilde wrote:
>> It seems not related to scale or Falkon.
>> It occurs when running on localhost (but on bgp) and when I cut all
>> the loops down to a single iteration.
>> I'm still debugging.
>> On 3/17/09 4:07 PM, Michael Wilde wrote:
>>> I just expanded my oops protein folding script to add another
>>> level of parameter sweep. This script is getting pretty complex
>>> now (at least, for a swift script).
>>>
>>> I got the following npe on my first two tries. Im going to start
>>> debugging, but any clues as to the cause would be helpful.
>>>
>>> The outer loops are:
>>>
>>> main()
>>> {
>>> string protein[] = readData(@arg("plist"));
>>> string startTemp[] = ["10","20"];
>>> string tempUpdate[] = ["1","2","3"];
>>>
>>> foreach p in protein {
>>> foreach st in startTemp {
>>> foreach tu in tempUpdate {
>>> doRoundSet(p,st,tu);
>>> }
>>> }
>>> }
>>> }
>>>
>>> There are two levels of inner loops further down below doRoundSet().
>>>
>>> The script, output, command line args and log are in:
>>> http://ww.ci.uchicago.edu/~wilde/swift3.tar.gz
>>>
>>> I suspect it will take a while to narrow the cause to a simpler
>>> test case thats easy tp reproduce without a lot of setup.
>>>
>>> I'll try on a vanilla swift on local execution; this is on bgp
>>> with Falkon.
>>>
>>> Thanks.
>>>
>>> --
>>>
>>> ...
>>> Progress: uninitialized:1 Selecting site:2
>>> SwiftScript trace: T1af7, Round, 0, Sim, 7
>>> SwiftScript trace: T1af7, Round, 0, Sim, 2
>>> SwiftScript trace: T1af7, Round, 0, Sim, 8
>>> SwiftScript trace: T1af7, Round, 0, Sim, 0
>>> SwiftScript trace: T1af7, Round, 0, Sim, 5
>>> SwiftScript trace: T1af7, Round, 0, Sim, 9
>>> SwiftScript trace: T1af7, Round, 0, Sim, 1
>>> SwiftScript trace: T1af7, Round, 0, Sim, 6
>>> SwiftScript trace: T1af7, Round, 0, Sim, 3
>>> SwiftScript trace: T1af7, Round, 0, Sim, 4
>>> Ex098
>>> java.lang.NullPointerException
>>> at
>>> org
>>> .griphyn.vdl.karajan.lib.VDLFunction.leafFileName(VDLFunction.java:
>>> 285)
>>> at
>>> org.griphyn.vdl.karajan.lib.VDLFunction.filename(VDLFunction.java:
>>> 201)
>>> at
>>> org.griphyn.vdl.karajan.lib.VDLFunction.filename(VDLFunction.java:
>>> 182)
>>> at
>>> org
>>> .griphyn
>>> .vdl.karajan.lib.swiftscript.FileName.function(FileName.java:19)
>>> at
>>> org.griphyn.vdl.karajan.lib.VDLFunction.post(VDLFunction.java:67)
>>> at
>>> org
>>> .globus
>>> .cog
>>> .karajan
>>> .workflow
>>> .nodes
>>> .AbstractSequentialWithArguments
>>> .childCompleted(AbstractSequentialWithArguments.java:192)
>>> at
>>> org
>>> .globus
>>> .cog
>>> .karajan
>>> .workflow.nodes.Sequential.notificationEvent(Sequential.java:33)
>>> at
>>> org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:
>>> 332)
>>> at
>>> org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:
>>> 125)
>>> at
>>> org
>>> .globus
>>> .cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99)
>>> at
>>> org
>>> .globus
>>> .cog
>>> .karajan
>>> .workflow.nodes.FlowNode.fireNotificationEvent(FlowNode.java:176)
>>> at
>>> org
>>> .globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:
>>> 296)
>>> at
>>> org
>>> .globus
>>> .cog.karajan.workflow.nodes.FlowContainer.post(FlowContainer.java:
>>> 58)
>>> at
>>> org
>>> .globus
>>> .cog
>>> .karajan
>>> .workflow
>>> .nodes.functions.AbstractFunction.post(AbstractFunction.java:46)
>>> at
>>> org
>>> .globus
>>> .cog.karajan.workflow.nodes.Sequential.startNext(Sequential.java:51)
>>> at
>>> org
>>> .globus
>>> .cog
>>> .karajan.workflow.nodes.Sequential.executeChildren(Sequential.java:
>>> 27)
>>> at
>>> org
>>> .globus
>>> .cog
>>> .karajan
>>> .workflow
>>> .nodes
>>> .functions.AbstractFunction.executeChildren(AbstractFunction.java:
>>> 40)
>>> at
>>> org
>>> .globus
>>> .cog
>>> .karajan.workflow.nodes.FlowContainer.execute(FlowContainer.java:63)
>>> at
>>> org
>>> .globus.cog.karajan.workflow.nodes.FlowNode.restart(FlowNode.java:
>>> 233)
>>> at
>>> org.globus.cog.karajan.workflow.nodes.FlowNode.start(FlowNode.java:
>>> 278)
>>> at
>>> org
>>> .globus
>>> .cog.karajan.workflow.nodes.FlowNode.controlEvent(FlowNode.java:391)
>>> at
>>> org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:
>>> 329)
>>> at
>>> org
>>> .globus
>>> .cog
>>> .karajan.workflow.FlowElementWrapper.event(FlowElementWrapper.java:
>>> 227)
>>> at
>>> org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:
>>> 125)
>>> at
>>> org
>>> .globus
>>> .cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99)
>>> at
>>> org
>>> .globus
>>> .cog.karajan.workflow.events.EventWorker.run(EventWorker.java:69)
>>> Execution failed:
>>> java.lang.NullPointerException
>>> at
>>> org
>>> .griphyn.vdl.karajan.lib.VDLFunction.leafFileName(VDLFunction.java:
>>> 285)
>>> at
>>> org.griphyn.vdl.karajan.lib.VDLFunction.filename(VDLFunction.java:
>>> 201)
>>> at
>>> org.griphyn.vdl.karajan.lib.VDLFunction.filename(VDLFunction.java:
>>> 182)
>>> at
>>> org
>>> .griphyn
>>> .vdl.karajan.lib.swiftscript.FileName.function(FileName.java:19)
>>> at
>>> org.griphyn.vdl.karajan.lib.VDLFunction.post(VDLFunction.java:67)
>>> at
>>> org
>>> .globus
>>> .cog
>>> .karajan
>>> .workflow
>>> .nodes
>>> .AbstractSequentialWithArguments
>>> .childCompleted(AbstractSequentialWithArguments.java:192)
>>> at
>>> org
>>> .globus
>>> .cog
>>> .karajan
>>> .workflow.nodes.Sequential.notificationEvent(Sequential.java:33)
>>> at
>>> org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:
>>> 332)
>>> at
>>> org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:
>>> 125)
>>> at
>>> org
>>> .globus
>>> .cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99)
>>> at
>>> org
>>> .globus
>>> .cog
>>> .karajan
>>> .workflow.nodes.FlowNode.fireNotificationEvent(FlowNode.java:176)
>>> at
>>> org
>>> .globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:
>>> 296)
>>> at
>>> org
>>> .globus
>>> .cog.karajan.workflow.nodes.FlowContainer.post(FlowContainer.java:
>>> 58)
>>> at
>>> org
>>> .globus
>>> .cog
>>> .karajan
>>> .workflow
>>> .nodes.functions.AbstractFunction.post(AbstractFunction.java:46)
>>> at
>>> org
>>> .globus
>>> .cog.karajan.workflow.nodes.Sequential.startNext(Sequential.java:51)
>>> at
>>> org
>>> .globus
>>> .cog
>>> .karajan.workflow.nodes.Sequential.executeChildren(Sequential.java:
>>> 27)
>>> at
>>> org
>>> .globus
>>> .cog
>>> .karajan
>>> .workflow
>>> .nodes
>>> .functions.AbstractFunction.executeChildren(AbstractFunction.java:
>>> 40)
>>> at
>>> org
>>> .globus
>>> .cog
>>> .karajan.workflow.nodes.FlowContainer.execute(FlowContainer.java:63)
>>> at
>>> org
>>> .globus.cog.karajan.workflow.nodes.FlowNode.restart(FlowNode.java:
>>> 233)
>>> at
>>> org.globus.cog.karajan.workflow.nodes.FlowNode.start(FlowNode.java:
>>> 278)
>>> at
>>> org
>>> .globus
>>> .cog.karajan.workflow.nodes.FlowNode.controlEvent(FlowNode.java:391)
>>> at
>>> org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:
>>> 329)
>>> at
>>> org
>>> .globus
>>> .cog
>>> .karajan.workflow.FlowElementWrapper.event(FlowElementWrapper.java:
>>> 227)
>>> at
>>> org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:
>>> 125)
>>> at
>>> org
>>> .globus
>>> .cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99)
>>> at
>>> org
>>> .globus
>>> .cog.karajan.workflow.events.EventWorker.run(EventWorker.java:69)
>>>
>>> Ex098
>>> java.lang.NullPointerException
>>> at
>>> org
>>> .griphyn.vdl.karajan.lib.VDLFunction.leafFileName(VDLFunction.java:
>>> 285)
>>> at
>>> org.griphyn.vdl.karajan.lib.VDLFunction.filename(VDLFunction.java:
>>> 201)
>>> at
>>> org.griphyn.vdl.karajan.lib.VDLFunction.filename(VDLFunction.java:
>>> 182)
>>> at
>>> org
>>> .griphyn
>>> .vdl.karajan.lib.swiftscript.FileName.function(FileName.java:19)
>>> at
>>> org.griphyn.vdl.karajan.lib.VDLFunction.post(VDLFunction.java:67)
>>> at
>>> org
>>> .globus
>>> .cog
>>> .karajan
>>> .workflow
>>> .nodes
>>> .AbstractSequentialWithArguments
>>> .childCompleted(AbstractSequentialWithArguments.java:192)
>>> at
>>> org
>>> .globus
>>> .cog
>>> .karajan
>>> .workflow.nodes.Sequential.notificationEvent(Sequential.java:33)
>>> at
>>> org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:
>>> 332)
>>> at
>>> org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:
>>> 125)
>>> at
>>> org
>>> .globus
>>> .cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99)
>>> at
>>> org
>>> .globus
>>> .cog
>>> .karajan
>>> .workflow.nodes.FlowNode.fireNotificationEvent(FlowNode.java:176)
>>> at
>>> org
>>> .globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:
>>> 296)
>>> at
>>> org
>>> .globus
>>> .cog.karajan.workflow.nodes.FlowContainer.post(FlowContainer.java:
>>> 58)
>>> at
>>> org
>>> .globus
>>> .cog
>>> .karajan
>>> .workflow
>>> .nodes.functions.AbstractFunction.post(AbstractFunction.java:46)
>>> at
>>> org
>>> .globus
>>> .cog.karajan.workflow.nodes.Sequential.startNext(Sequential.java:51)
>>> at
>>> org
>>> .globus
>>> .cog
>>> .karajan.workflow.nodes.Sequential.executeChildren(Sequential.java:
>>> 27)
>>> at
>>> org
>>> .globus
>>> .cog
>>> .karajan
>>> .workflow
>>> .nodes
>>> .functions.AbstractFunction.executeChildren(AbstractFunction.java:
>>> 40)
>>> at
>>> org
>>> .globus
>>> .cog
>>> .karajan.workflow.nodes.FlowContainer.execute(FlowContainer.java:63)
>>> at
>>> org
>>> .globus.cog.karajan.workflow.nodes.FlowNode.restart(FlowNode.java:
>>> 233)
>>> at
>>> org.globus.cog.karajan.workflow.nodes.FlowNode.start(FlowNode.java:
>>> 278)
>>> at
>>> org
>>> .globus
>>> .cog.karajan.workflow.nodes.FlowNode.controlEvent(FlowNode.java:391)
>>> at
>>> org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:
>>> 329)
>>> at
>>> org
>>> .globus
>>> .cog
>>> .karajan.workflow.FlowElementWrapper.event(FlowElementWrapper.java:
>>> 227)
>>> at
>>> org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:
>>> 125)
>>> at
>>> org
>>> .globus
>>> .cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99)
>>> at
>>> org
>>> .globus
>>> .cog.karajan.workflow.events.EventWorker.run(EventWorker.java:69)
>>> SwiftScript trace: T1af7, Round, 0, Sim, 7
>>> _______________________________________________
>>> Swift-devel mailing list
>>> Swift-devel at ci.uchicago.edu
>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>> _______________________________________________
>> Swift-devel mailing list
>> Swift-devel at ci.uchicago.edu
>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
From wilde at mcs.anl.gov Tue Mar 17 16:46:21 2009
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Tue, 17 Mar 2009 16:46:21 -0500
Subject: [Swift-devel] null pointer exception from nested loops
In-Reply-To: <49C0154C.7000608@mcs.anl.gov>
References: <49C010FE.4070503@mcs.anl.gov> <49C01351.1050807@mcs.anl.gov>
<49C0154C.7000608@mcs.anl.gov>
Message-ID: <49C01A2D.6050408@mcs.anl.gov>
I got the error narrowed down to this example:
type file;
app (file editedParams) setTemps ( file inParams )
{
echo @inParams stdout=@editedParams;
}
file inParams;
string config [] = readData( setTemps(inParams ) );
trace(0,config[0]);
trace(1,config[1]);
trace(2,config[2]);
--
params.tloops contains:
DEFAULT_INIT_TEMP_=_ at DTI@
TEMP_UPDATE_INTERVAL_=_ at TUI@
KILL_TIME_=_3
MAX_NUMBER_OF_ANNEALING_STEPS_=_0
--
In other code, readData() seemed happy to take a file *or* a filename
string as input, but I wonder if it was not as happy as it seemed.
I'd been taking advantage of the flexibility with good results (luck?)
so far, though.
On 3/17/09 4:25 PM, Michael Wilde wrote:
> The log contains this just before the NPE, including the suspicious
> message: WARN FlowNode Ex098:
>
> Thats giving me a clue as to the offending statements.
>
> ---
>
> 2009-03-17 16:20:34,723-0500 INFO AbstractDataNode Adding handle
> listener "F/org.griphyn.vdl.mapping.DataNode identifier
> tag:benc at ci.uchicago.edu,2008:swift:dataset:20090\
> 317-1620-e1n1bz3g:720000000071 type SecSeq with no value at
> dataset=secseq path=[0] (not closed)" to
> "org.griphyn.vdl.mapping.DataNode identifier tag:benc at ci.uchicago.edu,\
> 2008:swift:dataset:20090317-1620-e1n1bz3g:720000000071 type SecSeq with
> no value at dataset=secseq path=[0] (not closed)"
> 2009-03-17 16:20:34,724-0500 INFO AbstractDataNode closed
> org.griphyn.vdl.mapping.RootDataNode identifier
> tag:benc at ci.uchicago.edu,2008:swift:dataset:20090317-1620-e1n1bz\
> 3g:720000000092 type string value=s/@DIT@/10/ dataset=unnamed
> SwiftScript value (closed)
> 2009-03-17 16:20:34,724-0500 INFO AbstractDataNode ROOTPATH
> dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090317-1620-e1n1bz3g:720000000092
> path=$
> 2009-03-17 16:20:34,724-0500 INFO AbstractDataNode VALUE
> dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090317-1620-e1n1bz3g:720000000092
> VALUE=s/@DIT@/10/
> 2009-03-17 16:20:34,725-0500 INFO AbstractDataNode closed
> org.griphyn.vdl.mapping.RootDataNode identifier
> tag:benc at ci.uchicago.edu,2008:swift:dataset:20090317-1620-e1n1bz\
> 3g:720000000093 type string value=s/@TUI@/1/ dataset=unnamed SwiftScript
> value (closed)
> 2009-03-17 16:20:34,725-0500 INFO AbstractDataNode ROOTPATH
> dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090317-1620-e1n1bz3g:720000000093
> path=$
> 2009-03-17 16:20:34,725-0500 INFO AbstractDataNode VALUE
> dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090317-1620-e1n1bz3g:720000000093
> VALUE=s/@TUI@/1/
> 2009-03-17 16:20:34,726-0500 INFO AbstractDataNode closed
> org.griphyn.vdl.mapping.RootDataNode identifier
> tag:benc at ci.uchicago.edu,2008:swift:dataset:20090317-1620-e1n1bz\
> 3g:720000000094 type string value=params.tloop dataset=unnamed
> SwiftScript value (closed)
> 2009-03-17 16:20:34,726-0500 INFO AbstractDataNode ROOTPATH
> dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090317-1620-e1n1bz3g:720000000094
> path=$
> 2009-03-17 16:20:34,726-0500 INFO AbstractDataNode VALUE
> dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090317-1620-e1n1bz3g:720000000094
> VALUE=params.tloop
> 2009-03-17 16:20:34,727-0500 WARN FlowNode Ex098
> java.lang.NullPointerException
> at
> org.griphyn.vdl.karajan.lib.VDLFunction.leafFileName(VDLFunction.java:285)
> at
> org.griphyn.vdl.karajan.lib.VDLFunction.filename(VDLFunction.java:201)
> at
> org.griphyn.vdl.karajan.lib.VDLFunction.filename(VDLFunction.java:182)
> at
> org.griphyn.vdl.karajan.lib.swiftscript.FileName.function(FileName.java:19)
> at
> org.griphyn.vdl.karajan.lib.VDLFunction.post(VDLFunction.java:67)
>
>
>
> On 3/17/09 4:17 PM, Michael Wilde wrote:
>> It seems not related to scale or Falkon.
>>
>> It occurs when running on localhost (but on bgp) and when I cut all
>> the loops down to a single iteration.
>>
>> I'm still debugging.
>>
>> On 3/17/09 4:07 PM, Michael Wilde wrote:
>>> I just expanded my oops protein folding script to add another level
>>> of parameter sweep. This script is getting pretty complex now (at
>>> least, for a swift script).
>>>
>>> I got the following npe on my first two tries. Im going to start
>>> debugging, but any clues as to the cause would be helpful.
>>>
>>> The outer loops are:
>>>
>>> main()
>>> {
>>> string protein[] = readData(@arg("plist"));
>>> string startTemp[] = ["10","20"];
>>> string tempUpdate[] = ["1","2","3"];
>>>
>>> foreach p in protein {
>>> foreach st in startTemp {
>>> foreach tu in tempUpdate {
>>> doRoundSet(p,st,tu);
>>> }
>>> }
>>> }
>>> }
>>>
>>> There are two levels of inner loops further down below doRoundSet().
>>>
>>> The script, output, command line args and log are in:
>>> http://ww.ci.uchicago.edu/~wilde/swift3.tar.gz
>>>
>>> I suspect it will take a while to narrow the cause to a simpler test
>>> case thats easy tp reproduce without a lot of setup.
>>>
>>> I'll try on a vanilla swift on local execution; this is on bgp with
>>> Falkon.
>>>
>>> Thanks.
>>>
>>> --
>>>
>>> ...
>>> Progress: uninitialized:1 Selecting site:2
>>> SwiftScript trace: T1af7, Round, 0, Sim, 7
>>> SwiftScript trace: T1af7, Round, 0, Sim, 2
>>> SwiftScript trace: T1af7, Round, 0, Sim, 8
>>> SwiftScript trace: T1af7, Round, 0, Sim, 0
>>> SwiftScript trace: T1af7, Round, 0, Sim, 5
>>> SwiftScript trace: T1af7, Round, 0, Sim, 9
>>> SwiftScript trace: T1af7, Round, 0, Sim, 1
>>> SwiftScript trace: T1af7, Round, 0, Sim, 6
>>> SwiftScript trace: T1af7, Round, 0, Sim, 3
>>> SwiftScript trace: T1af7, Round, 0, Sim, 4
>>> Ex098
>>> java.lang.NullPointerException
>>> at
>>> org.griphyn.vdl.karajan.lib.VDLFunction.leafFileName(VDLFunction.java:285)
>>>
>>> at
>>> org.griphyn.vdl.karajan.lib.VDLFunction.filename(VDLFunction.java:201)
>>> at
>>> org.griphyn.vdl.karajan.lib.VDLFunction.filename(VDLFunction.java:182)
>>> at
>>> org.griphyn.vdl.karajan.lib.swiftscript.FileName.function(FileName.java:19)
>>>
>>> at org.griphyn.vdl.karajan.lib.VDLFunction.post(VDLFunction.java:67)
>>> at
>>> org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.childCompleted(AbstractSequentialWithArguments.java:192)
>>>
>>> at
>>> org.globus.cog.karajan.workflow.nodes.Sequential.notificationEvent(Sequential.java:33)
>>>
>>> at
>>> org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:332)
>>> at
>>> org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125)
>>> at
>>> org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99)
>>>
>>> at
>>> org.globus.cog.karajan.workflow.nodes.FlowNode.fireNotificationEvent(FlowNode.java:176)
>>>
>>> at
>>> org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:296)
>>>
>>> at
>>> org.globus.cog.karajan.workflow.nodes.FlowContainer.post(FlowContainer.java:58)
>>>
>>> at
>>> org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:46)
>>>
>>> at
>>> org.globus.cog.karajan.workflow.nodes.Sequential.startNext(Sequential.java:51)
>>>
>>> at
>>> org.globus.cog.karajan.workflow.nodes.Sequential.executeChildren(Sequential.java:27)
>>>
>>> at
>>> org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.executeChildren(AbstractFunction.java:40)
>>>
>>> at
>>> org.globus.cog.karajan.workflow.nodes.FlowContainer.execute(FlowContainer.java:63)
>>>
>>> at
>>> org.globus.cog.karajan.workflow.nodes.FlowNode.restart(FlowNode.java:233)
>>>
>>> at
>>> org.globus.cog.karajan.workflow.nodes.FlowNode.start(FlowNode.java:278)
>>> at
>>> org.globus.cog.karajan.workflow.nodes.FlowNode.controlEvent(FlowNode.java:391)
>>>
>>> at
>>> org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:329)
>>> at
>>> org.globus.cog.karajan.workflow.FlowElementWrapper.event(FlowElementWrapper.java:227)
>>>
>>> at
>>> org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125)
>>> at
>>> org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99)
>>>
>>> at
>>> org.globus.cog.karajan.workflow.events.EventWorker.run(EventWorker.java:69)
>>>
>>> Execution failed:
>>> java.lang.NullPointerException
>>> at
>>> org.griphyn.vdl.karajan.lib.VDLFunction.leafFileName(VDLFunction.java:285)
>>>
>>> at
>>> org.griphyn.vdl.karajan.lib.VDLFunction.filename(VDLFunction.java:201)
>>> at
>>> org.griphyn.vdl.karajan.lib.VDLFunction.filename(VDLFunction.java:182)
>>> at
>>> org.griphyn.vdl.karajan.lib.swiftscript.FileName.function(FileName.java:19)
>>>
>>> at org.griphyn.vdl.karajan.lib.VDLFunction.post(VDLFunction.java:67)
>>> at
>>> org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.childCompleted(AbstractSequentialWithArguments.java:192)
>>>
>>> at
>>> org.globus.cog.karajan.workflow.nodes.Sequential.notificationEvent(Sequential.java:33)
>>>
>>> at
>>> org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:332)
>>> at
>>> org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125)
>>> at
>>> org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99)
>>>
>>> at
>>> org.globus.cog.karajan.workflow.nodes.FlowNode.fireNotificationEvent(FlowNode.java:176)
>>>
>>> at
>>> org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:296)
>>>
>>> at
>>> org.globus.cog.karajan.workflow.nodes.FlowContainer.post(FlowContainer.java:58)
>>>
>>> at
>>> org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:46)
>>>
>>> at
>>> org.globus.cog.karajan.workflow.nodes.Sequential.startNext(Sequential.java:51)
>>>
>>> at
>>> org.globus.cog.karajan.workflow.nodes.Sequential.executeChildren(Sequential.java:27)
>>>
>>> at
>>> org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.executeChildren(AbstractFunction.java:40)
>>>
>>> at
>>> org.globus.cog.karajan.workflow.nodes.FlowContainer.execute(FlowContainer.java:63)
>>>
>>> at
>>> org.globus.cog.karajan.workflow.nodes.FlowNode.restart(FlowNode.java:233)
>>>
>>> at
>>> org.globus.cog.karajan.workflow.nodes.FlowNode.start(FlowNode.java:278)
>>> at
>>> org.globus.cog.karajan.workflow.nodes.FlowNode.controlEvent(FlowNode.java:391)
>>>
>>> at
>>> org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:329)
>>> at
>>> org.globus.cog.karajan.workflow.FlowElementWrapper.event(FlowElementWrapper.java:227)
>>>
>>> at
>>> org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125)
>>> at
>>> org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99)
>>>
>>> at
>>> org.globus.cog.karajan.workflow.events.EventWorker.run(EventWorker.java:69)
>>>
>>>
>>> Ex098
>>> java.lang.NullPointerException
>>> at
>>> org.griphyn.vdl.karajan.lib.VDLFunction.leafFileName(VDLFunction.java:285)
>>>
>>> at
>>> org.griphyn.vdl.karajan.lib.VDLFunction.filename(VDLFunction.java:201)
>>> at
>>> org.griphyn.vdl.karajan.lib.VDLFunction.filename(VDLFunction.java:182)
>>> at
>>> org.griphyn.vdl.karajan.lib.swiftscript.FileName.function(FileName.java:19)
>>>
>>> at org.griphyn.vdl.karajan.lib.VDLFunction.post(VDLFunction.java:67)
>>> at
>>> org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.childCompleted(AbstractSequentialWithArguments.java:192)
>>>
>>> at
>>> org.globus.cog.karajan.workflow.nodes.Sequential.notificationEvent(Sequential.java:33)
>>>
>>> at
>>> org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:332)
>>> at
>>> org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125)
>>> at
>>> org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99)
>>>
>>> at
>>> org.globus.cog.karajan.workflow.nodes.FlowNode.fireNotificationEvent(FlowNode.java:176)
>>>
>>> at
>>> org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:296)
>>>
>>> at
>>> org.globus.cog.karajan.workflow.nodes.FlowContainer.post(FlowContainer.java:58)
>>>
>>> at
>>> org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:46)
>>>
>>> at
>>> org.globus.cog.karajan.workflow.nodes.Sequential.startNext(Sequential.java:51)
>>>
>>> at
>>> org.globus.cog.karajan.workflow.nodes.Sequential.executeChildren(Sequential.java:27)
>>>
>>> at
>>> org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.executeChildren(AbstractFunction.java:40)
>>>
>>> at
>>> org.globus.cog.karajan.workflow.nodes.FlowContainer.execute(FlowContainer.java:63)
>>>
>>> at
>>> org.globus.cog.karajan.workflow.nodes.FlowNode.restart(FlowNode.java:233)
>>>
>>> at
>>> org.globus.cog.karajan.workflow.nodes.FlowNode.start(FlowNode.java:278)
>>> at
>>> org.globus.cog.karajan.workflow.nodes.FlowNode.controlEvent(FlowNode.java:391)
>>>
>>> at
>>> org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:329)
>>> at
>>> org.globus.cog.karajan.workflow.FlowElementWrapper.event(FlowElementWrapper.java:227)
>>>
>>> at
>>> org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125)
>>> at
>>> org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99)
>>>
>>> at
>>> org.globus.cog.karajan.workflow.events.EventWorker.run(EventWorker.java:69)
>>>
>>> SwiftScript trace: T1af7, Round, 0, Sim, 7
>>> _______________________________________________
>>> Swift-devel mailing list
>>> Swift-devel at ci.uchicago.edu
>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>> _______________________________________________
>> Swift-devel mailing list
>> Swift-devel at ci.uchicago.edu
>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
From wilde at mcs.anl.gov Tue Mar 17 17:02:31 2009
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Tue, 17 Mar 2009 17:02:31 -0500
Subject: [Swift-devel] null pointer exception from nested loops
In-Reply-To: <49C01A2D.6050408@mcs.anl.gov>
References: <49C010FE.4070503@mcs.anl.gov>
<49C01351.1050807@mcs.anl.gov> <49C0154C.7000608@mcs.anl.gov>
<49C01A2D.6050408@mcs.anl.gov>
Message-ID: <49C01DF7.4000204@mcs.anl.gov>
OK, I think this is it:
The following script *works* with the commented out code and fails with
the currently enabled alternative statement:
--
type file;
app (file editedParams) setTemps ( file inParams )
{
sed "-e" "s/@DTI@/123/" @inParams stdout=@editedParams;
}
file inParams;
/* works:
file o<"pout">;
o = setTemps(inParams);
string config [] = readData(o);
*/
/* Fails: */
string config [] = readData( setTemps(inParams ) );
trace(0,config[0]);
trace(1,config[1]);
trace(2,config[2]);
--
So readData is indeed happy to take a file-value var as an arg but not a
file-valued expression (procedure return in this case).
On 3/17/09 4:46 PM, Michael Wilde wrote:
> I got the error narrowed down to this example:
>
> type file;
>
> app (file editedParams) setTemps ( file inParams )
> {
> echo @inParams stdout=@editedParams;
> }
>
> file inParams;
>
> string config [] = readData( setTemps(inParams ) );
> trace(0,config[0]);
> trace(1,config[1]);
> trace(2,config[2]);
>
> --
>
> params.tloops contains:
>
> DEFAULT_INIT_TEMP_=_ at DTI@
> TEMP_UPDATE_INTERVAL_=_ at TUI@
> KILL_TIME_=_3
> MAX_NUMBER_OF_ANNEALING_STEPS_=_0
>
> --
>
> In other code, readData() seemed happy to take a file *or* a filename
> string as input, but I wonder if it was not as happy as it seemed.
>
> I'd been taking advantage of the flexibility with good results (luck?)
> so far, though.
>
> On 3/17/09 4:25 PM, Michael Wilde wrote:
>> The log contains this just before the NPE, including the suspicious
>> message: WARN FlowNode Ex098:
>>
>> Thats giving me a clue as to the offending statements.
>>
>> ---
>>
>> 2009-03-17 16:20:34,723-0500 INFO AbstractDataNode Adding handle
>> listener "F/org.griphyn.vdl.mapping.DataNode identifier
>> tag:benc at ci.uchicago.edu,2008:swift:dataset:20090\
>> 317-1620-e1n1bz3g:720000000071 type SecSeq with no value at
>> dataset=secseq path=[0] (not closed)" to
>> "org.griphyn.vdl.mapping.DataNode identifier tag:benc at ci.uchicago.edu,\
>> 2008:swift:dataset:20090317-1620-e1n1bz3g:720000000071 type SecSeq
>> with no value at dataset=secseq path=[0] (not closed)"
>> 2009-03-17 16:20:34,724-0500 INFO AbstractDataNode closed
>> org.griphyn.vdl.mapping.RootDataNode identifier
>> tag:benc at ci.uchicago.edu,2008:swift:dataset:20090317-1620-e1n1bz\
>> 3g:720000000092 type string value=s/@DIT@/10/ dataset=unnamed
>> SwiftScript value (closed)
>> 2009-03-17 16:20:34,724-0500 INFO AbstractDataNode ROOTPATH
>> dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090317-1620-e1n1bz3g:720000000092
>> path=$
>> 2009-03-17 16:20:34,724-0500 INFO AbstractDataNode VALUE
>> dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090317-1620-e1n1bz3g:720000000092
>> VALUE=s/@DIT@/10/
>> 2009-03-17 16:20:34,725-0500 INFO AbstractDataNode closed
>> org.griphyn.vdl.mapping.RootDataNode identifier
>> tag:benc at ci.uchicago.edu,2008:swift:dataset:20090317-1620-e1n1bz\
>> 3g:720000000093 type string value=s/@TUI@/1/ dataset=unnamed
>> SwiftScript value (closed)
>> 2009-03-17 16:20:34,725-0500 INFO AbstractDataNode ROOTPATH
>> dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090317-1620-e1n1bz3g:720000000093
>> path=$
>> 2009-03-17 16:20:34,725-0500 INFO AbstractDataNode VALUE
>> dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090317-1620-e1n1bz3g:720000000093
>> VALUE=s/@TUI@/1/
>> 2009-03-17 16:20:34,726-0500 INFO AbstractDataNode closed
>> org.griphyn.vdl.mapping.RootDataNode identifier
>> tag:benc at ci.uchicago.edu,2008:swift:dataset:20090317-1620-e1n1bz\
>> 3g:720000000094 type string value=params.tloop dataset=unnamed
>> SwiftScript value (closed)
>> 2009-03-17 16:20:34,726-0500 INFO AbstractDataNode ROOTPATH
>> dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090317-1620-e1n1bz3g:720000000094
>> path=$
>> 2009-03-17 16:20:34,726-0500 INFO AbstractDataNode VALUE
>> dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090317-1620-e1n1bz3g:720000000094
>> VALUE=params.tloop
>> 2009-03-17 16:20:34,727-0500 WARN FlowNode Ex098
>> java.lang.NullPointerException
>> at
>> org.griphyn.vdl.karajan.lib.VDLFunction.leafFileName(VDLFunction.java:285)
>>
>> at
>> org.griphyn.vdl.karajan.lib.VDLFunction.filename(VDLFunction.java:201)
>> at
>> org.griphyn.vdl.karajan.lib.VDLFunction.filename(VDLFunction.java:182)
>> at
>> org.griphyn.vdl.karajan.lib.swiftscript.FileName.function(FileName.java:19)
>>
>> at
>> org.griphyn.vdl.karajan.lib.VDLFunction.post(VDLFunction.java:67)
>>
>>
>>
>> On 3/17/09 4:17 PM, Michael Wilde wrote:
>>> It seems not related to scale or Falkon.
>>>
>>> It occurs when running on localhost (but on bgp) and when I cut all
>>> the loops down to a single iteration.
>>>
>>> I'm still debugging.
>>>
>>> On 3/17/09 4:07 PM, Michael Wilde wrote:
>>>> I just expanded my oops protein folding script to add another level
>>>> of parameter sweep. This script is getting pretty complex now (at
>>>> least, for a swift script).
>>>>
>>>> I got the following npe on my first two tries. Im going to start
>>>> debugging, but any clues as to the cause would be helpful.
>>>>
>>>> The outer loops are:
>>>>
>>>> main()
>>>> {
>>>> string protein[] = readData(@arg("plist"));
>>>> string startTemp[] = ["10","20"];
>>>> string tempUpdate[] = ["1","2","3"];
>>>>
>>>> foreach p in protein {
>>>> foreach st in startTemp {
>>>> foreach tu in tempUpdate {
>>>> doRoundSet(p,st,tu);
>>>> }
>>>> }
>>>> }
>>>> }
>>>>
>>>> There are two levels of inner loops further down below doRoundSet().
>>>>
>>>> The script, output, command line args and log are in:
>>>> http://ww.ci.uchicago.edu/~wilde/swift3.tar.gz
>>>>
>>>> I suspect it will take a while to narrow the cause to a simpler test
>>>> case thats easy tp reproduce without a lot of setup.
>>>>
>>>> I'll try on a vanilla swift on local execution; this is on bgp with
>>>> Falkon.
>>>>
>>>> Thanks.
>>>>
>>>> --
>>>>
>>>> ...
>>>> Progress: uninitialized:1 Selecting site:2
>>>> SwiftScript trace: T1af7, Round, 0, Sim, 7
>>>> SwiftScript trace: T1af7, Round, 0, Sim, 2
>>>> SwiftScript trace: T1af7, Round, 0, Sim, 8
>>>> SwiftScript trace: T1af7, Round, 0, Sim, 0
>>>> SwiftScript trace: T1af7, Round, 0, Sim, 5
>>>> SwiftScript trace: T1af7, Round, 0, Sim, 9
>>>> SwiftScript trace: T1af7, Round, 0, Sim, 1
>>>> SwiftScript trace: T1af7, Round, 0, Sim, 6
>>>> SwiftScript trace: T1af7, Round, 0, Sim, 3
>>>> SwiftScript trace: T1af7, Round, 0, Sim, 4
>>>> Ex098
>>>> java.lang.NullPointerException
>>>> at
>>>> org.griphyn.vdl.karajan.lib.VDLFunction.leafFileName(VDLFunction.java:285)
>>>>
>>>> at
>>>> org.griphyn.vdl.karajan.lib.VDLFunction.filename(VDLFunction.java:201)
>>>> at
>>>> org.griphyn.vdl.karajan.lib.VDLFunction.filename(VDLFunction.java:182)
>>>> at
>>>> org.griphyn.vdl.karajan.lib.swiftscript.FileName.function(FileName.java:19)
>>>>
>>>> at
>>>> org.griphyn.vdl.karajan.lib.VDLFunction.post(VDLFunction.java:67)
>>>> at
>>>> org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.childCompleted(AbstractSequentialWithArguments.java:192)
>>>>
>>>> at
>>>> org.globus.cog.karajan.workflow.nodes.Sequential.notificationEvent(Sequential.java:33)
>>>>
>>>> at
>>>> org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:332)
>>>> at
>>>> org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125)
>>>> at
>>>> org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99)
>>>>
>>>> at
>>>> org.globus.cog.karajan.workflow.nodes.FlowNode.fireNotificationEvent(FlowNode.java:176)
>>>>
>>>> at
>>>> org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:296)
>>>>
>>>> at
>>>> org.globus.cog.karajan.workflow.nodes.FlowContainer.post(FlowContainer.java:58)
>>>>
>>>> at
>>>> org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:46)
>>>>
>>>> at
>>>> org.globus.cog.karajan.workflow.nodes.Sequential.startNext(Sequential.java:51)
>>>>
>>>> at
>>>> org.globus.cog.karajan.workflow.nodes.Sequential.executeChildren(Sequential.java:27)
>>>>
>>>> at
>>>> org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.executeChildren(AbstractFunction.java:40)
>>>>
>>>> at
>>>> org.globus.cog.karajan.workflow.nodes.FlowContainer.execute(FlowContainer.java:63)
>>>>
>>>> at
>>>> org.globus.cog.karajan.workflow.nodes.FlowNode.restart(FlowNode.java:233)
>>>>
>>>> at
>>>> org.globus.cog.karajan.workflow.nodes.FlowNode.start(FlowNode.java:278)
>>>> at
>>>> org.globus.cog.karajan.workflow.nodes.FlowNode.controlEvent(FlowNode.java:391)
>>>>
>>>> at
>>>> org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:329)
>>>> at
>>>> org.globus.cog.karajan.workflow.FlowElementWrapper.event(FlowElementWrapper.java:227)
>>>>
>>>> at
>>>> org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125)
>>>> at
>>>> org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99)
>>>>
>>>> at
>>>> org.globus.cog.karajan.workflow.events.EventWorker.run(EventWorker.java:69)
>>>>
>>>> Execution failed:
>>>> java.lang.NullPointerException
>>>> at
>>>> org.griphyn.vdl.karajan.lib.VDLFunction.leafFileName(VDLFunction.java:285)
>>>>
>>>> at
>>>> org.griphyn.vdl.karajan.lib.VDLFunction.filename(VDLFunction.java:201)
>>>> at
>>>> org.griphyn.vdl.karajan.lib.VDLFunction.filename(VDLFunction.java:182)
>>>> at
>>>> org.griphyn.vdl.karajan.lib.swiftscript.FileName.function(FileName.java:19)
>>>>
>>>> at
>>>> org.griphyn.vdl.karajan.lib.VDLFunction.post(VDLFunction.java:67)
>>>> at
>>>> org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.childCompleted(AbstractSequentialWithArguments.java:192)
>>>>
>>>> at
>>>> org.globus.cog.karajan.workflow.nodes.Sequential.notificationEvent(Sequential.java:33)
>>>>
>>>> at
>>>> org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:332)
>>>> at
>>>> org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125)
>>>> at
>>>> org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99)
>>>>
>>>> at
>>>> org.globus.cog.karajan.workflow.nodes.FlowNode.fireNotificationEvent(FlowNode.java:176)
>>>>
>>>> at
>>>> org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:296)
>>>>
>>>> at
>>>> org.globus.cog.karajan.workflow.nodes.FlowContainer.post(FlowContainer.java:58)
>>>>
>>>> at
>>>> org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:46)
>>>>
>>>> at
>>>> org.globus.cog.karajan.workflow.nodes.Sequential.startNext(Sequential.java:51)
>>>>
>>>> at
>>>> org.globus.cog.karajan.workflow.nodes.Sequential.executeChildren(Sequential.java:27)
>>>>
>>>> at
>>>> org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.executeChildren(AbstractFunction.java:40)
>>>>
>>>> at
>>>> org.globus.cog.karajan.workflow.nodes.FlowContainer.execute(FlowContainer.java:63)
>>>>
>>>> at
>>>> org.globus.cog.karajan.workflow.nodes.FlowNode.restart(FlowNode.java:233)
>>>>
>>>> at
>>>> org.globus.cog.karajan.workflow.nodes.FlowNode.start(FlowNode.java:278)
>>>> at
>>>> org.globus.cog.karajan.workflow.nodes.FlowNode.controlEvent(FlowNode.java:391)
>>>>
>>>> at
>>>> org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:329)
>>>> at
>>>> org.globus.cog.karajan.workflow.FlowElementWrapper.event(FlowElementWrapper.java:227)
>>>>
>>>> at
>>>> org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125)
>>>> at
>>>> org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99)
>>>>
>>>> at
>>>> org.globus.cog.karajan.workflow.events.EventWorker.run(EventWorker.java:69)
>>>>
>>>>
>>>> Ex098
>>>> java.lang.NullPointerException
>>>> at
>>>> org.griphyn.vdl.karajan.lib.VDLFunction.leafFileName(VDLFunction.java:285)
>>>>
>>>> at
>>>> org.griphyn.vdl.karajan.lib.VDLFunction.filename(VDLFunction.java:201)
>>>> at
>>>> org.griphyn.vdl.karajan.lib.VDLFunction.filename(VDLFunction.java:182)
>>>> at
>>>> org.griphyn.vdl.karajan.lib.swiftscript.FileName.function(FileName.java:19)
>>>>
>>>> at
>>>> org.griphyn.vdl.karajan.lib.VDLFunction.post(VDLFunction.java:67)
>>>> at
>>>> org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.childCompleted(AbstractSequentialWithArguments.java:192)
>>>>
>>>> at
>>>> org.globus.cog.karajan.workflow.nodes.Sequential.notificationEvent(Sequential.java:33)
>>>>
>>>> at
>>>> org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:332)
>>>> at
>>>> org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125)
>>>> at
>>>> org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99)
>>>>
>>>> at
>>>> org.globus.cog.karajan.workflow.nodes.FlowNode.fireNotificationEvent(FlowNode.java:176)
>>>>
>>>> at
>>>> org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:296)
>>>>
>>>> at
>>>> org.globus.cog.karajan.workflow.nodes.FlowContainer.post(FlowContainer.java:58)
>>>>
>>>> at
>>>> org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:46)
>>>>
>>>> at
>>>> org.globus.cog.karajan.workflow.nodes.Sequential.startNext(Sequential.java:51)
>>>>
>>>> at
>>>> org.globus.cog.karajan.workflow.nodes.Sequential.executeChildren(Sequential.java:27)
>>>>
>>>> at
>>>> org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.executeChildren(AbstractFunction.java:40)
>>>>
>>>> at
>>>> org.globus.cog.karajan.workflow.nodes.FlowContainer.execute(FlowContainer.java:63)
>>>>
>>>> at
>>>> org.globus.cog.karajan.workflow.nodes.FlowNode.restart(FlowNode.java:233)
>>>>
>>>> at
>>>> org.globus.cog.karajan.workflow.nodes.FlowNode.start(FlowNode.java:278)
>>>> at
>>>> org.globus.cog.karajan.workflow.nodes.FlowNode.controlEvent(FlowNode.java:391)
>>>>
>>>> at
>>>> org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:329)
>>>> at
>>>> org.globus.cog.karajan.workflow.FlowElementWrapper.event(FlowElementWrapper.java:227)
>>>>
>>>> at
>>>> org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125)
>>>> at
>>>> org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99)
>>>>
>>>> at
>>>> org.globus.cog.karajan.workflow.events.EventWorker.run(EventWorker.java:69)
>>>>
>>>> SwiftScript trace: T1af7, Round, 0, Sim, 7
>>>> _______________________________________________
>>>> Swift-devel mailing list
>>>> Swift-devel at ci.uchicago.edu
>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>> _______________________________________________
>>> Swift-devel mailing list
>>> Swift-devel at ci.uchicago.edu
>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>> _______________________________________________
>> Swift-devel mailing list
>> Swift-devel at ci.uchicago.edu
>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
From wilde at mcs.anl.gov Tue Mar 17 17:05:37 2009
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Tue, 17 Mar 2009 17:05:37 -0500
Subject: [Swift-devel] null pointer exception from nested loops
In-Reply-To: <17AE2924-B021-46EC-9649-C80B1EBFFECD@anl.gov>
References: <49C010FE.4070503@mcs.anl.gov> <49C01351.1050807@mcs.anl.gov>
<49C0154C.7000608@mcs.anl.gov>
<17AE2924-B021-46EC-9649-C80B1EBFFECD@anl.gov>
Message-ID: <49C01EB1.5060801@mcs.anl.gov>
Not quite sure what you're asking, Ian.
The latest tests have been on BGP w/ Falkon.
Earlier tests were on other clusters.
The scripts has grown in last week or so, on BGP, and grew some more
today to explore some new science code algorithm questions.
Its not yet running at full desired scale on the BGP; we are now scaling
up carefully so as not to impact other users.
This is a test case for the "cio" work as well.
- Mike
On 3/17/09 4:26 PM, Ian Foster wrote:
> Just curious, is the whole thing working with just Falkon?
>
>
> On Mar 17, 2009, at 4:25 PM, Michael Wilde wrote:
>
>> The log contains this just before the NPE, including the suspicious
>> message: WARN FlowNode Ex098:
>>
>> Thats giving me a clue as to the offending statements.
>>
>> ---
>>
>> 2009-03-17 16:20:34,723-0500 INFO AbstractDataNode Adding handle
>> listener "F/org.griphyn.vdl.mapping.DataNode identifier
>> tag:benc at ci.uchicago.edu,2008:swift:dataset:20090\
>> 317-1620-e1n1bz3g:720000000071 type SecSeq with no value at
>> dataset=secseq path=[0] (not closed)" to
>> "org.griphyn.vdl.mapping.DataNode identifier tag:benc at ci.uchicago.edu,\
>> 2008:swift:dataset:20090317-1620-e1n1bz3g:720000000071 type SecSeq
>> with no value at dataset=secseq path=[0] (not closed)"
>> 2009-03-17 16:20:34,724-0500 INFO AbstractDataNode closed
>> org.griphyn.vdl.mapping.RootDataNode identifier
>> tag:benc at ci.uchicago.edu,2008:swift:dataset:20090317-1620-e1n1bz\
>> 3g:720000000092 type string value=s/@DIT@/10/ dataset=unnamed
>> SwiftScript value (closed)
>> 2009-03-17 16:20:34,724-0500 INFO AbstractDataNode ROOTPATH
>> dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090317-1620-e1n1bz3g:720000000092
>> path=$
>> 2009-03-17 16:20:34,724-0500 INFO AbstractDataNode VALUE
>> dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090317-1620-e1n1bz3g:720000000092
>> VALUE=s/@DIT@/10/
>> 2009-03-17 16:20:34,725-0500 INFO AbstractDataNode closed
>> org.griphyn.vdl.mapping.RootDataNode identifier
>> tag:benc at ci.uchicago.edu,2008:swift:dataset:20090317-1620-e1n1bz\
>> 3g:720000000093 type string value=s/@TUI@/1/ dataset=unnamed
>> SwiftScript value (closed)
>> 2009-03-17 16:20:34,725-0500 INFO AbstractDataNode ROOTPATH
>> dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090317-1620-e1n1bz3g:720000000093
>> path=$
>> 2009-03-17 16:20:34,725-0500 INFO AbstractDataNode VALUE
>> dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090317-1620-e1n1bz3g:720000000093
>> VALUE=s/@TUI@/1/
>> 2009-03-17 16:20:34,726-0500 INFO AbstractDataNode closed
>> org.griphyn.vdl.mapping.RootDataNode identifier
>> tag:benc at ci.uchicago.edu,2008:swift:dataset:20090317-1620-e1n1bz\
>> 3g:720000000094 type string value=params.tloop dataset=unnamed
>> SwiftScript value (closed)
>> 2009-03-17 16:20:34,726-0500 INFO AbstractDataNode ROOTPATH
>> dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090317-1620-e1n1bz3g:720000000094
>> path=$
>> 2009-03-17 16:20:34,726-0500 INFO AbstractDataNode VALUE
>> dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090317-1620-e1n1bz3g:720000000094
>> VALUE=params.tloop
>> 2009-03-17 16:20:34,727-0500 WARN FlowNode Ex098
>> java.lang.NullPointerException
>> at
>> org.griphyn.vdl.karajan.lib.VDLFunction.leafFileName(VDLFunction.java:285)
>>
>> at
>> org.griphyn.vdl.karajan.lib.VDLFunction.filename(VDLFunction.java:201)
>> at
>> org.griphyn.vdl.karajan.lib.VDLFunction.filename(VDLFunction.java:182)
>> at
>> org.griphyn.vdl.karajan.lib.swiftscript.FileName.function(FileName.java:19)
>>
>> at
>> org.griphyn.vdl.karajan.lib.VDLFunction.post(VDLFunction.java:67)
>>
>>
>>
>> On 3/17/09 4:17 PM, Michael Wilde wrote:
>>> It seems not related to scale or Falkon.
>>> It occurs when running on localhost (but on bgp) and when I cut all
>>> the loops down to a single iteration.
>>> I'm still debugging.
>>> On 3/17/09 4:07 PM, Michael Wilde wrote:
>>>> I just expanded my oops protein folding script to add another level
>>>> of parameter sweep. This script is getting pretty complex now (at
>>>> least, for a swift script).
>>>>
>>>> I got the following npe on my first two tries. Im going to start
>>>> debugging, but any clues as to the cause would be helpful.
>>>>
>>>> The outer loops are:
>>>>
>>>> main()
>>>> {
>>>> string protein[] = readData(@arg("plist"));
>>>> string startTemp[] = ["10","20"];
>>>> string tempUpdate[] = ["1","2","3"];
>>>>
>>>> foreach p in protein {
>>>> foreach st in startTemp {
>>>> foreach tu in tempUpdate {
>>>> doRoundSet(p,st,tu);
>>>> }
>>>> }
>>>> }
>>>> }
>>>>
>>>> There are two levels of inner loops further down below doRoundSet().
>>>>
>>>> The script, output, command line args and log are in:
>>>> http://ww.ci.uchicago.edu/~wilde/swift3.tar.gz
>>>>
>>>> I suspect it will take a while to narrow the cause to a simpler test
>>>> case thats easy tp reproduce without a lot of setup.
>>>>
>>>> I'll try on a vanilla swift on local execution; this is on bgp with
>>>> Falkon.
>>>>
>>>> Thanks.
>>>>
>>>> --
>>>>
>>>> ...
>>>> Progress: uninitialized:1 Selecting site:2
>>>> SwiftScript trace: T1af7, Round, 0, Sim, 7
>>>> SwiftScript trace: T1af7, Round, 0, Sim, 2
>>>> SwiftScript trace: T1af7, Round, 0, Sim, 8
>>>> SwiftScript trace: T1af7, Round, 0, Sim, 0
>>>> SwiftScript trace: T1af7, Round, 0, Sim, 5
>>>> SwiftScript trace: T1af7, Round, 0, Sim, 9
>>>> SwiftScript trace: T1af7, Round, 0, Sim, 1
>>>> SwiftScript trace: T1af7, Round, 0, Sim, 6
>>>> SwiftScript trace: T1af7, Round, 0, Sim, 3
>>>> SwiftScript trace: T1af7, Round, 0, Sim, 4
>>>> Ex098
>>>> java.lang.NullPointerException
>>>> at
>>>> org.griphyn.vdl.karajan.lib.VDLFunction.leafFileName(VDLFunction.java:285)
>>>>
>>>> at
>>>> org.griphyn.vdl.karajan.lib.VDLFunction.filename(VDLFunction.java:201)
>>>> at
>>>> org.griphyn.vdl.karajan.lib.VDLFunction.filename(VDLFunction.java:182)
>>>> at
>>>> org.griphyn.vdl.karajan.lib.swiftscript.FileName.function(FileName.java:19)
>>>>
>>>> at org.griphyn.vdl.karajan.lib.VDLFunction.post(VDLFunction.java:67)
>>>> at
>>>> org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.childCompleted(AbstractSequentialWithArguments.java:192)
>>>>
>>>> at
>>>> org.globus.cog.karajan.workflow.nodes.Sequential.notificationEvent(Sequential.java:33)
>>>>
>>>> at
>>>> org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:332)
>>>> at
>>>> org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125)
>>>> at
>>>> org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99)
>>>>
>>>> at
>>>> org.globus.cog.karajan.workflow.nodes.FlowNode.fireNotificationEvent(FlowNode.java:176)
>>>>
>>>> at
>>>> org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:296)
>>>>
>>>> at
>>>> org.globus.cog.karajan.workflow.nodes.FlowContainer.post(FlowContainer.java:58)
>>>>
>>>> at
>>>> org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:46)
>>>>
>>>> at
>>>> org.globus.cog.karajan.workflow.nodes.Sequential.startNext(Sequential.java:51)
>>>>
>>>> at
>>>> org.globus.cog.karajan.workflow.nodes.Sequential.executeChildren(Sequential.java:27)
>>>>
>>>> at
>>>> org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.executeChildren(AbstractFunction.java:40)
>>>>
>>>> at
>>>> org.globus.cog.karajan.workflow.nodes.FlowContainer.execute(FlowContainer.java:63)
>>>>
>>>> at
>>>> org.globus.cog.karajan.workflow.nodes.FlowNode.restart(FlowNode.java:233)
>>>>
>>>> at
>>>> org.globus.cog.karajan.workflow.nodes.FlowNode.start(FlowNode.java:278)
>>>> at
>>>> org.globus.cog.karajan.workflow.nodes.FlowNode.controlEvent(FlowNode.java:391)
>>>>
>>>> at
>>>> org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:329)
>>>> at
>>>> org.globus.cog.karajan.workflow.FlowElementWrapper.event(FlowElementWrapper.java:227)
>>>>
>>>> at
>>>> org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125)
>>>> at
>>>> org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99)
>>>>
>>>> at
>>>> org.globus.cog.karajan.workflow.events.EventWorker.run(EventWorker.java:69)
>>>>
>>>> Execution failed:
>>>> java.lang.NullPointerException
>>>> at
>>>> org.griphyn.vdl.karajan.lib.VDLFunction.leafFileName(VDLFunction.java:285)
>>>>
>>>> at
>>>> org.griphyn.vdl.karajan.lib.VDLFunction.filename(VDLFunction.java:201)
>>>> at
>>>> org.griphyn.vdl.karajan.lib.VDLFunction.filename(VDLFunction.java:182)
>>>> at
>>>> org.griphyn.vdl.karajan.lib.swiftscript.FileName.function(FileName.java:19)
>>>>
>>>> at org.griphyn.vdl.karajan.lib.VDLFunction.post(VDLFunction.java:67)
>>>> at
>>>> org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.childCompleted(AbstractSequentialWithArguments.java:192)
>>>>
>>>> at
>>>> org.globus.cog.karajan.workflow.nodes.Sequential.notificationEvent(Sequential.java:33)
>>>>
>>>> at
>>>> org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:332)
>>>> at
>>>> org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125)
>>>> at
>>>> org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99)
>>>>
>>>> at
>>>> org.globus.cog.karajan.workflow.nodes.FlowNode.fireNotificationEvent(FlowNode.java:176)
>>>>
>>>> at
>>>> org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:296)
>>>>
>>>> at
>>>> org.globus.cog.karajan.workflow.nodes.FlowContainer.post(FlowContainer.java:58)
>>>>
>>>> at
>>>> org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:46)
>>>>
>>>> at
>>>> org.globus.cog.karajan.workflow.nodes.Sequential.startNext(Sequential.java:51)
>>>>
>>>> at
>>>> org.globus.cog.karajan.workflow.nodes.Sequential.executeChildren(Sequential.java:27)
>>>>
>>>> at
>>>> org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.executeChildren(AbstractFunction.java:40)
>>>>
>>>> at
>>>> org.globus.cog.karajan.workflow.nodes.FlowContainer.execute(FlowContainer.java:63)
>>>>
>>>> at
>>>> org.globus.cog.karajan.workflow.nodes.FlowNode.restart(FlowNode.java:233)
>>>>
>>>> at
>>>> org.globus.cog.karajan.workflow.nodes.FlowNode.start(FlowNode.java:278)
>>>> at
>>>> org.globus.cog.karajan.workflow.nodes.FlowNode.controlEvent(FlowNode.java:391)
>>>>
>>>> at
>>>> org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:329)
>>>> at
>>>> org.globus.cog.karajan.workflow.FlowElementWrapper.event(FlowElementWrapper.java:227)
>>>>
>>>> at
>>>> org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125)
>>>> at
>>>> org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99)
>>>>
>>>> at
>>>> org.globus.cog.karajan.workflow.events.EventWorker.run(EventWorker.java:69)
>>>>
>>>>
>>>> Ex098
>>>> java.lang.NullPointerException
>>>> at
>>>> org.griphyn.vdl.karajan.lib.VDLFunction.leafFileName(VDLFunction.java:285)
>>>>
>>>> at
>>>> org.griphyn.vdl.karajan.lib.VDLFunction.filename(VDLFunction.java:201)
>>>> at
>>>> org.griphyn.vdl.karajan.lib.VDLFunction.filename(VDLFunction.java:182)
>>>> at
>>>> org.griphyn.vdl.karajan.lib.swiftscript.FileName.function(FileName.java:19)
>>>>
>>>> at org.griphyn.vdl.karajan.lib.VDLFunction.post(VDLFunction.java:67)
>>>> at
>>>> org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.childCompleted(AbstractSequentialWithArguments.java:192)
>>>>
>>>> at
>>>> org.globus.cog.karajan.workflow.nodes.Sequential.notificationEvent(Sequential.java:33)
>>>>
>>>> at
>>>> org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:332)
>>>> at
>>>> org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125)
>>>> at
>>>> org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99)
>>>>
>>>> at
>>>> org.globus.cog.karajan.workflow.nodes.FlowNode.fireNotificationEvent(FlowNode.java:176)
>>>>
>>>> at
>>>> org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:296)
>>>>
>>>> at
>>>> org.globus.cog.karajan.workflow.nodes.FlowContainer.post(FlowContainer.java:58)
>>>>
>>>> at
>>>> org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:46)
>>>>
>>>> at
>>>> org.globus.cog.karajan.workflow.nodes.Sequential.startNext(Sequential.java:51)
>>>>
>>>> at
>>>> org.globus.cog.karajan.workflow.nodes.Sequential.executeChildren(Sequential.java:27)
>>>>
>>>> at
>>>> org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.executeChildren(AbstractFunction.java:40)
>>>>
>>>> at
>>>> org.globus.cog.karajan.workflow.nodes.FlowContainer.execute(FlowContainer.java:63)
>>>>
>>>> at
>>>> org.globus.cog.karajan.workflow.nodes.FlowNode.restart(FlowNode.java:233)
>>>>
>>>> at
>>>> org.globus.cog.karajan.workflow.nodes.FlowNode.start(FlowNode.java:278)
>>>> at
>>>> org.globus.cog.karajan.workflow.nodes.FlowNode.controlEvent(FlowNode.java:391)
>>>>
>>>> at
>>>> org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:329)
>>>> at
>>>> org.globus.cog.karajan.workflow.FlowElementWrapper.event(FlowElementWrapper.java:227)
>>>>
>>>> at
>>>> org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125)
>>>> at
>>>> org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99)
>>>>
>>>> at
>>>> org.globus.cog.karajan.workflow.events.EventWorker.run(EventWorker.java:69)
>>>>
>>>> SwiftScript trace: T1af7, Round, 0, Sim, 7
>>>> _______________________________________________
>>>> Swift-devel mailing list
>>>> Swift-devel at ci.uchicago.edu
>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>> _______________________________________________
>>> Swift-devel mailing list
>>> Swift-devel at ci.uchicago.edu
>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>> _______________________________________________
>> Swift-devel mailing list
>> Swift-devel at ci.uchicago.edu
>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>
From hategan at mcs.anl.gov Tue Mar 17 17:10:02 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Tue, 17 Mar 2009 17:10:02 -0500
Subject: [Swift-devel] null pointer exception from nested loops
In-Reply-To: <49C01DF7.4000204@mcs.anl.gov>
References: <49C010FE.4070503@mcs.anl.gov> <49C01351.1050807@mcs.anl.gov>
<49C0154C.7000608@mcs.anl.gov> <49C01A2D.6050408@mcs.anl.gov>
<49C01DF7.4000204@mcs.anl.gov>
Message-ID: <1237327802.2668.0.camel@localhost>
Looks like it might be the nested procedure call interacting badly with
readData.
On Tue, 2009-03-17 at 17:02 -0500, Michael Wilde wrote:
> OK, I think this is it:
>
> The following script *works* with the commented out code and fails with
> the currently enabled alternative statement:
>
> --
>
> type file;
>
> app (file editedParams) setTemps ( file inParams )
> {
> sed "-e" "s/@DTI@/123/" @inParams stdout=@editedParams;
> }
>
> file inParams;
>
> /* works:
> file o<"pout">;
> o = setTemps(inParams);
> string config [] = readData(o);
> */
>
> /* Fails: */
> string config [] = readData( setTemps(inParams ) );
>
> trace(0,config[0]);
> trace(1,config[1]);
> trace(2,config[2]);
>
> --
>
> So readData is indeed happy to take a file-value var as an arg but not a
> file-valued expression (procedure return in this case).
>
>
>
> On 3/17/09 4:46 PM, Michael Wilde wrote:
> > I got the error narrowed down to this example:
> >
> > type file;
> >
> > app (file editedParams) setTemps ( file inParams )
> > {
> > echo @inParams stdout=@editedParams;
> > }
> >
> > file inParams;
> >
> > string config [] = readData( setTemps(inParams ) );
> > trace(0,config[0]);
> > trace(1,config[1]);
> > trace(2,config[2]);
> >
> > --
> >
> > params.tloops contains:
> >
> > DEFAULT_INIT_TEMP_=_ at DTI@
> > TEMP_UPDATE_INTERVAL_=_ at TUI@
> > KILL_TIME_=_3
> > MAX_NUMBER_OF_ANNEALING_STEPS_=_0
> >
> > --
> >
> > In other code, readData() seemed happy to take a file *or* a filename
> > string as input, but I wonder if it was not as happy as it seemed.
> >
> > I'd been taking advantage of the flexibility with good results (luck?)
> > so far, though.
> >
> > On 3/17/09 4:25 PM, Michael Wilde wrote:
> >> The log contains this just before the NPE, including the suspicious
> >> message: WARN FlowNode Ex098:
> >>
> >> Thats giving me a clue as to the offending statements.
> >>
> >> ---
> >>
> >> 2009-03-17 16:20:34,723-0500 INFO AbstractDataNode Adding handle
> >> listener "F/org.griphyn.vdl.mapping.DataNode identifier
> >> tag:benc at ci.uchicago.edu,2008:swift:dataset:20090\
> >> 317-1620-e1n1bz3g:720000000071 type SecSeq with no value at
> >> dataset=secseq path=[0] (not closed)" to
> >> "org.griphyn.vdl.mapping.DataNode identifier tag:benc at ci.uchicago.edu,\
> >> 2008:swift:dataset:20090317-1620-e1n1bz3g:720000000071 type SecSeq
> >> with no value at dataset=secseq path=[0] (not closed)"
> >> 2009-03-17 16:20:34,724-0500 INFO AbstractDataNode closed
> >> org.griphyn.vdl.mapping.RootDataNode identifier
> >> tag:benc at ci.uchicago.edu,2008:swift:dataset:20090317-1620-e1n1bz\
> >> 3g:720000000092 type string value=s/@DIT@/10/ dataset=unnamed
> >> SwiftScript value (closed)
> >> 2009-03-17 16:20:34,724-0500 INFO AbstractDataNode ROOTPATH
> >> dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090317-1620-e1n1bz3g:720000000092
> >> path=$
> >> 2009-03-17 16:20:34,724-0500 INFO AbstractDataNode VALUE
> >> dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090317-1620-e1n1bz3g:720000000092
> >> VALUE=s/@DIT@/10/
> >> 2009-03-17 16:20:34,725-0500 INFO AbstractDataNode closed
> >> org.griphyn.vdl.mapping.RootDataNode identifier
> >> tag:benc at ci.uchicago.edu,2008:swift:dataset:20090317-1620-e1n1bz\
> >> 3g:720000000093 type string value=s/@TUI@/1/ dataset=unnamed
> >> SwiftScript value (closed)
> >> 2009-03-17 16:20:34,725-0500 INFO AbstractDataNode ROOTPATH
> >> dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090317-1620-e1n1bz3g:720000000093
> >> path=$
> >> 2009-03-17 16:20:34,725-0500 INFO AbstractDataNode VALUE
> >> dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090317-1620-e1n1bz3g:720000000093
> >> VALUE=s/@TUI@/1/
> >> 2009-03-17 16:20:34,726-0500 INFO AbstractDataNode closed
> >> org.griphyn.vdl.mapping.RootDataNode identifier
> >> tag:benc at ci.uchicago.edu,2008:swift:dataset:20090317-1620-e1n1bz\
> >> 3g:720000000094 type string value=params.tloop dataset=unnamed
> >> SwiftScript value (closed)
> >> 2009-03-17 16:20:34,726-0500 INFO AbstractDataNode ROOTPATH
> >> dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090317-1620-e1n1bz3g:720000000094
> >> path=$
> >> 2009-03-17 16:20:34,726-0500 INFO AbstractDataNode VALUE
> >> dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090317-1620-e1n1bz3g:720000000094
> >> VALUE=params.tloop
> >> 2009-03-17 16:20:34,727-0500 WARN FlowNode Ex098
> >> java.lang.NullPointerException
> >> at
> >> org.griphyn.vdl.karajan.lib.VDLFunction.leafFileName(VDLFunction.java:285)
> >>
> >> at
> >> org.griphyn.vdl.karajan.lib.VDLFunction.filename(VDLFunction.java:201)
> >> at
> >> org.griphyn.vdl.karajan.lib.VDLFunction.filename(VDLFunction.java:182)
> >> at
> >> org.griphyn.vdl.karajan.lib.swiftscript.FileName.function(FileName.java:19)
> >>
> >> at
> >> org.griphyn.vdl.karajan.lib.VDLFunction.post(VDLFunction.java:67)
> >>
> >>
> >>
> >> On 3/17/09 4:17 PM, Michael Wilde wrote:
> >>> It seems not related to scale or Falkon.
> >>>
> >>> It occurs when running on localhost (but on bgp) and when I cut all
> >>> the loops down to a single iteration.
> >>>
> >>> I'm still debugging.
> >>>
> >>> On 3/17/09 4:07 PM, Michael Wilde wrote:
> >>>> I just expanded my oops protein folding script to add another level
> >>>> of parameter sweep. This script is getting pretty complex now (at
> >>>> least, for a swift script).
> >>>>
> >>>> I got the following npe on my first two tries. Im going to start
> >>>> debugging, but any clues as to the cause would be helpful.
> >>>>
> >>>> The outer loops are:
> >>>>
> >>>> main()
> >>>> {
> >>>> string protein[] = readData(@arg("plist"));
> >>>> string startTemp[] = ["10","20"];
> >>>> string tempUpdate[] = ["1","2","3"];
> >>>>
> >>>> foreach p in protein {
> >>>> foreach st in startTemp {
> >>>> foreach tu in tempUpdate {
> >>>> doRoundSet(p,st,tu);
> >>>> }
> >>>> }
> >>>> }
> >>>> }
> >>>>
> >>>> There are two levels of inner loops further down below doRoundSet().
> >>>>
> >>>> The script, output, command line args and log are in:
> >>>> http://ww.ci.uchicago.edu/~wilde/swift3.tar.gz
> >>>>
> >>>> I suspect it will take a while to narrow the cause to a simpler test
> >>>> case thats easy tp reproduce without a lot of setup.
> >>>>
> >>>> I'll try on a vanilla swift on local execution; this is on bgp with
> >>>> Falkon.
> >>>>
> >>>> Thanks.
> >>>>
> >>>> --
> >>>>
> >>>> ...
> >>>> Progress: uninitialized:1 Selecting site:2
> >>>> SwiftScript trace: T1af7, Round, 0, Sim, 7
> >>>> SwiftScript trace: T1af7, Round, 0, Sim, 2
> >>>> SwiftScript trace: T1af7, Round, 0, Sim, 8
> >>>> SwiftScript trace: T1af7, Round, 0, Sim, 0
> >>>> SwiftScript trace: T1af7, Round, 0, Sim, 5
> >>>> SwiftScript trace: T1af7, Round, 0, Sim, 9
> >>>> SwiftScript trace: T1af7, Round, 0, Sim, 1
> >>>> SwiftScript trace: T1af7, Round, 0, Sim, 6
> >>>> SwiftScript trace: T1af7, Round, 0, Sim, 3
> >>>> SwiftScript trace: T1af7, Round, 0, Sim, 4
> >>>> Ex098
> >>>> java.lang.NullPointerException
> >>>> at
> >>>> org.griphyn.vdl.karajan.lib.VDLFunction.leafFileName(VDLFunction.java:285)
> >>>>
> >>>> at
> >>>> org.griphyn.vdl.karajan.lib.VDLFunction.filename(VDLFunction.java:201)
> >>>> at
> >>>> org.griphyn.vdl.karajan.lib.VDLFunction.filename(VDLFunction.java:182)
> >>>> at
> >>>> org.griphyn.vdl.karajan.lib.swiftscript.FileName.function(FileName.java:19)
> >>>>
> >>>> at
> >>>> org.griphyn.vdl.karajan.lib.VDLFunction.post(VDLFunction.java:67)
> >>>> at
> >>>> org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.childCompleted(AbstractSequentialWithArguments.java:192)
> >>>>
> >>>> at
> >>>> org.globus.cog.karajan.workflow.nodes.Sequential.notificationEvent(Sequential.java:33)
> >>>>
> >>>> at
> >>>> org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:332)
> >>>> at
> >>>> org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125)
> >>>> at
> >>>> org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99)
> >>>>
> >>>> at
> >>>> org.globus.cog.karajan.workflow.nodes.FlowNode.fireNotificationEvent(FlowNode.java:176)
> >>>>
> >>>> at
> >>>> org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:296)
> >>>>
> >>>> at
> >>>> org.globus.cog.karajan.workflow.nodes.FlowContainer.post(FlowContainer.java:58)
> >>>>
> >>>> at
> >>>> org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:46)
> >>>>
> >>>> at
> >>>> org.globus.cog.karajan.workflow.nodes.Sequential.startNext(Sequential.java:51)
> >>>>
> >>>> at
> >>>> org.globus.cog.karajan.workflow.nodes.Sequential.executeChildren(Sequential.java:27)
> >>>>
> >>>> at
> >>>> org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.executeChildren(AbstractFunction.java:40)
> >>>>
> >>>> at
> >>>> org.globus.cog.karajan.workflow.nodes.FlowContainer.execute(FlowContainer.java:63)
> >>>>
> >>>> at
> >>>> org.globus.cog.karajan.workflow.nodes.FlowNode.restart(FlowNode.java:233)
> >>>>
> >>>> at
> >>>> org.globus.cog.karajan.workflow.nodes.FlowNode.start(FlowNode.java:278)
> >>>> at
> >>>> org.globus.cog.karajan.workflow.nodes.FlowNode.controlEvent(FlowNode.java:391)
> >>>>
> >>>> at
> >>>> org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:329)
> >>>> at
> >>>> org.globus.cog.karajan.workflow.FlowElementWrapper.event(FlowElementWrapper.java:227)
> >>>>
> >>>> at
> >>>> org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125)
> >>>> at
> >>>> org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99)
> >>>>
> >>>> at
> >>>> org.globus.cog.karajan.workflow.events.EventWorker.run(EventWorker.java:69)
> >>>>
> >>>> Execution failed:
> >>>> java.lang.NullPointerException
> >>>> at
> >>>> org.griphyn.vdl.karajan.lib.VDLFunction.leafFileName(VDLFunction.java:285)
> >>>>
> >>>> at
> >>>> org.griphyn.vdl.karajan.lib.VDLFunction.filename(VDLFunction.java:201)
> >>>> at
> >>>> org.griphyn.vdl.karajan.lib.VDLFunction.filename(VDLFunction.java:182)
> >>>> at
> >>>> org.griphyn.vdl.karajan.lib.swiftscript.FileName.function(FileName.java:19)
> >>>>
> >>>> at
> >>>> org.griphyn.vdl.karajan.lib.VDLFunction.post(VDLFunction.java:67)
> >>>> at
> >>>> org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.childCompleted(AbstractSequentialWithArguments.java:192)
> >>>>
> >>>> at
> >>>> org.globus.cog.karajan.workflow.nodes.Sequential.notificationEvent(Sequential.java:33)
> >>>>
> >>>> at
> >>>> org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:332)
> >>>> at
> >>>> org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125)
> >>>> at
> >>>> org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99)
> >>>>
> >>>> at
> >>>> org.globus.cog.karajan.workflow.nodes.FlowNode.fireNotificationEvent(FlowNode.java:176)
> >>>>
> >>>> at
> >>>> org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:296)
> >>>>
> >>>> at
> >>>> org.globus.cog.karajan.workflow.nodes.FlowContainer.post(FlowContainer.java:58)
> >>>>
> >>>> at
> >>>> org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:46)
> >>>>
> >>>> at
> >>>> org.globus.cog.karajan.workflow.nodes.Sequential.startNext(Sequential.java:51)
> >>>>
> >>>> at
> >>>> org.globus.cog.karajan.workflow.nodes.Sequential.executeChildren(Sequential.java:27)
> >>>>
> >>>> at
> >>>> org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.executeChildren(AbstractFunction.java:40)
> >>>>
> >>>> at
> >>>> org.globus.cog.karajan.workflow.nodes.FlowContainer.execute(FlowContainer.java:63)
> >>>>
> >>>> at
> >>>> org.globus.cog.karajan.workflow.nodes.FlowNode.restart(FlowNode.java:233)
> >>>>
> >>>> at
> >>>> org.globus.cog.karajan.workflow.nodes.FlowNode.start(FlowNode.java:278)
> >>>> at
> >>>> org.globus.cog.karajan.workflow.nodes.FlowNode.controlEvent(FlowNode.java:391)
> >>>>
> >>>> at
> >>>> org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:329)
> >>>> at
> >>>> org.globus.cog.karajan.workflow.FlowElementWrapper.event(FlowElementWrapper.java:227)
> >>>>
> >>>> at
> >>>> org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125)
> >>>> at
> >>>> org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99)
> >>>>
> >>>> at
> >>>> org.globus.cog.karajan.workflow.events.EventWorker.run(EventWorker.java:69)
> >>>>
> >>>>
> >>>> Ex098
> >>>> java.lang.NullPointerException
> >>>> at
> >>>> org.griphyn.vdl.karajan.lib.VDLFunction.leafFileName(VDLFunction.java:285)
> >>>>
> >>>> at
> >>>> org.griphyn.vdl.karajan.lib.VDLFunction.filename(VDLFunction.java:201)
> >>>> at
> >>>> org.griphyn.vdl.karajan.lib.VDLFunction.filename(VDLFunction.java:182)
> >>>> at
> >>>> org.griphyn.vdl.karajan.lib.swiftscript.FileName.function(FileName.java:19)
> >>>>
> >>>> at
> >>>> org.griphyn.vdl.karajan.lib.VDLFunction.post(VDLFunction.java:67)
> >>>> at
> >>>> org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.childCompleted(AbstractSequentialWithArguments.java:192)
> >>>>
> >>>> at
> >>>> org.globus.cog.karajan.workflow.nodes.Sequential.notificationEvent(Sequential.java:33)
> >>>>
> >>>> at
> >>>> org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:332)
> >>>> at
> >>>> org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125)
> >>>> at
> >>>> org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99)
> >>>>
> >>>> at
> >>>> org.globus.cog.karajan.workflow.nodes.FlowNode.fireNotificationEvent(FlowNode.java:176)
> >>>>
> >>>> at
> >>>> org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:296)
> >>>>
> >>>> at
> >>>> org.globus.cog.karajan.workflow.nodes.FlowContainer.post(FlowContainer.java:58)
> >>>>
> >>>> at
> >>>> org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:46)
> >>>>
> >>>> at
> >>>> org.globus.cog.karajan.workflow.nodes.Sequential.startNext(Sequential.java:51)
> >>>>
> >>>> at
> >>>> org.globus.cog.karajan.workflow.nodes.Sequential.executeChildren(Sequential.java:27)
> >>>>
> >>>> at
> >>>> org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.executeChildren(AbstractFunction.java:40)
> >>>>
> >>>> at
> >>>> org.globus.cog.karajan.workflow.nodes.FlowContainer.execute(FlowContainer.java:63)
> >>>>
> >>>> at
> >>>> org.globus.cog.karajan.workflow.nodes.FlowNode.restart(FlowNode.java:233)
> >>>>
> >>>> at
> >>>> org.globus.cog.karajan.workflow.nodes.FlowNode.start(FlowNode.java:278)
> >>>> at
> >>>> org.globus.cog.karajan.workflow.nodes.FlowNode.controlEvent(FlowNode.java:391)
> >>>>
> >>>> at
> >>>> org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:329)
> >>>> at
> >>>> org.globus.cog.karajan.workflow.FlowElementWrapper.event(FlowElementWrapper.java:227)
> >>>>
> >>>> at
> >>>> org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125)
> >>>> at
> >>>> org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99)
> >>>>
> >>>> at
> >>>> org.globus.cog.karajan.workflow.events.EventWorker.run(EventWorker.java:69)
> >>>>
> >>>> SwiftScript trace: T1af7, Round, 0, Sim, 7
> >>>> _______________________________________________
> >>>> Swift-devel mailing list
> >>>> Swift-devel at ci.uchicago.edu
> >>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> >>> _______________________________________________
> >>> Swift-devel mailing list
> >>> Swift-devel at ci.uchicago.edu
> >>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> >> _______________________________________________
> >> Swift-devel mailing list
> >> Swift-devel at ci.uchicago.edu
> >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> > _______________________________________________
> > Swift-devel mailing list
> > Swift-devel at ci.uchicago.edu
> > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
From foster at anl.gov Tue Mar 17 17:22:07 2009
From: foster at anl.gov (Ian Foster)
Date: Tue, 17 Mar 2009 17:22:07 -0500
Subject: [Swift-devel] null pointer exception from nested loops
In-Reply-To: <49C01EB1.5060801@mcs.anl.gov>
References: <49C010FE.4070503@mcs.anl.gov> <49C01351.1050807@mcs.anl.gov>
<49C0154C.7000608@mcs.anl.gov>
<17AE2924-B021-46EC-9649-C80B1EBFFECD@anl.gov>
<49C01EB1.5060801@mcs.anl.gov>
Message-ID:
I wasn't sure if this ran without Swift--with just Falkon
On Mar 17, 2009, at 5:05 PM, Michael Wilde wrote:
> Not quite sure what you're asking, Ian.
>
> The latest tests have been on BGP w/ Falkon.
> Earlier tests were on other clusters.
>
> The scripts has grown in last week or so, on BGP, and grew some more
> today to explore some new science code algorithm questions.
>
> Its not yet running at full desired scale on the BGP; we are now
> scaling up carefully so as not to impact other users.
>
> This is a test case for the "cio" work as well.
>
> - Mike
>
>
> On 3/17/09 4:26 PM, Ian Foster wrote:
>> Just curious, is the whole thing working with just Falkon?
>> On Mar 17, 2009, at 4:25 PM, Michael Wilde wrote:
>>> The log contains this just before the NPE, including the
>>> suspicious message: WARN FlowNode Ex098:
>>>
>>> Thats giving me a clue as to the offending statements.
>>>
>>> ---
>>>
>>> 2009-03-17 16:20:34,723-0500 INFO AbstractDataNode Adding handle
>>> listener "F/org.griphyn.vdl.mapping.DataNode identifier tag:benc at ci.uchicago.edu
>>> ,2008:swift:dataset:20090\
>>> 317-1620-e1n1bz3g:720000000071 type SecSeq with no value at
>>> dataset=secseq path=[0] (not closed)" to
>>> "org.griphyn.vdl.mapping.DataNode identifier tag:benc at ci.uchicago.edu
>>> ,\
>>> 2008:swift:dataset:20090317-1620-e1n1bz3g:720000000071 type SecSeq
>>> with no value at dataset=secseq path=[0] (not closed)"
>>> 2009-03-17 16:20:34,724-0500 INFO AbstractDataNode closed
>>> org.griphyn.vdl.mapping.RootDataNode identifier tag:benc at ci.uchicago.edu
>>> ,2008:swift:dataset:20090317-1620-e1n1bz\
>>> 3g:720000000092 type string value=s/@DIT@/10/ dataset=unnamed
>>> SwiftScript value (closed)
>>> 2009-03-17 16:20:34,724-0500 INFO AbstractDataNode ROOTPATH
>>> dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090317-1620-
>>> e1n1bz3g:720000000092 path=$
>>> 2009-03-17 16:20:34,724-0500 INFO AbstractDataNode VALUE
>>> dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090317-1620-
>>> e1n1bz3g:720000000092 VALUE=s/@DIT@/10/
>>> 2009-03-17 16:20:34,725-0500 INFO AbstractDataNode closed
>>> org.griphyn.vdl.mapping.RootDataNode identifier tag:benc at ci.uchicago.edu
>>> ,2008:swift:dataset:20090317-1620-e1n1bz\
>>> 3g:720000000093 type string value=s/@TUI@/1/ dataset=unnamed
>>> SwiftScript value (closed)
>>> 2009-03-17 16:20:34,725-0500 INFO AbstractDataNode ROOTPATH
>>> dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090317-1620-
>>> e1n1bz3g:720000000093 path=$
>>> 2009-03-17 16:20:34,725-0500 INFO AbstractDataNode VALUE
>>> dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090317-1620-
>>> e1n1bz3g:720000000093 VALUE=s/@TUI@/1/
>>> 2009-03-17 16:20:34,726-0500 INFO AbstractDataNode closed
>>> org.griphyn.vdl.mapping.RootDataNode identifier tag:benc at ci.uchicago.edu
>>> ,2008:swift:dataset:20090317-1620-e1n1bz\
>>> 3g:720000000094 type string value=params.tloop dataset=unnamed
>>> SwiftScript value (closed)
>>> 2009-03-17 16:20:34,726-0500 INFO AbstractDataNode ROOTPATH
>>> dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090317-1620-
>>> e1n1bz3g:720000000094 path=$
>>> 2009-03-17 16:20:34,726-0500 INFO AbstractDataNode VALUE
>>> dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090317-1620-
>>> e1n1bz3g:720000000094 VALUE=params.tloop
>>> 2009-03-17 16:20:34,727-0500 WARN FlowNode Ex098
>>> java.lang.NullPointerException
>>> at
>>> org
>>> .griphyn.vdl.karajan.lib.VDLFunction.leafFileName(VDLFunction.java:
>>> 285)
>>> at
>>> org.griphyn.vdl.karajan.lib.VDLFunction.filename(VDLFunction.java:
>>> 201)
>>> at
>>> org.griphyn.vdl.karajan.lib.VDLFunction.filename(VDLFunction.java:
>>> 182)
>>> at
>>> org
>>> .griphyn
>>> .vdl.karajan.lib.swiftscript.FileName.function(FileName.java:19)
>>> at
>>> org.griphyn.vdl.karajan.lib.VDLFunction.post(VDLFunction.java:67)
>>>
>>>
>>>
>>> On 3/17/09 4:17 PM, Michael Wilde wrote:
>>>> It seems not related to scale or Falkon.
>>>> It occurs when running on localhost (but on bgp) and when I cut
>>>> all the loops down to a single iteration.
>>>> I'm still debugging.
>>>> On 3/17/09 4:07 PM, Michael Wilde wrote:
>>>>> I just expanded my oops protein folding script to add another
>>>>> level of parameter sweep. This script is getting pretty complex
>>>>> now (at least, for a swift script).
>>>>>
>>>>> I got the following npe on my first two tries. Im going to start
>>>>> debugging, but any clues as to the cause would be helpful.
>>>>>
>>>>> The outer loops are:
>>>>>
>>>>> main()
>>>>> {
>>>>> string protein[] = readData(@arg("plist"));
>>>>> string startTemp[] = ["10","20"];
>>>>> string tempUpdate[] = ["1","2","3"];
>>>>>
>>>>> foreach p in protein {
>>>>> foreach st in startTemp {
>>>>> foreach tu in tempUpdate {
>>>>> doRoundSet(p,st,tu);
>>>>> }
>>>>> }
>>>>> }
>>>>> }
>>>>>
>>>>> There are two levels of inner loops further down below
>>>>> doRoundSet().
>>>>>
>>>>> The script, output, command line args and log are in:
>>>>> http://ww.ci.uchicago.edu/~wilde/swift3.tar.gz
>>>>>
>>>>> I suspect it will take a while to narrow the cause to a simpler
>>>>> test case thats easy tp reproduce without a lot of setup.
>>>>>
>>>>> I'll try on a vanilla swift on local execution; this is on bgp
>>>>> with Falkon.
>>>>>
>>>>> Thanks.
>>>>>
>>>>> --
>>>>>
>>>>> ...
>>>>> Progress: uninitialized:1 Selecting site:2
>>>>> SwiftScript trace: T1af7, Round, 0, Sim, 7
>>>>> SwiftScript trace: T1af7, Round, 0, Sim, 2
>>>>> SwiftScript trace: T1af7, Round, 0, Sim, 8
>>>>> SwiftScript trace: T1af7, Round, 0, Sim, 0
>>>>> SwiftScript trace: T1af7, Round, 0, Sim, 5
>>>>> SwiftScript trace: T1af7, Round, 0, Sim, 9
>>>>> SwiftScript trace: T1af7, Round, 0, Sim, 1
>>>>> SwiftScript trace: T1af7, Round, 0, Sim, 6
>>>>> SwiftScript trace: T1af7, Round, 0, Sim, 3
>>>>> SwiftScript trace: T1af7, Round, 0, Sim, 4
>>>>> Ex098
>>>>> java.lang.NullPointerException
>>>>> at
>>>>> org
>>>>> .griphyn
>>>>> .vdl.karajan.lib.VDLFunction.leafFileName(VDLFunction.java:285)
>>>>> at
>>>>> org
>>>>> .griphyn.vdl.karajan.lib.VDLFunction.filename(VDLFunction.java:
>>>>> 201)
>>>>> at
>>>>> org
>>>>> .griphyn.vdl.karajan.lib.VDLFunction.filename(VDLFunction.java:
>>>>> 182)
>>>>> at
>>>>> org
>>>>> .griphyn
>>>>> .vdl.karajan.lib.swiftscript.FileName.function(FileName.java:19)
>>>>> at
>>>>> org.griphyn.vdl.karajan.lib.VDLFunction.post(VDLFunction.java:67)
>>>>> at
>>>>> org
>>>>> .globus
>>>>> .cog
>>>>> .karajan
>>>>> .workflow
>>>>> .nodes
>>>>> .AbstractSequentialWithArguments
>>>>> .childCompleted(AbstractSequentialWithArguments.java:192)
>>>>> at
>>>>> org
>>>>> .globus
>>>>> .cog
>>>>> .karajan
>>>>> .workflow.nodes.Sequential.notificationEvent(Sequential.java:33)
>>>>> at
>>>>> org
>>>>> .globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:
>>>>> 332)
>>>>> at
>>>>> org
>>>>> .globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:
>>>>> 125)
>>>>> at
>>>>> org
>>>>> .globus
>>>>> .cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99)
>>>>> at
>>>>> org
>>>>> .globus
>>>>> .cog
>>>>> .karajan
>>>>> .workflow.nodes.FlowNode.fireNotificationEvent(FlowNode.java:176)
>>>>> at
>>>>> org
>>>>> .globus
>>>>> .cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:296)
>>>>> at
>>>>> org
>>>>> .globus
>>>>> .cog
>>>>> .karajan.workflow.nodes.FlowContainer.post(FlowContainer.java:58)
>>>>> at
>>>>> org
>>>>> .globus
>>>>> .cog
>>>>> .karajan
>>>>> .workflow
>>>>> .nodes.functions.AbstractFunction.post(AbstractFunction.java:46)
>>>>> at
>>>>> org
>>>>> .globus
>>>>> .cog.karajan.workflow.nodes.Sequential.startNext(Sequential.java:
>>>>> 51)
>>>>> at
>>>>> org
>>>>> .globus
>>>>> .cog
>>>>> .karajan
>>>>> .workflow.nodes.Sequential.executeChildren(Sequential.java:27)
>>>>> at
>>>>> org
>>>>> .globus
>>>>> .cog
>>>>> .karajan
>>>>> .workflow
>>>>> .nodes
>>>>> .functions
>>>>> .AbstractFunction.executeChildren(AbstractFunction.java:40)
>>>>> at
>>>>> org
>>>>> .globus
>>>>> .cog
>>>>> .karajan.workflow.nodes.FlowContainer.execute(FlowContainer.java:
>>>>> 63)
>>>>> at
>>>>> org
>>>>> .globus
>>>>> .cog.karajan.workflow.nodes.FlowNode.restart(FlowNode.java:233)
>>>>> at
>>>>> org
>>>>> .globus.cog.karajan.workflow.nodes.FlowNode.start(FlowNode.java:
>>>>> 278)
>>>>> at
>>>>> org
>>>>> .globus
>>>>> .cog.karajan.workflow.nodes.FlowNode.controlEvent(FlowNode.java:
>>>>> 391)
>>>>> at
>>>>> org
>>>>> .globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:
>>>>> 329)
>>>>> at
>>>>> org
>>>>> .globus
>>>>> .cog
>>>>> .karajan
>>>>> .workflow.FlowElementWrapper.event(FlowElementWrapper.java:227)
>>>>> at
>>>>> org
>>>>> .globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:
>>>>> 125)
>>>>> at
>>>>> org
>>>>> .globus
>>>>> .cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99)
>>>>> at
>>>>> org
>>>>> .globus
>>>>> .cog.karajan.workflow.events.EventWorker.run(EventWorker.java:69)
>>>>> Execution failed:
>>>>> java.lang.NullPointerException
>>>>> at
>>>>> org
>>>>> .griphyn
>>>>> .vdl.karajan.lib.VDLFunction.leafFileName(VDLFunction.java:285)
>>>>> at
>>>>> org
>>>>> .griphyn.vdl.karajan.lib.VDLFunction.filename(VDLFunction.java:
>>>>> 201)
>>>>> at
>>>>> org
>>>>> .griphyn.vdl.karajan.lib.VDLFunction.filename(VDLFunction.java:
>>>>> 182)
>>>>> at
>>>>> org
>>>>> .griphyn
>>>>> .vdl.karajan.lib.swiftscript.FileName.function(FileName.java:19)
>>>>> at
>>>>> org.griphyn.vdl.karajan.lib.VDLFunction.post(VDLFunction.java:67)
>>>>> at
>>>>> org
>>>>> .globus
>>>>> .cog
>>>>> .karajan
>>>>> .workflow
>>>>> .nodes
>>>>> .AbstractSequentialWithArguments
>>>>> .childCompleted(AbstractSequentialWithArguments.java:192)
>>>>> at
>>>>> org
>>>>> .globus
>>>>> .cog
>>>>> .karajan
>>>>> .workflow.nodes.Sequential.notificationEvent(Sequential.java:33)
>>>>> at
>>>>> org
>>>>> .globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:
>>>>> 332)
>>>>> at
>>>>> org
>>>>> .globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:
>>>>> 125)
>>>>> at
>>>>> org
>>>>> .globus
>>>>> .cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99)
>>>>> at
>>>>> org
>>>>> .globus
>>>>> .cog
>>>>> .karajan
>>>>> .workflow.nodes.FlowNode.fireNotificationEvent(FlowNode.java:176)
>>>>> at
>>>>> org
>>>>> .globus
>>>>> .cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:296)
>>>>> at
>>>>> org
>>>>> .globus
>>>>> .cog
>>>>> .karajan.workflow.nodes.FlowContainer.post(FlowContainer.java:58)
>>>>> at
>>>>> org
>>>>> .globus
>>>>> .cog
>>>>> .karajan
>>>>> .workflow
>>>>> .nodes.functions.AbstractFunction.post(AbstractFunction.java:46)
>>>>> at
>>>>> org
>>>>> .globus
>>>>> .cog.karajan.workflow.nodes.Sequential.startNext(Sequential.java:
>>>>> 51)
>>>>> at
>>>>> org
>>>>> .globus
>>>>> .cog
>>>>> .karajan
>>>>> .workflow.nodes.Sequential.executeChildren(Sequential.java:27)
>>>>> at
>>>>> org
>>>>> .globus
>>>>> .cog
>>>>> .karajan
>>>>> .workflow
>>>>> .nodes
>>>>> .functions
>>>>> .AbstractFunction.executeChildren(AbstractFunction.java:40)
>>>>> at
>>>>> org
>>>>> .globus
>>>>> .cog
>>>>> .karajan.workflow.nodes.FlowContainer.execute(FlowContainer.java:
>>>>> 63)
>>>>> at
>>>>> org
>>>>> .globus
>>>>> .cog.karajan.workflow.nodes.FlowNode.restart(FlowNode.java:233)
>>>>> at
>>>>> org
>>>>> .globus.cog.karajan.workflow.nodes.FlowNode.start(FlowNode.java:
>>>>> 278)
>>>>> at
>>>>> org
>>>>> .globus
>>>>> .cog.karajan.workflow.nodes.FlowNode.controlEvent(FlowNode.java:
>>>>> 391)
>>>>> at
>>>>> org
>>>>> .globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:
>>>>> 329)
>>>>> at
>>>>> org
>>>>> .globus
>>>>> .cog
>>>>> .karajan
>>>>> .workflow.FlowElementWrapper.event(FlowElementWrapper.java:227)
>>>>> at
>>>>> org
>>>>> .globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:
>>>>> 125)
>>>>> at
>>>>> org
>>>>> .globus
>>>>> .cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99)
>>>>> at
>>>>> org
>>>>> .globus
>>>>> .cog.karajan.workflow.events.EventWorker.run(EventWorker.java:69)
>>>>>
>>>>> Ex098
>>>>> java.lang.NullPointerException
>>>>> at
>>>>> org
>>>>> .griphyn
>>>>> .vdl.karajan.lib.VDLFunction.leafFileName(VDLFunction.java:285)
>>>>> at
>>>>> org
>>>>> .griphyn.vdl.karajan.lib.VDLFunction.filename(VDLFunction.java:
>>>>> 201)
>>>>> at
>>>>> org
>>>>> .griphyn.vdl.karajan.lib.VDLFunction.filename(VDLFunction.java:
>>>>> 182)
>>>>> at
>>>>> org
>>>>> .griphyn
>>>>> .vdl.karajan.lib.swiftscript.FileName.function(FileName.java:19)
>>>>> at
>>>>> org.griphyn.vdl.karajan.lib.VDLFunction.post(VDLFunction.java:67)
>>>>> at
>>>>> org
>>>>> .globus
>>>>> .cog
>>>>> .karajan
>>>>> .workflow
>>>>> .nodes
>>>>> .AbstractSequentialWithArguments
>>>>> .childCompleted(AbstractSequentialWithArguments.java:192)
>>>>> at
>>>>> org
>>>>> .globus
>>>>> .cog
>>>>> .karajan
>>>>> .workflow.nodes.Sequential.notificationEvent(Sequential.java:33)
>>>>> at
>>>>> org
>>>>> .globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:
>>>>> 332)
>>>>> at
>>>>> org
>>>>> .globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:
>>>>> 125)
>>>>> at
>>>>> org
>>>>> .globus
>>>>> .cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99)
>>>>> at
>>>>> org
>>>>> .globus
>>>>> .cog
>>>>> .karajan
>>>>> .workflow.nodes.FlowNode.fireNotificationEvent(FlowNode.java:176)
>>>>> at
>>>>> org
>>>>> .globus
>>>>> .cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:296)
>>>>> at
>>>>> org
>>>>> .globus
>>>>> .cog
>>>>> .karajan.workflow.nodes.FlowContainer.post(FlowContainer.java:58)
>>>>> at
>>>>> org
>>>>> .globus
>>>>> .cog
>>>>> .karajan
>>>>> .workflow
>>>>> .nodes.functions.AbstractFunction.post(AbstractFunction.java:46)
>>>>> at
>>>>> org
>>>>> .globus
>>>>> .cog.karajan.workflow.nodes.Sequential.startNext(Sequential.java:
>>>>> 51)
>>>>> at
>>>>> org
>>>>> .globus
>>>>> .cog
>>>>> .karajan
>>>>> .workflow.nodes.Sequential.executeChildren(Sequential.java:27)
>>>>> at
>>>>> org
>>>>> .globus
>>>>> .cog
>>>>> .karajan
>>>>> .workflow
>>>>> .nodes
>>>>> .functions
>>>>> .AbstractFunction.executeChildren(AbstractFunction.java:40)
>>>>> at
>>>>> org
>>>>> .globus
>>>>> .cog
>>>>> .karajan.workflow.nodes.FlowContainer.execute(FlowContainer.java:
>>>>> 63)
>>>>> at
>>>>> org
>>>>> .globus
>>>>> .cog.karajan.workflow.nodes.FlowNode.restart(FlowNode.java:233)
>>>>> at
>>>>> org
>>>>> .globus.cog.karajan.workflow.nodes.FlowNode.start(FlowNode.java:
>>>>> 278)
>>>>> at
>>>>> org
>>>>> .globus
>>>>> .cog.karajan.workflow.nodes.FlowNode.controlEvent(FlowNode.java:
>>>>> 391)
>>>>> at
>>>>> org
>>>>> .globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:
>>>>> 329)
>>>>> at
>>>>> org
>>>>> .globus
>>>>> .cog
>>>>> .karajan
>>>>> .workflow.FlowElementWrapper.event(FlowElementWrapper.java:227)
>>>>> at
>>>>> org
>>>>> .globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:
>>>>> 125)
>>>>> at
>>>>> org
>>>>> .globus
>>>>> .cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99)
>>>>> at
>>>>> org
>>>>> .globus
>>>>> .cog.karajan.workflow.events.EventWorker.run(EventWorker.java:69)
>>>>> SwiftScript trace: T1af7, Round, 0, Sim, 7
>>>>> _______________________________________________
>>>>> Swift-devel mailing list
>>>>> Swift-devel at ci.uchicago.edu
>>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>>> _______________________________________________
>>>> Swift-devel mailing list
>>>> Swift-devel at ci.uchicago.edu
>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>> _______________________________________________
>>> Swift-devel mailing list
>>> Swift-devel at ci.uchicago.edu
>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From wilde at mcs.anl.gov Tue Mar 17 18:03:34 2009
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Tue, 17 Mar 2009 18:03:34 -0500
Subject: [Swift-devel] null pointer exception from nested loops
In-Reply-To:
References: <49C010FE.4070503@mcs.anl.gov> <49C01351.1050807@mcs.anl.gov>
<49C0154C.7000608@mcs.anl.gov>
<17AE2924-B021-46EC-9649-C80B1EBFFECD@anl.gov>
<49C01EB1.5060801@mcs.anl.gov>
Message-ID: <49C02C46.2070106@mcs.anl.gov>
I see. No, we've been doing all tests through Swift, but testing the app
standalone at various points on both local hosts and bgp compute nodes.
On 3/17/09 5:22 PM, Ian Foster wrote:
> I wasn't sure if this ran without Swift--with just Falkon
>
>
> On Mar 17, 2009, at 5:05 PM, Michael Wilde wrote:
>
>> Not quite sure what you're asking, Ian.
>>
>> The latest tests have been on BGP w/ Falkon.
>> Earlier tests were on other clusters.
>>
>> The scripts has grown in last week or so, on BGP, and grew some more
>> today to explore some new science code algorithm questions.
>>
>> Its not yet running at full desired scale on the BGP; we are now
>> scaling up carefully so as not to impact other users.
>>
>> This is a test case for the "cio" work as well.
>>
>> - Mike
>>
>>
>> On 3/17/09 4:26 PM, Ian Foster wrote:
>>> Just curious, is the whole thing working with just Falkon?
>>> On Mar 17, 2009, at 4:25 PM, Michael Wilde wrote:
>>>> The log contains this just before the NPE, including the suspicious
>>>> message: WARN FlowNode Ex098:
>>>>
>>>> Thats giving me a clue as to the offending statements.
>>>>
>>>> ---
>>>>
>>>> 2009-03-17 16:20:34,723-0500 INFO AbstractDataNode Adding handle
>>>> listener "F/org.griphyn.vdl.mapping.DataNode identifier
>>>> tag:benc at ci.uchicago.edu
>>>> ,2008:swift:dataset:20090\
>>>> 317-1620-e1n1bz3g:720000000071 type SecSeq with no value at
>>>> dataset=secseq path=[0] (not closed)" to
>>>> "org.griphyn.vdl.mapping.DataNode identifier
>>>> tag:benc at ci.uchicago.edu ,\
>>>> 2008:swift:dataset:20090317-1620-e1n1bz3g:720000000071 type SecSeq
>>>> with no value at dataset=secseq path=[0] (not closed)"
>>>> 2009-03-17 16:20:34,724-0500 INFO AbstractDataNode closed
>>>> org.griphyn.vdl.mapping.RootDataNode identifier
>>>> tag:benc at ci.uchicago.edu
>>>> ,2008:swift:dataset:20090317-1620-e1n1bz\
>>>> 3g:720000000092 type string value=s/@DIT@/10/ dataset=unnamed
>>>> SwiftScript value (closed)
>>>> 2009-03-17 16:20:34,724-0500 INFO AbstractDataNode ROOTPATH
>>>> dataset=tag:benc at ci.uchicago.edu
>>>> ,2008:swift:dataset:20090317-1620-e1n1bz3g:720000000092
>>>> path=$
>>>> 2009-03-17 16:20:34,724-0500 INFO AbstractDataNode VALUE
>>>> dataset=tag:benc at ci.uchicago.edu
>>>> ,2008:swift:dataset:20090317-1620-e1n1bz3g:720000000092
>>>> VALUE=s/@DIT@/10/
>>>> 2009-03-17 16:20:34,725-0500 INFO AbstractDataNode closed
>>>> org.griphyn.vdl.mapping.RootDataNode identifier
>>>> tag:benc at ci.uchicago.edu
>>>> ,2008:swift:dataset:20090317-1620-e1n1bz\
>>>> 3g:720000000093 type string value=s/@TUI@/1/ dataset=unnamed
>>>> SwiftScript value (closed)
>>>> 2009-03-17 16:20:34,725-0500 INFO AbstractDataNode ROOTPATH
>>>> dataset=tag:benc at ci.uchicago.edu
>>>> ,2008:swift:dataset:20090317-1620-e1n1bz3g:720000000093
>>>> path=$
>>>> 2009-03-17 16:20:34,725-0500 INFO AbstractDataNode VALUE
>>>> dataset=tag:benc at ci.uchicago.edu
>>>> ,2008:swift:dataset:20090317-1620-e1n1bz3g:720000000093
>>>> VALUE=s/@TUI@/1/
>>>> 2009-03-17 16:20:34,726-0500 INFO AbstractDataNode closed
>>>> org.griphyn.vdl.mapping.RootDataNode identifier
>>>> tag:benc at ci.uchicago.edu
>>>> ,2008:swift:dataset:20090317-1620-e1n1bz\
>>>> 3g:720000000094 type string value=params.tloop dataset=unnamed
>>>> SwiftScript value (closed)
>>>> 2009-03-17 16:20:34,726-0500 INFO AbstractDataNode ROOTPATH
>>>> dataset=tag:benc at ci.uchicago.edu
>>>> ,2008:swift:dataset:20090317-1620-e1n1bz3g:720000000094
>>>> path=$
>>>> 2009-03-17 16:20:34,726-0500 INFO AbstractDataNode VALUE
>>>> dataset=tag:benc at ci.uchicago.edu
>>>> ,2008:swift:dataset:20090317-1620-e1n1bz3g:720000000094
>>>> VALUE=params.tloop
>>>> 2009-03-17 16:20:34,727-0500 WARN FlowNode Ex098
>>>> java.lang.NullPointerException
>>>> at
>>>> org.griphyn.vdl.karajan.lib.VDLFunction.leafFileName(VDLFunction.java:285)
>>>>
>>>> at
>>>> org.griphyn.vdl.karajan.lib.VDLFunction.filename(VDLFunction.java:201)
>>>> at
>>>> org.griphyn.vdl.karajan.lib.VDLFunction.filename(VDLFunction.java:182)
>>>> at
>>>> org.griphyn.vdl.karajan.lib.swiftscript.FileName.function(FileName.java:19)
>>>>
>>>> at
>>>> org.griphyn.vdl.karajan.lib.VDLFunction.post(VDLFunction.java:67)
>>>>
>>>>
>>>>
>>>> On 3/17/09 4:17 PM, Michael Wilde wrote:
>>>>> It seems not related to scale or Falkon.
>>>>> It occurs when running on localhost (but on bgp) and when I cut all
>>>>> the loops down to a single iteration.
>>>>> I'm still debugging.
>>>>> On 3/17/09 4:07 PM, Michael Wilde wrote:
>>>>>> I just expanded my oops protein folding script to add another
>>>>>> level of parameter sweep. This script is getting pretty complex
>>>>>> now (at least, for a swift script).
>>>>>>
>>>>>> I got the following npe on my first two tries. Im going to start
>>>>>> debugging, but any clues as to the cause would be helpful.
>>>>>>
>>>>>> The outer loops are:
>>>>>>
>>>>>> main()
>>>>>> {
>>>>>> string protein[] = readData(@arg("plist"));
>>>>>> string startTemp[] = ["10","20"];
>>>>>> string tempUpdate[] = ["1","2","3"];
>>>>>>
>>>>>> foreach p in protein {
>>>>>> foreach st in startTemp {
>>>>>> foreach tu in tempUpdate {
>>>>>> doRoundSet(p,st,tu);
>>>>>> }
>>>>>> }
>>>>>> }
>>>>>> }
>>>>>>
>>>>>> There are two levels of inner loops further down below doRoundSet().
>>>>>>
>>>>>> The script, output, command line args and log are in:
>>>>>> http://ww.ci.uchicago.edu/~wilde/swift3.tar.gz
>>>>>>
>>>>>> I suspect it will take a while to narrow the cause to a simpler
>>>>>> test case thats easy tp reproduce without a lot of setup.
>>>>>>
>>>>>> I'll try on a vanilla swift on local execution; this is on bgp
>>>>>> with Falkon.
>>>>>>
>>>>>> Thanks.
>>>>>>
>>>>>> --
>>>>>>
>>>>>> ...
>>>>>> Progress: uninitialized:1 Selecting site:2
>>>>>> SwiftScript trace: T1af7, Round, 0, Sim, 7
>>>>>> SwiftScript trace: T1af7, Round, 0, Sim, 2
>>>>>> SwiftScript trace: T1af7, Round, 0, Sim, 8
>>>>>> SwiftScript trace: T1af7, Round, 0, Sim, 0
>>>>>> SwiftScript trace: T1af7, Round, 0, Sim, 5
>>>>>> SwiftScript trace: T1af7, Round, 0, Sim, 9
>>>>>> SwiftScript trace: T1af7, Round, 0, Sim, 1
>>>>>> SwiftScript trace: T1af7, Round, 0, Sim, 6
>>>>>> SwiftScript trace: T1af7, Round, 0, Sim, 3
>>>>>> SwiftScript trace: T1af7, Round, 0, Sim, 4
>>>>>> Ex098
>>>>>> java.lang.NullPointerException
>>>>>> at
>>>>>> org.griphyn.vdl.karajan.lib.VDLFunction.leafFileName(VDLFunction.java:285)
>>>>>>
>>>>>> at
>>>>>> org.griphyn.vdl.karajan.lib.VDLFunction.filename(VDLFunction.java:201)
>>>>>> at
>>>>>> org.griphyn.vdl.karajan.lib.VDLFunction.filename(VDLFunction.java:182)
>>>>>> at
>>>>>> org.griphyn.vdl.karajan.lib.swiftscript.FileName.function(FileName.java:19)
>>>>>>
>>>>>> at org.griphyn.vdl.karajan.lib.VDLFunction.post(VDLFunction.java:67)
>>>>>> at
>>>>>> org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.childCompleted(AbstractSequentialWithArguments.java:192)
>>>>>>
>>>>>> at
>>>>>> org.globus.cog.karajan.workflow.nodes.Sequential.notificationEvent(Sequential.java:33)
>>>>>>
>>>>>> at
>>>>>> org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:332)
>>>>>> at
>>>>>> org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125)
>>>>>> at
>>>>>> org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99)
>>>>>>
>>>>>> at
>>>>>> org.globus.cog.karajan.workflow.nodes.FlowNode.fireNotificationEvent(FlowNode.java:176)
>>>>>>
>>>>>> at
>>>>>> org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:296)
>>>>>>
>>>>>> at
>>>>>> org.globus.cog.karajan.workflow.nodes.FlowContainer.post(FlowContainer.java:58)
>>>>>>
>>>>>> at
>>>>>> org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:46)
>>>>>>
>>>>>> at
>>>>>> org.globus.cog.karajan.workflow.nodes.Sequential.startNext(Sequential.java:51)
>>>>>>
>>>>>> at
>>>>>> org.globus.cog.karajan.workflow.nodes.Sequential.executeChildren(Sequential.java:27)
>>>>>>
>>>>>> at
>>>>>> org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.executeChildren(AbstractFunction.java:40)
>>>>>>
>>>>>> at
>>>>>> org.globus.cog.karajan.workflow.nodes.FlowContainer.execute(FlowContainer.java:63)
>>>>>>
>>>>>> at
>>>>>> org.globus.cog.karajan.workflow.nodes.FlowNode.restart(FlowNode.java:233)
>>>>>>
>>>>>> at
>>>>>> org.globus.cog.karajan.workflow.nodes.FlowNode.start(FlowNode.java:278)
>>>>>> at
>>>>>> org.globus.cog.karajan.workflow.nodes.FlowNode.controlEvent(FlowNode.java:391)
>>>>>>
>>>>>> at
>>>>>> org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:329)
>>>>>> at
>>>>>> org.globus.cog.karajan.workflow.FlowElementWrapper.event(FlowElementWrapper.java:227)
>>>>>>
>>>>>> at
>>>>>> org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125)
>>>>>> at
>>>>>> org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99)
>>>>>>
>>>>>> at
>>>>>> org.globus.cog.karajan.workflow.events.EventWorker.run(EventWorker.java:69)
>>>>>>
>>>>>> Execution failed:
>>>>>> java.lang.NullPointerException
>>>>>> at
>>>>>> org.griphyn.vdl.karajan.lib.VDLFunction.leafFileName(VDLFunction.java:285)
>>>>>>
>>>>>> at
>>>>>> org.griphyn.vdl.karajan.lib.VDLFunction.filename(VDLFunction.java:201)
>>>>>> at
>>>>>> org.griphyn.vdl.karajan.lib.VDLFunction.filename(VDLFunction.java:182)
>>>>>> at
>>>>>> org.griphyn.vdl.karajan.lib.swiftscript.FileName.function(FileName.java:19)
>>>>>>
>>>>>> at org.griphyn.vdl.karajan.lib.VDLFunction.post(VDLFunction.java:67)
>>>>>> at
>>>>>> org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.childCompleted(AbstractSequentialWithArguments.java:192)
>>>>>>
>>>>>> at
>>>>>> org.globus.cog.karajan.workflow.nodes.Sequential.notificationEvent(Sequential.java:33)
>>>>>>
>>>>>> at
>>>>>> org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:332)
>>>>>> at
>>>>>> org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125)
>>>>>> at
>>>>>> org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99)
>>>>>>
>>>>>> at
>>>>>> org.globus.cog.karajan.workflow.nodes.FlowNode.fireNotificationEvent(FlowNode.java:176)
>>>>>>
>>>>>> at
>>>>>> org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:296)
>>>>>>
>>>>>> at
>>>>>> org.globus.cog.karajan.workflow.nodes.FlowContainer.post(FlowContainer.java:58)
>>>>>>
>>>>>> at
>>>>>> org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:46)
>>>>>>
>>>>>> at
>>>>>> org.globus.cog.karajan.workflow.nodes.Sequential.startNext(Sequential.java:51)
>>>>>>
>>>>>> at
>>>>>> org.globus.cog.karajan.workflow.nodes.Sequential.executeChildren(Sequential.java:27)
>>>>>>
>>>>>> at
>>>>>> org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.executeChildren(AbstractFunction.java:40)
>>>>>>
>>>>>> at
>>>>>> org.globus.cog.karajan.workflow.nodes.FlowContainer.execute(FlowContainer.java:63)
>>>>>>
>>>>>> at
>>>>>> org.globus.cog.karajan.workflow.nodes.FlowNode.restart(FlowNode.java:233)
>>>>>>
>>>>>> at
>>>>>> org.globus.cog.karajan.workflow.nodes.FlowNode.start(FlowNode.java:278)
>>>>>> at
>>>>>> org.globus.cog.karajan.workflow.nodes.FlowNode.controlEvent(FlowNode.java:391)
>>>>>>
>>>>>> at
>>>>>> org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:329)
>>>>>> at
>>>>>> org.globus.cog.karajan.workflow.FlowElementWrapper.event(FlowElementWrapper.java:227)
>>>>>>
>>>>>> at
>>>>>> org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125)
>>>>>> at
>>>>>> org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99)
>>>>>>
>>>>>> at
>>>>>> org.globus.cog.karajan.workflow.events.EventWorker.run(EventWorker.java:69)
>>>>>>
>>>>>>
>>>>>> Ex098
>>>>>> java.lang.NullPointerException
>>>>>> at
>>>>>> org.griphyn.vdl.karajan.lib.VDLFunction.leafFileName(VDLFunction.java:285)
>>>>>>
>>>>>> at
>>>>>> org.griphyn.vdl.karajan.lib.VDLFunction.filename(VDLFunction.java:201)
>>>>>> at
>>>>>> org.griphyn.vdl.karajan.lib.VDLFunction.filename(VDLFunction.java:182)
>>>>>> at
>>>>>> org.griphyn.vdl.karajan.lib.swiftscript.FileName.function(FileName.java:19)
>>>>>>
>>>>>> at org.griphyn.vdl.karajan.lib.VDLFunction.post(VDLFunction.java:67)
>>>>>> at
>>>>>> org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.childCompleted(AbstractSequentialWithArguments.java:192)
>>>>>>
>>>>>> at
>>>>>> org.globus.cog.karajan.workflow.nodes.Sequential.notificationEvent(Sequential.java:33)
>>>>>>
>>>>>> at
>>>>>> org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:332)
>>>>>> at
>>>>>> org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125)
>>>>>> at
>>>>>> org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99)
>>>>>>
>>>>>> at
>>>>>> org.globus.cog.karajan.workflow.nodes.FlowNode.fireNotificationEvent(FlowNode.java:176)
>>>>>>
>>>>>> at
>>>>>> org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:296)
>>>>>>
>>>>>> at
>>>>>> org.globus.cog.karajan.workflow.nodes.FlowContainer.post(FlowContainer.java:58)
>>>>>>
>>>>>> at
>>>>>> org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:46)
>>>>>>
>>>>>> at
>>>>>> org.globus.cog.karajan.workflow.nodes.Sequential.startNext(Sequential.java:51)
>>>>>>
>>>>>> at
>>>>>> org.globus.cog.karajan.workflow.nodes.Sequential.executeChildren(Sequential.java:27)
>>>>>>
>>>>>> at
>>>>>> org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.executeChildren(AbstractFunction.java:40)
>>>>>>
>>>>>> at
>>>>>> org.globus.cog.karajan.workflow.nodes.FlowContainer.execute(FlowContainer.java:63)
>>>>>>
>>>>>> at
>>>>>> org.globus.cog.karajan.workflow.nodes.FlowNode.restart(FlowNode.java:233)
>>>>>>
>>>>>> at
>>>>>> org.globus.cog.karajan.workflow.nodes.FlowNode.start(FlowNode.java:278)
>>>>>> at
>>>>>> org.globus.cog.karajan.workflow.nodes.FlowNode.controlEvent(FlowNode.java:391)
>>>>>>
>>>>>> at
>>>>>> org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:329)
>>>>>> at
>>>>>> org.globus.cog.karajan.workflow.FlowElementWrapper.event(FlowElementWrapper.java:227)
>>>>>>
>>>>>> at
>>>>>> org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125)
>>>>>> at
>>>>>> org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99)
>>>>>>
>>>>>> at
>>>>>> org.globus.cog.karajan.workflow.events.EventWorker.run(EventWorker.java:69)
>>>>>>
>>>>>> SwiftScript trace: T1af7, Round, 0, Sim, 7
>>>>>> _______________________________________________
>>>>>> Swift-devel mailing list
>>>>>> Swift-devel at ci.uchicago.edu
>>>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>>>> _______________________________________________
>>>>> Swift-devel mailing list
>>>>> Swift-devel at ci.uchicago.edu
>>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>>> _______________________________________________
>>>> Swift-devel mailing list
>>>> Swift-devel at ci.uchicago.edu
>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>
From skenny at uchicago.edu Tue Mar 17 22:14:40 2009
From: skenny at uchicago.edu (skenny at uchicago.edu)
Date: Tue, 17 Mar 2009 22:14:40 -0500 (CDT)
Subject: [Swift-devel] How does swift know if
a task is successful
Message-ID: <20090317221440.BUF44237@m4500-02.uchicago.edu>
hey zhao, did you get this to work? was thinking i might try
it on ranger, but i was wondering if you also then have to
hack something else to prevent swift from cleaning up your
work directory? that is, i assume you actually DO want the
output, you just don't want to have to wait for the stageouts (?)
---- Original message ----
>Date: Tue, 17 Mar 2009 13:40:30 -0500
>From: Mihael Hategan
>Subject: Re: [Swift-devel] How does swift know if a task is
successful
>To: Zhao Zhang
>Cc: swift-devel
>
>On Tue, 2009-03-17 at 13:36 -0500, Zhao Zhang wrote:
>> Hi, Mihael
>>
>> I commented the following lines
>> /*dir:make(ldir)
>> restartOnError(".*", 2
>> task:transfer(srchost=host, srcfile=bname,
>> srcdir=rdir, destdir=ldir, desthost=dhost,
destprovider=provider)
>> )*/
>>
>
>Did you modify this file in dist/?/libexec? If not, did you
re-compile
>swift after the modification?
>
>Put an echo or a log message in place, to see if your change
is picked
>up by swift next time.
>
>> Then I modified wrapper.sh to not to copy output file back,
but I still
>> got an error.
>> The log file is at
>>
http://www.ci.uchicago.edu/~zzhang/first-20090317-1327-oqgttus8.log
>> Thanks
>>
>> zhao
>>
>> zzhang at login6.surveyor:~/new_dock6> ./run_swift_ssh.sh 1117
64 first.swift
>> waiting for at least 64 nodes to register before submitting
workload...
>> waiting to find at least 1 services in file
>>
/home/falkon/users/zzhang/1117/config/Client-service-URIs.config...
>> all done, file has found at least 1 services
>> found at least 64 registered, submitting workload...
>> Swift svn swift-r2676 (swift modified locally) cog-r2305
>>
>> RunID: 20090317-1327-oqgttus8
>> Progress:
>> Progress: Selecting site:1 Stage in:1
>> Progress: Submitting:1 Submitted:1
>> Progress: Submitted:1 Failed but can retry:1
>> Failed to transfer wrapper log from
>> first-20090317-1327-oqgttus8/info/b/n/bgp000
>> Progress: Submitted:1 Active:1
>> Failed to transfer wrapper log from
>> first-20090317-1327-oqgttus8/info/e/n/bgp000
>> Progress: Submitted:1 Active:1
>> Failed to transfer wrapper log from
>> first-20090317-1327-oqgttus8/info/g/n/bgp000
>> Execution failed:
>> Exception in echo:
>> Arguments: [Hello, world!]
>> Host: bgp000
>> Directory: first-20090317-1327-oqgttus8/jobs/g/n/echo-gnlq238j
>> stderr.txt:
>>
>> stdout.txt:
>>
>> ----
>>
>> Caused by:
>> Cannot transfer
>> "/tmp/first-20090317-1327-oqgttus8/shared/hello.txt" to
>> "/gpfs/home/zzhang/new_dock6/./hello.txt"
>> Caused by:
>> No such file
>>
>>
>> Mihael Hategan wrote:
>> > On Tue, 2009-03-17 at 12:23 -0500, Zhao Zhang wrote:
>> >
>> >> Hi, Mihael
>> >>
>> >> yes, can I do that?
>> >>
>> >
>> > You should know this by now:
>> > in vdl-int.k, in doStageout, comment out the
task:transfer invocation
>> > (and dir:make).
>> >
>> >
>> >> zhao
>> >>
>> >> Mihael Hategan wrote:
>> >>
>> >>> On Tue, 2009-03-17 at 12:14 -0500, Zhao Zhang wrote:
>> >>>
>> >>>
>> >>>> Here comes another question, is there any place that I
could set to
>> >>>> disable swift's waiting for data feature?
>> >>>>
>> >>>>
>> >>> Do you mean disable the stage-outs?
>> >>>
>> >>>
>> >>>
>> >>>> Or is there any way for me to cheat swift that the
data is already
>> >>>> there? thanks.
>> >>>>
>> >>>> zhao
>> >>>>
>> >>>> Mihael Hategan wrote:
>> >>>>
>> >>>>
>> >>>>> On Fri, 2009-03-13 at 17:07 -0500, Zhao Zhang wrote:
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>>> Hi, All
>> >>>>>>
>> >>>>>> I have a question on how swift knows if a task is
successful.
>> >>>>>> In my case, I am using a status notification instead
of a status file.
>> >>>>>>
>> >>>>>> So my question is is this status notification the
only thing swift is
>> >>>>>> waiting for, or is swift also waiting for the output
data to appear to
>> >>>>>> say that a job is successful?
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>> Once the job is done, swift will attempt to stage out
all the files that
>> >>>>> it expects the job to have produced.
>> >>>>>
>> >>>>> Should one of those files not be there, there will be
failures.
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>
>> >>>
>> >
>> >
>> >
>
>_______________________________________________
>Swift-devel mailing list
>Swift-devel at ci.uchicago.edu
>http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
From zhaozhang at uchicago.edu Tue Mar 17 23:04:59 2009
From: zhaozhang at uchicago.edu (Zhao Zhang)
Date: Tue, 17 Mar 2009 23:04:59 -0500
Subject: [Swift-devel] How does swift know if a task is successful
In-Reply-To: <20090317221440.BUF44237@m4500-02.uchicago.edu>
References: <20090317221440.BUF44237@m4500-02.uchicago.edu>
Message-ID: <49C072EB.2060109@uchicago.edu>
Hi, Sarah
skenny at uchicago.edu wrote:
> hey zhao, did you get this to work?
Not yet, I am still working on it.
> was thinking i might try
> it on ranger, but i was wondering if you also then have to
> hack something else to prevent swift from cleaning up your
> work directory?
I think to prevent swift cleaning up work dir is just an option in
swift.properties.
> that is, i assume you actually DO want the
> output, you just don't want to have to wait for the stageouts (?)
>
exactly, currently, we are building a collective IO system on BGP, so
CIO will take
care of stage out results.
zhao
> ---- Original message ----
>
>> Date: Tue, 17 Mar 2009 13:40:30 -0500
>> From: Mihael Hategan
>> Subject: Re: [Swift-devel] How does swift know if a task is
>>
> successful
>
>> To: Zhao Zhang
>> Cc: swift-devel
>>
>> On Tue, 2009-03-17 at 13:36 -0500, Zhao Zhang wrote:
>>
>>> Hi, Mihael
>>>
>>> I commented the following lines
>>> /*dir:make(ldir)
>>> restartOnError(".*", 2
>>> task:transfer(srchost=host, srcfile=bname,
>>> srcdir=rdir, destdir=ldir, desthost=dhost,
>>>
> destprovider=provider)
>
>>> )*/
>>>
>>>
>> Did you modify this file in dist/?/libexec? If not, did you
>>
> re-compile
>
>> swift after the modification?
>>
>> Put an echo or a log message in place, to see if your change
>>
> is picked
>
>> up by swift next time.
>>
>>
>>> Then I modified wrapper.sh to not to copy output file back,
>>>
> but I still
>
>>> got an error.
>>> The log file is at
>>>
>>>
> http://www.ci.uchicago.edu/~zzhang/first-20090317-1327-oqgttus8.log
>
>>> Thanks
>>>
>>> zhao
>>>
>>> zzhang at login6.surveyor:~/new_dock6> ./run_swift_ssh.sh 1117
>>>
> 64 first.swift
>
>>> waiting for at least 64 nodes to register before submitting
>>>
> workload...
>
>>> waiting to find at least 1 services in file
>>>
>>>
> /home/falkon/users/zzhang/1117/config/Client-service-URIs.config...
>
>>> all done, file has found at least 1 services
>>> found at least 64 registered, submitting workload...
>>> Swift svn swift-r2676 (swift modified locally) cog-r2305
>>>
>>> RunID: 20090317-1327-oqgttus8
>>> Progress:
>>> Progress: Selecting site:1 Stage in:1
>>> Progress: Submitting:1 Submitted:1
>>> Progress: Submitted:1 Failed but can retry:1
>>> Failed to transfer wrapper log from
>>> first-20090317-1327-oqgttus8/info/b/n/bgp000
>>> Progress: Submitted:1 Active:1
>>> Failed to transfer wrapper log from
>>> first-20090317-1327-oqgttus8/info/e/n/bgp000
>>> Progress: Submitted:1 Active:1
>>> Failed to transfer wrapper log from
>>> first-20090317-1327-oqgttus8/info/g/n/bgp000
>>> Execution failed:
>>> Exception in echo:
>>> Arguments: [Hello, world!]
>>> Host: bgp000
>>> Directory: first-20090317-1327-oqgttus8/jobs/g/n/echo-gnlq238j
>>> stderr.txt:
>>>
>>> stdout.txt:
>>>
>>> ----
>>>
>>> Caused by:
>>> Cannot transfer
>>> "/tmp/first-20090317-1327-oqgttus8/shared/hello.txt" to
>>> "/gpfs/home/zzhang/new_dock6/./hello.txt"
>>> Caused by:
>>> No such file
>>>
>>>
>>> Mihael Hategan wrote:
>>>
>>>> On Tue, 2009-03-17 at 12:23 -0500, Zhao Zhang wrote:
>>>>
>>>>
>>>>> Hi, Mihael
>>>>>
>>>>> yes, can I do that?
>>>>>
>>>>>
>>>> You should know this by now:
>>>> in vdl-int.k, in doStageout, comment out the
>>>>
> task:transfer invocation
>
>>>> (and dir:make).
>>>>
>>>>
>>>>
>>>>> zhao
>>>>>
>>>>> Mihael Hategan wrote:
>>>>>
>>>>>
>>>>>> On Tue, 2009-03-17 at 12:14 -0500, Zhao Zhang wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>>> Here comes another question, is there any place that I
>>>>>>>
> could set to
>
>>>>>>> disable swift's waiting for data feature?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>> Do you mean disable the stage-outs?
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>> Or is there any way for me to cheat swift that the
>>>>>>>
> data is already
>
>>>>>>> there? thanks.
>>>>>>>
>>>>>>> zhao
>>>>>>>
>>>>>>> Mihael Hategan wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> On Fri, 2009-03-13 at 17:07 -0500, Zhao Zhang wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>> Hi, All
>>>>>>>>>
>>>>>>>>> I have a question on how swift knows if a task is
>>>>>>>>>
> successful.
>
>>>>>>>>> In my case, I am using a status notification instead
>>>>>>>>>
> of a status file.
>
>>>>>>>>> So my question is is this status notification the
>>>>>>>>>
> only thing swift is
>
>>>>>>>>> waiting for, or is swift also waiting for the output
>>>>>>>>>
> data to appear to
>
>>>>>>>>> say that a job is successful?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>> Once the job is done, swift will attempt to stage out
>>>>>>>>
> all the files that
>
>>>>>>>> it expects the job to have produced.
>>>>>>>>
>>>>>>>> Should one of those files not be there, there will be
>>>>>>>>
> failures.
>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>
>>>>
>> _______________________________________________
>> Swift-devel mailing list
>> Swift-devel at ci.uchicago.edu
>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>
>
>
From zhaozhang at uchicago.edu Tue Mar 17 23:58:10 2009
From: zhaozhang at uchicago.edu (Zhao Zhang)
Date: Tue, 17 Mar 2009 23:58:10 -0500
Subject: [Swift-devel] How does swift know if a task is successful
In-Reply-To: <1237315230.31264.1.camel@localhost>
References: <49BAD926.1030607@uchicago.edu>
<1236982413.13026.1.camel@localhost>
<49BFDA75.9070803@uchicago.edu>
<1237310455.29738.11.camel@localhost>
<49BFDC78.8040506@uchicago.edu>
<1237310948.30064.2.camel@localhost>
<49BFEDB0.5070409@uchicago.edu>
<1237315230.31264.1.camel@localhost>
Message-ID: <49C07F62.3000309@uchicago.edu>
Hi, Mihael
I modified the vdl-int.k in cog/module/swift/libexec, and rebuilt swift,
and I used my customized wrapper.sh.
I ran the first.swift as a test, the job returned successful, and the
output file was still staged out. Any ideas?
Thanks.
zhao
Mihael Hategan wrote:
> On Tue, 2009-03-17 at 13:36 -0500, Zhao Zhang wrote:
>
>> Hi, Mihael
>>
>> I commented the following lines
>> /*dir:make(ldir)
>> restartOnError(".*", 2
>> task:transfer(srchost=host, srcfile=bname,
>> srcdir=rdir, destdir=ldir, desthost=dhost, destprovider=provider)
>> )*/
>>
>>
>
> Did you modify this file in dist/?/libexec? If not, did you re-compile
> swift after the modification?
>
> Put an echo or a log message in place, to see if your change is picked
> up by swift next time.
>
>
>> Then I modified wrapper.sh to not to copy output file back, but I still
>> got an error.
>> The log file is at
>> http://www.ci.uchicago.edu/~zzhang/first-20090317-1327-oqgttus8.log
>> Thanks
>>
>> zhao
>>
>> zzhang at login6.surveyor:~/new_dock6> ./run_swift_ssh.sh 1117 64 first.swift
>> waiting for at least 64 nodes to register before submitting workload...
>> waiting to find at least 1 services in file
>> /home/falkon/users/zzhang/1117/config/Client-service-URIs.config...
>> all done, file has found at least 1 services
>> found at least 64 registered, submitting workload...
>> Swift svn swift-r2676 (swift modified locally) cog-r2305
>>
>> RunID: 20090317-1327-oqgttus8
>> Progress:
>> Progress: Selecting site:1 Stage in:1
>> Progress: Submitting:1 Submitted:1
>> Progress: Submitted:1 Failed but can retry:1
>> Failed to transfer wrapper log from
>> first-20090317-1327-oqgttus8/info/b/n/bgp000
>> Progress: Submitted:1 Active:1
>> Failed to transfer wrapper log from
>> first-20090317-1327-oqgttus8/info/e/n/bgp000
>> Progress: Submitted:1 Active:1
>> Failed to transfer wrapper log from
>> first-20090317-1327-oqgttus8/info/g/n/bgp000
>> Execution failed:
>> Exception in echo:
>> Arguments: [Hello, world!]
>> Host: bgp000
>> Directory: first-20090317-1327-oqgttus8/jobs/g/n/echo-gnlq238j
>> stderr.txt:
>>
>> stdout.txt:
>>
>> ----
>>
>> Caused by:
>> Cannot transfer
>> "/tmp/first-20090317-1327-oqgttus8/shared/hello.txt" to
>> "/gpfs/home/zzhang/new_dock6/./hello.txt"
>> Caused by:
>> No such file
>>
>>
>> Mihael Hategan wrote:
>>
>>> On Tue, 2009-03-17 at 12:23 -0500, Zhao Zhang wrote:
>>>
>>>
>>>> Hi, Mihael
>>>>
>>>> yes, can I do that?
>>>>
>>>>
>>> You should know this by now:
>>> in vdl-int.k, in doStageout, comment out the task:transfer invocation
>>> (and dir:make).
>>>
>>>
>>>
>>>> zhao
>>>>
>>>> Mihael Hategan wrote:
>>>>
>>>>
>>>>> On Tue, 2009-03-17 at 12:14 -0500, Zhao Zhang wrote:
>>>>>
>>>>>
>>>>>
>>>>>> Here comes another question, is there any place that I could set to
>>>>>> disable swift's waiting for data feature?
>>>>>>
>>>>>>
>>>>>>
>>>>> Do you mean disable the stage-outs?
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> Or is there any way for me to cheat swift that the data is already
>>>>>> there? thanks.
>>>>>>
>>>>>> zhao
>>>>>>
>>>>>> Mihael Hategan wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>>> On Fri, 2009-03-13 at 17:07 -0500, Zhao Zhang wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> Hi, All
>>>>>>>>
>>>>>>>> I have a question on how swift knows if a task is successful.
>>>>>>>> In my case, I am using a status notification instead of a status file.
>>>>>>>>
>>>>>>>> So my question is is this status notification the only thing swift is
>>>>>>>> waiting for, or is swift also waiting for the output data to appear to
>>>>>>>> say that a job is successful?
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>> Once the job is done, swift will attempt to stage out all the files that
>>>>>>> it expects the job to have produced.
>>>>>>>
>>>>>>> Should one of those files not be there, there will be failures.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>
>>>>>
>>>>>
>>>
>>>
>
>
>
From benc at hawaga.org.uk Wed Mar 18 05:29:36 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Wed, 18 Mar 2009 10:29:36 +0000 (GMT)
Subject: [Swift-devel] testing
In-Reply-To: <1237307410.26969.8.camel@localhost>
References: <1237307410.26969.8.camel@localhost>
Message-ID:
On Tue, 17 Mar 2009, Mihael Hategan wrote:
> There are probably a few issues still left to address, one of which is
> to make sure that coasters are an acceptable way of running things on
> OSG. I suspect this would require some negotiation with the right people
> from OSG, and I don't know who those people are.
Mats can probably make comment on what people in OSG are going to say.
My sense is:
i) coasters as a general concept is fine - the big VOs do stuff like that.
ii) running anything on the head nodes is bad
iii) running anything through gram2 is bad - any base job submissions
need to be through condor-g using its hybrid gram2+gridmanager system.
--
From benc at hawaga.org.uk Wed Mar 18 05:31:02 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Wed, 18 Mar 2009 10:31:02 +0000 (GMT)
Subject: [Swift-devel] testing
In-Reply-To: <1237310335.29738.8.camel@localhost>
References: <1237307410.26969.8.camel@localhost> <49BFD80C.7080903@mcs.anl.gov>
<1237310335.29738.8.camel@localhost>
Message-ID:
On Tue, 17 Mar 2009, Mihael Hategan wrote:
> this testing, not before. But I would like to "negotiate" the ability
> to:
>
> - have one process on the head node, hopefully one that doesn't hog it.
> - the ability to submit from the head node to the queuing system
> directly (as if running qsub manually - something that isn't exactly
> "the way" on OSG).
I think you're likely to get 'no' to both of those.
--
From wilde at mcs.anl.gov Wed Mar 18 07:27:55 2009
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Wed, 18 Mar 2009 07:27:55 -0500
Subject: [Swift-devel] testing
In-Reply-To:
References: <1237307410.26969.8.camel@localhost>
Message-ID: <49C0E8CB.3080600@mcs.anl.gov>
On 3/18/09 5:29 AM, Ben Clifford wrote:
> On Tue, 17 Mar 2009, Mihael Hategan wrote:
>
>> There are probably a few issues still left to address, one of which is
>> to make sure that coasters are an acceptable way of running things on
>> OSG. I suspect this would require some negotiation with the right people
>> from OSG, and I don't know who those people are.
>
> Mats can probably make comment on what people in OSG are going to say.
>
> My sense is:
> i) coasters as a general concept is fine - the big VOs do stuff like that.
> ii) running anything on the head nodes is bad
I agree in principle. The immediate issue is whether the load we place
on head nodes will be light and not cause trouble, or whether it will be
yet another obstacle for us. We need to cope with managed head nodes and
their time limiter. I dont know that thats been tested yet.
Can coasters architecturally cope with no head node access, if we use a
worker node for this function and it connects back to the submitting
swift process? On the assumption that outbound connectivity from workers
will be more commonly found than inbound?
> iii) running anything through gram2 is bad - any base job submissions
> need to be through condor-g using its hybrid gram2+gridmanager system.
I agree, and was assuming that on OSG we would only use the new Condor
provider, and run jobs in this manner.
From benc at hawaga.org.uk Wed Mar 18 07:35:35 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Wed, 18 Mar 2009 12:35:35 +0000 (GMT)
Subject: [Swift-devel] null pointer exception from nested loops
In-Reply-To: <1237327802.2668.0.camel@localhost>
References: <49C010FE.4070503@mcs.anl.gov> <49C01351.1050807@mcs.anl.gov>
<49C0154C.7000608@mcs.anl.gov> <49C01A2D.6050408@mcs.anl.gov>
<49C01DF7.4000204@mcs.anl.gov> <1237327802.2668.0.camel@localhost>
Message-ID:
r2707 fixes this. It was the nested procedure call interacting badly with
anything that returns mapped content rather than in-memory values.
On Tue, 17 Mar 2009, Mihael Hategan wrote:
> Looks like it might be the nested procedure call interacting badly with
> readData.
>
> On Tue, 2009-03-17 at 17:02 -0500, Michael Wilde wrote:
> > OK, I think this is it:
> >
> > The following script *works* with the commented out code and fails with
> > the currently enabled alternative statement:
> >
> > --
> >
> > type file;
> >
> > app (file editedParams) setTemps ( file inParams )
> > {
> > sed "-e" "s/@DTI@/123/" @inParams stdout=@editedParams;
> > }
> >
> > file inParams;
> >
> > /* works:
> > file o<"pout">;
> > o = setTemps(inParams);
> > string config [] = readData(o);
> > */
> >
> > /* Fails: */
> > string config [] = readData( setTemps(inParams ) );
> >
> > trace(0,config[0]);
> > trace(1,config[1]);
> > trace(2,config[2]);
> >
> > --
> >
> > So readData is indeed happy to take a file-value var as an arg but not a
> > file-valued expression (procedure return in this case).
> >
> >
> >
> > On 3/17/09 4:46 PM, Michael Wilde wrote:
> > > I got the error narrowed down to this example:
> > >
> > > type file;
> > >
> > > app (file editedParams) setTemps ( file inParams )
> > > {
> > > echo @inParams stdout=@editedParams;
> > > }
> > >
> > > file inParams;
> > >
> > > string config [] = readData( setTemps(inParams ) );
> > > trace(0,config[0]);
> > > trace(1,config[1]);
> > > trace(2,config[2]);
> > >
> > > --
> > >
> > > params.tloops contains:
> > >
> > > DEFAULT_INIT_TEMP_=_ at DTI@
> > > TEMP_UPDATE_INTERVAL_=_ at TUI@
> > > KILL_TIME_=_3
> > > MAX_NUMBER_OF_ANNEALING_STEPS_=_0
> > >
> > > --
> > >
> > > In other code, readData() seemed happy to take a file *or* a filename
> > > string as input, but I wonder if it was not as happy as it seemed.
> > >
> > > I'd been taking advantage of the flexibility with good results (luck?)
> > > so far, though.
> > >
> > > On 3/17/09 4:25 PM, Michael Wilde wrote:
> > >> The log contains this just before the NPE, including the suspicious
> > >> message: WARN FlowNode Ex098:
> > >>
> > >> Thats giving me a clue as to the offending statements.
> > >>
> > >> ---
> > >>
> > >> 2009-03-17 16:20:34,723-0500 INFO AbstractDataNode Adding handle
> > >> listener "F/org.griphyn.vdl.mapping.DataNode identifier
> > >> tag:benc at ci.uchicago.edu,2008:swift:dataset:20090\
> > >> 317-1620-e1n1bz3g:720000000071 type SecSeq with no value at
> > >> dataset=secseq path=[0] (not closed)" to
> > >> "org.griphyn.vdl.mapping.DataNode identifier tag:benc at ci.uchicago.edu,\
> > >> 2008:swift:dataset:20090317-1620-e1n1bz3g:720000000071 type SecSeq
> > >> with no value at dataset=secseq path=[0] (not closed)"
> > >> 2009-03-17 16:20:34,724-0500 INFO AbstractDataNode closed
> > >> org.griphyn.vdl.mapping.RootDataNode identifier
> > >> tag:benc at ci.uchicago.edu,2008:swift:dataset:20090317-1620-e1n1bz\
> > >> 3g:720000000092 type string value=s/@DIT@/10/ dataset=unnamed
> > >> SwiftScript value (closed)
> > >> 2009-03-17 16:20:34,724-0500 INFO AbstractDataNode ROOTPATH
> > >> dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090317-1620-e1n1bz3g:720000000092
> > >> path=$
> > >> 2009-03-17 16:20:34,724-0500 INFO AbstractDataNode VALUE
> > >> dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090317-1620-e1n1bz3g:720000000092
> > >> VALUE=s/@DIT@/10/
> > >> 2009-03-17 16:20:34,725-0500 INFO AbstractDataNode closed
> > >> org.griphyn.vdl.mapping.RootDataNode identifier
> > >> tag:benc at ci.uchicago.edu,2008:swift:dataset:20090317-1620-e1n1bz\
> > >> 3g:720000000093 type string value=s/@TUI@/1/ dataset=unnamed
> > >> SwiftScript value (closed)
> > >> 2009-03-17 16:20:34,725-0500 INFO AbstractDataNode ROOTPATH
> > >> dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090317-1620-e1n1bz3g:720000000093
> > >> path=$
> > >> 2009-03-17 16:20:34,725-0500 INFO AbstractDataNode VALUE
> > >> dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090317-1620-e1n1bz3g:720000000093
> > >> VALUE=s/@TUI@/1/
> > >> 2009-03-17 16:20:34,726-0500 INFO AbstractDataNode closed
> > >> org.griphyn.vdl.mapping.RootDataNode identifier
> > >> tag:benc at ci.uchicago.edu,2008:swift:dataset:20090317-1620-e1n1bz\
> > >> 3g:720000000094 type string value=params.tloop dataset=unnamed
> > >> SwiftScript value (closed)
> > >> 2009-03-17 16:20:34,726-0500 INFO AbstractDataNode ROOTPATH
> > >> dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090317-1620-e1n1bz3g:720000000094
> > >> path=$
> > >> 2009-03-17 16:20:34,726-0500 INFO AbstractDataNode VALUE
> > >> dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090317-1620-e1n1bz3g:720000000094
> > >> VALUE=params.tloop
> > >> 2009-03-17 16:20:34,727-0500 WARN FlowNode Ex098
> > >> java.lang.NullPointerException
> > >> at
> > >> org.griphyn.vdl.karajan.lib.VDLFunction.leafFileName(VDLFunction.java:285)
> > >>
> > >> at
> > >> org.griphyn.vdl.karajan.lib.VDLFunction.filename(VDLFunction.java:201)
> > >> at
> > >> org.griphyn.vdl.karajan.lib.VDLFunction.filename(VDLFunction.java:182)
> > >> at
> > >> org.griphyn.vdl.karajan.lib.swiftscript.FileName.function(FileName.java:19)
> > >>
> > >> at
> > >> org.griphyn.vdl.karajan.lib.VDLFunction.post(VDLFunction.java:67)
> > >>
> > >>
> > >>
> > >> On 3/17/09 4:17 PM, Michael Wilde wrote:
> > >>> It seems not related to scale or Falkon.
> > >>>
> > >>> It occurs when running on localhost (but on bgp) and when I cut all
> > >>> the loops down to a single iteration.
> > >>>
> > >>> I'm still debugging.
> > >>>
> > >>> On 3/17/09 4:07 PM, Michael Wilde wrote:
> > >>>> I just expanded my oops protein folding script to add another level
> > >>>> of parameter sweep. This script is getting pretty complex now (at
> > >>>> least, for a swift script).
> > >>>>
> > >>>> I got the following npe on my first two tries. Im going to start
> > >>>> debugging, but any clues as to the cause would be helpful.
> > >>>>
> > >>>> The outer loops are:
> > >>>>
> > >>>> main()
> > >>>> {
> > >>>> string protein[] = readData(@arg("plist"));
> > >>>> string startTemp[] = ["10","20"];
> > >>>> string tempUpdate[] = ["1","2","3"];
> > >>>>
> > >>>> foreach p in protein {
> > >>>> foreach st in startTemp {
> > >>>> foreach tu in tempUpdate {
> > >>>> doRoundSet(p,st,tu);
> > >>>> }
> > >>>> }
> > >>>> }
> > >>>> }
> > >>>>
> > >>>> There are two levels of inner loops further down below doRoundSet().
> > >>>>
> > >>>> The script, output, command line args and log are in:
> > >>>> http://ww.ci.uchicago.edu/~wilde/swift3.tar.gz
> > >>>>
> > >>>> I suspect it will take a while to narrow the cause to a simpler test
> > >>>> case thats easy tp reproduce without a lot of setup.
> > >>>>
> > >>>> I'll try on a vanilla swift on local execution; this is on bgp with
> > >>>> Falkon.
> > >>>>
> > >>>> Thanks.
> > >>>>
> > >>>> --
> > >>>>
> > >>>> ...
> > >>>> Progress: uninitialized:1 Selecting site:2
> > >>>> SwiftScript trace: T1af7, Round, 0, Sim, 7
> > >>>> SwiftScript trace: T1af7, Round, 0, Sim, 2
> > >>>> SwiftScript trace: T1af7, Round, 0, Sim, 8
> > >>>> SwiftScript trace: T1af7, Round, 0, Sim, 0
> > >>>> SwiftScript trace: T1af7, Round, 0, Sim, 5
> > >>>> SwiftScript trace: T1af7, Round, 0, Sim, 9
> > >>>> SwiftScript trace: T1af7, Round, 0, Sim, 1
> > >>>> SwiftScript trace: T1af7, Round, 0, Sim, 6
> > >>>> SwiftScript trace: T1af7, Round, 0, Sim, 3
> > >>>> SwiftScript trace: T1af7, Round, 0, Sim, 4
> > >>>> Ex098
> > >>>> java.lang.NullPointerException
> > >>>> at
> > >>>> org.griphyn.vdl.karajan.lib.VDLFunction.leafFileName(VDLFunction.java:285)
> > >>>>
> > >>>> at
> > >>>> org.griphyn.vdl.karajan.lib.VDLFunction.filename(VDLFunction.java:201)
> > >>>> at
> > >>>> org.griphyn.vdl.karajan.lib.VDLFunction.filename(VDLFunction.java:182)
> > >>>> at
> > >>>> org.griphyn.vdl.karajan.lib.swiftscript.FileName.function(FileName.java:19)
> > >>>>
> > >>>> at
> > >>>> org.griphyn.vdl.karajan.lib.VDLFunction.post(VDLFunction.java:67)
> > >>>> at
> > >>>> org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.childCompleted(AbstractSequentialWithArguments.java:192)
> > >>>>
> > >>>> at
> > >>>> org.globus.cog.karajan.workflow.nodes.Sequential.notificationEvent(Sequential.java:33)
> > >>>>
> > >>>> at
> > >>>> org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:332)
> > >>>> at
> > >>>> org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125)
> > >>>> at
> > >>>> org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99)
> > >>>>
> > >>>> at
> > >>>> org.globus.cog.karajan.workflow.nodes.FlowNode.fireNotificationEvent(FlowNode.java:176)
> > >>>>
> > >>>> at
> > >>>> org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:296)
> > >>>>
> > >>>> at
> > >>>> org.globus.cog.karajan.workflow.nodes.FlowContainer.post(FlowContainer.java:58)
> > >>>>
> > >>>> at
> > >>>> org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:46)
> > >>>>
> > >>>> at
> > >>>> org.globus.cog.karajan.workflow.nodes.Sequential.startNext(Sequential.java:51)
> > >>>>
> > >>>> at
> > >>>> org.globus.cog.karajan.workflow.nodes.Sequential.executeChildren(Sequential.java:27)
> > >>>>
> > >>>> at
> > >>>> org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.executeChildren(AbstractFunction.java:40)
> > >>>>
> > >>>> at
> > >>>> org.globus.cog.karajan.workflow.nodes.FlowContainer.execute(FlowContainer.java:63)
> > >>>>
> > >>>> at
> > >>>> org.globus.cog.karajan.workflow.nodes.FlowNode.restart(FlowNode.java:233)
> > >>>>
> > >>>> at
> > >>>> org.globus.cog.karajan.workflow.nodes.FlowNode.start(FlowNode.java:278)
> > >>>> at
> > >>>> org.globus.cog.karajan.workflow.nodes.FlowNode.controlEvent(FlowNode.java:391)
> > >>>>
> > >>>> at
> > >>>> org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:329)
> > >>>> at
> > >>>> org.globus.cog.karajan.workflow.FlowElementWrapper.event(FlowElementWrapper.java:227)
> > >>>>
> > >>>> at
> > >>>> org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125)
> > >>>> at
> > >>>> org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99)
> > >>>>
> > >>>> at
> > >>>> org.globus.cog.karajan.workflow.events.EventWorker.run(EventWorker.java:69)
> > >>>>
> > >>>> Execution failed:
> > >>>> java.lang.NullPointerException
> > >>>> at
> > >>>> org.griphyn.vdl.karajan.lib.VDLFunction.leafFileName(VDLFunction.java:285)
> > >>>>
> > >>>> at
> > >>>> org.griphyn.vdl.karajan.lib.VDLFunction.filename(VDLFunction.java:201)
> > >>>> at
> > >>>> org.griphyn.vdl.karajan.lib.VDLFunction.filename(VDLFunction.java:182)
> > >>>> at
> > >>>> org.griphyn.vdl.karajan.lib.swiftscript.FileName.function(FileName.java:19)
> > >>>>
> > >>>> at
> > >>>> org.griphyn.vdl.karajan.lib.VDLFunction.post(VDLFunction.java:67)
> > >>>> at
> > >>>> org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.childCompleted(AbstractSequentialWithArguments.java:192)
> > >>>>
> > >>>> at
> > >>>> org.globus.cog.karajan.workflow.nodes.Sequential.notificationEvent(Sequential.java:33)
> > >>>>
> > >>>> at
> > >>>> org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:332)
> > >>>> at
> > >>>> org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125)
> > >>>> at
> > >>>> org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99)
> > >>>>
> > >>>> at
> > >>>> org.globus.cog.karajan.workflow.nodes.FlowNode.fireNotificationEvent(FlowNode.java:176)
> > >>>>
> > >>>> at
> > >>>> org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:296)
> > >>>>
> > >>>> at
> > >>>> org.globus.cog.karajan.workflow.nodes.FlowContainer.post(FlowContainer.java:58)
> > >>>>
> > >>>> at
> > >>>> org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:46)
> > >>>>
> > >>>> at
> > >>>> org.globus.cog.karajan.workflow.nodes.Sequential.startNext(Sequential.java:51)
> > >>>>
> > >>>> at
> > >>>> org.globus.cog.karajan.workflow.nodes.Sequential.executeChildren(Sequential.java:27)
> > >>>>
> > >>>> at
> > >>>> org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.executeChildren(AbstractFunction.java:40)
> > >>>>
> > >>>> at
> > >>>> org.globus.cog.karajan.workflow.nodes.FlowContainer.execute(FlowContainer.java:63)
> > >>>>
> > >>>> at
> > >>>> org.globus.cog.karajan.workflow.nodes.FlowNode.restart(FlowNode.java:233)
> > >>>>
> > >>>> at
> > >>>> org.globus.cog.karajan.workflow.nodes.FlowNode.start(FlowNode.java:278)
> > >>>> at
> > >>>> org.globus.cog.karajan.workflow.nodes.FlowNode.controlEvent(FlowNode.java:391)
> > >>>>
> > >>>> at
> > >>>> org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:329)
> > >>>> at
> > >>>> org.globus.cog.karajan.workflow.FlowElementWrapper.event(FlowElementWrapper.java:227)
> > >>>>
> > >>>> at
> > >>>> org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125)
> > >>>> at
> > >>>> org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99)
> > >>>>
> > >>>> at
> > >>>> org.globus.cog.karajan.workflow.events.EventWorker.run(EventWorker.java:69)
> > >>>>
> > >>>>
> > >>>> Ex098
> > >>>> java.lang.NullPointerException
> > >>>> at
> > >>>> org.griphyn.vdl.karajan.lib.VDLFunction.leafFileName(VDLFunction.java:285)
> > >>>>
> > >>>> at
> > >>>> org.griphyn.vdl.karajan.lib.VDLFunction.filename(VDLFunction.java:201)
> > >>>> at
> > >>>> org.griphyn.vdl.karajan.lib.VDLFunction.filename(VDLFunction.java:182)
> > >>>> at
> > >>>> org.griphyn.vdl.karajan.lib.swiftscript.FileName.function(FileName.java:19)
> > >>>>
> > >>>> at
> > >>>> org.griphyn.vdl.karajan.lib.VDLFunction.post(VDLFunction.java:67)
> > >>>> at
> > >>>> org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.childCompleted(AbstractSequentialWithArguments.java:192)
> > >>>>
> > >>>> at
> > >>>> org.globus.cog.karajan.workflow.nodes.Sequential.notificationEvent(Sequential.java:33)
> > >>>>
> > >>>> at
> > >>>> org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:332)
> > >>>> at
> > >>>> org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125)
> > >>>> at
> > >>>> org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99)
> > >>>>
> > >>>> at
> > >>>> org.globus.cog.karajan.workflow.nodes.FlowNode.fireNotificationEvent(FlowNode.java:176)
> > >>>>
> > >>>> at
> > >>>> org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:296)
> > >>>>
> > >>>> at
> > >>>> org.globus.cog.karajan.workflow.nodes.FlowContainer.post(FlowContainer.java:58)
> > >>>>
> > >>>> at
> > >>>> org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:46)
> > >>>>
> > >>>> at
> > >>>> org.globus.cog.karajan.workflow.nodes.Sequential.startNext(Sequential.java:51)
> > >>>>
> > >>>> at
> > >>>> org.globus.cog.karajan.workflow.nodes.Sequential.executeChildren(Sequential.java:27)
> > >>>>
> > >>>> at
> > >>>> org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.executeChildren(AbstractFunction.java:40)
> > >>>>
> > >>>> at
> > >>>> org.globus.cog.karajan.workflow.nodes.FlowContainer.execute(FlowContainer.java:63)
> > >>>>
> > >>>> at
> > >>>> org.globus.cog.karajan.workflow.nodes.FlowNode.restart(FlowNode.java:233)
> > >>>>
> > >>>> at
> > >>>> org.globus.cog.karajan.workflow.nodes.FlowNode.start(FlowNode.java:278)
> > >>>> at
> > >>>> org.globus.cog.karajan.workflow.nodes.FlowNode.controlEvent(FlowNode.java:391)
> > >>>>
> > >>>> at
> > >>>> org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:329)
> > >>>> at
> > >>>> org.globus.cog.karajan.workflow.FlowElementWrapper.event(FlowElementWrapper.java:227)
> > >>>>
> > >>>> at
> > >>>> org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125)
> > >>>> at
> > >>>> org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99)
> > >>>>
> > >>>> at
> > >>>> org.globus.cog.karajan.workflow.events.EventWorker.run(EventWorker.java:69)
> > >>>>
> > >>>> SwiftScript trace: T1af7, Round, 0, Sim, 7
> > >>>> _______________________________________________
> > >>>> Swift-devel mailing list
> > >>>> Swift-devel at ci.uchicago.edu
> > >>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> > >>> _______________________________________________
> > >>> Swift-devel mailing list
> > >>> Swift-devel at ci.uchicago.edu
> > >>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> > >> _______________________________________________
> > >> Swift-devel mailing list
> > >> Swift-devel at ci.uchicago.edu
> > >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> > > _______________________________________________
> > > Swift-devel mailing list
> > > Swift-devel at ci.uchicago.edu
> > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> > _______________________________________________
> > Swift-devel mailing list
> > Swift-devel at ci.uchicago.edu
> > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>
>
From benc at hawaga.org.uk Wed Mar 18 07:39:42 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Wed, 18 Mar 2009 12:39:42 +0000 (GMT)
Subject: [Swift-devel] testing
In-Reply-To:
References: <1237307410.26969.8.camel@localhost>
Message-ID:
On Wed, 18 Mar 2009, Ben Clifford wrote:
> My sense is:
By that I mean, my sense of what OSG people will accept, not my personal
opinions.
--
From benc at hawaga.org.uk Wed Mar 18 07:51:33 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Wed, 18 Mar 2009 12:51:33 +0000 (GMT)
Subject: [Swift-devel] How does swift know if a task is successful
In-Reply-To: <49C072EB.2060109@uchicago.edu>
References: <20090317221440.BUF44237@m4500-02.uchicago.edu>
<49C072EB.2060109@uchicago.edu>
Message-ID:
On Tue, 17 Mar 2009, Zhao Zhang wrote:
> exactly, currently, we are building a collective IO system on BGP, so CIO will
> take
> care of stage out results.
Is it possible to interface between Swift and your work more cleanly?
(maybe, for example, by doing something with the cog file transfer
provider API)
Hacking essential pieces of the Swift code out feels really unpleasant,
and will pretty much definitely break some functionality, which will cause
you trouble later on if you try to run arbitrary SwiftScript programs.
When we sat down in November/December, it sounded like you wouldn't need
to do anything like this to make the CIO stuff work with Swift; so I'd be
interested in more explanation/discussion about what the CIO work looks
like now.
--
From foster at anl.gov Wed Mar 18 08:38:36 2009
From: foster at anl.gov (Ian Foster)
Date: Wed, 18 Mar 2009 08:38:36 -0500
Subject: [Swift-devel] How does swift know if a task is successful
In-Reply-To:
References: <20090317221440.BUF44237@m4500-02.uchicago.edu>
<49C072EB.2060109@uchicago.edu>
Message-ID:
I think that at this point we are experimenting to see what can be
done. Not to say that we shouldn't do this, but the first focus is on
seeing what might work at all.
On Mar 18, 2009, at 7:51 AM, Ben Clifford wrote:
>
> On Tue, 17 Mar 2009, Zhao Zhang wrote:
>
>> exactly, currently, we are building a collective IO system on BGP,
>> so CIO will
>> take
>> care of stage out results.
>
> Is it possible to interface between Swift and your work more cleanly?
>
> (maybe, for example, by doing something with the cog file transfer
> provider API)
>
> Hacking essential pieces of the Swift code out feels really
> unpleasant,
> and will pretty much definitely break some functionality, which will
> cause
> you trouble later on if you try to run arbitrary SwiftScript programs.
>
> When we sat down in November/December, it sounded like you wouldn't
> need
> to do anything like this to make the CIO stuff work with Swift; so
> I'd be
> interested in more explanation/discussion about what the CIO work
> looks
> like now.
>
> --
>
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
From zhaozhang at uchicago.edu Wed Mar 18 09:04:13 2009
From: zhaozhang at uchicago.edu (Zhao Zhang)
Date: Wed, 18 Mar 2009 09:04:13 -0500
Subject: [Swift-devel] How does swift know if a task is successful
In-Reply-To:
References: <20090317221440.BUF44237@m4500-02.uchicago.edu>
<49C072EB.2060109@uchicago.edu>
Message-ID: <49C0FF5D.6000407@uchicago.edu>
Hi, Ben
Ben Clifford wrote:
> On Tue, 17 Mar 2009, Zhao Zhang wrote:
>
>
>> exactly, currently, we are building a collective IO system on BGP, so CIO will
>> take
>> care of stage out results.
>>
>
> Is it possible to interface between Swift and your work more cleanly?
>
> (maybe, for example, by doing something with the cog file transfer
> provider API)
>
Yes, make a new provider for swift is another way to do this.
> Hacking essential pieces of the Swift code out feels really unpleasant,
> and will pretty much definitely break some functionality, which will cause
> you trouble later on if you try to run arbitrary SwiftScript programs.
>
Well, I agree with your point for production use. But things we are
doing now is a research for a better
architecture of swift on BGP.
> When we sat down in November/December, it sounded like you wouldn't need
> to do anything like this to make the CIO stuff work with Swift;
As I implemented what we discussed last time, new problem came up.
Considering a 2-stage computation,
the second stage would take the output from the first as an input. With
either ssh provider or gridftp provider,
this intermediate data has to be copied back to GPFS since the job that
consumes this data could be sent to
any "site (pset)".
To solve this problem, we built a P2P data network on BGP over torus
network. So the basic logic for this is
that if a wrapper.sh found a piece of intermediate data, it registered
this data with (name, rank of the CN) to a
Centralized Hash Table(CHT). Next time, when a job needs this data,
first it looks this data up in CHT, gets
the rank of the remote node, convert the RANK to IP, fetch the data
directly.
With the above solution, all intermediate data has not to be copied back
to GPFS, but swift are waiting for those
intermediate data to determine if the jobs of first stage are successful
or not. In this case, swift won't send out the
jobs of 2nd stage, that's why we need to disable swift's staging out,
and let swift determine a job status only by
provider status notification.
zhao
> so I'd be
> interested in more explanation/discussion about what the CIO work looks
> like now.
>
>
From wilde at mcs.anl.gov Wed Mar 18 09:04:46 2009
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Wed, 18 Mar 2009 09:04:46 -0500
Subject: [Swift-devel] How does swift know if a task is successful
In-Reply-To:
References: <20090317221440.BUF44237@m4500-02.uchicago.edu> <49C072EB.2060109@uchicago.edu>
Message-ID: <49C0FF7E.6040405@mcs.anl.gov>
I'll talk to ZHao and Allan about this; I havent been following this
thread because I thought it was a mechanical detail, and in fact thought
Swift already did whats needed. But I'll read up and work with Zhao
and Allan to see if we can avoid unnecessary changes. Not yet sure,
we'll see.
On 3/18/09 8:38 AM, Ian Foster wrote:
> I think that at this point we are experimenting to see what can be done.
> Not to say that we shouldn't do this, but the first focus is on seeing
> what might work at all.
>
>
> On Mar 18, 2009, at 7:51 AM, Ben Clifford wrote:
>
>>
>> On Tue, 17 Mar 2009, Zhao Zhang wrote:
>>
>>> exactly, currently, we are building a collective IO system on BGP, so
>>> CIO will
>>> take
>>> care of stage out results.
>>
>> Is it possible to interface between Swift and your work more cleanly?
>>
>> (maybe, for example, by doing something with the cog file transfer
>> provider API)
>>
>> Hacking essential pieces of the Swift code out feels really unpleasant,
>> and will pretty much definitely break some functionality, which will
>> cause
>> you trouble later on if you try to run arbitrary SwiftScript programs.
>>
>> When we sat down in November/December, it sounded like you wouldn't need
>> to do anything like this to make the CIO stuff work with Swift; so I'd be
>> interested in more explanation/discussion about what the CIO work looks
>> like now.
>>
>> --
>>
>> _______________________________________________
>> Swift-devel mailing list
>> Swift-devel at ci.uchicago.edu
>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
From benc at hawaga.org.uk Wed Mar 18 09:13:56 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Wed, 18 Mar 2009 14:13:56 +0000 (GMT)
Subject: [Swift-devel] How does swift know if a task is successful
In-Reply-To: <49C0FF5D.6000407@uchicago.edu>
References: <20090317221440.BUF44237@m4500-02.uchicago.edu>
<49C072EB.2060109@uchicago.edu>
<49C0FF5D.6000407@uchicago.edu>
Message-ID:
So if Swift could remove the dependency between staging out and starting
subsequent jobs (a subset of what has been talked about before), would you
still need to hack out the stageout code?
> To solve this problem, we built a P2P data network on BGP over torus
> network. So the basic logic for this is that if a wrapper.sh found a
> piece of intermediate data, it registered this data with (name, rank of
> the CN) to a Centralized Hash Table(CHT). Next time, when a job needs
> this data, first it looks this data up in CHT, gets the rank of the
> remote node, convert the RANK to IP, fetch the data directly.
When we talked in December, I think this bit was done with posix
filesystem access. But it sounds like you are doing something different
now.
I've looked at abstracting that worker<->site shared filesystem code in
the past (and have some patches floating round in half-written state) -
can you send me your modified wrapper.sh so I can see how you do things?
--
From zhaozhang at uchicago.edu Wed Mar 18 09:21:15 2009
From: zhaozhang at uchicago.edu (Zhao Zhang)
Date: Wed, 18 Mar 2009 09:21:15 -0500
Subject: [Swift-devel] How does swift know if a task is successful
In-Reply-To:
References: <20090317221440.BUF44237@m4500-02.uchicago.edu>
<49C072EB.2060109@uchicago.edu>
<49C0FF5D.6000407@uchicago.edu>
Message-ID: <49C1035B.3070606@uchicago.edu>
Hi, Ben
Ben Clifford wrote:
> So if Swift could remove the dependency between staging out and starting
> subsequent jobs (a subset of what has been talked about before), would you
> still need to hack out the stageout code?
>
I think swift still needs to hold the 2nd stage computation until the
1st completes. If we simply remove
the dependency, swift would send all jobs (both 1st and 2nd) out, right?
>
>> To solve this problem, we built a P2P data network on BGP over torus
>> network. So the basic logic for this is that if a wrapper.sh found a
>> piece of intermediate data, it registered this data with (name, rank of
>> the CN) to a Centralized Hash Table(CHT). Next time, when a job needs
>> this data, first it looks this data up in CHT, gets the rank of the
>> remote node, convert the RANK to IP, fetch the data directly.
>>
>
> When we talked in December, I think this bit was done with posix
> filesystem access.
We missed this point in last talk.
> But it sounds like you are doing something different
> now.
>
> I've looked at abstracting that worker<->site shared filesystem code in
> the past (and have some patches floating round in half-written state) -
> can you send me your modified wrapper.sh so I can see how you do things?
>
Here it is: http://www.ci.uchicago.edu/~zzhang/wrapper.sh
zhao
From wilde at mcs.anl.gov Wed Mar 18 09:29:51 2009
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Wed, 18 Mar 2009 09:29:51 -0500
Subject: [Swift-devel] How does swift know if a task is successful
In-Reply-To:
References: <20090317221440.BUF44237@m4500-02.uchicago.edu> <49C072EB.2060109@uchicago.edu> <49C0FF5D.6000407@uchicago.edu>
Message-ID: <49C1055F.8040508@mcs.anl.gov>
I just reviewed the thread and think I understand the issue now.
To re-iterate (for my benefit):
The status.mode=provider property works but does not do all that Zhao
needs here: Swift still insists that all the expected app output files
were placed in the workdirectory (and then copies them back to their
mapped destination directory).
Zhao is experimenting with a "pull model" where data transfer can be
done by the compute nodes pulling their input files from where those
files were left by the previous job, rather than the swift engine
pushing their input data to the shared work directory.
So, Ben, I think your solution below *might* work.
Zhao, Allan, and I should document the data flow changes that we're
testing, to help us all discuss this.
On 3/18/09 9:13 AM, Ben Clifford wrote:
> So if Swift could remove the dependency between staging out and starting
> subsequent jobs (a subset of what has been talked about before), would you
> still need to hack out the stageout code?
>
>> To solve this problem, we built a P2P data network on BGP over torus
>> network. So the basic logic for this is that if a wrapper.sh found a
>> piece of intermediate data, it registered this data with (name, rank of
>> the CN) to a Centralized Hash Table(CHT). Next time, when a job needs
>> this data, first it looks this data up in CHT, gets the rank of the
>> remote node, convert the RANK to IP, fetch the data directly.
>
> When we talked in December, I think this bit was done with posix
> filesystem access. But it sounds like you are doing something different
> now.
>
> I've looked at abstracting that worker<->site shared filesystem code in
> the past (and have some patches floating round in half-written state) -
> can you send me your modified wrapper.sh so I can see how you do things?
>
From benc at hawaga.org.uk Wed Mar 18 09:30:51 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Wed, 18 Mar 2009 14:30:51 +0000 (GMT)
Subject: [Swift-devel] How does swift know if a task is successful
In-Reply-To: <49C1035B.3070606@uchicago.edu>
References: <20090317221440.BUF44237@m4500-02.uchicago.edu>
<49C072EB.2060109@uchicago.edu>
<49C0FF5D.6000407@uchicago.edu>
<49C1035B.3070606@uchicago.edu>
Message-ID:
On Wed, 18 Mar 2009, Zhao Zhang wrote:
> I think swift still needs to hold the 2nd stage computation until the 1st
> completes. If we simply remove
> the dependency, swift would send all jobs (both 1st and 2nd) out, right?
I don't meant the dependency between Swift jobs. That would still exist.
I mean make Swift so that it can start the next job when it has determined
that the first job has completed successfully, with stageout happening
separately.
At the moment, the dependencies are:
stagein(job A) < run(job A) < stageout(job A) < stagein(job B) < run(job b)
But they could become more like these two chains:
stagein(job A) < run(job A) < stagein(job B) < run(job b)
run(job A) < stageout(job A)
--
From benc at hawaga.org.uk Wed Mar 18 09:37:01 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Wed, 18 Mar 2009 14:37:01 +0000 (GMT)
Subject: [Swift-devel] How does swift know if a task is successful
In-Reply-To: <49C0FF5D.6000407@uchicago.edu>
References: <20090317221440.BUF44237@m4500-02.uchicago.edu>
<49C072EB.2060109@uchicago.edu>
<49C0FF5D.6000407@uchicago.edu>
Message-ID:
On Wed, 18 Mar 2009, Zhao Zhang wrote:
> To solve this problem, we built a P2P data network on BGP over torus network.
> So the basic logic for this is
> that if a wrapper.sh found a piece of intermediate data, it registered this
> data with (name, rank of the CN) to a
> Centralized Hash Table(CHT). Next time, when a job needs this data, first it
> looks this data up in CHT, gets
> the rank of the remote node, convert the RANK to IP, fetch the data directly.
I don't see any of that in the wrapper.sh that you just sent me. I see
input and output files moved around with cp and dd using posix fs access.
--
From zhaozhang at uchicago.edu Wed Mar 18 09:59:53 2009
From: zhaozhang at uchicago.edu (Zhao Zhang)
Date: Wed, 18 Mar 2009 09:59:53 -0500
Subject: [Swift-devel] How does swift know if a task is successful
In-Reply-To:
References: <20090317221440.BUF44237@m4500-02.uchicago.edu>
<49C072EB.2060109@uchicago.edu>
<49C0FF5D.6000407@uchicago.edu>
Message-ID: <49C10C69.7070801@uchicago.edu>
Oh, you want the wrapper.sh working with P2P data transfer? It is not
implemented yet. I am still working on
basic APT for this wrapper.sh.
zhao
Ben Clifford wrote:
> On Wed, 18 Mar 2009, Zhao Zhang wrote:
>
>
>> To solve this problem, we built a P2P data network on BGP over torus network.
>> So the basic logic for this is
>> that if a wrapper.sh found a piece of intermediate data, it registered this
>> data with (name, rank of the CN) to a
>> Centralized Hash Table(CHT). Next time, when a job needs this data, first it
>> looks this data up in CHT, gets
>> the rank of the remote node, convert the RANK to IP, fetch the data directly.
>>
>
> I don't see any of that in the wrapper.sh that you just sent me. I see
> input and output files moved around with cp and dd using posix fs access.
>
>
From hategan at mcs.anl.gov Wed Mar 18 10:11:11 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Wed, 18 Mar 2009 10:11:11 -0500
Subject: [Swift-devel] testing
In-Reply-To: <49C0E8CB.3080600@mcs.anl.gov>
References: <1237307410.26969.8.camel@localhost>
<49C0E8CB.3080600@mcs.anl.gov>
Message-ID: <1237389071.5032.1.camel@localhost>
On Wed, 2009-03-18 at 07:27 -0500, Michael Wilde wrote:
> > iii) running anything through gram2 is bad - any base job submissions
> > need to be through condor-g using its hybrid gram2+gridmanager system.
>
> I agree, and was assuming that on OSG we would only use the new Condor
> provider, and run jobs in this manner.
There seems to be some confusion here.
Ben, the point is to run with one of the scheduler providers, not gram2.
Mike, the condor provider is not a condor-through-gram provider. It only
submits to the local condor queue.
From hategan at mcs.anl.gov Wed Mar 18 10:16:00 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Wed, 18 Mar 2009 10:16:00 -0500
Subject: [Swift-devel] How does swift know if a task is successful
In-Reply-To:
References: <20090317221440.BUF44237@m4500-02.uchicago.edu>
<49C072EB.2060109@uchicago.edu>
<49C0FF5D.6000407@uchicago.edu>
Message-ID: <1237389360.5032.4.camel@localhost>
On Wed, 2009-03-18 at 14:13 +0000, Ben Clifford wrote:
> So if Swift could remove the dependency between staging out and starting
> subsequent jobs (a subset of what has been talked about before), would you
> still need to hack out the stageout code?
Or the yet to be CIO provider could, without doing much, say that the
files were staged out.
From zhaozhang at uchicago.edu Wed Mar 18 10:16:50 2009
From: zhaozhang at uchicago.edu (Zhao Zhang)
Date: Wed, 18 Mar 2009 10:16:50 -0500
Subject: [Swift-devel] How does swift know if a task is successful
In-Reply-To: <1237389360.5032.4.camel@localhost>
References: <20090317221440.BUF44237@m4500-02.uchicago.edu>
<49C072EB.2060109@uchicago.edu>
<49C0FF5D.6000407@uchicago.edu>
<1237389360.5032.4.camel@localhost>
Message-ID: <49C11062.9010304@uchicago.edu>
Yes, that is another plan. I think Allan is hacking this.
zhao
Mihael Hategan wrote:
> On Wed, 2009-03-18 at 14:13 +0000, Ben Clifford wrote:
>
>> So if Swift could remove the dependency between staging out and starting
>> subsequent jobs (a subset of what has been talked about before), would you
>> still need to hack out the stageout code?
>>
>
> Or the yet to be CIO provider could, without doing much, say that the
> files were staged out.
>
>
>
>
From benc at hawaga.org.uk Wed Mar 18 10:20:07 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Wed, 18 Mar 2009 15:20:07 +0000 (GMT)
Subject: [Swift-devel] testing
In-Reply-To: <1237389071.5032.1.camel@localhost>
References: <1237307410.26969.8.camel@localhost>
<49C0E8CB.3080600@mcs.anl.gov> <1237389071.5032.1.camel@localhost>
Message-ID:
On Wed, 18 Mar 2009, Mihael Hategan wrote:
> Ben, the point is to run with one of the scheduler providers, not gram2.
That is also bad for OSG, as far as I can tell because it screws up
accounting which is done in GRAM, not in the LRM.
Pretty much what I'm asserting is that both the head job and the worker
jobs need to be run through Conodr-G/gridmanager/gram2 (in my
understanding of what OSG requires).
--
From benc at hawaga.org.uk Wed Mar 18 10:24:42 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Wed, 18 Mar 2009 15:24:42 +0000 (GMT)
Subject: [Swift-devel] How does swift know if a task is successful
In-Reply-To: <1237389360.5032.4.camel@localhost>
References: <20090317221440.BUF44237@m4500-02.uchicago.edu>
<49C072EB.2060109@uchicago.edu>
<49C0FF5D.6000407@uchicago.edu>
<1237389360.5032.4.camel@localhost>
Message-ID:
On Wed, 18 Mar 2009, Mihael Hategan wrote:
> Or the yet to be CIO provider could, without doing much, say that the
> files were staged out.
That would still be giving mistruths to Swift about what files had been
copied where, and so would break when Swift relied on those mistruths
(such as trying to run a job on a different site).
What I was talking about is the ever-in-the-future in-swift manager which
files are where.
--
From hategan at mcs.anl.gov Wed Mar 18 10:40:07 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Wed, 18 Mar 2009 10:40:07 -0500
Subject: [Swift-devel] testing
In-Reply-To:
References: <1237307410.26969.8.camel@localhost>
<49C0E8CB.3080600@mcs.anl.gov> <1237389071.5032.1.camel@localhost>
Message-ID: <1237390807.5032.19.camel@localhost>
On Wed, 2009-03-18 at 15:20 +0000, Ben Clifford wrote:
> On Wed, 18 Mar 2009, Mihael Hategan wrote:
>
> > Ben, the point is to run with one of the scheduler providers, not gram2.
>
> That is also bad for OSG, as far as I can tell because it screws up
> accounting which is done in GRAM, not in the LRM.
>
> Pretty much what I'm asserting is that both the head job and the worker
> jobs need to be run through Conodr-G/gridmanager/gram2 (in my
> understanding of what OSG requires).
>
Ok. Back to the drawing board.
From zhaozhang at uchicago.edu Wed Mar 18 11:00:27 2009
From: zhaozhang at uchicago.edu (Zhao Zhang)
Date: Wed, 18 Mar 2009 11:00:27 -0500
Subject: [Swift-devel] How does swift know if a task is successful
In-Reply-To:
References: <20090317221440.BUF44237@m4500-02.uchicago.edu>
<49C072EB.2060109@uchicago.edu>
<49C0FF5D.6000407@uchicago.edu>
<49C1035B.3070606@uchicago.edu>
Message-ID: <49C11A9B.8040007@uchicago.edu>
Yes, I think that is we need. Thanks.
zhao
Ben Clifford wrote:
> On Wed, 18 Mar 2009, Zhao Zhang wrote:
>
>
>> I think swift still needs to hold the 2nd stage computation until the 1st
>> completes. If we simply remove
>> the dependency, swift would send all jobs (both 1st and 2nd) out, right?
>>
>
> I don't meant the dependency between Swift jobs. That would still exist.
>
> I mean make Swift so that it can start the next job when it has determined
> that the first job has completed successfully, with stageout happening
> separately.
>
> At the moment, the dependencies are:
>
> stagein(job A) < run(job A) < stageout(job A) < stagein(job B) < run(job b)
>
> But they could become more like these two chains:
>
> stagein(job A) < run(job A) < stagein(job B) < run(job b)
> run(job A) < stageout(job A)
>
>
From wilde at mcs.anl.gov Wed Mar 18 12:05:40 2009
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Wed, 18 Mar 2009 12:05:40 -0500
Subject: [Swift-devel] testing
In-Reply-To: <1237389071.5032.1.camel@localhost>
References: <1237307410.26969.8.camel@localhost>
<49C0E8CB.3080600@mcs.anl.gov> <1237389071.5032.1.camel@localhost>
Message-ID: <49C129E4.9020003@mcs.anl.gov>
On 3/18/09 10:11 AM, Mihael Hategan wrote:
> On Wed, 2009-03-18 at 07:27 -0500, Michael Wilde wrote:
>
>>> iii) running anything through gram2 is bad - any base job submissions
>>> need to be through condor-g using its hybrid gram2+gridmanager system.
>> I agree, and was assuming that on OSG we would only use the new Condor
>> provider, and run jobs in this manner.
>
> There seems to be some confusion here.
>
> Ben, the point is to run with one of the scheduler providers, not gram2.
>
> Mike, the condor provider is not a condor-through-gram provider. It only
> submits to the local condor queue.
I was thinking/hoping that the condor provider would have a setting that
submitted swift apps as condor-g jobs to N grid sites, *via* the local
condor queue.
Isn't that how condor-g works? I send my local condor (via condor_sumit)
a .sub file that says e.g.:
universe=grid
grid_resource=gt2 osg-edu.cs.wisc.edu/jobmanager-condor
If I cant do that yet through the condor provider, was it your intent
that users eventually be able to do that, or was that not what you were
implementing?
From hategan at mcs.anl.gov Wed Mar 18 12:11:53 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Wed, 18 Mar 2009 12:11:53 -0500
Subject: [Swift-devel] testing
In-Reply-To: <49C129E4.9020003@mcs.anl.gov>
References: <1237307410.26969.8.camel@localhost>
<49C0E8CB.3080600@mcs.anl.gov> <1237389071.5032.1.camel@localhost>
<49C129E4.9020003@mcs.anl.gov>
Message-ID: <1237396313.9356.1.camel@localhost>
On Wed, 2009-03-18 at 12:05 -0500, Michael Wilde wrote:
> On 3/18/09 10:11 AM, Mihael Hategan wrote:
> > On Wed, 2009-03-18 at 07:27 -0500, Michael Wilde wrote:
> >
> >>> iii) running anything through gram2 is bad - any base job submissions
> >>> need to be through condor-g using its hybrid gram2+gridmanager system.
> >> I agree, and was assuming that on OSG we would only use the new Condor
> >> provider, and run jobs in this manner.
> >
> > There seems to be some confusion here.
> >
> > Ben, the point is to run with one of the scheduler providers, not gram2.
> >
> > Mike, the condor provider is not a condor-through-gram provider. It only
> > submits to the local condor queue.
>
> I was thinking/hoping that the condor provider would have a setting that
> submitted swift apps as condor-g jobs to N grid sites, *via* the local
> condor queue.
>
> Isn't that how condor-g works? I send my local condor (via condor_sumit)
> a .sub file that says e.g.:
>
> universe=grid
> grid_resource=gt2 osg-edu.cs.wisc.edu/jobmanager-condor
>
> If I cant do that yet through the condor provider, was it your intent
> that users eventually be able to do that, or was that not what you were
> implementing?
That was not what I was implementing.
What I was aiming for was a local condor provider, similar to the PBS
provider, that would address the scalability issues with gram2 for sites
using condor as a queuing system.
From wilde at mcs.anl.gov Wed Mar 18 12:18:45 2009
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Wed, 18 Mar 2009 12:18:45 -0500
Subject: [Swift-devel] testing
In-Reply-To: <1237390807.5032.19.camel@localhost>
References: <1237307410.26969.8.camel@localhost>
<49C0E8CB.3080600@mcs.anl.gov> <1237389071.5032.1.camel@localhost>
<1237390807.5032.19.camel@localhost>
Message-ID: <49C12CF5.8090708@mcs.anl.gov>
Mihael, please create a design note on how coaster bootstrap and
communication works, and use that as the basis for getting agreement on
the approach and the range of options needed, and for getting input.
That design description probably exists in various material you have,
and can be brief.
On 3/18/09 10:40 AM, Mihael Hategan wrote:
> On Wed, 2009-03-18 at 15:20 +0000, Ben Clifford wrote:
>> On Wed, 18 Mar 2009, Mihael Hategan wrote:
>>
>>> Ben, the point is to run with one of the scheduler providers, not gram2.
>> That is also bad for OSG, as far as I can tell because it screws up
>> accounting which is done in GRAM, not in the LRM.
>>
>> Pretty much what I'm asserting is that both the head job and the worker
>> jobs need to be run through Conodr-G/gridmanager/gram2 (in my
>> understanding of what OSG requires).
>>
>
> Ok. Back to the drawing board.
>
>
From skenny at uchicago.edu Wed Mar 18 12:28:15 2009
From: skenny at uchicago.edu (skenny at uchicago.edu)
Date: Wed, 18 Mar 2009 12:28:15 -0500 (CDT)
Subject: [Swift-devel] log-processing tools
Message-ID: <20090318122815.BUG11518@m4500-02.uchicago.edu>
hi, i was trying to grab the log processing tools, but getting
this:
login3% svn co https://svn.ci.uchicago.edu/svn/vdl2/log-processing
svn: URL 'https://svn.ci.uchicago.edu/svn/vdl2/log-processing'
doesn't exist
should i look elsewhere?
thanks
~skenny
From hategan at mcs.anl.gov Wed Mar 18 12:29:47 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Wed, 18 Mar 2009 12:29:47 -0500
Subject: [Swift-devel] log-processing tools
In-Reply-To: <20090318122815.BUG11518@m4500-02.uchicago.edu>
References: <20090318122815.BUG11518@m4500-02.uchicago.edu>
Message-ID: <1237397387.9622.0.camel@localhost>
On Wed, 2009-03-18 at 12:28 -0500, skenny at uchicago.edu wrote:
> hi, i was trying to grab the log processing tools, but getting
> this:
>
> login3% svn co https://svn.ci.uchicago.edu/svn/vdl2/log-processing
> svn: URL 'https://svn.ci.uchicago.edu/svn/vdl2/log-processing'
> doesn't exist
>
> should i look elsewhere?
Yes, change "vdl2" to "swift" in the above url.
>
> thanks
> ~skenny
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
From skenny at uchicago.edu Wed Mar 18 12:35:51 2009
From: skenny at uchicago.edu (skenny at uchicago.edu)
Date: Wed, 18 Mar 2009 12:35:51 -0500 (CDT)
Subject: [Swift-devel] log-processing tools
Message-ID: <20090318123551.BUG12524@m4500-02.uchicago.edu>
[skenny at login swift]$ svn co
https://svn.ci.uchicago.edu/svn/swift/log-processing
Authentication realm: SVN Login
Password for 'skenny':
svn: PROPFIND request failed on '/svn/swift/log-processing'
svn: PROPFIND of '/svn/swift/log-processing': 403 Forbidden
(https://svn.ci.uchicago.edu)
[skenny at login swift]$
do i (or you) need to request access from support?
---- Original message ----
>Date: Wed, 18 Mar 2009 12:29:47 -0500
>From: Mihael Hategan
>Subject: Re: [Swift-devel] log-processing tools
>To: skenny at uchicago.edu
>Cc: Ben Clifford ,
swift-devel at ci.uchicago.edu
>
>On Wed, 2009-03-18 at 12:28 -0500, skenny at uchicago.edu wrote:
>> hi, i was trying to grab the log processing tools, but getting
>> this:
>>
>> login3% svn co
https://svn.ci.uchicago.edu/svn/vdl2/log-processing
>> svn: URL 'https://svn.ci.uchicago.edu/svn/vdl2/log-processing'
>> doesn't exist
>>
>> should i look elsewhere?
>
>Yes, change "vdl2" to "swift" in the above url.
>
>>
>> thanks
>> ~skenny
>> _______________________________________________
>> Swift-devel mailing list
>> Swift-devel at ci.uchicago.edu
>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>
From wilde at mcs.anl.gov Wed Mar 18 12:38:17 2009
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Wed, 18 Mar 2009 12:38:17 -0500
Subject: [Swift-devel] testing
In-Reply-To: <49C12CF5.8090708@mcs.anl.gov>
References: <1237307410.26969.8.camel@localhost> <49C0E8CB.3080600@mcs.anl.gov>
<1237389071.5032.1.camel@localhost> <1237390807.5032.19.camel@localhost>
<49C12CF5.8090708@mcs.anl.gov>
Message-ID: <49C13189.4090001@mcs.anl.gov>
Also, please put in this description the requirements that we're working
under:
- node IP connectivity
- security
- central (head) node load limits
- central (head) node job time limits
(eg managed headnodes)
- accounting info for all resources used
(ie generates OSG accounting records)
- etc
That will help the design converge.
On 3/18/09 12:18 PM, Michael Wilde wrote:
> Mihael, please create a design note on how coaster bootstrap and
> communication works, and use that as the basis for getting agreement on
> the approach and the range of options needed, and for getting input.
>
> That design description probably exists in various material you have,
> and can be brief.
>
> On 3/18/09 10:40 AM, Mihael Hategan wrote:
>> On Wed, 2009-03-18 at 15:20 +0000, Ben Clifford wrote:
>>> On Wed, 18 Mar 2009, Mihael Hategan wrote:
>>>
>>>> Ben, the point is to run with one of the scheduler providers, not
>>>> gram2.
>>> That is also bad for OSG, as far as I can tell because it screws up
>>> accounting which is done in GRAM, not in the LRM.
>>>
>>> Pretty much what I'm asserting is that both the head job and the
>>> worker jobs need to be run through Conodr-G/gridmanager/gram2 (in my
>>> understanding of what OSG requires).
>>>
>>
>> Ok. Back to the drawing board.
>>
>>
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
From wilde at mcs.anl.gov Wed Mar 18 12:51:13 2009
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Wed, 18 Mar 2009 12:51:13 -0500
Subject: [Swift-devel] log-processing tools
In-Reply-To: <20090318123551.BUG12524@m4500-02.uchicago.edu>
References: <20090318123551.BUG12524@m4500-02.uchicago.edu>
Message-ID: <49C13491.6060107@mcs.anl.gov>
There is no svn/swift as far as I can tell (although I'd be happy to see
svn/vld2 renamed to svn/swift)
I think log-processing is under trunk as per:
http://www.ci.uchicago.edu/trac/swift/changeset/2683
I'm about to look for it.
On 3/18/09 12:35 PM, skenny at uchicago.edu wrote:
> [skenny at login swift]$ svn co
> https://svn.ci.uchicago.edu/svn/swift/log-processing
> Authentication realm: SVN Login
> Password for 'skenny':
> svn: PROPFIND request failed on '/svn/swift/log-processing'
> svn: PROPFIND of '/svn/swift/log-processing': 403 Forbidden
> (https://svn.ci.uchicago.edu)
> [skenny at login swift]$
>
> do i (or you) need to request access from support?
>
>
> ---- Original message ----
>> Date: Wed, 18 Mar 2009 12:29:47 -0500
>> From: Mihael Hategan
>> Subject: Re: [Swift-devel] log-processing tools
>> To: skenny at uchicago.edu
>> Cc: Ben Clifford ,
> swift-devel at ci.uchicago.edu
>> On Wed, 2009-03-18 at 12:28 -0500, skenny at uchicago.edu wrote:
>>> hi, i was trying to grab the log processing tools, but getting
>>> this:
>>>
>>> login3% svn co
> https://svn.ci.uchicago.edu/svn/vdl2/log-processing
>>> svn: URL 'https://svn.ci.uchicago.edu/svn/vdl2/log-processing'
>>> doesn't exist
>>>
>>> should i look elsewhere?
>> Yes, change "vdl2" to "swift" in the above url.
>>
>>> thanks
>>> ~skenny
>>> _______________________________________________
>>> Swift-devel mailing list
>>> Swift-devel at ci.uchicago.edu
>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
From wilde at mcs.anl.gov Wed Mar 18 12:53:03 2009
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Wed, 18 Mar 2009 12:53:03 -0500
Subject: [Swift-devel] log-processing tools
In-Reply-To: <49C13491.6060107@mcs.anl.gov>
References: <20090318123551.BUG12524@m4500-02.uchicago.edu>
<49C13491.6060107@mcs.anl.gov>
Message-ID: <49C134FF.208@mcs.anl.gov>
indeed, its under libexec/log-processing as that change node describes.
On 3/18/09 12:51 PM, Michael Wilde wrote:
> There is no svn/swift as far as I can tell (although I'd be happy to see
> svn/vld2 renamed to svn/swift)
>
> I think log-processing is under trunk as per:
> http://www.ci.uchicago.edu/trac/swift/changeset/2683
>
> I'm about to look for it.
>
> On 3/18/09 12:35 PM, skenny at uchicago.edu wrote:
>> [skenny at login swift]$ svn co
>> https://svn.ci.uchicago.edu/svn/swift/log-processing
>> Authentication realm: SVN Login
>> Password for 'skenny':
>> svn: PROPFIND request failed on '/svn/swift/log-processing'
>> svn: PROPFIND of '/svn/swift/log-processing': 403 Forbidden
>> (https://svn.ci.uchicago.edu)
>> [skenny at login swift]$
>>
>> do i (or you) need to request access from support?
>>
>>
>> ---- Original message ----
>>> Date: Wed, 18 Mar 2009 12:29:47 -0500
>>> From: Mihael Hategan Subject: Re:
>>> [Swift-devel] log-processing tools To: skenny at uchicago.edu
>>> Cc: Ben Clifford ,
>> swift-devel at ci.uchicago.edu
>>> On Wed, 2009-03-18 at 12:28 -0500, skenny at uchicago.edu wrote:
>>>> hi, i was trying to grab the log processing tools, but getting
>>>> this:
>>>>
>>>> login3% svn co
>> https://svn.ci.uchicago.edu/svn/vdl2/log-processing
>>>> svn: URL 'https://svn.ci.uchicago.edu/svn/vdl2/log-processing'
>>>> doesn't exist
>>>>
>>>> should i look elsewhere?
>>> Yes, change "vdl2" to "swift" in the above url.
>>>
>>>> thanks
>>>> ~skenny
>>>> _______________________________________________
>>>> Swift-devel mailing list
>>>> Swift-devel at ci.uchicago.edu
>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>> _______________________________________________
>> Swift-devel mailing list
>> Swift-devel at ci.uchicago.edu
>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
From hategan at mcs.anl.gov Wed Mar 18 12:54:42 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Wed, 18 Mar 2009 12:54:42 -0500
Subject: [Swift-devel] testing
In-Reply-To: <49C12CF5.8090708@mcs.anl.gov>
References: <1237307410.26969.8.camel@localhost>
<49C0E8CB.3080600@mcs.anl.gov> <1237389071.5032.1.camel@localhost>
<1237390807.5032.19.camel@localhost> <49C12CF5.8090708@mcs.anl.gov>
Message-ID: <1237398882.9622.2.camel@localhost>
On Wed, 2009-03-18 at 12:18 -0500, Michael Wilde wrote:
> Mihael, please create a design note on how coaster bootstrap and
> communication works, and use that as the basis for getting agreement on
> the approach and the range of options needed, and for getting input.
http://wiki.cogkit.org/wiki/Coasters
>
> That design description probably exists in various material you have,
> and can be brief.
>
> On 3/18/09 10:40 AM, Mihael Hategan wrote:
> > On Wed, 2009-03-18 at 15:20 +0000, Ben Clifford wrote:
> >> On Wed, 18 Mar 2009, Mihael Hategan wrote:
> >>
> >>> Ben, the point is to run with one of the scheduler providers, not gram2.
> >> That is also bad for OSG, as far as I can tell because it screws up
> >> accounting which is done in GRAM, not in the LRM.
> >>
> >> Pretty much what I'm asserting is that both the head job and the worker
> >> jobs need to be run through Conodr-G/gridmanager/gram2 (in my
> >> understanding of what OSG requires).
> >>
> >
> > Ok. Back to the drawing board.
> >
> >
From hategan at mcs.anl.gov Wed Mar 18 12:56:49 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Wed, 18 Mar 2009 12:56:49 -0500
Subject: [Swift-devel] log-processing tools
In-Reply-To: <49C13491.6060107@mcs.anl.gov>
References: <20090318123551.BUG12524@m4500-02.uchicago.edu>
<49C13491.6060107@mcs.anl.gov>
Message-ID: <1237399009.9622.5.camel@localhost>
On Wed, 2009-03-18 at 12:51 -0500, Michael Wilde wrote:
> There is no svn/swift as far as I can tell (although I'd be happy to see
> svn/vld2 renamed to svn/swift)
Oh, sorry. I was confusing the swift directory with the SVN module.
>
> I think log-processing is under trunk as per:
> http://www.ci.uchicago.edu/trac/swift/changeset/2683
>
> I'm about to look for it.
>
> On 3/18/09 12:35 PM, skenny at uchicago.edu wrote:
> > [skenny at login swift]$ svn co
> > https://svn.ci.uchicago.edu/svn/swift/log-processing
> > Authentication realm: SVN Login
> > Password for 'skenny':
> > svn: PROPFIND request failed on '/svn/swift/log-processing'
> > svn: PROPFIND of '/svn/swift/log-processing': 403 Forbidden
> > (https://svn.ci.uchicago.edu)
> > [skenny at login swift]$
> >
> > do i (or you) need to request access from support?
> >
> >
> > ---- Original message ----
> >> Date: Wed, 18 Mar 2009 12:29:47 -0500
> >> From: Mihael Hategan
> >> Subject: Re: [Swift-devel] log-processing tools
> >> To: skenny at uchicago.edu
> >> Cc: Ben Clifford ,
> > swift-devel at ci.uchicago.edu
> >> On Wed, 2009-03-18 at 12:28 -0500, skenny at uchicago.edu wrote:
> >>> hi, i was trying to grab the log processing tools, but getting
> >>> this:
> >>>
> >>> login3% svn co
> > https://svn.ci.uchicago.edu/svn/vdl2/log-processing
> >>> svn: URL 'https://svn.ci.uchicago.edu/svn/vdl2/log-processing'
> >>> doesn't exist
> >>>
> >>> should i look elsewhere?
> >> Yes, change "vdl2" to "swift" in the above url.
> >>
> >>> thanks
> >>> ~skenny
> >>> _______________________________________________
> >>> Swift-devel mailing list
> >>> Swift-devel at ci.uchicago.edu
> >>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> > _______________________________________________
> > Swift-devel mailing list
> > Swift-devel at ci.uchicago.edu
> > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
From hategan at mcs.anl.gov Wed Mar 18 17:14:28 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Wed, 18 Mar 2009 17:14:28 -0500
Subject: [Swift-devel] testing
In-Reply-To: <1237398882.9622.2.camel@localhost>
References: <1237307410.26969.8.camel@localhost>
<49C0E8CB.3080600@mcs.anl.gov> <1237389071.5032.1.camel@localhost>
<1237390807.5032.19.camel@localhost> <49C12CF5.8090708@mcs.anl.gov>
<1237398882.9622.2.camel@localhost>
Message-ID: <1237414468.15019.0.camel@localhost>
On Wed, 2009-03-18 at 12:54 -0500, Mihael Hategan wrote:
> On Wed, 2009-03-18 at 12:18 -0500, Michael Wilde wrote:
> > Mihael, please create a design note on how coaster bootstrap and
> > communication works, and use that as the basis for getting agreement on
> > the approach and the range of options needed, and for getting input.
>
> http://wiki.cogkit.org/wiki/Coasters
And now, with pretty pictures!
From aespinosa at cs.uchicago.edu Wed Mar 18 22:19:27 2009
From: aespinosa at cs.uchicago.edu (Allan Espinosa)
Date: Wed, 18 Mar 2009 22:19:27 -0500
Subject: [Swift-devel] "any valid host for task" in Swift + deef provider
In-Reply-To: <1237231475.8617.12.camel@localhost>
References: <50b07b4b0903161213r292fb025o1bc52a34608192d@mail.gmail.com>
<1237231475.8617.12.camel@localhost>
Message-ID: <50b07b4b0903182019x695fb817ua9a05cc45b7fd8c8@mail.gmail.com>
Oops i made the mistake of not making the corresponding entries in
tc.data for each site description.
sorry
On Mon, Mar 16, 2009 at 2:24 PM, Mihael Hategan wrote:
> Can you post your tc.data?
>
> On Mon, 2009-03-16 at 14:13 -0500, Allan Espinosa wrote:
>> Hi,
>>
>> I'm using swift r2682, cogkit 2326 and provider-deef 2507
>>
>> RunID: 20090316-1354-ocn573c3
>> Progress:
>> Execution failed:
>> ? ? ? ? Could not find any valid host for task "Task(type=UNKNOWN,
>> identity=urn:cog-1237229648327)" with constraints {tr=hostname,
>> filenames=[Ljava.lang.String;@14aa453, trfqn=hostname,
>> filecache=org.griphyn.vdl.karajan.lib.cache.CacheMapAdapter at 16a4aef}
>>
>>
>> Sites.xml:
>>
>>
>> ?
>> ? > url="http://129.114.102.179:50001/wsrf/services/GenericPortal/core/WS/GPFactoryService"/>
>> ? /work/01035/tg802895/swift-runs
>>
>>
>>
>> The run did not initialize the work directory.
>>
a>
From wilde at mcs.anl.gov Thu Mar 19 07:13:53 2009
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Thu, 19 Mar 2009 07:13:53 -0500
Subject: [Swift-devel] Small issues in coasters on local:pbs
Message-ID: <49C23701.9060802@mcs.anl.gov>
Regarding: [Swift-user] Swift/PBS Scheduler Slow to Report "Finished"?]
I'm retesting coasters on local:pbs (on teraport), as I think this may
partially alleviate Andrew's problem.
A simple foreach() works nice and fast, but I see two things:
1) I first tested without a valid proxy. I forgot that coasters requires
a proxy (presumably for its secure channels) even when its not using
GRAM to reach its "RRM". The error returned if you dont have a proxy is
cryptic and buried in the coaster boostrap log. So 3 things: (a) do a
check for proxy early on and print a nice message if theres not a valid
proxy; (b) bring the errors from the bootstrap log back to the user
(unless thats not possible) in which case point the user to look for
that. (c) document that you need a proxy.
2) When the script finishes you get this message on stdout/err which
looks like a leftover debugging message:
--
Swift svn swift-r2701 cog-r2332
RunID: 20090319-0658-3ejpl9xc
Progress:
Progress: Submitting:9 Submitted:1
Progress: Submitted:9 Active:1
Progress: Submitted:4 Active:3 Stage out:1 Finished successfully:2
Final status: Finished successfully:10
Cleaning up...
Shutting down service at https://128.135.125.117:50002
Got channel MetaChannel: 101224864 -> GSSSChannel-null(1)
- Done
--
- Mike
The errors you get when you dont have a proxy are:
tp$ swift hellos.swift -sites.file sites.xml -tc.file tc.data
Swift svn swift-r2701 cog-r2332
RunID: 20090319-0655-9ufl1r2g
Progress:
Progress: Submitting:9 Submitted:1
Failed to transfer wrapper log from hellos-20090319-0655-9ufl1r2g/info/l
on teraport
Execution failed:
Exception in echo:
Arguments: [Output of run, 6]
Host: teraport
Directory: hellos-20090319-0655-9ufl1r2g/jobs/l/echo-lde8x58j
stderr.txt:
stdout.txt:
----
Caused by:
Could not submit job
Caused by:
Could not start coaster service
Caused by:
Task ended before registration was received.
STDOUT:
STDERR:
Caused by:
Job failed with an exit code of 1
Cleaning up...
Done
tp$
tp$ cat /home/wilde/coaster-bootstrap-01709350024.log
BS: http://tp-login2.ci.uchicago.edu:50001
find wget = /usr/bin/wget
-->/usr/bin/wget -c -q
http://tp-login2.ci.uchicago.edu:50001/coaster-bootstrap.jar -O
/tmp/bootstrap.YJ4129 >>/home/wilde/coaster-bootstrap-01709350024.log
2>&1<--
which: no gmd5sum in
(/autonfs/home/wilde/tutorials/osgedu/build/docbook-xsl/tools/bin:/home/wilde/bin:/soft/java-1.6.0_11-sun-r1/bin:/soft/java-1.6.0_11-sun-r1/jre/bin:/soft/apache-ant-1.7.1-r1/bin:/software/common/gx-map-0.5.3.3-r1/bin:/soft/condor-7.0.5-r1/bin:/soft/globus-4.2.1-r1/bin:/soft/globus-4.2.1-r1/sbin:/usr/kerberos/bin:/bin:/usr/bin:/usr/X11R6/bin:/usr/local/bin:/software/common/softenv-1.6.0-r1/bin:/home/wilde/bin/linux-rhel5-x86_64:/home/wilde/bin:/soft/torque-2.3.6-r1/bin:/soft/maui-3.2.6p21-r1/bin:/soft/maui-3.2.6p21-r1/sbin:/soft/matlab-7.7-r1/bin:/soft/osg-client-1.0.0-r1/lcg/bin:/soft/osg-client-1.0.0-r1/srm-client-lbnl/bin:/soft/osg-client-1.0.0-r1/srm-client-fermi/sbin:/soft/osg-client-1.0.0-r1/srm-client-fermi/bin:/soft/osg-client-1.0.0-r1/curl/bin:/soft/osg-client-1.0.0-r1/wget/bin:/soft/osg-client-1.0.0-r1/cert-scripts/bin:/soft/osg-client-1.0.0-r1/glite/sbin:/soft/osg-client-1.0.0-r1/glite/bin:/soft/osg-client-1.0.0-r1/pyglobus-url-copy/bin:/soft/osg-client-1.0.0-r1/peg
asus/bin:/soft/osg-client-1.0.0-r1/ant/bin:/soft/osg-client-1.0.0-r1/gpt/sbin:/soft/osg-client-1.0.0-r1/globus/bin:/soft/osg-client-1.0.0-r1/globus/sbin:/soft/osg-client-1.0.0-r1/jdk1.5/bin:/soft/osg-client-1.0.0-r1/condor/sbin:/soft/osg-client-1.0.0-r1/condor/bin:/soft/osg-client-1.0.0-r1/logrotate/sbin:/software/common/pacman-3.26-r1/bin:/soft/osg-client-1.0.0-r1/vdt/sbin:/soft/osg-client-1.0.0-r1/vdt/bin:/home/grog/bin/linux-rhel4-ia32:/home/grog/bin:/sw/bin:/sbin:/usr/sbin:/home/wilde/swift/tools:/home/wilde/swift/rev/latest/bin:/home/wilde/blast/ncbi/bin:/home/wilde/adem/adem-osg/bin)
which: no gmd5sum in
(/autonfs/home/wilde/tutorials/osgedu/build/docbook-xsl/tools/bin:/home/wilde/bin:/soft/java-1.6.0_11-sun-r1/bin:/soft/java-1.6.0_11-sun-r1/jre/bin:/soft/apache-ant-1.7.1-r1/bin:/software/common/gx-map-0.5.3.3-r1/bin:/soft/condor-7.0.5-r1/bin:/soft/globus-4.2.1-r1/bin:/soft/globus-4.2.1-r1/sbin:/usr/kerberos/bin:/bin:/usr/bin:/usr/X11R6/bin:/usr/local/bin:/software/common/softenv-1.6.0-r1/bin:/home/wilde/bin/linux-rhel5-x86_64:/home/wilde/bin:/soft/torque-2.3.6-r1/bin:/soft/maui-3.2.6p21-r1/bin:/soft/maui-3.2.6p21-r1/sbin:/soft/matlab-7.7-r1/bin:/soft/osg-client-1.0.0-r1/lcg/bin:/soft/osg-client-1.0.0-r1/srm-client-lbnl/bin:/soft/osg-client-1.0.0-r1/srm-client-fermi/sbin:/soft/osg-client-1.0.0-r1/srm-client-fermi/bin:/soft/osg-client-1.0.0-r1/curl/bin:/soft/osg-client-1.0.0-r1/wget/bin:/soft/osg-client-1.0.0-r1/cert-scripts/bin:/soft/osg-client-1.0.0-r1/glite/sbin:/soft/osg-client-1.0.0-r1/glite/bin:/soft/osg-client-1.0.0-r1/pyglobus-url-copy/bin:/soft/osg-client-1.0.0-r1/peg
asus/bin:/soft/osg-client-1.0.0-r1/ant/bin:/soft/osg-client-1.0.0-r1/gpt/sbin:/soft/osg-client-1.0.0-r1/globus/bin:/soft/osg-client-1.0.0-r1/globus/sbin:/soft/osg-client-1.0.0-r1/jdk1.5/bin:/soft/osg-client-1.0.0-r1/condor/sbin:/soft/osg-client-1.0.0-r1/condor/bin:/soft/osg-client-1.0.0-r1/logrotate/sbin:/software/common/pacman-3.26-r1/bin:/soft/osg-client-1.0.0-r1/vdt/sbin:/soft/osg-client-1.0.0-r1/vdt/bin:/home/grog/bin/linux-rhel4-ia32:/home/grog/bin:/sw/bin:/sbin:/usr/sbin:/home/wilde/swift/tools:/home/wilde/swift/rev/latest/bin:/home/wilde/blast/ncbi/bin:/home/wilde/adem/adem-osg/bin)
find gmd5sum =
find md5sum = /usr/bin/md5sum
Expected checksum: 33170989491a2e007a1c7c68eb907832
Computed checksum: 33170989491a2e007a1c7c68eb907832
find java = /soft/java-1.6.0_11-sun-r1/bin/java
JAVA=/soft/java-1.6.0_11-sun-r1/bin/java
/soft/java-1.6.0_11-sun-r1/bin/java
-Djava=/soft/java-1.6.0_11-sun-r1/bin/java
-DGLOBUS_TCP_PORT_RANGE=50000,51000 -DX509_USER_PROXY=
-DX509_CERT_DIR=/soft/osg-client-1.0.0-r1/globus/TRUSTED_CA
-DGLOBUS_HOSTNAME=none -jar /tmp/bootstrap.YJ4129
http://tp-login2.ci.uchicago.edu:50001 https://128.135.125.117:50000
01709350024
java.lang.RuntimeException: Failed to register service
at
org.globus.cog.abstraction.coaster.service.CoasterService.start(CoasterService.java:111)
at
org.globus.cog.abstraction.coaster.service.CoasterService.main(CoasterService.java:226)
Caused by:
org.globus.cog.karajan.workflow.service.channels.ChannelException:
Failed to start channel GSSCChannel-https://128.135.125.117:50000(1)
at
org.globus.cog.karajan.workflow.service.channels.GSSChannel.reconnect(GSSChannel.java:104)
at
org.globus.cog.karajan.workflow.service.channels.GSSChannel.start(GSSChannel.java:63)
at
org.globus.cog.karajan.workflow.service.ChannelFactory.newChannel(ChannelFactory.java:43)
at org.globus.cog.karajan.workflow.service.Client.connect(Client.java:115)
at org.globus.cog.karajan.workflow.service.Client.newClient(Client.java:72)
at
org.globus.cog.karajan.workflow.service.channels.ChannelManager.connect(ChannelManager.java:211)
at
org.globus.cog.karajan.workflow.service.channels.ChannelManager.reserveChannel(ChannelManager.java:230)
at
org.globus.cog.karajan.workflow.service.channels.ChannelManager.reserveChannel(ChannelManager.java:186)
at
org.globus.cog.abstraction.coaster.service.CoasterService.start(CoasterService.java:100)
... 1 more
Caused by: org.globus.gsi.GlobusCredentialException: [JGLOBUS-5] Proxy
file (/tmp/x509up_u1031) not found.
at org.globus.gsi.GlobusCredential.(GlobusCredential.java:114)
at
org.globus.gsi.GlobusCredential.reloadDefaultCredential(GlobusCredential.java:590)
at
org.globus.gsi.GlobusCredential.getDefaultCredential(GlobusCredential.java:575)
at
org.globus.cog.karajan.workflow.service.GSSService.initializeCredentials(GSSService.java:99)
at
org.globus.cog.karajan.workflow.service.channels.GSSChannel.reconnect(GSSChannel.java:77)
... 9 more
EC: 1
BS: http://tp-login2.ci.uchicago.edu:50001
find wget = /usr/bin/wget
-->/usr/bin/wget -c -q
http://tp-login2.ci.uchicago.edu:50001/coaster-bootstrap.jar -O
/tmp/bootstrap.DS4363 >>/home/wilde/coaster-bootstrap-01709350024.log
2>&1<--
which: no gmd5sum in
(/autonfs/home/wilde/tutorials/osgedu/build/docbook-xsl/tools/bin:/home/wilde/bin:/soft/java-1.6.0_11-sun-r1/bin:/soft/java-1.6.0_11-sun-r1/jre/bin:/soft/apache-ant-1.7.1-r1/bin:/software/common/gx-map-0.5.3.3-r1/bin:/soft/condor-7.0.5-r1/bin:/soft/globus-4.2.1-r1/bin:/soft/globus-4.2.1-r1/sbin:/usr/kerberos/bin:/bin:/usr/bin:/usr/X11R6/bin:/usr/local/bin:/software/common/softenv-1.6.0-r1/bin:/home/wilde/bin/linux-rhel5-x86_64:/home/wilde/bin:/soft/torque-2.3.6-r1/bin:/soft/maui-3.2.6p21-r1/bin:/soft/maui-3.2.6p21-r1/sbin:/soft/matlab-7.7-r1/bin:/soft/osg-client-1.0.0-r1/lcg/bin:/soft/osg-client-1.0.0-r1/srm-client-lbnl/bin:/soft/osg-client-1.0.0-r1/srm-client-fermi/sbin:/soft/osg-client-1.0.0-r1/srm-client-fermi/bin:/soft/osg-client-1.0.0-r1/curl/bin:/soft/osg-client-1.0.0-r1/wget/bin:/soft/osg-client-1.0.0-r1/cert-scripts/bin:/soft/osg-client-1.0.0-r1/glite/sbin:/soft/osg-client-1.0.0-r1/glite/bin:/soft/osg-client-1.0.0-r1/pyglobus-url-copy/bin:/soft/osg-client-1.0.0-r1/peg
asus/bin:/soft/osg-client-1.0.0-r1/ant/bin:/soft/osg-client-1.0.0-r1/gpt/sbin:/soft/osg-client-1.0.0-r1/globus/bin:/soft/osg-client-1.0.0-r1/globus/sbin:/soft/osg-client-1.0.0-r1/jdk1.5/bin:/soft/osg-client-1.0.0-r1/condor/sbin:/soft/osg-client-1.0.0-r1/condor/bin:/soft/osg-client-1.0.0-r1/logrotate/sbin:/software/common/pacman-3.26-r1/bin:/soft/osg-client-1.0.0-r1/vdt/sbin:/soft/osg-client-1.0.0-r1/vdt/bin:/home/grog/bin/linux-rhel4-ia32:/home/grog/bin:/sw/bin:/sbin:/usr/sbin:/home/wilde/swift/tools:/home/wilde/swift/rev/latest/bin:/home/wilde/blast/ncbi/bin:/home/wilde/adem/adem-osg/bin)
which: no gmd5sum in
(/autonfs/home/wilde/tutorials/osgedu/build/docbook-xsl/tools/bin:/home/wilde/bin:/soft/java-1.6.0_11-sun-r1/bin:/soft/java-1.6.0_11-sun-r1/jre/bin:/soft/apache-ant-1.7.1-r1/bin:/software/common/gx-map-0.5.3.3-r1/bin:/soft/condor-7.0.5-r1/bin:/soft/globus-4.2.1-r1/bin:/soft/globus-4.2.1-r1/sbin:/usr/kerberos/bin:/bin:/usr/bin:/usr/X11R6/bin:/usr/local/bin:/software/common/softenv-1.6.0-r1/bin:/home/wilde/bin/linux-rhel5-x86_64:/home/wilde/bin:/soft/torque-2.3.6-r1/bin:/soft/maui-3.2.6p21-r1/bin:/soft/maui-3.2.6p21-r1/sbin:/soft/matlab-7.7-r1/bin:/soft/osg-client-1.0.0-r1/lcg/bin:/soft/osg-client-1.0.0-r1/srm-client-lbnl/bin:/soft/osg-client-1.0.0-r1/srm-client-fermi/sbin:/soft/osg-client-1.0.0-r1/srm-client-fermi/bin:/soft/osg-client-1.0.0-r1/curl/bin:/soft/osg-client-1.0.0-r1/wget/bin:/soft/osg-client-1.0.0-r1/cert-scripts/bin:/soft/osg-client-1.0.0-r1/glite/sbin:/soft/osg-client-1.0.0-r1/glite/bin:/soft/osg-client-1.0.0-r1/pyglobus-url-copy/bin:/soft/osg-client-1.0.0-r1/peg
asus/bin:/soft/osg-client-1.0.0-r1/ant/bin:/soft/osg-client-1.0.0-r1/gpt/sbin:/soft/osg-client-1.0.0-r1/globus/bin:/soft/osg-client-1.0.0-r1/globus/sbin:/soft/osg-client-1.0.0-r1/jdk1.5/bin:/soft/osg-client-1.0.0-r1/condor/sbin:/soft/osg-client-1.0.0-r1/condor/bin:/soft/osg-client-1.0.0-r1/logrotate/sbin:/software/common/pacman-3.26-r1/bin:/soft/osg-client-1.0.0-r1/vdt/sbin:/soft/osg-client-1.0.0-r1/vdt/bin:/home/grog/bin/linux-rhel4-ia32:/home/grog/bin:/sw/bin:/sbin:/usr/sbin:/home/wilde/swift/tools:/home/wilde/swift/rev/latest/bin:/home/wilde/blast/ncbi/bin:/home/wilde/adem/adem-osg/bin)
find gmd5sum =
find md5sum = /usr/bin/md5sum
Expected checksum: 33170989491a2e007a1c7c68eb907832
Computed checksum: 33170989491a2e007a1c7c68eb907832
find java = /soft/java-1.6.0_11-sun-r1/bin/java
JAVA=/soft/java-1.6.0_11-sun-r1/bin/java
/soft/java-1.6.0_11-sun-r1/bin/java
-Djava=/soft/java-1.6.0_11-sun-r1/bin/java
-DGLOBUS_TCP_PORT_RANGE=50000,51000 -DX509_USER_PROXY=
-DX509_CERT_DIR=/soft/osg-client-1.0.0-r1/globus/TRUSTED_CA
-DGLOBUS_HOSTNAME=none -jar /tmp/bootstrap.DS4363
http://tp-login2.ci.uchicago.edu:50001 https://128.135.125.117:50000
01709350024
java.lang.RuntimeException: Failed to register service
at
org.globus.cog.abstraction.coaster.service.CoasterService.start(CoasterService.java:111)
at
org.globus.cog.abstraction.coaster.service.CoasterService.main(CoasterService.java:226)
Caused by:
org.globus.cog.karajan.workflow.service.channels.ChannelException:
Failed to start channel GSSCChannel-https://128.135.125.117:50000(1)
at
org.globus.cog.karajan.workflow.service.channels.GSSChannel.reconnect(GSSChannel.java:104)
at
org.globus.cog.karajan.workflow.service.channels.GSSChannel.start(GSSChannel.java:63)
at
org.globus.cog.karajan.workflow.service.ChannelFactory.newChannel(ChannelFactory.java:43)
at org.globus.cog.karajan.workflow.service.Client.connect(Client.java:115)
at org.globus.cog.karajan.workflow.service.Client.newClient(Client.java:72)
at
org.globus.cog.karajan.workflow.service.channels.ChannelManager.connect(ChannelManager.java:211)
at
org.globus.cog.karajan.workflow.service.channels.ChannelManager.reserveChannel(ChannelManager.java:230)
at
org.globus.cog.karajan.workflow.service.channels.ChannelManager.reserveChannel(ChannelManager.java:186)
at
org.globus.cog.abstraction.coaster.service.CoasterService.start(CoasterService.java:100)
... 1 more
Caused by: org.globus.gsi.GlobusCredentialException: [JGLOBUS-5] Proxy
file (/tmp/x509up_u1031) not found.
at org.globus.gsi.GlobusCredential.(GlobusCredential.java:114)
at
org.globus.gsi.GlobusCredential.reloadDefaultCredential(GlobusCredential.java:590)
at
org.globus.gsi.GlobusCredential.getDefaultCredential(GlobusCredential.java:575)
at
org.globus.cog.karajan.workflow.service.GSSService.initializeCredentials(GSSService.java:99)
at
org.globus.cog.karajan.workflow.service.channels.GSSChannel.reconnect(GSSChannel.java:77)
... 9 more
EC: 1
tp$
-------- Original Message --------
Subject: [Swift-user] Swift/PBS Scheduler Slow to Report "Finished"?
Date: Thu, 19 Mar 2009 02:42:25 -0500
From: Andrew Boyce
To: swift-user at ci.uchicago.edu
Hello,
I am currently running Swift in conjunction with the PBS scheduler. My
annoyance at the moment is this:
When running any script, even a simple script such as first.swift
(which normally finishes almost instantaneously), Swift always takes
precisely five minutes to tell me that my job Finished successfully
and copy the files back to the appropriate folder. It is always almost
exactly five minutes; I've checked many logs - it polls the scheduler
for five minutes. When I run a script (like first.swift) without using
the PBS scheduler, everything happens as normal; execution and
"Finished successfully" are nearly immediate.
I think I know what the problem is: even after the scheduler says that
the job is 'completed,' (which is generally right away) the scheduler
keeps the job up on qstat and such for 5 minutes after (this setting
is a PBS server attribute known as 'keep_completed', and I have
checked that it is indeed set to 300 seconds; unfortunately I don't
have permissions to change it). So when Swift polls the scheduler, the
job is still up on qstat, and Swift must think that the task has not
yet "Finished successfully."
My question is this:
Am I indeed right that Swift does not "understand" that when the PBS
scheduler says a job is 'completed', the job really has "Finished
successfully"?
Can this be changed so that Swift does "understand" that a 'completed'
job has "Finished successfully"?
I have not included any files because I think I have narrowed the
problem down to a question that does not require those that I would
usually provide, but if I am wrong, then I can provide.
Thank you and sorry for the length.
Regards,
Andrew Boyce
_______________________________________________
Swift-user mailing list
Swift-user at ci.uchicago.edu
http://mail.ci.uchicago.edu/mailman/listinfo/swift-user
From wilde at mcs.anl.gov Thu Mar 19 09:22:35 2009
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Thu, 19 Mar 2009 09:22:35 -0500
Subject: [Swift-devel] User script gets null pointer exception
Message-ID: <49C2552B.20409@mcs.anl.gov>
Yue, I should have clarified: this is a Swift bug.
Its perhaps caused by some error in your program, but Swift should never
get a null pointer exception.
Ben, Mihael - The log is in:
/home/yuechen/PTMapZhao/PTMap/PTMap-20090310-1746-zszi94b6.log
The error seems to be related to some boundary condition on the array
"texts" which maps Yue's 375 fasta files, fast01 .. fasta375
This message is in the log:
2009-03-10 17:46:36,188-0500 INFO AbstractDataNode Found data
org.griphyn.vdl.mapping.RootDataNode identifier
tag:benc at ci.uchicago.edu,2008:swift:dataset:20090310-1746-f12dnzdd:720000000382
\
type Inputfile with no value at dataset=parameter (closed).$
2009-03-10 17:46:36,188-0500 INFO New NEW
id=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090310-1746-f12dnzdd:720000000382
2009-03-10 17:46:36,190-0500 DEBUG VDL2ExecutionContext Getting value
for array texts.$[]/375 which is not permitted.
Getting value for array texts.$[]/375 which is not permitted.
vdl:getfieldvalue @ PTMap.kml, line: 124
sys:parallelfor @ PTMap.kml, line: 124
sys:sequential @ PTMap.kml, line: 123
doall @ PTMap.kml, line: 170
sys:sequential @ PTMap.kml, line: 169
vdl:mainp @ PTMap.kml, line: 168
mainp @ vdl.k, line: 165
vdl:mains @ PTMap.kml, line: 166
vdl:mains @ PTMap.kml, line: 166
rlog:restartlog @ PTMap.kml, line: 164
kernel:project @ PTMap.kml, line: 2
PTMap-20090310-1746-zszi94b6
Caused by: java.lang.RuntimeException: Getting value for array
texts.$[]/375 which is not permitted.
at
org.griphyn.vdl.karajan.lib.GetFieldValue.function(GetFieldValue.java:53)
at
org.griphyn.vdl.karajan.lib.VDLFunction.post(VDLFunction.java:67)
at
org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.childCompleted(AbstractSequentialWithArguments.java:192)
at
org.globus.cog.karajan.workflow.nodes.Sequential.notificationEvent(Sequential.java:33)
at
org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:335)
at
org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125)
-------- Original Message --------
Subject: Re: [Swift-user] swift execution problem
Date: Thu, 19 Mar 2009 08:27:02 -0500
From: Michael Wilde
To: Yue, Chen - BMD
CC: swift-user at ci.uchicago.edu
References:
Yue, what version of Swift are you using?
Please send the first few lines of your output file, where it says
something like:
Swift svn swift-r2701 cog-r2332
RunID: 20090319-0820-19zttiq9
(in fact please send the whole output file, stdout/err)
Ive tried to run your code in a near-identical test and I cant reproduce
the failure. Ive tried with both swift0.8 and the latest svn rev, and
both seem to work.
Also please can you post the pathname of the directory in which you are
testing (I assume you are running this on a CI machine?) so that I can
look at your logfile? And make it publicly accessible?
Thanks,
- Mike
On 3/18/09 6:32 PM, Yue, Chen - BMD wrote:
> Hi,
>
> I'm new to Swift programming. I was able to run a swift script before,
> but I couldn't run it now. I'm wondering if someone can help me figure
> out why. The swift script, sites.xml, tc.data, and all the error
> messages are copied in this email. Thank you!
>
> Regards,
>
> Chen, Yue
>
> *********************
> Swift script
> *********************
> type Fasta {}
> type PTMapOut {}
> type Solution {}
> type Inputfile {}
> app (PTMapOut ofile) PTMap (Solution sfile, Fasta fastafile, Inputfile
> input, Inputfile parameter)
> {
> PTMap @filename(sfile) @filename(fastafile) @filename(input)
> @filename(parameter) stdout=@filename(ofile
> );
> }
> Fasta texts[] ;
>
> doall(Fasta texts[])
> {
> Solution sfile <"BSASolution.mzXML">;
> Inputfile input <"inputs.txt">;
> Inputfile parameter <"parameters.txt">;
> foreach p in texts {
> PTMapOut r source=@p ,
> match="fasta(.*)",
> transform="\\1.out "
> >;
> r = PTMap(sfile, p, input, parameter);
> }
> }
> // Main
> doall(texts);
> **************
> sites.xml
> **************
>
>
>
> /var/tmp
> 0
>
> **************
> tc.data
> **************
> localhost echo /bin/echo INSTALLED
> INTEL32::LINUX null
> localhost cat /bin/cat INSTALLED
> INTEL32::LINUX null
> localhost ls /bin/ls INSTALLED
> INTEL32::LINUX null
> localhost grep /bin/grep INSTALLED
> INTEL32::LINUX null
> localhost sort /bin/sort INSTALLED
> INTEL32::LINUX null
> localhost paste /bin/paste INSTALLED
> INTEL32::LINUX null
> localhost PTMap /home/yuechen/PTMap/PTMap INSTALLED
> INTEL32::LINUX null
> **************
> Error messages
> **************
> [yuechen at communicado PTMap]$ swift PTMap.swift
> Execution failed:
> java.lang.NullPointerException
> at
> org.globus.cog.abstraction.impl.common.task.ServiceImpl.toString(ServiceImpl.java:156)
> at java.lang.String.valueOf(String.java:2577)
> at java.lang.StringBuffer.append(StringBuffer.java:220)
> at
> org.globus.cog.karajan.workflow.nodes.grid.GridNode.function(GridNode.java:31)
> at
> org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:45)
> at
> org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.childCompleted(AbstractSequentialWithArguments.java:192)
> at
> org.globus.cog.karajan.workflow.nodes.Sequential.notificationEvent(Sequential.java:33)
> at
> org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:335)
> at
> org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125)
> at
> org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99)
> at
> org.globus.cog.karajan.workflow.nodes.FlowNode.fireNotificationEvent(FlowNode.java:173)
> at
> org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:299)
> at
> org.globus.cog.karajan.workflow.nodes.ExecuteFile.notificationEvent(ExecuteFile.java:163)
> at
> org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:335)
> at
> org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125)
> at
> org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99)
> at
> org.globus.cog.karajan.workflow.nodes.FlowNode.fireNotificationEvent(FlowNode.java:173)
> at
> org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:299)
> at
> org.globus.cog.karajan.workflow.nodes.FlowContainer.post(FlowContainer.java:58)
> at
> org.globus.cog.karajan.workflow.nodes.Sequential.startNext(Sequential.java:51)
> at
> org.globus.cog.karajan.workflow.nodes.Sequential.childCompleted(Sequential.java:45)
> at
> org.globus.cog.karajan.workflow.nodes.Sequential.notificationEvent(Sequential.java:33)
> at
> org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:335)
> at
> org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125)
> at
> org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99)
> at
> org.globus.cog.karajan.workflow.nodes.FlowNode.fireNotificationEvent(FlowNode.java:173)
> at
> org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:299)
> at
> org.globus.cog.karajan.workflow.nodes.FlowContainer.post(FlowContainer.java:58)
> at
> org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.childCompleted(AbstractSequentialWithArguments.java:192)
> at
> org.globus.cog.karajan.workflow.nodes.user.UserDefinedElement.childCompleted(UserDefinedElement.java:283)
> at
> org.globus.cog.karajan.workflow.nodes.user.SequentialImplicitExecutionUDE.childCompleted(SequentialImplicitExecutionUDE.java:85)
> at
> org.globus.cog.karajan.workflow.nodes.Sequential.notificationEvent(Sequential.java:33)
> at
> org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:335)
> at
> org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125)
> at
> org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99)
> at
> org.globus.cog.karajan.workflow.nodes.FlowNode.fireNotificationEvent(FlowNode.java:173)
> at
> org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:299)
> at
> org.globus.cog.karajan.workflow.nodes.If.childCompleted(If.java:30)
> at
> org.globus.cog.karajan.workflow.nodes.Sequential.notificationEvent(Sequential.java:33)
> at
> org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:335)
> at
> org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125)
> at
> org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99)
> at
> org.globus.cog.karajan.workflow.nodes.FlowNode.fireNotificationEvent(FlowNode.java:173)
> at
> org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:299)
> at
> org.globus.cog.karajan.workflow.nodes.FlowContainer.post(FlowContainer.java:58)
> at
> org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:46)
> at
> org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.childCompleted(AbstractSequentialWithArguments.java:192)
> at
> org.globus.cog.karajan.workflow.nodes.Sequential.notificationEvent(Sequential.java:33)
> at
> org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:335)
> at
> org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125)
> at
> org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99)
> at
> org.globus.cog.karajan.workflow.nodes.FlowNode.fireNotificationEvent(FlowNode.java:173)
> at
> org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:299)
> at
> org.globus.cog.karajan.workflow.nodes.FlowContainer.post(FlowContainer.java:58)
> at
> org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:46)
> at
> org.globus.cog.karajan.workflow.nodes.Sequential.startNext(Sequential.java:51)
> at
> org.globus.cog.karajan.workflow.nodes.Sequential.executeChildren(Sequential.java:27)
> at
> org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.executeChildren(AbstractFunction.java:40)
> at
> org.globus.cog.karajan.workflow.nodes.FlowContainer.execute(FlowContainer.java:63)
> at
> org.globus.cog.karajan.workflow.nodes.FlowNode.restart(FlowNode.java:240)
> at
> org.globus.cog.karajan.workflow.nodes.FlowNode.start(FlowNode.java:281)
> at
> org.globus.cog.karajan.workflow.nodes.FlowNode.controlEvent(FlowNode.java:393)
> at
> org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:332)
> at
> org.globus.cog.karajan.workflow.FlowElementWrapper.event(FlowElementWrapper.java:227)
> at
> org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125)
> at
> org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99)
> at
> org.globus.cog.karajan.workflow.events.EventWorker.run(EventWorker.java:69)
>
>
>
>
>
> This email is intended only for the use of the individual or entity to
> which it is addressed and may contain information that is privileged and
> confidential. If the reader of this email message is not the intended
> recipient, you are hereby notified that any dissemination, distribution,
> or copying of this communication is prohibited. If you have received
> this email in error, please notify the sender and destroy/delete all
> copies of the transmittal. Thank you.
>
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> Swift-user mailing list
> Swift-user at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user
_______________________________________________
Swift-user mailing list
Swift-user at ci.uchicago.edu
http://mail.ci.uchicago.edu/mailman/listinfo/swift-user
From benc at hawaga.org.uk Thu Mar 19 09:24:08 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Thu, 19 Mar 2009 14:24:08 +0000 (GMT)
Subject: [Swift-devel] Re: log-processing tools
In-Reply-To: <20090318122815.BUG11518@m4500-02.uchicago.edu>
References: <20090318122815.BUG11518@m4500-02.uchicago.edu>
Message-ID:
On Wed, 18 Mar 2009, skenny at uchicago.edu wrote:
> hi, i was trying to grab the log processing tools, but getting
> this:
>
> login3% svn co https://svn.ci.uchicago.edu/svn/vdl2/log-processing
> svn: URL 'https://svn.ci.uchicago.edu/svn/vdl2/log-processing'
> doesn't exist
>
> should i look elsewhere?
Hopefully you figured this out based on what others said in this thread.
If not - the log-processing stuff is now part of the main swift build.
You should get a swift-plot-log command in the same place as the base
swift command (i.e. if you can run swift, you should also be able to run
swift-plot-log)
--
From yuechen at bsd.uchicago.edu Thu Mar 19 10:05:30 2009
From: yuechen at bsd.uchicago.edu (Yue, Chen - BMD)
Date: Thu, 19 Mar 2009 10:05:30 -0500
Subject: [Swift-devel] RE: User script gets null pointer exception
References: <49C2552B.20409@mcs.anl.gov>
Message-ID:
Hi Michael,
I used swift version 0.8. I was able to run the same swift script last week, but I don't know why I cannot run it now. Yesterday, I was trying to set up sites.xml and tc.data for NCSA and SDSC clusters, but it didn't run. So I removed those configurations and just used all the original files to see if I can still run swift. That's when I ran into errors. The sites.xml and tc.data were in the /home/yuechen/swift-0.8/etc/. PTMap run by itself seems normal. To test PTMap, I use the following command in PTMap directory at /home/yuechen/PTMap/:
$ ./PTMap BSASolution.mzXML fasta172 inputs.txt parameters.txt
Thank you very much for help.
Regards,
Chen, Yue
________________________________
From: Michael Wilde [mailto:wilde at mcs.anl.gov]
Sent: Thu 3/19/2009 9:22 AM
To: swift-devel
Cc: Yue, Chen - BMD; Zhao Zhang
Subject: User script gets null pointer exception
Yue, I should have clarified: this is a Swift bug.
Its perhaps caused by some error in your program, but Swift should never
get a null pointer exception.
Ben, Mihael - The log is in:
/home/yuechen/PTMapZhao/PTMap/PTMap-20090310-1746-zszi94b6.log
The error seems to be related to some boundary condition on the array
"texts" which maps Yue's 375 fasta files, fast01 .. fasta375
This message is in the log:
2009-03-10 17:46:36,188-0500 INFO AbstractDataNode Found data
org.griphyn.vdl.mapping.RootDataNode identifier
tag:benc at ci.uchicago.edu,2008:swift:dataset:20090310-1746-f12dnzdd:720000000382
\
type Inputfile with no value at dataset=parameter (closed).$
2009-03-10 17:46:36,188-0500 INFO New NEW
id=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090310-1746-f12dnzdd:720000000382
2009-03-10 17:46:36,190-0500 DEBUG VDL2ExecutionContext Getting value
for array texts.$[]/375 which is not permitted.
Getting value for array texts.$[]/375 which is not permitted.
vdl:getfieldvalue @ PTMap.kml, line: 124
sys:parallelfor @ PTMap.kml, line: 124
sys:sequential @ PTMap.kml, line: 123
doall @ PTMap.kml, line: 170
sys:sequential @ PTMap.kml, line: 169
vdl:mainp @ PTMap.kml, line: 168
mainp @ vdl.k, line: 165
vdl:mains @ PTMap.kml, line: 166
vdl:mains @ PTMap.kml, line: 166
rlog:restartlog @ PTMap.kml, line: 164
kernel:project @ PTMap.kml, line: 2
PTMap-20090310-1746-zszi94b6
Caused by: java.lang.RuntimeException: Getting value for array
texts.$[]/375 which is not permitted.
at
org.griphyn.vdl.karajan.lib.GetFieldValue.function(GetFieldValue.java:53)
at
org.griphyn.vdl.karajan.lib.VDLFunction.post(VDLFunction.java:67)
at
org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.childCompleted(AbstractSequentialWithArguments.java:192)
at
org.globus.cog.karajan.workflow.nodes.Sequential.notificationEvent(Sequential.java:33)
at
org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:335)
at
org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125)
-------- Original Message --------
Subject: Re: [Swift-user] swift execution problem
Date: Thu, 19 Mar 2009 08:27:02 -0500
From: Michael Wilde
To: Yue, Chen - BMD
CC: swift-user at ci.uchicago.edu
References:
Yue, what version of Swift are you using?
Please send the first few lines of your output file, where it says
something like:
Swift svn swift-r2701 cog-r2332
RunID: 20090319-0820-19zttiq9
(in fact please send the whole output file, stdout/err)
Ive tried to run your code in a near-identical test and I cant reproduce
the failure. Ive tried with both swift0.8 and the latest svn rev, and
both seem to work.
Also please can you post the pathname of the directory in which you are
testing (I assume you are running this on a CI machine?) so that I can
look at your logfile? And make it publicly accessible?
Thanks,
- Mike
On 3/18/09 6:32 PM, Yue, Chen - BMD wrote:
> Hi,
>
> I'm new to Swift programming. I was able to run a swift script before,
> but I couldn't run it now. I'm wondering if someone can help me figure
> out why. The swift script, sites.xml, tc.data, and all the error
> messages are copied in this email. Thank you!
>
> Regards,
>
> Chen, Yue
>
> *********************
> Swift script
> *********************
> type Fasta {}
> type PTMapOut {}
> type Solution {}
> type Inputfile {}
> app (PTMapOut ofile) PTMap (Solution sfile, Fasta fastafile, Inputfile
> input, Inputfile parameter)
> {
> PTMap @filename(sfile) @filename(fastafile) @filename(input)
> @filename(parameter) stdout=@filename(ofile
> );
> }
> Fasta texts[] ;
>
> doall(Fasta texts[])
> {
> Solution sfile <"BSASolution.mzXML">;
> Inputfile input <"inputs.txt">;
> Inputfile parameter <"parameters.txt">;
> foreach p in texts {
> PTMapOut r source=@p ,
> match="fasta(.*)",
> transform="\\1.out >"
> >;
> r = PTMap(sfile, p, input, parameter);
> }
> }
> // Main
> doall(texts);
> **************
> sites.xml
> **************
>
>
>
> /var/tmp
> 0
>
> **************
> tc.data
> **************
> localhost echo /bin/echo INSTALLED
> INTEL32::LINUX null
> localhost cat /bin/cat INSTALLED
> INTEL32::LINUX null
> localhost ls /bin/ls INSTALLED
> INTEL32::LINUX null
> localhost grep /bin/grep INSTALLED
> INTEL32::LINUX null
> localhost sort /bin/sort INSTALLED
> INTEL32::LINUX null
> localhost paste /bin/paste INSTALLED
> INTEL32::LINUX null
> localhost PTMap /home/yuechen/PTMap/PTMap INSTALLED
> INTEL32::LINUX null
> **************
> Error messages
> **************
> [yuechen at communicado PTMap]$ swift PTMap.swift
> Execution failed:
> java.lang.NullPointerException
> at
> org.globus.cog.abstraction.impl.common.task.ServiceImpl.toString(ServiceImpl.java:156)
> at java.lang.String.valueOf(String.java:2577)
> at java.lang.StringBuffer.append(StringBuffer.java:220)
> at
> org.globus.cog.karajan.workflow.nodes.grid.GridNode.function(GridNode.java:31)
> at
> org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:45)
> at
> org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.childCompleted(AbstractSequentialWithArguments.java:192)
> at
> org.globus.cog.karajan.workflow.nodes.Sequential.notificationEvent(Sequential.java:33)
> at
> org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:335)
> at
> org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125)
> at
> org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99)
> at
> org.globus.cog.karajan.workflow.nodes.FlowNode.fireNotificationEvent(FlowNode.java:173)
> at
> org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:299)
> at
> org.globus.cog.karajan.workflow.nodes.ExecuteFile.notificationEvent(ExecuteFile.java:163)
> at
> org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:335)
> at
> org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125)
> at
> org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99)
> at
> org.globus.cog.karajan.workflow.nodes.FlowNode.fireNotificationEvent(FlowNode.java:173)
> at
> org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:299)
> at
> org.globus.cog.karajan.workflow.nodes.FlowContainer.post(FlowContainer.java:58)
> at
> org.globus.cog.karajan.workflow.nodes.Sequential.startNext(Sequential.java:51)
> at
> org.globus.cog.karajan.workflow.nodes.Sequential.childCompleted(Sequential.java:45)
> at
> org.globus.cog.karajan.workflow.nodes.Sequential.notificationEvent(Sequential.java:33)
> at
> org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:335)
> at
> org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125)
> at
> org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99)
> at
> org.globus.cog.karajan.workflow.nodes.FlowNode.fireNotificationEvent(FlowNode.java:173)
> at
> org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:299)
> at
> org.globus.cog.karajan.workflow.nodes.FlowContainer.post(FlowContainer.java:58)
> at
> org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.childCompleted(AbstractSequentialWithArguments.java:192)
> at
> org.globus.cog.karajan.workflow.nodes.user.UserDefinedElement.childCompleted(UserDefinedElement.java:283)
> at
> org.globus.cog.karajan.workflow.nodes.user.SequentialImplicitExecutionUDE.childCompleted(SequentialImplicitExecutionUDE.java:85)
> at
> org.globus.cog.karajan.workflow.nodes.Sequential.notificationEvent(Sequential.java:33)
> at
> org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:335)
> at
> org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125)
> at
> org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99)
> at
> org.globus.cog.karajan.workflow.nodes.FlowNode.fireNotificationEvent(FlowNode.java:173)
> at
> org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:299)
> at
> org.globus.cog.karajan.workflow.nodes.If.childCompleted(If.java:30)
> at
> org.globus.cog.karajan.workflow.nodes.Sequential.notificationEvent(Sequential.java:33)
> at
> org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:335)
> at
> org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125)
> at
> org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99)
> at
> org.globus.cog.karajan.workflow.nodes.FlowNode.fireNotificationEvent(FlowNode.java:173)
> at
> org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:299)
> at
> org.globus.cog.karajan.workflow.nodes.FlowContainer.post(FlowContainer.java:58)
> at
> org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:46)
> at
> org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.childCompleted(AbstractSequentialWithArguments.java:192)
> at
> org.globus.cog.karajan.workflow.nodes.Sequential.notificationEvent(Sequential.java:33)
> at
> org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:335)
> at
> org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125)
> at
> org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99)
> at
> org.globus.cog.karajan.workflow.nodes.FlowNode.fireNotificationEvent(FlowNode.java:173)
> at
> org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:299)
> at
> org.globus.cog.karajan.workflow.nodes.FlowContainer.post(FlowContainer.java:58)
> at
> org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:46)
> at
> org.globus.cog.karajan.workflow.nodes.Sequential.startNext(Sequential.java:51)
> at
> org.globus.cog.karajan.workflow.nodes.Sequential.executeChildren(Sequential.java:27)
> at
> org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.executeChildren(AbstractFunction.java:40)
> at
> org.globus.cog.karajan.workflow.nodes.FlowContainer.execute(FlowContainer.java:63)
> at
> org.globus.cog.karajan.workflow.nodes.FlowNode.restart(FlowNode.java:240)
> at
> org.globus.cog.karajan.workflow.nodes.FlowNode.start(FlowNode.java:281)
> at
> org.globus.cog.karajan.workflow.nodes.FlowNode.controlEvent(FlowNode.java:393)
> at
> org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:332)
> at
> org.globus.cog.karajan.workflow.FlowElementWrapper.event(FlowElementWrapper.java:227)
> at
> org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125)
> at
> org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99)
> at
> org.globus.cog.karajan.workflow.events.EventWorker.run(EventWorker.java:69)
>
>
>
>
>
> This email is intended only for the use of the individual or entity to
> which it is addressed and may contain information that is privileged and
> confidential. If the reader of this email message is not the intended
> recipient, you are hereby notified that any dissemination, distribution,
> or copying of this communication is prohibited. If you have received
> this email in error, please notify the sender and destroy/delete all
> copies of the transmittal. Thank you.
>
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> Swift-user mailing list
> Swift-user at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user
_______________________________________________
Swift-user mailing list
Swift-user at ci.uchicago.edu
http://mail.ci.uchicago.edu/mailman/listinfo/swift-user
This email is intended only for the use of the individual or entity to which it is addressed and may contain information that is privileged and confidential. If the reader of this email message is not the intended recipient, you are hereby notified that any dissemination, distribution, or copying of this communication is prohibited. If you have received this email in error, please notify the sender and destroy/delete all copies of the transmittal. Thank you.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From benc at hawaga.org.uk Thu Mar 19 10:12:51 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Thu, 19 Mar 2009 15:12:51 +0000 (GMT)
Subject: [Swift-devel] User script gets null pointer exception
In-Reply-To: <49C2552B.20409@mcs.anl.gov>
References: <49C2552B.20409@mcs.anl.gov>
Message-ID:
Note that this is a different error to the original message which
Yue posted (and which I just answered) on swift-user.
The log that you show below is 9 days old and is not a log for the error
that Yue reported.
On Thu, 19 Mar 2009, Michael Wilde wrote:
> Yue, I should have clarified: this is a Swift bug.
>
> Its perhaps caused by some error in your program, but Swift should never get a
> null pointer exception.
>
> Ben, Mihael - The log is in:
>
> /home/yuechen/PTMapZhao/PTMap/PTMap-20090310-1746-zszi94b6.log
>
> The error seems to be related to some boundary condition on the array "texts"
> which maps Yue's 375 fasta files, fast01 .. fasta375
>
> This message is in the log:
>
> 2009-03-10 17:46:36,188-0500 INFO AbstractDataNode Found data
> org.griphyn.vdl.mapping.RootDataNode identifier
> tag:benc at ci.uchicago.edu,2008:swift:dataset:20090310-1746-f12dnzdd:720000000382
> \
> type Inputfile with no value at dataset=parameter (closed).$
> 2009-03-10 17:46:36,188-0500 INFO New NEW
> id=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090310-1746-f12dnzdd:720000000382
> 2009-03-10 17:46:36,190-0500 DEBUG VDL2ExecutionContext Getting value for
> array texts.$[]/375 which is not permitted.
> Getting value for array texts.$[]/375 which is not permitted.
> vdl:getfieldvalue @ PTMap.kml, line: 124
> sys:parallelfor @ PTMap.kml, line: 124
> sys:sequential @ PTMap.kml, line: 123
> doall @ PTMap.kml, line: 170
> sys:sequential @ PTMap.kml, line: 169
> vdl:mainp @ PTMap.kml, line: 168
> mainp @ vdl.k, line: 165
> vdl:mains @ PTMap.kml, line: 166
> vdl:mains @ PTMap.kml, line: 166
> rlog:restartlog @ PTMap.kml, line: 164
> kernel:project @ PTMap.kml, line: 2
> PTMap-20090310-1746-zszi94b6
> Caused by: java.lang.RuntimeException: Getting value for array texts.$[]/375
> which is not permitted.
> at
> org.griphyn.vdl.karajan.lib.GetFieldValue.function(GetFieldValue.java:53)
> at org.griphyn.vdl.karajan.lib.VDLFunction.post(VDLFunction.java:67)
> at
> org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.childCompleted(AbstractSequentialWithArguments.java:192)
> at
> org.globus.cog.karajan.workflow.nodes.Sequential.notificationEvent(Sequential.java:33)
> at
> org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:335)
> at
> org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125)
>
>
>
> -------- Original Message --------
> Subject: Re: [Swift-user] swift execution problem
> Date: Thu, 19 Mar 2009 08:27:02 -0500
> From: Michael Wilde