From iraicu at cs.uchicago.edu  Tue Nov  2 14:47:36 2010
From: iraicu at cs.uchicago.edu (Ioan Raicu)
Date: Tue, 02 Nov 2010 14:47:36 -0500
Subject: [Swift-devel] Call for Participation: 3rd IEEE Workshop on
 Many-Task Computing on Grids and Supercomputers (MTAGS10),
 co-located with Supercomputing 2010 -- win an Apple iPad!!!
Message-ID: <4CD06AD8.1050201@cs.uchicago.edu>

Dear all,
We invite you to participate in the 3rd workshop on Many-Task Computing 
on Grids and Supercomputers (MTAGS10) on Monday, November 15th, 2010, 
co-located with IEEE/ACM Supercomputing 2010 in New Orleans LA. MTAGS 
will provide the scientific community a dedicated forum for presenting 
new research, development, and deployment efforts of large-scale 
many-task computing (MTC) applications on large scale clusters, Grids, 
Supercomputers, and Cloud Computing infrastructure.

A few highlights of the workshop:

    * *Workshop Program: *The program can be found at
      http://www.cs.iit.edu/~iraicu/MTAGS10/program.htm; papers and
      slides will be posted by November 15th, 2010
    * *Keynote speaker: *Roger Barga, PhD, Architect, Extreme Computing
      Group, Microsoft Research
    * *Best Paper Nominees: *
          o Timothy Armstrong, Mike Wilde, Daniel Katz, Zhao Zhang, Ian
            Foster. "/Scheduling Many-Task Workloads on Supercomputers:
            Dealing with Trailing Tasks/", 3rd IEEE Workshop on
            Many-Task Computing on Grids and Supercomputers (MTAGS10) 2010
          o Thomas Budnik, Brant Knudson, Mark Megerian, Sam Miller,
            Mike Mundy, Will Stockdell. "/Blue Gene/Q Resource
            Management Architecture/", 3rd IEEE Workshop on Many-Task
            Computing on Grids and Supercomputers (MTAGS10) 2010
    * *Attendance Prize: *There will be a /*free Apple iPad */giveaway
      at the end of the workshop; must attend at least 1 talk throughout
      the day at the workshop, and must be present to win at the end of
      the workshop at 6:15PM

The workshop program is:

    * 9:00AM     Opening Remarks
    * 9:10AM *Keynote: Data Laden Clouds, Roger Barga, PhD, Architect,
      Extreme Computing Group, Microsoft Research *
    * Session 1: Applications
          o 10:30AM     Many Task Computing for Modeling the Fate of Oil
            Discharged from the Deep Water Horizon Well Blowout
          o 11:00AM     Many-Task Applications in the Integrated Plasma
            Simulator
          o 11:30AM     Compute and data management strategies for grid
            deployment of high throughput protein structure studies
    * Session 2: Storage
          o 1:30PM     Processing Massive Sized Graphs Using Sector/Sphere
          o 2:00PM     Easy and Instantaneous Processing for
            Data-Intensive Workflows
          o 2:30PM     Detecting Bottlenecks in Parallel DAG-based Data
            Flow Programs
    * Session 3: Resource Management
          o 3:30PM     Improving Many-Task Computing in Scientific
            Workflows Using P2P Techniques
          o 4:00PM     Dynamic Task Scheduling for the Uintah Framework
          o 4:30PM     Automatic and Coordinated Job Recovery for High
            Performance Computing
    * Session 4: Best Papers Nominees
          o 5:15PM     Scheduling Many-Task Workloads on Supercomputers:
            Dealing with Trailing Tasks
          o 5:45PM     Blue Gene/Q Resource Management Architecture
    * 6:15PM     Best Paper Award, Attendees Prizes, & Closing Remarks

We look forward to seeing you at the workshop in less than 2 weeks!

Regards,
Ioan Raicu, Yong Zhao, and Ian Foster
MTAGS10 Chairs
http://www.cs.iit.edu/~iraicu/MTAGS10/

-- 
=================================================================
Ioan Raicu, Ph.D.
Assistant Professor
=================================================================
Computer Science Department
Illinois Institute of Technology
10 W. 31st Street
Stuart Building, Room 237D
Chicago, IL 60616
=================================================================
Cel:    1-847-722-0876
Office: 1-312-567-5704
Email:  iraicu at cs.iit.edu
Web:    http://www.cs.iit.edu/~iraicu/
=================================================================
=================================================================


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20101102/bb345a75/attachment.html>

From wilde at mcs.anl.gov  Wed Nov  3 10:52:03 2010
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Wed, 3 Nov 2010 10:52:03 -0500 (CDT)
Subject: [Swift-devel] Swift parser fails under Java 1.6.0_07
In-Reply-To: <1214002773.7635.1288798811790.JavaMail.root@zimbra.anl.gov>
Message-ID: <287862175.7750.1288799523344.JavaMail.root@zimbra.anl.gov>

I (with the help of a new user) just painfully re-discoverd that Swift's parser fails under the (somewhat old) Java JRE release 1.6.0_07 which happens to be the default under on the UChicago IBI cluster.

[wilde at ibicluster t2]$ java -version
java version "1.6.0_07"
Java(TM) SE Runtime Environment (build 1.6.0_07-b06)
Java HotSpot(TM) 64-Bit Server VM (build 10.0-b23, mixed mode)

When I run under 1.6.0_20 the problem does not occur.

Under _07, Swift fails compiling even the most trivial script, as in the example below, in which I include the complete log file.

Has anyone else seen this, and/or know the cause?

Im puzzled that the swift .log file doesn't start with the typical version and environment info. Its almost like swift is taking some strangely different execution path under 1.6.0_07.

This is not yet a major issue, but its unsettling that a specific java version could trigger this behavior.

- Mike

[wilde at ibicluster t4]$ ls -l
total 4
-rw-r--r-- 1 wilde brdfuser 14 Nov  3 10:42 hw.swift
[wilde at ibicluster t4]$ cat hw.swift
trace ("hi");
[wilde at ibicluster t4]$ swift hw.swift >stdout 2>stderr
[wilde at ibicluster t4]$ ls -l
total 272
-rw-r--r-- 1 wilde brdfuser   1559 Nov  3 10:44 hw-20101103-1045-3ljqjlz3.log
-rw-r--r-- 1 wilde brdfuser     14 Nov  3 10:42 hw.swift
-rw-r--r-- 1 wilde brdfuser      1 Nov  3 10:44 hw.xml
-rw-r--r-- 1 wilde brdfuser 257393 Nov  3 10:44 stderr
-rw-r--r-- 1 wilde brdfuser      0 Nov  3 10:44 stdout
-rw-r--r-- 1 wilde brdfuser     57 Nov  3 10:44 swift.log
[wilde at ibicluster t4]$ cat hw-20101103-1045-3ljqjlz3.log 
2010-11-03 10:45:38,587-0500 DEBUG Loader Max heap: 238616576
2010-11-03 10:45:38,588-0500 INFO  Loader hw.swift: source file is new. Recompiling.
2010-11-03 10:45:39,066-0500 DEBUG Loader Detailed exception:
org.griphyn.vdl.karajan.CompilationException: Unable to parse intermediate XML
	at org.griphyn.vdl.engine.Karajan.compile(Karajan.java:98)
	at org.griphyn.vdl.karajan.Loader.compile(Loader.java:295)
	at org.griphyn.vdl.karajan.Loader.main(Loader.java:140)
Caused by: org.apache.xmlbeans.XmlException: /userhom2/2/wilde/swift/lab/t4/hw.xml:2:1: error: Unexpected end of file after null
	at org.apache.xmlbeans.impl.store.Locale$SaxLoader.load(Locale.java:3467)
	at org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:1270)
	at org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:1257)
	at org.apache.xmlbeans.impl.schema.SchemaTypeLoaderBase.parse(SchemaTypeLoaderBase.java:345)
	at org.apache.xmlbeans.impl.schema.SchemaTypeLoaderBase.parse(SchemaTypeLoaderBase.java:252)
	at org.globus.swift.language.ProgramDocument$Factory.parse(ProgramDocument.java:499)
	at org.griphyn.vdl.engine.Karajan.parseProgramXML(Karajan.java:122)
	at org.griphyn.vdl.engine.Karajan.compile(Karajan.java:96)
	... 2 more
Caused by: org.xml.sax.SAXParseException: Unexpected end of file after null
	at org.apache.xmlbeans.impl.piccolo.xml.Piccolo.reportFatalError(Piccolo.java:1038)
	at org.apache.xmlbeans.impl.piccolo.xml.Piccolo.parse(Piccolo.java:723)
	at org.apache.xmlbeans.impl.store.Locale$SaxLoader.load(Locale.java:3435)
	... 9 more
[wilde at ibicluster t4]$ wc -l stderr
3529 stderr
[wilde at ibicluster t4]$ head -30 stderr
action parse error in group XDTM line 3; template context is [program]
line 1:1: unexpected token: program
	at org.antlr.stringtemplate.language.ActionParser.primaryExpr(ActionParser.java:722)
	at org.antlr.stringtemplate.language.ActionParser.expr(ActionParser.java:430)
	at org.antlr.stringtemplate.language.ActionParser.templatesExpr(ActionParser.java:212)
	at org.antlr.stringtemplate.language.ActionParser.action(ActionParser.java:126)
	at org.antlr.stringtemplate.StringTemplate.parseAction(StringTemplate.java:957)
	at org.antlr.stringtemplate.language.TemplateParser.action(TemplateParser.java:161)
	at org.antlr.stringtemplate.language.TemplateParser.template(TemplateParser.java:127)
	at org.antlr.stringtemplate.StringTemplate.breakTemplateIntoChunks(StringTemplate.java:931)
	at org.antlr.stringtemplate.StringTemplate.setTemplate(StringTemplate.java:532)
	at org.antlr.stringtemplate.language.GroupParser.template(GroupParser.java:327)
	at org.antlr.stringtemplate.language.GroupParser.group(GroupParser.java:186)
	at org.antlr.stringtemplate.StringTemplateGroup.parseGroup(StringTemplateGroup.java:769)
	at org.antlr.stringtemplate.StringTemplateGroup.<init>(StringTemplateGroup.java:271)
	at org.antlr.stringtemplate.StringTemplateGroup.<init>(StringTemplateGroup.java:241)
	at org.griphyn.vdl.toolkit.VDLt2VDLx.compile(VDLt2VDLx.java:57)
	at org.griphyn.vdl.toolkit.VDLt2VDLx.compile(VDLt2VDLx.java:45)
	at org.griphyn.vdl.karajan.Loader.compile(Loader.java:290)
	at org.griphyn.vdl.karajan.Loader.main(Loader.java:140)
Can't parse chunk: program xmlns=$defaultNS()$
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xmlns:xs="http://www.w3.org/2001/XMLSchema"$if(!namespaces)$
line 1:15: unexpected char: '$'
	at org.antlr.stringtemplate.language.ActionLexer.nextToken(ActionLexer.java:219)
	at antlr.TokenBuffer.fill(TokenBuffer.java:69)
	at antlr.TokenBuffer.LA(TokenBuffer.java:80)
	at antlr.LLkParser.LA(LLkParser.java:52)
	at antlr.Parser.consumeUntil(Parser.java:149)
	at antlr.Parser.recover(Parser.java:312)
[wilde at ibicluster t4]$ which java
/usr/java/latest/bin/java
[wilde at ibicluster t4]$ 


-- 
Michael Wilde
Computation Institute, University of Chicago
Mathematics and Computer Science Division
Argonne National Laboratory


From hategan at mcs.anl.gov  Wed Nov  3 15:19:55 2010
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Wed, 03 Nov 2010 13:19:55 -0700
Subject: [Swift-devel] Swift parser fails under Java 1.6.0_07
In-Reply-To: <287862175.7750.1288799523344.JavaMail.root@zimbra.anl.gov>
References: <287862175.7750.1288799523344.JavaMail.root@zimbra.anl.gov>
Message-ID: <1288815595.32143.4.camel@blabla2.none>

On Wed, 2010-11-03 at 10:52 -0500, Michael Wilde wrote:
> I (with the help of a new user) just painfully re-discoverd that Swift's parser fails under the (somewhat old) Java JRE release 1.6.0_07 which happens to be the default under on the UChicago IBI cluster.
> 
> [wilde at ibicluster t2]$ java -version
> java version "1.6.0_07"
> Java(TM) SE Runtime Environment (build 1.6.0_07-b06)
> Java HotSpot(TM) 64-Bit Server VM (build 10.0-b23, mixed mode)
> 
> When I run under 1.6.0_20 the problem does not occur.
> 
> Under _07, Swift fails compiling even the most trivial script, as in the example below, in which I include the complete log file.
> 
> Has anyone else seen this, and/or know the cause?
> 
> Im puzzled that the swift .log file doesn't start with the typical
> version and environment info. Its almost like swift is taking some
> strangely different execution path under 1.6.0_07.

I think the version is printed after compilation.

> 
> This is not yet a major issue, but its unsettling that a specific java version could trigger this behavior.

I suspect it's the version of SAX (xml parser) included with the JVM.

Which may also suggest a solution: use a good version of the SAX jar and
override the JVM provided one.

Mihael


From bugzilla-daemon at mcs.anl.gov  Sun Nov  7 13:23:19 2010
From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov)
Date: Sun,  7 Nov 2010 13:23:19 -0600 (CST)
Subject: [Swift-devel] [Bug 3] VDL2 quickstart guide issues
In-Reply-To: <bug-3-21@http.bugzilla.mcs.anl.gov/swift/>
References: <bug-3-21@http.bugzilla.mcs.anl.gov/swift/>
Message-ID: <20101107192319.D52502C9EC@wind.mcs.anl.gov>

https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=3


skenny <skenny at uchicago.edu> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
                 CC|                            |skenny at uchicago.edu
         Resolution|                            |FIXED


--- Comment #3 from skenny <skenny at uchicago.edu>  2010-11-07 13:23:19 ---
link to the really quick start guide has been removed (and the code commented
out) in the doc and many of the other issues here have been resolved (as noted
below). 

the remainder of the bullet points seem to be general comments/questions, so
i'm not sure this is the right place for them (and as mihael points out this
ambiguity could keep the bug open indefinitely). thus i'm closing this bug. but
if others feel there is a clear significant bug here in the comments that
hasn't been addressed please re-open it as a separate bug report.

-- 
Configure bugmail: https://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.
You are watching someone on the CC list of the bug.


From hategan at mcs.anl.gov  Sun Nov  7 18:33:11 2010
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Sun, 07 Nov 2010 16:33:11 -0800
Subject: [Swift-devel] Standard Swift coaster behavior doesnt work
	wellfor sporadic jobs
In-Reply-To: <1287642685.15118.46.camel@blabla2.none>
References: <57986955.982201286845836821.JavaMail.root@zimbra.anl.gov>
	<4CBF0D27.80302@gmail.com><1287619089.13330.2.camel@blabla2.none>
	<1287642685.15118.46.camel@blabla2.none>
Message-ID: <1289176391.11359.1.camel@blabla2.none>

I can't reproduce this.

I tried to have both the blocks and the services shut down between the
jobs, and both scenarios lead to new blocks/services being started and
the jobs completing. Though I did clean up the code a bit after that.

I'll need full logs from failed runs (both swift and coaster).

Mihael

On Wed, 2010-10-20 at 23:31 -0700, Mihael Hategan wrote:
> On Thu, 2010-10-21 at 00:00 +0000, jon.monette at gmail.com wrote:
> > Ok. I can try to put together a script that does it. But I think it just need to be a script in which between two jobs that are submitted to a site there is a long time so all the workers time out. 
> 
> I can probably do that with a local sleep sandwiched between two coaster
> jobs. So don't worry about it.
> 
> Mihael
> 
> 
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel


From jon.monette at gmail.com  Sun Nov  7 19:55:42 2010
From: jon.monette at gmail.com (Jonathan Monette)
Date: Sun, 07 Nov 2010 19:55:42 -0600
Subject: [Swift-devel] Standard Swift coaster behavior doesnt work wellfor
	sporadic jobs
In-Reply-To: <1289176391.11359.1.camel@blabla2.none>
References: <57986955.982201286845836821.JavaMail.root@zimbra.anl.gov>	
	<4CBF0D27.80302@gmail.com><1287619089.13330.2.camel@blabla2.none>	
	<1287642685.15118.46.camel@blabla2.none>
	<1289176391.11359.1.camel@blabla2.none>
Message-ID: <4CD7589E.80706@gmail.com>

Ok.  I can send you those.  I have a test tomorrow so I can send them 
later in the week.

On 11/7/10 6:33 PM, Mihael Hategan wrote:
> I can't reproduce this.
>
> I tried to have both the blocks and the services shut down between the
> jobs, and both scenarios lead to new blocks/services being started and
> the jobs completing. Though I did clean up the code a bit after that.
>
> I'll need full logs from failed runs (both swift and coaster).
>
> Mihael
>
> On Wed, 2010-10-20 at 23:31 -0700, Mihael Hategan wrote:
>> On Thu, 2010-10-21 at 00:00 +0000, jon.monette at gmail.com wrote:
>>> Ok. I can try to put together a script that does it. But I think it just need to be a script in which between two jobs that are submitted to a site there is a long time so all the workers time out.
>> I can probably do that with a local sleep sandwiched between two coaster
>> jobs. So don't worry about it.
>>
>> Mihael
>>
>>
>> _______________________________________________
>> Swift-devel mailing list
>> Swift-devel at ci.uchicago.edu
>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>

-- 
Jon

Computers are incredibly fast, accurate, and stupid. Human beings are incredibly slow, inaccurate, and brilliant. Together they are powerful beyond imagination.
- Albert Einstein


From bugzilla-daemon at mcs.anl.gov  Mon Nov  8 14:46:01 2010
From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov)
Date: Mon,  8 Nov 2010 14:46:01 -0600 (CST)
Subject: [Swift-devel] [Bug 39] a poor syntax error
In-Reply-To: <bug-39-21@http.bugzilla.mcs.anl.gov/swift/>
References: <bug-39-21@http.bugzilla.mcs.anl.gov/swift/>
Message-ID: <20101108204601.28D142CB90@wind.mcs.anl.gov>

https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=39


skenny <skenny at uchicago.edu> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
                 CC|                            |skenny at uchicago.edu
         Resolution|                            |FIXED


--- Comment #2 from skenny <skenny at uchicago.edu>  2010-11-08 14:46:00 ---
the latest swift reports the next token (after the missing semicolon) as the
unexpected token, which seems appropriate...that is, it does not misinterpret
the > as a greater-than symbol:

[skenny at login2]$ cat 119-missing-semi.swift
type file {};
type student {
  file name;
  file age;
  file gpa;
}
app (file t) getname(string n) {
        echo n stdout=@filename(t);
}

file results <single_file_mapper; file="sinfo.txt">;
student fnames[] <csv_mapper;file="stu_list.txt">
results = getname(@filename(fnames[0]));

[skenny at login2]$ swift 119-missing-semi.swift
Could not compile SwiftScript source: line 13:1: unexpected token: results

a test for this has been added to swift/tests/language/should-not-work

-- 
Configure bugmail: https://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.
You are watching the reporter.


From bugzilla-daemon at mcs.anl.gov  Mon Nov  8 15:16:51 2010
From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov)
Date: Mon,  8 Nov 2010 15:16:51 -0600 (CST)
Subject: [Swift-devel] [Bug 5] Directory names seem wrong,
	and files are missing
In-Reply-To: <bug-5-21@http.bugzilla.mcs.anl.gov/swift/>
References: <bug-5-21@http.bugzilla.mcs.anl.gov/swift/>
Message-ID: <20101108211651.BABBA2B87F@wind.mcs.anl.gov>

https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=5


skenny <skenny at uchicago.edu> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|ASSIGNED                    |RESOLVED
                 CC|                            |skenny at uchicago.edu
         Resolution|                            |FIXED


--- Comment #2 from skenny <skenny at uchicago.edu>  2010-11-08 15:16:51 ---
this appears to refer to outdated tutorial material. first.swift is the current
example in section 2 of the tutorial and does not refer to any non-existent
files.

-- 
Configure bugmail: https://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.


From bugzilla-daemon at mcs.anl.gov  Mon Nov  8 16:19:44 2010
From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov)
Date: Mon,  8 Nov 2010 16:19:44 -0600 (CST)
Subject: [Swift-devel] [Bug 6] Not globally unique temporary file names
In-Reply-To: <bug-6-21@http.bugzilla.mcs.anl.gov/swift/>
References: <bug-6-21@http.bugzilla.mcs.anl.gov/swift/>
Message-ID: <20101108221944.A57BF2CB98@wind.mcs.anl.gov>

https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=6


skenny <skenny at uchicago.edu> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |ASSIGNED
                 CC|                            |skenny at uchicago.edu


--- Comment #3 from skenny <skenny at uchicago.edu>  2010-11-08 16:19:44 ---
both instances produce 10 files:

[skenny at login2]$ cat t2_outnames.swift
type messagefile;

app (messagefile t) greeting(string s) {
    echo s stdout=@filename(t);
}

myprog() {
     messagefile outfile;
     outfile=greeting("this file");
 }

int idx[] = [1:10];

foreach i in idx {
     myprog();
 }


foreach i in idx {
     messagefile outfile;
     outfile = greeting("this file");
 }

[skenny at login2]$ swift t2_outnames.swift
Swift svn swift-r3680 cog-r2913

RunID: 20101108-1615-7j7nvdc9
Progress:
Progress:  Selecting site:18  Stage in:1  Finished successfully:1
...

[skenny at login2]$ ls _concurrent/
outfile-3976bc53-6010-41b4-adc2-823c3cb17d91-1-0 
outfile-7e53f096-cf13-4eea-b52c-ee0c341f03bf-2-0
outfile-3976bc53-6010-41b4-adc2-823c3cb17d91-1-1 
outfile-7e53f096-cf13-4eea-b52c-ee0c341f03bf-2-1
outfile-3976bc53-6010-41b4-adc2-823c3cb17d91-1-2 
outfile-7e53f096-cf13-4eea-b52c-ee0c341f03bf-2-2
outfile-3976bc53-6010-41b4-adc2-823c3cb17d91-1-3 
outfile-7e53f096-cf13-4eea-b52c-ee0c341f03bf-2-3
outfile-3976bc53-6010-41b4-adc2-823c3cb17d91-1-4 
outfile-7e53f096-cf13-4eea-b52c-ee0c341f03bf-2-4
outfile-3976bc53-6010-41b4-adc2-823c3cb17d91-1-5 
outfile-7e53f096-cf13-4eea-b52c-ee0c341f03bf-2-5
outfile-3976bc53-6010-41b4-adc2-823c3cb17d91-1-6 
outfile-7e53f096-cf13-4eea-b52c-ee0c341f03bf-2-6
outfile-3976bc53-6010-41b4-adc2-823c3cb17d91-1-7 
outfile-7e53f096-cf13-4eea-b52c-ee0c341f03bf-2-7
outfile-3976bc53-6010-41b4-adc2-823c3cb17d91-1-8 
outfile-7e53f096-cf13-4eea-b52c-ee0c341f03bf-2-8
outfile-3976bc53-6010-41b4-adc2-823c3cb17d91-1-9 
outfile-7e53f096-cf13-4eea-b52c-ee0c341f03bf-2-9

this addresses ian's issue...then if i re-run it the output files produced are
given the same names (so they just overwrite the output from the previous run)
which, i *think* is what mihael was referring to (?)

this bug might need a bit of clarification as it's not clear to me if it's been
resolved...

-- 
Configure bugmail: https://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.
You are watching the reporter.


From bugzilla-daemon at mcs.anl.gov  Mon Nov  8 17:11:09 2010
From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov)
Date: Mon,  8 Nov 2010 17:11:09 -0600 (CST)
Subject: [Swift-devel] [Bug 6] Not globally unique temporary file names
In-Reply-To: <bug-6-21@http.bugzilla.mcs.anl.gov/swift/>
References: <bug-6-21@http.bugzilla.mcs.anl.gov/swift/>
Message-ID: <20101108231109.CBBA72B86D@wind.mcs.anl.gov>

https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=6


--- Comment #4 from Mihael Hategan <hategan at mcs.anl.gov>  2010-11-08 17:11:09 ---
(In reply to comment #3)
> this addresses ian's issue...then if i re-run it the output files produced are
> given the same names (so they just overwrite the output from the previous run)
> which, i *think* is what mihael was referring to (?)

Right. Which is Ian's initial complaint.

The question is whether you get the same behavior if you were to import that in
a different script and repeatedly call that program, would it still work? I
suspect that it now does, since swift will use both the static random name, as
well as the thread id, which will be a mix of the parent thread id and the
stand-alone thread ids.

Though if you had hard-mapped files, you would still run into problems.

Mihael

-- 
Configure bugmail: https://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.
You are watching the reporter.


From bugzilla-daemon at mcs.anl.gov  Mon Nov  8 17:43:56 2010
From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov)
Date: Mon,  8 Nov 2010 17:43:56 -0600 (CST)
Subject: [Swift-devel] [Bug 31] error message when mapper parameter is wrong
In-Reply-To: <bug-31-21@http.bugzilla.mcs.anl.gov/swift/>
References: <bug-31-21@http.bugzilla.mcs.anl.gov/swift/>
Message-ID: <20101108234357.022232B8BF@wind.mcs.anl.gov>

https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=31


skenny <skenny at uchicago.edu> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |skenny at uchicago.edu


--- Comment #2 from skenny <skenny at uchicago.edu>  2010-11-08 17:43:56 ---
currently gives the appropriate mapper parameter error (though still shows that
there's a java exception):

[skenny at communicado]$ swift mapperparam.swift
Swift svn swift-r3680 cog-r2913

RunID: 20101108-1737-9nkngll7
Execution failed:
        java.lang.RuntimeException:
org.griphyn.vdl.mapping.InvalidMapperException: csv_mapper: CSV mapper must
have a file parameter.

-- 
Configure bugmail: https://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.
You are watching the reporter.


From bugzilla-daemon at mcs.anl.gov  Mon Nov  8 17:52:36 2010
From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov)
Date: Mon,  8 Nov 2010 17:52:36 -0600 (CST)
Subject: [Swift-devel] [Bug 43] simple_mapper and ClassCastException
In-Reply-To: <bug-43-21@http.bugzilla.mcs.anl.gov/swift/>
References: <bug-43-21@http.bugzilla.mcs.anl.gov/swift/>
Message-ID: <20101108235236.0B9642B89A@wind.mcs.anl.gov>

https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=43


skenny <skenny at uchicago.edu> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|ASSIGNED                    |RESOLVED
                 CC|                            |skenny at uchicago.edu
         Resolution|                            |WORKSFORME


-- 
Configure bugmail: https://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching someone on the CC list of the bug.
You are watching the assignee of the bug.
You are watching the reporter.


From bugzilla-daemon at mcs.anl.gov  Mon Nov  8 19:27:56 2010
From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov)
Date: Mon,  8 Nov 2010 19:27:56 -0600 (CST)
Subject: [Swift-devel] [Bug 71] Develop a Matlab version for the SIDGrid
	workflow
In-Reply-To: <bug-71-21@http.bugzilla.mcs.anl.gov/swift/>
References: <bug-71-21@http.bugzilla.mcs.anl.gov/swift/>
Message-ID: <20101109012756.124B12CBAF@wind.mcs.anl.gov>

https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=71


skenny <skenny at uchicago.edu> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
                 CC|                            |skenny at uchicago.edu
         Resolution|                            |WONTFIX


--- Comment #1 from skenny <skenny at uchicago.edu>  2010-11-08 19:27:55 ---
Bennett and his students are gone.

-- 
Configure bugmail: https://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.
You are watching the reporter.


From bugzilla-daemon at mcs.anl.gov  Mon Nov  8 19:51:02 2010
From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov)
Date: Mon,  8 Nov 2010 19:51:02 -0600 (CST)
Subject: [Swift-devel] [Bug 77] Remote access to the CNARI Data
In-Reply-To: <bug-77-21@http.bugzilla.mcs.anl.gov/swift/>
References: <bug-77-21@http.bugzilla.mcs.anl.gov/swift/>
Message-ID: <20101109015102.A2F0F2CBAF@wind.mcs.anl.gov>

https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=77


skenny <skenny at uchicago.edu> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
                 CC|                            |skenny at uchicago.edu
         Resolution|                            |FIXED


--- Comment #1 from skenny <skenny at uchicago.edu>  2010-11-08 19:51:02 ---
the CNARI project has many workflows for accessing their databases remotely...
currently documented primarily on the wiki:
http://www.ci.uchicago.edu/wiki/bin/view/CNARI/WebHome

-- 
Configure bugmail: https://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.
You are watching the reporter.


From bugzilla-daemon at mcs.anl.gov  Mon Nov  8 19:59:53 2010
From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov)
Date: Mon,  8 Nov 2010 19:59:53 -0600 (CST)
Subject: [Swift-devel] [Bug 75] Run the SIDGrid workflow through Falkon
In-Reply-To: <bug-75-21@http.bugzilla.mcs.anl.gov/swift/>
References: <bug-75-21@http.bugzilla.mcs.anl.gov/swift/>
Message-ID: <20101109015953.BA95D2CBB0@wind.mcs.anl.gov>

https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=75


skenny <skenny at uchicago.edu> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
                 CC|                            |skenny at uchicago.edu
         Resolution|                            |WONTFIX


--- Comment #1 from skenny <skenny at uchicago.edu>  2010-11-08 19:59:53 ---
the SIDGrid project was comprised of a portal+processing scripts+grid
resources. i'm not sure which 'tool' this bug was referring to, but those
processing scripts have become part of the CNARI project.

-- 
Configure bugmail: https://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.
You are watching the reporter.


From bugzilla-daemon at mcs.anl.gov  Mon Nov  8 22:35:06 2010
From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov)
Date: Mon,  8 Nov 2010 22:35:06 -0600 (CST)
Subject: [Swift-devel] [Bug 69] Experiment Management Frontend
In-Reply-To: <bug-69-21@http.bugzilla.mcs.anl.gov/swift/>
References: <bug-69-21@http.bugzilla.mcs.anl.gov/swift/>
Message-ID: <20101109043506.34F032B8FF@wind.mcs.anl.gov>

https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=69


skenny <skenny at uchicago.edu> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
                 CC|                            |skenny at uchicago.edu
         Resolution|                            |WONTFIX


--- Comment #2 from skenny <skenny at uchicago.edu>  2010-11-08 22:35:06 ---
i *think* this is referring to some iteration of the old SIDGrid portal. the
link provided does not work so i'm not entirely certain what this refers to.
i'm closing the bug...please feel free to re-open if you can provide more
information and you believe this to be a valid swift bug.

-- 
Configure bugmail: https://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.
You are watching the reporter.


From bugzilla-daemon at mcs.anl.gov  Mon Nov  8 22:47:31 2010
From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov)
Date: Mon,  8 Nov 2010 22:47:31 -0600 (CST)
Subject: [Swift-devel] [Bug 70] LEB problem: Worker vs Investor occupation
	selection
In-Reply-To: <bug-70-21@http.bugzilla.mcs.anl.gov/swift/>
References: <bug-70-21@http.bugzilla.mcs.anl.gov/swift/>
Message-ID: <20101109044731.ACB162CABB@wind.mcs.anl.gov>

https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=70


skenny <skenny at uchicago.edu> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|ASSIGNED                    |RESOLVED
                 CC|                            |skenny at uchicago.edu
         Resolution|                            |WONTFIX


--- Comment #3 from skenny <skenny at uchicago.edu>  2010-11-08 22:47:31 ---
this does not appear to be a swift bug/ticket but rather something specific to
a matlab application (?) link is broken. closing ticket, please re-open if more
detail/info can be provided.

-- 
Configure bugmail: https://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.
You are watching the reporter.


From bugzilla-daemon at mcs.anl.gov  Tue Nov  9 00:14:50 2010
From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov)
Date: Tue,  9 Nov 2010 00:14:50 -0600 (CST)
Subject: [Swift-devel] [Bug 6] Not globally unique temporary file names
In-Reply-To: <bug-6-21@http.bugzilla.mcs.anl.gov/swift/>
References: <bug-6-21@http.bugzilla.mcs.anl.gov/swift/>
Message-ID: <20101109061450.1F3112CBC2@wind.mcs.anl.gov>

https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=6


skenny <skenny at uchicago.edu> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|ASSIGNED                    |RESOLVED
         Resolution|                            |FIXED


--- Comment #5 from skenny <skenny at uchicago.edu>  2010-11-09 00:14:49 ---
(In reply to comment #4)
> (In reply to comment #3)
> > this addresses ian's issue...then if i re-run it the output files produced are
> > given the same names (so they just overwrite the output from the previous run)
> > which, i *think* is what mihael was referring to (?)
> 
> Right. Which is Ian's initial complaint.
> 
> The question is whether you get the same behavior if you were to import that in
> a different script and repeatedly call that program, would it still work? I
> suspect that it now does

correct...it still works when you import the function from another script and
then call it repeatedly. thus it seems to have been resolved.

-- 
Configure bugmail: https://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.
You are watching the reporter.


From bugzilla-daemon at mcs.anl.gov  Tue Nov  9 00:56:36 2010
From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov)
Date: Tue,  9 Nov 2010 00:56:36 -0600 (CST)
Subject: [Swift-devel] [Bug 82] Request for a centralized installed
	applications catalog
In-Reply-To: <bug-82-21@http.bugzilla.mcs.anl.gov/swift/>
References: <bug-82-21@http.bugzilla.mcs.anl.gov/swift/>
Message-ID: <20101109065636.91C352CBD4@wind.mcs.anl.gov>

https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=82


skenny <skenny at uchicago.edu> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
                 CC|                            |skenny at uchicago.edu
         Resolution|                            |WONTFIX


--- Comment #2 from skenny <skenny at uchicago.edu>  2010-11-09 00:56:36 ---
as mentioned by ben, this is a discussion topic more so than a bug/ticket.
therefore, i am closing the bug. however, if others feel it is a valid ticket,
please re-open with further, detailed specification. 

it's worth noting that there has been a general shift amongst users (HNL users
specifically) to having a single shell wrapper executable entered into tc.data
which is used to call all other binaries on a give site.

-- 
Configure bugmail: https://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.
You are watching someone on the CC list of the bug.
You are watching the reporter.


From wilde at mcs.anl.gov  Tue Nov  9 18:21:56 2010
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Tue, 9 Nov 2010 18:21:56 -0600 (CST)
Subject: [Swift-devel] Hostnames vs BoundContacts in vdl-int.k
In-Reply-To: <1312386496.38446.1289347608250.JavaMail.root@zimbra.anl.gov>
Message-ID: <1632030314.38521.1289348516452.JavaMail.root@zimbra.anl.gov>

Im trying to extend Justin's initial cut of "external" CDM file types to connect to Globus Online.

The current code in trunk only handles input. Im trying to hook it into doStageOut as well.

Im making progress but stuck at the moment on passing hosts from vdl-int.k into the CDM external java functions.

Specifically I'm getting a string as a hostname argument to the cdm:external() element, where its expecting a "BoundContact". It pulls its args off the Karajan stack as follows:
---
    public void cdm_external(VariableStack stack)
    throws ExecutionException
    {
        String provider = (String) PA_PROVIDER.getValue(stack);
        String srchost  = (String) PA_SRCHOST.getValue(stack);
        String srcfile  = (String) PA_SRCFILE.getValue(stack);
        String srcdir   = (String) PA_SRCDIR.getValue(stack);
        BoundContact bc = (BoundContact) PA_DESTHOST.getValue(stack);
        String destdir  = (String) PA_DESTDIR.getValue(stack);
---

Is it the case that within vdl-int.k some host variables are simple strings (site names) whereas others are structured objects representing the site?

I'm having difficulty tracing this and would appreciate any guidance you can offer.

Thanks,

Mike


-- 
Michael Wilde
Computation Institute, University of Chicago
Mathematics and Computer Science Division
Argonne National Laboratory


From wilde at mcs.anl.gov  Tue Nov  9 22:53:12 2010
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Tue, 9 Nov 2010 22:53:12 -0600 (CST)
Subject: [Swift-devel] Re: Hostnames vs BoundContacts in vdl-int.k
In-Reply-To: <1632030314.38521.1289348516452.JavaMail.root@zimbra.anl.gov>
Message-ID: <1356016262.39112.1289364792649.JavaMail.root@zimbra.anl.gov>

OK, I think I get it. The "host" argument to doStagein and doStageOut is a BoundContact (ie, a "site" or pool entry).  The dhost and srchost parameters that those elements pass down to the lower level elements are computed from filename arguments and are simple strings.

- Mike

----- Original Message -----
> Im trying to extend Justin's initial cut of "external" CDM file types
> to connect to Globus Online.
> 
> The current code in trunk only handles input. Im trying to hook it
> into doStageOut as well.
> 
> Im making progress but stuck at the moment on passing hosts from
> vdl-int.k into the CDM external java functions.
> 
> Specifically I'm getting a string as a hostname argument to the
> cdm:external() element, where its expecting a "BoundContact". It pulls
> its args off the Karajan stack as follows:
> ---
> public void cdm_external(VariableStack stack)
> throws ExecutionException
> {
> String provider = (String) PA_PROVIDER.getValue(stack);
> String srchost = (String) PA_SRCHOST.getValue(stack);
> String srcfile = (String) PA_SRCFILE.getValue(stack);
> String srcdir = (String) PA_SRCDIR.getValue(stack);
> BoundContact bc = (BoundContact) PA_DESTHOST.getValue(stack);
> String destdir = (String) PA_DESTDIR.getValue(stack);
> ---
> 
> Is it the case that within vdl-int.k some host variables are simple
> strings (site names) whereas others are structured objects
> representing the site?
> 
> I'm having difficulty tracing this and would appreciate any guidance
> you can offer.
> 
> Thanks,
> 
> Mike
> 
> 
> --
> Michael Wilde
> Computation Institute, University of Chicago
> Mathematics and Computer Science Division
> Argonne National Laboratory

-- 
Michael Wilde
Computation Institute, University of Chicago
Mathematics and Computer Science Division
Argonne National Laboratory


From wilde at mcs.anl.gov  Wed Nov 10 14:35:00 2010
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Wed, 10 Nov 2010 14:35:00 -0600 (CST)
Subject: [Swift-devel] File names in gridftp provider seem to need an extra
	leading /
Message-ID: <267320207.42737.1289421300954.JavaMail.root@zimbra.anl.gov>

Both when I map files to a physical name starting with gsiftp://, and when I copy files using the swift version of globus-url-copy, I seem to need an extra "/" at the start of the file's pathname.

Here's an example of the issue from within a .swift script:

login1$ swift -config cf -sites.file sites.xml -tc.file tc.data gcat.swift
Swift svn swift-r3702 (swift modified locally) cog-r2924 (cog modified locally)

RunID: 20101110-1426-uouuvdf1
Progress:
Failed to transfer wrapper log from gcat-20101110-1426-uouuvdf1/info/g on localhost
Execution failed:
        Exception in cp:
Arguments: [etc/group, home/wilde/godata/gridoutput.txt]
Host: localhost
Directory: gcat-20101110-1426-uouuvdf1/jobs/g/cp-g3yt5f1k
stderr.txt: 

stdout.txt: 

----

Caused by:
        org.globus.cog.abstraction.impl.file.FileResourceException: Failed to retrieve file information about etc/group
                                ^^^^^^^^^ 
Caused by:
        Server refused performing the request. Custom message: Server refused MLST command (error code 1) [Nested exception message:  Custom message: Unexpected reply: 500-Command failed : globus_gridftp_server_file.c:globus_l_gfs_file_stat:389:
500-System error in stat: No such file or directory
500-A system call failed: No such file or directory
500 End.]
login1$ cat gcat.swift

type file;

app (file o) copy (file i)
{
  cp @i @o;
}

file f1<"gsiftp://pf-grid.unl.edu/etc/group">;
file f2<"gsiftp://gridftp.pads.ci.uchicago.edu/home/wilde/godata/gridoutput.txt">;
f2 = copy(f1);
login1$ 

When I put 2 slashes after the hostname in the URIs above, it works.

A similar issue occurs using Swift globus-url-copy, using the file:// protocol.  Rather then the usual 3 slashes after file:, I need *4*. With 3, it looks (in my test) for etc/group instead of /etc/group. With 4 it works.  With 2, it drops of etc entirely and looks for the file "group".

Are both of these the normal/expected behavior from the Swift gridftp code, or is this an error?

- Mike

-- 
Michael Wilde
Computation Institute, University of Chicago
Mathematics and Computer Science Division
Argonne National Laboratory


From wilde at mcs.anl.gov  Wed Nov 10 14:53:46 2010
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Wed, 10 Nov 2010 14:53:46 -0600 (CST)
Subject: [Swift-devel] Time/hires.pm used by coaster worker.pl is not
 available on Ranger compute nodes
In-Reply-To: <485388925.42818.1289421889393.JavaMail.root@zimbra.anl.gov>
Message-ID: <1124957525.42929.1289422426208.JavaMail.root@zimbra.anl.gov>

Glen Hocky observed this in recent runs. It makes worker.pl fail to start.

I verified that this module is available on the login nodes but not the worker nodes.

For now, worker.pl works OK if you just comment out the line:

# use Time::HiRes qw(time);

The error you get on Ranger is below.

Sarah, you may want to watch for this if you run on Ranger.

- Mike

Can't locate Time/HiRes.pm in @INC (@INC contains: /usr/lib64/perl5/5.8.5/x86_64-linux-thread-multi /usr/lib/perl5/5.8.5 /usr/lib64/perl5/site_perl/5.8.5/x86_64-linux-thread-multi /usr/lib64/perl5/site_perl/5.8.4/x86_64-linux-thread-multi /usr/lib64/perl5/site_perl/5.8.3/x86_64-linux-thread-multi /usr/lib64/perl5/site_perl/5.8.2/x86_64-linux-thread-multi /usr/lib64/perl5/site_perl/5.8.1/x86_64-linux-thread-multi /usr/lib64/perl5/site_perl/5.8.0/x86_64-linux-thread-multi /usr/lib/perl5/site_perl/5.8.5 /usr/lib/perl5/site_perl/5.8.4 /usr/lib/perl5/site_perl/5.8.3 /usr/lib/perl5/site_perl/5.8.2 /usr/lib/perl5/site_perl/5.8.1 /usr/lib/perl5/site_perl/5.8.0 /usr/lib/perl5/site_perl /usr/lib64/perl5/vendor_perl/5.8.5/x86_64-linux-thread-multi /usr/lib64/perl5/vendor_perl/5.8.4/x86_64-linux-thread-multi /usr/lib64/perl5/vendor_perl/5.8.3/x86_64-linux-thread-multi /usr/lib64/perl5/vendor_perl/5.8.2/x86_64-linux-thread-multi /usr/lib64/perl5/vendor_perl/5.8.1/x86_64-linux-thread-multi /usr/lib64/perl5/vendor_perl/5.8.0/x86_64-linux-thread-multi /usr/lib/perl5/vendor_perl/5.8.5 /usr/lib/perl5/vendor_perl/5.8.4 /usr/lib/perl5/vendor_perl/5.8.3 /usr/lib/perl5/vendor_perl/5.8.2 /usr/lib/perl5/vendor_perl/5.8.1 /usr/lib/perl5/vendor_perl/5.8.0 /usr/lib/perl5/vendor_perl .) at ./worker.pl line 17.
BEGIN failed--compilation aborted at ./worker.pl line 17.
i115-206$ fg

-- 
Michael Wilde
Computation Institute, University of Chicago
Mathematics and Computer Science Division
Argonne National Laboratory


From hategan at mcs.anl.gov  Wed Nov 10 15:55:01 2010
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Wed, 10 Nov 2010 13:55:01 -0800
Subject: [Swift-devel] Re: Hostnames vs BoundContacts in vdl-int.k
In-Reply-To: <1356016262.39112.1289364792649.JavaMail.root@zimbra.anl.gov>
References: <1356016262.39112.1289364792649.JavaMail.root@zimbra.anl.gov>
Message-ID: <1289426101.24457.2.camel@blabla2.none>

On Tue, 2010-11-09 at 22:53 -0600, Michael Wilde wrote:
> OK, I think I get it. The "host" argument to doStagein and doStageOut
> is a BoundContact (ie, a "site" or pool entry).  The dhost and srchost
> parameters that those elements pass down to the lower level elements
> are computed from filename arguments and are simple strings.

Right. I think transfer() accepts both a host object and a
hostname[:port] for the srchost/desthost arguments.

One of them is passed as a host object while the other is passed as a
string.

Mihael

> 
> - Mike
> 
> ----- Original Message -----
> > Im trying to extend Justin's initial cut of "external" CDM file types
> > to connect to Globus Online.
> > 
> > The current code in trunk only handles input. Im trying to hook it
> > into doStageOut as well.
> > 
> > Im making progress but stuck at the moment on passing hosts from
> > vdl-int.k into the CDM external java functions.
> > 
> > Specifically I'm getting a string as a hostname argument to the
> > cdm:external() element, where its expecting a "BoundContact". It pulls
> > its args off the Karajan stack as follows:
> > ---
> > public void cdm_external(VariableStack stack)
> > throws ExecutionException
> > {
> > String provider = (String) PA_PROVIDER.getValue(stack);
> > String srchost = (String) PA_SRCHOST.getValue(stack);
> > String srcfile = (String) PA_SRCFILE.getValue(stack);
> > String srcdir = (String) PA_SRCDIR.getValue(stack);
> > BoundContact bc = (BoundContact) PA_DESTHOST.getValue(stack);
> > String destdir = (String) PA_DESTDIR.getValue(stack);
> > ---
> > 
> > Is it the case that within vdl-int.k some host variables are simple
> > strings (site names) whereas others are structured objects
> > representing the site?
> > 
> > I'm having difficulty tracing this and would appreciate any guidance
> > you can offer.
> > 
> > Thanks,
> > 
> > Mike
> > 
> > 
> > --
> > Michael Wilde
> > Computation Institute, University of Chicago
> > Mathematics and Computer Science Division
> > Argonne National Laboratory
> 


From hategan at mcs.anl.gov  Wed Nov 10 15:59:04 2010
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Wed, 10 Nov 2010 13:59:04 -0800
Subject: [Swift-devel] Re: File names in gridftp provider seem to need an
	extra leading /
In-Reply-To: <267320207.42737.1289421300954.JavaMail.root@zimbra.anl.gov>
References: <267320207.42737.1289421300954.JavaMail.root@zimbra.anl.gov>
Message-ID: <1289426344.24457.6.camel@blabla2.none>

On Wed, 2010-11-10 at 14:35 -0600, Michael Wilde wrote:
> Both when I map files to a physical name starting with gsiftp://, and
> when I copy files using the swift version of globus-url-copy, I seem
> to need an extra "/" at the start of the file's pathname.

Absolute paths need an extra "/". The first one (i.e. the one between
the host name and the path name) is considered a separator and not
counted as part of the path name.

Mihael

> 
> Here's an example of the issue from within a .swift script:
> 
> login1$ swift -config cf -sites.file sites.xml -tc.file tc.data gcat.swift
> Swift svn swift-r3702 (swift modified locally) cog-r2924 (cog modified locally)
> 
> RunID: 20101110-1426-uouuvdf1
> Progress:
> Failed to transfer wrapper log from gcat-20101110-1426-uouuvdf1/info/g on localhost
> Execution failed:
>         Exception in cp:
> Arguments: [etc/group, home/wilde/godata/gridoutput.txt]
> Host: localhost
> Directory: gcat-20101110-1426-uouuvdf1/jobs/g/cp-g3yt5f1k
> stderr.txt: 
> 
> stdout.txt: 
> 
> ----
> 
> Caused by:
>         org.globus.cog.abstraction.impl.file.FileResourceException: Failed to retrieve file information about etc/group
>                                 ^^^^^^^^^ 
> Caused by:
>         Server refused performing the request. Custom message: Server refused MLST command (error code 1) [Nested exception message:  Custom message: Unexpected reply: 500-Command failed : globus_gridftp_server_file.c:globus_l_gfs_file_stat:389:
> 500-System error in stat: No such file or directory
> 500-A system call failed: No such file or directory
> 500 End.]
> login1$ cat gcat.swift
> 
> type file;
> 
> app (file o) copy (file i)
> {
>   cp @i @o;
> }
> 
> file f1<"gsiftp://pf-grid.unl.edu/etc/group">;
> file f2<"gsiftp://gridftp.pads.ci.uchicago.edu/home/wilde/godata/gridoutput.txt">;
> f2 = copy(f1);
> login1$ 
> 
> When I put 2 slashes after the hostname in the URIs above, it works.
> 
> A similar issue occurs using Swift globus-url-copy, using the file:// protocol.  Rather then the usual 3 slashes after file:, I need *4*. With 3, it looks (in my test) for etc/group instead of /etc/group. With 4 it works.  With 2, it drops of etc entirely and looks for the file "group".
> 
> Are both of these the normal/expected behavior from the Swift gridftp code, or is this an error?
> 
> - Mike
> 


From hategan at mcs.anl.gov  Wed Nov 10 16:06:10 2010
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Wed, 10 Nov 2010 14:06:10 -0800
Subject: [Swift-devel] Time/hires.pm used by coaster worker.pl is not
	available on Ranger compute nodes
In-Reply-To: <1124957525.42929.1289422426208.JavaMail.root@zimbra.anl.gov>
References: <1124957525.42929.1289422426208.JavaMail.root@zimbra.anl.gov>
Message-ID: <1289426770.24457.8.camel@blabla2.none>

That, I think, is only used for the timestamps in the log file.
Otherwise the granularity of localtime() is seconds (and not very useful
for timing the worker script).

I'm curious whether there is a way to "only include it if it's there".
Essentially it re-defines the standard localtime(), so no other changes
would be needed.

Mihael

On Wed, 2010-11-10 at 14:53 -0600, Michael Wilde wrote:
> Glen Hocky observed this in recent runs. It makes worker.pl fail to start.
> 
> I verified that this module is available on the login nodes but not the worker nodes.
> 
> For now, worker.pl works OK if you just comment out the line:
> 
> # use Time::HiRes qw(time);
> 
> The error you get on Ranger is below.
> 
> Sarah, you may want to watch for this if you run on Ranger.
> 
> - Mike
> 
> Can't locate Time/HiRes.pm in @INC (@INC contains: /usr/lib64/perl5/5.8.5/x86_64-linux-thread-multi /usr/lib/perl5/5.8.5 /usr/lib64/perl5/site_perl/5.8.5/x86_64-linux-thread-multi /usr/lib64/perl5/site_perl/5.8.4/x86_64-linux-thread-multi /usr/lib64/perl5/site_perl/5.8.3/x86_64-linux-thread-multi /usr/lib64/perl5/site_perl/5.8.2/x86_64-linux-thread-multi /usr/lib64/perl5/site_perl/5.8.1/x86_64-linux-thread-multi /usr/lib64/perl5/site_perl/5.8.0/x86_64-linux-thread-multi /usr/lib/perl5/site_perl/5.8.5 /usr/lib/perl5/site_perl/5.8.4 /usr/lib/perl5/site_perl/5.8.3 /usr/lib/perl5/site_perl/5.8.2 /usr/lib/perl5/site_perl/5.8.1 /usr/lib/perl5/site_perl/5.8.0 /usr/lib/perl5/site_perl /usr/lib64/perl5/vendor_perl/5.8.5/x86_64-linux-thread-multi /usr/lib64/perl5/vendor_perl/5.8.4/x86_64-linux-thread-multi /usr/lib64/perl5/vendor_perl/5.8.3/x86_64-linux-thread-multi /usr/lib64/perl5/vendor_perl/5.8.2/x86_64-linux-thread-multi /usr/lib64/perl5/vendor_perl/5.8.1/x86_64-linux-thread-multi /usr/lib64/perl5/vendor_perl/5.8.0/x86_64-linux-thread-multi /usr/lib/perl5/vendor_perl/5.8.5 /usr/lib/perl5/vendor_perl/5.8.4 /usr/lib/perl5/vendor_perl/5.8.3 /usr/lib/perl5/vendor_perl/5.8.2 /usr/lib/perl5/vendor_perl/5.8.1 /usr/lib/perl5/vendor_perl/5.8.0 /usr/lib/perl5/vendor_perl .) at ./worker.pl line 17.
> BEGIN failed--compilation aborted at ./worker.pl line 17.
> i115-206$ fg
> 


From wilde at mcs.anl.gov  Thu Nov 11 09:11:15 2010
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Thu, 11 Nov 2010 09:11:15 -0600 (CST)
Subject: [Swift-devel] Problems with provider.staging.pin.swiftfiles
In-Reply-To: <1788797735.45828.1289488198159.JavaMail.root@zimbra.anl.gov>
Message-ID: <1669558788.45842.1289488275177.JavaMail.root@zimbra.anl.gov>

Justin,

When Tom Uram turns on this option and runs a simple test script (a foreach and an app that just collects node info), he gets an error "520" returned in the swift log, as if from the app. I am thinking that the 520 is somehow coming from worker.

This is going from bridled to Eureka worker nodes, with provider staging turned on in proxy mode.

When we set provider.staging.pin.swiftfiles to false, the script runs ok.

We'll need to collect and send logs and a test case, but I wanted to alert you to a potential problem with this feature.

- Mike


-- 
Michael Wilde
Computation Institute, University of Chicago
Mathematics and Computer Science Division
Argonne National Laboratory


From wozniak at mcs.anl.gov  Thu Nov 11 09:41:43 2010
From: wozniak at mcs.anl.gov (Justin M Wozniak)
Date: Thu, 11 Nov 2010 09:41:43 -0600 (CST)
Subject: [Swift-devel] Re: Problems with provider.staging.pin.swiftfiles
In-Reply-To: <1669558788.45842.1289488275177.JavaMail.root@zimbra.anl.gov>
References: <1669558788.45842.1289488275177.JavaMail.root@zimbra.anl.gov>
Message-ID: <alpine.DEB.2.00.1011110940050.10341@wozniak-desktop.mcs.anl.gov>

Hello
 	Yes, this was broken a few weeks ago- I will try to restore it 
ASAP.  (Cf. swift-devel post from 9/27.)
 	Justin

On Thu, 11 Nov 2010, Michael Wilde wrote:

> Justin,
>
> When Tom Uram turns on this option and runs a simple test script (a 
> foreach and an app that just collects node info), he gets an error "520" 
> returned in the swift log, as if from the app. I am thinking that the 
> 520 is somehow coming from worker.
>
> This is going from bridled to Eureka worker nodes, with provider staging 
> turned on in proxy mode.
>
> When we set provider.staging.pin.swiftfiles to false, the script runs 
> ok.
>
> We'll need to collect and send logs and a test case, but I wanted to 
> alert you to a potential problem with this feature.
>
> - Mike

-- 
Justin M Wozniak


From wilde at mcs.anl.gov  Thu Nov 11 16:28:41 2010
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Thu, 11 Nov 2010 16:28:41 -0600 (CST)
Subject: [Swift-devel] Problems with karajan lists
In-Reply-To: <162047948.49532.1289514442117.JavaMail.root@zimbra.anl.gov>
Message-ID: <514919228.49534.1289514521745.JavaMail.root@zimbra.anl.gov>

I'm trying to append to a list multiple times, but when I try to append the second time, I get an error:

import("sys.k")
sequential(
  vec := list()
  append(vec,10)
  print(vec)
  //  append(vec,20)
  //  print(vec)
)

If I uncomment the 2 lines above, I get the error:

login1$ swift foo.k
[10.0]
[10.0, 20.0]
Ex098
org.globus.cog.karajan.workflow.KarajanRuntimeException: Illegal extra argument `[10.0, 20.0]' to kernel:karajan @ foo.k, line: 1
        at org.globus.cog.karajan.arguments.NameBindingVariableArguments.append(NameBindingVariableArguments.java:58)
   
Its almost as if append is not consuming its arguments and Karajan is finding extra stuff on the stack when it exits???

Can you help me understand what Im doing wrong here?

Thanks,

Mike


-- 
Michael Wilde
Computation Institute, University of Chicago
Mathematics and Computer Science Division
Argonne National Laboratory


From hategan at mcs.anl.gov  Thu Nov 11 16:35:11 2010
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Thu, 11 Nov 2010 14:35:11 -0800
Subject: [Swift-devel] Re: Problems with karajan lists
In-Reply-To: <514919228.49534.1289514521745.JavaMail.root@zimbra.anl.gov>
References: <514919228.49534.1289514521745.JavaMail.root@zimbra.anl.gov>
Message-ID: <1289514911.27871.3.camel@blabla2.none>

Append is append(list, items) -> list. So it returns a list.

If you are only interested in the side-effect of append, you can
probably do:
discard(append(list, 10)).

Alternatively you could say:
vec := list()
vec := append(list, 10)

Mihael

On Thu, 2010-11-11 at 16:28 -0600, Michael Wilde wrote:
> I'm trying to append to a list multiple times, but when I try to append the second time, I get an error:
> 
> import("sys.k")
> sequential(
>   vec := list()
>   append(vec,10)
>   print(vec)
>   //  append(vec,20)
>   //  print(vec)
> )
> 
> If I uncomment the 2 lines above, I get the error:
> 
> login1$ swift foo.k
> [10.0]
> [10.0, 20.0]
> Ex098
> org.globus.cog.karajan.workflow.KarajanRuntimeException: Illegal extra argument `[10.0, 20.0]' to kernel:karajan @ foo.k, line: 1
>         at org.globus.cog.karajan.arguments.NameBindingVariableArguments.append(NameBindingVariableArguments.java:58)
>    
> Its almost as if append is not consuming its arguments and Karajan is finding extra stuff on the stack when it exits???
> 
> Can you help me understand what Im doing wrong here?
> 
> Thanks,
> 
> Mike
> 
> 


From bugzilla-daemon at mcs.anl.gov  Fri Nov 12 13:41:56 2010
From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov)
Date: Fri, 12 Nov 2010 13:41:56 -0600 (CST)
Subject: [Swift-devel] [Bug 130] submitting to TG NCSA Mercury PBS with PATH
 env profile set causes job to hang
In-Reply-To: <bug-130-21@http.bugzilla.mcs.anl.gov/swift/>
References: <bug-130-21@http.bugzilla.mcs.anl.gov/swift/>
Message-ID: <20101112194156.E74B82BF1E@wind.mcs.anl.gov>

https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=130


skenny <skenny at uchicago.edu> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|                            |WONTFIX


--- Comment #2 from skenny <skenny at uchicago.edu>  2010-11-12 13:41:56 ---
ncsa mercury has been deprecated

-- 
Configure bugmail: https://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the reporter.


From wilde at mcs.anl.gov  Sat Nov 13 18:32:15 2010
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Sat, 13 Nov 2010 18:32:15 -0600 (CST)
Subject: [Swift-devel] Case handling bug
Message-ID: <1329526766.57544.1289694735438.JavaMail.root@zimbra.anl.gov>

Here's a cute one I just stumbled on - the 2-line script:

int n=10;
int N = 12;

gives this error:

login1$ swift casebug.swift
Swift svn swift-r3702 (swift modified locally) cog-r2924 (cog modified locally)

RunID: 20101113-1830-nkmmf1b5
Progress:
Execution failed:
        java.lang.IllegalArgumentException: N is closed with a value of 12.0
login1$ 


-- 
Michael Wilde
Computation Institute, University of Chicago
Mathematics and Computer Science Division
Argonne National Laboratory


From wilde at mcs.anl.gov  Sun Nov 14 12:45:19 2010
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Sun, 14 Nov 2010 12:45:19 -0600 (CST)
Subject: [Swift-devel] Concurrent dostagein calls limited to 8 ?
Message-ID: <15696286.58351.1289760319639.JavaMail.root@zimbra.anl.gov>

Im using Justin's new external CDM policy to talk to globusonline.org.

Ive got the two IO throttles set to 30:

throttle.transfers=30
throttle.file.operations=30

...but I still see exactly 8 concurrent dostagein calls (as seen by calls to my external IO handler external.sh)

Do you know what is limiting this concurrency to 8? I'd like to open it up much wider.

My swift command and relevant files are below.
Logs are in CI: /home/wilde/swift/demo/modis

Thanks,

Mike


login1$ cat rundemo.go.local.sh
rm -f external.*.log
swift -config cf -tc.file tc.local -sites.file sites.local.xml -cdm.file fs.ftponly modis.go.swift -nfiles=30
# -location= -n= -site==

login1$ cat cf
wrapperlog.always.transfer=true
sitedir.keep=true
execution.retries=0
lazy.errors=false
status.mode=provider
use.provider.staging=false
provider.staging.pin.swiftfiles=false
throttle.transfers=30
throttle.file.operations=30

login1$ cat sites.local.xml
<config>

  <pool handle="localhost">
    <execution provider="local" url="none"/>
    <profile namespace="karajan" key="jobThrottle">.63</profile>
    <profile namespace="karajan" key="initialScore">10000</profile>
    <filesystem provider="local"/>
    <workdirectory>/home/wilde/swiftwork</workdirectory>
  </pool>

</config>
login1$ 

login1$ cat fs.ftponly 
rule .*gsiftp://.* EXTERNAL /home/wilde/swift/lab/go/external.sh
login1$ 


-- 
Michael Wilde
Computation Institute, University of Chicago
Mathematics and Computer Science Division
Argonne National Laboratory


From hategan at mcs.anl.gov  Sun Nov 14 15:38:29 2010
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Sun, 14 Nov 2010 13:38:29 -0800
Subject: [Swift-devel] Re: Concurrent dostagein calls limited to 8 ?
In-Reply-To: <15696286.58351.1289760319639.JavaMail.root@zimbra.anl.gov>
References: <15696286.58351.1289760319639.JavaMail.root@zimbra.anl.gov>
Message-ID: <1289770709.10824.0.camel@blabla2.none>

My bad. Try cog trunk r2932.

Mihael

On Sun, 2010-11-14 at 12:45 -0600, Michael Wilde wrote:
> Im using Justin's new external CDM policy to talk to globusonline.org.
> 
> Ive got the two IO throttles set to 30:
> 
> throttle.transfers=30
> throttle.file.operations=30
> 
> ...but I still see exactly 8 concurrent dostagein calls (as seen by calls to my external IO handler external.sh)
> 
> Do you know what is limiting this concurrency to 8? I'd like to open it up much wider.
> 
> My swift command and relevant files are below.
> Logs are in CI: /home/wilde/swift/demo/modis
> 
> Thanks,
> 
> Mike
> 
> 
> login1$ cat rundemo.go.local.sh
> rm -f external.*.log
> swift -config cf -tc.file tc.local -sites.file sites.local.xml -cdm.file fs.ftponly modis.go.swift -nfiles=30
> # -location= -n= -site==
> 
> login1$ cat cf
> wrapperlog.always.transfer=true
> sitedir.keep=true
> execution.retries=0
> lazy.errors=false
> status.mode=provider
> use.provider.staging=false
> provider.staging.pin.swiftfiles=false
> throttle.transfers=30
> throttle.file.operations=30
> 
> login1$ cat sites.local.xml
> <config>
> 
>   <pool handle="localhost">
>     <execution provider="local" url="none"/>
>     <profile namespace="karajan" key="jobThrottle">.63</profile>
>     <profile namespace="karajan" key="initialScore">10000</profile>
>     <filesystem provider="local"/>
>     <workdirectory>/home/wilde/swiftwork</workdirectory>
>   </pool>
> 
> </config>
> login1$ 
> 
> login1$ cat fs.ftponly 
> rule .*gsiftp://.* EXTERNAL /home/wilde/swift/lab/go/external.sh
> login1$ 
> 
> 


From wilde at mcs.anl.gov  Sun Nov 14 16:12:39 2010
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Sun, 14 Nov 2010 16:12:39 -0600 (CST)
Subject: [Swift-devel] Re: Concurrent dostagein calls limited to 8 ?
In-Reply-To: <1289770709.10824.0.camel@blabla2.none>
Message-ID: <2026767406.58546.1289772759569.JavaMail.root@zimbra.anl.gov>

As far as I can tell its still showing the same behavior - only 8 stageins are active:

wilde    17568 17564 17564  4873  0 16:05 pts/8    00:00:00             /bin/sh /scratch/local/wilde/swift/src/trunk/cog/modules/swift
wilde    17632 17568 17564  4873  5 16:05 pts/8    00:00:09               java -Xmx256M -Djava.endorsed.dirs=/scratch/local/wilde/swif
wilde    19193 17632 17564  4873  0 16:08 pts/8    00:00:00                 /bin/bash /home/wilde/swift/lab/go/external.sh getlanduse-
wilde    19357 19193 17564  4873  0 16:08 pts/8    00:00:00                   sleep 3
wilde    19198 17632 17564  4873  0 16:08 pts/8    00:00:00                 /bin/bash /home/wilde/swift/lab/go/external.sh getlanduse-
wilde    19358 19198 17564  4873  0 16:08 pts/8    00:00:00                   sleep 3
wilde    19205 17632 17564  4873  0 16:08 pts/8    00:00:00                 /bin/bash /home/wilde/swift/lab/go/external.sh getlanduse-
wilde    19359 19205 17564  4873  0 16:08 pts/8    00:00:00                   sleep 3
wilde    19210 17632 17564  4873  0 16:08 pts/8    00:00:00                 /bin/bash /home/wilde/swift/lab/go/external.sh getlanduse-
wilde    19361 19210 17564  4873  0 16:08 pts/8    00:00:00                   sleep 3
wilde    19214 17632 17564  4873  0 16:08 pts/8    00:00:00                 /bin/bash /home/wilde/swift/lab/go/external.sh getlanduse-
wilde    19362 19214 17564  4873  0 16:08 pts/8    00:00:00                   sleep 3
wilde    19218 17632 17564  4873  0 16:08 pts/8    00:00:00                 /bin/bash /home/wilde/swift/lab/go/external.sh getlanduse-
wilde    19363 19218 17564  4873  0 16:08 pts/8    00:00:00                   sleep 3
wilde    19224 17632 17564  4873  0 16:08 pts/8    00:00:00                 /bin/bash /home/wilde/swift/lab/go/external.sh getlanduse-
wilde    19364 19224 17564  4873  0 16:08 pts/8    00:00:00                   sleep 3
wilde    19240 17632 17564  4873  0 16:08 pts/8    00:00:00                 /bin/bash /home/wilde/swift/lab/go/external.sh getlanduse-
wilde    19365 19240 17564  4873  0 16:08 pts/8    00:00:00                   sleep 3


- Mike


----- Original Message -----
> My bad. Try cog trunk r2932.
> 
> Mihael
> 
> On Sun, 2010-11-14 at 12:45 -0600, Michael Wilde wrote:
> > Im using Justin's new external CDM policy to talk to
> > globusonline.org.
> >
> > Ive got the two IO throttles set to 30:
> >
> > throttle.transfers=30
> > throttle.file.operations=30
> >
> > ...but I still see exactly 8 concurrent dostagein calls (as seen by
> > calls to my external IO handler external.sh)
> >
> > Do you know what is limiting this concurrency to 8? I'd like to open
> > it up much wider.
> >
> > My swift command and relevant files are below.
> > Logs are in CI: /home/wilde/swift/demo/modis
> >
> > Thanks,
> >
> > Mike
> >
> >
> > login1$ cat rundemo.go.local.sh
> > rm -f external.*.log
> > swift -config cf -tc.file tc.local -sites.file sites.local.xml
> > -cdm.file fs.ftponly modis.go.swift -nfiles=30
> > # -location= -n= -site==
> >
> > login1$ cat cf
> > wrapperlog.always.transfer=true
> > sitedir.keep=true
> > execution.retries=0
> > lazy.errors=false
> > status.mode=provider
> > use.provider.staging=false
> > provider.staging.pin.swiftfiles=false
> > throttle.transfers=30
> > throttle.file.operations=30
> >
> > login1$ cat sites.local.xml
> > <config>
> >
> >   <pool handle="localhost">
> >     <execution provider="local" url="none"/>
> >     <profile namespace="karajan" key="jobThrottle">.63</profile>
> >     <profile namespace="karajan" key="initialScore">10000</profile>
> >     <filesystem provider="local"/>
> >     <workdirectory>/home/wilde/swiftwork</workdirectory>
> >   </pool>
> >
> > </config>
> > login1$
> >
> > login1$ cat fs.ftponly
> > rule .*gsiftp://.* EXTERNAL /home/wilde/swift/lab/go/external.sh
> > login1$
> >
> >

-- 
Michael Wilde
Computation Institute, University of Chicago
Mathematics and Computer Science Division
Argonne National Laboratory


From hategan at mcs.anl.gov  Sun Nov 14 16:21:56 2010
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Sun, 14 Nov 2010 14:21:56 -0800
Subject: [Swift-devel] Re: Concurrent dostagein calls limited to 8 ?
In-Reply-To: <2026767406.58546.1289772759569.JavaMail.root@zimbra.anl.gov>
References: <2026767406.58546.1289772759569.JavaMail.root@zimbra.anl.gov>
Message-ID: <1289773316.10896.10.camel@blabla2.none>

I need more info.

Some useful questions to answer:
- how many stage-ins are queued to the scheduler?
- how quick are the stage-ins?
- who (i.e. what code) starts the external.sh script and how?

In theory, if a sufficient number of tasks are queued then the only
limit (after my last commit) should be the throttles and whatever
inherent limits are in the mechanism used to implement the stage-ins.

Mihael

On Sun, 2010-11-14 at 16:12 -0600, Michael Wilde wrote:
> As far as I can tell its still showing the same behavior - only 8 stageins are active:
> 
> wilde    17568 17564 17564  4873  0 16:05 pts/8    00:00:00             /bin/sh /scratch/local/wilde/swift/src/trunk/cog/modules/swift
> wilde    17632 17568 17564  4873  5 16:05 pts/8    00:00:09               java -Xmx256M -Djava.endorsed.dirs=/scratch/local/wilde/swif
> wilde    19193 17632 17564  4873  0 16:08 pts/8    00:00:00                 /bin/bash /home/wilde/swift/lab/go/external.sh getlanduse-
> wilde    19357 19193 17564  4873  0 16:08 pts/8    00:00:00                   sleep 3
> wilde    19198 17632 17564  4873  0 16:08 pts/8    00:00:00                 /bin/bash /home/wilde/swift/lab/go/external.sh getlanduse-
> wilde    19358 19198 17564  4873  0 16:08 pts/8    00:00:00                   sleep 3
> wilde    19205 17632 17564  4873  0 16:08 pts/8    00:00:00                 /bin/bash /home/wilde/swift/lab/go/external.sh getlanduse-
> wilde    19359 19205 17564  4873  0 16:08 pts/8    00:00:00                   sleep 3
> wilde    19210 17632 17564  4873  0 16:08 pts/8    00:00:00                 /bin/bash /home/wilde/swift/lab/go/external.sh getlanduse-
> wilde    19361 19210 17564  4873  0 16:08 pts/8    00:00:00                   sleep 3
> wilde    19214 17632 17564  4873  0 16:08 pts/8    00:00:00                 /bin/bash /home/wilde/swift/lab/go/external.sh getlanduse-
> wilde    19362 19214 17564  4873  0 16:08 pts/8    00:00:00                   sleep 3
> wilde    19218 17632 17564  4873  0 16:08 pts/8    00:00:00                 /bin/bash /home/wilde/swift/lab/go/external.sh getlanduse-
> wilde    19363 19218 17564  4873  0 16:08 pts/8    00:00:00                   sleep 3
> wilde    19224 17632 17564  4873  0 16:08 pts/8    00:00:00                 /bin/bash /home/wilde/swift/lab/go/external.sh getlanduse-
> wilde    19364 19224 17564  4873  0 16:08 pts/8    00:00:00                   sleep 3
> wilde    19240 17632 17564  4873  0 16:08 pts/8    00:00:00                 /bin/bash /home/wilde/swift/lab/go/external.sh getlanduse-
> wilde    19365 19240 17564  4873  0 16:08 pts/8    00:00:00                   sleep 3
> 
> 
> - Mike
> 
> 
> ----- Original Message -----
> > My bad. Try cog trunk r2932.
> > 
> > Mihael
> > 
> > On Sun, 2010-11-14 at 12:45 -0600, Michael Wilde wrote:
> > > Im using Justin's new external CDM policy to talk to
> > > globusonline.org.
> > >
> > > Ive got the two IO throttles set to 30:
> > >
> > > throttle.transfers=30
> > > throttle.file.operations=30
> > >
> > > ...but I still see exactly 8 concurrent dostagein calls (as seen by
> > > calls to my external IO handler external.sh)
> > >
> > > Do you know what is limiting this concurrency to 8? I'd like to open
> > > it up much wider.
> > >
> > > My swift command and relevant files are below.
> > > Logs are in CI: /home/wilde/swift/demo/modis
> > >
> > > Thanks,
> > >
> > > Mike
> > >
> > >
> > > login1$ cat rundemo.go.local.sh
> > > rm -f external.*.log
> > > swift -config cf -tc.file tc.local -sites.file sites.local.xml
> > > -cdm.file fs.ftponly modis.go.swift -nfiles=30
> > > # -location= -n= -site==
> > >
> > > login1$ cat cf
> > > wrapperlog.always.transfer=true
> > > sitedir.keep=true
> > > execution.retries=0
> > > lazy.errors=false
> > > status.mode=provider
> > > use.provider.staging=false
> > > provider.staging.pin.swiftfiles=false
> > > throttle.transfers=30
> > > throttle.file.operations=30
> > >
> > > login1$ cat sites.local.xml
> > > <config>
> > >
> > >   <pool handle="localhost">
> > >     <execution provider="local" url="none"/>
> > >     <profile namespace="karajan" key="jobThrottle">.63</profile>
> > >     <profile namespace="karajan" key="initialScore">10000</profile>
> > >     <filesystem provider="local"/>
> > >     <workdirectory>/home/wilde/swiftwork</workdirectory>
> > >   </pool>
> > >
> > > </config>
> > > login1$
> > >
> > > login1$ cat fs.ftponly
> > > rule .*gsiftp://.* EXTERNAL /home/wilde/swift/lab/go/external.sh
> > > login1$
> > >
> > >
> 


From hategan at mcs.anl.gov  Sun Nov 14 17:45:36 2010
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Sun, 14 Nov 2010 15:45:36 -0800
Subject: [Swift-devel] Concurrent dostagein calls limited to 8 ?
In-Reply-To: <AANLkTimns4pt47rOSAQEvmb2iRqieynV0S1FLSZg23cF@mail.gmail.com>
References: <2026767406.58546.1289772759569.JavaMail.root@zimbra.anl.gov>
	<1289773316.10896.10.camel@blabla2.none>
	<AANLkTimns4pt47rOSAQEvmb2iRqieynV0S1FLSZg23cF@mail.gmail.com>
Message-ID: <1289778336.11636.2.camel@blabla2.none>

On Sun, 2010-11-14 at 17:23 -0600, Michael Wilde wrote:
> Some answers from my handheld:
> - foreach loop has 317 files so ample parallelism

I would have assumed it's > 8. But I suspect, given one of the answers
below, that it does not matter.

> - throttle in sites entry set to .63 to run 64 jobs at once
> - the "active" external.sh is called from end of dostagein and
> dostageout in vdl-int.k (after all files for the job were put in a
> list by prior calls to externa.sh from within those functions

How is this call actually implemented. I.e. can you post the respective
snippet of vdl-int?

> - the actual staging op by globusonline take 30-60 seconds, sometimes
> more. I batch them up.


From wilde at mcs.anl.gov  Sun Nov 14 22:42:14 2010
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Sun, 14 Nov 2010 22:42:14 -0600 (CST)
Subject: [Swift-devel] SGE provider error parsing qstat output
In-Reply-To: <AANLkTi=9qX_cn=CM3gZTSNvtOkhc2=GGSnfjcEr9RMs8@mail.gmail.com>
Message-ID: <1001141783.59115.1289796134679.JavaMail.root@zimbra.anl.gov>

was: Re: Provenance DB Diagrams

Hi David,

The first bug seems very familiar, and I thought Mihael fixed it once.

qstat was giving slightly different output between older versions (eg I think sisboombah) and later ones (eg Ranger).

Perhaps this is a different manifestation of similar problems in command output format variations?  Feel free to dig into the code. Do you have access to an SGE system where this works?  (Let me know if you need access to the UC IBI Cluster; also try godzilla or ranger).

Regarding the coaster error: whats happening here is that the PE is not being passed from the coaster pool attributes to the attributes of the SGE jobs that the coaster provider is creating.

Do you have access to Ranger? I have a fix for this there that needs to be tested and integrated into trunk.

Marc Parisien of UChicago IBD is trying to run coasters on the IBI cluster, and getting the same error. If you could find and integrate the fix that would be great.

I attach my mods from Ranger. I think the mods related to "coresPerNode" can be removed as hopefully Mihael's PPN fix addresses them.  Whats needed is just the code that passes PE from the coasters pool to the job spces of the jSGE jobs it creates.

My svn diff is below and modified files are attached.

This was done on the stable branch, but the SGE provider has since been moved to trunk.

You should either get guidance from Mihael on this, or discuss with him if you'd rather he make the fix.

>From svn status:

M      modules/provider-localscheduler/src/org/globus/cog/abstraction/impl/scheduler/sge/SGEExecutor.java
M      modules/provider-coaster/src/org/globus/cog/abstraction/coaster/service/job/manager/Settings.java
M      modules/provider-coaster/src/org/globus/cog/abstraction/coaster/service/job/manager/BlockTask.java

I attach these three files. Just look for the changes that propagate the PE setting.

Ignore the coresPerNode changes.

There were also changes to ensure that the provider starts the right number of workers per node, which should now always be one copy of worker.pl whose parallelism is controlled by workersPerNode.

The PPN setting should ensure that the job gets the expected number of cores allocated, for systems that do node sharing of jobs.

from svn diff: (lots of junk below from my experiments):

login3$ svn diff
Index: modules/provider-localscheduler/src/org/globus/cog/abstraction/impl/scheduler/sge/SGEExecutor.java
===================================================================
--- modules/provider-localscheduler/src/org/globus/cog/abstraction/impl/scheduler/sge/SGEExecutor.java	(revision 2734)
+++ modules/provider-localscheduler/src/org/globus/cog/abstraction/impl/scheduler/sge/SGEExecutor.java	(working copy)
@@ -52,6 +52,10 @@
         writeAttr(attrName, arg, wr, null);
     }
 
+    protected void writeAttrValue(String value, String arg, Writer wr ) throws IOException {
+        wr.write("#$ " + arg + value + '\n');
+    }
+
     protected void writeWallTime(Writer wr) throws IOException {
         Object walltime = getSpec().getAttribute("maxwalltime");
         if (walltime != null) {
@@ -77,10 +81,32 @@
         wr.write("#$ -V\n");
         writeAttr("project", "-A ", wr);
 
-        writeAttr("count", "-pe "
+// FIXME: testing this change: MW
+
+        Object countValue = getSpec().getAttribute("count");
+	int count;
+
+        if (countValue != null)
+            count = Integer.valueOf(String.valueOf(countValue)).intValue();
+	else 
+	    count = 1;
+
+        // FIXME: wpn is only meaningful for coasters; is 1 ok otherwise?
+	// should we flag wpn as error if not coasters?
+
+	Object wpnValue = getAttribute(spec, "workerspernode", "1");
+	int wpn = Integer.valueOf(String.valueOf(wpnValue)).intValue();
+	logger.info("FETCH OF WPN: " + wpn); // FIXME: DB
+
+	count *= wpn;
+	logger.info("FETCH OF PE: " + getAttribute(spec, "pe", "NO pe"));
+	logger.info("FETCH OF CPN: " + getAttribute(spec, "corespernode", "NO cpn"));
+        writeAttrValue(String.valueOf(count), "-pe "
                 + getAttribute(spec, "pe", getSGEProperties().getDefaultPE())
-                + " ", wr, "1");
+                + " ", wr);
 
+// FIXME: END OF MW CHANGE
+
         writeWallTime(wr);
         writeAttr("queue", "-q ", wr);
         if (spec.getStdInput() != null) {
@@ -157,7 +183,8 @@
     
     protected void writeMultiJobPreamble(Writer wr, String exitcodefile)
             throws IOException {
-        wr.write("NODES=`cat $PE_HOSTFILE | awk '{ for(i=0;i<$2;i++){print $1} }'`\n");
+// FIXME:MW        wr.write("NODES=`cat $PE_HOSTFILE | awk '{ for(i=0;i<$2;i++){print $1} }'`\n");
+        wr.write("NODES=`cat $PE_HOSTFILE | awk '{ print $1 }'`\n");
         wr.write("ECF=" + exitcodefile + "\n");
         wr.write("INDEX=0\n");
         wr.write("for NODE in $NODES; do\n");
@@ -188,13 +215,21 @@
         return (Properties) getProperties();
     }
     
-    public static final Pattern JOB_ID_LINE = Pattern.compile("[Yy]our job (\\d+) \\(.*\\) has been submitted");
+     public static final Pattern JOB_ID_LINE = Pattern.compile(".*[Yy]our job (\\d+) \\(.*\\) has been submitted");
+    // public static final Pattern JOB_ID_LINE = Pattern.compile("[Yy]our job (\\d+) \\(.*\\) has been submitted");
+    // public static final Pattern JOB_ID_LINE = Pattern.compile("[Yy]our job \\([0-9]\\+\\) .* has been submitted");
 
     protected String parseSubmitCommandOutput(String out) throws IOException {
         // > your job 2494189 ("t1.sub") has been submitted
         BufferedReader br = new BufferedReader(new CharArrayReader(out.toCharArray()));
         String line = br.readLine();
+	if (logger.isInfoEnabled()) {
+	    logger.info("parseSubmitCommandOutput: out=" + out);
+	}
         while (line != null) {
+	    if (logger.isInfoEnabled()) {
+		logger.info("parseSubmitCommandOutput: line=" + line);
+	    }
             Matcher m = JOB_ID_LINE.matcher(line);
             if (m.matches()) {
                 String id = m.group(1);
Index: modules/provider-coaster/src/org/globus/cog/abstraction/coaster/service/job/manager/Settings.java
===================================================================
--- modules/provider-coaster/src/org/globus/cog/abstraction/coaster/service/job/manager/Settings.java	(revision 2734)
+++ modules/provider-coaster/src/org/globus/cog/abstraction/coaster/service/job/manager/Settings.java	(working copy)
@@ -22,16 +22,17 @@
     public static final Logger logger = Logger.getLogger(Settings.class);
 
     public static final String[] NAMES =
-            new String[] { "slots", "workersPerNode", "nodeGranularity", "allocationStepSize",
+            new String[] { "slots", "workersPerNode", "coresPerNode", "nodeGranularity", "allocationStepSize",
                     "maxNodes", "lowOverallocation", "highOverallocation",
                     "overallocationDecayFactor", "spread", "reserve", "maxtime", "project",
-                    "queue", "remoteMonitorEnabled", "kernelprofile", "alcfbgpnat", "internalHostname" };
+                    "queue", "remoteMonitorEnabled", "kernelprofile", "alcfbgpnat", "internalHostname", "pe" };
 
     /**
      * The maximum number of blocks that can be active at one time
      */
     private int slots = 20;
     private int workersPerNode = 1;
+    private int coresPerNode = 1;
     /**
      * How many nodes to allocate at once
      */
@@ -90,6 +91,8 @@
 
     private String queue;
 
+    private String pe;
+
     private String kernelprofile;
 
     private boolean alcfbgpnat;
@@ -116,6 +119,14 @@
         this.workersPerNode = workersPerNode;
     }
 
+    public int getCoresPerNode() {
+        return coresPerNode;
+    }
+
+    public void setCoresPerNode(int coresPerNode) {
+        this.coresPerNode = coresPerNode;
+    }
+
     public int getNodeGranularity() {
         return nodeGranularity;
     }
@@ -273,6 +284,14 @@
         this.queue = queue;
     }
 
+    public String getPe() {
+        return pe;
+    }
+
+    public void setPe(String pe) {
+        this.pe = pe;
+    }
+
     public boolean getRemoteMonitorEnabled() {
         return remoteMonitorEnabled;
     }
Index: modules/provider-coaster/src/org/globus/cog/abstraction/coaster/service/job/manager/BlockTask.java
===================================================================
--- modules/provider-coaster/src/org/globus/cog/abstraction/coaster/service/job/manager/BlockTask.java	(revision 2734)
+++ modules/provider-coaster/src/org/globus/cog/abstraction/coaster/service/job/manager/BlockTask.java	(working copy)
@@ -40,6 +40,13 @@
         setAttribute(spec, "maxwalltime", WallTime.format((int) block.getWalltime().getSeconds()));
         setAttribute(spec, "queue", settings.getQueue());
         setAttribute(spec, "project", settings.getProject());
+
+	// added - mw:
+        setAttribute(spec, "coresPerNode", String.valueOf(settings.getCoresPerNode()));
+        setAttribute(spec, "workersPerNode", String.valueOf(settings.getWorkersPerNode()));
+        setAttribute(spec, "pe", settings.getPe());
+	// end additions - mw ^^^
+
         int count = block.getWorkerCount() / settings.getWorkersPerNode();
         if (count > 1) {
             setAttribute(spec, "jobType", "multiple");
login3$ 

--


- Mike


----- Original Message -----
> Hello Mike,
> 
> I am working on adding more tests to the automated test suite. I am
> running into some issues when trying to run swift with SGE on
> sisboombah. The tests I wrote are based on the example configurations
> you sent out to the list earlier. Here is what is happening.
> 
> I am running the SGE local test with the following config file:
> 
> <config>
> <pool handle="sge-local">
> <execution provider="sge" url="none" />
> <profile namespace="globus" key="pe">threaded</profile>
> <profile key="jobThrottle" namespace="karajan">.49</profile>
> <profile namespace="karajan" key="initialScore">10000</profile>
> <filesystem provider="local" url="none" />
> <workdirectory>/home/dk0966/swiftwork</workdirectory>
> </pool>
> </config>
> 
> The error I am seeing is:
> Caused by:
> java.io.IOException: Failed to parse qstat line: 623018 0.55500
> SteadyShea xinliang r 11/13/2010 14:09:05 all.q at node1
> 1
> 
> The next test I try on this machine is SGE-coasters with the following
> config:
> 
> <config>
> <pool handle="sge-coasters">
> <execution provider="coaster" url="none" jobmanager="local:sge"/>
> <profile namespace="globus" key="pe">threaded</profile>
> <profile namespace="globus" key="workersPerNode">4</profile>
> <profile namespace="globus" key="slots">128</profile>
> <profile namespace="globus" key="nodeGranularity">1</profile>
> <profile namespace="globus" key="maxnodes">1</profile>
> <profile namespace="karajan" key="jobThrottle">5.11</profile>
> <profile namespace="karajan" key="initialScore">10000</profile>
> <filesystem provider="local" url="none"/>
> <workdirectory>/home/dk0966/swiftwork</workdirectory>
> </pool>
> </config>
> 
> For which I get the following error:
> Worker task failed: Error submitting block task
> org.globus.cog.abstraction.impl.common.task.TaskSubmissionException:
> Cannot submit job: Could not submit job (qsub reported an exit code of
> 1).
> Unable to run job: job rejected: the requested parallel environment
> "16way" does not exist.Exiting.
> 
> I couldn't find much information about SGE setups either in the guide
> or the cookbook. Is there anything else I am missing to get this up
> and running?
> 
> Thanks,
> David

-- 
Michael Wilde
Computation Institute, University of Chicago
Mathematics and Computer Science Division
Argonne National Laboratory

-------------- next part --------------
A non-text attachment was scrubbed...
Name: sgecoastermods.tar
Type: application/x-tar
Size: 30720 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20101114/6f7bb14c/attachment.tar>

From wilde at mcs.anl.gov  Sun Nov 14 23:03:26 2010
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Sun, 14 Nov 2010 23:03:26 -0600 (CST)
Subject: [Swift-devel] Concurrent dostagein calls limited to 8 ?
In-Reply-To: <1289778336.11636.2.camel@blabla2.none>
Message-ID: <1711691138.59170.1289797406428.JavaMail.root@zimbra.anl.gov>

Mihael,

I attached my vdl-int.k.  The changes were based on Justin's initial version of the external policy CDM setting, but I added the ability to handle stageout as well as stagein, and to gather all the files for a stagein or stageout in the external script, and process them all at once.

In my external script, I now batch the files for multiple requests into one larger transfer command to globusonline, using time-based batching.  This adds latency to an individual request but saves greatly overall, as globusonline will only do 3 concurrent transfers for a given user, and has its own latency for checking its work queue.

My hooks are the calls to cdm:externalin, externalout, and externalgo.

I use a map element as a reference variable to determine when to call externalgo.

All this seems to work at the basic level, but I still see only a steady state of 8 external calls running at once.

Further, I think the latency involved is causing some strange interaction with coasters which I need to send you.  My scripts run fine on localhost but fail on PADS with coasters: after about 80 of 300+ jobs I get a caster failure that I need to log and post - looks like some kind of timeout in worker.pl waiting for a response.

- Mike


----- Original Message -----
> On Sun, 2010-11-14 at 17:23 -0600, Michael Wilde wrote:
> > Some answers from my handheld:
> > - foreach loop has 317 files so ample parallelism
> 
> I would have assumed it's > 8. But I suspect, given one of the answers
> below, that it does not matter.
> 
> > - throttle in sites entry set to .63 to run 64 jobs at once
> > - the "active" external.sh is called from end of dostagein and
> > dostageout in vdl-int.k (after all files for the job were put in a
> > list by prior calls to externa.sh from within those functions
> 
> How is this call actually implemented. I.e. can you post the
> respective
> snippet of vdl-int?
> 
> > - the actual staging op by globusonline take 30-60 seconds,
> > sometimes
> > more. I batch them up.

-- 
Michael Wilde
Computation Institute, University of Chicago
Mathematics and Computer Science Division
Argonne National Laboratory

-------------- next part --------------
A non-text attachment was scrubbed...
Name: vdl-int.k
Type: application/octet-stream
Size: 20559 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20101114/d8017301/attachment.obj>

From hategan at mcs.anl.gov  Sun Nov 14 23:07:21 2010
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Sun, 14 Nov 2010 21:07:21 -0800
Subject: [Swift-devel] Concurrent dostagein calls limited to 8 ?
In-Reply-To: <AANLkTik5pzgVo7JsOYgYm0jp2OZSea==scy79_Kp+CPN@mail.gmail.com>
References: <2026767406.58546.1289772759569.JavaMail.root@zimbra.anl.gov>
	<1289773316.10896.10.camel@blabla2.none>
	<AANLkTimns4pt47rOSAQEvmb2iRqieynV0S1FLSZg23cF@mail.gmail.com>
	<1289778336.11636.2.camel@blabla2.none>
	<AANLkTik5pzgVo7JsOYgYm0jp2OZSea==scy79_Kp+CPN@mail.gmail.com>
Message-ID: <1289797641.13416.5.camel@blabla2.none>

The cdm functions (externalin, externalout, externalgo) are not
asynchronous. They block the karajan worker threads and therefore,
besides preventing anything else from running in the interpreter, are
also limited to concurrently running whatever the number of karajan
worker threads is (2*cores).

I would suggest changing those functions to use the local provider or
some other scheme that can free the workers while the sub-processes run.

Mihael

On Sun, 2010-11-14 at 20:56 -0600, Michael Wilde wrote:
> I'm in a cab - vdlint.k is in local fs on:
> 
> Login1.pads.ci
> /scratch/local/wilde/swift/src/trunk/...
> Running from dist/swft-svn in that tree
> 
> On 11/14/10, Mihael Hategan <hategan at mcs.anl.gov> wrote:
> > On Sun, 2010-11-14 at 17:23 -0600, Michael Wilde wrote:
> >> Some answers from my handheld:
> >> - foreach loop has 317 files so ample parallelism
> >
> > I would have assumed it's > 8. But I suspect, given one of the answers
> > below, that it does not matter.
> >
> >> - throttle in sites entry set to .63 to run 64 jobs at once
> >> - the "active" external.sh is called from end of dostagein and
> >> dostageout in vdl-int.k (after all files for the job were put in a
> >> list by prior calls to externa.sh from within those functions
> >
> > How is this call actually implemented. I.e. can you post the respective
> > snippet of vdl-int?
> >
> >> - the actual staging op by globusonline take 30-60 seconds, sometimes
> >> more. I batch them up.
> >
> >
> >
> 


From wilde at mcs.anl.gov  Sun Nov 14 23:24:42 2010
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Sun, 14 Nov 2010 23:24:42 -0600 (CST)
Subject: [Swift-devel] Concurrent dostagein calls limited to 8 ?
In-Reply-To: <1289797641.13416.5.camel@blabla2.none>
Message-ID: <1604855895.59214.1289798682652.JavaMail.root@zimbra.anl.gov>

That explains a lot - the limited number of Karajan threads probably explains why coasters goes haywire in the larger tests as well.

Clearly this should be done as full fledged provider.  But that will take a fair bit more work.

Would there be any ill effects from bumping up the number of karajan threads to see if I can make this demo work?  WHere is that set?

Also, when you say "use the local provider or
> some other scheme that can free the workers while the sub-processes
> run." - do you have anything "quick and easy" in mind there?

- Mike


----- Original Message -----
> The cdm functions (externalin, externalout, externalgo) are not
> asynchronous. They block the karajan worker threads and therefore,
> besides preventing anything else from running in the interpreter, are
> also limited to concurrently running whatever the number of karajan
> worker threads is (2*cores).
> 
> I would suggest changing those functions to use the local provider or
> some other scheme that can free the workers while the sub-processes
> run.
> 
> Mihael
> 
> On Sun, 2010-11-14 at 20:56 -0600, Michael Wilde wrote:
> > I'm in a cab - vdlint.k is in local fs on:
> >
> > Login1.pads.ci
> > /scratch/local/wilde/swift/src/trunk/...
> > Running from dist/swft-svn in that tree
> >
> > On 11/14/10, Mihael Hategan <hategan at mcs.anl.gov> wrote:
> > > On Sun, 2010-11-14 at 17:23 -0600, Michael Wilde wrote:
> > >> Some answers from my handheld:
> > >> - foreach loop has 317 files so ample parallelism
> > >
> > > I would have assumed it's > 8. But I suspect, given one of the
> > > answers
> > > below, that it does not matter.
> > >
> > >> - throttle in sites entry set to .63 to run 64 jobs at once
> > >> - the "active" external.sh is called from end of dostagein and
> > >> dostageout in vdl-int.k (after all files for the job were put in
> > >> a
> > >> list by prior calls to externa.sh from within those functions
> > >
> > > How is this call actually implemented. I.e. can you post the
> > > respective
> > > snippet of vdl-int?
> > >
> > >> - the actual staging op by globusonline take 30-60 seconds,
> > >> sometimes
> > >> more. I batch them up.
> > >
> > >
> > >
> >

-- 
Michael Wilde
Computation Institute, University of Chicago
Mathematics and Computer Science Division
Argonne National Laboratory


From wilde at mcs.anl.gov  Mon Nov 15 00:02:06 2010
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Mon, 15 Nov 2010 00:02:06 -0600 (CST)
Subject: [Swift-devel] Concurrent dostagein calls limited to 8 ?
In-Reply-To: <1604855895.59214.1289798682652.JavaMail.root@zimbra.anl.gov>
Message-ID: <1756882204.59280.1289800926179.JavaMail.root@zimbra.anl.gov>

I bumped up the thread count to 32*cores. It was 4 * # cores, so maybe there is some 50% allocation factor going on?

At any rate, if I reduce the number of files Im processing from 317 to 100, the entire script seems to work fairly reliably. But I can definitely see the ill effects of the large number of threads Im tying up waiting on IO.

(For one thing, I cant keep my coaster cores busy, and I get the "Canceling job" message from coaster workers shutting down for lack of work).

This will improve a bit when I enhance the interface to globusonline to wait on individual file transfers rather than on the whole allocation request.

- Mike


----- Original Message -----
> That explains a lot - the limited number of Karajan threads probably
> explains why coasters goes haywire in the larger tests as well.
> 
> Clearly this should be done as full fledged provider. But that will
> take a fair bit more work.
> 
> Would there be any ill effects from bumping up the number of karajan
> threads to see if I can make this demo work? WHere is that set?
> 
> Also, when you say "use the local provider or
> > some other scheme that can free the workers while the sub-processes
> > run." - do you have anything "quick and easy" in mind there?
> 
> - Mike
> 
> 
> ----- Original Message -----
> > The cdm functions (externalin, externalout, externalgo) are not
> > asynchronous. They block the karajan worker threads and therefore,
> > besides preventing anything else from running in the interpreter,
> > are
> > also limited to concurrently running whatever the number of karajan
> > worker threads is (2*cores).
> >
> > I would suggest changing those functions to use the local provider
> > or
> > some other scheme that can free the workers while the sub-processes
> > run.
> >
> > Mihael
> >
> > On Sun, 2010-11-14 at 20:56 -0600, Michael Wilde wrote:
> > > I'm in a cab - vdlint.k is in local fs on:
> > >
> > > Login1.pads.ci
> > > /scratch/local/wilde/swift/src/trunk/...
> > > Running from dist/swft-svn in that tree
> > >
> > > On 11/14/10, Mihael Hategan <hategan at mcs.anl.gov> wrote:
> > > > On Sun, 2010-11-14 at 17:23 -0600, Michael Wilde wrote:
> > > >> Some answers from my handheld:
> > > >> - foreach loop has 317 files so ample parallelism
> > > >
> > > > I would have assumed it's > 8. But I suspect, given one of the
> > > > answers
> > > > below, that it does not matter.
> > > >
> > > >> - throttle in sites entry set to .63 to run 64 jobs at once
> > > >> - the "active" external.sh is called from end of dostagein and
> > > >> dostageout in vdl-int.k (after all files for the job were put
> > > >> in
> > > >> a
> > > >> list by prior calls to externa.sh from within those functions
> > > >
> > > > How is this call actually implemented. I.e. can you post the
> > > > respective
> > > > snippet of vdl-int?
> > > >
> > > >> - the actual staging op by globusonline take 30-60 seconds,
> > > >> sometimes
> > > >> more. I batch them up.
> > > >
> > > >
> > > >
> > >
> 
> --
> Michael Wilde
> Computation Institute, University of Chicago
> Mathematics and Computer Science Division
> Argonne National Laboratory
> 
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel

-- 
Michael Wilde
Computation Institute, University of Chicago
Mathematics and Computer Science Division
Argonne National Laboratory


From hategan at mcs.anl.gov  Mon Nov 15 00:06:44 2010
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Sun, 14 Nov 2010 22:06:44 -0800
Subject: [Swift-devel] Concurrent dostagein calls limited to 8 ?
In-Reply-To: <1604855895.59214.1289798682652.JavaMail.root@zimbra.anl.gov>
References: <1604855895.59214.1289798682652.JavaMail.root@zimbra.anl.gov>
Message-ID: <1289801204.13873.2.camel@blabla2.none>

On Sun, 2010-11-14 at 23:24 -0600, Michael Wilde wrote:
> That explains a lot - the limited number of Karajan threads probably explains why coasters goes haywire in the larger tests as well.
> 
> Clearly this should be done as full fledged provider.  But that will take a fair bit more work.
> 
> Would there be any ill effects from bumping up the number of karajan threads to see if I can make this demo work?  WHere is that set?

There will be the ill effect of wasting memory to wait for stuff. It's
set in EventBus.java.

> 
> Also, when you say "use the local provider or
> > some other scheme that can free the workers while the sub-processes
> > run." - do you have anything "quick and easy" in mind there?

Yep. Say task:execute(script, args) in vdl-int instead.

Mihael


From hategan at mcs.anl.gov  Mon Nov 15 00:08:09 2010
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Sun, 14 Nov 2010 22:08:09 -0800
Subject: [Swift-devel] Concurrent dostagein calls limited to 8 ?
In-Reply-To: <1756882204.59280.1289800926179.JavaMail.root@zimbra.anl.gov>
References: <1756882204.59280.1289800926179.JavaMail.root@zimbra.anl.gov>
Message-ID: <1289801289.13873.3.camel@blabla2.none>

On Mon, 2010-11-15 at 00:02 -0600, Michael Wilde wrote:
> I bumped up the thread count to 32*cores. It was 4 * # cores, so maybe there is some 50% allocation factor going on?

There shouldn't be. I had 2*cores on my version though.

> 
> At any rate, if I reduce the number of files Im processing from 317 to 100, the entire script seems to work fairly reliably. But I can definitely see the ill effects of the large number of threads Im tying up waiting on IO.
> 
> (For one thing, I cant keep my coaster cores busy, and I get the "Canceling job" message from coaster workers shutting down for lack of work).
> 
> This will improve a bit when I enhance the interface to globusonline to wait on individual file transfers rather than on the whole allocation request.
> 
> - Mike
> 
> 
> ----- Original Message -----
> > That explains a lot - the limited number of Karajan threads probably
> > explains why coasters goes haywire in the larger tests as well.
> > 
> > Clearly this should be done as full fledged provider. But that will
> > take a fair bit more work.
> > 
> > Would there be any ill effects from bumping up the number of karajan
> > threads to see if I can make this demo work? WHere is that set?
> > 
> > Also, when you say "use the local provider or
> > > some other scheme that can free the workers while the sub-processes
> > > run." - do you have anything "quick and easy" in mind there?
> > 
> > - Mike
> > 
> > 
> > ----- Original Message -----
> > > The cdm functions (externalin, externalout, externalgo) are not
> > > asynchronous. They block the karajan worker threads and therefore,
> > > besides preventing anything else from running in the interpreter,
> > > are
> > > also limited to concurrently running whatever the number of karajan
> > > worker threads is (2*cores).
> > >
> > > I would suggest changing those functions to use the local provider
> > > or
> > > some other scheme that can free the workers while the sub-processes
> > > run.
> > >
> > > Mihael
> > >
> > > On Sun, 2010-11-14 at 20:56 -0600, Michael Wilde wrote:
> > > > I'm in a cab - vdlint.k is in local fs on:
> > > >
> > > > Login1.pads.ci
> > > > /scratch/local/wilde/swift/src/trunk/...
> > > > Running from dist/swft-svn in that tree
> > > >
> > > > On 11/14/10, Mihael Hategan <hategan at mcs.anl.gov> wrote:
> > > > > On Sun, 2010-11-14 at 17:23 -0600, Michael Wilde wrote:
> > > > >> Some answers from my handheld:
> > > > >> - foreach loop has 317 files so ample parallelism
> > > > >
> > > > > I would have assumed it's > 8. But I suspect, given one of the
> > > > > answers
> > > > > below, that it does not matter.
> > > > >
> > > > >> - throttle in sites entry set to .63 to run 64 jobs at once
> > > > >> - the "active" external.sh is called from end of dostagein and
> > > > >> dostageout in vdl-int.k (after all files for the job were put
> > > > >> in
> > > > >> a
> > > > >> list by prior calls to externa.sh from within those functions
> > > > >
> > > > > How is this call actually implemented. I.e. can you post the
> > > > > respective
> > > > > snippet of vdl-int?
> > > > >
> > > > >> - the actual staging op by globusonline take 30-60 seconds,
> > > > >> sometimes
> > > > >> more. I batch them up.
> > > > >
> > > > >
> > > > >
> > > >
> > 
> > --
> > Michael Wilde
> > Computation Institute, University of Chicago
> > Mathematics and Computer Science Division
> > Argonne National Laboratory
> > 
> > _______________________________________________
> > Swift-devel mailing list
> > Swift-devel at ci.uchicago.edu
> > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> 


From bugzilla-daemon at mcs.anl.gov  Mon Nov 15 14:19:29 2010
From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov)
Date: Mon, 15 Nov 2010 14:19:29 -0600 (CST)
Subject: [Swift-devel] [Bug 167] clustering time limit specification in
 seconds is awkward for large clustering times
In-Reply-To: <bug-167-21@http.bugzilla.mcs.anl.gov/swift/>
References: <bug-167-21@http.bugzilla.mcs.anl.gov/swift/>
Message-ID: <20101115201929.D8C142BF1E@wind.mcs.anl.gov>

https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=167


skenny <skenny at uchicago.edu> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |skenny at uchicago.edu


--- Comment #2 from skenny <skenny at uchicago.edu>  2010-11-15 14:19:29 ---
are we deprecating clustering?

-- 
Configure bugmail: https://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.
You are watching the reporter.


From bugzilla-daemon at mcs.anl.gov  Mon Nov 15 14:35:41 2010
From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov)
Date: Mon, 15 Nov 2010 14:35:41 -0600 (CST)
Subject: [Swift-devel] [Bug 178] strange unused string replacement in
 CSVMapper needs investigating
In-Reply-To: <bug-178-21@http.bugzilla.mcs.anl.gov/swift/>
References: <bug-178-21@http.bugzilla.mcs.anl.gov/swift/>
Message-ID: <20101115203541.5CA482BF52@wind.mcs.anl.gov>

https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=178


skenny <skenny at uchicago.edu> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |skenny at uchicago.edu


--- Comment #2 from skenny <skenny at uchicago.edu>  2010-11-15 14:35:41 ---
(In reply to comment #1)
> I believe we should deprecate the CSV mapper entirely in favor of the ext
> mapper, which is both more powerful and easier to use.

agreed! myself and my users do not use the csv mapper. do others use this?

-- 
Configure bugmail: https://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.
You are watching the reporter.


From bugzilla-daemon at mcs.anl.gov  Mon Nov 15 14:58:51 2010
From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov)
Date: Mon, 15 Nov 2010 14:58:51 -0600 (CST)
Subject: [Swift-devel] [Bug 182] Error messages summarized at end of Swift
 output should also be printed when they occur
In-Reply-To: <bug-182-21@http.bugzilla.mcs.anl.gov/swift/>
References: <bug-182-21@http.bugzilla.mcs.anl.gov/swift/>
Message-ID: <20101115205851.2FBD12BF1E@wind.mcs.anl.gov>

https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=182


skenny <skenny at uchicago.edu> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |ASSIGNED
                 CC|                            |skenny at uchicago.edu
         AssignedTo|benc at hawaga.org.uk          |skenny at uchicago.edu


-- 
Configure bugmail: https://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.
You are watching the reporter.


From bugzilla-daemon at mcs.anl.gov  Mon Nov 15 15:55:24 2010
From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov)
Date: Mon, 15 Nov 2010 15:55:24 -0600 (CST)
Subject: [Swift-devel] [Bug 178] strange unused string replacement in
 CSVMapper needs investigating
In-Reply-To: <bug-178-21@http.bugzilla.mcs.anl.gov/swift/>
References: <bug-178-21@http.bugzilla.mcs.anl.gov/swift/>
Message-ID: <20101115215524.90C7C2CBD5@wind.mcs.anl.gov>

https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=178


Justin Wozniak <wozniak at mcs.anl.gov> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |wozniak at mcs.anl.gov


--- Comment #3 from Justin Wozniak <wozniak at mcs.anl.gov>  2010-11-15 15:55:24 ---
It is used by the Montage application.  I fixed this up a bit over the summer. 
I vote for keeping it.

-- 
Configure bugmail: https://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.
You are watching the reporter.


From wilde at mcs.anl.gov  Mon Nov 15 16:35:30 2010
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Mon, 15 Nov 2010 16:35:30 -0600 (CST)
Subject: [Swift-devel] Concurrent dostagein calls limited to 8 ?
In-Reply-To: <1289801204.13873.2.camel@blabla2.none>
Message-ID: <2108607140.63954.1289860530901.JavaMail.root@zimbra.anl.gov>

> >...do you have anything "quick and easy" in mind there?
> 
> Yep. Say task:execute(script, args) in vdl-int instead.

OK, that looks very promising and works for me in simple .k tests.

Can you tell me how to pluck the workdir out of the host description from within vdl-int.k?

(I will hunt for this but clues are welcome :)

Thanks,

- Mike


----- Original Message -----
> On Sun, 2010-11-14 at 23:24 -0600, Michael Wilde wrote:
> > That explains a lot - the limited number of Karajan threads probably
> > explains why coasters goes haywire in the larger tests as well.
> >
> > Clearly this should be done as full fledged provider. But that will
> > take a fair bit more work.
> >
> > Would there be any ill effects from bumping up the number of karajan
> > threads to see if I can make this demo work? WHere is that set?
> 
> There will be the ill effect of wasting memory to wait for stuff. It's
> set in EventBus.java.
> 
> >
> > Also, when you say "use the local provider or
> > > some other scheme that can free the workers while the
> > > sub-processes
> > > run." - do you have anything "quick and easy" in mind there?
> 
> Yep. Say task:execute(script, args) in vdl-int instead.
> 
> Mihael

-- 
Michael Wilde
Computation Institute, University of Chicago
Mathematics and Computer Science Division
Argonne National Laboratory


From bugzilla-daemon at mcs.anl.gov  Mon Nov 15 16:55:43 2010
From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov)
Date: Mon, 15 Nov 2010 16:55:43 -0600 (CST)
Subject: [Swift-devel] [Bug 199] error in simple mapper when underscores are
 used in type declaration
In-Reply-To: <bug-199-21@http.bugzilla.mcs.anl.gov/swift/>
References: <bug-199-21@http.bugzilla.mcs.anl.gov/swift/>
Message-ID: <20101115225543.5BC082CBE5@wind.mcs.anl.gov>

https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=199


skenny <skenny at uchicago.edu> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|                            |WORKSFORME


-- 
Configure bugmail: https://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.


From bugzilla-daemon at mcs.anl.gov  Mon Nov 15 17:06:32 2010
From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov)
Date: Mon, 15 Nov 2010 17:06:32 -0600 (CST)
Subject: [Swift-devel] [Bug 215] stdout and stderr redirect for SGE
 jobmanager causing failure on stageouts
In-Reply-To: <bug-215-21@http.bugzilla.mcs.anl.gov/swift/>
References: <bug-215-21@http.bugzilla.mcs.anl.gov/swift/>
Message-ID: <20101115230632.8EDEE2CBF5@wind.mcs.anl.gov>

https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=215


skenny <skenny at uchicago.edu> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|                            |FIXED


--- Comment #1 from skenny <skenny at uchicago.edu>  2010-11-15 17:06:32 ---
has been working as of Swift svn swift-r3497 cog-r2829

-- 
Configure bugmail: https://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.


From bugzilla-daemon at mcs.anl.gov  Mon Nov 15 17:11:21 2010
From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov)
Date: Mon, 15 Nov 2010 17:11:21 -0600 (CST)
Subject: [Swift-devel] [Bug 220] id's for external data types not stored in
	rlog for resume
In-Reply-To: <bug-220-21@http.bugzilla.mcs.anl.gov/swift/>
References: <bug-220-21@http.bugzilla.mcs.anl.gov/swift/>
Message-ID: <20101115231121.2D2B52B99D@wind.mcs.anl.gov>

https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=220


--- Comment #1 from skenny <skenny at uchicago.edu>  2010-11-15 17:11:20 ---
*** Bug 219 has been marked as a duplicate of this bug. ***

-- 
Configure bugmail: https://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.


From bugzilla-daemon at mcs.anl.gov  Mon Nov 15 17:11:21 2010
From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov)
Date: Mon, 15 Nov 2010 17:11:21 -0600 (CST)
Subject: [Swift-devel] [Bug 219] variables of type external are not
	mapped/written to rlog
In-Reply-To: <bug-219-21@http.bugzilla.mcs.anl.gov/swift/>
References: <bug-219-21@http.bugzilla.mcs.anl.gov/swift/>
Message-ID: <20101115231121.181F32B99D@wind.mcs.anl.gov>

https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=219


skenny <skenny at uchicago.edu> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|                            |DUPLICATE


--- Comment #1 from skenny <skenny at uchicago.edu>  2010-11-15 17:11:20 ---


*** This bug has been marked as a duplicate of bug 220 ***

-- 
Configure bugmail: https://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.


From bugzilla-daemon at mcs.anl.gov  Mon Nov 15 17:19:19 2010
From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov)
Date: Mon, 15 Nov 2010 17:19:19 -0600 (CST)
Subject: [Swift-devel] [Bug 227] Always keep submit and stdout/err files for
 failing jobs from localscheduler provider
In-Reply-To: <bug-227-21@http.bugzilla.mcs.anl.gov/swift/>
References: <bug-227-21@http.bugzilla.mcs.anl.gov/swift/>
Message-ID: <20101115231919.23C022CC02@wind.mcs.anl.gov>

https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=227


skenny <skenny at uchicago.edu> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |ASSIGNED


-- 
Configure bugmail: https://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the reporter.


From wilde at mcs.anl.gov  Mon Nov 15 20:10:41 2010
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Mon, 15 Nov 2010 20:10:41 -0600 (CST)
Subject: [Swift-devel] Concurrent dostagein calls limited to 8 ?
In-Reply-To: <2108607140.63954.1289860530901.JavaMail.root@zimbra.anl.gov>
Message-ID: <2049120682.64729.1289873441190.JavaMail.root@zimbra.anl.gov>

Justin pointed me at the function vdl:getprofile, but this does not seem to contain the same attributes as the pool entry. WHen I call:

  vdl:siteprofile(desthost,"workdirectory")

I get:

Execution failed:
        Exception in cp:
Arguments: [etc/group, home/wilde/godata/gridoutput.txt]
Host: localhost
Directory: gcat-20101115-2005-kh054dx8/jobs/1/cp-1uu8rn1k
stderr.txt:

stdout.txt:

----

Caused by:
        Missing profile: workdirectory
login1$ cat sites.xml

<!-- Local execution in the given workspace -->

<config>

  <pool handle="localhost" sysinfo="INTEL32::LINUX">
    <gridftp url="local://localhost" />
    <execution provider="local" url="none" />
    <workdirectory>/home/wilde/swift/lab/go/work</workdirectory>
  </pool>

</config>

--- Mike


----- Original Message -----
> > >...do you have anything "quick and easy" in mind there?
> >
> > Yep. Say task:execute(script, args) in vdl-int instead.
> 
> OK, that looks very promising and works for me in simple .k tests.
> 
> Can you tell me how to pluck the workdir out of the host description
> from within vdl-int.k?
> 
> (I will hunt for this but clues are welcome :)
> 
> Thanks,
> 
> - Mike
> 
> 
> ----- Original Message -----
> > On Sun, 2010-11-14 at 23:24 -0600, Michael Wilde wrote:
> > > That explains a lot - the limited number of Karajan threads
> > > probably
> > > explains why coasters goes haywire in the larger tests as well.
> > >
> > > Clearly this should be done as full fledged provider. But that
> > > will
> > > take a fair bit more work.
> > >
> > > Would there be any ill effects from bumping up the number of
> > > karajan
> > > threads to see if I can make this demo work? WHere is that set?
> >
> > There will be the ill effect of wasting memory to wait for stuff.
> > It's
> > set in EventBus.java.
> >
> > >
> > > Also, when you say "use the local provider or
> > > > some other scheme that can free the workers while the
> > > > sub-processes
> > > > run." - do you have anything "quick and easy" in mind there?
> >
> > Yep. Say task:execute(script, args) in vdl-int instead.
> >
> > Mihael
> 
> --
> Michael Wilde
> Computation Institute, University of Chicago
> Mathematics and Computer Science Division
> Argonne National Laboratory
> 
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel

-- 
Michael Wilde
Computation Institute, University of Chicago
Mathematics and Computer Science Division
Argonne National Laboratory


From hategan at mcs.anl.gov  Mon Nov 15 20:49:52 2010
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Mon, 15 Nov 2010 18:49:52 -0800
Subject: [Swift-devel] Concurrent dostagein calls limited to 8 ?
In-Reply-To: <2108607140.63954.1289860530901.JavaMail.root@zimbra.anl.gov>
References: <2108607140.63954.1289860530901.JavaMail.root@zimbra.anl.gov>
Message-ID: <1289875792.17856.0.camel@blabla2.none>

On Mon, 2010-11-15 at 16:35 -0600, Michael Wilde wrote:
> > >...do you have anything "quick and easy" in mind there?
> > 
> > Yep. Say task:execute(script, args) in vdl-int instead.
> 
> OK, that looks very promising and works for me in simple .k tests.
> 
> Can you tell me how to pluck the workdir out of the host description from within vdl-int.k?

I don't know the details. Worst case scenario you write a fava karajan
function that does it.

Mihael


From bugzilla-daemon at mcs.anl.gov  Mon Nov 15 23:05:05 2010
From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov)
Date: Mon, 15 Nov 2010 23:05:05 -0600 (CST)
Subject: [Swift-devel] [Bug 178] strange unused string replacement in
 CSVMapper needs investigating
In-Reply-To: <bug-178-21@http.bugzilla.mcs.anl.gov/swift/>
References: <bug-178-21@http.bugzilla.mcs.anl.gov/swift/>
Message-ID: <20101116050505.E6B402CC33@wind.mcs.anl.gov>

https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=178


Mihael Hategan <hategan at mcs.anl.gov> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |hategan at mcs.anl.gov


--- Comment #4 from Mihael Hategan <hategan at mcs.anl.gov>  2010-11-15 23:05:05 ---
(In reply to comment #3)
> It is used by the Montage application.  I fixed this up a bit over the summer. 
> I vote for keeping it.

The question is whether we want to keep it in the long run. Orthogonality would
dictate that if there is a better way to do it, this should go.

-- 
Configure bugmail: https://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching someone on the CC list of the bug.
You are watching the assignee of the bug.
You are watching the reporter.


From wozniak at mcs.anl.gov  Tue Nov 16 09:46:52 2010
From: wozniak at mcs.anl.gov (Justin M Wozniak)
Date: Tue, 16 Nov 2010 09:46:52 -0600 (Central Standard Time)
Subject: [Swift-devel] Concurrent dostagein calls limited to 8 ?
In-Reply-To: <1289875792.17856.0.camel@blabla2.none>
References: <2108607140.63954.1289860530901.JavaMail.root@zimbra.anl.gov>
	<1289875792.17856.0.camel@blabla2.none>
Message-ID: <alpine.WNT.2.00.1011160940490.2908@justinwozniak>

On Mon, 15 Nov 2010, Mihael Hategan wrote:

> On Mon, 2010-11-15 at 16:35 -0600, Michael Wilde wrote:
>>>> ...do you have anything "quick and easy" in mind there?
>>>
>>> Yep. Say task:execute(script, args) in vdl-int instead.
>>
>> OK, that looks very promising and works for me in simple .k tests.
>>
>> Can you tell me how to pluck the workdir out of the host description from within vdl-int.k?
>
> I don't know the details. Worst case scenario you write a fava karajan
> function that does it.
>
> Mihael

I haven't tried this myself but vdl-sc.k makes it look like these are all 
stored in the HostNode properties and are therefore accessible by 
vdl:siteprofile() .  Is this correct?

-- 
Justin M Wozniak


From wozniak at mcs.anl.gov  Tue Nov 16 10:17:00 2010
From: wozniak at mcs.anl.gov (Justin M Wozniak)
Date: Tue, 16 Nov 2010 10:17:00 -0600 (Central Standard Time)
Subject: [Swift-devel] Concurrent dostagein calls limited to 8 ?
In-Reply-To: <alpine.WNT.2.00.1011160940490.2908@justinwozniak>
References: <2108607140.63954.1289860530901.JavaMail.root@zimbra.anl.gov>
	<1289875792.17856.0.camel@blabla2.none>
	<alpine.WNT.2.00.1011160940490.2908@justinwozniak>
Message-ID: <alpine.WNT.2.00.1011161014190.2908@justinwozniak>

On Tue, 16 Nov 2010, Justin M Wozniak wrote:

> On Mon, 15 Nov 2010, Mihael Hategan wrote:
>
>> On Mon, 2010-11-15 at 16:35 -0600, Michael Wilde wrote:

>>> Can you tell me how to pluck the workdir out of the host description from 
>>> within vdl-int.k?
>> 
>> I don't know the details. Worst case scenario you write a fava karajan
>> function that does it.
>> 
>> Mihael
>
> I haven't tried this myself but vdl-sc.k makes it look like these are all 
> stored in the HostNode properties and are therefore accessible by 
> vdl:siteprofile() .  Is this correct?

This does seem to work for me: for example, in vdl-int.k:initSharedDir() I 
can add the following and get the expected result from sites.xml:

 	workdir := vdl:siteprofile(rhost, "workdir")
 	print("workdir: {workdir}")

...
Progress:  Submitted:1
workdir: /home/wozniak/work
Final status:  Finished successfully:1
...

-- 
Justin M Wozniak


From bugzilla-daemon at mcs.anl.gov  Tue Nov 16 14:27:13 2010
From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov)
Date: Tue, 16 Nov 2010 14:27:13 -0600 (CST)
Subject: [Swift-devel] [Bug 167] clustering time limit specification in
 seconds is awkward for large clustering times
In-Reply-To: <bug-167-21@http.bugzilla.mcs.anl.gov/swift/>
References: <bug-167-21@http.bugzilla.mcs.anl.gov/swift/>
Message-ID: <20101116202713.545A12C9DA@wind.mcs.anl.gov>

https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=167


Justin Wozniak <wozniak at mcs.anl.gov> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |wozniak at mcs.anl.gov


--- Comment #3 from Justin Wozniak <wozniak at mcs.anl.gov>  2010-11-16 14:27:13 ---
I think so.

-- 
Configure bugmail: https://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.
You are watching the reporter.


From aespinosa at cs.uchicago.edu  Tue Nov 16 18:17:23 2010
From: aespinosa at cs.uchicago.edu (Allan Espinosa)
Date: Tue, 16 Nov 2010 18:17:23 -0600
Subject: [Swift-devel] patch: dostageinfile.transitions
Message-ID: <AANLkTik-0Z6ttD0GprMMmrn3ZGAY8+BRx3zxJ_GENs=D@mail.gmail.com>

Support for dostageinfile.png and dostageinfile-total.png .  Hopefully this is a
more precise plot of actual file stageins.


diff --git a/libexec/log-processing/log-to-dostageinfile-transitions
b/libexec/log-processing/log-to-dostageinfile-transitions
new file mode 100644
index 0000000..48dd399
--- /dev/null
+++ b/libexec/log-processing/log-to-dostageinfile-transitions
@@ -0,0 +1,11 @@
+#!/usr/bin/env ruby
+
+require 'time'
+
+$stdin.grep(/vdl:dostageinfile/).each do |line|
+  x = line.match(/^(\S*\ \S*)\ \S*\ \S*\ (\S*)\ file=(\S*)\
srchost=(\S*) \S*\ \S* desthost=(\S*)/)
+  oras = Time.parse(x[1]).to_f
+  id = "#{x[3]}-#{x[4]}-#{x[5]}"
+  state = x[2].match(/(START|END)/)[1]
+  puts "#{oras} #{id} #{state}"
+end
diff --git a/libexec/log-processing/makefile b/libexec/log-processing/makefile
index 40bdb8d..9a52f5b 100644
--- a/libexec/log-processing/makefile
+++ b/libexec/log-processing/makefile
@@ -95,6 +95,9 @@ createdirset.transitions: $(LOG)
 dostagein.transitions: $(LOG)
 	log-to-dostagein-transitions < $(LOG) > dostagein.transitions

+dostageinfile.transitions: $(LOG)
+	log-to-dostageinfile-transitions < $(LOG) > dostageinfile.transitions
+
 dostageout.transitions: $(LOG)
 	log-to-dostageout-transitions < $(LOG) > dostageout.transitions


-- 
Allan M. Espinosa <http://amespinosa.wordpress.com>
PhD student, Computer Science
University of Chicago <http://people.cs.uchicago.edu/~aespinosa>


From aespinosa at cs.uchicago.edu  Wed Nov 17 15:32:32 2010
From: aespinosa at cs.uchicago.edu (Allan Espinosa)
Date: Wed, 17 Nov 2010 15:32:32 -0600
Subject: [Swift-devel] Re: the persistence of the persistent coaster service.
In-Reply-To: <AANLkTi=fi8Urgx6itsg7dDJjS1JK+z9H2Z4NisKP9EG+@mail.gmail.com>
References: <AANLkTi=fi8Urgx6itsg7dDJjS1JK+z9H2Z4NisKP9EG+@mail.gmail.com>
Message-ID: <AANLkTi=Ffq5cA4KZLzTUNECi+YWFXDCzRsLD0jDE7xsN@mail.gmail.com>

Bumping the thread.  In an attempt to isolate the bug, I made this workflow:

app (external o) sleep(int time) {
  sleep time;
}


/* Main program */
external rups[];

int t = 300;
int a[];

iterate ix {
  a[ix] = ix;
} until (ix == 1300);

foreach ai,i in a {
  rups[i] = sleep(t);
}


<config>
  <pool handle="localhost">
    <execution provider="coaster-persistent"
url="https://communicado.ci.uchicago.edu:61999"
        jobmanager="local:local" />

    <profile namespace="globus" key="workerManager">passive</profile>

    <gridftp  url="local://localhost"/>
    <workdirectory>/gpfs/pads/swift/aespinosa/swift-runs</workdirectory>
  </pool>


</config>

localhost  sleep          /bin/sleep INSTALLED INTEL32::LINUX null

and still get the same type of error message:
RunID: 20101117-1527-ui6i2lra
Progress:
Find: https://communicado.ci.uchicago.edu:61999
Find:  keepalive(120), reconnect - https://communicado.ci.uchicago.edu:61999
Progress:  Selecting site:1  Submitting:294
Progress:  Selecting site:3  Submitting:367
Progress:  Selecting site:3  Submitting:367
Progress:  Selecting site:3  Submitting:367
Progress:  Selecting site:3  Submitting:367
Command(1, CHANNELCONFIG): handling reply timeout;
sendReqTime=101117-152717.209, sendTime=101117
-152717.211, now=101117-152917.232
Progress:  Selecting site:3  Submitting:366  Submitted:1
Command(1, CHANNELCONFIG)fault was: Reply timeout
org.globus.cog.karajan.workflow.service.ReplyTimeoutException
        at org.globus.cog.karajan.workflow.service.commands.Command.handleReplyTimeout(Command.ja
va:280)
        at org.globus.cog.karajan.workflow.service.commands.Command$Timeout.run(Command.java:285)
        at java.util.TimerThread.mainLoop(Timer.java:512)
        at java.util.TimerThread.run(Timer.java:462)
Progress:  Selecting site:3  Submitting:366 Failed but can retry:1
Progress:  Selecting site:3  Submitting:366 Failed but can retry:1


2010/10/21 Allan Espinosa <aespinosa at cs.uchicago.edu>:
> Hi,
>
> When I'm reusing the coaster service onto the next swift session, i
> get reply timeouts in the CHANNELCONFIG command:
>
>
> swift-r3685 cog-r2913
>
> RunID: extract
> Progress:
> Progress: ?uninitialized:2 ?Finished in previous run:2
> Progress: ?uninitialized:2 ?Finished in previous run:2
> Progress: ?Stage in:99 ?Submitting:1 ?Finished in previous run:102
> Find: https://communicado.ci.uchicago.edu:61999
> Find: ?keepalive(120), reconnect - https://communicado.ci.uchicago.edu:61999
> Progress: ?Stage in:92 ?Submitting:8 ?Finished in previous run:102
> Passive queue processor initialized. Callback URI is http://128.135.125.17:60999
> Progress: ?Stage in:71 ?Submitting:2 ?Submitted:27 ?Finished in previous run:102
> Progress: ?Stage in:29 ?Submitting:1 ?Submitted:70 ?Finished in previous run:102
>
> **Abord** (Ctrl-C)
> ** rerun/ resume workflow **
> swift-r3685 cog-r2913
>
> RunID: extract
> Progress:
> Progress: ?uninitialized:3 ?Finished in previous run:2
> Progress: ?Stage in:99 ?Submitting:1 ?Finished in previous run:102
> Find: https://communicado.ci.uchicago.edu:61999
> Find: ?keepalive(120), reconnect - https://communicado.ci.uchicago.edu:61999
> Progress: ?Stage in:92 ?Submitting:8 ?Finished in previous run:102
> Progress: ?Stage in:92 ?Submitting:8 ?Finished in previous run:102
> Progress: ?Stage in:92 ?Submitting:8 ?Finished in previous run:102
> Progress: ?Stage in:92 ?Submitting:8 ?Finished in previous run:102
> Command(1, CHANNELCONFIG): handling reply timeout;
> sendReqTime=101021-174124.460, sendTime=101021-174124.471,
> now=101021-174324.492
> Command(1, CHANNELCONFIG)fault was: Reply timeout
> org.globus.cog.karajan.workflow.service.ReplyTimeoutException
> ? ? ? ?at org.globus.cog.karajan.workflow.service.commands.Command.handleReplyTimeout(Command.java:280)
> ? ? ? ?at org.globus.cog.karajan.workflow.service.commands.Command$Timeout.run(Command.java:285)
> ? ? ? ?at java.util.TimerThread.mainLoop(Timer.java:512)
> ? ? ? ?at java.util.TimerThread.run(Timer.java:462)
> Progress: ?Stage in:92 ?Submitting:7 ?Submitted:1 ?Finished in previous run:102
>
> My sites.xml sets the persistent service to work in passive mode.
>
>
> thanks,
> -Allan
>
> --
> Allan M. Espinosa <http://amespinosa.wordpress.com>
> PhD student, Computer Science
> University of Chicago <http://people.cs.uchicago.edu/~aespinosa>
>


-- 
Allan M. Espinosa <http://amespinosa.wordpress.com>
PhD student, Computer Science
University of Chicago <http://people.cs.uchicago.edu/~aespinosa>


From aespinosa at cs.uchicago.edu  Wed Nov 17 15:35:40 2010
From: aespinosa at cs.uchicago.edu (Allan Espinosa)
Date: Wed, 17 Nov 2010 15:35:40 -0600
Subject: [Swift-devel] Re: the persistence of the persistent coaster service.
In-Reply-To: <AANLkTi=Ffq5cA4KZLzTUNECi+YWFXDCzRsLD0jDE7xsN@mail.gmail.com>
References: <AANLkTi=fi8Urgx6itsg7dDJjS1JK+z9H2Z4NisKP9EG+@mail.gmail.com>
	<AANLkTi=Ffq5cA4KZLzTUNECi+YWFXDCzRsLD0jDE7xsN@mail.gmail.com>
Message-ID: <AANLkTi=8tiNMhkAiihDizj5+yGiPVj1Cnkd=HesLq1Dd@mail.gmail.com>

Upon the client's connection, this gets registered in the service log:

...
...
Plan time: 1
Plan time: 1
GSSSChannel-null(0)[1175215772: {}]: Disabling heartbeats (config is null)
(1) Scheduling GSSSChannel-null(12)[1175215772: {}] for addition
nullChannel started
Channel id: u-20ccd0f-12c5bc25c45--8000-u-28c73091-12c5b774ab1--7ff5
MetaChannel: 682820082[1175215772: {}] -> null: Disabling heartbeats
(disabled in config)
MetaChannel: 682820082[1175215772: {}] -> null.bind ->
GSSSChannel-null(12)[1175215772: {}]
Plan time: 1
Congestion queue size: 0
runTime: 0, sleepTime: 10049
Plan time: 1
...
...

2010/11/17 Allan Espinosa <aespinosa at cs.uchicago.edu>:
> Bumping the thread. ?In an attempt to isolate the bug, I made this workflow:
>
> app (external o) sleep(int time) {
> ?sleep time;
> }
>
>
> /* Main program */
> external rups[];
>
> int t = 300;
> int a[];
>
> iterate ix {
> ?a[ix] = ix;
> } until (ix == 1300);
>
> foreach ai,i in a {
> ?rups[i] = sleep(t);
> }
>
>
> <config>
> ?<pool handle="localhost">
> ? ?<execution provider="coaster-persistent"
> url="https://communicado.ci.uchicago.edu:61999"
> ? ? ? ?jobmanager="local:local" />
>
> ? ?<profile namespace="globus" key="workerManager">passive</profile>
>
> ? ?<gridftp ?url="local://localhost"/>
> ? ?<workdirectory>/gpfs/pads/swift/aespinosa/swift-runs</workdirectory>
> ?</pool>
>
>
> </config>
>
> localhost ?sleep ? ? ? ? ?/bin/sleep INSTALLED INTEL32::LINUX null
>
> and still get the same type of error message:
> RunID: 20101117-1527-ui6i2lra
> Progress:
> Find: https://communicado.ci.uchicago.edu:61999
> Find: ?keepalive(120), reconnect - https://communicado.ci.uchicago.edu:61999
> Progress: ?Selecting site:1 ?Submitting:294
> Progress: ?Selecting site:3 ?Submitting:367
> Progress: ?Selecting site:3 ?Submitting:367
> Progress: ?Selecting site:3 ?Submitting:367
> Progress: ?Selecting site:3 ?Submitting:367
> Command(1, CHANNELCONFIG): handling reply timeout;
> sendReqTime=101117-152717.209, sendTime=101117
> -152717.211, now=101117-152917.232
> Progress: ?Selecting site:3 ?Submitting:366 ?Submitted:1
> Command(1, CHANNELCONFIG)fault was: Reply timeout
> org.globus.cog.karajan.workflow.service.ReplyTimeoutException
> ? ? ? ?at org.globus.cog.karajan.workflow.service.commands.Command.handleReplyTimeout(Command.ja
> va:280)
> ? ? ? ?at org.globus.cog.karajan.workflow.service.commands.Command$Timeout.run(Command.java:285)
> ? ? ? ?at java.util.TimerThread.mainLoop(Timer.java:512)
> ? ? ? ?at java.util.TimerThread.run(Timer.java:462)
> Progress: ?Selecting site:3 ?Submitting:366 Failed but can retry:1
> Progress: ?Selecting site:3 ?Submitting:366 Failed but can retry:1
>
>
> 2010/10/21 Allan Espinosa <aespinosa at cs.uchicago.edu>:
>> Hi,
>>
>> When I'm reusing the coaster service onto the next swift session, i
>> get reply timeouts in the CHANNELCONFIG command:
>>
>>
>> swift-r3685 cog-r2913
>>
>> RunID: extract
>> Progress:
>> Progress: ?uninitialized:2 ?Finished in previous run:2
>> Progress: ?uninitialized:2 ?Finished in previous run:2
>> Progress: ?Stage in:99 ?Submitting:1 ?Finished in previous run:102
>> Find: https://communicado.ci.uchicago.edu:61999
>> Find: ?keepalive(120), reconnect - https://communicado.ci.uchicago.edu:61999
>> Progress: ?Stage in:92 ?Submitting:8 ?Finished in previous run:102
>> Passive queue processor initialized. Callback URI is http://128.135.125.17:60999
>> Progress: ?Stage in:71 ?Submitting:2 ?Submitted:27 ?Finished in previous run:102
>> Progress: ?Stage in:29 ?Submitting:1 ?Submitted:70 ?Finished in previous run:102
>>
>> **Abord** (Ctrl-C)
>> ** rerun/ resume workflow **
>> swift-r3685 cog-r2913
>>
>> RunID: extract
>> Progress:
>> Progress: ?uninitialized:3 ?Finished in previous run:2
>> Progress: ?Stage in:99 ?Submitting:1 ?Finished in previous run:102
>> Find: https://communicado.ci.uchicago.edu:61999
>> Find: ?keepalive(120), reconnect - https://communicado.ci.uchicago.edu:61999
>> Progress: ?Stage in:92 ?Submitting:8 ?Finished in previous run:102
>> Progress: ?Stage in:92 ?Submitting:8 ?Finished in previous run:102
>> Progress: ?Stage in:92 ?Submitting:8 ?Finished in previous run:102
>> Progress: ?Stage in:92 ?Submitting:8 ?Finished in previous run:102
>> Command(1, CHANNELCONFIG): handling reply timeout;
>> sendReqTime=101021-174124.460, sendTime=101021-174124.471,
>> now=101021-174324.492
>> Command(1, CHANNELCONFIG)fault was: Reply timeout
>> org.globus.cog.karajan.workflow.service.ReplyTimeoutException
>> ? ? ? ?at org.globus.cog.karajan.workflow.service.commands.Command.handleReplyTimeout(Command.java:280)
>> ? ? ? ?at org.globus.cog.karajan.workflow.service.commands.Command$Timeout.run(Command.java:285)
>> ? ? ? ?at java.util.TimerThread.mainLoop(Timer.java:512)
>> ? ? ? ?at java.util.TimerThread.run(Timer.java:462)
>> Progress: ?Stage in:92 ?Submitting:7 ?Submitted:1 ?Finished in previous run:102
>>
>> My sites.xml sets the persistent service to work in passive mode.
>>
>>
>> thanks,
>> -Allan
>>
>> --
>> Allan M. Espinosa <http://amespinosa.wordpress.com>
>> PhD student, Computer Science
>> University of Chicago <http://people.cs.uchicago.edu/~aespinosa>
>>
>
>
>
> --
> Allan M. Espinosa <http://amespinosa.wordpress.com>
> PhD student, Computer Science
> University of Chicago <http://people.cs.uchicago.edu/~aespinosa>
>


-- 
Allan M. Espinosa <http://amespinosa.wordpress.com>
PhD student, Computer Science
University of Chicago <http://people.cs.uchicago.edu/~aespinosa>


From wilde at mcs.anl.gov  Wed Nov 17 23:25:19 2010
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Wed, 17 Nov 2010 23:25:19 -0600 (CST)
Subject: [Swift-devel] Re: the persistence of the persistent coaster
	service.
In-Reply-To: <AANLkTi=8tiNMhkAiihDizj5+yGiPVj1Cnkd=HesLq1Dd@mail.gmail.com>
Message-ID: <1516013329.76009.1290057919991.JavaMail.root@zimbra.anl.gov>

Allan, Ive had similar symptoms, but think that Im seeing different error messages.

When I start a persistent service, I can run repeated Swift scripts against it, but *only* if I do them in fairly quick succession.  If I let the service sit idle for more than about 5 minutes, it becomes unusable.

I need to carefully capture a test case, as well as testing on an unmodified trunk that would enable Mihael to reproduce and fix the problem.  I think thats the key: If you can give Mihael a way to easily reproduce the problem at will, then he'll likely be able to fix it quickly.

I also see a possible related problem: when I run coasters with a large number of slots (say 64) and my workload is unable to keep the workers busy due to staging delays, then after the workers start timing out (ie I get the message "Job Cancelled") then this causes an error somewhere on the client side and Swift quickly dies with a fatal error. I need to try to reproduce this as well and/or capture logs from it.

I hope to get to this next week after SC.

- Mike


----- Original Message -----
> Upon the client's connection, this gets registered in the service log:
> 
> ...
> ...
> Plan time: 1
> Plan time: 1
> GSSSChannel-null(0)[1175215772: {}]: Disabling heartbeats (config is
> null)
> (1) Scheduling GSSSChannel-null(12)[1175215772: {}] for addition
> nullChannel started
> Channel id: u-20ccd0f-12c5bc25c45--8000-u-28c73091-12c5b774ab1--7ff5
> MetaChannel: 682820082[1175215772: {}] -> null: Disabling heartbeats
> (disabled in config)
> MetaChannel: 682820082[1175215772: {}] -> null.bind ->
> GSSSChannel-null(12)[1175215772: {}]
> Plan time: 1
> Congestion queue size: 0
> runTime: 0, sleepTime: 10049
> Plan time: 1
> ...
> ...
> 
> 2010/11/17 Allan Espinosa <aespinosa at cs.uchicago.edu>:
> > Bumping the thread. In an attempt to isolate the bug, I made this
> > workflow:
> >
> > app (external o) sleep(int time) {
> > ?sleep time;
> > }
> >
> >
> > /* Main program */
> > external rups[];
> >
> > int t = 300;
> > int a[];
> >
> > iterate ix {
> > ?a[ix] = ix;
> > } until (ix == 1300);
> >
> > foreach ai,i in a {
> > ?rups[i] = sleep(t);
> > }
> >
> >
> > <config>
> > ?<pool handle="localhost">
> > ? ?<execution provider="coaster-persistent"
> > url="https://communicado.ci.uchicago.edu:61999"
> > ? ? ? ?jobmanager="local:local" />
> >
> > ? ?<profile namespace="globus" key="workerManager">passive</profile>
> >
> > ? ?<gridftp url="local://localhost"/>
> > ? ?<workdirectory>/gpfs/pads/swift/aespinosa/swift-runs</workdirectory>
> > ?</pool>
> >
> >
> > </config>
> >
> > localhost sleep /bin/sleep INSTALLED INTEL32::LINUX null
> >
> > and still get the same type of error message:
> > RunID: 20101117-1527-ui6i2lra
> > Progress:
> > Find: https://communicado.ci.uchicago.edu:61999
> > Find: keepalive(120), reconnect -
> > https://communicado.ci.uchicago.edu:61999
> > Progress: Selecting site:1 Submitting:294
> > Progress: Selecting site:3 Submitting:367
> > Progress: Selecting site:3 Submitting:367
> > Progress: Selecting site:3 Submitting:367
> > Progress: Selecting site:3 Submitting:367
> > Command(1, CHANNELCONFIG): handling reply timeout;
> > sendReqTime=101117-152717.209, sendTime=101117
> > -152717.211, now=101117-152917.232
> > Progress: Selecting site:3 Submitting:366 Submitted:1
> > Command(1, CHANNELCONFIG)fault was: Reply timeout
> > org.globus.cog.karajan.workflow.service.ReplyTimeoutException
> > ? ? ? ?at
> > ? ? ? ?org.globus.cog.karajan.workflow.service.commands.Command.handleReplyTimeout(Command.ja
> > va:280)
> > ? ? ? ?at
> > ? ? ? ?org.globus.cog.karajan.workflow.service.commands.Command$Timeout.run(Command.java:285)
> > ? ? ? ?at java.util.TimerThread.mainLoop(Timer.java:512)
> > ? ? ? ?at java.util.TimerThread.run(Timer.java:462)
> > Progress: Selecting site:3 Submitting:366 Failed but can retry:1
> > Progress: Selecting site:3 Submitting:366 Failed but can retry:1
> >
> >
> > 2010/10/21 Allan Espinosa <aespinosa at cs.uchicago.edu>:
> >> Hi,
> >>
> >> When I'm reusing the coaster service onto the next swift session, i
> >> get reply timeouts in the CHANNELCONFIG command:
> >>
> >>
> >> swift-r3685 cog-r2913
> >>
> >> RunID: extract
> >> Progress:
> >> Progress: uninitialized:2 Finished in previous run:2
> >> Progress: uninitialized:2 Finished in previous run:2
> >> Progress: Stage in:99 Submitting:1 Finished in previous run:102
> >> Find: https://communicado.ci.uchicago.edu:61999
> >> Find: keepalive(120), reconnect -
> >> https://communicado.ci.uchicago.edu:61999
> >> Progress: Stage in:92 Submitting:8 Finished in previous run:102
> >> Passive queue processor initialized. Callback URI is
> >> http://128.135.125.17:60999
> >> Progress: Stage in:71 Submitting:2 Submitted:27 Finished in
> >> previous run:102
> >> Progress: Stage in:29 Submitting:1 Submitted:70 Finished in
> >> previous run:102
> >>
> >> **Abord** (Ctrl-C)
> >> ** rerun/ resume workflow **
> >> swift-r3685 cog-r2913
> >>
> >> RunID: extract
> >> Progress:
> >> Progress: uninitialized:3 Finished in previous run:2
> >> Progress: Stage in:99 Submitting:1 Finished in previous run:102
> >> Find: https://communicado.ci.uchicago.edu:61999
> >> Find: keepalive(120), reconnect -
> >> https://communicado.ci.uchicago.edu:61999
> >> Progress: Stage in:92 Submitting:8 Finished in previous run:102
> >> Progress: Stage in:92 Submitting:8 Finished in previous run:102
> >> Progress: Stage in:92 Submitting:8 Finished in previous run:102
> >> Progress: Stage in:92 Submitting:8 Finished in previous run:102
> >> Command(1, CHANNELCONFIG): handling reply timeout;
> >> sendReqTime=101021-174124.460, sendTime=101021-174124.471,
> >> now=101021-174324.492
> >> Command(1, CHANNELCONFIG)fault was: Reply timeout
> >> org.globus.cog.karajan.workflow.service.ReplyTimeoutException
> >> ? ? ? ?at
> >> ? ? ? ?org.globus.cog.karajan.workflow.service.commands.Command.handleReplyTimeout(Command.java:280)
> >> ? ? ? ?at
> >> ? ? ? ?org.globus.cog.karajan.workflow.service.commands.Command$Timeout.run(Command.java:285)
> >> ? ? ? ?at java.util.TimerThread.mainLoop(Timer.java:512)
> >> ? ? ? ?at java.util.TimerThread.run(Timer.java:462)
> >> Progress: Stage in:92 Submitting:7 Submitted:1 Finished in previous
> >> run:102
> >>
> >> My sites.xml sets the persistent service to work in passive mode.
> >>
> >>
> >> thanks,
> >> -Allan
> >>
> >> --
> >> Allan M. Espinosa <http://amespinosa.wordpress.com>
> >> PhD student, Computer Science
> >> University of Chicago <http://people.cs.uchicago.edu/~aespinosa>
> >>
> >
> >
> >
> > --
> > Allan M. Espinosa <http://amespinosa.wordpress.com>
> > PhD student, Computer Science
> > University of Chicago <http://people.cs.uchicago.edu/~aespinosa>
> >
> 
> 
> 
> --
> Allan M. Espinosa <http://amespinosa.wordpress.com>
> PhD student, Computer Science
> University of Chicago <http://people.cs.uchicago.edu/~aespinosa>
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel

-- 
Michael Wilde
Computation Institute, University of Chicago
Mathematics and Computer Science Division
Argonne National Laboratory


From bugzilla-daemon at mcs.anl.gov  Thu Nov 18 08:46:51 2010
From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov)
Date: Thu, 18 Nov 2010 08:46:51 -0600 (CST)
Subject: [Swift-devel] [Bug 229] New: Swift log should capture additional
 environmental information
Message-ID: <bug-229-21@http.bugzilla.mcs.anl.gov/swift/>

https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=229

           Summary: Swift log should capture additional environmental
                    information
           Product: Swift
           Version: unspecified
          Platform: All
        OS/Version: All
            Status: NEW
          Severity: normal
          Priority: P2
         Component: SwiftScript language
        AssignedTo: skenny at uchicago.edu
        ReportedBy: wilde at mcs.anl.gov


Add:

- java version
- full tc, sites, and property info
- printenv
- swift script

We should control this via a property for users that dont want all this info
captured. That can be a second step.

With this info in the log, we will more likely be able to diagnose a user's
problem with just the single log file.

We should consider removing the file "swift.log" as I have never seen any
useful info in it.

-- 
Configure bugmail: https://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the reporter.


From aespinosa at cs.uchicago.edu  Thu Nov 18 14:39:04 2010
From: aespinosa at cs.uchicago.edu (Allan Espinosa)
Date: Thu, 18 Nov 2010 14:39:04 -0600
Subject: [Swift-devel] misassignment of jobs
Message-ID: <AANLkTimQobNKo+inP6Z_fSq8bEOUX2PoWcn5hq-FW=G7@mail.gmail.com>

tc.data for worker15:
SPRACE_osg-ce.sprace.org.br  worker15 /osg/app/engage/scec/worker.pl
   INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="02:00:00"

But it was assigned to another site instead:
$ grep 0erqqq1k worker-*.log
2010-11-17 15:38:58,804-0600 DEBUG vdl:execute2 THREAD_ASSOCIATION
jobid=worker15-0erqqq1k thread
 host=LIGO_UWM_NEMO_osg-nemo-ce.phys.uwm.edu replicationGroup=2pnqqq1k
2010-11-17 15:38:59,110-0600 INFO  vdl:createdirset START
jobid=worker15-0erqqq1k host=LIGO_UWM_N
ce.phys.uwm.edu - Initializing directory structure
2010-11-17 15:38:59,137-0600 INFO  vdl:createdirset END
jobid=worker15-0erqqq1k - Done initializi
structure
2010-11-17 15:38:59,172-0600 INFO  vdl:dostagein START
jobid=worker15-0erqqq1k - Staging in files
2010-11-17 15:38:59,257-0600 INFO  vdl:dostagein END
jobid=worker15-0erqqq1k - Staging in finishe
2010-11-17 15:38:59,323-0600 DEBUG vdl:execute2 JOB_START
jobid=worker15-0erqqq1k tr=worker15 arg
//128.135.125.17:61015, SPRACE_osg-ce.sprace.org.br, /tmp, 7200]
tmpdir=worker-20101117-1538-fe9a
orker15-0erqqq1k host=LIGO_UWM_NEMO_osg-nemo-ce.phys.uwm.edu
2010-11-17 15:39:01,394-0600 INFO  Execute Submit: in:
worker-20101117-1538-fe9aq209 command: /bi
/_swiftwrap worker15-0erqqq1k -jobdir 0 -scratch  -e worker15 -out
stdout.txt -err stderr.txt -i
 -k  -cdmfile  -status files -a http://128.135.125.17:61015
SPRACE_osg-ce.sprace.org.br /tmp 7200
2010-11-17 15:39:01,394-0600 INFO  GridExec TASK_DEFINITION:
Task(type=JOB_SUBMISSION, identity=u
-1-1290029938030) is /bin/bash shared/_swiftwrap worker15-0erqqq1k
-jobdir 0 -scratch  -e worker1
.txt -err stderr.txt -i -d  -if  -of  -k  -cdmfile  -status files -a
http://128.135.125.17:61015
.sprace.org.br /tmp 7200
2010-11-17 16:49:33,106-0600 DEBUG vdl:checkjobstatus START
jobid=worker15-0erqqq1k
2010-11-17 16:49:33,278-0600 INFO  vdl:checkjobstatus FAILURE
jobid=worker15-0erqqq1k - Failure f
2010-11-17 16:49:38,180-0600 DEBUG vdl:execute2 APPLICATION_EXCEPTION
jobid=worker15-0erqqq1k - A
ception: Cannot find executable worker15 on site system path

There is no entry for worker15 for the site LIGO_UWM_NEMO in my tc.data

-Allan

-- 
Allan M. Espinosa <http://amespinosa.wordpress.com>
PhD student, Computer Science
University of Chicago <http://people.cs.uchicago.edu/~aespinosa>


From hategan at mcs.anl.gov  Thu Nov 18 16:03:09 2010
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Thu, 18 Nov 2010 14:03:09 -0800
Subject: [Swift-devel] misassignment of jobs
In-Reply-To: <AANLkTimQobNKo+inP6Z_fSq8bEOUX2PoWcn5hq-FW=G7@mail.gmail.com>
References: <AANLkTimQobNKo+inP6Z_fSq8bEOUX2PoWcn5hq-FW=G7@mail.gmail.com>
Message-ID: <1290117789.30414.1.camel@blabla2.none>

I'm sure there is a reasonable explanation for this.

Can you post your entire tc.data? And to make sure we're talking about
the right one, can you look at the swift log and use exactly the one
that swift claims is using?

Mihael

On Thu, 2010-11-18 at 14:39 -0600, Allan Espinosa wrote:
> tc.data for worker15:
> SPRACE_osg-ce.sprace.org.br  worker15 /osg/app/engage/scec/worker.pl
>    INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="02:00:00"
> 
> But it was assigned to another site instead:
> $ grep 0erqqq1k worker-*.log
> 2010-11-17 15:38:58,804-0600 DEBUG vdl:execute2 THREAD_ASSOCIATION
> jobid=worker15-0erqqq1k thread
>  host=LIGO_UWM_NEMO_osg-nemo-ce.phys.uwm.edu replicationGroup=2pnqqq1k
> 2010-11-17 15:38:59,110-0600 INFO  vdl:createdirset START
> jobid=worker15-0erqqq1k host=LIGO_UWM_N
> ce.phys.uwm.edu - Initializing directory structure
> 2010-11-17 15:38:59,137-0600 INFO  vdl:createdirset END
> jobid=worker15-0erqqq1k - Done initializi
> structure
> 2010-11-17 15:38:59,172-0600 INFO  vdl:dostagein START
> jobid=worker15-0erqqq1k - Staging in files
> 2010-11-17 15:38:59,257-0600 INFO  vdl:dostagein END
> jobid=worker15-0erqqq1k - Staging in finishe
> 2010-11-17 15:38:59,323-0600 DEBUG vdl:execute2 JOB_START
> jobid=worker15-0erqqq1k tr=worker15 arg
> //128.135.125.17:61015, SPRACE_osg-ce.sprace.org.br, /tmp, 7200]
> tmpdir=worker-20101117-1538-fe9a
> orker15-0erqqq1k host=LIGO_UWM_NEMO_osg-nemo-ce.phys.uwm.edu
> 2010-11-17 15:39:01,394-0600 INFO  Execute Submit: in:
> worker-20101117-1538-fe9aq209 command: /bi
> /_swiftwrap worker15-0erqqq1k -jobdir 0 -scratch  -e worker15 -out
> stdout.txt -err stderr.txt -i
>  -k  -cdmfile  -status files -a http://128.135.125.17:61015
> SPRACE_osg-ce.sprace.org.br /tmp 7200
> 2010-11-17 15:39:01,394-0600 INFO  GridExec TASK_DEFINITION:
> Task(type=JOB_SUBMISSION, identity=u
> -1-1290029938030) is /bin/bash shared/_swiftwrap worker15-0erqqq1k
> -jobdir 0 -scratch  -e worker1
> .txt -err stderr.txt -i -d  -if  -of  -k  -cdmfile  -status files -a
> http://128.135.125.17:61015
> .sprace.org.br /tmp 7200
> 2010-11-17 16:49:33,106-0600 DEBUG vdl:checkjobstatus START
> jobid=worker15-0erqqq1k
> 2010-11-17 16:49:33,278-0600 INFO  vdl:checkjobstatus FAILURE
> jobid=worker15-0erqqq1k - Failure f
> 2010-11-17 16:49:38,180-0600 DEBUG vdl:execute2 APPLICATION_EXCEPTION
> jobid=worker15-0erqqq1k - A
> ception: Cannot find executable worker15 on site system path
> 
> There is no entry for worker15 for the site LIGO_UWM_NEMO in my tc.data
> 
> -Allan
> 


From aespinosa at cs.uchicago.edu  Thu Nov 18 16:08:56 2010
From: aespinosa at cs.uchicago.edu (Allan Espinosa)
Date: Thu, 18 Nov 2010 16:08:56 -0600
Subject: [Swift-devel] misassignment of jobs
In-Reply-To: <1290117789.30414.1.camel@blabla2.none>
References: <AANLkTimQobNKo+inP6Z_fSq8bEOUX2PoWcn5hq-FW=G7@mail.gmail.com>
	<1290117789.30414.1.camel@blabla2.none>
Message-ID: <AANLkTinnz50nSrqCMKkOOKGxHRPDCjhypqg9kz0Zv22n@mail.gmail.com>

i'm using a file named tc.data

2010-11-17 15:38:50,115-0600 INFO  unknown Using tc.data: tc.data
$cat tc.data
PADS  sleep_pads     /bin/sleep      INSTALLED INTEL32::LINUX
GLOBUS::maxwalltime="00:05:00"

BNL-ATLAS_gridgk01.racf.bnl.gov  worker0
/usatlas/OSG/engage-scec/worker.pl      INSTALLED INTEL32::LINUX
GLOBUS::maxwalltime="02:00:00"
BNL-ATLAS_gridgk01.racf.bnl.gov  sleep0  /bin/sleep
INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
BNL-ATLAS_gridgk01.racf.bnl.gov  sleep            /bin/sleep
       INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"

BNL-ATLAS_gridgk02.racf.bnl.gov  worker1
/usatlas/OSG/engage-scec/worker.pl      INSTALLED INTEL32::LINUX
GLOBUS::maxwalltime="02:00:00"
BNL-ATLAS_gridgk02.racf.bnl.gov  sleep1  /bin/sleep
INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
BNL-ATLAS_gridgk02.racf.bnl.gov  sleep            /bin/sleep
       INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"

FNAL_FERMIGRID_fnpcosg1.fnal.gov  worker2
/grid/app/engage/scec/worker.pl      INSTALLED INTEL32::LINUX
GLOBUS::maxwalltime="02:00:00"
FNAL_FERMIGRID_fnpcosg1.fnal.gov  sleep2  /bin/sleep
INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
FNAL_FERMIGRID_fnpcosg1.fnal.gov  sleep            /bin/sleep
        INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"

Firefly_ff-grid3.unl.edu  worker3
/panfs/panasas/CMS/app/engage/scec/worker.pl      INSTALLED
INTEL32::LINUX GLOBUS::maxwalltime="02:00:00"
Firefly_ff-grid3.unl.edu  sleep3  /bin/sleep
INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
Firefly_ff-grid3.unl.edu  sleep            /bin/sleep
INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"

GridUNESP_CENTRAL_ce.grid.unesp.br  worker4 /osg/app/worker.pl
INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="02:00:00"
GridUNESP_CENTRAL_ce.grid.unesp.br  sleep4  /bin/sleep
 INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
GridUNESP_CENTRAL_ce.grid.unesp.br  sleep            /bin/sleep
          INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"

LIGO_UWM_NEMO_osg-nemo-ce.phys.uwm.edu  worker5
/opt/osg/app/engage/scec/worker.pl      INSTALLED INTEL32::LINUX
GLOBUS::maxwalltime="02:00:00"
LIGO_UWM_NEMO_osg-nemo-ce.phys.uwm.edu  sleep5  /bin/sleep
     INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
LIGO_UWM_NEMO_osg-nemo-ce.phys.uwm.edu  sleep            /bin/sleep
              INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"

MIT_CMS_ce01.cmsaf.mit.edu  worker6 /osg/app/engage/scec/worker.pl
 INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="02:00:00"
MIT_CMS_ce01.cmsaf.mit.edu  sleep6  /bin/sleep
INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
MIT_CMS_ce01.cmsaf.mit.edu  sleep            /bin/sleep
  INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"

MIT_CMS_ce02.cmsaf.mit.edu  worker7 /osg/app/engage/scec/worker.pl
 INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="02:00:00"
MIT_CMS_ce02.cmsaf.mit.edu  sleep7  /bin/sleep
INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
MIT_CMS_ce02.cmsaf.mit.edu  sleep            /bin/sleep
  INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"

NYSGRID_CORNELL_NYS1_nys1.cac.cornell.edu  worker8
/grid-tmp/grid-apps/engage/scec/worker.pl      INSTALLED
INTEL32::LINUX GLOBUS::maxwalltime="02:00:00"
NYSGRID_CORNELL_NYS1_nys1.cac.cornell.edu  sleep8  /bin/sleep
        INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
NYSGRID_CORNELL_NYS1_nys1.cac.cornell.edu  sleep            /bin/sleep
                 INSTALLED INTEL32::LINUX
GLOBUS::maxwalltime="00:05:00"

Nebraska_gpn-husker.unl.edu  worker9
/opt/osg/app/engage/scec/worker.pl      INSTALLED INTEL32::LINUX
GLOBUS::maxwalltime="02:00:00"
Nebraska_gpn-husker.unl.edu  sleep9  /bin/sleep
INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
Nebraska_gpn-husker.unl.edu  sleep            /bin/sleep
   INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"

Nebraska_red.unl.edu  worker10 /opt/osg/app/engage/scec/worker.pl
INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="02:00:00"
Nebraska_red.unl.edu  sleep10  /bin/sleep                  INSTALLED
INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
Nebraska_red.unl.edu  sleep            /bin/sleep
INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"

Prairiefire_pf-grid.unl.edu  worker11
/opt/pfgridapp/engage/scec/worker.pl      INSTALLED INTEL32::LINUX
GLOBUS::maxwalltime="02:00:00"
Prairiefire_pf-grid.unl.edu  sleep11  /bin/sleep
INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
Prairiefire_pf-grid.unl.edu  sleep            /bin/sleep
   INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"

Purdue-RCAC_osg.rcac.purdue.edu  worker12
/apps/osg/engage/scec/worker.pl      INSTALLED INTEL32::LINUX
GLOBUS::maxwalltime="02:00:00"
Purdue-RCAC_osg.rcac.purdue.edu  sleep12  /bin/sleep
INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
Purdue-RCAC_osg.rcac.purdue.edu  sleep            /bin/sleep
       INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"

RENCI-Engagement_belhaven-1.renci.org  worker13
/nfs/osg-app/engage/scec/worker.pl      INSTALLED INTEL32::LINUX
GLOBUS::maxwalltime="02:00:00"
RENCI-Engagement_belhaven-1.renci.org  sleep13  /bin/sleep
     INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
RENCI-Engagement_belhaven-1.renci.org  sleep            /bin/sleep
             INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"

SBGrid-Harvard-East_osg-east.hms.harvard.edu  worker14
/osg/storage/app/engage/scec/worker.pl      INSTALLED INTEL32::LINUX
GLOBUS::maxwalltime="02:00:00"
SBGrid-Harvard-East_osg-east.hms.harvard.edu  sleep14  /bin/sleep
            INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
SBGrid-Harvard-East_osg-east.hms.harvard.edu  sleep
/bin/sleep                  INSTALLED INTEL32::LINUX
GLOBUS::maxwalltime="00:05:00"

SPRACE_osg-ce.sprace.org.br  worker15 /osg/app/engage/scec/worker.pl
   INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="02:00:00"
SPRACE_osg-ce.sprace.org.br  sleep15  /bin/sleep
INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
SPRACE_osg-ce.sprace.org.br  sleep            /bin/sleep
   INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"

UCHC_CBG_vdgateway.vcell.uchc.edu  worker16
/osg/app/engage/scec/worker.pl      INSTALLED INTEL32::LINUX
GLOBUS::maxwalltime="02:00:00"
UCHC_CBG_vdgateway.vcell.uchc.edu  sleep16  /bin/sleep
 INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
UCHC_CBG_vdgateway.vcell.uchc.edu  sleep            /bin/sleep
         INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"

UCR-HEP_top.ucr.edu  worker17
/data/bottom/osg_app/engage/scec/worker.pl      INSTALLED
INTEL32::LINUX GLOBUS::maxwalltime="02:00:00"
UCR-HEP_top.ucr.edu  sleep17  /bin/sleep                  INSTALLED
INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
UCR-HEP_top.ucr.edu  sleep            /bin/sleep
INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"

UFlorida-HPC_osg.hpc.ufl.edu  worker18 /osg/app/engage/scec/worker.pl
    INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="02:00:00"
UFlorida-HPC_osg.hpc.ufl.edu  sleep18  /bin/sleep
INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
UFlorida-HPC_osg.hpc.ufl.edu  sleep            /bin/sleep
    INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"

UFlorida-PG_pg.ihepa.ufl.edu  worker19
/raid/osgpg/pg/app/engage/scec/worker.pl      INSTALLED INTEL32::LINUX
GLOBUS::maxwalltime="02:00:00"
UFlorida-PG_pg.ihepa.ufl.edu  sleep19  /bin/sleep
INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
UFlorida-PG_pg.ihepa.ufl.edu  sleep            /bin/sleep
    INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"

UMissHEP_umiss001.hep.olemiss.edu  worker20
/osgremote/osg_app/engage/scec/worker.pl      INSTALLED INTEL32::LINUX
GLOBUS::maxwalltime="02:00:00"
UMissHEP_umiss001.hep.olemiss.edu  sleep20  /bin/sleep
 INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
UMissHEP_umiss001.hep.olemiss.edu  sleep            /bin/sleep
         INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"

UTA_SWT2_gk04.swt2.uta.edu  worker21
/cluster/grid/app/engage/scec/worker.pl      INSTALLED INTEL32::LINUX
GLOBUS::maxwalltime="02:00:00"
UTA_SWT2_gk04.swt2.uta.edu  sleep21  /bin/sleep
INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
UTA_SWT2_gk04.swt2.uta.edu  sleep            /bin/sleep
  INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"

WQCG-Harvard-OSG_tuscany.med.harvard.edu  worker22
/osg/storage/app/engage/scec/worker.pl      INSTALLED INTEL32::LINUX
GLOBUS::maxwalltime="02:00:00"
WQCG-Harvard-OSG_tuscany.med.harvard.edu  sleep22  /bin/sleep
        INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
WQCG-Harvard-OSG_tuscany.med.harvard.edu  sleep            /bin/sleep
                INSTALLED INTEL32::LINUX
GLOBUS::maxwalltime="00:05:00"

2010/11/18 Mihael Hategan <hategan at mcs.anl.gov>:
> I'm sure there is a reasonable explanation for this.
>
> Can you post your entire tc.data? And to make sure we're talking about
> the right one, can you look at the swift log and use exactly the one
> that swift claims is using?
>
> Mihael
>
> On Thu, 2010-11-18 at 14:39 -0600, Allan Espinosa wrote:
>> tc.data for worker15:
>> SPRACE_osg-ce.sprace.org.br ?worker15 /osg/app/engage/scec/worker.pl
>> ? ?INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="02:00:00"
>>
>> But it was assigned to another site instead:
>> $ grep 0erqqq1k worker-*.log
>> 2010-11-17 15:38:58,804-0600 DEBUG vdl:execute2 THREAD_ASSOCIATION
>> jobid=worker15-0erqqq1k thread
>> ?host=LIGO_UWM_NEMO_osg-nemo-ce.phys.uwm.edu replicationGroup=2pnqqq1k
>> 2010-11-17 15:38:59,110-0600 INFO ?vdl:createdirset START
>> jobid=worker15-0erqqq1k host=LIGO_UWM_N
>> ce.phys.uwm.edu - Initializing directory structure
>> 2010-11-17 15:38:59,137-0600 INFO ?vdl:createdirset END
>> jobid=worker15-0erqqq1k - Done initializi
>> structure
>> 2010-11-17 15:38:59,172-0600 INFO ?vdl:dostagein START
>> jobid=worker15-0erqqq1k - Staging in files
>> 2010-11-17 15:38:59,257-0600 INFO ?vdl:dostagein END
>> jobid=worker15-0erqqq1k - Staging in finishe
>> 2010-11-17 15:38:59,323-0600 DEBUG vdl:execute2 JOB_START
>> jobid=worker15-0erqqq1k tr=worker15 arg
>> //128.135.125.17:61015, SPRACE_osg-ce.sprace.org.br, /tmp, 7200]
>> tmpdir=worker-20101117-1538-fe9a
>> orker15-0erqqq1k host=LIGO_UWM_NEMO_osg-nemo-ce.phys.uwm.edu
>> 2010-11-17 15:39:01,394-0600 INFO ?Execute Submit: in:
>> worker-20101117-1538-fe9aq209 command: /bi
>> /_swiftwrap worker15-0erqqq1k -jobdir 0 -scratch ?-e worker15 -out
>> stdout.txt -err stderr.txt -i
>> ?-k ?-cdmfile ?-status files -a http://128.135.125.17:61015
>> SPRACE_osg-ce.sprace.org.br /tmp 7200
>> 2010-11-17 15:39:01,394-0600 INFO ?GridExec TASK_DEFINITION:
>> Task(type=JOB_SUBMISSION, identity=u
>> -1-1290029938030) is /bin/bash shared/_swiftwrap worker15-0erqqq1k
>> -jobdir 0 -scratch ?-e worker1
>> .txt -err stderr.txt -i -d ?-if ?-of ?-k ?-cdmfile ?-status files -a
>> http://128.135.125.17:61015
>> .sprace.org.br /tmp 7200
>> 2010-11-17 16:49:33,106-0600 DEBUG vdl:checkjobstatus START
>> jobid=worker15-0erqqq1k
>> 2010-11-17 16:49:33,278-0600 INFO ?vdl:checkjobstatus FAILURE
>> jobid=worker15-0erqqq1k - Failure f
>> 2010-11-17 16:49:38,180-0600 DEBUG vdl:execute2 APPLICATION_EXCEPTION
>> jobid=worker15-0erqqq1k - A
>> ception: Cannot find executable worker15 on site system path
>>
>> There is no entry for worker15 for the site LIGO_UWM_NEMO in my tc.data
>>
>> -Allan
>>
>
>
>
>


-- 
Allan M. Espinosa <http://amespinosa.wordpress.com>
PhD student, Computer Science
University of Chicago <http://people.cs.uchicago.edu/~aespinosa>


From hategan at mcs.anl.gov  Thu Nov 18 17:39:30 2010
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Thu, 18 Nov 2010 15:39:30 -0800
Subject: [Swift-devel] misassignment of jobs
In-Reply-To: <AANLkTinnz50nSrqCMKkOOKGxHRPDCjhypqg9kz0Zv22n@mail.gmail.com>
References: <AANLkTimQobNKo+inP6Z_fSq8bEOUX2PoWcn5hq-FW=G7@mail.gmail.com>
	<1290117789.30414.1.camel@blabla2.none>
	<AANLkTinnz50nSrqCMKkOOKGxHRPDCjhypqg9kz0Zv22n@mail.gmail.com>
Message-ID: <1290123570.30658.0.camel@blabla2.none>

Ok. I can see a couple of code paths that can lead to this, but I need
to constrain it some more.

Does this happen every time you run this?

Mihael

On Thu, 2010-11-18 at 16:08 -0600, Allan Espinosa wrote:
> i'm using a file named tc.data
> 
> 2010-11-17 15:38:50,115-0600 INFO  unknown Using tc.data: tc.data
> $cat tc.data
> PADS  sleep_pads     /bin/sleep      INSTALLED INTEL32::LINUX
> GLOBUS::maxwalltime="00:05:00"
> 
> BNL-ATLAS_gridgk01.racf.bnl.gov  worker0
> /usatlas/OSG/engage-scec/worker.pl      INSTALLED INTEL32::LINUX
> GLOBUS::maxwalltime="02:00:00"
> BNL-ATLAS_gridgk01.racf.bnl.gov  sleep0  /bin/sleep
> INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
> BNL-ATLAS_gridgk01.racf.bnl.gov  sleep            /bin/sleep
>        INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
> 
> BNL-ATLAS_gridgk02.racf.bnl.gov  worker1
> /usatlas/OSG/engage-scec/worker.pl      INSTALLED INTEL32::LINUX
> GLOBUS::maxwalltime="02:00:00"
> BNL-ATLAS_gridgk02.racf.bnl.gov  sleep1  /bin/sleep
> INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
> BNL-ATLAS_gridgk02.racf.bnl.gov  sleep            /bin/sleep
>        INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
> 
> FNAL_FERMIGRID_fnpcosg1.fnal.gov  worker2
> /grid/app/engage/scec/worker.pl      INSTALLED INTEL32::LINUX
> GLOBUS::maxwalltime="02:00:00"
> FNAL_FERMIGRID_fnpcosg1.fnal.gov  sleep2  /bin/sleep
> INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
> FNAL_FERMIGRID_fnpcosg1.fnal.gov  sleep            /bin/sleep
>         INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
> 
> Firefly_ff-grid3.unl.edu  worker3
> /panfs/panasas/CMS/app/engage/scec/worker.pl      INSTALLED
> INTEL32::LINUX GLOBUS::maxwalltime="02:00:00"
> Firefly_ff-grid3.unl.edu  sleep3  /bin/sleep
> INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
> Firefly_ff-grid3.unl.edu  sleep            /bin/sleep
> INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
> 
> GridUNESP_CENTRAL_ce.grid.unesp.br  worker4 /osg/app/worker.pl
> INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="02:00:00"
> GridUNESP_CENTRAL_ce.grid.unesp.br  sleep4  /bin/sleep
>  INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
> GridUNESP_CENTRAL_ce.grid.unesp.br  sleep            /bin/sleep
>           INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
> 
> LIGO_UWM_NEMO_osg-nemo-ce.phys.uwm.edu  worker5
> /opt/osg/app/engage/scec/worker.pl      INSTALLED INTEL32::LINUX
> GLOBUS::maxwalltime="02:00:00"
> LIGO_UWM_NEMO_osg-nemo-ce.phys.uwm.edu  sleep5  /bin/sleep
>      INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
> LIGO_UWM_NEMO_osg-nemo-ce.phys.uwm.edu  sleep            /bin/sleep
>               INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
> 
> MIT_CMS_ce01.cmsaf.mit.edu  worker6 /osg/app/engage/scec/worker.pl
>  INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="02:00:00"
> MIT_CMS_ce01.cmsaf.mit.edu  sleep6  /bin/sleep
> INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
> MIT_CMS_ce01.cmsaf.mit.edu  sleep            /bin/sleep
>   INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
> 
> MIT_CMS_ce02.cmsaf.mit.edu  worker7 /osg/app/engage/scec/worker.pl
>  INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="02:00:00"
> MIT_CMS_ce02.cmsaf.mit.edu  sleep7  /bin/sleep
> INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
> MIT_CMS_ce02.cmsaf.mit.edu  sleep            /bin/sleep
>   INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
> 
> NYSGRID_CORNELL_NYS1_nys1.cac.cornell.edu  worker8
> /grid-tmp/grid-apps/engage/scec/worker.pl      INSTALLED
> INTEL32::LINUX GLOBUS::maxwalltime="02:00:00"
> NYSGRID_CORNELL_NYS1_nys1.cac.cornell.edu  sleep8  /bin/sleep
>         INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
> NYSGRID_CORNELL_NYS1_nys1.cac.cornell.edu  sleep            /bin/sleep
>                  INSTALLED INTEL32::LINUX
> GLOBUS::maxwalltime="00:05:00"
> 
> Nebraska_gpn-husker.unl.edu  worker9
> /opt/osg/app/engage/scec/worker.pl      INSTALLED INTEL32::LINUX
> GLOBUS::maxwalltime="02:00:00"
> Nebraska_gpn-husker.unl.edu  sleep9  /bin/sleep
> INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
> Nebraska_gpn-husker.unl.edu  sleep            /bin/sleep
>    INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
> 
> Nebraska_red.unl.edu  worker10 /opt/osg/app/engage/scec/worker.pl
> INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="02:00:00"
> Nebraska_red.unl.edu  sleep10  /bin/sleep                  INSTALLED
> INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
> Nebraska_red.unl.edu  sleep            /bin/sleep
> INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
> 
> Prairiefire_pf-grid.unl.edu  worker11
> /opt/pfgridapp/engage/scec/worker.pl      INSTALLED INTEL32::LINUX
> GLOBUS::maxwalltime="02:00:00"
> Prairiefire_pf-grid.unl.edu  sleep11  /bin/sleep
> INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
> Prairiefire_pf-grid.unl.edu  sleep            /bin/sleep
>    INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
> 
> Purdue-RCAC_osg.rcac.purdue.edu  worker12
> /apps/osg/engage/scec/worker.pl      INSTALLED INTEL32::LINUX
> GLOBUS::maxwalltime="02:00:00"
> Purdue-RCAC_osg.rcac.purdue.edu  sleep12  /bin/sleep
> INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
> Purdue-RCAC_osg.rcac.purdue.edu  sleep            /bin/sleep
>        INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
> 
> RENCI-Engagement_belhaven-1.renci.org  worker13
> /nfs/osg-app/engage/scec/worker.pl      INSTALLED INTEL32::LINUX
> GLOBUS::maxwalltime="02:00:00"
> RENCI-Engagement_belhaven-1.renci.org  sleep13  /bin/sleep
>      INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
> RENCI-Engagement_belhaven-1.renci.org  sleep            /bin/sleep
>              INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
> 
> SBGrid-Harvard-East_osg-east.hms.harvard.edu  worker14
> /osg/storage/app/engage/scec/worker.pl      INSTALLED INTEL32::LINUX
> GLOBUS::maxwalltime="02:00:00"
> SBGrid-Harvard-East_osg-east.hms.harvard.edu  sleep14  /bin/sleep
>             INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
> SBGrid-Harvard-East_osg-east.hms.harvard.edu  sleep
> /bin/sleep                  INSTALLED INTEL32::LINUX
> GLOBUS::maxwalltime="00:05:00"
> 
> SPRACE_osg-ce.sprace.org.br  worker15 /osg/app/engage/scec/worker.pl
>    INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="02:00:00"
> SPRACE_osg-ce.sprace.org.br  sleep15  /bin/sleep
> INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
> SPRACE_osg-ce.sprace.org.br  sleep            /bin/sleep
>    INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
> 
> UCHC_CBG_vdgateway.vcell.uchc.edu  worker16
> /osg/app/engage/scec/worker.pl      INSTALLED INTEL32::LINUX
> GLOBUS::maxwalltime="02:00:00"
> UCHC_CBG_vdgateway.vcell.uchc.edu  sleep16  /bin/sleep
>  INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
> UCHC_CBG_vdgateway.vcell.uchc.edu  sleep            /bin/sleep
>          INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
> 
> UCR-HEP_top.ucr.edu  worker17
> /data/bottom/osg_app/engage/scec/worker.pl      INSTALLED
> INTEL32::LINUX GLOBUS::maxwalltime="02:00:00"
> UCR-HEP_top.ucr.edu  sleep17  /bin/sleep                  INSTALLED
> INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
> UCR-HEP_top.ucr.edu  sleep            /bin/sleep
> INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
> 
> UFlorida-HPC_osg.hpc.ufl.edu  worker18 /osg/app/engage/scec/worker.pl
>     INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="02:00:00"
> UFlorida-HPC_osg.hpc.ufl.edu  sleep18  /bin/sleep
> INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
> UFlorida-HPC_osg.hpc.ufl.edu  sleep            /bin/sleep
>     INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
> 
> UFlorida-PG_pg.ihepa.ufl.edu  worker19
> /raid/osgpg/pg/app/engage/scec/worker.pl      INSTALLED INTEL32::LINUX
> GLOBUS::maxwalltime="02:00:00"
> UFlorida-PG_pg.ihepa.ufl.edu  sleep19  /bin/sleep
> INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
> UFlorida-PG_pg.ihepa.ufl.edu  sleep            /bin/sleep
>     INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
> 
> UMissHEP_umiss001.hep.olemiss.edu  worker20
> /osgremote/osg_app/engage/scec/worker.pl      INSTALLED INTEL32::LINUX
> GLOBUS::maxwalltime="02:00:00"
> UMissHEP_umiss001.hep.olemiss.edu  sleep20  /bin/sleep
>  INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
> UMissHEP_umiss001.hep.olemiss.edu  sleep            /bin/sleep
>          INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
> 
> UTA_SWT2_gk04.swt2.uta.edu  worker21
> /cluster/grid/app/engage/scec/worker.pl      INSTALLED INTEL32::LINUX
> GLOBUS::maxwalltime="02:00:00"
> UTA_SWT2_gk04.swt2.uta.edu  sleep21  /bin/sleep
> INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
> UTA_SWT2_gk04.swt2.uta.edu  sleep            /bin/sleep
>   INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
> 
> WQCG-Harvard-OSG_tuscany.med.harvard.edu  worker22
> /osg/storage/app/engage/scec/worker.pl      INSTALLED INTEL32::LINUX
> GLOBUS::maxwalltime="02:00:00"
> WQCG-Harvard-OSG_tuscany.med.harvard.edu  sleep22  /bin/sleep
>         INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
> WQCG-Harvard-OSG_tuscany.med.harvard.edu  sleep            /bin/sleep
>                 INSTALLED INTEL32::LINUX
> GLOBUS::maxwalltime="00:05:00"
> 
> 2010/11/18 Mihael Hategan <hategan at mcs.anl.gov>:
> > I'm sure there is a reasonable explanation for this.
> >
> > Can you post your entire tc.data? And to make sure we're talking about
> > the right one, can you look at the swift log and use exactly the one
> > that swift claims is using?
> >
> > Mihael
> >
> > On Thu, 2010-11-18 at 14:39 -0600, Allan Espinosa wrote:
> >> tc.data for worker15:
> >> SPRACE_osg-ce.sprace.org.br  worker15 /osg/app/engage/scec/worker.pl
> >>    INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="02:00:00"
> >>
> >> But it was assigned to another site instead:
> >> $ grep 0erqqq1k worker-*.log
> >> 2010-11-17 15:38:58,804-0600 DEBUG vdl:execute2 THREAD_ASSOCIATION
> >> jobid=worker15-0erqqq1k thread
> >>  host=LIGO_UWM_NEMO_osg-nemo-ce.phys.uwm.edu replicationGroup=2pnqqq1k
> >> 2010-11-17 15:38:59,110-0600 INFO  vdl:createdirset START
> >> jobid=worker15-0erqqq1k host=LIGO_UWM_N
> >> ce.phys.uwm.edu - Initializing directory structure
> >> 2010-11-17 15:38:59,137-0600 INFO  vdl:createdirset END
> >> jobid=worker15-0erqqq1k - Done initializi
> >> structure
> >> 2010-11-17 15:38:59,172-0600 INFO  vdl:dostagein START
> >> jobid=worker15-0erqqq1k - Staging in files
> >> 2010-11-17 15:38:59,257-0600 INFO  vdl:dostagein END
> >> jobid=worker15-0erqqq1k - Staging in finishe
> >> 2010-11-17 15:38:59,323-0600 DEBUG vdl:execute2 JOB_START
> >> jobid=worker15-0erqqq1k tr=worker15 arg
> >> //128.135.125.17:61015, SPRACE_osg-ce.sprace.org.br, /tmp, 7200]
> >> tmpdir=worker-20101117-1538-fe9a
> >> orker15-0erqqq1k host=LIGO_UWM_NEMO_osg-nemo-ce.phys.uwm.edu
> >> 2010-11-17 15:39:01,394-0600 INFO  Execute Submit: in:
> >> worker-20101117-1538-fe9aq209 command: /bi
> >> /_swiftwrap worker15-0erqqq1k -jobdir 0 -scratch  -e worker15 -out
> >> stdout.txt -err stderr.txt -i
> >>  -k  -cdmfile  -status files -a http://128.135.125.17:61015
> >> SPRACE_osg-ce.sprace.org.br /tmp 7200
> >> 2010-11-17 15:39:01,394-0600 INFO  GridExec TASK_DEFINITION:
> >> Task(type=JOB_SUBMISSION, identity=u
> >> -1-1290029938030) is /bin/bash shared/_swiftwrap worker15-0erqqq1k
> >> -jobdir 0 -scratch  -e worker1
> >> .txt -err stderr.txt -i -d  -if  -of  -k  -cdmfile  -status files -a
> >> http://128.135.125.17:61015
> >> .sprace.org.br /tmp 7200
> >> 2010-11-17 16:49:33,106-0600 DEBUG vdl:checkjobstatus START
> >> jobid=worker15-0erqqq1k
> >> 2010-11-17 16:49:33,278-0600 INFO  vdl:checkjobstatus FAILURE
> >> jobid=worker15-0erqqq1k - Failure f
> >> 2010-11-17 16:49:38,180-0600 DEBUG vdl:execute2 APPLICATION_EXCEPTION
> >> jobid=worker15-0erqqq1k - A
> >> ception: Cannot find executable worker15 on site system path
> >>
> >> There is no entry for worker15 for the site LIGO_UWM_NEMO in my tc.data
> >>
> >> -Allan
> >>
> >
> >
> >
> >
> 
> 
> 


From hategan at mcs.anl.gov  Thu Nov 18 17:45:29 2010
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Thu, 18 Nov 2010 15:45:29 -0800
Subject: [Swift-devel] misassignment of jobs
In-Reply-To: <AANLkTinnz50nSrqCMKkOOKGxHRPDCjhypqg9kz0Zv22n@mail.gmail.com>
References: <AANLkTimQobNKo+inP6Z_fSq8bEOUX2PoWcn5hq-FW=G7@mail.gmail.com>
	<1290117789.30414.1.camel@blabla2.none>
	<AANLkTinnz50nSrqCMKkOOKGxHRPDCjhypqg9kz0Zv22n@mail.gmail.com>
Message-ID: <1290123929.30658.1.camel@blabla2.none>

Also, can you post sites.xml and the full log?

On Thu, 2010-11-18 at 16:08 -0600, Allan Espinosa wrote:
> i'm using a file named tc.data
> 
> 2010-11-17 15:38:50,115-0600 INFO  unknown Using tc.data: tc.data
> $cat tc.data
> PADS  sleep_pads     /bin/sleep      INSTALLED INTEL32::LINUX
> GLOBUS::maxwalltime="00:05:00"
> 
> BNL-ATLAS_gridgk01.racf.bnl.gov  worker0
> /usatlas/OSG/engage-scec/worker.pl      INSTALLED INTEL32::LINUX
> GLOBUS::maxwalltime="02:00:00"
> BNL-ATLAS_gridgk01.racf.bnl.gov  sleep0  /bin/sleep
> INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
> BNL-ATLAS_gridgk01.racf.bnl.gov  sleep            /bin/sleep
>        INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
> 
> BNL-ATLAS_gridgk02.racf.bnl.gov  worker1
> /usatlas/OSG/engage-scec/worker.pl      INSTALLED INTEL32::LINUX
> GLOBUS::maxwalltime="02:00:00"
> BNL-ATLAS_gridgk02.racf.bnl.gov  sleep1  /bin/sleep
> INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
> BNL-ATLAS_gridgk02.racf.bnl.gov  sleep            /bin/sleep
>        INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
> 
> FNAL_FERMIGRID_fnpcosg1.fnal.gov  worker2
> /grid/app/engage/scec/worker.pl      INSTALLED INTEL32::LINUX
> GLOBUS::maxwalltime="02:00:00"
> FNAL_FERMIGRID_fnpcosg1.fnal.gov  sleep2  /bin/sleep
> INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
> FNAL_FERMIGRID_fnpcosg1.fnal.gov  sleep            /bin/sleep
>         INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
> 
> Firefly_ff-grid3.unl.edu  worker3
> /panfs/panasas/CMS/app/engage/scec/worker.pl      INSTALLED
> INTEL32::LINUX GLOBUS::maxwalltime="02:00:00"
> Firefly_ff-grid3.unl.edu  sleep3  /bin/sleep
> INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
> Firefly_ff-grid3.unl.edu  sleep            /bin/sleep
> INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
> 
> GridUNESP_CENTRAL_ce.grid.unesp.br  worker4 /osg/app/worker.pl
> INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="02:00:00"
> GridUNESP_CENTRAL_ce.grid.unesp.br  sleep4  /bin/sleep
>  INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
> GridUNESP_CENTRAL_ce.grid.unesp.br  sleep            /bin/sleep
>           INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
> 
> LIGO_UWM_NEMO_osg-nemo-ce.phys.uwm.edu  worker5
> /opt/osg/app/engage/scec/worker.pl      INSTALLED INTEL32::LINUX
> GLOBUS::maxwalltime="02:00:00"
> LIGO_UWM_NEMO_osg-nemo-ce.phys.uwm.edu  sleep5  /bin/sleep
>      INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
> LIGO_UWM_NEMO_osg-nemo-ce.phys.uwm.edu  sleep            /bin/sleep
>               INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
> 
> MIT_CMS_ce01.cmsaf.mit.edu  worker6 /osg/app/engage/scec/worker.pl
>  INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="02:00:00"
> MIT_CMS_ce01.cmsaf.mit.edu  sleep6  /bin/sleep
> INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
> MIT_CMS_ce01.cmsaf.mit.edu  sleep            /bin/sleep
>   INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
> 
> MIT_CMS_ce02.cmsaf.mit.edu  worker7 /osg/app/engage/scec/worker.pl
>  INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="02:00:00"
> MIT_CMS_ce02.cmsaf.mit.edu  sleep7  /bin/sleep
> INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
> MIT_CMS_ce02.cmsaf.mit.edu  sleep            /bin/sleep
>   INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
> 
> NYSGRID_CORNELL_NYS1_nys1.cac.cornell.edu  worker8
> /grid-tmp/grid-apps/engage/scec/worker.pl      INSTALLED
> INTEL32::LINUX GLOBUS::maxwalltime="02:00:00"
> NYSGRID_CORNELL_NYS1_nys1.cac.cornell.edu  sleep8  /bin/sleep
>         INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
> NYSGRID_CORNELL_NYS1_nys1.cac.cornell.edu  sleep            /bin/sleep
>                  INSTALLED INTEL32::LINUX
> GLOBUS::maxwalltime="00:05:00"
> 
> Nebraska_gpn-husker.unl.edu  worker9
> /opt/osg/app/engage/scec/worker.pl      INSTALLED INTEL32::LINUX
> GLOBUS::maxwalltime="02:00:00"
> Nebraska_gpn-husker.unl.edu  sleep9  /bin/sleep
> INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
> Nebraska_gpn-husker.unl.edu  sleep            /bin/sleep
>    INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
> 
> Nebraska_red.unl.edu  worker10 /opt/osg/app/engage/scec/worker.pl
> INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="02:00:00"
> Nebraska_red.unl.edu  sleep10  /bin/sleep                  INSTALLED
> INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
> Nebraska_red.unl.edu  sleep            /bin/sleep
> INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
> 
> Prairiefire_pf-grid.unl.edu  worker11
> /opt/pfgridapp/engage/scec/worker.pl      INSTALLED INTEL32::LINUX
> GLOBUS::maxwalltime="02:00:00"
> Prairiefire_pf-grid.unl.edu  sleep11  /bin/sleep
> INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
> Prairiefire_pf-grid.unl.edu  sleep            /bin/sleep
>    INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
> 
> Purdue-RCAC_osg.rcac.purdue.edu  worker12
> /apps/osg/engage/scec/worker.pl      INSTALLED INTEL32::LINUX
> GLOBUS::maxwalltime="02:00:00"
> Purdue-RCAC_osg.rcac.purdue.edu  sleep12  /bin/sleep
> INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
> Purdue-RCAC_osg.rcac.purdue.edu  sleep            /bin/sleep
>        INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
> 
> RENCI-Engagement_belhaven-1.renci.org  worker13
> /nfs/osg-app/engage/scec/worker.pl      INSTALLED INTEL32::LINUX
> GLOBUS::maxwalltime="02:00:00"
> RENCI-Engagement_belhaven-1.renci.org  sleep13  /bin/sleep
>      INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
> RENCI-Engagement_belhaven-1.renci.org  sleep            /bin/sleep
>              INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
> 
> SBGrid-Harvard-East_osg-east.hms.harvard.edu  worker14
> /osg/storage/app/engage/scec/worker.pl      INSTALLED INTEL32::LINUX
> GLOBUS::maxwalltime="02:00:00"
> SBGrid-Harvard-East_osg-east.hms.harvard.edu  sleep14  /bin/sleep
>             INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
> SBGrid-Harvard-East_osg-east.hms.harvard.edu  sleep
> /bin/sleep                  INSTALLED INTEL32::LINUX
> GLOBUS::maxwalltime="00:05:00"
> 
> SPRACE_osg-ce.sprace.org.br  worker15 /osg/app/engage/scec/worker.pl
>    INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="02:00:00"
> SPRACE_osg-ce.sprace.org.br  sleep15  /bin/sleep
> INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
> SPRACE_osg-ce.sprace.org.br  sleep            /bin/sleep
>    INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
> 
> UCHC_CBG_vdgateway.vcell.uchc.edu  worker16
> /osg/app/engage/scec/worker.pl      INSTALLED INTEL32::LINUX
> GLOBUS::maxwalltime="02:00:00"
> UCHC_CBG_vdgateway.vcell.uchc.edu  sleep16  /bin/sleep
>  INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
> UCHC_CBG_vdgateway.vcell.uchc.edu  sleep            /bin/sleep
>          INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
> 
> UCR-HEP_top.ucr.edu  worker17
> /data/bottom/osg_app/engage/scec/worker.pl      INSTALLED
> INTEL32::LINUX GLOBUS::maxwalltime="02:00:00"
> UCR-HEP_top.ucr.edu  sleep17  /bin/sleep                  INSTALLED
> INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
> UCR-HEP_top.ucr.edu  sleep            /bin/sleep
> INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
> 
> UFlorida-HPC_osg.hpc.ufl.edu  worker18 /osg/app/engage/scec/worker.pl
>     INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="02:00:00"
> UFlorida-HPC_osg.hpc.ufl.edu  sleep18  /bin/sleep
> INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
> UFlorida-HPC_osg.hpc.ufl.edu  sleep            /bin/sleep
>     INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
> 
> UFlorida-PG_pg.ihepa.ufl.edu  worker19
> /raid/osgpg/pg/app/engage/scec/worker.pl      INSTALLED INTEL32::LINUX
> GLOBUS::maxwalltime="02:00:00"
> UFlorida-PG_pg.ihepa.ufl.edu  sleep19  /bin/sleep
> INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
> UFlorida-PG_pg.ihepa.ufl.edu  sleep            /bin/sleep
>     INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
> 
> UMissHEP_umiss001.hep.olemiss.edu  worker20
> /osgremote/osg_app/engage/scec/worker.pl      INSTALLED INTEL32::LINUX
> GLOBUS::maxwalltime="02:00:00"
> UMissHEP_umiss001.hep.olemiss.edu  sleep20  /bin/sleep
>  INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
> UMissHEP_umiss001.hep.olemiss.edu  sleep            /bin/sleep
>          INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
> 
> UTA_SWT2_gk04.swt2.uta.edu  worker21
> /cluster/grid/app/engage/scec/worker.pl      INSTALLED INTEL32::LINUX
> GLOBUS::maxwalltime="02:00:00"
> UTA_SWT2_gk04.swt2.uta.edu  sleep21  /bin/sleep
> INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
> UTA_SWT2_gk04.swt2.uta.edu  sleep            /bin/sleep
>   INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
> 
> WQCG-Harvard-OSG_tuscany.med.harvard.edu  worker22
> /osg/storage/app/engage/scec/worker.pl      INSTALLED INTEL32::LINUX
> GLOBUS::maxwalltime="02:00:00"
> WQCG-Harvard-OSG_tuscany.med.harvard.edu  sleep22  /bin/sleep
>         INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
> WQCG-Harvard-OSG_tuscany.med.harvard.edu  sleep            /bin/sleep
>                 INSTALLED INTEL32::LINUX
> GLOBUS::maxwalltime="00:05:00"
> 
> 2010/11/18 Mihael Hategan <hategan at mcs.anl.gov>:
> > I'm sure there is a reasonable explanation for this.
> >
> > Can you post your entire tc.data? And to make sure we're talking about
> > the right one, can you look at the swift log and use exactly the one
> > that swift claims is using?
> >
> > Mihael
> >
> > On Thu, 2010-11-18 at 14:39 -0600, Allan Espinosa wrote:
> >> tc.data for worker15:
> >> SPRACE_osg-ce.sprace.org.br  worker15 /osg/app/engage/scec/worker.pl
> >>    INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="02:00:00"
> >>
> >> But it was assigned to another site instead:
> >> $ grep 0erqqq1k worker-*.log
> >> 2010-11-17 15:38:58,804-0600 DEBUG vdl:execute2 THREAD_ASSOCIATION
> >> jobid=worker15-0erqqq1k thread
> >>  host=LIGO_UWM_NEMO_osg-nemo-ce.phys.uwm.edu replicationGroup=2pnqqq1k
> >> 2010-11-17 15:38:59,110-0600 INFO  vdl:createdirset START
> >> jobid=worker15-0erqqq1k host=LIGO_UWM_N
> >> ce.phys.uwm.edu - Initializing directory structure
> >> 2010-11-17 15:38:59,137-0600 INFO  vdl:createdirset END
> >> jobid=worker15-0erqqq1k - Done initializi
> >> structure
> >> 2010-11-17 15:38:59,172-0600 INFO  vdl:dostagein START
> >> jobid=worker15-0erqqq1k - Staging in files
> >> 2010-11-17 15:38:59,257-0600 INFO  vdl:dostagein END
> >> jobid=worker15-0erqqq1k - Staging in finishe
> >> 2010-11-17 15:38:59,323-0600 DEBUG vdl:execute2 JOB_START
> >> jobid=worker15-0erqqq1k tr=worker15 arg
> >> //128.135.125.17:61015, SPRACE_osg-ce.sprace.org.br, /tmp, 7200]
> >> tmpdir=worker-20101117-1538-fe9a
> >> orker15-0erqqq1k host=LIGO_UWM_NEMO_osg-nemo-ce.phys.uwm.edu
> >> 2010-11-17 15:39:01,394-0600 INFO  Execute Submit: in:
> >> worker-20101117-1538-fe9aq209 command: /bi
> >> /_swiftwrap worker15-0erqqq1k -jobdir 0 -scratch  -e worker15 -out
> >> stdout.txt -err stderr.txt -i
> >>  -k  -cdmfile  -status files -a http://128.135.125.17:61015
> >> SPRACE_osg-ce.sprace.org.br /tmp 7200
> >> 2010-11-17 15:39:01,394-0600 INFO  GridExec TASK_DEFINITION:
> >> Task(type=JOB_SUBMISSION, identity=u
> >> -1-1290029938030) is /bin/bash shared/_swiftwrap worker15-0erqqq1k
> >> -jobdir 0 -scratch  -e worker1
> >> .txt -err stderr.txt -i -d  -if  -of  -k  -cdmfile  -status files -a
> >> http://128.135.125.17:61015
> >> .sprace.org.br /tmp 7200
> >> 2010-11-17 16:49:33,106-0600 DEBUG vdl:checkjobstatus START
> >> jobid=worker15-0erqqq1k
> >> 2010-11-17 16:49:33,278-0600 INFO  vdl:checkjobstatus FAILURE
> >> jobid=worker15-0erqqq1k - Failure f
> >> 2010-11-17 16:49:38,180-0600 DEBUG vdl:execute2 APPLICATION_EXCEPTION
> >> jobid=worker15-0erqqq1k - A
> >> ception: Cannot find executable worker15 on site system path
> >>
> >> There is no entry for worker15 for the site LIGO_UWM_NEMO in my tc.data
> >>
> >> -Allan
> >>
> >
> >
> >
> >
> 
> 
> 


From aespinosa at cs.uchicago.edu  Thu Nov 18 18:15:03 2010
From: aespinosa at cs.uchicago.edu (Allan Espinosa)
Date: Thu, 18 Nov 2010 18:15:03 -0600
Subject: [Swift-devel] misassignment of jobs
In-Reply-To: <1290123929.30658.1.camel@blabla2.none>
References: <AANLkTimQobNKo+inP6Z_fSq8bEOUX2PoWcn5hq-FW=G7@mail.gmail.com>
	<1290117789.30414.1.camel@blabla2.none>
	<AANLkTinnz50nSrqCMKkOOKGxHRPDCjhypqg9kz0Zv22n@mail.gmail.com>
	<1290123929.30658.1.camel@blabla2.none>
Message-ID: <AANLkTi=BjNMm44zz_jkqRVZf-GTp5O34vZRj+5ZPsa_d@mail.gmail.com>

2010/11/18 Mihael Hategan <hategan at mcs.anl.gov>:
> Also, can you post sites.xml and the full log?
>
> On Thu, 2010-11-18 at 16:08 -0600, Allan Espinosa wrote:
>> i'm using a file named tc.data
>>
>> 2010-11-17 15:38:50,115-0600 INFO ?unknown Using tc.data: tc.data
>> $cat tc.data
>> PADS ?sleep_pads ? ? /bin/sleep ? ? ?INSTALLED INTEL32::LINUX
>> GLOBUS::maxwalltime="00:05:00"
>>
>> BNL-ATLAS_gridgk01.racf.bnl.gov ?worker0
>> /usatlas/OSG/engage-scec/worker.pl ? ? ?INSTALLED INTEL32::LINUX
>> GLOBUS::maxwalltime="02:00:00"
>> BNL-ATLAS_gridgk01.racf.bnl.gov ?sleep0 ?/bin/sleep
>> INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
>> BNL-ATLAS_gridgk01.racf.bnl.gov ?sleep ? ? ? ? ? ?/bin/sleep
>> ? ? ? ?INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
>>
>> BNL-ATLAS_gridgk02.racf.bnl.gov ?worker1
>> /usatlas/OSG/engage-scec/worker.pl ? ? ?INSTALLED INTEL32::LINUX
>> GLOBUS::maxwalltime="02:00:00"
>> BNL-ATLAS_gridgk02.racf.bnl.gov ?sleep1 ?/bin/sleep
>> INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
>> BNL-ATLAS_gridgk02.racf.bnl.gov ?sleep ? ? ? ? ? ?/bin/sleep
>> ? ? ? ?INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
>>
>> FNAL_FERMIGRID_fnpcosg1.fnal.gov ?worker2
>> /grid/app/engage/scec/worker.pl ? ? ?INSTALLED INTEL32::LINUX
>> GLOBUS::maxwalltime="02:00:00"
>> FNAL_FERMIGRID_fnpcosg1.fnal.gov ?sleep2 ?/bin/sleep
>> INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
>> FNAL_FERMIGRID_fnpcosg1.fnal.gov ?sleep ? ? ? ? ? ?/bin/sleep
>> ? ? ? ? INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
>>
>> Firefly_ff-grid3.unl.edu ?worker3
>> /panfs/panasas/CMS/app/engage/scec/worker.pl ? ? ?INSTALLED
>> INTEL32::LINUX GLOBUS::maxwalltime="02:00:00"
>> Firefly_ff-grid3.unl.edu ?sleep3 ?/bin/sleep
>> INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
>> Firefly_ff-grid3.unl.edu ?sleep ? ? ? ? ? ?/bin/sleep
>> INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
>>
>> GridUNESP_CENTRAL_ce.grid.unesp.br ?worker4 /osg/app/worker.pl
>> INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="02:00:00"
>> GridUNESP_CENTRAL_ce.grid.unesp.br ?sleep4 ?/bin/sleep
>> ?INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
>> GridUNESP_CENTRAL_ce.grid.unesp.br ?sleep ? ? ? ? ? ?/bin/sleep
>> ? ? ? ? ? INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
>>
>> LIGO_UWM_NEMO_osg-nemo-ce.phys.uwm.edu ?worker5
>> /opt/osg/app/engage/scec/worker.pl ? ? ?INSTALLED INTEL32::LINUX
>> GLOBUS::maxwalltime="02:00:00"
>> LIGO_UWM_NEMO_osg-nemo-ce.phys.uwm.edu ?sleep5 ?/bin/sleep
>> ? ? ?INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
>> LIGO_UWM_NEMO_osg-nemo-ce.phys.uwm.edu ?sleep ? ? ? ? ? ?/bin/sleep
>> ? ? ? ? ? ? ? INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
>>
>> MIT_CMS_ce01.cmsaf.mit.edu ?worker6 /osg/app/engage/scec/worker.pl
>> ?INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="02:00:00"
>> MIT_CMS_ce01.cmsaf.mit.edu ?sleep6 ?/bin/sleep
>> INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
>> MIT_CMS_ce01.cmsaf.mit.edu ?sleep ? ? ? ? ? ?/bin/sleep
>> ? INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
>>
>> MIT_CMS_ce02.cmsaf.mit.edu ?worker7 /osg/app/engage/scec/worker.pl
>> ?INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="02:00:00"
>> MIT_CMS_ce02.cmsaf.mit.edu ?sleep7 ?/bin/sleep
>> INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
>> MIT_CMS_ce02.cmsaf.mit.edu ?sleep ? ? ? ? ? ?/bin/sleep
>> ? INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
>>
>> NYSGRID_CORNELL_NYS1_nys1.cac.cornell.edu ?worker8
>> /grid-tmp/grid-apps/engage/scec/worker.pl ? ? ?INSTALLED
>> INTEL32::LINUX GLOBUS::maxwalltime="02:00:00"
>> NYSGRID_CORNELL_NYS1_nys1.cac.cornell.edu ?sleep8 ?/bin/sleep
>> ? ? ? ? INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
>> NYSGRID_CORNELL_NYS1_nys1.cac.cornell.edu ?sleep ? ? ? ? ? ?/bin/sleep
>> ? ? ? ? ? ? ? ? ?INSTALLED INTEL32::LINUX
>> GLOBUS::maxwalltime="00:05:00"
>>
>> Nebraska_gpn-husker.unl.edu ?worker9
>> /opt/osg/app/engage/scec/worker.pl ? ? ?INSTALLED INTEL32::LINUX
>> GLOBUS::maxwalltime="02:00:00"
>> Nebraska_gpn-husker.unl.edu ?sleep9 ?/bin/sleep
>> INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
>> Nebraska_gpn-husker.unl.edu ?sleep ? ? ? ? ? ?/bin/sleep
>> ? ?INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
>>
>> Nebraska_red.unl.edu ?worker10 /opt/osg/app/engage/scec/worker.pl
>> INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="02:00:00"
>> Nebraska_red.unl.edu ?sleep10 ?/bin/sleep ? ? ? ? ? ? ? ? ?INSTALLED
>> INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
>> Nebraska_red.unl.edu ?sleep ? ? ? ? ? ?/bin/sleep
>> INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
>>
>> Prairiefire_pf-grid.unl.edu ?worker11
>> /opt/pfgridapp/engage/scec/worker.pl ? ? ?INSTALLED INTEL32::LINUX
>> GLOBUS::maxwalltime="02:00:00"
>> Prairiefire_pf-grid.unl.edu ?sleep11 ?/bin/sleep
>> INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
>> Prairiefire_pf-grid.unl.edu ?sleep ? ? ? ? ? ?/bin/sleep
>> ? ?INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
>>
>> Purdue-RCAC_osg.rcac.purdue.edu ?worker12
>> /apps/osg/engage/scec/worker.pl ? ? ?INSTALLED INTEL32::LINUX
>> GLOBUS::maxwalltime="02:00:00"
>> Purdue-RCAC_osg.rcac.purdue.edu ?sleep12 ?/bin/sleep
>> INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
>> Purdue-RCAC_osg.rcac.purdue.edu ?sleep ? ? ? ? ? ?/bin/sleep
>> ? ? ? ?INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
>>
>> RENCI-Engagement_belhaven-1.renci.org ?worker13
>> /nfs/osg-app/engage/scec/worker.pl ? ? ?INSTALLED INTEL32::LINUX
>> GLOBUS::maxwalltime="02:00:00"
>> RENCI-Engagement_belhaven-1.renci.org ?sleep13 ?/bin/sleep
>> ? ? ?INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
>> RENCI-Engagement_belhaven-1.renci.org ?sleep ? ? ? ? ? ?/bin/sleep
>> ? ? ? ? ? ? ?INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
>>
>> SBGrid-Harvard-East_osg-east.hms.harvard.edu ?worker14
>> /osg/storage/app/engage/scec/worker.pl ? ? ?INSTALLED INTEL32::LINUX
>> GLOBUS::maxwalltime="02:00:00"
>> SBGrid-Harvard-East_osg-east.hms.harvard.edu ?sleep14 ?/bin/sleep
>> ? ? ? ? ? ? INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
>> SBGrid-Harvard-East_osg-east.hms.harvard.edu ?sleep
>> /bin/sleep ? ? ? ? ? ? ? ? ?INSTALLED INTEL32::LINUX
>> GLOBUS::maxwalltime="00:05:00"
>>
>> SPRACE_osg-ce.sprace.org.br ?worker15 /osg/app/engage/scec/worker.pl
>> ? ?INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="02:00:00"
>> SPRACE_osg-ce.sprace.org.br ?sleep15 ?/bin/sleep
>> INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
>> SPRACE_osg-ce.sprace.org.br ?sleep ? ? ? ? ? ?/bin/sleep
>> ? ?INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
>>
>> UCHC_CBG_vdgateway.vcell.uchc.edu ?worker16
>> /osg/app/engage/scec/worker.pl ? ? ?INSTALLED INTEL32::LINUX
>> GLOBUS::maxwalltime="02:00:00"
>> UCHC_CBG_vdgateway.vcell.uchc.edu ?sleep16 ?/bin/sleep
>> ?INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
>> UCHC_CBG_vdgateway.vcell.uchc.edu ?sleep ? ? ? ? ? ?/bin/sleep
>> ? ? ? ? ?INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
>>
>> UCR-HEP_top.ucr.edu ?worker17
>> /data/bottom/osg_app/engage/scec/worker.pl ? ? ?INSTALLED
>> INTEL32::LINUX GLOBUS::maxwalltime="02:00:00"
>> UCR-HEP_top.ucr.edu ?sleep17 ?/bin/sleep ? ? ? ? ? ? ? ? ?INSTALLED
>> INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
>> UCR-HEP_top.ucr.edu ?sleep ? ? ? ? ? ?/bin/sleep
>> INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
>>
>> UFlorida-HPC_osg.hpc.ufl.edu ?worker18 /osg/app/engage/scec/worker.pl
>> ? ? INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="02:00:00"
>> UFlorida-HPC_osg.hpc.ufl.edu ?sleep18 ?/bin/sleep
>> INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
>> UFlorida-HPC_osg.hpc.ufl.edu ?sleep ? ? ? ? ? ?/bin/sleep
>> ? ? INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
>>
>> UFlorida-PG_pg.ihepa.ufl.edu ?worker19
>> /raid/osgpg/pg/app/engage/scec/worker.pl ? ? ?INSTALLED INTEL32::LINUX
>> GLOBUS::maxwalltime="02:00:00"
>> UFlorida-PG_pg.ihepa.ufl.edu ?sleep19 ?/bin/sleep
>> INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
>> UFlorida-PG_pg.ihepa.ufl.edu ?sleep ? ? ? ? ? ?/bin/sleep
>> ? ? INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
>>
>> UMissHEP_umiss001.hep.olemiss.edu ?worker20
>> /osgremote/osg_app/engage/scec/worker.pl ? ? ?INSTALLED INTEL32::LINUX
>> GLOBUS::maxwalltime="02:00:00"
>> UMissHEP_umiss001.hep.olemiss.edu ?sleep20 ?/bin/sleep
>> ?INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
>> UMissHEP_umiss001.hep.olemiss.edu ?sleep ? ? ? ? ? ?/bin/sleep
>> ? ? ? ? ?INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
>>
>> UTA_SWT2_gk04.swt2.uta.edu ?worker21
>> /cluster/grid/app/engage/scec/worker.pl ? ? ?INSTALLED INTEL32::LINUX
>> GLOBUS::maxwalltime="02:00:00"
>> UTA_SWT2_gk04.swt2.uta.edu ?sleep21 ?/bin/sleep
>> INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
>> UTA_SWT2_gk04.swt2.uta.edu ?sleep ? ? ? ? ? ?/bin/sleep
>> ? INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
>>
>> WQCG-Harvard-OSG_tuscany.med.harvard.edu ?worker22
>> /osg/storage/app/engage/scec/worker.pl ? ? ?INSTALLED INTEL32::LINUX
>> GLOBUS::maxwalltime="02:00:00"
>> WQCG-Harvard-OSG_tuscany.med.harvard.edu ?sleep22 ?/bin/sleep
>> ? ? ? ? INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
>> WQCG-Harvard-OSG_tuscany.med.harvard.edu ?sleep ? ? ? ? ? ?/bin/sleep
>> ? ? ? ? ? ? ? ? INSTALLED INTEL32::LINUX
>> GLOBUS::maxwalltime="00:05:00"
>>
>> 2010/11/18 Mihael Hategan <hategan at mcs.anl.gov>:
>> > I'm sure there is a reasonable explanation for this.
>> >
>> > Can you post your entire tc.data? And to make sure we're talking about
>> > the right one, can you look at the swift log and use exactly the one
>> > that swift claims is using?
>> >
>> > Mihael
>> >
>> > On Thu, 2010-11-18 at 14:39 -0600, Allan Espinosa wrote:
>> >> tc.data for worker15:
>> >> SPRACE_osg-ce.sprace.org.br ?worker15 /osg/app/engage/scec/worker.pl
>> >> ? ?INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="02:00:00"
>> >>
>> >> But it was assigned to another site instead:
>> >> $ grep 0erqqq1k worker-*.log
>> >> 2010-11-17 15:38:58,804-0600 DEBUG vdl:execute2 THREAD_ASSOCIATION
>> >> jobid=worker15-0erqqq1k thread
>> >> ?host=LIGO_UWM_NEMO_osg-nemo-ce.phys.uwm.edu replicationGroup=2pnqqq1k
>> >> 2010-11-17 15:38:59,110-0600 INFO ?vdl:createdirset START
>> >> jobid=worker15-0erqqq1k host=LIGO_UWM_N
>> >> ce.phys.uwm.edu - Initializing directory structure
>> >> 2010-11-17 15:38:59,137-0600 INFO ?vdl:createdirset END
>> >> jobid=worker15-0erqqq1k - Done initializi
>> >> structure
>> >> 2010-11-17 15:38:59,172-0600 INFO ?vdl:dostagein START
>> >> jobid=worker15-0erqqq1k - Staging in files
>> >> 2010-11-17 15:38:59,257-0600 INFO ?vdl:dostagein END
>> >> jobid=worker15-0erqqq1k - Staging in finishe
>> >> 2010-11-17 15:38:59,323-0600 DEBUG vdl:execute2 JOB_START
>> >> jobid=worker15-0erqqq1k tr=worker15 arg
>> >> //128.135.125.17:61015, SPRACE_osg-ce.sprace.org.br, /tmp, 7200]
>> >> tmpdir=worker-20101117-1538-fe9a
>> >> orker15-0erqqq1k host=LIGO_UWM_NEMO_osg-nemo-ce.phys.uwm.edu
>> >> 2010-11-17 15:39:01,394-0600 INFO ?Execute Submit: in:
>> >> worker-20101117-1538-fe9aq209 command: /bi
>> >> /_swiftwrap worker15-0erqqq1k -jobdir 0 -scratch ?-e worker15 -out
>> >> stdout.txt -err stderr.txt -i
>> >> ?-k ?-cdmfile ?-status files -a http://128.135.125.17:61015
>> >> SPRACE_osg-ce.sprace.org.br /tmp 7200
>> >> 2010-11-17 15:39:01,394-0600 INFO ?GridExec TASK_DEFINITION:
>> >> Task(type=JOB_SUBMISSION, identity=u
>> >> -1-1290029938030) is /bin/bash shared/_swiftwrap worker15-0erqqq1k
>> >> -jobdir 0 -scratch ?-e worker1
>> >> .txt -err stderr.txt -i -d ?-if ?-of ?-k ?-cdmfile ?-status files -a
>> >> http://128.135.125.17:61015
>> >> .sprace.org.br /tmp 7200
>> >> 2010-11-17 16:49:33,106-0600 DEBUG vdl:checkjobstatus START
>> >> jobid=worker15-0erqqq1k
>> >> 2010-11-17 16:49:33,278-0600 INFO ?vdl:checkjobstatus FAILURE
>> >> jobid=worker15-0erqqq1k - Failure f
>> >> 2010-11-17 16:49:38,180-0600 DEBUG vdl:execute2 APPLICATION_EXCEPTION
>> >> jobid=worker15-0erqqq1k - A
>> >> ception: Cannot find executable worker15 on site system path
>> >>
>> >> There is no entry for worker15 for the site LIGO_UWM_NEMO in my tc.data
>> >>
>> >> -Allan
>> >>
>> >
>> >
>> >
>> >
>>
>>
>>
>
>
>
>


-- 
Allan M. Espinosa <http://amespinosa.wordpress.com>
PhD student, Computer Science
University of Chicago <http://people.cs.uchicago.edu/~aespinosa>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: condor_osg.xml
Type: text/xml
Size: 12555 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20101118/b205a759/attachment.xml>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: worker-20101117-1538-fe9aq209.log.bz2
Type: application/x-bzip2
Size: 1584471 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20101118/b205a759/attachment.bin>

From aespinosa at cs.uchicago.edu  Thu Nov 18 19:38:18 2010
From: aespinosa at cs.uchicago.edu (Allan Espinosa)
Date: Thu, 18 Nov 2010 19:38:18 -0600
Subject: [Swift-devel] <profile> info to provider
Message-ID: <AANLkTikuPY-q7VX7eotm8vjhBuktuuTWhCVT-a_syuHC@mail.gmail.com>

Hi,

I'm poking around the provider-coaster tree to be able to manually
specify the ports of the local service while the persistent coaster
service bugs are not yet ironed out.  Somewhere along the "-localport"
patch I made before.

What's the reference again for extracting <profile> information from
the sites.xml file to the provider?

Thanks,
-Allan

-- 
Allan M. Espinosa <http://amespinosa.wordpress.com>
PhD student, Computer Science
University of Chicago <http://people.cs.uchicago.edu/~aespinosa>


From wilde at mcs.anl.gov  Thu Nov 18 19:46:12 2010
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Thu, 18 Nov 2010 19:46:12 -0600 (CST)
Subject: [Swift-devel] <profile> info to provider
In-Reply-To: <AANLkTikuPY-q7VX7eotm8vjhBuktuuTWhCVT-a_syuHC@mail.gmail.com>
Message-ID: <243722423.89931.1290131172635.JavaMail.root@zimbra.anl.gov>

Allan, assuming you want to fetch this from Java (as opposed to Karajan in vdl-int.k) there are a few examples in the code snip I posted for David a few days ago:

+        Object countValue = getSpec().getAttribute("count");
+        int count;
+
+        if (countValue != null)
+            count = Integer.valueOf(String.valueOf(countValue)).intValue();
+        else 
+            count = 1;
+
+        // FIXME: wpn is only meaningful for coasters; is 1 ok otherwise?
+        // should we flag wpn as error if not coasters?
+
+        Object wpnValue = getAttribute(spec, "workerspernode", "1");
+        int wpn = Integer.valueOf(String.valueOf(wpnValue)).intValue();
+        logger.info("FETCH OF WPN: " + wpn); // FIXME: DB
+
+        count *= wpn;
+        logger.info("FETCH OF PE: " + getAttribute(spec, "pe", "NO pe"));
+        logger.info("FETCH OF CPN: " + getAttribute(spec, "corespernode", "NO cpn"));
+        writeAttrValue(String.valueOf(count), "-pe "
                 + getAttribute(spec, "pe", getSGEProperties().getDefaultPE())
-                + " ", wr, "1");
+                + " ", wr);


- Mike


----- Original Message -----
> Hi,
> 
> I'm poking around the provider-coaster tree to be able to manually
> specify the ports of the local service while the persistent coaster
> service bugs are not yet ironed out. Somewhere along the "-localport"
> patch I made before.
> 
> What's the reference again for extracting <profile> information from
> the sites.xml file to the provider?
> 
> Thanks,
> -Allan
> 
> --
> Allan M. Espinosa <http://amespinosa.wordpress.com>
> PhD student, Computer Science
> University of Chicago <http://people.cs.uchicago.edu/~aespinosa>
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel

-- 
Michael Wilde
Computation Institute, University of Chicago
Mathematics and Computer Science Division
Argonne National Laboratory


From hategan at mcs.anl.gov  Thu Nov 18 23:57:47 2010
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Thu, 18 Nov 2010 21:57:47 -0800
Subject: [Swift-devel] misassignment of jobs
In-Reply-To: <AANLkTi=BjNMm44zz_jkqRVZf-GTp5O34vZRj+5ZPsa_d@mail.gmail.com>
References: <AANLkTimQobNKo+inP6Z_fSq8bEOUX2PoWcn5hq-FW=G7@mail.gmail.com>
	<1290117789.30414.1.camel@blabla2.none>
	<AANLkTinnz50nSrqCMKkOOKGxHRPDCjhypqg9kz0Zv22n@mail.gmail.com>
	<1290123929.30658.1.camel@blabla2.none>
	<AANLkTi=BjNMm44zz_jkqRVZf-GTp5O34vZRj+5ZPsa_d@mail.gmail.com>
Message-ID: <1290146267.2226.11.camel@blabla2.none>

I was ready to blame cosmic rays, but this seems to be a pretty common
occurrence in your log. So I'm on it.

mike at blabla2 tmp$ cat worker-20101117-1538-fe9aq209.log|grep JOB_START |
awk '{print $7 " " $13}'|sort|uniq
tr=worker0 host=BNL-ATLAS_gridgk01.racf.bnl.gov
tr=worker0 host=LIGO_UWM_NEMO_osg-nemo-ce.phys.uwm.edu
tr=worker1 host=BNL-ATLAS_gridgk02.racf.bnl.gov
tr=worker10 host=Nebraska_red.unl.edu
tr=worker11 host=Prairiefire_pf-grid.unl.edu
tr=worker12 host=Purdue-RCAC_osg.rcac.purdue.edu
tr=worker13 host=GridUNESP_CENTRAL_ce.grid.unesp.br
tr=worker13 host=RENCI-Engagement_belhaven-1.renci.org
tr=worker14 host=SBGrid-Harvard-East_osg-east.hms.harvard.edu
tr=worker15 host=LIGO_UWM_NEMO_osg-nemo-ce.phys.uwm.edu
tr=worker15 host=SPRACE_osg-ce.sprace.org.br
tr=worker16 host=UCHC_CBG_vdgateway.vcell.uchc.edu
tr=worker17 host=UCR-HEP_top.ucr.edu
tr=worker18 host=UFlorida-HPC_osg.hpc.ufl.edu
tr=worker19 host=UFlorida-PG_pg.ihepa.ufl.edu
tr=worker2 host=FNAL_FERMIGRID_fnpcosg1.fnal.gov
tr=worker20 host=MIT_CMS_ce01.cmsaf.mit.edu
tr=worker20 host=UMissHEP_umiss001.hep.olemiss.edu
tr=worker21 host=Firefly_ff-grid3.unl.edu
tr=worker21 host=Nebraska_red.unl.edu
tr=worker21 host=UTA_SWT2_gk04.swt2.uta.edu
tr=worker22 host=WQCG-Harvard-OSG_tuscany.med.harvard.edu
tr=worker3 host=Firefly_ff-grid3.unl.edu
tr=worker3 host=LIGO_UWM_NEMO_osg-nemo-ce.phys.uwm.edu
tr=worker4 host=GridUNESP_CENTRAL_ce.grid.unesp.br
tr=worker5 host=Firefly_ff-grid3.unl.edu
tr=worker5 host=LIGO_UWM_NEMO_osg-nemo-ce.phys.uwm.edu
tr=worker6 host=MIT_CMS_ce01.cmsaf.mit.edu
tr=worker7 host=LIGO_UWM_NEMO_osg-nemo-ce.phys.uwm.edu
tr=worker7 host=MIT_CMS_ce02.cmsaf.mit.edu
tr=worker8 host=NYSGRID_CORNELL_NYS1_nys1.cac.cornell.edu
tr=worker9 host=Nebraska_gpn-husker.unl.edu


From bugzilla-daemon at mcs.anl.gov  Sat Nov 20 17:17:26 2010
From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov)
Date: Sat, 20 Nov 2010 17:17:26 -0600 (CST)
Subject: [Swift-devel] [Bug 231] New: ssh staging gives error if login
 scripts write to stdout
Message-ID: <bug-231-21@http.bugzilla.mcs.anl.gov/swift/>

https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=231

           Summary: ssh staging gives error if login scripts write to
                    stdout
           Product: Swift
           Version: unspecified
          Platform: All
        OS/Version: All
            Status: NEW
          Severity: normal
          Priority: P3
         Component: SwiftScript language
        AssignedTo: skenny at uchicago.edu
        ReportedBy: wilde at mcs.anl.gov


I found this in old notes.

Somewhat low prio as not many people use ssh staging. May be easy to re-create
it.

> > - if either .profile or .bashrc sends anything to stdout, you get
> this cryptic, mysterious message from swift:
> ...
> > Exception in thread "sftp subsystem 1" java.lang.OutOfMemoryError:
> Java heap space
> > 	at
> com.sshtools.j2ssh.subsystem.SubsystemClient.run(SubsystemClient.java:198)
> > 	at java.lang.Thread.run(Unknown Source)
> 
> That is funny. In other words a bug. Is there any easy way to
> reproduce
> that?

-- 
Configure bugmail: https://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the reporter.

From hategan at mcs.anl.gov  Sun Nov 21 16:56:26 2010
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Sun, 21 Nov 2010 14:56:26 -0800
Subject: [Swift-devel] misassignment of jobs
In-Reply-To: <1290146267.2226.11.camel@blabla2.none>
References: <AANLkTimQobNKo+inP6Z_fSq8bEOUX2PoWcn5hq-FW=G7@mail.gmail.com>
	<1290117789.30414.1.camel@blabla2.none>
	<AANLkTinnz50nSrqCMKkOOKGxHRPDCjhypqg9kz0Zv22n@mail.gmail.com>
	<1290123929.30658.1.camel@blabla2.none>
	<AANLkTi=BjNMm44zz_jkqRVZf-GTp5O34vZRj+5ZPsa_d@mail.gmail.com>
	<1290146267.2226.11.camel@blabla2.none>
Message-ID: <1290380186.26914.1.camel@blabla2.none>

Sadly though, I can't reproduce this.

Can you give me more details, such as the swift script, the version of
swift used, and anything that would be unusual compared to vanilla swift
use.

Mihael

On Thu, 2010-11-18 at 21:57 -0800, Mihael Hategan wrote:
> I was ready to blame cosmic rays, but this seems to be a pretty common
> occurrence in your log. So I'm on it.
> 
> mike at blabla2 tmp$ cat worker-20101117-1538-fe9aq209.log|grep JOB_START |
> awk '{print $7 " " $13}'|sort|uniq
> tr=worker0 host=BNL-ATLAS_gridgk01.racf.bnl.gov
> tr=worker0 host=LIGO_UWM_NEMO_osg-nemo-ce.phys.uwm.edu
> tr=worker1 host=BNL-ATLAS_gridgk02.racf.bnl.gov
> tr=worker10 host=Nebraska_red.unl.edu
> tr=worker11 host=Prairiefire_pf-grid.unl.edu
> tr=worker12 host=Purdue-RCAC_osg.rcac.purdue.edu
> tr=worker13 host=GridUNESP_CENTRAL_ce.grid.unesp.br
> tr=worker13 host=RENCI-Engagement_belhaven-1.renci.org
> tr=worker14 host=SBGrid-Harvard-East_osg-east.hms.harvard.edu
> tr=worker15 host=LIGO_UWM_NEMO_osg-nemo-ce.phys.uwm.edu
> tr=worker15 host=SPRACE_osg-ce.sprace.org.br
> tr=worker16 host=UCHC_CBG_vdgateway.vcell.uchc.edu
> tr=worker17 host=UCR-HEP_top.ucr.edu
> tr=worker18 host=UFlorida-HPC_osg.hpc.ufl.edu
> tr=worker19 host=UFlorida-PG_pg.ihepa.ufl.edu
> tr=worker2 host=FNAL_FERMIGRID_fnpcosg1.fnal.gov
> tr=worker20 host=MIT_CMS_ce01.cmsaf.mit.edu
> tr=worker20 host=UMissHEP_umiss001.hep.olemiss.edu
> tr=worker21 host=Firefly_ff-grid3.unl.edu
> tr=worker21 host=Nebraska_red.unl.edu
> tr=worker21 host=UTA_SWT2_gk04.swt2.uta.edu
> tr=worker22 host=WQCG-Harvard-OSG_tuscany.med.harvard.edu
> tr=worker3 host=Firefly_ff-grid3.unl.edu
> tr=worker3 host=LIGO_UWM_NEMO_osg-nemo-ce.phys.uwm.edu
> tr=worker4 host=GridUNESP_CENTRAL_ce.grid.unesp.br
> tr=worker5 host=Firefly_ff-grid3.unl.edu
> tr=worker5 host=LIGO_UWM_NEMO_osg-nemo-ce.phys.uwm.edu
> tr=worker6 host=MIT_CMS_ce01.cmsaf.mit.edu
> tr=worker7 host=LIGO_UWM_NEMO_osg-nemo-ce.phys.uwm.edu
> tr=worker7 host=MIT_CMS_ce02.cmsaf.mit.edu
> tr=worker8 host=NYSGRID_CORNELL_NYS1_nys1.cac.cornell.edu
> tr=worker9 host=Nebraska_gpn-husker.unl.edu
> 
> 
> 
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel


From wilde at mcs.anl.gov  Sun Nov 21 17:10:15 2010
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Sun, 21 Nov 2010 17:10:15 -0600 (CST)
Subject: [Swift-devel] misassignment of jobs
In-Reply-To: <1290380186.26914.1.camel@blabla2.none>
Message-ID: <998123415.101613.1290381015065.JavaMail.root@zimbra.anl.gov>

Mihael,

If you're in fixin' mode, I'll spend some time now trying to reproduce the 3 coaster problems that are high on my "needed for users" list:

1. Swift hangs/fails talking to persistent server if it sites idle for a few minutes, even with large timeout values (which were possibly not set correctly or fully).

2. With normal coaster mode, if workers start toiming out for lack of work, the Swift run dies.

3. Errors in provider staging at high volume.

If you already have test cases for these issues, let me know, and I'll focus on the missing ones. But Im assuming for now you need all three.

- Mike


----- Original Message -----
> Sadly though, I can't reproduce this.
> 
> Can you give me more details, such as the swift script, the version of
> swift used, and anything that would be unusual compared to vanilla
> swift
> use.
> 
> Mihael
> 
> On Thu, 2010-11-18 at 21:57 -0800, Mihael Hategan wrote:
> > I was ready to blame cosmic rays, but this seems to be a pretty
> > common
> > occurrence in your log. So I'm on it.
> >
> > mike at blabla2 tmp$ cat worker-20101117-1538-fe9aq209.log|grep
> > JOB_START |
> > awk '{print $7 " " $13}'|sort|uniq
> > tr=worker0 host=BNL-ATLAS_gridgk01.racf.bnl.gov
> > tr=worker0 host=LIGO_UWM_NEMO_osg-nemo-ce.phys.uwm.edu
> > tr=worker1 host=BNL-ATLAS_gridgk02.racf.bnl.gov
> > tr=worker10 host=Nebraska_red.unl.edu
> > tr=worker11 host=Prairiefire_pf-grid.unl.edu
> > tr=worker12 host=Purdue-RCAC_osg.rcac.purdue.edu
> > tr=worker13 host=GridUNESP_CENTRAL_ce.grid.unesp.br
> > tr=worker13 host=RENCI-Engagement_belhaven-1.renci.org
> > tr=worker14 host=SBGrid-Harvard-East_osg-east.hms.harvard.edu
> > tr=worker15 host=LIGO_UWM_NEMO_osg-nemo-ce.phys.uwm.edu
> > tr=worker15 host=SPRACE_osg-ce.sprace.org.br
> > tr=worker16 host=UCHC_CBG_vdgateway.vcell.uchc.edu
> > tr=worker17 host=UCR-HEP_top.ucr.edu
> > tr=worker18 host=UFlorida-HPC_osg.hpc.ufl.edu
> > tr=worker19 host=UFlorida-PG_pg.ihepa.ufl.edu
> > tr=worker2 host=FNAL_FERMIGRID_fnpcosg1.fnal.gov
> > tr=worker20 host=MIT_CMS_ce01.cmsaf.mit.edu
> > tr=worker20 host=UMissHEP_umiss001.hep.olemiss.edu
> > tr=worker21 host=Firefly_ff-grid3.unl.edu
> > tr=worker21 host=Nebraska_red.unl.edu
> > tr=worker21 host=UTA_SWT2_gk04.swt2.uta.edu
> > tr=worker22 host=WQCG-Harvard-OSG_tuscany.med.harvard.edu
> > tr=worker3 host=Firefly_ff-grid3.unl.edu
> > tr=worker3 host=LIGO_UWM_NEMO_osg-nemo-ce.phys.uwm.edu
> > tr=worker4 host=GridUNESP_CENTRAL_ce.grid.unesp.br
> > tr=worker5 host=Firefly_ff-grid3.unl.edu
> > tr=worker5 host=LIGO_UWM_NEMO_osg-nemo-ce.phys.uwm.edu
> > tr=worker6 host=MIT_CMS_ce01.cmsaf.mit.edu
> > tr=worker7 host=LIGO_UWM_NEMO_osg-nemo-ce.phys.uwm.edu
> > tr=worker7 host=MIT_CMS_ce02.cmsaf.mit.edu
> > tr=worker8 host=NYSGRID_CORNELL_NYS1_nys1.cac.cornell.edu
> > tr=worker9 host=Nebraska_gpn-husker.unl.edu
> >
> >
> >
> > _______________________________________________
> > Swift-devel mailing list
> > Swift-devel at ci.uchicago.edu
> > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> 
> 
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel

-- 
Michael Wilde
Computation Institute, University of Chicago
Mathematics and Computer Science Division
Argonne National Laboratory


From hategan at mcs.anl.gov  Sun Nov 21 19:31:18 2010
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Sun, 21 Nov 2010 17:31:18 -0800
Subject: [Swift-devel] misassignment of jobs
In-Reply-To: <998123415.101613.1290381015065.JavaMail.root@zimbra.anl.gov>
References: <998123415.101613.1290381015065.JavaMail.root@zimbra.anl.gov>
Message-ID: <1290389478.27403.1.camel@blabla2.none>

On Sun, 2010-11-21 at 17:10 -0600, Michael Wilde wrote:
> Mihael,
> 
> If you're in fixin' mode,

I've been in fixin' mode for the past two months :)

>  I'll spend some time now trying to reproduce the 3 coaster problems that are high on my "needed for users" list:
> 
> 1. Swift hangs/fails talking to persistent server if it sites idle for
> a few minutes, even with large timeout values (which were possibly not
> set correctly or fully).
> 
> 2. With normal coaster mode, if workers start toiming out for lack of work, the Swift run dies.

That one is addressed by removing the worker timeout. As I mentioned in
a previous email, that timeout is a artifact of an older worker
management scheme.

> 
> 3. Errors in provider staging at high volume.
> 
> If you already have test cases for these issues, let me know, and I'll
> focus on the missing ones. But Im assuming for now you need all three.

I have test cases for 1 and 3. I couldn't reproduce the problems so far.

Mihael


From wilde at mcs.anl.gov  Sun Nov 21 19:37:18 2010
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Sun, 21 Nov 2010 19:37:18 -0600 (CST)
Subject: [Swift-devel] misassignment of jobs
In-Reply-To: <1290389478.27403.1.camel@blabla2.none>
Message-ID: <139095612.101784.1290389838078.JavaMail.root@zimbra.anl.gov>

OK, re bug 2: I didnt connect the symptoms of this issue with your earlier comments on timeouts, and just verified that you are correct: with the same extended timeouts I was using to try to keep a persistent coaster service up for an extended time, the failing case for bug 2 works.

I'll try to reproduce bug 1 now, then 3.

- Mike


----- Original Message -----
> On Sun, 2010-11-21 at 17:10 -0600, Michael Wilde wrote:
> > Mihael,
> >
> > If you're in fixin' mode,
> 
> I've been in fixin' mode for the past two months :)
> 
> >  I'll spend some time now trying to reproduce the 3 coaster problems
> >  that are high on my "needed for users" list:
> >
> > 1. Swift hangs/fails talking to persistent server if it sites idle
> > for
> > a few minutes, even with large timeout values (which were possibly
> > not
> > set correctly or fully).
> >
> > 2. With normal coaster mode, if workers start toiming out for lack
> > of work, the Swift run dies.
> 
> That one is addressed by removing the worker timeout. As I mentioned
> in
> a previous email, that timeout is a artifact of an older worker
> management scheme.
> 
> >
> > 3. Errors in provider staging at high volume.
> >
> > If you already have test cases for these issues, let me know, and
> > I'll
> > focus on the missing ones. But Im assuming for now you need all
> > three.
> 
> I have test cases for 1 and 3. I couldn't reproduce the problems so
> far.
> 
> Mihael

-- 
Michael Wilde
Computation Institute, University of Chicago
Mathematics and Computer Science Division
Argonne National Laboratory


From hategan at mcs.anl.gov  Sun Nov 21 20:37:48 2010
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Sun, 21 Nov 2010 18:37:48 -0800
Subject: [Swift-devel] misassignment of jobs
In-Reply-To: <139095612.101784.1290389838078.JavaMail.root@zimbra.anl.gov>
References: <139095612.101784.1290389838078.JavaMail.root@zimbra.anl.gov>
Message-ID: <1290393468.27675.0.camel@blabla2.none>

Ok. I will remove the idle timeouts from the worker. I do not expect any
negative consequences there given the reasoning I outlined before.

Mihael

On Sun, 2010-11-21 at 19:37 -0600, Michael Wilde wrote:
> OK, re bug 2: I didnt connect the symptoms of this issue with your earlier comments on timeouts, and just verified that you are correct: with the same extended timeouts I was using to try to keep a persistent coaster service up for an extended time, the failing case for bug 2 works.
> 
> I'll try to reproduce bug 1 now, then 3.
> 
> - Mike
> 
> 
> ----- Original Message -----
> > On Sun, 2010-11-21 at 17:10 -0600, Michael Wilde wrote:
> > > Mihael,
> > >
> > > If you're in fixin' mode,
> > 
> > I've been in fixin' mode for the past two months :)
> > 
> > >  I'll spend some time now trying to reproduce the 3 coaster problems
> > >  that are high on my "needed for users" list:
> > >
> > > 1. Swift hangs/fails talking to persistent server if it sites idle
> > > for
> > > a few minutes, even with large timeout values (which were possibly
> > > not
> > > set correctly or fully).
> > >
> > > 2. With normal coaster mode, if workers start toiming out for lack
> > > of work, the Swift run dies.
> > 
> > That one is addressed by removing the worker timeout. As I mentioned
> > in
> > a previous email, that timeout is a artifact of an older worker
> > management scheme.
> > 
> > >
> > > 3. Errors in provider staging at high volume.
> > >
> > > If you already have test cases for these issues, let me know, and
> > > I'll
> > > focus on the missing ones. But Im assuming for now you need all
> > > three.
> > 
> > I have test cases for 1 and 3. I couldn't reproduce the problems so
> > far.
> > 
> > Mihael
> 


From wilde at mcs.anl.gov  Sun Nov 21 20:45:38 2010
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Sun, 21 Nov 2010 20:45:38 -0600 (CST)
Subject: [Swift-devel] misassignment of jobs
In-Reply-To: <1290393468.27675.0.camel@blabla2.none>
Message-ID: <180644543.101847.1290393938265.JavaMail.root@zimbra.anl.gov>

I was testing with the two mods below in place (long values in both worker timeout and service timeout).

- Mike

login1$ pwd
/scratch/local/wilde/swift/src/trunk.gomods/cog/modules/provider-coaster
login1$ 

login1$ svn diff
Index: src/org/globus/cog/abstraction/coaster/service/CoasterService.java
===================================================================
--- src/org/globus/cog/abstraction/coaster/service/CoasterService.java  (revision 2932)
+++ src/org/globus/cog/abstraction/coaster/service/CoasterService.java  (working copy)
@@ -41,7 +41,7 @@
     public static final Logger logger = Logger
             .getLogger(CoasterService.class);

-    public static final int IDLE_TIMEOUT = 120 * 1000;
+    public static final int IDLE_TIMEOUT = 120 * 1000 /* extend it: */ * 30 * 240;

     public static final int CONNECT_TIMEOUT = 2 * 60 * 1000;

Index: resources/worker.pl
===================================================================
--- resources/worker.pl (revision 2932)
+++ resources/worker.pl (working copy)
@@ -123,7 +123,7 @@
 my $URISTR=$ARGV[0];
 my $BLOCKID=$ARGV[1];
 my $LOGDIR=$ARGV[2];
-my $IDLETIMEOUT = ( $#ARGV <= 2 ) ? (4 * 60) : $ARGV[3];
+my $IDLETIMEOUT = ( $#ARGV <= 2 ) ? (4 * 60 * 60 * 24) : $ARGV[3];


 # REQUESTS holds a map of incoming requests
login1$ 


----- Original Message -----
> Ok. I will remove the idle timeouts from the worker. I do not expect
> any
> negative consequences there given the reasoning I outlined before.
> 
> Mihael
> 
> On Sun, 2010-11-21 at 19:37 -0600, Michael Wilde wrote:
> > OK, re bug 2: I didnt connect the symptoms of this issue with your
> > earlier comments on timeouts, and just verified that you are
> > correct: with the same extended timeouts I was using to try to keep
> > a persistent coaster service up for an extended time, the failing
> > case for bug 2 works.
> >
> > I'll try to reproduce bug 1 now, then 3.
> >
> > - Mike
> >
> >
> > ----- Original Message -----
> > > On Sun, 2010-11-21 at 17:10 -0600, Michael Wilde wrote:
> > > > Mihael,
> > > >
> > > > If you're in fixin' mode,
> > >
> > > I've been in fixin' mode for the past two months :)
> > >
> > > >  I'll spend some time now trying to reproduce the 3 coaster
> > > >  problems
> > > >  that are high on my "needed for users" list:
> > > >
> > > > 1. Swift hangs/fails talking to persistent server if it sites
> > > > idle
> > > > for
> > > > a few minutes, even with large timeout values (which were
> > > > possibly
> > > > not
> > > > set correctly or fully).
> > > >
> > > > 2. With normal coaster mode, if workers start toiming out for
> > > > lack
> > > > of work, the Swift run dies.
> > >
> > > That one is addressed by removing the worker timeout. As I
> > > mentioned
> > > in
> > > a previous email, that timeout is a artifact of an older worker
> > > management scheme.
> > >
> > > >
> > > > 3. Errors in provider staging at high volume.
> > > >
> > > > If you already have test cases for these issues, let me know,
> > > > and
> > > > I'll
> > > > focus on the missing ones. But Im assuming for now you need all
> > > > three.
> > >
> > > I have test cases for 1 and 3. I couldn't reproduce the problems
> > > so
> > > far.
> > >
> > > Mihael
> >

-- 
Michael Wilde
Computation Institute, University of Chicago
Mathematics and Computer Science Division
Argonne National Laboratory


From hategan at mcs.anl.gov  Sun Nov 21 20:49:13 2010
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Sun, 21 Nov 2010 18:49:13 -0800
Subject: [Swift-devel] misassignment of jobs
In-Reply-To: <180644543.101847.1290393938265.JavaMail.root@zimbra.anl.gov>
References: <180644543.101847.1290393938265.JavaMail.root@zimbra.anl.gov>
Message-ID: <1290394153.27777.1.camel@blabla2.none>

Right. I would hold off on the service timeout. My tests show that it
has no impact, and, in theory, that both shouldn't have an impact and it
should not be removed.

Mihael

On Sun, 2010-11-21 at 20:45 -0600, Michael Wilde wrote:
> I was testing with the two mods below in place (long values in both worker timeout and service timeout).
> 
> - Mike
> 
> login1$ pwd
> /scratch/local/wilde/swift/src/trunk.gomods/cog/modules/provider-coaster
> login1$ 
> 
> login1$ svn diff
> Index: src/org/globus/cog/abstraction/coaster/service/CoasterService.java
> ===================================================================
> --- src/org/globus/cog/abstraction/coaster/service/CoasterService.java  (revision 2932)
> +++ src/org/globus/cog/abstraction/coaster/service/CoasterService.java  (working copy)
> @@ -41,7 +41,7 @@
>      public static final Logger logger = Logger
>              .getLogger(CoasterService.class);
> 
> -    public static final int IDLE_TIMEOUT = 120 * 1000;
> +    public static final int IDLE_TIMEOUT = 120 * 1000 /* extend it: */ * 30 * 240;
> 
>      public static final int CONNECT_TIMEOUT = 2 * 60 * 1000;
> 
> Index: resources/worker.pl
> ===================================================================
> --- resources/worker.pl (revision 2932)
> +++ resources/worker.pl (working copy)
> @@ -123,7 +123,7 @@
>  my $URISTR=$ARGV[0];
>  my $BLOCKID=$ARGV[1];
>  my $LOGDIR=$ARGV[2];
> -my $IDLETIMEOUT = ( $#ARGV <= 2 ) ? (4 * 60) : $ARGV[3];
> +my $IDLETIMEOUT = ( $#ARGV <= 2 ) ? (4 * 60 * 60 * 24) : $ARGV[3];
> 
> 
>  # REQUESTS holds a map of incoming requests
> login1$ 
> 
> 
> ----- Original Message -----
> > Ok. I will remove the idle timeouts from the worker. I do not expect
> > any
> > negative consequences there given the reasoning I outlined before.
> > 
> > Mihael
> > 
> > On Sun, 2010-11-21 at 19:37 -0600, Michael Wilde wrote:
> > > OK, re bug 2: I didnt connect the symptoms of this issue with your
> > > earlier comments on timeouts, and just verified that you are
> > > correct: with the same extended timeouts I was using to try to keep
> > > a persistent coaster service up for an extended time, the failing
> > > case for bug 2 works.
> > >
> > > I'll try to reproduce bug 1 now, then 3.
> > >
> > > - Mike
> > >
> > >
> > > ----- Original Message -----
> > > > On Sun, 2010-11-21 at 17:10 -0600, Michael Wilde wrote:
> > > > > Mihael,
> > > > >
> > > > > If you're in fixin' mode,
> > > >
> > > > I've been in fixin' mode for the past two months :)
> > > >
> > > > >  I'll spend some time now trying to reproduce the 3 coaster
> > > > >  problems
> > > > >  that are high on my "needed for users" list:
> > > > >
> > > > > 1. Swift hangs/fails talking to persistent server if it sites
> > > > > idle
> > > > > for
> > > > > a few minutes, even with large timeout values (which were
> > > > > possibly
> > > > > not
> > > > > set correctly or fully).
> > > > >
> > > > > 2. With normal coaster mode, if workers start toiming out for
> > > > > lack
> > > > > of work, the Swift run dies.
> > > >
> > > > That one is addressed by removing the worker timeout. As I
> > > > mentioned
> > > > in
> > > > a previous email, that timeout is a artifact of an older worker
> > > > management scheme.
> > > >
> > > > >
> > > > > 3. Errors in provider staging at high volume.
> > > > >
> > > > > If you already have test cases for these issues, let me know,
> > > > > and
> > > > > I'll
> > > > > focus on the missing ones. But Im assuming for now you need all
> > > > > three.
> > > >
> > > > I have test cases for 1 and 3. I couldn't reproduce the problems
> > > > so
> > > > far.
> > > >
> > > > Mihael
> > >
> 


From wilde at mcs.anl.gov  Sun Nov 21 21:00:04 2010
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Sun, 21 Nov 2010 21:00:04 -0600 (CST)
Subject: [Swift-devel] Persistent coaster service fails after several runs
In-Reply-To: <1290394153.27777.1.camel@blabla2.none>
Message-ID: <2030510416.101872.1290394804460.JavaMail.root@zimbra.anl.gov>

subject was: Re: [Swift-devel] misassignment of jobs

Re the service-side timeout, OK, will do.

Ive just re-created bug1, but its a little different than I thought.

Swift runs to the persistent coaster server lock up (ie fail to progress) and then get errors, not after a delay, but seemingly randomly. Thats likely why I was misled into thinking it was delay related.

I started a coaster server on localhost with one worker.pl.

I then run catsn.swift against it with various n (# of cat jobs) including 1, 10 , and 100.

The first several (5-10) swift runs work fine.  Then I let it sleep 5 mins and tried again. That too worked fine.  But then, after a few more runs, things hang.

Here's all the logs and details if you want to look into this particular run.

working in /home/wilde/swift/lab, on pads login1

The latest .log in this this is the failing case; the others worked (against the same persistent server):

login1$ ls -lt *.log | head -20 
-rw-r--r-- 1 wilde ci-users   95478 Nov 21 20:41 catsn-20101121-2039-1yfngygc.log
-rw-r--r-- 1 wilde ci-users   36085 Nov 21 20:39 swift.log
-rw-r--r-- 1 wilde ci-users  272734 Nov 21 20:37 catsn-20101121-2037-7uk5fj33.log
-rw-r--r-- 1 wilde ci-users  272644 Nov 21 20:37 catsn-20101121-2037-j8xq9aie.log
-rw-r--r-- 1 wilde ci-users  272468 Nov 21 20:36 catsn-20101121-2036-4y0tnimd.log
-rw-r--r-- 1 wilde ci-users   31317 Nov 21 20:36 catsn-20101121-2036-opcvomk4.log
-rw-r--r-- 1 wilde ci-users    7183 Nov 21 20:36 catsn-20101121-2036-u59brtm4.log
-rw-r--r-- 1 wilde ci-users    7183 Nov 21 20:35 catsn-20101121-2035-360kh03b.log
-rw-r--r-- 1 wilde ci-users    7351 Nov 21 20:35 catsn-20101121-2035-8lttnn88.log
-rw-r--r-- 1 wilde ci-users    7183 Nov 21 20:30 catsn-20101121-2030-ddmo6gt3.log
-rw-r--r-- 1 wilde ci-users    7267 Nov 21 20:29 catsn-20101121-2029-sq8y6cnb.log
-rw-r--r-- 1 wilde ci-users    7179 Nov 21 20:29 catsn-20101121-2029-3su2x8v9.log
-rw-r--r-- 1 wilde ci-users    7183 Nov 21 20:29 catsn-20101121-2029-z0g50i50.log
-rw-r--r-- 1 wilde ci-users    7267 Nov 21 20:29 catsn-20101121-2029-5x6pbkde.log

The worker and service logs are in: /tmp/wilde/Swift/{server,worker}

swift  is:

/scratch/local/wilde/swift/src/trunk.gomods/cog/modules/swift/dist/swift-svn/bin/swift

The test runs were all of this form, with various n as above:

login1$ swift -config cf -tc.file tc -sites.file pecos04.xml catsn.swift -n=100


I started the persistent coaster with the somewhat ugly script:

/home/wilde/swift/lab/pecos/start-mcs

(which runs a dummy job to force the server to passive mode, for the general case of workers joining and leaving the service)

I'll clean this up for re-creatability if you cant spot the issue form these logs.

Lastly, the last few runs, including the failing one, gave this on stdout/err:

login1$ swift -config cf -tc.file tc -sites.file pecos04.xml catsn.swift -n=100
Swift svn swift-r3707 (swift modified locally) cog-r2932 (cog modified locally)

RunID: 20101121-2037-j8xq9aie
Progress:
Find: http://localhost:1985
Find:  keepalive(120), reconnect - http://localhost:1985
Progress:  Selecting site:64  Submitting:3  Submitted:25  Active:4  Finished successfully:4
Progress:  Selecting site:52  Submitted:28  Active:3  Checking status:1  Finished successfully:16
Progress:  Selecting site:36  Submitting:3  Submitted:25  Active:4  Finished successfully:32
Progress:  Selecting site:23  Submitted:28  Active:3  Checking status:1  Finished successfully:45
Progress:  Selecting site:7  Submitted:27  Active:3  Checking status:1  Finished successfully:62
Progress:  Submitted:14  Active:2  Stage out:3  Finished successfully:81
Progress:  Submitted:3  Stage out:3  Finished successfully:94
Final status:  Finished successfully:100
login1$ swift -config cf -tc.file tc -sites.file pecos04.xml catsn.swift -n=100
Swift svn swift-r3707 (swift modified locally) cog-r2932 (cog modified locally)

RunID: 20101121-2037-7uk5fj33
Progress:
Find: http://localhost:1985
Find:  keepalive(120), reconnect - http://localhost:1985
Progress:  Selecting site:64  Submitted:28  Active:3  Checking status:1  Finished successfully:4
Progress:  Selecting site:48  Submitting:3  Submitted:25  Active:4  Finished successfully:20
Progress:  Selecting site:36  Submitted:28  Active:3  Checking status:1  Finished successfully:32
Progress:  Selecting site:23  Submitted:24  Active:4  Stage out:3  Finished successfully:46
Progress:  Selecting site:6  Submitted:28  Active:3  Checking status:1  Finished successfully:62
Progress:  Submitted:17  Active:3  Checking status:1  Finished successfully:79
Progress:  Submitted:3  Active:1  Stage out:3  Finished successfully:93
Final status:  Finished successfully:100
login1$ swift -config cf -tc.file tc -sites.file pecos04.xml catsn.swift -n=100
Swift svn swift-r3707 (swift modified locally) cog-r2932 (cog modified locally)

RunID: 20101121-2039-1yfngygc
Progress:
Find: http://localhost:1985
Find:  keepalive(120), reconnect - http://localhost:1985
Progress:  Selecting site:68  Submitting:32
Progress:  Selecting site:68  Submitting:32
Progress:  Selecting site:68  Submitting:32
Progress:  Selecting site:68  Submitting:32
Command(1, CHANNELCONFIG): handling reply timeout; sendReqTime=101121-203902.376, sendTime=101121-203902.377, now=101121-204102.399
Command(1, CHANNELCONFIG)fault was: Reply timeout
org.globus.cog.karajan.workflow.service.ReplyTimeoutException
        at org.globus.cog.karajan.workflow.service.commands.Command.handleReplyTimeout(Command.java:280)
        at org.globus.cog.karajan.workflow.service.commands.Command$Timeout.run(Command.java:285)
        at java.util.TimerThread.mainLoop(Timer.java:512)
        at java.util.TimerThread.run(Timer.java:462)
Progress:  Selecting site:68  Submitting:31 Failed but can retry:1
login1$ 

-----


- Mike

----- Original Message -----
> Right. I would hold off on the service timeout. My tests show that it
> has no impact, and, in theory, that both shouldn't have an impact and
> it
> should not be removed.
> 
> Mihael
> 
> On Sun, 2010-11-21 at 20:45 -0600, Michael Wilde wrote:
> > I was testing with the two mods below in place (long values in both
> > worker timeout and service timeout).
> >
> > - Mike
> >
> > login1$ pwd
> > /scratch/local/wilde/swift/src/trunk.gomods/cog/modules/provider-coaster
> > login1$
> >
> > login1$ svn diff
> > Index:
> > src/org/globus/cog/abstraction/coaster/service/CoasterService.java
> > ===================================================================
> > ---
> > src/org/globus/cog/abstraction/coaster/service/CoasterService.java
> > (revision 2932)
> > +++
> > src/org/globus/cog/abstraction/coaster/service/CoasterService.java
> > (working copy)
> > @@ -41,7 +41,7 @@
> >      public static final Logger logger = Logger
> >              .getLogger(CoasterService.class);
> >
> > - public static final int IDLE_TIMEOUT = 120 * 1000;
> > + public static final int IDLE_TIMEOUT = 120 * 1000 /* extend it: */
> > * 30 * 240;
> >
> >      public static final int CONNECT_TIMEOUT = 2 * 60 * 1000;
> >
> > Index: resources/worker.pl
> > ===================================================================
> > --- resources/worker.pl (revision 2932)
> > +++ resources/worker.pl (working copy)
> > @@ -123,7 +123,7 @@
> >  my $URISTR=$ARGV[0];
> >  my $BLOCKID=$ARGV[1];
> >  my $LOGDIR=$ARGV[2];
> > -my $IDLETIMEOUT = ( $#ARGV <= 2 ) ? (4 * 60) : $ARGV[3];
> > +my $IDLETIMEOUT = ( $#ARGV <= 2 ) ? (4 * 60 * 60 * 24) : $ARGV[3];
> >
> >
> >  # REQUESTS holds a map of incoming requests
> > login1$
> >
> >
> > ----- Original Message -----
> > > Ok. I will remove the idle timeouts from the worker. I do not
> > > expect
> > > any
> > > negative consequences there given the reasoning I outlined before.
> > >
> > > Mihael
> > >
> > > On Sun, 2010-11-21 at 19:37 -0600, Michael Wilde wrote:
> > > > OK, re bug 2: I didnt connect the symptoms of this issue with
> > > > your
> > > > earlier comments on timeouts, and just verified that you are
> > > > correct: with the same extended timeouts I was using to try to
> > > > keep
> > > > a persistent coaster service up for an extended time, the
> > > > failing
> > > > case for bug 2 works.
> > > >
> > > > I'll try to reproduce bug 1 now, then 3.
> > > >
> > > > - Mike
> > > >
> > > >
> > > > ----- Original Message -----
> > > > > On Sun, 2010-11-21 at 17:10 -0600, Michael Wilde wrote:
> > > > > > Mihael,
> > > > > >
> > > > > > If you're in fixin' mode,
> > > > >
> > > > > I've been in fixin' mode for the past two months :)
> > > > >
> > > > > >  I'll spend some time now trying to reproduce the 3 coaster
> > > > > >  problems
> > > > > >  that are high on my "needed for users" list:
> > > > > >
> > > > > > 1. Swift hangs/fails talking to persistent server if it
> > > > > > sites
> > > > > > idle
> > > > > > for
> > > > > > a few minutes, even with large timeout values (which were
> > > > > > possibly
> > > > > > not
> > > > > > set correctly or fully).
> > > > > >
> > > > > > 2. With normal coaster mode, if workers start toiming out
> > > > > > for
> > > > > > lack
> > > > > > of work, the Swift run dies.
> > > > >
> > > > > That one is addressed by removing the worker timeout. As I
> > > > > mentioned
> > > > > in
> > > > > a previous email, that timeout is a artifact of an older
> > > > > worker
> > > > > management scheme.
> > > > >
> > > > > >
> > > > > > 3. Errors in provider staging at high volume.
> > > > > >
> > > > > > If you already have test cases for these issues, let me
> > > > > > know,
> > > > > > and
> > > > > > I'll
> > > > > > focus on the missing ones. But Im assuming for now you need
> > > > > > all
> > > > > > three.
> > > > >
> > > > > I have test cases for 1 and 3. I couldn't reproduce the
> > > > > problems
> > > > > so
> > > > > far.
> > > > >
> > > > > Mihael
> > > >
> >

-- 
Michael Wilde
Computation Institute, University of Chicago
Mathematics and Computer Science Division
Argonne National Laboratory


From hategan at mcs.anl.gov  Sun Nov 21 21:06:22 2010
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Sun, 21 Nov 2010 19:06:22 -0800
Subject: [Swift-devel] Re: Persistent coaster service fails after several
	runs
In-Reply-To: <2030510416.101872.1290394804460.JavaMail.root@zimbra.anl.gov>
References: <2030510416.101872.1290394804460.JavaMail.root@zimbra.anl.gov>
Message-ID: <1290395182.27886.0.camel@blabla2.none>

[hategan at login1 Swift]$ cd server
-bash: cd: server: Permission denied


On Sun, 2010-11-21 at 21:00 -0600, Michael Wilde wrote:

> The worker and service logs are in: /tmp/wilde/Swift/{server,worker}
> 


From wilde at mcs.anl.gov  Sun Nov 21 21:59:51 2010
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Sun, 21 Nov 2010 21:59:51 -0600 (CST)
Subject: [Swift-devel] Re: Persistent coaster service fails after several
	runs
In-Reply-To: <1290395182.27886.0.camel@blabla2.none>
Message-ID: <1935641716.101928.1290398391418.JavaMail.root@zimbra.anl.gov>

sorry, fixed.

----- Original Message -----
> [hategan at login1 Swift]$ cd server
> -bash: cd: server: Permission denied
> 
> 
> On Sun, 2010-11-21 at 21:00 -0600, Michael Wilde wrote:
> 
> > The worker and service logs are in: /tmp/wilde/Swift/{server,worker}
> >

-- 
Michael Wilde
Computation Institute, University of Chicago
Mathematics and Computer Science Division
Argonne National Laboratory


From wilde at mcs.anl.gov  Sun Nov 21 23:10:47 2010
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Sun, 21 Nov 2010 23:10:47 -0600 (CST)
Subject: [Swift-devel] Provider staging error in long-running test
In-Reply-To: <602179782.102075.1290402309752.JavaMail.root@zimbra.anl.gov>
Message-ID: <387720139.102083.1290402647892.JavaMail.root@zimbra.anl.gov>

Mihael, here is bug 3:

I was testing a foreach loop doing a cat of 10,000 input files of sizes up to about 300-400K each. ?The test hit an error after around 3,491 files:

Progress: ?Selecting site:1008 ?Submitted:12 ?Active:3 ?Finished successfully:3476
Progress: ?Selecting site:1008 ?Submitted:13 ?Active:3 ?Finished successfully:3491
Failed to shut down channel
java.lang.NullPointerException
?? ? ? ?at org.globus.cog.karajan.workflow.service.channels.AbstractKarajanChannel.configureHeartBeat(AbstractKarajanChannel.java:57)
?? ? ? ?at org.globus.cog.karajan.workflow.service.channels.AbstractKarajanChannel.<init>(AbstractKarajanChannel.java:52)

The test was executed on PADS login1 like this:

cd /home/wilde/swift/lab
./run.local.coast.ps.sh catsall

log file: catsall-20101121-2239-oc2flmn0.log

sites.xml:

<config>
??<pool handle="localhost">
?? ?<!-- <execution provider="coaster-persistent" url="http://login1.pads.ci.uchicago.edu:" jobmanager="local:local"/> -->
?? ?<execution provider="coaster" url="none" jobmanager="local:local"/>
?? ?<!-- <profile namespace="globus" key="workerManager">passive</profile> -->
?? ?<profile namespace="globus" key="workersPerNode">8</profile>
?? ?<profile namespace="globus" key="slots">1</profile>
?? ?<profile namespace="globus" key="maxnodes">1</profile>
?? ?<profile key="jobThrottle" namespace="karajan">.15</profile>
?? ?<profile namespace="karajan" key="initialScore">10000</profile>
?? ?<profile namespace="swift" key="stagingMethod">proxy</profile>
?? ?<workdirectory>/scratch/local/wilde/pstest/swiftwork</workdirectory>
??</pool>
</config>

login1$ cat cf
wrapperlog.always.transfer=true
sitedir.keep=true
execution.retries=0
lazy.errors=false
status.mode=provider
use.provider.staging=true
provider.staging.pin.swiftfiles=false
login1$ cat catsall.swift
type file;

app (file o) cat (file i)
{
  cat @i stdout=@o;
}

file infile[]  <simple_mapper; location="indir", prefix="f.", suffix=".in">;
file outfile[] <simple_mapper; location="outdir", prefix="f.",suffix=".out">;

foreach f, i in infile {
  outfile[i] = cat(f);
}
login1$ 

login1$ which swift
/scratch/local/wilde/swift/src/trunk.gomods/cog/modules/swift/dist/swift-svn/bin/swift
login1$ java -version
java version "1.6.0_22"
Java(TM) SE Runtime Environment (build 1.6.0_22-b04)
Java HotSpot(TM) 64-Bit Server VM (build 17.1-b03, mixed mode)
login1$ 


-- 
Michael Wilde
Computation Institute, University of Chicago
Mathematics and Computer Science Division
Argonne National Laboratory


From wozniak at mcs.anl.gov  Mon Nov 22 16:31:38 2010
From: wozniak at mcs.anl.gov (Justin M Wozniak)
Date: Mon, 22 Nov 2010 16:31:38 -0600 (CST)
Subject: [Swift-devel] Re: Problems with
 provider.staging.pin.swiftfiles
In-Reply-To: <alpine.DEB.2.00.1011110940050.10341@wozniak-desktop.mcs.anl.gov>
References: <1669558788.45842.1289488275177.JavaMail.root@zimbra.anl.gov>
	<alpine.DEB.2.00.1011110940050.10341@wozniak-desktop.mcs.anl.gov>
Message-ID: <alpine.DEB.2.00.1011221631130.10957@wozniak-desktop.mcs.anl.gov>

Hello
 	This should be corrected in trunk now.
 	Justin

On Thu, 11 Nov 2010, Justin M Wozniak wrote:

> Hello
> 	Yes, this was broken a few weeks ago- I will try to restore it ASAP. 
> (Cf. swift-devel post from 9/27.)
> 	Justin
>
> On Thu, 11 Nov 2010, Michael Wilde wrote:
>
>> Justin,
>> 
>> When Tom Uram turns on this option and runs a simple test script (a foreach 
>> and an app that just collects node info), he gets an error "520" returned 
>> in the swift log, as if from the app. I am thinking that the 520 is somehow 
>> coming from worker.
>> 
>> This is going from bridled to Eureka worker nodes, with provider staging 
>> turned on in proxy mode.
>> 
>> When we set provider.staging.pin.swiftfiles to false, the script runs ok.
>> 
>> We'll need to collect and send logs and a test case, but I wanted to alert 
>> you to a potential problem with this feature.
>> 
>> - Mike

-- 
Justin M Wozniak


From hategan at mcs.anl.gov  Mon Nov 22 20:29:57 2010
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Mon, 22 Nov 2010 18:29:57 -0800
Subject: [Swift-devel] Re: Provider staging error in long-running test
In-Reply-To: <387720139.102083.1290402647892.JavaMail.root@zimbra.anl.gov>
References: <387720139.102083.1290402647892.JavaMail.root@zimbra.anl.gov>
Message-ID: <1290479397.30590.1.camel@blabla2.none>

Ok. So that doesn't look like it's a staging problem specifically, but
more like something with the comm library. I'll have to look at the
logs. And I can foresee some free time coming in a couple of days just
for that!

Mihael

On Sun, 2010-11-21 at 23:10 -0600, Michael Wilde wrote:
> Mihael, here is bug 3:
> 
> I was testing a foreach loop doing a cat of 10,000 input files of sizes up to about 300-400K each.  The test hit an error after around 3,491 files:
> 
> Progress:  Selecting site:1008  Submitted:12  Active:3  Finished successfully:3476
> Progress:  Selecting site:1008  Submitted:13  Active:3  Finished successfully:3491
> Failed to shut down channel
> java.lang.NullPointerException
>         at org.globus.cog.karajan.workflow.service.channels.AbstractKarajanChannel.configureHeartBeat(AbstractKarajanChannel.java:57)
>         at org.globus.cog.karajan.workflow.service.channels.AbstractKarajanChannel.<init>(AbstractKarajanChannel.java:52)
> 
> The test was executed on PADS login1 like this:
> 
> cd /home/wilde/swift/lab
> ./run.local.coast.ps.sh catsall
> 
> log file: catsall-20101121-2239-oc2flmn0.log
> 
> sites.xml:
> 
> <config>
>   <pool handle="localhost">
>     <!-- <execution provider="coaster-persistent" url="http://login1.pads.ci.uchicago.edu:" jobmanager="local:local"/> -->
>     <execution provider="coaster" url="none" jobmanager="local:local"/>
>     <!-- <profile namespace="globus" key="workerManager">passive</profile> -->
>     <profile namespace="globus" key="workersPerNode">8</profile>
>     <profile namespace="globus" key="slots">1</profile>
>     <profile namespace="globus" key="maxnodes">1</profile>
>     <profile key="jobThrottle" namespace="karajan">.15</profile>
>     <profile namespace="karajan" key="initialScore">10000</profile>
>     <profile namespace="swift" key="stagingMethod">proxy</profile>
>     <workdirectory>/scratch/local/wilde/pstest/swiftwork</workdirectory>
>   </pool>
> </config>
> 
> login1$ cat cf
> wrapperlog.always.transfer=true
> sitedir.keep=true
> execution.retries=0
> lazy.errors=false
> status.mode=provider
> use.provider.staging=true
> provider.staging.pin.swiftfiles=false
> login1$ cat catsall.swift
> type file;
> 
> app (file o) cat (file i)
> {
>   cat @i stdout=@o;
> }
> 
> file infile[]  <simple_mapper; location="indir", prefix="f.", suffix=".in">;
> file outfile[] <simple_mapper; location="outdir", prefix="f.",suffix=".out">;
> 
> foreach f, i in infile {
>   outfile[i] = cat(f);
> }
> login1$ 
> 
> login1$ which swift
> /scratch/local/wilde/swift/src/trunk.gomods/cog/modules/swift/dist/swift-svn/bin/swift
> login1$ java -version
> java version "1.6.0_22"
> Java(TM) SE Runtime Environment (build 1.6.0_22-b04)
> Java HotSpot(TM) 64-Bit Server VM (build 17.1-b03, mixed mode)
> login1$ 
> 
> 


From bugzilla-daemon at mcs.anl.gov  Tue Nov 23 00:49:16 2010
From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov)
Date: Tue, 23 Nov 2010 00:49:16 -0600 (CST)
Subject: [Swift-devel] [Bug 182] Error messages summarized at end of Swift
 output should also be printed when they occur
In-Reply-To: <bug-182-21@http.bugzilla.mcs.anl.gov/swift/>
References: <bug-182-21@http.bugzilla.mcs.anl.gov/swift/>
Message-ID: <20101123064916.A613F2CD6F@wind.mcs.anl.gov>

https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=182


skenny <skenny at uchicago.edu> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|ASSIGNED                    |RESOLVED
         Resolution|                            |FIXED


-- 
Configure bugmail: https://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the reporter.


From damitha119 at gmail.com  Tue Nov 23 12:17:34 2010
From: damitha119 at gmail.com (Damitha Wimalasooriya)
Date: Tue, 23 Nov 2010 23:47:34 +0530
Subject: [Swift-devel] adding a source file
Message-ID: <AANLkTimokUG1A=FTA3iqcLEjEevYOVM4jBK_YO+-U-+e@mail.gmail.com>

I have coded a source code for a new method. But still I don't know to add
that file to the prevailing libraries. Can somebody help me to add this file
and test it.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20101123/393c48b9/attachment.html>

From bugzilla-daemon at mcs.anl.gov  Tue Nov 23 12:51:42 2010
From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov)
Date: Tue, 23 Nov 2010 12:51:42 -0600 (CST)
Subject: [Swift-devel] [Bug 31] error message should not refer to java
	exception classes
In-Reply-To: <bug-31-21@http.bugzilla.mcs.anl.gov/swift/>
References: <bug-31-21@http.bugzilla.mcs.anl.gov/swift/>
Message-ID: <20101123185142.C70322CD0C@wind.mcs.anl.gov>

https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=31


skenny <skenny at uchicago.edu> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |ASSIGNED
         AssignedTo|hategan at mcs.anl.gov         |skenny at uchicago.edu
            Summary|error message when mapper   |error message should not
                   |parameter is wrong          |refer to java exception
                   |                            |classes


-- 
Configure bugmail: https://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.
You are watching the reporter.


From wilde at mcs.anl.gov  Tue Nov 23 14:10:42 2010
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Tue, 23 Nov 2010 14:10:42 -0600 (CST)
Subject: [Swift-devel] Next Swift release
In-Reply-To: <577622484.117049.1290542892759.JavaMail.root@zimbra.anl.gov>
Message-ID: <1278622160.117059.1290543042855.JavaMail.root@zimbra.anl.gov>

All,

Sarah is going to take the lead in producing the next Swift release, and will propose a release definition and plan. We want to have the release done by Dec 20.

- Mike


From bugzilla-daemon at mcs.anl.gov  Tue Nov 23 14:36:37 2010
From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov)
Date: Tue, 23 Nov 2010 14:36:37 -0600 (CST)
Subject: [Swift-devel] [Bug 235] cryptic error message when app is not
	specified in tc.data
In-Reply-To: <bug-235-21@http.bugzilla.mcs.anl.gov/swift/>
References: <bug-235-21@http.bugzilla.mcs.anl.gov/swift/>
Message-ID: <20101123203637.80D782CDCE@wind.mcs.anl.gov>

https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=235


skenny <skenny at uchicago.edu> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |swift-devel at ci.uchicago.edu


-- 
Configure bugmail: https://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.


From bugzilla-daemon at mcs.anl.gov  Tue Nov 23 14:39:55 2010
From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov)
Date: Tue, 23 Nov 2010 14:39:55 -0600 (CST)
Subject: [Swift-devel] [Bug 235] cryptic error message when app is not
	specified in tc.data
In-Reply-To: <bug-235-21@http.bugzilla.mcs.anl.gov/swift/>
References: <bug-235-21@http.bugzilla.mcs.anl.gov/swift/>
Message-ID: <20101123203955.638512CB38@wind.mcs.anl.gov>

https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=235


--- Comment #1 from skenny <skenny at uchicago.edu>  2010-11-23 14:39:55 ---
new error is:

RunID: 20101123-1225-2xvzsta7
Progress:
Execution failed:
        The application "RInvokee" is not available in your tc.data catalog

-- 
Configure bugmail: https://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.


From wilde at mcs.anl.gov  Wed Nov 24 08:35:16 2010
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Wed, 24 Nov 2010 08:35:16 -0600 (CST)
Subject: [Swift-devel] Hi all,
In-Reply-To: <AANLkTi=gjDDsJa7=RmwgnV2J7+d7h_-W+zopphoctCuf@mail.gmail.com>
Message-ID: <1167986945.119613.1290609316657.JavaMail.root@zimbra.anl.gov>

Chaturanga, 


The only help we can offer to new or potential Swift users is help on specific problems or questions you have on using Swift. 


If you send to the swift-user list specific Swift code that you have tried, and any errors Swift is giving you, list members will try to help as time permits. As with any user list, we can not guarantee you an answer in the time you need it. 


Regarding Swift logging, you will need to understand how to use Java log4j to use Swift logging. Once you understand that, then logging can be controlled by settings in the property files in the Swift etc/ directory. 


I am sorry, but I will not be able to answer any further questions unless they are specific enough and show that you have read the Swift documentation carefully, completely executed the tutorial examples, and have made an effort to debug any problems that occur - just as you would have to to learn any other new computer language. That takes a lot of work, and we can not do this work for you. 


You need to work with your professor or teaching assistants to get guidance on how to complete your class project. 


- Mike 
----- Original Message -----


Sir, 

Thank you. 
I'm doing this for class credit. I asked it, they said OK. 

>You should try to use them in a few programs of your own, to verify your understanding of how they work. And build swift with extra >logging turned on (or added) to trace mapper activity. 

I tried with my own programs. 
What did you mean ' build swift with extra logging turned on (or added) to trace mapper activity' ? 
Is it the same thing that's mentioned in the tutorial under 'writing a mapper' ? (I did them and worked.) 

Do I have some specific things to do with mappers ? 

Chaturanga 


On Tue, Nov 23, 2010 at 8:59 AM, Michael Wilde < wilde at mcs.anl.gov > wrote: 


----- Original Message ----- 
> Michael, 
> 
> I was involving some exam stuffs in last few weeks at university. So, 
> was unable to spend more time with Swift. Hopefully now I can spend my 
> whole time with Swift. 
> I did the tutorial and read the user guide. It was interesting but had 
> some problems while doing. 


Can you report these to the swift-user list, please, so we can try to fix them? 

> 
> I'm interested to do the project 'Re-work the mapper naming 
> conventions to make code more readable and less verbose, and to fix 
> some broken mapper semantics' as you had mentioned earlier. 

OK. That may be a pretty difficult project, but it could perhaps be done incrementally, one small improvement at a time. 

What is your goal in this project? Are you doing it for class credit (in which case your professor should give you some guidance as to whether it is a good idea), or for learning more programming skills? 


> I looked at Mappers from the tutorial and user guide and was able to 
> understand them. 

You should try to use them in a few programs of your own, to verify your understanding of how they work. And build swift with extra logging turned on (or added) to trace mapper activity. 


> Can't we implement arrays in a similar manner described in tutorial 
> for the mappers or does it has another way? I mean by using an 
> Abstract class from java. 

Can you clarify what you mean? Java is not visible from Swift - its below the Karajan implementation. 

And its not clear what you mean by "cant we implement arrays..." 

The purpose of this particular project is to implement better mappers, not arrays, right? 

- Mike 


> 
> Sir, I'm pretty new to these things. 
> 
> If you can share some suggestions it will be very much helpful to me. 
> 
> Thank You. 
> 
> 
> 
> On Tue, Nov 23, 2010 at 5:11 AM, Chaturanga Wimalarathne < 
> chaturanganamal at gmail.com > wrote: 
> 
> 
> Michael, 
> 
> This is really helpful. I have already prepared much for this project 
> regarding SWIFT. I downloaded swift tutorial and user guide. I'll go 
> through those and will choose a suitable project. I also downloaded 
> the source code through Subversion. I was almost going to change the 
> project. Thank you for the suggestions. I will definitely choose one 
> project from this list and Let you know ASAP. And I am sure I can 
> count on your continued assistance. Thanks Again. 
> 
> Chaturanga 
> 
> 
> -- 
> Chaturanga Namal 
> Department of Computer Science and Engineering <Undergraduate> 
> University of Moratuwa 
> Sri Lanka 


-- 
Michael Wilde 
Computation Institute, University of Chicago 
Mathematics and Computer Science Division 
Argonne National Laboratory 


-- 
Chaturanga Namal 
Department of Computer Science and Engineering <Undergraduate> 
University of Moratuwa 
Sri Lanka 


-- 
Michael Wilde 
Computation Institute, University of Chicago 
Mathematics and Computer Science Division 
Argonne National Laboratory 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20101124/afa79a6a/attachment.html>

From dk0966 at cs.ship.edu  Fri Nov 26 21:25:00 2010
From: dk0966 at cs.ship.edu (David Kelly)
Date: Fri, 26 Nov 2010 22:25:00 -0500
Subject: [Swift-devel] SGE qstat and XML
Message-ID: <AANLkTi=ym3u5ZQAi=7bNPxtS2Jg9bTSx6=g3KOpt2S5L@mail.gmail.com>

Hello,

As I was testing Swift with SGE I noticed that the qstat included in
newer versions of the Grid Environment was not being correctly parsed
by Swift, causing it to fail and exit prematurely. This was caused by
a change in the formatting of qstat output. Starting with Grid
Environment 6.0, qstat can output data as XML. I believe this should
provide a more consistent way to parse the data. Attached are my
changes for using and parsing XML. So far I've tested this on the
ibicluster and on my Ubuntu laptop with grid environment installed.

David
-------------- next part --------------
A non-text attachment was scrubbed...
Name: sge-updates.diff
Type: text/x-patch
Size: 9015 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20101126/2adfc16c/attachment.bin>

From hategan at mcs.anl.gov  Sat Nov 27 19:41:46 2010
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Sat, 27 Nov 2010 17:41:46 -0800
Subject: [Swift-devel] Re: Provider staging error in long-running test
In-Reply-To: <1290479397.30590.1.camel@blabla2.none>
References: <387720139.102083.1290402647892.JavaMail.root@zimbra.anl.gov>
	<1290479397.30590.1.camel@blabla2.none>
Message-ID: <1290908506.32250.2.camel@blabla2.none>

So I think that was due to incorrect assumption that message headers
will never be broken up into pieces by the TCP layer. That caused the
worker to fail, presumably under high load, but I cannot be sure about
the exact conditions that led to the problem (and therefore I cannot be
sure of the solution).

I have added code to read things from a socket in a more resilient
fashion.

I have also removed the idle timeout from the worker. That should not
bother us any more.

Mihael

On Mon, 2010-11-22 at 18:29 -0800, Mihael Hategan wrote:
> Ok. So that doesn't look like it's a staging problem specifically, but
> more like something with the comm library. I'll have to look at the
> logs. And I can foresee some free time coming in a couple of days just
> for that!
> 
> Mihael
> 
> On Sun, 2010-11-21 at 23:10 -0600, Michael Wilde wrote:
> > Mihael, here is bug 3:
> > 
> > I was testing a foreach loop doing a cat of 10,000 input files of sizes up to about 300-400K each.  The test hit an error after around 3,491 files:
> > 
> > Progress:  Selecting site:1008  Submitted:12  Active:3  Finished successfully:3476
> > Progress:  Selecting site:1008  Submitted:13  Active:3  Finished successfully:3491
> > Failed to shut down channel
> > java.lang.NullPointerException
> >         at org.globus.cog.karajan.workflow.service.channels.AbstractKarajanChannel.configureHeartBeat(AbstractKarajanChannel.java:57)
> >         at org.globus.cog.karajan.workflow.service.channels.AbstractKarajanChannel.<init>(AbstractKarajanChannel.java:52)
> > 
> > The test was executed on PADS login1 like this:
> > 
> > cd /home/wilde/swift/lab
> > ./run.local.coast.ps.sh catsall
> > 
> > log file: catsall-20101121-2239-oc2flmn0.log
> > 
> > sites.xml:
> > 
> > <config>
> >   <pool handle="localhost">
> >     <!-- <execution provider="coaster-persistent" url="http://login1.pads.ci.uchicago.edu:" jobmanager="local:local"/> -->
> >     <execution provider="coaster" url="none" jobmanager="local:local"/>
> >     <!-- <profile namespace="globus" key="workerManager">passive</profile> -->
> >     <profile namespace="globus" key="workersPerNode">8</profile>
> >     <profile namespace="globus" key="slots">1</profile>
> >     <profile namespace="globus" key="maxnodes">1</profile>
> >     <profile key="jobThrottle" namespace="karajan">.15</profile>
> >     <profile namespace="karajan" key="initialScore">10000</profile>
> >     <profile namespace="swift" key="stagingMethod">proxy</profile>
> >     <workdirectory>/scratch/local/wilde/pstest/swiftwork</workdirectory>
> >   </pool>
> > </config>
> > 
> > login1$ cat cf
> > wrapperlog.always.transfer=true
> > sitedir.keep=true
> > execution.retries=0
> > lazy.errors=false
> > status.mode=provider
> > use.provider.staging=true
> > provider.staging.pin.swiftfiles=false
> > login1$ cat catsall.swift
> > type file;
> > 
> > app (file o) cat (file i)
> > {
> >   cat @i stdout=@o;
> > }
> > 
> > file infile[]  <simple_mapper; location="indir", prefix="f.", suffix=".in">;
> > file outfile[] <simple_mapper; location="outdir", prefix="f.",suffix=".out">;
> > 
> > foreach f, i in infile {
> >   outfile[i] = cat(f);
> > }
> > login1$ 
> > 
> > login1$ which swift
> > /scratch/local/wilde/swift/src/trunk.gomods/cog/modules/swift/dist/swift-svn/bin/swift
> > login1$ java -version
> > java version "1.6.0_22"
> > Java(TM) SE Runtime Environment (build 1.6.0_22-b04)
> > Java HotSpot(TM) 64-Bit Server VM (build 17.1-b03, mixed mode)
> > login1$ 
> > 
> > 
> 
> 
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel


From hategan at mcs.anl.gov  Sat Nov 27 23:58:24 2010
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Sat, 27 Nov 2010 21:58:24 -0800
Subject: [Swift-devel] Re: Persistent coaster service fails after several
	runs
In-Reply-To: <2030510416.101872.1290394804460.JavaMail.root@zimbra.anl.gov>
References: <2030510416.101872.1290394804460.JavaMail.root@zimbra.anl.gov>
Message-ID: <1290923904.9422.6.camel@blabla2.none>

I think some of the logs in /home/wilde/swift/lab are gone. Nonetheless,
I believe that the lockup was caused by the following issue:

- when something bad happened on a channel, some method would be called
to allow the channel implementation to handle that error.
- an existing problem (which I thought I fixed, but it turns out I had
not committed it) caused that method to throw an exception
- that would in turn (because it was not in a try/catch block) kill the
thread used to send messages on behalf of all channels of a given type.

This was fixed as follows:
1. I committed what I should have a while ago such that the triggering
problem is gone
2. The handling of channel exceptions is now properly isolated

Mihael

On Sun, 2010-11-21 at 21:00 -0600, Michael Wilde wrote:
> subject was: Re: [Swift-devel] misassignment of jobs
> 
> Re the service-side timeout, OK, will do.
> 
> Ive just re-created bug1, but its a little different than I thought.
> 
> Swift runs to the persistent coaster server lock up (ie fail to
> progress) and then get errors, not after a delay, but seemingly
> randomly. Thats likely why I was misled into thinking it was delay
> related.


From wilde at mcs.anl.gov  Sun Nov 28 00:46:26 2010
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Sun, 28 Nov 2010 00:46:26 -0600 (CST)
Subject: [Swift-devel] Re: Provider staging error in long-running test
In-Reply-To: <1290908506.32250.2.camel@blabla2.none>
Message-ID: <1027257314.126477.1290926786640.JavaMail.root@zimbra.anl.gov>

Great! The test that failed after 3000+ transfers now ran all 10,000 OK.
Im putting that in a loop now to see if it runs all night. Looks promising!

- Mike


----- Original Message -----
> So I think that was due to incorrect assumption that message headers
> will never be broken up into pieces by the TCP layer. That caused the
> worker to fail, presumably under high load, but I cannot be sure about
> the exact conditions that led to the problem (and therefore I cannot
> be
> sure of the solution).
> 
> I have added code to read things from a socket in a more resilient
> fashion.
> 
> I have also removed the idle timeout from the worker. That should not
> bother us any more.
> 
> Mihael
> 
> On Mon, 2010-11-22 at 18:29 -0800, Mihael Hategan wrote:
> > Ok. So that doesn't look like it's a staging problem specifically,
> > but
> > more like something with the comm library. I'll have to look at the
> > logs. And I can foresee some free time coming in a couple of days
> > just
> > for that!
> >
> > Mihael
> >
> > On Sun, 2010-11-21 at 23:10 -0600, Michael Wilde wrote:
> > > Mihael, here is bug 3:
> > >
> > > I was testing a foreach loop doing a cat of 10,000 input files of
> > > sizes up to about 300-400K each. The test hit an error after
> > > around 3,491 files:
> > >
> > > Progress: Selecting site:1008 Submitted:12 Active:3 Finished
> > > successfully:3476
> > > Progress: Selecting site:1008 Submitted:13 Active:3 Finished
> > > successfully:3491
> > > Failed to shut down channel
> > > java.lang.NullPointerException
> > >         at
> > >         org.globus.cog.karajan.workflow.service.channels.AbstractKarajanChannel.configureHeartBeat(AbstractKarajanChannel.java:57)
> > >         at
> > >         org.globus.cog.karajan.workflow.service.channels.AbstractKarajanChannel.<init>(AbstractKarajanChannel.java:52)
> > >
> > > The test was executed on PADS login1 like this:
> > >
> > > cd /home/wilde/swift/lab
> > > ./run.local.coast.ps.sh catsall
> > >
> > > log file: catsall-20101121-2239-oc2flmn0.log
> > >
> > > sites.xml:
> > >
> > > <config>
> > >   <pool handle="localhost">
> > >     <!-- <execution provider="coaster-persistent"
> > >     url="http://login1.pads.ci.uchicago.edu:"
> > >     jobmanager="local:local"/> -->
> > >     <execution provider="coaster" url="none"
> > >     jobmanager="local:local"/>
> > >     <!-- <profile namespace="globus"
> > >     key="workerManager">passive</profile> -->
> > >     <profile namespace="globus" key="workersPerNode">8</profile>
> > >     <profile namespace="globus" key="slots">1</profile>
> > >     <profile namespace="globus" key="maxnodes">1</profile>
> > >     <profile key="jobThrottle" namespace="karajan">.15</profile>
> > >     <profile namespace="karajan"
> > >     key="initialScore">10000</profile>
> > >     <profile namespace="swift" key="stagingMethod">proxy</profile>
> > >     <workdirectory>/scratch/local/wilde/pstest/swiftwork</workdirectory>
> > >   </pool>
> > > </config>
> > >
> > > login1$ cat cf
> > > wrapperlog.always.transfer=true
> > > sitedir.keep=true
> > > execution.retries=0
> > > lazy.errors=false
> > > status.mode=provider
> > > use.provider.staging=true
> > > provider.staging.pin.swiftfiles=false
> > > login1$ cat catsall.swift
> > > type file;
> > >
> > > app (file o) cat (file i)
> > > {
> > >   cat @i stdout=@o;
> > > }
> > >
> > > file infile[] <simple_mapper; location="indir", prefix="f.",
> > > suffix=".in">;
> > > file outfile[] <simple_mapper; location="outdir",
> > > prefix="f.",suffix=".out">;
> > >
> > > foreach f, i in infile {
> > >   outfile[i] = cat(f);
> > > }
> > > login1$
> > >
> > > login1$ which swift
> > > /scratch/local/wilde/swift/src/trunk.gomods/cog/modules/swift/dist/swift-svn/bin/swift
> > > login1$ java -version
> > > java version "1.6.0_22"
> > > Java(TM) SE Runtime Environment (build 1.6.0_22-b04)
> > > Java HotSpot(TM) 64-Bit Server VM (build 17.1-b03, mixed mode)
> > > login1$
> > >
> > >
> >
> >
> > _______________________________________________
> > Swift-devel mailing list
> > Swift-devel at ci.uchicago.edu
> > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel

-- 
Michael Wilde
Computation Institute, University of Chicago
Mathematics and Computer Science Division
Argonne National Laboratory


From wilde at mcs.anl.gov  Sun Nov 28 00:47:56 2010
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Sun, 28 Nov 2010 00:47:56 -0600 (CST)
Subject: [Swift-devel] Re: Persistent coaster service fails after several
	runs
In-Reply-To: <1290923904.9422.6.camel@blabla2.none>
Message-ID: <810228394.126479.1290926876271.JavaMail.root@zimbra.anl.gov>

Will test this one tomorrow. I deleted logs and other junk as I was way over quota. Sorry, I forgot I had pointed you to these.

- Mike


----- Original Message -----
> I think some of the logs in /home/wilde/swift/lab are gone.
> Nonetheless,
> I believe that the lockup was caused by the following issue:
> 
> - when something bad happened on a channel, some method would be
> called
> to allow the channel implementation to handle that error.
> - an existing problem (which I thought I fixed, but it turns out I had
> not committed it) caused that method to throw an exception
> - that would in turn (because it was not in a try/catch block) kill
> the
> thread used to send messages on behalf of all channels of a given
> type.
> 
> This was fixed as follows:
> 1. I committed what I should have a while ago such that the triggering
> problem is gone
> 2. The handling of channel exceptions is now properly isolated
> 
> Mihael
> 
> On Sun, 2010-11-21 at 21:00 -0600, Michael Wilde wrote:
> > subject was: Re: [Swift-devel] misassignment of jobs
> >
> > Re the service-side timeout, OK, will do.
> >
> > Ive just re-created bug1, but its a little different than I thought.
> >
> > Swift runs to the persistent coaster server lock up (ie fail to
> > progress) and then get errors, not after a delay, but seemingly
> > randomly. Thats likely why I was misled into thinking it was delay
> > related.

-- 
Michael Wilde
Computation Institute, University of Chicago
Mathematics and Computer Science Division
Argonne National Laboratory


From hategan at mcs.anl.gov  Sun Nov 28 02:07:30 2010
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Sun, 28 Nov 2010 00:07:30 -0800
Subject: [Swift-devel] Re: Provider staging error in long-running test
In-Reply-To: <1027257314.126477.1290926786640.JavaMail.root@zimbra.anl.gov>
References: <1027257314.126477.1290926786640.JavaMail.root@zimbra.anl.gov>
Message-ID: <1290931650.9947.0.camel@blabla2.none>

On Sun, 2010-11-28 at 00:46 -0600, Michael Wilde wrote:
> Great! The test that failed after 3000+ transfers now ran all 10,000 OK.
> Im putting that in a loop now to see if it runs all night.

Right. One run is probably not sufficient to tell.

Mihael


From hategan at mcs.anl.gov  Sun Nov 28 02:09:13 2010
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Sun, 28 Nov 2010 00:09:13 -0800
Subject: [Swift-devel] Re: Persistent coaster service fails after several
	runs
In-Reply-To: <810228394.126479.1290926876271.JavaMail.root@zimbra.anl.gov>
References: <810228394.126479.1290926876271.JavaMail.root@zimbra.anl.gov>
Message-ID: <1290931753.9947.2.camel@blabla2.none>

On Sun, 2010-11-28 at 00:47 -0600, Michael Wilde wrote:
> Will test this one tomorrow. I deleted logs and other junk as I was way over quota. Sorry, I forgot I had pointed you to these.

It's ok. The problem (or what I think was the problem) was visible in
one of the other logs.

Mihael


From wilde at mcs.anl.gov  Sun Nov 28 09:53:53 2010
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Sun, 28 Nov 2010 09:53:53 -0600 (CST)
Subject: [Swift-devel] Re: Provider staging error in long-running test
In-Reply-To: <1290931650.9947.0.camel@blabla2.none>
Message-ID: <382373340.126841.1290959633698.JavaMail.root@zimbra.anl.gov>

It ran all night and is still going - about 27 runs of 10,000 files each; all finished with no errors.

I'll start testing to multiple remote nodes now (the current test is to localhost).

Nice work!

- Mike


----- Original Message -----
> On Sun, 2010-11-28 at 00:46 -0600, Michael Wilde wrote:
> > Great! The test that failed after 3000+ transfers now ran all 10,000
> > OK.
> > Im putting that in a loop now to see if it runs all night.
> 
> Right. One run is probably not sufficient to tell.
> 
> Mihael

-- 
Michael Wilde
Computation Institute, University of Chicago
Mathematics and Computer Science Division
Argonne National Laboratory


From wilde at mcs.anl.gov  Sun Nov 28 23:23:42 2010
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Sun, 28 Nov 2010 23:23:42 -0600 (CST)
Subject: [Swift-devel] Re: Persistent coaster service fails after several
	runs
In-Reply-To: <1290931753.9947.2.camel@blabla2.none>
Message-ID: <959208306.127425.1291008222274.JavaMail.root@zimbra.anl.gov>

This fix looks great so far - Ive tested with varying workflow sizes and delays, and have not seen any problems.

- Mike


----- Original Message -----
> On Sun, 2010-11-28 at 00:47 -0600, Michael Wilde wrote:
> > Will test this one tomorrow. I deleted logs and other junk as I was
> > way over quota. Sorry, I forgot I had pointed you to these.
> 
> It's ok. The problem (or what I think was the problem) was visible in
> one of the other logs.
> 
> Mihael

-- 
Michael Wilde
Computation Institute, University of Chicago
Mathematics and Computer Science Division
Argonne National Laboratory


From wozniak at mcs.anl.gov  Tue Nov 30 09:59:34 2010
From: wozniak at mcs.anl.gov (Justin M Wozniak)
Date: Tue, 30 Nov 2010 09:59:34 -0600 (CST)
Subject: [Swift-devel] Re: Persistent coaster service fails after several
	runs
In-Reply-To: <959208306.127425.1291008222274.JavaMail.root@zimbra.anl.gov>
References: <959208306.127425.1291008222274.JavaMail.root@zimbra.anl.gov>
Message-ID: <alpine.DEB.2.00.1011300957500.4461@wozniak-desktop.mcs.anl.gov>


Along these lines, I'm looking at memory usage in Coasters.  There's a 
plot attached below- usage spikes when the workers start running.

96% of the usage is byte[] which makes me think it could be KarajanChannel 
stuff...

http://www.ci.uchicago.edu/wiki/bin/view/SWFT/PerformanceNotes#Memory

On Sun, 28 Nov 2010, Michael Wilde wrote:

> This fix looks great so far - Ive tested with varying workflow sizes and delays, and have not seen any problems.
>
> - Mike

-- 
Justin M Wozniak


From hategan at mcs.anl.gov  Tue Nov 30 16:51:12 2010
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Tue, 30 Nov 2010 14:51:12 -0800
Subject: [Swift-devel] Re: Persistent coaster service fails after
	several runs
In-Reply-To: <alpine.DEB.2.00.1011300957500.4461@wozniak-desktop.mcs.anl.gov>
References: <959208306.127425.1291008222274.JavaMail.root@zimbra.anl.gov>
	<alpine.DEB.2.00.1011300957500.4461@wozniak-desktop.mcs.anl.gov>
Message-ID: <1291157472.21980.0.camel@blabla2.none>

Can you make a heap dump of the relevant issue?

On Tue, 2010-11-30 at 09:59 -0600, Justin M Wozniak wrote:
> Along these lines, I'm looking at memory usage in Coasters.  There's a 
> plot attached below- usage spikes when the workers start running.
> 
> 96% of the usage is byte[] which makes me think it could be KarajanChannel 
> stuff...
> 
> http://www.ci.uchicago.edu/wiki/bin/view/SWFT/PerformanceNotes#Memory
> 
> On Sun, 28 Nov 2010, Michael Wilde wrote:
> 
> > This fix looks great so far - Ive tested with varying workflow sizes and delays, and have not seen any problems.
> >
> > - Mike
> 


From wozniak at mcs.anl.gov  Tue Nov 30 16:57:45 2010
From: wozniak at mcs.anl.gov (Justin M Wozniak)
Date: Tue, 30 Nov 2010 16:57:45 -0600 (CST)
Subject: [Swift-devel] Re: Persistent coaster service fails after several
	runs
In-Reply-To: <1291157472.21980.0.camel@blabla2.none>
References: <959208306.127425.1291008222274.JavaMail.root@zimbra.anl.gov>
	<alpine.DEB.2.00.1011300957500.4461@wozniak-desktop.mcs.anl.gov>
	<1291157472.21980.0.camel@blabla2.none>
Message-ID: <alpine.DEB.2.00.1011301654450.15756@wozniak-desktop.mcs.anl.gov>


I'm on Intrepid so it's an IBM heap dump.  There's a good one there in 
~wozniak/Public/heapdumps .

The byte[]s are definitely associated with TCPChannel but that's all I 
have been able to figure out so far- I don't see where they are retained.

It is possible that the reader is generating the bytes faster than the 
network can push them out, so we just need to tighten up the throttle?

On Tue, 30 Nov 2010, Mihael Hategan wrote:

> Can you make a heap dump of the relevant issue?
>
> On Tue, 2010-11-30 at 09:59 -0600, Justin M Wozniak wrote:
>> Along these lines, I'm looking at memory usage in Coasters.  There's a
>> plot attached below- usage spikes when the workers start running.
>>
>> 96% of the usage is byte[] which makes me think it could be KarajanChannel
>> stuff...
>>
>> http://www.ci.uchicago.edu/wiki/bin/view/SWFT/PerformanceNotes#Memory
>>
>> On Sun, 28 Nov 2010, Michael Wilde wrote:
>>
>>> This fix looks great so far - Ive tested with varying workflow sizes and delays, and have not seen any problems.
>>>
>>> - Mike
>>
>
>
>

-- 
Justin M Wozniak