From wilde at mcs.anl.gov  Fri Aug  1 14:49:03 2008
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Fri, 01 Aug 2008 14:49:03 -0500
Subject: [Swift-devel] more compile time type checking
In-Reply-To: <Pine.LNX.4.64.0807290907400.22488@dildano.hawaga.org.uk>
References: <Pine.LNX.4.64.0807290907400.22488@dildano.hawaga.org.uk>
Message-ID: <489368AF.2000300@mcs.anl.gov>

The type checking is working nicely - its a great improvement.

I just fixed several of my type errors in minutes, like:

Could not start execution.
         Compile error in foreach statement at line 26: Compile error in 
procedure invocation at line 28: Wrong type for parameter number 0, 
expected DockOut, got Dockout

Nice work, Milena and Ben!

- Mike


On 7/29/08 4:12 AM, Ben Clifford wrote:
> I just committed Milena's work on compile-time type checking.
> 
> Based on what happened last time I made changes to the compile-time anity 
> checking, there will be some things you do or thought you could do in 
> your programs that will now not work.
> 
> When you discover such, file a bug or post to this list.
> 


From foster at mcs.anl.gov  Fri Aug  1 15:09:28 2008
From: foster at mcs.anl.gov (Ian Foster)
Date: Fri, 1 Aug 2008 15:09:28 -0500
Subject: [Swift-devel] more compile time type checking
In-Reply-To: <489368AF.2000300@mcs.anl.gov>
References: <Pine.LNX.4.64.0807290907400.22488@dildano.hawaga.org.uk>
	<489368AF.2000300@mcs.anl.gov>
Message-ID: <DC18EE44-690A-4EB2-A246-174AB0AD61C6@mcs.anl.gov>

lovely ...!

On Aug 1, 2008, at 2:49 PM, Michael Wilde wrote:

> The type checking is working nicely - its a great improvement.
>
> I just fixed several of my type errors in minutes, like:
>
> Could not start execution.
>        Compile error in foreach statement at line 26: Compile error  
> in procedure invocation at line 28: Wrong type for parameter number  
> 0, expected DockOut, got Dockout
>
> Nice work, Milena and Ben!
>
> - Mike
>
>
> On 7/29/08 4:12 AM, Ben Clifford wrote:
>> I just committed Milena's work on compile-time type checking.
>> Based on what happened last time I made changes to the compile-time  
>> anity checking, there will be some things you do or thought you  
>> could do in your programs that will now not work.
>> When you discover such, file a bug or post to this list.
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel


From bugzilla-daemon at mcs.anl.gov  Sat Aug  2 10:04:51 2008
From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov)
Date: Sat,  2 Aug 2008 10:04:51 -0500 (CDT)
Subject: [Swift-devel] [Bug 152] New: filesys_mapper gives exception
Message-ID: <bug-152-21@http.bugzilla.mcs.anl.gov/swift/>

http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=152

           Summary: filesys_mapper gives exception
           Product: Swift
           Version: unspecified
          Platform: All
        OS/Version: Linux
            Status: NEW
          Severity: normal
          Priority: P2
         Component: SwiftScript language
        AssignedTo: benc at hawaga.org.uk
        ReportedBy: wilde at mcs.anl.gov


Running this script throws an exception in the filesys_mapper:

type File;

type Mol2;

(File out) rundock ( Mol2 ligand )
{
  app { echo "rundock debug:" @ligand @out stdout=@out; }
}

Mol2 ligand <filesys_mapper;
location="/disks/gpfs/ligandatlas/databases/KEGG_and_Drugs-test",
suffix="D01995.mol2">;

File out <"dockdb.out">;

out = rundock( ligand );

--

Gives:

Swift script dockdb1.swift starting at Sat Aug 2 09:54:06 CDT 2008
running on sites: localhost

Swift svn swift-r2159 cog-r2122 (CoG modified locally)

RunID: 20080802-0954-qmar4l7d
Progress: 
Execution failed:
        Index: 0

Swift Script dockdb1.swift ended at Sat Aug 2 09:54:09 CDT 2008 with exit code
0
--

Exception in log is:

2008-08-02 09:54:09,209-0500 INFO  New NEW
id=tag:benc at ci.uchicago.edu,2008:swift:dataset:20080802-0954-x2fxvmz5:720000000003
2008-08-02 09:54:09,248-0500 INFO  AbstractDataNode closed
tag:benc at ci.uchicago.edu,2008:swift:dataset:20080802-0954-x2fxvmz5:720000000004
2008-08-02 09:54:09,248-0500 INFO  AbstractDataNode ROOTPATH
dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20080802-0954-x2fxvmz5:720000000004
path=$
2008-08-02 09:54:09,249-0500 INFO  AbstractDataNode dataset
tag:benc at ci.uchicago.edu,2008:swift:dataset:20080802-0954-x2fxvmz5:720000000004
exception while mapping path f
rom root
java.lang.IndexOutOfBoundsException: Index: 0
        at java.util.Collections$EmptyList.get(Collections.java:2968)
        at org.griphyn.vdl.mapping.Path.isArrayIndex(Path.java:271)
        at
org.griphyn.vdl.mapping.file.FileSystemArrayMapper.map(FileSystemArrayMapper.java:30)
        at
org.griphyn.vdl.mapping.AbstractDataNode.logContent(AbstractDataNode.java:376)
--

This occurred first when mapping a 4-member test dataset. I shrunk the script
to a smaller example and show it here mapping a single member dataset. 

Script and all logs and output are in:
www.ci.uchicago.edu/~wilde/filesys_mapper_exception.2008.0802.tar.gz


-- 
Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You reported the bug, or are watching the reporter.


From bugzilla-daemon at mcs.anl.gov  Sat Aug  2 10:05:31 2008
From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov)
Date: Sat,  2 Aug 2008 10:05:31 -0500 (CDT)
Subject: [Swift-devel] [Bug 152] filesys_mapper gives exception
In-Reply-To: <bug-152-21@http.bugzilla.mcs.anl.gov/swift/>
Message-ID: <20080802150531.2EB4F164B1@foxtrot.mcs.anl.gov>

http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=152


wilde at mcs.anl.gov changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                URL|                            |http://www.ci.uchicago.edu/~
                   |                            |wilde/filesys_mapper_excepti
                   |                            |on.2008.0802.tar.gz


-- 
Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You reported the bug, or are watching the reporter.


From wilde at mcs.anl.gov  Sat Aug  2 19:26:40 2008
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Sat, 02 Aug 2008 19:26:40 -0500
Subject: [Swift-devel] coastersPerNode not recognized by GT2 GRAM
In-Reply-To: <Pine.LNX.4.64.0807301654170.5076@dildano.hawaga.org.uk>
References: <489053A9.4080906@mcs.anl.gov> <48907DF1.8020009@mcs.anl.gov>
	<48909148.1010504@mcs.anl.gov>
	<Pine.LNX.4.64.0807301654170.5076@dildano.hawaga.org.uk>
Message-ID: <4894FB40.6090008@mcs.anl.gov>

On 7/30/08 11:54 AM, Ben Clifford wrote:
> try cog r2123. i just tested that against ncsa teragrid. it now filters 
> out that attribute before sending on to gram2.

I just tried this, using cog r2125:
--
Swift script dock1.swift starting at Sat Aug 2 17:54:34 CDT 2008
running on sites: abe-coaster

Swift svn swift-r2171 cog-r2125 (CoG modified locally)
--

I still failed with same error. The gram log showed the coasterspernode 
rsl variable still getting through to gram (below).  I added a second 
string, "coasterspernode" to your list of parameters to filter out, in 
all lower case, and this worked.

Theres a small possibility that the first time I tried this, I lost the 
fix somewhere between the build and the install. I dont think thats the 
case, but I will check.

When you tested against a TG site, did you verify that the 
coasterspernode attribute wasnt getting in the RSL?

- Mike


<<<<<Job Request RSL
&("arguments" = "/u/ac/wilde/.globus/coasters/cscript49904.pl" 
"http://141.142.68.180:50091" "0" "1" "2" "3" "4" "5" "6" "7" )("coastersp
ernode" = "8" )("executable" = "/usr/bin/perl" )("maxwalltime" = "1:00" )
 >>>>>Job Request RSL
8/2 17:57:34
<<<<<Job Request RSL (canonical)
&("arguments" = "/u/ac/wilde/.globus/coasters/cscript49904.pl" 
"http://141.142.68.180:50091" "0" "1" "2" "3" "4" "5" "6" "7" )("coastersp
ernode" = "8" )("executable" = "/usr/bin/perl" )("maxwalltime" = "1:00" )
 >>>>>Job Request RSL (canonical)
8/2 17:57:34
<<<<<Job RSL
&("environment" = ("HOME" "/u/ac/wilde" ) ("LOGNAME" "wilde" ) 
)("arguments" = "/u/ac/wilde/.globus/coasters/cscript49904.pl" "http://141
.142.68.180:50091" "0" "1" "2" "3" "4" "5" "6" "7" )("coasterspernode" = 
"8" )("executable" = "/usr/bin/perl" )("maxwalltime" = "1:00" )
 >>>>>Job RSL
8/2 17:57:34
<<<<<Job RSL (post-eval)
&("environment" = ("HOME" "/u/ac/wilde" ) ("LOGNAME" "wilde" ) 
)("arguments" = "/u/ac/wilde/.globus/coasters/cscript49904.pl" "http://141
.142.68.180:50091" "0" "1" "2" "3" "4" "5" "6" "7" )("coasterspernode" = 
"8" )("executable" = "/usr/bin/perl" )("maxwalltime" = "1:00" )
 >>>>>Job RSL (post-eval)


From benc at hawaga.org.uk  Sun Aug  3 06:07:54 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Sun, 3 Aug 2008 11:07:54 +0000 (GMT)
Subject: [Swift-devel] coastersPerNode not recognized by GT2 GRAM
In-Reply-To: <4894FB40.6090008@mcs.anl.gov>
References: <489053A9.4080906@mcs.anl.gov> <48907DF1.8020009@mcs.anl.gov>
	<48909148.1010504@mcs.anl.gov>
	<Pine.LNX.4.64.0807301654170.5076@dildano.hawaga.org.uk>
	<4894FB40.6090008@mcs.anl.gov>
Message-ID: <Pine.LNX.4.64.0808031104170.22488@dildano.hawaga.org.uk>


On Sat, 2 Aug 2008, Michael Wilde wrote:

> I still failed with same error. The gram log showed the coasterspernode rsl
> variable still getting through to gram (below).  I added a second string,
> "coasterspernode" to your list of parameters to filter out, in all lower case,
> and this worked.

[..]

> When you tested against a TG site, did you verify that the coasterspernode
> attribute wasnt getting in the RSL?

No; I checked that jobs ran OK - liekly I used the same capitalisation as 
in the source and you did not.

Its a case sensitivity bug which should be straightforward to fix.

-- 


From wilde at mcs.anl.gov  Sun Aug  3 08:47:39 2008
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Sun, 03 Aug 2008 08:47:39 -0500
Subject: [Swift-devel] coastersPerNode not recognized by GT2 GRAM
In-Reply-To: <Pine.LNX.4.64.0808031104170.22488@dildano.hawaga.org.uk>
References: <489053A9.4080906@mcs.anl.gov> <48907DF1.8020009@mcs.anl.gov>
	<48909148.1010504@mcs.anl.gov>
	<Pine.LNX.4.64.0807301654170.5076@dildano.hawaga.org.uk>
	<4894FB40.6090008@mcs.anl.gov>
	<Pine.LNX.4.64.0808031104170.22488@dildano.hawaga.org.uk>
Message-ID: <4895B6FB.1030902@mcs.anl.gov>


On 8/3/08 6:07 AM, Ben Clifford wrote:
> On Sat, 2 Aug 2008, Michael Wilde wrote:
> 
>> I still failed with same error. The gram log showed the coasterspernode rsl
>> variable still getting through to gram (below).  I added a second string,
>> "coasterspernode" to your list of parameters to filter out, in all lower case,
>> and this worked.
> 
> [..]
> 
>> When you tested against a TG site, did you verify that the coasterspernode
>> attribute wasnt getting in the RSL?
> 
> No; I checked that jobs ran OK - liekly I used the same capitalisation as 
> in the source and you did not.
> 
> Its a case sensitivity bug which should be straightforward to fix.

That was the strange thing - I used the same capitalization in my 
<profile> tag as in your source rev, which didnt work. And the RSL in 
the GRAM log showed an all lower case attribute (which may have been 
GRAM's doing).

So one possibility is an error on my part in testing; a less likely one 
is that the system you tested against accepted the coasterspernode RSL 
attribute but the one I tested against (abe) did not.

I'll double-check on my side first.


From benc at hawaga.org.uk  Mon Aug  4 08:11:17 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Mon, 4 Aug 2008 13:11:17 +0000 (GMT)
Subject: [Swift-devel] coastersPerNode not recognized by GT2 GRAM
In-Reply-To: <Pine.LNX.4.64.0808031104170.22488@dildano.hawaga.org.uk>
References: <489053A9.4080906@mcs.anl.gov> <48907DF1.8020009@mcs.anl.gov>
	<48909148.1010504@mcs.anl.gov>
	<Pine.LNX.4.64.0807301654170.5076@dildano.hawaga.org.uk>
	<4894FB40.6090008@mcs.anl.gov>
	<Pine.LNX.4.64.0808031104170.22488@dildano.hawaga.org.uk>
Message-ID: <Pine.LNX.4.64.0808041309150.22488@dildano.hawaga.org.uk>


On Sun, 3 Aug 2008, Ben Clifford wrote:

>  I checked that jobs ran OK

Apparently I didn't check very well. I see that attribute being passed 
through. I made a modification to provider-wonky to help catch things like 
this in the future (It can now be made to get angry if there are spurious 
attributes supplied, which the local execution provider doesn't get upset 
about).

-- 


From bugzilla-daemon at mcs.anl.gov  Mon Aug  4 09:58:14 2008
From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov)
Date: Mon,  4 Aug 2008 09:58:14 -0500 (CDT)
Subject: [Swift-devel] [Bug 152] filesys_mapper gives exception
In-Reply-To: <bug-152-21@http.bugzilla.mcs.anl.gov/swift/>
Message-ID: <20080804145814.66AC6164B1@foxtrot.mcs.anl.gov>

http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=152


benc at hawaga.org.uk changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |ASSIGNED


------- Comment #1 from benc at hawaga.org.uk  2008-08-04 09:58 -------
This should probably produce something like a type exception.

rundock takes the filename of a datatype which also represents a single file
(which is OK) but then the mapping expression is something which will map an
array.

Potentially a compile-time typecheck could happen there for some mappers; but
in the very least this is detectable at execution time.


-- 
Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You reported the bug, or are watching the reporter.


From bugzilla-daemon at mcs.anl.gov  Mon Aug  4 10:55:49 2008
From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov)
Date: Mon,  4 Aug 2008 10:55:49 -0500 (CDT)
Subject: [Swift-devel] [Bug 152] filesys_mapper gives exception
In-Reply-To: <bug-152-21@http.bugzilla.mcs.anl.gov/swift/>
Message-ID: <20080804155549.65E1E16469@foxtrot.mcs.anl.gov>

http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=152


wilde at mcs.anl.gov changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |wilde at mcs.anl.gov


------- Comment #2 from wilde at mcs.anl.gov  2008-08-04 10:55 -------
I need to go back and check (wont have time today) but I think the problem
first occured with no type conflict, using filesys_mapper to map an array, as
it was intended.  So I suspect the problem is in filesys_mapper itself or its
interface back to Swift.

The conflict you describe here occurred in my attempt to reproduce the problem
in a simple example.


(In reply to comment #1)
> This should probably produce something like a type exception.
> 
> rundock takes the filename of a datatype which also represents a single file
> (which is OK) but then the mapping expression is something which will map an
> array.
> 
> Potentially a compile-time typecheck could happen there for some mappers; but
> in the very least this is detectable at execution time.
> 


-- 
Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.
You reported the bug, or are watching the reporter.


From bugzilla-daemon at mcs.anl.gov  Mon Aug  4 12:07:08 2008
From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov)
Date: Mon,  4 Aug 2008 12:07:08 -0500 (CDT)
Subject: [Swift-devel] [Bug 152] filesys_mapper gives exception
In-Reply-To: <bug-152-21@http.bugzilla.mcs.anl.gov/swift/>
Message-ID: <20080804170708.EF26D164B1@foxtrot.mcs.anl.gov>

http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=152


------- Comment #3 from benc at hawaga.org.uk  2008-08-04 12:07 -------
In a brief attempt to recreate this, I get this exception instead: (also
undesirable but not the same as what you reported). I will see if I can figure
out the difference in our setups.

$ swift bug152.swift
Swift svn swift-r2159 (Swift modified locally) cog-r2127 (CoG modified locally)

RunID: 20080804-1905-kdh8mzec
Execution failed:
        java.lang.IllegalStateException: mapper.existing() returned a path [0]
that it cannot subsequently map


-- 
Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.
You reported the bug, or are watching the reporter.


From bugzilla-daemon at mcs.anl.gov  Mon Aug  4 12:25:59 2008
From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov)
Date: Mon,  4 Aug 2008 12:25:59 -0500 (CDT)
Subject: [Swift-devel] [Bug 152] filesys_mapper gives exception
In-Reply-To: <bug-152-21@http.bugzilla.mcs.anl.gov/swift/>
Message-ID: <20080804172559.34BAD16469@foxtrot.mcs.anl.gov>

http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=152


------- Comment #4 from benc at hawaga.org.uk  2008-08-04 12:25 -------
Difference in the run you give and the run I tried that appears to cause the
problem is that the single file mapped in your case has a name that consists
entirely of the suffix, with no base filename on it. That probably should be
made to work. However it is suggestive that this is a different exception to
what you got with more than one file, given that only one file can exist where
the name consists only of the suffix.


-- 
Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.
You reported the bug, or are watching the reporter.


From bugzilla-daemon at mcs.anl.gov  Mon Aug  4 13:29:21 2008
From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov)
Date: Mon,  4 Aug 2008 13:29:21 -0500 (CDT)
Subject: [Swift-devel] [Bug 152] filesys_mapper gives exception
In-Reply-To: <bug-152-21@http.bugzilla.mcs.anl.gov/swift/>
Message-ID: <20080804182921.2FAB4164B1@foxtrot.mcs.anl.gov>

http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=152


------- Comment #5 from benc at hawaga.org.uk  2008-08-04 13:29 -------
The specific error message you report appears to happen when *no* files match
(a file named entirely with the suffix does not match because the mapper
assumes there will be a . between the main filename and the suffix) in the case
of the type violation that you have in the example code.


-- 
Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.
You reported the bug, or are watching the reporter.


From benc at hawaga.org.uk  Mon Aug  4 13:48:06 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Mon, 4 Aug 2008 18:48:06 +0000 (GMT)
Subject: [Swift-devel] type-checking mappers
Message-ID: <Pine.LNX.4.64.0808041845320.22488@dildano.hawaga.org.uk>


In the context of bug 152, I have thought a little about type checking 
mappers. Some mappers are amenable to compile-time type checking - for 
example, the single-file mapper can only map a simple unstructured type; 
the filesys-mapper can only map a single dimensional array of unstructured 
types. Not all mappers seem to work with this - for example the external 
mapper can map any shape structure.

-- 


From bugzilla-daemon at mcs.anl.gov  Mon Aug  4 14:06:24 2008
From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov)
Date: Mon,  4 Aug 2008 14:06:24 -0500 (CDT)
Subject: [Swift-devel] [Bug 152] Mappers used with incorrect types cause
	unintuitive error messgaes.
In-Reply-To: <bug-152-21@http.bugzilla.mcs.anl.gov/swift/>
Message-ID: <20080804190624.D2F74164B1@foxtrot.mcs.anl.gov>

http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=152


benc at hawaga.org.uk changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Severity|normal                      |enhancement
            Summary|filesys_mapper gives        |Mappers used with incorrect
                   |exception                   |types cause unintuitive
                   |                            |error messgaes.


------- Comment #6 from benc at hawaga.org.uk  2008-08-04 14:06 -------
r2174 fixes the null pointer, and a (more sane?) "mapper failed to map..."
error now results in the situation where an unstructured type is used with no
files.

However this (and the error in comment #3) should probably still be caught with
better type checking. Changing this to an enhancement request.


-- 
Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.
You reported the bug, or are watching the reporter.


From bugzilla-daemon at mcs.anl.gov  Mon Aug  4 14:13:26 2008
From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov)
Date: Mon,  4 Aug 2008 14:13:26 -0500 (CDT)
Subject: [Swift-devel] [Bug 147] swift hangs at faulty mapping
In-Reply-To: <bug-147-21@http.bugzilla.mcs.anl.gov/swift/>
Message-ID: <20080804191326.560D7164B2@foxtrot.mcs.anl.gov>

http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=147


benc at hawaga.org.uk changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|                            |FIXED


------- Comment #1 from benc at hawaga.org.uk  2008-08-04 14:13 -------
r2151 removes the spurious '-waitfor' parameter.

r2155 makes the external mapper labelled as static, which makes it work for the
sample code supplied out of band by skenny.


-- 
Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From bugzilla-daemon at mcs.anl.gov  Mon Aug  4 14:20:23 2008
From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov)
Date: Mon,  4 Aug 2008 14:20:23 -0500 (CDT)
Subject: [Swift-devel] [Bug 150] multiple workers on one compute node
In-Reply-To: <bug-150-21@http.bugzilla.mcs.anl.gov/swift/>
Message-ID: <20080804192023.C10FC16469@foxtrot.mcs.anl.gov>

http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=150


benc at hawaga.org.uk changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|                            |FIXED


------- Comment #2 from benc at hawaga.org.uk  2008-08-04 14:20 -------
CoG r2094 introduces a coastersPerNode parameter. This is documented in the
Swift users guide. This setting will cause a speciifed number of workers to be
started on each node that coasters run on. Support for multiple GRAM-level jobs
on each node is not needed in order to use this.


-- 
Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.


From bugzilla-daemon at mcs.anl.gov  Mon Aug  4 14:24:29 2008
From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov)
Date: Mon,  4 Aug 2008 14:24:29 -0500 (CDT)
Subject: [Swift-devel] [Bug 107] restarts broken (by generalisation of data
	file handling)
In-Reply-To: <bug-107-21@http.bugzilla.mcs.anl.gov/swift/>
Message-ID: <20080804192429.3ECDA164B2@foxtrot.mcs.anl.gov>

http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=107


benc at hawaga.org.uk changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|ASSIGNED                    |RESOLVED
         Resolution|                            |FIXED


------- Comment #10 from benc at hawaga.org.uk  2008-08-04 14:24 -------
No one has reported any further problems with restarts, so I'm happy that they
work enough for this bug to be closed.


-- 
Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.


From bugzilla-daemon at mcs.anl.gov  Mon Aug  4 14:27:16 2008
From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov)
Date: Mon,  4 Aug 2008 14:27:16 -0500 (CDT)
Subject: [Swift-devel] [Bug 101] fast-failing sites will absorb large
	numbers of jobs causing runs to fail despite multiple
	attempts at retrying
In-Reply-To: <bug-101-21@http.bugzilla.mcs.anl.gov/swift/>
Message-ID: <20080804192716.24B1216469@foxtrot.mcs.anl.gov>

http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=101


benc at hawaga.org.uk changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|                            |FIXED


------- Comment #5 from benc at hawaga.org.uk  2008-08-04 14:27 -------
CoG r2058 and numerous subsequent commits add delays for bad sites.


-- 
Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.
You reported the bug, or are watching the reporter.


From bugzilla-daemon at mcs.anl.gov  Mon Aug  4 14:30:13 2008
From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov)
Date: Mon,  4 Aug 2008 14:30:13 -0500 (CDT)
Subject: [Swift-devel] [Bug 26] implement 'swiftstat'
In-Reply-To: <bug-26-21@http.bugzilla.mcs.anl.gov/swift/>
Message-ID: <20080804193013.CA243164B2@foxtrot.mcs.anl.gov>

http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=26


------- Comment #2 from benc at hawaga.org.uk  2008-08-04 14:30 -------
Over the past months, two similar but different pieces of code have been
implemented:

Firstly, Swift generates a periodic status line giving a count of how many jobs
are in each of a number of general states. Secondly for more detailed
information, copious graphical and textual analysis of a Swift run (either in
progress or ended) is available through the log-processing package.


-- 
Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You reported the bug, or are watching the reporter.


From benc at hawaga.org.uk  Tue Aug  5 07:04:20 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Tue, 5 Aug 2008 12:04:20 +0000 (GMT)
Subject: [Swift-devel] coastersPerNode not recognized by GT2 GRAM
In-Reply-To: <Pine.LNX.4.64.0808041309150.22488@dildano.hawaga.org.uk>
References: <489053A9.4080906@mcs.anl.gov> <48907DF1.8020009@mcs.anl.gov>
	<48909148.1010504@mcs.anl.gov>
	<Pine.LNX.4.64.0807301654170.5076@dildano.hawaga.org.uk>
	<4894FB40.6090008@mcs.anl.gov>
	<Pine.LNX.4.64.0808031104170.22488@dildano.hawaga.org.uk>
	<Pine.LNX.4.64.0808041309150.22488@dildano.hawaga.org.uk>
Message-ID: <Pine.LNX.4.64.0808051203550.22488@dildano.hawaga.org.uk>


On Mon, 4 Aug 2008, Ben Clifford wrote:

> Apparently I didn't check very well. I see that attribute being passed 

This should be fixed in cog r2127.

-- 


From benc at hawaga.org.uk  Tue Aug  5 08:21:28 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Tue, 5 Aug 2008 13:21:28 +0000 (GMT)
Subject: [Swift-devel] Re: NCSA-hg servers
In-Reply-To: <488E0562.4070702@mcs.anl.gov>
References: <488E0562.4070702@mcs.anl.gov>
Message-ID: <Pine.LNX.4.64.0808051317170.22488@dildano.hawaga.org.uk>


On Mon, 28 Jul 2008, Michael Wilde wrote:

> When I tried gridftp-hg.ncsa.teragrid.org, it worked the first time, although
> with an unexpected lengthy delay (seemed about 15-30 seconds) but when I
> retried the same command I got the cert error below.

There are four hosts behind tg-gridftp.ncsa.teragrid.org.

Three of them have certificates for which communicado's CRLs have expired 
(141.42.48.24[341]), whilst the fourth has a certificate that communicado 
regards as valid (141.42.48.242).

Using a custom ~/.globus/certificates directory with no CRLs, I can 
communicate with all four of the above servers.

I will poke the relevant authorities.

-- 


From benc at hawaga.org.uk  Tue Aug  5 09:16:04 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Tue, 5 Aug 2008 14:16:04 +0000 (GMT)
Subject: [Swift-devel] Some observations
In-Reply-To: <Pine.LNX.4.64.0807280415430.5076@dildano.hawaga.org.uk>
References: <fec1351f0807271340j5b6a1a9dj92cc12ae138b2206@mail.gmail.com>
	<Pine.LNX.4.64.0807280415430.5076@dildano.hawaga.org.uk>
Message-ID: <Pine.LNX.4.64.0808051413420.22488@dildano.hawaga.org.uk>


> On Sun, 27 Jul 2008, Tiberiu Stef-Praun wrote:
> 
> > I was trying to read into swift the contents of a file which contained
> > a float (e.g. 0.415599405693).
> > It has been suggested that I use readData.
> > If did not work (some error about unable to cast to java.lang.Integer)

I just tested this and it seems to work for me. I added the test to 
tests/languag-behaviour/readData.swift in r2176.

Please try that test and check that it works for you. If you can come up 
with an example that does not work that would also be useful, as would the 
actual error message.

-- 


From bugzilla-daemon at mcs.anl.gov  Tue Aug  5 14:51:22 2008
From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov)
Date: Tue,  5 Aug 2008 14:51:22 -0500 (CDT)
Subject: [Swift-devel] [Bug 149] Improve readdata() error message
In-Reply-To: <bug-149-21@http.bugzilla.mcs.anl.gov/swift/>
Message-ID: <20080805195122.D29A816469@foxtrot.mcs.anl.gov>

http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=149


hategan at mcs.anl.gov changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|ASSIGNED                    |RESOLVED
         Resolution|                            |FIXED


------- Comment #2 from hategan at mcs.anl.gov  2008-08-05 14:51 -------
No further complaints received. Closing...


-- 
Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You reported the bug, or are watching the reporter.


From benc at hawaga.org.uk  Wed Aug  6 11:13:19 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Wed, 6 Aug 2008 16:13:19 +0000 (GMT)
Subject: [Swift-devel] swift 0.6 rc5
Message-ID: <Pine.LNX.4.64.0808061608240.22488@dildano.hawaga.org.uk>


Once again its time to try a release candidate for 0.6.

http://www.ci.uchicago.edu/~benc/vdsk-0.6-rc5.tar.gz

Please test and report.

It is built with coasters and provider-wonky both turned enabled.

I ran the site tests and got these results:

These sites failed: fletch-condor-gram2.xml osg-edu.cs.wisc.edu-condor.xml 
tgncsa-hg-pbs-gram4.xml tgpurdue-condor-gram2.xml 
tgpurdue-condor-gram4.xml tgtacc-fork-gram2.xml tgtacc-lsf-gram2.xml 
UCLA_Saxon_Tier3-fork.xml

These sites worked: fletch-fork-gram2.xml osg-edu.cs.wisc.edu-fork.xml 
tgncsa-hg-fork-gram2.xml tgncsa-hg-fork-gram4.xml tgncsa-hg-pbs-gram2.xml 
tgpurdue-fork-gram2.xml tgpurdue-fork-gram4.xml tguc-fork-gram2.xml 
tguc-fork-gram4.xml tguc-pbs-gram2-syntax1.xml tguc-pbs-gram2.xml 
tguc-pbs-gram4.xml tp-fork-gram2.xml tp-fork-gram4.xml tp-pbs-gram2.xml

Nothing looks too tragic there; I'll investigate the failures later but 
they all look like site-specific problems, not swift problems.

During local testing on communicado, I once saw one of the tests hang in 
initialising site state, but was unable to get that to reappear. So I 
think there's something fishy still going on with load management/site 
selection but not excessively bad.

-- 


From bugzilla-daemon at mcs.anl.gov  Wed Aug  6 12:01:18 2008
From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov)
Date: Wed,  6 Aug 2008 12:01:18 -0500 (CDT)
Subject: [Swift-devel] [Bug 153] New: SGE adapter for gram acting weird on
	TACC_Ranger
Message-ID: <bug-153-21@http.bugzilla.mcs.anl.gov/swift/>

http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=153

           Summary: SGE adapter for gram acting weird on TACC_Ranger
           Product: Swift
           Version: unspecified
          Platform: Macintosh
        OS/Version: Mac OS
            Status: NEW
          Severity: normal
          Priority: P2
         Component: Specific site issues
        AssignedTo: benc at hawaga.org.uk
        ReportedBy: skenny at uchicago.edu


when the output is not redirected, the job is put in an
"Unscheduled" state in the queue, and GRAM never gets any kind of
further notification. 

because of this problem with sge, swift has to be hacked to redirect stdout in
order to run jobs on ranger.


-- 
Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From bugzilla-daemon at mcs.anl.gov  Wed Aug  6 12:42:27 2008
From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov)
Date: Wed,  6 Aug 2008 12:42:27 -0500 (CDT)
Subject: [Swift-devel] [Bug 153] SGE adapter for gram acting weird on
	TACC_Ranger
In-Reply-To: <bug-153-21@http.bugzilla.mcs.anl.gov/swift/>
Message-ID: <20080806174227.6EABF16469@foxtrot.mcs.anl.gov>

http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=153


------- Comment #1 from skenny at uchicago.edu  2008-08-06 12:42 -------
the file that needed to be altered for redirection of stdout was:

swift/libexec/vdl-int.k

175c175
<                               task:execute("/bin/rm", arguments="-rf {dir}",
host=host, batch=true, stdout="/dev/null", stderr="/dev/null")
---
>                               task:execute("/bin/rm", arguments="-rf {dir}", host=host, batch=true)
403c403
<                                                       redirect=true
---
>                                                       redirect=false


-- 
Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From lixi at uchicago.edu  Wed Aug  6 23:18:56 2008
From: lixi at uchicago.edu (lixi at uchicago.edu)
Date: Wed,  6 Aug 2008 23:18:56 -0500 (CDT)
Subject: [Swift-devel] Swift run: java.io.IOException: Unknown error
 512
Message-ID: <20080806231856.BCW02035@m4500-03.uchicago.edu>

Hi,

I ran a workflow like this:
[lixi at communicado test]
$ /home/lixi/performancetest/4/cog/modules/vdsk/dist/vdsk-
svn/bin/swift -
sites.file ../sitesfile/SELECT1/sites2.0808062300.xml -
tc.file ../tc.data testworkflow.swift >0808062300.log 2>&1 &

During the execution, it stopped suddenly and the stdout and 
stderr are included 
in /home/lixi/performancetest/test/0808062300.log. It seems 
that it stopped due to "java.io.IOException: Unknown error 
512"

The log file is /home/lixi/performancetest/test/testworkflow-
20080806-2301-m1qbxjr3.log

[lixi at communicado test]$ tail -n 20 0808062300.log 
Sorted: [LIGO_UWM_NEMO:140.112(90.071):37/37 overload: 0]
node10 completed
Sorted: [FLTECH:144.563(90.361):37/37 overload: 0]
node10 completed
Sorted: [UTA_SWT2:147.336(90.533):37/37 overload: 0]
node10 completed
Sorted: [FLTECH:146.739(90.497):37/37 overload: 0]
node10 completed
Sorted: [TTU-ANTAEUS:21.888(51.767):21/21 overload: 0]
Sorted: [TTU-ANTAEUS:22.888(53.230):21/22 overload: 0]
Sorted: [TTU-ANTAEUS:22.888(53.230):22/22 overload: 0]
node10 completed
Progress:  Selecting site:1497 Stage in:19 Executing:170 
Stage out:165 Finished successfully:106 Initializing site 
shared directory:2 Failed but can retry:41
java.io.IOException: Unknown error 512
        at java.io.FileInputStream.readBytes(Native Method)
        at java.io.FileInputStream.read
(FileInputStream.java:194)
        at java.io.BufferedInputStream.fill
(BufferedInputStream.java:218)
        at java.io.BufferedInputStream.read
(BufferedInputStream.java:235)
        at org.griphyn.vdl.karajan.InHook.run(InHook.java:39)
        at java.lang.Thread.run(Thread.java:595)

Would you please tell me why such an error happened and what 
to do with it?

Thanks,

Xi


From benc at hawaga.org.uk  Thu Aug  7 03:09:53 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Thu, 7 Aug 2008 08:09:53 +0000 (GMT)
Subject: [Swift-devel] Swift run: java.io.IOException: Unknown error 512
In-Reply-To: <20080806231856.BCW02035@m4500-03.uchicago.edu>
References: <20080806231856.BCW02035@m4500-03.uchicago.edu>
Message-ID: <Pine.LNX.4.64.0808070805470.29009@dildano.hawaga.org.uk>


Can you reproduce it?

Google shows occurences of that exception (unknown err 512 in 
FileInputStream.readBytes) happening when the java process has been set to 
run in the background, when reading from the console.

Were you doing anything like that? (eg running with & after the command or 
pressing ctrl-z)

-- 


From mikekubal at yahoo.com  Thu Aug  7 20:42:30 2008
From: mikekubal at yahoo.com (Mike Kubal)
Date: Thu, 7 Aug 2008 18:42:30 -0700 (PDT)
Subject: [Swift-devel] connection error
Message-ID: <732920.74580.qm@web52308.mail.re2.yahoo.com>

Hi Ben,

I have a feeling this is a certificate or CRL issue on the host machine (terminable at the CI), but perhaps you can tell for sure by examining the log, Pipeline_BoNT-20080807-1449-zm9x88ad.log . Nothing in the swift code or sites file has changed.

Caused by:
??????? Cannot submit job
Caused by:
??????? The connection to the server failed (check host and port) [Caused by: Connection refused]
Progress:? Selecting site:1035 Executing:1 Failed:2 Failed but can retry:1

I rsync'd over many logs from various stages of processing 4000 ligands against a target to your CI swift-log dir, including the one above.

Thanks,

MikeK


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20080807/89f2fb8c/attachment.html>

From wilde at mcs.anl.gov  Thu Aug  7 21:53:27 2008
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Thu, 07 Aug 2008 21:53:27 -0500
Subject: [Swift-devel] connection error
In-Reply-To: <732920.74580.qm@web52308.mail.re2.yahoo.com>
References: <732920.74580.qm@web52308.mail.re2.yahoo.com>
Message-ID: <489BB527.5080403@mcs.anl.gov>

Mike, I wonder if you can try communicado.ci.uchicago.edu?

It should require no change in any scripts, tools or procedures (I think).

I *think* that communicado's certs are up to date.

Its not clear yet that this is a host-cert problem, but thats worth a try.

What server were you trying to reach?  Can you test a simple 
globus-job-run to it?

- Mike


On 8/7/08 8:42 PM, Mike Kubal wrote:
> Hi Ben,
> 
> I have a feeling this is a certificate or CRL issue on the host machine 
> (terminable at the CI), but perhaps you can tell for sure by examining 
> the log, Pipeline_BoNT-20080807-1449-zm9x88ad.log . Nothing in the swift 
> code or sites file has changed.
> 
> Caused by:
>         Cannot submit job
> Caused by:
>         The connection to the server failed (check host and port) 
> [Caused by: Connection refused]
> Progress:  Selecting site:1035 Executing:1 Failed:2 Failed but can retry:1
> 
> I rsync'd over many logs from various stages of processing 4000 ligands 
> against a target to your CI swift-log dir, including the one above.
> 
> Thanks,
> 
> MikeK
> 
> 
> 
> 
> ------------------------------------------------------------------------
> 
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel


From benc at hawaga.org.uk  Fri Aug  8 01:16:33 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Fri, 8 Aug 2008 06:16:33 +0000 (GMT)
Subject: [Swift-devel] Re: connection error
In-Reply-To: <732920.74580.qm@web52308.mail.re2.yahoo.com>
References: <732920.74580.qm@web52308.mail.re2.yahoo.com>
Message-ID: <Pine.LNX.4.64.0808080612150.22488@dildano.hawaga.org.uk>


'Connection refused' most likelyis a TCP-level connection error, so not as 
high in the stack as security. 


And indeed:

$ telnet grid-abe.ncsa.teragrid.org 2119
Trying 141.142.68.180...
telnet: Unable to connect to remote host: Connection refused

That is probably something to report to help at teragrid.


On Thu, 7 Aug 2008, Mike Kubal wrote:
> 
> Caused by:
> ??????? Cannot submit job
> Caused by:
> ??????? The connection to the server failed (check host and port) [Caused by: Connection refused]
> Progress:? Selecting site:1035 Executing:1 Failed:2 Failed but can retry:1
> 
> I rsync'd over many logs from various stages of processing 4000 ligands against a target to your CI swift-log dir, including the one above.
> 
> Thanks,
> 
> MikeK
> 
> 
> 

From benc at hawaga.org.uk  Fri Aug  8 07:29:35 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Fri, 8 Aug 2008 12:29:35 +0000 (GMT)
Subject: [Swift-devel] swift + pacman
Message-ID: <Pine.LNX.4.64.0808081224160.22488@dildano.hawaga.org.uk>


I made a pacman wrapper for swift 0.6 rc5.

If you are a pacman fanatic, for example if you like to install the OSG or 
VDT stacks often, you can add swift into an installation directory like 
this:

$ pacman -get http://www.ci.uchicago.edu/~benc/pacman:swift-0.6-rc5

In part, this is for experimenting with bugs 146 and 104 to bring in more 
dependencies into the release (mostly for credential management). A 
pacman-based appraoch could put swift in a custom cut-down VDT environment 
with only the requested dependencies.

-- 


From benc at hawaga.org.uk  Fri Aug  8 08:15:21 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Fri, 8 Aug 2008 13:15:21 +0000 (GMT)
Subject: [Swift-devel] swift+vdt bastard offspring
Message-ID: <Pine.LNX.4.64.0808081310510.22488@dildano.hawaga.org.uk>


I made a pacman package which will deploy both swift and the packages 
requested in bug 104 and 146 (the DOE CA cert-request tools and 
voms-proxy-init).

VDT/OSG installation instructions/rules apply as do the many foibles of 
pacman.

 $ pacman -get http://www.ci.uchicago.edu/~benc/pacman:swift-tools
[...]
 $ source setup.sh
 $ du -hsc .
163M    .

swift, voms-proxy-init and cert-request are all on the path.

-- 


From bugzilla-daemon at mcs.anl.gov  Fri Aug  8 09:29:56 2008
From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov)
Date: Fri,  8 Aug 2008 09:29:56 -0500 (CDT)
Subject: [Swift-devel] [Bug 146] Add voms-proxy-init command to Swift release
In-Reply-To: <bug-146-21@http.bugzilla.mcs.anl.gov/swift/>
Message-ID: <20080808142956.22460164B1@foxtrot.mcs.anl.gov>

http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=146


------- Comment #2 from benc at hawaga.org.uk  2008-08-08 09:29 -------
I combined swift with part of VDT. This gives voms-proxy-init.

See this message for more details:
http://mail.ci.uchicago.edu/pipermail/swift-devel/2008-August/003809.html


-- 
Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You reported the bug, or are watching the reporter.


From bugzilla-daemon at mcs.anl.gov  Fri Aug  8 09:34:51 2008
From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov)
Date: Fri,  8 Aug 2008 09:34:51 -0500 (CDT)
Subject: [Swift-devel] [Bug 104] Add cert request tools to swift/bin
In-Reply-To: <bug-104-21@http.bugzilla.mcs.anl.gov/swift/>
Message-ID: <20080808143451.3C70B164B1@foxtrot.mcs.anl.gov>

http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=104


------- Comment #6 from benc at hawaga.org.uk  2008-08-08 09:34 -------
I made a combination of parts of VDT along with Swift, which provides (via VDT)
the cert-request tools mentioned here. See this message for more details: I
combined swift with part of VDT. This gives voms-proxy-init.

See this message for more details:
http://mail.ci.uchicago.edu/pipermail/swift-devel/2008-August/003809.html


-- 
Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You reported the bug, or are watching the reporter.


From lixi at uchicago.edu  Sun Aug 10 15:43:17 2008
From: lixi at uchicago.edu (lixi at uchicago.edu)
Date: Sun, 10 Aug 2008 15:43:17 -0500 (CDT)
Subject: [Swift-devel] Swift run: hanging up when submitting a job
Message-ID: <20080810154317.BCY32452@m4500-03.uchicago.edu>

Hi,

Today I ran a workflow including 3000 jobs with replication 
enabled. 2999 jobs finished successfully and only one job is 
hanging up. When taking a close look at the log file, I 
found the hanging job id is 0-2800, so I execute the 
following command to check the job:

[lixi at communicado 3000]$ grep 0-2800 testworkflow-20080810-
0953-mlj2nsc4.log 
2008-08-10 09:53:53,032-0500 INFO  worknode PROCEDURE 
thread=0-2800 name=worknode
2008-08-10 09:53:54,200-0500 INFO  vdl:parameterlog PARAM 
thread=0-2800 direction=input variable=input 
provenanceid=tag:benc at ci.uchicago.edu,2008:swift:dataset:2008
0810-0953-d6p5ul9d:720000000006
2008-08-10 09:53:55,708-0500 INFO  vdl:parameterlog PARAM 
thread=0-2800 direction=output variable=output 
provenanceid=tag:benc at ci.uchicago.edu,2008:swift:dataset:2008
0810-0953-d6p5ul9d:720000005789
2008-08-10 09:54:05,612-0500 INFO  vdl:execute START 
thread=0-2800 tr=node10
2008-08-10 10:46:10,044-0500 DEBUG vdl:execute2 
THREAD_ASSOCIATION jobid=node10-19x1krxi thread=0-2800-1 
host=AGLT2 replicationGroup=fot1krxi
2008-08-10 10:46:15,494-0500 DEBUG TaskImpl Task
(type=FILE_OPERATION, identity=urn:0-2800-1-1-1218380053396) 
setting status to Submitting
2008-08-10 10:46:15,494-0500 DEBUG TaskImpl Task
(type=FILE_OPERATION, identity=urn:0-2800-1-1-1218380053396) 
setting status to Submitted
2008-08-10 10:46:15,494-0500 DEBUG TaskImpl Task
(type=FILE_OPERATION, identity=urn:0-2800-1-1-1218380053396) 
setting status to Active
2008-08-10 10:46:15,494-0500 DEBUG TaskImpl Task
(type=FILE_OPERATION, identity=urn:0-2800-1-1-1218380053396) 
setting status to Completed
2008-08-10 10:46:15,494-0500 INFO  LateBindingScheduler Task
(type=FILE_OPERATION, identity=urn:0-2800-1-1-1218380053396) 
Completed. Waiting: 2472, Running: 66. Heap size: 355M, Heap 
free: 141M, Max heap: 986M
2008-08-10 10:46:17,377-0500 DEBUG TaskImpl Task
(type=JOB_SUBMISSION, identity=urn:0-2800-1-1218380053474) 
setting status to Submitting
2008-08-10 10:46:18,848-0500 DEBUG TaskImpl Task
(type=JOB_SUBMISSION, identity=urn:0-2800-1-1218380053474) 
setting status to Submitted
2008-08-10 10:46:18,848-0500 DEBUG 
WeightedHostScoreScheduler Submission time for Task
(type=JOB_SUBMISSION, identity=urn:0-2800-1-1218380053474): 
1471ms. Score delta: -0.024897435897435895
2008-08-10 10:46:30,063-0500 DEBUG TaskImpl Task
(type=JOB_SUBMISSION, identity=urn:0-2800-1-1218380053474) 
setting status to Active

>From the log file, we can see that the submission of this 
job wasn't finished. So I think that this is why no 
replicaiton job was generated for this job after so long a 
time even with replication enabled.

This is my understanding. I wonder if I made any 
misunderstanding. If my understanding is right, is there any 
solution to this kind of situation? The log file is:
/home/lixi/performancetest/2/application/3000/testworkflow-
20080810-0953-mlj2nsc4.log

Thanks,

Xi


From hategan at mcs.anl.gov  Sun Aug 10 15:58:33 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Sun, 10 Aug 2008 15:58:33 -0500
Subject: [Swift-devel] Swift run: hanging up when submitting a job
In-Reply-To: <20080810154317.BCY32452@m4500-03.uchicago.edu>
References: <20080810154317.BCY32452@m4500-03.uchicago.edu>
Message-ID: <1218401913.9399.10.camel@localhost>

On Sun, 2008-08-10 at 15:43 -0500, lixi at uchicago.edu wrote:
> Hi,
> 
> Today I ran a workflow including 3000 jobs with replication 
> enabled. 2999 jobs finished successfully and only one job is 
> hanging up. When taking a close look at the log file, I 
> found the hanging job id is 0-2800, so I execute the 
> following command to check the job:
> 
> [...]
> 2008-08-10 10:46:17,377-0500 DEBUG TaskImpl Task
> (type=JOB_SUBMISSION, identity=urn:0-2800-1-1218380053474) 
> setting status to Submitting
> 2008-08-10 10:46:18,848-0500 DEBUG TaskImpl Task
> (type=JOB_SUBMISSION, identity=urn:0-2800-1-1218380053474) 
> setting status to Submitted
> 2008-08-10 10:46:18,848-0500 DEBUG 
> WeightedHostScoreScheduler Submission time for Task
> (type=JOB_SUBMISSION, identity=urn:0-2800-1-1218380053474): 
> 1471ms. Score delta: -0.024897435897435895
> 2008-08-10 10:46:30,063-0500 DEBUG TaskImpl Task
> (type=JOB_SUBMISSION, identity=urn:0-2800-1-1218380053474) 
> setting status to Active
> 
> >From the log file, we can see that the submission of this 
> job wasn't finished.

Actually the job was submitted and it appears to be running.

>  So I think that this is why no 
> replicaiton job was generated for this job after so long a 
> time even with replication enabled.

Replication only works if the job is queued. This job seems to be
running. Though we're probably talking about the site going bad after
the job started to run causing the notifications of the job
completing/failing to not be sent.

> 
> This is my understanding. I wonder if I made any 
> misunderstanding. If my understanding is right, is there any 
> solution to this kind of situation?

It's not simple. If notification is unreliable it's impossible to
distinguish between a really long process and the notification having
been lost. That is if there is no information about how long the process
is.

So one solution would be to make "notifications" more reliable by
polling for the job status. But GRAM makes it really hard to do this
efficiently (each poll for each job involves one full SSL session
establishment).

The other solution is to put a cap on the process duration. So if the
job has a walltime spec, consider notifications lost if the job doesn't
complete in walltime + some_margin_of_error.

Mihael


From benc at hawaga.org.uk  Mon Aug 11 08:48:06 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Mon, 11 Aug 2008 13:48:06 +0000 (GMT)
Subject: [Swift-devel] hangs in nmi build and test at first site selecting
	stage
Message-ID: <Pine.LNX.4.64.0808111342420.22488@dildano.hawaga.org.uk>


Two times in the past few days (out of 30 or so build/tests x about 120 
runs per build/test) runs have hung at the initial site selection stage 
for the first job. I haven't investigated in greater depth than this. My 
gut feeling, though, is that its probably still some funny behaviour 
related to rate limiting. Sometime in the next few days I'll see about 
running a few thousand tests with more debugging info to see if I can get 
more info...

-- 


From benc at hawaga.org.uk  Tue Aug 12 07:06:45 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Tue, 12 Aug 2008 12:06:45 +0000 (GMT)
Subject: [Swift-devel] Swift run: hanging up when submitting a job
In-Reply-To: <1218401913.9399.10.camel@localhost>
References: <20080810154317.BCY32452@m4500-03.uchicago.edu>
	<1218401913.9399.10.camel@localhost>
Message-ID: <Pine.LNX.4.64.0808121203310.29009@dildano.hawaga.org.uk>


On Sun, 10 Aug 2008, Mihael Hategan wrote:

> The other solution is to put a cap on the process duration. So if the
> job has a walltime spec, consider notifications lost if the job doesn't
> complete in walltime + some_margin_of_error.

I think that is probably a good thing to do. Either consider the job 
failed if the walltime + margin passes or some polling such as poa single 
time when walltime+margin has passed.

Margin can be pretty big (on the order of minutes), I think.


From benc at hawaga.org.uk  Thu Aug 14 03:56:41 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Thu, 14 Aug 2008 08:56:41 +0000 (GMT)
Subject: [Swift-devel] swift 0.6 rc5
In-Reply-To: <Pine.LNX.4.64.0808061608240.22488@dildano.hawaga.org.uk>
References: <Pine.LNX.4.64.0808061608240.22488@dildano.hawaga.org.uk>
Message-ID: <Pine.LNX.4.64.0808140855530.29009@dildano.hawaga.org.uk>


On Wed, 6 Aug 2008, Ben Clifford wrote:

> http://www.ci.uchicago.edu/~benc/vdsk-0.6-rc5.tar.gz
> Please test and report.

No one has commented on this, either bad or good; so I'll put this out as 
0.6 later today.

-- 


From lixi at uchicago.edu  Fri Aug 15 12:03:31 2008
From: lixi at uchicago.edu (lixi at uchicago.edu)
Date: Fri, 15 Aug 2008 12:03:31 -0500 (CDT)
Subject: [Swift-devel] Swift run: hanging up when
 submitting a job
Message-ID: <20080815120331.BDG09904@m4500-03.uchicago.edu>

>The other solution is to put a cap on the process duration. 
So if the
>job has a walltime spec, consider notifications lost if the 
job doesn't
>complete in walltime + some_margin_of_error.

Because the user might have some idea of the execution time 
of their single job, is it possible to add a paramter in 
swift.properities or tc.data specifying the max process 
duration of each job. If exceeding that throttle, the job 
would be resubmitted to another site to be executed. I know 
that there is already a maxwalltime which specifies a 
walltime limit for each job, in minutes in Swift. But I'm 
not sure if this paramter could exactly perform such 
function? If not, is it difficult to make such a trial?

Thanks,

Xi


From skenny at uchicago.edu  Fri Aug 15 12:43:40 2008
From: skenny at uchicago.edu (skenny at uchicago.edu)
Date: Fri, 15 Aug 2008 12:43:40 -0500 (CDT)
Subject: [Swift-devel] not able to resume
Message-ID: <20080815124340.BJD56291@m4500-02.uchicago.edu>

hi all, 

we recently updated our swift (so, using Swift svn swift-r2185
cog-r2128) and it seems that -resume is no longer behaving as
expected...or is possibly being ignored. 

previously, on a resume, swift's stdout would show how many
jobs were already completed as well as those that were being
initialized. but now it seems to simply start from
scratch, from what we can tell...if we could get # of
completed jobs to print to stdout again that would help to
verify. 

i can send the log, or a link to it (it's quite large) but i
don't see any errors there that seem related to the resume.
but let me know if there's other info that might help.

thanks!
sarah


From benc at HAWAGA.ORG.UK  Fri Aug 15 16:02:31 2008
From: benc at HAWAGA.ORG.UK (Ben Clifford)
Date: Fri, 15 Aug 2008 21:02:31 +0000 (GMT)
Subject: [Swift-devel] not able to resume
In-Reply-To: <20080815124340.BJD56291@m4500-02.uchicago.edu>
References: <20080815124340.BJD56291@m4500-02.uchicago.edu>
Message-ID: <Pine.LNX.4.64.0808152058330.29009@dildano.hawaga.org.uk>


In tests/misc/ there are a number of tests for restarts - restart*.sh

Run those against your build (by putting Swift in your path and typing eg:

./restart.sh

./restart-iterate.sh

etc

and see what results you get - each test will output either a failure 
message or "success" as the last line.

-- 


From skenny at uchicago.edu  Mon Aug 18 16:49:40 2008
From: skenny at uchicago.edu (skenny at uchicago.edu)
Date: Mon, 18 Aug 2008 16:49:40 -0500 (CDT)
Subject: [Swift-devel] not able to resume
Message-ID: <20080818164940.BJF74421@m4500-02.uchicago.edu>

restart: success (w/errors during run)
restart2: success (w/errors)
restart3: 
[skenny at andrew misc]$ ./restart3.sh
Could not start execution.
        Error reading source:  : input contained no data
Could not start execution.
        Error reading source:  : input contained no data
Failed - second round did not exit with success

restart4: success (w/errors)
restart5: success (w/errors)
restart-extern: success (w/errors)
restart-iterate: success (w/errors)

for all of the ones with errors, it seems to be helperB:

helperB failed
Final status:  Failed:1 Finished successfully:1
The following errors have occurred:
1. Application "helperB" failed (Exit code 1)
        Arguments:
"/disks/gpfs/fmri/cnari/swift/sbuilds/cog/modules/vdsk/tests/misc/restart-extern.2.out,
/etc/group, baz"
        Host: localhost
        Directory:
restart-extern-20080818-1643-bofrdg7f/jobs/c/helperB-c6x176yi
        STDERR:
        STDOUT:
Swift svn swift-r2185 cog-r2128

let me know if it helps to paste the entire output for any/all
of these. i'm not quite sure what 'success' means given there
are errors during the test (?)

thanks
sarah

---- Original message ----
>Date: Fri, 15 Aug 2008 21:02:31 +0000 (GMT)
>From: Ben Clifford <benc at HAWAGA.ORG.UK>  
>Subject: Re: [Swift-devel] not able to resume  
>To: skenny at uchicago.edu
>Cc: swift-devel at ci.uchicago.edu
>
>
>In tests/misc/ there are a number of tests for restarts -
restart*.sh
>
>Run those against your build (by putting Swift in your path
and typing eg:
>
>./restart.sh
>
>./restart-iterate.sh
>
>etc
>
>and see what results you get - each test will output either a
failure 
>message or "success" as the last line.
>
>-- 


From hategan at mcs.anl.gov  Mon Aug 18 16:59:36 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Mon, 18 Aug 2008 16:59:36 -0500
Subject: [Swift-devel] not able to resume
In-Reply-To: <20080818164940.BJF74421@m4500-02.uchicago.edu>
References: <20080818164940.BJF74421@m4500-02.uchicago.edu>
Message-ID: <1219096776.24889.2.camel@localhost>

On Mon, 2008-08-18 at 16:49 -0500, skenny at uchicago.edu wrote:
> restart: success (w/errors during run)
> restart2: success (w/errors)
> restart3: 
> [skenny at andrew misc]$ ./restart3.sh
> Could not start execution.
>         Error reading source:  : input contained no data
> Could not start execution.
>         Error reading source:  : input contained no data
> Failed - second round did not exit with success

Hmm. Can you delete restart3.xml and restart3.kml and try again?

> 
> restart4: success (w/errors)
> restart5: success (w/errors)
> restart-extern: success (w/errors)
> restart-iterate: success (w/errors)
> 
> for all of the ones with errors, it seems to be helperB:
> 
> helperB failed
> Final status:  Failed:1 Finished successfully:1
> The following errors have occurred:
> 1. Application "helperB" failed (Exit code 1)
...

Yes, helperB is the following script:
----------
#!/bin/bash

exit 1
----------

The first step needs to be "interrupted" in order to test the restarts.


From skenny at uchicago.edu  Mon Aug 18 17:04:54 2008
From: skenny at uchicago.edu (skenny at uchicago.edu)
Date: Mon, 18 Aug 2008 17:04:54 -0500 (CDT)
Subject: [Swift-devel] not able to resume
Message-ID: <20080818170454.BJF76049@m4500-02.uchicago.edu>

ok, now i get similar output to the others for restart3, not
sure what happened there; but here's the whole output:

[skenny at andrew misc]$ ./restart3.sh
Swift svn swift-r2185 cog-r2128

RunID: 20080818-1703-z57hx8j2
Progress:
helperA started
Sorted: [localhost:0.000(1.000):0/1 overload: 0]
helperA completed
helperB started
Sorted: [localhost:1.303(2.111):0/1 overload: 0]
Sorted: [localhost:1.595(2.473):0/1 overload: 0]
Sorted: [localhost:1.888(2.882):0/1 overload: 0]
helperB failed
Execution failed:
        Exception in helperB:
Arguments: [restart-2.out]
Host: localhost
Directory: restart3-20080818-1703-z57hx8j2/jobs/q/helperB-quat76yi
stderr.txt:
stdout.txt:
----

Caused by:
        Exit code 1
Swift svn swift-r2185 cog-r2128

RunID: 20080818-1703-bhfsb6be
Progress:
helperB started
Sorted: [localhost:0.000(1.000):0/1 overload: 0]
helperB completed
helperC started
Sorted: [localhost:1.303(2.111):0/1 overload: 0]
helperC completed
Final status:  Initializing:1 Finished successfully:2
success


---- Original message ----
>Date: Mon, 18 Aug 2008 16:59:36 -0500
>From: Mihael Hategan <hategan at mcs.anl.gov>  
>Subject: Re: [Swift-devel] not able to resume  
>To: skenny at uchicago.edu
>Cc: Ben Clifford <benc at HAWAGA.ORG.UK>,
swift-devel at ci.uchicago.edu
>
>On Mon, 2008-08-18 at 16:49 -0500, skenny at uchicago.edu wrote:
>> restart: success (w/errors during run)
>> restart2: success (w/errors)
>> restart3: 
>> [skenny at andrew misc]$ ./restart3.sh
>> Could not start execution.
>>         Error reading source:  : input contained no data
>> Could not start execution.
>>         Error reading source:  : input contained no data
>> Failed - second round did not exit with success
>
>Hmm. Can you delete restart3.xml and restart3.kml and try again?
>
>> 
>> restart4: success (w/errors)
>> restart5: success (w/errors)
>> restart-extern: success (w/errors)
>> restart-iterate: success (w/errors)
>> 
>> for all of the ones with errors, it seems to be helperB:
>> 
>> helperB failed
>> Final status:  Failed:1 Finished successfully:1
>> The following errors have occurred:
>> 1. Application "helperB" failed (Exit code 1)
>...
>
>Yes, helperB is the following script:
>----------
>#!/bin/bash
>
>exit 1
>----------
>
>The first step needs to be "interrupted" in order to test the
restarts.
>
>


From hategan at mcs.anl.gov  Mon Aug 18 17:09:15 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Mon, 18 Aug 2008 17:09:15 -0500
Subject: [Swift-devel] not able to resume
In-Reply-To: <20080818170454.BJF76049@m4500-02.uchicago.edu>
References: <20080818170454.BJF76049@m4500-02.uchicago.edu>
Message-ID: <1219097355.25406.1.camel@localhost>

Seems to be working fine.

Perhaps you are running, in your failed restarts, into the "staging out
happens late" issue. Can you send a sample rlog?

On Mon, 2008-08-18 at 17:04 -0500, skenny at uchicago.edu wrote:
> ok, now i get similar output to the others for restart3, not
> sure what happened there; but here's the whole output:
> 
> [skenny at andrew misc]$ ./restart3.sh
> Swift svn swift-r2185 cog-r2128
> 
> RunID: 20080818-1703-z57hx8j2
> Progress:
> helperA started
> Sorted: [localhost:0.000(1.000):0/1 overload: 0]
> helperA completed
> helperB started
> Sorted: [localhost:1.303(2.111):0/1 overload: 0]
> Sorted: [localhost:1.595(2.473):0/1 overload: 0]
> Sorted: [localhost:1.888(2.882):0/1 overload: 0]
> helperB failed
> Execution failed:
>         Exception in helperB:
> Arguments: [restart-2.out]
> Host: localhost
> Directory: restart3-20080818-1703-z57hx8j2/jobs/q/helperB-quat76yi
> stderr.txt:
> stdout.txt:
> ----
> 
> Caused by:
>         Exit code 1
> Swift svn swift-r2185 cog-r2128
> 
> RunID: 20080818-1703-bhfsb6be
> Progress:
> helperB started
> Sorted: [localhost:0.000(1.000):0/1 overload: 0]
> helperB completed
> helperC started
> Sorted: [localhost:1.303(2.111):0/1 overload: 0]
> helperC completed
> Final status:  Initializing:1 Finished successfully:2
> success
> 


From skenny at uchicago.edu  Mon Aug 18 17:47:23 2008
From: skenny at uchicago.edu (skenny at uchicago.edu)
Date: Mon, 18 Aug 2008 17:47:23 -0500 (CDT)
Subject: [Swift-devel] not able to resume
Message-ID: <20080818174723.BJF79002@m4500-02.uchicago.edu>

hmm, looks like the rlog got deleted...for future reference
though, do you know how i might be able to tell that from the
rlog?

---- Original message ----
>Date: Mon, 18 Aug 2008 17:09:15 -0500
>From: Mihael Hategan <hategan at mcs.anl.gov>  
>Subject: Re: [Swift-devel] not able to resume  
>To: skenny at uchicago.edu
>Cc: Ben Clifford <benc at HAWAGA.ORG.UK>,
swift-devel at ci.uchicago.edu
>
>Seems to be working fine.
>
>Perhaps you are running, in your failed restarts, into the
"staging out
>happens late" issue. Can you send a sample rlog?
>
>On Mon, 2008-08-18 at 17:04 -0500, skenny at uchicago.edu wrote:
>> ok, now i get similar output to the others for restart3, not
>> sure what happened there; but here's the whole output:
>> 
>> [skenny at andrew misc]$ ./restart3.sh
>> Swift svn swift-r2185 cog-r2128
>> 
>> RunID: 20080818-1703-z57hx8j2
>> Progress:
>> helperA started
>> Sorted: [localhost:0.000(1.000):0/1 overload: 0]
>> helperA completed
>> helperB started
>> Sorted: [localhost:1.303(2.111):0/1 overload: 0]
>> Sorted: [localhost:1.595(2.473):0/1 overload: 0]
>> Sorted: [localhost:1.888(2.882):0/1 overload: 0]
>> helperB failed
>> Execution failed:
>>         Exception in helperB:
>> Arguments: [restart-2.out]
>> Host: localhost
>> Directory:
restart3-20080818-1703-z57hx8j2/jobs/q/helperB-quat76yi
>> stderr.txt:
>> stdout.txt:
>> ----
>> 
>> Caused by:
>>         Exit code 1
>> Swift svn swift-r2185 cog-r2128
>> 
>> RunID: 20080818-1703-bhfsb6be
>> Progress:
>> helperB started
>> Sorted: [localhost:0.000(1.000):0/1 overload: 0]
>> helperB completed
>> helperC started
>> Sorted: [localhost:1.303(2.111):0/1 overload: 0]
>> helperC completed
>> Final status:  Initializing:1 Finished successfully:2
>> success
>> 
>
>


From hategan at mcs.anl.gov  Mon Aug 18 17:56:12 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Mon, 18 Aug 2008 17:56:12 -0500
Subject: [Swift-devel] not able to resume
In-Reply-To: <20080818174723.BJF79002@m4500-02.uchicago.edu>
References: <20080818174723.BJF79002@m4500-02.uchicago.edu>
Message-ID: <1219100172.26871.0.camel@localhost>

On Mon, 2008-08-18 at 17:47 -0500, skenny at uchicago.edu wrote:
> hmm, looks like the rlog got deleted...for future reference
> though, do you know how i might be able to tell that from the
> rlog?

If it's empty, it means it hasn't recorded anything.

> 
> ---- Original message ----
> >Date: Mon, 18 Aug 2008 17:09:15 -0500
> >From: Mihael Hategan <hategan at mcs.anl.gov>  
> >Subject: Re: [Swift-devel] not able to resume  
> >To: skenny at uchicago.edu
> >Cc: Ben Clifford <benc at HAWAGA.ORG.UK>,
> swift-devel at ci.uchicago.edu
> >
> >Seems to be working fine.
> >
> >Perhaps you are running, in your failed restarts, into the
> "staging out
> >happens late" issue. Can you send a sample rlog?
> >
> >On Mon, 2008-08-18 at 17:04 -0500, skenny at uchicago.edu wrote:
> >> ok, now i get similar output to the others for restart3, not
> >> sure what happened there; but here's the whole output:
> >> 
> >> [skenny at andrew misc]$ ./restart3.sh
> >> Swift svn swift-r2185 cog-r2128
> >> 
> >> RunID: 20080818-1703-z57hx8j2
> >> Progress:
> >> helperA started
> >> Sorted: [localhost:0.000(1.000):0/1 overload: 0]
> >> helperA completed
> >> helperB started
> >> Sorted: [localhost:1.303(2.111):0/1 overload: 0]
> >> Sorted: [localhost:1.595(2.473):0/1 overload: 0]
> >> Sorted: [localhost:1.888(2.882):0/1 overload: 0]
> >> helperB failed
> >> Execution failed:
> >>         Exception in helperB:
> >> Arguments: [restart-2.out]
> >> Host: localhost
> >> Directory:
> restart3-20080818-1703-z57hx8j2/jobs/q/helperB-quat76yi
> >> stderr.txt:
> >> stdout.txt:
> >> ----
> >> 
> >> Caused by:
> >>         Exit code 1
> >> Swift svn swift-r2185 cog-r2128
> >> 
> >> RunID: 20080818-1703-bhfsb6be
> >> Progress:
> >> helperB started
> >> Sorted: [localhost:0.000(1.000):0/1 overload: 0]
> >> helperB completed
> >> helperC started
> >> Sorted: [localhost:1.303(2.111):0/1 overload: 0]
> >> helperC completed
> >> Final status:  Initializing:1 Finished successfully:2
> >> success
> >> 
> >
> >


From benc at hawaga.org.uk  Sun Aug 24 08:11:50 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Sun, 24 Aug 2008 13:11:50 +0000 (GMT)
Subject: [Swift-devel] not able to resume
In-Reply-To: <1219097355.25406.1.camel@localhost>
References: <20080818170454.BJF76049@m4500-02.uchicago.edu>
	<1219097355.25406.1.camel@localhost>
Message-ID: <Pine.LNX.4.64.0808241304360.22488@dildano.hawaga.org.uk>

On Mon, 18 Aug 2008, Mihael Hategan wrote:

> Perhaps you are running, in your failed restarts, into the "staging out
> happens late" issue.

It should be the case, I think, that if the on-screen progress ticker line 
says a job is completed then it will be logged for restart; and if it is 
reported as completed there then it won't be logged for restart (with a 
sub-second margin of error).

That change is likely to not correspond closely in time to jobs completing 
in the queue on your execution site.

-- 


From benc at hawaga.org.uk  Mon Aug 25 03:51:31 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Mon, 25 Aug 2008 08:51:31 +0000 (GMT)
Subject: [Swift-devel] Swift 0.6 released
Message-ID: <Pine.LNX.4.64.0808250831430.22488@dildano.hawaga.org.uk>


Swift 0.6 is online for download at 
http://www.ci.uchicago.edu/swift/downloads/

In addition to a bunch of bugfixes, the most interesting changes are:

 * much more rigourous compile time type checking - this catches many
   more errors at the start rather than hours into a run, and gives more
   useful error reports.

 * better multisite handling:
     +  job replication - when a job has been queued for much longer than 
        average, Swift can launch a replica of the job on another site. 
        This helps when making multisite runs where one site has a much
        longer queue time than another.
     +  rate limiting for bad sites - poorly scored sites are now rate
        limited much more than in previous versions of Swift, with very
        poorly scored sites being delayed between executions.

 * cog coasters - this is a new execution provider that allows a single
   'coaster' job to be submitted per worker node which pulls in Swift 
   jobs. This can greatly reduce the number of jobs submitted to the
   underlying job submission mechanism (such as GRAM2) allowing more jobs 
   to be submitted; it also can reduce the amount of time jobs spend in
   the LRM queue by sending them directly to an already-executing coaster.


-- 


From benc at hawaga.org.uk  Mon Aug 25 06:06:44 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Mon, 25 Aug 2008 11:06:44 +0000 (GMT)
Subject: [Swift-devel] coaster log location
Message-ID: <Pine.LNX.4.64.0808251105140.22488@dildano.hawaga.org.uk>


coaster log location of ~ is displeasing to me (especially on machines 
whose primary purpose isn't developing grid stuff).

Obvious other choices would be pwd or ~/.globus/coasters

Does anyone have a particular opinion?

-- 


From hategan at mcs.anl.gov  Mon Aug 25 10:21:28 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Mon, 25 Aug 2008 10:21:28 -0500
Subject: [Swift-devel] coaster log location
In-Reply-To: <Pine.LNX.4.64.0808251105140.22488@dildano.hawaga.org.uk>
References: <Pine.LNX.4.64.0808251105140.22488@dildano.hawaga.org.uk>
Message-ID: <1219677688.27441.4.camel@localhost>

On Mon, 2008-08-25 at 11:06 +0000, Ben Clifford wrote:
> coaster log location of ~ is displeasing to me (especially on machines 
> whose primary purpose isn't developing grid stuff).
> 
> Obvious other choices would be pwd or ~/.globus/coasters
> 
> Does anyone have a particular opinion?

The prototype has a few rough corners. On the other hand gram also puts
the log of funny jobs in ~/. But ~/.globus/coasters sounds more
reasonable.


From nikolicmilena at gmail.com  Tue Aug 26 07:23:51 2008
From: nikolicmilena at gmail.com (Milena Nikolic)
Date: Tue, 26 Aug 2008 14:23:51 +0200
Subject: [Swift-devel] GSoC: Type checking and Type inference in SwiftScript
Message-ID: <123bf0400808260523k49369428n8efa8e30278d29@mail.gmail.com>

Hi All,

GSoC program is coming to an end, and I would like to share my non-committed
work with you. For those who don't know, type checking is released with
Swift 0.6, and I am waiting to hear about bugs now (I'll be here to fix them
of course).

The other part of my project was type inference, and that work isn't
committed. Progress is described in WhatToDoNext.txt (attached). The diff
file containing type inference work is also attached. Any comments about it
are welcome.

Cheers,
Milena

---------- Forwarded message ----------
From: Milena Nikolic <nikolicmilena at gmail.com>
Date: Sat, Aug 16, 2008 at 12:08 AM
Subject: Final work for GSoC
To: Ben Clifford <benc at hawaga.org.uk>


Hi Ben,

This is my not-committed work. Unfortunately I didn't finish type checking
of mappers, and I am not sending you that work at all because it isn't
working at the moment. Actually there is not too much work left about it,
but there is a lot of talking. So when you get back from holiday, we can
discuss it and I might finish it apart from GSoC program.

All my work considering type inference is in swiftInference.diff. It is upon
latest svn version and it passes tests. Status of it is briefly explained in
WhatToDoNext.txt, and I can write some more detailed document about what
I've done, if you think anyone will ever want to read something like that.

Should I send WhatToDoNext.txt file to the group? Or something similar?

Thanks,
Milena
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20080826/389597aa/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: swiftInference.diff
Type: text/x-patch
Size: 19230 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20080826/389597aa/attachment.bin>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: WhatToDoNext.txt
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20080826/389597aa/attachment.txt>

From benc at hawaga.org.uk  Tue Aug 26 15:29:45 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Tue, 26 Aug 2008 20:29:45 +0000 (GMT)
Subject: [Swift-devel] Re: Fwd: Re: Is it easy to get average wait time for
 all jobs in a workflow?
In-Reply-To: <20080818220911.BDI04681@m4500-03.uchicago.edu>
References: <20080818220911.BDI04681@m4500-03.uchicago.edu>
Message-ID: <Pine.LNX.4.64.0808262025521.22488@dildano.hawaga.org.uk>

[added swift-devel]

On Mon, 18 Aug 2008, lixi at uchicago.edu wrote:

> >then the first execute2 task id for this job is 0-1-1-..., 
> >the second replication job id will be 0-1-2-..., these two 
> >tasks have the same replicaiton group. But if this job 
> >failed, then the second task id for this failed job will be 
> >also 0-1-1-..., but it will have a different replication 
> >group. So it makes hard to make all these stuff clearly 
> >distinguished by writing a script. 
[...]
> it possible to add some information into log file? 

Swift r2199 adds a log line that looks like this:

2008-08-26 21:17:28,359+0100 INFO Execute jobid=echo-kqc79jyi 
task=Task(type=JOB_SUBMISSION, identity=urn:0-1-1219781848210)

This binds together execute2 IDs and karajan execution IDs, which is a 
binding that has been missing from the logs in the past.

That should allow binding of replication group IDs to several karajan 
level task IDs.

-- 


From benc at hawaga.org.uk  Wed Aug 27 04:22:36 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Wed, 27 Aug 2008 09:22:36 +0000 (GMT)
Subject: [Swift-devel] Re: Fwd: Re: Is it easy to get average wait time for
 all jobs in a workflow?
In-Reply-To: <Pine.LNX.4.64.0808262025521.22488@dildano.hawaga.org.uk>
References: <20080818220911.BDI04681@m4500-03.uchicago.edu>
	<Pine.LNX.4.64.0808262025521.22488@dildano.hawaga.org.uk>
Message-ID: <Pine.LNX.4.64.0808270918320.22488@dildano.hawaga.org.uk>


On Tue, 26 Aug 2008, Ben Clifford wrote:

> Swift r2199 adds a log line that looks like this:

and r2204 makes the log-processing code have a make target called 
karatasks.JOB_SUBMISSION.annotated-execute2.transitions (mmm long file 
names).

This is karajan task status for JOB_SUBMISSION tasks, with columns 5 and 6 
being the execute2 and replication IDs.

>From that you should be able to work out for each replication ID when the 
first submision happens and when the first Active state happens.

-- 


From zhaozhang at uchicago.edu  Wed Aug 27 15:18:04 2008
From: zhaozhang at uchicago.edu (Zhao Zhang)
Date: Wed, 27 Aug 2008 15:18:04 -0500
Subject: [Swift-devel] swift out of memory 
Message-ID: <48B5B67C.4030603@uchicago.edu>

Hi, I was trying to run 32768 tasks on BGP, swift failed to start and 
reported the following message.
Any ideas? Thanks

zhao

JVMDG217: Dump Handler is Processing OutOfMemory - Please Wait.
JVMDG315: JVM Requesting Heap dump file
.................................................JVMDG318: Heap dump 
file written to 
/gpfs/home/zzhang/swift/etc/heapdump.20080827.142903.2957.phd
JVMDG303: JVM Requesting Java core file
JVMDG304: Java core file written to 
/gpfs/home/zzhang/swift/etc/javacore.20080827.142924.2957.txt
JVMDG274: Dump Handler has Processed OutOfMemory.
JVMST109: Insufficient space in Javaheap to satisfy allocation request


From hategan at mcs.anl.gov  Wed Aug 27 15:33:25 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Wed, 27 Aug 2008 15:33:25 -0500
Subject: [Swift-devel] swift out of memory
In-Reply-To: <48B5B67C.4030603@uchicago.edu>
References: <48B5B67C.4030603@uchicago.edu>
Message-ID: <1219869205.13808.19.camel@localhost>

How much memory are you running the jvm with?

On Wed, 2008-08-27 at 15:18 -0500, Zhao Zhang wrote:
> Hi, I was trying to run 32768 tasks on BGP, swift failed to start and 
> reported the following message.
> Any ideas? Thanks
> 
> zhao
> 
> JVMDG217: Dump Handler is Processing OutOfMemory - Please Wait.
> JVMDG315: JVM Requesting Heap dump file
> .................................................JVMDG318: Heap dump 
> file written to 
> /gpfs/home/zzhang/swift/etc/heapdump.20080827.142903.2957.phd
> JVMDG303: JVM Requesting Java core file
> JVMDG304: Java core file written to 
> /gpfs/home/zzhang/swift/etc/javacore.20080827.142924.2957.txt
> JVMDG274: Dump Handler has Processed OutOfMemory.
> JVMST109: Insufficient space in Javaheap to satisfy allocation request
> 
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel


From iraicu at cs.uchicago.edu  Wed Aug 27 15:47:09 2008
From: iraicu at cs.uchicago.edu (Ioan Raicu)
Date: Wed, 27 Aug 2008 15:47:09 -0500
Subject: [Swift-devel] swift out of memory
In-Reply-To: <1219869205.13808.19.camel@localhost>
References: <48B5B67C.4030603@uchicago.edu>
	<1219869205.13808.19.camel@localhost>
Message-ID: <48B5BD4D.8030805@cs.uchicago.edu>

I don't think Zhao knows where this is set in Swift.  Where could he 
look this up, other than using "top" during a run?

Ioan

Mihael Hategan wrote:
> How much memory are you running the jvm with?
>
> On Wed, 2008-08-27 at 15:18 -0500, Zhao Zhang wrote:
>   
>> Hi, I was trying to run 32768 tasks on BGP, swift failed to start and 
>> reported the following message.
>> Any ideas? Thanks
>>
>> zhao
>>
>> JVMDG217: Dump Handler is Processing OutOfMemory - Please Wait.
>> JVMDG315: JVM Requesting Heap dump file
>> .................................................JVMDG318: Heap dump 
>> file written to 
>> /gpfs/home/zzhang/swift/etc/heapdump.20080827.142903.2957.phd
>> JVMDG303: JVM Requesting Java core file
>> JVMDG304: Java core file written to 
>> /gpfs/home/zzhang/swift/etc/javacore.20080827.142924.2957.txt
>> JVMDG274: Dump Handler has Processed OutOfMemory.
>> JVMST109: Insufficient space in Javaheap to satisfy allocation request
>>
>> _______________________________________________
>> Swift-devel mailing list
>> Swift-devel at ci.uchicago.edu
>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>     
>
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>
>   

-- 
===================================================
Ioan Raicu
Ph.D. Candidate
===================================================
Distributed Systems Laboratory
Computer Science Department
University of Chicago
1100 E. 58th Street, Ryerson Hall
Chicago, IL 60637
===================================================
Email: iraicu at cs.uchicago.edu
Web:   http://www.cs.uchicago.edu/~iraicu
http://dev.globus.org/wiki/Incubator/Falkon
http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page
===================================================
===================================================


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20080827/b5ca7239/attachment.html>

From zhaozhang at uchicago.edu  Wed Aug 27 15:48:16 2008
From: zhaozhang at uchicago.edu (Zhao Zhang)
Date: Wed, 27 Aug 2008 15:48:16 -0500
Subject: [Swift-devel] swift out of memory
In-Reply-To: <48B5BD4D.8030805@cs.uchicago.edu>
References: <48B5B67C.4030603@uchicago.edu>
	<1219869205.13808.19.camel@localhost>
	<48B5BD4D.8030805@cs.uchicago.edu>
Message-ID: <48B5BD90.4090006@uchicago.edu>

Mihael told me set options in swift command with -Xmx1024m , I am 
testing it now.

zhao

Ioan Raicu wrote:
> I don't think Zhao knows where this is set in Swift.  Where could he 
> look this up, other than using "top" during a run?
>
> Ioan
>
> Mihael Hategan wrote:
>> How much memory are you running the jvm with?
>>
>> On Wed, 2008-08-27 at 15:18 -0500, Zhao Zhang wrote:
>>   
>>> Hi, I was trying to run 32768 tasks on BGP, swift failed to start and 
>>> reported the following message.
>>> Any ideas? Thanks
>>>
>>> zhao
>>>
>>> JVMDG217: Dump Handler is Processing OutOfMemory - Please Wait.
>>> JVMDG315: JVM Requesting Heap dump file
>>> .................................................JVMDG318: Heap dump 
>>> file written to 
>>> /gpfs/home/zzhang/swift/etc/heapdump.20080827.142903.2957.phd
>>> JVMDG303: JVM Requesting Java core file
>>> JVMDG304: Java core file written to 
>>> /gpfs/home/zzhang/swift/etc/javacore.20080827.142924.2957.txt
>>> JVMDG274: Dump Handler has Processed OutOfMemory.
>>> JVMST109: Insufficient space in Javaheap to satisfy allocation request
>>>
>>> _______________________________________________
>>> Swift-devel mailing list
>>> Swift-devel at ci.uchicago.edu
>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>>     
>>
>> _______________________________________________
>> Swift-devel mailing list
>> Swift-devel at ci.uchicago.edu
>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>
>>   
>
> -- 
> ===================================================
> Ioan Raicu
> Ph.D. Candidate
> ===================================================
> Distributed Systems Laboratory
> Computer Science Department
> University of Chicago
> 1100 E. 58th Street, Ryerson Hall
> Chicago, IL 60637
> ===================================================
> Email: iraicu at cs.uchicago.edu
> Web:   http://www.cs.uchicago.edu/~iraicu
> http://dev.globus.org/wiki/Incubator/Falkon
> http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page
> ===================================================
> ===================================================
>
>   


From iraicu at cs.uchicago.edu  Wed Aug 27 15:53:54 2008
From: iraicu at cs.uchicago.edu (Ioan Raicu)
Date: Wed, 27 Aug 2008 15:53:54 -0500
Subject: [Swift-devel] swift out of memory
In-Reply-To: <48B5BD90.4090006@uchicago.edu>
References: <48B5B67C.4030603@uchicago.edu>
	<1219869205.13808.19.camel@localhost>
	<48B5BD4D.8030805@cs.uchicago.edu> <48B5BD90.4090006@uchicago.edu>
Message-ID: <48B5BEE2.4080404@cs.uchicago.edu>

I routinely use -Xms1536m -Xmx1536m for running Falkon (have the min and 
max heap size the same, to avoid having to resize the heap, which 
ultimately improves performance during these periods).  Those nodes on 
the BG/P have 4GB of memory, so once you find the largest workflows you 
can run with 1GB (or 1.5GB), it would be good to push the heap size up 
as close as you can to the 4GB limit (probably somewhere between 3GB and 
4GB). 

Ioan

Zhao Zhang wrote:
> Mihael told me set options in swift command with -Xmx1024m , I am 
> testing it now.
>
> zhao
>
> Ioan Raicu wrote:
>> I don't think Zhao knows where this is set in Swift.  Where could he 
>> look this up, other than using "top" during a run?
>>
>> Ioan
>>
>> Mihael Hategan wrote:
>>> How much memory are you running the jvm with?
>>>
>>> On Wed, 2008-08-27 at 15:18 -0500, Zhao Zhang wrote:
>>>  
>>>> Hi, I was trying to run 32768 tasks on BGP, swift failed to start 
>>>> and reported the following message.
>>>> Any ideas? Thanks
>>>>
>>>> zhao
>>>>
>>>> JVMDG217: Dump Handler is Processing OutOfMemory - Please Wait.
>>>> JVMDG315: JVM Requesting Heap dump file
>>>> .................................................JVMDG318: Heap 
>>>> dump file written to 
>>>> /gpfs/home/zzhang/swift/etc/heapdump.20080827.142903.2957.phd
>>>> JVMDG303: JVM Requesting Java core file
>>>> JVMDG304: Java core file written to 
>>>> /gpfs/home/zzhang/swift/etc/javacore.20080827.142924.2957.txt
>>>> JVMDG274: Dump Handler has Processed OutOfMemory.
>>>> JVMST109: Insufficient space in Javaheap to satisfy allocation request
>>>>
>>>> _______________________________________________
>>>> Swift-devel mailing list
>>>> Swift-devel at ci.uchicago.edu
>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>>>     
>>>
>>> _______________________________________________
>>> Swift-devel mailing list
>>> Swift-devel at ci.uchicago.edu
>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>>
>>>   
>>
>> -- 
>> ===================================================
>> Ioan Raicu
>> Ph.D. Candidate
>> ===================================================
>> Distributed Systems Laboratory
>> Computer Science Department
>> University of Chicago
>> 1100 E. 58th Street, Ryerson Hall
>> Chicago, IL 60637
>> ===================================================
>> Email: iraicu at cs.uchicago.edu
>> Web:   http://www.cs.uchicago.edu/~iraicu
>> http://dev.globus.org/wiki/Incubator/Falkon
>> http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page
>> ===================================================
>> ===================================================
>>
>>   
>

-- 
===================================================
Ioan Raicu
Ph.D. Candidate
===================================================
Distributed Systems Laboratory
Computer Science Department
University of Chicago
1100 E. 58th Street, Ryerson Hall
Chicago, IL 60637
===================================================
Email: iraicu at cs.uchicago.edu
Web:   http://www.cs.uchicago.edu/~iraicu
http://dev.globus.org/wiki/Incubator/Falkon
http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page
===================================================
===================================================


From mikekubal at yahoo.com  Thu Aug 28 15:29:16 2008
From: mikekubal at yahoo.com (Mike Kubal)
Date: Thu, 28 Aug 2008 13:29:16 -0700 (PDT)
Subject: [Swift-devel] error communicating with Abe's GridFTP server 
Message-ID: <951213.43274.qm@web52305.mail.re2.yahoo.com>

I get the following message when trying to run a swift job from the machines at the CI (communicado and bridled) on NCSA's Abe:

Execution failed:
        Could not initialize shared directory on abe
Caused by:
        org.globus.cog.abstraction.impl.file.FileResourceException: Error communicating with the GridFTP server
Caused by:
        Server refused performing the request. Custom message: Bad password. (error code 1) [Nested exception message:  Custom message: Unexpected reply: 530-Login incorrect. : IPC connection failed.

Are there any known webservice or other issues with Abe's GridFTP server? 

The gatekeeper appears to be up, when I run:
globus-job-run grid-abe.ncsa.teragrid.org /bin/hostname

It returns:
abe1196.ncsa.uiuc.edu

If I run:

globus-job-run gsiftp://grid-ftp.abe.ncsa.teragrid.org hostname

I get the following:

GRAM Job submission failed because the connection to the server failed (check host and port) (error code 12)

In the past I have received similar error messages when my certificate at the CI had not been updated, but the problem persists after an update.
I get the same message running from communicado and bridled at the CI.

Suggestions?

Mike Kubal


From help at teragrid.org  Thu Aug 28 17:03:43 2008
From: help at teragrid.org (help at teragrid.org)
Date: Thu, 28 Aug 2008 17:03:43 -0500
Subject: [Swift-devel] [Fwd: Re: error communicating with Abe's GridFTP
	server ]
Message-ID: <200808282203.m7SM3h5f005288@rimantadine.ncsa.uiuc.edu>

FROM: Jackson, Weddie
(Concerning ticket No. 161020)
==============================
Sorry, forgot to CC those on the cc list.

__________Original Message__________
Date: Aug 28 2008 5:01PM 
From: help at teragrid.org
  To: mikekubal at yahoo.com
Subj: Re: error communicating with Abe's GridFTP server


FROM: Jackson, Weddie
(Concerning ticket No. 161020)
==============================
Hello Mike,


We also got an errors when just trying to open a simple connection with uberftp.
We will ask the our Grid Services Admin to take a look and notify you when we 
have news.


Thanks,
-Weddie
------------------------
Weddie Jackson
NCSA Consulting Services
------------------------


Mike Kubal <mikekubal at yahoo.com> writes:
>I get the following message when trying to run a swift job from the machines 
at the CI (communicado and bridled) on NCSA's Abe:
>
>Execution failed:
>        Could not initialize shared directory on abe
>Caused by:
>        org.globus.cog.abstraction.impl.file.FileResourceException: Error 
communicating with the GridFTP server
>Caused by:
>        Server refused performing the request. Custom message: Bad password. 
(error code 1) [Nested exception message:  Custom message: Unexpected reply: 
530-Login incorrect. : IPC connection failed.
>
>Are there any known webservice or other issues with Abe's GridFTP server? 
>
>The gatekeeper appears to be up, when I run:
>globus-job-run grid-abe.ncsa.teragrid.org /bin/hostname
>
>It returns:
>abe1196.ncsa.uiuc.edu
>
>If I run:
>
>globus-job-run gsiftp://grid-ftp.abe.ncsa.teragrid.org hostname
>
>I get the following:
>
>GRAM Job submission failed because the connection to the server failed (check 
host and port) (error code 12)
>
>In the past I have received similar error messages when my certificate at the 
CI had not been updated, but the problem persists after an update.
>I get the same message running from communicado and bridled at the CI.
>
>Suggestions?
>
>Mike Kubal


From bugzilla-daemon at mcs.anl.gov  Fri Aug 29 02:11:49 2008
From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov)
Date: Fri, 29 Aug 2008 02:11:49 -0500 (CDT)
Subject: [Swift-devel] [Bug 154] New: iterate construct causes
	overserialisation of execution
Message-ID: <bug-154-21@http.bugzilla.mcs.anl.gov/swift/>

http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=154

           Summary: iterate construct causes overserialisation of execution
           Product: Swift
           Version: unspecified
          Platform: All
        OS/Version: All
            Status: NEW
          Severity: normal
          Priority: P2
         Component: General
        AssignedTo: benc at hawaga.org.uk
        ReportedBy: benc at hawaga.org.uk
                CC: swift-devel at ci.uchicago.edu


Iterate loops will always run in strict sequence, even if there is no data
dependency between iterations.
This violates the general principle that execution should happen as much in
parallel as possible with data dependencies being the old deciding factor in
execution.

For example, by data dependencies, the sleep statements in the following
program should execute in parallel. They do not.


s(int delay) {
  app {
    sleep delay;
  }
}

iterate i {
trace(i);
s(5);
} until(i>5);


-- 
Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.


From benc at hawaga.org.uk  Fri Aug 29 09:09:44 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Fri, 29 Aug 2008 14:09:44 +0000 (GMT)
Subject: [Swift-devel] coasters on nmi build test
Message-ID: <Pine.LNX.4.64.0808291408460.3457@dildano.hawaga.org.uk>


I just got the coaster tests running more on nmi build/test.

Most platforms fail; different platforms exhibit different errors.

http://nmi-s005.cs.wisc.edu:80/nmi/index.php?page=results%2FrunDetails&opt_user=benc&runid=102190&rows=100

-- 


From hategan at mcs.anl.gov  Fri Aug 29 09:27:14 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Fri, 29 Aug 2008 09:27:14 -0500
Subject: [Swift-devel] Re: coasters on nmi build test
In-Reply-To: <Pine.LNX.4.64.0808291408460.3457@dildano.hawaga.org.uk>
References: <Pine.LNX.4.64.0808291408460.3457@dildano.hawaga.org.uk>
Message-ID: <1220020034.9151.0.camel@localhost>

On Fri, 2008-08-29 at 14:09 +0000, Ben Clifford wrote:
> I just got the coaster tests running more on nmi build/test.
> 
> Most platforms fail; different platforms exhibit different errors.

Any chance we can look at the coaster service logs?


From benc at hawaga.org.uk  Fri Aug 29 09:26:24 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Fri, 29 Aug 2008 14:26:24 +0000 (GMT)
Subject: [Swift-devel] Re: coasters on nmi build test
In-Reply-To: <1220020034.9151.0.camel@localhost>
References: <Pine.LNX.4.64.0808291408460.3457@dildano.hawaga.org.uk>
	<1220020034.9151.0.camel@localhost>
Message-ID: <Pine.LNX.4.64.0808291426070.3457@dildano.hawaga.org.uk>


On Fri, 29 Aug 2008, Mihael Hategan wrote:

> On Fri, 2008-08-29 at 14:09 +0000, Ben Clifford wrote:
> > I just got the coaster tests running more on nmi build/test.
> > 
> > Most platforms fail; different platforms exhibit different errors.
> 
> Any chance we can look at the coaster service logs?

Yeah its possible somehow. I need to figure out how to do it though.

-- 


From help at teragrid.org  Fri Aug 29 11:23:16 2008
From: help at teragrid.org (help at teragrid.org)
Date: Fri, 29 Aug 2008 11:23:16 -0500
Subject: [Swift-devel] [Fwd: Re: error communicating with Abe's GridFTP
	server ]
Message-ID: <200808291623.m7TGNGA9020803@rimantadine.ncsa.uiuc.edu>

FROM: Jackson, Weddie
(Concerning ticket No. 161020)
==============================
Hello Mike,


Our Grid Services Admins have stated that they believe that have resolved the
issue.  

Can you try your job again(if you have not already done so) and let us know if
you are still seeing an issue?


(if you are still seeing an issue, it may be helpful to the Admins to know
exactly how your job is trying to communicate with gridftp-abe)

Thanks,
-Weddie
------------------------
Weddie Jackson
NCSA Consulting Services
------------------------


__________Original Message__________
Date: Aug 28 2008 5:01PM
From: help at teragrid.org
  To: mikekubal at yahoo.com
Subj: Re: error communicating with Abe's GridFTP server


FROM: Jackson, Weddie
(Concerning ticket No. 161020)
==============================
Hello Mike,


We also got an errors when just trying to open a simple connection with uberftp.
We will ask the our Grid Services Admin to take a look and notify you when we 
have news.


Thanks,
-Weddie
------------------------
Weddie Jackson
NCSA Consulting Services
------------------------


Mike Kubal <mikekubal at yahoo.com> writes:
>I get the following message when trying to run a swift job from the machines 
at the CI (communicado and bridled) on NCSA's Abe:
>
>Execution failed:
>        Could not initialize shared directory on abe
>Caused by:
>        org.globus.cog.abstraction.impl.file.FileResourceException: Error 
communicating with the GridFTP server
>Caused by:
>        Server refused performing the request. Custom message: Bad password. 
(error code 1) [Nested exception message:  Custom message: Unexpected reply: 
530-Login incorrect. : IPC connection failed.
>
>Are there any known webservice or other issues with Abe's GridFTP server? 
>
>The gatekeeper appears to be up, when I run:
>globus-job-run grid-abe.ncsa.teragrid.org /bin/hostname
>
>It returns:
>abe1196.ncsa.uiuc.edu
>
>If I run:
>
>globus-job-run gsiftp://grid-ftp.abe.ncsa.teragrid.org hostname
>
>I get the following:
>
>GRAM Job submission failed because the connection to the server failed (check 
host and port) (error code 12)
>
>In the past I have received similar error messages when my certificate at the 
CI had not been updated, but the problem persists after an update.
>I get the same message running from communicado and bridled at the CI.
>
>Suggestions?
>
>Mike Kubal


From hategan at mcs.anl.gov  Fri Aug 29 11:34:02 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Fri, 29 Aug 2008 11:34:02 -0500
Subject: [Swift-devel] Re: coasters on nmi build test
In-Reply-To: <Pine.LNX.4.64.0808291408460.3457@dildano.hawaga.org.uk>
References: <Pine.LNX.4.64.0808291408460.3457@dildano.hawaga.org.uk>
Message-ID: <1220027642.11456.0.camel@localhost>

On Fri, 2008-08-29 at 14:09 +0000, Ben Clifford wrote:
> I just got the coaster tests running more on nmi build/test.
> 
> Most platforms fail; different platforms exhibit different errors.
> 
> http://nmi-s005.cs.wisc.edu:80/nmi/index.php?page=results%2FrunDetails&opt_user=benc&runid=102190&rows=100
> 

Also, I'm having a bit of trouble making out what's what on that page.
Is there any way things can be labeled in a more descriptive fashion?


From hategan at mcs.anl.gov  Sun Aug 31 13:35:22 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Sun, 31 Aug 2008 13:35:22 -0500
Subject: [Swift-devel] class (as in programming) upgrades
Message-ID: <1220207722.7918.4.camel@localhost>

Fresh from LtU: http://lambda-the-ultimate.org/node/2960

Seems very close to my understanding of the intent of versions in
VDL/Swift, but nicely formalized.


From benc at hawaga.org.uk  Sun Aug 31 14:00:30 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Sun, 31 Aug 2008 19:00:30 +0000 (GMT)
Subject: [Swift-devel] class (as in programming) upgrades
In-Reply-To: <1220207722.7918.4.camel@localhost>
References: <1220207722.7918.4.camel@localhost>
Message-ID: <Pine.LNX.4.64.0808311855510.4687@dildano.hawaga.org.uk>


On Sun, 31 Aug 2008, Mihael Hategan wrote:

> Fresh from LtU: http://lambda-the-ultimate.org/node/2960
> 
> Seems very close to my understanding of the intent of versions in
> VDL/Swift, but nicely formalized.

I read that earlier today. They don't seem to have much in the way of 
implementation (and thus in experience with using in practice).

SwiftScript programs don't seem to be getting much in the way of 
complexity expressionwise which would be something that might cause 
versioning and namespaces to take a higher priority; and I don't think 
their linear version model really fits in with the actual diversity of 
applications that appear in real life (which is also what I think about 
using linear versions in VDL).

-- 


From hategan at mcs.anl.gov  Sun Aug 31 14:12:36 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Sun, 31 Aug 2008 14:12:36 -0500
Subject: [Swift-devel] class (as in programming) upgrades
In-Reply-To: <Pine.LNX.4.64.0808311855510.4687@dildano.hawaga.org.uk>
References: <1220207722.7918.4.camel@localhost>
	<Pine.LNX.4.64.0808311855510.4687@dildano.hawaga.org.uk>
Message-ID: <1220209956.19236.6.camel@localhost>

On Sun, 2008-08-31 at 19:00 +0000, Ben Clifford wrote:
> On Sun, 31 Aug 2008, Mihael Hategan wrote:
> 
> > Fresh from LtU: http://lambda-the-ultimate.org/node/2960
> > 
> > Seems very close to my understanding of the intent of versions in
> > VDL/Swift, but nicely formalized.
> 
> I read that earlier today. They don't seem to have much in the way of 
> implementation (and thus in experience with using in practice).

Which in itself makes no statement about the solution being good or bad.
But to me it looks decent.

> 
> SwiftScript programs don't seem to be getting much in the way of 
> complexity expressionwise which would be something that might cause 
> versioning and namespaces to take a higher priority;

Except when somebody tries to use it in a production environment (i2u2),
with data and analyses that span a few years.

So I think this is one of those things where if some project using Swift
realizes it needs it, it's probably too late.

>  and I don't think 
> their linear version model really fits in with the actual diversity of 
> applications that appear in real life (which is also what I think about 
> using linear versions in VDL).
> 

It's a reasonable model that I think would work fairly well in many
cases.