From wilde at mcs.anl.gov Fri Aug 1 14:49:03 2008 From: wilde at mcs.anl.gov (Michael Wilde) Date: Fri, 01 Aug 2008 14:49:03 -0500 Subject: [Swift-devel] more compile time type checking In-Reply-To: References: Message-ID: <489368AF.2000300@mcs.anl.gov> The type checking is working nicely - its a great improvement. I just fixed several of my type errors in minutes, like: Could not start execution. Compile error in foreach statement at line 26: Compile error in procedure invocation at line 28: Wrong type for parameter number 0, expected DockOut, got Dockout Nice work, Milena and Ben! - Mike On 7/29/08 4:12 AM, Ben Clifford wrote: > I just committed Milena's work on compile-time type checking. > > Based on what happened last time I made changes to the compile-time anity > checking, there will be some things you do or thought you could do in > your programs that will now not work. > > When you discover such, file a bug or post to this list. > From foster at mcs.anl.gov Fri Aug 1 15:09:28 2008 From: foster at mcs.anl.gov (Ian Foster) Date: Fri, 1 Aug 2008 15:09:28 -0500 Subject: [Swift-devel] more compile time type checking In-Reply-To: <489368AF.2000300@mcs.anl.gov> References: <489368AF.2000300@mcs.anl.gov> Message-ID: lovely ...! On Aug 1, 2008, at 2:49 PM, Michael Wilde wrote: > The type checking is working nicely - its a great improvement. > > I just fixed several of my type errors in minutes, like: > > Could not start execution. > Compile error in foreach statement at line 26: Compile error > in procedure invocation at line 28: Wrong type for parameter number > 0, expected DockOut, got Dockout > > Nice work, Milena and Ben! > > - Mike > > > On 7/29/08 4:12 AM, Ben Clifford wrote: >> I just committed Milena's work on compile-time type checking. >> Based on what happened last time I made changes to the compile-time >> anity checking, there will be some things you do or thought you >> could do in your programs that will now not work. >> When you discover such, file a bug or post to this list. > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From bugzilla-daemon at mcs.anl.gov Sat Aug 2 10:04:51 2008 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Sat, 2 Aug 2008 10:04:51 -0500 (CDT) Subject: [Swift-devel] [Bug 152] New: filesys_mapper gives exception Message-ID: http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=152 Summary: filesys_mapper gives exception Product: Swift Version: unspecified Platform: All OS/Version: Linux Status: NEW Severity: normal Priority: P2 Component: SwiftScript language AssignedTo: benc at hawaga.org.uk ReportedBy: wilde at mcs.anl.gov Running this script throws an exception in the filesys_mapper: type File; type Mol2; (File out) rundock ( Mol2 ligand ) { app { echo "rundock debug:" @ligand @out stdout=@out; } } Mol2 ligand ; File out <"dockdb.out">; out = rundock( ligand ); -- Gives: Swift script dockdb1.swift starting at Sat Aug 2 09:54:06 CDT 2008 running on sites: localhost Swift svn swift-r2159 cog-r2122 (CoG modified locally) RunID: 20080802-0954-qmar4l7d Progress: Execution failed: Index: 0 Swift Script dockdb1.swift ended at Sat Aug 2 09:54:09 CDT 2008 with exit code 0 -- Exception in log is: 2008-08-02 09:54:09,209-0500 INFO New NEW id=tag:benc at ci.uchicago.edu,2008:swift:dataset:20080802-0954-x2fxvmz5:720000000003 2008-08-02 09:54:09,248-0500 INFO AbstractDataNode closed tag:benc at ci.uchicago.edu,2008:swift:dataset:20080802-0954-x2fxvmz5:720000000004 2008-08-02 09:54:09,248-0500 INFO AbstractDataNode ROOTPATH dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20080802-0954-x2fxvmz5:720000000004 path=$ 2008-08-02 09:54:09,249-0500 INFO AbstractDataNode dataset tag:benc at ci.uchicago.edu,2008:swift:dataset:20080802-0954-x2fxvmz5:720000000004 exception while mapping path f rom root java.lang.IndexOutOfBoundsException: Index: 0 at java.util.Collections$EmptyList.get(Collections.java:2968) at org.griphyn.vdl.mapping.Path.isArrayIndex(Path.java:271) at org.griphyn.vdl.mapping.file.FileSystemArrayMapper.map(FileSystemArrayMapper.java:30) at org.griphyn.vdl.mapping.AbstractDataNode.logContent(AbstractDataNode.java:376) -- This occurred first when mapping a 4-member test dataset. I shrunk the script to a smaller example and show it here mapping a single member dataset. Script and all logs and output are in: www.ci.uchicago.edu/~wilde/filesys_mapper_exception.2008.0802.tar.gz -- Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You reported the bug, or are watching the reporter. From bugzilla-daemon at mcs.anl.gov Sat Aug 2 10:05:31 2008 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Sat, 2 Aug 2008 10:05:31 -0500 (CDT) Subject: [Swift-devel] [Bug 152] filesys_mapper gives exception In-Reply-To: Message-ID: <20080802150531.2EB4F164B1@foxtrot.mcs.anl.gov> http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=152 wilde at mcs.anl.gov changed: What |Removed |Added ---------------------------------------------------------------------------- URL| |http://www.ci.uchicago.edu/~ | |wilde/filesys_mapper_excepti | |on.2008.0802.tar.gz -- Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You reported the bug, or are watching the reporter. From wilde at mcs.anl.gov Sat Aug 2 19:26:40 2008 From: wilde at mcs.anl.gov (Michael Wilde) Date: Sat, 02 Aug 2008 19:26:40 -0500 Subject: [Swift-devel] coastersPerNode not recognized by GT2 GRAM In-Reply-To: References: <489053A9.4080906@mcs.anl.gov> <48907DF1.8020009@mcs.anl.gov> <48909148.1010504@mcs.anl.gov> Message-ID: <4894FB40.6090008@mcs.anl.gov> On 7/30/08 11:54 AM, Ben Clifford wrote: > try cog r2123. i just tested that against ncsa teragrid. it now filters > out that attribute before sending on to gram2. I just tried this, using cog r2125: -- Swift script dock1.swift starting at Sat Aug 2 17:54:34 CDT 2008 running on sites: abe-coaster Swift svn swift-r2171 cog-r2125 (CoG modified locally) -- I still failed with same error. The gram log showed the coasterspernode rsl variable still getting through to gram (below). I added a second string, "coasterspernode" to your list of parameters to filter out, in all lower case, and this worked. Theres a small possibility that the first time I tried this, I lost the fix somewhere between the build and the install. I dont think thats the case, but I will check. When you tested against a TG site, did you verify that the coasterspernode attribute wasnt getting in the RSL? - Mike <<<<>>>>Job Request RSL 8/2 17:57:34 <<<<>>>>Job Request RSL (canonical) 8/2 17:57:34 <<<<>>>>Job RSL 8/2 17:57:34 <<<<>>>>Job RSL (post-eval) From benc at hawaga.org.uk Sun Aug 3 06:07:54 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Sun, 3 Aug 2008 11:07:54 +0000 (GMT) Subject: [Swift-devel] coastersPerNode not recognized by GT2 GRAM In-Reply-To: <4894FB40.6090008@mcs.anl.gov> References: <489053A9.4080906@mcs.anl.gov> <48907DF1.8020009@mcs.anl.gov> <48909148.1010504@mcs.anl.gov> <4894FB40.6090008@mcs.anl.gov> Message-ID: On Sat, 2 Aug 2008, Michael Wilde wrote: > I still failed with same error. The gram log showed the coasterspernode rsl > variable still getting through to gram (below). I added a second string, > "coasterspernode" to your list of parameters to filter out, in all lower case, > and this worked. [..] > When you tested against a TG site, did you verify that the coasterspernode > attribute wasnt getting in the RSL? No; I checked that jobs ran OK - liekly I used the same capitalisation as in the source and you did not. Its a case sensitivity bug which should be straightforward to fix. -- From wilde at mcs.anl.gov Sun Aug 3 08:47:39 2008 From: wilde at mcs.anl.gov (Michael Wilde) Date: Sun, 03 Aug 2008 08:47:39 -0500 Subject: [Swift-devel] coastersPerNode not recognized by GT2 GRAM In-Reply-To: References: <489053A9.4080906@mcs.anl.gov> <48907DF1.8020009@mcs.anl.gov> <48909148.1010504@mcs.anl.gov> <4894FB40.6090008@mcs.anl.gov> Message-ID: <4895B6FB.1030902@mcs.anl.gov> On 8/3/08 6:07 AM, Ben Clifford wrote: > On Sat, 2 Aug 2008, Michael Wilde wrote: > >> I still failed with same error. The gram log showed the coasterspernode rsl >> variable still getting through to gram (below). I added a second string, >> "coasterspernode" to your list of parameters to filter out, in all lower case, >> and this worked. > > [..] > >> When you tested against a TG site, did you verify that the coasterspernode >> attribute wasnt getting in the RSL? > > No; I checked that jobs ran OK - liekly I used the same capitalisation as > in the source and you did not. > > Its a case sensitivity bug which should be straightforward to fix. That was the strange thing - I used the same capitalization in my tag as in your source rev, which didnt work. And the RSL in the GRAM log showed an all lower case attribute (which may have been GRAM's doing). So one possibility is an error on my part in testing; a less likely one is that the system you tested against accepted the coasterspernode RSL attribute but the one I tested against (abe) did not. I'll double-check on my side first. From benc at hawaga.org.uk Mon Aug 4 08:11:17 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 4 Aug 2008 13:11:17 +0000 (GMT) Subject: [Swift-devel] coastersPerNode not recognized by GT2 GRAM In-Reply-To: References: <489053A9.4080906@mcs.anl.gov> <48907DF1.8020009@mcs.anl.gov> <48909148.1010504@mcs.anl.gov> <4894FB40.6090008@mcs.anl.gov> Message-ID: On Sun, 3 Aug 2008, Ben Clifford wrote: > I checked that jobs ran OK Apparently I didn't check very well. I see that attribute being passed through. I made a modification to provider-wonky to help catch things like this in the future (It can now be made to get angry if there are spurious attributes supplied, which the local execution provider doesn't get upset about). -- From bugzilla-daemon at mcs.anl.gov Mon Aug 4 09:58:14 2008 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Mon, 4 Aug 2008 09:58:14 -0500 (CDT) Subject: [Swift-devel] [Bug 152] filesys_mapper gives exception In-Reply-To: Message-ID: <20080804145814.66AC6164B1@foxtrot.mcs.anl.gov> http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=152 benc at hawaga.org.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |ASSIGNED ------- Comment #1 from benc at hawaga.org.uk 2008-08-04 09:58 ------- This should probably produce something like a type exception. rundock takes the filename of a datatype which also represents a single file (which is OK) but then the mapping expression is something which will map an array. Potentially a compile-time typecheck could happen there for some mappers; but in the very least this is detectable at execution time. -- Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You reported the bug, or are watching the reporter. From bugzilla-daemon at mcs.anl.gov Mon Aug 4 10:55:49 2008 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Mon, 4 Aug 2008 10:55:49 -0500 (CDT) Subject: [Swift-devel] [Bug 152] filesys_mapper gives exception In-Reply-To: Message-ID: <20080804155549.65E1E16469@foxtrot.mcs.anl.gov> http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=152 wilde at mcs.anl.gov changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |wilde at mcs.anl.gov ------- Comment #2 from wilde at mcs.anl.gov 2008-08-04 10:55 ------- I need to go back and check (wont have time today) but I think the problem first occured with no type conflict, using filesys_mapper to map an array, as it was intended. So I suspect the problem is in filesys_mapper itself or its interface back to Swift. The conflict you describe here occurred in my attempt to reproduce the problem in a simple example. (In reply to comment #1) > This should probably produce something like a type exception. > > rundock takes the filename of a datatype which also represents a single file > (which is OK) but then the mapping expression is something which will map an > array. > > Potentially a compile-time typecheck could happen there for some mappers; but > in the very least this is detectable at execution time. > -- Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is. You reported the bug, or are watching the reporter. From bugzilla-daemon at mcs.anl.gov Mon Aug 4 12:07:08 2008 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Mon, 4 Aug 2008 12:07:08 -0500 (CDT) Subject: [Swift-devel] [Bug 152] filesys_mapper gives exception In-Reply-To: Message-ID: <20080804170708.EF26D164B1@foxtrot.mcs.anl.gov> http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=152 ------- Comment #3 from benc at hawaga.org.uk 2008-08-04 12:07 ------- In a brief attempt to recreate this, I get this exception instead: (also undesirable but not the same as what you reported). I will see if I can figure out the difference in our setups. $ swift bug152.swift Swift svn swift-r2159 (Swift modified locally) cog-r2127 (CoG modified locally) RunID: 20080804-1905-kdh8mzec Execution failed: java.lang.IllegalStateException: mapper.existing() returned a path [0] that it cannot subsequently map -- Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is. You reported the bug, or are watching the reporter. From bugzilla-daemon at mcs.anl.gov Mon Aug 4 12:25:59 2008 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Mon, 4 Aug 2008 12:25:59 -0500 (CDT) Subject: [Swift-devel] [Bug 152] filesys_mapper gives exception In-Reply-To: Message-ID: <20080804172559.34BAD16469@foxtrot.mcs.anl.gov> http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=152 ------- Comment #4 from benc at hawaga.org.uk 2008-08-04 12:25 ------- Difference in the run you give and the run I tried that appears to cause the problem is that the single file mapped in your case has a name that consists entirely of the suffix, with no base filename on it. That probably should be made to work. However it is suggestive that this is a different exception to what you got with more than one file, given that only one file can exist where the name consists only of the suffix. -- Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is. You reported the bug, or are watching the reporter. From bugzilla-daemon at mcs.anl.gov Mon Aug 4 13:29:21 2008 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Mon, 4 Aug 2008 13:29:21 -0500 (CDT) Subject: [Swift-devel] [Bug 152] filesys_mapper gives exception In-Reply-To: Message-ID: <20080804182921.2FAB4164B1@foxtrot.mcs.anl.gov> http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=152 ------- Comment #5 from benc at hawaga.org.uk 2008-08-04 13:29 ------- The specific error message you report appears to happen when *no* files match (a file named entirely with the suffix does not match because the mapper assumes there will be a . between the main filename and the suffix) in the case of the type violation that you have in the example code. -- Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is. You reported the bug, or are watching the reporter. From benc at hawaga.org.uk Mon Aug 4 13:48:06 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 4 Aug 2008 18:48:06 +0000 (GMT) Subject: [Swift-devel] type-checking mappers Message-ID: In the context of bug 152, I have thought a little about type checking mappers. Some mappers are amenable to compile-time type checking - for example, the single-file mapper can only map a simple unstructured type; the filesys-mapper can only map a single dimensional array of unstructured types. Not all mappers seem to work with this - for example the external mapper can map any shape structure. -- From bugzilla-daemon at mcs.anl.gov Mon Aug 4 14:06:24 2008 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Mon, 4 Aug 2008 14:06:24 -0500 (CDT) Subject: [Swift-devel] [Bug 152] Mappers used with incorrect types cause unintuitive error messgaes. In-Reply-To: Message-ID: <20080804190624.D2F74164B1@foxtrot.mcs.anl.gov> http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=152 benc at hawaga.org.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Severity|normal |enhancement Summary|filesys_mapper gives |Mappers used with incorrect |exception |types cause unintuitive | |error messgaes. ------- Comment #6 from benc at hawaga.org.uk 2008-08-04 14:06 ------- r2174 fixes the null pointer, and a (more sane?) "mapper failed to map..." error now results in the situation where an unstructured type is used with no files. However this (and the error in comment #3) should probably still be caught with better type checking. Changing this to an enhancement request. -- Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is. You reported the bug, or are watching the reporter. From bugzilla-daemon at mcs.anl.gov Mon Aug 4 14:13:26 2008 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Mon, 4 Aug 2008 14:13:26 -0500 (CDT) Subject: [Swift-devel] [Bug 147] swift hangs at faulty mapping In-Reply-To: Message-ID: <20080804191326.560D7164B2@foxtrot.mcs.anl.gov> http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=147 benc at hawaga.org.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #1 from benc at hawaga.org.uk 2008-08-04 14:13 ------- r2151 removes the spurious '-waitfor' parameter. r2155 makes the external mapper labelled as static, which makes it work for the sample code supplied out of band by skenny. -- Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at mcs.anl.gov Mon Aug 4 14:20:23 2008 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Mon, 4 Aug 2008 14:20:23 -0500 (CDT) Subject: [Swift-devel] [Bug 150] multiple workers on one compute node In-Reply-To: Message-ID: <20080804192023.C10FC16469@foxtrot.mcs.anl.gov> http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=150 benc at hawaga.org.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #2 from benc at hawaga.org.uk 2008-08-04 14:20 ------- CoG r2094 introduces a coastersPerNode parameter. This is documented in the Swift users guide. This setting will cause a speciifed number of workers to be started on each node that coasters run on. Support for multiple GRAM-level jobs on each node is not needed in order to use this. -- Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is. From bugzilla-daemon at mcs.anl.gov Mon Aug 4 14:24:29 2008 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Mon, 4 Aug 2008 14:24:29 -0500 (CDT) Subject: [Swift-devel] [Bug 107] restarts broken (by generalisation of data file handling) In-Reply-To: Message-ID: <20080804192429.3ECDA164B2@foxtrot.mcs.anl.gov> http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=107 benc at hawaga.org.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|ASSIGNED |RESOLVED Resolution| |FIXED ------- Comment #10 from benc at hawaga.org.uk 2008-08-04 14:24 ------- No one has reported any further problems with restarts, so I'm happy that they work enough for this bug to be closed. -- Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is. From bugzilla-daemon at mcs.anl.gov Mon Aug 4 14:27:16 2008 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Mon, 4 Aug 2008 14:27:16 -0500 (CDT) Subject: [Swift-devel] [Bug 101] fast-failing sites will absorb large numbers of jobs causing runs to fail despite multiple attempts at retrying In-Reply-To: Message-ID: <20080804192716.24B1216469@foxtrot.mcs.anl.gov> http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=101 benc at hawaga.org.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #5 from benc at hawaga.org.uk 2008-08-04 14:27 ------- CoG r2058 and numerous subsequent commits add delays for bad sites. -- Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is. You reported the bug, or are watching the reporter. From bugzilla-daemon at mcs.anl.gov Mon Aug 4 14:30:13 2008 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Mon, 4 Aug 2008 14:30:13 -0500 (CDT) Subject: [Swift-devel] [Bug 26] implement 'swiftstat' In-Reply-To: Message-ID: <20080804193013.CA243164B2@foxtrot.mcs.anl.gov> http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=26 ------- Comment #2 from benc at hawaga.org.uk 2008-08-04 14:30 ------- Over the past months, two similar but different pieces of code have been implemented: Firstly, Swift generates a periodic status line giving a count of how many jobs are in each of a number of general states. Secondly for more detailed information, copious graphical and textual analysis of a Swift run (either in progress or ended) is available through the log-processing package. -- Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You reported the bug, or are watching the reporter. From benc at hawaga.org.uk Tue Aug 5 07:04:20 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 5 Aug 2008 12:04:20 +0000 (GMT) Subject: [Swift-devel] coastersPerNode not recognized by GT2 GRAM In-Reply-To: References: <489053A9.4080906@mcs.anl.gov> <48907DF1.8020009@mcs.anl.gov> <48909148.1010504@mcs.anl.gov> <4894FB40.6090008@mcs.anl.gov> Message-ID: On Mon, 4 Aug 2008, Ben Clifford wrote: > Apparently I didn't check very well. I see that attribute being passed This should be fixed in cog r2127. -- From benc at hawaga.org.uk Tue Aug 5 08:21:28 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 5 Aug 2008 13:21:28 +0000 (GMT) Subject: [Swift-devel] Re: NCSA-hg servers In-Reply-To: <488E0562.4070702@mcs.anl.gov> References: <488E0562.4070702@mcs.anl.gov> Message-ID: On Mon, 28 Jul 2008, Michael Wilde wrote: > When I tried gridftp-hg.ncsa.teragrid.org, it worked the first time, although > with an unexpected lengthy delay (seemed about 15-30 seconds) but when I > retried the same command I got the cert error below. There are four hosts behind tg-gridftp.ncsa.teragrid.org. Three of them have certificates for which communicado's CRLs have expired (141.42.48.24[341]), whilst the fourth has a certificate that communicado regards as valid (141.42.48.242). Using a custom ~/.globus/certificates directory with no CRLs, I can communicate with all four of the above servers. I will poke the relevant authorities. -- From benc at hawaga.org.uk Tue Aug 5 09:16:04 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 5 Aug 2008 14:16:04 +0000 (GMT) Subject: [Swift-devel] Some observations In-Reply-To: References: Message-ID: > On Sun, 27 Jul 2008, Tiberiu Stef-Praun wrote: > > > I was trying to read into swift the contents of a file which contained > > a float (e.g. 0.415599405693). > > It has been suggested that I use readData. > > If did not work (some error about unable to cast to java.lang.Integer) I just tested this and it seems to work for me. I added the test to tests/languag-behaviour/readData.swift in r2176. Please try that test and check that it works for you. If you can come up with an example that does not work that would also be useful, as would the actual error message. -- From bugzilla-daemon at mcs.anl.gov Tue Aug 5 14:51:22 2008 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Tue, 5 Aug 2008 14:51:22 -0500 (CDT) Subject: [Swift-devel] [Bug 149] Improve readdata() error message In-Reply-To: Message-ID: <20080805195122.D29A816469@foxtrot.mcs.anl.gov> http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=149 hategan at mcs.anl.gov changed: What |Removed |Added ---------------------------------------------------------------------------- Status|ASSIGNED |RESOLVED Resolution| |FIXED ------- Comment #2 from hategan at mcs.anl.gov 2008-08-05 14:51 ------- No further complaints received. Closing... -- Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You reported the bug, or are watching the reporter. From benc at hawaga.org.uk Wed Aug 6 11:13:19 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 6 Aug 2008 16:13:19 +0000 (GMT) Subject: [Swift-devel] swift 0.6 rc5 Message-ID: Once again its time to try a release candidate for 0.6. http://www.ci.uchicago.edu/~benc/vdsk-0.6-rc5.tar.gz Please test and report. It is built with coasters and provider-wonky both turned enabled. I ran the site tests and got these results: These sites failed: fletch-condor-gram2.xml osg-edu.cs.wisc.edu-condor.xml tgncsa-hg-pbs-gram4.xml tgpurdue-condor-gram2.xml tgpurdue-condor-gram4.xml tgtacc-fork-gram2.xml tgtacc-lsf-gram2.xml UCLA_Saxon_Tier3-fork.xml These sites worked: fletch-fork-gram2.xml osg-edu.cs.wisc.edu-fork.xml tgncsa-hg-fork-gram2.xml tgncsa-hg-fork-gram4.xml tgncsa-hg-pbs-gram2.xml tgpurdue-fork-gram2.xml tgpurdue-fork-gram4.xml tguc-fork-gram2.xml tguc-fork-gram4.xml tguc-pbs-gram2-syntax1.xml tguc-pbs-gram2.xml tguc-pbs-gram4.xml tp-fork-gram2.xml tp-fork-gram4.xml tp-pbs-gram2.xml Nothing looks too tragic there; I'll investigate the failures later but they all look like site-specific problems, not swift problems. During local testing on communicado, I once saw one of the tests hang in initialising site state, but was unable to get that to reappear. So I think there's something fishy still going on with load management/site selection but not excessively bad. -- From bugzilla-daemon at mcs.anl.gov Wed Aug 6 12:01:18 2008 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Wed, 6 Aug 2008 12:01:18 -0500 (CDT) Subject: [Swift-devel] [Bug 153] New: SGE adapter for gram acting weird on TACC_Ranger Message-ID: http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=153 Summary: SGE adapter for gram acting weird on TACC_Ranger Product: Swift Version: unspecified Platform: Macintosh OS/Version: Mac OS Status: NEW Severity: normal Priority: P2 Component: Specific site issues AssignedTo: benc at hawaga.org.uk ReportedBy: skenny at uchicago.edu when the output is not redirected, the job is put in an "Unscheduled" state in the queue, and GRAM never gets any kind of further notification. because of this problem with sge, swift has to be hacked to redirect stdout in order to run jobs on ranger. -- Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at mcs.anl.gov Wed Aug 6 12:42:27 2008 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Wed, 6 Aug 2008 12:42:27 -0500 (CDT) Subject: [Swift-devel] [Bug 153] SGE adapter for gram acting weird on TACC_Ranger In-Reply-To: Message-ID: <20080806174227.6EABF16469@foxtrot.mcs.anl.gov> http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=153 ------- Comment #1 from skenny at uchicago.edu 2008-08-06 12:42 ------- the file that needed to be altered for redirection of stdout was: swift/libexec/vdl-int.k 175c175 < task:execute("/bin/rm", arguments="-rf {dir}", host=host, batch=true, stdout="/dev/null", stderr="/dev/null") --- > task:execute("/bin/rm", arguments="-rf {dir}", host=host, batch=true) 403c403 < redirect=true --- > redirect=false -- Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From lixi at uchicago.edu Wed Aug 6 23:18:56 2008 From: lixi at uchicago.edu (lixi at uchicago.edu) Date: Wed, 6 Aug 2008 23:18:56 -0500 (CDT) Subject: [Swift-devel] Swift run: java.io.IOException: Unknown error 512 Message-ID: <20080806231856.BCW02035@m4500-03.uchicago.edu> Hi, I ran a workflow like this: [lixi at communicado test] $ /home/lixi/performancetest/4/cog/modules/vdsk/dist/vdsk- svn/bin/swift - sites.file ../sitesfile/SELECT1/sites2.0808062300.xml - tc.file ../tc.data testworkflow.swift >0808062300.log 2>&1 & During the execution, it stopped suddenly and the stdout and stderr are included in /home/lixi/performancetest/test/0808062300.log. It seems that it stopped due to "java.io.IOException: Unknown error 512" The log file is /home/lixi/performancetest/test/testworkflow- 20080806-2301-m1qbxjr3.log [lixi at communicado test]$ tail -n 20 0808062300.log Sorted: [LIGO_UWM_NEMO:140.112(90.071):37/37 overload: 0] node10 completed Sorted: [FLTECH:144.563(90.361):37/37 overload: 0] node10 completed Sorted: [UTA_SWT2:147.336(90.533):37/37 overload: 0] node10 completed Sorted: [FLTECH:146.739(90.497):37/37 overload: 0] node10 completed Sorted: [TTU-ANTAEUS:21.888(51.767):21/21 overload: 0] Sorted: [TTU-ANTAEUS:22.888(53.230):21/22 overload: 0] Sorted: [TTU-ANTAEUS:22.888(53.230):22/22 overload: 0] node10 completed Progress: Selecting site:1497 Stage in:19 Executing:170 Stage out:165 Finished successfully:106 Initializing site shared directory:2 Failed but can retry:41 java.io.IOException: Unknown error 512 at java.io.FileInputStream.readBytes(Native Method) at java.io.FileInputStream.read (FileInputStream.java:194) at java.io.BufferedInputStream.fill (BufferedInputStream.java:218) at java.io.BufferedInputStream.read (BufferedInputStream.java:235) at org.griphyn.vdl.karajan.InHook.run(InHook.java:39) at java.lang.Thread.run(Thread.java:595) Would you please tell me why such an error happened and what to do with it? Thanks, Xi From benc at hawaga.org.uk Thu Aug 7 03:09:53 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Thu, 7 Aug 2008 08:09:53 +0000 (GMT) Subject: [Swift-devel] Swift run: java.io.IOException: Unknown error 512 In-Reply-To: <20080806231856.BCW02035@m4500-03.uchicago.edu> References: <20080806231856.BCW02035@m4500-03.uchicago.edu> Message-ID: Can you reproduce it? Google shows occurences of that exception (unknown err 512 in FileInputStream.readBytes) happening when the java process has been set to run in the background, when reading from the console. Were you doing anything like that? (eg running with & after the command or pressing ctrl-z) -- From mikekubal at yahoo.com Thu Aug 7 20:42:30 2008 From: mikekubal at yahoo.com (Mike Kubal) Date: Thu, 7 Aug 2008 18:42:30 -0700 (PDT) Subject: [Swift-devel] connection error Message-ID: <732920.74580.qm@web52308.mail.re2.yahoo.com> Hi Ben, I have a feeling this is a certificate or CRL issue on the host machine (terminable at the CI), but perhaps you can tell for sure by examining the log, Pipeline_BoNT-20080807-1449-zm9x88ad.log . Nothing in the swift code or sites file has changed. Caused by: ??????? Cannot submit job Caused by: ??????? The connection to the server failed (check host and port) [Caused by: Connection refused] Progress:? Selecting site:1035 Executing:1 Failed:2 Failed but can retry:1 I rsync'd over many logs from various stages of processing 4000 ligands against a target to your CI swift-log dir, including the one above. Thanks, MikeK -------------- next part -------------- An HTML attachment was scrubbed... URL: From wilde at mcs.anl.gov Thu Aug 7 21:53:27 2008 From: wilde at mcs.anl.gov (Michael Wilde) Date: Thu, 07 Aug 2008 21:53:27 -0500 Subject: [Swift-devel] connection error In-Reply-To: <732920.74580.qm@web52308.mail.re2.yahoo.com> References: <732920.74580.qm@web52308.mail.re2.yahoo.com> Message-ID: <489BB527.5080403@mcs.anl.gov> Mike, I wonder if you can try communicado.ci.uchicago.edu? It should require no change in any scripts, tools or procedures (I think). I *think* that communicado's certs are up to date. Its not clear yet that this is a host-cert problem, but thats worth a try. What server were you trying to reach? Can you test a simple globus-job-run to it? - Mike On 8/7/08 8:42 PM, Mike Kubal wrote: > Hi Ben, > > I have a feeling this is a certificate or CRL issue on the host machine > (terminable at the CI), but perhaps you can tell for sure by examining > the log, Pipeline_BoNT-20080807-1449-zm9x88ad.log . Nothing in the swift > code or sites file has changed. > > Caused by: > Cannot submit job > Caused by: > The connection to the server failed (check host and port) > [Caused by: Connection refused] > Progress: Selecting site:1035 Executing:1 Failed:2 Failed but can retry:1 > > I rsync'd over many logs from various stages of processing 4000 ligands > against a target to your CI swift-log dir, including the one above. > > Thanks, > > MikeK > > > > > ------------------------------------------------------------------------ > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From benc at hawaga.org.uk Fri Aug 8 01:16:33 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Fri, 8 Aug 2008 06:16:33 +0000 (GMT) Subject: [Swift-devel] Re: connection error In-Reply-To: <732920.74580.qm@web52308.mail.re2.yahoo.com> References: <732920.74580.qm@web52308.mail.re2.yahoo.com> Message-ID: 'Connection refused' most likelyis a TCP-level connection error, so not as high in the stack as security. And indeed: $ telnet grid-abe.ncsa.teragrid.org 2119 Trying 141.142.68.180... telnet: Unable to connect to remote host: Connection refused That is probably something to report to help at teragrid. On Thu, 7 Aug 2008, Mike Kubal wrote: > > Caused by: > ??????? Cannot submit job > Caused by: > ??????? The connection to the server failed (check host and port) [Caused by: Connection refused] > Progress:? Selecting site:1035 Executing:1 Failed:2 Failed but can retry:1 > > I rsync'd over many logs from various stages of processing 4000 ligands against a target to your CI swift-log dir, including the one above. > > Thanks, > > MikeK > > > From benc at hawaga.org.uk Fri Aug 8 07:29:35 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Fri, 8 Aug 2008 12:29:35 +0000 (GMT) Subject: [Swift-devel] swift + pacman Message-ID: I made a pacman wrapper for swift 0.6 rc5. If you are a pacman fanatic, for example if you like to install the OSG or VDT stacks often, you can add swift into an installation directory like this: $ pacman -get http://www.ci.uchicago.edu/~benc/pacman:swift-0.6-rc5 In part, this is for experimenting with bugs 146 and 104 to bring in more dependencies into the release (mostly for credential management). A pacman-based appraoch could put swift in a custom cut-down VDT environment with only the requested dependencies. -- From benc at hawaga.org.uk Fri Aug 8 08:15:21 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Fri, 8 Aug 2008 13:15:21 +0000 (GMT) Subject: [Swift-devel] swift+vdt bastard offspring Message-ID: I made a pacman package which will deploy both swift and the packages requested in bug 104 and 146 (the DOE CA cert-request tools and voms-proxy-init). VDT/OSG installation instructions/rules apply as do the many foibles of pacman. $ pacman -get http://www.ci.uchicago.edu/~benc/pacman:swift-tools [...] $ source setup.sh $ du -hsc . 163M . swift, voms-proxy-init and cert-request are all on the path. -- From bugzilla-daemon at mcs.anl.gov Fri Aug 8 09:29:56 2008 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Fri, 8 Aug 2008 09:29:56 -0500 (CDT) Subject: [Swift-devel] [Bug 146] Add voms-proxy-init command to Swift release In-Reply-To: Message-ID: <20080808142956.22460164B1@foxtrot.mcs.anl.gov> http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=146 ------- Comment #2 from benc at hawaga.org.uk 2008-08-08 09:29 ------- I combined swift with part of VDT. This gives voms-proxy-init. See this message for more details: http://mail.ci.uchicago.edu/pipermail/swift-devel/2008-August/003809.html -- Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You reported the bug, or are watching the reporter. From bugzilla-daemon at mcs.anl.gov Fri Aug 8 09:34:51 2008 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Fri, 8 Aug 2008 09:34:51 -0500 (CDT) Subject: [Swift-devel] [Bug 104] Add cert request tools to swift/bin In-Reply-To: Message-ID: <20080808143451.3C70B164B1@foxtrot.mcs.anl.gov> http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=104 ------- Comment #6 from benc at hawaga.org.uk 2008-08-08 09:34 ------- I made a combination of parts of VDT along with Swift, which provides (via VDT) the cert-request tools mentioned here. See this message for more details: I combined swift with part of VDT. This gives voms-proxy-init. See this message for more details: http://mail.ci.uchicago.edu/pipermail/swift-devel/2008-August/003809.html -- Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You reported the bug, or are watching the reporter. From lixi at uchicago.edu Sun Aug 10 15:43:17 2008 From: lixi at uchicago.edu (lixi at uchicago.edu) Date: Sun, 10 Aug 2008 15:43:17 -0500 (CDT) Subject: [Swift-devel] Swift run: hanging up when submitting a job Message-ID: <20080810154317.BCY32452@m4500-03.uchicago.edu> Hi, Today I ran a workflow including 3000 jobs with replication enabled. 2999 jobs finished successfully and only one job is hanging up. When taking a close look at the log file, I found the hanging job id is 0-2800, so I execute the following command to check the job: [lixi at communicado 3000]$ grep 0-2800 testworkflow-20080810- 0953-mlj2nsc4.log 2008-08-10 09:53:53,032-0500 INFO worknode PROCEDURE thread=0-2800 name=worknode 2008-08-10 09:53:54,200-0500 INFO vdl:parameterlog PARAM thread=0-2800 direction=input variable=input provenanceid=tag:benc at ci.uchicago.edu,2008:swift:dataset:2008 0810-0953-d6p5ul9d:720000000006 2008-08-10 09:53:55,708-0500 INFO vdl:parameterlog PARAM thread=0-2800 direction=output variable=output provenanceid=tag:benc at ci.uchicago.edu,2008:swift:dataset:2008 0810-0953-d6p5ul9d:720000005789 2008-08-10 09:54:05,612-0500 INFO vdl:execute START thread=0-2800 tr=node10 2008-08-10 10:46:10,044-0500 DEBUG vdl:execute2 THREAD_ASSOCIATION jobid=node10-19x1krxi thread=0-2800-1 host=AGLT2 replicationGroup=fot1krxi 2008-08-10 10:46:15,494-0500 DEBUG TaskImpl Task (type=FILE_OPERATION, identity=urn:0-2800-1-1-1218380053396) setting status to Submitting 2008-08-10 10:46:15,494-0500 DEBUG TaskImpl Task (type=FILE_OPERATION, identity=urn:0-2800-1-1-1218380053396) setting status to Submitted 2008-08-10 10:46:15,494-0500 DEBUG TaskImpl Task (type=FILE_OPERATION, identity=urn:0-2800-1-1-1218380053396) setting status to Active 2008-08-10 10:46:15,494-0500 DEBUG TaskImpl Task (type=FILE_OPERATION, identity=urn:0-2800-1-1-1218380053396) setting status to Completed 2008-08-10 10:46:15,494-0500 INFO LateBindingScheduler Task (type=FILE_OPERATION, identity=urn:0-2800-1-1-1218380053396) Completed. Waiting: 2472, Running: 66. Heap size: 355M, Heap free: 141M, Max heap: 986M 2008-08-10 10:46:17,377-0500 DEBUG TaskImpl Task (type=JOB_SUBMISSION, identity=urn:0-2800-1-1218380053474) setting status to Submitting 2008-08-10 10:46:18,848-0500 DEBUG TaskImpl Task (type=JOB_SUBMISSION, identity=urn:0-2800-1-1218380053474) setting status to Submitted 2008-08-10 10:46:18,848-0500 DEBUG WeightedHostScoreScheduler Submission time for Task (type=JOB_SUBMISSION, identity=urn:0-2800-1-1218380053474): 1471ms. Score delta: -0.024897435897435895 2008-08-10 10:46:30,063-0500 DEBUG TaskImpl Task (type=JOB_SUBMISSION, identity=urn:0-2800-1-1218380053474) setting status to Active >From the log file, we can see that the submission of this job wasn't finished. So I think that this is why no replicaiton job was generated for this job after so long a time even with replication enabled. This is my understanding. I wonder if I made any misunderstanding. If my understanding is right, is there any solution to this kind of situation? The log file is: /home/lixi/performancetest/2/application/3000/testworkflow- 20080810-0953-mlj2nsc4.log Thanks, Xi From hategan at mcs.anl.gov Sun Aug 10 15:58:33 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sun, 10 Aug 2008 15:58:33 -0500 Subject: [Swift-devel] Swift run: hanging up when submitting a job In-Reply-To: <20080810154317.BCY32452@m4500-03.uchicago.edu> References: <20080810154317.BCY32452@m4500-03.uchicago.edu> Message-ID: <1218401913.9399.10.camel@localhost> On Sun, 2008-08-10 at 15:43 -0500, lixi at uchicago.edu wrote: > Hi, > > Today I ran a workflow including 3000 jobs with replication > enabled. 2999 jobs finished successfully and only one job is > hanging up. When taking a close look at the log file, I > found the hanging job id is 0-2800, so I execute the > following command to check the job: > > [...] > 2008-08-10 10:46:17,377-0500 DEBUG TaskImpl Task > (type=JOB_SUBMISSION, identity=urn:0-2800-1-1218380053474) > setting status to Submitting > 2008-08-10 10:46:18,848-0500 DEBUG TaskImpl Task > (type=JOB_SUBMISSION, identity=urn:0-2800-1-1218380053474) > setting status to Submitted > 2008-08-10 10:46:18,848-0500 DEBUG > WeightedHostScoreScheduler Submission time for Task > (type=JOB_SUBMISSION, identity=urn:0-2800-1-1218380053474): > 1471ms. Score delta: -0.024897435897435895 > 2008-08-10 10:46:30,063-0500 DEBUG TaskImpl Task > (type=JOB_SUBMISSION, identity=urn:0-2800-1-1218380053474) > setting status to Active > > >From the log file, we can see that the submission of this > job wasn't finished. Actually the job was submitted and it appears to be running. > So I think that this is why no > replicaiton job was generated for this job after so long a > time even with replication enabled. Replication only works if the job is queued. This job seems to be running. Though we're probably talking about the site going bad after the job started to run causing the notifications of the job completing/failing to not be sent. > > This is my understanding. I wonder if I made any > misunderstanding. If my understanding is right, is there any > solution to this kind of situation? It's not simple. If notification is unreliable it's impossible to distinguish between a really long process and the notification having been lost. That is if there is no information about how long the process is. So one solution would be to make "notifications" more reliable by polling for the job status. But GRAM makes it really hard to do this efficiently (each poll for each job involves one full SSL session establishment). The other solution is to put a cap on the process duration. So if the job has a walltime spec, consider notifications lost if the job doesn't complete in walltime + some_margin_of_error. Mihael From benc at hawaga.org.uk Mon Aug 11 08:48:06 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 11 Aug 2008 13:48:06 +0000 (GMT) Subject: [Swift-devel] hangs in nmi build and test at first site selecting stage Message-ID: Two times in the past few days (out of 30 or so build/tests x about 120 runs per build/test) runs have hung at the initial site selection stage for the first job. I haven't investigated in greater depth than this. My gut feeling, though, is that its probably still some funny behaviour related to rate limiting. Sometime in the next few days I'll see about running a few thousand tests with more debugging info to see if I can get more info... -- From benc at hawaga.org.uk Tue Aug 12 07:06:45 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 12 Aug 2008 12:06:45 +0000 (GMT) Subject: [Swift-devel] Swift run: hanging up when submitting a job In-Reply-To: <1218401913.9399.10.camel@localhost> References: <20080810154317.BCY32452@m4500-03.uchicago.edu> <1218401913.9399.10.camel@localhost> Message-ID: On Sun, 10 Aug 2008, Mihael Hategan wrote: > The other solution is to put a cap on the process duration. So if the > job has a walltime spec, consider notifications lost if the job doesn't > complete in walltime + some_margin_of_error. I think that is probably a good thing to do. Either consider the job failed if the walltime + margin passes or some polling such as poa single time when walltime+margin has passed. Margin can be pretty big (on the order of minutes), I think. From benc at hawaga.org.uk Thu Aug 14 03:56:41 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Thu, 14 Aug 2008 08:56:41 +0000 (GMT) Subject: [Swift-devel] swift 0.6 rc5 In-Reply-To: References: Message-ID: On Wed, 6 Aug 2008, Ben Clifford wrote: > http://www.ci.uchicago.edu/~benc/vdsk-0.6-rc5.tar.gz > Please test and report. No one has commented on this, either bad or good; so I'll put this out as 0.6 later today. -- From lixi at uchicago.edu Fri Aug 15 12:03:31 2008 From: lixi at uchicago.edu (lixi at uchicago.edu) Date: Fri, 15 Aug 2008 12:03:31 -0500 (CDT) Subject: [Swift-devel] Swift run: hanging up when submitting a job Message-ID: <20080815120331.BDG09904@m4500-03.uchicago.edu> >The other solution is to put a cap on the process duration. So if the >job has a walltime spec, consider notifications lost if the job doesn't >complete in walltime + some_margin_of_error. Because the user might have some idea of the execution time of their single job, is it possible to add a paramter in swift.properities or tc.data specifying the max process duration of each job. If exceeding that throttle, the job would be resubmitted to another site to be executed. I know that there is already a maxwalltime which specifies a walltime limit for each job, in minutes in Swift. But I'm not sure if this paramter could exactly perform such function? If not, is it difficult to make such a trial? Thanks, Xi From skenny at uchicago.edu Fri Aug 15 12:43:40 2008 From: skenny at uchicago.edu (skenny at uchicago.edu) Date: Fri, 15 Aug 2008 12:43:40 -0500 (CDT) Subject: [Swift-devel] not able to resume Message-ID: <20080815124340.BJD56291@m4500-02.uchicago.edu> hi all, we recently updated our swift (so, using Swift svn swift-r2185 cog-r2128) and it seems that -resume is no longer behaving as expected...or is possibly being ignored. previously, on a resume, swift's stdout would show how many jobs were already completed as well as those that were being initialized. but now it seems to simply start from scratch, from what we can tell...if we could get # of completed jobs to print to stdout again that would help to verify. i can send the log, or a link to it (it's quite large) but i don't see any errors there that seem related to the resume. but let me know if there's other info that might help. thanks! sarah From benc at HAWAGA.ORG.UK Fri Aug 15 16:02:31 2008 From: benc at HAWAGA.ORG.UK (Ben Clifford) Date: Fri, 15 Aug 2008 21:02:31 +0000 (GMT) Subject: [Swift-devel] not able to resume In-Reply-To: <20080815124340.BJD56291@m4500-02.uchicago.edu> References: <20080815124340.BJD56291@m4500-02.uchicago.edu> Message-ID: In tests/misc/ there are a number of tests for restarts - restart*.sh Run those against your build (by putting Swift in your path and typing eg: ./restart.sh ./restart-iterate.sh etc and see what results you get - each test will output either a failure message or "success" as the last line. -- From skenny at uchicago.edu Mon Aug 18 16:49:40 2008 From: skenny at uchicago.edu (skenny at uchicago.edu) Date: Mon, 18 Aug 2008 16:49:40 -0500 (CDT) Subject: [Swift-devel] not able to resume Message-ID: <20080818164940.BJF74421@m4500-02.uchicago.edu> restart: success (w/errors during run) restart2: success (w/errors) restart3: [skenny at andrew misc]$ ./restart3.sh Could not start execution. Error reading source: : input contained no data Could not start execution. Error reading source: : input contained no data Failed - second round did not exit with success restart4: success (w/errors) restart5: success (w/errors) restart-extern: success (w/errors) restart-iterate: success (w/errors) for all of the ones with errors, it seems to be helperB: helperB failed Final status: Failed:1 Finished successfully:1 The following errors have occurred: 1. Application "helperB" failed (Exit code 1) Arguments: "/disks/gpfs/fmri/cnari/swift/sbuilds/cog/modules/vdsk/tests/misc/restart-extern.2.out, /etc/group, baz" Host: localhost Directory: restart-extern-20080818-1643-bofrdg7f/jobs/c/helperB-c6x176yi STDERR: STDOUT: Swift svn swift-r2185 cog-r2128 let me know if it helps to paste the entire output for any/all of these. i'm not quite sure what 'success' means given there are errors during the test (?) thanks sarah ---- Original message ---- >Date: Fri, 15 Aug 2008 21:02:31 +0000 (GMT) >From: Ben Clifford >Subject: Re: [Swift-devel] not able to resume >To: skenny at uchicago.edu >Cc: swift-devel at ci.uchicago.edu > > >In tests/misc/ there are a number of tests for restarts - restart*.sh > >Run those against your build (by putting Swift in your path and typing eg: > >./restart.sh > >./restart-iterate.sh > >etc > >and see what results you get - each test will output either a failure >message or "success" as the last line. > >-- From hategan at mcs.anl.gov Mon Aug 18 16:59:36 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 18 Aug 2008 16:59:36 -0500 Subject: [Swift-devel] not able to resume In-Reply-To: <20080818164940.BJF74421@m4500-02.uchicago.edu> References: <20080818164940.BJF74421@m4500-02.uchicago.edu> Message-ID: <1219096776.24889.2.camel@localhost> On Mon, 2008-08-18 at 16:49 -0500, skenny at uchicago.edu wrote: > restart: success (w/errors during run) > restart2: success (w/errors) > restart3: > [skenny at andrew misc]$ ./restart3.sh > Could not start execution. > Error reading source: : input contained no data > Could not start execution. > Error reading source: : input contained no data > Failed - second round did not exit with success Hmm. Can you delete restart3.xml and restart3.kml and try again? > > restart4: success (w/errors) > restart5: success (w/errors) > restart-extern: success (w/errors) > restart-iterate: success (w/errors) > > for all of the ones with errors, it seems to be helperB: > > helperB failed > Final status: Failed:1 Finished successfully:1 > The following errors have occurred: > 1. Application "helperB" failed (Exit code 1) ... Yes, helperB is the following script: ---------- #!/bin/bash exit 1 ---------- The first step needs to be "interrupted" in order to test the restarts. From skenny at uchicago.edu Mon Aug 18 17:04:54 2008 From: skenny at uchicago.edu (skenny at uchicago.edu) Date: Mon, 18 Aug 2008 17:04:54 -0500 (CDT) Subject: [Swift-devel] not able to resume Message-ID: <20080818170454.BJF76049@m4500-02.uchicago.edu> ok, now i get similar output to the others for restart3, not sure what happened there; but here's the whole output: [skenny at andrew misc]$ ./restart3.sh Swift svn swift-r2185 cog-r2128 RunID: 20080818-1703-z57hx8j2 Progress: helperA started Sorted: [localhost:0.000(1.000):0/1 overload: 0] helperA completed helperB started Sorted: [localhost:1.303(2.111):0/1 overload: 0] Sorted: [localhost:1.595(2.473):0/1 overload: 0] Sorted: [localhost:1.888(2.882):0/1 overload: 0] helperB failed Execution failed: Exception in helperB: Arguments: [restart-2.out] Host: localhost Directory: restart3-20080818-1703-z57hx8j2/jobs/q/helperB-quat76yi stderr.txt: stdout.txt: ---- Caused by: Exit code 1 Swift svn swift-r2185 cog-r2128 RunID: 20080818-1703-bhfsb6be Progress: helperB started Sorted: [localhost:0.000(1.000):0/1 overload: 0] helperB completed helperC started Sorted: [localhost:1.303(2.111):0/1 overload: 0] helperC completed Final status: Initializing:1 Finished successfully:2 success ---- Original message ---- >Date: Mon, 18 Aug 2008 16:59:36 -0500 >From: Mihael Hategan >Subject: Re: [Swift-devel] not able to resume >To: skenny at uchicago.edu >Cc: Ben Clifford , swift-devel at ci.uchicago.edu > >On Mon, 2008-08-18 at 16:49 -0500, skenny at uchicago.edu wrote: >> restart: success (w/errors during run) >> restart2: success (w/errors) >> restart3: >> [skenny at andrew misc]$ ./restart3.sh >> Could not start execution. >> Error reading source: : input contained no data >> Could not start execution. >> Error reading source: : input contained no data >> Failed - second round did not exit with success > >Hmm. Can you delete restart3.xml and restart3.kml and try again? > >> >> restart4: success (w/errors) >> restart5: success (w/errors) >> restart-extern: success (w/errors) >> restart-iterate: success (w/errors) >> >> for all of the ones with errors, it seems to be helperB: >> >> helperB failed >> Final status: Failed:1 Finished successfully:1 >> The following errors have occurred: >> 1. Application "helperB" failed (Exit code 1) >... > >Yes, helperB is the following script: >---------- >#!/bin/bash > >exit 1 >---------- > >The first step needs to be "interrupted" in order to test the restarts. > > From hategan at mcs.anl.gov Mon Aug 18 17:09:15 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 18 Aug 2008 17:09:15 -0500 Subject: [Swift-devel] not able to resume In-Reply-To: <20080818170454.BJF76049@m4500-02.uchicago.edu> References: <20080818170454.BJF76049@m4500-02.uchicago.edu> Message-ID: <1219097355.25406.1.camel@localhost> Seems to be working fine. Perhaps you are running, in your failed restarts, into the "staging out happens late" issue. Can you send a sample rlog? On Mon, 2008-08-18 at 17:04 -0500, skenny at uchicago.edu wrote: > ok, now i get similar output to the others for restart3, not > sure what happened there; but here's the whole output: > > [skenny at andrew misc]$ ./restart3.sh > Swift svn swift-r2185 cog-r2128 > > RunID: 20080818-1703-z57hx8j2 > Progress: > helperA started > Sorted: [localhost:0.000(1.000):0/1 overload: 0] > helperA completed > helperB started > Sorted: [localhost:1.303(2.111):0/1 overload: 0] > Sorted: [localhost:1.595(2.473):0/1 overload: 0] > Sorted: [localhost:1.888(2.882):0/1 overload: 0] > helperB failed > Execution failed: > Exception in helperB: > Arguments: [restart-2.out] > Host: localhost > Directory: restart3-20080818-1703-z57hx8j2/jobs/q/helperB-quat76yi > stderr.txt: > stdout.txt: > ---- > > Caused by: > Exit code 1 > Swift svn swift-r2185 cog-r2128 > > RunID: 20080818-1703-bhfsb6be > Progress: > helperB started > Sorted: [localhost:0.000(1.000):0/1 overload: 0] > helperB completed > helperC started > Sorted: [localhost:1.303(2.111):0/1 overload: 0] > helperC completed > Final status: Initializing:1 Finished successfully:2 > success > From skenny at uchicago.edu Mon Aug 18 17:47:23 2008 From: skenny at uchicago.edu (skenny at uchicago.edu) Date: Mon, 18 Aug 2008 17:47:23 -0500 (CDT) Subject: [Swift-devel] not able to resume Message-ID: <20080818174723.BJF79002@m4500-02.uchicago.edu> hmm, looks like the rlog got deleted...for future reference though, do you know how i might be able to tell that from the rlog? ---- Original message ---- >Date: Mon, 18 Aug 2008 17:09:15 -0500 >From: Mihael Hategan >Subject: Re: [Swift-devel] not able to resume >To: skenny at uchicago.edu >Cc: Ben Clifford , swift-devel at ci.uchicago.edu > >Seems to be working fine. > >Perhaps you are running, in your failed restarts, into the "staging out >happens late" issue. Can you send a sample rlog? > >On Mon, 2008-08-18 at 17:04 -0500, skenny at uchicago.edu wrote: >> ok, now i get similar output to the others for restart3, not >> sure what happened there; but here's the whole output: >> >> [skenny at andrew misc]$ ./restart3.sh >> Swift svn swift-r2185 cog-r2128 >> >> RunID: 20080818-1703-z57hx8j2 >> Progress: >> helperA started >> Sorted: [localhost:0.000(1.000):0/1 overload: 0] >> helperA completed >> helperB started >> Sorted: [localhost:1.303(2.111):0/1 overload: 0] >> Sorted: [localhost:1.595(2.473):0/1 overload: 0] >> Sorted: [localhost:1.888(2.882):0/1 overload: 0] >> helperB failed >> Execution failed: >> Exception in helperB: >> Arguments: [restart-2.out] >> Host: localhost >> Directory: restart3-20080818-1703-z57hx8j2/jobs/q/helperB-quat76yi >> stderr.txt: >> stdout.txt: >> ---- >> >> Caused by: >> Exit code 1 >> Swift svn swift-r2185 cog-r2128 >> >> RunID: 20080818-1703-bhfsb6be >> Progress: >> helperB started >> Sorted: [localhost:0.000(1.000):0/1 overload: 0] >> helperB completed >> helperC started >> Sorted: [localhost:1.303(2.111):0/1 overload: 0] >> helperC completed >> Final status: Initializing:1 Finished successfully:2 >> success >> > > From hategan at mcs.anl.gov Mon Aug 18 17:56:12 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 18 Aug 2008 17:56:12 -0500 Subject: [Swift-devel] not able to resume In-Reply-To: <20080818174723.BJF79002@m4500-02.uchicago.edu> References: <20080818174723.BJF79002@m4500-02.uchicago.edu> Message-ID: <1219100172.26871.0.camel@localhost> On Mon, 2008-08-18 at 17:47 -0500, skenny at uchicago.edu wrote: > hmm, looks like the rlog got deleted...for future reference > though, do you know how i might be able to tell that from the > rlog? If it's empty, it means it hasn't recorded anything. > > ---- Original message ---- > >Date: Mon, 18 Aug 2008 17:09:15 -0500 > >From: Mihael Hategan > >Subject: Re: [Swift-devel] not able to resume > >To: skenny at uchicago.edu > >Cc: Ben Clifford , > swift-devel at ci.uchicago.edu > > > >Seems to be working fine. > > > >Perhaps you are running, in your failed restarts, into the > "staging out > >happens late" issue. Can you send a sample rlog? > > > >On Mon, 2008-08-18 at 17:04 -0500, skenny at uchicago.edu wrote: > >> ok, now i get similar output to the others for restart3, not > >> sure what happened there; but here's the whole output: > >> > >> [skenny at andrew misc]$ ./restart3.sh > >> Swift svn swift-r2185 cog-r2128 > >> > >> RunID: 20080818-1703-z57hx8j2 > >> Progress: > >> helperA started > >> Sorted: [localhost:0.000(1.000):0/1 overload: 0] > >> helperA completed > >> helperB started > >> Sorted: [localhost:1.303(2.111):0/1 overload: 0] > >> Sorted: [localhost:1.595(2.473):0/1 overload: 0] > >> Sorted: [localhost:1.888(2.882):0/1 overload: 0] > >> helperB failed > >> Execution failed: > >> Exception in helperB: > >> Arguments: [restart-2.out] > >> Host: localhost > >> Directory: > restart3-20080818-1703-z57hx8j2/jobs/q/helperB-quat76yi > >> stderr.txt: > >> stdout.txt: > >> ---- > >> > >> Caused by: > >> Exit code 1 > >> Swift svn swift-r2185 cog-r2128 > >> > >> RunID: 20080818-1703-bhfsb6be > >> Progress: > >> helperB started > >> Sorted: [localhost:0.000(1.000):0/1 overload: 0] > >> helperB completed > >> helperC started > >> Sorted: [localhost:1.303(2.111):0/1 overload: 0] > >> helperC completed > >> Final status: Initializing:1 Finished successfully:2 > >> success > >> > > > > From benc at hawaga.org.uk Sun Aug 24 08:11:50 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Sun, 24 Aug 2008 13:11:50 +0000 (GMT) Subject: [Swift-devel] not able to resume In-Reply-To: <1219097355.25406.1.camel@localhost> References: <20080818170454.BJF76049@m4500-02.uchicago.edu> <1219097355.25406.1.camel@localhost> Message-ID: On Mon, 18 Aug 2008, Mihael Hategan wrote: > Perhaps you are running, in your failed restarts, into the "staging out > happens late" issue. It should be the case, I think, that if the on-screen progress ticker line says a job is completed then it will be logged for restart; and if it is reported as completed there then it won't be logged for restart (with a sub-second margin of error). That change is likely to not correspond closely in time to jobs completing in the queue on your execution site. -- From benc at hawaga.org.uk Mon Aug 25 03:51:31 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 25 Aug 2008 08:51:31 +0000 (GMT) Subject: [Swift-devel] Swift 0.6 released Message-ID: Swift 0.6 is online for download at http://www.ci.uchicago.edu/swift/downloads/ In addition to a bunch of bugfixes, the most interesting changes are: * much more rigourous compile time type checking - this catches many more errors at the start rather than hours into a run, and gives more useful error reports. * better multisite handling: + job replication - when a job has been queued for much longer than average, Swift can launch a replica of the job on another site. This helps when making multisite runs where one site has a much longer queue time than another. + rate limiting for bad sites - poorly scored sites are now rate limited much more than in previous versions of Swift, with very poorly scored sites being delayed between executions. * cog coasters - this is a new execution provider that allows a single 'coaster' job to be submitted per worker node which pulls in Swift jobs. This can greatly reduce the number of jobs submitted to the underlying job submission mechanism (such as GRAM2) allowing more jobs to be submitted; it also can reduce the amount of time jobs spend in the LRM queue by sending them directly to an already-executing coaster. -- From benc at hawaga.org.uk Mon Aug 25 06:06:44 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 25 Aug 2008 11:06:44 +0000 (GMT) Subject: [Swift-devel] coaster log location Message-ID: coaster log location of ~ is displeasing to me (especially on machines whose primary purpose isn't developing grid stuff). Obvious other choices would be pwd or ~/.globus/coasters Does anyone have a particular opinion? -- From hategan at mcs.anl.gov Mon Aug 25 10:21:28 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 25 Aug 2008 10:21:28 -0500 Subject: [Swift-devel] coaster log location In-Reply-To: References: Message-ID: <1219677688.27441.4.camel@localhost> On Mon, 2008-08-25 at 11:06 +0000, Ben Clifford wrote: > coaster log location of ~ is displeasing to me (especially on machines > whose primary purpose isn't developing grid stuff). > > Obvious other choices would be pwd or ~/.globus/coasters > > Does anyone have a particular opinion? The prototype has a few rough corners. On the other hand gram also puts the log of funny jobs in ~/. But ~/.globus/coasters sounds more reasonable. From nikolicmilena at gmail.com Tue Aug 26 07:23:51 2008 From: nikolicmilena at gmail.com (Milena Nikolic) Date: Tue, 26 Aug 2008 14:23:51 +0200 Subject: [Swift-devel] GSoC: Type checking and Type inference in SwiftScript Message-ID: <123bf0400808260523k49369428n8efa8e30278d29@mail.gmail.com> Hi All, GSoC program is coming to an end, and I would like to share my non-committed work with you. For those who don't know, type checking is released with Swift 0.6, and I am waiting to hear about bugs now (I'll be here to fix them of course). The other part of my project was type inference, and that work isn't committed. Progress is described in WhatToDoNext.txt (attached). The diff file containing type inference work is also attached. Any comments about it are welcome. Cheers, Milena ---------- Forwarded message ---------- From: Milena Nikolic Date: Sat, Aug 16, 2008 at 12:08 AM Subject: Final work for GSoC To: Ben Clifford Hi Ben, This is my not-committed work. Unfortunately I didn't finish type checking of mappers, and I am not sending you that work at all because it isn't working at the moment. Actually there is not too much work left about it, but there is a lot of talking. So when you get back from holiday, we can discuss it and I might finish it apart from GSoC program. All my work considering type inference is in swiftInference.diff. It is upon latest svn version and it passes tests. Status of it is briefly explained in WhatToDoNext.txt, and I can write some more detailed document about what I've done, if you think anyone will ever want to read something like that. Should I send WhatToDoNext.txt file to the group? Or something similar? Thanks, Milena -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: swiftInference.diff Type: text/x-patch Size: 19230 bytes Desc: not available URL: -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: WhatToDoNext.txt URL: From benc at hawaga.org.uk Tue Aug 26 15:29:45 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 26 Aug 2008 20:29:45 +0000 (GMT) Subject: [Swift-devel] Re: Fwd: Re: Is it easy to get average wait time for all jobs in a workflow? In-Reply-To: <20080818220911.BDI04681@m4500-03.uchicago.edu> References: <20080818220911.BDI04681@m4500-03.uchicago.edu> Message-ID: [added swift-devel] On Mon, 18 Aug 2008, lixi at uchicago.edu wrote: > >then the first execute2 task id for this job is 0-1-1-..., > >the second replication job id will be 0-1-2-..., these two > >tasks have the same replicaiton group. But if this job > >failed, then the second task id for this failed job will be > >also 0-1-1-..., but it will have a different replication > >group. So it makes hard to make all these stuff clearly > >distinguished by writing a script. [...] > it possible to add some information into log file? Swift r2199 adds a log line that looks like this: 2008-08-26 21:17:28,359+0100 INFO Execute jobid=echo-kqc79jyi task=Task(type=JOB_SUBMISSION, identity=urn:0-1-1219781848210) This binds together execute2 IDs and karajan execution IDs, which is a binding that has been missing from the logs in the past. That should allow binding of replication group IDs to several karajan level task IDs. -- From benc at hawaga.org.uk Wed Aug 27 04:22:36 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 27 Aug 2008 09:22:36 +0000 (GMT) Subject: [Swift-devel] Re: Fwd: Re: Is it easy to get average wait time for all jobs in a workflow? In-Reply-To: References: <20080818220911.BDI04681@m4500-03.uchicago.edu> Message-ID: On Tue, 26 Aug 2008, Ben Clifford wrote: > Swift r2199 adds a log line that looks like this: and r2204 makes the log-processing code have a make target called karatasks.JOB_SUBMISSION.annotated-execute2.transitions (mmm long file names). This is karajan task status for JOB_SUBMISSION tasks, with columns 5 and 6 being the execute2 and replication IDs. >From that you should be able to work out for each replication ID when the first submision happens and when the first Active state happens. -- From zhaozhang at uchicago.edu Wed Aug 27 15:18:04 2008 From: zhaozhang at uchicago.edu (Zhao Zhang) Date: Wed, 27 Aug 2008 15:18:04 -0500 Subject: [Swift-devel] swift out of memory Message-ID: <48B5B67C.4030603@uchicago.edu> Hi, I was trying to run 32768 tasks on BGP, swift failed to start and reported the following message. Any ideas? Thanks zhao JVMDG217: Dump Handler is Processing OutOfMemory - Please Wait. JVMDG315: JVM Requesting Heap dump file .................................................JVMDG318: Heap dump file written to /gpfs/home/zzhang/swift/etc/heapdump.20080827.142903.2957.phd JVMDG303: JVM Requesting Java core file JVMDG304: Java core file written to /gpfs/home/zzhang/swift/etc/javacore.20080827.142924.2957.txt JVMDG274: Dump Handler has Processed OutOfMemory. JVMST109: Insufficient space in Javaheap to satisfy allocation request From hategan at mcs.anl.gov Wed Aug 27 15:33:25 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 27 Aug 2008 15:33:25 -0500 Subject: [Swift-devel] swift out of memory In-Reply-To: <48B5B67C.4030603@uchicago.edu> References: <48B5B67C.4030603@uchicago.edu> Message-ID: <1219869205.13808.19.camel@localhost> How much memory are you running the jvm with? On Wed, 2008-08-27 at 15:18 -0500, Zhao Zhang wrote: > Hi, I was trying to run 32768 tasks on BGP, swift failed to start and > reported the following message. > Any ideas? Thanks > > zhao > > JVMDG217: Dump Handler is Processing OutOfMemory - Please Wait. > JVMDG315: JVM Requesting Heap dump file > .................................................JVMDG318: Heap dump > file written to > /gpfs/home/zzhang/swift/etc/heapdump.20080827.142903.2957.phd > JVMDG303: JVM Requesting Java core file > JVMDG304: Java core file written to > /gpfs/home/zzhang/swift/etc/javacore.20080827.142924.2957.txt > JVMDG274: Dump Handler has Processed OutOfMemory. > JVMST109: Insufficient space in Javaheap to satisfy allocation request > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From iraicu at cs.uchicago.edu Wed Aug 27 15:47:09 2008 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Wed, 27 Aug 2008 15:47:09 -0500 Subject: [Swift-devel] swift out of memory In-Reply-To: <1219869205.13808.19.camel@localhost> References: <48B5B67C.4030603@uchicago.edu> <1219869205.13808.19.camel@localhost> Message-ID: <48B5BD4D.8030805@cs.uchicago.edu> I don't think Zhao knows where this is set in Swift. Where could he look this up, other than using "top" during a run? Ioan Mihael Hategan wrote: > How much memory are you running the jvm with? > > On Wed, 2008-08-27 at 15:18 -0500, Zhao Zhang wrote: > >> Hi, I was trying to run 32768 tasks on BGP, swift failed to start and >> reported the following message. >> Any ideas? Thanks >> >> zhao >> >> JVMDG217: Dump Handler is Processing OutOfMemory - Please Wait. >> JVMDG315: JVM Requesting Heap dump file >> .................................................JVMDG318: Heap dump >> file written to >> /gpfs/home/zzhang/swift/etc/heapdump.20080827.142903.2957.phd >> JVMDG303: JVM Requesting Java core file >> JVMDG304: Java core file written to >> /gpfs/home/zzhang/swift/etc/javacore.20080827.142924.2957.txt >> JVMDG274: Dump Handler has Processed OutOfMemory. >> JVMST109: Insufficient space in Javaheap to satisfy allocation request >> >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >> > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > -- =================================================== Ioan Raicu Ph.D. Candidate =================================================== Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 =================================================== Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dev.globus.org/wiki/Incubator/Falkon http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page =================================================== =================================================== -------------- next part -------------- An HTML attachment was scrubbed... URL: From zhaozhang at uchicago.edu Wed Aug 27 15:48:16 2008 From: zhaozhang at uchicago.edu (Zhao Zhang) Date: Wed, 27 Aug 2008 15:48:16 -0500 Subject: [Swift-devel] swift out of memory In-Reply-To: <48B5BD4D.8030805@cs.uchicago.edu> References: <48B5B67C.4030603@uchicago.edu> <1219869205.13808.19.camel@localhost> <48B5BD4D.8030805@cs.uchicago.edu> Message-ID: <48B5BD90.4090006@uchicago.edu> Mihael told me set options in swift command with -Xmx1024m , I am testing it now. zhao Ioan Raicu wrote: > I don't think Zhao knows where this is set in Swift. Where could he > look this up, other than using "top" during a run? > > Ioan > > Mihael Hategan wrote: >> How much memory are you running the jvm with? >> >> On Wed, 2008-08-27 at 15:18 -0500, Zhao Zhang wrote: >> >>> Hi, I was trying to run 32768 tasks on BGP, swift failed to start and >>> reported the following message. >>> Any ideas? Thanks >>> >>> zhao >>> >>> JVMDG217: Dump Handler is Processing OutOfMemory - Please Wait. >>> JVMDG315: JVM Requesting Heap dump file >>> .................................................JVMDG318: Heap dump >>> file written to >>> /gpfs/home/zzhang/swift/etc/heapdump.20080827.142903.2957.phd >>> JVMDG303: JVM Requesting Java core file >>> JVMDG304: Java core file written to >>> /gpfs/home/zzhang/swift/etc/javacore.20080827.142924.2957.txt >>> JVMDG274: Dump Handler has Processed OutOfMemory. >>> JVMST109: Insufficient space in Javaheap to satisfy allocation request >>> >>> _______________________________________________ >>> Swift-devel mailing list >>> Swift-devel at ci.uchicago.edu >>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >>> >> >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >> >> > > -- > =================================================== > Ioan Raicu > Ph.D. Candidate > =================================================== > Distributed Systems Laboratory > Computer Science Department > University of Chicago > 1100 E. 58th Street, Ryerson Hall > Chicago, IL 60637 > =================================================== > Email: iraicu at cs.uchicago.edu > Web: http://www.cs.uchicago.edu/~iraicu > http://dev.globus.org/wiki/Incubator/Falkon > http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page > =================================================== > =================================================== > > From iraicu at cs.uchicago.edu Wed Aug 27 15:53:54 2008 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Wed, 27 Aug 2008 15:53:54 -0500 Subject: [Swift-devel] swift out of memory In-Reply-To: <48B5BD90.4090006@uchicago.edu> References: <48B5B67C.4030603@uchicago.edu> <1219869205.13808.19.camel@localhost> <48B5BD4D.8030805@cs.uchicago.edu> <48B5BD90.4090006@uchicago.edu> Message-ID: <48B5BEE2.4080404@cs.uchicago.edu> I routinely use -Xms1536m -Xmx1536m for running Falkon (have the min and max heap size the same, to avoid having to resize the heap, which ultimately improves performance during these periods). Those nodes on the BG/P have 4GB of memory, so once you find the largest workflows you can run with 1GB (or 1.5GB), it would be good to push the heap size up as close as you can to the 4GB limit (probably somewhere between 3GB and 4GB). Ioan Zhao Zhang wrote: > Mihael told me set options in swift command with -Xmx1024m , I am > testing it now. > > zhao > > Ioan Raicu wrote: >> I don't think Zhao knows where this is set in Swift. Where could he >> look this up, other than using "top" during a run? >> >> Ioan >> >> Mihael Hategan wrote: >>> How much memory are you running the jvm with? >>> >>> On Wed, 2008-08-27 at 15:18 -0500, Zhao Zhang wrote: >>> >>>> Hi, I was trying to run 32768 tasks on BGP, swift failed to start >>>> and reported the following message. >>>> Any ideas? Thanks >>>> >>>> zhao >>>> >>>> JVMDG217: Dump Handler is Processing OutOfMemory - Please Wait. >>>> JVMDG315: JVM Requesting Heap dump file >>>> .................................................JVMDG318: Heap >>>> dump file written to >>>> /gpfs/home/zzhang/swift/etc/heapdump.20080827.142903.2957.phd >>>> JVMDG303: JVM Requesting Java core file >>>> JVMDG304: Java core file written to >>>> /gpfs/home/zzhang/swift/etc/javacore.20080827.142924.2957.txt >>>> JVMDG274: Dump Handler has Processed OutOfMemory. >>>> JVMST109: Insufficient space in Javaheap to satisfy allocation request >>>> >>>> _______________________________________________ >>>> Swift-devel mailing list >>>> Swift-devel at ci.uchicago.edu >>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >>>> >>> >>> _______________________________________________ >>> Swift-devel mailing list >>> Swift-devel at ci.uchicago.edu >>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >>> >>> >> >> -- >> =================================================== >> Ioan Raicu >> Ph.D. Candidate >> =================================================== >> Distributed Systems Laboratory >> Computer Science Department >> University of Chicago >> 1100 E. 58th Street, Ryerson Hall >> Chicago, IL 60637 >> =================================================== >> Email: iraicu at cs.uchicago.edu >> Web: http://www.cs.uchicago.edu/~iraicu >> http://dev.globus.org/wiki/Incubator/Falkon >> http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page >> =================================================== >> =================================================== >> >> > -- =================================================== Ioan Raicu Ph.D. Candidate =================================================== Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 =================================================== Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dev.globus.org/wiki/Incubator/Falkon http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page =================================================== =================================================== From mikekubal at yahoo.com Thu Aug 28 15:29:16 2008 From: mikekubal at yahoo.com (Mike Kubal) Date: Thu, 28 Aug 2008 13:29:16 -0700 (PDT) Subject: [Swift-devel] error communicating with Abe's GridFTP server Message-ID: <951213.43274.qm@web52305.mail.re2.yahoo.com> I get the following message when trying to run a swift job from the machines at the CI (communicado and bridled) on NCSA's Abe: Execution failed: Could not initialize shared directory on abe Caused by: org.globus.cog.abstraction.impl.file.FileResourceException: Error communicating with the GridFTP server Caused by: Server refused performing the request. Custom message: Bad password. (error code 1) [Nested exception message: Custom message: Unexpected reply: 530-Login incorrect. : IPC connection failed. Are there any known webservice or other issues with Abe's GridFTP server? The gatekeeper appears to be up, when I run: globus-job-run grid-abe.ncsa.teragrid.org /bin/hostname It returns: abe1196.ncsa.uiuc.edu If I run: globus-job-run gsiftp://grid-ftp.abe.ncsa.teragrid.org hostname I get the following: GRAM Job submission failed because the connection to the server failed (check host and port) (error code 12) In the past I have received similar error messages when my certificate at the CI had not been updated, but the problem persists after an update. I get the same message running from communicado and bridled at the CI. Suggestions? Mike Kubal From help at teragrid.org Thu Aug 28 17:03:43 2008 From: help at teragrid.org (help at teragrid.org) Date: Thu, 28 Aug 2008 17:03:43 -0500 Subject: [Swift-devel] [Fwd: Re: error communicating with Abe's GridFTP server ] Message-ID: <200808282203.m7SM3h5f005288@rimantadine.ncsa.uiuc.edu> FROM: Jackson, Weddie (Concerning ticket No. 161020) ============================== Sorry, forgot to CC those on the cc list. __________Original Message__________ Date: Aug 28 2008 5:01PM From: help at teragrid.org To: mikekubal at yahoo.com Subj: Re: error communicating with Abe's GridFTP server FROM: Jackson, Weddie (Concerning ticket No. 161020) ============================== Hello Mike, We also got an errors when just trying to open a simple connection with uberftp. We will ask the our Grid Services Admin to take a look and notify you when we have news. Thanks, -Weddie ------------------------ Weddie Jackson NCSA Consulting Services ------------------------ Mike Kubal writes: >I get the following message when trying to run a swift job from the machines at the CI (communicado and bridled) on NCSA's Abe: > >Execution failed: > Could not initialize shared directory on abe >Caused by: > org.globus.cog.abstraction.impl.file.FileResourceException: Error communicating with the GridFTP server >Caused by: > Server refused performing the request. Custom message: Bad password. (error code 1) [Nested exception message: Custom message: Unexpected reply: 530-Login incorrect. : IPC connection failed. > >Are there any known webservice or other issues with Abe's GridFTP server? > >The gatekeeper appears to be up, when I run: >globus-job-run grid-abe.ncsa.teragrid.org /bin/hostname > >It returns: >abe1196.ncsa.uiuc.edu > >If I run: > >globus-job-run gsiftp://grid-ftp.abe.ncsa.teragrid.org hostname > >I get the following: > >GRAM Job submission failed because the connection to the server failed (check host and port) (error code 12) > >In the past I have received similar error messages when my certificate at the CI had not been updated, but the problem persists after an update. >I get the same message running from communicado and bridled at the CI. > >Suggestions? > >Mike Kubal From bugzilla-daemon at mcs.anl.gov Fri Aug 29 02:11:49 2008 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Fri, 29 Aug 2008 02:11:49 -0500 (CDT) Subject: [Swift-devel] [Bug 154] New: iterate construct causes overserialisation of execution Message-ID: http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=154 Summary: iterate construct causes overserialisation of execution Product: Swift Version: unspecified Platform: All OS/Version: All Status: NEW Severity: normal Priority: P2 Component: General AssignedTo: benc at hawaga.org.uk ReportedBy: benc at hawaga.org.uk CC: swift-devel at ci.uchicago.edu Iterate loops will always run in strict sequence, even if there is no data dependency between iterations. This violates the general principle that execution should happen as much in parallel as possible with data dependencies being the old deciding factor in execution. For example, by data dependencies, the sleep statements in the following program should execute in parallel. They do not. s(int delay) { app { sleep delay; } } iterate i { trace(i); s(5); } until(i>5); -- Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is. From benc at hawaga.org.uk Fri Aug 29 09:09:44 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Fri, 29 Aug 2008 14:09:44 +0000 (GMT) Subject: [Swift-devel] coasters on nmi build test Message-ID: I just got the coaster tests running more on nmi build/test. Most platforms fail; different platforms exhibit different errors. http://nmi-s005.cs.wisc.edu:80/nmi/index.php?page=results%2FrunDetails&opt_user=benc&runid=102190&rows=100 -- From hategan at mcs.anl.gov Fri Aug 29 09:27:14 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 29 Aug 2008 09:27:14 -0500 Subject: [Swift-devel] Re: coasters on nmi build test In-Reply-To: References: Message-ID: <1220020034.9151.0.camel@localhost> On Fri, 2008-08-29 at 14:09 +0000, Ben Clifford wrote: > I just got the coaster tests running more on nmi build/test. > > Most platforms fail; different platforms exhibit different errors. Any chance we can look at the coaster service logs? From benc at hawaga.org.uk Fri Aug 29 09:26:24 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Fri, 29 Aug 2008 14:26:24 +0000 (GMT) Subject: [Swift-devel] Re: coasters on nmi build test In-Reply-To: <1220020034.9151.0.camel@localhost> References: <1220020034.9151.0.camel@localhost> Message-ID: On Fri, 29 Aug 2008, Mihael Hategan wrote: > On Fri, 2008-08-29 at 14:09 +0000, Ben Clifford wrote: > > I just got the coaster tests running more on nmi build/test. > > > > Most platforms fail; different platforms exhibit different errors. > > Any chance we can look at the coaster service logs? Yeah its possible somehow. I need to figure out how to do it though. -- From help at teragrid.org Fri Aug 29 11:23:16 2008 From: help at teragrid.org (help at teragrid.org) Date: Fri, 29 Aug 2008 11:23:16 -0500 Subject: [Swift-devel] [Fwd: Re: error communicating with Abe's GridFTP server ] Message-ID: <200808291623.m7TGNGA9020803@rimantadine.ncsa.uiuc.edu> FROM: Jackson, Weddie (Concerning ticket No. 161020) ============================== Hello Mike, Our Grid Services Admins have stated that they believe that have resolved the issue. Can you try your job again(if you have not already done so) and let us know if you are still seeing an issue? (if you are still seeing an issue, it may be helpful to the Admins to know exactly how your job is trying to communicate with gridftp-abe) Thanks, -Weddie ------------------------ Weddie Jackson NCSA Consulting Services ------------------------ __________Original Message__________ Date: Aug 28 2008 5:01PM From: help at teragrid.org To: mikekubal at yahoo.com Subj: Re: error communicating with Abe's GridFTP server FROM: Jackson, Weddie (Concerning ticket No. 161020) ============================== Hello Mike, We also got an errors when just trying to open a simple connection with uberftp. We will ask the our Grid Services Admin to take a look and notify you when we have news. Thanks, -Weddie ------------------------ Weddie Jackson NCSA Consulting Services ------------------------ Mike Kubal writes: >I get the following message when trying to run a swift job from the machines at the CI (communicado and bridled) on NCSA's Abe: > >Execution failed: > Could not initialize shared directory on abe >Caused by: > org.globus.cog.abstraction.impl.file.FileResourceException: Error communicating with the GridFTP server >Caused by: > Server refused performing the request. Custom message: Bad password. (error code 1) [Nested exception message: Custom message: Unexpected reply: 530-Login incorrect. : IPC connection failed. > >Are there any known webservice or other issues with Abe's GridFTP server? > >The gatekeeper appears to be up, when I run: >globus-job-run grid-abe.ncsa.teragrid.org /bin/hostname > >It returns: >abe1196.ncsa.uiuc.edu > >If I run: > >globus-job-run gsiftp://grid-ftp.abe.ncsa.teragrid.org hostname > >I get the following: > >GRAM Job submission failed because the connection to the server failed (check host and port) (error code 12) > >In the past I have received similar error messages when my certificate at the CI had not been updated, but the problem persists after an update. >I get the same message running from communicado and bridled at the CI. > >Suggestions? > >Mike Kubal From hategan at mcs.anl.gov Fri Aug 29 11:34:02 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 29 Aug 2008 11:34:02 -0500 Subject: [Swift-devel] Re: coasters on nmi build test In-Reply-To: References: Message-ID: <1220027642.11456.0.camel@localhost> On Fri, 2008-08-29 at 14:09 +0000, Ben Clifford wrote: > I just got the coaster tests running more on nmi build/test. > > Most platforms fail; different platforms exhibit different errors. > > http://nmi-s005.cs.wisc.edu:80/nmi/index.php?page=results%2FrunDetails&opt_user=benc&runid=102190&rows=100 > Also, I'm having a bit of trouble making out what's what on that page. Is there any way things can be labeled in a more descriptive fashion? From hategan at mcs.anl.gov Sun Aug 31 13:35:22 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sun, 31 Aug 2008 13:35:22 -0500 Subject: [Swift-devel] class (as in programming) upgrades Message-ID: <1220207722.7918.4.camel@localhost> Fresh from LtU: http://lambda-the-ultimate.org/node/2960 Seems very close to my understanding of the intent of versions in VDL/Swift, but nicely formalized. From benc at hawaga.org.uk Sun Aug 31 14:00:30 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Sun, 31 Aug 2008 19:00:30 +0000 (GMT) Subject: [Swift-devel] class (as in programming) upgrades In-Reply-To: <1220207722.7918.4.camel@localhost> References: <1220207722.7918.4.camel@localhost> Message-ID: On Sun, 31 Aug 2008, Mihael Hategan wrote: > Fresh from LtU: http://lambda-the-ultimate.org/node/2960 > > Seems very close to my understanding of the intent of versions in > VDL/Swift, but nicely formalized. I read that earlier today. They don't seem to have much in the way of implementation (and thus in experience with using in practice). SwiftScript programs don't seem to be getting much in the way of complexity expressionwise which would be something that might cause versioning and namespaces to take a higher priority; and I don't think their linear version model really fits in with the actual diversity of applications that appear in real life (which is also what I think about using linear versions in VDL). -- From hategan at mcs.anl.gov Sun Aug 31 14:12:36 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sun, 31 Aug 2008 14:12:36 -0500 Subject: [Swift-devel] class (as in programming) upgrades In-Reply-To: References: <1220207722.7918.4.camel@localhost> Message-ID: <1220209956.19236.6.camel@localhost> On Sun, 2008-08-31 at 19:00 +0000, Ben Clifford wrote: > On Sun, 31 Aug 2008, Mihael Hategan wrote: > > > Fresh from LtU: http://lambda-the-ultimate.org/node/2960 > > > > Seems very close to my understanding of the intent of versions in > > VDL/Swift, but nicely formalized. > > I read that earlier today. They don't seem to have much in the way of > implementation (and thus in experience with using in practice). Which in itself makes no statement about the solution being good or bad. But to me it looks decent. > > SwiftScript programs don't seem to be getting much in the way of > complexity expressionwise which would be something that might cause > versioning and namespaces to take a higher priority; Except when somebody tries to use it in a production environment (i2u2), with data and analyses that span a few years. So I think this is one of those things where if some project using Swift realizes it needs it, it's probably too late. > and I don't think > their linear version model really fits in with the actual diversity of > applications that appear in real life (which is also what I think about > using linear versions in VDL). > It's a reasonable model that I think would work fairly well in many cases.