From aespinosa at cs.uchicago.edu Wed Aug 5 17:36:11 2009 From: aespinosa at cs.uchicago.edu (Allan Espinosa) Date: Wed, 5 Aug 2009 17:36:11 -0500 Subject: [Swift-user] swiftscript tricks on bypassing stage-ins (was Re: Helping Yi with Swift) Message-ID: <50b07b4b0908051536r3cf82090m36223d8bc3a3394@mail.gmail.com> instead of, type file; file inputA <"061-cattwo.1.in">; file inputB <"061-cattwo.2.in">; file output <"061-cattwo.out">; (file t) cat(file m, file n) { app { cat @filename(m) @filename(n) stdout=@filename(t); } } output = cat(inputA, inputB); where files inputA and inputB will get staged you can change the app function to simply accept string parameters and no data transfer will occur at all (except for "file t"): type file; file inputA <"/home/USER/workflows/nostaging/061-cattwo.1.in">; file inputB <"/home/USER/workflows/nostaging/061-cattwo.2.in">; file output <"061-cattwo.out">; (file t) cat(string m, string n) { app { cat m n stdout=@filename(t); } } output = cat(@strcat("/", at filename(inputA)), @strcat("/", @filename(inputB))); this quick hack requires absolute path names though 2009/8/5 Michael Wilde : > Yi, > > You should always send your swift questions to swift-user. > > Allan would be your prime helper for your initial questions; I am sure > Mihael will try to help when he can, but he's got other prime > responsibilities at the moment. > > Allan is away at workshops this week and next but will be reading mail. > > I dont expect to be in mail much on vacation, but if I see a question I'll > do my best. > > Your somewhat on your own for the next week, OK? > > - Mike > > > > > On 8/5/09 3:34 PM, Yi Zhu wrote: >> >> Hi, Michael >> >> About the section 2 in last message, could you suggest me someone who can >> provide me support for swift during your vacation? I may need to make a some >> modification on swift source code so I may need the structure of swift >> system and technical support. >> >> >> Many Thanks. >> >> -Yi >> >> Yi Zhu wrote: >>> >>> Hi Ian, >>> >>> I think I've found the performance issues in my last experiment, >>> generally, it because of Long Tail Effect, since the Total Execution Time is >>> calculated by the first job has been submitted until the last job finished, >>> when the execution approach to the end, there are some nodes is idling >>> because there are not enough job in the queue to run. When the rate of >>> (number of nodes/total number of jobs) is high, this problem effect more. >>> Therefore, In our first experiment, there are 100 jobs to run and the long >>> tail problem effect a lot at the "100 nodes" test, so that the performance >>> is not good as we expected. >>> >>> I've put the details on the wiki site: >>> http://dsl-wiki.cs.uchicago.edu/index.php/Performance_Comparison:Remote_Usage%2C_NFS%2C_S3-fuse%2C_EBS >>> (see the bottom) >>> >>> 2. >>> >>> In a traditional swift usage, data need to be transfer to remotely site >>> before run ( trade-in), and transfer back after finished (trade-out). the >>> remote side ?does not do a directly access to users's computer because they >>> may not have reliable network or ?there is ?potential delay&jitter during >>> network transmission. ?So, use the same traditional way when data is stored >>> on S3 may not be the optimum solution. Since there is a reliable connection >>> between S3 and EC2, we could let working node directly access the data on S3 >>> bucket rather than trade-in before execution and trade-out after >>> execution_done. >>> >>> Since this includes modify the source code of swift, Mike, can we arrange >>> a ?discuss about that on Tomorrow? >>> -- Allan M. Espinosa PhD student, Computer Science University of Chicago From yizhu at cs.uchicago.edu Thu Aug 6 15:15:23 2009 From: yizhu at cs.uchicago.edu (Yi Zhu) Date: Thu, 06 Aug 2009 15:15:23 -0500 Subject: [Swift-user] How to the maximum number of concurrent jobs allowed on a site to a fixed size? Message-ID: <4A7B39DB.3030602@cs.uchicago.edu> Hi, all As we've already know, Swift dynamically change the maximum number of concurrent jobs allowed on a site based on the performance history of that site. According to swift Document: Each site is assigned a score (initially 1), which can increase or decrease based on whether the site yields successful or faulty job runs. The score for a site can take values in the (0.1, 100) interval. The number of allowed jobs is calculated using the following formula: 2 + score*throttle.score.job.factor We can change the throttle.score.job.factor in sites.xml or swift.properties files, but since the "score" value can be increased/decreased during the execution, It seems that we can not really set the maximum number of concurrent jobs allowed on a site to a fixed number. Anyone have any idea of that? Many Thanks. -Yi Zhu From aespinosa at cs.uchicago.edu Thu Aug 6 15:30:24 2009 From: aespinosa at cs.uchicago.edu (Allan Espinosa) Date: Thu, 6 Aug 2009 15:30:24 -0500 Subject: [Swift-user] Re: [Swift-devel] How to the maximum number of concurrent jobs allowed on a site to a fixed size? In-Reply-To: <4A7B39DB.3030602@cs.uchicago.edu> References: <4A7B39DB.3030602@cs.uchicago.edu> Message-ID: <50b07b4b0908061330s32ccf081rf304fef411cecb4f@mail.gmail.com> hi yi, in swift.properites file you set foreach.max.threads=1024 to foreach.max.threads=N where N is the max number concurrent jobs you want per swift session. Also when you set score to be ridicuosly high (ie 10000) you always get the maximum theoretical number of jobs you want based on the throttling parameters. -Allan 2009/8/6 Yi Zhu : > Hi, all > > As we've already know, Swift dynamically change the maximum number of > concurrent jobs allowed on a site based on the performance history of that > site. According to swift Document: Each site is assigned a score (initially > 1), which can increase or decrease based on whether the site yields > successful or faulty job runs. The score for a site can take values in the > (0.1, 100) interval. The number of allowed jobs is calculated using the > following formula: > > 2 + score*throttle.score.job.factor > > We can change the throttle.score.job.factor in sites.xml or swift.properties > files, but since the "score" value can be increased/decreased during the > execution, It seems that we can not really set the maximum ?number of > concurrent jobs allowed on a site to a fixed number. Anyone have any idea of > that? > > > Many Thanks. > From yizhu at cs.uchicago.edu Thu Aug 6 16:19:53 2009 From: yizhu at cs.uchicago.edu (Yi Zhu) Date: Thu, 06 Aug 2009 16:19:53 -0500 Subject: [Swift-user] Re: [Swift-devel] How to the maximum number of concurrent jobs allowed on a site to a fixed size? In-Reply-To: <50b07b4b0908061330s32ccf081rf304fef411cecb4f@mail.gmail.com> References: <4A7B39DB.3030602@cs.uchicago.edu> <50b07b4b0908061330s32ccf081rf304fef411cecb4f@mail.gmail.com> Message-ID: <4A7B48F9.9020604@cs.uchicago.edu> Hi, Allan Thanks for you reply. I've tried the way you suggested, but it doesn't work as I expected. Suppose I have a PBS cluster with 10 worker nodes,(btw. I don't have the privilege to offline/shutdown those worker nodes), Now I want to use some portion of workers nodes rather than all of them to do jobs that are submitted by swift. The way i think is setting the value of "maximum number of concurrent jobs allowed on a site" to number of workers nodes I wish to run. (e.g. If I only want 5 out 10 workers node keep busy, I can choose to set the value of maximum number of concurrent jobs to 5 ,so there are only 5 jobs can be run on pbs concurrently) since I can not just set that value (i.e. swift use a formula to dynamically calculate the value), I need find another way to sort it out. Change the value of "foreach.max.threads" doesn't work as we expected: when i set it to 1, swift just freeze when started. when i set it to 2, it seems submit job one by one. when i set it to 3, swift submit 4 jobs at once. when i set it to 4, swift submit 9 jobs at once. when i set it to 5, swift submit 16 jobs at once. when i set it to 10,swift submit 81 jobs at once. -Yi Allan Espinosa wrote: > hi yi, > > in swift.properites file you set > > foreach.max.threads=1024 > > to > > foreach.max.threads=N > > where N is the max number concurrent jobs you want per swift session. > Also when you set score to be ridicuosly high (ie 10000) you always > get the maximum theoretical number of jobs you want based on the > throttling parameters. > > -Allan > > 2009/8/6 Yi Zhu : >> Hi, all >> >> As we've already know, Swift dynamically change the maximum number of >> concurrent jobs allowed on a site based on the performance history of that >> site. According to swift Document: Each site is assigned a score (initially >> 1), which can increase or decrease based on whether the site yields >> successful or faulty job runs. The score for a site can take values in the >> (0.1, 100) interval. The number of allowed jobs is calculated using the >> following formula: >> >> 2 + score*throttle.score.job.factor >> >> We can change the throttle.score.job.factor in sites.xml or swift.properties >> files, but since the "score" value can be increased/decreased during the >> execution, It seems that we can not really set the maximum number of >> concurrent jobs allowed on a site to a fixed number. Anyone have any idea of >> that? >> >> >> Many Thanks. >> > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-user > From aespinosa at cs.uchicago.edu Thu Aug 6 16:27:03 2009 From: aespinosa at cs.uchicago.edu (Allan Espinosa) Date: Thu, 6 Aug 2009 16:27:03 -0500 Subject: [Swift-user] Re: [Swift-devel] How to the maximum number of concurrent jobs allowed on a site to a fixed size? In-Reply-To: <4A7B48F9.9020604@cs.uchicago.edu> References: <4A7B39DB.3030602@cs.uchicago.edu> <50b07b4b0908061330s32ccf081rf304fef411cecb4f@mail.gmail.com> <4A7B48F9.9020604@cs.uchicago.edu> Message-ID: <50b07b4b0908061427p7569fdddxaa8d7818c26fcd6e@mail.gmail.com> the formula is 5 = 2 + 100 * throttle so here the throttle should be 0.03. another approach is to create a pbs queue / reservation with only 5 nodes available then just set swift to submit as much jobs as it can. 2009/8/6 Yi Zhu : > Hi, Allan > > Thanks for you reply. > > I've tried the way you suggested, but it doesn't work as I expected. > > > Suppose I have a PBS cluster with 10 worker nodes,(btw. I don't have the > privilege to offline/shutdown those worker nodes), Now I want to use some > portion of workers nodes rather than all of them to do jobs that are > submitted by swift. The way i think is setting the value of > "maximum number of concurrent jobs allowed on a site" to ?number of workers > nodes I wish to run. (e.g. If I only want 5 out 10 workers node keep busy, I > can choose to set the ?value of maximum number of concurrent jobs to 5 ,so > there are only 5 jobs can be run on pbs concurrently) > > since I can not just set that value (i.e. swift use a formula to dynamically > calculate the value), I need find another way to sort it out. > > Change the value of "foreach.max.threads" doesn't ?work as we expected: > > when i set it to 1, swift just freeze when started. > when i set it to 2, it seems submit job one by one. > when i set it to 3, swift submit 4 jobs at once. > when i set it to 4, swift submit 9 jobs at ?once. > when i set it to 5, swift submit 16 jobs at ?once. > when i set it to 10,swift submit 81 jobs at once. > > -Yi > > > > > > > Allan Espinosa wrote: >> >> hi yi, >> >> in swift.properites file you set >> >> foreach.max.threads=1024 >> >> to >> >> foreach.max.threads=N >> >> where N is the max number concurrent jobs you want per swift session. >> Also when you set score to be ridicuosly high (ie 10000) you always >> get the maximum theoretical number of jobs you want based on the >> throttling parameters. >> >> -Allan >> >> 2009/8/6 Yi Zhu : >>> >>> Hi, all >>> >>> As we've already know, Swift dynamically change the maximum number of >>> concurrent jobs allowed on a site based on the performance history of >>> that >>> site. According to swift Document: Each site is assigned a score >>> (initially >>> 1), which can increase or decrease based on whether the site yields >>> successful or faulty job runs. The score for a site can take values in >>> the >>> (0.1, 100) interval. The number of allowed jobs is calculated using the >>> following formula: >>> >>> 2 + score*throttle.score.job.factor >>> >>> We can change the throttle.score.job.factor in sites.xml or >>> swift.properties >>> files, but since the "score" value can be increased/decreased during the >>> execution, It seems that we can not really set the maximum ?number of >>> concurrent jobs allowed on a site to a fixed number. Anyone have any idea >>> of >>> that? From hategan at mcs.anl.gov Thu Aug 6 16:41:54 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 06 Aug 2009 16:41:54 -0500 Subject: [Swift-user] Re: [Swift-devel] How to the maximum number of concurrent jobs allowed on a site to a fixed size? In-Reply-To: <4A7B39DB.3030602@cs.uchicago.edu> References: <4A7B39DB.3030602@cs.uchicago.edu> Message-ID: <1249594914.28410.81.camel@blabla> On Thu, 2009-08-06 at 15:15 -0500, Yi Zhu wrote: > Hi, all > > As we've already know, Swift dynamically change the maximum number of > concurrent jobs allowed on a site based on the performance history of > that site. According to swift Document: Each site is assigned a score > (initially 1), which can increase or decrease based on whether the site > yields successful or faulty job runs. The score for a site can take > values in the (0.1, 100) interval. The number of allowed jobs is > calculated using the following formula: > > 2 + score*throttle.score.job.factor > > We can change the throttle.score.job.factor in sites.xml or > swift.properties files, but since the "score" value can be > increased/decreased during the execution, It seems that we can not > really set the maximum number of concurrent jobs allowed on a site to a > fixed number. Anyone have any idea of that? Can you rephrase the question? The number of jobs running on a site is a function of the current demand for that site and some monotonically increasing function of the score: nj = f(d, g(s)) = min(d, g(s)) The score is a function of time (roughly): s = s(t) Assuming demand is higher than the job limit (g) (which is the case when you're interested in limiting nj): d > g(s) => min(d, g(s)) = g(s) So nj = g(s(t)) Now, you know that s(t) is bounded (by default (0.01, 100) - max is open so assume limits instead of equality), and since g is monotonically increasing and g(max_score) is finite, it follows that max(g(x)) is g(max_score). So there there is a fixed number of concurrent jobs regardless of time/score (max(g(t))) as well as a maximum number of concurrent jobs at each time point (i.e. for each score) (g(t)). Mihael From yizhu at cs.uchicago.edu Thu Aug 6 16:50:35 2009 From: yizhu at cs.uchicago.edu (Yi Zhu) Date: Thu, 06 Aug 2009 16:50:35 -0500 Subject: [Swift-user] Re: [Swift-devel] How to the maximum number of concurrent jobs allowed on a site to a fixed size? In-Reply-To: <1249594914.28410.81.camel@blabla> References: <4A7B39DB.3030602@cs.uchicago.edu> <1249594914.28410.81.camel@blabla> Message-ID: <4A7B502B.1080809@cs.uchicago.edu> Hi Mihael Now, I just set the initialScorer to a ridiculously high value (e.g. 10000), and swift seems can automatically scale it down to the range, and then I set the throttle.factor according, therefore I could get a fixed maximum number according to the formula: 2+ score (range 0.1 -100)* throttle.factor -Yi Mihael Hategan wrote: > On Thu, 2009-08-06 at 15:15 -0500, Yi Zhu wrote: >> Hi, all >> >> As we've already know, Swift dynamically change the maximum number of >> concurrent jobs allowed on a site based on the performance history of >> that site. According to swift Document: Each site is assigned a score >> (initially 1), which can increase or decrease based on whether the site >> yields successful or faulty job runs. The score for a site can take >> values in the (0.1, 100) interval. The number of allowed jobs is >> calculated using the following formula: >> >> 2 + score*throttle.score.job.factor >> >> We can change the throttle.score.job.factor in sites.xml or >> swift.properties files, but since the "score" value can be >> increased/decreased during the execution, It seems that we can not >> really set the maximum number of concurrent jobs allowed on a site to a >> fixed number. Anyone have any idea of that? > > Can you rephrase the question? > > The number of jobs running on a site is a function of the current demand > for that site and some monotonically increasing function of the score: > > nj = f(d, g(s)) = min(d, g(s)) > > The score is a function of time (roughly): > > s = s(t) > > Assuming demand is higher than the job limit (g) (which is the case when > you're interested in limiting nj): > > d > g(s) => min(d, g(s)) = g(s) > > So > > nj = g(s(t)) > > Now, you know that s(t) is bounded (by default (0.01, 100) - max is open > so assume limits instead of equality), and since g is monotonically > increasing and g(max_score) is finite, it follows that max(g(x)) is > g(max_score). So there there is a fixed number of concurrent jobs > regardless of time/score (max(g(t))) as well as a maximum number of > concurrent jobs at each time point (i.e. for each score) (g(t)). > > Mihael > > From hategan at mcs.anl.gov Thu Aug 6 16:56:48 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 06 Aug 2009 16:56:48 -0500 Subject: [Swift-user] Re: [Swift-devel] How to the maximum number of concurrent jobs allowed on a site to a fixed size? In-Reply-To: <4A7B48F9.9020604@cs.uchicago.edu> References: <4A7B39DB.3030602@cs.uchicago.edu> <50b07b4b0908061330s32ccf081rf304fef411cecb4f@mail.gmail.com> <4A7B48F9.9020604@cs.uchicago.edu> Message-ID: <1249595808.28410.83.camel@blabla> On Thu, 2009-08-06 at 16:19 -0500, Yi Zhu wrote: > Hi, Allan > > Thanks for you reply. > > I've tried the way you suggested, but it doesn't work as I expected. > > > Suppose I have a PBS cluster with 10 worker nodes,(btw. I don't have the > privilege to offline/shutdown those worker nodes), Now I want to use > some portion of workers nodes rather than all of them to do jobs that > are submitted by swift. The way i think is setting the value of > "maximum number of concurrent jobs allowed on a site" to number of > workers nodes I wish to run. (e.g. If I only want 5 out 10 workers node > keep busy, I can choose to set the value of maximum number of > concurrent jobs to 5 ,so there are only 5 jobs can be run on pbs > concurrently) So you're not trying to set the maximum number of concurrent jobs, but the exact number of concurrent jobs. What you can do is give the site a large initial unscaled score, which means that the scaled score will be at its maximum from the start and then set the job throttle appropriately. From hategan at mcs.anl.gov Thu Aug 6 16:58:21 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 06 Aug 2009 16:58:21 -0500 Subject: [Swift-user] Re: [Swift-devel] How to the maximum number of concurrent jobs allowed on a site to a fixed size? In-Reply-To: <4A7B502B.1080809@cs.uchicago.edu> References: <4A7B39DB.3030602@cs.uchicago.edu> <1249594914.28410.81.camel@blabla> <4A7B502B.1080809@cs.uchicago.edu> Message-ID: <1249595901.28410.84.camel@blabla> On Thu, 2009-08-06 at 16:50 -0500, Yi Zhu wrote: > Hi Mihael > > Now, I just set the initialScorer to a ridiculously high value (e.g. > 10000), and swift seems can automatically scale it down to the range, > and then I set the throttle.factor according, therefore I could get a > fixed maximum number according to the formula: > > > 2+ score (range 0.1 -100)* throttle.factor > Exactly. From jamalphd at gmail.com Fri Aug 7 15:40:03 2009 From: jamalphd at gmail.com (J A) Date: Fri, 7 Aug 2009 16:40:03 -0400 Subject: [Swift-user] XDTM In-Reply-To: <4A6D08A8.9010400@mcs.anl.gov> References: <4A6CE247.4010105@mcs.anl.gov> <4A6D08A8.9010400@mcs.anl.gov> Message-ID: Hi Michael: 1. After running a .swift or .dtm code, two files gets created: .xml and .klm. What do they represent? 2. Correct me if I am wrong: - Datasets are mapped to physical presentation using mapping algorithms. Some mapping algorithms already created part of swift and the user can add/create others and use the existing once as the base. - Currently, the physical representation are files. 3. In the fMRI example, I see volume, Image, etc declared as a type? who defines them as a type? 4. In one of your emails, you stated that Swift functions can take accept files, int, string, float and boolean values as arguments. They return files, or scalar values inside files. My question is: if the output is a string that is inside a file, how can I use this output in another program that takes it as an input? doesn't call the file name and should have a code to read from the file? 5. I am still confused when talk about XML Data Type and Mapping. Where is the XML representation? Is it the .xml that gets generated when run the swift code? 6. Let's look at this example: type messagefile {} (messagefile t) greeting (string s[]) { app { echo s[0] s[1] s[2] stdout=@filename(t); } } messagefile outfile <"q5out.txt">; string words[] = ["how","are","you"]; outfile = greeting(words); === So we have messagefile as a data type. outfile and words are datasets. what will be the physical representation for these 2 datasets? is thee system parsing the swift code, identifying the data types and datasets and based on that choosea the proper mapping algorithm needed? Thanks, Jamal On Sun, Jul 26, 2009 at 9:53 PM, Michael Wilde wrote: > Hi Jamal, > > A lot of this is covered in the Swift user guide and tutorial. Have you > read through those yet? > > All the docs are at: http://www.ci.uchicago.edu/swift/docs/index.php > > If so, and the clarifications below don't help, please ask again on the > list, OK? > > - Mike > > > On 7/26/09 7:27 PM, J A wrote: > >> Hi Michael: >> First, thank you for your reply and information provided. >> I am trying to understand more how it handles the input/output parameters >> and make them available for other functions. >> > > All functions in Swift are either atomic interfaces to application programs > (ie, how o exec the program) or composite higher level functions. > >> To illustrate, I will give this example for the sake of discussion: >> I have a C program called test.c that contains 4 functions ( main(), F1, >> F2, and F3). each function takes some parameters such as int, string, name >> of a file that is in the same directory, and each one produced some output >> (string, int, and a file). Of course i am using global variables. Now, >> main calls F1, F1 passes its output to F2, and F2 passes its output to F3. >> > > Swift doesnt look at the functions inside an application. It invokes the > application as a program (think fork/exec) just like a shell would, but > distributed and in parallel if so specified. > >> Overall, the test.c takes an int, string, and file, and output several >> files. the output files contains output produced by the internal functions >> (tasks). >> > > Swift functions can take accept files, int, string, float and boolean > values as arguments. They return files, or scalar values inside files. > (Again, think shell scripts). Composite structures - structs and arrays - > of the above can be passed. > >> I would like to understand more when i transfer my code to Swift how it >> handles the input/output data, where it stores them, etc. I read couple of >> papers about XDTM and still have some confusion about the terms: dataset, >> typed, how/where its physical representation is located at, and how the >> input/output is used within the internal functions. >> > > Files are by default named ("mapped") relative to the directory in which > you run the Swift command. Many flexible extensions to that model are > provided for (eg, URIs). Swift sends the data to the site chosen for > execution (thats yet another topic) and returns results back to the same > submission host. > > Mapping declarations in the Swift script specify how files and directory > structures are mapped to Swift variables (scalars, arrays, structures). > These are used in the specification of the Swift code. When Swift runs > programs, it takes files that were mapped and knows how to send them to grid > sites or clusters and get data back. > >> I am new to this area and trying to understand how the DTM works. >> Any help from your side on this area is really appreciated. >> Thanks, >> Jamal >> >> On Sun, Jul 26, 2009 at 7:09 PM, Michael Wilde > wilde at mcs.anl.gov>> wrote: >> >> Jamal, >> >> As Swift evolved from its early prototypes to a more mature system, >> the notion of XDTM evolved to one of mapping between >> filesystem-based structures and Swift in-memory data structures (ie, >> scalars, arrays, and structures, which can be nested and typed). >> >> This is best seen by looking at the "external" mapper, which allows >> a user to map a dataset using any external program (typically a >> script) that returns the members of the dataset as a two-column >> list: the Swift variable reference, and the external file or URI. >> >> See the user guide section on the external mapper: >> >> >> http://www.ci.uchicago.edu/swift/guides/userguide.php#mapper.ext_mapper >> (but the example in the user guide doesn't show the power of mapping >> to nested structures). >> >> In other words, it still has the flavor of XDTM, but without any XML >> being visible to the user. It meets the same need but is easier to >> use and explain. >> >> - Mike >> >> >> On 7/26/09 2:50 PM, J A wrote: >> >> Hi All: >> Can any one direct me to a source with more >> examples/explanation on how XDTM is working/implemented? >> Thanks, >> Jamal >> >> >> ------------------------------------------------------------------------ >> >> _______________________________________________ >> Swift-user mailing list >> Swift-user at ci.uchicago.edu >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user >> >> >> -------------- next part -------------- An HTML attachment was scrubbed... URL: From wilde at mcs.anl.gov Fri Aug 14 10:38:07 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Fri, 14 Aug 2009 10:38:07 -0500 Subject: [Swift-user] XDTM In-Reply-To: References: <4A6CE247.4010105@mcs.anl.gov> <4A6D08A8.9010400@mcs.anl.gov> Message-ID: <4A8584DF.3030103@mcs.anl.gov> On 8/7/09 3:40 PM, J A wrote: > Hi Michael: > > 1. After running a .swift or .dtm code, two files gets created: .xml > and .klm. What do they represent? .xml is an xml version of the parsed .swift file .kml (not klm) is the xml representation of the Karajan script that the Swift script is translated into for execution. Its actually the .kml file that is executed by Karajan which drives the execution of a Swift script. > 2. Correct me if I am wrong: > * Datasets are mapped to physical presentation using mapping > algorithms. Some mapping algorithms already created part of > swift and the user can add/create others and use the > existing once as the base. Yes, thats right. But, to clarify this part: "user can add/create others and use the existing ones as the base." The user can use existing mappers, and add new mappers, either in Java or as external executables or scripts. But each mapper is independent. When you say "can use existing ones as a base" I would say thats correct, in that a user could *copy* and modify the code of one mapper to create another mapper, or, in the case of an "ext" mapper, one ext mapper could conceivably execute another and modify/filter its output to create a new mapping. > * Currently, the physical representation are files. Yes, if you mean to say that mappers map files to Swift variables. > 3. In the fMRI example, I see volume, Image, etc declared as > a type? who defines them as a type? > 4. In one of your emails, you stated that Swift functions can take > accept files, int, string, float and boolean values as arguments. > They return files, or scalar values inside files. My question is: > if the output is a string that is inside a file, how can I use > this output in another program that takes it as an input? doesn't > call the file name and should have a code to read from the file? Yes, you can use readData() or readData2() to read the contents of a file back into Swift variables (including into arrays and structures, if the output has some structure). > 5. I am still confused when talk about XML Data Type and Mapping. > Where is the XML representation? Is it the .xml that gets > generated when run the swift code? No, the XML - if indeed it still exists - is only internal. I described it this way in an earlier post: -- "As Swift evolved from its early prototypes to a more mature system, the notion of XDTM evolved to one of mapping between filesystem-based structures and Swift in-memory data structures (ie, scalars, arrays, and structures, which can be nested and typed). This is best seen by looking at the "external" mapper, ... In other words, it still has the flavor of XDTM, but without any XML being visible to the user. It meets the same need but is easier to use and explain." -- When XDTM was first implemented, by Yong Zhao, he used XML within Swift to represent the mapping. I am not even sure if this XML representation is still used in the current implementation, or not. I suspect *not*. But the important concept here should really be called "DTM" - dataset typing and mapping - and its embodied in the type model and mapping model of the language. So you should stop thinking about data typing and mapping as being connected in any way to XML. What we described in earlier papers as XDTM is not something that you can experiment with in terms of XML: ie, you can not see the XML for a mapping because its either deep inside the Swift implementation, or it no longer exists in the current Swift code. > 6. Let's look at this example: > > type messagefile {} > > (messagefile t) greeting (string s[]) { > app { > echo s[0] s[1] s[2] stdout=@filename(t > ); > } > } > > messagefile outfile <"q5out.txt">; > > string words[] = ["how","are","you"]; > > outfile = greeting(words); > === > > So we have messagefile as a data type. outfile and words are > datasets. what will be the physical representation for these 2 > datasets? An object of type messagefile will be represented as a single physical file externally, and internally as a scalar variable. Words is a an array of strings. Each atomic Swift variable (ie, scalars, array members, and structure members) can be thought of as a triple: (set-state, mapping, value) All variables have a set-state; initially unset, then set when the variable is assigned a value. File-valued variables have only a mapping, but no value. Scalar-values (ie, non-mapped variables like strings, as in your example) have a value (eg the string, interger, boolean or float value) but no mapping. We're still looking for better terminology to describe this; the current user guide uses both the terms "mapped type" and "marker type" to denote a file-valued variable. Both terms refer to the same concept; Im leaning to the term "mapped type". is thee system parsing the swift code, identifying the > data types and datasets and based on that choosea the proper mapping > algorithm needed? After the Swift command parses the Swift code, execution begins - i.e. the .kml file is executed by Karajan. Mappers are called as can be seen the kml. (And you can see their actions in the swift .log file). The mapping for all mapped variables is either specified by the user (the most common case) or defaults to concurrent_mapper. The users guide describes this in pretty good detail. I hope that gets you a bit further. I hope that looking at XML mappings is not critical to your research, as I don't think you'll be able to readily get an XML intermediate form out of Swift. An interesting topic would be to implement mechanisms to handle data in XML representations, in particular to enable Swift to invoke SOAP services as well as file-based applications and to compose scripts that call both forms of application. - Mike > > > Thanks, > Jamal > > > > > > > On Sun, Jul 26, 2009 at 9:53 PM, Michael Wilde > wrote: > > Hi Jamal, > > A lot of this is covered in the Swift user guide and tutorial. Have > you read through those yet? > > All the docs are at: http://www.ci.uchicago.edu/swift/docs/index.php > > If so, and the clarifications below don't help, please ask again on > the list, OK? > > - Mike > > > > On 7/26/09 7:27 PM, J A wrote: > > Hi Michael: > First, thank you for your reply and information provided. > I am trying to understand more how it handles the input/output > parameters and make them available for other functions. > > > All functions in Swift are either atomic interfaces to application > programs (ie, how o exec the program) or composite higher level > functions. > > To illustrate, I will give this example for the sake of discussion: > I have a C program called test.c that contains 4 functions ( > main(), F1, F2, and F3). each function takes some parameters > such as int, string, name of a file that is in the same > directory, and each one produced some output (string, int, and a > file). Of course i am using global variables. Now, main calls > F1, F1 passes its output to F2, and F2 passes its output to F3. > > > Swift doesnt look at the functions inside an application. It invokes > the application as a program (think fork/exec) just like a shell > would, but distributed and in parallel if so specified. > > Overall, the test.c takes an int, string, and file, and output > several files. the output files contains output produced by the > internal functions (tasks). > > > Swift functions can take accept files, int, string, float and > boolean values as arguments. They return files, or scalar values > inside files. (Again, think shell scripts). Composite structures - > structs and arrays - of the above can be passed. > > I would like to understand more when i transfer my code to > Swift how it handles the input/output data, where it stores > them, etc. I read couple of papers about XDTM and still have > some confusion about the terms: dataset, typed, how/where its > physical representation is located at, and how the input/output > is used within the internal functions. > > > Files are by default named ("mapped") relative to the directory in > which you run the Swift command. Many flexible extensions to that > model are provided for (eg, URIs). Swift sends the data to the site > chosen for execution (thats yet another topic) and returns results > back to the same submission host. > > Mapping declarations in the Swift script specify how files and > directory structures are mapped to Swift variables (scalars, arrays, > structures). These are used in the specification of the Swift code. > When Swift runs programs, it takes files that were mapped and knows > how to send them to grid sites or clusters and get data back. > > I am new to this area and trying to understand how the DTM works. > Any help from your side on this area is really appreciated. > Thanks, > Jamal > > On Sun, Jul 26, 2009 at 7:09 PM, Michael Wilde > > >> wrote: > > Jamal, > > As Swift evolved from its early prototypes to a more mature > system, > the notion of XDTM evolved to one of mapping between > filesystem-based structures and Swift in-memory data > structures (ie, > scalars, arrays, and structures, which can be nested and typed). > > This is best seen by looking at the "external" mapper, which > allows > a user to map a dataset using any external program (typically a > script) that returns the members of the dataset as a two-column > list: the Swift variable reference, and the external file or URI. > > See the user guide section on the external mapper: > > > http://www.ci.uchicago.edu/swift/guides/userguide.php#mapper.ext_mapper > (but the example in the user guide doesn't show the power of > mapping > to nested structures). > > In other words, it still has the flavor of XDTM, but without > any XML > being visible to the user. It meets the same need but is > easier to > use and explain. > > - Mike > > > On 7/26/09 2:50 PM, J A wrote: > > Hi All: > Can any one direct me to a source with more > examples/explanation on how XDTM is working/implemented? > Thanks, > Jamal > > > ------------------------------------------------------------------------ > > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > >

> > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-user > > > From hategan at mcs.anl.gov Fri Aug 14 11:30:58 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 14 Aug 2009 11:30:58 -0500 Subject: [Swift-user] XDTM In-Reply-To: <4A8584DF.3030103@mcs.anl.gov> References: <4A6CE247.4010105@mcs.anl.gov> <4A6D08A8.9010400@mcs.anl.gov> <4A8584DF.3030103@mcs.anl.gov> Message-ID: <1250267458.14428.21.camel@blabla> On Fri, 2009-08-14 at 10:38 -0500, Michael Wilde wrote: [...] > > 5. I am still confused when talk about XML Data Type and Mapping. > > Where is the XML representation? Is it the .xml that gets > > generated when run the swift code? > > No, the XML - if indeed it still exists - is only internal. I described > it this way in an earlier post: The Swift type declarations are "compiled" into XML schema since XML schema is sufficiently powerful to express the structure of Swift user-declared types. That piece appears in the header of a .kml file. However, it has nothing to do with mapping, so it's probably not to be called XDTM. > > -- > > "As Swift evolved from its early prototypes to a more mature system, the > notion of XDTM evolved to one of mapping between filesystem-based > structures and Swift in-memory data structures (ie, scalars, arrays, and > structures, which can be nested and typed). > > This is best seen by looking at the "external" mapper, ... > > In other words, it still has the flavor of XDTM, but without any XML > being visible to the user. It meets the same need but is easier to use > and explain." > > -- > > When XDTM was first implemented, by Yong Zhao, he used XML within Swift > to represent the mapping. I am not even sure if this XML representation > is still used in the current implementation, or not. I suspect *not*. XML is used by virtue of the .kml files being XML (loosely). But it isn't and was never (in any version of Swift/VDL2 compiled to Karajan I know of) a representation of the mapping, but a declaration of the mapper: That's the translation of ... <"0231-complex-type.out">; When the first swift prototype was written we went directly to representing swift data as a tree of in-memory objects, and mappers as being attached to the root of each such tree. We figured that we could achieve better scalability if we avoided storing actual mappings when we could and used algorithmic ways to calculate the mappings on-the-fly. In other words, it takes more (O(n)) space to store "(1, 1), (2, 4), (3, 9), (4, 16), ... (n, n^2)" than to store "f(k) = k^2, x = {1..n}" (O(1)) but they are the same function. Mihael From wilde at mcs.anl.gov Fri Aug 14 11:55:00 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Fri, 14 Aug 2009 11:55:00 -0500 Subject: [Swift-user] XDTM In-Reply-To: <1250267458.14428.21.camel@blabla> References: <4A6CE247.4010105@mcs.anl.gov> <4A6D08A8.9010400@mcs.anl.gov> <4A8584DF.3030103@mcs.anl.gov> <1250267458.14428.21.camel@blabla> Message-ID: <4A8596E4.4050901@mcs.anl.gov> On 8/14/09 11:30 AM, Mihael Hategan wrote: > On Fri, 2009-08-14 at 10:38 -0500, Michael Wilde wrote: > [...] >>> 5. I am still confused when talk about XML Data Type and Mapping. >>> Where is the XML representation? Is it the .xml that gets >>> generated when run the swift code? >> No, the XML - if indeed it still exists - is only internal. I described >> it this way in an earlier post: > > The Swift type declarations are "compiled" into XML schema since XML > schema is sufficiently powerful to express the structure of Swift > user-declared types. That piece appears in the header of > a .kml file. However, it has nothing to do with mapping, so it's > probably not to be called XDTM. Mihael, thanks for clarifying. In the XDTM paper we described the use of XML schema this way: "In XDTM, a dataset?s logical structure is specified via a subset of XML Schema, which defines primitive scalar data types such as Boolean, Integer, String, Float, and Date, and also allows for the definition of complex types via the composition of simple and complex types." In other words, what you say above: that XML Schema is used to define Swift user-declared types. - Mike >> -- >> >> "As Swift evolved from its early prototypes to a more mature system, the >> notion of XDTM evolved to one of mapping between filesystem-based >> structures and Swift in-memory data structures (ie, scalars, arrays, and >> structures, which can be nested and typed). >> >> This is best seen by looking at the "external" mapper, ... >> >> In other words, it still has the flavor of XDTM, but without any XML >> being visible to the user. It meets the same need but is easier to use >> and explain." >> >> -- >> >> When XDTM was first implemented, by Yong Zhao, he used XML within Swift >> to represent the mapping. I am not even sure if this XML representation >> is still used in the current implementation, or not. I suspect *not*. > > XML is used by virtue of the .kml files being XML (loosely). But it > isn't and was never (in any version of Swift/VDL2 compiled to Karajan I > know of) a representation of the mapping, but a declaration of the > mapper: > > > > > > That's the translation of > ... <"0231-complex-type.out">; > > When the first swift prototype was written we went directly to > representing swift data as a tree of in-memory objects, and mappers as > being attached to the root of each such tree. We figured that we could > achieve better scalability if we avoided storing actual mappings when we > could and used algorithmic ways to calculate the mappings on-the-fly. > > In other words, it takes more (O(n)) space to store "(1, 1), (2, 4), (3, > 9), (4, 16), ... (n, n^2)" than to store "f(k) = k^2, x = {1..n}" (O(1)) > but they are the same function. > > Mihael > From hategan at mcs.anl.gov Fri Aug 14 12:09:43 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 14 Aug 2009 12:09:43 -0500 Subject: [Swift-user] XDTM In-Reply-To: <4A8596E4.4050901@mcs.anl.gov> References: <4A6CE247.4010105@mcs.anl.gov> <4A6D08A8.9010400@mcs.anl.gov> <4A8584DF.3030103@mcs.anl.gov> <1250267458.14428.21.camel@blabla> <4A8596E4.4050901@mcs.anl.gov> Message-ID: <1250269783.17365.7.camel@blabla> On Fri, 2009-08-14 at 11:55 -0500, Michael Wilde wrote: > > On 8/14/09 11:30 AM, Mihael Hategan wrote: > > On Fri, 2009-08-14 at 10:38 -0500, Michael Wilde wrote: > > [...] > >>> 5. I am still confused when talk about XML Data Type and Mapping. > >>> Where is the XML representation? Is it the .xml that gets > >>> generated when run the swift code? > >> No, the XML - if indeed it still exists - is only internal. I described > >> it this way in an earlier post: > > > > The Swift type declarations are "compiled" into XML schema since XML > > schema is sufficiently powerful to express the structure of Swift > > user-declared types. That piece appears in the header of > > a .kml file. However, it has nothing to do with mapping, so it's > > probably not to be called XDTM. > > Mihael, thanks for clarifying. > > In the XDTM paper we described the use of XML schema this way: > > "In XDTM, a dataset?s logical structure is specified > via a subset of XML Schema, which defines primitive > scalar data types such as Boolean, Integer, String, Float, > and Date, and also allows for the definition of complex > types via the composition of simple and complex types." > > In other words, what you say above: that XML Schema is used to define > Swift user-declared types. > Right. It's arguable whether that has much to do with XDTM. Using XML schema to describe types/structure is the natural thing to do if you're using XML and you need to describe types/structure. I other words I think XDTM is confusing and the ambiguity above should not prevent us from saying that it isn't used in Swift. From jamalphd at gmail.com Fri Aug 14 14:02:24 2009 From: jamalphd at gmail.com (J A) Date: Fri, 14 Aug 2009 15:02:24 -0400 Subject: [Swift-user] XDTM In-Reply-To: <4A8584DF.3030103@mcs.anl.gov> References: <4A6CE247.4010105@mcs.anl.gov> <4A6D08A8.9010400@mcs.anl.gov> <4A8584DF.3030103@mcs.anl.gov> Message-ID: Thanks for your help. I will analyze the data i have and let you know of any updates. On Fri, Aug 14, 2009 at 11:38 AM, Michael Wilde wrote: > On 8/7/09 3:40 PM, J A wrote: > >> Hi Michael: >> >> 1. After running a .swift or .dtm code, two files gets created: .xml >> and .klm. What do they represent? >> > > .xml is an xml version of the parsed .swift file > .kml (not klm) is the xml representation of the Karajan script that the > Swift script is translated into for execution. Its actually the .kml file > that is executed by Karajan which drives the execution of a Swift script. > > 2. Correct me if I am wrong: >> * Datasets are mapped to physical presentation using mapping >> algorithms. Some mapping algorithms already created part of >> swift and the user can add/create others and use the >> existing once as the base. >> > > Yes, thats right. > > But, to clarify this part: > "user can add/create others and use the existing ones as the base." > > The user can use existing mappers, and add new mappers, either in Java or > as external executables or scripts. But each mapper is independent. When you > say "can use existing ones as a base" I would say thats correct, in that a > user could *copy* and modify the code of one mapper to create another > mapper, or, in the case of an "ext" mapper, one ext mapper could conceivably > execute another and modify/filter its output to create a new mapping. > > * Currently, the physical representation are files. >> > > Yes, if you mean to say that mappers map files to Swift variables. > > 3. In the fMRI example, I see volume, Image, etc declared as >> a type? who defines them as a type? >> > > > 4. In one of your emails, you stated that Swift functions can take >> accept files, int, string, float and boolean values as arguments. >> They return files, or scalar values inside files. My question is: >> if the output is a string that is inside a file, how can I use >> this output in another program that takes it as an input? doesn't >> call the file name and should have a code to read from the file? >> > > Yes, you can use readData() or readData2() to read the contents of a file > back into Swift variables (including into arrays and structures, if the > output has some structure). > > 5. I am still confused when talk about XML Data Type and Mapping. >> Where is the XML representation? Is it the .xml that gets >> generated when run the swift code? >> > > No, the XML - if indeed it still exists - is only internal. I described it > this way in an earlier post: > > -- > > "As Swift evolved from its early prototypes to a more mature system, the > notion of XDTM evolved to one of mapping between filesystem-based structures > and Swift in-memory data structures (ie, scalars, arrays, and structures, > which can be nested and typed). > > This is best seen by looking at the "external" mapper, ... > > In other words, it still has the flavor of XDTM, but without any XML being > visible to the user. It meets the same need but is easier to use and > explain." > > -- > > When XDTM was first implemented, by Yong Zhao, he used XML within Swift to > represent the mapping. I am not even sure if this XML representation is > still used in the current implementation, or not. I suspect *not*. > > But the important concept here should really be called "DTM" - dataset > typing and mapping - and its embodied in the type model and mapping model of > the language. > > So you should stop thinking about data typing and mapping as being > connected in any way to XML. > > What we described in earlier papers as XDTM is not something that you can > experiment with in terms of XML: ie, you can not see the XML for a mapping > because its either deep inside the Swift implementation, or it no longer > exists in the current Swift code. > > 6. Let's look at this example: >> >> type messagefile {} >> (messagefile t) greeting (string s[]) { app { >> echo s[0] s[1] s[2] stdout=@filename(t >> ); >> } >> } >> messagefile outfile <"q5out.txt">; >> string words[] = ["how","are","you"]; >> outfile = greeting(words); >> === >> So we have messagefile as a data type. outfile and words are >> datasets. what will be the physical representation for these 2 >> datasets? >> > > An object of type messagefile will be represented as a single physical file > externally, and internally as a scalar variable. > > Words is a an array of strings. > > Each atomic Swift variable (ie, scalars, array members, and structure > members) can be thought of as a triple: > > (set-state, mapping, value) > > All variables have a set-state; initially unset, then set when the variable > is assigned a value. > > File-valued variables have only a mapping, but no value. > Scalar-values (ie, non-mapped variables like strings, as in your example) > have a value (eg the string, interger, boolean or float value) but no > mapping. > > We're still looking for better terminology to describe this; the current > user guide uses both the terms "mapped type" and "marker type" to denote a > file-valued variable. Both terms refer to the same concept; Im leaning to > the term "mapped type". > > is thee system parsing the swift code, identifying the > >> data types and datasets and based on that choosea the proper mapping >> algorithm needed? >> > > After the Swift command parses the Swift code, execution begins - i.e. the > .kml file is executed by Karajan. Mappers are called as can be seen the kml. > (And you can see their actions in the swift .log file). > > The mapping for all mapped variables is either specified by the user (the > most common case) or defaults to concurrent_mapper. > > The users guide describes this in pretty good detail. > > I hope that gets you a bit further. I hope that looking at XML mappings is > not critical to your research, as I don't think you'll be able to readily > get an XML intermediate form out of Swift. > > An interesting topic would be to implement mechanisms to handle data in XML > representations, in particular to enable Swift to invoke SOAP services as > well as file-based applications and to compose scripts that call both forms > of application. > > - Mike > > >> Thanks, >> Jamal >> >> >> >> >> >> On Sun, Jul 26, 2009 at 9:53 PM, Michael Wilde > wilde at mcs.anl.gov>> wrote: >> >> Hi Jamal, >> >> A lot of this is covered in the Swift user guide and tutorial. Have >> you read through those yet? >> >> All the docs are at: http://www.ci.uchicago.edu/swift/docs/index.php >> >> If so, and the clarifications below don't help, please ask again on >> the list, OK? >> >> - Mike >> >> >> >> On 7/26/09 7:27 PM, J A wrote: >> >> Hi Michael: >> First, thank you for your reply and information provided. >> I am trying to understand more how it handles the input/output >> parameters and make them available for other functions. >> >> >> All functions in Swift are either atomic interfaces to application >> programs (ie, how o exec the program) or composite higher level >> functions. >> >> To illustrate, I will give this example for the sake of >> discussion: >> I have a C program called test.c that contains 4 functions ( >> main(), F1, F2, and F3). each function takes some parameters >> such as int, string, name of a file that is in the same >> directory, and each one produced some output (string, int, and a >> file). Of course i am using global variables. Now, main calls >> F1, F1 passes its output to F2, and F2 passes its output to F3. >> >> >> Swift doesnt look at the functions inside an application. It invokes >> the application as a program (think fork/exec) just like a shell >> would, but distributed and in parallel if so specified. >> >> Overall, the test.c takes an int, string, and file, and output >> several files. the output files contains output produced by the >> internal functions (tasks). >> >> >> Swift functions can take accept files, int, string, float and >> boolean values as arguments. They return files, or scalar values >> inside files. (Again, think shell scripts). Composite structures - >> structs and arrays - of the above can be passed. >> >> I would like to understand more when i transfer my code to >> Swift how it handles the input/output data, where it stores >> them, etc. I read couple of papers about XDTM and still have >> some confusion about the terms: dataset, typed, how/where its >> physical representation is located at, and how the input/output >> is used within the internal functions. >> >> >> Files are by default named ("mapped") relative to the directory in >> which you run the Swift command. Many flexible extensions to that >> model are provided for (eg, URIs). Swift sends the data to the site >> chosen for execution (thats yet another topic) and returns results >> back to the same submission host. >> >> Mapping declarations in the Swift script specify how files and >> directory structures are mapped to Swift variables (scalars, arrays, >> structures). These are used in the specification of the Swift code. >> When Swift runs programs, it takes files that were mapped and knows >> how to send them to grid sites or clusters and get data back. >> >> I am new to this area and trying to understand how the DTM works. >> Any help from your side on this area is really appreciated. >> Thanks, >> Jamal >> On Sun, Jul 26, 2009 at 7:09 PM, Michael Wilde >> >> >> wrote: >> >> Jamal, >> >> As Swift evolved from its early prototypes to a more mature >> system, >> the notion of XDTM evolved to one of mapping between >> filesystem-based structures and Swift in-memory data >> structures (ie, >> scalars, arrays, and structures, which can be nested and typed). >> >> This is best seen by looking at the "external" mapper, which >> allows >> a user to map a dataset using any external program (typically a >> script) that returns the members of the dataset as a two-column >> list: the Swift variable reference, and the external file or >> URI. >> >> See the user guide section on the external mapper: >> >> >> http://www.ci.uchicago.edu/swift/guides/userguide.php#mapper.ext_mapper >> (but the example in the user guide doesn't show the power of >> mapping >> to nested structures). >> >> In other words, it still has the flavor of XDTM, but without >> any XML >> being visible to the user. It meets the same need but is >> easier to >> use and explain. >> >> - Mike >> >> >> On 7/26/09 2:50 PM, J A wrote: >> >> Hi All: >> Can any one direct me to a source with more >> examples/explanation on how XDTM is working/implemented? >> Thanks, >> Jamal >> >> ------------------------------------------------------------------------ >> >> _______________________________________________ >> Swift-user mailing list >> Swift-user at ci.uchicago.edu >> >> > > >> >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user >> >> >> >> -------------- next part -------------- An HTML attachment was scrubbed... URL: From yecartes at gmail.com Thu Aug 6 15:24:24 2009 From: yecartes at gmail.com (Allan Espinosa) Date: Thu, 06 Aug 2009 20:24:24 -0000 Subject: [Swift-user] Re: [Swift-devel] How to the maximum number of concurrent jobs allowed on a site to a fixed size? In-Reply-To: <4A7B39DB.3030602@cs.uchicago.edu> References: <4A7B39DB.3030602@cs.uchicago.edu> Message-ID: <50b07b4b0908061324i6682c441v7e288d3b43c149b1@mail.gmail.com> hi yi, in swift.properites file you set foreach.max.threads=1024 to foreach.max.threads=N where N is the max number concurrent jobs you want per swift session. Also when you set score to be ridicuosly high (ie 10000) you always get the maximum theoretical number of jobs you want based on the throttling parameters. -Allan 2009/8/6 Yi Zhu : > Hi, all > > As we've already know, Swift dynamically change the maximum number of concurrent jobs allowed on a site based on the performance history of that site. According to swift Document: Each site is assigned a score (initially 1), which can increase or decrease based on whether the site yields successful or faulty job runs. The score for a site can take values in the (0.1, 100) interval. The number of allowed jobs is calculated using the following formula: > > 2 + score*throttle.score.job.factor > > We can change the throttle.score.job.factor in sites.xml or swift.properties files, but since the "score" value can be increased/decreased during the execution, It seems that we can not really set the maximum ?number of concurrent jobs allowed on a site to a fixed number. Anyone have any idea of that? > > > Many Thanks. > > -Yi Zhu > _______________________________________________ From andrey.fedorov at gmail.com Tue Aug 25 09:49:42 2009 From: andrey.fedorov at gmail.com (Andrey Fedorov) Date: Tue, 25 Aug 2009 10:49:42 -0400 Subject: [Swift-user] Problems getting started with coasters Message-ID: <82f536810908250749ve4c2c09xbfa5d6343ad5231c@mail.gmail.com> Hi, I have a processing step that takes somewhere ~2-5 min. It takes on input two ~5Mb files, and produces a small text file, which I need to store. I need to compute large number of such jobs, using different parameters. It seems to me "coaster" is the best execution provider for my application. Trying to start simple, I am running first.swift (echo) example that comes with Swift using different providers: GT2, GT4, GT2/coaster, and GT4/coaster. All of this is done on Abe NCSA cluster. Here's my sites.xml: /u/ac/fedorov/scratch-global/scratch /u/ac/fedorov/scratch-global/scratch /u/ac/fedorov/scratch-global/scratch /u/ac/fedorov/scratch-global/scratch And tc.data is simply Abe-GT4-coasters echo /bin/echo INSTALLED INTEL32::LINUX null and I change the site to test different providers. Now, results: 1) both GT2 and GT4 providers work fine, script completes 2) with GT2+coaster provider, I can see the job in the PBS queue (requested time is 01:41, I guess this comes with the default coaster parameters, that I didn't change). The job appears to finish successfully, but then I get this error: Final status: Finished successfully:1 START cleanups=[[first-20090825-0925-emkt2qt0, Abe-GT2-coasters]] START dir=first-20090825-0925-emkt2qt0 host=Abe-GT2-coasters Sending Command(21, SUBMITJOB) on GSSSChannel-null(1) Command(21, SUBMITJOB) CMD: Command(21, SUBMITJOB) GSSSChannel-null(1) REPL: Command(21, SUBMITJOB) Submitted task Task(type=JOB_SUBMISSION, identity=urn:0-1-1251210343871). Job id: urn:1251210343871-1251210376098-1251210376099 Unregistering Command(21, SUBMITJOB) GSSSChannel-null(1) REQ: Handler(JOBSTATUS) GSSSChannel-null(1) REQ: Handler(JOBSTATUS) Task(type=JOB_SUBMISSION, identity=urn:0-1-1251210343871) Completed. Waiting: 0, Running: 0. Heap size: 65M, Heap free: 42M, Max heap: 227M END dir=first-20090825-0925-emkt2qt0 host=Abe-GT2-coasters Cleaning up... Shutting down service at https://141.142.68.180:45552 Got channel MetaChannel: 500265006 -> GSSSChannel-null(1) Sending Command(22, SHUTDOWNSERVICE) on GSSSChannel-null(1) Command(22, SHUTDOWNSERVICE) CMD: Command(22, SHUTDOWNSERVICE) Command(22, SHUTDOWNSERVICE): handling reply timeout Command(22, SHUTDOWNSERVICE): failed too many times org.globus.cog.karajan.workflow.service.ReplyTimeoutException at org.globus.cog.karajan.workflow.service.commands.Command.handleReplyTimeout(Command.java:241) at org.globus.cog.karajan.workflow.service.commands.Command$Timeout.run(Command.java:246) at java.util.TimerThread.mainLoop(Timer.java:512) at java.util.TimerThread.run(Timer.java:462) - Done 3) with GT4-coaster provider, I don't get as far as with GT2-coaster. Possibly I am not setting up properly the site entry. I was not able to find any examples in the manual how to set coasters with GT4 (can anyone provide an example?). Here's the error: Failed to transfer wrapper log from first-20090825-0929-39x94x09/info/t on Abe-GT4-coasters END_FAILURE thread=0 tr=echo Progress: Failed:1 Execution failed: Exception in echo: Arguments: [Hello, world!] Host: Abe-GT4-coasters Directory: first-20090825-0929-39x94x09/jobs/t/echo-t5oymmfj stderr.txt: stdout.txt: ---- Caused by: Cannot submit job: Limited proxy is not accepted Can anybody help figuring this out? Thanks -- Andriy Fedorov, Ph.D. Research Fellow Brigham and Women's Hospital Harvard Medical School 75 Francis Street Boston, MA 02115 USA fedorov at bwh.harvard.edu From fedorov at bwh.harvard.edu Tue Aug 25 09:58:04 2009 From: fedorov at bwh.harvard.edu (Andriy Fedorov) Date: Tue, 25 Aug 2009 10:58:04 -0400 Subject: [Swift-user] Fwd: Problems getting started with coasters In-Reply-To: <82f536810908250749ve4c2c09xbfa5d6343ad5231c@mail.gmail.com> References: <82f536810908250749ve4c2c09xbfa5d6343ad5231c@mail.gmail.com> Message-ID: <82f536810908250758m25254058g3cda58997eb9adc2@mail.gmail.com> Hi, I have a processing step that takes somewhere ~2-5 min. It takes on input two ~5Mb files, and produces a small text file, which I need to store. I need to compute large number of such jobs, using different parameters. It seems to me "coaster" is the best execution provider for my application. Trying to start simple, I am running first.swift (echo) example that comes with Swift using different providers: GT2, GT4, GT2/coaster, and GT4/coaster. All of this is done on Abe NCSA cluster. Here's my sites.xml: ? ? ?/u/ac/fedorov/scratch-global/scratch ? ? ?/u/ac/fedorov/scratch-global/scratch ? ? ?/u/ac/fedorov/scratch-global/scratch ? ? ? ?/u/ac/fedorov/scratch-global/scratch And tc.data is simply Abe-GT4-coasters echo /bin/echo INSTALLED INTEL32::LINUX null and I change the site to test different providers. Now, results: 1) both GT2 and GT4 providers work fine, script completes 2) with GT2+coaster provider, I can see the job in the PBS queue (requested time is 01:41, I guess this comes with the default coaster parameters, that I didn't change). The job appears to finish successfully, and it seems like the output file is fetched back, but then I get this error: Final status: ?Finished successfully:1 START cleanups=[[first-20090825-0925-emkt2qt0, Abe-GT2-coasters]] START dir=first-20090825-0925-emkt2qt0 host=Abe-GT2-coasters Sending Command(21, SUBMITJOB) on GSSSChannel-null(1) Command(21, SUBMITJOB) CMD: Command(21, SUBMITJOB) GSSSChannel-null(1) REPL: Command(21, SUBMITJOB) Submitted task Task(type=JOB_SUBMISSION, identity=urn:0-1-1251210343871). Job id: urn:1251210343871-1251210376098-1251210376099 Unregistering Command(21, SUBMITJOB) GSSSChannel-null(1) REQ: Handler(JOBSTATUS) GSSSChannel-null(1) REQ: Handler(JOBSTATUS) Task(type=JOB_SUBMISSION, identity=urn:0-1-1251210343871) Completed. Waiting: 0, Running: 0. Heap size: 65M, Heap free: 42M, Max heap: 227M END dir=first-20090825-0925-emkt2qt0 host=Abe-GT2-coasters Cleaning up... Shutting down service at https://141.142.68.180:45552 Got channel MetaChannel: 500265006 -> GSSSChannel-null(1) Sending Command(22, SHUTDOWNSERVICE) on GSSSChannel-null(1) Command(22, SHUTDOWNSERVICE) CMD: Command(22, SHUTDOWNSERVICE) Command(22, SHUTDOWNSERVICE): handling reply timeout Command(22, SHUTDOWNSERVICE): failed too many times org.globus.cog.karajan.workflow.service.ReplyTimeoutException ? ? ? ?at org.globus.cog.karajan.workflow.service.commands.Command.handleReplyTimeout(Command.java:241) ? ? ? ?at org.globus.cog.karajan.workflow.service.commands.Command$Timeout.run(Command.java:246) ? ? ? ?at java.util.TimerThread.mainLoop(Timer.java:512) ? ? ? ?at java.util.TimerThread.run(Timer.java:462) - Done 3) with GT4-coaster provider, I don't get as far as with GT2-coaster. Possibly I am not setting up properly the site entry. I was not able to find any examples in the manual how to set coasters with GT4 (can anyone provide an example?). Here's the error: Failed to transfer wrapper log from first-20090825-0929-39x94x09/info/t on Abe-GT4-coasters END_FAILURE thread=0 tr=echo Progress: ?Failed:1 Execution failed: ? ? ? ?Exception in echo: Arguments: [Hello, world!] Host: Abe-GT4-coasters Directory: first-20090825-0929-39x94x09/jobs/t/echo-t5oymmfj stderr.txt: stdout.txt: ---- Caused by: ? ? ? ?Cannot submit job: Limited proxy is not accepted Can anybody help figuring this out? Thanks -- Andriy Fedorov, Ph.D. Research Fellow Brigham and Women's Hospital Harvard Medical School 75 Francis Street Boston, MA 02115 USA fedorov at bwh.harvard.edu From wilde at mcs.anl.gov Tue Aug 25 10:31:26 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 25 Aug 2009 10:31:26 -0500 Subject: [Swift-user] Problems getting started with coasters In-Reply-To: <82f536810908250749ve4c2c09xbfa5d6343ad5231c@mail.gmail.com> References: <82f536810908250749ve4c2c09xbfa5d6343ad5231c@mail.gmail.com> Message-ID: <4A9403CE.4050303@mcs.anl.gov> Andrey, On 8/25/09 9:49 AM, Andrey Fedorov wrote: > Hi, > > I have a processing step that takes somewhere ~2-5 min. It takes on > input two ~5Mb files, and produces a small text file, which I need to > store. I need to compute large number of such jobs, using different > parameters. It seems to me "coaster" is the best execution provider > for my application. > > Trying to start simple, I am running first.swift (echo) example that > comes with Swift using different providers: GT2, GT4, GT2/coaster, and > GT4/coaster. All of this is done on Abe NCSA cluster. > > Here's my sites.xml: > > > > url="https://grid-abe.ncsa.teragrid.org:8443/wsrf/services/ManagedJobFactoryService"/> > /u/ac/fedorov/scratch-global/scratch > > > > > url="https://grid-abe.ncsa.teragrid.org:8443/wsrf/services/ManagedJobFactoryService"/> > /u/ac/fedorov/scratch-global/scratch > > > > > url="grid-abe.ncsa.teragrid.org:2119/jobmanager-pbs"/> > /u/ac/fedorov/scratch-global/scratch > > > > > url="grid-abe.ncsa.teragrid.org"/> > > /u/ac/fedorov/scratch-global/scratch > > > And tc.data is simply > > Abe-GT4-coasters echo /bin/echo INSTALLED INTEL32::LINUX null > > and I change the site to test different providers. > > Now, results: > > 1) both GT2 and GT4 providers work fine, script completes > > 2) with GT2+coaster provider, I can see the job in the PBS queue > (requested time is 01:41, I guess this comes with the default coaster > parameters, that I didn't change). The job appears to finish > successfully, but then I get this error: > > Final status: Finished successfully:1 > START cleanups=[[first-20090825-0925-emkt2qt0, Abe-GT2-coasters]] > START dir=first-20090825-0925-emkt2qt0 host=Abe-GT2-coasters > Sending Command(21, SUBMITJOB) on GSSSChannel-null(1) > Command(21, SUBMITJOB) CMD: Command(21, SUBMITJOB) > GSSSChannel-null(1) REPL: Command(21, SUBMITJOB) > Submitted task Task(type=JOB_SUBMISSION, > identity=urn:0-1-1251210343871). Job id: > urn:1251210343871-1251210376098-1251210376099 > Unregistering Command(21, SUBMITJOB) > GSSSChannel-null(1) REQ: Handler(JOBSTATUS) > GSSSChannel-null(1) REQ: Handler(JOBSTATUS) > Task(type=JOB_SUBMISSION, identity=urn:0-1-1251210343871) Completed. > Waiting: 0, Running: 0. Heap size: 65M, Heap free: 42M, Max heap: 227M > END dir=first-20090825-0925-emkt2qt0 host=Abe-GT2-coasters > Cleaning up... > Shutting down service at https://141.142.68.180:45552 > Got channel MetaChannel: 500265006 -> GSSSChannel-null(1) > Sending Command(22, SHUTDOWNSERVICE) on GSSSChannel-null(1) > Command(22, SHUTDOWNSERVICE) CMD: Command(22, SHUTDOWNSERVICE) > Command(22, SHUTDOWNSERVICE): handling reply timeout > Command(22, SHUTDOWNSERVICE): failed too many times > org.globus.cog.karajan.workflow.service.ReplyTimeoutException > at org.globus.cog.karajan.workflow.service.commands.Command.handleReplyTimeout(Command.java:241) > at org.globus.cog.karajan.workflow.service.commands.Command$Timeout.run(Command.java:246) > at java.util.TimerThread.mainLoop(Timer.java:512) > at java.util.TimerThread.run(Timer.java:462) > - Done This seems like a low-prio error. I'll file it in bugzilla for now. Lets see how coasters works for you on Abe using your real app and a larger number of jobs, and come back to this shutdown problem if it proves to be a blocker to getting work done. Coasters has a few other current issues - mainly not throttling work efficiently - that we have a fix for, and need to apply and test that one first. We've also been experimenting with a non-coaster way to use all 8 cores of machines like Abe, but lets try the coaster route first, of thats OK with you, and lets focus on GT2/Coasters, as that will be more common. In addition, there is a test version of GT GRAM5 on QueenBee, Abe's sister-system at LSU, which we can try, assuming your TG project lets you run there. So please try to run the app, and we will try to get the latest coaster fixes committed. (I assume you are comfortable extracting Swift from svn and building it; if you have not done this before, can you try it, Andrey?) Regards, Mike > 3) with GT4-coaster provider, I don't get as far as with GT2-coaster. > Possibly I am not setting up properly the site entry. I was not able > to find any examples in the manual how to set coasters with GT4 (can > anyone provide an example?). Here's the error: > > Failed to transfer wrapper log from > first-20090825-0929-39x94x09/info/t on Abe-GT4-coasters > END_FAILURE thread=0 tr=echo > Progress: Failed:1 > Execution failed: > Exception in echo: > Arguments: [Hello, world!] > Host: Abe-GT4-coasters > Directory: first-20090825-0929-39x94x09/jobs/t/echo-t5oymmfj > stderr.txt: > > stdout.txt: > > ---- > > Caused by: > Cannot submit job: Limited proxy is not accepted > > > Can anybody help figuring this out? > > Thanks > -- > Andriy Fedorov, Ph.D. > > Research Fellow > Brigham and Women's Hospital > Harvard Medical School > 75 Francis Street > Boston, MA 02115 USA > fedorov at bwh.harvard.edu > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-user From fedorov at bwh.harvard.edu Tue Aug 25 10:44:38 2009 From: fedorov at bwh.harvard.edu (Andriy Fedorov) Date: Tue, 25 Aug 2009 11:44:38 -0400 Subject: [Swift-user] Problems getting started with coasters In-Reply-To: <4A9403CE.4050303@mcs.anl.gov> References: <82f536810908250749ve4c2c09xbfa5d6343ad5231c@mail.gmail.com> <4A9403CE.4050303@mcs.anl.gov> Message-ID: <82f536810908250844l69c2b2e8oe5caf3d73fcd46cb@mail.gmail.com> Michael, Thanks for the reply. So my understanding is, I should check out the trunk version and compile (yes, I've done this before), and try the real application with GT2+coasters. I do have an account on Queen Bee. You say, it has GT GRAM5, but I thought you also said I should target using GT2. What is GRAM5? At this point, my preference is the system with lowest load and confirmed functional coaster provider, to save time debugging and getting up to speed. Should I use Abe or Queen Bee? As soon as I compile the current swift trunk and try GT2+coaster @Abe for my application, I will report to the list my experience. -- Andriy Fedorov, Ph.D. Research Fellow Brigham and Women's Hospital Harvard Medical School 75 Francis Street Boston, MA 02115 USA fedorov at bwh.harvard.edu On Tue, Aug 25, 2009 at 11:31, Michael Wilde wrote: > Andrey, > > On 8/25/09 9:49 AM, Andrey Fedorov wrote: >> >> Hi, >> >> I have a processing step that takes somewhere ~2-5 min. It takes on >> input two ~5Mb files, and produces a small text file, which I need to >> store. I need to compute large number of such jobs, using different >> parameters. It seems to me "coaster" is the best execution provider >> for my application. >> >> Trying to start simple, I am running first.swift (echo) example that >> comes with Swift using different providers: GT2, GT4, GT2/coaster, and >> GT4/coaster. All of this is done on Abe NCSA cluster. >> >> Here's my sites.xml: >> >> >> ? >> ?> >> ?url="https://grid-abe.ncsa.teragrid.org:8443/wsrf/services/ManagedJobFactoryService"/> >> ?/u/ac/fedorov/scratch-global/scratch >> >> >> >> ? >> ?> >> ?url="https://grid-abe.ncsa.teragrid.org:8443/wsrf/services/ManagedJobFactoryService"/> >> ?/u/ac/fedorov/scratch-global/scratch >> >> >> >> ? >> ?> ?url="grid-abe.ncsa.teragrid.org:2119/jobmanager-pbs"/> >> ?/u/ac/fedorov/scratch-global/scratch >> >> >> >> ? >> ?> ?url="grid-abe.ncsa.teragrid.org"/> >> ? >> ?/u/ac/fedorov/scratch-global/scratch >> >> >> And tc.data is simply >> >> Abe-GT4-coasters echo /bin/echo INSTALLED INTEL32::LINUX null >> >> and I change the site to test different providers. >> >> Now, results: >> >> 1) both GT2 and GT4 providers work fine, script completes >> >> 2) with GT2+coaster provider, I can see the job in the PBS queue >> (requested time is 01:41, I guess this comes with the default coaster >> parameters, that I didn't change). The job appears to finish >> successfully, but then I get this error: >> >> Final status: ?Finished successfully:1 >> START cleanups=[[first-20090825-0925-emkt2qt0, Abe-GT2-coasters]] >> START dir=first-20090825-0925-emkt2qt0 host=Abe-GT2-coasters >> Sending Command(21, SUBMITJOB) on GSSSChannel-null(1) >> Command(21, SUBMITJOB) CMD: Command(21, SUBMITJOB) >> GSSSChannel-null(1) REPL: Command(21, SUBMITJOB) >> Submitted task Task(type=JOB_SUBMISSION, >> identity=urn:0-1-1251210343871). Job id: >> urn:1251210343871-1251210376098-1251210376099 >> Unregistering Command(21, SUBMITJOB) >> GSSSChannel-null(1) REQ: Handler(JOBSTATUS) >> GSSSChannel-null(1) REQ: Handler(JOBSTATUS) >> Task(type=JOB_SUBMISSION, identity=urn:0-1-1251210343871) Completed. >> Waiting: 0, Running: 0. Heap size: 65M, Heap free: 42M, Max heap: 227M >> END dir=first-20090825-0925-emkt2qt0 host=Abe-GT2-coasters >> Cleaning up... >> Shutting down service at https://141.142.68.180:45552 >> Got channel MetaChannel: 500265006 -> GSSSChannel-null(1) >> Sending Command(22, SHUTDOWNSERVICE) on GSSSChannel-null(1) >> Command(22, SHUTDOWNSERVICE) CMD: Command(22, SHUTDOWNSERVICE) >> Command(22, SHUTDOWNSERVICE): handling reply timeout >> Command(22, SHUTDOWNSERVICE): failed too many times >> org.globus.cog.karajan.workflow.service.ReplyTimeoutException >> ? ? ? ?at >> org.globus.cog.karajan.workflow.service.commands.Command.handleReplyTimeout(Command.java:241) >> ? ? ? ?at >> org.globus.cog.karajan.workflow.service.commands.Command$Timeout.run(Command.java:246) >> ? ? ? ?at java.util.TimerThread.mainLoop(Timer.java:512) >> ? ? ? ?at java.util.TimerThread.run(Timer.java:462) >> - Done > > This seems like a low-prio error. I'll file it in bugzilla for now. Lets see > how coasters works for you on Abe using your real app and a larger number of > jobs, and come back to this shutdown problem if it proves to be a blocker to > getting work done. > > Coasters has a few other current issues - mainly not throttling work > efficiently - that we have a fix for, and need to apply and test that one > first. > > We've also been experimenting with a non-coaster way to use all 8 cores of > machines like Abe, but lets try the coaster route first, of thats OK with > you, and lets focus on GT2/Coasters, as that will be more common. > > In addition, there is a test version of GT GRAM5 on QueenBee, Abe's > sister-system at LSU, which we can try, assuming your TG project lets you > run there. > > So please try to run the app, and we will try to get the latest coaster > fixes committed. (I assume you are comfortable extracting Swift from svn and > building it; if you have not done this before, can you try it, Andrey?) > > Regards, > > Mike > > >> 3) with GT4-coaster provider, I don't get as far as with GT2-coaster. >> Possibly I am not setting up properly the site entry. I was not able >> to find any examples in the manual how to set coasters with GT4 (can >> anyone provide an example?). Here's the error: >> >> Failed to transfer wrapper log from >> first-20090825-0929-39x94x09/info/t on Abe-GT4-coasters >> END_FAILURE thread=0 tr=echo >> Progress: ?Failed:1 >> Execution failed: >> ? ? ? ?Exception in echo: >> Arguments: [Hello, world!] >> Host: Abe-GT4-coasters >> Directory: first-20090825-0929-39x94x09/jobs/t/echo-t5oymmfj >> stderr.txt: >> >> stdout.txt: >> >> ---- >> >> Caused by: >> ? ? ? ?Cannot submit job: Limited proxy is not accepted >> >> >> Can anybody help figuring this out? >> >> Thanks >> -- >> Andriy Fedorov, Ph.D. >> >> Research Fellow >> Brigham and Women's Hospital >> Harvard Medical School >> 75 Francis Street >> Boston, MA 02115 USA >> fedorov at bwh.harvard.edu >> _______________________________________________ >> Swift-user mailing list >> Swift-user at ci.uchicago.edu >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user > From hategan at mcs.anl.gov Tue Aug 25 10:49:49 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 25 Aug 2009 10:49:49 -0500 Subject: [Swift-user] Problems getting started with coasters In-Reply-To: <82f536810908250749ve4c2c09xbfa5d6343ad5231c@mail.gmail.com> References: <82f536810908250749ve4c2c09xbfa5d6343ad5231c@mail.gmail.com> Message-ID: <1251215389.29699.8.camel@blabla> (2) This isn't strictly a bug. When shutting down a coaster service the client sends a shutdown command to the service, which it hopes will be acknowledged. The service acknowledges it and then terminates. However, there is no guarantee now that the termination will happen after the acknowledgement message is sent (which is something that could be corrected I guess) However, the client only tries to shut down the service. It is not an error condition if it doesn't succeed, but a diagnostic message gets printed. (3) I know what's happening. That is a bug. When using gt4:gt4:xxx, delegation needs to be enabled on the first step. Delegation is disabled (as much as possible) by default in all the providers. There should be a fix in SVN this week. Mihael On Tue, 2009-08-25 at 10:49 -0400, Andrey Fedorov wrote: > Hi, > > I have a processing step that takes somewhere ~2-5 min. It takes on > input two ~5Mb files, and produces a small text file, which I need to > store. I need to compute large number of such jobs, using different > parameters. It seems to me "coaster" is the best execution provider > for my application. > > Trying to start simple, I am running first.swift (echo) example that > comes with Swift using different providers: GT2, GT4, GT2/coaster, and > GT4/coaster. All of this is done on Abe NCSA cluster. > > Here's my sites.xml: > > > > url="https://grid-abe.ncsa.teragrid.org:8443/wsrf/services/ManagedJobFactoryService"/> > /u/ac/fedorov/scratch-global/scratch > > > > > url="https://grid-abe.ncsa.teragrid.org:8443/wsrf/services/ManagedJobFactoryService"/> > /u/ac/fedorov/scratch-global/scratch > > > > > url="grid-abe.ncsa.teragrid.org:2119/jobmanager-pbs"/> > /u/ac/fedorov/scratch-global/scratch > > > > > url="grid-abe.ncsa.teragrid.org"/> > > /u/ac/fedorov/scratch-global/scratch > > > And tc.data is simply > > Abe-GT4-coasters echo /bin/echo INSTALLED INTEL32::LINUX null > > and I change the site to test different providers. > > Now, results: > > 1) both GT2 and GT4 providers work fine, script completes > > 2) with GT2+coaster provider, I can see the job in the PBS queue > (requested time is 01:41, I guess this comes with the default coaster > parameters, that I didn't change). The job appears to finish > successfully, but then I get this error: > > Final status: Finished successfully:1 > START cleanups=[[first-20090825-0925-emkt2qt0, Abe-GT2-coasters]] > START dir=first-20090825-0925-emkt2qt0 host=Abe-GT2-coasters > Sending Command(21, SUBMITJOB) on GSSSChannel-null(1) > Command(21, SUBMITJOB) CMD: Command(21, SUBMITJOB) > GSSSChannel-null(1) REPL: Command(21, SUBMITJOB) > Submitted task Task(type=JOB_SUBMISSION, > identity=urn:0-1-1251210343871). Job id: > urn:1251210343871-1251210376098-1251210376099 > Unregistering Command(21, SUBMITJOB) > GSSSChannel-null(1) REQ: Handler(JOBSTATUS) > GSSSChannel-null(1) REQ: Handler(JOBSTATUS) > Task(type=JOB_SUBMISSION, identity=urn:0-1-1251210343871) Completed. > Waiting: 0, Running: 0. Heap size: 65M, Heap free: 42M, Max heap: 227M > END dir=first-20090825-0925-emkt2qt0 host=Abe-GT2-coasters > Cleaning up... > Shutting down service at https://141.142.68.180:45552 > Got channel MetaChannel: 500265006 -> GSSSChannel-null(1) > Sending Command(22, SHUTDOWNSERVICE) on GSSSChannel-null(1) > Command(22, SHUTDOWNSERVICE) CMD: Command(22, SHUTDOWNSERVICE) > Command(22, SHUTDOWNSERVICE): handling reply timeout > Command(22, SHUTDOWNSERVICE): failed too many times > org.globus.cog.karajan.workflow.service.ReplyTimeoutException > at org.globus.cog.karajan.workflow.service.commands.Command.handleReplyTimeout(Command.java:241) > at org.globus.cog.karajan.workflow.service.commands.Command$Timeout.run(Command.java:246) > at java.util.TimerThread.mainLoop(Timer.java:512) > at java.util.TimerThread.run(Timer.java:462) > - Done > > 3) with GT4-coaster provider, I don't get as far as with GT2-coaster. > Possibly I am not setting up properly the site entry. I was not able > to find any examples in the manual how to set coasters with GT4 (can > anyone provide an example?). Here's the error: > > Failed to transfer wrapper log from > first-20090825-0929-39x94x09/info/t on Abe-GT4-coasters > END_FAILURE thread=0 tr=echo > Progress: Failed:1 > Execution failed: > Exception in echo: > Arguments: [Hello, world!] > Host: Abe-GT4-coasters > Directory: first-20090825-0929-39x94x09/jobs/t/echo-t5oymmfj > stderr.txt: > > stdout.txt: > > ---- > > Caused by: > Cannot submit job: Limited proxy is not accepted > > > Can anybody help figuring this out? > > Thanks > -- > Andriy Fedorov, Ph.D. > > Research Fellow > Brigham and Women's Hospital > Harvard Medical School > 75 Francis Street > Boston, MA 02115 USA > fedorov at bwh.harvard.edu > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-user From wilde at mcs.anl.gov Tue Aug 25 11:54:37 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 25 Aug 2009 11:54:37 -0500 Subject: [Swift-user] Problems getting started with coasters In-Reply-To: <82f536810908250844l69c2b2e8oe5caf3d73fcd46cb@mail.gmail.com> References: <82f536810908250749ve4c2c09xbfa5d6343ad5231c@mail.gmail.com> <4A9403CE.4050303@mcs.anl.gov> <82f536810908250844l69c2b2e8oe5caf3d73fcd46cb@mail.gmail.com> Message-ID: <4A94174D.3070901@mcs.anl.gov> On 8/25/09 10:44 AM, Andriy Fedorov wrote: > Michael, > > Thanks for the reply. > > So my understanding is, I should check out the trunk version and > compile (yes, I've done this before), and try the real application > with GT2+coasters. Yes, thats a good step to re-master, in preparation for Mihael checking in Coaster fixes. He made significant enhancements to Coasters in the past 2 months, but has been working ona different project lately and thus these are not yet sufficiently tested. If you're willing to help in the testing that would be great. If not, I think the next best approach to try is this: - We have a small experimental mod that enables Swift GRAM2 jobs to use all cores of multi-core hosts (such as the 8-core hosts on Abe and QueenBee). Basically it uses the Swift clustering facility but runs jobs in parallel instead of serially. It works well if your jobs have a very uniform runtime. If they dont, then it wastes CPU. But its a good interim solution for many apps until coasters is more stable. This is described at: http://www.ci.uchicago.edu/wiki/bin/view/SWFT/SwiftParallelClustering This info is very preliminary and not end-user ready. Tibi Stef-Praun, on this list has tried it. Please start a new thread here if you want to discuss it or report experiences or problems with it. - On QueenBee or other GRAM5-enabled systems (not many test as its in test mode) you can use the GRAM2 provider if submitting remotely. On Abe and any other GRAM2 systems you should run this with the Condor-G provider if submitting remotely. The rule of thumb here for submitting jobs to a site from Swift running remotely on a submit host is: -- up to 20 jobs in parallel you can use plain GRAM2 -- above 20 jobs, use Condor-G or, where available, GRAM2 - On Abe, QueenBee, and other PBS systems with login hosts, you can run Swift locally on the login host, and use the PBS provider with the parallel clustering approach. We have a few other solutions that I will save till we explore these two solutions. To prepare for this, try running your app on Abe using the PBS provider, with just 1 or 2 jobs, then try the parallel clustering tip above. > I do have an account on Queen Bee. You say, it has GT GRAM5, but I > thought you also said I should target using GT2. What is GRAM5? GRAM5 is a new, more efficient version of GRAM2. Its fully compatible, so you just set Swift sites.xml exactly as for GRAM2. The only thing that changes is that you use a different URL for the GRAM gatekeeper contact string (ie different host and/or port, thats all). I'll need to get you the contact string for GRAM5 on QueenBee if/when we both agree the time is right to try it. > At > this point, my preference is the system with lowest load and confirmed > functional coaster provider, to save time debugging and getting up to > speed. Should I use Abe or Queen Bee? Thats hard to answer, as the loads fluctuate. You can examine the TeraPort system load monitor in the TG portal, which gives some rough estimates of load and queue time. Then queue the jobs and wait. Best to run Swift under screen, so you can easily wait for and monitor your script executions from anywhere, and not be interrupted if long delays are encountered. - Mike > > As soon as I compile the current swift trunk and try GT2+coaster @Abe > for my application, I will report to the list my experience. > > -- > Andriy Fedorov, Ph.D. > > Research Fellow > Brigham and Women's Hospital > Harvard Medical School > 75 Francis Street > Boston, MA 02115 USA > fedorov at bwh.harvard.edu > > > > On Tue, Aug 25, 2009 at 11:31, Michael Wilde wrote: >> Andrey, >> >> On 8/25/09 9:49 AM, Andrey Fedorov wrote: >>> Hi, >>> >>> I have a processing step that takes somewhere ~2-5 min. It takes on >>> input two ~5Mb files, and produces a small text file, which I need to >>> store. I need to compute large number of such jobs, using different >>> parameters. It seems to me "coaster" is the best execution provider >>> for my application. >>> >>> Trying to start simple, I am running first.swift (echo) example that >>> comes with Swift using different providers: GT2, GT4, GT2/coaster, and >>> GT4/coaster. All of this is done on Abe NCSA cluster. >>> >>> Here's my sites.xml: >>> >>> >>> >>> >> >>> url="https://grid-abe.ncsa.teragrid.org:8443/wsrf/services/ManagedJobFactoryService"/> >>> /u/ac/fedorov/scratch-global/scratch >>> >>> >>> >>> >>> >> >>> url="https://grid-abe.ncsa.teragrid.org:8443/wsrf/services/ManagedJobFactoryService"/> >>> /u/ac/fedorov/scratch-global/scratch >>> >>> >>> >>> >>> >> url="grid-abe.ncsa.teragrid.org:2119/jobmanager-pbs"/> >>> /u/ac/fedorov/scratch-global/scratch >>> >>> >>> >>> >>> >> url="grid-abe.ncsa.teragrid.org"/> >>> >>> /u/ac/fedorov/scratch-global/scratch >>> >>> >>> And tc.data is simply >>> >>> Abe-GT4-coasters echo /bin/echo INSTALLED INTEL32::LINUX null >>> >>> and I change the site to test different providers. >>> >>> Now, results: >>> >>> 1) both GT2 and GT4 providers work fine, script completes >>> >>> 2) with GT2+coaster provider, I can see the job in the PBS queue >>> (requested time is 01:41, I guess this comes with the default coaster >>> parameters, that I didn't change). The job appears to finish >>> successfully, but then I get this error: >>> >>> Final status: Finished successfully:1 >>> START cleanups=[[first-20090825-0925-emkt2qt0, Abe-GT2-coasters]] >>> START dir=first-20090825-0925-emkt2qt0 host=Abe-GT2-coasters >>> Sending Command(21, SUBMITJOB) on GSSSChannel-null(1) >>> Command(21, SUBMITJOB) CMD: Command(21, SUBMITJOB) >>> GSSSChannel-null(1) REPL: Command(21, SUBMITJOB) >>> Submitted task Task(type=JOB_SUBMISSION, >>> identity=urn:0-1-1251210343871). Job id: >>> urn:1251210343871-1251210376098-1251210376099 >>> Unregistering Command(21, SUBMITJOB) >>> GSSSChannel-null(1) REQ: Handler(JOBSTATUS) >>> GSSSChannel-null(1) REQ: Handler(JOBSTATUS) >>> Task(type=JOB_SUBMISSION, identity=urn:0-1-1251210343871) Completed. >>> Waiting: 0, Running: 0. Heap size: 65M, Heap free: 42M, Max heap: 227M >>> END dir=first-20090825-0925-emkt2qt0 host=Abe-GT2-coasters >>> Cleaning up... >>> Shutting down service at https://141.142.68.180:45552 >>> Got channel MetaChannel: 500265006 -> GSSSChannel-null(1) >>> Sending Command(22, SHUTDOWNSERVICE) on GSSSChannel-null(1) >>> Command(22, SHUTDOWNSERVICE) CMD: Command(22, SHUTDOWNSERVICE) >>> Command(22, SHUTDOWNSERVICE): handling reply timeout >>> Command(22, SHUTDOWNSERVICE): failed too many times >>> org.globus.cog.karajan.workflow.service.ReplyTimeoutException >>> at >>> org.globus.cog.karajan.workflow.service.commands.Command.handleReplyTimeout(Command.java:241) >>> at >>> org.globus.cog.karajan.workflow.service.commands.Command$Timeout.run(Command.java:246) >>> at java.util.TimerThread.mainLoop(Timer.java:512) >>> at java.util.TimerThread.run(Timer.java:462) >>> - Done >> This seems like a low-prio error. I'll file it in bugzilla for now. Lets see >> how coasters works for you on Abe using your real app and a larger number of >> jobs, and come back to this shutdown problem if it proves to be a blocker to >> getting work done. >> >> Coasters has a few other current issues - mainly not throttling work >> efficiently - that we have a fix for, and need to apply and test that one >> first. >> >> We've also been experimenting with a non-coaster way to use all 8 cores of >> machines like Abe, but lets try the coaster route first, of thats OK with >> you, and lets focus on GT2/Coasters, as that will be more common. >> >> In addition, there is a test version of GT GRAM5 on QueenBee, Abe's >> sister-system at LSU, which we can try, assuming your TG project lets you >> run there. >> >> So please try to run the app, and we will try to get the latest coaster >> fixes committed. (I assume you are comfortable extracting Swift from svn and >> building it; if you have not done this before, can you try it, Andrey?) >> >> Regards, >> >> Mike >> >> >>> 3) with GT4-coaster provider, I don't get as far as with GT2-coaster. >>> Possibly I am not setting up properly the site entry. I was not able >>> to find any examples in the manual how to set coasters with GT4 (can >>> anyone provide an example?). Here's the error: >>> >>> Failed to transfer wrapper log from >>> first-20090825-0929-39x94x09/info/t on Abe-GT4-coasters >>> END_FAILURE thread=0 tr=echo >>> Progress: Failed:1 >>> Execution failed: >>> Exception in echo: >>> Arguments: [Hello, world!] >>> Host: Abe-GT4-coasters >>> Directory: first-20090825-0929-39x94x09/jobs/t/echo-t5oymmfj >>> stderr.txt: >>> >>> stdout.txt: >>> >>> ---- >>> >>> Caused by: >>> Cannot submit job: Limited proxy is not accepted >>> >>> >>> Can anybody help figuring this out? >>> >>> Thanks >>> -- >>> Andriy Fedorov, Ph.D. >>> >>> Research Fellow >>> Brigham and Women's Hospital >>> Harvard Medical School >>> 75 Francis Street >>> Boston, MA 02115 USA >>> fedorov at bwh.harvard.edu >>> _______________________________________________ >>> Swift-user mailing list >>> Swift-user at ci.uchicago.edu >>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user From wilde at mcs.anl.gov Tue Aug 25 12:04:12 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 25 Aug 2009 12:04:12 -0500 Subject: [Swift-user] Problems getting started with coasters In-Reply-To: <4A94174D.3070901@mcs.anl.gov> References: <82f536810908250749ve4c2c09xbfa5d6343ad5231c@mail.gmail.com> <4A9403CE.4050303@mcs.anl.gov> <82f536810908250844l69c2b2e8oe5caf3d73fcd46cb@mail.gmail.com> <4A94174D.3070901@mcs.anl.gov> Message-ID: <4A94198C.8090109@mcs.anl.gov> Andrey, good news: GRAM5 is now available on Abe as well. Info and contact URLs, as well as some Swift usage experience reports, are at: http://dev.globus.org/wiki/GRAM/GRAM5#Deployments So with this in mind, a good approach is: - sanity test your app using the PBS provider on Abe, with swift on the login host, just 1 or 2 jobs - sanity test 16 to 64 or so jobs, adding parallel clustering to the above - change from the PBS provider to the GRAM2 (pre-WS-GRAM) provider, but using the GRAM URLs at http://dev.globus.org/wiki/GRAM/GRAM5#Deployments (still submitting from the Abe login host to Abe. You can keep the local data provider for this case) - Add Queenbee GRAM5 as a second site, using the gridftp data provider. Mike On 8/25/09 11:54 AM, Michael Wilde wrote: > On 8/25/09 10:44 AM, Andriy Fedorov wrote: >> Michael, >> >> Thanks for the reply. >> >> So my understanding is, I should check out the trunk version and >> compile (yes, I've done this before), and try the real application >> with GT2+coasters. > > Yes, thats a good step to re-master, in preparation for Mihael checking > in Coaster fixes. He made significant enhancements to Coasters in the > past 2 months, but has been working ona different project lately and > thus these are not yet sufficiently tested. If you're willing to help in > the testing that would be great. > > If not, I think the next best approach to try is this: > > - We have a small experimental mod that enables Swift GRAM2 jobs to use > all cores of multi-core hosts (such as the 8-core hosts on Abe and > QueenBee). Basically it uses the Swift clustering facility but runs jobs > in parallel instead of serially. > > It works well if your jobs have a very uniform runtime. If they dont, > then it wastes CPU. But its a good interim solution for many apps until > coasters is more stable. > > This is described at: > http://www.ci.uchicago.edu/wiki/bin/view/SWFT/SwiftParallelClustering > > This info is very preliminary and not end-user ready. Tibi Stef-Praun, > on this list has tried it. Please start a new thread here if you want to > discuss it or report experiences or problems with it. > > - On QueenBee or other GRAM5-enabled systems (not many test as its in > test mode) you can use the GRAM2 provider if submitting remotely. > On Abe and any other GRAM2 systems you should run this with the Condor-G > provider if submitting remotely. > > The rule of thumb here for submitting jobs to a site from Swift running > remotely on a submit host is: > > -- up to 20 jobs in parallel you can use plain GRAM2 > -- above 20 jobs, use Condor-G or, where available, GRAM2 > > - On Abe, QueenBee, and other PBS systems with login hosts, you can run > Swift locally on the login host, and use the PBS provider with the > parallel clustering approach. > > We have a few other solutions that I will save till we explore these two > solutions. > > To prepare for this, try running your app on Abe using the PBS provider, > with just 1 or 2 jobs, then try the parallel clustering tip above. > >> I do have an account on Queen Bee. You say, it has GT GRAM5, but I >> thought you also said I should target using GT2. What is GRAM5? > > GRAM5 is a new, more efficient version of GRAM2. Its fully compatible, > so you just set Swift sites.xml exactly as for GRAM2. The only thing > that changes is that you use a different URL for the GRAM gatekeeper > contact string (ie different host and/or port, thats all). > > I'll need to get you the contact string for GRAM5 on QueenBee if/when we > both agree the time is right to try it. > >> At >> this point, my preference is the system with lowest load and confirmed >> functional coaster provider, to save time debugging and getting up to >> speed. Should I use Abe or Queen Bee? > > Thats hard to answer, as the loads fluctuate. You can examine the > TeraPort system load monitor in the TG portal, which gives some rough > estimates of load and queue time. Then queue the jobs and wait. Best to > run Swift under screen, so you can easily wait for and monitor your > script executions from anywhere, and not be interrupted if long delays > are encountered. > > - Mike > >> As soon as I compile the current swift trunk and try GT2+coaster @Abe >> for my application, I will report to the list my experience. >> >> -- >> Andriy Fedorov, Ph.D. >> >> Research Fellow >> Brigham and Women's Hospital >> Harvard Medical School >> 75 Francis Street >> Boston, MA 02115 USA >> fedorov at bwh.harvard.edu >> >> >> >> On Tue, Aug 25, 2009 at 11:31, Michael Wilde wrote: >>> Andrey, >>> >>> On 8/25/09 9:49 AM, Andrey Fedorov wrote: >>>> Hi, >>>> >>>> I have a processing step that takes somewhere ~2-5 min. It takes on >>>> input two ~5Mb files, and produces a small text file, which I need to >>>> store. I need to compute large number of such jobs, using different >>>> parameters. It seems to me "coaster" is the best execution provider >>>> for my application. >>>> >>>> Trying to start simple, I am running first.swift (echo) example that >>>> comes with Swift using different providers: GT2, GT4, GT2/coaster, and >>>> GT4/coaster. All of this is done on Abe NCSA cluster. >>>> >>>> Here's my sites.xml: >>>> >>>> >>>> >>>> >>> >>>> url="https://grid-abe.ncsa.teragrid.org:8443/wsrf/services/ManagedJobFactoryService"/> >>>> /u/ac/fedorov/scratch-global/scratch >>>> >>>> >>>> >>>> >>>> >>> >>>> url="https://grid-abe.ncsa.teragrid.org:8443/wsrf/services/ManagedJobFactoryService"/> >>>> /u/ac/fedorov/scratch-global/scratch >>>> >>>> >>>> >>>> >>>> >>> url="grid-abe.ncsa.teragrid.org:2119/jobmanager-pbs"/> >>>> /u/ac/fedorov/scratch-global/scratch >>>> >>>> >>>> >>>> >>>> >>> url="grid-abe.ncsa.teragrid.org"/> >>>> >>>> /u/ac/fedorov/scratch-global/scratch >>>> >>>> >>>> And tc.data is simply >>>> >>>> Abe-GT4-coasters echo /bin/echo INSTALLED INTEL32::LINUX null >>>> >>>> and I change the site to test different providers. >>>> >>>> Now, results: >>>> >>>> 1) both GT2 and GT4 providers work fine, script completes >>>> >>>> 2) with GT2+coaster provider, I can see the job in the PBS queue >>>> (requested time is 01:41, I guess this comes with the default coaster >>>> parameters, that I didn't change). The job appears to finish >>>> successfully, but then I get this error: >>>> >>>> Final status: Finished successfully:1 >>>> START cleanups=[[first-20090825-0925-emkt2qt0, Abe-GT2-coasters]] >>>> START dir=first-20090825-0925-emkt2qt0 host=Abe-GT2-coasters >>>> Sending Command(21, SUBMITJOB) on GSSSChannel-null(1) >>>> Command(21, SUBMITJOB) CMD: Command(21, SUBMITJOB) >>>> GSSSChannel-null(1) REPL: Command(21, SUBMITJOB) >>>> Submitted task Task(type=JOB_SUBMISSION, >>>> identity=urn:0-1-1251210343871). Job id: >>>> urn:1251210343871-1251210376098-1251210376099 >>>> Unregistering Command(21, SUBMITJOB) >>>> GSSSChannel-null(1) REQ: Handler(JOBSTATUS) >>>> GSSSChannel-null(1) REQ: Handler(JOBSTATUS) >>>> Task(type=JOB_SUBMISSION, identity=urn:0-1-1251210343871) Completed. >>>> Waiting: 0, Running: 0. Heap size: 65M, Heap free: 42M, Max heap: 227M >>>> END dir=first-20090825-0925-emkt2qt0 host=Abe-GT2-coasters >>>> Cleaning up... >>>> Shutting down service at https://141.142.68.180:45552 >>>> Got channel MetaChannel: 500265006 -> GSSSChannel-null(1) >>>> Sending Command(22, SHUTDOWNSERVICE) on GSSSChannel-null(1) >>>> Command(22, SHUTDOWNSERVICE) CMD: Command(22, SHUTDOWNSERVICE) >>>> Command(22, SHUTDOWNSERVICE): handling reply timeout >>>> Command(22, SHUTDOWNSERVICE): failed too many times >>>> org.globus.cog.karajan.workflow.service.ReplyTimeoutException >>>> at >>>> org.globus.cog.karajan.workflow.service.commands.Command.handleReplyTimeout(Command.java:241) >>>> at >>>> org.globus.cog.karajan.workflow.service.commands.Command$Timeout.run(Command.java:246) >>>> at java.util.TimerThread.mainLoop(Timer.java:512) >>>> at java.util.TimerThread.run(Timer.java:462) >>>> - Done >>> This seems like a low-prio error. I'll file it in bugzilla for now. Lets see >>> how coasters works for you on Abe using your real app and a larger number of >>> jobs, and come back to this shutdown problem if it proves to be a blocker to >>> getting work done. >>> >>> Coasters has a few other current issues - mainly not throttling work >>> efficiently - that we have a fix for, and need to apply and test that one >>> first. >>> >>> We've also been experimenting with a non-coaster way to use all 8 cores of >>> machines like Abe, but lets try the coaster route first, of thats OK with >>> you, and lets focus on GT2/Coasters, as that will be more common. >>> >>> In addition, there is a test version of GT GRAM5 on QueenBee, Abe's >>> sister-system at LSU, which we can try, assuming your TG project lets you >>> run there. >>> >>> So please try to run the app, and we will try to get the latest coaster >>> fixes committed. (I assume you are comfortable extracting Swift from svn and >>> building it; if you have not done this before, can you try it, Andrey?) >>> >>> Regards, >>> >>> Mike >>> >>> >>>> 3) with GT4-coaster provider, I don't get as far as with GT2-coaster. >>>> Possibly I am not setting up properly the site entry. I was not able >>>> to find any examples in the manual how to set coasters with GT4 (can >>>> anyone provide an example?). Here's the error: >>>> >>>> Failed to transfer wrapper log from >>>> first-20090825-0929-39x94x09/info/t on Abe-GT4-coasters >>>> END_FAILURE thread=0 tr=echo >>>> Progress: Failed:1 >>>> Execution failed: >>>> Exception in echo: >>>> Arguments: [Hello, world!] >>>> Host: Abe-GT4-coasters >>>> Directory: first-20090825-0929-39x94x09/jobs/t/echo-t5oymmfj >>>> stderr.txt: >>>> >>>> stdout.txt: >>>> >>>> ---- >>>> >>>> Caused by: >>>> Cannot submit job: Limited proxy is not accepted >>>> >>>> >>>> Can anybody help figuring this out? >>>> >>>> Thanks >>>> -- >>>> Andriy Fedorov, Ph.D. >>>> >>>> Research Fellow >>>> Brigham and Women's Hospital >>>> Harvard Medical School >>>> 75 Francis Street >>>> Boston, MA 02115 USA >>>> fedorov at bwh.harvard.edu >>>> _______________________________________________ >>>> Swift-user mailing list >>>> Swift-user at ci.uchicago.edu >>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user > From fedorov at bwh.harvard.edu Tue Aug 25 12:11:21 2009 From: fedorov at bwh.harvard.edu (Andriy Fedorov) Date: Tue, 25 Aug 2009 13:11:21 -0400 Subject: [Swift-user] Problems getting started with coasters In-Reply-To: <4A94198C.8090109@mcs.anl.gov> References: <82f536810908250749ve4c2c09xbfa5d6343ad5231c@mail.gmail.com> <4A9403CE.4050303@mcs.anl.gov> <82f536810908250844l69c2b2e8oe5caf3d73fcd46cb@mail.gmail.com> <4A94174D.3070901@mcs.anl.gov> <4A94198C.8090109@mcs.anl.gov> Message-ID: <82f536810908251011g1529e26fpc7613bc414d4b2a@mail.gmail.com> Michael -- Sounds like a plan, thanks :) Let me digest this, and give it a try. I should get back to you and the list with the report on my experience later this week (or earlier, if I come across a stopper...) -- Andriy Fedorov, Ph.D. Research Fellow Brigham and Women's Hospital Harvard Medical School 75 Francis Street Boston, MA 02115 USA fedorov at bwh.harvard.edu On Tue, Aug 25, 2009 at 13:04, Michael Wilde wrote: > Andrey, good news: GRAM5 is now available on Abe as well. Info and contact > URLs, as well as some Swift usage experience reports, are at: > > http://dev.globus.org/wiki/GRAM/GRAM5#Deployments > > So with this in mind, a good approach is: > > - sanity test your app using the PBS provider on Abe, with swift on the > login host, just 1 or 2 jobs > > - sanity test 16 to 64 or so jobs, adding parallel clustering to the above > > - change from the PBS provider to the GRAM2 (pre-WS-GRAM) provider, but > using the GRAM URLs at http://dev.globus.org/wiki/GRAM/GRAM5#Deployments > (still submitting from the Abe login host to Abe. You can keep the local > data provider for this case) > > - Add Queenbee GRAM5 as a second site, using the gridftp data provider. > > Mike > > > On 8/25/09 11:54 AM, Michael Wilde wrote: >> >> On 8/25/09 10:44 AM, Andriy Fedorov wrote: >>> >>> Michael, >>> >>> Thanks for the reply. >>> >>> So my understanding is, I should check out the trunk version and >>> compile (yes, I've done this before), and try the real application >>> with GT2+coasters. >> >> Yes, thats a good step to re-master, in preparation for Mihael checking in >> Coaster fixes. He made significant enhancements to Coasters in the past 2 >> months, but has been working ona different project lately and thus these are >> not yet sufficiently tested. If you're willing to help in the testing that >> would be great. >> >> If not, I think the next best approach to try is this: >> >> - We have a small experimental mod that enables Swift GRAM2 jobs to use >> all cores of multi-core hosts (such as the 8-core hosts on Abe and >> QueenBee). Basically it uses the Swift clustering facility but runs jobs in >> parallel instead of serially. >> >> It works well if your jobs have a very uniform runtime. If they dont, then >> it wastes CPU. ?But its a good interim solution for many apps until coasters >> is more stable. >> >> This is described at: >> http://www.ci.uchicago.edu/wiki/bin/view/SWFT/SwiftParallelClustering >> >> This info is very preliminary and not end-user ready. Tibi Stef-Praun, on >> this list has tried it. Please start a new thread here if you want to >> ?discuss it or report experiences or problems with it. >> >> - On QueenBee or other GRAM5-enabled systems (not many test as its in test >> mode) you can use the GRAM2 provider if submitting remotely. >> On Abe and any other GRAM2 systems you should run this with the Condor-G >> provider if submitting remotely. >> >> The rule of thumb here for submitting jobs to a site from Swift running >> remotely on a submit host is: >> >> ? -- up to 20 jobs in parallel you can use plain GRAM2 >> ? -- above 20 jobs, use Condor-G or, where available, GRAM2 >> >> - On Abe, QueenBee, and other PBS systems with login hosts, you can run >> Swift locally on the login host, and use the PBS provider with the parallel >> clustering approach. >> >> We have a few other solutions that I will save till we explore these two >> solutions. >> >> To prepare for this, try running your app on Abe using the PBS provider, >> with just 1 or 2 jobs, then try the parallel clustering tip above. >> >>> I do have an account on Queen Bee. You say, it has GT GRAM5, but I >>> thought you also said I should target using GT2. What is GRAM5? >> >> GRAM5 is a new, more efficient version of GRAM2. Its fully compatible, so >> you just set Swift sites.xml exactly as for GRAM2. The only thing that >> changes is that you use a different URL for the GRAM gatekeeper contact >> string (ie different host and/or port, thats all). >> >> I'll need to get you the contact string for GRAM5 on QueenBee if/when we >> both agree the time is right to try it. >> >>> At >>> this point, my preference is the system with lowest load and confirmed >>> functional coaster provider, to save time debugging and getting up to >>> speed. Should I use Abe or Queen Bee? >> >> Thats hard to answer, as the loads fluctuate. ?You can examine the >> TeraPort system load monitor in the TG portal, which gives some rough >> estimates of load and queue time. ?Then queue the jobs and wait. Best to run >> Swift under screen, so you can easily wait for and monitor your script >> executions from anywhere, and not be interrupted if long delays are >> encountered. >> >> - Mike >> >>> As soon as I compile the current swift trunk and try GT2+coaster @Abe >>> for my application, I will report to the list my experience. >>> >>> -- >>> Andriy Fedorov, Ph.D. >>> >>> Research Fellow >>> Brigham and Women's Hospital >>> Harvard Medical School >>> 75 Francis Street >>> Boston, MA 02115 USA >>> fedorov at bwh.harvard.edu >>> >>> >>> >>> On Tue, Aug 25, 2009 at 11:31, Michael Wilde wrote: >>>> >>>> Andrey, >>>> >>>> On 8/25/09 9:49 AM, Andrey Fedorov wrote: >>>>> >>>>> Hi, >>>>> >>>>> I have a processing step that takes somewhere ~2-5 min. It takes on >>>>> input two ~5Mb files, and produces a small text file, which I need to >>>>> store. I need to compute large number of such jobs, using different >>>>> parameters. It seems to me "coaster" is the best execution provider >>>>> for my application. >>>>> >>>>> Trying to start simple, I am running first.swift (echo) example that >>>>> comes with Swift using different providers: GT2, GT4, GT2/coaster, and >>>>> GT4/coaster. All of this is done on Abe NCSA cluster. >>>>> >>>>> Here's my sites.xml: >>>>> >>>>> >>>>> ? >>>>> ?>>>> >>>>> >>>>> ?url="https://grid-abe.ncsa.teragrid.org:8443/wsrf/services/ManagedJobFactoryService"/> >>>>> ?/u/ac/fedorov/scratch-global/scratch >>>>> >>>>> >>>>> >>>>> ? >>>>> ?>>>> >>>>> >>>>> ?url="https://grid-abe.ncsa.teragrid.org:8443/wsrf/services/ManagedJobFactoryService"/> >>>>> ?/u/ac/fedorov/scratch-global/scratch >>>>> >>>>> >>>>> >>>>> ? >>>>> ?>>>> ?url="grid-abe.ncsa.teragrid.org:2119/jobmanager-pbs"/> >>>>> ?/u/ac/fedorov/scratch-global/scratch >>>>> >>>>> >>>>> >>>>> ? >>>>> ?>>>> ?url="grid-abe.ncsa.teragrid.org"/> >>>>> ?>>>> /> >>>>> ?/u/ac/fedorov/scratch-global/scratch >>>>> >>>>> >>>>> And tc.data is simply >>>>> >>>>> Abe-GT4-coasters echo /bin/echo INSTALLED INTEL32::LINUX null >>>>> >>>>> and I change the site to test different providers. >>>>> >>>>> Now, results: >>>>> >>>>> 1) both GT2 and GT4 providers work fine, script completes >>>>> >>>>> 2) with GT2+coaster provider, I can see the job in the PBS queue >>>>> (requested time is 01:41, I guess this comes with the default coaster >>>>> parameters, that I didn't change). The job appears to finish >>>>> successfully, but then I get this error: >>>>> >>>>> Final status: ?Finished successfully:1 >>>>> START cleanups=[[first-20090825-0925-emkt2qt0, Abe-GT2-coasters]] >>>>> START dir=first-20090825-0925-emkt2qt0 host=Abe-GT2-coasters >>>>> Sending Command(21, SUBMITJOB) on GSSSChannel-null(1) >>>>> Command(21, SUBMITJOB) CMD: Command(21, SUBMITJOB) >>>>> GSSSChannel-null(1) REPL: Command(21, SUBMITJOB) >>>>> Submitted task Task(type=JOB_SUBMISSION, >>>>> identity=urn:0-1-1251210343871). Job id: >>>>> urn:1251210343871-1251210376098-1251210376099 >>>>> Unregistering Command(21, SUBMITJOB) >>>>> GSSSChannel-null(1) REQ: Handler(JOBSTATUS) >>>>> GSSSChannel-null(1) REQ: Handler(JOBSTATUS) >>>>> Task(type=JOB_SUBMISSION, identity=urn:0-1-1251210343871) Completed. >>>>> Waiting: 0, Running: 0. Heap size: 65M, Heap free: 42M, Max heap: 227M >>>>> END dir=first-20090825-0925-emkt2qt0 host=Abe-GT2-coasters >>>>> Cleaning up... >>>>> Shutting down service at https://141.142.68.180:45552 >>>>> Got channel MetaChannel: 500265006 -> GSSSChannel-null(1) >>>>> Sending Command(22, SHUTDOWNSERVICE) on GSSSChannel-null(1) >>>>> Command(22, SHUTDOWNSERVICE) CMD: Command(22, SHUTDOWNSERVICE) >>>>> Command(22, SHUTDOWNSERVICE): handling reply timeout >>>>> Command(22, SHUTDOWNSERVICE): failed too many times >>>>> org.globus.cog.karajan.workflow.service.ReplyTimeoutException >>>>> ? ? ? at >>>>> >>>>> org.globus.cog.karajan.workflow.service.commands.Command.handleReplyTimeout(Command.java:241) >>>>> ? ? ? at >>>>> >>>>> org.globus.cog.karajan.workflow.service.commands.Command$Timeout.run(Command.java:246) >>>>> ? ? ? at java.util.TimerThread.mainLoop(Timer.java:512) >>>>> ? ? ? at java.util.TimerThread.run(Timer.java:462) >>>>> - Done >>>> >>>> This seems like a low-prio error. I'll file it in bugzilla for now. Lets >>>> see >>>> how coasters works for you on Abe using your real app and a larger >>>> number of >>>> jobs, and come back to this shutdown problem if it proves to be a >>>> blocker to >>>> getting work done. >>>> >>>> Coasters has a few other current issues - mainly not throttling work >>>> efficiently - that we have a fix for, and need to apply and test that >>>> one >>>> first. >>>> >>>> We've also been experimenting with a non-coaster way to use all 8 cores >>>> of >>>> machines like Abe, but lets try the coaster route first, of thats OK >>>> with >>>> you, and lets focus on GT2/Coasters, as that will be more common. >>>> >>>> In addition, there is a test version of GT GRAM5 on QueenBee, Abe's >>>> sister-system at LSU, which we can try, assuming your TG project lets >>>> you >>>> run there. >>>> >>>> So please try to run the app, and we will try to get the latest coaster >>>> fixes committed. (I assume you are comfortable extracting Swift from svn >>>> and >>>> building it; if you have not done this before, can you try it, Andrey?) >>>> >>>> Regards, >>>> >>>> Mike >>>> >>>> >>>>> 3) with GT4-coaster provider, I don't get as far as with GT2-coaster. >>>>> Possibly I am not setting up properly the site entry. I was not able >>>>> to find any examples in the manual how to set coasters with GT4 (can >>>>> anyone provide an example?). Here's the error: >>>>> >>>>> Failed to transfer wrapper log from >>>>> first-20090825-0929-39x94x09/info/t on Abe-GT4-coasters >>>>> END_FAILURE thread=0 tr=echo >>>>> Progress: ?Failed:1 >>>>> Execution failed: >>>>> ? ? ? Exception in echo: >>>>> Arguments: [Hello, world!] >>>>> Host: Abe-GT4-coasters >>>>> Directory: first-20090825-0929-39x94x09/jobs/t/echo-t5oymmfj >>>>> stderr.txt: >>>>> >>>>> stdout.txt: >>>>> >>>>> ---- >>>>> >>>>> Caused by: >>>>> ? ? ? Cannot submit job: Limited proxy is not accepted >>>>> >>>>> >>>>> Can anybody help figuring this out? >>>>> >>>>> Thanks >>>>> -- >>>>> Andriy Fedorov, Ph.D. >>>>> >>>>> Research Fellow >>>>> Brigham and Women's Hospital >>>>> Harvard Medical School >>>>> 75 Francis Street >>>>> Boston, MA 02115 USA >>>>> fedorov at bwh.harvard.edu >>>>> _______________________________________________ >>>>> Swift-user mailing list >>>>> Swift-user at ci.uchicago.edu >>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user >> > From hategan at mcs.anl.gov Wed Aug 26 09:44:41 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 26 Aug 2009 09:44:41 -0500 Subject: [Swift-user] Problems getting started with coasters In-Reply-To: <1251215389.29699.8.camel@blabla> References: <82f536810908250749ve4c2c09xbfa5d6343ad5231c@mail.gmail.com> <1251215389.29699.8.camel@blabla> Message-ID: <1251297881.9929.1.camel@localhost> On Tue, 2009-08-25 at 10:49 -0500, Mihael Hategan wrote: > (3) > I know what's happening. That is a bug. When using gt4:gt4:xxx, > delegation needs to be enabled on the first step. Delegation is disabled > (as much as possible) by default in all the providers. There should be a > fix in SVN this week. > Except not. Full delegation is enabled where it should be. Do you have a swift log from the run below? Mihael > On Tue, 2009-08-25 at 10:49 -0400, Andrey Fedorov wrote: > > > > 3) with GT4-coaster provider, I don't get as far as with GT2-coaster. > > Possibly I am not setting up properly the site entry. I was not able > > to find any examples in the manual how to set coasters with GT4 (can > > anyone provide an example?). Here's the error: > > > [...] > > Caused by: > > Cannot submit job: Limited proxy is not accepted > > > > From wilde at mcs.anl.gov Wed Aug 26 17:11:34 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Wed, 26 Aug 2009 17:11:34 -0500 Subject: [Swift-user] Re: swift on jazz In-Reply-To: <53595.207.181.247.22.1251318011.squirrel@galton.uchicago.edu> References: <53595.207.181.247.22.1251318011.squirrel@galton.uchicago.edu> Message-ID: <4A95B316.2090603@mcs.anl.gov> Hi Marcin, I took the liberty of moving this thread to swift-user for others to help me answer you, and for other users to benefit. On Jazz, are you observing that Swift only puts at most 2 jobs in the Jazz PBS queue (where you can see them with "qstat") or that Swift puts many jobs in the queue but only 2 run at a time? Assuming its the latter, you must be bumping in Jazz's scheduler policy which is favoring multi-CPU jobs. If thats the case, then lets try running the "coaster" provider which is specified in the sites.xml file. (tc.data doesnt change). First, change your "jazz" entry in sites.xml from the PBS execution provider: to the Coaster provider: This should work, although we may need to add additional XML specifications for timilimits, accounts, and maybe queues. Then we expect to be applying a fix to the coaster rpovider tonight, so we'll need to do a custom Swift build from the source repository after that, and test the latest fix. The fix improves the throughput, but even without it, you should see Swift requesting more CPUs from PBS in a single job. I suggest getting started with this simple change, and we'll enhance it in stages to give you better performance and more parallelism. - Mike On 8/26/09 3:20 PM, Marcin Hitczenko wrote: > ... I am running jobs on jazz and I noticed that jazz will only run > at most two jobs at once for me (I have about 30), even though there are > more nodes free and I am requiring only one node per job. Is there > something I can do to change this? Would I have to change the tc.data or > sites.xml file? > > Thanks, > > Marcin From wilde at mcs.anl.gov Thu Aug 27 11:55:51 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Thu, 27 Aug 2009 11:55:51 -0500 Subject: [Swift-user] Re: swift on jazz In-Reply-To: <59301.207.181.247.22.1251386649.squirrel@galton.uchicago.edu> References: <53595.207.181.247.22.1251318011.squirrel@galton.uchicago.edu> <4A95B316.2090603@mcs.anl.gov> <59301.207.181.247.22.1251386649.squirrel@galton.uchicago.edu> Message-ID: <4A96BA97.5010400@mcs.anl.gov> Marcin, If what you're seeing is that Swift is not sending enough jobs to PBS, add the following to your sites.xml entry for Jazz/PBS: .24 10000 This should cause Swift to queue up to 25 jobs at time to PBS. The formula is nJobs = (jobThrottle*100)+1. I.e., for 30 jobs at a time, use .29; for 256 jobs at a time use 2.55 Swift tries to hide this from users and throttle automatically, but the algorithm still causes "surprises" and starts up very (too?) slowly so as to not overwhelm a cluster with jobs. So you should be able to use the XML elements above to force Swift to go right to a specific level of parallelism. - Mike ps. I'll contact you off-list to set up a meeting. On 8/27/09 10:24 AM, Marcin Hitczenko wrote: > Hi, > > I am actually observing the former, which is why I thought this might be > controllable via swift. > > I also have a few other basic questions regarding following job status and > organization of all the output files. I think the easiest thing to do > would be to look at my account together. Is there any way that we could > meet this week? > > Thanks, > > Marcin > >> Hi Marcin, >> >> I took the liberty of moving this thread to swift-user for others to >> help me answer you, and for other users to benefit. >> >> On Jazz, are you observing that Swift only puts at most 2 jobs in the >> Jazz PBS queue (where you can see them with "qstat") or that Swift puts >> many jobs in the queue but only 2 run at a time? >> >> Assuming its the latter, you must be bumping in Jazz's scheduler policy >> which is favoring multi-CPU jobs. If thats the case, then lets try >> running the "coaster" provider which is specified in the sites.xml file. >> (tc.data doesnt change). >> >> First, change your "jazz" entry in sites.xml from the PBS execution >> provider: >> >> >> >> to the Coaster provider: >> >> >> >> This should work, although we may need to add additional XML >> specifications for timilimits, accounts, and maybe queues. >> >> Then we expect to be applying a fix to the coaster rpovider tonight, so >> we'll need to do a custom Swift build from the source repository after >> that, and test the latest fix. The fix improves the throughput, but even >> without it, you should see Swift requesting more CPUs from PBS in a >> single job. >> >> I suggest getting started with this simple change, and we'll enhance it >> in stages to give you better performance and more parallelism. >> >> - Mike >> >> >> On 8/26/09 3:20 PM, Marcin Hitczenko wrote: >> >>> ... I am running jobs on jazz and I noticed that jazz will only run >>> at most two jobs at once for me (I have about 30), even though there are >>> more nodes free and I am requiring only one node per job. Is there >>> something I can do to change this? Would I have to change the tc.data or >>> sites.xml file? >>> >>> Thanks, >>> >>> Marcin > From fedorov at bwh.harvard.edu Thu Aug 27 14:37:51 2009 From: fedorov at bwh.harvard.edu (Andriy Fedorov) Date: Thu, 27 Aug 2009 15:37:51 -0400 Subject: [Swift-user] Problems getting started with coasters In-Reply-To: <4A94198C.8090109@mcs.anl.gov> References: <82f536810908250749ve4c2c09xbfa5d6343ad5231c@mail.gmail.com> <4A9403CE.4050303@mcs.anl.gov> <82f536810908250844l69c2b2e8oe5caf3d73fcd46cb@mail.gmail.com> <4A94174D.3070901@mcs.anl.gov> <4A94198C.8090109@mcs.anl.gov> Message-ID: <82f536810908271237l7fe2e15bg368221bcde3e44cf@mail.gmail.com> On Tue, Aug 25, 2009 at 13:04, Michael Wilde wrote: > Andrey, good news: GRAM5 is now available on Abe as well. Info and contact > URLs, as well as some Swift usage experience reports, are at: > > http://dev.globus.org/wiki/GRAM/GRAM5#Deployments > > So with this in mind, a good approach is: > > - sanity test your app using the PBS provider on Abe, with swift on the > login host, just 1 or 2 jobs > Michael, I am actually having troubles with this sanity test. I need to submit a file (about 5M) as an input to my application. What seems to be happening is that the file gets corrupted in transmission! I debugged this, and this appears to be the reason for my application to fail. The same application/swift script runs fine when I use plain gt2, without coasters. What I did to debug, I echo the directory, where my applications is started by Swift, so I get exact location of the file: [fedorov at TG/Abe:honest4 SlicerReg] cat fileInfo.txt lrwxrwxrwx 1 fedorov dkk 109 Aug 27 14:15 Data/MRMeningioma0.nrrd -> /u/ac/fedorov/scratch-global/scratch/RigidRegistration1-20090827-1411-ttnpb5d3/shared/Data/MRMeningioma0.nrrd The file has the same size, but the content is not identical! Here's basically the story: [fedorov at TG/Abe:honest4 SlicerReg] ls -la Data/ total 10960 drwxr-x--- 2 fedorov dkk 4096 Aug 27 14:22 . drwxr-x--- 25 fedorov dkk 12288 Aug 27 14:27 .. -rw-r----- 1 fedorov dkk 5069225 Aug 25 15:49 MRMeningioma0.nrrd -rw-r----- 1 fedorov dkk 6132840 Aug 25 15:49 MRMeningioma1.nrrd [fedorov at TG/Abe:honest4 SlicerReg] ls -la /u/ac/fedorov/scratch-global/scratch/RigidRegistration1-20090827-1411-ttnpb5d3/shared/Data total 10952 drwxr-xr-x 2 fedorov dkk 4096 Aug 27 14:22 . drwxr-xr-x 3 fedorov dkk 4096 Aug 27 14:10 .. -rw-r--r-- 1 fedorov dkk 5069225 Aug 27 14:11 MRMeningioma0.nrrd -rw-r--r-- 1 fedorov dkk 6132840 Aug 27 14:16 MRMeningioma1.nrrd [fedorov at TG/Abe:honest4 SlicerReg] diff Data/MRMeningioma0.nrrd /u/ac/fedorov/scratch-global/scratch/RigidRegistration1-20090827-1411-ttnpb5d3/shared/Data/MRMeningioma0.nrrd Binary files Data/MRMeningioma0.nrrd and /u/ac/fedorov/scratch-global/scratch/RigidRegistration1-20090827-1411-ttnpb5d3/shared/Data/MRMeningioma0.nrrd differ I can read my original file, but not the copied one: [fedorov at TG/Abe:honest4 SlicerReg] ~/Slicer3-lib/teem-build/bin/unu minmax /u/ac/fedorov/scratch-global/scratch/RigidRegistration1-20090827-1411-ttnpb5d3/shared/Data/MRMeningioma0.nrrd unu minmax: trouble with "/u/ac/fedorov/scratch-global/scratch/RigidRegistration1-20090827-1411-ttnpb5d3/shared/Data/MRMeningioma0.nrrd": [unu minmax] unu minmax: trouble loading "/u/ac/fedorov/scratch-global/scratch/RigidRegistration1-20090827-1411-ttnpb5d3/shared/Data/MRMeningioma0.nrrd" [unu minmax] [nrrd] nrrdLoad: trouble reading "/u/ac/fedorov/scratch-global/scratch/RigidRegistration1-20090827-1411-ttnpb5d3/shared/Data/MRMeningioma0.nrrd" [unu minmax] [nrrd] nrrdRead: trouble [unu minmax] [nrrd] _nrrdRead: trouble reading NRRD file [unu minmax] [nrrd] _nrrdFormatNRRD_read: [unu minmax] [nrrd] _nrrdEncodingGzip_read: error reading from gzFile [unu minmax] [nrrd] _nrrdGzRead: data read error [fedorov at TG/Abe:honest4 SlicerReg] ~/Slicer3-lib/teem-build/bin/unu minmax Data/MRMeningioma0.nrrd min: 0 max: 695 Have you guys run any applications with non-trivial input file size, and verified that file integritiy is preserved? > - sanity test 16 to 64 or so jobs, adding parallel clustering to the above > > - change from the PBS provider to the GRAM2 (pre-WS-GRAM) provider, but > using the GRAM URLs at http://dev.globus.org/wiki/GRAM/GRAM5#Deployments > (still submitting from the Abe login host to Abe. You can keep the local > data provider for this case) > > - Add Queenbee GRAM5 as a second site, using the gridftp data provider. > > Mike > > > On 8/25/09 11:54 AM, Michael Wilde wrote: >> >> On 8/25/09 10:44 AM, Andriy Fedorov wrote: >>> >>> Michael, >>> >>> Thanks for the reply. >>> >>> So my understanding is, I should check out the trunk version and >>> compile (yes, I've done this before), and try the real application >>> with GT2+coasters. >> >> Yes, thats a good step to re-master, in preparation for Mihael checking in >> Coaster fixes. He made significant enhancements to Coasters in the past 2 >> months, but has been working ona different project lately and thus these are >> not yet sufficiently tested. If you're willing to help in the testing that >> would be great. >> >> If not, I think the next best approach to try is this: >> >> - We have a small experimental mod that enables Swift GRAM2 jobs to use >> all cores of multi-core hosts (such as the 8-core hosts on Abe and >> QueenBee). Basically it uses the Swift clustering facility but runs jobs in >> parallel instead of serially. >> >> It works well if your jobs have a very uniform runtime. If they dont, then >> it wastes CPU. ?But its a good interim solution for many apps until coasters >> is more stable. >> >> This is described at: >> http://www.ci.uchicago.edu/wiki/bin/view/SWFT/SwiftParallelClustering >> >> This info is very preliminary and not end-user ready. Tibi Stef-Praun, on >> this list has tried it. Please start a new thread here if you want to >> ?discuss it or report experiences or problems with it. >> >> - On QueenBee or other GRAM5-enabled systems (not many test as its in test >> mode) you can use the GRAM2 provider if submitting remotely. >> On Abe and any other GRAM2 systems you should run this with the Condor-G >> provider if submitting remotely. >> >> The rule of thumb here for submitting jobs to a site from Swift running >> remotely on a submit host is: >> >> ? -- up to 20 jobs in parallel you can use plain GRAM2 >> ? -- above 20 jobs, use Condor-G or, where available, GRAM2 >> >> - On Abe, QueenBee, and other PBS systems with login hosts, you can run >> Swift locally on the login host, and use the PBS provider with the parallel >> clustering approach. >> >> We have a few other solutions that I will save till we explore these two >> solutions. >> >> To prepare for this, try running your app on Abe using the PBS provider, >> with just 1 or 2 jobs, then try the parallel clustering tip above. >> >>> I do have an account on Queen Bee. You say, it has GT GRAM5, but I >>> thought you also said I should target using GT2. What is GRAM5? >> >> GRAM5 is a new, more efficient version of GRAM2. Its fully compatible, so >> you just set Swift sites.xml exactly as for GRAM2. The only thing that >> changes is that you use a different URL for the GRAM gatekeeper contact >> string (ie different host and/or port, thats all). >> >> I'll need to get you the contact string for GRAM5 on QueenBee if/when we >> both agree the time is right to try it. >> >>> At >>> this point, my preference is the system with lowest load and confirmed >>> functional coaster provider, to save time debugging and getting up to >>> speed. Should I use Abe or Queen Bee? >> >> Thats hard to answer, as the loads fluctuate. ?You can examine the >> TeraPort system load monitor in the TG portal, which gives some rough >> estimates of load and queue time. ?Then queue the jobs and wait. Best to run >> Swift under screen, so you can easily wait for and monitor your script >> executions from anywhere, and not be interrupted if long delays are >> encountered. >> >> - Mike >> >>> As soon as I compile the current swift trunk and try GT2+coaster @Abe >>> for my application, I will report to the list my experience. >>> >>> -- >>> Andriy Fedorov, Ph.D. >>> >>> Research Fellow >>> Brigham and Women's Hospital >>> Harvard Medical School >>> 75 Francis Street >>> Boston, MA 02115 USA >>> fedorov at bwh.harvard.edu >>> >>> >>> >>> On Tue, Aug 25, 2009 at 11:31, Michael Wilde wrote: >>>> >>>> Andrey, >>>> >>>> On 8/25/09 9:49 AM, Andrey Fedorov wrote: >>>>> >>>>> Hi, >>>>> >>>>> I have a processing step that takes somewhere ~2-5 min. It takes on >>>>> input two ~5Mb files, and produces a small text file, which I need to >>>>> store. I need to compute large number of such jobs, using different >>>>> parameters. It seems to me "coaster" is the best execution provider >>>>> for my application. >>>>> >>>>> Trying to start simple, I am running first.swift (echo) example that >>>>> comes with Swift using different providers: GT2, GT4, GT2/coaster, and >>>>> GT4/coaster. All of this is done on Abe NCSA cluster. >>>>> >>>>> Here's my sites.xml: >>>>> >>>>> >>>>> ? >>>>> ?>>>> >>>>> >>>>> ?url="https://grid-abe.ncsa.teragrid.org:8443/wsrf/services/ManagedJobFactoryService"/> >>>>> ?/u/ac/fedorov/scratch-global/scratch >>>>> >>>>> >>>>> >>>>> ? >>>>> ?>>>> >>>>> >>>>> ?url="https://grid-abe.ncsa.teragrid.org:8443/wsrf/services/ManagedJobFactoryService"/> >>>>> ?/u/ac/fedorov/scratch-global/scratch >>>>> >>>>> >>>>> >>>>> ? >>>>> ?>>>> ?url="grid-abe.ncsa.teragrid.org:2119/jobmanager-pbs"/> >>>>> ?/u/ac/fedorov/scratch-global/scratch >>>>> >>>>> >>>>> >>>>> ? >>>>> ?>>>> ?url="grid-abe.ncsa.teragrid.org"/> >>>>> ?>>>> /> >>>>> ?/u/ac/fedorov/scratch-global/scratch >>>>> >>>>> >>>>> And tc.data is simply >>>>> >>>>> Abe-GT4-coasters echo /bin/echo INSTALLED INTEL32::LINUX null >>>>> >>>>> and I change the site to test different providers. >>>>> >>>>> Now, results: >>>>> >>>>> 1) both GT2 and GT4 providers work fine, script completes >>>>> >>>>> 2) with GT2+coaster provider, I can see the job in the PBS queue >>>>> (requested time is 01:41, I guess this comes with the default coaster >>>>> parameters, that I didn't change). The job appears to finish >>>>> successfully, but then I get this error: >>>>> >>>>> Final status: ?Finished successfully:1 >>>>> START cleanups=[[first-20090825-0925-emkt2qt0, Abe-GT2-coasters]] >>>>> START dir=first-20090825-0925-emkt2qt0 host=Abe-GT2-coasters >>>>> Sending Command(21, SUBMITJOB) on GSSSChannel-null(1) >>>>> Command(21, SUBMITJOB) CMD: Command(21, SUBMITJOB) >>>>> GSSSChannel-null(1) REPL: Command(21, SUBMITJOB) >>>>> Submitted task Task(type=JOB_SUBMISSION, >>>>> identity=urn:0-1-1251210343871). Job id: >>>>> urn:1251210343871-1251210376098-1251210376099 >>>>> Unregistering Command(21, SUBMITJOB) >>>>> GSSSChannel-null(1) REQ: Handler(JOBSTATUS) >>>>> GSSSChannel-null(1) REQ: Handler(JOBSTATUS) >>>>> Task(type=JOB_SUBMISSION, identity=urn:0-1-1251210343871) Completed. >>>>> Waiting: 0, Running: 0. Heap size: 65M, Heap free: 42M, Max heap: 227M >>>>> END dir=first-20090825-0925-emkt2qt0 host=Abe-GT2-coasters >>>>> Cleaning up... >>>>> Shutting down service at https://141.142.68.180:45552 >>>>> Got channel MetaChannel: 500265006 -> GSSSChannel-null(1) >>>>> Sending Command(22, SHUTDOWNSERVICE) on GSSSChannel-null(1) >>>>> Command(22, SHUTDOWNSERVICE) CMD: Command(22, SHUTDOWNSERVICE) >>>>> Command(22, SHUTDOWNSERVICE): handling reply timeout >>>>> Command(22, SHUTDOWNSERVICE): failed too many times >>>>> org.globus.cog.karajan.workflow.service.ReplyTimeoutException >>>>> ? ? ? at >>>>> >>>>> org.globus.cog.karajan.workflow.service.commands.Command.handleReplyTimeout(Command.java:241) >>>>> ? ? ? at >>>>> >>>>> org.globus.cog.karajan.workflow.service.commands.Command$Timeout.run(Command.java:246) >>>>> ? ? ? at java.util.TimerThread.mainLoop(Timer.java:512) >>>>> ? ? ? at java.util.TimerThread.run(Timer.java:462) >>>>> - Done >>>> >>>> This seems like a low-prio error. I'll file it in bugzilla for now. Lets >>>> see >>>> how coasters works for you on Abe using your real app and a larger >>>> number of >>>> jobs, and come back to this shutdown problem if it proves to be a >>>> blocker to >>>> getting work done. >>>> >>>> Coasters has a few other current issues - mainly not throttling work >>>> efficiently - that we have a fix for, and need to apply and test that >>>> one >>>> first. >>>> >>>> We've also been experimenting with a non-coaster way to use all 8 cores >>>> of >>>> machines like Abe, but lets try the coaster route first, of thats OK >>>> with >>>> you, and lets focus on GT2/Coasters, as that will be more common. >>>> >>>> In addition, there is a test version of GT GRAM5 on QueenBee, Abe's >>>> sister-system at LSU, which we can try, assuming your TG project lets >>>> you >>>> run there. >>>> >>>> So please try to run the app, and we will try to get the latest coaster >>>> fixes committed. (I assume you are comfortable extracting Swift from svn >>>> and >>>> building it; if you have not done this before, can you try it, Andrey?) >>>> >>>> Regards, >>>> >>>> Mike >>>> >>>> >>>>> 3) with GT4-coaster provider, I don't get as far as with GT2-coaster. >>>>> Possibly I am not setting up properly the site entry. I was not able >>>>> to find any examples in the manual how to set coasters with GT4 (can >>>>> anyone provide an example?). Here's the error: >>>>> >>>>> Failed to transfer wrapper log from >>>>> first-20090825-0929-39x94x09/info/t on Abe-GT4-coasters >>>>> END_FAILURE thread=0 tr=echo >>>>> Progress: ?Failed:1 >>>>> Execution failed: >>>>> ? ? ? Exception in echo: >>>>> Arguments: [Hello, world!] >>>>> Host: Abe-GT4-coasters >>>>> Directory: first-20090825-0929-39x94x09/jobs/t/echo-t5oymmfj >>>>> stderr.txt: >>>>> >>>>> stdout.txt: >>>>> >>>>> ---- >>>>> >>>>> Caused by: >>>>> ? ? ? Cannot submit job: Limited proxy is not accepted >>>>> >>>>> >>>>> Can anybody help figuring this out? >>>>> >>>>> Thanks >>>>> -- >>>>> Andriy Fedorov, Ph.D. >>>>> >>>>> Research Fellow >>>>> Brigham and Women's Hospital >>>>> Harvard Medical School >>>>> 75 Francis Street >>>>> Boston, MA 02115 USA >>>>> fedorov at bwh.harvard.edu >>>>> _______________________________________________ >>>>> Swift-user mailing list >>>>> Swift-user at ci.uchicago.edu >>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user >> > From hategan at mcs.anl.gov Thu Aug 27 15:30:12 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 27 Aug 2009 15:30:12 -0500 Subject: [Swift-user] Problems getting started with coasters In-Reply-To: <82f536810908271237l7fe2e15bg368221bcde3e44cf@mail.gmail.com> References: <82f536810908250749ve4c2c09xbfa5d6343ad5231c@mail.gmail.com> <4A9403CE.4050303@mcs.anl.gov> <82f536810908250844l69c2b2e8oe5caf3d73fcd46cb@mail.gmail.com> <4A94174D.3070901@mcs.anl.gov> <4A94198C.8090109@mcs.anl.gov> <82f536810908271237l7fe2e15bg368221bcde3e44cf@mail.gmail.com> Message-ID: <1251405012.20895.1.camel@localhost> On Thu, 2009-08-27 at 15:37 -0400, Andriy Fedorov wrote: > I need to submit a file (about 5M) as an input to my application. What > seems to be happening is that the file gets corrupted in transmission! What's the contents of your sites.xml? From fedorov at bwh.harvard.edu Thu Aug 27 15:36:54 2009 From: fedorov at bwh.harvard.edu (Andriy Fedorov) Date: Thu, 27 Aug 2009 16:36:54 -0400 Subject: [Swift-user] Problems getting started with coasters In-Reply-To: <1251405012.20895.1.camel@localhost> References: <82f536810908250749ve4c2c09xbfa5d6343ad5231c@mail.gmail.com> <4A9403CE.4050303@mcs.anl.gov> <82f536810908250844l69c2b2e8oe5caf3d73fcd46cb@mail.gmail.com> <4A94174D.3070901@mcs.anl.gov> <4A94198C.8090109@mcs.anl.gov> <82f536810908271237l7fe2e15bg368221bcde3e44cf@mail.gmail.com> <1251405012.20895.1.camel@localhost> Message-ID: <82f536810908271336y3c4c1d1ftfcab39cfb37b9a32@mail.gmail.com> Here's the section for the coasters site: /u/ac/fedorov/scratch-global/scratch On Thu, Aug 27, 2009 at 16:30, Mihael Hategan wrote: > On Thu, 2009-08-27 at 15:37 -0400, Andriy Fedorov wrote: > >> I need to submit a file (about 5M) as an input to my application. What >> seems to be happening is that the file gets corrupted in transmission! > > What's the contents of your sites.xml? > > From hategan at mcs.anl.gov Thu Aug 27 15:37:35 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 27 Aug 2009 15:37:35 -0500 Subject: [Swift-user] Problems getting started with coasters In-Reply-To: <1251297881.9929.1.camel@localhost> References: <82f536810908250749ve4c2c09xbfa5d6343ad5231c@mail.gmail.com> <1251215389.29699.8.camel@blabla> <1251297881.9929.1.camel@localhost> Message-ID: <1251405455.20895.4.camel@localhost> On Wed, 2009-08-26 at 09:44 -0500, Mihael Hategan wrote: > On Tue, 2009-08-25 at 10:49 -0500, Mihael Hategan wrote: > > > (3) > > I know what's happening. That is a bug. When using gt4:gt4:xxx, > > delegation needs to be enabled on the first step. Delegation is disabled > > (as much as possible) by default in all the providers. There should be a > > fix in SVN this week. > > > > Except not. Full delegation is enabled where it should be. Except the gt4 provider wrongly handled the issue causing limited delegation to be requested when full delegation was specified. Fixed in cog r2454. From hategan at mcs.anl.gov Thu Aug 27 15:40:09 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 27 Aug 2009 15:40:09 -0500 Subject: [Swift-user] Problems getting started with coasters In-Reply-To: <82f536810908271336y3c4c1d1ftfcab39cfb37b9a32@mail.gmail.com> References: <82f536810908250749ve4c2c09xbfa5d6343ad5231c@mail.gmail.com> <4A9403CE.4050303@mcs.anl.gov> <82f536810908250844l69c2b2e8oe5caf3d73fcd46cb@mail.gmail.com> <4A94174D.3070901@mcs.anl.gov> <4A94198C.8090109@mcs.anl.gov> <82f536810908271237l7fe2e15bg368221bcde3e44cf@mail.gmail.com> <1251405012.20895.1.camel@localhost> <82f536810908271336y3c4c1d1ftfcab39cfb37b9a32@mail.gmail.com> Message-ID: <1251405609.20895.8.camel@localhost> The data corruption with the experimental coaster filesystem provider is a known issue that we have not had the resources to deal with unfortunately. Please use GridFTP which is better tested and known to behave. Mihael On Thu, 2009-08-27 at 16:36 -0400, Andriy Fedorov wrote: > Here's the section for the coasters site: > > > > url="grid-abe.ncsa.teragrid.org"/> > > /u/ac/fedorov/scratch-global/scratch > > > > > On Thu, Aug 27, 2009 at 16:30, Mihael Hategan wrote: > > On Thu, 2009-08-27 at 15:37 -0400, Andriy Fedorov wrote: > > > >> I need to submit a file (about 5M) as an input to my application. What > >> seems to be happening is that the file gets corrupted in transmission! > > > > What's the contents of your sites.xml? > > > > From wilde at mcs.anl.gov Thu Aug 27 15:42:33 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Thu, 27 Aug 2009 15:42:33 -0500 Subject: [Swift-user] Problems getting started with coasters In-Reply-To: <82f536810908271237l7fe2e15bg368221bcde3e44cf@mail.gmail.com> References: <82f536810908250749ve4c2c09xbfa5d6343ad5231c@mail.gmail.com> <4A9403CE.4050303@mcs.anl.gov> <82f536810908250844l69c2b2e8oe5caf3d73fcd46cb@mail.gmail.com> <4A94174D.3070901@mcs.anl.gov> <4A94198C.8090109@mcs.anl.gov> <82f536810908271237l7fe2e15bg368221bcde3e44cf@mail.gmail.com> Message-ID: <4A96EFB9.2020704@mcs.anl.gov> Andriy, can you post your sites.xml file? I *suspect* that you may (inadvertently) be using the Coaster data provider, using an XML tag like this in the element for the local site: If you are, remove that (for now). There is suspected problems with coaster data transfer for large(er) files. Its worked well for large sets of very small ones. (We need to get such alerts posted somewhere clearly, sorry). Use (only) this tag for the data provider for the local-PBS "sanity" test: If you do *not* have a coaster filesystem tag in your element, then I need to dig deeper, and may need some logs from you and/or access to your directories on Abe. Also note: o you can and should use the gridftp/local tag above even when using coasters as your execution provider, when you are running on a site that has access to your local directories (eg when the worker nodes of your target site can directly access the file names that your Swift script is mapping). o Mihael posted the promised fixes to Coasters last night, and once we get past this sanity test, you should try those. You may be trying these ahead of me, so my apologies if you find some problems for us first. - Mike On 8/27/09 2:37 PM, Andriy Fedorov wrote: > On Tue, Aug 25, 2009 at 13:04, Michael Wilde wrote: >> Andrey, good news: GRAM5 is now available on Abe as well. Info and contact >> URLs, as well as some Swift usage experience reports, are at: >> >> http://dev.globus.org/wiki/GRAM/GRAM5#Deployments >> >> So with this in mind, a good approach is: >> >> - sanity test your app using the PBS provider on Abe, with swift on the >> login host, just 1 or 2 jobs >> > > Michael, > > I am actually having troubles with this sanity test. > > I need to submit a file (about 5M) as an input to my application. What > seems to be happening is that the file gets corrupted in transmission! > > I debugged this, and this appears to be the reason for my application to fail. > > The same application/swift script runs fine when I use plain gt2, > without coasters. > > What I did to debug, I echo the directory, where my applications is > started by Swift, so I get exact location of the file: > > [fedorov at TG/Abe:honest4 SlicerReg] cat fileInfo.txt > lrwxrwxrwx 1 fedorov dkk 109 Aug 27 14:15 Data/MRMeningioma0.nrrd -> > /u/ac/fedorov/scratch-global/scratch/RigidRegistration1-20090827-1411-ttnpb5d3/shared/Data/MRMeningioma0.nrrd > > The file has the same size, but the content is not identical! Here's > basically the story: > > [fedorov at TG/Abe:honest4 SlicerReg] ls -la Data/ > total 10960 > drwxr-x--- 2 fedorov dkk 4096 Aug 27 14:22 . > drwxr-x--- 25 fedorov dkk 12288 Aug 27 14:27 .. > -rw-r----- 1 fedorov dkk 5069225 Aug 25 15:49 MRMeningioma0.nrrd > -rw-r----- 1 fedorov dkk 6132840 Aug 25 15:49 MRMeningioma1.nrrd > [fedorov at TG/Abe:honest4 SlicerReg] ls -la > /u/ac/fedorov/scratch-global/scratch/RigidRegistration1-20090827-1411-ttnpb5d3/shared/Data > total 10952 > drwxr-xr-x 2 fedorov dkk 4096 Aug 27 14:22 . > drwxr-xr-x 3 fedorov dkk 4096 Aug 27 14:10 .. > -rw-r--r-- 1 fedorov dkk 5069225 Aug 27 14:11 MRMeningioma0.nrrd > -rw-r--r-- 1 fedorov dkk 6132840 Aug 27 14:16 MRMeningioma1.nrrd > [fedorov at TG/Abe:honest4 SlicerReg] diff Data/MRMeningioma0.nrrd > /u/ac/fedorov/scratch-global/scratch/RigidRegistration1-20090827-1411-ttnpb5d3/shared/Data/MRMeningioma0.nrrd > Binary files Data/MRMeningioma0.nrrd and > /u/ac/fedorov/scratch-global/scratch/RigidRegistration1-20090827-1411-ttnpb5d3/shared/Data/MRMeningioma0.nrrd > differ > > I can read my original file, but not the copied one: > > [fedorov at TG/Abe:honest4 SlicerReg] ~/Slicer3-lib/teem-build/bin/unu > minmax /u/ac/fedorov/scratch-global/scratch/RigidRegistration1-20090827-1411-ttnpb5d3/shared/Data/MRMeningioma0.nrrd > unu minmax: trouble with > "/u/ac/fedorov/scratch-global/scratch/RigidRegistration1-20090827-1411-ttnpb5d3/shared/Data/MRMeningioma0.nrrd": > [unu minmax] unu minmax: trouble loading > "/u/ac/fedorov/scratch-global/scratch/RigidRegistration1-20090827-1411-ttnpb5d3/shared/Data/MRMeningioma0.nrrd" > [unu minmax] [nrrd] nrrdLoad: trouble reading > "/u/ac/fedorov/scratch-global/scratch/RigidRegistration1-20090827-1411-ttnpb5d3/shared/Data/MRMeningioma0.nrrd" > [unu minmax] [nrrd] nrrdRead: trouble > [unu minmax] [nrrd] _nrrdRead: trouble reading NRRD file > [unu minmax] [nrrd] _nrrdFormatNRRD_read: > [unu minmax] [nrrd] _nrrdEncodingGzip_read: error reading from gzFile > [unu minmax] [nrrd] _nrrdGzRead: data read error > [fedorov at TG/Abe:honest4 SlicerReg] ~/Slicer3-lib/teem-build/bin/unu > minmax Data/MRMeningioma0.nrrd > min: 0 > max: 695 > > > > Have you guys run any applications with non-trivial input file size, > and verified that file integritiy is preserved? > > > >> - sanity test 16 to 64 or so jobs, adding parallel clustering to the above >> >> - change from the PBS provider to the GRAM2 (pre-WS-GRAM) provider, but >> using the GRAM URLs at http://dev.globus.org/wiki/GRAM/GRAM5#Deployments >> (still submitting from the Abe login host to Abe. You can keep the local >> data provider for this case) >> >> - Add Queenbee GRAM5 as a second site, using the gridftp data provider. >> >> Mike >> >> >> On 8/25/09 11:54 AM, Michael Wilde wrote: >>> On 8/25/09 10:44 AM, Andriy Fedorov wrote: >>>> Michael, >>>> >>>> Thanks for the reply. >>>> >>>> So my understanding is, I should check out the trunk version and >>>> compile (yes, I've done this before), and try the real application >>>> with GT2+coasters. >>> Yes, thats a good step to re-master, in preparation for Mihael checking in >>> Coaster fixes. He made significant enhancements to Coasters in the past 2 >>> months, but has been working ona different project lately and thus these are >>> not yet sufficiently tested. If you're willing to help in the testing that >>> would be great. >>> >>> If not, I think the next best approach to try is this: >>> >>> - We have a small experimental mod that enables Swift GRAM2 jobs to use >>> all cores of multi-core hosts (such as the 8-core hosts on Abe and >>> QueenBee). Basically it uses the Swift clustering facility but runs jobs in >>> parallel instead of serially. >>> >>> It works well if your jobs have a very uniform runtime. If they dont, then >>> it wastes CPU. But its a good interim solution for many apps until coasters >>> is more stable. >>> >>> This is described at: >>> http://www.ci.uchicago.edu/wiki/bin/view/SWFT/SwiftParallelClustering >>> >>> This info is very preliminary and not end-user ready. Tibi Stef-Praun, on >>> this list has tried it. Please start a new thread here if you want to >>> discuss it or report experiences or problems with it. >>> >>> - On QueenBee or other GRAM5-enabled systems (not many test as its in test >>> mode) you can use the GRAM2 provider if submitting remotely. >>> On Abe and any other GRAM2 systems you should run this with the Condor-G >>> provider if submitting remotely. >>> >>> The rule of thumb here for submitting jobs to a site from Swift running >>> remotely on a submit host is: >>> >>> -- up to 20 jobs in parallel you can use plain GRAM2 >>> -- above 20 jobs, use Condor-G or, where available, GRAM2 >>> >>> - On Abe, QueenBee, and other PBS systems with login hosts, you can run >>> Swift locally on the login host, and use the PBS provider with the parallel >>> clustering approach. >>> >>> We have a few other solutions that I will save till we explore these two >>> solutions. >>> >>> To prepare for this, try running your app on Abe using the PBS provider, >>> with just 1 or 2 jobs, then try the parallel clustering tip above. >>> >>>> I do have an account on Queen Bee. You say, it has GT GRAM5, but I >>>> thought you also said I should target using GT2. What is GRAM5? >>> GRAM5 is a new, more efficient version of GRAM2. Its fully compatible, so >>> you just set Swift sites.xml exactly as for GRAM2. The only thing that >>> changes is that you use a different URL for the GRAM gatekeeper contact >>> string (ie different host and/or port, thats all). >>> >>> I'll need to get you the contact string for GRAM5 on QueenBee if/when we >>> both agree the time is right to try it. >>> >>>> At >>>> this point, my preference is the system with lowest load and confirmed >>>> functional coaster provider, to save time debugging and getting up to >>>> speed. Should I use Abe or Queen Bee? >>> Thats hard to answer, as the loads fluctuate. You can examine the >>> TeraPort system load monitor in the TG portal, which gives some rough >>> estimates of load and queue time. Then queue the jobs and wait. Best to run >>> Swift under screen, so you can easily wait for and monitor your script >>> executions from anywhere, and not be interrupted if long delays are >>> encountered. >>> >>> - Mike >>> >>>> As soon as I compile the current swift trunk and try GT2+coaster @Abe >>>> for my application, I will report to the list my experience. >>>> >>>> -- >>>> Andriy Fedorov, Ph.D. >>>> >>>> Research Fellow >>>> Brigham and Women's Hospital >>>> Harvard Medical School >>>> 75 Francis Street >>>> Boston, MA 02115 USA >>>> fedorov at bwh.harvard.edu >>>> >>>> >>>> >>>> On Tue, Aug 25, 2009 at 11:31, Michael Wilde wrote: >>>>> Andrey, >>>>> >>>>> On 8/25/09 9:49 AM, Andrey Fedorov wrote: >>>>>> Hi, >>>>>> >>>>>> I have a processing step that takes somewhere ~2-5 min. It takes on >>>>>> input two ~5Mb files, and produces a small text file, which I need to >>>>>> store. I need to compute large number of such jobs, using different >>>>>> parameters. It seems to me "coaster" is the best execution provider >>>>>> for my application. >>>>>> >>>>>> Trying to start simple, I am running first.swift (echo) example that >>>>>> comes with Swift using different providers: GT2, GT4, GT2/coaster, and >>>>>> GT4/coaster. All of this is done on Abe NCSA cluster. >>>>>> >>>>>> Here's my sites.xml: >>>>>> >>>>>> >>>>>> >>>>>> >>>>> >>>>>> >>>>>> url="https://grid-abe.ncsa.teragrid.org:8443/wsrf/services/ManagedJobFactoryService"/> >>>>>> /u/ac/fedorov/scratch-global/scratch >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>> >>>>>> >>>>>> url="https://grid-abe.ncsa.teragrid.org:8443/wsrf/services/ManagedJobFactoryService"/> >>>>>> /u/ac/fedorov/scratch-global/scratch >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>> url="grid-abe.ncsa.teragrid.org:2119/jobmanager-pbs"/> >>>>>> /u/ac/fedorov/scratch-global/scratch >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>> url="grid-abe.ncsa.teragrid.org"/> >>>>>> >>>>> /> >>>>>> /u/ac/fedorov/scratch-global/scratch >>>>>> >>>>>> >>>>>> And tc.data is simply >>>>>> >>>>>> Abe-GT4-coasters echo /bin/echo INSTALLED INTEL32::LINUX null >>>>>> >>>>>> and I change the site to test different providers. >>>>>> >>>>>> Now, results: >>>>>> >>>>>> 1) both GT2 and GT4 providers work fine, script completes >>>>>> >>>>>> 2) with GT2+coaster provider, I can see the job in the PBS queue >>>>>> (requested time is 01:41, I guess this comes with the default coaster >>>>>> parameters, that I didn't change). The job appears to finish >>>>>> successfully, but then I get this error: >>>>>> >>>>>> Final status: Finished successfully:1 >>>>>> START cleanups=[[first-20090825-0925-emkt2qt0, Abe-GT2-coasters]] >>>>>> START dir=first-20090825-0925-emkt2qt0 host=Abe-GT2-coasters >>>>>> Sending Command(21, SUBMITJOB) on GSSSChannel-null(1) >>>>>> Command(21, SUBMITJOB) CMD: Command(21, SUBMITJOB) >>>>>> GSSSChannel-null(1) REPL: Command(21, SUBMITJOB) >>>>>> Submitted task Task(type=JOB_SUBMISSION, >>>>>> identity=urn:0-1-1251210343871). Job id: >>>>>> urn:1251210343871-1251210376098-1251210376099 >>>>>> Unregistering Command(21, SUBMITJOB) >>>>>> GSSSChannel-null(1) REQ: Handler(JOBSTATUS) >>>>>> GSSSChannel-null(1) REQ: Handler(JOBSTATUS) >>>>>> Task(type=JOB_SUBMISSION, identity=urn:0-1-1251210343871) Completed. >>>>>> Waiting: 0, Running: 0. Heap size: 65M, Heap free: 42M, Max heap: 227M >>>>>> END dir=first-20090825-0925-emkt2qt0 host=Abe-GT2-coasters >>>>>> Cleaning up... >>>>>> Shutting down service at https://141.142.68.180:45552 >>>>>> Got channel MetaChannel: 500265006 -> GSSSChannel-null(1) >>>>>> Sending Command(22, SHUTDOWNSERVICE) on GSSSChannel-null(1) >>>>>> Command(22, SHUTDOWNSERVICE) CMD: Command(22, SHUTDOWNSERVICE) >>>>>> Command(22, SHUTDOWNSERVICE): handling reply timeout >>>>>> Command(22, SHUTDOWNSERVICE): failed too many times >>>>>> org.globus.cog.karajan.workflow.service.ReplyTimeoutException >>>>>> at >>>>>> >>>>>> org.globus.cog.karajan.workflow.service.commands.Command.handleReplyTimeout(Command.java:241) >>>>>> at >>>>>> >>>>>> org.globus.cog.karajan.workflow.service.commands.Command$Timeout.run(Command.java:246) >>>>>> at java.util.TimerThread.mainLoop(Timer.java:512) >>>>>> at java.util.TimerThread.run(Timer.java:462) >>>>>> - Done >>>>> This seems like a low-prio error. I'll file it in bugzilla for now. Lets >>>>> see >>>>> how coasters works for you on Abe using your real app and a larger >>>>> number of >>>>> jobs, and come back to this shutdown problem if it proves to be a >>>>> blocker to >>>>> getting work done. >>>>> >>>>> Coasters has a few other current issues - mainly not throttling work >>>>> efficiently - that we have a fix for, and need to apply and test that >>>>> one >>>>> first. >>>>> >>>>> We've also been experimenting with a non-coaster way to use all 8 cores >>>>> of >>>>> machines like Abe, but lets try the coaster route first, of thats OK >>>>> with >>>>> you, and lets focus on GT2/Coasters, as that will be more common. >>>>> >>>>> In addition, there is a test version of GT GRAM5 on QueenBee, Abe's >>>>> sister-system at LSU, which we can try, assuming your TG project lets >>>>> you >>>>> run there. >>>>> >>>>> So please try to run the app, and we will try to get the latest coaster >>>>> fixes committed. (I assume you are comfortable extracting Swift from svn >>>>> and >>>>> building it; if you have not done this before, can you try it, Andrey?) >>>>> >>>>> Regards, >>>>> >>>>> Mike >>>>> >>>>> >>>>>> 3) with GT4-coaster provider, I don't get as far as with GT2-coaster. >>>>>> Possibly I am not setting up properly the site entry. I was not able >>>>>> to find any examples in the manual how to set coasters with GT4 (can >>>>>> anyone provide an example?). Here's the error: >>>>>> >>>>>> Failed to transfer wrapper log from >>>>>> first-20090825-0929-39x94x09/info/t on Abe-GT4-coasters >>>>>> END_FAILURE thread=0 tr=echo >>>>>> Progress: Failed:1 >>>>>> Execution failed: >>>>>> Exception in echo: >>>>>> Arguments: [Hello, world!] >>>>>> Host: Abe-GT4-coasters >>>>>> Directory: first-20090825-0929-39x94x09/jobs/t/echo-t5oymmfj >>>>>> stderr.txt: >>>>>> >>>>>> stdout.txt: >>>>>> >>>>>> ---- >>>>>> >>>>>> Caused by: >>>>>> Cannot submit job: Limited proxy is not accepted >>>>>> >>>>>> >>>>>> Can anybody help figuring this out? >>>>>> >>>>>> Thanks >>>>>> -- >>>>>> Andriy Fedorov, Ph.D. >>>>>> >>>>>> Research Fellow >>>>>> Brigham and Women's Hospital >>>>>> Harvard Medical School >>>>>> 75 Francis Street >>>>>> Boston, MA 02115 USA >>>>>> fedorov at bwh.harvard.edu >>>>>> _______________________________________________ >>>>>> Swift-user mailing list >>>>>> Swift-user at ci.uchicago.edu >>>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user From wilde at mcs.anl.gov Thu Aug 27 15:43:33 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Thu, 27 Aug 2009 15:43:33 -0500 Subject: [Swift-user] Problems getting started with coasters In-Reply-To: <1251405609.20895.8.camel@localhost> References: <82f536810908250749ve4c2c09xbfa5d6343ad5231c@mail.gmail.com> <4A9403CE.4050303@mcs.anl.gov> <82f536810908250844l69c2b2e8oe5caf3d73fcd46cb@mail.gmail.com> <4A94174D.3070901@mcs.anl.gov> <4A94198C.8090109@mcs.anl.gov> <82f536810908271237l7fe2e15bg368221bcde3e44cf@mail.gmail.com> <1251405012.20895.1.camel@localhost> <82f536810908271336y3c4c1d1ftfcab39cfb37b9a32@mail.gmail.com> <1251405609.20895.8.camel@localhost> Message-ID: <4A96EFF5.504@mcs.anl.gov> Thanks, Mihael - our posts crossed, and I didnt see your answer when I posted mine. - Mike On 8/27/09 3:40 PM, Mihael Hategan wrote: > The data corruption with the experimental coaster filesystem provider is > a known issue that we have not had the resources to deal with > unfortunately. > > Please use GridFTP which is better tested and known to behave. > > Mihael > > On Thu, 2009-08-27 at 16:36 -0400, Andriy Fedorov wrote: >> Here's the section for the coasters site: >> >> >> >> > url="grid-abe.ncsa.teragrid.org"/> >> >> /u/ac/fedorov/scratch-global/scratch >> >> >> >> >> On Thu, Aug 27, 2009 at 16:30, Mihael Hategan wrote: >>> On Thu, 2009-08-27 at 15:37 -0400, Andriy Fedorov wrote: >>> >>>> I need to submit a file (about 5M) as an input to my application. What >>>> seems to be happening is that the file gets corrupted in transmission! >>> What's the contents of your sites.xml? >>> >>> > From fedorov at bwh.harvard.edu Fri Aug 28 13:56:27 2009 From: fedorov at bwh.harvard.edu (Andriy Fedorov) Date: Fri, 28 Aug 2009 14:56:27 -0400 Subject: [Swift-user] Coasters with gt2 and localhost file provider Message-ID: <82f536810908281156m4ee08d02s9d054b7e95be0689@mail.gmail.com> Hi, I have a gt2:gt2:pbs coaster provider on NCSA Abe with local filesystem provider: /u/ac/fedorov/scratch-global/scratch I have been submitting jobs, which seemed to be stuck in the scheduler queue for too long, here's the output of swift -v: Unregistering Command(6, SUBMITJOB) Progress: Submitted:2 Finished successfully:3 Progress: Submitted:2 Finished successfully:3 Progress: Submitted:2 Finished successfully:3 Progress: Submitted:2 Finished successfully:3 Progress: Submitted:2 Finished successfully:3 Progress: Submitted:2 Finished successfully:3 Progress: Submitted:2 Finished successfully:3 Progress: Submitted:2 Finished successfully:3 GSSSChannel-null(1) REQ: Handler(HEARTBEAT) Progress: Submitted:2 Finished successfully:3 Progress: Submitted:2 Finished successfully:3 Progress: Submitted:2 Finished successfully:3 Progress: Submitted:2 Finished successfully:3 Progress: Submitted:2 Finished successfully:3 Progress: Submitted:2 Finished successfully:3 Progress: Submitted:2 Finished successfully:3 Progress: Submitted:2 Finished successfully:3 Progress: Submitted:2 Finished successfully:3 Progress: Submitted:2 Finished successfully:3 GSSSChannel-null(1) REQ: Handler(HEARTBEAT) ..... many many times ...... Upon investigating this, it turns out that the scheduler delay is not the source of the problem. By looking at the output of "qstat", I see a job of 1 hr lenght scheduled, then it gets into the queue, waits, runs, completes, and immediately a new job of 1 hr lenght is scheduled. This repeats over and over. No output of "swift -v" gives me explanation of what is going on. Looking at the log, I see this: 2009-08-28 08:23:31,703-0500 INFO AbstractKarajanChannel GSSSChannel-null(1) REPL: Command(6, SUBMITJOB) 2009-08-28 08:23:31,704-0500 DEBUG TaskImpl Task(type=JOB_SUBMISSION, identity=urn:0-4-1-1251465736537) setting status to Submitted 2009-08-28 08:23:31,704-0500 DEBUG WeightedHostScoreScheduler Submission time for Task(type=JOB_SUBMISSION, identity=urn:0-4-1-1251465736537): 56ms. Score delta: 0.002276923076923077 2009-08-28 08:23:31,704-0500 DEBUG WeightedHostScoreScheduler multiplyScore(Abe-GT2-coasters:1.606(2.487):2/1 overload: 1, 0.002276923076923077) 2009-08-28 08:23:31,704-0500 DEBUG WeightedHostScoreScheduler Old score: 1.606, new score: 1.608 2009-08-28 08:23:31,704-0500 INFO JobSubmissionTaskHandler Submitted task Task(type=JOB_SUBMISSION, identity=urn:0-4-1-1251465736537). Job id: urn:1251465736537-1251465756817-1251465756818 2009-08-28 08:23:31,704-0500 INFO AbstractKarajanChannel Unregistering Command(6, SUBMITJOB) 2009-08-28 08:27:34,210-0500 INFO AbstractKarajanChannel GSSSChannel-null(1) REQ: Handler(HEARTBEAT) 2009-08-28 08:32:34,232-0500 INFO AbstractKarajanChannel GSSSChannel-null(1) REQ: Handler(HEARTBEAT) .... many many times .... 2009-08-28 13:22:34,354-0500 INFO AbstractKarajanChannel GSSSChannel-null(1) REQ: Handler(HEARTBEAT) 2009-08-28 13:27:34,359-0500 INFO AbstractKarajanChannel GSSSChannel-null(1) REQ: Handler(HEARTBEAT) 2009-08-28 13:32:34,358-0500 INFO AbstractKarajanChannel GSSSChannel-null(1) REQ: Handler(HEARTBEAT) 2009-08-28 13:37:34,363-0500 INFO AbstractKarajanChannel GSSSChannel-null(1) REQ: Handler(HEARTBEAT) Look at the timestamps! Note, that I do see the jobs go from Q to R status, I have no idea which jobs they are, and what they are doing. The complete log (after interruption) is attached. I also attach my simple swift script -- there are no loops, this is single execution of a component of my application, before which I do "ls" and calculate md5 sum of the input images. I have Swift svn swift-r3100 cog-r2446 What am I doing wrong? -- Andriy Fedorov, Ph.D. Research Fellow Brigham and Women's Hospital Harvard Medical School 75 Francis Street Boston, MA 02115 USA fedorov at bwh.harvard.edu -------------- next part -------------- A non-text attachment was scrubbed... Name: RigidRegistration1-20090828-0822-k6o8oqd9.log Type: text/x-log Size: 85709 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: RigidRegistration1.swift Type: application/octet-stream Size: 1056 bytes Desc: not available URL: From hategan at mcs.anl.gov Fri Aug 28 14:36:51 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 28 Aug 2009 14:36:51 -0500 Subject: [Swift-user] Re: Coasters with gt2 and localhost file provider In-Reply-To: <82f536810908281156m4ee08d02s9d054b7e95be0689@mail.gmail.com> References: <82f536810908281156m4ee08d02s9d054b7e95be0689@mail.gmail.com> Message-ID: <1251488211.5142.3.camel@localhost> ?Right. Workers shut down after 10 seconds of inactivity. I added an option ("maxWorkerIdleTime", in seconds) and changed the default to 2 minutes (cog r2455). Mihael On Fri, 2009-08-28 at 14:56 -0400, Andriy Fedorov wrote: > Hi, > > I have a gt2:gt2:pbs coaster provider on NCSA Abe with local > filesystem provider: > > > > url="grid-abe.ncsa.teragrid.org"/> > /u/ac/fedorov/scratch-global/scratch > > > I have been submitting jobs, which seemed to be stuck in the scheduler > queue for too long, here's the output of swift -v: > > Unregistering Command(6, SUBMITJOB) > Progress: Submitted:2 Finished successfully:3 > Progress: Submitted:2 Finished successfully:3 > Progress: Submitted:2 Finished successfully:3 > Progress: Submitted:2 Finished successfully:3 > Progress: Submitted:2 Finished successfully:3 > Progress: Submitted:2 Finished successfully:3 > Progress: Submitted:2 Finished successfully:3 > Progress: Submitted:2 Finished successfully:3 > GSSSChannel-null(1) REQ: Handler(HEARTBEAT) > Progress: Submitted:2 Finished successfully:3 > Progress: Submitted:2 Finished successfully:3 > Progress: Submitted:2 Finished successfully:3 > Progress: Submitted:2 Finished successfully:3 > Progress: Submitted:2 Finished successfully:3 > Progress: Submitted:2 Finished successfully:3 > Progress: Submitted:2 Finished successfully:3 > Progress: Submitted:2 Finished successfully:3 > Progress: Submitted:2 Finished successfully:3 > Progress: Submitted:2 Finished successfully:3 > GSSSChannel-null(1) REQ: Handler(HEARTBEAT) > ..... many many times ...... > > Upon investigating this, it turns out that the scheduler delay is not > the source of the problem. By looking at the output of "qstat", I see > a job of 1 hr lenght scheduled, then it gets into the queue, waits, > runs, completes, and immediately a new job of 1 hr lenght is > scheduled. This repeats over and over. > > No output of "swift -v" gives me explanation of what is going on. > > Looking at the log, I see this: > > 2009-08-28 08:23:31,703-0500 INFO AbstractKarajanChannel > GSSSChannel-null(1) REPL: Command(6, SUBMITJOB) > 2009-08-28 08:23:31,704-0500 DEBUG TaskImpl Task(type=JOB_SUBMISSION, > identity=urn:0-4-1-1251465736537) setting status to Submitted > 2009-08-28 08:23:31,704-0500 DEBUG WeightedHostScoreScheduler > Submission time for Task(type=JOB_SUBMISSION, > identity=urn:0-4-1-1251465736537): 56ms. Score delta: > 0.002276923076923077 > 2009-08-28 08:23:31,704-0500 DEBUG WeightedHostScoreScheduler > multiplyScore(Abe-GT2-coasters:1.606(2.487):2/1 overload: 1, > 0.002276923076923077) > 2009-08-28 08:23:31,704-0500 DEBUG WeightedHostScoreScheduler Old > score: 1.606, new score: 1.608 > 2009-08-28 08:23:31,704-0500 INFO JobSubmissionTaskHandler Submitted > task Task(type=JOB_SUBMISSION, identity=urn:0-4-1-1251465736537). Job > id: urn:1251465736537-1251465756817-1251465756818 > 2009-08-28 08:23:31,704-0500 INFO AbstractKarajanChannel > Unregistering Command(6, SUBMITJOB) > 2009-08-28 08:27:34,210-0500 INFO AbstractKarajanChannel > GSSSChannel-null(1) REQ: Handler(HEARTBEAT) > 2009-08-28 08:32:34,232-0500 INFO AbstractKarajanChannel > GSSSChannel-null(1) REQ: Handler(HEARTBEAT) > .... many many times .... > 2009-08-28 13:22:34,354-0500 INFO AbstractKarajanChannel > GSSSChannel-null(1) REQ: Handler(HEARTBEAT) > 2009-08-28 13:27:34,359-0500 INFO AbstractKarajanChannel > GSSSChannel-null(1) REQ: Handler(HEARTBEAT) > 2009-08-28 13:32:34,358-0500 INFO AbstractKarajanChannel > GSSSChannel-null(1) REQ: Handler(HEARTBEAT) > 2009-08-28 13:37:34,363-0500 INFO AbstractKarajanChannel > GSSSChannel-null(1) REQ: Handler(HEARTBEAT) > > Look at the timestamps! > > Note, that I do see the jobs go from Q to R status, I have no idea > which jobs they are, and what they are doing. > > The complete log (after interruption) is attached. I also attach my > simple swift script -- there are no loops, this is single execution of > a component of my application, before which I do "ls" and calculate > md5 sum of the input images. > > I have > > Swift svn swift-r3100 cog-r2446 > > What am I doing wrong? > > -- > Andriy Fedorov, Ph.D. > > Research Fellow > Brigham and Women's Hospital > Harvard Medical School > 75 Francis Street > Boston, MA 02115 USA > fedorov at bwh.harvard.edu From fedorov at bwh.harvard.edu Fri Aug 28 14:57:40 2009 From: fedorov at bwh.harvard.edu (Andriy Fedorov) Date: Fri, 28 Aug 2009 15:57:40 -0400 Subject: [Swift-user] Re: Coasters with gt2 and localhost file provider In-Reply-To: <1251488211.5142.3.camel@localhost> References: <82f536810908281156m4ee08d02s9d054b7e95be0689@mail.gmail.com> <1251488211.5142.3.camel@localhost> Message-ID: <82f536810908281257n65043caerb96e4e16bf5ab096@mail.gmail.com> Hey, Mihael, we are running! Thanks for the fix! I assume we go through all these troubles, because earlier you have been working with the large number of jobs that have very small execution time. New applications bring new troubles :) Oh joy -- I got my first successful coasters run with a real application component! -- Andriy Fedorov, Ph.D. Research Fellow Brigham and Women's Hospital Harvard Medical School 75 Francis Street Boston, MA 02115 USA fedorov at bwh.harvard.edu On Fri, Aug 28, 2009 at 15:36, Mihael Hategan wrote: > Right. Workers shut down after 10 seconds of inactivity. > > I added an option ("maxWorkerIdleTime", in seconds) and changed the > default to 2 minutes (cog r2455). > > Mihael > > On Fri, 2009-08-28 at 14:56 -0400, Andriy Fedorov wrote: >> Hi, >> >> I have a gt2:gt2:pbs coaster provider on NCSA Abe with local >> filesystem provider: >> >> >> ? >> ? > ? url="grid-abe.ncsa.teragrid.org"/> >> ? /u/ac/fedorov/scratch-global/scratch >> >> >> I have been submitting jobs, which seemed to be stuck in the scheduler >> queue for too long, here's the output of swift -v: >> >> Unregistering Command(6, SUBMITJOB) >> Progress: ?Submitted:2 ?Finished successfully:3 >> Progress: ?Submitted:2 ?Finished successfully:3 >> Progress: ?Submitted:2 ?Finished successfully:3 >> Progress: ?Submitted:2 ?Finished successfully:3 >> Progress: ?Submitted:2 ?Finished successfully:3 >> Progress: ?Submitted:2 ?Finished successfully:3 >> Progress: ?Submitted:2 ?Finished successfully:3 >> Progress: ?Submitted:2 ?Finished successfully:3 >> GSSSChannel-null(1) REQ: Handler(HEARTBEAT) >> Progress: ?Submitted:2 ?Finished successfully:3 >> Progress: ?Submitted:2 ?Finished successfully:3 >> Progress: ?Submitted:2 ?Finished successfully:3 >> Progress: ?Submitted:2 ?Finished successfully:3 >> Progress: ?Submitted:2 ?Finished successfully:3 >> Progress: ?Submitted:2 ?Finished successfully:3 >> Progress: ?Submitted:2 ?Finished successfully:3 >> Progress: ?Submitted:2 ?Finished successfully:3 >> Progress: ?Submitted:2 ?Finished successfully:3 >> Progress: ?Submitted:2 ?Finished successfully:3 >> GSSSChannel-null(1) REQ: Handler(HEARTBEAT) >> ..... many many times ...... >> >> Upon investigating this, it turns out that the scheduler delay is not >> the source of the problem. By looking at the output of "qstat", I see >> a job of 1 hr lenght scheduled, then it gets into the queue, waits, >> runs, completes, and immediately a new job of 1 hr lenght is >> scheduled. This repeats over and over. >> >> No output of "swift -v" gives me explanation of what is going on. >> >> Looking at the log, I see this: >> >> 2009-08-28 08:23:31,703-0500 INFO ?AbstractKarajanChannel >> GSSSChannel-null(1) REPL: Command(6, SUBMITJOB) >> 2009-08-28 08:23:31,704-0500 DEBUG TaskImpl Task(type=JOB_SUBMISSION, >> identity=urn:0-4-1-1251465736537) setting status to Submitted >> 2009-08-28 08:23:31,704-0500 DEBUG WeightedHostScoreScheduler >> Submission time for Task(type=JOB_SUBMISSION, >> identity=urn:0-4-1-1251465736537): 56ms. Score delta: >> 0.002276923076923077 >> 2009-08-28 08:23:31,704-0500 DEBUG WeightedHostScoreScheduler >> multiplyScore(Abe-GT2-coasters:1.606(2.487):2/1 overload: 1, >> 0.002276923076923077) >> 2009-08-28 08:23:31,704-0500 DEBUG WeightedHostScoreScheduler Old >> score: 1.606, new score: 1.608 >> 2009-08-28 08:23:31,704-0500 INFO ?JobSubmissionTaskHandler Submitted >> task Task(type=JOB_SUBMISSION, identity=urn:0-4-1-1251465736537). Job >> id: urn:1251465736537-1251465756817-1251465756818 >> 2009-08-28 08:23:31,704-0500 INFO ?AbstractKarajanChannel >> Unregistering Command(6, SUBMITJOB) >> 2009-08-28 08:27:34,210-0500 INFO ?AbstractKarajanChannel >> GSSSChannel-null(1) REQ: Handler(HEARTBEAT) >> 2009-08-28 08:32:34,232-0500 INFO ?AbstractKarajanChannel >> GSSSChannel-null(1) REQ: Handler(HEARTBEAT) >> .... many many times .... >> 2009-08-28 13:22:34,354-0500 INFO ?AbstractKarajanChannel >> GSSSChannel-null(1) REQ: Handler(HEARTBEAT) >> 2009-08-28 13:27:34,359-0500 INFO ?AbstractKarajanChannel >> GSSSChannel-null(1) REQ: Handler(HEARTBEAT) >> 2009-08-28 13:32:34,358-0500 INFO ?AbstractKarajanChannel >> GSSSChannel-null(1) REQ: Handler(HEARTBEAT) >> 2009-08-28 13:37:34,363-0500 INFO ?AbstractKarajanChannel >> GSSSChannel-null(1) REQ: Handler(HEARTBEAT) >> >> Look at the timestamps! >> >> Note, that I do see the jobs go from Q to R status, I have no idea >> which jobs they are, and what they are doing. >> >> The complete log (after interruption) is attached. I also attach my >> simple swift script -- there are no loops, this is single execution of >> a component of my application, before which I do "ls" and calculate >> md5 sum of the input images. >> >> I have >> >> Swift svn swift-r3100 cog-r2446 >> >> What am I doing wrong? >> >> -- >> Andriy Fedorov, Ph.D. >> >> Research Fellow >> Brigham and Women's Hospital >> Harvard Medical School >> 75 Francis Street >> Boston, MA 02115 USA >> fedorov at bwh.harvard.edu > > From hategan at mcs.anl.gov Fri Aug 28 15:11:31 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 28 Aug 2009 15:11:31 -0500 Subject: [Swift-user] Re: Coasters with gt2 and localhost file provider In-Reply-To: <82f536810908281257n65043caerb96e4e16bf5ab096@mail.gmail.com> References: <82f536810908281156m4ee08d02s9d054b7e95be0689@mail.gmail.com> <1251488211.5142.3.camel@localhost> <82f536810908281257n65043caerb96e4e16bf5ab096@mail.gmail.com> Message-ID: <1251490291.6180.6.camel@localhost> On Fri, 2009-08-28 at 15:57 -0400, Andriy Fedorov wrote: > Hey, Mihael, we are running! Thanks for the fix! I assume we go > through all these troubles, because earlier you have been working with > the large number of jobs that have very small execution time. New > applications bring new troubles :) I cannot confirm nor deny that :) But know that you are providing very useful feedback. Mihael From fedorov at bwh.harvard.edu Fri Aug 28 15:16:54 2009 From: fedorov at bwh.harvard.edu (Andriy Fedorov) Date: Fri, 28 Aug 2009 16:16:54 -0400 Subject: [Swift-user] Re: Coasters with gt2 and localhost file provider In-Reply-To: <1251490291.6180.6.camel@localhost> References: <82f536810908281156m4ee08d02s9d054b7e95be0689@mail.gmail.com> <1251488211.5142.3.camel@localhost> <82f536810908281257n65043caerb96e4e16bf5ab096@mail.gmail.com> <1251490291.6180.6.camel@localhost> Message-ID: <82f536810908281316j4ec85854vae8d70f1ee51925c@mail.gmail.com> On Fri, Aug 28, 2009 at 16:11, Mihael Hategan wrote: > On Fri, 2009-08-28 at 15:57 -0400, Andriy Fedorov wrote: >> Hey, Mihael, we are running! Thanks for the fix! I assume we go >> through all these troubles, because earlier you have been working with >> the large number of jobs that have very small execution time. New >> applications bring new troubles :) > > I cannot confirm nor deny that :) > > But know that you are providing very useful feedback. > Thanks, Mihael -- I am happy to know this :) And you are providing very quick and good support! AF > Mihael > > From marcin at galton.uchicago.edu Thu Aug 27 10:54:38 2009 From: marcin at galton.uchicago.edu (Marcin Hitczenko) Date: Thu, 27 Aug 2009 15:54:38 -0000 Subject: [Swift-user] Re: swift on jazz In-Reply-To: <4A95B316.2090603@mcs.anl.gov> References: <53595.207.181.247.22.1251318011.squirrel@galton.uchicago.edu> <4A95B316.2090603@mcs.anl.gov> Message-ID: <59301.207.181.247.22.1251386649.squirrel@galton.uchicago.edu> Hi, I am actually observing the former, which is why I thought this might be controllable via swift. I also have a few other basic questions regarding following job status and organization of all the output files. I think the easiest thing to do would be to look at my account together. Is there any way that we could meet this week? Thanks, Marcin > Hi Marcin, > > I took the liberty of moving this thread to swift-user for others to > help me answer you, and for other users to benefit. > > On Jazz, are you observing that Swift only puts at most 2 jobs in the > Jazz PBS queue (where you can see them with "qstat") or that Swift puts > many jobs in the queue but only 2 run at a time? > > Assuming its the latter, you must be bumping in Jazz's scheduler policy > which is favoring multi-CPU jobs. If thats the case, then lets try > running the "coaster" provider which is specified in the sites.xml file. > (tc.data doesnt change). > > First, change your "jazz" entry in sites.xml from the PBS execution > provider: > > > > to the Coaster provider: > > > > This should work, although we may need to add additional XML > specifications for timilimits, accounts, and maybe queues. > > Then we expect to be applying a fix to the coaster rpovider tonight, so > we'll need to do a custom Swift build from the source repository after > that, and test the latest fix. The fix improves the throughput, but even > without it, you should see Swift requesting more CPUs from PBS in a > single job. > > I suggest getting started with this simple change, and we'll enhance it > in stages to give you better performance and more parallelism. > > - Mike > > > On 8/26/09 3:20 PM, Marcin Hitczenko wrote: > >> ... I am running jobs on jazz and I noticed that jazz will only run >> at most two jobs at once for me (I have about 30), even though there are >> more nodes free and I am requiring only one node per job. Is there >> something I can do to change this? Would I have to change the tc.data or >> sites.xml file? >> >> Thanks, >> >> Marcin >