From nefedova at mcs.anl.gov Fri Jun 1 08:19:58 2007 From: nefedova at mcs.anl.gov (Veronika Nefedova) Date: Fri, 1 Jun 2007 08:19:58 -0500 Subject: [Swift-devel] disk space requirement Message-ID: Hi, I know I raised this question many times before, but I think I need a solution to it very soon (=now). I have a machinery in place to do a really big runs for MolDyn. Currently, the workflow produces about 0.9GB of data for each molecule, but this is all the intermediate data, all I need is *one* 300K file as a result (per molecule). The rest is intermediate data that I do not need. So my questions are: 1. How can I eliminate the intermediate results staging back to my submit host? I do not need it in case of one remote compute pool. VDS had this feature... 2. If implementing this feature is very hard and time-consuming -- what submit host would you recommend for a 244-molecule run ? Roughly 244 GB of space is needed. A 20-molecule run is all I can do on wiggum (where the application code is currently compiled). Thanks! Nika From wilde at mcs.anl.gov Fri Jun 1 08:55:12 2007 From: wilde at mcs.anl.gov (Mike Wilde) Date: Fri, 01 Jun 2007 08:55:12 -0500 Subject: [Swift-devel] disk space requirement In-Reply-To: References: Message-ID: <46602540.8090007@mcs.anl.gov> Nika, regarding where you can get 244GB, we should send in a request to CI Support to get 1TB or so from the CI SAN mounted on a Swift submit host. In the interim: - you can get 100GB or so on NFS on terminable - we can try to quickly add a 300GB drive (or two) to terminable - One of the gridlab hosts (4 I think) should be available with that storage now. I am not sure if we feel this lab host is stable yet but we should make sure. We need to be gearing up quickly to this scale, so the lab environment that Ive been pushing to get set up is going to be increasingly important. Regarding the core limitation that Nika points out - this seems to dictate that we take the step of allowing workflow inputs, outputs and intermediate results to be located on any gridftp server and tracked via a replica catalog, as VDS did. Ben, Mihael, do you have a feel for where we can slot this into development? v3 - say circa September? Mike Veronika Nefedova wrote, On 6/1/2007 8:19 AM: > Hi, > > I know I raised this question many times before, but I think I need a > solution to it very soon (=now). I have a machinery in place to do a > really big runs for MolDyn. Currently, the workflow produces about 0.9GB > of data for each molecule, but this is all the intermediate data, all I > need is *one* 300K file as a result (per molecule). The rest is > intermediate data that I do not need. So my questions are: > > 1. How can I eliminate the intermediate results staging back to my > submit host? I do not need it in case of one remote compute pool. VDS > had this feature... > > 2. If implementing this feature is very hard and time-consuming -- what > submit host would you recommend for a 244-molecule run ? Roughly 244 GB > of space is needed. > > A 20-molecule run is all I can do on wiggum (where the application code > is currently compiled). > > Thanks! > > Nika > > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > -- Mike Wilde Computation Institute, University of Chicago Math & Computer Science Division Argonne National Laboratory Argonne, IL 60439 USA tel 630-252-7497 fax 630-252-1997 From hategan at mcs.anl.gov Fri Jun 1 09:03:01 2007 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 01 Jun 2007 17:03:01 +0300 Subject: [Swift-devel] disk space requirement In-Reply-To: <46602540.8090007@mcs.anl.gov> References: <46602540.8090007@mcs.anl.gov> Message-ID: <1180706581.32642.3.camel@blabla.mcs.anl.gov> On Fri, 2007-06-01 at 08:55 -0500, Mike Wilde wrote: > Nika, regarding where you can get 244GB, we should send in a request > to CI Support to get 1TB or so from the CI SAN mounted on a Swift > submit host. > > In the interim: > > - you can get 100GB or so on NFS on terminable > > - we can try to quickly add a 300GB drive (or two) to terminable > > - One of the gridlab hosts (4 I think) should be available with that > storage now. I am not sure if we feel this lab host is stable yet > but we should make sure. > > We need to be gearing up quickly to this scale, so the lab > environment that Ive been pushing to get set up is going to be > increasingly important. > > Regarding the core limitation that Nika points out - this seems to > dictate that we take the step of allowing workflow inputs, outputs > and intermediate results to be located on any gridftp server and > tracked via a replica catalog, as VDS did. I'd say tracking where intermediate files are for a run should not necessarily imply a "replica catalog" as implemented by VDS (in particular RLS). > > Ben, Mihael, do you have a feel for where we can slot this into > development? v3 - say circa September? Sounds reasonable with the limited knowledge I have. Mihael > > Mike > > > Veronika Nefedova wrote, On 6/1/2007 8:19 AM: > > Hi, > > > > I know I raised this question many times before, but I think I need a > > solution to it very soon (=now). I have a machinery in place to do a > > really big runs for MolDyn. Currently, the workflow produces about 0.9GB > > of data for each molecule, but this is all the intermediate data, all I > > need is *one* 300K file as a result (per molecule). The rest is > > intermediate data that I do not need. So my questions are: > > > > 1. How can I eliminate the intermediate results staging back to my > > submit host? I do not need it in case of one remote compute pool. VDS > > had this feature... > > > > 2. If implementing this feature is very hard and time-consuming -- what > > submit host would you recommend for a 244-molecule run ? Roughly 244 GB > > of space is needed. > > > > A 20-molecule run is all I can do on wiggum (where the application code > > is currently compiled). > > > > Thanks! > > > > Nika > > > > > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > > From tiberius at ci.uchicago.edu Fri Jun 1 14:25:24 2007 From: tiberius at ci.uchicago.edu (Tiberiu Stef-Praun) Date: Fri, 1 Jun 2007 14:25:24 -0500 Subject: [Swift-devel] simple_mapper issue ( i might have reported it a while ago) Message-ID: happens in vdsk-april-29 This one is ok file procOut; This one is not file procOut; The error: Execution failed: java.lang.ClassCastException: org.griphyn.vdl.mapping.DataNode Caused by: java.lang.ClassCastException: org.griphyn.vdl.mapping.DataNode at org.griphyn.vdl.mapping.RootDataNode.init(RootDataNode.java:29) at org.griphyn.vdl.karajan.lib.New.function(New.java:103) at org.griphyn.vdl.karajan.lib.VDLFunction.post(VDLFunction.java:60) at org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.childCompleted(AbstractSequentialWithArguments.java:192) at org.globus.cog.karajan.workflow.nodes.Sequential.notificationEvent(Sequential.java:33) at org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:334) -- Tiberiu (Tibi) Stef-Praun, PhD Research Staff, Computation Institute 5640 S. Ellis Ave, #405 University of Chicago http://www-unix.mcs.anl.gov/~tiberius/ From tiberius at ci.uchicago.edu Fri Jun 1 14:27:35 2007 From: tiberius at ci.uchicago.edu (Tiberiu Stef-Praun) Date: Fri, 1 Jun 2007 14:27:35 -0500 Subject: [Swift-devel] Re: simple_mapper issue ( i might have reported it a while ago) In-Reply-To: References: Message-ID: Disregad this. It was caused by user error On 6/1/07, Tiberiu Stef-Praun wrote: > happens in vdsk-april-29 > > This one is ok > file procOut; > > This one is not > file procOut; > > The error: > Execution failed: > java.lang.ClassCastException: org.griphyn.vdl.mapping.DataNode > Caused by: > java.lang.ClassCastException: org.griphyn.vdl.mapping.DataNode > at org.griphyn.vdl.mapping.RootDataNode.init(RootDataNode.java:29) > at org.griphyn.vdl.karajan.lib.New.function(New.java:103) > at org.griphyn.vdl.karajan.lib.VDLFunction.post(VDLFunction.java:60) > at org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.childCompleted(AbstractSequentialWithArguments.java:192) > at org.globus.cog.karajan.workflow.nodes.Sequential.notificationEvent(Sequential.java:33) > at org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:334) > > -- > Tiberiu (Tibi) Stef-Praun, PhD > Research Staff, Computation Institute > 5640 S. Ellis Ave, #405 > University of Chicago > http://www-unix.mcs.anl.gov/~tiberius/ > -- Tiberiu (Tibi) Stef-Praun, PhD Research Staff, Computation Institute 5640 S. Ellis Ave, #405 University of Chicago http://www-unix.mcs.anl.gov/~tiberius/ From nefedova at mcs.anl.gov Fri Jun 1 16:59:08 2007 From: nefedova at mcs.anl.gov (Veronika Nefedova) Date: Fri, 1 Jun 2007 16:59:08 -0500 Subject: [Swift-devel] LQCD meeting at Fermi Message-ID: <4C018A99-0639-43AA-BA06-525B955A56A1@mcs.anl.gov> Hi, Yong and I have met with Xian-He and his team today to talk over their current problems with the production swift code. Some of the major issues we talked about: - Sperate of concern: SwiftScript could be made to just describe the abstract interfaces and data flows, and the app blocks could be pushed into some separate specifications ( in a repository or something ), in which other scripting lanugages can be used (e.g. python) to specify how to invoke an actual application. - Dealing with absolute path: LQCD uses dcache, which requires copying to/from some absolute path. - Run clean up jobs outside pbs (i.e. using the fork manager instead) - parameter problem: need to override things in tc.data, sites.xml, like number of nodes for MPI jobs possible solution: put profile specification back in. (but we do not have derivations, in which we were able to put some profiles). template based sites.xml and tc.data (generate the actual config files using some templates and user supplied values at runtime) - DB-mapper: users have an elaborate input data structures, keep it in the DB, so it would be nice to have a mapper that would read the input from the DB. This feature is in the works (?) -intermediate results problem -- the same as MolDyn: need to have an ability to specify which file to keep and which file not. - quoting problem: MPIrun does not deal correctly with "" that are passed to wrapper.sh I remember there was also quoting issue with condor queues. We also talked about using Falkon. But since LQCD uses dedicated resources (600 or more nodes) and pbs queue checking time is set to around 10s, it is not a big issue for them to run large number of jobs. None of these except for the absolute path problem is a show- stoppers, next we'll try to get their swiftscript running, and push some of the requests into 0.3 features. Yong and Nika From benc at hawaga.org.uk Fri Jun 1 17:20:42 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Fri, 1 Jun 2007 22:20:42 +0000 (GMT) Subject: [Swift-devel] LQCD meeting at Fermi In-Reply-To: <4C018A99-0639-43AA-BA06-525B955A56A1@mcs.anl.gov> References: <4C018A99-0639-43AA-BA06-525B955A56A1@mcs.anl.gov> Message-ID: On Fri, 1 Jun 2007, Veronika Nefedova wrote: > - DB-mapper: users have an elaborate input data structures, keep it in the DB, > so it would be nice to have a mapper that would read the input from the DB. > This feature is in the works (?) (i) The *data* in a DB, or (ii) paths to datafiles in the DB with actual data stored in disk files? (i) is much harder than (ii) > - Dealing with absolute path: > LQCD uses dcache, which requires copying to/from some absolute path. By this do you mean that their input/output files are stored in the unix filesystem on the submit node, but in some directory that is not the pwd, and that that directory causes files to be accessed from dcache? dcache has other access methods, such as gridftp (I think). Do you know if they use that? in some cases, but maybe not this case, staging from dcache ftp server to site workspace without going via submit node. -- From foster at mcs.anl.gov Fri Jun 1 17:34:03 2007 From: foster at mcs.anl.gov (Ian Foster) Date: Fri, 01 Jun 2007 17:34:03 -0500 Subject: [Swift-devel] LQCD meeting at Fermi In-Reply-To: <4C018A99-0639-43AA-BA06-525B955A56A1@mcs.anl.gov> References: <4C018A99-0639-43AA-BA06-525B955A56A1@mcs.anl.gov> Message-ID: <46609EDB.5020009@mcs.anl.gov> Nika: Thanks for the summary. I am very eager to see some results for executions of real application problems. Did they agree to a timeline for that? Ian. Veronika Nefedova wrote: > Hi, > > Yong and I have met with Xian-He and his team today to talk over their > current problems with the production swift code. > > Some of the major issues we talked about: > > - Sperate of concern: SwiftScript could be made to just describe the > abstract interfaces and data flows, and the app blocks could be pushed > into some separate specifications ( in a repository or something ), in > which other scripting lanugages can be used (e.g. python) to specify how > to invoke an actual application. > > - Dealing with absolute path: > LQCD uses dcache, which requires copying to/from some absolute path. > > - Run clean up jobs outside pbs (i.e. using the fork manager instead) > > - parameter problem: need to override things in tc.data, sites.xml, like > number of nodes for MPI jobs > possible solution: put profile specification back in. (but we do not > have derivations, in which we were able to put some profiles). > template based sites.xml and tc.data (generate the actual config files > using some templates and user supplied values at runtime) > > - DB-mapper: users have an elaborate input data structures, keep it in > the DB, so it would be nice to have a mapper that would read the input > from the DB. This feature is in the works (?) > > -intermediate results problem -- the same as MolDyn: need to have an > ability to specify which file to keep and which file not. > > - quoting problem: > MPIrun does not deal correctly with "" that are passed to wrapper.sh > I remember there was also quoting issue with condor queues. > > We also talked about using Falkon. But since LQCD uses dedicated > resources > (600 or more nodes) and pbs queue checking time is set to around 10s, it > is not a big issue for them to run large number of jobs. > > None of these except for the absolute path problem is a show-stoppers, > next > we'll try to get their swiftscript running, and push some of the requests > into 0.3 features. > > Yong and Nika > > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > -- Ian Foster, Director, Computation Institute Argonne National Laboratory & University of Chicago Argonne: MCS/221, 9700 S. Cass Ave, Argonne, IL 60439 Chicago: Rm 405, 5640 S. Ellis Ave, Chicago, IL 60637 Tel: +1 630 252 4619. Web: www.ci.uchicago.edu. Globus Alliance: www.globus.org. From benc at hawaga.org.uk Fri Jun 1 17:38:42 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Fri, 1 Jun 2007 22:38:42 +0000 (GMT) Subject: [Swift-devel] LQCD meeting at Fermi In-Reply-To: <46609EDB.5020009@mcs.anl.gov> References: <4C018A99-0639-43AA-BA06-525B955A56A1@mcs.anl.gov> <46609EDB.5020009@mcs.anl.gov> Message-ID: On Fri, 1 Jun 2007, Ian Foster wrote: > Thanks for the summary. I am very eager to see some results for executions of > real application problems. Did they agree to a timeline for that? Would also be good to have some definition of what is an acceptable 'real application problem', in terms of which programs are to be run on which datasets on what sized resource. -- From itf at mcs.anl.gov Fri Jun 1 17:50:41 2007 From: itf at mcs.anl.gov (=?UTF-8?B?SWFuIEZvc3Rlcg==?=) Date: Fri, 1 Jun 2007 22:50:41 +0000 Subject: [Swift-devel] LQCD meeting at Fermi In-Reply-To: References: <4C018A99-0639-43AA-BA06-525B955A56A1@mcs.anl.gov><46609EDB.5020009@mcs.anl.gov> Message-ID: <1472851168-1180738251-cardhu_blackberry.rim.net-646943670-@bwe005-cell00.bisx.prod.on.blackberry> Ben: The question for me is whether they are invested, or just playing with it and finding things that are "wrong" not because they are serious problems but rather as excuses for them not to do more. (I've often experienced that.) So I want to know if they are using it for real work or not. So far things don't look so good--as I understand things, they don't have a working program after what mus be 4-5 months talking to us (?). Ian Sent via BlackBerry from T-Mobile -----Original Message----- From: Ben Clifford Date: Fri, 1 Jun 2007 22:38:42 To:Ian Foster Cc:Veronika Nefedova , swift-devel at ci.uchicago.edu Subject: Re: [Swift-devel] LQCD meeting at Fermi On Fri, 1 Jun 2007, Ian Foster wrote: > Thanks for the summary. I am very eager to see some results for executions of > real application problems. Did they agree to a timeline for that? Would also be good to have some definition of what is an acceptable 'real application problem', in terms of which programs are to be run on which datasets on what sized resource. -- From nefedova at mcs.anl.gov Fri Jun 1 18:22:06 2007 From: nefedova at mcs.anl.gov (Veronika Nefedova) Date: Fri, 1 Jun 2007 18:22:06 -0500 Subject: [Swift-devel] LQCD meeting at Fermi In-Reply-To: <1472851168-1180738251-cardhu_blackberry.rim.net-646943670-@bwe005-cell00.bisx.prod.on.blackberry> References: <4C018A99-0639-43AA-BA06-525B955A56A1@mcs.anl.gov><46609EDB.5020009@mcs.anl.gov> <1472851168-1180738251-cardhu_blackberry.rim.net-646943670-@bwe005-cell00.bisx.prod.on.blackberry> Message-ID: <05C56433-BBB0-49FE-88F5-F602FCE8F696@mcs.anl.gov> Ian, they seem like quite a strange bunch to me. Here is my experience with them. When I first met them back in April, they gave us their code and explained what they want to achieve. Within couple of weeks I sent them the working code for their workflow. It took them another couple of weeks to start testing it (and it worked for them as well). After that they said that they would modify that workflow to suit their production needs. And they've disappeared for another 3 weeks. During those 3 weeks I sent them numerous emails offering my help * (basically saying - "give me your production code and data and I'll make it work for you") - but it was no response. Till last week when they sent us their code (not working) and questions. The code is very close to be in working condition, unfortunately they were "inventing the wheel" instead of asking us for help (like they spent quite a time trying to do string concatenation without using the @strcat function). They also did a hack for random number generator instead of using the function that I sent them (which is much easier and cleaner way). So my impression was that they wanted to figure our swift on their own (which is good) but without our help (which is bad). But now i think they are ready for us to step in. They mentioned something about 2 months timeframe when they need to have the things running... Nika On Jun 1, 2007, at 5:50 PM, Ian Foster wrote: > Ben: > > The question for me is whether they are invested, or just playing > with it and finding things that are "wrong" not because they are > serious problems but rather as excuses for them not to do more. > (I've often experienced that.) > > So I want to know if they are using it for real work or not. So far > things don't look so good--as I understand things, they don't have > a working program after what mus be 4-5 months talking to us (?). > > Ian > > Sent via BlackBerry from T-Mobile > > -----Original Message----- > From: Ben Clifford > Date: Fri, 1 Jun 2007 22:38:42 > To:Ian Foster > Cc:Veronika Nefedova , swift- > devel at ci.uchicago.edu > Subject: Re: [Swift-devel] LQCD meeting at Fermi > > > > On Fri, 1 Jun 2007, Ian Foster wrote: > >> Thanks for the summary. I am very eager to see some results for >> executions of >> real application problems. Did they agree to a timeline for that? > > Would also be good to have some definition of what is an acceptable > 'real > application problem', in terms of which programs are to be run on > which > datasets on what sized resource. > > -- > From yongzh at cs.uchicago.edu Fri Jun 1 21:10:28 2007 From: yongzh at cs.uchicago.edu (Yong Zhao) Date: Fri, 1 Jun 2007 21:10:28 -0500 (CDT) Subject: [Swift-devel] LQCD meeting at Fermi In-Reply-To: References: <4C018A99-0639-43AA-BA06-525B955A56A1@mcs.anl.gov> Message-ID: It is case (i), they want to read some paramenter settings from db. But I do not think the two cases are very different, we are to map something from database into in-memory data structures. Yong. On Fri, 1 Jun 2007, Ben Clifford wrote: > > > On Fri, 1 Jun 2007, Veronika Nefedova wrote: > > > - DB-mapper: users have an elaborate input data structures, keep it in the DB, > > so it would be nice to have a mapper that would read the input from the DB. > > This feature is in the works (?) > > (i) The *data* in a DB, or > > (ii) paths to datafiles in the DB with actual data > stored in disk files? > > (i) is much harder than (ii) > > > - Dealing with absolute path: > > LQCD uses dcache, which requires copying to/from some absolute path. > > By this do you mean that their input/output files are stored in the unix > filesystem on the submit node, but in some directory that is not the pwd, > and that that directory causes files to be accessed from dcache? > > dcache has other access methods, such as gridftp (I think). Do you know if > they use that? in some cases, but maybe not this case, staging from dcache > ftp server to site workspace without going via submit node. > > -- > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > From yongzh at cs.uchicago.edu Fri Jun 1 21:20:29 2007 From: yongzh at cs.uchicago.edu (Yong Zhao) Date: Fri, 1 Jun 2007 21:20:29 -0500 (CDT) Subject: [Swift-devel] LQCD meeting at Fermi In-Reply-To: <1472851168-1180738251-cardhu_blackberry.rim.net-646943670-@bwe005-cell00.bisx.prod.on.blackberry> References: <4C018A99-0639-43AA-BA06-525B955A56A1@mcs.anl.gov><46609EDB.5020009@mcs.anl.gov> <1472851168-1180738251-cardhu_blackberry.rim.net-646943670-@bwe005-cell00.bisx.prod.on.blackberry> Message-ID: They do mention in the discussion that the Swift approach is not much different from their ad hoc scripts (just because the way we deal with file names and command line arguments in invoking applications). They said they wanted something purer ( in the sense of having just procedure interfaces & data flows, and hiding invocation/mapping/wrapping details). But I guess neither side (us and them) has a clear idea about how to actually achieve that. They also notice the mixed flavor of scripting, imperative programming and functional programming in the language, they seem to favor the python style, more functional style of programming language. Also they do not quite like the @ functions. I think due to the slow process and also they have been trying to figure out the language features by themselves, they are a bit discouraged and disappointed at this point. So we need to get their semi-working scripts working very quickly in order not to alienate them. Yong. On Fri, 1 Jun 2007, [UTF-8] Ian Foster wrote: > Ben: > > The question for me is whether they are invested, or just playing with it and finding things that are "wrong" not because they are serious problems but rather as excuses for them not to do more. (I've often experienced that.) > > So I want to know if they are using it for real work or not. So far things don't look so good--as I understand things, they don't have a working program after what mus be 4-5 months talking to us (?). > > Ian > > Sent via BlackBerry from T-Mobile > > -----Original Message----- > From: Ben Clifford > Date: Fri, 1 Jun 2007 22:38:42 > To:Ian Foster > Cc:Veronika Nefedova , swift-devel at ci.uchicago.edu > Subject: Re: [Swift-devel] LQCD meeting at Fermi > > > > On Fri, 1 Jun 2007, Ian Foster wrote: > > > Thanks for the summary. I am very eager to see some results for executions of > > real application problems. Did they agree to a timeline for that? > > Would also be good to have some definition of what is an acceptable 'real > application problem', in terms of which programs are to be run on which > datasets on what sized resource. > > -- > > From hategan at mcs.anl.gov Sat Jun 2 03:33:31 2007 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sat, 02 Jun 2007 11:33:31 +0300 Subject: [Swift-devel] LQCD meeting at Fermi In-Reply-To: <4C018A99-0639-43AA-BA06-525B955A56A1@mcs.anl.gov> References: <4C018A99-0639-43AA-BA06-525B955A56A1@mcs.anl.gov> Message-ID: <1180773211.6680.8.camel@blabla.mcs.anl.gov> On Fri, 2007-06-01 at 16:59 -0500, Veronika Nefedova wrote: > Hi, > > Yong and I have met with Xian-He and his team today to talk over > their current problems with the production swift code. > > Some of the major issues we talked about: > > - Sperate of concern: SwiftScript could be made to just describe the > abstract interfaces and data flows, and the app blocks could be pushed > into some separate specifications ( in a repository or something ), in > which other scripting lanugages can be used (e.g. python) to specify how > to invoke an actual application. How's that different from application wrappers? > > - Dealing with absolute path: > LQCD uses dcache, which requires copying to/from some absolute path. This, I think, is the same as the ability to have non-local input and output files. > > - Run clean up jobs outside pbs (i.e. using the fork manager instead) We've discussed this before, and there are two choices: 1. Use the file provider. This may be inefficient because most of them, in particular GridFTP, don't have a recursive delete. The local one, which they are using does. This may imply another configuration option. 2. Make sure there's always a fork job manager there and use that. This means that the local PBS provider needs to become a job manager to the local provider rather than a stand-alone provider. > > - parameter problem: need to override things in tc.data, sites.xml, like > number of nodes for MPI jobs > possible solution: put profile specification back in. (but we do not > have derivations, in which we were able to put some profiles). Can you explain that? VDS != Swift. And we shouldn't talk about Swift having some literal thing from VDS, but rather the bit that achieves similar functionality. > template based sites.xml and tc.data (generate the actual config > files > using some templates and user supplied values at runtime) About sites.xml, we discussed in an email exchange the possibility of doing that. Luckily, in Swift, sites.xml is a karajan script, so it can do things like import("anothersites.xml") and so on. > > - DB-mapper: users have an elaborate input data structures, keep it > in the DB, so it would be nice to have a mapper that would read the > input from the DB. This feature is in the works (?) > > -intermediate results problem -- the same as MolDyn: need to have an > ability to specify which file to keep and which file not. > > - quoting problem: > MPIrun does not deal correctly with "" that are passed to wrapper.sh > I remember there was also quoting issue with condor queues. This is a problem with their mpirun. However, I guess the PBS provider could have a flag to do extra quoting for certain job types. > > We also talked about using Falkon. But since LQCD uses dedicated > resources > (600 or more nodes) and pbs queue checking time is set to around 10s, it > is not a big issue for them to run large number of jobs. The last thing we want with them is throw in another thing that might have problems in the stack. > > None of these except for the absolute path problem is a show- > stoppers, next > we'll try to get their swiftscript running, and push some of the > requests > into 0.3 features. > > Yong and Nika > > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > From hategan at mcs.anl.gov Sat Jun 2 03:43:24 2007 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sat, 02 Jun 2007 11:43:24 +0300 Subject: [Swift-devel] LQCD meeting at Fermi In-Reply-To: <1472851168-1180738251-cardhu_blackberry.rim.net-646943670-@bwe005-cell00.bisx.prod.on.blackberry> References: <4C018A99-0639-43AA-BA06-525B955A56A1@mcs.anl.gov> <46609EDB.5020009@mcs.anl.gov> <1472851168-1180738251-cardhu_blackberry.rim.net-646943670-@bwe005-cell00.bisx.prod.on.blackberry> Message-ID: <1180773804.6680.19.camel@blabla.mcs.anl.gov> On Fri, 2007-06-01 at 22:50 +0000, Ian Foster wrote: > Ben: > > The question for me is whether they are invested, or just playing with it and finding things that are "wrong" not because they are serious problems but rather as excuses for them not to do more. (I've often experienced that.) I'm not Ben, but: So far, I think they've invested quite a bit into Swift, compared to other things they've invested in. But I would assume they are worried whether the problems they see will ever get solved, and without something concrete and somewhat immediate, they might lean towards believing they might not. Also, in the same way they disappear for weeks, we also disappear for weeks on some of their requests. I'm guessing both sides are involved in more than one thing. In concrete terms, at one point they had a choice between Karajan and Swift, and they seem to have chosen Swift. > > So I want to know if they are using it for real work or not. So far things don't look so good--as I understand things, they don't have a working program after what mus be 4-5 months talking to us (?). They do have "weird" requirements, at least if you start from Swift's assumptions (which still has issues with being generic). And there's quite a few of them. Mihael > > Ian > > Sent via BlackBerry from T-Mobile > > -----Original Message----- > From: Ben Clifford > Date: Fri, 1 Jun 2007 22:38:42 > To:Ian Foster > Cc:Veronika Nefedova , swift-devel at ci.uchicago.edu > Subject: Re: [Swift-devel] LQCD meeting at Fermi > > > > On Fri, 1 Jun 2007, Ian Foster wrote: > > > Thanks for the summary. I am very eager to see some results for executions of > > real application problems. Did they agree to a timeline for that? > > Would also be good to have some definition of what is an acceptable 'real > application problem', in terms of which programs are to be run on which > datasets on what sized resource. > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From hategan at mcs.anl.gov Sat Jun 2 03:50:29 2007 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sat, 02 Jun 2007 11:50:29 +0300 Subject: [Swift-devel] LQCD meeting at Fermi In-Reply-To: <1180773211.6680.8.camel@blabla.mcs.anl.gov> References: <4C018A99-0639-43AA-BA06-525B955A56A1@mcs.anl.gov> <1180773211.6680.8.camel@blabla.mcs.anl.gov> Message-ID: <1180774229.6680.27.camel@blabla.mcs.anl.gov> On Sat, 2007-06-02 at 11:33 +0300, Mihael Hategan wrote: > > > > We also talked about using Falkon. But since LQCD uses dedicated > > resources > > (600 or more nodes) and pbs queue checking time is set to around 10s, it > > is not a big issue for them to run large number of jobs. > > The last thing we want with them is throw in another thing that might > have problems in the stack. Clarification: The last thing we want with them is throw another thing (that might have problems) in the stack. And I'm not talking specifically about Falkon, which is a fine piece of software and a wonderful concept. > > > > > None of these except for the absolute path problem is a show- > > stoppers, next > > we'll try to get their swiftscript running, and push some of the > > requests > > into 0.3 features. > > > > Yong and Nika > > > > > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > From wilde at mcs.anl.gov Sat Jun 2 10:44:16 2007 From: wilde at mcs.anl.gov (Mike Wilde) Date: Sat, 02 Jun 2007 10:44:16 -0500 Subject: [Swift-devel] LQCD meeting at Fermi In-Reply-To: References: <4C018A99-0639-43AA-BA06-525B955A56A1@mcs.anl.gov> <46609EDB.5020009@mcs.anl.gov> Message-ID: <46619050.2080005@mcs.anl.gov> Nika, can you develop this and post on the wiki page for LQCD? Do you have the information you need for this already? If not, can you do this on Monday? Thanks, Mike Ben Clifford wrote, On 6/1/2007 5:38 PM: > > On Fri, 1 Jun 2007, Ian Foster wrote: > >> Thanks for the summary. I am very eager to see some results for executions of >> real application problems. Did they agree to a timeline for that? > > Would also be good to have some definition of what is an acceptable 'real > application problem', in terms of which programs are to be run on which > datasets on what sized resource. > -- Mike Wilde Computation Institute, University of Chicago Math & Computer Science Division Argonne National Laboratory Argonne, IL 60439 USA tel 630-252-7497 fax 630-252-1997 From benc at hawaga.org.uk Tue Jun 5 10:46:28 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 5 Jun 2007 15:46:28 +0000 (GMT) Subject: [Swift-devel] swift live tutorial Message-ID: Yesterday Mike, Tibi and I performed a Swift tutorial at TG07, in the 'OSG Grid School' style of a lecture and then a hands on session. The exercises portion, which I lead, is (for now) at: http://www.ci.uchicago.edu/swift/guides/tutorial-live.php I don't have the lecture portion of the slides - Mike did that bit. For the most part I think it went well for a first tutorial. I think people mostly understood the points that we were trying to make and were able to do the exercises ok. We discovered a few more bugs in Swift which are now in the bugzilla (63, 64, 65, 66, 67. There were also a bunch of problems with the tutorial notes that are for the most part minor and easily correctable. I'd like to use the exercise portion as the base for future swift hands-on tutorials. -- From foster at mcs.anl.gov Tue Jun 5 13:28:20 2007 From: foster at mcs.anl.gov (Ian Foster) Date: Tue, 05 Jun 2007 13:28:20 -0500 Subject: [Swift-devel] swift live tutorial In-Reply-To: References: Message-ID: <4665AB44.7030601@mcs.anl.gov> Ben, Mike, Tibi: Congratulations on a successful tutorial! Ian. Ben Clifford wrote: > Yesterday Mike, Tibi and I performed a Swift tutorial at TG07, in the 'OSG > Grid School' style of a lecture and then a hands on session. > > The exercises portion, which I lead, is (for now) at: > > http://www.ci.uchicago.edu/swift/guides/tutorial-live.php > > I don't have the lecture portion of the slides - Mike did that bit. > > For the most part I think it went well for a first tutorial. I think > people mostly understood the points that we were trying to make and were > able to do the exercises ok. > > We discovered a few more bugs in Swift which are now in the bugzilla (63, > 64, 65, 66, 67. > > There were also a bunch of problems with the tutorial notes that are for > the most part minor and easily correctable. > > I'd like to use the exercise portion as the base for future swift hands-on > tutorials. > > -- Ian Foster, Director, Computation Institute Argonne National Laboratory & University of Chicago Argonne: MCS/221, 9700 S. Cass Ave, Argonne, IL 60439 Chicago: Rm 405, 5640 S. Ellis Ave, Chicago, IL 60637 Tel: +1 630 252 4619. Web: www.ci.uchicago.edu. Globus Alliance: www.globus.org. From wilde at mcs.anl.gov Tue Jun 12 12:32:22 2007 From: wilde at mcs.anl.gov (Mike Wilde) Date: Tue, 12 Jun 2007 12:32:22 -0500 Subject: [Swift-devel] Re: GRAM and Swift discussion this week? In-Reply-To: References: <685C1420-03DE-4F2E-BDC7-A8A2C5636154@mcs.anl.gov> <46532FB4.5070707@mcs.anl.gov> <465335D1.2040306@mcs.anl.gov> Message-ID: <466ED8A6.5010804@mcs.anl.gov> Hi all, Were we still planning to meet tomorrow at Argonne? Can we postpone this till Thu? nd what time would be good? Ben, where will you be Thu? Thanks, Mike Stuart Martin wrote, On 5/22/2007 2:41 PM: > > On May 22, 2007, at May 22, 2:22 PM, Ben Clifford wrote: > >> >> >> On Tue, 22 May 2007, Stuart Martin wrote: >> >>> On May 22, 2007, at May 22, 1:43 PM, Ben Clifford wrote: >>>> >>>> On Tue, 22 May 2007, Ian Foster wrote: >>>> >>>>> Are there WS-GRAM issues that are causing problems for Swift? >>>> >>>> No one uses WS-GRAM with Swift, so we aren't really uncovering issus >>>> there. >>> >> >>> Why not? What are you using? GRAM2? local executions? Other >>> services? >> >> for the high end stuff, Swift submits jobs to Falkon. Falkon, I think, >> uses WS-GRAM to start up its own workers, but that startup part of Falkon >> not Swift. >> >> For low end stuff, the two providers that I think people use much are >> local exec and GRAM2. >> >> Local exec is not in the space that GRAM is addressing, so ignore. > > Agreed. Just trying to learn what people are doing. > >> >> The GRAM2 vs GRAM4 question pretty much comes down to the fact that >> people >> in production (at least as far as I encounter them) tend to use GRAM2 >> rather than GRAM4 and so Swift tends to get used that way too - >> there's no >> real motivation to push people to use a different submission system than >> what they're used to, and one thing we decided within our group is >> that we >> would concentrate on being very application focused (after we had spent >> rather a long time pontificating and debating). GRAM2 -> GRAM4 doesn't >> provide enough incentive (in the way that a GRAM2 -> Falkon change does) >> for our actual apps (for example that Tibi and Nika work on). > > Fair enough. GRAM4 is deployed on most of TG and OSG now. It would be > good to push jobs to GRAM4 when reasonable/possible. The apps folks > should not care which service is used. It should be hidden by Swift. > Or are the apps folks your working with also dictating what GRAM service > is deployed/used? > >> >> At some point, perhaps, GRAM2 will decay or GRAM4 will become >> tantalising, >> at which point it would be in the interests of being app-focused to >> shift. >> Or we might change our priorities to be less app focused. > > Some are quite happy with GRAM4 in 4.0.3. We're improving things right > now to make GRAM4 outperform GRAM2 in most all the important > benchmarks. This should be in 4.0.5. I think things at that point > become "tantalizing". > >> >> -- > > -- Mike Wilde Computation Institute, University of Chicago Math & Computer Science Division Argonne National Laboratory Argonne, IL 60439 USA tel 630-252-7497 fax 630-252-1997 From benc at hawaga.org.uk Tue Jun 12 12:39:22 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 12 Jun 2007 17:39:22 +0000 (GMT) Subject: [Swift-devel] Re: GRAM and Swift discussion this week? In-Reply-To: <466ED8A6.5010804@mcs.anl.gov> References: <685C1420-03DE-4F2E-BDC7-A8A2C5636154@mcs.anl.gov> <46532FB4.5070707@mcs.anl.gov> <465335D1.2040306@mcs.anl.gov> <466ED8A6.5010804@mcs.anl.gov> Message-ID: On Tue, 12 Jun 2007, Mike Wilde wrote: > Were we still planning to meet tomorrow at Argonne? was there ever such a plan? what needs discussing? like I said before: > > > > > > Are there WS-GRAM issues that are causing problems for Swift? > > > > > > > > > > No one uses WS-GRAM with Swift, so we aren't really uncovering issus > > > > > there. > Ben, where will you be Thu? UC campus is my present plan. -- From wilde at mcs.anl.gov Tue Jun 12 13:33:45 2007 From: wilde at mcs.anl.gov (Mike Wilde) Date: Tue, 12 Jun 2007 13:33:45 -0500 Subject: [Swift-devel] Re: GRAM and Swift discussion this week? In-Reply-To: References: <685C1420-03DE-4F2E-BDC7-A8A2C5636154@mcs.anl.gov> <46532FB4.5070707@mcs.anl.gov> <465335D1.2040306@mcs.anl.gov> <466ED8A6.5010804@mcs.anl.gov> Message-ID: <466EE709.50007@mcs.anl.gov> Lets defer this meeting then and continue such discussions on swift and/or GRAM lists. - Mike Ben Clifford wrote, On 6/12/2007 12:39 PM: > On Tue, 12 Jun 2007, Mike Wilde wrote: > >> Were we still planning to meet tomorrow at Argonne? > > was there ever such a plan? > > what needs discussing? like I said before: > >>>>>>> Are there WS-GRAM issues that are causing problems for Swift? >>>>>> No one uses WS-GRAM with Swift, so we aren't really uncovering issus >>>>>> there. > > >> Ben, where will you be Thu? > > UC campus is my present plan. > > -- Mike Wilde Computation Institute, University of Chicago Math & Computer Science Division Argonne National Laboratory Argonne, IL 60439 USA tel 630-252-7497 fax 630-252-1997 From benc at hawaga.org.uk Tue Jun 12 13:38:34 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 12 Jun 2007 18:38:34 +0000 (GMT) Subject: [Swift-devel] Re: GRAM and Swift discussion this week? In-Reply-To: <466EE709.50007@mcs.anl.gov> References: <685C1420-03DE-4F2E-BDC7-A8A2C5636154@mcs.anl.gov> <46532FB4.5070707@mcs.anl.gov> <465335D1.2040306@mcs.anl.gov> <466ED8A6.5010804@mcs.anl.gov> <466EE709.50007@mcs.anl.gov> Message-ID: I suspect the best way forward for this is for us to actually start using GRAM4 in our daily swift work - that will either work perfectly or generate things to talk about. On Tue, 12 Jun 2007, Mike Wilde wrote: > Lets defer this meeting then and continue such discussions on swift and/or > GRAM lists. > > - Mike > > Ben Clifford wrote, On 6/12/2007 12:39 PM: > > On Tue, 12 Jun 2007, Mike Wilde wrote: > > > > > Were we still planning to meet tomorrow at Argonne? > > > > was there ever such a plan? > > > > what needs discussing? like I said before: > > > > > > > > > > Are there WS-GRAM issues that are causing problems for Swift? > > > > > > > No one uses WS-GRAM with Swift, so we aren't really uncovering > > > > > > > issus > > > > > > > there. > > > > > > > Ben, where will you be Thu? > > > > UC campus is my present plan. > > > > > > From itf at mcs.anl.gov Tue Jun 12 13:41:24 2007 From: itf at mcs.anl.gov (=?utf-8?B?SWFuIEZvc3Rlcg==?=) Date: Tue, 12 Jun 2007 18:41:24 +0000 Subject: [Swift-devel] Re: GRAM and Swift discussion this week? In-Reply-To: References: <685C1420-03DE-4F2E-BDC7-A8A2C5636154@mcs.anl.gov><46532FB4.5070707@mcs.anl.gov> <465335D1.2040306@mcs.anl.gov> <466ED8A6.5010804@mcs.anl.gov> <466EE709.50007@mcs.anl.gov> Message-ID: <1565933129-1181673722-cardhu_decombobulator_blackberry.rim.net-1111105974-@bxe111.bisx.prod.on.blackberry> I think Ioan already uses it for Falkon Sent via BlackBerry from T-Mobile -----Original Message----- From: Ben Clifford Date: Tue, 12 Jun 2007 18:38:34 To:Mike Wilde Cc:Stuart Martin , Ian Foster , swift-devel at ci.uchicago.edu Subject: Re: [Swift-devel] Re: GRAM and Swift discussion this week? I suspect the best way forward for this is for us to actually start using GRAM4 in our daily swift work - that will either work perfectly or generate things to talk about. On Tue, 12 Jun 2007, Mike Wilde wrote: > Lets defer this meeting then and continue such discussions on swift and/or > GRAM lists. > > - Mike > > Ben Clifford wrote, On 6/12/2007 12:39 PM: > > On Tue, 12 Jun 2007, Mike Wilde wrote: > > > > > Were we still planning to meet tomorrow at Argonne? > > > > was there ever such a plan? > > > > what needs discussing? like I said before: > > > > > > > > > > Are there WS-GRAM issues that are causing problems for Swift? > > > > > > > No one uses WS-GRAM with Swift, so we aren't really uncovering > > > > > > > issus > > > > > > > there. > > > > > > > Ben, where will you be Thu? > > > > UC campus is my present plan. > > > > > > From iraicu at cs.uchicago.edu Tue Jun 12 13:45:04 2007 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Tue, 12 Jun 2007 13:45:04 -0500 Subject: [Swift-devel] Re: GRAM and Swift discussion this week? In-Reply-To: <1565933129-1181673722-cardhu_decombobulator_blackberry.rim.net-1111105974-@bxe111.bisx.prod.on.blackberry> References: <685C1420-03DE-4F2E-BDC7-A8A2C5636154@mcs.anl.gov><46532FB4.5070707@mcs.anl.gov> <465335D1.2040306@mcs.anl.gov> <466ED8A6.5010804@mcs.anl.gov> <466EE709.50007@mcs.anl.gov> <1565933129-1181673722-cardhu_decombobulator_blackberry.rim.net-1111105974-@bxe111.bisx.prod.on.blackberry> Message-ID: <466EE9B0.506@cs.uchicago.edu> Right, Falkon uses only GRAM4 to get to remote resources! It has worked very well for Falkon. Ioan Ian Foster wrote: > I think Ioan already uses it for Falkon > > > Sent via BlackBerry from T-Mobile > > -----Original Message----- > From: Ben Clifford > > Date: Tue, 12 Jun 2007 18:38:34 > To:Mike Wilde > Cc:Stuart Martin , Ian Foster , swift-devel at ci.uchicago.edu > Subject: Re: [Swift-devel] Re: GRAM and Swift discussion this week? > > > > I suspect the best way forward for this is for us to actually start using > GRAM4 in our daily swift work - that will either work perfectly or > generate things to talk about. > > On Tue, 12 Jun 2007, Mike Wilde wrote: > > >> Lets defer this meeting then and continue such discussions on swift and/or >> GRAM lists. >> >> - Mike >> >> Ben Clifford wrote, On 6/12/2007 12:39 PM: >> >>> On Tue, 12 Jun 2007, Mike Wilde wrote: >>> >>> >>>> Were we still planning to meet tomorrow at Argonne? >>>> >>> was there ever such a plan? >>> >>> what needs discussing? like I said before: >>> >>> >>>>>>>>> Are there WS-GRAM issues that are causing problems for Swift? >>>>>>>>> >>>>>>>> No one uses WS-GRAM with Swift, so we aren't really uncovering >>>>>>>> issus >>>>>>>> there. >>>>>>>> >>> >>>> Ben, where will you be Thu? >>>> >>> UC campus is my present plan. >>> >>> >>> >> > > > ------------------------------------------------------------------------ > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > -- ============================================ Ioan Raicu Ph.D. Student ============================================ Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 ============================================ Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dsl.cs.uchicago.edu/ ============================================ ============================================ From benc at hawaga.org.uk Tue Jun 12 14:03:46 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 12 Jun 2007 19:03:46 +0000 (GMT) Subject: [Swift-devel] Re: GRAM and Swift discussion this week? In-Reply-To: <1565933129-1181673722-cardhu_decombobulator_blackberry.rim.net-1111105974-@bxe111.bisx.prod.on.blackberry> References: <685C1420-03DE-4F2E-BDC7-A8A2C5636154@mcs.anl.gov><46532FB4.5070707@mcs.anl.gov> <465335D1.2040306@mcs.anl.gov> <466ED8A6.5010804@mcs.anl.gov> <466EE709.50007@mcs.anl.gov> <1565933129-1181673722-cardhu_decombobulator_blackberry.rim.net-1111105974-@bxe111.bisx.prod.on.blackberry> Message-ID: Yeah, the falkon worker stuff goes in through GRAM4. Where that differs from swift submitted directly is that swift would be submitting more (many more?) jobs and lots of different kinds of jobs. On Tue, 12 Jun 2007, Ian Foster wrote: > I think Ioan already uses it for Falkon > > > Sent via BlackBerry from T-Mobile > > -----Original Message----- > From: Ben Clifford > > Date: Tue, 12 Jun 2007 18:38:34 > To:Mike Wilde > Cc:Stuart Martin , Ian Foster , swift-devel at ci.uchicago.edu > Subject: Re: [Swift-devel] Re: GRAM and Swift discussion this week? > > > > I suspect the best way forward for this is for us to actually start using > GRAM4 in our daily swift work - that will either work perfectly or > generate things to talk about. > > On Tue, 12 Jun 2007, Mike Wilde wrote: > > > Lets defer this meeting then and continue such discussions on swift and/or > > GRAM lists. > > > > - Mike > > > > Ben Clifford wrote, On 6/12/2007 12:39 PM: > > > On Tue, 12 Jun 2007, Mike Wilde wrote: > > > > > > > Were we still planning to meet tomorrow at Argonne? > > > > > > was there ever such a plan? > > > > > > what needs discussing? like I said before: > > > > > > > > > > > > Are there WS-GRAM issues that are causing problems for Swift? > > > > > > > > No one uses WS-GRAM with Swift, so we aren't really uncovering > > > > > > > > issus > > > > > > > > there. > > > > > > > > > > Ben, where will you be Thu? > > > > > > UC campus is my present plan. > > > > > > > > > > > > > From wilde at mcs.anl.gov Tue Jun 12 23:08:22 2007 From: wilde at mcs.anl.gov (Mike Wilde) Date: Tue, 12 Jun 2007 23:08:22 -0500 Subject: [Swift-devel] [Fwd: [OUTREACH] poster or demo at Grid2007?] Message-ID: <466F6DB6.3020302@mcs.anl.gov> Anyone interested in Swift/Falkon demos at SC? A massive TG/OSG falkon cluster demo running workflows with nice viz fast would be quite cool. (I cant believe I'm saying this while arguing that falkon needs more work, but.... ;) While we're on the topic of demos: I think there is still a board of Governors demo here next week. Do we have any chance of putting some cool workflows on the tiled display? - sidgrid wavelet with the "lava lamp brain" viz? - cnari w/ suma? - Mike -------- Original Message -------- Subject: [OUTREACH] poster or demo at Grid2007? Date: Wed, 13 Jun 2007 04:35:35 +0100 From: Jennifer M. Schopf To: outreach at globus.org Hi Folks- right now, we don't have any general globus content at grid2007 to my knowledge (we didn't know about their deadlines in time to apply for a tutorial or BOF). However, the call for posters and demos is still open - maybe someone would like to apply for one of these for some of the newer work? Maybe RAVE or MOPS especially? http://www.grid2007.org/?m_b_c=call_for_posterdemo -jen ------------------------------------------------------------------------------- Dr. Jennifer M. Schopf Scientist, Distributed Systems Lab Argonne National Laboratory jms at mcs.anl.gov http://www.mcs.anl.gov/~jms -- Mike Wilde Computation Institute, University of Chicago Math & Computer Science Division Argonne National Laboratory Argonne, IL 60439 USA tel 630-252-7497 fax 630-252-1997 From iraicu at cs.uchicago.edu Tue Jun 12 23:21:07 2007 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Tue, 12 Jun 2007 23:21:07 -0500 Subject: [Swift-devel] [Fwd: [OUTREACH] poster or demo at Grid2007?] In-Reply-To: <466F6DB6.3020302@mcs.anl.gov> References: <466F6DB6.3020302@mcs.anl.gov> Message-ID: <466F70B3.6070407@cs.uchicago.edu> I would be up for a demo, if others are interested as well! The demo could either be cross site runs on TG, even more ambitious would be TG+OSG, and even more would be TG+OSG+EC2! June 29th is the deadline, so that gives us enough time to figure out exactly what application we want to demo, and write the 1 page proposal! Ioan Mike Wilde wrote: > Anyone interested in Swift/Falkon demos at SC? > > A massive TG/OSG falkon cluster demo running workflows with nice viz > fast would be quite cool. > > (I cant believe I'm saying this while arguing that falkon needs more > work, but.... ;) > > While we're on the topic of demos: I think there is still a board of > Governors demo here next week. > > Do we have any chance of putting some cool workflows on the tiled > display? > > - sidgrid wavelet with the "lava lamp brain" viz? > - cnari w/ suma? > > - Mike > > > -------- Original Message -------- > Subject: [OUTREACH] poster or demo at Grid2007? > Date: Wed, 13 Jun 2007 04:35:35 +0100 > From: Jennifer M. Schopf > To: outreach at globus.org > > Hi Folks- > > right now, we don't have any general globus content at grid2007 to my > knowledge (we didn't know about their deadlines in time to apply for a > tutorial or BOF). However, the call for posters and demos is still open - > maybe someone would like to apply for one of these for some of the newer > work? Maybe RAVE or MOPS especially? > > http://www.grid2007.org/?m_b_c=call_for_posterdemo > > -jen > > > > ------------------------------------------------------------------------------- > > Dr. Jennifer M. Schopf > Scientist, Distributed Systems Lab > Argonne National Laboratory > jms at mcs.anl.gov http://www.mcs.anl.gov/~jms > > > > -- ============================================ Ioan Raicu Ph.D. Student ============================================ Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 ============================================ Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dsl.cs.uchicago.edu/ ============================================ ============================================ From tiberius at ci.uchicago.edu Tue Jun 12 23:26:08 2007 From: tiberius at ci.uchicago.edu (Tiberiu Stef-Praun) Date: Tue, 12 Jun 2007 23:26:08 -0500 Subject: [Swift-devel] [Fwd: [OUTREACH] poster or demo at Grid2007?] In-Reply-To: <466F6DB6.3020302@mcs.anl.gov> References: <466F6DB6.3020302@mcs.anl.gov> Message-ID: On 6/12/07, Mike Wilde wrote: > Anyone interested in Swift/Falkon demos at SC? > > A massive TG/OSG falkon cluster demo running workflows with nice viz > fast would be quite cool. > > (I cant believe I'm saying this while arguing that falkon needs more > work, but.... ;) > > While we're on the topic of demos: I think there is still a board of > Governors demo here next week. > > Do we have any chance of putting some cool workflows on the tiled > display? I would say there are small chances. I am currently overloaded with getting other things done in I2U2, CNARI and preparing Falkon measurement for Econ, and in addition to this, I have never attempted visualizations of the aforementioned workflows. Tibi > > - sidgrid wavelet with the "lava lamp brain" viz? > - cnari w/ suma? > > - Mike > > > -------- Original Message -------- > Subject: [OUTREACH] poster or demo at Grid2007? > Date: Wed, 13 Jun 2007 04:35:35 +0100 > From: Jennifer M. Schopf > To: outreach at globus.org > > Hi Folks- > > right now, we don't have any general globus content at grid2007 > to my > knowledge (we didn't know about their deadlines in time to apply for a > tutorial or BOF). However, the call for posters and demos is still > open - > maybe someone would like to apply for one of these for some of the > newer > work? Maybe RAVE or MOPS especially? > > http://www.grid2007.org/?m_b_c=call_for_posterdemo > > -jen > > > > ------------------------------------------------------------------------------- > Dr. Jennifer M. Schopf > Scientist, Distributed Systems Lab > Argonne National Laboratory > jms at mcs.anl.gov http://www.mcs.anl.gov/~jms > > > > > -- > Mike Wilde > Computation Institute, University of Chicago > Math & Computer Science Division > Argonne National Laboratory > Argonne, IL 60439 USA > tel 630-252-7497 fax 630-252-1997 > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > -- Tiberiu (Tibi) Stef-Praun, PhD Research Staff, Computation Institute 5640 S. Ellis Ave, #405 University of Chicago http://www-unix.mcs.anl.gov/~tiberius/ From benc at hawaga.org.uk Thu Jun 14 22:16:21 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Fri, 15 Jun 2007 03:16:21 +0000 (GMT) Subject: [Swift-devel] provider-deef Message-ID: how to get source for provider-deef/ from version control? -- From iraicu at cs.uchicago.edu Thu Jun 14 22:18:23 2007 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Thu, 14 Jun 2007 22:18:23 -0500 Subject: [Swift-devel] provider-deef In-Reply-To: References: Message-ID: <467204FF.8040106@cs.uchicago.edu> You get it from my web site currently: http://people.cs.uchicago.edu/~iraicu/research/Falkon/Falkon_v0.8.tgz :) We need to talk about getting into CVS or SVN, and where.... Ioan Ben Clifford wrote: > how to get source for provider-deef/ from version control? > > -- ============================================ Ioan Raicu Ph.D. Student ============================================ Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 ============================================ Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dsl.cs.uchicago.edu/ ============================================ ============================================ From benc at hawaga.org.uk Thu Jun 14 22:43:31 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Fri, 15 Jun 2007 03:43:31 +0000 (GMT) Subject: [Swift-devel] provider-deef In-Reply-To: <467204FF.8040106@cs.uchicago.edu> References: <467204FF.8040106@cs.uchicago.edu> Message-ID: That has the falkon code in it but I can't see the cog/swift job submission provider. On Thu, 14 Jun 2007, Ioan Raicu wrote: > You get it from my web site currently: > http://people.cs.uchicago.edu/~iraicu/research/Falkon/Falkon_v0.8.tgz :) > > We need to talk about getting into CVS or SVN, and where.... > > Ioan > > Ben Clifford wrote: > > how to get source for provider-deef/ from version control? > > > > > > From tiberius at ci.uchicago.edu Fri Jun 15 07:11:17 2007 From: tiberius at ci.uchicago.edu (Tiberiu Stef-Praun) Date: Fri, 15 Jun 2007 07:11:17 -0500 Subject: [Swift-devel] provider-deef In-Reply-To: References: <467204FF.8040106@cs.uchicago.edu> Message-ID: You get that from ~/tiberius/cogl (which I got from Yong's home) However, I have not teste that yet. On 6/14/07, Ben Clifford wrote: > > That has the falkon code in it but I can't see the cog/swift job > submission provider. > > On Thu, 14 Jun 2007, Ioan Raicu wrote: > > > You get it from my web site currently: > > http://people.cs.uchicago.edu/~iraicu/research/Falkon/Falkon_v0.8.tgz :) > > > > We need to talk about getting into CVS or SVN, and where.... > > > > Ioan > > > > Ben Clifford wrote: > > > how to get source for provider-deef/ from version control? > > > > > > > > > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > -- Tiberiu (Tibi) Stef-Praun, PhD Research Staff, Computation Institute 5640 S. Ellis Ave, #405 University of Chicago http://www-unix.mcs.anl.gov/~tiberius/ From wilde at mcs.anl.gov Fri Jun 15 08:11:30 2007 From: wilde at mcs.anl.gov (Mike Wilde) Date: Fri, 15 Jun 2007 08:11:30 -0500 Subject: [Swift-devel] Re: Welcome Swift as an incubator project In-Reply-To: <6.2.1.2.2.20070615133455.02dd9298@imap.mcs.anl.gov> References: <6.2.1.2.2.20070615133455.02dd9298@imap.mcs.anl.gov> Message-ID: <46729002.9090004@mcs.anl.gov> Thanks, Jen, thats great news. We'll discuss how to take the steps below. - Mike Jennifer M. Schopf wrote, On 6/15/2007 7:45 AM: > Hi Mike- > > We are please to accept Swift as an Incubator Project in the > dev.globus incubation process! This mail contains information on your > mentor, getting set up in the dev.globus infrastructure, and next steps. > Current process guidelines can be found at > http://dev.globus.org/wiki/Incubator/Incubator_Process > . > > Your mentor is currently Jennifer Schopf (jms at mcs.anl.gov). Your mentor > will act as a bridge between your project and the Incubation Management > Project, and should be able to answer any basic questions you have. They > will also help you during the quarterly reviews and in understanding how > to escalate to a full Globus Project. Your mentor needs to have write > access for the wiki page, but does not need to have CVS commit access > unless you would like them to. If you would like to propose a different > mentor for any reason, please let me know, and we can discuss options. > > We have requested three mailing lists for you, according to the Globus > project guidelines ( > https://dev.globus.org/wiki/Guidelines#Communication > )- swift-dev, swift-user, and swift-commit, with you listed as the > owner. The initial password will be set to ?incubator? for all of them, > and they are currently operational. It now falls to you to enroll the > members of these lists since they come completely empty ? they?re > standard majordomo lists, basic subscription is done by simply sending > (to majordomo at globus.org) > > approve subscribe > > approve subscribe > for your various lists and subscribers. All of your committers should be > subscribed to all 3 lists. We also strongly encourage them to subscribe > to announce at globus.org. > > Your CVS/SVN module will be set up for you. Please mail > infrastructure at globus.org (be sure to > have CVS in the subject line in addition to anything else you?d like) > with the complete list of committer names, desired names for accounts, > and ssh public keys, and they will email back access instructions when > these are set. Be sure to include your project name in the mail. > > Your wiki page has been set up with the common template and is located > at http://dev.globus.org/wiki/Incubator/Swift. If you need to add wiki > committers, please mail > infrastructure at globus.org (be sure to > have WIKI in the subject line in addition to anything else you?d like) > with your list of wiki committers? names and their desired account names > for the dev.globus.org wiki. Be sure to include your project name in the > mail. > > In order to set up bugzilla space on the Globus bugzilla (for example, > to keep track of your roadmap items, track bugs, follow enhancement > requests, etc), the person you would like to be responsible for your > bugzilla in Globus-space should get an account at bugzilla.globus.org, > and then send that account name and the name of your project to > infrastructure at globus.org (be sure to > have BUGZILLA in the subject line in addition to anything else you?d > like). They will then give that account authority to create subproducts > and such for your product and you will then be able to create the list > of components, cc: lists, and descriptions as you see fit. > > One of the hurdles you?ll need to pass in order to become a Globus > project is the licensing. Licensing information, including both of the > licenses needed and a guideline document, can be found here > http://dev.globus.org/wiki/Guidelines#License_and_Contributor_Agreements > In a nutshell every committer for your project will need to sign an > individual license, and anyone who is doing this as part of their day > job will also need a corporate license signed. Those doing it on their > own free time should include a letter stating this fact. All licenses > when signed should be mailed to Jennifer Schopf, Argonne National Lab, > 9700 S. Cass Ave, bldg 221, Argonne, IL 60439 USA. > > The next step in the incubator process will be a check on where the > startup projects are. This is likely to be happening around mid July, > and your mentor will contact you prior. > > Thanks again for your participation, and bearing with us while we get > the process up and running. Please don?t hesitate to contact me or your > mentor with any additional questions. > > > -Jennifer Schopf, on behalf of the Incubator Management Project > > > > > ------------------------------------------------------------------------------- > > Dr. Jennifer M. Schopf > Scientist, Distributed Systems Lab > Argonne National Laboratory > jms at mcs.anl.gov http://www.mcs.anl.gov/~jms > > > -- Mike Wilde Computation Institute, University of Chicago Math & Computer Science Division Argonne National Laboratory Argonne, IL 60439 USA tel 630-252-7497 fax 630-252-1997 From benc at hawaga.org.uk Fri Jun 15 08:27:57 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Fri, 15 Jun 2007 13:27:57 +0000 (GMT) Subject: [Swift-devel] provider-deef In-Reply-To: References: <467204FF.8040106@cs.uchicago.edu> Message-ID: YOng mentioned something about it being in the cog svn once i thought. On Fri, 15 Jun 2007, Tiberiu Stef-Praun wrote: > You get that from ~/tiberius/cogl (which I got from Yong's home) > However, I have not teste that yet. > > On 6/14/07, Ben Clifford wrote: > > > > That has the falkon code in it but I can't see the cog/swift job > > submission provider. > > > > On Thu, 14 Jun 2007, Ioan Raicu wrote: > > > > > You get it from my web site currently: > > > http://people.cs.uchicago.edu/~iraicu/research/Falkon/Falkon_v0.8.tgz :) > > > > > > We need to talk about getting into CVS or SVN, and where.... > > > > > > Ioan > > > > > > Ben Clifford wrote: > > > > how to get source for provider-deef/ from version control? > > > > > > > > > > > > > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > > From benc at hawaga.org.uk Fri Jun 15 08:39:04 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Fri, 15 Jun 2007 13:39:04 +0000 (GMT) Subject: [Swift-devel] swift infrastructure move to dev.globus Message-ID: dev.globus people have been talking some about how to deal with projects that already have their own infrastructure (swift being a prime example of that); I propose we don't mess round with our infrastructure (mailing lists and version control) until dev.globus has decided an approach there. -- From hategan at mcs.anl.gov Fri Jun 15 08:39:32 2007 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 15 Jun 2007 16:39:32 +0300 Subject: [Swift-devel] swift infrastructure move to dev.globus In-Reply-To: References: Message-ID: <1181914772.9966.0.camel@blabla.mcs.anl.gov> On Fri, 2007-06-15 at 13:39 +0000, Ben Clifford wrote: > dev.globus people have been talking some about how to deal with projects > that already have their own infrastructure (swift being a prime example of > that); I propose we don't mess round with our infrastructure (mailing > lists and version control) until dev.globus has decided an approach there. I concur. > From wilde at mcs.anl.gov Fri Jun 15 08:43:12 2007 From: wilde at mcs.anl.gov (Mike Wilde) Date: Fri, 15 Jun 2007 08:43:12 -0500 Subject: [Swift-devel] swift infrastructure move to dev.globus In-Reply-To: References: Message-ID: <46729770.6090206@mcs.anl.gov> Certainly OK with me if its OK with Jen and the incubator group. - Mike Ben Clifford wrote, On 6/15/2007 8:39 AM: > dev.globus people have been talking some about how to deal with projects > that already have their own infrastructure (swift being a prime example of > that); I propose we don't mess round with our infrastructure (mailing > lists and version control) until dev.globus has decided an approach there. > -- Mike Wilde Computation Institute, University of Chicago Math & Computer Science Division Argonne National Laboratory Argonne, IL 60439 USA tel 630-252-7497 fax 630-252-1997 From yongzh at cs.uchicago.edu Fri Jun 15 09:04:24 2007 From: yongzh at cs.uchicago.edu (Yong Zhao) Date: Fri, 15 Jun 2007 09:04:24 -0500 (CDT) Subject: [Swift-devel] provider-deef In-Reply-To: References: <467204FF.8040106@cs.uchicago.edu> Message-ID: We discussed where to put it in svn, but it never got into svn. Currently it resides in Karajan branch in my source code, but Mihael says that is not a good place to put it in svn. Yong. On Fri, 15 Jun 2007, Ben Clifford wrote: > > YOng mentioned something about it being in the cog svn once i thought. > > On Fri, 15 Jun 2007, Tiberiu Stef-Praun wrote: > > > You get that from ~/tiberius/cogl (which I got from Yong's home) > > However, I have not teste that yet. > > > > On 6/14/07, Ben Clifford wrote: > > > > > > That has the falkon code in it but I can't see the cog/swift job > > > submission provider. > > > > > > On Thu, 14 Jun 2007, Ioan Raicu wrote: > > > > > > > You get it from my web site currently: > > > > http://people.cs.uchicago.edu/~iraicu/research/Falkon/Falkon_v0.8.tgz :) > > > > > > > > We need to talk about getting into CVS or SVN, and where.... > > > > > > > > Ioan > > > > > > > > Ben Clifford wrote: > > > > > how to get source for provider-deef/ from version control? > > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > Swift-devel mailing list > > > Swift-devel at ci.uchicago.edu > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > > > > > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > From hategan at mcs.anl.gov Fri Jun 15 09:08:32 2007 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 15 Jun 2007 17:08:32 +0300 Subject: [Swift-devel] provider-deef In-Reply-To: References: <467204FF.8040106@cs.uchicago.edu> Message-ID: <1181916512.10096.2.camel@blabla.mcs.anl.gov> On Fri, 2007-06-15 at 09:04 -0500, Yong Zhao wrote: > We discussed where to put it in svn, but it never got into svn. > > Currently it resides in Karajan branch in my source code, but Mihael says > that is not a good place to put it in svn. Clearly not. It could live in a provider-falkon module, like all the other providers though. Mihael > > Yong. > > On Fri, 15 Jun 2007, Ben Clifford wrote: > > > > > YOng mentioned something about it being in the cog svn once i thought. > > > > On Fri, 15 Jun 2007, Tiberiu Stef-Praun wrote: > > > > > You get that from ~/tiberius/cogl (which I got from Yong's home) > > > However, I have not teste that yet. > > > > > > On 6/14/07, Ben Clifford wrote: > > > > > > > > That has the falkon code in it but I can't see the cog/swift job > > > > submission provider. > > > > > > > > On Thu, 14 Jun 2007, Ioan Raicu wrote: > > > > > > > > > You get it from my web site currently: > > > > > http://people.cs.uchicago.edu/~iraicu/research/Falkon/Falkon_v0.8.tgz :) > > > > > > > > > > We need to talk about getting into CVS or SVN, and where.... > > > > > > > > > > Ioan > > > > > > > > > > Ben Clifford wrote: > > > > > > how to get source for provider-deef/ from version control? > > > > > > > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > > Swift-devel mailing list > > > > Swift-devel at ci.uchicago.edu > > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > > > > > > > > > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > From benc at hawaga.org.uk Fri Jun 15 09:13:04 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Fri, 15 Jun 2007 14:13:04 +0000 (GMT) Subject: [Swift-devel] provider-deef In-Reply-To: References: <467204FF.8040106@cs.uchicago.edu> Message-ID: Oh, I see. I misunderstood what you meant by 'branch' earlier on. The administratively easiest would be to put both falkon and provider-deef as top level directories in the SVN that we store swift in, I guess. On Fri, 15 Jun 2007, Yong Zhao wrote: > We discussed where to put it in svn, but it never got into svn. > > Currently it resides in Karajan branch in my source code, but Mihael says > that is not a good place to put it in svn. > > Yong. > > On Fri, 15 Jun 2007, Ben Clifford wrote: > > > > > YOng mentioned something about it being in the cog svn once i thought. > > > > On Fri, 15 Jun 2007, Tiberiu Stef-Praun wrote: > > > > > You get that from ~/tiberius/cogl (which I got from Yong's home) > > > However, I have not teste that yet. > > > > > > On 6/14/07, Ben Clifford wrote: > > > > > > > > That has the falkon code in it but I can't see the cog/swift job > > > > submission provider. > > > > > > > > On Thu, 14 Jun 2007, Ioan Raicu wrote: > > > > > > > > > You get it from my web site currently: > > > > > http://people.cs.uchicago.edu/~iraicu/research/Falkon/Falkon_v0.8.tgz :) > > > > > > > > > > We need to talk about getting into CVS or SVN, and where.... > > > > > > > > > > Ioan > > > > > > > > > > Ben Clifford wrote: > > > > > > how to get source for provider-deef/ from version control? > > > > > > > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > > Swift-devel mailing list > > > > Swift-devel at ci.uchicago.edu > > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > > > > > > > > > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > From yongzh at cs.uchicago.edu Fri Jun 15 09:15:23 2007 From: yongzh at cs.uchicago.edu (Yong Zhao) Date: Fri, 15 Jun 2007 09:15:23 -0500 (CDT) Subject: [Swift-devel] provider-deef In-Reply-To: <1181916512.10096.2.camel@blabla.mcs.anl.gov> References: <467204FF.8040106@cs.uchicago.edu> <1181916512.10096.2.camel@blabla.mcs.anl.gov> Message-ID: sorry I did not mean it was in karajan, it was in cog alongside with other providers as Mihael indicated. Yong. On Fri, 15 Jun 2007, Mihael Hategan wrote: > On Fri, 2007-06-15 at 09:04 -0500, Yong Zhao wrote: > > We discussed where to put it in svn, but it never got into svn. > > > > Currently it resides in Karajan branch in my source code, but Mihael says > > that is not a good place to put it in svn. > > Clearly not. It could live in a provider-falkon module, like all the > other providers though. > > Mihael > > > > > Yong. > > > > On Fri, 15 Jun 2007, Ben Clifford wrote: > > > > > > > > YOng mentioned something about it being in the cog svn once i thought. > > > > > > On Fri, 15 Jun 2007, Tiberiu Stef-Praun wrote: > > > > > > > You get that from ~/tiberius/cogl (which I got from Yong's home) > > > > However, I have not teste that yet. > > > > > > > > On 6/14/07, Ben Clifford wrote: > > > > > > > > > > That has the falkon code in it but I can't see the cog/swift job > > > > > submission provider. > > > > > > > > > > On Thu, 14 Jun 2007, Ioan Raicu wrote: > > > > > > > > > > > You get it from my web site currently: > > > > > > http://people.cs.uchicago.edu/~iraicu/research/Falkon/Falkon_v0.8.tgz :) > > > > > > > > > > > > We need to talk about getting into CVS or SVN, and where.... > > > > > > > > > > > > Ioan > > > > > > > > > > > > Ben Clifford wrote: > > > > > > > how to get source for provider-deef/ from version control? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > > > Swift-devel mailing list > > > > > Swift-devel at ci.uchicago.edu > > > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > Swift-devel mailing list > > > Swift-devel at ci.uchicago.edu > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > From hategan at mcs.anl.gov Fri Jun 15 09:27:36 2007 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 15 Jun 2007 17:27:36 +0300 Subject: [Swift-devel] provider-deef In-Reply-To: References: <467204FF.8040106@cs.uchicago.edu> Message-ID: <1181917656.10152.0.camel@blabla.mcs.anl.gov> On Fri, 2007-06-15 at 14:13 +0000, Ben Clifford wrote: > Oh, I see. I misunderstood what you meant by 'branch' earlier on. > > The administratively easiest would be to put both falkon and provider-deef > as top level directories in the SVN that we store swift in, I guess. That sounds like a better idea. Mihael > > On Fri, 15 Jun 2007, Yong Zhao wrote: > > > We discussed where to put it in svn, but it never got into svn. > > > > Currently it resides in Karajan branch in my source code, but Mihael says > > that is not a good place to put it in svn. > > > > Yong. > > > > On Fri, 15 Jun 2007, Ben Clifford wrote: > > > > > > > > YOng mentioned something about it being in the cog svn once i thought. > > > > > > On Fri, 15 Jun 2007, Tiberiu Stef-Praun wrote: > > > > > > > You get that from ~/tiberius/cogl (which I got from Yong's home) > > > > However, I have not teste that yet. > > > > > > > > On 6/14/07, Ben Clifford wrote: > > > > > > > > > > That has the falkon code in it but I can't see the cog/swift job > > > > > submission provider. > > > > > > > > > > On Thu, 14 Jun 2007, Ioan Raicu wrote: > > > > > > > > > > > You get it from my web site currently: > > > > > > http://people.cs.uchicago.edu/~iraicu/research/Falkon/Falkon_v0.8.tgz :) > > > > > > > > > > > > We need to talk about getting into CVS or SVN, and where.... > > > > > > > > > > > > Ioan > > > > > > > > > > > > Ben Clifford wrote: > > > > > > > how to get source for provider-deef/ from version control? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > > > Swift-devel mailing list > > > > > Swift-devel at ci.uchicago.edu > > > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > Swift-devel mailing list > > > Swift-devel at ci.uchicago.edu > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > > > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > From benc at hawaga.org.uk Fri Jun 15 11:48:01 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Fri, 15 Jun 2007 16:48:01 +0000 (GMT) Subject: [Swift-devel] CPU usage with provider-deef Message-ID: Yesterday, I was playing a bit with Ioan and Nika - they submitted a 68 node / 15 minute workflow through provider-deef & falkon and saw the swift JVM on the submit node using about 100% CPU; then the same workflow running through the GT2 GRAM provider rather than provider-deef and falkon appeared to use significantly less. I wandered off at that point so don't know if any interesting results came after. -- From benc at hawaga.org.uk Fri Jun 15 12:26:30 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Fri, 15 Jun 2007 17:26:30 +0000 (GMT) Subject: [Swift-devel] provider-deef In-Reply-To: <1181917656.10152.0.camel@blabla.mcs.anl.gov> References: <467204FF.8040106@cs.uchicago.edu> <1181917656.10152.0.camel@blabla.mcs.anl.gov> Message-ID: On Fri, 15 Jun 2007, Mihael Hategan wrote: > > The administratively easiest would be to put both falkon and provider-deef > > as top level directories in the SVN that we store swift in, I guess. > > That sounds like a better idea. ok. I will work with Ioan and Yong to get their respective modules into the swift SVN when they're ready (over the next week or so). -- From nefedova at mcs.anl.gov Fri Jun 15 13:02:58 2007 From: nefedova at mcs.anl.gov (Veronika Nefedova) Date: Fri, 15 Jun 2007 13:02:58 -0500 Subject: [Swift-devel] CPU usage with provider-deef In-Reply-To: References: Message-ID: Ben, we tested both my workflow and a simple "sleep" workflow. Both tests produced 100% CPU usage when ran with Falcon. When submitted directly to GRAM from swift - only a fraction of 1% CPU was used. I know that Yong did some additional testings, but I do not know the results. Nika On Jun 15, 2007, at 11:48 AM, Ben Clifford wrote: > > Yesterday, I was playing a bit with Ioan and Nika - they submitted > a 68 > node / 15 minute workflow through provider-deef & falkon and saw > the swift > JVM on the submit node using about 100% CPU; then the same workflow > running through the GT2 GRAM provider rather than provider-deef and > falkon > appeared to use significantly less. > > I wandered off at that point so don't know if any interesting > results came > after. > > -- > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > From benc at hawaga.org.uk Fri Jun 15 15:13:23 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Fri, 15 Jun 2007 20:13:23 +0000 (GMT) Subject: [Swift-devel] on the semantics of 'array closing' Message-ID: There is a problem that has been called the 'array closing problem'. It manifests itself in the tutorial in that certain bits of code that intuitively can either in a procedure or in the top level can, in practice, only go in to a procedure. In that context, I tried to think about better ways to explain/document the behaviour than "mumble mumble move that code into a procedure". In Swift we claim to have 'single assignment variables'. >From single assignment variables we get our grid job ordering: a = p() b = s(a) causes first grid job p to run, and when that has completed, then grid job s will run. This is the same as if we had written: b = s(a) a = p() The ordering comes from the use of a as an 'output' for p and an 'input' for s, not from source text ordering. In that model, its meaningless to assign two different things ta a, like this: a = p() b = s(a) a = t() Note that I've omitted the data types from the above. This works in the implementation for simple types such as a datafile marker type. What is important is that each variable is either unassigned or has its single value - whenever we refer to that variable, we can either use the value it has, or defer evaluation of that expression until the variable has its value. Now consider arrays. In the present syntax, arrays can be passed as single (complex) values to/from procedures, like before: a = p() b = s(a) Here a and b are array types. That's fine. a is assigned to by the first statement, and b is assigned to by the second statement. But we also support a different assignment syntax for arrays, that looks like this: a[0] = p() a[1] = q() b = s(a) This fails at the moment (specifically, I think the execution engine will hang). Why? Because the is no one point at which we assign a value to 'a' - the assignment is split over multiple statements, which can be in various places (and inside loops etc). There is nothing in the implementation that detects that a has been assigned its value. So there is this notion in the karajan intermediate code of 'closing an array'. This is an assertion made in the object code that all assignments to pieces of an array have been made - that, in affect, the array has its value. The suggested hack/workaround for this is to move the array element assignments into a procedure: (file f[]) z() { f[0] = p(); f[1] - q(); } a = z() b = s(a) This works. (which is sort-of a violation of referential transparency) It works because Swift implicitly marks arrays returned from compound procedures as closed (which may or may not be correct). So in most variable scopes, arrays behave like single-assignment variables, but each array can have one specific scope in which members can be assigned to. In that scope, the array cannot be treated as a whole variable. In the z() example above, that special scope is the body of z(). In the previous example, that scope is the global scope, and the program is invalid by the rule above that the array cannot be referred to as a whole in the same place that its members are individually assigned to. That's my explanation of what's going on now. I think it matches reality. I don't like that this is reality, but it is what we have. Comments appreciated. -- From foster at mcs.anl.gov Fri Jun 15 15:26:11 2007 From: foster at mcs.anl.gov (Ian Foster) Date: Fri, 15 Jun 2007 15:26:11 -0500 Subject: [Swift-devel] on the semantics of 'array closing' In-Reply-To: References: Message-ID: <4672F5E3.7060205@mcs.anl.gov> Hi, For: a[0] = p() a[1] = q() b = s(a) I think there are two distinct issues. a) Determining the size of the array. This could presumably be done by declaring it, e.g.: a[2] or some similar notion a[0] = p() a[1] = q() b = s(a) or by some "closing" concept. b) Whether or not each element of an array is a separate single-assignment variable. If they are, then the code above should work just fine. If they are not, then we have a couple of behaviors we could define. One would be that b=s(a) blocks until all elements in "a" are defined. The other is that we have a way of "closing" (once again). In that case, we have to define what happens if b=s(a) accesses an element that is not defined. Ian. Ben Clifford wrote: > There is a problem that has been called the 'array closing problem'. > > It manifests itself in the tutorial in that certain bits of code that > intuitively can either in a procedure or in the top level can, in > practice, only go in to a procedure. > > In that context, I tried to think about better ways to explain/document > the behaviour than "mumble mumble move that code into a procedure". > > In Swift we claim to have 'single assignment variables'. > > >From single assignment variables we get our grid job ordering: > > a = p() > b = s(a) > > causes first grid job p to run, and when that has completed, then grid job > s will run. > > This is the same as if we had written: > > b = s(a) > a = p() > > The ordering comes from the use of a as an 'output' for p and an 'input' > for s, not from source text ordering. > > In that model, its meaningless to assign two different things ta a, like > this: > > a = p() > b = s(a) > a = t() > > > Note that I've omitted the data types from the above. This works in the > implementation for simple types such as a datafile marker type. > > What is important is that each variable is either unassigned or has its > single value - whenever we refer to that variable, we can either use the > value it has, or defer evaluation of that expression until the variable > has its value. > > Now consider arrays. In the present syntax, arrays can be passed as > single (complex) values to/from procedures, like before: > > a = p() > b = s(a) > > Here a and b are array types. > > That's fine. a is assigned to by the first statement, and b is assigned to > by the second statement. > > But we also support a different assignment syntax for arrays, that looks > like this: > > a[0] = p() > a[1] = q() > b = s(a) > > This fails at the moment (specifically, I think the execution engine will > hang). > > Why? Because the is no one point at which we assign a value to 'a' - the > assignment is split over multiple statements, which can be in various > places (and inside loops etc). > > There is nothing in the implementation that detects that a has been > assigned its value. > > So there is this notion in the karajan intermediate code of 'closing an > array'. This is an assertion made in the object code that all assignments > to pieces of an array have been made - that, in affect, the array has its > value. > > The suggested hack/workaround for this is to move the array element > assignments into a procedure: > > (file f[]) z() { > f[0] = p(); > f[1] - q(); > } > > a = z() > b = s(a) > > This works. (which is sort-of a violation of referential transparency) > > It works because Swift implicitly marks arrays returned from compound > procedures as closed (which may or may not be correct). > > So in most variable scopes, arrays behave like single-assignment > variables, but each array can have one specific scope in which members can > be assigned to. In that scope, the array cannot be treated as a whole > variable. > > In the z() example above, that special scope is the body of z(). In the > previous example, that scope is the global scope, and the program is > invalid by the rule above that the array cannot be referred to as a whole > in the same place that its members are individually assigned to. > > That's my explanation of what's going on now. I think it matches reality. > I don't like that this is reality, but it is what we have. > > Comments appreciated. > > -- Ian Foster, Director, Computation Institute Argonne National Laboratory & University of Chicago Argonne: MCS/221, 9700 S. Cass Ave, Argonne, IL 60439 Chicago: Rm 405, 5640 S. Ellis Ave, Chicago, IL 60637 Tel: +1 630 252 4619. Web: www.ci.uchicago.edu. Globus Alliance: www.globus.org. From yongzh at cs.uchicago.edu Fri Jun 15 15:40:29 2007 From: yongzh at cs.uchicago.edu (Yong Zhao) Date: Fri, 15 Jun 2007 15:40:29 -0500 (CDT) Subject: [Swift-devel] on the semantics of 'array closing' In-Reply-To: <4672F5E3.7060205@mcs.anl.gov> References: <4672F5E3.7060205@mcs.anl.gov> Message-ID: Yes, the case is exactly like you have described. Currently each a[i] is closed separately, but the whole array also needs to be closed. For instance, if in s, only a[0] and a[1] are accessed, it might go through correctly, but if s accesses all elements of a (where it has no idea how many there are), the workflow would hang to wait for the array to close. Mihael and I talked about closing statement, but it is unclear when it should be done since the order of each a[i] being closed is not deterministic in parallel execution. Yong. On Fri, 15 Jun 2007, Ian Foster wrote: > Hi, > > For: > > a[0] = p() > a[1] = q() > b = s(a) > > I think there are two distinct issues. > > a) Determining the size of the array. This could presumably be done by > declaring it, e.g.: > > a[2] or some similar notion > a[0] = p() > a[1] = q() > b = s(a) > > or by some "closing" concept. > > b) Whether or not each element of an array is a separate > single-assignment variable. If they are, then the code above should work > just fine. If they are not, then we have a couple of behaviors we could > define. One would be that b=s(a) blocks until all elements in "a" are > defined. The other is that we have a way of "closing" (once again). In > that case, we have to define what happens if b=s(a) accesses an element > that is not defined. > > Ian. > > Ben Clifford wrote: > > There is a problem that has been called the 'array closing problem'. > > > > It manifests itself in the tutorial in that certain bits of code that > > intuitively can either in a procedure or in the top level can, in > > practice, only go in to a procedure. > > > > In that context, I tried to think about better ways to explain/document > > the behaviour than "mumble mumble move that code into a procedure". > > > > In Swift we claim to have 'single assignment variables'. > > > > >From single assignment variables we get our grid job ordering: > > > > a = p() > > b = s(a) > > > > causes first grid job p to run, and when that has completed, then grid job > > s will run. > > > > This is the same as if we had written: > > > > b = s(a) > > a = p() > > > > The ordering comes from the use of a as an 'output' for p and an 'input' > > for s, not from source text ordering. > > > > In that model, its meaningless to assign two different things ta a, like > > this: > > > > a = p() > > b = s(a) > > a = t() > > > > > > Note that I've omitted the data types from the above. This works in the > > implementation for simple types such as a datafile marker type. > > > > What is important is that each variable is either unassigned or has its > > single value - whenever we refer to that variable, we can either use the > > value it has, or defer evaluation of that expression until the variable > > has its value. > > > > Now consider arrays. In the present syntax, arrays can be passed as > > single (complex) values to/from procedures, like before: > > > > a = p() > > b = s(a) > > > > Here a and b are array types. > > > > That's fine. a is assigned to by the first statement, and b is assigned to > > by the second statement. > > > > But we also support a different assignment syntax for arrays, that looks > > like this: > > > > a[0] = p() > > a[1] = q() > > b = s(a) > > > > This fails at the moment (specifically, I think the execution engine will > > hang). > > > > Why? Because the is no one point at which we assign a value to 'a' - the > > assignment is split over multiple statements, which can be in various > > places (and inside loops etc). > > > > There is nothing in the implementation that detects that a has been > > assigned its value. > > > > So there is this notion in the karajan intermediate code of 'closing an > > array'. This is an assertion made in the object code that all assignments > > to pieces of an array have been made - that, in affect, the array has its > > value. > > > > The suggested hack/workaround for this is to move the array element > > assignments into a procedure: > > > > (file f[]) z() { > > f[0] = p(); > > f[1] - q(); > > } > > > > a = z() > > b = s(a) > > > > This works. (which is sort-of a violation of referential transparency) > > > > It works because Swift implicitly marks arrays returned from compound > > procedures as closed (which may or may not be correct). > > > > So in most variable scopes, arrays behave like single-assignment > > variables, but each array can have one specific scope in which members can > > be assigned to. In that scope, the array cannot be treated as a whole > > variable. > > > > In the z() example above, that special scope is the body of z(). In the > > previous example, that scope is the global scope, and the program is > > invalid by the rule above that the array cannot be referred to as a whole > > in the same place that its members are individually assigned to. > > > > That's my explanation of what's going on now. I think it matches reality. > > I don't like that this is reality, but it is what we have. > > > > Comments appreciated. > > > > > > -- > > Ian Foster, Director, Computation Institute > Argonne National Laboratory & University of Chicago > Argonne: MCS/221, 9700 S. Cass Ave, Argonne, IL 60439 > Chicago: Rm 405, 5640 S. Ellis Ave, Chicago, IL 60637 > Tel: +1 630 252 4619. Web: www.ci.uchicago.edu. > Globus Alliance: www.globus.org. > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > From benc at hawaga.org.uk Fri Jun 15 15:45:18 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Fri, 15 Jun 2007 20:45:18 +0000 (GMT) Subject: [Swift-devel] CPU usage with provider-deef In-Reply-To: References: Message-ID: ok. So also Nika and Ioan had a problem where a workflow left running overnight ended up not completing - I think Falkon thinks it completed all of its job in a timely fashion, but that it seemed to take a (linearly increasing) amount of time for each job notification to be sent, and on the swift/provider-deef side of things, a large amount of CPU and not much else seemed to be happened. That may or may not be related to the high CPU load below, but its probably worth investigating as it appears to break up large runs. On Fri, 15 Jun 2007, Veronika Nefedova wrote: > Ben, > > we tested both my workflow and a simple "sleep" workflow. Both tests produced > 100% CPU usage when ran with Falcon. When submitted directly to GRAM from > swift - only a fraction of 1% CPU was used. I know that Yong did some > additional testings, but I do not know the results. > > Nika > > On Jun 15, 2007, at 11:48 AM, Ben Clifford wrote: > > > > > Yesterday, I was playing a bit with Ioan and Nika - they submitted a 68 > > node / 15 minute workflow through provider-deef & falkon and saw the swift > > JVM on the submit node using about 100% CPU; then the same workflow > > running through the GT2 GRAM provider rather than provider-deef and falkon > > appeared to use significantly less. > > > > I wandered off at that point so don't know if any interesting results came > > after. > > > > -- > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > From yongzh at cs.uchicago.edu Fri Jun 15 15:46:21 2007 From: yongzh at cs.uchicago.edu (Yong Zhao) Date: Fri, 15 Jun 2007 15:46:21 -0500 (CDT) Subject: [Swift-devel] on the semantics of 'array closing' In-Reply-To: References: <4672F5E3.7060205@mcs.anl.gov> Message-ID: P.S. We can not put a closing statement for 'a' right before b = s(a); as all statements are evaluated in parallel, so that b can wait for a to close to continue. If we do put it there, then b would proceed without waiting for a to be generated. Yong. On Fri, 15 Jun 2007, Yong Zhao wrote: > Yes, the case is exactly like you have described. Currently each a[i] is > closed separately, but the whole array also needs to be closed. For > instance, if in s, only a[0] and a[1] are accessed, it might go through > correctly, but if s accesses all elements of a (where it has no idea how > many there are), the workflow would hang to wait for the array to close. > > Mihael and I talked about closing statement, but it is unclear when it > should be done since the order of each a[i] being closed is not > deterministic in parallel execution. > > Yong. > > On Fri, 15 Jun 2007, Ian Foster wrote: > > > Hi, > > > > For: > > > > a[0] = p() > > a[1] = q() > > b = s(a) > > > > I think there are two distinct issues. > > > > a) Determining the size of the array. This could presumably be done by > > declaring it, e.g.: > > > > a[2] or some similar notion > > a[0] = p() > > a[1] = q() > > b = s(a) > > > > or by some "closing" concept. > > > > b) Whether or not each element of an array is a separate > > single-assignment variable. If they are, then the code above should work > > just fine. If they are not, then we have a couple of behaviors we could > > define. One would be that b=s(a) blocks until all elements in "a" are > > defined. The other is that we have a way of "closing" (once again). In > > that case, we have to define what happens if b=s(a) accesses an element > > that is not defined. > > > > Ian. > > > > Ben Clifford wrote: > > > There is a problem that has been called the 'array closing problem'. > > > > > > It manifests itself in the tutorial in that certain bits of code that > > > intuitively can either in a procedure or in the top level can, in > > > practice, only go in to a procedure. > > > > > > In that context, I tried to think about better ways to explain/document > > > the behaviour than "mumble mumble move that code into a procedure". > > > > > > In Swift we claim to have 'single assignment variables'. > > > > > > >From single assignment variables we get our grid job ordering: > > > > > > a = p() > > > b = s(a) > > > > > > causes first grid job p to run, and when that has completed, then grid job > > > s will run. > > > > > > This is the same as if we had written: > > > > > > b = s(a) > > > a = p() > > > > > > The ordering comes from the use of a as an 'output' for p and an 'input' > > > for s, not from source text ordering. > > > > > > In that model, its meaningless to assign two different things ta a, like > > > this: > > > > > > a = p() > > > b = s(a) > > > a = t() > > > > > > > > > Note that I've omitted the data types from the above. This works in the > > > implementation for simple types such as a datafile marker type. > > > > > > What is important is that each variable is either unassigned or has its > > > single value - whenever we refer to that variable, we can either use the > > > value it has, or defer evaluation of that expression until the variable > > > has its value. > > > > > > Now consider arrays. In the present syntax, arrays can be passed as > > > single (complex) values to/from procedures, like before: > > > > > > a = p() > > > b = s(a) > > > > > > Here a and b are array types. > > > > > > That's fine. a is assigned to by the first statement, and b is assigned to > > > by the second statement. > > > > > > But we also support a different assignment syntax for arrays, that looks > > > like this: > > > > > > a[0] = p() > > > a[1] = q() > > > b = s(a) > > > > > > This fails at the moment (specifically, I think the execution engine will > > > hang). > > > > > > Why? Because the is no one point at which we assign a value to 'a' - the > > > assignment is split over multiple statements, which can be in various > > > places (and inside loops etc). > > > > > > There is nothing in the implementation that detects that a has been > > > assigned its value. > > > > > > So there is this notion in the karajan intermediate code of 'closing an > > > array'. This is an assertion made in the object code that all assignments > > > to pieces of an array have been made - that, in affect, the array has its > > > value. > > > > > > The suggested hack/workaround for this is to move the array element > > > assignments into a procedure: > > > > > > (file f[]) z() { > > > f[0] = p(); > > > f[1] - q(); > > > } > > > > > > a = z() > > > b = s(a) > > > > > > This works. (which is sort-of a violation of referential transparency) > > > > > > It works because Swift implicitly marks arrays returned from compound > > > procedures as closed (which may or may not be correct). > > > > > > So in most variable scopes, arrays behave like single-assignment > > > variables, but each array can have one specific scope in which members can > > > be assigned to. In that scope, the array cannot be treated as a whole > > > variable. > > > > > > In the z() example above, that special scope is the body of z(). In the > > > previous example, that scope is the global scope, and the program is > > > invalid by the rule above that the array cannot be referred to as a whole > > > in the same place that its members are individually assigned to. > > > > > > That's my explanation of what's going on now. I think it matches reality. > > > I don't like that this is reality, but it is what we have. > > > > > > Comments appreciated. > > > > > > > > > > -- > > > > Ian Foster, Director, Computation Institute > > Argonne National Laboratory & University of Chicago > > Argonne: MCS/221, 9700 S. Cass Ave, Argonne, IL 60439 > > Chicago: Rm 405, 5640 S. Ellis Ave, Chicago, IL 60637 > > Tel: +1 630 252 4619. Web: www.ci.uchicago.edu. > > Globus Alliance: www.globus.org. > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > From benc at hawaga.org.uk Fri Jun 15 15:55:54 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Fri, 15 Jun 2007 20:55:54 +0000 (GMT) Subject: [Swift-devel] on the semantics of 'array closing' In-Reply-To: <4672F5E3.7060205@mcs.anl.gov> References: <4672F5E3.7060205@mcs.anl.gov> Message-ID: There's a different approach, which is to asay that 'a' is a variable and can be assigned to once. Thus assignemnt syntax like a[0]=something becomes illegal and we need more functional language constructs. So instead of writing: for e,i in input_array { output_array[i] = p(e); } we would write: output_array = foreach i in input_array { return p(i); } (its a haskell map in different syntax!) That means that, at the language level, output_array is now properly single assignment. On Fri, 15 Jun 2007, Ian Foster wrote: > Hi, > > For: > > a[0] = p() > a[1] = q() > b = s(a) > > I think there are two distinct issues. > > a) Determining the size of the array. This could presumably be done by > declaring it, e.g.: > > a[2] or some similar notion > a[0] = p() > a[1] = q() > b = s(a) > > or by some "closing" concept. > > b) Whether or not each element of an array is a separate single-assignment > variable. If they are, then the code above should work just fine. If they are > not, then we have a couple of behaviors we could define. One would be that > b=s(a) blocks until all elements in "a" are defined. The other is that we have > a way of "closing" (once again). In that case, we have to define what happens > if b=s(a) accesses an element that is not defined. > > Ian. > > Ben Clifford wrote: > > There is a problem that has been called the 'array closing problem'. > > > > It manifests itself in the tutorial in that certain bits of code that > > intuitively can either in a procedure or in the top level can, in practice, > > only go in to a procedure. > > > > In that context, I tried to think about better ways to explain/document the > > behaviour than "mumble mumble move that code into a procedure". > > > > In Swift we claim to have 'single assignment variables'. > > > > >From single assignment variables we get our grid job ordering: > > > > a = p() > > b = s(a) > > > > causes first grid job p to run, and when that has completed, then grid job s > > will run. > > > > This is the same as if we had written: > > > > b = s(a) > > a = p() > > > > The ordering comes from the use of a as an 'output' for p and an 'input' for > > s, not from source text ordering. > > > > In that model, its meaningless to assign two different things ta a, like > > this: > > > > a = p() > > b = s(a) > > a = t() > > > > > > Note that I've omitted the data types from the above. This works in the > > implementation for simple types such as a datafile marker type. > > > > What is important is that each variable is either unassigned or has its > > single value - whenever we refer to that variable, we can either use the > > value it has, or defer evaluation of that expression until the variable has > > its value. > > > > Now consider arrays. In the present syntax, arrays can be passed as single > > (complex) values to/from procedures, like before: > > > > a = p() > > b = s(a) > > > > Here a and b are array types. > > > > That's fine. a is assigned to by the first statement, and b is assigned to > > by the second statement. > > > > But we also support a different assignment syntax for arrays, that looks > > like this: > > > > a[0] = p() > > a[1] = q() > > b = s(a) > > > > This fails at the moment (specifically, I think the execution engine will > > hang). > > > > Why? Because the is no one point at which we assign a value to 'a' - the > > assignment is split over multiple statements, which can be in various places > > (and inside loops etc). > > > > There is nothing in the implementation that detects that a has been assigned > > its value. > > > > So there is this notion in the karajan intermediate code of 'closing an > > array'. This is an assertion made in the object code that all assignments > > to pieces of an array have been made - that, in affect, the array has its > > value. > > > > The suggested hack/workaround for this is to move the array element > > assignments into a procedure: > > > > (file f[]) z() { > > f[0] = p(); > > f[1] - q(); > > } > > > > a = z() > > b = s(a) > > > > This works. (which is sort-of a violation of referential transparency) > > > > It works because Swift implicitly marks arrays returned from compound > > procedures as closed (which may or may not be correct). > > > > So in most variable scopes, arrays behave like single-assignment variables, > > but each array can have one specific scope in which members can be assigned > > to. In that scope, the array cannot be treated as a whole variable. > > > > In the z() example above, that special scope is the body of z(). In the > > previous example, that scope is the global scope, and the program is invalid > > by the rule above that the array cannot be referred to as a whole in the > > same place that its members are individually assigned to. > > > > That's my explanation of what's going on now. I think it matches reality. I > > don't like that this is reality, but it is what we have. > > > > Comments appreciated. > > > > > > From hategan at mcs.anl.gov Sat Jun 16 03:58:26 2007 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sat, 16 Jun 2007 11:58:26 +0300 Subject: [Swift-devel] CPU usage with provider-deef In-Reply-To: References: Message-ID: <1181984306.10455.3.camel@blabla.mcs.anl.gov> That can either be good or bad. If the CPU is used doing meaningful stuff, then it's good. In other words, I'm guessing that the job throughput is also higher with Falkon. Mihael On Fri, 2007-06-15 at 16:48 +0000, Ben Clifford wrote: > Yesterday, I was playing a bit with Ioan and Nika - they submitted a 68 > node / 15 minute workflow through provider-deef & falkon and saw the swift > JVM on the submit node using about 100% CPU; then the same workflow > running through the GT2 GRAM provider rather than provider-deef and falkon > appeared to use significantly less. > > I wandered off at that point so don't know if any interesting results came > after. > From hategan at mcs.anl.gov Sat Jun 16 04:04:35 2007 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sat, 16 Jun 2007 12:04:35 +0300 Subject: [Swift-devel] on the semantics of 'array closing' In-Reply-To: References: Message-ID: <1181984676.10455.8.camel@blabla.mcs.anl.gov> On Fri, 2007-06-15 at 20:13 +0000, Ben Clifford wrote: > [...] > But we also support a different assignment syntax for arrays, that looks > like this: > > a[0] = p() > a[1] = q() > b = s(a) > > This fails at the moment (specifically, I think the execution engine will > hang). Somewhat. The invocation is ok. What happens is that if you iterate over a with foreach, two iterations will be started, but foreach will keep waiting to see if no more items appear in the array. Think of arrays as streams of (k, v) pairs and a size. If the size is unknown, foreach cannot stop. > > Why? Because the is no one point at which we assign a value to 'a' - the > assignment is split over multiple statements, which can be in various > places (and inside loops etc). > > There is nothing in the implementation that detects that a has been > assigned its value. > > So there is this notion in the karajan intermediate code of 'closing an > array'. This is an assertion made in the object code that all assignments > to pieces of an array have been made - that, in affect, the array has its > value. > > The suggested hack/workaround for this is to move the array element > assignments into a procedure: > > (file f[]) z() { > f[0] = p(); > f[1] - q(); > } > > a = z() > b = s(a) > > This works. (which is sort-of a violation of referential transparency) > > It works because Swift implicitly marks arrays returned from compound > procedures as closed (which may or may not be correct). We defined it as correct. Something created in one scope cannot be modified in a parent scope. Mihael > > So in most variable scopes, arrays behave like single-assignment > variables, but each array can have one specific scope in which members can > be assigned to. In that scope, the array cannot be treated as a whole > variable. > > In the z() example above, that special scope is the body of z(). In the > previous example, that scope is the global scope, and the program is > invalid by the rule above that the array cannot be referred to as a whole > in the same place that its members are individually assigned to. > > That's my explanation of what's going on now. I think it matches reality. > I don't like that this is reality, but it is what we have. > > Comments appreciated. > From hategan at mcs.anl.gov Sat Jun 16 04:12:36 2007 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sat, 16 Jun 2007 12:12:36 +0300 Subject: [Swift-devel] on the semantics of 'array closing' In-Reply-To: <4672F5E3.7060205@mcs.anl.gov> References: <4672F5E3.7060205@mcs.anl.gov> Message-ID: <1181985156.10455.17.camel@blabla.mcs.anl.gov> On Fri, 2007-06-15 at 15:26 -0500, Ian Foster wrote: > Hi, > > For: > > a[0] = p() > a[1] = q() > b = s(a) > > I think there are two distinct issues. > > a) Determining the size of the array. This could presumably be done by > declaring it, e.g.: > > a[2] or some similar notion > a[0] = p() > a[1] = q() > b = s(a) > > or by some "closing" concept. Right! > > b) Whether or not each element of an array is a separate > single-assignment variable. They are. And it should, provided that the a[2] declaration marks the array as "closed". > If they are, then the code above should work > just fine. If they are not, then we have a couple of behaviors we could > define. One would be that b=s(a) blocks until all elements in "a" are > defined. The other is that we have a way of "closing" (once again). In > that case, we have to define what happens if b=s(a) accesses an element > that is not defined. IndexOutOfBoundsException. Another thing we explored mentally was the possibility of doing a simple analysis and grouping all assignments to an array. I'll use an example: a[0] = 1; b = c; a[1] = 9; d = f(5); a[2] = 7; This normally gets translated into (some initializations omitted and function names changed for clarity): parallel( setarray(a, 0, 1) alias(b, c) setarray(a, 1, 9) set(d, f(5)) setarray(a, 2, 7) ) The "proposed" solution would be to translate into: parallel( alias(b, c) set(d, f(5)) sequential( parallel( setarray(a, 0, 1) setarray(a, 1, 9) setarray(a, 2, 7) ) closearray(a) ) ) Mihael > > Ian. > > Ben Clifford wrote: > > There is a problem that has been called the 'array closing problem'. > > > > It manifests itself in the tutorial in that certain bits of code that > > intuitively can either in a procedure or in the top level can, in > > practice, only go in to a procedure. > > > > In that context, I tried to think about better ways to explain/document > > the behaviour than "mumble mumble move that code into a procedure". > > > > In Swift we claim to have 'single assignment variables'. > > > > >From single assignment variables we get our grid job ordering: > > > > a = p() > > b = s(a) > > > > causes first grid job p to run, and when that has completed, then grid job > > s will run. > > > > This is the same as if we had written: > > > > b = s(a) > > a = p() > > > > The ordering comes from the use of a as an 'output' for p and an 'input' > > for s, not from source text ordering. > > > > In that model, its meaningless to assign two different things ta a, like > > this: > > > > a = p() > > b = s(a) > > a = t() > > > > > > Note that I've omitted the data types from the above. This works in the > > implementation for simple types such as a datafile marker type. > > > > What is important is that each variable is either unassigned or has its > > single value - whenever we refer to that variable, we can either use the > > value it has, or defer evaluation of that expression until the variable > > has its value. > > > > Now consider arrays. In the present syntax, arrays can be passed as > > single (complex) values to/from procedures, like before: > > > > a = p() > > b = s(a) > > > > Here a and b are array types. > > > > That's fine. a is assigned to by the first statement, and b is assigned to > > by the second statement. > > > > But we also support a different assignment syntax for arrays, that looks > > like this: > > > > a[0] = p() > > a[1] = q() > > b = s(a) > > > > This fails at the moment (specifically, I think the execution engine will > > hang). > > > > Why? Because the is no one point at which we assign a value to 'a' - the > > assignment is split over multiple statements, which can be in various > > places (and inside loops etc). > > > > There is nothing in the implementation that detects that a has been > > assigned its value. > > > > So there is this notion in the karajan intermediate code of 'closing an > > array'. This is an assertion made in the object code that all assignments > > to pieces of an array have been made - that, in affect, the array has its > > value. > > > > The suggested hack/workaround for this is to move the array element > > assignments into a procedure: > > > > (file f[]) z() { > > f[0] = p(); > > f[1] - q(); > > } > > > > a = z() > > b = s(a) > > > > This works. (which is sort-of a violation of referential transparency) > > > > It works because Swift implicitly marks arrays returned from compound > > procedures as closed (which may or may not be correct). > > > > So in most variable scopes, arrays behave like single-assignment > > variables, but each array can have one specific scope in which members can > > be assigned to. In that scope, the array cannot be treated as a whole > > variable. > > > > In the z() example above, that special scope is the body of z(). In the > > previous example, that scope is the global scope, and the program is > > invalid by the rule above that the array cannot be referred to as a whole > > in the same place that its members are individually assigned to. > > > > That's my explanation of what's going on now. I think it matches reality. > > I don't like that this is reality, but it is what we have. > > > > Comments appreciated. > > > > > From hategan at mcs.anl.gov Sat Jun 16 04:21:14 2007 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sat, 16 Jun 2007 12:21:14 +0300 Subject: [Swift-devel] on the semantics of 'array closing' In-Reply-To: References: <4672F5E3.7060205@mcs.anl.gov> Message-ID: <1181985674.10455.23.camel@blabla.mcs.anl.gov> On Fri, 2007-06-15 at 20:55 +0000, Ben Clifford wrote: > There's a different approach, which is to asay that 'a' is a variable and > can be assigned to once. Thus assignemnt syntax like a[0]=something > becomes illegal and we need more functional language constructs. so is the sequence: a = 1; a = 2; I think this cannot be completely avoided, functional language constructs or not. > So > instead of writing: > > for e,i in input_array { > output_array[i] = p(e); > } > > we would write: > > output_array = foreach i in input_array { > return p(i); > } > > (its a haskell map in different syntax!) However, even python features list comprehensions: output_array = [p(i) for i in input_array] so we could have both. Karajan already supports streams of this kind: output_array = stream(parallelFor(i, input_array, p(i))) (give or take some filters). Mihael > > That means that, at the language level, output_array is now properly > single assignment. > > > On Fri, 15 Jun 2007, Ian Foster wrote: > > > Hi, > > > > For: > > > > a[0] = p() > > a[1] = q() > > b = s(a) > > > > I think there are two distinct issues. > > > > a) Determining the size of the array. This could presumably be done by > > declaring it, e.g.: > > > > a[2] or some similar notion > > a[0] = p() > > a[1] = q() > > b = s(a) > > > > or by some "closing" concept. > > > > b) Whether or not each element of an array is a separate single-assignment > > variable. If they are, then the code above should work just fine. If they are > > not, then we have a couple of behaviors we could define. One would be that > > b=s(a) blocks until all elements in "a" are defined. The other is that we have > > a way of "closing" (once again). In that case, we have to define what happens > > if b=s(a) accesses an element that is not defined. > > > > Ian. > > > > Ben Clifford wrote: > > > There is a problem that has been called the 'array closing problem'. > > > > > > It manifests itself in the tutorial in that certain bits of code that > > > intuitively can either in a procedure or in the top level can, in practice, > > > only go in to a procedure. > > > > > > In that context, I tried to think about better ways to explain/document the > > > behaviour than "mumble mumble move that code into a procedure". > > > > > > In Swift we claim to have 'single assignment variables'. > > > > > > >From single assignment variables we get our grid job ordering: > > > > > > a = p() > > > b = s(a) > > > > > > causes first grid job p to run, and when that has completed, then grid job s > > > will run. > > > > > > This is the same as if we had written: > > > > > > b = s(a) > > > a = p() > > > > > > The ordering comes from the use of a as an 'output' for p and an 'input' for > > > s, not from source text ordering. > > > > > > In that model, its meaningless to assign two different things ta a, like > > > this: > > > > > > a = p() > > > b = s(a) > > > a = t() > > > > > > > > > Note that I've omitted the data types from the above. This works in the > > > implementation for simple types such as a datafile marker type. > > > > > > What is important is that each variable is either unassigned or has its > > > single value - whenever we refer to that variable, we can either use the > > > value it has, or defer evaluation of that expression until the variable has > > > its value. > > > > > > Now consider arrays. In the present syntax, arrays can be passed as single > > > (complex) values to/from procedures, like before: > > > > > > a = p() > > > b = s(a) > > > > > > Here a and b are array types. > > > > > > That's fine. a is assigned to by the first statement, and b is assigned to > > > by the second statement. > > > > > > But we also support a different assignment syntax for arrays, that looks > > > like this: > > > > > > a[0] = p() > > > a[1] = q() > > > b = s(a) > > > > > > This fails at the moment (specifically, I think the execution engine will > > > hang). > > > > > > Why? Because the is no one point at which we assign a value to 'a' - the > > > assignment is split over multiple statements, which can be in various places > > > (and inside loops etc). > > > > > > There is nothing in the implementation that detects that a has been assigned > > > its value. > > > > > > So there is this notion in the karajan intermediate code of 'closing an > > > array'. This is an assertion made in the object code that all assignments > > > to pieces of an array have been made - that, in affect, the array has its > > > value. > > > > > > The suggested hack/workaround for this is to move the array element > > > assignments into a procedure: > > > > > > (file f[]) z() { > > > f[0] = p(); > > > f[1] - q(); > > > } > > > > > > a = z() > > > b = s(a) > > > > > > This works. (which is sort-of a violation of referential transparency) > > > > > > It works because Swift implicitly marks arrays returned from compound > > > procedures as closed (which may or may not be correct). > > > > > > So in most variable scopes, arrays behave like single-assignment variables, > > > but each array can have one specific scope in which members can be assigned > > > to. In that scope, the array cannot be treated as a whole variable. > > > > > > In the z() example above, that special scope is the body of z(). In the > > > previous example, that scope is the global scope, and the program is invalid > > > by the rule above that the array cannot be referred to as a whole in the > > > same place that its members are individually assigned to. > > > > > > That's my explanation of what's going on now. I think it matches reality. I > > > don't like that this is reality, but it is what we have. > > > > > > Comments appreciated. > > > > > > > > > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > From hategan at mcs.anl.gov Sat Jun 16 04:38:42 2007 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sat, 16 Jun 2007 12:38:42 +0300 Subject: [Swift-devel] on the semantics of 'array closing' In-Reply-To: <1181985674.10455.23.camel@blabla.mcs.anl.gov> References: <4672F5E3.7060205@mcs.anl.gov> <1181985674.10455.23.camel@blabla.mcs.anl.gov> Message-ID: <1181986722.10744.2.camel@blabla.mcs.anl.gov> > However, even python features list comprehensions... Python is a fine language. The above should read "However, even some imperative languages, such as python, feature list comprehension...". From benc at hawaga.org.uk Sat Jun 16 08:00:21 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Sat, 16 Jun 2007 13:00:21 +0000 (GMT) Subject: [Swift-devel] CPU usage with provider-deef In-Reply-To: <1181984306.10455.3.camel@blabla.mcs.anl.gov> References: <1181984306.10455.3.camel@blabla.mcs.anl.gov> Message-ID: It was running something like 68 jobs in 15 minutes. Kinda scary if each of those jobs needs 15 cpu.seconds on the submit side. On Sat, 16 Jun 2007, Mihael Hategan wrote: > That can either be good or bad. If the CPU is used doing meaningful > stuff, then it's good. In other words, I'm guessing that the job > throughput is also higher with Falkon. > > Mihael > > On Fri, 2007-06-15 at 16:48 +0000, Ben Clifford wrote: > > Yesterday, I was playing a bit with Ioan and Nika - they submitted a 68 > > node / 15 minute workflow through provider-deef & falkon and saw the swift > > JVM on the submit node using about 100% CPU; then the same workflow > > running through the GT2 GRAM provider rather than provider-deef and falkon > > appeared to use significantly less. > > > > I wandered off at that point so don't know if any interesting results came > > after. > > > > From benc at hawaga.org.uk Sat Jun 16 08:10:28 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Sat, 16 Jun 2007 13:10:28 +0000 (GMT) Subject: [Swift-devel] on the semantics of 'array closing' In-Reply-To: <1181985674.10455.23.camel@blabla.mcs.anl.gov> References: <4672F5E3.7060205@mcs.anl.gov> <1181985674.10455.23.camel@blabla.mcs.anl.gov> Message-ID: On Sat, 16 Jun 2007, Mihael Hategan wrote: > > Thus assignemnt syntax like a[0]=something > > becomes illegal and we need more functional language constructs. > > so is the sequence: > a = 1; > a = 2; There's a bug about that open too (actually two, but I closed one of them). > > (its a haskell map in different syntax!) > > However, even python features list comprehensions: > output_array = [p(i) for i in input_array] right. Any of the constructs that have the 'expression that returns a whole array' property would be ok. -- From benc at hawaga.org.uk Sat Jun 16 08:34:10 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Sat, 16 Jun 2007 13:34:10 +0000 (GMT) Subject: [Swift-devel] on the semantics of 'array closing' In-Reply-To: <1181984676.10455.8.camel@blabla.mcs.anl.gov> References: <1181984676.10455.8.camel@blabla.mcs.anl.gov> Message-ID: On Sat, 16 Jun 2007, Mihael Hategan wrote: > > It works because Swift implicitly marks arrays returned from compound > > procedures as closed (which may or may not be correct). > > We defined it as correct. Something created in one scope cannot be > modified in a parent scope. That's fine - what was unintuitive to me was that something created in one scope cannot be referred to in that same scope. i.e. you can create a piecewise using a[...]=... but cannot then refer to a. -- From iraicu at cs.uchicago.edu Sat Jun 16 09:17:13 2007 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Sat, 16 Jun 2007 09:17:13 -0500 Subject: [Swift-devel] CPU usage with provider-deef In-Reply-To: References: <1181984306.10455.3.camel@blabla.mcs.anl.gov> Message-ID: <4673F0E9.3060505@cs.uchicago.edu> Actually, it was a bug the Falkon provider, there was a tight polling loop on a task queue even if it was empty... it got fixed with one line of code :) its now running the CPU relatively idle for Nika's workflow which doesn't require high throughputs. Thanks Yong for fixing it! Ioan Ben Clifford wrote: > It was running something like 68 jobs in 15 minutes. Kinda scary if each > of those jobs needs 15 cpu.seconds on the submit side. > > On Sat, 16 Jun 2007, Mihael Hategan wrote: > > >> That can either be good or bad. If the CPU is used doing meaningful >> stuff, then it's good. In other words, I'm guessing that the job >> throughput is also higher with Falkon. >> >> Mihael >> >> On Fri, 2007-06-15 at 16:48 +0000, Ben Clifford wrote: >> >>> Yesterday, I was playing a bit with Ioan and Nika - they submitted a 68 >>> node / 15 minute workflow through provider-deef & falkon and saw the swift >>> JVM on the submit node using about 100% CPU; then the same workflow >>> running through the GT2 GRAM provider rather than provider-deef and falkon >>> appeared to use significantly less. >>> >>> I wandered off at that point so don't know if any interesting results came >>> after. >>> >>> >> > > -- ============================================ Ioan Raicu Ph.D. Student ============================================ Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 ============================================ Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dsl.cs.uchicago.edu/ ============================================ ============================================ -------------- next part -------------- An HTML attachment was scrubbed... URL: From benc at hawaga.org.uk Sat Jun 16 09:22:26 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Sat, 16 Jun 2007 14:22:26 +0000 (GMT) Subject: [Swift-devel] CPU usage with provider-deef In-Reply-To: <4673F0E9.3060505@cs.uchicago.edu> References: <1181984306.10455.3.camel@blabla.mcs.anl.gov> <4673F0E9.3060505@cs.uchicago.edu> Message-ID: Cool. Did you try the long long run again with that change in place? Also what was the elapsed realtime for using GRAM2 vs Falkon for that 68 node / ~15 minute workflow? On Sat, 16 Jun 2007, Ioan Raicu wrote: > Actually, it was a bug the Falkon provider, there was a tight polling loop on > a task queue even if it was empty... it got fixed with one line of code :) > its now running the CPU relatively idle for Nika's workflow which doesn't > require high throughputs. > > Thanks Yong for fixing it! > > Ioan > > Ben Clifford wrote: > > It was running something like 68 jobs in 15 minutes. Kinda scary if each of > > those jobs needs 15 cpu.seconds on the submit side. > > > > On Sat, 16 Jun 2007, Mihael Hategan wrote: > > > > > > > That can either be good or bad. If the CPU is used doing meaningful > > > stuff, then it's good. In other words, I'm guessing that the job > > > throughput is also higher with Falkon. > > > > > > Mihael > > > > > > On Fri, 2007-06-15 at 16:48 +0000, Ben Clifford wrote: > > > > > > > Yesterday, I was playing a bit with Ioan and Nika - they submitted a 68 > > > > node / 15 minute workflow through provider-deef & falkon and saw the > > > > swift JVM on the submit node using about 100% CPU; then the same > > > > workflow running through the GT2 GRAM provider rather than provider-deef > > > > and falkon appeared to use significantly less. > > > > > > > > I wandered off at that point so don't know if any interesting results > > > > came after. > > > > > > > > > > > > > > > > > From benc at hawaga.org.uk Sat Jun 16 09:27:13 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Sat, 16 Jun 2007 14:27:13 +0000 (GMT) Subject: [Swift-devel] nightly tests never finishing Message-ID: The nightly test pages recently don't seem to be complete - 13th, 14th and 15th have all stopped around the array_iteration on grid section. That also means that the download link for nightly builds is never provided. -- From iraicu at cs.uchicago.edu Sat Jun 16 09:35:52 2007 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Sat, 16 Jun 2007 09:35:52 -0500 Subject: [Swift-devel] CPU usage with provider-deef In-Reply-To: References: <1181984306.10455.3.camel@blabla.mcs.anl.gov> <4673F0E9.3060505@cs.uchicago.edu> Message-ID: <4673F548.5070608@cs.uchicago.edu> With Falkon, we had 34 machines, with 68 processors, running a job on each processor. I think it took about 20 min. We then ran over GRAM, but there are only 60 IA64 nodes (120 processors) at ANL, so when the 68 jobs got submitted, only 60 of them went in the run queue, and 8 of them went in the wait queue.... there were enough processors to perform all jobs at the same time, but I don't know how we were supposed to tweak Swift to have it dispatch tasks per GRAM job, and perform both tasks in parallel on both processors. I believe the total time for the GRAM2 run was about 26 min. The extra round of 8 jobs (which Falkon didn't have) took about 200 sec (3.4 min), so the rough improvement would have probably been around 1~2.6 min (5~10%). That sounds about right with the 0.~1 job/sec, so 68 jobs would have taken 68~136 or so seconds. The comparison wasn't done scientifically, so don't quote the numbers exactly, but Falkon was a bit faster. In Nika's workflow case, where high throughput isn't essential, the big gain to use Falkon is the scalability of the Falkon wait queue, and the resource provisioning, once you get some resources, using them over and over to avoid the LRM queue wait time for each job. BTW, we ran a 20 molecule short run yesterday successfully, but we are still having problems with the 100 molecule run in MolDyn. Its not clear where the problem is, on the surface Falkon looks fine... we are looking into where everything breaks to cause Swift to not continue with the workflow to completion! Ioan Ben Clifford wrote: > Cool. > > Did you try the long long run again with that change in place? > > Also what was the elapsed realtime for using GRAM2 vs Falkon for that 68 > node / ~15 minute workflow? > > On Sat, 16 Jun 2007, Ioan Raicu wrote: > > >> Actually, it was a bug the Falkon provider, there was a tight polling loop on >> a task queue even if it was empty... it got fixed with one line of code :) >> its now running the CPU relatively idle for Nika's workflow which doesn't >> require high throughputs. >> >> Thanks Yong for fixing it! >> >> Ioan >> >> Ben Clifford wrote: >> >>> It was running something like 68 jobs in 15 minutes. Kinda scary if each of >>> those jobs needs 15 cpu.seconds on the submit side. >>> >>> On Sat, 16 Jun 2007, Mihael Hategan wrote: >>> >>> >>> >>>> That can either be good or bad. If the CPU is used doing meaningful >>>> stuff, then it's good. In other words, I'm guessing that the job >>>> throughput is also higher with Falkon. >>>> >>>> Mihael >>>> >>>> On Fri, 2007-06-15 at 16:48 +0000, Ben Clifford wrote: >>>> >>>> >>>>> Yesterday, I was playing a bit with Ioan and Nika - they submitted a 68 >>>>> node / 15 minute workflow through provider-deef & falkon and saw the >>>>> swift JVM on the submit node using about 100% CPU; then the same >>>>> workflow running through the GT2 GRAM provider rather than provider-deef >>>>> and falkon appeared to use significantly less. >>>>> >>>>> I wandered off at that point so don't know if any interesting results >>>>> came after. >>>>> >>>>> >>>>> >>>> >>>> >>> >>> >> > > -- ============================================ Ioan Raicu Ph.D. Student ============================================ Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 ============================================ Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dsl.cs.uchicago.edu/ ============================================ ============================================ -------------- next part -------------- An HTML attachment was scrubbed... URL: From benc at hawaga.org.uk Sat Jun 16 09:41:38 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Sat, 16 Jun 2007 14:41:38 +0000 (GMT) Subject: [Swift-devel] CPU usage with provider-deef In-Reply-To: <4673F548.5070608@cs.uchicago.edu> References: <1181984306.10455.3.camel@blabla.mcs.anl.gov> <4673F0E9.3060505@cs.uchicago.edu> <4673F548.5070608@cs.uchicago.edu> Message-ID: On Sat, 16 Jun 2007, Ioan Raicu wrote: > having problems with the 100 molecule run in MolDyn. Its not clear where the > problem is, on the surface Falkon looks fine... we are looking into where > everything breaks to cause Swift to not continue with the workflow to > completion! The same problem that you showed me the other day or different? with 'the same problem' being that falkon thinks all the jobs are done; but that falkon's measure response time for sending completion notifications gets approximately linearly longer over time and the swift JVM uses ~100% and doesn't inidicate job completion at all after a certain period. or different symptoms now? -- From iraicu at cs.uchicago.edu Sat Jun 16 10:00:41 2007 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Sat, 16 Jun 2007 10:00:41 -0500 Subject: [Swift-devel] CPU usage with provider-deef In-Reply-To: References: <1181984306.10455.3.camel@blabla.mcs.anl.gov> <4673F0E9.3060505@cs.uchicago.edu> <4673F548.5070608@cs.uchicago.edu> Message-ID: <4673FB19.6070305@cs.uchicago.edu> Nope, I think this is a different problem, or at least a subset of the problems we were having before. Since we fixed the CPU utilization, and we moved to a bigger box (4 CPUs with 2GB of memory), everything is happening in a timely fashion (a few ms per notification delivery throughout the experiment). Plus, I believe the view is consistent (the same tasks look complete on both ends) between Falkon and Swift, but we are still checking on this as the run was made just last night for the 100 mol run. We'll keep you posted with what we find. Ioan Ben Clifford wrote: > On Sat, 16 Jun 2007, Ioan Raicu wrote: > > >> having problems with the 100 molecule run in MolDyn. Its not clear where the >> problem is, on the surface Falkon looks fine... we are looking into where >> everything breaks to cause Swift to not continue with the workflow to >> completion! >> > > The same problem that you showed me the other day or different? > > with 'the same problem' being that falkon thinks all the jobs are done; > but that falkon's measure response time for sending completion > notifications gets approximately linearly longer over time and the swift > JVM uses ~100% and doesn't inidicate job completion at all after a certain > period. > > or different symptoms now? > > -- ============================================ Ioan Raicu Ph.D. Student ============================================ Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 ============================================ Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dsl.cs.uchicago.edu/ ============================================ ============================================ -------------- next part -------------- An HTML attachment was scrubbed... URL: From hategan at mcs.anl.gov Sat Jun 16 10:02:46 2007 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sat, 16 Jun 2007 18:02:46 +0300 Subject: [Swift-devel] CPU usage with provider-deef In-Reply-To: <4673FB19.6070305@cs.uchicago.edu> References: <1181984306.10455.3.camel@blabla.mcs.anl.gov> <4673F0E9.3060505@cs.uchicago.edu> <4673F548.5070608@cs.uchicago.edu> <4673FB19.6070305@cs.uchicago.edu> Message-ID: <1182006166.11495.1.camel@blabla.mcs.anl.gov> Yourkit (www.yourkit.com) has free licenses for open source projects for their profiler. Point them to a globus web page that has your name, and they'll send you the license. Alternatively, there are other profilers out there, and I strongly recommend using them on such issues. Mihael On Sat, 2007-06-16 at 10:00 -0500, Ioan Raicu wrote: > Nope, I think this is a different problem, or at least a subset of the > problems we were having before. > > Since we fixed the CPU utilization, and we moved to a bigger box (4 > CPUs with 2GB of memory), everything is happening in a timely fashion > (a few ms per notification delivery throughout the experiment). Plus, > I believe the view is consistent (the same tasks look complete on both > ends) between Falkon and Swift, but we are still checking on this as > the run was made just last night for the 100 mol run. We'll keep you > posted with what we find. > > Ioan > > Ben Clifford wrote: > > On Sat, 16 Jun 2007, Ioan Raicu wrote: > > > > > > > having problems with the 100 molecule run in MolDyn. Its not clear where the > > > problem is, on the surface Falkon looks fine... we are looking into where > > > everything breaks to cause Swift to not continue with the workflow to > > > completion! > > > > > > > The same problem that you showed me the other day or different? > > > > with 'the same problem' being that falkon thinks all the jobs are done; > > but that falkon's measure response time for sending completion > > notifications gets approximately linearly longer over time and the swift > > JVM uses ~100% and doesn't inidicate job completion at all after a certain > > period. > > > > or different symptoms now? > > > > > > -- > ============================================ > Ioan Raicu > Ph.D. Student > ============================================ > Distributed Systems Laboratory > Computer Science Department > University of Chicago > 1100 E. 58th Street, Ryerson Hall > Chicago, IL 60637 > ============================================ > Email: iraicu at cs.uchicago.edu > Web: http://www.cs.uchicago.edu/~iraicu > http://dsl.cs.uchicago.edu/ > ============================================ > ============================================ From wilde at mcs.anl.gov Sat Jun 16 10:05:39 2007 From: wilde at mcs.anl.gov (Mike Wilde) Date: Sat, 16 Jun 2007 10:05:39 -0500 Subject: [Swift-devel] on the semantics of 'array closing' In-Reply-To: References: <1181984676.10455.8.camel@blabla.mcs.anl.gov> Message-ID: <4673FC43.8040204@mcs.anl.gov> Hi all, I'm jumping in late; I re-read the thread a few times but may have missed something. So correct me as needed. Also, rather than spending more time polishing the thoughts below I just put them out here for discussion. This discussion seems to me very important, as it can close down several of the major open issues that are very critical to the language, both to give it complete and consistent semantics and to make it practical fr the problems that we are applying it to. Four important but missing aspects of this discussion are: pipelining, error handing, restart, and mapping. I feel that swift needs the following semantics: 1. Pipelining: The data dependency aspects of swift are carried out at the atomic level in a pipelined manner. -- elements of an array are written into the stream -- readers of the array consume the stream -- the entire program remains active in parallel, across function boundaries Array elements [k,v] are identified by their index, k, which can be an int or string. 2. Error handling In practice, many large-scale foreach() operations will never complete, yet they will deliver a lot of useful results that we want subsequent statements in a program to continue to operate on. Thus closing needs to permit different criteria other than just "finishing". An array is "closed" when its producer function/foreach "shuts down". Can we permit shutdown/closing to occur based on finishing, time, or quota/threshold. These would be parameters of the foreach statements that could be overridden. (For some practical examples, see map-reduce; it has similar problems: parallel computations reach a level whwre there is lots of parallelism, and as it proceeds, gets to a poiunt where only the "stragglers" are left - things waiting in slow queues or for hung data transfers, etc. Ive read this in m/r papers, and found that our experiences match those reported by the google m/r people). 3. Restart We want computations to be restartable. If 50% of a large array/dataset gets created in a 10-hour run, and then fails, we want the run to be restartable and continue where it left of with minimal lost of "completed" results. 4. Mapping Lastly, swift mapping should be connected to this whole process: the mapped contents of a dataset should be a stream of xml elements rather than a "completed" xml document, so that we can practically handle very large datasets. So when a foreach() statement processes a array, its processing the mapped stream of the array. mappers should be parallel processes that produce and consume these streams of xml elements. - Mike Ben Clifford wrote, On 6/16/2007 8:34 AM: > > On Sat, 16 Jun 2007, Mihael Hategan wrote: > >>> It works because Swift implicitly marks arrays returned from compound >>> procedures as closed (which may or may not be correct). >> We defined it as correct. Something created in one scope cannot be >> modified in a parent scope. > > That's fine - what was unintuitive to me was that something created in one > scope cannot be referred to in that same scope. i.e. you can create a > piecewise using a[...]=... but cannot then refer to a. > -- Mike Wilde Computation Institute, University of Chicago Math & Computer Science Division Argonne National Laboratory Argonne, IL 60439 USA tel 630-252-7497 fax 630-252-1997 From wilde at mcs.anl.gov Sat Jun 16 10:50:25 2007 From: wilde at mcs.anl.gov (Mike Wilde) Date: Sat, 16 Jun 2007 10:50:25 -0500 Subject: [Swift-devel] on the semantics of 'array closing' In-Reply-To: <4673FC43.8040204@mcs.anl.gov> References: <1181984676.10455.8.camel@blabla.mcs.anl.gov> <4673FC43.8040204@mcs.anl.gov> Message-ID: <467406C1.2070608@mcs.anl.gov> also to note: Ian has suggested several times that we explore map-reduce. I think this is worth doing: its possible/likely that swift is already pretty close to m-r in many ways, and could benefit from a more detailed comparison and assessment of what we can borrow, adapt, and/or integrate. We should use this as a chance to create a "swift library" page where we post good papers that we can cite in our discussions to get ourselves on a common page. Some of these might be good material for Thu Grad seminar discussins as well. - Mike Mike Wilde wrote, On 6/16/2007 10:05 AM: > Hi all, > > I'm jumping in late; I re-read the thread a few times but may have > missed something. So correct me as needed. Also, rather than spending > more time polishing the thoughts below I just put them out here for > discussion. > > This discussion seems to me very important, as it can close down several > of the major open issues that are very critical to the language, both to > give it complete and consistent semantics and to make it practical fr > the problems that we are applying it to. > > Four important but missing aspects of this discussion are: pipelining, > error handing, restart, and mapping. > > I feel that swift needs the following semantics: > > 1. Pipelining: > > The data dependency aspects of swift are carried out at the atomic level > in a pipelined manner. > > -- elements of an array are written into the stream > > -- readers of the array consume the stream > > -- the entire program remains active in parallel, across function > boundaries > > Array elements [k,v] are identified by their index, k, which can be an > int or string. > > 2. Error handling > > In practice, many large-scale foreach() operations will never complete, > yet they will deliver a lot of useful results that we want subsequent > statements in a program to continue to operate on. Thus closing needs to > permit different criteria other than just "finishing". > > An array is "closed" when its producer function/foreach "shuts down". > Can we permit shutdown/closing to occur based on finishing, time, or > quota/threshold. These would be parameters of the foreach statements > that could be overridden. > > (For some practical examples, see map-reduce; it has similar problems: > parallel computations reach a level whwre there is lots of parallelism, > and as it proceeds, gets to a poiunt where only the "stragglers" are > left - things waiting in slow queues or for hung data transfers, etc. > Ive read this in m/r papers, and found that our experiences match those > reported by the google m/r people). > > 3. Restart > > We want computations to be restartable. If 50% of a large array/dataset > gets created in a 10-hour run, and then fails, we want the run to be > restartable and continue where it left of with minimal lost of > "completed" results. > > 4. Mapping > > Lastly, swift mapping should be connected to this whole process: the > mapped contents of a dataset should be a stream of xml elements rather > than a "completed" xml document, so that we can practically handle very > large datasets. So when a foreach() statement processes a array, its > processing the mapped stream of the array. mappers should be parallel > processes that produce and consume these streams of xml elements. > > - Mike > > > > > Ben Clifford wrote, On 6/16/2007 8:34 AM: >> >> On Sat, 16 Jun 2007, Mihael Hategan wrote: >> >>>> It works because Swift implicitly marks arrays returned from >>>> compound procedures as closed (which may or may not be correct). >>> We defined it as correct. Something created in one scope cannot be >>> modified in a parent scope. >> >> That's fine - what was unintuitive to me was that something created in >> one scope cannot be referred to in that same scope. i.e. you can >> create a piecewise using a[...]=... but cannot then refer to a. >> > -- Mike Wilde Computation Institute, University of Chicago Math & Computer Science Division Argonne National Laboratory Argonne, IL 60439 USA tel 630-252-7497 fax 630-252-1997 From foster at mcs.anl.gov Sat Jun 16 10:58:13 2007 From: foster at mcs.anl.gov (Ian Foster) Date: Sat, 16 Jun 2007 10:58:13 -0500 Subject: [Swift-devel] on the semantics of 'array closing' In-Reply-To: <4673FC43.8040204@mcs.anl.gov> References: <1181984676.10455.8.camel@blabla.mcs.anl.gov> <4673FC43.8040204@mcs.anl.gov> Message-ID: <46740895.70708@mcs.anl.gov> Mike: That's a great summary of requirements. Ian. Mike Wilde wrote: > Hi all, > > I'm jumping in late; I re-read the thread a few times but may have > missed something. So correct me as needed. Also, rather than spending > more time polishing the thoughts below I just put them out here for > discussion. > > This discussion seems to me very important, as it can close down > several of the major open issues that are very critical to the > language, both to give it complete and consistent semantics and to > make it practical fr the problems that we are applying it to. > > Four important but missing aspects of this discussion are: pipelining, > error handing, restart, and mapping. > > I feel that swift needs the following semantics: > > 1. Pipelining: > > The data dependency aspects of swift are carried out at the atomic > level in a pipelined manner. > > -- elements of an array are written into the stream > > -- readers of the array consume the stream > > -- the entire program remains active in parallel, across function > boundaries > > Array elements [k,v] are identified by their index, k, which can be an > int or string. > > 2. Error handling > > In practice, many large-scale foreach() operations will never > complete, yet they will deliver a lot of useful results that we want > subsequent statements in a program to continue to operate on. Thus > closing needs to permit different criteria other than just "finishing". > > An array is "closed" when its producer function/foreach "shuts down". > Can we permit shutdown/closing to occur based on finishing, time, or > quota/threshold. These would be parameters of the foreach statements > that could be overridden. > > (For some practical examples, see map-reduce; it has similar problems: > parallel computations reach a level whwre there is lots of > parallelism, and as it proceeds, gets to a poiunt where only the > "stragglers" are left - things waiting in slow queues or for hung data > transfers, etc. Ive read this in m/r papers, and found that our > experiences match those reported by the google m/r people). > > 3. Restart > > We want computations to be restartable. If 50% of a large > array/dataset gets created in a 10-hour run, and then fails, we want > the run to be restartable and continue where it left of with minimal > lost of "completed" results. > > 4. Mapping > > Lastly, swift mapping should be connected to this whole process: the > mapped contents of a dataset should be a stream of xml elements rather > than a "completed" xml document, so that we can practically handle > very large datasets. So when a foreach() statement processes a > array, its processing the mapped stream of the array. mappers should > be parallel processes that produce and consume these streams of xml > elements. > > - Mike > > > > > Ben Clifford wrote, On 6/16/2007 8:34 AM: >> >> On Sat, 16 Jun 2007, Mihael Hategan wrote: >> >>>> It works because Swift implicitly marks arrays returned from >>>> compound procedures as closed (which may or may not be correct). >>> We defined it as correct. Something created in one scope cannot be >>> modified in a parent scope. >> >> That's fine - what was unintuitive to me was that something created >> in one scope cannot be referred to in that same scope. i.e. you can >> create a piecewise using a[...]=... but cannot then refer to a. >> > -- Ian Foster, Director, Computation Institute Argonne National Laboratory & University of Chicago Argonne: MCS/221, 9700 S. Cass Ave, Argonne, IL 60439 Chicago: Rm 405, 5640 S. Ellis Ave, Chicago, IL 60637 Tel: +1 630 252 4619. Web: www.ci.uchicago.edu. Globus Alliance: www.globus.org. From foster at mcs.anl.gov Sat Jun 16 10:59:17 2007 From: foster at mcs.anl.gov (Ian Foster) Date: Sat, 16 Jun 2007 10:59:17 -0500 Subject: [Swift-devel] CPU usage with provider-deef In-Reply-To: <1182006166.11495.1.camel@blabla.mcs.anl.gov> References: <1181984306.10455.3.camel@blabla.mcs.anl.gov> <4673F0E9.3060505@cs.uchicago.edu> <4673F548.5070608@cs.uchicago.edu> <4673FB19.6070305@cs.uchicago.edu> <1182006166.11495.1.camel@blabla.mcs.anl.gov> Message-ID: <467408D5.2020709@mcs.anl.gov> It seems important that Ioan sit down with Mihael and work through the Falkon code to see where it can be simplified, improved, etc. I am sure that this will result in problems being identified and fixed that will otherwise cost us time later. Mihael Hategan wrote: > Yourkit (www.yourkit.com) has free licenses for open source projects for > their profiler. Point them to a globus web page that has your name, and > they'll send you the license. Alternatively, there are other profilers > out there, and I strongly recommend using them on such issues. > > Mihael > > On Sat, 2007-06-16 at 10:00 -0500, Ioan Raicu wrote: > >> Nope, I think this is a different problem, or at least a subset of the >> problems we were having before. >> >> Since we fixed the CPU utilization, and we moved to a bigger box (4 >> CPUs with 2GB of memory), everything is happening in a timely fashion >> (a few ms per notification delivery throughout the experiment). Plus, >> I believe the view is consistent (the same tasks look complete on both >> ends) between Falkon and Swift, but we are still checking on this as >> the run was made just last night for the 100 mol run. We'll keep you >> posted with what we find. >> >> Ioan >> >> Ben Clifford wrote: >> >>> On Sat, 16 Jun 2007, Ioan Raicu wrote: >>> >>> >>> >>>> having problems with the 100 molecule run in MolDyn. Its not clear where the >>>> problem is, on the surface Falkon looks fine... we are looking into where >>>> everything breaks to cause Swift to not continue with the workflow to >>>> completion! >>>> >>>> >>> The same problem that you showed me the other day or different? >>> >>> with 'the same problem' being that falkon thinks all the jobs are done; >>> but that falkon's measure response time for sending completion >>> notifications gets approximately linearly longer over time and the swift >>> JVM uses ~100% and doesn't inidicate job completion at all after a certain >>> period. >>> >>> or different symptoms now? >>> >>> >>> >> -- >> ============================================ >> Ioan Raicu >> Ph.D. Student >> ============================================ >> Distributed Systems Laboratory >> Computer Science Department >> University of Chicago >> 1100 E. 58th Street, Ryerson Hall >> Chicago, IL 60637 >> ============================================ >> Email: iraicu at cs.uchicago.edu >> Web: http://www.cs.uchicago.edu/~iraicu >> http://dsl.cs.uchicago.edu/ >> ============================================ >> ============================================ >> > > -- Ian Foster, Director, Computation Institute Argonne National Laboratory & University of Chicago Argonne: MCS/221, 9700 S. Cass Ave, Argonne, IL 60439 Chicago: Rm 405, 5640 S. Ellis Ave, Chicago, IL 60637 Tel: +1 630 252 4619. Web: www.ci.uchicago.edu. Globus Alliance: www.globus.org. -------------- next part -------------- An HTML attachment was scrubbed... URL: From foster at mcs.anl.gov Sat Jun 16 11:01:47 2007 From: foster at mcs.anl.gov (Ian Foster) Date: Sat, 16 Jun 2007 11:01:47 -0500 Subject: [Swift-devel] on the semantics of 'array closing' In-Reply-To: References: <4672F5E3.7060205@mcs.anl.gov> Message-ID: <4674096B.4020109@mcs.anl.gov> I like the notion of having a "map" function. If that could entirely replace the current element assignments, that would be a wonderful simplification, it seems to me. Ian. Ben Clifford wrote: > There's a different approach, which is to asay that 'a' is a variable and > can be assigned to once. Thus assignemnt syntax like a[0]=something > becomes illegal and we need more functional language constructs. So > instead of writing: > > for e,i in input_array { > output_array[i] = p(e); > } > > we would write: > > output_array = foreach i in input_array { > return p(i); > } > > (its a haskell map in different syntax!) > > That means that, at the language level, output_array is now properly > single assignment. > > > On Fri, 15 Jun 2007, Ian Foster wrote: > > >> Hi, >> >> For: >> >> a[0] = p() >> a[1] = q() >> b = s(a) >> >> I think there are two distinct issues. >> >> a) Determining the size of the array. This could presumably be done by >> declaring it, e.g.: >> >> a[2] or some similar notion >> a[0] = p() >> a[1] = q() >> b = s(a) >> >> or by some "closing" concept. >> >> b) Whether or not each element of an array is a separate single-assignment >> variable. If they are, then the code above should work just fine. If they are >> not, then we have a couple of behaviors we could define. One would be that >> b=s(a) blocks until all elements in "a" are defined. The other is that we have >> a way of "closing" (once again). In that case, we have to define what happens >> if b=s(a) accesses an element that is not defined. >> >> Ian. >> >> Ben Clifford wrote: >> >>> There is a problem that has been called the 'array closing problem'. >>> >>> It manifests itself in the tutorial in that certain bits of code that >>> intuitively can either in a procedure or in the top level can, in practice, >>> only go in to a procedure. >>> >>> In that context, I tried to think about better ways to explain/document the >>> behaviour than "mumble mumble move that code into a procedure". >>> >>> In Swift we claim to have 'single assignment variables'. >>> >>> >From single assignment variables we get our grid job ordering: >>> >>> a = p() >>> b = s(a) >>> >>> causes first grid job p to run, and when that has completed, then grid job s >>> will run. >>> >>> This is the same as if we had written: >>> >>> b = s(a) >>> a = p() >>> >>> The ordering comes from the use of a as an 'output' for p and an 'input' for >>> s, not from source text ordering. >>> >>> In that model, its meaningless to assign two different things ta a, like >>> this: >>> >>> a = p() >>> b = s(a) >>> a = t() >>> >>> >>> Note that I've omitted the data types from the above. This works in the >>> implementation for simple types such as a datafile marker type. >>> >>> What is important is that each variable is either unassigned or has its >>> single value - whenever we refer to that variable, we can either use the >>> value it has, or defer evaluation of that expression until the variable has >>> its value. >>> >>> Now consider arrays. In the present syntax, arrays can be passed as single >>> (complex) values to/from procedures, like before: >>> >>> a = p() >>> b = s(a) >>> >>> Here a and b are array types. >>> >>> That's fine. a is assigned to by the first statement, and b is assigned to >>> by the second statement. >>> >>> But we also support a different assignment syntax for arrays, that looks >>> like this: >>> >>> a[0] = p() >>> a[1] = q() >>> b = s(a) >>> >>> This fails at the moment (specifically, I think the execution engine will >>> hang). >>> >>> Why? Because the is no one point at which we assign a value to 'a' - the >>> assignment is split over multiple statements, which can be in various places >>> (and inside loops etc). >>> >>> There is nothing in the implementation that detects that a has been assigned >>> its value. >>> >>> So there is this notion in the karajan intermediate code of 'closing an >>> array'. This is an assertion made in the object code that all assignments >>> to pieces of an array have been made - that, in affect, the array has its >>> value. >>> >>> The suggested hack/workaround for this is to move the array element >>> assignments into a procedure: >>> >>> (file f[]) z() { >>> f[0] = p(); >>> f[1] - q(); >>> } >>> >>> a = z() >>> b = s(a) >>> >>> This works. (which is sort-of a violation of referential transparency) >>> >>> It works because Swift implicitly marks arrays returned from compound >>> procedures as closed (which may or may not be correct). >>> >>> So in most variable scopes, arrays behave like single-assignment variables, >>> but each array can have one specific scope in which members can be assigned >>> to. In that scope, the array cannot be treated as a whole variable. >>> >>> In the z() example above, that special scope is the body of z(). In the >>> previous example, that scope is the global scope, and the program is invalid >>> by the rule above that the array cannot be referred to as a whole in the >>> same place that its members are individually assigned to. >>> >>> That's my explanation of what's going on now. I think it matches reality. I >>> don't like that this is reality, but it is what we have. >>> >>> Comments appreciated. >>> >>> >>> >> > > -- Ian Foster, Director, Computation Institute Argonne National Laboratory & University of Chicago Argonne: MCS/221, 9700 S. Cass Ave, Argonne, IL 60439 Chicago: Rm 405, 5640 S. Ellis Ave, Chicago, IL 60637 Tel: +1 630 252 4619. Web: www.ci.uchicago.edu. Globus Alliance: www.globus.org. -------------- next part -------------- An HTML attachment was scrubbed... URL: From wilde at mcs.anl.gov Sat Jun 16 12:03:26 2007 From: wilde at mcs.anl.gov (Mike Wilde) Date: Sat, 16 Jun 2007 12:03:26 -0500 Subject: [Swift-devel] CPU usage with provider-deef In-Reply-To: <467408D5.2020709@mcs.anl.gov> References: <1181984306.10455.3.camel@blabla.mcs.anl.gov> <4673F0E9.3060505@cs.uchicago.edu> <4673F548.5070608@cs.uchicago.edu> <4673FB19.6070305@cs.uchicago.edu> <1182006166.11495.1.camel@blabla.mcs.anl.gov> <467408D5.2020709@mcs.anl.gov> Message-ID: <467417DE.50408@mcs.anl.gov> This should be fun, and a nice break from the I2U2 work that you've been immersed in, Mihael. Want to do a read-through soon, and send out comments for discussion that can turn into a list of code improvements to bugzilize? What I think is important about Falkon is that its working, its proving out the value of the provisioned direct-scheduling approach with numbers, and that its working for Ioan as a vehicle for his research. What we want to get from the effort is a) Ioan progresses towards his PhD; b) the immediate needs of our app-users get met; and c) we learn whats needed in architecture, protocol and algorithm for a successful long-term approach to running swift programs efficiently. Point is that everyone is open to changes and towards an eventual re-design and re-write. This, Mihael, would be where you can propose, design and implement the ideas you've expressed about implementing provisioned direct-scheduling using Karajan's remote execution mechanisms. - Mike Ian Foster wrote, On 6/16/2007 10:59 AM: > It seems important that Ioan sit down with Mihael and work through the > Falkon code to see where it can be simplified, improved, etc. I am sure > that this will result in problems being identified and fixed that will > otherwise cost us time later. > > Mihael Hategan wrote: >> Yourkit (www.yourkit.com) has free licenses for open source projects for >> their profiler. Point them to a globus web page that has your name, and >> they'll send you the license. Alternatively, there are other profilers >> out there, and I strongly recommend using them on such issues. >> >> Mihael >> >> On Sat, 2007-06-16 at 10:00 -0500, Ioan Raicu wrote: >> >>> Nope, I think this is a different problem, or at least a subset of the >>> problems we were having before. >>> >>> Since we fixed the CPU utilization, and we moved to a bigger box (4 >>> CPUs with 2GB of memory), everything is happening in a timely fashion >>> (a few ms per notification delivery throughout the experiment). Plus, >>> I believe the view is consistent (the same tasks look complete on both >>> ends) between Falkon and Swift, but we are still checking on this as >>> the run was made just last night for the 100 mol run. We'll keep you >>> posted with what we find. >>> >>> Ioan >>> >>> Ben Clifford wrote: >>> >>>> On Sat, 16 Jun 2007, Ioan Raicu wrote: >>>> >>>> >>>> >>>>> having problems with the 100 molecule run in MolDyn. Its not clear where the >>>>> problem is, on the surface Falkon looks fine... we are looking into where >>>>> everything breaks to cause Swift to not continue with the workflow to >>>>> completion! >>>>> >>>>> >>>> The same problem that you showed me the other day or different? >>>> >>>> with 'the same problem' being that falkon thinks all the jobs are done; >>>> but that falkon's measure response time for sending completion >>>> notifications gets approximately linearly longer over time and the swift >>>> JVM uses ~100% and doesn't inidicate job completion at all after a certain >>>> period. >>>> >>>> or different symptoms now? >>>> >>>> >>>> >>> -- >>> ============================================ >>> Ioan Raicu >>> Ph.D. Student >>> ============================================ >>> Distributed Systems Laboratory >>> Computer Science Department >>> University of Chicago >>> 1100 E. 58th Street, Ryerson Hall >>> Chicago, IL 60637 >>> ============================================ >>> Email: iraicu at cs.uchicago.edu >>> Web: http://www.cs.uchicago.edu/~iraicu >>> http://dsl.cs.uchicago.edu/ >>> ============================================ >>> ============================================ >>> >> >> > > -- > > Ian Foster, Director, Computation Institute > Argonne National Laboratory & University of Chicago > Argonne: MCS/221, 9700 S. Cass Ave, Argonne, IL 60439 > Chicago: Rm 405, 5640 S. Ellis Ave, Chicago, IL 60637 > Tel: +1 630 252 4619. Web: www.ci.uchicago.edu. > Globus Alliance: www.globus.org. > > > ------------------------------------------------------------------------ > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel -- Mike Wilde Computation Institute, University of Chicago Math & Computer Science Division Argonne National Laboratory Argonne, IL 60439 USA tel 630-252-7497 fax 630-252-1997 From wilde at mcs.anl.gov Sat Jun 16 12:14:20 2007 From: wilde at mcs.anl.gov (Mike Wilde) Date: Sat, 16 Jun 2007 12:14:20 -0500 Subject: [Swift-devel] on the semantics of 'array closing' In-Reply-To: <4674096B.4020109@mcs.anl.gov> References: <4672F5E3.7060205@mcs.anl.gov> <4674096B.4020109@mcs.anl.gov> Message-ID: <46741A6C.8080404@mcs.anl.gov> I have to say for the record that I'm ready to concede victory to the functional camp in this discussion. (This is like conceding defeat but with a positive spin. If you cant beat 'em join 'em ;) I've previously felt that functional programming would be too hard to sell to our user base. But clearly Ben, Mihael and Ian are all in the f-camp. As long as we take it all the way, and work through all our existing docs, tutorials and application codes to make sure that the functional way of expressing things has clean elegant semantics, is clean to write and efficient to reliably implement, I think we're on a good path. The first criteria of a successful programming tool is that its implementers love it and use it effectively. If they do, the user community is likely to follow along, grow and be successful. And as long as we meet that criteria I am happy that we are on the right track. So lets function-away, do it right and make it work. Lets try keep the syntax close to its current c-like form to make the language "look" more palatable to the imperative hordes. I.e. where you *can* express things in a c-like form, do so, for comfort and readability. Ben, Mihael: you have the green light to move the language in that direction. Any objections - speak now! - Mike Ian Foster wrote, On 6/16/2007 11:01 AM: > I like the notion of having a "map" function. If that could entirely > replace the current element assignments, that would be a wonderful > simplification, it seems to me. > > Ian. > > Ben Clifford wrote: >> There's a different approach, which is to asay that 'a' is a variable and >> can be assigned to once. Thus assignemnt syntax like a[0]=something >> becomes illegal and we need more functional language constructs. So >> instead of writing: >> >> for e,i in input_array { >> output_array[i] = p(e); >> } >> >> we would write: >> >> output_array = foreach i in input_array { >> return p(i); >> } >> >> (its a haskell map in different syntax!) >> >> That means that, at the language level, output_array is now properly >> single assignment. >> >> >> On Fri, 15 Jun 2007, Ian Foster wrote: >> >> >>> Hi, >>> >>> For: >>> >>> a[0] = p() >>> a[1] = q() >>> b = s(a) >>> >>> I think there are two distinct issues. >>> >>> a) Determining the size of the array. This could presumably be done by >>> declaring it, e.g.: >>> >>> a[2] or some similar notion >>> a[0] = p() >>> a[1] = q() >>> b = s(a) >>> >>> or by some "closing" concept. >>> >>> b) Whether or not each element of an array is a separate single-assignment >>> variable. If they are, then the code above should work just fine. If they are >>> not, then we have a couple of behaviors we could define. One would be that >>> b=s(a) blocks until all elements in "a" are defined. The other is that we have >>> a way of "closing" (once again). In that case, we have to define what happens >>> if b=s(a) accesses an element that is not defined. >>> >>> Ian. >>> >>> Ben Clifford wrote: >>> >>>> There is a problem that has been called the 'array closing problem'. >>>> >>>> It manifests itself in the tutorial in that certain bits of code that >>>> intuitively can either in a procedure or in the top level can, in practice, >>>> only go in to a procedure. >>>> >>>> In that context, I tried to think about better ways to explain/document the >>>> behaviour than "mumble mumble move that code into a procedure". >>>> >>>> In Swift we claim to have 'single assignment variables'. >>>> >>>> >From single assignment variables we get our grid job ordering: >>>> >>>> a = p() >>>> b = s(a) >>>> >>>> causes first grid job p to run, and when that has completed, then grid job s >>>> will run. >>>> >>>> This is the same as if we had written: >>>> >>>> b = s(a) >>>> a = p() >>>> >>>> The ordering comes from the use of a as an 'output' for p and an 'input' for >>>> s, not from source text ordering. >>>> >>>> In that model, its meaningless to assign two different things ta a, like >>>> this: >>>> >>>> a = p() >>>> b = s(a) >>>> a = t() >>>> >>>> >>>> Note that I've omitted the data types from the above. This works in the >>>> implementation for simple types such as a datafile marker type. >>>> >>>> What is important is that each variable is either unassigned or has its >>>> single value - whenever we refer to that variable, we can either use the >>>> value it has, or defer evaluation of that expression until the variable has >>>> its value. >>>> >>>> Now consider arrays. In the present syntax, arrays can be passed as single >>>> (complex) values to/from procedures, like before: >>>> >>>> a = p() >>>> b = s(a) >>>> >>>> Here a and b are array types. >>>> >>>> That's fine. a is assigned to by the first statement, and b is assigned to >>>> by the second statement. >>>> >>>> But we also support a different assignment syntax for arrays, that looks >>>> like this: >>>> >>>> a[0] = p() >>>> a[1] = q() >>>> b = s(a) >>>> >>>> This fails at the moment (specifically, I think the execution engine will >>>> hang). >>>> >>>> Why? Because the is no one point at which we assign a value to 'a' - the >>>> assignment is split over multiple statements, which can be in various places >>>> (and inside loops etc). >>>> >>>> There is nothing in the implementation that detects that a has been assigned >>>> its value. >>>> >>>> So there is this notion in the karajan intermediate code of 'closing an >>>> array'. This is an assertion made in the object code that all assignments >>>> to pieces of an array have been made - that, in affect, the array has its >>>> value. >>>> >>>> The suggested hack/workaround for this is to move the array element >>>> assignments into a procedure: >>>> >>>> (file f[]) z() { >>>> f[0] = p(); >>>> f[1] - q(); >>>> } >>>> >>>> a = z() >>>> b = s(a) >>>> >>>> This works. (which is sort-of a violation of referential transparency) >>>> >>>> It works because Swift implicitly marks arrays returned from compound >>>> procedures as closed (which may or may not be correct). >>>> >>>> So in most variable scopes, arrays behave like single-assignment variables, >>>> but each array can have one specific scope in which members can be assigned >>>> to. In that scope, the array cannot be treated as a whole variable. >>>> >>>> In the z() example above, that special scope is the body of z(). In the >>>> previous example, that scope is the global scope, and the program is invalid >>>> by the rule above that the array cannot be referred to as a whole in the >>>> same place that its members are individually assigned to. >>>> >>>> That's my explanation of what's going on now. I think it matches reality. I >>>> don't like that this is reality, but it is what we have. >>>> >>>> Comments appreciated. >>>> >>>> >>>> >>> >> >> > > -- > > Ian Foster, Director, Computation Institute > Argonne National Laboratory & University of Chicago > Argonne: MCS/221, 9700 S. Cass Ave, Argonne, IL 60439 > Chicago: Rm 405, 5640 S. Ellis Ave, Chicago, IL 60637 > Tel: +1 630 252 4619. Web: www.ci.uchicago.edu. > Globus Alliance: www.globus.org. > > > ------------------------------------------------------------------------ > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel -- Mike Wilde Computation Institute, University of Chicago Math & Computer Science Division Argonne National Laboratory Argonne, IL 60439 USA tel 630-252-7497 fax 630-252-1997 From hategan at mcs.anl.gov Sat Jun 16 13:29:47 2007 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sat, 16 Jun 2007 21:29:47 +0300 Subject: [Swift-devel] CPU usage with provider-deef In-Reply-To: <467417DE.50408@mcs.anl.gov> References: <1181984306.10455.3.camel@blabla.mcs.anl.gov> <4673F0E9.3060505@cs.uchicago.edu> <4673F548.5070608@cs.uchicago.edu> <4673FB19.6070305@cs.uchicago.edu> <1182006166.11495.1.camel@blabla.mcs.anl.gov> <467408D5.2020709@mcs.anl.gov> <467417DE.50408@mcs.anl.gov> Message-ID: <1182018587.12013.14.camel@blabla.mcs.anl.gov> On Sat, 2007-06-16 at 12:03 -0500, Mike Wilde wrote: > This should be fun, and a nice break from the I2U2 work that you've > been immersed in, Mihael. I've already looked at the Falkon code and it's... a lot of code that does stuff that I understand only in principle. What you want isn't easy, and I have my reservations towards the amount of fun it involves. That being said, Ioan, would it be possible to have a cleaned up version of the code where there are no duplicate classes? It's hard for me to figure what's relevant or not in that case. And perhaps dead code/comments removed? Mihael > > Want to do a read-through soon, and send out comments for discussion > that can turn into a list of code improvements to bugzilize? > > What I think is important about Falkon is that its working, its > proving out the value of the provisioned direct-scheduling approach > with numbers, and that its working for Ioan as a vehicle for his > research. > > What we want to get from the effort is a) Ioan progresses towards > his PhD; b) the immediate needs of our app-users get met; and c) we > learn whats needed in architecture, protocol and algorithm for a > successful long-term approach to running swift programs efficiently. > > Point is that everyone is open to changes and towards an eventual > re-design and re-write. This, Mihael, would be where you can > propose, design and implement the ideas you've expressed about > implementing provisioned direct-scheduling using Karajan's remote > execution mechanisms. > > - Mike > > > > > Ian Foster wrote, On 6/16/2007 10:59 AM: > > It seems important that Ioan sit down with Mihael and work through the > > Falkon code to see where it can be simplified, improved, etc. I am sure > > that this will result in problems being identified and fixed that will > > otherwise cost us time later. > > > > Mihael Hategan wrote: > >> Yourkit (www.yourkit.com) has free licenses for open source projects for > >> their profiler. Point them to a globus web page that has your name, and > >> they'll send you the license. Alternatively, there are other profilers > >> out there, and I strongly recommend using them on such issues. > >> > >> Mihael > >> > >> On Sat, 2007-06-16 at 10:00 -0500, Ioan Raicu wrote: > >> > >>> Nope, I think this is a different problem, or at least a subset of the > >>> problems we were having before. > >>> > >>> Since we fixed the CPU utilization, and we moved to a bigger box (4 > >>> CPUs with 2GB of memory), everything is happening in a timely fashion > >>> (a few ms per notification delivery throughout the experiment). Plus, > >>> I believe the view is consistent (the same tasks look complete on both > >>> ends) between Falkon and Swift, but we are still checking on this as > >>> the run was made just last night for the 100 mol run. We'll keep you > >>> posted with what we find. > >>> > >>> Ioan > >>> > >>> Ben Clifford wrote: > >>> > >>>> On Sat, 16 Jun 2007, Ioan Raicu wrote: > >>>> > >>>> > >>>> > >>>>> having problems with the 100 molecule run in MolDyn. Its not clear where the > >>>>> problem is, on the surface Falkon looks fine... we are looking into where > >>>>> everything breaks to cause Swift to not continue with the workflow to > >>>>> completion! > >>>>> > >>>>> > >>>> The same problem that you showed me the other day or different? > >>>> > >>>> with 'the same problem' being that falkon thinks all the jobs are done; > >>>> but that falkon's measure response time for sending completion > >>>> notifications gets approximately linearly longer over time and the swift > >>>> JVM uses ~100% and doesn't inidicate job completion at all after a certain > >>>> period. > >>>> > >>>> or different symptoms now? > >>>> > >>>> > >>>> > >>> -- > >>> ============================================ > >>> Ioan Raicu > >>> Ph.D. Student > >>> ============================================ > >>> Distributed Systems Laboratory > >>> Computer Science Department > >>> University of Chicago > >>> 1100 E. 58th Street, Ryerson Hall > >>> Chicago, IL 60637 > >>> ============================================ > >>> Email: iraicu at cs.uchicago.edu > >>> Web: http://www.cs.uchicago.edu/~iraicu > >>> http://dsl.cs.uchicago.edu/ > >>> ============================================ > >>> ============================================ > >>> > >> > >> > > > > -- > > > > Ian Foster, Director, Computation Institute > > Argonne National Laboratory & University of Chicago > > Argonne: MCS/221, 9700 S. Cass Ave, Argonne, IL 60439 > > Chicago: Rm 405, 5640 S. Ellis Ave, Chicago, IL 60637 > > Tel: +1 630 252 4619. Web: www.ci.uchicago.edu. > > Globus Alliance: www.globus.org. > > > > > > ------------------------------------------------------------------------ > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > From benc at hawaga.org.uk Sat Jun 16 14:36:17 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Sat, 16 Jun 2007 19:36:17 +0000 (GMT) Subject: [Swift-devel] CPU usage with provider-deef In-Reply-To: <1182018587.12013.14.camel@blabla.mcs.anl.gov> References: <1181984306.10455.3.camel@blabla.mcs.anl.gov> <4673F0E9.3060505@cs.uchicago.edu> <4673F548.5070608@cs.uchicago.edu> <4673FB19.6070305@cs.uchicago.edu> <1182006166.11495.1.camel@blabla.mcs.anl.gov> <467408D5.2020709@mcs.anl.gov> <467417DE.50408@mcs.anl.gov> <1182018587.12013.14.camel@blabla.mcs.anl.gov> Message-ID: On Sat, 16 Jun 2007, Mihael Hategan wrote: > On Sat, 2007-06-16 at 12:03 -0500, Mike Wilde wrote: > > This should be fun, and a nice break from the I2U2 work that you've > > been immersed in, Mihael. > I have my reservations towards the amount of fun it involves. Right, taking prototypes and turning them into production isn't necessarily fun - in fact, a lot of the fun already happened with the making of the prototype and the rest is some what drugery. (to an extent that's the same situation i2u2 cosmic was/is in). -- From iraicu at cs.uchicago.edu Sat Jun 16 14:47:02 2007 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Sat, 16 Jun 2007 14:47:02 -0500 Subject: [Swift-devel] CPU usage with provider-deef In-Reply-To: <1182018587.12013.14.camel@blabla.mcs.anl.gov> References: <1181984306.10455.3.camel@blabla.mcs.anl.gov> <4673F0E9.3060505@cs.uchicago.edu> <4673F548.5070608@cs.uchicago.edu> <4673FB19.6070305@cs.uchicago.edu> <1182006166.11495.1.camel@blabla.mcs.anl.gov> <467408D5.2020709@mcs.anl.gov> <467417DE.50408@mcs.anl.gov> <1182018587.12013.14.camel@blabla.mcs.anl.gov> Message-ID: <46743E36.9040901@cs.uchicago.edu> Yes, I know I need to clean up the code, and remove unused (dead) code. Can this wait for the next version I am working on, so I don't do this clean-up twice? The version that is out there in testing currently is v0.8. My development version is v0.9. I have been distracted lately from finishing up v0.9, but its not far from being complete. Mihael, when do you get back in town? If this is something more urgent, then perhaps I can get you a clean-up version of v0.8 in the coming week. Ioan Mihael Hategan wrote: > On Sat, 2007-06-16 at 12:03 -0500, Mike Wilde wrote: > >> This should be fun, and a nice break from the I2U2 work that you've >> been immersed in, Mihael. >> > > I've already looked at the Falkon code and it's... a lot of code that > does stuff that I understand only in principle. What you want isn't > easy, and I have my reservations towards the amount of fun it involves. > > That being said, Ioan, would it be possible to have a cleaned up version > of the code where there are no duplicate classes? It's hard for me to > figure what's relevant or not in that case. And perhaps dead > code/comments removed? > > Mihael > > >> Want to do a read-through soon, and send out comments for discussion >> that can turn into a list of code improvements to bugzilize? >> >> What I think is important about Falkon is that its working, its >> proving out the value of the provisioned direct-scheduling approach >> with numbers, and that its working for Ioan as a vehicle for his >> research. >> >> What we want to get from the effort is a) Ioan progresses towards >> his PhD; b) the immediate needs of our app-users get met; and c) we >> learn whats needed in architecture, protocol and algorithm for a >> successful long-term approach to running swift programs efficiently. >> >> Point is that everyone is open to changes and towards an eventual >> re-design and re-write. This, Mihael, would be where you can >> propose, design and implement the ideas you've expressed about >> implementing provisioned direct-scheduling using Karajan's remote >> execution mechanisms. >> >> - Mike >> >> >> >> >> Ian Foster wrote, On 6/16/2007 10:59 AM: >> >>> It seems important that Ioan sit down with Mihael and work through the >>> Falkon code to see where it can be simplified, improved, etc. I am sure >>> that this will result in problems being identified and fixed that will >>> otherwise cost us time later. >>> >>> Mihael Hategan wrote: >>> >>>> Yourkit (www.yourkit.com) has free licenses for open source projects for >>>> their profiler. Point them to a globus web page that has your name, and >>>> they'll send you the license. Alternatively, there are other profilers >>>> out there, and I strongly recommend using them on such issues. >>>> >>>> Mihael >>>> >>>> On Sat, 2007-06-16 at 10:00 -0500, Ioan Raicu wrote: >>>> >>>> >>>>> Nope, I think this is a different problem, or at least a subset of the >>>>> problems we were having before. >>>>> >>>>> Since we fixed the CPU utilization, and we moved to a bigger box (4 >>>>> CPUs with 2GB of memory), everything is happening in a timely fashion >>>>> (a few ms per notification delivery throughout the experiment). Plus, >>>>> I believe the view is consistent (the same tasks look complete on both >>>>> ends) between Falkon and Swift, but we are still checking on this as >>>>> the run was made just last night for the 100 mol run. We'll keep you >>>>> posted with what we find. >>>>> >>>>> Ioan >>>>> >>>>> Ben Clifford wrote: >>>>> >>>>> >>>>>> On Sat, 16 Jun 2007, Ioan Raicu wrote: >>>>>> >>>>>> >>>>>> >>>>>> >>>>>>> having problems with the 100 molecule run in MolDyn. Its not clear where the >>>>>>> problem is, on the surface Falkon looks fine... we are looking into where >>>>>>> everything breaks to cause Swift to not continue with the workflow to >>>>>>> completion! >>>>>>> >>>>>>> >>>>>>> >>>>>> The same problem that you showed me the other day or different? >>>>>> >>>>>> with 'the same problem' being that falkon thinks all the jobs are done; >>>>>> but that falkon's measure response time for sending completion >>>>>> notifications gets approximately linearly longer over time and the swift >>>>>> JVM uses ~100% and doesn't inidicate job completion at all after a certain >>>>>> period. >>>>>> >>>>>> or different symptoms now? >>>>>> >>>>>> >>>>>> >>>>>> >>>>> -- >>>>> ============================================ >>>>> Ioan Raicu >>>>> Ph.D. Student >>>>> ============================================ >>>>> Distributed Systems Laboratory >>>>> Computer Science Department >>>>> University of Chicago >>>>> 1100 E. 58th Street, Ryerson Hall >>>>> Chicago, IL 60637 >>>>> ============================================ >>>>> Email: iraicu at cs.uchicago.edu >>>>> Web: http://www.cs.uchicago.edu/~iraicu >>>>> http://dsl.cs.uchicago.edu/ >>>>> ============================================ >>>>> ============================================ >>>>> >>>>> >>>> >>>> >>> -- >>> >>> Ian Foster, Director, Computation Institute >>> Argonne National Laboratory & University of Chicago >>> Argonne: MCS/221, 9700 S. Cass Ave, Argonne, IL 60439 >>> Chicago: Rm 405, 5640 S. Ellis Ave, Chicago, IL 60637 >>> Tel: +1 630 252 4619. Web: www.ci.uchicago.edu. >>> Globus Alliance: www.globus.org. >>> >>> >>> ------------------------------------------------------------------------ >>> >>> _______________________________________________ >>> Swift-devel mailing list >>> Swift-devel at ci.uchicago.edu >>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >>> > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > -- ============================================ Ioan Raicu Ph.D. Student ============================================ Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 ============================================ Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dsl.cs.uchicago.edu/ ============================================ ============================================ -------------- next part -------------- An HTML attachment was scrubbed... URL: From iraicu at cs.uchicago.edu Sat Jun 16 14:49:00 2007 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Sat, 16 Jun 2007 14:49:00 -0500 Subject: [Swift-devel] CPU usage with provider-deef In-Reply-To: References: <1181984306.10455.3.camel@blabla.mcs.anl.gov> <4673F0E9.3060505@cs.uchicago.edu> <4673F548.5070608@cs.uchicago.edu> <4673FB19.6070305@cs.uchicago.edu> <1182006166.11495.1.camel@blabla.mcs.anl.gov> <467408D5.2020709@mcs.anl.gov> <467417DE.50408@mcs.anl.gov> <1182018587.12013.14.camel@blabla.mcs.anl.gov> Message-ID: <46743EAC.2080400@cs.uchicago.edu> Although, there should still be fun left to have :), as new features/protocols/extensions could be on the horizon. Ioan Ben Clifford wrote: > On Sat, 16 Jun 2007, Mihael Hategan wrote: > > >> On Sat, 2007-06-16 at 12:03 -0500, Mike Wilde wrote: >> >>> This should be fun, and a nice break from the I2U2 work that you've >>> been immersed in, Mihael. >>> > > >> I have my reservations towards the amount of fun it involves. >> > > Right, taking prototypes and turning them into production isn't > necessarily fun - in fact, a lot of the fun already happened with the > making of the prototype and the rest is some what drugery. (to an extent > that's the same situation i2u2 cosmic was/is in). > > -- ============================================ Ioan Raicu Ph.D. Student ============================================ Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 ============================================ Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dsl.cs.uchicago.edu/ ============================================ ============================================ -------------- next part -------------- An HTML attachment was scrubbed... URL: From itf at mcs.anl.gov Sat Jun 16 14:52:49 2007 From: itf at mcs.anl.gov (=?utf-8?B?SWFuIEZvc3Rlcg==?=) Date: Sat, 16 Jun 2007 19:52:49 +0000 Subject: [Swift-devel] CPU usage with provider-deef In-Reply-To: References: <1181984306.10455.3.camel@blabla.mcs.anl.gov><4673F0E9.3060505@cs.uchicago.edu><4673F548.5070608@cs.uchicago.edu><4673FB19.6070305@cs.uchicago.edu><1182006166.11495.1.camel@blabla.mcs.anl.gov><467408D5.2020709@mcs.anl.gov> <467417DE.50408@mcs.anl.gov><1182018587.12013.14.camel@blabla.mcs.anl.gov> Message-ID: <537273469-1182023623-cardhu_decombobulator_blackberry.rim.net-728646234-@bxe006.bisx.prod.on.blackberry> I wasn't suggesting (at least in the first instance) that Mihael take the prototype and turn it into production, but that Mihael and Ioan sit down together and do a code walkthrough. I think that this would likely identify bugs and opportunities for simplification. Ian Sent via BlackBerry from T-Mobile -----Original Message----- From: Ben Clifford Date: Sat, 16 Jun 2007 19:36:17 To:Mihael Hategan Cc:swift-devel at ci.uchicago.edu Subject: Re: [Swift-devel] CPU usage with provider-deef On Sat, 16 Jun 2007, Mihael Hategan wrote: > On Sat, 2007-06-16 at 12:03 -0500, Mike Wilde wrote: > > This should be fun, and a nice break from the I2U2 work that you've > > been immersed in, Mihael. > I have my reservations towards the amount of fun it involves. Right, taking prototypes and turning them into production isn't necessarily fun - in fact, a lot of the fun already happened with the making of the prototype and the rest is some what drugery. (to an extent that's the same situation i2u2 cosmic was/is in). -- _______________________________________________ Swift-devel mailing list Swift-devel at ci.uchicago.edu http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From iraicu at cs.uchicago.edu Sat Jun 16 14:57:42 2007 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Sat, 16 Jun 2007 14:57:42 -0500 Subject: [Swift-devel] CPU usage with provider-deef In-Reply-To: <1182006166.11495.1.camel@blabla.mcs.anl.gov> References: <1181984306.10455.3.camel@blabla.mcs.anl.gov> <4673F0E9.3060505@cs.uchicago.edu> <4673F548.5070608@cs.uchicago.edu> <4673FB19.6070305@cs.uchicago.edu> <1182006166.11495.1.camel@blabla.mcs.anl.gov> Message-ID: <467440B6.1060706@cs.uchicago.edu> Hey, This looks really nice, I'll give it a try! Ioan Mihael Hategan wrote: > Yourkit (www.yourkit.com) has free licenses for open source projects for > their profiler. Point them to a globus web page that has your name, and > they'll send you the license. Alternatively, there are other profilers > out there, and I strongly recommend using them on such issues. > > Mihael > > On Sat, 2007-06-16 at 10:00 -0500, Ioan Raicu wrote: > >> Nope, I think this is a different problem, or at least a subset of the >> problems we were having before. >> >> Since we fixed the CPU utilization, and we moved to a bigger box (4 >> CPUs with 2GB of memory), everything is happening in a timely fashion >> (a few ms per notification delivery throughout the experiment). Plus, >> I believe the view is consistent (the same tasks look complete on both >> ends) between Falkon and Swift, but we are still checking on this as >> the run was made just last night for the 100 mol run. We'll keep you >> posted with what we find. >> >> Ioan >> >> Ben Clifford wrote: >> >>> On Sat, 16 Jun 2007, Ioan Raicu wrote: >>> >>> >>> >>>> having problems with the 100 molecule run in MolDyn. Its not clear where the >>>> problem is, on the surface Falkon looks fine... we are looking into where >>>> everything breaks to cause Swift to not continue with the workflow to >>>> completion! >>>> >>>> >>> The same problem that you showed me the other day or different? >>> >>> with 'the same problem' being that falkon thinks all the jobs are done; >>> but that falkon's measure response time for sending completion >>> notifications gets approximately linearly longer over time and the swift >>> JVM uses ~100% and doesn't inidicate job completion at all after a certain >>> period. >>> >>> or different symptoms now? >>> >>> >>> >> -- >> ============================================ >> Ioan Raicu >> Ph.D. Student >> ============================================ >> Distributed Systems Laboratory >> Computer Science Department >> University of Chicago >> 1100 E. 58th Street, Ryerson Hall >> Chicago, IL 60637 >> ============================================ >> Email: iraicu at cs.uchicago.edu >> Web: http://www.cs.uchicago.edu/~iraicu >> http://dsl.cs.uchicago.edu/ >> ============================================ >> ============================================ >> > > > -- ============================================ Ioan Raicu Ph.D. Student ============================================ Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 ============================================ Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dsl.cs.uchicago.edu/ ============================================ ============================================ -------------- next part -------------- An HTML attachment was scrubbed... URL: From benc at hawaga.org.uk Sat Jun 16 15:01:48 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Sat, 16 Jun 2007 20:01:48 +0000 (GMT) Subject: [Swift-devel] CPU usage with provider-deef In-Reply-To: <46743E36.9040901@cs.uchicago.edu> References: <1181984306.10455.3.camel@blabla.mcs.anl.gov> <4673F0E9.3060505@cs.uchicago.edu> <4673F548.5070608@cs.uchicago.edu> <4673FB19.6070305@cs.uchicago.edu> <1182006166.11495.1.camel@blabla.mcs.anl.gov> <467408D5.2020709@mcs.anl.gov> <467417DE.50408@mcs.anl.gov> <1182018587.12013.14.camel@blabla.mcs.anl.gov> <46743E36.9040901@cs.uchicago.edu> Message-ID: At some point you need to switch to developing in SVN so that others can play along. Freezing your development, sending me a tarball, waiting for me to import to SVN, and then unfreezing your development and continuing from an SVN checkout is approximately the easiest I can make it for you. On Sat, 16 Jun 2007, Ioan Raicu wrote: > Yes, I know I need to clean up the code, and remove unused (dead) code. Can > this wait for the next version I am working on, so I don't do this clean-up > twice? The version that is out there in testing currently is v0.8. My > development version is v0.9. I have been distracted lately from finishing up > v0.9, but its not far from being complete. Mihael, when do you get back in > town? > If this is something more urgent, then perhaps I can get you a clean-up > version of v0.8 in the coming week. > > Ioan > > Mihael Hategan wrote: > > On Sat, 2007-06-16 at 12:03 -0500, Mike Wilde wrote: > > > > > This should be fun, and a nice break from the I2U2 work that you've been > > > immersed in, Mihael. > > > > > > > I've already looked at the Falkon code and it's... a lot of code that > > does stuff that I understand only in principle. What you want isn't > > easy, and I have my reservations towards the amount of fun it involves. > > > > That being said, Ioan, would it be possible to have a cleaned up version > > of the code where there are no duplicate classes? It's hard for me to > > figure what's relevant or not in that case. And perhaps dead > > code/comments removed? > > > > Mihael > > > > > > > Want to do a read-through soon, and send out comments for discussion that > > > can turn into a list of code improvements to bugzilize? > > > > > > What I think is important about Falkon is that its working, its proving > > > out the value of the provisioned direct-scheduling approach with numbers, > > > and that its working for Ioan as a vehicle for his research. > > > > > > What we want to get from the effort is a) Ioan progresses towards his PhD; > > > b) the immediate needs of our app-users get met; and c) we learn whats > > > needed in architecture, protocol and algorithm for a successful long-term > > > approach to running swift programs efficiently. > > > > > > Point is that everyone is open to changes and towards an eventual > > > re-design and re-write. This, Mihael, would be where you can propose, > > > design and implement the ideas you've expressed about implementing > > > provisioned direct-scheduling using Karajan's remote execution mechanisms. > > > > > > - Mike > > > > > > > > > > > > > > > Ian Foster wrote, On 6/16/2007 10:59 AM: > > > > > > > It seems important that Ioan sit down with Mihael and work through the > > > > Falkon code to see where it can be simplified, improved, etc. I am sure > > > > that this will result in problems being identified and fixed that will > > > > otherwise cost us time later. > > > > > > > > Mihael Hategan wrote: > > > > > > > > > Yourkit (www.yourkit.com) has free licenses for open source projects > > > > > for > > > > > their profiler. Point them to a globus web page that has your name, > > > > > and > > > > > they'll send you the license. Alternatively, there are other profilers > > > > > out there, and I strongly recommend using them on such issues. > > > > > > > > > > Mihael > > > > > > > > > > On Sat, 2007-06-16 at 10:00 -0500, Ioan Raicu wrote: > > > > > > > > > > > Nope, I think this is a different problem, or at least a subset of > > > > > > the > > > > > > problems we were having before. > > > > > > > > > > > > Since we fixed the CPU utilization, and we moved to a bigger box (4 > > > > > > CPUs with 2GB of memory), everything is happening in a timely > > > > > > fashion > > > > > > (a few ms per notification delivery throughout the experiment). > > > > > > Plus, > > > > > > I believe the view is consistent (the same tasks look complete on > > > > > > both > > > > > > ends) between Falkon and Swift, but we are still checking on this as > > > > > > the run was made just last night for the 100 mol run. We'll keep > > > > > > you > > > > > > posted with what we find. > > > > > > > > > > > > Ioan > > > > > > > > > > > > Ben Clifford wrote: > > > > > > > On Sat, 16 Jun 2007, Ioan Raicu wrote: > > > > > > > > > > > > > > > > > > > > > > having problems with the 100 molecule run in MolDyn. Its not > > > > > > > > clear where the > > > > > > > > problem is, on the surface Falkon looks fine... we are looking > > > > > > > > into where > > > > > > > > everything breaks to cause Swift to not continue with the > > > > > > > > workflow to > > > > > > > > completion! > > > > > > > > > > > > > > > The same problem that you showed me the other day or different? > > > > > > > > > > > > > > with 'the same problem' being that falkon thinks all the jobs are > > > > > > > done; but that falkon's measure response time for sending > > > > > > > completion notifications gets approximately linearly longer over > > > > > > > time and the swift JVM uses ~100% and doesn't inidicate job > > > > > > > completion at all after a certain period. > > > > > > > > > > > > > > or different symptoms now? > > > > > > > > > > > > > > > > > > > > -- > > > > > > ============================================ > > > > > > Ioan Raicu > > > > > > Ph.D. Student > > > > > > ============================================ > > > > > > Distributed Systems Laboratory > > > > > > Computer Science Department > > > > > > University of Chicago > > > > > > 1100 E. 58th Street, Ryerson Hall > > > > > > Chicago, IL 60637 > > > > > > ============================================ > > > > > > Email: iraicu at cs.uchicago.edu > > > > > > Web: http://www.cs.uchicago.edu/~iraicu > > > > > > http://dsl.cs.uchicago.edu/ > > > > > > ============================================ > > > > > > ============================================ > > > > > > > > > > > > > > > -- > > > > > > > > Ian Foster, Director, Computation Institute > > > > Argonne National Laboratory & University of Chicago > > > > Argonne: MCS/221, 9700 S. Cass Ave, Argonne, IL 60439 > > > > Chicago: Rm 405, 5640 S. Ellis Ave, Chicago, IL 60637 > > > > Tel: +1 630 252 4619. Web: www.ci.uchicago.edu. > > > > Globus Alliance: www.globus.org. > > > > > > > > > > > > ------------------------------------------------------------------------ > > > > > > > > _______________________________________________ > > > > Swift-devel mailing list > > > > Swift-devel at ci.uchicago.edu > > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > > > From iraicu at cs.uchicago.edu Sat Jun 16 15:06:49 2007 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Sat, 16 Jun 2007 15:06:49 -0500 Subject: [Swift-devel] CPU usage with provider-deef In-Reply-To: References: <1181984306.10455.3.camel@blabla.mcs.anl.gov> <4673F0E9.3060505@cs.uchicago.edu> <4673F548.5070608@cs.uchicago.edu> <4673FB19.6070305@cs.uchicago.edu> <1182006166.11495.1.camel@blabla.mcs.anl.gov> <467408D5.2020709@mcs.anl.gov> <467417DE.50408@mcs.anl.gov> <1182018587.12013.14.camel@blabla.mcs.anl.gov> <46743E36.9040901@cs.uchicago.edu> Message-ID: <467442D9.9030608@cs.uchicago.edu> Right, I know! It sounds trivial :), but I just haven't had the time to think about it... I think the latest Falkon code is pretty solid, but I would like to get to the bottom of why Nika's MolDyn 100 molecule run isn't working... Unless there is a pressing need to look at the very latest code, let's do this next week (hopefully after we have Nika's app running)! Ioan Ben Clifford wrote: > At some point you need to switch to developing in SVN so that others can > play along. Freezing your development, sending me a tarball, waiting for > me to import to SVN, and then unfreezing your development and continuing > from an SVN checkout is approximately the easiest I can make it for you. > > On Sat, 16 Jun 2007, Ioan Raicu wrote: > > >> Yes, I know I need to clean up the code, and remove unused (dead) code. Can >> this wait for the next version I am working on, so I don't do this clean-up >> twice? The version that is out there in testing currently is v0.8. My >> development version is v0.9. I have been distracted lately from finishing up >> v0.9, but its not far from being complete. Mihael, when do you get back in >> town? >> If this is something more urgent, then perhaps I can get you a clean-up >> version of v0.8 in the coming week. >> >> Ioan >> >> Mihael Hategan wrote: >> >>> On Sat, 2007-06-16 at 12:03 -0500, Mike Wilde wrote: >>> >>> >>>> This should be fun, and a nice break from the I2U2 work that you've been >>>> immersed in, Mihael. >>>> >>>> >>> I've already looked at the Falkon code and it's... a lot of code that >>> does stuff that I understand only in principle. What you want isn't >>> easy, and I have my reservations towards the amount of fun it involves. >>> >>> That being said, Ioan, would it be possible to have a cleaned up version >>> of the code where there are no duplicate classes? It's hard for me to >>> figure what's relevant or not in that case. And perhaps dead >>> code/comments removed? >>> >>> Mihael >>> >>> >>> >>>> Want to do a read-through soon, and send out comments for discussion that >>>> can turn into a list of code improvements to bugzilize? >>>> >>>> What I think is important about Falkon is that its working, its proving >>>> out the value of the provisioned direct-scheduling approach with numbers, >>>> and that its working for Ioan as a vehicle for his research. >>>> >>>> What we want to get from the effort is a) Ioan progresses towards his PhD; >>>> b) the immediate needs of our app-users get met; and c) we learn whats >>>> needed in architecture, protocol and algorithm for a successful long-term >>>> approach to running swift programs efficiently. >>>> >>>> Point is that everyone is open to changes and towards an eventual >>>> re-design and re-write. This, Mihael, would be where you can propose, >>>> design and implement the ideas you've expressed about implementing >>>> provisioned direct-scheduling using Karajan's remote execution mechanisms. >>>> >>>> - Mike >>>> >>>> >>>> >>>> >>>> Ian Foster wrote, On 6/16/2007 10:59 AM: >>>> >>>> >>>>> It seems important that Ioan sit down with Mihael and work through the >>>>> Falkon code to see where it can be simplified, improved, etc. I am sure >>>>> that this will result in problems being identified and fixed that will >>>>> otherwise cost us time later. >>>>> >>>>> Mihael Hategan wrote: >>>>> >>>>> >>>>>> Yourkit (www.yourkit.com) has free licenses for open source projects >>>>>> for >>>>>> their profiler. Point them to a globus web page that has your name, >>>>>> and >>>>>> they'll send you the license. Alternatively, there are other profilers >>>>>> out there, and I strongly recommend using them on such issues. >>>>>> >>>>>> Mihael >>>>>> >>>>>> On Sat, 2007-06-16 at 10:00 -0500, Ioan Raicu wrote: >>>>>> >>>>>> >>>>>>> Nope, I think this is a different problem, or at least a subset of >>>>>>> the >>>>>>> problems we were having before. >>>>>>> >>>>>>> Since we fixed the CPU utilization, and we moved to a bigger box (4 >>>>>>> CPUs with 2GB of memory), everything is happening in a timely >>>>>>> fashion >>>>>>> (a few ms per notification delivery throughout the experiment). >>>>>>> Plus, >>>>>>> I believe the view is consistent (the same tasks look complete on >>>>>>> both >>>>>>> ends) between Falkon and Swift, but we are still checking on this as >>>>>>> the run was made just last night for the 100 mol run. We'll keep >>>>>>> you >>>>>>> posted with what we find. >>>>>>> >>>>>>> Ioan >>>>>>> >>>>>>> Ben Clifford wrote: >>>>>>> >>>>>>>> On Sat, 16 Jun 2007, Ioan Raicu wrote: >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>>> having problems with the 100 molecule run in MolDyn. Its not >>>>>>>>> clear where the >>>>>>>>> problem is, on the surface Falkon looks fine... we are looking >>>>>>>>> into where >>>>>>>>> everything breaks to cause Swift to not continue with the >>>>>>>>> workflow to >>>>>>>>> completion! >>>>>>>>> >>>>>>>>> >>>>>>>> The same problem that you showed me the other day or different? >>>>>>>> >>>>>>>> with 'the same problem' being that falkon thinks all the jobs are >>>>>>>> done; but that falkon's measure response time for sending >>>>>>>> completion notifications gets approximately linearly longer over >>>>>>>> time and the swift JVM uses ~100% and doesn't inidicate job >>>>>>>> completion at all after a certain period. >>>>>>>> >>>>>>>> or different symptoms now? >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> -- >>>>>>> ============================================ >>>>>>> Ioan Raicu >>>>>>> Ph.D. Student >>>>>>> ============================================ >>>>>>> Distributed Systems Laboratory >>>>>>> Computer Science Department >>>>>>> University of Chicago >>>>>>> 1100 E. 58th Street, Ryerson Hall >>>>>>> Chicago, IL 60637 >>>>>>> ============================================ >>>>>>> Email: iraicu at cs.uchicago.edu >>>>>>> Web: http://www.cs.uchicago.edu/~iraicu >>>>>>> http://dsl.cs.uchicago.edu/ >>>>>>> ============================================ >>>>>>> ============================================ >>>>>>> >>>>>>> >>>>>> >>>>>> >>>>> -- >>>>> >>>>> Ian Foster, Director, Computation Institute >>>>> Argonne National Laboratory & University of Chicago >>>>> Argonne: MCS/221, 9700 S. Cass Ave, Argonne, IL 60439 >>>>> Chicago: Rm 405, 5640 S. Ellis Ave, Chicago, IL 60637 >>>>> Tel: +1 630 252 4619. Web: www.ci.uchicago.edu. >>>>> Globus Alliance: www.globus.org. >>>>> >>>>> >>>>> ------------------------------------------------------------------------ >>>>> >>>>> _______________________________________________ >>>>> Swift-devel mailing list >>>>> Swift-devel at ci.uchicago.edu >>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >>>>> >>>>> >>> _______________________________________________ >>> Swift-devel mailing list >>> Swift-devel at ci.uchicago.edu >>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >>> >>> >>> >> > > -- ============================================ Ioan Raicu Ph.D. Student ============================================ Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 ============================================ Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dsl.cs.uchicago.edu/ ============================================ ============================================ -------------- next part -------------- An HTML attachment was scrubbed... URL: From hategan at mcs.anl.gov Sat Jun 16 15:05:46 2007 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sat, 16 Jun 2007 23:05:46 +0300 Subject: [Swift-devel] CPU usage with provider-deef In-Reply-To: <46743E36.9040901@cs.uchicago.edu> References: <1181984306.10455.3.camel@blabla.mcs.anl.gov> <4673F0E9.3060505@cs.uchicago.edu> <4673F548.5070608@cs.uchicago.edu> <4673FB19.6070305@cs.uchicago.edu> <1182006166.11495.1.camel@blabla.mcs.anl.gov> <467408D5.2020709@mcs.anl.gov> <467417DE.50408@mcs.anl.gov> <1182018587.12013.14.camel@blabla.mcs.anl.gov> <46743E36.9040901@cs.uchicago.edu> Message-ID: <1182024346.12401.0.camel@blabla.mcs.anl.gov> On Sat, 2007-06-16 at 14:47 -0500, Ioan Raicu wrote: > Yes, I know I need to clean up the code, and remove unused (dead) > code. Can this wait for the next version I am working on, so I don't > do this clean-up twice? The version that is out there in testing > currently is v0.8. My development version is v0.9. I have been > distracted lately from finishing up v0.9, but its not far from being > complete. Mihael, when do you get back in town? I should be in Chicago on 23rd. Mihael > > If this is something more urgent, then perhaps I can get you a > clean-up version of v0.8 in the coming week. > > Ioan > > Mihael Hategan wrote: > > On Sat, 2007-06-16 at 12:03 -0500, Mike Wilde wrote: > > > > > This should be fun, and a nice break from the I2U2 work that you've > > > been immersed in, Mihael. > > > > > > > I've already looked at the Falkon code and it's... a lot of code that > > does stuff that I understand only in principle. What you want isn't > > easy, and I have my reservations towards the amount of fun it involves. > > > > That being said, Ioan, would it be possible to have a cleaned up version > > of the code where there are no duplicate classes? It's hard for me to > > figure what's relevant or not in that case. And perhaps dead > > code/comments removed? > > > > Mihael > > > > > > > Want to do a read-through soon, and send out comments for discussion > > > that can turn into a list of code improvements to bugzilize? > > > > > > What I think is important about Falkon is that its working, its > > > proving out the value of the provisioned direct-scheduling approach > > > with numbers, and that its working for Ioan as a vehicle for his > > > research. > > > > > > What we want to get from the effort is a) Ioan progresses towards > > > his PhD; b) the immediate needs of our app-users get met; and c) we > > > learn whats needed in architecture, protocol and algorithm for a > > > successful long-term approach to running swift programs efficiently. > > > > > > Point is that everyone is open to changes and towards an eventual > > > re-design and re-write. This, Mihael, would be where you can > > > propose, design and implement the ideas you've expressed about > > > implementing provisioned direct-scheduling using Karajan's remote > > > execution mechanisms. > > > > > > - Mike > > > > > > > > > > > > > > > Ian Foster wrote, On 6/16/2007 10:59 AM: > > > > > > > It seems important that Ioan sit down with Mihael and work through the > > > > Falkon code to see where it can be simplified, improved, etc. I am sure > > > > that this will result in problems being identified and fixed that will > > > > otherwise cost us time later. > > > > > > > > Mihael Hategan wrote: > > > > > > > > > Yourkit (www.yourkit.com) has free licenses for open source projects for > > > > > their profiler. Point them to a globus web page that has your name, and > > > > > they'll send you the license. Alternatively, there are other profilers > > > > > out there, and I strongly recommend using them on such issues. > > > > > > > > > > Mihael > > > > > > > > > > On Sat, 2007-06-16 at 10:00 -0500, Ioan Raicu wrote: > > > > > > > > > > > > > > > > Nope, I think this is a different problem, or at least a subset of the > > > > > > problems we were having before. > > > > > > > > > > > > Since we fixed the CPU utilization, and we moved to a bigger box (4 > > > > > > CPUs with 2GB of memory), everything is happening in a timely fashion > > > > > > (a few ms per notification delivery throughout the experiment). Plus, > > > > > > I believe the view is consistent (the same tasks look complete on both > > > > > > ends) between Falkon and Swift, but we are still checking on this as > > > > > > the run was made just last night for the 100 mol run. We'll keep you > > > > > > posted with what we find. > > > > > > > > > > > > Ioan > > > > > > > > > > > > Ben Clifford wrote: > > > > > > > > > > > > > > > > > > > On Sat, 16 Jun 2007, Ioan Raicu wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > having problems with the 100 molecule run in MolDyn. Its not clear where the > > > > > > > > problem is, on the surface Falkon looks fine... we are looking into where > > > > > > > > everything breaks to cause Swift to not continue with the workflow to > > > > > > > > completion! > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > The same problem that you showed me the other day or different? > > > > > > > > > > > > > > with 'the same problem' being that falkon thinks all the jobs are done; > > > > > > > but that falkon's measure response time for sending completion > > > > > > > notifications gets approximately linearly longer over time and the swift > > > > > > > JVM uses ~100% and doesn't inidicate job completion at all after a certain > > > > > > > period. > > > > > > > > > > > > > > or different symptoms now? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > ============================================ > > > > > > Ioan Raicu > > > > > > Ph.D. Student > > > > > > ============================================ > > > > > > Distributed Systems Laboratory > > > > > > Computer Science Department > > > > > > University of Chicago > > > > > > 1100 E. 58th Street, Ryerson Hall > > > > > > Chicago, IL 60637 > > > > > > ============================================ > > > > > > Email: iraicu at cs.uchicago.edu > > > > > > Web: http://www.cs.uchicago.edu/~iraicu > > > > > > http://dsl.cs.uchicago.edu/ > > > > > > ============================================ > > > > > > ============================================ > > > > > > > > > > > > > > > > -- > > > > > > > > Ian Foster, Director, Computation Institute > > > > Argonne National Laboratory & University of Chicago > > > > Argonne: MCS/221, 9700 S. Cass Ave, Argonne, IL 60439 > > > > Chicago: Rm 405, 5640 S. Ellis Ave, Chicago, IL 60637 > > > > Tel: +1 630 252 4619. Web: www.ci.uchicago.edu. > > > > Globus Alliance: www.globus.org. > > > > > > > > > > > > ------------------------------------------------------------------------ > > > > > > > > _______________________________________________ > > > > Swift-devel mailing list > > > > Swift-devel at ci.uchicago.edu > > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > > > -- > ============================================ > Ioan Raicu > Ph.D. Student > ============================================ > Distributed Systems Laboratory > Computer Science Department > University of Chicago > 1100 E. 58th Street, Ryerson Hall > Chicago, IL 60637 > ============================================ > Email: iraicu at cs.uchicago.edu > Web: http://www.cs.uchicago.edu/~iraicu > http://dsl.cs.uchicago.edu/ > ============================================ > ============================================ From iraicu at cs.uchicago.edu Sat Jun 16 15:16:32 2007 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Sat, 16 Jun 2007 15:16:32 -0500 Subject: [Swift-devel] CPU usage with provider-deef In-Reply-To: <1182024346.12401.0.camel@blabla.mcs.anl.gov> References: <1181984306.10455.3.camel@blabla.mcs.anl.gov> <4673F0E9.3060505@cs.uchicago.edu> <4673F548.5070608@cs.uchicago.edu> <4673FB19.6070305@cs.uchicago.edu> <1182006166.11495.1.camel@blabla.mcs.anl.gov> <467408D5.2020709@mcs.anl.gov> <467417DE.50408@mcs.anl.gov> <1182018587.12013.14.camel@blabla.mcs.anl.gov> <46743E36.9040901@cs.uchicago.edu> <1182024346.12401.0.camel@blabla.mcs.anl.gov> Message-ID: <46744520.2010202@cs.uchicago.edu> Great! Then I'll plan to have a clean-up version in SVN by then :) Ioan Mihael Hategan wrote: > On Sat, 2007-06-16 at 14:47 -0500, Ioan Raicu wrote: > >> Yes, I know I need to clean up the code, and remove unused (dead) >> code. Can this wait for the next version I am working on, so I don't >> do this clean-up twice? The version that is out there in testing >> currently is v0.8. My development version is v0.9. I have been >> distracted lately from finishing up v0.9, but its not far from being >> complete. Mihael, when do you get back in town? >> > > I should be in Chicago on 23rd. > > Mihael > > >> If this is something more urgent, then perhaps I can get you a >> clean-up version of v0.8 in the coming week. >> >> Ioan >> >> Mihael Hategan wrote: >> >>> On Sat, 2007-06-16 at 12:03 -0500, Mike Wilde wrote: >>> >>> >>>> This should be fun, and a nice break from the I2U2 work that you've >>>> been immersed in, Mihael. >>>> >>>> >>> I've already looked at the Falkon code and it's... a lot of code that >>> does stuff that I understand only in principle. What you want isn't >>> easy, and I have my reservations towards the amount of fun it involves. >>> >>> That being said, Ioan, would it be possible to have a cleaned up version >>> of the code where there are no duplicate classes? It's hard for me to >>> figure what's relevant or not in that case. And perhaps dead >>> code/comments removed? >>> >>> Mihael >>> >>> >>> >>>> Want to do a read-through soon, and send out comments for discussion >>>> that can turn into a list of code improvements to bugzilize? >>>> >>>> What I think is important about Falkon is that its working, its >>>> proving out the value of the provisioned direct-scheduling approach >>>> with numbers, and that its working for Ioan as a vehicle for his >>>> research. >>>> >>>> What we want to get from the effort is a) Ioan progresses towards >>>> his PhD; b) the immediate needs of our app-users get met; and c) we >>>> learn whats needed in architecture, protocol and algorithm for a >>>> successful long-term approach to running swift programs efficiently. >>>> >>>> Point is that everyone is open to changes and towards an eventual >>>> re-design and re-write. This, Mihael, would be where you can >>>> propose, design and implement the ideas you've expressed about >>>> implementing provisioned direct-scheduling using Karajan's remote >>>> execution mechanisms. >>>> >>>> - Mike >>>> >>>> >>>> >>>> >>>> Ian Foster wrote, On 6/16/2007 10:59 AM: >>>> >>>> >>>>> It seems important that Ioan sit down with Mihael and work through the >>>>> Falkon code to see where it can be simplified, improved, etc. I am sure >>>>> that this will result in problems being identified and fixed that will >>>>> otherwise cost us time later. >>>>> >>>>> Mihael Hategan wrote: >>>>> >>>>> >>>>>> Yourkit (www.yourkit.com) has free licenses for open source projects for >>>>>> their profiler. Point them to a globus web page that has your name, and >>>>>> they'll send you the license. Alternatively, there are other profilers >>>>>> out there, and I strongly recommend using them on such issues. >>>>>> >>>>>> Mihael >>>>>> >>>>>> On Sat, 2007-06-16 at 10:00 -0500, Ioan Raicu wrote: >>>>>> >>>>>> >>>>>> >>>>>>> Nope, I think this is a different problem, or at least a subset of the >>>>>>> problems we were having before. >>>>>>> >>>>>>> Since we fixed the CPU utilization, and we moved to a bigger box (4 >>>>>>> CPUs with 2GB of memory), everything is happening in a timely fashion >>>>>>> (a few ms per notification delivery throughout the experiment). Plus, >>>>>>> I believe the view is consistent (the same tasks look complete on both >>>>>>> ends) between Falkon and Swift, but we are still checking on this as >>>>>>> the run was made just last night for the 100 mol run. We'll keep you >>>>>>> posted with what we find. >>>>>>> >>>>>>> Ioan >>>>>>> >>>>>>> Ben Clifford wrote: >>>>>>> >>>>>>> >>>>>>> >>>>>>>> On Sat, 16 Jun 2007, Ioan Raicu wrote: >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>>> having problems with the 100 molecule run in MolDyn. Its not clear where the >>>>>>>>> problem is, on the surface Falkon looks fine... we are looking into where >>>>>>>>> everything breaks to cause Swift to not continue with the workflow to >>>>>>>>> completion! >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>> The same problem that you showed me the other day or different? >>>>>>>> >>>>>>>> with 'the same problem' being that falkon thinks all the jobs are done; >>>>>>>> but that falkon's measure response time for sending completion >>>>>>>> notifications gets approximately linearly longer over time and the swift >>>>>>>> JVM uses ~100% and doesn't inidicate job completion at all after a certain >>>>>>>> period. >>>>>>>> >>>>>>>> or different symptoms now? >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> -- >>>>>>> ============================================ >>>>>>> Ioan Raicu >>>>>>> Ph.D. Student >>>>>>> ============================================ >>>>>>> Distributed Systems Laboratory >>>>>>> Computer Science Department >>>>>>> University of Chicago >>>>>>> 1100 E. 58th Street, Ryerson Hall >>>>>>> Chicago, IL 60637 >>>>>>> ============================================ >>>>>>> Email: iraicu at cs.uchicago.edu >>>>>>> Web: http://www.cs.uchicago.edu/~iraicu >>>>>>> http://dsl.cs.uchicago.edu/ >>>>>>> ============================================ >>>>>>> ============================================ >>>>>>> >>>>>>> >>>>>>> >>>>> -- >>>>> >>>>> Ian Foster, Director, Computation Institute >>>>> Argonne National Laboratory & University of Chicago >>>>> Argonne: MCS/221, 9700 S. Cass Ave, Argonne, IL 60439 >>>>> Chicago: Rm 405, 5640 S. Ellis Ave, Chicago, IL 60637 >>>>> Tel: +1 630 252 4619. Web: www.ci.uchicago.edu. >>>>> Globus Alliance: www.globus.org. >>>>> >>>>> >>>>> ------------------------------------------------------------------------ >>>>> >>>>> _______________________________________________ >>>>> Swift-devel mailing list >>>>> Swift-devel at ci.uchicago.edu >>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >>>>> >>>>> >>> _______________________________________________ >>> Swift-devel mailing list >>> Swift-devel at ci.uchicago.edu >>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >>> >>> >>> >> -- >> ============================================ >> Ioan Raicu >> Ph.D. Student >> ============================================ >> Distributed Systems Laboratory >> Computer Science Department >> University of Chicago >> 1100 E. 58th Street, Ryerson Hall >> Chicago, IL 60637 >> ============================================ >> Email: iraicu at cs.uchicago.edu >> Web: http://www.cs.uchicago.edu/~iraicu >> http://dsl.cs.uchicago.edu/ >> ============================================ >> ============================================ >> > > > -- ============================================ Ioan Raicu Ph.D. Student ============================================ Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 ============================================ Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dsl.cs.uchicago.edu/ ============================================ ============================================ -------------- next part -------------- An HTML attachment was scrubbed... URL: From hategan at mcs.anl.gov Sat Jun 16 15:15:04 2007 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sat, 16 Jun 2007 23:15:04 +0300 Subject: [Swift-devel] CPU usage with provider-deef In-Reply-To: <537273469-1182023623-cardhu_decombobulator_blackberry.rim.net-728646234-@bxe006.bisx.prod.on.blackberry> References: <1181984306.10455.3.camel@blabla.mcs.anl.gov> <4673F0E9.3060505@cs.uchicago.edu> <4673F548.5070608@cs.uchicago.edu> <4673FB19.6070305@cs.uchicago.edu> <1182006166.11495.1.camel@blabla.mcs.anl.gov><467408D5.2020709@mcs.anl.gov> <467417DE.50408@mcs.anl.gov><1182018587.12013.14.camel@blabla.mcs.anl.gov> <537273469-1182023623-cardhu_decombobulator_blackberry.rim.net-728646234-@bxe006.bisx.prod.on.blackberry> Message-ID: <1182024904.12401.11.camel@blabla.mcs.anl.gov> On Sat, 2007-06-16 at 19:52 +0000, Ian Foster wrote: > I wasn't suggesting (at least in the first instance) that Mihael take the prototype and turn it into production, but that Mihael and Ioan sit down together and do a code walkthrough. I think that this would likely identify bugs and opportunities for simplification. It's somewhat on the same in level of fun :). From the experience I've accumulated so far, design is hard. Understanding prototype design is probably even harder (not only do you need to understand the problem, you also need to understand why many non-obvious things are done the way they are done). Mihael > > Ian > > > > Sent via BlackBerry from T-Mobile > > -----Original Message----- > From: Ben Clifford > > Date: Sat, 16 Jun 2007 19:36:17 > To:Mihael Hategan > Cc:swift-devel at ci.uchicago.edu > Subject: Re: [Swift-devel] CPU usage with provider-deef > > > > On Sat, 16 Jun 2007, Mihael Hategan wrote: > > > On Sat, 2007-06-16 at 12:03 -0500, Mike Wilde wrote: > > > This should be fun, and a nice break from the I2U2 work that you've > > > been immersed in, Mihael. > > > I have my reservations towards the amount of fun it involves. > > Right, taking prototypes and turning them into production isn't > necessarily fun - in fact, a lot of the fun already happened with the > making of the prototype and the rest is some what drugery. (to an extent > that's the same situation i2u2 cosmic was/is in). > From wilde at mcs.anl.gov Sat Jun 16 17:55:59 2007 From: wilde at mcs.anl.gov (Mike Wilde) Date: Sat, 16 Jun 2007 17:55:59 -0500 Subject: [Swift-devel] CPU usage with provider-deef In-Reply-To: <1182024904.12401.11.camel@blabla.mcs.anl.gov> References: <1181984306.10455.3.camel@blabla.mcs.anl.gov> <4673F0E9.3060505@cs.uchicago.edu> <4673F548.5070608@cs.uchicago.edu> <4673FB19.6070305@cs.uchicago.edu> <1182006166.11495.1.camel@blabla.mcs.anl.gov><467408D5.2020709@mcs.anl.gov> <467417DE.50408@mcs.anl.gov><1182018587.12013.14.camel@blabla.mcs.anl.gov> <537273469-1182023623-cardhu_decombobulator_blackberry.rim.net-728646234-@bxe006.bisx.prod.on.blackberry> <1182024904.12401.11.camel@blabla.mcs.anl.gov> Message-ID: <46746A7F.7040004@mcs.anl.gov> I think a nice clean message sequence chart describing Falkon's various activities would be very useful, as its the backbone of its logic. I tried to create this for the SC paper by asking Ioan to describe the protocol to me, but I dod not succeed. I think this would be a very useful description to maintain, as a UML sequence chart, and that Ioan this would be a very important part of your thesis or of future papers. Its up to you and Ian to weigh whether this would be valuable to your research. I think its invaluable for design, review and debugging. - Mike Mihael Hategan wrote, On 6/16/2007 3:15 PM: > On Sat, 2007-06-16 at 19:52 +0000, Ian Foster wrote: >> I wasn't suggesting (at least in the first instance) that Mihael take the prototype and turn it into production, but that Mihael and Ioan sit down together and do a code walkthrough. I think that this would likely identify bugs and opportunities for simplification. > > It's somewhat on the same in level of fun :). From the experience I've > accumulated so far, design is hard. Understanding prototype design is > probably even harder (not only do you need to understand the problem, > you also need to understand why many non-obvious things are done the way > they are done). > > Mihael > >> Ian >> >> >> >> Sent via BlackBerry from T-Mobile >> >> -----Original Message----- >> From: Ben Clifford >> >> Date: Sat, 16 Jun 2007 19:36:17 >> To:Mihael Hategan >> Cc:swift-devel at ci.uchicago.edu >> Subject: Re: [Swift-devel] CPU usage with provider-deef >> >> >> >> On Sat, 16 Jun 2007, Mihael Hategan wrote: >> >>> On Sat, 2007-06-16 at 12:03 -0500, Mike Wilde wrote: >>>> This should be fun, and a nice break from the I2U2 work that you've >>>> been immersed in, Mihael. >>> I have my reservations towards the amount of fun it involves. >> Right, taking prototypes and turning them into production isn't >> necessarily fun - in fact, a lot of the fun already happened with the >> making of the prototype and the rest is some what drugery. (to an extent >> that's the same situation i2u2 cosmic was/is in). >> > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > -- Mike Wilde Computation Institute, University of Chicago Math & Computer Science Division Argonne National Laboratory Argonne, IL 60439 USA tel 630-252-7497 fax 630-252-1997 From wilde at mcs.anl.gov Sat Jun 16 18:09:17 2007 From: wilde at mcs.anl.gov (Mike Wilde) Date: Sat, 16 Jun 2007 18:09:17 -0500 Subject: [Swift-devel] CPU usage with provider-deef In-Reply-To: References: <1181984306.10455.3.camel@blabla.mcs.anl.gov> <4673F0E9.3060505@cs.uchicago.edu> <4673F548.5070608@cs.uchicago.edu> <4673FB19.6070305@cs.uchicago.edu> <1182006166.11495.1.camel@blabla.mcs.anl.gov> <467408D5.2020709@mcs.anl.gov> <467417DE.50408@mcs.anl.gov> <1182018587.12013.14.camel@blabla.mcs.anl.gov> Message-ID: <46746D9D.8020206@mcs.anl.gov> sorry, I still believe this can be fun. You can tell me afterwards if I was right or not. I think from Ioan's work we can gain some understanding of the problem, of architecture issues, performance potential, and of protocol and reliability issues. I feel it can still be fun to design a new production-quality system from scratch, and to prototype and implement that. Mihael, I think you had a vision of how this could be done elegantly using Karajan mechanisms as powerful building blocks to build on, to provide the communication and messaging / remote execution fabric. Ben Clifford wrote, On 6/16/2007 2:36 PM: > On Sat, 16 Jun 2007, Mihael Hategan wrote: > >> On Sat, 2007-06-16 at 12:03 -0500, Mike Wilde wrote: >>> This should be fun, and a nice break from the I2U2 work that you've >>> been immersed in, Mihael. > >> I have my reservations towards the amount of fun it involves. > > Right, taking prototypes and turning them into production isn't > necessarily fun - in fact, a lot of the fun already happened with the > making of the prototype and the rest is some what drugery. (to an extent > that's the same situation i2u2 cosmic was/is in). I hope its not the case that the only part of programming thats fun is prototyping. If you spend your day in powerpoint, word and email, anything related to code can look like fun... :) Mike > -- Mike Wilde Computation Institute, University of Chicago Math & Computer Science Division Argonne National Laboratory Argonne, IL 60439 USA tel 630-252-7497 fax 630-252-1997 From benc at hawaga.org.uk Sat Jun 16 18:15:02 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Sat, 16 Jun 2007 23:15:02 +0000 (GMT) Subject: [Swift-devel] CPU usage with provider-deef In-Reply-To: <46746A7F.7040004@mcs.anl.gov> References: <1181984306.10455.3.camel@blabla.mcs.anl.gov> <4673F0E9.3060505@cs.uchicago.edu> <4673F548.5070608@cs.uchicago.edu> <4673FB19.6070305@cs.uchicago.edu> <1182006166.11495.1.camel@blabla.mcs.anl.gov><467408D5.2020709@mcs.anl.gov> <467417DE.50408@mcs.anl.gov><1182018587.12013.14.camel@blabla.mcs.anl.gov> <537273469-1182023623-cardhu_decombobulator_blackberry.rim.net-728646234-@bxe006.bisx.prod.on.blackberry> <1182024904.12401.11.camel@blabla.mcs.anl.gov> <46746A7F.7040004@mcs.anl.gov> Message-ID: The WSDL should describe part of the web services bit of the protocol. That might be a good place to start. The WSDL should already describe the messages that go over the wire in something vaguely readable to a human. Probably what would be needed would be the extra info to say which order messages are sent. On Sat, 16 Jun 2007, Mike Wilde wrote: > I think a nice clean message sequence chart describing Falkon's various > activities would be very useful, as its the backbone of its logic. > > I tried to create this for the SC paper by asking Ioan to describe the > protocol to me, but I dod not succeed. > > I think this would be a very useful description to maintain, as a UML sequence > chart, and that Ioan this would be a very important part of your thesis or of > future papers. > > Its up to you and Ian to weigh whether this would be valuable to your > research. I think its invaluable for design, review and debugging. > > - Mike > > > > > Mihael Hategan wrote, On 6/16/2007 3:15 PM: > > On Sat, 2007-06-16 at 19:52 +0000, Ian Foster wrote: > > > I wasn't suggesting (at least in the first instance) that Mihael take the > > > prototype and turn it into production, but that Mihael and Ioan sit down > > > together and do a code walkthrough. I think that this would likely > > > identify bugs and opportunities for simplification. > > > > It's somewhat on the same in level of fun :). From the experience I've > > accumulated so far, design is hard. Understanding prototype design is > > probably even harder (not only do you need to understand the problem, > > you also need to understand why many non-obvious things are done the way > > they are done). > > > > Mihael > > > > > Ian > > > > > > > > > > > > Sent via BlackBerry from T-Mobile > > > > > > -----Original Message----- > > > From: Ben Clifford > > > > > > Date: Sat, 16 Jun 2007 19:36:17 To:Mihael Hategan > > > Cc:swift-devel at ci.uchicago.edu > > > Subject: Re: [Swift-devel] CPU usage with provider-deef > > > > > > > > > > > > On Sat, 16 Jun 2007, Mihael Hategan wrote: > > > > > > > On Sat, 2007-06-16 at 12:03 -0500, Mike Wilde wrote: > > > > > This should be fun, and a nice break from the I2U2 work that you've > > > > > been immersed in, Mihael. > > > > I have my reservations towards the amount of fun it involves. > > > Right, taking prototypes and turning them into production isn't > > > necessarily fun - in fact, a lot of the fun already happened with the > > > making of the prototype and the rest is some what drugery. (to an extent > > > that's the same situation i2u2 cosmic was/is in). > > > > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > > > From hategan at mcs.anl.gov Sun Jun 17 03:47:38 2007 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sun, 17 Jun 2007 11:47:38 +0300 Subject: [Swift-devel] CPU usage with provider-deef In-Reply-To: <46746D9D.8020206@mcs.anl.gov> References: <1181984306.10455.3.camel@blabla.mcs.anl.gov> <4673F0E9.3060505@cs.uchicago.edu> <4673F548.5070608@cs.uchicago.edu> <4673FB19.6070305@cs.uchicago.edu> <1182006166.11495.1.camel@blabla.mcs.anl.gov> <467408D5.2020709@mcs.anl.gov> <467417DE.50408@mcs.anl.gov> <1182018587.12013.14.camel@blabla.mcs.anl.gov> <46746D9D.8020206@mcs.anl.gov> Message-ID: <1182070058.12861.4.camel@blabla.mcs.anl.gov> On Sat, 2007-06-16 at 18:09 -0500, Mike Wilde wrote: > sorry, I still believe this can be fun. You can tell me afterwards > if I was right or not. Fun it is then. Mihael > > I think from Ioan's work we can gain some understanding of the > problem, of architecture issues, performance potential, and of > protocol and reliability issues. > > I feel it can still be fun to design a new production-quality system > from scratch, and to prototype and implement that. > > Mihael, I think you had a vision of how this could be done elegantly > using Karajan mechanisms as powerful building blocks to build on, to > provide the communication and messaging / remote execution fabric. > > Ben Clifford wrote, On 6/16/2007 2:36 PM: > > On Sat, 16 Jun 2007, Mihael Hategan wrote: > > > >> On Sat, 2007-06-16 at 12:03 -0500, Mike Wilde wrote: > >>> This should be fun, and a nice break from the I2U2 work that you've > >>> been immersed in, Mihael. > > > >> I have my reservations towards the amount of fun it involves. > > > > Right, taking prototypes and turning them into production isn't > > necessarily fun - in fact, a lot of the fun already happened with the > > making of the prototype and the rest is some what drugery. (to an extent > > that's the same situation i2u2 cosmic was/is in). > > I hope its not the case that the only part of programming thats fun > is prototyping. > > If you spend your day in powerpoint, word and email, anything > related to code can look like fun... > > :) Mike > > > > > From benc at hawaga.org.uk Sun Jun 17 13:31:56 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Sun, 17 Jun 2007 18:31:56 +0000 (GMT) Subject: [Swift-devel] XML/infix hybrid Message-ID: I got annoyed playing with the language parser/compiler stuff today with the XML/infix hybrid expression syntax in the intermediate language, so I basically spent my day changing it to use entirely XML syntax - the intermediate form will thus look a little more karajan like and a little less C like. -- From iraicu at cs.uchicago.edu Mon Jun 18 16:37:48 2007 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Mon, 18 Jun 2007 16:37:48 -0500 Subject: [Swift-devel] CPU usage with provider-deef In-Reply-To: References: <1181984306.10455.3.camel@blabla.mcs.anl.gov> <4673F0E9.3060505@cs.uchicago.edu> <4673F548.5070608@cs.uchicago.edu> <4673FB19.6070305@cs.uchicago.edu> <1182006166.11495.1.camel@blabla.mcs.anl.gov><467408D5.2020709@mcs.anl.gov> <467417DE.50408@mcs.anl.gov><1182018587.12013.14.camel@blabla.mcs.anl.gov> <537273469-1182023623-cardhu_decombobulator_blackberry.rim.net-728646234-@bxe006.bisx.prod.on.blackberry> <1182024904.12401.11.camel@blabla.mcs.anl.gov> <46746A7F.7040004@mcs.anl.gov> Message-ID: <4676FB2C.9010004@cs.uchicago.edu> There is a diagram in the SC paper that has the flow of messages, and what each does; these messages also map to the WSDL of the Falkon service. Mihael, Ben, or whoever else wants to start digging through the Falkon code, maybe a meeting might be good to go over the organization of the code, the message flow diagram, configuration options, etc... I would prefer a meeting over drafting up more documents, I think it would be more time effective for me for now. Ioan Ben Clifford wrote: > The WSDL should describe part of the web services bit of the protocol. > That might be a good place to start. The WSDL should already describe the > messages that go over the wire in something vaguely readable to a human. > Probably what would be needed would be the extra info to say which order > messages are sent. > > On Sat, 16 Jun 2007, Mike Wilde wrote: > > >> I think a nice clean message sequence chart describing Falkon's various >> activities would be very useful, as its the backbone of its logic. >> >> I tried to create this for the SC paper by asking Ioan to describe the >> protocol to me, but I dod not succeed. >> >> I think this would be a very useful description to maintain, as a UML sequence >> chart, and that Ioan this would be a very important part of your thesis or of >> future papers. >> >> Its up to you and Ian to weigh whether this would be valuable to your >> research. I think its invaluable for design, review and debugging. >> >> - Mike >> >> >> >> >> Mihael Hategan wrote, On 6/16/2007 3:15 PM: >> >>> On Sat, 2007-06-16 at 19:52 +0000, Ian Foster wrote: >>> >>>> I wasn't suggesting (at least in the first instance) that Mihael take the >>>> prototype and turn it into production, but that Mihael and Ioan sit down >>>> together and do a code walkthrough. I think that this would likely >>>> identify bugs and opportunities for simplification. >>>> >>> It's somewhat on the same in level of fun :). From the experience I've >>> accumulated so far, design is hard. Understanding prototype design is >>> probably even harder (not only do you need to understand the problem, >>> you also need to understand why many non-obvious things are done the way >>> they are done). >>> >>> Mihael >>> >>> >>>> Ian >>>> >>>> >>>> >>>> Sent via BlackBerry from T-Mobile >>>> >>>> -----Original Message----- >>>> From: Ben Clifford >>>> >>>> Date: Sat, 16 Jun 2007 19:36:17 To:Mihael Hategan >>>> Cc:swift-devel at ci.uchicago.edu >>>> Subject: Re: [Swift-devel] CPU usage with provider-deef >>>> >>>> >>>> >>>> On Sat, 16 Jun 2007, Mihael Hategan wrote: >>>> >>>> >>>>> On Sat, 2007-06-16 at 12:03 -0500, Mike Wilde wrote: >>>>> >>>>>> This should be fun, and a nice break from the I2U2 work that you've >>>>>> been immersed in, Mihael. >>>>>> >>>>> I have my reservations towards the amount of fun it involves. >>>>> >>>> Right, taking prototypes and turning them into production isn't >>>> necessarily fun - in fact, a lot of the fun already happened with the >>>> making of the prototype and the rest is some what drugery. (to an extent >>>> that's the same situation i2u2 cosmic was/is in). >>>> >>>> >>> _______________________________________________ >>> Swift-devel mailing list >>> Swift-devel at ci.uchicago.edu >>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >>> >>> >>> >> > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > -- ============================================ Ioan Raicu Ph.D. Student ============================================ Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 ============================================ Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dsl.cs.uchicago.edu/ ============================================ ============================================ -------------- next part -------------- An HTML attachment was scrubbed... URL: From foster at mcs.anl.gov Mon Jun 18 21:54:21 2007 From: foster at mcs.anl.gov (Ian Foster) Date: Mon, 18 Jun 2007 21:54:21 -0500 Subject: [Swift-devel] on the semantics of 'array closing' In-Reply-To: <4674096B.4020109@mcs.anl.gov> References: <4672F5E3.7060205@mcs.anl.gov> <4674096B.4020109@mcs.anl.gov> Message-ID: <4677455D.1070909@mcs.anl.gov> Interesting issue, as raised by Mike: it seems that we want to define a "map" function that can specify various "conditions" on the map, e.g., the length of time to wait for results, what to do if not all results are returned, etc. I wonder if others have done that before? Ian Foster wrote: > I like the notion of having a "map" function. If that could entirely > replace the current element assignments, that would be a wonderful > simplification, it seems to me. > > Ian. > > Ben Clifford wrote: >> There's a different approach, which is to asay that 'a' is a variable and >> can be assigned to once. Thus assignemnt syntax like a[0]=something >> becomes illegal and we need more functional language constructs. So >> instead of writing: >> >> for e,i in input_array { >> output_array[i] = p(e); >> } >> >> we would write: >> >> output_array = foreach i in input_array { >> return p(i); >> } >> >> (its a haskell map in different syntax!) >> >> That means that, at the language level, output_array is now properly >> single assignment. >> >> >> On Fri, 15 Jun 2007, Ian Foster wrote: >> >> >>> Hi, >>> >>> For: >>> >>> a[0] = p() >>> a[1] = q() >>> b = s(a) >>> >>> I think there are two distinct issues. >>> >>> a) Determining the size of the array. This could presumably be done by >>> declaring it, e.g.: >>> >>> a[2] or some similar notion >>> a[0] = p() >>> a[1] = q() >>> b = s(a) >>> >>> or by some "closing" concept. >>> >>> b) Whether or not each element of an array is a separate single-assignment >>> variable. If they are, then the code above should work just fine. If they are >>> not, then we have a couple of behaviors we could define. One would be that >>> b=s(a) blocks until all elements in "a" are defined. The other is that we have >>> a way of "closing" (once again). In that case, we have to define what happens >>> if b=s(a) accesses an element that is not defined. >>> >>> Ian. >>> >>> Ben Clifford wrote: >>> >>>> There is a problem that has been called the 'array closing problem'. >>>> >>>> It manifests itself in the tutorial in that certain bits of code that >>>> intuitively can either in a procedure or in the top level can, in practice, >>>> only go in to a procedure. >>>> >>>> In that context, I tried to think about better ways to explain/document the >>>> behaviour than "mumble mumble move that code into a procedure". >>>> >>>> In Swift we claim to have 'single assignment variables'. >>>> >>>> >From single assignment variables we get our grid job ordering: >>>> >>>> a = p() >>>> b = s(a) >>>> >>>> causes first grid job p to run, and when that has completed, then grid job s >>>> will run. >>>> >>>> This is the same as if we had written: >>>> >>>> b = s(a) >>>> a = p() >>>> >>>> The ordering comes from the use of a as an 'output' for p and an 'input' for >>>> s, not from source text ordering. >>>> >>>> In that model, its meaningless to assign two different things ta a, like >>>> this: >>>> >>>> a = p() >>>> b = s(a) >>>> a = t() >>>> >>>> >>>> Note that I've omitted the data types from the above. This works in the >>>> implementation for simple types such as a datafile marker type. >>>> >>>> What is important is that each variable is either unassigned or has its >>>> single value - whenever we refer to that variable, we can either use the >>>> value it has, or defer evaluation of that expression until the variable has >>>> its value. >>>> >>>> Now consider arrays. In the present syntax, arrays can be passed as single >>>> (complex) values to/from procedures, like before: >>>> >>>> a = p() >>>> b = s(a) >>>> >>>> Here a and b are array types. >>>> >>>> That's fine. a is assigned to by the first statement, and b is assigned to >>>> by the second statement. >>>> >>>> But we also support a different assignment syntax for arrays, that looks >>>> like this: >>>> >>>> a[0] = p() >>>> a[1] = q() >>>> b = s(a) >>>> >>>> This fails at the moment (specifically, I think the execution engine will >>>> hang). >>>> >>>> Why? Because the is no one point at which we assign a value to 'a' - the >>>> assignment is split over multiple statements, which can be in various places >>>> (and inside loops etc). >>>> >>>> There is nothing in the implementation that detects that a has been assigned >>>> its value. >>>> >>>> So there is this notion in the karajan intermediate code of 'closing an >>>> array'. This is an assertion made in the object code that all assignments >>>> to pieces of an array have been made - that, in affect, the array has its >>>> value. >>>> >>>> The suggested hack/workaround for this is to move the array element >>>> assignments into a procedure: >>>> >>>> (file f[]) z() { >>>> f[0] = p(); >>>> f[1] - q(); >>>> } >>>> >>>> a = z() >>>> b = s(a) >>>> >>>> This works. (which is sort-of a violation of referential transparency) >>>> >>>> It works because Swift implicitly marks arrays returned from compound >>>> procedures as closed (which may or may not be correct). >>>> >>>> So in most variable scopes, arrays behave like single-assignment variables, >>>> but each array can have one specific scope in which members can be assigned >>>> to. In that scope, the array cannot be treated as a whole variable. >>>> >>>> In the z() example above, that special scope is the body of z(). In the >>>> previous example, that scope is the global scope, and the program is invalid >>>> by the rule above that the array cannot be referred to as a whole in the >>>> same place that its members are individually assigned to. >>>> >>>> That's my explanation of what's going on now. I think it matches reality. I >>>> don't like that this is reality, but it is what we have. >>>> >>>> Comments appreciated. >>>> >>>> >>>> >>> >> >> > > -- > > Ian Foster, Director, Computation Institute > Argonne National Laboratory & University of Chicago > Argonne: MCS/221, 9700 S. Cass Ave, Argonne, IL 60439 > Chicago: Rm 405, 5640 S. Ellis Ave, Chicago, IL 60637 > Tel: +1 630 252 4619. Web: www.ci.uchicago.edu. > Globus Alliance: www.globus.org. > > ------------------------------------------------------------------------ > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > -- Ian Foster, Director, Computation Institute Argonne National Laboratory & University of Chicago Argonne: MCS/221, 9700 S. Cass Ave, Argonne, IL 60439 Chicago: Rm 405, 5640 S. Ellis Ave, Chicago, IL 60637 Tel: +1 630 252 4619. Web: www.ci.uchicago.edu. Globus Alliance: www.globus.org. -------------- next part -------------- An HTML attachment was scrubbed... URL: From hategan at mcs.anl.gov Tue Jun 19 03:22:51 2007 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 19 Jun 2007 11:22:51 +0300 Subject: [Swift-devel] CPU usage with provider-deef In-Reply-To: <4676FB2C.9010004@cs.uchicago.edu> References: <1181984306.10455.3.camel@blabla.mcs.anl.gov> <4673F0E9.3060505@cs.uchicago.edu> <4673F548.5070608@cs.uchicago.edu> <4673FB19.6070305@cs.uchicago.edu> <1182006166.11495.1.camel@blabla.mcs.anl.gov><467408D5.2020709@mcs.anl.gov> <467417DE.50408@mcs.anl.gov><1182018587.12013.14.camel@blabla.mcs.anl.gov> <537273469-1182023623-cardhu_decombobulator_blackberry.rim.net-728646234-@bxe006.bisx.prod.on.blackberry> <1182024904.12401.11.camel@blabla.mcs.anl.gov> <46746A7F.7040004@mcs.anl.gov> <4676FB2C.9010004@cs.uchicago.edu> Message-ID: <1182241371.17515.2.camel@blabla.mcs.anl.gov> On Mon, 2007-06-18 at 16:37 -0500, Ioan Raicu wrote: > There is a diagram in the SC paper that has the flow of messages, and > what each does; these messages also map to the WSDL of the Falkon > service. Mihael, Ben, or whoever else wants to start digging through > the Falkon code, maybe a meeting might be good to go over the > organization of the code, the message flow diagram, configuration > options, etc... I would prefer a meeting over drafting up more > documents, I think it would be more time effective for me for now. It might be the reverse for us. It's unlikely that faced with a lot of new information, we will consistently retain all of it. Something of reference might be handy. > > Ioan > > Ben Clifford wrote: > > The WSDL should describe part of the web services bit of the protocol. > > That might be a good place to start. The WSDL should already describe the > > messages that go over the wire in something vaguely readable to a human. > > Probably what would be needed would be the extra info to say which order > > messages are sent. > > > > On Sat, 16 Jun 2007, Mike Wilde wrote: > > > > > > > I think a nice clean message sequence chart describing Falkon's various > > > activities would be very useful, as its the backbone of its logic. > > > > > > I tried to create this for the SC paper by asking Ioan to describe the > > > protocol to me, but I dod not succeed. > > > > > > I think this would be a very useful description to maintain, as a UML sequence > > > chart, and that Ioan this would be a very important part of your thesis or of > > > future papers. > > > > > > Its up to you and Ian to weigh whether this would be valuable to your > > > research. I think its invaluable for design, review and debugging. > > > > > > - Mike > > > > > > > > > > > > > > > Mihael Hategan wrote, On 6/16/2007 3:15 PM: > > > > > > > On Sat, 2007-06-16 at 19:52 +0000, Ian Foster wrote: > > > > > > > > > I wasn't suggesting (at least in the first instance) that Mihael take the > > > > > prototype and turn it into production, but that Mihael and Ioan sit down > > > > > together and do a code walkthrough. I think that this would likely > > > > > identify bugs and opportunities for simplification. > > > > > > > > > It's somewhat on the same in level of fun :). From the experience I've > > > > accumulated so far, design is hard. Understanding prototype design is > > > > probably even harder (not only do you need to understand the problem, > > > > you also need to understand why many non-obvious things are done the way > > > > they are done). > > > > > > > > Mihael > > > > > > > > > > > > > Ian > > > > > > > > > > > > > > > > > > > > Sent via BlackBerry from T-Mobile > > > > > > > > > > -----Original Message----- > > > > > From: Ben Clifford > > > > > > > > > > Date: Sat, 16 Jun 2007 19:36:17 To:Mihael Hategan > > > > > Cc:swift-devel at ci.uchicago.edu > > > > > Subject: Re: [Swift-devel] CPU usage with provider-deef > > > > > > > > > > > > > > > > > > > > On Sat, 16 Jun 2007, Mihael Hategan wrote: > > > > > > > > > > > > > > > > On Sat, 2007-06-16 at 12:03 -0500, Mike Wilde wrote: > > > > > > > > > > > > > This should be fun, and a nice break from the I2U2 work that you've > > > > > > > been immersed in, Mihael. > > > > > > > > > > > > > I have my reservations towards the amount of fun it involves. > > > > > > > > > > > Right, taking prototypes and turning them into production isn't > > > > > necessarily fun - in fact, a lot of the fun already happened with the > > > > > making of the prototype and the rest is some what drugery. (to an extent > > > > > that's the same situation i2u2 cosmic was/is in). > > > > > > > > > > > > > > _______________________________________________ > > > > Swift-devel mailing list > > > > Swift-devel at ci.uchicago.edu > > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > > > > > > > > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > > > -- > ============================================ > Ioan Raicu > Ph.D. Student > ============================================ > Distributed Systems Laboratory > Computer Science Department > University of Chicago > 1100 E. 58th Street, Ryerson Hall > Chicago, IL 60637 > ============================================ > Email: iraicu at cs.uchicago.edu > Web: http://www.cs.uchicago.edu/~iraicu > http://dsl.cs.uchicago.edu/ > ============================================ > ============================================ > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From hategan at mcs.anl.gov Tue Jun 19 03:55:30 2007 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 19 Jun 2007 11:55:30 +0300 Subject: [Swift-devel] on the semantics of 'array closing' In-Reply-To: <4677455D.1070909@mcs.anl.gov> References: <4672F5E3.7060205@mcs.anl.gov> <4674096B.4020109@mcs.anl.gov> <4677455D.1070909@mcs.anl.gov> Message-ID: <1182243330.17515.14.camel@blabla.mcs.anl.gov> On Mon, 2007-06-18 at 21:54 -0500, Ian Foster wrote: > Interesting issue, as raised by Mike: it seems that we want to define > a "map" function that can specify various "conditions" on the map, > e.g., the length of time to wait for results, what to do if not all > results are returned, etc. I wonder if others have done that before? There are two types of errors and time-outs: 1. The ones that occur as part of the workflow: - Computations on some data does not return the expected files - The user wants to run multiple algorithms on some data and only select the fastest one or the ones that run in a specific time. - etc. 2. The ones that occur because of exceptional conditions in the system: - A badly configured site fails to run things - A job sits for a long time in a queue I think the first class can be handled in the language, but the second should not. Occam seems to support such things, by composing various keywords (e.g. PAR ... FOR and then timeouts or error handling). I personally favor composition of smaller dedicated functions/keywords to big, do-it-all functions. Mihael > > Ian Foster wrote: > > I like the notion of having a "map" function. If that could entirely > > replace the current element assignments, that would be a wonderful > > simplification, it seems to me. > > > > Ian. > > > > Ben Clifford wrote: > > > There's a different approach, which is to asay that 'a' is a variable and > > > can be assigned to once. Thus assignemnt syntax like a[0]=something > > > becomes illegal and we need more functional language constructs. So > > > instead of writing: > > > > > > for e,i in input_array { > > > output_array[i] = p(e); > > > } > > > > > > we would write: > > > > > > output_array = foreach i in input_array { > > > return p(i); > > > } > > > > > > (its a haskell map in different syntax!) > > > > > > That means that, at the language level, output_array is now properly > > > single assignment. > > > > > > > > > On Fri, 15 Jun 2007, Ian Foster wrote: > > > > > > > > > > Hi, > > > > > > > > For: > > > > > > > > a[0] = p() > > > > a[1] = q() > > > > b = s(a) > > > > > > > > I think there are two distinct issues. > > > > > > > > a) Determining the size of the array. This could presumably be done by > > > > declaring it, e.g.: > > > > > > > > a[2] or some similar notion > > > > a[0] = p() > > > > a[1] = q() > > > > b = s(a) > > > > > > > > or by some "closing" concept. > > > > > > > > b) Whether or not each element of an array is a separate single-assignment > > > > variable. If they are, then the code above should work just fine. If they are > > > > not, then we have a couple of behaviors we could define. One would be that > > > > b=s(a) blocks until all elements in "a" are defined. The other is that we have > > > > a way of "closing" (once again). In that case, we have to define what happens > > > > if b=s(a) accesses an element that is not defined. > > > > > > > > Ian. > > > > > > > > Ben Clifford wrote: > > > > > > > > > There is a problem that has been called the 'array closing problem'. > > > > > > > > > > It manifests itself in the tutorial in that certain bits of code that > > > > > intuitively can either in a procedure or in the top level can, in practice, > > > > > only go in to a procedure. > > > > > > > > > > In that context, I tried to think about better ways to explain/document the > > > > > behaviour than "mumble mumble move that code into a procedure". > > > > > > > > > > In Swift we claim to have 'single assignment variables'. > > > > > > > > > > >From single assignment variables we get our grid job ordering: > > > > > > > > > > a = p() > > > > > b = s(a) > > > > > > > > > > causes first grid job p to run, and when that has completed, then grid job s > > > > > will run. > > > > > > > > > > This is the same as if we had written: > > > > > > > > > > b = s(a) > > > > > a = p() > > > > > > > > > > The ordering comes from the use of a as an 'output' for p and an 'input' for > > > > > s, not from source text ordering. > > > > > > > > > > In that model, its meaningless to assign two different things ta a, like > > > > > this: > > > > > > > > > > a = p() > > > > > b = s(a) > > > > > a = t() > > > > > > > > > > > > > > > Note that I've omitted the data types from the above. This works in the > > > > > implementation for simple types such as a datafile marker type. > > > > > > > > > > What is important is that each variable is either unassigned or has its > > > > > single value - whenever we refer to that variable, we can either use the > > > > > value it has, or defer evaluation of that expression until the variable has > > > > > its value. > > > > > > > > > > Now consider arrays. In the present syntax, arrays can be passed as single > > > > > (complex) values to/from procedures, like before: > > > > > > > > > > a = p() > > > > > b = s(a) > > > > > > > > > > Here a and b are array types. > > > > > > > > > > That's fine. a is assigned to by the first statement, and b is assigned to > > > > > by the second statement. > > > > > > > > > > But we also support a different assignment syntax for arrays, that looks > > > > > like this: > > > > > > > > > > a[0] = p() > > > > > a[1] = q() > > > > > b = s(a) > > > > > > > > > > This fails at the moment (specifically, I think the execution engine will > > > > > hang). > > > > > > > > > > Why? Because the is no one point at which we assign a value to 'a' - the > > > > > assignment is split over multiple statements, which can be in various places > > > > > (and inside loops etc). > > > > > > > > > > There is nothing in the implementation that detects that a has been assigned > > > > > its value. > > > > > > > > > > So there is this notion in the karajan intermediate code of 'closing an > > > > > array'. This is an assertion made in the object code that all assignments > > > > > to pieces of an array have been made - that, in affect, the array has its > > > > > value. > > > > > > > > > > The suggested hack/workaround for this is to move the array element > > > > > assignments into a procedure: > > > > > > > > > > (file f[]) z() { > > > > > f[0] = p(); > > > > > f[1] - q(); > > > > > } > > > > > > > > > > a = z() > > > > > b = s(a) > > > > > > > > > > This works. (which is sort-of a violation of referential transparency) > > > > > > > > > > It works because Swift implicitly marks arrays returned from compound > > > > > procedures as closed (which may or may not be correct). > > > > > > > > > > So in most variable scopes, arrays behave like single-assignment variables, > > > > > but each array can have one specific scope in which members can be assigned > > > > > to. In that scope, the array cannot be treated as a whole variable. > > > > > > > > > > In the z() example above, that special scope is the body of z(). In the > > > > > previous example, that scope is the global scope, and the program is invalid > > > > > by the rule above that the array cannot be referred to as a whole in the > > > > > same place that its members are individually assigned to. > > > > > > > > > > That's my explanation of what's going on now. I think it matches reality. I > > > > > don't like that this is reality, but it is what we have. > > > > > > > > > > Comments appreciated. > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > Ian Foster, Director, Computation Institute > > Argonne National Laboratory & University of Chicago > > Argonne: MCS/221, 9700 S. Cass Ave, Argonne, IL 60439 > > Chicago: Rm 405, 5640 S. Ellis Ave, Chicago, IL 60637 > > Tel: +1 630 252 4619. Web: www.ci.uchicago.edu. > > Globus Alliance: www.globus.org. > > > > > > ____________________________________________________________________ > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > -- > > Ian Foster, Director, Computation Institute > Argonne National Laboratory & University of Chicago > Argonne: MCS/221, 9700 S. Cass Ave, Argonne, IL 60439 > Chicago: Rm 405, 5640 S. Ellis Ave, Chicago, IL 60637 > Tel: +1 630 252 4619. Web: www.ci.uchicago.edu. > Globus Alliance: www.globus.org. > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From wilde at mcs.anl.gov Tue Jun 19 07:10:13 2007 From: wilde at mcs.anl.gov (Mike Wilde) Date: Tue, 19 Jun 2007 07:10:13 -0500 Subject: [Swift-devel] CPU usage with provider-deef In-Reply-To: <1182241371.17515.2.camel@blabla.mcs.anl.gov> References: <1181984306.10455.3.camel@blabla.mcs.anl.gov> <4673F0E9.3060505@cs.uchicago.edu> <4673F548.5070608@cs.uchicago.edu> <4673FB19.6070305@cs.uchicago.edu> <1182006166.11495.1.camel@blabla.mcs.anl.gov><467408D5.2020709@mcs.anl.gov> <467417DE.50408@mcs.anl.gov><1182018587.12013.14.camel@blabla.mcs.anl.gov> <537273469-1182023623-cardhu_decombobulator_blackberry.rim.net-728646234-@bxe006.bisx.prod.on.blackberry> <1182024904.12401.11.camel@blabla.mcs.anl.gov> <46746A7F.7040004@mcs.anl.gov> <4676FB2C.9010004@cs.uchicago.edu> <1182241371.17515.2.camel@blabla.mcs.anl.gov> Message-ID: <4677C7A5.6040004@mcs.anl.gov> Ioan, I feel that documenting the message flow is key to understanding the system, and that such documentation will be indispensable for you (and the Swift team) to make progress with Falkon. If its the case, as you stated to me, that you dont have the time to fully support Falkon for end-user use, which I accept, then the only way to enable the Swift team to provide this support is for us to understand the tool. The three things you propose to communicate to us in a face-to-face meeting (organization of the code, the message flow diagram, configuration options) are exactly the things that need to be documented. But you and Ian together need to decide and propose how you want to see Falkon used and how you intend to make it supportable for that purpose. For Swift's goals, we *know* we need Falkon's capabilities to succeed, and the question is how much and how long can we use Falkon, and when would we need to start improving or rewriting it. - Mike Mihael Hategan wrote, On 6/19/2007 3:22 AM: > On Mon, 2007-06-18 at 16:37 -0500, Ioan Raicu wrote: >> There is a diagram in the SC paper that has the flow of messages, and >> what each does; these messages also map to the WSDL of the Falkon >> service. Mihael, Ben, or whoever else wants to start digging through >> the Falkon code, maybe a meeting might be good to go over the >> organization of the code, the message flow diagram, configuration >> options, etc... I would prefer a meeting over drafting up more >> documents, I think it would be more time effective for me for now. > > It might be the reverse for us. It's unlikely that faced with a lot of > new information, we will consistently retain all of it. Something of > reference might be handy. > >> Ioan >> >> Ben Clifford wrote: >>> The WSDL should describe part of the web services bit of the protocol. >>> That might be a good place to start. The WSDL should already describe the >>> messages that go over the wire in something vaguely readable to a human. >>> Probably what would be needed would be the extra info to say which order >>> messages are sent. >>> >>> On Sat, 16 Jun 2007, Mike Wilde wrote: >>> >>> >>>> I think a nice clean message sequence chart describing Falkon's various >>>> activities would be very useful, as its the backbone of its logic. >>>> >>>> I tried to create this for the SC paper by asking Ioan to describe the >>>> protocol to me, but I dod not succeed. >>>> >>>> I think this would be a very useful description to maintain, as a UML sequence >>>> chart, and that Ioan this would be a very important part of your thesis or of >>>> future papers. >>>> >>>> Its up to you and Ian to weigh whether this would be valuable to your >>>> research. I think its invaluable for design, review and debugging. >>>> >>>> - Mike >>>> >>>> >>>> >>>> >>>> Mihael Hategan wrote, On 6/16/2007 3:15 PM: >>>> >>>>> On Sat, 2007-06-16 at 19:52 +0000, Ian Foster wrote: >>>>> >>>>>> I wasn't suggesting (at least in the first instance) that Mihael take the >>>>>> prototype and turn it into production, but that Mihael and Ioan sit down >>>>>> together and do a code walkthrough. I think that this would likely >>>>>> identify bugs and opportunities for simplification. >>>>>> >>>>> It's somewhat on the same in level of fun :). From the experience I've >>>>> accumulated so far, design is hard. Understanding prototype design is >>>>> probably even harder (not only do you need to understand the problem, >>>>> you also need to understand why many non-obvious things are done the way >>>>> they are done). >>>>> >>>>> Mihael >>>>> >>>>> >>>>>> Ian >>>>>> >>>>>> >>>>>> >>>>>> Sent via BlackBerry from T-Mobile >>>>>> >>>>>> -----Original Message----- >>>>>> From: Ben Clifford >>>>>> >>>>>> Date: Sat, 16 Jun 2007 19:36:17 To:Mihael Hategan >>>>>> Cc:swift-devel at ci.uchicago.edu >>>>>> Subject: Re: [Swift-devel] CPU usage with provider-deef >>>>>> >>>>>> >>>>>> >>>>>> On Sat, 16 Jun 2007, Mihael Hategan wrote: >>>>>> >>>>>> >>>>>>> On Sat, 2007-06-16 at 12:03 -0500, Mike Wilde wrote: >>>>>>> >>>>>>>> This should be fun, and a nice break from the I2U2 work that you've >>>>>>>> been immersed in, Mihael. >>>>>>>> >>>>>>> I have my reservations towards the amount of fun it involves. >>>>>>> >>>>>> Right, taking prototypes and turning them into production isn't >>>>>> necessarily fun - in fact, a lot of the fun already happened with the >>>>>> making of the prototype and the rest is some what drugery. (to an extent >>>>>> that's the same situation i2u2 cosmic was/is in). >>>>>> >>>>>> >>>>> _______________________________________________ >>>>> Swift-devel mailing list >>>>> Swift-devel at ci.uchicago.edu >>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >>>>> >>>>> >>>>> >>> _______________________________________________ >>> Swift-devel mailing list >>> Swift-devel at ci.uchicago.edu >>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >>> >>> >> -- >> ============================================ >> Ioan Raicu >> Ph.D. Student >> ============================================ >> Distributed Systems Laboratory >> Computer Science Department >> University of Chicago >> 1100 E. 58th Street, Ryerson Hall >> Chicago, IL 60637 >> ============================================ >> Email: iraicu at cs.uchicago.edu >> Web: http://www.cs.uchicago.edu/~iraicu >> http://dsl.cs.uchicago.edu/ >> ============================================ >> ============================================ >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > -- Mike Wilde Computation Institute, University of Chicago Math & Computer Science Division Argonne National Laboratory Argonne, IL 60439 USA tel 630-252-7497 fax 630-252-1997 From hategan at mcs.anl.gov Tue Jun 19 07:55:10 2007 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 19 Jun 2007 15:55:10 +0300 Subject: [Swift-devel] on the semantics of 'array closing' In-Reply-To: <4677CEA1.8050806@mcs.anl.gov> References: <4672F5E3.7060205@mcs.anl.gov> <4674096B.4020109@mcs.anl.gov> <4677455D.1070909@mcs.anl.gov> <1182243330.17515.14.camel@blabla.mcs.anl.gov> <4677CEA1.8050806@mcs.anl.gov> Message-ID: <1182257710.18810.15.camel@blabla.mcs.anl.gov> On Tue, 2007-06-19 at 07:40 -0500, Mike Wilde wrote: > This is a good breakdown, but I dont yet see how to distinguish > between the two situations you lay out here, Mihael. The distinction is made by thinking about local vs. grid execution. If something cannot occur during local execution but can occur during grid execution, we probably want to hide that from the user as much as possible. Programming against randomly unreliable systems is hard. We, as long time Grid users and developers, have the unique position of identifying such problems and dealing with them as we can. > > I think we need to turn these cases into slightly more detailed > examples and then consider how we want the system to respond and > what the user would need to do in each case to recover/continue. I thought we've done this before, to a sizable extent, both on the mailing lists, and face-to-face. > > Your cases separate out logical errors (1) from physical ones (2) > which is I agree a useful distinction. > > But I wonder in case 2, when you have a program thats been > productively proceeding, and then encounters a physical error, in > some cases the program will have processed a give function "long > enough" to end that function call and proceed ("correctly") with the > data it has, while in other cases, the program can not proceed until > it has generated every single member that a foreach or map function > was to iterate over. Right. The former is the m out of n pattern with timeouts. While very rarely occurring in nature, we may eventually want to support such a thing. Also, given that reliable performance measurements are hard to get in a massively multi-user heterogeneous environment, we would need to find specific ways to express time. Mihael > > In the latter branch of this case, we'd want to be able to restart > the program, while in the former branch, we'd want to be able to > ignore certain errors and continue. > > - Mike > > > Mihael Hategan wrote, On 6/19/2007 3:55 AM: > > On Mon, 2007-06-18 at 21:54 -0500, Ian Foster wrote: > >> Interesting issue, as raised by Mike: it seems that we want to define > >> a "map" function that can specify various "conditions" on the map, > >> e.g., the length of time to wait for results, what to do if not all > >> results are returned, etc. I wonder if others have done that before? > > > > There are two types of errors and time-outs: > > 1. The ones that occur as part of the workflow: > > - Computations on some data does not return the expected files > > - The user wants to run multiple algorithms on some data and only > > select the fastest one or the ones that run in a specific time. > > - etc. > > 2. The ones that occur because of exceptional conditions in the system: > > - A badly configured site fails to run things > > - A job sits for a long time in a queue > > > > I think the first class can be handled in the language, but the second > > should not. > > > > Occam seems to support such things, by composing various keywords (e.g. > > PAR ... FOR and then timeouts or error handling). I personally favor > > composition of smaller dedicated functions/keywords to big, do-it-all > > functions. > > > > Mihael > > > >> Ian Foster wrote: > >>> I like the notion of having a "map" function. If that could entirely > >>> replace the current element assignments, that would be a wonderful > >>> simplification, it seems to me. > >>> > >>> Ian. > >>> > >>> Ben Clifford wrote: > >>>> There's a different approach, which is to asay that 'a' is a variable and > >>>> can be assigned to once. Thus assignemnt syntax like a[0]=something > >>>> becomes illegal and we need more functional language constructs. So > >>>> instead of writing: > >>>> > >>>> for e,i in input_array { > >>>> output_array[i] = p(e); > >>>> } > >>>> > >>>> we would write: > >>>> > >>>> output_array = foreach i in input_array { > >>>> return p(i); > >>>> } > >>>> > >>>> (its a haskell map in different syntax!) > >>>> > >>>> That means that, at the language level, output_array is now properly > >>>> single assignment. > >>>> > >>>> > >>>> On Fri, 15 Jun 2007, Ian Foster wrote: > >>>> > >>>> > >>>>> Hi, > >>>>> > >>>>> For: > >>>>> > >>>>> a[0] = p() > >>>>> a[1] = q() > >>>>> b = s(a) > >>>>> > >>>>> I think there are two distinct issues. > >>>>> > >>>>> a) Determining the size of the array. This could presumably be done by > >>>>> declaring it, e.g.: > >>>>> > >>>>> a[2] or some similar notion > >>>>> a[0] = p() > >>>>> a[1] = q() > >>>>> b = s(a) > >>>>> > >>>>> or by some "closing" concept. > >>>>> > >>>>> b) Whether or not each element of an array is a separate single-assignment > >>>>> variable. If they are, then the code above should work just fine. If they are > >>>>> not, then we have a couple of behaviors we could define. One would be that > >>>>> b=s(a) blocks until all elements in "a" are defined. The other is that we have > >>>>> a way of "closing" (once again). In that case, we have to define what happens > >>>>> if b=s(a) accesses an element that is not defined. > >>>>> > >>>>> Ian. > >>>>> > >>>>> Ben Clifford wrote: > >>>>> > >>>>>> There is a problem that has been called the 'array closing problem'. > >>>>>> > >>>>>> It manifests itself in the tutorial in that certain bits of code that > >>>>>> intuitively can either in a procedure or in the top level can, in practice, > >>>>>> only go in to a procedure. > >>>>>> > >>>>>> In that context, I tried to think about better ways to explain/document the > >>>>>> behaviour than "mumble mumble move that code into a procedure". > >>>>>> > >>>>>> In Swift we claim to have 'single assignment variables'. > >>>>>> > >>>>>> >From single assignment variables we get our grid job ordering: > >>>>>> > >>>>>> a = p() > >>>>>> b = s(a) > >>>>>> > >>>>>> causes first grid job p to run, and when that has completed, then grid job s > >>>>>> will run. > >>>>>> > >>>>>> This is the same as if we had written: > >>>>>> > >>>>>> b = s(a) > >>>>>> a = p() > >>>>>> > >>>>>> The ordering comes from the use of a as an 'output' for p and an 'input' for > >>>>>> s, not from source text ordering. > >>>>>> > >>>>>> In that model, its meaningless to assign two different things ta a, like > >>>>>> this: > >>>>>> > >>>>>> a = p() > >>>>>> b = s(a) > >>>>>> a = t() > >>>>>> > >>>>>> > >>>>>> Note that I've omitted the data types from the above. This works in the > >>>>>> implementation for simple types such as a datafile marker type. > >>>>>> > >>>>>> What is important is that each variable is either unassigned or has its > >>>>>> single value - whenever we refer to that variable, we can either use the > >>>>>> value it has, or defer evaluation of that expression until the variable has > >>>>>> its value. > >>>>>> > >>>>>> Now consider arrays. In the present syntax, arrays can be passed as single > >>>>>> (complex) values to/from procedures, like before: > >>>>>> > >>>>>> a = p() > >>>>>> b = s(a) > >>>>>> > >>>>>> Here a and b are array types. > >>>>>> > >>>>>> That's fine. a is assigned to by the first statement, and b is assigned to > >>>>>> by the second statement. > >>>>>> > >>>>>> But we also support a different assignment syntax for arrays, that looks > >>>>>> like this: > >>>>>> > >>>>>> a[0] = p() > >>>>>> a[1] = q() > >>>>>> b = s(a) > >>>>>> > >>>>>> This fails at the moment (specifically, I think the execution engine will > >>>>>> hang). > >>>>>> > >>>>>> Why? Because the is no one point at which we assign a value to 'a' - the > >>>>>> assignment is split over multiple statements, which can be in various places > >>>>>> (and inside loops etc). > >>>>>> > >>>>>> There is nothing in the implementation that detects that a has been assigned > >>>>>> its value. > >>>>>> > >>>>>> So there is this notion in the karajan intermediate code of 'closing an > >>>>>> array'. This is an assertion made in the object code that all assignments > >>>>>> to pieces of an array have been made - that, in affect, the array has its > >>>>>> value. > >>>>>> > >>>>>> The suggested hack/workaround for this is to move the array element > >>>>>> assignments into a procedure: > >>>>>> > >>>>>> (file f[]) z() { > >>>>>> f[0] = p(); > >>>>>> f[1] - q(); > >>>>>> } > >>>>>> > >>>>>> a = z() > >>>>>> b = s(a) > >>>>>> > >>>>>> This works. (which is sort-of a violation of referential transparency) > >>>>>> > >>>>>> It works because Swift implicitly marks arrays returned from compound > >>>>>> procedures as closed (which may or may not be correct). > >>>>>> > >>>>>> So in most variable scopes, arrays behave like single-assignment variables, > >>>>>> but each array can have one specific scope in which members can be assigned > >>>>>> to. In that scope, the array cannot be treated as a whole variable. > >>>>>> > >>>>>> In the z() example above, that special scope is the body of z(). In the > >>>>>> previous example, that scope is the global scope, and the program is invalid > >>>>>> by the rule above that the array cannot be referred to as a whole in the > >>>>>> same place that its members are individually assigned to. > >>>>>> > >>>>>> That's my explanation of what's going on now. I think it matches reality. I > >>>>>> don't like that this is reality, but it is what we have. > >>>>>> > >>>>>> Comments appreciated. > >>>>>> > >>>>>> > >>>>>> > >>>> > >>> -- > >>> > >>> Ian Foster, Director, Computation Institute > >>> Argonne National Laboratory & University of Chicago > >>> Argonne: MCS/221, 9700 S. Cass Ave, Argonne, IL 60439 > >>> Chicago: Rm 405, 5640 S. Ellis Ave, Chicago, IL 60637 > >>> Tel: +1 630 252 4619. Web: www.ci.uchicago.edu. > >>> Globus Alliance: www.globus.org. > >>> > >>> > >>> ____________________________________________________________________ > >>> > >>> _______________________________________________ > >>> Swift-devel mailing list > >>> Swift-devel at ci.uchicago.edu > >>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > >>> > >> -- > >> > >> Ian Foster, Director, Computation Institute > >> Argonne National Laboratory & University of Chicago > >> Argonne: MCS/221, 9700 S. Cass Ave, Argonne, IL 60439 > >> Chicago: Rm 405, 5640 S. Ellis Ave, Chicago, IL 60637 > >> Tel: +1 630 252 4619. Web: www.ci.uchicago.edu. > >> Globus Alliance: www.globus.org. > >> _______________________________________________ > >> Swift-devel mailing list > >> Swift-devel at ci.uchicago.edu > >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > > From hategan at mcs.anl.gov Tue Jun 19 09:48:31 2007 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 19 Jun 2007 17:48:31 +0300 Subject: [Swift-devel] on the semantics of 'array closing' In-Reply-To: <4677DD03.6010900@mcs.anl.gov> References: <4672F5E3.7060205@mcs.anl.gov> <4674096B.4020109@mcs.anl.gov> <4677455D.1070909@mcs.anl.gov> <1182243330.17515.14.camel@blabla.mcs.anl.gov> <4677CEA1.8050806@mcs.anl.gov> <1182257710.18810.15.camel@blabla.mcs.anl.gov> <4677DD03.6010900@mcs.anl.gov> Message-ID: <1182264511.19005.6.camel@blabla.mcs.anl.gov> On Tue, 2007-06-19 at 08:41 -0500, Mike Wilde wrote: > Mihael Hategan wrote, On 6/19/2007 7:55 AM: > > On Tue, 2007-06-19 at 07:40 -0500, Mike Wilde wrote: > >> This is a good breakdown, but I dont yet see how to distinguish > >> between the two situations you lay out here, Mihael. > > > > The distinction is made by thinking about local vs. grid execution. If > > something cannot occur during local execution but can occur during grid > > execution, we probably want to hide that from the user as much as > > possible. Programming against randomly unreliable systems is hard. We, > > as long time Grid users and developers, have the unique position of > > identifying such problems and dealing with them as we can. > > > >> I think we need to turn these cases into slightly more detailed > >> examples and then consider how we want the system to respond and > >> what the user would need to do in each case to recover/continue. > > > > I thought we've done this before, to a sizable extent, both on the > > mailing lists, and face-to-face. > > You are probably right, with respect to the exeception handling > disussion. We need to move such discussions into proposed language > spec doc changes so that we can turn them into decisions and > campaigns to implement them. > > The mailing list is the right place to discuss, but then we need to > summarize that discussion into a consensus. > > Also, I think there's two new aspects proposed here - of a foreach() > or map() reaching a threshold of "enough results" that it can be > called done, With sparse arrays, it would be sufficient to not have the results from the iterations that fail. If we had try/catch constructs, that would simply translate into: foreach k,v in V { try { results[k] = process(V[k]); } catch (*) {} } thus alleviating the need for a complex foreach construct. > and of a streaming model of computation where a DAG > acts like a pipeline even though it was specified as a set of > function applications. I'm not really sure what that refers to. > > Are you and Ben at a point where you could gather these issues > (map(), error handling, thresholds and streaming) and turn them into > proposed language improvements? We would need to chat some, and, of course, have the necessary time. Mihael > > If so, you should, and if not, we should decide if we need more > discussion on the list. > > Its likely that an attempt to do such a language update right now > would lead directly to more discussion, as writing language spec > tends to expose more unresolved issues. > > - Mike > > > > >> Your cases separate out logical errors (1) from physical ones (2) > >> which is I agree a useful distinction. > >> > >> But I wonder in case 2, when you have a program thats been > >> productively proceeding, and then encounters a physical error, in > >> some cases the program will have processed a give function "long > >> enough" to end that function call and proceed ("correctly") with the > >> data it has, while in other cases, the program can not proceed until > >> it has generated every single member that a foreach or map function > >> was to iterate over. > > > > Right. The former is the m out of n pattern with timeouts. While very > > rarely occurring in nature, we may eventually want to support such a > > thing. Also, given that reliable performance measurements are hard to > > get in a massively multi-user heterogeneous environment, we would need > > to find specific ways to express time. > > > > Mihael > > > >> In the latter branch of this case, we'd want to be able to restart > >> the program, while in the former branch, we'd want to be able to > >> ignore certain errors and continue. > >> > >> - Mike > >> > >> > >> Mihael Hategan wrote, On 6/19/2007 3:55 AM: > >>> On Mon, 2007-06-18 at 21:54 -0500, Ian Foster wrote: > >>>> Interesting issue, as raised by Mike: it seems that we want to define > >>>> a "map" function that can specify various "conditions" on the map, > >>>> e.g., the length of time to wait for results, what to do if not all > >>>> results are returned, etc. I wonder if others have done that before? > >>> There are two types of errors and time-outs: > >>> 1. The ones that occur as part of the workflow: > >>> - Computations on some data does not return the expected files > >>> - The user wants to run multiple algorithms on some data and only > >>> select the fastest one or the ones that run in a specific time. > >>> - etc. > >>> 2. The ones that occur because of exceptional conditions in the system: > >>> - A badly configured site fails to run things > >>> - A job sits for a long time in a queue > >>> > >>> I think the first class can be handled in the language, but the second > >>> should not. > >>> > >>> Occam seems to support such things, by composing various keywords (e.g. > >>> PAR ... FOR and then timeouts or error handling). I personally favor > >>> composition of smaller dedicated functions/keywords to big, do-it-all > >>> functions. > >>> > >>> Mihael > >>> > >>>> Ian Foster wrote: > >>>>> I like the notion of having a "map" function. If that could entirely > >>>>> replace the current element assignments, that would be a wonderful > >>>>> simplification, it seems to me. > >>>>> > >>>>> Ian. > >>>>> > >>>>> Ben Clifford wrote: > >>>>>> There's a different approach, which is to asay that 'a' is a variable and > >>>>>> can be assigned to once. Thus assignemnt syntax like a[0]=something > >>>>>> becomes illegal and we need more functional language constructs. So > >>>>>> instead of writing: > >>>>>> > >>>>>> for e,i in input_array { > >>>>>> output_array[i] = p(e); > >>>>>> } > >>>>>> > >>>>>> we would write: > >>>>>> > >>>>>> output_array = foreach i in input_array { > >>>>>> return p(i); > >>>>>> } > >>>>>> > >>>>>> (its a haskell map in different syntax!) > >>>>>> > >>>>>> That means that, at the language level, output_array is now properly > >>>>>> single assignment. > >>>>>> > >>>>>> > >>>>>> On Fri, 15 Jun 2007, Ian Foster wrote: > >>>>>> > >>>>>> > >>>>>>> Hi, > >>>>>>> > >>>>>>> For: > >>>>>>> > >>>>>>> a[0] = p() > >>>>>>> a[1] = q() > >>>>>>> b = s(a) > >>>>>>> > >>>>>>> I think there are two distinct issues. > >>>>>>> > >>>>>>> a) Determining the size of the array. This could presumably be done by > >>>>>>> declaring it, e.g.: > >>>>>>> > >>>>>>> a[2] or some similar notion > >>>>>>> a[0] = p() > >>>>>>> a[1] = q() > >>>>>>> b = s(a) > >>>>>>> > >>>>>>> or by some "closing" concept. > >>>>>>> > >>>>>>> b) Whether or not each element of an array is a separate single-assignment > >>>>>>> variable. If they are, then the code above should work just fine. If they are > >>>>>>> not, then we have a couple of behaviors we could define. One would be that > >>>>>>> b=s(a) blocks until all elements in "a" are defined. The other is that we have > >>>>>>> a way of "closing" (once again). In that case, we have to define what happens > >>>>>>> if b=s(a) accesses an element that is not defined. > >>>>>>> > >>>>>>> Ian. > >>>>>>> > >>>>>>> Ben Clifford wrote: > >>>>>>> > >>>>>>>> There is a problem that has been called the 'array closing problem'. > >>>>>>>> > >>>>>>>> It manifests itself in the tutorial in that certain bits of code that > >>>>>>>> intuitively can either in a procedure or in the top level can, in practice, > >>>>>>>> only go in to a procedure. > >>>>>>>> > >>>>>>>> In that context, I tried to think about better ways to explain/document the > >>>>>>>> behaviour than "mumble mumble move that code into a procedure". > >>>>>>>> > >>>>>>>> In Swift we claim to have 'single assignment variables'. > >>>>>>>> > >>>>>>>> >From single assignment variables we get our grid job ordering: > >>>>>>>> > >>>>>>>> a = p() > >>>>>>>> b = s(a) > >>>>>>>> > >>>>>>>> causes first grid job p to run, and when that has completed, then grid job s > >>>>>>>> will run. > >>>>>>>> > >>>>>>>> This is the same as if we had written: > >>>>>>>> > >>>>>>>> b = s(a) > >>>>>>>> a = p() > >>>>>>>> > >>>>>>>> The ordering comes from the use of a as an 'output' for p and an 'input' for > >>>>>>>> s, not from source text ordering. > >>>>>>>> > >>>>>>>> In that model, its meaningless to assign two different things ta a, like > >>>>>>>> this: > >>>>>>>> > >>>>>>>> a = p() > >>>>>>>> b = s(a) > >>>>>>>> a = t() > >>>>>>>> > >>>>>>>> > >>>>>>>> Note that I've omitted the data types from the above. This works in the > >>>>>>>> implementation for simple types such as a datafile marker type. > >>>>>>>> > >>>>>>>> What is important is that each variable is either unassigned or has its > >>>>>>>> single value - whenever we refer to that variable, we can either use the > >>>>>>>> value it has, or defer evaluation of that expression until the variable has > >>>>>>>> its value. > >>>>>>>> > >>>>>>>> Now consider arrays. In the present syntax, arrays can be passed as single > >>>>>>>> (complex) values to/from procedures, like before: > >>>>>>>> > >>>>>>>> a = p() > >>>>>>>> b = s(a) > >>>>>>>> > >>>>>>>> Here a and b are array types. > >>>>>>>> > >>>>>>>> That's fine. a is assigned to by the first statement, and b is assigned to > >>>>>>>> by the second statement. > >>>>>>>> > >>>>>>>> But we also support a different assignment syntax for arrays, that looks > >>>>>>>> like this: > >>>>>>>> > >>>>>>>> a[0] = p() > >>>>>>>> a[1] = q() > >>>>>>>> b = s(a) > >>>>>>>> > >>>>>>>> This fails at the moment (specifically, I think the execution engine will > >>>>>>>> hang). > >>>>>>>> > >>>>>>>> Why? Because the is no one point at which we assign a value to 'a' - the > >>>>>>>> assignment is split over multiple statements, which can be in various places > >>>>>>>> (and inside loops etc). > >>>>>>>> > >>>>>>>> There is nothing in the implementation that detects that a has been assigned > >>>>>>>> its value. > >>>>>>>> > >>>>>>>> So there is this notion in the karajan intermediate code of 'closing an > >>>>>>>> array'. This is an assertion made in the object code that all assignments > >>>>>>>> to pieces of an array have been made - that, in affect, the array has its > >>>>>>>> value. > >>>>>>>> > >>>>>>>> The suggested hack/workaround for this is to move the array element > >>>>>>>> assignments into a procedure: > >>>>>>>> > >>>>>>>> (file f[]) z() { > >>>>>>>> f[0] = p(); > >>>>>>>> f[1] - q(); > >>>>>>>> } > >>>>>>>> > >>>>>>>> a = z() > >>>>>>>> b = s(a) > >>>>>>>> > >>>>>>>> This works. (which is sort-of a violation of referential transparency) > >>>>>>>> > >>>>>>>> It works because Swift implicitly marks arrays returned from compound > >>>>>>>> procedures as closed (which may or may not be correct). > >>>>>>>> > >>>>>>>> So in most variable scopes, arrays behave like single-assignment variables, > >>>>>>>> but each array can have one specific scope in which members can be assigned > >>>>>>>> to. In that scope, the array cannot be treated as a whole variable. > >>>>>>>> > >>>>>>>> In the z() example above, that special scope is the body of z(). In the > >>>>>>>> previous example, that scope is the global scope, and the program is invalid > >>>>>>>> by the rule above that the array cannot be referred to as a whole in the > >>>>>>>> same place that its members are individually assigned to. > >>>>>>>> > >>>>>>>> That's my explanation of what's going on now. I think it matches reality. I > >>>>>>>> don't like that this is reality, but it is what we have. > >>>>>>>> > >>>>>>>> Comments appreciated. > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>> > >>>>> -- > >>>>> > >>>>> Ian Foster, Director, Computation Institute > >>>>> Argonne National Laboratory & University of Chicago > >>>>> Argonne: MCS/221, 9700 S. Cass Ave, Argonne, IL 60439 > >>>>> Chicago: Rm 405, 5640 S. Ellis Ave, Chicago, IL 60637 > >>>>> Tel: +1 630 252 4619. Web: www.ci.uchicago.edu. > >>>>> Globus Alliance: www.globus.org. > >>>>> > >>>>> > >>>>> ____________________________________________________________________ > >>>>> > >>>>> _______________________________________________ > >>>>> Swift-devel mailing list > >>>>> Swift-devel at ci.uchicago.edu > >>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > >>>>> > >>>> -- > >>>> > >>>> Ian Foster, Director, Computation Institute > >>>> Argonne National Laboratory & University of Chicago > >>>> Argonne: MCS/221, 9700 S. Cass Ave, Argonne, IL 60439 > >>>> Chicago: Rm 405, 5640 S. Ellis Ave, Chicago, IL 60637 > >>>> Tel: +1 630 252 4619. Web: www.ci.uchicago.edu. > >>>> Globus Alliance: www.globus.org. > >>>> _______________________________________________ > >>>> Swift-devel mailing list > >>>> Swift-devel at ci.uchicago.edu > >>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > >>> _______________________________________________ > >>> Swift-devel mailing list > >>> Swift-devel at ci.uchicago.edu > >>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > >>> > >>> > > > > > From benc at hawaga.org.uk Tue Jun 19 10:00:19 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 19 Jun 2007 15:00:19 +0000 (GMT) Subject: [Swift-devel] on the semantics of 'array closing' In-Reply-To: <1182264511.19005.6.camel@blabla.mcs.anl.gov> References: <4672F5E3.7060205@mcs.anl.gov> <4674096B.4020109@mcs.anl.gov> <4677455D.1070909@mcs.anl.gov> <1182243330.17515.14.camel@blabla.mcs.anl.gov> <4677CEA1.8050806@mcs.anl.gov> <1182257710.18810.15.camel@blabla.mcs.anl.gov> <4677DD03.6010900@mcs.anl.gov> <1182264511.19005.6.camel@blabla.mcs.anl.gov> Message-ID: On Tue, 19 Jun 2007, Mihael Hategan wrote: > > Also, I think there's two new aspects proposed here - of a foreach() > > or map() reaching a threshold of "enough results" that it can be > > called done, > > With sparse arrays, it would be sufficient to not have the results from > the iterations that fail. If we had try/catch constructs, that would > simply translate into: > foreach k,v in V { > try { > results[k] = process(V[k]); > } > catch (*) {} > } > thus alleviating the need for a complex foreach construct. That could fit in with the list comprehension / map syntax too, I guess, like some: a = [ ignore_errors(p(i)) for i in array]; with ignore_errors being an expression-level equivalent of a try { ...} catch{} block. or it could go on the outside of the map-like structure, like this: a = [p(i) for i in array] b = n_is_enough(7, a); so that b gets assigned a value that is the first 7 values of a to be computed, and errors in computing the rest of 'a' don't propagate through to errors in computing b. That lets you parameterise the 'n is enough' or 'whatever comes first' bits separately from the for construct. -- From hategan at mcs.anl.gov Tue Jun 19 10:13:52 2007 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 19 Jun 2007 18:13:52 +0300 Subject: [Swift-devel] on the semantics of 'array closing' In-Reply-To: References: <4672F5E3.7060205@mcs.anl.gov> <4674096B.4020109@mcs.anl.gov> <4677455D.1070909@mcs.anl.gov> <1182243330.17515.14.camel@blabla.mcs.anl.gov> <4677CEA1.8050806@mcs.anl.gov> <1182257710.18810.15.camel@blabla.mcs.anl.gov> <4677DD03.6010900@mcs.anl.gov> <1182264511.19005.6.camel@blabla.mcs.anl.gov> Message-ID: <1182266032.19234.13.camel@blabla.mcs.anl.gov> On Tue, 2007-06-19 at 15:00 +0000, Ben Clifford wrote: > > On Tue, 19 Jun 2007, Mihael Hategan wrote: > > > > Also, I think there's two new aspects proposed here - of a foreach() > > > or map() reaching a threshold of "enough results" that it can be > > > called done, > > > > With sparse arrays, it would be sufficient to not have the results from > > the iterations that fail. If we had try/catch constructs, that would > > simply translate into: > > foreach k,v in V { > > try { > > results[k] = process(V[k]); > > } > > catch (*) {} > > } > > thus alleviating the need for a complex foreach construct. > > That could fit in with the list comprehension / map syntax too, I guess, > like some: > > a = [ ignore_errors(p(i)) for i in array]; > > with ignore_errors being an expression-level equivalent of a try { ...} > catch{} block. Should easily translate to something like: a := swiftArray(for(i, array, ignoreErrors(p(i))))) > > or it could go on the outside of the map-like structure, like this: > > a = [p(i) for i in array] > b = n_is_enough(7, a); Doable. > > so that b gets assigned a value that is the first 7 values of a to be > computed, and errors in computing the rest of 'a' don't propagate through > to errors in computing b. > > That lets you parameterise the 'n is enough' or 'whatever comes first' > bits separately from the for construct. > From benc at hawaga.org.uk Tue Jun 19 13:41:08 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 19 Jun 2007 18:41:08 +0000 (GMT) Subject: [Swift-devel] serializable DSHandle Message-ID: Is there a reason for DSHandle to be serializable? > public interface DSHandle extends Serializable { -- From iraicu at cs.uchicago.edu Tue Jun 19 14:07:36 2007 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Tue, 19 Jun 2007 14:07:36 -0500 Subject: [Swift-devel] CPU usage with provider-deef In-Reply-To: <4677C7A5.6040004@mcs.anl.gov> References: <1181984306.10455.3.camel@blabla.mcs.anl.gov> <4673F0E9.3060505@cs.uchicago.edu> <4673F548.5070608@cs.uchicago.edu> <4673FB19.6070305@cs.uchicago.edu> <1182006166.11495.1.camel@blabla.mcs.anl.gov><467408D5.2020709@mcs.anl.gov> <467417DE.50408@mcs.anl.gov><1182018587.12013.14.camel@blabla.mcs.anl.gov> <537273469-1182023623-cardhu_decombobulator_blackberry.rim.net-728646234-@bxe006.bisx.prod.on.blackberry> <1182024904.12401.11.camel@blabla.mcs.anl.gov> <46746A7F.7040004@mcs.anl.gov> <4676FB2C.9010004@cs.uchicago.edu> <1182241371.17515.2.camel@blabla.mcs.anl.gov> <4677C7A5.6040004@mcs.anl.gov> Message-ID: <46782978.10800@cs.uchicago.edu> Hi, See below: Mike Wilde wrote: > Ioan, I feel that documenting the message flow is key to understanding > the system, and that such documentation will be indispensable for you > (and the Swift team) to make progress with Falkon. Yes, I agree! > > > If its the case, as you stated to me, that you dont have the time to > fully support Falkon for end-user use, which I accept, then the only > way to enable the Swift team to provide this support is for us to > understand the tool. Agreed! > > > The three things you propose to communicate to us in a face-to-face > meeting (organization of the code, the message flow diagram, > configuration options) are exactly the things that need to be documented. The message flow diagram is already in the SC paper, and if not there, in the provisioning paper I write a few weeks later that was never submitted anywhere.... the Falkon paper is at http://people.cs.uchicago.edu/~iraicu/research/docs/Falkon/Falkon_SC07_v17-submitted.pdf (section 3.2, Figure 2), and the DRP paper is at http://people.cs.uchicago.edu/~iraicu/research/docs/DRP/DRP_v01.doc (section 3.2 and Figure 1). Note that the latest Falkon paper (v24) does not have this message flow diagram. As for the organization of the code and configuration options, I will certainly do them, but for now they are not high on my to-do list. I know Mihael, Ben and everyone else wants written docs, but I can't keep spending 90% of my time on non-research related issues, as I have been doing recently. Getting Nika's MolDyn application running over Falkon has literally consummed all of my time recently. I don't mind doing it, especially when we have some results to show for (last night we ran the 100 molecule run successfully, I'll send out a separate update on this later), but I need to get back to the data management (which is almost ready, but unless I get a few quiet days to finish it up, it will never get done), my proposal, etc. I'll get to these eventually, but I can't promise when; in the meantime, I offer my time to meet in person with whoever wants to dig into Falkon further! > > > But you and Ian together need to decide and propose how you want to > see Falkon used and how you intend to make it supportable for that > purpose. I am OK to support Falkon (as you have already seen with me helping Nika one on one, talking to Tibi about his app, getting Falkon in as an incubator project to setup CVS, mailing lists, etc....), but I hope to avoid making Falkon support be 90% of my time, which has been the case recently :(. > > For Swift's goals, we *know* we need Falkon's capabilities to succeed, > and the question is how much and how long can we use Falkon, and when > would we need to start improving or rewriting it. > These are all very good questions, and only time will answer these, IMO. Ioan > - Mike > > > > > Mihael Hategan wrote, On 6/19/2007 3:22 AM: >> On Mon, 2007-06-18 at 16:37 -0500, Ioan Raicu wrote: >>> There is a diagram in the SC paper that has the flow of messages, and >>> what each does; these messages also map to the WSDL of the Falkon >>> service. Mihael, Ben, or whoever else wants to start digging through >>> the Falkon code, maybe a meeting might be good to go over the >>> organization of the code, the message flow diagram, configuration >>> options, etc... I would prefer a meeting over drafting up more >>> documents, I think it would be more time effective for me for now. >> >> It might be the reverse for us. It's unlikely that faced with a lot of >> new information, we will consistently retain all of it. Something of >> reference might be handy. >> >>> Ioan >>> >>> Ben Clifford wrote: >>>> The WSDL should describe part of the web services bit of the >>>> protocol. That might be a good place to start. The WSDL should >>>> already describe the messages that go over the wire in something >>>> vaguely readable to a human. Probably what would be needed would be >>>> the extra info to say which order messages are sent. >>>> >>>> On Sat, 16 Jun 2007, Mike Wilde wrote: >>>> >>>> >>>>> I think a nice clean message sequence chart describing Falkon's >>>>> various >>>>> activities would be very useful, as its the backbone of its logic. >>>>> >>>>> I tried to create this for the SC paper by asking Ioan to describe >>>>> the >>>>> protocol to me, but I dod not succeed. >>>>> >>>>> I think this would be a very useful description to maintain, as a >>>>> UML sequence >>>>> chart, and that Ioan this would be a very important part of your >>>>> thesis or of >>>>> future papers. >>>>> >>>>> Its up to you and Ian to weigh whether this would be valuable to your >>>>> research. I think its invaluable for design, review and debugging. >>>>> >>>>> - Mike >>>>> >>>>> >>>>> >>>>> >>>>> Mihael Hategan wrote, On 6/16/2007 3:15 PM: >>>>> >>>>>> On Sat, 2007-06-16 at 19:52 +0000, Ian Foster wrote: >>>>>> >>>>>>> I wasn't suggesting (at least in the first instance) that Mihael >>>>>>> take the >>>>>>> prototype and turn it into production, but that Mihael and Ioan >>>>>>> sit down >>>>>>> together and do a code walkthrough. I think that this would likely >>>>>>> identify bugs and opportunities for simplification. >>>>>>> >>>>>> It's somewhat on the same in level of fun :). From the experience >>>>>> I've >>>>>> accumulated so far, design is hard. Understanding prototype >>>>>> design is >>>>>> probably even harder (not only do you need to understand the >>>>>> problem, >>>>>> you also need to understand why many non-obvious things are done >>>>>> the way >>>>>> they are done). >>>>>> >>>>>> Mihael >>>>>> >>>>>> >>>>>>> Ian >>>>>>> >>>>>>> >>>>>>> >>>>>>> Sent via BlackBerry from T-Mobile >>>>>>> >>>>>>> -----Original Message----- >>>>>>> From: Ben Clifford >>>>>>> >>>>>>> Date: Sat, 16 Jun 2007 19:36:17 To:Mihael Hategan >>>>>>> >>>>>>> Cc:swift-devel at ci.uchicago.edu >>>>>>> Subject: Re: [Swift-devel] CPU usage with provider-deef >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Sat, 16 Jun 2007, Mihael Hategan wrote: >>>>>>> >>>>>>> >>>>>>>> On Sat, 2007-06-16 at 12:03 -0500, Mike Wilde wrote: >>>>>>>> >>>>>>>>> This should be fun, and a nice break from the I2U2 work that >>>>>>>>> you've >>>>>>>>> been immersed in, Mihael. >>>>>>>>> >>>>>>>> I have my reservations towards the amount of fun it involves. >>>>>>>> >>>>>>> Right, taking prototypes and turning them into production isn't >>>>>>> necessarily fun - in fact, a lot of the fun already happened >>>>>>> with the >>>>>>> making of the prototype and the rest is some what drugery. (to >>>>>>> an extent >>>>>>> that's the same situation i2u2 cosmic was/is in). >>>>>>> >>>>>>> >>>>>> _______________________________________________ >>>>>> Swift-devel mailing list >>>>>> Swift-devel at ci.uchicago.edu >>>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >>>>>> >>>>>> >>>>>> >>>> _______________________________________________ >>>> Swift-devel mailing list >>>> Swift-devel at ci.uchicago.edu >>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >>>> >>>> >>> -- >>> ============================================ >>> Ioan Raicu >>> Ph.D. Student >>> ============================================ >>> Distributed Systems Laboratory >>> Computer Science Department >>> University of Chicago >>> 1100 E. 58th Street, Ryerson Hall >>> Chicago, IL 60637 >>> ============================================ >>> Email: iraicu at cs.uchicago.edu >>> Web: http://www.cs.uchicago.edu/~iraicu >>> http://dsl.cs.uchicago.edu/ >>> ============================================ >>> ============================================ >>> _______________________________________________ >>> Swift-devel mailing list >>> Swift-devel at ci.uchicago.edu >>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >> >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >> >> > -- ============================================ Ioan Raicu Ph.D. Student ============================================ Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 ============================================ Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dsl.cs.uchicago.edu/ ============================================ ============================================ From yongzh at cs.uchicago.edu Tue Jun 19 15:40:54 2007 From: yongzh at cs.uchicago.edu (Yong Zhao) Date: Tue, 19 Jun 2007 15:40:54 -0500 (CDT) Subject: [Swift-devel] Re: 100 molecule In-Reply-To: <46782A14.2080308@cs.uchicago.edu> References: <7DAEB562-6FD5-4425-9BBB-7AA84EA26526@mcs.anl.gov> <46782A14.2080308@cs.uchicago.edu> Message-ID: Ioan, This sounds very good. I'm forwarding this to the swift list. Yong. On Tue, 19 Jun 2007, Ioan Raicu wrote: > Yes, rm -rf could take that long... Yong, why don't you try a these two > commands, instead of "rm -rf".... I bet it will be much faster on the > GPFS at ANL! > > find ./ -exec rm {} \; > find ./ -exec rm -r {} \; > > The first one removes the files, and the second one removes the > directories... I found rm -rf to be very slow on the ANL GPFS.... it has > to do with the fact that rm -rf does an expansion of all the files it > needs to deletes... and it ends up being very very long if you hav many > files to delete.... doing the method above, it does 1 delete at a > time... so it doesn't suffer from the long list of files as rm -rf.... > > Ioan > > Veronika Nefedova wrote: > > I am wondering how the cleanup is done? Its hard to believe that "rm > > -rf" would work that long. At the end of the successful run its just > > one directory with one nested subdir had to be removed. > > > > NIka > > From yongzh at cs.uchicago.edu Tue Jun 19 15:43:59 2007 From: yongzh at cs.uchicago.edu (Yong Zhao) Date: Tue, 19 Jun 2007 15:43:59 -0500 (CDT) Subject: [Swift-devel] serializable DSHandle In-Reply-To: References: Message-ID: The original thought was that we can serialize the handle and pass it around distributed sites, so the mapping could happen at different places during the execution. On Tue, 19 Jun 2007, Ben Clifford wrote: > > Is there a reason for DSHandle to be serializable? > > > public interface DSHandle extends Serializable { > > > -- > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > From benc at hawaga.org.uk Tue Jun 19 15:44:45 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 19 Jun 2007 20:44:45 +0000 (GMT) Subject: [Swift-devel] serializable DSHandle In-Reply-To: References: Message-ID: ok. So unused for now? On Tue, 19 Jun 2007, Yong Zhao wrote: > The original thought was that we can serialize the handle and pass it > around distributed sites, so the mapping could happen at different places > during the execution. > > On Tue, 19 Jun 2007, Ben Clifford wrote: > > > > > Is there a reason for DSHandle to be serializable? > > > > > public interface DSHandle extends Serializable { > > > > > > -- > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > From yongzh at cs.uchicago.edu Tue Jun 19 15:45:43 2007 From: yongzh at cs.uchicago.edu (Yong Zhao) Date: Tue, 19 Jun 2007 15:45:43 -0500 (CDT) Subject: [Swift-devel] serializable DSHandle In-Reply-To: References: Message-ID: right, the feature is not used right now. On Tue, 19 Jun 2007, Ben Clifford wrote: > > ok. So unused for now? > > On Tue, 19 Jun 2007, Yong Zhao wrote: > > > The original thought was that we can serialize the handle and pass it > > around distributed sites, so the mapping could happen at different places > > during the execution. > > > > On Tue, 19 Jun 2007, Ben Clifford wrote: > > > > > > > > Is there a reason for DSHandle to be serializable? > > > > > > > public interface DSHandle extends Serializable { > > > > > > > > > -- > > > _______________________________________________ > > > Swift-devel mailing list > > > Swift-devel at ci.uchicago.edu > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > > > > > From wilde at mcs.anl.gov Tue Jun 19 16:05:12 2007 From: wilde at mcs.anl.gov (Mike Wilde) Date: Tue, 19 Jun 2007 16:05:12 -0500 Subject: [Swift-devel] Re: 100 molecule In-Reply-To: References: <7DAEB562-6FD5-4425-9BBB-7AA84EA26526@mcs.anl.gov> <46782A14.2080308@cs.uchicago.edu> Message-ID: <46784508.6020200@mcs.anl.gov> One technique that works nice if you just want the old files out of the way is to do an mv of the top level dir to a new name, and then you can background the rm's. - Mike Yong Zhao wrote, On 6/19/2007 3:40 PM: > Ioan, > > This sounds very good. I'm forwarding this to the swift list. > > Yong. > > On Tue, 19 Jun 2007, Ioan Raicu wrote: > >> Yes, rm -rf could take that long... Yong, why don't you try a these two >> commands, instead of "rm -rf".... I bet it will be much faster on the >> GPFS at ANL! >> >> find ./ -exec rm {} \; >> find ./ -exec rm -r {} \; >> >> The first one removes the files, and the second one removes the >> directories... I found rm -rf to be very slow on the ANL GPFS.... it has >> to do with the fact that rm -rf does an expansion of all the files it >> needs to deletes... and it ends up being very very long if you hav many >> files to delete.... doing the method above, it does 1 delete at a >> time... so it doesn't suffer from the long list of files as rm -rf.... >> >> Ioan >> >> Veronika Nefedova wrote: >>> I am wondering how the cleanup is done? Its hard to believe that "rm >>> -rf" would work that long. At the end of the successful run its just >>> one directory with one nested subdir had to be removed. >>> >>> NIka >>> > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > -- Mike Wilde Computation Institute, University of Chicago Math & Computer Science Division Argonne National Laboratory Argonne, IL 60439 USA tel 630-252-7497 fax 630-252-1997 From benc at hawaga.org.uk Tue Jun 19 18:51:44 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 19 Jun 2007 23:51:44 +0000 (GMT) Subject: [Swift-devel] Re: email for Mike and Ian In-Reply-To: References: <46782EFC.6030601@cs.uchicago.edu> <46784274.8090202@cs.uchicago.edu> <051DE8C7-3598-4EDE-BCD7-870FE28B5CA6@mcs.anl.gov> <99348A95-85BD-4924-B708-481982C9F425@mcs.anl.gov> Message-ID: This looks like bug 49: http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=49 I just spent the evening tracking it down with Nika. As far as I can tell, that means she's been using a swift compiler that has been at least 2 months old right up until she just updated it this evening. *please* try to report problems against something resembling a recent checkout-and-build. Furthermore, when I finally tracked it down, turns out that its because of a bug in the SwiftScript source. I fix *exactly* this problem, here: Date: Sat, 28 Apr 2007 08:39:03 +0000 (GMT) From: Ben Clifford To: Veronika V. Nefedova Cc: swift-devel at ci.uchicago.edu Subject: Re: [Swift-devel] nightly built 070426 *please* try to actually use bugfixes that people give you. Bad users! Go to your room! On Tue, 19 Jun 2007, Yong Zhao wrote: > I tried the restart feature yesterday and it seemed to work fine with the > MolDyn workflow. I am not sure what was the problem that you encountered. > > About the compile problem, maybe Ben can take a look since he made a few > changes to the translation. > > yong. > > On Tue, 19 Jun 2007, Veronika Nefedova wrote: > > > Yong, > > > > Ben asks me to test the restart feature that was failing before... I > > am wondering if its OK to do svn up and then rebuild vdsk? I do not > > want to break things... If its OK - should I do it in ~nefedova/vdsk > > (I assume)? > > > > Nika > > > > On Jun 19, 2007, at 4:31 PM, Yong Zhao wrote: > > > > > did you make sure that your path is set correctly? do a > > > > > > which swift > > > > > > On Tue, 19 Jun 2007, Veronika Nefedova wrote: > > > > > >> Yong, > > >> > > >> Any idea what could've caused it to fail: > > >> > > >> nefedova at viper:~/alamines> cat MolDyn-244-ctsmk1lnf2qa1.log > > >> 2007-06-19 16:11:19,256 INFO Loader MolDyn-244.dtm: source file is > > >> new. Recompiling. > > >> 2007-06-19 16:12:08,346 DEBUG Loader Detailed exception: > > >> java.lang.RuntimeException: Failed to convert .xml to .kml for > > >> MolDyn-244.dtm > > >> at org.griphyn.vdl.karajan.Loader.compile(Loader.java:209) > > >> at org.griphyn.vdl.karajan.Loader.main(Loader.java:108) > > >> Caused by: java.util.NoSuchElementException: no such attribute: nil > > >> in template context [call_arg] > > >> at org.antlr.stringtemplate.StringTemplate.rawSetAttribute > > >> (StringTemplate.java:643) > > >> at org.antlr.stringtemplate.StringTemplate.setAttribute > > >> (StringTemplate.java:539) > > >> at org.griphyn.vdl.engine.Karajan.setExprOrValue > > >> (Karajan.java:663) > > >> at org.griphyn.vdl.engine.Karajan.setExprOrValue > > >> (Karajan.java:638) > > >> at org.griphyn.vdl.engine.Karajan.actualParameter > > >> (Karajan.java:458) > > >> at org.griphyn.vdl.engine.Karajan.call(Karajan.java:351) > > >> at org.griphyn.vdl.engine.Karajan.statements(Karajan.java: > > >> 304) > > >> at org.griphyn.vdl.engine.Karajan.program(Karajan.java:117) > > >> at org.griphyn.vdl.engine.Karajan.main(Karajan.java:71) > > >> at org.griphyn.vdl.karajan.Loader.compile(Loader.java:199) > > >> ... 1 more > > >> nefedova at viper:~/alamines> > > >> > > >> > > >> > > >> The dtm file is generated by a script. The same script that generated > > >> the files for 1,20 and 100 molecules. Not sure why 244 is different. > > >> Everything is in my alamines dir on viper in home dir... > > >> > > >> Nika > > >> > > >> On Jun 19, 2007, at 4:00 PM, Yong Zhao wrote: > > >> > > >>> Everything is configured in Nika's directory: > > >>> ~nefedova/vdsk > > >>> > > >>> Just point VDS_HOME or SWIFT_HOME to /home/nefedova/vdsk, and the > > >>> rest > > >>> should be correctly configured in the etc directory. > > >>> > > >>> Yong. > > >>> > > >>> On Tue, 19 Jun 2007, Ioan Raicu wrote: > > >>> > > >>>> Yong, you are the one who ran the Swift workflow... can you make > > >>>> sure > > >>>> Nika has everything updated, or can you invoke the command form > > >>>> your > > >>>> environment? > > >>>> > > >>>> I have restarted Falkon and set it to 18 hours for 100 nodes (200 > > >>>> workers).... its all up and running... there is a 2 hour idle > > >>>> time, so > > >>>> make sure to start the workflow in the next 2 hours so we don't > > >>>> loose > > >>>> the allocation. > > >>>> > > >>>> Falkon is in the same place as last night, tg-viz-login1 on 50001! > > >>>> > > >>>> Ioan > > >>>> > > >>>> Veronika Nefedova wrote: > > >>>>> Ok, I have the file ready. What workdir should I specify for TG > > >>>>> UC ? > > >>>>> > > >>>>> Nika > > >>>>> > > >>>>> On Jun 19, 2007, at 2:31 PM, Ioan Raicu wrote: > > >>>>> > > >>>>>> Hi guys, > > >>>>>> I need to go eat some lunch.... I'll be back in 30 min... but > > >>>>>> then > > >>>>>> I'll only be online until 4PM... so can you please look over that > > >>>>>> email, and send it back to me soon? Also, let's decide what > > >>>>>> to do > > >>>>>> about the next run, is 244 short mol run OK? Nika, can you prep > > >>>>>> the > > >>>>>> input data for this? ANL seems almost idle, only 4 nodes are in > > >>>>>> use, > > >>>>>> so we could easily et another 200 processors like last night :) > > >>>>>> > > >>>>>> Ioan > > >>>>>> > > >>>>>> -- > > >>>>>> ============================================ > > >>>>>> Ioan Raicu > > >>>>>> Ph.D. Student > > >>>>>> ============================================ > > >>>>>> Distributed Systems Laboratory > > >>>>>> Computer Science Department > > >>>>>> University of Chicago > > >>>>>> 1100 E. 58th Street, Ryerson Hall > > >>>>>> Chicago, IL 60637 > > >>>>>> ============================================ > > >>>>>> Email: iraicu at cs.uchicago.edu > > >>>>>> Web: http://www.cs.uchicago.edu/~iraicu > > >>>>>> http://dsl.cs.uchicago.edu/ > > >>>>>> ============================================ > > >>>>>> ============================================ > > >>>>>> > > >>>>> > > >>>>> > > >>>> > > >>>> -- > > >>>> ============================================ > > >>>> Ioan Raicu > > >>>> Ph.D. Student > > >>>> ============================================ > > >>>> Distributed Systems Laboratory > > >>>> Computer Science Department > > >>>> University of Chicago > > >>>> 1100 E. 58th Street, Ryerson Hall > > >>>> Chicago, IL 60637 > > >>>> ============================================ > > >>>> Email: iraicu at cs.uchicago.edu > > >>>> Web: http://www.cs.uchicago.edu/~iraicu > > >>>> http://dsl.cs.uchicago.edu/ > > >>>> ============================================ > > >>>> ============================================ > > >>>> > > >>>> > > >>> > > >> > > >> > > > > > > > > > From wilde at mcs.anl.gov Tue Jun 19 19:04:06 2007 From: wilde at mcs.anl.gov (Mike Wilde) Date: Tue, 19 Jun 2007 19:04:06 -0500 Subject: [Swift-devel] Re: email for Mike and Ian In-Reply-To: References: <46782EFC.6030601@cs.uchicago.edu> <46784274.8090202@cs.uchicago.edu> <051DE8C7-3598-4EDE-BCD7-870FE28B5CA6@mcs.anl.gov> <99348A95-85BD-4924-B708-481982C9F425@mcs.anl.gov> Message-ID: <46786EF6.507@mcs.anl.gov> So, at a practical level, what went wrong here and what do we do to correct it? The points below are perhaps a bit naive and reflect the sad fact that I'm not currently a user. But to set guidelines for ourselves and a growing community of users, should we: - Run Swift from well-defined submit hosts - Keep those hosts up to date with nightly builds - stay in tune to bugzilla traffic to know when to jump to a new build - is the run dir and/or logs clearly tagged with the build date? - use only official builds if at all possible (unless you need to include a fix thats not yet been included in a build?) - what else. Would it be useful to spell out good practices for Nika, Tibi, and CNARI, MolDyn, and LQCD people? Thanks, Mike Ben Clifford wrote, On 6/19/2007 6:51 PM: > This looks like bug 49: > > http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=49 > > I just spent the evening tracking it down with Nika. > > As far as I can tell, that means she's been using a swift compiler that > has been at least 2 months old right up until she just updated it this > evening. > > *please* try to report problems against something resembling a recent > checkout-and-build. > > Furthermore, when I finally tracked it down, turns out that its because of > a bug in the SwiftScript source. I fix *exactly* this problem, here: > > Date: Sat, 28 Apr 2007 08:39:03 +0000 (GMT) > From: Ben Clifford > To: Veronika V. Nefedova > Cc: swift-devel at ci.uchicago.edu > Subject: Re: [Swift-devel] nightly built 070426 > > *please* try to actually use bugfixes that people give you. > > Bad users! > > Go to your room! > > > On Tue, 19 Jun 2007, Yong Zhao wrote: > >> I tried the restart feature yesterday and it seemed to work fine with the >> MolDyn workflow. I am not sure what was the problem that you encountered. >> >> About the compile problem, maybe Ben can take a look since he made a few >> changes to the translation. >> >> yong. >> >> On Tue, 19 Jun 2007, Veronika Nefedova wrote: >> >>> Yong, >>> >>> Ben asks me to test the restart feature that was failing before... I >>> am wondering if its OK to do svn up and then rebuild vdsk? I do not >>> want to break things... If its OK - should I do it in ~nefedova/vdsk >>> (I assume)? >>> >>> Nika >>> >>> On Jun 19, 2007, at 4:31 PM, Yong Zhao wrote: >>> >>>> did you make sure that your path is set correctly? do a >>>> >>>> which swift >>>> >>>> On Tue, 19 Jun 2007, Veronika Nefedova wrote: >>>> >>>>> Yong, >>>>> >>>>> Any idea what could've caused it to fail: >>>>> >>>>> nefedova at viper:~/alamines> cat MolDyn-244-ctsmk1lnf2qa1.log >>>>> 2007-06-19 16:11:19,256 INFO Loader MolDyn-244.dtm: source file is >>>>> new. Recompiling. >>>>> 2007-06-19 16:12:08,346 DEBUG Loader Detailed exception: >>>>> java.lang.RuntimeException: Failed to convert .xml to .kml for >>>>> MolDyn-244.dtm >>>>> at org.griphyn.vdl.karajan.Loader.compile(Loader.java:209) >>>>> at org.griphyn.vdl.karajan.Loader.main(Loader.java:108) >>>>> Caused by: java.util.NoSuchElementException: no such attribute: nil >>>>> in template context [call_arg] >>>>> at org.antlr.stringtemplate.StringTemplate.rawSetAttribute >>>>> (StringTemplate.java:643) >>>>> at org.antlr.stringtemplate.StringTemplate.setAttribute >>>>> (StringTemplate.java:539) >>>>> at org.griphyn.vdl.engine.Karajan.setExprOrValue >>>>> (Karajan.java:663) >>>>> at org.griphyn.vdl.engine.Karajan.setExprOrValue >>>>> (Karajan.java:638) >>>>> at org.griphyn.vdl.engine.Karajan.actualParameter >>>>> (Karajan.java:458) >>>>> at org.griphyn.vdl.engine.Karajan.call(Karajan.java:351) >>>>> at org.griphyn.vdl.engine.Karajan.statements(Karajan.java: >>>>> 304) >>>>> at org.griphyn.vdl.engine.Karajan.program(Karajan.java:117) >>>>> at org.griphyn.vdl.engine.Karajan.main(Karajan.java:71) >>>>> at org.griphyn.vdl.karajan.Loader.compile(Loader.java:199) >>>>> ... 1 more >>>>> nefedova at viper:~/alamines> >>>>> >>>>> >>>>> >>>>> The dtm file is generated by a script. The same script that generated >>>>> the files for 1,20 and 100 molecules. Not sure why 244 is different. >>>>> Everything is in my alamines dir on viper in home dir... >>>>> >>>>> Nika >>>>> >>>>> On Jun 19, 2007, at 4:00 PM, Yong Zhao wrote: >>>>> >>>>>> Everything is configured in Nika's directory: >>>>>> ~nefedova/vdsk >>>>>> >>>>>> Just point VDS_HOME or SWIFT_HOME to /home/nefedova/vdsk, and the >>>>>> rest >>>>>> should be correctly configured in the etc directory. >>>>>> >>>>>> Yong. >>>>>> >>>>>> On Tue, 19 Jun 2007, Ioan Raicu wrote: >>>>>> >>>>>>> Yong, you are the one who ran the Swift workflow... can you make >>>>>>> sure >>>>>>> Nika has everything updated, or can you invoke the command form >>>>>>> your >>>>>>> environment? >>>>>>> >>>>>>> I have restarted Falkon and set it to 18 hours for 100 nodes (200 >>>>>>> workers).... its all up and running... there is a 2 hour idle >>>>>>> time, so >>>>>>> make sure to start the workflow in the next 2 hours so we don't >>>>>>> loose >>>>>>> the allocation. >>>>>>> >>>>>>> Falkon is in the same place as last night, tg-viz-login1 on 50001! >>>>>>> >>>>>>> Ioan >>>>>>> >>>>>>> Veronika Nefedova wrote: >>>>>>>> Ok, I have the file ready. What workdir should I specify for TG >>>>>>>> UC ? >>>>>>>> >>>>>>>> Nika >>>>>>>> >>>>>>>> On Jun 19, 2007, at 2:31 PM, Ioan Raicu wrote: >>>>>>>> >>>>>>>>> Hi guys, >>>>>>>>> I need to go eat some lunch.... I'll be back in 30 min... but >>>>>>>>> then >>>>>>>>> I'll only be online until 4PM... so can you please look over that >>>>>>>>> email, and send it back to me soon? Also, let's decide what >>>>>>>>> to do >>>>>>>>> about the next run, is 244 short mol run OK? Nika, can you prep >>>>>>>>> the >>>>>>>>> input data for this? ANL seems almost idle, only 4 nodes are in >>>>>>>>> use, >>>>>>>>> so we could easily et another 200 processors like last night :) >>>>>>>>> >>>>>>>>> Ioan >>>>>>>>> >>>>>>>>> -- >>>>>>>>> ============================================ >>>>>>>>> Ioan Raicu >>>>>>>>> Ph.D. Student >>>>>>>>> ============================================ >>>>>>>>> Distributed Systems Laboratory >>>>>>>>> Computer Science Department >>>>>>>>> University of Chicago >>>>>>>>> 1100 E. 58th Street, Ryerson Hall >>>>>>>>> Chicago, IL 60637 >>>>>>>>> ============================================ >>>>>>>>> Email: iraicu at cs.uchicago.edu >>>>>>>>> Web: http://www.cs.uchicago.edu/~iraicu >>>>>>>>> http://dsl.cs.uchicago.edu/ >>>>>>>>> ============================================ >>>>>>>>> ============================================ >>>>>>>>> >>>>>>>> >>>>>>> -- >>>>>>> ============================================ >>>>>>> Ioan Raicu >>>>>>> Ph.D. Student >>>>>>> ============================================ >>>>>>> Distributed Systems Laboratory >>>>>>> Computer Science Department >>>>>>> University of Chicago >>>>>>> 1100 E. 58th Street, Ryerson Hall >>>>>>> Chicago, IL 60637 >>>>>>> ============================================ >>>>>>> Email: iraicu at cs.uchicago.edu >>>>>>> Web: http://www.cs.uchicago.edu/~iraicu >>>>>>> http://dsl.cs.uchicago.edu/ >>>>>>> ============================================ >>>>>>> ============================================ >>>>>>> >>>>>>> >>>>> >>> >> > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > -- Mike Wilde Computation Institute, University of Chicago Math & Computer Science Division Argonne National Laboratory Argonne, IL 60439 USA tel 630-252-7497 fax 630-252-1997 From benc at hawaga.org.uk Tue Jun 19 19:10:06 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 20 Jun 2007 00:10:06 +0000 (GMT) Subject: [Swift-devel] Re: email for Mike and Ian In-Reply-To: <46786EF6.507@mcs.anl.gov> References: <46782EFC.6030601@cs.uchicago.edu> <46784274.8090202@cs.uchicago.edu> <051DE8C7-3598-4EDE-BCD7-870FE28B5CA6@mcs.anl.gov> <99348A95-85BD-4924-B708-481982C9F425@mcs.anl.gov> <46786EF6.507@mcs.anl.gov> Message-ID: On Tue, 19 Jun 2007, Mike Wilde wrote: > So, at a practical level, what went wrong here and what do we do to correct > it? > - use only official builds if at all possible (unless you need to > include a fix thats not yet been included in a build?) there's a reluctance on Nika's part to upgrade I think because its a hassle for her to get the Falkon/cog provider into a new build. It should not be hard. provider-deef should be going into SVN 'real soon' (i.e. as soon as Yong provides a clean recent tree for me to import) and after that it should be a lot easier to deploy the most recent swift + most recent provider-deef. -- From wilde at mcs.anl.gov Tue Jun 19 19:17:21 2007 From: wilde at mcs.anl.gov (Mike Wilde) Date: Tue, 19 Jun 2007 19:17:21 -0500 Subject: [Swift-devel] Re: email for Mike and Ian In-Reply-To: References: <46782EFC.6030601@cs.uchicago.edu> <46784274.8090202@cs.uchicago.edu> <051DE8C7-3598-4EDE-BCD7-870FE28B5CA6@mcs.anl.gov> <99348A95-85BD-4924-B708-481982C9F425@mcs.anl.gov> <46786EF6.507@mcs.anl.gov> Message-ID: <46787211.5030802@mcs.anl.gov> k, that makes sense. I think as we work to grow a group culture that works like a smooth distributed open source project, though, that writing down our processes will help us grow our team in both size and productivity. Whatever Nika and Tibi are doing today, a growing group of users and collaborators will be doing tomorrow - and we need to steer us and them towards good practices. - Mike Ben Clifford wrote, On 6/19/2007 7:10 PM: > > On Tue, 19 Jun 2007, Mike Wilde wrote: > >> So, at a practical level, what went wrong here and what do we do to correct >> it? > >> - use only official builds if at all possible (unless you need to >> include a fix thats not yet been included in a build?) > > there's a reluctance on Nika's part to upgrade I think because its a > hassle for her to get the Falkon/cog provider into a new build. It should > not be hard. provider-deef should be going into SVN 'real soon' (i.e. as > soon as Yong provides a clean recent tree for me to import) and after that > it should be a lot easier to deploy the most recent swift + most recent > provider-deef. > -- Mike Wilde Computation Institute, University of Chicago Math & Computer Science Division Argonne National Laboratory Argonne, IL 60439 USA tel 630-252-7497 fax 630-252-1997 From nefedova at mcs.anl.gov Tue Jun 19 19:18:34 2007 From: nefedova at mcs.anl.gov (Veronika Nefedova) Date: Tue, 19 Jun 2007 19:18:34 -0500 Subject: [Swift-devel] Re: email for Mike and Ian In-Reply-To: References: <46782EFC.6030601@cs.uchicago.edu> <46784274.8090202@cs.uchicago.edu> <051DE8C7-3598-4EDE-BCD7-870FE28B5CA6@mcs.anl.gov> <99348A95-85BD-4924-B708-481982C9F425@mcs.anl.gov> <46786EF6.507@mcs.anl.gov> Message-ID: <78ED59EC-B377-43A8-909B-2013415C50D7@mcs.anl.gov> I do not think its a correct asessment. I did update *many* times along the course of last couple of weeks, I just didn't recompile my dtm files. And they worked just fine, btw. I do not know how it happened that I ended up with the old version of swfit script -- it might've happened during those submit host-to- submit host moves, and obviously it was not intentional to discard any new bug fixes. Nika On Jun 19, 2007, at 7:10 PM, Ben Clifford wrote: > > > On Tue, 19 Jun 2007, Mike Wilde wrote: > >> So, at a practical level, what went wrong here and what do we do >> to correct >> it? > >> - use only official builds if at all possible (unless you need to >> include a fix thats not yet been included in a build?) > > there's a reluctance on Nika's part to upgrade I think because its a > hassle for her to get the Falkon/cog provider into a new build. It > should > not be hard. provider-deef should be going into SVN 'real > soon' (i.e. as > soon as Yong provides a clean recent tree for me to import) and > after that > it should be a lot easier to deploy the most recent swift + most > recent > provider-deef. > > -- > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > From benc at hawaga.org.uk Tue Jun 19 19:27:30 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 20 Jun 2007 00:27:30 +0000 (GMT) Subject: [Swift-devel] Re: email for Mike and Ian In-Reply-To: <78ED59EC-B377-43A8-909B-2013415C50D7@mcs.anl.gov> References: <46782EFC.6030601@cs.uchicago.edu> <46784274.8090202@cs.uchicago.edu> <051DE8C7-3598-4EDE-BCD7-870FE28B5CA6@mcs.anl.gov> <99348A95-85BD-4924-B708-481982C9F425@mcs.anl.gov> <46786EF6.507@mcs.anl.gov> <78ED59EC-B377-43A8-909B-2013415C50D7@mcs.anl.gov> Message-ID: On Tue, 19 Jun 2007, Veronika Nefedova wrote: > I did update *many* times along the course of last couple of weeks, I > just didn't recompile my dtm files. And they worked just fine, btw. ok. Its probably a good idea from a swift testing perspective to be doing clean builds of the swiftscript programs (cleaning away the .xml and karajan files) regularly, even though it shouldn't affect the application side of things. > I do not know how it happened that I ended up with the old version of > swfit script -- it might've happened during those submit host-to-submit > host moves, and obviously it was not intentional to discard any new bug > fixes. You might stick stuff in version control and use that rather than copying files between machines - that can alleviate overwriting problems sometimes. Tibi's been keeping at least one of his apps in the SwiftApps/ directory in the swift SVN. -- From wilde at mcs.anl.gov Tue Jun 19 19:41:44 2007 From: wilde at mcs.anl.gov (Mike Wilde) Date: Tue, 19 Jun 2007 19:41:44 -0500 Subject: [Swift-devel] Swift application testing practices In-Reply-To: <78ED59EC-B377-43A8-909B-2013415C50D7@mcs.anl.gov> References: <46782EFC.6030601@cs.uchicago.edu> <46784274.8090202@cs.uchicago.edu> <051DE8C7-3598-4EDE-BCD7-870FE28B5CA6@mcs.anl.gov> <99348A95-85BD-4924-B708-481982C9F425@mcs.anl.gov> <46786EF6.507@mcs.anl.gov> <78ED59EC-B377-43A8-909B-2013415C50D7@mcs.anl.gov> Message-ID: <467877C8.3050104@mcs.anl.gov> First, lets change this subject line. Second, I'm learning that we really are an open source development team, in that we're very distributed. And on such teams, we need to learn how to dish out and receive scoldings from each other. Both with a smile. Like so: :) This discussion is good and the kind of pointers we need to exchange as part of continuous improvement. Nika, it sounds like its a simple case of using a different procedure to update your Swift code - and that we should document the procedure for all to use, especially the growing user base. I'd like to see us starting and growing nice wiki pages for these techniques, and to periodically restructure the wiki as needed to keep it informative and useful to us. Please, do this as a matter of course, without being "assigned" to do it. It helps makes our environment a great place to work, and certainly helps us grow. Btw - I found the talk by Brian Fitzpatrick at the Globus committers meeting to be excellent, and very inspiring. Dont be misled by the title: it applies as much to defining what "Good" people are as it does to dealing with "bad" ones. Here's a nice summary: http://www.oreillynet.com/conferences/blog/2006/07/oscon_how_open_source_projects.html The video is at: http://video.google.com/videoplay?docid=-4216011961522818645 The slides are at: http://www.slideshare.net/vishnu/how-to-protect-yourhow-to-protect-your-open-source-project-from-poisonous-people/ And a great book recommended by both the speakers and our own Ben is: Producing Open Source Software How to Run a Successful Free Software Project by Karl Fogel Full text is at: http://producingoss.com/ This is a nice fast read that you can get a lot out of on each 10-minute break that you spend browsing it. I think there's loads of things we can pick up from here to help us grow as a project and as a team. - Mike Veronika Nefedova wrote, On 6/19/2007 7:18 PM: > I do not think its a correct asessment. > > I did update *many* times along the course of last couple of weeks, I > just didn't recompile my dtm files. And they worked just fine, btw. > I do not know how it happened that I ended up with the old version of > swfit script -- it might've happened during those submit host-to-submit > host moves, and obviously it was not intentional to discard any new bug > fixes. > > Nika > > On Jun 19, 2007, at 7:10 PM, Ben Clifford wrote: > >> >> >> On Tue, 19 Jun 2007, Mike Wilde wrote: >> >>> So, at a practical level, what went wrong here and what do we do to >>> correct >>> it? >> >>> - use only official builds if at all possible (unless you need to >>> include a fix thats not yet been included in a build?) >> >> there's a reluctance on Nika's part to upgrade I think because its a >> hassle for her to get the Falkon/cog provider into a new build. It should >> not be hard. provider-deef should be going into SVN 'real soon' (i.e. as >> soon as Yong provides a clean recent tree for me to import) and after >> that >> it should be a lot easier to deploy the most recent swift + most recent >> provider-deef. >> >> --_______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >> > > -- Mike Wilde Computation Institute, University of Chicago Math & Computer Science Division Argonne National Laboratory Argonne, IL 60439 USA tel 630-252-7497 fax 630-252-1997 From tiberius at ci.uchicago.edu Tue Jun 19 20:33:29 2007 From: tiberius at ci.uchicago.edu (Tiberiu Stef-Praun) Date: Tue, 19 Jun 2007 20:33:29 -0500 Subject: [Swift-devel] Re: email for Mike and Ian In-Reply-To: References: <46782EFC.6030601@cs.uchicago.edu> <051DE8C7-3598-4EDE-BCD7-870FE28B5CA6@mcs.anl.gov> <99348A95-85BD-4924-B708-481982C9F425@mcs.anl.gov> <46786EF6.507@mcs.anl.gov> <78ED59EC-B377-43A8-909B-2013415C50D7@mcs.anl.gov> Message-ID: Tibi keeps all three of his apps in the SVN I cannot say it's a big win, but at least it helps me keep track of the latest version. On 6/19/07, Ben Clifford wrote: > > On Tue, 19 Jun 2007, Veronika Nefedova wrote: > > > I did update *many* times along the course of last couple of weeks, I > > just didn't recompile my dtm files. And they worked just fine, btw. > > ok. > > Its probably a good idea from a swift testing perspective to be doing > clean builds of the swiftscript programs (cleaning away the .xml and > karajan files) regularly, even though it shouldn't affect the application > side of things. > > > I do not know how it happened that I ended up with the old version of > > swfit script -- it might've happened during those submit host-to-submit > > host moves, and obviously it was not intentional to discard any new bug > > fixes. > > You might stick stuff in version control and use that rather than copying > files between machines - that can alleviate overwriting problems > sometimes. Tibi's been keeping at least one of his apps in the SwiftApps/ > directory in the swift SVN. > > -- > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > -- Tiberiu (Tibi) Stef-Praun, PhD Research Staff, Computation Institute 5640 S. Ellis Ave, #405 University of Chicago http://www-unix.mcs.anl.gov/~tiberius/ From hategan at mcs.anl.gov Wed Jun 20 05:19:59 2007 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 20 Jun 2007 13:19:59 +0300 Subject: [Swift-devel] Re: 100 molecule In-Reply-To: <46784508.6020200@mcs.anl.gov> References: <7DAEB562-6FD5-4425-9BBB-7AA84EA26526@mcs.anl.gov> <46782A14.2080308@cs.uchicago.edu> <46784508.6020200@mcs.anl.gov> Message-ID: <1182334799.3206.0.camel@blabla.mcs.anl.gov> Hmm. It never occurred to me, but that rm job could be batch=true. On Tue, 2007-06-19 at 16:05 -0500, Mike Wilde wrote: > One technique that works nice if you just want the old files out of > the way is to do an mv of the top level dir to a new name, and then > you can background the rm's. > > - Mike > > Yong Zhao wrote, On 6/19/2007 3:40 PM: > > Ioan, > > > > This sounds very good. I'm forwarding this to the swift list. > > > > Yong. > > > > On Tue, 19 Jun 2007, Ioan Raicu wrote: > > > >> Yes, rm -rf could take that long... Yong, why don't you try a these two > >> commands, instead of "rm -rf".... I bet it will be much faster on the > >> GPFS at ANL! > >> > >> find ./ -exec rm {} \; > >> find ./ -exec rm -r {} \; > >> > >> The first one removes the files, and the second one removes the > >> directories... I found rm -rf to be very slow on the ANL GPFS.... it has > >> to do with the fact that rm -rf does an expansion of all the files it > >> needs to deletes... and it ends up being very very long if you hav many > >> files to delete.... doing the method above, it does 1 delete at a > >> time... so it doesn't suffer from the long list of files as rm -rf.... > >> > >> Ioan > >> > >> Veronika Nefedova wrote: > >>> I am wondering how the cleanup is done? Its hard to believe that "rm > >>> -rf" would work that long. At the end of the successful run its just > >>> one directory with one nested subdir had to be removed. > >>> > >>> NIka > >>> > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > > From hategan at mcs.anl.gov Wed Jun 20 05:22:24 2007 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 20 Jun 2007 13:22:24 +0300 Subject: [Swift-devel] Re: email for Mike and Ian In-Reply-To: References: <46782EFC.6030601@cs.uchicago.edu> <46784274.8090202@cs.uchicago.edu> <051DE8C7-3598-4EDE-BCD7-870FE28B5CA6@mcs.anl.gov> <99348A95-85BD-4924-B708-481982C9F425@mcs.anl.gov> <46786EF6.507@mcs.anl.gov> <78ED59EC-B377-43A8-909B-2013415C50D7@mcs.anl.gov> Message-ID: <1182334944.3206.3.camel@blabla.mcs.anl.gov> On Wed, 2007-06-20 at 00:27 +0000, Ben Clifford wrote: > On Tue, 19 Jun 2007, Veronika Nefedova wrote: > > > I did update *many* times along the course of last couple of weeks, I > > just didn't recompile my dtm files. And they worked just fine, btw. > > ok. > > Its probably a good idea from a swift testing perspective to be doing > clean builds of the swiftscript programs (cleaning away the .xml and > karajan files) regularly, even though it shouldn't affect the application > side of things. I think an even better idea would be for swift to recompile files if the swift version has changed. That we can achieve by having a timestamp on the swift build. If the kml files are older than that, a recompilation should be forced. Mihael > > > I do not know how it happened that I ended up with the old version of > > swfit script -- it might've happened during those submit host-to-submit > > host moves, and obviously it was not intentional to discard any new bug > > fixes. > > You might stick stuff in version control and use that rather than copying > files between machines - that can alleviate overwriting problems > sometimes. Tibi's been keeping at least one of his apps in the SwiftApps/ > directory in the swift SVN. > From wilde at mcs.anl.gov Wed Jun 20 06:35:30 2007 From: wilde at mcs.anl.gov (Mike Wilde) Date: Wed, 20 Jun 2007 06:35:30 -0500 Subject: [Swift-devel] Re: 100 molecule In-Reply-To: <1182334799.3206.0.camel@blabla.mcs.anl.gov> References: <7DAEB562-6FD5-4425-9BBB-7AA84EA26526@mcs.anl.gov> <46782A14.2080308@cs.uchicago.edu> <46784508.6020200@mcs.anl.gov> <1182334799.3206.0.camel@blabla.mcs.anl.gov> Message-ID: <46791102.3040201@mcs.anl.gov> Please file an enhancement bug on this if its not already filed. Thanks, Mike Mihael Hategan wrote, On 6/20/2007 5:19 AM: > Hmm. It never occurred to me, but that rm job could be batch=true. > > On Tue, 2007-06-19 at 16:05 -0500, Mike Wilde wrote: >> One technique that works nice if you just want the old files out of >> the way is to do an mv of the top level dir to a new name, and then >> you can background the rm's. >> >> - Mike >> >> Yong Zhao wrote, On 6/19/2007 3:40 PM: >>> Ioan, >>> >>> This sounds very good. I'm forwarding this to the swift list. >>> >>> Yong. >>> >>> On Tue, 19 Jun 2007, Ioan Raicu wrote: >>> >>>> Yes, rm -rf could take that long... Yong, why don't you try a these two >>>> commands, instead of "rm -rf".... I bet it will be much faster on the >>>> GPFS at ANL! >>>> >>>> find ./ -exec rm {} \; >>>> find ./ -exec rm -r {} \; >>>> >>>> The first one removes the files, and the second one removes the >>>> directories... I found rm -rf to be very slow on the ANL GPFS.... it has >>>> to do with the fact that rm -rf does an expansion of all the files it >>>> needs to deletes... and it ends up being very very long if you hav many >>>> files to delete.... doing the method above, it does 1 delete at a >>>> time... so it doesn't suffer from the long list of files as rm -rf.... >>>> >>>> Ioan >>>> >>>> Veronika Nefedova wrote: >>>>> I am wondering how the cleanup is done? Its hard to believe that "rm >>>>> -rf" would work that long. At the end of the successful run its just >>>>> one directory with one nested subdir had to be removed. >>>>> >>>>> NIka >>>>> >>> _______________________________________________ >>> Swift-devel mailing list >>> Swift-devel at ci.uchicago.edu >>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >>> >>> > > -- Mike Wilde Computation Institute, University of Chicago Math & Computer Science Division Argonne National Laboratory Argonne, IL 60439 USA tel 630-252-7497 fax 630-252-1997 From foster at mcs.anl.gov Thu Jun 21 08:33:11 2007 From: foster at mcs.anl.gov (Ian Foster) Date: Thu, 21 Jun 2007 08:33:11 -0500 Subject: [Swift-devel] Swift Performance Data In-Reply-To: <467A7AC6.7020400@mcs.anl.gov> References: <4679FBC8.1080606@mcs.anl.gov> <467A6D78.4020702@mcs.anl.gov> <467A7AC6.7020400@mcs.anl.gov> Message-ID: <467A7E17.5000207@mcs.anl.gov> Or maybe that is clear. I'd suggest that we want a tool that, after a run, one of us can run to: * Generate the three plots that Ioan has created * Generate a file containing as much information as we can about the run and its parameters--maybe a name=value format?--and some derived values such as those I mentioned in earlier email * Move these things to a known place * Create a Web page with pointers to these information and stick it somewhere [or add it to an existing web page?] Ian. Ian Foster wrote: > Mike: > > It seems important to define what the specific goals and milestones > are here, as it seems that simply asking for it doesn't get it done. > Perhaps we need a brief specification? > > Ian. > > Mike Wilde wrote: >> Yes, this is what Ganglia has been using. >> >> Regarding the auto-publishing - Jens has a machanism that regularly >> posted info in rrd format on the state of the VDS lab machines, using >> a perl mechanism like what Ian described. Perhaps we can find and >> adapt that for Ioan's numbers. >> It was running on gainly I think. But its not hard to develop from >> scratch. >> >> It would be good to see the same numbers for all the Swift apps being >> worked on, driven initially by kickstart summaries and digesting the >> swift logfile. >> We've long had this as a goal - now is a good time to push forward >> and do this. >> >> Nika and Tibi, could you work with Ioan on this? >> >> - Mike >> >> >> >> Ian Foster wrote, On 6/20/2007 11:17 PM: >>> Hi, >>> >>> I was pointed at http://oss.oetiker.ch/rrdtool/, has anyone seen >>> this? Seems nice to me. >>> >>> Ian. >>> >>> >> > -- Ian Foster, Director, Computation Institute Argonne National Laboratory & University of Chicago Argonne: MCS/221, 9700 S. Cass Ave, Argonne, IL 60439 Chicago: Rm 405, 5640 S. Ellis Ave, Chicago, IL 60637 Tel: +1 630 252 4619. Web: www.ci.uchicago.edu. Globus Alliance: www.globus.org. From foster at mcs.anl.gov Thu Jun 21 10:14:59 2007 From: foster at mcs.anl.gov (Ian Foster) Date: Thu, 21 Jun 2007 10:14:59 -0500 Subject: [Swift-devel] Re: [Swft] Q about throttling In-Reply-To: References: <467A6610.1000103@mcs.anl.gov> <467A6E50.3060508@mcs.anl.gov> Message-ID: <467A95F3.6040603@mcs.anl.gov> My original question was whether we could turn throttling off altogether. I'm not sure if that was answered? Tiberiu Stef-Praun wrote: > I did not look very deep into the throttling, mainly because I have to > wait for my turn at using the Argonne cluster because of the large > reservations that Ioan does for MolDyn > > Anyway, here is my experience (which Ian asked me to write down, but > I'm still trying to improve on): > - whatever one asks from Falkon, one seems to get, with the caveat > that Falkon might release nodes when configured to look at an idle > timer. In the case of the Econ workflow, I had 26 long running jobs, > so I requested 30 nodes (which Falkon got for me) > - there is a swift config file, $DVS_HOME/libexec/scheduler.xml in > which I set > , but that seemed not to be > enough to get all my 26 jobs running at the same time (as illustrated > by the graphing of the Falkon log that Ioan showed me). > - there are some other throttling parameters in > $VDS_HOME/etc/swift.properties (which I also set to 30) > > The general observation is that I needed to modify the scheduler.xml > config file, and I need to set larger throttle values that the limit > of workers requested. > In the current scheme (simply add Falkon to Swift as a provider) the > Swift scheduler (the weighted site selection algorithm) adversely > influences the optimal execution of the workflow. > There might be other parameters to work with, but my opinion is that > we should use a different (non-throttling) scheduler in combination > with Falkon > > Tibi > > On 6/21/07, Mike Wilde wrote: >> Ive had the same question - it seems that throttling is also >> problematic for >> Tibi in the econ workflow. >> >> Tibi, since you have looked pretty deeply into it, could you write up a >> desription on how the algorithm works and how the parameters affect it. >> Mihael, when you are back on central time next week, could you work >> with TIbi on >> this? If its not already, this should be part of the Swift >> documentation. >> >> Then we should work on getting high-performance settings for the >> different >> runtime environments we use, in particular Falkon as Ian asks. >> >> - Mike >> >> >> Ian Foster wrote, On 6/21/2007 6:50 AM: >> > Hi, >> > >> > I don't fully understand how throttling works in Swift/Karajan. >> However, >> > I understand that even when using Falkon, we may be doing some >> > throttling. Is there a reason to do that in this case, given that >> Falkon >> > can maintain large numbers of tasks just fine? >> > >> > I ask this because in a recent MolDyn run, there seemed to be some >> > uncertainty as to whether throttling was slowing down job dispatch. If >> > we could turn it off altogether, that question would presumably go >> away. >> > >> > Ian. >> > >> > >> >> -- >> Mike Wilde >> Computation Institute, University of Chicago >> Math & Computer Science Division >> Argonne National Laboratory >> Argonne, IL 60439 USA >> tel 630-252-7497 fax 630-252-1997 >> > > -- Ian Foster, Director, Computation Institute Argonne National Laboratory & University of Chicago Argonne: MCS/221, 9700 S. Cass Ave, Argonne, IL 60439 Chicago: Rm 405, 5640 S. Ellis Ave, Chicago, IL 60637 Tel: +1 630 252 4619. Web: www.ci.uchicago.edu. Globus Alliance: www.globus.org. From tiberius at ci.uchicago.edu Thu Jun 21 10:21:59 2007 From: tiberius at ci.uchicago.edu (Tiberiu Stef-Praun) Date: Thu, 21 Jun 2007 10:21:59 -0500 Subject: [Swift-devel] Re: [Swft] Q about throttling In-Reply-To: <467A95F3.6040603@mcs.anl.gov> References: <467A6610.1000103@mcs.anl.gov> <467A6E50.3060508@mcs.anl.gov> <467A95F3.6040603@mcs.anl.gov> Message-ID: On 6/21/07, Ian Foster wrote: > My original question was whether we could turn throttling off > altogether. I'm not sure if that was answered? I think the current answer is: just push way up the throttling limit. > > Tiberiu Stef-Praun wrote: > > I did not look very deep into the throttling, mainly because I have to > > wait for my turn at using the Argonne cluster because of the large > > reservations that Ioan does for MolDyn > > > > Anyway, here is my experience (which Ian asked me to write down, but > > I'm still trying to improve on): > > - whatever one asks from Falkon, one seems to get, with the caveat > > that Falkon might release nodes when configured to look at an idle > > timer. In the case of the Econ workflow, I had 26 long running jobs, > > so I requested 30 nodes (which Falkon got for me) > > - there is a swift config file, $DVS_HOME/libexec/scheduler.xml in > > which I set > > , but that seemed not to be > > enough to get all my 26 jobs running at the same time (as illustrated > > by the graphing of the Falkon log that Ioan showed me). > > - there are some other throttling parameters in > > $VDS_HOME/etc/swift.properties (which I also set to 30) > > > > The general observation is that I needed to modify the scheduler.xml > > config file, and I need to set larger throttle values that the limit > > of workers requested. > > In the current scheme (simply add Falkon to Swift as a provider) the > > Swift scheduler (the weighted site selection algorithm) adversely > > influences the optimal execution of the workflow. > > There might be other parameters to work with, but my opinion is that > > we should use a different (non-throttling) scheduler in combination > > with Falkon > > > > Tibi > > > > On 6/21/07, Mike Wilde wrote: > >> Ive had the same question - it seems that throttling is also > >> problematic for > >> Tibi in the econ workflow. > >> > >> Tibi, since you have looked pretty deeply into it, could you write up a > >> desription on how the algorithm works and how the parameters affect it. > >> Mihael, when you are back on central time next week, could you work > >> with TIbi on > >> this? If its not already, this should be part of the Swift > >> documentation. > >> > >> Then we should work on getting high-performance settings for the > >> different > >> runtime environments we use, in particular Falkon as Ian asks. > >> > >> - Mike > >> > >> > >> Ian Foster wrote, On 6/21/2007 6:50 AM: > >> > Hi, > >> > > >> > I don't fully understand how throttling works in Swift/Karajan. > >> However, > >> > I understand that even when using Falkon, we may be doing some > >> > throttling. Is there a reason to do that in this case, given that > >> Falkon > >> > can maintain large numbers of tasks just fine? > >> > > >> > I ask this because in a recent MolDyn run, there seemed to be some > >> > uncertainty as to whether throttling was slowing down job dispatch. If > >> > we could turn it off altogether, that question would presumably go > >> away. > >> > > >> > Ian. > >> > > >> > > >> > >> -- > >> Mike Wilde > >> Computation Institute, University of Chicago > >> Math & Computer Science Division > >> Argonne National Laboratory > >> Argonne, IL 60439 USA > >> tel 630-252-7497 fax 630-252-1997 > >> > > > > > > -- > > Ian Foster, Director, Computation Institute > Argonne National Laboratory & University of Chicago > Argonne: MCS/221, 9700 S. Cass Ave, Argonne, IL 60439 > Chicago: Rm 405, 5640 S. Ellis Ave, Chicago, IL 60637 > Tel: +1 630 252 4619. Web: www.ci.uchicago.edu. > Globus Alliance: www.globus.org. > > -- Tiberiu (Tibi) Stef-Praun, PhD Research Staff, Computation Institute 5640 S. Ellis Ave, #405 University of Chicago http://www-unix.mcs.anl.gov/~tiberius/ From benc at hawaga.org.uk Thu Jun 21 10:31:31 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Thu, 21 Jun 2007 15:31:31 +0000 (GMT) Subject: [Swift-devel] Re: [Swft] Q about throttling In-Reply-To: <467A95F3.6040603@mcs.anl.gov> References: <467A6610.1000103@mcs.anl.gov> <467A6E50.3060508@mcs.anl.gov> <467A95F3.6040603@mcs.anl.gov> Message-ID: But that isn't the base problem being investigated, right? On Thu, 21 Jun 2007, Ian Foster wrote: > My original question was whether we could turn throttling off altogether. I'm > not sure if that was answered? > > Tiberiu Stef-Praun wrote: > > I did not look very deep into the throttling, mainly because I have to > > wait for my turn at using the Argonne cluster because of the large > > reservations that Ioan does for MolDyn > > > > Anyway, here is my experience (which Ian asked me to write down, but > > I'm still trying to improve on): > > - whatever one asks from Falkon, one seems to get, with the caveat > > that Falkon might release nodes when configured to look at an idle > > timer. In the case of the Econ workflow, I had 26 long running jobs, > > so I requested 30 nodes (which Falkon got for me) > > - there is a swift config file, $DVS_HOME/libexec/scheduler.xml in which I > > set > > , but that seemed not to be > > enough to get all my 26 jobs running at the same time (as illustrated > > by the graphing of the Falkon log that Ioan showed me). > > - there are some other throttling parameters in > > $VDS_HOME/etc/swift.properties (which I also set to 30) > > > > The general observation is that I needed to modify the scheduler.xml > > config file, and I need to set larger throttle values that the limit > > of workers requested. > > In the current scheme (simply add Falkon to Swift as a provider) the > > Swift scheduler (the weighted site selection algorithm) adversely > > influences the optimal execution of the workflow. > > There might be other parameters to work with, but my opinion is that > > we should use a different (non-throttling) scheduler in combination > > with Falkon > > > > Tibi > > > > On 6/21/07, Mike Wilde wrote: > > > Ive had the same question - it seems that throttling is also problematic > > > for > > > Tibi in the econ workflow. > > > > > > Tibi, since you have looked pretty deeply into it, could you write up a > > > desription on how the algorithm works and how the parameters affect it. > > > Mihael, when you are back on central time next week, could you work with > > > TIbi on > > > this? If its not already, this should be part of the Swift documentation. > > > > > > Then we should work on getting high-performance settings for the different > > > runtime environments we use, in particular Falkon as Ian asks. > > > > > > - Mike > > > > > > > > > Ian Foster wrote, On 6/21/2007 6:50 AM: > > > > Hi, > > > > > > > > I don't fully understand how throttling works in Swift/Karajan. However, > > > > I understand that even when using Falkon, we may be doing some > > > > throttling. Is there a reason to do that in this case, given that Falkon > > > > can maintain large numbers of tasks just fine? > > > > > > > > I ask this because in a recent MolDyn run, there seemed to be some > > > > uncertainty as to whether throttling was slowing down job dispatch. If > > > > we could turn it off altogether, that question would presumably go away. > > > > > > > > Ian. > > > > > > > > > > > > > > -- > > > Mike Wilde > > > Computation Institute, University of Chicago > > > Math & Computer Science Division > > > Argonne National Laboratory > > > Argonne, IL 60439 USA > > > tel 630-252-7497 fax 630-252-1997 > > > > > > > > > From foster at mcs.anl.gov Thu Jun 21 10:37:22 2007 From: foster at mcs.anl.gov (Ian Foster) Date: Thu, 21 Jun 2007 10:37:22 -0500 Subject: [Swift-devel] Re: [Swft] Q about throttling In-Reply-To: References: <467A6610.1000103@mcs.anl.gov> <467A6E50.3060508@mcs.anl.gov> <467A95F3.6040603@mcs.anl.gov> Message-ID: <467A9B32.4030402@mcs.anl.gov> Well, if there is some concern that throttling might be a problem, then trying a run with it turned off seems good. I'm gathering from this exchange that this is not possible? Ben Clifford wrote: > But that isn't the base problem being investigated, right? > > On Thu, 21 Jun 2007, Ian Foster wrote: > > >> My original question was whether we could turn throttling off altogether. I'm >> not sure if that was answered? >> >> Tiberiu Stef-Praun wrote: >> >>> I did not look very deep into the throttling, mainly because I have to >>> wait for my turn at using the Argonne cluster because of the large >>> reservations that Ioan does for MolDyn >>> >>> Anyway, here is my experience (which Ian asked me to write down, but >>> I'm still trying to improve on): >>> - whatever one asks from Falkon, one seems to get, with the caveat >>> that Falkon might release nodes when configured to look at an idle >>> timer. In the case of the Econ workflow, I had 26 long running jobs, >>> so I requested 30 nodes (which Falkon got for me) >>> - there is a swift config file, $DVS_HOME/libexec/scheduler.xml in which I >>> set >>> , but that seemed not to be >>> enough to get all my 26 jobs running at the same time (as illustrated >>> by the graphing of the Falkon log that Ioan showed me). >>> - there are some other throttling parameters in >>> $VDS_HOME/etc/swift.properties (which I also set to 30) >>> >>> The general observation is that I needed to modify the scheduler.xml >>> config file, and I need to set larger throttle values that the limit >>> of workers requested. >>> In the current scheme (simply add Falkon to Swift as a provider) the >>> Swift scheduler (the weighted site selection algorithm) adversely >>> influences the optimal execution of the workflow. >>> There might be other parameters to work with, but my opinion is that >>> we should use a different (non-throttling) scheduler in combination >>> with Falkon >>> >>> Tibi >>> >>> On 6/21/07, Mike Wilde wrote: >>> >>>> Ive had the same question - it seems that throttling is also problematic >>>> for >>>> Tibi in the econ workflow. >>>> >>>> Tibi, since you have looked pretty deeply into it, could you write up a >>>> desription on how the algorithm works and how the parameters affect it. >>>> Mihael, when you are back on central time next week, could you work with >>>> TIbi on >>>> this? If its not already, this should be part of the Swift documentation. >>>> >>>> Then we should work on getting high-performance settings for the different >>>> runtime environments we use, in particular Falkon as Ian asks. >>>> >>>> - Mike >>>> >>>> >>>> Ian Foster wrote, On 6/21/2007 6:50 AM: >>>> >>>>> Hi, >>>>> >>>>> I don't fully understand how throttling works in Swift/Karajan. However, >>>>> I understand that even when using Falkon, we may be doing some >>>>> throttling. Is there a reason to do that in this case, given that Falkon >>>>> can maintain large numbers of tasks just fine? >>>>> >>>>> I ask this because in a recent MolDyn run, there seemed to be some >>>>> uncertainty as to whether throttling was slowing down job dispatch. If >>>>> we could turn it off altogether, that question would presumably go away. >>>>> >>>>> Ian. >>>>> >>>>> >>>>> >>>> -- >>>> Mike Wilde >>>> Computation Institute, University of Chicago >>>> Math & Computer Science Division >>>> Argonne National Laboratory >>>> Argonne, IL 60439 USA >>>> tel 630-252-7497 fax 630-252-1997 >>>> >>>> >>> >> > > -- Ian Foster, Director, Computation Institute Argonne National Laboratory & University of Chicago Argonne: MCS/221, 9700 S. Cass Ave, Argonne, IL 60439 Chicago: Rm 405, 5640 S. Ellis Ave, Chicago, IL 60637 Tel: +1 630 252 4619. Web: www.ci.uchicago.edu. Globus Alliance: www.globus.org. -------------- next part -------------- An HTML attachment was scrubbed... URL: From benc at hawaga.org.uk Thu Jun 21 10:40:22 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Thu, 21 Jun 2007 15:40:22 +0000 (GMT) Subject: [Swift-devel] Re: [Swft] Q about throttling In-Reply-To: <467A9B32.4030402@mcs.anl.gov> References: <467A6610.1000103@mcs.anl.gov> <467A6E50.3060508@mcs.anl.gov> <467A95F3.6040603@mcs.anl.gov> <467A9B32.4030402@mcs.anl.gov> Message-ID: On Thu, 21 Jun 2007, Ian Foster wrote: > I'm gathering from this exchange that this is not possible? I have no idea. It doesn't seem to be documented. But the number one rule of tech support is don't take somebody else's partially solved problem. It would be good to see what is actually causing you to suspect that there's a throttling problem. -- From foster at mcs.anl.gov Thu Jun 21 10:47:57 2007 From: foster at mcs.anl.gov (Ian Foster) Date: Thu, 21 Jun 2007 10:47:57 -0500 Subject: [Swift-devel] Re: [Swft] Q about throttling In-Reply-To: References: <467A6610.1000103@mcs.anl.gov> <467A6E50.3060508@mcs.anl.gov> <467A95F3.6040603@mcs.anl.gov> <467A9B32.4030402@mcs.anl.gov> Message-ID: <467A9DAD.7060909@mcs.anl.gov> agreed Ben Clifford wrote: > On Thu, 21 Jun 2007, Ian Foster wrote: > > >> I'm gathering from this exchange that this is not possible? >> > > I have no idea. It doesn't seem to be documented. > > But the number one rule of tech support is don't take somebody else's > partially solved problem. It would be good to see what is actually causing > you to suspect that there's a throttling problem. > > -- Ian Foster, Director, Computation Institute Argonne National Laboratory & University of Chicago Argonne: MCS/221, 9700 S. Cass Ave, Argonne, IL 60439 Chicago: Rm 405, 5640 S. Ellis Ave, Chicago, IL 60637 Tel: +1 630 252 4619. Web: www.ci.uchicago.edu. Globus Alliance: www.globus.org. -------------- next part -------------- An HTML attachment was scrubbed... URL: From benc at hawaga.org.uk Thu Jun 21 10:51:51 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Thu, 21 Jun 2007 15:51:51 +0000 (GMT) Subject: [Swift-devel] Re: [Swft] Q about throttling In-Reply-To: <467A9DAD.7060909@mcs.anl.gov> References: <467A6610.1000103@mcs.anl.gov> <467A6E50.3060508@mcs.anl.gov> <467A95F3.6040603@mcs.anl.gov> <467A9B32.4030402@mcs.anl.gov> <467A9DAD.7060909@mcs.anl.gov> Message-ID: actually, the graph that Tibi showed, which I think is pretty much the same as the graph from Ioan's gui visualizer thing, would be interesting to see for the present MolDyn runs. It was interesting to look at when wondering about the bug Yong fixed last week. On Thu, 21 Jun 2007, Ian Foster wrote: > agreed > > Ben Clifford wrote: > > On Thu, 21 Jun 2007, Ian Foster wrote: > > > > > > > I'm gathering from this exchange that this is not possible? > > > > > > > I have no idea. It doesn't seem to be documented. > > > > But the number one rule of tech support is don't take somebody else's > > partially solved problem. It would be good to see what is actually causing > > you to suspect that there's a throttling problem. > > > > > > From tiberius at ci.uchicago.edu Thu Jun 21 10:59:08 2007 From: tiberius at ci.uchicago.edu (Tiberiu Stef-Praun) Date: Thu, 21 Jun 2007 10:59:08 -0500 Subject: [Swift-devel] Re: [Swft] Q about throttling In-Reply-To: References: <467A6610.1000103@mcs.anl.gov> <467A6E50.3060508@mcs.anl.gov> <467A95F3.6040603@mcs.anl.gov> <467A9B32.4030402@mcs.anl.gov> Message-ID: Actually Ioan pointed out to me that the last two jobs from the first batch are scheduled to start at time zero but have to wait for the first 24 to finish before getting on some resource. (the red color means queue time. the green means execution time). Ben, yes, the worklow consists of 4 logical stages, each of which has to complete before the next stage is being executed. The graph was generated by Ioan, using Excel. He was showing me how to illustrate the Falkon logs information. On 6/21/07, Ben Clifford wrote: > > > On Thu, 21 Jun 2007, Ian Foster wrote: > > > I'm gathering from this exchange that this is not possible? > > I have no idea. It doesn't seem to be documented. > > But the number one rule of tech support is don't take somebody else's > partially solved problem. It would be good to see what is actually causing > you to suspect that there's a throttling problem. > > -- > -- Tiberiu (Tibi) Stef-Praun, PhD Research Staff, Computation Institute 5640 S. Ellis Ave, #405 University of Chicago http://www-unix.mcs.anl.gov/~tiberius/ From foster at mcs.anl.gov Thu Jun 21 10:57:56 2007 From: foster at mcs.anl.gov (Ian Foster) Date: Thu, 21 Jun 2007 10:57:56 -0500 Subject: [Swift-devel] Re: [Swft] Q about throttling In-Reply-To: References: <467A6610.1000103@mcs.anl.gov> <467A6E50.3060508@mcs.anl.gov> <467A95F3.6040603@mcs.anl.gov> <467A9B32.4030402@mcs.anl.gov> <467A9DAD.7060909@mcs.anl.gov> Message-ID: <467AA004.4080601@mcs.anl.gov> See this document for a set of three graphs that Ioan produced: http://people.cs.uchicago.edu/~iraicu/research/docs/MolDyn/100-Mol_MolDyn.pdf The first is the same as Tibi's, I think. The second and third are new. I want to have all three produced in a standard way for every application run. Ian. Ben Clifford wrote: > actually, the graph that Tibi showed, which I think is pretty much the > same as the graph from Ioan's gui visualizer thing, would be interesting > to see for the present MolDyn runs. > > It was interesting to look at when wondering about the bug Yong fixed last > week. > > On Thu, 21 Jun 2007, Ian Foster wrote: > > >> agreed >> >> Ben Clifford wrote: >> >>> On Thu, 21 Jun 2007, Ian Foster wrote: >>> >>> >>> >>>> I'm gathering from this exchange that this is not possible? >>>> >>>> >>> I have no idea. It doesn't seem to be documented. >>> >>> But the number one rule of tech support is don't take somebody else's >>> partially solved problem. It would be good to see what is actually causing >>> you to suspect that there's a throttling problem. >>> >>> >>> >> > > -- Ian Foster, Director, Computation Institute Argonne National Laboratory & University of Chicago Argonne: MCS/221, 9700 S. Cass Ave, Argonne, IL 60439 Chicago: Rm 405, 5640 S. Ellis Ave, Chicago, IL 60637 Tel: +1 630 252 4619. Web: www.ci.uchicago.edu. Globus Alliance: www.globus.org. -------------- next part -------------- An HTML attachment was scrubbed... URL: From benc at hawaga.org.uk Thu Jun 21 11:00:50 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Thu, 21 Jun 2007 16:00:50 +0000 (GMT) Subject: [Swift-devel] Re: [Swft] Q about throttling In-Reply-To: References: <467A6610.1000103@mcs.anl.gov> <467A6E50.3060508@mcs.anl.gov> <467A95F3.6040603@mcs.anl.gov> <467A9B32.4030402@mcs.anl.gov> Message-ID: On Thu, 21 Jun 2007, Tiberiu Stef-Praun wrote: > Actually Ioan pointed out to me that the last two jobs from the first > batch are scheduled to start at time zero but have to wait for the > first 24 to finish before getting on some resource. (the red color > means queue time. the green means execution time). > > Ben, yes, the worklow consists of 4 logical stages, each of which has > to complete before the next stage is being executed. so your chart indicates that everything is going 'just fine' rather than 'broken' ? -- From wilde at mcs.anl.gov Thu Jun 21 11:03:34 2007 From: wilde at mcs.anl.gov (Mike Wilde) Date: Thu, 21 Jun 2007 11:03:34 -0500 Subject: [Swift-devel] Re: Swift Performance Data In-Reply-To: <467A7E17.5000207@mcs.anl.gov> References: <4679FBC8.1080606@mcs.anl.gov> <467A6D78.4020702@mcs.anl.gov> <467A7AC6.7020400@mcs.anl.gov> <467A7E17.5000207@mcs.anl.gov> Message-ID: <467AA156.6020802@mcs.anl.gov> OK, thanks for responding so quickly to your own request, Ian. :) I agree that we need this and that this is a good spec to start from. Ive been pushing to do all runs from the swift lab machines where we can readily collect this info. We can create a set of scripts that collects the data in a uniform way, and makes it easy to send them to a central place. Well start a campaign for this and move the discussion there. I think that the stats should get gathered by default as part of the swift execution command; we may need some provision for collecting the data if swift dies. Perhaps the swift shell wrapper can catch such errors and try to report in most cases. I believe we also want to run everything under kickstart. Its been hard to get traction on that but we should discuss and keep on pushing on this. We need to designate one person to lead on this and certainly others to contribute. I hessitate to say "jump" until we review current work in progress and everyone's todo list. Mike Ian Foster wrote, On 6/21/2007 8:33 AM: > Or maybe that is clear. I'd suggest that we want a tool that, after a > run, one of us can run to: > > * Generate the three plots that Ioan has created > * Generate a file containing as much information as we can about the run > and its parameters--maybe a name=value format?--and some derived values > such as those I mentioned in earlier email > * Move these things to a known place > * Create a Web page with pointers to these information and stick it > somewhere [or add it to an existing web page?] > > Ian. > > > > Ian Foster wrote: >> Mike: >> >> It seems important to define what the specific goals and milestones >> are here, as it seems that simply asking for it doesn't get it done. >> Perhaps we need a brief specification? >> >> Ian. >> >> Mike Wilde wrote: >>> Yes, this is what Ganglia has been using. >>> >>> Regarding the auto-publishing - Jens has a machanism that regularly >>> posted info in rrd format on the state of the VDS lab machines, using >>> a perl mechanism like what Ian described. Perhaps we can find and >>> adapt that for Ioan's numbers. >>> It was running on gainly I think. But its not hard to develop from >>> scratch. >>> >>> It would be good to see the same numbers for all the Swift apps being >>> worked on, driven initially by kickstart summaries and digesting the >>> swift logfile. >>> We've long had this as a goal - now is a good time to push forward >>> and do this. >>> >>> Nika and Tibi, could you work with Ioan on this? >>> >>> - Mike >>> >>> >>> >>> Ian Foster wrote, On 6/20/2007 11:17 PM: >>>> Hi, >>>> >>>> I was pointed at http://oss.oetiker.ch/rrdtool/, has anyone seen >>>> this? Seems nice to me. >>>> >>>> Ian. >>>> >>>> >>> >> > -- Mike Wilde Computation Institute, University of Chicago Math & Computer Science Division Argonne National Laboratory Argonne, IL 60439 USA tel 630-252-7497 fax 630-252-1997 From tiberius at ci.uchicago.edu Thu Jun 21 11:04:35 2007 From: tiberius at ci.uchicago.edu (Tiberiu Stef-Praun) Date: Thu, 21 Jun 2007 11:04:35 -0500 Subject: [Swift-devel] Re: [Swft] Q about throttling In-Reply-To: References: <467A6610.1000103@mcs.anl.gov> <467A6E50.3060508@mcs.anl.gov> <467A95F3.6040603@mcs.anl.gov> <467A9B32.4030402@mcs.anl.gov> Message-ID: No My chart shows that if I had two more machines during the first stage run (the first 26 jobs), I would have avoided a long wait (50000 ms , or about 9 minutes) for the last two jobs from the first batch to finish. This is why I need to redo the Econ run, with a different throttle value for Swift. Tibi On 6/21/07, Ben Clifford wrote: > > > On Thu, 21 Jun 2007, Tiberiu Stef-Praun wrote: > > > Actually Ioan pointed out to me that the last two jobs from the first > > batch are scheduled to start at time zero but have to wait for the > > first 24 to finish before getting on some resource. (the red color > > means queue time. the green means execution time). > > > > Ben, yes, the worklow consists of 4 logical stages, each of which has > > to complete before the next stage is being executed. > > so your chart indicates that everything is going 'just fine' rather than > 'broken' ? > > -- > -- Tiberiu (Tibi) Stef-Praun, PhD Research Staff, Computation Institute 5640 S. Ellis Ave, #405 University of Chicago http://www-unix.mcs.anl.gov/~tiberius/ From benc at hawaga.org.uk Thu Jun 21 11:07:32 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Thu, 21 Jun 2007 16:07:32 +0000 (GMT) Subject: [Swift-devel] Re: [Swft] Q about throttling In-Reply-To: <467AA004.4080601@mcs.anl.gov> References: <467A6610.1000103@mcs.anl.gov> <467A6E50.3060508@mcs.anl.gov> <467A95F3.6040603@mcs.anl.gov> <467A9B32.4030402@mcs.anl.gov> <467A9DAD.7060909@mcs.anl.gov> <467AA004.4080601@mcs.anl.gov> Message-ID: neat. when was the run made that generated those graphs? The submits seem to be going through at about 1/sec in the 1000..2000s time range. Is that the bit that is the problem? On Thu, 21 Jun 2007, Ian Foster wrote: > See this document for a set of three graphs that Ioan produced: > > http://people.cs.uchicago.edu/~iraicu/research/docs/MolDyn/100-Mol_MolDyn.pdf > > The first is the same as Tibi's, I think. The second and third are new. I want > to have all three produced in a standard way for every application run. > > Ian. > > > Ben Clifford wrote: > > actually, the graph that Tibi showed, which I think is pretty much the same > > as the graph from Ioan's gui visualizer thing, would be interesting to see > > for the present MolDyn runs. > > > > It was interesting to look at when wondering about the bug Yong fixed last > > week. > > > > On Thu, 21 Jun 2007, Ian Foster wrote: > > > > > > > agreed > > > > > > Ben Clifford wrote: > > > > > > > On Thu, 21 Jun 2007, Ian Foster wrote: > > > > > > > > > > > > > I'm gathering from this exchange that this is not possible? > > > > > > > > > I have no idea. It doesn't seem to be documented. > > > > > > > > But the number one rule of tech support is don't take somebody else's > > > > partially solved problem. It would be good to see what is actually > > > > causing > > > > you to suspect that there's a throttling problem. > > > > > > > > > > > > > > > > > From benc at hawaga.org.uk Thu Jun 21 11:08:54 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Thu, 21 Jun 2007 16:08:54 +0000 (GMT) Subject: [Swift-devel] Re: [Swft] Q about throttling In-Reply-To: References: <467A6610.1000103@mcs.anl.gov> <467A6E50.3060508@mcs.anl.gov> <467A95F3.6040603@mcs.anl.gov> <467A9B32.4030402@mcs.anl.gov> Message-ID: On Thu, 21 Jun 2007, Tiberiu Stef-Praun wrote: > No > My chart shows that if I had two more machines during the first stage > run (the first 26 jobs), I would have avoided a long wait (50000 ms , > or about 9 minutes) for the last two jobs from the first batch to > finish. > This is why I need to redo the Econ run, with a different throttle > value for Swift. So you are saying that changing the 'throttle value for swift' will allocate more machines for you? -- From benc at hawaga.org.uk Thu Jun 21 11:12:35 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Thu, 21 Jun 2007 16:12:35 +0000 (GMT) Subject: [Swift-devel] Re: Swift Performance Data In-Reply-To: <467AA156.6020802@mcs.anl.gov> References: <4679FBC8.1080606@mcs.anl.gov> <467A6D78.4020702@mcs.anl.gov> <467A7AC6.7020400@mcs.anl.gov> <467A7E17.5000207@mcs.anl.gov> <467AA156.6020802@mcs.anl.gov> Message-ID: On Thu, 21 Jun 2007, Mike Wilde wrote: > I believe we also want to run everything under kickstart. Its been hard to get > traction on that but we should discuss and keep on pushing on this. There was once an idea to have kickstart installed by our group on every machine on which we commonly submit jobs to (maybe as part of the 'getting swift running on each site' campaign that resulted in the present site catalog). But that doesn't seem to be the way thing are now so maybe it never got written down. Putting installs in place and pointing the default as-distributed site catalog at those installs seems a relatively straightforward things to do (at least for most sites, and especially the OSG ones, which often have kickstart installed as part of their standard software stack). -- From tiberius at ci.uchicago.edu Thu Jun 21 11:25:24 2007 From: tiberius at ci.uchicago.edu (Tiberiu Stef-Praun) Date: Thu, 21 Jun 2007 11:25:24 -0500 Subject: [Swift-devel] Re: [Swft] Q about throttling In-Reply-To: References: <467A6610.1000103@mcs.anl.gov> <467A95F3.6040603@mcs.anl.gov> <467A9B32.4030402@mcs.anl.gov> Message-ID: No I'm saying that swift throttle value will allow me to make full use of all the nodes that Falkon makes available for me. I know that I had 26 jobs to be run, and I requested (and had) 30 nodes in the cluster. Somehow only 24 jobs run in the first time, so I'm going to push up the throttle value in Swift Tibi On 6/21/07, Ben Clifford wrote: > > > > On Thu, 21 Jun 2007, Tiberiu Stef-Praun wrote: > > > No > > My chart shows that if I had two more machines during the first stage > > run (the first 26 jobs), I would have avoided a long wait (50000 ms , > > or about 9 minutes) for the last two jobs from the first batch to > > finish. > > This is why I need to redo the Econ run, with a different throttle > > value for Swift. > > So you are saying that changing the 'throttle value for swift' will > allocate more machines for you? > > -- > -- Tiberiu (Tibi) Stef-Praun, PhD Research Staff, Computation Institute 5640 S. Ellis Ave, #405 University of Chicago http://www-unix.mcs.anl.gov/~tiberius/ From benc at hawaga.org.uk Thu Jun 21 11:30:09 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Thu, 21 Jun 2007 16:30:09 +0000 (GMT) Subject: [Swift-devel] Re: [Swft] Q about throttling In-Reply-To: References: <467A6610.1000103@mcs.anl.gov> <467A95F3.6040603@mcs.anl.gov> <467A9B32.4030402@mcs.anl.gov> Message-ID: My interpretation of the graph is: The two jobs that didn't get run till later (the 'spare' jobs) are submitted into falkon at approx t=0, along with the 24 'run straight away' jobs. Swift isn't holding them back. Falkon indicates that it is aware of them from approx time = 0 but doesn't run them until t=500000. That means, I think, that they're getting into Falkons queue right at the start, and its something happening with how Falkon places them onto worker nodes that isn't right here. At least that's my first impression. On Thu, 21 Jun 2007, Tiberiu Stef-Praun wrote: > No > I'm saying that swift throttle value will allow me to make full use of > all the nodes that Falkon makes available for me. I know that I had 26 > jobs to be run, and I requested (and had) 30 nodes in the cluster. > Somehow only 24 jobs run in the first time, so I'm going to push up > the throttle value in Swift > > Tibi > > > On 6/21/07, Ben Clifford wrote: > > > > > > > > On Thu, 21 Jun 2007, Tiberiu Stef-Praun wrote: > > > > > No > > > My chart shows that if I had two more machines during the first stage > > > run (the first 26 jobs), I would have avoided a long wait (50000 ms , > > > or about 9 minutes) for the last two jobs from the first batch to > > > finish. > > > This is why I need to redo the Econ run, with a different throttle > > > value for Swift. > > > > So you are saying that changing the 'throttle value for swift' will > > allocate more machines for you? > > > > -- > > > > > From nefedova at mcs.anl.gov Thu Jun 21 11:32:55 2007 From: nefedova at mcs.anl.gov (Veronika Nefedova) Date: Thu, 21 Jun 2007 11:32:55 -0500 Subject: [Swift-devel] Re: [Swft] Q about throttling In-Reply-To: References: <467A6610.1000103@mcs.anl.gov> <467A95F3.6040603@mcs.anl.gov> <467A9B32.4030402@mcs.anl.gov> Message-ID: There are two throttle parameters you might want to check. One is in swift.properties called throttle.submit and one in scheduler.xml called jobThrottle. I am curious whats the difference between them? Nika On Jun 21, 2007, at 11:25 AM, Tiberiu Stef-Praun wrote: > No > I'm saying that swift throttle value will allow me to make full use of > all the nodes that Falkon makes available for me. I know that I had 26 > jobs to be run, and I requested (and had) 30 nodes in the cluster. > Somehow only 24 jobs run in the first time, so I'm going to push up > the throttle value in Swift > > Tibi > > > On 6/21/07, Ben Clifford wrote: >> >> >> >> On Thu, 21 Jun 2007, Tiberiu Stef-Praun wrote: >> >> > No >> > My chart shows that if I had two more machines during the first >> stage >> > run (the first 26 jobs), I would have avoided a long wait (50000 >> ms , >> > or about 9 minutes) for the last two jobs from the first batch to >> > finish. >> > This is why I need to redo the Econ run, with a different throttle >> > value for Swift. >> >> So you are saying that changing the 'throttle value for swift' will >> allocate more machines for you? >> >> -- >> > > > -- > Tiberiu (Tibi) Stef-Praun, PhD > Research Staff, Computation Institute > 5640 S. Ellis Ave, #405 > University of Chicago > http://www-unix.mcs.anl.gov/~tiberius/ > From tiberius at ci.uchicago.edu Thu Jun 21 11:34:42 2007 From: tiberius at ci.uchicago.edu (Tiberiu Stef-Praun) Date: Thu, 21 Jun 2007 11:34:42 -0500 Subject: [Swift-devel] Re: [Swft] Q about throttling In-Reply-To: References: <467A6610.1000103@mcs.anl.gov> <467A9B32.4030402@mcs.anl.gov> Message-ID: I did have them both with values of 30. On 6/21/07, Veronika Nefedova wrote: > There are two throttle parameters you might want to check. One is in > swift.properties called throttle.submit and one in scheduler.xml > called jobThrottle. > I am curious whats the difference between them? > > Nika > > On Jun 21, 2007, at 11:25 AM, Tiberiu Stef-Praun wrote: > > > No > > I'm saying that swift throttle value will allow me to make full use of > > all the nodes that Falkon makes available for me. I know that I had 26 > > jobs to be run, and I requested (and had) 30 nodes in the cluster. > > Somehow only 24 jobs run in the first time, so I'm going to push up > > the throttle value in Swift > > > > Tibi > > > > > > On 6/21/07, Ben Clifford wrote: > >> > >> > >> > >> On Thu, 21 Jun 2007, Tiberiu Stef-Praun wrote: > >> > >> > No > >> > My chart shows that if I had two more machines during the first > >> stage > >> > run (the first 26 jobs), I would have avoided a long wait (50000 > >> ms , > >> > or about 9 minutes) for the last two jobs from the first batch to > >> > finish. > >> > This is why I need to redo the Econ run, with a different throttle > >> > value for Swift. > >> > >> So you are saying that changing the 'throttle value for swift' will > >> allocate more machines for you? > >> > >> -- > >> > > > > > > -- > > Tiberiu (Tibi) Stef-Praun, PhD > > Research Staff, Computation Institute > > 5640 S. Ellis Ave, #405 > > University of Chicago > > http://www-unix.mcs.anl.gov/~tiberius/ > > > > -- Tiberiu (Tibi) Stef-Praun, PhD Research Staff, Computation Institute 5640 S. Ellis Ave, #405 University of Chicago http://www-unix.mcs.anl.gov/~tiberius/ From tiberius at ci.uchicago.edu Thu Jun 21 11:36:29 2007 From: tiberius at ci.uchicago.edu (Tiberiu Stef-Praun) Date: Thu, 21 Jun 2007 11:36:29 -0500 Subject: [Swift-devel] Re: [Swft] Q about throttling In-Reply-To: References: <467A6610.1000103@mcs.anl.gov> <467A9B32.4030402@mcs.anl.gov> Message-ID: So the other thing that might have happened is that Falkon quietly released some of the nodes (even though I requested a minimum of 30 nodes and a maximum of 50) On 6/21/07, Ben Clifford wrote: > > My interpretation of the graph is: > > The two jobs that didn't get run till later (the 'spare' jobs) are > submitted into falkon at approx t=0, along with the 24 'run straight away' > jobs. > > Swift isn't holding them back. > > Falkon indicates that it is aware of them from approx time = 0 but doesn't > run them until t=500000. > > That means, I think, that they're getting into Falkons queue right at the > start, and its something happening with how Falkon places them onto worker > nodes that isn't right here. > > At least that's my first impression. > > On Thu, 21 Jun 2007, Tiberiu Stef-Praun wrote: > > > No > > I'm saying that swift throttle value will allow me to make full use of > > all the nodes that Falkon makes available for me. I know that I had 26 > > jobs to be run, and I requested (and had) 30 nodes in the cluster. > > Somehow only 24 jobs run in the first time, so I'm going to push up > > the throttle value in Swift > > > > Tibi > > > > > > On 6/21/07, Ben Clifford wrote: > > > > > > > > > > > > On Thu, 21 Jun 2007, Tiberiu Stef-Praun wrote: > > > > > > > No > > > > My chart shows that if I had two more machines during the first stage > > > > run (the first 26 jobs), I would have avoided a long wait (50000 ms , > > > > or about 9 minutes) for the last two jobs from the first batch to > > > > finish. > > > > This is why I need to redo the Econ run, with a different throttle > > > > value for Swift. > > > > > > So you are saying that changing the 'throttle value for swift' will > > > allocate more machines for you? > > > > > > -- > > > > > > > > > > -- Tiberiu (Tibi) Stef-Praun, PhD Research Staff, Computation Institute 5640 S. Ellis Ave, #405 University of Chicago http://www-unix.mcs.anl.gov/~tiberius/ From benc at hawaga.org.uk Thu Jun 21 11:37:36 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Thu, 21 Jun 2007 16:37:36 +0000 (GMT) Subject: [Swift-devel] Re: [Swft] Q about throttling In-Reply-To: References: <467A6610.1000103@mcs.anl.gov> <467A95F3.6040603@mcs.anl.gov> <467A9B32.4030402@mcs.anl.gov> Message-ID: Have a look at this thread: Date: Fri, 04 May 2007 09:04:44 -0500 From: Mihael Hategan To: Veronika V. Nefedova Cc: Ben Clifford , swift-devel at ci.uchicago.edu Subject: Re: [Swift-devel] limiting simultaneous jobs using the local provider. There was a bit of discussion there. On Thu, 21 Jun 2007, Veronika Nefedova wrote: > There are two throttle parameters you might want to check. One is in > swift.properties called throttle.submit and one in scheduler.xml called > jobThrottle. > I am curious whats the difference between them? > > Nika > > On Jun 21, 2007, at 11:25 AM, Tiberiu Stef-Praun wrote: > > > No > > I'm saying that swift throttle value will allow me to make full use of > > all the nodes that Falkon makes available for me. I know that I had 26 > > jobs to be run, and I requested (and had) 30 nodes in the cluster. > > Somehow only 24 jobs run in the first time, so I'm going to push up > > the throttle value in Swift > > > > Tibi > > > > > > On 6/21/07, Ben Clifford wrote: > > > > > > > > > > > > On Thu, 21 Jun 2007, Tiberiu Stef-Praun wrote: > > > > > > > No > > > > My chart shows that if I had two more machines during the first stage > > > > run (the first 26 jobs), I would have avoided a long wait (50000 ms , > > > > or about 9 minutes) for the last two jobs from the first batch to > > > > finish. > > > > This is why I need to redo the Econ run, with a different throttle > > > > value for Swift. > > > > > > So you are saying that changing the 'throttle value for swift' will > > > allocate more machines for you? > > > > > > -- > > > > > > > > > -- > > Tiberiu (Tibi) Stef-Praun, PhD > > Research Staff, Computation Institute > > 5640 S. Ellis Ave, #405 > > University of Chicago > > http://www-unix.mcs.anl.gov/~tiberius/ > > > From benc at hawaga.org.uk Thu Jun 21 11:40:33 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Thu, 21 Jun 2007 16:40:33 +0000 (GMT) Subject: [Swift-devel] Re: [Swft] Q about throttling In-Reply-To: References: <467A6610.1000103@mcs.anl.gov> <467A9B32.4030402@mcs.anl.gov> Message-ID: Its basically the same thing that is happening with the 6 steps in the 1200000 .. 1500000 range. The graph is (slightly) too low resolution for me to count the number of jobs in each of those steps. (note to infographic producer - make chart use exactly two horizontal pixels per job for this scale of run) On Thu, 21 Jun 2007, Tiberiu Stef-Praun wrote: > So the other thing that might have happened is that Falkon quietly > released some of the nodes (even though I requested a minimum of 30 > nodes and a maximum of 50) > > > On 6/21/07, Ben Clifford wrote: > > > > My interpretation of the graph is: > > > > The two jobs that didn't get run till later (the 'spare' jobs) are > > submitted into falkon at approx t=0, along with the 24 'run straight away' > > jobs. > > > > Swift isn't holding them back. > > > > Falkon indicates that it is aware of them from approx time = 0 but doesn't > > run them until t=500000. > > > > That means, I think, that they're getting into Falkons queue right at the > > start, and its something happening with how Falkon places them onto worker > > nodes that isn't right here. > > > > At least that's my first impression. > > > > On Thu, 21 Jun 2007, Tiberiu Stef-Praun wrote: > > > > > No > > > I'm saying that swift throttle value will allow me to make full use of > > > all the nodes that Falkon makes available for me. I know that I had 26 > > > jobs to be run, and I requested (and had) 30 nodes in the cluster. > > > Somehow only 24 jobs run in the first time, so I'm going to push up > > > the throttle value in Swift > > > > > > Tibi > > > > > > > > > On 6/21/07, Ben Clifford wrote: > > > > > > > > > > > > > > > > On Thu, 21 Jun 2007, Tiberiu Stef-Praun wrote: > > > > > > > > > No > > > > > My chart shows that if I had two more machines during the first stage > > > > > run (the first 26 jobs), I would have avoided a long wait (50000 ms , > > > > > or about 9 minutes) for the last two jobs from the first batch to > > > > > finish. > > > > > This is why I need to redo the Econ run, with a different throttle > > > > > value for Swift. > > > > > > > > So you are saying that changing the 'throttle value for swift' will > > > > allocate more machines for you? > > > > > > > > -- > > > > > > > > > > > > > > > > > > From benc at hawaga.org.uk Thu Jun 21 11:43:52 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Thu, 21 Jun 2007 16:43:52 +0000 (GMT) Subject: [Swift-devel] Re: [Swft] Q about throttling In-Reply-To: References: <467A6610.1000103@mcs.anl.gov> <467A9B32.4030402@mcs.anl.gov> Message-ID: might be useful to have debug-level throttle messages when cog/swift decides each throttle limit has been reached. -- From foster at mcs.anl.gov Thu Jun 21 13:43:04 2007 From: foster at mcs.anl.gov (Ian Foster) Date: Thu, 21 Jun 2007 13:43:04 -0500 Subject: [Swift-devel] Re: Swift Performance Data In-Reply-To: References: <4679FBC8.1080606@mcs.anl.gov> <467A6D78.4020702@mcs.anl.gov> <467A7AC6.7020400@mcs.anl.gov> <467A7E17.5000207@mcs.anl.gov> <467AA156.6020802@mcs.anl.gov> Message-ID: <467AC6B8.9060905@mcs.anl.gov> It is essential that we have kickstart (or the equivalent Falkon thing) running everywhere. Ian. Ben Clifford wrote: > On Thu, 21 Jun 2007, Mike Wilde wrote: > > >> I believe we also want to run everything under kickstart. Its been hard to get >> traction on that but we should discuss and keep on pushing on this. >> > > There was once an idea to have kickstart installed by our group on every > machine on which we commonly submit jobs to (maybe as part of the 'getting > swift running on each site' campaign that resulted in the present site > catalog). But that doesn't seem to be the way thing are now so maybe it > never got written down. > > Putting installs in place and pointing the default as-distributed site > catalog at those installs seems a relatively straightforward things to do > (at least for most sites, and especially the OSG ones, which often have > kickstart installed as part of their standard software stack). > > -- Ian Foster, Director, Computation Institute Argonne National Laboratory & University of Chicago Argonne: MCS/221, 9700 S. Cass Ave, Argonne, IL 60439 Chicago: Rm 405, 5640 S. Ellis Ave, Chicago, IL 60637 Tel: +1 630 252 4619. Web: www.ci.uchicago.edu. Globus Alliance: www.globus.org. -------------- next part -------------- An HTML attachment was scrubbed... URL: From iraicu at cs.uchicago.edu Thu Jun 21 21:53:58 2007 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Thu, 21 Jun 2007 21:53:58 -0500 Subject: [Swift-devel] Re: [Swft] Q about throttling In-Reply-To: References: <467A6610.1000103@mcs.anl.gov> <467A6E50.3060508@mcs.anl.gov> <467A95F3.6040603@mcs.anl.gov> <467A9B32.4030402@mcs.anl.gov> <467A9DAD.7060909@mcs.anl.gov> <467AA004.4080601@mcs.anl.gov> Message-ID: <467B39C6.8050104@cs.uchicago.edu> Sorry to jump in on the discussion late. Here are my thoughts on this issue: Actually, what you are seeing is the completion rate of about 1/sec, but all 200 executors were busy... this is inherent to the length of time each task was taking, namely about 200 seconds on each executor, and we had 200 executors to process them, so we get 200/200, or 1/sec... so this was perfectly normal. The part that is not normal, around time 5000+ sec (where the red disappears, and only green is found), only about 90 executors were kept busy, and the Falkon queue length was relatively at 0... so this means that Swift was not submitting fast enough to keep all the executors busy. If Swift submission rate would have been higher, I would have expected to see a little bit of red before each green bar throughout the graph. Perhaps the Swift rate of submission was lower due to the dependencies in the workflow, but as I stated in a previous email, the red queue time should have continued until about task # 9600 (301 [first 3 stages] + 6800 [4th stage] + 2500 [failed tasks])... Ioan Ben Clifford wrote: > neat. when was the run made that generated those graphs? > > The submits seem to be going through at about 1/sec in the 1000..2000s > time range. Is that the bit that is the problem? > > On Thu, 21 Jun 2007, Ian Foster wrote: > > >> See this document for a set of three graphs that Ioan produced: >> >> http://people.cs.uchicago.edu/~iraicu/research/docs/MolDyn/100-Mol_MolDyn.pdf >> >> The first is the same as Tibi's, I think. The second and third are new. I want >> to have all three produced in a standard way for every application run. >> >> Ian. >> >> >> Ben Clifford wrote: >> >>> actually, the graph that Tibi showed, which I think is pretty much the same >>> as the graph from Ioan's gui visualizer thing, would be interesting to see >>> for the present MolDyn runs. >>> >>> It was interesting to look at when wondering about the bug Yong fixed last >>> week. >>> >>> On Thu, 21 Jun 2007, Ian Foster wrote: >>> >>> >>> >>>> agreed >>>> >>>> Ben Clifford wrote: >>>> >>>> >>>>> On Thu, 21 Jun 2007, Ian Foster wrote: >>>>> >>>>> >>>>> >>>>>> I'm gathering from this exchange that this is not possible? >>>>>> >>>>>> >>>>> I have no idea. It doesn't seem to be documented. >>>>> >>>>> But the number one rule of tech support is don't take somebody else's >>>>> partially solved problem. It would be good to see what is actually >>>>> causing >>>>> you to suspect that there's a throttling problem. >>>>> >>>>> >>>>> >>>> >>>> >>> >>> >> > > -- ============================================ Ioan Raicu Ph.D. Student ============================================ Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 ============================================ Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dsl.cs.uchicago.edu/ ============================================ ============================================ -------------- next part -------------- An HTML attachment was scrubbed... URL: From iraicu at cs.uchicago.edu Thu Jun 21 21:54:14 2007 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Thu, 21 Jun 2007 21:54:14 -0500 Subject: [Swift-devel] Re: [Swft] Q about throttling In-Reply-To: References: <467A6610.1000103@mcs.anl.gov> <467A95F3.6040603@mcs.anl.gov> <467A9B32.4030402@mcs.anl.gov> Message-ID: <467B39D6.4010801@cs.uchicago.edu> I think Ben is right, in this particular instance, Swift submitted all 26 jobs, and Falkon dispatched 24 of them, and held 2 of them in the wait queue. Throttling was not the issue here. At first glance, I would say that although you asked for 30 nodes at the beining, you might have lost some due to idle time limit being reached, and hence when you started the 26 jobs, you only had 24 executors ready. Can you send me these two logs: service/logs/GenericPortalWS_perf_per_sec.log, and service/logs/GenericPortalWS_taskPerf.log and I will try to superimpose the # of busy and free executors on top of the graph you sent out showing the per task information. Ioan Ben Clifford wrote: > My interpretation of the graph is: > > The two jobs that didn't get run till later (the 'spare' jobs) are > submitted into falkon at approx t=0, along with the 24 'run straight away' > jobs. > > Swift isn't holding them back. > > Falkon indicates that it is aware of them from approx time = 0 but doesn't > run them until t=500000. > > That means, I think, that they're getting into Falkons queue right at the > start, and its something happening with how Falkon places them onto worker > nodes that isn't right here. > > At least that's my first impression. > > On Thu, 21 Jun 2007, Tiberiu Stef-Praun wrote: > > >> No >> I'm saying that swift throttle value will allow me to make full use of >> all the nodes that Falkon makes available for me. I know that I had 26 >> jobs to be run, and I requested (and had) 30 nodes in the cluster. >> Somehow only 24 jobs run in the first time, so I'm going to push up >> the throttle value in Swift >> >> Tibi >> >> >> On 6/21/07, Ben Clifford wrote: >> >>> >>> On Thu, 21 Jun 2007, Tiberiu Stef-Praun wrote: >>> >>> >>>> No >>>> My chart shows that if I had two more machines during the first stage >>>> run (the first 26 jobs), I would have avoided a long wait (50000 ms , >>>> or about 9 minutes) for the last two jobs from the first batch to >>>> finish. >>>> This is why I need to redo the Econ run, with a different throttle >>>> value for Swift. >>>> >>> So you are saying that changing the 'throttle value for swift' will >>> allocate more machines for you? >>> >>> -- >>> >>> >> >> > > -- ============================================ Ioan Raicu Ph.D. Student ============================================ Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 ============================================ Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dsl.cs.uchicago.edu/ ============================================ ============================================ -------------- next part -------------- An HTML attachment was scrubbed... URL: From iraicu at cs.uchicago.edu Thu Jun 21 21:54:21 2007 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Thu, 21 Jun 2007 21:54:21 -0500 Subject: [Swift-devel] Re: [Swft] Q about throttling In-Reply-To: References: <467A6610.1000103@mcs.anl.gov> <467A9B32.4030402@mcs.anl.gov> Message-ID: <467B39DD.80003@cs.uchicago.edu> Right, I think that is what happened. Send me the logs I asked for in a previous email, and we can plot both files on the same graph, and we will have the answer! Ioan Tiberiu Stef-Praun wrote: > So the other thing that might have happened is that Falkon quietly > released some of the nodes (even though I requested a minimum of 30 > nodes and a maximum of 50) > > > On 6/21/07, Ben Clifford wrote: >> >> My interpretation of the graph is: >> >> The two jobs that didn't get run till later (the 'spare' jobs) are >> submitted into falkon at approx t=0, along with the 24 'run straight >> away' >> jobs. >> >> Swift isn't holding them back. >> >> Falkon indicates that it is aware of them from approx time = 0 but >> doesn't >> run them until t=500000. >> >> That means, I think, that they're getting into Falkons queue right at >> the >> start, and its something happening with how Falkon places them onto >> worker >> nodes that isn't right here. >> >> At least that's my first impression. >> >> On Thu, 21 Jun 2007, Tiberiu Stef-Praun wrote: >> >> > No >> > I'm saying that swift throttle value will allow me to make full use of >> > all the nodes that Falkon makes available for me. I know that I had 26 >> > jobs to be run, and I requested (and had) 30 nodes in the cluster. >> > Somehow only 24 jobs run in the first time, so I'm going to push up >> > the throttle value in Swift >> > >> > Tibi >> > >> > >> > On 6/21/07, Ben Clifford wrote: >> > > >> > > >> > > >> > > On Thu, 21 Jun 2007, Tiberiu Stef-Praun wrote: >> > > >> > > > No >> > > > My chart shows that if I had two more machines during the first >> stage >> > > > run (the first 26 jobs), I would have avoided a long wait >> (50000 ms , >> > > > or about 9 minutes) for the last two jobs from the first batch to >> > > > finish. >> > > > This is why I need to redo the Econ run, with a different throttle >> > > > value for Swift. >> > > >> > > So you are saying that changing the 'throttle value for swift' will >> > > allocate more machines for you? >> > > >> > > -- >> > > >> > >> > >> > >> > > -- ============================================ Ioan Raicu Ph.D. Student ============================================ Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 ============================================ Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dsl.cs.uchicago.edu/ ============================================ ============================================ From benc at hawaga.org.uk Fri Jun 22 03:10:39 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Fri, 22 Jun 2007 08:10:39 +0000 (GMT) Subject: [Swift-devel] Re: [Swft] Q about throttling In-Reply-To: <467B39C6.8050104@cs.uchicago.edu> References: <467A6610.1000103@mcs.anl.gov> <467A6E50.3060508@mcs.anl.gov> <467A95F3.6040603@mcs.anl.gov> <467A9B32.4030402@mcs.anl.gov> <467A9DAD.7060909@mcs.anl.gov> <467AA004.4080601@mcs.anl.gov> <467B39C6.8050104@cs.uchicago.edu> Message-ID: > kept busy, and the Falkon queue length was relatively at 0... so this means > that Swift was not submitting fast enough to keep all the executors busy. interesting. though around t=1000 there is a rapid burst of submission getting the queue length up to about 6000 in a few minutes. Do you know what the cpu time usage of the swift submitting JVM was over that time period? -- From iraicu at cs.uchicago.edu Fri Jun 22 09:06:32 2007 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Fri, 22 Jun 2007 09:06:32 -0500 Subject: [Swift-devel] Re: [Swft] Q about throttling In-Reply-To: References: <467A6610.1000103@mcs.anl.gov> <467A6E50.3060508@mcs.anl.gov> <467A95F3.6040603@mcs.anl.gov> <467A9B32.4030402@mcs.anl.gov> <467A9DAD.7060909@mcs.anl.gov> <467AA004.4080601@mcs.anl.gov> <467B39C6.8050104@cs.uchicago.edu> Message-ID: <467BD768.3020507@cs.uchicago.edu> No, I didn't keep track of this info, unless Swift does this through some of its logs. Over the last week, my observations have been the following: Swift is more than capable and willing to send out many tasks as long as they are independent (as can be seen in this graph where probably 6800 tasks got submitted), but thereafter, it had no other burst of task submission, although I believe it could have send out more. For example, there were 2500+ tasks that failed in the middle of those 6800 tasks (which were all independent), why were 2500 tasks not resubmitted all at once... they were each about 200 seconds long, so most of them should have certainly showed up in the wait queue. Ioan Ben Clifford wrote: >> kept busy, and the Falkon queue length was relatively at 0... so this means >> that Swift was not submitting fast enough to keep all the executors busy. >> > > interesting. though around t=1000 there is a rapid burst of submission > getting the queue length up to about 6000 in a few minutes. > > Do you know what the cpu time usage of the swift submitting JVM was over > that time period? > > -- ============================================ Ioan Raicu Ph.D. Student ============================================ Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 ============================================ Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dsl.cs.uchicago.edu/ ============================================ ============================================ -------------- next part -------------- An HTML attachment was scrubbed... URL: From wilde at mcs.anl.gov Fri Jun 22 09:39:50 2007 From: wilde at mcs.anl.gov (Mike Wilde) Date: Fri, 22 Jun 2007 09:39:50 -0500 Subject: [Swift-devel] Re: [Swft] Q about throttling In-Reply-To: <467BD768.3020507@cs.uchicago.edu> References: <467A6610.1000103@mcs.anl.gov> <467A6E50.3060508@mcs.anl.gov> <467A95F3.6040603@mcs.anl.gov> <467A9B32.4030402@mcs.anl.gov> <467A9DAD.7060909@mcs.anl.gov> <467AA004.4080601@mcs.anl.gov> <467B39C6.8050104@cs.uchicago.edu> <467BD768.3020507@cs.uchicago.edu> Message-ID: <467BDF36.7080909@mcs.anl.gov> Is there a configurable retry delay after failure? I think you need to examine the overall workflow dependency structure. Also, I recall from older perf charts that there's an option to enable/disable pipelining. With pipelining disabled, it seems that Swift will wait for an entire dataset/foreach or procedure to finish before starting any tasks that depend on the foreach or procedure. Mihael, can you look at some of these issues when you are back online and rested? - Mike Ioan Raicu wrote, On 6/22/2007 9:06 AM: > No, I didn't keep track of this info, unless Swift does this through > some of its logs. > > Over the last week, my observations have been the following: Swift is > more than capable and willing to send out many tasks as long as they are > independent (as can be seen in this graph where probably 6800 tasks got > submitted), but thereafter, it had no other burst of task submission, > although I believe it could have send out more. For example, there were > 2500+ tasks that failed in the middle of those 6800 tasks (which were > all independent), why were 2500 tasks not resubmitted all at once... > they were each about 200 seconds long, so most of them should have > certainly showed up in the wait queue. > > Ioan > > Ben Clifford wrote: >>> kept busy, and the Falkon queue length was relatively at 0... so this means >>> that Swift was not submitting fast enough to keep all the executors busy. >>> >> >> interesting. though around t=1000 there is a rapid burst of submission >> getting the queue length up to about 6000 in a few minutes. >> >> Do you know what the cpu time usage of the swift submitting JVM was over >> that time period? >> >> > > -- > ============================================ > Ioan Raicu > Ph.D. Student > ============================================ > Distributed Systems Laboratory > Computer Science Department > University of Chicago > 1100 E. 58th Street, Ryerson Hall > Chicago, IL 60637 > ============================================ > Email: iraicu at cs.uchicago.edu > Web: http://www.cs.uchicago.edu/~iraicu > http://dsl.cs.uchicago.edu/ > ============================================ > ============================================ > -- Mike Wilde Computation Institute, University of Chicago Math & Computer Science Division Argonne National Laboratory Argonne, IL 60439 USA tel 630-252-7497 fax 630-252-1997 From yongzh at cs.uchicago.edu Fri Jun 22 09:45:42 2007 From: yongzh at cs.uchicago.edu (Yong Zhao) Date: Fri, 22 Jun 2007 09:45:42 -0500 (CDT) Subject: [Swift-devel] Re: [Swft] Q about throttling In-Reply-To: <467BDF36.7080909@mcs.anl.gov> References: <467A6610.1000103@mcs.anl.gov> <467A6E50.3060508@mcs.anl.gov> <467A95F3.6040603@mcs.anl.gov> <467A9B32.4030402@mcs.anl.gov> <467A9DAD.7060909@mcs.anl.gov> <467AA004.4080601@mcs.anl.gov> <467B39C6.8050104@cs.uchicago.edu> <467BD768.3020507@cs.uchicago.edu> <467BDF36.7080909@mcs.anl.gov> Message-ID: The retry mechanism is currently in some karajan script, and we can easily add some delay there. There is not a configuration option to disable pipeline. I did that manually (modified some code segment) to get a perf chart. Yong. On Fri, 22 Jun 2007, Mike Wilde wrote: > Is there a configurable retry delay after failure? > > I think you need to examine the overall workflow dependency structure. > > Also, I recall from older perf charts that there's an option to enable/disable > pipelining. With pipelining disabled, it seems that Swift will wait for an > entire dataset/foreach or procedure to finish before starting any tasks that > depend on the foreach or procedure. > > Mihael, can you look at some of these issues when you are back online and rested? > > - Mike > > Ioan Raicu wrote, On 6/22/2007 9:06 AM: > > No, I didn't keep track of this info, unless Swift does this through > > some of its logs. > > > > Over the last week, my observations have been the following: Swift is > > more than capable and willing to send out many tasks as long as they are > > independent (as can be seen in this graph where probably 6800 tasks got > > submitted), but thereafter, it had no other burst of task submission, > > although I believe it could have send out more. For example, there were > > 2500+ tasks that failed in the middle of those 6800 tasks (which were > > all independent), why were 2500 tasks not resubmitted all at once... > > they were each about 200 seconds long, so most of them should have > > certainly showed up in the wait queue. > > > > Ioan > > > > Ben Clifford wrote: > >>> kept busy, and the Falkon queue length was relatively at 0... so this means > >>> that Swift was not submitting fast enough to keep all the executors busy. > >>> > >> > >> interesting. though around t=1000 there is a rapid burst of submission > >> getting the queue length up to about 6000 in a few minutes. > >> > >> Do you know what the cpu time usage of the swift submitting JVM was over > >> that time period? > >> > >> > > > > -- > > ============================================ > > Ioan Raicu > > Ph.D. Student > > ============================================ > > Distributed Systems Laboratory > > Computer Science Department > > University of Chicago > > 1100 E. 58th Street, Ryerson Hall > > Chicago, IL 60637 > > ============================================ > > Email: iraicu at cs.uchicago.edu > > Web: http://www.cs.uchicago.edu/~iraicu > > http://dsl.cs.uchicago.edu/ > > ============================================ > > ============================================ > > > > -- > Mike Wilde > Computation Institute, University of Chicago > Math & Computer Science Division > Argonne National Laboratory > Argonne, IL 60439 USA > tel 630-252-7497 fax 630-252-1997 > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > From benc at hawaga.org.uk Fri Jun 22 12:11:38 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Fri, 22 Jun 2007 17:11:38 +0000 (GMT) Subject: [Swift-devel] Re: [Swft] Q about throttling In-Reply-To: <467BD768.3020507@cs.uchicago.edu> References: <467A6610.1000103@mcs.anl.gov> <467A6E50.3060508@mcs.anl.gov> <467A95F3.6040603@mcs.anl.gov> <467A9B32.4030402@mcs.anl.gov> <467A9DAD.7060909@mcs.anl.gov> <467AA004.4080601@mcs.anl.gov> <467B39C6.8050104@cs.uchicago.edu> <467BD768.3020507@cs.uchicago.edu> Message-ID: On Fri, 22 Jun 2007, Ioan Raicu wrote: > I believe it could have send out more. For example, there were 2500+ tasks > that failed in the middle of those 6800 tasks (which were all independent), > why were 2500 tasks not resubmitted all at once... they were each about 200 > seconds long, so most of them should have certainly showed up in the wait > queue. what kind of failure? -- From iraicu at cs.uchicago.edu Fri Jun 22 12:17:35 2007 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Fri, 22 Jun 2007 12:17:35 -0500 Subject: [Swift-devel] Re: [Swft] Q about throttling In-Reply-To: References: <467A6610.1000103@mcs.anl.gov> <467A6E50.3060508@mcs.anl.gov> <467A95F3.6040603@mcs.anl.gov> <467A9B32.4030402@mcs.anl.gov> <467A9DAD.7060909@mcs.anl.gov> <467AA004.4080601@mcs.anl.gov> <467B39C6.8050104@cs.uchicago.edu> <467BD768.3020507@cs.uchicago.edu> Message-ID: <467C042F.5040209@cs.uchicago.edu> Here is an excerpt from an email on 6/19. > > It completed 10998 > > tasks (8402 tasks with an exit code of 0, and 2596 tasks with an exit > > code of -1 -- aka failed) in 13399 seconds on 200 processors, this > > was for the 100 molecule run! The failed tasks were all on the same > > node over several short time intervals (~30 seconds), and were due to > > a "Stale NFS file handle", probably due to having 200 processes > > hitting the shared file system at the same time. Note that all these > > 2596 failed tasks were restarted by Swift and completed successfully > > on the resubmission. In the end, everything went through, and the run > > was successful! We noticed the same node in later runs act up, and take on the order of 100 times longer to complete some tasks than it was supposed to take. I bet this node is having some hardware issues, and we should write to help at tg to tell them. The failed tasks were eventually retried, and succeeded, and the whole run was successful, but the question is, why were the 2596 failed tasks (which were all independent of each other) not submitted faster after they failed... I would have expected them to fill up the wait queue with these 2596 retried tasks. Ioan Ben Clifford wrote: > > On Fri, 22 Jun 2007, Ioan Raicu wrote: > >> I believe it could have send out more. For example, there were 2500+ tasks >> that failed in the middle of those 6800 tasks (which were all independent), >> why were 2500 tasks not resubmitted all at once... they were each about 200 >> seconds long, so most of them should have certainly showed up in the wait >> queue. >> > > what kind of failure? > > -- ============================================ Ioan Raicu Ph.D. Student ============================================ Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 ============================================ Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dsl.cs.uchicago.edu/ ============================================ ============================================ -------------- next part -------------- An HTML attachment was scrubbed... URL: From wilde at mcs.anl.gov Fri Jun 22 15:27:57 2007 From: wilde at mcs.anl.gov (Mike Wilde) Date: Fri, 22 Jun 2007 15:27:57 -0500 Subject: [Swift-devel] Re: [Swft] Q about throttling In-Reply-To: References: <467A6610.1000103@mcs.anl.gov> <467A6E50.3060508@mcs.anl.gov> <467A95F3.6040603@mcs.anl.gov> <467A9B32.4030402@mcs.anl.gov> <467A9DAD.7060909@mcs.anl.gov> <467AA004.4080601@mcs.anl.gov> <467B39C6.8050104@cs.uchicago.edu> <467BD768.3020507@cs.uchicago.edu> <467BDF36.7080909@mcs.anl.gov> Message-ID: <467C30CD.8010703@mcs.anl.gov> [forgot to hit send on this - my apology if its no longer relevant] OK, thanks, Yong. Regarding the retry delay, I phrased the question poorly. I meant: Is it possible that the 2500 failing jobs are being retried too slowly? Ie that Karajan delays each re-run after a failure, and thus cant keep Falkon fed with retried jobs at a high rate? - Mike Yong Zhao wrote, On 6/22/2007 9:45 AM: > The retry mechanism is currently in some karajan script, and we can easily > add some delay there. > > There is not a configuration option to disable pipeline. I did that > manually (modified some code segment) to get a perf chart. > > Yong. > > On Fri, 22 Jun 2007, Mike Wilde wrote: > >> Is there a configurable retry delay after failure? >> >> I think you need to examine the overall workflow dependency structure. >> >> Also, I recall from older perf charts that there's an option to enable/disable >> pipelining. With pipelining disabled, it seems that Swift will wait for an >> entire dataset/foreach or procedure to finish before starting any tasks that >> depend on the foreach or procedure. >> >> Mihael, can you look at some of these issues when you are back online and rested? >> >> - Mike >> >> Ioan Raicu wrote, On 6/22/2007 9:06 AM: >>> No, I didn't keep track of this info, unless Swift does this through >>> some of its logs. >>> >>> Over the last week, my observations have been the following: Swift is >>> more than capable and willing to send out many tasks as long as they are >>> independent (as can be seen in this graph where probably 6800 tasks got >>> submitted), but thereafter, it had no other burst of task submission, >>> although I believe it could have send out more. For example, there were >>> 2500+ tasks that failed in the middle of those 6800 tasks (which were >>> all independent), why were 2500 tasks not resubmitted all at once... >>> they were each about 200 seconds long, so most of them should have >>> certainly showed up in the wait queue. >>> >>> Ioan >>> >>> Ben Clifford wrote: >>>>> kept busy, and the Falkon queue length was relatively at 0... so this means >>>>> that Swift was not submitting fast enough to keep all the executors busy. >>>>> >>>> interesting. though around t=1000 there is a rapid burst of submission >>>> getting the queue length up to about 6000 in a few minutes. >>>> >>>> Do you know what the cpu time usage of the swift submitting JVM was over >>>> that time period? >>>> >>>> >>> -- >>> ============================================ >>> Ioan Raicu >>> Ph.D. Student >>> ============================================ >>> Distributed Systems Laboratory >>> Computer Science Department >>> University of Chicago >>> 1100 E. 58th Street, Ryerson Hall >>> Chicago, IL 60637 >>> ============================================ >>> Email: iraicu at cs.uchicago.edu >>> Web: http://www.cs.uchicago.edu/~iraicu >>> http://dsl.cs.uchicago.edu/ >>> ============================================ >>> ============================================ >>> >> -- >> Mike Wilde >> Computation Institute, University of Chicago >> Math & Computer Science Division >> Argonne National Laboratory >> Argonne, IL 60439 USA >> tel 630-252-7497 fax 630-252-1997 >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >> > > -- Mike Wilde Computation Institute, University of Chicago Math & Computer Science Division Argonne National Laboratory Argonne, IL 60439 USA tel 630-252-7497 fax 630-252-1997 From yongzh at cs.uchicago.edu Fri Jun 22 15:32:46 2007 From: yongzh at cs.uchicago.edu (Yong Zhao) Date: Fri, 22 Jun 2007 15:32:46 -0500 (CDT) Subject: [Swift-devel] Re: [Swft] Q about throttling In-Reply-To: <467C30CD.8010703@mcs.anl.gov> References: <467A6610.1000103@mcs.anl.gov> <467A6E50.3060508@mcs.anl.gov> <467A95F3.6040603@mcs.anl.gov> <467A9B32.4030402@mcs.anl.gov> <467A9DAD.7060909@mcs.anl.gov> <467AA004.4080601@mcs.anl.gov> <467B39C6.8050104@cs.uchicago.edu> <467BD768.3020507@cs.uchicago.edu> <467BDF36.7080909@mcs.anl.gov> <467C30CD.8010703@mcs.anl.gov> Message-ID: There is no delay for submitting retry jobs. However, these retry jobs may be queued after the 'ready' jobs that swift already processed, which could be be held by swift, if there is job throttling. Yong. On Fri, 22 Jun 2007, Mike Wilde wrote: > [forgot to hit send on this - my apology if its no longer relevant] > > OK, thanks, Yong. > > Regarding the retry delay, I phrased the question poorly. I meant: > > Is it possible that the 2500 failing jobs are being retried too slowly? Ie that > Karajan delays each re-run after a failure, and thus cant keep Falkon fed with > retried jobs at a high rate? > > - Mike > > > Yong Zhao wrote, On 6/22/2007 9:45 AM: > > The retry mechanism is currently in some karajan script, and we can easily > > add some delay there. > > > > There is not a configuration option to disable pipeline. I did that > > manually (modified some code segment) to get a perf chart. > > > > Yong. > > > > On Fri, 22 Jun 2007, Mike Wilde wrote: > > > >> Is there a configurable retry delay after failure? > >> > >> I think you need to examine the overall workflow dependency structure. > >> > >> Also, I recall from older perf charts that there's an option to enable/disable > >> pipelining. With pipelining disabled, it seems that Swift will wait for an > >> entire dataset/foreach or procedure to finish before starting any tasks that > >> depend on the foreach or procedure. > >> > >> Mihael, can you look at some of these issues when you are back online and rested? > >> > >> - Mike > >> > >> Ioan Raicu wrote, On 6/22/2007 9:06 AM: > >>> No, I didn't keep track of this info, unless Swift does this through > >>> some of its logs. > >>> > >>> Over the last week, my observations have been the following: Swift is > >>> more than capable and willing to send out many tasks as long as they are > >>> independent (as can be seen in this graph where probably 6800 tasks got > >>> submitted), but thereafter, it had no other burst of task submission, > >>> although I believe it could have send out more. For example, there were > >>> 2500+ tasks that failed in the middle of those 6800 tasks (which were > >>> all independent), why were 2500 tasks not resubmitted all at once... > >>> they were each about 200 seconds long, so most of them should have > >>> certainly showed up in the wait queue. > >>> > >>> Ioan > >>> > >>> Ben Clifford wrote: > >>>>> kept busy, and the Falkon queue length was relatively at 0... so this means > >>>>> that Swift was not submitting fast enough to keep all the executors busy. > >>>>> > >>>> interesting. though around t=1000 there is a rapid burst of submission > >>>> getting the queue length up to about 6000 in a few minutes. > >>>> > >>>> Do you know what the cpu time usage of the swift submitting JVM was over > >>>> that time period? > >>>> > >>>> > >>> -- > >>> ============================================ > >>> Ioan Raicu > >>> Ph.D. Student > >>> ============================================ > >>> Distributed Systems Laboratory > >>> Computer Science Department > >>> University of Chicago > >>> 1100 E. 58th Street, Ryerson Hall > >>> Chicago, IL 60637 > >>> ============================================ > >>> Email: iraicu at cs.uchicago.edu > >>> Web: http://www.cs.uchicago.edu/~iraicu > >>> http://dsl.cs.uchicago.edu/ > >>> ============================================ > >>> ============================================ > >>> > >> -- > >> Mike Wilde > >> Computation Institute, University of Chicago > >> Math & Computer Science Division > >> Argonne National Laboratory > >> Argonne, IL 60439 USA > >> tel 630-252-7497 fax 630-252-1997 > >> _______________________________________________ > >> Swift-devel mailing list > >> Swift-devel at ci.uchicago.edu > >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > >> > > > > > > -- > Mike Wilde > Computation Institute, University of Chicago > Math & Computer Science Division > Argonne National Laboratory > Argonne, IL 60439 USA > tel 630-252-7497 fax 630-252-1997 > From hategan at mcs.anl.gov Sat Jun 23 14:59:30 2007 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sat, 23 Jun 2007 14:59:30 -0500 Subject: [Swift-devel] Re: [Swft] Q about throttling In-Reply-To: <467A9B32.4030402@mcs.anl.gov> References: <467A6610.1000103@mcs.anl.gov> <467A6E50.3060508@mcs.anl.gov> <467A95F3.6040603@mcs.anl.gov> <467A9B32.4030402@mcs.anl.gov> Message-ID: <1182628770.8366.3.camel@blabla.mcs.anl.gov> On Thu, 2007-06-21 at 10:37 -0500, Ian Foster wrote: > Well, if there is some concern that throttling might be a problem, > then trying a run with it turned off seems good. > > I'm gathering from this exchange that this is not possible? As Tibi guessed, it is done by using sufficiently large numbers for the throttles. There is no way to automatically turn off all throttles, but I guess it could be done. Mihael > > Ben Clifford wrote: > > But that isn't the base problem being investigated, right? > > > > On Thu, 21 Jun 2007, Ian Foster wrote: > > > > > > > My original question was whether we could turn throttling off altogether. I'm > > > not sure if that was answered? > > > > > > Tiberiu Stef-Praun wrote: > > > > > > > I did not look very deep into the throttling, mainly because I have to > > > > wait for my turn at using the Argonne cluster because of the large > > > > reservations that Ioan does for MolDyn > > > > > > > > Anyway, here is my experience (which Ian asked me to write down, but > > > > I'm still trying to improve on): > > > > - whatever one asks from Falkon, one seems to get, with the caveat > > > > that Falkon might release nodes when configured to look at an idle > > > > timer. In the case of the Econ workflow, I had 26 long running jobs, > > > > so I requested 30 nodes (which Falkon got for me) > > > > - there is a swift config file, $DVS_HOME/libexec/scheduler.xml in which I > > > > set > > > > , but that seemed not to be > > > > enough to get all my 26 jobs running at the same time (as illustrated > > > > by the graphing of the Falkon log that Ioan showed me). > > > > - there are some other throttling parameters in > > > > $VDS_HOME/etc/swift.properties (which I also set to 30) > > > > > > > > The general observation is that I needed to modify the scheduler.xml > > > > config file, and I need to set larger throttle values that the limit > > > > of workers requested. > > > > In the current scheme (simply add Falkon to Swift as a provider) the > > > > Swift scheduler (the weighted site selection algorithm) adversely > > > > influences the optimal execution of the workflow. > > > > There might be other parameters to work with, but my opinion is that > > > > we should use a different (non-throttling) scheduler in combination > > > > with Falkon > > > > > > > > Tibi > > > > > > > > On 6/21/07, Mike Wilde wrote: > > > > > > > > > Ive had the same question - it seems that throttling is also problematic > > > > > for > > > > > Tibi in the econ workflow. > > > > > > > > > > Tibi, since you have looked pretty deeply into it, could you write up a > > > > > desription on how the algorithm works and how the parameters affect it. > > > > > Mihael, when you are back on central time next week, could you work with > > > > > TIbi on > > > > > this? If its not already, this should be part of the Swift documentation. > > > > > > > > > > Then we should work on getting high-performance settings for the different > > > > > runtime environments we use, in particular Falkon as Ian asks. > > > > > > > > > > - Mike > > > > > > > > > > > > > > > Ian Foster wrote, On 6/21/2007 6:50 AM: > > > > > > > > > > > Hi, > > > > > > > > > > > > I don't fully understand how throttling works in Swift/Karajan. However, > > > > > > I understand that even when using Falkon, we may be doing some > > > > > > throttling. Is there a reason to do that in this case, given that Falkon > > > > > > can maintain large numbers of tasks just fine? > > > > > > > > > > > > I ask this because in a recent MolDyn run, there seemed to be some > > > > > > uncertainty as to whether throttling was slowing down job dispatch. If > > > > > > we could turn it off altogether, that question would presumably go away. > > > > > > > > > > > > Ian. > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > Mike Wilde > > > > > Computation Institute, University of Chicago > > > > > Math & Computer Science Division > > > > > Argonne National Laboratory > > > > > Argonne, IL 60439 USA > > > > > tel 630-252-7497 fax 630-252-1997 > > > > > > > > > > > > > > > > -- > > Ian Foster, Director, Computation Institute > Argonne National Laboratory & University of Chicago > Argonne: MCS/221, 9700 S. Cass Ave, Argonne, IL 60439 > Chicago: Rm 405, 5640 S. Ellis Ave, Chicago, IL 60637 > Tel: +1 630 252 4619. Web: www.ci.uchicago.edu. > Globus Alliance: www.globus.org. From hategan at mcs.anl.gov Sat Jun 23 15:06:34 2007 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sat, 23 Jun 2007 15:06:34 -0500 Subject: [Swift-devel] Re: [Swft] Q about throttling In-Reply-To: References: <467A6610.1000103@mcs.anl.gov> <467A95F3.6040603@mcs.anl.gov> <467A9B32.4030402@mcs.anl.gov> Message-ID: <1182629194.8366.7.camel@blabla.mcs.anl.gov> On Thu, 2007-06-21 at 11:32 -0500, Veronika Nefedova wrote: > There are two throttle parameters you might want to check. One is in > swift.properties called throttle.submit and one in scheduler.xml > called jobThrottle. > I am curious whats the difference between them? throttle.submit is documented in many places, including swift.properties and the user's guide. This should not affect things much since, as far as I can tell, submissions in Falkon are done pretty fast. JobThrottle is a site score scaling factor. It limits the initial set of jobs sent to sites in order to achieve better load balancing in the long run. This could be a cause for low number of concurrent jobs. Set it to large numbers if you want to get rid of it. Mihael > > Nika > > On Jun 21, 2007, at 11:25 AM, Tiberiu Stef-Praun wrote: > > > No > > I'm saying that swift throttle value will allow me to make full use of > > all the nodes that Falkon makes available for me. I know that I had 26 > > jobs to be run, and I requested (and had) 30 nodes in the cluster. > > Somehow only 24 jobs run in the first time, so I'm going to push up > > the throttle value in Swift > > > > Tibi > > > > > > On 6/21/07, Ben Clifford wrote: > >> > >> > >> > >> On Thu, 21 Jun 2007, Tiberiu Stef-Praun wrote: > >> > >> > No > >> > My chart shows that if I had two more machines during the first > >> stage > >> > run (the first 26 jobs), I would have avoided a long wait (50000 > >> ms , > >> > or about 9 minutes) for the last two jobs from the first batch to > >> > finish. > >> > This is why I need to redo the Econ run, with a different throttle > >> > value for Swift. > >> > >> So you are saying that changing the 'throttle value for swift' will > >> allocate more machines for you? > >> > >> -- > >> > > > > > > -- > > Tiberiu (Tibi) Stef-Praun, PhD > > Research Staff, Computation Institute > > 5640 S. Ellis Ave, #405 > > University of Chicago > > http://www-unix.mcs.anl.gov/~tiberius/ > > > From foster at mcs.anl.gov Sat Jun 23 15:29:54 2007 From: foster at mcs.anl.gov (Ian Foster) Date: Sat, 23 Jun 2007 15:29:54 -0500 Subject: [Swift-devel] Re: [Swft] Q about throttling In-Reply-To: <1182628770.8366.3.camel@blabla.mcs.anl.gov> References: <467A6610.1000103@mcs.anl.gov> <467A6E50.3060508@mcs.anl.gov> <467A95F3.6040603@mcs.anl.gov> <467A9B32.4030402@mcs.anl.gov> <1182628770.8366.3.camel@blabla.mcs.anl.gov> Message-ID: <467D82C2.8020906@mcs.anl.gov> thanks... Ian. Mihael Hategan wrote: > On Thu, 2007-06-21 at 10:37 -0500, Ian Foster wrote: > >> Well, if there is some concern that throttling might be a problem, >> then trying a run with it turned off seems good. >> >> I'm gathering from this exchange that this is not possible? >> > > As Tibi guessed, it is done by using sufficiently large numbers for the > throttles. There is no way to automatically turn off all throttles, but > I guess it could be done. > > Mihael > > >> Ben Clifford wrote: >> >>> But that isn't the base problem being investigated, right? >>> >>> On Thu, 21 Jun 2007, Ian Foster wrote: >>> >>> >>> >>>> My original question was whether we could turn throttling off altogether. I'm >>>> not sure if that was answered? >>>> >>>> Tiberiu Stef-Praun wrote: >>>> >>>> >>>>> I did not look very deep into the throttling, mainly because I have to >>>>> wait for my turn at using the Argonne cluster because of the large >>>>> reservations that Ioan does for MolDyn >>>>> >>>>> Anyway, here is my experience (which Ian asked me to write down, but >>>>> I'm still trying to improve on): >>>>> - whatever one asks from Falkon, one seems to get, with the caveat >>>>> that Falkon might release nodes when configured to look at an idle >>>>> timer. In the case of the Econ workflow, I had 26 long running jobs, >>>>> so I requested 30 nodes (which Falkon got for me) >>>>> - there is a swift config file, $DVS_HOME/libexec/scheduler.xml in which I >>>>> set >>>>> , but that seemed not to be >>>>> enough to get all my 26 jobs running at the same time (as illustrated >>>>> by the graphing of the Falkon log that Ioan showed me). >>>>> - there are some other throttling parameters in >>>>> $VDS_HOME/etc/swift.properties (which I also set to 30) >>>>> >>>>> The general observation is that I needed to modify the scheduler.xml >>>>> config file, and I need to set larger throttle values that the limit >>>>> of workers requested. >>>>> In the current scheme (simply add Falkon to Swift as a provider) the >>>>> Swift scheduler (the weighted site selection algorithm) adversely >>>>> influences the optimal execution of the workflow. >>>>> There might be other parameters to work with, but my opinion is that >>>>> we should use a different (non-throttling) scheduler in combination >>>>> with Falkon >>>>> >>>>> Tibi >>>>> >>>>> On 6/21/07, Mike Wilde wrote: >>>>> >>>>> >>>>>> Ive had the same question - it seems that throttling is also problematic >>>>>> for >>>>>> Tibi in the econ workflow. >>>>>> >>>>>> Tibi, since you have looked pretty deeply into it, could you write up a >>>>>> desription on how the algorithm works and how the parameters affect it. >>>>>> Mihael, when you are back on central time next week, could you work with >>>>>> TIbi on >>>>>> this? If its not already, this should be part of the Swift documentation. >>>>>> >>>>>> Then we should work on getting high-performance settings for the different >>>>>> runtime environments we use, in particular Falkon as Ian asks. >>>>>> >>>>>> - Mike >>>>>> >>>>>> >>>>>> Ian Foster wrote, On 6/21/2007 6:50 AM: >>>>>> >>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> I don't fully understand how throttling works in Swift/Karajan. However, >>>>>>> I understand that even when using Falkon, we may be doing some >>>>>>> throttling. Is there a reason to do that in this case, given that Falkon >>>>>>> can maintain large numbers of tasks just fine? >>>>>>> >>>>>>> I ask this because in a recent MolDyn run, there seemed to be some >>>>>>> uncertainty as to whether throttling was slowing down job dispatch. If >>>>>>> we could turn it off altogether, that question would presumably go away. >>>>>>> >>>>>>> Ian. >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>> -- >>>>>> Mike Wilde >>>>>> Computation Institute, University of Chicago >>>>>> Math & Computer Science Division >>>>>> Argonne National Laboratory >>>>>> Argonne, IL 60439 USA >>>>>> tel 630-252-7497 fax 630-252-1997 >>>>>> >>>>>> >>>>>> >>> >>> >> -- >> >> Ian Foster, Director, Computation Institute >> Argonne National Laboratory & University of Chicago >> Argonne: MCS/221, 9700 S. Cass Ave, Argonne, IL 60439 >> Chicago: Rm 405, 5640 S. Ellis Ave, Chicago, IL 60637 >> Tel: +1 630 252 4619. Web: www.ci.uchicago.edu. >> Globus Alliance: www.globus.org. >> > > -- Ian Foster, Director, Computation Institute Argonne National Laboratory & University of Chicago Argonne: MCS/221, 9700 S. Cass Ave, Argonne, IL 60439 Chicago: Rm 405, 5640 S. Ellis Ave, Chicago, IL 60637 Tel: +1 630 252 4619. Web: www.ci.uchicago.edu. Globus Alliance: www.globus.org. -------------- next part -------------- An HTML attachment was scrubbed... URL: From hategan at mcs.anl.gov Sat Jun 23 15:42:52 2007 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sat, 23 Jun 2007 15:42:52 -0500 Subject: [Swift-devel] Re: [Swft] Q about throttling In-Reply-To: <467BD768.3020507@cs.uchicago.edu> References: <467A6610.1000103@mcs.anl.gov> <467A6E50.3060508@mcs.anl.gov> <467A95F3.6040603@mcs.anl.gov> <467A9B32.4030402@mcs.anl.gov> <467A9DAD.7060909@mcs.anl.gov> <467AA004.4080601@mcs.anl.gov> <467B39C6.8050104@cs.uchicago.edu> <467BD768.3020507@cs.uchicago.edu> Message-ID: <1182631372.8366.9.camel@blabla.mcs.anl.gov> On Fri, 2007-06-22 at 09:06 -0500, Ioan Raicu wrote: > No, I didn't keep track of this info, unless Swift does this through > some of its logs. > > Over the last week, my observations have been the following: Swift is > more than capable and willing to send out many tasks as long as they > are independent (as can be seen in this graph where probably 6800 > tasks got submitted), but thereafter, it had no other burst of task > submission, although I believe it could have send out more. For > example, there were 2500+ tasks that failed in the middle of those > 6800 tasks (which were all independent), why were 2500 tasks not > resubmitted all at once... they were each about 200 seconds long, so > most of them should have certainly showed up in the wait queue. That's probably interpreter lag. It needs to do some work before resubmitting all those jobs. > > Ioan > > Ben Clifford wrote: > > > kept busy, and the Falkon queue length was relatively at 0... so this means > > > that Swift was not submitting fast enough to keep all the executors busy. > > > > > > > interesting. though around t=1000 there is a rapid burst of submission > > getting the queue length up to about 6000 in a few minutes. > > > > Do you know what the cpu time usage of the swift submitting JVM was over > > that time period? > > > > > > -- > ============================================ > Ioan Raicu > Ph.D. Student > ============================================ > Distributed Systems Laboratory > Computer Science Department > University of Chicago > 1100 E. 58th Street, Ryerson Hall > Chicago, IL 60637 > ============================================ > Email: iraicu at cs.uchicago.edu > Web: http://www.cs.uchicago.edu/~iraicu > http://dsl.cs.uchicago.edu/ > ============================================ > ============================================ From hategan at mcs.anl.gov Sat Jun 23 15:43:56 2007 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sat, 23 Jun 2007 15:43:56 -0500 Subject: [Swift-devel] Re: [Swft] Q about throttling In-Reply-To: <467BDF36.7080909@mcs.anl.gov> References: <467A6610.1000103@mcs.anl.gov> <467A6E50.3060508@mcs.anl.gov> <467A95F3.6040603@mcs.anl.gov> <467A9B32.4030402@mcs.anl.gov> <467A9DAD.7060909@mcs.anl.gov> <467AA004.4080601@mcs.anl.gov> <467B39C6.8050104@cs.uchicago.edu> <467BD768.3020507@cs.uchicago.edu> <467BDF36.7080909@mcs.anl.gov> Message-ID: <1182631436.8366.11.camel@blabla.mcs.anl.gov> On Fri, 2007-06-22 at 09:39 -0500, Mike Wilde wrote: > Is there a configurable retry delay after failure? > > I think you need to examine the overall workflow dependency structure. > > Also, I recall from older perf charts that there's an option to enable/disable > pipelining. With pipelining disabled, it seems that Swift will wait for an > entire dataset/foreach or procedure to finish before starting any tasks that > depend on the foreach or procedure. I don't think that was in swift. > > Mihael, can you look at some of these issues when you are back online and rested? You say that as if I normally don't :) > > - Mike > > Ioan Raicu wrote, On 6/22/2007 9:06 AM: > > No, I didn't keep track of this info, unless Swift does this through > > some of its logs. > > > > Over the last week, my observations have been the following: Swift is > > more than capable and willing to send out many tasks as long as they are > > independent (as can be seen in this graph where probably 6800 tasks got > > submitted), but thereafter, it had no other burst of task submission, > > although I believe it could have send out more. For example, there were > > 2500+ tasks that failed in the middle of those 6800 tasks (which were > > all independent), why were 2500 tasks not resubmitted all at once... > > they were each about 200 seconds long, so most of them should have > > certainly showed up in the wait queue. > > > > Ioan > > > > Ben Clifford wrote: > >>> kept busy, and the Falkon queue length was relatively at 0... so this means > >>> that Swift was not submitting fast enough to keep all the executors busy. > >>> > >> > >> interesting. though around t=1000 there is a rapid burst of submission > >> getting the queue length up to about 6000 in a few minutes. > >> > >> Do you know what the cpu time usage of the swift submitting JVM was over > >> that time period? > >> > >> > > > > -- > > ============================================ > > Ioan Raicu > > Ph.D. Student > > ============================================ > > Distributed Systems Laboratory > > Computer Science Department > > University of Chicago > > 1100 E. 58th Street, Ryerson Hall > > Chicago, IL 60637 > > ============================================ > > Email: iraicu at cs.uchicago.edu > > Web: http://www.cs.uchicago.edu/~iraicu > > http://dsl.cs.uchicago.edu/ > > ============================================ > > ============================================ > > > From hategan at mcs.anl.gov Sat Jun 23 15:47:01 2007 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sat, 23 Jun 2007 15:47:01 -0500 Subject: [Swift-devel] Re: [Swft] Q about throttling In-Reply-To: <467C30CD.8010703@mcs.anl.gov> References: <467A6610.1000103@mcs.anl.gov> <467A6E50.3060508@mcs.anl.gov> <467A95F3.6040603@mcs.anl.gov> <467A9B32.4030402@mcs.anl.gov> <467A9DAD.7060909@mcs.anl.gov> <467AA004.4080601@mcs.anl.gov> <467B39C6.8050104@cs.uchicago.edu> <467BD768.3020507@cs.uchicago.edu> <467BDF36.7080909@mcs.anl.gov> <467C30CD.8010703@mcs.anl.gov> Message-ID: <1182631621.8366.14.camel@blabla.mcs.anl.gov> On Fri, 2007-06-22 at 15:27 -0500, Mike Wilde wrote: > [forgot to hit send on this - my apology if its no longer relevant] > > OK, thanks, Yong. > > Regarding the retry delay, I phrased the question poorly. I meant: > > Is it possible that the 2500 failing jobs are being retried too slowly? Ie that > Karajan delays each re-run after a failure, and thus cant keep Falkon fed with > retried jobs at a high rate? It does not explicitly delay anything. But 2500*[many things to do] becomes visible. > > - Mike > > > Yong Zhao wrote, On 6/22/2007 9:45 AM: > > The retry mechanism is currently in some karajan script, and we can easily > > add some delay there. > > > > There is not a configuration option to disable pipeline. I did that > > manually (modified some code segment) to get a perf chart. > > > > Yong. > > > > On Fri, 22 Jun 2007, Mike Wilde wrote: > > > >> Is there a configurable retry delay after failure? > >> > >> I think you need to examine the overall workflow dependency structure. > >> > >> Also, I recall from older perf charts that there's an option to enable/disable > >> pipelining. With pipelining disabled, it seems that Swift will wait for an > >> entire dataset/foreach or procedure to finish before starting any tasks that > >> depend on the foreach or procedure. > >> > >> Mihael, can you look at some of these issues when you are back online and rested? > >> > >> - Mike > >> > >> Ioan Raicu wrote, On 6/22/2007 9:06 AM: > >>> No, I didn't keep track of this info, unless Swift does this through > >>> some of its logs. > >>> > >>> Over the last week, my observations have been the following: Swift is > >>> more than capable and willing to send out many tasks as long as they are > >>> independent (as can be seen in this graph where probably 6800 tasks got > >>> submitted), but thereafter, it had no other burst of task submission, > >>> although I believe it could have send out more. For example, there were > >>> 2500+ tasks that failed in the middle of those 6800 tasks (which were > >>> all independent), why were 2500 tasks not resubmitted all at once... > >>> they were each about 200 seconds long, so most of them should have > >>> certainly showed up in the wait queue. > >>> > >>> Ioan > >>> > >>> Ben Clifford wrote: > >>>>> kept busy, and the Falkon queue length was relatively at 0... so this means > >>>>> that Swift was not submitting fast enough to keep all the executors busy. > >>>>> > >>>> interesting. though around t=1000 there is a rapid burst of submission > >>>> getting the queue length up to about 6000 in a few minutes. > >>>> > >>>> Do you know what the cpu time usage of the swift submitting JVM was over > >>>> that time period? > >>>> > >>>> > >>> -- > >>> ============================================ > >>> Ioan Raicu > >>> Ph.D. Student > >>> ============================================ > >>> Distributed Systems Laboratory > >>> Computer Science Department > >>> University of Chicago > >>> 1100 E. 58th Street, Ryerson Hall > >>> Chicago, IL 60637 > >>> ============================================ > >>> Email: iraicu at cs.uchicago.edu > >>> Web: http://www.cs.uchicago.edu/~iraicu > >>> http://dsl.cs.uchicago.edu/ > >>> ============================================ > >>> ============================================ > >>> > >> -- > >> Mike Wilde > >> Computation Institute, University of Chicago > >> Math & Computer Science Division > >> Argonne National Laboratory > >> Argonne, IL 60439 USA > >> tel 630-252-7497 fax 630-252-1997 > >> _______________________________________________ > >> Swift-devel mailing list > >> Swift-devel at ci.uchicago.edu > >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > >> > > > > > From hategan at mcs.anl.gov Sat Jun 23 15:48:45 2007 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sat, 23 Jun 2007 15:48:45 -0500 Subject: [Swift-devel] Re: [Swft] Q about throttling In-Reply-To: References: <467A6610.1000103@mcs.anl.gov> <467A6E50.3060508@mcs.anl.gov> <467A95F3.6040603@mcs.anl.gov> <467A9B32.4030402@mcs.anl.gov> <467A9DAD.7060909@mcs.anl.gov> <467AA004.4080601@mcs.anl.gov> <467B39C6.8050104@cs.uchicago.edu> <467BD768.3020507@cs.uchicago.edu> <467BDF36.7080909@mcs.anl.gov> <467C30CD.8010703@mcs.anl.gov> Message-ID: <1182631725.8366.17.camel@blabla.mcs.anl.gov> On Fri, 2007-06-22 at 15:32 -0500, Yong Zhao wrote: > There is no delay for submitting retry jobs. However, these retry jobs may > be queued after the 'ready' jobs that swift already processed, which could > be be held by swift, if there is job throttling. Indeed. 2500 jobs failed may bring the score for the site down a bit. But then it doesn't look like there was much throttling, since 6800 tasks were submitted in bulk. > > Yong. > > On Fri, 22 Jun 2007, Mike Wilde wrote: > > > [forgot to hit send on this - my apology if its no longer relevant] > > > > OK, thanks, Yong. > > > > Regarding the retry delay, I phrased the question poorly. I meant: > > > > Is it possible that the 2500 failing jobs are being retried too slowly? Ie that > > Karajan delays each re-run after a failure, and thus cant keep Falkon fed with > > retried jobs at a high rate? > > > > - Mike > > > > > > Yong Zhao wrote, On 6/22/2007 9:45 AM: > > > The retry mechanism is currently in some karajan script, and we can easily > > > add some delay there. > > > > > > There is not a configuration option to disable pipeline. I did that > > > manually (modified some code segment) to get a perf chart. > > > > > > Yong. > > > > > > On Fri, 22 Jun 2007, Mike Wilde wrote: > > > > > >> Is there a configurable retry delay after failure? > > >> > > >> I think you need to examine the overall workflow dependency structure. > > >> > > >> Also, I recall from older perf charts that there's an option to enable/disable > > >> pipelining. With pipelining disabled, it seems that Swift will wait for an > > >> entire dataset/foreach or procedure to finish before starting any tasks that > > >> depend on the foreach or procedure. > > >> > > >> Mihael, can you look at some of these issues when you are back online and rested? > > >> > > >> - Mike > > >> > > >> Ioan Raicu wrote, On 6/22/2007 9:06 AM: > > >>> No, I didn't keep track of this info, unless Swift does this through > > >>> some of its logs. > > >>> > > >>> Over the last week, my observations have been the following: Swift is > > >>> more than capable and willing to send out many tasks as long as they are > > >>> independent (as can be seen in this graph where probably 6800 tasks got > > >>> submitted), but thereafter, it had no other burst of task submission, > > >>> although I believe it could have send out more. For example, there were > > >>> 2500+ tasks that failed in the middle of those 6800 tasks (which were > > >>> all independent), why were 2500 tasks not resubmitted all at once... > > >>> they were each about 200 seconds long, so most of them should have > > >>> certainly showed up in the wait queue. > > >>> > > >>> Ioan > > >>> > > >>> Ben Clifford wrote: > > >>>>> kept busy, and the Falkon queue length was relatively at 0... so this means > > >>>>> that Swift was not submitting fast enough to keep all the executors busy. > > >>>>> > > >>>> interesting. though around t=1000 there is a rapid burst of submission > > >>>> getting the queue length up to about 6000 in a few minutes. > > >>>> > > >>>> Do you know what the cpu time usage of the swift submitting JVM was over > > >>>> that time period? > > >>>> > > >>>> > > >>> -- > > >>> ============================================ > > >>> Ioan Raicu > > >>> Ph.D. Student > > >>> ============================================ > > >>> Distributed Systems Laboratory > > >>> Computer Science Department > > >>> University of Chicago > > >>> 1100 E. 58th Street, Ryerson Hall > > >>> Chicago, IL 60637 > > >>> ============================================ > > >>> Email: iraicu at cs.uchicago.edu > > >>> Web: http://www.cs.uchicago.edu/~iraicu > > >>> http://dsl.cs.uchicago.edu/ > > >>> ============================================ > > >>> ============================================ > > >>> > > >> -- > > >> Mike Wilde > > >> Computation Institute, University of Chicago > > >> Math & Computer Science Division > > >> Argonne National Laboratory > > >> Argonne, IL 60439 USA > > >> tel 630-252-7497 fax 630-252-1997 > > >> _______________________________________________ > > >> Swift-devel mailing list > > >> Swift-devel at ci.uchicago.edu > > >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > >> > > > > > > > > > > -- > > Mike Wilde > > Computation Institute, University of Chicago > > Math & Computer Science Division > > Argonne National Laboratory > > Argonne, IL 60439 USA > > tel 630-252-7497 fax 630-252-1997 > > > From benc at hawaga.org.uk Sat Jun 23 20:34:22 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Sun, 24 Jun 2007 01:34:22 +0000 (GMT) Subject: [Swift-devel] Re: [Swft] Q about throttling In-Reply-To: <1182629194.8366.7.camel@blabla.mcs.anl.gov> References: <467A6610.1000103@mcs.anl.gov> <467A95F3.6040603@mcs.anl.gov> <467A9B32.4030402@mcs.anl.gov> <1182629194.8366.7.camel@blabla.mcs.anl.gov> Message-ID: On Sat, 23 Jun 2007, Mihael Hategan wrote: > JobThrottle is a site score scaling factor. It limits the initial set of > jobs sent to sites in order to achieve better load balancing in the long > run. This could be a cause for low number of concurrent jobs. Set it to > large numbers if you want to get rid of it. the scaling factor gives a restriction in job load that is not dependent on the presence of other sites? the use of the word 'factor' has a subtle implication that it is a relative load and so in the case of a single site will have no effect -- From hategan at mcs.anl.gov Sat Jun 23 20:35:34 2007 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sat, 23 Jun 2007 20:35:34 -0500 Subject: [Swift-devel] Re: [Swft] Q about throttling In-Reply-To: References: <467A6610.1000103@mcs.anl.gov> <467A95F3.6040603@mcs.anl.gov> <467A9B32.4030402@mcs.anl.gov> <1182629194.8366.7.camel@blabla.mcs.anl.gov> Message-ID: <1182648934.20734.2.camel@blabla.mcs.anl.gov> On Sun, 2007-06-24 at 01:34 +0000, Ben Clifford wrote: > > On Sat, 23 Jun 2007, Mihael Hategan wrote: > > > JobThrottle is a site score scaling factor. It limits the initial set of > > jobs sent to sites in order to achieve better load balancing in the long > > run. This could be a cause for low number of concurrent jobs. Set it to > > large numbers if you want to get rid of it. > > the scaling factor gives a restriction in job load that is not dependent > on the presence of other sites? Factor with respect to the score of a site, which has a pre-set value in the beginning. > > the use of the word 'factor' has a subtle implication that it is a > relative load and so in the case of a single site will have no effect > It was discussed before that the algorithm could (and imo should) be changed such that in the case of only one site, it would not throttle, or at least the throttle would be significantly bigger. From benc at hawaga.org.uk Sat Jun 23 20:42:18 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Sun, 24 Jun 2007 01:42:18 +0000 (GMT) Subject: [Swift-devel] Re: [Swft] Q about throttling In-Reply-To: <1182648934.20734.2.camel@blabla.mcs.anl.gov> References: <467A6610.1000103@mcs.anl.gov> <467A95F3.6040603@mcs.anl.gov> <467A9B32.4030402@mcs.anl.gov> <1182629194.8366.7.camel@blabla.mcs.anl.gov> <1182648934.20734.2.camel@blabla.mcs.anl.gov> Message-ID: On Sat, 23 Jun 2007, Mihael Hategan wrote: > > the use of the word 'factor' has a subtle implication that it is a > > relative load and so in the case of a single site will have no effect > It was discussed before that the algorithm could (and imo should) be > changed such that in the case of only one site, it would not throttle, > or at least the throttle would be significantly bigger. that or the name should change. -- From hategan at mcs.anl.gov Sat Jun 23 20:45:53 2007 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sat, 23 Jun 2007 20:45:53 -0500 Subject: [Swift-devel] Re: [Swft] Q about throttling In-Reply-To: References: <467A6610.1000103@mcs.anl.gov> <467A95F3.6040603@mcs.anl.gov> <467A9B32.4030402@mcs.anl.gov> <1182629194.8366.7.camel@blabla.mcs.anl.gov> <1182648934.20734.2.camel@blabla.mcs.anl.gov> Message-ID: <1182649553.22337.0.camel@blabla.mcs.anl.gov> On Sun, 2007-06-24 at 01:42 +0000, Ben Clifford wrote: > > On Sat, 23 Jun 2007, Mihael Hategan wrote: > > > > the use of the word 'factor' has a subtle implication that it is a > > > relative load and so in the case of a single site will have no effect > > > It was discussed before that the algorithm could (and imo should) be > > changed such that in the case of only one site, it would not throttle, > > or at least the throttle would be significantly bigger. > > that or the name should change. It's called "jobThrottle" and that's a bad name, too. Suggestions? From foster at mcs.anl.gov Sun Jun 24 15:52:14 2007 From: foster at mcs.anl.gov (Ian Foster) Date: Sun, 24 Jun 2007 15:52:14 -0500 Subject: [Swift-devel] [Fwd: [Swft] Swift Performance Data] Message-ID: <467ED97E.3060400@mcs.anl.gov> Hi, I'd like to ask: can we agree to put all other work on hold until we have got in place tools to collect traces from every run, archive them, and process them--as described below? We've been talking about having these tools for months now, and each time I ask, I am told that they are "sort of there." But we also keep finding that people are creating custom plots, losing data, etc. If we stop all other work until we have these tools, then we will have them, and other problems will likely get easier to resolve. Ian. -------- Original Message -------- Subject: [Swft] Swift Performance Data Date: Thu, 21 Jun 2007 08:33:11 -0500 From: Ian Foster To: Mike Wilde CC: swft , swift-devel at ci.uchicago.edu References: <4679FBC8.1080606 at mcs.anl.gov> <467A6D78.4020702 at mcs.anl.gov> <467A7AC6.7020400 at mcs.anl.gov> Or maybe that is clear. I'd suggest that we want a tool that, after a run, one of us can run to: * Generate the three plots that Ioan has created * Generate a file containing as much information as we can about the run and its parameters--maybe a name=value format?--and some derived values such as those I mentioned in earlier email * Move these things to a known place * Create a Web page with pointers to these information and stick it somewhere [or add it to an existing web page?] Ian. Ian Foster wrote: > Mike: > > It seems important to define what the specific goals and milestones > are here, as it seems that simply asking for it doesn't get it done. > Perhaps we need a brief specification? > > Ian. > > Mike Wilde wrote: >> Yes, this is what Ganglia has been using. >> >> Regarding the auto-publishing - Jens has a machanism that regularly >> posted info in rrd format on the state of the VDS lab machines, using >> a perl mechanism like what Ian described. Perhaps we can find and >> adapt that for Ioan's numbers. >> It was running on gainly I think. But its not hard to develop from >> scratch. >> >> It would be good to see the same numbers for all the Swift apps being >> worked on, driven initially by kickstart summaries and digesting the >> swift logfile. >> We've long had this as a goal - now is a good time to push forward >> and do this. >> >> Nika and Tibi, could you work with Ioan on this? >> >> - Mike >> >> >> >> Ian Foster wrote, On 6/20/2007 11:17 PM: >>> Hi, >>> >>> I was pointed at http://oss.oetiker.ch/rrdtool/, has anyone seen >>> this? Seems nice to me. >>> >>> Ian. >>> >>> >> > -- Ian Foster, Director, Computation Institute Argonne National Laboratory & University of Chicago Argonne: MCS/221, 9700 S. Cass Ave, Argonne, IL 60439 Chicago: Rm 405, 5640 S. Ellis Ave, Chicago, IL 60637 Tel: +1 630 252 4619. Web: www.ci.uchicago.edu. Globus Alliance: www.globus.org. -- Ian Foster, Director, Computation Institute Argonne National Laboratory & University of Chicago Argonne: MCS/221, 9700 S. Cass Ave, Argonne, IL 60439 Chicago: Rm 405, 5640 S. Ellis Ave, Chicago, IL 60637 Tel: +1 630 252 4619. Web: www.ci.uchicago.edu. Globus Alliance: www.globus.org. -------------- next part -------------- An HTML attachment was scrubbed... URL: From foster at mcs.anl.gov Mon Jun 25 04:26:07 2007 From: foster at mcs.anl.gov (Ian Foster) Date: Mon, 25 Jun 2007 04:26:07 -0500 Subject: [Swift-devel] [Fwd: Re: [ogsa-wg] link to kepler] Message-ID: <467F8A2F.6020505@mcs.anl.gov> perhaps interesting slides ... -------- Original Message -------- Subject: Re: [ogsa-wg] link to kepler Date: Mon, 25 Jun 2007 10:26:53 +0200 From: adam belloum To: Ian Foster CC: O.F.Rana at cs.cardiff.ac.uk, ogsa-wg at ogf.org References: <20070614142750.3ADFF1F5185 at fork10.mail.virginia.edu> <20070614155331.t01etgesooc8so08 at www.cs.cf.ac.uk> <4671666F.5020207 at mcs.anl.gov> Hi, we have just finished the home page of the WS-VLAM (www.science.uva.nl/~gvlam/wsvlam), may be you also find interesting stuff. I know that OGSA-wg is interested in collecting requirements, there is presentation of the web site, it contains a list of 32 wishes for workflow, we have collected from our users in the VL-e project. we use thislist as a driving for our developments (www.science.uva.nl/~gvlam/wsvlam/presentations/WS-VLAM-wishlist.ppt) In the future we will put more info on the use case defined around workflows and there requirements REgards Adam Ian Foster wrote: >A couple more: > >* Karajan, and the Swift system >* DAGman, and Pegasus > >O.F.Rana at cs.cardiff.ac.uk wrote: > > >>Hi, >> >>Just to keep things in perspective -- there are also a number of other >>workflow engines. A good portal is >> >>http://www.gridworkflow.org/ >> >>We have Triana in Cardiff + Taverna from EBI/Manchester. >> >>regards >>Omer >> >> >> >> > > > -- Ian Foster, Director, Computation Institute Argonne National Laboratory & University of Chicago Argonne: MCS/221, 9700 S. Cass Ave, Argonne, IL 60439 Chicago: Rm 405, 5640 S. Ellis Ave, Chicago, IL 60637 Tel: +1 630 252 4619. Web: www.ci.uchicago.edu. Globus Alliance: www.globus.org. -------------- next part -------------- An HTML attachment was scrubbed... URL: From benc at hawaga.org.uk Mon Jun 25 07:09:36 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 25 Jun 2007 12:09:36 +0000 (GMT) Subject: [Swift-devel] Swift Performance Data In-Reply-To: <467A7E17.5000207@mcs.anl.gov> References: <4679FBC8.1080606@mcs.anl.gov> <467A6D78.4020702@mcs.anl.gov> <467A7AC6.7020400@mcs.anl.gov> <467A7E17.5000207@mcs.anl.gov> Message-ID: On Thu, 21 Jun 2007, Ian Foster wrote: > * Generate a file containing as much information as we can about the run > and its parameters--maybe a name=value format?--and some derived values > such as those I mentioned in earlier email The CEDPS project seems to have a somewhat reasonable document defining a logging format now - it didn't last time I looked ages ago, I think. Do you have tools for analysing those log files (or plan to have them)? If so, could be useful to put extra work in to match up with that. (The text is linked from here: http://www.cedps.net/wiki/index.php/LoggingBestPractices) (though its presentation as a Word suggests a certain abstraction from the community who actually do logging and troubleshooting so perhaps caution is advised ;-) -- From itf at mcs.anl.gov Mon Jun 25 07:13:22 2007 From: itf at mcs.anl.gov (=?utf-8?B?SWFuIEZvc3Rlcg==?=) Date: Mon, 25 Jun 2007 12:13:22 +0000 Subject: [Swift-devel] Swift Performance Data In-Reply-To: References: <4679FBC8.1080606@mcs.anl.gov> <467A6D78.4020702@mcs.anl.gov><467A7AC6.7020400@mcs.anl.gov> <467A7E17.5000207@mcs.anl.gov> Message-ID: <1975377455-1182773680-cardhu_decombobulator_blackberry.rim.net-423416081-@bxe006.bisx.prod.on.blackberry> There are the netlogger tools Sent via BlackBerry from T-Mobile -----Original Message----- From: Ben Clifford Date: Mon, 25 Jun 2007 12:09:36 To:Ian Foster Cc:swift-devel at ci.uchicago.edu Subject: Re: [Swift-devel] Swift Performance Data On Thu, 21 Jun 2007, Ian Foster wrote: > * Generate a file containing as much information as we can about the run > and its parameters--maybe a name=value format?--and some derived values > such as those I mentioned in earlier email The CEDPS project seems to have a somewhat reasonable document defining a logging format now - it didn't last time I looked ages ago, I think. Do you have tools for analysing those log files (or plan to have them)? If so, could be useful to put extra work in to match up with that. (The text is linked from here: http://www.cedps.net/wiki/index.php/LoggingBestPractices) (though its presentation as a Word suggests a certain abstraction from the community who actually do logging and troubleshooting so perhaps caution is advised ;-) -- From benc at hawaga.org.uk Mon Jun 25 07:16:31 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 25 Jun 2007 12:16:31 +0000 (GMT) Subject: [Swift-devel] Swift Performance Data In-Reply-To: <1975377455-1182773680-cardhu_decombobulator_blackberry.rim.net-423416081-@bxe006.bisx.prod.on.blackberry> References: <4679FBC8.1080606@mcs.anl.gov> <467A6D78.4020702@mcs.anl.gov><467A7AC6.7020400@mcs.anl.gov> <467A7E17.5000207@mcs.anl.gov> <1975377455-1182773680-cardhu_decombobulator_blackberry.rim.net-423416081-@bxe006.bisx.prod.on.blackberry> Message-ID: On Mon, 25 Jun 2007, Ian Foster wrote: > There are the netlogger tools do they use this format? If so that's fairly compelling (at least based on a powerpoint presentation I saw once, rather than actual experience ;-) -- From itf at mcs.anl.gov Mon Jun 25 07:19:26 2007 From: itf at mcs.anl.gov (=?utf-8?B?SWFuIEZvc3Rlcg==?=) Date: Mon, 25 Jun 2007 12:19:26 +0000 Subject: [Swift-devel] Swift Performance Data In-Reply-To: References: <4679FBC8.1080606@mcs.anl.gov> <467A6D78.4020702@mcs.anl.gov><467A7AC6.7020400@mcs.anl.gov><467A7E17.5000207@mcs.anl.gov><1975377455-1182773680-cardhu_decombobulator_blackberry.rim.net-423416081-@bxe006.bisx.prod.on.blackberry> Message-ID: <525747895-1182774044-cardhu_decombobulator_blackberry.rim.net-1528886809-@bxe006.bisx.prod.on.blackberry> I suggested them a while ago, there was some reason given for not using them, but I can't know what it was. It may just have been NIH. Ian Sent via BlackBerry from T-Mobile -----Original Message----- From: Ben Clifford Date: Mon, 25 Jun 2007 12:16:31 To:Ian Foster Cc:Ian Foster , swift-devel at ci.uchicago.edu Subject: Re: [Swift-devel] Swift Performance Data On Mon, 25 Jun 2007, Ian Foster wrote: > There are the netlogger tools do they use this format? If so that's fairly compelling (at least based on a powerpoint presentation I saw once, rather than actual experience ;-) -- From benc at hawaga.org.uk Mon Jun 25 07:23:55 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 25 Jun 2007 12:23:55 +0000 (GMT) Subject: [Swift-devel] Swift Performance Data In-Reply-To: <525747895-1182774044-cardhu_decombobulator_blackberry.rim.net-1528886809-@bxe006.bisx.prod.on.blackberry> References: <4679FBC8.1080606@mcs.anl.gov> <467A6D78.4020702@mcs.anl.gov><467A7AC6.7020400@mcs.anl.gov><467A7E17.5000207@mcs.anl.gov><1975377455-1182773680-cardhu_decombobulator_blackberry.rim.net-423416081-@bxe006.bisx.prod.on.blackberry> <525747895-1182774044-cardhu_decombobulator_blackberry.rim.net-1528886809-@bxe006.bisx.prod.on.blackberry> Message-ID: I was resistant to CEDPS trouble shooting because they didn't seem to have anything on offer at the time. They do now. Though that's different from netlogger, at least on the marketing side (though clearly the staff list overlaps some). On Mon, 25 Jun 2007, Ian Foster wrote: > I suggested them a while ago, there was some reason given for not using them, but I can't know what it was. It may just have been NIH. > > Ian > > Sent via BlackBerry from T-Mobile > > -----Original Message----- > From: Ben Clifford > > Date: Mon, 25 Jun 2007 12:16:31 > To:Ian Foster > Cc:Ian Foster , swift-devel at ci.uchicago.edu > Subject: Re: [Swift-devel] Swift Performance Data > > > > On Mon, 25 Jun 2007, Ian Foster wrote: > > > There are the netlogger tools > > do they use this format? If so that's fairly compelling (at least based on > a powerpoint presentation I saw once, rather than actual experience ;-) > > From itf at mcs.anl.gov Mon Jun 25 07:23:08 2007 From: itf at mcs.anl.gov (=?utf-8?B?SWFuIEZvc3Rlcg==?=) Date: Mon, 25 Jun 2007 12:23:08 +0000 Subject: [Swift-devel] Swift Performance Data In-Reply-To: <525747895-1182774044-cardhu_decombobulator_blackberry.rim.net-1528886809-@bxe006.bisx.prod.on.blackberry> References: <4679FBC8.1080606@mcs.anl.gov> <467A6D78.4020702@mcs.anl.gov><467A7AC6.7020400@mcs.anl.gov><467A7E17.5000207@mcs.anl.gov><1975377455-1182773680-cardhu_decombobulator_blackberry.rim.net-423416081-@bxe006.bisx.prod.on.blackberry><525747895-1182774044-cardhu_decombobulator_blackberry.rim.net-1528886809-@bxe006.bisx.prod.on.blackberry> Message-ID: <1560635678-1182774268-cardhu_decombobulator_blackberry.rim.net-1158930358-@bxe006.bisx.prod.on.blackberry> Howver, I'd like to mention that my specific request is that we start collecting and storing logs from all runs. Using standard formats and netlogger may well be a good idea, but I'd feel concerned that putting that on the critical path would delay yet further the day when we achieve the primary goal. Ian Sent via BlackBerry from T-Mobile -----Original Message----- From: "Ian Foster" Date: Mon, 25 Jun 2007 12:19:26 To:"Ben Clifford" Cc:"Ian Foster" , swift-devel at ci.uchicago.edu Subject: Re: [Swift-devel] Swift Performance Data I suggested them a while ago, there was some reason given for not using them, but I can't know what it was. It may just have been NIH. Ian Sent via BlackBerry from T-Mobile -----Original Message----- From: Ben Clifford Date: Mon, 25 Jun 2007 12:16:31 To:Ian Foster Cc:Ian Foster , swift-devel at ci.uchicago.edu Subject: Re: [Swift-devel] Swift Performance Data On Mon, 25 Jun 2007, Ian Foster wrote: > There are the netlogger tools do they use this format? If so that's fairly compelling (at least based on a powerpoint presentation I saw once, rather than actual experience ;-) -- From itf at mcs.anl.gov Mon Jun 25 07:27:58 2007 From: itf at mcs.anl.gov (=?utf-8?B?SWFuIEZvc3Rlcg==?=) Date: Mon, 25 Jun 2007 12:27:58 +0000 Subject: [Swift-devel] Swift Performance Data In-Reply-To: References: <4679FBC8.1080606@mcs.anl.gov><467A6D78.4020702@mcs.anl.gov><467A7AC6.7020400@mcs.anl.gov><467A7E17.5000207@mcs.anl.gov><1975377455-1182773680-cardhu_decombobulator_blackberry.rim.net-423416081-@bxe006.bisx.prod.on.blackberry><525747895-1182774044-cardhu_decombobulator_blackberry.rim.net-1528886809-@bxe006.bisx.prod.on.blackberry> Message-ID: <63469356-1182774555-cardhu_decombobulator_blackberry.rim.net-1579903038-@bxe006.bisx.prod.on.blackberry> Ah yes that's right. I've asked the NetLogger folks if they suppoprt it. Ian Sent via BlackBerry from T-Mobile -----Original Message----- From: Ben Clifford Date: Mon, 25 Jun 2007 12:23:55 To:Ian Foster Cc:Ian Foster , swift-devel at ci.uchicago.edu Subject: Re: [Swift-devel] Swift Performance Data I was resistant to CEDPS trouble shooting because they didn't seem to have anything on offer at the time. They do now. Though that's different from netlogger, at least on the marketing side (though clearly the staff list overlaps some). On Mon, 25 Jun 2007, Ian Foster wrote: > I suggested them a while ago, there was some reason given for not using them, but I can't know what it was. It may just have been NIH. > > Ian > > Sent via BlackBerry from T-Mobile > > -----Original Message----- > From: Ben Clifford > > Date: Mon, 25 Jun 2007 12:16:31 > To:Ian Foster > Cc:Ian Foster , swift-devel at ci.uchicago.edu > Subject: Re: [Swift-devel] Swift Performance Data > > > > On Mon, 25 Jun 2007, Ian Foster wrote: > > > There are the netlogger tools > > do they use this format? If so that's fairly compelling (at least based on > a powerpoint presentation I saw once, rather than actual experience ;-) > > From benc at hawaga.org.uk Mon Jun 25 07:29:49 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 25 Jun 2007 12:29:49 +0000 (GMT) Subject: [Swift-devel] Swift Performance Data In-Reply-To: <1560635678-1182774268-cardhu_decombobulator_blackberry.rim.net-1158930358-@bxe006.bisx.prod.on.blackberry> References: <4679FBC8.1080606@mcs.anl.gov> <467A6D78.4020702@mcs.anl.gov><467A7AC6.7020400@mcs.anl.gov><467A7E17.5000207@mcs.anl.gov><1975377455-1182773680-cardhu_decombobulator_blackberry.rim.net-423416081-@bxe006.bisx.prod.on.blackberry><525747895-1182774044-cardhu_decombobulator_blackberry.rim.net-1528886809-@bxe006.bisx.prod.on.blackberry> <1560635678-1182774268-cardhu_decombobulator_blackberry.rim.net-1158930358-@bxe006.bisx.prod.on.blackberry> Message-ID: On Mon, 25 Jun 2007, Ian Foster wrote: > Howver, I'd like to mention that my specific request is that we start > collecting and storing logs from all runs. Using standard formats and > netlogger may well be a good idea, but I'd feel concerned that putting > that on the critical path would delay yet further the day when we > achieve the primary goal. Raw unfiltered log collection is something that the app people need to do - Tibi, Nika, Ioan. I guess the base set is: swift .log files kickstart dumps (for extra logging info and extra slowdown, turn on kickstart for all jobs instead of failed tasks by setting: kickstart.always.transfer=true ) whatever falkon produces This is a very different request from the 'sort out the analysis tooling' request, though. -- From hategan at mcs.anl.gov Mon Jun 25 08:45:25 2007 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 25 Jun 2007 08:45:25 -0500 Subject: [Swift-devel] Swift Performance Data In-Reply-To: <525747895-1182774044-cardhu_decombobulator_blackberry.rim.net-1528886809-@bxe006.bisx.prod.on.blackberry> References: <4679FBC8.1080606@mcs.anl.gov> <467A6D78.4020702@mcs.anl.gov> <467A7AC6.7020400@mcs.anl.gov><467A7E17.5000207@mcs.anl.gov> <1975377455-1182773680-cardhu_decombobulator_blackberry.rim.net-423416081-@bxe006.bisx.prod.on.blackberry> <525747895-1182774044-cardhu_decombobulator_blackberry.rim.net-1528886809-@bxe006.bisx.prod.on.blackberry> Message-ID: <1182779125.5910.6.camel@blabla.mcs.anl.gov> On Mon, 2007-06-25 at 12:19 +0000, Ian Foster wrote: > I suggested them a while ago, there was some reason given for not using them, but I can't know what it was. It may just have been NIH. The reason was that nothing was there. Another difficulty is that if we want meaningful things in that particular format, the whole software stack needs to be changed (including cog and jglobus and perhaps other things). This sounds a bit difficult, especially considering the fact that the information is there, but the format is not. I'd rather write a few simple parsers than try to change all logging messages everywhere. Mihael > > Ian > > Sent via BlackBerry from T-Mobile > > -----Original Message----- > From: Ben Clifford > > Date: Mon, 25 Jun 2007 12:16:31 > To:Ian Foster > Cc:Ian Foster , swift-devel at ci.uchicago.edu > Subject: Re: [Swift-devel] Swift Performance Data > > > > On Mon, 25 Jun 2007, Ian Foster wrote: > > > There are the netlogger tools > > do they use this format? If so that's fairly compelling (at least based on > a powerpoint presentation I saw once, rather than actual experience ;-) > From hategan at mcs.anl.gov Mon Jun 25 08:48:13 2007 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 25 Jun 2007 08:48:13 -0500 Subject: [Swift-devel] Swift Performance Data In-Reply-To: <1560635678-1182774268-cardhu_decombobulator_blackberry.rim.net-1158930358-@bxe006.bisx.prod.on.blackberry> References: <4679FBC8.1080606@mcs.anl.gov> <467A6D78.4020702@mcs.anl.gov> <467A7AC6.7020400@mcs.anl.gov><467A7E17.5000207@mcs.anl.gov> <1975377455-1182773680-cardhu_decombobulator_blackberry.rim.net-423416081-@bxe006.bisx.prod.on.blackberry> <525747895-1182774044-cardhu_decombobulator_blackberry.rim.net-1528886809-@bxe006.bisx.prod.on.blackberry> <1560635678-1182774268-cardhu_decombobulator_blackberry.rim.net-1158930358-@bxe006.bisx.prod.on.blackberry> Message-ID: <1182779293.5910.10.camel@blabla.mcs.anl.gov> On Mon, 2007-06-25 at 12:23 +0000, Ian Foster wrote: > Howver, I'd like to mention that my specific request is that we start collecting and storing logs from all runs. Using standard formats and netlogger may well be a good idea, but I'd feel concerned that putting that on the critical path would delay yet further the day when we achieve the primary goal. Yes, I mentioned that to Tibi (I think) a while ago. The effort of collecting logs is minimal, and once the tools are there, we could easily analyze previous runs. Mihael > > Ian > > > > Sent via BlackBerry from T-Mobile > > -----Original Message----- > From: "Ian Foster" > > Date: Mon, 25 Jun 2007 12:19:26 > To:"Ben Clifford" > Cc:"Ian Foster" , swift-devel at ci.uchicago.edu > Subject: Re: [Swift-devel] Swift Performance Data > > > I suggested them a while ago, there was some reason given for not using them, but I can't know what it was. It may just have been NIH. > > Ian > > Sent via BlackBerry from T-Mobile > > -----Original Message----- > From: Ben Clifford > > Date: Mon, 25 Jun 2007 12:16:31 > To:Ian Foster > Cc:Ian Foster , swift-devel at ci.uchicago.edu > Subject: Re: [Swift-devel] Swift Performance Data > > > > On Mon, 25 Jun 2007, Ian Foster wrote: > > > There are the netlogger tools > > do they use this format? If so that's fairly compelling (at least based on > a powerpoint presentation I saw once, rather than actual experience ;-) > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From benc at hawaga.org.uk Mon Jun 25 08:54:17 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 25 Jun 2007 13:54:17 +0000 (GMT) Subject: [Swift-devel] Swift Performance Data In-Reply-To: <1182779125.5910.6.camel@blabla.mcs.anl.gov> References: <4679FBC8.1080606@mcs.anl.gov> <467A6D78.4020702@mcs.anl.gov> <467A7AC6.7020400@mcs.anl.gov><467A7E17.5000207@mcs.anl.gov> <1975377455-1182773680-cardhu_decombobulator_blackberry.rim.net-423416081-@bxe006.bisx.prod.on.blackberry> <525747895-1182774044-cardhu_decombobulator_blackberry.rim.net-1528886809-@bxe006.bisx.prod.on.blackberry> <1182779125.5910.6.camel@blabla.mcs.anl.gov> Message-ID: On Mon, 25 Jun 2007, Mihael Hategan wrote: > Another difficulty is that if we want meaningful things in that > particular format, the whole software stack needs to be changed > (including cog and jglobus and perhaps other things). This sounds a bit > difficult, especially considering the fact that the information is > there, but the format is not. I'd rather write a few simple parsers than > try to change all logging messages everywhere. 'a few simple parsers' is not necessarily 'simple'. changing logging messages in code everywhere definitely isn't, though. if there is going to be more than one analysis tool, converting log files to a common format somewhere in between generating application and analysing application might be a good idea, and not massively different from defining a language level API to abstract out log file format differences. -- From hategan at mcs.anl.gov Mon Jun 25 08:59:00 2007 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 25 Jun 2007 08:59:00 -0500 Subject: [Swift-devel] Swift Performance Data In-Reply-To: References: <4679FBC8.1080606@mcs.anl.gov> <467A6D78.4020702@mcs.anl.gov> <467A7AC6.7020400@mcs.anl.gov><467A7E17.5000207@mcs.anl.gov> <1975377455-1182773680-cardhu_decombobulator_blackberry.rim.net-423416081-@bxe006.bisx.prod.on.blackberry> <525747895-1182774044-cardhu_decombobulator_blackberry.rim.net-1528886809-@bxe006.bisx.prod.on.blackberry> <1182779125.5910.6.camel@blabla.mcs.anl.gov> Message-ID: <1182779940.5910.16.camel@blabla.mcs.anl.gov> On Mon, 2007-06-25 at 13:54 +0000, Ben Clifford wrote: > On Mon, 25 Jun 2007, Mihael Hategan wrote: > > > Another difficulty is that if we want meaningful things in that > > particular format, the whole software stack needs to be changed > > (including cog and jglobus and perhaps other things). This sounds a bit > > difficult, especially considering the fact that the information is > > there, but the format is not. I'd rather write a few simple parsers than > > try to change all logging messages everywhere. > > 'a few simple parsers' is not necessarily 'simple'. It's relatively simple. The python tool that is an adaptation of Jens' shows that it can be done relatively easy. I would worry more about Swift producing sufficient information and about how that information is represented. That seems harder to me. > > changing logging messages in code everywhere definitely isn't, though. > > if there is going to be more than one analysis tool, converting log files > to a common format somewhere in between generating application and > analysing application might be a good idea, and not massively different > from defining a language level API to abstract out log file format > differences. This is similar to code generation vs. abstraction (or interpretation vs. compilation). There can be: 1. An api to access the logs in structured ways 2. A log translator 3. An adaptor plugged in at the log4j (or whatever logging library) level that does the translation dynamically (at the expense of performance). Mihael > From benc at hawaga.org.uk Mon Jun 25 09:07:22 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 25 Jun 2007 14:07:22 +0000 (GMT) Subject: [Swift-devel] Swift Performance Data In-Reply-To: <1182779940.5910.16.camel@blabla.mcs.anl.gov> References: <4679FBC8.1080606@mcs.anl.gov> <467A6D78.4020702@mcs.anl.gov> <467A7AC6.7020400@mcs.anl.gov><467A7E17.5000207@mcs.anl.gov> <1975377455-1182773680-cardhu_decombobulator_blackberry.rim.net-423416081-@bxe006.bisx.prod.on.blackberry> <525747895-1182774044-cardhu_decombobulator_blackberry.rim.net-1528886809-@bxe006.bisx.prod.on.blackberry> <1182779125.5910.6.camel@blabla.mcs.anl.gov> <1182779940.5910.16.camel@blabla.mcs.anl.gov> Message-ID: On Mon, 25 Jun 2007, Mihael Hategan wrote: > This is similar to code generation vs. abstraction (or interpretation > vs. compilation). There can be: its also an issue with where the abstractions happen: > 1. An api to access the logs in structured ways needs the API to exist in the language that you want to write analysers in. > 2. A log translator makes the API into a posix filesystem with text files. still needs a per-language parser to parse that format, but that is 'simple' and works in a variety of languages. so I'd favour this. > 3. An adaptor plugged in at the log4j (or whatever logging library) > level that does the translation dynamically (at the expense of > performance). Possibly some components can output stuff into a common format using log4j - that wouldn't necessarily be any more dynamic than the existing log4j output. -- From wilde at mcs.anl.gov Mon Jun 25 09:19:17 2007 From: wilde at mcs.anl.gov (Mike Wilde) Date: Mon, 25 Jun 2007 09:19:17 -0500 Subject: [Swift-devel] Swift Performance Data In-Reply-To: References: <4679FBC8.1080606@mcs.anl.gov> <467A6D78.4020702@mcs.anl.gov> <467A7AC6.7020400@mcs.anl.gov><467A7E17.5000207@mcs.anl.gov> <1975377455-1182773680-cardhu_decombobulator_blackberry.rim.net-423416081-@bxe006.bisx.prod.on.blackberry> <525747895-1182774044-cardhu_decombobulator_blackberry.rim.net-1528886809-@bxe006.bisx.prod.on.blackberry> <1182779125.5910.6.camel@blabla.mcs.anl.gov> Message-ID: <467FCEE5.6000707@mcs.anl.gov> I have not been reading email this weekend and need to catch up on this and related threads. I want to ask that for the moment Ben and Mihael stay focused on what they are working on, and I will work with Nika and Tibi on application status and issues, and move the measurement issue along. I agree fully with Mihael's point that we can and should start gathering all execution logs into a uniformly structured gathering place. Then we can organize the current log tools and determine whats needed next in that area. For now: Ben: Swift 0.2 and mapper/language improvements Mihael: get to a closure point on I2U2 to get the last 4 months of work into production (or at a stable development lab for a next-generation production system). Determine when you can be back on Swift. Nika: MolDyn-244 and defining the MolDyn parameter sweep workflow; Next steps (TBD) on LQCD progress; Tibi: Econ - next workflow; set up environment for Econ to adopt tools. Work with new people in Econ to take over from Gabrielle. I2U2 load sharing into production, and assist in LIGO app. SIDGrid Wavelet: tbd. Nika and Tibi: application writeups Next apps: FLASH, RADCAD; possibly SCEC exploration. - Mike Ben Clifford wrote, On 6/25/2007 8:54 AM: > On Mon, 25 Jun 2007, Mihael Hategan wrote: > >> Another difficulty is that if we want meaningful things in that >> particular format, the whole software stack needs to be changed >> (including cog and jglobus and perhaps other things). This sounds a bit >> difficult, especially considering the fact that the information is >> there, but the format is not. I'd rather write a few simple parsers than >> try to change all logging messages everywhere. > > 'a few simple parsers' is not necessarily 'simple'. > > changing logging messages in code everywhere definitely isn't, though. > > if there is going to be more than one analysis tool, converting log files > to a common format somewhere in between generating application and > analysing application might be a good idea, and not massively different > from defining a language level API to abstract out log file format > differences. > -- Mike Wilde Computation Institute, University of Chicago Math & Computer Science Division Argonne National Laboratory Argonne, IL 60439 USA tel 630-252-7497 fax 630-252-1997 From hategan at mcs.anl.gov Mon Jun 25 09:19:00 2007 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 25 Jun 2007 09:19:00 -0500 Subject: [Swift-devel] Swift Performance Data In-Reply-To: References: <4679FBC8.1080606@mcs.anl.gov> <467A6D78.4020702@mcs.anl.gov> <467A7AC6.7020400@mcs.anl.gov><467A7E17.5000207@mcs.anl.gov> <1975377455-1182773680-cardhu_decombobulator_blackberry.rim.net-423416081-@bxe006.bisx.prod.on.blackberry> <525747895-1182774044-cardhu_decombobulator_blackberry.rim.net-1528886809-@bxe006.bisx.prod.on.blackberry> <1182779125.5910.6.camel@blabla.mcs.anl.gov> <1182779940.5910.16.camel@blabla.mcs.anl.gov> Message-ID: <1182781140.10791.3.camel@blabla.mcs.anl.gov> On Mon, 2007-06-25 at 14:07 +0000, Ben Clifford wrote: > > On Mon, 25 Jun 2007, Mihael Hategan wrote: > > > This is similar to code generation vs. abstraction (or interpretation > > vs. compilation). There can be: > > its also an issue with where the abstractions happen: > > > 1. An api to access the logs in structured ways > > needs the API to exist in the language that you want to write analysers > in. > > > 2. A log translator > > makes the API into a posix filesystem with text files. still needs a > per-language parser to parse that format, but that is 'simple' and works > in a variety of languages. so I'd favour this. So would I. And since logs are incremental, it could even be done live (i.e. tail -f log |translator >translated.log). It would also be "backwards compatible" with the logs that we've been gathering so far :) > > > 3. An adaptor plugged in at the log4j (or whatever logging library) > > level that does the translation dynamically (at the expense of > > performance). > > Possibly some components can output stuff into a common format using log4j > - that wouldn't necessarily be any more dynamic than the existing log4j > output. Right, but that would not be the general case. > From tiberius at ci.uchicago.edu Mon Jun 25 10:19:20 2007 From: tiberius at ci.uchicago.edu (Tiberiu Stef-Praun) Date: Mon, 25 Jun 2007 10:19:20 -0500 Subject: [Swift-devel] Swift Performance Data In-Reply-To: <1182779293.5910.10.camel@blabla.mcs.anl.gov> References: <4679FBC8.1080606@mcs.anl.gov> <467A6D78.4020702@mcs.anl.gov> <467A7AC6.7020400@mcs.anl.gov> <467A7E17.5000207@mcs.anl.gov> <1975377455-1182773680-cardhu_decombobulator_blackberry.rim.net-423416081-@bxe006.bisx.prod.on.blackberry> <525747895-1182774044-cardhu_decombobulator_blackberry.rim.net-1528886809-@bxe006.bisx.prod.on.blackberry> <1560635678-1182774268-cardhu_decombobulator_blackberry.rim.net-1158930358-@bxe006.bisx.prod.on.blackberry> <1182779293.5910.10.camel@blabla.mcs.anl.gov> Message-ID: All the logs that I consider relevant I save into the /results directory under each application, in the SwiftApps SVN. On 6/25/07, Mihael Hategan wrote: > On Mon, 2007-06-25 at 12:23 +0000, Ian Foster wrote: > > Howver, I'd like to mention that my specific request is that we start collecting and storing logs from all runs. Using standard formats and netlogger may well be a good idea, but I'd feel concerned that putting that on the critical path would delay yet further the day when we achieve the primary goal. > > Yes, I mentioned that to Tibi (I think) a while ago. The effort of > collecting logs is minimal, and once the tools are there, we could > easily analyze previous runs. > > Mihael > > > > > Ian > > > > > > > > Sent via BlackBerry from T-Mobile > > > > -----Original Message----- > > From: "Ian Foster" > > > > Date: Mon, 25 Jun 2007 12:19:26 > > To:"Ben Clifford" > > Cc:"Ian Foster" , swift-devel at ci.uchicago.edu > > Subject: Re: [Swift-devel] Swift Performance Data > > > > > > I suggested them a while ago, there was some reason given for not using them, but I can't know what it was. It may just have been NIH. > > > > Ian > > > > Sent via BlackBerry from T-Mobile > > > > -----Original Message----- > > From: Ben Clifford > > > > Date: Mon, 25 Jun 2007 12:16:31 > > To:Ian Foster > > Cc:Ian Foster , swift-devel at ci.uchicago.edu > > Subject: Re: [Swift-devel] Swift Performance Data > > > > > > > > On Mon, 25 Jun 2007, Ian Foster wrote: > > > > > There are the netlogger tools > > > > do they use this format? If so that's fairly compelling (at least based on > > a powerpoint presentation I saw once, rather than actual experience ;-) > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > -- Tiberiu (Tibi) Stef-Praun, PhD Research Staff, Computation Institute 5640 S. Ellis Ave, #405 University of Chicago http://www-unix.mcs.anl.gov/~tiberius/ From foster at mcs.anl.gov Mon Jun 25 10:34:56 2007 From: foster at mcs.anl.gov (Ian Foster) Date: Mon, 25 Jun 2007 10:34:56 -0500 Subject: [Swift-devel] [Fwd: Re: Q about netlogger] Message-ID: <467FE0A0.2010805@mcs.anl.gov> -------- Original Message -------- Subject: Re: Q about netlogger Date: Mon, 25 Jun 2007 09:20:15 -0400 From: Brian Tierney Organization: LBNL To: itf at mcs.anl.gov CC: Jenny Schopf References: <1794526737-1182774123-cardhu_decombobulator_blackberry.rim.net-1522300153- at bxe006.bisx.prod.on.blackberry> Ian Foster wrote: > Hi, > > > Can netlogger process logs in the new log format? yep. -- Ian Foster, Director, Computation Institute Argonne National Laboratory & University of Chicago Argonne: MCS/221, 9700 S. Cass Ave, Argonne, IL 60439 Chicago: Rm 405, 5640 S. Ellis Ave, Chicago, IL 60637 Tel: +1 630 252 4619. Web: www.ci.uchicago.edu. Globus Alliance: www.globus.org. -------------- next part -------------- An HTML attachment was scrubbed... URL: From foster at mcs.anl.gov Mon Jun 25 11:03:37 2007 From: foster at mcs.anl.gov (Ian Foster) Date: Mon, 25 Jun 2007 11:03:37 -0500 Subject: [Swift-devel] Swift Performance Data In-Reply-To: <467FCEE5.6000707@mcs.anl.gov> References: <4679FBC8.1080606@mcs.anl.gov> <467A6D78.4020702@mcs.anl.gov> <467A7AC6.7020400@mcs.anl.gov><467A7E17.5000207@mcs.anl.gov> <1975377455-1182773680-cardhu_decombobulator_blackberry.rim.net-423416081-@bxe006.bisx.prod.on.blackberry> <525747895-1182774044-cardhu_decombobulator_blackberry.rim.net-1528886809-@bxe006.bisx.prod.on.blackberry> <1182779125.5910.6.camel@blabla.mcs.anl.gov> <467FCEE5.6000707@mcs.anl.gov> Message-ID: <467FE759.50904@mcs.anl.gov> So who is going to do this? I've been asking about this for some time, and nothing has happened. The result, I think, has been a lot of confusion and delay. > > I agree fully with Mihael's point that we can and should start > gathering all execution logs into a uniformly structured gathering > place. Then we can organize the current log tools and determine whats > needed next in that area. > From hategan at mcs.anl.gov Mon Jun 25 11:15:18 2007 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 25 Jun 2007 11:15:18 -0500 Subject: [Swift-devel] Swift Performance Data In-Reply-To: <467FE759.50904@mcs.anl.gov> References: <4679FBC8.1080606@mcs.anl.gov> <467A6D78.4020702@mcs.anl.gov> <467A7AC6.7020400@mcs.anl.gov><467A7E17.5000207@mcs.anl.gov> <1975377455-1182773680-cardhu_decombobulator_blackberry.rim.net-423416081-@bxe006.bisx.prod.on.blackberry> <525747895-1182774044-cardhu_decombobulator_blackberry.rim.net-1528886809-@bxe006.bisx.prod.on.blackberry> <1182779125.5910.6.camel@blabla.mcs.anl.gov> <467FCEE5.6000707@mcs.anl.gov> <467FE759.50904@mcs.anl.gov> Message-ID: <1182788118.23226.3.camel@blabla.mcs.anl.gov> On Mon, 2007-06-25 at 11:03 -0500, Ian Foster wrote: > So who is going to do this? > > I've been asking about this for some time, and nothing has happened. The > result, I think, has been a lot of confusion and delay. Are we still talking about collecting logs? I'm a bit confused. > > > > I agree fully with Mihael's point that we can and should start > > gathering all execution logs into a uniformly structured gathering > > place. Then we can organize the current log tools and determine whats > > needed next in that area. > > > From benc at hawaga.org.uk Mon Jun 25 11:26:42 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 25 Jun 2007 16:26:42 +0000 (GMT) Subject: [Swift-devel] Swift Performance Data In-Reply-To: <1182788118.23226.3.camel@blabla.mcs.anl.gov> References: <4679FBC8.1080606@mcs.anl.gov> <467A6D78.4020702@mcs.anl.gov> <467A7AC6.7020400@mcs.anl.gov><467A7E17.5000207@mcs.anl.gov> <1975377455-1182773680-cardhu_decombobulator_blackberry.rim.net-423416081-@bxe006.bisx.prod.on.blackberry> <525747895-1182774044-cardhu_decombobulator_blackberry.rim.net-1528886809-@bxe006.bisx.prod.on.blackberry> <1182779125.5910.6.camel@blabla.mcs.anl.gov> <467FCEE5.6000707@mcs.anl.gov> <467FE759.50904@mcs.anl.gov> <1182788118.23226.3.camel@blabla.mcs.anl.gov> Message-ID: On Mon, 25 Jun 2007, Mihael Hategan wrote: > On Mon, 2007-06-25 at 11:03 -0500, Ian Foster wrote: > > So who is going to do this? > > > > I've been asking about this for some time, and nothing has happened. The > > result, I think, has been a lot of confusion and delay. > > Are we still talking about collecting logs? I'm a bit confused. I see a few of Tibi's run logs and derivative analyses in the SVN. Look at some of the files add between r844 and r861: http://www.ci.uchicago.edu/trac/swift/browser/SwiftApps/Econ/results/econ-ws-Falkon-ljmap4x6j42e0.log.tiff?rev=857 No kickstart though. Could do with more organising. However as to 'who is going to do this?' in response to collecting data - the app people have to - they do the runs and its their working directories that the stuff ends up in. -- From wilde at mcs.anl.gov Mon Jun 25 11:30:31 2007 From: wilde at mcs.anl.gov (Mike Wilde) Date: Mon, 25 Jun 2007 11:30:31 -0500 Subject: [Swift-devel] Swift Performance Data In-Reply-To: <467FE759.50904@mcs.anl.gov> References: <4679FBC8.1080606@mcs.anl.gov> <467A6D78.4020702@mcs.anl.gov> <467A7AC6.7020400@mcs.anl.gov><467A7E17.5000207@mcs.anl.gov> <1975377455-1182773680-cardhu_decombobulator_blackberry.rim.net-423416081-@bxe006.bisx.prod.on.blackberry> <525747895-1182774044-cardhu_decombobulator_blackberry.rim.net-1528886809-@bxe006.bisx.prod.on.blackberry> <1182779125.5910.6.camel@blabla.mcs.anl.gov> <467FCEE5.6000707@mcs.anl.gov> <467FE759.50904@mcs.anl.gov> Message-ID: <467FEDA7.8050000@mcs.anl.gov> We will decide, with more discussion, on this list. - Mike Ian Foster wrote, On 6/25/2007 11:03 AM: > So who is going to do this? > > I've been asking about this for some time, and nothing has happened. The > result, I think, has been a lot of confusion and delay. >> >> I agree fully with Mihael's point that we can and should start >> gathering all execution logs into a uniformly structured gathering >> place. Then we can organize the current log tools and determine whats >> needed next in that area. >> > > -- Mike Wilde Computation Institute, University of Chicago Math & Computer Science Division Argonne National Laboratory Argonne, IL 60439 USA tel 630-252-7497 fax 630-252-1997 From wilde at mcs.anl.gov Mon Jun 25 11:36:29 2007 From: wilde at mcs.anl.gov (Mike Wilde) Date: Mon, 25 Jun 2007 11:36:29 -0500 Subject: [Swift-devel] Swift Performance Data In-Reply-To: <1182788118.23226.3.camel@blabla.mcs.anl.gov> References: <4679FBC8.1080606@mcs.anl.gov> <467A6D78.4020702@mcs.anl.gov> <467A7AC6.7020400@mcs.anl.gov><467A7E17.5000207@mcs.anl.gov> <1975377455-1182773680-cardhu_decombobulator_blackberry.rim.net-423416081-@bxe006.bisx.prod.on.blackberry> <525747895-1182774044-cardhu_decombobulator_blackberry.rim.net-1528886809-@bxe006.bisx.prod.on.blackberry> <1182779125.5910.6.camel@blabla.mcs.anl.gov> <467FCEE5.6000707@mcs.anl.gov> <467FE759.50904@mcs.anl.gov> <1182788118.23226.3.camel@blabla.mcs.anl.gov> Message-ID: <467FEF0D.1050700@mcs.anl.gov> collecting logs and generating reports/plots easily and in a standard format, so we can set goals against specific metrics for each workflow/application and track how we are progressing against those goals, for each. - Mike Mihael Hategan wrote, On 6/25/2007 11:15 AM: > On Mon, 2007-06-25 at 11:03 -0500, Ian Foster wrote: >> So who is going to do this? >> >> I've been asking about this for some time, and nothing has happened. The >> result, I think, has been a lot of confusion and delay. > > Are we still talking about collecting logs? I'm a bit confused. > >>> I agree fully with Mihael's point that we can and should start >>> gathering all execution logs into a uniformly structured gathering >>> place. Then we can organize the current log tools and determine whats >>> needed next in that area. >>> > > -- Mike Wilde Computation Institute, University of Chicago Math & Computer Science Division Argonne National Laboratory Argonne, IL 60439 USA tel 630-252-7497 fax 630-252-1997 From benc at hawaga.org.uk Mon Jun 25 11:42:35 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 25 Jun 2007 16:42:35 +0000 (GMT) Subject: [Swift-devel] Swift Performance Data In-Reply-To: <467FEF0D.1050700@mcs.anl.gov> References: <4679FBC8.1080606@mcs.anl.gov> <467A6D78.4020702@mcs.anl.gov> <467A7AC6.7020400@mcs.anl.gov><467A7E17.5000207@mcs.anl.gov> <1975377455-1182773680-cardhu_decombobulator_blackberry.rim.net-423416081-@bxe006.bisx.prod.on.blackberry> <525747895-1182774044-cardhu_decombobulator_blackberry.rim.net-1528886809-@bxe006.bisx.prod.on.blackberry> <1182779125.5910.6.camel@blabla.mcs.anl.gov> <467FCEE5.6000707@mcs.anl.gov> <467FE759.50904@mcs.anl.gov> <1182788118.23226.3.camel@blabla.mcs.anl.gov> <467FEF0D.1050700@mcs.anl.gov> Message-ID: As far as I see, Ian asked for two separate things: i) most importantly (to him) the collection of raw data. that's a case of collecting files and not changing them. ii) the development of a log analysis stack. This very much overlaps with what CEDPS and is a much bigger job. There are two fairly contradictory views: that we shouldn't let CEDPS get in our way, and that we should have a pile of hacked up tools. i) is something for app people to do. ii) is a much bigger development effort which really should go in the bugzilla and wait its turn. On Mon, 25 Jun 2007, Mike Wilde wrote: > collecting logs and generating reports/plots easily and in a standard format, > so we can set goals against specific metrics for each workflow/application and > track how we are progressing against those goals, for each. > > - Mike > > Mihael Hategan wrote, On 6/25/2007 11:15 AM: > > On Mon, 2007-06-25 at 11:03 -0500, Ian Foster wrote: > > > So who is going to do this? > > > > > > I've been asking about this for some time, and nothing has happened. The > > > result, I think, has been a lot of confusion and delay. > > > > Are we still talking about collecting logs? I'm a bit confused. > > > > > > I agree fully with Mihael's point that we can and should start gathering > > > > all execution logs into a uniformly structured gathering place. Then we > > > > can organize the current log tools and determine whats needed next in > > > > that area. > > > > > > > > > > From benc at hawaga.org.uk Mon Jun 25 11:45:48 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 25 Jun 2007 16:45:48 +0000 (GMT) Subject: [Swift-devel] Swift Performance Data In-Reply-To: References: <4679FBC8.1080606@mcs.anl.gov> <467A6D78.4020702@mcs.anl.gov> <467A7AC6.7020400@mcs.anl.gov><467A7E17.5000207@mcs.anl.gov> <1975377455-1182773680-cardhu_decombobulator_blackberry.rim.net-423416081-@bxe006.bisx.prod.on.blackberry> <525747895-1182774044-cardhu_decombobulator_blackberry.rim.net-1528886809-@bxe006.bisx.prod.on.blackberry> <1182779125.5910.6.camel@blabla.mcs.anl.gov> <467FCEE5.6000707@mcs.anl.gov> <467FE759.50904@mcs.anl.gov> <1182788118.23226.3.camel@blabla.mcs.anl.gov> <467FEF0D.1050700@mcs.anl.gov> Message-ID: On Mon, 25 Jun 2007, Ben Clifford wrote: > what CEDPS and is a much bigger job. There are two fairly contradictory > views: that we shouldn't let CEDPS get in our way, and that we should have > a pile of hacked up tools. oops, that we *shouldn't* have a pile of hacked up tools, I meant. -- From foster at mcs.anl.gov Mon Jun 25 11:47:26 2007 From: foster at mcs.anl.gov (Ian Foster) Date: Mon, 25 Jun 2007 11:47:26 -0500 Subject: [Swift-devel] Swift Performance Data In-Reply-To: References: <4679FBC8.1080606@mcs.anl.gov> <467A6D78.4020702@mcs.anl.gov> <467A7AC6.7020400@mcs.anl.gov><467A7E17.5000207@mcs.anl.gov> <1975377455-1182773680-cardhu_decombobulator_blackberry.rim.net-423416081-@bxe006.bisx.prod.on.blackberry> <525747895-1182774044-cardhu_decombobulator_blackberry.rim.net-1528886809-@bxe006.bisx.prod.on.blackberry> <1182779125.5910.6.camel@blabla.mcs.anl.gov> <467FCEE5.6000707@mcs.anl.gov> <467FE759.50904@mcs.anl.gov> <1182788118.23226.3.camel@blabla.mcs.anl.gov> <467FEF0D.1050700@mcs.anl.gov> Message-ID: <467FF19E.8000000@mcs.anl.gov> I think I disagree on both points: i) I think that we could produce some tools that would grab the relevant files and stick them somewhere central. I've proposed some ideas in previous emails. ii) Of course we can imagine sophisticated tools. But there are also tools that various of you have already produced that generate graphs of various sorts. We should package these and (I think) integrate them with (i) so that when we grab the files we also run the programs to generate the graphs. Ian. Ben Clifford wrote: > As far as I see, Ian asked for two separate things: > > i) most importantly (to him) the collection of raw data. that's a case of > collecting files and not changing them. > > ii) the development of a log analysis stack. This very much overlaps with > what CEDPS and is a much bigger job. There are two fairly contradictory > views: that we shouldn't let CEDPS get in our way, and that we should have > a pile of hacked up tools. > > i) is something for app people to do. > > ii) is a much bigger development effort which really should go in the > bugzilla and wait its turn. > > On Mon, 25 Jun 2007, Mike Wilde wrote: > > >> collecting logs and generating reports/plots easily and in a standard format, >> so we can set goals against specific metrics for each workflow/application and >> track how we are progressing against those goals, for each. >> >> - Mike >> >> Mihael Hategan wrote, On 6/25/2007 11:15 AM: >> >>> On Mon, 2007-06-25 at 11:03 -0500, Ian Foster wrote: >>> >>>> So who is going to do this? >>>> >>>> I've been asking about this for some time, and nothing has happened. The >>>> result, I think, has been a lot of confusion and delay. >>>> >>> Are we still talking about collecting logs? I'm a bit confused. >>> >>> >>>>> I agree fully with Mihael's point that we can and should start gathering >>>>> all execution logs into a uniformly structured gathering place. Then we >>>>> can organize the current log tools and determine whats needed next in >>>>> that area. >>>>> >>>>> >>> >> > > -- Ian Foster, Director, Computation Institute Argonne National Laboratory & University of Chicago Argonne: MCS/221, 9700 S. Cass Ave, Argonne, IL 60439 Chicago: Rm 405, 5640 S. Ellis Ave, Chicago, IL 60637 Tel: +1 630 252 4619. Web: www.ci.uchicago.edu. Globus Alliance: www.globus.org. -------------- next part -------------- An HTML attachment was scrubbed... URL: From benc at hawaga.org.uk Mon Jun 25 11:49:12 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 25 Jun 2007 16:49:12 +0000 (GMT) Subject: [Swift-devel] Swift Performance Data In-Reply-To: <467FF19E.8000000@mcs.anl.gov> References: <4679FBC8.1080606@mcs.anl.gov> <467A6D78.4020702@mcs.anl.gov> <467A7AC6.7020400@mcs.anl.gov><467A7E17.5000207@mcs.anl.gov> <1975377455-1182773680-cardhu_decombobulator_blackberry.rim.net-423416081-@bxe006.bisx.prod.on.blackberry> <525747895-1182774044-cardhu_decombobulator_blackberry.rim.net-1528886809-@bxe006.bisx.prod.on.blackberry> <1182779125.5910.6.camel@blabla.mcs.anl.gov> <467FCEE5.6000707@mcs.anl.gov> <467FE759.50904@mcs.anl.gov> <1182788118.23226.3.camel@blabla.mcs.anl.gov> <467FEF0D.1050700@mcs.anl.gov> <467FF19E.8000000@mcs.anl.gov> Message-ID: On Mon, 25 Jun 2007, Ian Foster wrote: > We should package these and (I think) integrate them with (i) so that > when we grab the files we also run the programs to generate the graphs. yes, like I said: > > ii) the development of a log analysis stack. This very much overlaps with > > what CEDPS and is a much bigger job. -- From hategan at mcs.anl.gov Mon Jun 25 11:54:38 2007 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 25 Jun 2007 11:54:38 -0500 Subject: [Swift-devel] Swift Performance Data In-Reply-To: References: <4679FBC8.1080606@mcs.anl.gov> <467A6D78.4020702@mcs.anl.gov> <467A7AC6.7020400@mcs.anl.gov><467A7E17.5000207@mcs.anl.gov> <1975377455-1182773680-cardhu_decombobulator_blackberry.rim.net-423416081-@bxe006.bisx.prod.on.blackberry> <525747895-1182774044-cardhu_decombobulator_blackberry.rim.net-1528886809-@bxe006.bisx.prod.on.blackberry> <1182779125.5910.6.camel@blabla.mcs.anl.gov> <467FCEE5.6000707@mcs.anl.gov> <467FE759.50904@mcs.anl.gov> <1182788118.23226.3.camel@blabla.mcs.anl.gov> <467FEF0D.1050700@mcs.anl.gov> Message-ID: <1182790478.29751.7.camel@blabla.mcs.anl.gov> On Mon, 2007-06-25 at 16:45 +0000, Ben Clifford wrote: > > On Mon, 25 Jun 2007, Ben Clifford wrote: > > > what CEDPS and is a much bigger job. There are two fairly contradictory > > views: that we shouldn't let CEDPS get in our way, and that we should have > > a pile of hacked up tools. > > oops, that we *shouldn't* have a pile of hacked up tools, I meant. I think this view is a bit simplified. We should evaluate CEDPS based on the value it has to offer and the costs that adapting to it involves. Unfortunately it's hard to make a decision about the future value of CEDPS. What I've seen so far is a structured logging format, for which relevant analysis tools may or may not exist in the future. But it may very well be that we'll have to write these tools ourselves, in which case we're left with a format, and issue which we've discussed before. Mihael > From iraicu at cs.uchicago.edu Mon Jun 25 12:11:51 2007 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Mon, 25 Jun 2007 12:11:51 -0500 Subject: [Swift-devel] Swift Performance Data In-Reply-To: <1182788118.23226.3.camel@blabla.mcs.anl.gov> References: <4679FBC8.1080606@mcs.anl.gov> <467A6D78.4020702@mcs.anl.gov> <467A7AC6.7020400@mcs.anl.gov><467A7E17.5000207@mcs.anl.gov> <1975377455-1182773680-cardhu_decombobulator_blackberry.rim.net-423416081-@bxe006.bisx.prod.on.blackberry> <525747895-1182774044-cardhu_decombobulator_blackberry.rim.net-1528886809-@bxe006.bisx.prod.on.blackberry> <1182779125.5910.6.camel@blabla.mcs.anl.gov> <467FCEE5.6000707@mcs.anl.gov> <467FE759.50904@mcs.anl.gov> <1182788118.23226.3.camel@blabla.mcs.anl.gov> Message-ID: <467FF757.1030708@cs.uchicago.edu> Here is my 2c of experience in trying to draw up graphs of various experiments. I make a clear distinction between 1) logs that will be used for debugging/info that are in a relatively human readable format, and those logs that will be used for plotting graphs! The human readable logs (1) are almost always occurring based on events in the system. On the other hand, the logs that are geared towards graphing them (2) are mostly based on fixed time intervals, and a few are based on events. For example, in Falkon, I have the following set of logs: 1) Falkon dispatcher log (1 for the entire Falkon system) with debug/info level human readable logs, and it typically writes to this log for events related to the task dispatch and notifications that happen in the Falkon service; this log is currently only used for debugging purposes. 2) Falkon provisioner log (1 for the entire Falkon system) with debug/info level human readable logs, and it typically writes to this log for events related to the allocation of resources; this log is currently only used for debugging purposes. 3) Executor logs (1 per executor, separated into different files); this is also for human consumption that at the most detailed logging level, it prints out even the STDOUT and STERR of the task executions! These logs are not aggregated in any way currently, and are mostly used for debugging purposes. 4) Task description log (1 for the entire Falkon system), which stores the description of each task executed (i.e. TIMESTAMP, APPLICATION_ID, EXECUTABLE, ARGUEMENTS, ENVIRONMENT); I have not used this log yet for anything, but I envision we could use it for workload characterization, studies involving replaying an entire workload, etc... 5) Summary log (1 for the entire Falkon system) with an easy to parse format for automatic graph generation; this log is generated on fixed time intervals, in which some of the Falkon state is summarized for the duration of that period; the kind of state information that goes in this log is: TimeStamp_ms num_users num_resources num_threads num_all_workers num_free_workers num_pend_workers num_busy_workers waitQ_length waitNotQ_length activeQ_length doneQ_length delivered_tasks throughput_tasks/sec; this log can be used to plot the number of executors registered, active, idle, the queue length, the throughput of task delivered, etc... as the experiment progresses. In my latest development branch, I actually have a few more parameters that I am logging, such as CPU utilization, free memory, data caching hit rates, etc... 6) Per task log (1 for the entire Falkon system) that has information on each task executed in Falkon; this log is used to plot the per task info as the experiment progresses. The information that is kep on each task is: taskID workerID startTime endTime waitQueueTime execTime resultsQueueTime totalTime exitCode; this log can also be used to plot the per worker information, to see how the tasks were dispersed over the workers... 7) User information log (1 for the entire Falkon system) that stores information relevant for the end user, and is updated every time the state (wait, active, done) changes for any task; the information that this log contains is: Time_ms Users Resources JVM_Threads WaitingTasks ActiveTasks DoneTasks DeliveredTasks; I have not used this log for anything yet, but it has much more fine granular information that the summary log (5), so more detailed graphs/analysis could be generated for this log. 8) Worker information logs (1 for the entire Falkon system) that stores information about the workers state changes and is updated every time the state (free, pending, busy) changes for any worker; the information that this log contains is: Time_ms RegisteredWorkers FreeWorkers PendWorkers BusyWorkers; again, I have not used this log for anything yet, but it has much more fine granular information that the summary log (5), so more detailed graphs/analysis could be generated for this log. Now, as a summary, I use (5) and (6) a lot to generate the graphs that I do for Falkon. I have not used (7) and (8) yet, but might in the future. Its also relatively easy to add new state information to log to these existing logs since they are all localized in a few places, with little effort, I can add new metrics to monitor, or create a completely new log that has other information that was not easy to integrate into existing logs. For simplicity, my perf logs (5-8) are all simple logs that are just space delimited... > taskID workerID startTime endTime waitQueueTime execTime > resultsQueueTime totalTime exitCode > tg-viz-login1.uc.teragrid.org:50103:1_1326356873 > tg-c058.uc.teragrid.org:50100 1182533457601 1182533985431 467599 60225 > 6 527830 0 > tg-viz-login1.uc.teragrid.org:50103:2_1124048393 > tg-c052.uc.teragrid.org:50100 1182533457613 1182533985454 467735 60101 > 5 527841 0 > tg-viz-login1.uc.teragrid.org:50103:3_1648367237 > tg-c053.uc.teragrid.org:50100 1182533457616 1182533985524 467760 60138 > 10 527908 0 They could be converted to XML or any other format you want, but this is a nice format for programs like ploticus or gnuplot to understand easily. On the other hand, my debug logs (1-4) are all handled via log4j, look more like the traditional logs that log4j generates and people are accustomed to, but from my point of view, these are tedious and error-prone to parse for graphing purposes. Does this distinction (human readable vs. machine readable) between logs exist in Swift? If not, I would argue to not modify the debug/info logs, but to create new logs that are specifically targeted at automatic graph generations, such as my logs (5-8). If we are to use tools that others have built, then we just need to make sure these new logs conform to the appropriate format; if we are to write our own tools (or we already have them), then we have as much freedom as we want on what format these logs should be. Ioan Mihael Hategan wrote: > On Mon, 2007-06-25 at 11:03 -0500, Ian Foster wrote: > >> So who is going to do this? >> >> I've been asking about this for some time, and nothing has happened. The >> result, I think, has been a lot of confusion and delay. >> > > Are we still talking about collecting logs? I'm a bit confused. > > >>> I agree fully with Mihael's point that we can and should start >>> gathering all execution logs into a uniformly structured gathering >>> place. Then we can organize the current log tools and determine whats >>> needed next in that area. >>> >>> > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > -- ============================================ Ioan Raicu Ph.D. Student ============================================ Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 ============================================ Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dsl.cs.uchicago.edu/ ============================================ ============================================ -------------- next part -------------- An HTML attachment was scrubbed... URL: From foster at mcs.anl.gov Mon Jun 25 12:12:49 2007 From: foster at mcs.anl.gov (Ian Foster) Date: Mon, 25 Jun 2007 12:12:49 -0500 Subject: [Swift-devel] Swift Performance Data In-Reply-To: References: <4679FBC8.1080606@mcs.anl.gov> <467A6D78.4020702@mcs.anl.gov> <467A7AC6.7020400@mcs.anl.gov><467A7E17.5000207@mcs.anl.gov> <1975377455-1182773680-cardhu_decombobulator_blackberry.rim.net-423416081-@bxe006.bisx.prod.on.blackberry> <525747895-1182774044-cardhu_decombobulator_blackberry.rim.net-1528886809-@bxe006.bisx.prod.on.blackberry> <1182779125.5910.6.camel@blabla.mcs.anl.gov> <467FCEE5.6000707@mcs.anl.gov> <467FE759.50904@mcs.anl.gov> <1182788118.23226.3.camel@blabla.mcs.anl.gov> <467FEF0D.1050700@mcs.anl.gov> <467FF19E.8000000@mcs.anl.gov> Message-ID: <467FF791.3090902@mcs.anl.gov> great ... then we agree ... Ben Clifford wrote: > On Mon, 25 Jun 2007, Ian Foster wrote: > > >> We should package these and (I think) integrate them with (i) so that >> when we grab the files we also run the programs to generate the graphs. >> > > yes, like I said: > > > >>> ii) the development of a log analysis stack. This very much overlaps with >>> what CEDPS and is a much bigger job. >>> > > -- Ian Foster, Director, Computation Institute Argonne National Laboratory & University of Chicago Argonne: MCS/221, 9700 S. Cass Ave, Argonne, IL 60439 Chicago: Rm 405, 5640 S. Ellis Ave, Chicago, IL 60637 Tel: +1 630 252 4619. Web: www.ci.uchicago.edu. Globus Alliance: www.globus.org. -------------- next part -------------- An HTML attachment was scrubbed... URL: From benc at hawaga.org.uk Mon Jun 25 12:42:23 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 25 Jun 2007 17:42:23 +0000 (GMT) Subject: [Swift-devel] Swift Performance Data In-Reply-To: <467FF757.1030708@cs.uchicago.edu> References: <4679FBC8.1080606@mcs.anl.gov> <467A6D78.4020702@mcs.anl.gov> <467A7AC6.7020400@mcs.anl.gov><467A7E17.5000207@mcs.anl.gov> <1975377455-1182773680-cardhu_decombobulator_blackberry.rim.net-423416081-@bxe006.bisx.prod.on.blackberry> <525747895-1182774044-cardhu_decombobulator_blackberry.rim.net-1528886809-@bxe006.bisx.prod.on.blackberry> <1182779125.5910.6.camel@blabla.mcs.anl.gov> <467FCEE5.6000707@mcs.anl.gov> <467FE759.50904@mcs.anl.gov> <1182788118.23226.3.camel@blabla.mcs.anl.gov> <467FF757.1030708@cs.uchicago.edu> Message-ID: On Mon, 25 Jun 2007, Ioan Raicu wrote: > On the other hand, the logs > that are geared towards graphing them (2) are mostly based on fixed time > intervals, and a few are based on events. right. I think there's a need for both (eg. compare task queue lengths or CPU load against job lifetime lines). > also relatively easy to add new state information to log to these existing > logs since they are all localized in a few places, with little effort, I can > add new metrics to monitor but only when those metrics are somehow associated with Falkon? One of the interesting things to do, I think, is to be able to get a job lifetime line that goes from when Swift decides the job exists all the way through to when Swift decides the job has finished, with the two/three colour job lines for jobs being inside Falkon as part of that lifetime line. > On the other hand, my debug logs (1-4) are all handled via log4j, look more > like the traditional logs that log4j generates and people are accustomed to, > but from my point of view, these are tedious and error-prone to parse for > graphing purposes. log4j can easily be configured to output different formats - so we could have human readable logs in one format and machine readable logs logging different information in a different format, I think. > Does this distinction (human readable vs. machine readable) between logs exist > in Swift? A little bit. The data in the swift/karajan logs is mostly intended for human consumption; the data in kickstart records is very much more structured and intended to be both human readable and machine readable. More machine readable edge and level based logging from inside swift and inside karajan could be useful, I think. -- From iraicu at cs.uchicago.edu Mon Jun 25 12:51:37 2007 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Mon, 25 Jun 2007 12:51:37 -0500 Subject: [Swift-devel] Swift Performance Data In-Reply-To: References: <4679FBC8.1080606@mcs.anl.gov> <467A6D78.4020702@mcs.anl.gov> <467A7AC6.7020400@mcs.anl.gov><467A7E17.5000207@mcs.anl.gov> <1975377455-1182773680-cardhu_decombobulator_blackberry.rim.net-423416081-@bxe006.bisx.prod.on.blackberry> <525747895-1182774044-cardhu_decombobulator_blackberry.rim.net-1528886809-@bxe006.bisx.prod.on.blackberry> <1182779125.5910.6.camel@blabla.mcs.anl.gov> <467FCEE5.6000707@mcs.anl.gov> <467FE759.50904@mcs.anl.gov> <1182788118.23226.3.camel@blabla.mcs.anl.gov> <467FF757.1030708@cs.uchicago.edu> Message-ID: <468000A9.3000908@cs.uchicago.edu> Ben Clifford wrote: > On Mon, 25 Jun 2007, Ioan Raicu wrote: > > >> On the other hand, the logs >> that are geared towards graphing them (2) are mostly based on fixed time >> intervals, and a few are based on events. >> > > right. I think there's a need for both (eg. compare task queue lengths or > CPU load against job lifetime lines). > > >> also relatively easy to add new state information to log to these existing >> logs since they are all localized in a few places, with little effort, I can >> add new metrics to monitor >> > > but only when those metrics are somehow associated with Falkon? One of the > interesting things to do, I think, is to be able to get a job lifetime > line that goes from when Swift decides the job exists all the way through > to when Swift decides the job has finished, with the two/three colour job > lines for jobs being inside Falkon as part of that lifetime line. > Right, and I think we can do this from the Swift logs, including the preprocessing time in Swift, the postprocessing time, plus the end-to-end time the task spent in Falkon, etc... the logs that I mentioned are Falkon specific, and the logs in Swift that generate this kind of information I believe are parsed from the debug/info logs (human readable) to come up with the machine readable logs for graphing. We (Yong and I) had some trouble in the past generating these graphs from the Swift logs as the logs did not always contain all the information we needed to draw the graph, or the parsing would fail, and we had to manually fix the problem in the logs and try again the parsing! > >> On the other hand, my debug logs (1-4) are all handled via log4j, look more >> like the traditional logs that log4j generates and people are accustomed to, >> but from my point of view, these are tedious and error-prone to parse for >> graphing purposes. >> > > log4j can easily be configured to output different formats - so we could > have human readable logs in one format and machine readable logs logging > different information in a different format, I think. > OK, thats good! > >> Does this distinction (human readable vs. machine readable) between logs exist >> in Swift? >> > > A little bit. The data in the swift/karajan logs is mostly intended for > human consumption; the data in kickstart records is very much more > structured and intended to be both human readable and machine readable. > > More machine readable edge and level based logging from inside swift and > inside karajan could be useful, I think. > Right, but kickstart logs are all in separate files, so to really make sense of them in a programatic way and to plot them on 1 graph, there needs to be an aggreting step that either just merges these files together in some orderly way, or it mights even usmmarize the data for easier graphing. From my understanding of the kickstart records, I think its hard to generate overview graphs of an entire run due to the fact that they are kept in many files. Ioan -- ============================================ Ioan Raicu Ph.D. Student ============================================ Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 ============================================ Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dsl.cs.uchicago.edu/ ============================================ ============================================ -------------- next part -------------- An HTML attachment was scrubbed... URL: From benc at hawaga.org.uk Mon Jun 25 12:55:04 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 25 Jun 2007 17:55:04 +0000 (GMT) Subject: [Swift-devel] Swift Performance Data In-Reply-To: <468000A9.3000908@cs.uchicago.edu> References: <4679FBC8.1080606@mcs.anl.gov> <467A6D78.4020702@mcs.anl.gov> <467A7AC6.7020400@mcs.anl.gov><467A7E17.5000207@mcs.anl.gov> <1975377455-1182773680-cardhu_decombobulator_blackberry.rim.net-423416081-@bxe006.bisx.prod.on.blackberry> <525747895-1182774044-cardhu_decombobulator_blackberry.rim.net-1528886809-@bxe006.bisx.prod.on.blackberry> <1182779125.5910.6.camel@blabla.mcs.anl.gov> <467FCEE5.6000707@mcs.anl.gov> <467FE759.50904@mcs.anl.gov> <1182788118.23226.3.camel@blabla.mcs.anl.gov> <467FF757.1030708@cs.uchicago.edu> <468000A9.3000908@cs.uchicago.edu> Message-ID: On Mon, 25 Jun 2007, Ioan Raicu wrote: > the machine readable logs for graphing. We (Yong and I) had some trouble in > the past generating these graphs from the Swift logs as the logs did not > always contain all the information we needed to draw the graph, or the parsing > would fail, and we had to manually fix the problem in the logs and try again > the parsing! The log messages aren't fixed in stone so if there are small changes that would be useful for this, make them or bring them up on this list. > Right, but kickstart logs are all in separate files, so to really make sense > of them in a programatic way and to plot them on 1 graph, there needs to be an > aggreting step that either just merges these files together in some orderly > way, or it mights even usmmarize the data for easier graphing. From my > understanding of the kickstart records, I think its hard to generate overview > graphs of an entire run due to the fact that they are kept in many files. A commandline XSLT processor and a for-loop in bash might do it. -- From wilde at mcs.anl.gov Wed Jun 27 07:48:58 2007 From: wilde at mcs.anl.gov (Mike Wilde) Date: Wed, 27 Jun 2007 07:48:58 -0500 Subject: [Swift-devel] Re: bugzilla change? In-Reply-To: References: <3E18205D-4534-4100-BFB1-1CAC389D5D76@mcs.anl.gov> Message-ID: <46825CBA.7010906@mcs.anl.gov> The thought here was to put swift-devel on all campaign bugs to keep everyone informed and to encourage discussion. Is this a good way to do things? Should we create a swift-devel bugzilla account? Or just multi-select all people on campaigns? - Mike Ben Clifford wrote, On 6/26/2007 9:52 PM: > The list of people you get on the 'cc' drop down list is everyone who has > a bugzilla account. If you want more addresses there, get the person in > question to get a bugzilla acount and it will magically appear on the > list. > > On Tue, 26 Jun 2007, Veronika Nefedova wrote: > >> PS. I am talking about this page specifically: >> http://bugzilla.mcs.anl.gov/swift/enter_bug.cgi?product=App-MolDyn >> >> On Jun 26, 2007, at 2:28 PM, Veronika Nefedova wrote: >> >>> Hi, Ben: >>> >>> do you know how I can modify the Bugzilla settings? For example, I'd like to >>> modify the Cc field (add or remove address there), but it seem there is no >>> such option.. >>> >>> Thanks! >>> >>> Nika >>> > > -- Mike Wilde Computation Institute, University of Chicago Math & Computer Science Division Argonne National Laboratory Argonne, IL 60439 USA tel 630-252-7497 fax 630-252-1997 From bugzilla-daemon at mcs.anl.gov Wed Jun 27 07:56:42 2007 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Wed, 27 Jun 2007 07:56:42 -0500 (CDT) Subject: [Swift-devel] [Bug 72] New: Campaign for scaling wf up to 244 molecules Message-ID: http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=72 Summary: Campaign for scaling wf up to 244 molecules Product: App-MolDyn Version: unspecified Platform: Macintosh OS/Version: Mac OS Status: NEW Severity: normal Priority: P2 Component: FreeEnergyForMolecules AssignedTo: nefedova at mcs.anl.gov ReportedBy: nefedova at mcs.anl.gov CC: swift-devel at ci.uchicago.edu Campaign: scaling wf up to 244 molecules Campaign Leader: Veronika Nefedova Project: Swift Technology: Molecular Dynamics Application Objective: The Molecular Dynamics workflow at present can't be reliably executed for large number of molecules (100+). Execution fails for 244 molecules due to some problems with Falcon/Swift interactions. Benefits: Executing workflow for large number of molecules would enable the Molecular Dynamics group to run large simulations in one step which would increase the productiveness. Implementation Details: 1. Analyze logs from the failed runs 2. If some information is missing from the logs -- add new debug printouts and repeat the run. 3. Act on the findings -- make corrections to either Swift or Falcon; repeat the run. 4. Repeat stages 1-3 until 244 molecules run successfully reliably. Deliverables: 1. Falcon code that could be installed on any system to handle 100+ molecule runs 2. Swift code that works correctly with Falcon (in svn) -- Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is. From bugzilla-daemon at mcs.anl.gov Wed Jun 27 07:59:46 2007 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Wed, 27 Jun 2007 07:59:46 -0500 (CDT) Subject: [Swift-devel] [Bug 72] Campaign for scaling wf up to 244 molecules In-Reply-To: Message-ID: <20070627125946.61B0216505@foxtrot.mcs.anl.gov> http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=72 nefedova at mcs.anl.gov changed: What |Removed |Added ---------------------------------------------------------------------------- OtherBugsDependingO| |73 nThis| | -- Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is. From bugzilla-daemon at mcs.anl.gov Wed Jun 27 08:01:31 2007 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Wed, 27 Jun 2007 08:01:31 -0500 (CDT) Subject: [Swift-devel] [Bug 72] Campaign for scaling wf up to 244 molecules In-Reply-To: Message-ID: <20070627130131.740B616505@foxtrot.mcs.anl.gov> http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=72 nefedova at mcs.anl.gov changed: What |Removed |Added ---------------------------------------------------------------------------- OtherBugsDependingO| |74 nThis| | -- Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is. From bugzilla-daemon at mcs.anl.gov Wed Jun 27 08:03:04 2007 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Wed, 27 Jun 2007 08:03:04 -0500 (CDT) Subject: [Swift-devel] [Bug 73] Campaign: performance improvements for MolDyn workflow In-Reply-To: Message-ID: <20070627130304.E57F416505@foxtrot.mcs.anl.gov> http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=73 nefedova at mcs.anl.gov changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |swift-devel at ci.uchicago.edu, | |nefedova at mcs.anl.gov -- Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is. From bugzilla-daemon at mcs.anl.gov Wed Jun 27 08:04:06 2007 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Wed, 27 Jun 2007 08:04:06 -0500 (CDT) Subject: [Swift-devel] [Bug 74] Campaign: Technology transfer to the user In-Reply-To: Message-ID: <20070627130406.5A09C16505@foxtrot.mcs.anl.gov> http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=74 nefedova at mcs.anl.gov changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |swift-devel at ci.uchicago.edu, | |nefedova at mcs.anl.gov -- Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is. From benc at hawaga.org.uk Wed Jun 27 14:10:15 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 27 Jun 2007 19:10:15 +0000 (GMT) Subject: [Swift-devel] falkon bugzilla Message-ID: I've added a Falkon product to bugzilla with 'general' and 'provider-deef' components. As Ioan isn't signed up for the Swift bugzilla, I've made Yong the default owner for both. -- From benc at hawaga.org.uk Wed Jun 27 15:46:12 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Thu, 28 Jun 2007 02:16:12 +0530 (IST) Subject: [Swift-devel] a different way to do array/structure accesses Message-ID: I've been playing round with converting the .xml intermediate format to be more strictly XML and less a mixture of XML and various other syntaxes. One thing that comes out of this is that its simpler in the parser and compiler layer to generate array and structure accesses using a bunch of karajan level calls, like this: and 9091 instead of the way its done at the moment with a path syntax, like this: and This allows a bunch of simplification to happen with path handling in the swift code. However, it makes the Karajan intermediate code more complicated. From the language side of things, I'd like to make this change, but I don't know enough about how that effects the load on Karajan, especially with the insanely large source files that people are machine-generating. (here's the program I pulled these from: type mytype { int a; int b; } mytype foo; foo.a=9091; foo.b=818; print(foo.a); ) -- -- From hategan at mcs.anl.gov Wed Jun 27 15:58:03 2007 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 27 Jun 2007 15:58:03 -0500 Subject: [Swift-devel] a different way to do array/structure accesses In-Reply-To: References: Message-ID: <1182977884.9558.2.camel@blabla.mcs.anl.gov> Probably not very efficient, for more than one reason. Also less clear. I'd vote against it. On Thu, 2007-06-28 at 02:16 +0530, Ben Clifford wrote: > I've been playing round with converting the .xml intermediate format to > be more strictly XML and less a mixture of XML and various other syntaxes. > > One thing that comes out of this is that its simpler in the parser and > compiler layer to generate array and structure accesses using a bunch of > karajan level calls, like this: > > > > > > > > and > > > > > > > 9091 > > > > > instead of the way its done at the moment with a path syntax, like this: > > > > and > > > > > > > > > This allows a bunch of simplification to happen with path handling in the > swift code. However, it makes the Karajan intermediate code more > complicated. From the language side of things, I'd like to make this > change, but I don't know enough about how that effects the load on > Karajan, especially with the insanely large source files that people are > machine-generating. > > > (here's the program I pulled these from: > > type mytype { int a; int b; } > > mytype foo; > > foo.a=9091; > > foo.b=818; > > print(foo.a); > > ) > > -- > > From hategan at mcs.anl.gov Wed Jun 27 15:59:05 2007 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 27 Jun 2007 15:59:05 -0500 Subject: [Swift-devel] a different way to do array/structure accesses In-Reply-To: References: Message-ID: <1182977945.9558.4.camel@blabla.mcs.anl.gov> That reminds me. parallel(onlyOneThing()) should be reduced to onlyOneThing(). Either in the translation, or automatically in karajan. On Thu, 2007-06-28 at 02:16 +0530, Ben Clifford wrote: > I've been playing round with converting the .xml intermediate format to > be more strictly XML and less a mixture of XML and various other syntaxes. > > One thing that comes out of this is that its simpler in the parser and > compiler layer to generate array and structure accesses using a bunch of > karajan level calls, like this: > > > > > > > > and > > > > > > > 9091 > > > > > instead of the way its done at the moment with a path syntax, like this: > > > > and > > > > > > > > > This allows a bunch of simplification to happen with path handling in the > swift code. However, it makes the Karajan intermediate code more > complicated. From the language side of things, I'd like to make this > change, but I don't know enough about how that effects the load on > Karajan, especially with the insanely large source files that people are > machine-generating. > > > (here's the program I pulled these from: > > type mytype { int a; int b; } > > mytype foo; > > foo.a=9091; > > foo.b=818; > > print(foo.a); > > ) > > -- > > From benc at hawaga.org.uk Thu Jun 28 08:54:45 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Thu, 28 Jun 2007 13:54:45 +0000 (GMT) Subject: [Swift-devel] Re: Could not convert value to number: true In-Reply-To: <1182827359.17489.5.camel@blabla.mcs.anl.gov> References: <46807F9B.6050008@fnal.gov> <1182827359.17489.5.camel@blabla.mcs.anl.gov> Message-ID: I moved this to swift-devel from swift-user. On Mon, 25 Jun 2007, Mihael Hategan wrote: > > As for doing what you want, I have to think about it. We've talked in the past about more elaborate forms of mapping. Mapping data from other sources (rather than from disk data files, perhaps from databases); and mapping data (from whatever source) into the actual swift data space so that things like + and other operators can work on that data. Until/unless a strong application need arises for those, I don't think they're high enough priority for us to implement any time soon. Separately, I think its a bug that we allow the above code to compile and run with such a poor error message. Probably, attempts to get the value of a mapped piece of data should cause an error, rather than returning 'true' which is often not even of the right datatype, let alone a meaningful value. I'll put something in bugzilla for this. -- From bugzilla-daemon at mcs.anl.gov Thu Jun 28 09:10:26 2007 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Thu, 28 Jun 2007 09:10:26 -0500 (CDT) Subject: [Swift-devel] [Bug 72] Campaign for scaling wf up to 244 molecules In-Reply-To: Message-ID: <20070628141026.6707716506@foxtrot.mcs.anl.gov> http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=72 nefedova at mcs.anl.gov changed: What |Removed |Added ---------------------------------------------------------------------------- BugsThisDependsOn| |76 -- Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is. From bugzilla-daemon at mcs.anl.gov Thu Jun 28 09:11:07 2007 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Thu, 28 Jun 2007 09:11:07 -0500 (CDT) Subject: [Swift-devel] [Bug 76] disable intermediate stageout of data In-Reply-To: Message-ID: <20070628141107.E98D8164DB@foxtrot.mcs.anl.gov> http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=76 nefedova at mcs.anl.gov changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |swift-devel at ci.uchicago.edu -- Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is. From bugzilla-daemon at mcs.anl.gov Thu Jun 28 16:12:48 2007 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Thu, 28 Jun 2007 16:12:48 -0500 (CDT) Subject: [Swift-devel] [Bug 72] Campaign for scaling wf up to 244 molecules In-Reply-To: Message-ID: <20070628211248.35B6116505@foxtrot.mcs.anl.gov> http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=72 ------- Comment #1 from wilde at mcs.anl.gov 2007-06-28 16:12 ------- Ive reviewed this email thread on this bug, and am moving this discussion to bugzilla. I and am uncertain about the following - can people involved (Nika, Ioan, Mihael) clarify: - did Mihael discover an error in Falkon mutex code? - if so was it fixed, and did it correct the problem of missed completion notifications? - whats the state of the "unable to write output file" problem? - do we still have a bad node in UC-Teraport w/ NSF Stale file handles? If so, was that reported? (This raises interesting issues in troubleshooting and trouble workaround) - do we have a plan for how to run this WF at scale? Meaning how to get 244 nodes for several days, whether we can scale up beyond 1-processor-per-molecule, what the expected runtime is, how to deal with errors/restarts, etc? (Should detail this here in bugz). -- Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is. From bugzilla-daemon at mcs.anl.gov Thu Jun 28 16:25:54 2007 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Thu, 28 Jun 2007 16:25:54 -0500 (CDT) Subject: [Swift-devel] [Bug 72] Campaign for scaling wf up to 244 molecules In-Reply-To: Message-ID: <20070628212554.0516B164DB@foxtrot.mcs.anl.gov> http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=72 ------- Comment #2 from wilde at mcs.anl.gov 2007-06-28 16:25 ------- Ioan Raicu wrote, On 6/28/2007 4:11 PM: > Hi, > Yong and I are working at making some small changes (synchronizing some > lists, more logging, etc...) in the Falkon provider. We are also > working on the automatic graphing capability from Falkon's logs. We > should be ready to give the experiment another run later today. > > Ioan OK - thanks Ioan and Yong. By these "small changes" do you mean that the synchronization issue Mihael raised *was* or *was not* determined to be a cause of missing the 2000+ notifications out of 20K+ ? Are you now working on tightening up the syncro further? The graphing thing sounds good. When you get a moment, please respond on the last 3 items. Thanks. -- Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is. From iraicu at cs.uchicago.edu Thu Jun 28 16:27:04 2007 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Thu, 28 Jun 2007 16:27:04 -0500 Subject: [Swift-devel] [Bug 72] Campaign for scaling wf up to 244 molecules In-Reply-To: <20070628211248.35B6116505@foxtrot.mcs.anl.gov> References: <20070628211248.35B6116505@foxtrot.mcs.anl.gov> Message-ID: <468427A8.10104@cs.uchicago.edu> bugzilla-daemon at mcs.anl.gov wrote: > http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=72 > > > > > > ------- Comment #1 from wilde at mcs.anl.gov 2007-06-28 16:12 ------- > Ive reviewed this email thread on this bug, and am moving this discussion to > bugzilla. > > I and am uncertain about the following - can people involved (Nika, Ioan, > Mihael) clarify: > > - did Mihael discover an error in Falkon mutex code? > > We are not sure, but we are adding extra synchronization in several parts of the Falkon provider. The reason we are saying that we are not sure is that we stress tested (pushing 50~100 tasks/sec) both the Falkon provider and Falkon itself over and over again, and we never encountered this. Now, we have a workflow that has an average of 1 task/sec, I find it hard to beleive that a synchronization issue that never surfaced before under stress testing is surfacing now under such a light load. We are also verifying that we are handling all exceptions correctly within the Falkon provider. > - if so was it fixed, and did it correct the problem of missed completion > notifications? > We don't know, the problems are reproducible over short runs, and only seem to pop up with longer runs. For example, we completed the 100 mol run just fine, which had 10K jobs. We have to rerun the 244 mol run to verify things. > - whats the state of the "unable to write output file" problem? > > - do we still have a bad node in UC-Teraport w/ NSF Stale file handles? If so, > was that reported? (This raises interesting issues in troubleshooting and > trouble workaround) > I reported it, but help at tg claims the node is working fine. They claim that once in a while, it is normal for this to happen, and my argument that all other nodes behaved perfectly with the exception of this one isn't enough for them. For now, if we get this node again, we can manually kill the Falkon worker there so Falkon won't use it anymore. > - do we have a plan for how to run this WF at scale? Meaning how to get 244 > nodes for several days, whether we can scale up beyond > 1-processor-per-molecule, what the expected runtime is, how to deal with > errors/restarts, etc? (Should detail this here in bugz). > There is still work I need to do to ensure that a task that is running when the resource lease expires is correctly handled and Swift is notified that it failed. I have the code written and in Falkon already, but I have yet to test it. We need to make sure this works before we try to get say 24 hour resource allocations when we know the experiment will likely take several days. Also, I think the larger part of the workflow could benefit from more than 1 node per molecule, so if we could get more, it should improve the end-to-end time. Ioan > > -- ============================================ Ioan Raicu Ph.D. Student ============================================ Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 ============================================ Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dsl.cs.uchicago.edu/ ============================================ ============================================ From itf at mcs.anl.gov Thu Jun 28 16:26:25 2007 From: itf at mcs.anl.gov (=?utf-8?B?SWFuIEZvc3Rlcg==?=) Date: Thu, 28 Jun 2007 21:26:25 +0000 Subject: [Swift-devel] [Bug 72] Campaign for scaling wf up to 244 molecules In-Reply-To: <468427A8.10104@cs.uchicago.edu> References: <20070628211248.35B6116505@foxtrot.mcs.anl.gov><468427A8.10104@cs.uchicago.edu> Message-ID: <1895472852-1183066171-cardhu_decombobulator_blackberry.rim.net-826324017-@bxe006.bisx.prod.on.blackberry> Did we do a complete code review? Sent via BlackBerry from T-Mobile -----Original Message----- From: Ioan Raicu Date: Thu, 28 Jun 2007 16:27:04 To:bugzilla-daemon at mcs.anl.gov Cc:swift-devel at ci.uchicago.edu Subject: Re: [Swift-devel] [Bug 72] Campaign for scaling wf up to 244 molecules bugzilla-daemon at mcs.anl.gov wrote: > http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=72 > > > > > > ------- Comment #1 from wilde at mcs.anl.gov 2007-06-28 16:12 ------- > Ive reviewed this email thread on this bug, and am moving this discussion to > bugzilla. > > I and am uncertain about the following - can people involved (Nika, Ioan, > Mihael) clarify: > > - did Mihael discover an error in Falkon mutex code? > > We are not sure, but we are adding extra synchronization in several parts of the Falkon provider. The reason we are saying that we are not sure is that we stress tested (pushing 50~100 tasks/sec) both the Falkon provider and Falkon itself over and over again, and we never encountered this. Now, we have a workflow that has an average of 1 task/sec, I find it hard to beleive that a synchronization issue that never surfaced before under stress testing is surfacing now under such a light load. We are also verifying that we are handling all exceptions correctly within the Falkon provider. > - if so was it fixed, and did it correct the problem of missed completion > notifications? > We don't know, the problems are reproducible over short runs, and only seem to pop up with longer runs. For example, we completed the 100 mol run just fine, which had 10K jobs. We have to rerun the 244 mol run to verify things. > - whats the state of the "unable to write output file" problem? > > - do we still have a bad node in UC-Teraport w/ NSF Stale file handles? If so, > was that reported? (This raises interesting issues in troubleshooting and > trouble workaround) > I reported it, but help at tg claims the node is working fine. They claim that once in a while, it is normal for this to happen, and my argument that all other nodes behaved perfectly with the exception of this one isn't enough for them. For now, if we get this node again, we can manually kill the Falkon worker there so Falkon won't use it anymore. > - do we have a plan for how to run this WF at scale? Meaning how to get 244 > nodes for several days, whether we can scale up beyond > 1-processor-per-molecule, what the expected runtime is, how to deal with > errors/restarts, etc? (Should detail this here in bugz). > There is still work I need to do to ensure that a task that is running when the resource lease expires is correctly handled and Swift is notified that it failed. I have the code written and in Falkon already, but I have yet to test it. We need to make sure this works before we try to get say 24 hour resource allocations when we know the experiment will likely take several days. Also, I think the larger part of the workflow could benefit from more than 1 node per molecule, so if we could get more, it should improve the end-to-end time. Ioan > > -- ============================================ Ioan Raicu Ph.D. Student ============================================ Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 ============================================ Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dsl.cs.uchicago.edu/ ============================================ ============================================ _______________________________________________ Swift-devel mailing list Swift-devel at ci.uchicago.edu http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From iraicu at cs.uchicago.edu Thu Jun 28 16:32:35 2007 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Thu, 28 Jun 2007 16:32:35 -0500 Subject: [Swift-devel] [Bug 72] Campaign for scaling wf up to 244 molecules In-Reply-To: <1895472852-1183066171-cardhu_decombobulator_blackberry.rim.net-826324017-@bxe006.bisx.prod.on.blackberry> References: <20070628211248.35B6116505@foxtrot.mcs.anl.gov><468427A8.10104@cs.uchicago.edu> <1895472852-1183066171-cardhu_decombobulator_blackberry.rim.net-826324017-@bxe006.bisx.prod.on.blackberry> Message-ID: <468428F3.1060508@cs.uchicago.edu> No, just the Falkon provider (~500 lines of code), as far as I know. The Falkon service is around 10K lines of code, and the Falkon executor is another 3K, so they will likely take longer than a few days for a code review of everything in Falkon. Ioan Ian Foster wrote: > Did we do a complete code review? > > Sent via BlackBerry from T-Mobile > > -----Original Message----- > From: Ioan Raicu > > Date: Thu, 28 Jun 2007 16:27:04 > To:bugzilla-daemon at mcs.anl.gov > Cc:swift-devel at ci.uchicago.edu > Subject: Re: [Swift-devel] [Bug 72] Campaign for scaling wf up to 244 molecules > > > > > bugzilla-daemon at mcs.anl.gov wrote: > >> http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=72 >> >> >> >> >> >> ------- Comment #1 from wilde at mcs.anl.gov 2007-06-28 16:12 ------- >> Ive reviewed this email thread on this bug, and am moving this discussion to >> bugzilla. >> >> I and am uncertain about the following - can people involved (Nika, Ioan, >> Mihael) clarify: >> >> - did Mihael discover an error in Falkon mutex code? >> >> >> > We are not sure, but we are adding extra synchronization in several > parts of the Falkon provider. The reason we are saying that we are not > sure is that we stress tested (pushing 50~100 tasks/sec) both the Falkon > provider and Falkon itself over and over again, and we never encountered > this. Now, we have a workflow that has an average of 1 task/sec, I find > it hard to beleive that a synchronization issue that never surfaced > before under stress testing is surfacing now under such a light load. > We are also verifying that we are handling all exceptions correctly > within the Falkon provider. > >> - if so was it fixed, and did it correct the problem of missed completion >> notifications? >> >> > We don't know, the problems are reproducible over short runs, and only > seem to pop up with longer runs. For example, we completed the 100 mol > run just fine, which had 10K jobs. We have to rerun the 244 mol run to > verify things. > >> - whats the state of the "unable to write output file" problem? >> >> - do we still have a bad node in UC-Teraport w/ NSF Stale file handles? If so, >> was that reported? (This raises interesting issues in troubleshooting and >> trouble workaround) >> >> > I reported it, but help at tg claims the node is working fine. They claim > that once in a while, it is normal for this to happen, and my argument > that all other nodes behaved perfectly with the exception of this one > isn't enough for them. For now, if we get this node again, we can > manually kill the Falkon worker there so Falkon won't use it anymore. > >> - do we have a plan for how to run this WF at scale? Meaning how to get 244 >> nodes for several days, whether we can scale up beyond >> 1-processor-per-molecule, what the expected runtime is, how to deal with >> errors/restarts, etc? (Should detail this here in bugz). >> >> > There is still work I need to do to ensure that a task that is running > when the resource lease expires is correctly handled and Swift is > notified that it failed. I have the code written and in Falkon already, > but I have yet to test it. We need to make sure this works before we > try to get say 24 hour resource allocations when we know the experiment > will likely take several days. Also, I think the larger part of the > workflow could benefit from more than 1 node per molecule, so if we > could get more, it should improve the end-to-end time. > > Ioan > >> >> > > -- ============================================ Ioan Raicu Ph.D. Student ============================================ Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 ============================================ Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dsl.cs.uchicago.edu/ ============================================ ============================================ -------------- next part -------------- An HTML attachment was scrubbed... URL: From hategan at mcs.anl.gov Thu Jun 28 16:32:52 2007 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 28 Jun 2007 16:32:52 -0500 Subject: [Swift-devel] [Bug 72] Campaign for scaling wf up to 244 molecules In-Reply-To: <468427A8.10104@cs.uchicago.edu> References: <20070628211248.35B6116505@foxtrot.mcs.anl.gov> <468427A8.10104@cs.uchicago.edu> Message-ID: <1183066372.25279.5.camel@blabla.mcs.anl.gov> > > > > - did Mihael discover an error in Falkon mutex code? > > > > > We are not sure, but we are adding extra synchronization in several > parts of the Falkon provider. The reason we are saying that we are not > sure is that we stress tested (pushing 50~100 tasks/sec) both the Falkon > provider and Falkon itself over and over again, and we never encountered > this. Now, we have a workflow that has an average of 1 task/sec, I find > it hard to beleive that a synchronization issue that never surfaced > before under stress testing is surfacing now under such a light load. ?!? You are mutating maps and list from concurrent threads without synchronization. That is a problem regardless of any other considerations. Mihael From hategan at mcs.anl.gov Thu Jun 28 16:35:33 2007 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 28 Jun 2007 16:35:33 -0500 Subject: [Swift-devel] [Bug 72] Campaign for scaling wf up to 244 molecules In-Reply-To: <468428F3.1060508@cs.uchicago.edu> References: <20070628211248.35B6116505@foxtrot.mcs.anl.gov> <468427A8.10104@cs.uchicago.edu> <1895472852-1183066171-cardhu_decombobulator_blackberry.rim.net-826324017-@bxe006.bisx.prod.on.blackberry> <468428F3.1060508@cs.uchicago.edu> Message-ID: <1183066533.25279.8.camel@blabla.mcs.anl.gov> I was waiting for it to be cleaned up and put into SVN, as we agreed. On Thu, 2007-06-28 at 16:32 -0500, Ioan Raicu wrote: > No, just the Falkon provider (~500 lines of code), as far as I know. > > The Falkon service is around 10K lines of code, and the Falkon > executor is another 3K, so they will likely take longer than a few > days for a code review of everything in Falkon. > > Ioan > > Ian Foster wrote: > > Did we do a complete code review? > > > > Sent via BlackBerry from T-Mobile > > > > -----Original Message----- > > From: Ioan Raicu > > > > Date: Thu, 28 Jun 2007 16:27:04 > > To:bugzilla-daemon at mcs.anl.gov > > Cc:swift-devel at ci.uchicago.edu > > Subject: Re: [Swift-devel] [Bug 72] Campaign for scaling wf up to 244 molecules > > > > > > > > > > bugzilla-daemon at mcs.anl.gov wrote: > > > > > http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=72 > > > > > > > > > > > > > > > > > > ------- Comment #1 from wilde at mcs.anl.gov 2007-06-28 16:12 ------- > > > Ive reviewed this email thread on this bug, and am moving this discussion to > > > bugzilla. > > > > > > I and am uncertain about the following - can people involved (Nika, Ioan, > > > Mihael) clarify: > > > > > > - did Mihael discover an error in Falkon mutex code? > > > > > > > > > > > We are not sure, but we are adding extra synchronization in several > > parts of the Falkon provider. The reason we are saying that we are not > > sure is that we stress tested (pushing 50~100 tasks/sec) both the Falkon > > provider and Falkon itself over and over again, and we never encountered > > this. Now, we have a workflow that has an average of 1 task/sec, I find > > it hard to beleive that a synchronization issue that never surfaced > > before under stress testing is surfacing now under such a light load. > > We are also verifying that we are handling all exceptions correctly > > within the Falkon provider. > > > > > - if so was it fixed, and did it correct the problem of missed completion > > > notifications? > > > > > > > > We don't know, the problems are reproducible over short runs, and only > > seem to pop up with longer runs. For example, we completed the 100 mol > > run just fine, which had 10K jobs. We have to rerun the 244 mol run to > > verify things. > > > > > - whats the state of the "unable to write output file" problem? > > > > > > - do we still have a bad node in UC-Teraport w/ NSF Stale file handles? If so, > > > was that reported? (This raises interesting issues in troubleshooting and > > > trouble workaround) > > > > > > > > I reported it, but help at tg claims the node is working fine. They claim > > that once in a while, it is normal for this to happen, and my argument > > that all other nodes behaved perfectly with the exception of this one > > isn't enough for them. For now, if we get this node again, we can > > manually kill the Falkon worker there so Falkon won't use it anymore. > > > > > - do we have a plan for how to run this WF at scale? Meaning how to get 244 > > > nodes for several days, whether we can scale up beyond > > > 1-processor-per-molecule, what the expected runtime is, how to deal with > > > errors/restarts, etc? (Should detail this here in bugz). > > > > > > > > There is still work I need to do to ensure that a task that is running > > when the resource lease expires is correctly handled and Swift is > > notified that it failed. I have the code written and in Falkon already, > > but I have yet to test it. We need to make sure this works before we > > try to get say 24 hour resource allocations when we know the experiment > > will likely take several days. Also, I think the larger part of the > > workflow could benefit from more than 1 node per molecule, so if we > > could get more, it should improve the end-to-end time. > > > > Ioan > > > > > > > > -- > ============================================ > Ioan Raicu > Ph.D. Student > ============================================ > Distributed Systems Laboratory > Computer Science Department > University of Chicago > 1100 E. 58th Street, Ryerson Hall > Chicago, IL 60637 > ============================================ > Email: iraicu at cs.uchicago.edu > Web: http://www.cs.uchicago.edu/~iraicu > http://dsl.cs.uchicago.edu/ > ============================================ > ============================================ > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From iraicu at cs.uchicago.edu Thu Jun 28 16:36:21 2007 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Thu, 28 Jun 2007 16:36:21 -0500 Subject: [Swift-devel] [Bug 72] Campaign for scaling wf up to 244 molecules In-Reply-To: <1183066372.25279.5.camel@blabla.mcs.anl.gov> References: <20070628211248.35B6116505@foxtrot.mcs.anl.gov> <468427A8.10104@cs.uchicago.edu> <1183066372.25279.5.camel@blabla.mcs.anl.gov> Message-ID: <468429D5.7070200@cs.uchicago.edu> There is an option to have a pool of threads work on these data structures, but the pool size is set to 1. Point is well taken, we have fixed this, but I am not convinced this is where the problem was. We'll see after we do another run with all the extra logging. Ioan Mihael Hategan wrote: >>> - did Mihael discover an error in Falkon mutex code? >>> >>> >>> >> We are not sure, but we are adding extra synchronization in several >> parts of the Falkon provider. The reason we are saying that we are not >> sure is that we stress tested (pushing 50~100 tasks/sec) both the Falkon >> provider and Falkon itself over and over again, and we never encountered >> this. Now, we have a workflow that has an average of 1 task/sec, I find >> it hard to beleive that a synchronization issue that never surfaced >> before under stress testing is surfacing now under such a light load. >> > > ?!? > You are mutating maps and list from concurrent threads without > synchronization. That is a problem regardless of any other > considerations. > > Mihael > > > > > -- ============================================ Ioan Raicu Ph.D. Student ============================================ Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 ============================================ Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dsl.cs.uchicago.edu/ ============================================ ============================================ -------------- next part -------------- An HTML attachment was scrubbed... URL: From hategan at mcs.anl.gov Thu Jun 28 16:41:45 2007 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 28 Jun 2007 16:41:45 -0500 Subject: [Swift-devel] [Bug 72] Campaign for scaling wf up to 244 molecules In-Reply-To: <468429D5.7070200@cs.uchicago.edu> References: <20070628211248.35B6116505@foxtrot.mcs.anl.gov> <468427A8.10104@cs.uchicago.edu> <1183066372.25279.5.camel@blabla.mcs.anl.gov> <468429D5.7070200@cs.uchicago.edu> Message-ID: <1183066905.26775.2.camel@blabla.mcs.anl.gov> On Thu, 2007-06-28 at 16:36 -0500, Ioan Raicu wrote: > There is an option to have a pool of threads work on these data > structures, but the pool size is set to 1. Right, but the submit() method was called from different threads. Can we stop arguing about the obvious? > Point is well taken, we have fixed this, but I am not convinced this > is where the problem was. We'll see after we do another run with all > the extra logging. Can you commit the updates to svn? > > Ioan > > Mihael Hategan wrote: > > > > - did Mihael discover an error in Falkon mutex code? > > > > > > > > > > > > > > > We are not sure, but we are adding extra synchronization in several > > > parts of the Falkon provider. The reason we are saying that we are not > > > sure is that we stress tested (pushing 50~100 tasks/sec) both the Falkon > > > provider and Falkon itself over and over again, and we never encountered > > > this. Now, we have a workflow that has an average of 1 task/sec, I find > > > it hard to beleive that a synchronization issue that never surfaced > > > before under stress testing is surfacing now under such a light load. > > > > > > > ?!? > > You are mutating maps and list from concurrent threads without > > synchronization. That is a problem regardless of any other > > considerations. > > > > Mihael > > > > > > > > > > > > -- > ============================================ > Ioan Raicu > Ph.D. Student > ============================================ > Distributed Systems Laboratory > Computer Science Department > University of Chicago > 1100 E. 58th Street, Ryerson Hall > Chicago, IL 60637 > ============================================ > Email: iraicu at cs.uchicago.edu > Web: http://www.cs.uchicago.edu/~iraicu > http://dsl.cs.uchicago.edu/ > ============================================ > ============================================ From wilde at mcs.anl.gov Thu Jun 28 17:42:55 2007 From: wilde at mcs.anl.gov (Mike Wilde) Date: Thu, 28 Jun 2007 17:42:55 -0500 Subject: [Swift-devel] [Bug 72] Campaign for scaling wf up to 244 molecules In-Reply-To: <1183066905.26775.2.camel@blabla.mcs.anl.gov> References: <20070628211248.35B6116505@foxtrot.mcs.anl.gov> <468427A8.10104@cs.uchicago.edu> <1183066372.25279.5.camel@blabla.mcs.anl.gov> <468429D5.7070200@cs.uchicago.edu> <1183066905.26775.2.camel@blabla.mcs.anl.gov> Message-ID: <4684396F.30209@mcs.anl.gov> STOP. DO NOT reply to this email. reply instead via a comment in bugzilla. (do I sound like Ben yet? ;) Ioan, My understanding is that Mihael pointed out 2 clear unsynchronized race conditions from his review of the Falkon provider code. Do you agree or disagree? If you agree, have you fixed the race? If not, do we need to discuss it further among more experts to get to an decision we believe is correct? I dont want to sermonize, but will do so anyways: - mutex/synchronization problems are devilishly subtle - to make mutex code work right, you need *both* code review, extensive testing, and ideally a lot of code asserts to make sure you are (locked) where you think you are. - if we are arguing about the obvious its probably not obvious to everyone (so f2f tabletop code review is helpful here, for both education and verification) - to get mutex code right you need to make sure you have the tasks and shared data structures (and hence access patterns) clearly identified - then you need tons of testing. not just live tests, but carefully contrived artificial tests to stress test various mutex situations and potential race and deadlock conditions. I dont think we should stop testing to do a code review, but we certainly will need to do one before we can expect very high reliability. I'd like to ask you, Ioan that since it its your code and project, that you work out a schedule that works for everyone, and organize a review. I understand that the core Falkpon code needs some simple cosmetic cleanup (mainly removing fossil code) and then posting in SVN. :) Mike Mihael Hategan wrote, On 6/28/2007 4:41 PM: > On Thu, 2007-06-28 at 16:36 -0500, Ioan Raicu wrote: >> There is an option to have a pool of threads work on these data >> structures, but the pool size is set to 1. > > Right, but the submit() method was called from different threads. Can we > stop arguing about the obvious? > >> Point is well taken, we have fixed this, but I am not convinced this >> is where the problem was. We'll see after we do another run with all >> the extra logging. > > Can you commit the updates to svn? > >> Ioan >> >> Mihael Hategan wrote: >>>>> - did Mihael discover an error in Falkon mutex code? >>>>> >>>>> >>>>> >>>> We are not sure, but we are adding extra synchronization in several >>>> parts of the Falkon provider. The reason we are saying that we are not >>>> sure is that we stress tested (pushing 50~100 tasks/sec) both the Falkon >>>> provider and Falkon itself over and over again, and we never encountered >>>> this. Now, we have a workflow that has an average of 1 task/sec, I find >>>> it hard to beleive that a synchronization issue that never surfaced >>>> before under stress testing is surfacing now under such a light load. >>>> >>> ?!? >>> You are mutating maps and list from concurrent threads without >>> synchronization. That is a problem regardless of any other >>> considerations. >>> >>> Mihael >>> >>> >>> >>> >>> >> -- >> ============================================ >> Ioan Raicu >> Ph.D. Student >> ============================================ >> Distributed Systems Laboratory >> Computer Science Department >> University of Chicago >> 1100 E. 58th Street, Ryerson Hall >> Chicago, IL 60637 >> ============================================ >> Email: iraicu at cs.uchicago.edu >> Web: http://www.cs.uchicago.edu/~iraicu >> http://dsl.cs.uchicago.edu/ >> ============================================ >> ============================================ > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > -- Mike Wilde Computation Institute, University of Chicago Math & Computer Science Division Argonne National Laboratory Argonne, IL 60439 USA tel 630-252-7497 fax 630-252-1997 From wilde at mcs.anl.gov Thu Jun 28 18:11:43 2007 From: wilde at mcs.anl.gov (Mike Wilde) Date: Thu, 28 Jun 2007 18:11:43 -0500 Subject: [Swift-devel] [Bug 72] Campaign for scaling wf up to 244 molecules In-Reply-To: <468429D5.7070200@cs.uchicago.edu> References: <20070628211248.35B6116505@foxtrot.mcs.anl.gov> <468427A8.10104@cs.uchicago.edu> <1183066372.25279.5.camel@blabla.mcs.anl.gov> <468429D5.7070200@cs.uchicago.edu> Message-ID: <4684402F.5040503@mcs.anl.gov> Good discussion. I know this will take a bit of time to make a habit, but lets try to move discussion to the Campaign bug, when it applies. Click the link and enter a Comment as your reply in bugzilla. I think the Globus team is very used to doing this, and I believe thats a good practice and we should adopt it. - Mike Ioan Raicu wrote, On 6/28/2007 4:36 PM: > There is an option to have a pool of threads work on these data > structures, but the pool size is set to 1. Point is well taken, we have > fixed this, but I am not convinced this is where the problem was. We'll > see after we do another run with all the extra logging. > > Ioan > > Mihael Hategan wrote: >>>> - did Mihael discover an error in Falkon mutex code? >>>> >>>> >>>> >>> We are not sure, but we are adding extra synchronization in several >>> parts of the Falkon provider. The reason we are saying that we are not >>> sure is that we stress tested (pushing 50~100 tasks/sec) both the Falkon >>> provider and Falkon itself over and over again, and we never encountered >>> this. Now, we have a workflow that has an average of 1 task/sec, I find >>> it hard to beleive that a synchronization issue that never surfaced >>> before under stress testing is surfacing now under such a light load. >>> >> >> ?!? >> You are mutating maps and list from concurrent threads without >> synchronization. That is a problem regardless of any other >> considerations. >> >> Mihael >> >> >> >> >> > > -- > ============================================ > Ioan Raicu > Ph.D. Student > ============================================ > Distributed Systems Laboratory > Computer Science Department > University of Chicago > 1100 E. 58th Street, Ryerson Hall > Chicago, IL 60637 > ============================================ > Email: iraicu at cs.uchicago.edu > Web: http://www.cs.uchicago.edu/~iraicu > http://dsl.cs.uchicago.edu/ > ============================================ > ============================================ > > > ------------------------------------------------------------------------ > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel -- Mike Wilde Computation Institute, University of Chicago Math & Computer Science Division Argonne National Laboratory Argonne, IL 60439 USA tel 630-252-7497 fax 630-252-1997 From hategan at mcs.anl.gov Thu Jun 28 18:16:19 2007 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 28 Jun 2007 18:16:19 -0500 Subject: [Swift-devel] [Bug 72] Campaign for scaling wf up to 244 molecules In-Reply-To: <4684396F.30209@mcs.anl.gov> References: <20070628211248.35B6116505@foxtrot.mcs.anl.gov> <468427A8.10104@cs.uchicago.edu> <1183066372.25279.5.camel@blabla.mcs.anl.gov> <468429D5.7070200@cs.uchicago.edu> <1183066905.26775.2.camel@blabla.mcs.anl.gov> <4684396F.30209@mcs.anl.gov> Message-ID: <1183072579.20493.1.camel@blabla.mcs.anl.gov> On Thu, 2007-06-28 at 17:42 -0500, Mike Wilde wrote: > (do I sound like Ben yet? ;) Nope. You're missing the accent :) From benc at hawaga.org.uk Fri Jun 29 13:12:53 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Fri, 29 Jun 2007 23:42:53 +0530 (IST) Subject: [Swift-devel] PA_VAR vs PA_VAR1 Message-ID: GetFieldValue has: public class GetFieldValue extends VDLFunction { [...] public static final Arg PA_VAR1 = new Arg.Positional("var"); and also inherits from VDLFunction: public static final Arg PA_VAR = new Arg.Positional("var"); >From the name, it looks like PA_VAR1 was deliberately made to not be PA_VAR but I don't really understand why. -- From hategan at mcs.anl.gov Fri Jun 29 13:22:48 2007 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 29 Jun 2007 13:22:48 -0500 Subject: [Swift-devel] PA_VAR vs PA_VAR1 In-Reply-To: References: Message-ID: <1183141368.12220.2.camel@blabla.mcs.anl.gov> Either one might have been an optional at first and then it was changed, or it's an oversight. I can't see any reason why both would be needed. Mihael On Fri, 2007-06-29 at 23:42 +0530, Ben Clifford wrote: > GetFieldValue has: > > > public class GetFieldValue extends VDLFunction { > [...] > public static final Arg PA_VAR1 = new Arg.Positional("var"); > > and also inherits from VDLFunction: > > public static final Arg PA_VAR = new Arg.Positional("var"); > > >From the name, it looks like PA_VAR1 was deliberately made to not be > PA_VAR but I don't really understand why. > From benc at hawaga.org.uk Fri Jun 29 13:27:37 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Fri, 29 Jun 2007 18:27:37 +0000 (GMT) Subject: [Swift-devel] PA_VAR vs PA_VAR1 In-Reply-To: <1183141368.12220.2.camel@blabla.mcs.anl.gov> References: <1183141368.12220.2.camel@blabla.mcs.anl.gov> Message-ID: ok good. On Fri, 29 Jun 2007, Mihael Hategan wrote: > Either one might have been an optional at first and then it was changed, > or it's an oversight. I can't see any reason why both would be needed. > > Mihael > > On Fri, 2007-06-29 at 23:42 +0530, Ben Clifford wrote: > > GetFieldValue has: > > > > > > public class GetFieldValue extends VDLFunction { > > [...] > > public static final Arg PA_VAR1 = new Arg.Positional("var"); > > > > and also inherits from VDLFunction: > > > > public static final Arg PA_VAR = new Arg.Positional("var"); > > > > >From the name, it looks like PA_VAR1 was deliberately made to not be > > PA_VAR but I don't really understand why. > > > > From bugzilla-daemon at mcs.anl.gov Fri Jun 29 19:11:48 2007 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Fri, 29 Jun 2007 19:11:48 -0500 (CDT) Subject: [Swift-devel] [Bug 72] Campaign for scaling wf up to 244 molecules In-Reply-To: Message-ID: <20070630001148.19C15164DB@foxtrot.mcs.anl.gov> http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=72 ------- Comment #3 from nefedova at mcs.anl.gov 2007-06-29 19:11 ------- You can watch the new 244-molecule run live here: http://tg-viz-login1.uc.teragrid.org:51000/index.htm Ioan would post here later the details of what changes Yong and he did to the provider code. Nika -- Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is. From benc at hawaga.org.uk Fri Jun 29 20:24:58 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Sat, 30 Jun 2007 01:24:58 +0000 (GMT) Subject: [Swift-devel] a different way to do array/structure accesses In-Reply-To: <1182977884.9558.2.camel@blabla.mcs.anl.gov> References: <1182977884.9558.2.camel@blabla.mcs.anl.gov> Message-ID: I rememebred why I was poking round this way in the first place. The present implementation takes expressions like a[2] and passes them all the way through to the karajan layer, where the vdl runtime library code interprets the 2: But this doesn't work in the case where 2 becomes a more complex expression, such as a[2+2]. Accesses like a[2+2] or a[i+1] don't seem to work at all at the moment. (that's bug 54). I don't want to put a complete SwiftScript parser/evaluator in the runtime library, so I think in the case where array accesses are not simple variable names or constants, the code should break up array accesses into separate getfield calls so that the same code can be generated for that indexing expression as when that expression is used elsewhere. However, it doesn't need to do this for simple accesses (simple variable names and or constants) if the present path handling is also retained - that's more complication at the compiler layer but it sounds like its needed. -- From benc at hawaga.org.uk Fri Jun 29 20:34:41 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Sat, 30 Jun 2007 01:34:41 +0000 (GMT) Subject: [Swift-devel] a different way to do array/structure accesses In-Reply-To: References: <1182977884.9558.2.camel@blabla.mcs.anl.gov> Message-ID: alternatively, I guess the path argument can be constructed on the fly (as it looks like it might have been intended to once): [ ... compiled expression code here ... ] .. That's less aesthetically pleasing to me than the multiple-getfield form, though. On the gripping hand, we could say that array subscripts can only be constants or variable names and disallow expressions there. That perhaps reflects the status quo more accurately, although I don't like it. -- From hategan at mcs.anl.gov Fri Jun 29 20:36:59 2007 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 29 Jun 2007 20:36:59 -0500 Subject: [Swift-devel] a different way to do array/structure accesses In-Reply-To: References: <1182977884.9558.2.camel@blabla.mcs.anl.gov> Message-ID: <1183167419.28235.7.camel@blabla.mcs.anl.gov> VDLFunction.parsePath(Object, VariableStack) is probably somewhat related. I think the problem was that paths with indices were sometimes non-static, which was an exception to what the compiler was initially doing, so it was easier at the time to push things into the above function. It should be fixed, but sensible reduction should be applied. Mihael On Sat, 2007-06-30 at 01:24 +0000, Ben Clifford wrote: > I rememebred why I was poking round this way in the first place. > > The present implementation takes expressions like a[2] and passes them all > the way through to the karajan layer, where the vdl runtime library code > interprets the 2: > > > > > > > > But this doesn't work in the case where 2 becomes a more complex > expression, such as a[2+2]. > > Accesses like a[2+2] or a[i+1] don't seem to work at all at the moment. > (that's bug 54). > > I don't want to put a complete SwiftScript parser/evaluator in the runtime > library, so I think in the case where array accesses are not simple > variable names or constants, the code should break up array accesses into > separate getfield calls so that the same code can be generated for that > indexing expression as when that expression is used elsewhere. > > However, it doesn't need to do this for simple accesses (simple variable > names and or constants) if the present path handling is also retained - > that's more complication at the compiler layer but it sounds like its > needed. > From hategan at mcs.anl.gov Fri Jun 29 20:45:51 2007 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 29 Jun 2007 20:45:51 -0500 Subject: [Swift-devel] a different way to do array/structure accesses In-Reply-To: References: <1182977884.9558.2.camel@blabla.mcs.anl.gov> Message-ID: <1183167951.28235.16.camel@blabla.mcs.anl.gov> On Sat, 2007-06-30 at 01:34 +0000, Ben Clifford wrote: > alternatively, I guess the path argument can be constructed on the fly (as > it looks like it might have been intended to once): > > > > > [ > ... compiled expression code here ... > ] > > > .. > > That's less aesthetically pleasing to me than the multiple-getfield form, > though. Again, if constant nested getfields are reduced to a single path, it's probably fine to keep dynamic things as nested getfields (it's the only case incurring performance penalties). e.g. p.x.z.v[i].a: gfv(gfv(gfv(p, "x.z.v"), gfv(i)), "a")) Also, we could probably make this nicer with vargs (which will solve the problem anyway): gfv(p, "x", "z", "v", gfv(i), "a") > > On the gripping hand, we could say that array subscripts can only be > constants or variable names and disallow expressions there. That perhaps > reflects the status quo more accurately, although I don't like it. Me neither. And it's counterintuitive. There will be hundreds trying to do it anyway (assuming we will have those many users :) > From bugzilla-daemon at mcs.anl.gov Sat Jun 30 15:33:50 2007 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Sat, 30 Jun 2007 15:33:50 -0500 (CDT) Subject: [Swift-devel] [Bug 72] Campaign for scaling wf up to 244 molecules In-Reply-To: Message-ID: <20070630203350.F02F816505@foxtrot.mcs.anl.gov> http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=72 iraicu at cs.uchicago.edu changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |iraicu at cs.uchicago.edu ------- Comment #4 from iraicu at cs.uchicago.edu 2007-06-30 15:33 ------- Hi again, Here is an update of yesterday's 244 molecule run. The experiment ran further than before, but it still did not complete. There were 240 molecules that completed successfully (in the previous run, no molecule finished), but 4 molecules still did not finish. Here is the breakdown on the tasks: Exit Code 0: 20695 tasks Exit Code -3: 6 tasks Exit Code -1: 3585 tasks ===================== Total: 24286 tasks The 3 usual Falkon graphs can be found here: http://people.cs.uchicago.edu/~iraicu/research/docs/MolDyn/244-mol-failed/executor_graph.jpg http://people.cs.uchicago.edu/~iraicu/research/docs/MolDyn/244-mol-failed/summary_graph.jpg http://people.cs.uchicago.edu/~iraicu/research/docs/MolDyn/244-mol-failed/task_graph.jpg The relevant Falkon logs are here (there are more if people are interested, in total over 600MB of logs): http://people.cs.uchicago.edu/~iraicu/research/docs/MolDyn/244-mol-failed/Falkon_logs/ The Swift log are here: http://people.cs.uchicago.edu/~iraicu/research/docs/MolDyn/244-mol-failed/Swift_logs/MolDyn-244-63ar6atbg2ae1.log >From Falkon's point of view, things looked fine, tasks came in, they got processed, they got returned. We haven't got a chance to analyze the Swift end of the logs yet, so we don't know for sure what happened. We fixed the potential synchronization issue Mihael pointed out. We also fixed a badly handled exception we had in the Falkon provider, that would give up very easily and exit the Falkon provider thread in case of an exception, even if it wasn't a fatal one. This time around, we changed the logic to simply print the exception, if there were any, and not exit the Falkon provider, just continue. Personally, I think this logic on handling exceptions in the Falkon provider was causing the Falkon provider to exit prematurely, and hence not send any more tasks to Falkon... note that Swift was setting the set status of submitted tasks to the Falkon provider in a separate thread, which was not necesarly exiting when the Falkon provider was, and hence we had the scenario in which Swift thought it sent out more tasks than Falkon really saw. Now, the issue that I think stopped this experiment. On the console of Swift, the last thing that it printed was a "stack overflow error"; I don't think this printed in the logs, just on the console. I believe this is a JVM error when a thread recurses too deep and the thread stack size is not sufficiently large enough. We saw this same error on Thursday in some synthetic experiments with 20K sleep jobs, but it was not repeatable every time. Does anyone have any idea where this stack overflow could be coming from? Ioan -- Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is. From bugzilla-daemon at mcs.anl.gov Sat Jun 30 17:09:11 2007 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Sat, 30 Jun 2007 17:09:11 -0500 (CDT) Subject: [Swift-devel] [Bug 72] Campaign for scaling wf up to 244 molecules In-Reply-To: Message-ID: <20070630220911.9557A16506@foxtrot.mcs.anl.gov> http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=72 ------- Comment #5 from hategan at mcs.anl.gov 2007-06-30 17:09 ------- First of all, can you commit the changes to SVN? (In reply to comment #4) > We fixed the potential synchronization issue > Mihael pointed out. There were two. > We also fixed a badly handled exception we had in the > Falkon provider, that would give up very easily and exit the Falkon provider > thread in case of an exception, even if it wasn't a fatal one. This time > around, we changed the logic to simply print the exception, if there were any, > and not exit the Falkon provider, just continue. Personally, I think this > logic on handling exceptions in the Falkon provider was causing the Falkon > provider to exit prematurely, and hence not send any more tasks to Falkon... I can't seem to find anything that would fit that profile in the provider code. Can you be more specific? If the provider was setting the status of the task to failed, then it doesn't matter. Swift retries failed things. > note that Swift was setting the set status of submitted tasks to the Falkon > provider in a separate thread, Swift does not set status of tasks. That's what the provider is supposed to do. > which was not necesarly exiting when the Falkon > provider was, and hence we had the scenario in which Swift thought it sent out > more tasks than Falkon really saw. Can you be more specific? If there is a problem in Swift, we need to fix it, but your comment is too vague. > > Now, the issue that I think stopped this experiment. On the console of Swift, > the last thing that it printed was a "stack overflow error"; I don't think this > printed in the logs, just on the console. Without the stack trace, the information is not very useful. > > Ioan > -- Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is. From hategan at mcs.anl.gov Sat Jun 30 17:39:01 2007 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sat, 30 Jun 2007 17:39:01 -0500 Subject: [Swift-devel] 244 molecule workflow Message-ID: <1183243141.15631.3.camel@blabla.mcs.anl.gov> Looking at the swift logs, I stumbled across a few exceptions, with stack traces that contain things like: charmm3 @ MolDyn-244.kml, line: 1029201 That thing has over one million lines. Disturbing. For loops have been invented. Mihael From bugzilla-daemon at mcs.anl.gov Sat Jun 30 17:52:07 2007 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Sat, 30 Jun 2007 17:52:07 -0500 (CDT) Subject: [Swift-devel] [Bug 72] Campaign for scaling wf up to 244 molecules In-Reply-To: Message-ID: <20070630225207.B70D916506@foxtrot.mcs.anl.gov> http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=72 ------- Comment #6 from hategan at mcs.anl.gov 2007-06-30 17:52 ------- (In reply to comment #4) > Hi again, > Here is an update of yesterday's 244 molecule run. The experiment ran further > than before, but it still did not complete. There were 240 molecules that > completed successfully (in the previous run, no molecule finished), but 4 > molecules still did not finish. > Actually it looks tasks worked fine: bash-3.1$ cat MolDyn-244-63ar6atbg2ae1.log |grep "type=1.*ubmitted"|wc 24309 243090 2806214 bash-3.1$ cat MolDyn-244-63ar6atbg2ae1.log |grep "type=1.*ailed"|wc 3614 36140 405816 bash-3.1$ cat MolDyn-244-63ar6atbg2ae1.log |grep "type=1.*ompleted"|wc 20695 206950 2389556 All tasks are accounted for. It may be that some jobs failed 3 times in a row. >From the logs it looks like the workflow almost finished and it got to the point where the error reporting was to be done. Perhaps the stack overflow that you saw occurred there, and perhaps the impossible size of the workflow might have something to do with it. -- Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is. From itf at mcs.anl.gov Sat Jun 30 21:10:03 2007 From: itf at mcs.anl.gov (=?utf-8?B?SWFuIEZvc3Rlcg==?=) Date: Sun, 1 Jul 2007 02:10:03 +0000 Subject: [Swift-devel] [Bug 72] Campaign for scaling wf up to 244 molecules In-Reply-To: <20070630225207.B70D916506@foxtrot.mcs.anl.gov> References: <20070630225207.B70D916506@foxtrot.mcs.anl.gov> Message-ID: <1702663950-1183255865-cardhu_decombobulator_blackberry.rim.net-1244943269-@bxe006.bisx.prod.on.blackberry> Why do you say the workflow's size was "impossible"? It doesn't seem that large to me. We'd like to run larger ones! Sent via BlackBerry from T-Mobile -----Original Message----- From: bugzilla-daemon at mcs.anl.gov Date: Sat, 30 Jun 2007 17:52:07 To:swift-devel at ci.uchicago.edu Subject: [Swift-devel] [Bug 72] Campaign for scaling wf up to 244 molecules http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=72 ------- Comment #6 from hategan at mcs.anl.gov 2007-06-30 17:52 ------- (In reply to comment #4) > Hi again, > Here is an update of yesterday's 244 molecule run. The experiment ran further > than before, but it still did not complete. There were 240 molecules that > completed successfully (in the previous run, no molecule finished), but 4 > molecules still did not finish. > Actually it looks tasks worked fine: bash-3.1$ cat MolDyn-244-63ar6atbg2ae1.log |grep "type=1.*ubmitted"|wc 24309 243090 2806214 bash-3.1$ cat MolDyn-244-63ar6atbg2ae1.log |grep "type=1.*ailed"|wc 3614 36140 405816 bash-3.1$ cat MolDyn-244-63ar6atbg2ae1.log |grep "type=1.*ompleted"|wc 20695 206950 2389556 All tasks are accounted for. It may be that some jobs failed 3 times in a row. >From the logs it looks like the workflow almost finished and it got to the point where the error reporting was to be done. Perhaps the stack overflow that you saw occurred there, and perhaps the impossible size of the workflow might have something to do with it. -- Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is. _______________________________________________ Swift-devel mailing list Swift-devel at ci.uchicago.edu http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel