From benc at hawaga.org.uk Tue May 1 02:21:34 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 1 May 2007 07:21:34 +0000 (GMT) Subject: [Swift-devel] LQCD mapping Message-ID: yesterday evening I played some with Nika trying to get her LQCD workflow running some more. It involved one code change to swift: I put a 'create' option on the filesys_mapper so that one can do this: file lattice[] ; foreach i in range { int j=i-1; lattice[i] = lqcd_exec(test_in,lattice[j]); } where lattice.* files don't exist, so that lattice[5] will map to "lattice.5". With create=false (the default) then the mapper behaves as before, which seems to be essentially an input-only mode where it creates an array based on existing files. I think this is the mapping functionality that I want, but its not clear to me whether filesys_mapper is the place for it, whether one of the other mappers already does this, or if it should go in a different place (another mapper or a new mapper). comments? -- From itf at mcs.anl.gov Tue May 1 02:47:03 2007 From: itf at mcs.anl.gov (itf at mcs.anl.gov) Date: Tue, 1 May 2007 02:47:03 -0500 (CDT) Subject: [Swift-devel] LQCD mapping In-Reply-To: References: Message-ID: <3303.84.56.24.39.1178005623.squirrel@www-unix.mcs.anl.gov> In this case, it seems that you know the number oif files to be created ahead of time. Should that information be specified in the definition (the first line)? > yesterday evening I played some with Nika trying to get her LQCD workflow > running some more. > > It involved one code change to swift: > > I put a 'create' option on the filesys_mapper so that one can do this: > > file lattice[] ; > foreach i in range { > int j=i-1; > lattice[i] = lqcd_exec(test_in,lattice[j]); > } > > where lattice.* files don't exist, so that lattice[5] will map to > "lattice.5". With create=false (the default) then the mapper behaves as > before, which seems to be essentially an input-only mode where it creates > an array based on existing files. > > I think this is the mapping functionality that I want, but its not clear > to me whether filesys_mapper is the place for it, whether one of the other > mappers already does this, or if it should go in a different place > (another mapper or a new mapper). comments? > > -- > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > From benc at hawaga.org.uk Tue May 1 04:00:38 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 1 May 2007 09:00:38 +0000 (GMT) Subject: [Swift-devel] Re: LQCD mapping In-Reply-To: References: Message-ID: also, a couple of bugs I filed from this: bug 54: expressions inside array indices do not work and bug 55: workflow hangs when accessing uninitialised array member The problem with bug 54 has been discussed here already and I think is relatively straight forward to fix (though that bit of the code is a bit tangly); Bug 55 - perhaps some kind of deadlock detection necessary. I don't really know, but its a bad user experience at the moment. Neither of them are on the 0.2 feature list, though they should be fixed sooner rather than later. -- From benc at hawaga.org.uk Tue May 1 04:36:58 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 1 May 2007 09:36:58 +0000 (GMT) Subject: [Swift-devel] remote file/directory stuff (bug 22) (fwd) Message-ID: there was a thread a couple of months (almost to the day) about this 0.2 feature. I've done some work towards implementing this already, and I'm going to do some more now. I'm planning on implementing the syntax Yong discusses below. I hit the problem that motivated this again yesterday, working with Nika. I'm not sure what remote mapping should really look like - pretty much everyone seems in agreement about the general concept, with various differences, so I think it will be useful to actually have an implementation of *something* to get a practical feel, rather than our previous interminable discussion threads. ---------- Forwarded message ---------- Date: Sat, 3 Mar 2007 01:37:42 +0000 (GMT) From: Ben Clifford To: swft at ci.uchicago.edu Subject: [Swft] Re: [Swift-devel] remote file/directory stuff (bug 22) So the message below can be the beginnings of campaign definition for bug 22 - 'execute-side protomappers'. I think in terms of work, me (or Yong if he wants to) needs to implement the swiftscript->kml bit to generate the kml syntax that Mihael suggested in an earlier message; and then Mihael should go from there to make it happen inside the execution engine. On Fri, 2 Mar 2007, Ben Clifford wrote: > > > On Fri, 2 Mar 2007, Yong Zhao wrote: > > > Can you elaborate on this issue a little bit so that we can make a > > unanimous decision: > > > > 1. what was the problem exactly > > Some programs that we run in swift do not use the traditional VDS-like API > of being told on the commandline the names of the files that they must > input and output to. Instead, they make up some of the names themselves. > > For example, one of Nika's programs has the syntax: > > ./program inputfilename > > and places its outputs in inputfile.stuff, inputfile.abc, inputfile.foo > > > 2. what are you proposing > > To extend the syntax of the app {} block to permit specification of the > above interface, with a syntax something like: > > (stuffoutfile s, abcoutfile a, foooutfile f) myproc(inputfile i) > app { > program @i; > s < @strcat(@inputfile,".stuff") > a < @strcat(@inputfile,".abc") > f < @strcat(@inputfile,".foo") > } > } > > Meaning that rather than Swift specifying the remote name for s, a and f, > instead the app block specifies where those three files are. > > These will be staged back into the submit-side location defined in the > existing mappers. > > > 3. to what extent does the proposal solve the problem > > It should solve Nika's immediate problem, I think. > > > 4. what is the implication to the mapping interface > > A longer term perspective is that this is the beginning of longer work to > implement fuller execute-side mappers (which have also been called > application mappers in some threads). > > So it is mapping, but on the execute side. It fits in in a fairly > straightforward way with mapping on the submit side, which is what we have > now. > > Submit side mapping maps between submit-side data and SwiftScript > variables/structures, so that the user can arrange his submit-side data in > a way that he wants (rather than swift compelling it to be in a particular > format) > > Execute side mapping maps between SwiftScript variables and execute side > data, so that data can be laid out on the execute side in the way that the > program wants it (rather than swift compelling it to be in a particular > format) > > With the present implementation, this amounts to being able to specify > different paths and filenames on the submit and execute side for each data > file. > > In the longer term, it might also be useful in defining things like how to > map data on a submit-side database to some format on the execute side for > processing. If we have only submit side mappers, then we can map data > between a submit side database and SwiftScript structures, but not map > between those structures and the execute side... > > -- > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > From nefedova at mcs.anl.gov Tue May 1 07:06:09 2007 From: nefedova at mcs.anl.gov (Veronika V. Nefedova) Date: Tue, 01 May 2007 07:06:09 -0500 Subject: [Swift-devel] LQCD mapping In-Reply-To: References: Message-ID: <6.2.1.2.2.20070501070147.020db328@pop.mcs.anl.gov> I'd like to mention that as a result of our 'play' last evening - we have a working LQCD workflow (chained genU execution) so the workflow could be given to Xian-He once the changes Ben has made to swift make it into trunk. My special thanks to Ben who came up with the mapper construction below that made the workflow work! Nika At 02:21 AM 5/1/2007, Ben Clifford wrote: >yesterday evening I played some with Nika trying to get her LQCD workflow >running some more. > >It involved one code change to swift: > >I put a 'create' option on the filesys_mapper so that one can do this: > > file lattice[] ; > foreach i in range { > int j=i-1; > lattice[i] = lqcd_exec(test_in,lattice[j]); > } > >where lattice.* files don't exist, so that lattice[5] will map to >"lattice.5". With create=false (the default) then the mapper behaves as >before, which seems to be essentially an input-only mode where it creates >an array based on existing files. > >I think this is the mapping functionality that I want, but its not clear >to me whether filesys_mapper is the place for it, whether one of the other >mappers already does this, or if it should go in a different place >(another mapper or a new mapper). comments? > >-- >_______________________________________________ >Swift-devel mailing list >Swift-devel at ci.uchicago.edu >http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From hategan at mcs.anl.gov Tue May 1 08:21:33 2007 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 01 May 2007 08:21:33 -0500 Subject: [Swift-devel] LQCD mapping In-Reply-To: References: Message-ID: <1178025693.26042.10.camel@blabla.mcs.anl.gov> On Tue, 2007-05-01 at 07:21 +0000, Ben Clifford wrote: > yesterday evening I played some with Nika trying to get her LQCD workflow > running some more. > > It involved one code change to swift: > > I put a 'create' option on the filesys_mapper so that one can do this: > > file lattice[] ; > foreach i in range { > int j=i-1; > lattice[i] = lqcd_exec(test_in,lattice[j]); > } > > where lattice.* files don't exist, so that lattice[5] will map to > "lattice.5". With create=false (the default) then the mapper behaves as > before, which seems to be essentially an input-only mode where it creates > an array based on existing files. > > I think this is the mapping functionality that I want, but its not clear > to me whether filesys_mapper is the place for it, whether one of the other > mappers already does this, or if it should go in a different place > (another mapper or a new mapper). comments? The translator does static analysis to figure what things are "read" and what things are "write". In this case it looks like it's figuring the wrong thing, and I think that should be fixed. Mihael > From yongzh at cs.uchicago.edu Tue May 1 09:42:04 2007 From: yongzh at cs.uchicago.edu (Yong Zhao) Date: Tue, 1 May 2007 09:42:04 -0500 (CDT) Subject: [Swift-devel] LQCD mapping In-Reply-To: References: Message-ID: does create mean creating an empty file? On Tue, 1 May 2007, Ben Clifford wrote: > > yesterday evening I played some with Nika trying to get her LQCD workflow > running some more. > > It involved one code change to swift: > > I put a 'create' option on the filesys_mapper so that one can do this: > > file lattice[] ; > foreach i in range { > int j=i-1; > lattice[i] = lqcd_exec(test_in,lattice[j]); > } > > where lattice.* files don't exist, so that lattice[5] will map to > "lattice.5". With create=false (the default) then the mapper behaves as > before, which seems to be essentially an input-only mode where it creates > an array based on existing files. > > I think this is the mapping functionality that I want, but its not clear > to me whether filesys_mapper is the place for it, whether one of the other > mappers already does this, or if it should go in a different place > (another mapper or a new mapper). comments? > > -- > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > From benc at hawaga.org.uk Tue May 1 09:44:02 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 1 May 2007 14:44:02 +0000 (GMT) Subject: [Swift-devel] LQCD mapping In-Reply-To: References: Message-ID: On Tue, 1 May 2007, Yong Zhao wrote: > does create mean creating an empty file? it doesn't create any file - it creates mappings that this mapper appears to not make by default (though mihael suggests that might be a bug) so that if I refer to an array element lattice[787979] it is mapped to a (possibly non-existing) file called prefix+"787979"+suffix if that is then used as the output variable for a procedure, then yes the file gets created by the procedure. but not by the mapper. > > On Tue, 1 May 2007, Ben Clifford wrote: > > > > > yesterday evening I played some with Nika trying to get her LQCD workflow > > running some more. > > > > It involved one code change to swift: > > > > I put a 'create' option on the filesys_mapper so that one can do this: > > > > file lattice[] ; > > foreach i in range { > > int j=i-1; > > lattice[i] = lqcd_exec(test_in,lattice[j]); > > } > > > > where lattice.* files don't exist, so that lattice[5] will map to > > "lattice.5". With create=false (the default) then the mapper behaves as > > before, which seems to be essentially an input-only mode where it creates > > an array based on existing files. > > > > I think this is the mapping functionality that I want, but its not clear > > to me whether filesys_mapper is the place for it, whether one of the other > > mappers already does this, or if it should go in a different place > > (another mapper or a new mapper). comments? > > > > -- > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > From yongzh at cs.uchicago.edu Tue May 1 09:54:38 2007 From: yongzh at cs.uchicago.edu (Yong Zhao) Date: Tue, 1 May 2007 09:54:38 -0500 (CDT) Subject: [Swift-devel] LQCD mapping In-Reply-To: References: Message-ID: I see, then maybe you should not use filesys_mapper, can you try simple_mapper instead? Yong. On Tue, 1 May 2007, Ben Clifford wrote: > > > On Tue, 1 May 2007, Yong Zhao wrote: > > > does create mean creating an empty file? > > it doesn't create any file - it creates mappings that this mapper appears > to not make by default (though mihael suggests that might be a bug) > > so that if I refer to an array element lattice[787979] it is mapped to a > (possibly non-existing) file called prefix+"787979"+suffix > > if that is then used as the output variable for a procedure, then yes the > file gets created by the procedure. but not by the mapper. > > > > > On Tue, 1 May 2007, Ben Clifford wrote: > > > > > > > > yesterday evening I played some with Nika trying to get her LQCD workflow > > > running some more. > > > > > > It involved one code change to swift: > > > > > > I put a 'create' option on the filesys_mapper so that one can do this: > > > > > > file lattice[] ; > > > foreach i in range { > > > int j=i-1; > > > lattice[i] = lqcd_exec(test_in,lattice[j]); > > > } > > > > > > where lattice.* files don't exist, so that lattice[5] will map to > > > "lattice.5". With create=false (the default) then the mapper behaves as > > > before, which seems to be essentially an input-only mode where it creates > > > an array based on existing files. > > > > > > I think this is the mapping functionality that I want, but its not clear > > > to me whether filesys_mapper is the place for it, whether one of the other > > > mappers already does this, or if it should go in a different place > > > (another mapper or a new mapper). comments? > > > > > > -- > > > _______________________________________________ > > > Swift-devel mailing list > > > Swift-devel at ci.uchicago.edu > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > > > > > From hategan at mcs.anl.gov Tue May 1 09:58:17 2007 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 01 May 2007 09:58:17 -0500 Subject: [Swift-devel] LQCD mapping In-Reply-To: References: Message-ID: <1178031497.29391.7.camel@blabla.mcs.anl.gov> On Tue, 2007-05-01 at 14:44 +0000, Ben Clifford wrote: > > On Tue, 1 May 2007, Yong Zhao wrote: > > > does create mean creating an empty file? > > it doesn't create any file - it creates mappings that this mapper appears > to not make by default (though mihael suggests that might be a bug) > > so that if I refer to an array element lattice[787979] it is mapped to a > (possibly non-existing) file called prefix+"787979"+suffix > > if that is then used as the output variable for a procedure, then yes the > file gets created by the procedure. but not by the mapper. Mappers are supposed to be lazy. They don't enforce sizes. They can be used, if requested, to populate a data structure to reflect existing data. That's what the existing() call does. The translator is supposed to figure if that call should be made on initialization, and signal that using the "input" mapping parameter (which is used by the type system to determine whether a call to existing() should be made). In a sense, "input" is pretty much what your "create" does, but it's got a large part of the complexities figured. Should "input" not be passed, existing() would not be called, and the mapper should act in a fully lazy way, but that would also mean that no bits in the array will be marked as available, unless assigned to separately. > > > > > On Tue, 1 May 2007, Ben Clifford wrote: > > > > > > > > yesterday evening I played some with Nika trying to get her LQCD workflow > > > running some more. > > > > > > It involved one code change to swift: > > > > > > I put a 'create' option on the filesys_mapper so that one can do this: > > > > > > file lattice[] ; > > > foreach i in range { > > > int j=i-1; > > > lattice[i] = lqcd_exec(test_in,lattice[j]); > > > } > > > > > > where lattice.* files don't exist, so that lattice[5] will map to > > > "lattice.5". With create=false (the default) then the mapper behaves as > > > before, which seems to be essentially an input-only mode where it creates > > > an array based on existing files. > > > > > > I think this is the mapping functionality that I want, but its not clear > > > to me whether filesys_mapper is the place for it, whether one of the other > > > mappers already does this, or if it should go in a different place > > > (another mapper or a new mapper). comments? > > > > > > -- > > > _______________________________________________ > > > Swift-devel mailing list > > > Swift-devel at ci.uchicago.edu > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > > > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > From benc at hawaga.org.uk Tue May 1 10:01:25 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 1 May 2007 15:01:25 +0000 (GMT) Subject: [Swift-devel] LQCD mapping In-Reply-To: References: Message-ID: On Tue, 1 May 2007, Yong Zhao wrote: > I see, then maybe you should not use filesys_mapper, can you try > simple_mapper instead? The code in there looks like it does approximately the right stuff filename-wise for arrays. Maybe Nika can try with her workflow - if not, I'll try it later. -- From nefedova at mcs.anl.gov Tue May 1 10:06:20 2007 From: nefedova at mcs.anl.gov (Veronika V. Nefedova) Date: Tue, 01 May 2007 10:06:20 -0500 Subject: [Swift-devel] LQCD mapping In-Reply-To: References: Message-ID: <6.2.1.2.2.20070501100604.020553e8@pop.mcs.anl.gov> At 10:01 AM 5/1/2007, Ben Clifford wrote: >On Tue, 1 May 2007, Yong Zhao wrote: > > > I see, then maybe you should not use filesys_mapper, can you try > > simple_mapper instead? > >The code in there looks like it does approximately the right stuff >filename-wise for arrays. > >Maybe Nika can try with her workflow - if not, I'll try it later. > Sure, I can try that Nika >-- > >_______________________________________________ >Swift-devel mailing list >Swift-devel at ci.uchicago.edu >http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From tiberius at ci.uchicago.edu Tue May 1 10:09:04 2007 From: tiberius at ci.uchicago.edu (Tiberiu Stef-Praun) Date: Tue, 1 May 2007 10:09:04 -0500 Subject: [Swift-devel] suggestion please on hanging/sleeping/slow wf Message-ID: I have a workflow that generates 5000 files. The execution seems to have halted, for no obvious reason: - there are no more jobs in the queue - no error are reported in the logfile - NOTE: some of the input files have not been staged in yet , yet the workflow is hanging - NOTE: the remote application temp directory is GONE, only the shared directory is still there - apparently all the output files that are in /shared have been sent back (staged out) What to do, what to do ? The workflow is sid-wf.dtm in ~tiberius/scratch on teraport It uses the config files in ~tiberius/local/swift-conf -- Tiberiu (Tibi) Stef-Praun, PhD Research Staff, Computation Institute 5640 S. Ellis Ave, #405 University of Chicago http://www-unix.mcs.anl.gov/~tiberius/ From tiberius at ci.uchicago.edu Tue May 1 10:15:37 2007 From: tiberius at ci.uchicago.edu (Tiberiu Stef-Praun) Date: Tue, 1 May 2007 10:15:37 -0500 Subject: [Swift-devel] LQCD mapping In-Reply-To: <6.2.1.2.2.20070501100604.020553e8@pop.mcs.anl.gov> References: <6.2.1.2.2.20070501100604.020553e8@pop.mcs.anl.gov> Message-ID: Since the discussion seems to go in the direction I'm interested in, I remember wishing for a filesys mapper extension where I can specify some verification code (such as the number of inputs matched, or the total file sizes) before the initialization is being done. This implies that the mapper retries initialization, and it only succeeds when a previous step has produced the right number of outputs. This enables sequential procedure invocation in the case when I don't know the names or the number of output files from one stage to the other. Currently I fixed this by archiving the output from stage one, and passing it to stage two (fixes the number of unknown outputs) Tibi On 5/1/07, Veronika V. Nefedova wrote: > At 10:01 AM 5/1/2007, Ben Clifford wrote: > > > >On Tue, 1 May 2007, Yong Zhao wrote: > > > > > I see, then maybe you should not use filesys_mapper, can you try > > > simple_mapper instead? > > > >The code in there looks like it does approximately the right stuff > >filename-wise for arrays. > > > >Maybe Nika can try with her workflow - if not, I'll try it later. > > > > Sure, I can try that > > Nika > > >-- > > > >_______________________________________________ > >Swift-devel mailing list > >Swift-devel at ci.uchicago.edu > >http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > -- Tiberiu (Tibi) Stef-Praun, PhD Research Staff, Computation Institute 5640 S. Ellis Ave, #405 University of Chicago http://www-unix.mcs.anl.gov/~tiberius/ From benc at hawaga.org.uk Tue May 1 10:18:10 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 1 May 2007 15:18:10 +0000 (GMT) Subject: [Swift-devel] LQCD mapping In-Reply-To: References: <6.2.1.2.2.20070501100604.020553e8@pop.mcs.anl.gov> Message-ID: or you want to be able to return a variable length array from your app? (which has been mumbled about but never talked about terribly concretely on one of the lists before). anyway, its my birthday somewhere in the world for the next 44 hours or so. woo woo party! On Tue, 1 May 2007, Tiberiu Stef-Praun wrote: > Since the discussion seems to go in the direction I'm interested in, I > remember wishing for a filesys mapper extension where I can specify > some verification code (such as the number of inputs matched, or the > total file sizes) before the initialization is being done. > This implies that the mapper retries initialization, and it only > succeeds when a previous step has produced the right number of > outputs. This enables sequential procedure invocation in the case when > I don't know the names or the number of output files from one stage to > the other. > Currently I fixed this by archiving the output from stage one, and > passing it to stage two (fixes the number of unknown outputs) > > Tibi > > On 5/1/07, Veronika V. Nefedova wrote: > > At 10:01 AM 5/1/2007, Ben Clifford wrote: > > > > > > >On Tue, 1 May 2007, Yong Zhao wrote: > > > > > > > I see, then maybe you should not use filesys_mapper, can you try > > > > simple_mapper instead? > > > > > >The code in there looks like it does approximately the right stuff > > >filename-wise for arrays. > > > > > >Maybe Nika can try with her workflow - if not, I'll try it later. > > > > > > > Sure, I can try that > > > > Nika > > > > >-- > > > > > >_______________________________________________ > > >Swift-devel mailing list > > >Swift-devel at ci.uchicago.edu > > >http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > > From nefedova at mcs.anl.gov Tue May 1 12:22:25 2007 From: nefedova at mcs.anl.gov (Veronika V. Nefedova) Date: Tue, 01 May 2007 12:22:25 -0500 Subject: [Swift-devel] terminable down Message-ID: <6.0.0.22.2.20070501121638.05c7bec0@mail.mcs.anl.gov> Hi, terminable is down for at least an hour now, maybe longer. My email to ci.support was unanswered... I am wondering if anybody on this list is at UC now and know why terminable is down (scheduled maintenance or something)? Could anybody access terminable -- maybe I just can't do it from ANL ? Thanks! Nika From iraicu at cs.uchicago.edu Tue May 1 12:27:43 2007 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Tue, 01 May 2007 12:27:43 -0500 Subject: [Swift-devel] terminable down In-Reply-To: <6.0.0.22.2.20070501121638.05c7bec0@mail.mcs.anl.gov> References: <6.0.0.22.2.20070501121638.05c7bec0@mail.mcs.anl.gov> Message-ID: <4637788F.90707@cs.uchicago.edu> Nika, I can't access it either from the UChicago campus... although I don't know why its down. Ioan Veronika V. Nefedova wrote: > Hi, > > terminable is down for at least an hour now, maybe longer. My email to > ci.support was unanswered... I am wondering if anybody on this list is > at UC now and know why terminable is down (scheduled maintenance or > something)? Could anybody access terminable -- maybe I just can't do > it from ANL ? > > Thanks! > > Nika > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > -- ============================================ Ioan Raicu Ph.D. Student ============================================ Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 ============================================ Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dsl.cs.uchicago.edu/ ============================================ ============================================ From nefedova at mcs.anl.gov Tue May 1 12:29:38 2007 From: nefedova at mcs.anl.gov (Veronika V. Nefedova) Date: Tue, 01 May 2007 12:29:38 -0500 Subject: [Swift-devel] terminable down In-Reply-To: <4637788F.90707@cs.uchicago.edu> References: <6.0.0.22.2.20070501121638.05c7bec0@mail.mcs.anl.gov> <4637788F.90707@cs.uchicago.edu> Message-ID: <6.0.0.22.2.20070501122822.03836ec0@mail.mcs.anl.gov> Hmmm. Do you know of any other UC machine that I could try to login into (that shares the same home dirw/terminable) ? I tried evitable and its also down... Thanks for the info! Nika At 12:27 PM 5/1/2007, Ioan Raicu wrote: >Nika, >I can't access it either from the UChicago campus... although I don't >know why its down. >Ioan > >Veronika V. Nefedova wrote: >>Hi, >> >>terminable is down for at least an hour now, maybe longer. My email to >>ci.support was unanswered... I am wondering if anybody on this list is at >>UC now and know why terminable is down (scheduled maintenance or >>something)? Could anybody access terminable -- maybe I just can't do it >>from ANL ? >> >>Thanks! >> >>Nika >> >>_______________________________________________ >>Swift-devel mailing list >>Swift-devel at ci.uchicago.edu >>http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > >-- >============================================ >Ioan Raicu >Ph.D. Student >============================================ >Distributed Systems Laboratory >Computer Science Department >University of Chicago >1100 E. 58th Street, Ryerson Hall >Chicago, IL 60637 >============================================ >Email: iraicu at cs.uchicago.edu >Web: http://www.cs.uchicago.edu/~iraicu > http://dsl.cs.uchicago.edu/ >============================================ >============================================ From iraicu at cs.uchicago.edu Tue May 1 12:39:34 2007 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Tue, 01 May 2007 12:39:34 -0500 Subject: [Swift-devel] terminable down In-Reply-To: <6.0.0.22.2.20070501122822.03836ec0@mail.mcs.anl.gov> References: <6.0.0.22.2.20070501121638.05c7bec0@mail.mcs.anl.gov> <4637788F.90707@cs.uchicago.edu> <6.0.0.22.2.20070501122822.03836ec0@mail.mcs.anl.gov> Message-ID: <46377B56.8080807@cs.uchicago.edu> No, I don't, as I don't normally use the CI machines much. Ioan Veronika V. Nefedova wrote: > Hmmm. Do you know of any other UC machine that I could try to login > into (that shares the same home dirw/terminable) ? I tried evitable > and its also down... > > Thanks for the info! > > Nika > > At 12:27 PM 5/1/2007, Ioan Raicu wrote: >> Nika, >> I can't access it either from the UChicago campus... although I >> don't know why its down. >> Ioan >> >> Veronika V. Nefedova wrote: >>> Hi, >>> >>> terminable is down for at least an hour now, maybe longer. My email >>> to ci.support was unanswered... I am wondering if anybody on this >>> list is at UC now and know why terminable is down (scheduled >>> maintenance or something)? Could anybody access terminable -- maybe >>> I just can't do it from ANL ? >>> >>> Thanks! >>> >>> Nika >>> >>> _______________________________________________ >>> Swift-devel mailing list >>> Swift-devel at ci.uchicago.edu >>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >> >> -- >> ============================================ >> Ioan Raicu >> Ph.D. Student >> ============================================ >> Distributed Systems Laboratory >> Computer Science Department >> University of Chicago >> 1100 E. 58th Street, Ryerson Hall >> Chicago, IL 60637 >> ============================================ >> Email: iraicu at cs.uchicago.edu >> Web: http://www.cs.uchicago.edu/~iraicu >> http://dsl.cs.uchicago.edu/ >> ============================================ >> ============================================ > > > -- ============================================ Ioan Raicu Ph.D. Student ============================================ Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 ============================================ Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dsl.cs.uchicago.edu/ ============================================ ============================================ From nefedova at mcs.anl.gov Tue May 1 12:43:38 2007 From: nefedova at mcs.anl.gov (Veronika V. Nefedova) Date: Tue, 01 May 2007 12:43:38 -0500 Subject: [Swift-devel] terminable down In-Reply-To: <46377B56.8080807@cs.uchicago.edu> References: <6.0.0.22.2.20070501121638.05c7bec0@mail.mcs.anl.gov> <4637788F.90707@cs.uchicago.edu> <6.0.0.22.2.20070501122822.03836ec0@mail.mcs.anl.gov> <46377B56.8080807@cs.uchicago.edu> Message-ID: <6.0.0.22.2.20070501124319.0381c020@mail.mcs.anl.gov> It looks like I should avoid those as well (; At 12:39 PM 5/1/2007, Ioan Raicu wrote: >No, I don't, as I don't normally use the CI machines much. >Ioan > >Veronika V. Nefedova wrote: >>Hmmm. Do you know of any other UC machine that I could try to login into >>(that shares the same home dirw/terminable) ? I tried evitable and its >>also down... >> >>Thanks for the info! >> >>Nika >> >>At 12:27 PM 5/1/2007, Ioan Raicu wrote: >>>Nika, >>>I can't access it either from the UChicago campus... although I don't >>>know why its down. >>>Ioan >>> >>>Veronika V. Nefedova wrote: >>>>Hi, >>>> >>>>terminable is down for at least an hour now, maybe longer. My email to >>>>ci.support was unanswered... I am wondering if anybody on this list is >>>>at UC now and know why terminable is down (scheduled maintenance or >>>>something)? Could anybody access terminable -- maybe I just can't do it >>>>from ANL ? >>>> >>>>Thanks! >>>> >>>>Nika >>>> >>>>_______________________________________________ >>>>Swift-devel mailing list >>>>Swift-devel at ci.uchicago.edu >>>>http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >>> >>>-- >>>============================================ >>>Ioan Raicu >>>Ph.D. Student >>>============================================ >>>Distributed Systems Laboratory >>>Computer Science Department >>>University of Chicago >>>1100 E. 58th Street, Ryerson Hall >>>Chicago, IL 60637 >>>============================================ >>>Email: iraicu at cs.uchicago.edu >>>Web: http://www.cs.uchicago.edu/~iraicu >>> http://dsl.cs.uchicago.edu/ >>>============================================ >>>============================================ >> >> > >-- >============================================ >Ioan Raicu >Ph.D. Student >============================================ >Distributed Systems Laboratory >Computer Science Department >University of Chicago >1100 E. 58th Street, Ryerson Hall >Chicago, IL 60637 >============================================ >Email: iraicu at cs.uchicago.edu >Web: http://www.cs.uchicago.edu/~iraicu > http://dsl.cs.uchicago.edu/ >============================================ >============================================ From hategan at mcs.anl.gov Tue May 1 12:46:08 2007 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 01 May 2007 12:46:08 -0500 Subject: [Swift-devel] terminable down In-Reply-To: <6.0.0.22.2.20070501124319.0381c020@mail.mcs.anl.gov> References: <6.0.0.22.2.20070501121638.05c7bec0@mail.mcs.anl.gov> <4637788F.90707@cs.uchicago.edu> <6.0.0.22.2.20070501122822.03836ec0@mail.mcs.anl.gov> <46377B56.8080807@cs.uchicago.edu> <6.0.0.22.2.20070501124319.0381c020@mail.mcs.anl.gov> Message-ID: <1178041568.3508.0.camel@blabla.mcs.anl.gov> Somebody stepped on the power outlet that hosted the small switch that terminable used. It should be back now. On Tue, 2007-05-01 at 12:43 -0500, Veronika V. Nefedova wrote: > It looks like I should avoid those as well (; > > At 12:39 PM 5/1/2007, Ioan Raicu wrote: > >No, I don't, as I don't normally use the CI machines much. > >Ioan > > > >Veronika V. Nefedova wrote: > >>Hmmm. Do you know of any other UC machine that I could try to login into > >>(that shares the same home dirw/terminable) ? I tried evitable and its > >>also down... > >> > >>Thanks for the info! > >> > >>Nika > >> > >>At 12:27 PM 5/1/2007, Ioan Raicu wrote: > >>>Nika, > >>>I can't access it either from the UChicago campus... although I don't > >>>know why its down. > >>>Ioan > >>> > >>>Veronika V. Nefedova wrote: > >>>>Hi, > >>>> > >>>>terminable is down for at least an hour now, maybe longer. My email to > >>>>ci.support was unanswered... I am wondering if anybody on this list is > >>>>at UC now and know why terminable is down (scheduled maintenance or > >>>>something)? Could anybody access terminable -- maybe I just can't do it > >>>>from ANL ? > >>>> > >>>>Thanks! > >>>> > >>>>Nika > >>>> > >>>>_______________________________________________ > >>>>Swift-devel mailing list > >>>>Swift-devel at ci.uchicago.edu > >>>>http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > >>> > >>>-- > >>>============================================ > >>>Ioan Raicu > >>>Ph.D. Student > >>>============================================ > >>>Distributed Systems Laboratory > >>>Computer Science Department > >>>University of Chicago > >>>1100 E. 58th Street, Ryerson Hall > >>>Chicago, IL 60637 > >>>============================================ > >>>Email: iraicu at cs.uchicago.edu > >>>Web: http://www.cs.uchicago.edu/~iraicu > >>> http://dsl.cs.uchicago.edu/ > >>>============================================ > >>>============================================ > >> > >> > > > >-- > >============================================ > >Ioan Raicu > >Ph.D. Student > >============================================ > >Distributed Systems Laboratory > >Computer Science Department > >University of Chicago > >1100 E. 58th Street, Ryerson Hall > >Chicago, IL 60637 > >============================================ > >Email: iraicu at cs.uchicago.edu > >Web: http://www.cs.uchicago.edu/~iraicu > > http://dsl.cs.uchicago.edu/ > >============================================ > >============================================ > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > From nefedova at mcs.anl.gov Tue May 1 13:25:16 2007 From: nefedova at mcs.anl.gov (Veronika V. Nefedova) Date: Tue, 01 May 2007 13:25:16 -0500 Subject: [Swift-devel] LQCD mapping In-Reply-To: <6.2.1.2.2.20070501100604.020553e8@pop.mcs.anl.gov> References: <6.2.1.2.2.20070501100604.020553e8@pop.mcs.anl.gov> Message-ID: <6.0.0.22.2.20070501132329.05842c20@mail.mcs.anl.gov> Ok, I tried using a different mapper. It seems that replacing this line: file lattice[] ; with this one: file lattice[] ; Works just fine. My workflow has finished without any errors. Nika At 10:06 AM 5/1/2007, Veronika V. Nefedova wrote: >At 10:01 AM 5/1/2007, Ben Clifford wrote: > > >>On Tue, 1 May 2007, Yong Zhao wrote: >> >> > I see, then maybe you should not use filesys_mapper, can you try >> > simple_mapper instead? >> >>The code in there looks like it does approximately the right stuff >>filename-wise for arrays. >> >>Maybe Nika can try with her workflow - if not, I'll try it later. > >Sure, I can try that > >Nika > >>-- >> >>_______________________________________________ >>Swift-devel mailing list >>Swift-devel at ci.uchicago.edu >>http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > >_______________________________________________ >Swift-devel mailing list >Swift-devel at ci.uchicago.edu >http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From nefedova at mcs.anl.gov Tue May 1 14:53:40 2007 From: nefedova at mcs.anl.gov (Veronika V. Nefedova) Date: Tue, 01 May 2007 14:53:40 -0500 Subject: [Swift-devel] arguments to swift Message-ID: <6.0.0.22.2.20070501144050.04c69c90@mail.mcs.anl.gov> Hi, I couldn't find info on how to pass the arguments to the swift script. For example, I need to pass an integer, say NUM=5 on the command line when I am invoking swift. And inside swift script I'd like to address that variable (is it @arg(NUM) ?)... I tried several variations but none seems to work. Could somebody please point me to the documentation or give me an example on how to do that? This is one of the few syntax that I tried (and it didn't work): inside swift script (how to address it): type file{} int N = @arg(NUM); int range[] = [1:N]; foreach i in range { BLA } and on the command line - how to specify an argument: swift bla.swift NUM=2 Thanks! Nika From hategan at mcs.anl.gov Tue May 1 14:55:30 2007 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 01 May 2007 14:55:30 -0500 Subject: [Swift-devel] arguments to swift In-Reply-To: <6.0.0.22.2.20070501144050.04c69c90@mail.mcs.anl.gov> References: <6.0.0.22.2.20070501144050.04c69c90@mail.mcs.anl.gov> Message-ID: <1178049330.8303.2.camel@blabla.mcs.anl.gov> This is undocumented and unsupported, but: @arg("NUM") You pass it after the .dtm|.swift name: swift x.swift -NUM=5 On Tue, 2007-05-01 at 14:53 -0500, Veronika V. Nefedova wrote: > Hi, > > I couldn't find info on how to pass the arguments to the swift script. For > example, I need to pass an integer, say NUM=5 on the command line when I am > invoking swift. And inside swift script I'd like to address that variable > (is it @arg(NUM) ?)... I tried several variations but none seems to work. > Could somebody please point me to the documentation or give me an example > on how to do that? > > This is one of the few syntax that I tried (and it didn't work): > > inside swift script (how to address it): > > type file{} > int N = @arg(NUM); > int range[] = [1:N]; > > foreach i in range { > BLA > } > > and on the command line - how to specify an argument: > swift bla.swift NUM=2 > > Thanks! > > Nika > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > From nefedova at mcs.anl.gov Tue May 1 15:07:33 2007 From: nefedova at mcs.anl.gov (Veronika V. Nefedova) Date: Tue, 01 May 2007 15:07:33 -0500 Subject: [Swift-devel] arguments to swift In-Reply-To: <1178049330.8303.2.camel@blabla.mcs.anl.gov> References: <6.0.0.22.2.20070501144050.04c69c90@mail.mcs.anl.gov> <1178049330.8303.2.camel@blabla.mcs.anl.gov> Message-ID: <6.0.0.22.2.20070501150722.04c8b710@mail.mcs.anl.gov> Great, thanks! it worked. Nika At 02:55 PM 5/1/2007, Mihael Hategan wrote: >This is undocumented and unsupported, but: >@arg("NUM") > >You pass it after the .dtm|.swift name: > >swift x.swift -NUM=5 > > >On Tue, 2007-05-01 at 14:53 -0500, Veronika V. Nefedova wrote: > > Hi, > > > > I couldn't find info on how to pass the arguments to the swift script. For > > example, I need to pass an integer, say NUM=5 on the command line when > I am > > invoking swift. And inside swift script I'd like to address that variable > > (is it @arg(NUM) ?)... I tried several variations but none seems to work. > > Could somebody please point me to the documentation or give me an example > > on how to do that? > > > > This is one of the few syntax that I tried (and it didn't work): > > > > inside swift script (how to address it): > > > > type file{} > > int N = @arg(NUM); > > int range[] = [1:N]; > > > > foreach i in range { > > BLA > > } > > > > and on the command line - how to specify an argument: > > swift bla.swift NUM=2 > > > > Thanks! > > > > Nika > > > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > From nefedova at mcs.anl.gov Tue May 1 16:31:28 2007 From: nefedova at mcs.anl.gov (Veronika V. Nefedova) Date: Tue, 01 May 2007 16:31:28 -0500 Subject: [Swift-devel] Fwd: Re: chained genU workflow Message-ID: <6.0.0.22.2.20070501162729.04c5e910@mail.mcs.anl.gov> Hi, everybody: I got this email from Xian-He (after i sent him the lqcd workflow) and I do not think I understand what exactly is he talking about. Mihael and/or Yong -- you've worked with this group before I joined - maybe you know what exactly are their problems ? Please give me any background information so I could help them to proceed. Thanks! Nika >Date: Tue, 01 May 2007 16:24:45 -0500 >From: Xian-He Sun >Subject: Re: chained genU workflow >To: "Veronika V. Nefedova" >Cc: Don Holmgren , simone at fnal.gov, > Nirmal Seenu , Mike Wilde , > Ian Foster > > >Thank you, Nika. It is a good achievement. Currently, we are still >facing >two technical issues, > >1. The lqcd computing environment is not an true Grid environment. We >still >need to modify your code to make it work. We have had some success of >the hello >example and will work on this one too. > >2. We have made Swift talking to PBS directly but some efficiency >issues remain >at this time. Some modification at the Swift side is needed. Nirmal is >working with Mihael Hategan on this regard. > >Thank you, > >Xian-He From benc at hawaga.org.uk Tue May 1 16:56:51 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 1 May 2007 21:56:51 +0000 (GMT) Subject: [Swift-devel] arguments to swift In-Reply-To: <1178049330.8303.2.camel@blabla.mcs.anl.gov> References: <6.0.0.22.2.20070501144050.04c69c90@mail.mcs.anl.gov> <1178049330.8303.2.camel@blabla.mcs.anl.gov> Message-ID: On Tue, 1 May 2007, Mihael Hategan wrote: > This is undocumented and unsupported, but: > @arg("NUM") > > You pass it after the .dtm|.swift name: > > swift x.swift -NUM=5 That should probably be a supported feature. I'll note it in the user guide. -- From hategan at mcs.anl.gov Tue May 1 16:56:42 2007 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 01 May 2007 16:56:42 -0500 Subject: [Swift-devel] Fwd: Re: chained genU workflow In-Reply-To: <6.0.0.22.2.20070501162729.04c5e910@mail.mcs.anl.gov> References: <6.0.0.22.2.20070501162729.04c5e910@mail.mcs.anl.gov> Message-ID: <1178056602.13231.5.camel@blabla.mcs.anl.gov> This is the list of things I got from them OOB: - MPI jobs with the PBS provider. They need to be able to run with more than one version of MPI. - Easier configuration of tc.data/sites.xml. Basically they need the ability to use a global sites.xml while changing only things like the project profile entry. - The cleanup didn't work as it was. It would submit a job on the default execution provider (whatever that was) which needed a project profile entry, but the swift library didn't provide one. This was solved by hacking the vdl lib and adding /bin/rm in tc.data. - They would like the cleanup to be done without pbs in the future (possibly fork or directly with the fileop provider). There's some thinking that needs to go here. That's it I think. On Tue, 2007-05-01 at 16:31 -0500, Veronika V. Nefedova wrote: > Hi, everybody: > > I got this email from Xian-He (after i sent him the lqcd workflow) and I do > not think I understand what exactly is he talking about. > Mihael and/or Yong -- you've worked with this group before I joined - maybe > you know what exactly are their problems ? Please give me any background > information so I could help them to proceed. > > Thanks! > > Nika > > > >Date: Tue, 01 May 2007 16:24:45 -0500 > >From: Xian-He Sun > >Subject: Re: chained genU workflow > >To: "Veronika V. Nefedova" > >Cc: Don Holmgren , simone at fnal.gov, > > Nirmal Seenu , Mike Wilde , > > Ian Foster > > > > > >Thank you, Nika. It is a good achievement. Currently, we are still > >facing > >two technical issues, > > > >1. The lqcd computing environment is not an true Grid environment. We > >still > >need to modify your code to make it work. We have had some success of > >the hello > >example and will work on this one too. > > > >2. We have made Swift talking to PBS directly but some efficiency > >issues remain > >at this time. Some modification at the Swift side is needed. Nirmal is > >working with Mihael Hategan on this regard. > > > >Thank you, > > > >Xian-He > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > From hategan at mcs.anl.gov Tue May 1 16:57:27 2007 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 01 May 2007 16:57:27 -0500 Subject: [Swift-devel] arguments to swift In-Reply-To: References: <6.0.0.22.2.20070501144050.04c69c90@mail.mcs.anl.gov> <1178049330.8303.2.camel@blabla.mcs.anl.gov> Message-ID: <1178056647.13231.7.camel@blabla.mcs.anl.gov> On Tue, 2007-05-01 at 21:56 +0000, Ben Clifford wrote: > > On Tue, 1 May 2007, Mihael Hategan wrote: > > > This is undocumented and unsupported, but: > > @arg("NUM") > > > > You pass it after the .dtm|.swift name: > > > > swift x.swift -NUM=5 > > That should probably be a supported feature. I'll note it in the user > guide. makes sense. From benc at hawaga.org.uk Tue May 1 17:06:57 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 1 May 2007 22:06:57 +0000 (GMT) Subject: [Swift-devel] Fwd: Re: chained genU workflow In-Reply-To: <1178056602.13231.5.camel@blabla.mcs.anl.gov> References: <6.0.0.22.2.20070501162729.04c5e910@mail.mcs.anl.gov> <1178056602.13231.5.camel@blabla.mcs.anl.gov> Message-ID: whats a project profile entry? On Tue, 1 May 2007, Mihael Hategan wrote: > This is the list of things I got from them OOB: > - MPI jobs with the PBS provider. They need to be able to run with more > than one version of MPI. > - Easier configuration of tc.data/sites.xml. Basically they need the > ability to use a global sites.xml while changing only things like the > project profile entry. > - The cleanup didn't work as it was. It would submit a job on the > default execution provider (whatever that was) which needed a project > profile entry, but the swift library didn't provide one. This was solved > by hacking the vdl lib and adding /bin/rm in tc.data. > - They would like the cleanup to be done without pbs in the future > (possibly fork or directly with the fileop provider). There's some > thinking that needs to go here. > > That's it I think. > > On Tue, 2007-05-01 at 16:31 -0500, Veronika V. Nefedova wrote: > > Hi, everybody: > > > > I got this email from Xian-He (after i sent him the lqcd workflow) and I do > > not think I understand what exactly is he talking about. > > Mihael and/or Yong -- you've worked with this group before I joined - maybe > > you know what exactly are their problems ? Please give me any background > > information so I could help them to proceed. > > > > Thanks! > > > > Nika > > > > > > >Date: Tue, 01 May 2007 16:24:45 -0500 > > >From: Xian-He Sun > > >Subject: Re: chained genU workflow > > >To: "Veronika V. Nefedova" > > >Cc: Don Holmgren , simone at fnal.gov, > > > Nirmal Seenu , Mike Wilde , > > > Ian Foster > > > > > > > > >Thank you, Nika. It is a good achievement. Currently, we are still > > >facing > > >two technical issues, > > > > > >1. The lqcd computing environment is not an true Grid environment. We > > >still > > >need to modify your code to make it work. We have had some success of > > >the hello > > >example and will work on this one too. > > > > > >2. We have made Swift talking to PBS directly but some efficiency > > >issues remain > > >at this time. Some modification at the Swift side is needed. Nirmal is > > >working with Mihael Hategan on this regard. > > > > > >Thank you, > > > > > >Xian-He > > > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > From hategan at mcs.anl.gov Tue May 1 17:11:04 2007 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 01 May 2007 17:11:04 -0500 Subject: [Swift-devel] Fwd: Re: chained genU workflow In-Reply-To: References: <6.0.0.22.2.20070501162729.04c5e910@mail.mcs.anl.gov> <1178056602.13231.5.camel@blabla.mcs.anl.gov> Message-ID: <1178057464.13872.4.camel@blabla.mcs.anl.gov> On Tue, 2007-05-01 at 22:06 +0000, Ben Clifford wrote: > whats a project profile entry? That thing that gets translated into a 'project' RSL attribute and eventually to the PBS equivalent. In swift I suppose it's called a profile entry, since it's specified with the element, and has both a key and a value which kinda makes it an 'entry'. And the key in this case is "project". > > On Tue, 1 May 2007, Mihael Hategan wrote: > > > This is the list of things I got from them OOB: > > - MPI jobs with the PBS provider. They need to be able to run with more > > than one version of MPI. > > - Easier configuration of tc.data/sites.xml. Basically they need the > > ability to use a global sites.xml while changing only things like the > > project profile entry. > > - The cleanup didn't work as it was. It would submit a job on the > > default execution provider (whatever that was) which needed a project > > profile entry, but the swift library didn't provide one. This was solved > > by hacking the vdl lib and adding /bin/rm in tc.data. > > - They would like the cleanup to be done without pbs in the future > > (possibly fork or directly with the fileop provider). There's some > > thinking that needs to go here. > > > > That's it I think. > > > > On Tue, 2007-05-01 at 16:31 -0500, Veronika V. Nefedova wrote: > > > Hi, everybody: > > > > > > I got this email from Xian-He (after i sent him the lqcd workflow) and I do > > > not think I understand what exactly is he talking about. > > > Mihael and/or Yong -- you've worked with this group before I joined - maybe > > > you know what exactly are their problems ? Please give me any background > > > information so I could help them to proceed. > > > > > > Thanks! > > > > > > Nika > > > > > > > > > >Date: Tue, 01 May 2007 16:24:45 -0500 > > > >From: Xian-He Sun > > > >Subject: Re: chained genU workflow > > > >To: "Veronika V. Nefedova" > > > >Cc: Don Holmgren , simone at fnal.gov, > > > > Nirmal Seenu , Mike Wilde , > > > > Ian Foster > > > > > > > > > > > >Thank you, Nika. It is a good achievement. Currently, we are still > > > >facing > > > >two technical issues, > > > > > > > >1. The lqcd computing environment is not an true Grid environment. We > > > >still > > > >need to modify your code to make it work. We have had some success of > > > >the hello > > > >example and will work on this one too. > > > > > > > >2. We have made Swift talking to PBS directly but some efficiency > > > >issues remain > > > >at this time. Some modification at the Swift side is needed. Nirmal is > > > >working with Mihael Hategan on this regard. > > > > > > > >Thank you, > > > > > > > >Xian-He > > > > > > > > > _______________________________________________ > > > Swift-devel mailing list > > > Swift-devel at ci.uchicago.edu > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > > From benc at hawaga.org.uk Wed May 2 01:27:27 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 2 May 2007 06:27:27 +0000 (GMT) Subject: [Swift-devel] LQCD mapping In-Reply-To: <6.0.0.22.2.20070501132329.05842c20@mail.mcs.anl.gov> References: <6.2.1.2.2.20070501100604.020553e8@pop.mcs.anl.gov> <6.0.0.22.2.20070501132329.05842c20@mail.mcs.anl.gov> Message-ID: On Tue, 1 May 2007, Veronika V. Nefedova wrote: > Ok, I tried using a different mapper. It seems that replacing this line: > > file lattice[] ; > > with this one: > > file lattice[] ; > > Works just fine. My workflow has finished without any errors. ok cool. I'll add more documentation to the userguide about the behaviour of the simple mapper. -- From benc at hawaga.org.uk Wed May 2 03:52:47 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 2 May 2007 08:52:47 +0000 (GMT) Subject: [Swift-devel] Fwd: Re: chained genU workflow In-Reply-To: <1178056602.13231.5.camel@blabla.mcs.anl.gov> References: <6.0.0.22.2.20070501162729.04c5e910@mail.mcs.anl.gov> <1178056602.13231.5.camel@blabla.mcs.anl.gov> Message-ID: On Tue, 1 May 2007, Mihael Hategan wrote: These should probably go into the bugzilla as they look like app requirements that need tracking. > - MPI jobs with the PBS provider. They need to be able to run with more > than one version of MPI. > - Easier configuration of tc.data/sites.xml. Basically they need the > ability to use a global sites.xml while changing only things like the > project profile entry. It maybe makes sense in general that a commandline specified value overrides a tc.data-specified value which overrides a sites.xml-specified value. Though in this 'project' situation, that might look wrong in the multi-site case (though, given that they're using PBS on single site, that isn't so much a problem at the moment) > - The cleanup didn't work as it was. It would submit a job on the > default execution provider (whatever that was) which needed a project > profile entry, but the swift library didn't provide one. This was solved > by hacking the vdl lib and adding /bin/rm in tc.data. mmm hacks. anything useful from a production codebase perspective? > - They would like the cleanup to be done without pbs in the future > (possibly fork or directly with the fileop provider). There's some > thinking that needs to go here. VDS1's sites descriptions allowed different job submission mechanisms to be specified for different purposes - the 'vanilla' universe and the 'transfer' universe with the intention that the vanilla universe is for running the meat of the workflow and would point at a batch system of some kind, whilst the transfer universe is intended for lighter weight jobs and would point at GRAM2's jobmanager-fork. That's perhaps a starting point. -- From benc at hawaga.org.uk Wed May 2 04:08:34 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 2 May 2007 09:08:34 +0000 (GMT) Subject: [Swift-devel] multiple arguments Message-ID: I'm trying to run softmean, one of the tools in the fmri workflow that I used before for tutorial purposes. At present I run it like this: softmean @atlas.img overwrite scalingsuffix @sliced[0].img @sliced[1].img @sliced[2].img @sliced[3].img; so that each of four image filenames are passed in as separate parameters. This isn't so nice when 'sliced' has != 4 elements. I considered trying this: softmean @atlas.img overwrite scalingsuffix @sliced[*].img; But the @sliced[*].img appears to to turn into a single string argument listing all of the filenames, which softmean finds displeasing. -- From benc at hawaga.org.uk Wed May 2 05:12:17 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 2 May 2007 10:12:17 +0000 (GMT) Subject: [Swift-devel] version numbering and directory naming. Message-ID: I just changed the project version number to 0.1-dev (previously it was 1.0) as part of some version number stuff I did in r665. A practical side effect of this is that when you build for source, this will change the directory in which the distribution will appear. It will now appear in dist/vdsk-0.1-dev instead of dist/vdsk-1.0. Perhaps the nightly builds need tweaking to accomodate this too, but I can't remember where they happen... -- From yongzh at cs.uchicago.edu Wed May 2 09:27:19 2007 From: yongzh at cs.uchicago.edu (Yong Zhao) Date: Wed, 2 May 2007 09:27:19 -0500 (CDT) Subject: [Swift-devel] multiple arguments In-Reply-To: References: Message-ID: Ben, use @filenames(sliced[*].img). On Wed, 2 May 2007, Ben Clifford wrote: > > I'm trying to run softmean, one of the tools in the fmri workflow that I > used before for tutorial purposes. > > At present I run it like this: > > softmean @atlas.img overwrite scalingsuffix @sliced[0].img @sliced[1].img > @sliced[2].img @sliced[3].img; > > so that each of four image filenames are passed in as separate parameters. > > This isn't so nice when 'sliced' has != 4 elements. > > I considered trying this: > > softmean @atlas.img overwrite scalingsuffix @sliced[*].img; > > But the @sliced[*].img appears to to turn into a single string argument > listing all of the filenames, which softmean finds displeasing. > > -- > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > From benc at hawaga.org.uk Wed May 2 10:02:54 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 2 May 2007 15:02:54 +0000 (GMT) Subject: [Swift-devel] multiple arguments In-Reply-To: References: Message-ID: On Wed, 2 May 2007, Yong Zhao wrote: > use @filenames(sliced[*].img). I get this: Execution failed: org.griphyn.vdl.mapping.InvalidPathException: Invalid path (*.img) for type volume I tried something a little simpler: type file; (file out) echo(file n[]) { app { echo @filenames(n) stdout=out; } } file f[] ; file out; out=echo(f); but that hangs... oof. -- From yongzh at cs.uchicago.edu Wed May 2 10:28:43 2007 From: yongzh at cs.uchicago.edu (Yong Zhao) Date: Wed, 2 May 2007 10:28:43 -0500 (CDT) Subject: [Swift-devel] multiple arguments In-Reply-To: References: Message-ID: That's strange. I used @filenames a lot a while ago and never had any problems. Check the kml translation, maybe you added the getfieldvalue stuff to getFilenames, which should not happen. i.e. It needs to be .... not Yong. On Wed, 2 May 2007, Ben Clifford wrote: > > > On Wed, 2 May 2007, Yong Zhao wrote: > > > use @filenames(sliced[*].img). > > I get this: > > Execution failed: > org.griphyn.vdl.mapping.InvalidPathException: Invalid path (*.img) > for type volume > > > I tried something a little simpler: > > > type file; > > (file out) echo(file n[]) > { > app { > echo @filenames(n) stdout=out; > } > } > > > file f[] ; > > file out; > > out=echo(f); > > > but that hangs... > > oof. > > -- > From benc at hawaga.org.uk Wed May 2 10:37:30 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 2 May 2007 15:37:30 +0000 (GMT) Subject: [Swift-devel] multiple arguments In-Reply-To: References: Message-ID: On Wed, 2 May 2007, Yong Zhao wrote: > That's strange. I used @filenames a lot a while ago and never had any > problems. Check the kml translation, maybe you added the getfieldvalue > stuff to getFilenames, which should not happen. i.e. yeah, I was just checking that out as a probable cause. As of about r650, getField is used on all function invocations when a variable/path name is supplied as a parameter, no matter which function name is used. The semantics of these 'multiple valued' language constructs ([*] and how that passes through @filenames, for example) seems (still) quite poorly defined... > > It needs to be > > .... > > > not > > > > Yong. > > On Wed, 2 May 2007, Ben Clifford wrote: > > > > > > > On Wed, 2 May 2007, Yong Zhao wrote: > > > > > use @filenames(sliced[*].img). > > > > I get this: > > > > Execution failed: > > org.griphyn.vdl.mapping.InvalidPathException: Invalid path (*.img) > > for type volume > > > > > > I tried something a little simpler: > > > > > > type file; > > > > (file out) echo(file n[]) > > { > > app { > > echo @filenames(n) stdout=out; > > } > > } > > > > > > file f[] ; > > > > file out; > > > > out=echo(f); > > > > > > but that hangs... > > > > oof. > > > > -- > > > > From benc at hawaga.org.uk Wed May 2 11:03:10 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 2 May 2007 16:03:10 +0000 (GMT) Subject: [Swift-devel] suggestion please on hanging/sleeping/slow wf In-Reply-To: References: Message-ID: On Tue, 1 May 2007, Tiberiu Stef-Praun wrote: > I have a workflow that generates 5000 files. > The execution seems to have halted, for no obvious reason: In the past few days, I've hit hangs a bunch of times in various places - more than I've ever seen before, but I am doing more complicated things recently compared to before (which was running a few relatively trivial jobs in a bunch of relatively trivial workflows). Its an awkward user experience. In some cases, the code should perhaps detect such hangs; and in other cases, perhaps different logging info in the -debug output would be useful... > - there are no more jobs in the queue > - no error are reported in the logfile > - NOTE: some of the input files have not been staged in yet , yet the > workflow is hanging > - NOTE: the remote application temp directory is GONE, only the > shared directory is still there > - apparently all the output files that are in /shared have been sent > back (staged out) > > What to do, what to do ? > > The workflow is sid-wf.dtm in ~tiberius/scratch on teraport > It uses the config files in ~tiberius/local/swift-conf > > > From hategan at mcs.anl.gov Wed May 2 11:16:52 2007 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 02 May 2007 11:16:52 -0500 Subject: [Swift-devel] suggestion please on hanging/sleeping/slow wf In-Reply-To: References: Message-ID: <1178122612.31984.0.camel@blabla.mcs.anl.gov> On Wed, 2007-05-02 at 16:03 +0000, Ben Clifford wrote: > > On Tue, 1 May 2007, Tiberiu Stef-Praun wrote: > > > I have a workflow that generates 5000 files. > > The execution seems to have halted, for no obvious reason: > > In the past few days, I've hit hangs a bunch of times in various places - > more than I've ever seen before, but I am doing more complicated things > recently compared to before (which was running a few relatively trivial > jobs in a bunch of relatively trivial workflows). > > Its an awkward user experience. In some cases, the code should perhaps > detect such hangs; and in other cases, perhaps different logging info in > the -debug output would be useful... Yep. The question is how. > > > - there are no more jobs in the queue > > - no error are reported in the logfile > > - NOTE: some of the input files have not been staged in yet , yet the > > workflow is hanging > > - NOTE: the remote application temp directory is GONE, only the > > shared directory is still there > > - apparently all the output files that are in /shared have been sent > > back (staged out) > > > > What to do, what to do ? > > > > The workflow is sid-wf.dtm in ~tiberius/scratch on teraport > > It uses the config files in ~tiberius/local/swift-conf > > > > > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > From benc at hawaga.org.uk Wed May 2 18:28:20 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 2 May 2007 23:28:20 +0000 (GMT) Subject: [Swift-devel] suggestion please on hanging/sleeping/slow wf In-Reply-To: References: Message-ID: so one of the things I suggested tibi do is change logging to trace level for everything, which has resulted in a 400mb log file for his workflow. Of course, I don't know what this should really look like if it was healthy, but I notice a few hundred exceptions of the form: Caused by: org.globus.ftp.exception.ServerException: Server refused performing the request. Custom message: (error code 1) [Nested exception message: Custom message: Unexpected reply: 425 globus_ftp_control_local_pasv(): Handle not in the proper state CONNECT_WRITE.: Success.] [Nested exception is org.globus.ftp.exception.UnexpectedReplyCodeException: Custom message: Unexpected reply: 425 globus_ftp_control_local_pasv(): Handle not in the proper state CONNECT_WRITE.: Success.] towards the end of the log file, which may be wrong. For anyone interested in the full 400mb, the log file is on teraport at: /home/tiberius/scratch/sid-wf-1yrnoadiq0940.log -- From benc at hawaga.org.uk Thu May 3 07:24:29 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Thu, 3 May 2007 12:24:29 +0000 (GMT) Subject: [Swift-devel] multiple arguments In-Reply-To: References: Message-ID: On Wed, 2 May 2007, Yong Zhao wrote: > That's strange. I used @filenames a lot a while ago and never had any > problems. Check the kml translation, maybe you added the getfieldvalue > stuff to getFilenames, which should not happen. i.e. > > It needs to be > > .... > > > not > > I noted this problem as bug 59 so it doesn't get forgotten. -- From nefedova at mcs.anl.gov Thu May 3 09:27:35 2007 From: nefedova at mcs.anl.gov (Veronika V. Nefedova) Date: Thu, 03 May 2007 09:27:35 -0500 Subject: [Swift-devel] Fwd: Re: Fwd: chained genU workflow Message-ID: <6.2.1.2.2.20070503092456.0214dc30@pop.mcs.anl.gov> Does anybody have any idea why swift is failing ? It seems pretty straightforward but I do not see whats wrong here... He is using 070429 nightly build (I am using the same build and it works for me). Nika >Date: Wed, 02 May 2007 20:44:32 -0500 >From: Luciano Piccoli >Subject: Re: Fwd: chained genU workflow >To: nefedova at mcs.anl.gov > >Hi Nika, > >I tried to run the genU workflow, but I believe I have some configuration >problem. I did download and install the same vdsk version that you used. > >My sites.xml has only the localhost provider: > > > xmlns="http://www.griphyn.org/chimera/GVDS-PoolConfig" > > xsi:schemaLocation="http://www.griphyn.org/chimera/GVDS > http://www.griphyn.org/chimera/gvds-poolcfg-1.5.xsd" > xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" version="1.5"> > > > > > minor="0" patch="0"/> > minor="0" patch="0" /> > {user.home} > > > >I reproduced the problem using the q1.swift example. My tc.data looks like >this: > > localhost echo /home/piccoli/bin/myecho > INSTALLED INTEL64::LINUX null > localhost echoecho /home/piccoli/bin/myecho > INSTALLED INTEL64::LINUX null > >In the q1.swift workflow, when I replace the echo command with the >echoecho command I get the following error message: > > bash-3.00$ swift q1.swift > Swift V 0.0405 > RunID: 59888ief8zsp1 > echoecho started > echoecho failed > The following errors have occurred: > 1. The requested application (echoecho) cannot be found installed on > any of the sites. > You should check your tc.data and sites.xml files, and make sure > that the name (echoecho) is not misspelled. > >This is the swift script: > > type messagefile {} > > (messagefile t) greeting() { > app { > echoecho "Hello, world!" stdout=@filename(t); > } > } > > messagefile outfile <"hello.txt">; > > outfile = greeting(); > >Do you have any idea why this happens? The same error message shows up >when I run the genU script. Swift complains that mode_test_in cannot be >found, even though tc.data is correct... > >Thanks! >Luciano -------------- next part -------------- An HTML attachment was scrubbed... URL: From benc at hawaga.org.uk Thu May 3 09:43:55 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Thu, 3 May 2007 14:43:55 +0000 (GMT) Subject: [Swift-devel] Fwd: Re: Fwd: chained genU workflow In-Reply-To: <6.2.1.2.2.20070503092456.0214dc30@pop.mcs.anl.gov> References: <6.2.1.2.2.20070503092456.0214dc30@pop.mcs.anl.gov> Message-ID: On Thu, 3 May 2007, Veronika V. Nefedova wrote: > Does anybody have any idea why swift is failing ? It seems pretty > straightforward but I do not see whats wrong here... > He is using 070429 nightly build (I am using the same build and it works for > me). try explicitly indicating which tc.data to use like this: swift -tc.file /path/to/tc.data myprogram.swift you can encourage people to engage directly on swift-user too! -- From hategan at mcs.anl.gov Thu May 3 10:35:13 2007 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 03 May 2007 10:35:13 -0500 Subject: [Swift-devel] Fwd: Re: Fwd: chained genU workflow In-Reply-To: References: <6.2.1.2.2.20070503092456.0214dc30@pop.mcs.anl.gov> Message-ID: <1178206513.19768.0.camel@blabla.mcs.anl.gov> Or use -v to see what sites file is being used. On Thu, 2007-05-03 at 14:43 +0000, Ben Clifford wrote: > On Thu, 3 May 2007, Veronika V. Nefedova wrote: > > > Does anybody have any idea why swift is failing ? It seems pretty > > straightforward but I do not see whats wrong here... > > He is using 070429 nightly build (I am using the same build and it works for > > me). > > try explicitly indicating which tc.data to use like this: > > swift -tc.file /path/to/tc.data myprogram.swift > > you can encourage people to engage directly on swift-user too! > From benc at hawaga.org.uk Fri May 4 05:06:06 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Fri, 4 May 2007 10:06:06 +0000 (GMT) Subject: [Swift-devel] limiting simultaneous jobs using the local provider. Message-ID: Is there a way to limit the number of jobs that will be executing simultaneously with the local provider? (or perhaps with swift as a whole?) There are a few throttle parameters in the configuration file but I find them slightly confusing, and setting them all to 1 appears to not have the effect I want - I think I understand them to limit the number of jobs that will be in a particular internal gram 'in process of being submitted' state, rather than the total number of actually executing jobs. My immediate motivation for this is because the fmri workflow (which runs up to the incredible number of four cpu intensive executables simultaneously) pretty much kills my laptop for other purposes whilst its running. I'd much rather be able to limit it to one (or perhaps two). -- From benc at hawaga.org.uk Fri May 4 06:15:07 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Fri, 4 May 2007 11:15:07 +0000 (GMT) Subject: [Swift-devel] multiple arguments In-Reply-To: References: Message-ID: On Wed, 2 May 2007, Ben Clifford wrote: > The semantics of these 'multiple valued' language constructs ([*] and how > that passes through @filenames, for example) seems (still) quite poorly > defined... So I thought about this for a while. I believe that the problem is there is a tension between C/Java like structure/array access constructs (which is the syntax, but not necessarily the semantics, that SwiftScript uses) and XPath-like XML selection constructs (which is the semantics that let you write [*]). That tension for the most part has not been a problem in the way that we've written code so far, except in the presence of [*]. In the C/Java like model, an expression like v or v.image identifies exactly one entity - in the first case to a variable v, in the second case (assuming that v is a structure) to the unique element of the structure in variable v that is called image. In the XML/XPath model, we can make similar looking expressions, such as: v/image. However, XPath expressions do not identify exactly one entity. XPath expressions select nodes in an xml document; the identified entities are XML nodes. But they are not constrained to selecting exactly one node. They can select none, or they can select one, or they can select many. Consider the XPath query: v/image when applied to the XML document: theimage
theheader
The above query will select the node theimage However, consider the same query with the document: theimage
theheader
more bar
The query will select two nodes. One of the nodes selected will be theimage and the other will be more. We have not uniquely identified an entity. We have selected several. Similar things happen if we use what Swift refers to as arrays. Consider an XML document like this: tree plant fish dog We can say lifeforms[1] and have the element plant uniquely identified. or we can say lifeforms[*] and have all four foo elements selected. But what is the 'value' and 'type' of lifeforms[*], for the purposes of feeding into other swift expressions? When we say lifeforms[1] we can say the 'value' is the uniquely selected node plant, and that lifeforms[1] evaluates to plant. But there is no definition of 'value' at the moment in SwiftScript for expressions like this that select multiple expressions. And without a definition of what the value of such an expression is, then we can't use such an expression as a value to pass into some other bigger expression, for example @filenames(lifeforms[*]). One solution is to define a data type that can hold (as a single value) the complete set of results (for example an unbounded sequence of XML , or in something more like SwiftScript syntax any[] ). This would allows expressions such as lifeforms[*] to return a single value (an instance of the above type, containing all of the selected nodes) and would give a stronger formalisation of what expressions like @filenames(lifeforms[*]) actually mean. There may be other ways, which I'd be interested to hear about. -- From nefedova at mcs.anl.gov Fri May 4 07:58:51 2007 From: nefedova at mcs.anl.gov (Veronika V. Nefedova) Date: Fri, 04 May 2007 07:58:51 -0500 Subject: [Swift-devel] limiting simultaneous jobs using the local provider. In-Reply-To: References: Message-ID: <6.2.1.2.2.20070504075722.0216f740@pop.mcs.anl.gov> You can limit it with: in scheduler.xml Nika At 05:06 AM 5/4/2007, Ben Clifford wrote: >Is there a way to limit the number of jobs that will be executing >simultaneously with the local provider? (or perhaps with swift as a >whole?) > >There are a few throttle parameters in the configuration file but I find >them slightly confusing, and setting them all to 1 appears to not have the >effect I want - I think I understand them to limit the number of jobs that >will be in a particular internal gram 'in process of being submitted' >state, rather than the total number of actually executing jobs. > >My immediate motivation for this is because the fmri workflow (which runs >up to the incredible number of four cpu intensive executables >simultaneously) pretty much kills my laptop for other purposes whilst its >running. I'd much rather be able to limit it to one (or perhaps two). > >-- >_______________________________________________ >Swift-devel mailing list >Swift-devel at ci.uchicago.edu >http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From benc at hawaga.org.uk Fri May 4 08:38:10 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Fri, 4 May 2007 13:38:10 +0000 (GMT) Subject: [Swift-devel] remote file/directory stuff (bug 22) In-Reply-To: <1172604377.25936.2.camel@blabla.mcs.anl.gov> References: <1172521676.27811.9.camel@blabla.mcs.anl.gov> <1172604377.25936.2.camel@blabla.mcs.anl.gov> Message-ID: On Tue, 27 Feb 2007, Mihael Hategan wrote: > If you can make this translate into something like vdl:(in| > out)appmapping(var, path, dest), preferably after the stagein/stageout > directives, I can probably make it work. I have a patch that (at least for input files) makes this: type file; (file o) cat(file f) { app { cat "in.txt" stdout=@o; f < "in.txt"; } } file a <"hello.txt">; file b <"output.txt">; b=cat(a); turn into this kml (fragment): cat in.txt o f "in.txt" However, I have no implementation of the element. I guess its time for me to poke round at the guts of vdl.k a bit more. -- From hategan at mcs.anl.gov Fri May 4 09:04:44 2007 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 04 May 2007 09:04:44 -0500 Subject: [Swift-devel] limiting simultaneous jobs using the local provider. In-Reply-To: <6.2.1.2.2.20070504075722.0216f740@pop.mcs.anl.gov> References: <6.2.1.2.2.20070504075722.0216f740@pop.mcs.anl.gov> Message-ID: <1178287485.14998.1.camel@blabla.mcs.anl.gov> On Fri, 2007-05-04 at 07:58 -0500, Veronika V. Nefedova wrote: > You can limit it with: > Right. That one above limits the jobs for a site based on its score. It's supposed to provide load balancing with multiple sites, so it's likely not what you want. > That on the other hand enforces a hard limit on the number of total concurrent jobs. > > in scheduler.xml > > Nika > > At 05:06 AM 5/4/2007, Ben Clifford wrote: > > >Is there a way to limit the number of jobs that will be executing > >simultaneously with the local provider? (or perhaps with swift as a > >whole?) > > > >There are a few throttle parameters in the configuration file but I find > >them slightly confusing, and setting them all to 1 appears to not have the > >effect I want - I think I understand them to limit the number of jobs that > >will be in a particular internal gram 'in process of being submitted' > >state, rather than the total number of actually executing jobs. > > > >My immediate motivation for this is because the fmri workflow (which runs > >up to the incredible number of four cpu intensive executables > >simultaneously) pretty much kills my laptop for other purposes whilst its > >running. I'd much rather be able to limit it to one (or perhaps two). > > > >-- > >_______________________________________________ > >Swift-devel mailing list > >Swift-devel at ci.uchicago.edu > >http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > From hategan at mcs.anl.gov Fri May 4 09:10:57 2007 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 04 May 2007 09:10:57 -0500 Subject: [Swift-devel] multiple arguments In-Reply-To: References: Message-ID: <1178287857.14998.7.camel@blabla.mcs.anl.gov> I'd say it's somewhat simpler. Since data in Swift is recursive, non-leaf paths make sense by themselves. What may not make sense is translating them to arguments to an application. For example, passing a complex type as an argument to an application is not well defined. This can be restricted to, say, passing single values or arrays. In the case of arrays, a space separated list is the implicit conversion scheme, and functions could be provided to pass them as comma or something-else-separated-lists. Passing the whole fringe as a list is a possibility, too When it comes to files, the scheme was a little simpler. @filename would pass the file names of the fringe of a particular data tree. And @filenames would do the same, but each leaf is a single argument. On Fri, 2007-05-04 at 11:15 +0000, Ben Clifford wrote: > > On Wed, 2 May 2007, Ben Clifford wrote: > > > The semantics of these 'multiple valued' language constructs ([*] and how > > that passes through @filenames, for example) seems (still) quite poorly > > defined... > > So I thought about this for a while. > > I believe that the problem is there is a tension between C/Java like > structure/array access constructs (which is the syntax, but not > necessarily the semantics, that SwiftScript uses) and XPath-like XML > selection constructs (which is the semantics that let you write [*]). > > That tension for the most part has not been a problem in the way that > we've written code so far, except in the presence of [*]. > > In the C/Java like model, an expression like > > v > > or > > v.image > > identifies exactly one entity - in the first case to a variable v, in the > second case (assuming that v is a structure) to the unique element of the > structure in variable v that is called image. > > In the XML/XPath model, we can make similar looking expressions, such as: > > v/image. > > However, XPath expressions do not identify exactly one entity. XPath > expressions select nodes in an xml document; the identified entities are > XML nodes. But they are not constrained to selecting exactly one node. > They can select none, or they can select one, or they can select many. > > Consider the XPath query: v/image when applied to the XML document: > > > theimage >
theheader
>
> > The above query will select the node theimage > > However, consider the same query with the document: > > > theimage >
theheader
> more > bar >
> > The query will select two nodes. One of the nodes selected will be > theimage and the other will be more. > > We have not uniquely identified an entity. We have selected several. > > Similar things happen if we use what Swift refers to as arrays. Consider > an XML document like this: > > > tree > plant > fish > dog > > > We can say lifeforms[1] and have the element plant uniquely > identified. > > or we can say lifeforms[*] and have all four foo elements selected. > > But what is the 'value' and 'type' of lifeforms[*], for the purposes of > feeding into other swift expressions? > > When we say lifeforms[1] we can say the 'value' is the uniquely selected > node plant, and that lifeforms[1] evaluates to > plant. > > But there is no definition of 'value' at the moment in SwiftScript for > expressions like this that select multiple expressions. And without a > definition of what the value of such an expression is, then we can't use > such an expression as a value to pass into some other bigger expression, > for example @filenames(lifeforms[*]). > > One solution is to define a data type that can hold (as a single value) > the complete set of results (for example an unbounded sequence of XML > , or in something more like SwiftScript syntax any[] ). > > This would allows expressions such as lifeforms[*] to return a single > value (an instance of the above type, containing all of the selected > nodes) and would give a stronger formalisation of what expressions like > @filenames(lifeforms[*]) actually mean. > > There may be other ways, which I'd be interested to hear about. > From hategan at mcs.anl.gov Fri May 4 09:17:19 2007 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 04 May 2007 09:17:19 -0500 Subject: [Swift-devel] remote file/directory stuff (bug 22) In-Reply-To: References: <1172521676.27811.9.camel@blabla.mcs.anl.gov> <1172604377.25936.2.camel@blabla.mcs.anl.gov> Message-ID: <1178288239.14998.13.camel@blabla.mcs.anl.gov> On Fri, 2007-05-04 at 13:38 +0000, Ben Clifford wrote: > > On Tue, 27 Feb 2007, Mihael Hategan wrote: > > > If you can make this translate into something like vdl:(in| > > out)appmapping(var, path, dest), preferably after the stagein/stageout > > directives, I can probably make it work. > > I have a patch that (at least for input files) makes this: > > type file; > > (file o) cat(file f) { > app { > cat "in.txt" stdout=@o; > f < "in.txt"; > } > } Shouldn't that be f > "in.txt" and perhaps before "cat"? In a strict language style, that would vaguely suggest "dump f into 'in.txt'" before running cat... > > file a <"hello.txt">; > file b <"output.txt">; > > b=cat(a); > > > turn into this kml (fragment): > > > cat > > > > in.txt > > > > o > > > > f > > "in.txt" > > > > However, I have no implementation of the element. I > guess its time for me to poke round at the guts of vdl.k a bit more. 1. Using attributes instead of sub-elements may be a little faster. 2. I think the best deal would be to perhaps extend stagein with having pairs of [localName, remoteName] and deal with that appropriately. > > From benc at hawaga.org.uk Fri May 4 09:26:32 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Fri, 4 May 2007 14:26:32 +0000 (GMT) Subject: [Swift-devel] remote file/directory stuff (bug 22) In-Reply-To: <1178288239.14998.13.camel@blabla.mcs.anl.gov> References: <1172521676.27811.9.camel@blabla.mcs.anl.gov> <1172604377.25936.2.camel@blabla.mcs.anl.gov> <1178288239.14998.13.camel@blabla.mcs.anl.gov> Message-ID: On Fri, 4 May 2007, Mihael Hategan wrote: > Shouldn't that be f > "in.txt" and perhaps before "cat"? In a strict > language style, that would vaguely suggest "dump f into 'in.txt'" before > running cat... I pondered a while over which way to put the arrow when I was writing the parser, and also where in the text it should go. Then decided that it was a waste of time to think about it too much when I could be playing with the code and picked a configuration at random... I'm note even sure the a>b syntax is the right way anyway, but I'm more interested in getting some implementation done for now than pondering syntax. -- From benc at hawaga.org.uk Fri May 4 09:33:04 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Fri, 4 May 2007 14:33:04 +0000 (GMT) Subject: [Swift-devel] multiple arguments In-Reply-To: <1178287857.14998.7.camel@blabla.mcs.anl.gov> References: <1178287857.14998.7.camel@blabla.mcs.anl.gov> Message-ID: On Fri, 4 May 2007, Mihael Hategan wrote: > When it comes to files, the scheme was a little simpler. @filename would > pass the file names of the fringe of a particular data tree. And > @filenames would do the same, but each leaf is a single argument. when used in an app block something like: app { myapp "-in" @filenames(myarray[*]) "-type" "fast"; } then @filenames needs to be able to return something that gets passed to myapp as multiple parameters, rather than a single parameter with spaces in it. I think (?) that this is hard to do if @filenames returns a single value, from a SwiftScript-theory perspective (though I think in the karajan implementation, @filenames can return as many values as it wants?) -- From benc at hawaga.org.uk Fri May 4 09:48:29 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Fri, 4 May 2007 14:48:29 +0000 (GMT) Subject: [Swift-devel] multiple arguments In-Reply-To: References: <1178287857.14998.7.camel@blabla.mcs.anl.gov> Message-ID: On Fri, 4 May 2007, Ben Clifford wrote: > > When it comes to files, the scheme was a little simpler. @filename would > > pass the file names of the fringe of a particular data tree. And > > @filenames would do the same, but each leaf is a single argument. > > when used in an app block something like: > > app { > myapp "-in" @filenames(myarray[*]) "-type" "fast"; > } > > then @filenames needs to be able to return something that gets passed to > myapp as multiple parameters, rather than a single parameter with spaces > in it. > > I think (?) that this is hard to do if @filenames returns a single value, > from a SwiftScript-theory perspective (though I think in the karajan > implementation, @filenames can return as many values as it wants?) so perhaps what we should say is @filenames(myarray) returns an array of strings (so @filenames(myarray) has type string[]) and then say that the behaviour for string arrays being used in an application line is to make each element into its own argument. -- From benc at hawaga.org.uk Fri May 4 09:56:47 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Fri, 4 May 2007 14:56:47 +0000 (GMT) Subject: [Swift-devel] limiting simultaneous jobs using the local provider. In-Reply-To: <1178287485.14998.1.camel@blabla.mcs.anl.gov> References: <6.2.1.2.2.20070504075722.0216f740@pop.mcs.anl.gov> <1178287485.14998.1.camel@blabla.mcs.anl.gov> Message-ID: On Fri, 4 May 2007, Mihael Hategan wrote: > > > > That on the other hand enforces a hard limit on the number of total > concurrent jobs. setting this and leaving jobThrottle as it was appears to have caused the desired effect. -- From yongzh at cs.uchicago.edu Fri May 4 10:22:19 2007 From: yongzh at cs.uchicago.edu (Yong Zhao) Date: Fri, 4 May 2007 10:22:19 -0500 (CDT) Subject: [Swift-devel] remote file/directory stuff (bug 22) In-Reply-To: References: <1172521676.27811.9.camel@blabla.mcs.anl.gov> <1172604377.25936.2.camel@blabla.mcs.anl.gov> Message-ID: what does f < "in.txt" mean here? wouldn't it be placed before the call? Yong. On Fri, 4 May 2007, Ben Clifford wrote: > > > On Tue, 27 Feb 2007, Mihael Hategan wrote: > > > If you can make this translate into something like vdl:(in| > > out)appmapping(var, path, dest), preferably after the stagein/stageout > > directives, I can probably make it work. > > I have a patch that (at least for input files) makes this: > > type file; > > (file o) cat(file f) { > app { > cat "in.txt" stdout=@o; > f < "in.txt"; > } > } > > file a <"hello.txt">; > file b <"output.txt">; > > b=cat(a); > > > turn into this kml (fragment): > > > cat > > > > in.txt > > > > o > > > > f > > "in.txt" > > > > However, I have no implementation of the element. I > guess its time for me to poke round at the guts of vdl.k a bit more. > > > -- > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > From benc at hawaga.org.uk Fri May 4 10:25:19 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Fri, 4 May 2007 15:25:19 +0000 (GMT) Subject: [Swift-devel] remote file/directory stuff (bug 22) In-Reply-To: References: <1172521676.27811.9.camel@blabla.mcs.anl.gov> <1172604377.25936.2.camel@blabla.mcs.anl.gov> Message-ID: On Fri, 4 May 2007, Yong Zhao wrote: > what does f < "in.txt" mean here? wouldn't it be placed before the call? It means the input file f goes into a file called "in.txt" in the remote run directory, rather than into a file with the same name as whatever it happens to have on the submit side. I can tweak the syntax easily enough by moving round production rules and templates with cut-n-paste - the semantics are more something I'm concerned about, in terms of actually being useful. -- From hategan at mcs.anl.gov Fri May 4 10:24:46 2007 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 04 May 2007 10:24:46 -0500 Subject: [Swift-devel] multiple arguments In-Reply-To: References: <1178287857.14998.7.camel@blabla.mcs.anl.gov> Message-ID: <1178292286.17541.8.camel@blabla.mcs.anl.gov> On Fri, 2007-05-04 at 14:33 +0000, Ben Clifford wrote: > > On Fri, 4 May 2007, Mihael Hategan wrote: > > > When it comes to files, the scheme was a little simpler. @filename would > > pass the file names of the fringe of a particular data tree. And > > @filenames would do the same, but each leaf is a single argument. > > when used in an app block something like: > > app { > myapp "-in" @filenames(myarray[*]) "-type" "fast"; > } > > then @filenames needs to be able to return something that gets passed to > myapp as multiple parameters, rather than a single parameter with spaces > in it. > > I think (?) that this is hard to do if @filenames returns a single value, > from a SwiftScript-theory perspective (though I think in the karajan > implementation, @filenames can return as many values as it wants? Which is exactly what's happening. I'm not sure if we need to go into that much detail on that one. @filenames returns something that app{} knows how to interpret as meaning multiple arguments rather than one. > ) > From hategan at mcs.anl.gov Fri May 4 10:25:54 2007 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 04 May 2007 10:25:54 -0500 Subject: [Swift-devel] multiple arguments In-Reply-To: References: <1178287857.14998.7.camel@blabla.mcs.anl.gov> Message-ID: <1178292354.17541.10.camel@blabla.mcs.anl.gov> On Fri, 2007-05-04 at 14:48 +0000, Ben Clifford wrote: > > On Fri, 4 May 2007, Ben Clifford wrote: > > > > When it comes to files, the scheme was a little simpler. @filename would > > > pass the file names of the fringe of a particular data tree. And > > > @filenames would do the same, but each leaf is a single argument. > > > > when used in an app block something like: > > > > app { > > myapp "-in" @filenames(myarray[*]) "-type" "fast"; > > } > > > > then @filenames needs to be able to return something that gets passed to > > myapp as multiple parameters, rather than a single parameter with spaces > > in it. > > > > I think (?) that this is hard to do if @filenames returns a single value, > > from a SwiftScript-theory perspective (though I think in the karajan > > implementation, @filenames can return as many values as it wants?) > > so perhaps what we should say is > > @filenames(myarray) > > returns an array of strings (so @filenames(myarray) has type string[]) > > and then say that the behaviour for string arrays being used in an > application line is to make each element into its own argument. Exactly. > From benc at hawaga.org.uk Fri May 4 11:46:32 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Fri, 4 May 2007 16:46:32 +0000 (GMT) Subject: [Swift-devel] swift-on-windows Message-ID: out of interest, has anyone ever run swift on a Windows OS? -- From hategan at mcs.anl.gov Fri May 4 20:51:16 2007 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 04 May 2007 20:51:16 -0500 Subject: [Swift-devel] swift-on-windows In-Reply-To: References: Message-ID: <1178329876.19509.4.camel@blabla.mcs.anl.gov> I think our own Yong has. The trick, if you run locally, is the wrapper which is a bash script. If not, I can see no obvious problems. On Fri, 2007-05-04 at 16:46 +0000, Ben Clifford wrote: > out of interest, has anyone ever run swift on a Windows OS? > From yongzh at cs.uchicago.edu Fri May 4 21:14:26 2007 From: yongzh at cs.uchicago.edu (Yong Zhao) Date: Fri, 4 May 2007 21:14:26 -0500 (CDT) Subject: [Swift-devel] swift-on-windows In-Reply-To: <1178329876.19509.4.camel@blabla.mcs.anl.gov> References: <1178329876.19509.4.camel@blabla.mcs.anl.gov> Message-ID: Yeah, I did run swift on my windows laptop a while ago before we introduced the shell wrapper. We can have a windows wrapper in place of that to run on windows. Yong. On Fri, 4 May 2007, Mihael Hategan wrote: > I think our own Yong has. > The trick, if you run locally, is the wrapper which is a bash script. If > not, I can see no obvious problems. > > On Fri, 2007-05-04 at 16:46 +0000, Ben Clifford wrote: > > out of interest, has anyone ever run swift on a Windows OS? > > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > From hategan at mcs.anl.gov Fri May 4 21:12:49 2007 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 04 May 2007 21:12:49 -0500 Subject: [Swift-devel] swift-on-windows In-Reply-To: References: <1178329876.19509.4.camel@blabla.mcs.anl.gov> Message-ID: <1178331169.20702.0.camel@blabla.mcs.anl.gov> Again, if the jobs themselves are NOT run locally, then the wrapper problem does not apply. On Fri, 2007-05-04 at 21:14 -0500, Yong Zhao wrote: > Yeah, I did run swift on my windows laptop a while ago before we > introduced the shell wrapper. We can have a windows wrapper in place of > that to run on windows. > > Yong. > > On Fri, 4 May 2007, Mihael Hategan wrote: > > > I think our own Yong has. > > The trick, if you run locally, is the wrapper which is a bash script. If > > not, I can see no obvious problems. > > > > On Fri, 2007-05-04 at 16:46 +0000, Ben Clifford wrote: > > > out of interest, has anyone ever run swift on a Windows OS? > > > > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > From iraicu at cs.uchicago.edu Fri May 4 21:29:12 2007 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Fri, 04 May 2007 21:29:12 -0500 Subject: [Swift-devel] swift-on-windows In-Reply-To: <1178331169.20702.0.camel@blabla.mcs.anl.gov> References: <1178329876.19509.4.camel@blabla.mcs.anl.gov> <1178331169.20702.0.camel@blabla.mcs.anl.gov> Message-ID: <463BEBF8.10108@cs.uchicago.edu> What about cygwin? Linux scripts work unchanged in cygwin... for example, I can run my GT4 clients from windows under cygwin with no modifications to any of my scripts or code (non-swift related). Ioan Mihael Hategan wrote: > Again, if the jobs themselves are NOT run locally, then the wrapper > problem does not apply. > > On Fri, 2007-05-04 at 21:14 -0500, Yong Zhao wrote: > >> Yeah, I did run swift on my windows laptop a while ago before we >> introduced the shell wrapper. We can have a windows wrapper in place of >> that to run on windows. >> >> Yong. >> >> On Fri, 4 May 2007, Mihael Hategan wrote: >> >> >>> I think our own Yong has. >>> The trick, if you run locally, is the wrapper which is a bash script. If >>> not, I can see no obvious problems. >>> >>> On Fri, 2007-05-04 at 16:46 +0000, Ben Clifford wrote: >>> >>>> out of interest, has anyone ever run swift on a Windows OS? >>>> >>>> >>> _______________________________________________ >>> Swift-devel mailing list >>> Swift-devel at ci.uchicago.edu >>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >>> >>> > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > -- ============================================ Ioan Raicu Ph.D. Student ============================================ Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 ============================================ Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dsl.cs.uchicago.edu/ ============================================ ============================================ -------------- next part -------------- An HTML attachment was scrubbed... URL: From hategan at mcs.anl.gov Fri May 4 22:12:25 2007 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 04 May 2007 22:12:25 -0500 Subject: [Swift-devel] swift-on-windows In-Reply-To: <463BEBF8.10108@cs.uchicago.edu> References: <1178329876.19509.4.camel@blabla.mcs.anl.gov> <1178331169.20702.0.camel@blabla.mcs.anl.gov> <463BEBF8.10108@cs.uchicago.edu> Message-ID: <1178334745.22667.2.camel@blabla.mcs.anl.gov> On Fri, 2007-05-04 at 21:29 -0500, Ioan Raicu wrote: > What about cygwin? Linux scripts work unchanged in cygwin... for > example, I can run my GT4 clients from windows under cygwin with no > modifications to any of my scripts or code (non-swift related). Right. That should work. There's one other thing. I'm now a little more convinced that perl may be a better option for a wrapper. It's a little more strict than Bash, and Jens seems to think it's not as wasteful of resources (although that should not be that much of an issue if it's running on a worker node). Mihael > Ioan > > Mihael Hategan wrote: > > Again, if the jobs themselves are NOT run locally, then the wrapper > > problem does not apply. > > > > On Fri, 2007-05-04 at 21:14 -0500, Yong Zhao wrote: > > > > > Yeah, I did run swift on my windows laptop a while ago before we > > > introduced the shell wrapper. We can have a windows wrapper in place of > > > that to run on windows. > > > > > > Yong. > > > > > > On Fri, 4 May 2007, Mihael Hategan wrote: > > > > > > > > > > I think our own Yong has. > > > > The trick, if you run locally, is the wrapper which is a bash script. If > > > > not, I can see no obvious problems. > > > > > > > > On Fri, 2007-05-04 at 16:46 +0000, Ben Clifford wrote: > > > > > > > > > out of interest, has anyone ever run swift on a Windows OS? > > > > > > > > > > > > > > _______________________________________________ > > > > Swift-devel mailing list > > > > Swift-devel at ci.uchicago.edu > > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > > > > > > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > > > -- > ============================================ > Ioan Raicu > Ph.D. Student > ============================================ > Distributed Systems Laboratory > Computer Science Department > University of Chicago > 1100 E. 58th Street, Ryerson Hall > Chicago, IL 60637 > ============================================ > Email: iraicu at cs.uchicago.edu > Web: http://www.cs.uchicago.edu/~iraicu > http://dsl.cs.uchicago.edu/ > ============================================ > ============================================ From nefedova at mcs.anl.gov Wed May 9 11:19:20 2007 From: nefedova at mcs.anl.gov (Veronika Nefedova) Date: Wed, 9 May 2007 11:19:20 -0500 Subject: [Swift-devel] MolDyn at Purdue References: Message-ID: Hi, I am wondering if somebody knows where in swift we could specify the project name (allocation) for PBS ? Without that we can't submit to PBS at Purdue... In VDL you'd specify that in 'properties' file. Thanks! Nika Begin forwarded message: > From: "Yuqing Deng" > Date: May 9, 2007 10:21:50 AM CDT > To: "Veronika Nefedova" > Subject: Re: you allocation at Purdue > > > Is there a way to specify command line argument to the scheduler > from swift > config files? I need to use -A account with qsub. Purdue site > does not support > default account with pbs. > > Yuqing > -------------- next part -------------- An HTML attachment was scrubbed... URL: From benc at hawaga.org.uk Wed May 9 11:23:26 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 9 May 2007 16:23:26 +0000 (GMT) Subject: [Swift-devel] MolDyn at Purdue In-Reply-To: References: Message-ID: are they submitting to GRAM (and thence to PBS) or to PBS via some cog pbs provider? On Wed, 9 May 2007, Veronika Nefedova wrote: > Hi, > > I am wondering if somebody knows where in swift we could specify the project > name (allocation) for PBS ? Without that we can't submit to PBS at Purdue... > In VDL you'd specify that in 'properties' file. > > Thanks! > > Nika > > Begin forwarded message: > > > From: "Yuqing Deng" > > Date: May 9, 2007 10:21:50 AM CDT > > To: "Veronika Nefedova" > > Subject: Re: you allocation at Purdue > > > > > > Is there a way to specify command line argument to the scheduler from swift > > config files? I need to use -A account with qsub. Purdue site does not > > support > > default account with pbs. > > > > Yuqing > > > From itf at mcs.anl.gov Wed May 9 11:34:02 2007 From: itf at mcs.anl.gov (=?UTF-8?B?SWFuIEZvc3Rlcg==?=) Date: Wed, 9 May 2007 16:34:02 +0000 Subject: [Swift-devel] MolDyn at Purdue In-Reply-To: References: Message-ID: <233492970-1178728474-cardhu_blackberry.rim.net-3812750-@bwe026-cell00.bisx.prod.on.blackberry> I'm surprised that the purdue gram iinterface is different to that at ncsa Sent via BlackBerry from T-Mobile -----Original Message----- From: Ben Clifford Date: Wed, 9 May 2007 16:23:26 To:Veronika Nefedova Cc:swift-devel at ci.uchicago.edu Subject: Re: [Swift-devel] MolDyn at Purdue are they submitting to GRAM (and thence to PBS) or to PBS via some cog pbs provider? On Wed, 9 May 2007, Veronika Nefedova wrote: > Hi, > > I am wondering if somebody knows where in swift we could specify the project > name (allocation) for PBS ? Without that we can't submit to PBS at Purdue... > In VDL you'd specify that in 'properties' file. > > Thanks! > > Nika > > Begin forwarded message: > > > From: "Yuqing Deng" > > Date: May 9, 2007 10:21:50 AM CDT > > To: "Veronika Nefedova" > > Subject: Re: you allocation at Purdue > > > > > > Is there a way to specify command line argument to the scheduler from swift > > config files? I need to use -A account with qsub. Purdue site does not > > support > > default account with pbs. > > > > Yuqing > > > _______________________________________________ Swift-devel mailing list Swift-devel at ci.uchicago.edu http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From nefedova at mcs.anl.gov Wed May 9 11:38:23 2007 From: nefedova at mcs.anl.gov (Veronika Nefedova) Date: Wed, 9 May 2007 11:38:23 -0500 Subject: [Swift-devel] MolDyn at Purdue In-Reply-To: References: Message-ID: <1E5A7292-AC04-49BE-B017-6A4D8C28F32F@mcs.anl.gov> we are submitting with swift, to GRAM. purdue ~$ globus-job-run tg-gatekeeper.purdue.teragrid.org/jobmanager- pbs /bin/hostname Please specify a TG project number. GRAM Job failed because the job failed when the job manager attempted to run it (error code 17) While you can specify that for globusrun on a command line - it has to be a way to specify it somewhere inside swift? Thanks! Nika On May 9, 2007, at 11:23 AM, Ben Clifford wrote: > > are they submitting to GRAM (and thence to PBS) or to PBS via some > cog pbs > provider? > > On Wed, 9 May 2007, Veronika Nefedova wrote: > >> Hi, >> >> I am wondering if somebody knows where in swift we could specify >> the project >> name (allocation) for PBS ? Without that we can't submit to PBS at >> Purdue... >> In VDL you'd specify that in 'properties' file. >> >> Thanks! >> >> Nika >> >> Begin forwarded message: >> >>> From: "Yuqing Deng" >>> Date: May 9, 2007 10:21:50 AM CDT >>> To: "Veronika Nefedova" >>> Subject: Re: you allocation at Purdue >>> >>> >>> Is there a way to specify command line argument to the scheduler >>> from swift >>> config files? I need to use -A account with qsub. Purdue site >>> does not >>> support >>> default account with pbs. >>> >>> Yuqing >>> >> > From benc at hawaga.org.uk Wed May 9 11:37:16 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 9 May 2007 16:37:16 +0000 (GMT) Subject: [Swift-devel] MolDyn at Purdue In-Reply-To: <233492970-1178728474-cardhu_blackberry.rim.net-3812750-@bwe026-cell00.bisx.prod.on.blackberry> References: <233492970-1178728474-cardhu_blackberry.rim.net-3812750-@bwe026-cell00.bisx.prod.on.blackberry> Message-ID: On Wed, 9 May 2007, Ian Foster wrote: > I'm surprised that the purdue gram iinterface is different to that at ncsa some differences are documented here. http://www.teragrid.org/docs/jobs/index.php#NCSA what tibi and nika are encountering appears to be documented. -- From benc at hawaga.org.uk Wed May 9 11:39:37 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 9 May 2007 16:39:37 +0000 (GMT) Subject: [Swift-devel] MolDyn at Purdue In-Reply-To: <1E5A7292-AC04-49BE-B017-6A4D8C28F32F@mcs.anl.gov> References: <1E5A7292-AC04-49BE-B017-6A4D8C28F32F@mcs.anl.gov> Message-ID: On Wed, 9 May 2007, Veronika Nefedova wrote: > While you can specify that for globusrun on a command line - it has to > be a way to specify it somewhere inside swift? mihael talked about being able to specify it as a profile entry perhaps, in a thread the other day on this list. that might work - check out the VDS docs for how to specify globus RSL extension attributes in the site or transformation catalogs (or if you can't find, I can have a look). -- From benc at hawaga.org.uk Wed May 9 12:16:42 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 9 May 2007 17:16:42 +0000 (GMT) Subject: [Swift-devel] MolDyn at Purdue In-Reply-To: References: <1E5A7292-AC04-49BE-B017-6A4D8C28F32F@mcs.anl.gov> Message-ID: Do this: edit your site catalog to add an entry TG-STA040020N for the purdue site, add an entry TG-WHATEVERYOURGRANTNUMBERIS -- From tiberius at ci.uchicago.edu Wed May 9 12:21:41 2007 From: tiberius at ci.uchicago.edu (Tiberiu Stef-Praun) Date: Wed, 9 May 2007 12:21:41 -0500 Subject: [Swift-devel] MolDyn at Purdue In-Reply-To: References: <1E5A7292-AC04-49BE-B017-6A4D8C28F32F@mcs.anl.gov> Message-ID: alternatively do this: in your tc.data, append GLOBUS::project=TGxxxxxxx to your application definition On 5/9/07, Ben Clifford wrote: > > Do this: > > edit your site catalog to add an entry key="project">TG-STA040020N for the purdue site, add an entry > key="project">TG-WHATEVERYOURGRANTNUMBERIS > > -- > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > -- Tiberiu (Tibi) Stef-Praun, PhD Research Staff, Computation Institute 5640 S. Ellis Ave, #405 University of Chicago http://www-unix.mcs.anl.gov/~tiberius/ From nefedova at mcs.anl.gov Wed May 9 12:30:42 2007 From: nefedova at mcs.anl.gov (Veronika Nefedova) Date: Wed, 9 May 2007 12:30:42 -0500 Subject: [Swift-devel] MolDyn at Purdue In-Reply-To: References: <1E5A7292-AC04-49BE-B017-6A4D8C28F32F@mcs.anl.gov> Message-ID: <4F78FC2F-11EC-49FE-B9A1-FAD0ABCDFDE6@mcs.anl.gov> So does any of it work? Have you tested it successfully? (-; Nika On May 9, 2007, at 12:21 PM, Tiberiu Stef-Praun wrote: > alternatively do this: > in your tc.data, append GLOBUS::project=TGxxxxxxx to your > application definition > > > > On 5/9/07, Ben Clifford wrote: >> >> Do this: >> >> edit your site catalog to add an entry > key="project">TG-STA040020N for the purdue site, add an >> entry >> > key="project">TG-WHATEVERYOURGRANTNUMBERIS >> >> -- >> >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >> > > > -- > Tiberiu (Tibi) Stef-Praun, PhD > Research Staff, Computation Institute > 5640 S. Ellis Ave, #405 > University of Chicago > http://www-unix.mcs.anl.gov/~tiberius/ > From benc at hawaga.org.uk Wed May 9 12:30:20 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 9 May 2007 17:30:20 +0000 (GMT) Subject: [Swift-devel] MolDyn at Purdue In-Reply-To: <4F78FC2F-11EC-49FE-B9A1-FAD0ABCDFDE6@mcs.anl.gov> References: <1E5A7292-AC04-49BE-B017-6A4D8C28F32F@mcs.anl.gov> <4F78FC2F-11EC-49FE-B9A1-FAD0ABCDFDE6@mcs.anl.gov> Message-ID: I haven't. On Wed, 9 May 2007, Veronika Nefedova wrote: > So does any of it work? Have you tested it successfully? > > (-; > > Nika > > On May 9, 2007, at 12:21 PM, Tiberiu Stef-Praun wrote: > > > alternatively do this: > > in your tc.data, append GLOBUS::project=TGxxxxxxx to your application > > definition > > > > > > > > On 5/9/07, Ben Clifford wrote: > > > > > > Do this: > > > > > > edit your site catalog to add an entry > > key="project">TG-STA040020N for the purdue site, add an entry > > > > > key="project">TG-WHATEVERYOURGRANTNUMBERIS > > > > > > -- > > > > > > _______________________________________________ > > > Swift-devel mailing list > > > Swift-devel at ci.uchicago.edu > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > > > > > > -- > > Tiberiu (Tibi) Stef-Praun, PhD > > Research Staff, Computation Institute > > 5640 S. Ellis Ave, #405 > > University of Chicago > > http://www-unix.mcs.anl.gov/~tiberius/ > > > From nefedova at mcs.anl.gov Wed May 9 15:16:46 2007 From: nefedova at mcs.anl.gov (Veronika Nefedova) Date: Wed, 9 May 2007 15:16:46 -0500 Subject: [Swift-devel] MolDyn at Purdue In-Reply-To: References: <1E5A7292-AC04-49BE-B017-6A4D8C28F32F@mcs.anl.gov> Message-ID: Hi, Ioan: How do I add my project info into Falcon? (I can't submit anything to PBS queue unless I specify the project) Nika On May 9, 2007, at 12:16 PM, Ben Clifford wrote: > > Do this: > > edit your site catalog to add an entry key="project">TG-STA040020N for the purdue site, add an > entry > key="project">TG-WHATEVERYOURGRANTNUMBERIS > > -- > From iraicu at cs.uchicago.edu Wed May 9 15:19:56 2007 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Wed, 09 May 2007 15:19:56 -0500 Subject: [Swift-devel] MolDyn at Purdue In-Reply-To: References: <1E5A7292-AC04-49BE-B017-6A4D8C28F32F@mcs.anl.gov> Message-ID: <46422CEC.5050807@cs.uchicago.edu> Hmmm... I don't know. If you know what modifications we need to make to the GRAM4 RSL to include the specific project, we can simply modify the RSL by changing the create function to add the extra info into the RSL... its a minor change, but we'd have to recompile the DRP stuff again. If we can't add it to the RSL, then I don't know any other place to put this. Anyone have any ideas? I am using the GRAM4 Java API directly in the DRP code. Ioan PS: Here is a sample RSL... iraicu at tg-viz-login1:~/java/Falkon_v0.8/worker> cat RSL.0.0.ia32-compute.1.120.14898358.xml /home/iraicu/java/Falkon_v0.8/worker/run.worker.sh /home/iraicu/java/Falkon_v0.8/worker 6900000 1500000 120 ia32-compute 1 1 1 My guess is that we could add something like ... but I am not sure... If no one knows how to do this off the top of their heads, I'll look it up! Veronika Nefedova wrote: > Hi, Ioan: > > How do I add my project info into Falcon? (I can't submit anything to > PBS queue unless I specify the project) > > Nika > > On May 9, 2007, at 12:16 PM, Ben Clifford wrote: > >> >> Do this: >> >> edit your site catalog to add an entry > key="project">TG-STA040020N for the purdue site, add an entry >> > key="project">TG-WHATEVERYOURGRANTNUMBERIS >> >> -- >> > > -- ============================================ Ioan Raicu Ph.D. Student ============================================ Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 ============================================ Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dsl.cs.uchicago.edu/ ============================================ ============================================ From nefedova at mcs.anl.gov Wed May 9 15:30:22 2007 From: nefedova at mcs.anl.gov (Veronika Nefedova) Date: Wed, 9 May 2007 15:30:22 -0500 Subject: [Swift-devel] MolDyn at Purdue In-Reply-To: <46422CEC.5050807@cs.uchicago.edu> References: <1E5A7292-AC04-49BE-B017-6A4D8C28F32F@mcs.anl.gov> <46422CEC.5050807@cs.uchicago.edu> Message-ID: Ioan, its at the very bottom of this thread (thats what Ben is suggesting for Swift) We need to include just one line, similar to that one. But I do not know where (-; Nika On May 9, 2007, at 3:19 PM, Ioan Raicu wrote: > Hmmm... I don't know. If you know what modifications we need to > make to the GRAM4 RSL to include the specific project, we can > simply modify the RSL by changing the create function to add the > extra info into the RSL... its a minor change, but we'd have to > recompile the DRP stuff again. If we can't add it to the RSL, then > I don't know any other place to put this. Anyone have any ideas? > I am using the GRAM4 Java API directly in the DRP code. > > Ioan > > PS: Here is a sample RSL... > > iraicu at tg-viz-login1:~/java/Falkon_v0.8/worker> cat RSL.0.0.ia32- > compute.1.120.14898358.xml > > /home/iraicu/java/Falkon_v0.8/worker/run.worker.sh executable> > /home/iraicu/java/Falkon_v0.8/worker > 6900000 > 1500000 > 120 > > > ia32-compute > 1 > 1 > 1 > > > > > My guess is that we could add something like > ... > but I am not sure... > > If no one knows how to do this off the top of their heads, I'll > look it up! > > Veronika Nefedova wrote: >> Hi, Ioan: >> >> How do I add my project info into Falcon? (I can't submit anything >> to PBS queue unless I specify the project) >> >> Nika >> >> On May 9, 2007, at 12:16 PM, Ben Clifford wrote: >> >>> >>> Do this: >>> >>> edit your site catalog to add an entry >> key="project">TG-STA040020N for the purdue site, add an >>> entry >>> >> key="project">TG-WHATEVERYOURGRANTNUMBERIS >>> >>> -- >>> >> >> > > -- > ============================================ > Ioan Raicu > Ph.D. Student > ============================================ > Distributed Systems Laboratory > Computer Science Department > University of Chicago > 1100 E. 58th Street, Ryerson Hall > Chicago, IL 60637 > ============================================ > Email: iraicu at cs.uchicago.edu > Web: http://www.cs.uchicago.edu/~iraicu > http://dsl.cs.uchicago.edu/ > ============================================ > ============================================ > From benc at hawaga.org.uk Thu May 10 01:47:55 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Thu, 10 May 2007 06:47:55 +0000 (GMT) Subject: [Swift-devel] MolDyn at Purdue In-Reply-To: <46422CEC.5050807@cs.uchicago.edu> References: <1E5A7292-AC04-49BE-B017-6A4D8C28F32F@mcs.anl.gov> <46422CEC.5050807@cs.uchicago.edu> Message-ID: As you suggested, the GRAM4 RSL extension to use is whatever according to http://teragrid.org/userinfo/jobs/ Probably needs to go under extensions, (in xpath, extensions/project) On Wed, 9 May 2007, Ioan Raicu wrote: > Hmmm... I don't know. If you know what modifications we need to make to the > GRAM4 RSL to include the specific project, we can simply modify the RSL by > changing the create function to add the extra info into the RSL... its a minor > change, but we'd have to recompile the DRP stuff again. If we can't add it to > the RSL, then I don't know any other place to put this. Anyone have any > ideas? I am using the GRAM4 Java API directly in the DRP code. > > Ioan > > PS: Here is a sample RSL... > > iraicu at tg-viz-login1:~/java/Falkon_v0.8/worker> cat > RSL.0.0.ia32-compute.1.120.14898358.xml > > /home/iraicu/java/Falkon_v0.8/worker/run.worker.sh > /home/iraicu/java/Falkon_v0.8/worker > 6900000 > 1500000 > 120 > > > ia32-compute > 1 > 1 > 1 > > > > > My guess is that we could add something like > ... > but I am not sure... > > If no one knows how to do this off the top of their heads, I'll look it up! > > Veronika Nefedova wrote: > > Hi, Ioan: > > > > How do I add my project info into Falcon? (I can't submit anything to PBS > > queue unless I specify the project) > > > > Nika > > > > On May 9, 2007, at 12:16 PM, Ben Clifford wrote: > > > > > > > > Do this: > > > > > > edit your site catalog to add an entry > > key="project">TG-STA040020N for the purdue site, add an entry > > > > > key="project">TG-WHATEVERYOURGRANTNUMBERIS > > > > > > -- > > > > > > > > > From benc at hawaga.org.uk Thu May 10 05:21:44 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Thu, 10 May 2007 10:21:44 +0000 (GMT) Subject: [Swift-devel] MolDyn at Purdue In-Reply-To: <46422CEC.5050807@cs.uchicago.edu> References: <1E5A7292-AC04-49BE-B017-6A4D8C28F32F@mcs.anl.gov> <46422CEC.5050807@cs.uchicago.edu> Message-ID: not sure whether you're looking to make this code more closely integrated with swift and/or a product rather than a research project, but you might make the below submission use profile information from the site catalog (and transformation catalog?) - it doesn't look like you're doing anything fancy in the submission. On Wed, 9 May 2007, Ioan Raicu wrote: > Hmmm... I don't know. If you know what modifications we need to make to the > GRAM4 RSL to include the specific project, we can simply modify the RSL by > changing the create function to add the extra info into the RSL... its a minor > change, but we'd have to recompile the DRP stuff again. If we can't add it to > the RSL, then I don't know any other place to put this. Anyone have any > ideas? I am using the GRAM4 Java API directly in the DRP code. > > Ioan > > PS: Here is a sample RSL... > > iraicu at tg-viz-login1:~/java/Falkon_v0.8/worker> cat > RSL.0.0.ia32-compute.1.120.14898358.xml > > /home/iraicu/java/Falkon_v0.8/worker/run.worker.sh > /home/iraicu/java/Falkon_v0.8/worker > 6900000 > 1500000 > 120 > > > ia32-compute > 1 > 1 > 1 > > > > > My guess is that we could add something like > ... > but I am not sure... > > If no one knows how to do this off the top of their heads, I'll look it up! > > Veronika Nefedova wrote: > > Hi, Ioan: > > > > How do I add my project info into Falcon? (I can't submit anything to PBS > > queue unless I specify the project) > > > > Nika > > > > On May 9, 2007, at 12:16 PM, Ben Clifford wrote: > > > > > > > > Do this: > > > > > > edit your site catalog to add an entry > > key="project">TG-STA040020N for the purdue site, add an entry > > > > > key="project">TG-WHATEVERYOURGRANTNUMBERIS > > > > > > -- > > > > > > > > > From iraicu at cs.uchicago.edu Thu May 10 12:39:28 2007 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Thu, 10 May 2007 12:39:28 -0500 Subject: [Swift-devel] MolDyn at Purdue In-Reply-To: References: <1E5A7292-AC04-49BE-B017-6A4D8C28F32F@mcs.anl.gov> <46422CEC.5050807@cs.uchicago.edu> Message-ID: <464358D0.1060402@cs.uchicago.edu> So I made the modifications to the code that generates the GRAM RSL to take a command line arguement -project ..., which then is simply passed to the RSL file as ... . At the moment, this is something that needs to be set in Falkon at startup, and all resources provisioned by Falkon will use the same project. For now I think we have a solution that works at the various TG sites, and it is not tightly integrated with Swift. The issue is much more complex if you want Swift to carry the project information on a per job basis, and charge it to a (potentially) different project each job. Ioan Ben Clifford wrote: > not sure whether you're looking to make this code more closely integrated > with swift and/or a product rather than a research project, but you might > make the below submission use profile information from the site catalog > (and transformation catalog?) - it doesn't look like you're doing anything > fancy in the submission. > > On Wed, 9 May 2007, Ioan Raicu wrote: > > >> Hmmm... I don't know. If you know what modifications we need to make to the >> GRAM4 RSL to include the specific project, we can simply modify the RSL by >> changing the create function to add the extra info into the RSL... its a minor >> change, but we'd have to recompile the DRP stuff again. If we can't add it to >> the RSL, then I don't know any other place to put this. Anyone have any >> ideas? I am using the GRAM4 Java API directly in the DRP code. >> >> Ioan >> >> PS: Here is a sample RSL... >> >> iraicu at tg-viz-login1:~/java/Falkon_v0.8/worker> cat >> RSL.0.0.ia32-compute.1.120.14898358.xml >> >> /home/iraicu/java/Falkon_v0.8/worker/run.worker.sh >> /home/iraicu/java/Falkon_v0.8/worker >> 6900000 >> 1500000 >> 120 >> >> >> ia32-compute >> 1 >> 1 >> 1 >> >> >> >> >> My guess is that we could add something like >> ... >> but I am not sure... >> >> If no one knows how to do this off the top of their heads, I'll look it up! >> >> Veronika Nefedova wrote: >> >>> Hi, Ioan: >>> >>> How do I add my project info into Falcon? (I can't submit anything to PBS >>> queue unless I specify the project) >>> >>> Nika >>> >>> On May 9, 2007, at 12:16 PM, Ben Clifford wrote: >>> >>> >>>> Do this: >>>> >>>> edit your site catalog to add an entry >>> key="project">TG-STA040020N for the purdue site, add an entry >>>> >>> key="project">TG-WHATEVERYOURGRANTNUMBERIS >>>> >>>> -- >>>> >>>> >>> >> > > -- ============================================ Ioan Raicu Ph.D. Student ============================================ Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 ============================================ Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dsl.cs.uchicago.edu/ ============================================ ============================================ -------------- next part -------------- An HTML attachment was scrubbed... URL: From foster at mcs.anl.gov Thu May 10 13:34:18 2007 From: foster at mcs.anl.gov (Ian Foster) Date: Thu, 10 May 2007 13:34:18 -0500 Subject: [Swift-devel] MolDyn at Purdue In-Reply-To: <464358D0.1060402@cs.uchicago.edu> References: <1E5A7292-AC04-49BE-B017-6A4D8C28F32F@mcs.anl.gov> <46422CEC.5050807@cs.uchicago.edu> <464358D0.1060402@cs.uchicago.edu> Message-ID: <464365AA.50407@mcs.anl.gov> we don't want to do that ("carry the project information on a per job basis, and charge it to a (potentially) different project each job") Ioan Raicu wrote: > So I made the modifications to the code that generates the GRAM RSL to > take a command line arguement -project ..., which then is simply > passed to the RSL file as ... . At the moment, > this is something that needs to be set in Falkon at startup, and all > resources provisioned by Falkon will use the same project. For now I > think we have a solution that works at the various TG sites, and it is > not tightly integrated with Swift. > > The issue is much more complex if you want Swift to carry the project > information on a per job basis, and charge it to a (potentially) > different project each job. > > Ioan > > Ben Clifford wrote: >> not sure whether you're looking to make this code more closely integrated >> with swift and/or a product rather than a research project, but you might >> make the below submission use profile information from the site catalog >> (and transformation catalog?) - it doesn't look like you're doing anything >> fancy in the submission. >> >> On Wed, 9 May 2007, Ioan Raicu wrote: >> >> >>> Hmmm... I don't know. If you know what modifications we need to make to the >>> GRAM4 RSL to include the specific project, we can simply modify the RSL by >>> changing the create function to add the extra info into the RSL... its a minor >>> change, but we'd have to recompile the DRP stuff again. If we can't add it to >>> the RSL, then I don't know any other place to put this. Anyone have any >>> ideas? I am using the GRAM4 Java API directly in the DRP code. >>> >>> Ioan >>> >>> PS: Here is a sample RSL... >>> >>> iraicu at tg-viz-login1:~/java/Falkon_v0.8/worker> cat >>> RSL.0.0.ia32-compute.1.120.14898358.xml >>> >>> /home/iraicu/java/Falkon_v0.8/worker/run.worker.sh >>> /home/iraicu/java/Falkon_v0.8/worker >>> 6900000 >>> 1500000 >>> 120 >>> >>> >>> ia32-compute >>> 1 >>> 1 >>> 1 >>> >>> >>> >>> >>> My guess is that we could add something like >>> ... >>> but I am not sure... >>> >>> If no one knows how to do this off the top of their heads, I'll look it up! >>> >>> Veronika Nefedova wrote: >>> >>>> Hi, Ioan: >>>> >>>> How do I add my project info into Falcon? (I can't submit anything to PBS >>>> queue unless I specify the project) >>>> >>>> Nika >>>> >>>> On May 9, 2007, at 12:16 PM, Ben Clifford wrote: >>>> >>>> >>>>> Do this: >>>>> >>>>> edit your site catalog to add an entry >>>> key="project">TG-STA040020N for the purdue site, add an entry >>>>> >>>> key="project">TG-WHATEVERYOURGRANTNUMBERIS >>>>> >>>>> -- >>>>> >>>>> >>>> >>> >> >> > > -- > ============================================ > Ioan Raicu > Ph.D. Student > ============================================ > Distributed Systems Laboratory > Computer Science Department > University of Chicago > 1100 E. 58th Street, Ryerson Hall > Chicago, IL 60637 > ============================================ > Email: iraicu at cs.uchicago.edu > Web: http://www.cs.uchicago.edu/~iraicu > http://dsl.cs.uchicago.edu/ > ============================================ > ============================================ > > ------------------------------------------------------------------------ > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > -- Ian Foster, Director, Computation Institute Argonne National Laboratory & University of Chicago Argonne: MCS/221, 9700 S. Cass Ave, Argonne, IL 60439 Chicago: Rm 405, 5640 S. Ellis Ave, Chicago, IL 60637 Tel: +1 630 252 4619. Web: www.ci.uchicago.edu. Globus Alliance: www.globus.org. -------------- next part -------------- An HTML attachment was scrubbed... URL: From iraicu at cs.uchicago.edu Thu May 10 13:39:57 2007 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Thu, 10 May 2007 13:39:57 -0500 Subject: [Swift-devel] MolDyn at Purdue In-Reply-To: <464365AA.50407@mcs.anl.gov> References: <1E5A7292-AC04-49BE-B017-6A4D8C28F32F@mcs.anl.gov> <46422CEC.5050807@cs.uchicago.edu> <464358D0.1060402@cs.uchicago.edu> <464365AA.50407@mcs.anl.gov> Message-ID: <464366FD.8060008@cs.uchicago.edu> Great, than we are set, the project is configurable at the Falkon startup! Ioan Ian Foster wrote: > we don't want to do that ("carry the project information on a per job > basis, and charge it to a (potentially) different project each job") > > Ioan Raicu wrote: >> So I made the modifications to the code that generates the GRAM RSL >> to take a command line arguement -project ..., which then is simply >> passed to the RSL file as ... . At the moment, >> this is something that needs to be set in Falkon at startup, and all >> resources provisioned by Falkon will use the same project. For now >> I think we have a solution that works at the various TG sites, and it >> is not tightly integrated with Swift. >> >> The issue is much more complex if you want Swift to carry the project >> information on a per job basis, and charge it to a (potentially) >> different project each job. >> >> Ioan >> >> Ben Clifford wrote: >>> not sure whether you're looking to make this code more closely integrated >>> with swift and/or a product rather than a research project, but you might >>> make the below submission use profile information from the site catalog >>> (and transformation catalog?) - it doesn't look like you're doing anything >>> fancy in the submission. >>> >>> On Wed, 9 May 2007, Ioan Raicu wrote: >>> >>> >>>> Hmmm... I don't know. If you know what modifications we need to make to the >>>> GRAM4 RSL to include the specific project, we can simply modify the RSL by >>>> changing the create function to add the extra info into the RSL... its a minor >>>> change, but we'd have to recompile the DRP stuff again. If we can't add it to >>>> the RSL, then I don't know any other place to put this. Anyone have any >>>> ideas? I am using the GRAM4 Java API directly in the DRP code. >>>> >>>> Ioan >>>> >>>> PS: Here is a sample RSL... >>>> >>>> iraicu at tg-viz-login1:~/java/Falkon_v0.8/worker> cat >>>> RSL.0.0.ia32-compute.1.120.14898358.xml >>>> >>>> /home/iraicu/java/Falkon_v0.8/worker/run.worker.sh >>>> /home/iraicu/java/Falkon_v0.8/worker >>>> 6900000 >>>> 1500000 >>>> 120 >>>> >>>> >>>> ia32-compute >>>> 1 >>>> 1 >>>> 1 >>>> >>>> >>>> >>>> >>>> My guess is that we could add something like >>>> ... >>>> but I am not sure... >>>> >>>> If no one knows how to do this off the top of their heads, I'll look it up! >>>> >>>> Veronika Nefedova wrote: >>>> >>>>> Hi, Ioan: >>>>> >>>>> How do I add my project info into Falcon? (I can't submit anything to PBS >>>>> queue unless I specify the project) >>>>> >>>>> Nika >>>>> >>>>> On May 9, 2007, at 12:16 PM, Ben Clifford wrote: >>>>> >>>>> >>>>>> Do this: >>>>>> >>>>>> edit your site catalog to add an entry >>>>> key="project">TG-STA040020N for the purdue site, add an entry >>>>>> >>>>> key="project">TG-WHATEVERYOURGRANTNUMBERIS >>>>>> >>>>>> -- >>>>>> >>>>>> >>>>> >>>> >>> >>> >> >> -- >> ============================================ >> Ioan Raicu >> Ph.D. Student >> ============================================ >> Distributed Systems Laboratory >> Computer Science Department >> University of Chicago >> 1100 E. 58th Street, Ryerson Hall >> Chicago, IL 60637 >> ============================================ >> Email: iraicu at cs.uchicago.edu >> Web: http://www.cs.uchicago.edu/~iraicu >> http://dsl.cs.uchicago.edu/ >> ============================================ >> ============================================ >> >> ------------------------------------------------------------------------ >> >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >> > > -- > > Ian Foster, Director, Computation Institute > Argonne National Laboratory & University of Chicago > Argonne: MCS/221, 9700 S. Cass Ave, Argonne, IL 60439 > Chicago: Rm 405, 5640 S. Ellis Ave, Chicago, IL 60637 > Tel: +1 630 252 4619. Web: www.ci.uchicago.edu. > Globus Alliance: www.globus.org. > -- ============================================ Ioan Raicu Ph.D. Student ============================================ Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 ============================================ Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dsl.cs.uchicago.edu/ ============================================ ============================================ -------------- next part -------------- An HTML attachment was scrubbed... URL: From foster at mcs.anl.gov Fri May 11 08:31:11 2007 From: foster at mcs.anl.gov (Ian Foster) Date: Fri, 11 May 2007 08:31:11 -0500 Subject: [Swift-devel] MolDyn at Purdue In-Reply-To: <464366FD.8060008@cs.uchicago.edu> References: <1E5A7292-AC04-49BE-B017-6A4D8C28F32F@mcs.anl.gov> <46422CEC.5050807@cs.uchicago.edu> <464358D0.1060402@cs.uchicago.edu> <464365AA.50407@mcs.anl.gov> <464366FD.8060008@cs.uchicago.edu> Message-ID: <4644701F.4040007@mcs.anl.gov> I note that we have stopped running at NCSA and switched to trying to run at Purdue. A good thing to try, certainly. However, could we not have had a big job in the queue at NCSA all this time, also, using Falkon, which would have run by now? Ian. Ioan Raicu wrote: > Great, than we are set, the project is configurable at the Falkon startup! > Ioan > > Ian Foster wrote: >> we don't want to do that ("carry the project information on a per job >> basis, and charge it to a (potentially) different project each job") >> >> Ioan Raicu wrote: >>> So I made the modifications to the code that generates the GRAM RSL >>> to take a command line arguement -project ..., which then is simply >>> passed to the RSL file as ... . At the moment, >>> this is something that needs to be set in Falkon at startup, and all >>> resources provisioned by Falkon will use the same project. For now >>> I think we have a solution that works at the various TG sites, and >>> it is not tightly integrated with Swift. >>> >>> The issue is much more complex if you want Swift to carry the >>> project information on a per job basis, and charge it to a >>> (potentially) different project each job. >>> >>> Ioan >>> >>> Ben Clifford wrote: >>>> not sure whether you're looking to make this code more closely integrated >>>> with swift and/or a product rather than a research project, but you might >>>> make the below submission use profile information from the site catalog >>>> (and transformation catalog?) - it doesn't look like you're doing anything >>>> fancy in the submission. >>>> >>>> On Wed, 9 May 2007, Ioan Raicu wrote: >>>> >>>> >>>>> Hmmm... I don't know. If you know what modifications we need to make to the >>>>> GRAM4 RSL to include the specific project, we can simply modify the RSL by >>>>> changing the create function to add the extra info into the RSL... its a minor >>>>> change, but we'd have to recompile the DRP stuff again. If we can't add it to >>>>> the RSL, then I don't know any other place to put this. Anyone have any >>>>> ideas? I am using the GRAM4 Java API directly in the DRP code. >>>>> >>>>> Ioan >>>>> >>>>> PS: Here is a sample RSL... >>>>> >>>>> iraicu at tg-viz-login1:~/java/Falkon_v0.8/worker> cat >>>>> RSL.0.0.ia32-compute.1.120.14898358.xml >>>>> >>>>> /home/iraicu/java/Falkon_v0.8/worker/run.worker.sh >>>>> /home/iraicu/java/Falkon_v0.8/worker >>>>> 6900000 >>>>> 1500000 >>>>> 120 >>>>> >>>>> >>>>> ia32-compute >>>>> 1 >>>>> 1 >>>>> 1 >>>>> >>>>> >>>>> >>>>> >>>>> My guess is that we could add something like >>>>> ... >>>>> but I am not sure... >>>>> >>>>> If no one knows how to do this off the top of their heads, I'll look it up! >>>>> >>>>> Veronika Nefedova wrote: >>>>> >>>>>> Hi, Ioan: >>>>>> >>>>>> How do I add my project info into Falcon? (I can't submit anything to PBS >>>>>> queue unless I specify the project) >>>>>> >>>>>> Nika >>>>>> >>>>>> On May 9, 2007, at 12:16 PM, Ben Clifford wrote: >>>>>> >>>>>> >>>>>>> Do this: >>>>>>> >>>>>>> edit your site catalog to add an entry >>>>>> key="project">TG-STA040020N for the purdue site, add an entry >>>>>>> >>>>>> key="project">TG-WHATEVERYOURGRANTNUMBERIS >>>>>>> >>>>>>> -- >>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>>> >>> >>> -- >>> ============================================ >>> Ioan Raicu >>> Ph.D. Student >>> ============================================ >>> Distributed Systems Laboratory >>> Computer Science Department >>> University of Chicago >>> 1100 E. 58th Street, Ryerson Hall >>> Chicago, IL 60637 >>> ============================================ >>> Email: iraicu at cs.uchicago.edu >>> Web: http://www.cs.uchicago.edu/~iraicu >>> http://dsl.cs.uchicago.edu/ >>> ============================================ >>> ============================================ >>> >>> ------------------------------------------------------------------------ >>> >>> _______________________________________________ >>> Swift-devel mailing list >>> Swift-devel at ci.uchicago.edu >>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >>> >> >> -- >> >> Ian Foster, Director, Computation Institute >> Argonne National Laboratory & University of Chicago >> Argonne: MCS/221, 9700 S. Cass Ave, Argonne, IL 60439 >> Chicago: Rm 405, 5640 S. Ellis Ave, Chicago, IL 60637 >> Tel: +1 630 252 4619. Web: www.ci.uchicago.edu. >> Globus Alliance: www.globus.org. >> > > -- > ============================================ > Ioan Raicu > Ph.D. Student > ============================================ > Distributed Systems Laboratory > Computer Science Department > University of Chicago > 1100 E. 58th Street, Ryerson Hall > Chicago, IL 60637 > ============================================ > Email: iraicu at cs.uchicago.edu > Web: http://www.cs.uchicago.edu/~iraicu > http://dsl.cs.uchicago.edu/ > ============================================ > ============================================ > -- Ian Foster, Director, Computation Institute Argonne National Laboratory & University of Chicago Argonne: MCS/221, 9700 S. Cass Ave, Argonne, IL 60439 Chicago: Rm 405, 5640 S. Ellis Ave, Chicago, IL 60637 Tel: +1 630 252 4619. Web: www.ci.uchicago.edu. Globus Alliance: www.globus.org. -------------- next part -------------- An HTML attachment was scrubbed... URL: From nefedova at mcs.anl.gov Fri May 11 08:58:12 2007 From: nefedova at mcs.anl.gov (Veronika Nefedova) Date: Fri, 11 May 2007 08:58:12 -0500 Subject: [Swift-devel] MolDyn at Purdue In-Reply-To: <4644701F.4040007@mcs.anl.gov> References: <1E5A7292-AC04-49BE-B017-6A4D8C28F32F@mcs.anl.gov> <46422CEC.5050807@cs.uchicago.edu> <464358D0.1060402@cs.uchicago.edu> <464365AA.50407@mcs.anl.gov> <464366FD.8060008@cs.uchicago.edu> <4644701F.4040007@mcs.anl.gov> Message-ID: I think we had a problem submitting a big reservation to NCSA - even a smaller ones were in the queue for more then a week at that time. When we did a time estimate on a queue time it said something like 'unable to predict' or 'unable to accept'... Ioan - do you remember what was the exact problem? Nika On May 11, 2007, at 8:31 AM, Ian Foster wrote: > I note that we have stopped running at NCSA and switched to trying > to run at Purdue. A good thing to try, certainly. > > However, could we not have had a big job in the queue at NCSA all > this time, also, using Falkon, which would have run by now? > > Ian. > > Ioan Raicu wrote: >> Great, than we are set, the project is configurable at the Falkon >> startup! >> Ioan -------------- next part -------------- An HTML attachment was scrubbed... URL: From itf at mcs.anl.gov Fri May 11 09:01:27 2007 From: itf at mcs.anl.gov (=?UTF-8?B?SWFuIEZvc3Rlcg==?=) Date: Fri, 11 May 2007 14:01:27 +0000 Subject: [Swift-devel] MolDyn at Purdue In-Reply-To: References: <1E5A7292-AC04-49BE-B017-6A4D8C28F32F@mcs.anl.gov> <46422CEC.5050807@cs.uchicago.edu> <464358D0.1060402@cs.uchicago.edu> <464365AA.50407@mcs.anl.gov> <464366FD.8060008@cs.uchicago.edu> <4644701F.4040007@mcs.anl.gov> Message-ID: <1819896301-1178892121-cardhu_blackberry.rim.net-1642341055-@bwe017-cell00.bisx.prod.on.blackberry> It seems unlikely to me that you can't even submit it? Sent via BlackBerry from T-Mobile -----Original Message----- From: Veronika Nefedova Date: Fri, 11 May 2007 08:58:12 To:Ian Foster Cc:iraicu at cs.uchicago.edu, swift-devel at ci.uchicago.edu Subject: Re: [Swift-devel] MolDyn at Purdue I think we had a problem submitting a big reservation to NCSA - even a smaller ones were in the queue for more then a week at that time. When we did a time estimate on a queue time it said something like 'unable to predict' or 'unable to accept'...? Ioan - do you remember what was the exact problem? Nika On May 11, 2007, at 8:31 AM, Ian Foster wrote: I note that we have stopped running at NCSA and switched to trying to run at Purdue. A good thing to try, certainly. However, could we not have had a big job in the queue at NCSA all this time, also, using Falkon, which would have run by now? Ian. Ioan Raicu wrote:Great, than we are set, the project is configurable at the Falkon startup! Ioan From nefedova at mcs.anl.gov Fri May 11 09:18:35 2007 From: nefedova at mcs.anl.gov (Veronika Nefedova) Date: Fri, 11 May 2007 09:18:35 -0500 Subject: [Swift-devel] MolDyn at Purdue In-Reply-To: <1819896301-1178892121-cardhu_blackberry.rim.net-1642341055-@bwe017-cell00.bisx.prod.on.blackberry> References: <1E5A7292-AC04-49BE-B017-6A4D8C28F32F@mcs.anl.gov> <46422CEC.5050807@cs.uchicago.edu> <464358D0.1060402@cs.uchicago.edu> <464365AA.50407@mcs.anl.gov> <464366FD.8060008@cs.uchicago.edu> <4644701F.4040007@mcs.anl.gov> <1819896301-1178892121-cardhu_blackberry.rim.net-1642341055-@bwe017-cell00.bisx.prod.on.blackberry> Message-ID: Nope, its quite possible. Last week I couldn't submit a single job for almost a day -- their queue was completely full! The message was something like 'not accepting new jobs in a queue' - or something like that. The cluster is ridiculously busy. I could try to submit today a reservation for , say, 20 molecules... Nika On May 11, 2007, at 9:01 AM, Ian Foster wrote: > It seems unlikely to me that you can't even submit it? > > Sent via BlackBerry from T-Mobile > > -----Original Message----- > From: Veronika Nefedova > Date: Fri, 11 May 2007 08:58:12 > To:Ian Foster > Cc:iraicu at cs.uchicago.edu, swift-devel at ci.uchicago.edu > Subject: Re: [Swift-devel] MolDyn at Purdue > > I think we had a problem submitting a big reservation to NCSA - > even a smaller ones were in the queue for more then a week at that > time. When we did a time estimate on a queue time it said something > like 'unable to predict' or 'unable to accept'... > Ioan - do you remember what was the exact problem? > > > Nika > > > > > On May 11, 2007, at 8:31 AM, Ian Foster wrote: > I note that we have stopped running at NCSA and switched to trying > to run at Purdue. A good thing to try, certainly. > > However, could we not have had a big job in the queue at NCSA all > this time, also, using Falkon, which would have run by now? > > Ian. > > Ioan Raicu wrote:Great, than we are set, the project is > configurable at the Falkon startup! > Ioan > From iraicu at cs.uchicago.edu Fri May 11 10:18:20 2007 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Fri, 11 May 2007 10:18:20 -0500 Subject: [Swift-devel] MolDyn at Purdue In-Reply-To: <1819896301-1178892121-cardhu_blackberry.rim.net-1642341055-@bwe017-cell00.bisx.prod.on.blackberry> References: <1E5A7292-AC04-49BE-B017-6A4D8C28F32F@mcs.anl.gov> <46422CEC.5050807@cs.uchicago.edu> <464358D0.1060402@cs.uchicago.edu> <464365AA.50407@mcs.anl.gov> <464366FD.8060008@cs.uchicago.edu> <4644701F.4040007@mcs.anl.gov> <1819896301-1178892121-cardhu_blackberry.rim.net-1642341055-@bwe017-cell00.bisx.prod.on.blackberry> Message-ID: <4644893C.6050300@cs.uchicago.edu> I remember using he batch queue prediction system to try to estimate how long the queues would be, and we were getting relatively long queues (on the order of days) for just a few dozen processors for a a 24 hour period, and if we asked for anything significant (100+ processors), the prediction system was saying that it cannot give us a prediction... my guess is that the queue wait would have been longer than the maximum the prediction models were designed for. The site was really busy, and there were hundreds of large jobs involving 100~1000 processors each run for days at a time. We were essentially discouraged by all this, and decided that its not worth trying to do any large runs at NCSA (at that time), and that Nika would try to install the application at Purdue, and do try some larger scale runs there, as the Purdue site seemed to be relatively idle. So, we never tried to submit a large allocation at NCSA... but maybewe should have, maybe we would have gotten it by now. Ioan Ian Foster wrote: > It seems unlikely to me that you can't even submit it? > > Sent via BlackBerry from T-Mobile > > -----Original Message----- > From: Veronika Nefedova > Date: Fri, 11 May 2007 08:58:12 > To:Ian Foster > Cc:iraicu at cs.uchicago.edu, swift-devel at ci.uchicago.edu > Subject: Re: [Swift-devel] MolDyn at Purdue > > I think we had a problem submitting a big reservation to NCSA - even a smaller ones were in the queue for more then a week at that time. When we did a time estimate on a queue time it said something like 'unable to predict' or 'unable to accept'... > Ioan - do you remember what was the exact problem? > > > Nika > > > > > On May 11, 2007, at 8:31 AM, Ian Foster wrote: > I note that we have stopped running at NCSA and switched to trying to run at Purdue. A good thing to try, certainly. > > However, could we not have had a big job in the queue at NCSA all this time, also, using Falkon, which would have run by now? > > Ian. > > Ioan Raicu wrote:Great, than we are set, the project is configurable at the Falkon startup! > Ioan > > -- ============================================ Ioan Raicu Ph.D. Student ============================================ Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 ============================================ Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dsl.cs.uchicago.edu/ ============================================ ============================================ From iraicu at cs.uchicago.edu Fri May 11 10:24:16 2007 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Fri, 11 May 2007 10:24:16 -0500 Subject: [Swift-devel] MolDyn at Purdue In-Reply-To: References: <1E5A7292-AC04-49BE-B017-6A4D8C28F32F@mcs.anl.gov> <46422CEC.5050807@cs.uchicago.edu> <464358D0.1060402@cs.uchicago.edu> <464365AA.50407@mcs.anl.gov> <464366FD.8060008@cs.uchicago.edu> <4644701F.4040007@mcs.anl.gov> <1819896301-1178892121-cardhu_blackberry.rim.net-1642341055-@bwe017-cell00.bisx.prod.on.blackberry> Message-ID: <46448AA0.5030705@cs.uchicago.edu> Right, so if we want to get roughly the same execution time of 77 minutes, we would need 34*20 = 680 machines for 2 hours, right? If we halve the machine numbers, we can double the time reservation, right? Let me know if you need help with the Falkon settings! Ioan Veronika Nefedova wrote: > Nope, its quite possible. Last week I couldn't submit a single job for > almost a day -- their queue was completely full! The message was > something like 'not accepting new jobs in a queue' - or something like > that. The cluster is ridiculously busy. I could try to submit today a > reservation for , say, 20 molecules... > > Nika > > On May 11, 2007, at 9:01 AM, Ian Foster wrote: > >> It seems unlikely to me that you can't even submit it? >> >> Sent via BlackBerry from T-Mobile >> >> -----Original Message----- >> From: Veronika Nefedova >> Date: Fri, 11 May 2007 08:58:12 >> To:Ian Foster >> Cc:iraicu at cs.uchicago.edu, swift-devel at ci.uchicago.edu >> Subject: Re: [Swift-devel] MolDyn at Purdue >> >> I think we had a problem submitting a big reservation to NCSA - even >> a smaller ones were in the queue for more then a week at that time. >> When we did a time estimate on a queue time it said something like >> 'unable to predict' or 'unable to accept'... >> Ioan - do you remember what was the exact problem? >> >> >> Nika >> >> >> >> >> On May 11, 2007, at 8:31 AM, Ian Foster wrote: >> I note that we have stopped running at NCSA and switched to trying to >> run at Purdue. A good thing to try, certainly. >> >> However, could we not have had a big job in the queue at NCSA all >> this time, also, using Falkon, which would have run by now? >> >> Ian. >> >> Ioan Raicu wrote:Great, than we are set, the project is configurable >> at the Falkon startup! >> Ioan >> > > -- ============================================ Ioan Raicu Ph.D. Student ============================================ Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 ============================================ Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dsl.cs.uchicago.edu/ ============================================ ============================================ From nefedova at mcs.anl.gov Fri May 11 10:46:56 2007 From: nefedova at mcs.anl.gov (Veronika Nefedova) Date: Fri, 11 May 2007 10:46:56 -0500 Subject: [Swift-devel] MolDyn at Purdue In-Reply-To: <46448AA0.5030705@cs.uchicago.edu> References: <1E5A7292-AC04-49BE-B017-6A4D8C28F32F@mcs.anl.gov> <46422CEC.5050807@cs.uchicago.edu> <464358D0.1060402@cs.uchicago.edu> <464365AA.50407@mcs.anl.gov> <464366FD.8060008@cs.uchicago.edu> <4644701F.4040007@mcs.anl.gov> <1819896301-1178892121-cardhu_blackberry.rim.net-1642341055-@bwe017-cell00.bisx.prod.on.blackberry> <46448AA0.5030705@cs.uchicago.edu> Message-ID: Interesting... Apparently, I did submit the reservation for a big run back on Monday (I thought it didn't go through at that time). And it is still in the queue.. tg-login1 nefedova/Falkon_v0.8> showq | grep nefedova 995068 nefedova Idle 286 2:00:00 Mon May 7 10:02:28 1000628 nefedova Idle 340 4:00:00 Fri May 11 10:41:14 tg-login1 nefedova/Falkon_v0.8> Nika On May 11, 2007, at 10:24 AM, Ioan Raicu wrote: > Right, so if we want to get roughly the same execution time of 77 > minutes, we would need 34*20 = 680 machines for 2 hours, right? If > we halve the machine numbers, we can double the time reservation, > right? > > Let me know if you need help with the Falkon settings! > > Ioan > > > Veronika Nefedova wrote: >> Nope, its quite possible. Last week I couldn't submit a single job >> for almost a day -- their queue was completely full! The message >> was something like 'not accepting new jobs in a queue' - or >> something like that. The cluster is ridiculously busy. I could try >> to submit today a reservation for , say, 20 molecules... >> >> Nika >> >> On May 11, 2007, at 9:01 AM, Ian Foster wrote: >> >>> It seems unlikely to me that you can't even submit it? >>> >>> Sent via BlackBerry from T-Mobile >>> >>> -----Original Message----- >>> From: Veronika Nefedova >>> Date: Fri, 11 May 2007 08:58:12 >>> To:Ian Foster >>> Cc:iraicu at cs.uchicago.edu, swift-devel at ci.uchicago.edu >>> Subject: Re: [Swift-devel] MolDyn at Purdue >>> >>> I think we had a problem submitting a big reservation to NCSA - >>> even a smaller ones were in the queue for more then a week at >>> that time. When we did a time estimate on a queue time it said >>> something like 'unable to predict' or 'unable to accept'... >>> Ioan - do you remember what was the exact problem? >>> >>> >>> Nika >>> >>> >>> >>> >>> On May 11, 2007, at 8:31 AM, Ian Foster wrote: >>> I note that we have stopped running at NCSA and switched to >>> trying to run at Purdue. A good thing to try, certainly. >>> >>> However, could we not have had a big job in the queue at NCSA all >>> this time, also, using Falkon, which would have run by now? >>> >>> Ian. >>> >>> Ioan Raicu wrote:Great, than we are set, the project is >>> configurable at the Falkon startup! >>> Ioan >>> >> >> > > -- > ============================================ > Ioan Raicu > Ph.D. Student > ============================================ > Distributed Systems Laboratory > Computer Science Department > University of Chicago > 1100 E. 58th Street, Ryerson Hall > Chicago, IL 60637 > ============================================ > Email: iraicu at cs.uchicago.edu > Web: http://www.cs.uchicago.edu/~iraicu > http://dsl.cs.uchicago.edu/ > ============================================ > ============================================ > From foster at mcs.anl.gov Fri May 11 14:04:52 2007 From: foster at mcs.anl.gov (Ian Foster) Date: Fri, 11 May 2007 14:04:52 -0500 Subject: [Swift-devel] MolDyn at Purdue In-Reply-To: References: <1E5A7292-AC04-49BE-B017-6A4D8C28F32F@mcs.anl.gov> <46422CEC.5050807@cs.uchicago.edu> <464358D0.1060402@cs.uchicago.edu> <464365AA.50407@mcs.anl.gov> <464366FD.8060008@cs.uchicago.edu> <4644701F.4040007@mcs.anl.gov> <1819896301-1178892121-cardhu_blackberry.rim.net-1642341055-@bwe017-cell00.bisx.prod.on.blackberry> <46448AA0.5030705@cs.uchicago.edu> Message-ID: <4644BE54.4040202@mcs.anl.gov> that's scary! Just out of interest, how big was it? (cpus, time?) Veronika Nefedova wrote: > Interesting... > Apparently, I did submit the reservation for a big run back on Monday > (I thought it didn't go through at that time). And it is still in the > queue.. > > tg-login1 nefedova/Falkon_v0.8> showq | grep nefedova > 995068 nefedova Idle 286 2:00:00 Mon May 7 > 10:02:28 > 1000628 nefedova Idle 340 4:00:00 Fri May 11 > 10:41:14 > tg-login1 nefedova/Falkon_v0.8> > > > Nika > > On May 11, 2007, at 10:24 AM, Ioan Raicu wrote: > >> Right, so if we want to get roughly the same execution time of 77 >> minutes, we would need 34*20 = 680 machines for 2 hours, right? If >> we halve the machine numbers, we can double the time reservation, right? >> >> Let me know if you need help with the Falkon settings! >> >> Ioan >> >> >> Veronika Nefedova wrote: >>> Nope, its quite possible. Last week I couldn't submit a single job >>> for almost a day -- their queue was completely full! The message was >>> something like 'not accepting new jobs in a queue' - or something >>> like that. The cluster is ridiculously busy. I could try to submit >>> today a reservation for , say, 20 molecules... >>> >>> Nika >>> >>> On May 11, 2007, at 9:01 AM, Ian Foster wrote: >>> >>>> It seems unlikely to me that you can't even submit it? >>>> >>>> Sent via BlackBerry from T-Mobile >>>> >>>> -----Original Message----- >>>> From: Veronika Nefedova >>>> Date: Fri, 11 May 2007 08:58:12 >>>> To:Ian Foster >>>> Cc:iraicu at cs.uchicago.edu, swift-devel at ci.uchicago.edu >>>> Subject: Re: [Swift-devel] MolDyn at Purdue >>>> >>>> I think we had a problem submitting a big reservation to NCSA - >>>> even a smaller ones were in the queue for more then a week at that >>>> time. When we did a time estimate on a queue time it said something >>>> like 'unable to predict' or 'unable to accept'... >>>> Ioan - do you remember what was the exact problem? >>>> >>>> >>>> Nika >>>> >>>> >>>> >>>> >>>> On May 11, 2007, at 8:31 AM, Ian Foster wrote: >>>> I note that we have stopped running at NCSA and switched to trying >>>> to run at Purdue. A good thing to try, certainly. >>>> >>>> However, could we not have had a big job in the queue at NCSA all >>>> this time, also, using Falkon, which would have run by now? >>>> >>>> Ian. >>>> >>>> Ioan Raicu wrote:Great, than we are set, the project is >>>> configurable at the Falkon startup! >>>> Ioan >>>> >>> >>> >> >> -- >> ============================================ >> Ioan Raicu >> Ph.D. Student >> ============================================ >> Distributed Systems Laboratory >> Computer Science Department >> University of Chicago >> 1100 E. 58th Street, Ryerson Hall >> Chicago, IL 60637 >> ============================================ >> Email: iraicu at cs.uchicago.edu >> Web: http://www.cs.uchicago.edu/~iraicu >> http://dsl.cs.uchicago.edu/ >> ============================================ >> ============================================ >> > -- Ian Foster, Director, Computation Institute Argonne National Laboratory & University of Chicago Argonne: MCS/221, 9700 S. Cass Ave, Argonne, IL 60439 Chicago: Rm 405, 5640 S. Ellis Ave, Chicago, IL 60637 Tel: +1 630 252 4619. Web: www.ci.uchicago.edu. Globus Alliance: www.globus.org. From nefedova at mcs.anl.gov Fri May 11 14:11:41 2007 From: nefedova at mcs.anl.gov (Veronika Nefedova) Date: Fri, 11 May 2007 14:11:41 -0500 Subject: [Swift-devel] MolDyn at Purdue In-Reply-To: <4644BE54.4040202@mcs.anl.gov> References: <1E5A7292-AC04-49BE-B017-6A4D8C28F32F@mcs.anl.gov> <46422CEC.5050807@cs.uchicago.edu> <464358D0.1060402@cs.uchicago.edu> <464365AA.50407@mcs.anl.gov> <464366FD.8060008@cs.uchicago.edu> <4644701F.4040007@mcs.anl.gov> <1819896301-1178892121-cardhu_blackberry.rim.net-1642341055-@bwe017-cell00.bisx.prod.on.blackberry> <46448AA0.5030705@cs.uchicago.edu> <4644BE54.4040202@mcs.anl.gov> Message-ID: <8C852C48-7251-4E13-A9A0-E2AFF4B4F8F0@mcs.anl.gov> The requested CPU was 286 and time 2 hours. Still in the queue! On May 11, 2007, at 2:04 PM, Ian Foster wrote: > that's scary! > > Just out of interest, how big was it? (cpus, time?) > > Veronika Nefedova wrote: >> Interesting... >> Apparently, I did submit the reservation for a big run back on >> Monday (I thought it didn't go through at that time). And it is >> still in the queue.. >> >> tg-login1 nefedova/Falkon_v0.8> showq | grep nefedova >> 995068 nefedova Idle 286 2:00:00 Mon May >> 7 10:02:28 >> 1000628 nefedova Idle 340 4:00:00 Fri May >> 11 10:41:14 >> tg-login1 nefedova/Falkon_v0.8> >> >> >> Nika >> >> On May 11, 2007, at 10:24 AM, Ioan Raicu wrote: >> >>> Right, so if we want to get roughly the same execution time of 77 >>> minutes, we would need 34*20 = 680 machines for 2 hours, right? >>> If we halve the machine numbers, we can double the time >>> reservation, right? >>> >>> Let me know if you need help with the Falkon settings! >>> >>> Ioan >>> >>> >>> Veronika Nefedova wrote: >>>> Nope, its quite possible. Last week I couldn't submit a single >>>> job for almost a day -- their queue was completely full! The >>>> message was something like 'not accepting new jobs in a queue' - >>>> or something like that. The cluster is ridiculously busy. I >>>> could try to submit today a reservation for , say, 20 molecules... >>>> >>>> Nika >>>> >>>> On May 11, 2007, at 9:01 AM, Ian Foster wrote: >>>> >>>>> It seems unlikely to me that you can't even submit it? >>>>> >>>>> Sent via BlackBerry from T-Mobile >>>>> >>>>> -----Original Message----- >>>>> From: Veronika Nefedova >>>>> Date: Fri, 11 May 2007 08:58:12 >>>>> To:Ian Foster >>>>> Cc:iraicu at cs.uchicago.edu, swift-devel at ci.uchicago.edu >>>>> Subject: Re: [Swift-devel] MolDyn at Purdue >>>>> >>>>> I think we had a problem submitting a big reservation to NCSA - >>>>> even a smaller ones were in the queue for more then a week at >>>>> that time. When we did a time estimate on a queue time it said >>>>> something like 'unable to predict' or 'unable to accept'... >>>>> Ioan - do you remember what was the exact problem? >>>>> >>>>> >>>>> Nika >>>>> >>>>> >>>>> >>>>> >>>>> On May 11, 2007, at 8:31 AM, Ian Foster wrote: >>>>> I note that we have stopped running at NCSA and switched to >>>>> trying to run at Purdue. A good thing to try, certainly. >>>>> >>>>> However, could we not have had a big job in the queue at NCSA >>>>> all this time, also, using Falkon, which would have run by now? >>>>> >>>>> Ian. >>>>> >>>>> Ioan Raicu wrote:Great, than we are set, the project is >>>>> configurable at the Falkon startup! >>>>> Ioan >>>>> >>>> >>>> >>> >>> -- >>> ============================================ >>> Ioan Raicu >>> Ph.D. Student >>> ============================================ >>> Distributed Systems Laboratory >>> Computer Science Department >>> University of Chicago >>> 1100 E. 58th Street, Ryerson Hall >>> Chicago, IL 60637 >>> ============================================ >>> Email: iraicu at cs.uchicago.edu >>> Web: http://www.cs.uchicago.edu/~iraicu >>> http://dsl.cs.uchicago.edu/ >>> ============================================ >>> ============================================ >>> >> > > -- > > Ian Foster, Director, Computation Institute > Argonne National Laboratory & University of Chicago > Argonne: MCS/221, 9700 S. Cass Ave, Argonne, IL 60439 > Chicago: Rm 405, 5640 S. Ellis Ave, Chicago, IL 60637 > Tel: +1 630 252 4619. Web: www.ci.uchicago.edu. > Globus Alliance: www.globus.org. > From foster at mcs.anl.gov Fri May 11 14:11:51 2007 From: foster at mcs.anl.gov (Ian Foster) Date: Fri, 11 May 2007 14:11:51 -0500 Subject: [Swift-devel] MolDyn at Purdue In-Reply-To: References: <1E5A7292-AC04-49BE-B017-6A4D8C28F32F@mcs.anl.gov> <46422CEC.5050807@cs.uchicago.edu> <464358D0.1060402@cs.uchicago.edu> <464365AA.50407@mcs.anl.gov> <464366FD.8060008@cs.uchicago.edu> <4644701F.4040007@mcs.anl.gov> <1819896301-1178892121-cardhu_blackberry.rim.net-1642341055-@bwe017-cell00.bisx.prod.on.blackberry> <46448AA0.5030705@cs.uchicago.edu> Message-ID: <4644BFF7.5020106@mcs.anl.gov> One more question: should we be trying the TG-Argonne cluster? Apparently it is fairly idle? Veronika Nefedova wrote: > Interesting... > Apparently, I did submit the reservation for a big run back on Monday > (I thought it didn't go through at that time). And it is still in the > queue.. > > tg-login1 nefedova/Falkon_v0.8> showq | grep nefedova > 995068 nefedova Idle 286 2:00:00 Mon May 7 > 10:02:28 > 1000628 nefedova Idle 340 4:00:00 Fri May 11 > 10:41:14 > tg-login1 nefedova/Falkon_v0.8> > > > Nika > > On May 11, 2007, at 10:24 AM, Ioan Raicu wrote: > >> Right, so if we want to get roughly the same execution time of 77 >> minutes, we would need 34*20 = 680 machines for 2 hours, right? If >> we halve the machine numbers, we can double the time reservation, right? >> >> Let me know if you need help with the Falkon settings! >> >> Ioan >> >> >> Veronika Nefedova wrote: >>> Nope, its quite possible. Last week I couldn't submit a single job >>> for almost a day -- their queue was completely full! The message was >>> something like 'not accepting new jobs in a queue' - or something >>> like that. The cluster is ridiculously busy. I could try to submit >>> today a reservation for , say, 20 molecules... >>> >>> Nika >>> >>> On May 11, 2007, at 9:01 AM, Ian Foster wrote: >>> >>>> It seems unlikely to me that you can't even submit it? >>>> >>>> Sent via BlackBerry from T-Mobile >>>> >>>> -----Original Message----- >>>> From: Veronika Nefedova >>>> Date: Fri, 11 May 2007 08:58:12 >>>> To:Ian Foster >>>> Cc:iraicu at cs.uchicago.edu, swift-devel at ci.uchicago.edu >>>> Subject: Re: [Swift-devel] MolDyn at Purdue >>>> >>>> I think we had a problem submitting a big reservation to NCSA - >>>> even a smaller ones were in the queue for more then a week at that >>>> time. When we did a time estimate on a queue time it said something >>>> like 'unable to predict' or 'unable to accept'... >>>> Ioan - do you remember what was the exact problem? >>>> >>>> >>>> Nika >>>> >>>> >>>> >>>> >>>> On May 11, 2007, at 8:31 AM, Ian Foster wrote: >>>> I note that we have stopped running at NCSA and switched to trying >>>> to run at Purdue. A good thing to try, certainly. >>>> >>>> However, could we not have had a big job in the queue at NCSA all >>>> this time, also, using Falkon, which would have run by now? >>>> >>>> Ian. >>>> >>>> Ioan Raicu wrote:Great, than we are set, the project is >>>> configurable at the Falkon startup! >>>> Ioan >>>> >>> >>> >> >> -- >> ============================================ >> Ioan Raicu >> Ph.D. Student >> ============================================ >> Distributed Systems Laboratory >> Computer Science Department >> University of Chicago >> 1100 E. 58th Street, Ryerson Hall >> Chicago, IL 60637 >> ============================================ >> Email: iraicu at cs.uchicago.edu >> Web: http://www.cs.uchicago.edu/~iraicu >> http://dsl.cs.uchicago.edu/ >> ============================================ >> ============================================ >> > -- Ian Foster, Director, Computation Institute Argonne National Laboratory & University of Chicago Argonne: MCS/221, 9700 S. Cass Ave, Argonne, IL 60439 Chicago: Rm 405, 5640 S. Ellis Ave, Chicago, IL 60637 Tel: +1 630 252 4619. Web: www.ci.uchicago.edu. Globus Alliance: www.globus.org. From nefedova at mcs.anl.gov Fri May 11 14:18:10 2007 From: nefedova at mcs.anl.gov (Veronika Nefedova) Date: Fri, 11 May 2007 14:18:10 -0500 Subject: [Swift-devel] MolDyn at Purdue In-Reply-To: <4644BFF7.5020106@mcs.anl.gov> References: <1E5A7292-AC04-49BE-B017-6A4D8C28F32F@mcs.anl.gov> <46422CEC.5050807@cs.uchicago.edu> <464358D0.1060402@cs.uchicago.edu> <464365AA.50407@mcs.anl.gov> <464366FD.8060008@cs.uchicago.edu> <4644701F.4040007@mcs.anl.gov> <1819896301-1178892121-cardhu_blackberry.rim.net-1642341055-@bwe017-cell00.bisx.prod.on.blackberry> <46448AA0.5030705@cs.uchicago.edu> <4644BFF7.5020106@mcs.anl.gov> Message-ID: <2E404D2F-960D-45AC-BED0-31BBA4657C8C@mcs.anl.gov> I think Benoit's group doesn't have any allocation at TG-ANL (they have a good allocation at Purdue). It takes quite an effort to compile their tools, so I am not sure if Yuqing will be interested in trying TG-ANL... I could try to move apps to TG/ANL on Monday and it see if it runs there. Hopefully the Purdue guys will be able to resolve their GT4 GRAM issues by then... Nika On May 11, 2007, at 2:11 PM, Ian Foster wrote: > One more question: should we be trying the TG-Argonne cluster? > Apparently it is fairly idle? > > Veronika Nefedova wrote: >> Interesting... >> Apparently, I did submit the reservation for a big run back on >> Monday (I thought it didn't go through at that time). And it is >> still in the queue.. >> >> tg-login1 nefedova/Falkon_v0.8> showq | grep nefedova >> 995068 nefedova Idle 286 2:00:00 Mon May >> 7 10:02:28 >> 1000628 nefedova Idle 340 4:00:00 Fri May >> 11 10:41:14 >> tg-login1 nefedova/Falkon_v0.8> >> >> >> Nika >> >> On May 11, 2007, at 10:24 AM, Ioan Raicu wrote: >> >>> Right, so if we want to get roughly the same execution time of 77 >>> minutes, we would need 34*20 = 680 machines for 2 hours, right? >>> If we halve the machine numbers, we can double the time >>> reservation, right? >>> >>> Let me know if you need help with the Falkon settings! >>> >>> Ioan >>> >>> >>> Veronika Nefedova wrote: >>>> Nope, its quite possible. Last week I couldn't submit a single >>>> job for almost a day -- their queue was completely full! The >>>> message was something like 'not accepting new jobs in a queue' - >>>> or something like that. The cluster is ridiculously busy. I >>>> could try to submit today a reservation for , say, 20 molecules... >>>> >>>> Nika >>>> >>>> On May 11, 2007, at 9:01 AM, Ian Foster wrote: >>>> >>>>> It seems unlikely to me that you can't even submit it? >>>>> >>>>> Sent via BlackBerry from T-Mobile >>>>> >>>>> -----Original Message----- >>>>> From: Veronika Nefedova >>>>> Date: Fri, 11 May 2007 08:58:12 >>>>> To:Ian Foster >>>>> Cc:iraicu at cs.uchicago.edu, swift-devel at ci.uchicago.edu >>>>> Subject: Re: [Swift-devel] MolDyn at Purdue >>>>> >>>>> I think we had a problem submitting a big reservation to NCSA - >>>>> even a smaller ones were in the queue for more then a week at >>>>> that time. When we did a time estimate on a queue time it said >>>>> something like 'unable to predict' or 'unable to accept'... >>>>> Ioan - do you remember what was the exact problem? >>>>> >>>>> >>>>> Nika >>>>> >>>>> >>>>> >>>>> >>>>> On May 11, 2007, at 8:31 AM, Ian Foster wrote: >>>>> I note that we have stopped running at NCSA and switched to >>>>> trying to run at Purdue. A good thing to try, certainly. >>>>> >>>>> However, could we not have had a big job in the queue at NCSA >>>>> all this time, also, using Falkon, which would have run by now? >>>>> >>>>> Ian. >>>>> >>>>> Ioan Raicu wrote:Great, than we are set, the project is >>>>> configurable at the Falkon startup! >>>>> Ioan >>>>> >>>> >>>> >>> >>> -- >>> ============================================ >>> Ioan Raicu >>> Ph.D. Student >>> ============================================ >>> Distributed Systems Laboratory >>> Computer Science Department >>> University of Chicago >>> 1100 E. 58th Street, Ryerson Hall >>> Chicago, IL 60637 >>> ============================================ >>> Email: iraicu at cs.uchicago.edu >>> Web: http://www.cs.uchicago.edu/~iraicu >>> http://dsl.cs.uchicago.edu/ >>> ============================================ >>> ============================================ >>> >> > > -- > > Ian Foster, Director, Computation Institute > Argonne National Laboratory & University of Chicago > Argonne: MCS/221, 9700 S. Cass Ave, Argonne, IL 60439 > Chicago: Rm 405, 5640 S. Ellis Ave, Chicago, IL 60637 > Tel: +1 630 252 4619. Web: www.ci.uchicago.edu. > Globus Alliance: www.globus.org. > From itf at mcs.anl.gov Fri May 11 14:55:40 2007 From: itf at mcs.anl.gov (=?UTF-8?B?SWFuIEZvc3Rlcg==?=) Date: Fri, 11 May 2007 19:55:40 +0000 Subject: [Swift-devel] MolDyn at Purdue In-Reply-To: <2E404D2F-960D-45AC-BED0-31BBA4657C8C@mcs.anl.gov> References: <1E5A7292-AC04-49BE-B017-6A4D8C28F32F@mcs.anl.gov> <46422CEC.5050807@cs.uchicago.edu> <464358D0.1060402@cs.uchicago.edu> <464365AA.50407@mcs.anl.gov> <464366FD.8060008@cs.uchicago.edu> <4644701F.4040007@mcs.anl.gov> <1819896301-1178892121-cardhu_blackberry.rim.net-1642341055-@bwe017-cell00.bisx.prod.on.blackberry> <46448AA0.5030705@cs.uchicago.edu> <4644BFF7.5020106@mcs.anl.gov> <2E404D2F-960D-45AC-BED0-31BBA4657C8C@mcs.anl.gov> Message-ID: <1987384684-1178913374-cardhu_blackberry.rim.net-877925243-@bwe005-cell00.bisx.prod.on.blackberry> Ok. I think I should stop asking questions (-: Sent via BlackBerry from T-Mobile -----Original Message----- From: Veronika Nefedova Date: Fri, 11 May 2007 14:18:10 To:Ian Foster Cc:iraicu at cs.uchicago.edu, itf at mcs.anl.gov, swift-devel at ci.uchicago.edu Subject: Re: [Swift-devel] MolDyn at Purdue I think Benoit's group doesn't have any allocation at TG-ANL (they have a good allocation at Purdue). It takes quite an effort to compile their tools, so I am not sure if Yuqing will be interested in trying TG-ANL... I could try to move apps to TG/ANL on Monday and it see if it runs there. Hopefully the Purdue guys will be able to resolve their GT4 GRAM issues by then... Nika On May 11, 2007, at 2:11 PM, Ian Foster wrote: > One more question: should we be trying the TG-Argonne cluster? > Apparently it is fairly idle? > > Veronika Nefedova wrote: >> Interesting... >> Apparently, I did submit the reservation for a big run back on >> Monday (I thought it didn't go through at that time). And it is >> still in the queue.. >> >> tg-login1 nefedova/Falkon_v0.8> showq | grep nefedova >> 995068 nefedova Idle 286 2:00:00 Mon May >> 7 10:02:28 >> 1000628 nefedova Idle 340 4:00:00 Fri May >> 11 10:41:14 >> tg-login1 nefedova/Falkon_v0.8> >> >> >> Nika >> >> On May 11, 2007, at 10:24 AM, Ioan Raicu wrote: >> >>> Right, so if we want to get roughly the same execution time of 77 >>> minutes, we would need 34*20 = 680 machines for 2 hours, right? >>> If we halve the machine numbers, we can double the time >>> reservation, right? >>> >>> Let me know if you need help with the Falkon settings! >>> >>> Ioan >>> >>> >>> Veronika Nefedova wrote: >>>> Nope, its quite possible. Last week I couldn't submit a single >>>> job for almost a day -- their queue was completely full! The >>>> message was something like 'not accepting new jobs in a queue' - >>>> or something like that. The cluster is ridiculously busy. I >>>> could try to submit today a reservation for , say, 20 molecules... >>>> >>>> Nika >>>> >>>> On May 11, 2007, at 9:01 AM, Ian Foster wrote: >>>> >>>>> It seems unlikely to me that you can't even submit it? >>>>> >>>>> Sent via BlackBerry from T-Mobile >>>>> >>>>> -----Original Message----- >>>>> From: Veronika Nefedova >>>>> Date: Fri, 11 May 2007 08:58:12 >>>>> To:Ian Foster >>>>> Cc:iraicu at cs.uchicago.edu, swift-devel at ci.uchicago.edu >>>>> Subject: Re: [Swift-devel] MolDyn at Purdue >>>>> >>>>> I think we had a problem submitting a big reservation to NCSA - >>>>> even a smaller ones were in the queue for more then a week at >>>>> that time. When we did a time estimate on a queue time it said >>>>> something like 'unable to predict' or 'unable to accept'... >>>>> Ioan - do you remember what was the exact problem? >>>>> >>>>> >>>>> Nika >>>>> >>>>> >>>>> >>>>> >>>>> On May 11, 2007, at 8:31 AM, Ian Foster wrote: >>>>> I note that we have stopped running at NCSA and switched to >>>>> trying to run at Purdue. A good thing to try, certainly. >>>>> >>>>> However, could we not have had a big job in the queue at NCSA >>>>> all this time, also, using Falkon, which would have run by now? >>>>> >>>>> Ian. >>>>> >>>>> Ioan Raicu wrote:Great, than we are set, the project is >>>>> configurable at the Falkon startup! >>>>> Ioan >>>>> >>>> >>>> >>> >>> -- >>> ============================================ >>> Ioan Raicu >>> Ph.D. Student >>> ============================================ >>> Distributed Systems Laboratory >>> Computer Science Department >>> University of Chicago >>> 1100 E. 58th Street, Ryerson Hall >>> Chicago, IL 60637 >>> ============================================ >>> Email: iraicu at cs.uchicago.edu >>> Web: http://www.cs.uchicago.edu/~iraicu >>> http://dsl.cs.uchicago.edu/ >>> ============================================ >>> ============================================ >>> >> > > -- > > Ian Foster, Director, Computation Institute > Argonne National Laboratory & University of Chicago > Argonne: MCS/221, 9700 S. Cass Ave, Argonne, IL 60439 > Chicago: Rm 405, 5640 S. Ellis Ave, Chicago, IL 60637 > Tel: +1 630 252 4619. Web: www.ci.uchicago.edu. > Globus Alliance: www.globus.org. > From benc at hawaga.org.uk Tue May 15 11:20:03 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 15 May 2007 16:20:03 +0000 (GMT) Subject: [Swift-devel] swift-on-ec2 In-Reply-To: <4649D280.5080906@mcs.anl.gov> References: <4649D280.5080906@mcs.anl.gov> Message-ID: Ian asked about this elsewhere, but its perhaps interesting for swift-devel people to look at the questions too. On Tue, 15 May 2007, Ian Foster wrote: > Dear All: > I asked Kate if she and Tim could look into creating VM images that > would allow us to run Swift applications on Amazon EC2. I think Kate is > meeting with Ioan about this on Thursday (?). > One issue that I thought would be good to discuss is what we'd want in > that VM image. Perhaps this is obvious to the rest of you, but it isn't > to me. A few thoughts: > * I'm assuming that we want to run "workers" on EC2 nodes, and have the "task > dispatch" logic run on some external frontend system outside EC2. > * I would think that we want to use Falkon to do the task dispatch. If so, we > need a Falkon executor on each VM, configured to check in with the Falkon > dispatcher. (Alternatively, we could use, say, SGE: in that case, we would > want an SGE agent.) > * We need a way of getting data to and from the worker nodes. Do we want to > run a file system across the EC2 nodes and the external frontend node? That > seems rather inefficient. Other options? > * Should we preload the application code on each EC2 node? Here's a couple of approaches: 1) swift regards all the EC2 nodes that we are paying for as a single site. Something like falkon handles all the task dispatch and worker node management. I don't know what that looks like at the moment in Falkon, but the interface for Swift to send jobs into Falkon sounds pretty straightforward and shouldn't need changing. All the nodes in a site are required by our site model to have a shared filesystem - we've talked about removing it, but I think that is still the case and if so, isn't going to change soon. timf probably knows more than the people on this list about making shared filesystems. In this case, falkon would be doing the site selection. 2) swift regards each EC2 node as a separate site. So Swift would be doing site selection between each site (i.e. between each EC2 node), and then submitting to that site. I don't know if the interface between Swift and eg. Falkon allows swift to tell Falkon which remote node to run on. However, Swift would then be able to use something like gridftp to stage to each EC2 node (assuming that EC2 nodes can act as ftp servers - I don't know what their network connectivity is like) - a shared filesystem between all nodes in a site is pretty simple when there is only a single node in the site. Amazon also has a storage cloud, alongside its compute cloud. I know very little about that and have never thought about how it would fit into the above (if at all). Maybe someone else knows more. -- From tfreeman at mcs.anl.gov Tue May 15 15:45:00 2007 From: tfreeman at mcs.anl.gov (Tim Freeman) Date: Tue, 15 May 2007 15:45:00 -0500 Subject: [Swift-devel] swift-on-ec2 In-Reply-To: References: <4649D280.5080906@mcs.anl.gov> Message-ID: <20070515154500.ad1600bf.tfreeman@mcs.anl.gov> On Tue, 15 May 2007 16:20:03 +0000 (GMT) Ben Clifford wrote: > > Ian asked about this elsewhere, but its perhaps interesting for > swift-devel people to look at the questions too. > > On Tue, 15 May 2007, Ian Foster wrote: > > > Dear All: > > > I asked Kate if she and Tim could look into creating VM images that > > would allow us to run Swift applications on Amazon EC2. I think Kate is > > meeting with Ioan about this on Thursday (?). > > > One issue that I thought would be good to discuss is what we'd want in > > that VM image. Perhaps this is obvious to the rest of you, but it isn't > > to me. A few thoughts: > > > * I'm assuming that we want to run "workers" on EC2 nodes, and have the > > "task dispatch" logic run on some external frontend system outside EC2. > > > * I would think that we want to use Falkon to do the task dispatch. If so, > > we need a Falkon executor on each VM, configured to check in with the Falkon > > dispatcher. (Alternatively, we could use, say, SGE: in that case, we would > > want an SGE agent.) > > > * We need a way of getting data to and from the worker nodes. Do we want to > > run a file system across the EC2 nodes and the external frontend node? That > > seems rather inefficient. Other options? > > > * Should we preload the application code on each EC2 node? > > Here's a couple of approaches: > > 1) swift regards all the EC2 nodes that we are paying for as a single > site. > > Something like falkon handles all the task dispatch and worker node > management. I don't know what that looks like at the moment in Falkon, but > the interface for Swift to send jobs into Falkon sounds pretty > straightforward and shouldn't need changing. So if I understand, here there would be no gateway+LRM but each EC2 node + Falkon would need a port open to receive tasks? Or does each node pull down instructions OK from behind a firewall? Is there a latency problem with running each node as an indepdent task receiver with the dispatcher off-site from EC2? I would think it would be better to put the queues to fill with tasks on EC2 so it can more quickly get the task going when a node is done with a previous task (I may be missing some nuances here with respect to Falkon, don't know much about this yet!). If a gateway node is desired, this option sounds a lot like the GRAM+LRM situation we use on VMs with the workspace service and will soon use on EC2 via the workspace EC2 gateway we're implementing. Start up one gateway node and then add compute nodes which dynamically join the pool, they are pointed to the GRAM node. > All the nodes in a site are required by our site model to have a shared > filesystem - we've talked about removing it, but I think that is still the > case and if so, isn't going to change soon. Setting up a shared filesystem in this environment is akin to setting up the compute nodes to join an LRM pool. The VMs can communicate over the private network at EC2, you can instruct EC2 to let all the nodes be open to each other (while simultaneously keeping a separate policy of blocking ports from being open from the internet and other people's EC2 nodes). The non-file-serving nodes would simply need to know the private address of the filesystem server (unless you are using a fancier network file system than NFS-style ones). For background: every VM on EC2 currently gets a public address -- NAT'd to a private address which is actually what the VM's one NIC is configured with. There is a facility to open/forward specific network ports on the public address to each VM either via a group policy or on a VM by VM basis. [...] > Amazon also has a storage cloud, alongside its compute cloud. I know very > little about that and have never thought about how it would fit into the > above (if at all). Maybe someone else knows more. A VM template on EC2 is called an AMI which stands for Amazon Machine Image. This is just a packaging thing but what it mostly means is that the VM is stored on S3 and also registered into the EC2 system. When starting an instance of an AMI, the file is copied from S3 to the hypervisor node (what we call propagation in the workspace service). After it is used, this file is deleted (an option in the workspace service but there is also an option to save it back with any changes). So the VMs are stored in S3 but anything that happens on them after being started is lost unless you manually do something about it. As for free scratch space, you get a good amount per node, 140G. But the node could go down at any moment just like a physical resource. To harness S3 for safely persisting any data (or if you need more space) you would need to actually run S3 clients on the VMs when they are run on EC2. You could alternatively mirror data between nodes assuming that all would not go down at once. The good thing is that you do not pay transfer costs between S3 and EC2 if you chose to use S3 for big storage, you would only pay the "housing fees" so to speak. Tim From iraicu at cs.uchicago.edu Tue May 15 16:16:14 2007 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Tue, 15 May 2007 16:16:14 -0500 Subject: [Swift-devel] swift-on-ec2 In-Reply-To: References: <4649D280.5080906@mcs.anl.gov> Message-ID: <464A231E.9040708@cs.uchicago.edu> Hi, See below: Ben Clifford wrote: > Ian asked about this elsewhere, but its perhaps interesting for > swift-devel people to look at the questions too. > > On Tue, 15 May 2007, Ian Foster wrote: > > >> Dear All: >> > > >> I asked Kate if she and Tim could look into creating VM images that >> would allow us to run Swift applications on Amazon EC2. I think Kate is >> meeting with Ioan about this on Thursday (?). >> > > >> One issue that I thought would be good to discuss is what we'd want in >> that VM image. Perhaps this is obvious to the rest of you, but it isn't >> to me. A few thoughts: >> > > >> * I'm assuming that we want to run "workers" on EC2 nodes, and have the "task >> dispatch" logic run on some external frontend system outside EC2. >> > > >> * I would think that we want to use Falkon to do the task dispatch. If so, we >> need a Falkon executor on each VM, configured to check in with the Falkon >> dispatcher. (Alternatively, we could use, say, SGE: in that case, we would >> want an SGE agent.) >> > > >> * We need a way of getting data to and from the worker nodes. Do we want to >> run a file system across the EC2 nodes and the external frontend node? That >> seems rather inefficient. Other options? >> > > >> * Should we preload the application code on each EC2 node? >> > > Here's a couple of approaches: > > 1) swift regards all the EC2 nodes that we are paying for as a single > site. > > Something like falkon handles all the task dispatch and worker node > management. I don't know what that looks like at the moment in Falkon, but > the interface for Swift to send jobs into Falkon sounds pretty > straightforward and shouldn't need changing. > > All the nodes in a site are required by our site model to have a shared > filesystem - we've talked about removing it, but I think that is still the > case and if so, isn't going to change soon. timf probably knows more than > the people on this list about making shared filesystems. > If we can get the data caching working in Falkon, we might be able to run Swift over Falkon without a shared file system. This is still work in progress, but we might be closer to achieving this that not. BTW, the data caching would mean that Swift does not stage in any data anymore, but wold essentially stand up a GridFTP server from where Falkon workers would get the needed data just when they need it. We are still ironing out all this stuff, but it could potentially do away with the shared file sytem assumption. > In this case, falkon would be doing the site selection. > > 2) swift regards each EC2 node as a separate site. > > So Swift would be doing site selection between each site (i.e. between > each EC2 node), and then submitting to that site. > > I don't know if the interface between Swift and eg. Falkon allows swift to > tell Falkon which remote node to run on. > No, it does not... but the data caching work has added a data-aware scheduler that allows jobs to be run on nodes that have the data, and if they don't have the data, allow the respective node to get the data. > However, Swift would then be able to use something like gridftp to stage > to each EC2 node (assuming that EC2 nodes can act as ftp servers - I don't > know what their network connectivity is like) - a shared filesystem > between all nodes in a site is pretty simple when there is only a single > node in the site. > > > Amazon also has a storage cloud, alongside its compute cloud. I know very > little about that and have never thought about how it would fit into the > above (if at all). Maybe someone else knows more. > I think the idea would be to use the Amazon 3S storage service as a common medium from where to get data and where to put it back. Ioan -- ============================================ Ioan Raicu Ph.D. Student ============================================ Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 ============================================ Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dsl.cs.uchicago.edu/ ============================================ ============================================ From iraicu at cs.uchicago.edu Tue May 15 16:22:55 2007 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Tue, 15 May 2007 16:22:55 -0500 Subject: [Swift-devel] swift-on-ec2 In-Reply-To: <20070515154500.ad1600bf.tfreeman@mcs.anl.gov> References: <4649D280.5080906@mcs.anl.gov> <20070515154500.ad1600bf.tfreeman@mcs.anl.gov> Message-ID: <464A24AF.7080801@cs.uchicago.edu> Hi, See below: Tim Freeman wrote: > On Tue, 15 May 2007 16:20:03 +0000 (GMT) > Ben Clifford wrote: > > >> Ian asked about this elsewhere, but its perhaps interesting for >> swift-devel people to look at the questions too. >> >> On Tue, 15 May 2007, Ian Foster wrote: >> >> >>> Dear All: >>> >> >> >>> I asked Kate if she and Tim could look into creating VM images that >>> would allow us to run Swift applications on Amazon EC2. I think Kate is >>> meeting with Ioan about this on Thursday (?). >>> >> >> >>> One issue that I thought would be good to discuss is what we'd want in >>> that VM image. Perhaps this is obvious to the rest of you, but it isn't >>> to me. A few thoughts: >>> >>> * I'm assuming that we want to run "workers" on EC2 nodes, and have the >>> "task dispatch" logic run on some external frontend system outside EC2. >>> >>> * I would think that we want to use Falkon to do the task dispatch. If so, >>> we need a Falkon executor on each VM, configured to check in with the Falkon >>> dispatcher. (Alternatively, we could use, say, SGE: in that case, we would >>> want an SGE agent.) >>> >>> * We need a way of getting data to and from the worker nodes. Do we want to >>> run a file system across the EC2 nodes and the external frontend node? That >>> seems rather inefficient. Other options? >>> >>> * Should we preload the application code on each EC2 node? >>> >> Here's a couple of approaches: >> >> 1) swift regards all the EC2 nodes that we are paying for as a single >> site. >> >> Something like falkon handles all the task dispatch and worker node >> management. I don't know what that looks like at the moment in Falkon, but >> the interface for Swift to send jobs into Falkon sounds pretty >> straightforward and shouldn't need changing. >> > > So if I understand, here there would be no gateway+LRM but each EC2 node + > Falkon would need a port open to receive tasks? Or does each node pull down > instructions OK from behind a firewall? > Falkon supports both polling and notifications. To use notifications, there needs to be an open port on the worker :( > Is there a latency problem with running each node as an indepdent task > receiver with the dispatcher off-site from EC2? I would think it would be > better to put the queues to fill with tasks on EC2 so it can more quickly get > the task going when a node is done with a previous task (I may be missing some > nuances here with respect to Falkon, don't know much about this yet!). > We have run the Falkon dispatcher at UChicago and workers at ANL without any issues, so it can easily tolerate a few ms of latency. We haven't tried it across 10s of ms of latency links, but my instinct says that if you have enough workers, you might be able to hide the latency. We'd have to experiment with it to see what happens. We could potentially do some experiments between SDSC and ANL over a 50+ ms link, and see what difference in throughputs we get. Ioan > If a gateway node is desired, this option sounds a lot like the GRAM+LRM > situation we use on VMs with the workspace service and will soon use on EC2 via > the workspace EC2 gateway we're implementing. Start up one gateway node and > then add compute nodes which dynamically join the pool, they are pointed to the > GRAM node. > > >> All the nodes in a site are required by our site model to have a shared >> filesystem - we've talked about removing it, but I think that is still the >> case and if so, isn't going to change soon. >> > > Setting up a shared filesystem in this environment is akin to setting up the > compute nodes to join an LRM pool. The VMs can communicate over the private > network at EC2, you can instruct EC2 to let all the nodes be open to each other > (while simultaneously keeping a separate policy of blocking ports from being > open from the internet and other people's EC2 nodes). The non-file-serving > nodes would simply need to know the private address of the filesystem server > (unless you are using a fancier network file system than NFS-style ones). > > For background: every VM on EC2 currently gets a public address -- NAT'd to a > private address which is actually what the VM's one NIC is configured with. > There is a facility to open/forward specific network ports on the public > address to each VM either via a group policy or on a VM by VM basis. > > [...] > >> Amazon also has a storage cloud, alongside its compute cloud. I know very >> little about that and have never thought about how it would fit into the >> above (if at all). Maybe someone else knows more. >> > > A VM template on EC2 is called an AMI which stands for Amazon Machine Image. > This is just a packaging thing but what it mostly means is that the VM is > stored on S3 and also registered into the EC2 system. > > When starting an instance of an AMI, the file is copied from S3 to the > hypervisor node (what we call propagation in the workspace service). After it > is used, this file is deleted (an option in the workspace service but there is > also an option to save it back with any changes). > > So the VMs are stored in S3 but anything that happens on them after being > started is lost unless you manually do something about it. > > As for free scratch space, you get a good amount per node, 140G. But the node > could go down at any moment just like a physical resource. > > To harness S3 for safely persisting any data (or if you need more space) you > would need to actually run S3 clients on the VMs when they are run on EC2. You > could alternatively mirror data between nodes assuming that all would not go > down at once. > > The good thing is that you do not pay transfer costs between S3 and EC2 if you > chose to use S3 for big storage, you would only pay the "housing fees" so to > speak. > > Tim > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > -- ============================================ Ioan Raicu Ph.D. Student ============================================ Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 ============================================ Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dsl.cs.uchicago.edu/ ============================================ ============================================ From benc at hawaga.org.uk Tue May 15 18:24:03 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 15 May 2007 23:24:03 +0000 (GMT) Subject: [Swift-devel] swift-on-ec2 In-Reply-To: <464A231E.9040708@cs.uchicago.edu> References: <4649D280.5080906@mcs.anl.gov> <464A231E.9040708@cs.uchicago.edu> Message-ID: On Tue, 15 May 2007, Ioan Raicu wrote: > If we can get the data caching working in Falkon, we might be able to > run Swift over Falkon without a shared file system. This is still work > in progress, but we might be closer to achieving this that not. BTW, > the data caching would mean that Swift does not stage in any data > anymore, but wold essentially stand up a GridFTP server from where > Falkon workers would get the needed data just when they need it. We are > still ironing out all this stuff, but it could potentially do away with > the shared file sytem assumption. In the longer term, Swift possibly won't have its input data on the submitting system - for example, if data is mapped from remote gridftp servers, then it should be transferred directly from those ftp servers to the execute side (perhaps to a shared filesystem, perhaps direct to a worker node), and output data should be transferred back fairly directly, rather than going via the submit system. If Falkon is doing its own 'interesting' data movement stuff, then it would probably be a good idea for it to mesh in with what Swift (eg. swift provides a list of stage-these-in and stage-these-out URLs or something like that and has various ways of performing that, such as submitting a transfer job, or passing that information onto falkon) -- From iraicu at cs.uchicago.edu Tue May 15 18:40:15 2007 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Tue, 15 May 2007 18:40:15 -0500 Subject: [Swift-devel] swift-on-ec2 In-Reply-To: References: <4649D280.5080906@mcs.anl.gov> <464A231E.9040708@cs.uchicago.edu> Message-ID: <464A44DF.5030600@cs.uchicago.edu> Ben Clifford wrote: > On Tue, 15 May 2007, Ioan Raicu wrote: > > >> If we can get the data caching working in Falkon, we might be able to >> run Swift over Falkon without a shared file system. This is still work >> in progress, but we might be closer to achieving this that not. BTW, >> the data caching would mean that Swift does not stage in any data >> anymore, but wold essentially stand up a GridFTP server from where >> Falkon workers would get the needed data just when they need it. We are >> still ironing out all this stuff, but it could potentially do away with >> the shared file sytem assumption. >> > > In the longer term, Swift possibly won't have its input data on the > submitting system - for example, if data is mapped from remote gridftp > servers, then it should be transferred directly from those ftp servers to > the execute side (perhaps to a shared filesystem, perhaps direct to a > worker node), and output data should be transferred back fairly directly, > rather than going via the submit system. > Right, from Falon's point of view, this would not be any different than having the GridFTP server at the submit host. > If Falkon is doing its own 'interesting' data movement stuff, then it > would probably be a good idea for it to mesh in with what Swift (eg. swift > provides a list of stage-these-in and stage-these-out URLs or something > like that and has various ways of performing that, such as submitting a > transfer job, or passing that information onto falkon) > The idea is to do just this! Get Swift to pass in its normal URLs of input and output data, and then have Falkon do its own data management using those URLs! The idea is to not change anything fundamental in Swift, but ensure that enough information is passed to Falkon so it can operate properly, and do its own data management! Ioan -- ============================================ Ioan Raicu Ph.D. Student ============================================ Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 ============================================ Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dsl.cs.uchicago.edu/ ============================================ ============================================ -------------- next part -------------- An HTML attachment was scrubbed... URL: From keahey at mcs.anl.gov Tue May 15 23:28:07 2007 From: keahey at mcs.anl.gov (Kate Keahey) Date: Tue, 15 May 2007 23:28:07 -0500 Subject: [Swift-devel] swift-on-ec2 In-Reply-To: <464A24AF.7080801@cs.uchicago.edu> References: <4649D280.5080906@mcs.anl.gov> <20070515154500.ad1600bf.tfreeman@mcs.anl.gov> <464A24AF.7080801@cs.uchicago.edu> Message-ID: <464A8857.90800@mcs.anl.gov> First -- this is a very useful discussion, would it be possible to see all of it. We need to understand the requirements and trade-offs in some detail to figure out the best way to make this work. I see a few different interaction threads somewhat mixed up here though so just to make sure we are all on the same wavelength, here is some context. Ian and I have been talking on and off about providing a workspace service implementation with EC2 backend. The benefit for that would be that users could deploy the same VMs using the same interface to either TeraPort or EC2 or yet another resource provider. The workspace service would also provide some features on top of EC2 (translating between PKI credentials and Amazon's paying accounts, contextualization as needed to make deployment dynamic). One application of interest for this was Swift. Last time we chatted about this though was in the context of using EC2 to provide a production platform for STAR runs (since virtualizing enough TeraPort to provide a production platform is taking a long time). This is what Tim and I are trying to make happen now. Since there was also interest in running Swift in VMs, Mike, Tibi and I met around February/March and agreed that a reasonable way to proceed will be for us to stand up a base virtual cluster somewhere locally (e.g., a static deployment on TeraPort) so that they can finish the configuration according to their needs, look at performance, figure out the best way to interact with it, and make sure that there are no VM-induced gotchas. All of this will be much easier to assess locally and on a static deployment. Then we'd make sure the cluster is dynamically deployable using the workspace service (on EC2 or whatever other provider). During the meeting (and over following emails) we agreed that the required "base cluster" would be configured with GRAM/Torque on the headnode plus a number of worker nodes, plus root privileges. We configured this cluster and it is ready to deploy. Are you saying now that in fact something different is needed? As Ian says, Borja and I were planning to meet with Ioan on Thursday to discuss interaction between Falkon and the workspace service (not necessarily/exclusively in the EC2 context). I don't completely understand the relationship between swift and falkon -- are there specific applications or scenarios that you are trying to target in this exercise? Ioan Raicu wrote: > Hi, > See below: > > Tim Freeman wrote: >> On Tue, 15 May 2007 16:20:03 +0000 (GMT) >> Ben Clifford wrote: >> >> >>> Ian asked about this elsewhere, but its perhaps interesting for >>> swift-devel people to look at the questions too. >>> >>> On Tue, 15 May 2007, Ian Foster wrote: >>> >>> >>>> Dear All: >>>> >>> >>> >>>> I asked Kate if she and Tim could look into creating VM images that >>>> would allow us to run Swift applications on Amazon EC2. I think Kate >>>> is meeting with Ioan about this on Thursday (?). >>>> >>> >>> >>>> One issue that I thought would be good to discuss is what we'd want >>>> in that VM image. Perhaps this is obvious to the rest of you, but it >>>> isn't to me. A few thoughts: >>>> * I'm assuming that we want to run "workers" on EC2 nodes, and >>>> have the >>>> "task dispatch" logic run on some external frontend system outside EC2. >>>> * I would think that we want to use Falkon to do the task >>>> dispatch. If so, >>>> we need a Falkon executor on each VM, configured to check in with >>>> the Falkon >>>> dispatcher. (Alternatively, we could use, say, SGE: in that case, we >>>> would >>>> want an SGE agent.) >>>> * We need a way of getting data to and from the worker nodes. >>>> Do we want to >>>> run a file system across the EC2 nodes and the external frontend >>>> node? That >>>> seems rather inefficient. Other options? >>>> * Should we preload the application code on each EC2 node? >>>> >>> Here's a couple of approaches: >>> >>> 1) swift regards all the EC2 nodes that we are paying for as a >>> single site. >>> >>> Something like falkon handles all the task dispatch and worker node >>> management. I don't know what that looks like at the moment in >>> Falkon, but the interface for Swift to send jobs into Falkon sounds >>> pretty straightforward and shouldn't need changing. >>> >> >> So if I understand, here there would be no gateway+LRM but each EC2 >> node + >> Falkon would need a port open to receive tasks? Or does each node >> pull down >> instructions OK from behind a firewall? >> > Falkon supports both polling and notifications. To use notifications, > there needs to be an open port on the worker :( >> Is there a latency problem with running each node as an indepdent task >> receiver with the dispatcher off-site from EC2? I would think it >> would be >> better to put the queues to fill with tasks on EC2 so it can more >> quickly get >> the task going when a node is done with a previous task (I may be >> missing some >> nuances here with respect to Falkon, don't know much about this yet!). > We have run the Falkon dispatcher at UChicago and workers at ANL without > any issues, so it can easily tolerate a few ms of latency. We haven't > tried it across 10s of ms of latency links, but my instinct says that if > you have enough workers, you might be able to hide the latency. We'd > have to experiment with it to see what happens. We could potentially do > some experiments between SDSC and ANL over a 50+ ms link, and see what > difference in throughputs we get. > > Ioan >> If a gateway node is desired, this option sounds a lot like the GRAM+LRM >> situation we use on VMs with the workspace service and will soon use >> on EC2 via >> the workspace EC2 gateway we're implementing. Start up one gateway >> node and >> then add compute nodes which dynamically join the pool, they are >> pointed to the >> GRAM node. >> >> >>> All the nodes in a site are required by our site model to have a >>> shared filesystem - we've talked about removing it, but I think that >>> is still the case and if so, isn't going to change soon. >> >> Setting up a shared filesystem in this environment is akin to setting >> up the >> compute nodes to join an LRM pool. The VMs can communicate over the >> private >> network at EC2, you can instruct EC2 to let all the nodes be open to >> each other >> (while simultaneously keeping a separate policy of blocking ports from >> being >> open from the internet and other people's EC2 nodes). The >> non-file-serving >> nodes would simply need to know the private address of the filesystem >> server >> (unless you are using a fancier network file system than NFS-style ones). >> For background: every VM on EC2 currently gets a public address -- >> NAT'd to a >> private address which is actually what the VM's one NIC is configured >> with. >> There is a facility to open/forward specific network ports on the public >> address to each VM either via a group policy or on a VM by VM basis. >> >> [...] >>> Amazon also has a storage cloud, alongside its compute cloud. I know >>> very little about that and have never thought about how it would fit >>> into the above (if at all). Maybe someone else knows more. >>> >> >> A VM template on EC2 is called an AMI which stands for Amazon Machine >> Image. >> This is just a packaging thing but what it mostly means is that the VM is >> stored on S3 and also registered into the EC2 system. >> >> When starting an instance of an AMI, the file is copied from S3 to the >> hypervisor node (what we call propagation in the workspace service). >> After it >> is used, this file is deleted (an option in the workspace service but >> there is >> also an option to save it back with any changes). >> So the VMs are stored in S3 but anything that happens on them after being >> started is lost unless you manually do something about it. >> >> As for free scratch space, you get a good amount per node, 140G. But >> the node >> could go down at any moment just like a physical resource. >> >> To harness S3 for safely persisting any data (or if you need more >> space) you >> would need to actually run S3 clients on the VMs when they are run on >> EC2. You >> could alternatively mirror data between nodes assuming that all would >> not go >> down at once. >> The good thing is that you do not pay transfer costs between S3 and >> EC2 if you >> chose to use S3 for big storage, you would only pay the "housing fees" >> so to >> speak. >> Tim >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >> >> > -- Kate Keahey, Mathematics & CS Division, Argonne National Laboratory Computation Institute, University of Chicago From itf at mcs.anl.gov Wed May 16 02:44:59 2007 From: itf at mcs.anl.gov (=?UTF-8?B?SWFuIEZvc3Rlcg==?=) Date: Wed, 16 May 2007 07:44:59 +0000 Subject: [Swift-devel] swift-on-ec2 In-Reply-To: <464A8857.90800@mcs.anl.gov> References: <4649D280.5080906@mcs.anl.gov> <20070515154500.ad1600bf.tfreeman@mcs.anl.gov><464A24AF.7080801@cs.uchicago.edu> <464A8857.90800@mcs.anl.gov> Message-ID: <356127187-1179301594-cardhu_blackberry.rim.net-179336256-@bwe047-cell00.bisx.prod.on.blackberry> Kate: If we configure the virtual cluster with a full LRM, as you propose (and it seems have already done--great work!), then we can use this to start Falkon executors--as we do today on regular clusters. So it seems to me that we have all we need. How about you and Ioan spend your time on Thursday running something on EC2, to make sure it sorks? Regarding choice of LRM: have you looked at SGE? That is what quite a few others seem to be using. Ian Sent via BlackBerry from T-Mobile -----Original Message----- From: Kate Keahey Date: Tue, 15 May 2007 23:28:07 To:iraicu at cs.uchicago.edu Cc:swift-devel at ci.uchicago.edu Subject: Re: [Swift-devel] swift-on-ec2 First -- this is a very useful discussion, would it be possible to see all of it. We need to understand the requirements and trade-offs in some detail to figure out the best way to make this work. I see a few different interaction threads somewhat mixed up here though so just to make sure we are all on the same wavelength, here is some context. Ian and I have been talking on and off about providing a workspace service implementation with EC2 backend. The benefit for that would be that users could deploy the same VMs using the same interface to either TeraPort or EC2 or yet another resource provider. The workspace service would also provide some features on top of EC2 (translating between PKI credentials and Amazon's paying accounts, contextualization as needed to make deployment dynamic). One application of interest for this was Swift. Last time we chatted about this though was in the context of using EC2 to provide a production platform for STAR runs (since virtualizing enough TeraPort to provide a production platform is taking a long time). This is what Tim and I are trying to make happen now. Since there was also interest in running Swift in VMs, Mike, Tibi and I met around February/March and agreed that a reasonable way to proceed will be for us to stand up a base virtual cluster somewhere locally (e.g., a static deployment on TeraPort) so that they can finish the configuration according to their needs, look at performance, figure out the best way to interact with it, and make sure that there are no VM-induced gotchas. All of this will be much easier to assess locally and on a static deployment. Then we'd make sure the cluster is dynamically deployable using the workspace service (on EC2 or whatever other provider). During the meeting (and over following emails) we agreed that the required "base cluster" would be configured with GRAM/Torque on the headnode plus a number of worker nodes, plus root privileges. We configured this cluster and it is ready to deploy. Are you saying now that in fact something different is needed? As Ian says, Borja and I were planning to meet with Ioan on Thursday to discuss interaction between Falkon and the workspace service (not necessarily/exclusively in the EC2 context). I don't completely understand the relationship between swift and falkon -- are there specific applications or scenarios that you are trying to target in this exercise? Ioan Raicu wrote: > Hi, > See below: > > Tim Freeman wrote: >> On Tue, 15 May 2007 16:20:03 +0000 (GMT) >> Ben Clifford wrote: >> >> >>> Ian asked about this elsewhere, but its perhaps interesting for >>> swift-devel people to look at the questions too. >>> >>> On Tue, 15 May 2007, Ian Foster wrote: >>> >>> >>>> Dear All: >>>> >>> >>> >>>> I asked Kate if she and Tim could look into creating VM images that >>>> would allow us to run Swift applications on Amazon EC2. I think Kate >>>> is meeting with Ioan about this on Thursday (?). >>>> >>> >>> >>>> One issue that I thought would be good to discuss is what we'd want >>>> in that VM image. Perhaps this is obvious to the rest of you, but it >>>> isn't to me. A few thoughts: >>>> * I'm assuming that we want to run "workers" on EC2 nodes, and >>>> have the >>>> "task dispatch" logic run on some external frontend system outside EC2. >>>> * I would think that we want to use Falkon to do the task >>>> dispatch. If so, >>>> we need a Falkon executor on each VM, configured to check in with >>>> the Falkon >>>> dispatcher. (Alternatively, we could use, say, SGE: in that case, we >>>> would >>>> want an SGE agent.) >>>> * We need a way of getting data to and from the worker nodes. >>>> Do we want to >>>> run a file system across the EC2 nodes and the external frontend >>>> node? That >>>> seems rather inefficient. Other options? >>>> * Should we preload the application code on each EC2 node? >>>> >>> Here's a couple of approaches: >>> >>> 1) swift regards all the EC2 nodes that we are paying for as a >>> single site. >>> >>> Something like falkon handles all the task dispatch and worker node >>> management. I don't know what that looks like at the moment in >>> Falkon, but the interface for Swift to send jobs into Falkon sounds >>> pretty straightforward and shouldn't need changing. >>> >> >> So if I understand, here there would be no gateway+LRM but each EC2 >> node + >> Falkon would need a port open to receive tasks? Or does each node >> pull down >> instructions OK from behind a firewall? >> > Falkon supports both polling and notifications. To use notifications, > there needs to be an open port on the worker :( >> Is there a latency problem with running each node as an indepdent task >> receiver with the dispatcher off-site from EC2? I would think it >> would be >> better to put the queues to fill with tasks on EC2 so it can more >> quickly get >> the task going when a node is done with a previous task (I may be >> missing some >> nuances here with respect to Falkon, don't know much about this yet!). > We have run the Falkon dispatcher at UChicago and workers at ANL without > any issues, so it can easily tolerate a few ms of latency. We haven't > tried it across 10s of ms of latency links, but my instinct says that if > you have enough workers, you might be able to hide the latency. We'd > have to experiment with it to see what happens. We could potentially do > some experiments between SDSC and ANL over a 50+ ms link, and see what > difference in throughputs we get. > > Ioan >> If a gateway node is desired, this option sounds a lot like the GRAM+LRM >> situation we use on VMs with the workspace service and will soon use >> on EC2 via >> the workspace EC2 gateway we're implementing. Start up one gateway >> node and >> then add compute nodes which dynamically join the pool, they are >> pointed to the >> GRAM node. >> >> >>> All the nodes in a site are required by our site model to have a >>> shared filesystem - we've talked about removing it, but I think that >>> is still the case and if so, isn't going to change soon. >> >> Setting up a shared filesystem in this environment is akin to setting >> up the >> compute nodes to join an LRM pool. The VMs can communicate over the >> private >> network at EC2, you can instruct EC2 to let all the nodes be open to >> each other >> (while simultaneously keeping a separate policy of blocking ports from >> being >> open from the internet and other people's EC2 nodes). The >> non-file-serving >> nodes would simply need to know the private address of the filesystem >> server >> (unless you are using a fancier network file system than NFS-style ones). >> For background: every VM on EC2 currently gets a public address -- >> NAT'd to a >> private address which is actually what the VM's one NIC is configured >> with. >> There is a facility to open/forward specific network ports on the public >> address to each VM either via a group policy or on a VM by VM basis. >> >> [...] >>> Amazon also has a storage cloud, alongside its compute cloud. I know >>> very little about that and have never thought about how it would fit >>> into the above (if at all). Maybe someone else knows more. >>> >> >> A VM template on EC2 is called an AMI which stands for Amazon Machine >> Image. >> This is just a packaging thing but what it mostly means is that the VM is >> stored on S3 and also registered into the EC2 system. >> >> When starting an instance of an AMI, the file is copied from S3 to the >> hypervisor node (what we call propagation in the workspace service). >> After it >> is used, this file is deleted (an option in the workspace service but >> there is >> also an option to save it back with any changes). >> So the VMs are stored in S3 but anything that happens on them after being >> started is lost unless you manually do something about it. >> >> As for free scratch space, you get a good amount per node, 140G. But >> the node >> could go down at any moment just like a physical resource. >> >> To harness S3 for safely persisting any data (or if you need more >> space) you >> would need to actually run S3 clients on the VMs when they are run on >> EC2. You >> could alternatively mirror data between nodes assuming that all would >> not go >> down at once. >> The good thing is that you do not pay transfer costs between S3 and >> EC2 if you >> chose to use S3 for big storage, you would only pay the "housing fees" >> so to >> speak. >> Tim >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >> >> > -- Kate Keahey, Mathematics & CS Division, Argonne National Laboratory Computation Institute, University of Chicago _______________________________________________ Swift-devel mailing list Swift-devel at ci.uchicago.edu http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From benc at hawaga.org.uk Wed May 16 03:52:02 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 16 May 2007 08:52:02 +0000 (GMT) Subject: [Swift-devel] swift-on-ec2 In-Reply-To: <464A8857.90800@mcs.anl.gov> References: <4649D280.5080906@mcs.anl.gov> <20070515154500.ad1600bf.tfreeman@mcs.anl.gov> <464A24AF.7080801@cs.uchicago.edu> <464A8857.90800@mcs.anl.gov> Message-ID: On Tue, 15 May 2007, Kate Keahey wrote: > As Ian says, Borja and I were planning to meet with Ioan on Thursday to > discuss interaction between Falkon and the workspace service (not > necessarily/exclusively in the EC2 context). I don't completely > understand the relationship between swift and falkon -- are there > specific applications or scenarios that you are trying to target in this > exercise? By virtue of the fact that they come from pretty much the same group of people, they're somewhat fuzzily related - but pretty much swift is generating (over the duration of its execution, rather than in one batch) a bunch of jobs that need executing (as well, as various things like file transfers). As it generates them, it sends them off to be executed. The official ways that are 'supported' by Swift are by executing them on the local machine and by sending them off through GRAM; however, people can plug in whatever they want to do submissions. I know less about Falkon because it isn't Swift, but the Falkon side of things is pretty much about running a bunch of jobs - it plugs into the abovementioned place in Swift so that Swift gives Falkon jobs to run, and Falkon runs them (with a goal of Falkon being, presumably, to run it much more efficiently than if they were submitted straight through GRAM - it seems to do pretty well). There's two things going on with swift - one is about making it straightforward to use at the low end of things, so that people can start using it easily - for the most part, that isn't interesting in itself; the other is about getting it to perform well at the high end of things, which is where the fun research is. Using Falkon and using EC2 are both on that side of things. -- From benc at hawaga.org.uk Wed May 16 04:04:11 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 16 May 2007 09:04:11 +0000 (GMT) Subject: [Swift-devel] swift-on-ec2 In-Reply-To: <356127187-1179301594-cardhu_blackberry.rim.net-179336256-@bwe047-cell00.bisx.prod.on.blackberry> References: <4649D280.5080906@mcs.anl.gov> <20070515154500.ad1600bf.tfreeman@mcs.anl.gov><464A24AF.7080801@cs.uchicago.edu> <464A8857.90800@mcs.anl.gov> <356127187-1179301594-cardhu_blackberry.rim.net-179336256-@bwe047-cell00.bisx.prod.on.blackberry> Message-ID: On Wed, 16 May 2007, Ian Foster wrote: > If we configure the virtual cluster with a full LRM, as you propose (and > it seems have already done--great work!), then we can use this to start > Falkon executors--as we do today on regular clusters. So it seems to me > that we have all we need. How about you and Ioan spend your time on > Thursday running something on EC2, to make sure it sorks? > Regarding choice of LRM: have you looked at SGE? That is what quite a > few others seem to be using. That's probably a bunch of most unnecessary extra weight (== trouble) if the images are specifically intended for use as swift+falkon. But useful to have round if people want to do other things too. -- From hategan at mcs.anl.gov Wed May 16 04:07:01 2007 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 16 May 2007 12:07:01 +0300 Subject: [Swift-devel] swift-on-ec2 In-Reply-To: References: <4649D280.5080906@mcs.anl.gov> <464A231E.9040708@cs.uchicago.edu> Message-ID: <1179306421.2402.12.camel@blabla.mcs.anl.gov> I think we're moving towards a scenario in which Falkon does increasingly more things that it wasn't supposed to do. That includes scheduling and data management (which, is a tricky business if we look at the necessity for throttling, error handling and other management issues). I'm not sure if this is a good idea from an engineering standpoint. Mihael On Tue, 2007-05-15 at 23:24 +0000, Ben Clifford wrote: > On Tue, 15 May 2007, Ioan Raicu wrote: > > > If we can get the data caching working in Falkon, we might be able to > > run Swift over Falkon without a shared file system. This is still work > > in progress, but we might be closer to achieving this that not. BTW, > > the data caching would mean that Swift does not stage in any data > > anymore, but wold essentially stand up a GridFTP server from where > > Falkon workers would get the needed data just when they need it. We are > > still ironing out all this stuff, but it could potentially do away with > > the shared file sytem assumption. > > In the longer term, Swift possibly won't have its input data on the > submitting system - for example, if data is mapped from remote gridftp > servers, then it should be transferred directly from those ftp servers to > the execute side (perhaps to a shared filesystem, perhaps direct to a > worker node), and output data should be transferred back fairly directly, > rather than going via the submit system. > > If Falkon is doing its own 'interesting' data movement stuff, then it > would probably be a good idea for it to mesh in with what Swift (eg. swift > provides a list of stage-these-in and stage-these-out URLs or something > like that and has various ways of performing that, such as submitting a > transfer job, or passing that information onto falkon) > From keahey at mcs.anl.gov Wed May 16 09:24:02 2007 From: keahey at mcs.anl.gov (Kate Keahey) Date: Wed, 16 May 2007 09:24:02 -0500 Subject: [Swift-devel] swift-on-ec2 In-Reply-To: <356127187-1179301594-cardhu_blackberry.rim.net-179336256-@bwe047-cell00.bisx.prod.on.blackberry> References: <4649D280.5080906@mcs.anl.gov> <20070515154500.ad1600bf.tfreeman@mcs.anl.gov><464A24AF.7080801@cs.uchicago.edu> <464A8857.90800@mcs.anl.gov> <356127187-1179301594-cardhu_blackberry.rim.net-179336256-@bwe047-cell00.bisx.prod.on.blackberry> Message-ID: <464B1402.9040405@mcs.anl.gov> Ian Foster wrote: > Kate: > > If we configure the virtual cluster with a full LRM, as you propose (and it seems have already done--great work!), then we can use this to start Falkon executors--as we do today on regular clusters. So it seems to me that we have all we need. How about you and Ioan spend your time on Thursday running something on EC2, to make sure it sorks? As I suggest below, I think it would be easiest if we could deploy and debug a small static cluster locally first, and we can probably give it a shot tomorrow. We still don't have access to the Xen nodes on TeraPort (although hopefully that might change by tomorrow) but I asked Rick to rebuild a couple of nodes at ANL and he did, I think for a test that should give us enough resources to play with. At the same time -- if there are multiple ways of doing this, and perhaps better ways than simply using a virtual cluster, we should discuss them now. It is not completely clear to me what the relationship between Falkon and Swift is, and what the specific objectives are (other than that dynamically provisioning resources is required). It looks at this point like the objectives probably overlap with what Ioan, Borja and I wanted to talk about (which I thought was a separate project, but am thrilled to find out is related) so how about we come up with a design tomorrow and post the notes on this list (is this a good venue for that?) and then others can shoot them down. > Regarding choice of LRM: have you looked at SGE? That is what quite a few others seem to be using. Yes, we have. We also collaborate with others who do, as well as with Sun... As you may remember, Borja did the scheduling work for his thesis in the context of SGE. Last time we talked though, Torque was the scheduler of choice for the virtual cluster LRM so we used that. The usage of SGE you are referring to above -- is this in the context of virtualization projects, or as LRM for various Falkon-related applications? > > Ian > > > > Sent via BlackBerry from T-Mobile > > -----Original Message----- > From: Kate Keahey > Date: Tue, 15 May 2007 23:28:07 > To:iraicu at cs.uchicago.edu > Cc:swift-devel at ci.uchicago.edu > Subject: Re: [Swift-devel] swift-on-ec2 > > First -- this is a very useful discussion, would it be possible to see > all of it. We need to understand the requirements and trade-offs in some > detail to figure out the best way to make this work. I see a few > different interaction threads somewhat mixed up here though so just to > make sure we are all on the same wavelength, here is some context. > > Ian and I have been talking on and off about providing a workspace > service implementation with EC2 backend. The benefit for that would be > that users could deploy the same VMs using the same interface to either > TeraPort or EC2 or yet another resource provider. The workspace service > would also provide some features on top of EC2 (translating between PKI > credentials and Amazon's paying accounts, contextualization as needed to > make deployment dynamic). One application of interest for this was > Swift. Last time we chatted about this though was in the context of > using EC2 to provide a production platform for STAR runs (since > virtualizing enough TeraPort to provide a production platform is taking > a long time). This is what Tim and I are trying to make happen now. > > Since there was also interest in running Swift in VMs, Mike, Tibi and I > met around February/March and agreed that a reasonable way to proceed > will be for us to stand up a base virtual cluster somewhere locally > (e.g., a static deployment on TeraPort) so that they can finish the > configuration according to their needs, look at performance, figure out > the best way to interact with it, and make sure that there are no > VM-induced gotchas. All of this will be much easier to assess locally > and on a static deployment. Then we'd make sure the cluster is > dynamically deployable using the workspace service (on EC2 or whatever > other provider). During the meeting (and over following emails) we > agreed that the required "base cluster" would be configured with > GRAM/Torque on the headnode plus a number of worker nodes, plus root > privileges. We configured this cluster and it is ready to deploy. Are > you saying now that in fact something different is needed? > > As Ian says, Borja and I were planning to meet with Ioan on Thursday to > discuss interaction between Falkon and the workspace service (not > necessarily/exclusively in the EC2 context). I don't completely > understand the relationship between swift and falkon -- are there > specific applications or scenarios that you are trying to target in this > exercise? > > Ioan Raicu wrote: >> Hi, >> See below: >> >> Tim Freeman wrote: >>> On Tue, 15 May 2007 16:20:03 +0000 (GMT) >>> Ben Clifford wrote: >>> >>> >>>> Ian asked about this elsewhere, but its perhaps interesting for >>>> swift-devel people to look at the questions too. >>>> >>>> On Tue, 15 May 2007, Ian Foster wrote: >>>> >>>> >>>>> Dear All: >>>>> >>>> >>>> >>>>> I asked Kate if she and Tim could look into creating VM images that >>>>> would allow us to run Swift applications on Amazon EC2. I think Kate >>>>> is meeting with Ioan about this on Thursday (?). >>>>> >>>> >>>> >>>>> One issue that I thought would be good to discuss is what we'd want >>>>> in that VM image. Perhaps this is obvious to the rest of you, but it >>>>> isn't to me. A few thoughts: >>>>> * I'm assuming that we want to run "workers" on EC2 nodes, and >>>>> have the >>>>> "task dispatch" logic run on some external frontend system outside EC2. >>>>> * I would think that we want to use Falkon to do the task >>>>> dispatch. If so, >>>>> we need a Falkon executor on each VM, configured to check in with >>>>> the Falkon >>>>> dispatcher. (Alternatively, we could use, say, SGE: in that case, we >>>>> would >>>>> want an SGE agent.) >>>>> * We need a way of getting data to and from the worker nodes. >>>>> Do we want to >>>>> run a file system across the EC2 nodes and the external frontend >>>>> node? That >>>>> seems rather inefficient. Other options? >>>>> * Should we preload the application code on each EC2 node? >>>>> >>>> Here's a couple of approaches: >>>> >>>> 1) swift regards all the EC2 nodes that we are paying for as a >>>> single site. >>>> >>>> Something like falkon handles all the task dispatch and worker node >>>> management. I don't know what that looks like at the moment in >>>> Falkon, but the interface for Swift to send jobs into Falkon sounds >>>> pretty straightforward and shouldn't need changing. >>>> >>> So if I understand, here there would be no gateway+LRM but each EC2 >>> node + >>> Falkon would need a port open to receive tasks? Or does each node >>> pull down >>> instructions OK from behind a firewall? >>> >> Falkon supports both polling and notifications. To use notifications, >> there needs to be an open port on the worker :( >>> Is there a latency problem with running each node as an indepdent task >>> receiver with the dispatcher off-site from EC2? I would think it >>> would be >>> better to put the queues to fill with tasks on EC2 so it can more >>> quickly get >>> the task going when a node is done with a previous task (I may be >>> missing some >>> nuances here with respect to Falkon, don't know much about this yet!). >> We have run the Falkon dispatcher at UChicago and workers at ANL without >> any issues, so it can easily tolerate a few ms of latency. We haven't >> tried it across 10s of ms of latency links, but my instinct says that if >> you have enough workers, you might be able to hide the latency. We'd >> have to experiment with it to see what happens. We could potentially do >> some experiments between SDSC and ANL over a 50+ ms link, and see what >> difference in throughputs we get. >> >> Ioan >>> If a gateway node is desired, this option sounds a lot like the GRAM+LRM >>> situation we use on VMs with the workspace service and will soon use >>> on EC2 via >>> the workspace EC2 gateway we're implementing. Start up one gateway >>> node and >>> then add compute nodes which dynamically join the pool, they are >>> pointed to the >>> GRAM node. >>> >>> >>>> All the nodes in a site are required by our site model to have a >>>> shared filesystem - we've talked about removing it, but I think that >>>> is still the case and if so, isn't going to change soon. >>> Setting up a shared filesystem in this environment is akin to setting >>> up the >>> compute nodes to join an LRM pool. The VMs can communicate over the >>> private >>> network at EC2, you can instruct EC2 to let all the nodes be open to >>> each other >>> (while simultaneously keeping a separate policy of blocking ports from >>> being >>> open from the internet and other people's EC2 nodes). The >>> non-file-serving >>> nodes would simply need to know the private address of the filesystem >>> server >>> (unless you are using a fancier network file system than NFS-style ones). >>> For background: every VM on EC2 currently gets a public address -- >>> NAT'd to a >>> private address which is actually what the VM's one NIC is configured >>> with. >>> There is a facility to open/forward specific network ports on the public >>> address to each VM either via a group policy or on a VM by VM basis. >>> >>> [...] >>>> Amazon also has a storage cloud, alongside its compute cloud. I know >>>> very little about that and have never thought about how it would fit >>>> into the above (if at all). Maybe someone else knows more. >>>> >>> A VM template on EC2 is called an AMI which stands for Amazon Machine >>> Image. >>> This is just a packaging thing but what it mostly means is that the VM is >>> stored on S3 and also registered into the EC2 system. >>> >>> When starting an instance of an AMI, the file is copied from S3 to the >>> hypervisor node (what we call propagation in the workspace service). >>> After it >>> is used, this file is deleted (an option in the workspace service but >>> there is >>> also an option to save it back with any changes). >>> So the VMs are stored in S3 but anything that happens on them after being >>> started is lost unless you manually do something about it. >>> >>> As for free scratch space, you get a good amount per node, 140G. But >>> the node >>> could go down at any moment just like a physical resource. >>> >>> To harness S3 for safely persisting any data (or if you need more >>> space) you >>> would need to actually run S3 clients on the VMs when they are run on >>> EC2. You >>> could alternatively mirror data between nodes assuming that all would >>> not go >>> down at once. >>> The good thing is that you do not pay transfer costs between S3 and >>> EC2 if you >>> chose to use S3 for big storage, you would only pay the "housing fees" >>> so to >>> speak. >>> Tim >>> _______________________________________________ >>> Swift-devel mailing list >>> Swift-devel at ci.uchicago.edu >>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >>> >>> > -- Kate Keahey, Mathematics & CS Division, Argonne National Laboratory Computation Institute, University of Chicago From keahey at mcs.anl.gov Wed May 16 09:37:52 2007 From: keahey at mcs.anl.gov (Kate Keahey) Date: Wed, 16 May 2007 09:37:52 -0500 Subject: [Swift-devel] swift-on-ec2 In-Reply-To: References: <4649D280.5080906@mcs.anl.gov> <20070515154500.ad1600bf.tfreeman@mcs.anl.gov> <464A24AF.7080801@cs.uchicago.edu> <464A8857.90800@mcs.anl.gov> Message-ID: <464B1740.3060808@mcs.anl.gov> Thanks Ben, this helps a lot! So it seems to me like we are talking about combining dynamic provisioning with lightweight job management which should be pluggable into swift. Ben Clifford wrote: > On Tue, 15 May 2007, Kate Keahey wrote: > >> As Ian says, Borja and I were planning to meet with Ioan on Thursday to >> discuss interaction between Falkon and the workspace service (not >> necessarily/exclusively in the EC2 context). I don't completely >> understand the relationship between swift and falkon -- are there >> specific applications or scenarios that you are trying to target in this >> exercise? > > By virtue of the fact that they come from pretty much the same group of > people, they're somewhat fuzzily related - but pretty much swift is > generating (over the duration of its execution, rather than in one batch) > a bunch of jobs that need executing (as well, as various things like file > transfers). As it generates them, it sends them off to be executed. The > official ways that are 'supported' by Swift are by executing them on the > local machine and by sending them off through GRAM; however, people can > plug in whatever they want to do submissions. > > I know less about Falkon because it isn't Swift, but the Falkon side of > things is pretty much about running a bunch of jobs - it plugs into the > abovementioned place in Swift so that Swift gives Falkon jobs to run, and > Falkon runs them (with a goal of Falkon being, presumably, to run it much > more efficiently than if they were submitted straight through GRAM - it > seems to do pretty well). > > There's two things going on with swift - one is about making it > straightforward to use at the low end of things, so that people can start > using it easily - for the most part, that isn't interesting in itself; the > other is about getting it to perform well at the high end of things, which > is where the fun research is. Using Falkon and using EC2 are both on that > side of things. > -- Kate Keahey, Mathematics & CS Division, Argonne National Laboratory Computation Institute, University of Chicago From benc at hawaga.org.uk Wed May 16 09:55:45 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 16 May 2007 14:55:45 +0000 (GMT) Subject: [Swift-devel] 0.2 release Message-ID: Its been a while since there was a non-SVN based release. Not all the features that are on the milestone list for 0.2 have been done; however, what is in SVN now has a bunch of extra stuff that wasn't in 0.1 that would be good to make available to casual downloaders. So I'm planning on putting whatever is at the head of SVN some time middle of next week out as 0.2 (in the same fairly lightweight process as happened for 0.1) and move the remaining 0.2 milestones to be 0.3 milestones. It would be good if you're using SVN to start updating at least daily until that time. Separately, I'll send a note about remaining milestones for people to discuss how much they still want them in relation to other features. -- From benc at hawaga.org.uk Wed May 16 10:26:10 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 16 May 2007 15:26:10 +0000 (GMT) Subject: [Swift-devel] mappers on files that are inputs and outputs Message-ID: Here's a code fragment: type volume { imagefile img; headerfile hdr; }; volume atlas ; atlas = softmean(slices); string directions[] = [ "x", "y", "z"]; foreach direction in directions { giffile outputgif ; string option = @strcat("-",direction); outputgif = slice_to_gif(atlas, option, ".5"); } When this is run as part of a workflow, there are no atlas.* files and the atlas = softmean(slices) line causes atlas.hdr and atlas.img files to be created and placed in my working directory, and also used in the subsequent slice_to_gif calls. If I prune the program in a text editor so that the altas = ... line is not called, and leave the atlas.hdr and atlas.img files in place in my current directory (so that the files are now input files, rather than intermediate files), I get this error: $ swift -debug -tc.file tc.data play.swift WARN - Failed to configure log file name Swift v0.1-dev RunID: mx49u8a36d1m0 Execution failed: java.lang.RuntimeException: Data set initialization failed for true. Missing required field: img mapped to atlas I think its probably a desirable feature that the same mapping that maps ok for intermediate files to map for input files too. -- From hategan at mcs.anl.gov Wed May 16 10:29:02 2007 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 16 May 2007 18:29:02 +0300 Subject: [Swift-devel] mappers on files that are inputs and outputs In-Reply-To: References: Message-ID: <1179329342.4368.0.camel@blabla.mcs.anl.gov> You should probably also add the input=true mapping parameter? Mihael On Wed, 2007-05-16 at 15:26 +0000, Ben Clifford wrote: > Here's a code fragment: > > type volume { > imagefile img; > headerfile hdr; > }; > > volume atlas ; > atlas = softmean(slices); > > string directions[] = [ "x", "y", "z"]; > > foreach direction in directions { > giffile outputgif > ; > string option = @strcat("-",direction); > outputgif = slice_to_gif(atlas, option, ".5"); > } > > When this is run as part of a workflow, there are no atlas.* files and the > atlas = softmean(slices) line causes atlas.hdr and atlas.img files to be > created and placed in my working directory, and also used in the > subsequent slice_to_gif calls. > > If I prune the program in a text editor so that the altas = ... line is > not called, and leave the atlas.hdr and atlas.img files in place in my > current directory (so that the files are now input files, rather than > intermediate files), I get this error: > > $ swift -debug -tc.file tc.data play.swift > WARN - Failed to configure log file name > > Swift v0.1-dev > > RunID: mx49u8a36d1m0 > Execution failed: > java.lang.RuntimeException: Data set initialization failed for > true. Missing required field: img mapped to atlas > > > I think its probably a desirable feature that the same mapping that maps > ok for intermediate files to map for input files too. > From benc at hawaga.org.uk Wed May 16 10:37:36 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 16 May 2007 15:37:36 +0000 (GMT) Subject: [Swift-devel] mappers on files that are inputs and outputs In-Reply-To: <1179329342.4368.0.camel@blabla.mcs.anl.gov> References: <1179329342.4368.0.camel@blabla.mcs.anl.gov> Message-ID: On Wed, 16 May 2007, Mihael Hategan wrote: > You should probably also add the input=true mapping parameter? shouldn't really need that in the language though. -- From hategan at mcs.anl.gov Wed May 16 10:43:15 2007 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 16 May 2007 18:43:15 +0300 Subject: [Swift-devel] mappers on files that are inputs and outputs In-Reply-To: References: <1179329342.4368.0.camel@blabla.mcs.anl.gov> Message-ID: <1179330195.4473.0.camel@blabla.mcs.anl.gov> The translator does that bit. You hacked the translated file, but incompletely. Mihael On Wed, 2007-05-16 at 15:37 +0000, Ben Clifford wrote: > On Wed, 16 May 2007, Mihael Hategan wrote: > > > You should probably also add the input=true mapping parameter? > > shouldn't really need that in the language though. > From iraicu at cs.uchicago.edu Wed May 16 11:55:18 2007 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Wed, 16 May 2007 11:55:18 -0500 Subject: [Swift-devel] swift-on-ec2 In-Reply-To: References: <4649D280.5080906@mcs.anl.gov> <20070515154500.ad1600bf.tfreeman@mcs.anl.gov> <464A24AF.7080801@cs.uchicago.edu> <464A8857.90800@mcs.anl.gov> Message-ID: <464B3776.2010700@cs.uchicago.edu> Hi, I am just catching up with emails from last night... Ben Clifford wrote: > On Tue, 15 May 2007, Kate Keahey wrote: > > >> As Ian says, Borja and I were planning to meet with Ioan on Thursday to >> discuss interaction between Falkon and the workspace service (not >> necessarily/exclusively in the EC2 context). I don't completely >> understand the relationship between swift and falkon -- are there >> specific applications or scenarios that you are trying to target in this >> exercise? >> > > By virtue of the fact that they come from pretty much the same group of > people, they're somewhat fuzzily related - but pretty much swift is > generating (over the duration of its execution, rather than in one batch) > a bunch of jobs that need executing (as well, as various things like file > transfers). As it generates them, it sends them off to be executed. The > official ways that are 'supported' by Swift are by executing them on the > local machine and by sending them off through GRAM; however, people can > plug in whatever they want to do submissions. > > I know less about Falkon because it isn't Swift, but the Falkon side of > things is pretty much about running a bunch of jobs - it plugs into the > abovementioned place in Swift so that Swift gives Falkon jobs to run, and > Falkon runs them (with a goal of Falkon being, presumably, to run it much > more efficiently than if they were submitted straight through GRAM - it > seems to do pretty well). > We intentionally made Falkon's interface and semantics as similar as possible to that of GRAM, so applications that normally used GRAM could easily change to Falkon. > There's two things going on with swift - one is about making it > straightforward to use at the low end of things, so that people can start > using it easily - for the most part, that isn't interesting in itself; the > other is about getting it to perform well at the high end of things, which > is where the fun research is. Using Falkon and using EC2 are both on that > side of things. > Right! Falkon is certainly about getting more performance from the same hardware. EC2 on the other hand is more about a new paradigm of how resources are acquired. In the batch-scheduled world, the demand for resources is usually higher than the supply. In EC2, its likely that the supply for resources is higher than the demand. With that said, it means that with EC2, it is likely that you could always get more resources now if you were willing to pay for them... this could have implications on the resource allocation and management policies that govern when it makes sense to get more resources and when not to. Using EC2 might be about performance, but the really interesting part that I see emerging is a new model that deviates from the traditional batch-scheduled systems the Grid community has grown accustomed to. Ioan -- ============================================ Ioan Raicu Ph.D. Student ============================================ Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 ============================================ Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dsl.cs.uchicago.edu/ ============================================ ============================================ -------------- next part -------------- An HTML attachment was scrubbed... URL: From tfreeman at mcs.anl.gov Wed May 16 12:03:04 2007 From: tfreeman at mcs.anl.gov (Tim Freeman) Date: Wed, 16 May 2007 12:03:04 -0500 Subject: [Swift-devel] swift-on-ec2 In-Reply-To: <464B3776.2010700@cs.uchicago.edu> References: <4649D280.5080906@mcs.anl.gov> <20070515154500.ad1600bf.tfreeman@mcs.anl.gov> <464A24AF.7080801@cs.uchicago.edu> <464A8857.90800@mcs.anl.gov> <464B3776.2010700@cs.uchicago.edu> Message-ID: <20070516120304.19151d46.tfreeman@mcs.anl.gov> On Wed, 16 May 2007 11:55:18 -0500 Ioan Raicu wrote: > Hi, > I am just catching up with emails from last night... > > Ben Clifford wrote: > > On Tue, 15 May 2007, Kate Keahey wrote: > > > > > >> As Ian says, Borja and I were planning to meet with Ioan on Thursday to > >> discuss interaction between Falkon and the workspace service (not > >> necessarily/exclusively in the EC2 context). I don't completely > >> understand the relationship between swift and falkon -- are there > >> specific applications or scenarios that you are trying to target in this > >> exercise? > >> > > > > By virtue of the fact that they come from pretty much the same group of > > people, they're somewhat fuzzily related - but pretty much swift is > > generating (over the duration of its execution, rather than in one batch) > > a bunch of jobs that need executing (as well, as various things like file > > transfers). As it generates them, it sends them off to be executed. The > > official ways that are 'supported' by Swift are by executing them on the > > local machine and by sending them off through GRAM; however, people can > > plug in whatever they want to do submissions. > > > > I know less about Falkon because it isn't Swift, but the Falkon side of > > things is pretty much about running a bunch of jobs - it plugs into the > > abovementioned place in Swift so that Swift gives Falkon jobs to run, and > > Falkon runs them (with a goal of Falkon being, presumably, to run it much > > more efficiently than if they were submitted straight through GRAM - it > > seems to do pretty well). > > > We intentionally made Falkon's interface and semantics as similar as > possible to that of GRAM, so applications that normally used GRAM could > easily change to Falkon. > > There's two things going on with swift - one is about making it > > straightforward to use at the low end of things, so that people can start > > using it easily - for the most part, that isn't interesting in itself; the > > other is about getting it to perform well at the high end of things, which > > is where the fun research is. Using Falkon and using EC2 are both on that > > side of things. > > > Right! > > Falkon is certainly about getting more performance from the same hardware. > > EC2 on the other hand is more about a new paradigm of how resources are > acquired. In the batch-scheduled world, the demand for resources is > usually higher than the supply. In EC2, its likely that the supply for > resources is higher than the demand. With that said, it means that with > EC2, it is likely that you could always get more resources now if you > were willing to pay for them That's not entirely true at this particular point in time: http://www.pcworld.com/article/id,130832-c,webservices/article.html "We hate being capacity-constrained," Bezos said. "It's not the right way to run a business. We are trying to get ourselves in a position with EC2 where we will be demand-constrained instead of capacity-constrained." > ... this could have implications on the > resource allocation and management policies that govern when it makes > sense to get more resources and when not to. Right now for example, we're programming a little feature into the workspace-EC2 gateway that limits the amount of money an entity can spend :-) Tim > Using EC2 might be about > performance, but the really interesting part that I see emerging is a > new model that deviates from the traditional batch-scheduled systems the > Grid community has grown accustomed to. > > Ioan From iraicu at cs.uchicago.edu Wed May 16 12:04:22 2007 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Wed, 16 May 2007 12:04:22 -0500 Subject: [Swift-devel] swift-on-ec2 In-Reply-To: References: <4649D280.5080906@mcs.anl.gov> <20070515154500.ad1600bf.tfreeman@mcs.anl.gov><464A24AF.7080801@cs.uchicago.edu> <464A8857.90800@mcs.anl.gov> <356127187-1179301594-cardhu_blackberry.rim.net-179336256-@bwe047-cell00.bisx.prod.on.blackberry> Message-ID: <464B3996.9030305@cs.uchicago.edu> If a LRM can be configured by the virtual workspace service as part of the VMs, then its even easier for Falkon to work! I don't see an LRM as being central and necessary though, as when the VM starts up, we could easily bootstrap the Falkon executors to start up and live forever (or at least while the VM is running). We could host the disptacher off-site, or even on another EC2 VM... it all depends on how much the latency seems to affect Falkon performance. I assume that EC2 VMs have a public IP space that is not behind some site-wide firewall, right? If not, then this could be a problem. Ioan Ben Clifford wrote: > On Wed, 16 May 2007, Ian Foster wrote: > > >> If we configure the virtual cluster with a full LRM, as you propose (and >> it seems have already done--great work!), then we can use this to start >> Falkon executors--as we do today on regular clusters. So it seems to me >> that we have all we need. How about you and Ioan spend your time on >> Thursday running something on EC2, to make sure it sorks? >> > > >> Regarding choice of LRM: have you looked at SGE? That is what quite a >> few others seem to be using. >> > > That's probably a bunch of most unnecessary extra weight (== trouble) if > the images are specifically intended for use as swift+falkon. But useful > to have round if people want to do other things too. > > -- ============================================ Ioan Raicu Ph.D. Student ============================================ Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 ============================================ Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dsl.cs.uchicago.edu/ ============================================ ============================================ -------------- next part -------------- An HTML attachment was scrubbed... URL: From iraicu at cs.uchicago.edu Wed May 16 12:09:54 2007 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Wed, 16 May 2007 12:09:54 -0500 Subject: [Swift-devel] swift-on-ec2 In-Reply-To: <1179306421.2402.12.camel@blabla.mcs.anl.gov> References: <4649D280.5080906@mcs.anl.gov> <464A231E.9040708@cs.uchicago.edu> <1179306421.2402.12.camel@blabla.mcs.anl.gov> Message-ID: <464B3AE2.6020300@cs.uchicago.edu> One of the 2 main motivations for Falkon was the data management. We saw early on that we need to couple the compute and data resource management, and that is what we are doing as we push forward with Falkon. Falkon should be something that could be usable by other applications, that don't have all the smarts of Swift, that simply want to run jobs efficiently and have the data management abstracted away. The main idea is that Swift's data management will likely still be needed (at a site level), but Falkon can push that further to the physical node level. Swift and Falkon will likely evolve independently, but if we work together, we can ensure that they can inter-operate, as thy do today! Ioan Mihael Hategan wrote: > I think we're moving towards a scenario in which Falkon does > increasingly more things that it wasn't supposed to do. That includes > scheduling and data management (which, is a tricky business if we look > at the necessity for throttling, error handling and other management > issues). > I'm not sure if this is a good idea from an engineering standpoint. > > Mihael > > On Tue, 2007-05-15 at 23:24 +0000, Ben Clifford wrote: > >> On Tue, 15 May 2007, Ioan Raicu wrote: >> >> >>> If we can get the data caching working in Falkon, we might be able to >>> run Swift over Falkon without a shared file system. This is still work >>> in progress, but we might be closer to achieving this that not. BTW, >>> the data caching would mean that Swift does not stage in any data >>> anymore, but wold essentially stand up a GridFTP server from where >>> Falkon workers would get the needed data just when they need it. We are >>> still ironing out all this stuff, but it could potentially do away with >>> the shared file sytem assumption. >>> >> In the longer term, Swift possibly won't have its input data on the >> submitting system - for example, if data is mapped from remote gridftp >> servers, then it should be transferred directly from those ftp servers to >> the execute side (perhaps to a shared filesystem, perhaps direct to a >> worker node), and output data should be transferred back fairly directly, >> rather than going via the submit system. >> >> If Falkon is doing its own 'interesting' data movement stuff, then it >> would probably be a good idea for it to mesh in with what Swift (eg. swift >> provides a list of stage-these-in and stage-these-out URLs or something >> like that and has various ways of performing that, such as submitting a >> transfer job, or passing that information onto falkon) >> >> > > > -- ============================================ Ioan Raicu Ph.D. Student ============================================ Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 ============================================ Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dsl.cs.uchicago.edu/ ============================================ ============================================ -------------- next part -------------- An HTML attachment was scrubbed... URL: From iraicu at cs.uchicago.edu Wed May 16 12:15:07 2007 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Wed, 16 May 2007 12:15:07 -0500 Subject: [Swift-devel] swift-on-ec2 In-Reply-To: <464B1402.9040405@mcs.anl.gov> References: <4649D280.5080906@mcs.anl.gov> <20070515154500.ad1600bf.tfreeman@mcs.anl.gov><464A24AF.7080801@cs.uchicago.edu> <464A8857.90800@mcs.anl.gov> <356127187-1179301594-cardhu_blackberry.rim.net-179336256-@bwe047-cell00.bisx.prod.on.blackberry> <464B1402.9040405@mcs.anl.gov> Message-ID: <464B3C1B.2040506@cs.uchicago.edu> Kate Keahey wrote: > > > Ian Foster wrote: >> Kate: >> >> If we configure the virtual cluster with a full LRM, as you propose >> (and it seems have already done--great work!), then we can use this >> to start Falkon executors--as we do today on regular clusters. So it >> seems to me that we have all we need. How about you and Ioan spend >> your time on Thursday running something on EC2, to make sure it sorks? > > As I suggest below, I think it would be easiest if we could deploy and > debug a small static cluster locally first, and we can probably give > it a shot tomorrow. We still don't have access to the Xen nodes on > TeraPort (although hopefully that might change by tomorrow) but I > asked Rick to rebuild a couple of nodes at ANL and he did, I think for > a test that should give us enough resources to play with. If someone (Kate, Borja, Ian, anyone) has an account on EC2 and S3 so we can try a demo run tomorrow, I think it would be very beneficial! Do we have images created that would run on EC2? Can we easily modify them so we can include the necessary software, or at least once we start them up, we can upload the necessary software needed b Falkon (JVM, Falkon executor, some GT4 libraries). > > At the same time -- if there are multiple ways of doing this, and > perhaps better ways than simply using a virtual cluster, we should > discuss them now. It is not completely clear to me what the > relationship between Falkon and Swift is, and what the specific > objectives are (other than that dynamically provisioning resources is > required). It looks at this point like the objectives probably overlap > with what Ioan, Borja and I wanted to talk about (which I thought was > a separate project, but am thrilled to find out is related) so how > about we come up with a design tomorrow and post the notes on this > list (is this a good venue for that?) and then others can shoot them > down. > >> Regarding choice of LRM: have you looked at SGE? That is what quite a >> few others seem to be using. > > Yes, we have. We also collaborate with others who do, as well as with > Sun... As you may remember, Borja did the scheduling work for his > thesis in the context of SGE. Last time we talked though, Torque was > the scheduler of choice for the virtual cluster LRM so we used that. > > The usage of SGE you are referring to above -- is this in the context > of virtualization projects, or as LRM for various Falkon-related > applications? Falkon relies on LRMs to get resource allocations, and bootstrap. We have not interfaced with any specific LRMs, but use GRAM to abstract this away. > >> >> Ian >> >> >> >> Sent via BlackBerry from T-Mobile >> -----Original Message----- >> From: Kate Keahey >> Date: Tue, 15 May 2007 23:28:07 To:iraicu at cs.uchicago.edu >> Cc:swift-devel at ci.uchicago.edu >> Subject: Re: [Swift-devel] swift-on-ec2 >> >> First -- this is a very useful discussion, would it be possible to >> see all of it. We need to understand the requirements and trade-offs >> in some detail to figure out the best way to make this work. I see a >> few different interaction threads somewhat mixed up here though so >> just to make sure we are all on the same wavelength, here is some >> context. >> >> Ian and I have been talking on and off about providing a workspace >> service implementation with EC2 backend. The benefit for that would >> be that users could deploy the same VMs using the same interface to >> either TeraPort or EC2 or yet another resource provider. The >> workspace service would also provide some features on top of EC2 >> (translating between PKI credentials and Amazon's paying accounts, >> contextualization as needed to make deployment dynamic). One >> application of interest for this was Swift. Last time we chatted >> about this though was in the context of using EC2 to provide a >> production platform for STAR runs (since virtualizing enough TeraPort >> to provide a production platform is taking a long time). This is what >> Tim and I are trying to make happen now. >> >> Since there was also interest in running Swift in VMs, Mike, Tibi and >> I met around February/March and agreed that a reasonable way to >> proceed will be for us to stand up a base virtual cluster somewhere >> locally (e.g., a static deployment on TeraPort) so that they can >> finish the configuration according to their needs, look at >> performance, figure out the best way to interact with it, and make >> sure that there are no VM-induced gotchas. All of this will be much >> easier to assess locally and on a static deployment. Then we'd make >> sure the cluster is dynamically deployable using the workspace >> service (on EC2 or whatever other provider). During the meeting (and >> over following emails) we agreed that the required "base cluster" >> would be configured with GRAM/Torque on the headnode plus a number of >> worker nodes, plus root privileges. We configured this cluster and it >> is ready to deploy. Are you saying now that in fact something >> different is needed? >> >> As Ian says, Borja and I were planning to meet with Ioan on Thursday >> to discuss interaction between Falkon and the workspace service (not >> necessarily/exclusively in the EC2 context). I don't completely >> understand the relationship between swift and falkon -- are there >> specific applications or scenarios that you are trying to target in >> this exercise? >> >> Ioan Raicu wrote: >>> Hi, >>> See below: >>> >>> Tim Freeman wrote: >>>> On Tue, 15 May 2007 16:20:03 +0000 (GMT) >>>> Ben Clifford wrote: >>>> >>>> >>>>> Ian asked about this elsewhere, but its perhaps interesting for >>>>> swift-devel people to look at the questions too. >>>>> >>>>> On Tue, 15 May 2007, Ian Foster wrote: >>>>> >>>>> >>>>>> Dear All: >>>>>> >>>>> >>>>> >>>>>> I asked Kate if she and Tim could look into creating VM images >>>>>> that would allow us to run Swift applications on Amazon EC2. I >>>>>> think Kate is meeting with Ioan about this on Thursday (?). >>>>>> >>>>> >>>>> >>>>>> One issue that I thought would be good to discuss is what we'd >>>>>> want in that VM image. Perhaps this is obvious to the rest of >>>>>> you, but it isn't to me. A few thoughts: >>>>>> * I'm assuming that we want to run "workers" on EC2 nodes, >>>>>> and have the >>>>>> "task dispatch" logic run on some external frontend system >>>>>> outside EC2. >>>>>> * I would think that we want to use Falkon to do the task >>>>>> dispatch. If so, >>>>>> we need a Falkon executor on each VM, configured to check in with >>>>>> the Falkon >>>>>> dispatcher. (Alternatively, we could use, say, SGE: in that case, >>>>>> we would >>>>>> want an SGE agent.) >>>>>> * We need a way of getting data to and from the worker >>>>>> nodes. Do we want to >>>>>> run a file system across the EC2 nodes and the external frontend >>>>>> node? That >>>>>> seems rather inefficient. Other options? >>>>>> * Should we preload the application code on each EC2 node? >>>>>> >>>>> Here's a couple of approaches: >>>>> >>>>> 1) swift regards all the EC2 nodes that we are paying for as a >>>>> single site. >>>>> >>>>> Something like falkon handles all the task dispatch and worker >>>>> node management. I don't know what that looks like at the moment >>>>> in Falkon, but the interface for Swift to send jobs into Falkon >>>>> sounds pretty straightforward and shouldn't need changing. >>>>> >>>> So if I understand, here there would be no gateway+LRM but each EC2 >>>> node + >>>> Falkon would need a port open to receive tasks? Or does each node >>>> pull down >>>> instructions OK from behind a firewall? >>>> >>> Falkon supports both polling and notifications. To use >>> notifications, there needs to be an open port on the worker :( >>>> Is there a latency problem with running each node as an indepdent task >>>> receiver with the dispatcher off-site from EC2? I would think it >>>> would be >>>> better to put the queues to fill with tasks on EC2 so it can more >>>> quickly get >>>> the task going when a node is done with a previous task (I may be >>>> missing some >>>> nuances here with respect to Falkon, don't know much about this >>>> yet!). >>> We have run the Falkon dispatcher at UChicago and workers at ANL >>> without any issues, so it can easily tolerate a few ms of latency. >>> We haven't tried it across 10s of ms of latency links, but my >>> instinct says that if you have enough workers, you might be able to >>> hide the latency. We'd have to experiment with it to see what >>> happens. We could potentially do some experiments between SDSC and >>> ANL over a 50+ ms link, and see what difference in throughputs we get. >>> >>> Ioan >>>> If a gateway node is desired, this option sounds a lot like the >>>> GRAM+LRM >>>> situation we use on VMs with the workspace service and will soon >>>> use on EC2 via >>>> the workspace EC2 gateway we're implementing. Start up one gateway >>>> node and >>>> then add compute nodes which dynamically join the pool, they are >>>> pointed to the >>>> GRAM node. >>>> >>>> >>>>> All the nodes in a site are required by our site model to have a >>>>> shared filesystem - we've talked about removing it, but I think >>>>> that is still the case and if so, isn't going to change soon. >>>> Setting up a shared filesystem in this environment is akin to >>>> setting up the >>>> compute nodes to join an LRM pool. The VMs can communicate over >>>> the private >>>> network at EC2, you can instruct EC2 to let all the nodes be open >>>> to each other >>>> (while simultaneously keeping a separate policy of blocking ports >>>> from being >>>> open from the internet and other people's EC2 nodes). The >>>> non-file-serving >>>> nodes would simply need to know the private address of the >>>> filesystem server >>>> (unless you are using a fancier network file system than NFS-style >>>> ones). >>>> For background: every VM on EC2 currently gets a public address -- >>>> NAT'd to a >>>> private address which is actually what the VM's one NIC is >>>> configured with. >>>> There is a facility to open/forward specific network ports on the >>>> public >>>> address to each VM either via a group policy or on a VM by VM basis. >>>> >>>> [...] >>>>> Amazon also has a storage cloud, alongside its compute cloud. I >>>>> know very little about that and have never thought about how it >>>>> would fit into the above (if at all). Maybe someone else knows more. >>>>> >>>> A VM template on EC2 is called an AMI which stands for Amazon >>>> Machine Image. >>>> This is just a packaging thing but what it mostly means is that the >>>> VM is >>>> stored on S3 and also registered into the EC2 system. >>>> >>>> When starting an instance of an AMI, the file is copied from S3 to the >>>> hypervisor node (what we call propagation in the workspace >>>> service). After it >>>> is used, this file is deleted (an option in the workspace service >>>> but there is >>>> also an option to save it back with any changes). So the VMs are >>>> stored in S3 but anything that happens on them after being >>>> started is lost unless you manually do something about it. >>>> >>>> As for free scratch space, you get a good amount per node, 140G. >>>> But the node >>>> could go down at any moment just like a physical resource. >>>> >>>> To harness S3 for safely persisting any data (or if you need more >>>> space) you >>>> would need to actually run S3 clients on the VMs when they are run >>>> on EC2. You >>>> could alternatively mirror data between nodes assuming that all >>>> would not go >>>> down at once. >>>> The good thing is that you do not pay transfer costs between S3 and >>>> EC2 if you >>>> chose to use S3 for big storage, you would only pay the "housing >>>> fees" so to >>>> speak. >>>> Tim >>>> _______________________________________________ >>>> Swift-devel mailing list >>>> Swift-devel at ci.uchicago.edu >>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >>>> >>>> >> > -- ============================================ Ioan Raicu Ph.D. Student ============================================ Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 ============================================ Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dsl.cs.uchicago.edu/ ============================================ ============================================ From keahey at mcs.anl.gov Wed May 16 12:19:23 2007 From: keahey at mcs.anl.gov (Kate Keahey) Date: Wed, 16 May 2007 12:19:23 -0500 Subject: [Swift-devel] swift-on-ec2 In-Reply-To: <464B3776.2010700@cs.uchicago.edu> References: <4649D280.5080906@mcs.anl.gov> <20070515154500.ad1600bf.tfreeman@mcs.anl.gov> <464A24AF.7080801@cs.uchicago.edu> <464A8857.90800@mcs.anl.gov> <464B3776.2010700@cs.uchicago.edu> Message-ID: <464B3D1B.7080908@mcs.anl.gov> I agree that we are talking of a new model that allows a better separation between provisioning resources and task management -- the interesting aspect of this is that what we are talking about is combining coarse-grained provisioning combined with very light-weight task management. In terms of "always" being able to get resources if you pay for them -- there really are no miracles though. EC2 will run out of resources eventually just like any other provider does, payment is just a different way of managing policies. There is interesting work out of the HP quartermaster project though that predicts resource demand and the tycoon work of course shows how if people need resources they could always just bid higher. And then -- although we deviate from the traditional batch-scheduling, I don't think it will go away anytime soon ;-). The interesting challenge (what Borja is working on) is how to combine those two models for Grid communities. Ioan Raicu wrote: > > Falkon is certainly about getting more performance from the same hardware. > > EC2 on the other hand is more about a new paradigm of how resources are > acquired. In the batch-scheduled world, the demand for resources is > usually higher than the supply. In EC2, its likely that the supply for > resources is higher than the demand. With that said, it means that with > EC2, it is likely that you could always get more resources now if you > were willing to pay for them... this could have implications on the > resource allocation and management policies that govern when it makes > sense to get more resources and when not to. Using EC2 might be about > performance, but the really interesting part that I see emerging is a > new model that deviates from the traditional batch-scheduled systems the > Grid community has grown accustomed to. > > Ioan > > -- > ============================================ > Ioan Raicu > Ph.D. Student > ============================================ > Distributed Systems Laboratory > Computer Science Department > University of Chicago > 1100 E. 58th Street, Ryerson Hall > Chicago, IL 60637 > ============================================ > Email: iraicu at cs.uchicago.edu > Web: http://www.cs.uchicago.edu/~iraicu > http://dsl.cs.uchicago.edu/ > ============================================ > ============================================ > -- Kate Keahey, Mathematics & CS Division, Argonne National Laboratory Computation Institute, University of Chicago From keahey at mcs.anl.gov Wed May 16 12:20:43 2007 From: keahey at mcs.anl.gov (Kate Keahey) Date: Wed, 16 May 2007 12:20:43 -0500 Subject: [Swift-devel] swift-on-ec2 In-Reply-To: <20070516120304.19151d46.tfreeman@mcs.anl.gov> References: <4649D280.5080906@mcs.anl.gov> <20070515154500.ad1600bf.tfreeman@mcs.anl.gov> <464A24AF.7080801@cs.uchicago.edu> <464A8857.90800@mcs.anl.gov> <464B3776.2010700@cs.uchicago.edu> <20070516120304.19151d46.tfreeman@mcs.anl.gov> Message-ID: <464B3D6B.5090801@mcs.anl.gov> Ah, yes, the next thing they will allow people to bid... ;-). Tim Freeman wrote: > > That's not entirely true at this particular point in time: > > http://www.pcworld.com/article/id,130832-c,webservices/article.html > > "We hate being capacity-constrained," Bezos said. "It's not the right way to > run a business. We are trying to get ourselves in a position with EC2 where we > will be demand-constrained instead of capacity-constrained." > > >> ... this could have implications on the >> resource allocation and management policies that govern when it makes >> sense to get more resources and when not to. > > Right now for example, we're programming a little feature into the workspace-EC2 > gateway that limits the amount of money an entity can spend :-) > > Tim -- Kate Keahey, Mathematics & CS Division, Argonne National Laboratory Computation Institute, University of Chicago From iraicu at cs.uchicago.edu Wed May 16 12:29:43 2007 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Wed, 16 May 2007 12:29:43 -0500 Subject: [Swift-devel] swift-on-ec2 In-Reply-To: <464B1740.3060808@mcs.anl.gov> References: <4649D280.5080906@mcs.anl.gov> <20070515154500.ad1600bf.tfreeman@mcs.anl.gov> <464A24AF.7080801@cs.uchicago.edu> <464A8857.90800@mcs.anl.gov> <464B1740.3060808@mcs.anl.gov> Message-ID: <464B3F87.9090708@cs.uchicago.edu> Well, the dynamic provisioning assumes that Falkon is acquiring resources when it needs them. This implies that it knows how to talk to the EC2 service, and it knows how to bootstrap a VM that has the necessary Falkon software stack. I was actually hoping (at least in the short term) that static resource provisioning could be handled by the workspace service, talking to the EC2 service and bootstraping the VM (with the necesarry Falkon stack), and then once the Falkon executors register with the Falkon dispatcher, then Falkon handles the lightweight job management (in place of a traditional LRM). The provisioning to EC2 could be pushed onto Falkon in the future, but it is not currently on my immediate list of things to-do list. Ioan Kate Keahey wrote: > Thanks Ben, this helps a lot! So it seems to me like we are talking > about combining dynamic provisioning with lightweight job management > which should be pluggable into swift. > > Ben Clifford wrote: >> On Tue, 15 May 2007, Kate Keahey wrote: >> >>> As Ian says, Borja and I were planning to meet with Ioan on Thursday >>> to discuss interaction between Falkon and the workspace service (not >>> necessarily/exclusively in the EC2 context). I don't completely >>> understand the relationship between swift and falkon -- are there >>> specific applications or scenarios that you are trying to target in >>> this exercise? >> >> By virtue of the fact that they come from pretty much the same group >> of people, they're somewhat fuzzily related - but pretty much swift >> is generating (over the duration of its execution, rather than in one >> batch) a bunch of jobs that need executing (as well, as various >> things like file transfers). As it generates them, it sends them off >> to be executed. The official ways that are 'supported' by Swift are >> by executing them on the local machine and by sending them off >> through GRAM; however, people can plug in whatever they want to do >> submissions. >> >> I know less about Falkon because it isn't Swift, but the Falkon side >> of things is pretty much about running a bunch of jobs - it plugs >> into the abovementioned place in Swift so that Swift gives Falkon >> jobs to run, and Falkon runs them (with a goal of Falkon being, >> presumably, to run it much more efficiently than if they were >> submitted straight through GRAM - it seems to do pretty well). >> >> There's two things going on with swift - one is about making it >> straightforward to use at the low end of things, so that people can >> start using it easily - for the most part, that isn't interesting in >> itself; the other is about getting it to perform well at the high end >> of things, which is where the fun research is. Using Falkon and using >> EC2 are both on that side of things. >> > -- ============================================ Ioan Raicu Ph.D. Student ============================================ Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 ============================================ Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dsl.cs.uchicago.edu/ ============================================ ============================================ From yongzh at cs.uchicago.edu Wed May 16 12:31:05 2007 From: yongzh at cs.uchicago.edu (Yong Zhao) Date: Wed, 16 May 2007 12:31:05 -0500 (CDT) Subject: [Swift-devel] mappers on files that are inputs and outputs In-Reply-To: References: Message-ID: This sounds strange, it should be able to map the img and hdr files correctly to fields atlas.img and atlas.hdr. Can you enable detailed logging? Yong. On Wed, 16 May 2007, Ben Clifford wrote: > > Here's a code fragment: > > type volume { > imagefile img; > headerfile hdr; > }; > > volume atlas ; > atlas = softmean(slices); > > string directions[] = [ "x", "y", "z"]; > > foreach direction in directions { > giffile outputgif > ; > string option = @strcat("-",direction); > outputgif = slice_to_gif(atlas, option, ".5"); > } > > When this is run as part of a workflow, there are no atlas.* files and the > atlas = softmean(slices) line causes atlas.hdr and atlas.img files to be > created and placed in my working directory, and also used in the > subsequent slice_to_gif calls. > > If I prune the program in a text editor so that the altas = ... line is > not called, and leave the atlas.hdr and atlas.img files in place in my > current directory (so that the files are now input files, rather than > intermediate files), I get this error: > > $ swift -debug -tc.file tc.data play.swift > WARN - Failed to configure log file name > > Swift v0.1-dev > > RunID: mx49u8a36d1m0 > Execution failed: > java.lang.RuntimeException: Data set initialization failed for > true. Missing required field: img mapped to atlas > > > I think its probably a desirable feature that the same mapping that maps > ok for intermediate files to map for input files too. > > -- > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > From itf at mcs.anl.gov Wed May 16 12:20:57 2007 From: itf at mcs.anl.gov (=?UTF-8?B?SWFuIEZvc3Rlcg==?=) Date: Wed, 16 May 2007 17:20:57 +0000 Subject: [Swift-devel] swift-on-ec2 In-Reply-To: <464B1402.9040405@mcs.anl.gov> References: <4649D280.5080906@mcs.anl.gov> <20070515154500.ad1600bf.tfreeman@mcs.anl.gov><464A24AF.7080801@cs.uchicago.edu> <464A8857.90800@mcs.anl.gov> <356127187-1179301594-cardhu_blackberry.rim.net-179336256-@bwe047-cell00.bisx.prod.on.blackberry> <464B1402.9040405@mcs.anl.gov> Message-ID: <1583680979-1179336692-cardhu_blackberry.rim.net-2135204572-@bwe032-cell00.bisx.prod.on.blackberry> Kate: I personally will be delighted if you could run the virtual cluster on ec2 tomorrow. I know that there are lots of ways that you could refine its config, local expts that could be performed, etc., but perhaps we could try bypassing those things, which seem somewhat like distractions to me? Ian Sent via BlackBerry from T-Mobile -----Original Message----- From: Kate Keahey Date: Wed, 16 May 2007 09:24:02 To:itf at mcs.anl.gov Cc:swift-devel-bounces at ci.uchicago.edu, Ioan Raicu , swift-devel at ci.uchicago.edu, Borja Sotomayor Subject: Re: [Swift-devel] swift-on-ec2 Ian Foster wrote: > Kate: > > If we configure the virtual cluster with a full LRM, as you propose (and it seems have already done--great work!), then we can use this to start Falkon executors--as we do today on regular clusters. So it seems to me that we have all we need. How about you and Ioan spend your time on Thursday running something on EC2, to make sure it sorks? As I suggest below, I think it would be easiest if we could deploy and debug a small static cluster locally first, and we can probably give it a shot tomorrow. We still don't have access to the Xen nodes on TeraPort (although hopefully that might change by tomorrow) but I asked Rick to rebuild a couple of nodes at ANL and he did, I think for a test that should give us enough resources to play with. At the same time -- if there are multiple ways of doing this, and perhaps better ways than simply using a virtual cluster, we should discuss them now. It is not completely clear to me what the relationship between Falkon and Swift is, and what the specific objectives are (other than that dynamically provisioning resources is required). It looks at this point like the objectives probably overlap with what Ioan, Borja and I wanted to talk about (which I thought was a separate project, but am thrilled to find out is related) so how about we come up with a design tomorrow and post the notes on this list (is this a good venue for that?) and then others can shoot them down. > Regarding choice of LRM: have you looked at SGE? That is what quite a few others seem to be using. Yes, we have. We also collaborate with others who do, as well as with Sun... As you may remember, Borja did the scheduling work for his thesis in the context of SGE. Last time we talked though, Torque was the scheduler of choice for the virtual cluster LRM so we used that. The usage of SGE you are referring to above -- is this in the context of virtualization projects, or as LRM for various Falkon-related applications? > > Ian > > > > Sent via BlackBerry from T-Mobile > > -----Original Message----- > From: Kate Keahey > Date: Tue, 15 May 2007 23:28:07 > To:iraicu at cs.uchicago.edu > Cc:swift-devel at ci.uchicago.edu > Subject: Re: [Swift-devel] swift-on-ec2 > > First -- this is a very useful discussion, would it be possible to see > all of it. We need to understand the requirements and trade-offs in some > detail to figure out the best way to make this work. I see a few > different interaction threads somewhat mixed up here though so just to > make sure we are all on the same wavelength, here is some context. > > Ian and I have been talking on and off about providing a workspace > service implementation with EC2 backend. The benefit for that would be > that users could deploy the same VMs using the same interface to either > TeraPort or EC2 or yet another resource provider. The workspace service > would also provide some features on top of EC2 (translating between PKI > credentials and Amazon's paying accounts, contextualization as needed to > make deployment dynamic). One application of interest for this was > Swift. Last time we chatted about this though was in the context of > using EC2 to provide a production platform for STAR runs (since > virtualizing enough TeraPort to provide a production platform is taking > a long time). This is what Tim and I are trying to make happen now. > > Since there was also interest in running Swift in VMs, Mike, Tibi and I > met around February/March and agreed that a reasonable way to proceed > will be for us to stand up a base virtual cluster somewhere locally > (e.g., a static deployment on TeraPort) so that they can finish the > configuration according to their needs, look at performance, figure out > the best way to interact with it, and make sure that there are no > VM-induced gotchas. All of this will be much easier to assess locally > and on a static deployment. Then we'd make sure the cluster is > dynamically deployable using the workspace service (on EC2 or whatever > other provider). During the meeting (and over following emails) we > agreed that the required "base cluster" would be configured with > GRAM/Torque on the headnode plus a number of worker nodes, plus root > privileges. We configured this cluster and it is ready to deploy. Are > you saying now that in fact something different is needed? > > As Ian says, Borja and I were planning to meet with Ioan on Thursday to > discuss interaction between Falkon and the workspace service (not > necessarily/exclusively in the EC2 context). I don't completely > understand the relationship between swift and falkon -- are there > specific applications or scenarios that you are trying to target in this > exercise? > > Ioan Raicu wrote: >> Hi, >> See below: >> >> Tim Freeman wrote: >>> On Tue, 15 May 2007 16:20:03 +0000 (GMT) >>> Ben Clifford wrote: >>> >>> >>>> Ian asked about this elsewhere, but its perhaps interesting for >>>> swift-devel people to look at the questions too. >>>> >>>> On Tue, 15 May 2007, Ian Foster wrote: >>>> >>>> >>>>> Dear All: >>>>> >>>> >>>> >>>>> I asked Kate if she and Tim could look into creating VM images that >>>>> would allow us to run Swift applications on Amazon EC2. I think Kate >>>>> is meeting with Ioan about this on Thursday (?). >>>>> >>>> >>>> >>>>> One issue that I thought would be good to discuss is what we'd want >>>>> in that VM image. Perhaps this is obvious to the rest of you, but it >>>>> isn't to me. A few thoughts: >>>>> * I'm assuming that we want to run "workers" on EC2 nodes, and >>>>> have the >>>>> "task dispatch" logic run on some external frontend system outside EC2. >>>>> * I would think that we want to use Falkon to do the task >>>>> dispatch. If so, >>>>> we need a Falkon executor on each VM, configured to check in with >>>>> the Falkon >>>>> dispatcher. (Alternatively, we could use, say, SGE: in that case, we >>>>> would >>>>> want an SGE agent.) >>>>> * We need a way of getting data to and from the worker nodes. >>>>> Do we want to >>>>> run a file system across the EC2 nodes and the external frontend >>>>> node? That >>>>> seems rather inefficient. Other options? >>>>> * Should we preload the application code on each EC2 node? >>>>> >>>> Here's a couple of approaches: >>>> >>>> 1) swift regards all the EC2 nodes that we are paying for as a >>>> single site. >>>> >>>> Something like falkon handles all the task dispatch and worker node >>>> management. I don't know what that looks like at the moment in >>>> Falkon, but the interface for Swift to send jobs into Falkon sounds >>>> pretty straightforward and shouldn't need changing. >>>> >>> So if I understand, here there would be no gateway+LRM but each EC2 >>> node + >>> Falkon would need a port open to receive tasks? Or does each node >>> pull down >>> instructions OK from behind a firewall? >>> >> Falkon supports both polling and notifications. To use notifications, >> there needs to be an open port on the worker :( >>> Is there a latency problem with running each node as an indepdent task >>> receiver with the dispatcher off-site from EC2? I would think it >>> would be >>> better to put the queues to fill with tasks on EC2 so it can more >>> quickly get >>> the task going when a node is done with a previous task (I may be >>> missing some >>> nuances here with respect to Falkon, don't know much about this yet!). >> We have run the Falkon dispatcher at UChicago and workers at ANL without >> any issues, so it can easily tolerate a few ms of latency. We haven't >> tried it across 10s of ms of latency links, but my instinct says that if >> you have enough workers, you might be able to hide the latency. We'd >> have to experiment with it to see what happens. We could potentially do >> some experiments between SDSC and ANL over a 50+ ms link, and see what >> difference in throughputs we get. >> >> Ioan >>> If a gateway node is desired, this option sounds a lot like the GRAM+LRM >>> situation we use on VMs with the workspace service and will soon use >>> on EC2 via >>> the workspace EC2 gateway we're implementing. Start up one gateway >>> node and >>> then add compute nodes which dynamically join the pool, they are >>> pointed to the >>> GRAM node. >>> >>> >>>> All the nodes in a site are required by our site model to have a >>>> shared filesystem - we've talked about removing it, but I think that >>>> is still the case and if so, isn't going to change soon. >>> Setting up a shared filesystem in this environment is akin to setting >>> up the >>> compute nodes to join an LRM pool. The VMs can communicate over the >>> private >>> network at EC2, you can instruct EC2 to let all the nodes be open to >>> each other >>> (while simultaneously keeping a separate policy of blocking ports from >>> being >>> open from the internet and other people's EC2 nodes). The >>> non-file-serving >>> nodes would simply need to know the private address of the filesystem >>> server >>> (unless you are using a fancier network file system than NFS-style ones). >>> For background: every VM on EC2 currently gets a public address -- >>> NAT'd to a >>> private address which is actually what the VM's one NIC is configured >>> with. >>> There is a facility to open/forward specific network ports on the public >>> address to each VM either via a group policy or on a VM by VM basis. >>> >>> [...] >>>> Amazon also has a storage cloud, alongside its compute cloud. I know >>>> very little about that and have never thought about how it would fit >>>> into the above (if at all). Maybe someone else knows more. >>>> >>> A VM template on EC2 is called an AMI which stands for Amazon Machine >>> Image. >>> This is just a packaging thing but what it mostly means is that the VM is >>> stored on S3 and also registered into the EC2 system. >>> >>> When starting an instance of an AMI, the file is copied from S3 to the >>> hypervisor node (what we call propagation in the workspace service). >>> After it >>> is used, this file is deleted (an option in the workspace service but >>> there is >>> also an option to save it back with any changes). >>> So the VMs are stored in S3 but anything that happens on them after being >>> started is lost unless you manually do something about it. >>> >>> As for free scratch space, you get a good amount per node, 140G. But >>> the node >>> could go down at any moment just like a physical resource. >>> >>> To harness S3 for safely persisting any data (or if you need more >>> space) you >>> would need to actually run S3 clients on the VMs when they are run on >>> EC2. You >>> could alternatively mirror data between nodes assuming that all would >>> not go >>> down at once. >>> The good thing is that you do not pay transfer costs between S3 and >>> EC2 if you >>> chose to use S3 for big storage, you would only pay the "housing fees" >>> so to >>> speak. >>> Tim >>> _______________________________________________ >>> Swift-devel mailing list >>> Swift-devel at ci.uchicago.edu >>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >>> >>> > -- Kate Keahey, Mathematics & CS Division, Argonne National Laboratory Computation Institute, University of Chicago From yongzh at cs.uchicago.edu Wed May 16 12:34:53 2007 From: yongzh at cs.uchicago.edu (Yong Zhao) Date: Wed, 16 May 2007 12:34:53 -0500 (CDT) Subject: [Swift-devel] swift-on-ec2 In-Reply-To: <464B3F87.9090708@cs.uchicago.edu> References: <4649D280.5080906@mcs.anl.gov> <20070515154500.ad1600bf.tfreeman@mcs.anl.gov> <464A24AF.7080801@cs.uchicago.edu> <464A8857.90800@mcs.anl.gov> <464B1740.3060808@mcs.anl.gov> <464B3F87.9090708@cs.uchicago.edu> Message-ID: I'd think the the workspace manager should be able to do that, and not statically, but allocate new virtual nodes as requested. Yong. On Wed, 16 May 2007, Ioan Raicu wrote: > Well, the dynamic provisioning assumes that Falkon is acquiring > resources when it needs them. This implies that it knows how to talk to > the EC2 service, and it knows how to bootstrap a VM that has the > necessary Falkon software stack. > > I was actually hoping (at least in the short term) that static resource > provisioning could be handled by the workspace service, talking to the > EC2 service and bootstraping the VM (with the necesarry Falkon stack), > and then once the Falkon executors register with the Falkon dispatcher, > then Falkon handles the lightweight job management (in place of a > traditional LRM). > > The provisioning to EC2 could be pushed onto Falkon in the future, but > it is not currently on my immediate list of things to-do list. > > Ioan > > Kate Keahey wrote: > > Thanks Ben, this helps a lot! So it seems to me like we are talking > > about combining dynamic provisioning with lightweight job management > > which should be pluggable into swift. > > > > Ben Clifford wrote: > >> On Tue, 15 May 2007, Kate Keahey wrote: > >> > >>> As Ian says, Borja and I were planning to meet with Ioan on Thursday > >>> to discuss interaction between Falkon and the workspace service (not > >>> necessarily/exclusively in the EC2 context). I don't completely > >>> understand the relationship between swift and falkon -- are there > >>> specific applications or scenarios that you are trying to target in > >>> this exercise? > >> > >> By virtue of the fact that they come from pretty much the same group > >> of people, they're somewhat fuzzily related - but pretty much swift > >> is generating (over the duration of its execution, rather than in one > >> batch) a bunch of jobs that need executing (as well, as various > >> things like file transfers). As it generates them, it sends them off > >> to be executed. The official ways that are 'supported' by Swift are > >> by executing them on the local machine and by sending them off > >> through GRAM; however, people can plug in whatever they want to do > >> submissions. > >> > >> I know less about Falkon because it isn't Swift, but the Falkon side > >> of things is pretty much about running a bunch of jobs - it plugs > >> into the abovementioned place in Swift so that Swift gives Falkon > >> jobs to run, and Falkon runs them (with a goal of Falkon being, > >> presumably, to run it much more efficiently than if they were > >> submitted straight through GRAM - it seems to do pretty well). > >> > >> There's two things going on with swift - one is about making it > >> straightforward to use at the low end of things, so that people can > >> start using it easily - for the most part, that isn't interesting in > >> itself; the other is about getting it to perform well at the high end > >> of things, which is where the fun research is. Using Falkon and using > >> EC2 are both on that side of things. > >> > > > > -- > ============================================ > Ioan Raicu > Ph.D. Student > ============================================ > Distributed Systems Laboratory > Computer Science Department > University of Chicago > 1100 E. 58th Street, Ryerson Hall > Chicago, IL 60637 > ============================================ > Email: iraicu at cs.uchicago.edu > Web: http://www.cs.uchicago.edu/~iraicu > http://dsl.cs.uchicago.edu/ > ============================================ > ============================================ > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > From iraicu at cs.uchicago.edu Wed May 16 12:35:32 2007 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Wed, 16 May 2007 12:35:32 -0500 Subject: [Swift-devel] swift-on-ec2 In-Reply-To: <20070516120304.19151d46.tfreeman@mcs.anl.gov> References: <4649D280.5080906@mcs.anl.gov> <20070515154500.ad1600bf.tfreeman@mcs.anl.gov> <464A24AF.7080801@cs.uchicago.edu> <464A8857.90800@mcs.anl.gov> <464B3776.2010700@cs.uchicago.edu> <20070516120304.19151d46.tfreeman@mcs.anl.gov> Message-ID: <464B40E4.3020706@cs.uchicago.edu> Tim Freeman wrote: > On Wed, 16 May 2007 11:55:18 -0500 > Ioan Raicu wrote: > > >> Hi, >> I am just catching up with emails from last night... >> >> Ben Clifford wrote: >> >>> On Tue, 15 May 2007, Kate Keahey wrote: >>> >>> >>> >>>> As Ian says, Borja and I were planning to meet with Ioan on Thursday to >>>> discuss interaction between Falkon and the workspace service (not >>>> necessarily/exclusively in the EC2 context). I don't completely >>>> understand the relationship between swift and falkon -- are there >>>> specific applications or scenarios that you are trying to target in this >>>> exercise? >>>> >>>> >>> By virtue of the fact that they come from pretty much the same group of >>> people, they're somewhat fuzzily related - but pretty much swift is >>> generating (over the duration of its execution, rather than in one batch) >>> a bunch of jobs that need executing (as well, as various things like file >>> transfers). As it generates them, it sends them off to be executed. The >>> official ways that are 'supported' by Swift are by executing them on the >>> local machine and by sending them off through GRAM; however, people can >>> plug in whatever they want to do submissions. >>> >>> I know less about Falkon because it isn't Swift, but the Falkon side of >>> things is pretty much about running a bunch of jobs - it plugs into the >>> abovementioned place in Swift so that Swift gives Falkon jobs to run, and >>> Falkon runs them (with a goal of Falkon being, presumably, to run it much >>> more efficiently than if they were submitted straight through GRAM - it >>> seems to do pretty well). >>> >>> >> We intentionally made Falkon's interface and semantics as similar as >> possible to that of GRAM, so applications that normally used GRAM could >> easily change to Falkon. >> >>> There's two things going on with swift - one is about making it >>> straightforward to use at the low end of things, so that people can start >>> using it easily - for the most part, that isn't interesting in itself; the >>> other is about getting it to perform well at the high end of things, which >>> is where the fun research is. Using Falkon and using EC2 are both on that >>> side of things. >>> >>> >> Right! >> >> Falkon is certainly about getting more performance from the same hardware. >> >> EC2 on the other hand is more about a new paradigm of how resources are >> acquired. In the batch-scheduled world, the demand for resources is >> usually higher than the supply. In EC2, its likely that the supply for >> resources is higher than the demand. With that said, it means that with >> EC2, it is likely that you could always get more resources now if you >> were willing to pay for them >> > > That's not entirely true at this particular point in time: > > http://www.pcworld.com/article/id,130832-c,webservices/article.html > > "We hate being capacity-constrained," Bezos said. "It's not the right way to > run a business. We are trying to get ourselves in a position with EC2 where we > will be demand-constrained instead of capacity-constrained." > > But this doesn't make much sense. I think these guys get $700 or so a year for each VM they run, that means that they are charging more money over the lifetime of the machine than it costs to purchase and maintain the machine (assuming they are cheap computers). With this said, it seems that they should be adding more resources as the demand grows, so they always have resources available if someone asks for them... at least that is what I am expecting from such as service. If this is not the case now, I hope it will be in the future! Ioan > >> ... this could have implications on the >> resource allocation and management policies that govern when it makes >> sense to get more resources and when not to. >> > > Right now for example, we're programming a little feature into the workspace-EC2 > gateway that limits the amount of money an entity can spend :-) > > Tim > > > >> Using EC2 might be about >> performance, but the really interesting part that I see emerging is a >> new model that deviates from the traditional batch-scheduled systems the >> Grid community has grown accustomed to. >> >> Ioan >> > > > -- ============================================ Ioan Raicu Ph.D. Student ============================================ Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 ============================================ Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dsl.cs.uchicago.edu/ ============================================ ============================================ -------------- next part -------------- An HTML attachment was scrubbed... URL: From iraicu at cs.uchicago.edu Wed May 16 12:37:50 2007 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Wed, 16 May 2007 12:37:50 -0500 Subject: [Swift-devel] swift-on-ec2 In-Reply-To: <464B3D6B.5090801@mcs.anl.gov> References: <4649D280.5080906@mcs.anl.gov> <20070515154500.ad1600bf.tfreeman@mcs.anl.gov> <464A24AF.7080801@cs.uchicago.edu> <464A8857.90800@mcs.anl.gov> <464B3776.2010700@cs.uchicago.edu> <20070516120304.19151d46.tfreeman@mcs.anl.gov> <464B3D6B.5090801@mcs.anl.gov> Message-ID: <464B416E.30205@cs.uchicago.edu> Yes, that could certainly make their resource capacity planning easier, since as their resources consumption reaches critical levels, they just charge more and more, making it unrealistic that all resources will ever be consumed! Ioan Kate Keahey wrote: > Ah, yes, the next thing they will allow people to bid... ;-). > > Tim Freeman wrote: > >> >> That's not entirely true at this particular point in time: >> >> http://www.pcworld.com/article/id,130832-c,webservices/article.html >> >> "We hate being capacity-constrained," Bezos said. "It's not the right >> way to >> run a business. We are trying to get ourselves in a position with EC2 >> where we >> will be demand-constrained instead of capacity-constrained." >> >> >>> ... this could have implications on the resource allocation and >>> management policies that govern when it makes sense to get more >>> resources and when not to. >> >> Right now for example, we're programming a little feature into the >> workspace-EC2 >> gateway that limits the amount of money an entity can spend :-) >> Tim > -- ============================================ Ioan Raicu Ph.D. Student ============================================ Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 ============================================ Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dsl.cs.uchicago.edu/ ============================================ ============================================ From itf at mcs.anl.gov Wed May 16 12:45:35 2007 From: itf at mcs.anl.gov (=?UTF-8?B?SWFuIEZvc3Rlcg==?=) Date: Wed, 16 May 2007 17:45:35 +0000 Subject: [Swift-devel] swift-on-ec2 In-Reply-To: <464B3F87.9090708@cs.uchicago.edu> References: <4649D280.5080906@mcs.anl.gov><20070515154500.ad1600bf.tfreeman@mcs.anl.gov><464A24AF.7080801@cs.uchicago.edu> <464A8857.90800@mcs.anl.gov><464B1740.3060808@mcs.anl.gov> <464B3F87.9090708@cs.uchicago.edu> Message-ID: <1154116369-1179337647-cardhu_blackberry.rim.net-1593412775-@bwe059-cell00.bisx.prod.on.blackberry> Yes, that is all true. But let's focus on getting a static virtual cluster on ec2, with swift apps running on it. I am sure this can done tomorrow! Sent via BlackBerry from T-Mobile -----Original Message----- From: Ioan Raicu Date: Wed, 16 May 2007 12:29:43 To:Kate Keahey Cc:swift-devel at ci.uchicago.edu Subject: Re: [Swift-devel] swift-on-ec2 Well, the dynamic provisioning assumes that Falkon is acquiring resources when it needs them. This implies that it knows how to talk to the EC2 service, and it knows how to bootstrap a VM that has the necessary Falkon software stack. I was actually hoping (at least in the short term) that static resource provisioning could be handled by the workspace service, talking to the EC2 service and bootstraping the VM (with the necesarry Falkon stack), and then once the Falkon executors register with the Falkon dispatcher, then Falkon handles the lightweight job management (in place of a traditional LRM). The provisioning to EC2 could be pushed onto Falkon in the future, but it is not currently on my immediate list of things to-do list. Ioan Kate Keahey wrote: > Thanks Ben, this helps a lot! So it seems to me like we are talking > about combining dynamic provisioning with lightweight job management > which should be pluggable into swift. > > Ben Clifford wrote: >> On Tue, 15 May 2007, Kate Keahey wrote: >> >>> As Ian says, Borja and I were planning to meet with Ioan on Thursday >>> to discuss interaction between Falkon and the workspace service (not >>> necessarily/exclusively in the EC2 context). I don't completely >>> understand the relationship between swift and falkon -- are there >>> specific applications or scenarios that you are trying to target in >>> this exercise? >> >> By virtue of the fact that they come from pretty much the same group >> of people, they're somewhat fuzzily related - but pretty much swift >> is generating (over the duration of its execution, rather than in one >> batch) a bunch of jobs that need executing (as well, as various >> things like file transfers). As it generates them, it sends them off >> to be executed. The official ways that are 'supported' by Swift are >> by executing them on the local machine and by sending them off >> through GRAM; however, people can plug in whatever they want to do >> submissions. >> >> I know less about Falkon because it isn't Swift, but the Falkon side >> of things is pretty much about running a bunch of jobs - it plugs >> into the abovementioned place in Swift so that Swift gives Falkon >> jobs to run, and Falkon runs them (with a goal of Falkon being, >> presumably, to run it much more efficiently than if they were >> submitted straight through GRAM - it seems to do pretty well). >> >> There's two things going on with swift - one is about making it >> straightforward to use at the low end of things, so that people can >> start using it easily - for the most part, that isn't interesting in >> itself; the other is about getting it to perform well at the high end >> of things, which is where the fun research is. Using Falkon and using >> EC2 are both on that side of things. >> > -- ============================================ Ioan Raicu Ph.D. Student ============================================ Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 ============================================ Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dsl.cs.uchicago.edu/ ============================================ ============================================ _______________________________________________ Swift-devel mailing list Swift-devel at ci.uchicago.edu http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From iraicu at cs.uchicago.edu Wed May 16 12:49:49 2007 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Wed, 16 May 2007 12:49:49 -0500 Subject: [Swift-devel] swift-on-ec2 In-Reply-To: References: <4649D280.5080906@mcs.anl.gov> <20070515154500.ad1600bf.tfreeman@mcs.anl.gov> <464A24AF.7080801@cs.uchicago.edu> <464A8857.90800@mcs.anl.gov> <464B1740.3060808@mcs.anl.gov> <464B3F87.9090708@cs.uchicago.edu> Message-ID: <464B443D.3050708@cs.uchicago.edu> That would make things much simpler, from Falkon's perspective. Essentially, if the workspace service offered an interface that Falkon to allocate and de-allocate resources (VMs) on demand, then the Falkon dynamic resource provisioning could be used as long as Falkon implement this new workspace interface instead of the current GRAM interface it uses! Then, the whole EC2 deployment and bootstrapping would be offloaded to the worspace service, and only the resource provisioning and task dispatch would be done at Falkon, the same as it is today when we use GRAM! Ioan Yong Zhao wrote: > I'd think the the workspace manager should be able to do that, and not > statically, but allocate new virtual nodes as requested. > > Yong. > > On Wed, 16 May 2007, Ioan Raicu wrote: > > >> Well, the dynamic provisioning assumes that Falkon is acquiring >> resources when it needs them. This implies that it knows how to talk to >> the EC2 service, and it knows how to bootstrap a VM that has the >> necessary Falkon software stack. >> >> I was actually hoping (at least in the short term) that static resource >> provisioning could be handled by the workspace service, talking to the >> EC2 service and bootstraping the VM (with the necesarry Falkon stack), >> and then once the Falkon executors register with the Falkon dispatcher, >> then Falkon handles the lightweight job management (in place of a >> traditional LRM). >> >> The provisioning to EC2 could be pushed onto Falkon in the future, but >> it is not currently on my immediate list of things to-do list. >> >> Ioan >> >> Kate Keahey wrote: >> >>> Thanks Ben, this helps a lot! So it seems to me like we are talking >>> about combining dynamic provisioning with lightweight job management >>> which should be pluggable into swift. >>> >>> Ben Clifford wrote: >>> >>>> On Tue, 15 May 2007, Kate Keahey wrote: >>>> >>>> >>>>> As Ian says, Borja and I were planning to meet with Ioan on Thursday >>>>> to discuss interaction between Falkon and the workspace service (not >>>>> necessarily/exclusively in the EC2 context). I don't completely >>>>> understand the relationship between swift and falkon -- are there >>>>> specific applications or scenarios that you are trying to target in >>>>> this exercise? >>>>> >>>> By virtue of the fact that they come from pretty much the same group >>>> of people, they're somewhat fuzzily related - but pretty much swift >>>> is generating (over the duration of its execution, rather than in one >>>> batch) a bunch of jobs that need executing (as well, as various >>>> things like file transfers). As it generates them, it sends them off >>>> to be executed. The official ways that are 'supported' by Swift are >>>> by executing them on the local machine and by sending them off >>>> through GRAM; however, people can plug in whatever they want to do >>>> submissions. >>>> >>>> I know less about Falkon because it isn't Swift, but the Falkon side >>>> of things is pretty much about running a bunch of jobs - it plugs >>>> into the abovementioned place in Swift so that Swift gives Falkon >>>> jobs to run, and Falkon runs them (with a goal of Falkon being, >>>> presumably, to run it much more efficiently than if they were >>>> submitted straight through GRAM - it seems to do pretty well). >>>> >>>> There's two things going on with swift - one is about making it >>>> straightforward to use at the low end of things, so that people can >>>> start using it easily - for the most part, that isn't interesting in >>>> itself; the other is about getting it to perform well at the high end >>>> of things, which is where the fun research is. Using Falkon and using >>>> EC2 are both on that side of things. >>>> >>>> >> -- >> ============================================ >> Ioan Raicu >> Ph.D. Student >> ============================================ >> Distributed Systems Laboratory >> Computer Science Department >> University of Chicago >> 1100 E. 58th Street, Ryerson Hall >> Chicago, IL 60637 >> ============================================ >> Email: iraicu at cs.uchicago.edu >> Web: http://www.cs.uchicago.edu/~iraicu >> http://dsl.cs.uchicago.edu/ >> ============================================ >> ============================================ >> >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >> >> > > -- ============================================ Ioan Raicu Ph.D. Student ============================================ Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 ============================================ Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dsl.cs.uchicago.edu/ ============================================ ============================================ -------------- next part -------------- An HTML attachment was scrubbed... URL: From keahey at mcs.anl.gov Wed May 16 12:52:01 2007 From: keahey at mcs.anl.gov (Kate Keahey) Date: Wed, 16 May 2007 12:52:01 -0500 Subject: [Swift-devel] swift-on-ec2 In-Reply-To: <464B3F87.9090708@cs.uchicago.edu> References: <4649D280.5080906@mcs.anl.gov> <20070515154500.ad1600bf.tfreeman@mcs.anl.gov> <464A24AF.7080801@cs.uchicago.edu> <464A8857.90800@mcs.anl.gov> <464B1740.3060808@mcs.anl.gov> <464B3F87.9090708@cs.uchicago.edu> Message-ID: <464B44C1.302@mcs.anl.gov> Ioan, Ioan Raicu wrote: > Well, the dynamic provisioning assumes that Falkon is acquiring > resources when it needs them. This implies that it knows how to talk to > the EC2 service, and it knows how to bootstrap a VM that has the > necessary Falkon software stack. > > I was actually hoping (at least in the short term) that static resource > provisioning could be handled by the workspace service, talking to the > EC2 service and bootstraping the VM (with the necesarry Falkon stack), > and then once the Falkon executors register with the Falkon dispatcher, > then Falkon handles the lightweight job management (in place of a > traditional LRM). Yes, this is exactly what I was also thinking. My point below is that the combined infrastructure would fit into the swift. > The provisioning to EC2 could be pushed onto Falkon in the future, but > it is not currently on my immediate list of things to-do list. > > Ioan > > Kate Keahey wrote: >> Thanks Ben, this helps a lot! So it seems to me like we are talking >> about combining dynamic provisioning with lightweight job management >> which should be pluggable into swift. >> >> Ben Clifford wrote: >>> On Tue, 15 May 2007, Kate Keahey wrote: >>> >>>> As Ian says, Borja and I were planning to meet with Ioan on Thursday >>>> to discuss interaction between Falkon and the workspace service (not >>>> necessarily/exclusively in the EC2 context). I don't completely >>>> understand the relationship between swift and falkon -- are there >>>> specific applications or scenarios that you are trying to target in >>>> this exercise? >>> >>> By virtue of the fact that they come from pretty much the same group >>> of people, they're somewhat fuzzily related - but pretty much swift >>> is generating (over the duration of its execution, rather than in one >>> batch) a bunch of jobs that need executing (as well, as various >>> things like file transfers). As it generates them, it sends them off >>> to be executed. The official ways that are 'supported' by Swift are >>> by executing them on the local machine and by sending them off >>> through GRAM; however, people can plug in whatever they want to do >>> submissions. >>> >>> I know less about Falkon because it isn't Swift, but the Falkon side >>> of things is pretty much about running a bunch of jobs - it plugs >>> into the abovementioned place in Swift so that Swift gives Falkon >>> jobs to run, and Falkon runs them (with a goal of Falkon being, >>> presumably, to run it much more efficiently than if they were >>> submitted straight through GRAM - it seems to do pretty well). >>> >>> There's two things going on with swift - one is about making it >>> straightforward to use at the low end of things, so that people can >>> start using it easily - for the most part, that isn't interesting in >>> itself; the other is about getting it to perform well at the high end >>> of things, which is where the fun research is. Using Falkon and using >>> EC2 are both on that side of things. >>> >> > -- Kate Keahey, Mathematics & CS Division, Argonne National Laboratory Computation Institute, University of Chicago From keahey at mcs.anl.gov Wed May 16 13:16:14 2007 From: keahey at mcs.anl.gov (Kate Keahey) Date: Wed, 16 May 2007 13:16:14 -0500 Subject: [Swift-devel] swift-on-ec2 In-Reply-To: <1583680979-1179336692-cardhu_blackberry.rim.net-2135204572-@bwe032-cell00.bisx.prod.on.blackberry> References: <4649D280.5080906@mcs.anl.gov> <20070515154500.ad1600bf.tfreeman@mcs.anl.gov><464A24AF.7080801@cs.uchicago.edu> <464A8857.90800@mcs.anl.gov> <356127187-1179301594-cardhu_blackberry.rim.net-179336256-@bwe047-cell00.bisx.prod.on.blackberry> <464B1402.9040405@mcs.anl.gov> <1583680979-1179336692-cardhu_blackberry.rim.net-2135204572-@bwe032-cell00.bisx.prod.on.blackberry> Message-ID: <464B4A6E.2040804@mcs.anl.gov> Ian, you seem to be referring to the necessary /etc/hosts configuration as well as workers registering with the torque headnode below as "distractions" -- I agree they can be very distracting, but in my experience without these distractions a cluster (virtual or physical) won't work in the way such clusters are typically expected to work. What I said in my mail is that we can set up a base cluster locally so that somebody like Ioan can finish the configuration (i.e., install Falkon on it). We will configure this cluster once and leave it deployed as long as needed. Once we have the front-end to EC2 working (which we don't have yet although we are close) we will deploy this cluster on EC2 and provide methods that will automate this last little bit of configuration that *always* has to be done on deployment. I also think it is quite important that we spend the time tomorrow discussing what exactly we are trying to do -- right now, it looks to me like it might make more sense to not use clusters (it will help with the "distractions" if we don't). I realize that you are eager for us to get things to run -- I am eager too, but I honestly think we will get there faster if we plan better. Ian Foster wrote: > Kate: > > I personally will be delighted if you could run the virtual cluster on ec2 tomorrow. I know that there are lots of ways that you could refine its config, local expts that could be performed, etc., but perhaps we could try bypassing those things, which seem somewhat like distractions to me? > > Ian > > > Sent via BlackBerry from T-Mobile > > -----Original Message----- > From: Kate Keahey > Date: Wed, 16 May 2007 09:24:02 > To:itf at mcs.anl.gov > Cc:swift-devel-bounces at ci.uchicago.edu, Ioan Raicu , swift-devel at ci.uchicago.edu, Borja Sotomayor > Subject: Re: [Swift-devel] swift-on-ec2 > > > > Ian Foster wrote: >> Kate: >> >> If we configure the virtual cluster with a full LRM, as you propose (and it seems have already done--great work!), then we can use this to start Falkon executors--as we do today on regular clusters. So it seems to me that we have all we need. How about you and Ioan spend your time on Thursday running something on EC2, to make sure it sorks? > > As I suggest below, I think it would be easiest if we could deploy and > debug a small static cluster locally first, and we can probably give it > a shot tomorrow. We still don't have access to the Xen nodes on TeraPort > (although hopefully that might change by tomorrow) but I asked Rick to > rebuild a couple of nodes at ANL and he did, I think for a test that > should give us enough resources to play with. > > At the same time -- if there are multiple ways of doing this, and > perhaps better ways than simply using a virtual cluster, we should > discuss them now. It is not completely clear to me what the relationship > between Falkon and Swift is, and what the specific objectives are (other > than that dynamically provisioning resources is required). It looks at > this point like the objectives probably overlap with what Ioan, Borja > and I wanted to talk about (which I thought was a separate project, but > am thrilled to find out is related) so how about we come up with a > design tomorrow and post the notes on this list (is this a good venue > for that?) and then others can shoot them down. > >> Regarding choice of LRM: have you looked at SGE? That is what quite a few others seem to be using. > > Yes, we have. We also collaborate with others who do, as well as with > Sun... As you may remember, Borja did the scheduling work for his thesis > in the context of SGE. Last time we talked though, Torque was the > scheduler of choice for the virtual cluster LRM so we used that. > > The usage of SGE you are referring to above -- is this in the context of > virtualization projects, or as LRM for various Falkon-related applications? > >> Ian >> >> >> >> Sent via BlackBerry from T-Mobile >> >> -----Original Message----- >> From: Kate Keahey >> Date: Tue, 15 May 2007 23:28:07 >> To:iraicu at cs.uchicago.edu >> Cc:swift-devel at ci.uchicago.edu >> Subject: Re: [Swift-devel] swift-on-ec2 >> >> First -- this is a very useful discussion, would it be possible to see >> all of it. We need to understand the requirements and trade-offs in some >> detail to figure out the best way to make this work. I see a few >> different interaction threads somewhat mixed up here though so just to >> make sure we are all on the same wavelength, here is some context. >> >> Ian and I have been talking on and off about providing a workspace >> service implementation with EC2 backend. The benefit for that would be >> that users could deploy the same VMs using the same interface to either >> TeraPort or EC2 or yet another resource provider. The workspace service >> would also provide some features on top of EC2 (translating between PKI >> credentials and Amazon's paying accounts, contextualization as needed to >> make deployment dynamic). One application of interest for this was >> Swift. Last time we chatted about this though was in the context of >> using EC2 to provide a production platform for STAR runs (since >> virtualizing enough TeraPort to provide a production platform is taking >> a long time). This is what Tim and I are trying to make happen now. >> >> Since there was also interest in running Swift in VMs, Mike, Tibi and I >> met around February/March and agreed that a reasonable way to proceed >> will be for us to stand up a base virtual cluster somewhere locally >> (e.g., a static deployment on TeraPort) so that they can finish the >> configuration according to their needs, look at performance, figure out >> the best way to interact with it, and make sure that there are no >> VM-induced gotchas. All of this will be much easier to assess locally >> and on a static deployment. Then we'd make sure the cluster is >> dynamically deployable using the workspace service (on EC2 or whatever >> other provider). During the meeting (and over following emails) we >> agreed that the required "base cluster" would be configured with >> GRAM/Torque on the headnode plus a number of worker nodes, plus root >> privileges. We configured this cluster and it is ready to deploy. Are >> you saying now that in fact something different is needed? >> >> As Ian says, Borja and I were planning to meet with Ioan on Thursday to >> discuss interaction between Falkon and the workspace service (not >> necessarily/exclusively in the EC2 context). I don't completely >> understand the relationship between swift and falkon -- are there >> specific applications or scenarios that you are trying to target in this >> exercise? >> >> Ioan Raicu wrote: >>> Hi, >>> See below: >>> >>> Tim Freeman wrote: >>>> On Tue, 15 May 2007 16:20:03 +0000 (GMT) >>>> Ben Clifford wrote: >>>> >>>> >>>>> Ian asked about this elsewhere, but its perhaps interesting for >>>>> swift-devel people to look at the questions too. >>>>> >>>>> On Tue, 15 May 2007, Ian Foster wrote: >>>>> >>>>> >>>>>> Dear All: >>>>>> >>>>> >>>>> >>>>>> I asked Kate if she and Tim could look into creating VM images that >>>>>> would allow us to run Swift applications on Amazon EC2. I think Kate >>>>>> is meeting with Ioan about this on Thursday (?). >>>>>> >>>>> >>>>> >>>>>> One issue that I thought would be good to discuss is what we'd want >>>>>> in that VM image. Perhaps this is obvious to the rest of you, but it >>>>>> isn't to me. A few thoughts: >>>>>> * I'm assuming that we want to run "workers" on EC2 nodes, and >>>>>> have the >>>>>> "task dispatch" logic run on some external frontend system outside EC2. >>>>>> * I would think that we want to use Falkon to do the task >>>>>> dispatch. If so, >>>>>> we need a Falkon executor on each VM, configured to check in with >>>>>> the Falkon >>>>>> dispatcher. (Alternatively, we could use, say, SGE: in that case, we >>>>>> would >>>>>> want an SGE agent.) >>>>>> * We need a way of getting data to and from the worker nodes. >>>>>> Do we want to >>>>>> run a file system across the EC2 nodes and the external frontend >>>>>> node? That >>>>>> seems rather inefficient. Other options? >>>>>> * Should we preload the application code on each EC2 node? >>>>>> >>>>> Here's a couple of approaches: >>>>> >>>>> 1) swift regards all the EC2 nodes that we are paying for as a >>>>> single site. >>>>> >>>>> Something like falkon handles all the task dispatch and worker node >>>>> management. I don't know what that looks like at the moment in >>>>> Falkon, but the interface for Swift to send jobs into Falkon sounds >>>>> pretty straightforward and shouldn't need changing. >>>>> >>>> So if I understand, here there would be no gateway+LRM but each EC2 >>>> node + >>>> Falkon would need a port open to receive tasks? Or does each node >>>> pull down >>>> instructions OK from behind a firewall? >>>> >>> Falkon supports both polling and notifications. To use notifications, >>> there needs to be an open port on the worker :( >>>> Is there a latency problem with running each node as an indepdent task >>>> receiver with the dispatcher off-site from EC2? I would think it >>>> would be >>>> better to put the queues to fill with tasks on EC2 so it can more >>>> quickly get >>>> the task going when a node is done with a previous task (I may be >>>> missing some >>>> nuances here with respect to Falkon, don't know much about this yet!). >>> We have run the Falkon dispatcher at UChicago and workers at ANL without >>> any issues, so it can easily tolerate a few ms of latency. We haven't >>> tried it across 10s of ms of latency links, but my instinct says that if >>> you have enough workers, you might be able to hide the latency. We'd >>> have to experiment with it to see what happens. We could potentially do >>> some experiments between SDSC and ANL over a 50+ ms link, and see what >>> difference in throughputs we get. >>> >>> Ioan >>>> If a gateway node is desired, this option sounds a lot like the GRAM+LRM >>>> situation we use on VMs with the workspace service and will soon use >>>> on EC2 via >>>> the workspace EC2 gateway we're implementing. Start up one gateway >>>> node and >>>> then add compute nodes which dynamically join the pool, they are >>>> pointed to the >>>> GRAM node. >>>> >>>> >>>>> All the nodes in a site are required by our site model to have a >>>>> shared filesystem - we've talked about removing it, but I think that >>>>> is still the case and if so, isn't going to change soon. >>>> Setting up a shared filesystem in this environment is akin to setting >>>> up the >>>> compute nodes to join an LRM pool. The VMs can communicate over the >>>> private >>>> network at EC2, you can instruct EC2 to let all the nodes be open to >>>> each other >>>> (while simultaneously keeping a separate policy of blocking ports from >>>> being >>>> open from the internet and other people's EC2 nodes). The >>>> non-file-serving >>>> nodes would simply need to know the private address of the filesystem >>>> server >>>> (unless you are using a fancier network file system than NFS-style ones). >>>> For background: every VM on EC2 currently gets a public address -- >>>> NAT'd to a >>>> private address which is actually what the VM's one NIC is configured >>>> with. >>>> There is a facility to open/forward specific network ports on the public >>>> address to each VM either via a group policy or on a VM by VM basis. >>>> >>>> [...] >>>>> Amazon also has a storage cloud, alongside its compute cloud. I know >>>>> very little about that and have never thought about how it would fit >>>>> into the above (if at all). Maybe someone else knows more. >>>>> >>>> A VM template on EC2 is called an AMI which stands for Amazon Machine >>>> Image. >>>> This is just a packaging thing but what it mostly means is that the VM is >>>> stored on S3 and also registered into the EC2 system. >>>> >>>> When starting an instance of an AMI, the file is copied from S3 to the >>>> hypervisor node (what we call propagation in the workspace service). >>>> After it >>>> is used, this file is deleted (an option in the workspace service but >>>> there is >>>> also an option to save it back with any changes). >>>> So the VMs are stored in S3 but anything that happens on them after being >>>> started is lost unless you manually do something about it. >>>> >>>> As for free scratch space, you get a good amount per node, 140G. But >>>> the node >>>> could go down at any moment just like a physical resource. >>>> >>>> To harness S3 for safely persisting any data (or if you need more >>>> space) you >>>> would need to actually run S3 clients on the VMs when they are run on >>>> EC2. You >>>> could alternatively mirror data between nodes assuming that all would >>>> not go >>>> down at once. >>>> The good thing is that you do not pay transfer costs between S3 and >>>> EC2 if you >>>> chose to use S3 for big storage, you would only pay the "housing fees" >>>> so to >>>> speak. >>>> Tim >>>> _______________________________________________ >>>> Swift-devel mailing list >>>> Swift-devel at ci.uchicago.edu >>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >>>> >>>> > -- Kate Keahey, Mathematics & CS Division, Argonne National Laboratory Computation Institute, University of Chicago From nefedova at mcs.anl.gov Wed May 16 15:07:50 2007 From: nefedova at mcs.anl.gov (Veronika Nefedova) Date: Wed, 16 May 2007 15:07:50 -0500 Subject: [Swift-devel] Teragrid usage Message-ID: Hi, I checked my Teragrid accounts and it looks like the Swift's allocation is almost completely used by now (or is it just for me ?): Account: TG-CDA060004T Title: TeraGrid: Development Account for Multiple Grid Science Projects Resource: teragrid_roaming Allocation Period: 2006-08-30 to 2007-08-31 Name (Last First) or Account Total Remaining Usage ---------------------------- ---------- ------------ ---------- Nefedova Veronika 30000 SU 0 SU 27491 SU ---------------------------------------------------------------------- Fortunately, Benoit has added me to his group's allocation - so I can continue testing on TG. But it looks like Swift's allocation is almost gone... Should we renew it ? Nika From foster at mcs.anl.gov Wed May 16 15:19:18 2007 From: foster at mcs.anl.gov (Ian Foster) Date: Wed, 16 May 2007 15:19:18 -0500 Subject: [Swift-devel] swift-on-ec2 In-Reply-To: <464B4A6E.2040804@mcs.anl.gov> References: <4649D280.5080906@mcs.anl.gov> <20070515154500.ad1600bf.tfreeman@mcs.anl.gov><464A24AF.7080801@cs.uchicago.edu> <464A8857.90800@mcs.anl.gov> <356127187-1179301594-cardhu_blackberry.rim.net-179336256-@bwe047-cell00.bisx.prod.on.blackberry> <464B1402.9040405@mcs.anl.gov> <1583680979-1179336692-cardhu_blackberry.rim.net-2135204572-@bwe032-cell00.bisx.prod.on.blackberry> <464B4A6E.2040804@mcs.anl.gov> Message-ID: <464B6746.7050907@mcs.anl.gov> Kate: I want to emphasize that I was *not* dismissing the issues below as distractions. What I meant was: given that you are working on developing a "virtual cluster", which I am pretty sure will be able to execute Swift apps, let's focus on getting that done, rather than worrying about "special casing" it for Falkon, adding dynamic node acquisition, or the other things that people started discussing as potential extensions. I understand from our IM conversation today that the "virtual cluster" is ready for us in a "static environment" such as some machines in our lab. In a "dynamic environment" such as EC2, it is not quite ready for use yet. Thus, you won't be able to get Swift running on EC2 tomorrow. Ian. Kate Keahey wrote: > Ian, > > you seem to be referring to the necessary /etc/hosts configuration as > well as workers registering with the torque headnode below as > "distractions" -- I agree they can be very distracting, but in my > experience without these distractions a cluster (virtual or physical) > won't work in the way such clusters are typically expected to work. > > What I said in my mail is that we can set up a base cluster locally so > that somebody like Ioan can finish the configuration (i.e., install > Falkon on it). We will configure this cluster once and leave it > deployed as long as needed. > > Once we have the front-end to EC2 working (which we don't have yet > although we are close) we will deploy this cluster on EC2 and provide > methods that will automate this last little bit of configuration that > *always* has to be done on deployment. > > I also think it is quite important that we spend the time tomorrow > discussing what exactly we are trying to do -- right now, it looks to > me like it might make more sense to not use clusters (it will help > with the "distractions" if we don't). > > I realize that you are eager for us to get things to run -- I am eager > too, but I honestly think we will get there faster if we plan better. > > Ian Foster wrote: >> Kate: >> >> I personally will be delighted if you could run the virtual cluster >> on ec2 tomorrow. I know that there are lots of ways that you could >> refine its config, local expts that could be performed, etc., but >> perhaps we could try bypassing those things, which seem somewhat like >> distractions to me? >> >> Ian >> >> >> Sent via BlackBerry from T-Mobile >> -----Original Message----- >> From: Kate Keahey >> Date: Wed, 16 May 2007 09:24:02 To:itf at mcs.anl.gov >> Cc:swift-devel-bounces at ci.uchicago.edu, Ioan Raicu >> , swift-devel at ci.uchicago.edu, Borja >> Sotomayor >> Subject: Re: [Swift-devel] swift-on-ec2 >> >> >> >> Ian Foster wrote: >>> Kate: >>> >>> If we configure the virtual cluster with a full LRM, as you propose >>> (and it seems have already done--great work!), then we can use this >>> to start Falkon executors--as we do today on regular clusters. So it >>> seems to me that we have all we need. How about you and Ioan spend >>> your time on Thursday running something on EC2, to make sure it sorks? >> >> As I suggest below, I think it would be easiest if we could deploy >> and debug a small static cluster locally first, and we can probably >> give it a shot tomorrow. We still don't have access to the Xen nodes >> on TeraPort (although hopefully that might change by tomorrow) but I >> asked Rick to rebuild a couple of nodes at ANL and he did, I think >> for a test that should give us enough resources to play with. >> >> At the same time -- if there are multiple ways of doing this, and >> perhaps better ways than simply using a virtual cluster, we should >> discuss them now. It is not completely clear to me what the >> relationship between Falkon and Swift is, and what the specific >> objectives are (other than that dynamically provisioning resources is >> required). It looks at this point like the objectives probably >> overlap with what Ioan, Borja and I wanted to talk about (which I >> thought was a separate project, but am thrilled to find out is >> related) so how about we come up with a design tomorrow and post the >> notes on this list (is this a good venue for that?) and then others >> can shoot them down. >> >>> Regarding choice of LRM: have you looked at SGE? That is what quite >>> a few others seem to be using. >> >> Yes, we have. We also collaborate with others who do, as well as with >> Sun... As you may remember, Borja did the scheduling work for his >> thesis in the context of SGE. Last time we talked though, Torque was >> the scheduler of choice for the virtual cluster LRM so we used that. >> >> The usage of SGE you are referring to above -- is this in the context >> of virtualization projects, or as LRM for various Falkon-related >> applications? >> >>> Ian >>> >>> >>> >>> Sent via BlackBerry from T-Mobile >>> -----Original Message----- >>> From: Kate Keahey >>> Date: Tue, 15 May 2007 23:28:07 To:iraicu at cs.uchicago.edu >>> Cc:swift-devel at ci.uchicago.edu >>> Subject: Re: [Swift-devel] swift-on-ec2 >>> >>> First -- this is a very useful discussion, would it be possible to >>> see all of it. We need to understand the requirements and trade-offs >>> in some detail to figure out the best way to make this work. I see a >>> few different interaction threads somewhat mixed up here though so >>> just to make sure we are all on the same wavelength, here is some >>> context. >>> >>> Ian and I have been talking on and off about providing a workspace >>> service implementation with EC2 backend. The benefit for that would >>> be that users could deploy the same VMs using the same interface to >>> either TeraPort or EC2 or yet another resource provider. The >>> workspace service would also provide some features on top of EC2 >>> (translating between PKI credentials and Amazon's paying accounts, >>> contextualization as needed to make deployment dynamic). One >>> application of interest for this was Swift. Last time we chatted >>> about this though was in the context of using EC2 to provide a >>> production platform for STAR runs (since virtualizing enough >>> TeraPort to provide a production platform is taking a long time). >>> This is what Tim and I are trying to make happen now. >>> >>> Since there was also interest in running Swift in VMs, Mike, Tibi >>> and I met around February/March and agreed that a reasonable way to >>> proceed will be for us to stand up a base virtual cluster somewhere >>> locally (e.g., a static deployment on TeraPort) so that they can >>> finish the configuration according to their needs, look at >>> performance, figure out the best way to interact with it, and make >>> sure that there are no VM-induced gotchas. All of this will be much >>> easier to assess locally and on a static deployment. Then we'd make >>> sure the cluster is dynamically deployable using the workspace >>> service (on EC2 or whatever other provider). During the meeting (and >>> over following emails) we agreed that the required "base cluster" >>> would be configured with GRAM/Torque on the headnode plus a number >>> of worker nodes, plus root privileges. We configured this cluster >>> and it is ready to deploy. Are you saying now that in fact something >>> different is needed? >>> >>> As Ian says, Borja and I were planning to meet with Ioan on Thursday >>> to discuss interaction between Falkon and the workspace service (not >>> necessarily/exclusively in the EC2 context). I don't completely >>> understand the relationship between swift and falkon -- are there >>> specific applications or scenarios that you are trying to target in >>> this exercise? >>> >>> Ioan Raicu wrote: >>>> Hi, >>>> See below: >>>> >>>> Tim Freeman wrote: >>>>> On Tue, 15 May 2007 16:20:03 +0000 (GMT) >>>>> Ben Clifford wrote: >>>>> >>>>> >>>>>> Ian asked about this elsewhere, but its perhaps interesting for >>>>>> swift-devel people to look at the questions too. >>>>>> >>>>>> On Tue, 15 May 2007, Ian Foster wrote: >>>>>> >>>>>> >>>>>>> Dear All: >>>>>>> >>>>>> >>>>>> >>>>>>> I asked Kate if she and Tim could look into creating VM images >>>>>>> that would allow us to run Swift applications on Amazon EC2. I >>>>>>> think Kate is meeting with Ioan about this on Thursday (?). >>>>>>> >>>>>> >>>>>> >>>>>>> One issue that I thought would be good to discuss is what we'd >>>>>>> want in that VM image. Perhaps this is obvious to the rest of >>>>>>> you, but it isn't to me. A few thoughts: >>>>>>> * I'm assuming that we want to run "workers" on EC2 nodes, >>>>>>> and have the >>>>>>> "task dispatch" logic run on some external frontend system >>>>>>> outside EC2. >>>>>>> * I would think that we want to use Falkon to do the task >>>>>>> dispatch. If so, >>>>>>> we need a Falkon executor on each VM, configured to check in >>>>>>> with the Falkon >>>>>>> dispatcher. (Alternatively, we could use, say, SGE: in that >>>>>>> case, we would >>>>>>> want an SGE agent.) >>>>>>> * We need a way of getting data to and from the worker >>>>>>> nodes. Do we want to >>>>>>> run a file system across the EC2 nodes and the external frontend >>>>>>> node? That >>>>>>> seems rather inefficient. Other options? >>>>>>> * Should we preload the application code on each EC2 node? >>>>>>> >>>>>> Here's a couple of approaches: >>>>>> >>>>>> 1) swift regards all the EC2 nodes that we are paying for as a >>>>>> single site. >>>>>> >>>>>> Something like falkon handles all the task dispatch and worker >>>>>> node management. I don't know what that looks like at the moment >>>>>> in Falkon, but the interface for Swift to send jobs into Falkon >>>>>> sounds pretty straightforward and shouldn't need changing. >>>>>> >>>>> So if I understand, here there would be no gateway+LRM but each >>>>> EC2 node + >>>>> Falkon would need a port open to receive tasks? Or does each node >>>>> pull down >>>>> instructions OK from behind a firewall? >>>>> >>>> Falkon supports both polling and notifications. To use >>>> notifications, there needs to be an open port on the worker :( >>>>> Is there a latency problem with running each node as an indepdent >>>>> task >>>>> receiver with the dispatcher off-site from EC2? I would think it >>>>> would be >>>>> better to put the queues to fill with tasks on EC2 so it can more >>>>> quickly get >>>>> the task going when a node is done with a previous task (I may be >>>>> missing some >>>>> nuances here with respect to Falkon, don't know much about this >>>>> yet!). >>>> We have run the Falkon dispatcher at UChicago and workers at ANL >>>> without any issues, so it can easily tolerate a few ms of latency. >>>> We haven't tried it across 10s of ms of latency links, but my >>>> instinct says that if you have enough workers, you might be able to >>>> hide the latency. We'd have to experiment with it to see what >>>> happens. We could potentially do some experiments between SDSC and >>>> ANL over a 50+ ms link, and see what difference in throughputs we get. >>>> >>>> Ioan >>>>> If a gateway node is desired, this option sounds a lot like the >>>>> GRAM+LRM >>>>> situation we use on VMs with the workspace service and will soon >>>>> use on EC2 via >>>>> the workspace EC2 gateway we're implementing. Start up one >>>>> gateway node and >>>>> then add compute nodes which dynamically join the pool, they are >>>>> pointed to the >>>>> GRAM node. >>>>> >>>>> >>>>>> All the nodes in a site are required by our site model to have a >>>>>> shared filesystem - we've talked about removing it, but I think >>>>>> that is still the case and if so, isn't going to change soon. >>>>> Setting up a shared filesystem in this environment is akin to >>>>> setting up the >>>>> compute nodes to join an LRM pool. The VMs can communicate over >>>>> the private >>>>> network at EC2, you can instruct EC2 to let all the nodes be open >>>>> to each other >>>>> (while simultaneously keeping a separate policy of blocking ports >>>>> from being >>>>> open from the internet and other people's EC2 nodes). The >>>>> non-file-serving >>>>> nodes would simply need to know the private address of the >>>>> filesystem server >>>>> (unless you are using a fancier network file system than NFS-style >>>>> ones). >>>>> For background: every VM on EC2 currently gets a public address -- >>>>> NAT'd to a >>>>> private address which is actually what the VM's one NIC is >>>>> configured with. >>>>> There is a facility to open/forward specific network ports on the >>>>> public >>>>> address to each VM either via a group policy or on a VM by VM basis. >>>>> >>>>> [...] >>>>>> Amazon also has a storage cloud, alongside its compute cloud. I >>>>>> know very little about that and have never thought about how it >>>>>> would fit into the above (if at all). Maybe someone else knows more. >>>>>> >>>>> A VM template on EC2 is called an AMI which stands for Amazon >>>>> Machine Image. >>>>> This is just a packaging thing but what it mostly means is that >>>>> the VM is >>>>> stored on S3 and also registered into the EC2 system. >>>>> >>>>> When starting an instance of an AMI, the file is copied from S3 to >>>>> the >>>>> hypervisor node (what we call propagation in the workspace >>>>> service). After it >>>>> is used, this file is deleted (an option in the workspace service >>>>> but there is >>>>> also an option to save it back with any changes). So the VMs are >>>>> stored in S3 but anything that happens on them after being >>>>> started is lost unless you manually do something about it. >>>>> >>>>> As for free scratch space, you get a good amount per node, 140G. >>>>> But the node >>>>> could go down at any moment just like a physical resource. >>>>> >>>>> To harness S3 for safely persisting any data (or if you need more >>>>> space) you >>>>> would need to actually run S3 clients on the VMs when they are run >>>>> on EC2. You >>>>> could alternatively mirror data between nodes assuming that all >>>>> would not go >>>>> down at once. >>>>> The good thing is that you do not pay transfer costs between S3 >>>>> and EC2 if you >>>>> chose to use S3 for big storage, you would only pay the "housing >>>>> fees" so to >>>>> speak. >>>>> Tim >>>>> _______________________________________________ >>>>> Swift-devel mailing list >>>>> Swift-devel at ci.uchicago.edu >>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >>>>> >>>>> >> > -- Ian Foster, Director, Computation Institute Argonne National Laboratory & University of Chicago Argonne: MCS/221, 9700 S. Cass Ave, Argonne, IL 60439 Chicago: Rm 405, 5640 S. Ellis Ave, Chicago, IL 60637 Tel: +1 630 252 4619. Web: www.ci.uchicago.edu. Globus Alliance: www.globus.org. From foster at mcs.anl.gov Wed May 16 15:22:09 2007 From: foster at mcs.anl.gov (Ian Foster) Date: Wed, 16 May 2007 15:22:09 -0500 Subject: [Swift-devel] swift-on-ec2 In-Reply-To: <464B4A6E.2040804@mcs.anl.gov> References: <4649D280.5080906@mcs.anl.gov> <20070515154500.ad1600bf.tfreeman@mcs.anl.gov><464A24AF.7080801@cs.uchicago.edu> <464A8857.90800@mcs.anl.gov> <356127187-1179301594-cardhu_blackberry.rim.net-179336256-@bwe047-cell00.bisx.prod.on.blackberry> <464B1402.9040405@mcs.anl.gov> <1583680979-1179336692-cardhu_blackberry.rim.net-2135204572-@bwe032-cell00.bisx.prod.on.blackberry> <464B4A6E.2040804@mcs.anl.gov> Message-ID: <464B67F1.5060408@mcs.anl.gov> The people using SGE were just using it as a LRM, I think. Ian. >>> Regarding choice of LRM: have you looked at SGE? That is what quite >>> a few others seem to be using. >> >> Yes, we have. We also collaborate with others who do, as well as with >> Sun... As you may remember, Borja did the scheduling work for his >> thesis in the context of SGE. Last time we talked though, Torque was >> the scheduler of choice for the virtual cluster LRM so we used that. >> >> The usage of SGE you are referring to above -- is this in the context >> of virtualization projects, or as LRM for various Falkon-related >> applications? >> From benc at hawaga.org.uk Wed May 16 15:44:48 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 16 May 2007 20:44:48 +0000 (GMT) Subject: [Swift-devel] Teragrid usage In-Reply-To: References: Message-ID: On Wed, 16 May 2007, Veronika Nefedova wrote: > I checked my Teragrid accounts and it looks like the Swift's allocation is > almost completely used by now (or is it just for me ?): I show different figures, that suggest that yes, that account is empty. -- From benc at hawaga.org.uk Wed May 16 16:17:58 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 16 May 2007 21:17:58 +0000 (GMT) Subject: [Swift-devel] mappers on files that are inputs and outputs In-Reply-To: <1179330195.4473.0.camel@blabla.mcs.anl.gov> References: <1179329342.4368.0.camel@blabla.mcs.anl.gov> <1179330195.4473.0.camel@blabla.mcs.anl.gov> Message-ID: On Wed, 16 May 2007, Mihael Hategan wrote: > The translator does that bit. You hacked the translated file, but > incompletely. I used the translated file as it came out of Karajan.java - no manual editing. So being worried that it had got broken, I made a test case that I think is the demonstrating my problem, and tried on r740, r625 and r101 (those being an even spread over the evolution of Karajan.java) and get essentially the same results with all three of those versions (modulo output format changes). I tried the following two programs on each of the above: working: string m ; string f = @filename(m); print(f); (it outputs map1) not working: type foo { string txt; } foo m ; string f = @filename(m.txt); print(f); (it gives the error I pasted before) -- From wilde at mcs.anl.gov Wed May 16 18:05:35 2007 From: wilde at mcs.anl.gov (Mike Wilde) Date: Wed, 16 May 2007 18:05:35 -0500 Subject: [Swift-devel] Teragrid usage In-Reply-To: References: Message-ID: <464B8E3F.7020106@mcs.anl.gov> Oi. I'll see what I can do. - Mike Ben Clifford wrote, On 5/16/2007 3:44 PM: > > On Wed, 16 May 2007, Veronika Nefedova wrote: > >> I checked my Teragrid accounts and it looks like the Swift's allocation is >> almost completely used by now (or is it just for me ?): > > > I show different figures, that suggest that yes, that account is empty. > -- Mike Wilde Computation Institute, University of Chicago Math & Computer Science Division Argonne National Laboratory Argonne, IL 60439 USA tel 630-252-7497 fax 630-252-1997 From benc at hawaga.org.uk Thu May 17 05:08:19 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Thu, 17 May 2007 10:08:19 +0000 (GMT) Subject: [Swift-devel] mappers on files that are inputs and outputs In-Reply-To: <1179329342.4368.0.camel@blabla.mcs.anl.gov> References: <1179329342.4368.0.camel@blabla.mcs.anl.gov> Message-ID: On Wed, 16 May 2007, Mihael Hategan wrote: > You should probably also add the input=true mapping parameter? If I *remove* the input=true mapping parameter that the translater puts there, it works (which is consistent, I suppose, with this working when used as an output). This is in bug 60, http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=60 and I'll poke round at it more later - I can work round by using the CSV mapper for now. -- From benc at hawaga.org.uk Thu May 17 08:03:28 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Thu, 17 May 2007 13:03:28 +0000 (GMT) Subject: [Swift-devel] tutorial code snippets look bad in Internet Explorer Message-ID: On at least one machine that I've seen, the code snippets at http://www.ci.uchicago.edu/swift/guides/tutorial.php come out all on one line. Does that happen for anyone here with that browser? -- From hategan at mcs.anl.gov Thu May 17 08:11:55 2007 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 17 May 2007 16:11:55 +0300 Subject: [Swift-devel] tutorial code snippets look bad in Internet Explorer In-Reply-To: References: Message-ID: <1179407515.26179.0.camel@blabla.mcs.anl.gov> Could be either the syntax highlighting or IE being what it is. Try disabling javascript and see if it helps. Mihael On Thu, 2007-05-17 at 13:03 +0000, Ben Clifford wrote: > On at least one machine that I've seen, the code snippets at > http://www.ci.uchicago.edu/swift/guides/tutorial.php come out all on one > line. Does that happen for anyone here with that browser? > From keahey at mcs.anl.gov Thu May 17 09:42:30 2007 From: keahey at mcs.anl.gov (Kate Keahey) Date: Thu, 17 May 2007 09:42:30 -0500 Subject: [Swift-devel] swift-on-ec2 In-Reply-To: <464B6746.7050907@mcs.anl.gov> References: <4649D280.5080906@mcs.anl.gov> <20070515154500.ad1600bf.tfreeman@mcs.anl.gov><464A24AF.7080801@cs.uchicago.edu> <464A8857.90800@mcs.anl.gov> <356127187-1179301594-cardhu_blackberry.rim.net-179336256-@bwe047-cell00.bisx.prod.on.blackberry> <464B1402.9040405@mcs.anl.gov> <1583680979-1179336692-cardhu_blackberry.rim.net-2135204572-@bwe032-cell00.bisx.prod.on.blackberry> <464B4A6E.2040804@mcs.anl.gov> <464B6746.7050907@mcs.anl.gov> Message-ID: <464C69D6.70909@mcs.anl.gov> Ian Foster wrote: > Kate: > > I want to emphasize that I was *not* dismissing the issues below as > distractions. > > What I meant was: given that you are working on developing a "virtual > cluster", which I am pretty sure will be able to execute Swift apps, > let's focus on getting that done, rather than worrying about "special > casing" it for Falkon, adding dynamic node acquisition, or the other > things that people started discussing as potential extensions. We only now really began to discuss how to use VMs with Swift/Falkon -- the original set of issues you posted was just what was needed, it clearly inspired a very good discussion, and made me realize that I should have been talking to a wider set of people about this. Please, don't go back on us now... It also looks to me like there may be solutions that will make more sense both from the perspective of the architecture and will also be easier to implement with the current state of virtualization tools. For example, if we can set up Falkon to provision single nodes operating in pull mode (pulling work from a "master") various contextualization issues will have become much easier. > > I understand from our IM conversation today that the "virtual cluster" > is ready for us in a "static environment" such as some machines in our > lab. In a "dynamic environment" such as EC2, it is not quite ready for > use yet. Thus, you won't be able to get Swift running on EC2 tomorrow. This is not quite accurate; static refers to statically assigned IPs -- we have control over our IPs and can assign them to the cluster nodes in the same way each time we deploy it. Amazon will choose new IPs for the nodes each time the cluster is deployed, so each time the configuration of the cluster will have to be adjusted to reflect different IP assignment to the nodes (but if we were to change the IPs on the cluster nodes in a local environment we would be just as dynamic). But if you deploy just one node (e.g., a node operating in the pull mode as in the example above) the need for this configuration adjustment may go away (depending on what the node does) so everything may become much simpler. We can spend some time looking at deploying a VM on EC2 if it is of interest (as well as deploying a VM via the workspace service if that is of interest), we can run things on the deployed VM, etc. But I *strongly* argue that we spend at least some time defining what we want from this project, what is realistic to have in the short-term, what will be hard/impossible/inconvenient and try to build it systematically. Then we can figure out who does what and by when this is going to be done. > > Ian. > > > Kate Keahey wrote: >> Ian, >> >> you seem to be referring to the necessary /etc/hosts configuration as >> well as workers registering with the torque headnode below as >> "distractions" -- I agree they can be very distracting, but in my >> experience without these distractions a cluster (virtual or physical) >> won't work in the way such clusters are typically expected to work. >> >> What I said in my mail is that we can set up a base cluster locally so >> that somebody like Ioan can finish the configuration (i.e., install >> Falkon on it). We will configure this cluster once and leave it >> deployed as long as needed. >> >> Once we have the front-end to EC2 working (which we don't have yet >> although we are close) we will deploy this cluster on EC2 and provide >> methods that will automate this last little bit of configuration that >> *always* has to be done on deployment. >> >> I also think it is quite important that we spend the time tomorrow >> discussing what exactly we are trying to do -- right now, it looks to >> me like it might make more sense to not use clusters (it will help >> with the "distractions" if we don't). >> >> I realize that you are eager for us to get things to run -- I am eager >> too, but I honestly think we will get there faster if we plan better. >> >> Ian Foster wrote: >>> Kate: >>> >>> I personally will be delighted if you could run the virtual cluster >>> on ec2 tomorrow. I know that there are lots of ways that you could >>> refine its config, local expts that could be performed, etc., but >>> perhaps we could try bypassing those things, which seem somewhat like >>> distractions to me? >>> >>> Ian >>> >>> >>> Sent via BlackBerry from T-Mobile -----Original Message----- >>> From: Kate Keahey >>> Date: Wed, 16 May 2007 09:24:02 To:itf at mcs.anl.gov >>> Cc:swift-devel-bounces at ci.uchicago.edu, Ioan Raicu >>> , swift-devel at ci.uchicago.edu, Borja >>> Sotomayor >>> Subject: Re: [Swift-devel] swift-on-ec2 >>> >>> >>> >>> Ian Foster wrote: >>>> Kate: >>>> >>>> If we configure the virtual cluster with a full LRM, as you propose >>>> (and it seems have already done--great work!), then we can use this >>>> to start Falkon executors--as we do today on regular clusters. So it >>>> seems to me that we have all we need. How about you and Ioan spend >>>> your time on Thursday running something on EC2, to make sure it sorks? >>> >>> As I suggest below, I think it would be easiest if we could deploy >>> and debug a small static cluster locally first, and we can probably >>> give it a shot tomorrow. We still don't have access to the Xen nodes >>> on TeraPort (although hopefully that might change by tomorrow) but I >>> asked Rick to rebuild a couple of nodes at ANL and he did, I think >>> for a test that should give us enough resources to play with. >>> >>> At the same time -- if there are multiple ways of doing this, and >>> perhaps better ways than simply using a virtual cluster, we should >>> discuss them now. It is not completely clear to me what the >>> relationship between Falkon and Swift is, and what the specific >>> objectives are (other than that dynamically provisioning resources is >>> required). It looks at this point like the objectives probably >>> overlap with what Ioan, Borja and I wanted to talk about (which I >>> thought was a separate project, but am thrilled to find out is >>> related) so how about we come up with a design tomorrow and post the >>> notes on this list (is this a good venue for that?) and then others >>> can shoot them down. >>> >>>> Regarding choice of LRM: have you looked at SGE? That is what quite >>>> a few others seem to be using. >>> >>> Yes, we have. We also collaborate with others who do, as well as with >>> Sun... As you may remember, Borja did the scheduling work for his >>> thesis in the context of SGE. Last time we talked though, Torque was >>> the scheduler of choice for the virtual cluster LRM so we used that. >>> >>> The usage of SGE you are referring to above -- is this in the context >>> of virtualization projects, or as LRM for various Falkon-related >>> applications? >>> >>>> Ian >>>> >>>> >>>> >>>> Sent via BlackBerry from T-Mobile -----Original Message----- >>>> From: Kate Keahey >>>> Date: Tue, 15 May 2007 23:28:07 To:iraicu at cs.uchicago.edu >>>> Cc:swift-devel at ci.uchicago.edu >>>> Subject: Re: [Swift-devel] swift-on-ec2 >>>> >>>> First -- this is a very useful discussion, would it be possible to >>>> see all of it. We need to understand the requirements and trade-offs >>>> in some detail to figure out the best way to make this work. I see a >>>> few different interaction threads somewhat mixed up here though so >>>> just to make sure we are all on the same wavelength, here is some >>>> context. >>>> >>>> Ian and I have been talking on and off about providing a workspace >>>> service implementation with EC2 backend. The benefit for that would >>>> be that users could deploy the same VMs using the same interface to >>>> either TeraPort or EC2 or yet another resource provider. The >>>> workspace service would also provide some features on top of EC2 >>>> (translating between PKI credentials and Amazon's paying accounts, >>>> contextualization as needed to make deployment dynamic). One >>>> application of interest for this was Swift. Last time we chatted >>>> about this though was in the context of using EC2 to provide a >>>> production platform for STAR runs (since virtualizing enough >>>> TeraPort to provide a production platform is taking a long time). >>>> This is what Tim and I are trying to make happen now. >>>> >>>> Since there was also interest in running Swift in VMs, Mike, Tibi >>>> and I met around February/March and agreed that a reasonable way to >>>> proceed will be for us to stand up a base virtual cluster somewhere >>>> locally (e.g., a static deployment on TeraPort) so that they can >>>> finish the configuration according to their needs, look at >>>> performance, figure out the best way to interact with it, and make >>>> sure that there are no VM-induced gotchas. All of this will be much >>>> easier to assess locally and on a static deployment. Then we'd make >>>> sure the cluster is dynamically deployable using the workspace >>>> service (on EC2 or whatever other provider). During the meeting (and >>>> over following emails) we agreed that the required "base cluster" >>>> would be configured with GRAM/Torque on the headnode plus a number >>>> of worker nodes, plus root privileges. We configured this cluster >>>> and it is ready to deploy. Are you saying now that in fact something >>>> different is needed? >>>> >>>> As Ian says, Borja and I were planning to meet with Ioan on Thursday >>>> to discuss interaction between Falkon and the workspace service (not >>>> necessarily/exclusively in the EC2 context). I don't completely >>>> understand the relationship between swift and falkon -- are there >>>> specific applications or scenarios that you are trying to target in >>>> this exercise? >>>> >>>> Ioan Raicu wrote: >>>>> Hi, >>>>> See below: >>>>> >>>>> Tim Freeman wrote: >>>>>> On Tue, 15 May 2007 16:20:03 +0000 (GMT) >>>>>> Ben Clifford wrote: >>>>>> >>>>>> >>>>>>> Ian asked about this elsewhere, but its perhaps interesting for >>>>>>> swift-devel people to look at the questions too. >>>>>>> >>>>>>> On Tue, 15 May 2007, Ian Foster wrote: >>>>>>> >>>>>>> >>>>>>>> Dear All: >>>>>>>> >>>>>>> >>>>>>> >>>>>>>> I asked Kate if she and Tim could look into creating VM images >>>>>>>> that would allow us to run Swift applications on Amazon EC2. I >>>>>>>> think Kate is meeting with Ioan about this on Thursday (?). >>>>>>>> >>>>>>> >>>>>>> >>>>>>>> One issue that I thought would be good to discuss is what we'd >>>>>>>> want in that VM image. Perhaps this is obvious to the rest of >>>>>>>> you, but it isn't to me. A few thoughts: >>>>>>>> * I'm assuming that we want to run "workers" on EC2 nodes, >>>>>>>> and have the >>>>>>>> "task dispatch" logic run on some external frontend system >>>>>>>> outside EC2. >>>>>>>> * I would think that we want to use Falkon to do the task >>>>>>>> dispatch. If so, >>>>>>>> we need a Falkon executor on each VM, configured to check in >>>>>>>> with the Falkon >>>>>>>> dispatcher. (Alternatively, we could use, say, SGE: in that >>>>>>>> case, we would >>>>>>>> want an SGE agent.) >>>>>>>> * We need a way of getting data to and from the worker >>>>>>>> nodes. Do we want to >>>>>>>> run a file system across the EC2 nodes and the external frontend >>>>>>>> node? That >>>>>>>> seems rather inefficient. Other options? >>>>>>>> * Should we preload the application code on each EC2 node? >>>>>>>> >>>>>>> Here's a couple of approaches: >>>>>>> >>>>>>> 1) swift regards all the EC2 nodes that we are paying for as a >>>>>>> single site. >>>>>>> >>>>>>> Something like falkon handles all the task dispatch and worker >>>>>>> node management. I don't know what that looks like at the moment >>>>>>> in Falkon, but the interface for Swift to send jobs into Falkon >>>>>>> sounds pretty straightforward and shouldn't need changing. >>>>>>> >>>>>> So if I understand, here there would be no gateway+LRM but each >>>>>> EC2 node + >>>>>> Falkon would need a port open to receive tasks? Or does each node >>>>>> pull down >>>>>> instructions OK from behind a firewall? >>>>>> >>>>> Falkon supports both polling and notifications. To use >>>>> notifications, there needs to be an open port on the worker :( >>>>>> Is there a latency problem with running each node as an indepdent >>>>>> task >>>>>> receiver with the dispatcher off-site from EC2? I would think it >>>>>> would be >>>>>> better to put the queues to fill with tasks on EC2 so it can more >>>>>> quickly get >>>>>> the task going when a node is done with a previous task (I may be >>>>>> missing some >>>>>> nuances here with respect to Falkon, don't know much about this >>>>>> yet!). >>>>> We have run the Falkon dispatcher at UChicago and workers at ANL >>>>> without any issues, so it can easily tolerate a few ms of latency. >>>>> We haven't tried it across 10s of ms of latency links, but my >>>>> instinct says that if you have enough workers, you might be able to >>>>> hide the latency. We'd have to experiment with it to see what >>>>> happens. We could potentially do some experiments between SDSC and >>>>> ANL over a 50+ ms link, and see what difference in throughputs we get. >>>>> >>>>> Ioan >>>>>> If a gateway node is desired, this option sounds a lot like the >>>>>> GRAM+LRM >>>>>> situation we use on VMs with the workspace service and will soon >>>>>> use on EC2 via >>>>>> the workspace EC2 gateway we're implementing. Start up one >>>>>> gateway node and >>>>>> then add compute nodes which dynamically join the pool, they are >>>>>> pointed to the >>>>>> GRAM node. >>>>>> >>>>>> >>>>>>> All the nodes in a site are required by our site model to have a >>>>>>> shared filesystem - we've talked about removing it, but I think >>>>>>> that is still the case and if so, isn't going to change soon. >>>>>> Setting up a shared filesystem in this environment is akin to >>>>>> setting up the >>>>>> compute nodes to join an LRM pool. The VMs can communicate over >>>>>> the private >>>>>> network at EC2, you can instruct EC2 to let all the nodes be open >>>>>> to each other >>>>>> (while simultaneously keeping a separate policy of blocking ports >>>>>> from being >>>>>> open from the internet and other people's EC2 nodes). The >>>>>> non-file-serving >>>>>> nodes would simply need to know the private address of the >>>>>> filesystem server >>>>>> (unless you are using a fancier network file system than NFS-style >>>>>> ones). >>>>>> For background: every VM on EC2 currently gets a public address -- >>>>>> NAT'd to a >>>>>> private address which is actually what the VM's one NIC is >>>>>> configured with. >>>>>> There is a facility to open/forward specific network ports on the >>>>>> public >>>>>> address to each VM either via a group policy or on a VM by VM basis. >>>>>> >>>>>> [...] >>>>>>> Amazon also has a storage cloud, alongside its compute cloud. I >>>>>>> know very little about that and have never thought about how it >>>>>>> would fit into the above (if at all). Maybe someone else knows more. >>>>>>> >>>>>> A VM template on EC2 is called an AMI which stands for Amazon >>>>>> Machine Image. >>>>>> This is just a packaging thing but what it mostly means is that >>>>>> the VM is >>>>>> stored on S3 and also registered into the EC2 system. >>>>>> >>>>>> When starting an instance of an AMI, the file is copied from S3 to >>>>>> the >>>>>> hypervisor node (what we call propagation in the workspace >>>>>> service). After it >>>>>> is used, this file is deleted (an option in the workspace service >>>>>> but there is >>>>>> also an option to save it back with any changes). So the VMs are >>>>>> stored in S3 but anything that happens on them after being >>>>>> started is lost unless you manually do something about it. >>>>>> >>>>>> As for free scratch space, you get a good amount per node, 140G. >>>>>> But the node >>>>>> could go down at any moment just like a physical resource. >>>>>> >>>>>> To harness S3 for safely persisting any data (or if you need more >>>>>> space) you >>>>>> would need to actually run S3 clients on the VMs when they are run >>>>>> on EC2. You >>>>>> could alternatively mirror data between nodes assuming that all >>>>>> would not go >>>>>> down at once. >>>>>> The good thing is that you do not pay transfer costs between S3 >>>>>> and EC2 if you >>>>>> chose to use S3 for big storage, you would only pay the "housing >>>>>> fees" so to >>>>>> speak. >>>>>> Tim >>>>>> _______________________________________________ >>>>>> Swift-devel mailing list >>>>>> Swift-devel at ci.uchicago.edu >>>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >>>>>> >>>>>> >>> >> > -- Kate Keahey, Mathematics & CS Division, Argonne National Laboratory Computation Institute, University of Chicago From itf at mcs.anl.gov Thu May 17 10:14:16 2007 From: itf at mcs.anl.gov (=?UTF-8?B?SWFuIEZvc3Rlcg==?=) Date: Thu, 17 May 2007 15:14:16 +0000 Subject: [Swift-devel] swift-on-ec2 In-Reply-To: <464C69D6.70909@mcs.anl.gov> References: <4649D280.5080906@mcs.anl.gov> <20070515154500.ad1600bf.tfreeman@mcs.anl.gov><464A24AF.7080801@cs.uchicago.edu> <464A8857.90800@mcs.anl.gov> <356127187-1179301594-cardhu_blackberry.rim.net-179336256-@bwe047-cell00.bisx.prod.on.blackberry> <464B1402.9040405@mcs.anl.gov> <1583680979-1179336692-cardhu_blackberry.rim.net-2135204572-@bwe032-cell00.bisx.prod.on.blackberry> <464B4A6E.2040804@mcs.anl.gov> <464B6746.7050907@mcs.anl.gov> <464C69D6.70909@mcs.anl.gov> Message-ID: <1350594402-1179414977-cardhu_blackberry.rim.net-526979985-@bwe005-cell00.bisx.prod.on.blackberry> If the discussion is useful then by all means continue it. I was concerned that if you were very close to having a virtual cluster that would work for us, then taking time to create a different virtual cluster design would slow us down. But maybe that won't happen. Ian Sent via BlackBerry from T-Mobile -----Original Message----- From: Kate Keahey Date: Thu, 17 May 2007 09:42:30 To:Ian Foster Cc:itf at mcs.anl.gov, swift-devel-bounces at ci.uchicago.edu, Ioan Raicu , swift-devel at ci.uchicago.edu, Borja Sotomayor Subject: Re: [Swift-devel] swift-on-ec2 Ian Foster wrote: > Kate: > > I want to emphasize that I was *not* dismissing the issues below as > distractions. > > What I meant was: given that you are working on developing a "virtual > cluster", which I am pretty sure will be able to execute Swift apps, > let's focus on getting that done, rather than worrying about "special > casing" it for Falkon, adding dynamic node acquisition, or the other > things that people started discussing as potential extensions. We only now really began to discuss how to use VMs with Swift/Falkon -- the original set of issues you posted was just what was needed, it clearly inspired a very good discussion, and made me realize that I should have been talking to a wider set of people about this. Please, don't go back on us now... It also looks to me like there may be solutions that will make more sense both from the perspective of the architecture and will also be easier to implement with the current state of virtualization tools. For example, if we can set up Falkon to provision single nodes operating in pull mode (pulling work from a "master") various contextualization issues will have become much easier. > > I understand from our IM conversation today that the "virtual cluster" > is ready for us in a "static environment" such as some machines in our > lab. In a "dynamic environment" such as EC2, it is not quite ready for > use yet. Thus, you won't be able to get Swift running on EC2 tomorrow. This is not quite accurate; static refers to statically assigned IPs -- we have control over our IPs and can assign them to the cluster nodes in the same way each time we deploy it. Amazon will choose new IPs for the nodes each time the cluster is deployed, so each time the configuration of the cluster will have to be adjusted to reflect different IP assignment to the nodes (but if we were to change the IPs on the cluster nodes in a local environment we would be just as dynamic). But if you deploy just one node (e.g., a node operating in the pull mode as in the example above) the need for this configuration adjustment may go away (depending on what the node does) so everything may become much simpler. We can spend some time looking at deploying a VM on EC2 if it is of interest (as well as deploying a VM via the workspace service if that is of interest), we can run things on the deployed VM, etc. But I *strongly* argue that we spend at least some time defining what we want from this project, what is realistic to have in the short-term, what will be hard/impossible/inconvenient and try to build it systematically. Then we can figure out who does what and by when this is going to be done. > > Ian. > > > Kate Keahey wrote: >> Ian, >> >> you seem to be referring to the necessary /etc/hosts configuration as >> well as workers registering with the torque headnode below as >> "distractions" -- I agree they can be very distracting, but in my >> experience without these distractions a cluster (virtual or physical) >> won't work in the way such clusters are typically expected to work. >> >> What I said in my mail is that we can set up a base cluster locally so >> that somebody like Ioan can finish the configuration (i.e., install >> Falkon on it). We will configure this cluster once and leave it >> deployed as long as needed. >> >> Once we have the front-end to EC2 working (which we don't have yet >> although we are close) we will deploy this cluster on EC2 and provide >> methods that will automate this last little bit of configuration that >> *always* has to be done on deployment. >> >> I also think it is quite important that we spend the time tomorrow >> discussing what exactly we are trying to do -- right now, it looks to >> me like it might make more sense to not use clusters (it will help >> with the "distractions" if we don't). >> >> I realize that you are eager for us to get things to run -- I am eager >> too, but I honestly think we will get there faster if we plan better. >> >> Ian Foster wrote: >>> Kate: >>> >>> I personally will be delighted if you could run the virtual cluster >>> on ec2 tomorrow. I know that there are lots of ways that you could >>> refine its config, local expts that could be performed, etc., but >>> perhaps we could try bypassing those things, which seem somewhat like >>> distractions to me? >>> >>> Ian >>> >>> >>> Sent via BlackBerry from T-Mobile -----Original Message----- >>> From: Kate Keahey >>> Date: Wed, 16 May 2007 09:24:02 To:itf at mcs.anl.gov >>> Cc:swift-devel-bounces at ci.uchicago.edu, Ioan Raicu >>> , swift-devel at ci.uchicago.edu, Borja >>> Sotomayor >>> Subject: Re: [Swift-devel] swift-on-ec2 >>> >>> >>> >>> Ian Foster wrote: >>>> Kate: >>>> >>>> If we configure the virtual cluster with a full LRM, as you propose >>>> (and it seems have already done--great work!), then we can use this >>>> to start Falkon executors--as we do today on regular clusters. So it >>>> seems to me that we have all we need. How about you and Ioan spend >>>> your time on Thursday running something on EC2, to make sure it sorks? >>> >>> As I suggest below, I think it would be easiest if we could deploy >>> and debug a small static cluster locally first, and we can probably >>> give it a shot tomorrow. We still don't have access to the Xen nodes >>> on TeraPort (although hopefully that might change by tomorrow) but I >>> asked Rick to rebuild a couple of nodes at ANL and he did, I think >>> for a test that should give us enough resources to play with. >>> >>> At the same time -- if there are multiple ways of doing this, and >>> perhaps better ways than simply using a virtual cluster, we should >>> discuss them now. It is not completely clear to me what the >>> relationship between Falkon and Swift is, and what the specific >>> objectives are (other than that dynamically provisioning resources is >>> required). It looks at this point like the objectives probably >>> overlap with what Ioan, Borja and I wanted to talk about (which I >>> thought was a separate project, but am thrilled to find out is >>> related) so how about we come up with a design tomorrow and post the >>> notes on this list (is this a good venue for that?) and then others >>> can shoot them down. >>> >>>> Regarding choice of LRM: have you looked at SGE? That is what quite >>>> a few others seem to be using. >>> >>> Yes, we have. We also collaborate with others who do, as well as with >>> Sun... As you may remember, Borja did the scheduling work for his >>> thesis in the context of SGE. Last time we talked though, Torque was >>> the scheduler of choice for the virtual cluster LRM so we used that. >>> >>> The usage of SGE you are referring to above -- is this in the context >>> of virtualization projects, or as LRM for various Falkon-related >>> applications? >>> >>>> Ian >>>> >>>> >>>> >>>> Sent via BlackBerry from T-Mobile -----Original Message----- >>>> From: Kate Keahey >>>> Date: Tue, 15 May 2007 23:28:07 To:iraicu at cs.uchicago.edu >>>> Cc:swift-devel at ci.uchicago.edu >>>> Subject: Re: [Swift-devel] swift-on-ec2 >>>> >>>> First -- this is a very useful discussion, would it be possible to >>>> see all of it. We need to understand the requirements and trade-offs >>>> in some detail to figure out the best way to make this work. I see a >>>> few different interaction threads somewhat mixed up here though so >>>> just to make sure we are all on the same wavelength, here is some >>>> context. >>>> >>>> Ian and I have been talking on and off about providing a workspace >>>> service implementation with EC2 backend. The benefit for that would >>>> be that users could deploy the same VMs using the same interface to >>>> either TeraPort or EC2 or yet another resource provider. The >>>> workspace service would also provide some features on top of EC2 >>>> (translating between PKI credentials and Amazon's paying accounts, >>>> contextualization as needed to make deployment dynamic). One >>>> application of interest for this was Swift. Last time we chatted >>>> about this though was in the context of using EC2 to provide a >>>> production platform for STAR runs (since virtualizing enough >>>> TeraPort to provide a production platform is taking a long time). >>>> This is what Tim and I are trying to make happen now. >>>> >>>> Since there was also interest in running Swift in VMs, Mike, Tibi >>>> and I met around February/March and agreed that a reasonable way to >>>> proceed will be for us to stand up a base virtual cluster somewhere >>>> locally (e.g., a static deployment on TeraPort) so that they can >>>> finish the configuration according to their needs, look at >>>> performance, figure out the best way to interact with it, and make >>>> sure that there are no VM-induced gotchas. All of this will be much >>>> easier to assess locally and on a static deployment. Then we'd make >>>> sure the cluster is dynamically deployable using the workspace >>>> service (on EC2 or whatever other provider). During the meeting (and >>>> over following emails) we agreed that the required "base cluster" >>>> would be configured with GRAM/Torque on the headnode plus a number >>>> of worker nodes, plus root privileges. We configured this cluster >>>> and it is ready to deploy. Are you saying now that in fact something >>>> different is needed? >>>> >>>> As Ian says, Borja and I were planning to meet with Ioan on Thursday >>>> to discuss interaction between Falkon and the workspace service (not >>>> necessarily/exclusively in the EC2 context). I don't completely >>>> understand the relationship between swift and falkon -- are there >>>> specific applications or scenarios that you are trying to target in >>>> this exercise? >>>> >>>> Ioan Raicu wrote: >>>>> Hi, >>>>> See below: >>>>> >>>>> Tim Freeman wrote: >>>>>> On Tue, 15 May 2007 16:20:03 +0000 (GMT) >>>>>> Ben Clifford wrote: >>>>>> >>>>>> >>>>>>> Ian asked about this elsewhere, but its perhaps interesting for >>>>>>> swift-devel people to look at the questions too. >>>>>>> >>>>>>> On Tue, 15 May 2007, Ian Foster wrote: >>>>>>> >>>>>>> >>>>>>>> Dear All: >>>>>>>> >>>>>>> >>>>>>> >>>>>>>> I asked Kate if she and Tim could look into creating VM images >>>>>>>> that would allow us to run Swift applications on Amazon EC2. I >>>>>>>> think Kate is meeting with Ioan about this on Thursday (?). >>>>>>>> >>>>>>> >>>>>>> >>>>>>>> One issue that I thought would be good to discuss is what we'd >>>>>>>> want in that VM image. Perhaps this is obvious to the rest of >>>>>>>> you, but it isn't to me. A few thoughts: >>>>>>>> * I'm assuming that we want to run "workers" on EC2 nodes, >>>>>>>> and have the >>>>>>>> "task dispatch" logic run on some external frontend system >>>>>>>> outside EC2. >>>>>>>> * I would think that we want to use Falkon to do the task >>>>>>>> dispatch. If so, >>>>>>>> we need a Falkon executor on each VM, configured to check in >>>>>>>> with the Falkon >>>>>>>> dispatcher. (Alternatively, we could use, say, SGE: in that >>>>>>>> case, we would >>>>>>>> want an SGE agent.) >>>>>>>> * We need a way of getting data to and from the worker >>>>>>>> nodes. Do we want to >>>>>>>> run a file system across the EC2 nodes and the external frontend >>>>>>>> node? That >>>>>>>> seems rather inefficient. Other options? >>>>>>>> * Should we preload the application code on each EC2 node? >>>>>>>> >>>>>>> Here's a couple of approaches: >>>>>>> >>>>>>> 1) swift regards all the EC2 nodes that we are paying for as a >>>>>>> single site. >>>>>>> >>>>>>> Something like falkon handles all the task dispatch and worker >>>>>>> node management. I don't know what that looks like at the moment >>>>>>> in Falkon, but the interface for Swift to send jobs into Falkon >>>>>>> sounds pretty straightforward and shouldn't need changing. >>>>>>> >>>>>> So if I understand, here there would be no gateway+LRM but each >>>>>> EC2 node + >>>>>> Falkon would need a port open to receive tasks? Or does each node >>>>>> pull down >>>>>> instructions OK from behind a firewall? >>>>>> >>>>> Falkon supports both polling and notifications. To use >>>>> notifications, there needs to be an open port on the worker :( >>>>>> Is there a latency problem with running each node as an indepdent >>>>>> task >>>>>> receiver with the dispatcher off-site from EC2? I would think it >>>>>> would be >>>>>> better to put the queues to fill with tasks on EC2 so it can more >>>>>> quickly get >>>>>> the task going when a node is done with a previous task (I may be >>>>>> missing some >>>>>> nuances here with respect to Falkon, don't know much about this >>>>>> yet!). >>>>> We have run the Falkon dispatcher at UChicago and workers at ANL >>>>> without any issues, so it can easily tolerate a few ms of latency. >>>>> We haven't tried it across 10s of ms of latency links, but my >>>>> instinct says that if you have enough workers, you might be able to >>>>> hide the latency. We'd have to experiment with it to see what >>>>> happens. We could potentially do some experiments between SDSC and >>>>> ANL over a 50+ ms link, and see what difference in throughputs we get. >>>>> >>>>> Ioan >>>>>> If a gateway node is desired, this option sounds a lot like the >>>>>> GRAM+LRM >>>>>> situation we use on VMs with the workspace service and will soon >>>>>> use on EC2 via >>>>>> the workspace EC2 gateway we're implementing. Start up one >>>>>> gateway node and >>>>>> then add compute nodes which dynamically join the pool, they are >>>>>> pointed to the >>>>>> GRAM node. >>>>>> >>>>>> >>>>>>> All the nodes in a site are required by our site model to have a >>>>>>> shared filesystem - we've talked about removing it, but I think >>>>>>> that is still the case and if so, isn't going to change soon. >>>>>> Setting up a shared filesystem in this environment is akin to >>>>>> setting up the >>>>>> compute nodes to join an LRM pool. The VMs can communicate over >>>>>> the private >>>>>> network at EC2, you can instruct EC2 to let all the nodes be open >>>>>> to each other >>>>>> (while simultaneously keeping a separate policy of blocking ports >>>>>> from being >>>>>> open from the internet and other people's EC2 nodes). The >>>>>> non-file-serving >>>>>> nodes would simply need to know the private address of the >>>>>> filesystem server >>>>>> (unless you are using a fancier network file system than NFS-style >>>>>> ones). >>>>>> For background: every VM on EC2 currently gets a public address -- >>>>>> NAT'd to a >>>>>> private address which is actually what the VM's one NIC is >>>>>> configured with. >>>>>> There is a facility to open/forward specific network ports on the >>>>>> public >>>>>> address to each VM either via a group policy or on a VM by VM basis. >>>>>> >>>>>> [...] >>>>>>> Amazon also has a storage cloud, alongside its compute cloud. I >>>>>>> know very little about that and have never thought about how it >>>>>>> would fit into the above (if at all). Maybe someone else knows more. >>>>>>> >>>>>> A VM template on EC2 is called an AMI which stands for Amazon >>>>>> Machine Image. >>>>>> This is just a packaging thing but what it mostly means is that >>>>>> the VM is >>>>>> stored on S3 and also registered into the EC2 system. >>>>>> >>>>>> When starting an instance of an AMI, the file is copied from S3 to >>>>>> the >>>>>> hypervisor node (what we call propagation in the workspace >>>>>> service). After it >>>>>> is used, this file is deleted (an option in the workspace service >>>>>> but there is >>>>>> also an option to save it back with any changes). So the VMs are >>>>>> stored in S3 but anything that happens on them after being >>>>>> started is lost unless you manually do something about it. >>>>>> >>>>>> As for free scratch space, you get a good amount per node, 140G. >>>>>> But the node >>>>>> could go down at any moment just like a physical resource. >>>>>> >>>>>> To harness S3 for safely persisting any data (or if you need more >>>>>> space) you >>>>>> would need to actually run S3 clients on the VMs when they are run >>>>>> on EC2. You >>>>>> could alternatively mirror data between nodes assuming that all >>>>>> would not go >>>>>> down at once. >>>>>> The good thing is that you do not pay transfer costs between S3 >>>>>> and EC2 if you >>>>>> chose to use S3 for big storage, you would only pay the "housing >>>>>> fees" so to >>>>>> speak. >>>>>> Tim >>>>>> _______________________________________________ >>>>>> Swift-devel mailing list >>>>>> Swift-devel at ci.uchicago.edu >>>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >>>>>> >>>>>> >>> >> > -- Kate Keahey, Mathematics & CS Division, Argonne National Laboratory Computation Institute, University of Chicago From iraicu at cs.uchicago.edu Thu May 17 11:10:16 2007 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Thu, 17 May 2007 11:10:16 -0500 Subject: [Swift-devel] swift-on-ec2 In-Reply-To: <464C69D6.70909@mcs.anl.gov> References: <4649D280.5080906@mcs.anl.gov> <20070515154500.ad1600bf.tfreeman@mcs.anl.gov><464A24AF.7080801@cs.uchicago.edu> <464A8857.90800@mcs.anl.gov> <356127187-1179301594-cardhu_blackberry.rim.net-179336256-@bwe047-cell00.bisx.prod.on.blackberry> <464B1402.9040405@mcs.anl.gov> <1583680979-1179336692-cardhu_blackberry.rim.net-2135204572-@bwe032-cell00.bisx.prod.on.blackberry> <464B4A6E.2040804@mcs.anl.gov> <464B6746.7050907@mcs.anl.gov> <464C69D6.70909@mcs.anl.gov> Message-ID: <464C7E68.1030400@cs.uchicago.edu> Kate Keahey wrote: > > > Ian Foster wrote: >> Kate: >> >> I want to emphasize that I was *not* dismissing the issues below as >> distractions. >> >> What I meant was: given that you are working on developing a "virtual >> cluster", which I am pretty sure will be able to execute Swift apps, >> let's focus on getting that done, rather than worrying about "special >> casing" it for Falkon, adding dynamic node acquisition, or the other >> things that people started discussing as potential extensions. > > We only now really began to discuss how to use VMs with Swift/Falkon > -- the original set of issues you posted was just what was needed, it > clearly inspired a very good discussion, and made me realize that I > should have been talking to a wider set of people about this. Please, > don't go back on us now... It also looks to me like there may be > solutions that will make more sense both from the perspective of the > architecture and will also be easier to implement with the current > state of virtualization tools. For example, if we can set up Falkon to > provision single nodes operating in pull mode (pulling work from a > "master") various contextualization issues will have become much easier. > >> >> I understand from our IM conversation today that the "virtual >> cluster" is ready for us in a "static environment" such as some >> machines in our lab. In a "dynamic environment" such as EC2, it is >> not quite ready for use yet. Thus, you won't be able to get Swift >> running on EC2 tomorrow. > > This is not quite accurate; static refers to statically assigned IPs > -- we have control over our IPs and can assign them to the cluster > nodes in the same way each time we deploy it. Amazon will choose new > IPs for the nodes each time the cluster is deployed, so each time the > configuration of the cluster will have to be adjusted to reflect > different IP assignment to the nodes (but if we were to change the IPs > on the cluster nodes in a local environment we would be just as dynamic). > > But if you deploy just one node (e.g., a node operating in the pull > mode as in the example above) the need for this configuration > adjustment may go away (depending on what the node does) so everything > may become much simpler. Currently, a Falkon executor (the worker code) upon bootstrapping, makes 1 WS call to the Falkon dispatcher (running in a GT4 container) to register its name and the port on which the notification engine is listening on. Once this is done, the executors go into a listen mode for notifications, and only acts (send WS calls out) upon the reception of notifications. So, the VMs that run the Falkon executors can get DHCP addresses, and the registration message will include all the necessary information about where the Falkon dispatcher needs to contact the respective Falkon executor! Now, the one configuration parameter that we must have is the location of the Falkon dispatcher. If we have it running in a static location (a well known machine and port), then this can be hard coded into the bootstrapping scripts, and there is no configuration needed! If the dispatcher does not have a static resource to run on (i.e. it runs in another VM), then this information needs to be passed to the executor bootstrapping scripts! Ioan > > We can spend some time looking at deploying a VM on EC2 if it is of > interest (as well as deploying a VM via the workspace service if that > is of interest), we can run things on the deployed VM, etc. But I > *strongly* argue that we spend at least some time defining what we > want from this project, what is realistic to have in the short-term, > what will be hard/impossible/inconvenient and try to build it > systematically. Then we can figure out who does what and by when this > is going to be done. > > >> >> Ian. >> >> >> Kate Keahey wrote: >>> Ian, >>> >>> you seem to be referring to the necessary /etc/hosts configuration >>> as well as workers registering with the torque headnode below as >>> "distractions" -- I agree they can be very distracting, but in my >>> experience without these distractions a cluster (virtual or >>> physical) won't work in the way such clusters are typically expected >>> to work. >>> >>> What I said in my mail is that we can set up a base cluster locally >>> so that somebody like Ioan can finish the configuration (i.e., >>> install Falkon on it). We will configure this cluster once and leave >>> it deployed as long as needed. >>> >>> Once we have the front-end to EC2 working (which we don't have yet >>> although we are close) we will deploy this cluster on EC2 and >>> provide methods that will automate this last little bit of >>> configuration that *always* has to be done on deployment. >>> >>> I also think it is quite important that we spend the time tomorrow >>> discussing what exactly we are trying to do -- right now, it looks >>> to me like it might make more sense to not use clusters (it will >>> help with the "distractions" if we don't). >>> >>> I realize that you are eager for us to get things to run -- I am >>> eager too, but I honestly think we will get there faster if we plan >>> better. >>> >>> Ian Foster wrote: >>>> Kate: >>>> >>>> I personally will be delighted if you could run the virtual cluster >>>> on ec2 tomorrow. I know that there are lots of ways that you could >>>> refine its config, local expts that could be performed, etc., but >>>> perhaps we could try bypassing those things, which seem somewhat >>>> like distractions to me? >>>> >>>> Ian >>>> >>>> >>>> Sent via BlackBerry from T-Mobile -----Original Message----- >>>> From: Kate Keahey >>>> Date: Wed, 16 May 2007 09:24:02 To:itf at mcs.anl.gov >>>> Cc:swift-devel-bounces at ci.uchicago.edu, Ioan Raicu >>>> , swift-devel at ci.uchicago.edu, Borja >>>> Sotomayor >>>> Subject: Re: [Swift-devel] swift-on-ec2 >>>> >>>> >>>> >>>> Ian Foster wrote: >>>>> Kate: >>>>> >>>>> If we configure the virtual cluster with a full LRM, as you >>>>> propose (and it seems have already done--great work!), then we can >>>>> use this to start Falkon executors--as we do today on regular >>>>> clusters. So it seems to me that we have all we need. How about >>>>> you and Ioan spend your time on Thursday running something on EC2, >>>>> to make sure it sorks? >>>> >>>> As I suggest below, I think it would be easiest if we could deploy >>>> and debug a small static cluster locally first, and we can probably >>>> give it a shot tomorrow. We still don't have access to the Xen >>>> nodes on TeraPort (although hopefully that might change by >>>> tomorrow) but I asked Rick to rebuild a couple of nodes at ANL and >>>> he did, I think for a test that should give us enough resources to >>>> play with. >>>> >>>> At the same time -- if there are multiple ways of doing this, and >>>> perhaps better ways than simply using a virtual cluster, we should >>>> discuss them now. It is not completely clear to me what the >>>> relationship between Falkon and Swift is, and what the specific >>>> objectives are (other than that dynamically provisioning resources >>>> is required). It looks at this point like the objectives probably >>>> overlap with what Ioan, Borja and I wanted to talk about (which I >>>> thought was a separate project, but am thrilled to find out is >>>> related) so how about we come up with a design tomorrow and post >>>> the notes on this list (is this a good venue for that?) and then >>>> others can shoot them down. >>>> >>>>> Regarding choice of LRM: have you looked at SGE? That is what >>>>> quite a few others seem to be using. >>>> >>>> Yes, we have. We also collaborate with others who do, as well as >>>> with Sun... As you may remember, Borja did the scheduling work for >>>> his thesis in the context of SGE. Last time we talked though, >>>> Torque was the scheduler of choice for the virtual cluster LRM so >>>> we used that. >>>> >>>> The usage of SGE you are referring to above -- is this in the >>>> context of virtualization projects, or as LRM for various >>>> Falkon-related applications? >>>> >>>>> Ian >>>>> >>>>> >>>>> >>>>> Sent via BlackBerry from T-Mobile -----Original Message----- >>>>> From: Kate Keahey >>>>> Date: Tue, 15 May 2007 23:28:07 To:iraicu at cs.uchicago.edu >>>>> Cc:swift-devel at ci.uchicago.edu >>>>> Subject: Re: [Swift-devel] swift-on-ec2 >>>>> >>>>> First -- this is a very useful discussion, would it be possible to >>>>> see all of it. We need to understand the requirements and >>>>> trade-offs in some detail to figure out the best way to make this >>>>> work. I see a few different interaction threads somewhat mixed up >>>>> here though so just to make sure we are all on the same >>>>> wavelength, here is some context. >>>>> >>>>> Ian and I have been talking on and off about providing a workspace >>>>> service implementation with EC2 backend. The benefit for that >>>>> would be that users could deploy the same VMs using the same >>>>> interface to either TeraPort or EC2 or yet another resource >>>>> provider. The workspace service would also provide some features >>>>> on top of EC2 (translating between PKI credentials and Amazon's >>>>> paying accounts, contextualization as needed to make deployment >>>>> dynamic). One application of interest for this was Swift. Last >>>>> time we chatted about this though was in the context of using EC2 >>>>> to provide a production platform for STAR runs (since virtualizing >>>>> enough TeraPort to provide a production platform is taking a long >>>>> time). This is what Tim and I are trying to make happen now. >>>>> >>>>> Since there was also interest in running Swift in VMs, Mike, Tibi >>>>> and I met around February/March and agreed that a reasonable way >>>>> to proceed will be for us to stand up a base virtual cluster >>>>> somewhere locally (e.g., a static deployment on TeraPort) so that >>>>> they can finish the configuration according to their needs, look >>>>> at performance, figure out the best way to interact with it, and >>>>> make sure that there are no VM-induced gotchas. All of this will >>>>> be much easier to assess locally and on a static deployment. Then >>>>> we'd make sure the cluster is dynamically deployable using the >>>>> workspace service (on EC2 or whatever other provider). During the >>>>> meeting (and over following emails) we agreed that the required >>>>> "base cluster" would be configured with GRAM/Torque on the >>>>> headnode plus a number of worker nodes, plus root privileges. We >>>>> configured this cluster and it is ready to deploy. Are you saying >>>>> now that in fact something different is needed? >>>>> >>>>> As Ian says, Borja and I were planning to meet with Ioan on >>>>> Thursday to discuss interaction between Falkon and the workspace >>>>> service (not necessarily/exclusively in the EC2 context). I don't >>>>> completely understand the relationship between swift and falkon -- >>>>> are there specific applications or scenarios that you are trying >>>>> to target in this exercise? >>>>> >>>>> Ioan Raicu wrote: >>>>>> Hi, >>>>>> See below: >>>>>> >>>>>> Tim Freeman wrote: >>>>>>> On Tue, 15 May 2007 16:20:03 +0000 (GMT) >>>>>>> Ben Clifford wrote: >>>>>>> >>>>>>> >>>>>>>> Ian asked about this elsewhere, but its perhaps interesting for >>>>>>>> swift-devel people to look at the questions too. >>>>>>>> >>>>>>>> On Tue, 15 May 2007, Ian Foster wrote: >>>>>>>> >>>>>>>> >>>>>>>>> Dear All: >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>>> I asked Kate if she and Tim could look into creating VM images >>>>>>>>> that would allow us to run Swift applications on Amazon EC2. I >>>>>>>>> think Kate is meeting with Ioan about this on Thursday (?). >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>>> One issue that I thought would be good to discuss is what we'd >>>>>>>>> want in that VM image. Perhaps this is obvious to the rest of >>>>>>>>> you, but it isn't to me. A few thoughts: >>>>>>>>> * I'm assuming that we want to run "workers" on EC2 >>>>>>>>> nodes, and have the >>>>>>>>> "task dispatch" logic run on some external frontend system >>>>>>>>> outside EC2. >>>>>>>>> * I would think that we want to use Falkon to do the >>>>>>>>> task dispatch. If so, >>>>>>>>> we need a Falkon executor on each VM, configured to check in >>>>>>>>> with the Falkon >>>>>>>>> dispatcher. (Alternatively, we could use, say, SGE: in that >>>>>>>>> case, we would >>>>>>>>> want an SGE agent.) >>>>>>>>> * We need a way of getting data to and from the worker >>>>>>>>> nodes. Do we want to >>>>>>>>> run a file system across the EC2 nodes and the external >>>>>>>>> frontend node? That >>>>>>>>> seems rather inefficient. Other options? >>>>>>>>> * Should we preload the application code on each EC2 node? >>>>>>>>> >>>>>>>> Here's a couple of approaches: >>>>>>>> >>>>>>>> 1) swift regards all the EC2 nodes that we are paying for as a >>>>>>>> single site. >>>>>>>> >>>>>>>> Something like falkon handles all the task dispatch and worker >>>>>>>> node management. I don't know what that looks like at the >>>>>>>> moment in Falkon, but the interface for Swift to send jobs into >>>>>>>> Falkon sounds pretty straightforward and shouldn't need changing. >>>>>>>> >>>>>>> So if I understand, here there would be no gateway+LRM but each >>>>>>> EC2 node + >>>>>>> Falkon would need a port open to receive tasks? Or does each >>>>>>> node pull down >>>>>>> instructions OK from behind a firewall? >>>>>>> >>>>>> Falkon supports both polling and notifications. To use >>>>>> notifications, there needs to be an open port on the worker :( >>>>>>> Is there a latency problem with running each node as an >>>>>>> indepdent task >>>>>>> receiver with the dispatcher off-site from EC2? I would think >>>>>>> it would be >>>>>>> better to put the queues to fill with tasks on EC2 so it can >>>>>>> more quickly get >>>>>>> the task going when a node is done with a previous task (I may >>>>>>> be missing some >>>>>>> nuances here with respect to Falkon, don't know much about this >>>>>>> yet!). >>>>>> We have run the Falkon dispatcher at UChicago and workers at ANL >>>>>> without any issues, so it can easily tolerate a few ms of >>>>>> latency. We haven't tried it across 10s of ms of latency links, >>>>>> but my instinct says that if you have enough workers, you might >>>>>> be able to hide the latency. We'd have to experiment with it to >>>>>> see what happens. We could potentially do some experiments >>>>>> between SDSC and ANL over a 50+ ms link, and see what difference >>>>>> in throughputs we get. >>>>>> >>>>>> Ioan >>>>>>> If a gateway node is desired, this option sounds a lot like the >>>>>>> GRAM+LRM >>>>>>> situation we use on VMs with the workspace service and will soon >>>>>>> use on EC2 via >>>>>>> the workspace EC2 gateway we're implementing. Start up one >>>>>>> gateway node and >>>>>>> then add compute nodes which dynamically join the pool, they are >>>>>>> pointed to the >>>>>>> GRAM node. >>>>>>> >>>>>>> >>>>>>>> All the nodes in a site are required by our site model to have >>>>>>>> a shared filesystem - we've talked about removing it, but I >>>>>>>> think that is still the case and if so, isn't going to change >>>>>>>> soon. >>>>>>> Setting up a shared filesystem in this environment is akin to >>>>>>> setting up the >>>>>>> compute nodes to join an LRM pool. The VMs can communicate over >>>>>>> the private >>>>>>> network at EC2, you can instruct EC2 to let all the nodes be >>>>>>> open to each other >>>>>>> (while simultaneously keeping a separate policy of blocking >>>>>>> ports from being >>>>>>> open from the internet and other people's EC2 nodes). The >>>>>>> non-file-serving >>>>>>> nodes would simply need to know the private address of the >>>>>>> filesystem server >>>>>>> (unless you are using a fancier network file system than >>>>>>> NFS-style ones). >>>>>>> For background: every VM on EC2 currently gets a public address >>>>>>> -- NAT'd to a >>>>>>> private address which is actually what the VM's one NIC is >>>>>>> configured with. >>>>>>> There is a facility to open/forward specific network ports on >>>>>>> the public >>>>>>> address to each VM either via a group policy or on a VM by VM >>>>>>> basis. >>>>>>> >>>>>>> [...] >>>>>>>> Amazon also has a storage cloud, alongside its compute cloud. I >>>>>>>> know very little about that and have never thought about how it >>>>>>>> would fit into the above (if at all). Maybe someone else knows >>>>>>>> more. >>>>>>>> >>>>>>> A VM template on EC2 is called an AMI which stands for Amazon >>>>>>> Machine Image. >>>>>>> This is just a packaging thing but what it mostly means is that >>>>>>> the VM is >>>>>>> stored on S3 and also registered into the EC2 system. >>>>>>> >>>>>>> When starting an instance of an AMI, the file is copied from S3 >>>>>>> to the >>>>>>> hypervisor node (what we call propagation in the workspace >>>>>>> service). After it >>>>>>> is used, this file is deleted (an option in the workspace >>>>>>> service but there is >>>>>>> also an option to save it back with any changes). So the VMs are >>>>>>> stored in S3 but anything that happens on them after being >>>>>>> started is lost unless you manually do something about it. >>>>>>> >>>>>>> As for free scratch space, you get a good amount per node, >>>>>>> 140G. But the node >>>>>>> could go down at any moment just like a physical resource. >>>>>>> >>>>>>> To harness S3 for safely persisting any data (or if you need >>>>>>> more space) you >>>>>>> would need to actually run S3 clients on the VMs when they are >>>>>>> run on EC2. You >>>>>>> could alternatively mirror data between nodes assuming that all >>>>>>> would not go >>>>>>> down at once. >>>>>>> The good thing is that you do not pay transfer costs between S3 >>>>>>> and EC2 if you >>>>>>> chose to use S3 for big storage, you would only pay the "housing >>>>>>> fees" so to >>>>>>> speak. >>>>>>> Tim >>>>>>> _______________________________________________ >>>>>>> Swift-devel mailing list >>>>>>> Swift-devel at ci.uchicago.edu >>>>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >>>>>>> >>>>>>> >>>> >>> >> > -- ============================================ Ioan Raicu Ph.D. Student ============================================ Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 ============================================ Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dsl.cs.uchicago.edu/ ============================================ ============================================ From tfreeman at mcs.anl.gov Thu May 17 11:24:49 2007 From: tfreeman at mcs.anl.gov (Tim Freeman) Date: Thu, 17 May 2007 11:24:49 -0500 Subject: [Swift-devel] swift-on-ec2 In-Reply-To: <464C7E68.1030400@cs.uchicago.edu> References: <4649D280.5080906@mcs.anl.gov> <20070515154500.ad1600bf.tfreeman@mcs.anl.gov> <464A24AF.7080801@cs.uchicago.edu> <464A8857.90800@mcs.anl.gov> <356127187-1179301594-cardhu_blackberry.rim.net-179336256-@bwe047-cell00.bisx.prod.on.blackberry> <464B1402.9040405@mcs.anl.gov> <1583680979-1179336692-cardhu_blackberry.rim.net-2135204572-@bwe032-cell00.bisx.prod.on.blackberry> <464B4A6E.2040804@mcs.anl.gov> <464B6746.7050907@mcs.anl.gov> <464C69D6.70909@mcs.anl.gov> <464C7E68.1030400@cs.uchicago.edu> Message-ID: <20070517112449.3a856f70.tfreeman@mcs.anl.gov> On Thu, 17 May 2007 11:10:16 -0500 Ioan Raicu wrote: > > > Kate Keahey wrote: > > > > > > Ian Foster wrote: > >> Kate: > >> > >> I want to emphasize that I was *not* dismissing the issues below as > >> distractions. > >> > >> What I meant was: given that you are working on developing a "virtual > >> cluster", which I am pretty sure will be able to execute Swift apps, > >> let's focus on getting that done, rather than worrying about "special > >> casing" it for Falkon, adding dynamic node acquisition, or the other > >> things that people started discussing as potential extensions. > > > > We only now really began to discuss how to use VMs with Swift/Falkon > > -- the original set of issues you posted was just what was needed, it > > clearly inspired a very good discussion, and made me realize that I > > should have been talking to a wider set of people about this. Please, > > don't go back on us now... It also looks to me like there may be > > solutions that will make more sense both from the perspective of the > > architecture and will also be easier to implement with the current > > state of virtualization tools. For example, if we can set up Falkon to > > provision single nodes operating in pull mode (pulling work from a > > "master") various contextualization issues will have become much easier. > > > >> > >> I understand from our IM conversation today that the "virtual > >> cluster" is ready for us in a "static environment" such as some > >> machines in our lab. In a "dynamic environment" such as EC2, it is > >> not quite ready for use yet. Thus, you won't be able to get Swift > >> running on EC2 tomorrow. > > > > This is not quite accurate; static refers to statically assigned IPs > > -- we have control over our IPs and can assign them to the cluster > > nodes in the same way each time we deploy it. Amazon will choose new > > IPs for the nodes each time the cluster is deployed, so each time the > > configuration of the cluster will have to be adjusted to reflect > > different IP assignment to the nodes (but if we were to change the IPs > > on the cluster nodes in a local environment we would be just as dynamic). > > > > But if you deploy just one node (e.g., a node operating in the pull > > mode as in the example above) the need for this configuration > > adjustment may go away (depending on what the node does) so everything > > may become much simpler. > Currently, a Falkon executor (the worker code) upon bootstrapping, makes > 1 WS call to the Falkon dispatcher (running in a GT4 container) to > register its name and the port on which the notification engine is > listening on. Once this is done, the executors go into a listen mode > for notifications, and only acts (send WS calls out) upon the reception > of notifications. So, the VMs that run the Falkon executors can get > DHCP addresses, and the registration message will include all the > necessary information about where the Falkon dispatcher needs to contact > the respective Falkon executor On EC2 the VM has a private address with a corresponding public one that it can discover (through very EC2-specific mechanisms). We've been working on abstractions and software for doing this in a non ad-hoc way. I'll let Kate expound at your meeting. > Now, the one configuration parameter > that we must have is the location of the Falkon dispatcher. If we have > it running in a static location (a well known machine and port), then > this can be hard coded into the bootstrapping scripts, and there is no > configuration needed! If the dispatcher does not have a static resource > to run on (i.e. it runs in another VM), then this information needs to > be passed to the executor bootstrapping scripts Through those EC2-specific mechanisms you can push per VM instance deployment and the VM instance can be coded to discover this bit of information just like its public IP. Tying VMs + grid computing to EC2 specific mechanisms is the totally wrong way to go, but it may be necessary to case for it specifically in the VM's boot + contextualization process since we (the grid computing people) don't control the middleware there. Tim From tfreeman at mcs.anl.gov Thu May 17 11:26:37 2007 From: tfreeman at mcs.anl.gov (Tim Freeman) Date: Thu, 17 May 2007 11:26:37 -0500 Subject: [Swift-devel] swift-on-ec2 In-Reply-To: <20070517112449.3a856f70.tfreeman@mcs.anl.gov> References: <4649D280.5080906@mcs.anl.gov> <20070515154500.ad1600bf.tfreeman@mcs.anl.gov> <464A24AF.7080801@cs.uchicago.edu> <464A8857.90800@mcs.anl.gov> <356127187-1179301594-cardhu_blackberry.rim.net-179336256-@bwe047-cell00.bisx.prod.on.blackberry> <464B1402.9040405@mcs.anl.gov> <1583680979-1179336692-cardhu_blackberry.rim.net-2135204572-@bwe032-cell00.bisx.prod.on.blackberry> <464B4A6E.2040804@mcs.anl.gov> <464B6746.7050907@mcs.anl.gov> <464C69D6.70909@mcs.anl.gov> <464C7E68.1030400@cs.uchicago.edu> <20070517112449.3a856f70.tfreeman@mcs.anl.gov> Message-ID: <20070517112637.70ae6c9f.tfreeman@mcs.anl.gov> On Thu, 17 May 2007 11:24:49 -0500 Tim Freeman wrote: > > Through those EC2-specific mechanisms you can push per VM instance deployment s/per VM instance deployment/per-VM-instance deployment information/ Tim From tiberius at ci.uchicago.edu Thu May 17 11:29:55 2007 From: tiberius at ci.uchicago.edu (Tiberiu Stef-Praun) Date: Thu, 17 May 2007 11:29:55 -0500 Subject: [Swift-devel] swift-on-ec2 In-Reply-To: <20070517112637.70ae6c9f.tfreeman@mcs.anl.gov> References: <4649D280.5080906@mcs.anl.gov> <356127187-1179301594-cardhu_blackberry.rim.net-179336256-@bwe047-cell00.bisx.prod.on.blackberry> <464B1402.9040405@mcs.anl.gov> <1583680979-1179336692-cardhu_blackberry.rim.net-2135204572-@bwe032-cell00.bisx.prod.on.blackberry> <464B4A6E.2040804@mcs.anl.gov> <464B6746.7050907@mcs.anl.gov> <464C69D6.70909@mcs.anl.gov> <464C7E68.1030400@cs.uchicago.edu> <20070517112449.3a856f70.tfreeman@mcs.anl.gov> <20070517112637.70ae6c9f.tfreeman@mcs.anl.gov> Message-ID: Since there are dependencies in setting up the Falcon-enabled cluster (essentially passing the IP of the headnode to the workers), maybe we can have a Swift workflow that start up the whole EC2 grid shebang Tibi On 5/17/07, Tim Freeman wrote: > On Thu, 17 May 2007 11:24:49 -0500 > Tim Freeman wrote: > > > > > Through those EC2-specific mechanisms you can push per VM instance deployment > > s/per VM instance deployment/per-VM-instance deployment information/ > > Tim > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > -- Tiberiu (Tibi) Stef-Praun, PhD Research Staff, Computation Institute 5640 S. Ellis Ave, #405 University of Chicago http://www-unix.mcs.anl.gov/~tiberius/ From iraicu at cs.uchicago.edu Thu May 17 11:51:10 2007 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Thu, 17 May 2007 11:51:10 -0500 Subject: [Swift-devel] swift-on-ec2 In-Reply-To: References: <4649D280.5080906@mcs.anl.gov> <356127187-1179301594-cardhu_blackberry.rim.net-179336256-@bwe047-cell00.bisx.prod.on.blackberry> <464B1402.9040405@mcs.anl.gov> <1583680979-1179336692-cardhu_blackberry.rim.net-2135204572-@bwe032-cell00.bisx.prod.on.blackberry> <464B4A6E.2040804@mcs.anl.gov> <464B6746.7050907@mcs.anl.gov> <464C69D6.70909@mcs.anl.gov> <464C7E68.1030400@cs.uchicago.edu> <20070517112449.3a856f70.tfreeman@mcs.anl.gov> <20070517112637.70ae6c9f.tfreeman@mcs.anl.gov> Message-ID: <464C87FE.5050006@cs.uchicago.edu> If Swift can do it (through an LRM presumably), then Falkon could do it as well! This should either be done from Falkon, or from the workspace service itself, but not from Swift... Ioan Tiberiu Stef-Praun wrote: > Since there are dependencies in setting up the Falcon-enabled cluster > (essentially passing the IP of the headnode to the workers), maybe we > can have a Swift workflow that start up the whole EC2 grid shebang > > Tibi > > On 5/17/07, Tim Freeman wrote: >> On Thu, 17 May 2007 11:24:49 -0500 >> Tim Freeman wrote: >> >> > >> > Through those EC2-specific mechanisms you can push per VM instance >> deployment >> >> s/per VM instance deployment/per-VM-instance deployment information/ >> >> Tim >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >> > > -- ============================================ Ioan Raicu Ph.D. Student ============================================ Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 ============================================ Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dsl.cs.uchicago.edu/ ============================================ ============================================ From benc at hawaga.org.uk Thu May 17 12:04:05 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Thu, 17 May 2007 17:04:05 +0000 (GMT) Subject: [Swift-devel] swift-on-ec2 In-Reply-To: <464C87FE.5050006@cs.uchicago.edu> References: <4649D280.5080906@mcs.anl.gov> <356127187-1179301594-cardhu_blackberry.rim.net-179336256-@bwe047-cell00.bisx.prod.on.blackberry> <464B1402.9040405@mcs.anl.gov> <1583680979-1179336692-cardhu_blackberry.rim.net-2135204572-@bwe032-cell00.bisx.prod.on.blackberry> <464B4A6E.2040804@mcs.anl.gov> <464B6746.7050907@mcs.anl.gov> <464C69D6.70909@mcs.anl.gov> <464C7E68.1030400@cs.uchicago.edu> <20070517112449.3a856f70.tfreeman@mcs.anl.gov> <20070517112637.70ae6c9f.tfreeman@mcs.anl.gov> <464C87FE.5050006@cs.uchicago.edu> Message-ID: Management of remote virtual machine start-and-config on EC2 strikes me as being almost entirely out of scope for both swift and falkon... -- From itf at mcs.anl.gov Thu May 17 12:11:21 2007 From: itf at mcs.anl.gov (=?UTF-8?B?SWFuIEZvc3Rlcg==?=) Date: Thu, 17 May 2007 17:11:21 +0000 Subject: [Swift-devel] swift-on-ec2 In-Reply-To: References: <4649D280.5080906@mcs.anl.gov> <356127187-1179301594-cardhu_blackberry.rim.net-179336256-@bwe047-cell00.bisx.prod.on.blackberry><464B1402.9040405@mcs.anl.gov> <1583680979-1179336692-cardhu_blackberry.rim.net-2135204572-@bwe032-cell00.bisx.prod.on.blackberry><464B4A6E.2040804@mcs.anl.gov> <464B6746.7050907@mcs.anl.gov> <464C69D6.70909@mcs.anl.gov> <464C7E68.1030400@cs.uchicago.edu> <20070517112449.3a856f70.tfreeman@mcs.anl.gov><20070517112637.70ae6c9f.tfreeman@mcs.anl.gov><464C87FE.5050006@cs.uchicago.edu> Message-ID: <1608928822-1179422001-cardhu_blackberry.rim.net-20619473-@bwe035-cell00.bisx.prod.on.blackberry> Indeed .... Sent via BlackBerry from T-Mobile -----Original Message----- From: Ben Clifford Date: Thu, 17 May 2007 17:04:05 To:Ioan Raicu Cc:swift-devel at ci.uchicago.edu, borja at borjanet.com, itf at mcs.anl.gov Subject: Re: [Swift-devel] swift-on-ec2 Management of remote virtual machine start-and-config on EC2 strikes me as being almost entirely out of scope for both swift and falkon... -- _______________________________________________ Swift-devel mailing list Swift-devel at ci.uchicago.edu http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From hategan at mcs.anl.gov Thu May 17 16:27:51 2007 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 18 May 2007 00:27:51 +0300 Subject: [Swift-devel] swift-on-ec2 In-Reply-To: References: <4649D280.5080906@mcs.anl.gov> <356127187-1179301594-cardhu_blackberry.rim.net-179336256-@bwe047-cell00.bisx.prod.on.blackberry> <464B1402.9040405@mcs.anl.gov> <1583680979-1179336692-cardhu_blackberry.rim.net-2135204572-@bwe032-cell00.bisx.prod.on.blackberry> <464B4A6E.2040804@mcs.anl.gov> <464B6746.7050907@mcs.anl.gov> <464C69D6.70909@mcs.anl.gov> <464C7E68.1030400@cs.uchicago.edu> <20070517112449.3a856f70.tfreeman@mcs.anl.gov> <20070517112637.70ae6c9f.tfreeman@mcs.anl.gov> Message-ID: <1179437271.27959.14.camel@blabla.mcs.anl.gov> On Thu, 2007-05-17 at 11:29 -0500, Tiberiu Stef-Praun wrote: > Since there are dependencies in setting up the Falcon-enabled cluster > (essentially passing the IP of the headnode to the workers), maybe we > can have a Swift workflow that start up the whole EC2 grid shebang We're running into that "workflow" might be a "program" issue (or the other way around - i get confused). Yes. It would make sense to deal with parallelism/concurrency/RPC in a system suitable for those kinds of things. Mihael > > Tibi > > On 5/17/07, Tim Freeman wrote: > > On Thu, 17 May 2007 11:24:49 -0500 > > Tim Freeman wrote: > > > > > > > > Through those EC2-specific mechanisms you can push per VM instance deployment > > > > s/per VM instance deployment/per-VM-instance deployment information/ > > > > Tim > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > From hategan at mcs.anl.gov Thu May 17 16:31:03 2007 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 18 May 2007 00:31:03 +0300 Subject: [Swift-devel] swift-on-ec2 In-Reply-To: <464C87FE.5050006@cs.uchicago.edu> References: <4649D280.5080906@mcs.anl.gov> <356127187-1179301594-cardhu_blackberry.rim.net-179336256-@bwe047-cell00.bisx.prod.on.blackberry> <464B1402.9040405@mcs.anl.gov> <1583680979-1179336692-cardhu_blackberry.rim.net-2135204572-@bwe032-cell00.bisx.prod.on.blackberry> <464B4A6E.2040804@mcs.anl.gov> <464B6746.7050907@mcs.anl.gov> <464C69D6.70909@mcs.anl.gov> <464C7E68.1030400@cs.uchicago.edu> <20070517112449.3a856f70.tfreeman@mcs.anl.gov> <20070517112637.70ae6c9f.tfreeman@mcs.anl.gov> <464C87FE.5050006@cs.uchicago.edu> Message-ID: <1179437463.27959.18.camel@blabla.mcs.anl.gov> On Thu, 2007-05-17 at 11:51 -0500, Ioan Raicu wrote: > If Swift can do it (through an LRM presumably), then Falkon could do it > as well! I think Falkon could do it by virtue of the fact that Java can eventually do it if somebody writes the right bits of code. Mihael > This should either be done from Falkon, or from the workspace service > itself, but not from Swift... > Ioan > > Tiberiu Stef-Praun wrote: > > Since there are dependencies in setting up the Falcon-enabled cluster > > (essentially passing the IP of the headnode to the workers), maybe we > > can have a Swift workflow that start up the whole EC2 grid shebang > > > > Tibi > > > > On 5/17/07, Tim Freeman wrote: > >> On Thu, 17 May 2007 11:24:49 -0500 > >> Tim Freeman wrote: > >> > >> > > >> > Through those EC2-specific mechanisms you can push per VM instance > >> deployment > >> > >> s/per VM instance deployment/per-VM-instance deployment information/ > >> > >> Tim > >> _______________________________________________ > >> Swift-devel mailing list > >> Swift-devel at ci.uchicago.edu > >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > >> > > > > > From benc at hawaga.org.uk Mon May 21 06:55:00 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 21 May 2007 11:55:00 +0000 (GMT) Subject: [Swift-devel] multiple arguments In-Reply-To: References: Message-ID: r752 reintroduces (in a different, equally unspecified manner) support for [*], at least to the extent that I have seen it used. I opened bug 61 to track the fact that this is not properly specified in the language in terms of the data model / type system and (perhaps as a consequence) messily implemented. On Wed, 2 May 2007, Yong Zhao wrote: > That's strange. I used @filenames a lot a while ago and never had any > problems. Check the kml translation, maybe you added the getfieldvalue > stuff to getFilenames, which should not happen. i.e. > > It needs to be > > .... > > > not > > > > Yong. > > On Wed, 2 May 2007, Ben Clifford wrote: > > > > > > > On Wed, 2 May 2007, Yong Zhao wrote: > > > > > use @filenames(sliced[*].img). > > > > I get this: > > > > Execution failed: > > org.griphyn.vdl.mapping.InvalidPathException: Invalid path (*.img) > > for type volume > > > > > > I tried something a little simpler: > > > > > > type file; > > > > (file out) echo(file n[]) > > { > > app { > > echo @filenames(n) stdout=out; > > } > > } > > > > > > file f[] ; > > > > file out; > > > > out=echo(f); > > > > > > but that hangs... > > > > oof. > > > > -- > > > > From dvezendla.savithri at gmail.com Tue May 22 10:08:27 2007 From: dvezendla.savithri at gmail.com (DVezendla) Date: Tue, 22 May 2007 11:08:27 -0400 Subject: [Swift-devel] New to Swift Message-ID: <4262c2820705220808m26a5774cjd25f817907ea0e00@mail.gmail.com> Hi there, I am new to Swift Scripting language. Please help me how to start and proceed. Thanks & Regards, --DVezendla -------------- next part -------------- An HTML attachment was scrubbed... URL: From benc at hawaga.org.uk Tue May 22 10:38:10 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 22 May 2007 15:38:10 +0000 (GMT) Subject: [Swift-devel] swift + gram2 + condor Message-ID: I just started playing with swift submitting through gram2 to a condor installation, as that is my preferred queueing system for training systems. wrapper.sh seems to go awry, though, reporting strange errors where it seems to be interpreting parameters out of place. Perhaps a quoting problem. This seems vaguely familiar - I think maybe I tried it before and gave up. Has anyone else used swift->gram2->condor successfully? -- From benc at hawaga.org.uk Tue May 22 10:40:01 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 22 May 2007 15:40:01 +0000 (GMT) Subject: [Swift-devel] New to Swift In-Reply-To: <4262c2820705220808m26a5774cjd25f817907ea0e00@mail.gmail.com> References: <4262c2820705220808m26a5774cjd25f817907ea0e00@mail.gmail.com> Message-ID: On Tue, 22 May 2007, DVezendla wrote: > Hi there, > I am new to Swift Scripting language. > Please help me how to start and proceed. Hi. There is a quickstart guide at http://www.ci.uchicago.edu/swift/guides/quickstartguide.php which should talk you through getting a simple hello world program running, and then other documentation (the beginnings of a tutorial, and the user guide) at http://www.ci.uchicago.edu/swift/guides/ -- From yongzh at cs.uchicago.edu Tue May 22 10:50:27 2007 From: yongzh at cs.uchicago.edu (Yong Zhao) Date: Tue, 22 May 2007 10:50:27 -0500 (CDT) Subject: [Swift-devel] swift + gram2 + condor In-Reply-To: References: Message-ID: I've had some success using gram + condor, but I think that was before we introduced wrapper.sh. Condor does have quoting problem, I do not remember exactly how we dealt with that in VDS1. Yong. On Tue, 22 May 2007, Ben Clifford wrote: > > I just started playing with swift submitting through gram2 to a condor > installation, as that is my preferred queueing system for training > systems. wrapper.sh seems to go awry, though, reporting strange errors > where it seems to be interpreting parameters out of place. Perhaps a > quoting problem. > > This seems vaguely familiar - I think maybe I tried it before and gave up. > > Has anyone else used swift->gram2->condor successfully? > > -- > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > From wilde at mcs.anl.gov Tue May 22 13:00:20 2007 From: wilde at mcs.anl.gov (Mike Wilde) Date: Tue, 22 May 2007 13:00:20 -0500 Subject: [Swift-devel] Re: GRAM and Swift discussion this week? In-Reply-To: <685C1420-03DE-4F2E-BDC7-A8A2C5636154@mcs.anl.gov> References: <685C1420-03DE-4F2E-BDC7-A8A2C5636154@mcs.anl.gov> Message-ID: <46532FB4.5070707@mcs.anl.gov> Stu, sorry - I missed this message until you mentioned it to me just now. Thinking about it, I'd like to have Ben and Mihael involved as well as all the local Swift and Falkon people. Ben will be arriving this Thu I think but Im not sure what time. Mihael is working from Romania through late June, and can join I hope via skype or telecon. (Im looking for a good Skype speakerphone). Since the people you mentioned are mostly from within the DSL, eg Joe Bester, perhaps we can schedule this meeting by email for a date around June 12-14, the last few days before Ben heads back. June 13 I think is best for me. Does anyone see a pressing reason to do this meeting earlier? - Mike Stuart Martin wrote, On 5/20/2007 10:51 PM: > Hi Mike, > > Will you are swift folks be at the committers all hands meeting? Does > it make sense to sync up on plans for GRAM and Swift? We could have > this as a GRAM meeting on Thursday the 24th? We could also invite the > GridWay guys. What do you think? > > -Stu > > -- Mike Wilde Computation Institute, University of Chicago Math & Computer Science Division Argonne National Laboratory Argonne, IL 60439 USA tel 630-252-7497 fax 630-252-1997 From benc at hawaga.org.uk Tue May 22 13:10:03 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 22 May 2007 18:10:03 +0000 (GMT) Subject: [Swift-devel] swift + gram2 + condor In-Reply-To: References: Message-ID: On Tue, 22 May 2007, Yong Zhao wrote: > Condor does have quoting problem, I do not remember exactly how we dealt > with that in VDS1. I think really its GRAM that has the problem, not Condor - GRAM is meant to abstract away stuff like this. -- From benc at hawaga.org.uk Tue May 22 13:10:44 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 22 May 2007 18:10:44 +0000 (GMT) Subject: [Swift-devel] Re: GRAM and Swift discussion this week? In-Reply-To: <46532FB4.5070707@mcs.anl.gov> References: <685C1420-03DE-4F2E-BDC7-A8A2C5636154@mcs.anl.gov> <46532FB4.5070707@mcs.anl.gov> Message-ID: On Tue, 22 May 2007, Mike Wilde wrote: > Since the people you mentioned are mostly from within the DSL, eg Joe Bester, > perhaps we can schedule this meeting by email for a date around June 12-14, > the last few days before Ben heads back. June 13 I think is best for me. > > Does anyone see a pressing reason to do this meeting earlier? 12th-14th is best for me - I'll have my mind on many other things until then. -- From wilde at mcs.anl.gov Tue May 22 13:17:31 2007 From: wilde at mcs.anl.gov (Mike Wilde) Date: Tue, 22 May 2007 13:17:31 -0500 Subject: [Swift-devel] Re: GRAM and Swift discussion this week? In-Reply-To: References: <685C1420-03DE-4F2E-BDC7-A8A2C5636154@mcs.anl.gov> <46532FB4.5070707@mcs.anl.gov> Message-ID: <465333BB.2070600@mcs.anl.gov> So lets shoot for Jun 13 then and see if that works for everyone whose interested. - Mike Ben Clifford wrote, On 5/22/2007 1:10 PM: > > On Tue, 22 May 2007, Mike Wilde wrote: > >> Since the people you mentioned are mostly from within the DSL, eg Joe Bester, >> perhaps we can schedule this meeting by email for a date around June 12-14, >> the last few days before Ben heads back. June 13 I think is best for me. >> >> Does anyone see a pressing reason to do this meeting earlier? > > 12th-14th is best for me - I'll have my mind on many other things until > then. > -- Mike Wilde Computation Institute, University of Chicago Math & Computer Science Division Argonne National Laboratory Argonne, IL 60439 USA tel 630-252-7497 fax 630-252-1997 From tiberius at ci.uchicago.edu Tue May 22 13:26:07 2007 From: tiberius at ci.uchicago.edu (Tiberiu Stef-Praun) Date: Tue, 22 May 2007 13:26:07 -0500 Subject: [Swift-devel] Re: GRAM and Swift discussion this week? In-Reply-To: <465333BB.2070600@mcs.anl.gov> References: <685C1420-03DE-4F2E-BDC7-A8A2C5636154@mcs.anl.gov> <46532FB4.5070707@mcs.anl.gov> <465333BB.2070600@mcs.anl.gov> Message-ID: I seem to available on that date. Can I attend as well ? Tibi On 5/22/07, Mike Wilde wrote: > So lets shoot for Jun 13 then and see if that works for everyone > whose interested. > > - Mike > > Ben Clifford wrote, On 5/22/2007 1:10 PM: > > > > On Tue, 22 May 2007, Mike Wilde wrote: > > > >> Since the people you mentioned are mostly from within the DSL, eg Joe Bester, > >> perhaps we can schedule this meeting by email for a date around June 12-14, > >> the last few days before Ben heads back. June 13 I think is best for me. > >> > >> Does anyone see a pressing reason to do this meeting earlier? > > > > 12th-14th is best for me - I'll have my mind on many other things until > > then. > > > > -- > Mike Wilde > Computation Institute, University of Chicago > Math & Computer Science Division > Argonne National Laboratory > Argonne, IL 60439 USA > tel 630-252-7497 fax 630-252-1997 > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > -- Tiberiu (Tibi) Stef-Praun, PhD Research Staff, Computation Institute 5640 S. Ellis Ave, #405 University of Chicago http://www-unix.mcs.anl.gov/~tiberius/ From foster at mcs.anl.gov Tue May 22 13:26:25 2007 From: foster at mcs.anl.gov (Ian Foster) Date: Tue, 22 May 2007 13:26:25 -0500 Subject: [Swift-devel] Re: GRAM and Swift discussion this week? In-Reply-To: <46532FB4.5070707@mcs.anl.gov> References: <685C1420-03DE-4F2E-BDC7-A8A2C5636154@mcs.anl.gov> <46532FB4.5070707@mcs.anl.gov> Message-ID: <465335D1.2040306@mcs.anl.gov> It would be interesting to hear what issues are of interest on each side. Are there WS-GRAM issues that are causing problems for Swift? Is advance reservation important for Swift? Swift is increasingly using Falkon to handle submissions, which reduces the number of GRAM operations performed significantly. Ian. Mike Wilde wrote: > Stu, sorry - I missed this message until you mentioned it to me just now. > > Thinking about it, I'd like to have Ben and Mihael involved as well as > all the local Swift and Falkon people. Ben will be arriving this Thu I > think but Im not sure what time. Mihael is working from Romania > through late June, and can join I hope via skype or telecon. (Im > looking for a good Skype speakerphone). > > Since the people you mentioned are mostly from within the DSL, eg Joe > Bester, perhaps we can schedule this meeting by email for a date > around June 12-14, the last few days before Ben heads back. June 13 I > think is best for me. > > Does anyone see a pressing reason to do this meeting earlier? > > - Mike > > > Stuart Martin wrote, On 5/20/2007 10:51 PM: >> Hi Mike, >> >> Will you are swift folks be at the committers all hands meeting? >> Does it make sense to sync up on plans for GRAM and Swift? We could >> have this as a GRAM meeting on Thursday the 24th? We could also >> invite the GridWay guys. What do you think? >> >> -Stu >> >> > -- Ian Foster, Director, Computation Institute Argonne National Laboratory & University of Chicago Argonne: MCS/221, 9700 S. Cass Ave, Argonne, IL 60439 Chicago: Rm 405, 5640 S. Ellis Ave, Chicago, IL 60637 Tel: +1 630 252 4619. Web: www.ci.uchicago.edu. Globus Alliance: www.globus.org. From benc at hawaga.org.uk Tue May 22 13:43:15 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 22 May 2007 18:43:15 +0000 (GMT) Subject: [Swift-devel] Re: GRAM and Swift discussion this week? In-Reply-To: <465335D1.2040306@mcs.anl.gov> References: <685C1420-03DE-4F2E-BDC7-A8A2C5636154@mcs.anl.gov> <46532FB4.5070707@mcs.anl.gov> <465335D1.2040306@mcs.anl.gov> Message-ID: On Tue, 22 May 2007, Ian Foster wrote: > Are there WS-GRAM issues that are causing problems for Swift? No one uses WS-GRAM with Swift, so we aren't really uncovering issus there. > Is advance reservation important for Swift? We haven't really talked about that. I'm not sure how it would fit in, but if people want it, it would be nice to accomodate it somehow. > Swift is increasingly using Falkon to handle submissions, which reduces > the number of GRAM operations performed significantly. At the high/experimental end, yes. However, if we have any expectation of people downloading and using it by themselves without us providing professional services-style consultancy, then those users won't be going anywhere near Falkon any time soon. -- From yongzh at cs.uchicago.edu Tue May 22 14:05:49 2007 From: yongzh at cs.uchicago.edu (Yong Zhao) Date: Tue, 22 May 2007 14:05:49 -0500 (CDT) Subject: [Swift-devel] swift + gram2 + condor In-Reply-To: References: Message-ID: yep, maybe we can talk with people that developed the interface between gram and condor? Yong. On Tue, 22 May 2007, Ben Clifford wrote: > > > On Tue, 22 May 2007, Yong Zhao wrote: > > > Condor does have quoting problem, I do not remember exactly how we dealt > > with that in VDS1. > > I think really its GRAM that has the problem, not Condor - GRAM is meant > to abstract away stuff like this. > > -- > From yongzh at cs.uchicago.edu Tue May 22 14:09:35 2007 From: yongzh at cs.uchicago.edu (Yong Zhao) Date: Tue, 22 May 2007 14:09:35 -0500 (CDT) Subject: [Swift-devel] Re: GRAM and Swift discussion this week? In-Reply-To: References: <685C1420-03DE-4F2E-BDC7-A8A2C5636154@mcs.anl.gov> <46532FB4.5070707@mcs.anl.gov> <465335D1.2040306@mcs.anl.gov> Message-ID: I used WS_GRAM a while ago with Swift, I did not encounter any specific WS-GRAM problem then. Yong. On Tue, 22 May 2007, Ben Clifford wrote: > > On Tue, 22 May 2007, Ian Foster wrote: > > > Are there WS-GRAM issues that are causing problems for Swift? > > No one uses WS-GRAM with Swift, so we aren't really uncovering issus > there. > > > > Is advance reservation important for Swift? > > We haven't really talked about that. I'm not sure how it would fit in, but > if people want it, it would be nice to accomodate it somehow. > > > > Swift is increasingly using Falkon to handle submissions, which reduces > > the number of GRAM operations performed significantly. > > At the high/experimental end, yes. However, if we have any expectation of > people downloading and using it by themselves without us providing > professional services-style consultancy, then those users won't be going > anywhere near Falkon any time soon. > > -- > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > From smartin at mcs.anl.gov Tue May 22 14:10:48 2007 From: smartin at mcs.anl.gov (Stuart Martin) Date: Tue, 22 May 2007 14:10:48 -0500 Subject: [Swift-devel] Re: GRAM and Swift discussion this week? In-Reply-To: References: <685C1420-03DE-4F2E-BDC7-A8A2C5636154@mcs.anl.gov> <46532FB4.5070707@mcs.anl.gov> <465335D1.2040306@mcs.anl.gov> Message-ID: On May 22, 2007, at May 22, 1:43 PM, Ben Clifford wrote: > > On Tue, 22 May 2007, Ian Foster wrote: > >> Are there WS-GRAM issues that are causing problems for Swift? > > No one uses WS-GRAM with Swift, so we aren't really uncovering issus > there. Why not? What are you using? GRAM2? local executions? Other services? > > >> Is advance reservation important for Swift? > > We haven't really talked about that. I'm not sure how it would fit > in, but > if people want it, it would be nice to accomodate it somehow. > > >> Swift is increasingly using Falkon to handle submissions, which >> reduces >> the number of GRAM operations performed significantly. > > At the high/experimental end, yes. However, if we have any > expectation of > people downloading and using it by themselves without us providing > professional services-style consultancy, then those users won't be > going > anywhere near Falkon any time soon. > > -- > From tiberius at ci.uchicago.edu Tue May 22 14:17:18 2007 From: tiberius at ci.uchicago.edu (Tiberiu Stef-Praun) Date: Tue, 22 May 2007 14:17:18 -0500 Subject: [Swift-devel] Re: GRAM and Swift discussion this week? In-Reply-To: References: <685C1420-03DE-4F2E-BDC7-A8A2C5636154@mcs.anl.gov> <46532FB4.5070707@mcs.anl.gov> <465335D1.2040306@mcs.anl.gov> Message-ID: I might have used ws-gram at the TACC site. I think it was quite a while ago, so I am not 100% that I actually used ws-gram. Tibi On 5/22/07, Stuart Martin wrote: > On May 22, 2007, at May 22, 1:43 PM, Ben Clifford wrote: > > > > On Tue, 22 May 2007, Ian Foster wrote: > > > >> Are there WS-GRAM issues that are causing problems for Swift? > > > > No one uses WS-GRAM with Swift, so we aren't really uncovering issus > > there. > > Why not? What are you using? GRAM2? local executions? Other > services? > > > > > > >> Is advance reservation important for Swift? > > > > We haven't really talked about that. I'm not sure how it would fit > > in, but > > if people want it, it would be nice to accomodate it somehow. > > > > > >> Swift is increasingly using Falkon to handle submissions, which > >> reduces > >> the number of GRAM operations performed significantly. > > > > At the high/experimental end, yes. However, if we have any > > expectation of > > people downloading and using it by themselves without us providing > > professional services-style consultancy, then those users won't be > > going > > anywhere near Falkon any time soon. > > > > -- > > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > -- Tiberiu (Tibi) Stef-Praun, PhD Research Staff, Computation Institute 5640 S. Ellis Ave, #405 University of Chicago http://www-unix.mcs.anl.gov/~tiberius/ From benc at hawaga.org.uk Tue May 22 14:22:32 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 22 May 2007 19:22:32 +0000 (GMT) Subject: [Swift-devel] Re: GRAM and Swift discussion this week? In-Reply-To: References: <685C1420-03DE-4F2E-BDC7-A8A2C5636154@mcs.anl.gov> <46532FB4.5070707@mcs.anl.gov> <465335D1.2040306@mcs.anl.gov> Message-ID: On Tue, 22 May 2007, Stuart Martin wrote: > On May 22, 2007, at May 22, 1:43 PM, Ben Clifford wrote: > > > > On Tue, 22 May 2007, Ian Foster wrote: > > > > > Are there WS-GRAM issues that are causing problems for Swift? > > > > No one uses WS-GRAM with Swift, so we aren't really uncovering issus > > there. > > Why not? What are you using? GRAM2? local executions? Other services? for the high end stuff, Swift submits jobs to Falkon. Falkon, I think, uses WS-GRAM to start up its own workers, but that startup part of Falkon not Swift. For low end stuff, the two providers that I think people use much are local exec and GRAM2. Local exec is not in the space that GRAM is addressing, so ignore. The GRAM2 vs GRAM4 question pretty much comes down to the fact that people in production (at least as far as I encounter them) tend to use GRAM2 rather than GRAM4 and so Swift tends to get used that way too - there's no real motivation to push people to use a different submission system than what they're used to, and one thing we decided within our group is that we would concentrate on being very application focused (after we had spent rather a long time pontificating and debating). GRAM2 -> GRAM4 doesn't provide enough incentive (in the way that a GRAM2 -> Falkon change does) for our actual apps (for example that Tibi and Nika work on). At some point, perhaps, GRAM2 will decay or GRAM4 will become tantalising, at which point it would be in the interests of being app-focused to shift. Or we might change our priorities to be less app focused. -- From iraicu at cs.uchicago.edu Tue May 22 14:34:07 2007 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Tue, 22 May 2007 14:34:07 -0500 Subject: [Swift-devel] Re: GRAM and Swift discussion this week? In-Reply-To: References: <685C1420-03DE-4F2E-BDC7-A8A2C5636154@mcs.anl.gov> <46532FB4.5070707@mcs.anl.gov> <465335D1.2040306@mcs.anl.gov> Message-ID: <465345AF.3010201@cs.uchicago.edu> See below: Ben Clifford wrote: > On Tue, 22 May 2007, Ian Foster wrote: > > >> Are there WS-GRAM issues that are causing problems for Swift? >> > > No one uses WS-GRAM with Swift, so we aren't really uncovering issus > there. > > > >> Is advance reservation important for Swift? >> > > We haven't really talked about that. I'm not sure how it would fit in, but > if people want it, it would be nice to accomodate it somehow. > > > >> Swift is increasingly using Falkon to handle submissions, which reduces >> the number of GRAM operations performed significantly. >> > > At the high/experimental end, yes. However, if we have any expectation of > people downloading and using it by themselves without us providing > professional services-style consultancy, then those users won't be going > anywhere near Falkon any time soon. > We have learned quite a bit about setting up Falkon at different sites across the TG. The caveats that we have to watch out for are: 1. platform specific JVM location, this is not set correctly in the remote machine's environment, and is different from site to site; this remains as an issue that needs to be addressed per site 2. some sites require the project be explicitly specified; this has been fixed 3. expired credentials errors don't get propagated to the user's screen, they are simply written to logs... 4. some sites (ANL) support GRAM4 extensions, while other sites do not; we now support both RSL formats 5. the many logs that we generate are quite hard for people to follow, and keep track of what each one contains; we fixed this by developing a GUI that can connect to the GT4 container remotely and display relevant information! 6. TG machines have an old kernel that do not support changing the thread stack size * this has implications on the number of threads a JVM can create before running out of memory * we have observed that we can create about 100~200 threads per JVM on most TG nodes * the GT4 container operates on a pool of threads for everything it does, so the max number of threads it will create is bounded! * the provisioner currently creates a new thread for every job (resource allocation) it sends to GRAM4 o depending on which allocation strategy is used, this might/might not be a problem on TG nodes o in theory, we don't want more than 100 or so GRAM4 jobs in parallel running, but if we choose the policy in which each job allocates a single machine, then we can easily surpass 100 jobs in parallel... all the other policies, would be able to allocate 1K+, even 10K+ machines with less than 100 jobs in parallel, so it could work perfectly fine even with the current implementation; in the long run, this might be able to be changed to a pool of threads in the provisioner! The things that I believe are needed for it be more friendly to new/existing users outside of the core developers: 1. A suite of tests that will ensure everything is set correctly, before using Falkon * we could check against grid-proxy-info in a script * make sure GRAM4 works at the particular site by using globusrun-ws * check the JAVA_HOME and java commands from within a GRAM4 submitted job * check if ANT is installed; this is needed to recompile the Falkon service 2. get more of the Falkon configuration parameters into config files, rather than scripts or code! 3. clean up the scripts, and make them more robust and user friendly 4. make an interface into the provisioning component and Falkon to allow the live configuration of Falkon without requiring restarts 5. Documentation well beyond the current 1 page readme that is only sufficient if everything works! 6. There is no documentation on how to set up the needed security if a user wants to enable security in Falkon; the default is no security Maybe there are others that I missed, but I don't think we are that far from people being able to use it without us taking them by the hand the entire way. The things that would be good to do are not on the top of my things to do list, but in time, I'll get them done. If anyone wants to help with these, I would not refuse anyone's help. Ioan > -- > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > -- ============================================ Ioan Raicu Ph.D. Student ============================================ Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 ============================================ Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dsl.cs.uchicago.edu/ ============================================ ============================================ From smartin at mcs.anl.gov Tue May 22 14:41:05 2007 From: smartin at mcs.anl.gov (Stuart Martin) Date: Tue, 22 May 2007 14:41:05 -0500 Subject: [Swift-devel] Re: GRAM and Swift discussion this week? In-Reply-To: References: <685C1420-03DE-4F2E-BDC7-A8A2C5636154@mcs.anl.gov> <46532FB4.5070707@mcs.anl.gov> <465335D1.2040306@mcs.anl.gov> Message-ID: On May 22, 2007, at May 22, 2:22 PM, Ben Clifford wrote: > > > On Tue, 22 May 2007, Stuart Martin wrote: > >> On May 22, 2007, at May 22, 1:43 PM, Ben Clifford wrote: >>> >>> On Tue, 22 May 2007, Ian Foster wrote: >>> >>>> Are there WS-GRAM issues that are causing problems for Swift? >>> >>> No one uses WS-GRAM with Swift, so we aren't really uncovering issus >>> there. >> > >> Why not? What are you using? GRAM2? local executions? Other >> services? > > for the high end stuff, Swift submits jobs to Falkon. Falkon, I think, > uses WS-GRAM to start up its own workers, but that startup part of > Falkon > not Swift. > > For low end stuff, the two providers that I think people use much are > local exec and GRAM2. > > Local exec is not in the space that GRAM is addressing, so ignore. Agreed. Just trying to learn what people are doing. > > The GRAM2 vs GRAM4 question pretty much comes down to the fact that > people > in production (at least as far as I encounter them) tend to use GRAM2 > rather than GRAM4 and so Swift tends to get used that way too - > there's no > real motivation to push people to use a different submission system > than > what they're used to, and one thing we decided within our group is > that we > would concentrate on being very application focused (after we had > spent > rather a long time pontificating and debating). GRAM2 -> GRAM4 doesn't > provide enough incentive (in the way that a GRAM2 -> Falkon change > does) > for our actual apps (for example that Tibi and Nika work on). Fair enough. GRAM4 is deployed on most of TG and OSG now. It would be good to push jobs to GRAM4 when reasonable/possible. The apps folks should not care which service is used. It should be hidden by Swift. Or are the apps folks your working with also dictating what GRAM service is deployed/used? > > At some point, perhaps, GRAM2 will decay or GRAM4 will become > tantalising, > at which point it would be in the interests of being app-focused to > shift. > Or we might change our priorities to be less app focused. Some are quite happy with GRAM4 in 4.0.3. We're improving things right now to make GRAM4 outperform GRAM2 in most all the important benchmarks. This should be in 4.0.5. I think things at that point become "tantalizing". > > -- > From yongzh at cs.uchicago.edu Tue May 22 14:51:22 2007 From: yongzh at cs.uchicago.edu (Yong Zhao) Date: Tue, 22 May 2007 14:51:22 -0500 (CDT) Subject: [Swift-devel] Re: GRAM and Swift discussion this week? In-Reply-To: References: <685C1420-03DE-4F2E-BDC7-A8A2C5636154@mcs.anl.gov> <46532FB4.5070707@mcs.anl.gov> <465335D1.2040306@mcs.anl.gov> Message-ID: Swift does hide which provider the app uses, say, local, gt2, gt4, falkon. I think the major reasons they are not using WS-GRAM are: - WS_GRAM not configured - WS_GRAM slower than GT2 But as you've pointed out, as things improve, we should shift to WS_GRAM gradually. Yong. On Tue, 22 May 2007, Stuart Martin wrote: > > On May 22, 2007, at May 22, 2:22 PM, Ben Clifford wrote: > > > > > > > On Tue, 22 May 2007, Stuart Martin wrote: > > > >> On May 22, 2007, at May 22, 1:43 PM, Ben Clifford wrote: > >>> > >>> On Tue, 22 May 2007, Ian Foster wrote: > >>> > >>>> Are there WS-GRAM issues that are causing problems for Swift? > >>> > >>> No one uses WS-GRAM with Swift, so we aren't really uncovering issus > >>> there. > >> > > > >> Why not? What are you using? GRAM2? local executions? Other > >> services? > > > > for the high end stuff, Swift submits jobs to Falkon. Falkon, I think, > > uses WS-GRAM to start up its own workers, but that startup part of > > Falkon > > not Swift. > > > > For low end stuff, the two providers that I think people use much are > > local exec and GRAM2. > > > > Local exec is not in the space that GRAM is addressing, so ignore. > > Agreed. Just trying to learn what people are doing. > > > > > The GRAM2 vs GRAM4 question pretty much comes down to the fact that > > people > > in production (at least as far as I encounter them) tend to use GRAM2 > > rather than GRAM4 and so Swift tends to get used that way too - > > there's no > > real motivation to push people to use a different submission system > > than > > what they're used to, and one thing we decided within our group is > > that we > > would concentrate on being very application focused (after we had > > spent > > rather a long time pontificating and debating). GRAM2 -> GRAM4 doesn't > > provide enough incentive (in the way that a GRAM2 -> Falkon change > > does) > > for our actual apps (for example that Tibi and Nika work on). > > Fair enough. GRAM4 is deployed on most of TG and OSG now. It would > be good to push jobs to GRAM4 when reasonable/possible. The apps > folks should not care which service is used. It should be hidden by > Swift. Or are the apps folks your working with also dictating what > GRAM service is deployed/used? > > > > > At some point, perhaps, GRAM2 will decay or GRAM4 will become > > tantalising, > > at which point it would be in the interests of being app-focused to > > shift. > > Or we might change our priorities to be less app focused. > > Some are quite happy with GRAM4 in 4.0.3. We're improving things > right now to make GRAM4 outperform GRAM2 in most all the important > benchmarks. This should be in 4.0.5. I think things at that point > become "tantalizing". > > > > > -- > > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > From tfreeman at mcs.anl.gov Tue May 22 22:10:59 2007 From: tfreeman at mcs.anl.gov (Tim Freeman) Date: Tue, 22 May 2007 22:10:59 -0500 Subject: [Swift-devel] Re: GRAM and Swift discussion this week? In-Reply-To: <465345AF.3010201@cs.uchicago.edu> References: <685C1420-03DE-4F2E-BDC7-A8A2C5636154@mcs.anl.gov> <46532FB4.5070707@mcs.anl.gov> <465335D1.2040306@mcs.anl.gov> <465345AF.3010201@cs.uchicago.edu> Message-ID: <20070522221059.3e992405.tfreeman@mcs.anl.gov> On Tue, 22 May 2007 14:34:07 -0500 Ioan Raicu wrote: > the many logs that we generate are quite hard for people to > follow, and keep track of what each one contains; we fixed this by > developing a GUI that can connect to the GT4 container remotely > and display relevant information! Is that something that could be used for other GT services? > * the provisioner currently creates a new thread for every job > (resource allocation) it sends to GRAM4 > o depending on which allocation strategy is used, this > might/might not be a problem on TG nodes > o in theory, we don't want more than 100 or so GRAM4 > jobs in parallel running, but if we choose the policy > in which each job allocates a single machine, then we > can easily surpass 100 jobs in parallel... all the > other policies, would be able to allocate 1K+, even > 10K+ machines with less than 100 jobs in parallel, so > it could work perfectly fine even with the current > implementation; in the long run, this might be able to > be changed to a pool of threads in the provisioner! Are these threads just waiting on notifications? If so: you should be able to reduce this to one thread by subscribing with the same notification consumer EPR for each GRAM job and demuxing the result (demux based on the producer EPR that is passed to the class implementing NotifyCallback). That way the thread that creates the GRAM job can disappear once it is done with the create call. Tim From hategan at mcs.anl.gov Wed May 23 04:10:42 2007 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 23 May 2007 12:10:42 +0300 Subject: [Swift-devel] Re: GRAM and Swift discussion this week? In-Reply-To: <20070522221059.3e992405.tfreeman@mcs.anl.gov> (from tfreeman@mcs.anl.gov on Wed May 23 06:10:59 2007) References: <685C1420-03DE-4F2E-BDC7-A8A2C5636154@mcs.anl.gov> <46532FB4.5070707@mcs.anl.gov> <465335D1.2040306@mcs.anl.gov> <465345AF.3010201@cs.uchicago.edu> <20070522221059.3e992405.tfreeman@mcs.anl.gov> Message-ID: <1179911442l.13147l.0l@blabla> On 05/23/2007 06:10:59 AM, Tim Freeman wrote: > On Tue, 22 May 2007 14:34:07 -0500 > Ioan Raicu wrote: > > > the many logs that we generate are quite hard for people to > > follow, and keep track of what each one contains; we fixed > this by > > developing a GUI that can connect to the GT4 container > remotely > > and display relevant information! > > Is that something that could be used for other GT services? > > > > * the provisioner currently creates a new thread for every > job > > (resource allocation) it sends to GRAM4 > > o depending on which allocation strategy is used, > this > > might/might not be a problem on TG nodes > > o in theory, we don't want more than 100 or so GRAM4 > > jobs in parallel running, but if we choose the > policy > > in which each job allocates a single machine, then > we > > can easily surpass 100 jobs in parallel... all the > > other policies, would be able to allocate 1K+, > even > > 10K+ machines with less than 100 jobs in parallel, > so > > it could work perfectly fine even with the current > > implementation; in the long run, this might be > able to > > be changed to a pool of threads in the > provisioner! > > Are these threads just waiting on notifications? > > If so: you should be able to reduce this to one thread by subscribing > with the > same notification consumer EPR for each GRAM job and demuxing the > result (demux > based on the producer EPR that is passed to the class implementing > NotifyCallback). That way the thread that creates the GRAM job can > disappear > once it is done with the create call. I'd recommend the CoG abstractions. They do exactly that, but hide all the details. Mihael > > Tim > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > From benc at hawaga.org.uk Wed May 23 09:15:33 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 23 May 2007 14:15:33 +0000 (GMT) Subject: [Swift-devel] swift after 2007-04-29 Message-ID: Has anyone on this list used a swift source base more recent than 29th of april? -- From benc at hawaga.org.uk Wed May 23 10:19:03 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 23 May 2007 15:19:03 +0000 (GMT) Subject: [Swift-devel] wiring swift and falkon together Message-ID: i hear rumour that its sufficiently unclear how to wire swift and falkon together that people are avoiding testing swift code (more recent than the 8th of march build that Yong made) that is lame - it means large chunks of our app testing are being done with code that is 2.5 months old. I don't know how Falkon gets deployed alongside swift at the moment, so I don't know what to do to make this easier - are they written down anywhere? -- From tiberius at ci.uchicago.edu Wed May 23 10:57:08 2007 From: tiberius at ci.uchicago.edu (Tiberiu Stef-Praun) Date: Wed, 23 May 2007 10:57:08 -0500 Subject: [Swift-devel] wiring swift and falkon together In-Reply-To: References: Message-ID: It seems that Yong's Falkon provider is working (according to Nika), so I was wondering when will it make it into the Swift ? At that point it's more convenient for me to test it (as I would only have to handle the Falkon backend configuration). Tibi On 5/23/07, Ben Clifford wrote: > > i hear rumour that its sufficiently unclear how to wire swift and falkon > together that people are avoiding testing swift code (more recent than the > 8th of march build that Yong made) > > that is lame - it means large chunks of our app testing are being done > with code that is 2.5 months old. > > I don't know how Falkon gets deployed alongside swift at the moment, so I > don't know what to do to make this easier - are they written down > anywhere? > > -- > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > -- Tiberiu (Tibi) Stef-Praun, PhD Research Staff, Computation Institute 5640 S. Ellis Ave, #405 University of Chicago http://www-unix.mcs.anl.gov/~tiberius/ From benc at hawaga.org.uk Wed May 23 11:01:47 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 23 May 2007 16:01:47 +0000 (GMT) Subject: [Swift-devel] wiring swift and falkon together In-Reply-To: References: Message-ID: On Wed, 23 May 2007, Tiberiu Stef-Praun wrote: > It seems that Yong's Falkon provider is working (according to Nika), > so I was wondering when will it make it into the Swift ? At that point > it's more convenient for me to test it (as I would only have to handle > the Falkon backend configuration). Does it have build dependencies on Falkon code? -- From wilde at mcs.anl.gov Wed May 23 11:03:29 2007 From: wilde at mcs.anl.gov (Mike Wilde) Date: Wed, 23 May 2007 11:03:29 -0500 Subject: [Swift-devel] wiring swift and falkon together In-Reply-To: References: Message-ID: <465465D1.4060002@mcs.anl.gov> Im in favor of asking Ioan and possibly Yong - to the extent he has time - to push forward on this, to specifications from Ben and Mihael, and based on usability feedback from Nika and Tibi who need to speak for users' needs. Ben's specs should also address code quality, testing/certification and maintainability. - Mike Tiberiu Stef-Praun wrote, On 5/23/2007 10:57 AM: > It seems that Yong's Falkon provider is working (according to Nika), > so I was wondering when will it make it into the Swift ? At that point > it's more convenient for me to test it (as I would only have to handle > the Falkon backend configuration). > > Tibi > > On 5/23/07, Ben Clifford wrote: >> >> i hear rumour that its sufficiently unclear how to wire swift and falkon >> together that people are avoiding testing swift code (more recent than >> the >> 8th of march build that Yong made) >> >> that is lame - it means large chunks of our app testing are being done >> with code that is 2.5 months old. >> >> I don't know how Falkon gets deployed alongside swift at the moment, so I >> don't know what to do to make this easier - are they written down >> anywhere? >> >> -- >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >> > > -- Mike Wilde Computation Institute, University of Chicago Math & Computer Science Division Argonne National Laboratory Argonne, IL 60439 USA tel 630-252-7497 fax 630-252-1997 From benc at hawaga.org.uk Wed May 23 11:12:08 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 23 May 2007 16:12:08 +0000 (GMT) Subject: [Swift-devel] wiring swift and falkon together In-Reply-To: <465465D1.4060002@mcs.anl.gov> References: <465465D1.4060002@mcs.anl.gov> Message-ID: What does Falkon deployment look like at the moment? (in terms of procedures to deploy it from an empty computer, and in terms of how files are laid out, and in terms of how things get configured)? I think it doesn't make sense to look at the falkon/swift interface code without looking at the whole deployment process for both Swift and Falkon together. On Wed, 23 May 2007, Mike Wilde wrote: > Im in favor of asking Ioan and possibly Yong - to the extent he has time - to > push forward on this, to specifications from Ben and Mihael, and based on > usability feedback from Nika and Tibi who need to speak for users' needs. > Ben's specs should also address code quality, testing/certification and > maintainability. > > - Mike > > > Tiberiu Stef-Praun wrote, On 5/23/2007 10:57 AM: > > It seems that Yong's Falkon provider is working (according to Nika), > > so I was wondering when will it make it into the Swift ? At that point > > it's more convenient for me to test it (as I would only have to handle > > the Falkon backend configuration). > > > > Tibi > > > > On 5/23/07, Ben Clifford wrote: > > > > > > i hear rumour that its sufficiently unclear how to wire swift and falkon > > > together that people are avoiding testing swift code (more recent than the > > > 8th of march build that Yong made) > > > > > > that is lame - it means large chunks of our app testing are being done > > > with code that is 2.5 months old. > > > > > > I don't know how Falkon gets deployed alongside swift at the moment, so I > > > don't know what to do to make this easier - are they written down > > > anywhere? > > > > > > -- > > > _______________________________________________ > > > Swift-devel mailing list > > > Swift-devel at ci.uchicago.edu > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > > > > > > From iraicu at cs.uchicago.edu Wed May 23 11:28:51 2007 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Wed, 23 May 2007 11:28:51 -0500 Subject: [Swift-devel] Re: GRAM and Swift discussion this week? Message-ID: <46546BC3.4070600@cs.uchicago.edu> See below: Tim Freeman wrote: > On Tue, 22 May 2007 14:34:07 -0500 > Ioan Raicu wrote: > > >> the many logs that we generate are quite hard for people to >> follow, and keep track of what each one contains; we fixed this by >> developing a GUI that can connect to the GT4 container remotely >> and display relevant information! >> > > Is that something that could be used for other GT services? > > I guess so... Here is a screen shot: http://people.cs.uchicago.edu/~iraicu/research/Falkon/Falkon_GUI.gif Essentially, all it does is it uses Java swing to paint the GUI, which has a bunch of text fields that get populated from data from the results of web service calls which are being polled against the GT4 service in question (Falkon in our case). Its nothing fancy, but I bet something like this could be made for the GT4 container in general that would give basic container and host statistics! > >> * the provisioner currently creates a new thread for every job >> (resource allocation) it sends to GRAM4 >> o depending on which allocation strategy is used, this >> might/might not be a problem on TG nodes >> o in theory, we don't want more than 100 or so GRAM4 >> jobs in parallel running, but if we choose the policy >> in which each job allocates a single machine, then we >> can easily surpass 100 jobs in parallel... all the >> other policies, would be able to allocate 1K+, even >> 10K+ machines with less than 100 jobs in parallel, so >> it could work perfectly fine even with the current >> implementation; in the long run, this might be able to >> be changed to a pool of threads in the provisioner! >> > > Are these threads just waiting on notifications? > > Right... I took the easy way out and just created 1 thread per GRAM job, but it doesn't have to be this way, as you pointed out below. > If so: you should be able to reduce this to one thread by subscribing with the > same notification consumer EPR for each GRAM job and demuxing the result (demux > based on the producer EPR that is passed to the class implementing > NotifyCallback). That way the thread that creates the GRAM job can disappear > once it is done with the create call. > > This is on my list of things to do, but I just haven't gotten around to fixing this! It hasn't really been an issue with my current tests and usage scenarios, but needs to be addressed for the general case as people will likely hit this problem if we have enough users using Falkon. Thanks, Ioan > Tim > > -- ============================================ Ioan Raicu Ph.D. Student ============================================ Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 ============================================ Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dsl.cs.uchicago.edu/ ============================================ ============================================ -- ============================================ Ioan Raicu Ph.D. Student ============================================ Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 ============================================ Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dsl.cs.uchicago.edu/ ============================================ ============================================ -------------- next part -------------- An HTML attachment was scrubbed... URL: From benc at hawaga.org.uk Wed May 23 11:35:25 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 23 May 2007 16:35:25 +0000 (GMT) Subject: [Swift-devel] Re: GRAM and Swift discussion this week? In-Reply-To: <46546BC3.4070600@cs.uchicago.edu> References: <46546BC3.4070600@cs.uchicago.edu> Message-ID: On Wed, 23 May 2007, Ioan Raicu wrote: > I guess so... > Here is a screen shot: > http://people.cs.uchicago.edu/~iraicu/research/Falkon/Falkon_GUI.gif > > > Essentially, all it does is it uses Java swing to paint the GUI, which has a > bunch of text fields that get populated from data from the results of web > service calls which are being polled against the GT4 service in question > (Falkon in our case). Its nothing fancy, but I bet something like this could > be made for the GT4 container in general that would give basic container and > host statistics! Does it use WS-Resource Properties? If it doesn't, it probably should. If it does, it overlaps strongly with the work of the Globus MDS group and it might be interesting to interact with them. -- From iraicu at cs.uchicago.edu Wed May 23 11:36:06 2007 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Wed, 23 May 2007 11:36:06 -0500 Subject: [Swift-devel] wiring swift and falkon together In-Reply-To: References: Message-ID: <46546D76.3020904@cs.uchicago.edu> Hmmm... from my understanding, the Falkon provider is independent of the fact that Swift will actually use Falkon or not. There is no requirement that Falkon be used, even if you have the Falkon provider installed! With that said, our statement about people avoiding the latest version of Swift due to the Falkon provider doesn't make any sense. Maybe Yong has more input on this... About how Falkon gets deplyed, it is simply uncompressed, you modify 1 or 2 config files, and use the included scripts to start everything! All this is in the included readme.txt in the Falkon archive, downloadable online on my web site. Once again, if someone is not intersted in using Falkon, then I see no reason why they would be doing anything different than before just because there is now a Falkon provider in Swift. Ioan Ben Clifford wrote: > i hear rumour that its sufficiently unclear how to wire swift and falkon > together that people are avoiding testing swift code (more recent than the > 8th of march build that Yong made) > > that is lame - it means large chunks of our app testing are being done > with code that is 2.5 months old. > > I don't know how Falkon gets deployed alongside swift at the moment, so I > don't know what to do to make this easier - are they written down > anywhere? > > -- ============================================ Ioan Raicu Ph.D. Student ============================================ Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 ============================================ Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dsl.cs.uchicago.edu/ ============================================ ============================================ From nefedova at mcs.anl.gov Wed May 23 11:39:45 2007 From: nefedova at mcs.anl.gov (Veronika Nefedova) Date: Wed, 23 May 2007 11:39:45 -0500 Subject: [Swift-devel] wiring swift and falkon together In-Reply-To: <46546D76.3020904@cs.uchicago.edu> References: <46546D76.3020904@cs.uchicago.edu> Message-ID: <4955391B-5395-4F90-852C-BC06908FBD20@mcs.anl.gov> I think Yong told me to use his swift install from terminable when I started using Falcon. I am not sure why -- I presumed there were some specifics in that install. Nika On May 23, 2007, at 11:36 AM, Ioan Raicu wrote: > Hmmm... from my understanding, the Falkon provider is independent > of the fact that Swift will actually use Falkon or not. There is > no requirement that Falkon be used, even if you have the Falkon > provider installed! > > With that said, our statement about people avoiding the latest > version of Swift due to the Falkon provider doesn't make any sense. > Maybe Yong has more input on this... > > About how Falkon gets deplyed, it is simply uncompressed, you > modify 1 or 2 config files, and use the included scripts to start > everything! All this is in the included readme.txt in the Falkon > archive, downloadable online on my web site. Once again, if > someone is not intersted in using Falkon, then I see no reason why > they would be doing anything different than before just because > there is now a Falkon provider in Swift. > > Ioan > > Ben Clifford wrote: >> i hear rumour that its sufficiently unclear how to wire swift and >> falkon together that people are avoiding testing swift code (more >> recent than the 8th of march build that Yong made) >> >> that is lame - it means large chunks of our app testing are being >> done with code that is 2.5 months old. >> >> I don't know how Falkon gets deployed alongside swift at the >> moment, so I don't know what to do to make this easier - are they >> written down anywhere? >> >> > > -- > ============================================ > Ioan Raicu > Ph.D. Student > ============================================ > Distributed Systems Laboratory > Computer Science Department > University of Chicago > 1100 E. 58th Street, Ryerson Hall > Chicago, IL 60637 > ============================================ > Email: iraicu at cs.uchicago.edu > Web: http://www.cs.uchicago.edu/~iraicu > http://dsl.cs.uchicago.edu/ > ============================================ > ============================================ > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > From iraicu at cs.uchicago.edu Wed May 23 11:42:31 2007 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Wed, 23 May 2007 11:42:31 -0500 Subject: [Swift-devel] wiring swift and falkon together In-Reply-To: References: Message-ID: <46546EF7.5050506@cs.uchicago.edu> Yes, it needs the stubs generated from the WSDL file which defines the interface into Falkon. These stubs can simply be generated on the fly from the WSDL file, or copied from the Falkon service after compilation of the service. So, there are dependencies, but nothing that requires the Falkon distribution :) Ioan Ben Clifford wrote: > On Wed, 23 May 2007, Tiberiu Stef-Praun wrote: > > >> It seems that Yong's Falkon provider is working (according to Nika), >> so I was wondering when will it make it into the Swift ? At that point >> it's more convenient for me to test it (as I would only have to handle >> the Falkon backend configuration). >> > > Does it have build dependencies on Falkon code? > > -- ============================================ Ioan Raicu Ph.D. Student ============================================ Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 ============================================ Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dsl.cs.uchicago.edu/ ============================================ ============================================ -------------- next part -------------- An HTML attachment was scrubbed... URL: From benc at hawaga.org.uk Wed May 23 11:44:42 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 23 May 2007 16:44:42 +0000 (GMT) Subject: [Swift-devel] wiring swift and falkon together In-Reply-To: <46546D76.3020904@cs.uchicago.edu> References: <46546D76.3020904@cs.uchicago.edu> Message-ID: On Wed, 23 May 2007, Ioan Raicu wrote: > Hmmm... from my understanding, the Falkon provider is independent of the fact > that Swift will actually use Falkon or not. There is no requirement that > Falkon be used, even if you have the Falkon provider installed! Hopefully its that way, configurable by eg. a site catalog setting. I don't know if that is the case though right now. If not, we should make it that way. > About how Falkon gets deplyed, it is simply uncompressed, you modify 1 or 2 > config files, and use the included scripts to start everything! All this is > in the included readme.txt in the Falkon archive, downloadable online on my > web site. Once again, if someone is not intersted in using Falkon, then I see > no reason why they would be doing anything different than before just because > there is now a Falkon provider in Swift. ok. Does the swift/falkon provider need to be told an EPR to the Falkon web service? My concerns mostly are not so much about having a provider in the source tree when people aren't going to use; that's fine. But the code needs to not be in the form of some random jar file without it being clear where it came from. If the code can build without needing Falkon code around (which I suspect it can't), then its simple to put it in the Swift codebase. If it has Falkon build dependencies (eg for web service stubs) then thats more stuff accumulating in the codebase that needs long term management (and brings in incompatibilities if you want to modify the Falkon web services API) -- From yongzh at cs.uchicago.edu Wed May 23 12:04:59 2007 From: yongzh at cs.uchicago.edu (Yong Zhao) Date: Wed, 23 May 2007 12:04:59 -0500 (CDT) Subject: [Swift-devel] wiring swift and falkon together In-Reply-To: References: Message-ID: I would say testing swift has nothing to do with the Falkon provider. The provider is just one of the many providers that you can choose to use or not, such as local, GT2, GT4, PBS etc. I would strongly encourage people to look at the CoG documentation about providers and others. The provider interface is nothing specific to Falkon, I am frustrated that you guys mix the provider issue with Falkon and make claims without looking deep into related documents. Yong. On Wed, 23 May 2007, Ben Clifford wrote: > > i hear rumour that its sufficiently unclear how to wire swift and falkon > together that people are avoiding testing swift code (more recent than the > 8th of march build that Yong made) > > that is lame - it means large chunks of our app testing are being done > with code that is 2.5 months old. > > I don't know how Falkon gets deployed alongside swift at the moment, so I > don't know what to do to make this easier - are they written down > anywhere? > > -- > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > From yongzh at cs.uchicago.edu Wed May 23 12:07:48 2007 From: yongzh at cs.uchicago.edu (Yong Zhao) Date: Wed, 23 May 2007 12:07:48 -0500 (CDT) Subject: [Swift-devel] wiring swift and falkon together In-Reply-To: References: Message-ID: It of course needs something from Falkon, for instance, the service stubs for the Falkon service. But it does not require Falkon to be deployed or bundled together with Swift. This applies to other providers such as GT2 or GT4. You can use them, but you do not need to package them with Swift. Yong. On Wed, 23 May 2007, Ben Clifford wrote: > > > On Wed, 23 May 2007, Tiberiu Stef-Praun wrote: > > > It seems that Yong's Falkon provider is working (according to Nika), > > so I was wondering when will it make it into the Swift ? At that point > > it's more convenient for me to test it (as I would only have to handle > > the Falkon backend configuration). > > Does it have build dependencies on Falkon code? > > -- > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > From iraicu at cs.uchicago.edu Wed May 23 12:08:57 2007 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Wed, 23 May 2007 12:08:57 -0500 Subject: [Swift-devel] wiring swift and falkon together In-Reply-To: <465465D1.4060002@mcs.anl.gov> References: <465465D1.4060002@mcs.anl.gov> Message-ID: <46547529.9010809@cs.uchicago.edu> Hi, I am certainly swamped right now for the next months or so at the very least (data caching support in Falkon, working with Nika for her apps, Kate+Borja for possibly using VMs and EC2, DSL Workshop, SC challenge brainstorming, HPDC hot topics paper, etc...). I could certainly use some help from developers which might be much more familiar with what it takes to get a prototype from research to production ready. I am willing to work these developers, but if I have to do it myself (and with Yong's help), then I can't promise anything about what timeline I can have something more production ready. Can any resources (developer power) be devoted to getting Falkon production ready? Ioan Mike Wilde wrote: > Im in favor of asking Ioan and possibly Yong - to the extent he has > time - to push forward on this, to specifications from Ben and Mihael, > and based on usability feedback from Nika and Tibi who need to speak > for users' needs. Ben's specs should also address code quality, > testing/certification and maintainability. > > - Mike > > > Tiberiu Stef-Praun wrote, On 5/23/2007 10:57 AM: >> It seems that Yong's Falkon provider is working (according to Nika), >> so I was wondering when will it make it into the Swift ? At that point >> it's more convenient for me to test it (as I would only have to handle >> the Falkon backend configuration). >> >> Tibi >> >> On 5/23/07, Ben Clifford wrote: >>> >>> i hear rumour that its sufficiently unclear how to wire swift and >>> falkon >>> together that people are avoiding testing swift code (more recent >>> than the >>> 8th of march build that Yong made) >>> >>> that is lame - it means large chunks of our app testing are being done >>> with code that is 2.5 months old. >>> >>> I don't know how Falkon gets deployed alongside swift at the moment, >>> so I >>> don't know what to do to make this easier - are they written down >>> anywhere? >>> >>> -- >>> _______________________________________________ >>> Swift-devel mailing list >>> Swift-devel at ci.uchicago.edu >>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >>> >> >> > -- ============================================ Ioan Raicu Ph.D. Student ============================================ Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 ============================================ Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dsl.cs.uchicago.edu/ ============================================ ============================================ From yongzh at cs.uchicago.edu Wed May 23 12:10:48 2007 From: yongzh at cs.uchicago.edu (Yong Zhao) Date: Wed, 23 May 2007 12:10:48 -0500 (CDT) Subject: [Swift-devel] wiring swift and falkon together In-Reply-To: <4955391B-5395-4F90-852C-BC06908FBD20@mcs.anl.gov> References: <46546D76.3020904@cs.uchicago.edu> <4955391B-5395-4F90-852C-BC06908FBD20@mcs.anl.gov> Message-ID: Because the Falkon provider code is not in SVN, and that install is where you can get the Falkon provider code. But you did update your Swift code to the lastest source code in SVN at that time (as you needed some new features) as I told you to. So in that sense you were testing at least some relative new Swift code. Yong. On Wed, 23 May 2007, Veronika Nefedova wrote: > I think Yong told me to use his swift install from terminable when I > started using Falcon. I am not sure why -- I presumed there were some > specifics in that install. > > Nika > > On May 23, 2007, at 11:36 AM, Ioan Raicu wrote: > > > Hmmm... from my understanding, the Falkon provider is independent > > of the fact that Swift will actually use Falkon or not. There is > > no requirement that Falkon be used, even if you have the Falkon > > provider installed! > > > > With that said, our statement about people avoiding the latest > > version of Swift due to the Falkon provider doesn't make any sense. > > Maybe Yong has more input on this... > > > > About how Falkon gets deplyed, it is simply uncompressed, you > > modify 1 or 2 config files, and use the included scripts to start > > everything! All this is in the included readme.txt in the Falkon > > archive, downloadable online on my web site. Once again, if > > someone is not intersted in using Falkon, then I see no reason why > > they would be doing anything different than before just because > > there is now a Falkon provider in Swift. > > > > Ioan > > > > Ben Clifford wrote: > >> i hear rumour that its sufficiently unclear how to wire swift and > >> falkon together that people are avoiding testing swift code (more > >> recent than the 8th of march build that Yong made) > >> > >> that is lame - it means large chunks of our app testing are being > >> done with code that is 2.5 months old. > >> > >> I don't know how Falkon gets deployed alongside swift at the > >> moment, so I don't know what to do to make this easier - are they > >> written down anywhere? > >> > >> > > > > -- > > ============================================ > > Ioan Raicu > > Ph.D. Student > > ============================================ > > Distributed Systems Laboratory > > Computer Science Department > > University of Chicago > > 1100 E. 58th Street, Ryerson Hall > > Chicago, IL 60637 > > ============================================ > > Email: iraicu at cs.uchicago.edu > > Web: http://www.cs.uchicago.edu/~iraicu > > http://dsl.cs.uchicago.edu/ > > ============================================ > > ============================================ > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > From tiberius at ci.uchicago.edu Wed May 23 12:40:24 2007 From: tiberius at ci.uchicago.edu (Tiberiu Stef-Praun) Date: Wed, 23 May 2007 12:40:24 -0500 Subject: [Swift-devel] wiring swift and falkon together In-Reply-To: References: <46546D76.3020904@cs.uchicago.edu> <4955391B-5395-4F90-852C-BC06908FBD20@mcs.anl.gov> Message-ID: So to summarize: - apparently there are small changes needed by the Swift's Falcon provider - will anyone get these in the SVN, and make Swift " Falcon-ready" ? Note that I did not say "Falcon enabled", because the latter means that a Falkon service is installed somewhere and ready to run Tibi On 5/23/07, Yong Zhao wrote: > Because the Falkon provider code is not in SVN, and that install is where > you can get the Falkon provider code. But you did update your Swift code > to the lastest source code in SVN at that time (as you needed some > new features) as I told you to. So in that sense you were testing at least > some relative new Swift code. > > Yong. > > On Wed, 23 May 2007, Veronika Nefedova wrote: > > > I think Yong told me to use his swift install from terminable when I > > started using Falcon. I am not sure why -- I presumed there were some > > specifics in that install. > > > > Nika > > > > On May 23, 2007, at 11:36 AM, Ioan Raicu wrote: > > > > > Hmmm... from my understanding, the Falkon provider is independent > > > of the fact that Swift will actually use Falkon or not. There is > > > no requirement that Falkon be used, even if you have the Falkon > > > provider installed! > > > > > > With that said, our statement about people avoiding the latest > > > version of Swift due to the Falkon provider doesn't make any sense. > > > Maybe Yong has more input on this... > > > > > > About how Falkon gets deplyed, it is simply uncompressed, you > > > modify 1 or 2 config files, and use the included scripts to start > > > everything! All this is in the included readme.txt in the Falkon > > > archive, downloadable online on my web site. Once again, if > > > someone is not intersted in using Falkon, then I see no reason why > > > they would be doing anything different than before just because > > > there is now a Falkon provider in Swift. > > > > > > Ioan > > > > > > Ben Clifford wrote: > > >> i hear rumour that its sufficiently unclear how to wire swift and > > >> falkon together that people are avoiding testing swift code (more > > >> recent than the 8th of march build that Yong made) > > >> > > >> that is lame - it means large chunks of our app testing are being > > >> done with code that is 2.5 months old. > > >> > > >> I don't know how Falkon gets deployed alongside swift at the > > >> moment, so I don't know what to do to make this easier - are they > > >> written down anywhere? > > >> > > >> > > > > > > -- > > > ============================================ > > > Ioan Raicu > > > Ph.D. Student > > > ============================================ > > > Distributed Systems Laboratory > > > Computer Science Department > > > University of Chicago > > > 1100 E. 58th Street, Ryerson Hall > > > Chicago, IL 60637 > > > ============================================ > > > Email: iraicu at cs.uchicago.edu > > > Web: http://www.cs.uchicago.edu/~iraicu > > > http://dsl.cs.uchicago.edu/ > > > ============================================ > > > ============================================ > > > > > > _______________________________________________ > > > Swift-devel mailing list > > > Swift-devel at ci.uchicago.edu > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > -- Tiberiu (Tibi) Stef-Praun, PhD Research Staff, Computation Institute 5640 S. Ellis Ave, #405 University of Chicago http://www-unix.mcs.anl.gov/~tiberius/ From benc at hawaga.org.uk Wed May 23 12:42:26 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 23 May 2007 17:42:26 +0000 (GMT) Subject: [Swift-devel] wiring swift and falkon together In-Reply-To: References: <46546D76.3020904@cs.uchicago.edu> <4955391B-5395-4F90-852C-BC06908FBD20@mcs.anl.gov> Message-ID: so a relatively straightforward thing to do would be to put the source code into the swift SVN, put the stubs in jar form into the swift SVN, have the falkon provider built as part of the swift build and made available for use. another way would be for it to go into cog. but that's for cog to decide, not me. either way looks pretty much the same when swift is deployed. how does a user specify that jobs should go through falkon rather than the other mechanisms? -- From yongzh at cs.uchicago.edu Wed May 23 13:09:08 2007 From: yongzh at cs.uchicago.edu (Yong Zhao) Date: Wed, 23 May 2007 13:09:08 -0500 (CDT) Subject: [Swift-devel] wiring swift and falkon together In-Reply-To: References: <46546D76.3020904@cs.uchicago.edu> <4955391B-5395-4F90-852C-BC06908FBD20@mcs.anl.gov> Message-ID: Currently the provider resides in the cog branch. I'm not quite sure how to put it into another branch. In the sites.xml, if there is a Falkon service URL, then the Falkon provider is selected. Yong. On Wed, 23 May 2007, Ben Clifford wrote: > > so a relatively straightforward thing to do would be to put the source > code into the swift SVN, put the stubs in jar form into the swift SVN, > have the falkon provider built as part of the swift build and made > available for use. > > another way would be for it to go into cog. but that's for cog to decide, > not me. > > either way looks pretty much the same when swift is deployed. > > how does a user specify that jobs should go through falkon rather than the > other mechanisms? > > -- > From benc at hawaga.org.uk Wed May 23 13:11:58 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 23 May 2007 18:11:58 +0000 (GMT) Subject: [Swift-devel] wiring swift and falkon together In-Reply-To: <46547529.9010809@cs.uchicago.edu> References: <465465D1.4060002@mcs.anl.gov> <46547529.9010809@cs.uchicago.edu> Message-ID: On Wed, 23 May 2007, Ioan Raicu wrote: > I could certainly use some help from developers which might be much more > familiar with what it takes to get a prototype from research to production > ready. what it takes, perhaps more than anything, is a bunch of time, both as a one off occurence and as an on-going concern. something no-one has much of :-( most often underestimated is the on-going time - I've seen plenty of stuff been made "production ready and released" and then left to rot, which it will do within months without constant care and attention. -- From benc at hawaga.org.uk Wed May 23 13:12:25 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 23 May 2007 18:12:25 +0000 (GMT) Subject: [Swift-devel] wiring swift and falkon together In-Reply-To: References: <46546D76.3020904@cs.uchicago.edu> <4955391B-5395-4F90-852C-BC06908FBD20@mcs.anl.gov> Message-ID: On Wed, 23 May 2007, Yong Zhao wrote: > In the sites.xml, if there is a Falkon service URL, then the Falkon > provider is selected. how is it determined that its a falkon url? -- From iraicu at cs.uchicago.edu Wed May 23 13:14:54 2007 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Wed, 23 May 2007 13:14:54 -0500 Subject: [Swift-devel] wiring swift and falkon together In-Reply-To: References: <465465D1.4060002@mcs.anl.gov> Message-ID: <4654849E.9030902@cs.uchicago.edu> Ben Clifford wrote: > What does Falkon deployment look like at the moment? (in terms of > procedures to deploy it from an empty computer, and in terms of how files > are laid out, and in terms of how things get configured)? > there is an archive that has everything one might need, a GT4 container, the service code (which is already deployed in the GT4 container), the executor code, the client code, and the monitor GUI code. We also bundle a 1.4 32bit JRE. There is a configuration file that points to where the logs are supposed to go, and another one with what security mechanisms you want to use. There are also a bunch of settable parameters in the startup scripts, and a few obscure setable parameters in the code, that requires recompiling the service. To recompile the service, you need ANT installed and configured as well as 1.4+ JDK; to recompile anything else, you just need 1.4+ JDK. With a single script, and a single arguement (the port number), you can start the entire Falkon system! > I think it doesn't make sense to look at the falkon/swift interface code > without looking at the whole deployment process for both Swift and Falkon > together. > My understanding is that the Falkon provider can be specified in the sites.xml, including where the Falkon dispatcher will be found. Other than that, everything else should be straight forward. Ioan > On Wed, 23 May 2007, Mike Wilde wrote: > > >> Im in favor of asking Ioan and possibly Yong - to the extent he has time - to >> push forward on this, to specifications from Ben and Mihael, and based on >> usability feedback from Nika and Tibi who need to speak for users' needs. >> Ben's specs should also address code quality, testing/certification and >> maintainability. >> >> - Mike >> >> >> Tiberiu Stef-Praun wrote, On 5/23/2007 10:57 AM: >> >>> It seems that Yong's Falkon provider is working (according to Nika), >>> so I was wondering when will it make it into the Swift ? At that point >>> it's more convenient for me to test it (as I would only have to handle >>> the Falkon backend configuration). >>> >>> Tibi >>> >>> On 5/23/07, Ben Clifford wrote: >>> >>>> i hear rumour that its sufficiently unclear how to wire swift and falkon >>>> together that people are avoiding testing swift code (more recent than the >>>> 8th of march build that Yong made) >>>> >>>> that is lame - it means large chunks of our app testing are being done >>>> with code that is 2.5 months old. >>>> >>>> I don't know how Falkon gets deployed alongside swift at the moment, so I >>>> don't know what to do to make this easier - are they written down >>>> anywhere? >>>> >>>> -- >>>> _______________________________________________ >>>> Swift-devel mailing list >>>> Swift-devel at ci.uchicago.edu >>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >>>> >>>> >>> >> > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > -- ============================================ Ioan Raicu Ph.D. Student ============================================ Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 ============================================ Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dsl.cs.uchicago.edu/ ============================================ ============================================ -------------- next part -------------- An HTML attachment was scrubbed... URL: From iraicu at cs.uchicago.edu Wed May 23 13:17:30 2007 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Wed, 23 May 2007 13:17:30 -0500 Subject: [Swift-devel] Re: GRAM and Swift discussion this week? In-Reply-To: References: <46546BC3.4070600@cs.uchicago.edu> Message-ID: <4654853A.6000104@cs.uchicago.edu> Ben Clifford wrote: > On Wed, 23 May 2007, Ioan Raicu wrote: > > >> I guess so... >> Here is a screen shot: >> http://people.cs.uchicago.edu/~iraicu/research/Falkon/Falkon_GUI.gif >> >> >> Essentially, all it does is it uses Java swing to paint the GUI, which has a >> bunch of text fields that get populated from data from the results of web >> service calls which are being polled against the GT4 service in question >> (Falkon in our case). Its nothing fancy, but I bet something like this could >> be made for the GT4 container in general that would give basic container and >> host statistics! >> > > Does it use WS-Resource Properties? No, but it could... the GUI was a 1 day hack, and I found it simpler to simply add a monitorStatus function that returned a bunch of system metrics! > If it doesn't, it probably should. If > it does, it overlaps strongly with the work of the Globus MDS group and > it might be interesting to interact with them. > I never meant for the monitor GUI to be anything fancy, it was simply to give me a more efficient way of looking at the log files. I intended it to be a poll driven GUI, rather than notification driven, for simplicity! If anyone wants to extend this, feel free! Ioan -- ============================================ Ioan Raicu Ph.D. Student ============================================ Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 ============================================ Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dsl.cs.uchicago.edu/ ============================================ ============================================ -------------- next part -------------- An HTML attachment was scrubbed... URL: From yongzh at cs.uchicago.edu Wed May 23 13:18:07 2007 From: yongzh at cs.uchicago.edu (Yong Zhao) Date: Wed, 23 May 2007 13:18:07 -0500 (CDT) Subject: [Swift-devel] wiring swift and falkon together In-Reply-To: References: <46546D76.3020904@cs.uchicago.edu> <4955391B-5395-4F90-852C-BC06908FBD20@mcs.anl.gov> Message-ID: It is a WSRF service EPR with something like this: http://tg-login1.uc.teragrid.org:50001/wsrf/services/GenericPortal/core/WS/GPFactoryService Although the GenericPortal stuff needs to be changed to Falkon soon. Yong. On Wed, 23 May 2007, Ben Clifford wrote: > > > On Wed, 23 May 2007, Yong Zhao wrote: > > > In the sites.xml, if there is a Falkon service URL, then the Falkon > > provider is selected. > > how is it determined that its a falkon url? > > -- > > From benc at hawaga.org.uk Wed May 23 13:24:31 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 23 May 2007 18:24:31 +0000 (GMT) Subject: [Swift-devel] wiring swift and falkon together In-Reply-To: References: <46546D76.3020904@cs.uchicago.edu> <4955391B-5395-4F90-852C-BC06908FBD20@mcs.anl.gov> Message-ID: On Wed, 23 May 2007, Yong Zhao wrote: > It is a WSRF service EPR with something like this: > > http://tg-login1.uc.teragrid.org:50001/wsrf/services/GenericPortal/core/WS/GPFactoryService > > Although the GenericPortal stuff needs to be changed to Falkon soon. An http URI doesn't really indicate that its Falkon compared to some other system that also chooses to use web services to submit. Perhaps there should be a site catalog entry to pick providers - there already so-of is that in the legacy GRAM version parameter. -- From benc at hawaga.org.uk Wed May 23 13:30:21 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 23 May 2007 18:30:21 +0000 (GMT) Subject: [Swift-devel] Re: GRAM and Swift discussion this week? In-Reply-To: <4654853A.6000104@cs.uchicago.edu> References: <46546BC3.4070600@cs.uchicago.edu> <4654853A.6000104@cs.uchicago.edu> Message-ID: On Wed, 23 May 2007, Ioan Raicu wrote: > I found it simpler to simply add a monitorStatus function that returned > a bunch of system metrics! A damnation of the GT WS Resource Properties implementation! -- From benc at hawaga.org.uk Wed May 23 13:38:15 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 23 May 2007 18:38:15 +0000 (GMT) Subject: [Swift-devel] wiring swift and falkon together In-Reply-To: References: <46546D76.3020904@cs.uchicago.edu> <4955391B-5395-4F90-852C-BC06908FBD20@mcs.anl.gov> Message-ID: On Wed, 23 May 2007, Yong Zhao wrote: > Currently the provider resides in the cog branch. I'm not quite sure how > to put it into another branch. It is in the cog svn at the moment? -- From foster at mcs.anl.gov Wed May 23 13:38:47 2007 From: foster at mcs.anl.gov (Ian Foster) Date: Wed, 23 May 2007 13:38:47 -0500 Subject: [Swift-devel] Re: GRAM and Swift discussion this week? In-Reply-To: References: <46546BC3.4070600@cs.uchicago.edu> <4654853A.6000104@cs.uchicago.edu> Message-ID: <46548A37.7070708@mcs.anl.gov> maybe ... or maybe an indication that Ioan is an inveterate NIHer ... Ben Clifford wrote: > On Wed, 23 May 2007, Ioan Raicu wrote: > > >> I found it simpler to simply add a monitorStatus function that returned >> a bunch of system metrics! >> > > A damnation of the GT WS Resource Properties implementation! > > -- Ian Foster, Director, Computation Institute Argonne National Laboratory & University of Chicago Argonne: MCS/221, 9700 S. Cass Ave, Argonne, IL 60439 Chicago: Rm 405, 5640 S. Ellis Ave, Chicago, IL 60637 Tel: +1 630 252 4619. Web: www.ci.uchicago.edu. Globus Alliance: www.globus.org. -------------- next part -------------- An HTML attachment was scrubbed... URL: From iraicu at cs.uchicago.edu Wed May 23 13:51:23 2007 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Wed, 23 May 2007 13:51:23 -0500 Subject: [Swift-devel] wiring swift and falkon together In-Reply-To: References: <46546D76.3020904@cs.uchicago.edu> Message-ID: <46548D2B.1010404@cs.uchicago.edu> Ben Clifford wrote: > On Wed, 23 May 2007, Ioan Raicu wrote: > > >> Hmmm... from my understanding, the Falkon provider is independent of the fact >> that Swift will actually use Falkon or not. There is no requirement that >> Falkon be used, even if you have the Falkon provider installed! >> > > Hopefully its that way, configurable by eg. a site catalog setting. I > don't know if that is the case though right now. If not, we should make it > that way. > > >> About how Falkon gets deplyed, it is simply uncompressed, you modify 1 or 2 >> config files, and use the included scripts to start everything! All this is >> in the included readme.txt in the Falkon archive, downloadable online on my >> web site. Once again, if someone is not intersted in using Falkon, then I see >> no reason why they would be doing anything different than before just because >> there is now a Falkon provider in Swift. >> > > ok. Does the swift/falkon provider need to be told an EPR to the Falkon > web service? > No, it creates a new resource for which the EPR is returned, and that is used over and over again until Swift shuts down and the resource is destroyed. Basically, the service URL is all is needed! > My concerns mostly are not so much about having a provider in the source > tree when people aren't going to use; that's fine. But the code needs to > not be in the form of some random jar file without it being clear where it > came from. If the code can build without needing Falkon code around (which > I suspect it can't), then its simple to put it in the Swift codebase. If > it has Falkon build dependencies (eg for web service stubs) then thats > more stuff accumulating in the codebase that needs long term management > (and brings in incompatibilities if you want to modify the Falkon web > services API) > I was able to generate stubs from a command line tool bundled with GT4 a while back, so I don't see why you couldn't just have it all Falkon independent! Ioan -- ============================================ Ioan Raicu Ph.D. Student ============================================ Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 ============================================ Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dsl.cs.uchicago.edu/ ============================================ ============================================ -------------- next part -------------- An HTML attachment was scrubbed... URL: From benc at hawaga.org.uk Wed May 23 13:55:28 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 23 May 2007 18:55:28 +0000 (GMT) Subject: [Swift-devel] wiring swift and falkon together In-Reply-To: <46548D2B.1010404@cs.uchicago.edu> References: <46546D76.3020904@cs.uchicago.edu> <46548D2B.1010404@cs.uchicago.edu> Message-ID: On Wed, 23 May 2007, Ioan Raicu wrote: > I was able to generate stubs from a command line tool bundled with GT4 a while > back, so I don't see why you couldn't just have it all Falkon independent! needs the wsdl though? -- From iraicu at cs.uchicago.edu Wed May 23 13:58:21 2007 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Wed, 23 May 2007 13:58:21 -0500 Subject: [Swift-devel] wiring swift and falkon together In-Reply-To: References: <46546D76.3020904@cs.uchicago.edu> <4955391B-5395-4F90-852C-BC06908FBD20@mcs.anl.gov> Message-ID: <46548ECD.8060103@cs.uchicago.edu> I believe that its searching not just for the http:// url, but a specific string (i..e. GenericPortal currently, Falkon soon)... Ioan Ben Clifford wrote: > On Wed, 23 May 2007, Yong Zhao wrote: > > >> It is a WSRF service EPR with something like this: >> >> http://tg-login1.uc.teragrid.org:50001/wsrf/services/GenericPortal/core/WS/GPFactoryService >> >> Although the GenericPortal stuff needs to be changed to Falkon soon. >> > > An http URI doesn't really indicate that its Falkon compared to some other > system that also chooses to use web services to submit. Perhaps there > should be a site catalog entry to pick providers - there already so-of is > that in the legacy GRAM version parameter. > > -- ============================================ Ioan Raicu Ph.D. Student ============================================ Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 ============================================ Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dsl.cs.uchicago.edu/ ============================================ ============================================ -------------- next part -------------- An HTML attachment was scrubbed... URL: From benc at hawaga.org.uk Wed May 23 14:00:38 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 23 May 2007 19:00:38 +0000 (GMT) Subject: [Swift-devel] wiring swift and falkon together In-Reply-To: <46548ECD.8060103@cs.uchicago.edu> References: <46546D76.3020904@cs.uchicago.edu> <4955391B-5395-4F90-852C-BC06908FBD20@mcs.anl.gov> <46548ECD.8060103@cs.uchicago.edu> Message-ID: On Wed, 23 May 2007, Ioan Raicu wrote: > I believe that its searching not just for the http:// url, but a specific > string (i..e. GenericPortal currently, Falkon soon)... evil! service URLs are (should be) opaque. -- From iraicu at cs.uchicago.edu Wed May 23 14:01:47 2007 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Wed, 23 May 2007 14:01:47 -0500 Subject: [Swift-devel] Re: GRAM and Swift discussion this week? In-Reply-To: References: <46546BC3.4070600@cs.uchicago.edu> <4654853A.6000104@cs.uchicago.edu> Message-ID: <46548F9B.8040303@cs.uchicago.edu> I did not give it much thought, and did not look into how it would look as resource properties. My service does expose some resource properties, but I found them to be harder to configure, and in my current way of handling them, I would have had to retrieve each resource property in a separate WS call, being very inefficient :( Maybe I could have made an encapsulating object that held all the system metrics, similar to what my function does, but ah well... in the next version... Ioan Ben Clifford wrote: > On Wed, 23 May 2007, Ioan Raicu wrote: > > >> I found it simpler to simply add a monitorStatus function that returned >> a bunch of system metrics! >> > > A damnation of the GT WS Resource Properties implementation! > > -- ============================================ Ioan Raicu Ph.D. Student ============================================ Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 ============================================ Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dsl.cs.uchicago.edu/ ============================================ ============================================ -------------- next part -------------- An HTML attachment was scrubbed... URL: From iraicu at cs.uchicago.edu Wed May 23 14:04:24 2007 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Wed, 23 May 2007 14:04:24 -0500 Subject: [Swift-devel] wiring swift and falkon together In-Reply-To: References: <46546D76.3020904@cs.uchicago.edu> <46548D2B.1010404@cs.uchicago.edu> Message-ID: <46549038.5020009@cs.uchicago.edu> You can get the WSDL from a running service by querying the service in a standard WS way.... Ioan Ben Clifford wrote: > On Wed, 23 May 2007, Ioan Raicu wrote: > > >> I was able to generate stubs from a command line tool bundled with GT4 a while >> back, so I don't see why you couldn't just have it all Falkon independent! >> > > needs the wsdl though? > > -- ============================================ Ioan Raicu Ph.D. Student ============================================ Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 ============================================ Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dsl.cs.uchicago.edu/ ============================================ ============================================ -------------- next part -------------- An HTML attachment was scrubbed... URL: From benc at hawaga.org.uk Wed May 23 14:05:12 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 23 May 2007 19:05:12 +0000 (GMT) Subject: [Swift-devel] wiring swift and falkon together In-Reply-To: <46549038.5020009@cs.uchicago.edu> References: <46546D76.3020904@cs.uchicago.edu> <46548D2B.1010404@cs.uchicago.edu> <46549038.5020009@cs.uchicago.edu> Message-ID: On Wed, 23 May 2007, Ioan Raicu wrote: > You can get the WSDL from a running service by querying the service in a > standard WS way.... but not at compile time. -- From iraicu at cs.uchicago.edu Wed May 23 15:23:55 2007 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Wed, 23 May 2007 15:23:55 -0500 Subject: [Swift-devel] wiring swift and falkon together In-Reply-To: References: <46546D76.3020904@cs.uchicago.edu> <46548D2B.1010404@cs.uchicago.edu> <46549038.5020009@cs.uchicago.edu> Message-ID: <4654A2DB.7030904@cs.uchicago.edu> It all depends at how complicated your compile scripts are, and if Falkon is operational anywhere... it could be done at compile time... if not, then you'd have to package it with Swift... these are probably small details, I bet we could work around them if we know exactly what end result we want. Also, what is so bad about including the WSDL definition of Falkon with Swift, so you can generate the stubs at compile time? Ioan Ben Clifford wrote: > On Wed, 23 May 2007, Ioan Raicu wrote: > > >> You can get the WSDL from a running service by querying the service in a >> standard WS way.... >> > > but not at compile time. > > -- ============================================ Ioan Raicu Ph.D. Student ============================================ Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 ============================================ Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dsl.cs.uchicago.edu/ ============================================ ============================================ -------------- next part -------------- An HTML attachment was scrubbed... URL: From benc at hawaga.org.uk Wed May 23 18:36:09 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 23 May 2007 23:36:09 +0000 (GMT) Subject: [Swift-devel] wiring swift and falkon together In-Reply-To: <4654A2DB.7030904@cs.uchicago.edu> References: <46546D76.3020904@cs.uchicago.edu> <46548D2B.1010404@cs.uchicago.edu> <46549038.5020009@cs.uchicago.edu> <4654A2DB.7030904@cs.uchicago.edu> Message-ID: On Wed, 23 May 2007, Ioan Raicu wrote: > Also, what is so bad about including the WSDL definition of Falkon with > Swift, so you can generate the stubs at compile time? Not so much of an issue. Pretty much the main consideration is that you will have difficulty changing the interface once people start taking copies of the interface definition. But I think that is the way to go for now. -- From hategan at mcs.anl.gov Thu May 24 03:49:06 2007 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 24 May 2007 11:49:06 +0300 Subject: [Swift-devel] wiring swift and falkon together In-Reply-To: (from benc@hawaga.org.uk on Wed May 23 20:42:26 2007) Message-ID: <1179996546l.17759l.0l@blabla> On 05/23/2007 08:42:26 PM, Ben Clifford wrote: > > so a relatively straightforward thing to do would be to put the source > > code into the swift SVN, put the stubs in jar form into the swift SVN, > > have the falkon provider built as part of the swift build and made > available for use. I'm not sure if that is wise. The falkon provider should be a separate module (build entity). In other words, it should be straightforward to either build it or not build it. > > another way would be for it to go into cog. but that's for cog to > decide, > not me. It's somewhat unlikely. > > either way looks pretty much the same when swift is deployed. > > how does a user specify that jobs should go through falkon rather than > the > other mechanisms? > > -- > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > From benc at hawaga.org.uk Thu May 24 16:31:58 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Thu, 24 May 2007 21:31:58 +0000 (GMT) Subject: [Swift-devel] wiring swift and falkon together In-Reply-To: <1179996546l.17759l.0l@blabla> References: <1179996546l.17759l.0l@blabla> Message-ID: On Thu, 24 May 2007, Mihael Hategan wrote: > I'm not sure if that is wise. The falkon provider should be a separate module > (build entity). In other words, it should be straightforward to either build > it or not build it. needs a sensible deployment mechanism then, which pretty much means some nicer way of plugging in new providers to swift than manually editing config files / source code each time. there's a feature req for something like that for plugging in new mappers, but it should cover both. -- From hategan at mcs.anl.gov Fri May 25 02:48:07 2007 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 25 May 2007 10:48:07 +0300 Subject: [Swift-devel] wiring swift and falkon together In-Reply-To: (from benc@hawaga.org.uk on Fri May 25 00:31:58 2007) References: <1179996546l.17759l.0l@blabla> Message-ID: <1180079287l.22334l.0l@blabla> On 05/25/2007 12:31:58 AM, Ben Clifford wrote: > > > On Thu, 24 May 2007, Mihael Hategan wrote: > > > I'm not sure if that is wise. The falkon provider should be a > separate module > > (build entity). In other words, it should be straightforward to > either build > > it or not build it. > > needs a sensible deployment mechanism then, which pretty much means > some > nicer way of plugging in new providers to swift than manually editing > config files / source code each time. there's a feature req for > something > like that for plugging in new mappers, but it should cover both. > Providers in cog are dynamically loaded. Assuming that swift handles the sites.xml entries correctly, all it takes is to have the relevant jars and config files on the classpath (assuming that the provider itself does not have funny requirements). Mihael From benc at hawaga.org.uk Fri May 25 06:45:37 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Fri, 25 May 2007 11:45:37 +0000 (GMT) Subject: [Swift-devel] wiring swift and falkon together In-Reply-To: <1180079287l.22334l.0l@blabla> References: <1179996546l.17759l.0l@blabla> <1180079287l.22334l.0l@blabla> Message-ID: On Fri, 25 May 2007, Mihael Hategan wrote: > Providers in cog are dynamically loaded. Assuming that swift handles the > sites.xml entries correctly which I guess isn't the case at the moment because providers are explicitly named in libexec/scheduler.xml and in vdl-sc.k. -- From hategan at mcs.anl.gov Wed May 30 06:03:52 2007 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 30 May 2007 14:03:52 +0300 Subject: [Swift-devel] wiring swift and falkon together In-Reply-To: References: <1179996546l.17759l.0l@blabla> <1180079287l.22334l.0l@blabla> Message-ID: <1180523032.2501.10.camel@blabla.mcs.anl.gov> On Fri, 2007-05-25 at 11:45 +0000, Ben Clifford wrote: > > On Fri, 25 May 2007, Mihael Hategan wrote: > > > Providers in cog are dynamically loaded. Assuming that swift handles the > > sites.xml entries correctly > > which I guess isn't the case at the moment because providers are > explicitly named in libexec/scheduler.xml and in vdl-sc.k. Right. Let me rephrase: if vdl-sc.k has the right stuff, then deployment of the falkon provider should consist of sticking the jar & config files in lib and etc, respectively. > From benc at hawaga.org.uk Thu May 31 16:45:14 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Thu, 31 May 2007 21:45:14 +0000 (GMT) Subject: [Swift-devel] Teragrid usage In-Reply-To: References: Message-ID: Anyone know if its possible to see how those units were spent? (eg userid? job logs?) On Wed, 16 May 2007, Veronika Nefedova wrote: > Hi, > > I checked my Teragrid accounts and it looks like the Swift's allocation is > almost completely used by now (or is it just for me ?): > > Account: TG-CDA060004T > Title: TeraGrid: Development Account for Multiple Grid Science Projects > Resource: teragrid_roaming > Allocation Period: 2006-08-30 to 2007-08-31 > > Name (Last First) or Account Total Remaining Usage > ---------------------------- ---------- ------------ ---------- > Nefedova Veronika 30000 SU 0 SU 27491 SU > ---------------------------------------------------------------------- > > Fortunately, Benoit has added me to his group's allocation - so I can continue > testing on TG. But it looks like Swift's allocation is almost gone... Should > we renew it ? > > Nika > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > From wilde at mcs.anl.gov Thu May 31 17:05:09 2007 From: wilde at mcs.anl.gov (Mike Wilde) Date: Thu, 31 May 2007 17:05:09 -0500 Subject: [Swift-devel] Teragrid usage In-Reply-To: References: Message-ID: <465F4695.7040505@mcs.anl.gov> Here's a first approximation: Account: TG-CDA060004T Title: TeraGrid: Development Account for Multiple Grid Science Projects Resource: teragrid_roaming Local project name on dtf.ncsa.teragrid is kgx Allocation Period: 2006-08-30 to 2007-08-31 Name (Last First) or Account Total Remaining Usage ---------------------------- ---------- ------------ ---------- Clifford Ben 30000 SU 0 SU 0 SU Jamieson Andrew 30000 SU 0 SU 0 SU Nefedova Veronika 30000 SU 0 SU 31147 SU Stef-praun Tiberiu 30000 SU 0 SU 568 SU PI-Wilde Michael 30000 SU 0 SU 0 SU Zhao Yong 30000 SU 0 SU 13664 SU - Mike Ben Clifford wrote, On 5/31/2007 4:45 PM: > Anyone know if its possible to see how those units were spent? (eg userid? > job logs?) > > On Wed, 16 May 2007, Veronika Nefedova wrote: > >> Hi, >> >> I checked my Teragrid accounts and it looks like the Swift's allocation is >> almost completely used by now (or is it just for me ?): >> >> Account: TG-CDA060004T >> Title: TeraGrid: Development Account for Multiple Grid Science Projects >> Resource: teragrid_roaming >> Allocation Period: 2006-08-30 to 2007-08-31 >> >> Name (Last First) or Account Total Remaining Usage >> ---------------------------- ---------- ------------ ---------- >> Nefedova Veronika 30000 SU 0 SU 27491 SU >> ---------------------------------------------------------------------- >> >> Fortunately, Benoit has added me to his group's allocation - so I can continue >> testing on TG. But it looks like Swift's allocation is almost gone... Should >> we renew it ? >> >> Nika >> >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >> > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > -- Mike Wilde Computation Institute, University of Chicago Math & Computer Science Division Argonne National Laboratory Argonne, IL 60439 USA tel 630-252-7497 fax 630-252-1997 From foster at mcs.anl.gov Thu May 31 20:16:20 2007 From: foster at mcs.anl.gov (Ian Foster) Date: Thu, 31 May 2007 20:16:20 -0500 Subject: [Swift-devel] Teragrid usage In-Reply-To: References: Message-ID: <465F7364.9000808@mcs.anl.gov> We shouldn't be using the "Swift development" account for application work. We should have a CNARI allocation, an economics allocation, a MolDyn allocation, etc. Ben Clifford wrote: > Anyone know if its possible to see how those units were spent? (eg userid? > job logs?) > > On Wed, 16 May 2007, Veronika Nefedova wrote: > > >> Hi, >> >> I checked my Teragrid accounts and it looks like the Swift's allocation is >> almost completely used by now (or is it just for me ?): >> >> Account: TG-CDA060004T >> Title: TeraGrid: Development Account for Multiple Grid Science Projects >> Resource: teragrid_roaming >> Allocation Period: 2006-08-30 to 2007-08-31 >> >> Name (Last First) or Account Total Remaining Usage >> ---------------------------- ---------- ------------ ---------- >> Nefedova Veronika 30000 SU 0 SU 27491 SU >> ---------------------------------------------------------------------- >> >> Fortunately, Benoit has added me to his group's allocation - so I can continue >> testing on TG. But it looks like Swift's allocation is almost gone... Should >> we renew it ? >> >> Nika >> >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >> >> > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > -- Ian Foster, Director, Computation Institute Argonne National Laboratory & University of Chicago Argonne: MCS/221, 9700 S. Cass Ave, Argonne, IL 60439 Chicago: Rm 405, 5640 S. Ellis Ave, Chicago, IL 60637 Tel: +1 630 252 4619. Web: www.ci.uchicago.edu. Globus Alliance: www.globus.org. -------------- next part -------------- An HTML attachment was scrubbed... URL: