From nefedova at mcs.anl.gov  Fri Jun  1 08:19:58 2007
From: nefedova at mcs.anl.gov (Veronika Nefedova)
Date: Fri, 1 Jun 2007 08:19:58 -0500
Subject: [Swift-devel] disk space requirement
Message-ID: <C04744A1-D52B-4064-9A71-F60D52A8AFEE@mcs.anl.gov>

Hi,

I know I raised this question many times before, but I think I need a  
solution to it very soon (=now). I have a machinery in place to do a  
really big runs for MolDyn. Currently, the workflow produces about  
0.9GB of data for each molecule, but this is all the intermediate  
data, all I need is *one* 300K file as a result (per molecule). The  
rest is intermediate data that I do not need.  So my questions are:

1. How can I eliminate the intermediate results staging back to my  
submit host? I do not need it in case of one remote compute pool. VDS  
had this feature...

2. If implementing this feature is very hard and time-consuming --  
what submit host would you recommend for a 244-molecule run ? Roughly  
244 GB of space is needed.

A 20-molecule run is all I can do on wiggum (where the application  
code is currently compiled).

Thanks!

Nika


From wilde at mcs.anl.gov  Fri Jun  1 08:55:12 2007
From: wilde at mcs.anl.gov (Mike Wilde)
Date: Fri, 01 Jun 2007 08:55:12 -0500
Subject: [Swift-devel] disk space requirement
In-Reply-To: <C04744A1-D52B-4064-9A71-F60D52A8AFEE@mcs.anl.gov>
References: <C04744A1-D52B-4064-9A71-F60D52A8AFEE@mcs.anl.gov>
Message-ID: <46602540.8090007@mcs.anl.gov>

Nika, regarding where you can get 244GB, we should send in a request 
to CI Support to get 1TB or so from the CI SAN mounted on a Swift 
submit host.

In the interim:

- you can get 100GB or so on NFS on terminable

- we can try to quickly add a 300GB drive (or two) to terminable

- One of the gridlab hosts (4 I think) should be available with that 
storage now. I am not sure if we feel this lab host is stable yet 
but we should make sure.

We need to be gearing up quickly to this scale, so the lab 
environment that Ive been pushing to get set up is going to be 
increasingly important.

Regarding the core limitation that Nika points out - this seems to 
dictate that we take the step of allowing workflow inputs, outputs 
and intermediate results to be located on any gridftp server and 
tracked via a replica catalog, as VDS did.

Ben, Mihael, do you have a feel for where we can slot this into 
development? v3 - say circa September?

Mike


Veronika Nefedova wrote, On 6/1/2007 8:19 AM:
> Hi,
> 
> I know I raised this question many times before, but I think I need a 
> solution to it very soon (=now). I have a machinery in place to do a 
> really big runs for MolDyn. Currently, the workflow produces about 0.9GB 
> of data for each molecule, but this is all the intermediate data, all I 
> need is *one* 300K file as a result (per molecule). The rest is 
> intermediate data that I do not need.  So my questions are:
> 
> 1. How can I eliminate the intermediate results staging back to my 
> submit host? I do not need it in case of one remote compute pool. VDS 
> had this feature...
> 
> 2. If implementing this feature is very hard and time-consuming -- what 
> submit host would you recommend for a 244-molecule run ? Roughly 244 GB 
> of space is needed.
> 
> A 20-molecule run is all I can do on wiggum (where the application code 
> is currently compiled).
> 
> Thanks!
> 
> Nika
> 
> 
> 
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> 
> 

-- 
Mike Wilde
Computation Institute, University of Chicago
Math & Computer Science Division
Argonne National Laboratory
Argonne, IL   60439    USA
tel 630-252-7497 fax 630-252-1997


From hategan at mcs.anl.gov  Fri Jun  1 09:03:01 2007
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Fri, 01 Jun 2007 17:03:01 +0300
Subject: [Swift-devel] disk space requirement
In-Reply-To: <46602540.8090007@mcs.anl.gov>
References: <C04744A1-D52B-4064-9A71-F60D52A8AFEE@mcs.anl.gov>
	<46602540.8090007@mcs.anl.gov>
Message-ID: <1180706581.32642.3.camel@blabla.mcs.anl.gov>

On Fri, 2007-06-01 at 08:55 -0500, Mike Wilde wrote:
> Nika, regarding where you can get 244GB, we should send in a request 
> to CI Support to get 1TB or so from the CI SAN mounted on a Swift 
> submit host.
> 
> In the interim:
> 
> - you can get 100GB or so on NFS on terminable
> 
> - we can try to quickly add a 300GB drive (or two) to terminable
> 
> - One of the gridlab hosts (4 I think) should be available with that 
> storage now. I am not sure if we feel this lab host is stable yet 
> but we should make sure.
> 
> We need to be gearing up quickly to this scale, so the lab 
> environment that Ive been pushing to get set up is going to be 
> increasingly important.
> 
> Regarding the core limitation that Nika points out - this seems to 
> dictate that we take the step of allowing workflow inputs, outputs 
> and intermediate results to be located on any gridftp server and 
> tracked via a replica catalog, as VDS did.

I'd say tracking where intermediate files are for a run should not
necessarily imply a "replica catalog" as implemented by VDS (in
particular RLS).

> 
> Ben, Mihael, do you have a feel for where we can slot this into 
> development? v3 - say circa September?

Sounds reasonable with the limited knowledge I have.

Mihael

> 
> Mike
> 
> 
> Veronika Nefedova wrote, On 6/1/2007 8:19 AM:
> > Hi,
> > 
> > I know I raised this question many times before, but I think I need a 
> > solution to it very soon (=now). I have a machinery in place to do a 
> > really big runs for MolDyn. Currently, the workflow produces about 0.9GB 
> > of data for each molecule, but this is all the intermediate data, all I 
> > need is *one* 300K file as a result (per molecule). The rest is 
> > intermediate data that I do not need.  So my questions are:
> > 
> > 1. How can I eliminate the intermediate results staging back to my 
> > submit host? I do not need it in case of one remote compute pool. VDS 
> > had this feature...
> > 
> > 2. If implementing this feature is very hard and time-consuming -- what 
> > submit host would you recommend for a 244-molecule run ? Roughly 244 GB 
> > of space is needed.
> > 
> > A 20-molecule run is all I can do on wiggum (where the application code 
> > is currently compiled).
> > 
> > Thanks!
> > 
> > Nika
> > 
> > 
> > 
> > _______________________________________________
> > Swift-devel mailing list
> > Swift-devel at ci.uchicago.edu
> > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> > 
> > 
> 


From tiberius at ci.uchicago.edu  Fri Jun  1 14:25:24 2007
From: tiberius at ci.uchicago.edu (Tiberiu Stef-Praun)
Date: Fri, 1 Jun 2007 14:25:24 -0500
Subject: [Swift-devel] simple_mapper issue ( i might have reported it a
	while ago)
Message-ID: <fec1351f0706011225w2fac05d5q1cdd7fe7f03bb874@mail.gmail.com>

happens in vdsk-april-29

This one is ok
file procOut<simple_mapper; prefix="ccf.spchhperm.output", suffix=perm>;

This one is not
file procOut<simple_mapper; prefix=perm, suffix="ccf.spchhperm.output">;

The error:
Execution failed:
        java.lang.ClassCastException: org.griphyn.vdl.mapping.DataNode
Caused by:
        java.lang.ClassCastException: org.griphyn.vdl.mapping.DataNode
        at org.griphyn.vdl.mapping.RootDataNode.init(RootDataNode.java:29)
        at org.griphyn.vdl.karajan.lib.New.function(New.java:103)
        at org.griphyn.vdl.karajan.lib.VDLFunction.post(VDLFunction.java:60)
        at org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.childCompleted(AbstractSequentialWithArguments.java:192)
        at org.globus.cog.karajan.workflow.nodes.Sequential.notificationEvent(Sequential.java:33)
        at org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:334)

-- 
Tiberiu (Tibi) Stef-Praun, PhD
Research Staff, Computation Institute
5640 S. Ellis Ave, #405
University of Chicago
http://www-unix.mcs.anl.gov/~tiberius/


From tiberius at ci.uchicago.edu  Fri Jun  1 14:27:35 2007
From: tiberius at ci.uchicago.edu (Tiberiu Stef-Praun)
Date: Fri, 1 Jun 2007 14:27:35 -0500
Subject: [Swift-devel] Re: simple_mapper issue ( i might have reported it a
	while ago)
In-Reply-To: <fec1351f0706011225w2fac05d5q1cdd7fe7f03bb874@mail.gmail.com>
References: <fec1351f0706011225w2fac05d5q1cdd7fe7f03bb874@mail.gmail.com>
Message-ID: <fec1351f0706011227k2a11ebb9t1c5405c844b35932@mail.gmail.com>

Disregad this.
It was caused by user error


On 6/1/07, Tiberiu Stef-Praun <tiberius at ci.uchicago.edu> wrote:
> happens in vdsk-april-29
>
> This one is ok
> file procOut<simple_mapper; prefix="ccf.spchhperm.output", suffix=perm>;
>
> This one is not
> file procOut<simple_mapper; prefix=perm, suffix="ccf.spchhperm.output">;
>
> The error:
> Execution failed:
>         java.lang.ClassCastException: org.griphyn.vdl.mapping.DataNode
> Caused by:
>         java.lang.ClassCastException: org.griphyn.vdl.mapping.DataNode
>         at org.griphyn.vdl.mapping.RootDataNode.init(RootDataNode.java:29)
>         at org.griphyn.vdl.karajan.lib.New.function(New.java:103)
>         at org.griphyn.vdl.karajan.lib.VDLFunction.post(VDLFunction.java:60)
>         at org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.childCompleted(AbstractSequentialWithArguments.java:192)
>         at org.globus.cog.karajan.workflow.nodes.Sequential.notificationEvent(Sequential.java:33)
>         at org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:334)
>
> --
> Tiberiu (Tibi) Stef-Praun, PhD
> Research Staff, Computation Institute
> 5640 S. Ellis Ave, #405
> University of Chicago
> http://www-unix.mcs.anl.gov/~tiberius/
>


-- 
Tiberiu (Tibi) Stef-Praun, PhD
Research Staff, Computation Institute
5640 S. Ellis Ave, #405
University of Chicago
http://www-unix.mcs.anl.gov/~tiberius/


From nefedova at mcs.anl.gov  Fri Jun  1 16:59:08 2007
From: nefedova at mcs.anl.gov (Veronika Nefedova)
Date: Fri, 1 Jun 2007 16:59:08 -0500
Subject: [Swift-devel] LQCD meeting at Fermi
Message-ID: <4C018A99-0639-43AA-BA06-525B955A56A1@mcs.anl.gov>

Hi,

Yong and I have met with Xian-He and his team today to talk over  
their current problems with the production swift code.

Some of the major issues we talked about:

- Sperate of concern: SwiftScript could be made to just describe the
abstract interfaces and data flows, and the app blocks could be pushed
into some separate specifications ( in a repository or something ), in
which other scripting lanugages can be used (e.g. python) to specify how
to invoke an actual application.

- Dealing with absolute path:
   LQCD uses dcache, which requires copying to/from some absolute path.

- Run clean up jobs outside pbs (i.e. using the fork manager instead)

- parameter problem: need to override things in tc.data, sites.xml, like
number of nodes for MPI jobs
   possible solution: put profile specification back in. (but we do not
have derivations, in which we were able to put some profiles).
   template based sites.xml and tc.data (generate the actual config  
files
using some templates and user supplied values at runtime)

- DB-mapper: users have an elaborate input data structures, keep it  
in the DB, so it would be nice to have a mapper that would read the  
input from the DB. This feature is in the works (?)

-intermediate results problem -- the same as MolDyn: need to have an  
ability to specify which file to keep and which file not.

- quoting problem:
   MPIrun does not deal correctly with "" that are passed to wrapper.sh
I remember there was also quoting issue with condor queues.

We also talked about using Falkon. But since LQCD uses dedicated  
resources
(600 or more nodes) and pbs queue checking time is set to around 10s, it
is not a big issue for them to run large number of jobs.

None of these except for the absolute path problem is a show- 
stoppers, next
we'll try to get their swiftscript running, and push some of the  
requests
into 0.3 features.

Yong and Nika


From benc at hawaga.org.uk  Fri Jun  1 17:20:42 2007
From: benc at hawaga.org.uk (Ben Clifford)
Date: Fri, 1 Jun 2007 22:20:42 +0000 (GMT)
Subject: [Swift-devel] LQCD meeting at Fermi
In-Reply-To: <4C018A99-0639-43AA-BA06-525B955A56A1@mcs.anl.gov>
References: <4C018A99-0639-43AA-BA06-525B955A56A1@mcs.anl.gov>
Message-ID: <Pine.LNX.4.64.0706012216230.20212@dildano.hawaga.org.uk>


On Fri, 1 Jun 2007, Veronika Nefedova wrote:

> - DB-mapper: users have an elaborate input data structures, keep it in the DB,
> so it would be nice to have a mapper that would read the input from the DB.
> This feature is in the works (?)

(i) The *data* in a DB, or 

(ii) paths to datafiles in the DB with actual data 
stored in disk files?

(i) is much harder than (ii)

> - Dealing with absolute path:                                                   
>  LQCD uses dcache, which requires copying to/from some absolute path.  

By this do you mean that their input/output files are stored in the unix 
filesystem on the submit node, but in some directory that is not the pwd, 
and that that directory causes files to be accessed from dcache?

dcache has other access methods, such as gridftp (I think). Do you know if 
they use that? in some cases, but maybe not this case, staging from dcache 
ftp server to site workspace without going via submit node.

-- 


From foster at mcs.anl.gov  Fri Jun  1 17:34:03 2007
From: foster at mcs.anl.gov (Ian Foster)
Date: Fri, 01 Jun 2007 17:34:03 -0500
Subject: [Swift-devel] LQCD meeting at Fermi
In-Reply-To: <4C018A99-0639-43AA-BA06-525B955A56A1@mcs.anl.gov>
References: <4C018A99-0639-43AA-BA06-525B955A56A1@mcs.anl.gov>
Message-ID: <46609EDB.5020009@mcs.anl.gov>

Nika:

Thanks for the summary. I am very eager to see some results for 
executions of real application problems. Did they agree to a timeline 
for that?

Ian.

Veronika Nefedova wrote:
> Hi,
>
> Yong and I have met with Xian-He and his team today to talk over their 
> current problems with the production swift code.
>
> Some of the major issues we talked about:
>
> - Sperate of concern: SwiftScript could be made to just describe the
> abstract interfaces and data flows, and the app blocks could be pushed
> into some separate specifications ( in a repository or something ), in
> which other scripting lanugages can be used (e.g. python) to specify how
> to invoke an actual application.
>
> - Dealing with absolute path:
>   LQCD uses dcache, which requires copying to/from some absolute path.
>
> - Run clean up jobs outside pbs (i.e. using the fork manager instead)
>
> - parameter problem: need to override things in tc.data, sites.xml, like
> number of nodes for MPI jobs
>   possible solution: put profile specification back in. (but we do not
> have derivations, in which we were able to put some profiles).
>   template based sites.xml and tc.data (generate the actual config files
> using some templates and user supplied values at runtime)
>
> - DB-mapper: users have an elaborate input data structures, keep it in 
> the DB, so it would be nice to have a mapper that would read the input 
> from the DB. This feature is in the works (?)
>
> -intermediate results problem -- the same as MolDyn: need to have an 
> ability to specify which file to keep and which file not.
>
> - quoting problem:
>   MPIrun does not deal correctly with "" that are passed to wrapper.sh
> I remember there was also quoting issue with condor queues.
>
> We also talked about using Falkon. But since LQCD uses dedicated 
> resources
> (600 or more nodes) and pbs queue checking time is set to around 10s, it
> is not a big issue for them to run large number of jobs.
>
> None of these except for the absolute path problem is a show-stoppers, 
> next
> we'll try to get their swiftscript running, and push some of the requests
> into 0.3 features.
>
> Yong and Nika
>
>
>
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>

-- 

   Ian Foster, Director, Computation Institute
Argonne National Laboratory & University of Chicago
Argonne: MCS/221, 9700 S. Cass Ave, Argonne, IL 60439
Chicago: Rm 405, 5640 S. Ellis Ave, Chicago, IL 60637
Tel: +1 630 252 4619.  Web: www.ci.uchicago.edu.
      Globus Alliance: www.globus.org.


From benc at hawaga.org.uk  Fri Jun  1 17:38:42 2007
From: benc at hawaga.org.uk (Ben Clifford)
Date: Fri, 1 Jun 2007 22:38:42 +0000 (GMT)
Subject: [Swift-devel] LQCD meeting at Fermi
In-Reply-To: <46609EDB.5020009@mcs.anl.gov>
References: <4C018A99-0639-43AA-BA06-525B955A56A1@mcs.anl.gov>
	<46609EDB.5020009@mcs.anl.gov>
Message-ID: <Pine.LNX.4.64.0706012238040.20212@dildano.hawaga.org.uk>


On Fri, 1 Jun 2007, Ian Foster wrote:

> Thanks for the summary. I am very eager to see some results for executions of
> real application problems. Did they agree to a timeline for that?

Would also be good to have some definition of what is an acceptable 'real 
application problem', in terms of which programs are to be run on which 
datasets on what sized resource.

-- 


From itf at mcs.anl.gov  Fri Jun  1 17:50:41 2007
From: itf at mcs.anl.gov (=?UTF-8?B?SWFuIEZvc3Rlcg==?=)
Date: Fri, 1 Jun 2007 22:50:41 +0000
Subject: [Swift-devel] LQCD meeting at Fermi
In-Reply-To: <Pine.LNX.4.64.0706012238040.20212@dildano.hawaga.org.uk>
References: <4C018A99-0639-43AA-BA06-525B955A56A1@mcs.anl.gov><46609EDB.5020009@mcs.anl.gov>
	<Pine.LNX.4.64.0706012238040.20212@dildano.hawaga.org.uk>
Message-ID: <1472851168-1180738251-cardhu_blackberry.rim.net-646943670-@bwe005-cell00.bisx.prod.on.blackberry>

Ben:

The question for me is whether they are invested, or just playing with it and finding things that are "wrong" not because they are serious problems but rather as excuses for them not to do more. (I've often experienced that.)

So I want to know if they are using it for real work or not. So far things don't look so good--as I understand things, they don't have a working program after what mus be 4-5 months talking to us (?).

Ian

Sent via BlackBerry from T-Mobile  

-----Original Message-----
From: Ben Clifford <benc at hawaga.org.uk>
Date: Fri, 1 Jun 2007 22:38:42 
To:Ian Foster <foster at mcs.anl.gov>
Cc:Veronika Nefedova <nefedova at mcs.anl.gov>, swift-devel at ci.uchicago.edu
Subject: Re: [Swift-devel] LQCD meeting at Fermi


On Fri, 1 Jun 2007, Ian Foster wrote:

> Thanks for the summary. I am very eager to see some results for executions of
> real application problems. Did they agree to a timeline for that?

Would also be good to have some definition of what is an acceptable 'real 
application problem', in terms of which programs are to be run on which 
datasets on what sized resource.

-- 


From nefedova at mcs.anl.gov  Fri Jun  1 18:22:06 2007
From: nefedova at mcs.anl.gov (Veronika Nefedova)
Date: Fri, 1 Jun 2007 18:22:06 -0500
Subject: [Swift-devel] LQCD meeting at Fermi
In-Reply-To: <1472851168-1180738251-cardhu_blackberry.rim.net-646943670-@bwe005-cell00.bisx.prod.on.blackberry>
References: <4C018A99-0639-43AA-BA06-525B955A56A1@mcs.anl.gov><46609EDB.5020009@mcs.anl.gov>
	<Pine.LNX.4.64.0706012238040.20212@dildano.hawaga.org.uk>
	<1472851168-1180738251-cardhu_blackberry.rim.net-646943670-@bwe005-cell00.bisx.prod.on.blackberry>
Message-ID: <05C56433-BBB0-49FE-88F5-F602FCE8F696@mcs.anl.gov>

Ian,

they seem like quite a strange bunch to me. Here is my experience  
with them.

When I first met them back in April, they gave us their code and  
explained what they want to achieve. Within couple of weeks I sent  
them the working code for their workflow. It took them another couple  
of weeks to start testing it (and it worked for them as well). After  
that they said that they would modify that workflow to suit their  
production needs. And they've disappeared for another 3 weeks. During  
those 3 weeks I sent them numerous emails offering my help * 
(basically saying - "give me your production code and data and I'll  
make it work for you") - but it was no response. Till last week when  
they sent us their code (not working) and questions.
The code is very close to be in working condition, unfortunately they  
were "inventing the wheel" instead of asking us for help (like they  
spent quite a time trying to do string concatenation without using  
the @strcat function). They also did a hack for random number  
generator instead of using the function that I sent them (which is  
much easier and cleaner way).

So my impression was that they wanted to figure our swift on their  
own (which is good) but without our help (which is bad). But now i  
think they are ready for us to step in. They mentioned something  
about 2 months timeframe when they need to have the things running...

Nika

On Jun 1, 2007, at 5:50 PM, Ian Foster wrote:

> Ben:
>
> The question for me is whether they are invested, or just playing  
> with it and finding things that are "wrong" not because they are  
> serious problems but rather as excuses for them not to do more.  
> (I've often experienced that.)
>
> So I want to know if they are using it for real work or not. So far  
> things don't look so good--as I understand things, they don't have  
> a working program after what mus be 4-5 months talking to us (?).
>
> Ian
>
> Sent via BlackBerry from T-Mobile
>
> -----Original Message-----
> From: Ben Clifford <benc at hawaga.org.uk>
> Date: Fri, 1 Jun 2007 22:38:42
> To:Ian Foster <foster at mcs.anl.gov>
> Cc:Veronika Nefedova <nefedova at mcs.anl.gov>, swift- 
> devel at ci.uchicago.edu
> Subject: Re: [Swift-devel] LQCD meeting at Fermi
>
>
>
> On Fri, 1 Jun 2007, Ian Foster wrote:
>
>> Thanks for the summary. I am very eager to see some results for  
>> executions of
>> real application problems. Did they agree to a timeline for that?
>
> Would also be good to have some definition of what is an acceptable  
> 'real
> application problem', in terms of which programs are to be run on  
> which
> datasets on what sized resource.
>
> -- 
>


From yongzh at cs.uchicago.edu  Fri Jun  1 21:10:28 2007
From: yongzh at cs.uchicago.edu (Yong Zhao)
Date: Fri, 1 Jun 2007 21:10:28 -0500 (CDT)
Subject: [Swift-devel] LQCD meeting at Fermi
In-Reply-To: <Pine.LNX.4.64.0706012216230.20212@dildano.hawaga.org.uk>
References: <4C018A99-0639-43AA-BA06-525B955A56A1@mcs.anl.gov>
	<Pine.LNX.4.64.0706012216230.20212@dildano.hawaga.org.uk>
Message-ID: <Pine.LNX.4.58.0706012109230.14577@classes.cs.uchicago.edu>

It is case (i), they want to read some paramenter settings from db. But I
do not think the two cases are very different, we are to map something
from database into in-memory data structures.

Yong.

On Fri, 1 Jun 2007, Ben Clifford wrote:

>
>
> On Fri, 1 Jun 2007, Veronika Nefedova wrote:
>
> > - DB-mapper: users have an elaborate input data structures, keep it in the DB,
> > so it would be nice to have a mapper that would read the input from the DB.
> > This feature is in the works (?)
>
> (i) The *data* in a DB, or
>
> (ii) paths to datafiles in the DB with actual data
> stored in disk files?
>
> (i) is much harder than (ii)
>
> > - Dealing with absolute path:
> >  LQCD uses dcache, which requires copying to/from some absolute path.
>
> By this do you mean that their input/output files are stored in the unix
> filesystem on the submit node, but in some directory that is not the pwd,
> and that that directory causes files to be accessed from dcache?
>
> dcache has other access methods, such as gridftp (I think). Do you know if
> they use that? in some cases, but maybe not this case, staging from dcache
> ftp server to site workspace without going via submit node.
>
> --
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>


From yongzh at cs.uchicago.edu  Fri Jun  1 21:20:29 2007
From: yongzh at cs.uchicago.edu (Yong Zhao)
Date: Fri, 1 Jun 2007 21:20:29 -0500 (CDT)
Subject: [Swift-devel] LQCD meeting at Fermi
In-Reply-To: <1472851168-1180738251-cardhu_blackberry.rim.net-646943670-@bwe005-cell00.bisx.prod.on.blackberry>
References: <4C018A99-0639-43AA-BA06-525B955A56A1@mcs.anl.gov><46609EDB.5020009@mcs.anl.gov>
	<Pine.LNX.4.64.0706012238040.20212@dildano.hawaga.org.uk>
	<1472851168-1180738251-cardhu_blackberry.rim.net-646943670-@bwe005-cell00.bisx.prod.on.blackberry>
Message-ID: <Pine.LNX.4.58.0706012110500.14577@classes.cs.uchicago.edu>

They do mention in the discussion that the Swift approach is not much
different from their ad hoc scripts (just because the way we deal with
file names and command line arguments in invoking applications). They said
they wanted something purer ( in the sense of having just procedure
interfaces & data flows, and hiding invocation/mapping/wrapping details).
But I guess neither side (us and them) has a clear idea about how to
actually achieve that.

They also notice the mixed flavor of scripting, imperative programming and
functional programming in the language, they seem to favor the python
style, more functional style of programming language. Also they do not
quite like the @ functions.

I think due to the slow process and also they have been trying to figure
out the language features by themselves, they are a bit discouraged and
disappointed at this point. So we need to get their semi-working scripts
working very quickly in order not to alienate them.

Yong.

On Fri, 1 Jun 2007, [UTF-8] Ian Foster wrote:

> Ben:
>
> The question for me is whether they are invested, or just playing with it and finding things that are "wrong" not because they are serious problems but rather as excuses for them not to do more. (I've often experienced that.)
>
> So I want to know if they are using it for real work or not. So far things don't look so good--as I understand things, they don't have a working program after what mus be 4-5 months talking to us (?).
>
> Ian
>
> Sent via BlackBerry from T-Mobile
>
> -----Original Message-----
> From: Ben Clifford <benc at hawaga.org.uk>
> Date: Fri, 1 Jun 2007 22:38:42
> To:Ian Foster <foster at mcs.anl.gov>
> Cc:Veronika Nefedova <nefedova at mcs.anl.gov>, swift-devel at ci.uchicago.edu
> Subject: Re: [Swift-devel] LQCD meeting at Fermi
>
>
>
> On Fri, 1 Jun 2007, Ian Foster wrote:
>
> > Thanks for the summary. I am very eager to see some results for executions of
> > real application problems. Did they agree to a timeline for that?
>
> Would also be good to have some definition of what is an acceptable 'real
> application problem', in terms of which programs are to be run on which
> datasets on what sized resource.
>
> --
>
>


From hategan at mcs.anl.gov  Sat Jun  2 03:33:31 2007
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Sat, 02 Jun 2007 11:33:31 +0300
Subject: [Swift-devel] LQCD meeting at Fermi
In-Reply-To: <4C018A99-0639-43AA-BA06-525B955A56A1@mcs.anl.gov>
References: <4C018A99-0639-43AA-BA06-525B955A56A1@mcs.anl.gov>
Message-ID: <1180773211.6680.8.camel@blabla.mcs.anl.gov>

On Fri, 2007-06-01 at 16:59 -0500, Veronika Nefedova wrote:
> Hi,
> 
> Yong and I have met with Xian-He and his team today to talk over  
> their current problems with the production swift code.
> 
> Some of the major issues we talked about:
> 
> - Sperate of concern: SwiftScript could be made to just describe the
> abstract interfaces and data flows, and the app blocks could be pushed
> into some separate specifications ( in a repository or something ), in
> which other scripting lanugages can be used (e.g. python) to specify how
> to invoke an actual application.

How's that different from application wrappers?

> 
> - Dealing with absolute path:
>    LQCD uses dcache, which requires copying to/from some absolute path.

This, I think, is the same as the ability to have non-local input and
output files.

> 
> - Run clean up jobs outside pbs (i.e. using the fork manager instead)

We've discussed this before, and there are two choices:
1. Use the file provider. This may be inefficient because most of them,
in particular GridFTP, don't have a recursive delete. The local one,
which they are using does. This may imply another configuration option.
2. Make sure there's always a fork job manager there and use that. This
means that the local PBS provider needs to become a job manager to the
local provider rather than a stand-alone provider.

> 
> - parameter problem: need to override things in tc.data, sites.xml, like
> number of nodes for MPI jobs
>    possible solution: put profile specification back in. (but we do not
> have derivations, in which we were able to put some profiles).

Can you explain that? VDS != Swift. And we shouldn't talk about Swift
having some literal thing from VDS, but rather the bit that achieves
similar functionality.

>    template based sites.xml and tc.data (generate the actual config  
> files
> using some templates and user supplied values at runtime)

About sites.xml, we discussed in an email exchange the possibility of
doing that. Luckily, in Swift, sites.xml is a karajan script, so it can
do things like import("anothersites.xml") and so on.

> 
> - DB-mapper: users have an elaborate input data structures, keep it  
> in the DB, so it would be nice to have a mapper that would read the  
> input from the DB. This feature is in the works (?)
> 
> -intermediate results problem -- the same as MolDyn: need to have an  
> ability to specify which file to keep and which file not.
> 
> - quoting problem:
>    MPIrun does not deal correctly with "" that are passed to wrapper.sh
> I remember there was also quoting issue with condor queues.

This is a problem with their mpirun. However, I guess the PBS provider
could have a flag to do extra quoting for certain job types.

> 
> We also talked about using Falkon. But since LQCD uses dedicated  
> resources
> (600 or more nodes) and pbs queue checking time is set to around 10s, it
> is not a big issue for them to run large number of jobs.

The last thing we want with them is throw in another thing that might
have problems in the stack.

> 
> None of these except for the absolute path problem is a show- 
> stoppers, next
> we'll try to get their swiftscript running, and push some of the  
> requests
> into 0.3 features.
> 
> Yong and Nika
> 
> 
> 
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> 


From hategan at mcs.anl.gov  Sat Jun  2 03:43:24 2007
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Sat, 02 Jun 2007 11:43:24 +0300
Subject: [Swift-devel] LQCD meeting at Fermi
In-Reply-To: <1472851168-1180738251-cardhu_blackberry.rim.net-646943670-@bwe005-cell00.bisx.prod.on.blackberry>
References: <4C018A99-0639-43AA-BA06-525B955A56A1@mcs.anl.gov>
	<46609EDB.5020009@mcs.anl.gov>
	<Pine.LNX.4.64.0706012238040.20212@dildano.hawaga.org.uk>
	<1472851168-1180738251-cardhu_blackberry.rim.net-646943670-@bwe005-cell00.bisx.prod.on.blackberry>
Message-ID: <1180773804.6680.19.camel@blabla.mcs.anl.gov>

On Fri, 2007-06-01 at 22:50 +0000, Ian Foster wrote:
> Ben:
> 
> The question for me is whether they are invested, or just playing with it and finding things that are "wrong" not because they are serious problems but rather as excuses for them not to do more. (I've often experienced that.)

I'm not Ben, but:
So far, I think they've invested quite a bit into Swift, compared to
other things they've invested in. But I would assume they are worried
whether the problems they see will ever get solved, and without
something concrete and somewhat immediate, they might lean towards
believing they might not.

Also, in the same way they disappear for weeks, we also disappear for
weeks on some of their requests. I'm guessing both sides are involved in
more than one thing.

In concrete terms, at one point they had a choice between Karajan and
Swift, and they seem to have chosen Swift.

> 
> So I want to know if they are using it for real work or not. So far things don't look so good--as I understand things, they don't have a working program after what mus be 4-5 months talking to us (?).

They do have "weird" requirements, at least if you start from Swift's
assumptions (which still has issues with being generic). And there's
quite a few of them.

Mihael

> 
> Ian
> 
> Sent via BlackBerry from T-Mobile  
> 
> -----Original Message-----
> From: Ben Clifford <benc at hawaga.org.uk>
> Date: Fri, 1 Jun 2007 22:38:42 
> To:Ian Foster <foster at mcs.anl.gov>
> Cc:Veronika Nefedova <nefedova at mcs.anl.gov>, swift-devel at ci.uchicago.edu
> Subject: Re: [Swift-devel] LQCD meeting at Fermi
> 
> 
> 
> On Fri, 1 Jun 2007, Ian Foster wrote:
> 
> > Thanks for the summary. I am very eager to see some results for executions of
> > real application problems. Did they agree to a timeline for that?
> 
> Would also be good to have some definition of what is an acceptable 'real 
> application problem', in terms of which programs are to be run on which 
> datasets on what sized resource.
> 
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel


From hategan at mcs.anl.gov  Sat Jun  2 03:50:29 2007
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Sat, 02 Jun 2007 11:50:29 +0300
Subject: [Swift-devel] LQCD meeting at Fermi
In-Reply-To: <1180773211.6680.8.camel@blabla.mcs.anl.gov>
References: <4C018A99-0639-43AA-BA06-525B955A56A1@mcs.anl.gov>
	<1180773211.6680.8.camel@blabla.mcs.anl.gov>
Message-ID: <1180774229.6680.27.camel@blabla.mcs.anl.gov>

On Sat, 2007-06-02 at 11:33 +0300, Mihael Hategan wrote:
> > 
> > We also talked about using Falkon. But since LQCD uses dedicated  
> > resources
> > (600 or more nodes) and pbs queue checking time is set to around 10s, it
> > is not a big issue for them to run large number of jobs.
>  
> The last thing we want with them is throw in another thing that might
> have problems in the stack.

Clarification:

The last thing we want with them is throw another thing (that might have
problems) in the stack. And I'm not talking specifically about Falkon,
which is a fine piece of software and a wonderful concept.

> 
> > 
> > None of these except for the absolute path problem is a show- 
> > stoppers, next
> > we'll try to get their swiftscript running, and push some of the  
> > requests
> > into 0.3 features.
> > 
> > Yong and Nika
> > 
> > 
> > 
> > _______________________________________________
> > Swift-devel mailing list
> > Swift-devel at ci.uchicago.edu
> > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> > 
> 
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> 


From wilde at mcs.anl.gov  Sat Jun  2 10:44:16 2007
From: wilde at mcs.anl.gov (Mike Wilde)
Date: Sat, 02 Jun 2007 10:44:16 -0500
Subject: [Swift-devel] LQCD meeting at Fermi
In-Reply-To: <Pine.LNX.4.64.0706012238040.20212@dildano.hawaga.org.uk>
References: <4C018A99-0639-43AA-BA06-525B955A56A1@mcs.anl.gov>	<46609EDB.5020009@mcs.anl.gov>
	<Pine.LNX.4.64.0706012238040.20212@dildano.hawaga.org.uk>
Message-ID: <46619050.2080005@mcs.anl.gov>

Nika, can you develop this and post on the wiki page for LQCD?

Do you have the information you need for this already? If not, can 
you do this on Monday?

Thanks,

Mike


Ben Clifford wrote, On 6/1/2007 5:38 PM:
> 
> On Fri, 1 Jun 2007, Ian Foster wrote:
> 
>> Thanks for the summary. I am very eager to see some results for executions of
>> real application problems. Did they agree to a timeline for that?
> 
> Would also be good to have some definition of what is an acceptable 'real 
> application problem', in terms of which programs are to be run on which 
> datasets on what sized resource.
> 

-- 
Mike Wilde
Computation Institute, University of Chicago
Math & Computer Science Division
Argonne National Laboratory
Argonne, IL   60439    USA
tel 630-252-7497 fax 630-252-1997


From benc at hawaga.org.uk  Tue Jun  5 10:46:28 2007
From: benc at hawaga.org.uk (Ben Clifford)
Date: Tue, 5 Jun 2007 15:46:28 +0000 (GMT)
Subject: [Swift-devel] swift live tutorial
Message-ID: <Pine.LNX.4.64.0706051535370.27276@dildano.hawaga.org.uk>


Yesterday Mike, Tibi and I performed a Swift tutorial at TG07, in the 'OSG 
Grid School' style of a lecture and then a hands on session.

The exercises portion, which I lead, is (for now) at:

http://www.ci.uchicago.edu/swift/guides/tutorial-live.php

I don't have the lecture portion of the slides - Mike did that bit.

For the most part I think it went well for a first tutorial. I think 
people mostly understood the points that we were trying to make and were 
able to do the exercises ok.

We discovered a few more bugs in Swift which are now in the bugzilla (63, 
64, 65, 66, 67.

There were also a bunch of problems with the tutorial notes that are for 
the most part minor and easily correctable.

I'd like to use the exercise portion as the base for future swift hands-on 
tutorials.

-- 


From foster at mcs.anl.gov  Tue Jun  5 13:28:20 2007
From: foster at mcs.anl.gov (Ian Foster)
Date: Tue, 05 Jun 2007 13:28:20 -0500
Subject: [Swift-devel] swift live tutorial
In-Reply-To: <Pine.LNX.4.64.0706051535370.27276@dildano.hawaga.org.uk>
References: <Pine.LNX.4.64.0706051535370.27276@dildano.hawaga.org.uk>
Message-ID: <4665AB44.7030601@mcs.anl.gov>

Ben, Mike, Tibi:

Congratulations on a successful tutorial!

Ian.

Ben Clifford wrote:
> Yesterday Mike, Tibi and I performed a Swift tutorial at TG07, in the 'OSG 
> Grid School' style of a lecture and then a hands on session.
>
> The exercises portion, which I lead, is (for now) at:
>
> http://www.ci.uchicago.edu/swift/guides/tutorial-live.php
>
> I don't have the lecture portion of the slides - Mike did that bit.
>
> For the most part I think it went well for a first tutorial. I think 
> people mostly understood the points that we were trying to make and were 
> able to do the exercises ok.
>
> We discovered a few more bugs in Swift which are now in the bugzilla (63, 
> 64, 65, 66, 67.
>
> There were also a bunch of problems with the tutorial notes that are for 
> the most part minor and easily correctable.
>
> I'd like to use the exercise portion as the base for future swift hands-on 
> tutorials.
>
>   

-- 

   Ian Foster, Director, Computation Institute
Argonne National Laboratory & University of Chicago
Argonne: MCS/221, 9700 S. Cass Ave, Argonne, IL 60439
Chicago: Rm 405, 5640 S. Ellis Ave, Chicago, IL 60637
Tel: +1 630 252 4619.  Web: www.ci.uchicago.edu.
      Globus Alliance: www.globus.org.


From wilde at mcs.anl.gov  Tue Jun 12 12:32:22 2007
From: wilde at mcs.anl.gov (Mike Wilde)
Date: Tue, 12 Jun 2007 12:32:22 -0500
Subject: [Swift-devel] Re: GRAM and Swift discussion this week?
In-Reply-To: <EBCA77FC-971B-4537-8FB8-EF17032D2DD3@mcs.anl.gov>
References: <685C1420-03DE-4F2E-BDC7-A8A2C5636154@mcs.anl.gov>
	<46532FB4.5070707@mcs.anl.gov> <465335D1.2040306@mcs.anl.gov>
	<Pine.LNX.4.64.0705221837010.20212@dildano.hawaga.org.uk>
	<DA40B2E2-8A00-43E0-B76F-0EB9A16EFAF6@mcs.anl.gov>
	<Pine.LNX.4.64.0705221910500.22628@dildano.hawaga.org.uk>
	<EBCA77FC-971B-4537-8FB8-EF17032D2DD3@mcs.anl.gov>
Message-ID: <466ED8A6.5010804@mcs.anl.gov>

Hi all,

Were we still planning to meet tomorrow at Argonne?

Can we postpone this till Thu?  nd what time would be good?

Ben, where will you be Thu?

Thanks,

Mike


Stuart Martin wrote, On 5/22/2007 2:41 PM:
> 
> On May 22, 2007, at May 22, 2:22 PM, Ben Clifford wrote:
> 
>>
>>
>> On Tue, 22 May 2007, Stuart Martin wrote:
>>
>>> On May 22, 2007, at May 22, 1:43 PM, Ben Clifford wrote:
>>>>
>>>> On Tue, 22 May 2007, Ian Foster wrote:
>>>>
>>>>> Are there WS-GRAM issues that are causing problems for Swift?
>>>>
>>>> No one uses WS-GRAM with Swift, so we aren't really uncovering issus
>>>> there.
>>>
>>
>>> Why not?  What are you using?  GRAM2?  local executions?  Other 
>>> services?
>>
>> for the high end stuff, Swift submits jobs to Falkon. Falkon, I think,
>> uses WS-GRAM to start up its own workers, but that startup part of Falkon
>> not Swift.
>>
>> For low end stuff, the two providers that I think people use much are
>> local exec and GRAM2.
>>
>> Local exec is not in the space that GRAM is addressing, so ignore.
> 
> Agreed.  Just trying to learn what people are doing.
> 
>>
>> The GRAM2 vs GRAM4 question pretty much comes down to the fact that 
>> people
>> in production (at least as far as I encounter them) tend to use GRAM2
>> rather than GRAM4 and so Swift tends to get used that way too - 
>> there's no
>> real motivation to push people to use a different submission system than
>> what they're used to, and one thing we decided within our group is 
>> that we
>> would concentrate on being very application focused (after we had spent
>> rather a long time pontificating and debating). GRAM2 -> GRAM4 doesn't
>> provide enough incentive (in the way that a GRAM2 -> Falkon change does)
>> for our actual apps (for example that Tibi and Nika work on).
> 
> Fair enough.  GRAM4 is deployed on most of TG and OSG now.  It would be 
> good to push jobs to GRAM4 when reasonable/possible.  The apps folks 
> should not care which service is used.  It should be hidden by Swift.  
> Or are the apps folks your working with also dictating what GRAM service 
> is deployed/used?
> 
>>
>> At some point, perhaps, GRAM2 will decay or GRAM4 will become 
>> tantalising,
>> at which point it would be in the interests of being app-focused to 
>> shift.
>> Or we might change our priorities to be less app focused.
> 
> Some are quite happy with GRAM4 in 4.0.3.  We're improving things right 
> now to make GRAM4 outperform GRAM2 in most all the important 
> benchmarks.  This should be in 4.0.5.  I think things at that point 
> become "tantalizing".
> 
>>
>> -- 
> 
> 

-- 
Mike Wilde
Computation Institute, University of Chicago
Math & Computer Science Division
Argonne National Laboratory
Argonne, IL   60439    USA
tel 630-252-7497 fax 630-252-1997


From benc at hawaga.org.uk  Tue Jun 12 12:39:22 2007
From: benc at hawaga.org.uk (Ben Clifford)
Date: Tue, 12 Jun 2007 17:39:22 +0000 (GMT)
Subject: [Swift-devel] Re: GRAM and Swift discussion this week?
In-Reply-To: <466ED8A6.5010804@mcs.anl.gov>
References: <685C1420-03DE-4F2E-BDC7-A8A2C5636154@mcs.anl.gov>
	<46532FB4.5070707@mcs.anl.gov> <465335D1.2040306@mcs.anl.gov>
	<Pine.LNX.4.64.0705221837010.20212@dildano.hawaga.org.uk>
	<DA40B2E2-8A00-43E0-B76F-0EB9A16EFAF6@mcs.anl.gov>
	<Pine.LNX.4.64.0705221910500.22628@dildano.hawaga.org.uk>
	<EBCA77FC-971B-4537-8FB8-EF17032D2DD3@mcs.anl.gov>
	<466ED8A6.5010804@mcs.anl.gov>
Message-ID: <Pine.LNX.4.64.0706121737450.10634@dildano.hawaga.org.uk>


On Tue, 12 Jun 2007, Mike Wilde wrote:

> Were we still planning to meet tomorrow at Argonne?

was there ever such a plan?

what needs discussing? like I said before:

> > > > > > Are there WS-GRAM issues that are causing problems for Swift?
> > > > > 
> > > > > No one uses WS-GRAM with Swift, so we aren't really uncovering issus
> > > > > there.


> Ben, where will you be Thu?

UC campus is my present plan.


-- 


From wilde at mcs.anl.gov  Tue Jun 12 13:33:45 2007
From: wilde at mcs.anl.gov (Mike Wilde)
Date: Tue, 12 Jun 2007 13:33:45 -0500
Subject: [Swift-devel] Re: GRAM and Swift discussion this week?
In-Reply-To: <Pine.LNX.4.64.0706121737450.10634@dildano.hawaga.org.uk>
References: <685C1420-03DE-4F2E-BDC7-A8A2C5636154@mcs.anl.gov>
	<46532FB4.5070707@mcs.anl.gov> <465335D1.2040306@mcs.anl.gov>
	<Pine.LNX.4.64.0705221837010.20212@dildano.hawaga.org.uk>
	<DA40B2E2-8A00-43E0-B76F-0EB9A16EFAF6@mcs.anl.gov>
	<Pine.LNX.4.64.0705221910500.22628@dildano.hawaga.org.uk>
	<EBCA77FC-971B-4537-8FB8-EF17032D2DD3@mcs.anl.gov>
	<466ED8A6.5010804@mcs.anl.gov>
	<Pine.LNX.4.64.0706121737450.10634@dildano.hawaga.org.uk>
Message-ID: <466EE709.50007@mcs.anl.gov>

Lets defer this meeting then and continue such discussions on swift 
and/or GRAM lists.

- Mike

Ben Clifford wrote, On 6/12/2007 12:39 PM:
> On Tue, 12 Jun 2007, Mike Wilde wrote:
> 
>> Were we still planning to meet tomorrow at Argonne?
> 
> was there ever such a plan?
> 
> what needs discussing? like I said before:
> 
>>>>>>> Are there WS-GRAM issues that are causing problems for Swift?
>>>>>> No one uses WS-GRAM with Swift, so we aren't really uncovering issus
>>>>>> there.
> 
> 
>> Ben, where will you be Thu?
> 
> UC campus is my present plan.
> 
> 

-- 
Mike Wilde
Computation Institute, University of Chicago
Math & Computer Science Division
Argonne National Laboratory
Argonne, IL   60439    USA
tel 630-252-7497 fax 630-252-1997


From benc at hawaga.org.uk  Tue Jun 12 13:38:34 2007
From: benc at hawaga.org.uk (Ben Clifford)
Date: Tue, 12 Jun 2007 18:38:34 +0000 (GMT)
Subject: [Swift-devel] Re: GRAM and Swift discussion this week?
In-Reply-To: <466EE709.50007@mcs.anl.gov>
References: <685C1420-03DE-4F2E-BDC7-A8A2C5636154@mcs.anl.gov>
	<46532FB4.5070707@mcs.anl.gov> <465335D1.2040306@mcs.anl.gov>
	<Pine.LNX.4.64.0705221837010.20212@dildano.hawaga.org.uk>
	<DA40B2E2-8A00-43E0-B76F-0EB9A16EFAF6@mcs.anl.gov>
	<Pine.LNX.4.64.0705221910500.22628@dildano.hawaga.org.uk>
	<EBCA77FC-971B-4537-8FB8-EF17032D2DD3@mcs.anl.gov>
	<466ED8A6.5010804@mcs.anl.gov>
	<Pine.LNX.4.64.0706121737450.10634@dildano.hawaga.org.uk>
	<466EE709.50007@mcs.anl.gov>
Message-ID: <Pine.LNX.4.64.0706121835430.10634@dildano.hawaga.org.uk>


I suspect the best way forward for this is for us to actually start using 
GRAM4 in our daily swift work - that will either work perfectly or 
generate things to talk about.

On Tue, 12 Jun 2007, Mike Wilde wrote:

> Lets defer this meeting then and continue such discussions on swift and/or
> GRAM lists.
> 
> - Mike
> 
> Ben Clifford wrote, On 6/12/2007 12:39 PM:
> > On Tue, 12 Jun 2007, Mike Wilde wrote:
> > 
> > > Were we still planning to meet tomorrow at Argonne?
> > 
> > was there ever such a plan?
> > 
> > what needs discussing? like I said before:
> > 
> > > > > > > > Are there WS-GRAM issues that are causing problems for Swift?
> > > > > > > No one uses WS-GRAM with Swift, so we aren't really uncovering
> > > > > > > issus
> > > > > > > there.
> > 
> > 
> > > Ben, where will you be Thu?
> > 
> > UC campus is my present plan.
> > 
> > 
> 
> 


From itf at mcs.anl.gov  Tue Jun 12 13:41:24 2007
From: itf at mcs.anl.gov (=?utf-8?B?SWFuIEZvc3Rlcg==?=)
Date: Tue, 12 Jun 2007 18:41:24 +0000
Subject: [Swift-devel] Re: GRAM and Swift discussion this week?
In-Reply-To: <Pine.LNX.4.64.0706121835430.10634@dildano.hawaga.org.uk>
References: <685C1420-03DE-4F2E-BDC7-A8A2C5636154@mcs.anl.gov><46532FB4.5070707@mcs.anl.gov>
	<465335D1.2040306@mcs.anl.gov><Pine.LNX.4.64.0705221837010.20212@dildano.hawaga.org.uk><DA40B2E2-8A00-43E0-B76F-0EB9A16EFAF6@mcs.anl.gov><Pine.LNX.4.64.0705221910500.22628@dildano.hawaga.org.uk><EBCA77FC-971B-4537-8FB8-EF17032D2DD3@mcs.anl.gov>
	<466ED8A6.5010804@mcs.anl.gov><Pine.LNX.4.64.0706121737450.10634@dildano.hawaga.org.uk>
	<466EE709.50007@mcs.anl.gov><Pine.LNX.4.64.0706121835430.10634@dildano.hawaga.org.uk>
Message-ID: <1565933129-1181673722-cardhu_decombobulator_blackberry.rim.net-1111105974-@bxe111.bisx.prod.on.blackberry>

I think Ioan already uses it for Falkon


Sent via BlackBerry from T-Mobile

-----Original Message-----
From: Ben Clifford <benc at hawaga.org.uk>

Date: Tue, 12 Jun 2007 18:38:34 
To:Mike Wilde <wilde at mcs.anl.gov>
Cc:Stuart Martin <smartin at mcs.anl.gov>, Ian Foster <foster at mcs.anl.gov>,  swift-devel at ci.uchicago.edu
Subject: Re: [Swift-devel] Re: GRAM and Swift discussion this week?


I suspect the best way forward for this is for us to actually start using 
GRAM4 in our daily swift work - that will either work perfectly or 
generate things to talk about.

On Tue, 12 Jun 2007, Mike Wilde wrote:

> Lets defer this meeting then and continue such discussions on swift and/or
> GRAM lists.
> 
> - Mike
> 
> Ben Clifford wrote, On 6/12/2007 12:39 PM:
> > On Tue, 12 Jun 2007, Mike Wilde wrote:
> > 
> > > Were we still planning to meet tomorrow at Argonne?
> > 
> > was there ever such a plan?
> > 
> > what needs discussing? like I said before:
> > 
> > > > > > > > Are there WS-GRAM issues that are causing problems for Swift?
> > > > > > > No one uses WS-GRAM with Swift, so we aren't really uncovering
> > > > > > > issus
> > > > > > > there.
> > 
> > 
> > > Ben, where will you be Thu?
> > 
> > UC campus is my present plan.
> > 
> > 
> 
> 


From iraicu at cs.uchicago.edu  Tue Jun 12 13:45:04 2007
From: iraicu at cs.uchicago.edu (Ioan Raicu)
Date: Tue, 12 Jun 2007 13:45:04 -0500
Subject: [Swift-devel] Re: GRAM and Swift discussion this week?
In-Reply-To: <1565933129-1181673722-cardhu_decombobulator_blackberry.rim.net-1111105974-@bxe111.bisx.prod.on.blackberry>
References: <685C1420-03DE-4F2E-BDC7-A8A2C5636154@mcs.anl.gov><46532FB4.5070707@mcs.anl.gov>	<465335D1.2040306@mcs.anl.gov><Pine.LNX.4.64.0705221837010.20212@dildano.hawaga.org.uk><DA40B2E2-8A00-43E0-B76F-0EB9A16EFAF6@mcs.anl.gov><Pine.LNX.4.64.0705221910500.22628@dildano.hawaga.org.uk><EBCA77FC-971B-4537-8FB8-EF17032D2DD3@mcs.anl.gov>	<466ED8A6.5010804@mcs.anl.gov><Pine.LNX.4.64.0706121737450.10634@dildano.hawaga.org.uk>	<466EE709.50007@mcs.anl.gov><Pine.LNX.4.64.0706121835430.10634@dildano.hawaga.org.uk>
	<1565933129-1181673722-cardhu_decombobulator_blackberry.rim.net-1111105974-@bxe111.bisx.prod.on.blackberry>
Message-ID: <466EE9B0.506@cs.uchicago.edu>

Right, Falkon uses only GRAM4 to get to remote resources!  It has worked 
very well for Falkon.

Ioan

Ian Foster wrote:
> I think Ioan already uses it for Falkon
>
>
> Sent via BlackBerry from T-Mobile
>
> -----Original Message-----
> From: Ben Clifford <benc at hawaga.org.uk>
>
> Date: Tue, 12 Jun 2007 18:38:34 
> To:Mike Wilde <wilde at mcs.anl.gov>
> Cc:Stuart Martin <smartin at mcs.anl.gov>, Ian Foster <foster at mcs.anl.gov>,  swift-devel at ci.uchicago.edu
> Subject: Re: [Swift-devel] Re: GRAM and Swift discussion this week?
>
>
>
> I suspect the best way forward for this is for us to actually start using 
> GRAM4 in our daily swift work - that will either work perfectly or 
> generate things to talk about.
>
> On Tue, 12 Jun 2007, Mike Wilde wrote:
>
>   
>> Lets defer this meeting then and continue such discussions on swift and/or
>> GRAM lists.
>>
>> - Mike
>>
>> Ben Clifford wrote, On 6/12/2007 12:39 PM:
>>     
>>> On Tue, 12 Jun 2007, Mike Wilde wrote:
>>>
>>>       
>>>> Were we still planning to meet tomorrow at Argonne?
>>>>         
>>> was there ever such a plan?
>>>
>>> what needs discussing? like I said before:
>>>
>>>       
>>>>>>>>> Are there WS-GRAM issues that are causing problems for Swift?
>>>>>>>>>                   
>>>>>>>> No one uses WS-GRAM with Swift, so we aren't really uncovering
>>>>>>>> issus
>>>>>>>> there.
>>>>>>>>                 
>>>       
>>>> Ben, where will you be Thu?
>>>>         
>>> UC campus is my present plan.
>>>
>>>
>>>       
>>     
>
>   
> ------------------------------------------------------------------------
>
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>   

-- 
============================================
Ioan Raicu
Ph.D. Student
============================================
Distributed Systems Laboratory
Computer Science Department
University of Chicago
1100 E. 58th Street, Ryerson Hall
Chicago, IL 60637
============================================
Email: iraicu at cs.uchicago.edu
Web:   http://www.cs.uchicago.edu/~iraicu
       http://dsl.cs.uchicago.edu/
============================================
============================================


From benc at hawaga.org.uk  Tue Jun 12 14:03:46 2007
From: benc at hawaga.org.uk (Ben Clifford)
Date: Tue, 12 Jun 2007 19:03:46 +0000 (GMT)
Subject: [Swift-devel] Re: GRAM and Swift discussion this week?
In-Reply-To: <1565933129-1181673722-cardhu_decombobulator_blackberry.rim.net-1111105974-@bxe111.bisx.prod.on.blackberry>
References: <685C1420-03DE-4F2E-BDC7-A8A2C5636154@mcs.anl.gov><46532FB4.5070707@mcs.anl.gov>
	<465335D1.2040306@mcs.anl.gov><Pine.LNX.4.64.0705221837010.20212@dildano.hawaga.org.uk><DA40B2E2-8A00-43E0-B76F-0EB9A16EFAF6@mcs.anl.gov><Pine.LNX.4.64.0705221910500.22628@dildano.hawaga.org.uk><EBCA77FC-971B-4537-8FB8-EF17032D2DD3@mcs.anl.gov>
	<466ED8A6.5010804@mcs.anl.gov><Pine.LNX.4.64.0706121737450.10634@dildano.hawaga.org.uk>
	<466EE709.50007@mcs.anl.gov><Pine.LNX.4.64.0706121835430.10634@dildano.hawaga.org.uk>
	<1565933129-1181673722-cardhu_decombobulator_blackberry.rim.net-1111105974-@bxe111.bisx.prod.on.blackberry>
Message-ID: <Pine.LNX.4.64.0706121901490.10634@dildano.hawaga.org.uk>


Yeah, the falkon worker stuff goes in through GRAM4. Where that differs 
from swift submitted directly is that swift would be submitting more (many 
more?) jobs and lots of different kinds of jobs.

On Tue, 12 Jun 2007, Ian Foster wrote:

> I think Ioan already uses it for Falkon
> 
> 
> Sent via BlackBerry from T-Mobile
> 
> -----Original Message-----
> From: Ben Clifford <benc at hawaga.org.uk>
> 
> Date: Tue, 12 Jun 2007 18:38:34 
> To:Mike Wilde <wilde at mcs.anl.gov>
> Cc:Stuart Martin <smartin at mcs.anl.gov>, Ian Foster <foster at mcs.anl.gov>,  swift-devel at ci.uchicago.edu
> Subject: Re: [Swift-devel] Re: GRAM and Swift discussion this week?
> 
> 
> 
> I suspect the best way forward for this is for us to actually start using 
> GRAM4 in our daily swift work - that will either work perfectly or 
> generate things to talk about.
> 
> On Tue, 12 Jun 2007, Mike Wilde wrote:
> 
> > Lets defer this meeting then and continue such discussions on swift and/or
> > GRAM lists.
> > 
> > - Mike
> > 
> > Ben Clifford wrote, On 6/12/2007 12:39 PM:
> > > On Tue, 12 Jun 2007, Mike Wilde wrote:
> > > 
> > > > Were we still planning to meet tomorrow at Argonne?
> > > 
> > > was there ever such a plan?
> > > 
> > > what needs discussing? like I said before:
> > > 
> > > > > > > > > Are there WS-GRAM issues that are causing problems for Swift?
> > > > > > > > No one uses WS-GRAM with Swift, so we aren't really uncovering
> > > > > > > > issus
> > > > > > > > there.
> > > 
> > > 
> > > > Ben, where will you be Thu?
> > > 
> > > UC campus is my present plan.
> > > 
> > > 
> > 
> > 
> 
> 
> 


From wilde at mcs.anl.gov  Tue Jun 12 23:08:22 2007
From: wilde at mcs.anl.gov (Mike Wilde)
Date: Tue, 12 Jun 2007 23:08:22 -0500
Subject: [Swift-devel] [Fwd: [OUTREACH] poster or demo at Grid2007?]
Message-ID: <466F6DB6.3020302@mcs.anl.gov>

Anyone interested in Swift/Falkon demos at SC?

A massive TG/OSG falkon cluster demo running workflows with nice viz 
fast would be quite cool.

(I cant believe I'm saying this while arguing that falkon needs more 
work, but.... ;)

While we're on the topic of demos: I think there is still a board of 
Governors demo here next week.

Do we have any chance of putting some cool workflows on the tiled 
display?

- sidgrid wavelet with the "lava lamp brain" viz?
- cnari w/ suma?

- Mike


-------- Original Message --------
Subject: [OUTREACH] poster or demo at Grid2007?
Date: Wed, 13 Jun 2007 04:35:35 +0100
From: Jennifer M. Schopf <jms at mcs.anl.gov>
To: outreach at globus.org

Hi Folks-

    right now, we don't have any general globus content at grid2007 
to my
knowledge (we didn't know about their deadlines in time to apply for a
tutorial or BOF). However, the call for posters and demos is still 
open -
maybe someone would like to apply for one of these for some of the 
newer
work?  Maybe RAVE or MOPS especially?

http://www.grid2007.org/?m_b_c=call_for_posterdemo

  -jen


-------------------------------------------------------------------------------
Dr. Jennifer M. Schopf
Scientist, Distributed Systems Lab
Argonne National Laboratory
jms at mcs.anl.gov  http://www.mcs.anl.gov/~jms


-- 
Mike Wilde
Computation Institute, University of Chicago
Math & Computer Science Division
Argonne National Laboratory
Argonne, IL   60439    USA
tel 630-252-7497 fax 630-252-1997


From iraicu at cs.uchicago.edu  Tue Jun 12 23:21:07 2007
From: iraicu at cs.uchicago.edu (Ioan Raicu)
Date: Tue, 12 Jun 2007 23:21:07 -0500
Subject: [Swift-devel] [Fwd: [OUTREACH] poster or demo at Grid2007?]
In-Reply-To: <466F6DB6.3020302@mcs.anl.gov>
References: <466F6DB6.3020302@mcs.anl.gov>
Message-ID: <466F70B3.6070407@cs.uchicago.edu>

I would be up for a demo, if others are interested as well!  The demo 
could either be cross site runs on TG, even more ambitious would be 
TG+OSG, and even more would be TG+OSG+EC2!

June 29th is the deadline, so that gives us enough time to figure out 
exactly what application we want to demo, and write the 1 page proposal!

Ioan

Mike Wilde wrote:
> Anyone interested in Swift/Falkon demos at SC?
>
> A massive TG/OSG falkon cluster demo running workflows with nice viz 
> fast would be quite cool.
>
> (I cant believe I'm saying this while arguing that falkon needs more 
> work, but.... ;)
>
> While we're on the topic of demos: I think there is still a board of 
> Governors demo here next week.
>
> Do we have any chance of putting some cool workflows on the tiled 
> display?
>
> - sidgrid wavelet with the "lava lamp brain" viz?
> - cnari w/ suma?
>
> - Mike
>
>
> -------- Original Message --------
> Subject: [OUTREACH] poster or demo at Grid2007?
> Date: Wed, 13 Jun 2007 04:35:35 +0100
> From: Jennifer M. Schopf <jms at mcs.anl.gov>
> To: outreach at globus.org
>
> Hi Folks-
>
>    right now, we don't have any general globus content at grid2007 to my
> knowledge (we didn't know about their deadlines in time to apply for a
> tutorial or BOF). However, the call for posters and demos is still open -
> maybe someone would like to apply for one of these for some of the newer
> work?  Maybe RAVE or MOPS especially?
>
> http://www.grid2007.org/?m_b_c=call_for_posterdemo
>
>  -jen
>
>
>
> ------------------------------------------------------------------------------- 
>
> Dr. Jennifer M. Schopf
> Scientist, Distributed Systems Lab
> Argonne National Laboratory
> jms at mcs.anl.gov  http://www.mcs.anl.gov/~jms
>
>
>
>

-- 
============================================
Ioan Raicu
Ph.D. Student
============================================
Distributed Systems Laboratory
Computer Science Department
University of Chicago
1100 E. 58th Street, Ryerson Hall
Chicago, IL 60637
============================================
Email: iraicu at cs.uchicago.edu
Web:   http://www.cs.uchicago.edu/~iraicu
       http://dsl.cs.uchicago.edu/
============================================
============================================


From tiberius at ci.uchicago.edu  Tue Jun 12 23:26:08 2007
From: tiberius at ci.uchicago.edu (Tiberiu Stef-Praun)
Date: Tue, 12 Jun 2007 23:26:08 -0500
Subject: [Swift-devel] [Fwd: [OUTREACH] poster or demo at Grid2007?]
In-Reply-To: <466F6DB6.3020302@mcs.anl.gov>
References: <466F6DB6.3020302@mcs.anl.gov>
Message-ID: <fec1351f0706122126s4b60e5bbo9582e6ec038ba112@mail.gmail.com>

On 6/12/07, Mike Wilde <wilde at mcs.anl.gov> wrote:
> Anyone interested in Swift/Falkon demos at SC?
>
> A massive TG/OSG falkon cluster demo running workflows with nice viz
> fast would be quite cool.
>
> (I cant believe I'm saying this while arguing that falkon needs more
> work, but.... ;)
>
> While we're on the topic of demos: I think there is still a board of
> Governors demo here next week.
>
> Do we have any chance of putting some cool workflows on the tiled
> display?

I would say there are small chances.
I am currently overloaded with getting other things done in I2U2,
CNARI and preparing Falkon measurement for Econ, and in addition to
this, I have never attempted visualizations of the aforementioned
workflows.

Tibi
>
> - sidgrid wavelet with the "lava lamp brain" viz?
> - cnari w/ suma?
>
> - Mike
>
>
> -------- Original Message --------
> Subject: [OUTREACH] poster or demo at Grid2007?
> Date: Wed, 13 Jun 2007 04:35:35 +0100
> From: Jennifer M. Schopf <jms at mcs.anl.gov>
> To: outreach at globus.org
>
> Hi Folks-
>
>     right now, we don't have any general globus content at grid2007
> to my
> knowledge (we didn't know about their deadlines in time to apply for a
> tutorial or BOF). However, the call for posters and demos is still
> open -
> maybe someone would like to apply for one of these for some of the
> newer
> work?  Maybe RAVE or MOPS especially?
>
> http://www.grid2007.org/?m_b_c=call_for_posterdemo
>
>   -jen
>
>
>
> -------------------------------------------------------------------------------
> Dr. Jennifer M. Schopf
> Scientist, Distributed Systems Lab
> Argonne National Laboratory
> jms at mcs.anl.gov  http://www.mcs.anl.gov/~jms
>
>
>
>
> --
> Mike Wilde
> Computation Institute, University of Chicago
> Math & Computer Science Division
> Argonne National Laboratory
> Argonne, IL   60439    USA
> tel 630-252-7497 fax 630-252-1997
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>


-- 
Tiberiu (Tibi) Stef-Praun, PhD
Research Staff, Computation Institute
5640 S. Ellis Ave, #405
University of Chicago
http://www-unix.mcs.anl.gov/~tiberius/


From benc at hawaga.org.uk  Thu Jun 14 22:16:21 2007
From: benc at hawaga.org.uk (Ben Clifford)
Date: Fri, 15 Jun 2007 03:16:21 +0000 (GMT)
Subject: [Swift-devel] provider-deef
Message-ID: <Pine.LNX.4.64.0706150315550.10634@dildano.hawaga.org.uk>


how to get source for provider-deef/ from version control?

-- 


From iraicu at cs.uchicago.edu  Thu Jun 14 22:18:23 2007
From: iraicu at cs.uchicago.edu (Ioan Raicu)
Date: Thu, 14 Jun 2007 22:18:23 -0500
Subject: [Swift-devel] provider-deef
In-Reply-To: <Pine.LNX.4.64.0706150315550.10634@dildano.hawaga.org.uk>
References: <Pine.LNX.4.64.0706150315550.10634@dildano.hawaga.org.uk>
Message-ID: <467204FF.8040106@cs.uchicago.edu>

You get it from my web site currently: 
http://people.cs.uchicago.edu/~iraicu/research/Falkon/Falkon_v0.8.tgz :)

We need to talk about getting into CVS or SVN, and where....

Ioan

Ben Clifford wrote:
> how to get source for provider-deef/ from version control?
>
>   

-- 
============================================
Ioan Raicu
Ph.D. Student
============================================
Distributed Systems Laboratory
Computer Science Department
University of Chicago
1100 E. 58th Street, Ryerson Hall
Chicago, IL 60637
============================================
Email: iraicu at cs.uchicago.edu
Web:   http://www.cs.uchicago.edu/~iraicu
       http://dsl.cs.uchicago.edu/
============================================
============================================


From benc at hawaga.org.uk  Thu Jun 14 22:43:31 2007
From: benc at hawaga.org.uk (Ben Clifford)
Date: Fri, 15 Jun 2007 03:43:31 +0000 (GMT)
Subject: [Swift-devel] provider-deef
In-Reply-To: <467204FF.8040106@cs.uchicago.edu>
References: <Pine.LNX.4.64.0706150315550.10634@dildano.hawaga.org.uk>
	<467204FF.8040106@cs.uchicago.edu>
Message-ID: <Pine.LNX.4.64.0706150343130.10634@dildano.hawaga.org.uk>


That has the falkon code in it but I can't see the cog/swift job 
submission provider.

On Thu, 14 Jun 2007, Ioan Raicu wrote:

> You get it from my web site currently:
> http://people.cs.uchicago.edu/~iraicu/research/Falkon/Falkon_v0.8.tgz :)
> 
> We need to talk about getting into CVS or SVN, and where....
> 
> Ioan
> 
> Ben Clifford wrote:
> > how to get source for provider-deef/ from version control?
> > 
> >   
> 
> 


From tiberius at ci.uchicago.edu  Fri Jun 15 07:11:17 2007
From: tiberius at ci.uchicago.edu (Tiberiu Stef-Praun)
Date: Fri, 15 Jun 2007 07:11:17 -0500
Subject: [Swift-devel] provider-deef
In-Reply-To: <Pine.LNX.4.64.0706150343130.10634@dildano.hawaga.org.uk>
References: <Pine.LNX.4.64.0706150315550.10634@dildano.hawaga.org.uk>
	<467204FF.8040106@cs.uchicago.edu>
	<Pine.LNX.4.64.0706150343130.10634@dildano.hawaga.org.uk>
Message-ID: <fec1351f0706150511o4999621fya36396fe5f36055@mail.gmail.com>

You get that from ~/tiberius/cogl (which I got from Yong's home)
However, I have not teste that yet.

On 6/14/07, Ben Clifford <benc at hawaga.org.uk> wrote:
>
> That has the falkon code in it but I can't see the cog/swift job
> submission provider.
>
> On Thu, 14 Jun 2007, Ioan Raicu wrote:
>
> > You get it from my web site currently:
> > http://people.cs.uchicago.edu/~iraicu/research/Falkon/Falkon_v0.8.tgz :)
> >
> > We need to talk about getting into CVS or SVN, and where....
> >
> > Ioan
> >
> > Ben Clifford wrote:
> > > how to get source for provider-deef/ from version control?
> > >
> > >
> >
> >
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>


-- 
Tiberiu (Tibi) Stef-Praun, PhD
Research Staff, Computation Institute
5640 S. Ellis Ave, #405
University of Chicago
http://www-unix.mcs.anl.gov/~tiberius/


From wilde at mcs.anl.gov  Fri Jun 15 08:11:30 2007
From: wilde at mcs.anl.gov (Mike Wilde)
Date: Fri, 15 Jun 2007 08:11:30 -0500
Subject: [Swift-devel] Re: Welcome Swift as an incubator project
In-Reply-To: <6.2.1.2.2.20070615133455.02dd9298@imap.mcs.anl.gov>
References: <6.2.1.2.2.20070615133455.02dd9298@imap.mcs.anl.gov>
Message-ID: <46729002.9090004@mcs.anl.gov>

Thanks, Jen, thats great news.

We'll discuss how to take the steps below.

- Mike


Jennifer M. Schopf wrote, On 6/15/2007 7:45 AM:
> Hi Mike-
> 
>    We are please to accept Swift as an Incubator Project in the 
> dev.globus incubation process!  This mail contains information on your 
> mentor, getting set up in the dev.globus infrastructure, and next steps. 
> Current process guidelines can be found at 
> <http://dev.globus.org/wiki/Incubator/Incubator_Process>http://dev.globus.org/wiki/Incubator/Incubator_Process 
> .
> 
> Your mentor is currently Jennifer Schopf (jms at mcs.anl.gov). Your mentor 
> will act as a bridge between your project and the Incubation Management 
> Project, and should be able to answer any basic questions you have. They 
> will also help you during the quarterly reviews and in understanding how 
> to escalate to a full Globus Project.  Your mentor needs to have write 
> access for the wiki page, but does not need to have CVS commit access 
> unless you would like them to. If you would like to propose a different 
> mentor for any reason, please let me know, and we can discuss options.
> 
> We have requested three mailing lists for you, according to the Globus 
> project guidelines ( 
> <https://dev.globus.org/wiki/Guidelines#Communication>https://dev.globus.org/wiki/Guidelines#Communication 
> )- swift-dev, swift-user, and swift-commit, with you listed as the 
> owner. The initial password will be set to ?incubator? for all of them, 
> and they are currently operational.  It now falls to you to enroll the 
> members of these lists since they come completely empty ? they?re 
> standard majordomo lists, basic subscription is done by simply sending 
> (to <mailto:majordomo at globus.org>majordomo at globus.org)
> 
> approve <passwd> subscribe <listname> <email_address>
> 
> approve <passwd> subscribe <listname> <next_email_address>
> for your various lists and subscribers. All of your committers should be 
> subscribed to all 3 lists. We also strongly encourage them to subscribe 
> to <mailto:announce at globus.org>announce at globus.org.
> 
> Your CVS/SVN module will be set up for you. Please mail 
> <mailto:infrastructure at globus.org>infrastructure at globus.org (be sure to 
> have CVS in the subject line in addition to anything else you?d like) 
> with the complete list of committer names, desired names for accounts, 
> and ssh public keys, and they will email back access instructions when 
> these are set. Be sure to include your project name in the mail.
> 
> Your wiki page has been set up with the common template and is located 
> at http://dev.globus.org/wiki/Incubator/Swift. If you need to add wiki 
> committers, please mail 
> <mailto:infrastructure at globus.org>infrastructure at globus.org (be sure to 
> have WIKI in the subject line in addition to anything else you?d like)  
> with your list of wiki committers? names and their desired account names 
> for the dev.globus.org wiki. Be sure to include your project name in the 
> mail.
> 
> In order to set up bugzilla space on the Globus bugzilla (for example, 
> to keep track of your roadmap items, track bugs, follow enhancement 
> requests, etc), the person you would like to be responsible for your 
> bugzilla in Globus-space should get an account at bugzilla.globus.org, 
> and then send that account name and the name of your project to 
> <mailto:infrastructure at globus.org>infrastructure at globus.org (be sure to 
> have BUGZILLA in the subject line in addition to anything else you?d 
> like).  They will then give that account authority to create subproducts 
> and such for your product and you will then be able to create the list 
> of components, cc: lists, and descriptions as you see fit.
> 
> One of the hurdles you?ll need to pass in order to become a Globus 
> project is the licensing. Licensing information, including both of the 
> licenses needed and a guideline document, can be found here 
> <http://dev.globus.org/wiki/Guidelines#License_and_Contributor_Agreements>http://dev.globus.org/wiki/Guidelines#License_and_Contributor_Agreements 
> In a nutshell every committer for your project will need to sign an 
> individual license, and anyone who is doing this as part of their day 
> job will also need a corporate license signed. Those doing it on their 
> own free time should include a letter stating this fact. All licenses 
> when signed should be mailed to Jennifer Schopf, Argonne National Lab, 
> 9700 S. Cass Ave, bldg 221, Argonne, IL 60439 USA.
> 
> The next step in the incubator process will be a check on where the 
> startup projects are. This is likely to be happening around mid July, 
> and your mentor will contact you prior.
> 
> Thanks again for your participation, and bearing with us while we get 
> the process up and running. Please don?t hesitate to contact me or your 
> mentor with any additional questions.
> 
> 
>    -Jennifer Schopf, on behalf of the Incubator Management Project
> 
> 
> 
> 
> ------------------------------------------------------------------------------- 
> 
> Dr. Jennifer M. Schopf
> Scientist, Distributed Systems Lab
> Argonne National Laboratory
> jms at mcs.anl.gov  http://www.mcs.anl.gov/~jms
> 
> 
> 

-- 
Mike Wilde
Computation Institute, University of Chicago
Math & Computer Science Division
Argonne National Laboratory
Argonne, IL   60439    USA
tel 630-252-7497 fax 630-252-1997


From benc at hawaga.org.uk  Fri Jun 15 08:27:57 2007
From: benc at hawaga.org.uk (Ben Clifford)
Date: Fri, 15 Jun 2007 13:27:57 +0000 (GMT)
Subject: [Swift-devel] provider-deef
In-Reply-To: <fec1351f0706150511o4999621fya36396fe5f36055@mail.gmail.com>
References: <Pine.LNX.4.64.0706150315550.10634@dildano.hawaga.org.uk> 
	<467204FF.8040106@cs.uchicago.edu>
	<Pine.LNX.4.64.0706150343130.10634@dildano.hawaga.org.uk>
	<fec1351f0706150511o4999621fya36396fe5f36055@mail.gmail.com>
Message-ID: <Pine.LNX.4.64.0706151327300.10634@dildano.hawaga.org.uk>


YOng mentioned something about it being in the cog svn once i thought.

On Fri, 15 Jun 2007, Tiberiu Stef-Praun wrote:

> You get that from ~/tiberius/cogl (which I got from Yong's home)
> However, I have not teste that yet.
> 
> On 6/14/07, Ben Clifford <benc at hawaga.org.uk> wrote:
> > 
> > That has the falkon code in it but I can't see the cog/swift job
> > submission provider.
> > 
> > On Thu, 14 Jun 2007, Ioan Raicu wrote:
> > 
> > > You get it from my web site currently:
> > > http://people.cs.uchicago.edu/~iraicu/research/Falkon/Falkon_v0.8.tgz :)
> > >
> > > We need to talk about getting into CVS or SVN, and where....
> > >
> > > Ioan
> > >
> > > Ben Clifford wrote:
> > > > how to get source for provider-deef/ from version control?
> > > >
> > > >
> > >
> > >
> > _______________________________________________
> > Swift-devel mailing list
> > Swift-devel at ci.uchicago.edu
> > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> > 
> 
> 
> 


From benc at hawaga.org.uk  Fri Jun 15 08:39:04 2007
From: benc at hawaga.org.uk (Ben Clifford)
Date: Fri, 15 Jun 2007 13:39:04 +0000 (GMT)
Subject: [Swift-devel] swift infrastructure move to dev.globus
Message-ID: <Pine.LNX.4.64.0706151334590.3016@dildano.hawaga.org.uk>


dev.globus people have been talking some about how to deal with projects 
that already have their own infrastructure (swift being a prime example of 
that); I propose we don't mess round with our infrastructure (mailing 
lists and version control) until dev.globus has decided an approach there.

-- 


From hategan at mcs.anl.gov  Fri Jun 15 08:39:32 2007
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Fri, 15 Jun 2007 16:39:32 +0300
Subject: [Swift-devel] swift infrastructure move to dev.globus
In-Reply-To: <Pine.LNX.4.64.0706151334590.3016@dildano.hawaga.org.uk>
References: <Pine.LNX.4.64.0706151334590.3016@dildano.hawaga.org.uk>
Message-ID: <1181914772.9966.0.camel@blabla.mcs.anl.gov>

On Fri, 2007-06-15 at 13:39 +0000, Ben Clifford wrote:
> dev.globus people have been talking some about how to deal with projects 
> that already have their own infrastructure (swift being a prime example of 
> that); I propose we don't mess round with our infrastructure (mailing 
> lists and version control) until dev.globus has decided an approach there.

I concur.

> 


From wilde at mcs.anl.gov  Fri Jun 15 08:43:12 2007
From: wilde at mcs.anl.gov (Mike Wilde)
Date: Fri, 15 Jun 2007 08:43:12 -0500
Subject: [Swift-devel] swift infrastructure move to dev.globus
In-Reply-To: <Pine.LNX.4.64.0706151334590.3016@dildano.hawaga.org.uk>
References: <Pine.LNX.4.64.0706151334590.3016@dildano.hawaga.org.uk>
Message-ID: <46729770.6090206@mcs.anl.gov>

Certainly OK with me if its OK with Jen and the incubator group.

- Mike

Ben Clifford wrote, On 6/15/2007 8:39 AM:
> dev.globus people have been talking some about how to deal with projects 
> that already have their own infrastructure (swift being a prime example of 
> that); I propose we don't mess round with our infrastructure (mailing 
> lists and version control) until dev.globus has decided an approach there.
> 

-- 
Mike Wilde
Computation Institute, University of Chicago
Math & Computer Science Division
Argonne National Laboratory
Argonne, IL   60439    USA
tel 630-252-7497 fax 630-252-1997


From yongzh at cs.uchicago.edu  Fri Jun 15 09:04:24 2007
From: yongzh at cs.uchicago.edu (Yong Zhao)
Date: Fri, 15 Jun 2007 09:04:24 -0500 (CDT)
Subject: [Swift-devel] provider-deef
In-Reply-To: <Pine.LNX.4.64.0706151327300.10634@dildano.hawaga.org.uk>
References: <Pine.LNX.4.64.0706150315550.10634@dildano.hawaga.org.uk> 
	<467204FF.8040106@cs.uchicago.edu>
	<Pine.LNX.4.64.0706150343130.10634@dildano.hawaga.org.uk>
	<fec1351f0706150511o4999621fya36396fe5f36055@mail.gmail.com>
	<Pine.LNX.4.64.0706151327300.10634@dildano.hawaga.org.uk>
Message-ID: <Pine.LNX.4.58.0706150902550.14789@vagus.cs.uchicago.edu>

We discussed where to put it in svn, but it never got into svn.

Currently it resides in Karajan branch in my source code, but Mihael says
that is not a good place to put it in svn.

Yong.

On Fri, 15 Jun 2007, Ben Clifford wrote:

>
> YOng mentioned something about it being in the cog svn once i thought.
>
> On Fri, 15 Jun 2007, Tiberiu Stef-Praun wrote:
>
> > You get that from ~/tiberius/cogl (which I got from Yong's home)
> > However, I have not teste that yet.
> >
> > On 6/14/07, Ben Clifford <benc at hawaga.org.uk> wrote:
> > >
> > > That has the falkon code in it but I can't see the cog/swift job
> > > submission provider.
> > >
> > > On Thu, 14 Jun 2007, Ioan Raicu wrote:
> > >
> > > > You get it from my web site currently:
> > > > http://people.cs.uchicago.edu/~iraicu/research/Falkon/Falkon_v0.8.tgz :)
> > > >
> > > > We need to talk about getting into CVS or SVN, and where....
> > > >
> > > > Ioan
> > > >
> > > > Ben Clifford wrote:
> > > > > how to get source for provider-deef/ from version control?
> > > > >
> > > > >
> > > >
> > > >
> > > _______________________________________________
> > > Swift-devel mailing list
> > > Swift-devel at ci.uchicago.edu
> > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> > >
> >
> >
> >
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>


From hategan at mcs.anl.gov  Fri Jun 15 09:08:32 2007
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Fri, 15 Jun 2007 17:08:32 +0300
Subject: [Swift-devel] provider-deef
In-Reply-To: <Pine.LNX.4.58.0706150902550.14789@vagus.cs.uchicago.edu>
References: <Pine.LNX.4.64.0706150315550.10634@dildano.hawaga.org.uk>
	<467204FF.8040106@cs.uchicago.edu>
	<Pine.LNX.4.64.0706150343130.10634@dildano.hawaga.org.uk>
	<fec1351f0706150511o4999621fya36396fe5f36055@mail.gmail.com>
	<Pine.LNX.4.64.0706151327300.10634@dildano.hawaga.org.uk>
	<Pine.LNX.4.58.0706150902550.14789@vagus.cs.uchicago.edu>
Message-ID: <1181916512.10096.2.camel@blabla.mcs.anl.gov>

On Fri, 2007-06-15 at 09:04 -0500, Yong Zhao wrote:
> We discussed where to put it in svn, but it never got into svn.
> 
> Currently it resides in Karajan branch in my source code, but Mihael says
> that is not a good place to put it in svn.

Clearly not. It could live in a provider-falkon module, like all the
other providers though.

Mihael

> 
> Yong.
> 
> On Fri, 15 Jun 2007, Ben Clifford wrote:
> 
> >
> > YOng mentioned something about it being in the cog svn once i thought.
> >
> > On Fri, 15 Jun 2007, Tiberiu Stef-Praun wrote:
> >
> > > You get that from ~/tiberius/cogl (which I got from Yong's home)
> > > However, I have not teste that yet.
> > >
> > > On 6/14/07, Ben Clifford <benc at hawaga.org.uk> wrote:
> > > >
> > > > That has the falkon code in it but I can't see the cog/swift job
> > > > submission provider.
> > > >
> > > > On Thu, 14 Jun 2007, Ioan Raicu wrote:
> > > >
> > > > > You get it from my web site currently:
> > > > > http://people.cs.uchicago.edu/~iraicu/research/Falkon/Falkon_v0.8.tgz :)
> > > > >
> > > > > We need to talk about getting into CVS or SVN, and where....
> > > > >
> > > > > Ioan
> > > > >
> > > > > Ben Clifford wrote:
> > > > > > how to get source for provider-deef/ from version control?
> > > > > >
> > > > > >
> > > > >
> > > > >
> > > > _______________________________________________
> > > > Swift-devel mailing list
> > > > Swift-devel at ci.uchicago.edu
> > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> > > >
> > >
> > >
> > >
> > _______________________________________________
> > Swift-devel mailing list
> > Swift-devel at ci.uchicago.edu
> > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> >
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> 


From benc at hawaga.org.uk  Fri Jun 15 09:13:04 2007
From: benc at hawaga.org.uk (Ben Clifford)
Date: Fri, 15 Jun 2007 14:13:04 +0000 (GMT)
Subject: [Swift-devel] provider-deef
In-Reply-To: <Pine.LNX.4.58.0706150902550.14789@vagus.cs.uchicago.edu>
References: <Pine.LNX.4.64.0706150315550.10634@dildano.hawaga.org.uk> 
	<467204FF.8040106@cs.uchicago.edu>
	<Pine.LNX.4.64.0706150343130.10634@dildano.hawaga.org.uk>
	<fec1351f0706150511o4999621fya36396fe5f36055@mail.gmail.com>
	<Pine.LNX.4.64.0706151327300.10634@dildano.hawaga.org.uk>
	<Pine.LNX.4.58.0706150902550.14789@vagus.cs.uchicago.edu>
Message-ID: <Pine.LNX.4.64.0706151407330.10634@dildano.hawaga.org.uk>


Oh, I see. I misunderstood what you meant by 'branch' earlier on.

The administratively easiest would be to put both falkon and provider-deef 
as top level directories in the SVN that we store swift in, I guess.

On Fri, 15 Jun 2007, Yong Zhao wrote:

> We discussed where to put it in svn, but it never got into svn.
> 
> Currently it resides in Karajan branch in my source code, but Mihael says
> that is not a good place to put it in svn.
> 
> Yong.
> 
> On Fri, 15 Jun 2007, Ben Clifford wrote:
> 
> >
> > YOng mentioned something about it being in the cog svn once i thought.
> >
> > On Fri, 15 Jun 2007, Tiberiu Stef-Praun wrote:
> >
> > > You get that from ~/tiberius/cogl (which I got from Yong's home)
> > > However, I have not teste that yet.
> > >
> > > On 6/14/07, Ben Clifford <benc at hawaga.org.uk> wrote:
> > > >
> > > > That has the falkon code in it but I can't see the cog/swift job
> > > > submission provider.
> > > >
> > > > On Thu, 14 Jun 2007, Ioan Raicu wrote:
> > > >
> > > > > You get it from my web site currently:
> > > > > http://people.cs.uchicago.edu/~iraicu/research/Falkon/Falkon_v0.8.tgz :)
> > > > >
> > > > > We need to talk about getting into CVS or SVN, and where....
> > > > >
> > > > > Ioan
> > > > >
> > > > > Ben Clifford wrote:
> > > > > > how to get source for provider-deef/ from version control?
> > > > > >
> > > > > >
> > > > >
> > > > >
> > > > _______________________________________________
> > > > Swift-devel mailing list
> > > > Swift-devel at ci.uchicago.edu
> > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> > > >
> > >
> > >
> > >
> > _______________________________________________
> > Swift-devel mailing list
> > Swift-devel at ci.uchicago.edu
> > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> >
> 
> 


From yongzh at cs.uchicago.edu  Fri Jun 15 09:15:23 2007
From: yongzh at cs.uchicago.edu (Yong Zhao)
Date: Fri, 15 Jun 2007 09:15:23 -0500 (CDT)
Subject: [Swift-devel] provider-deef
In-Reply-To: <1181916512.10096.2.camel@blabla.mcs.anl.gov>
References: <Pine.LNX.4.64.0706150315550.10634@dildano.hawaga.org.uk> 
	<467204FF.8040106@cs.uchicago.edu>
	<Pine.LNX.4.64.0706150343130.10634@dildano.hawaga.org.uk>
	<fec1351f0706150511o4999621fya36396fe5f36055@mail.gmail.com> 
	<Pine.LNX.4.64.0706151327300.10634@dildano.hawaga.org.uk> 
	<Pine.LNX.4.58.0706150902550.14789@vagus.cs.uchicago.edu>
	<1181916512.10096.2.camel@blabla.mcs.anl.gov>
Message-ID: <Pine.LNX.4.58.0706150914330.14789@vagus.cs.uchicago.edu>

sorry I did not mean it was in karajan, it was in cog alongside with other
providers as Mihael indicated.

Yong.

On Fri, 15 Jun 2007, Mihael Hategan wrote:

> On Fri, 2007-06-15 at 09:04 -0500, Yong Zhao wrote:
> > We discussed where to put it in svn, but it never got into svn.
> >
> > Currently it resides in Karajan branch in my source code, but Mihael says
> > that is not a good place to put it in svn.
>
> Clearly not. It could live in a provider-falkon module, like all the
> other providers though.
>
> Mihael
>
> >
> > Yong.
> >
> > On Fri, 15 Jun 2007, Ben Clifford wrote:
> >
> > >
> > > YOng mentioned something about it being in the cog svn once i thought.
> > >
> > > On Fri, 15 Jun 2007, Tiberiu Stef-Praun wrote:
> > >
> > > > You get that from ~/tiberius/cogl (which I got from Yong's home)
> > > > However, I have not teste that yet.
> > > >
> > > > On 6/14/07, Ben Clifford <benc at hawaga.org.uk> wrote:
> > > > >
> > > > > That has the falkon code in it but I can't see the cog/swift job
> > > > > submission provider.
> > > > >
> > > > > On Thu, 14 Jun 2007, Ioan Raicu wrote:
> > > > >
> > > > > > You get it from my web site currently:
> > > > > > http://people.cs.uchicago.edu/~iraicu/research/Falkon/Falkon_v0.8.tgz :)
> > > > > >
> > > > > > We need to talk about getting into CVS or SVN, and where....
> > > > > >
> > > > > > Ioan
> > > > > >
> > > > > > Ben Clifford wrote:
> > > > > > > how to get source for provider-deef/ from version control?
> > > > > > >
> > > > > > >
> > > > > >
> > > > > >
> > > > > _______________________________________________
> > > > > Swift-devel mailing list
> > > > > Swift-devel at ci.uchicago.edu
> > > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> > > > >
> > > >
> > > >
> > > >
> > > _______________________________________________
> > > Swift-devel mailing list
> > > Swift-devel at ci.uchicago.edu
> > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> > >
> > _______________________________________________
> > Swift-devel mailing list
> > Swift-devel at ci.uchicago.edu
> > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> >
>
>


From hategan at mcs.anl.gov  Fri Jun 15 09:27:36 2007
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Fri, 15 Jun 2007 17:27:36 +0300
Subject: [Swift-devel] provider-deef
In-Reply-To: <Pine.LNX.4.64.0706151407330.10634@dildano.hawaga.org.uk>
References: <Pine.LNX.4.64.0706150315550.10634@dildano.hawaga.org.uk>
	<467204FF.8040106@cs.uchicago.edu>
	<Pine.LNX.4.64.0706150343130.10634@dildano.hawaga.org.uk>
	<fec1351f0706150511o4999621fya36396fe5f36055@mail.gmail.com>
	<Pine.LNX.4.64.0706151327300.10634@dildano.hawaga.org.uk>
	<Pine.LNX.4.58.0706150902550.14789@vagus.cs.uchicago.edu>
	<Pine.LNX.4.64.0706151407330.10634@dildano.hawaga.org.uk>
Message-ID: <1181917656.10152.0.camel@blabla.mcs.anl.gov>

On Fri, 2007-06-15 at 14:13 +0000, Ben Clifford wrote:
> Oh, I see. I misunderstood what you meant by 'branch' earlier on.
> 
> The administratively easiest would be to put both falkon and provider-deef 
> as top level directories in the SVN that we store swift in, I guess.

That sounds like a better idea.

Mihael

> 
> On Fri, 15 Jun 2007, Yong Zhao wrote:
> 
> > We discussed where to put it in svn, but it never got into svn.
> > 
> > Currently it resides in Karajan branch in my source code, but Mihael says
> > that is not a good place to put it in svn.
> > 
> > Yong.
> > 
> > On Fri, 15 Jun 2007, Ben Clifford wrote:
> > 
> > >
> > > YOng mentioned something about it being in the cog svn once i thought.
> > >
> > > On Fri, 15 Jun 2007, Tiberiu Stef-Praun wrote:
> > >
> > > > You get that from ~/tiberius/cogl (which I got from Yong's home)
> > > > However, I have not teste that yet.
> > > >
> > > > On 6/14/07, Ben Clifford <benc at hawaga.org.uk> wrote:
> > > > >
> > > > > That has the falkon code in it but I can't see the cog/swift job
> > > > > submission provider.
> > > > >
> > > > > On Thu, 14 Jun 2007, Ioan Raicu wrote:
> > > > >
> > > > > > You get it from my web site currently:
> > > > > > http://people.cs.uchicago.edu/~iraicu/research/Falkon/Falkon_v0.8.tgz :)
> > > > > >
> > > > > > We need to talk about getting into CVS or SVN, and where....
> > > > > >
> > > > > > Ioan
> > > > > >
> > > > > > Ben Clifford wrote:
> > > > > > > how to get source for provider-deef/ from version control?
> > > > > > >
> > > > > > >
> > > > > >
> > > > > >
> > > > > _______________________________________________
> > > > > Swift-devel mailing list
> > > > > Swift-devel at ci.uchicago.edu
> > > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> > > > >
> > > >
> > > >
> > > >
> > > _______________________________________________
> > > Swift-devel mailing list
> > > Swift-devel at ci.uchicago.edu
> > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> > >
> > 
> > 
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> 


From benc at hawaga.org.uk  Fri Jun 15 11:48:01 2007
From: benc at hawaga.org.uk (Ben Clifford)
Date: Fri, 15 Jun 2007 16:48:01 +0000 (GMT)
Subject: [Swift-devel] CPU usage with provider-deef
Message-ID: <Pine.LNX.4.64.0706151634060.10634@dildano.hawaga.org.uk>


Yesterday, I was playing a bit with Ioan and Nika - they submitted a 68 
node / 15 minute workflow through provider-deef & falkon and saw the swift 
JVM on the submit node using about 100% CPU; then the same workflow 
running through the GT2 GRAM provider rather than provider-deef and falkon 
appeared to use significantly less.

I wandered off at that point so don't know if any interesting results came 
after.

-- 


From benc at hawaga.org.uk  Fri Jun 15 12:26:30 2007
From: benc at hawaga.org.uk (Ben Clifford)
Date: Fri, 15 Jun 2007 17:26:30 +0000 (GMT)
Subject: [Swift-devel] provider-deef
In-Reply-To: <1181917656.10152.0.camel@blabla.mcs.anl.gov>
References: <Pine.LNX.4.64.0706150315550.10634@dildano.hawaga.org.uk> 
	<467204FF.8040106@cs.uchicago.edu>
	<Pine.LNX.4.64.0706150343130.10634@dildano.hawaga.org.uk>
	<fec1351f0706150511o4999621fya36396fe5f36055@mail.gmail.com> 
	<Pine.LNX.4.64.0706151327300.10634@dildano.hawaga.org.uk> 
	<Pine.LNX.4.58.0706150902550.14789@vagus.cs.uchicago.edu> 
	<Pine.LNX.4.64.0706151407330.10634@dildano.hawaga.org.uk>
	<1181917656.10152.0.camel@blabla.mcs.anl.gov>
Message-ID: <Pine.LNX.4.64.0706151718380.10634@dildano.hawaga.org.uk>


On Fri, 15 Jun 2007, Mihael Hategan wrote:

> > The administratively easiest would be to put both falkon and provider-deef 
> > as top level directories in the SVN that we store swift in, I guess.
> 
> That sounds like a better idea.

ok. I will work with Ioan and Yong to get their respective modules into 
the swift SVN when they're ready (over the next week or so).

-- 


From nefedova at mcs.anl.gov  Fri Jun 15 13:02:58 2007
From: nefedova at mcs.anl.gov (Veronika Nefedova)
Date: Fri, 15 Jun 2007 13:02:58 -0500
Subject: [Swift-devel] CPU usage with provider-deef
In-Reply-To: <Pine.LNX.4.64.0706151634060.10634@dildano.hawaga.org.uk>
References: <Pine.LNX.4.64.0706151634060.10634@dildano.hawaga.org.uk>
Message-ID: <FF1DF2ED-B682-4BEC-98F0-B6A490D35FB3@mcs.anl.gov>

Ben,

we tested both my workflow and a simple "sleep" workflow. Both tests  
produced 100% CPU usage when ran with Falcon. When submitted directly  
to GRAM from swift - only a fraction of 1% CPU was used. I know that  
Yong did some additional testings, but I do not know the results.

Nika

On Jun 15, 2007, at 11:48 AM, Ben Clifford wrote:

>
> Yesterday, I was playing a bit with Ioan and Nika - they submitted  
> a 68
> node / 15 minute workflow through provider-deef & falkon and saw  
> the swift
> JVM on the submit node using about 100% CPU; then the same workflow
> running through the GT2 GRAM provider rather than provider-deef and  
> falkon
> appeared to use significantly less.
>
> I wandered off at that point so don't know if any interesting  
> results came
> after.
>
> -- 
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>


From benc at hawaga.org.uk  Fri Jun 15 15:13:23 2007
From: benc at hawaga.org.uk (Ben Clifford)
Date: Fri, 15 Jun 2007 20:13:23 +0000 (GMT)
Subject: [Swift-devel] on the semantics of 'array closing'
Message-ID: <Pine.LNX.4.64.0706151924490.10634@dildano.hawaga.org.uk>


There is a problem that has been called the 'array closing problem'.

It manifests itself in the tutorial in that certain bits of code that 
intuitively can either in a procedure or in the top level can, in 
practice, only go in to a procedure.

In that context, I tried to think about better ways to explain/document 
the behaviour than "mumble mumble move that code into a procedure".

In Swift we claim to have 'single assignment variables'.

>From single assignment variables we get our grid job ordering:

  a = p()
  b = s(a)

causes first grid job p to run, and when that has completed, then grid job 
s will run.

This is the same as if we had written:

  b = s(a)
  a = p()

The ordering comes from the use of a as an 'output' for p and an 'input' 
for s, not from source text ordering.

In that model, its meaningless to assign two different things ta a, like 
this:

  a = p()
  b = s(a)
  a = t()


Note that I've omitted the data types from the above. This works in the 
implementation for simple types such as a datafile marker type.

What is important is that each variable is either unassigned or has its 
single value - whenever we refer to that variable, we can either use the 
value it has, or defer evaluation of that expression until the variable 
has its value.

Now consider arrays. In the present syntax, arrays can be passed as 
single (complex) values to/from procedures, like before:

  a = p()
  b = s(a)

Here a and b are array types.

That's fine. a is assigned to by the first statement, and b is assigned to 
by the second statement.

But we also support a different assignment syntax for arrays, that looks 
like this:

  a[0] = p()
  a[1] = q()
  b = s(a)

This fails at the moment (specifically, I think the execution engine will 
hang).

Why? Because the is no one point at which we assign a value to 'a' - the 
assignment is split over multiple statements, which can be in various 
places (and inside loops etc).

There is nothing in the implementation that detects that a has been 
assigned its value.

So there is this notion in the karajan intermediate code of 'closing an 
array'.  This is an assertion made in the object code that all assignments 
to pieces of an array have been made - that, in affect, the array has its 
value.

The suggested hack/workaround for this is to move the array element 
assignments into a procedure:

 (file f[]) z() {
   f[0] = p();
   f[1] - q();
 }

 a = z()
 b = s(a)

This works. (which is sort-of a violation of referential transparency)

It works because Swift implicitly marks arrays returned from compound 
procedures as closed (which may or may not be correct).

So in most variable scopes, arrays behave like single-assignment 
variables, but each array can have one specific scope in which members can 
be assigned to. In that scope, the array cannot be treated as a whole 
variable.

In the z() example above, that special scope is the body of z(). In the 
previous example, that scope is the global scope, and the program is 
invalid by the rule above that the array cannot be referred to as a whole 
in the same place that its members are individually assigned to.

That's my explanation of what's going on now. I think it matches reality. 
I don't like that this is reality, but it is what we have.

Comments appreciated.

-- 


From foster at mcs.anl.gov  Fri Jun 15 15:26:11 2007
From: foster at mcs.anl.gov (Ian Foster)
Date: Fri, 15 Jun 2007 15:26:11 -0500
Subject: [Swift-devel] on the semantics of 'array closing'
In-Reply-To: <Pine.LNX.4.64.0706151924490.10634@dildano.hawaga.org.uk>
References: <Pine.LNX.4.64.0706151924490.10634@dildano.hawaga.org.uk>
Message-ID: <4672F5E3.7060205@mcs.anl.gov>

Hi,

For:

  a[0] = p()
  a[1] = q()
  b = s(a)

I think there are two distinct issues.

a) Determining the size of the array. This could presumably be done by 
declaring it, e.g.:

  a[2] or some similar notion
  a[0] = p()
  a[1] = q()
  b = s(a)

or by some "closing" concept.

b) Whether or not each element of an array is a separate 
single-assignment variable. If they are, then the code above should work 
just fine. If they are not, then we have a couple of behaviors we could 
define. One would be that b=s(a) blocks until all elements in "a" are 
defined. The other is that we have a way of "closing" (once again). In 
that case, we have to define what happens if b=s(a) accesses an element 
that is not defined.

Ian.

Ben Clifford wrote:
> There is a problem that has been called the 'array closing problem'.
>
> It manifests itself in the tutorial in that certain bits of code that 
> intuitively can either in a procedure or in the top level can, in 
> practice, only go in to a procedure.
>
> In that context, I tried to think about better ways to explain/document 
> the behaviour than "mumble mumble move that code into a procedure".
>
> In Swift we claim to have 'single assignment variables'.
>
> >From single assignment variables we get our grid job ordering:
>
>   a = p()
>   b = s(a)
>
> causes first grid job p to run, and when that has completed, then grid job 
> s will run.
>
> This is the same as if we had written:
>
>   b = s(a)
>   a = p()
>
> The ordering comes from the use of a as an 'output' for p and an 'input' 
> for s, not from source text ordering.
>
> In that model, its meaningless to assign two different things ta a, like 
> this:
>
>   a = p()
>   b = s(a)
>   a = t()
>
>
> Note that I've omitted the data types from the above. This works in the 
> implementation for simple types such as a datafile marker type.
>
> What is important is that each variable is either unassigned or has its 
> single value - whenever we refer to that variable, we can either use the 
> value it has, or defer evaluation of that expression until the variable 
> has its value.
>
> Now consider arrays. In the present syntax, arrays can be passed as 
> single (complex) values to/from procedures, like before:
>
>   a = p()
>   b = s(a)
>
> Here a and b are array types.
>
> That's fine. a is assigned to by the first statement, and b is assigned to 
> by the second statement.
>
> But we also support a different assignment syntax for arrays, that looks 
> like this:
>
>   a[0] = p()
>   a[1] = q()
>   b = s(a)
>
> This fails at the moment (specifically, I think the execution engine will 
> hang).
>
> Why? Because the is no one point at which we assign a value to 'a' - the 
> assignment is split over multiple statements, which can be in various 
> places (and inside loops etc).
>
> There is nothing in the implementation that detects that a has been 
> assigned its value.
>
> So there is this notion in the karajan intermediate code of 'closing an 
> array'.  This is an assertion made in the object code that all assignments 
> to pieces of an array have been made - that, in affect, the array has its 
> value.
>
> The suggested hack/workaround for this is to move the array element 
> assignments into a procedure:
>
>  (file f[]) z() {
>    f[0] = p();
>    f[1] - q();
>  }
>
>  a = z()
>  b = s(a)
>
> This works. (which is sort-of a violation of referential transparency)
>
> It works because Swift implicitly marks arrays returned from compound 
> procedures as closed (which may or may not be correct).
>
> So in most variable scopes, arrays behave like single-assignment 
> variables, but each array can have one specific scope in which members can 
> be assigned to. In that scope, the array cannot be treated as a whole 
> variable.
>
> In the z() example above, that special scope is the body of z(). In the 
> previous example, that scope is the global scope, and the program is 
> invalid by the rule above that the array cannot be referred to as a whole 
> in the same place that its members are individually assigned to.
>
> That's my explanation of what's going on now. I think it matches reality. 
> I don't like that this is reality, but it is what we have.
>
> Comments appreciated.
>
>   

-- 

   Ian Foster, Director, Computation Institute
Argonne National Laboratory & University of Chicago
Argonne: MCS/221, 9700 S. Cass Ave, Argonne, IL 60439
Chicago: Rm 405, 5640 S. Ellis Ave, Chicago, IL 60637
Tel: +1 630 252 4619.  Web: www.ci.uchicago.edu.
      Globus Alliance: www.globus.org.


From yongzh at cs.uchicago.edu  Fri Jun 15 15:40:29 2007
From: yongzh at cs.uchicago.edu (Yong Zhao)
Date: Fri, 15 Jun 2007 15:40:29 -0500 (CDT)
Subject: [Swift-devel] on the semantics of 'array closing'
In-Reply-To: <4672F5E3.7060205@mcs.anl.gov>
References: <Pine.LNX.4.64.0706151924490.10634@dildano.hawaga.org.uk>
	<4672F5E3.7060205@mcs.anl.gov>
Message-ID: <Pine.LNX.4.58.0706151535310.8107@classes.cs.uchicago.edu>

Yes, the case is exactly like you have described. Currently each a[i] is
closed separately, but the whole array also needs to be closed. For
instance, if in s, only a[0] and a[1] are accessed, it might go through
correctly, but if s accesses all elements of a (where it has no idea how
many there are), the workflow would hang to wait for the array to close.

Mihael and I talked about closing statement, but it is unclear when it
should be done since the order of each a[i] being closed is not
deterministic in parallel execution.

Yong.

On Fri, 15 Jun 2007, Ian Foster wrote:

> Hi,
>
> For:
>
>   a[0] = p()
>   a[1] = q()
>   b = s(a)
>
> I think there are two distinct issues.
>
> a) Determining the size of the array. This could presumably be done by
> declaring it, e.g.:
>
>   a[2] or some similar notion
>   a[0] = p()
>   a[1] = q()
>   b = s(a)
>
> or by some "closing" concept.
>
> b) Whether or not each element of an array is a separate
> single-assignment variable. If they are, then the code above should work
> just fine. If they are not, then we have a couple of behaviors we could
> define. One would be that b=s(a) blocks until all elements in "a" are
> defined. The other is that we have a way of "closing" (once again). In
> that case, we have to define what happens if b=s(a) accesses an element
> that is not defined.
>
> Ian.
>
> Ben Clifford wrote:
> > There is a problem that has been called the 'array closing problem'.
> >
> > It manifests itself in the tutorial in that certain bits of code that
> > intuitively can either in a procedure or in the top level can, in
> > practice, only go in to a procedure.
> >
> > In that context, I tried to think about better ways to explain/document
> > the behaviour than "mumble mumble move that code into a procedure".
> >
> > In Swift we claim to have 'single assignment variables'.
> >
> > >From single assignment variables we get our grid job ordering:
> >
> >   a = p()
> >   b = s(a)
> >
> > causes first grid job p to run, and when that has completed, then grid job
> > s will run.
> >
> > This is the same as if we had written:
> >
> >   b = s(a)
> >   a = p()
> >
> > The ordering comes from the use of a as an 'output' for p and an 'input'
> > for s, not from source text ordering.
> >
> > In that model, its meaningless to assign two different things ta a, like
> > this:
> >
> >   a = p()
> >   b = s(a)
> >   a = t()
> >
> >
> > Note that I've omitted the data types from the above. This works in the
> > implementation for simple types such as a datafile marker type.
> >
> > What is important is that each variable is either unassigned or has its
> > single value - whenever we refer to that variable, we can either use the
> > value it has, or defer evaluation of that expression until the variable
> > has its value.
> >
> > Now consider arrays. In the present syntax, arrays can be passed as
> > single (complex) values to/from procedures, like before:
> >
> >   a = p()
> >   b = s(a)
> >
> > Here a and b are array types.
> >
> > That's fine. a is assigned to by the first statement, and b is assigned to
> > by the second statement.
> >
> > But we also support a different assignment syntax for arrays, that looks
> > like this:
> >
> >   a[0] = p()
> >   a[1] = q()
> >   b = s(a)
> >
> > This fails at the moment (specifically, I think the execution engine will
> > hang).
> >
> > Why? Because the is no one point at which we assign a value to 'a' - the
> > assignment is split over multiple statements, which can be in various
> > places (and inside loops etc).
> >
> > There is nothing in the implementation that detects that a has been
> > assigned its value.
> >
> > So there is this notion in the karajan intermediate code of 'closing an
> > array'.  This is an assertion made in the object code that all assignments
> > to pieces of an array have been made - that, in affect, the array has its
> > value.
> >
> > The suggested hack/workaround for this is to move the array element
> > assignments into a procedure:
> >
> >  (file f[]) z() {
> >    f[0] = p();
> >    f[1] - q();
> >  }
> >
> >  a = z()
> >  b = s(a)
> >
> > This works. (which is sort-of a violation of referential transparency)
> >
> > It works because Swift implicitly marks arrays returned from compound
> > procedures as closed (which may or may not be correct).
> >
> > So in most variable scopes, arrays behave like single-assignment
> > variables, but each array can have one specific scope in which members can
> > be assigned to. In that scope, the array cannot be treated as a whole
> > variable.
> >
> > In the z() example above, that special scope is the body of z(). In the
> > previous example, that scope is the global scope, and the program is
> > invalid by the rule above that the array cannot be referred to as a whole
> > in the same place that its members are individually assigned to.
> >
> > That's my explanation of what's going on now. I think it matches reality.
> > I don't like that this is reality, but it is what we have.
> >
> > Comments appreciated.
> >
> >
>
> --
>
>    Ian Foster, Director, Computation Institute
> Argonne National Laboratory & University of Chicago
> Argonne: MCS/221, 9700 S. Cass Ave, Argonne, IL 60439
> Chicago: Rm 405, 5640 S. Ellis Ave, Chicago, IL 60637
> Tel: +1 630 252 4619.  Web: www.ci.uchicago.edu.
>       Globus Alliance: www.globus.org.
>
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>


From benc at hawaga.org.uk  Fri Jun 15 15:45:18 2007
From: benc at hawaga.org.uk (Ben Clifford)
Date: Fri, 15 Jun 2007 20:45:18 +0000 (GMT)
Subject: [Swift-devel] CPU usage with provider-deef
In-Reply-To: <FF1DF2ED-B682-4BEC-98F0-B6A490D35FB3@mcs.anl.gov>
References: <Pine.LNX.4.64.0706151634060.10634@dildano.hawaga.org.uk>
	<FF1DF2ED-B682-4BEC-98F0-B6A490D35FB3@mcs.anl.gov>
Message-ID: <Pine.LNX.4.64.0706152038370.10634@dildano.hawaga.org.uk>


ok.

So also Nika and Ioan had a problem where a workflow left running 
overnight ended up not completing - I think Falkon thinks it completed all 
of its job in a timely fashion, but that it seemed to take a (linearly 
increasing) amount of time for each job notification to be sent, and on 
the swift/provider-deef side of things, a large amount of CPU and not much 
else seemed to be happened.

That may or may not be related to the high CPU load below, but its 
probably worth investigating as it appears to break up large runs.

On Fri, 15 Jun 2007, Veronika Nefedova wrote:

> Ben,
> 
> we tested both my workflow and a simple "sleep" workflow. Both tests produced
> 100% CPU usage when ran with Falcon. When submitted directly to GRAM from
> swift - only a fraction of 1% CPU was used. I know that Yong did some
> additional testings, but I do not know the results.
> 
> Nika
> 
> On Jun 15, 2007, at 11:48 AM, Ben Clifford wrote:
> 
> > 
> > Yesterday, I was playing a bit with Ioan and Nika - they submitted a 68
> > node / 15 minute workflow through provider-deef & falkon and saw the swift
> > JVM on the submit node using about 100% CPU; then the same workflow
> > running through the GT2 GRAM provider rather than provider-deef and falkon
> > appeared to use significantly less.
> > 
> > I wandered off at that point so don't know if any interesting results came
> > after.
> > 
> > -- 
> > _______________________________________________
> > Swift-devel mailing list
> > Swift-devel at ci.uchicago.edu
> > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> > 
> 


From yongzh at cs.uchicago.edu  Fri Jun 15 15:46:21 2007
From: yongzh at cs.uchicago.edu (Yong Zhao)
Date: Fri, 15 Jun 2007 15:46:21 -0500 (CDT)
Subject: [Swift-devel] on the semantics of 'array closing'
In-Reply-To: <Pine.LNX.4.58.0706151535310.8107@classes.cs.uchicago.edu>
References: <Pine.LNX.4.64.0706151924490.10634@dildano.hawaga.org.uk>
	<4672F5E3.7060205@mcs.anl.gov>
	<Pine.LNX.4.58.0706151535310.8107@classes.cs.uchicago.edu>
Message-ID: <Pine.LNX.4.58.0706151543270.8107@classes.cs.uchicago.edu>

P.S.

We can not put a closing statement for 'a' right before
	b = s(a);
as all statements are evaluated in parallel, so that b can wait for a to
close to continue. If we do put it there, then b would proceed without
waiting for a to be generated.

Yong.

On Fri, 15 Jun 2007, Yong Zhao wrote:

> Yes, the case is exactly like you have described. Currently each a[i] is
> closed separately, but the whole array also needs to be closed. For
> instance, if in s, only a[0] and a[1] are accessed, it might go through
> correctly, but if s accesses all elements of a (where it has no idea how
> many there are), the workflow would hang to wait for the array to close.
>
> Mihael and I talked about closing statement, but it is unclear when it
> should be done since the order of each a[i] being closed is not
> deterministic in parallel execution.
>
> Yong.
>
> On Fri, 15 Jun 2007, Ian Foster wrote:
>
> > Hi,
> >
> > For:
> >
> >   a[0] = p()
> >   a[1] = q()
> >   b = s(a)
> >
> > I think there are two distinct issues.
> >
> > a) Determining the size of the array. This could presumably be done by
> > declaring it, e.g.:
> >
> >   a[2] or some similar notion
> >   a[0] = p()
> >   a[1] = q()
> >   b = s(a)
> >
> > or by some "closing" concept.
> >
> > b) Whether or not each element of an array is a separate
> > single-assignment variable. If they are, then the code above should work
> > just fine. If they are not, then we have a couple of behaviors we could
> > define. One would be that b=s(a) blocks until all elements in "a" are
> > defined. The other is that we have a way of "closing" (once again). In
> > that case, we have to define what happens if b=s(a) accesses an element
> > that is not defined.
> >
> > Ian.
> >
> > Ben Clifford wrote:
> > > There is a problem that has been called the 'array closing problem'.
> > >
> > > It manifests itself in the tutorial in that certain bits of code that
> > > intuitively can either in a procedure or in the top level can, in
> > > practice, only go in to a procedure.
> > >
> > > In that context, I tried to think about better ways to explain/document
> > > the behaviour than "mumble mumble move that code into a procedure".
> > >
> > > In Swift we claim to have 'single assignment variables'.
> > >
> > > >From single assignment variables we get our grid job ordering:
> > >
> > >   a = p()
> > >   b = s(a)
> > >
> > > causes first grid job p to run, and when that has completed, then grid job
> > > s will run.
> > >
> > > This is the same as if we had written:
> > >
> > >   b = s(a)
> > >   a = p()
> > >
> > > The ordering comes from the use of a as an 'output' for p and an 'input'
> > > for s, not from source text ordering.
> > >
> > > In that model, its meaningless to assign two different things ta a, like
> > > this:
> > >
> > >   a = p()
> > >   b = s(a)
> > >   a = t()
> > >
> > >
> > > Note that I've omitted the data types from the above. This works in the
> > > implementation for simple types such as a datafile marker type.
> > >
> > > What is important is that each variable is either unassigned or has its
> > > single value - whenever we refer to that variable, we can either use the
> > > value it has, or defer evaluation of that expression until the variable
> > > has its value.
> > >
> > > Now consider arrays. In the present syntax, arrays can be passed as
> > > single (complex) values to/from procedures, like before:
> > >
> > >   a = p()
> > >   b = s(a)
> > >
> > > Here a and b are array types.
> > >
> > > That's fine. a is assigned to by the first statement, and b is assigned to
> > > by the second statement.
> > >
> > > But we also support a different assignment syntax for arrays, that looks
> > > like this:
> > >
> > >   a[0] = p()
> > >   a[1] = q()
> > >   b = s(a)
> > >
> > > This fails at the moment (specifically, I think the execution engine will
> > > hang).
> > >
> > > Why? Because the is no one point at which we assign a value to 'a' - the
> > > assignment is split over multiple statements, which can be in various
> > > places (and inside loops etc).
> > >
> > > There is nothing in the implementation that detects that a has been
> > > assigned its value.
> > >
> > > So there is this notion in the karajan intermediate code of 'closing an
> > > array'.  This is an assertion made in the object code that all assignments
> > > to pieces of an array have been made - that, in affect, the array has its
> > > value.
> > >
> > > The suggested hack/workaround for this is to move the array element
> > > assignments into a procedure:
> > >
> > >  (file f[]) z() {
> > >    f[0] = p();
> > >    f[1] - q();
> > >  }
> > >
> > >  a = z()
> > >  b = s(a)
> > >
> > > This works. (which is sort-of a violation of referential transparency)
> > >
> > > It works because Swift implicitly marks arrays returned from compound
> > > procedures as closed (which may or may not be correct).
> > >
> > > So in most variable scopes, arrays behave like single-assignment
> > > variables, but each array can have one specific scope in which members can
> > > be assigned to. In that scope, the array cannot be treated as a whole
> > > variable.
> > >
> > > In the z() example above, that special scope is the body of z(). In the
> > > previous example, that scope is the global scope, and the program is
> > > invalid by the rule above that the array cannot be referred to as a whole
> > > in the same place that its members are individually assigned to.
> > >
> > > That's my explanation of what's going on now. I think it matches reality.
> > > I don't like that this is reality, but it is what we have.
> > >
> > > Comments appreciated.
> > >
> > >
> >
> > --
> >
> >    Ian Foster, Director, Computation Institute
> > Argonne National Laboratory & University of Chicago
> > Argonne: MCS/221, 9700 S. Cass Ave, Argonne, IL 60439
> > Chicago: Rm 405, 5640 S. Ellis Ave, Chicago, IL 60637
> > Tel: +1 630 252 4619.  Web: www.ci.uchicago.edu.
> >       Globus Alliance: www.globus.org.
> >
> > _______________________________________________
> > Swift-devel mailing list
> > Swift-devel at ci.uchicago.edu
> > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> >
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>


From benc at hawaga.org.uk  Fri Jun 15 15:55:54 2007
From: benc at hawaga.org.uk (Ben Clifford)
Date: Fri, 15 Jun 2007 20:55:54 +0000 (GMT)
Subject: [Swift-devel] on the semantics of 'array closing'
In-Reply-To: <4672F5E3.7060205@mcs.anl.gov>
References: <Pine.LNX.4.64.0706151924490.10634@dildano.hawaga.org.uk>
	<4672F5E3.7060205@mcs.anl.gov>
Message-ID: <Pine.LNX.4.64.0706152050270.10634@dildano.hawaga.org.uk>


There's a different approach, which is to asay that 'a' is a variable and 
can be assigned to once. Thus assignemnt syntax like a[0]=something 
becomes illegal and we need more functional language constructs. So 
instead of writing:

for e,i in input_array {
  output_array[i] = p(e);
}

we would write:

output_array = foreach i in input_array {
  return p(i);
}

(its a haskell map in different syntax!)

That means that, at the language level, output_array is now properly 
single assignment.


On Fri, 15 Jun 2007, Ian Foster wrote:

> Hi,
> 
> For:
> 
>  a[0] = p()
>  a[1] = q()
>  b = s(a)
> 
> I think there are two distinct issues.
> 
> a) Determining the size of the array. This could presumably be done by
> declaring it, e.g.:
> 
>  a[2] or some similar notion
>  a[0] = p()
>  a[1] = q()
>  b = s(a)
> 
> or by some "closing" concept.
> 
> b) Whether or not each element of an array is a separate single-assignment
> variable. If they are, then the code above should work just fine. If they are
> not, then we have a couple of behaviors we could define. One would be that
> b=s(a) blocks until all elements in "a" are defined. The other is that we have
> a way of "closing" (once again). In that case, we have to define what happens
> if b=s(a) accesses an element that is not defined.
> 
> Ian.
> 
> Ben Clifford wrote:
> > There is a problem that has been called the 'array closing problem'.
> > 
> > It manifests itself in the tutorial in that certain bits of code that
> > intuitively can either in a procedure or in the top level can, in practice,
> > only go in to a procedure.
> > 
> > In that context, I tried to think about better ways to explain/document the
> > behaviour than "mumble mumble move that code into a procedure".
> > 
> > In Swift we claim to have 'single assignment variables'.
> > 
> > >From single assignment variables we get our grid job ordering:
> > 
> >   a = p()
> >   b = s(a)
> > 
> > causes first grid job p to run, and when that has completed, then grid job s
> > will run.
> > 
> > This is the same as if we had written:
> > 
> >   b = s(a)
> >   a = p()
> > 
> > The ordering comes from the use of a as an 'output' for p and an 'input' for
> > s, not from source text ordering.
> > 
> > In that model, its meaningless to assign two different things ta a, like
> > this:
> > 
> >   a = p()
> >   b = s(a)
> >   a = t()
> > 
> > 
> > Note that I've omitted the data types from the above. This works in the
> > implementation for simple types such as a datafile marker type.
> > 
> > What is important is that each variable is either unassigned or has its
> > single value - whenever we refer to that variable, we can either use the
> > value it has, or defer evaluation of that expression until the variable has
> > its value.
> > 
> > Now consider arrays. In the present syntax, arrays can be passed as single
> > (complex) values to/from procedures, like before:
> > 
> >   a = p()
> >   b = s(a)
> > 
> > Here a and b are array types.
> > 
> > That's fine. a is assigned to by the first statement, and b is assigned to
> > by the second statement.
> > 
> > But we also support a different assignment syntax for arrays, that looks
> > like this:
> > 
> >   a[0] = p()
> >   a[1] = q()
> >   b = s(a)
> > 
> > This fails at the moment (specifically, I think the execution engine will
> > hang).
> > 
> > Why? Because the is no one point at which we assign a value to 'a' - the
> > assignment is split over multiple statements, which can be in various places
> > (and inside loops etc).
> > 
> > There is nothing in the implementation that detects that a has been assigned
> > its value.
> > 
> > So there is this notion in the karajan intermediate code of 'closing an
> > array'.  This is an assertion made in the object code that all assignments
> > to pieces of an array have been made - that, in affect, the array has its
> > value.
> > 
> > The suggested hack/workaround for this is to move the array element
> > assignments into a procedure:
> > 
> >  (file f[]) z() {
> >    f[0] = p();
> >    f[1] - q();
> >  }
> > 
> >  a = z()
> >  b = s(a)
> > 
> > This works. (which is sort-of a violation of referential transparency)
> > 
> > It works because Swift implicitly marks arrays returned from compound
> > procedures as closed (which may or may not be correct).
> > 
> > So in most variable scopes, arrays behave like single-assignment variables,
> > but each array can have one specific scope in which members can be assigned
> > to. In that scope, the array cannot be treated as a whole variable.
> > 
> > In the z() example above, that special scope is the body of z(). In the
> > previous example, that scope is the global scope, and the program is invalid
> > by the rule above that the array cannot be referred to as a whole in the
> > same place that its members are individually assigned to.
> > 
> > That's my explanation of what's going on now. I think it matches reality. I
> > don't like that this is reality, but it is what we have.
> > 
> > Comments appreciated.
> > 
> >   
> 
> 


From hategan at mcs.anl.gov  Sat Jun 16 03:58:26 2007
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Sat, 16 Jun 2007 11:58:26 +0300
Subject: [Swift-devel] CPU usage with provider-deef
In-Reply-To: <Pine.LNX.4.64.0706151634060.10634@dildano.hawaga.org.uk>
References: <Pine.LNX.4.64.0706151634060.10634@dildano.hawaga.org.uk>
Message-ID: <1181984306.10455.3.camel@blabla.mcs.anl.gov>

That can either be good or bad. If the CPU is used doing meaningful
stuff, then it's good. In other words, I'm guessing that the job
throughput is also higher with Falkon.

Mihael

On Fri, 2007-06-15 at 16:48 +0000, Ben Clifford wrote:
> Yesterday, I was playing a bit with Ioan and Nika - they submitted a 68 
> node / 15 minute workflow through provider-deef & falkon and saw the swift 
> JVM on the submit node using about 100% CPU; then the same workflow 
> running through the GT2 GRAM provider rather than provider-deef and falkon 
> appeared to use significantly less.
> 
> I wandered off at that point so don't know if any interesting results came 
> after.
> 


From hategan at mcs.anl.gov  Sat Jun 16 04:04:35 2007
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Sat, 16 Jun 2007 12:04:35 +0300
Subject: [Swift-devel] on the semantics of 'array closing'
In-Reply-To: <Pine.LNX.4.64.0706151924490.10634@dildano.hawaga.org.uk>
References: <Pine.LNX.4.64.0706151924490.10634@dildano.hawaga.org.uk>
Message-ID: <1181984676.10455.8.camel@blabla.mcs.anl.gov>

On Fri, 2007-06-15 at 20:13 +0000, Ben Clifford wrote:

> [...]
> But we also support a different assignment syntax for arrays, that looks 
> like this:
> 
>   a[0] = p()
>   a[1] = q()
>   b = s(a)
> 
> This fails at the moment (specifically, I think the execution engine will 
> hang).

Somewhat. The invocation is ok. What happens is that if you iterate over
a with foreach, two iterations will be started, but foreach will keep
waiting to see if no more items appear in the array. Think of arrays as
streams of (k, v) pairs and a size. If the size is unknown, foreach
cannot stop.

> 
> Why? Because the is no one point at which we assign a value to 'a' - the 
> assignment is split over multiple statements, which can be in various 
> places (and inside loops etc).
> 
> There is nothing in the implementation that detects that a has been 
> assigned its value.
> 
> So there is this notion in the karajan intermediate code of 'closing an 
> array'.  This is an assertion made in the object code that all assignments 
> to pieces of an array have been made - that, in affect, the array has its 
> value.
> 
> The suggested hack/workaround for this is to move the array element 
> assignments into a procedure:
> 
>  (file f[]) z() {
>    f[0] = p();
>    f[1] - q();
>  }
> 
>  a = z()
>  b = s(a)
> 
> This works. (which is sort-of a violation of referential transparency)
> 
> It works because Swift implicitly marks arrays returned from compound 
> procedures as closed (which may or may not be correct).

We defined it as correct. Something created in one scope cannot be
modified in a parent scope.

Mihael

> 
> So in most variable scopes, arrays behave like single-assignment 
> variables, but each array can have one specific scope in which members can 
> be assigned to. In that scope, the array cannot be treated as a whole 
> variable.
> 
> In the z() example above, that special scope is the body of z(). In the 
> previous example, that scope is the global scope, and the program is 
> invalid by the rule above that the array cannot be referred to as a whole 
> in the same place that its members are individually assigned to.
> 
> That's my explanation of what's going on now. I think it matches reality. 
> I don't like that this is reality, but it is what we have.
> 
> Comments appreciated.
> 


From hategan at mcs.anl.gov  Sat Jun 16 04:12:36 2007
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Sat, 16 Jun 2007 12:12:36 +0300
Subject: [Swift-devel] on the semantics of 'array closing'
In-Reply-To: <4672F5E3.7060205@mcs.anl.gov>
References: <Pine.LNX.4.64.0706151924490.10634@dildano.hawaga.org.uk>
	<4672F5E3.7060205@mcs.anl.gov>
Message-ID: <1181985156.10455.17.camel@blabla.mcs.anl.gov>

On Fri, 2007-06-15 at 15:26 -0500, Ian Foster wrote:
> Hi,
> 
> For:
> 
>   a[0] = p()
>   a[1] = q()
>   b = s(a)
> 
> I think there are two distinct issues.
> 
> a) Determining the size of the array. This could presumably be done by 
> declaring it, e.g.:
> 
>   a[2] or some similar notion
>   a[0] = p()
>   a[1] = q()
>   b = s(a)
> 
> or by some "closing" concept.

Right!

> 
> b) Whether or not each element of an array is a separate 
> single-assignment variable.

They are. And it should, provided that the a[2] declaration marks the
array as "closed".

>  If they are, then the code above should work 
> just fine. If they are not, then we have a couple of behaviors we could 
> define. One would be that b=s(a) blocks until all elements in "a" are 
> defined. The other is that we have a way of "closing" (once again). In 
> that case, we have to define what happens if b=s(a) accesses an element 
> that is not defined.

IndexOutOfBoundsException.

Another thing we explored mentally was the possibility of doing a simple
analysis and grouping all assignments to an array. I'll use an example:

a[0] = 1;
b = c;
a[1] = 9;
d = f(5);
a[2] = 7;

This normally gets translated into (some initializations omitted and
function names changed for clarity):
parallel(
  setarray(a, 0, 1)
  alias(b, c)
  setarray(a, 1, 9)
  set(d, f(5))
  setarray(a, 2, 7)
)

The "proposed" solution would be to translate into:
parallel(
  alias(b, c)
  set(d, f(5))
  sequential(
    parallel(
      setarray(a, 0, 1)
      setarray(a, 1, 9)
      setarray(a, 2, 7)
    )
    closearray(a)
  )
)

Mihael

> 
> Ian.
> 
> Ben Clifford wrote:
> > There is a problem that has been called the 'array closing problem'.
> >
> > It manifests itself in the tutorial in that certain bits of code that 
> > intuitively can either in a procedure or in the top level can, in 
> > practice, only go in to a procedure.
> >
> > In that context, I tried to think about better ways to explain/document 
> > the behaviour than "mumble mumble move that code into a procedure".
> >
> > In Swift we claim to have 'single assignment variables'.
> >
> > >From single assignment variables we get our grid job ordering:
> >
> >   a = p()
> >   b = s(a)
> >
> > causes first grid job p to run, and when that has completed, then grid job 
> > s will run.
> >
> > This is the same as if we had written:
> >
> >   b = s(a)
> >   a = p()
> >
> > The ordering comes from the use of a as an 'output' for p and an 'input' 
> > for s, not from source text ordering.
> >
> > In that model, its meaningless to assign two different things ta a, like 
> > this:
> >
> >   a = p()
> >   b = s(a)
> >   a = t()
> >
> >
> > Note that I've omitted the data types from the above. This works in the 
> > implementation for simple types such as a datafile marker type.
> >
> > What is important is that each variable is either unassigned or has its 
> > single value - whenever we refer to that variable, we can either use the 
> > value it has, or defer evaluation of that expression until the variable 
> > has its value.
> >
> > Now consider arrays. In the present syntax, arrays can be passed as 
> > single (complex) values to/from procedures, like before:
> >
> >   a = p()
> >   b = s(a)
> >
> > Here a and b are array types.
> >
> > That's fine. a is assigned to by the first statement, and b is assigned to 
> > by the second statement.
> >
> > But we also support a different assignment syntax for arrays, that looks 
> > like this:
> >
> >   a[0] = p()
> >   a[1] = q()
> >   b = s(a)
> >
> > This fails at the moment (specifically, I think the execution engine will 
> > hang).
> >
> > Why? Because the is no one point at which we assign a value to 'a' - the 
> > assignment is split over multiple statements, which can be in various 
> > places (and inside loops etc).
> >
> > There is nothing in the implementation that detects that a has been 
> > assigned its value.
> >
> > So there is this notion in the karajan intermediate code of 'closing an 
> > array'.  This is an assertion made in the object code that all assignments 
> > to pieces of an array have been made - that, in affect, the array has its 
> > value.
> >
> > The suggested hack/workaround for this is to move the array element 
> > assignments into a procedure:
> >
> >  (file f[]) z() {
> >    f[0] = p();
> >    f[1] - q();
> >  }
> >
> >  a = z()
> >  b = s(a)
> >
> > This works. (which is sort-of a violation of referential transparency)
> >
> > It works because Swift implicitly marks arrays returned from compound 
> > procedures as closed (which may or may not be correct).
> >
> > So in most variable scopes, arrays behave like single-assignment 
> > variables, but each array can have one specific scope in which members can 
> > be assigned to. In that scope, the array cannot be treated as a whole 
> > variable.
> >
> > In the z() example above, that special scope is the body of z(). In the 
> > previous example, that scope is the global scope, and the program is 
> > invalid by the rule above that the array cannot be referred to as a whole 
> > in the same place that its members are individually assigned to.
> >
> > That's my explanation of what's going on now. I think it matches reality. 
> > I don't like that this is reality, but it is what we have.
> >
> > Comments appreciated.
> >
> >   
> 


From hategan at mcs.anl.gov  Sat Jun 16 04:21:14 2007
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Sat, 16 Jun 2007 12:21:14 +0300
Subject: [Swift-devel] on the semantics of 'array closing'
In-Reply-To: <Pine.LNX.4.64.0706152050270.10634@dildano.hawaga.org.uk>
References: <Pine.LNX.4.64.0706151924490.10634@dildano.hawaga.org.uk>
	<4672F5E3.7060205@mcs.anl.gov>
	<Pine.LNX.4.64.0706152050270.10634@dildano.hawaga.org.uk>
Message-ID: <1181985674.10455.23.camel@blabla.mcs.anl.gov>

On Fri, 2007-06-15 at 20:55 +0000, Ben Clifford wrote:
> There's a different approach, which is to asay that 'a' is a variable and 
> can be assigned to once. Thus assignemnt syntax like a[0]=something 
> becomes illegal and we need more functional language constructs.

so is the sequence:
a = 1;
a = 2;

I think this cannot be completely avoided, functional language
constructs or not.

>  So 
> instead of writing:
> 
> for e,i in input_array {
>   output_array[i] = p(e);
> }
> 
> we would write:
> 
> output_array = foreach i in input_array {
>   return p(i);
> }
> 
> (its a haskell map in different syntax!)

However, even python features list comprehensions:
output_array = [p(i) for i in input_array]

so we could have both. Karajan already supports streams of this kind:
output_array = stream(parallelFor(i, input_array, p(i))) (give or take
some filters).

Mihael

> 
> That means that, at the language level, output_array is now properly 
> single assignment.
> 
> 
> On Fri, 15 Jun 2007, Ian Foster wrote:
> 
> > Hi,
> > 
> > For:
> > 
> >  a[0] = p()
> >  a[1] = q()
> >  b = s(a)
> > 
> > I think there are two distinct issues.
> > 
> > a) Determining the size of the array. This could presumably be done by
> > declaring it, e.g.:
> > 
> >  a[2] or some similar notion
> >  a[0] = p()
> >  a[1] = q()
> >  b = s(a)
> > 
> > or by some "closing" concept.
> > 
> > b) Whether or not each element of an array is a separate single-assignment
> > variable. If they are, then the code above should work just fine. If they are
> > not, then we have a couple of behaviors we could define. One would be that
> > b=s(a) blocks until all elements in "a" are defined. The other is that we have
> > a way of "closing" (once again). In that case, we have to define what happens
> > if b=s(a) accesses an element that is not defined.
> > 
> > Ian.
> > 
> > Ben Clifford wrote:
> > > There is a problem that has been called the 'array closing problem'.
> > > 
> > > It manifests itself in the tutorial in that certain bits of code that
> > > intuitively can either in a procedure or in the top level can, in practice,
> > > only go in to a procedure.
> > > 
> > > In that context, I tried to think about better ways to explain/document the
> > > behaviour than "mumble mumble move that code into a procedure".
> > > 
> > > In Swift we claim to have 'single assignment variables'.
> > > 
> > > >From single assignment variables we get our grid job ordering:
> > > 
> > >   a = p()
> > >   b = s(a)
> > > 
> > > causes first grid job p to run, and when that has completed, then grid job s
> > > will run.
> > > 
> > > This is the same as if we had written:
> > > 
> > >   b = s(a)
> > >   a = p()
> > > 
> > > The ordering comes from the use of a as an 'output' for p and an 'input' for
> > > s, not from source text ordering.
> > > 
> > > In that model, its meaningless to assign two different things ta a, like
> > > this:
> > > 
> > >   a = p()
> > >   b = s(a)
> > >   a = t()
> > > 
> > > 
> > > Note that I've omitted the data types from the above. This works in the
> > > implementation for simple types such as a datafile marker type.
> > > 
> > > What is important is that each variable is either unassigned or has its
> > > single value - whenever we refer to that variable, we can either use the
> > > value it has, or defer evaluation of that expression until the variable has
> > > its value.
> > > 
> > > Now consider arrays. In the present syntax, arrays can be passed as single
> > > (complex) values to/from procedures, like before:
> > > 
> > >   a = p()
> > >   b = s(a)
> > > 
> > > Here a and b are array types.
> > > 
> > > That's fine. a is assigned to by the first statement, and b is assigned to
> > > by the second statement.
> > > 
> > > But we also support a different assignment syntax for arrays, that looks
> > > like this:
> > > 
> > >   a[0] = p()
> > >   a[1] = q()
> > >   b = s(a)
> > > 
> > > This fails at the moment (specifically, I think the execution engine will
> > > hang).
> > > 
> > > Why? Because the is no one point at which we assign a value to 'a' - the
> > > assignment is split over multiple statements, which can be in various places
> > > (and inside loops etc).
> > > 
> > > There is nothing in the implementation that detects that a has been assigned
> > > its value.
> > > 
> > > So there is this notion in the karajan intermediate code of 'closing an
> > > array'.  This is an assertion made in the object code that all assignments
> > > to pieces of an array have been made - that, in affect, the array has its
> > > value.
> > > 
> > > The suggested hack/workaround for this is to move the array element
> > > assignments into a procedure:
> > > 
> > >  (file f[]) z() {
> > >    f[0] = p();
> > >    f[1] - q();
> > >  }
> > > 
> > >  a = z()
> > >  b = s(a)
> > > 
> > > This works. (which is sort-of a violation of referential transparency)
> > > 
> > > It works because Swift implicitly marks arrays returned from compound
> > > procedures as closed (which may or may not be correct).
> > > 
> > > So in most variable scopes, arrays behave like single-assignment variables,
> > > but each array can have one specific scope in which members can be assigned
> > > to. In that scope, the array cannot be treated as a whole variable.
> > > 
> > > In the z() example above, that special scope is the body of z(). In the
> > > previous example, that scope is the global scope, and the program is invalid
> > > by the rule above that the array cannot be referred to as a whole in the
> > > same place that its members are individually assigned to.
> > > 
> > > That's my explanation of what's going on now. I think it matches reality. I
> > > don't like that this is reality, but it is what we have.
> > > 
> > > Comments appreciated.
> > > 
> > >   
> > 
> > 
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> 


From hategan at mcs.anl.gov  Sat Jun 16 04:38:42 2007
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Sat, 16 Jun 2007 12:38:42 +0300
Subject: [Swift-devel] on the semantics of 'array closing'
In-Reply-To: <1181985674.10455.23.camel@blabla.mcs.anl.gov>
References: <Pine.LNX.4.64.0706151924490.10634@dildano.hawaga.org.uk>
	<4672F5E3.7060205@mcs.anl.gov>
	<Pine.LNX.4.64.0706152050270.10634@dildano.hawaga.org.uk>
	<1181985674.10455.23.camel@blabla.mcs.anl.gov>
Message-ID: <1181986722.10744.2.camel@blabla.mcs.anl.gov>


> However, even python features list comprehensions...

Python is a fine language. The above should read "However, even some
imperative languages, such as python, feature list comprehension...".


From benc at hawaga.org.uk  Sat Jun 16 08:00:21 2007
From: benc at hawaga.org.uk (Ben Clifford)
Date: Sat, 16 Jun 2007 13:00:21 +0000 (GMT)
Subject: [Swift-devel] CPU usage with provider-deef
In-Reply-To: <1181984306.10455.3.camel@blabla.mcs.anl.gov>
References: <Pine.LNX.4.64.0706151634060.10634@dildano.hawaga.org.uk>
	<1181984306.10455.3.camel@blabla.mcs.anl.gov>
Message-ID: <Pine.LNX.4.64.0706161258280.10634@dildano.hawaga.org.uk>


It was running something like 68 jobs in 15 minutes. Kinda scary if each 
of those jobs needs 15 cpu.seconds on the submit side.

On Sat, 16 Jun 2007, Mihael Hategan wrote:

> That can either be good or bad. If the CPU is used doing meaningful
> stuff, then it's good. In other words, I'm guessing that the job
> throughput is also higher with Falkon.
> 
> Mihael
> 
> On Fri, 2007-06-15 at 16:48 +0000, Ben Clifford wrote:
> > Yesterday, I was playing a bit with Ioan and Nika - they submitted a 68 
> > node / 15 minute workflow through provider-deef & falkon and saw the swift 
> > JVM on the submit node using about 100% CPU; then the same workflow 
> > running through the GT2 GRAM provider rather than provider-deef and falkon 
> > appeared to use significantly less.
> > 
> > I wandered off at that point so don't know if any interesting results came 
> > after.
> > 
> 
> 


From benc at hawaga.org.uk  Sat Jun 16 08:10:28 2007
From: benc at hawaga.org.uk (Ben Clifford)
Date: Sat, 16 Jun 2007 13:10:28 +0000 (GMT)
Subject: [Swift-devel] on the semantics of 'array closing'
In-Reply-To: <1181985674.10455.23.camel@blabla.mcs.anl.gov>
References: <Pine.LNX.4.64.0706151924490.10634@dildano.hawaga.org.uk> 
	<4672F5E3.7060205@mcs.anl.gov>
	<Pine.LNX.4.64.0706152050270.10634@dildano.hawaga.org.uk>
	<1181985674.10455.23.camel@blabla.mcs.anl.gov>
Message-ID: <Pine.LNX.4.64.0706161301400.10634@dildano.hawaga.org.uk>


On Sat, 16 Jun 2007, Mihael Hategan wrote:

> > Thus assignemnt syntax like a[0]=something 
> > becomes illegal and we need more functional language constructs.
> 
> so is the sequence:
> a = 1;
> a = 2;

There's a bug about that open too (actually two, but I closed one of 
them).

> > (its a haskell map in different syntax!)
> 
> However, even python features list comprehensions:
> output_array = [p(i) for i in input_array]

right. Any of the constructs that have the 'expression that returns a 
whole array' property would be ok.

-- 


From benc at hawaga.org.uk  Sat Jun 16 08:34:10 2007
From: benc at hawaga.org.uk (Ben Clifford)
Date: Sat, 16 Jun 2007 13:34:10 +0000 (GMT)
Subject: [Swift-devel] on the semantics of 'array closing'
In-Reply-To: <1181984676.10455.8.camel@blabla.mcs.anl.gov>
References: <Pine.LNX.4.64.0706151924490.10634@dildano.hawaga.org.uk>
	<1181984676.10455.8.camel@blabla.mcs.anl.gov>
Message-ID: <Pine.LNX.4.64.0706161331420.10634@dildano.hawaga.org.uk>


On Sat, 16 Jun 2007, Mihael Hategan wrote:

> > It works because Swift implicitly marks arrays returned from compound 
> > procedures as closed (which may or may not be correct).
> 
> We defined it as correct. Something created in one scope cannot be
> modified in a parent scope.

That's fine - what was unintuitive to me was that something created in one 
scope cannot be referred to in that same scope. i.e. you can create a 
piecewise using a[...]=... but cannot then refer to a.

-- 


From iraicu at cs.uchicago.edu  Sat Jun 16 09:17:13 2007
From: iraicu at cs.uchicago.edu (Ioan Raicu)
Date: Sat, 16 Jun 2007 09:17:13 -0500
Subject: [Swift-devel] CPU usage with provider-deef
In-Reply-To: <Pine.LNX.4.64.0706161258280.10634@dildano.hawaga.org.uk>
References: <Pine.LNX.4.64.0706151634060.10634@dildano.hawaga.org.uk>
	<1181984306.10455.3.camel@blabla.mcs.anl.gov>
	<Pine.LNX.4.64.0706161258280.10634@dildano.hawaga.org.uk>
Message-ID: <4673F0E9.3060505@cs.uchicago.edu>

Actually, it was a bug the Falkon provider, there was a tight polling 
loop on a task queue even if it was empty... it got fixed with one line 
of code :)  its now running the CPU relatively idle for Nika's workflow 
which doesn't require high throughputs.

Thanks Yong for fixing it!

Ioan

Ben Clifford wrote:
> It was running something like 68 jobs in 15 minutes. Kinda scary if each 
> of those jobs needs 15 cpu.seconds on the submit side.
>
> On Sat, 16 Jun 2007, Mihael Hategan wrote:
>
>   
>> That can either be good or bad. If the CPU is used doing meaningful
>> stuff, then it's good. In other words, I'm guessing that the job
>> throughput is also higher with Falkon.
>>
>> Mihael
>>
>> On Fri, 2007-06-15 at 16:48 +0000, Ben Clifford wrote:
>>     
>>> Yesterday, I was playing a bit with Ioan and Nika - they submitted a 68 
>>> node / 15 minute workflow through provider-deef & falkon and saw the swift 
>>> JVM on the submit node using about 100% CPU; then the same workflow 
>>> running through the GT2 GRAM provider rather than provider-deef and falkon 
>>> appeared to use significantly less.
>>>
>>> I wandered off at that point so don't know if any interesting results came 
>>> after.
>>>
>>>       
>>     
>
>   

-- 
============================================
Ioan Raicu
Ph.D. Student
============================================
Distributed Systems Laboratory
Computer Science Department
University of Chicago
1100 E. 58th Street, Ryerson Hall
Chicago, IL 60637
============================================
Email: iraicu at cs.uchicago.edu
Web:   http://www.cs.uchicago.edu/~iraicu
       http://dsl.cs.uchicago.edu/
============================================
============================================

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20070616/5da49915/attachment.html>

From benc at hawaga.org.uk  Sat Jun 16 09:22:26 2007
From: benc at hawaga.org.uk (Ben Clifford)
Date: Sat, 16 Jun 2007 14:22:26 +0000 (GMT)
Subject: [Swift-devel] CPU usage with provider-deef
In-Reply-To: <4673F0E9.3060505@cs.uchicago.edu>
References: <Pine.LNX.4.64.0706151634060.10634@dildano.hawaga.org.uk>
	<1181984306.10455.3.camel@blabla.mcs.anl.gov>
	<Pine.LNX.4.64.0706161258280.10634@dildano.hawaga.org.uk>
	<4673F0E9.3060505@cs.uchicago.edu>
Message-ID: <Pine.LNX.4.64.0706161418470.10634@dildano.hawaga.org.uk>


Cool.

Did you try the long long run again with that change in place?

Also what was the elapsed realtime for using GRAM2 vs Falkon for that 68 
node / ~15 minute workflow?

On Sat, 16 Jun 2007, Ioan Raicu wrote:

> Actually, it was a bug the Falkon provider, there was a tight polling loop on
> a task queue even if it was empty... it got fixed with one line of code :)
> its now running the CPU relatively idle for Nika's workflow which doesn't
> require high throughputs.
> 
> Thanks Yong for fixing it!
> 
> Ioan
> 
> Ben Clifford wrote:
> > It was running something like 68 jobs in 15 minutes. Kinda scary if each of
> > those jobs needs 15 cpu.seconds on the submit side.
> > 
> > On Sat, 16 Jun 2007, Mihael Hategan wrote:
> > 
> >   
> > > That can either be good or bad. If the CPU is used doing meaningful
> > > stuff, then it's good. In other words, I'm guessing that the job
> > > throughput is also higher with Falkon.
> > > 
> > > Mihael
> > > 
> > > On Fri, 2007-06-15 at 16:48 +0000, Ben Clifford wrote:
> > >     
> > > > Yesterday, I was playing a bit with Ioan and Nika - they submitted a 68
> > > > node / 15 minute workflow through provider-deef & falkon and saw the
> > > > swift JVM on the submit node using about 100% CPU; then the same
> > > > workflow running through the GT2 GRAM provider rather than provider-deef
> > > > and falkon appeared to use significantly less.
> > > > 
> > > > I wandered off at that point so don't know if any interesting results
> > > > came after.
> > > > 
> > > >       
> > >     
> > 
> >   
> 
> 


From benc at hawaga.org.uk  Sat Jun 16 09:27:13 2007
From: benc at hawaga.org.uk (Ben Clifford)
Date: Sat, 16 Jun 2007 14:27:13 +0000 (GMT)
Subject: [Swift-devel] nightly tests never finishing
Message-ID: <Pine.LNX.4.64.0706161425300.10634@dildano.hawaga.org.uk>


The nightly test pages recently don't seem to be complete - 13th, 14th and 
15th have all stopped around the array_iteration on grid section.

That also means that the download link for nightly builds is never 
provided.

-- 


From iraicu at cs.uchicago.edu  Sat Jun 16 09:35:52 2007
From: iraicu at cs.uchicago.edu (Ioan Raicu)
Date: Sat, 16 Jun 2007 09:35:52 -0500
Subject: [Swift-devel] CPU usage with provider-deef
In-Reply-To: <Pine.LNX.4.64.0706161418470.10634@dildano.hawaga.org.uk>
References: <Pine.LNX.4.64.0706151634060.10634@dildano.hawaga.org.uk>
	<1181984306.10455.3.camel@blabla.mcs.anl.gov>
	<Pine.LNX.4.64.0706161258280.10634@dildano.hawaga.org.uk>
	<4673F0E9.3060505@cs.uchicago.edu>
	<Pine.LNX.4.64.0706161418470.10634@dildano.hawaga.org.uk>
Message-ID: <4673F548.5070608@cs.uchicago.edu>

With Falkon, we had 34 machines, with 68 processors, running a job on 
each processor.  I think it took about 20 min.  We then ran over GRAM, 
but there are only 60 IA64 nodes (120 processors) at ANL, so when the 68 
jobs got submitted, only 60 of them went in the run queue, and 8 of them 
went in the wait queue.... there were enough processors to perform all 
jobs at the same time, but I don't know how we were supposed to tweak 
Swift to have it dispatch tasks per GRAM job, and perform both tasks in 
parallel on both processors.  I believe the total time for the GRAM2 run 
was about 26 min.  The extra round of 8 jobs (which Falkon didn't have) 
took about 200 sec (3.4 min), so the rough improvement would have 
probably been around 1~2.6 min (5~10%).  That sounds about right with 
the 0.~1 job/sec, so 68 jobs would have taken 68~136 or so seconds. 

The comparison wasn't done scientifically, so don't quote the numbers 
exactly, but Falkon was a bit faster.  In Nika's workflow case, where 
high throughput isn't essential, the big gain to use Falkon is the 
scalability of the Falkon wait queue, and the resource provisioning, 
once you get some resources, using them over and over to avoid the LRM 
queue wait time for each job.

BTW, we ran a 20 molecule short run yesterday successfully, but we are 
still having problems with the 100 molecule run in MolDyn.  Its not 
clear where the problem is, on the surface Falkon looks fine... we are 
looking into where everything breaks to cause Swift to not continue with 
the workflow to completion!

Ioan

Ben Clifford wrote:
> Cool.
>
> Did you try the long long run again with that change in place?
>
> Also what was the elapsed realtime for using GRAM2 vs Falkon for that 68 
> node / ~15 minute workflow?
>
> On Sat, 16 Jun 2007, Ioan Raicu wrote:
>
>   
>> Actually, it was a bug the Falkon provider, there was a tight polling loop on
>> a task queue even if it was empty... it got fixed with one line of code :)
>> its now running the CPU relatively idle for Nika's workflow which doesn't
>> require high throughputs.
>>
>> Thanks Yong for fixing it!
>>
>> Ioan
>>
>> Ben Clifford wrote:
>>     
>>> It was running something like 68 jobs in 15 minutes. Kinda scary if each of
>>> those jobs needs 15 cpu.seconds on the submit side.
>>>
>>> On Sat, 16 Jun 2007, Mihael Hategan wrote:
>>>
>>>   
>>>       
>>>> That can either be good or bad. If the CPU is used doing meaningful
>>>> stuff, then it's good. In other words, I'm guessing that the job
>>>> throughput is also higher with Falkon.
>>>>
>>>> Mihael
>>>>
>>>> On Fri, 2007-06-15 at 16:48 +0000, Ben Clifford wrote:
>>>>     
>>>>         
>>>>> Yesterday, I was playing a bit with Ioan and Nika - they submitted a 68
>>>>> node / 15 minute workflow through provider-deef & falkon and saw the
>>>>> swift JVM on the submit node using about 100% CPU; then the same
>>>>> workflow running through the GT2 GRAM provider rather than provider-deef
>>>>> and falkon appeared to use significantly less.
>>>>>
>>>>> I wandered off at that point so don't know if any interesting results
>>>>> came after.
>>>>>
>>>>>       
>>>>>           
>>>>     
>>>>         
>>>   
>>>       
>>     
>
>   

-- 
============================================
Ioan Raicu
Ph.D. Student
============================================
Distributed Systems Laboratory
Computer Science Department
University of Chicago
1100 E. 58th Street, Ryerson Hall
Chicago, IL 60637
============================================
Email: iraicu at cs.uchicago.edu
Web:   http://www.cs.uchicago.edu/~iraicu
       http://dsl.cs.uchicago.edu/
============================================
============================================

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20070616/dce07879/attachment.html>

From benc at hawaga.org.uk  Sat Jun 16 09:41:38 2007
From: benc at hawaga.org.uk (Ben Clifford)
Date: Sat, 16 Jun 2007 14:41:38 +0000 (GMT)
Subject: [Swift-devel] CPU usage with provider-deef
In-Reply-To: <4673F548.5070608@cs.uchicago.edu>
References: <Pine.LNX.4.64.0706151634060.10634@dildano.hawaga.org.uk>
	<1181984306.10455.3.camel@blabla.mcs.anl.gov>
	<Pine.LNX.4.64.0706161258280.10634@dildano.hawaga.org.uk>
	<4673F0E9.3060505@cs.uchicago.edu>
	<Pine.LNX.4.64.0706161418470.10634@dildano.hawaga.org.uk>
	<4673F548.5070608@cs.uchicago.edu>
Message-ID: <Pine.LNX.4.64.0706161438550.10634@dildano.hawaga.org.uk>


On Sat, 16 Jun 2007, Ioan Raicu wrote:

> having problems with the 100 molecule run in MolDyn.  Its not clear where the
> problem is, on the surface Falkon looks fine... we are looking into where
> everything breaks to cause Swift to not continue with the workflow to
> completion!

The same problem that you showed me the other day or different?

with 'the same problem' being that falkon thinks all the jobs are done; 
but that falkon's measure response time for sending completion 
notifications gets approximately linearly longer over time and the swift 
JVM uses ~100% and doesn't inidicate job completion at all after a certain 
period.

or different symptoms now?

-- 


From iraicu at cs.uchicago.edu  Sat Jun 16 10:00:41 2007
From: iraicu at cs.uchicago.edu (Ioan Raicu)
Date: Sat, 16 Jun 2007 10:00:41 -0500
Subject: [Swift-devel] CPU usage with provider-deef
In-Reply-To: <Pine.LNX.4.64.0706161438550.10634@dildano.hawaga.org.uk>
References: <Pine.LNX.4.64.0706151634060.10634@dildano.hawaga.org.uk>
	<1181984306.10455.3.camel@blabla.mcs.anl.gov>
	<Pine.LNX.4.64.0706161258280.10634@dildano.hawaga.org.uk>
	<4673F0E9.3060505@cs.uchicago.edu>
	<Pine.LNX.4.64.0706161418470.10634@dildano.hawaga.org.uk>
	<4673F548.5070608@cs.uchicago.edu>
	<Pine.LNX.4.64.0706161438550.10634@dildano.hawaga.org.uk>
Message-ID: <4673FB19.6070305@cs.uchicago.edu>

Nope, I think this is a different problem, or at least a subset of the 
problems we were having before.

Since we fixed the CPU utilization, and we moved to a bigger box (4 CPUs 
with 2GB of memory), everything is happening in a timely fashion (a few 
ms per notification delivery throughout the experiment).  Plus, I 
believe the view is consistent (the same tasks look complete on both 
ends) between Falkon and Swift, but we are still checking on this as the 
run was made just last night for the 100 mol run.  We'll keep you posted 
with what we find.

Ioan

Ben Clifford wrote:
> On Sat, 16 Jun 2007, Ioan Raicu wrote:
>
>   
>> having problems with the 100 molecule run in MolDyn.  Its not clear where the
>> problem is, on the surface Falkon looks fine... we are looking into where
>> everything breaks to cause Swift to not continue with the workflow to
>> completion!
>>     
>
> The same problem that you showed me the other day or different?
>
> with 'the same problem' being that falkon thinks all the jobs are done; 
> but that falkon's measure response time for sending completion 
> notifications gets approximately linearly longer over time and the swift 
> JVM uses ~100% and doesn't inidicate job completion at all after a certain 
> period.
>
> or different symptoms now?
>
>   

-- 
============================================
Ioan Raicu
Ph.D. Student
============================================
Distributed Systems Laboratory
Computer Science Department
University of Chicago
1100 E. 58th Street, Ryerson Hall
Chicago, IL 60637
============================================
Email: iraicu at cs.uchicago.edu
Web:   http://www.cs.uchicago.edu/~iraicu
       http://dsl.cs.uchicago.edu/
============================================
============================================

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20070616/98cf2001/attachment.html>

From hategan at mcs.anl.gov  Sat Jun 16 10:02:46 2007
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Sat, 16 Jun 2007 18:02:46 +0300
Subject: [Swift-devel] CPU usage with provider-deef
In-Reply-To: <4673FB19.6070305@cs.uchicago.edu>
References: <Pine.LNX.4.64.0706151634060.10634@dildano.hawaga.org.uk>
	<1181984306.10455.3.camel@blabla.mcs.anl.gov>
	<Pine.LNX.4.64.0706161258280.10634@dildano.hawaga.org.uk>
	<4673F0E9.3060505@cs.uchicago.edu>
	<Pine.LNX.4.64.0706161418470.10634@dildano.hawaga.org.uk>
	<4673F548.5070608@cs.uchicago.edu>
	<Pine.LNX.4.64.0706161438550.10634@dildano.hawaga.org.uk>
	<4673FB19.6070305@cs.uchicago.edu>
Message-ID: <1182006166.11495.1.camel@blabla.mcs.anl.gov>

Yourkit (www.yourkit.com) has free licenses for open source projects for
their profiler. Point them to a globus web page that has your name, and
they'll send you the license. Alternatively, there are other profilers
out there, and I strongly recommend using them on such issues.

Mihael

On Sat, 2007-06-16 at 10:00 -0500, Ioan Raicu wrote:
> Nope, I think this is a different problem, or at least a subset of the
> problems we were having before.
> 
> Since we fixed the CPU utilization, and we moved to a bigger box (4
> CPUs with 2GB of memory), everything is happening in a timely fashion
> (a few ms per notification delivery throughout the experiment).  Plus,
> I believe the view is consistent (the same tasks look complete on both
> ends) between Falkon and Swift, but we are still checking on this as
> the run was made just last night for the 100 mol run.  We'll keep you
> posted with what we find.
> 
> Ioan
> 
> Ben Clifford wrote: 
> > On Sat, 16 Jun 2007, Ioan Raicu wrote:
> > 
> >   
> > > having problems with the 100 molecule run in MolDyn.  Its not clear where the
> > > problem is, on the surface Falkon looks fine... we are looking into where
> > > everything breaks to cause Swift to not continue with the workflow to
> > > completion!
> > >     
> > 
> > The same problem that you showed me the other day or different?
> > 
> > with 'the same problem' being that falkon thinks all the jobs are done; 
> > but that falkon's measure response time for sending completion 
> > notifications gets approximately linearly longer over time and the swift 
> > JVM uses ~100% and doesn't inidicate job completion at all after a certain 
> > period.
> > 
> > or different symptoms now?
> > 
> >   
> 
> -- 
> ============================================
> Ioan Raicu
> Ph.D. Student
> ============================================
> Distributed Systems Laboratory
> Computer Science Department
> University of Chicago
> 1100 E. 58th Street, Ryerson Hall
> Chicago, IL 60637
> ============================================
> Email: iraicu at cs.uchicago.edu
> Web:   http://www.cs.uchicago.edu/~iraicu
>        http://dsl.cs.uchicago.edu/
> ============================================
> ============================================


From wilde at mcs.anl.gov  Sat Jun 16 10:05:39 2007
From: wilde at mcs.anl.gov (Mike Wilde)
Date: Sat, 16 Jun 2007 10:05:39 -0500
Subject: [Swift-devel] on the semantics of 'array closing'
In-Reply-To: <Pine.LNX.4.64.0706161331420.10634@dildano.hawaga.org.uk>
References: <Pine.LNX.4.64.0706151924490.10634@dildano.hawaga.org.uk>	<1181984676.10455.8.camel@blabla.mcs.anl.gov>
	<Pine.LNX.4.64.0706161331420.10634@dildano.hawaga.org.uk>
Message-ID: <4673FC43.8040204@mcs.anl.gov>

Hi all,

I'm jumping in late; I re-read the thread a few times but may have 
missed something. So correct me as needed. Also, rather than 
spending more time polishing the thoughts below I just put them out 
here for discussion.

This discussion seems to me very important, as it can close down 
several of the major open issues that are very critical to the 
language, both to give it complete and consistent semantics and to 
make it practical fr the problems that we are applying it to.

Four important but missing aspects of this discussion are: 
pipelining, error handing, restart, and mapping.

I feel that swift needs the following semantics:

1. Pipelining:

The data dependency aspects of swift are carried out at the atomic 
level in a pipelined manner.

  -- elements of an array are written into the stream

  -- readers of the array consume the stream

  -- the entire program remains active in parallel, across function 
boundaries

Array elements [k,v] are identified by their index, k, which can be 
an int or string.

2. Error handling

In practice, many large-scale foreach() operations will never 
complete, yet they will deliver a lot of useful results that we want 
subsequent statements in a program to continue to operate on. Thus 
closing needs to permit different criteria other than just "finishing".

An array is "closed" when its producer function/foreach "shuts 
down".  Can we permit shutdown/closing to occur based on finishing, 
time, or quota/threshold.  These would be parameters of the foreach 
statements that could be overridden.

(For some practical examples, see map-reduce; it has similar 
problems: parallel computations reach a level whwre there is lots of 
parallelism, and as it proceeds, gets to a poiunt where only the 
"stragglers" are left - things waiting in slow queues or for hung 
data transfers, etc.  Ive read this in m/r papers, and found that 
our experiences match those reported by the google m/r people).

3. Restart

We want computations to be restartable.  If 50% of a large 
array/dataset gets created in a 10-hour run, and then fails, we want 
the run to be restartable and continue where it left of with minimal 
  lost of "completed" results.

4. Mapping

Lastly, swift mapping should be connected to this whole process: the 
mapped contents of a dataset should be a stream of xml elements 
rather than a "completed" xml document, so that we can practically 
handle very large datasets.  So when a foreach() statement processes 
  a array, its processing the mapped stream of the array. mappers 
should be parallel processes that produce and consume these streams 
of xml elements.

- Mike


Ben Clifford wrote, On 6/16/2007 8:34 AM:
> 
> On Sat, 16 Jun 2007, Mihael Hategan wrote:
> 
>>> It works because Swift implicitly marks arrays returned from compound 
>>> procedures as closed (which may or may not be correct).
>> We defined it as correct. Something created in one scope cannot be
>> modified in a parent scope.
> 
> That's fine - what was unintuitive to me was that something created in one 
> scope cannot be referred to in that same scope. i.e. you can create a 
> piecewise using a[...]=... but cannot then refer to a.
> 

-- 
Mike Wilde
Computation Institute, University of Chicago
Math & Computer Science Division
Argonne National Laboratory
Argonne, IL   60439    USA
tel 630-252-7497 fax 630-252-1997


From wilde at mcs.anl.gov  Sat Jun 16 10:50:25 2007
From: wilde at mcs.anl.gov (Mike Wilde)
Date: Sat, 16 Jun 2007 10:50:25 -0500
Subject: [Swift-devel] on the semantics of 'array closing'
In-Reply-To: <4673FC43.8040204@mcs.anl.gov>
References: <Pine.LNX.4.64.0706151924490.10634@dildano.hawaga.org.uk>	<1181984676.10455.8.camel@blabla.mcs.anl.gov>	<Pine.LNX.4.64.0706161331420.10634@dildano.hawaga.org.uk>
	<4673FC43.8040204@mcs.anl.gov>
Message-ID: <467406C1.2070608@mcs.anl.gov>

also to note: Ian has suggested several times that we explore 
map-reduce.  I think this is worth doing: its possible/likely that 
swift is already pretty close to m-r in many ways, and could benefit 
from a more detailed comparison and assessment of what we can 
borrow, adapt, and/or integrate.

We should use this as a chance to create a "swift library" page 
where we post good papers that we can cite in our discussions to get 
ourselves on a common page.

Some of these might be good material for Thu Grad seminar discussins 
as well.

- Mike


Mike Wilde wrote, On 6/16/2007 10:05 AM:
> Hi all,
> 
> I'm jumping in late; I re-read the thread a few times but may have 
> missed something. So correct me as needed. Also, rather than spending 
> more time polishing the thoughts below I just put them out here for 
> discussion.
> 
> This discussion seems to me very important, as it can close down several 
> of the major open issues that are very critical to the language, both to 
> give it complete and consistent semantics and to make it practical fr 
> the problems that we are applying it to.
> 
> Four important but missing aspects of this discussion are: pipelining, 
> error handing, restart, and mapping.
> 
> I feel that swift needs the following semantics:
> 
> 1. Pipelining:
> 
> The data dependency aspects of swift are carried out at the atomic level 
> in a pipelined manner.
> 
>  -- elements of an array are written into the stream
> 
>  -- readers of the array consume the stream
> 
>  -- the entire program remains active in parallel, across function 
> boundaries
> 
> Array elements [k,v] are identified by their index, k, which can be an 
> int or string.
> 
> 2. Error handling
> 
> In practice, many large-scale foreach() operations will never complete, 
> yet they will deliver a lot of useful results that we want subsequent 
> statements in a program to continue to operate on. Thus closing needs to 
> permit different criteria other than just "finishing".
> 
> An array is "closed" when its producer function/foreach "shuts down".  
> Can we permit shutdown/closing to occur based on finishing, time, or 
> quota/threshold.  These would be parameters of the foreach statements 
> that could be overridden.
> 
> (For some practical examples, see map-reduce; it has similar problems: 
> parallel computations reach a level whwre there is lots of parallelism, 
> and as it proceeds, gets to a poiunt where only the "stragglers" are 
> left - things waiting in slow queues or for hung data transfers, etc.  
> Ive read this in m/r papers, and found that our experiences match those 
> reported by the google m/r people).
> 
> 3. Restart
> 
> We want computations to be restartable.  If 50% of a large array/dataset 
> gets created in a 10-hour run, and then fails, we want the run to be 
> restartable and continue where it left of with minimal  lost of 
> "completed" results.
> 
> 4. Mapping
> 
> Lastly, swift mapping should be connected to this whole process: the 
> mapped contents of a dataset should be a stream of xml elements rather 
> than a "completed" xml document, so that we can practically handle very 
> large datasets.  So when a foreach() statement processes  a array, its 
> processing the mapped stream of the array. mappers should be parallel 
> processes that produce and consume these streams of xml elements.
> 
> - Mike
> 
> 
> 
> 
> Ben Clifford wrote, On 6/16/2007 8:34 AM:
>>
>> On Sat, 16 Jun 2007, Mihael Hategan wrote:
>>
>>>> It works because Swift implicitly marks arrays returned from 
>>>> compound procedures as closed (which may or may not be correct).
>>> We defined it as correct. Something created in one scope cannot be
>>> modified in a parent scope.
>>
>> That's fine - what was unintuitive to me was that something created in 
>> one scope cannot be referred to in that same scope. i.e. you can 
>> create a piecewise using a[...]=... but cannot then refer to a.
>>
> 

-- 
Mike Wilde
Computation Institute, University of Chicago
Math & Computer Science Division
Argonne National Laboratory
Argonne, IL   60439    USA
tel 630-252-7497 fax 630-252-1997


From foster at mcs.anl.gov  Sat Jun 16 10:58:13 2007
From: foster at mcs.anl.gov (Ian Foster)
Date: Sat, 16 Jun 2007 10:58:13 -0500
Subject: [Swift-devel] on the semantics of 'array closing'
In-Reply-To: <4673FC43.8040204@mcs.anl.gov>
References: <Pine.LNX.4.64.0706151924490.10634@dildano.hawaga.org.uk>	<1181984676.10455.8.camel@blabla.mcs.anl.gov>	<Pine.LNX.4.64.0706161331420.10634@dildano.hawaga.org.uk>
	<4673FC43.8040204@mcs.anl.gov>
Message-ID: <46740895.70708@mcs.anl.gov>

Mike:

That's a great summary of requirements.

Ian.

Mike Wilde wrote:
> Hi all,
>
> I'm jumping in late; I re-read the thread a few times but may have 
> missed something. So correct me as needed. Also, rather than spending 
> more time polishing the thoughts below I just put them out here for 
> discussion.
>
> This discussion seems to me very important, as it can close down 
> several of the major open issues that are very critical to the 
> language, both to give it complete and consistent semantics and to 
> make it practical fr the problems that we are applying it to.
>
> Four important but missing aspects of this discussion are: pipelining, 
> error handing, restart, and mapping.
>
> I feel that swift needs the following semantics:
>
> 1. Pipelining:
>
> The data dependency aspects of swift are carried out at the atomic 
> level in a pipelined manner.
>
>  -- elements of an array are written into the stream
>
>  -- readers of the array consume the stream
>
>  -- the entire program remains active in parallel, across function 
> boundaries
>
> Array elements [k,v] are identified by their index, k, which can be an 
> int or string.
>
> 2. Error handling
>
> In practice, many large-scale foreach() operations will never 
> complete, yet they will deliver a lot of useful results that we want 
> subsequent statements in a program to continue to operate on. Thus 
> closing needs to permit different criteria other than just "finishing".
>
> An array is "closed" when its producer function/foreach "shuts down".  
> Can we permit shutdown/closing to occur based on finishing, time, or 
> quota/threshold.  These would be parameters of the foreach statements 
> that could be overridden.
>
> (For some practical examples, see map-reduce; it has similar problems: 
> parallel computations reach a level whwre there is lots of 
> parallelism, and as it proceeds, gets to a poiunt where only the 
> "stragglers" are left - things waiting in slow queues or for hung data 
> transfers, etc.  Ive read this in m/r papers, and found that our 
> experiences match those reported by the google m/r people).
>
> 3. Restart
>
> We want computations to be restartable.  If 50% of a large 
> array/dataset gets created in a 10-hour run, and then fails, we want 
> the run to be restartable and continue where it left of with minimal 
>  lost of "completed" results.
>
> 4. Mapping
>
> Lastly, swift mapping should be connected to this whole process: the 
> mapped contents of a dataset should be a stream of xml elements rather 
> than a "completed" xml document, so that we can practically handle 
> very large datasets.  So when a foreach() statement processes  a 
> array, its processing the mapped stream of the array. mappers should 
> be parallel processes that produce and consume these streams of xml 
> elements.
>
> - Mike
>
>
>
>
> Ben Clifford wrote, On 6/16/2007 8:34 AM:
>>
>> On Sat, 16 Jun 2007, Mihael Hategan wrote:
>>
>>>> It works because Swift implicitly marks arrays returned from 
>>>> compound procedures as closed (which may or may not be correct).
>>> We defined it as correct. Something created in one scope cannot be
>>> modified in a parent scope.
>>
>> That's fine - what was unintuitive to me was that something created 
>> in one scope cannot be referred to in that same scope. i.e. you can 
>> create a piecewise using a[...]=... but cannot then refer to a.
>>
>

-- 

   Ian Foster, Director, Computation Institute
Argonne National Laboratory & University of Chicago
Argonne: MCS/221, 9700 S. Cass Ave, Argonne, IL 60439
Chicago: Rm 405, 5640 S. Ellis Ave, Chicago, IL 60637
Tel: +1 630 252 4619.  Web: www.ci.uchicago.edu.
      Globus Alliance: www.globus.org.


From foster at mcs.anl.gov  Sat Jun 16 10:59:17 2007
From: foster at mcs.anl.gov (Ian Foster)
Date: Sat, 16 Jun 2007 10:59:17 -0500
Subject: [Swift-devel] CPU usage with provider-deef
In-Reply-To: <1182006166.11495.1.camel@blabla.mcs.anl.gov>
References: <Pine.LNX.4.64.0706151634060.10634@dildano.hawaga.org.uk>	
	<1181984306.10455.3.camel@blabla.mcs.anl.gov>	
	<Pine.LNX.4.64.0706161258280.10634@dildano.hawaga.org.uk>	
	<4673F0E9.3060505@cs.uchicago.edu>	
	<Pine.LNX.4.64.0706161418470.10634@dildano.hawaga.org.uk>	
	<4673F548.5070608@cs.uchicago.edu>	
	<Pine.LNX.4.64.0706161438550.10634@dildano.hawaga.org.uk>	
	<4673FB19.6070305@cs.uchicago.edu>
	<1182006166.11495.1.camel@blabla.mcs.anl.gov>
Message-ID: <467408D5.2020709@mcs.anl.gov>

It seems important that Ioan sit down with Mihael and work through the 
Falkon code to see where it can be simplified, improved, etc. I am sure 
that this will result in problems being identified and fixed that will 
otherwise cost us time later.

Mihael Hategan wrote:
> Yourkit (www.yourkit.com) has free licenses for open source projects for
> their profiler. Point them to a globus web page that has your name, and
> they'll send you the license. Alternatively, there are other profilers
> out there, and I strongly recommend using them on such issues.
>
> Mihael
>
> On Sat, 2007-06-16 at 10:00 -0500, Ioan Raicu wrote:
>   
>> Nope, I think this is a different problem, or at least a subset of the
>> problems we were having before.
>>
>> Since we fixed the CPU utilization, and we moved to a bigger box (4
>> CPUs with 2GB of memory), everything is happening in a timely fashion
>> (a few ms per notification delivery throughout the experiment).  Plus,
>> I believe the view is consistent (the same tasks look complete on both
>> ends) between Falkon and Swift, but we are still checking on this as
>> the run was made just last night for the 100 mol run.  We'll keep you
>> posted with what we find.
>>
>> Ioan
>>
>> Ben Clifford wrote: 
>>     
>>> On Sat, 16 Jun 2007, Ioan Raicu wrote:
>>>
>>>   
>>>       
>>>> having problems with the 100 molecule run in MolDyn.  Its not clear where the
>>>> problem is, on the surface Falkon looks fine... we are looking into where
>>>> everything breaks to cause Swift to not continue with the workflow to
>>>> completion!
>>>>     
>>>>         
>>> The same problem that you showed me the other day or different?
>>>
>>> with 'the same problem' being that falkon thinks all the jobs are done; 
>>> but that falkon's measure response time for sending completion 
>>> notifications gets approximately linearly longer over time and the swift 
>>> JVM uses ~100% and doesn't inidicate job completion at all after a certain 
>>> period.
>>>
>>> or different symptoms now?
>>>
>>>   
>>>       
>> -- 
>> ============================================
>> Ioan Raicu
>> Ph.D. Student
>> ============================================
>> Distributed Systems Laboratory
>> Computer Science Department
>> University of Chicago
>> 1100 E. 58th Street, Ryerson Hall
>> Chicago, IL 60637
>> ============================================
>> Email: iraicu at cs.uchicago.edu
>> Web:   http://www.cs.uchicago.edu/~iraicu
>>        http://dsl.cs.uchicago.edu/
>> ============================================
>> ============================================
>>     
>
>   

-- 

   Ian Foster, Director, Computation Institute
Argonne National Laboratory & University of Chicago
Argonne: MCS/221, 9700 S. Cass Ave, Argonne, IL 60439
Chicago: Rm 405, 5640 S. Ellis Ave, Chicago, IL 60637
Tel: +1 630 252 4619.  Web: www.ci.uchicago.edu.
      Globus Alliance: www.globus.org.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20070616/2e6f77a3/attachment.html>

From foster at mcs.anl.gov  Sat Jun 16 11:01:47 2007
From: foster at mcs.anl.gov (Ian Foster)
Date: Sat, 16 Jun 2007 11:01:47 -0500
Subject: [Swift-devel] on the semantics of 'array closing'
In-Reply-To: <Pine.LNX.4.64.0706152050270.10634@dildano.hawaga.org.uk>
References: <Pine.LNX.4.64.0706151924490.10634@dildano.hawaga.org.uk>
	<4672F5E3.7060205@mcs.anl.gov>
	<Pine.LNX.4.64.0706152050270.10634@dildano.hawaga.org.uk>
Message-ID: <4674096B.4020109@mcs.anl.gov>

I like the notion of having a "map" function. If that could entirely 
replace the current element assignments, that would be a wonderful 
simplification, it seems to me.

Ian.

Ben Clifford wrote:
> There's a different approach, which is to asay that 'a' is a variable and 
> can be assigned to once. Thus assignemnt syntax like a[0]=something 
> becomes illegal and we need more functional language constructs. So 
> instead of writing:
>
> for e,i in input_array {
>   output_array[i] = p(e);
> }
>
> we would write:
>
> output_array = foreach i in input_array {
>   return p(i);
> }
>
> (its a haskell map in different syntax!)
>
> That means that, at the language level, output_array is now properly 
> single assignment.
>
>
> On Fri, 15 Jun 2007, Ian Foster wrote:
>
>   
>> Hi,
>>
>> For:
>>
>>  a[0] = p()
>>  a[1] = q()
>>  b = s(a)
>>
>> I think there are two distinct issues.
>>
>> a) Determining the size of the array. This could presumably be done by
>> declaring it, e.g.:
>>
>>  a[2] or some similar notion
>>  a[0] = p()
>>  a[1] = q()
>>  b = s(a)
>>
>> or by some "closing" concept.
>>
>> b) Whether or not each element of an array is a separate single-assignment
>> variable. If they are, then the code above should work just fine. If they are
>> not, then we have a couple of behaviors we could define. One would be that
>> b=s(a) blocks until all elements in "a" are defined. The other is that we have
>> a way of "closing" (once again). In that case, we have to define what happens
>> if b=s(a) accesses an element that is not defined.
>>
>> Ian.
>>
>> Ben Clifford wrote:
>>     
>>> There is a problem that has been called the 'array closing problem'.
>>>
>>> It manifests itself in the tutorial in that certain bits of code that
>>> intuitively can either in a procedure or in the top level can, in practice,
>>> only go in to a procedure.
>>>
>>> In that context, I tried to think about better ways to explain/document the
>>> behaviour than "mumble mumble move that code into a procedure".
>>>
>>> In Swift we claim to have 'single assignment variables'.
>>>
>>> >From single assignment variables we get our grid job ordering:
>>>
>>>   a = p()
>>>   b = s(a)
>>>
>>> causes first grid job p to run, and when that has completed, then grid job s
>>> will run.
>>>
>>> This is the same as if we had written:
>>>
>>>   b = s(a)
>>>   a = p()
>>>
>>> The ordering comes from the use of a as an 'output' for p and an 'input' for
>>> s, not from source text ordering.
>>>
>>> In that model, its meaningless to assign two different things ta a, like
>>> this:
>>>
>>>   a = p()
>>>   b = s(a)
>>>   a = t()
>>>
>>>
>>> Note that I've omitted the data types from the above. This works in the
>>> implementation for simple types such as a datafile marker type.
>>>
>>> What is important is that each variable is either unassigned or has its
>>> single value - whenever we refer to that variable, we can either use the
>>> value it has, or defer evaluation of that expression until the variable has
>>> its value.
>>>
>>> Now consider arrays. In the present syntax, arrays can be passed as single
>>> (complex) values to/from procedures, like before:
>>>
>>>   a = p()
>>>   b = s(a)
>>>
>>> Here a and b are array types.
>>>
>>> That's fine. a is assigned to by the first statement, and b is assigned to
>>> by the second statement.
>>>
>>> But we also support a different assignment syntax for arrays, that looks
>>> like this:
>>>
>>>   a[0] = p()
>>>   a[1] = q()
>>>   b = s(a)
>>>
>>> This fails at the moment (specifically, I think the execution engine will
>>> hang).
>>>
>>> Why? Because the is no one point at which we assign a value to 'a' - the
>>> assignment is split over multiple statements, which can be in various places
>>> (and inside loops etc).
>>>
>>> There is nothing in the implementation that detects that a has been assigned
>>> its value.
>>>
>>> So there is this notion in the karajan intermediate code of 'closing an
>>> array'.  This is an assertion made in the object code that all assignments
>>> to pieces of an array have been made - that, in affect, the array has its
>>> value.
>>>
>>> The suggested hack/workaround for this is to move the array element
>>> assignments into a procedure:
>>>
>>>  (file f[]) z() {
>>>    f[0] = p();
>>>    f[1] - q();
>>>  }
>>>
>>>  a = z()
>>>  b = s(a)
>>>
>>> This works. (which is sort-of a violation of referential transparency)
>>>
>>> It works because Swift implicitly marks arrays returned from compound
>>> procedures as closed (which may or may not be correct).
>>>
>>> So in most variable scopes, arrays behave like single-assignment variables,
>>> but each array can have one specific scope in which members can be assigned
>>> to. In that scope, the array cannot be treated as a whole variable.
>>>
>>> In the z() example above, that special scope is the body of z(). In the
>>> previous example, that scope is the global scope, and the program is invalid
>>> by the rule above that the array cannot be referred to as a whole in the
>>> same place that its members are individually assigned to.
>>>
>>> That's my explanation of what's going on now. I think it matches reality. I
>>> don't like that this is reality, but it is what we have.
>>>
>>> Comments appreciated.
>>>
>>>   
>>>       
>>     
>
>   

-- 

   Ian Foster, Director, Computation Institute
Argonne National Laboratory & University of Chicago
Argonne: MCS/221, 9700 S. Cass Ave, Argonne, IL 60439
Chicago: Rm 405, 5640 S. Ellis Ave, Chicago, IL 60637
Tel: +1 630 252 4619.  Web: www.ci.uchicago.edu.
      Globus Alliance: www.globus.org.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20070616/6e2b5eff/attachment.html>

From wilde at mcs.anl.gov  Sat Jun 16 12:03:26 2007
From: wilde at mcs.anl.gov (Mike Wilde)
Date: Sat, 16 Jun 2007 12:03:26 -0500
Subject: [Swift-devel] CPU usage with provider-deef
In-Reply-To: <467408D5.2020709@mcs.anl.gov>
References: <Pine.LNX.4.64.0706151634060.10634@dildano.hawaga.org.uk>		<1181984306.10455.3.camel@blabla.mcs.anl.gov>		<Pine.LNX.4.64.0706161258280.10634@dildano.hawaga.org.uk>		<4673F0E9.3060505@cs.uchicago.edu>		<Pine.LNX.4.64.0706161418470.10634@dildano.hawaga.org.uk>		<4673F548.5070608@cs.uchicago.edu>		<Pine.LNX.4.64.0706161438550.10634@dildano.hawaga.org.uk>		<4673FB19.6070305@cs.uchicago.edu>	<1182006166.11495.1.camel@blabla.mcs.anl.gov>
	<467408D5.2020709@mcs.anl.gov>
Message-ID: <467417DE.50408@mcs.anl.gov>

This should be fun, and a nice break from the I2U2 work that you've 
been immersed in, Mihael.

Want to do a read-through soon, and send out comments for discussion 
that can turn into a list of code improvements to bugzilize?

What I think is important about Falkon is that its working, its 
proving out the value of the provisioned direct-scheduling approach 
with numbers, and that its working for Ioan as a vehicle for his 
research.

What we want to get from the effort is a) Ioan progresses towards 
his PhD; b) the immediate needs of our app-users get met; and c) we 
learn whats needed in architecture, protocol and algorithm for a 
successful long-term approach to running swift programs efficiently.

Point is that everyone is open to changes and towards an eventual 
re-design and re-write. This, Mihael, would be where you can 
propose, design and implement the ideas you've expressed about 
implementing provisioned direct-scheduling using Karajan's remote 
execution mechanisms.

- Mike


Ian Foster wrote, On 6/16/2007 10:59 AM:
> It seems important that Ioan sit down with Mihael and work through the 
> Falkon code to see where it can be simplified, improved, etc. I am sure 
> that this will result in problems being identified and fixed that will 
> otherwise cost us time later.
> 
> Mihael Hategan wrote:
>> Yourkit (www.yourkit.com) has free licenses for open source projects for
>> their profiler. Point them to a globus web page that has your name, and
>> they'll send you the license. Alternatively, there are other profilers
>> out there, and I strongly recommend using them on such issues.
>>
>> Mihael
>>
>> On Sat, 2007-06-16 at 10:00 -0500, Ioan Raicu wrote:
>>   
>>> Nope, I think this is a different problem, or at least a subset of the
>>> problems we were having before.
>>>
>>> Since we fixed the CPU utilization, and we moved to a bigger box (4
>>> CPUs with 2GB of memory), everything is happening in a timely fashion
>>> (a few ms per notification delivery throughout the experiment).  Plus,
>>> I believe the view is consistent (the same tasks look complete on both
>>> ends) between Falkon and Swift, but we are still checking on this as
>>> the run was made just last night for the 100 mol run.  We'll keep you
>>> posted with what we find.
>>>
>>> Ioan
>>>
>>> Ben Clifford wrote: 
>>>     
>>>> On Sat, 16 Jun 2007, Ioan Raicu wrote:
>>>>
>>>>   
>>>>       
>>>>> having problems with the 100 molecule run in MolDyn.  Its not clear where the
>>>>> problem is, on the surface Falkon looks fine... we are looking into where
>>>>> everything breaks to cause Swift to not continue with the workflow to
>>>>> completion!
>>>>>     
>>>>>         
>>>> The same problem that you showed me the other day or different?
>>>>
>>>> with 'the same problem' being that falkon thinks all the jobs are done; 
>>>> but that falkon's measure response time for sending completion 
>>>> notifications gets approximately linearly longer over time and the swift 
>>>> JVM uses ~100% and doesn't inidicate job completion at all after a certain 
>>>> period.
>>>>
>>>> or different symptoms now?
>>>>
>>>>   
>>>>       
>>> -- 
>>> ============================================
>>> Ioan Raicu
>>> Ph.D. Student
>>> ============================================
>>> Distributed Systems Laboratory
>>> Computer Science Department
>>> University of Chicago
>>> 1100 E. 58th Street, Ryerson Hall
>>> Chicago, IL 60637
>>> ============================================
>>> Email: iraicu at cs.uchicago.edu
>>> Web:   http://www.cs.uchicago.edu/~iraicu
>>>        http://dsl.cs.uchicago.edu/
>>> ============================================
>>> ============================================
>>>     
>>
>>   
> 
> -- 
> 
>    Ian Foster, Director, Computation Institute
> Argonne National Laboratory & University of Chicago
> Argonne: MCS/221, 9700 S. Cass Ave, Argonne, IL 60439
> Chicago: Rm 405, 5640 S. Ellis Ave, Chicago, IL 60637
> Tel: +1 630 252 4619.  Web: www.ci.uchicago.edu.
>       Globus Alliance: www.globus.org.
> 
> 
> ------------------------------------------------------------------------
> 
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel

-- 
Mike Wilde
Computation Institute, University of Chicago
Math & Computer Science Division
Argonne National Laboratory
Argonne, IL   60439    USA
tel 630-252-7497 fax 630-252-1997


From wilde at mcs.anl.gov  Sat Jun 16 12:14:20 2007
From: wilde at mcs.anl.gov (Mike Wilde)
Date: Sat, 16 Jun 2007 12:14:20 -0500
Subject: [Swift-devel] on the semantics of 'array closing'
In-Reply-To: <4674096B.4020109@mcs.anl.gov>
References: <Pine.LNX.4.64.0706151924490.10634@dildano.hawaga.org.uk>	<4672F5E3.7060205@mcs.anl.gov>	<Pine.LNX.4.64.0706152050270.10634@dildano.hawaga.org.uk>
	<4674096B.4020109@mcs.anl.gov>
Message-ID: <46741A6C.8080404@mcs.anl.gov>

I have to say for the record that I'm ready to concede victory to 
the functional camp in this discussion. (This is like conceding 
defeat but with a positive spin. If you cant beat 'em join 'em ;)

I've previously felt that functional programming would be too hard 
to sell to our user base.

But clearly Ben, Mihael and Ian are all in the f-camp.

As long as we take it all the way, and work through all our existing 
docs, tutorials and application codes to make sure that the 
functional way of expressing things has clean elegant semantics, is 
clean to write and efficient to reliably implement, I think we're on 
a good path.

The first criteria of a successful programming tool is that its 
implementers love it and use it effectively.  If they do, the user 
community is likely to follow along, grow and be successful. And as 
long as we meet that criteria I am happy that we are on the right track.

So lets function-away, do it right and make it work. Lets try keep 
the syntax close to its current c-like form to make the language 
"look" more palatable to the imperative hordes. I.e.  where you 
*can* express things in a c-like form, do so, for comfort and 
readability.

Ben, Mihael: you have the green light to move the language in that 
direction.

Any objections - speak now!

- Mike


Ian Foster wrote, On 6/16/2007 11:01 AM:
> I like the notion of having a "map" function. If that could entirely 
> replace the current element assignments, that would be a wonderful 
> simplification, it seems to me.
> 
> Ian.
> 
> Ben Clifford wrote:
>> There's a different approach, which is to asay that 'a' is a variable and 
>> can be assigned to once. Thus assignemnt syntax like a[0]=something 
>> becomes illegal and we need more functional language constructs. So 
>> instead of writing:
>>
>> for e,i in input_array {
>>   output_array[i] = p(e);
>> }
>>
>> we would write:
>>
>> output_array = foreach i in input_array {
>>   return p(i);
>> }
>>
>> (its a haskell map in different syntax!)
>>
>> That means that, at the language level, output_array is now properly 
>> single assignment.
>>
>>
>> On Fri, 15 Jun 2007, Ian Foster wrote:
>>
>>   
>>> Hi,
>>>
>>> For:
>>>
>>>  a[0] = p()
>>>  a[1] = q()
>>>  b = s(a)
>>>
>>> I think there are two distinct issues.
>>>
>>> a) Determining the size of the array. This could presumably be done by
>>> declaring it, e.g.:
>>>
>>>  a[2] or some similar notion
>>>  a[0] = p()
>>>  a[1] = q()
>>>  b = s(a)
>>>
>>> or by some "closing" concept.
>>>
>>> b) Whether or not each element of an array is a separate single-assignment
>>> variable. If they are, then the code above should work just fine. If they are
>>> not, then we have a couple of behaviors we could define. One would be that
>>> b=s(a) blocks until all elements in "a" are defined. The other is that we have
>>> a way of "closing" (once again). In that case, we have to define what happens
>>> if b=s(a) accesses an element that is not defined.
>>>
>>> Ian.
>>>
>>> Ben Clifford wrote:
>>>     
>>>> There is a problem that has been called the 'array closing problem'.
>>>>
>>>> It manifests itself in the tutorial in that certain bits of code that
>>>> intuitively can either in a procedure or in the top level can, in practice,
>>>> only go in to a procedure.
>>>>
>>>> In that context, I tried to think about better ways to explain/document the
>>>> behaviour than "mumble mumble move that code into a procedure".
>>>>
>>>> In Swift we claim to have 'single assignment variables'.
>>>>
>>>> >From single assignment variables we get our grid job ordering:
>>>>
>>>>   a = p()
>>>>   b = s(a)
>>>>
>>>> causes first grid job p to run, and when that has completed, then grid job s
>>>> will run.
>>>>
>>>> This is the same as if we had written:
>>>>
>>>>   b = s(a)
>>>>   a = p()
>>>>
>>>> The ordering comes from the use of a as an 'output' for p and an 'input' for
>>>> s, not from source text ordering.
>>>>
>>>> In that model, its meaningless to assign two different things ta a, like
>>>> this:
>>>>
>>>>   a = p()
>>>>   b = s(a)
>>>>   a = t()
>>>>
>>>>
>>>> Note that I've omitted the data types from the above. This works in the
>>>> implementation for simple types such as a datafile marker type.
>>>>
>>>> What is important is that each variable is either unassigned or has its
>>>> single value - whenever we refer to that variable, we can either use the
>>>> value it has, or defer evaluation of that expression until the variable has
>>>> its value.
>>>>
>>>> Now consider arrays. In the present syntax, arrays can be passed as single
>>>> (complex) values to/from procedures, like before:
>>>>
>>>>   a = p()
>>>>   b = s(a)
>>>>
>>>> Here a and b are array types.
>>>>
>>>> That's fine. a is assigned to by the first statement, and b is assigned to
>>>> by the second statement.
>>>>
>>>> But we also support a different assignment syntax for arrays, that looks
>>>> like this:
>>>>
>>>>   a[0] = p()
>>>>   a[1] = q()
>>>>   b = s(a)
>>>>
>>>> This fails at the moment (specifically, I think the execution engine will
>>>> hang).
>>>>
>>>> Why? Because the is no one point at which we assign a value to 'a' - the
>>>> assignment is split over multiple statements, which can be in various places
>>>> (and inside loops etc).
>>>>
>>>> There is nothing in the implementation that detects that a has been assigned
>>>> its value.
>>>>
>>>> So there is this notion in the karajan intermediate code of 'closing an
>>>> array'.  This is an assertion made in the object code that all assignments
>>>> to pieces of an array have been made - that, in affect, the array has its
>>>> value.
>>>>
>>>> The suggested hack/workaround for this is to move the array element
>>>> assignments into a procedure:
>>>>
>>>>  (file f[]) z() {
>>>>    f[0] = p();
>>>>    f[1] - q();
>>>>  }
>>>>
>>>>  a = z()
>>>>  b = s(a)
>>>>
>>>> This works. (which is sort-of a violation of referential transparency)
>>>>
>>>> It works because Swift implicitly marks arrays returned from compound
>>>> procedures as closed (which may or may not be correct).
>>>>
>>>> So in most variable scopes, arrays behave like single-assignment variables,
>>>> but each array can have one specific scope in which members can be assigned
>>>> to. In that scope, the array cannot be treated as a whole variable.
>>>>
>>>> In the z() example above, that special scope is the body of z(). In the
>>>> previous example, that scope is the global scope, and the program is invalid
>>>> by the rule above that the array cannot be referred to as a whole in the
>>>> same place that its members are individually assigned to.
>>>>
>>>> That's my explanation of what's going on now. I think it matches reality. I
>>>> don't like that this is reality, but it is what we have.
>>>>
>>>> Comments appreciated.
>>>>
>>>>   
>>>>       
>>>     
>>
>>   
> 
> -- 
> 
>    Ian Foster, Director, Computation Institute
> Argonne National Laboratory & University of Chicago
> Argonne: MCS/221, 9700 S. Cass Ave, Argonne, IL 60439
> Chicago: Rm 405, 5640 S. Ellis Ave, Chicago, IL 60637
> Tel: +1 630 252 4619.  Web: www.ci.uchicago.edu.
>       Globus Alliance: www.globus.org.
> 
> 
> ------------------------------------------------------------------------
> 
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel

-- 
Mike Wilde
Computation Institute, University of Chicago
Math & Computer Science Division
Argonne National Laboratory
Argonne, IL   60439    USA
tel 630-252-7497 fax 630-252-1997


From hategan at mcs.anl.gov  Sat Jun 16 13:29:47 2007
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Sat, 16 Jun 2007 21:29:47 +0300
Subject: [Swift-devel] CPU usage with provider-deef
In-Reply-To: <467417DE.50408@mcs.anl.gov>
References: <Pine.LNX.4.64.0706151634060.10634@dildano.hawaga.org.uk>
	<1181984306.10455.3.camel@blabla.mcs.anl.gov>
	<Pine.LNX.4.64.0706161258280.10634@dildano.hawaga.org.uk>
	<4673F0E9.3060505@cs.uchicago.edu>
	<Pine.LNX.4.64.0706161418470.10634@dildano.hawaga.org.uk>
	<4673F548.5070608@cs.uchicago.edu>
	<Pine.LNX.4.64.0706161438550.10634@dildano.hawaga.org.uk>
	<4673FB19.6070305@cs.uchicago.edu>
	<1182006166.11495.1.camel@blabla.mcs.anl.gov>
	<467408D5.2020709@mcs.anl.gov>  <467417DE.50408@mcs.anl.gov>
Message-ID: <1182018587.12013.14.camel@blabla.mcs.anl.gov>

On Sat, 2007-06-16 at 12:03 -0500, Mike Wilde wrote:
> This should be fun, and a nice break from the I2U2 work that you've 
> been immersed in, Mihael.

I've already looked at the Falkon code and it's... a lot of code that
does stuff that I understand only in principle. What you want isn't
easy, and I have my reservations towards the amount of fun it involves.

That being said, Ioan, would it be possible to have a cleaned up version
of the code where there are no duplicate classes? It's hard for me to
figure what's relevant or not in that case. And perhaps dead
code/comments removed?

Mihael

> 
> Want to do a read-through soon, and send out comments for discussion 
> that can turn into a list of code improvements to bugzilize?
> 
> What I think is important about Falkon is that its working, its 
> proving out the value of the provisioned direct-scheduling approach 
> with numbers, and that its working for Ioan as a vehicle for his 
> research.
> 
> What we want to get from the effort is a) Ioan progresses towards 
> his PhD; b) the immediate needs of our app-users get met; and c) we 
> learn whats needed in architecture, protocol and algorithm for a 
> successful long-term approach to running swift programs efficiently.
> 
> Point is that everyone is open to changes and towards an eventual 
> re-design and re-write. This, Mihael, would be where you can 
> propose, design and implement the ideas you've expressed about 
> implementing provisioned direct-scheduling using Karajan's remote 
> execution mechanisms.
> 
> - Mike
> 
> 
> 
> 
> Ian Foster wrote, On 6/16/2007 10:59 AM:
> > It seems important that Ioan sit down with Mihael and work through the 
> > Falkon code to see where it can be simplified, improved, etc. I am sure 
> > that this will result in problems being identified and fixed that will 
> > otherwise cost us time later.
> > 
> > Mihael Hategan wrote:
> >> Yourkit (www.yourkit.com) has free licenses for open source projects for
> >> their profiler. Point them to a globus web page that has your name, and
> >> they'll send you the license. Alternatively, there are other profilers
> >> out there, and I strongly recommend using them on such issues.
> >>
> >> Mihael
> >>
> >> On Sat, 2007-06-16 at 10:00 -0500, Ioan Raicu wrote:
> >>   
> >>> Nope, I think this is a different problem, or at least a subset of the
> >>> problems we were having before.
> >>>
> >>> Since we fixed the CPU utilization, and we moved to a bigger box (4
> >>> CPUs with 2GB of memory), everything is happening in a timely fashion
> >>> (a few ms per notification delivery throughout the experiment).  Plus,
> >>> I believe the view is consistent (the same tasks look complete on both
> >>> ends) between Falkon and Swift, but we are still checking on this as
> >>> the run was made just last night for the 100 mol run.  We'll keep you
> >>> posted with what we find.
> >>>
> >>> Ioan
> >>>
> >>> Ben Clifford wrote: 
> >>>     
> >>>> On Sat, 16 Jun 2007, Ioan Raicu wrote:
> >>>>
> >>>>   
> >>>>       
> >>>>> having problems with the 100 molecule run in MolDyn.  Its not clear where the
> >>>>> problem is, on the surface Falkon looks fine... we are looking into where
> >>>>> everything breaks to cause Swift to not continue with the workflow to
> >>>>> completion!
> >>>>>     
> >>>>>         
> >>>> The same problem that you showed me the other day or different?
> >>>>
> >>>> with 'the same problem' being that falkon thinks all the jobs are done; 
> >>>> but that falkon's measure response time for sending completion 
> >>>> notifications gets approximately linearly longer over time and the swift 
> >>>> JVM uses ~100% and doesn't inidicate job completion at all after a certain 
> >>>> period.
> >>>>
> >>>> or different symptoms now?
> >>>>
> >>>>   
> >>>>       
> >>> -- 
> >>> ============================================
> >>> Ioan Raicu
> >>> Ph.D. Student
> >>> ============================================
> >>> Distributed Systems Laboratory
> >>> Computer Science Department
> >>> University of Chicago
> >>> 1100 E. 58th Street, Ryerson Hall
> >>> Chicago, IL 60637
> >>> ============================================
> >>> Email: iraicu at cs.uchicago.edu
> >>> Web:   http://www.cs.uchicago.edu/~iraicu
> >>>        http://dsl.cs.uchicago.edu/
> >>> ============================================
> >>> ============================================
> >>>     
> >>
> >>   
> > 
> > -- 
> > 
> >    Ian Foster, Director, Computation Institute
> > Argonne National Laboratory & University of Chicago
> > Argonne: MCS/221, 9700 S. Cass Ave, Argonne, IL 60439
> > Chicago: Rm 405, 5640 S. Ellis Ave, Chicago, IL 60637
> > Tel: +1 630 252 4619.  Web: www.ci.uchicago.edu.
> >       Globus Alliance: www.globus.org.
> > 
> > 
> > ------------------------------------------------------------------------
> > 
> > _______________________________________________
> > Swift-devel mailing list
> > Swift-devel at ci.uchicago.edu
> > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> 


From benc at hawaga.org.uk  Sat Jun 16 14:36:17 2007
From: benc at hawaga.org.uk (Ben Clifford)
Date: Sat, 16 Jun 2007 19:36:17 +0000 (GMT)
Subject: [Swift-devel] CPU usage with provider-deef
In-Reply-To: <1182018587.12013.14.camel@blabla.mcs.anl.gov>
References: <Pine.LNX.4.64.0706151634060.10634@dildano.hawaga.org.uk>
	<1181984306.10455.3.camel@blabla.mcs.anl.gov>
	<Pine.LNX.4.64.0706161258280.10634@dildano.hawaga.org.uk>
	<4673F0E9.3060505@cs.uchicago.edu>
	<Pine.LNX.4.64.0706161418470.10634@dildano.hawaga.org.uk>
	<4673F548.5070608@cs.uchicago.edu>
	<Pine.LNX.4.64.0706161438550.10634@dildano.hawaga.org.uk>
	<4673FB19.6070305@cs.uchicago.edu>
	<1182006166.11495.1.camel@blabla.mcs.anl.gov>
	<467408D5.2020709@mcs.anl.gov>  <467417DE.50408@mcs.anl.gov>
	<1182018587.12013.14.camel@blabla.mcs.anl.gov>
Message-ID: <Pine.LNX.4.64.0706161932560.15250@dildano.hawaga.org.uk>


On Sat, 16 Jun 2007, Mihael Hategan wrote:

> On Sat, 2007-06-16 at 12:03 -0500, Mike Wilde wrote:
> > This should be fun, and a nice break from the I2U2 work that you've 
> > been immersed in, Mihael.

> I have my reservations towards the amount of fun it involves.

Right, taking prototypes and turning them into production isn't 
necessarily fun - in fact, a lot of the fun already happened with the 
making of the prototype and the rest is some what drugery. (to an extent 
that's the same situation i2u2 cosmic was/is in).

-- 


From iraicu at cs.uchicago.edu  Sat Jun 16 14:47:02 2007
From: iraicu at cs.uchicago.edu (Ioan Raicu)
Date: Sat, 16 Jun 2007 14:47:02 -0500
Subject: [Swift-devel] CPU usage with provider-deef
In-Reply-To: <1182018587.12013.14.camel@blabla.mcs.anl.gov>
References: <Pine.LNX.4.64.0706151634060.10634@dildano.hawaga.org.uk>	<1181984306.10455.3.camel@blabla.mcs.anl.gov>	<Pine.LNX.4.64.0706161258280.10634@dildano.hawaga.org.uk>	<4673F0E9.3060505@cs.uchicago.edu>	<Pine.LNX.4.64.0706161418470.10634@dildano.hawaga.org.uk>	<4673F548.5070608@cs.uchicago.edu>	<Pine.LNX.4.64.0706161438550.10634@dildano.hawaga.org.uk>	<4673FB19.6070305@cs.uchicago.edu>	<1182006166.11495.1.camel@blabla.mcs.anl.gov>	<467408D5.2020709@mcs.anl.gov>
	<467417DE.50408@mcs.anl.gov>
	<1182018587.12013.14.camel@blabla.mcs.anl.gov>
Message-ID: <46743E36.9040901@cs.uchicago.edu>

Yes, I know I need to clean up the code, and remove unused (dead) code.  
Can this wait for the next version I am working on, so I don't do this 
clean-up twice?  The version that is out there in testing currently is 
v0.8.  My development version is v0.9.  I have been distracted lately 
from finishing up v0.9, but its not far from being complete.  Mihael, 
when do you get back in town? 

If this is something more urgent, then perhaps I can get you a clean-up 
version of v0.8 in the coming week.

Ioan

Mihael Hategan wrote:
> On Sat, 2007-06-16 at 12:03 -0500, Mike Wilde wrote:
>   
>> This should be fun, and a nice break from the I2U2 work that you've 
>> been immersed in, Mihael.
>>     
>
> I've already looked at the Falkon code and it's... a lot of code that
> does stuff that I understand only in principle. What you want isn't
> easy, and I have my reservations towards the amount of fun it involves.
>
> That being said, Ioan, would it be possible to have a cleaned up version
> of the code where there are no duplicate classes? It's hard for me to
> figure what's relevant or not in that case. And perhaps dead
> code/comments removed?
>
> Mihael
>
>   
>> Want to do a read-through soon, and send out comments for discussion 
>> that can turn into a list of code improvements to bugzilize?
>>
>> What I think is important about Falkon is that its working, its 
>> proving out the value of the provisioned direct-scheduling approach 
>> with numbers, and that its working for Ioan as a vehicle for his 
>> research.
>>
>> What we want to get from the effort is a) Ioan progresses towards 
>> his PhD; b) the immediate needs of our app-users get met; and c) we 
>> learn whats needed in architecture, protocol and algorithm for a 
>> successful long-term approach to running swift programs efficiently.
>>
>> Point is that everyone is open to changes and towards an eventual 
>> re-design and re-write. This, Mihael, would be where you can 
>> propose, design and implement the ideas you've expressed about 
>> implementing provisioned direct-scheduling using Karajan's remote 
>> execution mechanisms.
>>
>> - Mike
>>
>>
>>
>>
>> Ian Foster wrote, On 6/16/2007 10:59 AM:
>>     
>>> It seems important that Ioan sit down with Mihael and work through the 
>>> Falkon code to see where it can be simplified, improved, etc. I am sure 
>>> that this will result in problems being identified and fixed that will 
>>> otherwise cost us time later.
>>>
>>> Mihael Hategan wrote:
>>>       
>>>> Yourkit (www.yourkit.com) has free licenses for open source projects for
>>>> their profiler. Point them to a globus web page that has your name, and
>>>> they'll send you the license. Alternatively, there are other profilers
>>>> out there, and I strongly recommend using them on such issues.
>>>>
>>>> Mihael
>>>>
>>>> On Sat, 2007-06-16 at 10:00 -0500, Ioan Raicu wrote:
>>>>   
>>>>         
>>>>> Nope, I think this is a different problem, or at least a subset of the
>>>>> problems we were having before.
>>>>>
>>>>> Since we fixed the CPU utilization, and we moved to a bigger box (4
>>>>> CPUs with 2GB of memory), everything is happening in a timely fashion
>>>>> (a few ms per notification delivery throughout the experiment).  Plus,
>>>>> I believe the view is consistent (the same tasks look complete on both
>>>>> ends) between Falkon and Swift, but we are still checking on this as
>>>>> the run was made just last night for the 100 mol run.  We'll keep you
>>>>> posted with what we find.
>>>>>
>>>>> Ioan
>>>>>
>>>>> Ben Clifford wrote: 
>>>>>     
>>>>>           
>>>>>> On Sat, 16 Jun 2007, Ioan Raicu wrote:
>>>>>>
>>>>>>   
>>>>>>       
>>>>>>             
>>>>>>> having problems with the 100 molecule run in MolDyn.  Its not clear where the
>>>>>>> problem is, on the surface Falkon looks fine... we are looking into where
>>>>>>> everything breaks to cause Swift to not continue with the workflow to
>>>>>>> completion!
>>>>>>>     
>>>>>>>         
>>>>>>>               
>>>>>> The same problem that you showed me the other day or different?
>>>>>>
>>>>>> with 'the same problem' being that falkon thinks all the jobs are done; 
>>>>>> but that falkon's measure response time for sending completion 
>>>>>> notifications gets approximately linearly longer over time and the swift 
>>>>>> JVM uses ~100% and doesn't inidicate job completion at all after a certain 
>>>>>> period.
>>>>>>
>>>>>> or different symptoms now?
>>>>>>
>>>>>>   
>>>>>>       
>>>>>>             
>>>>> -- 
>>>>> ============================================
>>>>> Ioan Raicu
>>>>> Ph.D. Student
>>>>> ============================================
>>>>> Distributed Systems Laboratory
>>>>> Computer Science Department
>>>>> University of Chicago
>>>>> 1100 E. 58th Street, Ryerson Hall
>>>>> Chicago, IL 60637
>>>>> ============================================
>>>>> Email: iraicu at cs.uchicago.edu
>>>>> Web:   http://www.cs.uchicago.edu/~iraicu
>>>>>        http://dsl.cs.uchicago.edu/
>>>>> ============================================
>>>>> ============================================
>>>>>     
>>>>>           
>>>>   
>>>>         
>>> -- 
>>>
>>>    Ian Foster, Director, Computation Institute
>>> Argonne National Laboratory & University of Chicago
>>> Argonne: MCS/221, 9700 S. Cass Ave, Argonne, IL 60439
>>> Chicago: Rm 405, 5640 S. Ellis Ave, Chicago, IL 60637
>>> Tel: +1 630 252 4619.  Web: www.ci.uchicago.edu.
>>>       Globus Alliance: www.globus.org.
>>>
>>>
>>> ------------------------------------------------------------------------
>>>
>>> _______________________________________________
>>> Swift-devel mailing list
>>> Swift-devel at ci.uchicago.edu
>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>>       
>
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>
>   

-- 
============================================
Ioan Raicu
Ph.D. Student
============================================
Distributed Systems Laboratory
Computer Science Department
University of Chicago
1100 E. 58th Street, Ryerson Hall
Chicago, IL 60637
============================================
Email: iraicu at cs.uchicago.edu
Web:   http://www.cs.uchicago.edu/~iraicu
       http://dsl.cs.uchicago.edu/
============================================
============================================

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20070616/d85c39b6/attachment.html>

From iraicu at cs.uchicago.edu  Sat Jun 16 14:49:00 2007
From: iraicu at cs.uchicago.edu (Ioan Raicu)
Date: Sat, 16 Jun 2007 14:49:00 -0500
Subject: [Swift-devel] CPU usage with provider-deef
In-Reply-To: <Pine.LNX.4.64.0706161932560.15250@dildano.hawaga.org.uk>
References: <Pine.LNX.4.64.0706151634060.10634@dildano.hawaga.org.uk>	<1181984306.10455.3.camel@blabla.mcs.anl.gov>	<Pine.LNX.4.64.0706161258280.10634@dildano.hawaga.org.uk>	<4673F0E9.3060505@cs.uchicago.edu>	<Pine.LNX.4.64.0706161418470.10634@dildano.hawaga.org.uk>	<4673F548.5070608@cs.uchicago.edu>	<Pine.LNX.4.64.0706161438550.10634@dildano.hawaga.org.uk>	<4673FB19.6070305@cs.uchicago.edu>	<1182006166.11495.1.camel@blabla.mcs.anl.gov>	<467408D5.2020709@mcs.anl.gov>
	<467417DE.50408@mcs.anl.gov>	<1182018587.12013.14.camel@blabla.mcs.anl.gov>
	<Pine.LNX.4.64.0706161932560.15250@dildano.hawaga.org.uk>
Message-ID: <46743EAC.2080400@cs.uchicago.edu>

Although, there should still be fun left to have :), as new 
features/protocols/extensions could be on the horizon.  

Ioan

Ben Clifford wrote:
> On Sat, 16 Jun 2007, Mihael Hategan wrote:
>
>   
>> On Sat, 2007-06-16 at 12:03 -0500, Mike Wilde wrote:
>>     
>>> This should be fun, and a nice break from the I2U2 work that you've 
>>> been immersed in, Mihael.
>>>       
>
>   
>> I have my reservations towards the amount of fun it involves.
>>     
>
> Right, taking prototypes and turning them into production isn't 
> necessarily fun - in fact, a lot of the fun already happened with the 
> making of the prototype and the rest is some what drugery. (to an extent 
> that's the same situation i2u2 cosmic was/is in).
>
>   

-- 
============================================
Ioan Raicu
Ph.D. Student
============================================
Distributed Systems Laboratory
Computer Science Department
University of Chicago
1100 E. 58th Street, Ryerson Hall
Chicago, IL 60637
============================================
Email: iraicu at cs.uchicago.edu
Web:   http://www.cs.uchicago.edu/~iraicu
       http://dsl.cs.uchicago.edu/
============================================
============================================

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20070616/939b1102/attachment.html>

From itf at mcs.anl.gov  Sat Jun 16 14:52:49 2007
From: itf at mcs.anl.gov (=?utf-8?B?SWFuIEZvc3Rlcg==?=)
Date: Sat, 16 Jun 2007 19:52:49 +0000
Subject: [Swift-devel] CPU usage with provider-deef
In-Reply-To: <Pine.LNX.4.64.0706161932560.15250@dildano.hawaga.org.uk>
References: <Pine.LNX.4.64.0706151634060.10634@dildano.hawaga.org.uk><1181984306.10455.3.camel@blabla.mcs.anl.gov><Pine.LNX.4.64.0706161258280.10634@dildano.hawaga.org.uk><4673F0E9.3060505@cs.uchicago.edu><Pine.LNX.4.64.0706161418470.10634@dildano.hawaga.org.uk><4673F548.5070608@cs.uchicago.edu><Pine.LNX.4.64.0706161438550.10634@dildano.hawaga.org.uk><4673FB19.6070305@cs.uchicago.edu><1182006166.11495.1.camel@blabla.mcs.anl.gov><467408D5.2020709@mcs.anl.gov>
	<467417DE.50408@mcs.anl.gov><1182018587.12013.14.camel@blabla.mcs.anl.gov><Pine.LNX.4.64.0706161932560.15250@dildano.hawaga.org.uk>
Message-ID: <537273469-1182023623-cardhu_decombobulator_blackberry.rim.net-728646234-@bxe006.bisx.prod.on.blackberry>

I wasn't suggesting (at least in the first instance) that Mihael take the prototype and turn it into production, but that Mihael and Ioan sit down together and do a code walkthrough. I think that this would likely identify bugs and opportunities for simplification.

Ian


Sent via BlackBerry from T-Mobile

-----Original Message-----
From: Ben Clifford <benc at hawaga.org.uk>

Date: Sat, 16 Jun 2007 19:36:17 
To:Mihael Hategan <hategan at mcs.anl.gov>
Cc:swift-devel at ci.uchicago.edu
Subject: Re: [Swift-devel] CPU usage with provider-deef


On Sat, 16 Jun 2007, Mihael Hategan wrote:

> On Sat, 2007-06-16 at 12:03 -0500, Mike Wilde wrote:
> > This should be fun, and a nice break from the I2U2 work that you've 
> > been immersed in, Mihael.

> I have my reservations towards the amount of fun it involves.

Right, taking prototypes and turning them into production isn't 
necessarily fun - in fact, a lot of the fun already happened with the 
making of the prototype and the rest is some what drugery. (to an extent 
that's the same situation i2u2 cosmic was/is in).

-- 
_______________________________________________
Swift-devel mailing list
Swift-devel at ci.uchicago.edu
http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel


From iraicu at cs.uchicago.edu  Sat Jun 16 14:57:42 2007
From: iraicu at cs.uchicago.edu (Ioan Raicu)
Date: Sat, 16 Jun 2007 14:57:42 -0500
Subject: [Swift-devel] CPU usage with provider-deef
In-Reply-To: <1182006166.11495.1.camel@blabla.mcs.anl.gov>
References: <Pine.LNX.4.64.0706151634060.10634@dildano.hawaga.org.uk>	
	<1181984306.10455.3.camel@blabla.mcs.anl.gov>	
	<Pine.LNX.4.64.0706161258280.10634@dildano.hawaga.org.uk>	
	<4673F0E9.3060505@cs.uchicago.edu>	
	<Pine.LNX.4.64.0706161418470.10634@dildano.hawaga.org.uk>	
	<4673F548.5070608@cs.uchicago.edu>	
	<Pine.LNX.4.64.0706161438550.10634@dildano.hawaga.org.uk>	
	<4673FB19.6070305@cs.uchicago.edu>
	<1182006166.11495.1.camel@blabla.mcs.anl.gov>
Message-ID: <467440B6.1060706@cs.uchicago.edu>

Hey,
This looks really nice, I'll give it a try!

Ioan

Mihael Hategan wrote:
> Yourkit (www.yourkit.com) has free licenses for open source projects for
> their profiler. Point them to a globus web page that has your name, and
> they'll send you the license. Alternatively, there are other profilers
> out there, and I strongly recommend using them on such issues.
>
> Mihael
>
> On Sat, 2007-06-16 at 10:00 -0500, Ioan Raicu wrote:
>   
>> Nope, I think this is a different problem, or at least a subset of the
>> problems we were having before.
>>
>> Since we fixed the CPU utilization, and we moved to a bigger box (4
>> CPUs with 2GB of memory), everything is happening in a timely fashion
>> (a few ms per notification delivery throughout the experiment).  Plus,
>> I believe the view is consistent (the same tasks look complete on both
>> ends) between Falkon and Swift, but we are still checking on this as
>> the run was made just last night for the 100 mol run.  We'll keep you
>> posted with what we find.
>>
>> Ioan
>>
>> Ben Clifford wrote: 
>>     
>>> On Sat, 16 Jun 2007, Ioan Raicu wrote:
>>>
>>>   
>>>       
>>>> having problems with the 100 molecule run in MolDyn.  Its not clear where the
>>>> problem is, on the surface Falkon looks fine... we are looking into where
>>>> everything breaks to cause Swift to not continue with the workflow to
>>>> completion!
>>>>     
>>>>         
>>> The same problem that you showed me the other day or different?
>>>
>>> with 'the same problem' being that falkon thinks all the jobs are done; 
>>> but that falkon's measure response time for sending completion 
>>> notifications gets approximately linearly longer over time and the swift 
>>> JVM uses ~100% and doesn't inidicate job completion at all after a certain 
>>> period.
>>>
>>> or different symptoms now?
>>>
>>>   
>>>       
>> -- 
>> ============================================
>> Ioan Raicu
>> Ph.D. Student
>> ============================================
>> Distributed Systems Laboratory
>> Computer Science Department
>> University of Chicago
>> 1100 E. 58th Street, Ryerson Hall
>> Chicago, IL 60637
>> ============================================
>> Email: iraicu at cs.uchicago.edu
>> Web:   http://www.cs.uchicago.edu/~iraicu
>>        http://dsl.cs.uchicago.edu/
>> ============================================
>> ============================================
>>     
>
>
>   

-- 
============================================
Ioan Raicu
Ph.D. Student
============================================
Distributed Systems Laboratory
Computer Science Department
University of Chicago
1100 E. 58th Street, Ryerson Hall
Chicago, IL 60637
============================================
Email: iraicu at cs.uchicago.edu
Web:   http://www.cs.uchicago.edu/~iraicu
       http://dsl.cs.uchicago.edu/
============================================
============================================

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20070616/ab59eb5e/attachment.html>

From benc at hawaga.org.uk  Sat Jun 16 15:01:48 2007
From: benc at hawaga.org.uk (Ben Clifford)
Date: Sat, 16 Jun 2007 20:01:48 +0000 (GMT)
Subject: [Swift-devel] CPU usage with provider-deef
In-Reply-To: <46743E36.9040901@cs.uchicago.edu>
References: <Pine.LNX.4.64.0706151634060.10634@dildano.hawaga.org.uk>
	<1181984306.10455.3.camel@blabla.mcs.anl.gov>
	<Pine.LNX.4.64.0706161258280.10634@dildano.hawaga.org.uk>
	<4673F0E9.3060505@cs.uchicago.edu>
	<Pine.LNX.4.64.0706161418470.10634@dildano.hawaga.org.uk>
	<4673F548.5070608@cs.uchicago.edu>
	<Pine.LNX.4.64.0706161438550.10634@dildano.hawaga.org.uk>
	<4673FB19.6070305@cs.uchicago.edu>
	<1182006166.11495.1.camel@blabla.mcs.anl.gov>
	<467408D5.2020709@mcs.anl.gov> <467417DE.50408@mcs.anl.gov>
	<1182018587.12013.14.camel@blabla.mcs.anl.gov>
	<46743E36.9040901@cs.uchicago.edu>
Message-ID: <Pine.LNX.4.64.0706162000320.15250@dildano.hawaga.org.uk>


At some point you need to switch to developing in SVN so that others can 
play along. Freezing your development, sending me a tarball, waiting for 
me to import to SVN, and then unfreezing your development and continuing 
from an SVN checkout is approximately the easiest I can make it for you.

On Sat, 16 Jun 2007, Ioan Raicu wrote:

> Yes, I know I need to clean up the code, and remove unused (dead) code.  Can
> this wait for the next version I am working on, so I don't do this clean-up
> twice?  The version that is out there in testing currently is v0.8.  My
> development version is v0.9.  I have been distracted lately from finishing up
> v0.9, but its not far from being complete.  Mihael, when do you get back in
> town? 
> If this is something more urgent, then perhaps I can get you a clean-up
> version of v0.8 in the coming week.
> 
> Ioan
> 
> Mihael Hategan wrote:
> > On Sat, 2007-06-16 at 12:03 -0500, Mike Wilde wrote:
> >   
> > > This should be fun, and a nice break from the I2U2 work that you've been
> > > immersed in, Mihael.
> > >     
> > 
> > I've already looked at the Falkon code and it's... a lot of code that
> > does stuff that I understand only in principle. What you want isn't
> > easy, and I have my reservations towards the amount of fun it involves.
> > 
> > That being said, Ioan, would it be possible to have a cleaned up version
> > of the code where there are no duplicate classes? It's hard for me to
> > figure what's relevant or not in that case. And perhaps dead
> > code/comments removed?
> > 
> > Mihael
> > 
> >   
> > > Want to do a read-through soon, and send out comments for discussion that
> > > can turn into a list of code improvements to bugzilize?
> > > 
> > > What I think is important about Falkon is that its working, its proving
> > > out the value of the provisioned direct-scheduling approach with numbers,
> > > and that its working for Ioan as a vehicle for his research.
> > > 
> > > What we want to get from the effort is a) Ioan progresses towards his PhD;
> > > b) the immediate needs of our app-users get met; and c) we learn whats
> > > needed in architecture, protocol and algorithm for a successful long-term
> > > approach to running swift programs efficiently.
> > > 
> > > Point is that everyone is open to changes and towards an eventual
> > > re-design and re-write. This, Mihael, would be where you can propose,
> > > design and implement the ideas you've expressed about implementing
> > > provisioned direct-scheduling using Karajan's remote execution mechanisms.
> > > 
> > > - Mike
> > > 
> > > 
> > > 
> > > 
> > > Ian Foster wrote, On 6/16/2007 10:59 AM:
> > >     
> > > > It seems important that Ioan sit down with Mihael and work through the
> > > > Falkon code to see where it can be simplified, improved, etc. I am sure
> > > > that this will result in problems being identified and fixed that will
> > > > otherwise cost us time later.
> > > > 
> > > > Mihael Hategan wrote:
> > > >       
> > > > > Yourkit (www.yourkit.com) has free licenses for open source projects
> > > > > for
> > > > > their profiler. Point them to a globus web page that has your name,
> > > > > and
> > > > > they'll send you the license. Alternatively, there are other profilers
> > > > > out there, and I strongly recommend using them on such issues.
> > > > > 
> > > > > Mihael
> > > > > 
> > > > > On Sat, 2007-06-16 at 10:00 -0500, Ioan Raicu wrote:
> > > > >           
> > > > > > Nope, I think this is a different problem, or at least a subset of
> > > > > > the
> > > > > > problems we were having before.
> > > > > > 
> > > > > > Since we fixed the CPU utilization, and we moved to a bigger box (4
> > > > > > CPUs with 2GB of memory), everything is happening in a timely
> > > > > > fashion
> > > > > > (a few ms per notification delivery throughout the experiment).
> > > > > > Plus,
> > > > > > I believe the view is consistent (the same tasks look complete on
> > > > > > both
> > > > > > ends) between Falkon and Swift, but we are still checking on this as
> > > > > > the run was made just last night for the 100 mol run.  We'll keep
> > > > > > you
> > > > > > posted with what we find.
> > > > > > 
> > > > > > Ioan
> > > > > > 
> > > > > > Ben Clifford wrote:               
> > > > > > > On Sat, 16 Jun 2007, Ioan Raicu wrote:
> > > > > > > 
> > > > > > >                     
> > > > > > > > having problems with the 100 molecule run in MolDyn.  Its not
> > > > > > > > clear where the
> > > > > > > > problem is, on the surface Falkon looks fine... we are looking
> > > > > > > > into where
> > > > > > > > everything breaks to cause Swift to not continue with the
> > > > > > > > workflow to
> > > > > > > > completion!
> > > > > > > >                           
> > > > > > > The same problem that you showed me the other day or different?
> > > > > > > 
> > > > > > > with 'the same problem' being that falkon thinks all the jobs are
> > > > > > > done; but that falkon's measure response time for sending
> > > > > > > completion notifications gets approximately linearly longer over
> > > > > > > time and the swift JVM uses ~100% and doesn't inidicate job
> > > > > > > completion at all after a certain period.
> > > > > > > 
> > > > > > > or different symptoms now?
> > > > > > > 
> > > > > > >                     
> > > > > > -- 
> > > > > > ============================================
> > > > > > Ioan Raicu
> > > > > > Ph.D. Student
> > > > > > ============================================
> > > > > > Distributed Systems Laboratory
> > > > > > Computer Science Department
> > > > > > University of Chicago
> > > > > > 1100 E. 58th Street, Ryerson Hall
> > > > > > Chicago, IL 60637
> > > > > > ============================================
> > > > > > Email: iraicu at cs.uchicago.edu
> > > > > > Web:   http://www.cs.uchicago.edu/~iraicu
> > > > > >        http://dsl.cs.uchicago.edu/
> > > > > > ============================================
> > > > > > ============================================
> > > > > >               
> > > > >           
> > > > -- 
> > > > 
> > > >    Ian Foster, Director, Computation Institute
> > > > Argonne National Laboratory & University of Chicago
> > > > Argonne: MCS/221, 9700 S. Cass Ave, Argonne, IL 60439
> > > > Chicago: Rm 405, 5640 S. Ellis Ave, Chicago, IL 60637
> > > > Tel: +1 630 252 4619.  Web: www.ci.uchicago.edu.
> > > >       Globus Alliance: www.globus.org.
> > > > 
> > > > 
> > > > ------------------------------------------------------------------------
> > > > 
> > > > _______________________________________________
> > > > Swift-devel mailing list
> > > > Swift-devel at ci.uchicago.edu
> > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> > > >       
> > 
> > _______________________________________________
> > Swift-devel mailing list
> > Swift-devel at ci.uchicago.edu
> > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> > 
> >   
> 
> 


From iraicu at cs.uchicago.edu  Sat Jun 16 15:06:49 2007
From: iraicu at cs.uchicago.edu (Ioan Raicu)
Date: Sat, 16 Jun 2007 15:06:49 -0500
Subject: [Swift-devel] CPU usage with provider-deef
In-Reply-To: <Pine.LNX.4.64.0706162000320.15250@dildano.hawaga.org.uk>
References: <Pine.LNX.4.64.0706151634060.10634@dildano.hawaga.org.uk>
	<1181984306.10455.3.camel@blabla.mcs.anl.gov>
	<Pine.LNX.4.64.0706161258280.10634@dildano.hawaga.org.uk>
	<4673F0E9.3060505@cs.uchicago.edu>
	<Pine.LNX.4.64.0706161418470.10634@dildano.hawaga.org.uk>
	<4673F548.5070608@cs.uchicago.edu>
	<Pine.LNX.4.64.0706161438550.10634@dildano.hawaga.org.uk>
	<4673FB19.6070305@cs.uchicago.edu>
	<1182006166.11495.1.camel@blabla.mcs.anl.gov>
	<467408D5.2020709@mcs.anl.gov> <467417DE.50408@mcs.anl.gov>
	<1182018587.12013.14.camel@blabla.mcs.anl.gov>
	<46743E36.9040901@cs.uchicago.edu>
	<Pine.LNX.4.64.0706162000320.15250@dildano.hawaga.org.uk>
Message-ID: <467442D9.9030608@cs.uchicago.edu>

Right, I know!  It sounds trivial :), but I just haven't had the time to 
think about it... 
I think the latest Falkon code is pretty solid, but I would like to get 
to the bottom of why Nika's MolDyn 100 molecule run isn't working...
Unless there is a pressing need to look at the very latest code, let's 
do this next week (hopefully after we have Nika's app running)! 

Ioan

Ben Clifford wrote:
> At some point you need to switch to developing in SVN so that others can 
> play along. Freezing your development, sending me a tarball, waiting for 
> me to import to SVN, and then unfreezing your development and continuing 
> from an SVN checkout is approximately the easiest I can make it for you.
>
> On Sat, 16 Jun 2007, Ioan Raicu wrote:
>
>   
>> Yes, I know I need to clean up the code, and remove unused (dead) code.  Can
>> this wait for the next version I am working on, so I don't do this clean-up
>> twice?  The version that is out there in testing currently is v0.8.  My
>> development version is v0.9.  I have been distracted lately from finishing up
>> v0.9, but its not far from being complete.  Mihael, when do you get back in
>> town? 
>> If this is something more urgent, then perhaps I can get you a clean-up
>> version of v0.8 in the coming week.
>>
>> Ioan
>>
>> Mihael Hategan wrote:
>>     
>>> On Sat, 2007-06-16 at 12:03 -0500, Mike Wilde wrote:
>>>   
>>>       
>>>> This should be fun, and a nice break from the I2U2 work that you've been
>>>> immersed in, Mihael.
>>>>     
>>>>         
>>> I've already looked at the Falkon code and it's... a lot of code that
>>> does stuff that I understand only in principle. What you want isn't
>>> easy, and I have my reservations towards the amount of fun it involves.
>>>
>>> That being said, Ioan, would it be possible to have a cleaned up version
>>> of the code where there are no duplicate classes? It's hard for me to
>>> figure what's relevant or not in that case. And perhaps dead
>>> code/comments removed?
>>>
>>> Mihael
>>>
>>>   
>>>       
>>>> Want to do a read-through soon, and send out comments for discussion that
>>>> can turn into a list of code improvements to bugzilize?
>>>>
>>>> What I think is important about Falkon is that its working, its proving
>>>> out the value of the provisioned direct-scheduling approach with numbers,
>>>> and that its working for Ioan as a vehicle for his research.
>>>>
>>>> What we want to get from the effort is a) Ioan progresses towards his PhD;
>>>> b) the immediate needs of our app-users get met; and c) we learn whats
>>>> needed in architecture, protocol and algorithm for a successful long-term
>>>> approach to running swift programs efficiently.
>>>>
>>>> Point is that everyone is open to changes and towards an eventual
>>>> re-design and re-write. This, Mihael, would be where you can propose,
>>>> design and implement the ideas you've expressed about implementing
>>>> provisioned direct-scheduling using Karajan's remote execution mechanisms.
>>>>
>>>> - Mike
>>>>
>>>>
>>>>
>>>>
>>>> Ian Foster wrote, On 6/16/2007 10:59 AM:
>>>>     
>>>>         
>>>>> It seems important that Ioan sit down with Mihael and work through the
>>>>> Falkon code to see where it can be simplified, improved, etc. I am sure
>>>>> that this will result in problems being identified and fixed that will
>>>>> otherwise cost us time later.
>>>>>
>>>>> Mihael Hategan wrote:
>>>>>       
>>>>>           
>>>>>> Yourkit (www.yourkit.com) has free licenses for open source projects
>>>>>> for
>>>>>> their profiler. Point them to a globus web page that has your name,
>>>>>> and
>>>>>> they'll send you the license. Alternatively, there are other profilers
>>>>>> out there, and I strongly recommend using them on such issues.
>>>>>>
>>>>>> Mihael
>>>>>>
>>>>>> On Sat, 2007-06-16 at 10:00 -0500, Ioan Raicu wrote:
>>>>>>           
>>>>>>             
>>>>>>> Nope, I think this is a different problem, or at least a subset of
>>>>>>> the
>>>>>>> problems we were having before.
>>>>>>>
>>>>>>> Since we fixed the CPU utilization, and we moved to a bigger box (4
>>>>>>> CPUs with 2GB of memory), everything is happening in a timely
>>>>>>> fashion
>>>>>>> (a few ms per notification delivery throughout the experiment).
>>>>>>> Plus,
>>>>>>> I believe the view is consistent (the same tasks look complete on
>>>>>>> both
>>>>>>> ends) between Falkon and Swift, but we are still checking on this as
>>>>>>> the run was made just last night for the 100 mol run.  We'll keep
>>>>>>> you
>>>>>>> posted with what we find.
>>>>>>>
>>>>>>> Ioan
>>>>>>>
>>>>>>> Ben Clifford wrote:               
>>>>>>>               
>>>>>>>> On Sat, 16 Jun 2007, Ioan Raicu wrote:
>>>>>>>>
>>>>>>>>                     
>>>>>>>>                 
>>>>>>>>> having problems with the 100 molecule run in MolDyn.  Its not
>>>>>>>>> clear where the
>>>>>>>>> problem is, on the surface Falkon looks fine... we are looking
>>>>>>>>> into where
>>>>>>>>> everything breaks to cause Swift to not continue with the
>>>>>>>>> workflow to
>>>>>>>>> completion!
>>>>>>>>>                           
>>>>>>>>>                   
>>>>>>>> The same problem that you showed me the other day or different?
>>>>>>>>
>>>>>>>> with 'the same problem' being that falkon thinks all the jobs are
>>>>>>>> done; but that falkon's measure response time for sending
>>>>>>>> completion notifications gets approximately linearly longer over
>>>>>>>> time and the swift JVM uses ~100% and doesn't inidicate job
>>>>>>>> completion at all after a certain period.
>>>>>>>>
>>>>>>>> or different symptoms now?
>>>>>>>>
>>>>>>>>                     
>>>>>>>>                 
>>>>>>> -- 
>>>>>>> ============================================
>>>>>>> Ioan Raicu
>>>>>>> Ph.D. Student
>>>>>>> ============================================
>>>>>>> Distributed Systems Laboratory
>>>>>>> Computer Science Department
>>>>>>> University of Chicago
>>>>>>> 1100 E. 58th Street, Ryerson Hall
>>>>>>> Chicago, IL 60637
>>>>>>> ============================================
>>>>>>> Email: iraicu at cs.uchicago.edu
>>>>>>> Web:   http://www.cs.uchicago.edu/~iraicu
>>>>>>>        http://dsl.cs.uchicago.edu/
>>>>>>> ============================================
>>>>>>> ============================================
>>>>>>>               
>>>>>>>               
>>>>>>           
>>>>>>             
>>>>> -- 
>>>>>
>>>>>    Ian Foster, Director, Computation Institute
>>>>> Argonne National Laboratory & University of Chicago
>>>>> Argonne: MCS/221, 9700 S. Cass Ave, Argonne, IL 60439
>>>>> Chicago: Rm 405, 5640 S. Ellis Ave, Chicago, IL 60637
>>>>> Tel: +1 630 252 4619.  Web: www.ci.uchicago.edu.
>>>>>       Globus Alliance: www.globus.org.
>>>>>
>>>>>
>>>>> ------------------------------------------------------------------------
>>>>>
>>>>> _______________________________________________
>>>>> Swift-devel mailing list
>>>>> Swift-devel at ci.uchicago.edu
>>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>>>>       
>>>>>           
>>> _______________________________________________
>>> Swift-devel mailing list
>>> Swift-devel at ci.uchicago.edu
>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>>
>>>   
>>>       
>>     
>
>   

-- 
============================================
Ioan Raicu
Ph.D. Student
============================================
Distributed Systems Laboratory
Computer Science Department
University of Chicago
1100 E. 58th Street, Ryerson Hall
Chicago, IL 60637
============================================
Email: iraicu at cs.uchicago.edu
Web:   http://www.cs.uchicago.edu/~iraicu
       http://dsl.cs.uchicago.edu/
============================================
============================================

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20070616/d2ad241c/attachment.html>

From hategan at mcs.anl.gov  Sat Jun 16 15:05:46 2007
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Sat, 16 Jun 2007 23:05:46 +0300
Subject: [Swift-devel] CPU usage with provider-deef
In-Reply-To: <46743E36.9040901@cs.uchicago.edu>
References: <Pine.LNX.4.64.0706151634060.10634@dildano.hawaga.org.uk>
	<1181984306.10455.3.camel@blabla.mcs.anl.gov>
	<Pine.LNX.4.64.0706161258280.10634@dildano.hawaga.org.uk>
	<4673F0E9.3060505@cs.uchicago.edu>
	<Pine.LNX.4.64.0706161418470.10634@dildano.hawaga.org.uk>
	<4673F548.5070608@cs.uchicago.edu>
	<Pine.LNX.4.64.0706161438550.10634@dildano.hawaga.org.uk>
	<4673FB19.6070305@cs.uchicago.edu>
	<1182006166.11495.1.camel@blabla.mcs.anl.gov>
	<467408D5.2020709@mcs.anl.gov>  <467417DE.50408@mcs.anl.gov>
	<1182018587.12013.14.camel@blabla.mcs.anl.gov>
	<46743E36.9040901@cs.uchicago.edu>
Message-ID: <1182024346.12401.0.camel@blabla.mcs.anl.gov>

On Sat, 2007-06-16 at 14:47 -0500, Ioan Raicu wrote:
> Yes, I know I need to clean up the code, and remove unused (dead)
> code.  Can this wait for the next version I am working on, so I don't
> do this clean-up twice?  The version that is out there in testing
> currently is v0.8.  My development version is v0.9.  I have been
> distracted lately from finishing up v0.9, but its not far from being
> complete.  Mihael, when do you get back in town?

I should be in Chicago on 23rd.

Mihael

> 
> If this is something more urgent, then perhaps I can get you a
> clean-up version of v0.8 in the coming week.
> 
> Ioan
> 
> Mihael Hategan wrote: 
> > On Sat, 2007-06-16 at 12:03 -0500, Mike Wilde wrote:
> >   
> > > This should be fun, and a nice break from the I2U2 work that you've 
> > > been immersed in, Mihael.
> > >     
> > 
> > I've already looked at the Falkon code and it's... a lot of code that
> > does stuff that I understand only in principle. What you want isn't
> > easy, and I have my reservations towards the amount of fun it involves.
> > 
> > That being said, Ioan, would it be possible to have a cleaned up version
> > of the code where there are no duplicate classes? It's hard for me to
> > figure what's relevant or not in that case. And perhaps dead
> > code/comments removed?
> > 
> > Mihael
> > 
> >   
> > > Want to do a read-through soon, and send out comments for discussion 
> > > that can turn into a list of code improvements to bugzilize?
> > > 
> > > What I think is important about Falkon is that its working, its 
> > > proving out the value of the provisioned direct-scheduling approach 
> > > with numbers, and that its working for Ioan as a vehicle for his 
> > > research.
> > > 
> > > What we want to get from the effort is a) Ioan progresses towards 
> > > his PhD; b) the immediate needs of our app-users get met; and c) we 
> > > learn whats needed in architecture, protocol and algorithm for a 
> > > successful long-term approach to running swift programs efficiently.
> > > 
> > > Point is that everyone is open to changes and towards an eventual 
> > > re-design and re-write. This, Mihael, would be where you can 
> > > propose, design and implement the ideas you've expressed about 
> > > implementing provisioned direct-scheduling using Karajan's remote 
> > > execution mechanisms.
> > > 
> > > - Mike
> > > 
> > > 
> > > 
> > > 
> > > Ian Foster wrote, On 6/16/2007 10:59 AM:
> > >     
> > > > It seems important that Ioan sit down with Mihael and work through the 
> > > > Falkon code to see where it can be simplified, improved, etc. I am sure 
> > > > that this will result in problems being identified and fixed that will 
> > > > otherwise cost us time later.
> > > > 
> > > > Mihael Hategan wrote:
> > > >       
> > > > > Yourkit (www.yourkit.com) has free licenses for open source projects for
> > > > > their profiler. Point them to a globus web page that has your name, and
> > > > > they'll send you the license. Alternatively, there are other profilers
> > > > > out there, and I strongly recommend using them on such issues.
> > > > > 
> > > > > Mihael
> > > > > 
> > > > > On Sat, 2007-06-16 at 10:00 -0500, Ioan Raicu wrote:
> > > > >   
> > > > >         
> > > > > > Nope, I think this is a different problem, or at least a subset of the
> > > > > > problems we were having before.
> > > > > > 
> > > > > > Since we fixed the CPU utilization, and we moved to a bigger box (4
> > > > > > CPUs with 2GB of memory), everything is happening in a timely fashion
> > > > > > (a few ms per notification delivery throughout the experiment).  Plus,
> > > > > > I believe the view is consistent (the same tasks look complete on both
> > > > > > ends) between Falkon and Swift, but we are still checking on this as
> > > > > > the run was made just last night for the 100 mol run.  We'll keep you
> > > > > > posted with what we find.
> > > > > > 
> > > > > > Ioan
> > > > > > 
> > > > > > Ben Clifford wrote: 
> > > > > >     
> > > > > >           
> > > > > > > On Sat, 16 Jun 2007, Ioan Raicu wrote:
> > > > > > > 
> > > > > > >   
> > > > > > >       
> > > > > > >             
> > > > > > > > having problems with the 100 molecule run in MolDyn.  Its not clear where the
> > > > > > > > problem is, on the surface Falkon looks fine... we are looking into where
> > > > > > > > everything breaks to cause Swift to not continue with the workflow to
> > > > > > > > completion!
> > > > > > > >     
> > > > > > > >         
> > > > > > > >               
> > > > > > > The same problem that you showed me the other day or different?
> > > > > > > 
> > > > > > > with 'the same problem' being that falkon thinks all the jobs are done; 
> > > > > > > but that falkon's measure response time for sending completion 
> > > > > > > notifications gets approximately linearly longer over time and the swift 
> > > > > > > JVM uses ~100% and doesn't inidicate job completion at all after a certain 
> > > > > > > period.
> > > > > > > 
> > > > > > > or different symptoms now?
> > > > > > > 
> > > > > > >   
> > > > > > >       
> > > > > > >             
> > > > > > -- 
> > > > > > ============================================
> > > > > > Ioan Raicu
> > > > > > Ph.D. Student
> > > > > > ============================================
> > > > > > Distributed Systems Laboratory
> > > > > > Computer Science Department
> > > > > > University of Chicago
> > > > > > 1100 E. 58th Street, Ryerson Hall
> > > > > > Chicago, IL 60637
> > > > > > ============================================
> > > > > > Email: iraicu at cs.uchicago.edu
> > > > > > Web:   http://www.cs.uchicago.edu/~iraicu
> > > > > >        http://dsl.cs.uchicago.edu/
> > > > > > ============================================
> > > > > > ============================================
> > > > > >     
> > > > > >           
> > > > -- 
> > > > 
> > > >    Ian Foster, Director, Computation Institute
> > > > Argonne National Laboratory & University of Chicago
> > > > Argonne: MCS/221, 9700 S. Cass Ave, Argonne, IL 60439
> > > > Chicago: Rm 405, 5640 S. Ellis Ave, Chicago, IL 60637
> > > > Tel: +1 630 252 4619.  Web: www.ci.uchicago.edu.
> > > >       Globus Alliance: www.globus.org.
> > > > 
> > > > 
> > > > ------------------------------------------------------------------------
> > > > 
> > > > _______________________________________________
> > > > Swift-devel mailing list
> > > > Swift-devel at ci.uchicago.edu
> > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> > > >       
> > 
> > _______________________________________________
> > Swift-devel mailing list
> > Swift-devel at ci.uchicago.edu
> > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> > 
> >   
> 
> -- 
> ============================================
> Ioan Raicu
> Ph.D. Student
> ============================================
> Distributed Systems Laboratory
> Computer Science Department
> University of Chicago
> 1100 E. 58th Street, Ryerson Hall
> Chicago, IL 60637
> ============================================
> Email: iraicu at cs.uchicago.edu
> Web:   http://www.cs.uchicago.edu/~iraicu
>        http://dsl.cs.uchicago.edu/
> ============================================
> ============================================


From iraicu at cs.uchicago.edu  Sat Jun 16 15:16:32 2007
From: iraicu at cs.uchicago.edu (Ioan Raicu)
Date: Sat, 16 Jun 2007 15:16:32 -0500
Subject: [Swift-devel] CPU usage with provider-deef
In-Reply-To: <1182024346.12401.0.camel@blabla.mcs.anl.gov>
References: <Pine.LNX.4.64.0706151634060.10634@dildano.hawaga.org.uk>	
	<1181984306.10455.3.camel@blabla.mcs.anl.gov>	
	<Pine.LNX.4.64.0706161258280.10634@dildano.hawaga.org.uk>	
	<4673F0E9.3060505@cs.uchicago.edu>	
	<Pine.LNX.4.64.0706161418470.10634@dildano.hawaga.org.uk>	
	<4673F548.5070608@cs.uchicago.edu>	
	<Pine.LNX.4.64.0706161438550.10634@dildano.hawaga.org.uk>	
	<4673FB19.6070305@cs.uchicago.edu>	
	<1182006166.11495.1.camel@blabla.mcs.anl.gov>	
	<467408D5.2020709@mcs.anl.gov> <467417DE.50408@mcs.anl.gov>	
	<1182018587.12013.14.camel@blabla.mcs.anl.gov>	
	<46743E36.9040901@cs.uchicago.edu>
	<1182024346.12401.0.camel@blabla.mcs.anl.gov>
Message-ID: <46744520.2010202@cs.uchicago.edu>

Great!  Then I'll plan to have a clean-up version in SVN by then :)
Ioan

Mihael Hategan wrote:
> On Sat, 2007-06-16 at 14:47 -0500, Ioan Raicu wrote:
>   
>> Yes, I know I need to clean up the code, and remove unused (dead)
>> code.  Can this wait for the next version I am working on, so I don't
>> do this clean-up twice?  The version that is out there in testing
>> currently is v0.8.  My development version is v0.9.  I have been
>> distracted lately from finishing up v0.9, but its not far from being
>> complete.  Mihael, when do you get back in town?
>>     
>
> I should be in Chicago on 23rd.
>
> Mihael
>
>   
>> If this is something more urgent, then perhaps I can get you a
>> clean-up version of v0.8 in the coming week.
>>
>> Ioan
>>
>> Mihael Hategan wrote: 
>>     
>>> On Sat, 2007-06-16 at 12:03 -0500, Mike Wilde wrote:
>>>   
>>>       
>>>> This should be fun, and a nice break from the I2U2 work that you've 
>>>> been immersed in, Mihael.
>>>>     
>>>>         
>>> I've already looked at the Falkon code and it's... a lot of code that
>>> does stuff that I understand only in principle. What you want isn't
>>> easy, and I have my reservations towards the amount of fun it involves.
>>>
>>> That being said, Ioan, would it be possible to have a cleaned up version
>>> of the code where there are no duplicate classes? It's hard for me to
>>> figure what's relevant or not in that case. And perhaps dead
>>> code/comments removed?
>>>
>>> Mihael
>>>
>>>   
>>>       
>>>> Want to do a read-through soon, and send out comments for discussion 
>>>> that can turn into a list of code improvements to bugzilize?
>>>>
>>>> What I think is important about Falkon is that its working, its 
>>>> proving out the value of the provisioned direct-scheduling approach 
>>>> with numbers, and that its working for Ioan as a vehicle for his 
>>>> research.
>>>>
>>>> What we want to get from the effort is a) Ioan progresses towards 
>>>> his PhD; b) the immediate needs of our app-users get met; and c) we 
>>>> learn whats needed in architecture, protocol and algorithm for a 
>>>> successful long-term approach to running swift programs efficiently.
>>>>
>>>> Point is that everyone is open to changes and towards an eventual 
>>>> re-design and re-write. This, Mihael, would be where you can 
>>>> propose, design and implement the ideas you've expressed about 
>>>> implementing provisioned direct-scheduling using Karajan's remote 
>>>> execution mechanisms.
>>>>
>>>> - Mike
>>>>
>>>>
>>>>
>>>>
>>>> Ian Foster wrote, On 6/16/2007 10:59 AM:
>>>>     
>>>>         
>>>>> It seems important that Ioan sit down with Mihael and work through the 
>>>>> Falkon code to see where it can be simplified, improved, etc. I am sure 
>>>>> that this will result in problems being identified and fixed that will 
>>>>> otherwise cost us time later.
>>>>>
>>>>> Mihael Hategan wrote:
>>>>>       
>>>>>           
>>>>>> Yourkit (www.yourkit.com) has free licenses for open source projects for
>>>>>> their profiler. Point them to a globus web page that has your name, and
>>>>>> they'll send you the license. Alternatively, there are other profilers
>>>>>> out there, and I strongly recommend using them on such issues.
>>>>>>
>>>>>> Mihael
>>>>>>
>>>>>> On Sat, 2007-06-16 at 10:00 -0500, Ioan Raicu wrote:
>>>>>>   
>>>>>>         
>>>>>>             
>>>>>>> Nope, I think this is a different problem, or at least a subset of the
>>>>>>> problems we were having before.
>>>>>>>
>>>>>>> Since we fixed the CPU utilization, and we moved to a bigger box (4
>>>>>>> CPUs with 2GB of memory), everything is happening in a timely fashion
>>>>>>> (a few ms per notification delivery throughout the experiment).  Plus,
>>>>>>> I believe the view is consistent (the same tasks look complete on both
>>>>>>> ends) between Falkon and Swift, but we are still checking on this as
>>>>>>> the run was made just last night for the 100 mol run.  We'll keep you
>>>>>>> posted with what we find.
>>>>>>>
>>>>>>> Ioan
>>>>>>>
>>>>>>> Ben Clifford wrote: 
>>>>>>>     
>>>>>>>           
>>>>>>>               
>>>>>>>> On Sat, 16 Jun 2007, Ioan Raicu wrote:
>>>>>>>>
>>>>>>>>   
>>>>>>>>       
>>>>>>>>             
>>>>>>>>                 
>>>>>>>>> having problems with the 100 molecule run in MolDyn.  Its not clear where the
>>>>>>>>> problem is, on the surface Falkon looks fine... we are looking into where
>>>>>>>>> everything breaks to cause Swift to not continue with the workflow to
>>>>>>>>> completion!
>>>>>>>>>     
>>>>>>>>>         
>>>>>>>>>               
>>>>>>>>>                   
>>>>>>>> The same problem that you showed me the other day or different?
>>>>>>>>
>>>>>>>> with 'the same problem' being that falkon thinks all the jobs are done; 
>>>>>>>> but that falkon's measure response time for sending completion 
>>>>>>>> notifications gets approximately linearly longer over time and the swift 
>>>>>>>> JVM uses ~100% and doesn't inidicate job completion at all after a certain 
>>>>>>>> period.
>>>>>>>>
>>>>>>>> or different symptoms now?
>>>>>>>>
>>>>>>>>   
>>>>>>>>       
>>>>>>>>             
>>>>>>>>                 
>>>>>>> -- 
>>>>>>> ============================================
>>>>>>> Ioan Raicu
>>>>>>> Ph.D. Student
>>>>>>> ============================================
>>>>>>> Distributed Systems Laboratory
>>>>>>> Computer Science Department
>>>>>>> University of Chicago
>>>>>>> 1100 E. 58th Street, Ryerson Hall
>>>>>>> Chicago, IL 60637
>>>>>>> ============================================
>>>>>>> Email: iraicu at cs.uchicago.edu
>>>>>>> Web:   http://www.cs.uchicago.edu/~iraicu
>>>>>>>        http://dsl.cs.uchicago.edu/
>>>>>>> ============================================
>>>>>>> ============================================
>>>>>>>     
>>>>>>>           
>>>>>>>               
>>>>> -- 
>>>>>
>>>>>    Ian Foster, Director, Computation Institute
>>>>> Argonne National Laboratory & University of Chicago
>>>>> Argonne: MCS/221, 9700 S. Cass Ave, Argonne, IL 60439
>>>>> Chicago: Rm 405, 5640 S. Ellis Ave, Chicago, IL 60637
>>>>> Tel: +1 630 252 4619.  Web: www.ci.uchicago.edu.
>>>>>       Globus Alliance: www.globus.org.
>>>>>
>>>>>
>>>>> ------------------------------------------------------------------------
>>>>>
>>>>> _______________________________________________
>>>>> Swift-devel mailing list
>>>>> Swift-devel at ci.uchicago.edu
>>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>>>>       
>>>>>           
>>> _______________________________________________
>>> Swift-devel mailing list
>>> Swift-devel at ci.uchicago.edu
>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>>
>>>   
>>>       
>> -- 
>> ============================================
>> Ioan Raicu
>> Ph.D. Student
>> ============================================
>> Distributed Systems Laboratory
>> Computer Science Department
>> University of Chicago
>> 1100 E. 58th Street, Ryerson Hall
>> Chicago, IL 60637
>> ============================================
>> Email: iraicu at cs.uchicago.edu
>> Web:   http://www.cs.uchicago.edu/~iraicu
>>        http://dsl.cs.uchicago.edu/
>> ============================================
>> ============================================
>>     
>
>
>   

-- 
============================================
Ioan Raicu
Ph.D. Student
============================================
Distributed Systems Laboratory
Computer Science Department
University of Chicago
1100 E. 58th Street, Ryerson Hall
Chicago, IL 60637
============================================
Email: iraicu at cs.uchicago.edu
Web:   http://www.cs.uchicago.edu/~iraicu
       http://dsl.cs.uchicago.edu/
============================================
============================================

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20070616/da2c0527/attachment.html>

From hategan at mcs.anl.gov  Sat Jun 16 15:15:04 2007
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Sat, 16 Jun 2007 23:15:04 +0300
Subject: [Swift-devel] CPU usage with provider-deef
In-Reply-To: <537273469-1182023623-cardhu_decombobulator_blackberry.rim.net-728646234-@bxe006.bisx.prod.on.blackberry>
References: <Pine.LNX.4.64.0706151634060.10634@dildano.hawaga.org.uk>
	<1181984306.10455.3.camel@blabla.mcs.anl.gov>
	<Pine.LNX.4.64.0706161258280.10634@dildano.hawaga.org.uk>
	<4673F0E9.3060505@cs.uchicago.edu>
	<Pine.LNX.4.64.0706161418470.10634@dildano.hawaga.org.uk>
	<4673F548.5070608@cs.uchicago.edu>
	<Pine.LNX.4.64.0706161438550.10634@dildano.hawaga.org.uk>
	<4673FB19.6070305@cs.uchicago.edu>
	<1182006166.11495.1.camel@blabla.mcs.anl.gov><467408D5.2020709@mcs.anl.gov>
	<467417DE.50408@mcs.anl.gov><1182018587.12013.14.camel@blabla.mcs.anl.gov>
	<Pine.LNX.4.64.0706161932560.15250@dildano.hawaga.org.uk>
	<537273469-1182023623-cardhu_decombobulator_blackberry.rim.net-728646234-@bxe006.bisx.prod.on.blackberry>
Message-ID: <1182024904.12401.11.camel@blabla.mcs.anl.gov>

On Sat, 2007-06-16 at 19:52 +0000, Ian Foster wrote:
> I wasn't suggesting (at least in the first instance) that Mihael take the prototype and turn it into production, but that Mihael and Ioan sit down together and do a code walkthrough. I think that this would likely identify bugs and opportunities for simplification.

It's somewhat on the same in level of fun :). From the experience I've
accumulated so far, design is hard. Understanding prototype design is
probably even harder (not only do you need to understand the problem,
you also need to understand why many non-obvious things are done the way
they are done).

Mihael

> 
> Ian
> 
> 
> 
> Sent via BlackBerry from T-Mobile
> 
> -----Original Message-----
> From: Ben Clifford <benc at hawaga.org.uk>
> 
> Date: Sat, 16 Jun 2007 19:36:17 
> To:Mihael Hategan <hategan at mcs.anl.gov>
> Cc:swift-devel at ci.uchicago.edu
> Subject: Re: [Swift-devel] CPU usage with provider-deef
> 
> 
> 
> On Sat, 16 Jun 2007, Mihael Hategan wrote:
> 
> > On Sat, 2007-06-16 at 12:03 -0500, Mike Wilde wrote:
> > > This should be fun, and a nice break from the I2U2 work that you've 
> > > been immersed in, Mihael.
> 
> > I have my reservations towards the amount of fun it involves.
> 
> Right, taking prototypes and turning them into production isn't 
> necessarily fun - in fact, a lot of the fun already happened with the 
> making of the prototype and the rest is some what drugery. (to an extent 
> that's the same situation i2u2 cosmic was/is in).
> 


From wilde at mcs.anl.gov  Sat Jun 16 17:55:59 2007
From: wilde at mcs.anl.gov (Mike Wilde)
Date: Sat, 16 Jun 2007 17:55:59 -0500
Subject: [Swift-devel] CPU usage with provider-deef
In-Reply-To: <1182024904.12401.11.camel@blabla.mcs.anl.gov>
References: <Pine.LNX.4.64.0706151634060.10634@dildano.hawaga.org.uk>	<1181984306.10455.3.camel@blabla.mcs.anl.gov>	<Pine.LNX.4.64.0706161258280.10634@dildano.hawaga.org.uk>	<4673F0E9.3060505@cs.uchicago.edu>	<Pine.LNX.4.64.0706161418470.10634@dildano.hawaga.org.uk>	<4673F548.5070608@cs.uchicago.edu>	<Pine.LNX.4.64.0706161438550.10634@dildano.hawaga.org.uk>	<4673FB19.6070305@cs.uchicago.edu>	<1182006166.11495.1.camel@blabla.mcs.anl.gov><467408D5.2020709@mcs.anl.gov>	<467417DE.50408@mcs.anl.gov><1182018587.12013.14.camel@blabla.mcs.anl.gov>	<Pine.LNX.4.64.0706161932560.15250@dildano.hawaga.org.uk>	<537273469-1182023623-cardhu_decombobulator_blackberry.rim.net-728646234-@bxe006.bisx.prod.on.blackberry>
	<1182024904.12401.11.camel@blabla.mcs.anl.gov>
Message-ID: <46746A7F.7040004@mcs.anl.gov>

I think a nice clean message sequence chart describing Falkon's 
various activities would be very useful, as its the backbone of its 
logic.

I tried to create this for the SC paper by asking Ioan to describe 
the protocol to me, but I dod not succeed.

I think this would be a very useful description to maintain, as a 
UML sequence chart, and that Ioan this would be a very important 
part of your thesis or of future papers.

Its up to you and Ian to weigh whether this would be valuable to 
your research.  I think its invaluable for design, review and debugging.

- Mike


Mihael Hategan wrote, On 6/16/2007 3:15 PM:
> On Sat, 2007-06-16 at 19:52 +0000, Ian Foster wrote:
>> I wasn't suggesting (at least in the first instance) that Mihael take the prototype and turn it into production, but that Mihael and Ioan sit down together and do a code walkthrough. I think that this would likely identify bugs and opportunities for simplification.
> 
> It's somewhat on the same in level of fun :). From the experience I've
> accumulated so far, design is hard. Understanding prototype design is
> probably even harder (not only do you need to understand the problem,
> you also need to understand why many non-obvious things are done the way
> they are done).
> 
> Mihael
> 
>> Ian
>>
>>
>>
>> Sent via BlackBerry from T-Mobile
>>
>> -----Original Message-----
>> From: Ben Clifford <benc at hawaga.org.uk>
>>
>> Date: Sat, 16 Jun 2007 19:36:17 
>> To:Mihael Hategan <hategan at mcs.anl.gov>
>> Cc:swift-devel at ci.uchicago.edu
>> Subject: Re: [Swift-devel] CPU usage with provider-deef
>>
>>
>>
>> On Sat, 16 Jun 2007, Mihael Hategan wrote:
>>
>>> On Sat, 2007-06-16 at 12:03 -0500, Mike Wilde wrote:
>>>> This should be fun, and a nice break from the I2U2 work that you've 
>>>> been immersed in, Mihael.
>>> I have my reservations towards the amount of fun it involves.
>> Right, taking prototypes and turning them into production isn't 
>> necessarily fun - in fact, a lot of the fun already happened with the 
>> making of the prototype and the rest is some what drugery. (to an extent 
>> that's the same situation i2u2 cosmic was/is in).
>>
> 
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> 
> 

-- 
Mike Wilde
Computation Institute, University of Chicago
Math & Computer Science Division
Argonne National Laboratory
Argonne, IL   60439    USA
tel 630-252-7497 fax 630-252-1997


From wilde at mcs.anl.gov  Sat Jun 16 18:09:17 2007
From: wilde at mcs.anl.gov (Mike Wilde)
Date: Sat, 16 Jun 2007 18:09:17 -0500
Subject: [Swift-devel] CPU usage with provider-deef
In-Reply-To: <Pine.LNX.4.64.0706161932560.15250@dildano.hawaga.org.uk>
References: <Pine.LNX.4.64.0706151634060.10634@dildano.hawaga.org.uk>
	<1181984306.10455.3.camel@blabla.mcs.anl.gov>
	<Pine.LNX.4.64.0706161258280.10634@dildano.hawaga.org.uk>
	<4673F0E9.3060505@cs.uchicago.edu>
	<Pine.LNX.4.64.0706161418470.10634@dildano.hawaga.org.uk>
	<4673F548.5070608@cs.uchicago.edu>
	<Pine.LNX.4.64.0706161438550.10634@dildano.hawaga.org.uk>
	<4673FB19.6070305@cs.uchicago.edu>
	<1182006166.11495.1.camel@blabla.mcs.anl.gov>
	<467408D5.2020709@mcs.anl.gov> <467417DE.50408@mcs.anl.gov>
	<1182018587.12013.14.camel@blabla.mcs.anl.gov>
	<Pine.LNX.4.64.0706161932560.15250@dildano.hawaga.org.uk>
Message-ID: <46746D9D.8020206@mcs.anl.gov>

sorry, I still believe this can be fun. You can tell me afterwards 
if I was right or not.

I think from Ioan's work we can gain some understanding of the 
problem, of architecture issues, performance potential, and of 
protocol and reliability issues.

I feel it can still be fun to design a new production-quality system 
from scratch, and to prototype and implement that.

Mihael, I think you had a vision of how this could be done elegantly 
using Karajan mechanisms as powerful building blocks to build on, to 
provide the communication and messaging / remote execution fabric.

Ben Clifford wrote, On 6/16/2007 2:36 PM:
> On Sat, 16 Jun 2007, Mihael Hategan wrote:
> 
>> On Sat, 2007-06-16 at 12:03 -0500, Mike Wilde wrote:
>>> This should be fun, and a nice break from the I2U2 work that you've 
>>> been immersed in, Mihael.
> 
>> I have my reservations towards the amount of fun it involves.
> 
> Right, taking prototypes and turning them into production isn't 
> necessarily fun - in fact, a lot of the fun already happened with the 
> making of the prototype and the rest is some what drugery. (to an extent 
> that's the same situation i2u2 cosmic was/is in).

I hope its not the case that the only part of programming thats fun 
is prototyping.

If you spend your day in powerpoint, word and email, anything 
related to code can look like fun...

:) Mike


> 

-- 
Mike Wilde
Computation Institute, University of Chicago
Math & Computer Science Division
Argonne National Laboratory
Argonne, IL   60439    USA
tel 630-252-7497 fax 630-252-1997


From benc at hawaga.org.uk  Sat Jun 16 18:15:02 2007
From: benc at hawaga.org.uk (Ben Clifford)
Date: Sat, 16 Jun 2007 23:15:02 +0000 (GMT)
Subject: [Swift-devel] CPU usage with provider-deef
In-Reply-To: <46746A7F.7040004@mcs.anl.gov>
References: <Pine.LNX.4.64.0706151634060.10634@dildano.hawaga.org.uk>
	<1181984306.10455.3.camel@blabla.mcs.anl.gov>
	<Pine.LNX.4.64.0706161258280.10634@dildano.hawaga.org.uk>
	<4673F0E9.3060505@cs.uchicago.edu>
	<Pine.LNX.4.64.0706161418470.10634@dildano.hawaga.org.uk>
	<4673F548.5070608@cs.uchicago.edu>
	<Pine.LNX.4.64.0706161438550.10634@dildano.hawaga.org.uk>
	<4673FB19.6070305@cs.uchicago.edu>
	<1182006166.11495.1.camel@blabla.mcs.anl.gov><467408D5.2020709@mcs.anl.gov>
	<467417DE.50408@mcs.anl.gov><1182018587.12013.14.camel@blabla.mcs.anl.gov>
	<Pine.LNX.4.64.0706161932560.15250@dildano.hawaga.org.uk>
	<537273469-1182023623-cardhu_decombobulator_blackberry.rim.net-728646234-@bxe006.bisx.prod.on.blackberry>
	<1182024904.12401.11.camel@blabla.mcs.anl.gov>
	<46746A7F.7040004@mcs.anl.gov>
Message-ID: <Pine.LNX.4.64.0706162313300.15250@dildano.hawaga.org.uk>


The WSDL should describe part of the web services bit of the protocol. 
That might be a good place to start. The WSDL should already describe the 
messages that go over the wire in something vaguely readable to a human. 
Probably what would be needed would be the extra info to say which order 
messages are sent.

On Sat, 16 Jun 2007, Mike Wilde wrote:

> I think a nice clean message sequence chart describing Falkon's various
> activities would be very useful, as its the backbone of its logic.
> 
> I tried to create this for the SC paper by asking Ioan to describe the
> protocol to me, but I dod not succeed.
> 
> I think this would be a very useful description to maintain, as a UML sequence
> chart, and that Ioan this would be a very important part of your thesis or of
> future papers.
> 
> Its up to you and Ian to weigh whether this would be valuable to your
> research.  I think its invaluable for design, review and debugging.
> 
> - Mike
> 
> 
> 
> 
> Mihael Hategan wrote, On 6/16/2007 3:15 PM:
> > On Sat, 2007-06-16 at 19:52 +0000, Ian Foster wrote:
> > > I wasn't suggesting (at least in the first instance) that Mihael take the
> > > prototype and turn it into production, but that Mihael and Ioan sit down
> > > together and do a code walkthrough. I think that this would likely
> > > identify bugs and opportunities for simplification.
> > 
> > It's somewhat on the same in level of fun :). From the experience I've
> > accumulated so far, design is hard. Understanding prototype design is
> > probably even harder (not only do you need to understand the problem,
> > you also need to understand why many non-obvious things are done the way
> > they are done).
> > 
> > Mihael
> > 
> > > Ian
> > > 
> > > 
> > > 
> > > Sent via BlackBerry from T-Mobile
> > > 
> > > -----Original Message-----
> > > From: Ben Clifford <benc at hawaga.org.uk>
> > > 
> > > Date: Sat, 16 Jun 2007 19:36:17 To:Mihael Hategan <hategan at mcs.anl.gov>
> > > Cc:swift-devel at ci.uchicago.edu
> > > Subject: Re: [Swift-devel] CPU usage with provider-deef
> > > 
> > > 
> > > 
> > > On Sat, 16 Jun 2007, Mihael Hategan wrote:
> > > 
> > > > On Sat, 2007-06-16 at 12:03 -0500, Mike Wilde wrote:
> > > > > This should be fun, and a nice break from the I2U2 work that you've
> > > > > been immersed in, Mihael.
> > > > I have my reservations towards the amount of fun it involves.
> > > Right, taking prototypes and turning them into production isn't
> > > necessarily fun - in fact, a lot of the fun already happened with the
> > > making of the prototype and the rest is some what drugery. (to an extent
> > > that's the same situation i2u2 cosmic was/is in).
> > > 
> > 
> > _______________________________________________
> > Swift-devel mailing list
> > Swift-devel at ci.uchicago.edu
> > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> > 
> > 
> 
> 


From hategan at mcs.anl.gov  Sun Jun 17 03:47:38 2007
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Sun, 17 Jun 2007 11:47:38 +0300
Subject: [Swift-devel] CPU usage with provider-deef
In-Reply-To: <46746D9D.8020206@mcs.anl.gov>
References: <Pine.LNX.4.64.0706151634060.10634@dildano.hawaga.org.uk>
	<1181984306.10455.3.camel@blabla.mcs.anl.gov>
	<Pine.LNX.4.64.0706161258280.10634@dildano.hawaga.org.uk>
	<4673F0E9.3060505@cs.uchicago.edu>
	<Pine.LNX.4.64.0706161418470.10634@dildano.hawaga.org.uk>
	<4673F548.5070608@cs.uchicago.edu>
	<Pine.LNX.4.64.0706161438550.10634@dildano.hawaga.org.uk>
	<4673FB19.6070305@cs.uchicago.edu>
	<1182006166.11495.1.camel@blabla.mcs.anl.gov>
	<467408D5.2020709@mcs.anl.gov>  <467417DE.50408@mcs.anl.gov>
	<1182018587.12013.14.camel@blabla.mcs.anl.gov>
	<Pine.LNX.4.64.0706161932560.15250@dildano.hawaga.org.uk>
	<46746D9D.8020206@mcs.anl.gov>
Message-ID: <1182070058.12861.4.camel@blabla.mcs.anl.gov>

On Sat, 2007-06-16 at 18:09 -0500, Mike Wilde wrote:
> sorry, I still believe this can be fun. You can tell me afterwards 
> if I was right or not.

Fun it is then.

Mihael

> 
> I think from Ioan's work we can gain some understanding of the 
> problem, of architecture issues, performance potential, and of 
> protocol and reliability issues.
> 
> I feel it can still be fun to design a new production-quality system 
> from scratch, and to prototype and implement that.
> 
> Mihael, I think you had a vision of how this could be done elegantly 
> using Karajan mechanisms as powerful building blocks to build on, to 
> provide the communication and messaging / remote execution fabric.
> 
> Ben Clifford wrote, On 6/16/2007 2:36 PM:
> > On Sat, 16 Jun 2007, Mihael Hategan wrote:
> > 
> >> On Sat, 2007-06-16 at 12:03 -0500, Mike Wilde wrote:
> >>> This should be fun, and a nice break from the I2U2 work that you've 
> >>> been immersed in, Mihael.
> > 
> >> I have my reservations towards the amount of fun it involves.
> > 
> > Right, taking prototypes and turning them into production isn't 
> > necessarily fun - in fact, a lot of the fun already happened with the 
> > making of the prototype and the rest is some what drugery. (to an extent 
> > that's the same situation i2u2 cosmic was/is in).
> 
> I hope its not the case that the only part of programming thats fun 
> is prototyping.
> 
> If you spend your day in powerpoint, word and email, anything 
> related to code can look like fun...
> 
> :) Mike
> 
> 
> > 
> 


From benc at hawaga.org.uk  Sun Jun 17 13:31:56 2007
From: benc at hawaga.org.uk (Ben Clifford)
Date: Sun, 17 Jun 2007 18:31:56 +0000 (GMT)
Subject: [Swift-devel] XML/infix hybrid
Message-ID: <Pine.LNX.4.64.0706171830090.10634@dildano.hawaga.org.uk>


I got annoyed playing with the language parser/compiler stuff today with 
the XML/infix hybrid expression syntax in the intermediate language, so I 
basically spent my day changing it to use entirely XML syntax - the 
intermediate form will thus look a little more karajan like and a little 
less C like.

-- 


From iraicu at cs.uchicago.edu  Mon Jun 18 16:37:48 2007
From: iraicu at cs.uchicago.edu (Ioan Raicu)
Date: Mon, 18 Jun 2007 16:37:48 -0500
Subject: [Swift-devel] CPU usage with provider-deef
In-Reply-To: <Pine.LNX.4.64.0706162313300.15250@dildano.hawaga.org.uk>
References: <Pine.LNX.4.64.0706151634060.10634@dildano.hawaga.org.uk>	<1181984306.10455.3.camel@blabla.mcs.anl.gov>	<Pine.LNX.4.64.0706161258280.10634@dildano.hawaga.org.uk>	<4673F0E9.3060505@cs.uchicago.edu>	<Pine.LNX.4.64.0706161418470.10634@dildano.hawaga.org.uk>	<4673F548.5070608@cs.uchicago.edu>	<Pine.LNX.4.64.0706161438550.10634@dildano.hawaga.org.uk>	<4673FB19.6070305@cs.uchicago.edu>	<1182006166.11495.1.camel@blabla.mcs.anl.gov><467408D5.2020709@mcs.anl.gov>	<467417DE.50408@mcs.anl.gov><1182018587.12013.14.camel@blabla.mcs.anl.gov>	<Pine.LNX.4.64.0706161932560.15250@dildano.hawaga.org.uk>	<537273469-1182023623-cardhu_decombobulator_blackberry.rim.net-728646234-@bxe006.bisx.prod.on.blackberry>	<1182024904.12401.11.camel@blabla.mcs.anl.gov>	<46746A7F.7040004@mcs.anl.gov>
	<Pine.LNX.4.64.0706162313300.15250@dildano.hawaga.org.uk>
Message-ID: <4676FB2C.9010004@cs.uchicago.edu>

There is a diagram in the SC paper that has the flow of messages, and 
what each does; these messages also map to the WSDL of the Falkon 
service.  Mihael, Ben, or whoever else wants to start digging through 
the Falkon code, maybe a meeting might be good to go over the 
organization of the code, the message flow diagram, configuration 
options, etc...  I would prefer a meeting over drafting up more 
documents, I think it would be more time effective for me for now.

Ioan

Ben Clifford wrote:
> The WSDL should describe part of the web services bit of the protocol. 
> That might be a good place to start. The WSDL should already describe the 
> messages that go over the wire in something vaguely readable to a human. 
> Probably what would be needed would be the extra info to say which order 
> messages are sent.
>
> On Sat, 16 Jun 2007, Mike Wilde wrote:
>
>   
>> I think a nice clean message sequence chart describing Falkon's various
>> activities would be very useful, as its the backbone of its logic.
>>
>> I tried to create this for the SC paper by asking Ioan to describe the
>> protocol to me, but I dod not succeed.
>>
>> I think this would be a very useful description to maintain, as a UML sequence
>> chart, and that Ioan this would be a very important part of your thesis or of
>> future papers.
>>
>> Its up to you and Ian to weigh whether this would be valuable to your
>> research.  I think its invaluable for design, review and debugging.
>>
>> - Mike
>>
>>
>>
>>
>> Mihael Hategan wrote, On 6/16/2007 3:15 PM:
>>     
>>> On Sat, 2007-06-16 at 19:52 +0000, Ian Foster wrote:
>>>       
>>>> I wasn't suggesting (at least in the first instance) that Mihael take the
>>>> prototype and turn it into production, but that Mihael and Ioan sit down
>>>> together and do a code walkthrough. I think that this would likely
>>>> identify bugs and opportunities for simplification.
>>>>         
>>> It's somewhat on the same in level of fun :). From the experience I've
>>> accumulated so far, design is hard. Understanding prototype design is
>>> probably even harder (not only do you need to understand the problem,
>>> you also need to understand why many non-obvious things are done the way
>>> they are done).
>>>
>>> Mihael
>>>
>>>       
>>>> Ian
>>>>
>>>>
>>>>
>>>> Sent via BlackBerry from T-Mobile
>>>>
>>>> -----Original Message-----
>>>> From: Ben Clifford <benc at hawaga.org.uk>
>>>>
>>>> Date: Sat, 16 Jun 2007 19:36:17 To:Mihael Hategan <hategan at mcs.anl.gov>
>>>> Cc:swift-devel at ci.uchicago.edu
>>>> Subject: Re: [Swift-devel] CPU usage with provider-deef
>>>>
>>>>
>>>>
>>>> On Sat, 16 Jun 2007, Mihael Hategan wrote:
>>>>
>>>>         
>>>>> On Sat, 2007-06-16 at 12:03 -0500, Mike Wilde wrote:
>>>>>           
>>>>>> This should be fun, and a nice break from the I2U2 work that you've
>>>>>> been immersed in, Mihael.
>>>>>>             
>>>>> I have my reservations towards the amount of fun it involves.
>>>>>           
>>>> Right, taking prototypes and turning them into production isn't
>>>> necessarily fun - in fact, a lot of the fun already happened with the
>>>> making of the prototype and the rest is some what drugery. (to an extent
>>>> that's the same situation i2u2 cosmic was/is in).
>>>>
>>>>         
>>> _______________________________________________
>>> Swift-devel mailing list
>>> Swift-devel at ci.uchicago.edu
>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>>
>>>
>>>       
>>     
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>
>   

-- 
============================================
Ioan Raicu
Ph.D. Student
============================================
Distributed Systems Laboratory
Computer Science Department
University of Chicago
1100 E. 58th Street, Ryerson Hall
Chicago, IL 60637
============================================
Email: iraicu at cs.uchicago.edu
Web:   http://www.cs.uchicago.edu/~iraicu
       http://dsl.cs.uchicago.edu/
============================================
============================================

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20070618/8e8c5522/attachment.html>

From foster at mcs.anl.gov  Mon Jun 18 21:54:21 2007
From: foster at mcs.anl.gov (Ian Foster)
Date: Mon, 18 Jun 2007 21:54:21 -0500
Subject: [Swift-devel] on the semantics of 'array closing'
In-Reply-To: <4674096B.4020109@mcs.anl.gov>
References: <Pine.LNX.4.64.0706151924490.10634@dildano.hawaga.org.uk>	<4672F5E3.7060205@mcs.anl.gov>	<Pine.LNX.4.64.0706152050270.10634@dildano.hawaga.org.uk>
	<4674096B.4020109@mcs.anl.gov>
Message-ID: <4677455D.1070909@mcs.anl.gov>

Interesting issue, as raised by Mike: it seems that we want to define a 
"map" function that can specify various "conditions" on the map, e.g., 
the length of time to wait for results, what to do if not all results 
are returned, etc. I wonder if others have done that before?

Ian Foster wrote:
> I like the notion of having a "map" function. If that could entirely 
> replace the current element assignments, that would be a wonderful 
> simplification, it seems to me.
>
> Ian.
>
> Ben Clifford wrote:
>> There's a different approach, which is to asay that 'a' is a variable and 
>> can be assigned to once. Thus assignemnt syntax like a[0]=something 
>> becomes illegal and we need more functional language constructs. So 
>> instead of writing:
>>
>> for e,i in input_array {
>>   output_array[i] = p(e);
>> }
>>
>> we would write:
>>
>> output_array = foreach i in input_array {
>>   return p(i);
>> }
>>
>> (its a haskell map in different syntax!)
>>
>> That means that, at the language level, output_array is now properly 
>> single assignment.
>>
>>
>> On Fri, 15 Jun 2007, Ian Foster wrote:
>>
>>   
>>> Hi,
>>>
>>> For:
>>>
>>>  a[0] = p()
>>>  a[1] = q()
>>>  b = s(a)
>>>
>>> I think there are two distinct issues.
>>>
>>> a) Determining the size of the array. This could presumably be done by
>>> declaring it, e.g.:
>>>
>>>  a[2] or some similar notion
>>>  a[0] = p()
>>>  a[1] = q()
>>>  b = s(a)
>>>
>>> or by some "closing" concept.
>>>
>>> b) Whether or not each element of an array is a separate single-assignment
>>> variable. If they are, then the code above should work just fine. If they are
>>> not, then we have a couple of behaviors we could define. One would be that
>>> b=s(a) blocks until all elements in "a" are defined. The other is that we have
>>> a way of "closing" (once again). In that case, we have to define what happens
>>> if b=s(a) accesses an element that is not defined.
>>>
>>> Ian.
>>>
>>> Ben Clifford wrote:
>>>     
>>>> There is a problem that has been called the 'array closing problem'.
>>>>
>>>> It manifests itself in the tutorial in that certain bits of code that
>>>> intuitively can either in a procedure or in the top level can, in practice,
>>>> only go in to a procedure.
>>>>
>>>> In that context, I tried to think about better ways to explain/document the
>>>> behaviour than "mumble mumble move that code into a procedure".
>>>>
>>>> In Swift we claim to have 'single assignment variables'.
>>>>
>>>> >From single assignment variables we get our grid job ordering:
>>>>
>>>>   a = p()
>>>>   b = s(a)
>>>>
>>>> causes first grid job p to run, and when that has completed, then grid job s
>>>> will run.
>>>>
>>>> This is the same as if we had written:
>>>>
>>>>   b = s(a)
>>>>   a = p()
>>>>
>>>> The ordering comes from the use of a as an 'output' for p and an 'input' for
>>>> s, not from source text ordering.
>>>>
>>>> In that model, its meaningless to assign two different things ta a, like
>>>> this:
>>>>
>>>>   a = p()
>>>>   b = s(a)
>>>>   a = t()
>>>>
>>>>
>>>> Note that I've omitted the data types from the above. This works in the
>>>> implementation for simple types such as a datafile marker type.
>>>>
>>>> What is important is that each variable is either unassigned or has its
>>>> single value - whenever we refer to that variable, we can either use the
>>>> value it has, or defer evaluation of that expression until the variable has
>>>> its value.
>>>>
>>>> Now consider arrays. In the present syntax, arrays can be passed as single
>>>> (complex) values to/from procedures, like before:
>>>>
>>>>   a = p()
>>>>   b = s(a)
>>>>
>>>> Here a and b are array types.
>>>>
>>>> That's fine. a is assigned to by the first statement, and b is assigned to
>>>> by the second statement.
>>>>
>>>> But we also support a different assignment syntax for arrays, that looks
>>>> like this:
>>>>
>>>>   a[0] = p()
>>>>   a[1] = q()
>>>>   b = s(a)
>>>>
>>>> This fails at the moment (specifically, I think the execution engine will
>>>> hang).
>>>>
>>>> Why? Because the is no one point at which we assign a value to 'a' - the
>>>> assignment is split over multiple statements, which can be in various places
>>>> (and inside loops etc).
>>>>
>>>> There is nothing in the implementation that detects that a has been assigned
>>>> its value.
>>>>
>>>> So there is this notion in the karajan intermediate code of 'closing an
>>>> array'.  This is an assertion made in the object code that all assignments
>>>> to pieces of an array have been made - that, in affect, the array has its
>>>> value.
>>>>
>>>> The suggested hack/workaround for this is to move the array element
>>>> assignments into a procedure:
>>>>
>>>>  (file f[]) z() {
>>>>    f[0] = p();
>>>>    f[1] - q();
>>>>  }
>>>>
>>>>  a = z()
>>>>  b = s(a)
>>>>
>>>> This works. (which is sort-of a violation of referential transparency)
>>>>
>>>> It works because Swift implicitly marks arrays returned from compound
>>>> procedures as closed (which may or may not be correct).
>>>>
>>>> So in most variable scopes, arrays behave like single-assignment variables,
>>>> but each array can have one specific scope in which members can be assigned
>>>> to. In that scope, the array cannot be treated as a whole variable.
>>>>
>>>> In the z() example above, that special scope is the body of z(). In the
>>>> previous example, that scope is the global scope, and the program is invalid
>>>> by the rule above that the array cannot be referred to as a whole in the
>>>> same place that its members are individually assigned to.
>>>>
>>>> That's my explanation of what's going on now. I think it matches reality. I
>>>> don't like that this is reality, but it is what we have.
>>>>
>>>> Comments appreciated.
>>>>
>>>>   
>>>>       
>>>     
>>
>>   
>
> -- 
>
>    Ian Foster, Director, Computation Institute
> Argonne National Laboratory & University of Chicago
> Argonne: MCS/221, 9700 S. Cass Ave, Argonne, IL 60439
> Chicago: Rm 405, 5640 S. Ellis Ave, Chicago, IL 60637
> Tel: +1 630 252 4619.  Web: www.ci.uchicago.edu.
>       Globus Alliance: www.globus.org.
>   
> ------------------------------------------------------------------------
>
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>   

-- 

   Ian Foster, Director, Computation Institute
Argonne National Laboratory & University of Chicago
Argonne: MCS/221, 9700 S. Cass Ave, Argonne, IL 60439
Chicago: Rm 405, 5640 S. Ellis Ave, Chicago, IL 60637
Tel: +1 630 252 4619.  Web: www.ci.uchicago.edu.
      Globus Alliance: www.globus.org.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20070618/4953383d/attachment.html>

From hategan at mcs.anl.gov  Tue Jun 19 03:22:51 2007
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Tue, 19 Jun 2007 11:22:51 +0300
Subject: [Swift-devel] CPU usage with provider-deef
In-Reply-To: <4676FB2C.9010004@cs.uchicago.edu>
References: <Pine.LNX.4.64.0706151634060.10634@dildano.hawaga.org.uk>
	<1181984306.10455.3.camel@blabla.mcs.anl.gov>
	<Pine.LNX.4.64.0706161258280.10634@dildano.hawaga.org.uk>
	<4673F0E9.3060505@cs.uchicago.edu>
	<Pine.LNX.4.64.0706161418470.10634@dildano.hawaga.org.uk>
	<4673F548.5070608@cs.uchicago.edu>
	<Pine.LNX.4.64.0706161438550.10634@dildano.hawaga.org.uk>
	<4673FB19.6070305@cs.uchicago.edu>
	<1182006166.11495.1.camel@blabla.mcs.anl.gov><467408D5.2020709@mcs.anl.gov>
	<467417DE.50408@mcs.anl.gov><1182018587.12013.14.camel@blabla.mcs.anl.gov>
	<Pine.LNX.4.64.0706161932560.15250@dildano.hawaga.org.uk>
	<537273469-1182023623-cardhu_decombobulator_blackberry.rim.net-728646234-@bxe006.bisx.prod.on.blackberry>
	<1182024904.12401.11.camel@blabla.mcs.anl.gov>
	<46746A7F.7040004@mcs.anl.gov>
	<Pine.LNX.4.64.0706162313300.15250@dildano.hawaga.org.uk>
	<4676FB2C.9010004@cs.uchicago.edu>
Message-ID: <1182241371.17515.2.camel@blabla.mcs.anl.gov>

On Mon, 2007-06-18 at 16:37 -0500, Ioan Raicu wrote:
> There is a diagram in the SC paper that has the flow of messages, and
> what each does; these messages also map to the WSDL of the Falkon
> service.  Mihael, Ben, or whoever else wants to start digging through
> the Falkon code, maybe a meeting might be good to go over the
> organization of the code, the message flow diagram, configuration
> options, etc...  I would prefer a meeting over drafting up more
> documents, I think it would be more time effective for me for now.

It might be the reverse for us. It's unlikely that faced with a lot of
new information, we will consistently retain all of it. Something of
reference might be handy.

> 
> Ioan
> 
> Ben Clifford wrote: 
> > The WSDL should describe part of the web services bit of the protocol. 
> > That might be a good place to start. The WSDL should already describe the 
> > messages that go over the wire in something vaguely readable to a human. 
> > Probably what would be needed would be the extra info to say which order 
> > messages are sent.
> > 
> > On Sat, 16 Jun 2007, Mike Wilde wrote:
> > 
> >   
> > > I think a nice clean message sequence chart describing Falkon's various
> > > activities would be very useful, as its the backbone of its logic.
> > > 
> > > I tried to create this for the SC paper by asking Ioan to describe the
> > > protocol to me, but I dod not succeed.
> > > 
> > > I think this would be a very useful description to maintain, as a UML sequence
> > > chart, and that Ioan this would be a very important part of your thesis or of
> > > future papers.
> > > 
> > > Its up to you and Ian to weigh whether this would be valuable to your
> > > research.  I think its invaluable for design, review and debugging.
> > > 
> > > - Mike
> > > 
> > > 
> > > 
> > > 
> > > Mihael Hategan wrote, On 6/16/2007 3:15 PM:
> > >     
> > > > On Sat, 2007-06-16 at 19:52 +0000, Ian Foster wrote:
> > > >       
> > > > > I wasn't suggesting (at least in the first instance) that Mihael take the
> > > > > prototype and turn it into production, but that Mihael and Ioan sit down
> > > > > together and do a code walkthrough. I think that this would likely
> > > > > identify bugs and opportunities for simplification.
> > > > >         
> > > > It's somewhat on the same in level of fun :). From the experience I've
> > > > accumulated so far, design is hard. Understanding prototype design is
> > > > probably even harder (not only do you need to understand the problem,
> > > > you also need to understand why many non-obvious things are done the way
> > > > they are done).
> > > > 
> > > > Mihael
> > > > 
> > > >       
> > > > > Ian
> > > > > 
> > > > > 
> > > > > 
> > > > > Sent via BlackBerry from T-Mobile
> > > > > 
> > > > > -----Original Message-----
> > > > > From: Ben Clifford <benc at hawaga.org.uk>
> > > > > 
> > > > > Date: Sat, 16 Jun 2007 19:36:17 To:Mihael Hategan <hategan at mcs.anl.gov>
> > > > > Cc:swift-devel at ci.uchicago.edu
> > > > > Subject: Re: [Swift-devel] CPU usage with provider-deef
> > > > > 
> > > > > 
> > > > > 
> > > > > On Sat, 16 Jun 2007, Mihael Hategan wrote:
> > > > > 
> > > > >         
> > > > > > On Sat, 2007-06-16 at 12:03 -0500, Mike Wilde wrote:
> > > > > >           
> > > > > > > This should be fun, and a nice break from the I2U2 work that you've
> > > > > > > been immersed in, Mihael.
> > > > > > >             
> > > > > > I have my reservations towards the amount of fun it involves.
> > > > > >           
> > > > > Right, taking prototypes and turning them into production isn't
> > > > > necessarily fun - in fact, a lot of the fun already happened with the
> > > > > making of the prototype and the rest is some what drugery. (to an extent
> > > > > that's the same situation i2u2 cosmic was/is in).
> > > > > 
> > > > >         
> > > > _______________________________________________
> > > > Swift-devel mailing list
> > > > Swift-devel at ci.uchicago.edu
> > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> > > > 
> > > > 
> > > >       
> > _______________________________________________
> > Swift-devel mailing list
> > Swift-devel at ci.uchicago.edu
> > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> > 
> >   
> 
> -- 
> ============================================
> Ioan Raicu
> Ph.D. Student
> ============================================
> Distributed Systems Laboratory
> Computer Science Department
> University of Chicago
> 1100 E. 58th Street, Ryerson Hall
> Chicago, IL 60637
> ============================================
> Email: iraicu at cs.uchicago.edu
> Web:   http://www.cs.uchicago.edu/~iraicu
>        http://dsl.cs.uchicago.edu/
> ============================================
> ============================================
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel


From hategan at mcs.anl.gov  Tue Jun 19 03:55:30 2007
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Tue, 19 Jun 2007 11:55:30 +0300
Subject: [Swift-devel] on the semantics of 'array closing'
In-Reply-To: <4677455D.1070909@mcs.anl.gov>
References: <Pine.LNX.4.64.0706151924490.10634@dildano.hawaga.org.uk>
	<4672F5E3.7060205@mcs.anl.gov>
	<Pine.LNX.4.64.0706152050270.10634@dildano.hawaga.org.uk>
	<4674096B.4020109@mcs.anl.gov>  <4677455D.1070909@mcs.anl.gov>
Message-ID: <1182243330.17515.14.camel@blabla.mcs.anl.gov>

On Mon, 2007-06-18 at 21:54 -0500, Ian Foster wrote:
> Interesting issue, as raised by Mike: it seems that we want to define
> a "map" function that can specify various "conditions" on the map,
> e.g., the length of time to wait for results, what to do if not all
> results are returned, etc. I wonder if others have done that before?

There are two types of errors and time-outs:
1. The ones that occur as part of the workflow:
 - Computations on some data does not return the expected files
 - The user wants to run multiple algorithms on some data and only
select the fastest one or the ones that run in a specific time.
 - etc.
2. The ones that occur because of exceptional conditions in the system:
 - A badly configured site fails to run things
 - A job sits for a long time in a queue

I think the first class can be handled in the language, but the second
should not.

Occam seems to support such things, by composing various keywords (e.g.
PAR ... FOR and then timeouts or error handling). I personally favor
composition of smaller dedicated functions/keywords to big, do-it-all
functions.

Mihael

> 
> Ian Foster wrote: 
> > I like the notion of having a "map" function. If that could entirely
> > replace the current element assignments, that would be a wonderful
> > simplification, it seems to me.
> > 
> > Ian.
> > 
> > Ben Clifford wrote: 
> > > There's a different approach, which is to asay that 'a' is a variable and 
> > > can be assigned to once. Thus assignemnt syntax like a[0]=something 
> > > becomes illegal and we need more functional language constructs. So 
> > > instead of writing:
> > > 
> > > for e,i in input_array {
> > >   output_array[i] = p(e);
> > > }
> > > 
> > > we would write:
> > > 
> > > output_array = foreach i in input_array {
> > >   return p(i);
> > > }
> > > 
> > > (its a haskell map in different syntax!)
> > > 
> > > That means that, at the language level, output_array is now properly 
> > > single assignment.
> > > 
> > > 
> > > On Fri, 15 Jun 2007, Ian Foster wrote:
> > > 
> > >   
> > > > Hi,
> > > > 
> > > > For:
> > > > 
> > > >  a[0] = p()
> > > >  a[1] = q()
> > > >  b = s(a)
> > > > 
> > > > I think there are two distinct issues.
> > > > 
> > > > a) Determining the size of the array. This could presumably be done by
> > > > declaring it, e.g.:
> > > > 
> > > >  a[2] or some similar notion
> > > >  a[0] = p()
> > > >  a[1] = q()
> > > >  b = s(a)
> > > > 
> > > > or by some "closing" concept.
> > > > 
> > > > b) Whether or not each element of an array is a separate single-assignment
> > > > variable. If they are, then the code above should work just fine. If they are
> > > > not, then we have a couple of behaviors we could define. One would be that
> > > > b=s(a) blocks until all elements in "a" are defined. The other is that we have
> > > > a way of "closing" (once again). In that case, we have to define what happens
> > > > if b=s(a) accesses an element that is not defined.
> > > > 
> > > > Ian.
> > > > 
> > > > Ben Clifford wrote:
> > > >     
> > > > > There is a problem that has been called the 'array closing problem'.
> > > > > 
> > > > > It manifests itself in the tutorial in that certain bits of code that
> > > > > intuitively can either in a procedure or in the top level can, in practice,
> > > > > only go in to a procedure.
> > > > > 
> > > > > In that context, I tried to think about better ways to explain/document the
> > > > > behaviour than "mumble mumble move that code into a procedure".
> > > > > 
> > > > > In Swift we claim to have 'single assignment variables'.
> > > > > 
> > > > > >From single assignment variables we get our grid job ordering:
> > > > > 
> > > > >   a = p()
> > > > >   b = s(a)
> > > > > 
> > > > > causes first grid job p to run, and when that has completed, then grid job s
> > > > > will run.
> > > > > 
> > > > > This is the same as if we had written:
> > > > > 
> > > > >   b = s(a)
> > > > >   a = p()
> > > > > 
> > > > > The ordering comes from the use of a as an 'output' for p and an 'input' for
> > > > > s, not from source text ordering.
> > > > > 
> > > > > In that model, its meaningless to assign two different things ta a, like
> > > > > this:
> > > > > 
> > > > >   a = p()
> > > > >   b = s(a)
> > > > >   a = t()
> > > > > 
> > > > > 
> > > > > Note that I've omitted the data types from the above. This works in the
> > > > > implementation for simple types such as a datafile marker type.
> > > > > 
> > > > > What is important is that each variable is either unassigned or has its
> > > > > single value - whenever we refer to that variable, we can either use the
> > > > > value it has, or defer evaluation of that expression until the variable has
> > > > > its value.
> > > > > 
> > > > > Now consider arrays. In the present syntax, arrays can be passed as single
> > > > > (complex) values to/from procedures, like before:
> > > > > 
> > > > >   a = p()
> > > > >   b = s(a)
> > > > > 
> > > > > Here a and b are array types.
> > > > > 
> > > > > That's fine. a is assigned to by the first statement, and b is assigned to
> > > > > by the second statement.
> > > > > 
> > > > > But we also support a different assignment syntax for arrays, that looks
> > > > > like this:
> > > > > 
> > > > >   a[0] = p()
> > > > >   a[1] = q()
> > > > >   b = s(a)
> > > > > 
> > > > > This fails at the moment (specifically, I think the execution engine will
> > > > > hang).
> > > > > 
> > > > > Why? Because the is no one point at which we assign a value to 'a' - the
> > > > > assignment is split over multiple statements, which can be in various places
> > > > > (and inside loops etc).
> > > > > 
> > > > > There is nothing in the implementation that detects that a has been assigned
> > > > > its value.
> > > > > 
> > > > > So there is this notion in the karajan intermediate code of 'closing an
> > > > > array'.  This is an assertion made in the object code that all assignments
> > > > > to pieces of an array have been made - that, in affect, the array has its
> > > > > value.
> > > > > 
> > > > > The suggested hack/workaround for this is to move the array element
> > > > > assignments into a procedure:
> > > > > 
> > > > >  (file f[]) z() {
> > > > >    f[0] = p();
> > > > >    f[1] - q();
> > > > >  }
> > > > > 
> > > > >  a = z()
> > > > >  b = s(a)
> > > > > 
> > > > > This works. (which is sort-of a violation of referential transparency)
> > > > > 
> > > > > It works because Swift implicitly marks arrays returned from compound
> > > > > procedures as closed (which may or may not be correct).
> > > > > 
> > > > > So in most variable scopes, arrays behave like single-assignment variables,
> > > > > but each array can have one specific scope in which members can be assigned
> > > > > to. In that scope, the array cannot be treated as a whole variable.
> > > > > 
> > > > > In the z() example above, that special scope is the body of z(). In the
> > > > > previous example, that scope is the global scope, and the program is invalid
> > > > > by the rule above that the array cannot be referred to as a whole in the
> > > > > same place that its members are individually assigned to.
> > > > > 
> > > > > That's my explanation of what's going on now. I think it matches reality. I
> > > > > don't like that this is reality, but it is what we have.
> > > > > 
> > > > > Comments appreciated.
> > > > > 
> > > > >   
> > > > >       
> > > 
> > >   
> > 
> > -- 
> > 
> >    Ian Foster, Director, Computation Institute
> > Argonne National Laboratory & University of Chicago
> > Argonne: MCS/221, 9700 S. Cass Ave, Argonne, IL 60439
> > Chicago: Rm 405, 5640 S. Ellis Ave, Chicago, IL 60637
> > Tel: +1 630 252 4619.  Web: www.ci.uchicago.edu.
> >       Globus Alliance: www.globus.org.
> >   
> > 
> > ____________________________________________________________________
> > 
> > _______________________________________________
> > Swift-devel mailing list
> > Swift-devel at ci.uchicago.edu
> > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> >   
> 
> -- 
> 
>    Ian Foster, Director, Computation Institute
> Argonne National Laboratory & University of Chicago
> Argonne: MCS/221, 9700 S. Cass Ave, Argonne, IL 60439
> Chicago: Rm 405, 5640 S. Ellis Ave, Chicago, IL 60637
> Tel: +1 630 252 4619.  Web: www.ci.uchicago.edu.
>       Globus Alliance: www.globus.org.
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel


From wilde at mcs.anl.gov  Tue Jun 19 07:10:13 2007
From: wilde at mcs.anl.gov (Mike Wilde)
Date: Tue, 19 Jun 2007 07:10:13 -0500
Subject: [Swift-devel] CPU usage with provider-deef
In-Reply-To: <1182241371.17515.2.camel@blabla.mcs.anl.gov>
References: <Pine.LNX.4.64.0706151634060.10634@dildano.hawaga.org.uk>	<1181984306.10455.3.camel@blabla.mcs.anl.gov>	<Pine.LNX.4.64.0706161258280.10634@dildano.hawaga.org.uk>	<4673F0E9.3060505@cs.uchicago.edu>	<Pine.LNX.4.64.0706161418470.10634@dildano.hawaga.org.uk>	<4673F548.5070608@cs.uchicago.edu>	<Pine.LNX.4.64.0706161438550.10634@dildano.hawaga.org.uk>	<4673FB19.6070305@cs.uchicago.edu>	<1182006166.11495.1.camel@blabla.mcs.anl.gov><467408D5.2020709@mcs.anl.gov>	<467417DE.50408@mcs.anl.gov><1182018587.12013.14.camel@blabla.mcs.anl.gov>	<Pine.LNX.4.64.0706161932560.15250@dildano.hawaga.org.uk>	<537273469-1182023623-cardhu_decombobulator_blackberry.rim.net-728646234-@bxe006.bisx.prod.on.blackberry>	<1182024904.12401.11.camel@blabla.mcs.anl.gov>	<46746A7F.7040004@mcs.anl.gov>	<Pine.LNX.4.64.0706162313300.15250@dildano.hawaga.org.uk>	<4676FB2C.9010004@cs.uchicago.edu>
	<1182241371.17515.2.camel@blabla.mcs.anl.gov>
Message-ID: <4677C7A5.6040004@mcs.anl.gov>

Ioan, I feel that documenting the message flow is key to 
understanding the system, and that such documentation will be 
indispensable for you (and the Swift team) to make progress with Falkon.

If its the case, as you stated to me, that you dont have the time to 
fully support Falkon for end-user use, which I accept, then the only 
way to enable the Swift team to provide this support is for us to 
understand the tool.

The three things you propose to communicate to us in a face-to-face 
meeting (organization of the code, the message flow diagram, 
configuration options) are exactly the things that need to be 
documented.

But you and Ian together need to decide and propose how you want to 
see Falkon used and how you intend to make it supportable for that 
purpose.

For Swift's goals, we *know* we need Falkon's capabilities to 
succeed, and the question is how much and how long can we use 
Falkon, and when would we need to start improving or rewriting it.

- Mike


Mihael Hategan wrote, On 6/19/2007 3:22 AM:
> On Mon, 2007-06-18 at 16:37 -0500, Ioan Raicu wrote:
>> There is a diagram in the SC paper that has the flow of messages, and
>> what each does; these messages also map to the WSDL of the Falkon
>> service.  Mihael, Ben, or whoever else wants to start digging through
>> the Falkon code, maybe a meeting might be good to go over the
>> organization of the code, the message flow diagram, configuration
>> options, etc...  I would prefer a meeting over drafting up more
>> documents, I think it would be more time effective for me for now.
> 
> It might be the reverse for us. It's unlikely that faced with a lot of
> new information, we will consistently retain all of it. Something of
> reference might be handy.
> 
>> Ioan
>>
>> Ben Clifford wrote: 
>>> The WSDL should describe part of the web services bit of the protocol. 
>>> That might be a good place to start. The WSDL should already describe the 
>>> messages that go over the wire in something vaguely readable to a human. 
>>> Probably what would be needed would be the extra info to say which order 
>>> messages are sent.
>>>
>>> On Sat, 16 Jun 2007, Mike Wilde wrote:
>>>
>>>   
>>>> I think a nice clean message sequence chart describing Falkon's various
>>>> activities would be very useful, as its the backbone of its logic.
>>>>
>>>> I tried to create this for the SC paper by asking Ioan to describe the
>>>> protocol to me, but I dod not succeed.
>>>>
>>>> I think this would be a very useful description to maintain, as a UML sequence
>>>> chart, and that Ioan this would be a very important part of your thesis or of
>>>> future papers.
>>>>
>>>> Its up to you and Ian to weigh whether this would be valuable to your
>>>> research.  I think its invaluable for design, review and debugging.
>>>>
>>>> - Mike
>>>>
>>>>
>>>>
>>>>
>>>> Mihael Hategan wrote, On 6/16/2007 3:15 PM:
>>>>     
>>>>> On Sat, 2007-06-16 at 19:52 +0000, Ian Foster wrote:
>>>>>       
>>>>>> I wasn't suggesting (at least in the first instance) that Mihael take the
>>>>>> prototype and turn it into production, but that Mihael and Ioan sit down
>>>>>> together and do a code walkthrough. I think that this would likely
>>>>>> identify bugs and opportunities for simplification.
>>>>>>         
>>>>> It's somewhat on the same in level of fun :). From the experience I've
>>>>> accumulated so far, design is hard. Understanding prototype design is
>>>>> probably even harder (not only do you need to understand the problem,
>>>>> you also need to understand why many non-obvious things are done the way
>>>>> they are done).
>>>>>
>>>>> Mihael
>>>>>
>>>>>       
>>>>>> Ian
>>>>>>
>>>>>>
>>>>>>
>>>>>> Sent via BlackBerry from T-Mobile
>>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Ben Clifford <benc at hawaga.org.uk>
>>>>>>
>>>>>> Date: Sat, 16 Jun 2007 19:36:17 To:Mihael Hategan <hategan at mcs.anl.gov>
>>>>>> Cc:swift-devel at ci.uchicago.edu
>>>>>> Subject: Re: [Swift-devel] CPU usage with provider-deef
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Sat, 16 Jun 2007, Mihael Hategan wrote:
>>>>>>
>>>>>>         
>>>>>>> On Sat, 2007-06-16 at 12:03 -0500, Mike Wilde wrote:
>>>>>>>           
>>>>>>>> This should be fun, and a nice break from the I2U2 work that you've
>>>>>>>> been immersed in, Mihael.
>>>>>>>>             
>>>>>>> I have my reservations towards the amount of fun it involves.
>>>>>>>           
>>>>>> Right, taking prototypes and turning them into production isn't
>>>>>> necessarily fun - in fact, a lot of the fun already happened with the
>>>>>> making of the prototype and the rest is some what drugery. (to an extent
>>>>>> that's the same situation i2u2 cosmic was/is in).
>>>>>>
>>>>>>         
>>>>> _______________________________________________
>>>>> Swift-devel mailing list
>>>>> Swift-devel at ci.uchicago.edu
>>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>>>>
>>>>>
>>>>>       
>>> _______________________________________________
>>> Swift-devel mailing list
>>> Swift-devel at ci.uchicago.edu
>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>>
>>>   
>> -- 
>> ============================================
>> Ioan Raicu
>> Ph.D. Student
>> ============================================
>> Distributed Systems Laboratory
>> Computer Science Department
>> University of Chicago
>> 1100 E. 58th Street, Ryerson Hall
>> Chicago, IL 60637
>> ============================================
>> Email: iraicu at cs.uchicago.edu
>> Web:   http://www.cs.uchicago.edu/~iraicu
>>        http://dsl.cs.uchicago.edu/
>> ============================================
>> ============================================
>> _______________________________________________
>> Swift-devel mailing list
>> Swift-devel at ci.uchicago.edu
>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> 
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> 
> 

-- 
Mike Wilde
Computation Institute, University of Chicago
Math & Computer Science Division
Argonne National Laboratory
Argonne, IL   60439    USA
tel 630-252-7497 fax 630-252-1997


From hategan at mcs.anl.gov  Tue Jun 19 07:55:10 2007
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Tue, 19 Jun 2007 15:55:10 +0300
Subject: [Swift-devel] on the semantics of 'array closing'
In-Reply-To: <4677CEA1.8050806@mcs.anl.gov>
References: <Pine.LNX.4.64.0706151924490.10634@dildano.hawaga.org.uk>
	<4672F5E3.7060205@mcs.anl.gov>
	<Pine.LNX.4.64.0706152050270.10634@dildano.hawaga.org.uk>
	<4674096B.4020109@mcs.anl.gov>  <4677455D.1070909@mcs.anl.gov>
	<1182243330.17515.14.camel@blabla.mcs.anl.gov>
	<4677CEA1.8050806@mcs.anl.gov>
Message-ID: <1182257710.18810.15.camel@blabla.mcs.anl.gov>

On Tue, 2007-06-19 at 07:40 -0500, Mike Wilde wrote:
> This is a good breakdown, but I dont yet see how to distinguish 
> between the two situations you lay out here, Mihael.

The distinction is made by thinking about local vs. grid execution. If
something cannot occur during local execution but can occur during grid
execution, we probably want to hide that from the user as much as
possible. Programming against randomly unreliable systems is hard. We,
as long time Grid users and developers, have the unique position of
identifying such problems and dealing with them as we can.

> 
> I think we need to turn these cases into slightly more detailed 
> examples and then consider how we want the system to respond and 
> what the user would need to do in each case to recover/continue.

I thought we've done this before, to a sizable extent, both on the
mailing lists, and face-to-face.

> 
> Your cases separate out logical errors (1) from physical ones (2) 
> which is I agree a useful distinction.
> 
> But I wonder in case 2, when you have a program thats been 
> productively proceeding, and then encounters a physical error, in 
> some cases the program will have processed a give function "long 
> enough" to end that function call and proceed ("correctly") with the 
> data it has, while in other cases, the program can not proceed until 
> it has generated every single member that a foreach or map function 
> was to iterate over.

Right. The former is the m out of n pattern with timeouts. While very
rarely occurring in nature, we may eventually want to support such a
thing. Also, given that reliable performance measurements are hard to
get in a massively multi-user heterogeneous environment, we would need
to find specific ways to express time.

Mihael

> 
> In the latter branch of this case, we'd want to be able to restart 
> the program, while in the former branch, we'd want to be able to 
> ignore certain errors and continue.
> 
> - Mike
> 
> 
> Mihael Hategan wrote, On 6/19/2007 3:55 AM:
> > On Mon, 2007-06-18 at 21:54 -0500, Ian Foster wrote:
> >> Interesting issue, as raised by Mike: it seems that we want to define
> >> a "map" function that can specify various "conditions" on the map,
> >> e.g., the length of time to wait for results, what to do if not all
> >> results are returned, etc. I wonder if others have done that before?
> > 
> > There are two types of errors and time-outs:
> > 1. The ones that occur as part of the workflow:
> >  - Computations on some data does not return the expected files
> >  - The user wants to run multiple algorithms on some data and only
> > select the fastest one or the ones that run in a specific time.
> >  - etc.
> > 2. The ones that occur because of exceptional conditions in the system:
> >  - A badly configured site fails to run things
> >  - A job sits for a long time in a queue
> > 
> > I think the first class can be handled in the language, but the second
> > should not.
> > 
> > Occam seems to support such things, by composing various keywords (e.g.
> > PAR ... FOR and then timeouts or error handling). I personally favor
> > composition of smaller dedicated functions/keywords to big, do-it-all
> > functions.
> > 
> > Mihael
> > 
> >> Ian Foster wrote: 
> >>> I like the notion of having a "map" function. If that could entirely
> >>> replace the current element assignments, that would be a wonderful
> >>> simplification, it seems to me.
> >>>
> >>> Ian.
> >>>
> >>> Ben Clifford wrote: 
> >>>> There's a different approach, which is to asay that 'a' is a variable and 
> >>>> can be assigned to once. Thus assignemnt syntax like a[0]=something 
> >>>> becomes illegal and we need more functional language constructs. So 
> >>>> instead of writing:
> >>>>
> >>>> for e,i in input_array {
> >>>>   output_array[i] = p(e);
> >>>> }
> >>>>
> >>>> we would write:
> >>>>
> >>>> output_array = foreach i in input_array {
> >>>>   return p(i);
> >>>> }
> >>>>
> >>>> (its a haskell map in different syntax!)
> >>>>
> >>>> That means that, at the language level, output_array is now properly 
> >>>> single assignment.
> >>>>
> >>>>
> >>>> On Fri, 15 Jun 2007, Ian Foster wrote:
> >>>>
> >>>>   
> >>>>> Hi,
> >>>>>
> >>>>> For:
> >>>>>
> >>>>>  a[0] = p()
> >>>>>  a[1] = q()
> >>>>>  b = s(a)
> >>>>>
> >>>>> I think there are two distinct issues.
> >>>>>
> >>>>> a) Determining the size of the array. This could presumably be done by
> >>>>> declaring it, e.g.:
> >>>>>
> >>>>>  a[2] or some similar notion
> >>>>>  a[0] = p()
> >>>>>  a[1] = q()
> >>>>>  b = s(a)
> >>>>>
> >>>>> or by some "closing" concept.
> >>>>>
> >>>>> b) Whether or not each element of an array is a separate single-assignment
> >>>>> variable. If they are, then the code above should work just fine. If they are
> >>>>> not, then we have a couple of behaviors we could define. One would be that
> >>>>> b=s(a) blocks until all elements in "a" are defined. The other is that we have
> >>>>> a way of "closing" (once again). In that case, we have to define what happens
> >>>>> if b=s(a) accesses an element that is not defined.
> >>>>>
> >>>>> Ian.
> >>>>>
> >>>>> Ben Clifford wrote:
> >>>>>     
> >>>>>> There is a problem that has been called the 'array closing problem'.
> >>>>>>
> >>>>>> It manifests itself in the tutorial in that certain bits of code that
> >>>>>> intuitively can either in a procedure or in the top level can, in practice,
> >>>>>> only go in to a procedure.
> >>>>>>
> >>>>>> In that context, I tried to think about better ways to explain/document the
> >>>>>> behaviour than "mumble mumble move that code into a procedure".
> >>>>>>
> >>>>>> In Swift we claim to have 'single assignment variables'.
> >>>>>>
> >>>>>> >From single assignment variables we get our grid job ordering:
> >>>>>>
> >>>>>>   a = p()
> >>>>>>   b = s(a)
> >>>>>>
> >>>>>> causes first grid job p to run, and when that has completed, then grid job s
> >>>>>> will run.
> >>>>>>
> >>>>>> This is the same as if we had written:
> >>>>>>
> >>>>>>   b = s(a)
> >>>>>>   a = p()
> >>>>>>
> >>>>>> The ordering comes from the use of a as an 'output' for p and an 'input' for
> >>>>>> s, not from source text ordering.
> >>>>>>
> >>>>>> In that model, its meaningless to assign two different things ta a, like
> >>>>>> this:
> >>>>>>
> >>>>>>   a = p()
> >>>>>>   b = s(a)
> >>>>>>   a = t()
> >>>>>>
> >>>>>>
> >>>>>> Note that I've omitted the data types from the above. This works in the
> >>>>>> implementation for simple types such as a datafile marker type.
> >>>>>>
> >>>>>> What is important is that each variable is either unassigned or has its
> >>>>>> single value - whenever we refer to that variable, we can either use the
> >>>>>> value it has, or defer evaluation of that expression until the variable has
> >>>>>> its value.
> >>>>>>
> >>>>>> Now consider arrays. In the present syntax, arrays can be passed as single
> >>>>>> (complex) values to/from procedures, like before:
> >>>>>>
> >>>>>>   a = p()
> >>>>>>   b = s(a)
> >>>>>>
> >>>>>> Here a and b are array types.
> >>>>>>
> >>>>>> That's fine. a is assigned to by the first statement, and b is assigned to
> >>>>>> by the second statement.
> >>>>>>
> >>>>>> But we also support a different assignment syntax for arrays, that looks
> >>>>>> like this:
> >>>>>>
> >>>>>>   a[0] = p()
> >>>>>>   a[1] = q()
> >>>>>>   b = s(a)
> >>>>>>
> >>>>>> This fails at the moment (specifically, I think the execution engine will
> >>>>>> hang).
> >>>>>>
> >>>>>> Why? Because the is no one point at which we assign a value to 'a' - the
> >>>>>> assignment is split over multiple statements, which can be in various places
> >>>>>> (and inside loops etc).
> >>>>>>
> >>>>>> There is nothing in the implementation that detects that a has been assigned
> >>>>>> its value.
> >>>>>>
> >>>>>> So there is this notion in the karajan intermediate code of 'closing an
> >>>>>> array'.  This is an assertion made in the object code that all assignments
> >>>>>> to pieces of an array have been made - that, in affect, the array has its
> >>>>>> value.
> >>>>>>
> >>>>>> The suggested hack/workaround for this is to move the array element
> >>>>>> assignments into a procedure:
> >>>>>>
> >>>>>>  (file f[]) z() {
> >>>>>>    f[0] = p();
> >>>>>>    f[1] - q();
> >>>>>>  }
> >>>>>>
> >>>>>>  a = z()
> >>>>>>  b = s(a)
> >>>>>>
> >>>>>> This works. (which is sort-of a violation of referential transparency)
> >>>>>>
> >>>>>> It works because Swift implicitly marks arrays returned from compound
> >>>>>> procedures as closed (which may or may not be correct).
> >>>>>>
> >>>>>> So in most variable scopes, arrays behave like single-assignment variables,
> >>>>>> but each array can have one specific scope in which members can be assigned
> >>>>>> to. In that scope, the array cannot be treated as a whole variable.
> >>>>>>
> >>>>>> In the z() example above, that special scope is the body of z(). In the
> >>>>>> previous example, that scope is the global scope, and the program is invalid
> >>>>>> by the rule above that the array cannot be referred to as a whole in the
> >>>>>> same place that its members are individually assigned to.
> >>>>>>
> >>>>>> That's my explanation of what's going on now. I think it matches reality. I
> >>>>>> don't like that this is reality, but it is what we have.
> >>>>>>
> >>>>>> Comments appreciated.
> >>>>>>
> >>>>>>   
> >>>>>>       
> >>>>   
> >>> -- 
> >>>
> >>>    Ian Foster, Director, Computation Institute
> >>> Argonne National Laboratory & University of Chicago
> >>> Argonne: MCS/221, 9700 S. Cass Ave, Argonne, IL 60439
> >>> Chicago: Rm 405, 5640 S. Ellis Ave, Chicago, IL 60637
> >>> Tel: +1 630 252 4619.  Web: www.ci.uchicago.edu.
> >>>       Globus Alliance: www.globus.org.
> >>>   
> >>>
> >>> ____________________________________________________________________
> >>>
> >>> _______________________________________________
> >>> Swift-devel mailing list
> >>> Swift-devel at ci.uchicago.edu
> >>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> >>>   
> >> -- 
> >>
> >>    Ian Foster, Director, Computation Institute
> >> Argonne National Laboratory & University of Chicago
> >> Argonne: MCS/221, 9700 S. Cass Ave, Argonne, IL 60439
> >> Chicago: Rm 405, 5640 S. Ellis Ave, Chicago, IL 60637
> >> Tel: +1 630 252 4619.  Web: www.ci.uchicago.edu.
> >>       Globus Alliance: www.globus.org.
> >> _______________________________________________
> >> Swift-devel mailing list
> >> Swift-devel at ci.uchicago.edu
> >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> > 
> > _______________________________________________
> > Swift-devel mailing list
> > Swift-devel at ci.uchicago.edu
> > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> > 
> > 
> 


From hategan at mcs.anl.gov  Tue Jun 19 09:48:31 2007
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Tue, 19 Jun 2007 17:48:31 +0300
Subject: [Swift-devel] on the semantics of 'array closing'
In-Reply-To: <4677DD03.6010900@mcs.anl.gov>
References: <Pine.LNX.4.64.0706151924490.10634@dildano.hawaga.org.uk>
	<4672F5E3.7060205@mcs.anl.gov>
	<Pine.LNX.4.64.0706152050270.10634@dildano.hawaga.org.uk>
	<4674096B.4020109@mcs.anl.gov>  <4677455D.1070909@mcs.anl.gov>
	<1182243330.17515.14.camel@blabla.mcs.anl.gov>
	<4677CEA1.8050806@mcs.anl.gov>
	<1182257710.18810.15.camel@blabla.mcs.anl.gov>
	<4677DD03.6010900@mcs.anl.gov>
Message-ID: <1182264511.19005.6.camel@blabla.mcs.anl.gov>

On Tue, 2007-06-19 at 08:41 -0500, Mike Wilde wrote:
> Mihael Hategan wrote, On 6/19/2007 7:55 AM:
> > On Tue, 2007-06-19 at 07:40 -0500, Mike Wilde wrote:
> >> This is a good breakdown, but I dont yet see how to distinguish 
> >> between the two situations you lay out here, Mihael.
> > 
> > The distinction is made by thinking about local vs. grid execution. If
> > something cannot occur during local execution but can occur during grid
> > execution, we probably want to hide that from the user as much as
> > possible. Programming against randomly unreliable systems is hard. We,
> > as long time Grid users and developers, have the unique position of
> > identifying such problems and dealing with them as we can.
> > 
> >> I think we need to turn these cases into slightly more detailed 
> >> examples and then consider how we want the system to respond and 
> >> what the user would need to do in each case to recover/continue.
> > 
> > I thought we've done this before, to a sizable extent, both on the
> > mailing lists, and face-to-face.
> 
> You are probably right, with respect to the exeception handling 
> disussion. We need to move such discussions into proposed language 
> spec doc changes so that we can turn them into decisions and 
> campaigns to implement them.
> 
> The mailing list is the right place to discuss, but then we need to 
> summarize that discussion into a consensus.
> 
> Also, I think there's two new aspects proposed here - of a foreach() 
> or map() reaching a threshold of "enough results" that it can be 
> called done,

With sparse arrays, it would be sufficient to not have the results from
the iterations that fail. If we had try/catch constructs, that would
simply translate into:
foreach k,v in V {
  try {
    results[k] = process(V[k]);
  }
  catch (*) {}
}
thus alleviating the need for a complex foreach construct.

>  and of a streaming model of computation where a DAG 
> acts like a pipeline even though it was specified as a set of 
> function applications.

I'm not really sure what that refers to.

> 
> Are you and Ben at a point where you could gather these issues 
> (map(), error handling, thresholds and streaming) and turn them into 
> proposed language improvements?

We would need to chat some, and, of course, have the necessary time.

Mihael

> 
> If so, you should, and if not, we should decide if we need more 
> discussion on the list.
> 
> Its likely that an attempt to do such a language update right now 
> would lead directly to more discussion, as writing language spec 
> tends to expose more unresolved issues.
> 
> - Mike
> 
> > 
> >> Your cases separate out logical errors (1) from physical ones (2) 
> >> which is I agree a useful distinction.
> >>
> >> But I wonder in case 2, when you have a program thats been 
> >> productively proceeding, and then encounters a physical error, in 
> >> some cases the program will have processed a give function "long 
> >> enough" to end that function call and proceed ("correctly") with the 
> >> data it has, while in other cases, the program can not proceed until 
> >> it has generated every single member that a foreach or map function 
> >> was to iterate over.
> > 
> > Right. The former is the m out of n pattern with timeouts. While very
> > rarely occurring in nature, we may eventually want to support such a
> > thing. Also, given that reliable performance measurements are hard to
> > get in a massively multi-user heterogeneous environment, we would need
> > to find specific ways to express time.
> > 
> > Mihael
> > 
> >> In the latter branch of this case, we'd want to be able to restart 
> >> the program, while in the former branch, we'd want to be able to 
> >> ignore certain errors and continue.
> >>
> >> - Mike
> >>
> >>
> >> Mihael Hategan wrote, On 6/19/2007 3:55 AM:
> >>> On Mon, 2007-06-18 at 21:54 -0500, Ian Foster wrote:
> >>>> Interesting issue, as raised by Mike: it seems that we want to define
> >>>> a "map" function that can specify various "conditions" on the map,
> >>>> e.g., the length of time to wait for results, what to do if not all
> >>>> results are returned, etc. I wonder if others have done that before?
> >>> There are two types of errors and time-outs:
> >>> 1. The ones that occur as part of the workflow:
> >>>  - Computations on some data does not return the expected files
> >>>  - The user wants to run multiple algorithms on some data and only
> >>> select the fastest one or the ones that run in a specific time.
> >>>  - etc.
> >>> 2. The ones that occur because of exceptional conditions in the system:
> >>>  - A badly configured site fails to run things
> >>>  - A job sits for a long time in a queue
> >>>
> >>> I think the first class can be handled in the language, but the second
> >>> should not.
> >>>
> >>> Occam seems to support such things, by composing various keywords (e.g.
> >>> PAR ... FOR and then timeouts or error handling). I personally favor
> >>> composition of smaller dedicated functions/keywords to big, do-it-all
> >>> functions.
> >>>
> >>> Mihael
> >>>
> >>>> Ian Foster wrote: 
> >>>>> I like the notion of having a "map" function. If that could entirely
> >>>>> replace the current element assignments, that would be a wonderful
> >>>>> simplification, it seems to me.
> >>>>>
> >>>>> Ian.
> >>>>>
> >>>>> Ben Clifford wrote: 
> >>>>>> There's a different approach, which is to asay that 'a' is a variable and 
> >>>>>> can be assigned to once. Thus assignemnt syntax like a[0]=something 
> >>>>>> becomes illegal and we need more functional language constructs. So 
> >>>>>> instead of writing:
> >>>>>>
> >>>>>> for e,i in input_array {
> >>>>>>   output_array[i] = p(e);
> >>>>>> }
> >>>>>>
> >>>>>> we would write:
> >>>>>>
> >>>>>> output_array = foreach i in input_array {
> >>>>>>   return p(i);
> >>>>>> }
> >>>>>>
> >>>>>> (its a haskell map in different syntax!)
> >>>>>>
> >>>>>> That means that, at the language level, output_array is now properly 
> >>>>>> single assignment.
> >>>>>>
> >>>>>>
> >>>>>> On Fri, 15 Jun 2007, Ian Foster wrote:
> >>>>>>
> >>>>>>   
> >>>>>>> Hi,
> >>>>>>>
> >>>>>>> For:
> >>>>>>>
> >>>>>>>  a[0] = p()
> >>>>>>>  a[1] = q()
> >>>>>>>  b = s(a)
> >>>>>>>
> >>>>>>> I think there are two distinct issues.
> >>>>>>>
> >>>>>>> a) Determining the size of the array. This could presumably be done by
> >>>>>>> declaring it, e.g.:
> >>>>>>>
> >>>>>>>  a[2] or some similar notion
> >>>>>>>  a[0] = p()
> >>>>>>>  a[1] = q()
> >>>>>>>  b = s(a)
> >>>>>>>
> >>>>>>> or by some "closing" concept.
> >>>>>>>
> >>>>>>> b) Whether or not each element of an array is a separate single-assignment
> >>>>>>> variable. If they are, then the code above should work just fine. If they are
> >>>>>>> not, then we have a couple of behaviors we could define. One would be that
> >>>>>>> b=s(a) blocks until all elements in "a" are defined. The other is that we have
> >>>>>>> a way of "closing" (once again). In that case, we have to define what happens
> >>>>>>> if b=s(a) accesses an element that is not defined.
> >>>>>>>
> >>>>>>> Ian.
> >>>>>>>
> >>>>>>> Ben Clifford wrote:
> >>>>>>>     
> >>>>>>>> There is a problem that has been called the 'array closing problem'.
> >>>>>>>>
> >>>>>>>> It manifests itself in the tutorial in that certain bits of code that
> >>>>>>>> intuitively can either in a procedure or in the top level can, in practice,
> >>>>>>>> only go in to a procedure.
> >>>>>>>>
> >>>>>>>> In that context, I tried to think about better ways to explain/document the
> >>>>>>>> behaviour than "mumble mumble move that code into a procedure".
> >>>>>>>>
> >>>>>>>> In Swift we claim to have 'single assignment variables'.
> >>>>>>>>
> >>>>>>>> >From single assignment variables we get our grid job ordering:
> >>>>>>>>
> >>>>>>>>   a = p()
> >>>>>>>>   b = s(a)
> >>>>>>>>
> >>>>>>>> causes first grid job p to run, and when that has completed, then grid job s
> >>>>>>>> will run.
> >>>>>>>>
> >>>>>>>> This is the same as if we had written:
> >>>>>>>>
> >>>>>>>>   b = s(a)
> >>>>>>>>   a = p()
> >>>>>>>>
> >>>>>>>> The ordering comes from the use of a as an 'output' for p and an 'input' for
> >>>>>>>> s, not from source text ordering.
> >>>>>>>>
> >>>>>>>> In that model, its meaningless to assign two different things ta a, like
> >>>>>>>> this:
> >>>>>>>>
> >>>>>>>>   a = p()
> >>>>>>>>   b = s(a)
> >>>>>>>>   a = t()
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Note that I've omitted the data types from the above. This works in the
> >>>>>>>> implementation for simple types such as a datafile marker type.
> >>>>>>>>
> >>>>>>>> What is important is that each variable is either unassigned or has its
> >>>>>>>> single value - whenever we refer to that variable, we can either use the
> >>>>>>>> value it has, or defer evaluation of that expression until the variable has
> >>>>>>>> its value.
> >>>>>>>>
> >>>>>>>> Now consider arrays. In the present syntax, arrays can be passed as single
> >>>>>>>> (complex) values to/from procedures, like before:
> >>>>>>>>
> >>>>>>>>   a = p()
> >>>>>>>>   b = s(a)
> >>>>>>>>
> >>>>>>>> Here a and b are array types.
> >>>>>>>>
> >>>>>>>> That's fine. a is assigned to by the first statement, and b is assigned to
> >>>>>>>> by the second statement.
> >>>>>>>>
> >>>>>>>> But we also support a different assignment syntax for arrays, that looks
> >>>>>>>> like this:
> >>>>>>>>
> >>>>>>>>   a[0] = p()
> >>>>>>>>   a[1] = q()
> >>>>>>>>   b = s(a)
> >>>>>>>>
> >>>>>>>> This fails at the moment (specifically, I think the execution engine will
> >>>>>>>> hang).
> >>>>>>>>
> >>>>>>>> Why? Because the is no one point at which we assign a value to 'a' - the
> >>>>>>>> assignment is split over multiple statements, which can be in various places
> >>>>>>>> (and inside loops etc).
> >>>>>>>>
> >>>>>>>> There is nothing in the implementation that detects that a has been assigned
> >>>>>>>> its value.
> >>>>>>>>
> >>>>>>>> So there is this notion in the karajan intermediate code of 'closing an
> >>>>>>>> array'.  This is an assertion made in the object code that all assignments
> >>>>>>>> to pieces of an array have been made - that, in affect, the array has its
> >>>>>>>> value.
> >>>>>>>>
> >>>>>>>> The suggested hack/workaround for this is to move the array element
> >>>>>>>> assignments into a procedure:
> >>>>>>>>
> >>>>>>>>  (file f[]) z() {
> >>>>>>>>    f[0] = p();
> >>>>>>>>    f[1] - q();
> >>>>>>>>  }
> >>>>>>>>
> >>>>>>>>  a = z()
> >>>>>>>>  b = s(a)
> >>>>>>>>
> >>>>>>>> This works. (which is sort-of a violation of referential transparency)
> >>>>>>>>
> >>>>>>>> It works because Swift implicitly marks arrays returned from compound
> >>>>>>>> procedures as closed (which may or may not be correct).
> >>>>>>>>
> >>>>>>>> So in most variable scopes, arrays behave like single-assignment variables,
> >>>>>>>> but each array can have one specific scope in which members can be assigned
> >>>>>>>> to. In that scope, the array cannot be treated as a whole variable.
> >>>>>>>>
> >>>>>>>> In the z() example above, that special scope is the body of z(). In the
> >>>>>>>> previous example, that scope is the global scope, and the program is invalid
> >>>>>>>> by the rule above that the array cannot be referred to as a whole in the
> >>>>>>>> same place that its members are individually assigned to.
> >>>>>>>>
> >>>>>>>> That's my explanation of what's going on now. I think it matches reality. I
> >>>>>>>> don't like that this is reality, but it is what we have.
> >>>>>>>>
> >>>>>>>> Comments appreciated.
> >>>>>>>>
> >>>>>>>>   
> >>>>>>>>       
> >>>>>>   
> >>>>> -- 
> >>>>>
> >>>>>    Ian Foster, Director, Computation Institute
> >>>>> Argonne National Laboratory & University of Chicago
> >>>>> Argonne: MCS/221, 9700 S. Cass Ave, Argonne, IL 60439
> >>>>> Chicago: Rm 405, 5640 S. Ellis Ave, Chicago, IL 60637
> >>>>> Tel: +1 630 252 4619.  Web: www.ci.uchicago.edu.
> >>>>>       Globus Alliance: www.globus.org.
> >>>>>   
> >>>>>
> >>>>> ____________________________________________________________________
> >>>>>
> >>>>> _______________________________________________
> >>>>> Swift-devel mailing list
> >>>>> Swift-devel at ci.uchicago.edu
> >>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> >>>>>   
> >>>> -- 
> >>>>
> >>>>    Ian Foster, Director, Computation Institute
> >>>> Argonne National Laboratory & University of Chicago
> >>>> Argonne: MCS/221, 9700 S. Cass Ave, Argonne, IL 60439
> >>>> Chicago: Rm 405, 5640 S. Ellis Ave, Chicago, IL 60637
> >>>> Tel: +1 630 252 4619.  Web: www.ci.uchicago.edu.
> >>>>       Globus Alliance: www.globus.org.
> >>>> _______________________________________________
> >>>> Swift-devel mailing list
> >>>> Swift-devel at ci.uchicago.edu
> >>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> >>> _______________________________________________
> >>> Swift-devel mailing list
> >>> Swift-devel at ci.uchicago.edu
> >>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> >>>
> >>>
> > 
> > 
> 


From benc at hawaga.org.uk  Tue Jun 19 10:00:19 2007
From: benc at hawaga.org.uk (Ben Clifford)
Date: Tue, 19 Jun 2007 15:00:19 +0000 (GMT)
Subject: [Swift-devel] on the semantics of 'array closing'
In-Reply-To: <1182264511.19005.6.camel@blabla.mcs.anl.gov>
References: <Pine.LNX.4.64.0706151924490.10634@dildano.hawaga.org.uk>
	<4672F5E3.7060205@mcs.anl.gov>
	<Pine.LNX.4.64.0706152050270.10634@dildano.hawaga.org.uk>
	<4674096B.4020109@mcs.anl.gov>  <4677455D.1070909@mcs.anl.gov>
	<1182243330.17515.14.camel@blabla.mcs.anl.gov>
	<4677CEA1.8050806@mcs.anl.gov>
	<1182257710.18810.15.camel@blabla.mcs.anl.gov>
	<4677DD03.6010900@mcs.anl.gov>
	<1182264511.19005.6.camel@blabla.mcs.anl.gov>
Message-ID: <Pine.LNX.4.64.0706191454460.15250@dildano.hawaga.org.uk>


On Tue, 19 Jun 2007, Mihael Hategan wrote:

> > Also, I think there's two new aspects proposed here - of a foreach() 
> > or map() reaching a threshold of "enough results" that it can be 
> > called done,
> 
> With sparse arrays, it would be sufficient to not have the results from
> the iterations that fail. If we had try/catch constructs, that would
> simply translate into:
> foreach k,v in V {
>   try {
>     results[k] = process(V[k]);
>   }
>   catch (*) {}
> }
> thus alleviating the need for a complex foreach construct.

That could fit in with the list comprehension / map syntax too, I guess, 
like some:

 a = [ ignore_errors(p(i)) for i in array];

with ignore_errors being an expression-level equivalent of a try { ...} 
catch{} block.

or it could go on the outside of the map-like structure, like this:

  a = [p(i) for i in array]
  b = n_is_enough(7, a);

so that b gets assigned a value that is the first 7 values of a to be 
computed, and errors in computing the rest of 'a' don't propagate through 
to errors in computing b.

That lets you parameterise the 'n is enough' or 'whatever comes first' 
bits separately from the for construct.

-- 


From hategan at mcs.anl.gov  Tue Jun 19 10:13:52 2007
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Tue, 19 Jun 2007 18:13:52 +0300
Subject: [Swift-devel] on the semantics of 'array closing'
In-Reply-To: <Pine.LNX.4.64.0706191454460.15250@dildano.hawaga.org.uk>
References: <Pine.LNX.4.64.0706151924490.10634@dildano.hawaga.org.uk>
	<4672F5E3.7060205@mcs.anl.gov>
	<Pine.LNX.4.64.0706152050270.10634@dildano.hawaga.org.uk>
	<4674096B.4020109@mcs.anl.gov>  <4677455D.1070909@mcs.anl.gov>
	<1182243330.17515.14.camel@blabla.mcs.anl.gov>
	<4677CEA1.8050806@mcs.anl.gov>
	<1182257710.18810.15.camel@blabla.mcs.anl.gov>
	<4677DD03.6010900@mcs.anl.gov>
	<1182264511.19005.6.camel@blabla.mcs.anl.gov>
	<Pine.LNX.4.64.0706191454460.15250@dildano.hawaga.org.uk>
Message-ID: <1182266032.19234.13.camel@blabla.mcs.anl.gov>

On Tue, 2007-06-19 at 15:00 +0000, Ben Clifford wrote:
> 
> On Tue, 19 Jun 2007, Mihael Hategan wrote:
> 
> > > Also, I think there's two new aspects proposed here - of a foreach() 
> > > or map() reaching a threshold of "enough results" that it can be 
> > > called done,
> > 
> > With sparse arrays, it would be sufficient to not have the results from
> > the iterations that fail. If we had try/catch constructs, that would
> > simply translate into:
> > foreach k,v in V {
> >   try {
> >     results[k] = process(V[k]);
> >   }
> >   catch (*) {}
> > }
> > thus alleviating the need for a complex foreach construct.
> 
> That could fit in with the list comprehension / map syntax too, I guess, 
> like some:
> 
>  a = [ ignore_errors(p(i)) for i in array];
> 
> with ignore_errors being an expression-level equivalent of a try { ...} 
> catch{} block.

Should easily translate to something like:

a := swiftArray(for(i, array, ignoreErrors(p(i)))))

> 
> or it could go on the outside of the map-like structure, like this:
> 
>   a = [p(i) for i in array]
>   b = n_is_enough(7, a);

Doable.

> 
> so that b gets assigned a value that is the first 7 values of a to be 
> computed, and errors in computing the rest of 'a' don't propagate through 
> to errors in computing b.
> 
> That lets you parameterise the 'n is enough' or 'whatever comes first' 
> bits separately from the for construct.
> 


From benc at hawaga.org.uk  Tue Jun 19 13:41:08 2007
From: benc at hawaga.org.uk (Ben Clifford)
Date: Tue, 19 Jun 2007 18:41:08 +0000 (GMT)
Subject: [Swift-devel] serializable DSHandle
Message-ID: <Pine.LNX.4.64.0706191838520.1452@dildano.hawaga.org.uk>


Is there a reason for DSHandle to be serializable?

> public interface DSHandle extends Serializable {


-- 


From iraicu at cs.uchicago.edu  Tue Jun 19 14:07:36 2007
From: iraicu at cs.uchicago.edu (Ioan Raicu)
Date: Tue, 19 Jun 2007 14:07:36 -0500
Subject: [Swift-devel] CPU usage with provider-deef
In-Reply-To: <4677C7A5.6040004@mcs.anl.gov>
References: <Pine.LNX.4.64.0706151634060.10634@dildano.hawaga.org.uk>	<1181984306.10455.3.camel@blabla.mcs.anl.gov>	<Pine.LNX.4.64.0706161258280.10634@dildano.hawaga.org.uk>	<4673F0E9.3060505@cs.uchicago.edu>	<Pine.LNX.4.64.0706161418470.10634@dildano.hawaga.org.uk>	<4673F548.5070608@cs.uchicago.edu>	<Pine.LNX.4.64.0706161438550.10634@dildano.hawaga.org.uk>	<4673FB19.6070305@cs.uchicago.edu>	<1182006166.11495.1.camel@blabla.mcs.anl.gov><467408D5.2020709@mcs.anl.gov>	<467417DE.50408@mcs.anl.gov><1182018587.12013.14.camel@blabla.mcs.anl.gov>	<Pine.LNX.4.64.0706161932560.15250@dildano.hawaga.org.uk>	<537273469-1182023623-cardhu_decombobulator_blackberry.rim.net-728646234-@bxe006.bisx.prod.on.blackberry>	<1182024904.12401.11.camel@blabla.mcs.anl.gov>	<46746A7F.7040004@mcs.anl.gov>	<Pine.LNX.4.64.0706162313300.15250@dildano.hawaga.org.uk>	<4676FB2C.9010004@cs.uchicago.edu>
	<1182241371.17515.2.camel@blabla.mcs.anl.gov>
	<4677C7A5.6040004@mcs.anl.gov>
Message-ID: <46782978.10800@cs.uchicago.edu>

Hi,
See below:

Mike Wilde wrote:
> Ioan, I feel that documenting the message flow is key to understanding 
> the system, and that such documentation will be indispensable for you 
> (and the Swift team) to make progress with Falkon.
Yes, I agree!
>
>
> If its the case, as you stated to me, that you dont have the time to 
> fully support Falkon for end-user use, which I accept, then the only 
> way to enable the Swift team to provide this support is for us to 
> understand the tool.
Agreed!
>
>
> The three things you propose to communicate to us in a face-to-face 
> meeting (organization of the code, the message flow diagram, 
> configuration options) are exactly the things that need to be documented.
The message flow diagram is already in the SC paper, and if not there, 
in the provisioning paper I write a few weeks later that was never 
submitted anywhere....  the Falkon paper is at 
http://people.cs.uchicago.edu/~iraicu/research/docs/Falkon/Falkon_SC07_v17-submitted.pdf 
(section 3.2, Figure 2), and the DRP paper is at 
http://people.cs.uchicago.edu/~iraicu/research/docs/DRP/DRP_v01.doc 
(section 3.2 and Figure 1).  Note that the latest Falkon paper (v24) 
does not have this message flow diagram.  As for the organization of the 
code and configuration options, I will certainly do them, but for now 
they are not high on my to-do list.  I know Mihael, Ben and everyone 
else wants written docs, but I can't keep spending 90% of my time on 
non-research related issues, as I have been doing recently.  Getting 
Nika's MolDyn application running over Falkon has literally consummed 
all of my time recently.  I don't mind doing it, especially when we have 
some results to show for (last night we ran the 100 molecule run 
successfully, I'll send out a separate update on this later), but I need 
to get back to the data management (which is almost ready, but unless I 
get a few quiet days to finish it up, it will never get done), my 
proposal, etc.  I'll get to these eventually, but I can't promise when; 
in the meantime, I offer my time to meet in person with whoever wants to 
dig into Falkon further!
>
>
> But you and Ian together need to decide and propose how you want to 
> see Falkon used and how you intend to make it supportable for that 
> purpose.
I am OK to support Falkon (as you have already seen with me helping Nika 
one on one, talking to Tibi about his app, getting Falkon in as an 
incubator project to setup CVS, mailing lists, etc....), but I hope to 
avoid making Falkon support be 90% of my time, which has been the case 
recently :(.
>
> For Swift's goals, we *know* we need Falkon's capabilities to succeed, 
> and the question is how much and how long can we use Falkon, and when 
> would we need to start improving or rewriting it.
>
These are all very good questions, and only time will answer these, IMO. 

Ioan
> - Mike
>
>
>
>
> Mihael Hategan wrote, On 6/19/2007 3:22 AM:
>> On Mon, 2007-06-18 at 16:37 -0500, Ioan Raicu wrote:
>>> There is a diagram in the SC paper that has the flow of messages, and
>>> what each does; these messages also map to the WSDL of the Falkon
>>> service.  Mihael, Ben, or whoever else wants to start digging through
>>> the Falkon code, maybe a meeting might be good to go over the
>>> organization of the code, the message flow diagram, configuration
>>> options, etc...  I would prefer a meeting over drafting up more
>>> documents, I think it would be more time effective for me for now.
>>
>> It might be the reverse for us. It's unlikely that faced with a lot of
>> new information, we will consistently retain all of it. Something of
>> reference might be handy.
>>
>>> Ioan
>>>
>>> Ben Clifford wrote:
>>>> The WSDL should describe part of the web services bit of the 
>>>> protocol. That might be a good place to start. The WSDL should 
>>>> already describe the messages that go over the wire in something 
>>>> vaguely readable to a human. Probably what would be needed would be 
>>>> the extra info to say which order messages are sent.
>>>>
>>>> On Sat, 16 Jun 2007, Mike Wilde wrote:
>>>>
>>>>  
>>>>> I think a nice clean message sequence chart describing Falkon's 
>>>>> various
>>>>> activities would be very useful, as its the backbone of its logic.
>>>>>
>>>>> I tried to create this for the SC paper by asking Ioan to describe 
>>>>> the
>>>>> protocol to me, but I dod not succeed.
>>>>>
>>>>> I think this would be a very useful description to maintain, as a 
>>>>> UML sequence
>>>>> chart, and that Ioan this would be a very important part of your 
>>>>> thesis or of
>>>>> future papers.
>>>>>
>>>>> Its up to you and Ian to weigh whether this would be valuable to your
>>>>> research.  I think its invaluable for design, review and debugging.
>>>>>
>>>>> - Mike
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Mihael Hategan wrote, On 6/16/2007 3:15 PM:
>>>>>    
>>>>>> On Sat, 2007-06-16 at 19:52 +0000, Ian Foster wrote:
>>>>>>      
>>>>>>> I wasn't suggesting (at least in the first instance) that Mihael 
>>>>>>> take the
>>>>>>> prototype and turn it into production, but that Mihael and Ioan 
>>>>>>> sit down
>>>>>>> together and do a code walkthrough. I think that this would likely
>>>>>>> identify bugs and opportunities for simplification.
>>>>>>>         
>>>>>> It's somewhat on the same in level of fun :). From the experience 
>>>>>> I've
>>>>>> accumulated so far, design is hard. Understanding prototype 
>>>>>> design is
>>>>>> probably even harder (not only do you need to understand the 
>>>>>> problem,
>>>>>> you also need to understand why many non-obvious things are done 
>>>>>> the way
>>>>>> they are done).
>>>>>>
>>>>>> Mihael
>>>>>>
>>>>>>      
>>>>>>> Ian
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Sent via BlackBerry from T-Mobile
>>>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: Ben Clifford <benc at hawaga.org.uk>
>>>>>>>
>>>>>>> Date: Sat, 16 Jun 2007 19:36:17 To:Mihael Hategan 
>>>>>>> <hategan at mcs.anl.gov>
>>>>>>> Cc:swift-devel at ci.uchicago.edu
>>>>>>> Subject: Re: [Swift-devel] CPU usage with provider-deef
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Sat, 16 Jun 2007, Mihael Hategan wrote:
>>>>>>>
>>>>>>>        
>>>>>>>> On Sat, 2007-06-16 at 12:03 -0500, Mike Wilde wrote:
>>>>>>>>          
>>>>>>>>> This should be fun, and a nice break from the I2U2 work that 
>>>>>>>>> you've
>>>>>>>>> been immersed in, Mihael.
>>>>>>>>>             
>>>>>>>> I have my reservations towards the amount of fun it involves.
>>>>>>>>           
>>>>>>> Right, taking prototypes and turning them into production isn't
>>>>>>> necessarily fun - in fact, a lot of the fun already happened 
>>>>>>> with the
>>>>>>> making of the prototype and the rest is some what drugery. (to 
>>>>>>> an extent
>>>>>>> that's the same situation i2u2 cosmic was/is in).
>>>>>>>
>>>>>>>         
>>>>>> _______________________________________________
>>>>>> Swift-devel mailing list
>>>>>> Swift-devel at ci.uchicago.edu
>>>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>>>>>
>>>>>>
>>>>>>       
>>>> _______________________________________________
>>>> Swift-devel mailing list
>>>> Swift-devel at ci.uchicago.edu
>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>>>
>>>>   
>>> -- 
>>> ============================================
>>> Ioan Raicu
>>> Ph.D. Student
>>> ============================================
>>> Distributed Systems Laboratory
>>> Computer Science Department
>>> University of Chicago
>>> 1100 E. 58th Street, Ryerson Hall
>>> Chicago, IL 60637
>>> ============================================
>>> Email: iraicu at cs.uchicago.edu
>>> Web:   http://www.cs.uchicago.edu/~iraicu
>>>        http://dsl.cs.uchicago.edu/
>>> ============================================
>>> ============================================
>>> _______________________________________________
>>> Swift-devel mailing list
>>> Swift-devel at ci.uchicago.edu
>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>
>> _______________________________________________
>> Swift-devel mailing list
>> Swift-devel at ci.uchicago.edu
>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>
>>
>

-- 
============================================
Ioan Raicu
Ph.D. Student
============================================
Distributed Systems Laboratory
Computer Science Department
University of Chicago
1100 E. 58th Street, Ryerson Hall
Chicago, IL 60637
============================================
Email: iraicu at cs.uchicago.edu
Web:   http://www.cs.uchicago.edu/~iraicu
       http://dsl.cs.uchicago.edu/
============================================
============================================


From yongzh at cs.uchicago.edu  Tue Jun 19 15:40:54 2007
From: yongzh at cs.uchicago.edu (Yong Zhao)
Date: Tue, 19 Jun 2007 15:40:54 -0500 (CDT)
Subject: [Swift-devel] Re: 100 molecule
In-Reply-To: <46782A14.2080308@cs.uchicago.edu>
References: <Pine.LNX.4.58.0706190921550.18172@classes.cs.uchicago.edu>
	<7DAEB562-6FD5-4425-9BBB-7AA84EA26526@mcs.anl.gov>
	<46782A14.2080308@cs.uchicago.edu>
Message-ID: <Pine.LNX.4.58.0706191538510.21432@classes.cs.uchicago.edu>

Ioan,

This sounds very good. I'm forwarding this to the swift list.

Yong.

On Tue, 19 Jun 2007, Ioan Raicu wrote:

> Yes, rm -rf could take that long... Yong, why don't you try a these two
> commands, instead of "rm -rf".... I bet it will be much faster on the
> GPFS at ANL!
>
> find ./ -exec rm {} \;
> find ./ -exec rm -r {} \;
>
> The first one removes the files, and the second one removes the
> directories... I found rm -rf to be very slow on the ANL GPFS.... it has
> to do with the fact that rm -rf does an expansion of all the files it
> needs to deletes... and it ends up being very very long if you hav many
> files to delete.... doing the method above, it does 1 delete at a
> time... so it doesn't suffer from the long list of files as rm -rf....
>
> Ioan
>
> Veronika Nefedova wrote:
> > I am wondering how the cleanup is done? Its hard to believe that "rm
> > -rf" would work that long. At the end of the successful run its just
> > one directory with one nested subdir had to be removed.
> >
> > NIka
> >


From yongzh at cs.uchicago.edu  Tue Jun 19 15:43:59 2007
From: yongzh at cs.uchicago.edu (Yong Zhao)
Date: Tue, 19 Jun 2007 15:43:59 -0500 (CDT)
Subject: [Swift-devel] serializable DSHandle
In-Reply-To: <Pine.LNX.4.64.0706191838520.1452@dildano.hawaga.org.uk>
References: <Pine.LNX.4.64.0706191838520.1452@dildano.hawaga.org.uk>
Message-ID: <Pine.LNX.4.58.0706191541180.21432@classes.cs.uchicago.edu>

The original thought was that we can serialize the handle and pass it
around distributed sites, so the mapping could happen at different places
during the execution.

On Tue, 19 Jun 2007, Ben Clifford wrote:

>
> Is there a reason for DSHandle to be serializable?
>
> > public interface DSHandle extends Serializable {
>
>
> --
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>


From benc at hawaga.org.uk  Tue Jun 19 15:44:45 2007
From: benc at hawaga.org.uk (Ben Clifford)
Date: Tue, 19 Jun 2007 20:44:45 +0000 (GMT)
Subject: [Swift-devel] serializable DSHandle
In-Reply-To: <Pine.LNX.4.58.0706191541180.21432@classes.cs.uchicago.edu>
References: <Pine.LNX.4.64.0706191838520.1452@dildano.hawaga.org.uk>
	<Pine.LNX.4.58.0706191541180.21432@classes.cs.uchicago.edu>
Message-ID: <Pine.LNX.4.64.0706192044410.1452@dildano.hawaga.org.uk>


ok. So unused for now?

On Tue, 19 Jun 2007, Yong Zhao wrote:

> The original thought was that we can serialize the handle and pass it
> around distributed sites, so the mapping could happen at different places
> during the execution.
> 
> On Tue, 19 Jun 2007, Ben Clifford wrote:
> 
> >
> > Is there a reason for DSHandle to be serializable?
> >
> > > public interface DSHandle extends Serializable {
> >
> >
> > --
> > _______________________________________________
> > Swift-devel mailing list
> > Swift-devel at ci.uchicago.edu
> > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> >
> 
> 


From yongzh at cs.uchicago.edu  Tue Jun 19 15:45:43 2007
From: yongzh at cs.uchicago.edu (Yong Zhao)
Date: Tue, 19 Jun 2007 15:45:43 -0500 (CDT)
Subject: [Swift-devel] serializable DSHandle
In-Reply-To: <Pine.LNX.4.64.0706192044410.1452@dildano.hawaga.org.uk>
References: <Pine.LNX.4.64.0706191838520.1452@dildano.hawaga.org.uk>
	<Pine.LNX.4.58.0706191541180.21432@classes.cs.uchicago.edu>
	<Pine.LNX.4.64.0706192044410.1452@dildano.hawaga.org.uk>
Message-ID: <Pine.LNX.4.58.0706191545290.21432@classes.cs.uchicago.edu>

right, the feature is not used right now.

On Tue, 19 Jun 2007, Ben Clifford wrote:

>
> ok. So unused for now?
>
> On Tue, 19 Jun 2007, Yong Zhao wrote:
>
> > The original thought was that we can serialize the handle and pass it
> > around distributed sites, so the mapping could happen at different places
> > during the execution.
> >
> > On Tue, 19 Jun 2007, Ben Clifford wrote:
> >
> > >
> > > Is there a reason for DSHandle to be serializable?
> > >
> > > > public interface DSHandle extends Serializable {
> > >
> > >
> > > --
> > > _______________________________________________
> > > Swift-devel mailing list
> > > Swift-devel at ci.uchicago.edu
> > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> > >
> >
> >
>


From wilde at mcs.anl.gov  Tue Jun 19 16:05:12 2007
From: wilde at mcs.anl.gov (Mike Wilde)
Date: Tue, 19 Jun 2007 16:05:12 -0500
Subject: [Swift-devel] Re: 100 molecule
In-Reply-To: <Pine.LNX.4.58.0706191538510.21432@classes.cs.uchicago.edu>
References: <Pine.LNX.4.58.0706190921550.18172@classes.cs.uchicago.edu>	<7DAEB562-6FD5-4425-9BBB-7AA84EA26526@mcs.anl.gov>	<46782A14.2080308@cs.uchicago.edu>
	<Pine.LNX.4.58.0706191538510.21432@classes.cs.uchicago.edu>
Message-ID: <46784508.6020200@mcs.anl.gov>

One technique that works nice if you just want the old files out of 
the way is to do an mv of the top level dir to a new name, and then 
you can background the rm's.

- Mike

Yong Zhao wrote, On 6/19/2007 3:40 PM:
> Ioan,
> 
> This sounds very good. I'm forwarding this to the swift list.
> 
> Yong.
> 
> On Tue, 19 Jun 2007, Ioan Raicu wrote:
> 
>> Yes, rm -rf could take that long... Yong, why don't you try a these two
>> commands, instead of "rm -rf".... I bet it will be much faster on the
>> GPFS at ANL!
>>
>> find ./ -exec rm {} \;
>> find ./ -exec rm -r {} \;
>>
>> The first one removes the files, and the second one removes the
>> directories... I found rm -rf to be very slow on the ANL GPFS.... it has
>> to do with the fact that rm -rf does an expansion of all the files it
>> needs to deletes... and it ends up being very very long if you hav many
>> files to delete.... doing the method above, it does 1 delete at a
>> time... so it doesn't suffer from the long list of files as rm -rf....
>>
>> Ioan
>>
>> Veronika Nefedova wrote:
>>> I am wondering how the cleanup is done? Its hard to believe that "rm
>>> -rf" would work that long. At the end of the successful run its just
>>> one directory with one nested subdir had to be removed.
>>>
>>> NIka
>>>
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> 
> 

-- 
Mike Wilde
Computation Institute, University of Chicago
Math & Computer Science Division
Argonne National Laboratory
Argonne, IL   60439    USA
tel 630-252-7497 fax 630-252-1997


From benc at hawaga.org.uk  Tue Jun 19 18:51:44 2007
From: benc at hawaga.org.uk (Ben Clifford)
Date: Tue, 19 Jun 2007 23:51:44 +0000 (GMT)
Subject: [Swift-devel] Re: email for Mike and Ian
In-Reply-To: <Pine.LNX.4.58.0706191653260.5413@classes.cs.uchicago.edu>
References: <46782EFC.6030601@cs.uchicago.edu>
	<FC1E3EC1-908E-415B-98A6-7A631CA90A27@mcs.anl.gov>
	<46784274.8090202@cs.uchicago.edu>
	<Pine.LNX.4.58.0706191558350.21432@classes.cs.uchicago.edu>
	<051DE8C7-3598-4EDE-BCD7-870FE28B5CA6@mcs.anl.gov>
	<Pine.LNX.4.58.0706191630480.5413@classes.cs.uchicago.edu>
	<99348A95-85BD-4924-B708-481982C9F425@mcs.anl.gov>
	<Pine.LNX.4.58.0706191653260.5413@classes.cs.uchicago.edu>
Message-ID: <Pine.LNX.4.64.0706192345380.1452@dildano.hawaga.org.uk>


This looks like bug 49:

http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=49

I just spent the evening tracking it down with Nika.

As far as I can tell, that means she's been using a swift compiler that 
has been at least 2 months old right up until she just updated it this 
evening.

*please* try to report problems against something resembling a recent 
checkout-and-build.

Furthermore, when I finally tracked it down, turns out that its because of 
a bug in the SwiftScript source. I fix *exactly* this problem, here:

 Date: Sat, 28 Apr 2007 08:39:03 +0000 (GMT)                                     
 From: Ben Clifford <benc at hawaga.org.uk>                                         
 To: Veronika  V. Nefedova <nefedova at mcs.anl.gov>                                
 Cc: swift-devel at ci.uchicago.edu                                                 
 Subject: Re: [Swift-devel] nightly built 070426      

*please* try to actually use bugfixes that people give you.

Bad users!

Go to your room!


On Tue, 19 Jun 2007, Yong Zhao wrote:

> I tried the restart feature yesterday and it seemed to work fine with the
> MolDyn workflow. I am not sure what was the problem that you encountered.
> 
> About the compile problem, maybe Ben can take a look since he made a few
> changes to the translation.
> 
> yong.
> 
> On Tue, 19 Jun 2007, Veronika Nefedova wrote:
> 
> > Yong,
> >
> > Ben asks me to test the restart feature that was failing before... I
> > am wondering if its OK to do svn up and then rebuild vdsk? I do not
> > want to break things... If its OK - should I do it in ~nefedova/vdsk
> > (I assume)?
> >
> > Nika
> >
> > On Jun 19, 2007, at 4:31 PM, Yong Zhao wrote:
> >
> > > did you make sure that your path is set correctly? do a
> > >
> > > which swift
> > >
> > > On Tue, 19 Jun 2007, Veronika Nefedova wrote:
> > >
> > >> Yong,
> > >>
> > >> Any idea what could've caused it to fail:
> > >>
> > >> nefedova at viper:~/alamines> cat MolDyn-244-ctsmk1lnf2qa1.log
> > >> 2007-06-19 16:11:19,256 INFO  Loader MolDyn-244.dtm: source file is
> > >> new. Recompiling.
> > >> 2007-06-19 16:12:08,346 DEBUG Loader Detailed exception:
> > >> java.lang.RuntimeException: Failed to convert .xml to .kml for
> > >> MolDyn-244.dtm
> > >>          at org.griphyn.vdl.karajan.Loader.compile(Loader.java:209)
> > >>          at org.griphyn.vdl.karajan.Loader.main(Loader.java:108)
> > >> Caused by: java.util.NoSuchElementException: no such attribute: nil
> > >> in template context [call_arg]
> > >>          at org.antlr.stringtemplate.StringTemplate.rawSetAttribute
> > >> (StringTemplate.java:643)
> > >>          at org.antlr.stringtemplate.StringTemplate.setAttribute
> > >> (StringTemplate.java:539)
> > >>          at org.griphyn.vdl.engine.Karajan.setExprOrValue
> > >> (Karajan.java:663)
> > >>          at org.griphyn.vdl.engine.Karajan.setExprOrValue
> > >> (Karajan.java:638)
> > >>          at org.griphyn.vdl.engine.Karajan.actualParameter
> > >> (Karajan.java:458)
> > >>          at org.griphyn.vdl.engine.Karajan.call(Karajan.java:351)
> > >>          at org.griphyn.vdl.engine.Karajan.statements(Karajan.java:
> > >> 304)
> > >>          at org.griphyn.vdl.engine.Karajan.program(Karajan.java:117)
> > >>          at org.griphyn.vdl.engine.Karajan.main(Karajan.java:71)
> > >>          at org.griphyn.vdl.karajan.Loader.compile(Loader.java:199)
> > >>          ... 1 more
> > >> nefedova at viper:~/alamines>
> > >>
> > >>
> > >>
> > >> The dtm file is generated by a script. The same script that generated
> > >> the files for 1,20 and 100 molecules. Not sure why 244 is different.
> > >> Everything is in my alamines dir on viper in home dir...
> > >>
> > >> Nika
> > >>
> > >> On Jun 19, 2007, at 4:00 PM, Yong Zhao wrote:
> > >>
> > >>> Everything is configured in Nika's directory:
> > >>> ~nefedova/vdsk
> > >>>
> > >>> Just point VDS_HOME or SWIFT_HOME to /home/nefedova/vdsk, and the
> > >>> rest
> > >>> should be correctly configured in the etc directory.
> > >>>
> > >>> Yong.
> > >>>
> > >>> On Tue, 19 Jun 2007, Ioan Raicu wrote:
> > >>>
> > >>>> Yong, you are the one who ran the Swift workflow... can you make
> > >>>> sure
> > >>>> Nika has everything updated, or can you invoke the command form
> > >>>> your
> > >>>> environment?
> > >>>>
> > >>>> I have restarted Falkon and set it to 18 hours for 100 nodes (200
> > >>>> workers).... its all up and running... there is a 2 hour idle
> > >>>> time, so
> > >>>> make sure to start the workflow in the next 2 hours so we don't
> > >>>> loose
> > >>>> the allocation.
> > >>>>
> > >>>> Falkon is in the same place as last night, tg-viz-login1 on 50001!
> > >>>>
> > >>>> Ioan
> > >>>>
> > >>>> Veronika Nefedova wrote:
> > >>>>> Ok, I have the file ready. What workdir should I specify for TG
> > >>>>> UC ?
> > >>>>>
> > >>>>> Nika
> > >>>>>
> > >>>>> On Jun 19, 2007, at 2:31 PM, Ioan Raicu wrote:
> > >>>>>
> > >>>>>> Hi guys,
> > >>>>>> I need to go eat some lunch.... I'll be back in 30 min... but
> > >>>>>> then
> > >>>>>> I'll only be online until 4PM... so can you please look over that
> > >>>>>> email, and send it back to me soon?  Also, let's decide what
> > >>>>>> to do
> > >>>>>> about the next run, is 244 short mol run OK?  Nika, can you prep
> > >>>>>> the
> > >>>>>> input data for this?  ANL seems almost idle, only 4 nodes are in
> > >>>>>> use,
> > >>>>>> so we could easily et another 200 processors like last night :)
> > >>>>>>
> > >>>>>> Ioan
> > >>>>>>
> > >>>>>> --
> > >>>>>> ============================================
> > >>>>>> Ioan Raicu
> > >>>>>> Ph.D. Student
> > >>>>>> ============================================
> > >>>>>> Distributed Systems Laboratory
> > >>>>>> Computer Science Department
> > >>>>>> University of Chicago
> > >>>>>> 1100 E. 58th Street, Ryerson Hall
> > >>>>>> Chicago, IL 60637
> > >>>>>> ============================================
> > >>>>>> Email: iraicu at cs.uchicago.edu
> > >>>>>> Web:   http://www.cs.uchicago.edu/~iraicu
> > >>>>>>       http://dsl.cs.uchicago.edu/
> > >>>>>> ============================================
> > >>>>>> ============================================
> > >>>>>>
> > >>>>>
> > >>>>>
> > >>>>
> > >>>> --
> > >>>> ============================================
> > >>>> Ioan Raicu
> > >>>> Ph.D. Student
> > >>>> ============================================
> > >>>> Distributed Systems Laboratory
> > >>>> Computer Science Department
> > >>>> University of Chicago
> > >>>> 1100 E. 58th Street, Ryerson Hall
> > >>>> Chicago, IL 60637
> > >>>> ============================================
> > >>>> Email: iraicu at cs.uchicago.edu
> > >>>> Web:   http://www.cs.uchicago.edu/~iraicu
> > >>>>        http://dsl.cs.uchicago.edu/
> > >>>> ============================================
> > >>>> ============================================
> > >>>>
> > >>>>
> > >>>
> > >>
> > >>
> > >
> >
> >
> 
> 


From wilde at mcs.anl.gov  Tue Jun 19 19:04:06 2007
From: wilde at mcs.anl.gov (Mike Wilde)
Date: Tue, 19 Jun 2007 19:04:06 -0500
Subject: [Swift-devel] Re: email for Mike and Ian
In-Reply-To: <Pine.LNX.4.64.0706192345380.1452@dildano.hawaga.org.uk>
References: <46782EFC.6030601@cs.uchicago.edu>	<FC1E3EC1-908E-415B-98A6-7A631CA90A27@mcs.anl.gov>	<46784274.8090202@cs.uchicago.edu>	<Pine.LNX.4.58.0706191558350.21432@classes.cs.uchicago.edu>	<051DE8C7-3598-4EDE-BCD7-870FE28B5CA6@mcs.anl.gov>	<Pine.LNX.4.58.0706191630480.5413@classes.cs.uchicago.edu>	<99348A95-85BD-4924-B708-481982C9F425@mcs.anl.gov>	<Pine.LNX.4.58.0706191653260.5413@classes.cs.uchicago.edu>
	<Pine.LNX.4.64.0706192345380.1452@dildano.hawaga.org.uk>
Message-ID: <46786EF6.507@mcs.anl.gov>

So, at a practical level, what went wrong here and what do we do to 
correct it?

The points below are perhaps a bit naive and reflect the sad fact 
that I'm not currently a user.  But to set guidelines for ourselves 
and a growing community of users, should we:

- Run Swift from well-defined submit hosts

- Keep those hosts up to date with nightly builds

- stay in tune to bugzilla traffic to know when to jump to a new build

- is the run dir and/or logs clearly tagged with the build date?

- use only official builds if at all possible (unless you need to 
include a fix thats not yet been included in a build?)

- what else.

Would it be useful to spell out good practices for Nika, Tibi, and 
CNARI, MolDyn, and LQCD people?

Thanks,

Mike


Ben Clifford wrote, On 6/19/2007 6:51 PM:
> This looks like bug 49:
> 
> http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=49
> 
> I just spent the evening tracking it down with Nika.
> 
> As far as I can tell, that means she's been using a swift compiler that 
> has been at least 2 months old right up until she just updated it this 
> evening.
> 
> *please* try to report problems against something resembling a recent 
> checkout-and-build.
> 
> Furthermore, when I finally tracked it down, turns out that its because of 
> a bug in the SwiftScript source. I fix *exactly* this problem, here:
> 
>  Date: Sat, 28 Apr 2007 08:39:03 +0000 (GMT)                                     
>  From: Ben Clifford <benc at hawaga.org.uk>                                         
>  To: Veronika  V. Nefedova <nefedova at mcs.anl.gov>                                
>  Cc: swift-devel at ci.uchicago.edu                                                 
>  Subject: Re: [Swift-devel] nightly built 070426      
> 
> *please* try to actually use bugfixes that people give you.
> 
> Bad users!
> 
> Go to your room!
> 
> 
> On Tue, 19 Jun 2007, Yong Zhao wrote:
> 
>> I tried the restart feature yesterday and it seemed to work fine with the
>> MolDyn workflow. I am not sure what was the problem that you encountered.
>>
>> About the compile problem, maybe Ben can take a look since he made a few
>> changes to the translation.
>>
>> yong.
>>
>> On Tue, 19 Jun 2007, Veronika Nefedova wrote:
>>
>>> Yong,
>>>
>>> Ben asks me to test the restart feature that was failing before... I
>>> am wondering if its OK to do svn up and then rebuild vdsk? I do not
>>> want to break things... If its OK - should I do it in ~nefedova/vdsk
>>> (I assume)?
>>>
>>> Nika
>>>
>>> On Jun 19, 2007, at 4:31 PM, Yong Zhao wrote:
>>>
>>>> did you make sure that your path is set correctly? do a
>>>>
>>>> which swift
>>>>
>>>> On Tue, 19 Jun 2007, Veronika Nefedova wrote:
>>>>
>>>>> Yong,
>>>>>
>>>>> Any idea what could've caused it to fail:
>>>>>
>>>>> nefedova at viper:~/alamines> cat MolDyn-244-ctsmk1lnf2qa1.log
>>>>> 2007-06-19 16:11:19,256 INFO  Loader MolDyn-244.dtm: source file is
>>>>> new. Recompiling.
>>>>> 2007-06-19 16:12:08,346 DEBUG Loader Detailed exception:
>>>>> java.lang.RuntimeException: Failed to convert .xml to .kml for
>>>>> MolDyn-244.dtm
>>>>>          at org.griphyn.vdl.karajan.Loader.compile(Loader.java:209)
>>>>>          at org.griphyn.vdl.karajan.Loader.main(Loader.java:108)
>>>>> Caused by: java.util.NoSuchElementException: no such attribute: nil
>>>>> in template context [call_arg]
>>>>>          at org.antlr.stringtemplate.StringTemplate.rawSetAttribute
>>>>> (StringTemplate.java:643)
>>>>>          at org.antlr.stringtemplate.StringTemplate.setAttribute
>>>>> (StringTemplate.java:539)
>>>>>          at org.griphyn.vdl.engine.Karajan.setExprOrValue
>>>>> (Karajan.java:663)
>>>>>          at org.griphyn.vdl.engine.Karajan.setExprOrValue
>>>>> (Karajan.java:638)
>>>>>          at org.griphyn.vdl.engine.Karajan.actualParameter
>>>>> (Karajan.java:458)
>>>>>          at org.griphyn.vdl.engine.Karajan.call(Karajan.java:351)
>>>>>          at org.griphyn.vdl.engine.Karajan.statements(Karajan.java:
>>>>> 304)
>>>>>          at org.griphyn.vdl.engine.Karajan.program(Karajan.java:117)
>>>>>          at org.griphyn.vdl.engine.Karajan.main(Karajan.java:71)
>>>>>          at org.griphyn.vdl.karajan.Loader.compile(Loader.java:199)
>>>>>          ... 1 more
>>>>> nefedova at viper:~/alamines>
>>>>>
>>>>>
>>>>>
>>>>> The dtm file is generated by a script. The same script that generated
>>>>> the files for 1,20 and 100 molecules. Not sure why 244 is different.
>>>>> Everything is in my alamines dir on viper in home dir...
>>>>>
>>>>> Nika
>>>>>
>>>>> On Jun 19, 2007, at 4:00 PM, Yong Zhao wrote:
>>>>>
>>>>>> Everything is configured in Nika's directory:
>>>>>> ~nefedova/vdsk
>>>>>>
>>>>>> Just point VDS_HOME or SWIFT_HOME to /home/nefedova/vdsk, and the
>>>>>> rest
>>>>>> should be correctly configured in the etc directory.
>>>>>>
>>>>>> Yong.
>>>>>>
>>>>>> On Tue, 19 Jun 2007, Ioan Raicu wrote:
>>>>>>
>>>>>>> Yong, you are the one who ran the Swift workflow... can you make
>>>>>>> sure
>>>>>>> Nika has everything updated, or can you invoke the command form
>>>>>>> your
>>>>>>> environment?
>>>>>>>
>>>>>>> I have restarted Falkon and set it to 18 hours for 100 nodes (200
>>>>>>> workers).... its all up and running... there is a 2 hour idle
>>>>>>> time, so
>>>>>>> make sure to start the workflow in the next 2 hours so we don't
>>>>>>> loose
>>>>>>> the allocation.
>>>>>>>
>>>>>>> Falkon is in the same place as last night, tg-viz-login1 on 50001!
>>>>>>>
>>>>>>> Ioan
>>>>>>>
>>>>>>> Veronika Nefedova wrote:
>>>>>>>> Ok, I have the file ready. What workdir should I specify for TG
>>>>>>>> UC ?
>>>>>>>>
>>>>>>>> Nika
>>>>>>>>
>>>>>>>> On Jun 19, 2007, at 2:31 PM, Ioan Raicu wrote:
>>>>>>>>
>>>>>>>>> Hi guys,
>>>>>>>>> I need to go eat some lunch.... I'll be back in 30 min... but
>>>>>>>>> then
>>>>>>>>> I'll only be online until 4PM... so can you please look over that
>>>>>>>>> email, and send it back to me soon?  Also, let's decide what
>>>>>>>>> to do
>>>>>>>>> about the next run, is 244 short mol run OK?  Nika, can you prep
>>>>>>>>> the
>>>>>>>>> input data for this?  ANL seems almost idle, only 4 nodes are in
>>>>>>>>> use,
>>>>>>>>> so we could easily et another 200 processors like last night :)
>>>>>>>>>
>>>>>>>>> Ioan
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> ============================================
>>>>>>>>> Ioan Raicu
>>>>>>>>> Ph.D. Student
>>>>>>>>> ============================================
>>>>>>>>> Distributed Systems Laboratory
>>>>>>>>> Computer Science Department
>>>>>>>>> University of Chicago
>>>>>>>>> 1100 E. 58th Street, Ryerson Hall
>>>>>>>>> Chicago, IL 60637
>>>>>>>>> ============================================
>>>>>>>>> Email: iraicu at cs.uchicago.edu
>>>>>>>>> Web:   http://www.cs.uchicago.edu/~iraicu
>>>>>>>>>       http://dsl.cs.uchicago.edu/
>>>>>>>>> ============================================
>>>>>>>>> ============================================
>>>>>>>>>
>>>>>>>>
>>>>>>> --
>>>>>>> ============================================
>>>>>>> Ioan Raicu
>>>>>>> Ph.D. Student
>>>>>>> ============================================
>>>>>>> Distributed Systems Laboratory
>>>>>>> Computer Science Department
>>>>>>> University of Chicago
>>>>>>> 1100 E. 58th Street, Ryerson Hall
>>>>>>> Chicago, IL 60637
>>>>>>> ============================================
>>>>>>> Email: iraicu at cs.uchicago.edu
>>>>>>> Web:   http://www.cs.uchicago.edu/~iraicu
>>>>>>>        http://dsl.cs.uchicago.edu/
>>>>>>> ============================================
>>>>>>> ============================================
>>>>>>>
>>>>>>>
>>>>>
>>>
>>
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> 
> 

-- 
Mike Wilde
Computation Institute, University of Chicago
Math & Computer Science Division
Argonne National Laboratory
Argonne, IL   60439    USA
tel 630-252-7497 fax 630-252-1997


From benc at hawaga.org.uk  Tue Jun 19 19:10:06 2007
From: benc at hawaga.org.uk (Ben Clifford)
Date: Wed, 20 Jun 2007 00:10:06 +0000 (GMT)
Subject: [Swift-devel] Re: email for Mike and Ian
In-Reply-To: <46786EF6.507@mcs.anl.gov>
References: <46782EFC.6030601@cs.uchicago.edu>
	<FC1E3EC1-908E-415B-98A6-7A631CA90A27@mcs.anl.gov>
	<46784274.8090202@cs.uchicago.edu>
	<Pine.LNX.4.58.0706191558350.21432@classes.cs.uchicago.edu>
	<051DE8C7-3598-4EDE-BCD7-870FE28B5CA6@mcs.anl.gov>
	<Pine.LNX.4.58.0706191630480.5413@classes.cs.uchicago.edu>
	<99348A95-85BD-4924-B708-481982C9F425@mcs.anl.gov>
	<Pine.LNX.4.58.0706191653260.5413@classes.cs.uchicago.edu>
	<Pine.LNX.4.64.0706192345380.1452@dildano.hawaga.org.uk>
	<46786EF6.507@mcs.anl.gov>
Message-ID: <Pine.LNX.4.64.0706200007150.1452@dildano.hawaga.org.uk>


On Tue, 19 Jun 2007, Mike Wilde wrote:

> So, at a practical level, what went wrong here and what do we do to correct
> it?

> - use only official builds if at all possible (unless you need to 
> include a fix thats not yet been included in a build?)

there's a reluctance on Nika's part to upgrade I think because its a 
hassle for her to get the Falkon/cog provider into a new build. It should 
not be hard. provider-deef should be going into SVN 'real soon' (i.e. as 
soon as Yong provides a clean recent tree for me to import) and after that 
it should be a lot easier to deploy the most recent swift + most recent 
provider-deef.

-- 


From wilde at mcs.anl.gov  Tue Jun 19 19:17:21 2007
From: wilde at mcs.anl.gov (Mike Wilde)
Date: Tue, 19 Jun 2007 19:17:21 -0500
Subject: [Swift-devel] Re: email for Mike and Ian
In-Reply-To: <Pine.LNX.4.64.0706200007150.1452@dildano.hawaga.org.uk>
References: <46782EFC.6030601@cs.uchicago.edu>
	<FC1E3EC1-908E-415B-98A6-7A631CA90A27@mcs.anl.gov>
	<46784274.8090202@cs.uchicago.edu>
	<Pine.LNX.4.58.0706191558350.21432@classes.cs.uchicago.edu>
	<051DE8C7-3598-4EDE-BCD7-870FE28B5CA6@mcs.anl.gov>
	<Pine.LNX.4.58.0706191630480.5413@classes.cs.uchicago.edu>
	<99348A95-85BD-4924-B708-481982C9F425@mcs.anl.gov>
	<Pine.LNX.4.58.0706191653260.5413@classes.cs.uchicago.edu>
	<Pine.LNX.4.64.0706192345380.1452@dildano.hawaga.org.uk>
	<46786EF6.507@mcs.anl.gov>
	<Pine.LNX.4.64.0706200007150.1452@dildano.hawaga.org.uk>
Message-ID: <46787211.5030802@mcs.anl.gov>

k, that makes sense.

I think as we work to grow a group culture that works like a smooth 
distributed open source project, though, that writing down our 
processes will help us grow our team in both size and productivity.

Whatever Nika and Tibi are doing today, a growing group of users and 
collaborators will be doing tomorrow - and we need to steer us and 
them towards good practices.

- Mike


Ben Clifford wrote, On 6/19/2007 7:10 PM:
> 
> On Tue, 19 Jun 2007, Mike Wilde wrote:
> 
>> So, at a practical level, what went wrong here and what do we do to correct
>> it?
> 
>> - use only official builds if at all possible (unless you need to 
>> include a fix thats not yet been included in a build?)
> 
> there's a reluctance on Nika's part to upgrade I think because its a 
> hassle for her to get the Falkon/cog provider into a new build. It should 
> not be hard. provider-deef should be going into SVN 'real soon' (i.e. as 
> soon as Yong provides a clean recent tree for me to import) and after that 
> it should be a lot easier to deploy the most recent swift + most recent 
> provider-deef.
> 

-- 
Mike Wilde
Computation Institute, University of Chicago
Math & Computer Science Division
Argonne National Laboratory
Argonne, IL   60439    USA
tel 630-252-7497 fax 630-252-1997


From nefedova at mcs.anl.gov  Tue Jun 19 19:18:34 2007
From: nefedova at mcs.anl.gov (Veronika Nefedova)
Date: Tue, 19 Jun 2007 19:18:34 -0500
Subject: [Swift-devel] Re: email for Mike and Ian
In-Reply-To: <Pine.LNX.4.64.0706200007150.1452@dildano.hawaga.org.uk>
References: <46782EFC.6030601@cs.uchicago.edu>
	<FC1E3EC1-908E-415B-98A6-7A631CA90A27@mcs.anl.gov>
	<46784274.8090202@cs.uchicago.edu>
	<Pine.LNX.4.58.0706191558350.21432@classes.cs.uchicago.edu>
	<051DE8C7-3598-4EDE-BCD7-870FE28B5CA6@mcs.anl.gov>
	<Pine.LNX.4.58.0706191630480.5413@classes.cs.uchicago.edu>
	<99348A95-85BD-4924-B708-481982C9F425@mcs.anl.gov>
	<Pine.LNX.4.58.0706191653260.5413@classes.cs.uchicago.edu>
	<Pine.LNX.4.64.0706192345380.1452@dildano.hawaga.org.uk>
	<46786EF6.507@mcs.anl.gov>
	<Pine.LNX.4.64.0706200007150.1452@dildano.hawaga.org.uk>
Message-ID: <78ED59EC-B377-43A8-909B-2013415C50D7@mcs.anl.gov>

I do not think its a correct asessment.

I did update *many* times along the course of last couple of weeks, I  
just didn't recompile my dtm files. And they worked just fine, btw.
I do not know how it happened that I ended up with the old version of  
swfit script -- it might've happened during those submit host-to- 
submit host moves, and obviously it was not intentional to discard  
any new bug fixes.

Nika

On Jun 19, 2007, at 7:10 PM, Ben Clifford wrote:

>
>
> On Tue, 19 Jun 2007, Mike Wilde wrote:
>
>> So, at a practical level, what went wrong here and what do we do  
>> to correct
>> it?
>
>> - use only official builds if at all possible (unless you need to
>> include a fix thats not yet been included in a build?)
>
> there's a reluctance on Nika's part to upgrade I think because its a
> hassle for her to get the Falkon/cog provider into a new build. It  
> should
> not be hard. provider-deef should be going into SVN 'real  
> soon' (i.e. as
> soon as Yong provides a clean recent tree for me to import) and  
> after that
> it should be a lot easier to deploy the most recent swift + most  
> recent
> provider-deef.
>
> -- 
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>


From benc at hawaga.org.uk  Tue Jun 19 19:27:30 2007
From: benc at hawaga.org.uk (Ben Clifford)
Date: Wed, 20 Jun 2007 00:27:30 +0000 (GMT)
Subject: [Swift-devel] Re: email for Mike and Ian
In-Reply-To: <78ED59EC-B377-43A8-909B-2013415C50D7@mcs.anl.gov>
References: <46782EFC.6030601@cs.uchicago.edu>
	<FC1E3EC1-908E-415B-98A6-7A631CA90A27@mcs.anl.gov>
	<46784274.8090202@cs.uchicago.edu>
	<Pine.LNX.4.58.0706191558350.21432@classes.cs.uchicago.edu>
	<051DE8C7-3598-4EDE-BCD7-870FE28B5CA6@mcs.anl.gov>
	<Pine.LNX.4.58.0706191630480.5413@classes.cs.uchicago.edu>
	<99348A95-85BD-4924-B708-481982C9F425@mcs.anl.gov>
	<Pine.LNX.4.58.0706191653260.5413@classes.cs.uchicago.edu>
	<Pine.LNX.4.64.0706192345380.1452@dildano.hawaga.org.uk>
	<46786EF6.507@mcs.anl.gov>
	<Pine.LNX.4.64.0706200007150.1452@dildano.hawaga.org.uk>
	<78ED59EC-B377-43A8-909B-2013415C50D7@mcs.anl.gov>
Message-ID: <Pine.LNX.4.64.0706200024250.1452@dildano.hawaga.org.uk>


On Tue, 19 Jun 2007, Veronika Nefedova wrote:

> I did update *many* times along the course of last couple of weeks, I 
> just didn't recompile my dtm files. And they worked just fine, btw.

ok.

Its probably a good idea from a swift testing perspective to be doing 
clean builds of the swiftscript programs (cleaning away the .xml and 
karajan files) regularly, even though it shouldn't affect the application 
side of things.

> I do not know how it happened that I ended up with the old version of 
> swfit script -- it might've happened during those submit host-to-submit 
> host moves, and obviously it was not intentional to discard any new bug 
> fixes.

You might stick stuff in version control and use that rather than copying 
files between machines - that can alleviate overwriting problems 
sometimes. Tibi's been keeping at least one of his apps in the SwiftApps/ 
directory in the swift SVN.

-- 


From wilde at mcs.anl.gov  Tue Jun 19 19:41:44 2007
From: wilde at mcs.anl.gov (Mike Wilde)
Date: Tue, 19 Jun 2007 19:41:44 -0500
Subject: [Swift-devel] Swift application testing practices
In-Reply-To: <78ED59EC-B377-43A8-909B-2013415C50D7@mcs.anl.gov>
References: <46782EFC.6030601@cs.uchicago.edu>
	<FC1E3EC1-908E-415B-98A6-7A631CA90A27@mcs.anl.gov>
	<46784274.8090202@cs.uchicago.edu>
	<Pine.LNX.4.58.0706191558350.21432@classes.cs.uchicago.edu>
	<051DE8C7-3598-4EDE-BCD7-870FE28B5CA6@mcs.anl.gov>
	<Pine.LNX.4.58.0706191630480.5413@classes.cs.uchicago.edu>
	<99348A95-85BD-4924-B708-481982C9F425@mcs.anl.gov>
	<Pine.LNX.4.58.0706191653260.5413@classes.cs.uchicago.edu>
	<Pine.LNX.4.64.0706192345380.1452@dildano.hawaga.org.uk>
	<46786EF6.507@mcs.anl.gov>
	<Pine.LNX.4.64.0706200007150.1452@dildano.hawaga.org.uk>
	<78ED59EC-B377-43A8-909B-2013415C50D7@mcs.anl.gov>
Message-ID: <467877C8.3050104@mcs.anl.gov>

First, lets change this subject line.

Second, I'm learning that we really are an open source development 
team, in that we're very distributed.  And on such teams, we need to 
learn how to dish out and receive scoldings from each other. Both 
with a smile.  Like so:  :)

This discussion is good and the kind of pointers we need to exchange 
as part of continuous improvement.

Nika, it sounds like its a simple case of using a different 
procedure to update your Swift code - and that we should document 
the procedure for all to use, especially the growing user base.

I'd like to see us starting and growing nice wiki pages for these 
techniques, and to periodically restructure the wiki as needed to 
keep it informative and useful to us.

Please, do this as a matter of course, without being "assigned" to 
do it. It helps makes our environment a great place to work, and 
certainly helps us grow.

Btw - I found the talk by Brian Fitzpatrick at the Globus committers 
meeting to be excellent, and very inspiring.  Dont be misled by the 
title: it applies as much to defining what "Good" people are as it 
does to dealing with "bad" ones.

Here's a nice summary:

http://www.oreillynet.com/conferences/blog/2006/07/oscon_how_open_source_projects.html

The video is at:

http://video.google.com/videoplay?docid=-4216011961522818645

The slides are at:

http://www.slideshare.net/vishnu/how-to-protect-yourhow-to-protect-your-open-source-project-from-poisonous-people/

And a great book recommended by both the speakers and our own Ben is:

Producing  Open Source Software
How to Run a Successful Free Software Project
by Karl Fogel

Full text is at:
http://producingoss.com/

This is a nice fast read that you can get a lot out of on each 
10-minute break that you spend browsing it.

I think there's loads of things we can pick up from here to help us 
grow as a project and as a team.

- Mike


Veronika Nefedova wrote, On 6/19/2007 7:18 PM:
> I do not think its a correct asessment.
> 
> I did update *many* times along the course of last couple of weeks, I 
> just didn't recompile my dtm files. And they worked just fine, btw.
> I do not know how it happened that I ended up with the old version of 
> swfit script -- it might've happened during those submit host-to-submit 
> host moves, and obviously it was not intentional to discard any new bug 
> fixes.
> 
> Nika
> 
> On Jun 19, 2007, at 7:10 PM, Ben Clifford wrote:
> 
>>
>>
>> On Tue, 19 Jun 2007, Mike Wilde wrote:
>>
>>> So, at a practical level, what went wrong here and what do we do to 
>>> correct
>>> it?
>>
>>> - use only official builds if at all possible (unless you need to
>>> include a fix thats not yet been included in a build?)
>>
>> there's a reluctance on Nika's part to upgrade I think because its a
>> hassle for her to get the Falkon/cog provider into a new build. It should
>> not be hard. provider-deef should be going into SVN 'real soon' (i.e. as
>> soon as Yong provides a clean recent tree for me to import) and after 
>> that
>> it should be a lot easier to deploy the most recent swift + most recent
>> provider-deef.
>>
>> --_______________________________________________
>> Swift-devel mailing list
>> Swift-devel at ci.uchicago.edu
>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>
> 
> 

-- 
Mike Wilde
Computation Institute, University of Chicago
Math & Computer Science Division
Argonne National Laboratory
Argonne, IL   60439    USA
tel 630-252-7497 fax 630-252-1997


From tiberius at ci.uchicago.edu  Tue Jun 19 20:33:29 2007
From: tiberius at ci.uchicago.edu (Tiberiu Stef-Praun)
Date: Tue, 19 Jun 2007 20:33:29 -0500
Subject: [Swift-devel] Re: email for Mike and Ian
In-Reply-To: <Pine.LNX.4.64.0706200024250.1452@dildano.hawaga.org.uk>
References: <46782EFC.6030601@cs.uchicago.edu>
	<051DE8C7-3598-4EDE-BCD7-870FE28B5CA6@mcs.anl.gov>
	<Pine.LNX.4.58.0706191630480.5413@classes.cs.uchicago.edu>
	<99348A95-85BD-4924-B708-481982C9F425@mcs.anl.gov>
	<Pine.LNX.4.58.0706191653260.5413@classes.cs.uchicago.edu>
	<Pine.LNX.4.64.0706192345380.1452@dildano.hawaga.org.uk>
	<46786EF6.507@mcs.anl.gov>
	<Pine.LNX.4.64.0706200007150.1452@dildano.hawaga.org.uk>
	<78ED59EC-B377-43A8-909B-2013415C50D7@mcs.anl.gov>
	<Pine.LNX.4.64.0706200024250.1452@dildano.hawaga.org.uk>
Message-ID: <fec1351f0706191833x28f2e699oefdc960e616cc3fa@mail.gmail.com>

Tibi keeps all three of his apps in the SVN
I cannot say it's a big win, but at least it helps me keep track of
the latest version.


On 6/19/07, Ben Clifford <benc at hawaga.org.uk> wrote:
>
> On Tue, 19 Jun 2007, Veronika Nefedova wrote:
>
> > I did update *many* times along the course of last couple of weeks, I
> > just didn't recompile my dtm files. And they worked just fine, btw.
>
> ok.
>
> Its probably a good idea from a swift testing perspective to be doing
> clean builds of the swiftscript programs (cleaning away the .xml and
> karajan files) regularly, even though it shouldn't affect the application
> side of things.
>
> > I do not know how it happened that I ended up with the old version of
> > swfit script -- it might've happened during those submit host-to-submit
> > host moves, and obviously it was not intentional to discard any new bug
> > fixes.
>
> You might stick stuff in version control and use that rather than copying
> files between machines - that can alleviate overwriting problems
> sometimes. Tibi's been keeping at least one of his apps in the SwiftApps/
> directory in the swift SVN.
>
> --
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>


-- 
Tiberiu (Tibi) Stef-Praun, PhD
Research Staff, Computation Institute
5640 S. Ellis Ave, #405
University of Chicago
http://www-unix.mcs.anl.gov/~tiberius/


From hategan at mcs.anl.gov  Wed Jun 20 05:19:59 2007
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Wed, 20 Jun 2007 13:19:59 +0300
Subject: [Swift-devel] Re: 100 molecule
In-Reply-To: <46784508.6020200@mcs.anl.gov>
References: <Pine.LNX.4.58.0706190921550.18172@classes.cs.uchicago.edu>
	<7DAEB562-6FD5-4425-9BBB-7AA84EA26526@mcs.anl.gov>
	<46782A14.2080308@cs.uchicago.edu>
	<Pine.LNX.4.58.0706191538510.21432@classes.cs.uchicago.edu>
	<46784508.6020200@mcs.anl.gov>
Message-ID: <1182334799.3206.0.camel@blabla.mcs.anl.gov>

Hmm. It never occurred to me, but that rm job could be batch=true.

On Tue, 2007-06-19 at 16:05 -0500, Mike Wilde wrote:
> One technique that works nice if you just want the old files out of 
> the way is to do an mv of the top level dir to a new name, and then 
> you can background the rm's.
> 
> - Mike
> 
> Yong Zhao wrote, On 6/19/2007 3:40 PM:
> > Ioan,
> > 
> > This sounds very good. I'm forwarding this to the swift list.
> > 
> > Yong.
> > 
> > On Tue, 19 Jun 2007, Ioan Raicu wrote:
> > 
> >> Yes, rm -rf could take that long... Yong, why don't you try a these two
> >> commands, instead of "rm -rf".... I bet it will be much faster on the
> >> GPFS at ANL!
> >>
> >> find ./ -exec rm {} \;
> >> find ./ -exec rm -r {} \;
> >>
> >> The first one removes the files, and the second one removes the
> >> directories... I found rm -rf to be very slow on the ANL GPFS.... it has
> >> to do with the fact that rm -rf does an expansion of all the files it
> >> needs to deletes... and it ends up being very very long if you hav many
> >> files to delete.... doing the method above, it does 1 delete at a
> >> time... so it doesn't suffer from the long list of files as rm -rf....
> >>
> >> Ioan
> >>
> >> Veronika Nefedova wrote:
> >>> I am wondering how the cleanup is done? Its hard to believe that "rm
> >>> -rf" would work that long. At the end of the successful run its just
> >>> one directory with one nested subdir had to be removed.
> >>>
> >>> NIka
> >>>
> > _______________________________________________
> > Swift-devel mailing list
> > Swift-devel at ci.uchicago.edu
> > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> > 
> > 
> 


From hategan at mcs.anl.gov  Wed Jun 20 05:22:24 2007
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Wed, 20 Jun 2007 13:22:24 +0300
Subject: [Swift-devel] Re: email for Mike and Ian
In-Reply-To: <Pine.LNX.4.64.0706200024250.1452@dildano.hawaga.org.uk>
References: <46782EFC.6030601@cs.uchicago.edu>
	<FC1E3EC1-908E-415B-98A6-7A631CA90A27@mcs.anl.gov>
	<46784274.8090202@cs.uchicago.edu>
	<Pine.LNX.4.58.0706191558350.21432@classes.cs.uchicago.edu>
	<051DE8C7-3598-4EDE-BCD7-870FE28B5CA6@mcs.anl.gov>
	<Pine.LNX.4.58.0706191630480.5413@classes.cs.uchicago.edu>
	<99348A95-85BD-4924-B708-481982C9F425@mcs.anl.gov>
	<Pine.LNX.4.58.0706191653260.5413@classes.cs.uchicago.edu>
	<Pine.LNX.4.64.0706192345380.1452@dildano.hawaga.org.uk>
	<46786EF6.507@mcs.anl.gov>
	<Pine.LNX.4.64.0706200007150.1452@dildano.hawaga.org.uk>
	<78ED59EC-B377-43A8-909B-2013415C50D7@mcs.anl.gov>
	<Pine.LNX.4.64.0706200024250.1452@dildano.hawaga.org.uk>
Message-ID: <1182334944.3206.3.camel@blabla.mcs.anl.gov>

On Wed, 2007-06-20 at 00:27 +0000, Ben Clifford wrote:
> On Tue, 19 Jun 2007, Veronika Nefedova wrote:
> 
> > I did update *many* times along the course of last couple of weeks, I 
> > just didn't recompile my dtm files. And they worked just fine, btw.
> 
> ok.
> 
> Its probably a good idea from a swift testing perspective to be doing 
> clean builds of the swiftscript programs (cleaning away the .xml and 
> karajan files) regularly, even though it shouldn't affect the application 
> side of things.

I think an even better idea would be for swift to recompile files if the
swift version has changed. That we can achieve by having a timestamp on
the swift build. If the kml files are older than that, a recompilation
should be forced.

Mihael

> 
> > I do not know how it happened that I ended up with the old version of 
> > swfit script -- it might've happened during those submit host-to-submit 
> > host moves, and obviously it was not intentional to discard any new bug 
> > fixes.
> 
> You might stick stuff in version control and use that rather than copying 
> files between machines - that can alleviate overwriting problems 
> sometimes. Tibi's been keeping at least one of his apps in the SwiftApps/ 
> directory in the swift SVN.
> 


From wilde at mcs.anl.gov  Wed Jun 20 06:35:30 2007
From: wilde at mcs.anl.gov (Mike Wilde)
Date: Wed, 20 Jun 2007 06:35:30 -0500
Subject: [Swift-devel] Re: 100 molecule
In-Reply-To: <1182334799.3206.0.camel@blabla.mcs.anl.gov>
References: <Pine.LNX.4.58.0706190921550.18172@classes.cs.uchicago.edu>	
	<7DAEB562-6FD5-4425-9BBB-7AA84EA26526@mcs.anl.gov>	
	<46782A14.2080308@cs.uchicago.edu>	
	<Pine.LNX.4.58.0706191538510.21432@classes.cs.uchicago.edu>	
	<46784508.6020200@mcs.anl.gov>
	<1182334799.3206.0.camel@blabla.mcs.anl.gov>
Message-ID: <46791102.3040201@mcs.anl.gov>

Please file an enhancement bug on this if its not already filed.

Thanks,

Mike


Mihael Hategan wrote, On 6/20/2007 5:19 AM:
> Hmm. It never occurred to me, but that rm job could be batch=true.
> 
> On Tue, 2007-06-19 at 16:05 -0500, Mike Wilde wrote:
>> One technique that works nice if you just want the old files out of 
>> the way is to do an mv of the top level dir to a new name, and then 
>> you can background the rm's.
>>
>> - Mike
>>
>> Yong Zhao wrote, On 6/19/2007 3:40 PM:
>>> Ioan,
>>>
>>> This sounds very good. I'm forwarding this to the swift list.
>>>
>>> Yong.
>>>
>>> On Tue, 19 Jun 2007, Ioan Raicu wrote:
>>>
>>>> Yes, rm -rf could take that long... Yong, why don't you try a these two
>>>> commands, instead of "rm -rf".... I bet it will be much faster on the
>>>> GPFS at ANL!
>>>>
>>>> find ./ -exec rm {} \;
>>>> find ./ -exec rm -r {} \;
>>>>
>>>> The first one removes the files, and the second one removes the
>>>> directories... I found rm -rf to be very slow on the ANL GPFS.... it has
>>>> to do with the fact that rm -rf does an expansion of all the files it
>>>> needs to deletes... and it ends up being very very long if you hav many
>>>> files to delete.... doing the method above, it does 1 delete at a
>>>> time... so it doesn't suffer from the long list of files as rm -rf....
>>>>
>>>> Ioan
>>>>
>>>> Veronika Nefedova wrote:
>>>>> I am wondering how the cleanup is done? Its hard to believe that "rm
>>>>> -rf" would work that long. At the end of the successful run its just
>>>>> one directory with one nested subdir had to be removed.
>>>>>
>>>>> NIka
>>>>>
>>> _______________________________________________
>>> Swift-devel mailing list
>>> Swift-devel at ci.uchicago.edu
>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>>
>>>
> 
> 

-- 
Mike Wilde
Computation Institute, University of Chicago
Math & Computer Science Division
Argonne National Laboratory
Argonne, IL   60439    USA
tel 630-252-7497 fax 630-252-1997


From foster at mcs.anl.gov  Thu Jun 21 08:33:11 2007
From: foster at mcs.anl.gov (Ian Foster)
Date: Thu, 21 Jun 2007 08:33:11 -0500
Subject: [Swift-devel] Swift Performance Data
In-Reply-To: <467A7AC6.7020400@mcs.anl.gov>
References: <4679FBC8.1080606@mcs.anl.gov> <467A6D78.4020702@mcs.anl.gov>
	<467A7AC6.7020400@mcs.anl.gov>
Message-ID: <467A7E17.5000207@mcs.anl.gov>

Or maybe that is clear. I'd suggest that we want a tool that, after a 
run, one of us can run to:

* Generate the three plots that Ioan has created
* Generate a file containing as much information as we can about the run 
and its parameters--maybe a name=value format?--and some derived values 
such as those I mentioned in earlier email
* Move these things to a known place
* Create a Web page with pointers to these information and stick it 
somewhere [or add it to an existing web page?]

Ian.


Ian Foster wrote:
> Mike:
>
> It seems important to define what the specific goals and milestones 
> are here, as it seems that simply asking for it doesn't get it done. 
> Perhaps we need a brief specification?
>
> Ian.
>
> Mike Wilde wrote:
>> Yes, this is what Ganglia has been using.
>>
>> Regarding the auto-publishing - Jens has a machanism that regularly 
>> posted info in rrd format on the state of the VDS lab machines, using 
>> a perl mechanism like what Ian described.  Perhaps we can find and 
>> adapt that for Ioan's numbers.
>> It was running on gainly I think. But its not hard to develop from 
>> scratch.
>>
>> It would be good to see the same numbers for all the Swift apps being 
>> worked on, driven initially by kickstart summaries and digesting the 
>> swift logfile.
>> We've long had this as a goal - now is a good time to push forward 
>> and do this.
>>
>> Nika and Tibi, could you work with Ioan on this?
>>
>> - Mike
>>
>>
>>
>> Ian Foster wrote, On 6/20/2007 11:17 PM:
>>> Hi,
>>>
>>> I was pointed at http://oss.oetiker.ch/rrdtool/, has anyone seen 
>>> this? Seems nice to me.
>>>
>>> Ian.
>>>
>>>
>>
>

-- 

   Ian Foster, Director, Computation Institute
Argonne National Laboratory & University of Chicago
Argonne: MCS/221, 9700 S. Cass Ave, Argonne, IL 60439
Chicago: Rm 405, 5640 S. Ellis Ave, Chicago, IL 60637
Tel: +1 630 252 4619.  Web: www.ci.uchicago.edu.
      Globus Alliance: www.globus.org.


From foster at mcs.anl.gov  Thu Jun 21 10:14:59 2007
From: foster at mcs.anl.gov (Ian Foster)
Date: Thu, 21 Jun 2007 10:14:59 -0500
Subject: [Swift-devel] Re: [Swft] Q about throttling
In-Reply-To: <fec1351f0706210809k1035425cjffc18b442ab41dea@mail.gmail.com>
References: <467A6610.1000103@mcs.anl.gov> <467A6E50.3060508@mcs.anl.gov>
	<fec1351f0706210809k1035425cjffc18b442ab41dea@mail.gmail.com>
Message-ID: <467A95F3.6040603@mcs.anl.gov>

My original question was whether we could turn throttling off 
altogether. I'm not sure if that was answered?

Tiberiu Stef-Praun wrote:
> I did not look very deep into the throttling, mainly because I have to
> wait for my turn at using the Argonne cluster because of the large
> reservations that Ioan does for MolDyn
>
> Anyway, here is my experience (which Ian asked me to write down, but
> I'm still trying to improve on):
> - whatever one asks from Falkon, one seems to get, with the caveat
> that Falkon might release nodes when configured to look at an idle
> timer. In the case of the Econ workflow, I had 26 long running jobs,
> so I requested 30 nodes (which Falkon got for me)
> - there is a swift config file, $DVS_HOME/libexec/scheduler.xml in 
> which I set
> <property name="jobThrottle" value="30"/>, but that seemed not to be
> enough to get all my 26 jobs running at the same time (as illustrated
> by the graphing of the Falkon log that Ioan showed me).
> - there are some other throttling parameters in
> $VDS_HOME/etc/swift.properties (which I also set to 30)
>
> The general observation is that I needed to modify the scheduler.xml
> config file, and I need to set larger throttle values that the limit
> of workers requested.
> In the current scheme (simply add Falkon to Swift as a provider) the
> Swift scheduler (the weighted site selection algorithm) adversely
> influences the optimal execution of the workflow.
> There might be other parameters to work with, but my opinion is that
> we should use a different (non-throttling) scheduler in combination
> with Falkon
>
> Tibi
>
> On 6/21/07, Mike Wilde <wilde at mcs.anl.gov> wrote:
>> Ive had the same question - it seems that throttling is also 
>> problematic for
>> Tibi in the econ workflow.
>>
>> Tibi, since you have looked pretty deeply into it, could you write up a
>> desription on how the algorithm works and how the parameters affect it.
>> Mihael, when you are back on central time next week, could you work 
>> with TIbi on
>> this?  If its not already, this should be part of the Swift 
>> documentation.
>>
>> Then we should work on getting high-performance settings for the 
>> different
>> runtime environments we use, in particular Falkon as Ian asks.
>>
>> - Mike
>>
>>
>> Ian Foster wrote, On 6/21/2007 6:50 AM:
>> > Hi,
>> >
>> > I don't fully understand how throttling works in Swift/Karajan. 
>> However,
>> > I understand that even when using Falkon, we may be doing some
>> > throttling. Is there a reason to do that in this case, given that 
>> Falkon
>> > can maintain large numbers of tasks just fine?
>> >
>> > I ask this because in a recent MolDyn run, there seemed to be some
>> > uncertainty as to whether throttling was slowing down job dispatch. If
>> > we could turn it off altogether, that question would presumably go 
>> away.
>> >
>> > Ian.
>> >
>> >
>>
>> -- 
>> Mike Wilde
>> Computation Institute, University of Chicago
>> Math & Computer Science Division
>> Argonne National Laboratory
>> Argonne, IL   60439    USA
>> tel 630-252-7497 fax 630-252-1997
>>
>
>

-- 

   Ian Foster, Director, Computation Institute
Argonne National Laboratory & University of Chicago
Argonne: MCS/221, 9700 S. Cass Ave, Argonne, IL 60439
Chicago: Rm 405, 5640 S. Ellis Ave, Chicago, IL 60637
Tel: +1 630 252 4619.  Web: www.ci.uchicago.edu.
      Globus Alliance: www.globus.org.


From tiberius at ci.uchicago.edu  Thu Jun 21 10:21:59 2007
From: tiberius at ci.uchicago.edu (Tiberiu Stef-Praun)
Date: Thu, 21 Jun 2007 10:21:59 -0500
Subject: [Swift-devel] Re: [Swft] Q about throttling
In-Reply-To: <467A95F3.6040603@mcs.anl.gov>
References: <467A6610.1000103@mcs.anl.gov> <467A6E50.3060508@mcs.anl.gov>
	<fec1351f0706210809k1035425cjffc18b442ab41dea@mail.gmail.com>
	<467A95F3.6040603@mcs.anl.gov>
Message-ID: <fec1351f0706210821n63f4169by5da7cd1848db080@mail.gmail.com>

On 6/21/07, Ian Foster <foster at mcs.anl.gov> wrote:
> My original question was whether we could turn throttling off
> altogether. I'm not sure if that was answered?

I think the current answer is: just push way up the throttling limit.

>
> Tiberiu Stef-Praun wrote:
> > I did not look very deep into the throttling, mainly because I have to
> > wait for my turn at using the Argonne cluster because of the large
> > reservations that Ioan does for MolDyn
> >
> > Anyway, here is my experience (which Ian asked me to write down, but
> > I'm still trying to improve on):
> > - whatever one asks from Falkon, one seems to get, with the caveat
> > that Falkon might release nodes when configured to look at an idle
> > timer. In the case of the Econ workflow, I had 26 long running jobs,
> > so I requested 30 nodes (which Falkon got for me)
> > - there is a swift config file, $DVS_HOME/libexec/scheduler.xml in
> > which I set
> > <property name="jobThrottle" value="30"/>, but that seemed not to be
> > enough to get all my 26 jobs running at the same time (as illustrated
> > by the graphing of the Falkon log that Ioan showed me).
> > - there are some other throttling parameters in
> > $VDS_HOME/etc/swift.properties (which I also set to 30)
> >
> > The general observation is that I needed to modify the scheduler.xml
> > config file, and I need to set larger throttle values that the limit
> > of workers requested.
> > In the current scheme (simply add Falkon to Swift as a provider) the
> > Swift scheduler (the weighted site selection algorithm) adversely
> > influences the optimal execution of the workflow.
> > There might be other parameters to work with, but my opinion is that
> > we should use a different (non-throttling) scheduler in combination
> > with Falkon
> >
> > Tibi
> >
> > On 6/21/07, Mike Wilde <wilde at mcs.anl.gov> wrote:
> >> Ive had the same question - it seems that throttling is also
> >> problematic for
> >> Tibi in the econ workflow.
> >>
> >> Tibi, since you have looked pretty deeply into it, could you write up a
> >> desription on how the algorithm works and how the parameters affect it.
> >> Mihael, when you are back on central time next week, could you work
> >> with TIbi on
> >> this?  If its not already, this should be part of the Swift
> >> documentation.
> >>
> >> Then we should work on getting high-performance settings for the
> >> different
> >> runtime environments we use, in particular Falkon as Ian asks.
> >>
> >> - Mike
> >>
> >>
> >> Ian Foster wrote, On 6/21/2007 6:50 AM:
> >> > Hi,
> >> >
> >> > I don't fully understand how throttling works in Swift/Karajan.
> >> However,
> >> > I understand that even when using Falkon, we may be doing some
> >> > throttling. Is there a reason to do that in this case, given that
> >> Falkon
> >> > can maintain large numbers of tasks just fine?
> >> >
> >> > I ask this because in a recent MolDyn run, there seemed to be some
> >> > uncertainty as to whether throttling was slowing down job dispatch. If
> >> > we could turn it off altogether, that question would presumably go
> >> away.
> >> >
> >> > Ian.
> >> >
> >> >
> >>
> >> --
> >> Mike Wilde
> >> Computation Institute, University of Chicago
> >> Math & Computer Science Division
> >> Argonne National Laboratory
> >> Argonne, IL   60439    USA
> >> tel 630-252-7497 fax 630-252-1997
> >>
> >
> >
>
> --
>
>    Ian Foster, Director, Computation Institute
> Argonne National Laboratory & University of Chicago
> Argonne: MCS/221, 9700 S. Cass Ave, Argonne, IL 60439
> Chicago: Rm 405, 5640 S. Ellis Ave, Chicago, IL 60637
> Tel: +1 630 252 4619.  Web: www.ci.uchicago.edu.
>       Globus Alliance: www.globus.org.
>
>


-- 
Tiberiu (Tibi) Stef-Praun, PhD
Research Staff, Computation Institute
5640 S. Ellis Ave, #405
University of Chicago
http://www-unix.mcs.anl.gov/~tiberius/


From benc at hawaga.org.uk  Thu Jun 21 10:31:31 2007
From: benc at hawaga.org.uk (Ben Clifford)
Date: Thu, 21 Jun 2007 15:31:31 +0000 (GMT)
Subject: [Swift-devel] Re: [Swft] Q about throttling
In-Reply-To: <467A95F3.6040603@mcs.anl.gov>
References: <467A6610.1000103@mcs.anl.gov> <467A6E50.3060508@mcs.anl.gov>
	<fec1351f0706210809k1035425cjffc18b442ab41dea@mail.gmail.com>
	<467A95F3.6040603@mcs.anl.gov>
Message-ID: <Pine.LNX.4.64.0706211531180.15250@dildano.hawaga.org.uk>


But that isn't the base problem being investigated, right?

On Thu, 21 Jun 2007, Ian Foster wrote:

> My original question was whether we could turn throttling off altogether. I'm
> not sure if that was answered?
> 
> Tiberiu Stef-Praun wrote:
> > I did not look very deep into the throttling, mainly because I have to
> > wait for my turn at using the Argonne cluster because of the large
> > reservations that Ioan does for MolDyn
> > 
> > Anyway, here is my experience (which Ian asked me to write down, but
> > I'm still trying to improve on):
> > - whatever one asks from Falkon, one seems to get, with the caveat
> > that Falkon might release nodes when configured to look at an idle
> > timer. In the case of the Econ workflow, I had 26 long running jobs,
> > so I requested 30 nodes (which Falkon got for me)
> > - there is a swift config file, $DVS_HOME/libexec/scheduler.xml in which I
> > set
> > <property name="jobThrottle" value="30"/>, but that seemed not to be
> > enough to get all my 26 jobs running at the same time (as illustrated
> > by the graphing of the Falkon log that Ioan showed me).
> > - there are some other throttling parameters in
> > $VDS_HOME/etc/swift.properties (which I also set to 30)
> > 
> > The general observation is that I needed to modify the scheduler.xml
> > config file, and I need to set larger throttle values that the limit
> > of workers requested.
> > In the current scheme (simply add Falkon to Swift as a provider) the
> > Swift scheduler (the weighted site selection algorithm) adversely
> > influences the optimal execution of the workflow.
> > There might be other parameters to work with, but my opinion is that
> > we should use a different (non-throttling) scheduler in combination
> > with Falkon
> > 
> > Tibi
> > 
> > On 6/21/07, Mike Wilde <wilde at mcs.anl.gov> wrote:
> > > Ive had the same question - it seems that throttling is also problematic
> > > for
> > > Tibi in the econ workflow.
> > > 
> > > Tibi, since you have looked pretty deeply into it, could you write up a
> > > desription on how the algorithm works and how the parameters affect it.
> > > Mihael, when you are back on central time next week, could you work with
> > > TIbi on
> > > this?  If its not already, this should be part of the Swift documentation.
> > > 
> > > Then we should work on getting high-performance settings for the different
> > > runtime environments we use, in particular Falkon as Ian asks.
> > > 
> > > - Mike
> > > 
> > > 
> > > Ian Foster wrote, On 6/21/2007 6:50 AM:
> > > > Hi,
> > > >
> > > > I don't fully understand how throttling works in Swift/Karajan. However,
> > > > I understand that even when using Falkon, we may be doing some
> > > > throttling. Is there a reason to do that in this case, given that Falkon
> > > > can maintain large numbers of tasks just fine?
> > > >
> > > > I ask this because in a recent MolDyn run, there seemed to be some
> > > > uncertainty as to whether throttling was slowing down job dispatch. If
> > > > we could turn it off altogether, that question would presumably go away.
> > > >
> > > > Ian.
> > > >
> > > >
> > > 
> > > -- 
> > > Mike Wilde
> > > Computation Institute, University of Chicago
> > > Math & Computer Science Division
> > > Argonne National Laboratory
> > > Argonne, IL   60439    USA
> > > tel 630-252-7497 fax 630-252-1997
> > > 
> > 
> > 
> 
> 


From foster at mcs.anl.gov  Thu Jun 21 10:37:22 2007
From: foster at mcs.anl.gov (Ian Foster)
Date: Thu, 21 Jun 2007 10:37:22 -0500
Subject: [Swift-devel] Re: [Swft] Q about throttling
In-Reply-To: <Pine.LNX.4.64.0706211531180.15250@dildano.hawaga.org.uk>
References: <467A6610.1000103@mcs.anl.gov> <467A6E50.3060508@mcs.anl.gov>
	<fec1351f0706210809k1035425cjffc18b442ab41dea@mail.gmail.com>
	<467A95F3.6040603@mcs.anl.gov>
	<Pine.LNX.4.64.0706211531180.15250@dildano.hawaga.org.uk>
Message-ID: <467A9B32.4030402@mcs.anl.gov>

Well, if there is some concern that throttling might be a problem, then 
trying a run with it turned off seems good.

I'm gathering from this exchange that this is not possible?

Ben Clifford wrote:
> But that isn't the base problem being investigated, right?
>
> On Thu, 21 Jun 2007, Ian Foster wrote:
>
>   
>> My original question was whether we could turn throttling off altogether. I'm
>> not sure if that was answered?
>>
>> Tiberiu Stef-Praun wrote:
>>     
>>> I did not look very deep into the throttling, mainly because I have to
>>> wait for my turn at using the Argonne cluster because of the large
>>> reservations that Ioan does for MolDyn
>>>
>>> Anyway, here is my experience (which Ian asked me to write down, but
>>> I'm still trying to improve on):
>>> - whatever one asks from Falkon, one seems to get, with the caveat
>>> that Falkon might release nodes when configured to look at an idle
>>> timer. In the case of the Econ workflow, I had 26 long running jobs,
>>> so I requested 30 nodes (which Falkon got for me)
>>> - there is a swift config file, $DVS_HOME/libexec/scheduler.xml in which I
>>> set
>>> <property name="jobThrottle" value="30"/>, but that seemed not to be
>>> enough to get all my 26 jobs running at the same time (as illustrated
>>> by the graphing of the Falkon log that Ioan showed me).
>>> - there are some other throttling parameters in
>>> $VDS_HOME/etc/swift.properties (which I also set to 30)
>>>
>>> The general observation is that I needed to modify the scheduler.xml
>>> config file, and I need to set larger throttle values that the limit
>>> of workers requested.
>>> In the current scheme (simply add Falkon to Swift as a provider) the
>>> Swift scheduler (the weighted site selection algorithm) adversely
>>> influences the optimal execution of the workflow.
>>> There might be other parameters to work with, but my opinion is that
>>> we should use a different (non-throttling) scheduler in combination
>>> with Falkon
>>>
>>> Tibi
>>>
>>> On 6/21/07, Mike Wilde <wilde at mcs.anl.gov> wrote:
>>>       
>>>> Ive had the same question - it seems that throttling is also problematic
>>>> for
>>>> Tibi in the econ workflow.
>>>>
>>>> Tibi, since you have looked pretty deeply into it, could you write up a
>>>> desription on how the algorithm works and how the parameters affect it.
>>>> Mihael, when you are back on central time next week, could you work with
>>>> TIbi on
>>>> this?  If its not already, this should be part of the Swift documentation.
>>>>
>>>> Then we should work on getting high-performance settings for the different
>>>> runtime environments we use, in particular Falkon as Ian asks.
>>>>
>>>> - Mike
>>>>
>>>>
>>>> Ian Foster wrote, On 6/21/2007 6:50 AM:
>>>>         
>>>>> Hi,
>>>>>
>>>>> I don't fully understand how throttling works in Swift/Karajan. However,
>>>>> I understand that even when using Falkon, we may be doing some
>>>>> throttling. Is there a reason to do that in this case, given that Falkon
>>>>> can maintain large numbers of tasks just fine?
>>>>>
>>>>> I ask this because in a recent MolDyn run, there seemed to be some
>>>>> uncertainty as to whether throttling was slowing down job dispatch. If
>>>>> we could turn it off altogether, that question would presumably go away.
>>>>>
>>>>> Ian.
>>>>>
>>>>>
>>>>>           
>>>> -- 
>>>> Mike Wilde
>>>> Computation Institute, University of Chicago
>>>> Math & Computer Science Division
>>>> Argonne National Laboratory
>>>> Argonne, IL   60439    USA
>>>> tel 630-252-7497 fax 630-252-1997
>>>>
>>>>         
>>>       
>>     
>
>   

-- 

   Ian Foster, Director, Computation Institute
Argonne National Laboratory & University of Chicago
Argonne: MCS/221, 9700 S. Cass Ave, Argonne, IL 60439
Chicago: Rm 405, 5640 S. Ellis Ave, Chicago, IL 60637
Tel: +1 630 252 4619.  Web: www.ci.uchicago.edu.
      Globus Alliance: www.globus.org.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20070621/2d52ce1d/attachment.html>

From benc at hawaga.org.uk  Thu Jun 21 10:40:22 2007
From: benc at hawaga.org.uk (Ben Clifford)
Date: Thu, 21 Jun 2007 15:40:22 +0000 (GMT)
Subject: [Swift-devel] Re: [Swft] Q about throttling
In-Reply-To: <467A9B32.4030402@mcs.anl.gov>
References: <467A6610.1000103@mcs.anl.gov> <467A6E50.3060508@mcs.anl.gov>
	<fec1351f0706210809k1035425cjffc18b442ab41dea@mail.gmail.com>
	<467A95F3.6040603@mcs.anl.gov>
	<Pine.LNX.4.64.0706211531180.15250@dildano.hawaga.org.uk>
	<467A9B32.4030402@mcs.anl.gov>
Message-ID: <Pine.LNX.4.64.0706211537490.1452@dildano.hawaga.org.uk>


On Thu, 21 Jun 2007, Ian Foster wrote:

> I'm gathering from this exchange that this is not possible?

I have no idea. It doesn't seem to be documented.

But the number one rule of tech support is don't take somebody else's 
partially solved problem. It would be good to see what is actually causing 
you to suspect that there's a throttling problem.

-- 


From foster at mcs.anl.gov  Thu Jun 21 10:47:57 2007
From: foster at mcs.anl.gov (Ian Foster)
Date: Thu, 21 Jun 2007 10:47:57 -0500
Subject: [Swift-devel] Re: [Swft] Q about throttling
In-Reply-To: <Pine.LNX.4.64.0706211537490.1452@dildano.hawaga.org.uk>
References: <467A6610.1000103@mcs.anl.gov> <467A6E50.3060508@mcs.anl.gov>
	<fec1351f0706210809k1035425cjffc18b442ab41dea@mail.gmail.com>
	<467A95F3.6040603@mcs.anl.gov>
	<Pine.LNX.4.64.0706211531180.15250@dildano.hawaga.org.uk>
	<467A9B32.4030402@mcs.anl.gov>
	<Pine.LNX.4.64.0706211537490.1452@dildano.hawaga.org.uk>
Message-ID: <467A9DAD.7060909@mcs.anl.gov>

agreed

Ben Clifford wrote:
> On Thu, 21 Jun 2007, Ian Foster wrote:
>
>   
>> I'm gathering from this exchange that this is not possible?
>>     
>
> I have no idea. It doesn't seem to be documented.
>
> But the number one rule of tech support is don't take somebody else's 
> partially solved problem. It would be good to see what is actually causing 
> you to suspect that there's a throttling problem.
>
>   

-- 

   Ian Foster, Director, Computation Institute
Argonne National Laboratory & University of Chicago
Argonne: MCS/221, 9700 S. Cass Ave, Argonne, IL 60439
Chicago: Rm 405, 5640 S. Ellis Ave, Chicago, IL 60637
Tel: +1 630 252 4619.  Web: www.ci.uchicago.edu.
      Globus Alliance: www.globus.org.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20070621/c91835cf/attachment.html>

From benc at hawaga.org.uk  Thu Jun 21 10:51:51 2007
From: benc at hawaga.org.uk (Ben Clifford)
Date: Thu, 21 Jun 2007 15:51:51 +0000 (GMT)
Subject: [Swift-devel] Re: [Swft] Q about throttling
In-Reply-To: <467A9DAD.7060909@mcs.anl.gov>
References: <467A6610.1000103@mcs.anl.gov> <467A6E50.3060508@mcs.anl.gov>
	<fec1351f0706210809k1035425cjffc18b442ab41dea@mail.gmail.com>
	<467A95F3.6040603@mcs.anl.gov>
	<Pine.LNX.4.64.0706211531180.15250@dildano.hawaga.org.uk>
	<467A9B32.4030402@mcs.anl.gov>
	<Pine.LNX.4.64.0706211537490.1452@dildano.hawaga.org.uk>
	<467A9DAD.7060909@mcs.anl.gov>
Message-ID: <Pine.LNX.4.64.0706211549270.1452@dildano.hawaga.org.uk>


actually, the graph that Tibi showed, which I think is pretty much the 
same as the graph from Ioan's gui visualizer thing, would be interesting 
to see for the present MolDyn runs.

It was interesting to look at when wondering about the bug Yong fixed last 
week.

On Thu, 21 Jun 2007, Ian Foster wrote:

> agreed
> 
> Ben Clifford wrote:
> > On Thu, 21 Jun 2007, Ian Foster wrote:
> > 
> >   
> > > I'm gathering from this exchange that this is not possible?
> > >     
> > 
> > I have no idea. It doesn't seem to be documented.
> > 
> > But the number one rule of tech support is don't take somebody else's
> > partially solved problem. It would be good to see what is actually causing
> > you to suspect that there's a throttling problem.
> > 
> >   
> 
> 


From tiberius at ci.uchicago.edu  Thu Jun 21 10:59:08 2007
From: tiberius at ci.uchicago.edu (Tiberiu Stef-Praun)
Date: Thu, 21 Jun 2007 10:59:08 -0500
Subject: [Swift-devel] Re: [Swft] Q about throttling
In-Reply-To: <Pine.LNX.4.64.0706211537490.1452@dildano.hawaga.org.uk>
References: <467A6610.1000103@mcs.anl.gov> <467A6E50.3060508@mcs.anl.gov>
	<fec1351f0706210809k1035425cjffc18b442ab41dea@mail.gmail.com>
	<467A95F3.6040603@mcs.anl.gov>
	<Pine.LNX.4.64.0706211531180.15250@dildano.hawaga.org.uk>
	<467A9B32.4030402@mcs.anl.gov>
	<Pine.LNX.4.64.0706211537490.1452@dildano.hawaga.org.uk>
Message-ID: <fec1351f0706210859v417e6161i32b0569844949784@mail.gmail.com>

Actually Ioan pointed out to me  that the last two jobs from the first
batch are scheduled to start at time zero but have to wait for the
first 24 to finish before getting on some resource. (the red color
means queue time. the green means execution time).

Ben, yes, the worklow consists of 4 logical stages, each of which has
to complete before the next stage is being executed.

The graph was generated by Ioan, using Excel. He was showing me how to
illustrate the Falkon logs information.


On 6/21/07, Ben Clifford <benc at hawaga.org.uk> wrote:
>
>
> On Thu, 21 Jun 2007, Ian Foster wrote:
>
> > I'm gathering from this exchange that this is not possible?
>
> I have no idea. It doesn't seem to be documented.
>
> But the number one rule of tech support is don't take somebody else's
> partially solved problem. It would be good to see what is actually causing
> you to suspect that there's a throttling problem.
>
> --
>


-- 
Tiberiu (Tibi) Stef-Praun, PhD
Research Staff, Computation Institute
5640 S. Ellis Ave, #405
University of Chicago
http://www-unix.mcs.anl.gov/~tiberius/


From foster at mcs.anl.gov  Thu Jun 21 10:57:56 2007
From: foster at mcs.anl.gov (Ian Foster)
Date: Thu, 21 Jun 2007 10:57:56 -0500
Subject: [Swift-devel] Re: [Swft] Q about throttling
In-Reply-To: <Pine.LNX.4.64.0706211549270.1452@dildano.hawaga.org.uk>
References: <467A6610.1000103@mcs.anl.gov> <467A6E50.3060508@mcs.anl.gov>
	<fec1351f0706210809k1035425cjffc18b442ab41dea@mail.gmail.com>
	<467A95F3.6040603@mcs.anl.gov>
	<Pine.LNX.4.64.0706211531180.15250@dildano.hawaga.org.uk>
	<467A9B32.4030402@mcs.anl.gov>
	<Pine.LNX.4.64.0706211537490.1452@dildano.hawaga.org.uk>
	<467A9DAD.7060909@mcs.anl.gov>
	<Pine.LNX.4.64.0706211549270.1452@dildano.hawaga.org.uk>
Message-ID: <467AA004.4080601@mcs.anl.gov>

See this document for a set of three graphs that Ioan produced:

http://people.cs.uchicago.edu/~iraicu/research/docs/MolDyn/100-Mol_MolDyn.pdf 


The first is the same as Tibi's, I think. The second and third are new. 
I want to have all three produced in a standard way for every 
application run.

Ian.


Ben Clifford wrote:
> actually, the graph that Tibi showed, which I think is pretty much the 
> same as the graph from Ioan's gui visualizer thing, would be interesting 
> to see for the present MolDyn runs.
>
> It was interesting to look at when wondering about the bug Yong fixed last 
> week.
>
> On Thu, 21 Jun 2007, Ian Foster wrote:
>
>   
>> agreed
>>
>> Ben Clifford wrote:
>>     
>>> On Thu, 21 Jun 2007, Ian Foster wrote:
>>>
>>>   
>>>       
>>>> I'm gathering from this exchange that this is not possible?
>>>>     
>>>>         
>>> I have no idea. It doesn't seem to be documented.
>>>
>>> But the number one rule of tech support is don't take somebody else's
>>> partially solved problem. It would be good to see what is actually causing
>>> you to suspect that there's a throttling problem.
>>>
>>>   
>>>       
>>     
>
>   

-- 

   Ian Foster, Director, Computation Institute
Argonne National Laboratory & University of Chicago
Argonne: MCS/221, 9700 S. Cass Ave, Argonne, IL 60439
Chicago: Rm 405, 5640 S. Ellis Ave, Chicago, IL 60637
Tel: +1 630 252 4619.  Web: www.ci.uchicago.edu.
      Globus Alliance: www.globus.org.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20070621/0a645582/attachment.html>

From benc at hawaga.org.uk  Thu Jun 21 11:00:50 2007
From: benc at hawaga.org.uk (Ben Clifford)
Date: Thu, 21 Jun 2007 16:00:50 +0000 (GMT)
Subject: [Swift-devel] Re: [Swft] Q about throttling
In-Reply-To: <fec1351f0706210859v417e6161i32b0569844949784@mail.gmail.com>
References: <467A6610.1000103@mcs.anl.gov> <467A6E50.3060508@mcs.anl.gov> 
	<fec1351f0706210809k1035425cjffc18b442ab41dea@mail.gmail.com> 
	<467A95F3.6040603@mcs.anl.gov>
	<Pine.LNX.4.64.0706211531180.15250@dildano.hawaga.org.uk>
	<467A9B32.4030402@mcs.anl.gov>
	<Pine.LNX.4.64.0706211537490.1452@dildano.hawaga.org.uk>
	<fec1351f0706210859v417e6161i32b0569844949784@mail.gmail.com>
Message-ID: <Pine.LNX.4.64.0706211600140.1452@dildano.hawaga.org.uk>


On Thu, 21 Jun 2007, Tiberiu Stef-Praun wrote:

> Actually Ioan pointed out to me  that the last two jobs from the first
> batch are scheduled to start at time zero but have to wait for the
> first 24 to finish before getting on some resource. (the red color
> means queue time. the green means execution time).
> 
> Ben, yes, the worklow consists of 4 logical stages, each of which has
> to complete before the next stage is being executed.

so your chart indicates that everything is going 'just fine' rather than 
'broken' ?

-- 


From wilde at mcs.anl.gov  Thu Jun 21 11:03:34 2007
From: wilde at mcs.anl.gov (Mike Wilde)
Date: Thu, 21 Jun 2007 11:03:34 -0500
Subject: [Swift-devel] Re: Swift Performance Data
In-Reply-To: <467A7E17.5000207@mcs.anl.gov>
References: <4679FBC8.1080606@mcs.anl.gov> <467A6D78.4020702@mcs.anl.gov>
	<467A7AC6.7020400@mcs.anl.gov> <467A7E17.5000207@mcs.anl.gov>
Message-ID: <467AA156.6020802@mcs.anl.gov>

OK, thanks for responding so quickly to your own request, Ian. :)

I agree that we need this and that this is a good spec to start from.

Ive been pushing to do all runs from the swift lab machines where we can readily 
collect this info.

We can create a set of scripts that collects the data in a uniform way, and 
makes it easy to send them to a central place.

Well start a campaign for this and move the discussion there.

I think that the stats should get gathered by default as part of the swift 
execution command; we may need some provision for collecting the data if swift 
dies.  Perhaps the swift shell wrapper can catch such errors and try to report 
in most cases.

I believe we also want to run everything under kickstart. Its been hard to get 
traction on that but we should discuss and keep on pushing on this.

We need to designate one person to lead on this and certainly others to contribute.

I hessitate to say "jump" until we review current work in progress and 
everyone's todo list.

Mike


Ian Foster wrote, On 6/21/2007 8:33 AM:
> Or maybe that is clear. I'd suggest that we want a tool that, after a 
> run, one of us can run to:
> 
> * Generate the three plots that Ioan has created
> * Generate a file containing as much information as we can about the run 
> and its parameters--maybe a name=value format?--and some derived values 
> such as those I mentioned in earlier email
> * Move these things to a known place
> * Create a Web page with pointers to these information and stick it 
> somewhere [or add it to an existing web page?]
> 
> Ian.
> 
> 
> 
> Ian Foster wrote:
>> Mike:
>>
>> It seems important to define what the specific goals and milestones 
>> are here, as it seems that simply asking for it doesn't get it done. 
>> Perhaps we need a brief specification?
>>
>> Ian.
>>
>> Mike Wilde wrote:
>>> Yes, this is what Ganglia has been using.
>>>
>>> Regarding the auto-publishing - Jens has a machanism that regularly 
>>> posted info in rrd format on the state of the VDS lab machines, using 
>>> a perl mechanism like what Ian described.  Perhaps we can find and 
>>> adapt that for Ioan's numbers.
>>> It was running on gainly I think. But its not hard to develop from 
>>> scratch.
>>>
>>> It would be good to see the same numbers for all the Swift apps being 
>>> worked on, driven initially by kickstart summaries and digesting the 
>>> swift logfile.
>>> We've long had this as a goal - now is a good time to push forward 
>>> and do this.
>>>
>>> Nika and Tibi, could you work with Ioan on this?
>>>
>>> - Mike
>>>
>>>
>>>
>>> Ian Foster wrote, On 6/20/2007 11:17 PM:
>>>> Hi,
>>>>
>>>> I was pointed at http://oss.oetiker.ch/rrdtool/, has anyone seen 
>>>> this? Seems nice to me.
>>>>
>>>> Ian.
>>>>
>>>>
>>>
>>
> 

-- 
Mike Wilde
Computation Institute, University of Chicago
Math & Computer Science Division
Argonne National Laboratory
Argonne, IL   60439    USA
tel 630-252-7497 fax 630-252-1997


From tiberius at ci.uchicago.edu  Thu Jun 21 11:04:35 2007
From: tiberius at ci.uchicago.edu (Tiberiu Stef-Praun)
Date: Thu, 21 Jun 2007 11:04:35 -0500
Subject: [Swift-devel] Re: [Swft] Q about throttling
In-Reply-To: <Pine.LNX.4.64.0706211600140.1452@dildano.hawaga.org.uk>
References: <467A6610.1000103@mcs.anl.gov> <467A6E50.3060508@mcs.anl.gov>
	<fec1351f0706210809k1035425cjffc18b442ab41dea@mail.gmail.com>
	<467A95F3.6040603@mcs.anl.gov>
	<Pine.LNX.4.64.0706211531180.15250@dildano.hawaga.org.uk>
	<467A9B32.4030402@mcs.anl.gov>
	<Pine.LNX.4.64.0706211537490.1452@dildano.hawaga.org.uk>
	<fec1351f0706210859v417e6161i32b0569844949784@mail.gmail.com>
	<Pine.LNX.4.64.0706211600140.1452@dildano.hawaga.org.uk>
Message-ID: <fec1351f0706210904l6d74828bj6d5fe093a2d3d7@mail.gmail.com>

No
My chart shows that if I had two more machines during the first stage
run (the first 26 jobs), I would have avoided a long wait (50000 ms ,
or about 9 minutes) for the last two jobs from the first batch to
finish.
This is why I need to redo the Econ run, with a different throttle
value for Swift.

Tibi

On 6/21/07, Ben Clifford <benc at hawaga.org.uk> wrote:
>
>
> On Thu, 21 Jun 2007, Tiberiu Stef-Praun wrote:
>
> > Actually Ioan pointed out to me  that the last two jobs from the first
> > batch are scheduled to start at time zero but have to wait for the
> > first 24 to finish before getting on some resource. (the red color
> > means queue time. the green means execution time).
> >
> > Ben, yes, the worklow consists of 4 logical stages, each of which has
> > to complete before the next stage is being executed.
>
> so your chart indicates that everything is going 'just fine' rather than
> 'broken' ?
>
> --
>


-- 
Tiberiu (Tibi) Stef-Praun, PhD
Research Staff, Computation Institute
5640 S. Ellis Ave, #405
University of Chicago
http://www-unix.mcs.anl.gov/~tiberius/


From benc at hawaga.org.uk  Thu Jun 21 11:07:32 2007
From: benc at hawaga.org.uk (Ben Clifford)
Date: Thu, 21 Jun 2007 16:07:32 +0000 (GMT)
Subject: [Swift-devel] Re: [Swft] Q about throttling
In-Reply-To: <467AA004.4080601@mcs.anl.gov>
References: <467A6610.1000103@mcs.anl.gov> <467A6E50.3060508@mcs.anl.gov>
	<fec1351f0706210809k1035425cjffc18b442ab41dea@mail.gmail.com>
	<467A95F3.6040603@mcs.anl.gov>
	<Pine.LNX.4.64.0706211531180.15250@dildano.hawaga.org.uk>
	<467A9B32.4030402@mcs.anl.gov>
	<Pine.LNX.4.64.0706211537490.1452@dildano.hawaga.org.uk>
	<467A9DAD.7060909@mcs.anl.gov>
	<Pine.LNX.4.64.0706211549270.1452@dildano.hawaga.org.uk>
	<467AA004.4080601@mcs.anl.gov>
Message-ID: <Pine.LNX.4.64.0706211604050.1452@dildano.hawaga.org.uk>


neat. when was the run made that generated those graphs?

The submits seem to be going through at about 1/sec in the 1000..2000s 
time range. Is that the bit that is the problem?

On Thu, 21 Jun 2007, Ian Foster wrote:

> See this document for a set of three graphs that Ioan produced:
> 
> http://people.cs.uchicago.edu/~iraicu/research/docs/MolDyn/100-Mol_MolDyn.pdf 
> 
> The first is the same as Tibi's, I think. The second and third are new. I want
> to have all three produced in a standard way for every application run.
> 
> Ian.
> 
> 
> Ben Clifford wrote:
> > actually, the graph that Tibi showed, which I think is pretty much the same
> > as the graph from Ioan's gui visualizer thing, would be interesting to see
> > for the present MolDyn runs.
> > 
> > It was interesting to look at when wondering about the bug Yong fixed last
> > week.
> > 
> > On Thu, 21 Jun 2007, Ian Foster wrote:
> > 
> >   
> > > agreed
> > > 
> > > Ben Clifford wrote:
> > >     
> > > > On Thu, 21 Jun 2007, Ian Foster wrote:
> > > > 
> > > >         
> > > > > I'm gathering from this exchange that this is not possible?
> > > > >             
> > > > I have no idea. It doesn't seem to be documented.
> > > > 
> > > > But the number one rule of tech support is don't take somebody else's
> > > > partially solved problem. It would be good to see what is actually
> > > > causing
> > > > you to suspect that there's a throttling problem.
> > > > 
> > > >         
> > >     
> > 
> >   
> 
> 


From benc at hawaga.org.uk  Thu Jun 21 11:08:54 2007
From: benc at hawaga.org.uk (Ben Clifford)
Date: Thu, 21 Jun 2007 16:08:54 +0000 (GMT)
Subject: [Swift-devel] Re: [Swft] Q about throttling
In-Reply-To: <fec1351f0706210904l6d74828bj6d5fe093a2d3d7@mail.gmail.com>
References: <467A6610.1000103@mcs.anl.gov> <467A6E50.3060508@mcs.anl.gov> 
	<fec1351f0706210809k1035425cjffc18b442ab41dea@mail.gmail.com> 
	<467A95F3.6040603@mcs.anl.gov>
	<Pine.LNX.4.64.0706211531180.15250@dildano.hawaga.org.uk>
	<467A9B32.4030402@mcs.anl.gov>
	<Pine.LNX.4.64.0706211537490.1452@dildano.hawaga.org.uk>
	<fec1351f0706210859v417e6161i32b0569844949784@mail.gmail.com> 
	<Pine.LNX.4.64.0706211600140.1452@dildano.hawaga.org.uk>
	<fec1351f0706210904l6d74828bj6d5fe093a2d3d7@mail.gmail.com>
Message-ID: <Pine.LNX.4.64.0706211608270.1452@dildano.hawaga.org.uk>


On Thu, 21 Jun 2007, Tiberiu Stef-Praun wrote:

> No
> My chart shows that if I had two more machines during the first stage
> run (the first 26 jobs), I would have avoided a long wait (50000 ms ,
> or about 9 minutes) for the last two jobs from the first batch to
> finish.
> This is why I need to redo the Econ run, with a different throttle
> value for Swift.

So you are saying that changing the 'throttle value for swift' will 
allocate more machines for you?

-- 


From benc at hawaga.org.uk  Thu Jun 21 11:12:35 2007
From: benc at hawaga.org.uk (Ben Clifford)
Date: Thu, 21 Jun 2007 16:12:35 +0000 (GMT)
Subject: [Swift-devel] Re: Swift Performance Data
In-Reply-To: <467AA156.6020802@mcs.anl.gov>
References: <4679FBC8.1080606@mcs.anl.gov> <467A6D78.4020702@mcs.anl.gov>
	<467A7AC6.7020400@mcs.anl.gov> <467A7E17.5000207@mcs.anl.gov>
	<467AA156.6020802@mcs.anl.gov>
Message-ID: <Pine.LNX.4.64.0706211609160.15250@dildano.hawaga.org.uk>


On Thu, 21 Jun 2007, Mike Wilde wrote:

> I believe we also want to run everything under kickstart. Its been hard to get
> traction on that but we should discuss and keep on pushing on this.

There was once an idea to have kickstart installed by our group on every 
machine on which we commonly submit jobs to (maybe as part of the 'getting 
swift running on each site' campaign that resulted in the present site 
catalog). But that doesn't seem to be the way thing are now so maybe it 
never got written down.

Putting installs in place and pointing the default as-distributed site 
catalog at those installs seems a relatively straightforward things to do 
(at least for most sites, and especially the OSG ones, which often have 
kickstart installed as part of their standard software stack).

-- 


From tiberius at ci.uchicago.edu  Thu Jun 21 11:25:24 2007
From: tiberius at ci.uchicago.edu (Tiberiu Stef-Praun)
Date: Thu, 21 Jun 2007 11:25:24 -0500
Subject: [Swift-devel] Re: [Swft] Q about throttling
In-Reply-To: <Pine.LNX.4.64.0706211608270.1452@dildano.hawaga.org.uk>
References: <467A6610.1000103@mcs.anl.gov>
	<fec1351f0706210809k1035425cjffc18b442ab41dea@mail.gmail.com>
	<467A95F3.6040603@mcs.anl.gov>
	<Pine.LNX.4.64.0706211531180.15250@dildano.hawaga.org.uk>
	<467A9B32.4030402@mcs.anl.gov>
	<Pine.LNX.4.64.0706211537490.1452@dildano.hawaga.org.uk>
	<fec1351f0706210859v417e6161i32b0569844949784@mail.gmail.com>
	<Pine.LNX.4.64.0706211600140.1452@dildano.hawaga.org.uk>
	<fec1351f0706210904l6d74828bj6d5fe093a2d3d7@mail.gmail.com>
	<Pine.LNX.4.64.0706211608270.1452@dildano.hawaga.org.uk>
Message-ID: <fec1351f0706210925t2358bf90mec1507118903ef6e@mail.gmail.com>

No
I'm saying that swift throttle value will allow me to make full use of
all the nodes that Falkon makes available for me. I know that I had 26
jobs to be run, and I requested (and had) 30 nodes in the cluster.
Somehow only 24 jobs run in the first time, so I'm going to push up
the throttle value in Swift

Tibi


On 6/21/07, Ben Clifford <benc at hawaga.org.uk> wrote:
>
>
>
> On Thu, 21 Jun 2007, Tiberiu Stef-Praun wrote:
>
> > No
> > My chart shows that if I had two more machines during the first stage
> > run (the first 26 jobs), I would have avoided a long wait (50000 ms ,
> > or about 9 minutes) for the last two jobs from the first batch to
> > finish.
> > This is why I need to redo the Econ run, with a different throttle
> > value for Swift.
>
> So you are saying that changing the 'throttle value for swift' will
> allocate more machines for you?
>
> --
>


-- 
Tiberiu (Tibi) Stef-Praun, PhD
Research Staff, Computation Institute
5640 S. Ellis Ave, #405
University of Chicago
http://www-unix.mcs.anl.gov/~tiberius/


From benc at hawaga.org.uk  Thu Jun 21 11:30:09 2007
From: benc at hawaga.org.uk (Ben Clifford)
Date: Thu, 21 Jun 2007 16:30:09 +0000 (GMT)
Subject: [Swift-devel] Re: [Swft] Q about throttling
In-Reply-To: <fec1351f0706210925t2358bf90mec1507118903ef6e@mail.gmail.com>
References: <467A6610.1000103@mcs.anl.gov>
	<fec1351f0706210809k1035425cjffc18b442ab41dea@mail.gmail.com>
	<467A95F3.6040603@mcs.anl.gov>
	<Pine.LNX.4.64.0706211531180.15250@dildano.hawaga.org.uk>
	<467A9B32.4030402@mcs.anl.gov>
	<Pine.LNX.4.64.0706211537490.1452@dildano.hawaga.org.uk>
	<fec1351f0706210859v417e6161i32b0569844949784@mail.gmail.com> 
	<Pine.LNX.4.64.0706211600140.1452@dildano.hawaga.org.uk> 
	<fec1351f0706210904l6d74828bj6d5fe093a2d3d7@mail.gmail.com> 
	<Pine.LNX.4.64.0706211608270.1452@dildano.hawaga.org.uk>
	<fec1351f0706210925t2358bf90mec1507118903ef6e@mail.gmail.com>
Message-ID: <Pine.LNX.4.64.0706211626190.1452@dildano.hawaga.org.uk>


My interpretation of the graph is:

The two jobs that didn't get run till later (the 'spare' jobs) are 
submitted into falkon at approx t=0, along with the 24 'run straight away' 
jobs.

Swift isn't holding them back.

Falkon indicates that it is aware of them from approx time = 0 but doesn't 
run them until t=500000.

That means, I think, that they're getting into Falkons queue right at the 
start, and its something happening with how Falkon places them onto worker 
nodes that isn't right here.

At least that's my first impression.

On Thu, 21 Jun 2007, Tiberiu Stef-Praun wrote:

> No
> I'm saying that swift throttle value will allow me to make full use of
> all the nodes that Falkon makes available for me. I know that I had 26
> jobs to be run, and I requested (and had) 30 nodes in the cluster.
> Somehow only 24 jobs run in the first time, so I'm going to push up
> the throttle value in Swift
> 
> Tibi
> 
> 
> On 6/21/07, Ben Clifford <benc at hawaga.org.uk> wrote:
> > 
> > 
> > 
> > On Thu, 21 Jun 2007, Tiberiu Stef-Praun wrote:
> > 
> > > No
> > > My chart shows that if I had two more machines during the first stage
> > > run (the first 26 jobs), I would have avoided a long wait (50000 ms ,
> > > or about 9 minutes) for the last two jobs from the first batch to
> > > finish.
> > > This is why I need to redo the Econ run, with a different throttle
> > > value for Swift.
> > 
> > So you are saying that changing the 'throttle value for swift' will
> > allocate more machines for you?
> > 
> > --
> > 
> 
> 
> 


From nefedova at mcs.anl.gov  Thu Jun 21 11:32:55 2007
From: nefedova at mcs.anl.gov (Veronika Nefedova)
Date: Thu, 21 Jun 2007 11:32:55 -0500
Subject: [Swift-devel] Re: [Swft] Q about throttling
In-Reply-To: <fec1351f0706210925t2358bf90mec1507118903ef6e@mail.gmail.com>
References: <467A6610.1000103@mcs.anl.gov>
	<fec1351f0706210809k1035425cjffc18b442ab41dea@mail.gmail.com>
	<467A95F3.6040603@mcs.anl.gov>
	<Pine.LNX.4.64.0706211531180.15250@dildano.hawaga.org.uk>
	<467A9B32.4030402@mcs.anl.gov>
	<Pine.LNX.4.64.0706211537490.1452@dildano.hawaga.org.uk>
	<fec1351f0706210859v417e6161i32b0569844949784@mail.gmail.com>
	<Pine.LNX.4.64.0706211600140.1452@dildano.hawaga.org.uk>
	<fec1351f0706210904l6d74828bj6d5fe093a2d3d7@mail.gmail.com>
	<Pine.LNX.4.64.0706211608270.1452@dildano.hawaga.org.uk>
	<fec1351f0706210925t2358bf90mec1507118903ef6e@mail.gmail.com>
Message-ID: <F7C67410-5C79-4964-B8EA-466BE465876B@mcs.anl.gov>

There are two throttle parameters you might want to check. One is in  
swift.properties called throttle.submit and one in scheduler.xml  
called jobThrottle.
I am curious whats the difference between them?

Nika

On Jun 21, 2007, at 11:25 AM, Tiberiu Stef-Praun wrote:

> No
> I'm saying that swift throttle value will allow me to make full use of
> all the nodes that Falkon makes available for me. I know that I had 26
> jobs to be run, and I requested (and had) 30 nodes in the cluster.
> Somehow only 24 jobs run in the first time, so I'm going to push up
> the throttle value in Swift
>
> Tibi
>
>
> On 6/21/07, Ben Clifford <benc at hawaga.org.uk> wrote:
>>
>>
>>
>> On Thu, 21 Jun 2007, Tiberiu Stef-Praun wrote:
>>
>> > No
>> > My chart shows that if I had two more machines during the first  
>> stage
>> > run (the first 26 jobs), I would have avoided a long wait (50000  
>> ms ,
>> > or about 9 minutes) for the last two jobs from the first batch to
>> > finish.
>> > This is why I need to redo the Econ run, with a different throttle
>> > value for Swift.
>>
>> So you are saying that changing the 'throttle value for swift' will
>> allocate more machines for you?
>>
>> --
>>
>
>
> -- 
> Tiberiu (Tibi) Stef-Praun, PhD
> Research Staff, Computation Institute
> 5640 S. Ellis Ave, #405
> University of Chicago
> http://www-unix.mcs.anl.gov/~tiberius/
>


From tiberius at ci.uchicago.edu  Thu Jun 21 11:34:42 2007
From: tiberius at ci.uchicago.edu (Tiberiu Stef-Praun)
Date: Thu, 21 Jun 2007 11:34:42 -0500
Subject: [Swift-devel] Re: [Swft] Q about throttling
In-Reply-To: <F7C67410-5C79-4964-B8EA-466BE465876B@mcs.anl.gov>
References: <467A6610.1000103@mcs.anl.gov>
	<Pine.LNX.4.64.0706211531180.15250@dildano.hawaga.org.uk>
	<467A9B32.4030402@mcs.anl.gov>
	<Pine.LNX.4.64.0706211537490.1452@dildano.hawaga.org.uk>
	<fec1351f0706210859v417e6161i32b0569844949784@mail.gmail.com>
	<Pine.LNX.4.64.0706211600140.1452@dildano.hawaga.org.uk>
	<fec1351f0706210904l6d74828bj6d5fe093a2d3d7@mail.gmail.com>
	<Pine.LNX.4.64.0706211608270.1452@dildano.hawaga.org.uk>
	<fec1351f0706210925t2358bf90mec1507118903ef6e@mail.gmail.com>
	<F7C67410-5C79-4964-B8EA-466BE465876B@mcs.anl.gov>
Message-ID: <fec1351f0706210934y72e66d33p728611c6b0bca238@mail.gmail.com>

I did have them both with values of 30.

On 6/21/07, Veronika Nefedova <nefedova at mcs.anl.gov> wrote:
> There are two throttle parameters you might want to check. One is in
> swift.properties called throttle.submit and one in scheduler.xml
> called jobThrottle.
> I am curious whats the difference between them?
>
> Nika
>
> On Jun 21, 2007, at 11:25 AM, Tiberiu Stef-Praun wrote:
>
> > No
> > I'm saying that swift throttle value will allow me to make full use of
> > all the nodes that Falkon makes available for me. I know that I had 26
> > jobs to be run, and I requested (and had) 30 nodes in the cluster.
> > Somehow only 24 jobs run in the first time, so I'm going to push up
> > the throttle value in Swift
> >
> > Tibi
> >
> >
> > On 6/21/07, Ben Clifford <benc at hawaga.org.uk> wrote:
> >>
> >>
> >>
> >> On Thu, 21 Jun 2007, Tiberiu Stef-Praun wrote:
> >>
> >> > No
> >> > My chart shows that if I had two more machines during the first
> >> stage
> >> > run (the first 26 jobs), I would have avoided a long wait (50000
> >> ms ,
> >> > or about 9 minutes) for the last two jobs from the first batch to
> >> > finish.
> >> > This is why I need to redo the Econ run, with a different throttle
> >> > value for Swift.
> >>
> >> So you are saying that changing the 'throttle value for swift' will
> >> allocate more machines for you?
> >>
> >> --
> >>
> >
> >
> > --
> > Tiberiu (Tibi) Stef-Praun, PhD
> > Research Staff, Computation Institute
> > 5640 S. Ellis Ave, #405
> > University of Chicago
> > http://www-unix.mcs.anl.gov/~tiberius/
> >
>
>


-- 
Tiberiu (Tibi) Stef-Praun, PhD
Research Staff, Computation Institute
5640 S. Ellis Ave, #405
University of Chicago
http://www-unix.mcs.anl.gov/~tiberius/


From tiberius at ci.uchicago.edu  Thu Jun 21 11:36:29 2007
From: tiberius at ci.uchicago.edu (Tiberiu Stef-Praun)
Date: Thu, 21 Jun 2007 11:36:29 -0500
Subject: [Swift-devel] Re: [Swft] Q about throttling
In-Reply-To: <Pine.LNX.4.64.0706211626190.1452@dildano.hawaga.org.uk>
References: <467A6610.1000103@mcs.anl.gov>
	<Pine.LNX.4.64.0706211531180.15250@dildano.hawaga.org.uk>
	<467A9B32.4030402@mcs.anl.gov>
	<Pine.LNX.4.64.0706211537490.1452@dildano.hawaga.org.uk>
	<fec1351f0706210859v417e6161i32b0569844949784@mail.gmail.com>
	<Pine.LNX.4.64.0706211600140.1452@dildano.hawaga.org.uk>
	<fec1351f0706210904l6d74828bj6d5fe093a2d3d7@mail.gmail.com>
	<Pine.LNX.4.64.0706211608270.1452@dildano.hawaga.org.uk>
	<fec1351f0706210925t2358bf90mec1507118903ef6e@mail.gmail.com>
	<Pine.LNX.4.64.0706211626190.1452@dildano.hawaga.org.uk>
Message-ID: <fec1351f0706210936s50c10800he65ff978b1edbd8@mail.gmail.com>

So the other thing that might have happened is that Falkon quietly
released some of the nodes (even though I requested a minimum of 30
nodes and a maximum of 50)


On 6/21/07, Ben Clifford <benc at hawaga.org.uk> wrote:
>
> My interpretation of the graph is:
>
> The two jobs that didn't get run till later (the 'spare' jobs) are
> submitted into falkon at approx t=0, along with the 24 'run straight away'
> jobs.
>
> Swift isn't holding them back.
>
> Falkon indicates that it is aware of them from approx time = 0 but doesn't
> run them until t=500000.
>
> That means, I think, that they're getting into Falkons queue right at the
> start, and its something happening with how Falkon places them onto worker
> nodes that isn't right here.
>
> At least that's my first impression.
>
> On Thu, 21 Jun 2007, Tiberiu Stef-Praun wrote:
>
> > No
> > I'm saying that swift throttle value will allow me to make full use of
> > all the nodes that Falkon makes available for me. I know that I had 26
> > jobs to be run, and I requested (and had) 30 nodes in the cluster.
> > Somehow only 24 jobs run in the first time, so I'm going to push up
> > the throttle value in Swift
> >
> > Tibi
> >
> >
> > On 6/21/07, Ben Clifford <benc at hawaga.org.uk> wrote:
> > >
> > >
> > >
> > > On Thu, 21 Jun 2007, Tiberiu Stef-Praun wrote:
> > >
> > > > No
> > > > My chart shows that if I had two more machines during the first stage
> > > > run (the first 26 jobs), I would have avoided a long wait (50000 ms ,
> > > > or about 9 minutes) for the last two jobs from the first batch to
> > > > finish.
> > > > This is why I need to redo the Econ run, with a different throttle
> > > > value for Swift.
> > >
> > > So you are saying that changing the 'throttle value for swift' will
> > > allocate more machines for you?
> > >
> > > --
> > >
> >
> >
> >
>


-- 
Tiberiu (Tibi) Stef-Praun, PhD
Research Staff, Computation Institute
5640 S. Ellis Ave, #405
University of Chicago
http://www-unix.mcs.anl.gov/~tiberius/


From benc at hawaga.org.uk  Thu Jun 21 11:37:36 2007
From: benc at hawaga.org.uk (Ben Clifford)
Date: Thu, 21 Jun 2007 16:37:36 +0000 (GMT)
Subject: [Swift-devel] Re: [Swft] Q about throttling
In-Reply-To: <F7C67410-5C79-4964-B8EA-466BE465876B@mcs.anl.gov>
References: <467A6610.1000103@mcs.anl.gov>
	<fec1351f0706210809k1035425cjffc18b442ab41dea@mail.gmail.com>
	<467A95F3.6040603@mcs.anl.gov>
	<Pine.LNX.4.64.0706211531180.15250@dildano.hawaga.org.uk>
	<467A9B32.4030402@mcs.anl.gov>
	<Pine.LNX.4.64.0706211537490.1452@dildano.hawaga.org.uk>
	<fec1351f0706210859v417e6161i32b0569844949784@mail.gmail.com>
	<Pine.LNX.4.64.0706211600140.1452@dildano.hawaga.org.uk>
	<fec1351f0706210904l6d74828bj6d5fe093a2d3d7@mail.gmail.com>
	<Pine.LNX.4.64.0706211608270.1452@dildano.hawaga.org.uk>
	<fec1351f0706210925t2358bf90mec1507118903ef6e@mail.gmail.com>
	<F7C67410-5C79-4964-B8EA-466BE465876B@mcs.anl.gov>
Message-ID: <Pine.LNX.4.64.0706211637200.1452@dildano.hawaga.org.uk>


Have a look at this thread:

 Date: Fri, 04 May 2007 09:04:44 -0500                                           
 From: Mihael Hategan <hategan at mcs.anl.gov>                                      
 To: Veronika V. Nefedova <nefedova at mcs.anl.gov>                                 
 Cc: Ben Clifford <benc at hawaga.org.uk>, swift-devel at ci.uchicago.edu              
 Subject: Re: [Swift-devel] limiting simultaneous jobs using the local           
     provider.       

There was a bit of discussion there.

On Thu, 21 Jun 2007, Veronika Nefedova wrote:

> There are two throttle parameters you might want to check. One is in
> swift.properties called throttle.submit and one in scheduler.xml called
> jobThrottle.
> I am curious whats the difference between them?
> 
> Nika
> 
> On Jun 21, 2007, at 11:25 AM, Tiberiu Stef-Praun wrote:
> 
> > No
> > I'm saying that swift throttle value will allow me to make full use of
> > all the nodes that Falkon makes available for me. I know that I had 26
> > jobs to be run, and I requested (and had) 30 nodes in the cluster.
> > Somehow only 24 jobs run in the first time, so I'm going to push up
> > the throttle value in Swift
> > 
> > Tibi
> > 
> > 
> > On 6/21/07, Ben Clifford <benc at hawaga.org.uk> wrote:
> > > 
> > > 
> > > 
> > > On Thu, 21 Jun 2007, Tiberiu Stef-Praun wrote:
> > > 
> > > > No
> > > > My chart shows that if I had two more machines during the first stage
> > > > run (the first 26 jobs), I would have avoided a long wait (50000 ms ,
> > > > or about 9 minutes) for the last two jobs from the first batch to
> > > > finish.
> > > > This is why I need to redo the Econ run, with a different throttle
> > > > value for Swift.
> > > 
> > > So you are saying that changing the 'throttle value for swift' will
> > > allocate more machines for you?
> > > 
> > > --
> > > 
> > 
> > 
> > -- 
> > Tiberiu (Tibi) Stef-Praun, PhD
> > Research Staff, Computation Institute
> > 5640 S. Ellis Ave, #405
> > University of Chicago
> > http://www-unix.mcs.anl.gov/~tiberius/
> > 
> 


From benc at hawaga.org.uk  Thu Jun 21 11:40:33 2007
From: benc at hawaga.org.uk (Ben Clifford)
Date: Thu, 21 Jun 2007 16:40:33 +0000 (GMT)
Subject: [Swift-devel] Re: [Swft] Q about throttling
In-Reply-To: <fec1351f0706210936s50c10800he65ff978b1edbd8@mail.gmail.com>
References: <467A6610.1000103@mcs.anl.gov>
	<Pine.LNX.4.64.0706211531180.15250@dildano.hawaga.org.uk>
	<467A9B32.4030402@mcs.anl.gov>
	<Pine.LNX.4.64.0706211537490.1452@dildano.hawaga.org.uk>
	<fec1351f0706210859v417e6161i32b0569844949784@mail.gmail.com> 
	<Pine.LNX.4.64.0706211600140.1452@dildano.hawaga.org.uk> 
	<fec1351f0706210904l6d74828bj6d5fe093a2d3d7@mail.gmail.com> 
	<Pine.LNX.4.64.0706211608270.1452@dildano.hawaga.org.uk> 
	<fec1351f0706210925t2358bf90mec1507118903ef6e@mail.gmail.com> 
	<Pine.LNX.4.64.0706211626190.1452@dildano.hawaga.org.uk>
	<fec1351f0706210936s50c10800he65ff978b1edbd8@mail.gmail.com>
Message-ID: <Pine.LNX.4.64.0706211638450.1452@dildano.hawaga.org.uk>


Its basically the same thing that is happening with the 6 steps in the 
1200000 .. 1500000 range. The graph is (slightly) too low resolution for 
me to count the number of jobs in each of those steps. (note to 
infographic producer - make chart use exactly two horizontal pixels per 
job for this scale of run)

On Thu, 21 Jun 2007, Tiberiu Stef-Praun wrote:

> So the other thing that might have happened is that Falkon quietly
> released some of the nodes (even though I requested a minimum of 30
> nodes and a maximum of 50)
> 
> 
> On 6/21/07, Ben Clifford <benc at hawaga.org.uk> wrote:
> > 
> > My interpretation of the graph is:
> > 
> > The two jobs that didn't get run till later (the 'spare' jobs) are
> > submitted into falkon at approx t=0, along with the 24 'run straight away'
> > jobs.
> > 
> > Swift isn't holding them back.
> > 
> > Falkon indicates that it is aware of them from approx time = 0 but doesn't
> > run them until t=500000.
> > 
> > That means, I think, that they're getting into Falkons queue right at the
> > start, and its something happening with how Falkon places them onto worker
> > nodes that isn't right here.
> > 
> > At least that's my first impression.
> > 
> > On Thu, 21 Jun 2007, Tiberiu Stef-Praun wrote:
> > 
> > > No
> > > I'm saying that swift throttle value will allow me to make full use of
> > > all the nodes that Falkon makes available for me. I know that I had 26
> > > jobs to be run, and I requested (and had) 30 nodes in the cluster.
> > > Somehow only 24 jobs run in the first time, so I'm going to push up
> > > the throttle value in Swift
> > >
> > > Tibi
> > >
> > >
> > > On 6/21/07, Ben Clifford <benc at hawaga.org.uk> wrote:
> > > >
> > > >
> > > >
> > > > On Thu, 21 Jun 2007, Tiberiu Stef-Praun wrote:
> > > >
> > > > > No
> > > > > My chart shows that if I had two more machines during the first stage
> > > > > run (the first 26 jobs), I would have avoided a long wait (50000 ms ,
> > > > > or about 9 minutes) for the last two jobs from the first batch to
> > > > > finish.
> > > > > This is why I need to redo the Econ run, with a different throttle
> > > > > value for Swift.
> > > >
> > > > So you are saying that changing the 'throttle value for swift' will
> > > > allocate more machines for you?
> > > >
> > > > --
> > > >
> > >
> > >
> > >
> > 
> 
> 
> 


From benc at hawaga.org.uk  Thu Jun 21 11:43:52 2007
From: benc at hawaga.org.uk (Ben Clifford)
Date: Thu, 21 Jun 2007 16:43:52 +0000 (GMT)
Subject: [Swift-devel] Re: [Swft] Q about throttling
In-Reply-To: <Pine.LNX.4.64.0706211638450.1452@dildano.hawaga.org.uk>
References: <467A6610.1000103@mcs.anl.gov>
	<Pine.LNX.4.64.0706211531180.15250@dildano.hawaga.org.uk>
	<467A9B32.4030402@mcs.anl.gov>
	<Pine.LNX.4.64.0706211537490.1452@dildano.hawaga.org.uk>
	<fec1351f0706210859v417e6161i32b0569844949784@mail.gmail.com> 
	<Pine.LNX.4.64.0706211600140.1452@dildano.hawaga.org.uk> 
	<fec1351f0706210904l6d74828bj6d5fe093a2d3d7@mail.gmail.com> 
	<Pine.LNX.4.64.0706211608270.1452@dildano.hawaga.org.uk> 
	<fec1351f0706210925t2358bf90mec1507118903ef6e@mail.gmail.com> 
	<Pine.LNX.4.64.0706211626190.1452@dildano.hawaga.org.uk>
	<fec1351f0706210936s50c10800he65ff978b1edbd8@mail.gmail.com>
	<Pine.LNX.4.64.0706211638450.1452@dildano.hawaga.org.uk>
Message-ID: <Pine.LNX.4.64.0706211643020.15250@dildano.hawaga.org.uk>

might be useful to have debug-level throttle messages when cog/swift 
decides each throttle limit has been reached.

-- 


From foster at mcs.anl.gov  Thu Jun 21 13:43:04 2007
From: foster at mcs.anl.gov (Ian Foster)
Date: Thu, 21 Jun 2007 13:43:04 -0500
Subject: [Swift-devel] Re: Swift Performance Data
In-Reply-To: <Pine.LNX.4.64.0706211609160.15250@dildano.hawaga.org.uk>
References: <4679FBC8.1080606@mcs.anl.gov> <467A6D78.4020702@mcs.anl.gov>
	<467A7AC6.7020400@mcs.anl.gov> <467A7E17.5000207@mcs.anl.gov>
	<467AA156.6020802@mcs.anl.gov>
	<Pine.LNX.4.64.0706211609160.15250@dildano.hawaga.org.uk>
Message-ID: <467AC6B8.9060905@mcs.anl.gov>

It is essential that we have kickstart (or the equivalent Falkon thing) 
running everywhere.

Ian.

Ben Clifford wrote:
> On Thu, 21 Jun 2007, Mike Wilde wrote:
>
>   
>> I believe we also want to run everything under kickstart. Its been hard to get
>> traction on that but we should discuss and keep on pushing on this.
>>     
>
> There was once an idea to have kickstart installed by our group on every 
> machine on which we commonly submit jobs to (maybe as part of the 'getting 
> swift running on each site' campaign that resulted in the present site 
> catalog). But that doesn't seem to be the way thing are now so maybe it 
> never got written down.
>
> Putting installs in place and pointing the default as-distributed site 
> catalog at those installs seems a relatively straightforward things to do 
> (at least for most sites, and especially the OSG ones, which often have 
> kickstart installed as part of their standard software stack).
>
>   

-- 

   Ian Foster, Director, Computation Institute
Argonne National Laboratory & University of Chicago
Argonne: MCS/221, 9700 S. Cass Ave, Argonne, IL 60439
Chicago: Rm 405, 5640 S. Ellis Ave, Chicago, IL 60637
Tel: +1 630 252 4619.  Web: www.ci.uchicago.edu.
      Globus Alliance: www.globus.org.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20070621/0d0d05bc/attachment.html>

From iraicu at cs.uchicago.edu  Thu Jun 21 21:53:58 2007
From: iraicu at cs.uchicago.edu (Ioan Raicu)
Date: Thu, 21 Jun 2007 21:53:58 -0500
Subject: [Swift-devel] Re: [Swft] Q about throttling
In-Reply-To: <Pine.LNX.4.64.0706211604050.1452@dildano.hawaga.org.uk>
References: <467A6610.1000103@mcs.anl.gov> <467A6E50.3060508@mcs.anl.gov>
	<fec1351f0706210809k1035425cjffc18b442ab41dea@mail.gmail.com>
	<467A95F3.6040603@mcs.anl.gov>
	<Pine.LNX.4.64.0706211531180.15250@dildano.hawaga.org.uk>
	<467A9B32.4030402@mcs.anl.gov>
	<Pine.LNX.4.64.0706211537490.1452@dildano.hawaga.org.uk>
	<467A9DAD.7060909@mcs.anl.gov>
	<Pine.LNX.4.64.0706211549270.1452@dildano.hawaga.org.uk>
	<467AA004.4080601@mcs.anl.gov>
	<Pine.LNX.4.64.0706211604050.1452@dildano.hawaga.org.uk>
Message-ID: <467B39C6.8050104@cs.uchicago.edu>

Sorry to jump in on the discussion late.  Here are my thoughts on this 
issue:

Actually, what you are seeing is the completion rate of about 1/sec, but 
all 200 executors were busy... this is inherent to the length of time 
each task was taking, namely about 200 seconds on each executor, and we 
had 200 executors to process them, so we get 200/200, or 1/sec... so 
this was perfectly normal.  The part that is not normal, around time 
5000+ sec (where the red disappears, and only green is found), only 
about 90 executors were kept busy, and the Falkon queue length was 
relatively at 0... so this means that Swift was not submitting fast 
enough to keep all the executors busy.  If Swift submission rate would 
have been higher, I would have expected to see a little bit of red 
before each green bar throughout the graph.  Perhaps the Swift rate of 
submission was lower due to the dependencies in the workflow, but as I 
stated in a previous email, the red queue time should have continued 
until about task # 9600 (301 [first 3 stages] + 6800 [4th stage] + 2500 
[failed tasks])...

Ioan

Ben Clifford wrote:
> neat. when was the run made that generated those graphs?
>
> The submits seem to be going through at about 1/sec in the 1000..2000s 
> time range. Is that the bit that is the problem?
>
> On Thu, 21 Jun 2007, Ian Foster wrote:
>
>   
>> See this document for a set of three graphs that Ioan produced:
>>
>> http://people.cs.uchicago.edu/~iraicu/research/docs/MolDyn/100-Mol_MolDyn.pdf 
>>
>> The first is the same as Tibi's, I think. The second and third are new. I want
>> to have all three produced in a standard way for every application run.
>>
>> Ian.
>>
>>
>> Ben Clifford wrote:
>>     
>>> actually, the graph that Tibi showed, which I think is pretty much the same
>>> as the graph from Ioan's gui visualizer thing, would be interesting to see
>>> for the present MolDyn runs.
>>>
>>> It was interesting to look at when wondering about the bug Yong fixed last
>>> week.
>>>
>>> On Thu, 21 Jun 2007, Ian Foster wrote:
>>>
>>>   
>>>       
>>>> agreed
>>>>
>>>> Ben Clifford wrote:
>>>>     
>>>>         
>>>>> On Thu, 21 Jun 2007, Ian Foster wrote:
>>>>>
>>>>>         
>>>>>           
>>>>>> I'm gathering from this exchange that this is not possible?
>>>>>>             
>>>>>>             
>>>>> I have no idea. It doesn't seem to be documented.
>>>>>
>>>>> But the number one rule of tech support is don't take somebody else's
>>>>> partially solved problem. It would be good to see what is actually
>>>>> causing
>>>>> you to suspect that there's a throttling problem.
>>>>>
>>>>>         
>>>>>           
>>>>     
>>>>         
>>>   
>>>       
>>     
>
>   

-- 
============================================
Ioan Raicu
Ph.D. Student
============================================
Distributed Systems Laboratory
Computer Science Department
University of Chicago
1100 E. 58th Street, Ryerson Hall
Chicago, IL 60637
============================================
Email: iraicu at cs.uchicago.edu
Web:   http://www.cs.uchicago.edu/~iraicu
       http://dsl.cs.uchicago.edu/
============================================
============================================

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20070621/6419e9e0/attachment.html>

From iraicu at cs.uchicago.edu  Thu Jun 21 21:54:14 2007
From: iraicu at cs.uchicago.edu (Ioan Raicu)
Date: Thu, 21 Jun 2007 21:54:14 -0500
Subject: [Swift-devel] Re: [Swft] Q about throttling
In-Reply-To: <Pine.LNX.4.64.0706211626190.1452@dildano.hawaga.org.uk>
References: <467A6610.1000103@mcs.anl.gov>
	<fec1351f0706210809k1035425cjffc18b442ab41dea@mail.gmail.com>
	<467A95F3.6040603@mcs.anl.gov>
	<Pine.LNX.4.64.0706211531180.15250@dildano.hawaga.org.uk>
	<467A9B32.4030402@mcs.anl.gov>
	<Pine.LNX.4.64.0706211537490.1452@dildano.hawaga.org.uk>
	<fec1351f0706210859v417e6161i32b0569844949784@mail.gmail.com>
	<Pine.LNX.4.64.0706211600140.1452@dildano.hawaga.org.uk>
	<fec1351f0706210904l6d74828bj6d5fe093a2d3d7@mail.gmail.com>
	<Pine.LNX.4.64.0706211608270.1452@dildano.hawaga.org.uk>
	<fec1351f0706210925t2358bf90mec1507118903ef6e@mail.gmail.com>
	<Pine.LNX.4.64.0706211626190.1452@dildano.hawaga.org.uk>
Message-ID: <467B39D6.4010801@cs.uchicago.edu>

I think Ben is right, in this particular instance, Swift submitted all 
26 jobs, and Falkon dispatched 24 of them, and held 2 of them in the 
wait queue.  Throttling was not the issue here.  At first glance, I 
would say that although you asked for 30 nodes at the beining, you might 
have lost some due to idle time limit being reached, and hence when you 
started the 26 jobs, you only had 24 executors ready.  Can you send me 
these two logs: service/logs/GenericPortalWS_perf_per_sec.log, and 
service/logs/GenericPortalWS_taskPerf.log and I will try to superimpose 
the # of busy and free executors on top of the graph you sent out 
showing the per task information.

Ioan

Ben Clifford wrote:
> My interpretation of the graph is:
>
> The two jobs that didn't get run till later (the 'spare' jobs) are 
> submitted into falkon at approx t=0, along with the 24 'run straight away' 
> jobs.
>
> Swift isn't holding them back.
>
> Falkon indicates that it is aware of them from approx time = 0 but doesn't 
> run them until t=500000.
>
> That means, I think, that they're getting into Falkons queue right at the 
> start, and its something happening with how Falkon places them onto worker 
> nodes that isn't right here.
>
> At least that's my first impression.
>
> On Thu, 21 Jun 2007, Tiberiu Stef-Praun wrote:
>
>   
>> No
>> I'm saying that swift throttle value will allow me to make full use of
>> all the nodes that Falkon makes available for me. I know that I had 26
>> jobs to be run, and I requested (and had) 30 nodes in the cluster.
>> Somehow only 24 jobs run in the first time, so I'm going to push up
>> the throttle value in Swift
>>
>> Tibi
>>
>>
>> On 6/21/07, Ben Clifford <benc at hawaga.org.uk> wrote:
>>     
>>>
>>> On Thu, 21 Jun 2007, Tiberiu Stef-Praun wrote:
>>>
>>>       
>>>> No
>>>> My chart shows that if I had two more machines during the first stage
>>>> run (the first 26 jobs), I would have avoided a long wait (50000 ms ,
>>>> or about 9 minutes) for the last two jobs from the first batch to
>>>> finish.
>>>> This is why I need to redo the Econ run, with a different throttle
>>>> value for Swift.
>>>>         
>>> So you are saying that changing the 'throttle value for swift' will
>>> allocate more machines for you?
>>>
>>> --
>>>
>>>       
>>
>>     
>
>   

-- 
============================================
Ioan Raicu
Ph.D. Student
============================================
Distributed Systems Laboratory
Computer Science Department
University of Chicago
1100 E. 58th Street, Ryerson Hall
Chicago, IL 60637
============================================
Email: iraicu at cs.uchicago.edu
Web:   http://www.cs.uchicago.edu/~iraicu
       http://dsl.cs.uchicago.edu/
============================================
============================================

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20070621/dc09bec6/attachment.html>

From iraicu at cs.uchicago.edu  Thu Jun 21 21:54:21 2007
From: iraicu at cs.uchicago.edu (Ioan Raicu)
Date: Thu, 21 Jun 2007 21:54:21 -0500
Subject: [Swift-devel] Re: [Swft] Q about throttling
In-Reply-To: <fec1351f0706210936s50c10800he65ff978b1edbd8@mail.gmail.com>
References: <467A6610.1000103@mcs.anl.gov>	
	<Pine.LNX.4.64.0706211531180.15250@dildano.hawaga.org.uk>	
	<467A9B32.4030402@mcs.anl.gov>	
	<Pine.LNX.4.64.0706211537490.1452@dildano.hawaga.org.uk>	
	<fec1351f0706210859v417e6161i32b0569844949784@mail.gmail.com>	
	<Pine.LNX.4.64.0706211600140.1452@dildano.hawaga.org.uk>	
	<fec1351f0706210904l6d74828bj6d5fe093a2d3d7@mail.gmail.com>	
	<Pine.LNX.4.64.0706211608270.1452@dildano.hawaga.org.uk>	
	<fec1351f0706210925t2358bf90mec1507118903ef6e@mail.gmail.com>	
	<Pine.LNX.4.64.0706211626190.1452@dildano.hawaga.org.uk>
	<fec1351f0706210936s50c10800he65ff978b1edbd8@mail.gmail.com>
Message-ID: <467B39DD.80003@cs.uchicago.edu>

Right, I think that is what happened.   Send me the logs I asked for in 
a previous email, and we can plot both files on the same graph, and we 
will have the answer!

Ioan

Tiberiu Stef-Praun wrote:
> So the other thing that might have happened is that Falkon quietly
> released some of the nodes (even though I requested a minimum of 30
> nodes and a maximum of 50)
>
>
> On 6/21/07, Ben Clifford <benc at hawaga.org.uk> wrote:
>>
>> My interpretation of the graph is:
>>
>> The two jobs that didn't get run till later (the 'spare' jobs) are
>> submitted into falkon at approx t=0, along with the 24 'run straight 
>> away'
>> jobs.
>>
>> Swift isn't holding them back.
>>
>> Falkon indicates that it is aware of them from approx time = 0 but 
>> doesn't
>> run them until t=500000.
>>
>> That means, I think, that they're getting into Falkons queue right at 
>> the
>> start, and its something happening with how Falkon places them onto 
>> worker
>> nodes that isn't right here.
>>
>> At least that's my first impression.
>>
>> On Thu, 21 Jun 2007, Tiberiu Stef-Praun wrote:
>>
>> > No
>> > I'm saying that swift throttle value will allow me to make full use of
>> > all the nodes that Falkon makes available for me. I know that I had 26
>> > jobs to be run, and I requested (and had) 30 nodes in the cluster.
>> > Somehow only 24 jobs run in the first time, so I'm going to push up
>> > the throttle value in Swift
>> >
>> > Tibi
>> >
>> >
>> > On 6/21/07, Ben Clifford <benc at hawaga.org.uk> wrote:
>> > >
>> > >
>> > >
>> > > On Thu, 21 Jun 2007, Tiberiu Stef-Praun wrote:
>> > >
>> > > > No
>> > > > My chart shows that if I had two more machines during the first 
>> stage
>> > > > run (the first 26 jobs), I would have avoided a long wait 
>> (50000 ms ,
>> > > > or about 9 minutes) for the last two jobs from the first batch to
>> > > > finish.
>> > > > This is why I need to redo the Econ run, with a different throttle
>> > > > value for Swift.
>> > >
>> > > So you are saying that changing the 'throttle value for swift' will
>> > > allocate more machines for you?
>> > >
>> > > --
>> > >
>> >
>> >
>> >
>>
>
>

-- 
============================================
Ioan Raicu
Ph.D. Student
============================================
Distributed Systems Laboratory
Computer Science Department
University of Chicago
1100 E. 58th Street, Ryerson Hall
Chicago, IL 60637
============================================
Email: iraicu at cs.uchicago.edu
Web:   http://www.cs.uchicago.edu/~iraicu
       http://dsl.cs.uchicago.edu/
============================================
============================================


From benc at hawaga.org.uk  Fri Jun 22 03:10:39 2007
From: benc at hawaga.org.uk (Ben Clifford)
Date: Fri, 22 Jun 2007 08:10:39 +0000 (GMT)
Subject: [Swift-devel] Re: [Swft] Q about throttling
In-Reply-To: <467B39C6.8050104@cs.uchicago.edu>
References: <467A6610.1000103@mcs.anl.gov> <467A6E50.3060508@mcs.anl.gov>
	<fec1351f0706210809k1035425cjffc18b442ab41dea@mail.gmail.com>
	<467A95F3.6040603@mcs.anl.gov>
	<Pine.LNX.4.64.0706211531180.15250@dildano.hawaga.org.uk>
	<467A9B32.4030402@mcs.anl.gov>
	<Pine.LNX.4.64.0706211537490.1452@dildano.hawaga.org.uk>
	<467A9DAD.7060909@mcs.anl.gov>
	<Pine.LNX.4.64.0706211549270.1452@dildano.hawaga.org.uk>
	<467AA004.4080601@mcs.anl.gov>
	<Pine.LNX.4.64.0706211604050.1452@dildano.hawaga.org.uk>
	<467B39C6.8050104@cs.uchicago.edu>
Message-ID: <Pine.LNX.4.64.0706220805320.1452@dildano.hawaga.org.uk>


> kept busy, and the Falkon queue length was relatively at 0... so this means
> that Swift was not submitting fast enough to keep all the executors busy.

interesting. though around t=1000 there is a rapid burst of submission 
getting the queue length up to about 6000 in a few minutes.

Do you know what the cpu time usage of the swift submitting JVM was over 
that time period?

-- 


From iraicu at cs.uchicago.edu  Fri Jun 22 09:06:32 2007
From: iraicu at cs.uchicago.edu (Ioan Raicu)
Date: Fri, 22 Jun 2007 09:06:32 -0500
Subject: [Swift-devel] Re: [Swft] Q about throttling
In-Reply-To: <Pine.LNX.4.64.0706220805320.1452@dildano.hawaga.org.uk>
References: <467A6610.1000103@mcs.anl.gov> <467A6E50.3060508@mcs.anl.gov>
	<fec1351f0706210809k1035425cjffc18b442ab41dea@mail.gmail.com>
	<467A95F3.6040603@mcs.anl.gov>
	<Pine.LNX.4.64.0706211531180.15250@dildano.hawaga.org.uk>
	<467A9B32.4030402@mcs.anl.gov>
	<Pine.LNX.4.64.0706211537490.1452@dildano.hawaga.org.uk>
	<467A9DAD.7060909@mcs.anl.gov>
	<Pine.LNX.4.64.0706211549270.1452@dildano.hawaga.org.uk>
	<467AA004.4080601@mcs.anl.gov>
	<Pine.LNX.4.64.0706211604050.1452@dildano.hawaga.org.uk>
	<467B39C6.8050104@cs.uchicago.edu>
	<Pine.LNX.4.64.0706220805320.1452@dildano.hawaga.org.uk>
Message-ID: <467BD768.3020507@cs.uchicago.edu>

No, I didn't keep track of this info, unless Swift does this through 
some of its logs. 

Over the last week, my observations have been the following: Swift is 
more than capable and willing to send out many tasks as long as they are 
independent (as can be seen in this graph where probably 6800 tasks got 
submitted), but thereafter, it had no other burst of task submission, 
although I believe it could have send out more.  For example, there were 
2500+ tasks that failed in the middle of those 6800 tasks (which were 
all independent), why were 2500 tasks not resubmitted all at once... 
they were each about 200 seconds long, so most of them should have 
certainly showed up in the wait queue.

Ioan

Ben Clifford wrote:
>> kept busy, and the Falkon queue length was relatively at 0... so this means
>> that Swift was not submitting fast enough to keep all the executors busy.
>>     
>
> interesting. though around t=1000 there is a rapid burst of submission 
> getting the queue length up to about 6000 in a few minutes.
>
> Do you know what the cpu time usage of the swift submitting JVM was over 
> that time period?
>
>   

-- 
============================================
Ioan Raicu
Ph.D. Student
============================================
Distributed Systems Laboratory
Computer Science Department
University of Chicago
1100 E. 58th Street, Ryerson Hall
Chicago, IL 60637
============================================
Email: iraicu at cs.uchicago.edu
Web:   http://www.cs.uchicago.edu/~iraicu
       http://dsl.cs.uchicago.edu/
============================================
============================================

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20070622/803eb218/attachment.html>

From wilde at mcs.anl.gov  Fri Jun 22 09:39:50 2007
From: wilde at mcs.anl.gov (Mike Wilde)
Date: Fri, 22 Jun 2007 09:39:50 -0500
Subject: [Swift-devel] Re: [Swft] Q about throttling
In-Reply-To: <467BD768.3020507@cs.uchicago.edu>
References: <467A6610.1000103@mcs.anl.gov>
	<467A6E50.3060508@mcs.anl.gov>	<fec1351f0706210809k1035425cjffc18b442ab41dea@mail.gmail.com>	<467A95F3.6040603@mcs.anl.gov>	<Pine.LNX.4.64.0706211531180.15250@dildano.hawaga.org.uk>	<467A9B32.4030402@mcs.anl.gov>	<Pine.LNX.4.64.0706211537490.1452@dildano.hawaga.org.uk>	<467A9DAD.7060909@mcs.anl.gov>	<Pine.LNX.4.64.0706211549270.1452@dildano.hawaga.org.uk>	<467AA004.4080601@mcs.anl.gov>	<Pine.LNX.4.64.0706211604050.1452@dildano.hawaga.org.uk>	<467B39C6.8050104@cs.uchicago.edu>	<Pine.LNX.4.64.0706220805320.1452@dildano.hawaga.org.uk>
	<467BD768.3020507@cs.uchicago.edu>
Message-ID: <467BDF36.7080909@mcs.anl.gov>

Is there a configurable retry delay after failure?

I think you need to examine the overall workflow dependency structure.

Also, I recall from older perf charts that there's an option to enable/disable 
pipelining.  With pipelining disabled, it seems that Swift will wait for an 
entire dataset/foreach or procedure to finish before starting any tasks that 
depend on the foreach or procedure.

Mihael, can you look at some of these issues when you are back online and rested?

- Mike

Ioan Raicu wrote, On 6/22/2007 9:06 AM:
> No, I didn't keep track of this info, unless Swift does this through 
> some of its logs. 
> 
> Over the last week, my observations have been the following: Swift is 
> more than capable and willing to send out many tasks as long as they are 
> independent (as can be seen in this graph where probably 6800 tasks got 
> submitted), but thereafter, it had no other burst of task submission, 
> although I believe it could have send out more.  For example, there were 
> 2500+ tasks that failed in the middle of those 6800 tasks (which were 
> all independent), why were 2500 tasks not resubmitted all at once... 
> they were each about 200 seconds long, so most of them should have 
> certainly showed up in the wait queue.
> 
> Ioan
> 
> Ben Clifford wrote:
>>> kept busy, and the Falkon queue length was relatively at 0... so this means
>>> that Swift was not submitting fast enough to keep all the executors busy.
>>>     
>>
>> interesting. though around t=1000 there is a rapid burst of submission 
>> getting the queue length up to about 6000 in a few minutes.
>>
>> Do you know what the cpu time usage of the swift submitting JVM was over 
>> that time period?
>>
>>   
> 
> -- 
> ============================================
> Ioan Raicu
> Ph.D. Student
> ============================================
> Distributed Systems Laboratory
> Computer Science Department
> University of Chicago
> 1100 E. 58th Street, Ryerson Hall
> Chicago, IL 60637
> ============================================
> Email: iraicu at cs.uchicago.edu
> Web:   http://www.cs.uchicago.edu/~iraicu
>        http://dsl.cs.uchicago.edu/
> ============================================
> ============================================
> 

-- 
Mike Wilde
Computation Institute, University of Chicago
Math & Computer Science Division
Argonne National Laboratory
Argonne, IL   60439    USA
tel 630-252-7497 fax 630-252-1997


From yongzh at cs.uchicago.edu  Fri Jun 22 09:45:42 2007
From: yongzh at cs.uchicago.edu (Yong Zhao)
Date: Fri, 22 Jun 2007 09:45:42 -0500 (CDT)
Subject: [Swift-devel] Re: [Swft] Q about throttling
In-Reply-To: <467BDF36.7080909@mcs.anl.gov>
References: <467A6610.1000103@mcs.anl.gov> <467A6E50.3060508@mcs.anl.gov>
	<fec1351f0706210809k1035425cjffc18b442ab41dea@mail.gmail.com>
	<467A95F3.6040603@mcs.anl.gov>
	<Pine.LNX.4.64.0706211531180.15250@dildano.hawaga.org.uk>
	<467A9B32.4030402@mcs.anl.gov>
	<Pine.LNX.4.64.0706211537490.1452@dildano.hawaga.org.uk>
	<467A9DAD.7060909@mcs.anl.gov>
	<Pine.LNX.4.64.0706211549270.1452@dildano.hawaga.org.uk>
	<467AA004.4080601@mcs.anl.gov>
	<Pine.LNX.4.64.0706211604050.1452@dildano.hawaga.org.uk>
	<467B39C6.8050104@cs.uchicago.edu>
	<Pine.LNX.4.64.0706220805320.1452@dildano.hawaga.org.uk>
	<467BD768.3020507@cs.uchicago.edu> <467BDF36.7080909@mcs.anl.gov>
Message-ID: <Pine.LNX.4.58.0706220943350.15672@classes.cs.uchicago.edu>

The retry mechanism is currently in some karajan script, and we can easily
add some delay there.

There is not a configuration option to disable pipeline. I did that
manually (modified some code segment) to get a perf chart.

Yong.

On Fri, 22 Jun 2007, Mike Wilde wrote:

> Is there a configurable retry delay after failure?
>
> I think you need to examine the overall workflow dependency structure.
>
> Also, I recall from older perf charts that there's an option to enable/disable
> pipelining.  With pipelining disabled, it seems that Swift will wait for an
> entire dataset/foreach or procedure to finish before starting any tasks that
> depend on the foreach or procedure.
>
> Mihael, can you look at some of these issues when you are back online and rested?
>
> - Mike
>
> Ioan Raicu wrote, On 6/22/2007 9:06 AM:
> > No, I didn't keep track of this info, unless Swift does this through
> > some of its logs.
> >
> > Over the last week, my observations have been the following: Swift is
> > more than capable and willing to send out many tasks as long as they are
> > independent (as can be seen in this graph where probably 6800 tasks got
> > submitted), but thereafter, it had no other burst of task submission,
> > although I believe it could have send out more.  For example, there were
> > 2500+ tasks that failed in the middle of those 6800 tasks (which were
> > all independent), why were 2500 tasks not resubmitted all at once...
> > they were each about 200 seconds long, so most of them should have
> > certainly showed up in the wait queue.
> >
> > Ioan
> >
> > Ben Clifford wrote:
> >>> kept busy, and the Falkon queue length was relatively at 0... so this means
> >>> that Swift was not submitting fast enough to keep all the executors busy.
> >>>
> >>
> >> interesting. though around t=1000 there is a rapid burst of submission
> >> getting the queue length up to about 6000 in a few minutes.
> >>
> >> Do you know what the cpu time usage of the swift submitting JVM was over
> >> that time period?
> >>
> >>
> >
> > --
> > ============================================
> > Ioan Raicu
> > Ph.D. Student
> > ============================================
> > Distributed Systems Laboratory
> > Computer Science Department
> > University of Chicago
> > 1100 E. 58th Street, Ryerson Hall
> > Chicago, IL 60637
> > ============================================
> > Email: iraicu at cs.uchicago.edu
> > Web:   http://www.cs.uchicago.edu/~iraicu
> >        http://dsl.cs.uchicago.edu/
> > ============================================
> > ============================================
> >
>
> --
> Mike Wilde
> Computation Institute, University of Chicago
> Math & Computer Science Division
> Argonne National Laboratory
> Argonne, IL   60439    USA
> tel 630-252-7497 fax 630-252-1997
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>


From benc at hawaga.org.uk  Fri Jun 22 12:11:38 2007
From: benc at hawaga.org.uk (Ben Clifford)
Date: Fri, 22 Jun 2007 17:11:38 +0000 (GMT)
Subject: [Swift-devel] Re: [Swft] Q about throttling
In-Reply-To: <467BD768.3020507@cs.uchicago.edu>
References: <467A6610.1000103@mcs.anl.gov> <467A6E50.3060508@mcs.anl.gov>
	<fec1351f0706210809k1035425cjffc18b442ab41dea@mail.gmail.com>
	<467A95F3.6040603@mcs.anl.gov>
	<Pine.LNX.4.64.0706211531180.15250@dildano.hawaga.org.uk>
	<467A9B32.4030402@mcs.anl.gov>
	<Pine.LNX.4.64.0706211537490.1452@dildano.hawaga.org.uk>
	<467A9DAD.7060909@mcs.anl.gov>
	<Pine.LNX.4.64.0706211549270.1452@dildano.hawaga.org.uk>
	<467AA004.4080601@mcs.anl.gov>
	<Pine.LNX.4.64.0706211604050.1452@dildano.hawaga.org.uk>
	<467B39C6.8050104@cs.uchicago.edu>
	<Pine.LNX.4.64.0706220805320.1452@dildano.hawaga.org.uk>
	<467BD768.3020507@cs.uchicago.edu>
Message-ID: <Pine.LNX.4.64.0706221711110.1452@dildano.hawaga.org.uk>


On Fri, 22 Jun 2007, Ioan Raicu wrote:
> I believe it could have send out more.  For example, there were 2500+ tasks
> that failed in the middle of those 6800 tasks (which were all independent),
> why were 2500 tasks not resubmitted all at once... they were each about 200
> seconds long, so most of them should have certainly showed up in the wait
> queue.

what kind of failure?

-- 


From iraicu at cs.uchicago.edu  Fri Jun 22 12:17:35 2007
From: iraicu at cs.uchicago.edu (Ioan Raicu)
Date: Fri, 22 Jun 2007 12:17:35 -0500
Subject: [Swift-devel] Re: [Swft] Q about throttling
In-Reply-To: <Pine.LNX.4.64.0706221711110.1452@dildano.hawaga.org.uk>
References: <467A6610.1000103@mcs.anl.gov> <467A6E50.3060508@mcs.anl.gov>
	<fec1351f0706210809k1035425cjffc18b442ab41dea@mail.gmail.com>
	<467A95F3.6040603@mcs.anl.gov>
	<Pine.LNX.4.64.0706211531180.15250@dildano.hawaga.org.uk>
	<467A9B32.4030402@mcs.anl.gov>
	<Pine.LNX.4.64.0706211537490.1452@dildano.hawaga.org.uk>
	<467A9DAD.7060909@mcs.anl.gov>
	<Pine.LNX.4.64.0706211549270.1452@dildano.hawaga.org.uk>
	<467AA004.4080601@mcs.anl.gov>
	<Pine.LNX.4.64.0706211604050.1452@dildano.hawaga.org.uk>
	<467B39C6.8050104@cs.uchicago.edu>
	<Pine.LNX.4.64.0706220805320.1452@dildano.hawaga.org.uk>
	<467BD768.3020507@cs.uchicago.edu>
	<Pine.LNX.4.64.0706221711110.1452@dildano.hawaga.org.uk>
Message-ID: <467C042F.5040209@cs.uchicago.edu>

Here is an excerpt from an email on 6/19. 
> > It completed 10998
> > tasks (8402 tasks with an exit code of 0, and 2596 tasks with an exit
> > code of -1 -- aka failed) in 13399 seconds on 200 processors, this
> > was for the 100 molecule run! The failed tasks were all on the same
> > node over several short time intervals (~30 seconds), and were due to
> > a "Stale NFS file handle", probably due to having 200 processes
> > hitting the shared file system at the same time. Note that all these
> > 2596 failed tasks were restarted by Swift and completed successfully
> > on the resubmission. In the end, everything went through, and the run
> > was successful!

We noticed the same node in later runs act up, and take on the order of 
100 times longer to complete some tasks than it was supposed to take.  I 
bet this node is having some hardware issues, and we should write to 
help at tg to tell them.

The failed tasks were eventually retried, and succeeded, and the whole 
run was successful, but the question is, why were the 2596 failed tasks 
(which were all independent of each other) not submitted faster after 
they failed... I would have expected them to fill up the wait queue with 
these 2596 retried tasks.

Ioan

Ben Clifford wrote:
>
> On Fri, 22 Jun 2007, Ioan Raicu wrote:
>   
>> I believe it could have send out more.  For example, there were 2500+ tasks
>> that failed in the middle of those 6800 tasks (which were all independent),
>> why were 2500 tasks not resubmitted all at once... they were each about 200
>> seconds long, so most of them should have certainly showed up in the wait
>> queue.
>>     
>
> what kind of failure?
>
>   

-- 
============================================
Ioan Raicu
Ph.D. Student
============================================
Distributed Systems Laboratory
Computer Science Department
University of Chicago
1100 E. 58th Street, Ryerson Hall
Chicago, IL 60637
============================================
Email: iraicu at cs.uchicago.edu
Web:   http://www.cs.uchicago.edu/~iraicu
       http://dsl.cs.uchicago.edu/
============================================
============================================

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20070622/28acec19/attachment.html>

From wilde at mcs.anl.gov  Fri Jun 22 15:27:57 2007
From: wilde at mcs.anl.gov (Mike Wilde)
Date: Fri, 22 Jun 2007 15:27:57 -0500
Subject: [Swift-devel] Re: [Swft] Q about throttling
In-Reply-To: <Pine.LNX.4.58.0706220943350.15672@classes.cs.uchicago.edu>
References: <467A6610.1000103@mcs.anl.gov> <467A6E50.3060508@mcs.anl.gov>
	<fec1351f0706210809k1035425cjffc18b442ab41dea@mail.gmail.com>
	<467A95F3.6040603@mcs.anl.gov>
	<Pine.LNX.4.64.0706211531180.15250@dildano.hawaga.org.uk>
	<467A9B32.4030402@mcs.anl.gov>
	<Pine.LNX.4.64.0706211537490.1452@dildano.hawaga.org.uk>
	<467A9DAD.7060909@mcs.anl.gov>
	<Pine.LNX.4.64.0706211549270.1452@dildano.hawaga.org.uk>
	<467AA004.4080601@mcs.anl.gov>
	<Pine.LNX.4.64.0706211604050.1452@dildano.hawaga.org.uk>
	<467B39C6.8050104@cs.uchicago.edu>
	<Pine.LNX.4.64.0706220805320.1452@dildano.hawaga.org.uk>
	<467BD768.3020507@cs.uchicago.edu> <467BDF36.7080909@mcs.anl.gov>
	<Pine.LNX.4.58.0706220943350.15672@classes.cs.uchicago.edu>
Message-ID: <467C30CD.8010703@mcs.anl.gov>

[forgot to hit send on this - my apology if its no longer relevant]

OK, thanks, Yong.

Regarding the retry delay, I phrased the question poorly. I meant:

Is it possible that the 2500 failing jobs are being retried too slowly? Ie that 
Karajan delays each re-run after a failure, and thus cant keep Falkon fed with 
retried jobs at a high rate?

- Mike


Yong Zhao wrote, On 6/22/2007 9:45 AM:
> The retry mechanism is currently in some karajan script, and we can easily
> add some delay there.
> 
> There is not a configuration option to disable pipeline. I did that
> manually (modified some code segment) to get a perf chart.
> 
> Yong.
> 
> On Fri, 22 Jun 2007, Mike Wilde wrote:
> 
>> Is there a configurable retry delay after failure?
>>
>> I think you need to examine the overall workflow dependency structure.
>>
>> Also, I recall from older perf charts that there's an option to enable/disable
>> pipelining.  With pipelining disabled, it seems that Swift will wait for an
>> entire dataset/foreach or procedure to finish before starting any tasks that
>> depend on the foreach or procedure.
>>
>> Mihael, can you look at some of these issues when you are back online and rested?
>>
>> - Mike
>>
>> Ioan Raicu wrote, On 6/22/2007 9:06 AM:
>>> No, I didn't keep track of this info, unless Swift does this through
>>> some of its logs.
>>>
>>> Over the last week, my observations have been the following: Swift is
>>> more than capable and willing to send out many tasks as long as they are
>>> independent (as can be seen in this graph where probably 6800 tasks got
>>> submitted), but thereafter, it had no other burst of task submission,
>>> although I believe it could have send out more.  For example, there were
>>> 2500+ tasks that failed in the middle of those 6800 tasks (which were
>>> all independent), why were 2500 tasks not resubmitted all at once...
>>> they were each about 200 seconds long, so most of them should have
>>> certainly showed up in the wait queue.
>>>
>>> Ioan
>>>
>>> Ben Clifford wrote:
>>>>> kept busy, and the Falkon queue length was relatively at 0... so this means
>>>>> that Swift was not submitting fast enough to keep all the executors busy.
>>>>>
>>>> interesting. though around t=1000 there is a rapid burst of submission
>>>> getting the queue length up to about 6000 in a few minutes.
>>>>
>>>> Do you know what the cpu time usage of the swift submitting JVM was over
>>>> that time period?
>>>>
>>>>
>>> --
>>> ============================================
>>> Ioan Raicu
>>> Ph.D. Student
>>> ============================================
>>> Distributed Systems Laboratory
>>> Computer Science Department
>>> University of Chicago
>>> 1100 E. 58th Street, Ryerson Hall
>>> Chicago, IL 60637
>>> ============================================
>>> Email: iraicu at cs.uchicago.edu
>>> Web:   http://www.cs.uchicago.edu/~iraicu
>>>        http://dsl.cs.uchicago.edu/
>>> ============================================
>>> ============================================
>>>
>> --
>> Mike Wilde
>> Computation Institute, University of Chicago
>> Math & Computer Science Division
>> Argonne National Laboratory
>> Argonne, IL   60439    USA
>> tel 630-252-7497 fax 630-252-1997
>> _______________________________________________
>> Swift-devel mailing list
>> Swift-devel at ci.uchicago.edu
>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>
> 
> 

-- 
Mike Wilde
Computation Institute, University of Chicago
Math & Computer Science Division
Argonne National Laboratory
Argonne, IL   60439    USA
tel 630-252-7497 fax 630-252-1997


From yongzh at cs.uchicago.edu  Fri Jun 22 15:32:46 2007
From: yongzh at cs.uchicago.edu (Yong Zhao)
Date: Fri, 22 Jun 2007 15:32:46 -0500 (CDT)
Subject: [Swift-devel] Re: [Swft] Q about throttling
In-Reply-To: <467C30CD.8010703@mcs.anl.gov>
References: <467A6610.1000103@mcs.anl.gov> <467A6E50.3060508@mcs.anl.gov>
	<fec1351f0706210809k1035425cjffc18b442ab41dea@mail.gmail.com>
	<467A95F3.6040603@mcs.anl.gov>
	<Pine.LNX.4.64.0706211531180.15250@dildano.hawaga.org.uk>
	<467A9B32.4030402@mcs.anl.gov>
	<Pine.LNX.4.64.0706211537490.1452@dildano.hawaga.org.uk>
	<467A9DAD.7060909@mcs.anl.gov>
	<Pine.LNX.4.64.0706211549270.1452@dildano.hawaga.org.uk>
	<467AA004.4080601@mcs.anl.gov>
	<Pine.LNX.4.64.0706211604050.1452@dildano.hawaga.org.uk>
	<467B39C6.8050104@cs.uchicago.edu>
	<Pine.LNX.4.64.0706220805320.1452@dildano.hawaga.org.uk>
	<467BD768.3020507@cs.uchicago.edu> <467BDF36.7080909@mcs.anl.gov>
	<Pine.LNX.4.58.0706220943350.15672@classes.cs.uchicago.edu>
	<467C30CD.8010703@mcs.anl.gov>
Message-ID: <Pine.LNX.4.58.0706221530160.7834@classes.cs.uchicago.edu>

There is no delay for submitting retry jobs. However, these retry jobs may
be queued after the 'ready' jobs that swift already processed, which could
be be held by swift, if there is job throttling.

Yong.

On Fri, 22 Jun 2007, Mike Wilde wrote:

> [forgot to hit send on this - my apology if its no longer relevant]
>
> OK, thanks, Yong.
>
> Regarding the retry delay, I phrased the question poorly. I meant:
>
> Is it possible that the 2500 failing jobs are being retried too slowly? Ie that
> Karajan delays each re-run after a failure, and thus cant keep Falkon fed with
> retried jobs at a high rate?
>
> - Mike
>
>
> Yong Zhao wrote, On 6/22/2007 9:45 AM:
> > The retry mechanism is currently in some karajan script, and we can easily
> > add some delay there.
> >
> > There is not a configuration option to disable pipeline. I did that
> > manually (modified some code segment) to get a perf chart.
> >
> > Yong.
> >
> > On Fri, 22 Jun 2007, Mike Wilde wrote:
> >
> >> Is there a configurable retry delay after failure?
> >>
> >> I think you need to examine the overall workflow dependency structure.
> >>
> >> Also, I recall from older perf charts that there's an option to enable/disable
> >> pipelining.  With pipelining disabled, it seems that Swift will wait for an
> >> entire dataset/foreach or procedure to finish before starting any tasks that
> >> depend on the foreach or procedure.
> >>
> >> Mihael, can you look at some of these issues when you are back online and rested?
> >>
> >> - Mike
> >>
> >> Ioan Raicu wrote, On 6/22/2007 9:06 AM:
> >>> No, I didn't keep track of this info, unless Swift does this through
> >>> some of its logs.
> >>>
> >>> Over the last week, my observations have been the following: Swift is
> >>> more than capable and willing to send out many tasks as long as they are
> >>> independent (as can be seen in this graph where probably 6800 tasks got
> >>> submitted), but thereafter, it had no other burst of task submission,
> >>> although I believe it could have send out more.  For example, there were
> >>> 2500+ tasks that failed in the middle of those 6800 tasks (which were
> >>> all independent), why were 2500 tasks not resubmitted all at once...
> >>> they were each about 200 seconds long, so most of them should have
> >>> certainly showed up in the wait queue.
> >>>
> >>> Ioan
> >>>
> >>> Ben Clifford wrote:
> >>>>> kept busy, and the Falkon queue length was relatively at 0... so this means
> >>>>> that Swift was not submitting fast enough to keep all the executors busy.
> >>>>>
> >>>> interesting. though around t=1000 there is a rapid burst of submission
> >>>> getting the queue length up to about 6000 in a few minutes.
> >>>>
> >>>> Do you know what the cpu time usage of the swift submitting JVM was over
> >>>> that time period?
> >>>>
> >>>>
> >>> --
> >>> ============================================
> >>> Ioan Raicu
> >>> Ph.D. Student
> >>> ============================================
> >>> Distributed Systems Laboratory
> >>> Computer Science Department
> >>> University of Chicago
> >>> 1100 E. 58th Street, Ryerson Hall
> >>> Chicago, IL 60637
> >>> ============================================
> >>> Email: iraicu at cs.uchicago.edu
> >>> Web:   http://www.cs.uchicago.edu/~iraicu
> >>>        http://dsl.cs.uchicago.edu/
> >>> ============================================
> >>> ============================================
> >>>
> >> --
> >> Mike Wilde
> >> Computation Institute, University of Chicago
> >> Math & Computer Science Division
> >> Argonne National Laboratory
> >> Argonne, IL   60439    USA
> >> tel 630-252-7497 fax 630-252-1997
> >> _______________________________________________
> >> Swift-devel mailing list
> >> Swift-devel at ci.uchicago.edu
> >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> >>
> >
> >
>
> --
> Mike Wilde
> Computation Institute, University of Chicago
> Math & Computer Science Division
> Argonne National Laboratory
> Argonne, IL   60439    USA
> tel 630-252-7497 fax 630-252-1997
>


From hategan at mcs.anl.gov  Sat Jun 23 14:59:30 2007
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Sat, 23 Jun 2007 14:59:30 -0500
Subject: [Swift-devel] Re: [Swft] Q about throttling
In-Reply-To: <467A9B32.4030402@mcs.anl.gov>
References: <467A6610.1000103@mcs.anl.gov> <467A6E50.3060508@mcs.anl.gov>
	<fec1351f0706210809k1035425cjffc18b442ab41dea@mail.gmail.com>
	<467A95F3.6040603@mcs.anl.gov>
	<Pine.LNX.4.64.0706211531180.15250@dildano.hawaga.org.uk>
	<467A9B32.4030402@mcs.anl.gov>
Message-ID: <1182628770.8366.3.camel@blabla.mcs.anl.gov>

On Thu, 2007-06-21 at 10:37 -0500, Ian Foster wrote:
> Well, if there is some concern that throttling might be a problem,
> then trying a run with it turned off seems good.
> 
> I'm gathering from this exchange that this is not possible?

As Tibi guessed, it is done by using sufficiently large numbers for the
throttles. There is no way to automatically turn off all throttles, but
I guess it could be done.

Mihael

> 
> Ben Clifford wrote: 
> > But that isn't the base problem being investigated, right?
> > 
> > On Thu, 21 Jun 2007, Ian Foster wrote:
> > 
> >   
> > > My original question was whether we could turn throttling off altogether. I'm
> > > not sure if that was answered?
> > > 
> > > Tiberiu Stef-Praun wrote:
> > >     
> > > > I did not look very deep into the throttling, mainly because I have to
> > > > wait for my turn at using the Argonne cluster because of the large
> > > > reservations that Ioan does for MolDyn
> > > > 
> > > > Anyway, here is my experience (which Ian asked me to write down, but
> > > > I'm still trying to improve on):
> > > > - whatever one asks from Falkon, one seems to get, with the caveat
> > > > that Falkon might release nodes when configured to look at an idle
> > > > timer. In the case of the Econ workflow, I had 26 long running jobs,
> > > > so I requested 30 nodes (which Falkon got for me)
> > > > - there is a swift config file, $DVS_HOME/libexec/scheduler.xml in which I
> > > > set
> > > > <property name="jobThrottle" value="30"/>, but that seemed not to be
> > > > enough to get all my 26 jobs running at the same time (as illustrated
> > > > by the graphing of the Falkon log that Ioan showed me).
> > > > - there are some other throttling parameters in
> > > > $VDS_HOME/etc/swift.properties (which I also set to 30)
> > > > 
> > > > The general observation is that I needed to modify the scheduler.xml
> > > > config file, and I need to set larger throttle values that the limit
> > > > of workers requested.
> > > > In the current scheme (simply add Falkon to Swift as a provider) the
> > > > Swift scheduler (the weighted site selection algorithm) adversely
> > > > influences the optimal execution of the workflow.
> > > > There might be other parameters to work with, but my opinion is that
> > > > we should use a different (non-throttling) scheduler in combination
> > > > with Falkon
> > > > 
> > > > Tibi
> > > > 
> > > > On 6/21/07, Mike Wilde <wilde at mcs.anl.gov> wrote:
> > > >       
> > > > > Ive had the same question - it seems that throttling is also problematic
> > > > > for
> > > > > Tibi in the econ workflow.
> > > > > 
> > > > > Tibi, since you have looked pretty deeply into it, could you write up a
> > > > > desription on how the algorithm works and how the parameters affect it.
> > > > > Mihael, when you are back on central time next week, could you work with
> > > > > TIbi on
> > > > > this?  If its not already, this should be part of the Swift documentation.
> > > > > 
> > > > > Then we should work on getting high-performance settings for the different
> > > > > runtime environments we use, in particular Falkon as Ian asks.
> > > > > 
> > > > > - Mike
> > > > > 
> > > > > 
> > > > > Ian Foster wrote, On 6/21/2007 6:50 AM:
> > > > >         
> > > > > > Hi,
> > > > > > 
> > > > > > I don't fully understand how throttling works in Swift/Karajan. However,
> > > > > > I understand that even when using Falkon, we may be doing some
> > > > > > throttling. Is there a reason to do that in this case, given that Falkon
> > > > > > can maintain large numbers of tasks just fine?
> > > > > > 
> > > > > > I ask this because in a recent MolDyn run, there seemed to be some
> > > > > > uncertainty as to whether throttling was slowing down job dispatch. If
> > > > > > we could turn it off altogether, that question would presumably go away.
> > > > > > 
> > > > > > Ian.
> > > > > > 
> > > > > > 
> > > > > >           
> > > > > -- 
> > > > > Mike Wilde
> > > > > Computation Institute, University of Chicago
> > > > > Math & Computer Science Division
> > > > > Argonne National Laboratory
> > > > > Argonne, IL   60439    USA
> > > > > tel 630-252-7497 fax 630-252-1997
> > > > > 
> > > > >         
> > 
> >   
> 
> -- 
> 
>    Ian Foster, Director, Computation Institute
> Argonne National Laboratory & University of Chicago
> Argonne: MCS/221, 9700 S. Cass Ave, Argonne, IL 60439
> Chicago: Rm 405, 5640 S. Ellis Ave, Chicago, IL 60637
> Tel: +1 630 252 4619.  Web: www.ci.uchicago.edu.
>       Globus Alliance: www.globus.org.


From hategan at mcs.anl.gov  Sat Jun 23 15:06:34 2007
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Sat, 23 Jun 2007 15:06:34 -0500
Subject: [Swift-devel] Re: [Swft] Q about throttling
In-Reply-To: <F7C67410-5C79-4964-B8EA-466BE465876B@mcs.anl.gov>
References: <467A6610.1000103@mcs.anl.gov>
	<fec1351f0706210809k1035425cjffc18b442ab41dea@mail.gmail.com>
	<467A95F3.6040603@mcs.anl.gov>
	<Pine.LNX.4.64.0706211531180.15250@dildano.hawaga.org.uk>
	<467A9B32.4030402@mcs.anl.gov>
	<Pine.LNX.4.64.0706211537490.1452@dildano.hawaga.org.uk>
	<fec1351f0706210859v417e6161i32b0569844949784@mail.gmail.com>
	<Pine.LNX.4.64.0706211600140.1452@dildano.hawaga.org.uk>
	<fec1351f0706210904l6d74828bj6d5fe093a2d3d7@mail.gmail.com>
	<Pine.LNX.4.64.0706211608270.1452@dildano.hawaga.org.uk>
	<fec1351f0706210925t2358bf90mec1507118903ef6e@mail.gmail.com>
	<F7C67410-5C79-4964-B8EA-466BE465876B@mcs.anl.gov>
Message-ID: <1182629194.8366.7.camel@blabla.mcs.anl.gov>

On Thu, 2007-06-21 at 11:32 -0500, Veronika Nefedova wrote:
> There are two throttle parameters you might want to check. One is in  
> swift.properties called throttle.submit and one in scheduler.xml  
> called jobThrottle.
> I am curious whats the difference between them?

throttle.submit is documented in many places, including swift.properties
and the user's guide. This should not affect things much since, as far
as I can tell, submissions in Falkon are done pretty fast.

JobThrottle is a site score scaling factor. It limits the initial set of
jobs sent to sites in order to achieve better load balancing in the long
run. This could be a cause for low number of concurrent jobs. Set it to
large numbers if you want to get rid of it.

Mihael

> 
> Nika
> 
> On Jun 21, 2007, at 11:25 AM, Tiberiu Stef-Praun wrote:
> 
> > No
> > I'm saying that swift throttle value will allow me to make full use of
> > all the nodes that Falkon makes available for me. I know that I had 26
> > jobs to be run, and I requested (and had) 30 nodes in the cluster.
> > Somehow only 24 jobs run in the first time, so I'm going to push up
> > the throttle value in Swift
> >
> > Tibi
> >
> >
> > On 6/21/07, Ben Clifford <benc at hawaga.org.uk> wrote:
> >>
> >>
> >>
> >> On Thu, 21 Jun 2007, Tiberiu Stef-Praun wrote:
> >>
> >> > No
> >> > My chart shows that if I had two more machines during the first  
> >> stage
> >> > run (the first 26 jobs), I would have avoided a long wait (50000  
> >> ms ,
> >> > or about 9 minutes) for the last two jobs from the first batch to
> >> > finish.
> >> > This is why I need to redo the Econ run, with a different throttle
> >> > value for Swift.
> >>
> >> So you are saying that changing the 'throttle value for swift' will
> >> allocate more machines for you?
> >>
> >> --
> >>
> >
> >
> > -- 
> > Tiberiu (Tibi) Stef-Praun, PhD
> > Research Staff, Computation Institute
> > 5640 S. Ellis Ave, #405
> > University of Chicago
> > http://www-unix.mcs.anl.gov/~tiberius/
> >
> 


From foster at mcs.anl.gov  Sat Jun 23 15:29:54 2007
From: foster at mcs.anl.gov (Ian Foster)
Date: Sat, 23 Jun 2007 15:29:54 -0500
Subject: [Swift-devel] Re: [Swft] Q about throttling
In-Reply-To: <1182628770.8366.3.camel@blabla.mcs.anl.gov>
References: <467A6610.1000103@mcs.anl.gov> <467A6E50.3060508@mcs.anl.gov>	
	<fec1351f0706210809k1035425cjffc18b442ab41dea@mail.gmail.com>	
	<467A95F3.6040603@mcs.anl.gov>	
	<Pine.LNX.4.64.0706211531180.15250@dildano.hawaga.org.uk>	
	<467A9B32.4030402@mcs.anl.gov>
	<1182628770.8366.3.camel@blabla.mcs.anl.gov>
Message-ID: <467D82C2.8020906@mcs.anl.gov>

thanks... Ian.

Mihael Hategan wrote:
> On Thu, 2007-06-21 at 10:37 -0500, Ian Foster wrote:
>   
>> Well, if there is some concern that throttling might be a problem,
>> then trying a run with it turned off seems good.
>>
>> I'm gathering from this exchange that this is not possible?
>>     
>
> As Tibi guessed, it is done by using sufficiently large numbers for the
> throttles. There is no way to automatically turn off all throttles, but
> I guess it could be done.
>
> Mihael
>
>   
>> Ben Clifford wrote: 
>>     
>>> But that isn't the base problem being investigated, right?
>>>
>>> On Thu, 21 Jun 2007, Ian Foster wrote:
>>>
>>>   
>>>       
>>>> My original question was whether we could turn throttling off altogether. I'm
>>>> not sure if that was answered?
>>>>
>>>> Tiberiu Stef-Praun wrote:
>>>>     
>>>>         
>>>>> I did not look very deep into the throttling, mainly because I have to
>>>>> wait for my turn at using the Argonne cluster because of the large
>>>>> reservations that Ioan does for MolDyn
>>>>>
>>>>> Anyway, here is my experience (which Ian asked me to write down, but
>>>>> I'm still trying to improve on):
>>>>> - whatever one asks from Falkon, one seems to get, with the caveat
>>>>> that Falkon might release nodes when configured to look at an idle
>>>>> timer. In the case of the Econ workflow, I had 26 long running jobs,
>>>>> so I requested 30 nodes (which Falkon got for me)
>>>>> - there is a swift config file, $DVS_HOME/libexec/scheduler.xml in which I
>>>>> set
>>>>> <property name="jobThrottle" value="30"/>, but that seemed not to be
>>>>> enough to get all my 26 jobs running at the same time (as illustrated
>>>>> by the graphing of the Falkon log that Ioan showed me).
>>>>> - there are some other throttling parameters in
>>>>> $VDS_HOME/etc/swift.properties (which I also set to 30)
>>>>>
>>>>> The general observation is that I needed to modify the scheduler.xml
>>>>> config file, and I need to set larger throttle values that the limit
>>>>> of workers requested.
>>>>> In the current scheme (simply add Falkon to Swift as a provider) the
>>>>> Swift scheduler (the weighted site selection algorithm) adversely
>>>>> influences the optimal execution of the workflow.
>>>>> There might be other parameters to work with, but my opinion is that
>>>>> we should use a different (non-throttling) scheduler in combination
>>>>> with Falkon
>>>>>
>>>>> Tibi
>>>>>
>>>>> On 6/21/07, Mike Wilde <wilde at mcs.anl.gov> wrote:
>>>>>       
>>>>>           
>>>>>> Ive had the same question - it seems that throttling is also problematic
>>>>>> for
>>>>>> Tibi in the econ workflow.
>>>>>>
>>>>>> Tibi, since you have looked pretty deeply into it, could you write up a
>>>>>> desription on how the algorithm works and how the parameters affect it.
>>>>>> Mihael, when you are back on central time next week, could you work with
>>>>>> TIbi on
>>>>>> this?  If its not already, this should be part of the Swift documentation.
>>>>>>
>>>>>> Then we should work on getting high-performance settings for the different
>>>>>> runtime environments we use, in particular Falkon as Ian asks.
>>>>>>
>>>>>> - Mike
>>>>>>
>>>>>>
>>>>>> Ian Foster wrote, On 6/21/2007 6:50 AM:
>>>>>>         
>>>>>>             
>>>>>>> Hi,
>>>>>>>
>>>>>>> I don't fully understand how throttling works in Swift/Karajan. However,
>>>>>>> I understand that even when using Falkon, we may be doing some
>>>>>>> throttling. Is there a reason to do that in this case, given that Falkon
>>>>>>> can maintain large numbers of tasks just fine?
>>>>>>>
>>>>>>> I ask this because in a recent MolDyn run, there seemed to be some
>>>>>>> uncertainty as to whether throttling was slowing down job dispatch. If
>>>>>>> we could turn it off altogether, that question would presumably go away.
>>>>>>>
>>>>>>> Ian.
>>>>>>>
>>>>>>>
>>>>>>>           
>>>>>>>               
>>>>>> -- 
>>>>>> Mike Wilde
>>>>>> Computation Institute, University of Chicago
>>>>>> Math & Computer Science Division
>>>>>> Argonne National Laboratory
>>>>>> Argonne, IL   60439    USA
>>>>>> tel 630-252-7497 fax 630-252-1997
>>>>>>
>>>>>>         
>>>>>>             
>>>   
>>>       
>> -- 
>>
>>    Ian Foster, Director, Computation Institute
>> Argonne National Laboratory & University of Chicago
>> Argonne: MCS/221, 9700 S. Cass Ave, Argonne, IL 60439
>> Chicago: Rm 405, 5640 S. Ellis Ave, Chicago, IL 60637
>> Tel: +1 630 252 4619.  Web: www.ci.uchicago.edu.
>>       Globus Alliance: www.globus.org.
>>     
>
>   

-- 

   Ian Foster, Director, Computation Institute
Argonne National Laboratory & University of Chicago
Argonne: MCS/221, 9700 S. Cass Ave, Argonne, IL 60439
Chicago: Rm 405, 5640 S. Ellis Ave, Chicago, IL 60637
Tel: +1 630 252 4619.  Web: www.ci.uchicago.edu.
      Globus Alliance: www.globus.org.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20070623/4c11c3a0/attachment.html>

From hategan at mcs.anl.gov  Sat Jun 23 15:42:52 2007
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Sat, 23 Jun 2007 15:42:52 -0500
Subject: [Swift-devel] Re: [Swft] Q about throttling
In-Reply-To: <467BD768.3020507@cs.uchicago.edu>
References: <467A6610.1000103@mcs.anl.gov> <467A6E50.3060508@mcs.anl.gov>
	<fec1351f0706210809k1035425cjffc18b442ab41dea@mail.gmail.com>
	<467A95F3.6040603@mcs.anl.gov>
	<Pine.LNX.4.64.0706211531180.15250@dildano.hawaga.org.uk>
	<467A9B32.4030402@mcs.anl.gov>
	<Pine.LNX.4.64.0706211537490.1452@dildano.hawaga.org.uk>
	<467A9DAD.7060909@mcs.anl.gov>
	<Pine.LNX.4.64.0706211549270.1452@dildano.hawaga.org.uk>
	<467AA004.4080601@mcs.anl.gov>
	<Pine.LNX.4.64.0706211604050.1452@dildano.hawaga.org.uk>
	<467B39C6.8050104@cs.uchicago.edu>
	<Pine.LNX.4.64.0706220805320.1452@dildano.hawaga.org.uk>
	<467BD768.3020507@cs.uchicago.edu>
Message-ID: <1182631372.8366.9.camel@blabla.mcs.anl.gov>

On Fri, 2007-06-22 at 09:06 -0500, Ioan Raicu wrote:
> No, I didn't keep track of this info, unless Swift does this through
> some of its logs.  
> 
> Over the last week, my observations have been the following: Swift is
> more than capable and willing to send out many tasks as long as they
> are independent (as can be seen in this graph where probably 6800
> tasks got submitted), but thereafter, it had no other burst of task
> submission, although I believe it could have send out more.  For
> example, there were 2500+ tasks that failed in the middle of those
> 6800 tasks (which were all independent), why were 2500 tasks not
> resubmitted all at once... they were each about 200 seconds long, so
> most of them should have certainly showed up in the wait queue.

That's probably interpreter lag. It needs to do some work before
resubmitting all those jobs.

> 
> Ioan
> 
> Ben Clifford wrote: 
> > > kept busy, and the Falkon queue length was relatively at 0... so this means
> > > that Swift was not submitting fast enough to keep all the executors busy.
> > >     
> > 
> > interesting. though around t=1000 there is a rapid burst of submission 
> > getting the queue length up to about 6000 in a few minutes.
> > 
> > Do you know what the cpu time usage of the swift submitting JVM was over 
> > that time period?
> > 
> >   
> 
> -- 
> ============================================
> Ioan Raicu
> Ph.D. Student
> ============================================
> Distributed Systems Laboratory
> Computer Science Department
> University of Chicago
> 1100 E. 58th Street, Ryerson Hall
> Chicago, IL 60637
> ============================================
> Email: iraicu at cs.uchicago.edu
> Web:   http://www.cs.uchicago.edu/~iraicu
>        http://dsl.cs.uchicago.edu/
> ============================================
> ============================================


From hategan at mcs.anl.gov  Sat Jun 23 15:43:56 2007
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Sat, 23 Jun 2007 15:43:56 -0500
Subject: [Swift-devel] Re: [Swft] Q about throttling
In-Reply-To: <467BDF36.7080909@mcs.anl.gov>
References: <467A6610.1000103@mcs.anl.gov> <467A6E50.3060508@mcs.anl.gov>
	<fec1351f0706210809k1035425cjffc18b442ab41dea@mail.gmail.com>
	<467A95F3.6040603@mcs.anl.gov>
	<Pine.LNX.4.64.0706211531180.15250@dildano.hawaga.org.uk>
	<467A9B32.4030402@mcs.anl.gov>
	<Pine.LNX.4.64.0706211537490.1452@dildano.hawaga.org.uk>
	<467A9DAD.7060909@mcs.anl.gov>
	<Pine.LNX.4.64.0706211549270.1452@dildano.hawaga.org.uk>
	<467AA004.4080601@mcs.anl.gov>
	<Pine.LNX.4.64.0706211604050.1452@dildano.hawaga.org.uk>
	<467B39C6.8050104@cs.uchicago.edu>
	<Pine.LNX.4.64.0706220805320.1452@dildano.hawaga.org.uk>
	<467BD768.3020507@cs.uchicago.edu>  <467BDF36.7080909@mcs.anl.gov>
Message-ID: <1182631436.8366.11.camel@blabla.mcs.anl.gov>

On Fri, 2007-06-22 at 09:39 -0500, Mike Wilde wrote:
> Is there a configurable retry delay after failure?
> 
> I think you need to examine the overall workflow dependency structure.
> 
> Also, I recall from older perf charts that there's an option to enable/disable 
> pipelining.  With pipelining disabled, it seems that Swift will wait for an 
> entire dataset/foreach or procedure to finish before starting any tasks that 
> depend on the foreach or procedure.

I don't think that was in swift.

> 
> Mihael, can you look at some of these issues when you are back online and rested?

You say that as if I normally don't :)

> 
> - Mike
> 
> Ioan Raicu wrote, On 6/22/2007 9:06 AM:
> > No, I didn't keep track of this info, unless Swift does this through 
> > some of its logs. 
> > 
> > Over the last week, my observations have been the following: Swift is 
> > more than capable and willing to send out many tasks as long as they are 
> > independent (as can be seen in this graph where probably 6800 tasks got 
> > submitted), but thereafter, it had no other burst of task submission, 
> > although I believe it could have send out more.  For example, there were 
> > 2500+ tasks that failed in the middle of those 6800 tasks (which were 
> > all independent), why were 2500 tasks not resubmitted all at once... 
> > they were each about 200 seconds long, so most of them should have 
> > certainly showed up in the wait queue.
> > 
> > Ioan
> > 
> > Ben Clifford wrote:
> >>> kept busy, and the Falkon queue length was relatively at 0... so this means
> >>> that Swift was not submitting fast enough to keep all the executors busy.
> >>>     
> >>
> >> interesting. though around t=1000 there is a rapid burst of submission 
> >> getting the queue length up to about 6000 in a few minutes.
> >>
> >> Do you know what the cpu time usage of the swift submitting JVM was over 
> >> that time period?
> >>
> >>   
> > 
> > -- 
> > ============================================
> > Ioan Raicu
> > Ph.D. Student
> > ============================================
> > Distributed Systems Laboratory
> > Computer Science Department
> > University of Chicago
> > 1100 E. 58th Street, Ryerson Hall
> > Chicago, IL 60637
> > ============================================
> > Email: iraicu at cs.uchicago.edu
> > Web:   http://www.cs.uchicago.edu/~iraicu
> >        http://dsl.cs.uchicago.edu/
> > ============================================
> > ============================================
> > 
> 


From hategan at mcs.anl.gov  Sat Jun 23 15:47:01 2007
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Sat, 23 Jun 2007 15:47:01 -0500
Subject: [Swift-devel] Re: [Swft] Q about throttling
In-Reply-To: <467C30CD.8010703@mcs.anl.gov>
References: <467A6610.1000103@mcs.anl.gov> <467A6E50.3060508@mcs.anl.gov>
	<fec1351f0706210809k1035425cjffc18b442ab41dea@mail.gmail.com>
	<467A95F3.6040603@mcs.anl.gov>
	<Pine.LNX.4.64.0706211531180.15250@dildano.hawaga.org.uk>
	<467A9B32.4030402@mcs.anl.gov>
	<Pine.LNX.4.64.0706211537490.1452@dildano.hawaga.org.uk>
	<467A9DAD.7060909@mcs.anl.gov>
	<Pine.LNX.4.64.0706211549270.1452@dildano.hawaga.org.uk>
	<467AA004.4080601@mcs.anl.gov>
	<Pine.LNX.4.64.0706211604050.1452@dildano.hawaga.org.uk>
	<467B39C6.8050104@cs.uchicago.edu>
	<Pine.LNX.4.64.0706220805320.1452@dildano.hawaga.org.uk>
	<467BD768.3020507@cs.uchicago.edu> <467BDF36.7080909@mcs.anl.gov>
	<Pine.LNX.4.58.0706220943350.15672@classes.cs.uchicago.edu>
	<467C30CD.8010703@mcs.anl.gov>
Message-ID: <1182631621.8366.14.camel@blabla.mcs.anl.gov>

On Fri, 2007-06-22 at 15:27 -0500, Mike Wilde wrote:
> [forgot to hit send on this - my apology if its no longer relevant]
> 
> OK, thanks, Yong.
> 
> Regarding the retry delay, I phrased the question poorly. I meant:
> 
> Is it possible that the 2500 failing jobs are being retried too slowly? Ie that 
> Karajan delays each re-run after a failure, and thus cant keep Falkon fed with 
> retried jobs at a high rate?

It does not explicitly delay anything. But 2500*[many things to do]
becomes visible.

> 
> - Mike
> 
> 
> Yong Zhao wrote, On 6/22/2007 9:45 AM:
> > The retry mechanism is currently in some karajan script, and we can easily
> > add some delay there.
> > 
> > There is not a configuration option to disable pipeline. I did that
> > manually (modified some code segment) to get a perf chart.
> > 
> > Yong.
> > 
> > On Fri, 22 Jun 2007, Mike Wilde wrote:
> > 
> >> Is there a configurable retry delay after failure?
> >>
> >> I think you need to examine the overall workflow dependency structure.
> >>
> >> Also, I recall from older perf charts that there's an option to enable/disable
> >> pipelining.  With pipelining disabled, it seems that Swift will wait for an
> >> entire dataset/foreach or procedure to finish before starting any tasks that
> >> depend on the foreach or procedure.
> >>
> >> Mihael, can you look at some of these issues when you are back online and rested?
> >>
> >> - Mike
> >>
> >> Ioan Raicu wrote, On 6/22/2007 9:06 AM:
> >>> No, I didn't keep track of this info, unless Swift does this through
> >>> some of its logs.
> >>>
> >>> Over the last week, my observations have been the following: Swift is
> >>> more than capable and willing to send out many tasks as long as they are
> >>> independent (as can be seen in this graph where probably 6800 tasks got
> >>> submitted), but thereafter, it had no other burst of task submission,
> >>> although I believe it could have send out more.  For example, there were
> >>> 2500+ tasks that failed in the middle of those 6800 tasks (which were
> >>> all independent), why were 2500 tasks not resubmitted all at once...
> >>> they were each about 200 seconds long, so most of them should have
> >>> certainly showed up in the wait queue.
> >>>
> >>> Ioan
> >>>
> >>> Ben Clifford wrote:
> >>>>> kept busy, and the Falkon queue length was relatively at 0... so this means
> >>>>> that Swift was not submitting fast enough to keep all the executors busy.
> >>>>>
> >>>> interesting. though around t=1000 there is a rapid burst of submission
> >>>> getting the queue length up to about 6000 in a few minutes.
> >>>>
> >>>> Do you know what the cpu time usage of the swift submitting JVM was over
> >>>> that time period?
> >>>>
> >>>>
> >>> --
> >>> ============================================
> >>> Ioan Raicu
> >>> Ph.D. Student
> >>> ============================================
> >>> Distributed Systems Laboratory
> >>> Computer Science Department
> >>> University of Chicago
> >>> 1100 E. 58th Street, Ryerson Hall
> >>> Chicago, IL 60637
> >>> ============================================
> >>> Email: iraicu at cs.uchicago.edu
> >>> Web:   http://www.cs.uchicago.edu/~iraicu
> >>>        http://dsl.cs.uchicago.edu/
> >>> ============================================
> >>> ============================================
> >>>
> >> --
> >> Mike Wilde
> >> Computation Institute, University of Chicago
> >> Math & Computer Science Division
> >> Argonne National Laboratory
> >> Argonne, IL   60439    USA
> >> tel 630-252-7497 fax 630-252-1997
> >> _______________________________________________
> >> Swift-devel mailing list
> >> Swift-devel at ci.uchicago.edu
> >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> >>
> > 
> > 
> 


From hategan at mcs.anl.gov  Sat Jun 23 15:48:45 2007
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Sat, 23 Jun 2007 15:48:45 -0500
Subject: [Swift-devel] Re: [Swft] Q about throttling
In-Reply-To: <Pine.LNX.4.58.0706221530160.7834@classes.cs.uchicago.edu>
References: <467A6610.1000103@mcs.anl.gov> <467A6E50.3060508@mcs.anl.gov>
	<fec1351f0706210809k1035425cjffc18b442ab41dea@mail.gmail.com>
	<467A95F3.6040603@mcs.anl.gov>
	<Pine.LNX.4.64.0706211531180.15250@dildano.hawaga.org.uk>
	<467A9B32.4030402@mcs.anl.gov>
	<Pine.LNX.4.64.0706211537490.1452@dildano.hawaga.org.uk>
	<467A9DAD.7060909@mcs.anl.gov>
	<Pine.LNX.4.64.0706211549270.1452@dildano.hawaga.org.uk>
	<467AA004.4080601@mcs.anl.gov>
	<Pine.LNX.4.64.0706211604050.1452@dildano.hawaga.org.uk>
	<467B39C6.8050104@cs.uchicago.edu>
	<Pine.LNX.4.64.0706220805320.1452@dildano.hawaga.org.uk>
	<467BD768.3020507@cs.uchicago.edu> <467BDF36.7080909@mcs.anl.gov>
	<Pine.LNX.4.58.0706220943350.15672@classes.cs.uchicago.edu>
	<467C30CD.8010703@mcs.anl.gov>
	<Pine.LNX.4.58.0706221530160.7834@classes.cs.uchicago.edu>
Message-ID: <1182631725.8366.17.camel@blabla.mcs.anl.gov>

On Fri, 2007-06-22 at 15:32 -0500, Yong Zhao wrote:
> There is no delay for submitting retry jobs. However, these retry jobs may
> be queued after the 'ready' jobs that swift already processed, which could
> be be held by swift, if there is job throttling.

Indeed. 2500 jobs failed may bring the score for the site down a bit.
But then it doesn't look like there was much throttling, since 6800
tasks were submitted in bulk.

> 
> Yong.
> 
> On Fri, 22 Jun 2007, Mike Wilde wrote:
> 
> > [forgot to hit send on this - my apology if its no longer relevant]
> >
> > OK, thanks, Yong.
> >
> > Regarding the retry delay, I phrased the question poorly. I meant:
> >
> > Is it possible that the 2500 failing jobs are being retried too slowly? Ie that
> > Karajan delays each re-run after a failure, and thus cant keep Falkon fed with
> > retried jobs at a high rate?
> >
> > - Mike
> >
> >
> > Yong Zhao wrote, On 6/22/2007 9:45 AM:
> > > The retry mechanism is currently in some karajan script, and we can easily
> > > add some delay there.
> > >
> > > There is not a configuration option to disable pipeline. I did that
> > > manually (modified some code segment) to get a perf chart.
> > >
> > > Yong.
> > >
> > > On Fri, 22 Jun 2007, Mike Wilde wrote:
> > >
> > >> Is there a configurable retry delay after failure?
> > >>
> > >> I think you need to examine the overall workflow dependency structure.
> > >>
> > >> Also, I recall from older perf charts that there's an option to enable/disable
> > >> pipelining.  With pipelining disabled, it seems that Swift will wait for an
> > >> entire dataset/foreach or procedure to finish before starting any tasks that
> > >> depend on the foreach or procedure.
> > >>
> > >> Mihael, can you look at some of these issues when you are back online and rested?
> > >>
> > >> - Mike
> > >>
> > >> Ioan Raicu wrote, On 6/22/2007 9:06 AM:
> > >>> No, I didn't keep track of this info, unless Swift does this through
> > >>> some of its logs.
> > >>>
> > >>> Over the last week, my observations have been the following: Swift is
> > >>> more than capable and willing to send out many tasks as long as they are
> > >>> independent (as can be seen in this graph where probably 6800 tasks got
> > >>> submitted), but thereafter, it had no other burst of task submission,
> > >>> although I believe it could have send out more.  For example, there were
> > >>> 2500+ tasks that failed in the middle of those 6800 tasks (which were
> > >>> all independent), why were 2500 tasks not resubmitted all at once...
> > >>> they were each about 200 seconds long, so most of them should have
> > >>> certainly showed up in the wait queue.
> > >>>
> > >>> Ioan
> > >>>
> > >>> Ben Clifford wrote:
> > >>>>> kept busy, and the Falkon queue length was relatively at 0... so this means
> > >>>>> that Swift was not submitting fast enough to keep all the executors busy.
> > >>>>>
> > >>>> interesting. though around t=1000 there is a rapid burst of submission
> > >>>> getting the queue length up to about 6000 in a few minutes.
> > >>>>
> > >>>> Do you know what the cpu time usage of the swift submitting JVM was over
> > >>>> that time period?
> > >>>>
> > >>>>
> > >>> --
> > >>> ============================================
> > >>> Ioan Raicu
> > >>> Ph.D. Student
> > >>> ============================================
> > >>> Distributed Systems Laboratory
> > >>> Computer Science Department
> > >>> University of Chicago
> > >>> 1100 E. 58th Street, Ryerson Hall
> > >>> Chicago, IL 60637
> > >>> ============================================
> > >>> Email: iraicu at cs.uchicago.edu
> > >>> Web:   http://www.cs.uchicago.edu/~iraicu
> > >>>        http://dsl.cs.uchicago.edu/
> > >>> ============================================
> > >>> ============================================
> > >>>
> > >> --
> > >> Mike Wilde
> > >> Computation Institute, University of Chicago
> > >> Math & Computer Science Division
> > >> Argonne National Laboratory
> > >> Argonne, IL   60439    USA
> > >> tel 630-252-7497 fax 630-252-1997
> > >> _______________________________________________
> > >> Swift-devel mailing list
> > >> Swift-devel at ci.uchicago.edu
> > >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> > >>
> > >
> > >
> >
> > --
> > Mike Wilde
> > Computation Institute, University of Chicago
> > Math & Computer Science Division
> > Argonne National Laboratory
> > Argonne, IL   60439    USA
> > tel 630-252-7497 fax 630-252-1997
> >
> 


From benc at hawaga.org.uk  Sat Jun 23 20:34:22 2007
From: benc at hawaga.org.uk (Ben Clifford)
Date: Sun, 24 Jun 2007 01:34:22 +0000 (GMT)
Subject: [Swift-devel] Re: [Swft] Q about throttling
In-Reply-To: <1182629194.8366.7.camel@blabla.mcs.anl.gov>
References: <467A6610.1000103@mcs.anl.gov>
	<fec1351f0706210809k1035425cjffc18b442ab41dea@mail.gmail.com>
	<467A95F3.6040603@mcs.anl.gov>
	<Pine.LNX.4.64.0706211531180.15250@dildano.hawaga.org.uk>
	<467A9B32.4030402@mcs.anl.gov>
	<Pine.LNX.4.64.0706211537490.1452@dildano.hawaga.org.uk>
	<fec1351f0706210859v417e6161i32b0569844949784@mail.gmail.com>
	<Pine.LNX.4.64.0706211600140.1452@dildano.hawaga.org.uk>
	<fec1351f0706210904l6d74828bj6d5fe093a2d3d7@mail.gmail.com>
	<Pine.LNX.4.64.0706211608270.1452@dildano.hawaga.org.uk>
	<fec1351f0706210925t2358bf90mec1507118903ef6e@mail.gmail.com>
	<F7C67410-5C79-4964-B8EA-466BE465876B@mcs.anl.gov>
	<1182629194.8366.7.camel@blabla.mcs.anl.gov>
Message-ID: <Pine.LNX.4.64.0706240133100.15250@dildano.hawaga.org.uk>


On Sat, 23 Jun 2007, Mihael Hategan wrote:

> JobThrottle is a site score scaling factor. It limits the initial set of
> jobs sent to sites in order to achieve better load balancing in the long
> run. This could be a cause for low number of concurrent jobs. Set it to
> large numbers if you want to get rid of it.

the scaling factor gives a restriction in job load that is not dependent 
on the presence of other sites?

the use of the word 'factor' has a subtle implication that it is a 
relative load and so in the case of a single site will have no effect

-- 


From hategan at mcs.anl.gov  Sat Jun 23 20:35:34 2007
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Sat, 23 Jun 2007 20:35:34 -0500
Subject: [Swift-devel] Re: [Swft] Q about throttling
In-Reply-To: <Pine.LNX.4.64.0706240133100.15250@dildano.hawaga.org.uk>
References: <467A6610.1000103@mcs.anl.gov>
	<fec1351f0706210809k1035425cjffc18b442ab41dea@mail.gmail.com>
	<467A95F3.6040603@mcs.anl.gov>
	<Pine.LNX.4.64.0706211531180.15250@dildano.hawaga.org.uk>
	<467A9B32.4030402@mcs.anl.gov>
	<Pine.LNX.4.64.0706211537490.1452@dildano.hawaga.org.uk>
	<fec1351f0706210859v417e6161i32b0569844949784@mail.gmail.com>
	<Pine.LNX.4.64.0706211600140.1452@dildano.hawaga.org.uk>
	<fec1351f0706210904l6d74828bj6d5fe093a2d3d7@mail.gmail.com>
	<Pine.LNX.4.64.0706211608270.1452@dildano.hawaga.org.uk>
	<fec1351f0706210925t2358bf90mec1507118903ef6e@mail.gmail.com>
	<F7C67410-5C79-4964-B8EA-466BE465876B@mcs.anl.gov>
	<1182629194.8366.7.camel@blabla.mcs.anl.gov>
	<Pine.LNX.4.64.0706240133100.15250@dildano.hawaga.org.uk>
Message-ID: <1182648934.20734.2.camel@blabla.mcs.anl.gov>

On Sun, 2007-06-24 at 01:34 +0000, Ben Clifford wrote:
> 
> On Sat, 23 Jun 2007, Mihael Hategan wrote:
> 
> > JobThrottle is a site score scaling factor. It limits the initial set of
> > jobs sent to sites in order to achieve better load balancing in the long
> > run. This could be a cause for low number of concurrent jobs. Set it to
> > large numbers if you want to get rid of it.
> 
> the scaling factor gives a restriction in job load that is not dependent 
> on the presence of other sites?

Factor with respect to the score of a site, which has a pre-set value in
the beginning.

> 
> the use of the word 'factor' has a subtle implication that it is a 
> relative load and so in the case of a single site will have no effect
> 

It was discussed before that the algorithm could (and imo should) be
changed such that in the case of only one site, it would not throttle,
or at least the throttle would be significantly bigger.


From benc at hawaga.org.uk  Sat Jun 23 20:42:18 2007
From: benc at hawaga.org.uk (Ben Clifford)
Date: Sun, 24 Jun 2007 01:42:18 +0000 (GMT)
Subject: [Swift-devel] Re: [Swft] Q about throttling
In-Reply-To: <1182648934.20734.2.camel@blabla.mcs.anl.gov>
References: <467A6610.1000103@mcs.anl.gov>
	<fec1351f0706210809k1035425cjffc18b442ab41dea@mail.gmail.com>
	<467A95F3.6040603@mcs.anl.gov>
	<Pine.LNX.4.64.0706211531180.15250@dildano.hawaga.org.uk>
	<467A9B32.4030402@mcs.anl.gov>
	<Pine.LNX.4.64.0706211537490.1452@dildano.hawaga.org.uk>
	<fec1351f0706210859v417e6161i32b0569844949784@mail.gmail.com> 
	<Pine.LNX.4.64.0706211600140.1452@dildano.hawaga.org.uk> 
	<fec1351f0706210904l6d74828bj6d5fe093a2d3d7@mail.gmail.com> 
	<Pine.LNX.4.64.0706211608270.1452@dildano.hawaga.org.uk> 
	<fec1351f0706210925t2358bf90mec1507118903ef6e@mail.gmail.com> 
	<F7C67410-5C79-4964-B8EA-466BE465876B@mcs.anl.gov>
	<1182629194.8366.7.camel@blabla.mcs.anl.gov>
	<Pine.LNX.4.64.0706240133100.15250@dildano.hawaga.org.uk>
	<1182648934.20734.2.camel@blabla.mcs.anl.gov>
Message-ID: <Pine.LNX.4.64.0706240141510.1452@dildano.hawaga.org.uk>


On Sat, 23 Jun 2007, Mihael Hategan wrote:

> > the use of the word 'factor' has a subtle implication that it is a 
> > relative load and so in the case of a single site will have no effect

> It was discussed before that the algorithm could (and imo should) be
> changed such that in the case of only one site, it would not throttle,
> or at least the throttle would be significantly bigger.

that or the name should change.
-- 


From hategan at mcs.anl.gov  Sat Jun 23 20:45:53 2007
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Sat, 23 Jun 2007 20:45:53 -0500
Subject: [Swift-devel] Re: [Swft] Q about throttling
In-Reply-To: <Pine.LNX.4.64.0706240141510.1452@dildano.hawaga.org.uk>
References: <467A6610.1000103@mcs.anl.gov>
	<fec1351f0706210809k1035425cjffc18b442ab41dea@mail.gmail.com>
	<467A95F3.6040603@mcs.anl.gov>
	<Pine.LNX.4.64.0706211531180.15250@dildano.hawaga.org.uk>
	<467A9B32.4030402@mcs.anl.gov>
	<Pine.LNX.4.64.0706211537490.1452@dildano.hawaga.org.uk>
	<fec1351f0706210859v417e6161i32b0569844949784@mail.gmail.com>
	<Pine.LNX.4.64.0706211600140.1452@dildano.hawaga.org.uk>
	<fec1351f0706210904l6d74828bj6d5fe093a2d3d7@mail.gmail.com>
	<Pine.LNX.4.64.0706211608270.1452@dildano.hawaga.org.uk>
	<fec1351f0706210925t2358bf90mec1507118903ef6e@mail.gmail.com>
	<F7C67410-5C79-4964-B8EA-466BE465876B@mcs.anl.gov>
	<1182629194.8366.7.camel@blabla.mcs.anl.gov>
	<Pine.LNX.4.64.0706240133100.15250@dildano.hawaga.org.uk>
	<1182648934.20734.2.camel@blabla.mcs.anl.gov>
	<Pine.LNX.4.64.0706240141510.1452@dildano.hawaga.org.uk>
Message-ID: <1182649553.22337.0.camel@blabla.mcs.anl.gov>

On Sun, 2007-06-24 at 01:42 +0000, Ben Clifford wrote:
> 
> On Sat, 23 Jun 2007, Mihael Hategan wrote:
> 
> > > the use of the word 'factor' has a subtle implication that it is a 
> > > relative load and so in the case of a single site will have no effect
> 
> > It was discussed before that the algorithm could (and imo should) be
> > changed such that in the case of only one site, it would not throttle,
> > or at least the throttle would be significantly bigger.
> 
> that or the name should change.

It's called "jobThrottle" and that's a bad name, too. Suggestions?


From foster at mcs.anl.gov  Sun Jun 24 15:52:14 2007
From: foster at mcs.anl.gov (Ian Foster)
Date: Sun, 24 Jun 2007 15:52:14 -0500
Subject: [Swift-devel] [Fwd: [Swft] Swift Performance Data]
Message-ID: <467ED97E.3060400@mcs.anl.gov>

Hi,

I'd like to ask: can we agree to put all other work on hold until we 
have got in place tools to collect traces from every run, archive them, 
and process them--as described below?

We've been talking about having these tools for months now, and each 
time I ask, I am told that they are "sort of there." But we also keep 
finding that people are creating custom plots, losing data, etc. If we 
stop all other work until we have these tools, then we will have them, 
and other problems will likely get easier to resolve.

Ian.

-------- Original Message --------
Subject: 	[Swft] Swift Performance Data
Date: 	Thu, 21 Jun 2007 08:33:11 -0500
From: 	Ian Foster <foster at mcs.anl.gov>
To: 	Mike Wilde <wilde at mcs.anl.gov>
CC: 	swft <swft at ci.uchicago.edu>, swift-devel at ci.uchicago.edu
References: 	<4679FBC8.1080606 at mcs.anl.gov> 
<467A6D78.4020702 at mcs.anl.gov> <467A7AC6.7020400 at mcs.anl.gov>


Or maybe that is clear. I'd suggest that we want a tool that, after a 
run, one of us can run to:

* Generate the three plots that Ioan has created
* Generate a file containing as much information as we can about the run 
and its parameters--maybe a name=value format?--and some derived values 
such as those I mentioned in earlier email
* Move these things to a known place
* Create a Web page with pointers to these information and stick it 
somewhere [or add it to an existing web page?]

Ian.


Ian Foster wrote:
> Mike:
>
> It seems important to define what the specific goals and milestones 
> are here, as it seems that simply asking for it doesn't get it done. 
> Perhaps we need a brief specification?
>
> Ian.
>
> Mike Wilde wrote:
>> Yes, this is what Ganglia has been using.
>>
>> Regarding the auto-publishing - Jens has a machanism that regularly 
>> posted info in rrd format on the state of the VDS lab machines, using 
>> a perl mechanism like what Ian described.  Perhaps we can find and 
>> adapt that for Ioan's numbers.
>> It was running on gainly I think. But its not hard to develop from 
>> scratch.
>>
>> It would be good to see the same numbers for all the Swift apps being 
>> worked on, driven initially by kickstart summaries and digesting the 
>> swift logfile.
>> We've long had this as a goal - now is a good time to push forward 
>> and do this.
>>
>> Nika and Tibi, could you work with Ioan on this?
>>
>> - Mike
>>
>>
>>
>> Ian Foster wrote, On 6/20/2007 11:17 PM:
>>> Hi,
>>>
>>> I was pointed at http://oss.oetiker.ch/rrdtool/, has anyone seen 
>>> this? Seems nice to me.
>>>
>>> Ian.
>>>
>>>
>>
>

-- 

  Ian Foster, Director, Computation Institute
Argonne National Laboratory & University of Chicago
Argonne: MCS/221, 9700 S. Cass Ave, Argonne, IL 60439
Chicago: Rm 405, 5640 S. Ellis Ave, Chicago, IL 60637
Tel: +1 630 252 4619.  Web: www.ci.uchicago.edu.
     Globus Alliance: www.globus.org.


-- 

   Ian Foster, Director, Computation Institute
Argonne National Laboratory & University of Chicago
Argonne: MCS/221, 9700 S. Cass Ave, Argonne, IL 60439
Chicago: Rm 405, 5640 S. Ellis Ave, Chicago, IL 60637
Tel: +1 630 252 4619.  Web: www.ci.uchicago.edu.
      Globus Alliance: www.globus.org.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20070624/67ebf932/attachment.html>

From foster at mcs.anl.gov  Mon Jun 25 04:26:07 2007
From: foster at mcs.anl.gov (Ian Foster)
Date: Mon, 25 Jun 2007 04:26:07 -0500
Subject: [Swift-devel] [Fwd: Re: [ogsa-wg] link to kepler]
Message-ID: <467F8A2F.6020505@mcs.anl.gov>

perhaps interesting slides ...

-------- Original Message --------
Subject: 	Re: [ogsa-wg] link to kepler
Date: 	Mon, 25 Jun 2007 10:26:53 +0200
From: 	adam belloum <adam at science.uva.nl>
To: 	Ian Foster <foster at mcs.anl.gov>
CC: 	O.F.Rana at cs.cardiff.ac.uk, ogsa-wg at ogf.org
References: 	<20070614142750.3ADFF1F5185 at fork10.mail.virginia.edu> 
<20070614155331.t01etgesooc8so08 at www.cs.cf.ac.uk> 
<4671666F.5020207 at mcs.anl.gov>


Hi,

we have just finished the home page of the WS-VLAM 
(www.science.uva.nl/~gvlam/wsvlam),
may be you also find interesting stuff.

I know that OGSA-wg is interested in collecting requirements, there is  
presentation of the web site,
it contains a list of 32 wishes for workflow, we have collected from 
our users in the VL-e project.
we use thislist  as a driving for our developments
     (www.science.uva.nl/~gvlam/wsvlam/presentations/WS-VLAM-wishlist.ppt)

In the future we will put more info on the use case defined around 
workflows and there requirements

REgards

Adam


Ian Foster wrote:

>A couple more:
>
>* Karajan, and the Swift system
>* DAGman, and Pegasus
>
>O.F.Rana at cs.cardiff.ac.uk wrote:
>  
>
>>Hi,
>>
>>Just to keep things in perspective -- there are also a number of other
>>workflow engines. A good portal is
>>
>>http://www.gridworkflow.org/
>>
>>We have Triana in Cardiff + Taverna from EBI/Manchester.
>>
>>regards
>>Omer
>>
>>  
>>    
>>
>
>  
>


-- 

   Ian Foster, Director, Computation Institute
Argonne National Laboratory & University of Chicago
Argonne: MCS/221, 9700 S. Cass Ave, Argonne, IL 60439
Chicago: Rm 405, 5640 S. Ellis Ave, Chicago, IL 60637
Tel: +1 630 252 4619.  Web: www.ci.uchicago.edu.
      Globus Alliance: www.globus.org.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20070625/8a94246b/attachment.html>

From benc at hawaga.org.uk  Mon Jun 25 07:09:36 2007
From: benc at hawaga.org.uk (Ben Clifford)
Date: Mon, 25 Jun 2007 12:09:36 +0000 (GMT)
Subject: [Swift-devel] Swift Performance Data
In-Reply-To: <467A7E17.5000207@mcs.anl.gov>
References: <4679FBC8.1080606@mcs.anl.gov> <467A6D78.4020702@mcs.anl.gov>
	<467A7AC6.7020400@mcs.anl.gov> <467A7E17.5000207@mcs.anl.gov>
Message-ID: <Pine.LNX.4.64.0706251157360.15250@dildano.hawaga.org.uk>


On Thu, 21 Jun 2007, Ian Foster wrote:

> * Generate a file containing as much information as we can about the run 
> and its parameters--maybe a name=value format?--and some derived values 
> such as those I mentioned in earlier email

The CEDPS project seems to have a somewhat reasonable document defining a 
logging format now - it didn't last time I looked ages ago, I think.

Do you have tools for analysing those log files (or plan to have them)?

If so, could be useful to put extra work in to match up with that.

(The text is linked from here: 
http://www.cedps.net/wiki/index.php/LoggingBestPractices)

(though its presentation as a Word suggests a certain abstraction from the 
community who actually do logging and troubleshooting so perhaps caution 
is advised ;-)

-- 


From itf at mcs.anl.gov  Mon Jun 25 07:13:22 2007
From: itf at mcs.anl.gov (=?utf-8?B?SWFuIEZvc3Rlcg==?=)
Date: Mon, 25 Jun 2007 12:13:22 +0000
Subject: [Swift-devel] Swift Performance Data
In-Reply-To: <Pine.LNX.4.64.0706251157360.15250@dildano.hawaga.org.uk>
References: <4679FBC8.1080606@mcs.anl.gov>
	<467A6D78.4020702@mcs.anl.gov><467A7AC6.7020400@mcs.anl.gov>
	<467A7E17.5000207@mcs.anl.gov><Pine.LNX.4.64.0706251157360.15250@dildano.hawaga.org.uk>
Message-ID: <1975377455-1182773680-cardhu_decombobulator_blackberry.rim.net-423416081-@bxe006.bisx.prod.on.blackberry>

There are the netlogger tools


Sent via BlackBerry from T-Mobile

-----Original Message-----
From: Ben Clifford <benc at hawaga.org.uk>

Date: Mon, 25 Jun 2007 12:09:36 
To:Ian Foster <foster at mcs.anl.gov>
Cc:swift-devel at ci.uchicago.edu
Subject: Re: [Swift-devel] Swift Performance Data


On Thu, 21 Jun 2007, Ian Foster wrote:

> * Generate a file containing as much information as we can about the run 
> and its parameters--maybe a name=value format?--and some derived values 
> such as those I mentioned in earlier email

The CEDPS project seems to have a somewhat reasonable document defining a 
logging format now - it didn't last time I looked ages ago, I think.

Do you have tools for analysing those log files (or plan to have them)?

If so, could be useful to put extra work in to match up with that.

(The text is linked from here: 
http://www.cedps.net/wiki/index.php/LoggingBestPractices)

(though its presentation as a Word suggests a certain abstraction from the 
community who actually do logging and troubleshooting so perhaps caution 
is advised ;-)

-- 


From benc at hawaga.org.uk  Mon Jun 25 07:16:31 2007
From: benc at hawaga.org.uk (Ben Clifford)
Date: Mon, 25 Jun 2007 12:16:31 +0000 (GMT)
Subject: [Swift-devel] Swift Performance Data
In-Reply-To: <1975377455-1182773680-cardhu_decombobulator_blackberry.rim.net-423416081-@bxe006.bisx.prod.on.blackberry>
References: <4679FBC8.1080606@mcs.anl.gov>
	<467A6D78.4020702@mcs.anl.gov><467A7AC6.7020400@mcs.anl.gov>
	<467A7E17.5000207@mcs.anl.gov><Pine.LNX.4.64.0706251157360.15250@dildano.hawaga.org.uk>
	<1975377455-1182773680-cardhu_decombobulator_blackberry.rim.net-423416081-@bxe006.bisx.prod.on.blackberry>
Message-ID: <Pine.LNX.4.64.0706251215390.18874@dildano.hawaga.org.uk>


On Mon, 25 Jun 2007, Ian Foster wrote:

> There are the netlogger tools

do they use this format? If so that's fairly compelling (at least based on 
a powerpoint presentation I saw once, rather than actual experience ;-)

-- 


From itf at mcs.anl.gov  Mon Jun 25 07:19:26 2007
From: itf at mcs.anl.gov (=?utf-8?B?SWFuIEZvc3Rlcg==?=)
Date: Mon, 25 Jun 2007 12:19:26 +0000
Subject: [Swift-devel] Swift Performance Data
In-Reply-To: <Pine.LNX.4.64.0706251215390.18874@dildano.hawaga.org.uk>
References: <4679FBC8.1080606@mcs.anl.gov>
	<467A6D78.4020702@mcs.anl.gov><467A7AC6.7020400@mcs.anl.gov><467A7E17.5000207@mcs.anl.gov><Pine.LNX.4.64.0706251157360.15250@dildano.hawaga.org.uk><1975377455-1182773680-cardhu_decombobulator_blackberry.rim.net-423416081-@bxe006.bisx.prod.on.blackberry><Pine.LNX.4.64.0706251215390.18874@dildano.hawaga.org.uk>
Message-ID: <525747895-1182774044-cardhu_decombobulator_blackberry.rim.net-1528886809-@bxe006.bisx.prod.on.blackberry>

I suggested them a while ago, there was some reason given for not using them, but I can't know what it was. It may just have been NIH.

Ian

Sent via BlackBerry from T-Mobile

-----Original Message-----
From: Ben Clifford <benc at hawaga.org.uk>

Date: Mon, 25 Jun 2007 12:16:31 
To:Ian Foster <itf at mcs.anl.gov>
Cc:Ian Foster <foster at mcs.anl.gov>, swift-devel at ci.uchicago.edu
Subject: Re: [Swift-devel] Swift Performance Data


On Mon, 25 Jun 2007, Ian Foster wrote:

> There are the netlogger tools

do they use this format? If so that's fairly compelling (at least based on 
a powerpoint presentation I saw once, rather than actual experience ;-)

-- 


From benc at hawaga.org.uk  Mon Jun 25 07:23:55 2007
From: benc at hawaga.org.uk (Ben Clifford)
Date: Mon, 25 Jun 2007 12:23:55 +0000 (GMT)
Subject: [Swift-devel] Swift Performance Data
In-Reply-To: <525747895-1182774044-cardhu_decombobulator_blackberry.rim.net-1528886809-@bxe006.bisx.prod.on.blackberry>
References: <4679FBC8.1080606@mcs.anl.gov>
	<467A6D78.4020702@mcs.anl.gov><467A7AC6.7020400@mcs.anl.gov><467A7E17.5000207@mcs.anl.gov><Pine.LNX.4.64.0706251157360.15250@dildano.hawaga.org.uk><1975377455-1182773680-cardhu_decombobulator_blackberry.rim.net-423416081-@bxe006.bisx.prod.on.blackberry><Pine.LNX.4.64.0706251215390.18874@dildano.hawaga.org.uk>
	<525747895-1182774044-cardhu_decombobulator_blackberry.rim.net-1528886809-@bxe006.bisx.prod.on.blackberry>
Message-ID: <Pine.LNX.4.64.0706251222420.18874@dildano.hawaga.org.uk>


I was resistant to CEDPS trouble shooting because they didn't seem to have 
anything on offer at the time. They do now.

Though that's different from netlogger, at least on the marketing side 
(though clearly the staff list overlaps some).

On Mon, 25 Jun 2007, Ian Foster wrote:

> I suggested them a while ago, there was some reason given for not using them, but I can't know what it was. It may just have been NIH.
> 
> Ian
> 
> Sent via BlackBerry from T-Mobile
> 
> -----Original Message-----
> From: Ben Clifford <benc at hawaga.org.uk>
> 
> Date: Mon, 25 Jun 2007 12:16:31 
> To:Ian Foster <itf at mcs.anl.gov>
> Cc:Ian Foster <foster at mcs.anl.gov>, swift-devel at ci.uchicago.edu
> Subject: Re: [Swift-devel] Swift Performance Data
> 
> 
> 
> On Mon, 25 Jun 2007, Ian Foster wrote:
> 
> > There are the netlogger tools
> 
> do they use this format? If so that's fairly compelling (at least based on 
> a powerpoint presentation I saw once, rather than actual experience ;-)
> 
> 


From itf at mcs.anl.gov  Mon Jun 25 07:23:08 2007
From: itf at mcs.anl.gov (=?utf-8?B?SWFuIEZvc3Rlcg==?=)
Date: Mon, 25 Jun 2007 12:23:08 +0000
Subject: [Swift-devel] Swift Performance Data
In-Reply-To: <525747895-1182774044-cardhu_decombobulator_blackberry.rim.net-1528886809-@bxe006.bisx.prod.on.blackberry>
References: <4679FBC8.1080606@mcs.anl.gov>
	<467A6D78.4020702@mcs.anl.gov><467A7AC6.7020400@mcs.anl.gov><467A7E17.5000207@mcs.anl.gov><Pine.LNX.4.64.0706251157360.15250@dildano.hawaga.org.uk><1975377455-1182773680-cardhu_decombobulator_blackberry.rim.net-423416081-@bxe006.bisx.prod.on.blackberry><Pine.LNX.4.64.0706251215390.18874@dildano.hawaga.org.uk><525747895-1182774044-cardhu_decombobulator_blackberry.rim.net-1528886809-@bxe006.bisx.prod.on.blackberry>
Message-ID: <1560635678-1182774268-cardhu_decombobulator_blackberry.rim.net-1158930358-@bxe006.bisx.prod.on.blackberry>

Howver, I'd like to mention that my specific request is that we start collecting and storing logs from all runs. Using standard formats and netlogger may well be a good idea, but I'd feel concerned that putting that on the critical path would delay yet further the day when we achieve the primary goal.

Ian 


Sent via BlackBerry from T-Mobile

-----Original Message-----
From: "Ian Foster" <itf at mcs.anl.gov>

Date: Mon, 25 Jun 2007 12:19:26 
To:"Ben Clifford" <benc at hawaga.org.uk>
Cc:"Ian Foster" <foster at mcs.anl.gov>, swift-devel at ci.uchicago.edu
Subject: Re: [Swift-devel] Swift Performance Data


I suggested them a while ago, there was some reason given for not using them, but I can't know what it was. It may just have been NIH.

Ian

Sent via BlackBerry from T-Mobile

-----Original Message-----
From: Ben Clifford <benc at hawaga.org.uk>

Date: Mon, 25 Jun 2007 12:16:31 
To:Ian Foster <itf at mcs.anl.gov>
Cc:Ian Foster <foster at mcs.anl.gov>, swift-devel at ci.uchicago.edu
Subject: Re: [Swift-devel] Swift Performance Data


On Mon, 25 Jun 2007, Ian Foster wrote:

> There are the netlogger tools

do they use this format? If so that's fairly compelling (at least based on 
a powerpoint presentation I saw once, rather than actual experience ;-)

-- 


From itf at mcs.anl.gov  Mon Jun 25 07:27:58 2007
From: itf at mcs.anl.gov (=?utf-8?B?SWFuIEZvc3Rlcg==?=)
Date: Mon, 25 Jun 2007 12:27:58 +0000
Subject: [Swift-devel] Swift Performance Data
In-Reply-To: <Pine.LNX.4.64.0706251222420.18874@dildano.hawaga.org.uk>
References: <4679FBC8.1080606@mcs.anl.gov><467A6D78.4020702@mcs.anl.gov><467A7AC6.7020400@mcs.anl.gov><467A7E17.5000207@mcs.anl.gov><Pine.LNX.4.64.0706251157360.15250@dildano.hawaga.org.uk><1975377455-1182773680-cardhu_decombobulator_blackberry.rim.net-423416081-@bxe006.bisx.prod.on.blackberry><Pine.LNX.4.64.0706251215390.18874@dildano.hawaga.org.uk><525747895-1182774044-cardhu_decombobulator_blackberry.rim.net-1528886809-@bxe006.bisx.prod.on.blackberry><Pine.LNX.4.64.0706251222420.18874@dildano.hawaga.org.uk>
Message-ID: <63469356-1182774555-cardhu_decombobulator_blackberry.rim.net-1579903038-@bxe006.bisx.prod.on.blackberry>

Ah yes that's right.

I've asked the NetLogger folks if they suppoprt it.

Ian

Sent via BlackBerry from T-Mobile

-----Original Message-----
From: Ben Clifford <benc at hawaga.org.uk>

Date: Mon, 25 Jun 2007 12:23:55 
To:Ian Foster <itf at mcs.anl.gov>
Cc:Ian Foster <foster at mcs.anl.gov>, swift-devel at ci.uchicago.edu
Subject: Re: [Swift-devel] Swift Performance Data


I was resistant to CEDPS trouble shooting because they didn't seem to have 
anything on offer at the time. They do now.

Though that's different from netlogger, at least on the marketing side 
(though clearly the staff list overlaps some).

On Mon, 25 Jun 2007, Ian Foster wrote:

> I suggested them a while ago, there was some reason given for not using them, but I can't know what it was. It may just have been NIH.
> 
> Ian
> 
> Sent via BlackBerry from T-Mobile
> 
> -----Original Message-----
> From: Ben Clifford <benc at hawaga.org.uk>
> 
> Date: Mon, 25 Jun 2007 12:16:31 
> To:Ian Foster <itf at mcs.anl.gov>
> Cc:Ian Foster <foster at mcs.anl.gov>, swift-devel at ci.uchicago.edu
> Subject: Re: [Swift-devel] Swift Performance Data
> 
> 
> 
> On Mon, 25 Jun 2007, Ian Foster wrote:
> 
> > There are the netlogger tools
> 
> do they use this format? If so that's fairly compelling (at least based on 
> a powerpoint presentation I saw once, rather than actual experience ;-)
> 
> 


From benc at hawaga.org.uk  Mon Jun 25 07:29:49 2007
From: benc at hawaga.org.uk (Ben Clifford)
Date: Mon, 25 Jun 2007 12:29:49 +0000 (GMT)
Subject: [Swift-devel] Swift Performance Data
In-Reply-To: <1560635678-1182774268-cardhu_decombobulator_blackberry.rim.net-1158930358-@bxe006.bisx.prod.on.blackberry>
References: <4679FBC8.1080606@mcs.anl.gov>
	<467A6D78.4020702@mcs.anl.gov><467A7AC6.7020400@mcs.anl.gov><467A7E17.5000207@mcs.anl.gov><Pine.LNX.4.64.0706251157360.15250@dildano.hawaga.org.uk><1975377455-1182773680-cardhu_decombobulator_blackberry.rim.net-423416081-@bxe006.bisx.prod.on.blackberry><Pine.LNX.4.64.0706251215390.18874@dildano.hawaga.org.uk><525747895-1182774044-cardhu_decombobulator_blackberry.rim.net-1528886809-@bxe006.bisx.prod.on.blackberry>
	<1560635678-1182774268-cardhu_decombobulator_blackberry.rim.net-1158930358-@bxe006.bisx.prod.on.blackberry>
Message-ID: <Pine.LNX.4.64.0706251225390.18874@dildano.hawaga.org.uk>


On Mon, 25 Jun 2007, Ian Foster wrote:

> Howver, I'd like to mention that my specific request is that we start 
> collecting and storing logs from all runs. Using standard formats and 
> netlogger may well be a good idea, but I'd feel concerned that putting 
> that on the critical path would delay yet further the day when we 
> achieve the primary goal.

Raw unfiltered log collection is something that the app people need to do 
- Tibi, Nika, Ioan.

I guess the base set is:

  swift .log files

  kickstart dumps (for extra logging info and extra slowdown, turn 
  on kickstart for all jobs instead of failed tasks by setting: 
     kickstart.always.transfer=true
  )

  whatever falkon produces


This is a very different request from the 'sort out the analysis tooling' 
request, though.

-- 


From hategan at mcs.anl.gov  Mon Jun 25 08:45:25 2007
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Mon, 25 Jun 2007 08:45:25 -0500
Subject: [Swift-devel] Swift Performance Data
In-Reply-To: <525747895-1182774044-cardhu_decombobulator_blackberry.rim.net-1528886809-@bxe006.bisx.prod.on.blackberry>
References: <4679FBC8.1080606@mcs.anl.gov> <467A6D78.4020702@mcs.anl.gov>
	<467A7AC6.7020400@mcs.anl.gov><467A7E17.5000207@mcs.anl.gov>
	<Pine.LNX.4.64.0706251157360.15250@dildano.hawaga.org.uk>
	<1975377455-1182773680-cardhu_decombobulator_blackberry.rim.net-423416081-@bxe006.bisx.prod.on.blackberry>
	<Pine.LNX.4.64.0706251215390.18874@dildano.hawaga.org.uk>
	<525747895-1182774044-cardhu_decombobulator_blackberry.rim.net-1528886809-@bxe006.bisx.prod.on.blackberry>
Message-ID: <1182779125.5910.6.camel@blabla.mcs.anl.gov>

On Mon, 2007-06-25 at 12:19 +0000, Ian Foster wrote:
> I suggested them a while ago, there was some reason given for not using them, but I can't know what it was. It may just have been NIH.

The reason was that nothing was there.

Another difficulty is that if we want meaningful things in that
particular format, the whole software stack needs to be changed
(including cog and jglobus and perhaps other things). This sounds a bit
difficult, especially considering the fact that the information is
there, but the format is not. I'd rather write a few simple parsers than
try to change all logging messages everywhere.

Mihael

> 
> Ian
> 
> Sent via BlackBerry from T-Mobile
> 
> -----Original Message-----
> From: Ben Clifford <benc at hawaga.org.uk>
> 
> Date: Mon, 25 Jun 2007 12:16:31 
> To:Ian Foster <itf at mcs.anl.gov>
> Cc:Ian Foster <foster at mcs.anl.gov>, swift-devel at ci.uchicago.edu
> Subject: Re: [Swift-devel] Swift Performance Data
> 
> 
> 
> On Mon, 25 Jun 2007, Ian Foster wrote:
> 
> > There are the netlogger tools
> 
> do they use this format? If so that's fairly compelling (at least based on 
> a powerpoint presentation I saw once, rather than actual experience ;-)
> 


From hategan at mcs.anl.gov  Mon Jun 25 08:48:13 2007
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Mon, 25 Jun 2007 08:48:13 -0500
Subject: [Swift-devel] Swift Performance Data
In-Reply-To: <1560635678-1182774268-cardhu_decombobulator_blackberry.rim.net-1158930358-@bxe006.bisx.prod.on.blackberry>
References: <4679FBC8.1080606@mcs.anl.gov> <467A6D78.4020702@mcs.anl.gov>
	<467A7AC6.7020400@mcs.anl.gov><467A7E17.5000207@mcs.anl.gov>
	<Pine.LNX.4.64.0706251157360.15250@dildano.hawaga.org.uk>
	<1975377455-1182773680-cardhu_decombobulator_blackberry.rim.net-423416081-@bxe006.bisx.prod.on.blackberry>
	<Pine.LNX.4.64.0706251215390.18874@dildano.hawaga.org.uk>
	<525747895-1182774044-cardhu_decombobulator_blackberry.rim.net-1528886809-@bxe006.bisx.prod.on.blackberry>
	<1560635678-1182774268-cardhu_decombobulator_blackberry.rim.net-1158930358-@bxe006.bisx.prod.on.blackberry>
Message-ID: <1182779293.5910.10.camel@blabla.mcs.anl.gov>

On Mon, 2007-06-25 at 12:23 +0000, Ian Foster wrote:
> Howver, I'd like to mention that my specific request is that we start collecting and storing logs from all runs. Using standard formats and netlogger may well be a good idea, but I'd feel concerned that putting that on the critical path would delay yet further the day when we achieve the primary goal.

Yes, I mentioned that to Tibi (I think) a while ago. The effort of
collecting logs is minimal, and once the tools are there, we could
easily analyze previous runs.

Mihael

> 
> Ian 
> 
> 
> 
> Sent via BlackBerry from T-Mobile
> 
> -----Original Message-----
> From: "Ian Foster" <itf at mcs.anl.gov>
> 
> Date: Mon, 25 Jun 2007 12:19:26 
> To:"Ben Clifford" <benc at hawaga.org.uk>
> Cc:"Ian Foster" <foster at mcs.anl.gov>, swift-devel at ci.uchicago.edu
> Subject: Re: [Swift-devel] Swift Performance Data
> 
> 
> I suggested them a while ago, there was some reason given for not using them, but I can't know what it was. It may just have been NIH.
> 
> Ian
> 
> Sent via BlackBerry from T-Mobile
> 
> -----Original Message-----
> From: Ben Clifford <benc at hawaga.org.uk>
> 
> Date: Mon, 25 Jun 2007 12:16:31 
> To:Ian Foster <itf at mcs.anl.gov>
> Cc:Ian Foster <foster at mcs.anl.gov>, swift-devel at ci.uchicago.edu
> Subject: Re: [Swift-devel] Swift Performance Data
> 
> 
> 
> On Mon, 25 Jun 2007, Ian Foster wrote:
> 
> > There are the netlogger tools
> 
> do they use this format? If so that's fairly compelling (at least based on 
> a powerpoint presentation I saw once, rather than actual experience ;-)
> 
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel


From benc at hawaga.org.uk  Mon Jun 25 08:54:17 2007
From: benc at hawaga.org.uk (Ben Clifford)
Date: Mon, 25 Jun 2007 13:54:17 +0000 (GMT)
Subject: [Swift-devel] Swift Performance Data
In-Reply-To: <1182779125.5910.6.camel@blabla.mcs.anl.gov>
References: <4679FBC8.1080606@mcs.anl.gov> <467A6D78.4020702@mcs.anl.gov> 
	<467A7AC6.7020400@mcs.anl.gov><467A7E17.5000207@mcs.anl.gov> 
	<Pine.LNX.4.64.0706251157360.15250@dildano.hawaga.org.uk> 
	<1975377455-1182773680-cardhu_decombobulator_blackberry.rim.net-423416081-@bxe006.bisx.prod.on.blackberry>
	<Pine.LNX.4.64.0706251215390.18874@dildano.hawaga.org.uk> 
	<525747895-1182774044-cardhu_decombobulator_blackberry.rim.net-1528886809-@bxe006.bisx.prod.on.blackberry>
	<1182779125.5910.6.camel@blabla.mcs.anl.gov>
Message-ID: <Pine.LNX.4.64.0706251349370.18874@dildano.hawaga.org.uk>


On Mon, 25 Jun 2007, Mihael Hategan wrote:

> Another difficulty is that if we want meaningful things in that 
> particular format, the whole software stack needs to be changed 
> (including cog and jglobus and perhaps other things). This sounds a bit 
> difficult, especially considering the fact that the information is 
> there, but the format is not. I'd rather write a few simple parsers than 
> try to change all logging messages everywhere.

'a few simple parsers' is not necessarily 'simple'.

changing logging messages in code everywhere definitely isn't, though.

if there is going to be more than one analysis tool, converting log files 
to a common format somewhere in between generating application and 
analysing application might be a good idea, and not massively different 
from defining a language level API to abstract out log file format 
differences.

-- 


From hategan at mcs.anl.gov  Mon Jun 25 08:59:00 2007
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Mon, 25 Jun 2007 08:59:00 -0500
Subject: [Swift-devel] Swift Performance Data
In-Reply-To: <Pine.LNX.4.64.0706251349370.18874@dildano.hawaga.org.uk>
References: <4679FBC8.1080606@mcs.anl.gov> <467A6D78.4020702@mcs.anl.gov>
	<467A7AC6.7020400@mcs.anl.gov><467A7E17.5000207@mcs.anl.gov>
	<Pine.LNX.4.64.0706251157360.15250@dildano.hawaga.org.uk>
	<1975377455-1182773680-cardhu_decombobulator_blackberry.rim.net-423416081-@bxe006.bisx.prod.on.blackberry>
	<Pine.LNX.4.64.0706251215390.18874@dildano.hawaga.org.uk>
	<525747895-1182774044-cardhu_decombobulator_blackberry.rim.net-1528886809-@bxe006.bisx.prod.on.blackberry>
	<1182779125.5910.6.camel@blabla.mcs.anl.gov>
	<Pine.LNX.4.64.0706251349370.18874@dildano.hawaga.org.uk>
Message-ID: <1182779940.5910.16.camel@blabla.mcs.anl.gov>

On Mon, 2007-06-25 at 13:54 +0000, Ben Clifford wrote:
> On Mon, 25 Jun 2007, Mihael Hategan wrote:
> 
> > Another difficulty is that if we want meaningful things in that 
> > particular format, the whole software stack needs to be changed 
> > (including cog and jglobus and perhaps other things). This sounds a bit 
> > difficult, especially considering the fact that the information is 
> > there, but the format is not. I'd rather write a few simple parsers than 
> > try to change all logging messages everywhere.
> 
> 'a few simple parsers' is not necessarily 'simple'.

It's relatively simple. The python tool that is an adaptation of Jens'
shows that it can be done relatively easy. I would worry more about
Swift producing sufficient information and about how that information is
represented. That seems harder to me.

> 
> changing logging messages in code everywhere definitely isn't, though.
> 
> if there is going to be more than one analysis tool, converting log files 
> to a common format somewhere in between generating application and 
> analysing application might be a good idea, and not massively different 
> from defining a language level API to abstract out log file format 
> differences.

This is similar to code generation vs. abstraction (or interpretation
vs. compilation). There can be:
1. An api to access the logs in structured ways
2. A log translator
3. An adaptor plugged in at the log4j (or whatever logging library)
level that does the translation dynamically (at the expense of
performance).

Mihael

> 


From benc at hawaga.org.uk  Mon Jun 25 09:07:22 2007
From: benc at hawaga.org.uk (Ben Clifford)
Date: Mon, 25 Jun 2007 14:07:22 +0000 (GMT)
Subject: [Swift-devel] Swift Performance Data
In-Reply-To: <1182779940.5910.16.camel@blabla.mcs.anl.gov>
References: <4679FBC8.1080606@mcs.anl.gov> <467A6D78.4020702@mcs.anl.gov> 
	<467A7AC6.7020400@mcs.anl.gov><467A7E17.5000207@mcs.anl.gov> 
	<Pine.LNX.4.64.0706251157360.15250@dildano.hawaga.org.uk> 
	<1975377455-1182773680-cardhu_decombobulator_blackberry.rim.net-423416081-@bxe006.bisx.prod.on.blackberry>
	<Pine.LNX.4.64.0706251215390.18874@dildano.hawaga.org.uk> 
	<525747895-1182774044-cardhu_decombobulator_blackberry.rim.net-1528886809-@bxe006.bisx.prod.on.blackberry>
	<1182779125.5910.6.camel@blabla.mcs.anl.gov> 
	<Pine.LNX.4.64.0706251349370.18874@dildano.hawaga.org.uk>
	<1182779940.5910.16.camel@blabla.mcs.anl.gov>
Message-ID: <Pine.LNX.4.64.0706251404540.18874@dildano.hawaga.org.uk>


On Mon, 25 Jun 2007, Mihael Hategan wrote:

> This is similar to code generation vs. abstraction (or interpretation
> vs. compilation). There can be:

its also an issue with where the abstractions happen:

> 1. An api to access the logs in structured ways

needs the API to exist in the language that you want to write analysers 
in.

> 2. A log translator

makes the API into a posix filesystem with text files. still needs a 
per-language parser to parse that format, but that is 'simple' and works 
in a variety of languages. so I'd favour this.

> 3. An adaptor plugged in at the log4j (or whatever logging library)
> level that does the translation dynamically (at the expense of
> performance).

Possibly some components can output stuff into a common format using log4j 
- that wouldn't necessarily be any more dynamic than the existing log4j 
output.

-- 


From wilde at mcs.anl.gov  Mon Jun 25 09:19:17 2007
From: wilde at mcs.anl.gov (Mike Wilde)
Date: Mon, 25 Jun 2007 09:19:17 -0500
Subject: [Swift-devel] Swift Performance Data
In-Reply-To: <Pine.LNX.4.64.0706251349370.18874@dildano.hawaga.org.uk>
References: <4679FBC8.1080606@mcs.anl.gov> <467A6D78.4020702@mcs.anl.gov>
	<467A7AC6.7020400@mcs.anl.gov><467A7E17.5000207@mcs.anl.gov>
	<Pine.LNX.4.64.0706251157360.15250@dildano.hawaga.org.uk>
	<1975377455-1182773680-cardhu_decombobulator_blackberry.rim.net-423416081-@bxe006.bisx.prod.on.blackberry>	<Pine.LNX.4.64.0706251215390.18874@dildano.hawaga.org.uk>
	<525747895-1182774044-cardhu_decombobulator_blackberry.rim.net-1528886809-@bxe006.bisx.prod.on.blackberry>	<1182779125.5910.6.camel@blabla.mcs.anl.gov>
	<Pine.LNX.4.64.0706251349370.18874@dildano.hawaga.org.uk>
Message-ID: <467FCEE5.6000707@mcs.anl.gov>

I have not been reading email this weekend and need to catch up on this and 
related threads.

I want to ask that for the moment Ben and Mihael stay focused on what they are 
working on, and I will work with Nika and Tibi on application status and issues, 
and move the measurement issue along.

I agree fully with Mihael's point that we can and should start gathering all 
execution logs into a uniformly structured gathering place. Then we can organize 
the current log tools and determine whats needed next in that area.

For now:

Ben: Swift 0.2 and mapper/language improvements

Mihael: get to a closure point on I2U2 to get the last 4 months of work into 
production (or at a stable development lab for a next-generation production 
system). Determine when you can be back on Swift.

Nika: MolDyn-244 and defining the MolDyn parameter sweep workflow;
  Next steps (TBD) on LQCD progress;

Tibi: Econ - next workflow; set up environment for Econ to adopt tools. Work 
with new people in Econ to take over from Gabrielle. I2U2 load sharing into 
production, and assist in LIGO app.  SIDGrid Wavelet: tbd.

Nika and Tibi: application writeups

Next apps: FLASH, RADCAD; possibly SCEC exploration.

- Mike


Ben Clifford wrote, On 6/25/2007 8:54 AM:
> On Mon, 25 Jun 2007, Mihael Hategan wrote:
> 
>> Another difficulty is that if we want meaningful things in that 
>> particular format, the whole software stack needs to be changed 
>> (including cog and jglobus and perhaps other things). This sounds a bit 
>> difficult, especially considering the fact that the information is 
>> there, but the format is not. I'd rather write a few simple parsers than 
>> try to change all logging messages everywhere.
> 
> 'a few simple parsers' is not necessarily 'simple'.
> 
> changing logging messages in code everywhere definitely isn't, though.
> 
> if there is going to be more than one analysis tool, converting log files 
> to a common format somewhere in between generating application and 
> analysing application might be a good idea, and not massively different 
> from defining a language level API to abstract out log file format 
> differences.
> 

-- 
Mike Wilde
Computation Institute, University of Chicago
Math & Computer Science Division
Argonne National Laboratory
Argonne, IL   60439    USA
tel 630-252-7497 fax 630-252-1997


From hategan at mcs.anl.gov  Mon Jun 25 09:19:00 2007
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Mon, 25 Jun 2007 09:19:00 -0500
Subject: [Swift-devel] Swift Performance Data
In-Reply-To: <Pine.LNX.4.64.0706251404540.18874@dildano.hawaga.org.uk>
References: <4679FBC8.1080606@mcs.anl.gov> <467A6D78.4020702@mcs.anl.gov>
	<467A7AC6.7020400@mcs.anl.gov><467A7E17.5000207@mcs.anl.gov>
	<Pine.LNX.4.64.0706251157360.15250@dildano.hawaga.org.uk>
	<1975377455-1182773680-cardhu_decombobulator_blackberry.rim.net-423416081-@bxe006.bisx.prod.on.blackberry>
	<Pine.LNX.4.64.0706251215390.18874@dildano.hawaga.org.uk>
	<525747895-1182774044-cardhu_decombobulator_blackberry.rim.net-1528886809-@bxe006.bisx.prod.on.blackberry>
	<1182779125.5910.6.camel@blabla.mcs.anl.gov>
	<Pine.LNX.4.64.0706251349370.18874@dildano.hawaga.org.uk>
	<1182779940.5910.16.camel@blabla.mcs.anl.gov>
	<Pine.LNX.4.64.0706251404540.18874@dildano.hawaga.org.uk>
Message-ID: <1182781140.10791.3.camel@blabla.mcs.anl.gov>

On Mon, 2007-06-25 at 14:07 +0000, Ben Clifford wrote:
> 
> On Mon, 25 Jun 2007, Mihael Hategan wrote:
> 
> > This is similar to code generation vs. abstraction (or interpretation
> > vs. compilation). There can be:
> 
> its also an issue with where the abstractions happen:
> 
> > 1. An api to access the logs in structured ways
> 
> needs the API to exist in the language that you want to write analysers 
> in.
> 
> > 2. A log translator
> 
> makes the API into a posix filesystem with text files. still needs a 
> per-language parser to parse that format, but that is 'simple' and works 
> in a variety of languages. so I'd favour this.

So would I. And since logs are incremental, it could even be done live
(i.e. tail -f log |translator >translated.log). It would also be
"backwards compatible" with the logs that we've been gathering so far :)

> 
> > 3. An adaptor plugged in at the log4j (or whatever logging library)
> > level that does the translation dynamically (at the expense of
> > performance).
> 
> Possibly some components can output stuff into a common format using log4j 
> - that wouldn't necessarily be any more dynamic than the existing log4j 
> output.

Right, but that would not be the general case.

> 


From tiberius at ci.uchicago.edu  Mon Jun 25 10:19:20 2007
From: tiberius at ci.uchicago.edu (Tiberiu Stef-Praun)
Date: Mon, 25 Jun 2007 10:19:20 -0500
Subject: [Swift-devel] Swift Performance Data
In-Reply-To: <1182779293.5910.10.camel@blabla.mcs.anl.gov>
References: <4679FBC8.1080606@mcs.anl.gov> <467A6D78.4020702@mcs.anl.gov>
	<467A7AC6.7020400@mcs.anl.gov> <467A7E17.5000207@mcs.anl.gov>
	<Pine.LNX.4.64.0706251157360.15250@dildano.hawaga.org.uk>
	<1975377455-1182773680-cardhu_decombobulator_blackberry.rim.net-423416081-@bxe006.bisx.prod.on.blackberry>
	<Pine.LNX.4.64.0706251215390.18874@dildano.hawaga.org.uk>
	<525747895-1182774044-cardhu_decombobulator_blackberry.rim.net-1528886809-@bxe006.bisx.prod.on.blackberry>
	<1560635678-1182774268-cardhu_decombobulator_blackberry.rim.net-1158930358-@bxe006.bisx.prod.on.blackberry>
	<1182779293.5910.10.camel@blabla.mcs.anl.gov>
Message-ID: <fec1351f0706250819u5fd3a8y7cee15d4c38d9dc2@mail.gmail.com>

All the logs that I consider relevant I save into the /results
directory under each application, in the SwiftApps SVN.


On 6/25/07, Mihael Hategan <hategan at mcs.anl.gov> wrote:
> On Mon, 2007-06-25 at 12:23 +0000, Ian Foster wrote:
> > Howver, I'd like to mention that my specific request is that we start collecting and storing logs from all runs. Using standard formats and netlogger may well be a good idea, but I'd feel concerned that putting that on the critical path would delay yet further the day when we achieve the primary goal.
>
> Yes, I mentioned that to Tibi (I think) a while ago. The effort of
> collecting logs is minimal, and once the tools are there, we could
> easily analyze previous runs.
>
> Mihael
>
> >
> > Ian
> >
> >
> >
> > Sent via BlackBerry from T-Mobile
> >
> > -----Original Message-----
> > From: "Ian Foster" <itf at mcs.anl.gov>
> >
> > Date: Mon, 25 Jun 2007 12:19:26
> > To:"Ben Clifford" <benc at hawaga.org.uk>
> > Cc:"Ian Foster" <foster at mcs.anl.gov>, swift-devel at ci.uchicago.edu
> > Subject: Re: [Swift-devel] Swift Performance Data
> >
> >
> > I suggested them a while ago, there was some reason given for not using them, but I can't know what it was. It may just have been NIH.
> >
> > Ian
> >
> > Sent via BlackBerry from T-Mobile
> >
> > -----Original Message-----
> > From: Ben Clifford <benc at hawaga.org.uk>
> >
> > Date: Mon, 25 Jun 2007 12:16:31
> > To:Ian Foster <itf at mcs.anl.gov>
> > Cc:Ian Foster <foster at mcs.anl.gov>, swift-devel at ci.uchicago.edu
> > Subject: Re: [Swift-devel] Swift Performance Data
> >
> >
> >
> > On Mon, 25 Jun 2007, Ian Foster wrote:
> >
> > > There are the netlogger tools
> >
> > do they use this format? If so that's fairly compelling (at least based on
> > a powerpoint presentation I saw once, rather than actual experience ;-)
> >
> > _______________________________________________
> > Swift-devel mailing list
> > Swift-devel at ci.uchicago.edu
> > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>


-- 
Tiberiu (Tibi) Stef-Praun, PhD
Research Staff, Computation Institute
5640 S. Ellis Ave, #405
University of Chicago
http://www-unix.mcs.anl.gov/~tiberius/


From foster at mcs.anl.gov  Mon Jun 25 10:34:56 2007
From: foster at mcs.anl.gov (Ian Foster)
Date: Mon, 25 Jun 2007 10:34:56 -0500
Subject: [Swift-devel] [Fwd: Re: Q about netlogger]
Message-ID: <467FE0A0.2010805@mcs.anl.gov>


-------- Original Message --------
Subject: 	Re: Q about netlogger
Date: 	Mon, 25 Jun 2007 09:20:15 -0400
From: 	Brian Tierney <bltierney at lbl.gov>
Organization: 	LBNL
To: 	itf at mcs.anl.gov
CC: 	Jenny Schopf <jms at mcs.anl.gov>
References: 
<1794526737-1182774123-cardhu_decombobulator_blackberry.rim.net-1522300153- at bxe006.bisx.prod.on.blackberry> 


Ian Foster wrote:
> Hi,
> 
> 
> Can netlogger process logs in the new log format?

yep.


-- 

   Ian Foster, Director, Computation Institute
Argonne National Laboratory & University of Chicago
Argonne: MCS/221, 9700 S. Cass Ave, Argonne, IL 60439
Chicago: Rm 405, 5640 S. Ellis Ave, Chicago, IL 60637
Tel: +1 630 252 4619.  Web: www.ci.uchicago.edu.
      Globus Alliance: www.globus.org.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20070625/a563f921/attachment.html>

From foster at mcs.anl.gov  Mon Jun 25 11:03:37 2007
From: foster at mcs.anl.gov (Ian Foster)
Date: Mon, 25 Jun 2007 11:03:37 -0500
Subject: [Swift-devel] Swift Performance Data
In-Reply-To: <467FCEE5.6000707@mcs.anl.gov>
References: <4679FBC8.1080606@mcs.anl.gov> <467A6D78.4020702@mcs.anl.gov>
	<467A7AC6.7020400@mcs.anl.gov><467A7E17.5000207@mcs.anl.gov>
	<Pine.LNX.4.64.0706251157360.15250@dildano.hawaga.org.uk>
	<1975377455-1182773680-cardhu_decombobulator_blackberry.rim.net-423416081-@bxe006.bisx.prod.on.blackberry>	<Pine.LNX.4.64.0706251215390.18874@dildano.hawaga.org.uk>
	<525747895-1182774044-cardhu_decombobulator_blackberry.rim.net-1528886809-@bxe006.bisx.prod.on.blackberry>	<1182779125.5910.6.camel@blabla.mcs.anl.gov>
	<Pine.LNX.4.64.0706251349370.18874@dildano.hawaga.org.uk>
	<467FCEE5.6000707@mcs.anl.gov>
Message-ID: <467FE759.50904@mcs.anl.gov>

So who is going to do this?

I've been asking about this for some time, and nothing has happened. The 
result, I think, has been a lot of confusion and delay.
>
> I agree fully with Mihael's point that we can and should start 
> gathering all execution logs into a uniformly structured gathering 
> place. Then we can organize the current log tools and determine whats 
> needed next in that area.
>


From hategan at mcs.anl.gov  Mon Jun 25 11:15:18 2007
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Mon, 25 Jun 2007 11:15:18 -0500
Subject: [Swift-devel] Swift Performance Data
In-Reply-To: <467FE759.50904@mcs.anl.gov>
References: <4679FBC8.1080606@mcs.anl.gov> <467A6D78.4020702@mcs.anl.gov>
	<467A7AC6.7020400@mcs.anl.gov><467A7E17.5000207@mcs.anl.gov>
	<Pine.LNX.4.64.0706251157360.15250@dildano.hawaga.org.uk>
	<1975377455-1182773680-cardhu_decombobulator_blackberry.rim.net-423416081-@bxe006.bisx.prod.on.blackberry>
	<Pine.LNX.4.64.0706251215390.18874@dildano.hawaga.org.uk>
	<525747895-1182774044-cardhu_decombobulator_blackberry.rim.net-1528886809-@bxe006.bisx.prod.on.blackberry>
	<1182779125.5910.6.camel@blabla.mcs.anl.gov>
	<Pine.LNX.4.64.0706251349370.18874@dildano.hawaga.org.uk>
	<467FCEE5.6000707@mcs.anl.gov>  <467FE759.50904@mcs.anl.gov>
Message-ID: <1182788118.23226.3.camel@blabla.mcs.anl.gov>

On Mon, 2007-06-25 at 11:03 -0500, Ian Foster wrote:
> So who is going to do this?
> 
> I've been asking about this for some time, and nothing has happened. The 
> result, I think, has been a lot of confusion and delay.

Are we still talking about collecting logs? I'm a bit confused.

> >
> > I agree fully with Mihael's point that we can and should start 
> > gathering all execution logs into a uniformly structured gathering 
> > place. Then we can organize the current log tools and determine whats 
> > needed next in that area.
> >
> 


From benc at hawaga.org.uk  Mon Jun 25 11:26:42 2007
From: benc at hawaga.org.uk (Ben Clifford)
Date: Mon, 25 Jun 2007 16:26:42 +0000 (GMT)
Subject: [Swift-devel] Swift Performance Data
In-Reply-To: <1182788118.23226.3.camel@blabla.mcs.anl.gov>
References: <4679FBC8.1080606@mcs.anl.gov> <467A6D78.4020702@mcs.anl.gov> 
	<467A7AC6.7020400@mcs.anl.gov><467A7E17.5000207@mcs.anl.gov> 
	<Pine.LNX.4.64.0706251157360.15250@dildano.hawaga.org.uk> 
	<1975377455-1182773680-cardhu_decombobulator_blackberry.rim.net-423416081-@bxe006.bisx.prod.on.blackberry>
	<Pine.LNX.4.64.0706251215390.18874@dildano.hawaga.org.uk> 
	<525747895-1182774044-cardhu_decombobulator_blackberry.rim.net-1528886809-@bxe006.bisx.prod.on.blackberry>
	<1182779125.5910.6.camel@blabla.mcs.anl.gov> 
	<Pine.LNX.4.64.0706251349370.18874@dildano.hawaga.org.uk>
	<467FCEE5.6000707@mcs.anl.gov> <467FE759.50904@mcs.anl.gov>
	<1182788118.23226.3.camel@blabla.mcs.anl.gov>
Message-ID: <Pine.LNX.4.64.0706251623000.18874@dildano.hawaga.org.uk>


On Mon, 25 Jun 2007, Mihael Hategan wrote:

> On Mon, 2007-06-25 at 11:03 -0500, Ian Foster wrote:
> > So who is going to do this?
> > 
> > I've been asking about this for some time, and nothing has happened. The 
> > result, I think, has been a lot of confusion and delay.
> 
> Are we still talking about collecting logs? I'm a bit confused.

I see a few of Tibi's run logs and derivative analyses in the SVN. Look 
at some of the files add between r844 and r861:

http://www.ci.uchicago.edu/trac/swift/browser/SwiftApps/Econ/results/econ-ws-Falkon-ljmap4x6j42e0.log.tiff?rev=857

No kickstart though.

Could do with more organising.

However as to 'who is going to do this?' in response to collecting data - 
the app people have to - they do the runs and its their working 
directories that the stuff ends up in.

-- 


From wilde at mcs.anl.gov  Mon Jun 25 11:30:31 2007
From: wilde at mcs.anl.gov (Mike Wilde)
Date: Mon, 25 Jun 2007 11:30:31 -0500
Subject: [Swift-devel] Swift Performance Data
In-Reply-To: <467FE759.50904@mcs.anl.gov>
References: <4679FBC8.1080606@mcs.anl.gov> <467A6D78.4020702@mcs.anl.gov>
	<467A7AC6.7020400@mcs.anl.gov><467A7E17.5000207@mcs.anl.gov>
	<Pine.LNX.4.64.0706251157360.15250@dildano.hawaga.org.uk>
	<1975377455-1182773680-cardhu_decombobulator_blackberry.rim.net-423416081-@bxe006.bisx.prod.on.blackberry>	<Pine.LNX.4.64.0706251215390.18874@dildano.hawaga.org.uk>
	<525747895-1182774044-cardhu_decombobulator_blackberry.rim.net-1528886809-@bxe006.bisx.prod.on.blackberry>	<1182779125.5910.6.camel@blabla.mcs.anl.gov>
	<Pine.LNX.4.64.0706251349370.18874@dildano.hawaga.org.uk>
	<467FCEE5.6000707@mcs.anl.gov> <467FE759.50904@mcs.anl.gov>
Message-ID: <467FEDA7.8050000@mcs.anl.gov>

We will decide, with more discussion, on this list.

- Mike

Ian Foster wrote, On 6/25/2007 11:03 AM:
> So who is going to do this?
> 
> I've been asking about this for some time, and nothing has happened. The 
> result, I think, has been a lot of confusion and delay.
>>
>> I agree fully with Mihael's point that we can and should start 
>> gathering all execution logs into a uniformly structured gathering 
>> place. Then we can organize the current log tools and determine whats 
>> needed next in that area.
>>
> 
> 

-- 
Mike Wilde
Computation Institute, University of Chicago
Math & Computer Science Division
Argonne National Laboratory
Argonne, IL   60439    USA
tel 630-252-7497 fax 630-252-1997


From wilde at mcs.anl.gov  Mon Jun 25 11:36:29 2007
From: wilde at mcs.anl.gov (Mike Wilde)
Date: Mon, 25 Jun 2007 11:36:29 -0500
Subject: [Swift-devel] Swift Performance Data
In-Reply-To: <1182788118.23226.3.camel@blabla.mcs.anl.gov>
References: <4679FBC8.1080606@mcs.anl.gov> <467A6D78.4020702@mcs.anl.gov>	
	<467A7AC6.7020400@mcs.anl.gov><467A7E17.5000207@mcs.anl.gov>	
	<Pine.LNX.4.64.0706251157360.15250@dildano.hawaga.org.uk>	
	<1975377455-1182773680-cardhu_decombobulator_blackberry.rim.net-423416081-@bxe006.bisx.prod.on.blackberry>	
	<Pine.LNX.4.64.0706251215390.18874@dildano.hawaga.org.uk>	
	<525747895-1182774044-cardhu_decombobulator_blackberry.rim.net-1528886809-@bxe006.bisx.prod.on.blackberry>	
	<1182779125.5910.6.camel@blabla.mcs.anl.gov>	
	<Pine.LNX.4.64.0706251349370.18874@dildano.hawaga.org.uk>	
	<467FCEE5.6000707@mcs.anl.gov> <467FE759.50904@mcs.anl.gov>
	<1182788118.23226.3.camel@blabla.mcs.anl.gov>
Message-ID: <467FEF0D.1050700@mcs.anl.gov>

collecting logs and generating reports/plots easily and in a standard format, so 
we can set goals against specific metrics for each workflow/application and 
track how we are progressing against those goals, for each.

- Mike

Mihael Hategan wrote, On 6/25/2007 11:15 AM:
> On Mon, 2007-06-25 at 11:03 -0500, Ian Foster wrote:
>> So who is going to do this?
>>
>> I've been asking about this for some time, and nothing has happened. The 
>> result, I think, has been a lot of confusion and delay.
> 
> Are we still talking about collecting logs? I'm a bit confused.
> 
>>> I agree fully with Mihael's point that we can and should start 
>>> gathering all execution logs into a uniformly structured gathering 
>>> place. Then we can organize the current log tools and determine whats 
>>> needed next in that area.
>>>
> 
> 

-- 
Mike Wilde
Computation Institute, University of Chicago
Math & Computer Science Division
Argonne National Laboratory
Argonne, IL   60439    USA
tel 630-252-7497 fax 630-252-1997


From benc at hawaga.org.uk  Mon Jun 25 11:42:35 2007
From: benc at hawaga.org.uk (Ben Clifford)
Date: Mon, 25 Jun 2007 16:42:35 +0000 (GMT)
Subject: [Swift-devel] Swift Performance Data
In-Reply-To: <467FEF0D.1050700@mcs.anl.gov>
References: <4679FBC8.1080606@mcs.anl.gov> <467A6D78.4020702@mcs.anl.gov> 
	<467A7AC6.7020400@mcs.anl.gov><467A7E17.5000207@mcs.anl.gov> 
	<Pine.LNX.4.64.0706251157360.15250@dildano.hawaga.org.uk> 
	<1975377455-1182773680-cardhu_decombobulator_blackberry.rim.net-423416081-@bxe006.bisx.prod.on.blackberry>
	<Pine.LNX.4.64.0706251215390.18874@dildano.hawaga.org.uk> 
	<525747895-1182774044-cardhu_decombobulator_blackberry.rim.net-1528886809-@bxe006.bisx.prod.on.blackberry>
	<1182779125.5910.6.camel@blabla.mcs.anl.gov> 
	<Pine.LNX.4.64.0706251349370.18874@dildano.hawaga.org.uk>
	<467FCEE5.6000707@mcs.anl.gov> <467FE759.50904@mcs.anl.gov>
	<1182788118.23226.3.camel@blabla.mcs.anl.gov>
	<467FEF0D.1050700@mcs.anl.gov>
Message-ID: <Pine.LNX.4.64.0706251638550.18874@dildano.hawaga.org.uk>


As far as I see, Ian asked for two separate things:

 i) most importantly (to him) the collection of raw data. that's a case of 
collecting files and not changing them.

 ii) the development of a log analysis stack. This very much overlaps with 
what CEDPS and is a much bigger job. There are two fairly contradictory 
views: that we shouldn't let CEDPS get in our way, and that we should have 
a pile of hacked up tools.

i) is something for app people to do.

ii) is a much bigger development effort which really should go in the 
bugzilla and wait its turn.

On Mon, 25 Jun 2007, Mike Wilde wrote:

> collecting logs and generating reports/plots easily and in a standard format,
> so we can set goals against specific metrics for each workflow/application and
> track how we are progressing against those goals, for each.
> 
> - Mike
> 
> Mihael Hategan wrote, On 6/25/2007 11:15 AM:
> > On Mon, 2007-06-25 at 11:03 -0500, Ian Foster wrote:
> > > So who is going to do this?
> > > 
> > > I've been asking about this for some time, and nothing has happened. The
> > > result, I think, has been a lot of confusion and delay.
> > 
> > Are we still talking about collecting logs? I'm a bit confused.
> > 
> > > > I agree fully with Mihael's point that we can and should start gathering
> > > > all execution logs into a uniformly structured gathering place. Then we
> > > > can organize the current log tools and determine whats needed next in
> > > > that area.
> > > > 
> > 
> > 
> 
> 


From benc at hawaga.org.uk  Mon Jun 25 11:45:48 2007
From: benc at hawaga.org.uk (Ben Clifford)
Date: Mon, 25 Jun 2007 16:45:48 +0000 (GMT)
Subject: [Swift-devel] Swift Performance Data
In-Reply-To: <Pine.LNX.4.64.0706251638550.18874@dildano.hawaga.org.uk>
References: <4679FBC8.1080606@mcs.anl.gov> <467A6D78.4020702@mcs.anl.gov> 
	<467A7AC6.7020400@mcs.anl.gov><467A7E17.5000207@mcs.anl.gov> 
	<Pine.LNX.4.64.0706251157360.15250@dildano.hawaga.org.uk> 
	<1975377455-1182773680-cardhu_decombobulator_blackberry.rim.net-423416081-@bxe006.bisx.prod.on.blackberry>
	<Pine.LNX.4.64.0706251215390.18874@dildano.hawaga.org.uk> 
	<525747895-1182774044-cardhu_decombobulator_blackberry.rim.net-1528886809-@bxe006.bisx.prod.on.blackberry>
	<1182779125.5910.6.camel@blabla.mcs.anl.gov> 
	<Pine.LNX.4.64.0706251349370.18874@dildano.hawaga.org.uk>
	<467FCEE5.6000707@mcs.anl.gov> <467FE759.50904@mcs.anl.gov>
	<1182788118.23226.3.camel@blabla.mcs.anl.gov>
	<467FEF0D.1050700@mcs.anl.gov>
	<Pine.LNX.4.64.0706251638550.18874@dildano.hawaga.org.uk>
Message-ID: <Pine.LNX.4.64.0706251645260.18874@dildano.hawaga.org.uk>


On Mon, 25 Jun 2007, Ben Clifford wrote:

> what CEDPS and is a much bigger job. There are two fairly contradictory 
> views: that we shouldn't let CEDPS get in our way, and that we should have 
> a pile of hacked up tools.

oops, that we *shouldn't* have a pile of hacked up tools, I meant.

-- 


From foster at mcs.anl.gov  Mon Jun 25 11:47:26 2007
From: foster at mcs.anl.gov (Ian Foster)
Date: Mon, 25 Jun 2007 11:47:26 -0500
Subject: [Swift-devel] Swift Performance Data
In-Reply-To: <Pine.LNX.4.64.0706251638550.18874@dildano.hawaga.org.uk>
References: <4679FBC8.1080606@mcs.anl.gov> <467A6D78.4020702@mcs.anl.gov>
	<467A7AC6.7020400@mcs.anl.gov><467A7E17.5000207@mcs.anl.gov>
	<Pine.LNX.4.64.0706251157360.15250@dildano.hawaga.org.uk>
	<1975377455-1182773680-cardhu_decombobulator_blackberry.rim.net-423416081-@bxe006.bisx.prod.on.blackberry>
	<Pine.LNX.4.64.0706251215390.18874@dildano.hawaga.org.uk>
	<525747895-1182774044-cardhu_decombobulator_blackberry.rim.net-1528886809-@bxe006.bisx.prod.on.blackberry>
	<1182779125.5910.6.camel@blabla.mcs.anl.gov>
	<Pine.LNX.4.64.0706251349370.18874@dildano.hawaga.org.uk>
	<467FCEE5.6000707@mcs.anl.gov> <467FE759.50904@mcs.anl.gov>
	<1182788118.23226.3.camel@blabla.mcs.anl.gov>
	<467FEF0D.1050700@mcs.anl.gov>
	<Pine.LNX.4.64.0706251638550.18874@dildano.hawaga.org.uk>
Message-ID: <467FF19E.8000000@mcs.anl.gov>

I think I disagree on both points:

i) I think that we could produce some tools that would grab the relevant 
files and stick them somewhere central. I've proposed some ideas in 
previous emails.

ii) Of course we can imagine sophisticated tools. But there are also 
tools that various of you have already produced that generate graphs of 
various sorts. We should package these and (I think) integrate them with 
(i) so that when we grab the files we also run the programs to generate 
the graphs.

Ian.


Ben Clifford wrote:
> As far as I see, Ian asked for two separate things:
>
>  i) most importantly (to him) the collection of raw data. that's a case of 
> collecting files and not changing them.
>
>  ii) the development of a log analysis stack. This very much overlaps with 
> what CEDPS and is a much bigger job. There are two fairly contradictory 
> views: that we shouldn't let CEDPS get in our way, and that we should have 
> a pile of hacked up tools.
>
> i) is something for app people to do.
>
> ii) is a much bigger development effort which really should go in the 
> bugzilla and wait its turn.
>
> On Mon, 25 Jun 2007, Mike Wilde wrote:
>
>   
>> collecting logs and generating reports/plots easily and in a standard format,
>> so we can set goals against specific metrics for each workflow/application and
>> track how we are progressing against those goals, for each.
>>
>> - Mike
>>
>> Mihael Hategan wrote, On 6/25/2007 11:15 AM:
>>     
>>> On Mon, 2007-06-25 at 11:03 -0500, Ian Foster wrote:
>>>       
>>>> So who is going to do this?
>>>>
>>>> I've been asking about this for some time, and nothing has happened. The
>>>> result, I think, has been a lot of confusion and delay.
>>>>         
>>> Are we still talking about collecting logs? I'm a bit confused.
>>>
>>>       
>>>>> I agree fully with Mihael's point that we can and should start gathering
>>>>> all execution logs into a uniformly structured gathering place. Then we
>>>>> can organize the current log tools and determine whats needed next in
>>>>> that area.
>>>>>
>>>>>           
>>>       
>>     
>
>   

-- 

   Ian Foster, Director, Computation Institute
Argonne National Laboratory & University of Chicago
Argonne: MCS/221, 9700 S. Cass Ave, Argonne, IL 60439
Chicago: Rm 405, 5640 S. Ellis Ave, Chicago, IL 60637
Tel: +1 630 252 4619.  Web: www.ci.uchicago.edu.
      Globus Alliance: www.globus.org.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20070625/cca2e65e/attachment.html>

From benc at hawaga.org.uk  Mon Jun 25 11:49:12 2007
From: benc at hawaga.org.uk (Ben Clifford)
Date: Mon, 25 Jun 2007 16:49:12 +0000 (GMT)
Subject: [Swift-devel] Swift Performance Data
In-Reply-To: <467FF19E.8000000@mcs.anl.gov>
References: <4679FBC8.1080606@mcs.anl.gov> <467A6D78.4020702@mcs.anl.gov> 
	<467A7AC6.7020400@mcs.anl.gov><467A7E17.5000207@mcs.anl.gov> 
	<Pine.LNX.4.64.0706251157360.15250@dildano.hawaga.org.uk> 
	<1975377455-1182773680-cardhu_decombobulator_blackberry.rim.net-423416081-@bxe006.bisx.prod.on.blackberry>
	<Pine.LNX.4.64.0706251215390.18874@dildano.hawaga.org.uk> 
	<525747895-1182774044-cardhu_decombobulator_blackberry.rim.net-1528886809-@bxe006.bisx.prod.on.blackberry>
	<1182779125.5910.6.camel@blabla.mcs.anl.gov> 
	<Pine.LNX.4.64.0706251349370.18874@dildano.hawaga.org.uk>
	<467FCEE5.6000707@mcs.anl.gov> <467FE759.50904@mcs.anl.gov>
	<1182788118.23226.3.camel@blabla.mcs.anl.gov>
	<467FEF0D.1050700@mcs.anl.gov>
	<Pine.LNX.4.64.0706251638550.18874@dildano.hawaga.org.uk>
	<467FF19E.8000000@mcs.anl.gov>
Message-ID: <Pine.LNX.4.64.0706251648450.18874@dildano.hawaga.org.uk>


On Mon, 25 Jun 2007, Ian Foster wrote:

>  We should package these and (I think) integrate them with (i) so that
> when we grab the files we also run the programs to generate the graphs.

yes, like I said:


> >  ii) the development of a log analysis stack. This very much overlaps with
> > what CEDPS and is a much bigger job.

-- 


From hategan at mcs.anl.gov  Mon Jun 25 11:54:38 2007
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Mon, 25 Jun 2007 11:54:38 -0500
Subject: [Swift-devel] Swift Performance Data
In-Reply-To: <Pine.LNX.4.64.0706251645260.18874@dildano.hawaga.org.uk>
References: <4679FBC8.1080606@mcs.anl.gov> <467A6D78.4020702@mcs.anl.gov>
	<467A7AC6.7020400@mcs.anl.gov><467A7E17.5000207@mcs.anl.gov>
	<Pine.LNX.4.64.0706251157360.15250@dildano.hawaga.org.uk>
	<1975377455-1182773680-cardhu_decombobulator_blackberry.rim.net-423416081-@bxe006.bisx.prod.on.blackberry>
	<Pine.LNX.4.64.0706251215390.18874@dildano.hawaga.org.uk>
	<525747895-1182774044-cardhu_decombobulator_blackberry.rim.net-1528886809-@bxe006.bisx.prod.on.blackberry>
	<1182779125.5910.6.camel@blabla.mcs.anl.gov>
	<Pine.LNX.4.64.0706251349370.18874@dildano.hawaga.org.uk>
	<467FCEE5.6000707@mcs.anl.gov> <467FE759.50904@mcs.anl.gov>
	<1182788118.23226.3.camel@blabla.mcs.anl.gov>
	<467FEF0D.1050700@mcs.anl.gov>
	<Pine.LNX.4.64.0706251638550.18874@dildano.hawaga.org.uk>
	<Pine.LNX.4.64.0706251645260.18874@dildano.hawaga.org.uk>
Message-ID: <1182790478.29751.7.camel@blabla.mcs.anl.gov>

On Mon, 2007-06-25 at 16:45 +0000, Ben Clifford wrote:
> 
> On Mon, 25 Jun 2007, Ben Clifford wrote:
> 
> > what CEDPS and is a much bigger job. There are two fairly contradictory 
> > views: that we shouldn't let CEDPS get in our way, and that we should have 
> > a pile of hacked up tools.
> 
> oops, that we *shouldn't* have a pile of hacked up tools, I meant.

I think this view is a bit simplified. We should evaluate CEDPS based on
the value it has to offer and the costs that adapting to it involves.
Unfortunately it's hard to make a decision about the future value of
CEDPS. What I've seen so far is a structured logging format, for which
relevant analysis tools may or may not exist in the future. But it may
very well be that we'll have to write these tools ourselves, in which
case we're left with a format, and issue which we've discussed before.

Mihael

> 


From iraicu at cs.uchicago.edu  Mon Jun 25 12:11:51 2007
From: iraicu at cs.uchicago.edu (Ioan Raicu)
Date: Mon, 25 Jun 2007 12:11:51 -0500
Subject: [Swift-devel] Swift Performance Data
In-Reply-To: <1182788118.23226.3.camel@blabla.mcs.anl.gov>
References: <4679FBC8.1080606@mcs.anl.gov>
	<467A6D78.4020702@mcs.anl.gov>	<467A7AC6.7020400@mcs.anl.gov><467A7E17.5000207@mcs.anl.gov>	<Pine.LNX.4.64.0706251157360.15250@dildano.hawaga.org.uk>	<1975377455-1182773680-cardhu_decombobulator_blackberry.rim.net-423416081-@bxe006.bisx.prod.on.blackberry>	<Pine.LNX.4.64.0706251215390.18874@dildano.hawaga.org.uk>	<525747895-1182774044-cardhu_decombobulator_blackberry.rim.net-1528886809-@bxe006.bisx.prod.on.blackberry>	<1182779125.5910.6.camel@blabla.mcs.anl.gov>	<Pine.LNX.4.64.0706251349370.18874@dildano.hawaga.org.uk>	<467FCEE5.6000707@mcs.anl.gov>
	<467FE759.50904@mcs.anl.gov>
	<1182788118.23226.3.camel@blabla.mcs.anl.gov>
Message-ID: <467FF757.1030708@cs.uchicago.edu>

Here is my 2c of experience in trying to draw up graphs of various 
experiments.  I make a clear distinction between 1) logs that will be 
used for debugging/info that are in a relatively human readable format, 
and those logs that will be used for plotting graphs!  The human 
readable logs (1) are almost always occurring based on events in the 
system.  On the other hand, the logs that are geared towards graphing 
them (2) are mostly based on fixed time intervals, and a few are based 
on events. 

For example, in Falkon, I have the following set of logs:
1) Falkon dispatcher log (1 for the entire Falkon system) with 
debug/info level human readable logs, and it typically writes to this 
log for events related to the task dispatch and notifications that 
happen in the Falkon service; this log is currently only used for 
debugging purposes.

2) Falkon provisioner log (1 for the entire Falkon system) with 
debug/info level human readable logs, and it typically writes to this 
log for events related to the allocation of resources; this log is 
currently only used for debugging purposes.

3) Executor logs (1 per executor, separated into different files); this 
is also for human consumption that at the most detailed logging level, 
it prints out even the STDOUT and STERR of the task executions!  These 
logs are not aggregated in any way currently, and are mostly used for 
debugging purposes.

4) Task description log (1 for the entire Falkon system), which stores 
the description of each task executed (i.e. TIMESTAMP, APPLICATION_ID, 
EXECUTABLE, ARGUEMENTS, ENVIRONMENT); I have not used this log yet for 
anything, but I envision we could use it for workload characterization, 
studies involving replaying an entire workload, etc...

5) Summary log (1 for the entire Falkon system) with an easy to parse 
format for automatic graph generation; this log is generated on fixed 
time intervals, in which some of the Falkon state is summarized for the 
duration of that period; the kind of state information that goes in this 
log is: TimeStamp_ms num_users num_resources num_threads num_all_workers 
num_free_workers num_pend_workers num_busy_workers waitQ_length 
waitNotQ_length activeQ_length doneQ_length delivered_tasks 
throughput_tasks/sec; this log can be used to plot the number of 
executors registered, active, idle, the queue length, the throughput of 
task delivered, etc... as the experiment progresses.  In my latest 
development branch, I actually have a few more parameters that I am 
logging, such as CPU utilization, free memory, data caching hit rates, 
etc...

6) Per task log (1 for the entire Falkon system) that has information on 
each task executed in Falkon; this log is used to plot the per task info 
as the experiment progresses.  The information that is kep on each task 
is: taskID workerID startTime endTime waitQueueTime execTime 
resultsQueueTime totalTime exitCode; this log can also be used to plot 
the per worker information, to see how the tasks were dispersed over the 
workers...

7) User information log (1 for the entire Falkon system) that stores 
information relevant for the end user, and is updated every time the 
state (wait, active, done) changes for any task; the information that 
this log contains is: Time_ms Users Resources JVM_Threads WaitingTasks 
ActiveTasks DoneTasks DeliveredTasks; I have not used this log for 
anything yet, but it has much more fine granular information that the 
summary log (5), so more detailed graphs/analysis could be generated for 
this log.

8) Worker information logs (1 for the entire Falkon system) that stores 
information about the workers state changes and is updated every time 
the state (free, pending, busy) changes for any worker; the information 
that this log contains is: Time_ms RegisteredWorkers FreeWorkers 
PendWorkers BusyWorkers; again, I have not used this log for anything 
yet, but it has much more fine granular information that the summary log 
(5), so more detailed graphs/analysis could be generated for this log.


Now, as a summary, I use (5) and (6) a lot to generate the graphs that I 
do for Falkon.  I have not used (7) and (8) yet, but might in the 
future.  Its also relatively easy to add new state information to log to 
these existing logs since they are all localized in a few places, with 
little effort, I can add new metrics to monitor, or create a completely 
new log that has other information that was not easy to integrate into 
existing logs.  For simplicity, my perf logs (5-8) are all simple logs 
that are just space delimited...

> taskID workerID startTime endTime waitQueueTime execTime 
> resultsQueueTime totalTime exitCode
> tg-viz-login1.uc.teragrid.org:50103:1_1326356873 
> tg-c058.uc.teragrid.org:50100 1182533457601 1182533985431 467599 60225 
> 6 527830 0
> tg-viz-login1.uc.teragrid.org:50103:2_1124048393 
> tg-c052.uc.teragrid.org:50100 1182533457613 1182533985454 467735 60101 
> 5 527841 0
> tg-viz-login1.uc.teragrid.org:50103:3_1648367237 
> tg-c053.uc.teragrid.org:50100 1182533457616 1182533985524 467760 60138 
> 10 527908 0
They could be converted to XML or any other format you want, but this is 
a nice format for programs like ploticus or gnuplot to understand easily. 

On the other hand, my debug logs (1-4) are all handled via log4j, look 
more like the traditional logs that log4j generates and people are 
accustomed to, but from my point of view, these are tedious and 
error-prone to parse for graphing purposes.

Does this distinction (human readable vs. machine readable) between logs 
exist in Swift?  If not, I would argue to not modify the debug/info 
logs, but to create new logs that are specifically targeted at automatic 
graph generations, such as my logs (5-8).  If we are to use tools that 
others have built, then we just need to make sure these new logs conform 
to the appropriate format; if we are to write our own tools (or we 
already have them), then we have as much freedom as we want on what 
format these logs should be.

Ioan


Mihael Hategan wrote:
> On Mon, 2007-06-25 at 11:03 -0500, Ian Foster wrote:
>   
>> So who is going to do this?
>>
>> I've been asking about this for some time, and nothing has happened. The 
>> result, I think, has been a lot of confusion and delay.
>>     
>
> Are we still talking about collecting logs? I'm a bit confused.
>
>   
>>> I agree fully with Mihael's point that we can and should start 
>>> gathering all execution logs into a uniformly structured gathering 
>>> place. Then we can organize the current log tools and determine whats 
>>> needed next in that area.
>>>
>>>       
>
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>
>   

-- 
============================================
Ioan Raicu
Ph.D. Student
============================================
Distributed Systems Laboratory
Computer Science Department
University of Chicago
1100 E. 58th Street, Ryerson Hall
Chicago, IL 60637
============================================
Email: iraicu at cs.uchicago.edu
Web:   http://www.cs.uchicago.edu/~iraicu
       http://dsl.cs.uchicago.edu/
============================================
============================================

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20070625/562de62a/attachment.html>

From foster at mcs.anl.gov  Mon Jun 25 12:12:49 2007
From: foster at mcs.anl.gov (Ian Foster)
Date: Mon, 25 Jun 2007 12:12:49 -0500
Subject: [Swift-devel] Swift Performance Data
In-Reply-To: <Pine.LNX.4.64.0706251648450.18874@dildano.hawaga.org.uk>
References: <4679FBC8.1080606@mcs.anl.gov> <467A6D78.4020702@mcs.anl.gov>
	<467A7AC6.7020400@mcs.anl.gov><467A7E17.5000207@mcs.anl.gov>
	<Pine.LNX.4.64.0706251157360.15250@dildano.hawaga.org.uk>
	<1975377455-1182773680-cardhu_decombobulator_blackberry.rim.net-423416081-@bxe006.bisx.prod.on.blackberry>
	<Pine.LNX.4.64.0706251215390.18874@dildano.hawaga.org.uk>
	<525747895-1182774044-cardhu_decombobulator_blackberry.rim.net-1528886809-@bxe006.bisx.prod.on.blackberry>
	<1182779125.5910.6.camel@blabla.mcs.anl.gov>
	<Pine.LNX.4.64.0706251349370.18874@dildano.hawaga.org.uk>
	<467FCEE5.6000707@mcs.anl.gov> <467FE759.50904@mcs.anl.gov>
	<1182788118.23226.3.camel@blabla.mcs.anl.gov>
	<467FEF0D.1050700@mcs.anl.gov>
	<Pine.LNX.4.64.0706251638550.18874@dildano.hawaga.org.uk>
	<467FF19E.8000000@mcs.anl.gov>
	<Pine.LNX.4.64.0706251648450.18874@dildano.hawaga.org.uk>
Message-ID: <467FF791.3090902@mcs.anl.gov>

great ... then we agree ...

Ben Clifford wrote:
> On Mon, 25 Jun 2007, Ian Foster wrote:
>
>   
>>  We should package these and (I think) integrate them with (i) so that
>> when we grab the files we also run the programs to generate the graphs.
>>     
>
> yes, like I said:
>
>
>   
>>>  ii) the development of a log analysis stack. This very much overlaps with
>>> what CEDPS and is a much bigger job.
>>>       
>
>   

-- 

   Ian Foster, Director, Computation Institute
Argonne National Laboratory & University of Chicago
Argonne: MCS/221, 9700 S. Cass Ave, Argonne, IL 60439
Chicago: Rm 405, 5640 S. Ellis Ave, Chicago, IL 60637
Tel: +1 630 252 4619.  Web: www.ci.uchicago.edu.
      Globus Alliance: www.globus.org.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20070625/0ee6915a/attachment.html>

From benc at hawaga.org.uk  Mon Jun 25 12:42:23 2007
From: benc at hawaga.org.uk (Ben Clifford)
Date: Mon, 25 Jun 2007 17:42:23 +0000 (GMT)
Subject: [Swift-devel] Swift Performance Data
In-Reply-To: <467FF757.1030708@cs.uchicago.edu>
References: <4679FBC8.1080606@mcs.anl.gov> <467A6D78.4020702@mcs.anl.gov>
	<467A7AC6.7020400@mcs.anl.gov><467A7E17.5000207@mcs.anl.gov>
	<Pine.LNX.4.64.0706251157360.15250@dildano.hawaga.org.uk>
	<1975377455-1182773680-cardhu_decombobulator_blackberry.rim.net-423416081-@bxe006.bisx.prod.on.blackberry>
	<Pine.LNX.4.64.0706251215390.18874@dildano.hawaga.org.uk>
	<525747895-1182774044-cardhu_decombobulator_blackberry.rim.net-1528886809-@bxe006.bisx.prod.on.blackberry>
	<1182779125.5910.6.camel@blabla.mcs.anl.gov>
	<Pine.LNX.4.64.0706251349370.18874@dildano.hawaga.org.uk>
	<467FCEE5.6000707@mcs.anl.gov> <467FE759.50904@mcs.anl.gov>
	<1182788118.23226.3.camel@blabla.mcs.anl.gov>
	<467FF757.1030708@cs.uchicago.edu>
Message-ID: <Pine.LNX.4.64.0706251737250.15250@dildano.hawaga.org.uk>


On Mon, 25 Jun 2007, Ioan Raicu wrote:

>  On the other hand, the logs
> that are geared towards graphing them (2) are mostly based on fixed time
> intervals, and a few are based on events. 

right. I think there's a need for both (eg. compare task queue lengths or 
CPU load against job lifetime lines).

> also relatively easy to add new state information to log to these existing
> logs since they are all localized in a few places, with little effort, I can
> add new metrics to monitor

but only when those metrics are somehow associated with Falkon? One of the 
interesting things to do, I think, is to be able to get a job lifetime 
line that goes from when Swift decides the job exists all the way through 
to when Swift decides the job has finished, with the two/three colour job 
lines for jobs being inside Falkon as part of that lifetime line.

> On the other hand, my debug logs (1-4) are all handled via log4j, look more
> like the traditional logs that log4j generates and people are accustomed to,
> but from my point of view, these are tedious and error-prone to parse for
> graphing purposes.

log4j can easily be configured to output different formats - so we could 
have human readable logs in one format and machine readable logs logging 
different information in a different format, I think.

> Does this distinction (human readable vs. machine readable) between logs exist
> in Swift? 

A little bit. The data in the swift/karajan logs is mostly intended for 
human consumption; the data in kickstart records is very much more 
structured and intended to be both human readable and machine readable.

More machine readable edge and level based logging from inside swift and 
inside karajan could be useful, I think.

-- 


From iraicu at cs.uchicago.edu  Mon Jun 25 12:51:37 2007
From: iraicu at cs.uchicago.edu (Ioan Raicu)
Date: Mon, 25 Jun 2007 12:51:37 -0500
Subject: [Swift-devel] Swift Performance Data
In-Reply-To: <Pine.LNX.4.64.0706251737250.15250@dildano.hawaga.org.uk>
References: <4679FBC8.1080606@mcs.anl.gov> <467A6D78.4020702@mcs.anl.gov>
	<467A7AC6.7020400@mcs.anl.gov><467A7E17.5000207@mcs.anl.gov>
	<Pine.LNX.4.64.0706251157360.15250@dildano.hawaga.org.uk>
	<1975377455-1182773680-cardhu_decombobulator_blackberry.rim.net-423416081-@bxe006.bisx.prod.on.blackberry>
	<Pine.LNX.4.64.0706251215390.18874@dildano.hawaga.org.uk>
	<525747895-1182774044-cardhu_decombobulator_blackberry.rim.net-1528886809-@bxe006.bisx.prod.on.blackberry>
	<1182779125.5910.6.camel@blabla.mcs.anl.gov>
	<Pine.LNX.4.64.0706251349370.18874@dildano.hawaga.org.uk>
	<467FCEE5.6000707@mcs.anl.gov> <467FE759.50904@mcs.anl.gov>
	<1182788118.23226.3.camel@blabla.mcs.anl.gov>
	<467FF757.1030708@cs.uchicago.edu>
	<Pine.LNX.4.64.0706251737250.15250@dildano.hawaga.org.uk>
Message-ID: <468000A9.3000908@cs.uchicago.edu>


Ben Clifford wrote:
> On Mon, 25 Jun 2007, Ioan Raicu wrote:
>
>   
>>  On the other hand, the logs
>> that are geared towards graphing them (2) are mostly based on fixed time
>> intervals, and a few are based on events. 
>>     
>
> right. I think there's a need for both (eg. compare task queue lengths or 
> CPU load against job lifetime lines).
>
>   
>> also relatively easy to add new state information to log to these existing
>> logs since they are all localized in a few places, with little effort, I can
>> add new metrics to monitor
>>     
>
> but only when those metrics are somehow associated with Falkon? One of the 
> interesting things to do, I think, is to be able to get a job lifetime 
> line that goes from when Swift decides the job exists all the way through 
> to when Swift decides the job has finished, with the two/three colour job 
> lines for jobs being inside Falkon as part of that lifetime line.
>   
Right, and I think we can do this from the Swift logs, including the 
preprocessing time in Swift, the postprocessing time, plus the 
end-to-end time the task spent in Falkon, etc...  the logs that I 
mentioned are Falkon specific, and the logs in Swift that generate this 
kind of information I believe are parsed from the debug/info logs (human 
readable) to come up with the machine readable logs for graphing.  We 
(Yong and I) had some trouble in the past generating these graphs from 
the Swift logs as the logs did not always contain all the information we 
needed to draw the graph, or the parsing would fail, and we had to 
manually fix the problem in the logs and try again the parsing!
>   
>> On the other hand, my debug logs (1-4) are all handled via log4j, look more
>> like the traditional logs that log4j generates and people are accustomed to,
>> but from my point of view, these are tedious and error-prone to parse for
>> graphing purposes.
>>     
>
> log4j can easily be configured to output different formats - so we could 
> have human readable logs in one format and machine readable logs logging 
> different information in a different format, I think.
>   
OK, thats good!
>   
>> Does this distinction (human readable vs. machine readable) between logs exist
>> in Swift? 
>>     
>
> A little bit. The data in the swift/karajan logs is mostly intended for 
> human consumption; the data in kickstart records is very much more 
> structured and intended to be both human readable and machine readable.
>
> More machine readable edge and level based logging from inside swift and 
> inside karajan could be useful, I think.
>   
Right, but kickstart logs are all in separate files, so to really make 
sense of them in a programatic way and to plot them on 1 graph, there 
needs to be an aggreting step that either just merges these files 
together in some orderly way, or it mights even usmmarize the data for 
easier graphing.  From my understanding of the kickstart records, I 
think its hard to generate overview graphs of an entire run due to the 
fact that they are kept in many files.

Ioan

-- 
============================================
Ioan Raicu
Ph.D. Student
============================================
Distributed Systems Laboratory
Computer Science Department
University of Chicago
1100 E. 58th Street, Ryerson Hall
Chicago, IL 60637
============================================
Email: iraicu at cs.uchicago.edu
Web:   http://www.cs.uchicago.edu/~iraicu
       http://dsl.cs.uchicago.edu/
============================================
============================================

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20070625/98c75930/attachment.html>

From benc at hawaga.org.uk  Mon Jun 25 12:55:04 2007
From: benc at hawaga.org.uk (Ben Clifford)
Date: Mon, 25 Jun 2007 17:55:04 +0000 (GMT)
Subject: [Swift-devel] Swift Performance Data
In-Reply-To: <468000A9.3000908@cs.uchicago.edu>
References: <4679FBC8.1080606@mcs.anl.gov> <467A6D78.4020702@mcs.anl.gov>
	<467A7AC6.7020400@mcs.anl.gov><467A7E17.5000207@mcs.anl.gov>
	<Pine.LNX.4.64.0706251157360.15250@dildano.hawaga.org.uk>
	<1975377455-1182773680-cardhu_decombobulator_blackberry.rim.net-423416081-@bxe006.bisx.prod.on.blackberry>
	<Pine.LNX.4.64.0706251215390.18874@dildano.hawaga.org.uk>
	<525747895-1182774044-cardhu_decombobulator_blackberry.rim.net-1528886809-@bxe006.bisx.prod.on.blackberry>
	<1182779125.5910.6.camel@blabla.mcs.anl.gov>
	<Pine.LNX.4.64.0706251349370.18874@dildano.hawaga.org.uk>
	<467FCEE5.6000707@mcs.anl.gov> <467FE759.50904@mcs.anl.gov>
	<1182788118.23226.3.camel@blabla.mcs.anl.gov>
	<467FF757.1030708@cs.uchicago.edu>
	<Pine.LNX.4.64.0706251737250.15250@dildano.hawaga.org.uk>
	<468000A9.3000908@cs.uchicago.edu>
Message-ID: <Pine.LNX.4.64.0706251753220.18874@dildano.hawaga.org.uk>


On Mon, 25 Jun 2007, Ioan Raicu wrote:

> the machine readable logs for graphing.  We (Yong and I) had some trouble in
> the past generating these graphs from the Swift logs as the logs did not
> always contain all the information we needed to draw the graph, or the parsing
> would fail, and we had to manually fix the problem in the logs and try again
> the parsing!

The log messages aren't fixed in stone so if there are small changes that 
would be useful for this, make them or bring them up on this list.


> Right, but kickstart logs are all in separate files, so to really make sense
> of them in a programatic way and to plot them on 1 graph, there needs to be an
> aggreting step that either just merges these files together in some orderly
> way, or it mights even usmmarize the data for easier graphing.  From my
> understanding of the kickstart records, I think its hard to generate overview
> graphs of an entire run due to the fact that they are kept in many files.

A commandline XSLT processor and a for-loop in bash might do it.

-- 


From wilde at mcs.anl.gov  Wed Jun 27 07:48:58 2007
From: wilde at mcs.anl.gov (Mike Wilde)
Date: Wed, 27 Jun 2007 07:48:58 -0500
Subject: [Swift-devel] Re: bugzilla change?
In-Reply-To: <Pine.LNX.4.64.0706270250440.18874@dildano.hawaga.org.uk>
References: <3E18205D-4534-4100-BFB1-1CAC389D5D76@mcs.anl.gov>
	<AB35B22D-642A-4C44-B284-C2027D427022@mcs.anl.gov>
	<Pine.LNX.4.64.0706270250440.18874@dildano.hawaga.org.uk>
Message-ID: <46825CBA.7010906@mcs.anl.gov>

The thought here was to put swift-devel on all campaign bugs to keep everyone 
informed and to encourage discussion.

Is this a good way to do things?

Should we create a swift-devel bugzilla account?

Or just multi-select all people on campaigns?

- Mike

Ben Clifford wrote, On 6/26/2007 9:52 PM:
> The list of people you get on the 'cc' drop down list is everyone who has 
> a bugzilla account. If you want more addresses there, get the person in 
> question to get a bugzilla acount and it will magically appear on the 
> list.
> 
> On Tue, 26 Jun 2007, Veronika Nefedova wrote:
> 
>> PS. I am talking about this page specifically:
>> http://bugzilla.mcs.anl.gov/swift/enter_bug.cgi?product=App-MolDyn
>>
>> On Jun 26, 2007, at 2:28 PM, Veronika Nefedova wrote:
>>
>>> Hi, Ben:
>>>
>>> do you know how I can modify the Bugzilla settings? For example, I'd like to
>>> modify the Cc field (add or remove address there), but it seem there is no
>>> such option..
>>>
>>> Thanks!
>>>
>>> Nika
>>>
> 
> 

-- 
Mike Wilde
Computation Institute, University of Chicago
Math & Computer Science Division
Argonne National Laboratory
Argonne, IL   60439    USA
tel 630-252-7497 fax 630-252-1997


From bugzilla-daemon at mcs.anl.gov  Wed Jun 27 07:56:42 2007
From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov)
Date: Wed, 27 Jun 2007 07:56:42 -0500 (CDT)
Subject: [Swift-devel] [Bug 72] New: Campaign for scaling wf up to 244
	molecules
Message-ID: <bug-72-21@http.bugzilla.mcs.anl.gov/swift/>

http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=72

           Summary: Campaign for scaling wf up to 244 molecules
           Product: App-MolDyn
           Version: unspecified
          Platform: Macintosh
        OS/Version: Mac OS
            Status: NEW
          Severity: normal
          Priority: P2
         Component: FreeEnergyForMolecules
        AssignedTo: nefedova at mcs.anl.gov
        ReportedBy: nefedova at mcs.anl.gov
                CC: swift-devel at ci.uchicago.edu


Campaign: scaling wf up to 244 molecules

Campaign Leader: Veronika Nefedova

Project: Swift

Technology: Molecular Dynamics Application

Objective:

The Molecular Dynamics workflow at present can't be reliably executed for large
number of molecules (100+). Execution fails for 244 molecules due to some
problems with Falcon/Swift interactions. 

Benefits:

Executing  workflow for large number of molecules would enable the Molecular
Dynamics group to run large simulations in one step which would increase the
productiveness.

Implementation Details:

1. Analyze logs from the failed runs
2. If some information is missing from the logs -- add new debug printouts and
repeat the run.
3. Act on the findings -- make corrections to either Swift or Falcon; repeat
the run.
4. Repeat stages 1-3 until 244 molecules run successfully reliably.

Deliverables:

1. Falcon code that could be installed on any system to handle 100+ molecule
runs
2. Swift code that works correctly with Falcon (in svn)


-- 
Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.


From bugzilla-daemon at mcs.anl.gov  Wed Jun 27 07:59:46 2007
From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov)
Date: Wed, 27 Jun 2007 07:59:46 -0500 (CDT)
Subject: [Swift-devel] [Bug 72] Campaign for scaling wf up to 244 molecules
In-Reply-To: <bug-72-21@http.bugzilla.mcs.anl.gov/swift/>
Message-ID: <20070627125946.61B0216505@foxtrot.mcs.anl.gov>

http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=72


nefedova at mcs.anl.gov changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
OtherBugsDependingO|                            |73
              nThis|                            |


-- 
Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.


From bugzilla-daemon at mcs.anl.gov  Wed Jun 27 08:01:31 2007
From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov)
Date: Wed, 27 Jun 2007 08:01:31 -0500 (CDT)
Subject: [Swift-devel] [Bug 72] Campaign for scaling wf up to 244 molecules
In-Reply-To: <bug-72-21@http.bugzilla.mcs.anl.gov/swift/>
Message-ID: <20070627130131.740B616505@foxtrot.mcs.anl.gov>

http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=72


nefedova at mcs.anl.gov changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
OtherBugsDependingO|                            |74
              nThis|                            |


-- 
Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.


From bugzilla-daemon at mcs.anl.gov  Wed Jun 27 08:03:04 2007
From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov)
Date: Wed, 27 Jun 2007 08:03:04 -0500 (CDT)
Subject: [Swift-devel] [Bug 73] Campaign: performance improvements for
	MolDyn workflow
In-Reply-To: <bug-73-21@http.bugzilla.mcs.anl.gov/swift/>
Message-ID: <20070627130304.E57F416505@foxtrot.mcs.anl.gov>

http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=73


nefedova at mcs.anl.gov changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |swift-devel at ci.uchicago.edu,
                   |                            |nefedova at mcs.anl.gov


-- 
Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.


From bugzilla-daemon at mcs.anl.gov  Wed Jun 27 08:04:06 2007
From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov)
Date: Wed, 27 Jun 2007 08:04:06 -0500 (CDT)
Subject: [Swift-devel] [Bug 74] Campaign: Technology transfer to the user
In-Reply-To: <bug-74-21@http.bugzilla.mcs.anl.gov/swift/>
Message-ID: <20070627130406.5A09C16505@foxtrot.mcs.anl.gov>

http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=74


nefedova at mcs.anl.gov changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |swift-devel at ci.uchicago.edu,
                   |                            |nefedova at mcs.anl.gov


-- 
Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.


From benc at hawaga.org.uk  Wed Jun 27 14:10:15 2007
From: benc at hawaga.org.uk (Ben Clifford)
Date: Wed, 27 Jun 2007 19:10:15 +0000 (GMT)
Subject: [Swift-devel] falkon bugzilla
Message-ID: <Pine.LNX.4.64.0706271909530.18874@dildano.hawaga.org.uk>


I've added a Falkon product to bugzilla with 'general' and 'provider-deef' 
components.

As Ioan isn't signed up for the Swift bugzilla, I've made Yong the default 
owner for both.
-- 


From benc at hawaga.org.uk  Wed Jun 27 15:46:12 2007
From: benc at hawaga.org.uk (Ben Clifford)
Date: Thu, 28 Jun 2007 02:16:12 +0530 (IST)
Subject: [Swift-devel] a different way to do array/structure accesses
Message-ID: <Pine.OSX.4.64.0706280138380.18443@soju.hawaga.org.uk>

I've been playing round with converting the .xml intermediate format to 
be more strictly XML and less a mixture of XML and various other syntaxes.

One thing that comes out of this is that its simpler in the parser and 
compiler layer to generate array and structure accesses using a bunch of 
karajan level calls, like this:

  <print>
    <parallel>
      <getfield path="a"><vdl:getfield var="{foo}" /></getfield>
    </parallel>
  </print>

and

  <vdl:setfieldvalue>
    <argument name="var">
      <getfield path="a"><vdl:getfield var="{foo}" /></getfield>
    </argument>
    <argument name="value">
      <number>9091</number>
    </argument>
  </vdl:setfieldvalue>


instead of the way its done at the moment with a path syntax, like this:

  <vdl:setfieldvalue path="a" var="{foo}" value="9091"/>

and

  <print>
    <parallel>
      <vdl:getfield var="{foo}" path="a"/>
    </parallel>
  </print>


This allows a bunch of simplification to happen with path handling in the 
swift code. However, it makes the Karajan intermediate code more 
complicated. From the language side of things, I'd like to make this 
change, but I don't know enough about how that effects the load on 
Karajan, especially with the insanely large source files that people are 
machine-generating.


(here's the program I pulled these from:

type mytype { int a; int b; }

mytype foo;

foo.a=9091;

foo.b=818;

print(foo.a);

)

-- 


-- 


From hategan at mcs.anl.gov  Wed Jun 27 15:58:03 2007
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Wed, 27 Jun 2007 15:58:03 -0500
Subject: [Swift-devel] a different way to do array/structure accesses
In-Reply-To: <Pine.OSX.4.64.0706280138380.18443@soju.hawaga.org.uk>
References: <Pine.OSX.4.64.0706280138380.18443@soju.hawaga.org.uk>
Message-ID: <1182977884.9558.2.camel@blabla.mcs.anl.gov>

Probably not very efficient, for more than one reason. Also less clear.
I'd vote against it.

On Thu, 2007-06-28 at 02:16 +0530, Ben Clifford wrote:
> I've been playing round with converting the .xml intermediate format to 
> be more strictly XML and less a mixture of XML and various other syntaxes.
> 
> One thing that comes out of this is that its simpler in the parser and 
> compiler layer to generate array and structure accesses using a bunch of 
> karajan level calls, like this:
> 
>   <print>
>     <parallel>
>       <getfield path="a"><vdl:getfield var="{foo}" /></getfield>
>     </parallel>
>   </print>
> 
> and
> 
>   <vdl:setfieldvalue>
>     <argument name="var">
>       <getfield path="a"><vdl:getfield var="{foo}" /></getfield>
>     </argument>
>     <argument name="value">
>       <number>9091</number>
>     </argument>
>   </vdl:setfieldvalue>
> 
> 
> instead of the way its done at the moment with a path syntax, like this:
> 
>   <vdl:setfieldvalue path="a" var="{foo}" value="9091"/>
> 
> and
> 
>   <print>
>     <parallel>
>       <vdl:getfield var="{foo}" path="a"/>
>     </parallel>
>   </print>
> 
> 
> This allows a bunch of simplification to happen with path handling in the 
> swift code. However, it makes the Karajan intermediate code more 
> complicated. From the language side of things, I'd like to make this 
> change, but I don't know enough about how that effects the load on 
> Karajan, especially with the insanely large source files that people are 
> machine-generating.
> 
> 
> (here's the program I pulled these from:
> 
> type mytype { int a; int b; }
> 
> mytype foo;
> 
> foo.a=9091;
> 
> foo.b=818;
> 
> print(foo.a);
> 
> )
> 
> -- 
> 
> 


From hategan at mcs.anl.gov  Wed Jun 27 15:59:05 2007
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Wed, 27 Jun 2007 15:59:05 -0500
Subject: [Swift-devel] a different way to do array/structure accesses
In-Reply-To: <Pine.OSX.4.64.0706280138380.18443@soju.hawaga.org.uk>
References: <Pine.OSX.4.64.0706280138380.18443@soju.hawaga.org.uk>
Message-ID: <1182977945.9558.4.camel@blabla.mcs.anl.gov>

That reminds me. parallel(onlyOneThing()) should be reduced to
onlyOneThing(). Either in the translation, or automatically in karajan.

On Thu, 2007-06-28 at 02:16 +0530, Ben Clifford wrote:
> I've been playing round with converting the .xml intermediate format to 
> be more strictly XML and less a mixture of XML and various other syntaxes.
> 
> One thing that comes out of this is that its simpler in the parser and 
> compiler layer to generate array and structure accesses using a bunch of 
> karajan level calls, like this:
> 
>   <print>
>     <parallel>
>       <getfield path="a"><vdl:getfield var="{foo}" /></getfield>
>     </parallel>
>   </print>
> 
> and
> 
>   <vdl:setfieldvalue>
>     <argument name="var">
>       <getfield path="a"><vdl:getfield var="{foo}" /></getfield>
>     </argument>
>     <argument name="value">
>       <number>9091</number>
>     </argument>
>   </vdl:setfieldvalue>
> 
> 
> instead of the way its done at the moment with a path syntax, like this:
> 
>   <vdl:setfieldvalue path="a" var="{foo}" value="9091"/>
> 
> and
> 
>   <print>
>     <parallel>
>       <vdl:getfield var="{foo}" path="a"/>
>     </parallel>
>   </print>
> 
> 
> This allows a bunch of simplification to happen with path handling in the 
> swift code. However, it makes the Karajan intermediate code more 
> complicated. From the language side of things, I'd like to make this 
> change, but I don't know enough about how that effects the load on 
> Karajan, especially with the insanely large source files that people are 
> machine-generating.
> 
> 
> (here's the program I pulled these from:
> 
> type mytype { int a; int b; }
> 
> mytype foo;
> 
> foo.a=9091;
> 
> foo.b=818;
> 
> print(foo.a);
> 
> )
> 
> -- 
> 
> 


From benc at hawaga.org.uk  Thu Jun 28 08:54:45 2007
From: benc at hawaga.org.uk (Ben Clifford)
Date: Thu, 28 Jun 2007 13:54:45 +0000 (GMT)
Subject: [Swift-devel] Re: Could not convert value to number: true
In-Reply-To: <1182827359.17489.5.camel@blabla.mcs.anl.gov>
References: <46807F9B.6050008@fnal.gov>
	<1182827359.17489.5.camel@blabla.mcs.anl.gov>
Message-ID: <Pine.LNX.4.64.0706281345430.15250@dildano.hawaga.org.uk>


I moved this to swift-devel from swift-user.

On Mon, 25 Jun 2007, Mihael Hategan wrote:

> 
> As for doing what you want, I have to think about it.

We've talked in the past about more elaborate forms of mapping. Mapping 
data from other sources (rather than from disk data files, perhaps from 
databases); and mapping data (from whatever source) into the actual swift 
data space so that things like + and other operators can work on that 
data.

Until/unless a strong application need arises for those, I don't think 
they're high enough priority for us to implement any time soon.

Separately, I think its a bug that we allow the above code to compile and 
run with such a poor error message.  Probably, attempts to get the value 
of a mapped piece of data should cause an error, rather than returning 
'true' which is often not even of the right datatype, let alone a 
meaningful value. I'll put something in bugzilla for this.

-- 


From bugzilla-daemon at mcs.anl.gov  Thu Jun 28 09:10:26 2007
From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov)
Date: Thu, 28 Jun 2007 09:10:26 -0500 (CDT)
Subject: [Swift-devel] [Bug 72] Campaign for scaling wf up to 244 molecules
In-Reply-To: <bug-72-21@http.bugzilla.mcs.anl.gov/swift/>
Message-ID: <20070628141026.6707716506@foxtrot.mcs.anl.gov>

http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=72


nefedova at mcs.anl.gov changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
  BugsThisDependsOn|                            |76


-- 
Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.


From bugzilla-daemon at mcs.anl.gov  Thu Jun 28 09:11:07 2007
From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov)
Date: Thu, 28 Jun 2007 09:11:07 -0500 (CDT)
Subject: [Swift-devel] [Bug 76] disable intermediate stageout of data
In-Reply-To: <bug-76-21@http.bugzilla.mcs.anl.gov/swift/>
Message-ID: <20070628141107.E98D8164DB@foxtrot.mcs.anl.gov>

http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=76


nefedova at mcs.anl.gov changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |swift-devel at ci.uchicago.edu


-- 
Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.


From bugzilla-daemon at mcs.anl.gov  Thu Jun 28 16:12:48 2007
From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov)
Date: Thu, 28 Jun 2007 16:12:48 -0500 (CDT)
Subject: [Swift-devel] [Bug 72] Campaign for scaling wf up to 244 molecules
In-Reply-To: <bug-72-21@http.bugzilla.mcs.anl.gov/swift/>
Message-ID: <20070628211248.35B6116505@foxtrot.mcs.anl.gov>

http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=72


------- Comment #1 from wilde at mcs.anl.gov  2007-06-28 16:12 -------
Ive reviewed this email thread on this bug, and am moving this discussion to
bugzilla. 

I and am uncertain about the following - can people involved (Nika, Ioan,
Mihael) clarify:

- did Mihael discover an error in Falkon mutex code?

- if so was it fixed, and did it correct the problem of missed completion
notifications?

- whats the state of the "unable to write output file" problem?

- do we still have a bad node in UC-Teraport w/ NSF Stale file handles? If so,
was that reported? (This raises interesting issues in troubleshooting and
trouble workaround)

- do we have a plan for how to run this WF at scale? Meaning how to get 244
nodes for several days, whether we can scale up beyond
1-processor-per-molecule, what the expected runtime is, how to deal with
errors/restarts, etc? (Should detail this here in bugz).


-- 
Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.


From bugzilla-daemon at mcs.anl.gov  Thu Jun 28 16:25:54 2007
From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov)
Date: Thu, 28 Jun 2007 16:25:54 -0500 (CDT)
Subject: [Swift-devel] [Bug 72] Campaign for scaling wf up to 244 molecules
In-Reply-To: <bug-72-21@http.bugzilla.mcs.anl.gov/swift/>
Message-ID: <20070628212554.0516B164DB@foxtrot.mcs.anl.gov>

http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=72


------- Comment #2 from wilde at mcs.anl.gov  2007-06-28 16:25 -------
Ioan Raicu wrote, On 6/28/2007 4:11 PM:
> Hi,
> Yong and I are working at making some small changes (synchronizing some 
> lists, more logging, etc...) in the Falkon provider.  We are also 
> working on the automatic graphing capability from Falkon's logs.  We 
> should be ready to give the experiment another run later today.
> 
> Ioan

OK - thanks Ioan and Yong. By these "small changes" do you mean that the
synchronization issue Mihael raised *was* or *was not* determined to be a cause
of missing the 2000+ notifications out of 20K+ ?  Are you now working on
tightening up the syncro further?

The graphing thing sounds good.  When you get a moment, please respond on the
last 3 items.  Thanks.


-- 
Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.


From iraicu at cs.uchicago.edu  Thu Jun 28 16:27:04 2007
From: iraicu at cs.uchicago.edu (Ioan Raicu)
Date: Thu, 28 Jun 2007 16:27:04 -0500
Subject: [Swift-devel] [Bug 72] Campaign for scaling wf up to 244 molecules
In-Reply-To: <20070628211248.35B6116505@foxtrot.mcs.anl.gov>
References: <20070628211248.35B6116505@foxtrot.mcs.anl.gov>
Message-ID: <468427A8.10104@cs.uchicago.edu>


bugzilla-daemon at mcs.anl.gov wrote:
> http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=72
>
>
>
>
>
> ------- Comment #1 from wilde at mcs.anl.gov  2007-06-28 16:12 -------
> Ive reviewed this email thread on this bug, and am moving this discussion to
> bugzilla. 
>
> I and am uncertain about the following - can people involved (Nika, Ioan,
> Mihael) clarify:
>
> - did Mihael discover an error in Falkon mutex code?
>
>   
We are not sure, but we are adding extra synchronization in several 
parts of the Falkon provider.  The reason we are saying that we are not 
sure is that we stress tested (pushing 50~100 tasks/sec) both the Falkon 
provider and Falkon itself over and over again, and we never encountered 
this.  Now, we have a workflow that has an average of 1 task/sec, I find 
it hard to beleive that a synchronization issue that never surfaced 
before under stress testing is surfacing now under such a light load.  
We are also verifying that we are handling all exceptions correctly 
within the Falkon provider.
> - if so was it fixed, and did it correct the problem of missed completion
> notifications?
>   
We don't know, the problems are reproducible over short runs, and only 
seem to pop up with longer runs.  For example, we completed the 100 mol 
run just fine, which had 10K jobs.  We have to rerun the 244 mol run to 
verify things.
> - whats the state of the "unable to write output file" problem?
>
> - do we still have a bad node in UC-Teraport w/ NSF Stale file handles? If so,
> was that reported? (This raises interesting issues in troubleshooting and
> trouble workaround)
>   
I reported it, but help at tg claims the node is working fine.  They claim 
that once in a while, it is normal for this to happen, and my argument 
that all other nodes behaved perfectly with the exception of this one 
isn't enough for them.  For now, if we get this node again, we can 
manually kill the Falkon worker there so Falkon won't use it anymore.
> - do we have a plan for how to run this WF at scale? Meaning how to get 244
> nodes for several days, whether we can scale up beyond
> 1-processor-per-molecule, what the expected runtime is, how to deal with
> errors/restarts, etc? (Should detail this here in bugz).
>   
There is still work I need to do to ensure that a task that is running 
when the resource lease expires is correctly handled and Swift is 
notified that it failed.  I have the code written and in Falkon already, 
but I have yet to test it.  We need to make sure this works before we 
try to get say 24 hour resource allocations when we know the experiment 
will likely take several days.  Also, I think the larger part of the 
workflow could benefit from more than 1 node per molecule, so if we 
could get more, it should improve the end-to-end time. 

Ioan
>
>   

-- 
============================================
Ioan Raicu
Ph.D. Student
============================================
Distributed Systems Laboratory
Computer Science Department
University of Chicago
1100 E. 58th Street, Ryerson Hall
Chicago, IL 60637
============================================
Email: iraicu at cs.uchicago.edu
Web:   http://www.cs.uchicago.edu/~iraicu
       http://dsl.cs.uchicago.edu/
============================================
============================================


From itf at mcs.anl.gov  Thu Jun 28 16:26:25 2007
From: itf at mcs.anl.gov (=?utf-8?B?SWFuIEZvc3Rlcg==?=)
Date: Thu, 28 Jun 2007 21:26:25 +0000
Subject: [Swift-devel] [Bug 72] Campaign for scaling wf up to 244 molecules
In-Reply-To: <468427A8.10104@cs.uchicago.edu>
References: <20070628211248.35B6116505@foxtrot.mcs.anl.gov><468427A8.10104@cs.uchicago.edu>
Message-ID: <1895472852-1183066171-cardhu_decombobulator_blackberry.rim.net-826324017-@bxe006.bisx.prod.on.blackberry>

Did we do a complete code review?
 
Sent via BlackBerry from T-Mobile

-----Original Message-----
From: Ioan Raicu <iraicu at cs.uchicago.edu>

Date: Thu, 28 Jun 2007 16:27:04 
To:bugzilla-daemon at mcs.anl.gov
Cc:swift-devel at ci.uchicago.edu
Subject: Re: [Swift-devel] [Bug 72] Campaign for scaling wf up to 244 molecules


bugzilla-daemon at mcs.anl.gov wrote:
> http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=72
>
>
>
>
>
> ------- Comment #1 from wilde at mcs.anl.gov  2007-06-28 16:12 -------
> Ive reviewed this email thread on this bug, and am moving this discussion to
> bugzilla. 
>
> I and am uncertain about the following - can people involved (Nika, Ioan,
> Mihael) clarify:
>
> - did Mihael discover an error in Falkon mutex code?
>
>   
We are not sure, but we are adding extra synchronization in several 
parts of the Falkon provider.  The reason we are saying that we are not 
sure is that we stress tested (pushing 50~100 tasks/sec) both the Falkon 
provider and Falkon itself over and over again, and we never encountered 
this.  Now, we have a workflow that has an average of 1 task/sec, I find 
it hard to beleive that a synchronization issue that never surfaced 
before under stress testing is surfacing now under such a light load.  
We are also verifying that we are handling all exceptions correctly 
within the Falkon provider.
> - if so was it fixed, and did it correct the problem of missed completion
> notifications?
>   
We don't know, the problems are reproducible over short runs, and only 
seem to pop up with longer runs.  For example, we completed the 100 mol 
run just fine, which had 10K jobs.  We have to rerun the 244 mol run to 
verify things.
> - whats the state of the "unable to write output file" problem?
>
> - do we still have a bad node in UC-Teraport w/ NSF Stale file handles? If so,
> was that reported? (This raises interesting issues in troubleshooting and
> trouble workaround)
>   
I reported it, but help at tg claims the node is working fine.  They claim 
that once in a while, it is normal for this to happen, and my argument 
that all other nodes behaved perfectly with the exception of this one 
isn't enough for them.  For now, if we get this node again, we can 
manually kill the Falkon worker there so Falkon won't use it anymore.
> - do we have a plan for how to run this WF at scale? Meaning how to get 244
> nodes for several days, whether we can scale up beyond
> 1-processor-per-molecule, what the expected runtime is, how to deal with
> errors/restarts, etc? (Should detail this here in bugz).
>   
There is still work I need to do to ensure that a task that is running 
when the resource lease expires is correctly handled and Swift is 
notified that it failed.  I have the code written and in Falkon already, 
but I have yet to test it.  We need to make sure this works before we 
try to get say 24 hour resource allocations when we know the experiment 
will likely take several days.  Also, I think the larger part of the 
workflow could benefit from more than 1 node per molecule, so if we 
could get more, it should improve the end-to-end time. 

Ioan
>
>   

-- 
============================================
Ioan Raicu
Ph.D. Student
============================================
Distributed Systems Laboratory
Computer Science Department
University of Chicago
1100 E. 58th Street, Ryerson Hall
Chicago, IL 60637
============================================
Email: iraicu at cs.uchicago.edu
Web:   http://www.cs.uchicago.edu/~iraicu
       http://dsl.cs.uchicago.edu/
============================================
============================================

_______________________________________________
Swift-devel mailing list
Swift-devel at ci.uchicago.edu
http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel


From iraicu at cs.uchicago.edu  Thu Jun 28 16:32:35 2007
From: iraicu at cs.uchicago.edu (Ioan Raicu)
Date: Thu, 28 Jun 2007 16:32:35 -0500
Subject: [Swift-devel] [Bug 72] Campaign for scaling wf up to 244 molecules
In-Reply-To: <1895472852-1183066171-cardhu_decombobulator_blackberry.rim.net-826324017-@bxe006.bisx.prod.on.blackberry>
References: <20070628211248.35B6116505@foxtrot.mcs.anl.gov><468427A8.10104@cs.uchicago.edu>
	<1895472852-1183066171-cardhu_decombobulator_blackberry.rim.net-826324017-@bxe006.bisx.prod.on.blackberry>
Message-ID: <468428F3.1060508@cs.uchicago.edu>

No, just the Falkon provider (~500 lines of code), as far as I know. 

The Falkon service is around 10K lines of code, and the Falkon executor 
is another 3K, so they will likely take longer than a few days for a 
code review of everything in Falkon.

Ioan

Ian Foster wrote:
> Did we do a complete code review?
>  
> Sent via BlackBerry from T-Mobile
>
> -----Original Message-----
> From: Ioan Raicu <iraicu at cs.uchicago.edu>
>
> Date: Thu, 28 Jun 2007 16:27:04 
> To:bugzilla-daemon at mcs.anl.gov
> Cc:swift-devel at ci.uchicago.edu
> Subject: Re: [Swift-devel] [Bug 72] Campaign for scaling wf up to 244 molecules
>
>
>
>
> bugzilla-daemon at mcs.anl.gov wrote:
>   
>> http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=72
>>
>>
>>
>>
>>
>> ------- Comment #1 from wilde at mcs.anl.gov  2007-06-28 16:12 -------
>> Ive reviewed this email thread on this bug, and am moving this discussion to
>> bugzilla. 
>>
>> I and am uncertain about the following - can people involved (Nika, Ioan,
>> Mihael) clarify:
>>
>> - did Mihael discover an error in Falkon mutex code?
>>
>>   
>>     
> We are not sure, but we are adding extra synchronization in several 
> parts of the Falkon provider.  The reason we are saying that we are not 
> sure is that we stress tested (pushing 50~100 tasks/sec) both the Falkon 
> provider and Falkon itself over and over again, and we never encountered 
> this.  Now, we have a workflow that has an average of 1 task/sec, I find 
> it hard to beleive that a synchronization issue that never surfaced 
> before under stress testing is surfacing now under such a light load.  
> We are also verifying that we are handling all exceptions correctly 
> within the Falkon provider.
>   
>> - if so was it fixed, and did it correct the problem of missed completion
>> notifications?
>>   
>>     
> We don't know, the problems are reproducible over short runs, and only 
> seem to pop up with longer runs.  For example, we completed the 100 mol 
> run just fine, which had 10K jobs.  We have to rerun the 244 mol run to 
> verify things.
>   
>> - whats the state of the "unable to write output file" problem?
>>
>> - do we still have a bad node in UC-Teraport w/ NSF Stale file handles? If so,
>> was that reported? (This raises interesting issues in troubleshooting and
>> trouble workaround)
>>   
>>     
> I reported it, but help at tg claims the node is working fine.  They claim 
> that once in a while, it is normal for this to happen, and my argument 
> that all other nodes behaved perfectly with the exception of this one 
> isn't enough for them.  For now, if we get this node again, we can 
> manually kill the Falkon worker there so Falkon won't use it anymore.
>   
>> - do we have a plan for how to run this WF at scale? Meaning how to get 244
>> nodes for several days, whether we can scale up beyond
>> 1-processor-per-molecule, what the expected runtime is, how to deal with
>> errors/restarts, etc? (Should detail this here in bugz).
>>   
>>     
> There is still work I need to do to ensure that a task that is running 
> when the resource lease expires is correctly handled and Swift is 
> notified that it failed.  I have the code written and in Falkon already, 
> but I have yet to test it.  We need to make sure this works before we 
> try to get say 24 hour resource allocations when we know the experiment 
> will likely take several days.  Also, I think the larger part of the 
> workflow could benefit from more than 1 node per molecule, so if we 
> could get more, it should improve the end-to-end time. 
>
> Ioan
>   
>>   
>>     
>
>   

-- 
============================================
Ioan Raicu
Ph.D. Student
============================================
Distributed Systems Laboratory
Computer Science Department
University of Chicago
1100 E. 58th Street, Ryerson Hall
Chicago, IL 60637
============================================
Email: iraicu at cs.uchicago.edu
Web:   http://www.cs.uchicago.edu/~iraicu
       http://dsl.cs.uchicago.edu/
============================================
============================================

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20070628/9d32efc1/attachment.html>

From hategan at mcs.anl.gov  Thu Jun 28 16:32:52 2007
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Thu, 28 Jun 2007 16:32:52 -0500
Subject: [Swift-devel] [Bug 72] Campaign for scaling wf up to 244 molecules
In-Reply-To: <468427A8.10104@cs.uchicago.edu>
References: <20070628211248.35B6116505@foxtrot.mcs.anl.gov>
	<468427A8.10104@cs.uchicago.edu>
Message-ID: <1183066372.25279.5.camel@blabla.mcs.anl.gov>

> >
> > - did Mihael discover an error in Falkon mutex code?
> >
> >   
> We are not sure, but we are adding extra synchronization in several 
> parts of the Falkon provider.  The reason we are saying that we are not 
> sure is that we stress tested (pushing 50~100 tasks/sec) both the Falkon 
> provider and Falkon itself over and over again, and we never encountered 
> this.  Now, we have a workflow that has an average of 1 task/sec, I find 
> it hard to beleive that a synchronization issue that never surfaced 
> before under stress testing is surfacing now under such a light load.

?!?
You are mutating maps and list from concurrent threads without
synchronization. That is a problem regardless of any other
considerations.

Mihael


From hategan at mcs.anl.gov  Thu Jun 28 16:35:33 2007
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Thu, 28 Jun 2007 16:35:33 -0500
Subject: [Swift-devel] [Bug 72] Campaign for scaling wf up to 244 molecules
In-Reply-To: <468428F3.1060508@cs.uchicago.edu>
References: <20070628211248.35B6116505@foxtrot.mcs.anl.gov>
	<468427A8.10104@cs.uchicago.edu>
	<1895472852-1183066171-cardhu_decombobulator_blackberry.rim.net-826324017-@bxe006.bisx.prod.on.blackberry>
	<468428F3.1060508@cs.uchicago.edu>
Message-ID: <1183066533.25279.8.camel@blabla.mcs.anl.gov>

I was waiting for it to be cleaned up and put into SVN, as we agreed.

On Thu, 2007-06-28 at 16:32 -0500, Ioan Raicu wrote:
> No, just the Falkon provider (~500 lines of code), as far as I know.  
> 
> The Falkon service is around 10K lines of code, and the Falkon
> executor is another 3K, so they will likely take longer than a few
> days for a code review of everything in Falkon.
> 
> Ioan
> 
> Ian Foster wrote: 
> > Did we do a complete code review?
> >  
> > Sent via BlackBerry from T-Mobile
> > 
> > -----Original Message-----
> > From: Ioan Raicu <iraicu at cs.uchicago.edu>
> > 
> > Date: Thu, 28 Jun 2007 16:27:04 
> > To:bugzilla-daemon at mcs.anl.gov
> > Cc:swift-devel at ci.uchicago.edu
> > Subject: Re: [Swift-devel] [Bug 72] Campaign for scaling wf up to 244 molecules
> > 
> > 
> > 
> > 
> > bugzilla-daemon at mcs.anl.gov wrote:
> >   
> > > http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=72
> > > 
> > > 
> > > 
> > > 
> > > 
> > > ------- Comment #1 from wilde at mcs.anl.gov  2007-06-28 16:12 -------
> > > Ive reviewed this email thread on this bug, and am moving this discussion to
> > > bugzilla. 
> > > 
> > > I and am uncertain about the following - can people involved (Nika, Ioan,
> > > Mihael) clarify:
> > > 
> > > - did Mihael discover an error in Falkon mutex code?
> > > 
> > >   
> > >     
> > We are not sure, but we are adding extra synchronization in several 
> > parts of the Falkon provider.  The reason we are saying that we are not 
> > sure is that we stress tested (pushing 50~100 tasks/sec) both the Falkon 
> > provider and Falkon itself over and over again, and we never encountered 
> > this.  Now, we have a workflow that has an average of 1 task/sec, I find 
> > it hard to beleive that a synchronization issue that never surfaced 
> > before under stress testing is surfacing now under such a light load.  
> > We are also verifying that we are handling all exceptions correctly 
> > within the Falkon provider.
> >   
> > > - if so was it fixed, and did it correct the problem of missed completion
> > > notifications?
> > >   
> > >     
> > We don't know, the problems are reproducible over short runs, and only 
> > seem to pop up with longer runs.  For example, we completed the 100 mol 
> > run just fine, which had 10K jobs.  We have to rerun the 244 mol run to 
> > verify things.
> >   
> > > - whats the state of the "unable to write output file" problem?
> > > 
> > > - do we still have a bad node in UC-Teraport w/ NSF Stale file handles? If so,
> > > was that reported? (This raises interesting issues in troubleshooting and
> > > trouble workaround)
> > >   
> > >     
> > I reported it, but help at tg claims the node is working fine.  They claim 
> > that once in a while, it is normal for this to happen, and my argument 
> > that all other nodes behaved perfectly with the exception of this one 
> > isn't enough for them.  For now, if we get this node again, we can 
> > manually kill the Falkon worker there so Falkon won't use it anymore.
> >   
> > > - do we have a plan for how to run this WF at scale? Meaning how to get 244
> > > nodes for several days, whether we can scale up beyond
> > > 1-processor-per-molecule, what the expected runtime is, how to deal with
> > > errors/restarts, etc? (Should detail this here in bugz).
> > >   
> > >     
> > There is still work I need to do to ensure that a task that is running 
> > when the resource lease expires is correctly handled and Swift is 
> > notified that it failed.  I have the code written and in Falkon already, 
> > but I have yet to test it.  We need to make sure this works before we 
> > try to get say 24 hour resource allocations when we know the experiment 
> > will likely take several days.  Also, I think the larger part of the 
> > workflow could benefit from more than 1 node per molecule, so if we 
> > could get more, it should improve the end-to-end time. 
> > 
> > Ioan
> >   
> > 
> >   
> 
> -- 
> ============================================
> Ioan Raicu
> Ph.D. Student
> ============================================
> Distributed Systems Laboratory
> Computer Science Department
> University of Chicago
> 1100 E. 58th Street, Ryerson Hall
> Chicago, IL 60637
> ============================================
> Email: iraicu at cs.uchicago.edu
> Web:   http://www.cs.uchicago.edu/~iraicu
>        http://dsl.cs.uchicago.edu/
> ============================================
> ============================================
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel


From iraicu at cs.uchicago.edu  Thu Jun 28 16:36:21 2007
From: iraicu at cs.uchicago.edu (Ioan Raicu)
Date: Thu, 28 Jun 2007 16:36:21 -0500
Subject: [Swift-devel] [Bug 72] Campaign for scaling wf up to 244	molecules
In-Reply-To: <1183066372.25279.5.camel@blabla.mcs.anl.gov>
References: <20070628211248.35B6116505@foxtrot.mcs.anl.gov>	
	<468427A8.10104@cs.uchicago.edu>
	<1183066372.25279.5.camel@blabla.mcs.anl.gov>
Message-ID: <468429D5.7070200@cs.uchicago.edu>

There is an option to have a pool of threads work on these data 
structures, but the pool size is set to 1.  Point is well taken, we have 
fixed this, but I am not convinced this is where the problem was.  We'll 
see after we do another run with all the extra logging.

Ioan

Mihael Hategan wrote:
>>> - did Mihael discover an error in Falkon mutex code?
>>>
>>>   
>>>       
>> We are not sure, but we are adding extra synchronization in several 
>> parts of the Falkon provider.  The reason we are saying that we are not 
>> sure is that we stress tested (pushing 50~100 tasks/sec) both the Falkon 
>> provider and Falkon itself over and over again, and we never encountered 
>> this.  Now, we have a workflow that has an average of 1 task/sec, I find 
>> it hard to beleive that a synchronization issue that never surfaced 
>> before under stress testing is surfacing now under such a light load.
>>     
>
> ?!?
> You are mutating maps and list from concurrent threads without
> synchronization. That is a problem regardless of any other
> considerations.
>
> Mihael
>
>
>
>
>   

-- 
============================================
Ioan Raicu
Ph.D. Student
============================================
Distributed Systems Laboratory
Computer Science Department
University of Chicago
1100 E. 58th Street, Ryerson Hall
Chicago, IL 60637
============================================
Email: iraicu at cs.uchicago.edu
Web:   http://www.cs.uchicago.edu/~iraicu
       http://dsl.cs.uchicago.edu/
============================================
============================================

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20070628/b63a1b37/attachment.html>

From hategan at mcs.anl.gov  Thu Jun 28 16:41:45 2007
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Thu, 28 Jun 2007 16:41:45 -0500
Subject: [Swift-devel] [Bug 72] Campaign for scaling wf up to 244	molecules
In-Reply-To: <468429D5.7070200@cs.uchicago.edu>
References: <20070628211248.35B6116505@foxtrot.mcs.anl.gov>
	<468427A8.10104@cs.uchicago.edu>
	<1183066372.25279.5.camel@blabla.mcs.anl.gov>
	<468429D5.7070200@cs.uchicago.edu>
Message-ID: <1183066905.26775.2.camel@blabla.mcs.anl.gov>

On Thu, 2007-06-28 at 16:36 -0500, Ioan Raicu wrote:
> There is an option to have a pool of threads work on these data
> structures, but the pool size is set to 1.

Right, but the submit() method was called from different threads. Can we
stop arguing about the obvious?

>   Point is well taken, we have fixed this, but I am not convinced this
> is where the problem was.  We'll see after we do another run with all
> the extra logging.

Can you commit the updates to svn?

> 
> Ioan
> 
> Mihael Hategan wrote: 
> > > > - did Mihael discover an error in Falkon mutex code?
> > > > 
> > > >   
> > > >       
> > > We are not sure, but we are adding extra synchronization in several 
> > > parts of the Falkon provider.  The reason we are saying that we are not 
> > > sure is that we stress tested (pushing 50~100 tasks/sec) both the Falkon 
> > > provider and Falkon itself over and over again, and we never encountered 
> > > this.  Now, we have a workflow that has an average of 1 task/sec, I find 
> > > it hard to beleive that a synchronization issue that never surfaced 
> > > before under stress testing is surfacing now under such a light load.
> > >     
> > 
> > ?!?
> > You are mutating maps and list from concurrent threads without
> > synchronization. That is a problem regardless of any other
> > considerations.
> > 
> > Mihael
> > 
> > 
> > 
> > 
> >   
> 
> -- 
> ============================================
> Ioan Raicu
> Ph.D. Student
> ============================================
> Distributed Systems Laboratory
> Computer Science Department
> University of Chicago
> 1100 E. 58th Street, Ryerson Hall
> Chicago, IL 60637
> ============================================
> Email: iraicu at cs.uchicago.edu
> Web:   http://www.cs.uchicago.edu/~iraicu
>        http://dsl.cs.uchicago.edu/
> ============================================
> ============================================


From wilde at mcs.anl.gov  Thu Jun 28 17:42:55 2007
From: wilde at mcs.anl.gov (Mike Wilde)
Date: Thu, 28 Jun 2007 17:42:55 -0500
Subject: [Swift-devel] [Bug 72] Campaign for scaling wf up to 244	molecules
In-Reply-To: <1183066905.26775.2.camel@blabla.mcs.anl.gov>
References: <20070628211248.35B6116505@foxtrot.mcs.anl.gov>	<468427A8.10104@cs.uchicago.edu>	<1183066372.25279.5.camel@blabla.mcs.anl.gov>	<468429D5.7070200@cs.uchicago.edu>
	<1183066905.26775.2.camel@blabla.mcs.anl.gov>
Message-ID: <4684396F.30209@mcs.anl.gov>

STOP.  DO NOT reply to this email.

reply instead via a comment in bugzilla.

(do I sound like Ben yet? ;)


Ioan,

My understanding is that Mihael pointed out 2 clear unsynchronized race 
conditions from his review of the Falkon provider code.

Do you agree or disagree?  If you agree, have you fixed the race?  If not, do we 
need to discuss it further among more experts to get to an decision we believe 
is correct?

I dont want to sermonize, but will do so anyways:

<soapbox>

- mutex/synchronization problems are devilishly subtle

- to make mutex code work right, you need *both* code review, extensive testing, 
and ideally a lot of code asserts to make sure you are (locked) where you think 
you are.

- if we are arguing about the obvious its probably not obvious to everyone
(so f2f tabletop code review is helpful here, for both education and verification)

- to get mutex code right you need to make sure you have the tasks and shared 
data structures (and hence access patterns) clearly identified

- then you need tons of testing. not just live tests, but carefully contrived 
artificial tests to stress test various mutex situations and potential race and 
deadlock conditions.

</soapbox>

I dont think we should stop testing to do a code review, but we certainly will 
need to do one before we can expect very high reliability.

I'd like to ask you, Ioan that since it its your code and project, that you work 
out a schedule that works for everyone, and organize a review.  I understand 
that the core Falkpon code needs some simple cosmetic cleanup (mainly removing 
fossil code) and then posting in SVN.


:) Mike


Mihael Hategan wrote, On 6/28/2007 4:41 PM:
> On Thu, 2007-06-28 at 16:36 -0500, Ioan Raicu wrote:
>> There is an option to have a pool of threads work on these data
>> structures, but the pool size is set to 1.
> 
> Right, but the submit() method was called from different threads. Can we
> stop arguing about the obvious?
> 
>>   Point is well taken, we have fixed this, but I am not convinced this
>> is where the problem was.  We'll see after we do another run with all
>> the extra logging.
> 
> Can you commit the updates to svn?
> 
>> Ioan
>>
>> Mihael Hategan wrote: 
>>>>> - did Mihael discover an error in Falkon mutex code?
>>>>>
>>>>>   
>>>>>       
>>>> We are not sure, but we are adding extra synchronization in several 
>>>> parts of the Falkon provider.  The reason we are saying that we are not 
>>>> sure is that we stress tested (pushing 50~100 tasks/sec) both the Falkon 
>>>> provider and Falkon itself over and over again, and we never encountered 
>>>> this.  Now, we have a workflow that has an average of 1 task/sec, I find 
>>>> it hard to beleive that a synchronization issue that never surfaced 
>>>> before under stress testing is surfacing now under such a light load.
>>>>     
>>> ?!?
>>> You are mutating maps and list from concurrent threads without
>>> synchronization. That is a problem regardless of any other
>>> considerations.
>>>
>>> Mihael
>>>
>>>
>>>
>>>
>>>   
>> -- 
>> ============================================
>> Ioan Raicu
>> Ph.D. Student
>> ============================================
>> Distributed Systems Laboratory
>> Computer Science Department
>> University of Chicago
>> 1100 E. 58th Street, Ryerson Hall
>> Chicago, IL 60637
>> ============================================
>> Email: iraicu at cs.uchicago.edu
>> Web:   http://www.cs.uchicago.edu/~iraicu
>>        http://dsl.cs.uchicago.edu/
>> ============================================
>> ============================================
> 
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> 
> 

-- 
Mike Wilde
Computation Institute, University of Chicago
Math & Computer Science Division
Argonne National Laboratory
Argonne, IL   60439    USA
tel 630-252-7497 fax 630-252-1997


From wilde at mcs.anl.gov  Thu Jun 28 18:11:43 2007
From: wilde at mcs.anl.gov (Mike Wilde)
Date: Thu, 28 Jun 2007 18:11:43 -0500
Subject: [Swift-devel] [Bug 72] Campaign for scaling wf up to 244	molecules
In-Reply-To: <468429D5.7070200@cs.uchicago.edu>
References: <20070628211248.35B6116505@foxtrot.mcs.anl.gov>		<468427A8.10104@cs.uchicago.edu>	<1183066372.25279.5.camel@blabla.mcs.anl.gov>
	<468429D5.7070200@cs.uchicago.edu>
Message-ID: <4684402F.5040503@mcs.anl.gov>

Good discussion.

I know this will take a bit of time to make a habit, but lets try to move 
discussion to the Campaign bug, when it applies. Click the link and enter a 
Comment as your reply in bugzilla. I think the Globus team is very used to doing 
this, and I believe thats a good practice and we should adopt it.

- Mike


Ioan Raicu wrote, On 6/28/2007 4:36 PM:
> There is an option to have a pool of threads work on these data 
> structures, but the pool size is set to 1.  Point is well taken, we have 
> fixed this, but I am not convinced this is where the problem was.  We'll 
> see after we do another run with all the extra logging.
> 
> Ioan
> 
> Mihael Hategan wrote:
>>>> - did Mihael discover an error in Falkon mutex code?
>>>>
>>>>   
>>>>       
>>> We are not sure, but we are adding extra synchronization in several 
>>> parts of the Falkon provider.  The reason we are saying that we are not 
>>> sure is that we stress tested (pushing 50~100 tasks/sec) both the Falkon 
>>> provider and Falkon itself over and over again, and we never encountered 
>>> this.  Now, we have a workflow that has an average of 1 task/sec, I find 
>>> it hard to beleive that a synchronization issue that never surfaced 
>>> before under stress testing is surfacing now under such a light load.
>>>     
>>
>> ?!?
>> You are mutating maps and list from concurrent threads without
>> synchronization. That is a problem regardless of any other
>> considerations.
>>
>> Mihael
>>
>>
>>
>>
>>   
> 
> -- 
> ============================================
> Ioan Raicu
> Ph.D. Student
> ============================================
> Distributed Systems Laboratory
> Computer Science Department
> University of Chicago
> 1100 E. 58th Street, Ryerson Hall
> Chicago, IL 60637
> ============================================
> Email: iraicu at cs.uchicago.edu
> Web:   http://www.cs.uchicago.edu/~iraicu
>        http://dsl.cs.uchicago.edu/
> ============================================
> ============================================
> 
> 
> ------------------------------------------------------------------------
> 
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel

-- 
Mike Wilde
Computation Institute, University of Chicago
Math & Computer Science Division
Argonne National Laboratory
Argonne, IL   60439    USA
tel 630-252-7497 fax 630-252-1997


From hategan at mcs.anl.gov  Thu Jun 28 18:16:19 2007
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Thu, 28 Jun 2007 18:16:19 -0500
Subject: [Swift-devel] [Bug 72] Campaign for scaling wf up to 244	molecules
In-Reply-To: <4684396F.30209@mcs.anl.gov>
References: <20070628211248.35B6116505@foxtrot.mcs.anl.gov>
	<468427A8.10104@cs.uchicago.edu>
	<1183066372.25279.5.camel@blabla.mcs.anl.gov>
	<468429D5.7070200@cs.uchicago.edu>
	<1183066905.26775.2.camel@blabla.mcs.anl.gov>
	<4684396F.30209@mcs.anl.gov>
Message-ID: <1183072579.20493.1.camel@blabla.mcs.anl.gov>

On Thu, 2007-06-28 at 17:42 -0500, Mike Wilde wrote:

> (do I sound like Ben yet? ;)

Nope. You're missing the accent :)


From benc at hawaga.org.uk  Fri Jun 29 13:12:53 2007
From: benc at hawaga.org.uk (Ben Clifford)
Date: Fri, 29 Jun 2007 23:42:53 +0530 (IST)
Subject: [Swift-devel] PA_VAR vs PA_VAR1
Message-ID: <Pine.OSX.4.64.0706292341130.7417@soju.hawaga.org.uk>


GetFieldValue has:


 public class GetFieldValue extends VDLFunction {
 [...]
   public static final Arg PA_VAR1 = new Arg.Positional("var");

and also inherits from VDLFunction:

   public static final Arg PA_VAR = new Arg.Positional("var");

>From the name, it looks like PA_VAR1 was deliberately made to not be 
PA_VAR but I don't really understand why.

-- 


From hategan at mcs.anl.gov  Fri Jun 29 13:22:48 2007
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Fri, 29 Jun 2007 13:22:48 -0500
Subject: [Swift-devel] PA_VAR vs PA_VAR1
In-Reply-To: <Pine.OSX.4.64.0706292341130.7417@soju.hawaga.org.uk>
References: <Pine.OSX.4.64.0706292341130.7417@soju.hawaga.org.uk>
Message-ID: <1183141368.12220.2.camel@blabla.mcs.anl.gov>

Either one might have been an optional at first and then it was changed,
or it's an oversight. I can't see any reason why both would be needed.

Mihael

On Fri, 2007-06-29 at 23:42 +0530, Ben Clifford wrote:
> GetFieldValue has:
> 
> 
>  public class GetFieldValue extends VDLFunction {
>  [...]
>    public static final Arg PA_VAR1 = new Arg.Positional("var");
> 
> and also inherits from VDLFunction:
> 
>    public static final Arg PA_VAR = new Arg.Positional("var");
> 
> >From the name, it looks like PA_VAR1 was deliberately made to not be 
> PA_VAR but I don't really understand why.
> 


From benc at hawaga.org.uk  Fri Jun 29 13:27:37 2007
From: benc at hawaga.org.uk (Ben Clifford)
Date: Fri, 29 Jun 2007 18:27:37 +0000 (GMT)
Subject: [Swift-devel] PA_VAR vs PA_VAR1
In-Reply-To: <1183141368.12220.2.camel@blabla.mcs.anl.gov>
References: <Pine.OSX.4.64.0706292341130.7417@soju.hawaga.org.uk>
	<1183141368.12220.2.camel@blabla.mcs.anl.gov>
Message-ID: <Pine.LNX.4.64.0706291827320.7513@dildano.hawaga.org.uk>


ok good.

On Fri, 29 Jun 2007, Mihael Hategan wrote:

> Either one might have been an optional at first and then it was changed,
> or it's an oversight. I can't see any reason why both would be needed.
> 
> Mihael
> 
> On Fri, 2007-06-29 at 23:42 +0530, Ben Clifford wrote:
> > GetFieldValue has:
> > 
> > 
> >  public class GetFieldValue extends VDLFunction {
> >  [...]
> >    public static final Arg PA_VAR1 = new Arg.Positional("var");
> > 
> > and also inherits from VDLFunction:
> > 
> >    public static final Arg PA_VAR = new Arg.Positional("var");
> > 
> > >From the name, it looks like PA_VAR1 was deliberately made to not be 
> > PA_VAR but I don't really understand why.
> > 
> 
> 


From bugzilla-daemon at mcs.anl.gov  Fri Jun 29 19:11:48 2007
From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov)
Date: Fri, 29 Jun 2007 19:11:48 -0500 (CDT)
Subject: [Swift-devel] [Bug 72] Campaign for scaling wf up to 244 molecules
In-Reply-To: <bug-72-21@http.bugzilla.mcs.anl.gov/swift/>
Message-ID: <20070630001148.19C15164DB@foxtrot.mcs.anl.gov>

http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=72


------- Comment #3 from nefedova at mcs.anl.gov  2007-06-29 19:11 -------
You can watch the new 244-molecule run live here:
http://tg-viz-login1.uc.teragrid.org:51000/index.htm

Ioan would post here later the details of what changes Yong and he did to the
provider code.

Nika


-- 
Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.


From benc at hawaga.org.uk  Fri Jun 29 20:24:58 2007
From: benc at hawaga.org.uk (Ben Clifford)
Date: Sat, 30 Jun 2007 01:24:58 +0000 (GMT)
Subject: [Swift-devel] a different way to do array/structure accesses
In-Reply-To: <1182977884.9558.2.camel@blabla.mcs.anl.gov>
References: <Pine.OSX.4.64.0706280138380.18443@soju.hawaga.org.uk>
	<1182977884.9558.2.camel@blabla.mcs.anl.gov>
Message-ID: <Pine.LNX.4.64.0706300114080.7513@dildano.hawaga.org.uk>


I rememebred why I was poking round this way in the first place.

The present implementation takes expressions like a[2] and passes them all 
the way through to the karajan layer, where the vdl runtime library code 
interprets the 2:

   <print>
     <parallel>
       <vdl:getfield var="{a}" path="[2]"/>
     </parallel>
   </print>

But this doesn't work in the case where 2 becomes a more complex 
expression, such as a[2+2].

Accesses like a[2+2] or a[i+1] don't seem to work at all at the moment. 
(that's bug 54).

I don't want to put a complete SwiftScript parser/evaluator in the runtime 
library, so I think in the case where array accesses are not simple 
variable names or constants, the code should break up array accesses into 
separate getfield calls so that the same code can be generated for that 
indexing expression as when that expression is used elsewhere.

However, it doesn't need to do this for simple accesses (simple variable 
names and or constants) if the present path handling is also retained - 
that's more complication at the compiler layer but it sounds like its 
needed.

-- 


From benc at hawaga.org.uk  Fri Jun 29 20:34:41 2007
From: benc at hawaga.org.uk (Ben Clifford)
Date: Sat, 30 Jun 2007 01:34:41 +0000 (GMT)
Subject: [Swift-devel] a different way to do array/structure accesses
In-Reply-To: <Pine.LNX.4.64.0706300114080.7513@dildano.hawaga.org.uk>
References: <Pine.OSX.4.64.0706280138380.18443@soju.hawaga.org.uk>
	<1182977884.9558.2.camel@blabla.mcs.anl.gov>
	<Pine.LNX.4.64.0706300114080.7513@dildano.hawaga.org.uk>
Message-ID: <Pine.LNX.4.64.0706300130530.7513@dildano.hawaga.org.uk>

alternatively, I guess the path argument can be constructed on the fly (as 
it looks like it might have been intended to once):

         <vdl:getfield var="{a}">
            <argument name="path">
              <concat>
                <string>[<string>
                ... compiled expression code here ...
                <string>]</string>
              </concat>
           </argument>
          </vdl:getfield>..

That's less aesthetically pleasing to me than the multiple-getfield form, 
though.

On the gripping hand, we could say that array subscripts can only be 
constants or variable names and disallow expressions there. That perhaps 
reflects the status quo more accurately, although I don't like it.

-- 


From hategan at mcs.anl.gov  Fri Jun 29 20:36:59 2007
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Fri, 29 Jun 2007 20:36:59 -0500
Subject: [Swift-devel] a different way to do array/structure accesses
In-Reply-To: <Pine.LNX.4.64.0706300114080.7513@dildano.hawaga.org.uk>
References: <Pine.OSX.4.64.0706280138380.18443@soju.hawaga.org.uk>
	<1182977884.9558.2.camel@blabla.mcs.anl.gov>
	<Pine.LNX.4.64.0706300114080.7513@dildano.hawaga.org.uk>
Message-ID: <1183167419.28235.7.camel@blabla.mcs.anl.gov>

VDLFunction.parsePath(Object, VariableStack) is probably somewhat
related.

I think the problem was that paths with indices were sometimes
non-static, which was an exception to what the compiler was initially
doing, so it was easier at the time to push things into the above
function.

It should be fixed, but sensible reduction should be applied.

Mihael


On Sat, 2007-06-30 at 01:24 +0000, Ben Clifford wrote:
> I rememebred why I was poking round this way in the first place.
> 
> The present implementation takes expressions like a[2] and passes them all 
> the way through to the karajan layer, where the vdl runtime library code 
> interprets the 2:
> 
>    <print>
>      <parallel>
>        <vdl:getfield var="{a}" path="[2]"/>
>      </parallel>
>    </print>
> 
> But this doesn't work in the case where 2 becomes a more complex 
> expression, such as a[2+2].
> 
> Accesses like a[2+2] or a[i+1] don't seem to work at all at the moment. 
> (that's bug 54).
> 
> I don't want to put a complete SwiftScript parser/evaluator in the runtime 
> library, so I think in the case where array accesses are not simple 
> variable names or constants, the code should break up array accesses into 
> separate getfield calls so that the same code can be generated for that 
> indexing expression as when that expression is used elsewhere.
> 
> However, it doesn't need to do this for simple accesses (simple variable 
> names and or constants) if the present path handling is also retained - 
> that's more complication at the compiler layer but it sounds like its 
> needed.
> 


From hategan at mcs.anl.gov  Fri Jun 29 20:45:51 2007
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Fri, 29 Jun 2007 20:45:51 -0500
Subject: [Swift-devel] a different way to do array/structure accesses
In-Reply-To: <Pine.LNX.4.64.0706300130530.7513@dildano.hawaga.org.uk>
References: <Pine.OSX.4.64.0706280138380.18443@soju.hawaga.org.uk>
	<1182977884.9558.2.camel@blabla.mcs.anl.gov>
	<Pine.LNX.4.64.0706300114080.7513@dildano.hawaga.org.uk>
	<Pine.LNX.4.64.0706300130530.7513@dildano.hawaga.org.uk>
Message-ID: <1183167951.28235.16.camel@blabla.mcs.anl.gov>

On Sat, 2007-06-30 at 01:34 +0000, Ben Clifford wrote:
> alternatively, I guess the path argument can be constructed on the fly (as 
> it looks like it might have been intended to once):
> 
>          <vdl:getfield var="{a}">
>             <argument name="path">
>               <concat>
>                 <string>[<string>
>                 ... compiled expression code here ...
>                 <string>]</string>
>               </concat>
>            </argument>
>           </vdl:getfield>..
> 
> That's less aesthetically pleasing to me than the multiple-getfield form, 
> though.

Again, if constant nested getfields are reduced to a single path, it's
probably fine to keep dynamic things as nested getfields (it's the only
case incurring performance penalties).

e.g. p.x.z.v[i].a:
gfv(gfv(gfv(p, "x.z.v"), gfv(i)), "a"))

Also, we could probably make this nicer with vargs (which will solve the
problem anyway):
gfv(p, "x", "z", "v", gfv(i), "a")

> 
> On the gripping hand, we could say that array subscripts can only be 
> constants or variable names and disallow expressions there. That perhaps 
> reflects the status quo more accurately, although I don't like it.

Me neither. And it's counterintuitive. There will be hundreds trying to
do it anyway (assuming we will have those many users :)

> 


From bugzilla-daemon at mcs.anl.gov  Sat Jun 30 15:33:50 2007
From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov)
Date: Sat, 30 Jun 2007 15:33:50 -0500 (CDT)
Subject: [Swift-devel] [Bug 72] Campaign for scaling wf up to 244 molecules
In-Reply-To: <bug-72-21@http.bugzilla.mcs.anl.gov/swift/>
Message-ID: <20070630203350.F02F816505@foxtrot.mcs.anl.gov>

http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=72


iraicu at cs.uchicago.edu changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |iraicu at cs.uchicago.edu


------- Comment #4 from iraicu at cs.uchicago.edu  2007-06-30 15:33 -------
Hi again,
Here is an update of yesterday's 244 molecule run.  The experiment ran further
than before, but it still did not complete.  There were 240 molecules that
completed successfully (in the previous run, no molecule finished), but 4
molecules still did not finish. 

Here is the breakdown on the tasks:
Exit Code 0: 20695 tasks
Exit Code -3: 6 tasks
Exit Code -1: 3585 tasks
=====================
Total: 24286 tasks

The 3 usual Falkon graphs can be found here:
http://people.cs.uchicago.edu/~iraicu/research/docs/MolDyn/244-mol-failed/executor_graph.jpg
http://people.cs.uchicago.edu/~iraicu/research/docs/MolDyn/244-mol-failed/summary_graph.jpg
http://people.cs.uchicago.edu/~iraicu/research/docs/MolDyn/244-mol-failed/task_graph.jpg

The relevant Falkon logs are here (there are more if people are interested, in
total over 600MB of logs):
http://people.cs.uchicago.edu/~iraicu/research/docs/MolDyn/244-mol-failed/Falkon_logs/
The Swift log are here:
http://people.cs.uchicago.edu/~iraicu/research/docs/MolDyn/244-mol-failed/Swift_logs/MolDyn-244-63ar6atbg2ae1.log

>From Falkon's point of view, things looked fine, tasks came in, they got
processed, they got returned.  

We haven't got a chance to analyze the Swift end of the logs yet, so we don't
know for sure what happened.  We fixed the potential synchronization issue
Mihael pointed out.  We also fixed a badly handled exception we had in the
Falkon provider, that would give up very easily and exit the Falkon provider
thread in case of an exception, even if it wasn't a fatal one.  This time
around, we changed the logic to simply print the exception, if there were any,
and not exit the Falkon provider, just continue.  Personally, I think this
logic on handling exceptions in the Falkon provider was causing the Falkon
provider to exit prematurely, and hence not send any more tasks to Falkon...
note that Swift was setting the set status of submitted tasks to the Falkon
provider in a separate thread, which was not necesarly exiting when the Falkon
provider was, and hence we had the scenario in which Swift thought it sent out
more tasks than Falkon really saw. 

Now, the issue that I think stopped this experiment.  On the console of Swift,
the last thing that it printed was a "stack overflow error"; I don't think this
printed in the logs, just on the console.  I believe this is a JVM error when a
thread recurses too deep and the thread stack size is not sufficiently large
enough.  We saw this same error on Thursday in some synthetic experiments with
20K sleep jobs, but it was not repeatable every time.  Does anyone have any
idea where this stack overflow could be coming from? 

Ioan


-- 
Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.


From bugzilla-daemon at mcs.anl.gov  Sat Jun 30 17:09:11 2007
From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov)
Date: Sat, 30 Jun 2007 17:09:11 -0500 (CDT)
Subject: [Swift-devel] [Bug 72] Campaign for scaling wf up to 244 molecules
In-Reply-To: <bug-72-21@http.bugzilla.mcs.anl.gov/swift/>
Message-ID: <20070630220911.9557A16506@foxtrot.mcs.anl.gov>

http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=72


------- Comment #5 from hategan at mcs.anl.gov  2007-06-30 17:09 -------
First of all, can you commit the changes to SVN?

(In reply to comment #4)
> We fixed the potential synchronization issue
> Mihael pointed out.

There were two.

> We also fixed a badly handled exception we had in the
> Falkon provider, that would give up very easily and exit the Falkon provider
> thread in case of an exception, even if it wasn't a fatal one.  This time
> around, we changed the logic to simply print the exception, if there were any,
> and not exit the Falkon provider, just continue.  Personally, I think this
> logic on handling exceptions in the Falkon provider was causing the Falkon
> provider to exit prematurely, and hence not send any more tasks to Falkon...

I can't seem to find anything that would fit that profile in the provider code.
Can you be more specific? If the provider was setting the status of the task to
failed, then it doesn't matter. Swift retries failed things.

> note that Swift was setting the set status of submitted tasks to the Falkon
> provider in a separate thread,

Swift does not set status of tasks. That's what the provider is supposed to do.

> which was not necesarly exiting when the Falkon
> provider was, and hence we had the scenario in which Swift thought it sent out
> more tasks than Falkon really saw. 

Can you be more specific? If there is a problem in Swift, we need to fix it,
but your comment is too vague.

> 
> Now, the issue that I think stopped this experiment.  On the console of Swift,
> the last thing that it printed was a "stack overflow error"; I don't think this
> printed in the logs, just on the console.

Without the stack trace, the information is not very useful.

> 
> Ioan
> 


-- 
Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.


From hategan at mcs.anl.gov  Sat Jun 30 17:39:01 2007
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Sat, 30 Jun 2007 17:39:01 -0500
Subject: [Swift-devel] 244 molecule workflow
Message-ID: <1183243141.15631.3.camel@blabla.mcs.anl.gov>

Looking at the swift logs, I stumbled across a few exceptions, with
stack traces that contain things like:
charmm3 @ MolDyn-244.kml, line: 1029201

That thing has over one million lines. Disturbing. For loops have been
invented.

Mihael


From bugzilla-daemon at mcs.anl.gov  Sat Jun 30 17:52:07 2007
From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov)
Date: Sat, 30 Jun 2007 17:52:07 -0500 (CDT)
Subject: [Swift-devel] [Bug 72] Campaign for scaling wf up to 244 molecules
In-Reply-To: <bug-72-21@http.bugzilla.mcs.anl.gov/swift/>
Message-ID: <20070630225207.B70D916506@foxtrot.mcs.anl.gov>

http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=72


------- Comment #6 from hategan at mcs.anl.gov  2007-06-30 17:52 -------
(In reply to comment #4)
> Hi again,
> Here is an update of yesterday's 244 molecule run.  The experiment ran further
> than before, but it still did not complete.  There were 240 molecules that
> completed successfully (in the previous run, no molecule finished), but 4
> molecules still did not finish. 
> 

Actually it looks tasks worked fine:
bash-3.1$ cat MolDyn-244-63ar6atbg2ae1.log |grep "type=1.*ubmitted"|wc
  24309  243090 2806214
bash-3.1$ cat MolDyn-244-63ar6atbg2ae1.log |grep "type=1.*ailed"|wc
   3614   36140  405816
bash-3.1$ cat MolDyn-244-63ar6atbg2ae1.log |grep "type=1.*ompleted"|wc
  20695  206950 2389556

All tasks are accounted for. It may be that some jobs failed 3 times in a row.
>From the logs it looks like the workflow almost finished and it got to the
point where the error reporting was to be done. Perhaps the stack overflow that
you saw occurred there, and perhaps the impossible size of the workflow might
have something to do with it.


-- 
Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.


From itf at mcs.anl.gov  Sat Jun 30 21:10:03 2007
From: itf at mcs.anl.gov (=?utf-8?B?SWFuIEZvc3Rlcg==?=)
Date: Sun, 1 Jul 2007 02:10:03 +0000
Subject: [Swift-devel] [Bug 72] Campaign for scaling wf up to 244 molecules
In-Reply-To: <20070630225207.B70D916506@foxtrot.mcs.anl.gov>
References: <bug-72-21@http.bugzilla.mcs.anl.gov/swift/><20070630225207.B70D916506@foxtrot.mcs.anl.gov>
Message-ID: <1702663950-1183255865-cardhu_decombobulator_blackberry.rim.net-1244943269-@bxe006.bisx.prod.on.blackberry>

Why do you say the workflow's size was "impossible"? It doesn't seem that large to me. We'd like to run larger ones!


Sent via BlackBerry from T-Mobile

-----Original Message-----
From: bugzilla-daemon at mcs.anl.gov

Date: Sat, 30 Jun 2007 17:52:07 
To:swift-devel at ci.uchicago.edu
Subject: [Swift-devel] [Bug 72] Campaign for scaling wf up to 244 molecules


http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=72


------- Comment #6 from hategan at mcs.anl.gov  2007-06-30 17:52 -------
(In reply to comment #4)
> Hi again,
> Here is an update of yesterday's 244 molecule run.  The experiment ran further
> than before, but it still did not complete.  There were 240 molecules that
> completed successfully (in the previous run, no molecule finished), but 4
> molecules still did not finish. 
> 

Actually it looks tasks worked fine:
bash-3.1$ cat MolDyn-244-63ar6atbg2ae1.log |grep "type=1.*ubmitted"|wc
  24309  243090 2806214
bash-3.1$ cat MolDyn-244-63ar6atbg2ae1.log |grep "type=1.*ailed"|wc
   3614   36140  405816
bash-3.1$ cat MolDyn-244-63ar6atbg2ae1.log |grep "type=1.*ompleted"|wc
  20695  206950 2389556

All tasks are accounted for. It may be that some jobs failed 3 times in a row.
>From the logs it looks like the workflow almost finished and it got to the
point where the error reporting was to be done. Perhaps the stack overflow that
you saw occurred there, and perhaps the impossible size of the workflow might
have something to do with it.


-- 
Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.
_______________________________________________
Swift-devel mailing list
Swift-devel at ci.uchicago.edu
http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel