From benc at hawaga.org.uk  Sun Feb  1 11:36:14 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Sun, 1 Feb 2009 17:36:14 +0000 (GMT)
Subject: [Swift-devel] [VOTE] Expanding arrays in app function command lines
Message-ID: <Pine.LNX.4.64.0902011729410.14259@dildano.hawaga.org.uk>


it being slightly unclear in my mind whether the below discussed change 
was generally approved of, here is a more formal request for 
clarification.

the change that we talked about is in this thread:

  Subject: Re: [Swift-user] Expanding arrays in app function command lines

the proposal (which I sent a patch for) is to change the handling of app 
paramters to expand string arrays into multiple command line arguments.

Vote as in http://dev.globus.org/wiki/Guidelines#Action_Item_Votes

-- 


From hategan at mcs.anl.gov  Sun Feb  1 11:39:17 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Sun, 01 Feb 2009 11:39:17 -0600
Subject: [Swift-devel] [VOTE] Expanding arrays in app function command
	lines
In-Reply-To: <Pine.LNX.4.64.0902011729410.14259@dildano.hawaga.org.uk>
References: <Pine.LNX.4.64.0902011729410.14259@dildano.hawaga.org.uk>
Message-ID: <1233509957.12878.0.camel@localhost>

+1

On Sun, 2009-02-01 at 17:36 +0000, Ben Clifford wrote:
> it being slightly unclear in my mind whether the below discussed change 
> was generally approved of, here is a more formal request for 
> clarification.
> 
> the change that we talked about is in this thread:
> 
>   Subject: Re: [Swift-user] Expanding arrays in app function command lines
> 
> the proposal (which I sent a patch for) is to change the handling of app 
> paramters to expand string arrays into multiple command line arguments.
> 
> Vote as in http://dev.globus.org/wiki/Guidelines#Action_Item_Votes
> 


From benc at hawaga.org.uk  Sun Feb  1 18:58:11 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Mon, 2 Feb 2009 00:58:11 +0000 (GMT)
Subject: [Swift-devel] swift changing walltime of prews-gram jobs
In-Reply-To: <1233334434.14201.3.camel@localhost>
References: <50b07b4b0901241503r72f28b96rec19583bb8044ea1@mail.gmail.com> 
	<Pine.LNX.4.64.0901251342150.14996@dildano.hawaga.org.uk> 
	<Pine.LNX.4.64.0901301638480.8995@dildano.hawaga.org.uk>
	<1233334434.14201.3.camel@localhost>
Message-ID: <Pine.LNX.4.64.0902020056530.14259@dildano.hawaga.org.uk>


On Fri, 30 Jan 2009, Mihael Hategan wrote:

> I suppose the PBS provider could adopt the same scheme.

The stuff you put in r2266 almost worked for me. r2270 adds a missing 
newline, and now it seems to run correctly now and with the right 
walltime.

-- 


From benc at hawaga.org.uk  Mon Feb  2 11:38:56 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Mon, 2 Feb 2009 17:38:56 +0000 (GMT)
Subject: [Swift-devel] Coasters failing on Teraport - cant find Java?
In-Reply-To: <497FE73F.7000307@mcs.anl.gov>
References: <50b07b4b0901241503r72f28b96rec19583bb8044ea1@mail.gmail.com> 
	<497F637D.5080707@mcs.anl.gov> <1233113611.2159.25.camel@localhost>
	<497FE73F.7000307@mcs.anl.gov>
Message-ID: <Pine.LNX.4.64.0902021737090.8995@dildano.hawaga.org.uk>


On Tue, 27 Jan 2009, Michael Wilde wrote:

> 1) On OSG sites, the jobmanager(s) are modified to inset OSG env vars and set
> the PATH to contain OSG stuff. So if you do a globus-job-run of

This isn't universal OSG behaviour. Some sites give you 
PATH=/bin:/usr/bin

-- 


From hategan at mcs.anl.gov  Mon Feb  2 11:58:11 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Mon, 02 Feb 2009 11:58:11 -0600
Subject: [Swift-devel] Coasters failing on Teraport - cant find Java?
In-Reply-To: <Pine.LNX.4.64.0902021737090.8995@dildano.hawaga.org.uk>
References: <50b07b4b0901241503r72f28b96rec19583bb8044ea1@mail.gmail.com>
	<497F637D.5080707@mcs.anl.gov> <1233113611.2159.25.camel@localhost>
	<497FE73F.7000307@mcs.anl.gov>
	<Pine.LNX.4.64.0902021737090.8995@dildano.hawaga.org.uk>
Message-ID: <1233597491.22200.3.camel@localhost>

On Mon, 2009-02-02 at 17:38 +0000, Ben Clifford wrote:
> On Tue, 27 Jan 2009, Michael Wilde wrote:
> 
> > 1) On OSG sites, the jobmanager(s) are modified to inset OSG env vars and set
> > the PATH to contain OSG stuff. So if you do a globus-job-run of
> 
> This isn't universal OSG behaviour. Some sites give you 
> PATH=/bin:/usr/bin
> 

Which happens to be useless.

I suppose, for those sites, we need have an option to explicitly set
where Java is, if that doesn't already work somehow.


From bugzilla-daemon at mcs.anl.gov  Tue Feb  3 05:24:30 2009
From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov)
Date: Tue,  3 Feb 2009 05:24:30 -0600 (CST)
Subject: [Swift-devel] [Bug 173] New: poor syntax error missing close
	parentheses at end of procedure invocation
Message-ID: <bug-173-21@http.bugzilla.mcs.anl.gov/swift/>

http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=173

           Summary: poor syntax error missing close parentheses at end of
                    procedure invocation
           Product: Swift
           Version: unspecified
          Platform: Macintosh
        OS/Version: Mac OS
            Status: NEW
          Severity: normal
          Priority: P2
         Component: SwiftScript language
        AssignedTo: benc at hawaga.org.uk
        ReportedBy: benc at hawaga.org.uk


The below fragment is missing ) at the end of the trace invocation. The antlr
generated parser fails to parse this, giving the confusing message below. It
would be better if the error was more related to the missing )

$ cat rw.swift

type file;

file s <"muppet.gif">;

trace(@regexp(@filename(s),"gif","jpg");

$ swift rw.swift 
Could not compile SwiftScript source: line 6:1: unexpected token: trace


-- 
Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You reported the bug, or are watching the reporter.
You are the assignee for the bug, or are watching the assignee.


From benc at hawaga.org.uk  Tue Feb  3 05:26:36 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Tue, 3 Feb 2009 11:26:36 +0000 (GMT)
Subject: [Swift-devel] regexp_mapper redundant?
Message-ID: <Pine.LNX.4.64.0902031108120.14259@dildano.hawaga.org.uk>


It might be the case now that many of the mappings provided by 
regexp_mapper can be implemented using single_filename_mapper and @regexp 
and @filename.

Previously this did not work, I think because of lack of dataflow 
dependency handling in mapper parameters. However, that handling is in 
place now.

So perhaps it is the case that:

file f <regexp_mapper;source=s,match="(.*)gif",transform="\1jpg">;

is the same as:

file f <single_file_mapper;file=@regexp(@filename(s),"gif","jpg")>;

This isn't quite a complete replacment, because the @regexp function 
doesn't seem to support substitution groups like \1

It perhaps could be made to, though.

-- 


From benc at hawaga.org.uk  Tue Feb  3 05:45:06 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Tue, 3 Feb 2009 11:45:06 +0000 (GMT)
Subject: [Swift-devel] "type file;" by default
Message-ID: <Pine.LNX.4.64.0902031129570.14259@dildano.hawaga.org.uk>


Pretty much every simple SwiftScript program that I write, I find myself 
putting in "type file;" at the start, and avoiding "marker types" of the 
form:

   type picturefile;

and thus ignoring application-level type checking (checking that I'm not 
feeding a picture into a text processing app, and the like) whilst still 
taking advantage of other swift type checking.

To simplify low-end uses of the language, it might be useful to have the 
above "type file;" defined as a built-in type.

This has been discussed before, but I'd like to know what peoples opinions 
are.

-- 


From wilde at mcs.anl.gov  Tue Feb  3 08:36:30 2009
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Tue, 03 Feb 2009 08:36:30 -0600
Subject: [Swift-devel] Coasters failing on Teraport - cant find Java?
In-Reply-To: <1233597491.22200.3.camel@localhost>
References: <50b07b4b0901241503r72f28b96rec19583bb8044ea1@mail.gmail.com>	
	<497F637D.5080707@mcs.anl.gov>
	<1233113611.2159.25.camel@localhost>	
	<497FE73F.7000307@mcs.anl.gov>	
	<Pine.LNX.4.64.0902021737090.8995@dildano.hawaga.org.uk>
	<1233597491.22200.3.camel@localhost>
Message-ID: <4988566E.2070700@mcs.anl.gov>

The approach I'm testing is this:

if user has a .coasterinit file
   source it to put java in PATH
else if java is in PATH
   use it
else source /etc/profile

(executed under a non-login shell, i.e never use /bin/sh -l)

Right now I have the above in a different order (.coasterinit last) and 
it works on ranger, mercury and teraport.

.coasterinit is a more flexible alternative to a per-site option that 
points to java. Im not sure which is better.

On 2/2/09 11:58 AM, Mihael Hategan wrote:
> On Mon, 2009-02-02 at 17:38 +0000, Ben Clifford wrote:
>> On Tue, 27 Jan 2009, Michael Wilde wrote:
>>
>>> 1) On OSG sites, the jobmanager(s) are modified to inset OSG env vars and set
>>> the PATH to contain OSG stuff. So if you do a globus-job-run of
>> This isn't universal OSG behaviour. Some sites give you 
>> PATH=/bin:/usr/bin
>>
> 
> Which happens to be useless.
> 
> I suppose, for those sites, we need have an option to explicitly set
> where Java is, if that doesn't already work somehow.
> 


From wilde at mcs.anl.gov  Tue Feb  3 08:44:47 2009
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Tue, 03 Feb 2009 08:44:47 -0600
Subject: [Swift-devel] Coasters failing on Teraport - cant find Java?
In-Reply-To: <1233340920.18750.1.camel@localhost>
References: <50b07b4b0901241503r72f28b96rec19583bb8044ea1@mail.gmail.com>	
	<497F637D.5080707@mcs.anl.gov>
	<1233113611.2159.25.camel@localhost>	
	<497FE73F.7000307@mcs.anl.gov> <1233340920.18750.1.camel@localhost>
Message-ID: <4988585F.9020408@mcs.anl.gov>

I didnt see this message till now. I'll compare this to the approach I 
was testing (see previous message) and see what works where.

- Mike

On 1/30/09 12:42 PM, Mihael Hategan wrote:
> Cog r2267 contains a tentative fix for this. The bootstrap script is
> started without -l, and if java cannot be found, it attempts to get that
> information using bash -l.
> 
> I haven't tested it.
> 
> On Tue, 2009-01-27 at 23:03 -0600, Michael Wilde wrote:
>> I dug a bit deeper. As far as I can tell, this is what's happening:
>>
>> 1) On OSG sites, the jobmanager(s) are modified to inset OSG env vars 
>> and set the PATH to contain OSG stuff. So if you do a globus-job-run of 
>> /usr/bin/printenv (i.e. with no shell) you see all this, including java 
>> in the path (from an osg dir).
>>
>> 2) when you globus-job-run /bin/sh, all this stays around, but
>>
>> 3) when you globus-job-run /bin/sh with -l, it runs /etc/profile, which 
>> un-does the path and LD_LIBRARY_PATH, setting PATH to some default and 
>> LD_LIBRARY_PATH to null.  I *think* this is being done by softenv which 
>> runs from /etc/profile.d, called at the end of /etc/profile.
>>
>> You can simulate this with:
>>
>> globus-job-run tp-grid1.ci.uchicago.edu /bin/sh -c "which java; source 
>> /etc/profile; which java"  (or try printenv instead of which java to see 
>> the details)
>>
>> So bottom line: there's at least two cases where -l hurts, this one, and 
>> abe, where attempts to run login shells from globus are thwarted.
>>
>> If the purpose of -l was just to get java in the path,, then
>> for OSG sites that behave like teraport, just omitting -l should work, 
>> because the OSG jobmanager modes put it in the path.
>>
>> For sites like abe, bypassing -l, and forcing the user to put Java in 
>> the path with a .bashrc or equivalent, may work. (The hack I used on abe 
>> was to remove the -l arg, and insert this in bootstrap.sh:
>>
>> +if [ -f ~/.myetcprofile ]; then
>> +  source ~/.myetcprofile
>> +else
>> +  source /etc/profile
>> +fi
>>
>> One option is to accept a per-site option from sites.xml to bypass "-l" 
>>   on the startup shell, and insert the logic above for something like 
>> .coasterinit, sourcing that if the user provides it.
>>
>> Another option is to put a +java line in the OSG .soft file on TeraPort.
>>
>> Its possible this problem only eists on the few sites like teraport that 
>> run both OSG mods and softenv???
>>
>> I think we need to test coasters broadly across OSG to be sure (Ben's IP 
>> problem is a case in point).  But a simple shell test across all the OSG 
>> VO sites could detect whether Java will be there or not, with and 
>> without -l.
>>
>> - Mike
>>
>>
>> On 1/27/09 9:33 PM, Mihael Hategan wrote:
>>> Hmm. Looks like -l has the opposite effect of what I thought it should
>>> do (end up with an environment equivalent to the one you get in when you
>>> log in as an interactive session). Is it my misunderstanding or
>>> something else?
>>>
>>> On Tue, 2009-01-27 at 13:41 -0600, Michael Wilde wrote:
>>>> Related to: Re: [Swift-devel] swift changing walltime of prews-gram jobs
>>>>
>>>> I can't get a Swift script to run on coasters on TeraPort in gt2:gt2:pbs 
>>>> mode.
>>>>
>>>> Im using 0.8rc1 and submitting from tp-login.
>>>>
>>>> I am running with a DOEgrids cert in the OSG VO.
>>>>
>>>> I *think* the issue is that when a gt2 jobs on this vo runs with a login 
>>>> shell, it doesnt get java in its path.
>>>>
>>>> When I run /bin/sh *without* the "-l" option, under globus, I do get a 
>>>> java in my path.
>>>>
>>>> Allan: what VO did you run on when you got a sucsessful gt2:gt2:pbs 
>>>> coaster run on teraport, after you fixed the walltime issue?
>>>>
>>>> It seems to me that this is a rough edge with coaster startup. Recall 
>>>> that I had a similar problem running on abe last year: I had to edit out 
>>>> the "-l" and create a custom .profile to get coasters to work.
>>>>
>>>> It would be great if we can iron this out in 0.8 or soon after. I'm 
>>>> willing to do some testing and enlist help from Allan and Zhengxiong for 
>>>> wider testing.
>>>>
>>>> Do we need special site attributes for specific sites to override 
>>>> default behaviors when they dont work?
>>>>
>>>>
>>>> My sites.xml is:
>>>>
>>>> <config>
>>>> <pool handle="teraport" >
>>>>    <profile namespace="globus" key="queue">fast</profile>
>>>>    <profile namespace="globus" key="maxwalltime">00:05:00</profile>
>>>>    <gridftp url="gsiftp://tp-grid1.ci.uchicago.edu" />
>>>>    <execution provider="coaster"
>>>>       url="tp-grid1.ci.uchicago.edu"
>>>>       jobmanager="gt2:gt2:pbs" />
>>>>    <workdirectory>/gpfs1/osg/data/oops/swiftwork</workdirectory>
>>>> </pool>
>>>> </config>
>>>>
>>>> I get this on stdout/err:
>>>>
>>>> ---------------------------------------------
>>>> Swift 0.8rc1 swift-r2448 cog-r2261
>>>>
>>>> RunID: 20090127-1305-hcxdpor3
>>>> Progress:
>>>> Progress:  Selecting site:2 Stage in:1 Initializing site shared directory:1
>>>> Progress:  Selecting site:2 Stage in:1 Submitting:1
>>>> Progress:  Selecting site:2 Submitting:1 Submitted:1
>>>> Failed to transfer wrapper log from oops5-20090127-1305-hcxdpor3/info/a 
>>>> on teraport
>>>> Execution failed:
>>>>          Exception in runoops:
>>>> Arguments: [input/fasta/T1af7.fasta, input/secseq/T1af7.secseq, 
>>>> input/native/T1af7.pdb, output/T1af7.1.pdt, output/T1af7.1.rmsd, 1, 
>>>> [TEMP UPDATE INTERVAL = 10, SMOOTH DEVIATION COEFFICIENT = 0.80001]]
>>>> Host: teraport
>>>> Directory: oops5-20090127-1305-hcxdpor3/jobs/a/runoops-asq0ir5j
>>>> stderr.txt:
>>>>
>>>> stdout.txt:
>>>>
>>>> ----
>>>>
>>>> Caused by:
>>>>          Could not submit job
>>>> Caused by:
>>>>          Could not start coaster service
>>>> Caused by:
>>>>          Task ended before registration was received.
>>>> STDOUT: which: no java in 
>>>> (/usr/kerberos/bin:/bin:/usr/bin:/usr/X11R6/bin:/usr/local/bin:/software/common/softenv-1.6.0-r1/bin:/home/osgvo/osg/bin/linux-rhel4-x86_64:/home/osgvo/osg/bin:/soft/xcat-1.2.0-r1/bin:/soft/xcat-1.2.0-r1/sbin:/soft/xcat-1.2.0-r1/x86_64/bin:/soft/xcat-1.2.0-r1/x86_64/sbin:/soft/xcat-1.2.0-r1/contrib/bin:/soft/xcat-1.2.0-r1/contrib/sbin:/soft/xcat-1.2.0-r1/contrib/x86_64/bin:/soft/xcat-1.2.0-r1/contrib/x86_64/sbin)
>>>> dirname: too few arguments
>>>> Try `dirname --help' for more information.
>>>> http://tp-login2.ci.uchicago.edu:50001: line 55: -Djava.home=/..: No 
>>>> such file or directory
>>>>
>>>> STDERR: null
>>>> Cleaning up...
>>>>   Done
>>>>
>>>> ------------------------------------
>>>>
>>>> Checking out the environment with this cert I see:
>>>>
>>>> tp$ globus-job-run tp-grid1.ci.uchicago.edu /bin/sh -l -c 'java -version'
>>>> /bin/sh: java: command not found
>>>>
>>>>
>>>> tp$ globus-job-run tp-grid1.ci.uchicago.edu /bin/sh -c 'java -version'
>>>> java version "1.5.0_14"
>>>> Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_14-b03)
>>>> Java HotSpot(TM) 64-Bit Server VM (build 1.5.0_14-b03, mixed mode)
>>>>
>>>>
>>>> tp$ globus-job-run tp-grid1.ci.uchicago.edu /bin/sh -l -c 'which java; 
>>>> echo JAVA_HOME IS: $JAVA_HOME;echo PATH IS: $PATH'
>>>> JAVA_HOME IS:
>>>> PATH IS: 
>>>> /usr/kerberos/bin:/bin:/usr/bin:/usr/X11R6/bin:/usr/local/bin:/software/common/softenv-1.6.0-r1/bin:/home/osgvo/osg/bin/linux-rhel4-x86_64:/home/osgvo/osg/bin:/soft/xcat-1.2.0-r1/bin:/soft/xcat-1.2.0-r1/sbin:/soft/xcat-1.2.0-r1/x86_64/bin:/soft/xcat-1.2.0-r1/x86_64/sbin:/soft/xcat-1.2.0-r1/contrib/bin:/soft/xcat-1.2.0-r1/contrib/sbin:/soft/xcat-1.2.0-r1/contrib/x86_64/bin:/soft/xcat-1.2.0-r1/contrib/x86_64/sbin
>>>> /usr/bin/which: no java in 
>>>> (/usr/kerberos/bin:/bin:/usr/bin:/usr/X11R6/bin:/usr/local/bin:/software/common/softenv-1.6.0-r1/bin:/home/osgvo/osg/bin/linux-rhel4-x86_64:/home/osgvo/osg/bin:/soft/xcat-1.2.0-r1/bin:/soft/xcat-1.2.0-r1/sbin:/soft/xcat-1.2.0-r1/x86_64/bin:/soft/xcat-1.2.0-r1/x86_64/sbin:/soft/xcat-1.2.0-r1/contrib/bin:/soft/xcat-1.2.0-r1/contrib/sbin:/soft/xcat-1.2.0-r1/contrib/x86_64/bin:/soft/xcat-1.2.0-r1/contrib/x86_64/sbin)
>>>> tp$
>>>>
>>>>
>>>> tp$ globus-job-run tp-grid1.ci.uchicago.edu /bin/sh -c 'which java; echo 
>>>> JAVA_HOME IS: $JAVA_HOME;echo PATH IS: $PATH'
>>>>
>>>> /opt/osg-ce-0.8.0-r1/jdk1.5/bin/java
>>>> JAVA_HOME IS:
>>>> PATH IS: 
>>>> /opt/osg-ce-0.8.0-r1/condor/sbin:/opt/osg-ce-0.8.0-r1/condor/bin:/opt/osg-ce-0.8.0-r1/apache/bin:/opt/osg-ce-0.8.0-r1/srm-v2-client/bin:/opt/osg-ce-0.8.0-r1/srm-v1-client/sbin:/opt/osg-ce-0.8.0-r1/srm-v1-client/bin:/opt/osg-ce-0.8.0-r1/wget/bin:/opt/osg-ce-0.8.0-r1/gums/scripts:/opt/osg-ce-0.8.0-r1/cert-scripts/bin:/opt/osg-ce-0.8.0-r1/glite/sbin:/opt/osg-ce-0.8.0-r1/glite/bin:/opt/osg-ce-0.8.0-r1/edg/sbin:/opt/osg-ce-0.8.0-r1/prima/bin:/opt/osg-ce-0.8.0-r1/mysql/bin:/opt/osg-ce-0.8.0-r1/logrotate/sbin:/opt/osg-ce-0.8.0-r1/ant/bin:/opt/osg-ce-0.8.0-r1/jdk1.5/bin:/opt/osg-ce-0.8.0-r1/gpt/sbin:/opt/osg-ce-0.8.0-r1/globus/bin:/opt/osg-ce-0.8.0-r1/globus/sbin:/software/linux-rhel4-x86_64/pacman-3.21-r1/bin:/opt/osg-ce-0.8.0-r1/vdt/sbin:/opt/osg-ce-0.8.0-r1/vdt/bin:/opt/osg-ce-0.8.0-r1/condor/sbin:/opt/osg-ce-0.8.0-r1/condor/bin:/opt/osg-ce-0.8.0-r1/apache/bin:/opt/osg-ce-0.8.0-r1/srm-v2-client/bin:/opt/osg-ce-0.8.0-r1/srm-v1-client/sbin:/opt/osg-ce-0.8.0-r1/srm-v1-client/bin

>  :/
>> o
>>>  pt
>>>> /osg-ce-0.8.0-r1/wget/bin:/opt/osg-ce-0.8.0-r1/gums/scripts:/opt/osg-ce-0.8.0-r1/cert-scripts/bin:/opt/osg-ce-0.8.0-r1/glite/sbin:/opt/osg-ce-0.8.0-r1/glite/bin:/opt/osg-ce-0.8.0-r1/edg/sbin:/opt/osg-ce-0.8.0-r1/prima/bin:/opt/osg-ce-0.8.0-r1/mysql/bin:/opt/osg-ce-0.8.0-r1/logrotate/sbin:/opt/osg-ce-0.8.0-r1/ant/bin:/opt/osg-ce-0.8.0-r1/jdk1.5/bin:/opt/osg-ce-0.8.0-r1/gpt/sbin:/software/linux-rhel4-x86_64/pacman-3.21-r1/bin:/opt/osg-ce-0.8.0-r1/vdt/sbin:/opt/osg-ce-0.8.0-r1/vdt/bin:/sbin:/usr/sbin:/bin:/usr/bin:/usr/X11R6/bin
>>>> tp$ globus-job-run tp-grid1.ci.uchicago.edu /bin/sh -c 'java 
>>>> -version'java version "1.5.0_14"
>>>> Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_14-b03)
>>>> Java HotSpot(TM) 64-Bit Server VM (build 1.5.0_14-b03, mixed mode)
>>>>
>>>>
>>>> - Mike
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On 1/24/09 5:03 PM, Allan Espinosa wrote:
>>>>> Hi,
>>>>>
>>>>> I am using swift0.8rc1.  the same also happens to v0.7
>>>>>
>>>>> I tried submitting a job from communicado to tp-grid1 (teraport) using
>>>>> coasters.  The swift runtime does not give any error but it does not
>>>>> finish as well. Looking through the files received by the teraport
>>>>> head node, i observed that swift keeps submitting gram jobs.  It looks
>>>>> like that the submitted pbs scripts kept finishing / failing.
>>>>>
>>>>> diging through ~/.globus/jobs/tp-grid1.uchicago.edu/*/scheduler* we
>>>>> see that maxwalltime become 101:00 from 00:10:00 (in sites.xml)
>>>>>
>>>>> /usr/bin/perl "/home/aespinosa/.globus/coasters/cscript63266.pl"
>>>>> "http://128.135.125.118:50001" "1728236079"
>>>>> #! /bin/sh
>>>>> # PBS batch job script built by Globus job manager
>>>>> #
>>>>> #PBS -S /bin/sh
>>>>> #PBS -m n
>>>>> #PBS -q fast
>>>>> #PBS -l walltime=101:00
>>>>> #PBS -o /dev/null
>>>>> #PBS -e /dev/null
>>>>> #PBS -l nodes=1
>>>>> HOME="/home/aespinosa";
>>>>> export HOME;
>>>>> OSG_DATA="/gpfs1/osg/data";
>>>>> ...
>>>>> ...
>>>>> counter=0
>>>>> exit_code=0
>>>>> while test $counter -lt 1; do
>>>>>     /bin/touch /home/aespinosa/.globus/job/tp-grid1.ci.uchicago.edu/7432.1232837576/exit.$counter;
>>>>>
>>>>>     read tmp_exit_code <
>>>>> /home/aespinosa/.globus/job/tp-grid1.ci.uchicago.edu/7432.1232837576/exit.$counter
>>>>>     if [ $exit_code = 0 -a $tmp_exit_code != 0 ]; then
>>>>>         exit_code=$tmp_exit_code
>>>>>     fi
>>>>>     counter=`expr $counter + 1`
>>>>> done
>>>>>
>>>>> exit $exit_code
>>>>> qsub: Job exceeds queue resource limits MSG=cannot satisfy queue max
>>>>> walltime requirement
>>>>>
>>>>>
>>>>>
>>>>> Below is my sites.xml:
>>>>>
>>>>> <config>
>>>>>
>>>>>   <pool handle="Teraport" sysinfo="INTEL32::LINUX">
>>>>>     <profile namespace="globus" key="queue">fast</profile>
>>>>>     <profile namespace="globus" key="maxwalltime">00:10:00</profile>
>>>>>     <gridftp  url="gsiftp://tp-grid1.ci.uchicago.edu/disks/tp-gpfs/scratch/aespinosa"
>>>>> storage="/opt/osg/data/aespinosa" major="2" minor="2" patch="4">
>>>>>     </gridftp>
>>>>>     <execution provider="coaster" url="tp-grid1.uchicago.edu"
>>>>> jobmanager="gt2:gt2:pbs" />
>>>>>     <filesystem provider="coaster" url="gt2://tp-grid1.uchicago.edu" />
>>>>>     <workdirectory >/disks/tp-gpfs/scratch/aespinosa</workdirectory>
>>>>>   </pool>
>>>>>
>>>>> </config>
>>>>>
>>>>> This does not happen if i use "local:pbs" as the jobmanager for the
>>>>> coaster and was successful in running jobs
>>>>> -Allan
>>>>>
>>>>>
>>>> _______________________________________________
>>>> Swift-devel mailing list
>>>> Swift-devel at ci.uchicago.edu
>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> 


From hategan at mcs.anl.gov  Tue Feb  3 09:16:59 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Tue, 03 Feb 2009 09:16:59 -0600
Subject: [Swift-devel] Coasters failing on Teraport - cant find Java?
In-Reply-To: <4988566E.2070700@mcs.anl.gov>
References: <50b07b4b0901241503r72f28b96rec19583bb8044ea1@mail.gmail.com>
	<497F637D.5080707@mcs.anl.gov> <1233113611.2159.25.camel@localhost>
	<497FE73F.7000307@mcs.anl.gov>
	<Pine.LNX.4.64.0902021737090.8995@dildano.hawaga.org.uk>
	<1233597491.22200.3.camel@localhost>  <4988566E.2070700@mcs.anl.gov>
Message-ID: <1233674219.24924.1.camel@localhost>

On Tue, 2009-02-03 at 08:36 -0600, Michael Wilde wrote:
> The approach I'm testing is this:
> 
> if user has a .coasterinit file
>    source it to put java in PATH
> else if java is in PATH
>    use it
> else source /etc/profile
> 
> (executed under a non-login shell, i.e never use /bin/sh -l)
> 
> Right now I have the above in a different order (.coasterinit last) and 
> it works on ranger, mercury and teraport.
> 
> .coasterinit is a more flexible alternative to a per-site option that 
> points to java. Im not sure which is better.

We already have a mechanism for specifying site properties (sites.xml).
I don't think we should invent a different one.


From benc at hawaga.org.uk  Tue Feb  3 11:00:38 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Tue, 3 Feb 2009 17:00:38 +0000 (GMT)
Subject: [Swift-devel] Coasters failing on Teraport - cant find Java?
In-Reply-To: <4988566E.2070700@mcs.anl.gov>
References: <50b07b4b0901241503r72f28b96rec19583bb8044ea1@mail.gmail.com> 
	<497F637D.5080707@mcs.anl.gov> <1233113611.2159.25.camel@localhost> 
	<497FE73F.7000307@mcs.anl.gov>
	<Pine.LNX.4.64.0902021737090.8995@dildano.hawaga.org.uk>
	<1233597491.22200.3.camel@localhost> <4988566E.2070700@mcs.anl.gov>
Message-ID: <Pine.LNX.4.64.0902031659220.14259@dildano.hawaga.org.uk>


On Tue, 3 Feb 2009, Michael Wilde wrote:

> .coasterinit is a more flexible alternative to a per-site option that points
> to java. Im not sure which is better.

Flexible in that you can run arbitrary commands; however, less flexible in 
that it is per-remote-uid, not per-run. per-site options are implicitly 
also settable per-run (and hence per-submit-side-user, 
per-installed-swift-version, and the like)

-- 


From benc at hawaga.org.uk  Tue Feb  3 16:26:38 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Tue, 3 Feb 2009 22:26:38 +0000 (GMT)
Subject: [Swift-devel] throttle flow diagram
Message-ID: <Pine.LNX.4.64.0902032225170.14259@dildano.hawaga.org.uk>


I just drew this attempt at showing where the various throttles in Swift 
are and which parameters control them.

http://www.ci.uchicago.edu/~benc/tmp/throttle-flow.jpeg

Comments both on the technical content (which throttles are where) and on 
the best layout to draw this diagram are welcome.

Eventually I'll draw it using some kind of computer program and put it in 
the user guide.

-- 


From aespinosa at cs.uchicago.edu  Tue Feb  3 17:02:22 2009
From: aespinosa at cs.uchicago.edu (Allan Espinosa)
Date: Tue, 3 Feb 2009 17:02:22 -0600
Subject: [Swift-devel] coasters paths (was Re: Coasters failing on Teraport
	- cant find Java?)
Message-ID: <50b07b4b0902031502v5074b655wc20c1b15dfa1daaa@mail.gmail.com>

I also am having path problems in running coasters remotely.  This
time its looking for "curl" and "wget" (all are in /usr/bin)
logfile snippet:
Caused by: org.globus.cog.abstraction.impl.common.task.TaskSubmissionException:
Could not submit job
Caused by: org.globus.cog.abstraction.impl.common.task.TaskSubmissionException:
Could not start coaster service
Caused by: org.globus.cog.abstraction.impl.common.task.TaskSubmissionException:
Task ended before registration was received.
STDOUT: No wget or curl available

using globus-job-run, it also does not find these binaries by default.
 using an "sh -l" gives me tty permission errors in ranger
[aespinosa at communicado ~]$ globus-job-run
gatekeeper.ranger.tacc.teragrid.org /usr/bin/which wget
/usr/bin/which: no wget in ((null))
[aespinosa at communicado ~]$ globus-job-run
gatekeeper.ranger.tacc.teragrid.org /bin/sh -l -c "which curl"
stty: standard input: Inappropriate ioctl for device
stty: standard input: Inappropriate ioctl for device


On Tue, Jan 27, 2009 at 1:41 PM, Michael Wilde <wilde at mcs.anl.gov> wrote:
> Related to: Re: [Swift-devel] swift changing walltime of prews-gram jobs
>
> I can't get a Swift script to run on coasters on TeraPort in gt2:gt2:pbs
> mode.
>
> Im using 0.8rc1 and submitting from tp-login.
>
> I am running with a DOEgrids cert in the OSG VO.
>
> I *think* the issue is that when a gt2 jobs on this vo runs with a login
> shell, it doesnt get java in its path.
>
> When I run /bin/sh *without* the "-l" option, under globus, I do get a java
> in my path.
>


From benc at hawaga.org.uk  Tue Feb  3 17:37:21 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Tue, 3 Feb 2009 23:37:21 +0000 (GMT)
Subject: [Swift-devel] Re: [Swift-user] gsiftp filesystem mapper removes
	leading slash
In-Reply-To: <50b07b4b0902031412t5f05bda2t4ff88326df678fb7@mail.gmail.com>
References: <50b07b4b0902031343g1c800451j78b0b914c944a416@mail.gmail.com> 
	<Pine.LNX.4.64.0902032147310.8995@dildano.hawaga.org.uk>
	<50b07b4b0902031412t5f05bda2t4ff88326df678fb7@mail.gmail.com>
Message-ID: <Pine.LNX.4.64.0902032334030.14259@dildano.hawaga.org.uk>


Moved from swift-user

On Tue, 3 Feb 2009, Allan Espinosa wrote:

> Oh i see.  now I'm getting NullPointExceptions:
> database pir[] <filesys_mapper;location="gsiftp://gridftp.ranger.tacc.teragrid.org//work/01035/tg802895/pir",
> pattern="UNIPROT_for_blast_14.0.seq*">;

I can recreate the same stacktrace you see, against my directory on 
teraport. The below change makes it go away for me.

Get a clean fresh source tree, then:

 $ cd cog
 $ wget http://www.ci.uchicago.edu/~benc/tmp/ftpbug-1.patch
 $ patch -p1 < ftpbug-1.patch
 $ ant redist

And try that.

Probably you should keep a copy of your source tree before applying the 
patch so that you can easily get rid of it.

--


From hategan at mcs.anl.gov  Tue Feb  3 18:37:16 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Tue, 03 Feb 2009 18:37:16 -0600
Subject: [Swift-devel] coasters paths (was Re: Coasters failing on
	Teraport - cant find Java?)
In-Reply-To: <50b07b4b0902031502v5074b655wc20c1b15dfa1daaa@mail.gmail.com>
References: <50b07b4b0902031502v5074b655wc20c1b15dfa1daaa@mail.gmail.com>
Message-ID: <1233707836.12879.3.camel@localhost>

On Tue, 2009-02-03 at 17:02 -0600, Allan Espinosa wrote:
> I also am having path problems in running coasters remotely.  This
> time its looking for "curl" and "wget" (all are in /usr/bin)
> logfile snippet:
> Caused by: org.globus.cog.abstraction.impl.common.task.TaskSubmissionException:
> Could not submit job
> Caused by: org.globus.cog.abstraction.impl.common.task.TaskSubmissionException:
> Could not start coaster service
> Caused by: org.globus.cog.abstraction.impl.common.task.TaskSubmissionException:
> Task ended before registration was received.
> STDOUT: No wget or curl available

What version of cog is this?

It occurred to me that the change I made a few days ago might solve the
java problem on some sites, but also mess up wget/curl lookup.


From aespinosa at cs.uchicago.edu  Tue Feb  3 18:39:43 2009
From: aespinosa at cs.uchicago.edu (Allan Espinosa)
Date: Tue, 3 Feb 2009 18:39:43 -0600
Subject: [Swift-devel] coasters paths (was Re: Coasters failing on
	Teraport - cant find Java?)
In-Reply-To: <1233707836.12879.3.camel@localhost>
References: <50b07b4b0902031502v5074b655wc20c1b15dfa1daaa@mail.gmail.com>
	<1233707836.12879.3.camel@localhost>
Message-ID: <20090204003943.GA5628@quadrant>

I think i am using the latest revision (2271) for cog.
for swift my build is using 2490

On Tue, Feb 03, 2009 at 06:37:16PM -0600, Mihael Hategan wrote:
> 
> What version of cog is this?
> 
> It occurred to me that the change I made a few days ago might solve the
> java problem on some sites, but also mess up wget/curl lookup.
> 


From hategan at mcs.anl.gov  Tue Feb  3 19:59:03 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Tue, 03 Feb 2009 19:59:03 -0600
Subject: [Swift-devel] coasters paths (was Re: Coasters failing on
	Teraport - cant find Java?)
In-Reply-To: <20090204003943.GA5628@quadrant>
References: <50b07b4b0902031502v5074b655wc20c1b15dfa1daaa@mail.gmail.com>
	<1233707836.12879.3.camel@localhost> <20090204003943.GA5628@quadrant>
Message-ID: <1233712743.20123.0.camel@localhost>

cog r2272 contains a tentative fix for the issue. I tested it locally so
far.

On Tue, 2009-02-03 at 18:39 -0600, Allan Espinosa wrote:
> I think i am using the latest revision (2271) for cog.
> for swift my build is using 2490
> 
> On Tue, Feb 03, 2009 at 06:37:16PM -0600, Mihael Hategan wrote:
> > 
> > What version of cog is this?
> > 
> > It occurred to me that the change I made a few days ago might solve the
> > java problem on some sites, but also mess up wget/curl lookup.
> > 


From wilde at mcs.anl.gov  Wed Feb  4 07:49:12 2009
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Wed, 04 Feb 2009 07:49:12 -0600
Subject: [Swift-devel] "type file;" by default
In-Reply-To: <Pine.LNX.4.64.0902031129570.14259@dildano.hawaga.org.uk>
References: <Pine.LNX.4.64.0902031129570.14259@dildano.hawaga.org.uk>
Message-ID: <49899CD8.4090004@mcs.anl.gov>

I'm in favor of making "file" a built-in type.

I further propose we then use the term "file type" instead of "marker" 
or "mapped" type.

It seems there are at least 2 ways to do so, subtly different:

(a) as if the statement "type file;" has been implicitly executed, or
(b) as if there is a new simple, built-in, mapped type called "file".

(b) seems a bit better, because then the currently unnamed, built-in 
"marker" type gets a name.

If we use alternative (a) above, you would still say:

type Image;
file f;
Image i;

Here, nothing changes except for the built-in definition for "file" - 
not how you *use* the word "file".

With alternative (b), though, you would say:

type Image file;
file f;
Image i;

We could still allow the old style declarations to remain valid (but 
deprecated) for now (or forever), to avoid impact to existing code.

I favor approach (b), because it gives the un-named "marker" type, and 
hence all primitive built-in types, an explicit name. And it looks less 
quirky. But its a minor point.

- Mike


On 2/3/09 5:45 AM, Ben Clifford wrote:
> Pretty much every simple SwiftScript program that I write, I find myself 
> putting in "type file;" at the start, and avoiding "marker types" of the 
> form:
> 
>    type picturefile;
> 
> and thus ignoring application-level type checking (checking that I'm not 
> feeding a picture into a text processing app, and the like) whilst still 
> taking advantage of other swift type checking.
> 
> To simplify low-end uses of the language, it might be useful to have the 
> above "type file;" defined as a built-in type.
> 
> This has been discussed before, but I'd like to know what peoples opinions 
> are.
> 


From wilde at mcs.anl.gov  Wed Feb  4 07:55:03 2009
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Wed, 04 Feb 2009 07:55:03 -0600
Subject: [Swift-devel] Making "type" more uniform
Message-ID: <49899E37.8070501@mcs.anl.gov>

In thinking through the question on "type file", I found that the 
current "type" statement has several irregularities.

These are not of major consequence, but I want to verify my 
understanding and propose that we document this in the user guide in the 
short term, and in the longer term, consider changes to make "type" more 
regular and useful.

Currently (as far as I can tel), the "type" statement can be used to 
declare new data types of exactly 2 kinds:
  1) named, mapped types representing a single file, and
  2) named struct types.

You can not declare a new type that represents a (usable) simple type, 
and you can not (as far as I can tell) declare a new type that 
represents an array. While a few examples of declaring new simple scalar 
types are accepted, they are of no practical use - you can not assign 
values to such variables.

While you can declare a type whose representation is an int:

type Flag int;
Flag f;

You can not then say:

f = 99;

because f is a Flag and 99 is an int.

This is because of the following, as far as I can tell:

Creating new types whose representation is a "simple type" (int, string, 
boolean, or float), while potentially useful, does not work, because you 
can not create or assign any values of such types: you can not "cast" 
constants or variables of the simple types to any named type declared 
with the same representation, and there is no way to return such values 
from any function, atomic or compound.

In contrast, creating multiple declared types whose representation is a 
"marker" is useful, and works, because app() functions essentially 
"cast" physical files as return values of any marker type. Thus, a 
variable of type "image" can be assigned a value because, in the 
userguide example, the convert command called by the rotate() 
app-function "casts" its returned marker file type as an "image".

To enable the use of types that are represented as simple type values, 
we would need to add a cast expression to the language.

Further, types are the only way to declare a struct: you can not declare 
a struct in a variable declaration; you need a type to define the 
struct. While you can only declare an array in a variable declaration, 
and you can not, it seems, declare one as a type. (Although you can 
declare array variables within struct types).

These are all restrictions which, while seemingly irregular, are not in 
practice very limiting.

Its not clear how important this is, and I'm not proposing it at the 
moment, but rather suggest we clarify some of these corners of the 
language in the user guide, so users dont get tripped up trying to do 
things that seem natural in, e.g., C.


From wilde at mcs.anl.gov  Wed Feb  4 12:41:56 2009
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Wed, 04 Feb 2009 12:41:56 -0600
Subject: [Swift-devel] strange behavior evaluating function call as trace arg
Message-ID: <4989E174.8060500@mcs.anl.gov>

In this program:

--
trace(@toint("123"));

(int k) add (int i, int j)
{
   k = i + j;
}

int m = add(123,456);

trace(m);

trace(add(123,456));
--

...the first and second trace() calls work OK.
When I add the third, I get:

Could not compile SwiftScript source: line 13:1: unexpected token: trace

It seems as if trace is picking up the @toint function call OK, but not 
the call to "add".

(This case is condensed is from a more complex program where I first saw 
this)

Is this my error, or swifts?


From hategan at mcs.anl.gov  Wed Feb  4 14:05:37 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Wed, 4 Feb 2009 14:05:37 -0600 (CST)
Subject: [Swift-devel] strange behavior evaluating function call as
	trace arg
In-Reply-To: <3374084.1784281233777774647.JavaMail.root@zimbra>
Message-ID: <19137214.1784411233777937698.JavaMail.root@zimbra>

Looks like most nested invocations are broken, not specifically trace.

The following fails, too:

(int r) f(int i) {
        r = i;
}
int x;
x = f(f(2));


----- "Michael Wilde" <wilde at mcs.anl.gov> wrote:

> In this program:
> 
> --
> trace(@toint("123"));
> 
> (int k) add (int i, int j)
> {
>    k = i + j;
> }
> 
> int m = add(123,456);
> 
> trace(m);
> 
> trace(add(123,456));
> --
> 
> ...the first and second trace() calls work OK.
> When I add the third, I get:
> 
> Could not compile SwiftScript source: line 13:1: unexpected token:
> trace
> 
> It seems as if trace is picking up the @toint function call OK, but
> not 
> the call to "add".
> 
> (This case is condensed is from a more complex program where I first
> saw 
> this)
> 
> Is this my error, or swifts?
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel


From wilde at mcs.anl.gov  Wed Feb  4 16:27:56 2009
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Wed, 04 Feb 2009 16:27:56 -0600
Subject: [Swift-devel] [VOTE] Expanding arrays in app function command
	lines
In-Reply-To: <Pine.LNX.4.64.0902011729410.14259@dildano.hawaga.org.uk>
References: <Pine.LNX.4.64.0902011729410.14259@dildano.hawaga.org.uk>
Message-ID: <498A166C.5090706@mcs.anl.gov>

+1

On 2/1/09 11:36 AM, Ben Clifford wrote:
> it being slightly unclear in my mind whether the below discussed change 
> was generally approved of, here is a more formal request for 
> clarification.
> 
> the change that we talked about is in this thread:
> 
>   Subject: Re: [Swift-user] Expanding arrays in app function command lines
> 
> the proposal (which I sent a patch for) is to change the handling of app 
> paramters to expand string arrays into multiple command line arguments.
> 
> Vote as in http://dev.globus.org/wiki/Guidelines#Action_Item_Votes
> 


From benc at hawaga.org.uk  Wed Feb  4 19:29:06 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Thu, 5 Feb 2009 01:29:06 +0000 (GMT)
Subject: [Swift-devel] strange behavior evaluating function call as trace
	arg
In-Reply-To: <19137214.1784411233777937698.JavaMail.root@zimbra>
References: <19137214.1784411233777937698.JavaMail.root@zimbra>
Message-ID: <Pine.LNX.4.64.0902050125350.8995@dildano.hawaga.org.uk>


> Looks like most nested invocations are broken, not specifically trace.

They're not 'broken'. If you think they should work, you're thinking far 
too much along the lines of procedure calls evaluating to a value like 
some kind of referentially transparent thing. Procedure calls *must* have 
an lvalue to their l to give them somewhere to put their l. That's been 
the case forever in Swift and in VDL2. Making that not happen is a long 
term goal of mine, but its not in the language now.

-- 


From hategan at mcs.anl.gov  Wed Feb  4 19:43:17 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Wed, 4 Feb 2009 19:43:17 -0600 (CST)
Subject: [Swift-devel] strange behavior evaluating function call as
	trace arg
In-Reply-To: <21004599.1808421233797708367.JavaMail.root@zimbra>
Message-ID: <2625227.1808521233798197984.JavaMail.root@zimbra>

----- "Ben Clifford" <benc at hawaga.org.uk> wrote:

> > Looks like most nested invocations are broken, not specifically
> trace.
> 
> They're not 'broken'. If you think they should work, you're thinking
> far 
> too much along the lines of procedure calls evaluating to a value like
> 
> some kind of referentially transparent thing.

Yes, I am. I think it's reasonable to be able to use procedures with a return arity of one like that.

> Procedure calls *must*
> have 
> an lvalue to their l to give them somewhere to put their l. That's
> been 
> the case forever in Swift and in VDL2.

Right. I'm aware of the nasty issue with this, but I think it's doable.

> Making that not happen is a
> long 
> term goal of mine, but its not in the language now.
> 

I've started looking into it. If I don't get something in the next 8 hours of Swift work, I'll drop it. I want it there because not having it is a bit unintuitive.


From wilde at mcs.anl.gov  Wed Feb  4 22:39:17 2009
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Wed, 04 Feb 2009 22:39:17 -0600
Subject: [Swift-devel] Can't initialize and map in same var declaration?
Message-ID: <498A6D75.7060403@mcs.anl.gov>

I'd like to say:

   file f <"t.out"> = t(a);

but instead need to say:

   file f <"t.out">; f = t(a);

Should the first form work, or should we document that its not permitted?

A very minor issue.


From benc at hawaga.org.uk  Thu Feb  5 02:36:00 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Thu, 5 Feb 2009 08:36:00 +0000 (GMT)
Subject: [Swift-devel] Can't initialize and map in same var declaration?
In-Reply-To: <498A6D75.7060403@mcs.anl.gov>
References: <498A6D75.7060403@mcs.anl.gov>
Message-ID: <Pine.LNX.4.64.0902050835180.8995@dildano.hawaga.org.uk>


On Wed, 4 Feb 2009, Michael Wilde wrote:

> I'd like to say:
> 
>   file f <"t.out"> = t(a);
> 
> but instead need to say:
> 
>   file f <"t.out">; f = t(a);
> 
> Should the first form work, or should we document that its not permitted?

It doesn't work; but the ways things are, its a pretty minor change to 
make it work, I think.

-- 


From benc at hawaga.org.uk  Thu Feb  5 02:40:28 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Thu, 5 Feb 2009 08:40:28 +0000 (GMT)
Subject: [Swift-devel] strange behavior evaluating function call as trace
	arg
In-Reply-To: <2625227.1808521233798197984.JavaMail.root@zimbra>
References: <2625227.1808521233798197984.JavaMail.root@zimbra>
Message-ID: <Pine.LNX.4.64.0902050838160.14259@dildano.hawaga.org.uk>

On Wed, 4 Feb 2009, Mihael Hategan wrote:

> Yes, I am. I think it's reasonable to be able to use procedures with a 
> return arity of one like that.

yes. I think beyond that it would be nice to lose the procedure/@function 
distinction entirely. They used to be very different but as time passes 
they get closer and closer to the same thing, so that pretty much the only 
distinction at the moment is how they return their return value(s) and 
that @functions can only return one value.

> Right. I'm aware of the nasty issue with this, but I think it's doable.

yes.

> I've started looking into it. If I don't get something in the next 8 
> hours of Swift work, I'll drop it. I want it there because not having it 
> is a bit unintuitive.

ok. let me know if you abandon it and I'll put it on my todo.

-- 


From wilde at mcs.anl.gov  Thu Feb  5 08:05:21 2009
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Thu, 05 Feb 2009 08:05:21 -0600
Subject: [Swift-devel] Extending the set of builtin functions with external
	scripts?
In-Reply-To: <Pine.LNX.4.64.0902050838160.14259@dildano.hawaga.org.uk>
References: <2625227.1808521233798197984.JavaMail.root@zimbra>
	<Pine.LNX.4.64.0902050838160.14259@dildano.hawaga.org.uk>
Message-ID: <498AF221.4050802@mcs.anl.gov>

was: Re: [Swift-devel] strange behavior evaluating function call as 
trace arg

On 2/5/09 2:40 AM, Ben Clifford wrote:
> On Wed, 4 Feb 2009, Mihael Hategan wrote:
> 
>> Yes, I am. I think it's reasonable to be able to use procedures with a 
>> return arity of one like that.
> 
> yes. I think beyond that it would be nice to lose the procedure/@function 
> distinction entirely.

Yes, I agree.

On a related topic: can we make it easier for users to define @-like 
functions externally, as scripts, so that we can readily grow the 
function library, and experiment with such functions? (much like the ext 
mapper).

Users can currently define such functions in pairs: an app, searched for 
in PATH, which takes typically simple type data and returns its value(s) 
in a file, which is then read in a (compound) wrapper function that uses 
readData() or readdata2(). This works well, except for a few issues:
- not so easy to do varying number of args (must use arrays)
- cant do varying types of args (eg, hard to do a sprintf)
- need to put each one in tc.data
- it could be more elegant (eg, the code for each function today touches 
4 places: app, wrapper, tc, and the actual external code). We could make 
it 2: an extern() func, syntactically almost identical to app(), and the 
external code (ie, it bypasses tc.data)

This relates to the procedure/@function difference in the following way:

If we extend the language slightly with an "any" type declaration for 
parameter types that accepts any value, and a form of var-args, for 
example permitting @arc, @arglist[], and possibly @argtypes[], to be 
placed on the app() command line, then external functions could reflect 
as needed on their args and do pretty much anything that an 
internally-implemented function could do.  In this way the library could 
grow more readily, with simple perl, python, shell, awk, etc. scripts to 
implement them, searched for in a path like SWIFTLIB which would include 
SWIFT_HOME/swiftlib as well as user directories.

(By the way, I played with cpp as a way to #include swift library code. 
It worked well for a simple test; needs much more experimentation and 
testing. That or a similar approach looks promising).

With such extensions, we could use the app() declaration for such 
externs, and they would work exactly like any other app(), but serve the 
same purpose as built-in functions.

The builtins are faster (which is seldom needed), unthrottled (which is 
sometimes needed) and more robust (ie dont depend on external 
interpreters) which is handy for the core of the language, I guess.

Thoughts? Any interest in such a direction?

A related issue is how we want to control and shape the growth of such a 
library so that it gains the necessary power without getting unruly.

- Mike


> They used to be very different but as time passes 
> they get closer and closer to the same thing, so that pretty much the only 
> distinction at the moment is how they return their return value(s) and 
> that @functions can only return one value.
> 
>> Right. I'm aware of the nasty issue with this, but I think it's doable.
> 
> yes.
> 
>> I've started looking into it. If I don't get something in the next 8 
>> hours of Swift work, I'll drop it. I want it there because not having it 
>> is a bit unintuitive.
> 
> ok. let me know if you abandon it and I'll put it on my todo.
> 


From benc at hawaga.org.uk  Thu Feb  5 08:38:06 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Thu, 5 Feb 2009 14:38:06 +0000 (GMT)
Subject: [Swift-devel] cpp
In-Reply-To: <498AF221.4050802@mcs.anl.gov>
References: <2625227.1808521233798197984.JavaMail.root@zimbra>
	<Pine.LNX.4.64.0902050838160.14259@dildano.hawaga.org.uk>
	<498AF221.4050802@mcs.anl.gov>
Message-ID: <Pine.LNX.4.64.0902051432510.14259@dildano.hawaga.org.uk>


On Thu, 5 Feb 2009, Michael Wilde wrote:

> (By the way, I played with cpp as a way to #include swift library code. 
> It worked well for a simple test; needs much more experimentation and 
> testing. That or a similar approach looks promising).

GNU cpp is fairly explicit in its man page about not using it for 
non-C-like source files. That's fine for hacking round with, but so is 
/bin/cat.

Implementing some library system for Swift probably needs substantially 
more consideration - there are issues like modular compilation of code 
that are similar to other languages, as well as other more swift-specific 
issues (for example, should opting to use a library bring along relevant 
tc.data entries that are not usually defined?)

-- 


From wilde at mcs.anl.gov  Thu Feb  5 09:06:32 2009
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Thu, 05 Feb 2009 09:06:32 -0600
Subject: [Swift-devel] Re: cpp
In-Reply-To: <Pine.LNX.4.64.0902051432510.14259@dildano.hawaga.org.uk>
References: <2625227.1808521233798197984.JavaMail.root@zimbra>
	<Pine.LNX.4.64.0902050838160.14259@dildano.hawaga.org.uk>
	<498AF221.4050802@mcs.anl.gov>
	<Pine.LNX.4.64.0902051432510.14259@dildano.hawaga.org.uk>
Message-ID: <498B0078.3080606@mcs.anl.gov>


On 2/5/09 8:38 AM, Ben Clifford wrote:
> On Thu, 5 Feb 2009, Michael Wilde wrote:
> 
>> (By the way, I played with cpp as a way to #include swift library code. 
>> It worked well for a simple test; needs much more experimentation and 
>> testing. That or a similar approach looks promising).
> 
> GNU cpp is fairly explicit in its man page about not using it for 
> non-C-like source files. That's fine for hacking round with, but so is 
> /bin/cat.

agreed.

> Implementing some library system for Swift probably needs substantially 
> more consideration - there are issues like modular compilation of code 
> that are similar to other languages, as well as other more swift-specific 
> issues (for example, should opting to use a library bring along relevant 
> tc.data entries that are not usually defined?)

yes. Just want to start somewhere to get a feeling for what is useful.


From benc at hawaga.org.uk  Thu Feb  5 09:18:03 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Thu, 5 Feb 2009 15:18:03 +0000 (GMT)
Subject: [Swift-devel] Re: [VOTE] Expanding arrays in app function command
	lines 
In-Reply-To: <Pine.LNX.4.64.0902011729410.14259@dildano.hawaga.org.uk>
References: <Pine.LNX.4.64.0902011729410.14259@dildano.hawaga.org.uk>
Message-ID: <Pine.LNX.4.64.0902051511060.14259@dildano.hawaga.org.uk>


This change is committed in r2498.

-- 


From benc at hawaga.org.uk  Thu Feb  5 09:31:09 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Thu, 5 Feb 2009 15:31:09 +0000 (GMT)
Subject: [Swift-devel] strange behavior evaluating function call as trace
	arg
In-Reply-To: <4989E174.8060500@mcs.anl.gov>
References: <4989E174.8060500@mcs.anl.gov>
Message-ID: <Pine.LNX.4.64.0902051524100.8995@dildano.hawaga.org.uk>


On Wed, 4 Feb 2009, Michael Wilde wrote:

> trace(add(123,456));

> Could not compile SwiftScript source: line 13:1: unexpected token: trace

related to this, there's a bug open related to syntax error - bug 173:

http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=173

The syntax error should be reporting that add is not valid there, not that 
trace is an unexpected token.

This comes, I think, from deciding whether a statement is a procedure 
declaration or a procedure call by attempting to parse the entire first 
bit up to this:

  foo (syntactically valid argument declarations) {

or

  foo (syntactically valid argument expressions) ;

to distinguish between declarations or invocations.

In the case above, your statement matches neither of the above and so it 
tries to parse neither as a declaration or an invocation, giving an overly 
general error message saying, essentially, "i don't know what this whole 
statement is".

It may be possible to make better predictors or otherwise tighten up the 
location, but I had trouble last time. Having had a year to think about 
it, I may be able to come up with something better.

-- 


From wilde at mcs.anl.gov  Thu Feb  5 09:49:31 2009
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Thu, 05 Feb 2009 09:49:31 -0600
Subject: [Swift-devel] strange behavior evaluating function call as trace
	arg
In-Reply-To: <Pine.LNX.4.64.0902050125350.8995@dildano.hawaga.org.uk>
References: <19137214.1784411233777937698.JavaMail.root@zimbra>
	<Pine.LNX.4.64.0902050125350.8995@dildano.hawaga.org.uk>
Message-ID: <498B0A8B.7000601@mcs.anl.gov>

I dont understand this - can you clarify?

I understand an "lvalue" in swift to be one of:
- a var (eg var=value)
- an array element (eg a[i]=value)
- a struct element (eg s.a=value)

But swift procedures do indeed return a list of values, right?

Does the problem stem from the list-nature of the swift procedure 
return? (Ie, when a proc returns multiple values, it "needs" an set of 
lvalues on the lhs of an assignment statement to put them in, and this 
is currently enforced even in the case of a single value?  Wile an 
@function() returns a single value, and hence works?)

So below when you say "Procedure calls *must* have an lvalue to their l 
to give them somewhere to put their l" do you mean "Procedure calls can 
only be invoked form assignment statements, and *must* have the same 
number of lvalues on the lhs of the assignment to give them somewhere to 
put all of their return values" ???

On 2/4/09 7:29 PM, Ben Clifford wrote:
>> Looks like most nested invocations are broken, not specifically trace.
> 
> They're not 'broken'. If you think they should work, you're thinking far 
> too much along the lines of procedure calls evaluating to a value like 
> some kind of referentially transparent thing. Procedure calls *must* have 
> an lvalue to their l to give them somewhere to put their l. That's been 
> the case forever in Swift and in VDL2. Making that not happen is a long 
> term goal of mine, but its not in the language now.
> 


From benc at hawaga.org.uk  Thu Feb  5 10:06:39 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Thu, 5 Feb 2009 16:06:39 +0000 (GMT)
Subject: [Swift-devel] strange behavior evaluating function call as trace
	arg
In-Reply-To: <498B0A8B.7000601@mcs.anl.gov>
References: <19137214.1784411233777937698.JavaMail.root@zimbra>
	<Pine.LNX.4.64.0902050125350.8995@dildano.hawaga.org.uk>
	<498B0A8B.7000601@mcs.anl.gov>
Message-ID: <Pine.LNX.4.64.0902051600430.14259@dildano.hawaga.org.uk>


On Thu, 5 Feb 2009, Michael Wilde wrote:

> I dont understand this - can you clarify?
> 
> I understand an "lvalue" in swift to be one of:
> - a var (eg var=value)
> - an array element (eg a[i]=value)
> - a struct element (eg s.a=value)

yes.

> But swift procedures do indeed return a list of values, right?

no.

> So below when you say "Procedure calls *must* have an lvalue to their l to
> give them somewhere to put their l" do you mean "Procedure calls can only be
> invoked form assignment statements, and *must* have the same number of lvalues
> on the lhs of the assignment to give them somewhere to put all of their return
> values" ???

yes. 

Ignore the cardinality of return values - that's irrelevant to this.

When evaluating a @function call what comes out is a new DSHandle object 
that has been created by something inside that @function.

When evaluting a procedure call, the procedure expects to be given 
existing DSHandle object(s) to place its 'return values' into. Return 
values in the procedure case encompass files as well as primitive values. 

That mechanism is how, when you say:

(file f) p() {
  touch @f
}

file myfile <"foo">
myfile = p();

the procedure is able to figure out the filename "foo" to touch, even 
though its a return parameter. The mapping is attached to the DSHandle 
object, which comes into existence due to the 'file myfile' declaration.

-- 


From hategan at mcs.anl.gov  Thu Feb  5 10:22:04 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Thu, 05 Feb 2009 10:22:04 -0600
Subject: [Swift-devel] strange behavior evaluating function call as
	trace arg
In-Reply-To: <Pine.LNX.4.64.0902051600430.14259@dildano.hawaga.org.uk>
References: <19137214.1784411233777937698.JavaMail.root@zimbra>
	<Pine.LNX.4.64.0902050125350.8995@dildano.hawaga.org.uk>
	<498B0A8B.7000601@mcs.anl.gov>
	<Pine.LNX.4.64.0902051600430.14259@dildano.hawaga.org.uk>
Message-ID: <1233850924.18738.15.camel@localhost>

On Thu, 2009-02-05 at 16:06 +0000, Ben Clifford wrote:
> On Thu, 5 Feb 2009, Michael Wilde wrote:
> 
> > I dont understand this - can you clarify?
> > 
> > I understand an "lvalue" in swift to be one of:
> > - a var (eg var=value)
> > - an array element (eg a[i]=value)
> > - a struct element (eg s.a=value)
> 
> yes.
> 
> > But swift procedures do indeed return a list of values, right?
> 
> no.

Another way of viewing this would be the following:

Returns from swift procedures are not actually returns but arguments
passed by reference. This is there in order to support the automatic
parallelization scheme.

So assuming 

(int s) add(int x, int y) { s = x + y; }
int s;
s = add(1, 2);, this translates to (in C-like pointerish pseudocode):

add(int* s, int* x, int* y) {
  *s = *x + *y;
}
int *s = malloc(sizeof(int));
add(s, newInt(1), newInt(2));

where newInt(int) allocates an int pointer, puts some value into it and
returns the address. Pointers here are DSHandles (Swift's way of dealing
with data).


From benc at hawaga.org.uk  Thu Feb  5 10:59:21 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Thu, 5 Feb 2009 16:59:21 +0000 (GMT)
Subject: [Swift-devel] Re: [Swift-user] gsiftp filesystem mapper removes
	leading slash
In-Reply-To: <Pine.LNX.4.64.0902032334030.14259@dildano.hawaga.org.uk>
References: <50b07b4b0902031343g1c800451j78b0b914c944a416@mail.gmail.com> 
	<Pine.LNX.4.64.0902032147310.8995@dildano.hawaga.org.uk>
	<50b07b4b0902031412t5f05bda2t4ff88326df678fb7@mail.gmail.com>
	<Pine.LNX.4.64.0902032334030.14259@dildano.hawaga.org.uk>
Message-ID: <Pine.LNX.4.64.0902051658530.14259@dildano.hawaga.org.uk>


The fix in the below mentioned patch is in CoG r2273.

On Tue, 3 Feb 2009, Ben Clifford wrote:

> 
> Moved from swift-user
> 
> On Tue, 3 Feb 2009, Allan Espinosa wrote:
> 
> > Oh i see.  now I'm getting NullPointExceptions:
> > database pir[] <filesys_mapper;location="gsiftp://gridftp.ranger.tacc.teragrid.org//work/01035/tg802895/pir",
> > pattern="UNIPROT_for_blast_14.0.seq*">;
> 
> I can recreate the same stacktrace you see, against my directory on 
> teraport. The below change makes it go away for me.
> 
> Get a clean fresh source tree, then:
> 
>  $ cd cog
>  $ wget http://www.ci.uchicago.edu/~benc/tmp/ftpbug-1.patch
>  $ patch -p1 < ftpbug-1.patch
>  $ ant redist
> 
> And try that.
> 
> Probably you should keep a copy of your source tree before applying the 
> patch so that you can easily get rid of it.
> 
> --
> 
> 


From bugzilla-daemon at mcs.anl.gov  Thu Feb  5 11:05:29 2009
From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov)
Date: Thu,  5 Feb 2009 11:05:29 -0600 (CST)
Subject: [Swift-devel] [Bug 174] New: Type string is not defined
Message-ID: <bug-174-21@http.bugzilla.mcs.anl.gov/swift/>

http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=174

           Summary: Type string is not defined
           Product: Swift
           Version: unspecified
          Platform: PC
        OS/Version: Windows
            Status: NEW
          Severity: normal
          Priority: P2
         Component: SwiftScript language
        AssignedTo: benc at hawaga.org.uk
        ReportedBy: skenny at uchicago.edu


this is the error thrown when i have an incorrect reference to a member of a
string array:

string labels[]

referenced here as:

labels[0].label 

should've been:

labels[0]

swift produces an error saying that 'type string is not defined' which is not
the appropriate error


-- 
Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From bugzilla-daemon at mcs.anl.gov  Thu Feb  5 11:09:36 2009
From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov)
Date: Thu,  5 Feb 2009 11:09:36 -0600 (CST)
Subject: [Swift-devel] [Bug 175] New: ambiguous error when input file not
	found
Message-ID: <bug-175-21@http.bugzilla.mcs.anl.gov/swift/>

http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=175

           Summary: ambiguous error when input file not found
           Product: Swift
           Version: unspecified
          Platform: PC
        OS/Version: Windows
            Status: NEW
          Severity: normal
          Priority: P2
         Component: SwiftScript language
        AssignedTo: benc at hawaga.org.uk
        ReportedBy: skenny at uchicago.edu


if an input file is mapped that does not exist, swift throws the following
error:

 Swift svn swift-r2494 cog-r2271 

 RunID: 20090205-1030-r6q4lgm7 
 Progress: 
 Unexpected VDL2FutureException checking inputs 
 Execution failed: 
         java.lang.RuntimeException: Got a VDL2FutureException but all
parameters should have their values


-- 
Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From wilde at mcs.anl.gov  Thu Feb  5 11:34:48 2009
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Thu, 05 Feb 2009 11:34:48 -0600
Subject: [Swift-devel] strange behavior evaluating function call as	trace
	arg
In-Reply-To: <1233850924.18738.15.camel@localhost>
References: <19137214.1784411233777937698.JavaMail.root@zimbra>	
	<Pine.LNX.4.64.0902050125350.8995@dildano.hawaga.org.uk>	
	<498B0A8B.7000601@mcs.anl.gov>	
	<Pine.LNX.4.64.0902051600430.14259@dildano.hawaga.org.uk>
	<1233850924.18738.15.camel@localhost>
Message-ID: <498B2338.8000201@mcs.anl.gov>

OK, thanks, that helps me get closer, but Im not quite there yet.

The "automatic parallelization scheme" (i.e. the swift dataflow model) 
works by chaining together lvalues, correct? In other words, it is the 
setting of lvalues that enables a statement waiting on a variable to get 
a value to execute. Is that correct?

Conceptually, one could view the dataflow model as existing solely in 
the memory of the swift interpreter.  The creation of files is in some 
sense a side effect of executing app procedures, and the creation of 
files within an app() and the consequent execution of assignment 
operations then results in lvalues being set, which enables execution of 
any statement waiting on an lvalue to proceed. The assignment operations 
  can be explicit (lvalue=value) or implicit (procedure return).

To fully understand this, you need to go further into the details of 
mapping, scoping, procedure activation/completion, array closing, and 
single assignment.

So to take a few steps in that direction:

Question: are the following tuples the correct abstract representations 
of lvalues and dshandles?

lvalue: type, *handle, state (set/unset) - same as handle==null?
handle: type, value(if simple type), mapping

Question: how are arrays and structures represented in this model?

Getting this into writing in a concise and correct form would be useful 
for gaining a full understanding of Swift and also help in the language 
paper.

Is it reasonable to put this into latex form and jointly edit until 
we're satisfied with it?

If so, I will do that.

It could go into a version of the language paper that we post on the 
site, while we submit a condensed version to a conference.

- Mike


On 2/5/09 10:22 AM, Mihael Hategan wrote:
> On Thu, 2009-02-05 at 16:06 +0000, Ben Clifford wrote:
>> On Thu, 5 Feb 2009, Michael Wilde wrote:
>>
>>> I dont understand this - can you clarify?
>>>
>>> I understand an "lvalue" in swift to be one of:
>>> - a var (eg var=value)
>>> - an array element (eg a[i]=value)
>>> - a struct element (eg s.a=value)
>> yes.
>>
>>> But swift procedures do indeed return a list of values, right?
>> no.
> 
> Another way of viewing this would be the following:
> 
> Returns from swift procedures are not actually returns but arguments
> passed by reference. This is there in order to support the automatic
> parallelization scheme.
> 
> So assuming 
> 
> (int s) add(int x, int y) { s = x + y; }
> int s;
> s = add(1, 2);, this translates to (in C-like pointerish pseudocode):
> 
> add(int* s, int* x, int* y) {
>   *s = *x + *y;
> }
> int *s = malloc(sizeof(int));
> add(s, newInt(1), newInt(2));
> 
> where newInt(int) allocates an int pointer, puts some value into it and
> returns the address. Pointers here are DSHandles (Swift's way of dealing
> with data).
> 


From hategan at mcs.anl.gov  Thu Feb  5 11:46:34 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Thu, 05 Feb 2009 11:46:34 -0600
Subject: [Swift-devel] strange behavior evaluating function call
	as	trace arg
In-Reply-To: <498B2338.8000201@mcs.anl.gov>
References: <19137214.1784411233777937698.JavaMail.root@zimbra>
	<Pine.LNX.4.64.0902050125350.8995@dildano.hawaga.org.uk>
	<498B0A8B.7000601@mcs.anl.gov>
	<Pine.LNX.4.64.0902051600430.14259@dildano.hawaga.org.uk>
	<1233850924.18738.15.camel@localhost>  <498B2338.8000201@mcs.anl.gov>
Message-ID: <1233855994.21035.8.camel@localhost>

On Thu, 2009-02-05 at 11:34 -0600, Michael Wilde wrote:
> OK, thanks, that helps me get closer, but Im not quite there yet.
> 
> The "automatic parallelization scheme" (i.e. the swift dataflow model) 
> works by chaining together lvalues, correct? In other words, it is the 
> setting of lvalues that enables a statement waiting on a variable to get 
> a value to execute. Is that correct?

Somewhat. It's is the existence of the lvalue that allows a consumer and
a producer to synchronize on the same thing.

> 
> Conceptually, one could view the dataflow model as existing solely in 
> the memory of the swift interpreter.  The creation of files is in some 
> sense a side effect of executing app procedures, and the creation of 
> files within an app() and the consequent execution of assignment 
> operations then results in lvalues being set, which enables execution of 
> any statement waiting on an lvalue to proceed. The assignment operations 
>   can be explicit (lvalue=value) or implicit (procedure return).

Yes.

> 
> To fully understand this, you need to go further into the details of 
> mapping, scoping, procedure activation/completion, array closing, and 
> single assignment.
> 
> So to take a few steps in that direction:
> 
> Question: are the following tuples the correct abstract representations 
> of lvalues and dshandles?

There is no "lvalue" distinct from "handle". The term was used by Ben to
refer to what looks like lvalues in the Swift scripts.

> 
> lvalue: type, *handle, state (set/unset) - same as handle==null?
> handle: type, value(if simple type), mapping

In the light of my sayings above:

handle: type, state, who_is_waiting_on_this, value?, mapping?

> 
> Question: how are arrays and structures represented in this model?

Structs are handles with fields, which are also handles. Arrays are
structs with dynamic fields (i.e. you can add fields/elements at
run-time).

> 
> Getting this into writing in a concise and correct form would be useful 
> for gaining a full understanding of Swift and also help in the language 
> paper.

It's useful as far as there is usefulness in people besides us
understanding how Swift works in detail at the expense of our time spent
writing the document.

> 
> Is it reasonable to put this into latex form and jointly edit until 
> we're satisfied with it?
> 
> If so, I will do that.
> 
> It could go into a version of the language paper that we post on the 
> site, while we submit a condensed version to a conference.
> 


From wilde at mcs.anl.gov  Thu Feb  5 11:57:13 2009
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Thu, 05 Feb 2009 11:57:13 -0600
Subject: [Swift-devel] strange behavior evaluating function call	as	trace
	arg
In-Reply-To: <1233855994.21035.8.camel@localhost>
References: <19137214.1784411233777937698.JavaMail.root@zimbra>	
	<Pine.LNX.4.64.0902050125350.8995@dildano.hawaga.org.uk>	
	<498B0A8B.7000601@mcs.anl.gov>	
	<Pine.LNX.4.64.0902051600430.14259@dildano.hawaga.org.uk>	
	<1233850924.18738.15.camel@localhost>
	<498B2338.8000201@mcs.anl.gov> <1233855994.21035.8.camel@localhost>
Message-ID: <498B2879.9010403@mcs.anl.gov>

Thanks.

But regarding:

 >> Getting this into writing in a concise and correct form would be useful
 >> for gaining a full understanding of Swift and also help in the language
 >> paper.
 >
 > It's useful as far as there is usefulness in people besides us
 > understanding how Swift works in detail at the expense of our time spent
 > writing the document.

...my intent is not to probe the internals but rather to understand how 
to *use* the language. Its not how swift works, but how to work with 
swift. I use it quite a bit and yet I'm continually discovering new 
things about it, and finding that some of my understandings were incorrect.

Thus I find it increasingly important to pin down the language 
specification in a form that lets users understand it thoroughly. Its 
not that complex, yet it has many subtleties and surprises, due both to 
parallelism and to the handling of external data.

Such a spec is also helpful in discussing changes and enhancements.

I think that the user guide, or some doc hanging off of it, is the place 
to capture this.


On 2/5/09 11:46 AM, Mihael Hategan wrote:
> On Thu, 2009-02-05 at 11:34 -0600, Michael Wilde wrote:
>> OK, thanks, that helps me get closer, but Im not quite there yet.
>>
>> The "automatic parallelization scheme" (i.e. the swift dataflow model) 
>> works by chaining together lvalues, correct? In other words, it is the 
>> setting of lvalues that enables a statement waiting on a variable to get 
>> a value to execute. Is that correct?
> 
> Somewhat. It's is the existence of the lvalue that allows a consumer and
> a producer to synchronize on the same thing.
> 
>> Conceptually, one could view the dataflow model as existing solely in 
>> the memory of the swift interpreter.  The creation of files is in some 
>> sense a side effect of executing app procedures, and the creation of 
>> files within an app() and the consequent execution of assignment 
>> operations then results in lvalues being set, which enables execution of 
>> any statement waiting on an lvalue to proceed. The assignment operations 
>>   can be explicit (lvalue=value) or implicit (procedure return).
> 
> Yes.
> 
>> To fully understand this, you need to go further into the details of 
>> mapping, scoping, procedure activation/completion, array closing, and 
>> single assignment.
>>
>> So to take a few steps in that direction:
>>
>> Question: are the following tuples the correct abstract representations 
>> of lvalues and dshandles?
> 
> There is no "lvalue" distinct from "handle". The term was used by Ben to
> refer to what looks like lvalues in the Swift scripts.
> 
>> lvalue: type, *handle, state (set/unset) - same as handle==null?
>> handle: type, value(if simple type), mapping
> 
> In the light of my sayings above:
> 
> handle: type, state, who_is_waiting_on_this, value?, mapping?
> 
>> Question: how are arrays and structures represented in this model?
> 
> Structs are handles with fields, which are also handles. Arrays are
> structs with dynamic fields (i.e. you can add fields/elements at
> run-time).
> 
>> Getting this into writing in a concise and correct form would be useful 
>> for gaining a full understanding of Swift and also help in the language 
>> paper.
> 
> It's useful as far as there is usefulness in people besides us
> understanding how Swift works in detail at the expense of our time spent
> writing the document.
> 
>> Is it reasonable to put this into latex form and jointly edit until 
>> we're satisfied with it?
>>
>> If so, I will do that.
>>
>> It could go into a version of the language paper that we post on the 
>> site, while we submit a condensed version to a conference.
>>
> 
> 


From aespinosa at cs.uchicago.edu  Thu Feb  5 12:53:28 2009
From: aespinosa at cs.uchicago.edu (Allan Espinosa)
Date: Thu, 5 Feb 2009 12:53:28 -0600
Subject: [Swift-devel] Re: [Swift-user] gsiftp filesystem mapper removes
	leading slash
In-Reply-To: <Pine.LNX.4.64.0902051658530.14259@dildano.hawaga.org.uk>
References: <50b07b4b0902031343g1c800451j78b0b914c944a416@mail.gmail.com>
	<Pine.LNX.4.64.0902032147310.8995@dildano.hawaga.org.uk>
	<50b07b4b0902031412t5f05bda2t4ff88326df678fb7@mail.gmail.com>
	<Pine.LNX.4.64.0902032334030.14259@dildano.hawaga.org.uk>
	<Pine.LNX.4.64.0902051658530.14259@dildano.hawaga.org.uk>
Message-ID: <50b07b4b0902051053r67ef2070k996236fa00f510f@mail.gmail.com>

Ok will try it out.

Thanks Ben!
-Allan

On Thu, Feb 5, 2009 at 10:59 AM, Ben Clifford <benc at hawaga.org.uk> wrote:
>
> The fix in the below mentioned patch is in CoG r2273.
>
> On Tue, 3 Feb 2009, Ben Clifford wrote:
>
>>
>> Moved from swift-user
>>
>> On Tue, 3 Feb 2009, Allan Espinosa wrote:
>>
>> > Oh i see.  now I'm getting NullPointExceptions:
>> > database pir[] <filesys_mapper;location="gsiftp://gridftp.ranger.tacc.teragrid.org//work/01035/tg802895/pir",
>> > pattern="UNIPROT_for_blast_14.0.seq*">;
>>
>> I can recreate the same stacktrace you see, against my directory on
>> teraport. The below change makes it go away for me.
>>
>> Get a clean fresh source tree, then:
>>
>>  $ cd cog
>>  $ wget http://www.ci.uchicago.edu/~benc/tmp/ftpbug-1.patch
>>  $ patch -p1 < ftpbug-1.patch
>>  $ ant redist
>>
>> And try that.
>>
>> Probably you should keep a copy of your source tree before applying the
>> patch so that you can easily get rid of it.
>>
>> --
>>
>>
>
>


-- 
Allan M. Espinosa <http://allan.88-mph.net/blog>
PhD student, Computer Science
University of Chicago <http://people.cs.uchicago.edu/~aespinosa>


From hategan at mcs.anl.gov  Thu Feb  5 13:00:41 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Thu, 05 Feb 2009 13:00:41 -0600
Subject: [Swift-devel] strange behavior evaluating function
	call	as	trace arg
In-Reply-To: <498B2879.9010403@mcs.anl.gov>
References: <19137214.1784411233777937698.JavaMail.root@zimbra>
	<Pine.LNX.4.64.0902050125350.8995@dildano.hawaga.org.uk>
	<498B0A8B.7000601@mcs.anl.gov>
	<Pine.LNX.4.64.0902051600430.14259@dildano.hawaga.org.uk>
	<1233850924.18738.15.camel@localhost>  <498B2338.8000201@mcs.anl.gov>
	<1233855994.21035.8.camel@localhost>  <498B2879.9010403@mcs.anl.gov>
Message-ID: <1233860441.24135.10.camel@localhost>

On Thu, 2009-02-05 at 11:57 -0600, Michael Wilde wrote:
> Thanks.
> 
> But regarding:
> 
>  >> Getting this into writing in a concise and correct form would be useful
>  >> for gaining a full understanding of Swift and also help in the language
>  >> paper.
>  >
>  > It's useful as far as there is usefulness in people besides us
>  > understanding how Swift works in detail at the expense of our time spent
>  > writing the document.
> 
> ...my intent is not to probe the internals but rather to understand how 
> to *use* the language. Its not how swift works, but how to work with 
> swift. I use it quite a bit and yet I'm continually discovering new 
> things about it, and finding that some of my understandings were incorrect.

In my experience, in particular with Tibi's workflows, understanding how
Swift works leads in most cases to bad results.

It turns out that the best way to use swift is to express the problem
formally and use the publicly defined interface to implement it. It goes
against the C/imperative school of thought, but allowing an automated
system to optimize a problem requires specifying the problem, not "the
way one thinks it works under the hood".

> 
> Thus I find it increasingly important to pin down the language 
> specification in a form that lets users understand it thoroughly. Its 
> not that complex, yet it has many subtleties and surprises, due both to 
> parallelism and to the handling of external data.

Right. Parallelism is one of the issues that should be completely
ignored by a swift programmer. Writing swift code in such a way as to
achieve parallelization in a certain way is, as mentioned above, mostly
going to yield bad results.

This is mostly because we can only achieve proper parallelization if
only a certain level of abstraction is used. That level of abstraction
is the level at which a swift code writer should work at. Perhaps
documentation on that needs improvement, but not circumventing.

> 
> Such a spec is also helpful in discussing changes and enhancements.

That it is.


From aespinosa at cs.uchicago.edu  Thu Feb  5 14:18:39 2009
From: aespinosa at cs.uchicago.edu (Allan Espinosa)
Date: Thu, 5 Feb 2009 14:18:39 -0600
Subject: [Swift-devel] coasters paths (was Re: Coasters failing on 
	Teraport - cant find Java?)
In-Reply-To: <1233712743.20123.0.camel@localhost>
References: <50b07b4b0902031502v5074b655wc20c1b15dfa1daaa@mail.gmail.com>
	<1233707836.12879.3.camel@localhost> <20090204003943.GA5628@quadrant>
	<1233712743.20123.0.camel@localhost>
Message-ID: <50b07b4b0902051218l1d259177y42e57d1273aa2997@mail.gmail.com>

Hi Mihael,

Tried out cog r2273 against swift r2486

The GRAM environment in Ranger has some stty permission errors.  So
the coasters can't initialize the paths when it attempts to create a
"login" session

from the remote site:
/share/home/01035/tg802895/.globus/job/gatekeeper.ranger.tacc.teragrid.org
login4$ ls
login4$ ls
8851.1233864730
login4$ cd 8851.1233864730/
login4$ ls
stderr  stdout  x509_up
login4$ cat *
stty: standard input: Inappropriate ioctl for device
stty: standard input: Inappropriate ioctl for device
http://communicado.ci.uchicago.edu:50002: line 36: eval: --: invalid option
eval: usage: eval [arg ...]
Failed to download bootstrap jar from http://communicado.ci.uchicago.edu:50002
-----BEGIN CERTIFICATE-----


from submission site:
Progress:  Failed:1
Execution failed:
        Could not initialize shared directory on RANGER
Caused by:
        org.globus.cog.abstraction.impl.file.FileResourceException:
Failed to start coaster resource on
gatekeeper.ranger.tacc.teragrid.org
Caused by:
        Could not start coaster service
Caused by:
        Task ended before registration was received.
STDOUT: Failed to download bootstrap jar from
http://communicado.ci.uchicago.edu:50002
stty: standard input: Inappropriate ioctl for device
stty: standard input: Inappropriate ioctl for device
http://communicado.ci.uchicago.edu:50002: line 36: eval: --: invalid option
eval: usage: eval [arg ...]

STDERR: null

I guess the best approach is create an environment agnostic http get
request.  From the standard Java network packages perhaps?

-Allan

On Tue, Feb 3, 2009 at 7:59 PM, Mihael Hategan <hategan at mcs.anl.gov> wrote:
> cog r2272 contains a tentative fix for the issue. I tested it locally so
> far.
>
> On Tue, 2009-02-03 at 18:39 -0600, Allan Espinosa wrote:
>> I think i am using the latest revision (2271) for cog.
>> for swift my build is using 2490
>>
>> On Tue, Feb 03, 2009 at 06:37:16PM -0600, Mihael Hategan wrote:
>> >
>> > What version of cog is this?
>> >
>> > It occurred to me that the change I made a few days ago might solve the
>> > java problem on some sites, but also mess up wget/curl lookup.


From hategan at mcs.anl.gov  Thu Feb  5 15:52:23 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Thu, 5 Feb 2009 15:52:23 -0600 (CST)
Subject: [Swift-devel] coasters paths (was Re: Coasters failing on
	Teraport - cant find Java?)
In-Reply-To: <50b07b4b0902051218l1d259177y42e57d1273aa2997@mail.gmail.com>
Message-ID: <15055702.1859001233870743461.JavaMail.root@zimbra>


----- Allan Espinosa <aespinosa at cs.uchicago.edu> wrote:
> 
> I guess the best approach is create an environment agnostic http get
> request.  From the standard Java network packages perhaps?

Except you need a small script to download the jar file that
implements that, finds java, and starts the downloader. Which 
is pretty much what the existing script does.

So I think the solution is to fix the existing script.


From wilde at mcs.anl.gov  Thu Feb  5 16:26:52 2009
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Thu, 05 Feb 2009 16:26:52 -0600
Subject: [Swift-devel] coasters paths (was Re: Coasters failing on Teraport
	- cant find Java?)
In-Reply-To: <15055702.1859001233870743461.JavaMail.root@zimbra>
References: <15055702.1859001233870743461.JavaMail.root@zimbra>
Message-ID: <498B67AC.8060502@mcs.anl.gov>

Yeah, I agree. I just takes time and iterative testing.

I wonder if it would be useful to be able to run just bootstrap.sh in 
some kind of test mode (ie give it args that just verify that it can 
start a java service), and then run this from a script that does a 
globus-job-run of the script to a growing set of sites.  And add that to 
the build+test battery.

On 2/5/09 3:52 PM, Mihael Hategan wrote:
> ----- Allan Espinosa <aespinosa at cs.uchicago.edu> wrote:
>> I guess the best approach is create an environment agnostic http get
>> request.  From the standard Java network packages perhaps?
> 
> Except you need a small script to download the jar file that
> implements that, finds java, and starts the downloader. Which 
> is pretty much what the existing script does.
> 
> So I think the solution is to fix the existing script.


From hategan at mcs.anl.gov  Thu Feb  5 17:07:48 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Thu, 5 Feb 2009 17:07:48 -0600 (CST)
Subject: [Swift-devel] coasters paths (was Re: Coasters failing on
	Teraport - cant find Java?)
In-Reply-To: <498B67AC.8060502@mcs.anl.gov>
Message-ID: <20254178.1868091233875268293.JavaMail.root@zimbra>


----- Michael Wilde <wilde at mcs.anl.gov> wrote:
> Yeah, I agree. I just takes time and iterative testing.

Right.

> 
> I wonder if it would be useful to be able to run just bootstrap.sh in 
> some kind of test mode (ie give it args that just verify that it can 
> start a java service), and then run this from a script that does a 
> globus-job-run of the script to a growing set of sites.  And add that to 
> the build+test battery.

It can be done. But I think it's one of those things where if you^1 can't 
figure out how to do it^2, it's likely you won't contribute much to the 
effort of fixing it.

1. "You" as in X.
2. Find the script, figure out the parameters, fake the environment, and
globusrun it.


From hategan at mcs.anl.gov  Thu Feb  5 17:13:37 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Thu, 5 Feb 2009 17:13:37 -0600 (CST)
Subject: [Swift-devel] coasters paths (was Re: Coasters failing on
	Teraport - cant find Java?)
In-Reply-To: <20254178.1868091233875268293.JavaMail.root@zimbra>
Message-ID: <32262676.1868521233875617337.JavaMail.root@zimbra>


----- Mihael Hategan <hategan at mcs.anl.gov> wrote:
> 
> ----- Michael Wilde <wilde at mcs.anl.gov> wrote:
> > Yeah, I agree. I just takes time and iterative testing.
> 
> Right.
> 
> > 
> > I wonder if it would be useful to be able to run just bootstrap.sh in 
> > some kind of test mode (ie give it args that just verify that it can 
> > start a java service), and then run this from a script that does a 
> > globus-job-run of the script to a growing set of sites.  And add that to 
> > the build+test battery.
> 
> It can be done. But I think it's one of those things where if you^1 can't 
> figure out how to do it^2, it's likely you won't contribute much to the 
> effort of fixing it.
> 
> 1. "You" as in X.
> 2. Find the script, figure out the parameters, fake the environment, and
> globusrun it.

... and there already exist coaster tests, which would reveal issues with
bootstrap.sh (modulo actually having an extensive set of sites files), and
which I would personally be happy with for debugging.

Therefore I don't see a reason for doing it.


From wilde at mcs.anl.gov  Thu Feb  5 23:22:11 2009
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Thu, 05 Feb 2009 23:22:11 -0600
Subject: [Swift-devel] Problems with coasters and managedfork jobmanager
Message-ID: <498BC903.7010008@mcs.anl.gov>

I was able to run with coasters on teraport last week, using 
gt2:gt2:pbs, but not today.

I see the error "Failed to parse command file (line 21)" in my swift 
output and in the gram log (excerpt of the latter, below).

This line # was originally 17. I added some comment lines to 
bootstrap.sh to see if the line number would move, and indeed it did. So 
it suggests something in the jobmanager thats unable to handle the text 
of the bootstrap script embedded in its RSL. But I dont think the line 
in this error is the line in the bootstrap script.

Does anyone know how to find the script text that the jobmanager is 
complaining about?

As far as I can tell, something changed on teraport (or my config?) as 
my gram logs from last week indicate that the plain fork jobmanager was 
being used. (Ive got an email in to teraport support to probe this).

I see Mats's note in a prio mail about concern that the managed-fork 
mechanism may kill the coaster service, but no comments about script 
parsing errors.

I'll send more logs in this tomorrow if I havent found it yet.

Thanks,

Mike


Thu Feb  5 21:17:51 2009 JM_SCRIPT: Error file is not empty, and 
submission failed

Thu Feb  5 21:17:51 2009 JM_SCRIPT: Error text is
ERROR: Failed to parse command file (line 21).

Thu Feb  5 21:17:51 2009 JM_SCRIPT: Writing extended error information 
to stderr
2/5 21:17:51 JM: GT3 extended error message: 
GRAM_SCRIPT_GT3_FAILURE_MESSAGE: ERROR: Failed to parse command file 
(line 21).
2/5 21:17:51 JMI: while return_buf = GRAM_SCRIPT_GT3_FAILURE_MESSAGE = 
ERROR: Failed to parse command file (line 21).
2/5 21:17:51 JMI: while return_buf = GRAM_SCRIPT_ERROR = 17
2/5 21:17:51 Job Manager State Machine (entering): 
GLOBUS_GRAM_JOB_MANAGER_STATE_SUBMIT
2/5 21:17:51 JM: in globus_gram_job_manager_reporting_file_create()
2/5 21:17:51 JM: not reporting job information
2/5 21:17:51 JM: in globus_gram_job_manager_history_file_create()
2/5 21:17:51 Job Manager State Machine (entering): 
GLOBUS_GRAM_JOB_MANAGER_STATE_FAILED
2/5 21:17:51 closing destination 
https://128.135.125.17:50002/dev/stdout-urn:cog-1233890265644
2/5 21:17:51 JM: exiting 
globus_l_gram_job_manager_output_destination_close()
2/5 21:18:00 closing destination 
https://128.135.125.17:50002/dev/stderr-urn:cog-1233890265644
2/5 21:18:00 JM: exiting 
globus_l_gram_job_manager_output_destination_close()
2/5 21:18:00 Job Manager State Machine (entering): 
GLOBUS_GRAM_JOB_MANAGER_STATE_FAILED_CLOSE_OUTPUT
2/5 21:18:00 JM: NOT empty client callback list.
2/5 21:18:00 JM: sending callback of status 4 (failure code 155) to 
https://128.135.125.17:50003/1233890268457.
2/5 21:18:00 Job Manager State Machine (entering): 
GLOBUS_GRAM_JOB_MANAGER_STATE_FAILED_TWO_PHASE
2/5 21:18:00 Job Manager State Machine (entering): 
GLOBUS_GRAM_JOB_MANAGER_STATE_FAILED_TWO_PHASE_COMMITTED
2/5 21:18:00 Job Manager State Machine (entering): 
GLOBUS_GRAM_JOB_MANAGER_STATE_FAILED_FILE_CLEAN_UP
2/5 21:18:00 Job Manager State Machine (entering): 
GLOBUS_GRAM_JOB_MANAGER_STATE_FAILED_SCRATCH_CLEAN_UP
2/5 21:18:00 JMI: testing job manager scripts for type managedfork exist 
and permissions are ok.
2/5 21:18:00 JMI: completed script validation: job manager type is 
managedfork.


From hategan at mcs.anl.gov  Thu Feb  5 23:50:51 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Thu, 05 Feb 2009 23:50:51 -0600
Subject: [Swift-devel] Problems with coasters and managedfork jobmanager
In-Reply-To: <498BC903.7010008@mcs.anl.gov>
References: <498BC903.7010008@mcs.anl.gov>
Message-ID: <1233899451.2869.8.camel@localhost>

This particular line seems troubling to me:

?2/5 21:18:00 JMI: testing job manager scripts for type managedfork
exist and permissions are ok.

Does this mean that managed fork is now in use on TP? Is there any way
to still use plain fork?

On Thu, 2009-02-05 at 23:22 -0600, Michael Wilde wrote:
> I was able to run with coasters on teraport last week, using 
> gt2:gt2:pbs, but not today.
> 
> I see the error "Failed to parse command file (line 21)" in my swift 
> output and in the gram log (excerpt of the latter, below).
> 
> This line # was originally 17. I added some comment lines to 
> bootstrap.sh to see if the line number would move, and indeed it did. So 
> it suggests something in the jobmanager thats unable to handle the text 
> of the bootstrap script embedded in its RSL. But I dont think the line 
> in this error is the line in the bootstrap script.
> 
> Does anyone know how to find the script text that the jobmanager is 
> complaining about?
> 
> As far as I can tell, something changed on teraport (or my config?) as 
> my gram logs from last week indicate that the plain fork jobmanager was 
> being used. (Ive got an email in to teraport support to probe this).
> 
> I see Mats's note in a prio mail about concern that the managed-fork 
> mechanism may kill the coaster service, but no comments about script 
> parsing errors.
> 
> I'll send more logs in this tomorrow if I havent found it yet.
> 
> Thanks,
> 
> Mike
> 
> 
> 
> Thu Feb  5 21:17:51 2009 JM_SCRIPT: Error file is not empty, and 
> submission failed
> 
> Thu Feb  5 21:17:51 2009 JM_SCRIPT: Error text is
> ERROR: Failed to parse command file (line 21).
> 
> Thu Feb  5 21:17:51 2009 JM_SCRIPT: Writing extended error information 
> to stderr
> 2/5 21:17:51 JM: GT3 extended error message: 
> GRAM_SCRIPT_GT3_FAILURE_MESSAGE: ERROR: Failed to parse command file 
> (line 21).
> 2/5 21:17:51 JMI: while return_buf = GRAM_SCRIPT_GT3_FAILURE_MESSAGE = 
> ERROR: Failed to parse command file (line 21).
> 2/5 21:17:51 JMI: while return_buf = GRAM_SCRIPT_ERROR = 17
> 2/5 21:17:51 Job Manager State Machine (entering): 
> GLOBUS_GRAM_JOB_MANAGER_STATE_SUBMIT
> 2/5 21:17:51 JM: in globus_gram_job_manager_reporting_file_create()
> 2/5 21:17:51 JM: not reporting job information
> 2/5 21:17:51 JM: in globus_gram_job_manager_history_file_create()
> 2/5 21:17:51 Job Manager State Machine (entering): 
> GLOBUS_GRAM_JOB_MANAGER_STATE_FAILED
> 2/5 21:17:51 closing destination 
> https://128.135.125.17:50002/dev/stdout-urn:cog-1233890265644
> 2/5 21:17:51 JM: exiting 
> globus_l_gram_job_manager_output_destination_close()
> 2/5 21:18:00 closing destination 
> https://128.135.125.17:50002/dev/stderr-urn:cog-1233890265644
> 2/5 21:18:00 JM: exiting 
> globus_l_gram_job_manager_output_destination_close()
> 2/5 21:18:00 Job Manager State Machine (entering): 
> GLOBUS_GRAM_JOB_MANAGER_STATE_FAILED_CLOSE_OUTPUT
> 2/5 21:18:00 JM: NOT empty client callback list.
> 2/5 21:18:00 JM: sending callback of status 4 (failure code 155) to 
> https://128.135.125.17:50003/1233890268457.
> 2/5 21:18:00 Job Manager State Machine (entering): 
> GLOBUS_GRAM_JOB_MANAGER_STATE_FAILED_TWO_PHASE
> 2/5 21:18:00 Job Manager State Machine (entering): 
> GLOBUS_GRAM_JOB_MANAGER_STATE_FAILED_TWO_PHASE_COMMITTED
> 2/5 21:18:00 Job Manager State Machine (entering): 
> GLOBUS_GRAM_JOB_MANAGER_STATE_FAILED_FILE_CLEAN_UP
> 2/5 21:18:00 Job Manager State Machine (entering): 
> GLOBUS_GRAM_JOB_MANAGER_STATE_FAILED_SCRATCH_CLEAN_UP
> 2/5 21:18:00 JMI: testing job manager scripts for type managedfork exist 
> and permissions are ok.
> 2/5 21:18:00 JMI: completed script validation: job manager type is 
> managedfork.
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel


From hategan at mcs.anl.gov  Thu Feb  5 23:52:19 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Thu, 05 Feb 2009 23:52:19 -0600
Subject: [Swift-devel] coasters paths (was Re: Coasters failing on
	Teraport - cant find Java?)
In-Reply-To: <32262676.1868521233875617337.JavaMail.root@zimbra>
References: <32262676.1868521233875617337.JavaMail.root@zimbra>
Message-ID: <1233899539.2869.11.camel@localhost>

On Thu, 2009-02-05 at 17:13 -0600, Mihael Hategan wrote:
> ----- Mihael Hategan <hategan at mcs.anl.gov> wrote:
> > 
> > ----- Michael Wilde <wilde at mcs.anl.gov> wrote:
> > > Yeah, I agree. I just takes time and iterative testing.
> > 
> > Right.
> > 
> > > 
> > > I wonder if it would be useful to be able to run just bootstrap.sh in 
> > > some kind of test mode (ie give it args that just verify that it can 
> > > start a java service), and then run this from a script that does a 
> > > globus-job-run of the script to a growing set of sites.  And add that to 
> > > the build+test battery.
> > 
> > It can be done. But I think it's one of those things where if you^1 can't 
> > figure out how to do it^2, it's likely you won't contribute much to the 
> > effort of fixing it.
> > 
> > 1. "You" as in X.
> > 2. Find the script, figure out the parameters, fake the environment, and
> > globusrun it.
> 
> ... and there already exist coaster tests, which would reveal issues with
> bootstrap.sh (modulo actually having an extensive set of sites files), and
> which I would personally be happy with for debugging.
> 
> Therefore I don't see a reason for doing it.

Ok, this whole thing makes no sense to me. Please forget the things I
said above.


From aespinosa at cs.uchicago.edu  Fri Feb  6 02:57:24 2009
From: aespinosa at cs.uchicago.edu (Allan Espinosa)
Date: Fri, 6 Feb 2009 02:57:24 -0600
Subject: [Swift-devel] Problems with coasters and managedfork jobmanager
In-Reply-To: <498BC903.7010008@mcs.anl.gov>
References: <498BC903.7010008@mcs.anl.gov>
Message-ID: <50b07b4b0902060057n59d56823s8b6aaf9f80b68357@mail.gmail.com>

Hi Mike,

I think Greg posted about an OSG stack upgrade this week so gram won't
be available. That's why i just used local:pbs for my runs today.

-Allan

On Thu, Feb 5, 2009 at 11:22 PM, Michael Wilde <wilde at mcs.anl.gov> wrote:
> I was able to run with coasters on teraport last week, using gt2:gt2:pbs,
> but not today.
>
> I see the error "Failed to parse command file (line 21)" in my swift output
> and in the gram log (excerpt of the latter, below).
>
> This line # was originally 17. I added some comment lines to bootstrap.sh to
> see if the line number would move, and indeed it did. So it suggests
> something in the jobmanager thats unable to handle the text of the bootstrap
> script embedded in its RSL. But I dont think the line in this error is the
> line in the bootstrap script.
>
> Does anyone know how to find the script text that the jobmanager is
> complaining about?
>
> As far as I can tell, something changed on teraport (or my config?) as my
> gram logs from last week indicate that the plain fork jobmanager was being
> used. (Ive got an email in to teraport support to probe this).
>
> I see Mats's note in a prio mail about concern that the managed-fork
> mechanism may kill the coaster service, but no comments about script parsing
> errors.
>
> I'll send more logs in this tomorrow if I havent found it yet.
>
> Thanks,
>
> Mike
>
>
>
> Thu Feb  5 21:17:51 2009 JM_SCRIPT: Error file is not empty, and submission
> failed
>
> Thu Feb  5 21:17:51 2009 JM_SCRIPT: Error text is
> ERROR: Failed to parse command file (line 21).
>
> Thu Feb  5 21:17:51 2009 JM_SCRIPT: Writing extended error information to
> stderr
> 2/5 21:17:51 JM: GT3 extended error message:
> GRAM_SCRIPT_GT3_FAILURE_MESSAGE: ERROR: Failed to parse command file (line
> 21).
> 2/5 21:17:51 JMI: while return_buf = GRAM_SCRIPT_GT3_FAILURE_MESSAGE =
> ERROR: Failed to parse command file (line 21).
> 2/5 21:17:51 JMI: while return_buf = GRAM_SCRIPT_ERROR = 17
> 2/5 21:17:51 Job Manager State Machine (entering):
> GLOBUS_GRAM_JOB_MANAGER_STATE_SUBMIT
> 2/5 21:17:51 JM: in globus_gram_job_manager_reporting_file_create()
> 2/5 21:17:51 JM: not reporting job information
> 2/5 21:17:51 JM: in globus_gram_job_manager_history_file_create()
> 2/5 21:17:51 Job Manager State Machine (entering):
> GLOBUS_GRAM_JOB_MANAGER_STATE_FAILED
> 2/5 21:17:51 closing destination
> https://128.135.125.17:50002/dev/stdout-urn:cog-1233890265644
> 2/5 21:17:51 JM: exiting
> globus_l_gram_job_manager_output_destination_close()
> 2/5 21:18:00 closing destination
> https://128.135.125.17:50002/dev/stderr-urn:cog-1233890265644
> 2/5 21:18:00 JM: exiting
> globus_l_gram_job_manager_output_destination_close()
> 2/5 21:18:00 Job Manager State Machine (entering):
> GLOBUS_GRAM_JOB_MANAGER_STATE_FAILED_CLOSE_OUTPUT
> 2/5 21:18:00 JM: NOT empty client callback list.
> 2/5 21:18:00 JM: sending callback of status 4 (failure code 155) to
> https://128.135.125.17:50003/1233890268457.
> 2/5 21:18:00 Job Manager State Machine (entering):
> GLOBUS_GRAM_JOB_MANAGER_STATE_FAILED_TWO_PHASE
> 2/5 21:18:00 Job Manager State Machine (entering):
> GLOBUS_GRAM_JOB_MANAGER_STATE_FAILED_TWO_PHASE_COMMITTED
> 2/5 21:18:00 Job Manager State Machine (entering):
> GLOBUS_GRAM_JOB_MANAGER_STATE_FAILED_FILE_CLEAN_UP
> 2/5 21:18:00 Job Manager State Machine (entering):
> GLOBUS_GRAM_JOB_MANAGER_STATE_FAILED_SCRATCH_CLEAN_UP
> 2/5 21:18:00 JMI: testing job manager scripts for type managedfork exist and
> permissions are ok.
> 2/5 21:18:00 JMI: completed script validation: job manager type is
> managedfork.
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>
>


-- 
Allan M. Espinosa <http://allan.88-mph.net/blog>
PhD student, Computer Science
University of Chicago <http://people.cs.uchicago.edu/~aespinosa>


From benc at hawaga.org.uk  Fri Feb  6 02:56:54 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Fri, 6 Feb 2009 08:56:54 +0000 (GMT)
Subject: [Swift-devel] strange behavior evaluating function call as trace
	arg
In-Reply-To: <498B2338.8000201@mcs.anl.gov>
References: <19137214.1784411233777937698.JavaMail.root@zimbra> 
	<Pine.LNX.4.64.0902050125350.8995@dildano.hawaga.org.uk>
	<498B0A8B.7000601@mcs.anl.gov>
	<Pine.LNX.4.64.0902051600430.14259@dildano.hawaga.org.uk>
	<1233850924.18738.15.camel@localhost> <498B2338.8000201@mcs.anl.gov>
Message-ID: <Pine.LNX.4.64.0902060849170.14259@dildano.hawaga.org.uk>


On Thu, 5 Feb 2009, Michael Wilde wrote:

> lvalue: type, *handle, state (set/unset) - same as handle==null?

Like hategan says, don't use the word 'lvalue' - I only used whimsically.

> same as handle==null

no. If a DSHandle doesn't have its value yet you cannot observe its value 
- you'll find yourself shunted into the future where the DSHandle does 
have a value.

-- 


From benc at hawaga.org.uk  Fri Feb  6 03:17:50 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Fri, 6 Feb 2009 09:17:50 +0000 (GMT)
Subject: [Swift-devel] strange behavior evaluating function call as trace
	arg
In-Reply-To: <498B2338.8000201@mcs.anl.gov>
References: <19137214.1784411233777937698.JavaMail.root@zimbra> 
	<Pine.LNX.4.64.0902050125350.8995@dildano.hawaga.org.uk>
	<498B0A8B.7000601@mcs.anl.gov>
	<Pine.LNX.4.64.0902051600430.14259@dildano.hawaga.org.uk>
	<1233850924.18738.15.camel@localhost> <498B2338.8000201@mcs.anl.gov>
Message-ID: <Pine.LNX.4.64.0902060857070.14259@dildano.hawaga.org.uk>


On Thu, 5 Feb 2009, Michael Wilde wrote:

> Is it reasonable to put this into latex form and jointly edit until 
> we're satisfied with it?
> 
> If so, I will do that.

The text from the language section of that paper is in the user guide now. 
That probably should be the most definitive location for information.

-- 


From benc at hawaga.org.uk  Fri Feb  6 08:04:22 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Fri, 6 Feb 2009 14:04:22 +0000 (GMT)
Subject: [Swift-devel] Problems with coasters and managedfork jobmanager
In-Reply-To: <498BC903.7010008@mcs.anl.gov>
References: <498BC903.7010008@mcs.anl.gov>
Message-ID: <Pine.LNX.4.64.0902061351270.8995@dildano.hawaga.org.uk>


On Thu, 5 Feb 2009, Michael Wilde wrote:

> I see Mats's note in a prio mail about concern that the managed-fork mechanism
> may kill the coaster service, but no comments about script parsing errors.

The condor jobmanager deals quite poorly with whitespace in arguments, in 
a way that I cannot see how to work around. (I've run into a very similar 
problem when looking at making Swift run without any shared filesystem).

This bit almost definitely doesn't work with existing jobmanager-condor.

>            js.setExecutable("/bin/bash");
>            js.addArgument("-c");
>            js.addArgument(loadBootstrapScript());


GRAM provided an update package to VDT/OSG the other day that changes 
condor jobmanager whitespace handling so that it may be possible to make 
it work. See this thread: 
http://lists.globus.org/pipermail/gram-user/2009-January/000790.html

With the present deployed infrastructure, one approach might be to have 
the bootstrap script staged in as a file using file transfer mechanisms 
(in the quickest hack case, staged in at the same time as wrapper.sh and 
seq.sh by swift, though this will not work if you are trying to use the 
coaster filesystem provider), allowing the shell command to have spaces 
removed.

-- 


From wilde at mcs.anl.gov  Fri Feb  6 09:19:39 2009
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Fri, 06 Feb 2009 09:19:39 -0600
Subject: [Swift-devel] Problems with coasters and managedfork jobmanager
In-Reply-To: <Pine.LNX.4.64.0902061351270.8995@dildano.hawaga.org.uk>
References: <498BC903.7010008@mcs.anl.gov>
	<Pine.LNX.4.64.0902061351270.8995@dildano.hawaga.org.uk>
Message-ID: <498C550B.4040303@mcs.anl.gov>


On 2/6/09 8:04 AM, Ben Clifford wrote:
> On Thu, 5 Feb 2009, Michael Wilde wrote:
> 
>> I see Mats's note in a prio mail about concern that the managed-fork mechanism
>> may kill the coaster service, but no comments about script parsing errors.
> 
> The condor jobmanager deals quite poorly with whitespace in arguments, in 
> a way that I cannot see how to work around. (I've run into a very similar 
> problem when looking at making Swift run without any shared filesystem).
> 
> This bit almost definitely doesn't work with existing jobmanager-condor.
> 
>>            js.setExecutable("/bin/bash");
>>            js.addArgument("-c");
>>            js.addArgument(loadBootstrapScript());
> 

I see. The problem turns out to be the newlines in the command script. 
It can be reproduced with globusrun:

com$ globusrun -o -r 
tp-grid1.ci.uchicago.edu:2119/jobmanager-managedfork 
'&(executable="/bin/echo") (arguments= "hello world")'
hello world


com$ globusrun -o -r 
tp-grid1.ci.uchicago.edu:2119/jobmanager-managedfork 
'&(executable="/bin/echo") (arguments= "hello
 >  world")'

ERROR: Failed to parse command file (line 10).
GRAM Job failed because the job failed when the job manager attempted to 
run it (error code 17)
com$

--

I'll make a brief attempt to work around this, but most likely wont be 
able to, as you say.

- Mike


> GRAM provided an update package to VDT/OSG the other day that changes 
> condor jobmanager whitespace handling so that it may be possible to make 
> it work. See this thread: 
> http://lists.globus.org/pipermail/gram-user/2009-January/000790.html
> 
> With the present deployed infrastructure, one approach might be to have 
> the bootstrap script staged in as a file using file transfer mechanisms 
> (in the quickest hack case, staged in at the same time as wrapper.sh and 
> seq.sh by swift, though this will not work if you are trying to use the 
> coaster filesystem provider), allowing the shell command to have spaces 
> removed.
> 


From hategan at mcs.anl.gov  Fri Feb  6 10:12:39 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Fri, 06 Feb 2009 10:12:39 -0600
Subject: [Swift-devel] Problems with coasters and managedfork jobmanager
In-Reply-To: <498C550B.4040303@mcs.anl.gov>
References: <498BC903.7010008@mcs.anl.gov>
	<Pine.LNX.4.64.0902061351270.8995@dildano.hawaga.org.uk>
	<498C550B.4040303@mcs.anl.gov>
Message-ID: <1233936759.5005.2.camel@localhost>

I guess we'll have to stage in the bootstrap script using the stage-in
directive if we are to support managed fork, since I don't see OSG
fixing the issue.

Unfortunately, this doesn't work very well with ws-gram.

On Fri, 2009-02-06 at 09:19 -0600, Michael Wilde wrote:
> 
> On 2/6/09 8:04 AM, Ben Clifford wrote:
> > On Thu, 5 Feb 2009, Michael Wilde wrote:
> > 
> >> I see Mats's note in a prio mail about concern that the managed-fork mechanism
> >> may kill the coaster service, but no comments about script parsing errors.
> > 
> > The condor jobmanager deals quite poorly with whitespace in arguments, in 
> > a way that I cannot see how to work around. (I've run into a very similar 
> > problem when looking at making Swift run without any shared filesystem).
> > 
> > This bit almost definitely doesn't work with existing jobmanager-condor.
> > 
> >>            js.setExecutable("/bin/bash");
> >>            js.addArgument("-c");
> >>            js.addArgument(loadBootstrapScript());
> > 
> 
> I see. The problem turns out to be the newlines in the command script. 
> It can be reproduced with globusrun:
> 
> com$ globusrun -o -r 
> tp-grid1.ci.uchicago.edu:2119/jobmanager-managedfork 
> '&(executable="/bin/echo") (arguments= "hello world")'
> hello world
> 
> 
> com$ globusrun -o -r 
> tp-grid1.ci.uchicago.edu:2119/jobmanager-managedfork 
> '&(executable="/bin/echo") (arguments= "hello
>  >  world")'
> 
> ERROR: Failed to parse command file (line 10).
> GRAM Job failed because the job failed when the job manager attempted to 
> run it (error code 17)
> com$
> 
> --
> 
> I'll make a brief attempt to work around this, but most likely wont be 
> able to, as you say.
> 
> - Mike
> 
> 
> > GRAM provided an update package to VDT/OSG the other day that changes 
> > condor jobmanager whitespace handling so that it may be possible to make 
> > it work. See this thread: 
> > http://lists.globus.org/pipermail/gram-user/2009-January/000790.html
> > 
> > With the present deployed infrastructure, one approach might be to have 
> > the bootstrap script staged in as a file using file transfer mechanisms 
> > (in the quickest hack case, staged in at the same time as wrapper.sh and 
> > seq.sh by swift, though this will not work if you are trying to use the 
> > coaster filesystem provider), allowing the shell command to have spaces 
> > removed.
> > 
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel


From benc at hawaga.org.uk  Fri Feb  6 10:17:05 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Fri, 6 Feb 2009 16:17:05 +0000 (GMT)
Subject: [Swift-devel] Problems with coasters and managedfork jobmanager
In-Reply-To: <1233936759.5005.2.camel@localhost>
References: <498BC903.7010008@mcs.anl.gov>
	<Pine.LNX.4.64.0902061351270.8995@dildano.hawaga.org.uk>
	<498C550B.4040303@mcs.anl.gov> <1233936759.5005.2.camel@localhost>
Message-ID: <Pine.LNX.4.64.0902061616080.14259@dildano.hawaga.org.uk>


On Fri, 6 Feb 2009, Mihael Hategan wrote:

> I guess we'll have to stage in the bootstrap script using the stage-in
> directive if we are to support managed fork, since I don't see OSG
> fixing the issue.

They are fixing the whitespace in parameters - see the gram-user thread I 
sent in a different message.

-- 


From hategan at mcs.anl.gov  Fri Feb  6 10:24:07 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Fri, 06 Feb 2009 10:24:07 -0600
Subject: [Swift-devel] Problems with coasters and managedfork jobmanager
In-Reply-To: <Pine.LNX.4.64.0902061616080.14259@dildano.hawaga.org.uk>
References: <498BC903.7010008@mcs.anl.gov>
	<Pine.LNX.4.64.0902061351270.8995@dildano.hawaga.org.uk>
	<498C550B.4040303@mcs.anl.gov> <1233936759.5005.2.camel@localhost>
	<Pine.LNX.4.64.0902061616080.14259@dildano.hawaga.org.uk>
Message-ID: <1233937447.5206.0.camel@localhost>

On Fri, 2009-02-06 at 16:17 +0000, Ben Clifford wrote:
> On Fri, 6 Feb 2009, Mihael Hategan wrote:
> 
> > I guess we'll have to stage in the bootstrap script using the stage-in
> > directive if we are to support managed fork, since I don't see OSG
> > fixing the issue.
> 
> They are fixing the whitespace in parameters - see the gram-user thread I 
> sent in a different message.

Does this include the new lines?


From wilde at mcs.anl.gov  Fri Feb  6 10:33:25 2009
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Fri, 06 Feb 2009 10:33:25 -0600
Subject: [Swift-devel] Problems with coasters and managedfork jobmanager
In-Reply-To: <1233937447.5206.0.camel@localhost>
References: <498BC903.7010008@mcs.anl.gov>	
	<Pine.LNX.4.64.0902061351270.8995@dildano.hawaga.org.uk>	
	<498C550B.4040303@mcs.anl.gov> <1233936759.5005.2.camel@localhost>	
	<Pine.LNX.4.64.0902061616080.14259@dildano.hawaga.org.uk>
	<1233937447.5206.0.camel@localhost>
Message-ID: <498C6655.3010702@mcs.anl.gov>

I dont know, but I am testing a version where I removed the newlines 
from bootstrap.pl (and adjusted a few bits manually) and I *think* its 
moving on to the next stage and trying to start the workers.

Ben, it seems that *some* whitespace is passed on OK, in that I can run 
a job that does echo "hello world" and that blank after hello is 
preserved, and the job runs. I assume the whitespace problem is more 
subtle than that?

On 2/6/09 10:24 AM, Mihael Hategan wrote:
> On Fri, 2009-02-06 at 16:17 +0000, Ben Clifford wrote:
>> On Fri, 6 Feb 2009, Mihael Hategan wrote:
>>
>>> I guess we'll have to stage in the bootstrap script using the stage-in
>>> directive if we are to support managed fork, since I don't see OSG
>>> fixing the issue.
>> They are fixing the whitespace in parameters - see the gram-user thread I 
>> sent in a different message.
> 
> Does this include the new lines?
> 


From hategan at mcs.anl.gov  Fri Feb  6 10:38:03 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Fri, 06 Feb 2009 10:38:03 -0600
Subject: [Swift-devel] Problems with coasters and managedfork jobmanager
In-Reply-To: <498C6655.3010702@mcs.anl.gov>
References: <498BC903.7010008@mcs.anl.gov>
	<Pine.LNX.4.64.0902061351270.8995@dildano.hawaga.org.uk>
	<498C550B.4040303@mcs.anl.gov> <1233936759.5005.2.camel@localhost>
	<Pine.LNX.4.64.0902061616080.14259@dildano.hawaga.org.uk>
	<1233937447.5206.0.camel@localhost>  <498C6655.3010702@mcs.anl.gov>
Message-ID: <1233938283.5483.0.camel@localhost>

I suppose that script could be made newline-less.

On Fri, 2009-02-06 at 10:33 -0600, Michael Wilde wrote:
> I dont know, but I am testing a version where I removed the newlines 
> from bootstrap.pl (and adjusted a few bits manually) and I *think* its 
> moving on to the next stage and trying to start the workers.
> 
> Ben, it seems that *some* whitespace is passed on OK, in that I can run 
> a job that does echo "hello world" and that blank after hello is 
> preserved, and the job runs. I assume the whitespace problem is more 
> subtle than that?
> 
> On 2/6/09 10:24 AM, Mihael Hategan wrote:
> > On Fri, 2009-02-06 at 16:17 +0000, Ben Clifford wrote:
> >> On Fri, 6 Feb 2009, Mihael Hategan wrote:
> >>
> >>> I guess we'll have to stage in the bootstrap script using the stage-in
> >>> directive if we are to support managed fork, since I don't see OSG
> >>> fixing the issue.
> >> They are fixing the whitespace in parameters - see the gram-user thread I 
> >> sent in a different message.
> > 
> > Does this include the new lines?
> > 


From benc at hawaga.org.uk  Fri Feb  6 10:40:45 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Fri, 6 Feb 2009 16:40:45 +0000 (GMT)
Subject: [Swift-devel] Problems with coasters and managedfork jobmanager
In-Reply-To: <498C6655.3010702@mcs.anl.gov>
References: <498BC903.7010008@mcs.anl.gov>
	<Pine.LNX.4.64.0902061351270.8995@dildano.hawaga.org.uk>
	<498C550B.4040303@mcs.anl.gov> <1233936759.5005.2.camel@localhost> 
	<Pine.LNX.4.64.0902061616080.14259@dildano.hawaga.org.uk>
	<1233937447.5206.0.camel@localhost> <498C6655.3010702@mcs.anl.gov>
Message-ID: <Pine.LNX.4.64.0902061639550.14259@dildano.hawaga.org.uk>


On Fri, 6 Feb 2009, Michael Wilde wrote:

> Ben, it seems that *some* whitespace is passed on OK, in that I can run a job
> that does echo "hello world" and that blank after hello is preserved, and the
> job runs. I assume the whitespace problem is more subtle than that?

yes. read the thread that I sent earlier on gram-user:

http://lists.globus.org/pipermail/gram-user/2009-January/000790.html

-- 


From benc at hawaga.org.uk  Fri Feb  6 10:50:00 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Fri, 6 Feb 2009 16:50:00 +0000 (GMT)
Subject: [Swift-devel] Problems with coasters and managedfork jobmanager
In-Reply-To: <498C6655.3010702@mcs.anl.gov>
References: <498BC903.7010008@mcs.anl.gov>
	<Pine.LNX.4.64.0902061351270.8995@dildano.hawaga.org.uk>
	<498C550B.4040303@mcs.anl.gov> <1233936759.5005.2.camel@localhost> 
	<Pine.LNX.4.64.0902061616080.14259@dildano.hawaga.org.uk>
	<1233937447.5206.0.camel@localhost> <498C6655.3010702@mcs.anl.gov>
Message-ID: <Pine.LNX.4.64.0902061647090.14259@dildano.hawaga.org.uk>


On Fri, 6 Feb 2009, Michael Wilde wrote:

> I dont know, but I am testing a version where I removed the newlines from
> bootstrap.pl (and adjusted a few bits manually) and I *think* its moving on to
> the next stage and trying to start the workers.

If the remote coaster head node job is running, then you should see 
some activity in the remote side ~/.globus/coasters/coaster.log

Check that.

-- 


From wilde at mcs.anl.gov  Fri Feb  6 11:03:43 2009
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Fri, 06 Feb 2009 11:03:43 -0600
Subject: [Swift-devel] Problems with coasters and managedfork jobmanager
In-Reply-To: <Pine.LNX.4.64.0902061647090.14259@dildano.hawaga.org.uk>
References: <498BC903.7010008@mcs.anl.gov>
	<Pine.LNX.4.64.0902061351270.8995@dildano.hawaga.org.uk>
	<498C550B.4040303@mcs.anl.gov> <1233936759.5005.2.camel@localhost>
	<Pine.LNX.4.64.0902061616080.14259@dildano.hawaga.org.uk>
	<1233937447.5206.0.camel@localhost> <498C6655.3010702@mcs.anl.gov>
	<Pine.LNX.4.64.0902061647090.14259@dildano.hawaga.org.uk>
Message-ID: <498C6D6F.9030804@mcs.anl.gov>

No, mine didnt get that far. All the logs I see (under ~osgedu) are from 
you.

I'm dropping out of this and will wait for something from Mihael and/or 
you to test.

On 2/6/09 10:50 AM, Ben Clifford wrote:
> On Fri, 6 Feb 2009, Michael Wilde wrote:
> 
>> I dont know, but I am testing a version where I removed the newlines from
>> bootstrap.pl (and adjusted a few bits manually) and I *think* its moving on to
>> the next stage and trying to start the workers.
> 
> If the remote coaster head node job is running, then you should see 
> some activity in the remote side ~/.globus/coasters/coaster.log
> 
> Check that.
> 


From benc at hawaga.org.uk  Fri Feb  6 11:50:50 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Fri, 6 Feb 2009 17:50:50 +0000 (GMT)
Subject: [Swift-devel] coasters on UC teraport with head job running on a
	worker node
Message-ID: <Pine.LNX.4.64.0902061743010.14259@dildano.hawaga.org.uk>


I hacked around with coasters here to see about getting the head job 
running on a cluster worker node rather than on the cluster head node.

This works on teraport through PBS. The below patch contains the changes I 
made to make that happen.

http://www.ci.uchicago.edu/~benc/tmp/coaster-head-elsewhere

There are three changes I made:

    i) submit to pbs jobmanager instead of to fork jobmanager
   ii) start coaster workers with IP address of the head-worker node 
       instead of the address of the cluster head node
  iii) hack the environment to point to teraport's CA directory (in the 
       environment that I get there, there is no automatically findable 
       CA directory, and an ENV profile appeared to not work).

In situations where the cluster nodes have outbound network connectivity, 
this seems like a nice thing to do, and I want to make this a configurable 
option to go into the SVN.

I think:

i) above should probably be an extension to the existing three-field 
coaster jobmanager string, ii) should be a configuration option to go 
along-side the coasterInternalIP setting and iii) should be a more general 
ability to set the environment for a coaster worker.

Here is the site.xml that I used with this patch:

<config>

  <pool handle="teraport" >
    <gridftp  url="gsiftp://tp-osg.ci.uchicago.edu" /> 
    <execution provider="coaster" url="tp-osg.ci.uchicago.edu" 
jobManager="gt2:gt2:pbs" />
    <workdirectory >/home/benc/swifttest</workdirectory>
  </pool>

</config>

-- 


From wilde at mcs.anl.gov  Fri Feb  6 12:27:21 2009
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Fri, 06 Feb 2009 12:27:21 -0600
Subject: [Swift-devel] coasters on UC teraport with head job running on
	a	worker node
In-Reply-To: <Pine.LNX.4.64.0902061743010.14259@dildano.hawaga.org.uk>
References: <Pine.LNX.4.64.0902061743010.14259@dildano.hawaga.org.uk>
Message-ID: <498C8109.603@mcs.anl.gov>

I tested it, and it worked - very nice.

I like the idea of moving the service load to a worker when possible.

So this patch gets around the problem of managed-fork/condor jobmanager 
by submitting to the pbs jobmanager instead of fork.

But that means that to generalize this, we still need to solve the 
problem of running the service bootstrap.sh if the cluster is a condor 
pool, right?

- Mike


On 2/6/09 11:50 AM, Ben Clifford wrote:
> I hacked around with coasters here to see about getting the head job 
> running on a cluster worker node rather than on the cluster head node.
> 
> This works on teraport through PBS. The below patch contains the changes I 
> made to make that happen.
> 
> http://www.ci.uchicago.edu/~benc/tmp/coaster-head-elsewhere
> 
> There are three changes I made:
> 
>     i) submit to pbs jobmanager instead of to fork jobmanager
>    ii) start coaster workers with IP address of the head-worker node 
>        instead of the address of the cluster head node
>   iii) hack the environment to point to teraport's CA directory (in the 
>        environment that I get there, there is no automatically findable 
>        CA directory, and an ENV profile appeared to not work).
> 
> In situations where the cluster nodes have outbound network connectivity, 
> this seems like a nice thing to do, and I want to make this a configurable 
> option to go into the SVN.
> 
> I think:
> 
> i) above should probably be an extension to the existing three-field 
> coaster jobmanager string, ii) should be a configuration option to go 
> along-side the coasterInternalIP setting and iii) should be a more general 
> ability to set the environment for a coaster worker.
> 
> Here is the site.xml that I used with this patch:
> 
> <config>
> 
>   <pool handle="teraport" >
>     <gridftp  url="gsiftp://tp-osg.ci.uchicago.edu" /> 
>     <execution provider="coaster" url="tp-osg.ci.uchicago.edu" 
> jobManager="gt2:gt2:pbs" />
>     <workdirectory >/home/benc/swifttest</workdirectory>
>   </pool>
> 
> </config>
> 


From benc at hawaga.org.uk  Fri Feb  6 12:34:11 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Fri, 6 Feb 2009 18:34:11 +0000 (GMT)
Subject: [Swift-devel] coasters on UC teraport with head job running on
	a worker node
In-Reply-To: <498C8109.603@mcs.anl.gov>
References: <Pine.LNX.4.64.0902061743010.14259@dildano.hawaga.org.uk>
	<498C8109.603@mcs.anl.gov>
Message-ID: <Pine.LNX.4.64.0902061832020.14259@dildano.hawaga.org.uk>


On Fri, 6 Feb 2009, Michael Wilde wrote:

> But that means that to generalize this, we still need to solve the problem of
> running the service bootstrap.sh if the cluster is a condor pool, right?

yes. I'd like to see how this behaves against the latest condor 
jobmanager, though, as that is going into VDT.

-- 


From wilde at mcs.anl.gov  Fri Feb  6 12:46:57 2009
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Fri, 06 Feb 2009 12:46:57 -0600
Subject: [Swift-devel] coasters on UC teraport with head job running on
	a worker node
In-Reply-To: <Pine.LNX.4.64.0902061832020.14259@dildano.hawaga.org.uk>
References: <Pine.LNX.4.64.0902061743010.14259@dildano.hawaga.org.uk>
	<498C8109.603@mcs.anl.gov>
	<Pine.LNX.4.64.0902061832020.14259@dildano.hawaga.org.uk>
Message-ID: <498C85A1.2070001@mcs.anl.gov>

I missed the message that Greg Cross sent regarding TeraPort updates 
(which Allan pointed out to me).  Perhaps he can test it there?

Or, perhaps he can install it on the HNL cluster for testing?

I'll send a message to support and cc you, unless you have another test 
environment.  Anything in the NMI build&test that can support such a test?

On 2/6/09 12:34 PM, Ben Clifford wrote:
> On Fri, 6 Feb 2009, Michael Wilde wrote:
> 
>> But that means that to generalize this, we still need to solve the problem of
>> running the service bootstrap.sh if the cluster is a condor pool, right?
> 
> yes. I'd like to see how this behaves against the latest condor 
> jobmanager, though, as that is going into VDT.
> 


From hategan at mcs.anl.gov  Fri Feb  6 12:48:24 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Fri, 06 Feb 2009 12:48:24 -0600
Subject: [Swift-devel] coasters on UC teraport with head job running on
	a worker node
In-Reply-To: <Pine.LNX.4.64.0902061832020.14259@dildano.hawaga.org.uk>
References: <Pine.LNX.4.64.0902061743010.14259@dildano.hawaga.org.uk>
	<498C8109.603@mcs.anl.gov>
	<Pine.LNX.4.64.0902061832020.14259@dildano.hawaga.org.uk>
Message-ID: <1233946104.9571.1.camel@localhost>

On Fri, 2009-02-06 at 18:34 +0000, Ben Clifford wrote:
> On Fri, 6 Feb 2009, Michael Wilde wrote:
> 
> > But that means that to generalize this, we still need to solve the problem of
> > running the service bootstrap.sh if the cluster is a condor pool, right?
> 
> yes. I'd like to see how this behaves against the latest condor 
> jobmanager, though, as that is going into VDT.
> 

Ok, so it would be sufficient but not necessarily necessary to make
bootstrap.sh a one-liner with the new VDT?


From wilde at mcs.anl.gov  Fri Feb  6 12:48:05 2009
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Fri, 06 Feb 2009 12:48:05 -0600
Subject: [Swift-devel] coasters on UC teraport with head job running on
	a worker node
In-Reply-To: <498C85A1.2070001@mcs.anl.gov>
References: <Pine.LNX.4.64.0902061743010.14259@dildano.hawaga.org.uk>	<498C8109.603@mcs.anl.gov>	<Pine.LNX.4.64.0902061832020.14259@dildano.hawaga.org.uk>
	<498C85A1.2070001@mcs.anl.gov>
Message-ID: <498C85E5.6000707@mcs.anl.gov>

or perhaps the ITB testbed that Rob and Suchandra maintain?

On 2/6/09 12:46 PM, Michael Wilde wrote:
> I missed the message that Greg Cross sent regarding TeraPort updates 
> (which Allan pointed out to me).  Perhaps he can test it there?
> 
> Or, perhaps he can install it on the HNL cluster for testing?
> 
> I'll send a message to support and cc you, unless you have another test 
> environment.  Anything in the NMI build&test that can support such a test?
> 
> On 2/6/09 12:34 PM, Ben Clifford wrote:
>> On Fri, 6 Feb 2009, Michael Wilde wrote:
>>
>>> But that means that to generalize this, we still need to solve the 
>>> problem of
>>> running the service bootstrap.sh if the cluster is a condor pool, right?
>>
>> yes. I'd like to see how this behaves against the latest condor 
>> jobmanager, though, as that is going into VDT.
>>
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel


From benc at hawaga.org.uk  Fri Feb  6 12:50:59 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Fri, 6 Feb 2009 18:50:59 +0000 (GMT)
Subject: [Swift-devel] coasters on UC teraport with head job running on
	a worker node
In-Reply-To: <498C85A1.2070001@mcs.anl.gov>
References: <Pine.LNX.4.64.0902061743010.14259@dildano.hawaga.org.uk>
	<498C8109.603@mcs.anl.gov>
	<Pine.LNX.4.64.0902061832020.14259@dildano.hawaga.org.uk>
	<498C85A1.2070001@mcs.anl.gov>
Message-ID: <Pine.LNX.4.64.0902061848520.14259@dildano.hawaga.org.uk>


On Fri, 6 Feb 2009, Michael Wilde wrote:

> I missed the message that Greg Cross sent regarding TeraPort updates 
> (which Allan pointed out to me).  Perhaps he can test it there?
> 
> Or, perhaps he can install it on the HNL cluster for testing?

It will appear on the ITB in due course, I think. I'm happy to wait for it 
to do so.

-- 


From benc at hawaga.org.uk  Fri Feb  6 12:52:44 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Fri, 6 Feb 2009 18:52:44 +0000 (GMT)
Subject: [Swift-devel] coasters on UC teraport with head job running on
	a worker node
In-Reply-To: <1233946104.9571.1.camel@localhost>
References: <Pine.LNX.4.64.0902061743010.14259@dildano.hawaga.org.uk> 
	<498C8109.603@mcs.anl.gov>
	<Pine.LNX.4.64.0902061832020.14259@dildano.hawaga.org.uk>
	<1233946104.9571.1.camel@localhost>
Message-ID: <Pine.LNX.4.64.0902061851250.14259@dildano.hawaga.org.uk>


On Fri, 6 Feb 2009, Mihael Hategan wrote:

> Ok, so it would be sufficient but not necessarily necessary to make
> bootstrap.sh a one-liner with the new VDT?

Should be.

-- 


From wilde at mcs.anl.gov  Fri Feb  6 12:58:40 2009
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Fri, 06 Feb 2009 12:58:40 -0600
Subject: [Swift-devel] coasters on UC teraport with head job running on
	a worker node
In-Reply-To: <1233946104.9571.1.camel@localhost>
References: <Pine.LNX.4.64.0902061743010.14259@dildano.hawaga.org.uk>	
	<498C8109.603@mcs.anl.gov>	
	<Pine.LNX.4.64.0902061832020.14259@dildano.hawaga.org.uk>
	<1233946104.9571.1.camel@localhost>
Message-ID: <498C8860.6070105@mcs.anl.gov>

It seems the one-liner alone might get past the managed-fork rejection 
of newlines, but perhaps not sufficient to deal with the separate-arg 
issue in the gram thread that Ben indicated.

and its not clear yet if the fix indicated in that thread also fixes the 
newline issue.

also the fix may take a while to propagate across OSG, so a fix thats 
independent of jobmanager would still be nice if thats possible, but if 
its more than a few more hours of fiddling, perhaps not worth it.

If I can run a swift script with coasters nicely, on a set of OSG PBS 
sites, and on a set of TG sites, that would be sufficient, I *think*, to 
wait and see how long it will take for the Condor jobmanger fix to 
propagate.

So one approach is:
- polish up and integrate the coaster-service-on-workernode patch
- test same code on a whitespace-fix-patched condor system
If this combination works on all non-condor JM's and the fixed Condor 
JM, call it done.
- (we) test the app-finding fixes (java, wget etc) on more systems in 
meantime

Is that a reasonable route?

- Mike


On 2/6/09 12:48 PM, Mihael Hategan wrote:
> On Fri, 2009-02-06 at 18:34 +0000, Ben Clifford wrote:
>> On Fri, 6 Feb 2009, Michael Wilde wrote:
>>
>>> But that means that to generalize this, we still need to solve the problem of
>>> running the service bootstrap.sh if the cluster is a condor pool, right?
>> yes. I'd like to see how this behaves against the latest condor 
>> jobmanager, though, as that is going into VDT.
>>
> 
> Ok, so it would be sufficient but not necessarily necessary to make
> bootstrap.sh a one-liner with the new VDT?
> 


From hategan at mcs.anl.gov  Fri Feb  6 13:04:49 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Fri, 06 Feb 2009 13:04:49 -0600
Subject: [Swift-devel] coasters on UC teraport with head job running on
	a worker node
In-Reply-To: <498C8860.6070105@mcs.anl.gov>
References: <Pine.LNX.4.64.0902061743010.14259@dildano.hawaga.org.uk>
	<498C8109.603@mcs.anl.gov>
	<Pine.LNX.4.64.0902061832020.14259@dildano.hawaga.org.uk>
	<1233946104.9571.1.camel@localhost>  <498C8860.6070105@mcs.anl.gov>
Message-ID: <1233947089.10253.4.camel@localhost>

On Fri, 2009-02-06 at 12:58 -0600, Michael Wilde wrote:
> It seems the one-liner alone might get past the managed-fork rejection 
> of newlines, but perhaps not sufficient to deal with the separate-arg 
> issue in the gram thread that Ben indicated.
> 
> and its not clear yet if the fix indicated in that thread also fixes the 
> newline issue.
> 
> also the fix may take a while to propagate across OSG, so a fix thats 
> independent of jobmanager would still be nice if thats possible, but if 
> its more than a few more hours of fiddling, perhaps not worth it.

I'm open to suggestions, but it seems like the only reasonable choice is
to wait for that fix.

> 
> If I can run a swift script with coasters nicely, on a set of OSG PBS 
> sites, and on a set of TG sites, that would be sufficient, I *think*,

I like the idea of running the service on a worker node. However, I like
the idea of zero configuration even more. So if there is a conflict, I
will favor the latter (unless, of course, the service-on-worker-node
thing can be done seamlessly).


From hategan at mcs.anl.gov  Fri Feb  6 21:27:06 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Fri, 06 Feb 2009 21:27:06 -0600
Subject: [Swift-devel] runaway jobs
Message-ID: <1233977226.16878.6.camel@localhost>

I committed a bunch of stuff to deal with that. The idea is to kill a
job if it's over 2*walltime and allow swift to re-schedule it.

It required some changes to the cog abstraction interfaces, and I used
the opportunity to do some clean ups. This means odds of something
breaking somewhat higher than normal. I also updated the versions of
most of the providers, so jar file names will change.


From benc at hawaga.org.uk  Sat Feb  7 05:22:54 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Sat, 7 Feb 2009 11:22:54 +0000 (GMT)
Subject: [Swift-devel] runaway jobs
In-Reply-To: <1233977226.16878.6.camel@localhost>
References: <1233977226.16878.6.camel@localhost>
Message-ID: <Pine.LNX.4.64.0902071107210.8995@dildano.hawaga.org.uk>


On Fri, 6 Feb 2009, Mihael Hategan wrote:

> I committed a bunch of stuff to deal with that. The idea is to kill a
> job if it's over 2*walltime and allow swift to re-schedule it.

I think this will interact poorly with clustering, due to the very 
inaccurate times at which clustered jobs go into Active and Completed 
states. Many clustered jobs will exceed their wall time in large clusters 
(for example, clusters that contain more than 2 jobs and where the 
maxwalltime is a tight bound).

A job with walltime w and actual runtime (w-e) is clustered with 3 similar 
tasks, giving a cluster that will run with actual time 4w-e ~= 4w; so then 
all four of the clustered jobs will be presented to the replication 
manager layer as running for walltime 4w (> 2w).

As to actually what happens when you try to cancel a clustered task at the 
moment, I'm unsure - perhaps it does nothing causing the runaway job to 
happen to have no adverse effects.

It should be relatively straightforward to disable this mechanism when 
clustering is enabled; so that you can use either this or clusters but not 
both.

But this and replication would be nice to use with clustering.

For that to happen, perhaps there needs to be some better communication 
between the clustering code and the replication code. For example, it 
could be that clusters are subject to walltime control, with walltime 
control on clustered jobs suppressed; and likewise for replication.

The replication stuff works mostly at the karajan Task level so that might 
not be an excessively arduous task.

-- 


From hategan at mcs.anl.gov  Sat Feb  7 08:36:53 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Sat, 07 Feb 2009 08:36:53 -0600
Subject: [Swift-devel] runaway jobs
In-Reply-To: <Pine.LNX.4.64.0902071107210.8995@dildano.hawaga.org.uk>
References: <1233977226.16878.6.camel@localhost>
	<Pine.LNX.4.64.0902071107210.8995@dildano.hawaga.org.uk>
Message-ID: <1234017413.20118.0.camel@localhost>

On Sat, 2009-02-07 at 11:22 +0000, Ben Clifford wrote:
> On Fri, 6 Feb 2009, Mihael Hategan wrote:
> 
> > I committed a bunch of stuff to deal with that. The idea is to kill a
> > job if it's over 2*walltime and allow swift to re-schedule it.
> 
> I think this will interact poorly with clustering, due to the very 
> inaccurate times at which clustered jobs go into Active and Completed 
> states. Many clustered jobs will exceed their wall time in large clusters 
> (for example, clusters that contain more than 2 jobs and where the 
> maxwalltime is a tight bound).
> 
> A job with walltime w and actual runtime (w-e) is clustered with 3 similar 
> tasks, giving a cluster that will run with actual time 4w-e ~= 4w; so then 
> all four of the clustered jobs will be presented to the replication 
> manager layer as running for walltime 4w (> 2w).
> 
> As to actually what happens when you try to cancel a clustered task at the 
> moment, I'm unsure - perhaps it does nothing causing the runaway job to 
> happen to have no adverse effects.
> 
> It should be relatively straightforward to disable this mechanism when 
> clustering is enabled; so that you can use either this or clusters but not 
> both.
> 
> But this and replication would be nice to use with clustering.

It can be updated to be only enabled when no clustering or clustering
and this job is a cluster. That should fix it.


From hategan at mcs.anl.gov  Sat Feb  7 14:04:25 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Sat, 7 Feb 2009 14:04:25 -0600 (CST)
Subject: [Swift-devel] strange behavior evaluating function call as trace arg
Message-ID: <23806863.1952211234037065554.JavaMail.root@zimbra>

I think that the particular way in which the implementation
manages to do the dataflow (i.e. returns as ref args) should
not be in the intermediate .xml file.

In other words, y = f(x) should be:

<assign>
  <var>y</var>
  <call proc="f">
    <var>x</var>
  </call>
</assign>

not

<call proc="f">
  <output>
    <var>y</var>
  </output>
  <input>
    <var>x</var>
  </input>
</call>

That scheme can be applied later when generating the karajan
code. The current stuff complicates reasoning about the 
intermediate code (e.g. typechecking).


From hategan at mcs.anl.gov  Sat Feb  7 19:37:16 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Sat, 7 Feb 2009 19:37:16 -0600 (CST)
Subject: [Swift-devel] strange behavior evaluating function call as
	trace arg
In-Reply-To: <23806863.1952211234037065554.JavaMail.root@zimbra>
Message-ID: <31868237.1960501234057036064.JavaMail.root@zimbra>

Here's a patch. It allows procedure invocations in expressions.

http://www.ci.uchicago.edu/~hategan/invoke-proc.patch


From benc at hawaga.org.uk  Sun Feb  8 03:54:23 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Sun, 8 Feb 2009 09:54:23 +0000 (GMT)
Subject: [Swift-devel] strange behavior evaluating function call as trace
	arg
In-Reply-To: <23806863.1952211234037065554.JavaMail.root@zimbra>
References: <23806863.1952211234037065554.JavaMail.root@zimbra>
Message-ID: <Pine.LNX.4.64.0902080949420.8995@dildano.hawaga.org.uk>


On Sat, 7 Feb 2009, Mihael Hategan wrote:

> In other words, y = f(x) should be:
> 
> <assign>
>   <var>y</var>
>   <call proc="f">
>     <var>x</var>
>   </call>
> </assign>

That would fit in with the general trend to not syntactically 
distinguishing between what used to be datasets and what used to not be 
datasets.

Need more markup than the above, though, to accept multiple lvalues in the 
assignment.

-- 


From benc at hawaga.org.uk  Sun Feb  8 05:28:32 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Sun, 8 Feb 2009 11:28:32 +0000 (GMT)
Subject: [Swift-devel] strange behavior evaluating function call as trace
	arg
In-Reply-To: <31868237.1960501234057036064.JavaMail.root@zimbra>
References: <31868237.1960501234057036064.JavaMail.root@zimbra>
Message-ID: <Pine.LNX.4.64.0902081123060.8995@dildano.hawaga.org.uk>


On Sat, 7 Feb 2009, Mihael Hategan wrote:

> Here's a patch. It allows procedure invocations in expressions.
> 
> http://www.ci.uchicago.edu/~hategan/invoke-proc.patch

In the test in the below patch, I get a conflict with the use of $ for 
more than one purpose. The use in the nested procedure call upsets 
getThreadPrefix which is expecting it to contain something else (if it 
exists). Renaming that variable (as the below patch does) makes the test 
run ok for me.

http://www.ci.uchicago.edu/~benc/invoke-proc-test-fix-1

(its not a very good test as it doesn't check the output is correct, but 
it suffices for the purposes of this precise bug)

-- 


From benc at hawaga.org.uk  Sun Feb  8 08:05:33 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Sun, 8 Feb 2009 14:05:33 +0000 (GMT)
Subject: [Swift-devel] strange behavior evaluating function call as trace
	arg
In-Reply-To: <Pine.LNX.4.64.0902080949420.8995@dildano.hawaga.org.uk>
References: <23806863.1952211234037065554.JavaMail.root@zimbra>
	<Pine.LNX.4.64.0902080949420.8995@dildano.hawaga.org.uk>
Message-ID: <Pine.LNX.4.64.0902081405001.13917@dildano.hawaga.org.uk>


I'm working in this area, making declarations have both <"mapping 
expressions"> and =assignments. It looks like I'll end up making some 
change like below as part of that.

On Sun, 8 Feb 2009, Ben Clifford wrote:

> 
> On Sat, 7 Feb 2009, Mihael Hategan wrote:
> 
> > In other words, y = f(x) should be:
> > 
> > <assign>
> >   <var>y</var>
> >   <call proc="f">
> >     <var>x</var>
> >   </call>
> > </assign>
> 
> That would fit in with the general trend to not syntactically 
> distinguishing between what used to be datasets and what used to not be 
> datasets.
> 
> Need more markup than the above, though, to accept multiple lvalues in the 
> assignment.
> 
> 


From hategan at mcs.anl.gov  Sun Feb  8 09:17:40 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Sun, 08 Feb 2009 09:17:40 -0600
Subject: [Swift-devel] strange behavior evaluating function call as
	trace arg
In-Reply-To: <Pine.LNX.4.64.0902080949420.8995@dildano.hawaga.org.uk>
References: <23806863.1952211234037065554.JavaMail.root@zimbra>
	<Pine.LNX.4.64.0902080949420.8995@dildano.hawaga.org.uk>
Message-ID: <1234106260.11012.0.camel@localhost>

On Sun, 2009-02-08 at 09:54 +0000, Ben Clifford wrote:
> On Sat, 7 Feb 2009, Mihael Hategan wrote:
> 
> > In other words, y = f(x) should be:
> > 
> > <assign>
> >   <var>y</var>
> >   <call proc="f">
> >     <var>x</var>
> >   </call>
> > </assign>
> 
> That would fit in with the general trend to not syntactically 
> distinguishing between what used to be datasets and what used to not be 
> datasets.
> 
> Need more markup than the above, though, to accept multiple lvalues in the 
> assignment.
> 

<from> and <to> or maybe <tuple>.


From wilde at mcs.anl.gov  Sun Feb  8 23:36:31 2009
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Sun, 08 Feb 2009 23:36:31 -0600
Subject: [Swift-devel] runaway workers on teraport coaster test of 
Message-ID: <498FC0DF.8060607@mcs.anl.gov>

Im testing coasters with 
http://www.ci.uchicago.edu/~benc/tmp/coaster-head-elsewhere

This worked for me once Fri at noon but not since.

I put a .soft entry for java into the ~osg .soft file, to deal with 
issues discussed off-list.

I made a few smll changes in bootstrap.sh from that patched version - 
some for logging, and one to make the X509_CERT_DIR variable conditional 
on whether that directory exists.

The coaster service now starts, but it went into a loop spawning 
short-lived workers, and not getting anything done.

I saw dozens of workers start, with about 10-20 or so running at a time.

These logs and all files related to the run are in 
~wilde/oops/oopstest/runaway-workers.

In coasters.log I see 50+ messages "WorkerManager No suitable worker 
found. Attempting to start a new one."

Ayy thoughts on this?  TeraPort has been too saturated this evening to 
test any further, but it would be good to have some sense of whats 
causing this.


From benc at hawaga.org.uk  Mon Feb  9 02:40:52 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Mon, 9 Feb 2009 08:40:52 +0000 (GMT)
Subject: [Swift-devel] runaway workers on teraport coaster test of 
In-Reply-To: <498FC0DF.8060607@mcs.anl.gov>
References: <498FC0DF.8060607@mcs.anl.gov>
Message-ID: <Pine.LNX.4.64.0902090838270.13917@dildano.hawaga.org.uk>


On Sun, 8 Feb 2009, Michael Wilde wrote:

> The coaster service now starts, but it went into a loop spawning short-lived
> workers, and not getting anything done.
> 
> I saw dozens of workers start, with about 10-20 or so running at a time.

Through what mechanism were you seeing that? The coasters.log file (for 
the run around 1724) only shows workers getting as far as being submittd.

-- 


From benc at hawaga.org.uk  Mon Feb  9 03:07:08 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Mon, 9 Feb 2009 09:07:08 +0000 (GMT)
Subject: [Swift-devel] r2509 breaks GRAM2 job submission by adding an
	unknown attribute
Message-ID: <Pine.LNX.4.64.0902090905400.14188@dildano.hawaga.org.uk>


r2509 adds a tr attribute to all jobs, which causes gram2 job submissions 
to fail like this:

stdout.txt: 

----

Caused by:
        Cannot submit job
Caused by:
        Parameter not supported

(on tg-uc and tp)

r2515 reverts r2509.

-- 


From benc at hawaga.org.uk  Mon Feb  9 05:02:45 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Mon, 9 Feb 2009 11:02:45 +0000 (GMT)
Subject: [Swift-devel] runaway workers on teraport coaster test of 
In-Reply-To: <Pine.LNX.4.64.0902090838270.13917@dildano.hawaga.org.uk>
References: <498FC0DF.8060607@mcs.anl.gov>
	<Pine.LNX.4.64.0902090838270.13917@dildano.hawaga.org.uk>
Message-ID: <Pine.LNX.4.64.0902091101260.13917@dildano.hawaga.org.uk>


With more hacking round to get jobs to run in teraport's test queue rather 
than in the default extended queue, I see a similar problem - I see lots 
of workers being launched, getting as far as exchanging a heartbeat with 
the head job, and then not being issued with jobs, with new workers being 
launched every few seconds. On the swift side, no jobs ever go beyond 
Submitted state.

-- 


From benc at hawaga.org.uk  Mon Feb  9 05:12:22 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Mon, 9 Feb 2009 11:12:22 +0000 (GMT)
Subject: [Swift-devel] bash profile emitting text causes md5sum to be not
	located
Message-ID: <Pine.LNX.4.64.0902091105310.14188@dildano.hawaga.org.uk>


On teraport, my .bash_profile emits some information to stdout at login.

That causes bootstrap.sh to be unable to get confused (it takes the 
bash_profile output to be the name of the md5sum executable, and then is 
unable to execute).

This change stops that happening for me:

-MD5SUM=`find 'which gmd5sum'`
+MD5SUM=`which gmd5sum`
 if [ "X$MD5SUM" == "X" ]; then
-       MD5SUM=`find 'which md5sum'`
+       MD5SUM=`which md5sum`

This is because the find command (which is a shell procedure in this case, 
not unix find) first invokes the command, and if it gives no output, 
invokes it again with a bash login shell, which then does give output in 
the gmd5sum case when it should not.

I think maybe that this should not happen in the bowels of the bootstrap 
script - either the environment is correctly initialised for the whole 
bootstrap script, or it is not. (this ties into the question of whether 
bash should be invoked with -l or not for the whole bootstrap script and 
other environmental configuration)

-- 


From benc at hawaga.org.uk  Mon Feb  9 08:06:55 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Mon, 9 Feb 2009 14:06:55 +0000 (GMT)
Subject: [Swift-devel] double negatives
Message-ID: <Pine.LNX.4.64.0902091405280.14188@dildano.hawaga.org.uk>


For UI purposes, I'd like to flip the truthiness of ticker.disable, so 
that its ticker.enable with a default value of true. Double negatives are 
harder to understand.

-- 


From benc at hawaga.org.uk  Mon Feb  9 08:59:14 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Mon, 9 Feb 2009 14:59:14 +0000 (GMT)
Subject: [Swift-devel] strange behavior evaluating function call as trace
	arg
In-Reply-To: <Pine.LNX.4.64.0902081123060.8995@dildano.hawaga.org.uk>
References: <31868237.1960501234057036064.JavaMail.root@zimbra>
	<Pine.LNX.4.64.0902081123060.8995@dildano.hawaga.org.uk>
Message-ID: <Pine.LNX.4.64.0902091458380.13917@dildano.hawaga.org.uk>


I committed your original patch and my fix as r2520.

On Sun, 8 Feb 2009, Ben Clifford wrote:

> 
> On Sat, 7 Feb 2009, Mihael Hategan wrote:
> 
> > Here's a patch. It allows procedure invocations in expressions.
> > 
> > http://www.ci.uchicago.edu/~hategan/invoke-proc.patch
> 
> In the test in the below patch, I get a conflict with the use of $ for 
> more than one purpose. The use in the nested procedure call upsets 
> getThreadPrefix which is expecting it to contain something else (if it 
> exists). Renaming that variable (as the below patch does) makes the test 
> run ok for me.
> 
> http://www.ci.uchicago.edu/~benc/invoke-proc-test-fix-1
> 
> (its not a very good test as it doesn't check the output is correct, but 
> it suffices for the purposes of this precise bug)
> 
> 


From hategan at mcs.anl.gov  Mon Feb  9 10:02:00 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Mon, 09 Feb 2009 10:02:00 -0600
Subject: [Swift-devel] r2509 breaks GRAM2 job submission by adding an
	unknown attribute
In-Reply-To: <Pine.LNX.4.64.0902090905400.14188@dildano.hawaga.org.uk>
References: <Pine.LNX.4.64.0902090905400.14188@dildano.hawaga.org.uk>
Message-ID: <1234195320.23799.0.camel@localhost>

On Mon, 2009-02-09 at 09:07 +0000, Ben Clifford wrote:
> r2509 adds a tr attribute to all jobs, which causes gram2 job submissions 
> to fail like this:
> 
> stdout.txt: 
> 
> ----
> 
> Caused by:
>         Cannot submit job
> Caused by:
>         Parameter not supported

That would be me. I have to think about it. In the mean time, please use
an older version.


From benc at hawaga.org.uk  Mon Feb  9 10:02:55 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Mon, 9 Feb 2009 16:02:55 +0000 (GMT)
Subject: [Swift-devel] r2509 breaks GRAM2 job submission by adding an
	unknown attribute
In-Reply-To: <1234195320.23799.0.camel@localhost>
References: <Pine.LNX.4.64.0902090905400.14188@dildano.hawaga.org.uk>
	<1234195320.23799.0.camel@localhost>
Message-ID: <Pine.LNX.4.64.0902091602410.14188@dildano.hawaga.org.uk>


> That would be me. I have to think about it.

What do you want it for, out of interest?

-- 


From hategan at mcs.anl.gov  Mon Feb  9 10:04:25 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Mon, 09 Feb 2009 10:04:25 -0600
Subject: [Swift-devel] bash profile emitting text causes md5sum to be
	not located
In-Reply-To: <Pine.LNX.4.64.0902091105310.14188@dildano.hawaga.org.uk>
References: <Pine.LNX.4.64.0902091105310.14188@dildano.hawaga.org.uk>
Message-ID: <1234195465.23799.3.camel@localhost>

On Mon, 2009-02-09 at 11:12 +0000, Ben Clifford wrote:
> On teraport, my .bash_profile emits some information to stdout at login.
> 
> That causes bootstrap.sh to be unable to get confused (it takes the 
> bash_profile output to be the name of the md5sum executable, and then is 
> unable to execute).
> 
> This change stops that happening for me:
> 
> -MD5SUM=`find 'which gmd5sum'`
> +MD5SUM=`which gmd5sum`
>  if [ "X$MD5SUM" == "X" ]; then
> -       MD5SUM=`find 'which md5sum'`
> +       MD5SUM=`which md5sum`
> 
> This is because the find command (which is a shell procedure in this case, 
> not unix find) first invokes the command, and if it gives no output, 
> invokes it again with a bash login shell, which then does give output in 
> the gmd5sum case when it should not.
> 
> I think maybe that this should not happen in the bowels of the bootstrap 
> script - either the environment is correctly initialised for the whole 
> bootstrap script, or it is not.

Well, obviously you can't have it all. This is why I put the change in
in the first place.

So hang on. I'm working on it.


From benc at hawaga.org.uk  Mon Feb  9 10:06:01 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Mon, 9 Feb 2009 16:06:01 +0000 (GMT)
Subject: [Swift-devel] which foo
Message-ID: <Pine.LNX.4.64.0902091605020.14188@dildano.hawaga.org.uk>


Here's a particularly annoyingly behaved 'which', on my os x 10.4 box:

$ which foo 2>/dev/null ; echo $?
no foo in /Users/benc/work/cog/modules/swift/dist/swift-svn/bin 
/Users/benc/bin /opt/local/bin /usr/X11R6/bin 
/Users/benc/work/globus/4.2.0-rc1/bin 
/Users/benc/work/globus/4.2.0-rc1/sbin /bin /sbin /usr/bin /usr/sbin
0

I came across this when trying to find out why bootstrap wasn't switching 
over to using curl when it can't find wget, testing coasters locally.

-- 


From hategan at mcs.anl.gov  Mon Feb  9 10:21:37 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Mon, 09 Feb 2009 10:21:37 -0600
Subject: [Swift-devel] r2509 breaks GRAM2 job submission by adding an
	unknown attribute
In-Reply-To: <Pine.LNX.4.64.0902091602410.14188@dildano.hawaga.org.uk>
References: <Pine.LNX.4.64.0902090905400.14188@dildano.hawaga.org.uk>
	<1234195320.23799.0.camel@localhost>
	<Pine.LNX.4.64.0902091602410.14188@dildano.hawaga.org.uk>
Message-ID: <1234196497.24358.0.camel@localhost>

On Mon, 2009-02-09 at 16:02 +0000, Ben Clifford wrote:
> 
> > That would be me. I have to think about it.
> 
> What do you want it for, out of interest?
> 

Display a message if the walltime is missing. But I suppose that can be
done earlier.


From benc at hawaga.org.uk  Mon Feb  9 10:25:07 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Mon, 9 Feb 2009 16:25:07 +0000 (GMT)
Subject: [Swift-devel] Re: which foo
In-Reply-To: <Pine.LNX.4.64.0902091605020.14188@dildano.hawaga.org.uk>
References: <Pine.LNX.4.64.0902091605020.14188@dildano.hawaga.org.uk>
Message-ID: <Pine.LNX.4.64.0902091621030.14188@dildano.hawaga.org.uk>

... though thats easy to get around with a test for executableness of the 
output, which I have done in my local checkout.

The curl support is broken elsewhere, in its use of -O instead of -o, 
which I am also fixing.

-- 


From hategan at mcs.anl.gov  Mon Feb  9 10:29:17 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Mon, 09 Feb 2009 10:29:17 -0600
Subject: [Swift-devel] Re: which foo
In-Reply-To: <Pine.LNX.4.64.0902091621030.14188@dildano.hawaga.org.uk>
References: <Pine.LNX.4.64.0902091605020.14188@dildano.hawaga.org.uk>
	<Pine.LNX.4.64.0902091621030.14188@dildano.hawaga.org.uk>
Message-ID: <1234196957.24501.1.camel@localhost>

On Mon, 2009-02-09 at 16:25 +0000, Ben Clifford wrote:
> ... though thats easy to get around with a test for executableness of the 
> output, which I have done in my local checkout.

Right. That's what I was going to put in.

I'm still appalled by $? being 0 though.

> 
> The curl support is broken elsewhere, in its use of -O instead of -o, 
> which I am also fixing.
> 


From wilde at mcs.anl.gov  Mon Feb  9 10:52:03 2009
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Mon, 09 Feb 2009 10:52:03 -0600
Subject: [Swift-devel] Re: which foo
In-Reply-To: <Pine.LNX.4.64.0902091621030.14188@dildano.hawaga.org.uk>
References: <Pine.LNX.4.64.0902091605020.14188@dildano.hawaga.org.uk>
	<Pine.LNX.4.64.0902091621030.14188@dildano.hawaga.org.uk>
Message-ID: <49905F33.3090404@mcs.anl.gov>

also, the case where java is not found either initially or after testing 
with "bash -l" yields a dirname error, I think. I didnt have time to 
gather the output of that, but its easy to see where it happens.

that was the case on teraport to the osg vo using the service-on-worker 
patch.

On 2/9/09 10:25 AM, Ben Clifford wrote:
> ... though thats easy to get around with a test for executableness of the 
> output, which I have done in my local checkout.
> 
> The curl support is broken elsewhere, in its use of -O instead of -o, 
> which I am also fixing.
> 


From benc at hawaga.org.uk  Mon Feb  9 11:01:29 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Mon, 9 Feb 2009 17:01:29 +0000 (GMT)
Subject: [Swift-devel] JAVA_HOME misdetection
Message-ID: <Pine.LNX.4.64.0902091659590.14188@dildano.hawaga.org.uk>


Another thing i've seen is when java is on the path, but JAVA_HOME is not 
set, and java is found not-in-JAVA_HOME, like this:

$ which java
/usr/bin/java
$ ls -l /usr/bin/java
lrwxr-xr-x   1 root  wheel  77 Feb  3 11:11 /usr/bin/java -> 
/System/Library/Frameworks/JavaVM.framework/Versions/CurrentJDK/Commands/java

then setting JAVA_HOME incorrectly to /usr/ in this case breaks things, 
when leaving it entirely unset does not break things.

This is on my os x box.
 
-- 


From benc at hawaga.org.uk  Mon Feb  9 11:10:32 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Mon, 9 Feb 2009 17:10:32 +0000 (GMT)
Subject: [Swift-devel] runaway workers on teraport coaster test of 
In-Reply-To: <Pine.LNX.4.64.0902091101260.13917@dildano.hawaga.org.uk>
References: <498FC0DF.8060607@mcs.anl.gov>
	<Pine.LNX.4.64.0902090838270.13917@dildano.hawaga.org.uk>
	<Pine.LNX.4.64.0902091101260.13917@dildano.hawaga.org.uk>
Message-ID: <Pine.LNX.4.64.0902091709130.14188@dildano.hawaga.org.uk>

On Mon, 9 Feb 2009, Ben Clifford wrote:

> With more hacking round to get jobs to run in teraport's test queue rather 
> than in the default extended queue, I see a similar problem - I see lots 
> of workers being launched, getting as far as exchanging a heartbeat with 
> the head job, and then not being issued with jobs, with new workers being 
> launched every few seconds. On the swift side, no jobs ever go beyond 
> Submitted state.

However coasters locally on my laptop (modulo various environmental fun 
discussed in other messages) runs ok and does not show this problem - I 
can run tests/sites/ run-site coasters/coaster-local.xml to completion.

-- 


From hategan at mcs.anl.gov  Mon Feb  9 12:09:22 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Mon, 09 Feb 2009 12:09:22 -0600
Subject: [Swift-devel] Re: which foo
In-Reply-To: <49905F33.3090404@mcs.anl.gov>
References: <Pine.LNX.4.64.0902091605020.14188@dildano.hawaga.org.uk>
	<Pine.LNX.4.64.0902091621030.14188@dildano.hawaga.org.uk>
	<49905F33.3090404@mcs.anl.gov>
Message-ID: <1234202962.29439.1.camel@localhost>

This is getting too annoying. Too much variation. So I'm exploring
embedding the bootstrap jar into the script so that the only thing the
script needs to do is find java (no md5 check and no wget).

On Mon, 2009-02-09 at 10:52 -0600, Michael Wilde wrote:
> also, the case where java is not found either initially or after testing 
> with "bash -l" yields a dirname error, I think. I didnt have time to 
> gather the output of that, but its easy to see where it happens.
> 
> that was the case on teraport to the osg vo using the service-on-worker 
> patch.
> 
> On 2/9/09 10:25 AM, Ben Clifford wrote:
> > ... though thats easy to get around with a test for executableness of the 
> > output, which I have done in my local checkout.
> > 
> > The curl support is broken elsewhere, in its use of -O instead of -o, 
> > which I am also fixing.
> > 
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel


From hategan at mcs.anl.gov  Mon Feb  9 13:03:39 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Mon, 09 Feb 2009 13:03:39 -0600
Subject: [Swift-devel] Re: which foo
In-Reply-To: <1234202962.29439.1.camel@localhost>
References: <Pine.LNX.4.64.0902091605020.14188@dildano.hawaga.org.uk>
	<Pine.LNX.4.64.0902091621030.14188@dildano.hawaga.org.uk>
	<49905F33.3090404@mcs.anl.gov>  <1234202962.29439.1.camel@localhost>
Message-ID: <1234206219.29700.0.camel@localhost>

Not a very good idea. Back to where I was.

On Mon, 2009-02-09 at 12:09 -0600, Mihael Hategan wrote:
> This is getting too annoying. Too much variation. So I'm exploring
> embedding the bootstrap jar into the script so that the only thing the
> script needs to do is find java (no md5 check and no wget).
> 
> On Mon, 2009-02-09 at 10:52 -0600, Michael Wilde wrote:
> > also, the case where java is not found either initially or after testing 
> > with "bash -l" yields a dirname error, I think. I didnt have time to 
> > gather the output of that, but its easy to see where it happens.
> > 
> > that was the case on teraport to the osg vo using the service-on-worker 
> > patch.
> > 
> > On 2/9/09 10:25 AM, Ben Clifford wrote:
> > > ... though thats easy to get around with a test for executableness of the 
> > > output, which I have done in my local checkout.
> > > 
> > > The curl support is broken elsewhere, in its use of -O instead of -o, 
> > > which I am also fixing.
> > > 
> > _______________________________________________
> > Swift-devel mailing list
> > Swift-devel at ci.uchicago.edu
> > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> 
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel


From hategan at mcs.anl.gov  Mon Feb  9 13:06:39 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Mon, 09 Feb 2009 13:06:39 -0600
Subject: [Swift-devel] JAVA_HOME misdetection
In-Reply-To: <Pine.LNX.4.64.0902091659590.14188@dildano.hawaga.org.uk>
References: <Pine.LNX.4.64.0902091659590.14188@dildano.hawaga.org.uk>
Message-ID: <1234206399.30971.0.camel@localhost>

On Mon, 2009-02-09 at 17:01 +0000, Ben Clifford wrote:
> Another thing i've seen is when java is on the path, but JAVA_HOME is not 
> set, and java is found not-in-JAVA_HOME, like this:
> 
> $ which java
> /usr/bin/java
> $ ls -l /usr/bin/java
> lrwxr-xr-x   1 root  wheel  77 Feb  3 11:11 /usr/bin/java -> 
> /System/Library/Frameworks/JavaVM.framework/Versions/CurrentJDK/Commands/java
> 
> then setting JAVA_HOME incorrectly to /usr/ in this case breaks things, 

How does it break things?

> when leaving it entirely unset does not break things.
> 
> This is on my os x box.
>  


From hategan at mcs.anl.gov  Mon Feb  9 13:46:53 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Mon, 09 Feb 2009 13:46:53 -0600
Subject: [Swift-devel] Problems with coasters and managedfork jobmanager
In-Reply-To: <498C6655.3010702@mcs.anl.gov>
References: <498BC903.7010008@mcs.anl.gov>
	<Pine.LNX.4.64.0902061351270.8995@dildano.hawaga.org.uk>
	<498C550B.4040303@mcs.anl.gov> <1233936759.5005.2.camel@localhost>
	<Pine.LNX.4.64.0902061616080.14259@dildano.hawaga.org.uk>
	<1233937447.5206.0.camel@localhost>  <498C6655.3010702@mcs.anl.gov>
Message-ID: <1234208813.32649.0.camel@localhost>

On Fri, 2009-02-06 at 10:33 -0600, Michael Wilde wrote:
> I dont know, but I am testing a version where I removed the newlines 
> from bootstrap.pl (and adjusted a few bits manually)

May we see that?

>  and I *think* its 
> moving on to the next stage and trying to start the workers.
> 
> Ben, it seems that *some* whitespace is passed on OK, in that I can run 
> a job that does echo "hello world" and that blank after hello is 
> preserved, and the job runs. I assume the whitespace problem is more 
> subtle than that?
> 
> On 2/6/09 10:24 AM, Mihael Hategan wrote:
> > On Fri, 2009-02-06 at 16:17 +0000, Ben Clifford wrote:
> >> On Fri, 6 Feb 2009, Mihael Hategan wrote:
> >>
> >>> I guess we'll have to stage in the bootstrap script using the stage-in
> >>> directive if we are to support managed fork, since I don't see OSG
> >>> fixing the issue.
> >> They are fixing the whitespace in parameters - see the gram-user thread I 
> >> sent in a different message.
> > 
> > Does this include the new lines?
> > 


From wilde at mcs.anl.gov  Mon Feb  9 14:23:03 2009
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Mon, 09 Feb 2009 14:23:03 -0600
Subject: [Swift-devel] Problems with coasters and managedfork jobmanager
In-Reply-To: <1234208813.32649.0.camel@localhost>
References: <498BC903.7010008@mcs.anl.gov>	
	<Pine.LNX.4.64.0902061351270.8995@dildano.hawaga.org.uk>	
	<498C550B.4040303@mcs.anl.gov> <1233936759.5005.2.camel@localhost>	
	<Pine.LNX.4.64.0902061616080.14259@dildano.hawaga.org.uk>	
	<1233937447.5206.0.camel@localhost> <498C6655.3010702@mcs.anl.gov>
	<1234208813.32649.0.camel@localhost>
Message-ID: <499090A7.2020100@mcs.anl.gov>

its www.ci.uchicago.edu/~wilde/bootstrap.nonl.sh, plus I removed the 
code in ServiceManager that inserted the extra newlines when reading it 
into a string buffer. I checked as far as verifying in the gram log that 
it was seen by gram as a single line in the rsl.  I never got a 
successful run from it, though - it ran into other problems later.

On 2/9/09 1:46 PM, Mihael Hategan wrote:
> On Fri, 2009-02-06 at 10:33 -0600, Michael Wilde wrote:
>> I dont know, but I am testing a version where I removed the newlines 
>> from bootstrap.pl (and adjusted a few bits manually)
> 
> May we see that?
> 
>>  and I *think* its 
>> moving on to the next stage and trying to start the workers.
>>
>> Ben, it seems that *some* whitespace is passed on OK, in that I can run 
>> a job that does echo "hello world" and that blank after hello is 
>> preserved, and the job runs. I assume the whitespace problem is more 
>> subtle than that?
>>
>> On 2/6/09 10:24 AM, Mihael Hategan wrote:
>>> On Fri, 2009-02-06 at 16:17 +0000, Ben Clifford wrote:
>>>> On Fri, 6 Feb 2009, Mihael Hategan wrote:
>>>>
>>>>> I guess we'll have to stage in the bootstrap script using the stage-in
>>>>> directive if we are to support managed fork, since I don't see OSG
>>>>> fixing the issue.
>>>> They are fixing the whitespace in parameters - see the gram-user thread I 
>>>> sent in a different message.
>>> Does this include the new lines?
>>>
> 


From hategan at mcs.anl.gov  Mon Feb  9 14:34:04 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Mon, 09 Feb 2009 14:34:04 -0600
Subject: [Swift-devel] Problems with coasters and managedfork jobmanager
In-Reply-To: <499090A7.2020100@mcs.anl.gov>
References: <498BC903.7010008@mcs.anl.gov>
	<Pine.LNX.4.64.0902061351270.8995@dildano.hawaga.org.uk>
	<498C550B.4040303@mcs.anl.gov> <1233936759.5005.2.camel@localhost>
	<Pine.LNX.4.64.0902061616080.14259@dildano.hawaga.org.uk>
	<1233937447.5206.0.camel@localhost>  <498C6655.3010702@mcs.anl.gov>
	<1234208813.32649.0.camel@localhost>  <499090A7.2020100@mcs.anl.gov>
Message-ID: <1234211644.6450.0.camel@localhost>

I asked because I couldn't figure out how to get it to work.

But it seems like yours has problems, too:
mike at blabla tmp$ sh bootstrap.nonl.sh
bootstrap.nonl.sh: line 1: syntax error near unexpected token `;'
...

On Mon, 2009-02-09 at 14:23 -0600, Michael Wilde wrote:
> its www.ci.uchicago.edu/~wilde/bootstrap.nonl.sh, plus I removed the 
> code in ServiceManager that inserted the extra newlines when reading it 
> into a string buffer. I checked as far as verifying in the gram log that 
> it was seen by gram as a single line in the rsl.  I never got a 
> successful run from it, though - it ran into other problems later.
> 
> On 2/9/09 1:46 PM, Mihael Hategan wrote:
> > On Fri, 2009-02-06 at 10:33 -0600, Michael Wilde wrote:
> >> I dont know, but I am testing a version where I removed the newlines 
> >> from bootstrap.pl (and adjusted a few bits manually)
> > 
> > May we see that?
> > 
> >>  and I *think* its 
> >> moving on to the next stage and trying to start the workers.
> >>
> >> Ben, it seems that *some* whitespace is passed on OK, in that I can run 
> >> a job that does echo "hello world" and that blank after hello is 
> >> preserved, and the job runs. I assume the whitespace problem is more 
> >> subtle than that?
> >>
> >> On 2/6/09 10:24 AM, Mihael Hategan wrote:
> >>> On Fri, 2009-02-06 at 16:17 +0000, Ben Clifford wrote:
> >>>> On Fri, 6 Feb 2009, Mihael Hategan wrote:
> >>>>
> >>>>> I guess we'll have to stage in the bootstrap script using the stage-in
> >>>>> directive if we are to support managed fork, since I don't see OSG
> >>>>> fixing the issue.
> >>>> They are fixing the whitespace in parameters - see the gram-user thread I 
> >>>> sent in a different message.
> >>> Does this include the new lines?
> >>>
> > 


From wilde at mcs.anl.gov  Mon Feb  9 16:06:19 2009
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Mon, 09 Feb 2009 16:06:19 -0600
Subject: [Swift-devel] Problems with coasters and managedfork jobmanager
In-Reply-To: <1234211644.6450.0.camel@localhost>
References: <498BC903.7010008@mcs.anl.gov>	
	<Pine.LNX.4.64.0902061351270.8995@dildano.hawaga.org.uk>	
	<498C550B.4040303@mcs.anl.gov> <1233936759.5005.2.camel@localhost>	
	<Pine.LNX.4.64.0902061616080.14259@dildano.hawaga.org.uk>	
	<1233937447.5206.0.camel@localhost> <498C6655.3010702@mcs.anl.gov>	
	<1234208813.32649.0.camel@localhost> <499090A7.2020100@mcs.anl.gov>
	<1234211644.6450.0.camel@localhost>
Message-ID: <4990A8DB.1010606@mcs.anl.gov>

sorry, i think i put the wrong version there; i dealt with that specific 
problem and hit a subtler one, deeper in the bootstrap process.

On 2/9/09 2:34 PM, Mihael Hategan wrote:
> I asked because I couldn't figure out how to get it to work.
> 
> But it seems like yours has problems, too:
> mike at blabla tmp$ sh bootstrap.nonl.sh
> bootstrap.nonl.sh: line 1: syntax error near unexpected token `;'
> ...
> 
> On Mon, 2009-02-09 at 14:23 -0600, Michael Wilde wrote:
>> its www.ci.uchicago.edu/~wilde/bootstrap.nonl.sh, plus I removed the 
>> code in ServiceManager that inserted the extra newlines when reading it 
>> into a string buffer. I checked as far as verifying in the gram log that 
>> it was seen by gram as a single line in the rsl.  I never got a 
>> successful run from it, though - it ran into other problems later.
>>
>> On 2/9/09 1:46 PM, Mihael Hategan wrote:
>>> On Fri, 2009-02-06 at 10:33 -0600, Michael Wilde wrote:
>>>> I dont know, but I am testing a version where I removed the newlines 
>>>> from bootstrap.pl (and adjusted a few bits manually)
>>> May we see that?
>>>
>>>>  and I *think* its 
>>>> moving on to the next stage and trying to start the workers.
>>>>
>>>> Ben, it seems that *some* whitespace is passed on OK, in that I can run 
>>>> a job that does echo "hello world" and that blank after hello is 
>>>> preserved, and the job runs. I assume the whitespace problem is more 
>>>> subtle than that?
>>>>
>>>> On 2/6/09 10:24 AM, Mihael Hategan wrote:
>>>>> On Fri, 2009-02-06 at 16:17 +0000, Ben Clifford wrote:
>>>>>> On Fri, 6 Feb 2009, Mihael Hategan wrote:
>>>>>>
>>>>>>> I guess we'll have to stage in the bootstrap script using the stage-in
>>>>>>> directive if we are to support managed fork, since I don't see OSG
>>>>>>> fixing the issue.
>>>>>> They are fixing the whitespace in parameters - see the gram-user thread I 
>>>>>> sent in a different message.
>>>>> Does this include the new lines?
>>>>>
> 


From benc at hawaga.org.uk  Mon Feb  9 15:33:57 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Mon, 9 Feb 2009 21:33:57 +0000 (GMT)
Subject: [Swift-devel] JAVA_HOME misdetection
In-Reply-To: <1234206399.30971.0.camel@localhost>
References: <Pine.LNX.4.64.0902091659590.14188@dildano.hawaga.org.uk>
	<1234206399.30971.0.camel@localhost>
Message-ID: <Pine.LNX.4.64.0902092131380.23512@dildano.hawaga.org.uk>


On Mon, 9 Feb 2009, Mihael Hategan wrote:

> How does it break things?

screws up timezone checking and then gives an ExceptionPreparationError or 
something like that (an Error that I'd never heard of before).

-- 


From hategan at mcs.anl.gov  Mon Feb  9 16:23:16 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Mon, 09 Feb 2009 16:23:16 -0600
Subject: [Swift-devel] JAVA_HOME misdetection
In-Reply-To: <Pine.LNX.4.64.0902092131380.23512@dildano.hawaga.org.uk>
References: <Pine.LNX.4.64.0902091659590.14188@dildano.hawaga.org.uk>
	<1234206399.30971.0.camel@localhost>
	<Pine.LNX.4.64.0902092131380.23512@dildano.hawaga.org.uk>
Message-ID: <1234218196.8851.0.camel@localhost>

On Mon, 2009-02-09 at 21:33 +0000, Ben Clifford wrote:
> On Mon, 9 Feb 2009, Mihael Hategan wrote:
> 
> > How does it break things?
> 
> screws up timezone checking and then gives an ExceptionPreparationError or 
> something like that (an Error that I'd never heard of before).
> 

Can you dig a bit more?


From wilde at mcs.anl.gov  Mon Feb  9 17:05:08 2009
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Mon, 09 Feb 2009 17:05:08 -0600
Subject: [Swift-devel] Problems with coasters and managedfork jobmanager
In-Reply-To: <4990A8DB.1010606@mcs.anl.gov>
References: <498BC903.7010008@mcs.anl.gov>		<Pine.LNX.4.64.0902061351270.8995@dildano.hawaga.org.uk>		<498C550B.4040303@mcs.anl.gov>
	<1233936759.5005.2.camel@localhost>		<Pine.LNX.4.64.0902061616080.14259@dildano.hawaga.org.uk>		<1233937447.5206.0.camel@localhost>
	<498C6655.3010702@mcs.anl.gov>		<1234208813.32649.0.camel@localhost>
	<499090A7.2020100@mcs.anl.gov>	<1234211644.6450.0.camel@localhost>
	<4990A8DB.1010606@mcs.anl.gov>
Message-ID: <4990B6A4.3070806@mcs.anl.gov>

No, correction, I mis-spoke.  When I try this direct to the shell (as 
opposed to via swift and coaster bootstrap) I get the same error as you 
show. I cant get sh to accept function defs on 1 line.

So I must have mis-interpreted my result from last Friday.

- Mike


On 2/9/09 4:06 PM, Michael Wilde wrote:
> sorry, i think i put the wrong version there; i dealt with that specific 
> problem and hit a subtler one, deeper in the bootstrap process.
> 
> On 2/9/09 2:34 PM, Mihael Hategan wrote:
>> I asked because I couldn't figure out how to get it to work.
>>
>> But it seems like yours has problems, too:
>> mike at blabla tmp$ sh bootstrap.nonl.sh
>> bootstrap.nonl.sh: line 1: syntax error near unexpected token `;'
>> ...
>>
>> On Mon, 2009-02-09 at 14:23 -0600, Michael Wilde wrote:
>>> its www.ci.uchicago.edu/~wilde/bootstrap.nonl.sh, plus I removed the 
>>> code in ServiceManager that inserted the extra newlines when reading 
>>> it into a string buffer. I checked as far as verifying in the gram 
>>> log that it was seen by gram as a single line in the rsl.  I never 
>>> got a successful run from it, though - it ran into other problems later.
>>>
>>> On 2/9/09 1:46 PM, Mihael Hategan wrote:
>>>> On Fri, 2009-02-06 at 10:33 -0600, Michael Wilde wrote:
>>>>> I dont know, but I am testing a version where I removed the 
>>>>> newlines from bootstrap.pl (and adjusted a few bits manually)
>>>> May we see that?
>>>>
>>>>>  and I *think* its moving on to the next stage and trying to start 
>>>>> the workers.
>>>>>
>>>>> Ben, it seems that *some* whitespace is passed on OK, in that I can 
>>>>> run a job that does echo "hello world" and that blank after hello 
>>>>> is preserved, and the job runs. I assume the whitespace problem is 
>>>>> more subtle than that?
>>>>>
>>>>> On 2/6/09 10:24 AM, Mihael Hategan wrote:
>>>>>> On Fri, 2009-02-06 at 16:17 +0000, Ben Clifford wrote:
>>>>>>> On Fri, 6 Feb 2009, Mihael Hategan wrote:
>>>>>>>
>>>>>>>> I guess we'll have to stage in the bootstrap script using the 
>>>>>>>> stage-in
>>>>>>>> directive if we are to support managed fork, since I don't see OSG
>>>>>>>> fixing the issue.
>>>>>>> They are fixing the whitespace in parameters - see the gram-user 
>>>>>>> thread I sent in a different message.
>>>>>> Does this include the new lines?
>>>>>>
>>
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel


From benc at hawaga.org.uk  Mon Feb  9 16:32:55 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Mon, 9 Feb 2009 22:32:55 +0000 (GMT)
Subject: [Swift-devel] JAVA_HOME misdetection
In-Reply-To: <1234218196.8851.0.camel@localhost>
References: <Pine.LNX.4.64.0902091659590.14188@dildano.hawaga.org.uk> 
	<1234206399.30971.0.camel@localhost>
	<Pine.LNX.4.64.0902092131380.23512@dildano.hawaga.org.uk>
	<1234218196.8851.0.camel@localhost>
Message-ID: <Pine.LNX.4.64.0902092230200.23512@dildano.hawaga.org.uk>


On Mon, 9 Feb 2009, Mihael Hategan wrote:

> > > How does it break things?
> > 
> > screws up timezone checking and then gives an ExceptionPreparationError or 
> > something like that (an Error that I'd never heard of before).
> 
> Can you dig a bit more?

I can give you a whole paste of the error next time I go that way. It 
strikes me that the answer, however is, "don't set JAVA_HOME to some 
directory that isn't."

-- 


From hategan at mcs.anl.gov  Mon Feb  9 18:16:12 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Mon, 09 Feb 2009 18:16:12 -0600
Subject: [Swift-devel] Problems with coasters and managedfork jobmanager
In-Reply-To: <4990B6A4.3070806@mcs.anl.gov>
References: <498BC903.7010008@mcs.anl.gov>
	<Pine.LNX.4.64.0902061351270.8995@dildano.hawaga.org.uk>
	<498C550B.4040303@mcs.anl.gov> <1233936759.5005.2.camel@localhost>
	<Pine.LNX.4.64.0902061616080.14259@dildano.hawaga.org.uk>
	<1233937447.5206.0.camel@localhost> <498C6655.3010702@mcs.anl.gov>
	<1234208813.32649.0.camel@localhost> <499090A7.2020100@mcs.anl.gov>
	<1234211644.6450.0.camel@localhost> <4990A8DB.1010606@mcs.anl.gov>
	<4990B6A4.3070806@mcs.anl.gov>
Message-ID: <1234224972.11972.2.camel@localhost>

I did get a one-liner perl which contains an encoded bootstrap jar file
(only printable characters). Perl cares less about whitespace, which
makes me think it's a better candidate if we want to go that way.

Thoughts?

On Mon, 2009-02-09 at 17:05 -0600, Michael Wilde wrote:
> No, correction, I mis-spoke.  When I try this direct to the shell (as 
> opposed to via swift and coaster bootstrap) I get the same error as you 
> show. I cant get sh to accept function defs on 1 line.
> 
> So I must have mis-interpreted my result from last Friday.
> 
> - Mike
> 
> 
> On 2/9/09 4:06 PM, Michael Wilde wrote:
> > sorry, i think i put the wrong version there; i dealt with that specific 
> > problem and hit a subtler one, deeper in the bootstrap process.
> > 
> > On 2/9/09 2:34 PM, Mihael Hategan wrote:
> >> I asked because I couldn't figure out how to get it to work.
> >>
> >> But it seems like yours has problems, too:
> >> mike at blabla tmp$ sh bootstrap.nonl.sh
> >> bootstrap.nonl.sh: line 1: syntax error near unexpected token `;'
> >> ...
> >>
> >> On Mon, 2009-02-09 at 14:23 -0600, Michael Wilde wrote:
> >>> its www.ci.uchicago.edu/~wilde/bootstrap.nonl.sh, plus I removed the 
> >>> code in ServiceManager that inserted the extra newlines when reading 
> >>> it into a string buffer. I checked as far as verifying in the gram 
> >>> log that it was seen by gram as a single line in the rsl.  I never 
> >>> got a successful run from it, though - it ran into other problems later.
> >>>
> >>> On 2/9/09 1:46 PM, Mihael Hategan wrote:
> >>>> On Fri, 2009-02-06 at 10:33 -0600, Michael Wilde wrote:
> >>>>> I dont know, but I am testing a version where I removed the 
> >>>>> newlines from bootstrap.pl (and adjusted a few bits manually)
> >>>> May we see that?
> >>>>
> >>>>>  and I *think* its moving on to the next stage and trying to start 
> >>>>> the workers.
> >>>>>
> >>>>> Ben, it seems that *some* whitespace is passed on OK, in that I can 
> >>>>> run a job that does echo "hello world" and that blank after hello 
> >>>>> is preserved, and the job runs. I assume the whitespace problem is 
> >>>>> more subtle than that?
> >>>>>
> >>>>> On 2/6/09 10:24 AM, Mihael Hategan wrote:
> >>>>>> On Fri, 2009-02-06 at 16:17 +0000, Ben Clifford wrote:
> >>>>>>> On Fri, 6 Feb 2009, Mihael Hategan wrote:
> >>>>>>>
> >>>>>>>> I guess we'll have to stage in the bootstrap script using the 
> >>>>>>>> stage-in
> >>>>>>>> directive if we are to support managed fork, since I don't see OSG
> >>>>>>>> fixing the issue.
> >>>>>>> They are fixing the whitespace in parameters - see the gram-user 
> >>>>>>> thread I sent in a different message.
> >>>>>> Does this include the new lines?
> >>>>>>
> >>
> > _______________________________________________
> > Swift-devel mailing list
> > Swift-devel at ci.uchicago.edu
> > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel


From wilde at mcs.anl.gov  Mon Feb  9 18:23:39 2009
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Mon, 09 Feb 2009 18:23:39 -0600
Subject: [Swift-devel] Problems with coasters and managedfork jobmanager
In-Reply-To: <1234224972.11972.2.camel@localhost>
References: <498BC903.7010008@mcs.anl.gov>	
	<Pine.LNX.4.64.0902061351270.8995@dildano.hawaga.org.uk>	
	<498C550B.4040303@mcs.anl.gov> <1233936759.5005.2.camel@localhost>	
	<Pine.LNX.4.64.0902061616080.14259@dildano.hawaga.org.uk>	
	<1233937447.5206.0.camel@localhost> <498C6655.3010702@mcs.anl.gov>	
	<1234208813.32649.0.camel@localhost>
	<499090A7.2020100@mcs.anl.gov>	 <1234211644.6450.0.camel@localhost>
	<4990A8DB.1010606@mcs.anl.gov>	 <4990B6A4.3070806@mcs.anl.gov>
	<1234224972.11972.2.camel@localhost>
Message-ID: <4990C90B.40700@mcs.anl.gov>

It sounds reasonable, but lets try it and see how well it works.

I'd like to create a test that runs a trivial swift script on a set of 
osg and tg sites with coasters.

If you create a patch or check it in I'll try it too.

- Mike


On 2/9/09 6:16 PM, Mihael Hategan wrote:
> I did get a one-liner perl which contains an encoded bootstrap jar file
> (only printable characters). Perl cares less about whitespace, which
> makes me think it's a better candidate if we want to go that way.
> 
> Thoughts?
> 
> On Mon, 2009-02-09 at 17:05 -0600, Michael Wilde wrote:
>> No, correction, I mis-spoke.  When I try this direct to the shell (as 
>> opposed to via swift and coaster bootstrap) I get the same error as you 
>> show. I cant get sh to accept function defs on 1 line.
>>
>> So I must have mis-interpreted my result from last Friday.
>>
>> - Mike
>>
>>
>> On 2/9/09 4:06 PM, Michael Wilde wrote:
>>> sorry, i think i put the wrong version there; i dealt with that specific 
>>> problem and hit a subtler one, deeper in the bootstrap process.
>>>
>>> On 2/9/09 2:34 PM, Mihael Hategan wrote:
>>>> I asked because I couldn't figure out how to get it to work.
>>>>
>>>> But it seems like yours has problems, too:
>>>> mike at blabla tmp$ sh bootstrap.nonl.sh
>>>> bootstrap.nonl.sh: line 1: syntax error near unexpected token `;'
>>>> ...
>>>>
>>>> On Mon, 2009-02-09 at 14:23 -0600, Michael Wilde wrote:
>>>>> its www.ci.uchicago.edu/~wilde/bootstrap.nonl.sh, plus I removed the 
>>>>> code in ServiceManager that inserted the extra newlines when reading 
>>>>> it into a string buffer. I checked as far as verifying in the gram 
>>>>> log that it was seen by gram as a single line in the rsl.  I never 
>>>>> got a successful run from it, though - it ran into other problems later.
>>>>>
>>>>> On 2/9/09 1:46 PM, Mihael Hategan wrote:
>>>>>> On Fri, 2009-02-06 at 10:33 -0600, Michael Wilde wrote:
>>>>>>> I dont know, but I am testing a version where I removed the 
>>>>>>> newlines from bootstrap.pl (and adjusted a few bits manually)
>>>>>> May we see that?
>>>>>>
>>>>>>>  and I *think* its moving on to the next stage and trying to start 
>>>>>>> the workers.
>>>>>>>
>>>>>>> Ben, it seems that *some* whitespace is passed on OK, in that I can 
>>>>>>> run a job that does echo "hello world" and that blank after hello 
>>>>>>> is preserved, and the job runs. I assume the whitespace problem is 
>>>>>>> more subtle than that?
>>>>>>>
>>>>>>> On 2/6/09 10:24 AM, Mihael Hategan wrote:
>>>>>>>> On Fri, 2009-02-06 at 16:17 +0000, Ben Clifford wrote:
>>>>>>>>> On Fri, 6 Feb 2009, Mihael Hategan wrote:
>>>>>>>>>
>>>>>>>>>> I guess we'll have to stage in the bootstrap script using the 
>>>>>>>>>> stage-in
>>>>>>>>>> directive if we are to support managed fork, since I don't see OSG
>>>>>>>>>> fixing the issue.
>>>>>>>>> They are fixing the whitespace in parameters - see the gram-user 
>>>>>>>>> thread I sent in a different message.
>>>>>>>> Does this include the new lines?
>>>>>>>>
>>> _______________________________________________
>>> Swift-devel mailing list
>>> Swift-devel at ci.uchicago.edu
>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> 


From hategan at mcs.anl.gov  Mon Feb  9 18:29:05 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Mon, 09 Feb 2009 18:29:05 -0600
Subject: [Swift-devel] Problems with coasters and managedfork jobmanager
In-Reply-To: <4990C90B.40700@mcs.anl.gov>
References: <498BC903.7010008@mcs.anl.gov>
	<Pine.LNX.4.64.0902061351270.8995@dildano.hawaga.org.uk>
	<498C550B.4040303@mcs.anl.gov> <1233936759.5005.2.camel@localhost>
	<Pine.LNX.4.64.0902061616080.14259@dildano.hawaga.org.uk>
	<1233937447.5206.0.camel@localhost> <498C6655.3010702@mcs.anl.gov>
	<1234208813.32649.0.camel@localhost> <499090A7.2020100@mcs.anl.gov>
	<1234211644.6450.0.camel@localhost> <4990A8DB.1010606@mcs.anl.gov>
	<4990B6A4.3070806@mcs.anl.gov> <1234224972.11972.2.camel@localhost>
	<4990C90B.40700@mcs.anl.gov>
Message-ID: <1234225745.12056.11.camel@localhost>

On Mon, 2009-02-09 at 18:23 -0600, Michael Wilde wrote:
> It sounds reasonable, but lets try it and see how well it works.

Right. It might not work with condor, given that the line is 16k bytes
long.

> 
> I'd like to create a test that runs a trivial swift script on a set of 
> osg and tg sites with coasters.

I suggest looking at the existing tests (swift/tests/(sites)?) first.


From benc at hawaga.org.uk  Mon Feb  9 18:29:25 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Tue, 10 Feb 2009 00:29:25 +0000 (GMT)
Subject: [Swift-devel] Problems with coasters and managedfork jobmanager
In-Reply-To: <4990C90B.40700@mcs.anl.gov>
References: <498BC903.7010008@mcs.anl.gov>
	<Pine.LNX.4.64.0902061351270.8995@dildano.hawaga.org.uk>
	<498C550B.4040303@mcs.anl.gov> <1233936759.5005.2.camel@localhost> 
	<Pine.LNX.4.64.0902061616080.14259@dildano.hawaga.org.uk> 
	<1233937447.5206.0.camel@localhost> <498C6655.3010702@mcs.anl.gov> 
	<1234208813.32649.0.camel@localhost> <499090A7.2020100@mcs.anl.gov> 
	<1234211644.6450.0.camel@localhost> <4990A8DB.1010606@mcs.anl.gov> 
	<4990B6A4.3070806@mcs.anl.gov> <1234224972.11972.2.camel@localhost>
	<4990C90B.40700@mcs.anl.gov>
Message-ID: <Pine.LNX.4.64.0902100025530.23368@dildano.hawaga.org.uk>


On Mon, 9 Feb 2009, Michael Wilde wrote:

> I'd like to create a test that runs a trivial swift script on a set of 
> osg and tg sites with coasters.

There's a multi-site test setup in tests/sites/

cd tests/sites/
./run-all coaster/

will run some tests (the list is defined in tests/sites/run-site) with 
each site in tests/sites/coaster/ and report on success or failure

To add sites, create a sites.xml file in tests/sites/coaster/ directory, 
and then add appropriate lines for the site name into tests/sites/tc.data 
(if the site isn't already in there).

-- 


From benc at hawaga.org.uk  Tue Feb 10 07:12:40 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Tue, 10 Feb 2009 13:12:40 +0000 (GMT)
Subject: [Swift-devel] typecheck foo[*].bar
Message-ID: <Pine.LNX.4.64.0902101305000.23512@dildano.hawaga.org.uk>


I noticed today that expressions like this don't get typechecked properly, 
so in 0.8, you can't use [*].member expressions. Bleugh.

As I want to use such expressions (or equivalent), I guess I have to fix 
that soonish.

I think the approach I am favouring language-wise is that [*] becomes a 
no-op/identity operator, and . with an array of structs on the left 
returns an array of the appropriate member fields.

Thus   a[*] == a   for all arrays a

       a[*].foo == a.foo == (in haskelly pseudocode) (map \(x->x.foo) a)

I think from an implementation point of view, that can cause some trouble 
though. DSHandles expect to have only one parent. Writing an expression 
a[*].foo causes each element to then have two potential parents:

     i.        array of structs  -> contained structure -> foo
     ii.       array of foos -> foo

Although something like this must be happening at the moment anyway with 
the [*].foo syntax, so it might not turn out to be a big deal.

-- 


From wilde at mcs.anl.gov  Tue Feb 10 11:20:19 2009
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Tue, 10 Feb 2009 11:20:19 -0600
Subject: [Swift-devel] Problems with coasters and managedfork jobmanager
In-Reply-To: <1234224972.11972.2.camel@localhost>
References: <498BC903.7010008@mcs.anl.gov>	
	<Pine.LNX.4.64.0902061351270.8995@dildano.hawaga.org.uk>	
	<498C550B.4040303@mcs.anl.gov> <1233936759.5005.2.camel@localhost>	
	<Pine.LNX.4.64.0902061616080.14259@dildano.hawaga.org.uk>	
	<1233937447.5206.0.camel@localhost> <498C6655.3010702@mcs.anl.gov>	
	<1234208813.32649.0.camel@localhost>
	<499090A7.2020100@mcs.anl.gov>	 <1234211644.6450.0.camel@localhost>
	<4990A8DB.1010606@mcs.anl.gov>	 <4990B6A4.3070806@mcs.anl.gov>
	<1234224972.11972.2.camel@localhost>
Message-ID: <4991B753.1000801@mcs.anl.gov>

Was it not possible and/or easy to let GRAM stage in bootstrap.sh as a 
stdin file to /bin/bash, the equivalent of:

com$ globus-job-run tp-osg.uchicago.edu -stdin -s longscript.sh /bin/sh
hello
world
com$

- Mike

On 2/9/09 6:16 PM, Mihael Hategan wrote:
> I did get a one-liner perl which contains an encoded bootstrap jar file
> (only printable characters). Perl cares less about whitespace, which
> makes me think it's a better candidate if we want to go that way.
> 
> Thoughts?
> 
> On Mon, 2009-02-09 at 17:05 -0600, Michael Wilde wrote:
>> No, correction, I mis-spoke.  When I try this direct to the shell (as 
>> opposed to via swift and coaster bootstrap) I get the same error as you 
>> show. I cant get sh to accept function defs on 1 line.
>>
>> So I must have mis-interpreted my result from last Friday.
>>
>> - Mike
>>
>>
>> On 2/9/09 4:06 PM, Michael Wilde wrote:
>>> sorry, i think i put the wrong version there; i dealt with that specific 
>>> problem and hit a subtler one, deeper in the bootstrap process.
>>>
>>> On 2/9/09 2:34 PM, Mihael Hategan wrote:
>>>> I asked because I couldn't figure out how to get it to work.
>>>>
>>>> But it seems like yours has problems, too:
>>>> mike at blabla tmp$ sh bootstrap.nonl.sh
>>>> bootstrap.nonl.sh: line 1: syntax error near unexpected token `;'
>>>> ...
>>>>
>>>> On Mon, 2009-02-09 at 14:23 -0600, Michael Wilde wrote:
>>>>> its www.ci.uchicago.edu/~wilde/bootstrap.nonl.sh, plus I removed the 
>>>>> code in ServiceManager that inserted the extra newlines when reading 
>>>>> it into a string buffer. I checked as far as verifying in the gram 
>>>>> log that it was seen by gram as a single line in the rsl.  I never 
>>>>> got a successful run from it, though - it ran into other problems later.
>>>>>
>>>>> On 2/9/09 1:46 PM, Mihael Hategan wrote:
>>>>>> On Fri, 2009-02-06 at 10:33 -0600, Michael Wilde wrote:
>>>>>>> I dont know, but I am testing a version where I removed the 
>>>>>>> newlines from bootstrap.pl (and adjusted a few bits manually)
>>>>>> May we see that?
>>>>>>
>>>>>>>  and I *think* its moving on to the next stage and trying to start 
>>>>>>> the workers.
>>>>>>>
>>>>>>> Ben, it seems that *some* whitespace is passed on OK, in that I can 
>>>>>>> run a job that does echo "hello world" and that blank after hello 
>>>>>>> is preserved, and the job runs. I assume the whitespace problem is 
>>>>>>> more subtle than that?
>>>>>>>
>>>>>>> On 2/6/09 10:24 AM, Mihael Hategan wrote:
>>>>>>>> On Fri, 2009-02-06 at 16:17 +0000, Ben Clifford wrote:
>>>>>>>>> On Fri, 6 Feb 2009, Mihael Hategan wrote:
>>>>>>>>>
>>>>>>>>>> I guess we'll have to stage in the bootstrap script using the 
>>>>>>>>>> stage-in
>>>>>>>>>> directive if we are to support managed fork, since I don't see OSG
>>>>>>>>>> fixing the issue.
>>>>>>>>> They are fixing the whitespace in parameters - see the gram-user 
>>>>>>>>> thread I sent in a different message.
>>>>>>>> Does this include the new lines?
>>>>>>>>
>>> _______________________________________________
>>> Swift-devel mailing list
>>> Swift-devel at ci.uchicago.edu
>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> 


From hategan at mcs.anl.gov  Tue Feb 10 13:22:27 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Tue, 10 Feb 2009 13:22:27 -0600 (CST)
Subject: [Swift-devel] Problems with coasters and managedfork jobmanager
In-Reply-To: <4991B753.1000801@mcs.anl.gov>
Message-ID: <23141741.2071951234293747843.JavaMail.root@zimbra>


----- Michael Wilde <wilde at mcs.anl.gov> wrote:
> Was it not possible and/or easy to let GRAM stage in bootstrap.sh as a 
> stdin file to /bin/bash,

With GT4 that would have required users to have a gridftp server on the
Swift side. Which seems to come in conflict with ease of use, so I think
it's a bad idea.

>From a theoretical perspective, it also annoys me that with staging in,
there is no way to guarantee lack of contention in file naming without
globally unique identifiers.

> the equivalent of:
> 
> com$ globus-job-run tp-osg.uchicago.edu -stdin -s longscript.sh /bin/sh
> hello
> world
> com$
> 
> - Mike
> 
> On 2/9/09 6:16 PM, Mihael Hategan wrote:
> > I did get a one-liner perl which contains an encoded bootstrap jar file
> > (only printable characters). Perl cares less about whitespace, which
> > makes me think it's a better candidate if we want to go that way.
> > 
> > Thoughts?
> > 
> > On Mon, 2009-02-09 at 17:05 -0600, Michael Wilde wrote:
> >> No, correction, I mis-spoke.  When I try this direct to the shell (as 
> >> opposed to via swift and coaster bootstrap) I get the same error as you 
> >> show. I cant get sh to accept function defs on 1 line.
> >>
> >> So I must have mis-interpreted my result from last Friday.
> >>
> >> - Mike
> >>
> >>
> >> On 2/9/09 4:06 PM, Michael Wilde wrote:
> >>> sorry, i think i put the wrong version there; i dealt with that specific 
> >>> problem and hit a subtler one, deeper in the bootstrap process.
> >>>
> >>> On 2/9/09 2:34 PM, Mihael Hategan wrote:
> >>>> I asked because I couldn't figure out how to get it to work.
> >>>>
> >>>> But it seems like yours has problems, too:
> >>>> mike at blabla tmp$ sh bootstrap.nonl.sh
> >>>> bootstrap.nonl.sh: line 1: syntax error near unexpected token `;'
> >>>> ...
> >>>>
> >>>> On Mon, 2009-02-09 at 14:23 -0600, Michael Wilde wrote:
> >>>>> its www.ci.uchicago.edu/~wilde/bootstrap.nonl.sh, plus I removed the 
> >>>>> code in ServiceManager that inserted the extra newlines when reading 
> >>>>> it into a string buffer. I checked as far as verifying in the gram 
> >>>>> log that it was seen by gram as a single line in the rsl.  I never 
> >>>>> got a successful run from it, though - it ran into other problems later.
> >>>>>
> >>>>> On 2/9/09 1:46 PM, Mihael Hategan wrote:
> >>>>>> On Fri, 2009-02-06 at 10:33 -0600, Michael Wilde wrote:
> >>>>>>> I dont know, but I am testing a version where I removed the 
> >>>>>>> newlines from bootstrap.pl (and adjusted a few bits manually)
> >>>>>> May we see that?
> >>>>>>
> >>>>>>>  and I *think* its moving on to the next stage and trying to start 
> >>>>>>> the workers.
> >>>>>>>
> >>>>>>> Ben, it seems that *some* whitespace is passed on OK, in that I can 
> >>>>>>> run a job that does echo "hello world" and that blank after hello 
> >>>>>>> is preserved, and the job runs. I assume the whitespace problem is 
> >>>>>>> more subtle than that?
> >>>>>>>
> >>>>>>> On 2/6/09 10:24 AM, Mihael Hategan wrote:
> >>>>>>>> On Fri, 2009-02-06 at 16:17 +0000, Ben Clifford wrote:
> >>>>>>>>> On Fri, 6 Feb 2009, Mihael Hategan wrote:
> >>>>>>>>>
> >>>>>>>>>> I guess we'll have to stage in the bootstrap script using the 
> >>>>>>>>>> stage-in
> >>>>>>>>>> directive if we are to support managed fork, since I don't see OSG
> >>>>>>>>>> fixing the issue.
> >>>>>>>>> They are fixing the whitespace in parameters - see the gram-user 
> >>>>>>>>> thread I sent in a different message.
> >>>>>>>> Does this include the new lines?
> >>>>>>>>
> >>> _______________________________________________
> >>> Swift-devel mailing list
> >>> Swift-devel at ci.uchicago.edu
> >>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> > 


From hategan at mcs.anl.gov  Tue Feb 10 13:31:14 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Tue, 10 Feb 2009 13:31:14 -0600 (CST)
Subject: [Swift-devel] JAVA_HOME misdetection
Message-ID: <10496779.2072501234294274599.JavaMail.root@zimbra>

The bootstrapping process does not require, AFAIK, JAVA_HOME.
It only needs a java executable.

I have a local patch for the issue, but I'm trying to explore
the perl route, so I'll integrate the concept there.


From zhaozhang at uchicago.edu  Tue Feb 10 13:46:05 2009
From: zhaozhang at uchicago.edu (Zhao Zhang)
Date: Tue, 10 Feb 2009 13:46:05 -0600
Subject: [Swift-devel] GPFS issue of SWIFT on BGP
Message-ID: <4991D97D.7010508@uchicago.edu>

Hi, All

I am working with Allan on applying CIO to SWIFT on BGP, now we are 
blocked by a ssh-provider issue.
Here is the description: we made ssh-provider working as the data 
provider, and I tested it with multiple psets, it is working fine.
Login Node ----- submit host
IO Node -------- remote site
Compute Node -- workers

Now, we start swift on Login Node, and the working directory will be 
created on IO Node, so that all intermediate files and final
result files will be copied back to Login Node(GPFS) once they are 
generated. Here we got an old problem, all IO nodes are trying
to write files in the same directory, which we are trying to avoid all 
the way.

My solution would be modify the ssh-provider source code, implement an 
asynchronous collector logic there.

Do you have any other ideas about this issue? Or other alternative 
design? Thanks so much.

best wishes
zhangzhao


From benc at hawaga.org.uk  Tue Feb 10 14:03:17 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Tue, 10 Feb 2009 20:03:17 +0000 (GMT)
Subject: [Swift-devel] GPFS issue of SWIFT on BGP
In-Reply-To: <4991D97D.7010508@uchicago.edu>
References: <4991D97D.7010508@uchicago.edu>
Message-ID: <Pine.LNX.4.64.0902101959260.23368@dildano.hawaga.org.uk>


On Tue, 10 Feb 2009, Zhao Zhang wrote:

> Now, we start swift on Login Node, and the working directory will be created
> on IO Node, so that all intermediate files and final
> result files will be copied back to Login Node(GPFS) once they are generated.
> Here we got an old problem, all IO nodes are trying
> to write files in the same directory, which we are trying to avoid all the
> way.
> My solution would be modify the ssh-provider source code, implement an
> asynchronous collector logic there.

Can you describe what is going on here more explicitly.

How do multiple IO nodes end up writing to the same GPFS directory?

It is unclear to me from what you write how that comes about - as I 
understand it:

 . submit side data files are posix-accessed only by the swift submit-side 
client

 . files on the I/O nodes (the remote sites) use pset-local storage

 . any communication between the I/O nodes and submit-side client happens 
over ssh.

Where does an I/O node access machine-wide GPFS?

-- 


From zhaozhang at uchicago.edu  Tue Feb 10 14:13:49 2009
From: zhaozhang at uchicago.edu (Zhao Zhang)
Date: Tue, 10 Feb 2009 14:13:49 -0600
Subject: [Swift-devel] GPFS issue of SWIFT on BGP
In-Reply-To: <Pine.LNX.4.64.0902101959260.23368@dildano.hawaga.org.uk>
References: <4991D97D.7010508@uchicago.edu>
	<Pine.LNX.4.64.0902101959260.23368@dildano.hawaga.org.uk>
Message-ID: <4991DFFD.2070501@uchicago.edu>

Hi,

Ben Clifford wrote:
> On Tue, 10 Feb 2009, Zhao Zhang wrote:
>
>   
>> Now, we start swift on Login Node, and the working directory will be created
>> on IO Node, so that all intermediate files and final
>> result files will be copied back to Login Node(GPFS) once they are generated.
>> Here we got an old problem, all IO nodes are trying
>> to write files in the same directory, which we are trying to avoid all the
>> way.
>> My solution would be modify the ssh-provider source code, implement an
>> asynchronous collector logic there.
>>     
>
> Can you describe what is going on here more explicitly.
>
> How do multiple IO nodes end up writing to the same GPFS directory?
>   
In previous case, we have 512 IO nodes each create 1 file in the same 
directory, that would take 30 minutes to finish.
Besides, some time only 510 files could be created.
> It is unclear to me from what you write how that comes about - as I 
> understand it:
>
>  . submit side data files are posix-accessed only by the swift submit-side 
> client
>   
yes
>  . files on the I/O nodes (the remote sites) use pset-local storage
>   
yes
>  . any communication between the I/O nodes and submit-side client happens 
> over ssh.
>   
yes
> Where does an I/O node access machine-wide GPFS?
>   
data transfer from  I/O nodes to submit-side client is  writing to GPFS 
through ssh-provider.

zhao


From benc at hawaga.org.uk  Tue Feb 10 14:20:33 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Tue, 10 Feb 2009 20:20:33 +0000 (GMT)
Subject: [Swift-devel] GPFS issue of SWIFT on BGP
In-Reply-To: <4991DFFD.2070501@uchicago.edu>
References: <4991D97D.7010508@uchicago.edu>
	<Pine.LNX.4.64.0902101959260.23368@dildano.hawaga.org.uk>
	<4991DFFD.2070501@uchicago.edu>
Message-ID: <Pine.LNX.4.64.0902102015010.23512@dildano.hawaga.org.uk>


On Tue, 10 Feb 2009, Zhao Zhang wrote:

> > > Here we got an old problem, all IO nodes are trying

> data transfer from  I/O nodes to submit-side client is  writing to GPFS
> through ssh-provider.

By 'old problem' I assumed you meant the GPFS locking problems previously 
experienced, where GPFS locks for particular filesystem objects would 
need to be expensively moved between nodes.

However, that should not be a problem here - from the GPFS perspective, 
the submit (login) node is the only node that is interacting with GPFS, 
and the only node that needs a lock on that directory.

If you're experiencing slowness, then I would be inclined to investigate 
elsewhere. It may be that the ssh provider is not fast (ssh is not 
renowned for being a fast protocol; Mihael might have some commentary 
either way based on his experiences with the cog ssh provider); it may be 
that something else is causing a bottleneck.

Do you have any detailed timing information? (from my perspective, the 
wrapper logs for every job, and the submit side log, would be interesting 
to look at - send those to me)

-- 


From zhaozhang at uchicago.edu  Tue Feb 10 14:29:53 2009
From: zhaozhang at uchicago.edu (Zhao Zhang)
Date: Tue, 10 Feb 2009 14:29:53 -0600
Subject: [Swift-devel] GPFS issue of SWIFT on BGP
In-Reply-To: <Pine.LNX.4.64.0902102015010.23512@dildano.hawaga.org.uk>
References: <4991D97D.7010508@uchicago.edu>
	<Pine.LNX.4.64.0902101959260.23368@dildano.hawaga.org.uk>
	<4991DFFD.2070501@uchicago.edu>
	<Pine.LNX.4.64.0902102015010.23512@dildano.hawaga.org.uk>
Message-ID: <4991E3C1.2060503@uchicago.edu>

Hi

Ben Clifford wrote:
> On Tue, 10 Feb 2009, Zhao Zhang wrote:
>
>   
>>>> Here we got an old problem, all IO nodes are trying
>>>>         
>
>   
>> data transfer from  I/O nodes to submit-side client is  writing to GPFS
>> through ssh-provider.
>>     
>
> By 'old problem' I assumed you meant the GPFS locking problems previously 
> experienced, where GPFS locks for particular filesystem objects would 
> need to be expensively moved between nodes.
>
> However, that should not be a problem here - from the GPFS perspective, 
> the submit (login) node is the only node that is interacting with GPFS, 
> and the only node that needs a lock on that directory.
>   
ok, this sounds reasonable. Thanks
> If you're experiencing slowness, then I would be inclined to investigate 
> elsewhere. It may be that the ssh provider is not fast (ssh is not 
> renowned for being a fast protocol; Mihael might have some commentary 
> either way based on his experiences with the cog ssh provider); it may be 
> that something else is causing a bottleneck.
>
> Do you have any detailed timing information? (from my perspective, the 
> wrapper logs for every job, and the submit side log, would be interesting 
> to look at - send those to me)
>   
Nope, I am just making a work plan, those are potential issues, I will 
send you the data once I got them.

zhao


From zhaozhang at uchicago.edu  Tue Feb 10 14:33:43 2009
From: zhaozhang at uchicago.edu (Zhao Zhang)
Date: Tue, 10 Feb 2009 14:33:43 -0600
Subject: [Swift-devel] GPFS issue of SWIFT on BGP
In-Reply-To: <Pine.LNX.4.64.0902102015010.23512@dildano.hawaga.org.uk>
References: <4991D97D.7010508@uchicago.edu>
	<Pine.LNX.4.64.0902101959260.23368@dildano.hawaga.org.uk>
	<4991DFFD.2070501@uchicago.edu>
	<Pine.LNX.4.64.0902102015010.23512@dildano.hawaga.org.uk>
Message-ID: <4991E4A7.3040504@uchicago.edu>

Hi, Ben

What if there are 640 ssh-providers sending result files at the same 
time? Do you know any successful test case
with hundreds of ssh-providers working together with one submit host?

zhao

Ben Clifford wrote:
> On Tue, 10 Feb 2009, Zhao Zhang wrote:
>
>   
>>>> Here we got an old problem, all IO nodes are trying
>>>>         
>
>   
>> data transfer from  I/O nodes to submit-side client is  writing to GPFS
>> through ssh-provider.
>>     
>
> By 'old problem' I assumed you meant the GPFS locking problems previously 
> experienced, where GPFS locks for particular filesystem objects would 
> need to be expensively moved between nodes.
>
> However, that should not be a problem here - from the GPFS perspective, 
> the submit (login) node is the only node that is interacting with GPFS, 
> and the only node that needs a lock on that directory.
>
> If you're experiencing slowness, then I would be inclined to investigate 
> elsewhere. It may be that the ssh provider is not fast (ssh is not 
> renowned for being a fast protocol; Mihael might have some commentary 
> either way based on his experiences with the cog ssh provider); it may be 
> that something else is causing a bottleneck.
>
> Do you have any detailed timing information? (from my perspective, the 
> wrapper logs for every job, and the submit side log, would be interesting 
> to look at - send those to me)
>
>   


From hategan at mcs.anl.gov  Tue Feb 10 14:38:20 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Tue, 10 Feb 2009 14:38:20 -0600 (CST)
Subject: [Swift-devel] GPFS issue of SWIFT on BGP
In-Reply-To: <4991D97D.7010508@uchicago.edu>
Message-ID: <9278820.2078991234298300290.JavaMail.root@zimbra>


----- Zhao Zhang <zhaozhang at uchicago.edu> wrote:
> Hi, All
> 
> I am working with Allan on applying CIO to SWIFT on BGP, now we are 
> blocked by a ssh-provider issue.
> Here is the description: we made ssh-provider working as the data 
> provider, and I tested it with multiple psets, it is working fine.
> Login Node ----- submit host
> IO Node -------- remote site
> Compute Node -- workers
> 
> Now, we start swift on Login Node, and the working directory will be 
> created on IO Node, so that all intermediate files and final
> result files will be copied back to Login Node(GPFS) once they are 
> generated. Here we got an old problem, all IO nodes are trying
> to write files in the same directory, which we are trying to avoid all 
> the way.
> 
> My solution would be modify the ssh-provider source code, implement an 
> asynchronous collector logic there.

I'm not really sure what the protocol used to move data with has to do with
organizing files to make things work faster.


From benc at hawaga.org.uk  Tue Feb 10 14:43:05 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Tue, 10 Feb 2009 20:43:05 +0000 (GMT)
Subject: [Swift-devel] GPFS issue of SWIFT on BGP
In-Reply-To: <4991E4A7.3040504@uchicago.edu>
References: <4991D97D.7010508@uchicago.edu>
	<Pine.LNX.4.64.0902101959260.23368@dildano.hawaga.org.uk>
	<4991DFFD.2070501@uchicago.edu>
	<Pine.LNX.4.64.0902102015010.23512@dildano.hawaga.org.uk>
	<4991E4A7.3040504@uchicago.edu>
Message-ID: <Pine.LNX.4.64.0902102035100.23512@dildano.hawaga.org.uk>


On Tue, 10 Feb 2009, Zhao Zhang wrote:

> What if there are 640 ssh-providers sending result files at the same time? Do
> you know any successful test case
> with hundreds of ssh-providers working together with one submit host?

Control is the other way round.

The Swift client will pull files down from the I/O nodes when jobs are 
finished. (that is done by the dostageout call in execute2 in 
libexec/vdl-int.k)

Swift has rate limiting on the number of file transfers and file 
operations that can be in progress at any one time. By default, the limit 
is 4 (for file transfers) and 8 (for file operations). This is controlled 
by the throttle.transfers and throttle.file.operations settings in 
swift.properties.

I think (but I am not sure) that this is a limit for the whole of Swift, 
rather than per site (but I am not sure).

If jobs are finishing faster than Swift can stage out the data (which is 
likely to happen) then a queue of transfer requests will build up inside 
Swift.

I think it is quite likely (though I have no numerical evidence) that you 
will find provider-ssh copies files too slowly for your liking; in which 
case you would need to come up with a faster way of moving files between 
the IO nodes and the submitting node. But you should see what happens with 
provider-ssh first. You should easily be able to compute throughput rates 
when you have log files for this.

-- 


From hategan at mcs.anl.gov  Tue Feb 10 18:47:50 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Tue, 10 Feb 2009 18:47:50 -0600 (CST)
Subject: [Swift-devel] Problems with coasters and managedfork jobmanager
Message-ID: <32608764.2094661234313270386.JavaMail.root@zimbra>

On Mon, 2009-02-09 at 18:23 -0600, Michael Wilde wrote:
> It sounds reasonable, but lets try it and see how well it works.

http://www.ci.uchicago.edu/~hategan/coaster-bootstrap.jar.pl

I suggest trying 

globusrun perl -e "`cat coaster-bootstrap.jar.pl`"

What you should see if everything works ok is the following 
complaint:

"Wrong number of arguments. Expected <serviceURL>, 
<package list checksum>, <registration service URL>, and <id>"


From wilde at mcs.anl.gov  Tue Feb 10 20:04:29 2009
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Tue, 10 Feb 2009 20:04:29 -0600
Subject: [Swift-devel] Problems with coasters and managedfork jobmanager
In-Reply-To: <32608764.2094661234313270386.JavaMail.root@zimbra>
References: <32608764.2094661234313270386.JavaMail.root@zimbra>
Message-ID: <4992322D.4050204@mcs.anl.gov>

I dont know what to do, re:

On 2/10/09 6:47 PM, Mihael Hategan wrote:
> On Mon, 2009-02-09 at 18:23 -0600, Michael Wilde wrote:
>> It sounds reasonable, but lets try it and see how well it works.
> 
> http://www.ci.uchicago.edu/~hategan/coaster-bootstrap.jar.pl

OK, fetched that.

Is there a patch for this? Modifed ServiceManager to use this for 
bootstrapping?

> I suggest trying 
> 
> globusrun perl -e "`cat coaster-bootstrap.jar.pl`"

Do you mean put perl, -e, and the contents of the .pl file into an RSL 
string and run through globusrun?

perl -e "`cat coaster-bootstrap.jar.pl`"
on the command line gives what you show below.

How do I integrate and test this?

> 
> What you should see if everything works ok is the following 
> complaint:
> 
> "Wrong number of arguments. Expected <serviceURL>, 
> <package list checksum>, <registration service URL>, and <id>"
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel


From hategan at mcs.anl.gov  Tue Feb 10 20:22:54 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Tue, 10 Feb 2009 20:22:54 -0600 (CST)
Subject: [Swift-devel] Problems with coasters and managedfork jobmanager
In-Reply-To: <4992322D.4050204@mcs.anl.gov>
Message-ID: <25935177.2095661234318974195.JavaMail.root@zimbra>


----- Michael Wilde <wilde at mcs.anl.gov> wrote:
> I dont know what to do, re:

I mean literally do the globusrun thing.

> 
> On 2/10/09 6:47 PM, Mihael Hategan wrote:
> > On Mon, 2009-02-09 at 18:23 -0600, Michael Wilde wrote:
> >> It sounds reasonable, but lets try it and see how well it works.
> > 
> > http://www.ci.uchicago.edu/~hategan/coaster-bootstrap.jar.pl
> 
> OK, fetched that.
> 
> Is there a patch for this? Modifed ServiceManager to use this for 
> bootstrapping?

There is no patch. Just that script.

> 
> > I suggest trying 
> > 
> > globusrun perl -e "`cat coaster-bootstrap.jar.pl`"
> 
> Do you mean put perl, -e, and the contents of the .pl file into an RSL 
> string and run through globusrun?

That would do. But I suspect the above command will do that.

> 
> perl -e "`cat coaster-bootstrap.jar.pl`"
> on the command line gives what you show below.

Yes, that I checked.

The point is to run the same through a few job managers and see 
whether they choke on it or not. If it goes through as many sites 
as we can send to, then we declare this a winner.
If not, we go back to the drawing board.


From wilde at mcs.anl.gov  Tue Feb 10 22:54:08 2009
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Tue, 10 Feb 2009 22:54:08 -0600
Subject: [Swift-devel] Problems with coasters and managedfork jobmanager
In-Reply-To: <25935177.2095661234318974195.JavaMail.root@zimbra>
References: <25935177.2095661234318974195.JavaMail.root@zimbra>
Message-ID: <499259F0.9060005@mcs.anl.gov>

Sorry, I still dont get it. Did the globusrun command below work for 
you? It doesnt work for me, and doesnt match the syntax of what I 
understand globusrun to expect.

I was not able to get coaster-bootstrap.jar.pl into an RSL string for 
globusrun. I don't think thats possible, because globusrun gets confused 
by the single and double quotes in that file.

Unless Im missing something, as far as I can tell you can only test this 
from API level. Has it worked for you, from inside swift, say from 
ServiceManager?

On 2/10/09 8:22 PM, Mihael Hategan wrote:
> ----- Michael Wilde <wilde at mcs.anl.gov> wrote:
>> I dont know what to do, re:
> 
> I mean literally do the globusrun thing.
> 
>> On 2/10/09 6:47 PM, Mihael Hategan wrote:
>>> On Mon, 2009-02-09 at 18:23 -0600, Michael Wilde wrote:
>>>> It sounds reasonable, but lets try it and see how well it works.
>>> http://www.ci.uchicago.edu/~hategan/coaster-bootstrap.jar.pl
>> OK, fetched that.
>>
>> Is there a patch for this? Modifed ServiceManager to use this for 
>> bootstrapping?
> 
> There is no patch. Just that script.
> 
>>> I suggest trying 
>>>
>>> globusrun perl -e "`cat coaster-bootstrap.jar.pl`"
>> Do you mean put perl, -e, and the contents of the .pl file into an RSL 
>> string and run through globusrun?
> 
> That would do. But I suspect the above command will do that.
> 
>> perl -e "`cat coaster-bootstrap.jar.pl`"
>> on the command line gives what you show below.
> 
> Yes, that I checked.
> 
> The point is to run the same through a few job managers and see 
> whether they choke on it or not. If it goes through as many sites 
> as we can send to, then we declare this a winner.
> If not, we go back to the drawing board.

I think thats a good idea, but I have not yet found a way to do this 
from a shell. I'm skeptical that there's a gram client command that will 
take the required perl command from the command line

Can you encode the script without embedded quotes?

- Mike

---

The output I get is:

com$ globusrun -r tp-grid1.ci.uchicago.edu perl -e "`cat 
coaster-bootstrap.jar.pl`"

ERROR: too many request strings specified

Syntax: globusrun [-help] [-f RSL file] [-s][-b][-d][...] [-r RM] [RSL]

Use -help to display full usage
com$


From hategan at mcs.anl.gov  Tue Feb 10 23:43:25 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Tue, 10 Feb 2009 23:43:25 -0600
Subject: [Swift-devel] Problems with coasters and managedfork jobmanager
In-Reply-To: <499259F0.9060005@mcs.anl.gov>
References: <25935177.2095661234318974195.JavaMail.root@zimbra>
	<499259F0.9060005@mcs.anl.gov>
Message-ID: <1234331005.2557.5.camel@localhost>

On Tue, 2009-02-10 at 22:54 -0600, Michael Wilde wrote:
> Sorry, I still dont get it. Did the globusrun command below work for 
> you? It doesnt work for me, and doesnt match the syntax of what I 
> understand globusrun to expect.

Allow me to rephrase it. Try the globusrun command that does "perl -e
<contents of that file>". I haven't tried it, so I don't know the exact
incantation.

> 
> I was not able to get coaster-bootstrap.jar.pl into an RSL string for 
> globusrun. I don't think thats possible, because globusrun gets confused 
> by the single and double quotes in that file.

Globusrun should properly escape those as long as it is gets the correct
ARGV (which '`cat file`' should do).

> 
> Unless Im missing something, as far as I can tell you can only test this 
> from API level. Has it worked for you, from inside swift, say from 
> ServiceManager?

I haven't tried it from inside swift. I'll poke around and send a
command that can be used literally.


From hategan at mcs.anl.gov  Wed Feb 11 00:15:02 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Wed, 11 Feb 2009 00:15:02 -0600
Subject: [Swift-devel] Problems with coasters and managedfork jobmanager
In-Reply-To: <1234331005.2557.5.camel@localhost>
References: <25935177.2095661234318974195.JavaMail.root@zimbra>
	<499259F0.9060005@mcs.anl.gov>  <1234331005.2557.5.camel@localhost>
Message-ID: <1234332902.4348.3.camel@localhost>

On Tue, 2009-02-10 at 23:43 -0600, Mihael Hategan wrote:

> I haven't tried it from inside swift. I'll poke around and send a
> command that can be used literally.

Grr.

Put the following in a file named t.k:

import("sys.k")
import("task.k")

[bs,url,provider,jm] := each(...)

h := host(url
        service("execution", provider=provider,
                url=url,jobManager=jm)
)

src := strip(file:read(bs))

execute("/usr/bin/perl",
        args=["-e", "{src}"],
        host=h, provider=provider,
        redirect=true)


Then (with your swift bin dir in your path) run it like this:

cog-workflow t.k <the-bootstrap.pl-file-name> <url> <provider>
<jobmanager>

For example:
cog-workflow t.k coaster-bootstrap.jar.pl localhost local none

You should get:
Submitting task Task(type=JOB_SUBMISSION, identity=urn:0-1234332844313)
Wrong number of arguments. Expected <serviceURL>, <package list
checksum>, <registration service URL>, and <id>

Execution failed:
Job failed with an exit code of 1
...


From benc at hawaga.org.uk  Wed Feb 11 03:42:36 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Wed, 11 Feb 2009 09:42:36 +0000 (GMT)
Subject: [Swift-devel] Problems with coasters and managedfork jobmanager
In-Reply-To: <1234332902.4348.3.camel@localhost>
References: <25935177.2095661234318974195.JavaMail.root@zimbra>
	<499259F0.9060005@mcs.anl.gov>  <1234331005.2557.5.camel@localhost>
	<1234332902.4348.3.camel@localhost>
Message-ID: <Pine.LNX.4.64.0902110911360.1293@dildano.hawaga.org.uk>


I've tried (with my OSGEDU credential) and the following site give the 
expected 'Wrong number of arguments...':

  tp-osg.ci.uchicago.edu pbs  (t.k modified to use queue="test")

Trying with osgce.cs.clemson.edu condor gives:

[bcliff at osgedu coaster-perl-test]$ cog-workflow t.k 
coaster-bootstrap.jar.pl osgce.cs.clemson.edu gt2 condor
String found where operator expected at -e line 1, at end of line
        (Missing semicolon on previous line?)
Can't find string terminator "'" anywhere before EOF at -e line 1.

which doesn't particularly surprise me as the perl file has many spaces, 
so the same problem as with sh -c appears to arise there.

-- 


From benc at hawaga.org.uk  Wed Feb 11 05:39:40 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Wed, 11 Feb 2009 11:39:40 +0000 (GMT)
Subject: [Swift-devel] scheduler stuff for Google Summer of Code 2009
Message-ID: <Pine.LNX.4.64.0902111119000.23512@dildano.hawaga.org.uk>


Google Summer of Code 2009 mentor applications open in a couple of weeks. 
dev.globus likely will be applying again. I have a few ideas for Swift 
projects within dev.globus that I'd like to mentor.

One idea that I'm kinda fuzzy on but there might be interesting work to do 
is implementing more interesting scheduler behaviour.

Various people have talked in the past about these, that I think have some 
decent level of merit:

 a) changing ordering of execution of swift-level jobs based on how many
     other swift-level jobs depend on that first job

 b) reordering stageins and stageouts so to allow (in addition to the 
     present as-they-come (I think) policy) "prefer stageins" (which would 
     get more jobs going sooner, but incurring an expense in that 
     stageouts would happen slower, and in our present restart model 
     reduce the speed as which jobs complete enough for restart), and
     "prefer stageouts", which would get completed results out to submit 
     side faster, at the expense of job execution speed.

 c) data affinity - there was some messing round with this, but it
     resulted in code that did not work (which is fine for that project, 
     as it was not production code oriented, but not for committing to
     the codebase). So potentially this could be reimplemented or the 
     existing code tidied up as part of this.

Comments and additional ideas...

-- 


From hategan at mcs.anl.gov  Wed Feb 11 08:17:53 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Wed, 11 Feb 2009 08:17:53 -0600
Subject: [Swift-devel] Problems with coasters and managedfork jobmanager
In-Reply-To: <Pine.LNX.4.64.0902110911360.1293@dildano.hawaga.org.uk>
References: <25935177.2095661234318974195.JavaMail.root@zimbra>
	<499259F0.9060005@mcs.anl.gov>  <1234331005.2557.5.camel@localhost>
	<1234332902.4348.3.camel@localhost>
	<Pine.LNX.4.64.0902110911360.1293@dildano.hawaga.org.uk>
Message-ID: <1234361873.5085.0.camel@localhost>

This was an attempt at dealing with the newlines. There is not much I
can do about the spaces.

On Wed, 2009-02-11 at 09:42 +0000, Ben Clifford wrote:
> I've tried (with my OSGEDU credential) and the following site give the 
> expected 'Wrong number of arguments...':
> 
>   tp-osg.ci.uchicago.edu pbs  (t.k modified to use queue="test")
> 
> Trying with osgce.cs.clemson.edu condor gives:
> 
> [bcliff at osgedu coaster-perl-test]$ cog-workflow t.k 
> coaster-bootstrap.jar.pl osgce.cs.clemson.edu gt2 condor
> String found where operator expected at -e line 1, at end of line
>         (Missing semicolon on previous line?)
> Can't find string terminator "'" anywhere before EOF at -e line 1.
> 
> which doesn't particularly surprise me as the perl file has many spaces, 
> so the same problem as with sh -c appears to arise there.
> 


From foster at anl.gov  Wed Feb 11 08:23:48 2009
From: foster at anl.gov (foster at anl.gov)
Date: Wed, 11 Feb 2009 08:23:48 -0600 (CST)
Subject: [Swift-devel] Problems with coasters and managedfork jobmanager
In-Reply-To: <1234361873.5085.0.camel@localhost>
References: <25935177.2095661234318974195.JavaMail.root@zimbra>
	<499259F0.9060005@mcs.anl.gov> <1234331005.2557.5.camel@localhost>
	<1234332902.4348.3.camel@localhost>
	<Pine.LNX.4.64.0902110911360.1293@dildano.hawaga.org.uk>
	<1234361873.5085.0.camel@localhost>
Message-ID: <1DB48832-4CF3-4EDD-A26D-6A38D42D51D3@anl.gov>

Are you not using gram?

Ian -- from mobile

On Feb 11, 2009, at 8:19 AM, Mihael Hategan <hategan at mcs.anl.gov> wrote:

> This was an attempt at dealing with the newlines. There is not much I
> can do about the spaces.
>
> On Wed, 2009-02-11 at 09:42 +0000, Ben Clifford wrote:
>> I've tried (with my OSGEDU credential) and the following site give  
>> the
>> expected 'Wrong number of arguments...':
>>
>>  tp-osg.ci.uchicago.edu pbs  (t.k modified to use queue="test")
>>
>> Trying with osgce.cs.clemson.edu condor gives:
>>
>> [bcliff at osgedu coaster-perl-test]$ cog-workflow t.k
>> coaster-bootstrap.jar.pl osgce.cs.clemson.edu gt2 condor
>> String found where operator expected at -e line 1, at end of line
>>        (Missing semicolon on previous line?)
>> Can't find string terminator "'" anywhere before EOF at -e line 1.
>>
>> which doesn't particularly surprise me as the perl file has many  
>> spaces,
>> so the same problem as with sh -c appears to arise there.
>>
>
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel


From benc at hawaga.org.uk  Wed Feb 11 08:40:58 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Wed, 11 Feb 2009 14:40:58 +0000 (GMT)
Subject: [Swift-devel] Problems with coasters and managedfork jobmanager
In-Reply-To: <1DB48832-4CF3-4EDD-A26D-6A38D42D51D3@anl.gov>
References: <25935177.2095661234318974195.JavaMail.root@zimbra>
	<499259F0.9060005@mcs.anl.gov>  <1234331005.2557.5.camel@localhost>
	<1234332902.4348.3.camel@localhost>
	<Pine.LNX.4.64.0902110911360.1293@dildano.hawaga.org.uk>
	<1234361873.5085.0.camel@localhost>
	<1DB48832-4CF3-4EDD-A26D-6A38D42D51D3@anl.gov>
Message-ID: <Pine.LNX.4.64.0902111439480.23512@dildano.hawaga.org.uk>


On Wed, 11 Feb 2009, foster at anl.gov wrote:

> Are you not using gram?

yes, which feeds jobs into condor.

-- 


From foster at anl.gov  Wed Feb 11 08:49:56 2009
From: foster at anl.gov (Ian Foster)
Date: Wed, 11 Feb 2009 08:49:56 -0600
Subject: [Swift-devel] Problems with coasters and managedfork jobmanager
In-Reply-To: <Pine.LNX.4.64.0902111439480.23512@dildano.hawaga.org.uk>
References: <25935177.2095661234318974195.JavaMail.root@zimbra>
	<499259F0.9060005@mcs.anl.gov> <1234331005.2557.5.camel@localhost>
	<1234332902.4348.3.camel@localhost>
	<Pine.LNX.4.64.0902110911360.1293@dildano.hawaga.org.uk>
	<1234361873.5085.0.camel@localhost>
	<1DB48832-4CF3-4EDD-A26D-6A38D42D51D3@anl.gov>
	<Pine.LNX.4.64.0902111439480.23512@dildano.hawaga.org.uk>
Message-ID: <86375BBE-29A8-4171-BC89-0B5495180E9A@anl.gov>

Ben:

The meaning of the answer "yes" to a negative question has different  
meanings depending on one's cultural origins :)

What did you mean in this case?

Ian.


On Feb 11, 2009, at 8:40 AM, Ben Clifford wrote:

>
> On Wed, 11 Feb 2009, foster at anl.gov wrote:
>
>> Are you not using gram?
>
> yes, which feeds jobs into condor.
>
> -- 
>


From benc at hawaga.org.uk  Wed Feb 11 09:00:29 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Wed, 11 Feb 2009 15:00:29 +0000 (GMT)
Subject: [Swift-devel] Problems with coasters and managedfork jobmanager
In-Reply-To: <86375BBE-29A8-4171-BC89-0B5495180E9A@anl.gov>
References: <25935177.2095661234318974195.JavaMail.root@zimbra>
	<499259F0.9060005@mcs.anl.gov>  <1234331005.2557.5.camel@localhost>
	<1234332902.4348.3.camel@localhost>
	<Pine.LNX.4.64.0902110911360.1293@dildano.hawaga.org.uk>
	<1234361873.5085.0.camel@localhost>
	<1DB48832-4CF3-4EDD-A26D-6A38D42D51D3@anl.gov>
	<Pine.LNX.4.64.0902111439480.23512@dildano.hawaga.org.uk>
	<86375BBE-29A8-4171-BC89-0B5495180E9A@anl.gov>
Message-ID: <Pine.LNX.4.64.0902111450550.23512@dildano.hawaga.org.uk>


On Wed, 11 Feb 2009, Ian Foster wrote:

> The meaning of the answer "yes" to a negative question has different meanings
> depending on one's cultural origins :)

ha. indeed.

> What did you mean in this case?

job flow is cog commandline -> cog gt2 provider -> gram2 -> condor -> go!

This problem occurs with any use of Condor up until fairly recent 
versions. The Condor job submission file format (which in this case is 
being generated by gram2, but that is mostly irrelevant) doesn't cope with 
spaces in arguments:

Normally:
  echo "hi there"
has $1 equal to the entire string:  hi there
and no $2

If you submit the same through condor, you get $1=hi and $2=there

This then causes problems using on-the-commandline command sequences in sh 
or perl:

Something like this:

  sh -c 'echo foo'

which passes the entire command "echo foo" as the second parameter to sh 
gets turned into something like:
  $1=-c   (which is ok)
  $2=echo (or maybe 'echo)
  $3=foo (or maybe foo')

so sh runs the command "echo" with no parameters (as $2 instructs it to).

So its hard to run any non-trivial command through this mechanism.

There's future hope, though, as a hopefully fixed GRAM update is 
available, working with a recently fixed Condor, and that is likely to be 
deployed on OSG in due course (which is where most Condor woes occur).

-- 


From hategan at mcs.anl.gov  Wed Feb 11 09:04:13 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Wed, 11 Feb 2009 09:04:13 -0600
Subject: [Swift-devel] Problems with coasters and managedfork jobmanager
In-Reply-To: <1DB48832-4CF3-4EDD-A26D-6A38D42D51D3@anl.gov>
References: <25935177.2095661234318974195.JavaMail.root@zimbra>
	<499259F0.9060005@mcs.anl.gov>  <1234331005.2557.5.camel@localhost>
	<1234332902.4348.3.camel@localhost>
	<Pine.LNX.4.64.0902110911360.1293@dildano.hawaga.org.uk>
	<1234361873.5085.0.camel@localhost>
	<1DB48832-4CF3-4EDD-A26D-6A38D42D51D3@anl.gov>
Message-ID: <1234364653.5907.2.camel@localhost>

On Wed, 2009-02-11 at 08:23 -0600, foster at anl.gov wrote:
> Are you not using gram?

Hmm, you're still in "you hate gram" mode :)

Yes, we are using gram and the condor job manager behaves badly.


From foster at anl.gov  Wed Feb 11 09:43:15 2009
From: foster at anl.gov (Ian Foster)
Date: Wed, 11 Feb 2009 09:43:15 -0600
Subject: [Swift-devel] Problems with coasters and managedfork jobmanager
In-Reply-To: <1234364653.5907.2.camel@localhost>
References: <25935177.2095661234318974195.JavaMail.root@zimbra>
	<499259F0.9060005@mcs.anl.gov> <1234331005.2557.5.camel@localhost>
	<1234332902.4348.3.camel@localhost>
	<Pine.LNX.4.64.0902110911360.1293@dildano.hawaga.org.uk>
	<1234361873.5085.0.camel@localhost>
	<1DB48832-4CF3-4EDD-A26D-6A38D42D51D3@anl.gov>
	<1234364653.5907.2.camel@localhost>
Message-ID: <35E4FAE1-4044-4A9C-8242-AF427EBACCEF@anl.gov>

Ben, Mihael:

My email did have that flavor, didn't it. My apologies. :)

Ian.


On Feb 11, 2009, at 9:04 AM, Mihael Hategan wrote:

> On Wed, 2009-02-11 at 08:23 -0600, foster at anl.gov wrote:
>> Are you not using gram?
>
> Hmm, you're still in "you hate gram" mode :)
>
> Yes, we are using gram and the condor job manager behaves badly.
>


From wilde at mcs.anl.gov  Wed Feb 11 13:24:01 2009
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Wed, 11 Feb 2009 13:24:01 -0600
Subject: [Swift-devel] scheduler stuff for Google Summer of Code 2009
In-Reply-To: <Pine.LNX.4.64.0902111119000.23512@dildano.hawaga.org.uk>
References: <Pine.LNX.4.64.0902111119000.23512@dildano.hawaga.org.uk>
Message-ID: <499325D1.9040601@mcs.anl.gov>

These all sound good.

Another scheduler-related set of projects relates to algorithms around 
coasters:

- how to manage (grow and shrink) the size of coaster pools
- how to size the time requests for the workers (perhaps dynamically)
- how the current dynamic throttle works for coasters
- and probably many more.

Some more areas:

- a detailed evaluation and tuning of the throttle heuristics for 
many-site workflows

- automatically clustering more diverse dags, and pipelining

- running Swift on clouds like E2C and Azure

- scaling swift to 1M+ task workflows, efficiently (streaming the mappers)

- service oriented workflows

- extending CIO techniques back to grid environments

- creating a lightweight "embedded swift" VM for running workflows from 
*within* perl, python, R, etc.

These cover a wide space of useful-to-crazy, easy-to-hard, etc.  Just 
tossing them out.

- Mike


On 2/11/09 5:39 AM, Ben Clifford wrote:
> Google Summer of Code 2009 mentor applications open in a couple of weeks. 
> dev.globus likely will be applying again. I have a few ideas for Swift 
> projects within dev.globus that I'd like to mentor.
> 
> One idea that I'm kinda fuzzy on but there might be interesting work to do 
> is implementing more interesting scheduler behaviour.
> 
> Various people have talked in the past about these, that I think have some 
> decent level of merit:
> 
>  a) changing ordering of execution of swift-level jobs based on how many
>      other swift-level jobs depend on that first job
> 
>  b) reordering stageins and stageouts so to allow (in addition to the 
>      present as-they-come (I think) policy) "prefer stageins" (which would 
>      get more jobs going sooner, but incurring an expense in that 
>      stageouts would happen slower, and in our present restart model 
>      reduce the speed as which jobs complete enough for restart), and
>      "prefer stageouts", which would get completed results out to submit 
>      side faster, at the expense of job execution speed.
> 
>  c) data affinity - there was some messing round with this, but it
>      resulted in code that did not work (which is fine for that project, 
>      as it was not production code oriented, but not for committing to
>      the codebase). So potentially this could be reimplemented or the 
>      existing code tidied up as part of this.
> 
> Comments and additional ideas...
> 


From benc at hawaga.org.uk  Wed Feb 11 14:33:16 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Wed, 11 Feb 2009 20:33:16 +0000 (GMT)
Subject: [Swift-devel] scheduler stuff for Google Summer of Code 2009
In-Reply-To: <499325D1.9040601@mcs.anl.gov>
References: <Pine.LNX.4.64.0902111119000.23512@dildano.hawaga.org.uk>
	<499325D1.9040601@mcs.anl.gov>
Message-ID: <Pine.LNX.4.64.0902112029440.23512@dildano.hawaga.org.uk>


>From the stuff below, some are quite researchy and so probably better for 
an academic student project rather than a google project. The two that 
seem fairly clearly defined are:

For this, there may be some work here doing grunt work implementation of 
tweakable parameters and things like that. I don't know if it would take 
up a whole summer, though.

> Another scheduler-related set of projects relates to algorithms around 
> coasters: - how to manage (grow and shrink) the size of coaster pools - 
> how to size the time requests for the workers (perhaps dynamically) - 
> how the current dynamic throttle works for coasters - and probably many 
> more.


> - running Swift on clouds like E2C and Azure

Tim Freeman has done some work standing up clusters on multiple EC2 nodes, 
so that the multiple nodes are exposed through a single (perhaps GRAM?) 
interface. So putting Swift on the front of that seems straightforward to 
define.

-- 


From iraicu at cs.uchicago.edu  Wed Feb 11 14:41:17 2009
From: iraicu at cs.uchicago.edu (Ioan Raicu)
Date: Wed, 11 Feb 2009 14:41:17 -0600
Subject: [Swift-devel] scheduler stuff for Google Summer of Code 2009
In-Reply-To: <Pine.LNX.4.64.0902112029440.23512@dildano.hawaga.org.uk>
References: <Pine.LNX.4.64.0902111119000.23512@dildano.hawaga.org.uk>	<499325D1.9040601@mcs.anl.gov>
	<Pine.LNX.4.64.0902112029440.23512@dildano.hawaga.org.uk>
Message-ID: <499337ED.9010009@cs.uchicago.edu>


Ben Clifford wrote:
>
>> - running Swift on clouds like E2C and Azure
>>     
>
> Tim Freeman has done some work standing up clusters on multiple EC2 nodes, 
> so that the multiple nodes are exposed through a single (perhaps GRAM?) 
> interface. So putting Swift on the front of that seems straightforward to 
> define.
>
>   
Back in 2007 (yes, its been almost 2 years since we tried this), 
Catalin, Tim Freeman, and I got MolDyn (Nika's molecular dynamics app) 
running through Swift + Falkon + Workspace Service + EC2 + NFS. At the 
time, they were just rolling out support for PBS/GRAM, which means that 
for a simpler deployment scenario, you might be able to use GRAM/PBS 
instead of Falkon.

We had a wiki setup to keep track of our progress:
http://dev.globus.org/wiki/Incubator/Falkon/EC2
http://www.ci.uchicago.edu/wiki/bin/view/VDS/DslCS/Falkon_EC2

but then Catalin found a job, and moved on to other things, and I never 
had time to carry the work forward. It sounds like an interesting 
scenario to support, with minimal end-user intervention.

Ioan

-- 
===================================================
Ioan Raicu
Ph.D. Candidate
===================================================
Distributed Systems Laboratory
Computer Science Department
University of Chicago
1100 E. 58th Street, Ryerson Hall
Chicago, IL 60637
===================================================
Email: iraicu at cs.uchicago.edu
Web:   http://www.cs.uchicago.edu/~iraicu
http://dev.globus.org/wiki/Incubator/Falkon
http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page
===================================================
===================================================

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20090211/9c4e892f/attachment.html>

From benc at hawaga.org.uk  Wed Feb 11 14:41:41 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Wed, 11 Feb 2009 20:41:41 +0000 (GMT)
Subject: [Swift-devel] scheduler stuff for Google Summer of Code 2009
In-Reply-To: <499325D1.9040601@mcs.anl.gov>
References: <Pine.LNX.4.64.0902111119000.23512@dildano.hawaga.org.uk>
	<499325D1.9040601@mcs.anl.gov>
Message-ID: <Pine.LNX.4.64.0902112033410.23512@dildano.hawaga.org.uk>


On Wed, 11 Feb 2009, Michael Wilde wrote:

> - scaling swift to 1M+ task workflows, efficiently (streaming the 
> mappers)

There's more to this than simple streaming mappers.

At the moment, everything is built around having a Java object in memory 
for every piece of data that can be referenced, and that object tends to 
stick around for a long time (at least as long as that data can be 
referenced). For example, if you have an array which has a large number of 
elements, then each of those elements has at least one object in memory 
representing it, because as long as you have the array in scope, you can 
say a[1] or a[anything] and thus get to every element.

The in-memory implementation of the data model and anything that touches 
it would need some fairly serious work to cope with having stuff kept out 
of core; and I think keeping stuff out of core is something that would 
need to happen.

(that is, 'streaming mappers' as a phrase seems to deal with "not getting 
knowledge about data too fast" but does not deal with "forgetting 
knowledge about data fast enough")

-- 


From wilde at mcs.anl.gov  Wed Feb 11 14:53:01 2009
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Wed, 11 Feb 2009 14:53:01 -0600
Subject: [Swift-devel] scheduler stuff for Google Summer of Code 2009
In-Reply-To: <Pine.LNX.4.64.0902112033410.23512@dildano.hawaga.org.uk>
References: <Pine.LNX.4.64.0902111119000.23512@dildano.hawaga.org.uk>
	<499325D1.9040601@mcs.anl.gov>
	<Pine.LNX.4.64.0902112033410.23512@dildano.hawaga.org.uk>
Message-ID: <49933AAD.2020303@mcs.anl.gov>

So in a sense, rather than saying "streaming mappers" can you call this 
"streaming foreach() statements" so that as each "iteration" (or 
"instantiation") of the foreach completes, the objects it used are freed 
and removed/removable from memory? (ie, does this address the 'scope" 
problem?)

Too big for an SOC student?

Interesting enough for one?

(Its a nice scalability challenge... and could be demonstrated first on 
localhost to make good progress w/o getting tangled in distributed 
computing initially)

If too big, can we make it manageble?

If too small, can we bundle with related tasks?

On 2/11/09 2:41 PM, Ben Clifford wrote:
> On Wed, 11 Feb 2009, Michael Wilde wrote:
> 
>> - scaling swift to 1M+ task workflows, efficiently (streaming the 
>> mappers)
> 
> There's more to this than simple streaming mappers.
> 
> At the moment, everything is built around having a Java object in memory 
> for every piece of data that can be referenced, and that object tends to 
> stick around for a long time (at least as long as that data can be 
> referenced). For example, if you have an array which has a large number of 
> elements, then each of those elements has at least one object in memory 
> representing it, because as long as you have the array in scope, you can 
> say a[1] or a[anything] and thus get to every element.
> 
> The in-memory implementation of the data model and anything that touches 
> it would need some fairly serious work to cope with having stuff kept out 
> of core; and I think keeping stuff out of core is something that would 
> need to happen.
> 
> (that is, 'streaming mappers' as a phrase seems to deal with "not getting 
> knowledge about data too fast" but does not deal with "forgetting 
> knowledge about data fast enough")
> 


From tfreeman at mcs.anl.gov  Wed Feb 11 14:53:43 2009
From: tfreeman at mcs.anl.gov (Tim Freeman)
Date: Wed, 11 Feb 2009 14:53:43 -0600
Subject: [Swift-devel] scheduler stuff for Google Summer of Code 2009
In-Reply-To: <Pine.LNX.4.64.0902112029440.23512@dildano.hawaga.org.uk>
References: <Pine.LNX.4.64.0902111119000.23512@dildano.hawaga.org.uk>
	<499325D1.9040601@mcs.anl.gov>
	<Pine.LNX.4.64.0902112029440.23512@dildano.hawaga.org.uk>
Message-ID: <20090211145343.3df6ab8e@prnb>

On Wed, 11 Feb 2009 20:33:16 +0000 (GMT)
Ben Clifford <benc at hawaga.org.uk> wrote:

[...]
> 
> > - running Swift on clouds like E2C and Azure
> 
> Tim Freeman has done some work standing up clusters on multiple EC2 nodes, 
> so that the multiple nodes are exposed through a single (perhaps GRAM?) 
> interface. So putting Swift on the front of that seems straightforward to 
> define.

Yep, was just starting some 100+ node GRAM/PBS clusters there last week.

Tim


From hategan at mcs.anl.gov  Wed Feb 11 15:33:48 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Wed, 11 Feb 2009 15:33:48 -0600 (CST)
Subject: [Swift-devel] scheduler stuff for Google Summer of Code 2009
In-Reply-To: <Pine.LNX.4.64.0902112033410.23512@dildano.hawaga.org.uk>
Message-ID: <22220735.2155401234388028575.JavaMail.root@zimbra>


----- Ben Clifford <benc at hawaga.org.uk> wrote:
> 
> On Wed, 11 Feb 2009, Michael Wilde wrote:
> 
> > - scaling swift to 1M+ task workflows, efficiently (streaming the 
> > mappers)
> 
> There's more to this than simple streaming mappers.
> 
> At the moment, everything is built around having a Java object in memory 
> for every piece of data that can be referenced, and that object tends to 
> stick around for a long time (at least as long as that data can be 
> referenced). For example, if you have an array which has a large number of 
> elements, then each of those elements has at least one object in memory 
> representing it, because as long as you have the array in scope, you can 
> say a[1] or a[anything] and thus get to every element.

I do not think that this issue is the bottleneck here. For every application
invocation there is a karajan thread. The fact that one such thread eats
around 10-20k seems to be the problem. By contrast, a piece of Swift data
probably takes less than 1k.

So I think that one order of magnitude improvement could be achieved by 
addressing that 10-20k problem (or by somehow having fewer karajan threads).

> 
> The in-memory implementation of the data model and anything that touches 
> it would need some fairly serious work to cope with having stuff kept out 
> of core; and I think keeping stuff out of core is something that would 
> need to happen.
> 
> (that is, 'streaming mappers' as a phrase seems to deal with "not getting 
> knowledge about data too fast" but does not deal with "forgetting 
> knowledge about data fast enough")
> 
> -- 
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel


From foster at anl.gov  Wed Feb 11 15:34:33 2009
From: foster at anl.gov (Ian Foster)
Date: Wed, 11 Feb 2009 15:34:33 -0600
Subject: [Swift-devel] scheduler stuff for Google Summer of Code 2009
In-Reply-To: <20090211145343.3df6ab8e@prnb>
References: <Pine.LNX.4.64.0902111119000.23512@dildano.hawaga.org.uk>
	<499325D1.9040601@mcs.anl.gov>
	<Pine.LNX.4.64.0902112029440.23512@dildano.hawaga.org.uk>
	<20090211145343.3df6ab8e@prnb>
Message-ID: <529D3FFB-BA26-4736-A160-3AD51C823CA4@anl.gov>

UniCloud provides that functionality, too. I imagine there are others.


On Feb 11, 2009, at 2:53 PM, Tim Freeman wrote:

> On Wed, 11 Feb 2009 20:33:16 +0000 (GMT)
> Ben Clifford <benc at hawaga.org.uk> wrote:
>
> [...]
>>
>>> - running Swift on clouds like E2C and Azure
>>
>> Tim Freeman has done some work standing up clusters on multiple EC2  
>> nodes,
>> so that the multiple nodes are exposed through a single (perhaps  
>> GRAM?)
>> interface. So putting Swift on the front of that seems  
>> straightforward to
>> define.
>
> Yep, was just starting some 100+ node GRAM/PBS clusters there last  
> week.
>
> Tim
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20090211/bce15f3e/attachment.html>

From foster at anl.gov  Wed Feb 11 15:35:56 2009
From: foster at anl.gov (Ian Foster)
Date: Wed, 11 Feb 2009 15:35:56 -0600
Subject: [Swift-devel] scheduler stuff for Google Summer of Code 2009
In-Reply-To: <22220735.2155401234388028575.JavaMail.root@zimbra>
References: <22220735.2155401234388028575.JavaMail.root@zimbra>
Message-ID: <D2D39F78-257C-4CA6-B215-FB7C4F93E8D0@anl.gov>

I would argue that Swift support for collective operations also helps  
with scaling. (We can run a computation with 1M tasks, if not all 1M  
tasks are active at once.)


On Feb 11, 2009, at 3:33 PM, Mihael Hategan wrote:

>
> ----- Ben Clifford <benc at hawaga.org.uk> wrote:
>>
>> On Wed, 11 Feb 2009, Michael Wilde wrote:
>>
>>> - scaling swift to 1M+ task workflows, efficiently (streaming the
>>> mappers)
>>
>> There's more to this than simple streaming mappers.
>>
>> At the moment, everything is built around having a Java object in  
>> memory
>> for every piece of data that can be referenced, and that object  
>> tends to
>> stick around for a long time (at least as long as that data can be
>> referenced). For example, if you have an array which has a large  
>> number of
>> elements, then each of those elements has at least one object in  
>> memory
>> representing it, because as long as you have the array in scope,  
>> you can
>> say a[1] or a[anything] and thus get to every element.
>
> I do not think that this issue is the bottleneck here. For every  
> application
> invocation there is a karajan thread. The fact that one such thread  
> eats
> around 10-20k seems to be the problem. By contrast, a piece of Swift  
> data
> probably takes less than 1k.
>
> So I think that one order of magnitude improvement could be achieved  
> by
> addressing that 10-20k problem (or by somehow having fewer karajan  
> threads).
>
>>
>> The in-memory implementation of the data model and anything that  
>> touches
>> it would need some fairly serious work to cope with having stuff  
>> kept out
>> of core; and I think keeping stuff out of core is something that  
>> would
>> need to happen.
>>
>> (that is, 'streaming mappers' as a phrase seems to deal with "not  
>> getting
>> knowledge about data too fast" but does not deal with "forgetting
>> knowledge about data fast enough")
>>
>> -- 
>> _______________________________________________
>> Swift-devel mailing list
>> Swift-devel at ci.uchicago.edu
>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel


From benc at hawaga.org.uk  Wed Feb 11 15:52:33 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Wed, 11 Feb 2009 21:52:33 +0000 (GMT)
Subject: [Swift-devel] scheduler stuff for Google Summer of Code 2009
In-Reply-To: <22220735.2155401234388028575.JavaMail.root@zimbra>
References: <22220735.2155401234388028575.JavaMail.root@zimbra>
Message-ID: <Pine.LNX.4.64.0902112146110.23512@dildano.hawaga.org.uk>


On Wed, 11 Feb 2009, Mihael Hategan wrote:

> I do not think that this issue is the bottleneck here. For every application

I wasn't attempting to provide a comprehensive summary of stuff that won't 
scale... mostly I wanted one example of another issue.


> So I think that one order of magnitude improvement could be achieved by 
> addressing that 10-20k problem (or by somehow having fewer karajan threads).

20k per app invocation is pretty heavyweight when invocations are 
parallelised.

I suspect that fewer karajan threads at any point in time can be brought 
about by some streaming-like approach - rather than n elements to iterate 
over forking n karajan threads at by saying parallelFor and having a 
simultaneous thread for each (of which most threads, for large enough n, 
will block), that construct could perhaps be made to act over time. So 
stream-like (lists spread over time as well as space) behaviour for 
foreach.

-- 


From hategan at mcs.anl.gov  Wed Feb 11 15:52:46 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Wed, 11 Feb 2009 15:52:46 -0600 (CST)
Subject: [Swift-devel] scheduler stuff for Google Summer of Code 2009
In-Reply-To: <D2D39F78-257C-4CA6-B215-FB7C4F93E8D0@anl.gov>
Message-ID: <31731535.2162031234389166830.JavaMail.root@zimbra>


----- Ian Foster <foster at anl.gov> wrote:
> I would argue that Swift support for collective operations also helps  
> with scaling. (We can run a computation with 1M tasks, if not all 1M  
> tasks are active at once.)

I don't think we can talk about an improvement if we're moving from 
"no can do" to "no can do in a different way".


From tfreeman at mcs.anl.gov  Wed Feb 11 16:12:34 2009
From: tfreeman at mcs.anl.gov (Tim Freeman)
Date: Wed, 11 Feb 2009 16:12:34 -0600
Subject: [Swift-devel] scheduler stuff for Google Summer of Code 2009
In-Reply-To: <529D3FFB-BA26-4736-A160-3AD51C823CA4@anl.gov>
References: <Pine.LNX.4.64.0902111119000.23512@dildano.hawaga.org.uk>
	<499325D1.9040601@mcs.anl.gov>
	<Pine.LNX.4.64.0902112029440.23512@dildano.hawaga.org.uk>
	<20090211145343.3df6ab8e@prnb>
	<529D3FFB-BA26-4736-A160-3AD51C823CA4@anl.gov>
Message-ID: <20090211161234.3de09250@prnb>

On Wed, 11 Feb 2009 15:34:33 -0600
Ian Foster <foster at anl.gov> wrote:

> UniCloud provides that functionality, too. I imagine there are others.

What Nimbus actually provides is much different, a generic virtual cluster
configuration engine that runs on VMs from either EC2 and Nimbus clouds or even
for clusters that span more than one.

There happens to be one instantiation of a cluster that has gram2 + Torque (it's
been around in some form for almost two years), but you can do anything (and
people have).

And it doesn't cost any extra money on top of what you pay EC2 (like UniCloud
would).

Tim

> On Feb 11, 2009, at 2:53 PM, Tim Freeman wrote:
> 
> > On Wed, 11 Feb 2009 20:33:16 +0000 (GMT)
> > Ben Clifford <benc at hawaga.org.uk> wrote:
> >
> > [...]
> >>
> >>> - running Swift on clouds like E2C and Azure
> >>
> >> Tim Freeman has done some work standing up clusters on multiple EC2  
> >> nodes,
> >> so that the multiple nodes are exposed through a single (perhaps  
> >> GRAM?)
> >> interface. So putting Swift on the front of that seems  
> >> straightforward to
> >> define.
> >
> > Yep, was just starting some 100+ node GRAM/PBS clusters there last  
> > week.
> >
> > Tim
> > _______________________________________________
> > Swift-devel mailing list
> > Swift-devel at ci.uchicago.edu
> > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> 


From foster at anl.gov  Wed Feb 11 16:15:45 2009
From: foster at anl.gov (Ian Foster)
Date: Wed, 11 Feb 2009 16:15:45 -0600
Subject: [Swift-devel] scheduler stuff for Google Summer of Code 2009
In-Reply-To: <31731535.2162031234389166830.JavaMail.root@zimbra>
References: <31731535.2162031234389166830.JavaMail.root@zimbra>
Message-ID: <1002CF15-4A70-4EA9-8373-10F1E795CAEB@anl.gov>

Mihael:

I don't know how to parse your comment.

If I write a program that performs a series of operations on many  
files, involving 1M tasks during its execution, but with only 10,000  
active at each step, why is that not interesting? Or are you re- 
defining the problem to "have 1M tasks active at once"? That is a  
useful thing to be able to do, I am sure, but that does not mean that  
the former is not useful also.

Ian.


On Feb 11, 2009, at 3:52 PM, Mihael Hategan wrote:

>
> ----- Ian Foster <foster at anl.gov> wrote:
>> I would argue that Swift support for collective operations also helps
>> with scaling. (We can run a computation with 1M tasks, if not all 1M
>> tasks are active at once.)
>
> I don't think we can talk about an improvement if we're moving from
> "no can do" to "no can do in a different way".


From wilde at mcs.anl.gov  Wed Feb 11 16:28:39 2009
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Wed, 11 Feb 2009 16:28:39 -0600
Subject: [Swift-devel] swift tools directory
Message-ID: <49935117.3040500@mcs.anl.gov>

As was briefly discussed long ago, Im going to make a tools/ directory 
in svn under https://svn.ci.uchicago.edu/svn/vdl2 (eg alongside things 
like provenancedb, www, etc)

This is to hold things like the following, some of which may migrate to 
dist/bin as their distribution location, when they are ready and accepted:
- enhanced swift run command (select multiple sites, etc)
- sites command to compose sites.xml more dynamically
- swift #include processor
- osg/tg site listing/status/checking tools
- bgp execution scripts

For now, collaborators using this stuff would check it out separate from 
the core of swift.

If anyone prefers a different name, place, or approach, let me know. 
I'll do this tonight, but can move/remove it as desired.


From benc at hawaga.org.uk  Wed Feb 11 16:30:57 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Wed, 11 Feb 2009 22:30:57 +0000 (GMT)
Subject: [Swift-devel] swift tools directory
In-Reply-To: <49935117.3040500@mcs.anl.gov>
References: <49935117.3040500@mcs.anl.gov>
Message-ID: <Pine.LNX.4.64.0902112230160.1293@dildano.hawaga.org.uk>


On Wed, 11 Feb 2009, Michael Wilde wrote:

> As was briefly discussed long ago, Im going to make a tools/ directory in svn
> under https://svn.ci.uchicago.edu/svn/vdl2 (eg alongside things like
> provenancedb, www, etc)

That seems fine.

-- 


From hategan at mcs.anl.gov  Wed Feb 11 16:39:22 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Wed, 11 Feb 2009 16:39:22 -0600 (CST)
Subject: [Swift-devel] scheduler stuff for Google Summer of Code 2009
In-Reply-To: <Pine.LNX.4.64.0902112146110.23512@dildano.hawaga.org.uk>
Message-ID: <8223033.2165841234391962754.JavaMail.root@zimbra>


----- Ben Clifford <benc at hawaga.org.uk> wrote:
> 
> On Wed, 11 Feb 2009, Mihael Hategan wrote:
> 
> > I do not think that this issue is the bottleneck here. For every application
> 
> I wasn't attempting to provide a comprehensive summary of stuff that won't 
> scale... mostly I wanted one example of another issue.
> 
> 
> 
> > So I think that one order of magnitude improvement could be achieved by 
> > addressing that 10-20k problem (or by somehow having fewer karajan threads).
> 
> 20k per app invocation is pretty heavyweight when invocations are 
> parallelised.
> 
> I suspect that fewer karajan threads at any point in time can be brought 
> about by some streaming-like approach - rather than n elements to iterate 
> over forking n karajan threads at by saying parallelFor and having a 
> simultaneous thread for each (of which most threads, for large enough n, 
> will block), that construct could perhaps be made to act over time. So 
> stream-like (lists spread over time as well as space) behaviour for 
> foreach.
> 

This may already happen in certain cases (e.g. a foreach acting on the
product of another foreach).


From hategan at mcs.anl.gov  Wed Feb 11 16:49:28 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Wed, 11 Feb 2009 16:49:28 -0600 (CST)
Subject: [Swift-devel] scheduler stuff for Google Summer of Code 2009
In-Reply-To: <1002CF15-4A70-4EA9-8373-10F1E795CAEB@anl.gov>
Message-ID: <973187.2171551234392568270.JavaMail.root@zimbra>

----- Ian Foster <foster at anl.gov> wrote:
> Mihael:
> 
> I don't know how to parse your comment.
> 
> If I write a program that performs a series of operations on many  
> files, involving 1M tasks during its execution, but with only 10,000  
> active at each step, why is that not interesting? Or are you re- 
> defining the problem to "have 1M tasks active at once"? That is a  
> useful thing to be able to do, I am sure, but that does not mean that  
> the former is not useful also.

In the context in which the engine cannot reasonably support 1M tasks, 
it seems fairly pointless to say that, e.g. a faster GRAM is better for
running 1M tasks with Swift. It makes no difference.


From aespinosa at cs.uchicago.edu  Wed Feb 11 19:07:10 2009
From: aespinosa at cs.uchicago.edu (Allan Espinosa)
Date: Wed, 11 Feb 2009 19:07:10 -0600
Subject: [Swift-devel] data staging process/ documents?
Message-ID: <50b07b4b0902111707q7aad742fh89c0dad88f744ecf@mail.gmail.com>

Hi,

I am attempting to actualize how collective operations on workflows
(loosely-coupled) work in general.  My initial idea is that this goes
in the staging of data before executing a task in a workflow.

Do we have documents describing these? I have a small idea on how it
works by monitoring my swift job's <workdir/> as a workflow executes.

My initial ideas are posted in
http://www.ci.uchicago.edu/wiki/bin/view/VDS/DslCS/CollectiveIO

-Allan


From hategan at mcs.anl.gov  Wed Feb 11 20:02:34 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Wed, 11 Feb 2009 20:02:34 -0600 (CST)
Subject: [Swift-devel] coaster one-liner bootstrap script
Message-ID: <12554808.2179801234404154196.JavaMail.root@zimbra>

cog r2297 contains a patch to transform the bootstrap script 
to a one-liner (thanks to Mike for the tips).

I did a sanity test on localhost.


From hategan at mcs.anl.gov  Wed Feb 11 20:04:06 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Wed, 11 Feb 2009 20:04:06 -0600 (CST)
Subject: [Swift-devel] java home misdetection
Message-ID: <15461697.2179831234404246287.JavaMail.root@zimbra>

Cog r2297 has a patch for the java_home issue. Since
JAVA_HOME doesn't appear to be needed, the script only
attempts to find a java executable.


From hategan at mcs.anl.gov  Wed Feb 11 20:06:19 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Wed, 11 Feb 2009 20:06:19 -0600 (CST)
Subject: [Swift-devel] gt2 and runaway jobs
Message-ID: <18026343.2179861234404379439.JavaMail.root@zimbra>

This is probably getting annoying.

swift r2525 has a fix for the issue introduced by the 
runaway jobs patch, namely if gt2 was used, jobs would
fail complaining about a bogus "tr" attribute. Or
something like that.


From wilde at mcs.anl.gov  Wed Feb 11 22:29:17 2009
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Wed, 11 Feb 2009 22:29:17 -0600
Subject: [Swift-devel] Rename versions of Globus commands in swift/bin?
In-Reply-To: <4993A311.3060103@mcs.anl.gov>
References: <4988A6A9.6030302@renci.org> <4993561E.1060105@mcs.anl.gov>
	<Pine.LNX.4.64.0902112252200.23512@dildano.hawaga.org.uk>
	<49936388.80207@mcs.anl.gov> <49939761.9020002@renci.org>
	<4993A311.3060103@mcs.anl.gov>
Message-ID: <4993A59D.4010900@mcs.anl.gov>

Matt raises a good point below. Can we rename globus-url-copy, 
grid-proxy-init, and other commands that have the same name as Globus 
commands, to swift-url-copy, swift-proxy-init, etc?

Especially for those that are not identical enough to the Globus 
versions (where that is tbd).

Its extremely handy to have these commands there, but perhaps confusing 
for some users that get these in their paths ahead of the Globus 
versions. I realize calling them swift-* causes its own kind of 
confusion, but I'm excited that folks like Mats are installing Swift for 
users, and I'd like to remove any barriers, even the small ones.

This is likely to be a bigger issue for users with the OSG client stack 
installed.

On 2/11/09 10:18 PM, Michael Wilde wrote:
> 
> 
> On 2/11/09 9:28 PM, Mats Rynge wrote:

...
>>
>> Regarding bin/, it is pretty evil to have a globus-url-copy under the
>> swift bin/ which has a different set of command line options as the
>> Globus one. I had a plan on adding swift to the default path on our
>> submit node so all our users could use swift without doing anything
>> special. But having a different globus-url-copy means it would break
>> things for other users.
> 
> I know what you mean. I actually got bit the other day by our copy of 
> grid-proxy-init - I was seeing the globus one, and one of my users was 
> seeing the swift one, with a slightly different output that broke a script.
> 
> So I agree - if the swift version is not pretty near identical, its 
> better to give it another name. I'll past your comment to the devel list 
> with a suggestion that we rename it. Perhaps swift-proxy-init, 
> swift-url-copy? (not sure how that will go over... ;)
> 
> - Mike

Mats, I would just go ahead and remove or rename those, in the meantime.

I dont think anything points to them.

- Mike


From foster at anl.gov  Wed Feb 11 22:31:02 2009
From: foster at anl.gov (Ian Foster)
Date: Wed, 11 Feb 2009 22:31:02 -0600
Subject: [Swift-devel] Rename versions of Globus commands in swift/bin?
In-Reply-To: <4993A59D.4010900@mcs.anl.gov>
References: <4988A6A9.6030302@renci.org> <4993561E.1060105@mcs.anl.gov>
	<Pine.LNX.4.64.0902112252200.23512@dildano.hawaga.org.uk>
	<49936388.80207@mcs.anl.gov> <49939761.9020002@renci.org>
	<4993A311.3060103@mcs.anl.gov> <4993A59D.4010900@mcs.anl.gov>
Message-ID: <255CA353-FCE7-4A1E-B937-71EEF7E1E690@anl.gov>

Mike:

This begs the question for me as to why they are different. Are Swift  
proxies different from Globus proxies, for example? And if so, why?

Ian.


On Feb 11, 2009, at 10:29 PM, Michael Wilde wrote:

> Matt raises a good point below. Can we rename globus-url-copy, grid- 
> proxy-init, and other commands that have the same name as Globus  
> commands, to swift-url-copy, swift-proxy-init, etc?
>
> Especially for those that are not identical enough to the Globus  
> versions (where that is tbd).
>
> Its extremely handy to have these commands there, but perhaps  
> confusing for some users that get these in their paths ahead of the  
> Globus versions. I realize calling them swift-* causes its own kind  
> of confusion, but I'm excited that folks like Mats are installing  
> Swift for users, and I'd like to remove any barriers, even the small  
> ones.
>
> This is likely to be a bigger issue for users with the OSG client  
> stack installed.
>
> On 2/11/09 10:18 PM, Michael Wilde wrote:
>> On 2/11/09 9:28 PM, Mats Rynge wrote:
>
> ...
>>>
>>> Regarding bin/, it is pretty evil to have a globus-url-copy under  
>>> the
>>> swift bin/ which has a different set of command line options as the
>>> Globus one. I had a plan on adding swift to the default path on our
>>> submit node so all our users could use swift without doing anything
>>> special. But having a different globus-url-copy means it would break
>>> things for other users.
>> I know what you mean. I actually got bit the other day by our copy  
>> of grid-proxy-init - I was seeing the globus one, and one of my  
>> users was seeing the swift one, with a slightly different output  
>> that broke a script.
>> So I agree - if the swift version is not pretty near identical, its  
>> better to give it another name. I'll past your comment to the devel  
>> list with a suggestion that we rename it. Perhaps swift-proxy-init,  
>> swift-url-copy? (not sure how that will go over... ;)
>> - Mike
>
> Mats, I would just go ahead and remove or rename those, in the  
> meantime.
>
> I dont think anything points to them.
>
> - Mike
>
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel


From foster at anl.gov  Wed Feb 11 22:36:45 2009
From: foster at anl.gov (Ian Foster)
Date: Wed, 11 Feb 2009 22:36:45 -0600
Subject: [Swift-devel] scheduler stuff for Google Summer of Code 2009
In-Reply-To: <973187.2171551234392568270.JavaMail.root@zimbra>
References: <973187.2171551234392568270.JavaMail.root@zimbra>
Message-ID: <14F191F5-CB8C-403B-A032-053334D9169D@anl.gov>

Mihael:

I think we are exploring the limits of email as a communication  
vehicle :-)

I wasn't talking about GRAM at all.

I had understood someone to say that we can't run 1M tasks because  
each task needs 10KB (or similar), and 1M*10KB is a lot.

I was observing that a workflow of 1M tasks can still be run if a  
smaller number are active at a time. That may seem like splitting  
hairs, but in fact that was a big reason for Swift's design. We would  
have applications that had multiple phases, each with thousands of  
tasks. As a DAG, this would not fit in memory (and was a pain to write  
of course). As a Swift program, we might have:

foreach(i in 1:10,000) f()
foreach(i in 1:10,000) g()
etc.

Ian.


On Feb 11, 2009, at 4:49 PM, Mihael Hategan wrote:

> ----- Ian Foster <foster at anl.gov> wrote:
>> Mihael:
>>
>> I don't know how to parse your comment.
>>
>> If I write a program that performs a series of operations on many
>> files, involving 1M tasks during its execution, but with only 10,000
>> active at each step, why is that not interesting? Or are you re-
>> defining the problem to "have 1M tasks active at once"? That is a
>> useful thing to be able to do, I am sure, but that does not mean that
>> the former is not useful also.
>
> In the context in which the engine cannot reasonably support 1M tasks,
> it seems fairly pointless to say that, e.g. a faster GRAM is better  
> for
> running 1M tasks with Swift. It makes no difference.


From wilde at mcs.anl.gov  Wed Feb 11 22:37:35 2009
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Wed, 11 Feb 2009 22:37:35 -0600
Subject: [Swift-devel] Rename versions of Globus commands in swift/bin?
In-Reply-To: <255CA353-FCE7-4A1E-B937-71EEF7E1E690@anl.gov>
References: <4988A6A9.6030302@renci.org> <4993561E.1060105@mcs.anl.gov>
	<Pine.LNX.4.64.0902112252200.23512@dildano.hawaga.org.uk>
	<49936388.80207@mcs.anl.gov> <49939761.9020002@renci.org>
	<4993A311.3060103@mcs.anl.gov> <4993A59D.4010900@mcs.anl.gov>
	<255CA353-FCE7-4A1E-B937-71EEF7E1E690@anl.gov>
Message-ID: <4993A78F.4050604@mcs.anl.gov>

These are cog-based versions of the commands.

The cool thing is that users get this core Globus functionality with no 
compilation needed: just untar Swift, and poof, you can make proxies, 
run jobs, move files (eg for setting this up on remote grid sites).

The issue I bumped into with proxies was cosmetic. The proxies are 
totally compatible as far as I know.  I just happened to have a 
front-end script for running swift code that checked to make sure the 
user has a valid proxy with some time left. And the time format returned 
by the swift grid-proxy-info was slightly different than the Globus one.

That broke my brittle little script - its not a criticism of the cog 
code version.

But its easy enough for us to refer, in swift docs, to swift-proxy-init, 
swift-proxy-info with a * explaining that the Globus versions of these 
are fine if you happen to have them.

- Mike


On 2/11/09 10:31 PM, Ian Foster wrote:
> Mike:
> 
> This begs the question for me as to why they are different. Are Swift 
> proxies different from Globus proxies, for example? And if so, why?
> 
> Ian.
> 
> 
> On Feb 11, 2009, at 10:29 PM, Michael Wilde wrote:
> 
>> Matt raises a good point below. Can we rename globus-url-copy, 
>> grid-proxy-init, and other commands that have the same name as Globus 
>> commands, to swift-url-copy, swift-proxy-init, etc?
>>
>> Especially for those that are not identical enough to the Globus 
>> versions (where that is tbd).
>>
>> Its extremely handy to have these commands there, but perhaps 
>> confusing for some users that get these in their paths ahead of the 
>> Globus versions. I realize calling them swift-* causes its own kind of 
>> confusion, but I'm excited that folks like Mats are installing Swift 
>> for users, and I'd like to remove any barriers, even the small ones.
>>
>> This is likely to be a bigger issue for users with the OSG client 
>> stack installed.
>>
>> On 2/11/09 10:18 PM, Michael Wilde wrote:
>>> On 2/11/09 9:28 PM, Mats Rynge wrote:
>>
>> ...
>>>>
>>>> Regarding bin/, it is pretty evil to have a globus-url-copy under the
>>>> swift bin/ which has a different set of command line options as the
>>>> Globus one. I had a plan on adding swift to the default path on our
>>>> submit node so all our users could use swift without doing anything
>>>> special. But having a different globus-url-copy means it would break
>>>> things for other users.
>>> I know what you mean. I actually got bit the other day by our copy of 
>>> grid-proxy-init - I was seeing the globus one, and one of my users 
>>> was seeing the swift one, with a slightly different output that broke 
>>> a script.
>>> So I agree - if the swift version is not pretty near identical, its 
>>> better to give it another name. I'll past your comment to the devel 
>>> list with a suggestion that we rename it. Perhaps swift-proxy-init, 
>>> swift-url-copy? (not sure how that will go over... ;)
>>> - Mike
>>
>> Mats, I would just go ahead and remove or rename those, in the meantime.
>>
>> I dont think anything points to them.
>>
>> - Mike
>>
>> _______________________________________________
>> Swift-devel mailing list
>> Swift-devel at ci.uchicago.edu
>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> 


From hategan at mcs.anl.gov  Wed Feb 11 23:06:13 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Wed, 11 Feb 2009 23:06:13 -0600
Subject: [Swift-devel] Rename versions of Globus commands in swift/bin?
In-Reply-To: <4993A59D.4010900@mcs.anl.gov>
References: <4988A6A9.6030302@renci.org> <4993561E.1060105@mcs.anl.gov>
	<Pine.LNX.4.64.0902112252200.23512@dildano.hawaga.org.uk>
	<49936388.80207@mcs.anl.gov> <49939761.9020002@renci.org>
	<4993A311.3060103@mcs.anl.gov>  <4993A59D.4010900@mcs.anl.gov>
Message-ID: <1234415173.1513.1.camel@localhost>

On Wed, 2009-02-11 at 22:29 -0600, Michael Wilde wrote:
> Matt raises a good point below. Can we rename globus-url-copy, 
> grid-proxy-init, and other commands that have the same name as Globus 
> commands, to swift-url-copy, swift-proxy-init, etc?

I think the less confusing and accurate names would be cog-*.

> 
> Especially for those that are not identical enough to the Globus 
> versions (where that is tbd).
> 
> Its extremely handy to have these commands there, but perhaps confusing 
> for some users that get these in their paths ahead of the Globus 
> versions. I realize calling them swift-* causes its own kind of 
> confusion, but I'm excited that folks like Mats are installing Swift for 
> users, and I'd like to remove any barriers, even the small ones.
> 
> This is likely to be a bigger issue for users with the OSG client stack 
> installed.
> 
> On 2/11/09 10:18 PM, Michael Wilde wrote:
> > 
> > 
> > On 2/11/09 9:28 PM, Mats Rynge wrote:
> 
> ...
> >>
> >> Regarding bin/, it is pretty evil to have a globus-url-copy under the
> >> swift bin/ which has a different set of command line options as the
> >> Globus one. I had a plan on adding swift to the default path on our
> >> submit node so all our users could use swift without doing anything
> >> special. But having a different globus-url-copy means it would break
> >> things for other users.
> > 
> > I know what you mean. I actually got bit the other day by our copy of 
> > grid-proxy-init - I was seeing the globus one, and one of my users was 
> > seeing the swift one, with a slightly different output that broke a script.
> > 
> > So I agree - if the swift version is not pretty near identical, its 
> > better to give it another name. I'll past your comment to the devel list 
> > with a suggestion that we rename it. Perhaps swift-proxy-init, 
> > swift-url-copy? (not sure how that will go over... ;)
> > 
> > - Mike
> 
> Mats, I would just go ahead and remove or rename those, in the meantime.
> 
> I dont think anything points to them.
> 
> - Mike
> 
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel


From hategan at mcs.anl.gov  Wed Feb 11 23:31:14 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Wed, 11 Feb 2009 23:31:14 -0600
Subject: [Swift-devel] scheduler stuff for Google Summer of Code 2009
In-Reply-To: <14F191F5-CB8C-403B-A032-053334D9169D@anl.gov>
References: <973187.2171551234392568270.JavaMail.root@zimbra>
	<14F191F5-CB8C-403B-A032-053334D9169D@anl.gov>
Message-ID: <1234416674.1513.26.camel@localhost>

On Wed, 2009-02-11 at 22:36 -0600, Ian Foster wrote:
> Mihael:
> 
> I think we are exploring the limits of email as a communication  
> vehicle :-)

I just think some subjects require more emails than others :)

> 
> I wasn't talking about GRAM at all.

I know. I gave it as an example, because we both know clearly what it
is.

Let me back up a bit:

>> If I write a program that performs a series of operations on many
>> files, involving 1M tasks during its execution, but with only 10,000
>> active at each step, why is that not interesting?

It is interesting, but that part where you can have a 1M workflow in
Swift with only 10,000 karajan threads active at a time is what I'm
questioning.

So before we talk about better I/O or job execution providers for the 1M
workflow, we need to make sure that the engine can run the 1M workflow
in the first place. Otherwise the I/O can be as fast as you want and
swift still won't run the workflow.


From hategan at mcs.anl.gov  Wed Feb 11 23:33:40 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Wed, 11 Feb 2009 23:33:40 -0600
Subject: [Swift-devel] Rename versions of Globus commands in swift/bin?
In-Reply-To: <4993A78F.4050604@mcs.anl.gov>
References: <4988A6A9.6030302@renci.org> <4993561E.1060105@mcs.anl.gov>
	<Pine.LNX.4.64.0902112252200.23512@dildano.hawaga.org.uk>
	<49936388.80207@mcs.anl.gov> <49939761.9020002@renci.org>
	<4993A311.3060103@mcs.anl.gov> <4993A59D.4010900@mcs.anl.gov>
	<255CA353-FCE7-4A1E-B937-71EEF7E1E690@anl.gov>
	<4993A78F.4050604@mcs.anl.gov>
Message-ID: <1234416820.1513.30.camel@localhost>

On Wed, 2009-02-11 at 22:37 -0600, Michael Wilde wrote:
> These are cog-based versions of the commands.
> 
> The cool thing is that users get this core Globus functionality with no 
> compilation needed: just untar Swift, and poof, you can make proxies, 
> run jobs, move files (eg for setting this up on remote grid sites).

We rename them to cog-* then?

> 
> The issue I bumped into with proxies was cosmetic. The proxies are 
> totally compatible as far as I know.

Yes.

>   I just happened to have a 
> front-end script for running swift code that checked to make sure the 
> user has a valid proxy with some time left. And the time format returned 
> by the swift grid-proxy-info was slightly different than the Globus one.
> 
> That broke my brittle little script - its not a criticism of the cog 
> code version.

Though that may be possible to fix. url-copy not so much.

> 
> But its easy enough for us to refer, in swift docs, to swift-proxy-init, 
> swift-proxy-info with a * explaining that the Globus versions of these 
> are fine if you happen to have them.

Or the other way around. We use globus-* by default and then say that
the user could use the other ones.

> 
> - Mike
> 
> 
> On 2/11/09 10:31 PM, Ian Foster wrote:
> > Mike:
> > 
> > This begs the question for me as to why they are different. Are Swift 
> > proxies different from Globus proxies, for example? And if so, why?
> > 
> > Ian.
> > 
> > 
> > On Feb 11, 2009, at 10:29 PM, Michael Wilde wrote:
> > 
> >> Matt raises a good point below. Can we rename globus-url-copy, 
> >> grid-proxy-init, and other commands that have the same name as Globus 
> >> commands, to swift-url-copy, swift-proxy-init, etc?
> >>
> >> Especially for those that are not identical enough to the Globus 
> >> versions (where that is tbd).
> >>
> >> Its extremely handy to have these commands there, but perhaps 
> >> confusing for some users that get these in their paths ahead of the 
> >> Globus versions. I realize calling them swift-* causes its own kind of 
> >> confusion, but I'm excited that folks like Mats are installing Swift 
> >> for users, and I'd like to remove any barriers, even the small ones.
> >>
> >> This is likely to be a bigger issue for users with the OSG client 
> >> stack installed.
> >>
> >> On 2/11/09 10:18 PM, Michael Wilde wrote:
> >>> On 2/11/09 9:28 PM, Mats Rynge wrote:
> >>
> >> ...
> >>>>
> >>>> Regarding bin/, it is pretty evil to have a globus-url-copy under the
> >>>> swift bin/ which has a different set of command line options as the
> >>>> Globus one. I had a plan on adding swift to the default path on our
> >>>> submit node so all our users could use swift without doing anything
> >>>> special. But having a different globus-url-copy means it would break
> >>>> things for other users.
> >>> I know what you mean. I actually got bit the other day by our copy of 
> >>> grid-proxy-init - I was seeing the globus one, and one of my users 
> >>> was seeing the swift one, with a slightly different output that broke 
> >>> a script.
> >>> So I agree - if the swift version is not pretty near identical, its 
> >>> better to give it another name. I'll past your comment to the devel 
> >>> list with a suggestion that we rename it. Perhaps swift-proxy-init, 
> >>> swift-url-copy? (not sure how that will go over... ;)
> >>> - Mike
> >>
> >> Mats, I would just go ahead and remove or rename those, in the meantime.
> >>
> >> I dont think anything points to them.
> >>
> >> - Mike
> >>
> >> _______________________________________________
> >> Swift-devel mailing list
> >> Swift-devel at ci.uchicago.edu
> >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> > 
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel


From foster at anl.gov  Wed Feb 11 23:38:30 2009
From: foster at anl.gov (Ian Foster)
Date: Wed, 11 Feb 2009 23:38:30 -0600
Subject: [Swift-devel] Rename versions of Globus commands in swift/bin?
In-Reply-To: <1234416820.1513.30.camel@localhost>
References: <4988A6A9.6030302@renci.org> <4993561E.1060105@mcs.anl.gov>
	<Pine.LNX.4.64.0902112252200.23512@dildano.hawaga.org.uk>
	<49936388.80207@mcs.anl.gov> <49939761.9020002@renci.org>
	<4993A311.3060103@mcs.anl.gov> <4993A59D.4010900@mcs.anl.gov>
	<255CA353-FCE7-4A1E-B937-71EEF7E1E690@anl.gov>
	<4993A78F.4050604@mcs.anl.gov> <1234416820.1513.30.camel@localhost>
Message-ID: <5CA10434-F090-41B6-A1C4-9AB67B0A88EC@anl.gov>

My view is that the CoG and C versions of basic Globus commands should  
have the same behavior. If they do not, that is a bug. It should be  
reported and fixed, not worked around. I appreciate that others may  
not share that perspective.


On Feb 11, 2009, at 11:33 PM, Mihael Hategan wrote:

> On Wed, 2009-02-11 at 22:37 -0600, Michael Wilde wrote:
>> These are cog-based versions of the commands.
>>
>> The cool thing is that users get this core Globus functionality  
>> with no
>> compilation needed: just untar Swift, and poof, you can make proxies,
>> run jobs, move files (eg for setting this up on remote grid sites).
>
> We rename them to cog-* then?
>
>>
>> The issue I bumped into with proxies was cosmetic. The proxies are
>> totally compatible as far as I know.
>
> Yes.
>
>>  I just happened to have a
>> front-end script for running swift code that checked to make sure the
>> user has a valid proxy with some time left. And the time format  
>> returned
>> by the swift grid-proxy-info was slightly different than the Globus  
>> one.
>>
>> That broke my brittle little script - its not a criticism of the cog
>> code version.
>
> Though that may be possible to fix. url-copy not so much.
>
>>
>> But its easy enough for us to refer, in swift docs, to swift-proxy- 
>> init,
>> swift-proxy-info with a * explaining that the Globus versions of  
>> these
>> are fine if you happen to have them.
>
> Or the other way around. We use globus-* by default and then say that
> the user could use the other ones.
>
>>
>> - Mike
>>
>>
>> On 2/11/09 10:31 PM, Ian Foster wrote:
>>> Mike:
>>>
>>> This begs the question for me as to why they are different. Are  
>>> Swift
>>> proxies different from Globus proxies, for example? And if so, why?
>>>
>>> Ian.
>>>
>>>
>>> On Feb 11, 2009, at 10:29 PM, Michael Wilde wrote:
>>>
>>>> Matt raises a good point below. Can we rename globus-url-copy,
>>>> grid-proxy-init, and other commands that have the same name as  
>>>> Globus
>>>> commands, to swift-url-copy, swift-proxy-init, etc?
>>>>
>>>> Especially for those that are not identical enough to the Globus
>>>> versions (where that is tbd).
>>>>
>>>> Its extremely handy to have these commands there, but perhaps
>>>> confusing for some users that get these in their paths ahead of the
>>>> Globus versions. I realize calling them swift-* causes its own  
>>>> kind of
>>>> confusion, but I'm excited that folks like Mats are installing  
>>>> Swift
>>>> for users, and I'd like to remove any barriers, even the small  
>>>> ones.
>>>>
>>>> This is likely to be a bigger issue for users with the OSG client
>>>> stack installed.
>>>>
>>>> On 2/11/09 10:18 PM, Michael Wilde wrote:
>>>>> On 2/11/09 9:28 PM, Mats Rynge wrote:
>>>>
>>>> ...
>>>>>>
>>>>>> Regarding bin/, it is pretty evil to have a globus-url-copy  
>>>>>> under the
>>>>>> swift bin/ which has a different set of command line options as  
>>>>>> the
>>>>>> Globus one. I had a plan on adding swift to the default path on  
>>>>>> our
>>>>>> submit node so all our users could use swift without doing  
>>>>>> anything
>>>>>> special. But having a different globus-url-copy means it would  
>>>>>> break
>>>>>> things for other users.
>>>>> I know what you mean. I actually got bit the other day by our  
>>>>> copy of
>>>>> grid-proxy-init - I was seeing the globus one, and one of my users
>>>>> was seeing the swift one, with a slightly different output that  
>>>>> broke
>>>>> a script.
>>>>> So I agree - if the swift version is not pretty near identical,  
>>>>> its
>>>>> better to give it another name. I'll past your comment to the  
>>>>> devel
>>>>> list with a suggestion that we rename it. Perhaps swift-proxy- 
>>>>> init,
>>>>> swift-url-copy? (not sure how that will go over... ;)
>>>>> - Mike
>>>>
>>>> Mats, I would just go ahead and remove or rename those, in the  
>>>> meantime.
>>>>
>>>> I dont think anything points to them.
>>>>
>>>> - Mike
>>>>
>>>> _______________________________________________
>>>> Swift-devel mailing list
>>>> Swift-devel at ci.uchicago.edu
>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>>
>> _______________________________________________
>> Swift-devel mailing list
>> Swift-devel at ci.uchicago.edu
>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>


From wilde at mcs.anl.gov  Wed Feb 11 23:46:20 2009
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Wed, 11 Feb 2009 23:46:20 -0600
Subject: [Swift-devel] Rename versions of Globus commands in swift/bin?
In-Reply-To: <1234416820.1513.30.camel@localhost>
References: <4988A6A9.6030302@renci.org> <4993561E.1060105@mcs.anl.gov>	
	<Pine.LNX.4.64.0902112252200.23512@dildano.hawaga.org.uk>	
	<49936388.80207@mcs.anl.gov> <49939761.9020002@renci.org>	
	<4993A311.3060103@mcs.anl.gov> <4993A59D.4010900@mcs.anl.gov>	
	<255CA353-FCE7-4A1E-B937-71EEF7E1E690@anl.gov>	
	<4993A78F.4050604@mcs.anl.gov> <1234416820.1513.30.camel@localhost>
Message-ID: <4993B7AC.3020408@mcs.anl.gov>


On 2/11/09 11:33 PM, Mihael Hategan wrote:
> On Wed, 2009-02-11 at 22:37 -0600, Michael Wilde wrote:
>> These are cog-based versions of the commands.
>>
>> The cool thing is that users get this core Globus functionality with no 
>> compilation needed: just untar Swift, and poof, you can make proxies, 
>> run jobs, move files (eg for setting this up on remote grid sites).
> 
> We rename them to cog-* then?

That would be fine by me. Unless Ben weighs in with a different view, 
please do.

>> The issue I bumped into with proxies was cosmetic. The proxies are 
>> totally compatible as far as I know.
> 
> Yes.
> 
>>   I just happened to have a 
>> front-end script for running swift code that checked to make sure the 
>> user has a valid proxy with some time left. And the time format returned 
>> by the swift grid-proxy-info was slightly different than the Globus one.
>>
>> That broke my brittle little script - its not a criticism of the cog 
>> code version.
> 
> Though that may be possible to fix. url-copy not so much.

I think this is pretty low on the prio list. If its easy, yes, please 
do. Else file it as low prio in bugzilla, imo.

>> But its easy enough for us to refer, in swift docs, to swift-proxy-init, 
>> swift-proxy-info with a * explaining that the Globus versions of these 
>> are fine if you happen to have them.
> 
> Or the other way around. We use globus-* by default and then say that
> the user could use the other ones.

Yes, that would be better. OSG and TG users are likely to have the 
globus- versions in their paths. New users running swift on their own 
hosts are likely not to.

At the moment, the users guide doesnt cover this at all; its a future 
issue for when it does.

- Mike

>> - Mike
>>
>>
>> On 2/11/09 10:31 PM, Ian Foster wrote:
>>> Mike:
>>>
>>> This begs the question for me as to why they are different. Are Swift 
>>> proxies different from Globus proxies, for example? And if so, why?
>>>
>>> Ian.
>>>
>>>
>>> On Feb 11, 2009, at 10:29 PM, Michael Wilde wrote:
>>>
>>>> Matt raises a good point below. Can we rename globus-url-copy, 
>>>> grid-proxy-init, and other commands that have the same name as Globus 
>>>> commands, to swift-url-copy, swift-proxy-init, etc?
>>>>
>>>> Especially for those that are not identical enough to the Globus 
>>>> versions (where that is tbd).
>>>>
>>>> Its extremely handy to have these commands there, but perhaps 
>>>> confusing for some users that get these in their paths ahead of the 
>>>> Globus versions. I realize calling them swift-* causes its own kind of 
>>>> confusion, but I'm excited that folks like Mats are installing Swift 
>>>> for users, and I'd like to remove any barriers, even the small ones.
>>>>
>>>> This is likely to be a bigger issue for users with the OSG client 
>>>> stack installed.
>>>>
>>>> On 2/11/09 10:18 PM, Michael Wilde wrote:
>>>>> On 2/11/09 9:28 PM, Mats Rynge wrote:
>>>> ...
>>>>>> Regarding bin/, it is pretty evil to have a globus-url-copy under the
>>>>>> swift bin/ which has a different set of command line options as the
>>>>>> Globus one. I had a plan on adding swift to the default path on our
>>>>>> submit node so all our users could use swift without doing anything
>>>>>> special. But having a different globus-url-copy means it would break
>>>>>> things for other users.
>>>>> I know what you mean. I actually got bit the other day by our copy of 
>>>>> grid-proxy-init - I was seeing the globus one, and one of my users 
>>>>> was seeing the swift one, with a slightly different output that broke 
>>>>> a script.
>>>>> So I agree - if the swift version is not pretty near identical, its 
>>>>> better to give it another name. I'll past your comment to the devel 
>>>>> list with a suggestion that we rename it. Perhaps swift-proxy-init, 
>>>>> swift-url-copy? (not sure how that will go over... ;)
>>>>> - Mike
>>>> Mats, I would just go ahead and remove or rename those, in the meantime.
>>>>
>>>> I dont think anything points to them.
>>>>
>>>> - Mike
>>>>
>>>> _______________________________________________
>>>> Swift-devel mailing list
>>>> Swift-devel at ci.uchicago.edu
>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>> _______________________________________________
>> Swift-devel mailing list
>> Swift-devel at ci.uchicago.edu
>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> 


From hategan at mcs.anl.gov  Thu Feb 12 00:01:35 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Thu, 12 Feb 2009 00:01:35 -0600
Subject: [Swift-devel] Rename versions of Globus commands in swift/bin?
In-Reply-To: <5CA10434-F090-41B6-A1C4-9AB67B0A88EC@anl.gov>
References: <4988A6A9.6030302@renci.org> <4993561E.1060105@mcs.anl.gov>
	<Pine.LNX.4.64.0902112252200.23512@dildano.hawaga.org.uk>
	<49936388.80207@mcs.anl.gov> <49939761.9020002@renci.org>
	<4993A311.3060103@mcs.anl.gov> <4993A59D.4010900@mcs.anl.gov>
	<255CA353-FCE7-4A1E-B937-71EEF7E1E690@anl.gov>
	<4993A78F.4050604@mcs.anl.gov> <1234416820.1513.30.camel@localhost>
	<5CA10434-F090-41B6-A1C4-9AB67B0A88EC@anl.gov>
Message-ID: <1234418495.1513.35.camel@localhost>

On Wed, 2009-02-11 at 23:38 -0600, Ian Foster wrote:
> My view is that the CoG and C versions of basic Globus commands should  
> have the same behavior. If they do not, that is a bug. It should be  
> reported and fixed, not worked around. I appreciate that others may  
> not share that perspective.

I very much agree. But it seems there aren't many resources to fix those
bugs, so we end up working around them.


From hategan at mcs.anl.gov  Thu Feb 12 00:03:40 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Thu, 12 Feb 2009 00:03:40 -0600
Subject: [Swift-devel] Rename versions of Globus commands in swift/bin?
In-Reply-To: <4993B7AC.3020408@mcs.anl.gov>
References: <4988A6A9.6030302@renci.org> <4993561E.1060105@mcs.anl.gov>
	<Pine.LNX.4.64.0902112252200.23512@dildano.hawaga.org.uk>
	<49936388.80207@mcs.anl.gov> <49939761.9020002@renci.org>
	<4993A311.3060103@mcs.anl.gov> <4993A59D.4010900@mcs.anl.gov>
	<255CA353-FCE7-4A1E-B937-71EEF7E1E690@anl.gov>
	<4993A78F.4050604@mcs.anl.gov> <1234416820.1513.30.camel@localhost>
	<4993B7AC.3020408@mcs.anl.gov>
Message-ID: <1234418620.1513.36.camel@localhost>

On Wed, 2009-02-11 at 23:46 -0600, Michael Wilde wrote:
> >> That broke my brittle little script - its not a criticism of the cog 
> >> code version.
> > 
> > Though that may be possible to fix. url-copy not so much.
> 
> I think this is pretty low on the prio list. If its easy, yes, please 
> do. Else file it as low prio in bugzilla, imo.

I can ping Rachana.


From wilde at mcs.anl.gov  Thu Feb 12 01:05:51 2009
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Thu, 12 Feb 2009 01:05:51 -0600
Subject: [Swift-devel] coaster one-liner bootstrap script
In-Reply-To: <12554808.2179801234404154196.JavaMail.root@zimbra>
References: <12554808.2179801234404154196.JavaMail.root@zimbra>
Message-ID: <4993CA4F.3030609@mcs.anl.gov>

I updated cog and swift to the latest, and tried on teraport.

sites.xml was:

<config>
<pool handle="teraport" >
   <profile namespace="globus" key="queue">fast</profile>
   <profile namespace="globus" 
key="coasterWorkerMaxwalltime">00:05:00</profile>
   <gridftp url="gsiftp://tp-grid1.ci.uchicago.edu" />
   <execution provider="coaster" url="tp-grid1.ci.uchicago.edu" 
jobmanager="gt2:gt2:pbs" />
   <workdirectory>/gpfs1/osg/data/oops/swiftwork</workdirectory>
</pool>

</config>


I got: coaster-bootstrap.list not found in classpath

Output is below. I lost the log, but it didnt say much more than whats 
below.

/home/wilde/swift/tools/swiftrun: Swift script oops7.swift starting at 
Wed Feb 11 22:50:33 CST 2009
running on sites: teraport.coaster.gt2.osg

Swift svn swift-r2527 cog-r2297

RunID: 20090211-2250-gkzsaa90
Progress:
Progress:  Stage in:1 Initializing site shared directory:1
Progress:  Stage in:1 Submitting:1
Warning: missing walltime specification for "runoops". Assuming 10 minutes.
Failed to transfer wrapper log from oops7-20090211-2250-gkzsaa90/info/7 
on teraport
Execution failed:
         Failed to transfer wrapper log from 
oops7-20090211-2250-gkzsaa90/info/8 on teraport
Exception in runoops:
Arguments: [input/fasta/T1af7.fasta, input/secseq/T1af7.secseq, 
input/native/T1af7.pdb, output/T1af7.0.pdt, output/T1af7.0.rmsd, 0, TEMP 
UPDATE INTERVAL = 10, SMOOTH DEVIATION COEFFICIENT = 0.80001]
Host: teraport
Directory: oops7-20090211-2250-gkzsaa90/jobs/7/runoops-7aj2uh6j
stderr.txt:

stdout.txt:

----

Caused by:
         Could not submit job
Caused by:
         Could not start coaster service
Caused by:
         java.lang.RuntimeException: coaster-bootstrap.list not found in 
classpath
Cleaning up...
  Done

/home/wilde/swift/tools/swiftrun: Swift Script oops7.swift ended at Wed 
Feb 11 22:50:40 CST 2009 with exit code 0
com$


On 2/11/09 8:02 PM, Mihael Hategan wrote:
> cog r2297 contains a patch to transform the bootstrap script 
> to a one-liner (thanks to Mike for the tips).
> 
> I did a sanity test on localhost.
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel


From benc at hawaga.org.uk  Thu Feb 12 03:23:13 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Thu, 12 Feb 2009 09:23:13 +0000 (GMT)
Subject: [Swift-devel] Rename versions of Globus commands in swift/bin?
In-Reply-To: <5CA10434-F090-41B6-A1C4-9AB67B0A88EC@anl.gov>
References: <4988A6A9.6030302@renci.org> <4993561E.1060105@mcs.anl.gov>
	<Pine.LNX.4.64.0902112252200.23512@dildano.hawaga.org.uk>
	<49936388.80207@mcs.anl.gov>
	<49939761.9020002@renci.org> <4993A311.3060103@mcs.anl.gov>
	<4993A59D.4010900@mcs.anl.gov>
	<255CA353-FCE7-4A1E-B937-71EEF7E1E690@anl.gov>
	<4993A78F.4050604@mcs.anl.gov> <1234416820.1513.30.camel@localhost>
	<5CA10434-F090-41B6-A1C4-9AB67B0A88EC@anl.gov>
Message-ID: <Pine.LNX.4.64.0902120919460.1293@dildano.hawaga.org.uk>


On Wed, 11 Feb 2009, Ian Foster wrote:

> My view is that the CoG and C versions of basic Globus commands should have
> the same behavior. If they do not, that is a bug. It should be reported and
> fixed, not worked around. I appreciate that others may not share that
> perspective.

I tend to agree (especially when they share the same filename).

But, these are cog vs GT user interface bugs, not Swift bugs, and I don't 
think Swift developers should expend any non-trivial amount of work fixing 
such (although reporting them to the appropriate is always nice)

-- 


From benc at hawaga.org.uk  Thu Feb 12 03:31:09 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Thu, 12 Feb 2009 09:31:09 +0000 (GMT)
Subject: [Swift-devel] Rename versions of Globus commands in swift/bin?
In-Reply-To: <4993A59D.4010900@mcs.anl.gov>
References: <4988A6A9.6030302@renci.org> <4993561E.1060105@mcs.anl.gov>
	<Pine.LNX.4.64.0902112252200.23512@dildano.hawaga.org.uk>
	<49936388.80207@mcs.anl.gov>
	<49939761.9020002@renci.org> <4993A311.3060103@mcs.anl.gov>
	<4993A59D.4010900@mcs.anl.gov>
Message-ID: <Pine.LNX.4.64.0902120924050.1293@dildano.hawaga.org.uk>


Relevant to this thread, there is a build option:

  ant -Dno-supporting=true

(hurrah for more double negatives - should fix that too)

which makes a swift build without potentially conflicting commands.

I did this specifically to ease installation onto systems where a decent 
grid stack is already deployed.

That goes a long way to solve the problem that motivated Mikes initial 
mail (which was someone trying to install swift onto a system where a grid 
stack is already deployed).

At the moment, that needs a source build, in order to get the build 
option.

Release-mechanics-wise, it would be possible to put up a version with, and 
a version without, the supporting material. I'm a little wary of making 
more release combinations (I like the single one that we have), but 
perhaps this is the correct thing to do. It also fits in nicely with the 
pacman packaging that I experimented with in an earlier release and have 
had no feedback for.

-- 


From benc at hawaga.org.uk  Thu Feb 12 03:58:05 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Thu, 12 Feb 2009 09:58:05 +0000 (GMT)
Subject: [Swift-devel] Rename versions of Globus commands in swift/bin?
In-Reply-To: <Pine.LNX.4.64.0902120924050.1293@dildano.hawaga.org.uk>
References: <4988A6A9.6030302@renci.org> <4993561E.1060105@mcs.anl.gov>
	<Pine.LNX.4.64.0902112252200.23512@dildano.hawaga.org.uk>
	<49936388.80207@mcs.anl.gov>
	<49939761.9020002@renci.org> <4993A311.3060103@mcs.anl.gov>
	<4993A59D.4010900@mcs.anl.gov>
	<Pine.LNX.4.64.0902120924050.1293@dildano.hawaga.org.uk>
Message-ID: <Pine.LNX.4.64.0902120952480.1293@dildano.hawaga.org.uk>

On Thu, 12 Feb 2009, Ben Clifford wrote:

> Release-mechanics-wise, it would be possible to put up a version with, and 
> a version without, the supporting material. I'm a little wary of making 
> more release combinations (I like the single one that we have), but 
> perhaps this is the correct thing to do. It also fits in nicely with the 
> pacman packaging that I experimented with in an earlier release and have 
> had no feedback for.

I just made:

 http://www.ci.uchicago.edu/swift/packages/swift-0.8-stripped.tar.gz

so built, and will link to it from the release page alongside the existing 
full 0.8 release.

-- 


From benc at hawaga.org.uk  Thu Feb 12 06:54:31 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Thu, 12 Feb 2009 12:54:31 +0000 (GMT)
Subject: [Swift-devel] walltime compulsion
Message-ID: <Pine.LNX.4.64.0902121250260.23512@dildano.hawaga.org.uk>


Recent walltime violation commits seem to change the way in which 
applications with unspecified walltimes behave.

They now get this treatment:

> Warning: missing walltime specification for "echo". Assuming 10 minutes.

I don't think it is appropriate to assume that any arbitrary program in 
Swift has a 10 minute maxwalltime, if none is specified.

None should be assumed, and if that means some functionality based on 
having a walltime doesn't do anything for those tasks, then so be it.

-- 


From wilde at mcs.anl.gov  Thu Feb 12 08:25:05 2009
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Thu, 12 Feb 2009 08:25:05 -0600
Subject: [Swift-devel] coaster one-liner bootstrap script
In-Reply-To: <4993CA4F.3030609@mcs.anl.gov>
References: <12554808.2179801234404154196.JavaMail.root@zimbra>
	<4993CA4F.3030609@mcs.anl.gov>
Message-ID: <49943141.9020701@mcs.anl.gov>

I forgot to add: there was no gram or coaster log on the target site, 
teraport, under ~osg, which is what my cert is mapped to. As far as I 
could tell, the job never made it to the site, or even to gram.

Is this message the result of a client-side check, before the botstrap 
job is launched?

- Mike


On 2/12/09 1:05 AM, Michael Wilde wrote:
> I updated cog and swift to the latest, and tried on teraport.
> 
> sites.xml was:
> 
> <config>
> <pool handle="teraport" >
>   <profile namespace="globus" key="queue">fast</profile>
>   <profile namespace="globus" 
> key="coasterWorkerMaxwalltime">00:05:00</profile>
>   <gridftp url="gsiftp://tp-grid1.ci.uchicago.edu" />
>   <execution provider="coaster" url="tp-grid1.ci.uchicago.edu" 
> jobmanager="gt2:gt2:pbs" />
>   <workdirectory>/gpfs1/osg/data/oops/swiftwork</workdirectory>
> </pool>
> 
> </config>
> 
> 
> I got: coaster-bootstrap.list not found in classpath
> 
> Output is below. I lost the log, but it didnt say much more than whats 
> below.
> 
> /home/wilde/swift/tools/swiftrun: Swift script oops7.swift starting at 
> Wed Feb 11 22:50:33 CST 2009
> running on sites: teraport.coaster.gt2.osg
> 
> Swift svn swift-r2527 cog-r2297
> 
> RunID: 20090211-2250-gkzsaa90
> Progress:
> Progress:  Stage in:1 Initializing site shared directory:1
> Progress:  Stage in:1 Submitting:1
> Warning: missing walltime specification for "runoops". Assuming 10 minutes.
> Failed to transfer wrapper log from oops7-20090211-2250-gkzsaa90/info/7 
> on teraport
> Execution failed:
>         Failed to transfer wrapper log from 
> oops7-20090211-2250-gkzsaa90/info/8 on teraport
> Exception in runoops:
> Arguments: [input/fasta/T1af7.fasta, input/secseq/T1af7.secseq, 
> input/native/T1af7.pdb, output/T1af7.0.pdt, output/T1af7.0.rmsd, 0, TEMP 
> UPDATE INTERVAL = 10, SMOOTH DEVIATION COEFFICIENT = 0.80001]
> Host: teraport
> Directory: oops7-20090211-2250-gkzsaa90/jobs/7/runoops-7aj2uh6j
> stderr.txt:
> 
> stdout.txt:
> 
> ----
> 
> Caused by:
>         Could not submit job
> Caused by:
>         Could not start coaster service
> Caused by:
>         java.lang.RuntimeException: coaster-bootstrap.list not found in 
> classpath
> Cleaning up...
>  Done
> 
> /home/wilde/swift/tools/swiftrun: Swift Script oops7.swift ended at Wed 
> Feb 11 22:50:40 CST 2009 with exit code 0
> com$
> 
> 
> On 2/11/09 8:02 PM, Mihael Hategan wrote:
>> cog r2297 contains a patch to transform the bootstrap script to a 
>> one-liner (thanks to Mike for the tips).
>>
>> I did a sanity test on localhost.
>> _______________________________________________
>> Swift-devel mailing list
>> Swift-devel at ci.uchicago.edu
>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel


From hategan at mcs.anl.gov  Thu Feb 12 09:33:26 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Thu, 12 Feb 2009 09:33:26 -0600
Subject: [Swift-devel] walltime compulsion
In-Reply-To: <Pine.LNX.4.64.0902121250260.23512@dildano.hawaga.org.uk>
References: <Pine.LNX.4.64.0902121250260.23512@dildano.hawaga.org.uk>
Message-ID: <1234452806.3694.1.camel@localhost>

I have yet to see a queuing system that works that same way (not that
I've seen many).

On Thu, 2009-02-12 at 12:54 +0000, Ben Clifford wrote:
> Recent walltime violation commits seem to change the way in which 
> applications with unspecified walltimes behave.
> 
> They now get this treatment:
> 
> > Warning: missing walltime specification for "echo". Assuming 10 minutes.
> 
> I don't think it is appropriate to assume that any arbitrary program in 
> Swift has a 10 minute maxwalltime, if none is specified.
> 
> None should be assumed, and if that means some functionality based on 
> having a walltime doesn't do anything for those tasks, then so be it.
> 


From benc at hawaga.org.uk  Thu Feb 12 09:36:43 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Thu, 12 Feb 2009 15:36:43 +0000 (GMT)
Subject: [Swift-devel] walltime compulsion
In-Reply-To: <1234452806.3694.1.camel@localhost>
References: <Pine.LNX.4.64.0902121250260.23512@dildano.hawaga.org.uk>
	<1234452806.3694.1.camel@localhost>
Message-ID: <Pine.LNX.4.64.0902121534280.23512@dildano.hawaga.org.uk>


On Thu, 12 Feb 2009, Mihael Hategan wrote:

> I have yet to see a queuing system that works that same way (not that 
> I've seen many).

Plenty of queueing systems give you a default walltime on jobs that you 
submit. I don't see that its Swift's business to be interfering with that 
default.

-- 


From hategan at mcs.anl.gov  Thu Feb 12 09:44:18 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Thu, 12 Feb 2009 09:44:18 -0600
Subject: [Swift-devel] walltime compulsion
In-Reply-To: <Pine.LNX.4.64.0902121534280.23512@dildano.hawaga.org.uk>
References: <Pine.LNX.4.64.0902121250260.23512@dildano.hawaga.org.uk>
	<1234452806.3694.1.camel@localhost>
	<Pine.LNX.4.64.0902121534280.23512@dildano.hawaga.org.uk>
Message-ID: <1234453458.4032.0.camel@localhost>

On Thu, 2009-02-12 at 15:36 +0000, Ben Clifford wrote:
> On Thu, 12 Feb 2009, Mihael Hategan wrote:
> 
> > I have yet to see a queuing system that works that same way (not that 
> > I've seen many).
> 
> Plenty of queueing systems give you a default walltime on jobs that you 
> submit. I don't see that its Swift's business to be interfering with that 
> default.
> 

I suppose there's no clear thing here. Anybody else?


From rynge at renci.org  Thu Feb 12 09:55:42 2009
From: rynge at renci.org (Mats Rynge)
Date: Thu, 12 Feb 2009 10:55:42 -0500
Subject: [Swift-devel] walltime compulsion
In-Reply-To: <1234453458.4032.0.camel@localhost>
References: <Pine.LNX.4.64.0902121250260.23512@dildano.hawaga.org.uk>	<1234452806.3694.1.camel@localhost>	<Pine.LNX.4.64.0902121534280.23512@dildano.hawaga.org.uk>
	<1234453458.4032.0.camel@localhost>
Message-ID: <4994467E.1020209@renci.org>

Mihael Hategan wrote:
> On Thu, 2009-02-12 at 15:36 +0000, Ben Clifford wrote:
>> On Thu, 12 Feb 2009, Mihael Hategan wrote:
>>
>>> I have yet to see a queuing system that works that same way (not that 
>>> I've seen many).
>> Plenty of queueing systems give you a default walltime on jobs that you 
>> submit. I don't see that its Swift's business to be interfering with that 
>> default.
>>
> 
> I suppose there's no clear thing here. Anybody else?

Ignoring the queuing system for a moment, it is still a good idea to
know what the expected runtime is. Ben and I had some of this
conversation when we tried Swift on OSG, and we had a couple of
instances where job and/or file transfer status changes where "lost",
and Swift got stuck. I strongly believe that you need to have internal
timeouts for pretty much all your states in your state machine, and that
the timeouts for the job states should be based on the "walltime".

We are using state timeouts for a lot of our OSG jobs based on Condor
and OSGMM. This ensures that it is the workflow engine, and not the
user, that picks up weird states and handles them accordingly (resubmit
to another site for example).

-- 
Mats Rynge
Renaissance Computing Institute <http://www.renci.org>


From benc at hawaga.org.uk  Thu Feb 12 10:09:06 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Thu, 12 Feb 2009 16:09:06 +0000 (GMT)
Subject: [Swift-devel] walltime compulsion
In-Reply-To: <4994467E.1020209@renci.org>
References: <Pine.LNX.4.64.0902121250260.23512@dildano.hawaga.org.uk>
	<1234452806.3694.1.camel@localhost>
	<Pine.LNX.4.64.0902121534280.23512@dildano.hawaga.org.uk>
	<1234453458.4032.0.camel@localhost> <4994467E.1020209@renci.org>
Message-ID: <Pine.LNX.4.64.0902121601030.23512@dildano.hawaga.org.uk>


On Thu, 12 Feb 2009, Mats Rynge wrote:

> Ignoring the queuing system for a moment, it is still a good idea to
> know what the expected runtime is.

Right.

And better handling basically as you describe when the expected maximum 
runtime (expressed through maxwalltime) is known is what was implemented.

The issue I brought up is in cases where the expected maximum runtime is 
not specified by a user (in tc.data).

Previously (for example, in Swift 0.8), tc.data entries that had no 
maxwalltime specification carried on having no maxwalltime all the way 
through the job submission process.

Some functionality (in 0.8, I think only clustering) cannot be used with 
tc.data entries that have no maxwalltime: entries that have no maxwalltime 
will never be considered for being clustered. Instead, they will be 
submitted as if clustering was not enabled.

What exists in SVN HEAD now is that if you do not specify a tc.data 
maxwalltime, then that tc.data entry is given a 10 minute maxwalltime 
entry by default (for everywhere - clustering, submit-side walltime 
violation, remote queue entry)

This is a change in maxwalltime handling that I dislike.

-- 


From wilde at mcs.anl.gov  Thu Feb 12 10:21:26 2009
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Thu, 12 Feb 2009 10:21:26 -0600
Subject: [Swift-devel] data staging process/ documents?
In-Reply-To: <50b07b4b0902111707q7aad742fh89c0dad88f744ecf@mail.gmail.com>
References: <50b07b4b0902111707q7aad742fh89c0dad88f744ecf@mail.gmail.com>
Message-ID: <49944C86.9060206@mcs.anl.gov>


On 2/11/09 7:07 PM, Allan Espinosa wrote:
> Hi,
> 
> I am attempting to actualize how collective operations on workflows
> (loosely-coupled) work in general.  My initial idea is that this goes
> in the staging of data before executing a task in a workflow.
> 
> Do we have documents describing these?

I think the email below from Ben is relevant, and referes to a prios 
post on swift-devel. Im not sure if that text has made it to the 
userguide yet.

- Mike


I have a small idea on how it
> works by monitoring my swift job's <workdir/> as a workflow executes.
> 
> My initial ideas are posted in
> http://www.ci.uchicago.edu/wiki/bin/view/VDS/DslCS/CollectiveIO
> 
> -Allan
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel


-------- Original Message --------
Subject: [Swift-devel] notes on how swift implements file input and output
Date: Mon, 1 Dec 2008 22:00:16 +0000 (GMT)
From: Ben Clifford <benc at hawaga.org.uk>
To: swift-devel at ci.uchicago.edu
References: <Pine.LNX.4.64.0812011854470.2448 at dildano.hawaga.org.uk>


read this in conjunction with previous note, "Subject: User perspective on
how an app procedure call maps into an application executable call"


This note details the implementation of Swift file input and output in
application blocks; it is intended to be read in conjunction with a
previous note 'How an app procedure call maps into an application call,
from a Swift user perspective, attempting to avoid the mechanics inside
Swift.'


Swift executes application procedures on one or more //sites//.

Each site consists of:

* worker nodes. There is some //execution mechanism// through which the
Swift client side executable can execute its //wrapper script// on those
worker nodes. This is commonly GRAM or Falkon or coasters.

* a site-shared file system. This site shared filesystem is accessible
through some //file transfer mechanism// from the Swift client side
executable. This is commonly GridFTP or coasters. This site shared
filesystem is also accessible through the posix file system on all worker
nodes, mounted at the same location as seen through the file transfer
mechanism. Swift is configured with the location of some //site working
directory// on that site-shared file system.

There is no assumption that the site shared file system for one site is
accessible from another site.

For each workflow run, on each site that is used by that run, a //run
directory// is created in the site working directory, by the Swift client
side.

In that run directory are placed several subdirectories:

* shared/ - site shared files cache

* kickstart/ - when kickstart is used, kickstart record files
for each job that has generated a kickstart

* info/ - wrapper script log files

* status/ - job status files

* jobs/  //application workspace directories// (optionally placed here -
see below)

Application execution looks like this:

For each application procedure call:

The Swift client side selects a site; copies the input files for that
procedure call to the site shared file cache if they are not already in
the cache, using the file transfer mechanism; and then invokes the wrapper
script on that site using the execution mechanism.

The wrapper script creates the application workspace directory; places the
input files for that job into the application workspace directory using
either cp or ln -s (depending on a configuration option); executes the
application unix executable; copies output files from the application
workspace directory to the site shared directory using cp; creates a
status file under the status/ directory; and exits, returning control to
the Swift client side. Logs created during the execution of the wrapper
script are stored under the info/ directory.

The Swift client side then checks for the presence of and deletes a status
file indicating success; copies files from the site shared directory to
the appropriate client side location.

The job directory is created (in the default mode) under the jobs/
directory. However, it can be created under an arbitrary other path, which
allows it to be created on a different file system (such as a worker node
local file system in the case that the worker node has a local file
system).

-- 

_______________________________________________
Swift-devel mailing list
Swift-devel at ci.uchicago.edu
http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel


From rynge at renci.org  Thu Feb 12 10:24:50 2009
From: rynge at renci.org (Mats Rynge)
Date: Thu, 12 Feb 2009 11:24:50 -0500
Subject: [Swift-devel] walltime compulsion
In-Reply-To: <Pine.LNX.4.64.0902121601030.23512@dildano.hawaga.org.uk>
References: <Pine.LNX.4.64.0902121250260.23512@dildano.hawaga.org.uk>
	<1234452806.3694.1.camel@localhost>
	<Pine.LNX.4.64.0902121534280.23512@dildano.hawaga.org.uk>
	<1234453458.4032.0.camel@localhost> <4994467E.1020209@renci.org>
	<Pine.LNX.4.64.0902121601030.23512@dildano.hawaga.org.uk>
Message-ID: <49944D52.6060804@renci.org>

Ben Clifford wrote:
> What exists in SVN HEAD now is that if you do not specify a tc.data 
> maxwalltime, then that tc.data entry is given a 10 minute maxwalltime 
> entry by default (for everywhere - clustering, submit-side walltime 
> violation, remote queue entry)
> 
> This is a change in maxwalltime handling that I dislike.

So, why not just make the maxwalltime in tc.data a required field?

-- 
Mats Rynge
Renaissance Computing Institute <http://www.renci.org>


From wilde at mcs.anl.gov  Thu Feb 12 10:25:15 2009
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Thu, 12 Feb 2009 10:25:15 -0600
Subject: [Swift-devel] data staging process/ documents?
In-Reply-To: <49944C86.9060206@mcs.anl.gov>
References: <50b07b4b0902111707q7aad742fh89c0dad88f744ecf@mail.gmail.com>
	<49944C86.9060206@mcs.anl.gov>
Message-ID: <49944D6B.9090200@mcs.anl.gov>


On 2/12/09 10:21 AM, Michael Wilde wrote:
> 
> 
> On 2/11/09 7:07 PM, Allan Espinosa wrote:
>> Hi,
>>
>> I am attempting to actualize how collective operations on workflows
>> (loosely-coupled) work in general.  My initial idea is that this goes
>> in the staging of data before executing a task in a workflow.
>>
>> Do we have documents describing these?
> 
> I think the email below from Ben is relevant, and referes to a prios 
> post on swift-devel. Im not sure if that text has made it to the 
> userguide yet.

Prior post was: 
http://mail.ci.uchicago.edu/pipermail/swift-devel/2008-December/004070.html

I think that info has been added to the userguide.

- Mike

> 
> - Mike
> 
> 
> I have a small idea on how it
>> works by monitoring my swift job's <workdir/> as a workflow executes.
>>
>> My initial ideas are posted in
>> http://www.ci.uchicago.edu/wiki/bin/view/VDS/DslCS/CollectiveIO
>>
>> -Allan
>> _______________________________________________
>> Swift-devel mailing list
>> Swift-devel at ci.uchicago.edu
>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> 
> 
> 
> 
> -------- Original Message --------
> Subject: [Swift-devel] notes on how swift implements file input and output
> Date: Mon, 1 Dec 2008 22:00:16 +0000 (GMT)
> From: Ben Clifford <benc at hawaga.org.uk>
> To: swift-devel at ci.uchicago.edu
> References: <Pine.LNX.4.64.0812011854470.2448 at dildano.hawaga.org.uk>
> 
> 
> read this in conjunction with previous note, "Subject: User perspective on
> how an app procedure call maps into an application executable call"
> 
> 
> This note details the implementation of Swift file input and output in
> application blocks; it is intended to be read in conjunction with a
> previous note 'How an app procedure call maps into an application call,
> from a Swift user perspective, attempting to avoid the mechanics inside
> Swift.'
> 
> 
> Swift executes application procedures on one or more //sites//.
> 
> Each site consists of:
> 
> * worker nodes. There is some //execution mechanism// through which the
> Swift client side executable can execute its //wrapper script// on those
> worker nodes. This is commonly GRAM or Falkon or coasters.
> 
> * a site-shared file system. This site shared filesystem is accessible
> through some //file transfer mechanism// from the Swift client side
> executable. This is commonly GridFTP or coasters. This site shared
> filesystem is also accessible through the posix file system on all worker
> nodes, mounted at the same location as seen through the file transfer
> mechanism. Swift is configured with the location of some //site working
> directory// on that site-shared file system.
> 
> There is no assumption that the site shared file system for one site is
> accessible from another site.
> 
> For each workflow run, on each site that is used by that run, a //run
> directory// is created in the site working directory, by the Swift client
> side.
> 
> In that run directory are placed several subdirectories:
> 
> * shared/ - site shared files cache
> 
> * kickstart/ - when kickstart is used, kickstart record files
> for each job that has generated a kickstart
> 
> * info/ - wrapper script log files
> 
> * status/ - job status files
> 
> * jobs/  //application workspace directories// (optionally placed here -
> see below)
> 
> Application execution looks like this:
> 
> For each application procedure call:
> 
> The Swift client side selects a site; copies the input files for that
> procedure call to the site shared file cache if they are not already in
> the cache, using the file transfer mechanism; and then invokes the wrapper
> script on that site using the execution mechanism.
> 
> The wrapper script creates the application workspace directory; places the
> input files for that job into the application workspace directory using
> either cp or ln -s (depending on a configuration option); executes the
> application unix executable; copies output files from the application
> workspace directory to the site shared directory using cp; creates a
> status file under the status/ directory; and exits, returning control to
> the Swift client side. Logs created during the execution of the wrapper
> script are stored under the info/ directory.
> 
> The Swift client side then checks for the presence of and deletes a status
> file indicating success; copies files from the site shared directory to
> the appropriate client side location.
> 
> The job directory is created (in the default mode) under the jobs/
> directory. However, it can be created under an arbitrary other path, which
> allows it to be created on a different file system (such as a worker node
> local file system in the case that the worker node has a local file
> system).
> 


From benc at hawaga.org.uk  Thu Feb 12 10:28:16 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Thu, 12 Feb 2009 16:28:16 +0000 (GMT)
Subject: [Swift-devel] data staging process/ documents?
In-Reply-To: <49944D6B.9090200@mcs.anl.gov>
References: <50b07b4b0902111707q7aad742fh89c0dad88f744ecf@mail.gmail.com>
	<49944C86.9060206@mcs.anl.gov> <49944D6B.9090200@mcs.anl.gov>
Message-ID: <Pine.LNX.4.64.0902121627270.1293@dildano.hawaga.org.uk>


On Thu, 12 Feb 2009, Michael Wilde wrote:

> I think that info has been added to the userguide.

It has.

Those two emails ended up being this:

http://www.ci.uchicago.edu/swift/guides/userguide/appmodel.php

-- 


From hategan at mcs.anl.gov  Thu Feb 12 10:31:46 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Thu, 12 Feb 2009 10:31:46 -0600
Subject: [Swift-devel] walltime compulsion
In-Reply-To: <49944D52.6060804@renci.org>
References: <Pine.LNX.4.64.0902121250260.23512@dildano.hawaga.org.uk>
	<1234452806.3694.1.camel@localhost>
	<Pine.LNX.4.64.0902121534280.23512@dildano.hawaga.org.uk>
	<1234453458.4032.0.camel@localhost> <4994467E.1020209@renci.org>
	<Pine.LNX.4.64.0902121601030.23512@dildano.hawaga.org.uk>
	<49944D52.6060804@renci.org>
Message-ID: <1234456306.4850.3.camel@localhost>

On Thu, 2009-02-12 at 11:24 -0500, Mats Rynge wrote:
> Ben Clifford wrote:
> > What exists in SVN HEAD now is that if you do not specify a tc.data 
> > maxwalltime, then that tc.data entry is given a 10 minute maxwalltime 
> > entry by default (for everywhere - clustering, submit-side walltime 
> > violation, remote queue entry)
> > 
> > This is a change in maxwalltime handling that I dislike.
> 
> So, why not just make the maxwalltime in tc.data a required field?
> 

What I put in there is the middle ground between no walltime and
mandatory walltime.

For many small things 10 minutes is fine. But I also wouldn't want the
user to be surprised that their 30 minute job for which they didn't put
a walltime in never completes. So there's a once-per-job warning which
is supposed to persuade the user to specify a walltime.


From benc at hawaga.org.uk  Thu Feb 12 10:34:08 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Thu, 12 Feb 2009 16:34:08 +0000 (GMT)
Subject: [Swift-devel] walltime compulsion
In-Reply-To: <49944D52.6060804@renci.org>
References: <Pine.LNX.4.64.0902121250260.23512@dildano.hawaga.org.uk>
	<1234452806.3694.1.camel@localhost>
	<Pine.LNX.4.64.0902121534280.23512@dildano.hawaga.org.uk>
	<1234453458.4032.0.camel@localhost> <4994467E.1020209@renci.org>
	<Pine.LNX.4.64.0902121601030.23512@dildano.hawaga.org.uk>
	<49944D52.6060804@renci.org>
Message-ID: <Pine.LNX.4.64.0902121628500.23512@dildano.hawaga.org.uk>


On Thu, 12 Feb 2009, Mats Rynge wrote:

> So, why not just make the maxwalltime in tc.data a required field?

I'd rather have it compulsory than an arbitrary default value; but I'd 
rather have neither.

I see no reason to compel a walltime if you don't want walltime-based 
handling.

In the situations that you are running in, it seems fairly vital, because 
you want to use walltime based features.

However, Swift also gets used in situations where it isn't necessary - for 
example, when running on single-site clusters where stuff tends to either 
work or not work (rather than having distributed-system style partial 
failures), and hung jobs don't cause a problem (which is why its taken 
this many years for the submit side walltime stuff to get implemented).

Nothing has changed to suddenly make it necessary to compel that user 
community to think about walltimes, and in those cases, it adds 
unnecessary configuration load onto users; and that is something that I 
feel fairly strongly about in Swift.

-- 


From wilde at mcs.anl.gov  Thu Feb 12 10:35:57 2009
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Thu, 12 Feb 2009 10:35:57 -0600
Subject: [Swift-devel] walltime compulsion
In-Reply-To: <49944D52.6060804@renci.org>
References: <Pine.LNX.4.64.0902121250260.23512@dildano.hawaga.org.uk>	<1234452806.3694.1.camel@localhost>	<Pine.LNX.4.64.0902121534280.23512@dildano.hawaga.org.uk>	<1234453458.4032.0.camel@localhost>
	<4994467E.1020209@renci.org>	<Pine.LNX.4.64.0902121601030.23512@dildano.hawaga.org.uk>
	<49944D52.6060804@renci.org>
Message-ID: <49944FED.7080206@mcs.anl.gov>

Would that address Ben's concern? Is the dislike the 10 minute default, 
or the fact that the same value is used for all 3 calculations mentioned?

My slight preference is to keep required fields to a minimum, to make 
the time somewhat higher (to to reduce surprise job terminations at the 
expense of surprise at winding up in slow queue).

If its easy, is a global property for the default time reasonable?
So a user could tweak one value for all their wall times?

Also: I didnt raise this because there was so much churn last week on 
the coaster code, but when testing Ben's coaster-service-on-headnode 
patch, I was unable to find walltime settings that would get me into the 
fast queue on teraport. I did not have time to track down what was 
happening, in terms of what I requested where vs what was sent to gram. 
Just a heads-up that the time calculation code could use some testing.

- Mike


On 2/12/09 10:24 AM, Mats Rynge wrote:
> Ben Clifford wrote:
>> What exists in SVN HEAD now is that if you do not specify a tc.data 
>> maxwalltime, then that tc.data entry is given a 10 minute maxwalltime 
>> entry by default (for everywhere - clustering, submit-side walltime 
>> violation, remote queue entry)
>>
>> This is a change in maxwalltime handling that I dislike.
> 
> So, why not just make the maxwalltime in tc.data a required field?
> 


From benc at hawaga.org.uk  Thu Feb 12 10:36:50 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Thu, 12 Feb 2009 16:36:50 +0000 (GMT)
Subject: [Swift-devel] walltime compulsion
In-Reply-To: <1234456306.4850.3.camel@localhost>
References: <Pine.LNX.4.64.0902121250260.23512@dildano.hawaga.org.uk> 
	<1234452806.3694.1.camel@localhost>
	<Pine.LNX.4.64.0902121534280.23512@dildano.hawaga.org.uk>
	<1234453458.4032.0.camel@localhost> <4994467E.1020209@renci.org> 
	<Pine.LNX.4.64.0902121601030.23512@dildano.hawaga.org.uk>
	<49944D52.6060804@renci.org> <1234456306.4850.3.camel@localhost>
Message-ID: <Pine.LNX.4.64.0902121634450.23512@dildano.hawaga.org.uk>


On Thu, 12 Feb 2009, Mihael Hategan wrote:

> What I put in there is the middle ground between no walltime and
> mandatory walltime.

I think the middle ground is less desirable than either extreme.

> For many small things 10 minutes is fine. But I also wouldn't want the
> user to be surprised that their 30 minute job for which they didn't put
> a walltime in never completes. So there's a once-per-job warning which
> is supposed to persuade the user to specify a walltime.

Having a warning/recommendation is fine.

-- 


From wilde at mcs.anl.gov  Thu Feb 12 10:38:50 2009
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Thu, 12 Feb 2009 10:38:50 -0600
Subject: [Swift-devel] walltime compulsion
In-Reply-To: <Pine.LNX.4.64.0902121628500.23512@dildano.hawaga.org.uk>
References: <Pine.LNX.4.64.0902121250260.23512@dildano.hawaga.org.uk>	<1234452806.3694.1.camel@localhost>	<Pine.LNX.4.64.0902121534280.23512@dildano.hawaga.org.uk>	<1234453458.4032.0.camel@localhost>
	<4994467E.1020209@renci.org>	<Pine.LNX.4.64.0902121601030.23512@dildano.hawaga.org.uk>	<49944D52.6060804@renci.org>
	<Pine.LNX.4.64.0902121628500.23512@dildano.hawaga.org.uk>
Message-ID: <4994509A.8080401@mcs.anl.gov>

This argument is more clear and makes sense to me. I agree with it, if 
it causes no complications.

On 2/12/09 10:34 AM, Ben Clifford wrote:
> On Thu, 12 Feb 2009, Mats Rynge wrote:
> 
>> So, why not just make the maxwalltime in tc.data a required field?
> 
> I'd rather have it compulsory than an arbitrary default value; but I'd 
> rather have neither.
> 
> I see no reason to compel a walltime if you don't want walltime-based 
> handling.
> 
> In the situations that you are running in, it seems fairly vital, because 
> you want to use walltime based features.
> 
> However, Swift also gets used in situations where it isn't necessary - for 
> example, when running on single-site clusters where stuff tends to either 
> work or not work (rather than having distributed-system style partial 
> failures), and hung jobs don't cause a problem (which is why its taken 
> this many years for the submit side walltime stuff to get implemented).
> 
> Nothing has changed to suddenly make it necessary to compel that user 
> community to think about walltimes, and in those cases, it adds 
> unnecessary configuration load onto users; and that is something that I 
> feel fairly strongly about in Swift.
> 


From benc at hawaga.org.uk  Thu Feb 12 10:41:35 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Thu, 12 Feb 2009 16:41:35 +0000 (GMT)
Subject: [Swift-devel] walltime compulsion
In-Reply-To: <49944FED.7080206@mcs.anl.gov>
References: <Pine.LNX.4.64.0902121250260.23512@dildano.hawaga.org.uk>
	<1234452806.3694.1.camel@localhost>
	<Pine.LNX.4.64.0902121534280.23512@dildano.hawaga.org.uk>
	<1234453458.4032.0.camel@localhost> <4994467E.1020209@renci.org>
	<Pine.LNX.4.64.0902121601030.23512@dildano.hawaga.org.uk>
	<49944D52.6060804@renci.org> <49944FED.7080206@mcs.anl.gov>
Message-ID: <Pine.LNX.4.64.0902121637060.23512@dildano.hawaga.org.uk>


On Thu, 12 Feb 2009, Michael Wilde wrote:

> My slight preference is to keep required fields to a minimum, to make the time
> somewhat higher (to to reduce surprise job terminations at the expense of
> surprise at winding up in slow queue).

In (most?) PBS deployments, you don't end up in a slow queue - you get 
your job rejected. Queues are selected by specifying a queue name, 
separately from the walltime.

This requires low-end users to have a better understanding of the submit 
stack than even someone like you who has worked on grid stuff for many 
years has; I think thats a nice argument for not compelling this.
 
> If its easy, is a global property for the default time reasonable?  So a 
> user could tweak one value for all their wall times?

That could be implemented but feels a bit messy to me - if you know what 
the time to specify is for your apps, and can take a max() of it to set 
the default, then you also have enough understanding and information to 
specify it in tc.data.


-- 


From hategan at mcs.anl.gov  Thu Feb 12 12:48:03 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Thu, 12 Feb 2009 12:48:03 -0600 (CST)
Subject: [Swift-devel] walltime compulsion
Message-ID: <12353309.2198941234464483254.JavaMail.root@zimbra>

r2531 fixes the issue to only use the walltime if specified. 
We will encourage our users to specify a walltime through
other means.


From zhaozhang at uchicago.edu  Thu Feb 12 13:23:31 2009
From: zhaozhang at uchicago.edu (Zhao Zhang)
Date: Thu, 12 Feb 2009 13:23:31 -0600
Subject: [Swift-devel] Swift data distribution strategy
Message-ID: <49947733.7000300@uchicago.edu>

Hi,

I got a problem when I do a 512-compute node scale test.
There are 15351 input files for this computation, and there are 8 sites.
The question is: when I start swift, are all 15351 input files be copied 
to each of the 8 sites?

By by test, it is yes. Does swift has an option so that only on demand 
input files are copied?

best wishes
zhangzhao


From aespinosa at cs.uchicago.edu  Thu Feb 12 13:29:37 2009
From: aespinosa at cs.uchicago.edu (Allan Espinosa)
Date: Thu, 12 Feb 2009 13:29:37 -0600
Subject: [Swift-devel] Swift data distribution strategy
In-Reply-To: <49947733.7000300@uchicago.edu>
References: <49947733.7000300@uchicago.edu>
Message-ID: <50b07b4b0902121129m4c25484fm5dcb88f47d1f9558@mail.gmail.com>

Hi Zhao,

the input files are copies on demand as jobs get dispatched.  (i think)

-Allan

On Thu, Feb 12, 2009 at 1:23 PM, Zhao Zhang <zhaozhang at uchicago.edu> wrote:
> Hi,
>
> I got a problem when I do a 512-compute node scale test.
> There are 15351 input files for this computation, and there are 8 sites.
> The question is: when I start swift, are all 15351 input files be copied to
> each of the 8 sites?
>
> By by test, it is yes. Does swift has an option so that only on demand input
> files are copied?
>
> best wishes
> zhangzhao


From hategan at mcs.anl.gov  Thu Feb 12 13:43:48 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Thu, 12 Feb 2009 13:43:48 -0600 (CST)
Subject: [Swift-devel] Swift data distribution strategy
In-Reply-To: <49947733.7000300@uchicago.edu>
Message-ID: <24164550.2203051234467828083.JavaMail.root@zimbra>

----- Zhao Zhang <zhaozhang at uchicago.edu> wrote:
> Hi,
> 
> I got a problem when I do a 512-compute node scale test.
> There are 15351 input files for this computation, and there are 8 sites.
> The question is: when I start swift, are all 15351 input files be copied 
> to each of the 8 sites?
> 
> By by test, it is yes.

It's more like "no" actually. Swift first selects a site for each job 
that can be run. After that, it stages each job's files to the site 
that was selected for that job.

> Does swift has an option so that only on demand 
> input files are copied?
> 
> best wishes
> zhangzhao
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel


From zhaozhang at uchicago.edu  Thu Feb 12 13:46:15 2009
From: zhaozhang at uchicago.edu (Zhao Zhang)
Date: Thu, 12 Feb 2009 13:46:15 -0600
Subject: [Swift-devel] Swift data distribution strategy
In-Reply-To: <24164550.2203051234467828083.JavaMail.root@zimbra>
References: <24164550.2203051234467828083.JavaMail.root@zimbra>
Message-ID: <49947C87.1050107@uchicago.edu>

ok, got it.

zhao

Mihael Hategan wrote:
> ----- Zhao Zhang <zhaozhang at uchicago.edu> wrote:
>   
>> Hi,
>>
>> I got a problem when I do a 512-compute node scale test.
>> There are 15351 input files for this computation, and there are 8 sites.
>> The question is: when I start swift, are all 15351 input files be copied 
>> to each of the 8 sites?
>>
>> By by test, it is yes.
>>     
>
> It's more like "no" actually. Swift first selects a site for each job 
> that can be run. After that, it stages each job's files to the site 
> that was selected for that job.
>
>   
>> Does swift has an option so that only on demand 
>> input files are copied?
>>
>> best wishes
>> zhangzhao
>> _______________________________________________
>> Swift-devel mailing list
>> Swift-devel at ci.uchicago.edu
>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>     
>
>
>   


From hategan at mcs.anl.gov  Thu Feb 12 14:06:32 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Thu, 12 Feb 2009 14:06:32 -0600 (CST)
Subject: [Swift-devel] coaster one-liner bootstrap script
Message-ID: <20791025.2204621234469192280.JavaMail.root@zimbra>

On Thu, 2009-02-12 at 01:05 -0600, Michael Wilde wrote:
> I got: coaster-bootstrap.list not found in classpath

Should be fixed in swift r2300.


From zhangzhao0718 at gmail.com  Thu Feb 12 13:14:55 2009
From: zhangzhao0718 at gmail.com (Zhao Zhang)
Date: Thu, 12 Feb 2009 13:14:55 -0600
Subject: [Swift-devel] Swift data distribution strategy
Message-ID: <4994752F.5010705@gmail.com>

Hi,

I got a problem when I do a 512-compute node scale test.
There are 15351 input files for this computation, and there are 8 sites.
The question is: when I start swift, are all 15351 input files be copied 
to each of the 8 sites?

By by test, it is yes. Does swift has an option so that only on demand 
input files are copied?

best wishes
zhangzhao


From wilde at mcs.anl.gov  Thu Feb 12 18:36:13 2009
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Thu, 12 Feb 2009 18:36:13 -0600
Subject: [Swift-devel] coaster one-liner bootstrap script
In-Reply-To: <20791025.2204621234469192280.JavaMail.root@zimbra>
References: <20791025.2204621234469192280.JavaMail.root@zimbra>
Message-ID: <4994C07D.8070809@mcs.anl.gov>

I updated to 2300. Now I get the error below 
(java.lang.RuntimeException: Failed to register service)

Im also a bit confused why I see "which: no gmd5sum in 
(/soft/java-1.5.0_06-sun-r1/bin: etc etc" on stdout - that should be 
going to /dev/null, but its reproducible in a normal interactive shell. 
Something subtle in eval?

gram log is in ~osg/gram_job_mgr_17585.log

swift log is in ~wilde/oops7-20090212-1510-kk6i43og.log

(on ci network)

- Mike

On 2/12/09 2:06 PM, Mihael Hategan wrote:
> On Thu, 2009-02-12 at 01:05 -0600, Michael Wilde wrote:
>> I got: coaster-bootstrap.list not found in classpath
> 
> Should be fixed in swift r2300.
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel

com$ ls -l ~osg/coaster-bootstrap-167563000.log
-rw-r--r--  1 osg osgvo 2744 Feb 12 15:11 
/home/osgvo/osg/coaster-bootstrap-167563000.log
com$ cat ~osg/coaster-bootstrap-167563000.log
BS: http://communicado.ci.uchicago.edu:50001
Expected checksum: c6dbde30e69462446c06e15a46fba6eb
Computed checksum: c6dbde30e69462446c06e15a46fba6eb
JAVA=/soft/java-1.5.0_06-sun-r1/bin/java
/soft/java-1.5.0_06-sun-r1/bin/java 
-Djava=/soft/java-1.5.0_06-sun-r1/bin/java -DGLOBUS_TCP_PORT_RANGE= 
-DX509_USER_PROXY=/home/osgvo/osg/.globus/job/tp-grid1.ci.uchicago.edu/16700.1234473069/x509_up 
-DX509_CERT_DIR= -DGLOBUS_HOSTNAME=tp-grid1.ci.uchicago.edu -jar 
/tmp/bootstrap.N16834 http://communicado.ci.uchicago.edu:50001 
b3d581fddd49e3d1166f52f6077ddcc5 https://128.135.125.17:50000 167563000
java.lang.RuntimeException: Failed to register service
         at 
org.globus.cog.abstraction.coaster.service.CoasterService.start(CoasterService.java:111)
         at 
org.globus.cog.abstraction.coaster.service.CoasterService.main(CoasterService.java:226)
Caused by: 
org.globus.cog.karajan.workflow.service.channels.ChannelException: 
Failed to start channel 
GSSCChannel-https://b3d581fddd49e3d1166f52f6077ddcc5:1984(1)
         at 
org.globus.cog.karajan.workflow.service.channels.GSSChannel.reconnect(GSSChannel.java:104)
         at 
org.globus.cog.karajan.workflow.service.channels.GSSChannel.start(GSSChannel.java:63)
         at 
org.globus.cog.karajan.workflow.service.ChannelFactory.newChannel(ChannelFactory.java:43)
         at 
org.globus.cog.karajan.workflow.service.Client.connect(Client.java:115)
         at 
org.globus.cog.karajan.workflow.service.Client.newClient(Client.java:72)
         at 
org.globus.cog.karajan.workflow.service.channels.ChannelManager.connect(ChannelManager.java:211)
         at 
org.globus.cog.karajan.workflow.service.channels.ChannelManager.reserveChannel(ChannelManager.java:230)
         at 
org.globus.cog.karajan.workflow.service.channels.ChannelManager.reserveChannel(ChannelManager.java:186)
         at 
org.globus.cog.abstraction.coaster.service.CoasterService.start(CoasterService.java:100)
         ... 1 more
Caused by: java.net.UnknownHostException: 
b3d581fddd49e3d1166f52f6077ddcc5: b3d581fddd49e3d1166f52f6077ddcc5
         at java.net.InetAddress.getAllByName0(InetAddress.java:1128)
         at java.net.InetAddress.getAllByName0(InetAddress.java:1098)
         at java.net.InetAddress.getAllByName(InetAddress.java:1061)
         at java.net.InetAddress.getByName(InetAddress.java:958)
         at org.globus.net.SocketFactory.createSocket(SocketFactory.java:53)
         at org.globus.gsi.gssapi.net.GssSocket.<init>(GssSocket.java:56)
         at 
org.globus.gsi.gssapi.net.impl.GSIGssSocket.<init>(GSIGssSocket.java:29)
         at 
org.globus.gsi.gssapi.net.impl.GSIGssSocketFactory.createSocket(GSIGssSocketFactory.java:38)
         at 
org.globus.cog.karajan.workflow.service.channels.GSSChannel.reconnect(GSSChannel.java:90)
         ... 9 more

EC: 1
BS: http://communicado.ci.uchicago.edu:50001
Failed to download bootstrap jar from 
http://communicado.ci.uchicago.edu:50001
com$

---- and on stdout/stderr:

com$ cat swift.out
/home/wilde/swift/tools/swiftrun: Swift script oops7.swift starting at 
Thu Feb 12 15:10:57 CST 2009
running on sites: teraport.coaster.gt2.osg

Swift svn swift-r2532 cog-r2300

RunID: 20090212-1510-kk6i43og
Progress:
Progress:  Stage in:1 Initializing site shared directory:1
Progress:  Stage in:1 Submitting:1
Progress:  Submitting:1 Submitted:1
Failed to transfer wrapper log from oops7-20090212-1510-kk6i43og/info/j 
on teraport
Execution failed:
         Exception in runoops:
Arguments: [input/fasta/T1af7.fasta, input/secseq/T1af7.secseq, 
input/native/T1af7.pdb, output/T1af7.0.pdt, output/T1af7.0.rmsd, 0, TEMP 
UPDATE INTERVAL = 10, SMOOTH DEVIATION COEFFICIENT = 0.80001]
Host: teraport
Directory: oops7-20090212-1510-kk6i43og/jobs/j/runoops-j1j9zi6j
stderr.txt:

stdout.txt:

----

Caused by:
         Could not submit job
Caused by:
         Could not start coaster service
Caused by:
         Task ended before registration was received.
STDOUT: which: no gmd5sum in 
(/soft/java-1.5.0_06-sun-r1/bin:/soft/java-1.5.0_06-sun-r1/jre/bin:/usr/kerberos/bin:/bin:/usr/bin:/usr/X11R6/bin:/usr/local/bin:/software/common/softenv-1.6.0-r1/bin:/home/osgvo/osg/bin/linux-rhel4-x86_64:/home/osgvo/osg/bin:/soft/xcat-1.2.0-r1/bin:/soft/xcat-1.2.0-r1/sbin:/soft/xcat-1.2.0-r1/x86_64/bin:/soft/xcat-1.2.0-r1/x86_64/sbin:/soft/xcat-1.2.0-r1/contrib/bin:/soft/xcat-1.2.0-r1/contrib/sbin:/soft/xcat-1.2.0-r1/contrib/x86_64/bin:/soft/xcat-1.2.0-r1/contrib/x86_64/sbin)


STDERR: null
Cleaning up...
  Done

/home/wilde/swift/tools/swiftrun: Swift Script oops7.swift ended at Thu 
Feb 12 15:11:24 CST 2009 with exit code 0
com$


From hategan at mcs.anl.gov  Thu Feb 12 19:28:23 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Thu, 12 Feb 2009 19:28:23 -0600 (CST)
Subject: [Swift-devel] coaster one-liner bootstrap script
In-Reply-To: <4994C07D.8070809@mcs.anl.gov>
Message-ID: <11175850.2224431234488503374.JavaMail.root@zimbra>


----- Michael Wilde <wilde at mcs.anl.gov> wrote:
> I updated to 2300. Now I get the error below 
> (java.lang.RuntimeException: Failed to register service)

Excellent! It works.

The exception is due to me messing with the parameters, but 
Java starts properly.


From wilde at mcs.anl.gov  Fri Feb 13 08:59:55 2009
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Fri, 13 Feb 2009 08:59:55 -0600
Subject: [Swift-devel] Status of coasters
Message-ID: <49958AEB.3090002@mcs.anl.gov>

Here's my understanding of status, issues and needs on coasters.

Some side discussion with Mihael on various coaster issues is summarized 
here as well; clarifications welcome.

Work in progress:

- Mihael has a good handle on the bootstrap issues and is working on 
improvements. This is not working in trunk at the moment, will likely be 
fixed soon. We think this will fix known issues in: command line lenth 
for condor, spaces, quotes, newlines and other offending argument 
issues; location of Java and tools (wget/curl and mdsum).

- still to do on above: sites.xml attribute to explicitly specify 
location of tools, or at least of Java.

- Ben has a patch to integrate to run the coaster service on a worker 
node. Question: this is only usable when workers have sufficient IP 
access, correct?

- The scalability problem submitting to GT2 GRAM sites still exists. 
Potential solutions are:

-- Service submits workers via PBS (using jobmanger=gt2:pbs). Valid only 
on PBS sites. Not yet tested.

-- Service submits workers via Condor-G (using jobmanager=gt2:condor). 
Mihael feels this requires a new Condor provider, the one in the current 
code base being insufficient and untested - really more of a prototype 
developed by a student).

-- Service submits via WS-GRAM. This should be tested, on sites where 
WS-GRAM is working.
This woild use jobmanager=gt2:gt4:{pbs/condor/sge}, and needs to be tested.
For sites where WS-GRAM is not functional, I suggested we consider 
configuring our own non-root WS-GRAM, ideally using already-installed 
GT4 software, eg, from the OSG package on OSG and TG sites where its 
installed. Mihael thought this would be considerable work. I agree but 
it might be a stable solution with fewer unknowns and suppot from the 
GRAM group. We can bring in the latest GT4 as needed if that provides a 
better solution than some older installed GT4 which we have no control 
over and which wont change till upcoming releases of say OSG or TG packages.

Doing the above should then enable large-scale testing of user workflows 
across many OSG and TG sites, without need to throttle back the *number* 
of jobs waiting or running.

Lastly: it seems that a Condor-G provide might be a powerful capability 
(as one configuration option) to be able to submit all swift jobs via 
Condor-G (e.g, for non-coaster runs as well).  Please comment on the 
value of such a capability.

- Mike


From hategan at mcs.anl.gov  Fri Feb 13 09:17:39 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Fri, 13 Feb 2009 09:17:39 -0600
Subject: [Swift-devel] Status of coasters
In-Reply-To: <49958AEB.3090002@mcs.anl.gov>
References: <49958AEB.3090002@mcs.anl.gov>
Message-ID: <1234538259.25737.1.camel@localhost>

On Fri, 2009-02-13 at 08:59 -0600, Michael Wilde wrote:
> -- Service submits via WS-GRAM. This should be tested, on sites where 
> WS-GRAM is working.
> This woild use jobmanager=gt2:gt4:{pbs/condor/sge}, and needs to be tested.
> For sites where WS-GRAM is not functional, I suggested we consider 
> configuring our own non-root WS-GRAM, ideally using already-installed 
> GT4 software, eg, from the OSG package on OSG and TG sites where its 
> installed. Mihael thought this would be considerable work. 

Not as much the amount of work, but:
1. getting root on sites (if installed for multiple users)

OR

2. telling our users that they need to install a GT4 server in order to
submit many jobs at once using swift.


From foster at anl.gov  Fri Feb 13 09:17:38 2009
From: foster at anl.gov (Ian Foster)
Date: Fri, 13 Feb 2009 09:17:38 -0600
Subject: [Swift-devel] Status of coasters
In-Reply-To: <49958AEB.3090002@mcs.anl.gov>
References: <49958AEB.3090002@mcs.anl.gov>
Message-ID: <E9B17877-30FE-4206-BEEA-C032D1806D6F@anl.gov>

Mike:

What is the scalability problem WRT GT2 GRAM sites?

Ian.


On Feb 13, 2009, at 8:59 AM, Michael Wilde wrote:

> Here's my understanding of status, issues and needs on coasters.
>
> Some side discussion with Mihael on various coaster issues is  
> summarized here as well; clarifications welcome.
>
> Work in progress:
>
> - Mihael has a good handle on the bootstrap issues and is working on  
> improvements. This is not working in trunk at the moment, will  
> likely be fixed soon. We think this will fix known issues in:  
> command line lenth for condor, spaces, quotes, newlines and other  
> offending argument issues; location of Java and tools (wget/curl and  
> mdsum).
>
> - still to do on above: sites.xml attribute to explicitly specify  
> location of tools, or at least of Java.
>
> - Ben has a patch to integrate to run the coaster service on a  
> worker node. Question: this is only usable when workers have  
> sufficient IP access, correct?
>
> - The scalability problem submitting to GT2 GRAM sites still exists.  
> Potential solutions are:
>
> -- Service submits workers via PBS (using jobmanger=gt2:pbs). Valid  
> only on PBS sites. Not yet tested.
>
> -- Service submits workers via Condor-G (using  
> jobmanager=gt2:condor). Mihael feels this requires a new Condor  
> provider, the one in the current code base being insufficient and  
> untested - really more of a prototype developed by a student).
>
> -- Service submits via WS-GRAM. This should be tested, on sites  
> where WS-GRAM is working.
> This woild use jobmanager=gt2:gt4:{pbs/condor/sge}, and needs to be  
> tested.
> For sites where WS-GRAM is not functional, I suggested we consider  
> configuring our own non-root WS-GRAM, ideally using already- 
> installed GT4 software, eg, from the OSG package on OSG and TG sites  
> where its installed. Mihael thought this would be considerable work.  
> I agree but it might be a stable solution with fewer unknowns and  
> suppot from the GRAM group. We can bring in the latest GT4 as needed  
> if that provides a better solution than some older installed GT4  
> which we have no control over and which wont change till upcoming  
> releases of say OSG or TG packages.
>
> Doing the above should then enable large-scale testing of user  
> workflows across many OSG and TG sites, without need to throttle  
> back the *number* of jobs waiting or running.
>
> Lastly: it seems that a Condor-G provide might be a powerful  
> capability (as one configuration option) to be able to submit all  
> swift jobs via Condor-G (e.g, for non-coaster runs as well).  Please  
> comment on the value of such a capability.
>
> - Mike
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel


From wilde at mcs.anl.gov  Fri Feb 13 09:20:45 2009
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Fri, 13 Feb 2009 09:20:45 -0600
Subject: [Swift-devel] Status of coasters
In-Reply-To: <E9B17877-30FE-4206-BEEA-C032D1806D6F@anl.gov>
References: <49958AEB.3090002@mcs.anl.gov>
	<E9B17877-30FE-4206-BEEA-C032D1806D6F@anl.gov>
Message-ID: <49958FCD.3060106@mcs.anl.gov>

Its the problem of resource consumption by the jobmanager: the 
longstanding problem that the Condor-G GRID_MONITOR addresses; the 
problem that requires that we scale back to send fewer thn 20-40 jobs to 
any OSG site when we use pre-WS-GRAM.


On 2/13/09 9:17 AM, Ian Foster wrote:
> Mike:
> 
> What is the scalability problem WRT GT2 GRAM sites?
> 
> Ian.
> 
> 
> On Feb 13, 2009, at 8:59 AM, Michael Wilde wrote:
> 
>> Here's my understanding of status, issues and needs on coasters.
>>
>> Some side discussion with Mihael on various coaster issues is 
>> summarized here as well; clarifications welcome.
>>
>> Work in progress:
>>
>> - Mihael has a good handle on the bootstrap issues and is working on 
>> improvements. This is not working in trunk at the moment, will likely 
>> be fixed soon. We think this will fix known issues in: command line 
>> lenth for condor, spaces, quotes, newlines and other offending 
>> argument issues; location of Java and tools (wget/curl and mdsum).
>>
>> - still to do on above: sites.xml attribute to explicitly specify 
>> location of tools, or at least of Java.
>>
>> - Ben has a patch to integrate to run the coaster service on a worker 
>> node. Question: this is only usable when workers have sufficient IP 
>> access, correct?
>>
>> - The scalability problem submitting to GT2 GRAM sites still exists. 
>> Potential solutions are:
>>
>> -- Service submits workers via PBS (using jobmanger=gt2:pbs). Valid 
>> only on PBS sites. Not yet tested.
>>
>> -- Service submits workers via Condor-G (using jobmanager=gt2:condor). 
>> Mihael feels this requires a new Condor provider, the one in the 
>> current code base being insufficient and untested - really more of a 
>> prototype developed by a student).
>>
>> -- Service submits via WS-GRAM. This should be tested, on sites where 
>> WS-GRAM is working.
>> This woild use jobmanager=gt2:gt4:{pbs/condor/sge}, and needs to be 
>> tested.
>> For sites where WS-GRAM is not functional, I suggested we consider 
>> configuring our own non-root WS-GRAM, ideally using already-installed 
>> GT4 software, eg, from the OSG package on OSG and TG sites where its 
>> installed. Mihael thought this would be considerable work. I agree but 
>> it might be a stable solution with fewer unknowns and suppot from the 
>> GRAM group. We can bring in the latest GT4 as needed if that provides 
>> a better solution than some older installed GT4 which we have no 
>> control over and which wont change till upcoming releases of say OSG 
>> or TG packages.
>>
>> Doing the above should then enable large-scale testing of user 
>> workflows across many OSG and TG sites, without need to throttle back 
>> the *number* of jobs waiting or running.
>>
>> Lastly: it seems that a Condor-G provide might be a powerful 
>> capability (as one configuration option) to be able to submit all 
>> swift jobs via Condor-G (e.g, for non-coaster runs as well).  Please 
>> comment on the value of such a capability.
>>
>> - Mike
>> _______________________________________________
>> Swift-devel mailing list
>> Swift-devel at ci.uchicago.edu
>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> 


From wilde at mcs.anl.gov  Fri Feb 13 09:23:21 2009
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Fri, 13 Feb 2009 09:23:21 -0600
Subject: [Swift-devel] Status of coasters
In-Reply-To: <1234538259.25737.1.camel@localhost>
References: <49958AEB.3090002@mcs.anl.gov> <1234538259.25737.1.camel@localhost>
Message-ID: <49959069.7020602@mcs.anl.gov>


On 2/13/09 9:17 AM, Mihael Hategan wrote:
> On Fri, 2009-02-13 at 08:59 -0600, Michael Wilde wrote:
>> -- Service submits via WS-GRAM. This should be tested, on sites where 
>> WS-GRAM is working.
>> This woild use jobmanager=gt2:gt4:{pbs/condor/sge}, and needs to be tested.
>> For sites where WS-GRAM is not functional, I suggested we consider 
>> configuring our own non-root WS-GRAM, ideally using already-installed 
>> GT4 software, eg, from the OSG package on OSG and TG sites where its 
>> installed. Mihael thought this would be considerable work. 
> 
> Not as much the amount of work, but:
> 1. getting root on sites (if installed for multiple users)

I was thinking/hoping we could have a single setup script, which we'd 
pre-install where required, that would configure and start a personal 
WSRF container for the user. Which would be work for us to create, 
install and maintain, but, if successful, would be transparent to the user.

> OR
> 
> 2. telling our users that they need to install a GT4 server in order to
> submit many jobs at once using swift.

No, that would not be a good route. If that was required, I'll call this 
alternative undesirable.


From benc at hawaga.org.uk  Fri Feb 13 09:27:37 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Fri, 13 Feb 2009 15:27:37 +0000 (GMT)
Subject: [Swift-devel] Status of coasters
In-Reply-To: <49958AEB.3090002@mcs.anl.gov>
References: <49958AEB.3090002@mcs.anl.gov>
Message-ID: <Pine.LNX.4.64.0902131504520.1293@dildano.hawaga.org.uk>


On Fri, 13 Feb 2009, Michael Wilde wrote:

> - Ben has a patch to integrate to run the coaster service on a worker node.
> Question: this is only usable when workers have sufficient IP access, correct?

Yes. I plan on making this presentable and then committing it. As part of 
that, probably I should document who connects where in coasters with a 
pretty diagram, to aid in understanding of what 'sufficient' is.

> - The scalability problem submitting to GT2 GRAM sites still exists. Potential
> solutions are:
> 
> -- Service submits workers via PBS (using jobmanger=gt2:pbs). Valid only on
> PBS sites. Not yet tested.
> 
> -- Service submits workers via Condor-G (using jobmanager=gt2:condor). Mihael
> feels this requires a new Condor provider, the one in the current code base
> being insufficient and untested - really more of a prototype developed by a
> student).

That would be regular Condor, not Condor-G, I think.

The two above could be summarised as "submit service workers through the 
local LRM using CoG specific providers for that LRM".

The PBS provider seems to be getting a reasonable amount of use recently, 
and I think is also useful in the single-site case where it allows GRAM to 
be avoided entirely.

A decent Condor provider would probably allow something similar for Condor 
based clusters.

> -- Service submits via WS-GRAM. This should be tested, on sites where 
> WS-GRAM is working. This woild use jobmanager=gt2:gt4:{pbs/condor/sge}, 
> and needs to be tested.

If gram4.0 is working on a site, is there any reason to use gt2 for the 
head job submission? It seems to add a dependency on one more service 
(depending on both gram2 and gram4.0) rather than substituting one 
dependency for another (gram2 for gram4.0)

> For sites where WS-GRAM is not functional, I suggested we consider configuring
> our own non-root WS-GRAM, ideally using already-installed GT4 software, eg,
> from the OSG package on OSG and TG sites where its installed. Mihael thought
> this would be considerable work. I agree but it might be a stable solution
> with fewer unknowns and suppot from the GRAM group. We can bring in the latest
> GT4 as needed if that provides a better solution than some older installed GT4
> which we have no control over and which wont change till upcoming releases of
> say OSG or TG packages.

I agree that this is considerable work. I think it is not something we 
should pursue.

> Lastly: it seems that a Condor-G provide might be a powerful capability (as
> one configuration option) to be able to submit all swift jobs via Condor-G
> (e.g, for non-coaster runs as well).  Please comment on the value of such a
> capability.

I've pondered that before.

Using Condor-G appears to be the officially supported mechanism for 
submitting to OSG in some peoples minds; and similarly, using plain GRAM2 
is Prohibited in those peoples minds.

Using Condor-G would be more in line with some peoples views of how jobs 
should properly be submitted to OSG.

Such functionality could fit in as a CoG execution provider (similar to, 
or part of a plain Condor execution provider), and would not peturb the 
architecture of Swift. Swift runs in such a situation would look a little 
like DAGman runs, with a management process handling some rate limiting 
and deciding which jobs to run and where, but then the mechanics of 
submission being handled by a local Condor.

This approach would necessitate a local Condor installation, but only in 
situations where this approach was used; so this would not peturb 
usability too much, and many places where this would be used already have 
a Condor installation.

So I'm cautiously supportive of this approach.

Specifically given the two different uses for condor interfacing discussed 
above, I think that it would be useful to investigate making the Condor 
provider decent.

-- 


From benc at hawaga.org.uk  Fri Feb 13 09:29:53 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Fri, 13 Feb 2009 15:29:53 +0000 (GMT)
Subject: [Swift-devel] Status of coasters
In-Reply-To: <E9B17877-30FE-4206-BEEA-C032D1806D6F@anl.gov>
References: <49958AEB.3090002@mcs.anl.gov>
	<E9B17877-30FE-4206-BEEA-C032D1806D6F@anl.gov>
Message-ID: <Pine.LNX.4.64.0902131528180.1293@dildano.hawaga.org.uk>


On Fri, 13 Feb 2009, Ian Foster wrote:

> What is the scalability problem WRT GT2 GRAM sites?

loadavg on the submit site = k * the number of GRAM2 jobs running on that 
site or queued on that site.

where k is in the range 0.1 .. 1 in my informal testing.

-- 


From hategan at mcs.anl.gov  Fri Feb 13 09:32:43 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Fri, 13 Feb 2009 09:32:43 -0600
Subject: [Swift-devel] Status of coasters
In-Reply-To: <49959069.7020602@mcs.anl.gov>
References: <49958AEB.3090002@mcs.anl.gov>
	<1234538259.25737.1.camel@localhost>  <49959069.7020602@mcs.anl.gov>
Message-ID: <1234539163.26116.3.camel@localhost>

On Fri, 2009-02-13 at 09:23 -0600, Michael Wilde wrote:
> 
> On 2/13/09 9:17 AM, Mihael Hategan wrote:
> > On Fri, 2009-02-13 at 08:59 -0600, Michael Wilde wrote:
> >> -- Service submits via WS-GRAM. This should be tested, on sites where 
> >> WS-GRAM is working.
> >> This woild use jobmanager=gt2:gt4:{pbs/condor/sge}, and needs to be tested.
> >> For sites where WS-GRAM is not functional, I suggested we consider 
> >> configuring our own non-root WS-GRAM, ideally using already-installed 
> >> GT4 software, eg, from the OSG package on OSG and TG sites where its 
> >> installed. Mihael thought this would be considerable work. 
> > 
> > Not as much the amount of work, but:
> > 1. getting root on sites (if installed for multiple users)
> 
> I was thinking/hoping we could have a single setup script, which we'd 
> pre-install where required, that would configure and start a personal 
> WSRF container for the user. Which would be work for us to create, 
> install and maintain, but, if successful, would be transparent to the user.

You need root to configure sudo.

I also don't think you can easily automate the gt4 installation process.
There's manual configuration that needs to be done. I suggest installing
your own gt4 once to which I can submit jobs to get an idea of what's
involved.


From wilde at mcs.anl.gov  Fri Feb 13 09:40:08 2009
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Fri, 13 Feb 2009 09:40:08 -0600
Subject: [Swift-devel] Status of coasters
In-Reply-To: <Pine.LNX.4.64.0902131504520.1293@dildano.hawaga.org.uk>
References: <49958AEB.3090002@mcs.anl.gov>
	<Pine.LNX.4.64.0902131504520.1293@dildano.hawaga.org.uk>
Message-ID: <49959458.3070704@mcs.anl.gov>


On 2/13/09 9:27 AM, Ben Clifford wrote:
> On Fri, 13 Feb 2009, Michael Wilde wrote:
> 
>> - Ben has a patch to integrate to run the coaster service on a worker node.
>> Question: this is only usable when workers have sufficient IP access, correct?
> 
> Yes. I plan on making this presentable and then committing it. As part of 
> that, probably I should document who connects where in coasters with a 
> pretty diagram, to aid in understanding of what 'sufficient' is.

Very good; I was just thinking of the same diagram, even as design 
documentation to help us grok the setup and communication paths for 
coasters.

Also: coaster-server-on-workernode has the nice advantage that we dont 
run any swift software on infrastructure nodes like headnodes: less 
chance to cause damage; more power for our workflow. Gets round 
potential problem that managed-fork JM will kill our process for 
exceeding a walltime limit. Nice philosophy overall.

>> - The scalability problem submitting to GT2 GRAM sites still exists. Potential
>> solutions are:
>>
>> -- Service submits workers via PBS (using jobmanger=gt2:pbs). Valid only on
>> PBS sites. Not yet tested.
>>
>> -- Service submits workers via Condor-G (using jobmanager=gt2:condor). Mihael
>> feels this requires a new Condor provider, the one in the current code base
>> being insufficient and untested - really more of a prototype developed by a
>> student).
> 
> That would be regular Condor, not Condor-G, I think.

Seems could be either:
- regular Condor to submit to the local condor pool
- Condor-G to submit back through GT2 but with aid of its GRID_MONITOR 
for scalability, and would be LRM-independent.

> 
> The two above could be summarised as "submit service workers through the 
> local LRM using CoG specific providers for that LRM".
> 
> The PBS provider seems to be getting a reasonable amount of use recently, 
> and I think is also useful in the single-site case where it allows GRAM to 
> be avoided entirely.
> 
> A decent Condor provider would probably allow something similar for Condor 
> based clusters.
> 
>> -- Service submits via WS-GRAM. This should be tested, on sites where 
>> WS-GRAM is working. This woild use jobmanager=gt2:gt4:{pbs/condor/sge}, 
>> and needs to be tested.
> 
> If gram4.0 is working on a site, is there any reason to use gt2 for the 
> head job submission?

No, not at all: we should indeed use WSGRAM in those cases. In fact, we 
should use it wherever possible - i.e., wherever it provides the best 
available job exec service.

> It seems to add a dependency on one more service 
> (depending on both gram2 and gram4.0) rather than substituting one 
> dependency for another (gram2 for gram4.0)
> 
>> For sites where WS-GRAM is not functional, I suggested we consider configuring
>> our own non-root WS-GRAM, ideally using already-installed GT4 software, eg,
>> from the OSG package on OSG and TG sites where its installed. Mihael thought
>> this would be considerable work. I agree but it might be a stable solution
>> with fewer unknowns and suppot from the GRAM group. We can bring in the latest
>> GT4 as needed if that provides a better solution than some older installed GT4
>> which we have no control over and which wont change till upcoming releases of
>> say OSG or TG packages.
> 
> I agree that this is considerable work. I think it is not something we 
> should pursue.
> 
>> Lastly: it seems that a Condor-G provide might be a powerful capability (as
>> one configuration option) to be able to submit all swift jobs via Condor-G
>> (e.g, for non-coaster runs as well).  Please comment on the value of such a
>> capability.
> 
> I've pondered that before.
> 
> Using Condor-G appears to be the officially supported mechanism for 
> submitting to OSG in some peoples minds; and similarly, using plain GRAM2 
> is Prohibited in those peoples minds.
> 
> Using Condor-G would be more in line with some peoples views of how jobs 
> should properly be submitted to OSG.
> 
> Such functionality could fit in as a CoG execution provider (similar to, 
> or part of a plain Condor execution provider), and would not peturb the 
> architecture of Swift. Swift runs in such a situation would look a little 
> like DAGman runs, with a management process handling some rate limiting 
> and deciding which jobs to run and where, but then the mechanics of 
> submission being handled by a local Condor.
> 
> This approach would necessitate a local Condor installation, but only in 
> situations where this approach was used; so this would not peturb 
> usability too much, and many places where this would be used already have 
> a Condor installation.
> 
> So I'm cautiously supportive of this approach.

Excellent, and I agree with your analysis.

I'll draft a priority list for such efforts and then circulate to the group.

> 
> Specifically given the two different uses for condor interfacing discussed 
> above, I think that it would be useful to investigate making the Condor 
> provider decent.
> 

Agreed.


From tfreeman at mcs.anl.gov  Fri Feb 13 09:40:32 2009
From: tfreeman at mcs.anl.gov (Tim Freeman)
Date: Fri, 13 Feb 2009 09:40:32 -0600
Subject: [Swift-devel] Status of coasters
In-Reply-To: <1234539163.26116.3.camel@localhost>
References: <49958AEB.3090002@mcs.anl.gov> <1234538259.25737.1.camel@localhost>
	<49959069.7020602@mcs.anl.gov> <1234539163.26116.3.camel@localhost>
Message-ID: <20090213094032.2b041afc@prnb>

On Fri, 13 Feb 2009 09:32:43 -0600
Mihael Hategan <hategan at mcs.anl.gov> wrote:

> On Fri, 2009-02-13 at 09:23 -0600, Michael Wilde wrote:
> > 
> > On 2/13/09 9:17 AM, Mihael Hategan wrote:
> > > On Fri, 2009-02-13 at 08:59 -0600, Michael Wilde wrote:
> > >> -- Service submits via WS-GRAM. This should be tested, on sites where 
> > >> WS-GRAM is working.
> > >> This woild use jobmanager=gt2:gt4:{pbs/condor/sge}, and needs to be
> > >> tested. For sites where WS-GRAM is not functional, I suggested we
> > >> consider configuring our own non-root WS-GRAM, ideally using
> > >> already-installed GT4 software, eg, from the OSG package on OSG and TG
> > >> sites where its installed. Mihael thought this would be considerable
> > >> work. 
> > > 
> > > Not as much the amount of work, but:
> > > 1. getting root on sites (if installed for multiple users)
> > 
> > I was thinking/hoping we could have a single setup script, which we'd 
> > pre-install where required, that would configure and start a personal 
> > WSRF container for the user. Which would be work for us to create, 
> > install and maintain, but, if successful, would be transparent to the user.
> 
> You need root to configure sudo.

Why would you need sudo for gram if it mapped to the same account?

> 
> I also don't think you can easily automate the gt4 installation process.
> There's manual configuration that needs to be done. I suggest installing
> your own gt4 once to which I can submit jobs to get an idea of what's
> involved.

For non-gram, there's an auto script here:

    http://workspace.globus.org/vm/TP2.2/admin/quickstart.html#auto-container

The questions it asks users (is this the right hostname? etc) could also be
scripted.

Getting from there to GRAM4 auto-configuration should not be too much of a step
(the nimbus contextualization scripts have done it for gram4).

Tim


From benc at hawaga.org.uk  Fri Feb 13 09:44:14 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Fri, 13 Feb 2009 15:44:14 +0000 (GMT)
Subject: [Swift-devel] Status of coasters
In-Reply-To: <49958AEB.3090002@mcs.anl.gov>
References: <49958AEB.3090002@mcs.anl.gov>
Message-ID: <Pine.LNX.4.64.0902131538350.1293@dildano.hawaga.org.uk>


On Fri, 13 Feb 2009, Michael Wilde wrote:

> -- Service submits via WS-GRAM. This should be tested, on sites where 
> WS-GRAM is working. This woild use jobmanager=gt2:gt4:{pbs/condor/sge}, 
> and needs to be tested. For sites where WS-GRAM is not functional, I 
> suggested we consider configuring our own non-root WS-GRAM, ideally 
> using already-installed GT4 software, eg, from the OSG package on OSG 
> and TG sites where its installed. Mihael thought this would be 
> considerable work. I agree but it might be a stable solution with fewer 
> unknowns and suppot from the GRAM group. We can bring in the latest GT4 
> as needed if that provides a better solution than some older installed 
> GT4 which we have no control over and which wont change till upcoming 
> releases of say OSG or TG packages.

We already deploy an execution system on the remote head node. Its called 
coasters.

To deploy another execution service on a site through which our existing 
execution service on that site can execute things seems perverse.

Putting aside "we must use GRAM" dogma, the key benefit of using GRAM 
would be (I think) to get access to the wider range of LRM adapters than 
is provided by CoG. If that actually is the benefit we're after by this, 
then we should consider other ways in which we might more profitably 
interface to those LRM adapters.

They might be other reasons to use GRAM locally that I have not thought 
of, though.

-- 


From wilde at mcs.anl.gov  Fri Feb 13 09:52:51 2009
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Fri, 13 Feb 2009 09:52:51 -0600
Subject: [Swift-devel] Status of coasters
In-Reply-To: <20090213094032.2b041afc@prnb>
References: <49958AEB.3090002@mcs.anl.gov>	<1234538259.25737.1.camel@localhost>	<49959069.7020602@mcs.anl.gov>	<1234539163.26116.3.camel@localhost>
	<20090213094032.2b041afc@prnb>
Message-ID: <49959753.9050001@mcs.anl.gov>

Tim, we should take your advice under advisement.

I can only leave the merits and cost/benefit assessment of this approach 
to you GT4 experts (Ben, Mihael, Tim, ...)

Im open to the idea, and it sounds like its not totally off the table, 
but it has some costs and some unknowns.

The diagram Ben suggests can also be expressed as a list of config 
alternatives, essentially an embellishment of the 
jobmanager=service-submitter:worker-submitter:worker-submitter-jobmanager 
string.

The first message on this thread started to enumerate coaster config 
alternatives.

I suggest we (or I) clarify those into a clean list.

Then we can comment on the cost/benefit tradeoffs of the different 
alternatives, and denote which ones have been tested, which ones should 
be tested, and which ones need how much development work. I think 
there's some obvious "low hanging fruit"-ful ones that float to the top, 
  which we can test, debug, and harden now, and some that require more 
development, some of which has greater additional benefits (like a good 
Condor provider).

If a user-config for a WS-GRAM container proved easier than expected, 
possibly with help from you, Tim, or from other GRAM experts, then 
perhaps it can stay on the table.

- Mike


On 2/13/09 9:40 AM, Tim Freeman wrote:
> On Fri, 13 Feb 2009 09:32:43 -0600
> Mihael Hategan <hategan at mcs.anl.gov> wrote:
> 
>> On Fri, 2009-02-13 at 09:23 -0600, Michael Wilde wrote:
>>> On 2/13/09 9:17 AM, Mihael Hategan wrote:
>>>> On Fri, 2009-02-13 at 08:59 -0600, Michael Wilde wrote:
>>>>> -- Service submits via WS-GRAM. This should be tested, on sites where 
>>>>> WS-GRAM is working.
>>>>> This woild use jobmanager=gt2:gt4:{pbs/condor/sge}, and needs to be
>>>>> tested. For sites where WS-GRAM is not functional, I suggested we
>>>>> consider configuring our own non-root WS-GRAM, ideally using
>>>>> already-installed GT4 software, eg, from the OSG package on OSG and TG
>>>>> sites where its installed. Mihael thought this would be considerable
>>>>> work. 
>>>> Not as much the amount of work, but:
>>>> 1. getting root on sites (if installed for multiple users)
>>> I was thinking/hoping we could have a single setup script, which we'd 
>>> pre-install where required, that would configure and start a personal 
>>> WSRF container for the user. Which would be work for us to create, 
>>> install and maintain, but, if successful, would be transparent to the user.
>> You need root to configure sudo.
> 
> Why would you need sudo for gram if it mapped to the same account?
> 
>> I also don't think you can easily automate the gt4 installation process.
>> There's manual configuration that needs to be done. I suggest installing
>> your own gt4 once to which I can submit jobs to get an idea of what's
>> involved.
> 
> For non-gram, there's an auto script here:
> 
>     http://workspace.globus.org/vm/TP2.2/admin/quickstart.html#auto-container
> 
> The questions it asks users (is this the right hostname? etc) could also be
> scripted.
> 
> Getting from there to GRAM4 auto-configuration should not be too much of a step
> (the nimbus contextualization scripts have done it for gram4).
> 
> Tim


From smartin at mcs.anl.gov  Fri Feb 13 09:58:38 2009
From: smartin at mcs.anl.gov (Stuart Martin)
Date: Fri, 13 Feb 2009 09:58:38 -0600
Subject: [Swift-devel] Status of coasters
In-Reply-To: <20090213094032.2b041afc@prnb>
References: <49958AEB.3090002@mcs.anl.gov> <1234538259.25737.1.camel@localhost>
	<49959069.7020602@mcs.anl.gov> <1234539163.26116.3.camel@localhost>
	<20090213094032.2b041afc@prnb>
Message-ID: <FA43DB88-FB0C-407E-8401-27DD2CBB7B63@mcs.anl.gov>

On Feb 13, 2009, at Feb 13, 9:40 AM, Tim Freeman wrote:

> On Fri, 13 Feb 2009 09:32:43 -0600
> Mihael Hategan <hategan at mcs.anl.gov> wrote:
>
>> On Fri, 2009-02-13 at 09:23 -0600, Michael Wilde wrote:
>>>
>>> On 2/13/09 9:17 AM, Mihael Hategan wrote:
>>>> On Fri, 2009-02-13 at 08:59 -0600, Michael Wilde wrote:
>>>>> -- Service submits via WS-GRAM. This should be tested, on sites  
>>>>> where 
>>>>> WS-GRAM is working.
>>>>> This woild use jobmanager=gt2:gt4:{pbs/condor/sge}, and needs to  
>>>>> be
>>>>> tested. For sites where WS-GRAM is not functional, I suggested we
>>>>> consider configuring our own non-root WS-GRAM, ideally using
>>>>> already-installed GT4 software, eg, from the OSG package on OSG  
>>>>> and TG
>>>>> sites where its installed. Mihael thought this would be  
>>>>> considerable
>>>>> work.
>>>>
>>>> Not as much the amount of work, but:
>>>> 1. getting root on sites (if installed for multiple users)
>>>
>>> I was thinking/hoping we could have a single setup script, which  
>>> we'd
>>> pre-install where required, that would configure and start a  
>>> personal
>>> WSRF container for the user. Which would be work for us to create,
>>> install and maintain, but, if successful, would be transparent to  
>>> the user.
>>
>> You need root to configure sudo.
>
> Why would you need sudo for gram if it mapped to the same account?

You wouldn't.  sudo is not needed if the DN of the requester is the  
same as the DN used to start the container.

>>
>> I also don't think you can easily automate the gt4 installation  
>> process.
>> There's manual configuration that needs to be done. I suggest  
>> installing
>> your own gt4 once to which I can submit jobs to get an idea of what's
>> involved.
>
> For non-gram, there's an auto script here:
>
>    http://workspace.globus.org/vm/TP2.2/admin/quickstart.html#auto-container
>
> The questions it asks users (is this the right hostname? etc) could  
> also be
> scripted.
>
> Getting from there to GRAM4 auto-configuration should not be too  
> much of a step
> (the nimbus contextualization scripts have done it for gram4).

You could do this and if you avoid staging, then this may work fine  
for GRAM4.

But also, we want to have GRAM2 sites to start to use the SEG to  
remove a significant portion of the GRAM2 scalability problem.  I  
think that would be best and simplest solution to focus on.  Maybe we  
can start with a site where you need more scalability and the admin  
would want to work with us on that?

>
> Tim
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel


From tfreeman at mcs.anl.gov  Fri Feb 13 09:58:43 2009
From: tfreeman at mcs.anl.gov (Tim Freeman)
Date: Fri, 13 Feb 2009 09:58:43 -0600
Subject: [Swift-devel] Status of coasters
In-Reply-To: <49959753.9050001@mcs.anl.gov>
References: <49958AEB.3090002@mcs.anl.gov> <1234538259.25737.1.camel@localhost>
	<49959069.7020602@mcs.anl.gov> <1234539163.26116.3.camel@localhost>
	<20090213094032.2b041afc@prnb> <49959753.9050001@mcs.anl.gov>
Message-ID: <20090213095843.67e0496b@prnb>

On Fri, 13 Feb 2009 09:52:51 -0600
Michael Wilde <wilde at mcs.anl.gov> wrote:

> Tim, we should take your advice under advisement.
> 
> I can only leave the merits and cost/benefit assessment of this approach 
> to you GT4 experts (Ben, Mihael, Tim, ...)

I'm only weighing in on the cost parts. :-)

Tim


From rynge at renci.org  Fri Feb 13 09:56:31 2009
From: rynge at renci.org (Mats Rynge)
Date: Fri, 13 Feb 2009 10:56:31 -0500
Subject: [Swift-devel] Status of coasters
In-Reply-To: <49958AEB.3090002@mcs.anl.gov>
References: <49958AEB.3090002@mcs.anl.gov>
Message-ID: <4995982F.9010605@renci.org>

Michael Wilde wrote:
> For sites where WS-GRAM is not functional, I suggested we consider
> configuring our own non-root WS-GRAM, ideally using already-installed
> GT4 software, eg, from the OSG package on OSG and TG sites where its
> installed. Mihael thought this would be considerable work. I agree but
> it might be a stable solution with fewer unknowns and suppot from the
> GRAM group. We can bring in the latest GT4 as needed if that provides a
> better solution than some older installed GT4 which we have no control
> over and which wont change till upcoming releases of say OSG or TG
> packages.

Please don't do this on OSG. I'm fairly sure that working around the
existing interfaces to a resource would just tick off the resource owners.

-- 
Mats Rynge
Renaissance Computing Institute <http://www.renci.org>


From wilde at mcs.anl.gov  Fri Feb 13 10:05:18 2009
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Fri, 13 Feb 2009 10:05:18 -0600
Subject: [Swift-devel] Status of coasters
In-Reply-To: <FA43DB88-FB0C-407E-8401-27DD2CBB7B63@mcs.anl.gov>
References: <49958AEB.3090002@mcs.anl.gov>
	<1234538259.25737.1.camel@localhost>	<49959069.7020602@mcs.anl.gov>
	<1234539163.26116.3.camel@localhost>	<20090213094032.2b041afc@prnb>
	<FA43DB88-FB0C-407E-8401-27DD2CBB7B63@mcs.anl.gov>
Message-ID: <49959A3E.9010305@mcs.anl.gov>

Stu, this would be a good thing to discuss through the UChicago OSG 
group. I can start that, and am cc'ing Rob Gardner to get it on the list 
of Globus-OSG things to track.

- Mike


 > But also, we want to have GRAM2 sites to start to use the SEG to remove
 > a significant portion of the GRAM2 scalability problem.  I think that
 > would be best and simplest solution to focus on.  Maybe we can start
 > with a site where you need more scalability and the admin would want to
 > work with us on that?

- Mike


On 2/13/09 9:58 AM, Stuart Martin wrote:
> On Feb 13, 2009, at Feb 13, 9:40 AM, Tim Freeman wrote:
> 
>> On Fri, 13 Feb 2009 09:32:43 -0600
>> Mihael Hategan <hategan at mcs.anl.gov> wrote:
>>
>>> On Fri, 2009-02-13 at 09:23 -0600, Michael Wilde wrote:
>>>>
>>>> On 2/13/09 9:17 AM, Mihael Hategan wrote:
>>>>> On Fri, 2009-02-13 at 08:59 -0600, Michael Wilde wrote:
>>>>>> -- Service submits via WS-GRAM. This should be tested, on sites 
>>>>>> whereWS-GRAM is working.
>>>>>> This woild use jobmanager=gt2:gt4:{pbs/condor/sge}, and needs to be
>>>>>> tested. For sites where WS-GRAM is not functional, I suggested we
>>>>>> consider configuring our own non-root WS-GRAM, ideally using
>>>>>> already-installed GT4 software, eg, from the OSG package on OSG 
>>>>>> and TG
>>>>>> sites where its installed. Mihael thought this would be considerable
>>>>>> work.
>>>>>
>>>>> Not as much the amount of work, but:
>>>>> 1. getting root on sites (if installed for multiple users)
>>>>
>>>> I was thinking/hoping we could have a single setup script, which we'd
>>>> pre-install where required, that would configure and start a personal
>>>> WSRF container for the user. Which would be work for us to create,
>>>> install and maintain, but, if successful, would be transparent to 
>>>> the user.
>>>
>>> You need root to configure sudo.
>>
>> Why would you need sudo for gram if it mapped to the same account?
> 
> You wouldn't.  sudo is not needed if the DN of the requester is the same 
> as the DN used to start the container.
> 
>>>
>>> I also don't think you can easily automate the gt4 installation process.
>>> There's manual configuration that needs to be done. I suggest installing
>>> your own gt4 once to which I can submit jobs to get an idea of what's
>>> involved.
>>
>> For non-gram, there's an auto script here:
>>
>>    
>> http://workspace.globus.org/vm/TP2.2/admin/quickstart.html#auto-container
>>
>> The questions it asks users (is this the right hostname? etc) could 
>> also be
>> scripted.
>>
>> Getting from there to GRAM4 auto-configuration should not be too much 
>> of a step
>> (the nimbus contextualization scripts have done it for gram4).
> 
> You could do this and if you avoid staging, then this may work fine for 
> GRAM4.
> 
> But also, we want to have GRAM2 sites to start to use the SEG to remove 
> a significant portion of the GRAM2 scalability problem.  I think that 
> would be best and simplest solution to focus on.  Maybe we can start 
> with a site where you need more scalability and the admin would want to 
> work with us on that?
> 
>>
>> Tim
>> _______________________________________________
>> Swift-devel mailing list
>> Swift-devel at ci.uchicago.edu
>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> 
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel


From wilde at mcs.anl.gov  Fri Feb 13 10:06:13 2009
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Fri, 13 Feb 2009 10:06:13 -0600
Subject: [Swift-devel] Status of coasters
In-Reply-To: <4995982F.9010605@renci.org>
References: <49958AEB.3090002@mcs.anl.gov> <4995982F.9010605@renci.org>
Message-ID: <49959A75.3050500@mcs.anl.gov>

So instead work with OSG to get WS-GRAM working?

On 2/13/09 9:56 AM, Mats Rynge wrote:
> Michael Wilde wrote:
>> For sites where WS-GRAM is not functional, I suggested we consider
>> configuring our own non-root WS-GRAM, ideally using already-installed
>> GT4 software, eg, from the OSG package on OSG and TG sites where its
>> installed. Mihael thought this would be considerable work. I agree but
>> it might be a stable solution with fewer unknowns and suppot from the
>> GRAM group. We can bring in the latest GT4 as needed if that provides a
>> better solution than some older installed GT4 which we have no control
>> over and which wont change till upcoming releases of say OSG or TG
>> packages.
> 
> Please don't do this on OSG. I'm fairly sure that working around the
> existing interfaces to a resource would just tick off the resource owners.
> 


From hategan at mcs.anl.gov  Fri Feb 13 10:19:49 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Fri, 13 Feb 2009 10:19:49 -0600
Subject: [Swift-devel] Status of coasters
In-Reply-To: <20090213094032.2b041afc@prnb>
References: <49958AEB.3090002@mcs.anl.gov>
	<1234538259.25737.1.camel@localhost> <49959069.7020602@mcs.anl.gov>
	<1234539163.26116.3.camel@localhost>  <20090213094032.2b041afc@prnb>
Message-ID: <1234541989.26956.1.camel@localhost>

On Fri, 2009-02-13 at 09:40 -0600, Tim Freeman wrote:
> > 
> > You need root to configure sudo.
> 
> Why would you need sudo for gram if it mapped to the same account?

I assumed that if we support multiple users, we do so properly.

> 
> > 
> > I also don't think you can easily automate the gt4 installation process.
> > There's manual configuration that needs to be done. I suggest installing
> > your own gt4 once to which I can submit jobs to get an idea of what's
> > involved.
> 
> For non-gram, there's an auto script here:
> 
>     http://workspace.globus.org/vm/TP2.2/admin/quickstart.html#auto-container
> 
> The questions it asks users (is this the right hostname? etc) could also be
> scripted.
> 
> Getting from there to GRAM4 auto-configuration should not be too much of a step
> (the nimbus contextualization scripts have done it for gram4).

Thanks. We should then try this.


From rynge at renci.org  Fri Feb 13 10:17:46 2009
From: rynge at renci.org (Mats Rynge)
Date: Fri, 13 Feb 2009 11:17:46 -0500
Subject: [Swift-devel] Status of coasters
In-Reply-To: <49959A75.3050500@mcs.anl.gov>
References: <49958AEB.3090002@mcs.anl.gov> <4995982F.9010605@renci.org>
	<49959A75.3050500@mcs.anl.gov>
Message-ID: <49959D2A.7090006@renci.org>

Michael Wilde wrote:
> So instead work with OSG to get WS-GRAM working?

There is a slow movement to make this happen. A couple of smaller VOs
(Engagement, which I'm representing, included) which are asking for
WS-GRAM to become a suggested/required service, but I don't think that
will happen for the next release.

I have heard that the new SEG for pre-WS GRAM will be included in the
next release of the OSG software stack.


> On 2/13/09 9:56 AM, Mats Rynge wrote:
>> Michael Wilde wrote:
>>> For sites where WS-GRAM is not functional, I suggested we consider
>>> configuring our own non-root WS-GRAM, ideally using already-installed
>>> GT4 software, eg, from the OSG package on OSG and TG sites where its
>>> installed. Mihael thought this would be considerable work. I agree but
>>> it might be a stable solution with fewer unknowns and suppot from the
>>> GRAM group. We can bring in the latest GT4 as needed if that provides a
>>> better solution than some older installed GT4 which we have no control
>>> over and which wont change till upcoming releases of say OSG or TG
>>> packages.
>>
>> Please don't do this on OSG. I'm fairly sure that working around the
>> existing interfaces to a resource would just tick off the resource
>> owners.
>>
> 


-- 
Mats Rynge
Renaissance Computing Institute <http://www.renci.org>


From hategan at mcs.anl.gov  Fri Feb 13 10:22:43 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Fri, 13 Feb 2009 10:22:43 -0600
Subject: [Swift-devel] Status of coasters
In-Reply-To: <FA43DB88-FB0C-407E-8401-27DD2CBB7B63@mcs.anl.gov>
References: <49958AEB.3090002@mcs.anl.gov>
	<1234538259.25737.1.camel@localhost> <49959069.7020602@mcs.anl.gov>
	<1234539163.26116.3.camel@localhost> <20090213094032.2b041afc@prnb>
	<FA43DB88-FB0C-407E-8401-27DD2CBB7B63@mcs.anl.gov>
Message-ID: <1234542163.26956.3.camel@localhost>

On Fri, 2009-02-13 at 09:58 -0600, Stuart Martin wrote:

> But also, we want to have GRAM2 sites to start to use the SEG to  
> remove a significant portion of the GRAM2 scalability problem.  I  
> think that would be best and simplest solution to focus on.  Maybe we  
> can start with a site where you need more scalability and the admin  
> would want to work with us on that?

Can this be pushed into VDT/OSG and SDgrr$% (that thing that TG uses)?


From benc at hawaga.org.uk  Fri Feb 13 10:25:15 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Fri, 13 Feb 2009 16:25:15 +0000 (GMT)
Subject: [Swift-devel] Status of coasters
In-Reply-To: <1234542163.26956.3.camel@localhost>
References: <49958AEB.3090002@mcs.anl.gov> <1234538259.25737.1.camel@localhost>
	<49959069.7020602@mcs.anl.gov> <1234539163.26116.3.camel@localhost>
	<20090213094032.2b041afc@prnb>
	<FA43DB88-FB0C-407E-8401-27DD2CBB7B63@mcs.anl.gov>
	<1234542163.26956.3.camel@localhost>
Message-ID: <Pine.LNX.4.64.0902131624280.1293@dildano.hawaga.org.uk>


On Fri, 13 Feb 2009, Mihael Hategan wrote:

> > But also, we want to have GRAM2 sites to start to use the SEG to 
> > remove a significant portion of the GRAM2 scalability problem.  I 
> > think that would be best and simplest solution to focus on.  Maybe we 
> > can start with a site where you need more scalability and the admin 
> > would want to work with us on that?
> 
> Can this be pushed into VDT/OSG and SDgrr$% (that thing that TG uses)?

CTSS. I think that is based on VDT now too.

-- 


From iraicu at cs.uchicago.edu  Fri Feb 13 12:13:12 2009
From: iraicu at cs.uchicago.edu (Ioan Raicu)
Date: Fri, 13 Feb 2009 12:13:12 -0600
Subject: [Swift-devel] Status of coasters
In-Reply-To: <49959458.3070704@mcs.anl.gov>
References: <49958AEB.3090002@mcs.anl.gov>	<Pine.LNX.4.64.0902131504520.1293@dildano.hawaga.org.uk>
	<49959458.3070704@mcs.anl.gov>
Message-ID: <4995B838.9060904@cs.uchicago.edu>


Michael Wilde wrote:
>
> ...Gets round potential problem that managed-fork JM will kill our 
> process for exceeding a walltime limit.
>
>
Managed-fork kills your process on the head node. The LRM (PBS, Condor, 
etc) kills your process on the compute node. Either way, if you exceed 
the walltime limit, your process gets killed.

Ioan

-- 
===================================================
Ioan Raicu
Ph.D. Candidate
===================================================
Distributed Systems Laboratory
Computer Science Department
University of Chicago
1100 E. 58th Street, Ryerson Hall
Chicago, IL 60637
===================================================
Email: iraicu at cs.uchicago.edu
Web:   http://www.cs.uchicago.edu/~iraicu
http://dev.globus.org/wiki/Incubator/Falkon
http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page
===================================================
===================================================


From iraicu at cs.uchicago.edu  Fri Feb 13 12:15:38 2009
From: iraicu at cs.uchicago.edu (Ioan Raicu)
Date: Fri, 13 Feb 2009 12:15:38 -0600
Subject: [Swift-devel] Status of coasters
In-Reply-To: <4995982F.9010605@renci.org>
References: <49958AEB.3090002@mcs.anl.gov> <4995982F.9010605@renci.org>
Message-ID: <4995B8CA.8030509@cs.uchicago.edu>

Although, we and others have been doing this (multi-level scheduling) 
for a while now, using Coasters, Falkon, Condor glide-ins, etc... I 
don't see what would be different, to set up a separate WS-GRAM 
interface on a site that doesn't support it. Its just another method, to 
do multi-level scheduling.

Ioan

Mats Rynge wrote:
> Michael Wilde wrote:
>   
>> For sites where WS-GRAM is not functional, I suggested we consider
>> configuring our own non-root WS-GRAM, ideally using already-installed
>> GT4 software, eg, from the OSG package on OSG and TG sites where its
>> installed. Mihael thought this would be considerable work. I agree but
>> it might be a stable solution with fewer unknowns and suppot from the
>> GRAM group. We can bring in the latest GT4 as needed if that provides a
>> better solution than some older installed GT4 which we have no control
>> over and which wont change till upcoming releases of say OSG or TG
>> packages.
>>     
>
> Please don't do this on OSG. I'm fairly sure that working around the
> existing interfaces to a resource would just tick off the resource owners.
>
>   

-- 
===================================================
Ioan Raicu
Ph.D. Candidate
===================================================
Distributed Systems Laboratory
Computer Science Department
University of Chicago
1100 E. 58th Street, Ryerson Hall
Chicago, IL 60637
===================================================
Email: iraicu at cs.uchicago.edu
Web:   http://www.cs.uchicago.edu/~iraicu
http://dev.globus.org/wiki/Incubator/Falkon
http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page
===================================================
===================================================

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20090213/cc594231/attachment.html>

From hategan at mcs.anl.gov  Fri Feb 13 12:32:01 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Fri, 13 Feb 2009 12:32:01 -0600 (CST)
Subject: [Swift-devel] Status of coasters
In-Reply-To: <4995B8CA.8030509@cs.uchicago.edu>
Message-ID: <29514779.2253011234549921308.JavaMail.root@zimbra>


----- Ioan Raicu  wrote:
> Although, we and others have been doing this (multi-level scheduling)
> for a while now, using Coasters, Falkon, Condor glide-ins, etc... I
> don't see what would be different, to set up a separate WS-GRAM
> interface on a site that doesn't support it. Its just another method,
> to do multi-level scheduling. 

I suppose too much multi unnecessarily complicates things.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20090213/87e499df/attachment.html>

From hategan at mcs.anl.gov  Fri Feb 13 12:33:41 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Fri, 13 Feb 2009 12:33:41 -0600 (CST)
Subject: [Swift-devel] Status of coasters
In-Reply-To: <29514779.2253011234549921308.JavaMail.root@zimbra>
Message-ID: <29887029.2253071234550021030.JavaMail.root@zimbra>


----- Mihael Hategan  wrote:
> 
----- Ioan Raicu  wrote:
> Although, we and others have been doing this (multi-level scheduling)
> for a while now, using Coasters, Falkon, Condor glide-ins, etc... I
> don't see what would be different, to set up a separate WS-GRAM
> interface on a site that doesn't support it. Its just another method,
> to do multi-level scheduling. 

I suppose too much multi unnecessarily complicates things.

Additionally if you're setting up WS-GRAM in order to be able to start 
coasters, you might as well start the coasters manually.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20090213/af7a4926/attachment.html>

From rynge at renci.org  Fri Feb 13 12:34:55 2009
From: rynge at renci.org (Mats Rynge)
Date: Fri, 13 Feb 2009 13:34:55 -0500
Subject: [Swift-devel] Status of coasters
In-Reply-To: <4995B8CA.8030509@cs.uchicago.edu>
References: <49958AEB.3090002@mcs.anl.gov> <4995982F.9010605@renci.org>
	<4995B8CA.8030509@cs.uchicago.edu>
Message-ID: <4995BD4F.8080204@renci.org>

Ioan Raicu wrote:
> Although, we and others have been doing this (multi-level scheduling)
> for a while now, using Coasters, Falkon, Condor glide-ins, etc... I
> don't see what would be different, to set up a separate WS-GRAM
> interface on a site that doesn't support it. Its just another method, to
> do multi-level scheduling.

No. The problem here is that you want to directly interact with the lrm.
This will break several things such as accounting, and lrm policies. It
is similar to use jobmanger-fork to run condor_submit. Even though it is
technically possibly to do so, most OSG sites would find such a behavior
unacceptable and probably ban the user/VO doing it.

This is different from for example glideins, where you use the existing
interface and lrm to get slots allocated to you, and then your job is
actually the glidein. The WS-GRAM on the side would be the same iif you
submitted it to the lrm, started it on compute nodes and only used
jobmanager-fork (but only one real job at once). This would obviously
not be very helpful as many compute nodes are behind NAT.

You really should not user jobmanager-fork for anything except simple
setup/cleanup jobs.

-- 
Mats Rynge
Renaissance Computing Institute <http://www.renci.org>


From iraicu at cs.uchicago.edu  Fri Feb 13 12:38:36 2009
From: iraicu at cs.uchicago.edu (Ioan Raicu)
Date: Fri, 13 Feb 2009 12:38:36 -0600
Subject: [Swift-devel] Status of coasters
In-Reply-To: <29514779.2253011234549921308.JavaMail.root@zimbra>
References: <29514779.2253011234549921308.JavaMail.root@zimbra>
Message-ID: <4995BE2C.4010200@cs.uchicago.edu>

I am not arguing for adding more levels to the scheduling decisions, but 
was merely pointing out to Mats that we have been bypassing the main 
schedulers of various systems for a while now, and that the reason for 
not following through with installing WS-GRAM should not be because of 
this. In fact, I don't support the idea of installing a user level GRAM 
per site. Its not even clear to me, how you would make that happen in a 
generic way, as GRAM needs to be configured to be used in conjunction 
with other LRMs, SGE, PBS, Condor, etc... so you not only need to 
install GRAM, but also configure it against another LRM, that you might 
know little about where its installed, the paths to the various logs 
(which GRAM needs), etc. If Coaster can be made to run transparently and 
works well enough for production, then Coaster can run on top of GRAM2 
just fine, as the load it will impose on GRAM is lower than what the 
same app would impose if it were to go directly to GRAM.

Ioan

Mihael Hategan wrote:
>
> ----- Ioan Raicu wrote:
> > Although, we and others have been doing this (multi-level scheduling)
> > for a while now, using Coasters, Falkon, Condor glide-ins, etc... I
> > don't see what would be different, to set up a separate WS-GRAM
> > interface on a site that doesn't support it. Its just another method,
> > to do multi-level scheduling.
>
> I suppose too much multi unnecessarily complicates things.

-- 
===================================================
Ioan Raicu, Ph.D.
===================================================
Distributed Systems Laboratory
Computer Science Department
University of Chicago
1100 E. 58th Street, Ryerson Hall
Chicago, IL 60637
===================================================
Email: iraicu at cs.uchicago.edu
Web:   http://www.cs.uchicago.edu/~iraicu
http://dev.globus.org/wiki/Incubator/Falkon
http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page
===================================================
===================================================

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20090213/061f5f91/attachment.html>

From iraicu at cs.uchicago.edu  Fri Feb 13 12:51:24 2009
From: iraicu at cs.uchicago.edu (Ioan Raicu)
Date: Fri, 13 Feb 2009 12:51:24 -0600
Subject: [Swift-devel] Status of coasters
In-Reply-To: <4995BD4F.8080204@renci.org>
References: <49958AEB.3090002@mcs.anl.gov> <4995982F.9010605@renci.org>
	<4995B8CA.8030509@cs.uchicago.edu> <4995BD4F.8080204@renci.org>
Message-ID: <4995C12C.1050107@cs.uchicago.edu>

I think we are on the same page, as I was not implying to use methods to 
bypass accounting or lrm policies.

Cheers,
Ioan

Mats Rynge wrote:
> Ioan Raicu wrote:
>   
>> Although, we and others have been doing this (multi-level scheduling)
>> for a while now, using Coasters, Falkon, Condor glide-ins, etc... I
>> don't see what would be different, to set up a separate WS-GRAM
>> interface on a site that doesn't support it. Its just another method, to
>> do multi-level scheduling.
>>     
>
> No. The problem here is that you want to directly interact with the lrm.
> This will break several things such as accounting, and lrm policies. It
> is similar to use jobmanger-fork to run condor_submit. Even though it is
> technically possibly to do so, most OSG sites would find such a behavior
> unacceptable and probably ban the user/VO doing it.
>
> This is different from for example glideins, where you use the existing
> interface and lrm to get slots allocated to you, and then your job is
> actually the glidein. The WS-GRAM on the side would be the same iif you
> submitted it to the lrm, started it on compute nodes and only used
> jobmanager-fork (but only one real job at once). This would obviously
> not be very helpful as many compute nodes are behind NAT.
>
> You really should not user jobmanager-fork for anything except simple
> setup/cleanup jobs.
>
>   

-- 
===================================================
Ioan Raicu, Ph.D.
===================================================
Distributed Systems Laboratory
Computer Science Department
University of Chicago
1100 E. 58th Street, Ryerson Hall
Chicago, IL 60637
===================================================
Email: iraicu at cs.uchicago.edu
Web:   http://www.cs.uchicago.edu/~iraicu
http://dev.globus.org/wiki/Incubator/Falkon
http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page
===================================================
===================================================

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20090213/11f5532d/attachment.html>

From hategan at mcs.anl.gov  Fri Feb 13 13:19:19 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Fri, 13 Feb 2009 13:19:19 -0600 (CST)
Subject: [Swift-devel] Status of coasters
In-Reply-To: <4995BD4F.8080204@renci.org>
Message-ID: <31956487.2256981234552759142.JavaMail.root@zimbra>


----- Mats Rynge <rynge at renci.org> wrote:
> Ioan Raicu wrote:
> > Although, we and others have been doing this (multi-level scheduling)
> > for a while now, using Coasters, Falkon, Condor glide-ins, etc... I
> > don't see what would be different, to set up a separate WS-GRAM
> > interface on a site that doesn't support it. Its just another method, to
> > do multi-level scheduling.
> 
> No. The problem here is that you want to directly interact with the lrm.
> This will break several things such as accounting, and lrm policies. It
> is similar to use jobmanger-fork to run condor_submit. Even though it is
> technically possibly to do so, most OSG sites would find such a behavior
> unacceptable and probably ban the user/VO doing it.

I assumed the LRM does the accounting.

> 
> This is different from for example glideins, where you use the existing
> interface and lrm to get slots allocated to you, and then your job is
> actually the glidein. The WS-GRAM on the side would be the same iif you
> submitted it to the lrm, started it on compute nodes and only used
> jobmanager-fork (but only one real job at once). This would obviously
> not be very helpful as many compute nodes are behind NAT.
> 
> You really should not user jobmanager-fork for anything except simple
> setup/cleanup jobs.

That doesn't leave many options. If we have a more efficient way of 
submitting to the LRM than GRAM2, we can't use it. If we have more-user
friendly way of doing glide-ins, we can't use that either. We're pretty
much at the mercy of the VDT, which doesn't, after many years, properly
escape whitespace. I find that to be an affront to the users and 
ultimately to the taxpayer.


From rynge at renci.org  Fri Feb 13 13:28:57 2009
From: rynge at renci.org (Mats Rynge)
Date: Fri, 13 Feb 2009 14:28:57 -0500
Subject: [Swift-devel] Status of coasters
In-Reply-To: <31956487.2256981234552759142.JavaMail.root@zimbra>
References: <31956487.2256981234552759142.JavaMail.root@zimbra>
Message-ID: <4995C9F9.8010705@renci.org>

Mihael Hategan wrote:
> ----- Mats Rynge <rynge at renci.org> wrote:
>> Ioan Raicu wrote:
>>> Although, we and others have been doing this (multi-level scheduling)
>>> for a while now, using Coasters, Falkon, Condor glide-ins, etc... I
>>> don't see what would be different, to set up a separate WS-GRAM
>>> interface on a site that doesn't support it. Its just another method, to
>>> do multi-level scheduling.
>> No. The problem here is that you want to directly interact with the lrm.
>> This will break several things such as accounting, and lrm policies. It
>> is similar to use jobmanger-fork to run condor_submit. Even though it is
>> technically possibly to do so, most OSG sites would find such a behavior
>> unacceptable and probably ban the user/VO doing it.
> 
> I assumed the LRM does the accounting.

I don't know details on how this works, but I think there are some OSG
and/or VDT patches to the jobmanager perl modules. Think about it as the
LRM does the accounting, but you have to pass the LRM the correct
information. If you interacted directly with the LRM, the jobs would
look like local jobs, not OSG originated jobs.


>> This is different from for example glideins, where you use the existing
>> interface and lrm to get slots allocated to you, and then your job is
>> actually the glidein. The WS-GRAM on the side would be the same iif you
>> submitted it to the lrm, started it on compute nodes and only used
>> jobmanager-fork (but only one real job at once). This would obviously
>> not be very helpful as many compute nodes are behind NAT.
>>
>> You really should not user jobmanager-fork for anything except simple
>> setup/cleanup jobs.
> 
> That doesn't leave many options. If we have a more efficient way of 
> submitting to the LRM than GRAM2, we can't use it. If we have more-user
> friendly way of doing glide-ins, we can't use that either. We're pretty
> much at the mercy of the VDT, which doesn't, after many years, properly
> escape whitespace. I find that to be an affront to the users and 
> ultimately to the taxpayer.

I agree with you, but standing up your own WS-GRAM is not the solution.

-- 
Mats Rynge
Renaissance Computing Institute <http://www.renci.org>


From hategan at mcs.anl.gov  Fri Feb 13 13:39:50 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Fri, 13 Feb 2009 13:39:50 -0600 (CST)
Subject: [Swift-devel] Status of coasters
In-Reply-To: <4995C9F9.8010705@renci.org>
Message-ID: <3035487.2258841234553990865.JavaMail.root@zimbra>

> >> You really should not user jobmanager-fork for anything except simple
> >> setup/cleanup jobs.
> > 
> > That doesn't leave many options. If we have a more efficient way of 
> > submitting to the LRM than GRAM2, we can't use it. If we have more-user
> > friendly way of doing glide-ins, we can't use that either. We're pretty
> > much at the mercy of the VDT, which doesn't, after many years, properly
> > escape whitespace. I find that to be an affront to the users and 
> > ultimately to the taxpayer.
> 
> I agree with you, but standing up your own WS-GRAM is not the solution.

I don't think standing up our own WS-GRAM is the solution either. However, I
must also consider that standing up coasters (whether manually or 
automatically) or anything non-trivial is similar to standing
up WS-GRAM in that there is a not-so-transient process that plays a part
in managing jobs and other things.


From hategan at mcs.anl.gov  Fri Feb 13 23:48:11 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Fri, 13 Feb 2009 23:48:11 -0600
Subject: [Swift-devel] Status of coasters
In-Reply-To: <49958AEB.3090002@mcs.anl.gov>
References: <49958AEB.3090002@mcs.anl.gov>
Message-ID: <1234590491.9413.0.camel@localhost>

On Fri, 2009-02-13 at 08:59 -0600, Michael Wilde wrote:

> - Mihael has a good handle on the bootstrap issues and is working on 
> improvements. This is not working in trunk at the moment, will likely be 
> fixed soon. We think this will fix known issues in: command line lenth 
> for condor, spaces, quotes, newlines and other offending argument 
> issues; location of Java and tools (wget/curl and mdsum).

I think I nailed it this time.


From benc at hawaga.org.uk  Sat Feb 14 08:16:11 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Sat, 14 Feb 2009 14:16:11 +0000 (GMT)
Subject: [Swift-devel] typecheck foo[*].bar
In-Reply-To: <Pine.LNX.4.64.0902101305000.23512@dildano.hawaga.org.uk>
References: <Pine.LNX.4.64.0902101305000.23512@dildano.hawaga.org.uk>
Message-ID: <Pine.LNX.4.64.0902141415220.1293@dildano.hawaga.org.uk>


I've implemented the below, as r2538.

On Tue, 10 Feb 2009, Ben Clifford wrote:

> I noticed today that expressions like this don't get typechecked properly, 
> so in 0.8, you can't use [*].member expressions. Bleugh.
> 
> As I want to use such expressions (or equivalent), I guess I have to fix 
> that soonish.
> 
> I think the approach I am favouring language-wise is that [*] becomes a 
> no-op/identity operator, and . with an array of structs on the left 
> returns an array of the appropriate member fields.
> 
> Thus   a[*] == a   for all arrays a
> 
>        a[*].foo == a.foo == (in haskelly pseudocode) (map \(x->x.foo) a)

-- 


From benc at hawaga.org.uk  Sun Feb 15 13:29:42 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Sun, 15 Feb 2009 19:29:42 +0000 (GMT)
Subject: [Swift-devel] code coverage of tests
Message-ID: <Pine.LNX.4.64.0902151920000.23512@dildano.hawaga.org.uk>


I just used EMMA, a code coverage tool, on the test suite. The results (in 
html report) are here:

http://www.ci.uchicago.edu/~benc/tmp/coverage/index.html

I ran this to try to see what areas of the code aren't being tested at all 
at the moment.

You'll need to have an understanding of the Swift source code in order to 
understand this report.

This covers only the classes in 'cog-swift-svn.jar', where the code from 
the Swift repository lives. It doesn't cover any of the other classes (for 
example the CoG providers or karajan)

The report is for only the base fully automated tests, not per-site or 
wonky tests.

For future reference, this is how I ran the tests:

 cd cog/modules/swift
 export CLASSPATH=/Users/benc/work/emma-2.0.5312/lib/emma.jar
 export COG_OPTS="-Demma.verbosity.level=quiet"
 java emma instr -cp dist/swift-svn/lib/cog-swift-svn.jar -m overwrite
 cd tests
 ./run
 cd ..
 java emma report -r html -sp src -in coverage.em -in coverage.ec
 open coverage/index.html

-- 


From benc at hawaga.org.uk  Sun Feb 15 16:04:45 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Sun, 15 Feb 2009 22:04:45 +0000 (GMT)
Subject: [Swift-devel] 0.8 clustering broken
Message-ID: <Pine.LNX.4.64.0902152203260.23512@dildano.hawaga.org.uk>


doh, I broke clustering in 0.8 - turns out my test for clusters is lame 
and neither enables clustering nor checks that clustering was actually 
used.

Upcoming fixes in the SVN in a few, so it should be working in 0.9 planned 
for mid-March.

-- 


From wilde at mcs.anl.gov  Sun Feb 15 17:50:18 2009
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Sun, 15 Feb 2009 17:50:18 -0600
Subject: [Swift-devel] Returning GRAM errors to swift user
Message-ID: <4998AA3A.4050307@mcs.anl.gov>

Im assuming that Swift and the CoG provider return as much about GRAM 
errors back to the user as they know. But, for jobs that fail to start, 
e.g., due to an invalid project code, that error never makes it back to 
the user (but *is* present in the gram log).

In this case, can the message below, from the GRAM log, 
"GRAM_SCRIPT_GT3_FAILURE_MESSAGE:qsub: Invalid Account MSG=invalid 
account\n" available in the GRAM API so it can be sent to the user?

I'm assuming this particular issue is well known to users experienced 
with TeraGrid sites, like Sarah, but is perhaps worth pointing out in a 
troubleshooting section. If there's a chance that some of this GRAM 
error info can be returned but is not currently, I can file this in 
bugzilla.

It seems like a few errors, such as account/project errors, or other 
invalid job specs (like time/queue mismatches?) are similarly not passed 
back. Is that the case?

Relevant snips from the logs are below.

Also interesting to note: On the UC teragrid site, a project specified 
in sites.xml via the globus profile does *not* override a default 
project set by the tgprojects command. Im my case, I had an invalid 
(old) project set via tgprojects, which too precedence over the one in 
my sites.xml. When I set the default project to "None" in tgprojects, 
then the sites.xml project was accepted and the job ran.

- Mike

In swift .log file:

2009-02-15 16:59:27,408-0600 DEBUG vdl:execute2 APPLICATION_EXCEPTION 
jobid=uname-0gpb1o6j - Application exception: The job failed when the 
job manager attempted to run it
Caused by: org.globus.gram.GramException: The job failed when the job 
manager attempted to run it

Messages on swift stdout/err:

===============================
Swift svn swift-r2532 cog-r2300

RunID: 20090215-1733-0x4ksmd8
Progress:
Progress:  Stage in:1
Progress:  Active:1
Failed to transfer wrapper log from un-20090215-1733-0x4ksmd8/info/k on uc32
Progress:  Failed:1
Execution failed:
         Exception in uname:
Arguments: [-a]
Host: uc32
Directory: un-20090215-1733-0x4ksmd8/jobs/k/uname-kl1p2o6j
stderr.txt:

stdout.txt:

----

Caused by:
         The job failed when the job manager attempted to run it

===============================


But the following useful info is in the gram log (on the server side), 
which did not make it to the swift logs above:


Sun Feb 15 17:33:58 2009 JM_SCRIPT: submitting job -- 
/soft/torque/bin/qsub < 
/home/wilde/.globus/job/tg-grid1.uc.teragrid.org/14326.1234740838/scheduler_pbs_job_script 
2>/home/wilde/.glo
bus/job/tg-grid1.uc.teragrid.org/14326.1234740838/scheduler_pbs_submit_stderr
Sun Feb 15 17:33:58 2009 JM_SCRIPT: qsub returned
Sun Feb 15 17:33:58 2009 JM_SCRIPT: qsub stderr qsub: Invalid Account 
MSG=invalid account

2/15 17:33:58 JM: GT3 extended error message: 
GRAM_SCRIPT_GT3_FAILURE_MESSAGE:qsub: Invalid Account MSG=invalid account\n
2/15 17:33:58 JMI: while return_buf = GRAM_SCRIPT_GT3_FAILURE_MESSAGE = 
qsub: Invalid Account MSG=invalid account\n
2/15 17:33:58 JMI: while return_buf = GRAM_SCRIPT_ERROR = 17
2/15 17:33:58 JM: in globus_gram_job_manager_reporting_file_create()
2/15 17:33:58 JM: not reporting job information
2/15 17:33:58 JM: in globus_gram_job_manager_history_file_create()
2/15 17:33:58 JM: NOT empty client callback list.
2/15 17:33:58 JM: sending callback of status 4 (failure code 17) to 
https://128.135.125.17:50000/1234740837636.
2/15 17:33:58 JMI: testing job manager scripts for type pbs exist and 
permissions are ok.
2/15 17:33:58 JMI: completed script validation: job manager type is pbs.


From benc at hawaga.org.uk  Sun Feb 15 18:43:54 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Mon, 16 Feb 2009 00:43:54 +0000 (GMT)
Subject: [Swift-devel] wonky-runawayjob missing its sites file
Message-ID: <Pine.LNX.4.64.0902160011390.23512@dildano.hawaga.org.uk>


r2513 introduces a test for runaway jobs, but doesn't include the site 
file.

In r2550 I add a script to run all the wonky tests (I thought I had that 
in already, but apparently not - seeming it was uncommitted in my local 
working directory). If you fix the runaway test, add it to that script - 
tests/misc/run-wonky

-- 


From hategan at mcs.anl.gov  Sun Feb 15 23:14:57 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Sun, 15 Feb 2009 23:14:57 -0600
Subject: [Swift-devel] Re: wonky-runawayjob missing its sites file
In-Reply-To: <Pine.LNX.4.64.0902160011390.23512@dildano.hawaga.org.uk>
References: <Pine.LNX.4.64.0902160011390.23512@dildano.hawaga.org.uk>
Message-ID: <1234761297.12882.0.camel@localhost>

On Mon, 2009-02-16 at 00:43 +0000, Ben Clifford wrote:
> r2513 introduces a test for runaway jobs, but doesn't include the site 
> file.

r2551 fixes that.

> 
> In r2550 I add a script to run all the wonky tests (I thought I had that 
> in already, but apparently not - seeming it was uncommitted in my local 
> working directory). If you fix the runaway test, add it to that script - 
> tests/misc/run-wonky
> 


From hategan at mcs.anl.gov  Sun Feb 15 23:21:58 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Sun, 15 Feb 2009 23:21:58 -0600
Subject: [Swift-devel] Returning GRAM errors to swift user
In-Reply-To: <4998AA3A.4050307@mcs.anl.gov>
References: <4998AA3A.4050307@mcs.anl.gov>
Message-ID: <1234761718.12882.7.camel@localhost>

On Sun, 2009-02-15 at 17:50 -0600, Michael Wilde wrote:
> Im assuming that Swift and the CoG provider return as much about GRAM 
> errors back to the user as they know. But, for jobs that fail to start, 
> e.g., due to an invalid project code, that error never makes it back to 
> the user (but *is* present in the gram log).
> 
> In this case, can the message below, from the GRAM log, 
> "GRAM_SCRIPT_GT3_FAILURE_MESSAGE:qsub: Invalid Account MSG=invalid 
> account\n" available in the GRAM API so it can be sent to the user?

There are two possibilities:
1. The message does not make it to the ws-gram client. This needs to be
fixed in ws-gram.
2. (1) is false, and the ws-gram cog provider does not propagate that
message in the failure event. This I should fix.

There's a third, but unlikely, that the karajan or swift portion is
broken.

> 
> I'm assuming this particular issue is well known to users experienced 
> with TeraGrid sites, like Sarah, but is perhaps worth pointing out in a 
> troubleshooting section. If there's a chance that some of this GRAM 
> error info can be returned but is not currently, I can file this in 
> bugzilla.
> 
> It seems like a few errors, such as account/project errors, or other 
> invalid job specs (like time/queue mismatches?) are similarly not passed 
> back. Is that the case?

In my experience, yes.

> 
> Relevant snips from the logs are below.
> 
> Also interesting to note: On the UC teragrid site, a project specified 
> in sites.xml via the globus profile does *not* override a default 
> project set by the tgprojects command.

If this is correct (i.e. we're not talking about some obscure issue
where having a bogus default project causes all your jobs to fail), I
would think of it as a bug that should be submitted to teragrid.


From rynge at renci.org  Mon Feb 16 17:01:50 2009
From: rynge at renci.org (Mats Rynge)
Date: Mon, 16 Feb 2009 18:01:50 -0500
Subject: [Swift-devel] Contribution: swift-osg-ress-site-catalog
Message-ID: <20090216230150.GA9956@rynge.europa.renci.org>

Attached is a contribution for Swift. It is a script which queries the
OSG Resource Selection System (ReSS) for site information and builds a
Swift site catalog. I think the script should be located in bin/, but
feel free to include it anywhere in the distribution.

I have already submitted a signed contribution license agreement, and
this script is contributed under that agreement.

-- 
Mats Rynge
Renaissance Computing Institute <http://www.renci.org>
-------------- next part --------------
#!/usr/bin/perl

use strict;

use Pod::Usage;
use Getopt::Long;
use File::Temp qw/ tempfile tempdir mktemp /;

my $opt_help = 0;
my $opt_vo = 'engage';
my $opt_engage_verified = 0;
my $opt_gt4 = 0;
my $opt_out = '&STDOUT';

Getopt::Long::Configure('bundling');
GetOptions(
    "help"                   => \$opt_help,
    "vo=s"                   => \$opt_vo,
    "engage-verified"        => \$opt_engage_verified,
    "gt4"                    => \$opt_gt4,
    "out=s"                  => \$opt_out,
) or pod2usage(1);

if ($opt_help) {
    pod2usage(1);
}

if ($opt_engage_verified && $opt_vo ne "engage") {
    die("You can not specify a vo when using --engage-verified\n");
}

# make sure condor_status is in the path
my $out = `which condor_status 2>/dev/null`;
if ($out eq "") {
    die("This tool depends on condor_status.\n" .
        "Please make sure condor_status in your path.\n");
}

my %ads;
my %tmp;
my $cmd = "condor_status -any -long -constraint" .
          " 'StringlistIMember(\"VO:$opt_vo\";GlueCEAccessControlBaseRule)'" .
          " -pool osg-ress-1.fnal.gov";
# if we want the engage verified sites, ignore opt_vo and query against 
# engage central collector
if ($opt_engage_verified) {
    $cmd = "condor_status -any -long -constraint" .
           " 'SiteVerified==TRUE'" .
           " -pool engage-central.renci.org"
}
open(STATUS, "$cmd|");
while(<STATUS>) {
    chomp;
    if ($_ eq "") {
        if ($tmp{'GlueSiteName'} ne "") {
            my %copy = %tmp;
            $ads{$tmp{'GlueSiteName'}} = \%copy;
            undef %tmp;
        }
    }
    else {
        my ($key, $value) = split(/ = /, $_, 2);
        $value =~ s/^"|"$//g; # remove quotes from Condor strings
        $tmp{$key} = $value;
    }
}
close(STATUS);

# lowercase vo
my $lc_vo = lc($opt_vo);

open(FH, ">$opt_out") or die("Unable to open $opt_out");
print FH "<config>\n";
foreach my $sitename (keys %ads) {
    my $contact = $ads{$sitename}->{'GlueCEInfoContactString'};
    my $host = $contact;
    $host =~ s/:.*//;
    my $jm = $contact;
    $jm =~ s/.*jobmanager-//;
    if ($jm eq "pbs") {
        $jm = "PBS";
    }
    elsif ($jm eq "lsf") {
        $jm = "LSF";
    }
    elsif ($jm eq "sge") {
        $jm = "SGE";
    }
    elsif ($jm eq "condor") {
        $jm = "Condor";
    }
    my $workdir = $ads{$sitename}->{'GlueCEInfoDataDir'};
    print FH "\n";
    print FH "  <!-- $sitename -->\n";
    print FH "  <pool handle=\"$sitename\" >\n";
    print FH "    <gridftp  url=\"gsiftp://$host/\" />\n";
    if ($opt_gt4) {
        print FH "    <execution provider=\"gt4\" jobmanager=\"$jm\" url=\"$host:9443\" />\n";
    }
    else {
        print FH "    <jobmanager universe=\"vanilla\" url=\"$contact\" major=\"2\" />\n";
    }
    print FH "    <workdirectory >$workdir/$lc_vo/tmp</workdirectory>\n";
    print FH "  </pool>\n";
}
print FH "\n</config>\n";
close(FH);

exit(0);

__END__

=head1 NAME

swift-osg-ress-site-catalog - converts ReSS data to Swift site catalog

=head1 SYNOPSIS

swift-osg-ress-site-catalog [options]

=head1 OPTIONS

=over 8

=item B<--help>

Show this help message

=item B<--vo=[name]>

Set what VO to query ReSS for

=item B<--engage-verified>

Only retrieve sites verified by the Engagement VO site verification tests
This can not be used together with --vo, as the query will only work for
sites advertising support for the Engagement VO.

This option means information will be retrieved from the Engagement collector
instead of the top-level ReSS collector.

=item B<--out=[filename]>

Write to [filename] instead of stdout

=back

=head1 DESCRIPTION

B<swift-osg-ress-site-catalog> converts ReSS data to Swift site catalog

=cut


From benc at hawaga.org.uk  Tue Feb 17 02:26:49 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Tue, 17 Feb 2009 08:26:49 +0000 (GMT)
Subject: [Swift-devel] Contribution: swift-osg-ress-site-catalog
In-Reply-To: <20090216230150.GA9956@rynge.europa.renci.org>
References: <20090216230150.GA9956@rynge.europa.renci.org>
Message-ID: <Pine.LNX.4.64.0902170826380.1293@dildano.hawaga.org.uk>

cool. I'll put this in for 0.9.
-- 


From iraicu at cs.uchicago.edu  Wed Feb 18 11:43:04 2009
From: iraicu at cs.uchicago.edu (Ioan Raicu)
Date: Wed, 18 Feb 2009 11:43:04 -0600
Subject: [Swift-devel] [Fwd: [Dadc09] Deadlines for DADC'09 extended]
Message-ID: <499C48A8.5080808@cs.uchicago.edu>

Hi all,
The DADC workshop extended their deadline. I attended last year, and it 
was a great venue focusing on data resource management in distributed 
systems. If you have any work that is close to being ready to publish 
and it is relevant to the workshop, its a good venue to try!

Cheers,
Ioan

-------- Original Message --------
Subject: 	[Dadc09] Deadlines for DADC'09 extended
Date: 	Wed, 18 Feb 2009 11:34:16 -0600
From: 	Tevfik Kosar <kosar at cct.lsu.edu>
To: 	dadc09 at cct.lsu.edu


The abstract and paper submission deadlines for DADC'09 have been 
extended. The new deadlines are:

Abstract submission: February 25, 2009 (extended)
Paper submission: March 1, 2009 (extended)

Thank you.
Tevfik

-----------------------------------------------------------------------------------
*** Call for Papers ***
WORKSHOP ON DATA-AWARE DISTRIBUTED COMPUTING (DADC'09)
In conjunction with HPDC 2009, June 9-13, Munich, Germany
http://www.cct.lsu.edu/~kosar/dadc09 
<http://www.cct.lsu.edu/%7Ekosar/dadc09>
------------------------------------------------------------------------------------

The Second International Workshop on Data-Aware Distributed Computing 
(DADC'09)
will be held in conjunction with the 18th International Symposium on 
High Performance
Distributed Computing (HPDC-18), in Munich, Germany.

The data requirements of scientific as well as commercial applications from
a diverse range of fields have been increasing exponentially over the recent
years. This increase in the demand for large-scale data processing has 
necessitated
collaboration and sharing among the world's leading education, research, 
and industrial
institutions and use of distributed resources owned by collaborating 
parties. In a
widely distributed environment, data is no more locally accessible and 
has thus to
be remotely retrieved and stored. While traditional distributed systems 
work well
for computation that requires limited data handling, they fail in 
unexpected ways
when the computation accesses, creates, and moves large amounts of data 
especially
over wide-area networks. Scientists, researchers, and application 
developers are
often forced to spend a great deal of time and energy on solving basic 
data-handling
issues, such as the physical location of data, how to access it, and/or 
how to move
it to visualization and/or compute resources for further analysis.

This workshop will focus on the challenges of distributed systems 
imposed by the
data intensive applications, and on the different state-of-the-art 
solutions proposed
to overcome these challenges. A new paradigm called "data-aware 
distributed computing"
and its application to different research realms such as scheduling, 
resource allocation,
workflow management, and visualization will be discussed. With the 
knowledge of the
correct data management techniques, the domain scientists will be able 
to focus on
their primary goal, assured that their data management needs are handled 
reliably
and efficiently. We believe this workshop will make a unique 
contribution to collaborative
and distributed computing community by focusing on the planning, 
management, and
scheduling of data handling tasks and data storage resources.

Topics of interest include, but are not limited to:
- Data-intensive applications and their challenges
- Data-aware toolkits, middleware, storage and file systems
- Data-oriented batch schedulers and workflow managers
- Data staging, replication, and remote access to data
- Data placement, management, and scheduling techniques
- Co-scheduling of computation, data storage, and network resources
- Network support for data-intensive computing
- High speed wide area data transfers and bulk data movement
- Remote and distributed visualization of large scale data
- Data-aware workflow and data-flow management
- Cross-domain metadata and ontologies
- Distributed and hierarchical storage management
- Storage resource managers and brokers
- Data archives, digital libraries, and preservations
- Service oriented architectures for data-intensive computing
- Protection of sensitive data in a collaborative environment
- Peer-to-peer data movement and data streaming
- Future research challenges in data-intensive computing

IMPORTANT DATES:
Abstract submission: February 25, 2009 (extended)
Paper submission: March 1, 2009 (extended)
Acceptance notification: March 15, 2009
Final papers due: April 1, 2009

PAPER SUBMISSIONS:
DADC'09 invites authors to submit original and unpublished technical papers
of at most 10 pages. All submissions will be peer-reviewed and judged on 
correctness,
originality, technical strength, significance, quality of presentation, 
and relevance
to the workshop topics of interest. Submitted papers may not have 
appeared in or be
under consideration for another workshop, conference or a journal, nor 
may they be
under review or submitted to another forum during the DADC'09 review 
process. Papers
should be prepared in ACM SIG Proceedings format and submitted 
electronically
(in PDF format) via the HPDC 2009 conference web site.


WORKSHOP and PROGRAM CHAIR:
Tevfik Kosar, Louisiana State University


PROGRAM COMMITTEE:
Micah Beck, University of Tennessee
John Bent, Los Alamos National Laboratory
Ann Chervenak, USC Information Sciences Institute
Alok Choudhary, Northwestern University
Ewa Deelman, USC Information Sciences Institute
Renato Figueiredo, University of Florida
Geoffrey Fox, Indiana University
Peter Kacsuk, Hungarian Academy of Sciences
Dan Katz, Louisiana State University
Peter Kunszt, Swiss National Computing Center
Erwin Laure, CERN
Reagan Moore, San Diego Supercomputing Center
Don Petravick, Fermi National Accelarator Laboratory
Ioan Raicu, University of Chicago
Sanjay Ranka, University of Florida
Doron Rotem, Lawrence Berkeley National Laboratory
Jennifer Schopf, National Science Foundation
Florian Schintke, Zuse Institute Berlin
Alex Sim, Lawrence Berkeley National Laboratory
Ian Taylor, Cardiff University
Douglas Thain, University of Notre Dame
Brian Tierney, Lawrence Berkeley National Laboratory
Bernard Traversat, Sun Microsystems
Sudharshan Vazhkudai, Oak Ridge National Laboratory
Andrew Wendelborn, University of Adelaide
Mike Wilde, Argonne National Laboratory

_______________________________________________
CS mailing list
CS at cct.lsu.edu <mailto:CS at cct.lsu.edu>
https://mail.cct.lsu.edu/mailman/listinfo/cs

-- 
===================================================
Ioan Raicu, Ph.D.
===================================================
Distributed Systems Laboratory
Computer Science Department
University of Chicago
1100 E. 58th Street, Ryerson Hall
Chicago, IL 60637
===================================================
Email: iraicu at cs.uchicago.edu
Web:   http://www.cs.uchicago.edu/~iraicu
http://dev.globus.org/wiki/Incubator/Falkon
http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page
===================================================
===================================================

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20090218/c69309d9/attachment.html>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: Attached Message Part
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20090218/c69309d9/attachment.ksh>

From zhaozhang at uchicago.edu  Thu Feb 19 11:51:59 2009
From: zhaozhang at uchicago.edu (Zhao Zhang)
Date: Thu, 19 Feb 2009 11:51:59 -0600
Subject: [Swift-devel] new swift 
Message-ID: <499D9C3F.9060904@uchicago.edu>

Hi,

I tried to update my swift on bgp up to date, but I found that the tree 
structure is changed, the provider-deef is no longer in the "module" 
directory.
Shall I copy the old provider-deef to the new directory? or any  
suggestions to  install provider-deef? thanks.

best wishes
zhangzhao


From benc at hawaga.org.uk  Thu Feb 19 15:40:22 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Thu, 19 Feb 2009 21:40:22 +0000 (GMT)
Subject: [Swift-devel] new swift 
In-Reply-To: <499D9C3F.9060904@uchicago.edu>
References: <499D9C3F.9060904@uchicago.edu>
Message-ID: <Pine.LNX.4.64.0902192136150.1293@dildano.hawaga.org.uk>


On Thu, 19 Feb 2009, Zhao Zhang wrote:

> I tried to update my swift on bgp up to date, but I found that the tree
> structure is changed, the provider-deef is no longer in the "module"
> directory.
> Shall I copy the old provider-deef to the new directory? or any  suggestions
> to  install provider-deef? thanks.

What do you mean by that?

What did you do to update? Go into an existing checkout or start from 
fresh?

If you start from fresh, then you need to separately make three checkouts:

cog/

cog/modules/swift

cog/modules/provider-deef

from their various different locations.

-- 


From zhaozhang at uchicago.edu  Fri Feb 20 14:37:32 2009
From: zhaozhang at uchicago.edu (Zhao Zhang)
Date: Fri, 20 Feb 2009 14:37:32 -0600
Subject: [Swift-devel] filesystem mapper
Message-ID: <499F148C.7050906@uchicago.edu>

Hi,

I have a problem with file system mapper in the latest version of swift.
The code looks like:
Mol texts[] <filesys_mapper;location="input/", suffix=".mol2">;

It is trying to map all .mol2 files in the input directory, and it works 
fine with an older version of swift which is
Swift svn swift-r2334 (Swift modified locally) cog-r2216

But failed with the following information
zzhang at login6.surveyor:~/new_dock6> ./run_swift_ssh.sh 1010 64 test.swift
waiting for at least 64 nodes to register before submitting workload...
waiting to find at least 1 services in file 
/home/falkon/users/zzhang/1010/config/Client-service-URIs.config...
all done, file has found at least 1 services
found at least 64 registered, submitting workload...
Swift svn swift-r2578 cog-r2305

RunID: 20090220-1428-ugfvnoya
Progress:
Execution failed:
        Getting value for array texts.$[]/1 which is not permitted.

The log file is at 
http://www.ci.uchicago.edu/~zzhang/test-20090220-1428-ugfvnoya.log

zhao


From zhaozhang at uchicago.edu  Fri Feb 20 16:48:49 2009
From: zhaozhang at uchicago.edu (Zhao Zhang)
Date: Fri, 20 Feb 2009 16:48:49 -0600
Subject: [Swift-devel] absolute path
Message-ID: <499F3351.2060000@uchicago.edu>

Hi, Guys

I found that in the latest swift code, the task description is using 
absolute path like this:
/bin/bash /tmp/sleep-20090220-1646-7vdlcg0a/shared/wrapper.sh 
sleep-hoyn8x6j -jobdir h/o -e /bin/sleep -out stdout.txt -err stderr.txt 
-i -d  -if  -of  -k  -a 30

I mean the "/tmp/sleep-20090220-1646-7vdlcg0a/shared/wrapper.sh" part.
Is there an option that we use relative path? thanks

best wishes
zhangzhao


From hategan at mcs.anl.gov  Fri Feb 20 16:57:12 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Fri, 20 Feb 2009 16:57:12 -0600 (CST)
Subject: [Swift-devel] absolute path
In-Reply-To: <499F3351.2060000@uchicago.edu>
Message-ID: <9459070.396241235170632289.JavaMail.root@zimbra>


----- Zhao Zhang <zhaozhang at uchicago.edu> wrote:
> Hi, Guys
> 
> I found that in the latest swift code, the task description is using 
> absolute path like this:
> /bin/bash /tmp/sleep-20090220-1646-7vdlcg0a/shared/wrapper.sh 
> sleep-hoyn8x6j -jobdir h/o -e /bin/sleep -out stdout.txt -err stderr.txt 
> -i -d  -if  -of  -k  -a 30
> 
> I mean the "/tmp/sleep-20090220-1646-7vdlcg0a/shared/wrapper.sh" part.
> Is there an option that we use relative path? thanks

Relative to what?

You could try changing the <workdirectory> in sites.xml.


From zhaozhang at uchicago.edu  Fri Feb 20 17:01:13 2009
From: zhaozhang at uchicago.edu (Zhao Zhang)
Date: Fri, 20 Feb 2009 17:01:13 -0600
Subject: [Swift-devel] absolute path
In-Reply-To: <9459070.396241235170632289.JavaMail.root@zimbra>
References: <9459070.396241235170632289.JavaMail.root@zimbra>
Message-ID: <499F3639.8010709@uchicago.edu>

so, in the old version instead of
"/tmp/sleep-20090220-1646-7vdlcg0a/shared/wrapper.sh"
it is like this:
"shared/wrapper.sh"

by this I mean it is a relative path.

I got another question about using signal notification instead of status 
files,
as I remembered, there is an option in one property file for that, but I 
couldn't
find it.

zhao

Mihael Hategan wrote:
> ----- Zhao Zhang <zhaozhang at uchicago.edu> wrote:
>   
>> Hi, Guys
>>
>> I found that in the latest swift code, the task description is using 
>> absolute path like this:
>> /bin/bash /tmp/sleep-20090220-1646-7vdlcg0a/shared/wrapper.sh 
>> sleep-hoyn8x6j -jobdir h/o -e /bin/sleep -out stdout.txt -err stderr.txt 
>> -i -d  -if  -of  -k  -a 30
>>
>> I mean the "/tmp/sleep-20090220-1646-7vdlcg0a/shared/wrapper.sh" part.
>> Is there an option that we use relative path? thanks
>>     
>
> Relative to what?
>
> You could try changing the <workdirectory> in sites.xml.
>
>   


From hategan at mcs.anl.gov  Fri Feb 20 17:08:11 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Fri, 20 Feb 2009 17:08:11 -0600 (CST)
Subject: [Swift-devel] absolute path
In-Reply-To: <499F3639.8010709@uchicago.edu>
Message-ID: <7469230.397401235171291666.JavaMail.root@zimbra>


----- Zhao Zhang <zhaozhang at uchicago.edu> wrote:
> so, in the old version instead of
> "/tmp/sleep-20090220-1646-7vdlcg0a/shared/wrapper.sh"
> it is like this:
> "shared/wrapper.sh"
> 
> by this I mean it is a relative path.

Yes, of course, but relative paths are in respect to something. In
other words, where do you want it to end on the remote site?

> 
> I got another question about using signal notification instead of status 
> files,
> as I remembered, there is an option in one property file for that,

Have you tried swift.properties?

> but I 
> couldn't
> find it.
> 


From zhaozhang at uchicago.edu  Fri Feb 20 17:11:50 2009
From: zhaozhang at uchicago.edu (Zhao Zhang)
Date: Fri, 20 Feb 2009 17:11:50 -0600
Subject: [Swift-devel] absolute path
In-Reply-To: <7469230.397401235171291666.JavaMail.root@zimbra>
References: <7469230.397401235171291666.JavaMail.root@zimbra>
Message-ID: <499F38B6.5070403@uchicago.edu>


Mihael Hategan wrote:
> ----- Zhao Zhang <zhaozhang at uchicago.edu> wrote:
>   
>> so, in the old version instead of
>> "/tmp/sleep-20090220-1646-7vdlcg0a/shared/wrapper.sh"
>> it is like this:
>> "shared/wrapper.sh"
>>
>> by this I mean it is a relative path.
>>     
>
> Yes, of course, but relative paths are in respect to something. In
> other words, where do you want it to end on the remote site?
>   
This is fine. I just made a change in the worker code, and it could 
dynamically work with both cases.
>   
>> I got another question about using signal notification instead of status 
>> files,
>> as I remembered, there is an option in one property file for that,
>>     
>
> Have you tried swift.properties?
>   
yes I tried, but I didn't find it. Ben knows it, but probably, he is 
sleeping right now.

zhao
>   
>> but I 
>> couldn't
>> find it.
>>
>>     
>
>
>   


From hategan at mcs.anl.gov  Fri Feb 20 17:15:40 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Fri, 20 Feb 2009 17:15:40 -0600 (CST)
Subject: [Swift-devel] absolute path
In-Reply-To: <499F38B6.5070403@uchicago.edu>
Message-ID: <416126.398011235171740732.JavaMail.root@zimbra>


----- Zhao Zhang <zhaozhang at uchicago.edu> wrote:
> 
> 
> Mihael Hategan wrote:
> > ----- Zhao Zhang <zhaozhang at uchicago.edu> wrote:
> >   
> >> so, in the old version instead of
> >> "/tmp/sleep-20090220-1646-7vdlcg0a/shared/wrapper.sh"
> >> it is like this:
> >> "shared/wrapper.sh"
> >>
> >> by this I mean it is a relative path.
> >>     
> >
> > Yes, of course, but relative paths are in respect to something. In
> > other words, where do you want it to end on the remote site?
> >   
> This is fine. I just made a change in the worker code, and it could 
> dynamically work with both cases.
> >   
> >> I got another question about using signal notification instead of status 
> >> files,
> >> as I remembered, there is an option in one property file for that,
> >>     
> >
> > Have you tried swift.properties?
> >   
> yes I tried, but I didn't find it.

You should probably do an SVN update and look at the end of that file:

# Controls how Swift will communicate the result code of running user programs
# from workers to the submit side. In files mode, a file
# indicating success or failure will be created on the site shared filesystem.
# In provider mode, the execution provider job status will
# be used. Notably, GRAM2 does not return job statuses correctly, and so
# provider mode will not work with GRAM2. With other
# providers, it can be used to reduce the amount of filesystem access compared
# to files mode.
#
# status.mode=files


From zhaozhang at uchicago.edu  Fri Feb 20 17:20:03 2009
From: zhaozhang at uchicago.edu (Zhao Zhang)
Date: Fri, 20 Feb 2009 17:20:03 -0600
Subject: [Swift-devel] absolute path
In-Reply-To: <416126.398011235171740732.JavaMail.root@zimbra>
References: <416126.398011235171740732.JavaMail.root@zimbra>
Message-ID: <499F3AA3.1050204@uchicago.edu>

I tried this
zzhang at login6.surveyor:/home/falkon/swift_scratch/cog/modules/swift> svn 
update
U    src/org/griphyn/vdl/mapping/RootDataNode.java
Updated to revision 2579.

but there is no such texts in the swift.properties.

zhao

Mihael Hategan wrote:
> ----- Zhao Zhang <zhaozhang at uchicago.edu> wrote:
>   
>> Mihael Hategan wrote:
>>     
>>> ----- Zhao Zhang <zhaozhang at uchicago.edu> wrote:
>>>   
>>>       
>>>> so, in the old version instead of
>>>> "/tmp/sleep-20090220-1646-7vdlcg0a/shared/wrapper.sh"
>>>> it is like this:
>>>> "shared/wrapper.sh"
>>>>
>>>> by this I mean it is a relative path.
>>>>     
>>>>         
>>> Yes, of course, but relative paths are in respect to something. In
>>> other words, where do you want it to end on the remote site?
>>>   
>>>       
>> This is fine. I just made a change in the worker code, and it could 
>> dynamically work with both cases.
>>     
>>>   
>>>       
>>>> I got another question about using signal notification instead of status 
>>>> files,
>>>> as I remembered, there is an option in one property file for that,
>>>>     
>>>>         
>>> Have you tried swift.properties?
>>>   
>>>       
>> yes I tried, but I didn't find it.
>>     
>
> You should probably do an SVN update and look at the end of that file:
>
> # Controls how Swift will communicate the result code of running user programs
> # from workers to the submit side. In files mode, a file
> # indicating success or failure will be created on the site shared filesystem.
> # In provider mode, the execution provider job status will
> # be used. Notably, GRAM2 does not return job statuses correctly, and so
> # provider mode will not work with GRAM2. With other
> # providers, it can be used to reduce the amount of filesystem access compared
> # to files mode.
> #
> # status.mode=files
>
>   


From hategan at mcs.anl.gov  Fri Feb 20 17:29:35 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Fri, 20 Feb 2009 17:29:35 -0600 (CST)
Subject: [Swift-devel] absolute path
In-Reply-To: <499F3AA3.1050204@uchicago.edu>
Message-ID: <2461287.398561235172575261.JavaMail.root@zimbra>

----- Zhao Zhang <zhaozhang at uchicago.edu> wrote:
> I tried this
> zzhang at login6.surveyor:/home/falkon/swift_scratch/cog/modules/swift> svn 
> update
> U    src/org/griphyn/vdl/mapping/RootDataNode.java
> Updated to revision 2579.
> 
> but there is no such texts in the swift.properties.

That's probably because you have a locally modified swift.properties
that got messed up.

Try this:
swift> cd etc
mv swift.properties swift.properties.mine
svn up
tail -n 16 swift.properties

Then manually merge your customization into swift.properties and
re-compile.

> 
> zhao
> 
> Mihael Hategan wrote:
> > ----- Zhao Zhang <zhaozhang at uchicago.edu> wrote:
> >   
> >> Mihael Hategan wrote:
> >>     
> >>> ----- Zhao Zhang <zhaozhang at uchicago.edu> wrote:
> >>>   
> >>>       
> >>>> so, in the old version instead of
> >>>> "/tmp/sleep-20090220-1646-7vdlcg0a/shared/wrapper.sh"
> >>>> it is like this:
> >>>> "shared/wrapper.sh"
> >>>>
> >>>> by this I mean it is a relative path.
> >>>>     
> >>>>         
> >>> Yes, of course, but relative paths are in respect to something. In
> >>> other words, where do you want it to end on the remote site?
> >>>   
> >>>       
> >> This is fine. I just made a change in the worker code, and it could 
> >> dynamically work with both cases.
> >>     
> >>>   
> >>>       
> >>>> I got another question about using signal notification instead of status 
> >>>> files,
> >>>> as I remembered, there is an option in one property file for that,
> >>>>     
> >>>>         
> >>> Have you tried swift.properties?
> >>>   
> >>>       
> >> yes I tried, but I didn't find it.
> >>     
> >
> > You should probably do an SVN update and look at the end of that file:
> >
> > # Controls how Swift will communicate the result code of running user programs
> > # from workers to the submit side. In files mode, a file
> > # indicating success or failure will be created on the site shared filesystem.
> > # In provider mode, the execution provider job status will
> > # be used. Notably, GRAM2 does not return job statuses correctly, and so
> > # provider mode will not work with GRAM2. With other
> > # providers, it can be used to reduce the amount of filesystem access compared
> > # to files mode.
> > #
> > # status.mode=files
> >
> >   


From benc at hawaga.org.uk  Fri Feb 20 17:30:25 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Fri, 20 Feb 2009 23:30:25 +0000 (GMT)
Subject: [Swift-devel] absolute path
In-Reply-To: <499F3351.2060000@uchicago.edu>
References: <499F3351.2060000@uchicago.edu>
Message-ID: <Pine.LNX.4.64.0902202328590.1293@dildano.hawaga.org.uk>


On Fri, 20 Feb 2009, Zhao Zhang wrote:

> I found that in the latest swift code, the task description is using absolute
> path like this:
> /bin/bash /tmp/sleep-20090220-1646-7vdlcg0a/shared/wrapper.sh sleep-hoyn8x6j
> -jobdir h/o -e /bin/sleep -out stdout.txt -err stderr.txt -i -d  -if  -of  -k
> -a 30
> 
> I mean the "/tmp/sleep-20090220-1646-7vdlcg0a/shared/wrapper.sh" part.
> Is there an option that we use relative path? thanks

yes, that's a change made recently because I discovered that some sites do 
not respect the initial working directory specified in job submissions.

In the rest of this thread, you don't clearly describe *why* you want a 
relative path - you're clearly trying to achieve some higher goal but it 
is not clear what.

-- 


From zhaozhang at uchicago.edu  Fri Feb 20 17:32:15 2009
From: zhaozhang at uchicago.edu (Zhao Zhang)
Date: Fri, 20 Feb 2009 17:32:15 -0600
Subject: [Swift-devel] absolute path
In-Reply-To: <Pine.LNX.4.64.0902202328590.1293@dildano.hawaga.org.uk>
References: <499F3351.2060000@uchicago.edu>
	<Pine.LNX.4.64.0902202328590.1293@dildano.hawaga.org.uk>
Message-ID: <499F3D7F.5010708@uchicago.edu>

it doesn't matter right now, I was just trying out the latest version of 
swift and stuck, then I solved the problem.

zhao

Ben Clifford wrote:
> On Fri, 20 Feb 2009, Zhao Zhang wrote:
>
>   
>> I found that in the latest swift code, the task description is using absolute
>> path like this:
>> /bin/bash /tmp/sleep-20090220-1646-7vdlcg0a/shared/wrapper.sh sleep-hoyn8x6j
>> -jobdir h/o -e /bin/sleep -out stdout.txt -err stderr.txt -i -d  -if  -of  -k
>> -a 30
>>
>> I mean the "/tmp/sleep-20090220-1646-7vdlcg0a/shared/wrapper.sh" part.
>> Is there an option that we use relative path? thanks
>>     
>
> yes, that's a change made recently because I discovered that some sites do 
> not respect the initial working directory specified in job submissions.
>
> In the rest of this thread, you don't clearly describe *why* you want a 
> relative path - you're clearly trying to achieve some higher goal but it 
> is not clear what.
>
>   


From benc at hawaga.org.uk  Fri Feb 20 17:32:34 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Fri, 20 Feb 2009 23:32:34 +0000 (GMT)
Subject: [Swift-devel] absolute path
In-Reply-To: <2461287.398561235172575261.JavaMail.root@zimbra>
References: <2461287.398561235172575261.JavaMail.root@zimbra>
Message-ID: <Pine.LNX.4.64.0902202331560.1293@dildano.hawaga.org.uk>


On Fri, 20 Feb 2009, Mihael Hategan wrote:

> Try this:
> swift> cd etc
> mv swift.properties swift.properties.mine
> svn up
> tail -n 16 swift.properties

also paste the output of:

 svn info swift.properties

-- 


From zhaozhang at uchicago.edu  Fri Feb 20 17:35:09 2009
From: zhaozhang at uchicago.edu (Zhao Zhang)
Date: Fri, 20 Feb 2009 17:35:09 -0600
Subject: [Swift-devel] absolute path
In-Reply-To: <Pine.LNX.4.64.0902202331560.1293@dildano.hawaga.org.uk>
References: <2461287.398561235172575261.JavaMail.root@zimbra>
	<Pine.LNX.4.64.0902202331560.1293@dildano.hawaga.org.uk>
Message-ID: <499F3E2D.5030706@uchicago.edu>

yep, I got it. will try this out soon.

zzhang at login6.surveyor:/home/falkon/swift_scratch/cog/modules/swift/etc> 
svn info swift.properties
Path: swift.properties
Name: swift.properties
URL: https://svn.ci.uchicago.edu/svn/vdl2/trunk/etc/swift.properties
Repository Root: https://svn.ci.uchicago.edu/svn/vdl2
Repository UUID: e2bb083e-7f23-0410-b3a8-8253ac9ef6d8
Revision: 2579
Node Kind: file
Schedule: normal
Last Changed Author: benc
Last Changed Rev: 2533
Last Changed Date: 2009-02-13 07:54:41 -0600 (Fri, 13 Feb 2009)
Text Last Updated: 2009-02-20 17:32:31 -0600 (Fri, 20 Feb 2009)
Properties Last Updated: 2009-02-20 11:03:38 -0600 (Fri, 20 Feb 2009)
Checksum: c7124b6c27e8bc56b68f2d197d31c96d

zhao

Ben Clifford wrote:
> On Fri, 20 Feb 2009, Mihael Hategan wrote:
>
>   
>> Try this:
>> swift> cd etc
>> mv swift.properties swift.properties.mine
>> svn up
>> tail -n 16 swift.properties
>>     
>
> also paste the output of:
>
>  svn info swift.properties
>
>   


From benc at hawaga.org.uk  Fri Feb 20 17:40:08 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Fri, 20 Feb 2009 23:40:08 +0000 (GMT)
Subject: [Swift-devel] absolute path
In-Reply-To: <499F3D7F.5010708@uchicago.edu>
References: <499F3351.2060000@uchicago.edu>
	<Pine.LNX.4.64.0902202328590.1293@dildano.hawaga.org.uk>
	<499F3D7F.5010708@uchicago.edu>
Message-ID: <Pine.LNX.4.64.0902202334260.23512@dildano.hawaga.org.uk>


On Fri, 20 Feb 2009, Zhao Zhang wrote:

> it doesn't matter right now, I was just trying out the latest version of swift
> and stuck, then I solved the problem.

It does matter, in that I made a change that was, as far as I could see, 
either identical or beneficial in all circumstances.

But apparently this change caused trouble for you.

If you write about what you were trying to do and how you solved your 
problem, you will help the Swift developers understand what you are doing 
and you are likely to encounter fewer problems in the future.

If you keep secrets, then we cannot help you.

-- 


From zhaozhang at uchicago.edu  Fri Feb 20 18:01:33 2009
From: zhaozhang at uchicago.edu (Zhao Zhang)
Date: Fri, 20 Feb 2009 18:01:33 -0600
Subject: [Swift-devel] absolute path
In-Reply-To: <Pine.LNX.4.64.0902202334260.23512@dildano.hawaga.org.uk>
References: <499F3351.2060000@uchicago.edu>
	<Pine.LNX.4.64.0902202328590.1293@dildano.hawaga.org.uk>
	<499F3D7F.5010708@uchicago.edu>
	<Pine.LNX.4.64.0902202334260.23512@dildano.hawaga.org.uk>
Message-ID: <499F445D.7070205@uchicago.edu>

sure i am not keeping my secrets, : )

The context on BGP is, the working "sleep-20090220-1646-7vdlcg0a" 
directory is created on IO nodes at
"/tmp/sleep-20090220-1646-7vdlcg0a/", and mounted on Compute Node 
through fuse. which means Compute Nodes
need to enter this directory through 
/fuse/tmp/sleep-20090220-1646-7vdlcg0a, then with the old style of 
relative path
"shared/wrapper.sh", everything is fine. And the wrapper.sh knows where 
the job is started, so all output data will be
copied to the job dir on IO nodes.

So in the new case, we are using the absolute path for wrapper.sh, after 
worker enters the working directory on IO nodes,
it tried to invoke 
"/tmp/sleep-20090220-1646-7vdlcg0a/shared/wrapper.sh", which it has no 
idea where it is. So what I did
is that if the path of wrapper.sh starts with a absolute path, then put 
a "/fuse" in front of "/tmp/sleep-20090220-1646-7vdlcg0a/shared/wrapper.sh",
or else just work as it was.

Any point if it is not clear, let me know asap.

best wishes
zhangzhao

Ben Clifford wrote:
> On Fri, 20 Feb 2009, Zhao Zhang wrote:
>
>   
>> it doesn't matter right now, I was just trying out the latest version of swift
>> and stuck, then I solved the problem.
>>     
>
> It does matter, in that I made a change that was, as far as I could see, 
> either identical or beneficial in all circumstances.
>
> But apparently this change caused trouble for you.
>
> If you write about what you were trying to do and how you solved your 
> problem, you will help the Swift developers understand what you are doing 
> and you are likely to encounter fewer problems in the future.
>
> If you keep secrets, then we cannot help you.
>
>   


From benc at hawaga.org.uk  Sat Feb 21 03:10:14 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Sat, 21 Feb 2009 09:10:14 +0000 (GMT)
Subject: [Swift-devel] filesystem mapper
In-Reply-To: <499F148C.7050906@uchicago.edu>
References: <499F148C.7050906@uchicago.edu>
Message-ID: <Pine.LNX.4.64.0902210908430.1293@dildano.hawaga.org.uk>


can you send me the .swift and .kml files for this?

Did you recompile your kml file after upgrading? If not, you may find a 
kml file from an older Swift does not work with the present version.

On Fri, 20 Feb 2009, Zhao Zhang wrote:

> Hi,
> 
> I have a problem with file system mapper in the latest version of swift.
> The code looks like:
> Mol texts[] <filesys_mapper;location="input/", suffix=".mol2">;
> 
> It is trying to map all .mol2 files in the input directory, and it works fine
> with an older version of swift which is
> Swift svn swift-r2334 (Swift modified locally) cog-r2216
> 
> But failed with the following information
> zzhang at login6.surveyor:~/new_dock6> ./run_swift_ssh.sh 1010 64 test.swift
> waiting for at least 64 nodes to register before submitting workload...
> waiting to find at least 1 services in file
> /home/falkon/users/zzhang/1010/config/Client-service-URIs.config...
> all done, file has found at least 1 services
> found at least 64 registered, submitting workload...
> Swift svn swift-r2578 cog-r2305
> 
> RunID: 20090220-1428-ugfvnoya
> Progress:
> Execution failed:
>        Getting value for array texts.$[]/1 which is not permitted.
> 
> The log file is at
> http://www.ci.uchicago.edu/~zzhang/test-20090220-1428-ugfvnoya.log
> 
> zhao
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> 
> 


From iraicu at cs.uchicago.edu  Sat Feb 21 06:37:59 2009
From: iraicu at cs.uchicago.edu (Ioan Raicu)
Date: Sat, 21 Feb 2009 06:37:59 -0600
Subject: [Swift-devel] [Fwd: [Dbworld] Extended Deadline:CFP: IEEE
 International Workshop on	Scientific Workflows (SWF 2009)]
Message-ID: <499FF5A7.5060308@cs.uchicago.edu>

It seems that the SWF09 deadline has been extended to March 16th.

Cheers,
Ioan

-------- Original Message --------
Subject: 	[Dbworld] Extended Deadline:CFP: IEEE International Workshop 
on Scientific Workflows (SWF 2009)
Date: 	Fri, 20 Feb 2009 16:35:21 -0600
From: 	Shiyong Lu <shiyong at wayne.edu>
Reply-To: 	dbworld_owner at yahoo.com
To: 	dbworld at cs.wisc.edu


EXTENDED DEADLINE: Due to numerous requests and a discussion with the 
ICWS organizing committee, the SWF submission deadline is extended to March 16, 2009,
existing submissions can be updated before the deadline.


Call for Papers  
IEEE 2009 Third International Workshop on Scientific Workflows (SWF 2009)
http://www.servicescongress.org/2009/1/swf-2009.html
Los Angeles, USA, July 10, 2009
In conjunction with IEEE International Conference on Web Services (ICWS 2009)

Description
Today, many scientific discoveries are achieved through complex and distributed scientific 
computations that are represented and structured as scientific workflows. User friendly 
scientific workflow systems are increasingly being developed to enable e-scientists to 
integrate, structure, and orchestrate various local or remote data and service resources 
to perform various in silico experiments to produce interesting scientific discovery. 
The critical role of scientific workflows in cyberinfrastructure has been recognized 
by a recent NSF workshop on the challenges of scientific workflows in May 2006, 
which concluded that ??workflows should become first-class entities in cyberinfrastructure 
architecture. For domain scientists, they are important because workflows document and 
manage the increasingly complex processes involved in exploration and discovery through 
computations. For computer scientists, workflows provide a formal and declarative 
representation of complex distributed computations that must be managed efficiently 
through their lifecycle from assembly, to execution, to sharing.?? 

Authors are invited to submit regular papers (8 pages), short papers (4 pages), and 
demo papers (2 pages) that show original unpublished research results in all areas 
of scientific workflows. Topics of interest are listed below; however, submissions 
on all aspects of scientific workflows are welcome. For demo papers, at least one 
author is expected to present a demo in the workshop during the demo session, 
special arrangement will be made to meet the need of the authors. 
Accepted SWF 2009 papers will be included in the proceedings of SERVICES 2009 (Part I), 
which will be published by IEEE Computer Society Press.


Topics
o Architecture, model, and language 
o Provenance management 
o Task management 
o Workflow scheduling 
o Data product management 
o Monitoring and failure handling 
o Service, Grid, and Cloud  workflows 
o Scientific workflow composition 
o Scientific workflow security 
o Modeling, simulation, analysis 
o Scalability, reliability, extensibility 
o Scientific workflow applications 
o Service-oriented scientific workflows and workflow-based Web services 
o Security of Web services and scientific workflows 
o Data integration and service integration in scientific workflows
o Application service management in scientific workflows
o Data service management in scientific workflows 
o Scientific workflow architectures, models, and languages
o Grid workflow management
o Scientific workflow mapping, optimization, and scheduling
o Scientific workflow modeling, verification, and validation 
o Scientific workflow provenance management
o Workflow and provenance mining and analysis
o Scalability, reliability, extensibility, agility, and interoperability    
o Scientific workflow real-life applications

 
Important dates
Paper Submission     March 16, 2009
Decision Notification (Electronic)   April 2, 2009
Camera-Ready Submission & Pre-registration    April 17, 2009


Workshop organizers
Workshop chairs: Shiyong Lu, Wayne State University, shiyong at wayne.edu; Calton Pu, Georgia Tech
Publicity chairs: Yong Zhao, Microsoft Corporation; Ilkay Altintas, San Diego Supercomputer Center
Publication chair: Cui Lin, Wayne State University

For any questions, please send e-mails to Shiyong Lu at shiyong at wayne.edu.
 
Previous SWF workshops
http://www.cs.wayne.edu/~shiyong/swf

IEEE 2009 Third International Workshop on Scientific Workflows 

_______________________________________________
Please do not post msgs that are not relevant to the database community at large.  Go to www.cs.wisc.edu/dbworld for guidelines and posting forms.
To unsubscribe, go to https://lists.cs.wisc.edu/mailman/listinfo/dbworld


-- 
===================================================
Ioan Raicu, Ph.D.
===================================================
Distributed Systems Laboratory
Computer Science Department
University of Chicago
1100 E. 58th Street, Ryerson Hall
Chicago, IL 60637
===================================================
Email: iraicu at cs.uchicago.edu
Web:   http://www.cs.uchicago.edu/~iraicu
http://dev.globus.org/wiki/Incubator/Falkon
http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page
===================================================
===================================================

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20090221/f2e3240e/attachment.html>

From iraicu at cs.uchicago.edu  Sat Feb 21 06:39:30 2009
From: iraicu at cs.uchicago.edu (Ioan Raicu)
Date: Sat, 21 Feb 2009 06:39:30 -0600
Subject: [Swift-devel] [Fwd: [Dbworld] 2nd CFP: Special issue on Scientific
 Workflows in IJBPIM]
Message-ID: <499FF602.3080102@cs.uchicago.edu>

Here might be a good venue for a future paper on Swift/Falkon, with a 
May 1st deadline.

Cheers,
Ioan

-------- Original Message --------
Subject: 	[Dbworld] 2nd CFP: Special issue on Scientific Workflows in 
IJBPIM
Date: 	Fri, 20 Feb 2009 16:46:49 -0600
From: 	Shiyong Lu <shiyong at wayne.edu>
Reply-To: 	dbworld_owner at yahoo.com
To: 	dbworld at cs.wisc.edu


Call for Papers  
Special Issue on Scientific Workflows
International Journal of Business Process Integration and Management (IJBPIM)
http://www.cs.wayne.edu/~shiyong/swf/ijbpim09.html


Description
Scientific workflows have recently emerged as a new paradigm for scientists to formalize and 
structure complex scientific processes to enable and accelerate many significant scientific 
discoveries. A scientific workflow is a formal specification of a scientific process, which 
represents, streamlines, and automates the analytical and computational steps that a scientist 
needs to go through from dataset selection and integration, computation and analysis, to final 
data product presentation and visualization. A scientific workflow management system (SWFMS) 
is a system that supports the specification, modification, execution, failure recovery, 
and monitoring of a scientific workflow using the workflow logic to control the order of 
executing workflow tasks. 

The goal of this special issue is to present critical challenges, requirements, and issues 
related to scientific workflows. This collection of manuscripts will discuss key aspects 
in the development of a broad range of novel and innovative scientific workflow technologies. 
The emphasis of the special issue is on critical challenges in the development of various 
scientific workflows specifically as they relate to business workflow and service technologies. 
Particular emphasis will be placed on examples where innovative solutions to these challenges 
have resulted in scientific workflows which impact the scientific discovery process. Topics 
include but are not limited to:

List of topics
??	Scientific workflow provenance management
??	Scientific workflow provenance analytics
??	Scientific workflow data, metadata, service, and task management
??	Scientific workflow architectures, models, and languages
??	Scientific workflow monitoring and failure handling
??	Streaming data processing in scientific workflows
??	Pipelined, data, workflow, and task parallelism in scientific workflows
??	Service, Grid, or Cloud-based scientific workflows
??	Data, metadata, compute, user-interaction, or visualization-intensive scientific workflows
??	Scientific workflow composition
??	Security issues in scientific workflows 
??	Data integration and service integration in scientific workflows
??	Scientific workflow mapping, optimization, and scheduling
??	Scientific workflow modeling, verification, and validation 
??	Scalability, reliability, extensibility, agility, and interoperability    
??	Scientific workflow real-life applications
 

Important dates
??	May 1, 2009, paper submission
??	August 1, 2009, notification
??	November 1, 2009, camera-ready version
??	Planned publication, end of 2009/early 2010

Guest editors
??	Shiyong Lu, Wayne State University, U.S.A., Email: shiyong at wayne.edu
??	Ewa Deelman, USC Information Sciences Institute, U.S.A., Email: deelman at isi.edu 
??	Zhiming Zhao, University of Amsterdam, the Netherlands, Email: z.zhao at uva.nl

Submission details
Submitted papers should not have been previously published nor be currently under 
consideration for publication elsewhere. All papers are refereed through a peer review process. 
Please submit your paper at http://www.servicescomputing.org/ijbpim. 

Contact information
All enquires about the special issue should be sent to Shiyong Lu at shiyong at wayne.edu. 

_______________________________________________
Please do not post msgs that are not relevant to the database community at large.  Go to www.cs.wisc.edu/dbworld for guidelines and posting forms.
To unsubscribe, go to https://lists.cs.wisc.edu/mailman/listinfo/dbworld


-- 
===================================================
Ioan Raicu, Ph.D.
===================================================
Distributed Systems Laboratory
Computer Science Department
University of Chicago
1100 E. 58th Street, Ryerson Hall
Chicago, IL 60637
===================================================
Email: iraicu at cs.uchicago.edu
Web:   http://www.cs.uchicago.edu/~iraicu
http://dev.globus.org/wiki/Incubator/Falkon
http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page
===================================================
===================================================

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20090221/dc0bc760/attachment.html>

From benc at hawaga.org.uk  Sat Feb 21 08:11:03 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Sat, 21 Feb 2009 14:11:03 +0000 (GMT)
Subject: [Swift-devel] absolute path
In-Reply-To: <499F445D.7070205@uchicago.edu>
References: <499F3351.2060000@uchicago.edu>
	<Pine.LNX.4.64.0902202328590.1293@dildano.hawaga.org.uk>
	<499F3D7F.5010708@uchicago.edu>
	<Pine.LNX.4.64.0902202334260.23512@dildano.hawaga.org.uk>
	<499F445D.7070205@uchicago.edu>
Message-ID: <Pine.LNX.4.64.0902211407040.23512@dildano.hawaga.org.uk>


On Fri, 20 Feb 2009, Zhao Zhang wrote:

> The context on BGP is, the working "sleep-20090220-1646-7vdlcg0a" directory is
> created on IO nodes at
> "/tmp/sleep-20090220-1646-7vdlcg0a/", and mounted on Compute Node through
> fuse. which means Compute Nodes
> need to enter this directory through /fuse/tmp/sleep-20090220-1646-7vdlcg0a,
> then with the old style of relative path
> "shared/wrapper.sh", everything is fine. And the wrapper.sh knows where the
> job is started, so all output data will be
> copied to the job dir on IO nodes.

ok, I'll make a config option to allow you to choose whether wrapper.sh is 
invoked with an absolute path in the command line or not.

-- 


From bugzilla-daemon at mcs.anl.gov  Sat Feb 21 09:03:59 2009
From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov)
Date: Sat, 21 Feb 2009 09:03:59 -0600 (CST)
Subject: [Swift-devel] [Bug 169] submit-side timeouts (or other fault
	detection) to accommodate some byzantine site failures
In-Reply-To: <bug-169-21@http.bugzilla.mcs.anl.gov/swift/>
Message-ID: <20090221150359.1E715164CE@foxtrot.mcs.anl.gov>

http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=169


hategan at mcs.anl.gov changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |ASSIGNED


------- Comment #1 from hategan at mcs.anl.gov  2008-12-18 13:46 -------
A 2*walltime timeout is what we discussed before and agreed upon. This needs to
be implemented.


------- Comment #2 from benc at hawaga.org.uk  2009-02-21 09:03 -------
Mihael implemented this for job submission. It isn't for file operations or
transfers; and likely doesn't behave well with coasters enabled, so leaving
this open as a to-do for those.


-- 
Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You reported the bug, or are watching the reporter.
You are the assignee for the bug, or are watching the assignee.


From bugzilla-daemon at mcs.anl.gov  Sat Feb 21 09:09:06 2009
From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov)
Date: Sat, 21 Feb 2009 09:09:06 -0600 (CST)
Subject: [Swift-devel] [Bug 176] New: config option to invoke wrapper script
	with relative path
Message-ID: <bug-176-21@http.bugzilla.mcs.anl.gov/swift/>

http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=176

           Summary: config option to invoke wrapper script with relative
                    path
           Product: Swift
           Version: unspecified
          Platform: Macintosh
        OS/Version: Mac OS
            Status: NEW
          Severity: normal
          Priority: P2
         Component: General
        AssignedTo: benc at hawaga.org.uk
        ReportedBy: benc at hawaga.org.uk


Versions of Swift before 0.8 invoked the wrapper script using a relative path,
and relied on the job submission system to start in the directory requested in
the job submission. Some sites do not do that, instead starting in a clean
job-specific working directory. Swift 0.8 had different behaviour, with
explicit specification of the path to wrapper.sh.

However, this fails to work on sites where the site-shared filesystem is mapped
into local filesystems differently on the worker nodes and through the
submission system, such as present experimental use of Falkon on BG/P. In such
cases, Falkon starts the job in the correct directory, which Swift then
ignores.

A configuration option to switch between absolute and relative behaviour should
be provided.


-- 
Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You reported the bug, or are watching the reporter.
You are the assignee for the bug, or are watching the assignee.


From benc at hawaga.org.uk  Sat Feb 21 09:09:34 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Sat, 21 Feb 2009 15:09:34 +0000 (GMT)
Subject: [Swift-devel] absolute path
In-Reply-To: <Pine.LNX.4.64.0902211407040.23512@dildano.hawaga.org.uk>
References: <499F3351.2060000@uchicago.edu>
	<Pine.LNX.4.64.0902202328590.1293@dildano.hawaga.org.uk>
	<499F3D7F.5010708@uchicago.edu>
	<Pine.LNX.4.64.0902202334260.23512@dildano.hawaga.org.uk>
	<499F445D.7070205@uchicago.edu>
	<Pine.LNX.4.64.0902211407040.23512@dildano.hawaga.org.uk>
Message-ID: <Pine.LNX.4.64.0902211509160.1293@dildano.hawaga.org.uk>


On Sat, 21 Feb 2009, Ben Clifford wrote:

> ok, I'll make a config option to allow you to choose whether wrapper.sh is 
> invoked with an absolute path in the command line or not.

bug 176 if you want to keep an eye on this.

http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=176

-- 


From bugzilla-daemon at mcs.anl.gov  Sat Feb 21 09:12:42 2009
From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov)
Date: Sat, 21 Feb 2009 09:12:42 -0600 (CST)
Subject: [Swift-devel] [Bug 177] New: variables declared inside an iterate
	body should be available to the termination condition
Message-ID: <bug-177-21@http.bugzilla.mcs.anl.gov/swift/>

http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=177

           Summary: variables declared inside an iterate body should be
                    available to the termination condition
           Product: Swift
           Version: unspecified
          Platform: Macintosh
        OS/Version: Mac OS
            Status: NEW
          Severity: enhancement
          Priority: P2
         Component: SwiftScript language
        AssignedTo: benc at hawaga.org.uk
        ReportedBy: benc at hawaga.org.uk


Variables declared inside an iterate body are not available to the termination
expression. Those variables should be made available to the termination
expression.

As far as I can tell, the lack of this ability does not restrict what can be
expressed, as a variable v used inside the loop can always be transformed into
an array element v[ix] with ix the iteration index, and v declared outside of
the iteration loop. However, it does force a certain coding style which can be
unintuitive.


-- 
Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You reported the bug, or are watching the reporter.
You are the assignee for the bug, or are watching the assignee.


From bugzilla-daemon at mcs.anl.gov  Sat Feb 21 09:32:56 2009
From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov)
Date: Sat, 21 Feb 2009 09:32:56 -0600 (CST)
Subject: [Swift-devel] [Bug 169] submit-side timeouts (or other fault
	detection) to accommodate some byzantine site failures
In-Reply-To: <bug-169-21@http.bugzilla.mcs.anl.gov/swift/>
Message-ID: <20090221153256.1631B164B3@foxtrot.mcs.anl.gov>

http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=169


------- Comment #3 from hategan at mcs.anl.gov  2009-02-21 09:32 -------
(In reply to comment #2)
> Mihael implemented this for job submission. It isn't for file operations or
> transfers; and likely doesn't behave well with coasters enabled, so leaving
> this open as a to-do for those.
> 

I will likely not implement this for file ops/transfers. At least not for now.
That because most of the implementations for those have their own timeouts.

Why would this not behave well with coasters?


-- 
Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You reported the bug, or are watching the reporter.
You are the assignee for the bug, or are watching the assignee.


From bugzilla-daemon at mcs.anl.gov  Sat Feb 21 15:18:23 2009
From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov)
Date: Sat, 21 Feb 2009 15:18:23 -0600 (CST)
Subject: [Swift-devel] [Bug 169] submit-side timeouts (or other fault
	detection) to accommodate some byzantine site failures
In-Reply-To: <bug-169-21@http.bugzilla.mcs.anl.gov/swift/>
Message-ID: <20090221211823.6D538164B3@foxtrot.mcs.anl.gov>

http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=169


------- Comment #4 from benc at hawaga.org.uk  2009-02-21 15:18 -------
http://mail.ci.uchicago.edu/pipermail/swift-devel/2009-February/004382.html


-- 
Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You reported the bug, or are watching the reporter.
You are the assignee for the bug, or are watching the assignee.


From bugzilla-daemon at mcs.anl.gov  Sat Feb 21 15:18:46 2009
From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov)
Date: Sat, 21 Feb 2009 15:18:46 -0600 (CST)
Subject: [Swift-devel] [Bug 169] submit-side timeouts (or other fault
	detection) to accommodate some byzantine site failures
In-Reply-To: <bug-169-21@http.bugzilla.mcs.anl.gov/swift/>
Message-ID: <20090221211846.F186E164B3@foxtrot.mcs.anl.gov>

http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=169


------- Comment #5 from benc at hawaga.org.uk  2009-02-21 15:18 -------
although I did mean clusters, not coasters...


-- 
Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You reported the bug, or are watching the reporter.
You are the assignee for the bug, or are watching the assignee.


From bugzilla-daemon at mcs.anl.gov  Sat Feb 21 15:31:04 2009
From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov)
Date: Sat, 21 Feb 2009 15:31:04 -0600 (CST)
Subject: [Swift-devel] [Bug 169] submit-side timeouts (or other fault
	detection) to accommodate some byzantine site failures
In-Reply-To: <bug-169-21@http.bugzilla.mcs.anl.gov/swift/>
Message-ID: <20090221213104.5F644164B3@foxtrot.mcs.anl.gov>

http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=169


------- Comment #6 from hategan at mcs.anl.gov  2009-02-21 15:31 -------
(In reply to comment #5)
> although I did mean clusters, not coasters...
> 

That's what confused me. We're clear now.


-- 
Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You reported the bug, or are watching the reporter.
You are the assignee for the bug, or are watching the assignee.


From aespinosa at cs.uchicago.edu  Sun Feb 22 18:21:25 2009
From: aespinosa at cs.uchicago.edu (Allan Espinosa)
Date: Sun, 22 Feb 2009 18:21:25 -0600
Subject: [Swift-devel] different host CN expectations in gram and gridftp
	server
Message-ID: <50b07b4b0902221621s52239835xf920e665e8cfce5f@mail.gmail.com>

Swift expects and different CN from the gridftp server and gram server
and creates the authentication problems below.  Doing a gridftp url
also gives the same message but in addition displays the authorization
error as a runtime exception

Swift version: swift-r2580 cog-r2305

>From a ranger login host:

RunID: 20090222-1815-9ly285cb
        Progress:
Progress:  Initializing:25 Selecting site:6
Progress:  Selecting site:25 Stage in:4 Submitting:2
Progress:  Selecting site:25 Submitting:5 Submitted:1
Failed to transfer wrapper log from test-20090222-1815-9ly285cb/info/m
on RANGERFailed to transfer wrapper log from
test-20090222-1815-9ly285cb/info/i on RANGERFailed to transfer wrapper
log from test-20090222-1815-9ly285cb/info/l on RANGERFailed to
transfer wrapper log from test-20090222-1815-9ly285cb/info/k on
RANGERFailed to transfer wrapper log from
test-20090222-1815-9ly285cb/info/j on RANGERFailed to transfer wrapper
log from test-20090222-1815-9ly285cb/info/n on RANGER

logfile:
Could not start coaster service
Caused by: org.globus.cog.abstraction.impl.common.task.TaskSubmissionException:
Cannot submit job
Caused by: org.globus.gram.GramException: Data transfer to the server
failed [Caused by: Authentication failed [Caused by: Operation
unauthorized (Mechanism level: [JGLOBUS-56] Authorization failed.
Expected "/CN=host/129.114.50.163" target but received
"/C=US/O=UTAustin/OU=TACC/CN=login3.ranger.tacc.utexas.edu")]]

<config>
  <pool handle="RANGER">
    <profile namespace="karajan" key="initialScore">2</profile>
    <profile namespace="karajan" key="jobThrottle">1</profile>
    <profile namespace="globus" key="coastersPerNode">16</profile>
    <gridftp url="local://localhost"/>

    <execution provider="coaster"
url="gatekeeper.ranger.tacc.teragrid.org" jobManager="gt2:gt2:SGE"/>
    <workdirectory>/work/01035/tg802895/swift-runs</workdirectory>
  </pool>
</config>


-Allan


From aespinosa at cs.uchicago.edu  Sun Feb 22 18:25:38 2009
From: aespinosa at cs.uchicago.edu (Allan Espinosa)
Date: Sun, 22 Feb 2009 18:25:38 -0600
Subject: [Swift-devel] Re: different host CN expectations in gram and
	gridftp server
In-Reply-To: <50b07b4b0902221621s52239835xf920e665e8cfce5f@mail.gmail.com>
References: <50b07b4b0902221621s52239835xf920e665e8cfce5f@mail.gmail.com>
Message-ID: <50b07b4b0902221625g26cdad89h188d22358e1a16d9@mail.gmail.com>

I rsync'ed the logfile to ~benc/swift-logs: test-20090222-1815-9ly285cb

On Sun, Feb 22, 2009 at 6:21 PM, Allan Espinosa
<aespinosa at cs.uchicago.edu> wrote:
> Swift expects and different CN from the gridftp server and gram server
> and creates the authentication problems below.  Doing a gridftp url
> also gives the same message but in addition displays the authorization
> error as a runtime exception
>


-- 
Allan M. Espinosa <http://allan.88-mph.net/blog>
PhD student, Computer Science
University of Chicago <http://people.cs.uchicago.edu/~aespinosa>


From hategan at mcs.anl.gov  Sun Feb 22 21:55:43 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Sun, 22 Feb 2009 21:55:43 -0600
Subject: [Swift-devel] different host CN expectations in gram and
	gridftp server
In-Reply-To: <50b07b4b0902221621s52239835xf920e665e8cfce5f@mail.gmail.com>
References: <50b07b4b0902221621s52239835xf920e665e8cfce5f@mail.gmail.com>
Message-ID: <1235361343.1273.6.camel@localhost>

Can you try a globusrun from the same host to gatekeeper.ranger?

Mihael

On Sun, 2009-02-22 at 18:21 -0600, Allan Espinosa wrote:
> Swift expects and different CN from the gridftp server and gram server
> and creates the authentication problems below.  Doing a gridftp url
> also gives the same message but in addition displays the authorization
> error as a runtime exception
> 
> Swift version: swift-r2580 cog-r2305
> 
> >From a ranger login host:
> 
> RunID: 20090222-1815-9ly285cb
>         Progress:
> Progress:  Initializing:25 Selecting site:6
> Progress:  Selecting site:25 Stage in:4 Submitting:2
> Progress:  Selecting site:25 Submitting:5 Submitted:1
> Failed to transfer wrapper log from test-20090222-1815-9ly285cb/info/m
> on RANGERFailed to transfer wrapper log from
> test-20090222-1815-9ly285cb/info/i on RANGERFailed to transfer wrapper
> log from test-20090222-1815-9ly285cb/info/l on RANGERFailed to
> transfer wrapper log from test-20090222-1815-9ly285cb/info/k on
> RANGERFailed to transfer wrapper log from
> test-20090222-1815-9ly285cb/info/j on RANGERFailed to transfer wrapper
> log from test-20090222-1815-9ly285cb/info/n on RANGER
> 
> logfile:
> Could not start coaster service
> Caused by: org.globus.cog.abstraction.impl.common.task.TaskSubmissionException:
> Cannot submit job
> Caused by: org.globus.gram.GramException: Data transfer to the server
> failed [Caused by: Authentication failed [Caused by: Operation
> unauthorized (Mechanism level: [JGLOBUS-56] Authorization failed.
> Expected "/CN=host/129.114.50.163" target but received
> "/C=US/O=UTAustin/OU=TACC/CN=login3.ranger.tacc.utexas.edu")]]
> 
> <config>
>   <pool handle="RANGER">
>     <profile namespace="karajan" key="initialScore">2</profile>
>     <profile namespace="karajan" key="jobThrottle">1</profile>
>     <profile namespace="globus" key="coastersPerNode">16</profile>
>     <gridftp url="local://localhost"/>
> 
>     <execution provider="coaster"
> url="gatekeeper.ranger.tacc.teragrid.org" jobManager="gt2:gt2:SGE"/>
>     <workdirectory>/work/01035/tg802895/swift-runs</workdirectory>
>   </pool>
> </config>
> 
> 
> -Allan
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel


From bugzilla-daemon at mcs.anl.gov  Mon Feb 23 09:44:05 2009
From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov)
Date: Mon, 23 Feb 2009 09:44:05 -0600 (CST)
Subject: [Swift-devel] [Bug 177] variables declared inside an iterate body
	should be available to the termination condition
In-Reply-To: <bug-177-21@http.bugzilla.mcs.anl.gov/swift/>
Message-ID: <20090223154405.E6F8D164B3@foxtrot.mcs.anl.gov>

http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=177


benc at hawaga.org.uk changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|                            |FIXED


------- Comment #1 from benc at hawaga.org.uk  2009-02-23 09:44 -------
this should be fixed in r2593


-- 
Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You reported the bug, or are watching the reporter.
You are the assignee for the bug, or are watching the assignee.


From bugzilla-daemon at mcs.anl.gov  Mon Feb 23 09:50:44 2009
From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov)
Date: Mon, 23 Feb 2009 09:50:44 -0600 (CST)
Subject: [Swift-devel] [Bug 61] semantics of [*] and multi-return-values
	need clarifying
In-Reply-To: <bug-61-21@http.bugzilla.mcs.anl.gov/swift/>
Message-ID: <20090223155044.B2E82164B3@foxtrot.mcs.anl.gov>

http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=61


benc at hawaga.org.uk changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |ASSIGNED


------- Comment #2 from benc at hawaga.org.uk  2009-02-23 09:50 -------
This has mostly been done as of r2538.

However, the parser still appears to take .* for structure access, which needs
some more consideration and tidying.


-- 
Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You reported the bug, or are watching the reporter.
You are the assignee for the bug, or are watching the assignee.


From bugzilla-daemon at mcs.anl.gov  Mon Feb 23 09:53:33 2009
From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov)
Date: Mon, 23 Feb 2009 09:53:33 -0600 (CST)
Subject: [Swift-devel] [Bug 172] filesystem and gridftp element in the same
	pool.
In-Reply-To: <bug-172-21@http.bugzilla.mcs.anl.gov/swift/>
Message-ID: <20090223155333.20F53164CF@foxtrot.mcs.anl.gov>

http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=172


------- Comment #1 from benc at hawaga.org.uk  2009-02-23 09:53 -------
This should be a general test for duplicates - not only for a gridftp and a
filesystem specified in the same entry, but also multiple filesystem entries,
and other combinations that are illegal.


-- 
Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From aespinosa at cs.uchicago.edu  Mon Feb 23 10:48:10 2009
From: aespinosa at cs.uchicago.edu (Allan Espinosa)
Date: Mon, 23 Feb 2009 10:48:10 -0600
Subject: [Swift-devel] different host CN expectations in gram and gridftp 
	server
In-Reply-To: <1235361343.1273.6.camel@localhost>
References: <50b07b4b0902221621s52239835xf920e665e8cfce5f@mail.gmail.com>
	<1235361343.1273.6.camel@localhost>
Message-ID: <50b07b4b0902230848v15e1394dh829fcb2bbf94a578@mail.gmail.com>

These are all run from a Ranger login node
Here's the output:

login4$ globusrun -b -r gatekeeper.ranger.tacc.teragrid.org
'&(executable=/bin/hostname)'
globus_gram_client_callback_allow successful
GRAM Job submission successful
https://gatekeeper.ranger.tacc.teragrid.org:50004/24184/1235407542/
GLOBUS_GRAM_PROTOCOL_JOB_STATE_ACTIVE
login4$


globus-job-run:
login4$ globus-job-run gatekeeper.ranger.tacc.teragrid.org /bin/hostname
login3.ranger.tacc.utexas.edu


-Allan


On Sun, Feb 22, 2009 at 9:55 PM, Mihael Hategan <hategan at mcs.anl.gov> wrote:
> Can you try a globusrun from the same host to gatekeeper.ranger?
>
> Mihael
>
> On Sun, 2009-02-22 at 18:21 -0600, Allan Espinosa wrote:
>> Swift expects and different CN from the gridftp server and gram server
>> and creates the authentication problems below.  Doing a gridftp url
>> also gives the same message but in addition displays the authorization
>> error as a runtime exception
>>
>> Swift version: swift-r2580 cog-r2305
>>
>> >From a ranger login host:
>>
>> RunID: 20090222-1815-9ly285cb
>>         Progress:
>> Progress:  Initializing:25 Selecting site:6
>> Progress:  Selecting site:25 Stage in:4 Submitting:2
>> Progress:  Selecting site:25 Submitting:5 Submitted:1
>> Failed to transfer wrapper log from test-20090222-1815-9ly285cb/info/m
>> on RANGERFailed to transfer wrapper log from
>> test-20090222-1815-9ly285cb/info/i on RANGERFailed to transfer wrapper
>> log from test-20090222-1815-9ly285cb/info/l on RANGERFailed to
>> transfer wrapper log from test-20090222-1815-9ly285cb/info/k on
>> RANGERFailed to transfer wrapper log from
>> test-20090222-1815-9ly285cb/info/j on RANGERFailed to transfer wrapper
>> log from test-20090222-1815-9ly285cb/info/n on RANGER
>>


From hategan at mcs.anl.gov  Mon Feb 23 10:54:32 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Mon, 23 Feb 2009 10:54:32 -0600
Subject: [Swift-devel] different host CN expectations in gram and
	gridftp  server
In-Reply-To: <50b07b4b0902230848v15e1394dh829fcb2bbf94a578@mail.gmail.com>
References: <50b07b4b0902221621s52239835xf920e665e8cfce5f@mail.gmail.com>
	<1235361343.1273.6.camel@localhost>
	<50b07b4b0902230848v15e1394dh829fcb2bbf94a578@mail.gmail.com>
Message-ID: <1235408072.10242.0.camel@localhost>

Can you now run the same from login3 rather than login4?

On Mon, 2009-02-23 at 10:48 -0600, Allan Espinosa wrote:
> These are all run from a Ranger login node
> Here's the output:
> 
> login4$ globusrun -b -r gatekeeper.ranger.tacc.teragrid.org
> '&(executable=/bin/hostname)'
> globus_gram_client_callback_allow successful
> GRAM Job submission successful
> https://gatekeeper.ranger.tacc.teragrid.org:50004/24184/1235407542/
> GLOBUS_GRAM_PROTOCOL_JOB_STATE_ACTIVE
> login4$
> 
> 
> globus-job-run:
> login4$ globus-job-run gatekeeper.ranger.tacc.teragrid.org /bin/hostname
> login3.ranger.tacc.utexas.edu
> 
> 
> -Allan
> 
> 
> On Sun, Feb 22, 2009 at 9:55 PM, Mihael Hategan <hategan at mcs.anl.gov> wrote:
> > Can you try a globusrun from the same host to gatekeeper.ranger?
> >
> > Mihael
> >
> > On Sun, 2009-02-22 at 18:21 -0600, Allan Espinosa wrote:
> >> Swift expects and different CN from the gridftp server and gram server
> >> and creates the authentication problems below.  Doing a gridftp url
> >> also gives the same message but in addition displays the authorization
> >> error as a runtime exception
> >>
> >> Swift version: swift-r2580 cog-r2305
> >>
> >> >From a ranger login host:
> >>
> >> RunID: 20090222-1815-9ly285cb
> >>         Progress:
> >> Progress:  Initializing:25 Selecting site:6
> >> Progress:  Selecting site:25 Stage in:4 Submitting:2
> >> Progress:  Selecting site:25 Submitting:5 Submitted:1
> >> Failed to transfer wrapper log from test-20090222-1815-9ly285cb/info/m
> >> on RANGERFailed to transfer wrapper log from
> >> test-20090222-1815-9ly285cb/info/i on RANGERFailed to transfer wrapper
> >> log from test-20090222-1815-9ly285cb/info/l on RANGERFailed to
> >> transfer wrapper log from test-20090222-1815-9ly285cb/info/k on
> >> RANGERFailed to transfer wrapper log from
> >> test-20090222-1815-9ly285cb/info/j on RANGERFailed to transfer wrapper
> >> log from test-20090222-1815-9ly285cb/info/n on RANGER
> >>


From aespinosa at cs.uchicago.edu  Mon Feb 23 10:56:32 2009
From: aespinosa at cs.uchicago.edu (Allan Espinosa)
Date: Mon, 23 Feb 2009 10:56:32 -0600
Subject: [Swift-devel] different host CN expectations in gram and gridftp 
	server
In-Reply-To: <1235408072.10242.0.camel@localhost>
References: <50b07b4b0902221621s52239835xf920e665e8cfce5f@mail.gmail.com>
	<1235361343.1273.6.camel@localhost>
	<50b07b4b0902230848v15e1394dh829fcb2bbf94a578@mail.gmail.com>
	<1235408072.10242.0.camel@localhost>
Message-ID: <50b07b4b0902230856g18e11118v5f5a27d2d5eb7afc@mail.gmail.com>

Below's the output.  But i did the my swift submit run from login4 too.

-Allan

login3$ globusrun -b -r gatekeeper.ranger.tacc.teragrid.org
'&(executable=/bin/hostname)'
globus_gram_client_callback_allow successful
GRAM Job submission successful
https://gatekeeper.ranger.tacc.teragrid.org:50004/7306/1235408110/
GLOBUS_GRAM_PROTOCOL_JOB_STATE_ACTIVE
login3$ globus-job-run gatekeeper.ranger.tacc.teragrid.org /bin/hostname
login3.ranger.tacc.utexas.edu
login3$


On Mon, Feb 23, 2009 at 10:54 AM, Mihael Hategan <hategan at mcs.anl.gov> wrote:
> Can you now run the same from login3 rather than login4?
>
> On Mon, 2009-02-23 at 10:48 -0600, Allan Espinosa wrote:
>> These are all run from a Ranger login node
>> Here's the output:
>>
>> login4$ globusrun -b -r gatekeeper.ranger.tacc.teragrid.org
>> '&(executable=/bin/hostname)'
>> globus_gram_client_callback_allow successful
>> GRAM Job submission successful
>> https://gatekeeper.ranger.tacc.teragrid.org:50004/24184/1235407542/
>> GLOBUS_GRAM_PROTOCOL_JOB_STATE_ACTIVE
>> login4$
>>
>>
>> globus-job-run:
>> login4$ globus-job-run gatekeeper.ranger.tacc.teragrid.org /bin/hostname
>> login3.ranger.tacc.utexas.edu
>>
>>
>> -Allan
>>
>>
>> On Sun, Feb 22, 2009 at 9:55 PM, Mihael Hategan <hategan at mcs.anl.gov> wrote:
>> > Can you try a globusrun from the same host to gatekeeper.ranger?
>> >
>> > Mihael
>> >
>> > On Sun, 2009-02-22 at 18:21 -0600, Allan Espinosa wrote:
>> >> Swift expects and different CN from the gridftp server and gram server
>> >> and creates the authentication problems below.  Doing a gridftp url
>> >> also gives the same message but in addition displays the authorization
>> >> error as a runtime exception
>> >>
>> >> Swift version: swift-r2580 cog-r2305
>> >>
>> >> >From a ranger login host:
>> >>
>> >> RunID: 20090222-1815-9ly285cb
>> >>         Progress:
>> >> Progress:  Initializing:25 Selecting site:6
>> >> Progress:  Selecting site:25 Stage in:4 Submitting:2
>> >> Progress:  Selecting site:25 Submitting:5 Submitted:1
>> >> Failed to transfer wrapper log from test-20090222-1815-9ly285cb/info/m
>> >> on RANGERFailed to transfer wrapper log from
>> >> test-20090222-1815-9ly285cb/info/i on RANGERFailed to transfer wrapper
>> >> log from test-20090222-1815-9ly285cb/info/l on RANGERFailed to
>> >> transfer wrapper log from test-20090222-1815-9ly285cb/info/k on
>> >> RANGERFailed to transfer wrapper log from
>> >> test-20090222-1815-9ly285cb/info/j on RANGERFailed to transfer wrapper
>> >> log from test-20090222-1815-9ly285cb/info/n on RANGER
>> >>
>
>
>


-- 
Allan M. Espinosa <http://allan.88-mph.net/blog>
PhD student, Computer Science
University of Chicago <http://people.cs.uchicago.edu/~aespinosa>


From benc at hawaga.org.uk  Mon Feb 23 10:58:43 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Mon, 23 Feb 2009 16:58:43 +0000 (GMT)
Subject: [Swift-devel] different host CN expectations in gram and gridftp
	server
In-Reply-To: <50b07b4b0902230856g18e11118v5f5a27d2d5eb7afc@mail.gmail.com>
References: <50b07b4b0902221621s52239835xf920e665e8cfce5f@mail.gmail.com>
	<1235361343.1273.6.camel@localhost>
	<50b07b4b0902230848v15e1394dh829fcb2bbf94a578@mail.gmail.com>
	<1235408072.10242.0.camel@localhost>
	<50b07b4b0902230856g18e11118v5f5a27d2d5eb7afc@mail.gmail.com>
Message-ID: <Pine.LNX.4.64.0902231658180.1293@dildano.hawaga.org.uk>


If you use gram2 instead of coasters+gram2, what happens?

-- 


From bugzilla-daemon at mcs.anl.gov  Tue Feb 24 06:54:26 2009
From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov)
Date: Tue, 24 Feb 2009 06:54:26 -0600 (CST)
Subject: [Swift-devel] [Bug 176] config option to invoke wrapper script with
	relative path
In-Reply-To: <bug-176-21@http.bugzilla.mcs.anl.gov/swift/>
Message-ID: <20090224125426.A77E0164CE@foxtrot.mcs.anl.gov>

http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=176


benc at hawaga.org.uk changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|                            |FIXED


------- Comment #1 from benc at hawaga.org.uk  2009-02-24 06:54 -------
r2597 implements this, wrapper.invocation.mode, documented in the user guide.


-- 
Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You reported the bug, or are watching the reporter.
You are the assignee for the bug, or are watching the assignee.


From benc at hawaga.org.uk  Tue Feb 24 06:56:27 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Tue, 24 Feb 2009 12:56:27 +0000 (GMT)
Subject: [Swift-devel] absolute path
In-Reply-To: <Pine.LNX.4.64.0902211509160.1293@dildano.hawaga.org.uk>
References: <499F3351.2060000@uchicago.edu>
	<Pine.LNX.4.64.0902202328590.1293@dildano.hawaga.org.uk>
	<499F3D7F.5010708@uchicago.edu>
	<Pine.LNX.4.64.0902202334260.23512@dildano.hawaga.org.uk>
	<499F445D.7070205@uchicago.edu>
	<Pine.LNX.4.64.0902211407040.23512@dildano.hawaga.org.uk>
	<Pine.LNX.4.64.0902211509160.1293@dildano.hawaga.org.uk>
Message-ID: <Pine.LNX.4.64.0902241255430.23512@dildano.hawaga.org.uk>


> > ok, I'll make a config option to allow you to choose whether wrapper.sh is 
> > invoked with an absolute path in the command line or not.
> 
> bug 176 if you want to keep an eye on this.
> 
> http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=176

r2597 provides such an option:  wrapper.invocation.mode

It is documented in the user guide and swift.properties

Please let me know if this does what you want.

-- 


From bugzilla-daemon at mcs.anl.gov  Tue Feb 24 07:30:28 2009
From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov)
Date: Tue, 24 Feb 2009 07:30:28 -0600 (CST)
Subject: [Swift-devel] [Bug 123] Array mappers should accept
	programmatically-built string[]s
In-Reply-To: <bug-123-21@http.bugzilla.mcs.anl.gov/swift/>
Message-ID: <20090224133028.CA887164CF@foxtrot.mcs.anl.gov>

http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=123


benc at hawaga.org.uk changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|                            |FIXED


------- Comment #1 from benc at hawaga.org.uk  2009-02-24 07:30 -------
This works now - it should have been fixed somewhere around Swift 0.8.


-- 
Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You reported the bug, or are watching the reporter.


From wilde at mcs.anl.gov  Tue Feb 24 15:07:24 2009
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Tue, 24 Feb 2009 15:07:24 -0600
Subject: [Swift-devel] Problem using @extractint() on derived file
Message-ID: <49A4618C.6050108@mcs.anl.gov>

This script:

---

type file;

app (file out) echo (string s) { echo s stdout=@out; }

file f<"a">;
int i;

f = echo("123");
i = @extractint(@f);
trace("i=", i);

---

produces:

Swift svn swift-r2552 cog-r2303 

 
RunID: 20090224-1455-k1vj4uy7 

Progress: 

Execution failed: 

         Reading integer content of file 

Caused by: 

         a (No such file or directory) 


---

I seem to get the same behavior from readData, and the same if I 
explicitly specify "a" as the argument to @extractint();

Is this because @extractint() is not waiting for "f" to get produced?

Ive extracted this example while debugging a script that uses an app to 
test the termination condition of an iterate loop.

- Mike


From hategan at mcs.anl.gov  Tue Feb 24 15:12:32 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Tue, 24 Feb 2009 15:12:32 -0600
Subject: [Swift-devel] Problem using @extractint() on derived file
In-Reply-To: <49A4618C.6050108@mcs.anl.gov>
References: <49A4618C.6050108@mcs.anl.gov>
Message-ID: <1235509952.7505.2.camel@localhost>

Where is that file with respect to:
- the script
- the place you are running this from

On Tue, 2009-02-24 at 15:07 -0600, Michael Wilde wrote:
> This script:
> 
> ---
> 
> type file;
> 
> app (file out) echo (string s) { echo s stdout=@out; }
> 
> file f<"a">;
> int i;
> 
> f = echo("123");
> i = @extractint(@f);
> trace("i=", i);
> 
> ---
> 
> produces:
> 
> Swift svn swift-r2552 cog-r2303 
> 
>  
> 
> RunID: 20090224-1455-k1vj4uy7 
> 
> Progress: 
> 
> Execution failed: 
> 
>          Reading integer content of file 
> 
> Caused by: 
> 
>          a (No such file or directory) 
> 
> 
> 
> ---
> 
> I seem to get the same behavior from readData, and the same if I 
> explicitly specify "a" as the argument to @extractint();
> 
> Is this because @extractint() is not waiting for "f" to get produced?
> 
> Ive extracted this example while debugging a script that uses an app to 
> test the termination condition of an iterate loop.
> 
> - Mike
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel


From aespinosa at cs.uchicago.edu  Tue Feb 24 15:14:23 2009
From: aespinosa at cs.uchicago.edu (Allan Espinosa)
Date: Tue, 24 Feb 2009 15:14:23 -0600
Subject: [Swift-devel] different host CN expectations in gram and gridftp 
	server
In-Reply-To: <Pine.LNX.4.64.0902231658180.1293@dildano.hawaga.org.uk>
References: <50b07b4b0902221621s52239835xf920e665e8cfce5f@mail.gmail.com>
	<1235361343.1273.6.camel@localhost>
	<50b07b4b0902230848v15e1394dh829fcb2bbf94a578@mail.gmail.com>
	<1235408072.10242.0.camel@localhost>
	<50b07b4b0902230856g18e11118v5f5a27d2d5eb7afc@mail.gmail.com>
	<Pine.LNX.4.64.0902231658180.1293@dildano.hawaga.org.uk>
Message-ID: <50b07b4b0902241314t7ea23b28g832c70e26877c5f6@mail.gmail.com>

I still get the same gram authentication error message:

Caused by: org.globus.cog.abstraction.impl.common.task.TaskSubmissionException:
Cannot submit job
Caused by: org.globus.gram.GramException: Data transfer to the server
failed [Caused by: Authentication failed [Caused by: Operation
unauthorized (Mechanism level: [JGLOBUS-56] Authorization failed.
Expected "/CN=host/129.114.50.163" target but received
"/C=US/O=UTAustin/OU=TACC/CN=login3.ranger.tacc.utexas.edu")]]
2009-02-24 15:12:07,215-0600 DEBUG vdl:execute2 APPLICATION_EXCEPTION
jobid=hostname-8tx7p37j - Application exception: Cannot submit job

This is using both the fork and sge job manager via gram2-only

-aallan


On Mon, Feb 23, 2009 at 10:58 AM, Ben Clifford <benc at hawaga.org.uk> wrote:
>
> If you use gram2 instead of coasters+gram2, what happens?
>


From hategan at mcs.anl.gov  Tue Feb 24 15:17:59 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Tue, 24 Feb 2009 15:17:59 -0600
Subject: [Swift-devel] different host CN expectations in gram and
	gridftp  server
In-Reply-To: <50b07b4b0902241314t7ea23b28g832c70e26877c5f6@mail.gmail.com>
References: <50b07b4b0902221621s52239835xf920e665e8cfce5f@mail.gmail.com>
	<1235361343.1273.6.camel@localhost>
	<50b07b4b0902230848v15e1394dh829fcb2bbf94a578@mail.gmail.com>
	<1235408072.10242.0.camel@localhost>
	<50b07b4b0902230856g18e11118v5f5a27d2d5eb7afc@mail.gmail.com>
	<Pine.LNX.4.64.0902231658180.1293@dildano.hawaga.org.uk>
	<50b07b4b0902241314t7ea23b28g832c70e26877c5f6@mail.gmail.com>
Message-ID: <1235510279.7676.0.camel@localhost>

Ok. I'll look into this.

On Tue, 2009-02-24 at 15:14 -0600, Allan Espinosa wrote:
> I still get the same gram authentication error message:
> 
> Caused by: org.globus.cog.abstraction.impl.common.task.TaskSubmissionException:
> Cannot submit job
> Caused by: org.globus.gram.GramException: Data transfer to the server
> failed [Caused by: Authentication failed [Caused by: Operation
> unauthorized (Mechanism level: [JGLOBUS-56] Authorization failed.
> Expected "/CN=host/129.114.50.163" target but received
> "/C=US/O=UTAustin/OU=TACC/CN=login3.ranger.tacc.utexas.edu")]]
> 2009-02-24 15:12:07,215-0600 DEBUG vdl:execute2 APPLICATION_EXCEPTION
> jobid=hostname-8tx7p37j - Application exception: Cannot submit job
> 
> This is using both the fork and sge job manager via gram2-only
> 
> -aallan
> 
> 
> On Mon, Feb 23, 2009 at 10:58 AM, Ben Clifford <benc at hawaga.org.uk> wrote:
> >
> > If you use gram2 instead of coasters+gram2, what happens?
> >
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel


From aespinosa at cs.uchicago.edu  Tue Feb 24 15:19:30 2009
From: aespinosa at cs.uchicago.edu (Allan Espinosa)
Date: Tue, 24 Feb 2009 15:19:30 -0600
Subject: [Swift-devel] different host CN expectations in gram and gridftp 
	server
In-Reply-To: <1235510279.7676.0.camel@localhost>
References: <50b07b4b0902221621s52239835xf920e665e8cfce5f@mail.gmail.com>
	<1235361343.1273.6.camel@localhost>
	<50b07b4b0902230848v15e1394dh829fcb2bbf94a578@mail.gmail.com>
	<1235408072.10242.0.camel@localhost>
	<50b07b4b0902230856g18e11118v5f5a27d2d5eb7afc@mail.gmail.com>
	<Pine.LNX.4.64.0902231658180.1293@dildano.hawaga.org.uk>
	<50b07b4b0902241314t7ea23b28g832c70e26877c5f6@mail.gmail.com>
	<1235510279.7676.0.camel@localhost>
Message-ID: <50b07b4b0902241319w5d1ffeb9ua65918428fcae9f7@mail.gmail.com>

Thanks Mihael!

On Tue, Feb 24, 2009 at 3:17 PM, Mihael Hategan <hategan at mcs.anl.gov> wrote:
> Ok. I'll look into this.
>
> On Tue, 2009-02-24 at 15:14 -0600, Allan Espinosa wrote:
>> I still get the same gram authentication error message:
>>
>> Caused by: org.globus.cog.abstraction.impl.common.task.TaskSubmissionException:
>> Cannot submit job
>> Caused by: org.globus.gram.GramException: Data transfer to the server
>> failed [Caused by: Authentication failed [Caused by: Operation
>> unauthorized (Mechanism level: [JGLOBUS-56] Authorization failed.
>> Expected "/CN=host/129.114.50.163" target but received
>> "/C=US/O=UTAustin/OU=TACC/CN=login3.ranger.tacc.utexas.edu")]]
>> 2009-02-24 15:12:07,215-0600 DEBUG vdl:execute2 APPLICATION_EXCEPTION
>> jobid=hostname-8tx7p37j - Application exception: Cannot submit job


From wilde at mcs.anl.gov  Tue Feb 24 15:24:21 2009
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Tue, 24 Feb 2009 15:24:21 -0600
Subject: [Swift-devel] Problem using @extractint() on derived file
In-Reply-To: <1235509952.7505.2.camel@localhost>
References: <49A4618C.6050108@mcs.anl.gov> <1235509952.7505.2.camel@localhost>
Message-ID: <49A46585.6090003@mcs.anl.gov>

The file doesnt exist - its created in the script, and hence I would 
expect it to be placed back in the dir that I ran swift from.

(But Ive been testing further, in my original scripts, and am seeing 
confusing results, so I need to sort out and isolate.

I do this "y=f(x); rc=extractint(y)" pattern in a loop, thus I needed to 
make the file name unique (else I get the "file already in cache" 
error). So I switched to an anonymous file, and then started getting the 
"no such file" error.

Whats confusing is I may have seen a csae similar to whats below that 
did work, so I need to do more testing to isolate when it works and when 
it fails.

Can you duplicate the failure with the simple script below?

- Mike


On 2/24/09 3:12 PM, Mihael Hategan wrote:
> Where is that file with respect to:
> - the script
> - the place you are running this from
> 
> On Tue, 2009-02-24 at 15:07 -0600, Michael Wilde wrote:
>> This script:
>>
>> ---
>>
>> type file;
>>
>> app (file out) echo (string s) { echo s stdout=@out; }
>>
>> file f<"a">;
>> int i;
>>
>> f = echo("123");
>> i = @extractint(@f);
>> trace("i=", i);
>>
>> ---
>>
>> produces:
>>
>> Swift svn swift-r2552 cog-r2303 
>>
>>  
>>
>> RunID: 20090224-1455-k1vj4uy7 
>>
>> Progress: 
>>
>> Execution failed: 
>>
>>          Reading integer content of file 
>>
>> Caused by: 
>>
>>          a (No such file or directory) 
>>
>>
>>
>> ---
>>
>> I seem to get the same behavior from readData, and the same if I 
>> explicitly specify "a" as the argument to @extractint();
>>
>> Is this because @extractint() is not waiting for "f" to get produced?
>>
>> Ive extracted this example while debugging a script that uses an app to 
>> test the termination condition of an iterate loop.
>>
>> - Mike
>> _______________________________________________
>> Swift-devel mailing list
>> Swift-devel at ci.uchicago.edu
>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> 


From hategan at mcs.anl.gov  Tue Feb 24 15:33:28 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Tue, 24 Feb 2009 15:33:28 -0600
Subject: [Swift-devel] Problem using @extractint() on derived file
In-Reply-To: <49A46585.6090003@mcs.anl.gov>
References: <49A4618C.6050108@mcs.anl.gov>
	<1235509952.7505.2.camel@localhost>  <49A46585.6090003@mcs.anl.gov>
Message-ID: <1235511208.7899.3.camel@localhost>

Sorry. Didn't look properly.

Yes, this happens because @f can be resolved before f can be, so swift
will happily do the extractint before echo finishes.

I don't have a solution yet.

On Tue, 2009-02-24 at 15:24 -0600, Michael Wilde wrote:
> The file doesnt exist - its created in the script, and hence I would 
> expect it to be placed back in the dir that I ran swift from.
> 
> (But Ive been testing further, in my original scripts, and am seeing 
> confusing results, so I need to sort out and isolate.
> 
> I do this "y=f(x); rc=extractint(y)" pattern in a loop, thus I needed to 
> make the file name unique (else I get the "file already in cache" 
> error). So I switched to an anonymous file, and then started getting the 
> "no such file" error.
> 
> Whats confusing is I may have seen a csae similar to whats below that 
> did work, so I need to do more testing to isolate when it works and when 
> it fails.
> 
> Can you duplicate the failure with the simple script below?
> 
> - Mike
> 
> 
> On 2/24/09 3:12 PM, Mihael Hategan wrote:
> > Where is that file with respect to:
> > - the script
> > - the place you are running this from
> > 
> > On Tue, 2009-02-24 at 15:07 -0600, Michael Wilde wrote:
> >> This script:
> >>
> >> ---
> >>
> >> type file;
> >>
> >> app (file out) echo (string s) { echo s stdout=@out; }
> >>
> >> file f<"a">;
> >> int i;
> >>
> >> f = echo("123");
> >> i = @extractint(@f);
> >> trace("i=", i);
> >>
> >> ---
> >>
> >> produces:
> >>
> >> Swift svn swift-r2552 cog-r2303 
> >>
> >>  
> >>
> >> RunID: 20090224-1455-k1vj4uy7 
> >>
> >> Progress: 
> >>
> >> Execution failed: 
> >>
> >>          Reading integer content of file 
> >>
> >> Caused by: 
> >>
> >>          a (No such file or directory) 
> >>
> >>
> >>
> >> ---
> >>
> >> I seem to get the same behavior from readData, and the same if I 
> >> explicitly specify "a" as the argument to @extractint();
> >>
> >> Is this because @extractint() is not waiting for "f" to get produced?
> >>
> >> Ive extracted this example while debugging a script that uses an app to 
> >> test the termination condition of an iterate loop.
> >>
> >> - Mike
> >> _______________________________________________
> >> Swift-devel mailing list
> >> Swift-devel at ci.uchicago.edu
> >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> > 


From wilde at mcs.anl.gov  Tue Feb 24 16:21:47 2009
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Tue, 24 Feb 2009 16:21:47 -0600
Subject: [Swift-devel] Iterate example broken - semantics changed?
Message-ID: <49A472FB.2040000@mcs.anl.gov>

The iterate example in the swift tutorial no longer works.
Its at: 
http://www.ci.uchicago.edu/swift/guides/tutorial.php#tutorial.iterate

The problem seems to be the same as the example below: swift wont let 
you set the members of an array both in the declaring scope and in an 
inner nested scope, it seems.

This example:

---

int a[];

a[0] = 0;

iterate v {
   a[v+1] = v+1;
   trace("v=",v); 

} until ();

---

gives:

Could not start execution.
         variable a has multiple writers.


From wilde at mcs.anl.gov  Tue Feb 24 16:29:19 2009
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Tue, 24 Feb 2009 16:29:19 -0600
Subject: [Swift-devel] Problem using @extractint() on derived file
In-Reply-To: <1235511208.7899.3.camel@localhost>
References: <49A4618C.6050108@mcs.anl.gov>	 <1235509952.7505.2.camel@localhost>
	<49A46585.6090003@mcs.anl.gov> <1235511208.7899.3.camel@localhost>
Message-ID: <49A474BF.8030708@mcs.anl.gov>

Sorry, I think I see the problem. @extractint() wants a file-mapped 
object (ie marker type), not a filename string. Now (I think) it seems 
to work.

Is readData and readData2 the same?

On 2/24/09 3:33 PM, Mihael Hategan wrote:
> Sorry. Didn't look properly.
> 
> Yes, this happens because @f can be resolved before f can be, so swift
> will happily do the extractint before echo finishes.
> 
> I don't have a solution yet.
> 
> On Tue, 2009-02-24 at 15:24 -0600, Michael Wilde wrote:
>> The file doesnt exist - its created in the script, and hence I would 
>> expect it to be placed back in the dir that I ran swift from.
>>
>> (But Ive been testing further, in my original scripts, and am seeing 
>> confusing results, so I need to sort out and isolate.
>>
>> I do this "y=f(x); rc=extractint(y)" pattern in a loop, thus I needed to 
>> make the file name unique (else I get the "file already in cache" 
>> error). So I switched to an anonymous file, and then started getting the 
>> "no such file" error.
>>
>> Whats confusing is I may have seen a csae similar to whats below that 
>> did work, so I need to do more testing to isolate when it works and when 
>> it fails.
>>
>> Can you duplicate the failure with the simple script below?
>>
>> - Mike
>>
>>
>> On 2/24/09 3:12 PM, Mihael Hategan wrote:
>>> Where is that file with respect to:
>>> - the script
>>> - the place you are running this from
>>>
>>> On Tue, 2009-02-24 at 15:07 -0600, Michael Wilde wrote:
>>>> This script:
>>>>
>>>> ---
>>>>
>>>> type file;
>>>>
>>>> app (file out) echo (string s) { echo s stdout=@out; }
>>>>
>>>> file f<"a">;
>>>> int i;
>>>>
>>>> f = echo("123");
>>>> i = @extractint(@f);
>>>> trace("i=", i);
>>>>
>>>> ---
>>>>
>>>> produces:
>>>>
>>>> Swift svn swift-r2552 cog-r2303 
>>>>
>>>>  
>>>>
>>>> RunID: 20090224-1455-k1vj4uy7 
>>>>
>>>> Progress: 
>>>>
>>>> Execution failed: 
>>>>
>>>>          Reading integer content of file 
>>>>
>>>> Caused by: 
>>>>
>>>>          a (No such file or directory) 
>>>>
>>>>
>>>>
>>>> ---
>>>>
>>>> I seem to get the same behavior from readData, and the same if I 
>>>> explicitly specify "a" as the argument to @extractint();
>>>>
>>>> Is this because @extractint() is not waiting for "f" to get produced?
>>>>
>>>> Ive extracted this example while debugging a script that uses an app to 
>>>> test the termination condition of an iterate loop.
>>>>
>>>> - Mike
>>>> _______________________________________________
>>>> Swift-devel mailing list
>>>> Swift-devel at ci.uchicago.edu
>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> 


From wilde at mcs.anl.gov  Tue Feb 24 18:32:59 2009
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Tue, 24 Feb 2009 18:32:59 -0600
Subject: [Swift-devel] truncated name in typecheck error message
Message-ID: <49A491BB.4060007@mcs.anl.gov>

This script:

int out[]; 

out[0][1]=123; 

 
produces:

Could not start execution. 

         Compile error in assigment at line 4: You cannot assign value 
of type int to a variable of type i

--

The typename is truncated at the end of the message. Eg, I think "file" 
prints as "fi".


From aespinosa at cs.uchicago.edu  Tue Feb 24 18:52:29 2009
From: aespinosa at cs.uchicago.edu (Allan Espinosa)
Date: Tue, 24 Feb 2009 18:52:29 -0600
Subject: [Swift-devel] throttling parameters with coasters
Message-ID: <50b07b4b0902241652l2552a38eh8954155112b70d09@mail.gmail.com>

In coasters, are throtte.submit, throttle.host.submit,
throttle.job.factor parameters ignored ?

Looking on how swift submits initial requests, it seems that
throttle.job.factor affects the number of coaster nodes it will submit
to the LRM. If I have
  <profile namespace="karajan" key="initialScore">1</profile>
    <profile namespace="karajan" key="jobThrottle">1</profile>
Swift spawns 4 coasters.  4*16=64 processors available to me.  I
observe that throughout the job this number did not increase

Next, in my swift.properties, I have
throttle.submit=4
throttle.host.submit=2

But in the runtime,

rogress:  Selecting site:2809 Submitting:17 Active:40 Stage out:31
Finished successfully:103
Progress:  Selecting site:2809 Submitting:17 Active:40 Stage out:30 Finished su

so the 2 parameters does not apply to coaster submissions?


From hategan at mcs.anl.gov  Tue Feb 24 19:15:50 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Tue, 24 Feb 2009 19:15:50 -0600
Subject: [Swift-devel] throttling parameters with coasters
In-Reply-To: <50b07b4b0902241652l2552a38eh8954155112b70d09@mail.gmail.com>
References: <50b07b4b0902241652l2552a38eh8954155112b70d09@mail.gmail.com>
Message-ID: <1235524550.11984.4.camel@localhost>

On Tue, 2009-02-24 at 18:52 -0600, Allan Espinosa wrote:
> In coasters, are throtte.submit, throttle.host.submit,
> throttle.job.factor parameters ignored ?
> 
> Looking on how swift submits initial requests, it seems that
> throttle.job.factor affects the number of coaster nodes it will submit
> to the LRM. If I have
>   <profile namespace="karajan" key="initialScore">1</profile>
>     <profile namespace="karajan" key="jobThrottle">1</profile>
> Swift spawns 4 coasters.  4*16=64 processors available to me.  I
> observe that throughout the job this number did not increase

A job throttle of 1 pretty much caps the total number of concurrent jobs
at 100.

> 
> Next, in my swift.properties, I have
> throttle.submit=4
> throttle.host.submit=2
> 
> But in the runtime,
> 
> rogress:  Selecting site:2809 Submitting:17 Active:40 Stage out:31
> Finished successfully:103
> Progress:  Selecting site:2809 Submitting:17 Active:40 Stage out:30 Finished su
> 
> so the 2 parameters does not apply to coaster submissions?

The "submitting" printed by the progress ticker is not the same as the
"submit" in swift.properties.

>From a cog abstractions perspective, 4 concurrent submissions means that
only 4 calls to TaskHandler.submit(Task) can be active at one time.

>From swift's perspective it means that the job was queued to the
scheduler and awaits its turn to be one of the 4 to go through
TaskHandler.submit().


From benc at hawaga.org.uk  Tue Feb 24 20:23:12 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Wed, 25 Feb 2009 02:23:12 +0000 (GMT)
Subject: [Swift-devel] Problem using @extractint() on derived file
In-Reply-To: <1235511208.7899.3.camel@localhost>
References: <49A4618C.6050108@mcs.anl.gov> <1235509952.7505.2.camel@localhost>
	<49A46585.6090003@mcs.anl.gov> <1235511208.7899.3.camel@localhost>
Message-ID: <Pine.LNX.4.64.0902250221080.1293@dildano.hawaga.org.uk>


On Tue, 24 Feb 2009, Mihael Hategan wrote:

> Yes, this happens because @f can be resolved before f can be, so swift
> will happily do the extractint before echo finishes.
> 
> I don't have a solution yet.

extractint probably should be able to take a file parameter, rather than a 
string, and order evaluation properly.

which I think is what readData does (though its not in the test suite, I'm 
told)

so it may be that replacing extractint(@f) with readData(f)  (note the 
lack of @) does what is desired

in which case, extractInt can be removed from the language.

-- 


From wilde at mcs.anl.gov  Tue Feb 24 22:55:49 2009
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Tue, 24 Feb 2009 22:55:49 -0600
Subject: [Swift-devel] Problem using @extractint() on derived file
In-Reply-To: <Pine.LNX.4.64.0902250221080.1293@dildano.hawaga.org.uk>
References: <49A4618C.6050108@mcs.anl.gov> <1235509952.7505.2.camel@localhost>
	<49A46585.6090003@mcs.anl.gov> <1235511208.7899.3.camel@localhost>
	<Pine.LNX.4.64.0902250221080.1293@dildano.hawaga.org.uk>
Message-ID: <49A4CF55.3060105@mcs.anl.gov>

On 2/24/09 8:23 PM, Ben Clifford wrote:
> On Tue, 24 Feb 2009, Mihael Hategan wrote:
> 
>> Yes, this happens because @f can be resolved before f can be, so swift
>> will happily do the extractint before echo finishes.
>>
>> I don't have a solution yet.
> 
> extractint probably should be able to take a file parameter, rather than a 
> string, and order evaluation properly.

extractint(f) seems to work:

---
type file;

app (file out) echo (string s) { echo s stdout=@out; }

file f = echo("123");

int i = @extractint(f);

trace (i);
---

prints 123

substituting readData for @extractint in the above works as well.

I was confused about what each expected, so perhaps just clarifying the 
users guide is whats needed.

> which I think is what readData does (though its not in the test suite, I'm 
> told)
> 
> so it may be that replacing extractint(@f) with readData(f)  (note the 
> lack of @) does what is desired
> 
> in which case, extractInt can be removed from the language.
> 


From benc at hawaga.org.uk  Wed Feb 25 03:45:32 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Wed, 25 Feb 2009 09:45:32 +0000 (GMT)
Subject: [Swift-devel] Problem using @extractint() on derived file
In-Reply-To: <49A4CF55.3060105@mcs.anl.gov>
References: <49A4618C.6050108@mcs.anl.gov> <1235509952.7505.2.camel@localhost>
	<49A46585.6090003@mcs.anl.gov> <1235511208.7899.3.camel@localhost>
	<Pine.LNX.4.64.0902250221080.1293@dildano.hawaga.org.uk>
	<49A4CF55.3060105@mcs.anl.gov>
Message-ID: <Pine.LNX.4.64.0902250942020.23512@dildano.hawaga.org.uk>


On Tue, 24 Feb 2009, Michael Wilde wrote:

> substituting readData for @extractint in the above works as well.

ok good.

rambling slightly:

It would be nice to get rid of @extractint and have only readData, but I 
think that this doesn't work in all cases: readData's behaviour is 
controlled by the type that it is expected to return (that is, if you 
assign a readData expression to an int, it tries to read an int; if you 
assign it to an array, it tries to read an array). In some situations, 
that return type isn't well defined, because it could be a context where 
anything is accepted - for example what is the type of 
readData in: trace(readData(f))

-- 


From benc at hawaga.org.uk  Wed Feb 25 03:59:45 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Wed, 25 Feb 2009 09:59:45 +0000 (GMT)
Subject: [Swift-devel] Problem using @extractint() on derived file
In-Reply-To: <49A4CF55.3060105@mcs.anl.gov>
References: <49A4618C.6050108@mcs.anl.gov> <1235509952.7505.2.camel@localhost>
	<49A46585.6090003@mcs.anl.gov> <1235511208.7899.3.camel@localhost>
	<Pine.LNX.4.64.0902250221080.1293@dildano.hawaga.org.uk>
	<49A4CF55.3060105@mcs.anl.gov>
Message-ID: <Pine.LNX.4.64.0902250945420.23512@dildano.hawaga.org.uk>


On Tue, 24 Feb 2009, Michael Wilde wrote:

> > extractint probably should be able to take a file parameter, rather than a
> > string, and order evaluation properly.
> 
> extractint(f) seems to work:

[...]

> I was confused about what each expected, so perhaps just clarifying the users
> guide is whats needed.

It would be nice if you got some kind of warning here.

The approach that I think is most in-sync with other file handling in 
Swift would be to say that you could not pass a filename into readData; 
instead you would be compelled to pass a mapped file (as you ended up 
doing in this case).

That would increase the volume of text needed when using the present 'pass 
in a filename' behaviour; but in some ways, its a simplification because 
it strengthens the rule "if you want to deal with a file, you must do it 
by mapping to a variable, not by passing its filename around".

I'm unsure. Right this second I think I'd prefer that change to be made.

-- 


From wilde at mcs.anl.gov  Wed Feb 25 08:07:42 2009
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Wed, 25 Feb 2009 08:07:42 -0600
Subject: [Swift-devel] Re: [Swift-user] Questions on use of iterate statement
In-Reply-To: <Pine.LNX.4.64.0902231545520.23512@dildano.hawaga.org.uk>
References: <499EAB13.5050401@mcs.anl.gov>
	<Pine.LNX.4.64.0902201334000.1293@dildano.hawaga.org.uk>
	<499EC8DA.70304@mcs.anl.gov>
	<Pine.LNX.4.64.0902231545520.23512@dildano.hawaga.org.uk>
Message-ID: <49A550AE.1020007@mcs.anl.gov>

On 2/23/09 9:46 AM, Ben Clifford wrote:
> As of r2593 you should be able to use the style of iteration that you 
> originally used.

Thanks, Ben.

When I tried this, it turned out my data flow required a whole array of 
results to be passed out of the iterate anyways. So what was initially a 
workaround turned out to be the way it needed to be coded anyways. But 
I'll try to test your change on other variants of this.


From benc at hawaga.org.uk  Wed Feb 25 11:31:53 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Wed, 25 Feb 2009 17:31:53 +0000 (GMT)
Subject: [Swift-devel] pmd
Message-ID: <Pine.LNX.4.64.0902251659420.23512@dildano.hawaga.org.uk>


In my ongoing adventure with build/test/analysis tools, I ran PMD on the 
swift source code. I used the unused code and unused import report to 
remove a bunch of unused code from the source,. I just ran a test with 
almost all rulesets enabled, which gives 8000 comments on the swift source 
code.  A bunch are stylistic coments that I don't particualrly agree with 
(such as on teh use of single-character variable names), but if anyonei s 
interested in having a browse, here's the report:

http://www.ci.uchicago.edu/~benc/tmp/pmd.html
The source code lines in this report are against r2606


From hategan at mcs.anl.gov  Wed Feb 25 11:42:45 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Wed, 25 Feb 2009 11:42:45 -0600
Subject: [Swift-devel] pmd
In-Reply-To: <Pine.LNX.4.64.0902251659420.23512@dildano.hawaga.org.uk>
References: <Pine.LNX.4.64.0902251659420.23512@dildano.hawaga.org.uk>
Message-ID: <1235583765.20020.1.camel@localhost>

On Wed, 2009-02-25 at 17:31 +0000, Ben Clifford wrote:
> In my ongoing adventure with build/test/analysis tools, I ran PMD on the 
> swift source code. I used the unused code and unused import report

Eclipse has a sub-set of what PMD does, including looking for unused
imports (for which there is also a shortcut that automatically
re-organizes them).


From bugzilla-daemon at mcs.anl.gov  Wed Feb 25 18:27:57 2009
From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov)
Date: Wed, 25 Feb 2009 18:27:57 -0600 (CST)
Subject: [Swift-devel] [Bug 178] New: strange unused string replacement in
	CSVMapper needs investigating
Message-ID: <bug-178-21@http.bugzilla.mcs.anl.gov/swift/>

http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=178

           Summary: strange unused string replacement in CSVMapper needs
                    investigating
           Product: Swift
           Version: unspecified
          Platform: Macintosh
        OS/Version: Mac OS
            Status: NEW
          Severity: normal
          Priority: P2
         Component: General
        AssignedTo: benc at hawaga.org.uk
        ReportedBy: benc at hawaga.org.uk


CSVMapper contains (as of r2606) the following TODO to investigate:

                                      // TODO PMD reports this for the
                                        // following line:
                                        // An operation on an Immutable object
(
String, BigDecimal or BigInteger) won't change the object itself
                                        // This is likely a bug
                                        column.replaceAll("\\s", "_");

That's meant to do something, presumably, but it doesn't...


-- 
Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You reported the bug, or are watching the reporter.
You are the assignee for the bug, or are watching the assignee.


From bugzilla-daemon at mcs.anl.gov  Wed Feb 25 18:37:25 2009
From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov)
Date: Wed, 25 Feb 2009 18:37:25 -0600 (CST)
Subject: [Swift-devel] [Bug 178] strange unused string replacement in
	CSVMapper needs investigating
In-Reply-To: <bug-178-21@http.bugzilla.mcs.anl.gov/swift/>
Message-ID: <20090226003725.B181E164CE@foxtrot.mcs.anl.gov>

http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=178


------- Comment #1 from hategan at mcs.anl.gov  2009-02-25 18:37 -------
I believe we should deprecate the CSV mapper entirely in favor of the ext
mapper, which is both more powerful and easier to use.


-- 
Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You reported the bug, or are watching the reporter.
You are the assignee for the bug, or are watching the assignee.


From aespinosa at cs.uchicago.edu  Wed Feb 25 18:46:38 2009
From: aespinosa at cs.uchicago.edu (Allan Espinosa)
Date: Wed, 25 Feb 2009 18:46:38 -0600
Subject: [Swift-devel] current workers < 0 ?
Message-ID: <50b07b4b0902251646r2cc11935w34337187c6c26b4a@mail.gmail.com>

I am trying to generate a plot of number of coaster workers vs time
superimposed with number of tasks vs time plot. Upon poking through
~/.globus/coasters/coasters.log, I notice that it drops to negative
values.

2009-02-25 15:30:34,447-0600 INFO  WorkerManager Current workers: 81
2009-02-25 15:30:34,447-0600 INFO  WorkerManager Ready: {}
2009-02-25 15:30:34,447-0600 INFO  WorkerManager Busy:
[Worker[608604359], Worker[671140203], Worker[-475116310],
Worker[-1187087425], Worker[1021599238], Work2009-02-25
15:30:34,448-0600 INFO  WorkerManager Requested:
{-906148816=Worker[-906148816], 40

I think this deals with the  currentWorkers++ ; line in WorkerManager.java

-Allan


From hategan at mcs.anl.gov  Wed Feb 25 20:21:40 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Wed, 25 Feb 2009 20:21:40 -0600 (CST)
Subject: [Swift-devel] current workers < 0 ?
In-Reply-To: <50b07b4b0902251646r2cc11935w34337187c6c26b4a@mail.gmail.com>
Message-ID: <11549721.289771235614900491.JavaMail.root@zimbra>


----- Allan Espinosa <aespinosa at cs.uchicago.edu> wrote:
> I am trying to generate a plot of number of coaster workers vs time
> superimposed with number of tasks vs time plot. Upon poking through
> ~/.globus/coasters/coasters.log, I notice that it drops to negative
> values.

Maybe I'm missing something, but where do you see the number of 
workers being negative in the text below?

> 
> 2009-02-25 15:30:34,447-0600 INFO  WorkerManager Current workers: 81
> 2009-02-25 15:30:34,447-0600 INFO  WorkerManager Ready: {}
> 2009-02-25 15:30:34,447-0600 INFO  WorkerManager Busy:
> [Worker[608604359], Worker[671140203], Worker[-475116310],
> Worker[-1187087425], Worker[1021599238], Work2009-02-25
> 15:30:34,448-0600 INFO  WorkerManager Requested:
> {-906148816=Worker[-906148816], 40
> 
> I think this deals with the  currentWorkers++ ; line in WorkerManager.java
> 
> -Allan
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel


From aespinosa at cs.uchicago.edu  Wed Feb 25 20:30:16 2009
From: aespinosa at cs.uchicago.edu (Allan Espinosa)
Date: Wed, 25 Feb 2009 20:30:16 -0600
Subject: [Swift-devel] current workers < 0 ?
In-Reply-To: <11549721.289771235614900491.JavaMail.root@zimbra>
References: <50b07b4b0902251646r2cc11935w34337187c6c26b4a@mail.gmail.com>
	<11549721.289771235614900491.JavaMail.root@zimbra>
Message-ID: <50b07b4b0902251830k9f5f04bma9987bce21083e9c@mail.gmail.com>

Ooops. I copy pasted the wrong line.  It should be:

2009-02-25 15:33:14,665-0600 INFO  WorkerManager Current workers: -110
2009-02-25 15:33:14,669-0600 INFO  AbstractKarajanChannel SC-null
REPL: Command(2009-02-25 15:33:14,670-0600 INFO
AbstractKarajanChannel Unregistering Command(2009-02-25
15:33:14,673-0600 INFO  WorkerManager Ready:
{-1644269098/1235598824s2009-02-25 15:33:14,674-0600 INFO
WorkerManager Busy: [Worker[-1955187037], Wor2009-02-25
15:33:14,674-0600 INFO  WorkerManager Requested:
{-1105264759=Worker[2009-02-25 15:33:14,674-0600 INFO  WorkerManager
Starting: []
2009-02-25 15:33:14,676-0600 INFO  WorkerManager Ids:
{-1734485274=Worker[-173442009-02-25 15:33:14,676-0600 INFO
WorkerManager AllocationR: []


Sorry about that.

-Allan

On Wed, Feb 25, 2009 at 8:21 PM, Mihael Hategan <hategan at mcs.anl.gov> wrote:
>
> ----- Allan Espinosa <aespinosa at cs.uchicago.edu> wrote:
>> I am trying to generate a plot of number of coaster workers vs time
>> superimposed with number of tasks vs time plot. Upon poking through
>> ~/.globus/coasters/coasters.log, I notice that it drops to negative
>> values.
>
> Maybe I'm missing something, but where do you see the number of
> workers being negative in the text below?
>
>>
>> 2009-02-25 15:30:34,447-0600 INFO  WorkerManager Current workers: 81
>> 2009-02-25 15:30:34,447-0600 INFO  WorkerManager Ready: {}
>> 2009-02-25 15:30:34,447-0600 INFO  WorkerManager Busy:
>> [Worker[608604359], Worker[671140203], Worker[-475116310],
>> Worker[-1187087425], Worker[1021599238], Work2009-02-25
>> 15:30:34,448-0600 INFO  WorkerManager Requested:
>> {-906148816=Worker[-906148816], 40
>>


From hategan at mcs.anl.gov  Wed Feb 25 20:36:40 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Wed, 25 Feb 2009 20:36:40 -0600 (CST)
Subject: [Swift-devel] current workers < 0 ?
In-Reply-To: <50b07b4b0902251830k9f5f04bma9987bce21083e9c@mail.gmail.com>
Message-ID: <8091050.289951235615800053.JavaMail.root@zimbra>


----- Allan Espinosa <aespinosa at cs.uchicago.edu> wrote:
> Ooops. I copy pasted the wrong line.  It should be:
> 
> 2009-02-25 15:33:14,665-0600 INFO  WorkerManager Current workers: -110

Heh. Yes. That increment should be synchronized. I guess I didn't bother
because it was only there for informal reasons.

> 2009-02-25 15:33:14,669-0600 INFO  AbstractKarajanChannel SC-null
> REPL: Command(2009-02-25 15:33:14,670-0600 INFO
> AbstractKarajanChannel Unregistering Command(2009-02-25
> 15:33:14,673-0600 INFO  WorkerManager Ready:
> {-1644269098/1235598824s2009-02-25 15:33:14,674-0600 INFO
> WorkerManager Busy: [Worker[-1955187037], Wor2009-02-25
> 15:33:14,674-0600 INFO  WorkerManager Requested:
> {-1105264759=Worker[2009-02-25 15:33:14,674-0600 INFO  WorkerManager
> Starting: []
> 2009-02-25 15:33:14,676-0600 INFO  WorkerManager Ids:
> {-1734485274=Worker[-173442009-02-25 15:33:14,676-0600 INFO
> WorkerManager AllocationR: []
> 
> 
> Sorry about that.
> 
> -Allan
> 
> On Wed, Feb 25, 2009 at 8:21 PM, Mihael Hategan <hategan at mcs.anl.gov> wrote:
> >
> > ----- Allan Espinosa <aespinosa at cs.uchicago.edu> wrote:
> >> I am trying to generate a plot of number of coaster workers vs time
> >> superimposed with number of tasks vs time plot. Upon poking through
> >> ~/.globus/coasters/coasters.log, I notice that it drops to negative
> >> values.
> >
> > Maybe I'm missing something, but where do you see the number of
> > workers being negative in the text below?
> >
> >>
> >> 2009-02-25 15:30:34,447-0600 INFO  WorkerManager Current workers: 81
> >> 2009-02-25 15:30:34,447-0600 INFO  WorkerManager Ready: {}
> >> 2009-02-25 15:30:34,447-0600 INFO  WorkerManager Busy:
> >> [Worker[608604359], Worker[671140203], Worker[-475116310],
> >> Worker[-1187087425], Worker[1021599238], Work2009-02-25
> >> 15:30:34,448-0600 INFO  WorkerManager Requested:
> >> {-906148816=Worker[-906148816], 40
> >>


From aespinosa at cs.uchicago.edu  Wed Feb 25 20:40:35 2009
From: aespinosa at cs.uchicago.edu (Allan Espinosa)
Date: Wed, 25 Feb 2009 20:40:35 -0600
Subject: [Swift-devel] current workers < 0 ?
In-Reply-To: <8091050.289951235615800053.JavaMail.root@zimbra>
References: <50b07b4b0902251830k9f5f04bma9987bce21083e9c@mail.gmail.com>
	<8091050.289951235615800053.JavaMail.root@zimbra>
Message-ID: <50b07b4b0902251840u71deaefdk7fcebd04acdc0ec3@mail.gmail.com>

I see. Formally it should be

currentWorkers = ready.size() + busy.size()

right?

On Wed, Feb 25, 2009 at 8:36 PM, Mihael Hategan <hategan at mcs.anl.gov> wrote:
>
> ----- Allan Espinosa <aespinosa at cs.uchicago.edu> wrote:
>> Ooops. I copy pasted the wrong line.  It should be:
>>
>> 2009-02-25 15:33:14,665-0600 INFO  WorkerManager Current workers: -110
>
> Heh. Yes. That increment should be synchronized. I guess I didn't bother
> because it was only there for informal reasons.
>
>> 2009-02-25 15:33:14,669-0600 INFO  AbstractKarajanChannel SC-null
>> REPL: Command(2009-02-25 15:33:14,670-0600 INFO
>> AbstractKarajanChannel Unregistering Command(2009-02-25
>> 15:33:14,673-0600 INFO  WorkerManager Ready:
>> {-1644269098/1235598824s2009-02-25 15:33:14,674-0600 INFO
>> WorkerManager Busy: [Worker[-1955187037], Wor2009-02-25
>> 15:33:14,674-0600 INFO  WorkerManager Requested:
>> {-1105264759=Worker[2009-02-25 15:33:14,674-0600 INFO  WorkerManager
>> Starting: []
>> 2009-02-25 15:33:14,676-0600 INFO  WorkerManager Ids:
>> {-1734485274=Worker[-173442009-02-25 15:33:14,676-0600 INFO
>> WorkerManager AllocationR: []
>>
>>
>> Sorry about that.
>>
>> -Allan
>>
>> On Wed, Feb 25, 2009 at 8:21 PM, Mihael Hategan <hategan at mcs.anl.gov> wrote:
>> >
>> > ----- Allan Espinosa <aespinosa at cs.uchicago.edu> wrote:
>> >> I am trying to generate a plot of number of coaster workers vs time
>> >> superimposed with number of tasks vs time plot. Upon poking through
>> >> ~/.globus/coasters/coasters.log, I notice that it drops to negative
>> >> values.
>> >
>> > Maybe I'm missing something, but where do you see the number of
>> > workers being negative in the text below?


From hategan at mcs.anl.gov  Wed Feb 25 21:32:34 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Wed, 25 Feb 2009 21:32:34 -0600 (CST)
Subject: [Swift-devel] current workers < 0 ?
In-Reply-To: <50b07b4b0902251840u71deaefdk7fcebd04acdc0ec3@mail.gmail.com>
Message-ID: <31786787.290651235619154333.JavaMail.root@zimbra>

----- Allan Espinosa <aespinosa at cs.uchicago.edu> wrote:
> I see. Formally it should be
> 
> currentWorkers = ready.size() + busy.size()

Also + starting.size() (I suppose in order to avoid starting more workers
than the total number allowed).

> 
> right?
> 
> On Wed, Feb 25, 2009 at 8:36 PM, Mihael Hategan <hategan at mcs.anl.gov> wrote:
> >
> > ----- Allan Espinosa <aespinosa at cs.uchicago.edu> wrote:
> >> Ooops. I copy pasted the wrong line.  It should be:
> >>
> >> 2009-02-25 15:33:14,665-0600 INFO  WorkerManager Current workers: -110
> >
> > Heh. Yes. That increment should be synchronized. I guess I didn't bother
> > because it was only there for informal reasons.

I take that back. It is actually used for things.


From hategan at mcs.anl.gov  Wed Feb 25 21:39:55 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Wed, 25 Feb 2009 21:39:55 -0600 (CST)
Subject: [Swift-devel] current workers < 0 ?
In-Reply-To: <8091050.289951235615800053.JavaMail.root@zimbra>
Message-ID: <17584853.290771235619595728.JavaMail.root@zimbra>


----- Mihael Hategan <hategan at mcs.anl.gov> wrote:
> 
> ----- Allan Espinosa <aespinosa at cs.uchicago.edu> wrote:
> > Ooops. I copy pasted the wrong line.  It should be:
> > 
> > 2009-02-25 15:33:14,665-0600 INFO  WorkerManager Current workers: -110
> 
> Heh. Yes. That increment should be synchronized. I guess I didn't bother
> because it was only there for informal reasons.
> 

cog r2306 should fix this. Let me know if it works or if I screwed up 
something else.


From aespinosa at cs.uchicago.edu  Wed Feb 25 22:29:14 2009
From: aespinosa at cs.uchicago.edu (Allan Espinosa)
Date: Wed, 25 Feb 2009 22:29:14 -0600
Subject: [Swift-devel] current workers < 0 ?
In-Reply-To: <17584853.290771235619595728.JavaMail.root@zimbra>
References: <8091050.289951235615800053.JavaMail.root@zimbra>
	<17584853.290771235619595728.JavaMail.root@zimbra>
Message-ID: <50b07b4b0902252029u2dd82147x4accab87ac85ecfd@mail.gmail.com>

It still has the same issues.  It subtracts too much when a task if finished.

Also, observing the LRM queue, i see swift  creating 18-20 "make
coaster" requests (4 at start then 16-18 after 5 mins).  with a 16
coastersPerNode you get a 320 processor allocation.  this more than
MAX_WORKERS~256 and the max score possible from my sites.xml (102 max)

    <profile namespace="karajan" key="initialScore">1</profile>
    <profile namespace="karajan" key="jobThrottle">1</profile>


2009-02-25 20:31:15,590-0600 INFO  Worker Worker stderr: null
2009-02-25 20:31:15,590-0600 WARN  WorkerManager Worker terminated:
Worker[-1909333457]
2009-02-25 20:31:15,590-0600 WARN  Worker Worker 335457820 status
change: Completed
2009-02-25 20:31:15,590-0600 INFO  Worker Worker stdout: Job You has completed.
Writing job STDOUT and STDERR to cache files.
Returning job success.

2009-02-25 20:31:15,590-0600 INFO  Worker Worker stderr: null
2009-02-25 20:31:15,590-0600 WARN  WorkerManager Worker terminated:
Worker[335457820]
******2009-02-25 20:31:15,742-0600 INFO  WorkerManager Current workers: -32****
2009-02-25 20:31:15,745-0600 INFO  WorkerManager Ready: {}
2009-02-25 20:31:15,745-0600 INFO  WorkerManager Busy:
[Worker[-1260987422], Worker[2142641145], Worker[2053757208
2009-02-25 20:31:15,751-0600 INFO  WorkerManager Requested:
{640597733=Worker[640597733], -692025578=Worker[-69202
2009-02-25 20:31:15,751-0600 INFO  WorkerManager Starting:
[Task(type=JOB_SUBMISSION, identity=urn:1235615211813-1
2009-02-25 20:31:15,752-0600 INFO  WorkerManager Ids:
{1078934147=Worker[1078934147], 264613139=Worker[264613139],
2009-02-25 20:31:15,753-0600 INFO  WorkerManager AllocationR:
[org.globus.cog.abstraction.coaster.service.job.mana
2009-02-25 20:31:15,873-0600 INFO  AbstractKarajanChannel SC-null REQ:
Handler(JOBSTATUS)


On Wed, Feb 25, 2009 at 9:39 PM, Mihael Hategan <hategan at mcs.anl.gov> wrote:
>
> ----- Mihael Hategan <hategan at mcs.anl.gov> wrote:
>>
>> ----- Allan Espinosa <aespinosa at cs.uchicago.edu> wrote:
>> > Ooops. I copy pasted the wrong line. ?It should be:
>> >
>> > 2009-02-25 15:33:14,665-0600 INFO ?WorkerManager Current workers: -110
>>
>> Heh. Yes. That increment should be synchronized. I guess I didn't bother
>> because it was only there for informal reasons.
>>
>
> cog r2306 should fix this. Let me know if it works or if I screwed up
> something else.
>
>


-- 
Allan M. Espinosa <http://allan.88-mph.net/blog>
PhD student, Computer Science
University of Chicago <http://people.cs.uchicago.edu/~aespinosa>


From hategan at mcs.anl.gov  Wed Feb 25 23:27:37 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Wed, 25 Feb 2009 23:27:37 -0600
Subject: [Swift-devel] current workers < 0 ?
In-Reply-To: <50b07b4b0902252029u2dd82147x4accab87ac85ecfd@mail.gmail.com>
References: <8091050.289951235615800053.JavaMail.root@zimbra>
	<17584853.290771235619595728.JavaMail.root@zimbra>
	<50b07b4b0902252029u2dd82147x4accab87ac85ecfd@mail.gmail.com>
Message-ID: <1235626057.5218.6.camel@localhost>

I suspect the issue was introduced by the addition of multiple coasters
per node. The manager expects one worker, but gets 16 instead. 

On Wed, 2009-02-25 at 22:29 -0600, Allan Espinosa wrote:
> It still has the same issues.  It subtracts too much when a task if finished.
> 
> Also, observing the LRM queue, i see swift  creating 18-20 "make
> coaster" requests (4 at start then 16-18 after 5 mins).  with a 16
> coastersPerNode you get a 320 processor allocation.  this more than
> MAX_WORKERS~256 and the max score possible from my sites.xml (102 max)

Regarding MAX_WORKERS, that probably suffers from the same problem, in
that it may request less than 256 workers, but given that each request
means 16 workers, the end result may be different than what's expected.

However, MAX_WORKERS was introduced merely to limit damage in case the
code is bad and it doesn't otherwise put an upper bound on the limit of
worker requests (/jobs in the queue).


From wilde at mcs.anl.gov  Thu Feb 26 01:05:35 2009
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Thu, 26 Feb 2009 01:05:35 -0600
Subject: [Swift-devel] Re: [Swift-user] assigning file variables
In-Reply-To: <49A632FE.8070906@mcs.anl.gov>
References: <49A55309.4050100@mcs.anl.gov>	<Pine.LNX.4.64.0902251521230.1293@dildano.hawaga.org.uk>	<1235576422.17806.4.camel@localhost>	<Pine.LNX.4.64.0902251545300.23512@dildano.hawaga.org.uk>
	<49A632FE.8070906@mcs.anl.gov>
Message-ID: <49A63F3F.10504@mcs.anl.gov>

(Im moving this back from swift-user to swift-devel)

--

Another aspect of where I'm stuck at the moment is this:

Recall (from previous posts ;) that I have 3 nested loops:

foreach protein in plist {
   iterate {
     foreach i in [1:N]
       randomlyFoldProtein()
     }
   } until convergence or limit reached
}

In testing the "simulated" version of this (my oops8e.swift example) I 
had to put the inner folding "round" into a function, in order to force 
the closing of the array of result files returned by the inner foreach.

That was fine, with simple_mapper, because I had pre-mapped the entire 
2D result[][] array with simple_mapper, and Swift still let me return an 
inner array and assign it to a member of the outer array:

foreach p, pn in protein {
   file result[][]
     <simple_mapper; prefix=@strcat("output/",p,"/"),suffix=".pdt">;
   iterate i {
     result[i] = doRound(p,i);
   } until (roundDone(result[i],pn) == 1);
}

But, that test was over-simplified, because it didnt handle the fact 
that these returns are really 6-file structs, which motivated me to try 
ext mapper.

But that decision led me back in circles, bouncing between Swift 
limitations:

- ext-mapper cant pre-map a dynamic output structure with any dimensions 
whose size cant be passed to the mapper (I think?)

- arrays can only be closed via return from functions

- files and structs with files have limitations on assignments

- I cant set a mapping any time I want on any member (field or element) 
of any structure.

Here's a related question: Is it the case that if a function returns an 
array, that array *must* be declared and mapped in the calling function, 
*not* in the called function?  Eg, I cant dynamically declare and map an 
array *within* a function and return that array out? (I'll try this in 
the morning).

I think I can solve my problems by retreating from ext mapper and 
accepting the naming conventions of simple_mapper, but the set of 
restrictions was interesting.

This makes me more determined to re-open the discussion on the nature of 
object, variables, handles, scope, and lifetime, as it seems to me that 
part of the problem comes from an object model thats almost, but not 
quite, as regular as it should be.

- Mike


On 2/26/09 12:13 AM, Michael Wilde wrote:
> Can you clarify how the ext mapper behaves differently from say the 
> simple_mapper for output files, and if the following is correct?
> 
> It seems that for the simple_mapper, the mapper parameters define a 
> prefix/suffix, and these strings are used wherever necessary at runtime 
> to form a mapping for any object composed of (possibly nested) structs 
> and arrays, by bracketing the dynamically-constructed object path.
> 
> But when the ext mapper is used for output, it is expected, in a single 
> call, to map the entire structure (and hence can only do static mappings)?
> 
> I thought I had my problem solved using the ext mapper, but the 
> combination of restrictions on assigning file variables and getting the 
> right info to the ext mapper seems to be forcing me back to simple_mapper.
> 
> (I'll try to assemble examples when I have more time)
> 
> On 2/25/09 9:50 AM, Ben Clifford wrote:
>> On Wed, 25 Feb 2009, Mihael Hategan wrote:
>>
>>> it would be preferable to map t to what m is mapped to
>>> from the start
>>
>> right. often (always?) the desire to do this kind of assignment comes 
>> from insufficient expressiveness in our mapping semantics. in the 
>> foreach case, I think my email suggests a reasonable alternative to 
>> assignments that allows mapping to be generated inside of Swift. In 
>> the iterate{} case, that in-swift expression is not possible at the 
>> moment, but could be. For example, soemthing like the ext mapper that 
>> only maps output files, not inputs, and calls a specified swift 
>> procedure to do that mapping. (thates something that has been 
>> discussed before, I think)
>>
> _______________________________________________
> Swift-user mailing list
> Swift-user at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user


From benc at hawaga.org.uk  Thu Feb 26 05:34:55 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Thu, 26 Feb 2009 11:34:55 +0000 (GMT)
Subject: [Swift-devel] Re: [Swift-user] assigning file variables
In-Reply-To: <49A63F3F.10504@mcs.anl.gov>
References: <49A55309.4050100@mcs.anl.gov>
	<Pine.LNX.4.64.0902251521230.1293@dildano.hawaga.org.uk>
	<1235576422.17806.4.camel@localhost>
	<Pine.LNX.4.64.0902251545300.23512@dildano.hawaga.org.uk>
	<49A632FE.8070906@mcs.anl.gov> <49A63F3F.10504@mcs.anl.gov>
Message-ID: <Pine.LNX.4.64.0902261051540.23512@dildano.hawaga.org.uk>


On Thu, 26 Feb 2009, Michael Wilde wrote:

> foreach p, pn in protein {                                                      
>  file result[][]                                                               
>    <simple_mapper; prefix=@strcat("output/",p,"/"),suffix=".pdt">;             
>  iterate i {                                                                   
>    result[i] = doRound(p,i);                                                   
>  } until (roundDone(result[i],pn) == 1);                                       
> }                                                                               
                                                                                
> But, that test was over-simplified, because it didnt handle the fact 
> that these returns are really 6-file structs, which motivated me to try 
> ext mapper.

Assuming the above is working, what breaks when you change file into a 
6-member struct?

> - ext-mapper cant pre-map a dynamic output structure with any dimensions whose
> size cant be passed to the mapper (I think?)

yes.

> - arrays can only be closed via return from functions

no. Not since r1536 | benc at CI.UCHICAGO.EDU | 2008-01-03

Since that commit, there is static analysis of source code, and when no 
more assignments are left to make to an array, its regarded as closed.

However, in the case of multidimensional arrays, this only happens when 
the entire top level array has no more assignments at all, not as each 
subarray happens to become finished.

Static analysis of arrays (and even runtime analysis to discover when no 
more assignments may happen to a particular piece) is extremely hard 
because you're allowed to construct your own indicies, and you're allowed 
to use them in a way that isn't single assignment; I think they're a 
fairly poor structure to have in SwiftScript the way its going.

For example, in the code fragment:

>  file result[][]                                                             
>    <simple_mapper; prefix=@strcat("output/",p,"/"),suffix=".pdt">;           
>  iterate i {                                                                 
>    result[i] = doRound(p,i);                                                 
>  } until (roundDone(result[i],pn) == 1);

You can look at that and reason that result[i] won't get assigned any more 
after the iterate statement for that i, but in general that i can be any 
expression. In the general case, how do you know that result[2] will never 
get any more assignments?

There are other ways of doing things, for example Haskell's map, fold and 
unfold, that I think would be much easier to analyse in this case.

(hey I get to mention map/reduce here!)

foreach in that case could look like this map (making up ugly syntax)
with syntax:  output =  map (range) (code)

   file results[] = map proteins (p -> { analyse(p); return p})

This means the same as:

file results[];
foreach p,i in results {
  results[i] = analyse(p);
}

What is different is there is now only a single assignment to results. The 
idea of "array closing" collapses down to "has a single assignment been 
made?"

Iterate would look more like an unfoldr:

output = unfold seed step terminateCondition

file results[] = unfold initalStep (\prev -> { evaluate(prev); return prev}

Again, you know when results is fully assigned, because there is now only 
a single statement assigning to it.

In addition, in both of these, you know exactly when a member of the array 
has been assigned - for any element of results, in both the map and unfold 
case, there is exactly one 'iteration' of the map or unfold which can 
assign to that element, and that is easily known to Swift because it knows 
how map/unfold work.

These should be nestable, and in the case of a multidimensional array, you 
known when any particular sub-array has been assigned, because you know 
which iteration of the outer map/unfold generates that value.

> - files and structs with files have limitations on assignments

yes.

Its easy to implement struct assignment, for structs where the members 
have defined assignment semantics already.

for files, see other thread.

> - I cant set a mapping any time I want on any member (field or element) of any
> structure.

Yes.

> Here's a related question: Is it the case that if a function returns an array,
> that array *must* be declared and mapped in the calling function, *not* in the
> called function?  Eg, I cant dynamically declare and map an array *within* a
> function and return that array out? (I'll try this in the morning).

By function, you mean procedure, I think (code referenced without a @ 
prefix). In that case, yes - procedure call semantics are that you pass in 
where the output belongs.

> This makes me more determined to re-open the discussion on the nature of 
> object, variables, handles, scope, and lifetime, as it seems to me that 
> part of the problem comes from an object model thats almost, but not 
> quite, as regular as it should be.

yes, its riddled with prototypiness from before; mostly from 
imperativeness conflicting with data flow dependencies. Its substantially 
more consistent than it was a few years ago, though.

-- 


From wilde at mcs.anl.gov  Thu Feb 26 09:31:27 2009
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Thu, 26 Feb 2009 09:31:27 -0600
Subject: [Swift-devel] Re: [Swift-user] assigning file variables
In-Reply-To: <Pine.LNX.4.64.0902261051540.23512@dildano.hawaga.org.uk>
References: <49A55309.4050100@mcs.anl.gov>
	<Pine.LNX.4.64.0902251521230.1293@dildano.hawaga.org.uk>
	<1235576422.17806.4.camel@localhost>
	<Pine.LNX.4.64.0902251545300.23512@dildano.hawaga.org.uk>
	<49A632FE.8070906@mcs.anl.gov> <49A63F3F.10504@mcs.anl.gov>
	<Pine.LNX.4.64.0902261051540.23512@dildano.hawaga.org.uk>
Message-ID: <49A6B5CF.8030505@mcs.anl.gov>


On 2/26/09 5:34 AM, Ben Clifford wrote:
> On Thu, 26 Feb 2009, Michael Wilde wrote:
> 
>> foreach p, pn in protein {                                                      
>>  file result[][]                                                               
>>    <simple_mapper; prefix=@strcat("output/",p,"/"),suffix=".pdt">;             
>>  iterate i {                                                                   
>>    result[i] = doRound(p,i);                                                   
>>  } until (roundDone(result[i],pn) == 1);                                       
>> }                                                                               
>                                                                                 
>> But, that test was over-simplified, because it didnt handle the fact 
>> that these returns are really 6-file structs, which motivated me to try 
>> ext mapper.
> 
> Assuming the above is working, what breaks when you change file into a 
> 6-member struct?

If I just move to the 6-file struct and leave all else the same, I think 
I can get that to work (I'll be trying this next). But I was trying to 
preserve the current output structure as well, which is not what I'll 
get with the code above.

If you call the loops:

foreach $protein
   iterate each $round
     foreach $simulation

and the array indices result[$round][$simulation]

I wanted:

output/r$round/$protein.{pdt,energy,rmsd,...}

and what the working code I think will give me is:
output/$protein/$round.$simulation.{pdt,energy,rmsd,...}

Thats not bad, but I didn't expect it to be so hard to get a specific 
output structure. Trying to do so was an interesting learning experience 
about the nature of the language.

My conclusion is that the simplest thing that would let me do what I 
want is to stay with the 2-d array structure, and extend the ext mapper 
to be dynamically called once for each output mapping desired, passing 
the ext script the path of the element being mapped.

Another seemingly-simple solution is a generalization of simple_mapper 
that allows a more powerful sprintf-like expression to form the file name.

I wonder if we could actually move *all* our mappers to "ext" 
implementations, and implement them with shell, perl, awk, etc scripts?
This would seem to make testing new ideas and enhancements pretty easy 
(and in fact more user extensible), and would have virtually no 
performance impact on most workflows.

(But dont implement anything yet; I think all this needs more thought 
and discussion before we bounce around on solutions; I just want to 
gather and organize the issues, then have a language review and see 
whats most important based on real app needs).

>> - ext-mapper cant pre-map a dynamic output structure with any dimensions whose
>> size cant be passed to the mapper (I think?)
> 
> yes.

Can this be lifted, as above?

>> - arrays can only be closed via return from functions
> 
> no. Not since r1536 | benc at CI.UCHICAGO.EDU | 2008-01-03
> 
> Since that commit, there is static analysis of source code, and when no 
> more assignments are left to make to an array, its regarded as closed.
> 
> However, in the case of multidimensional arrays, this only happens when 
> the entire top level array has no more assignments at all, not as each 
> subarray happens to become finished.

OK, so in my case, effectively that restriction remains (although I 
appreciate the explanation below). Note that I'm not complaining about 
that restriction in this example. In my case, moving the inner loop into 
a separate procedure made the code read a bit nicer, in fact. But it led 
to bumping into the other restrictions mentioned.

> Static analysis of arrays (and even runtime analysis to discover when no 
> more assignments may happen to a particular piece) is extremely hard 
> because you're allowed to construct your own indicies, and you're allowed 
> to use them in a way that isn't single assignment; I think they're a 
> fairly poor structure to have in SwiftScript the way its going.

By "theyre a fairly poor structure" do you mean user-specified array 
indices? I fear that removing them will take us too  deep into the 
imperative/functional debate, but perhaps we need to keep that 
discussion going.

> For example, in the code fragment:
> 
>>  file result[][]                                                             
>>    <simple_mapper; prefix=@strcat("output/",p,"/"),suffix=".pdt">;           
>>  iterate i {                                                                 
>>    result[i] = doRound(p,i);                                                 
>>  } until (roundDone(result[i],pn) == 1);
> 
> You can look at that and reason that result[i] won't get assigned any more 
> after the iterate statement for that i, but in general that i can be any 
> expression. In the general case, how do you know that result[2] will never 
> get any more assignments?
> 
> There are other ways of doing things, for example Haskell's map, fold and 
> unfold, that I think would be much easier to analyse in this case.
> 
> (hey I get to mention map/reduce here!)
> 
> foreach in that case could look like this map (making up ugly syntax)
> with syntax:  output =  map (range) (code)
> 
>    file results[] = map proteins (p -> { analyse(p); return p})
> 
> This means the same as:
> 
> file results[];
> foreach p,i in results {
>   results[i] = analyse(p);
> }
> 
> What is different is there is now only a single assignment to results. The 
> idea of "array closing" collapses down to "has a single assignment been 
> made?"
> 
> Iterate would look more like an unfoldr:
> 
> output = unfold seed step terminateCondition
> 
> file results[] = unfold initalStep (\prev -> { evaluate(prev); return prev}
> 
> Again, you know when results is fully assigned, because there is now only 
> a single statement assigning to it.

We could discuss if such things could be added as experiments without 
(yet) removing their imperative equivalents. I think that the question 
of the attractiveness of the functional model to distributed and 
parallel programming is a promising research topic. But its not at the 
top of my priority list for the group, which is  usability/productivity, 
platform support, performance, and provenance. I do agree that it could 
lead to these, but its uncertain if we can get as many people to use it, 
and thats where we need to make progress right now.

If you think that going in the direction above could take us to the goal 
quicker than improving the language in its current flavor, I'll listen 
to a plan. My view right now is that swift is on the right track as-is 
and is *very close* to becoming *very* usable/productive. If we can 
identify and make the fewest tweaks we need to iron out current 
difficulties, we'll be on the right track.  And some of those tweaks 
might be to documentation and examples, not even code changes. I do 
realize that some of the *tweaks* might be hard.

> In addition, in both of these, you know exactly when a member of the array 
> has been assigned - for any element of results, in both the map and unfold 
> case, there is exactly one 'iteration' of the map or unfold which can 
> assign to that element, and that is easily known to Swift because it knows 
> how map/unfold work.
> 
> These should be nestable, and in the case of a multidimensional array, you 
> known when any particular sub-array has been assigned, because you know 
> which iteration of the outer map/unfold generates that value.
> 
>> - files and structs with files have limitations on assignments
> 
> yes.
> 
> Its easy to implement struct assignment, for structs where the members 
> have defined assignment semantics already.
> 
> for files, see other thread.

The conclusion of that thread (in my opinion) is that case (ii), what I 
would call "value assignment of file handles", is what we want. (Where 
"file handle" is that "marker type" term that I think the debate is 
still open on).

>> - I cant set a mapping any time I want on any member (field or element) of any
>> structure.
> 
> Yes.

But thats one of the critical things here. I seem to bump into this 
limitation frequently. Does language consistency require these 
limitations on setting mappings, or is it an implementation issue that 
can be lifted? Is it the case that mapping does not affect data flow 
semantics?

>> Here's a related question: Is it the case that if a function returns an array,
>> that array *must* be declared and mapped in the calling function, *not* in the
>> called function?  Eg, I cant dynamically declare and map an array *within* a
>> function and return that array out? (I'll try this in the morning).
> 
> By function, you mean procedure, I think (code referenced without a @ 
> prefix).

I was wondering about that difference - I thought it was inconsistent 
usage in various documents/tutorials. So we should clarify that 
terminology in the user guide. But better to erase the differnce - all 
callable things, I feel, should have the same name - function or 
procedure, and they are either built-in, or user (or eventually library) 
defined.

Whats the semantic difference between the two today?  One distinction I 
see is that built-in things like trace() can take varying arg types, but 
trace has no @ and thus looks more like a user-defined procedure 
syntactically.

In that case, yes - procedure call semantics are that you pass in
> where the output belongs.

Then this dictates that the caller also do the mapping - hence the names 
of the members of an array can not depend on values that will only be 
known in the called function, which actually creates the array members.
(in my case, doRound())

>> This makes me more determined to re-open the discussion on the nature of 
>> object, variables, handles, scope, and lifetime, as it seems to me that 
>> part of the problem comes from an object model thats almost, but not 
>> quite, as regular as it should be.
> 
> yes, its riddled with prototypiness from before; mostly from 
> imperativeness conflicting with data flow dependencies. Its substantially 
> more consistent than it was a few years ago, though.

I agree, it's greatly improved and can do some amazing things.

My guts tell me that if we can address some of the issues I mentioned on 
the nature of vars, handles, and mappings, we're in the home stretch.
I dont think that a more regular approach to object structure and 
lifetime would conflict with the dataflow semantics.

Maybe we should start a new thread on that specific topic, or resume the 
old thread.

For starters (and feel free to move this to a new thread), do you feel 
comfortable with the current model of var, dsHandle, and by-value-like 
assignment?

I would like to see a more Java-like model with a var being a typed 
pointer or scalar value holder, and structs and arrays being dynamic 
objects, and files being special vars with mapping and state.

scalar-var:
   value (int/string/boolean/float)
   state (set/unset)

object-var
   pointer to array or struct
   state (set/unset)

file-var
   mapping
   state (set/unset)

I have to confess that the above is pretty much the way I *thought* 
Swift worked until we tried to write the latest paper, and had the 
ensuing email discussions. Then I realized that (even after the 
discussions) I still dont understand the model.

I dont feel that we have yet adequately described the model, neither for 
a CS paper *nor* for the programmer.  I think that a good start is to 
write a data model description (in the user guide, in a detailed "skip 
this on first reading" section, that specifies the data model in 
language-reference-specification fashion).

 From there we can discuss any proposed changes to either terminology 
and/or implementation.

I *think* that with the model above, one should be able to more flexibly 
set mappings - in fact, set them from swift code, with some kind of 
assignment (like f=<> expression; or f<expression>).


From benc at hawaga.org.uk  Thu Feb 26 10:39:01 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Thu, 26 Feb 2009 16:39:01 +0000 (GMT)
Subject: [Swift-devel] Re: [Swift-user] assigning file variables
In-Reply-To: <49A6B5CF.8030505@mcs.anl.gov>
References: <49A55309.4050100@mcs.anl.gov>
	<Pine.LNX.4.64.0902251521230.1293@dildano.hawaga.org.uk>
	<1235576422.17806.4.camel@localhost>
	<Pine.LNX.4.64.0902251545300.23512@dildano.hawaga.org.uk>
	<49A632FE.8070906@mcs.anl.gov> <49A63F3F.10504@mcs.anl.gov>
	<Pine.LNX.4.64.0902261051540.23512@dildano.hawaga.org.uk>
	<49A6B5CF.8030505@mcs.anl.gov>
Message-ID: <Pine.LNX.4.64.0902261545170.23512@dildano.hawaga.org.uk>


On Thu, 26 Feb 2009, Michael Wilde wrote:

> Another seemingly-simple solution is a generalization of simple_mapper that
> allows a more powerful sprintf-like expression to form the file name.

I think the an interesting approach is to look at having a mapper call an 
arbitrary Swift procedure that returns a string.

> I wonder if we could actually move *all* our mappers to "ext" implementations,
> and implement them with shell, perl, awk, etc scripts?
> This would seem to make testing new ideas and enhancements pretty easy (and in
> fact more user extensible), and would have virtually no performance impact on
> most workflows.

I think the ext interface isn't sufficiently expressive for that at the 
moment.

The whole mapper API feels rather messy to me at the moment, and if we're 
doing development there, putting more serious consideration into what it 
should look like seems worthwhile.

> and would have virtually no performance impact on 
> most workflows.                                                               

Do you have numbers to back that up?
                                
> > > - ext-mapper cant pre-map a dynamic output structure with any dimensions
> > > whose
> > > size cant be passed to the mapper (I think?)
> > 
> > yes.
> 
> Can this be lifted, as above?

I think not easily. But see above paragraph about API design.

> > However, in the case of multidimensional arrays, this only happens when the
> > entire top level array has no more assignments at all, not as each subarray
> > happens to become finished.
> 
> OK, so in my case, effectively that restriction remains (although I appreciate
> the explanation below). Note that I'm not complaining about that restriction
> in this example. In my case, moving the inner loop into a separate procedure
> made the code read a bit nicer, in fact. But it led to bumping into the other
> restrictions mentioned.

I think its an undesirable restriction. However...

> > Static analysis of arrays (and even runtime analysis to discover when no
> > more assignments may happen to a particular piece) is extremely hard because
> > you're allowed to construct your own indicies, and you're allowed to use
> > them in a way that isn't single assignment; I think they're a fairly poor
> > structure to have in SwiftScript the way its going.
> 
> By "theyre a fairly poor structure" do you mean user-specified array indices?
> I fear that removing them will take us too  deep into the
> imperative/functional debate, but perhaps we need to keep that discussion
> going.

Yes, I mean user-specified array indices.

> We could discuss if such things could be added as experiments without (yet)
> removing their imperative equivalents. I think that the question of the
> attractiveness of the functional model to distributed and parallel programming
> is a promising research topic. But its not at the top of my priority list for
> the group, which is  usability/productivity, platform support, performance,

I think that its important from a user-interface perspective, not 
particularly from a research perspective.

This style of piecewise assignment to arrays plays merry hell with trying 
to do data-dependent ordering in a way that I think is not easily 
resolvable; and anyone trying to do anything at all interesting with 
arrays gets hit by strange things happening - "I know i've assigned 
everything but somehow the next stage isn't running".

Syntactically this stuff doesn't have to look too different from what it 
looks like now, and we don't have to use particularly scary words like map 
or haskell (although I will point out the doublethink inherent in "we 
don't want functional' vs. 'google map/reduce is god')

> I was wondering about that difference - I thought it was inconsistent usage in
> various documents/tutorials. So we should clarify that terminology in the user
> guide. But better to erase the differnce - all callable things, I feel, should
> have the same name - function or procedure, and they are either built-in, or
> user (or eventually library) defined.

> Whats the semantic difference between the two today?  One distinction I see is
> that built-in things like trace() can take varying arg types, but trace has no
> @ and thus looks more like a user-defined procedure syntactically.

@strcat takes varargs too.

Those differences are ever more insignificant and with time will disappear 
entirely, I think. At the moment, its the return semantics that make them 
different.

Historically, @functions returned in-memory values, and procedures 
operated on files; with @functions being intended for constructing 
parameters in mapping parameters, and procedures being the equivalent of a 
VDL1 procedure invocation.

That distinction has blurred greatly over time.

> I dont feel that we have yet adequately described the model, neither for a CS
> paper *nor* for the programmer.  I think that a good start is to write a data
> model description (in the user guide, in a detailed "skip this on first
> reading" section, that specifies the data model in
> language-reference-specification fashion).

right. I'll see about writing more, as I'm in a writing mood this month ;)

-- 


From aespinosa at cs.uchicago.edu  Thu Feb 26 12:44:34 2009
From: aespinosa at cs.uchicago.edu (Allan Espinosa)
Date: Thu, 26 Feb 2009 12:44:34 -0600
Subject: on coaster accounting (was Re: [Swift-devel] current workers < 0 ?)
Message-ID: <50b07b4b0902261044kcc21925tbbdb0c51d21a48b4@mail.gmail.com>

Here i reverted to the 1 coaster per node configuration:  Here is the
content of the LRM :

JOBID     JOBNAME    USERNAME      STATE   CORE  REMAINING  STARTTIME
================================================================================561497
   data       tg802895      Running 16     00:21:26  Thu Feb 26
12:36:45
561498    data       tg802895      Running 16     00:21:26  Thu Feb 26 12:36:45
561499    data       tg802895      Running 16     00:21:26  Thu Feb 26 12:36:45
....
....
...
561547    data       tg802895      Running 16     00:23:42  Thu Feb 26 12:39:01

    50 active jobs :   50 of 3896 hosts (  1.28 %)


Total jobs: 50    Active Jobs: 50    Waiting Jobs: 0     Dep/Unsched Jobs: 0

Here is the current workers:

2009-02-26 12:38:50,412-0600 INFO  WorkerManager Current workers: 111
2009-02-26 12:38:50,412-0600 INFO  CoasterQueueProcessor Coaster
queue: [org.glo2009-02-26 12:38:50,413-0600 INFO  WorkerManager Ready:
0 {}
2009-02-26 12:38:50,413-0600 INFO  WorkerManager Busy: 0
[Worker[-1480006551], W2009-02-26 12:38:50,413-0600 INFO
WorkerManager Requested: 61 {2109491608=Worke2009-02-26
12:38:50,414-0600 INFO  WorkerManager Starting: 32
[Task(type=JOB_SUB2009-02-26 12:38:50,414-0600 INFO  WorkerManager
Ids: 13 {1104104218=Worker[11042009-02-26 12:38:50,414-0600 INFO
WorkerManager AllocationR: [org.globus.cog.ab


On Wed, Feb 25, 2009 at 11:27 PM, Mihael Hategan <hategan at mcs.anl.gov> wrote:
> I suspect the issue was introduced by the addition of multiple coasters
> per node. The manager expects one worker, but gets 16 instead.
>
> On Wed, 2009-02-25 at 22:29 -0600, Allan Espinosa wrote:
>> It still has the same issues.  It subtracts too much when a task if finished.
>>
>> Also, observing the LRM queue, i see swift  creating 18-20 "make
>> coaster" requests (4 at start then 16-18 after 5 mins).  with a 16
>> coastersPerNode you get a 320 processor allocation.  this more than
>> MAX_WORKERS~256 and the max score possible from my sites.xml (102 max)
>
> Regarding MAX_WORKERS, that probably suffers from the same problem, in
> that it may request less than 256 workers, but given that each request
> means 16 workers, the end result may be different than what's expected.
>
> However, MAX_WORKERS was introduced merely to limit damage in case the
> code is bad and it doesn't otherwise put an upper bound on the limit of
> worker requests (/jobs in the queue).


From hategan at mcs.anl.gov  Thu Feb 26 13:03:18 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Thu, 26 Feb 2009 13:03:18 -0600
Subject: on coaster accounting (was Re: [Swift-devel] current workers <
	0 ?)
In-Reply-To: <50b07b4b0902261044kcc21925tbbdb0c51d21a48b4@mail.gmail.com>
References: <50b07b4b0902261044kcc21925tbbdb0c51d21a48b4@mail.gmail.com>
Message-ID: <1235674998.15221.2.camel@localhost>

There are 50 running workers and 61 somewhere between being submitted
and contacting the service. What's the question?

On Thu, 2009-02-26 at 12:44 -0600, Allan Espinosa wrote:
> Here i reverted to the 1 coaster per node configuration:  Here is the
> content of the LRM :
> 
> JOBID     JOBNAME    USERNAME      STATE   CORE  REMAINING  STARTTIME
> ================================================================================561497
>    data       tg802895      Running 16     00:21:26  Thu Feb 26
> 12:36:45
> 561498    data       tg802895      Running 16     00:21:26  Thu Feb 26 12:36:45
> 561499    data       tg802895      Running 16     00:21:26  Thu Feb 26 12:36:45
> ....
> ....
> ...
> 561547    data       tg802895      Running 16     00:23:42  Thu Feb 26 12:39:01
> 
>     50 active jobs :   50 of 3896 hosts (  1.28 %)
> 
> 
> Total jobs: 50    Active Jobs: 50    Waiting Jobs: 0     Dep/Unsched Jobs: 0
> 
> Here is the current workers:
> 
> 2009-02-26 12:38:50,412-0600 INFO  WorkerManager Current workers: 111
> 2009-02-26 12:38:50,412-0600 INFO  CoasterQueueProcessor Coaster
> queue: [org.glo2009-02-26 12:38:50,413-0600 INFO  WorkerManager Ready:
> 0 {}
> 2009-02-26 12:38:50,413-0600 INFO  WorkerManager Busy: 0
> [Worker[-1480006551], W2009-02-26 12:38:50,413-0600 INFO
> WorkerManager Requested: 61 {2109491608=Worke2009-02-26
> 12:38:50,414-0600 INFO  WorkerManager Starting: 32
> [Task(type=JOB_SUB2009-02-26 12:38:50,414-0600 INFO  WorkerManager
> Ids: 13 {1104104218=Worker[11042009-02-26 12:38:50,414-0600 INFO
> WorkerManager AllocationR: [org.globus.cog.ab
> 
> 
> 
> On Wed, Feb 25, 2009 at 11:27 PM, Mihael Hategan <hategan at mcs.anl.gov> wrote:
> > I suspect the issue was introduced by the addition of multiple coasters
> > per node. The manager expects one worker, but gets 16 instead.
> >
> > On Wed, 2009-02-25 at 22:29 -0600, Allan Espinosa wrote:
> >> It still has the same issues.  It subtracts too much when a task if finished.
> >>
> >> Also, observing the LRM queue, i see swift  creating 18-20 "make
> >> coaster" requests (4 at start then 16-18 after 5 mins).  with a 16
> >> coastersPerNode you get a 320 processor allocation.  this more than
> >> MAX_WORKERS~256 and the max score possible from my sites.xml (102 max)
> >
> > Regarding MAX_WORKERS, that probably suffers from the same problem, in
> > that it may request less than 256 workers, but given that each request
> > means 16 workers, the end result may be different than what's expected.
> >
> > However, MAX_WORKERS was introduced merely to limit damage in case the
> > code is bad and it doesn't otherwise put an upper bound on the limit of
> > worker requests (/jobs in the queue).


From aespinosa at cs.uchicago.edu  Thu Feb 26 13:29:44 2009
From: aespinosa at cs.uchicago.edu (Allan Espinosa)
Date: Thu, 26 Feb 2009 13:29:44 -0600
Subject: on coaster accounting (was Re: [Swift-devel] current workers < 0 
	?)
In-Reply-To: <1235674998.15221.2.camel@localhost>
References: <50b07b4b0902261044kcc21925tbbdb0c51d21a48b4@mail.gmail.com>
	<1235674998.15221.2.camel@localhost>
Message-ID: <50b07b4b0902261129o6a67b39fy204c8850305f5384@mail.gmail.com>

So the Requested list is for the tasks being received and not the
"make coaster" request to the LRM?

also, currentWorkers is the "demand" for coasters and not the number
of coasters that are available (busy or ready)

thus the best way to graph a "number of avail processors" & "current
usage" vs time is using the size of Ids and Busy right?

On Thu, Feb 26, 2009 at 1:03 PM, Mihael Hategan <hategan at mcs.anl.gov> wrote:
> There are 50 running workers and 61 somewhere between being submitted
> and contacting the service. What's the question?
>


From hategan at mcs.anl.gov  Thu Feb 26 14:55:33 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Thu, 26 Feb 2009 14:55:33 -0600 (CST)
Subject: on coaster accounting (was Re: [Swift-devel] current workers <
	0  ?)
In-Reply-To: <50b07b4b0902261129o6a67b39fy204c8850305f5384@mail.gmail.com>
Message-ID: <20544682.352711235681733961.JavaMail.root@zimbra>


----- Allan Espinosa <aespinosa at cs.uchicago.edu> wrote:
> So the Requested list is for the tasks being received and not the
> "make coaster" request to the LRM?

The requested list is to track all the workers that the manager plans
to have started and has put a request for to the underlying provider
(LRM, but see below) but haven't yet started or failed.

The manager attempting to start a job is not the same as that job being
in the LRM queue. Between delays, asynchronicity, and just weird job 
managers/LRMs, stuff happens.

> 
> also, currentWorkers is the "demand" for coasters and not the number
> of coasters that are available (busy or ready)

Right. It's supposed to track the total amount of workers: busy, ready,
and starting.

> 
> thus the best way to graph a "number of avail processors" & "current
> usage" vs time is using the size of Ids and Busy right?

Somewhat. Busy will tell you the workers the manager thinks are 
running jobs.

Ids is there to allow quick lookup of a worker based on its id. I'm not
sure what stages of a worker's life (busy, ready, starting) it includes.

> 
> On Thu, Feb 26, 2009 at 1:03 PM, Mihael Hategan <hategan at mcs.anl.gov> wrote:
> > There are 50 running workers and 61 somewhere between being submitted
> > and contacting the service. What's the question?
> >


From wilde at mcs.anl.gov  Thu Feb 26 18:08:40 2009
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Thu, 26 Feb 2009 18:08:40 -0600
Subject: [Swift-devel] output format of simple_mapper
Message-ID: <49A72F08.4050608@mcs.anl.gov>

When I apply this mapping to a 2D array of files:
   file result[][] <simple_mapper;
                      prefix=@strcat("output/",p,"/"),suffix=".pdt">;

then I get files like:

   output/T1di2/0004.0001.pdt

but when I apply this mapping to a 2D array of structs of files:

   OOPSOut result[][] <simple_mapper; prefix=@strcat("output/",p,"/")>;

then I get files like:

   output/T3cpo/0000_0000.pdt

Not a problem, just curious what motivated the difference (of sub1.sub2 
vs sub1_sub2)?


From hategan at mcs.anl.gov  Thu Feb 26 18:24:29 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Thu, 26 Feb 2009 18:24:29 -0600 (CST)
Subject: [Swift-devel] output format of simple_mapper
In-Reply-To: <49A72F08.4050608@mcs.anl.gov>
Message-ID: <13895265.374241235694269652.JavaMail.root@zimbra>

I think the logic was that if you have a type path 
(say a.b.c), it would be mapped to a_b.c, where the
last element gives the extension. This was inspired by the analyze
format, where we would usually have a struct "image {hdr, img}", so
that mapper would magically end up naming files with the proper
extension for that case.

In your first case, given that the second index is the last
element in the path, it will be separated by ".", and then you
add ".pdt" to that.

In the second case, I assume in OOPSOut your field is named "pdt"
and that ends up being the last element in the path.

If you were to try file result[][][] <...>, you would get names
like: 0004_0005.0001

Mihael

----- Michael Wilde <wilde at mcs.anl.gov> wrote:
> When I apply this mapping to a 2D array of files:
>    file result[][] <simple_mapper;
>                       prefix=@strcat("output/",p,"/"),suffix=".pdt">;
> 
> then I get files like:
> 
>    output/T1di2/0004.0001.pdt
> 
> but when I apply this mapping to a 2D array of structs of files:
> 
>    OOPSOut result[][] <simple_mapper; prefix=@strcat("output/",p,"/")>;
> 
> then I get files like:
> 
>    output/T3cpo/0000_0000.pdt
> 
> Not a problem, just curious what motivated the difference (of sub1.sub2 
> vs sub1_sub2)?
> 
> 
> 
> 
> 
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel


From benc at hawaga.org.uk  Thu Feb 26 18:42:01 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Fri, 27 Feb 2009 00:42:01 +0000 (GMT)
Subject: [Swift-devel] output format of simple_mapper
In-Reply-To: <13895265.374241235694269652.JavaMail.root@zimbra>
References: <13895265.374241235694269652.JavaMail.root@zimbra>
Message-ID: <Pine.LNX.4.64.0902270030240.1293@dildano.hawaga.org.uk>


AbstractMapper has this rather case:

 if (level < tokenCount - 2) {
                                logger.debug("Adding mapper-specified separator"
);
                                
sb.append(getElementMapper().getSeparator(level)
);
                        }
                        else {
                                logger.debug("Adding '.' instead of 
mapper-specified separator");
                                sb.append('.');
                        }


that implements the behaviour Mihael describes.

Don't let the name fool you - simple_mapper is not the simplest thing in 
the world...

-- 


From hategan at mcs.anl.gov  Thu Feb 26 18:58:19 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Thu, 26 Feb 2009 18:58:19 -0600 (CST)
Subject: [Swift-devel] output format of simple_mapper
In-Reply-To: <Pine.LNX.4.64.0902270030240.1293@dildano.hawaga.org.uk>
Message-ID: <14520899.374411235696299617.JavaMail.root@zimbra>


----- Ben Clifford <benc at hawaga.org.uk> wrote:
> 
> AbstractMapper has this rather case:
> 

Monologue smelling of dialogue:
"Rather what?"
"Case!"


From benc at hawaga.org.uk  Thu Feb 26 19:24:34 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Fri, 27 Feb 2009 01:24:34 +0000 (GMT)
Subject: [Swift-devel] output format of simple_mapper
In-Reply-To: <14520899.374411235696299617.JavaMail.root@zimbra>
References: <14520899.374411235696299617.JavaMail.root@zimbra>
Message-ID: <Pine.LNX.4.64.0902270124110.23512@dildano.hawaga.org.uk>

> > 
> > AbstractMapper has this rather case:
> > 
> 
> Monologue smelling of dialogue:
> "Rather what?"
> "Case!"

well, there was an adjective there originally, but I removed it.

<polite cough>

-- 


From wilde at mcs.anl.gov  Thu Feb 26 20:51:24 2009
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Thu, 26 Feb 2009 20:51:24 -0600
Subject: [Swift-devel] output format of simple_mapper
In-Reply-To: <13895265.374241235694269652.JavaMail.root@zimbra>
References: <13895265.374241235694269652.JavaMail.root@zimbra>
Message-ID: <49A7552C.202@mcs.anl.gov>


On 2/26/09 6:24 PM, Mihael Hategan wrote:
> I think the logic was that if you have a type path 
> (say a.b.c), it would be mapped to a_b.c, where the
> last element gives the extension. This was inspired by the analyze
> format, where we would usually have a struct "image {hdr, img}", so
> that mapper would magically end up naming files with the proper
> extension for that case.

I see an argument for a sprintf mapper here. But like Ben suggested 
earlier, the whole mapper thing needs assessment and redesign.

Trick there will be some amount of deprecatable backwards compat.

> In your first case, given that the second index is the last
> element in the path, it will be separated by ".", and then you
> add ".pdt" to that.
> 
> In the second case, I assume in OOPSOut your field is named "pdt"
> and that ends up being the last element in the path.
> 
> If you were to try file result[][][] <...>, you would get names
> like: 0004_0005.0001

That would violate the principle of least astonishment ;)

> 
> Mihael
> 
> ----- Michael Wilde <wilde at mcs.anl.gov> wrote:
>> When I apply this mapping to a 2D array of files:
>>    file result[][] <simple_mapper;
>>                       prefix=@strcat("output/",p,"/"),suffix=".pdt">;
>>
>> then I get files like:
>>
>>    output/T1di2/0004.0001.pdt
>>
>> but when I apply this mapping to a 2D array of structs of files:
>>
>>    OOPSOut result[][] <simple_mapper; prefix=@strcat("output/",p,"/")>;
>>
>> then I get files like:
>>
>>    output/T3cpo/0000_0000.pdt
>>
>> Not a problem, just curious what motivated the difference (of sub1.sub2 
>> vs sub1_sub2)?
>>
>>
>>
>>
>>
>> _______________________________________________
>> Swift-devel mailing list
>> Swift-devel at ci.uchicago.edu
>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> 


From hategan at mcs.anl.gov  Thu Feb 26 21:04:48 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Thu, 26 Feb 2009 21:04:48 -0600 (CST)
Subject: [Swift-devel] output format of simple_mapper
In-Reply-To: <49A7552C.202@mcs.anl.gov>
Message-ID: <15448138.376321235703888961.JavaMail.root@zimbra>


----- Michael Wilde <wilde at mcs.anl.gov> wrote:
> 
> 
> On 2/26/09 6:24 PM, Mihael Hategan wrote:
> > I think the logic was that if you have a type path 
> > (say a.b.c), it would be mapped to a_b.c, where the
> > last element gives the extension. This was inspired by the analyze
> > format, where we would usually have a struct "image {hdr, img}", so
> > that mapper would magically end up naming files with the proper
> > extension for that case.
> 
> I see an argument for a sprintf mapper here. But like Ben suggested 
> earlier, the whole mapper thing needs assessment and redesign.
> 
> Trick there will be some amount of deprecatable backwards compat.
> 
> > In your first case, given that the second index is the last
> > element in the path, it will be separated by ".", and then you
> > add ".pdt" to that.
> > 
> > In the second case, I assume in OOPSOut your field is named "pdt"
> > and that ends up being the last element in the path.
> > 
> > If you were to try file result[][][] <...>, you would get names
> > like: 0004_0005.0001
> 
> That would violate the principle of least astonishment ;)

Except you did find it convenient when you used the struct, and the
extension just happened to be right :)

> 
> > 
> > Mihael
> > 
> > ----- Michael Wilde <wilde at mcs.anl.gov> wrote:
> >> When I apply this mapping to a 2D array of files:
> >>    file result[][] <simple_mapper;
> >>                       prefix=@strcat("output/",p,"/"),suffix=".pdt">;
> >>
> >> then I get files like:
> >>
> >>    output/T1di2/0004.0001.pdt
> >>
> >> but when I apply this mapping to a 2D array of structs of files:
> >>
> >>    OOPSOut result[][] <simple_mapper; prefix=@strcat("output/",p,"/")>;
> >>
> >> then I get files like:
> >>
> >>    output/T3cpo/0000_0000.pdt
> >>
> >> Not a problem, just curious what motivated the difference (of sub1.sub2 
> >> vs sub1_sub2)?
> >>
> >>
> >>
> >>
> >>
> >> _______________________________________________
> >> Swift-devel mailing list
> >> Swift-devel at ci.uchicago.edu
> >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> > 


From benc at hawaga.org.uk  Fri Feb 27 04:02:59 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Fri, 27 Feb 2009 10:02:59 +0000 (GMT)
Subject: [Swift-devel] Re: [Swift-user] assigning file variables
In-Reply-To: <Pine.LNX.4.64.0902261545170.23512@dildano.hawaga.org.uk>
References: <49A55309.4050100@mcs.anl.gov>
	<Pine.LNX.4.64.0902251521230.1293@dildano.hawaga.org.uk>
	<1235576422.17806.4.camel@localhost>
	<Pine.LNX.4.64.0902251545300.23512@dildano.hawaga.org.uk>
	<49A632FE.8070906@mcs.anl.gov> <49A63F3F.10504@mcs.anl.gov>
	<Pine.LNX.4.64.0902261051540.23512@dildano.hawaga.org.uk>
	<49A6B5CF.8030505@mcs.anl.gov>
	<Pine.LNX.4.64.0902261545170.23512@dildano.hawaga.org.uk>
Message-ID: <Pine.LNX.4.64.0902270952560.23512@dildano.hawaga.org.uk>


On Thu, 26 Feb 2009, Ben Clifford wrote:

> This style of piecewise assignment to arrays plays merry hell with 
> trying to do data-dependent ordering in a way that I think is not easily 
> resolvable; and anyone trying to do anything at all interesting with 
> arrays gets hit by strange things happening - "I know i've assigned 
> everything but somehow the next stage isn't running".

A different way of looking at this:

Why is it that Swift can have the 'close array returned from a procedure 
call' behaviour which made you move code out of the loop body and into a 
procedure?

Its because from the calling code, the procedure call looks like a single 
assignment:

  file a[] = foo();

or when accessing sub-arrays:

  file a[][];
  a[7] = foo();

We know a[7], which is an entire array, has its entire value because that 
assignment is the only place that a[7] can have its elements assigned - 
there is a single statement which assigns its entire value.

-- 


From benc at hawaga.org.uk  Fri Feb 27 06:13:56 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Fri, 27 Feb 2009 12:13:56 +0000 (GMT)
Subject: [Swift-devel] Re: [Swift-user] assigning file variables
In-Reply-To: <49A6B5CF.8030505@mcs.anl.gov>
References: <49A55309.4050100@mcs.anl.gov>
	<Pine.LNX.4.64.0902251521230.1293@dildano.hawaga.org.uk>
	<1235576422.17806.4.camel@localhost>
	<Pine.LNX.4.64.0902251545300.23512@dildano.hawaga.org.uk>
	<49A632FE.8070906@mcs.anl.gov> <49A63F3F.10504@mcs.anl.gov>
	<Pine.LNX.4.64.0902261051540.23512@dildano.hawaga.org.uk>
	<49A6B5CF.8030505@mcs.anl.gov>
Message-ID: <Pine.LNX.4.64.0902271016220.23512@dildano.hawaga.org.uk>


On Thu, 26 Feb 2009, Michael Wilde wrote:

> But thats one of the critical things here. I seem to bump into this 
> limitation frequently. Does language consistency require these 
> limitations on setting mappings, or is it an implementation issue that 
> can be lifted? Is it the case that mapping does not affect data flow 
> semantics?

>From a high level perspective, I don't think the language requires much 
about what is mapped where and based on what.

The present implementations of the mappers and the framework surrounding 
those mappers compel stricter requirements.

For example, at present, an entire data structure rooted at some variable 
declaration is regarded as either "not mapped" or "mapped" in its entirety 
- that is either the mapper is not yet initialized, and so any attempts to 
ask it about the data structure it maps must be deferred; or it is 
initialized and therefore can answer authoritatively about any part of 
that data structure.

This is pretty much what is meant by "Swift does not have streaming 
mappers".

What you propose, being able to map some subpiece of a data structure 
programmatically, ties in closely with the 'streaming mapper' concept, I 
think. The 'stream of new things' comes perhaps from some on going 
external process, some deliberate rate limiting or from on-going 
evaluation of other pieces of SwiftScript that come up with new mappings.

> For starters (and feel free to move this to a new thread), do you feel
> comfortable with the current model of var, dsHandle, and by-value-like
> assignment?
> 
> I would like to see a more Java-like model with a var being a typed pointer or
> scalar value holder, and structs and arrays being dynamic objects, and files
> being special vars with mapping and state.

That's very much what SwiftScript has now. Can you describe what you 
percieve to be the salient differences?

> I dont feel that we have yet adequately described the model, neither for a CS
> paper *nor* for the programmer.  I think that a good start is to write a data
> model description (in the user guide, in a detailed "skip this on first
> reading" section, that specifies the data model in
> language-reference-specification fashion).

Yes, I think getting a more rigorous description of what actually happens, 
warts and all, would be useful.

I think targetting such a description as a CS paper is the wrong thing to 
aim for is the wrong way to go - we need to be adding copious warts and 
ugliness to the description making it mind numbingly tedious to read, not 
coughing politely and deleting such paragraphs to make a paper that a 
wider audience is interested in.

-- 


From wilde at mcs.anl.gov  Fri Feb 27 08:55:53 2009
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Fri, 27 Feb 2009 08:55:53 -0600
Subject: [Swift-devel] Re: [Swift-user] assigning file variables
In-Reply-To: <Pine.LNX.4.64.0902270952560.23512@dildano.hawaga.org.uk>
References: <49A55309.4050100@mcs.anl.gov>
	<Pine.LNX.4.64.0902251521230.1293@dildano.hawaga.org.uk>
	<1235576422.17806.4.camel@localhost>
	<Pine.LNX.4.64.0902251545300.23512@dildano.hawaga.org.uk>
	<49A632FE.8070906@mcs.anl.gov> <49A63F3F.10504@mcs.anl.gov>
	<Pine.LNX.4.64.0902261051540.23512@dildano.hawaga.org.uk>
	<49A6B5CF.8030505@mcs.anl.gov>
	<Pine.LNX.4.64.0902261545170.23512@dildano.hawaga.org.uk>
	<Pine.LNX.4.64.0902270952560.23512@dildano.hawaga.org.uk>
Message-ID: <49A7FEF9.9020508@mcs.anl.gov>

Thats a good explanation, worth adding to the user guide text on this topic.

I think on first read one (eg, me ;) misses the subtlety that the 
foreach is special in how the array is treated within it, and that 
outside foreach statements, arrays need to be closed.

The challenge will be to see in general how feasible it is to code in 
such a way that you always to a close via a procedure return. My guess 
is it will be feasible, but may lead to more procedures than a user 
might use in an imperative style.

Thats not bad, as long as coding is easy and the resulting code is clear.

We're getting some good experience now as we build libraries for CNARI, 
OOPS, DOCK and more.

related: whats a suggested debugging technique when "the next stage isnt 
running"? Thats exactly what happened to me, and one of the harder 
things in swift to debug.

I noticed by chance the other day that swift seems to read debugging 
commands of some sort from stdin? I may have missed it, but what are 
these, and can users use them to find the state of a stuck script?

On 2/27/09 4:02 AM, Ben Clifford wrote:
> On Thu, 26 Feb 2009, Ben Clifford wrote:
> 
>> This style of piecewise assignment to arrays plays merry hell with 
>> trying to do data-dependent ordering in a way that I think is not easily 
>> resolvable; and anyone trying to do anything at all interesting with 
>> arrays gets hit by strange things happening - "I know i've assigned 
>> everything but somehow the next stage isn't running".
> 
> A different way of looking at this:
> 
> Why is it that Swift can have the 'close array returned from a procedure 
> call' behaviour which made you move code out of the loop body and into a 
> procedure?
> 
> Its because from the calling code, the procedure call looks like a single 
> assignment:
> 
>   file a[] = foo();
> 
> or when accessing sub-arrays:
> 
>   file a[][];
>   a[7] = foo();
> 
> We know a[7], which is an entire array, has its entire value because that 
> assignment is the only place that a[7] can have its elements assigned - 
> there is a single statement which assigns its entire value.
> 


From benc at hawaga.org.uk  Fri Feb 27 08:57:22 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Fri, 27 Feb 2009 14:57:22 +0000 (GMT)
Subject: [Swift-devel] log processing in main tree
Message-ID: <Pine.LNX.4.64.0902271455120.23512@dildano.hawaga.org.uk>


I've rearranged the log-processing/ SVN directory so that its contents are 
in two directories:

 bin/  (2 commands)

 libexec/log-processing/

with the notes in the previous README file moved over to the existing 
docs/log-processing/ module.

Placing the log-processing code under a subdirectory of libexec keeps the 
many files there nicely separated from other libexec stuff.

What I'd like to do for 0.9 is move that tree as-in into the trunk/ module 
so that all Swift builds have this stuff, rather than this being a 
separate SVN checkout.

-- 


From benc at hawaga.org.uk  Fri Feb 27 09:06:25 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Fri, 27 Feb 2009 15:06:25 +0000 (GMT)
Subject: [Swift-devel] Re: [Swift-user] assigning file variables
In-Reply-To: <49A7FEF9.9020508@mcs.anl.gov>
References: <49A55309.4050100@mcs.anl.gov>
	<Pine.LNX.4.64.0902251521230.1293@dildano.hawaga.org.uk>
	<1235576422.17806.4.camel@localhost>
	<Pine.LNX.4.64.0902251545300.23512@dildano.hawaga.org.uk>
	<49A632FE.8070906@mcs.anl.gov> <49A63F3F.10504@mcs.anl.gov>
	<Pine.LNX.4.64.0902261051540.23512@dildano.hawaga.org.uk>
	<49A6B5CF.8030505@mcs.anl.gov>
	<Pine.LNX.4.64.0902261545170.23512@dildano.hawaga.org.uk>
	<Pine.LNX.4.64.0902270952560.23512@dildano.hawaga.org.uk>
	<49A7FEF9.9020508@mcs.anl.gov>
Message-ID: <Pine.LNX.4.64.0902271459150.23512@dildano.hawaga.org.uk>


On Fri, 27 Feb 2009, Michael Wilde wrote:

> The challenge will be to see in general how feasible it is to code in such a
> way that you always to a close via a procedure return. My guess is it will be
> feasible, but may lead to more procedures than a user might use in an
> imperative style.

I'd rather not force a separate procedure style; that's part of my 
argument for having iteration constructs that are sympathetic to single 
assignment analysis rather than being almost perfectly anti-sympathetic.

> related: whats a suggested debugging technique when "the next stage isnt
> running"? Thats exactly what happened to me, and one of the harder things in
> swift to debug.
> 
> I noticed by chance the other day that swift seems to read debugging commands
> of some sort from stdin? I may have missed it, but what are these, and can
> users use them to find the state of a stuck script?

There are two commands: v and t to show waiting variables and threads. I'm 
not sure how useful the output of that is in your case. Its not really a 
public interface, but you might be able to make sense of it.

There is much in place for easy debugging of dataflow-based hangs like 
this - previously, I've put effort into tightening the analysis so hangs 
don't happen, rather than into detecting and reporting hangs; and I'd like 
to continue in that trend (although at the moment, the next useful step 
there is to remove [index] based assignment entirely...)

-- 


From wilde at mcs.anl.gov  Fri Feb 27 09:21:33 2009
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Fri, 27 Feb 2009 09:21:33 -0600
Subject: [Swift-devel] Re: [Swift-user] assigning file variables
In-Reply-To: <Pine.LNX.4.64.0902271459150.23512@dildano.hawaga.org.uk>
References: <49A55309.4050100@mcs.anl.gov>
	<Pine.LNX.4.64.0902251521230.1293@dildano.hawaga.org.uk>
	<1235576422.17806.4.camel@localhost>
	<Pine.LNX.4.64.0902251545300.23512@dildano.hawaga.org.uk>
	<49A632FE.8070906@mcs.anl.gov> <49A63F3F.10504@mcs.anl.gov>
	<Pine.LNX.4.64.0902261051540.23512@dildano.hawaga.org.uk>
	<49A6B5CF.8030505@mcs.anl.gov>
	<Pine.LNX.4.64.0902261545170.23512@dildano.hawaga.org.uk>
	<Pine.LNX.4.64.0902270952560.23512@dildano.hawaga.org.uk>
	<49A7FEF9.9020508@mcs.anl.gov>
	<Pine.LNX.4.64.0902271459150.23512@dildano.hawaga.org.uk>
Message-ID: <49A804FD.9050101@mcs.anl.gov>

On 2/27/09 9:06 AM, Ben Clifford wrote:

> There is much in place for easy debugging of dataflow-based hangs like 
> this - previously, I've put effort into tightening the analysis so hangs 
> don't happen, rather than into detecting and reporting hangs; and I'd like 
> to continue in that trend (although at the moment, the next useful step 
> there is to remove [index] based assignment entirely...)

I assume you meant "there is not much in place", and thats fine. Your 
approach sounds good, lets see where it leads.

The question above leads to the interesting research topic of "how to 
show the state of, and debug, concurrent functional programs". I suspect 
theres some (much?) work on that out there.


From benc at hawaga.org.uk  Fri Feb 27 09:22:58 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Fri, 27 Feb 2009 15:22:58 +0000 (GMT)
Subject: [Swift-devel] Re: [Swift-user] assigning file variables
In-Reply-To: <49A804FD.9050101@mcs.anl.gov>
References: <49A55309.4050100@mcs.anl.gov>
	<Pine.LNX.4.64.0902251521230.1293@dildano.hawaga.org.uk>
	<1235576422.17806.4.camel@localhost>
	<Pine.LNX.4.64.0902251545300.23512@dildano.hawaga.org.uk>
	<49A632FE.8070906@mcs.anl.gov> <49A63F3F.10504@mcs.anl.gov>
	<Pine.LNX.4.64.0902261051540.23512@dildano.hawaga.org.uk>
	<49A6B5CF.8030505@mcs.anl.gov>
	<Pine.LNX.4.64.0902261545170.23512@dildano.hawaga.org.uk>
	<Pine.LNX.4.64.0902270952560.23512@dildano.hawaga.org.uk>
	<49A7FEF9.9020508@mcs.anl.gov>
	<Pine.LNX.4.64.0902271459150.23512@dildano.hawaga.org.uk>
	<49A804FD.9050101@mcs.anl.gov>
Message-ID: <Pine.LNX.4.64.0902271522310.23512@dildano.hawaga.org.uk>


On Fri, 27 Feb 2009, Michael Wilde wrote:

> I assume you meant "there is not much in place", and thats fine. Your approach
> sounds good, lets see where it leads.

yes: 'not much'.

-- 


From hategan at mcs.anl.gov  Fri Feb 27 09:58:12 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Fri, 27 Feb 2009 09:58:12 -0600
Subject: [Swift-devel] Re: [Swift-user] assigning file variables
In-Reply-To: <49A804FD.9050101@mcs.anl.gov>
References: <49A55309.4050100@mcs.anl.gov>
	<Pine.LNX.4.64.0902251521230.1293@dildano.hawaga.org.uk>
	<1235576422.17806.4.camel@localhost>
	<Pine.LNX.4.64.0902251545300.23512@dildano.hawaga.org.uk>
	<49A632FE.8070906@mcs.anl.gov> <49A63F3F.10504@mcs.anl.gov>
	<Pine.LNX.4.64.0902261051540.23512@dildano.hawaga.org.uk>
	<49A6B5CF.8030505@mcs.anl.gov>
	<Pine.LNX.4.64.0902261545170.23512@dildano.hawaga.org.uk>
	<Pine.LNX.4.64.0902270952560.23512@dildano.hawaga.org.uk>
	<49A7FEF9.9020508@mcs.anl.gov>
	<Pine.LNX.4.64.0902271459150.23512@dildano.hawaga.org.uk>
	<49A804FD.9050101@mcs.anl.gov>
Message-ID: <1235750292.1344.6.camel@localhost>

On Fri, 2009-02-27 at 09:21 -0600, Michael Wilde wrote:
> The question above leads to the interesting research topic of "how to 
> show the state of, and debug, concurrent functional programs". I suspect 
> theres some (much?) work on that out there.
> 

Not that much. Debugging lazy languages is a known difficulty. Debugging
future-based languages, like swift, I don't know.

However, I think that the "who's waiting on what" information can be
presented to the user in such a way as to make it more clear where
problems are.


From bugzilla-daemon at mcs.anl.gov  Fri Feb 27 20:06:32 2009
From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov)
Date: Fri, 27 Feb 2009 20:06:32 -0600 (CST)
Subject: [Swift-devel] [Bug 179] New: coaster request throttling and
	(currentWorkers <0)
Message-ID: <bug-179-21@http.bugzilla.mcs.anl.gov/swift/>

http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=179

           Summary: coaster request throttling and (currentWorkers <0)
           Product: Swift
           Version: unspecified
          Platform: PC
        OS/Version: Linux
            Status: NEW
          Severity: normal
          Priority: P2
         Component: Log processing and plotting
        AssignedTo: hategan at mcs.anl.gov
        ReportedBy: aespinosa at cs.uchicago.edu


The number of currentWorkers becomes < 0.  this has impact on how coasters get
throttled.

In an example session, it can be observed in the LRM  creating 18-20 "make
coaster" requests (4 at start then 16-18 after 5 mins).  with a 16
coastersPerNode you get a 320 processor allocation.  this more than
MAX_WORKERS~256 and the max score possible from my sites.xml (102 max)

   <profile namespace="karajan" key="initialScore">1</profile>
   <profile namespace="karajan" key="jobThrottle">1</profile>


2009-02-25 20:31:15,590-0600 INFO  Worker Worker stderr: null
2009-02-25 20:31:15,590-0600 WARN  WorkerManager Worker terminated:
Worker[-1909333457]
2009-02-25 20:31:15,590-0600 WARN  Worker Worker 335457820 status
change: Completed
2009-02-25 20:31:15,590-0600 INFO  Worker Worker stdout: Job You has completed.
Writing job STDOUT and STDERR to cache files.
Returning job success.

2009-02-25 20:31:15,590-0600 INFO  Worker Worker stderr: null
2009-02-25 20:31:15,590-0600 WARN  WorkerManager Worker terminated:
Worker[335457820]
******2009-02-25 20:31:15,742-0600 INFO  WorkerManager Current workers: -32****
2009-02-25 20:31:15,745-0600 INFO  WorkerManager Ready: {}
2009-02-25 20:31:15,745-0600 INFO  WorkerManager Busy:
[Worker[-1260987422], Worker[2142641145], Worker[2053757208
2009-02-25 20:31:15,751-0600 INFO  WorkerManager Requested:
{640597733=Worker[640597733], -692025578=Worker[-69202
2009-02-25 20:31:15,751-0600 INFO  WorkerManager Starting:
[Task(type=JOB_SUBMISSION, identity=urn:1235615211813-1
2009-02-25 20:31:15,752-0600 INFO  WorkerManager Ids:
{1078934147=Worker[1078934147], 264613139=Worker[264613139],
2009-02-25 20:31:15,753-0600 INFO  WorkerManager AllocationR:
[org.globus.cog.abstraction.coaster.service.job.mana
2009-02-25 20:31:15,873-0600 INFO  AbstractKarajanChannel SC-null REQ:


-- 
Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From bugzilla-daemon at mcs.anl.gov  Fri Feb 27 20:09:45 2009
From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov)
Date: Fri, 27 Feb 2009 20:09:45 -0600 (CST)
Subject: [Swift-devel] [Bug 180] New: multi-node coasters?
Message-ID: <bug-180-21@http.bugzilla.mcs.anl.gov/swift/>

http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=180

           Summary: multi-node coasters?
           Product: Swift
           Version: unspecified
          Platform: PC
        OS/Version: Linux
            Status: NEW
          Severity: enhancement
          Priority: P2
         Component: Specific site issues
        AssignedTo: benc at hawaga.org.uk
        ReportedBy: aespinosa at cs.uchicago.edu


In a 1 coaster per node configuration.  Sometimes site policies only allow you
to submit a maximum number of jobs in the queue (e.g. 50).  thus even though
the score can reach up to 102, the maximum number of active jobs in a site is
only 50.

this can be worked around by requesting 2 or more nodes in a single job
submission.  We can use pbsdsh or equivalent in the LRMs.  Using mpirun can
also be explored.


-- 
Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.