From benc at hawaga.org.uk Sun Feb 1 11:36:14 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Sun, 1 Feb 2009 17:36:14 +0000 (GMT) Subject: [Swift-devel] [VOTE] Expanding arrays in app function command lines Message-ID: it being slightly unclear in my mind whether the below discussed change was generally approved of, here is a more formal request for clarification. the change that we talked about is in this thread: Subject: Re: [Swift-user] Expanding arrays in app function command lines the proposal (which I sent a patch for) is to change the handling of app paramters to expand string arrays into multiple command line arguments. Vote as in http://dev.globus.org/wiki/Guidelines#Action_Item_Votes -- From hategan at mcs.anl.gov Sun Feb 1 11:39:17 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sun, 01 Feb 2009 11:39:17 -0600 Subject: [Swift-devel] [VOTE] Expanding arrays in app function command lines In-Reply-To: References: Message-ID: <1233509957.12878.0.camel@localhost> +1 On Sun, 2009-02-01 at 17:36 +0000, Ben Clifford wrote: > it being slightly unclear in my mind whether the below discussed change > was generally approved of, here is a more formal request for > clarification. > > the change that we talked about is in this thread: > > Subject: Re: [Swift-user] Expanding arrays in app function command lines > > the proposal (which I sent a patch for) is to change the handling of app > paramters to expand string arrays into multiple command line arguments. > > Vote as in http://dev.globus.org/wiki/Guidelines#Action_Item_Votes > From benc at hawaga.org.uk Sun Feb 1 18:58:11 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 2 Feb 2009 00:58:11 +0000 (GMT) Subject: [Swift-devel] swift changing walltime of prews-gram jobs In-Reply-To: <1233334434.14201.3.camel@localhost> References: <50b07b4b0901241503r72f28b96rec19583bb8044ea1@mail.gmail.com> <1233334434.14201.3.camel@localhost> Message-ID: On Fri, 30 Jan 2009, Mihael Hategan wrote: > I suppose the PBS provider could adopt the same scheme. The stuff you put in r2266 almost worked for me. r2270 adds a missing newline, and now it seems to run correctly now and with the right walltime. -- From benc at hawaga.org.uk Mon Feb 2 11:38:56 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 2 Feb 2009 17:38:56 +0000 (GMT) Subject: [Swift-devel] Coasters failing on Teraport - cant find Java? In-Reply-To: <497FE73F.7000307@mcs.anl.gov> References: <50b07b4b0901241503r72f28b96rec19583bb8044ea1@mail.gmail.com> <497F637D.5080707@mcs.anl.gov> <1233113611.2159.25.camel@localhost> <497FE73F.7000307@mcs.anl.gov> Message-ID: On Tue, 27 Jan 2009, Michael Wilde wrote: > 1) On OSG sites, the jobmanager(s) are modified to inset OSG env vars and set > the PATH to contain OSG stuff. So if you do a globus-job-run of This isn't universal OSG behaviour. Some sites give you PATH=/bin:/usr/bin -- From hategan at mcs.anl.gov Mon Feb 2 11:58:11 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 02 Feb 2009 11:58:11 -0600 Subject: [Swift-devel] Coasters failing on Teraport - cant find Java? In-Reply-To: References: <50b07b4b0901241503r72f28b96rec19583bb8044ea1@mail.gmail.com> <497F637D.5080707@mcs.anl.gov> <1233113611.2159.25.camel@localhost> <497FE73F.7000307@mcs.anl.gov> Message-ID: <1233597491.22200.3.camel@localhost> On Mon, 2009-02-02 at 17:38 +0000, Ben Clifford wrote: > On Tue, 27 Jan 2009, Michael Wilde wrote: > > > 1) On OSG sites, the jobmanager(s) are modified to inset OSG env vars and set > > the PATH to contain OSG stuff. So if you do a globus-job-run of > > This isn't universal OSG behaviour. Some sites give you > PATH=/bin:/usr/bin > Which happens to be useless. I suppose, for those sites, we need have an option to explicitly set where Java is, if that doesn't already work somehow. From bugzilla-daemon at mcs.anl.gov Tue Feb 3 05:24:30 2009 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Tue, 3 Feb 2009 05:24:30 -0600 (CST) Subject: [Swift-devel] [Bug 173] New: poor syntax error missing close parentheses at end of procedure invocation Message-ID: http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=173 Summary: poor syntax error missing close parentheses at end of procedure invocation Product: Swift Version: unspecified Platform: Macintosh OS/Version: Mac OS Status: NEW Severity: normal Priority: P2 Component: SwiftScript language AssignedTo: benc at hawaga.org.uk ReportedBy: benc at hawaga.org.uk The below fragment is missing ) at the end of the trace invocation. The antlr generated parser fails to parse this, giving the confusing message below. It would be better if the error was more related to the missing ) $ cat rw.swift type file; file s <"muppet.gif">; trace(@regexp(@filename(s),"gif","jpg"); $ swift rw.swift Could not compile SwiftScript source: line 6:1: unexpected token: trace -- Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You reported the bug, or are watching the reporter. You are the assignee for the bug, or are watching the assignee. From benc at hawaga.org.uk Tue Feb 3 05:26:36 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 3 Feb 2009 11:26:36 +0000 (GMT) Subject: [Swift-devel] regexp_mapper redundant? Message-ID: It might be the case now that many of the mappings provided by regexp_mapper can be implemented using single_filename_mapper and @regexp and @filename. Previously this did not work, I think because of lack of dataflow dependency handling in mapper parameters. However, that handling is in place now. So perhaps it is the case that: file f ; is the same as: file f ; This isn't quite a complete replacment, because the @regexp function doesn't seem to support substitution groups like \1 It perhaps could be made to, though. -- From benc at hawaga.org.uk Tue Feb 3 05:45:06 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 3 Feb 2009 11:45:06 +0000 (GMT) Subject: [Swift-devel] "type file;" by default Message-ID: Pretty much every simple SwiftScript program that I write, I find myself putting in "type file;" at the start, and avoiding "marker types" of the form: type picturefile; and thus ignoring application-level type checking (checking that I'm not feeding a picture into a text processing app, and the like) whilst still taking advantage of other swift type checking. To simplify low-end uses of the language, it might be useful to have the above "type file;" defined as a built-in type. This has been discussed before, but I'd like to know what peoples opinions are. -- From wilde at mcs.anl.gov Tue Feb 3 08:36:30 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 03 Feb 2009 08:36:30 -0600 Subject: [Swift-devel] Coasters failing on Teraport - cant find Java? In-Reply-To: <1233597491.22200.3.camel@localhost> References: <50b07b4b0901241503r72f28b96rec19583bb8044ea1@mail.gmail.com> <497F637D.5080707@mcs.anl.gov> <1233113611.2159.25.camel@localhost> <497FE73F.7000307@mcs.anl.gov> <1233597491.22200.3.camel@localhost> Message-ID: <4988566E.2070700@mcs.anl.gov> The approach I'm testing is this: if user has a .coasterinit file source it to put java in PATH else if java is in PATH use it else source /etc/profile (executed under a non-login shell, i.e never use /bin/sh -l) Right now I have the above in a different order (.coasterinit last) and it works on ranger, mercury and teraport. .coasterinit is a more flexible alternative to a per-site option that points to java. Im not sure which is better. On 2/2/09 11:58 AM, Mihael Hategan wrote: > On Mon, 2009-02-02 at 17:38 +0000, Ben Clifford wrote: >> On Tue, 27 Jan 2009, Michael Wilde wrote: >> >>> 1) On OSG sites, the jobmanager(s) are modified to inset OSG env vars and set >>> the PATH to contain OSG stuff. So if you do a globus-job-run of >> This isn't universal OSG behaviour. Some sites give you >> PATH=/bin:/usr/bin >> > > Which happens to be useless. > > I suppose, for those sites, we need have an option to explicitly set > where Java is, if that doesn't already work somehow. > From wilde at mcs.anl.gov Tue Feb 3 08:44:47 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 03 Feb 2009 08:44:47 -0600 Subject: [Swift-devel] Coasters failing on Teraport - cant find Java? In-Reply-To: <1233340920.18750.1.camel@localhost> References: <50b07b4b0901241503r72f28b96rec19583bb8044ea1@mail.gmail.com> <497F637D.5080707@mcs.anl.gov> <1233113611.2159.25.camel@localhost> <497FE73F.7000307@mcs.anl.gov> <1233340920.18750.1.camel@localhost> Message-ID: <4988585F.9020408@mcs.anl.gov> I didnt see this message till now. I'll compare this to the approach I was testing (see previous message) and see what works where. - Mike On 1/30/09 12:42 PM, Mihael Hategan wrote: > Cog r2267 contains a tentative fix for this. The bootstrap script is > started without -l, and if java cannot be found, it attempts to get that > information using bash -l. > > I haven't tested it. > > On Tue, 2009-01-27 at 23:03 -0600, Michael Wilde wrote: >> I dug a bit deeper. As far as I can tell, this is what's happening: >> >> 1) On OSG sites, the jobmanager(s) are modified to inset OSG env vars >> and set the PATH to contain OSG stuff. So if you do a globus-job-run of >> /usr/bin/printenv (i.e. with no shell) you see all this, including java >> in the path (from an osg dir). >> >> 2) when you globus-job-run /bin/sh, all this stays around, but >> >> 3) when you globus-job-run /bin/sh with -l, it runs /etc/profile, which >> un-does the path and LD_LIBRARY_PATH, setting PATH to some default and >> LD_LIBRARY_PATH to null. I *think* this is being done by softenv which >> runs from /etc/profile.d, called at the end of /etc/profile. >> >> You can simulate this with: >> >> globus-job-run tp-grid1.ci.uchicago.edu /bin/sh -c "which java; source >> /etc/profile; which java" (or try printenv instead of which java to see >> the details) >> >> So bottom line: there's at least two cases where -l hurts, this one, and >> abe, where attempts to run login shells from globus are thwarted. >> >> If the purpose of -l was just to get java in the path,, then >> for OSG sites that behave like teraport, just omitting -l should work, >> because the OSG jobmanager modes put it in the path. >> >> For sites like abe, bypassing -l, and forcing the user to put Java in >> the path with a .bashrc or equivalent, may work. (The hack I used on abe >> was to remove the -l arg, and insert this in bootstrap.sh: >> >> +if [ -f ~/.myetcprofile ]; then >> + source ~/.myetcprofile >> +else >> + source /etc/profile >> +fi >> >> One option is to accept a per-site option from sites.xml to bypass "-l" >> on the startup shell, and insert the logic above for something like >> .coasterinit, sourcing that if the user provides it. >> >> Another option is to put a +java line in the OSG .soft file on TeraPort. >> >> Its possible this problem only eists on the few sites like teraport that >> run both OSG mods and softenv??? >> >> I think we need to test coasters broadly across OSG to be sure (Ben's IP >> problem is a case in point). But a simple shell test across all the OSG >> VO sites could detect whether Java will be there or not, with and >> without -l. >> >> - Mike >> >> >> On 1/27/09 9:33 PM, Mihael Hategan wrote: >>> Hmm. Looks like -l has the opposite effect of what I thought it should >>> do (end up with an environment equivalent to the one you get in when you >>> log in as an interactive session). Is it my misunderstanding or >>> something else? >>> >>> On Tue, 2009-01-27 at 13:41 -0600, Michael Wilde wrote: >>>> Related to: Re: [Swift-devel] swift changing walltime of prews-gram jobs >>>> >>>> I can't get a Swift script to run on coasters on TeraPort in gt2:gt2:pbs >>>> mode. >>>> >>>> Im using 0.8rc1 and submitting from tp-login. >>>> >>>> I am running with a DOEgrids cert in the OSG VO. >>>> >>>> I *think* the issue is that when a gt2 jobs on this vo runs with a login >>>> shell, it doesnt get java in its path. >>>> >>>> When I run /bin/sh *without* the "-l" option, under globus, I do get a >>>> java in my path. >>>> >>>> Allan: what VO did you run on when you got a sucsessful gt2:gt2:pbs >>>> coaster run on teraport, after you fixed the walltime issue? >>>> >>>> It seems to me that this is a rough edge with coaster startup. Recall >>>> that I had a similar problem running on abe last year: I had to edit out >>>> the "-l" and create a custom .profile to get coasters to work. >>>> >>>> It would be great if we can iron this out in 0.8 or soon after. I'm >>>> willing to do some testing and enlist help from Allan and Zhengxiong for >>>> wider testing. >>>> >>>> Do we need special site attributes for specific sites to override >>>> default behaviors when they dont work? >>>> >>>> >>>> My sites.xml is: >>>> >>>> >>>> >>>> fast >>>> 00:05:00 >>>> >>>> >>> url="tp-grid1.ci.uchicago.edu" >>>> jobmanager="gt2:gt2:pbs" /> >>>> /gpfs1/osg/data/oops/swiftwork >>>> >>>> >>>> >>>> I get this on stdout/err: >>>> >>>> --------------------------------------------- >>>> Swift 0.8rc1 swift-r2448 cog-r2261 >>>> >>>> RunID: 20090127-1305-hcxdpor3 >>>> Progress: >>>> Progress: Selecting site:2 Stage in:1 Initializing site shared directory:1 >>>> Progress: Selecting site:2 Stage in:1 Submitting:1 >>>> Progress: Selecting site:2 Submitting:1 Submitted:1 >>>> Failed to transfer wrapper log from oops5-20090127-1305-hcxdpor3/info/a >>>> on teraport >>>> Execution failed: >>>> Exception in runoops: >>>> Arguments: [input/fasta/T1af7.fasta, input/secseq/T1af7.secseq, >>>> input/native/T1af7.pdb, output/T1af7.1.pdt, output/T1af7.1.rmsd, 1, >>>> [TEMP UPDATE INTERVAL = 10, SMOOTH DEVIATION COEFFICIENT = 0.80001]] >>>> Host: teraport >>>> Directory: oops5-20090127-1305-hcxdpor3/jobs/a/runoops-asq0ir5j >>>> stderr.txt: >>>> >>>> stdout.txt: >>>> >>>> ---- >>>> >>>> Caused by: >>>> Could not submit job >>>> Caused by: >>>> Could not start coaster service >>>> Caused by: >>>> Task ended before registration was received. >>>> STDOUT: which: no java in >>>> (/usr/kerberos/bin:/bin:/usr/bin:/usr/X11R6/bin:/usr/local/bin:/software/common/softenv-1.6.0-r1/bin:/home/osgvo/osg/bin/linux-rhel4-x86_64:/home/osgvo/osg/bin:/soft/xcat-1.2.0-r1/bin:/soft/xcat-1.2.0-r1/sbin:/soft/xcat-1.2.0-r1/x86_64/bin:/soft/xcat-1.2.0-r1/x86_64/sbin:/soft/xcat-1.2.0-r1/contrib/bin:/soft/xcat-1.2.0-r1/contrib/sbin:/soft/xcat-1.2.0-r1/contrib/x86_64/bin:/soft/xcat-1.2.0-r1/contrib/x86_64/sbin) >>>> dirname: too few arguments >>>> Try `dirname --help' for more information. >>>> http://tp-login2.ci.uchicago.edu:50001: line 55: -Djava.home=/..: No >>>> such file or directory >>>> >>>> STDERR: null >>>> Cleaning up... >>>> Done >>>> >>>> ------------------------------------ >>>> >>>> Checking out the environment with this cert I see: >>>> >>>> tp$ globus-job-run tp-grid1.ci.uchicago.edu /bin/sh -l -c 'java -version' >>>> /bin/sh: java: command not found >>>> >>>> >>>> tp$ globus-job-run tp-grid1.ci.uchicago.edu /bin/sh -c 'java -version' >>>> java version "1.5.0_14" >>>> Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_14-b03) >>>> Java HotSpot(TM) 64-Bit Server VM (build 1.5.0_14-b03, mixed mode) >>>> >>>> >>>> tp$ globus-job-run tp-grid1.ci.uchicago.edu /bin/sh -l -c 'which java; >>>> echo JAVA_HOME IS: $JAVA_HOME;echo PATH IS: $PATH' >>>> JAVA_HOME IS: >>>> PATH IS: >>>> /usr/kerberos/bin:/bin:/usr/bin:/usr/X11R6/bin:/usr/local/bin:/software/common/softenv-1.6.0-r1/bin:/home/osgvo/osg/bin/linux-rhel4-x86_64:/home/osgvo/osg/bin:/soft/xcat-1.2.0-r1/bin:/soft/xcat-1.2.0-r1/sbin:/soft/xcat-1.2.0-r1/x86_64/bin:/soft/xcat-1.2.0-r1/x86_64/sbin:/soft/xcat-1.2.0-r1/contrib/bin:/soft/xcat-1.2.0-r1/contrib/sbin:/soft/xcat-1.2.0-r1/contrib/x86_64/bin:/soft/xcat-1.2.0-r1/contrib/x86_64/sbin >>>> /usr/bin/which: no java in >>>> (/usr/kerberos/bin:/bin:/usr/bin:/usr/X11R6/bin:/usr/local/bin:/software/common/softenv-1.6.0-r1/bin:/home/osgvo/osg/bin/linux-rhel4-x86_64:/home/osgvo/osg/bin:/soft/xcat-1.2.0-r1/bin:/soft/xcat-1.2.0-r1/sbin:/soft/xcat-1.2.0-r1/x86_64/bin:/soft/xcat-1.2.0-r1/x86_64/sbin:/soft/xcat-1.2.0-r1/contrib/bin:/soft/xcat-1.2.0-r1/contrib/sbin:/soft/xcat-1.2.0-r1/contrib/x86_64/bin:/soft/xcat-1.2.0-r1/contrib/x86_64/sbin) >>>> tp$ >>>> >>>> >>>> tp$ globus-job-run tp-grid1.ci.uchicago.edu /bin/sh -c 'which java; echo >>>> JAVA_HOME IS: $JAVA_HOME;echo PATH IS: $PATH' >>>> >>>> /opt/osg-ce-0.8.0-r1/jdk1.5/bin/java >>>> JAVA_HOME IS: >>>> PATH IS: >>>> /opt/osg-ce-0.8.0-r1/condor/sbin:/opt/osg-ce-0.8.0-r1/condor/bin:/opt/osg-ce-0.8.0-r1/apache/bin:/opt/osg-ce-0.8.0-r1/srm-v2-client/bin:/opt/osg-ce-0.8.0-r1/srm-v1-client/sbin:/opt/osg-ce-0.8.0-r1/srm-v1-client/bin:/opt/osg-ce-0.8.0-r1/wget/bin:/opt/osg-ce-0.8.0-r1/gums/scripts:/opt/osg-ce-0.8.0-r1/cert-scripts/bin:/opt/osg-ce-0.8.0-r1/glite/sbin:/opt/osg-ce-0.8.0-r1/glite/bin:/opt/osg-ce-0.8.0-r1/edg/sbin:/opt/osg-ce-0.8.0-r1/prima/bin:/opt/osg-ce-0.8.0-r1/mysql/bin:/opt/osg-ce-0.8.0-r1/logrotate/sbin:/opt/osg-ce-0.8.0-r1/ant/bin:/opt/osg-ce-0.8.0-r1/jdk1.5/bin:/opt/osg-ce-0.8.0-r1/gpt/sbin:/opt/osg-ce-0.8.0-r1/globus/bin:/opt/osg-ce-0.8.0-r1/globus/sbin:/software/linux-rhel4-x86_64/pacman-3.21-r1/bin:/opt/osg-ce-0.8.0-r1/vdt/sbin:/opt/osg-ce-0.8.0-r1/vdt/bin:/opt/osg-ce-0.8.0-r1/condor/sbin:/opt/osg-ce-0.8.0-r1/condor/bin:/opt/osg-ce-0.8.0-r1/apache/bin:/opt/osg-ce-0.8.0-r1/srm-v2-client/bin:/opt/osg-ce-0.8.0-r1/srm-v1-client/sbin:/opt/osg-ce-0.8.0-r1/srm-v1-client/bin > :/ >> o >>> pt >>>> /osg-ce-0.8.0-r1/wget/bin:/opt/osg-ce-0.8.0-r1/gums/scripts:/opt/osg-ce-0.8.0-r1/cert-scripts/bin:/opt/osg-ce-0.8.0-r1/glite/sbin:/opt/osg-ce-0.8.0-r1/glite/bin:/opt/osg-ce-0.8.0-r1/edg/sbin:/opt/osg-ce-0.8.0-r1/prima/bin:/opt/osg-ce-0.8.0-r1/mysql/bin:/opt/osg-ce-0.8.0-r1/logrotate/sbin:/opt/osg-ce-0.8.0-r1/ant/bin:/opt/osg-ce-0.8.0-r1/jdk1.5/bin:/opt/osg-ce-0.8.0-r1/gpt/sbin:/software/linux-rhel4-x86_64/pacman-3.21-r1/bin:/opt/osg-ce-0.8.0-r1/vdt/sbin:/opt/osg-ce-0.8.0-r1/vdt/bin:/sbin:/usr/sbin:/bin:/usr/bin:/usr/X11R6/bin >>>> tp$ globus-job-run tp-grid1.ci.uchicago.edu /bin/sh -c 'java >>>> -version'java version "1.5.0_14" >>>> Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_14-b03) >>>> Java HotSpot(TM) 64-Bit Server VM (build 1.5.0_14-b03, mixed mode) >>>> >>>> >>>> - Mike >>>> >>>> >>>> >>>> >>>> >>>> On 1/24/09 5:03 PM, Allan Espinosa wrote: >>>>> Hi, >>>>> >>>>> I am using swift0.8rc1. the same also happens to v0.7 >>>>> >>>>> I tried submitting a job from communicado to tp-grid1 (teraport) using >>>>> coasters. The swift runtime does not give any error but it does not >>>>> finish as well. Looking through the files received by the teraport >>>>> head node, i observed that swift keeps submitting gram jobs. It looks >>>>> like that the submitted pbs scripts kept finishing / failing. >>>>> >>>>> diging through ~/.globus/jobs/tp-grid1.uchicago.edu/*/scheduler* we >>>>> see that maxwalltime become 101:00 from 00:10:00 (in sites.xml) >>>>> >>>>> /usr/bin/perl "/home/aespinosa/.globus/coasters/cscript63266.pl" >>>>> "http://128.135.125.118:50001" "1728236079" >>>>> #! /bin/sh >>>>> # PBS batch job script built by Globus job manager >>>>> # >>>>> #PBS -S /bin/sh >>>>> #PBS -m n >>>>> #PBS -q fast >>>>> #PBS -l walltime=101:00 >>>>> #PBS -o /dev/null >>>>> #PBS -e /dev/null >>>>> #PBS -l nodes=1 >>>>> HOME="/home/aespinosa"; >>>>> export HOME; >>>>> OSG_DATA="/gpfs1/osg/data"; >>>>> ... >>>>> ... >>>>> counter=0 >>>>> exit_code=0 >>>>> while test $counter -lt 1; do >>>>> /bin/touch /home/aespinosa/.globus/job/tp-grid1.ci.uchicago.edu/7432.1232837576/exit.$counter; >>>>> >>>>> read tmp_exit_code < >>>>> /home/aespinosa/.globus/job/tp-grid1.ci.uchicago.edu/7432.1232837576/exit.$counter >>>>> if [ $exit_code = 0 -a $tmp_exit_code != 0 ]; then >>>>> exit_code=$tmp_exit_code >>>>> fi >>>>> counter=`expr $counter + 1` >>>>> done >>>>> >>>>> exit $exit_code >>>>> qsub: Job exceeds queue resource limits MSG=cannot satisfy queue max >>>>> walltime requirement >>>>> >>>>> >>>>> >>>>> Below is my sites.xml: >>>>> >>>>> >>>>> >>>>> >>>>> fast >>>>> 00:10:00 >>>>> >>>> storage="/opt/osg/data/aespinosa" major="2" minor="2" patch="4"> >>>>> >>>>> >>>> jobmanager="gt2:gt2:pbs" /> >>>>> >>>>> /disks/tp-gpfs/scratch/aespinosa >>>>> >>>>> >>>>> >>>>> >>>>> This does not happen if i use "local:pbs" as the jobmanager for the >>>>> coaster and was successful in running jobs >>>>> -Allan >>>>> >>>>> >>>> _______________________________________________ >>>> Swift-devel mailing list >>>> Swift-devel at ci.uchicago.edu >>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > From hategan at mcs.anl.gov Tue Feb 3 09:16:59 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 03 Feb 2009 09:16:59 -0600 Subject: [Swift-devel] Coasters failing on Teraport - cant find Java? In-Reply-To: <4988566E.2070700@mcs.anl.gov> References: <50b07b4b0901241503r72f28b96rec19583bb8044ea1@mail.gmail.com> <497F637D.5080707@mcs.anl.gov> <1233113611.2159.25.camel@localhost> <497FE73F.7000307@mcs.anl.gov> <1233597491.22200.3.camel@localhost> <4988566E.2070700@mcs.anl.gov> Message-ID: <1233674219.24924.1.camel@localhost> On Tue, 2009-02-03 at 08:36 -0600, Michael Wilde wrote: > The approach I'm testing is this: > > if user has a .coasterinit file > source it to put java in PATH > else if java is in PATH > use it > else source /etc/profile > > (executed under a non-login shell, i.e never use /bin/sh -l) > > Right now I have the above in a different order (.coasterinit last) and > it works on ranger, mercury and teraport. > > .coasterinit is a more flexible alternative to a per-site option that > points to java. Im not sure which is better. We already have a mechanism for specifying site properties (sites.xml). I don't think we should invent a different one. From benc at hawaga.org.uk Tue Feb 3 11:00:38 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 3 Feb 2009 17:00:38 +0000 (GMT) Subject: [Swift-devel] Coasters failing on Teraport - cant find Java? In-Reply-To: <4988566E.2070700@mcs.anl.gov> References: <50b07b4b0901241503r72f28b96rec19583bb8044ea1@mail.gmail.com> <497F637D.5080707@mcs.anl.gov> <1233113611.2159.25.camel@localhost> <497FE73F.7000307@mcs.anl.gov> <1233597491.22200.3.camel@localhost> <4988566E.2070700@mcs.anl.gov> Message-ID: On Tue, 3 Feb 2009, Michael Wilde wrote: > .coasterinit is a more flexible alternative to a per-site option that points > to java. Im not sure which is better. Flexible in that you can run arbitrary commands; however, less flexible in that it is per-remote-uid, not per-run. per-site options are implicitly also settable per-run (and hence per-submit-side-user, per-installed-swift-version, and the like) -- From benc at hawaga.org.uk Tue Feb 3 16:26:38 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 3 Feb 2009 22:26:38 +0000 (GMT) Subject: [Swift-devel] throttle flow diagram Message-ID: I just drew this attempt at showing where the various throttles in Swift are and which parameters control them. http://www.ci.uchicago.edu/~benc/tmp/throttle-flow.jpeg Comments both on the technical content (which throttles are where) and on the best layout to draw this diagram are welcome. Eventually I'll draw it using some kind of computer program and put it in the user guide. -- From aespinosa at cs.uchicago.edu Tue Feb 3 17:02:22 2009 From: aespinosa at cs.uchicago.edu (Allan Espinosa) Date: Tue, 3 Feb 2009 17:02:22 -0600 Subject: [Swift-devel] coasters paths (was Re: Coasters failing on Teraport - cant find Java?) Message-ID: <50b07b4b0902031502v5074b655wc20c1b15dfa1daaa@mail.gmail.com> I also am having path problems in running coasters remotely. This time its looking for "curl" and "wget" (all are in /usr/bin) logfile snippet: Caused by: org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: Could not submit job Caused by: org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: Could not start coaster service Caused by: org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: Task ended before registration was received. STDOUT: No wget or curl available using globus-job-run, it also does not find these binaries by default. using an "sh -l" gives me tty permission errors in ranger [aespinosa at communicado ~]$ globus-job-run gatekeeper.ranger.tacc.teragrid.org /usr/bin/which wget /usr/bin/which: no wget in ((null)) [aespinosa at communicado ~]$ globus-job-run gatekeeper.ranger.tacc.teragrid.org /bin/sh -l -c "which curl" stty: standard input: Inappropriate ioctl for device stty: standard input: Inappropriate ioctl for device On Tue, Jan 27, 2009 at 1:41 PM, Michael Wilde wrote: > Related to: Re: [Swift-devel] swift changing walltime of prews-gram jobs > > I can't get a Swift script to run on coasters on TeraPort in gt2:gt2:pbs > mode. > > Im using 0.8rc1 and submitting from tp-login. > > I am running with a DOEgrids cert in the OSG VO. > > I *think* the issue is that when a gt2 jobs on this vo runs with a login > shell, it doesnt get java in its path. > > When I run /bin/sh *without* the "-l" option, under globus, I do get a java > in my path. > From benc at hawaga.org.uk Tue Feb 3 17:37:21 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 3 Feb 2009 23:37:21 +0000 (GMT) Subject: [Swift-devel] Re: [Swift-user] gsiftp filesystem mapper removes leading slash In-Reply-To: <50b07b4b0902031412t5f05bda2t4ff88326df678fb7@mail.gmail.com> References: <50b07b4b0902031343g1c800451j78b0b914c944a416@mail.gmail.com> <50b07b4b0902031412t5f05bda2t4ff88326df678fb7@mail.gmail.com> Message-ID: Moved from swift-user On Tue, 3 Feb 2009, Allan Espinosa wrote: > Oh i see. now I'm getting NullPointExceptions: > database pir[] pattern="UNIPROT_for_blast_14.0.seq*">; I can recreate the same stacktrace you see, against my directory on teraport. The below change makes it go away for me. Get a clean fresh source tree, then: $ cd cog $ wget http://www.ci.uchicago.edu/~benc/tmp/ftpbug-1.patch $ patch -p1 < ftpbug-1.patch $ ant redist And try that. Probably you should keep a copy of your source tree before applying the patch so that you can easily get rid of it. -- From hategan at mcs.anl.gov Tue Feb 3 18:37:16 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 03 Feb 2009 18:37:16 -0600 Subject: [Swift-devel] coasters paths (was Re: Coasters failing on Teraport - cant find Java?) In-Reply-To: <50b07b4b0902031502v5074b655wc20c1b15dfa1daaa@mail.gmail.com> References: <50b07b4b0902031502v5074b655wc20c1b15dfa1daaa@mail.gmail.com> Message-ID: <1233707836.12879.3.camel@localhost> On Tue, 2009-02-03 at 17:02 -0600, Allan Espinosa wrote: > I also am having path problems in running coasters remotely. This > time its looking for "curl" and "wget" (all are in /usr/bin) > logfile snippet: > Caused by: org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: > Could not submit job > Caused by: org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: > Could not start coaster service > Caused by: org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: > Task ended before registration was received. > STDOUT: No wget or curl available What version of cog is this? It occurred to me that the change I made a few days ago might solve the java problem on some sites, but also mess up wget/curl lookup. From aespinosa at cs.uchicago.edu Tue Feb 3 18:39:43 2009 From: aespinosa at cs.uchicago.edu (Allan Espinosa) Date: Tue, 3 Feb 2009 18:39:43 -0600 Subject: [Swift-devel] coasters paths (was Re: Coasters failing on Teraport - cant find Java?) In-Reply-To: <1233707836.12879.3.camel@localhost> References: <50b07b4b0902031502v5074b655wc20c1b15dfa1daaa@mail.gmail.com> <1233707836.12879.3.camel@localhost> Message-ID: <20090204003943.GA5628@quadrant> I think i am using the latest revision (2271) for cog. for swift my build is using 2490 On Tue, Feb 03, 2009 at 06:37:16PM -0600, Mihael Hategan wrote: > > What version of cog is this? > > It occurred to me that the change I made a few days ago might solve the > java problem on some sites, but also mess up wget/curl lookup. > From hategan at mcs.anl.gov Tue Feb 3 19:59:03 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 03 Feb 2009 19:59:03 -0600 Subject: [Swift-devel] coasters paths (was Re: Coasters failing on Teraport - cant find Java?) In-Reply-To: <20090204003943.GA5628@quadrant> References: <50b07b4b0902031502v5074b655wc20c1b15dfa1daaa@mail.gmail.com> <1233707836.12879.3.camel@localhost> <20090204003943.GA5628@quadrant> Message-ID: <1233712743.20123.0.camel@localhost> cog r2272 contains a tentative fix for the issue. I tested it locally so far. On Tue, 2009-02-03 at 18:39 -0600, Allan Espinosa wrote: > I think i am using the latest revision (2271) for cog. > for swift my build is using 2490 > > On Tue, Feb 03, 2009 at 06:37:16PM -0600, Mihael Hategan wrote: > > > > What version of cog is this? > > > > It occurred to me that the change I made a few days ago might solve the > > java problem on some sites, but also mess up wget/curl lookup. > > From wilde at mcs.anl.gov Wed Feb 4 07:49:12 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Wed, 04 Feb 2009 07:49:12 -0600 Subject: [Swift-devel] "type file;" by default In-Reply-To: References: Message-ID: <49899CD8.4090004@mcs.anl.gov> I'm in favor of making "file" a built-in type. I further propose we then use the term "file type" instead of "marker" or "mapped" type. It seems there are at least 2 ways to do so, subtly different: (a) as if the statement "type file;" has been implicitly executed, or (b) as if there is a new simple, built-in, mapped type called "file". (b) seems a bit better, because then the currently unnamed, built-in "marker" type gets a name. If we use alternative (a) above, you would still say: type Image; file f; Image i; Here, nothing changes except for the built-in definition for "file" - not how you *use* the word "file". With alternative (b), though, you would say: type Image file; file f; Image i; We could still allow the old style declarations to remain valid (but deprecated) for now (or forever), to avoid impact to existing code. I favor approach (b), because it gives the un-named "marker" type, and hence all primitive built-in types, an explicit name. And it looks less quirky. But its a minor point. - Mike On 2/3/09 5:45 AM, Ben Clifford wrote: > Pretty much every simple SwiftScript program that I write, I find myself > putting in "type file;" at the start, and avoiding "marker types" of the > form: > > type picturefile; > > and thus ignoring application-level type checking (checking that I'm not > feeding a picture into a text processing app, and the like) whilst still > taking advantage of other swift type checking. > > To simplify low-end uses of the language, it might be useful to have the > above "type file;" defined as a built-in type. > > This has been discussed before, but I'd like to know what peoples opinions > are. > From wilde at mcs.anl.gov Wed Feb 4 07:55:03 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Wed, 04 Feb 2009 07:55:03 -0600 Subject: [Swift-devel] Making "type" more uniform Message-ID: <49899E37.8070501@mcs.anl.gov> In thinking through the question on "type file", I found that the current "type" statement has several irregularities. These are not of major consequence, but I want to verify my understanding and propose that we document this in the user guide in the short term, and in the longer term, consider changes to make "type" more regular and useful. Currently (as far as I can tel), the "type" statement can be used to declare new data types of exactly 2 kinds: 1) named, mapped types representing a single file, and 2) named struct types. You can not declare a new type that represents a (usable) simple type, and you can not (as far as I can tell) declare a new type that represents an array. While a few examples of declaring new simple scalar types are accepted, they are of no practical use - you can not assign values to such variables. While you can declare a type whose representation is an int: type Flag int; Flag f; You can not then say: f = 99; because f is a Flag and 99 is an int. This is because of the following, as far as I can tell: Creating new types whose representation is a "simple type" (int, string, boolean, or float), while potentially useful, does not work, because you can not create or assign any values of such types: you can not "cast" constants or variables of the simple types to any named type declared with the same representation, and there is no way to return such values from any function, atomic or compound. In contrast, creating multiple declared types whose representation is a "marker" is useful, and works, because app() functions essentially "cast" physical files as return values of any marker type. Thus, a variable of type "image" can be assigned a value because, in the userguide example, the convert command called by the rotate() app-function "casts" its returned marker file type as an "image". To enable the use of types that are represented as simple type values, we would need to add a cast expression to the language. Further, types are the only way to declare a struct: you can not declare a struct in a variable declaration; you need a type to define the struct. While you can only declare an array in a variable declaration, and you can not, it seems, declare one as a type. (Although you can declare array variables within struct types). These are all restrictions which, while seemingly irregular, are not in practice very limiting. Its not clear how important this is, and I'm not proposing it at the moment, but rather suggest we clarify some of these corners of the language in the user guide, so users dont get tripped up trying to do things that seem natural in, e.g., C. From wilde at mcs.anl.gov Wed Feb 4 12:41:56 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Wed, 04 Feb 2009 12:41:56 -0600 Subject: [Swift-devel] strange behavior evaluating function call as trace arg Message-ID: <4989E174.8060500@mcs.anl.gov> In this program: -- trace(@toint("123")); (int k) add (int i, int j) { k = i + j; } int m = add(123,456); trace(m); trace(add(123,456)); -- ...the first and second trace() calls work OK. When I add the third, I get: Could not compile SwiftScript source: line 13:1: unexpected token: trace It seems as if trace is picking up the @toint function call OK, but not the call to "add". (This case is condensed is from a more complex program where I first saw this) Is this my error, or swifts? From hategan at mcs.anl.gov Wed Feb 4 14:05:37 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 4 Feb 2009 14:05:37 -0600 (CST) Subject: [Swift-devel] strange behavior evaluating function call as trace arg In-Reply-To: <3374084.1784281233777774647.JavaMail.root@zimbra> Message-ID: <19137214.1784411233777937698.JavaMail.root@zimbra> Looks like most nested invocations are broken, not specifically trace. The following fails, too: (int r) f(int i) { r = i; } int x; x = f(f(2)); ----- "Michael Wilde" wrote: > In this program: > > -- > trace(@toint("123")); > > (int k) add (int i, int j) > { > k = i + j; > } > > int m = add(123,456); > > trace(m); > > trace(add(123,456)); > -- > > ...the first and second trace() calls work OK. > When I add the third, I get: > > Could not compile SwiftScript source: line 13:1: unexpected token: > trace > > It seems as if trace is picking up the @toint function call OK, but > not > the call to "add". > > (This case is condensed is from a more complex program where I first > saw > this) > > Is this my error, or swifts? > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From wilde at mcs.anl.gov Wed Feb 4 16:27:56 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Wed, 04 Feb 2009 16:27:56 -0600 Subject: [Swift-devel] [VOTE] Expanding arrays in app function command lines In-Reply-To: References: Message-ID: <498A166C.5090706@mcs.anl.gov> +1 On 2/1/09 11:36 AM, Ben Clifford wrote: > it being slightly unclear in my mind whether the below discussed change > was generally approved of, here is a more formal request for > clarification. > > the change that we talked about is in this thread: > > Subject: Re: [Swift-user] Expanding arrays in app function command lines > > the proposal (which I sent a patch for) is to change the handling of app > paramters to expand string arrays into multiple command line arguments. > > Vote as in http://dev.globus.org/wiki/Guidelines#Action_Item_Votes > From benc at hawaga.org.uk Wed Feb 4 19:29:06 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Thu, 5 Feb 2009 01:29:06 +0000 (GMT) Subject: [Swift-devel] strange behavior evaluating function call as trace arg In-Reply-To: <19137214.1784411233777937698.JavaMail.root@zimbra> References: <19137214.1784411233777937698.JavaMail.root@zimbra> Message-ID: > Looks like most nested invocations are broken, not specifically trace. They're not 'broken'. If you think they should work, you're thinking far too much along the lines of procedure calls evaluating to a value like some kind of referentially transparent thing. Procedure calls *must* have an lvalue to their l to give them somewhere to put their l. That's been the case forever in Swift and in VDL2. Making that not happen is a long term goal of mine, but its not in the language now. -- From hategan at mcs.anl.gov Wed Feb 4 19:43:17 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 4 Feb 2009 19:43:17 -0600 (CST) Subject: [Swift-devel] strange behavior evaluating function call as trace arg In-Reply-To: <21004599.1808421233797708367.JavaMail.root@zimbra> Message-ID: <2625227.1808521233798197984.JavaMail.root@zimbra> ----- "Ben Clifford" wrote: > > Looks like most nested invocations are broken, not specifically > trace. > > They're not 'broken'. If you think they should work, you're thinking > far > too much along the lines of procedure calls evaluating to a value like > > some kind of referentially transparent thing. Yes, I am. I think it's reasonable to be able to use procedures with a return arity of one like that. > Procedure calls *must* > have > an lvalue to their l to give them somewhere to put their l. That's > been > the case forever in Swift and in VDL2. Right. I'm aware of the nasty issue with this, but I think it's doable. > Making that not happen is a > long > term goal of mine, but its not in the language now. > I've started looking into it. If I don't get something in the next 8 hours of Swift work, I'll drop it. I want it there because not having it is a bit unintuitive. From wilde at mcs.anl.gov Wed Feb 4 22:39:17 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Wed, 04 Feb 2009 22:39:17 -0600 Subject: [Swift-devel] Can't initialize and map in same var declaration? Message-ID: <498A6D75.7060403@mcs.anl.gov> I'd like to say: file f <"t.out"> = t(a); but instead need to say: file f <"t.out">; f = t(a); Should the first form work, or should we document that its not permitted? A very minor issue. From benc at hawaga.org.uk Thu Feb 5 02:36:00 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Thu, 5 Feb 2009 08:36:00 +0000 (GMT) Subject: [Swift-devel] Can't initialize and map in same var declaration? In-Reply-To: <498A6D75.7060403@mcs.anl.gov> References: <498A6D75.7060403@mcs.anl.gov> Message-ID: On Wed, 4 Feb 2009, Michael Wilde wrote: > I'd like to say: > > file f <"t.out"> = t(a); > > but instead need to say: > > file f <"t.out">; f = t(a); > > Should the first form work, or should we document that its not permitted? It doesn't work; but the ways things are, its a pretty minor change to make it work, I think. -- From benc at hawaga.org.uk Thu Feb 5 02:40:28 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Thu, 5 Feb 2009 08:40:28 +0000 (GMT) Subject: [Swift-devel] strange behavior evaluating function call as trace arg In-Reply-To: <2625227.1808521233798197984.JavaMail.root@zimbra> References: <2625227.1808521233798197984.JavaMail.root@zimbra> Message-ID: On Wed, 4 Feb 2009, Mihael Hategan wrote: > Yes, I am. I think it's reasonable to be able to use procedures with a > return arity of one like that. yes. I think beyond that it would be nice to lose the procedure/@function distinction entirely. They used to be very different but as time passes they get closer and closer to the same thing, so that pretty much the only distinction at the moment is how they return their return value(s) and that @functions can only return one value. > Right. I'm aware of the nasty issue with this, but I think it's doable. yes. > I've started looking into it. If I don't get something in the next 8 > hours of Swift work, I'll drop it. I want it there because not having it > is a bit unintuitive. ok. let me know if you abandon it and I'll put it on my todo. -- From wilde at mcs.anl.gov Thu Feb 5 08:05:21 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Thu, 05 Feb 2009 08:05:21 -0600 Subject: [Swift-devel] Extending the set of builtin functions with external scripts? In-Reply-To: References: <2625227.1808521233798197984.JavaMail.root@zimbra> Message-ID: <498AF221.4050802@mcs.anl.gov> was: Re: [Swift-devel] strange behavior evaluating function call as trace arg On 2/5/09 2:40 AM, Ben Clifford wrote: > On Wed, 4 Feb 2009, Mihael Hategan wrote: > >> Yes, I am. I think it's reasonable to be able to use procedures with a >> return arity of one like that. > > yes. I think beyond that it would be nice to lose the procedure/@function > distinction entirely. Yes, I agree. On a related topic: can we make it easier for users to define @-like functions externally, as scripts, so that we can readily grow the function library, and experiment with such functions? (much like the ext mapper). Users can currently define such functions in pairs: an app, searched for in PATH, which takes typically simple type data and returns its value(s) in a file, which is then read in a (compound) wrapper function that uses readData() or readdata2(). This works well, except for a few issues: - not so easy to do varying number of args (must use arrays) - cant do varying types of args (eg, hard to do a sprintf) - need to put each one in tc.data - it could be more elegant (eg, the code for each function today touches 4 places: app, wrapper, tc, and the actual external code). We could make it 2: an extern() func, syntactically almost identical to app(), and the external code (ie, it bypasses tc.data) This relates to the procedure/@function difference in the following way: If we extend the language slightly with an "any" type declaration for parameter types that accepts any value, and a form of var-args, for example permitting @arc, @arglist[], and possibly @argtypes[], to be placed on the app() command line, then external functions could reflect as needed on their args and do pretty much anything that an internally-implemented function could do. In this way the library could grow more readily, with simple perl, python, shell, awk, etc. scripts to implement them, searched for in a path like SWIFTLIB which would include SWIFT_HOME/swiftlib as well as user directories. (By the way, I played with cpp as a way to #include swift library code. It worked well for a simple test; needs much more experimentation and testing. That or a similar approach looks promising). With such extensions, we could use the app() declaration for such externs, and they would work exactly like any other app(), but serve the same purpose as built-in functions. The builtins are faster (which is seldom needed), unthrottled (which is sometimes needed) and more robust (ie dont depend on external interpreters) which is handy for the core of the language, I guess. Thoughts? Any interest in such a direction? A related issue is how we want to control and shape the growth of such a library so that it gains the necessary power without getting unruly. - Mike > They used to be very different but as time passes > they get closer and closer to the same thing, so that pretty much the only > distinction at the moment is how they return their return value(s) and > that @functions can only return one value. > >> Right. I'm aware of the nasty issue with this, but I think it's doable. > > yes. > >> I've started looking into it. If I don't get something in the next 8 >> hours of Swift work, I'll drop it. I want it there because not having it >> is a bit unintuitive. > > ok. let me know if you abandon it and I'll put it on my todo. > From benc at hawaga.org.uk Thu Feb 5 08:38:06 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Thu, 5 Feb 2009 14:38:06 +0000 (GMT) Subject: [Swift-devel] cpp In-Reply-To: <498AF221.4050802@mcs.anl.gov> References: <2625227.1808521233798197984.JavaMail.root@zimbra> <498AF221.4050802@mcs.anl.gov> Message-ID: On Thu, 5 Feb 2009, Michael Wilde wrote: > (By the way, I played with cpp as a way to #include swift library code. > It worked well for a simple test; needs much more experimentation and > testing. That or a similar approach looks promising). GNU cpp is fairly explicit in its man page about not using it for non-C-like source files. That's fine for hacking round with, but so is /bin/cat. Implementing some library system for Swift probably needs substantially more consideration - there are issues like modular compilation of code that are similar to other languages, as well as other more swift-specific issues (for example, should opting to use a library bring along relevant tc.data entries that are not usually defined?) -- From wilde at mcs.anl.gov Thu Feb 5 09:06:32 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Thu, 05 Feb 2009 09:06:32 -0600 Subject: [Swift-devel] Re: cpp In-Reply-To: References: <2625227.1808521233798197984.JavaMail.root@zimbra> <498AF221.4050802@mcs.anl.gov> Message-ID: <498B0078.3080606@mcs.anl.gov> On 2/5/09 8:38 AM, Ben Clifford wrote: > On Thu, 5 Feb 2009, Michael Wilde wrote: > >> (By the way, I played with cpp as a way to #include swift library code. >> It worked well for a simple test; needs much more experimentation and >> testing. That or a similar approach looks promising). > > GNU cpp is fairly explicit in its man page about not using it for > non-C-like source files. That's fine for hacking round with, but so is > /bin/cat. agreed. > Implementing some library system for Swift probably needs substantially > more consideration - there are issues like modular compilation of code > that are similar to other languages, as well as other more swift-specific > issues (for example, should opting to use a library bring along relevant > tc.data entries that are not usually defined?) yes. Just want to start somewhere to get a feeling for what is useful. From benc at hawaga.org.uk Thu Feb 5 09:18:03 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Thu, 5 Feb 2009 15:18:03 +0000 (GMT) Subject: [Swift-devel] Re: [VOTE] Expanding arrays in app function command lines In-Reply-To: References: Message-ID: This change is committed in r2498. -- From benc at hawaga.org.uk Thu Feb 5 09:31:09 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Thu, 5 Feb 2009 15:31:09 +0000 (GMT) Subject: [Swift-devel] strange behavior evaluating function call as trace arg In-Reply-To: <4989E174.8060500@mcs.anl.gov> References: <4989E174.8060500@mcs.anl.gov> Message-ID: On Wed, 4 Feb 2009, Michael Wilde wrote: > trace(add(123,456)); > Could not compile SwiftScript source: line 13:1: unexpected token: trace related to this, there's a bug open related to syntax error - bug 173: http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=173 The syntax error should be reporting that add is not valid there, not that trace is an unexpected token. This comes, I think, from deciding whether a statement is a procedure declaration or a procedure call by attempting to parse the entire first bit up to this: foo (syntactically valid argument declarations) { or foo (syntactically valid argument expressions) ; to distinguish between declarations or invocations. In the case above, your statement matches neither of the above and so it tries to parse neither as a declaration or an invocation, giving an overly general error message saying, essentially, "i don't know what this whole statement is". It may be possible to make better predictors or otherwise tighten up the location, but I had trouble last time. Having had a year to think about it, I may be able to come up with something better. -- From wilde at mcs.anl.gov Thu Feb 5 09:49:31 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Thu, 05 Feb 2009 09:49:31 -0600 Subject: [Swift-devel] strange behavior evaluating function call as trace arg In-Reply-To: References: <19137214.1784411233777937698.JavaMail.root@zimbra> Message-ID: <498B0A8B.7000601@mcs.anl.gov> I dont understand this - can you clarify? I understand an "lvalue" in swift to be one of: - a var (eg var=value) - an array element (eg a[i]=value) - a struct element (eg s.a=value) But swift procedures do indeed return a list of values, right? Does the problem stem from the list-nature of the swift procedure return? (Ie, when a proc returns multiple values, it "needs" an set of lvalues on the lhs of an assignment statement to put them in, and this is currently enforced even in the case of a single value? Wile an @function() returns a single value, and hence works?) So below when you say "Procedure calls *must* have an lvalue to their l to give them somewhere to put their l" do you mean "Procedure calls can only be invoked form assignment statements, and *must* have the same number of lvalues on the lhs of the assignment to give them somewhere to put all of their return values" ??? On 2/4/09 7:29 PM, Ben Clifford wrote: >> Looks like most nested invocations are broken, not specifically trace. > > They're not 'broken'. If you think they should work, you're thinking far > too much along the lines of procedure calls evaluating to a value like > some kind of referentially transparent thing. Procedure calls *must* have > an lvalue to their l to give them somewhere to put their l. That's been > the case forever in Swift and in VDL2. Making that not happen is a long > term goal of mine, but its not in the language now. > From benc at hawaga.org.uk Thu Feb 5 10:06:39 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Thu, 5 Feb 2009 16:06:39 +0000 (GMT) Subject: [Swift-devel] strange behavior evaluating function call as trace arg In-Reply-To: <498B0A8B.7000601@mcs.anl.gov> References: <19137214.1784411233777937698.JavaMail.root@zimbra> <498B0A8B.7000601@mcs.anl.gov> Message-ID: On Thu, 5 Feb 2009, Michael Wilde wrote: > I dont understand this - can you clarify? > > I understand an "lvalue" in swift to be one of: > - a var (eg var=value) > - an array element (eg a[i]=value) > - a struct element (eg s.a=value) yes. > But swift procedures do indeed return a list of values, right? no. > So below when you say "Procedure calls *must* have an lvalue to their l to > give them somewhere to put their l" do you mean "Procedure calls can only be > invoked form assignment statements, and *must* have the same number of lvalues > on the lhs of the assignment to give them somewhere to put all of their return > values" ??? yes. Ignore the cardinality of return values - that's irrelevant to this. When evaluating a @function call what comes out is a new DSHandle object that has been created by something inside that @function. When evaluting a procedure call, the procedure expects to be given existing DSHandle object(s) to place its 'return values' into. Return values in the procedure case encompass files as well as primitive values. That mechanism is how, when you say: (file f) p() { touch @f } file myfile <"foo"> myfile = p(); the procedure is able to figure out the filename "foo" to touch, even though its a return parameter. The mapping is attached to the DSHandle object, which comes into existence due to the 'file myfile' declaration. -- From hategan at mcs.anl.gov Thu Feb 5 10:22:04 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 05 Feb 2009 10:22:04 -0600 Subject: [Swift-devel] strange behavior evaluating function call as trace arg In-Reply-To: References: <19137214.1784411233777937698.JavaMail.root@zimbra> <498B0A8B.7000601@mcs.anl.gov> Message-ID: <1233850924.18738.15.camel@localhost> On Thu, 2009-02-05 at 16:06 +0000, Ben Clifford wrote: > On Thu, 5 Feb 2009, Michael Wilde wrote: > > > I dont understand this - can you clarify? > > > > I understand an "lvalue" in swift to be one of: > > - a var (eg var=value) > > - an array element (eg a[i]=value) > > - a struct element (eg s.a=value) > > yes. > > > But swift procedures do indeed return a list of values, right? > > no. Another way of viewing this would be the following: Returns from swift procedures are not actually returns but arguments passed by reference. This is there in order to support the automatic parallelization scheme. So assuming (int s) add(int x, int y) { s = x + y; } int s; s = add(1, 2);, this translates to (in C-like pointerish pseudocode): add(int* s, int* x, int* y) { *s = *x + *y; } int *s = malloc(sizeof(int)); add(s, newInt(1), newInt(2)); where newInt(int) allocates an int pointer, puts some value into it and returns the address. Pointers here are DSHandles (Swift's way of dealing with data). From benc at hawaga.org.uk Thu Feb 5 10:59:21 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Thu, 5 Feb 2009 16:59:21 +0000 (GMT) Subject: [Swift-devel] Re: [Swift-user] gsiftp filesystem mapper removes leading slash In-Reply-To: References: <50b07b4b0902031343g1c800451j78b0b914c944a416@mail.gmail.com> <50b07b4b0902031412t5f05bda2t4ff88326df678fb7@mail.gmail.com> Message-ID: The fix in the below mentioned patch is in CoG r2273. On Tue, 3 Feb 2009, Ben Clifford wrote: > > Moved from swift-user > > On Tue, 3 Feb 2009, Allan Espinosa wrote: > > > Oh i see. now I'm getting NullPointExceptions: > > database pir[] > pattern="UNIPROT_for_blast_14.0.seq*">; > > I can recreate the same stacktrace you see, against my directory on > teraport. The below change makes it go away for me. > > Get a clean fresh source tree, then: > > $ cd cog > $ wget http://www.ci.uchicago.edu/~benc/tmp/ftpbug-1.patch > $ patch -p1 < ftpbug-1.patch > $ ant redist > > And try that. > > Probably you should keep a copy of your source tree before applying the > patch so that you can easily get rid of it. > > -- > > From bugzilla-daemon at mcs.anl.gov Thu Feb 5 11:05:29 2009 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Thu, 5 Feb 2009 11:05:29 -0600 (CST) Subject: [Swift-devel] [Bug 174] New: Type string is not defined Message-ID: http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=174 Summary: Type string is not defined Product: Swift Version: unspecified Platform: PC OS/Version: Windows Status: NEW Severity: normal Priority: P2 Component: SwiftScript language AssignedTo: benc at hawaga.org.uk ReportedBy: skenny at uchicago.edu this is the error thrown when i have an incorrect reference to a member of a string array: string labels[] referenced here as: labels[0].label should've been: labels[0] swift produces an error saying that 'type string is not defined' which is not the appropriate error -- Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at mcs.anl.gov Thu Feb 5 11:09:36 2009 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Thu, 5 Feb 2009 11:09:36 -0600 (CST) Subject: [Swift-devel] [Bug 175] New: ambiguous error when input file not found Message-ID: http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=175 Summary: ambiguous error when input file not found Product: Swift Version: unspecified Platform: PC OS/Version: Windows Status: NEW Severity: normal Priority: P2 Component: SwiftScript language AssignedTo: benc at hawaga.org.uk ReportedBy: skenny at uchicago.edu if an input file is mapped that does not exist, swift throws the following error: Swift svn swift-r2494 cog-r2271 RunID: 20090205-1030-r6q4lgm7 Progress: Unexpected VDL2FutureException checking inputs Execution failed: java.lang.RuntimeException: Got a VDL2FutureException but all parameters should have their values -- Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From wilde at mcs.anl.gov Thu Feb 5 11:34:48 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Thu, 05 Feb 2009 11:34:48 -0600 Subject: [Swift-devel] strange behavior evaluating function call as trace arg In-Reply-To: <1233850924.18738.15.camel@localhost> References: <19137214.1784411233777937698.JavaMail.root@zimbra> <498B0A8B.7000601@mcs.anl.gov> <1233850924.18738.15.camel@localhost> Message-ID: <498B2338.8000201@mcs.anl.gov> OK, thanks, that helps me get closer, but Im not quite there yet. The "automatic parallelization scheme" (i.e. the swift dataflow model) works by chaining together lvalues, correct? In other words, it is the setting of lvalues that enables a statement waiting on a variable to get a value to execute. Is that correct? Conceptually, one could view the dataflow model as existing solely in the memory of the swift interpreter. The creation of files is in some sense a side effect of executing app procedures, and the creation of files within an app() and the consequent execution of assignment operations then results in lvalues being set, which enables execution of any statement waiting on an lvalue to proceed. The assignment operations can be explicit (lvalue=value) or implicit (procedure return). To fully understand this, you need to go further into the details of mapping, scoping, procedure activation/completion, array closing, and single assignment. So to take a few steps in that direction: Question: are the following tuples the correct abstract representations of lvalues and dshandles? lvalue: type, *handle, state (set/unset) - same as handle==null? handle: type, value(if simple type), mapping Question: how are arrays and structures represented in this model? Getting this into writing in a concise and correct form would be useful for gaining a full understanding of Swift and also help in the language paper. Is it reasonable to put this into latex form and jointly edit until we're satisfied with it? If so, I will do that. It could go into a version of the language paper that we post on the site, while we submit a condensed version to a conference. - Mike On 2/5/09 10:22 AM, Mihael Hategan wrote: > On Thu, 2009-02-05 at 16:06 +0000, Ben Clifford wrote: >> On Thu, 5 Feb 2009, Michael Wilde wrote: >> >>> I dont understand this - can you clarify? >>> >>> I understand an "lvalue" in swift to be one of: >>> - a var (eg var=value) >>> - an array element (eg a[i]=value) >>> - a struct element (eg s.a=value) >> yes. >> >>> But swift procedures do indeed return a list of values, right? >> no. > > Another way of viewing this would be the following: > > Returns from swift procedures are not actually returns but arguments > passed by reference. This is there in order to support the automatic > parallelization scheme. > > So assuming > > (int s) add(int x, int y) { s = x + y; } > int s; > s = add(1, 2);, this translates to (in C-like pointerish pseudocode): > > add(int* s, int* x, int* y) { > *s = *x + *y; > } > int *s = malloc(sizeof(int)); > add(s, newInt(1), newInt(2)); > > where newInt(int) allocates an int pointer, puts some value into it and > returns the address. Pointers here are DSHandles (Swift's way of dealing > with data). > From hategan at mcs.anl.gov Thu Feb 5 11:46:34 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 05 Feb 2009 11:46:34 -0600 Subject: [Swift-devel] strange behavior evaluating function call as trace arg In-Reply-To: <498B2338.8000201@mcs.anl.gov> References: <19137214.1784411233777937698.JavaMail.root@zimbra> <498B0A8B.7000601@mcs.anl.gov> <1233850924.18738.15.camel@localhost> <498B2338.8000201@mcs.anl.gov> Message-ID: <1233855994.21035.8.camel@localhost> On Thu, 2009-02-05 at 11:34 -0600, Michael Wilde wrote: > OK, thanks, that helps me get closer, but Im not quite there yet. > > The "automatic parallelization scheme" (i.e. the swift dataflow model) > works by chaining together lvalues, correct? In other words, it is the > setting of lvalues that enables a statement waiting on a variable to get > a value to execute. Is that correct? Somewhat. It's is the existence of the lvalue that allows a consumer and a producer to synchronize on the same thing. > > Conceptually, one could view the dataflow model as existing solely in > the memory of the swift interpreter. The creation of files is in some > sense a side effect of executing app procedures, and the creation of > files within an app() and the consequent execution of assignment > operations then results in lvalues being set, which enables execution of > any statement waiting on an lvalue to proceed. The assignment operations > can be explicit (lvalue=value) or implicit (procedure return). Yes. > > To fully understand this, you need to go further into the details of > mapping, scoping, procedure activation/completion, array closing, and > single assignment. > > So to take a few steps in that direction: > > Question: are the following tuples the correct abstract representations > of lvalues and dshandles? There is no "lvalue" distinct from "handle". The term was used by Ben to refer to what looks like lvalues in the Swift scripts. > > lvalue: type, *handle, state (set/unset) - same as handle==null? > handle: type, value(if simple type), mapping In the light of my sayings above: handle: type, state, who_is_waiting_on_this, value?, mapping? > > Question: how are arrays and structures represented in this model? Structs are handles with fields, which are also handles. Arrays are structs with dynamic fields (i.e. you can add fields/elements at run-time). > > Getting this into writing in a concise and correct form would be useful > for gaining a full understanding of Swift and also help in the language > paper. It's useful as far as there is usefulness in people besides us understanding how Swift works in detail at the expense of our time spent writing the document. > > Is it reasonable to put this into latex form and jointly edit until > we're satisfied with it? > > If so, I will do that. > > It could go into a version of the language paper that we post on the > site, while we submit a condensed version to a conference. > From wilde at mcs.anl.gov Thu Feb 5 11:57:13 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Thu, 05 Feb 2009 11:57:13 -0600 Subject: [Swift-devel] strange behavior evaluating function call as trace arg In-Reply-To: <1233855994.21035.8.camel@localhost> References: <19137214.1784411233777937698.JavaMail.root@zimbra> <498B0A8B.7000601@mcs.anl.gov> <1233850924.18738.15.camel@localhost> <498B2338.8000201@mcs.anl.gov> <1233855994.21035.8.camel@localhost> Message-ID: <498B2879.9010403@mcs.anl.gov> Thanks. But regarding: >> Getting this into writing in a concise and correct form would be useful >> for gaining a full understanding of Swift and also help in the language >> paper. > > It's useful as far as there is usefulness in people besides us > understanding how Swift works in detail at the expense of our time spent > writing the document. ...my intent is not to probe the internals but rather to understand how to *use* the language. Its not how swift works, but how to work with swift. I use it quite a bit and yet I'm continually discovering new things about it, and finding that some of my understandings were incorrect. Thus I find it increasingly important to pin down the language specification in a form that lets users understand it thoroughly. Its not that complex, yet it has many subtleties and surprises, due both to parallelism and to the handling of external data. Such a spec is also helpful in discussing changes and enhancements. I think that the user guide, or some doc hanging off of it, is the place to capture this. On 2/5/09 11:46 AM, Mihael Hategan wrote: > On Thu, 2009-02-05 at 11:34 -0600, Michael Wilde wrote: >> OK, thanks, that helps me get closer, but Im not quite there yet. >> >> The "automatic parallelization scheme" (i.e. the swift dataflow model) >> works by chaining together lvalues, correct? In other words, it is the >> setting of lvalues that enables a statement waiting on a variable to get >> a value to execute. Is that correct? > > Somewhat. It's is the existence of the lvalue that allows a consumer and > a producer to synchronize on the same thing. > >> Conceptually, one could view the dataflow model as existing solely in >> the memory of the swift interpreter. The creation of files is in some >> sense a side effect of executing app procedures, and the creation of >> files within an app() and the consequent execution of assignment >> operations then results in lvalues being set, which enables execution of >> any statement waiting on an lvalue to proceed. The assignment operations >> can be explicit (lvalue=value) or implicit (procedure return). > > Yes. > >> To fully understand this, you need to go further into the details of >> mapping, scoping, procedure activation/completion, array closing, and >> single assignment. >> >> So to take a few steps in that direction: >> >> Question: are the following tuples the correct abstract representations >> of lvalues and dshandles? > > There is no "lvalue" distinct from "handle". The term was used by Ben to > refer to what looks like lvalues in the Swift scripts. > >> lvalue: type, *handle, state (set/unset) - same as handle==null? >> handle: type, value(if simple type), mapping > > In the light of my sayings above: > > handle: type, state, who_is_waiting_on_this, value?, mapping? > >> Question: how are arrays and structures represented in this model? > > Structs are handles with fields, which are also handles. Arrays are > structs with dynamic fields (i.e. you can add fields/elements at > run-time). > >> Getting this into writing in a concise and correct form would be useful >> for gaining a full understanding of Swift and also help in the language >> paper. > > It's useful as far as there is usefulness in people besides us > understanding how Swift works in detail at the expense of our time spent > writing the document. > >> Is it reasonable to put this into latex form and jointly edit until >> we're satisfied with it? >> >> If so, I will do that. >> >> It could go into a version of the language paper that we post on the >> site, while we submit a condensed version to a conference. >> > > From aespinosa at cs.uchicago.edu Thu Feb 5 12:53:28 2009 From: aespinosa at cs.uchicago.edu (Allan Espinosa) Date: Thu, 5 Feb 2009 12:53:28 -0600 Subject: [Swift-devel] Re: [Swift-user] gsiftp filesystem mapper removes leading slash In-Reply-To: References: <50b07b4b0902031343g1c800451j78b0b914c944a416@mail.gmail.com> <50b07b4b0902031412t5f05bda2t4ff88326df678fb7@mail.gmail.com> Message-ID: <50b07b4b0902051053r67ef2070k996236fa00f510f@mail.gmail.com> Ok will try it out. Thanks Ben! -Allan On Thu, Feb 5, 2009 at 10:59 AM, Ben Clifford wrote: > > The fix in the below mentioned patch is in CoG r2273. > > On Tue, 3 Feb 2009, Ben Clifford wrote: > >> >> Moved from swift-user >> >> On Tue, 3 Feb 2009, Allan Espinosa wrote: >> >> > Oh i see. now I'm getting NullPointExceptions: >> > database pir[] > > pattern="UNIPROT_for_blast_14.0.seq*">; >> >> I can recreate the same stacktrace you see, against my directory on >> teraport. The below change makes it go away for me. >> >> Get a clean fresh source tree, then: >> >> $ cd cog >> $ wget http://www.ci.uchicago.edu/~benc/tmp/ftpbug-1.patch >> $ patch -p1 < ftpbug-1.patch >> $ ant redist >> >> And try that. >> >> Probably you should keep a copy of your source tree before applying the >> patch so that you can easily get rid of it. >> >> -- >> >> > > -- Allan M. Espinosa PhD student, Computer Science University of Chicago From hategan at mcs.anl.gov Thu Feb 5 13:00:41 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 05 Feb 2009 13:00:41 -0600 Subject: [Swift-devel] strange behavior evaluating function call as trace arg In-Reply-To: <498B2879.9010403@mcs.anl.gov> References: <19137214.1784411233777937698.JavaMail.root@zimbra> <498B0A8B.7000601@mcs.anl.gov> <1233850924.18738.15.camel@localhost> <498B2338.8000201@mcs.anl.gov> <1233855994.21035.8.camel@localhost> <498B2879.9010403@mcs.anl.gov> Message-ID: <1233860441.24135.10.camel@localhost> On Thu, 2009-02-05 at 11:57 -0600, Michael Wilde wrote: > Thanks. > > But regarding: > > >> Getting this into writing in a concise and correct form would be useful > >> for gaining a full understanding of Swift and also help in the language > >> paper. > > > > It's useful as far as there is usefulness in people besides us > > understanding how Swift works in detail at the expense of our time spent > > writing the document. > > ...my intent is not to probe the internals but rather to understand how > to *use* the language. Its not how swift works, but how to work with > swift. I use it quite a bit and yet I'm continually discovering new > things about it, and finding that some of my understandings were incorrect. In my experience, in particular with Tibi's workflows, understanding how Swift works leads in most cases to bad results. It turns out that the best way to use swift is to express the problem formally and use the publicly defined interface to implement it. It goes against the C/imperative school of thought, but allowing an automated system to optimize a problem requires specifying the problem, not "the way one thinks it works under the hood". > > Thus I find it increasingly important to pin down the language > specification in a form that lets users understand it thoroughly. Its > not that complex, yet it has many subtleties and surprises, due both to > parallelism and to the handling of external data. Right. Parallelism is one of the issues that should be completely ignored by a swift programmer. Writing swift code in such a way as to achieve parallelization in a certain way is, as mentioned above, mostly going to yield bad results. This is mostly because we can only achieve proper parallelization if only a certain level of abstraction is used. That level of abstraction is the level at which a swift code writer should work at. Perhaps documentation on that needs improvement, but not circumventing. > > Such a spec is also helpful in discussing changes and enhancements. That it is. From aespinosa at cs.uchicago.edu Thu Feb 5 14:18:39 2009 From: aespinosa at cs.uchicago.edu (Allan Espinosa) Date: Thu, 5 Feb 2009 14:18:39 -0600 Subject: [Swift-devel] coasters paths (was Re: Coasters failing on Teraport - cant find Java?) In-Reply-To: <1233712743.20123.0.camel@localhost> References: <50b07b4b0902031502v5074b655wc20c1b15dfa1daaa@mail.gmail.com> <1233707836.12879.3.camel@localhost> <20090204003943.GA5628@quadrant> <1233712743.20123.0.camel@localhost> Message-ID: <50b07b4b0902051218l1d259177y42e57d1273aa2997@mail.gmail.com> Hi Mihael, Tried out cog r2273 against swift r2486 The GRAM environment in Ranger has some stty permission errors. So the coasters can't initialize the paths when it attempts to create a "login" session from the remote site: /share/home/01035/tg802895/.globus/job/gatekeeper.ranger.tacc.teragrid.org login4$ ls login4$ ls 8851.1233864730 login4$ cd 8851.1233864730/ login4$ ls stderr stdout x509_up login4$ cat * stty: standard input: Inappropriate ioctl for device stty: standard input: Inappropriate ioctl for device http://communicado.ci.uchicago.edu:50002: line 36: eval: --: invalid option eval: usage: eval [arg ...] Failed to download bootstrap jar from http://communicado.ci.uchicago.edu:50002 -----BEGIN CERTIFICATE----- from submission site: Progress: Failed:1 Execution failed: Could not initialize shared directory on RANGER Caused by: org.globus.cog.abstraction.impl.file.FileResourceException: Failed to start coaster resource on gatekeeper.ranger.tacc.teragrid.org Caused by: Could not start coaster service Caused by: Task ended before registration was received. STDOUT: Failed to download bootstrap jar from http://communicado.ci.uchicago.edu:50002 stty: standard input: Inappropriate ioctl for device stty: standard input: Inappropriate ioctl for device http://communicado.ci.uchicago.edu:50002: line 36: eval: --: invalid option eval: usage: eval [arg ...] STDERR: null I guess the best approach is create an environment agnostic http get request. From the standard Java network packages perhaps? -Allan On Tue, Feb 3, 2009 at 7:59 PM, Mihael Hategan wrote: > cog r2272 contains a tentative fix for the issue. I tested it locally so > far. > > On Tue, 2009-02-03 at 18:39 -0600, Allan Espinosa wrote: >> I think i am using the latest revision (2271) for cog. >> for swift my build is using 2490 >> >> On Tue, Feb 03, 2009 at 06:37:16PM -0600, Mihael Hategan wrote: >> > >> > What version of cog is this? >> > >> > It occurred to me that the change I made a few days ago might solve the >> > java problem on some sites, but also mess up wget/curl lookup. From hategan at mcs.anl.gov Thu Feb 5 15:52:23 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 5 Feb 2009 15:52:23 -0600 (CST) Subject: [Swift-devel] coasters paths (was Re: Coasters failing on Teraport - cant find Java?) In-Reply-To: <50b07b4b0902051218l1d259177y42e57d1273aa2997@mail.gmail.com> Message-ID: <15055702.1859001233870743461.JavaMail.root@zimbra> ----- Allan Espinosa wrote: > > I guess the best approach is create an environment agnostic http get > request. From the standard Java network packages perhaps? Except you need a small script to download the jar file that implements that, finds java, and starts the downloader. Which is pretty much what the existing script does. So I think the solution is to fix the existing script. From wilde at mcs.anl.gov Thu Feb 5 16:26:52 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Thu, 05 Feb 2009 16:26:52 -0600 Subject: [Swift-devel] coasters paths (was Re: Coasters failing on Teraport - cant find Java?) In-Reply-To: <15055702.1859001233870743461.JavaMail.root@zimbra> References: <15055702.1859001233870743461.JavaMail.root@zimbra> Message-ID: <498B67AC.8060502@mcs.anl.gov> Yeah, I agree. I just takes time and iterative testing. I wonder if it would be useful to be able to run just bootstrap.sh in some kind of test mode (ie give it args that just verify that it can start a java service), and then run this from a script that does a globus-job-run of the script to a growing set of sites. And add that to the build+test battery. On 2/5/09 3:52 PM, Mihael Hategan wrote: > ----- Allan Espinosa wrote: >> I guess the best approach is create an environment agnostic http get >> request. From the standard Java network packages perhaps? > > Except you need a small script to download the jar file that > implements that, finds java, and starts the downloader. Which > is pretty much what the existing script does. > > So I think the solution is to fix the existing script. From hategan at mcs.anl.gov Thu Feb 5 17:07:48 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 5 Feb 2009 17:07:48 -0600 (CST) Subject: [Swift-devel] coasters paths (was Re: Coasters failing on Teraport - cant find Java?) In-Reply-To: <498B67AC.8060502@mcs.anl.gov> Message-ID: <20254178.1868091233875268293.JavaMail.root@zimbra> ----- Michael Wilde wrote: > Yeah, I agree. I just takes time and iterative testing. Right. > > I wonder if it would be useful to be able to run just bootstrap.sh in > some kind of test mode (ie give it args that just verify that it can > start a java service), and then run this from a script that does a > globus-job-run of the script to a growing set of sites. And add that to > the build+test battery. It can be done. But I think it's one of those things where if you^1 can't figure out how to do it^2, it's likely you won't contribute much to the effort of fixing it. 1. "You" as in X. 2. Find the script, figure out the parameters, fake the environment, and globusrun it. From hategan at mcs.anl.gov Thu Feb 5 17:13:37 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 5 Feb 2009 17:13:37 -0600 (CST) Subject: [Swift-devel] coasters paths (was Re: Coasters failing on Teraport - cant find Java?) In-Reply-To: <20254178.1868091233875268293.JavaMail.root@zimbra> Message-ID: <32262676.1868521233875617337.JavaMail.root@zimbra> ----- Mihael Hategan wrote: > > ----- Michael Wilde wrote: > > Yeah, I agree. I just takes time and iterative testing. > > Right. > > > > > I wonder if it would be useful to be able to run just bootstrap.sh in > > some kind of test mode (ie give it args that just verify that it can > > start a java service), and then run this from a script that does a > > globus-job-run of the script to a growing set of sites. And add that to > > the build+test battery. > > It can be done. But I think it's one of those things where if you^1 can't > figure out how to do it^2, it's likely you won't contribute much to the > effort of fixing it. > > 1. "You" as in X. > 2. Find the script, figure out the parameters, fake the environment, and > globusrun it. ... and there already exist coaster tests, which would reveal issues with bootstrap.sh (modulo actually having an extensive set of sites files), and which I would personally be happy with for debugging. Therefore I don't see a reason for doing it. From wilde at mcs.anl.gov Thu Feb 5 23:22:11 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Thu, 05 Feb 2009 23:22:11 -0600 Subject: [Swift-devel] Problems with coasters and managedfork jobmanager Message-ID: <498BC903.7010008@mcs.anl.gov> I was able to run with coasters on teraport last week, using gt2:gt2:pbs, but not today. I see the error "Failed to parse command file (line 21)" in my swift output and in the gram log (excerpt of the latter, below). This line # was originally 17. I added some comment lines to bootstrap.sh to see if the line number would move, and indeed it did. So it suggests something in the jobmanager thats unable to handle the text of the bootstrap script embedded in its RSL. But I dont think the line in this error is the line in the bootstrap script. Does anyone know how to find the script text that the jobmanager is complaining about? As far as I can tell, something changed on teraport (or my config?) as my gram logs from last week indicate that the plain fork jobmanager was being used. (Ive got an email in to teraport support to probe this). I see Mats's note in a prio mail about concern that the managed-fork mechanism may kill the coaster service, but no comments about script parsing errors. I'll send more logs in this tomorrow if I havent found it yet. Thanks, Mike Thu Feb 5 21:17:51 2009 JM_SCRIPT: Error file is not empty, and submission failed Thu Feb 5 21:17:51 2009 JM_SCRIPT: Error text is ERROR: Failed to parse command file (line 21). Thu Feb 5 21:17:51 2009 JM_SCRIPT: Writing extended error information to stderr 2/5 21:17:51 JM: GT3 extended error message: GRAM_SCRIPT_GT3_FAILURE_MESSAGE: ERROR: Failed to parse command file (line 21). 2/5 21:17:51 JMI: while return_buf = GRAM_SCRIPT_GT3_FAILURE_MESSAGE = ERROR: Failed to parse command file (line 21). 2/5 21:17:51 JMI: while return_buf = GRAM_SCRIPT_ERROR = 17 2/5 21:17:51 Job Manager State Machine (entering): GLOBUS_GRAM_JOB_MANAGER_STATE_SUBMIT 2/5 21:17:51 JM: in globus_gram_job_manager_reporting_file_create() 2/5 21:17:51 JM: not reporting job information 2/5 21:17:51 JM: in globus_gram_job_manager_history_file_create() 2/5 21:17:51 Job Manager State Machine (entering): GLOBUS_GRAM_JOB_MANAGER_STATE_FAILED 2/5 21:17:51 closing destination https://128.135.125.17:50002/dev/stdout-urn:cog-1233890265644 2/5 21:17:51 JM: exiting globus_l_gram_job_manager_output_destination_close() 2/5 21:18:00 closing destination https://128.135.125.17:50002/dev/stderr-urn:cog-1233890265644 2/5 21:18:00 JM: exiting globus_l_gram_job_manager_output_destination_close() 2/5 21:18:00 Job Manager State Machine (entering): GLOBUS_GRAM_JOB_MANAGER_STATE_FAILED_CLOSE_OUTPUT 2/5 21:18:00 JM: NOT empty client callback list. 2/5 21:18:00 JM: sending callback of status 4 (failure code 155) to https://128.135.125.17:50003/1233890268457. 2/5 21:18:00 Job Manager State Machine (entering): GLOBUS_GRAM_JOB_MANAGER_STATE_FAILED_TWO_PHASE 2/5 21:18:00 Job Manager State Machine (entering): GLOBUS_GRAM_JOB_MANAGER_STATE_FAILED_TWO_PHASE_COMMITTED 2/5 21:18:00 Job Manager State Machine (entering): GLOBUS_GRAM_JOB_MANAGER_STATE_FAILED_FILE_CLEAN_UP 2/5 21:18:00 Job Manager State Machine (entering): GLOBUS_GRAM_JOB_MANAGER_STATE_FAILED_SCRATCH_CLEAN_UP 2/5 21:18:00 JMI: testing job manager scripts for type managedfork exist and permissions are ok. 2/5 21:18:00 JMI: completed script validation: job manager type is managedfork. From hategan at mcs.anl.gov Thu Feb 5 23:50:51 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 05 Feb 2009 23:50:51 -0600 Subject: [Swift-devel] Problems with coasters and managedfork jobmanager In-Reply-To: <498BC903.7010008@mcs.anl.gov> References: <498BC903.7010008@mcs.anl.gov> Message-ID: <1233899451.2869.8.camel@localhost> This particular line seems troubling to me: ?2/5 21:18:00 JMI: testing job manager scripts for type managedfork exist and permissions are ok. Does this mean that managed fork is now in use on TP? Is there any way to still use plain fork? On Thu, 2009-02-05 at 23:22 -0600, Michael Wilde wrote: > I was able to run with coasters on teraport last week, using > gt2:gt2:pbs, but not today. > > I see the error "Failed to parse command file (line 21)" in my swift > output and in the gram log (excerpt of the latter, below). > > This line # was originally 17. I added some comment lines to > bootstrap.sh to see if the line number would move, and indeed it did. So > it suggests something in the jobmanager thats unable to handle the text > of the bootstrap script embedded in its RSL. But I dont think the line > in this error is the line in the bootstrap script. > > Does anyone know how to find the script text that the jobmanager is > complaining about? > > As far as I can tell, something changed on teraport (or my config?) as > my gram logs from last week indicate that the plain fork jobmanager was > being used. (Ive got an email in to teraport support to probe this). > > I see Mats's note in a prio mail about concern that the managed-fork > mechanism may kill the coaster service, but no comments about script > parsing errors. > > I'll send more logs in this tomorrow if I havent found it yet. > > Thanks, > > Mike > > > > Thu Feb 5 21:17:51 2009 JM_SCRIPT: Error file is not empty, and > submission failed > > Thu Feb 5 21:17:51 2009 JM_SCRIPT: Error text is > ERROR: Failed to parse command file (line 21). > > Thu Feb 5 21:17:51 2009 JM_SCRIPT: Writing extended error information > to stderr > 2/5 21:17:51 JM: GT3 extended error message: > GRAM_SCRIPT_GT3_FAILURE_MESSAGE: ERROR: Failed to parse command file > (line 21). > 2/5 21:17:51 JMI: while return_buf = GRAM_SCRIPT_GT3_FAILURE_MESSAGE = > ERROR: Failed to parse command file (line 21). > 2/5 21:17:51 JMI: while return_buf = GRAM_SCRIPT_ERROR = 17 > 2/5 21:17:51 Job Manager State Machine (entering): > GLOBUS_GRAM_JOB_MANAGER_STATE_SUBMIT > 2/5 21:17:51 JM: in globus_gram_job_manager_reporting_file_create() > 2/5 21:17:51 JM: not reporting job information > 2/5 21:17:51 JM: in globus_gram_job_manager_history_file_create() > 2/5 21:17:51 Job Manager State Machine (entering): > GLOBUS_GRAM_JOB_MANAGER_STATE_FAILED > 2/5 21:17:51 closing destination > https://128.135.125.17:50002/dev/stdout-urn:cog-1233890265644 > 2/5 21:17:51 JM: exiting > globus_l_gram_job_manager_output_destination_close() > 2/5 21:18:00 closing destination > https://128.135.125.17:50002/dev/stderr-urn:cog-1233890265644 > 2/5 21:18:00 JM: exiting > globus_l_gram_job_manager_output_destination_close() > 2/5 21:18:00 Job Manager State Machine (entering): > GLOBUS_GRAM_JOB_MANAGER_STATE_FAILED_CLOSE_OUTPUT > 2/5 21:18:00 JM: NOT empty client callback list. > 2/5 21:18:00 JM: sending callback of status 4 (failure code 155) to > https://128.135.125.17:50003/1233890268457. > 2/5 21:18:00 Job Manager State Machine (entering): > GLOBUS_GRAM_JOB_MANAGER_STATE_FAILED_TWO_PHASE > 2/5 21:18:00 Job Manager State Machine (entering): > GLOBUS_GRAM_JOB_MANAGER_STATE_FAILED_TWO_PHASE_COMMITTED > 2/5 21:18:00 Job Manager State Machine (entering): > GLOBUS_GRAM_JOB_MANAGER_STATE_FAILED_FILE_CLEAN_UP > 2/5 21:18:00 Job Manager State Machine (entering): > GLOBUS_GRAM_JOB_MANAGER_STATE_FAILED_SCRATCH_CLEAN_UP > 2/5 21:18:00 JMI: testing job manager scripts for type managedfork exist > and permissions are ok. > 2/5 21:18:00 JMI: completed script validation: job manager type is > managedfork. > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From hategan at mcs.anl.gov Thu Feb 5 23:52:19 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 05 Feb 2009 23:52:19 -0600 Subject: [Swift-devel] coasters paths (was Re: Coasters failing on Teraport - cant find Java?) In-Reply-To: <32262676.1868521233875617337.JavaMail.root@zimbra> References: <32262676.1868521233875617337.JavaMail.root@zimbra> Message-ID: <1233899539.2869.11.camel@localhost> On Thu, 2009-02-05 at 17:13 -0600, Mihael Hategan wrote: > ----- Mihael Hategan wrote: > > > > ----- Michael Wilde wrote: > > > Yeah, I agree. I just takes time and iterative testing. > > > > Right. > > > > > > > > I wonder if it would be useful to be able to run just bootstrap.sh in > > > some kind of test mode (ie give it args that just verify that it can > > > start a java service), and then run this from a script that does a > > > globus-job-run of the script to a growing set of sites. And add that to > > > the build+test battery. > > > > It can be done. But I think it's one of those things where if you^1 can't > > figure out how to do it^2, it's likely you won't contribute much to the > > effort of fixing it. > > > > 1. "You" as in X. > > 2. Find the script, figure out the parameters, fake the environment, and > > globusrun it. > > ... and there already exist coaster tests, which would reveal issues with > bootstrap.sh (modulo actually having an extensive set of sites files), and > which I would personally be happy with for debugging. > > Therefore I don't see a reason for doing it. Ok, this whole thing makes no sense to me. Please forget the things I said above. From aespinosa at cs.uchicago.edu Fri Feb 6 02:57:24 2009 From: aespinosa at cs.uchicago.edu (Allan Espinosa) Date: Fri, 6 Feb 2009 02:57:24 -0600 Subject: [Swift-devel] Problems with coasters and managedfork jobmanager In-Reply-To: <498BC903.7010008@mcs.anl.gov> References: <498BC903.7010008@mcs.anl.gov> Message-ID: <50b07b4b0902060057n59d56823s8b6aaf9f80b68357@mail.gmail.com> Hi Mike, I think Greg posted about an OSG stack upgrade this week so gram won't be available. That's why i just used local:pbs for my runs today. -Allan On Thu, Feb 5, 2009 at 11:22 PM, Michael Wilde wrote: > I was able to run with coasters on teraport last week, using gt2:gt2:pbs, > but not today. > > I see the error "Failed to parse command file (line 21)" in my swift output > and in the gram log (excerpt of the latter, below). > > This line # was originally 17. I added some comment lines to bootstrap.sh to > see if the line number would move, and indeed it did. So it suggests > something in the jobmanager thats unable to handle the text of the bootstrap > script embedded in its RSL. But I dont think the line in this error is the > line in the bootstrap script. > > Does anyone know how to find the script text that the jobmanager is > complaining about? > > As far as I can tell, something changed on teraport (or my config?) as my > gram logs from last week indicate that the plain fork jobmanager was being > used. (Ive got an email in to teraport support to probe this). > > I see Mats's note in a prio mail about concern that the managed-fork > mechanism may kill the coaster service, but no comments about script parsing > errors. > > I'll send more logs in this tomorrow if I havent found it yet. > > Thanks, > > Mike > > > > Thu Feb 5 21:17:51 2009 JM_SCRIPT: Error file is not empty, and submission > failed > > Thu Feb 5 21:17:51 2009 JM_SCRIPT: Error text is > ERROR: Failed to parse command file (line 21). > > Thu Feb 5 21:17:51 2009 JM_SCRIPT: Writing extended error information to > stderr > 2/5 21:17:51 JM: GT3 extended error message: > GRAM_SCRIPT_GT3_FAILURE_MESSAGE: ERROR: Failed to parse command file (line > 21). > 2/5 21:17:51 JMI: while return_buf = GRAM_SCRIPT_GT3_FAILURE_MESSAGE = > ERROR: Failed to parse command file (line 21). > 2/5 21:17:51 JMI: while return_buf = GRAM_SCRIPT_ERROR = 17 > 2/5 21:17:51 Job Manager State Machine (entering): > GLOBUS_GRAM_JOB_MANAGER_STATE_SUBMIT > 2/5 21:17:51 JM: in globus_gram_job_manager_reporting_file_create() > 2/5 21:17:51 JM: not reporting job information > 2/5 21:17:51 JM: in globus_gram_job_manager_history_file_create() > 2/5 21:17:51 Job Manager State Machine (entering): > GLOBUS_GRAM_JOB_MANAGER_STATE_FAILED > 2/5 21:17:51 closing destination > https://128.135.125.17:50002/dev/stdout-urn:cog-1233890265644 > 2/5 21:17:51 JM: exiting > globus_l_gram_job_manager_output_destination_close() > 2/5 21:18:00 closing destination > https://128.135.125.17:50002/dev/stderr-urn:cog-1233890265644 > 2/5 21:18:00 JM: exiting > globus_l_gram_job_manager_output_destination_close() > 2/5 21:18:00 Job Manager State Machine (entering): > GLOBUS_GRAM_JOB_MANAGER_STATE_FAILED_CLOSE_OUTPUT > 2/5 21:18:00 JM: NOT empty client callback list. > 2/5 21:18:00 JM: sending callback of status 4 (failure code 155) to > https://128.135.125.17:50003/1233890268457. > 2/5 21:18:00 Job Manager State Machine (entering): > GLOBUS_GRAM_JOB_MANAGER_STATE_FAILED_TWO_PHASE > 2/5 21:18:00 Job Manager State Machine (entering): > GLOBUS_GRAM_JOB_MANAGER_STATE_FAILED_TWO_PHASE_COMMITTED > 2/5 21:18:00 Job Manager State Machine (entering): > GLOBUS_GRAM_JOB_MANAGER_STATE_FAILED_FILE_CLEAN_UP > 2/5 21:18:00 Job Manager State Machine (entering): > GLOBUS_GRAM_JOB_MANAGER_STATE_FAILED_SCRATCH_CLEAN_UP > 2/5 21:18:00 JMI: testing job manager scripts for type managedfork exist and > permissions are ok. > 2/5 21:18:00 JMI: completed script validation: job manager type is > managedfork. > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > -- Allan M. Espinosa PhD student, Computer Science University of Chicago From benc at hawaga.org.uk Fri Feb 6 02:56:54 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Fri, 6 Feb 2009 08:56:54 +0000 (GMT) Subject: [Swift-devel] strange behavior evaluating function call as trace arg In-Reply-To: <498B2338.8000201@mcs.anl.gov> References: <19137214.1784411233777937698.JavaMail.root@zimbra> <498B0A8B.7000601@mcs.anl.gov> <1233850924.18738.15.camel@localhost> <498B2338.8000201@mcs.anl.gov> Message-ID: On Thu, 5 Feb 2009, Michael Wilde wrote: > lvalue: type, *handle, state (set/unset) - same as handle==null? Like hategan says, don't use the word 'lvalue' - I only used whimsically. > same as handle==null no. If a DSHandle doesn't have its value yet you cannot observe its value - you'll find yourself shunted into the future where the DSHandle does have a value. -- From benc at hawaga.org.uk Fri Feb 6 03:17:50 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Fri, 6 Feb 2009 09:17:50 +0000 (GMT) Subject: [Swift-devel] strange behavior evaluating function call as trace arg In-Reply-To: <498B2338.8000201@mcs.anl.gov> References: <19137214.1784411233777937698.JavaMail.root@zimbra> <498B0A8B.7000601@mcs.anl.gov> <1233850924.18738.15.camel@localhost> <498B2338.8000201@mcs.anl.gov> Message-ID: On Thu, 5 Feb 2009, Michael Wilde wrote: > Is it reasonable to put this into latex form and jointly edit until > we're satisfied with it? > > If so, I will do that. The text from the language section of that paper is in the user guide now. That probably should be the most definitive location for information. -- From benc at hawaga.org.uk Fri Feb 6 08:04:22 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Fri, 6 Feb 2009 14:04:22 +0000 (GMT) Subject: [Swift-devel] Problems with coasters and managedfork jobmanager In-Reply-To: <498BC903.7010008@mcs.anl.gov> References: <498BC903.7010008@mcs.anl.gov> Message-ID: On Thu, 5 Feb 2009, Michael Wilde wrote: > I see Mats's note in a prio mail about concern that the managed-fork mechanism > may kill the coaster service, but no comments about script parsing errors. The condor jobmanager deals quite poorly with whitespace in arguments, in a way that I cannot see how to work around. (I've run into a very similar problem when looking at making Swift run without any shared filesystem). This bit almost definitely doesn't work with existing jobmanager-condor. > js.setExecutable("/bin/bash"); > js.addArgument("-c"); > js.addArgument(loadBootstrapScript()); GRAM provided an update package to VDT/OSG the other day that changes condor jobmanager whitespace handling so that it may be possible to make it work. See this thread: http://lists.globus.org/pipermail/gram-user/2009-January/000790.html With the present deployed infrastructure, one approach might be to have the bootstrap script staged in as a file using file transfer mechanisms (in the quickest hack case, staged in at the same time as wrapper.sh and seq.sh by swift, though this will not work if you are trying to use the coaster filesystem provider), allowing the shell command to have spaces removed. -- From wilde at mcs.anl.gov Fri Feb 6 09:19:39 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Fri, 06 Feb 2009 09:19:39 -0600 Subject: [Swift-devel] Problems with coasters and managedfork jobmanager In-Reply-To: References: <498BC903.7010008@mcs.anl.gov> Message-ID: <498C550B.4040303@mcs.anl.gov> On 2/6/09 8:04 AM, Ben Clifford wrote: > On Thu, 5 Feb 2009, Michael Wilde wrote: > >> I see Mats's note in a prio mail about concern that the managed-fork mechanism >> may kill the coaster service, but no comments about script parsing errors. > > The condor jobmanager deals quite poorly with whitespace in arguments, in > a way that I cannot see how to work around. (I've run into a very similar > problem when looking at making Swift run without any shared filesystem). > > This bit almost definitely doesn't work with existing jobmanager-condor. > >> js.setExecutable("/bin/bash"); >> js.addArgument("-c"); >> js.addArgument(loadBootstrapScript()); > I see. The problem turns out to be the newlines in the command script. It can be reproduced with globusrun: com$ globusrun -o -r tp-grid1.ci.uchicago.edu:2119/jobmanager-managedfork '&(executable="/bin/echo") (arguments= "hello world")' hello world com$ globusrun -o -r tp-grid1.ci.uchicago.edu:2119/jobmanager-managedfork '&(executable="/bin/echo") (arguments= "hello > world")' ERROR: Failed to parse command file (line 10). GRAM Job failed because the job failed when the job manager attempted to run it (error code 17) com$ -- I'll make a brief attempt to work around this, but most likely wont be able to, as you say. - Mike > GRAM provided an update package to VDT/OSG the other day that changes > condor jobmanager whitespace handling so that it may be possible to make > it work. See this thread: > http://lists.globus.org/pipermail/gram-user/2009-January/000790.html > > With the present deployed infrastructure, one approach might be to have > the bootstrap script staged in as a file using file transfer mechanisms > (in the quickest hack case, staged in at the same time as wrapper.sh and > seq.sh by swift, though this will not work if you are trying to use the > coaster filesystem provider), allowing the shell command to have spaces > removed. > From hategan at mcs.anl.gov Fri Feb 6 10:12:39 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 06 Feb 2009 10:12:39 -0600 Subject: [Swift-devel] Problems with coasters and managedfork jobmanager In-Reply-To: <498C550B.4040303@mcs.anl.gov> References: <498BC903.7010008@mcs.anl.gov> <498C550B.4040303@mcs.anl.gov> Message-ID: <1233936759.5005.2.camel@localhost> I guess we'll have to stage in the bootstrap script using the stage-in directive if we are to support managed fork, since I don't see OSG fixing the issue. Unfortunately, this doesn't work very well with ws-gram. On Fri, 2009-02-06 at 09:19 -0600, Michael Wilde wrote: > > On 2/6/09 8:04 AM, Ben Clifford wrote: > > On Thu, 5 Feb 2009, Michael Wilde wrote: > > > >> I see Mats's note in a prio mail about concern that the managed-fork mechanism > >> may kill the coaster service, but no comments about script parsing errors. > > > > The condor jobmanager deals quite poorly with whitespace in arguments, in > > a way that I cannot see how to work around. (I've run into a very similar > > problem when looking at making Swift run without any shared filesystem). > > > > This bit almost definitely doesn't work with existing jobmanager-condor. > > > >> js.setExecutable("/bin/bash"); > >> js.addArgument("-c"); > >> js.addArgument(loadBootstrapScript()); > > > > I see. The problem turns out to be the newlines in the command script. > It can be reproduced with globusrun: > > com$ globusrun -o -r > tp-grid1.ci.uchicago.edu:2119/jobmanager-managedfork > '&(executable="/bin/echo") (arguments= "hello world")' > hello world > > > com$ globusrun -o -r > tp-grid1.ci.uchicago.edu:2119/jobmanager-managedfork > '&(executable="/bin/echo") (arguments= "hello > > world")' > > ERROR: Failed to parse command file (line 10). > GRAM Job failed because the job failed when the job manager attempted to > run it (error code 17) > com$ > > -- > > I'll make a brief attempt to work around this, but most likely wont be > able to, as you say. > > - Mike > > > > GRAM provided an update package to VDT/OSG the other day that changes > > condor jobmanager whitespace handling so that it may be possible to make > > it work. See this thread: > > http://lists.globus.org/pipermail/gram-user/2009-January/000790.html > > > > With the present deployed infrastructure, one approach might be to have > > the bootstrap script staged in as a file using file transfer mechanisms > > (in the quickest hack case, staged in at the same time as wrapper.sh and > > seq.sh by swift, though this will not work if you are trying to use the > > coaster filesystem provider), allowing the shell command to have spaces > > removed. > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From benc at hawaga.org.uk Fri Feb 6 10:17:05 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Fri, 6 Feb 2009 16:17:05 +0000 (GMT) Subject: [Swift-devel] Problems with coasters and managedfork jobmanager In-Reply-To: <1233936759.5005.2.camel@localhost> References: <498BC903.7010008@mcs.anl.gov> <498C550B.4040303@mcs.anl.gov> <1233936759.5005.2.camel@localhost> Message-ID: On Fri, 6 Feb 2009, Mihael Hategan wrote: > I guess we'll have to stage in the bootstrap script using the stage-in > directive if we are to support managed fork, since I don't see OSG > fixing the issue. They are fixing the whitespace in parameters - see the gram-user thread I sent in a different message. -- From hategan at mcs.anl.gov Fri Feb 6 10:24:07 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 06 Feb 2009 10:24:07 -0600 Subject: [Swift-devel] Problems with coasters and managedfork jobmanager In-Reply-To: References: <498BC903.7010008@mcs.anl.gov> <498C550B.4040303@mcs.anl.gov> <1233936759.5005.2.camel@localhost> Message-ID: <1233937447.5206.0.camel@localhost> On Fri, 2009-02-06 at 16:17 +0000, Ben Clifford wrote: > On Fri, 6 Feb 2009, Mihael Hategan wrote: > > > I guess we'll have to stage in the bootstrap script using the stage-in > > directive if we are to support managed fork, since I don't see OSG > > fixing the issue. > > They are fixing the whitespace in parameters - see the gram-user thread I > sent in a different message. Does this include the new lines? From wilde at mcs.anl.gov Fri Feb 6 10:33:25 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Fri, 06 Feb 2009 10:33:25 -0600 Subject: [Swift-devel] Problems with coasters and managedfork jobmanager In-Reply-To: <1233937447.5206.0.camel@localhost> References: <498BC903.7010008@mcs.anl.gov> <498C550B.4040303@mcs.anl.gov> <1233936759.5005.2.camel@localhost> <1233937447.5206.0.camel@localhost> Message-ID: <498C6655.3010702@mcs.anl.gov> I dont know, but I am testing a version where I removed the newlines from bootstrap.pl (and adjusted a few bits manually) and I *think* its moving on to the next stage and trying to start the workers. Ben, it seems that *some* whitespace is passed on OK, in that I can run a job that does echo "hello world" and that blank after hello is preserved, and the job runs. I assume the whitespace problem is more subtle than that? On 2/6/09 10:24 AM, Mihael Hategan wrote: > On Fri, 2009-02-06 at 16:17 +0000, Ben Clifford wrote: >> On Fri, 6 Feb 2009, Mihael Hategan wrote: >> >>> I guess we'll have to stage in the bootstrap script using the stage-in >>> directive if we are to support managed fork, since I don't see OSG >>> fixing the issue. >> They are fixing the whitespace in parameters - see the gram-user thread I >> sent in a different message. > > Does this include the new lines? > From hategan at mcs.anl.gov Fri Feb 6 10:38:03 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 06 Feb 2009 10:38:03 -0600 Subject: [Swift-devel] Problems with coasters and managedfork jobmanager In-Reply-To: <498C6655.3010702@mcs.anl.gov> References: <498BC903.7010008@mcs.anl.gov> <498C550B.4040303@mcs.anl.gov> <1233936759.5005.2.camel@localhost> <1233937447.5206.0.camel@localhost> <498C6655.3010702@mcs.anl.gov> Message-ID: <1233938283.5483.0.camel@localhost> I suppose that script could be made newline-less. On Fri, 2009-02-06 at 10:33 -0600, Michael Wilde wrote: > I dont know, but I am testing a version where I removed the newlines > from bootstrap.pl (and adjusted a few bits manually) and I *think* its > moving on to the next stage and trying to start the workers. > > Ben, it seems that *some* whitespace is passed on OK, in that I can run > a job that does echo "hello world" and that blank after hello is > preserved, and the job runs. I assume the whitespace problem is more > subtle than that? > > On 2/6/09 10:24 AM, Mihael Hategan wrote: > > On Fri, 2009-02-06 at 16:17 +0000, Ben Clifford wrote: > >> On Fri, 6 Feb 2009, Mihael Hategan wrote: > >> > >>> I guess we'll have to stage in the bootstrap script using the stage-in > >>> directive if we are to support managed fork, since I don't see OSG > >>> fixing the issue. > >> They are fixing the whitespace in parameters - see the gram-user thread I > >> sent in a different message. > > > > Does this include the new lines? > > From benc at hawaga.org.uk Fri Feb 6 10:40:45 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Fri, 6 Feb 2009 16:40:45 +0000 (GMT) Subject: [Swift-devel] Problems with coasters and managedfork jobmanager In-Reply-To: <498C6655.3010702@mcs.anl.gov> References: <498BC903.7010008@mcs.anl.gov> <498C550B.4040303@mcs.anl.gov> <1233936759.5005.2.camel@localhost> <1233937447.5206.0.camel@localhost> <498C6655.3010702@mcs.anl.gov> Message-ID: On Fri, 6 Feb 2009, Michael Wilde wrote: > Ben, it seems that *some* whitespace is passed on OK, in that I can run a job > that does echo "hello world" and that blank after hello is preserved, and the > job runs. I assume the whitespace problem is more subtle than that? yes. read the thread that I sent earlier on gram-user: http://lists.globus.org/pipermail/gram-user/2009-January/000790.html -- From benc at hawaga.org.uk Fri Feb 6 10:50:00 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Fri, 6 Feb 2009 16:50:00 +0000 (GMT) Subject: [Swift-devel] Problems with coasters and managedfork jobmanager In-Reply-To: <498C6655.3010702@mcs.anl.gov> References: <498BC903.7010008@mcs.anl.gov> <498C550B.4040303@mcs.anl.gov> <1233936759.5005.2.camel@localhost> <1233937447.5206.0.camel@localhost> <498C6655.3010702@mcs.anl.gov> Message-ID: On Fri, 6 Feb 2009, Michael Wilde wrote: > I dont know, but I am testing a version where I removed the newlines from > bootstrap.pl (and adjusted a few bits manually) and I *think* its moving on to > the next stage and trying to start the workers. If the remote coaster head node job is running, then you should see some activity in the remote side ~/.globus/coasters/coaster.log Check that. -- From wilde at mcs.anl.gov Fri Feb 6 11:03:43 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Fri, 06 Feb 2009 11:03:43 -0600 Subject: [Swift-devel] Problems with coasters and managedfork jobmanager In-Reply-To: References: <498BC903.7010008@mcs.anl.gov> <498C550B.4040303@mcs.anl.gov> <1233936759.5005.2.camel@localhost> <1233937447.5206.0.camel@localhost> <498C6655.3010702@mcs.anl.gov> Message-ID: <498C6D6F.9030804@mcs.anl.gov> No, mine didnt get that far. All the logs I see (under ~osgedu) are from you. I'm dropping out of this and will wait for something from Mihael and/or you to test. On 2/6/09 10:50 AM, Ben Clifford wrote: > On Fri, 6 Feb 2009, Michael Wilde wrote: > >> I dont know, but I am testing a version where I removed the newlines from >> bootstrap.pl (and adjusted a few bits manually) and I *think* its moving on to >> the next stage and trying to start the workers. > > If the remote coaster head node job is running, then you should see > some activity in the remote side ~/.globus/coasters/coaster.log > > Check that. > From benc at hawaga.org.uk Fri Feb 6 11:50:50 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Fri, 6 Feb 2009 17:50:50 +0000 (GMT) Subject: [Swift-devel] coasters on UC teraport with head job running on a worker node Message-ID: I hacked around with coasters here to see about getting the head job running on a cluster worker node rather than on the cluster head node. This works on teraport through PBS. The below patch contains the changes I made to make that happen. http://www.ci.uchicago.edu/~benc/tmp/coaster-head-elsewhere There are three changes I made: i) submit to pbs jobmanager instead of to fork jobmanager ii) start coaster workers with IP address of the head-worker node instead of the address of the cluster head node iii) hack the environment to point to teraport's CA directory (in the environment that I get there, there is no automatically findable CA directory, and an ENV profile appeared to not work). In situations where the cluster nodes have outbound network connectivity, this seems like a nice thing to do, and I want to make this a configurable option to go into the SVN. I think: i) above should probably be an extension to the existing three-field coaster jobmanager string, ii) should be a configuration option to go along-side the coasterInternalIP setting and iii) should be a more general ability to set the environment for a coaster worker. Here is the site.xml that I used with this patch: /home/benc/swifttest -- From wilde at mcs.anl.gov Fri Feb 6 12:27:21 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Fri, 06 Feb 2009 12:27:21 -0600 Subject: [Swift-devel] coasters on UC teraport with head job running on a worker node In-Reply-To: References: Message-ID: <498C8109.603@mcs.anl.gov> I tested it, and it worked - very nice. I like the idea of moving the service load to a worker when possible. So this patch gets around the problem of managed-fork/condor jobmanager by submitting to the pbs jobmanager instead of fork. But that means that to generalize this, we still need to solve the problem of running the service bootstrap.sh if the cluster is a condor pool, right? - Mike On 2/6/09 11:50 AM, Ben Clifford wrote: > I hacked around with coasters here to see about getting the head job > running on a cluster worker node rather than on the cluster head node. > > This works on teraport through PBS. The below patch contains the changes I > made to make that happen. > > http://www.ci.uchicago.edu/~benc/tmp/coaster-head-elsewhere > > There are three changes I made: > > i) submit to pbs jobmanager instead of to fork jobmanager > ii) start coaster workers with IP address of the head-worker node > instead of the address of the cluster head node > iii) hack the environment to point to teraport's CA directory (in the > environment that I get there, there is no automatically findable > CA directory, and an ENV profile appeared to not work). > > In situations where the cluster nodes have outbound network connectivity, > this seems like a nice thing to do, and I want to make this a configurable > option to go into the SVN. > > I think: > > i) above should probably be an extension to the existing three-field > coaster jobmanager string, ii) should be a configuration option to go > along-side the coasterInternalIP setting and iii) should be a more general > ability to set the environment for a coaster worker. > > Here is the site.xml that I used with this patch: > > > > > > jobManager="gt2:gt2:pbs" /> > /home/benc/swifttest > > > > From benc at hawaga.org.uk Fri Feb 6 12:34:11 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Fri, 6 Feb 2009 18:34:11 +0000 (GMT) Subject: [Swift-devel] coasters on UC teraport with head job running on a worker node In-Reply-To: <498C8109.603@mcs.anl.gov> References: <498C8109.603@mcs.anl.gov> Message-ID: On Fri, 6 Feb 2009, Michael Wilde wrote: > But that means that to generalize this, we still need to solve the problem of > running the service bootstrap.sh if the cluster is a condor pool, right? yes. I'd like to see how this behaves against the latest condor jobmanager, though, as that is going into VDT. -- From wilde at mcs.anl.gov Fri Feb 6 12:46:57 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Fri, 06 Feb 2009 12:46:57 -0600 Subject: [Swift-devel] coasters on UC teraport with head job running on a worker node In-Reply-To: References: <498C8109.603@mcs.anl.gov> Message-ID: <498C85A1.2070001@mcs.anl.gov> I missed the message that Greg Cross sent regarding TeraPort updates (which Allan pointed out to me). Perhaps he can test it there? Or, perhaps he can install it on the HNL cluster for testing? I'll send a message to support and cc you, unless you have another test environment. Anything in the NMI build&test that can support such a test? On 2/6/09 12:34 PM, Ben Clifford wrote: > On Fri, 6 Feb 2009, Michael Wilde wrote: > >> But that means that to generalize this, we still need to solve the problem of >> running the service bootstrap.sh if the cluster is a condor pool, right? > > yes. I'd like to see how this behaves against the latest condor > jobmanager, though, as that is going into VDT. > From hategan at mcs.anl.gov Fri Feb 6 12:48:24 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 06 Feb 2009 12:48:24 -0600 Subject: [Swift-devel] coasters on UC teraport with head job running on a worker node In-Reply-To: References: <498C8109.603@mcs.anl.gov> Message-ID: <1233946104.9571.1.camel@localhost> On Fri, 2009-02-06 at 18:34 +0000, Ben Clifford wrote: > On Fri, 6 Feb 2009, Michael Wilde wrote: > > > But that means that to generalize this, we still need to solve the problem of > > running the service bootstrap.sh if the cluster is a condor pool, right? > > yes. I'd like to see how this behaves against the latest condor > jobmanager, though, as that is going into VDT. > Ok, so it would be sufficient but not necessarily necessary to make bootstrap.sh a one-liner with the new VDT? From wilde at mcs.anl.gov Fri Feb 6 12:48:05 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Fri, 06 Feb 2009 12:48:05 -0600 Subject: [Swift-devel] coasters on UC teraport with head job running on a worker node In-Reply-To: <498C85A1.2070001@mcs.anl.gov> References: <498C8109.603@mcs.anl.gov> <498C85A1.2070001@mcs.anl.gov> Message-ID: <498C85E5.6000707@mcs.anl.gov> or perhaps the ITB testbed that Rob and Suchandra maintain? On 2/6/09 12:46 PM, Michael Wilde wrote: > I missed the message that Greg Cross sent regarding TeraPort updates > (which Allan pointed out to me). Perhaps he can test it there? > > Or, perhaps he can install it on the HNL cluster for testing? > > I'll send a message to support and cc you, unless you have another test > environment. Anything in the NMI build&test that can support such a test? > > On 2/6/09 12:34 PM, Ben Clifford wrote: >> On Fri, 6 Feb 2009, Michael Wilde wrote: >> >>> But that means that to generalize this, we still need to solve the >>> problem of >>> running the service bootstrap.sh if the cluster is a condor pool, right? >> >> yes. I'd like to see how this behaves against the latest condor >> jobmanager, though, as that is going into VDT. >> > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From benc at hawaga.org.uk Fri Feb 6 12:50:59 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Fri, 6 Feb 2009 18:50:59 +0000 (GMT) Subject: [Swift-devel] coasters on UC teraport with head job running on a worker node In-Reply-To: <498C85A1.2070001@mcs.anl.gov> References: <498C8109.603@mcs.anl.gov> <498C85A1.2070001@mcs.anl.gov> Message-ID: On Fri, 6 Feb 2009, Michael Wilde wrote: > I missed the message that Greg Cross sent regarding TeraPort updates > (which Allan pointed out to me). Perhaps he can test it there? > > Or, perhaps he can install it on the HNL cluster for testing? It will appear on the ITB in due course, I think. I'm happy to wait for it to do so. -- From benc at hawaga.org.uk Fri Feb 6 12:52:44 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Fri, 6 Feb 2009 18:52:44 +0000 (GMT) Subject: [Swift-devel] coasters on UC teraport with head job running on a worker node In-Reply-To: <1233946104.9571.1.camel@localhost> References: <498C8109.603@mcs.anl.gov> <1233946104.9571.1.camel@localhost> Message-ID: On Fri, 6 Feb 2009, Mihael Hategan wrote: > Ok, so it would be sufficient but not necessarily necessary to make > bootstrap.sh a one-liner with the new VDT? Should be. -- From wilde at mcs.anl.gov Fri Feb 6 12:58:40 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Fri, 06 Feb 2009 12:58:40 -0600 Subject: [Swift-devel] coasters on UC teraport with head job running on a worker node In-Reply-To: <1233946104.9571.1.camel@localhost> References: <498C8109.603@mcs.anl.gov> <1233946104.9571.1.camel@localhost> Message-ID: <498C8860.6070105@mcs.anl.gov> It seems the one-liner alone might get past the managed-fork rejection of newlines, but perhaps not sufficient to deal with the separate-arg issue in the gram thread that Ben indicated. and its not clear yet if the fix indicated in that thread also fixes the newline issue. also the fix may take a while to propagate across OSG, so a fix thats independent of jobmanager would still be nice if thats possible, but if its more than a few more hours of fiddling, perhaps not worth it. If I can run a swift script with coasters nicely, on a set of OSG PBS sites, and on a set of TG sites, that would be sufficient, I *think*, to wait and see how long it will take for the Condor jobmanger fix to propagate. So one approach is: - polish up and integrate the coaster-service-on-workernode patch - test same code on a whitespace-fix-patched condor system If this combination works on all non-condor JM's and the fixed Condor JM, call it done. - (we) test the app-finding fixes (java, wget etc) on more systems in meantime Is that a reasonable route? - Mike On 2/6/09 12:48 PM, Mihael Hategan wrote: > On Fri, 2009-02-06 at 18:34 +0000, Ben Clifford wrote: >> On Fri, 6 Feb 2009, Michael Wilde wrote: >> >>> But that means that to generalize this, we still need to solve the problem of >>> running the service bootstrap.sh if the cluster is a condor pool, right? >> yes. I'd like to see how this behaves against the latest condor >> jobmanager, though, as that is going into VDT. >> > > Ok, so it would be sufficient but not necessarily necessary to make > bootstrap.sh a one-liner with the new VDT? > From hategan at mcs.anl.gov Fri Feb 6 13:04:49 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 06 Feb 2009 13:04:49 -0600 Subject: [Swift-devel] coasters on UC teraport with head job running on a worker node In-Reply-To: <498C8860.6070105@mcs.anl.gov> References: <498C8109.603@mcs.anl.gov> <1233946104.9571.1.camel@localhost> <498C8860.6070105@mcs.anl.gov> Message-ID: <1233947089.10253.4.camel@localhost> On Fri, 2009-02-06 at 12:58 -0600, Michael Wilde wrote: > It seems the one-liner alone might get past the managed-fork rejection > of newlines, but perhaps not sufficient to deal with the separate-arg > issue in the gram thread that Ben indicated. > > and its not clear yet if the fix indicated in that thread also fixes the > newline issue. > > also the fix may take a while to propagate across OSG, so a fix thats > independent of jobmanager would still be nice if thats possible, but if > its more than a few more hours of fiddling, perhaps not worth it. I'm open to suggestions, but it seems like the only reasonable choice is to wait for that fix. > > If I can run a swift script with coasters nicely, on a set of OSG PBS > sites, and on a set of TG sites, that would be sufficient, I *think*, I like the idea of running the service on a worker node. However, I like the idea of zero configuration even more. So if there is a conflict, I will favor the latter (unless, of course, the service-on-worker-node thing can be done seamlessly). From hategan at mcs.anl.gov Fri Feb 6 21:27:06 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 06 Feb 2009 21:27:06 -0600 Subject: [Swift-devel] runaway jobs Message-ID: <1233977226.16878.6.camel@localhost> I committed a bunch of stuff to deal with that. The idea is to kill a job if it's over 2*walltime and allow swift to re-schedule it. It required some changes to the cog abstraction interfaces, and I used the opportunity to do some clean ups. This means odds of something breaking somewhat higher than normal. I also updated the versions of most of the providers, so jar file names will change. From benc at hawaga.org.uk Sat Feb 7 05:22:54 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Sat, 7 Feb 2009 11:22:54 +0000 (GMT) Subject: [Swift-devel] runaway jobs In-Reply-To: <1233977226.16878.6.camel@localhost> References: <1233977226.16878.6.camel@localhost> Message-ID: On Fri, 6 Feb 2009, Mihael Hategan wrote: > I committed a bunch of stuff to deal with that. The idea is to kill a > job if it's over 2*walltime and allow swift to re-schedule it. I think this will interact poorly with clustering, due to the very inaccurate times at which clustered jobs go into Active and Completed states. Many clustered jobs will exceed their wall time in large clusters (for example, clusters that contain more than 2 jobs and where the maxwalltime is a tight bound). A job with walltime w and actual runtime (w-e) is clustered with 3 similar tasks, giving a cluster that will run with actual time 4w-e ~= 4w; so then all four of the clustered jobs will be presented to the replication manager layer as running for walltime 4w (> 2w). As to actually what happens when you try to cancel a clustered task at the moment, I'm unsure - perhaps it does nothing causing the runaway job to happen to have no adverse effects. It should be relatively straightforward to disable this mechanism when clustering is enabled; so that you can use either this or clusters but not both. But this and replication would be nice to use with clustering. For that to happen, perhaps there needs to be some better communication between the clustering code and the replication code. For example, it could be that clusters are subject to walltime control, with walltime control on clustered jobs suppressed; and likewise for replication. The replication stuff works mostly at the karajan Task level so that might not be an excessively arduous task. -- From hategan at mcs.anl.gov Sat Feb 7 08:36:53 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sat, 07 Feb 2009 08:36:53 -0600 Subject: [Swift-devel] runaway jobs In-Reply-To: References: <1233977226.16878.6.camel@localhost> Message-ID: <1234017413.20118.0.camel@localhost> On Sat, 2009-02-07 at 11:22 +0000, Ben Clifford wrote: > On Fri, 6 Feb 2009, Mihael Hategan wrote: > > > I committed a bunch of stuff to deal with that. The idea is to kill a > > job if it's over 2*walltime and allow swift to re-schedule it. > > I think this will interact poorly with clustering, due to the very > inaccurate times at which clustered jobs go into Active and Completed > states. Many clustered jobs will exceed their wall time in large clusters > (for example, clusters that contain more than 2 jobs and where the > maxwalltime is a tight bound). > > A job with walltime w and actual runtime (w-e) is clustered with 3 similar > tasks, giving a cluster that will run with actual time 4w-e ~= 4w; so then > all four of the clustered jobs will be presented to the replication > manager layer as running for walltime 4w (> 2w). > > As to actually what happens when you try to cancel a clustered task at the > moment, I'm unsure - perhaps it does nothing causing the runaway job to > happen to have no adverse effects. > > It should be relatively straightforward to disable this mechanism when > clustering is enabled; so that you can use either this or clusters but not > both. > > But this and replication would be nice to use with clustering. It can be updated to be only enabled when no clustering or clustering and this job is a cluster. That should fix it. From hategan at mcs.anl.gov Sat Feb 7 14:04:25 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sat, 7 Feb 2009 14:04:25 -0600 (CST) Subject: [Swift-devel] strange behavior evaluating function call as trace arg Message-ID: <23806863.1952211234037065554.JavaMail.root@zimbra> I think that the particular way in which the implementation manages to do the dataflow (i.e. returns as ref args) should not be in the intermediate .xml file. In other words, y = f(x) should be: y x not y x That scheme can be applied later when generating the karajan code. The current stuff complicates reasoning about the intermediate code (e.g. typechecking). From hategan at mcs.anl.gov Sat Feb 7 19:37:16 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sat, 7 Feb 2009 19:37:16 -0600 (CST) Subject: [Swift-devel] strange behavior evaluating function call as trace arg In-Reply-To: <23806863.1952211234037065554.JavaMail.root@zimbra> Message-ID: <31868237.1960501234057036064.JavaMail.root@zimbra> Here's a patch. It allows procedure invocations in expressions. http://www.ci.uchicago.edu/~hategan/invoke-proc.patch From benc at hawaga.org.uk Sun Feb 8 03:54:23 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Sun, 8 Feb 2009 09:54:23 +0000 (GMT) Subject: [Swift-devel] strange behavior evaluating function call as trace arg In-Reply-To: <23806863.1952211234037065554.JavaMail.root@zimbra> References: <23806863.1952211234037065554.JavaMail.root@zimbra> Message-ID: On Sat, 7 Feb 2009, Mihael Hategan wrote: > In other words, y = f(x) should be: > > > y > > x > > That would fit in with the general trend to not syntactically distinguishing between what used to be datasets and what used to not be datasets. Need more markup than the above, though, to accept multiple lvalues in the assignment. -- From benc at hawaga.org.uk Sun Feb 8 05:28:32 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Sun, 8 Feb 2009 11:28:32 +0000 (GMT) Subject: [Swift-devel] strange behavior evaluating function call as trace arg In-Reply-To: <31868237.1960501234057036064.JavaMail.root@zimbra> References: <31868237.1960501234057036064.JavaMail.root@zimbra> Message-ID: On Sat, 7 Feb 2009, Mihael Hategan wrote: > Here's a patch. It allows procedure invocations in expressions. > > http://www.ci.uchicago.edu/~hategan/invoke-proc.patch In the test in the below patch, I get a conflict with the use of $ for more than one purpose. The use in the nested procedure call upsets getThreadPrefix which is expecting it to contain something else (if it exists). Renaming that variable (as the below patch does) makes the test run ok for me. http://www.ci.uchicago.edu/~benc/invoke-proc-test-fix-1 (its not a very good test as it doesn't check the output is correct, but it suffices for the purposes of this precise bug) -- From benc at hawaga.org.uk Sun Feb 8 08:05:33 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Sun, 8 Feb 2009 14:05:33 +0000 (GMT) Subject: [Swift-devel] strange behavior evaluating function call as trace arg In-Reply-To: References: <23806863.1952211234037065554.JavaMail.root@zimbra> Message-ID: I'm working in this area, making declarations have both <"mapping expressions"> and =assignments. It looks like I'll end up making some change like below as part of that. On Sun, 8 Feb 2009, Ben Clifford wrote: > > On Sat, 7 Feb 2009, Mihael Hategan wrote: > > > In other words, y = f(x) should be: > > > > > > y > > > > x > > > > > > That would fit in with the general trend to not syntactically > distinguishing between what used to be datasets and what used to not be > datasets. > > Need more markup than the above, though, to accept multiple lvalues in the > assignment. > > From hategan at mcs.anl.gov Sun Feb 8 09:17:40 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sun, 08 Feb 2009 09:17:40 -0600 Subject: [Swift-devel] strange behavior evaluating function call as trace arg In-Reply-To: References: <23806863.1952211234037065554.JavaMail.root@zimbra> Message-ID: <1234106260.11012.0.camel@localhost> On Sun, 2009-02-08 at 09:54 +0000, Ben Clifford wrote: > On Sat, 7 Feb 2009, Mihael Hategan wrote: > > > In other words, y = f(x) should be: > > > > > > y > > > > x > > > > > > That would fit in with the general trend to not syntactically > distinguishing between what used to be datasets and what used to not be > datasets. > > Need more markup than the above, though, to accept multiple lvalues in the > assignment. > and or maybe . From wilde at mcs.anl.gov Sun Feb 8 23:36:31 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Sun, 08 Feb 2009 23:36:31 -0600 Subject: [Swift-devel] runaway workers on teraport coaster test of Message-ID: <498FC0DF.8060607@mcs.anl.gov> Im testing coasters with http://www.ci.uchicago.edu/~benc/tmp/coaster-head-elsewhere This worked for me once Fri at noon but not since. I put a .soft entry for java into the ~osg .soft file, to deal with issues discussed off-list. I made a few smll changes in bootstrap.sh from that patched version - some for logging, and one to make the X509_CERT_DIR variable conditional on whether that directory exists. The coaster service now starts, but it went into a loop spawning short-lived workers, and not getting anything done. I saw dozens of workers start, with about 10-20 or so running at a time. These logs and all files related to the run are in ~wilde/oops/oopstest/runaway-workers. In coasters.log I see 50+ messages "WorkerManager No suitable worker found. Attempting to start a new one." Ayy thoughts on this? TeraPort has been too saturated this evening to test any further, but it would be good to have some sense of whats causing this. From benc at hawaga.org.uk Mon Feb 9 02:40:52 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 9 Feb 2009 08:40:52 +0000 (GMT) Subject: [Swift-devel] runaway workers on teraport coaster test of In-Reply-To: <498FC0DF.8060607@mcs.anl.gov> References: <498FC0DF.8060607@mcs.anl.gov> Message-ID: On Sun, 8 Feb 2009, Michael Wilde wrote: > The coaster service now starts, but it went into a loop spawning short-lived > workers, and not getting anything done. > > I saw dozens of workers start, with about 10-20 or so running at a time. Through what mechanism were you seeing that? The coasters.log file (for the run around 1724) only shows workers getting as far as being submittd. -- From benc at hawaga.org.uk Mon Feb 9 03:07:08 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 9 Feb 2009 09:07:08 +0000 (GMT) Subject: [Swift-devel] r2509 breaks GRAM2 job submission by adding an unknown attribute Message-ID: r2509 adds a tr attribute to all jobs, which causes gram2 job submissions to fail like this: stdout.txt: ---- Caused by: Cannot submit job Caused by: Parameter not supported (on tg-uc and tp) r2515 reverts r2509. -- From benc at hawaga.org.uk Mon Feb 9 05:02:45 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 9 Feb 2009 11:02:45 +0000 (GMT) Subject: [Swift-devel] runaway workers on teraport coaster test of In-Reply-To: References: <498FC0DF.8060607@mcs.anl.gov> Message-ID: With more hacking round to get jobs to run in teraport's test queue rather than in the default extended queue, I see a similar problem - I see lots of workers being launched, getting as far as exchanging a heartbeat with the head job, and then not being issued with jobs, with new workers being launched every few seconds. On the swift side, no jobs ever go beyond Submitted state. -- From benc at hawaga.org.uk Mon Feb 9 05:12:22 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 9 Feb 2009 11:12:22 +0000 (GMT) Subject: [Swift-devel] bash profile emitting text causes md5sum to be not located Message-ID: On teraport, my .bash_profile emits some information to stdout at login. That causes bootstrap.sh to be unable to get confused (it takes the bash_profile output to be the name of the md5sum executable, and then is unable to execute). This change stops that happening for me: -MD5SUM=`find 'which gmd5sum'` +MD5SUM=`which gmd5sum` if [ "X$MD5SUM" == "X" ]; then - MD5SUM=`find 'which md5sum'` + MD5SUM=`which md5sum` This is because the find command (which is a shell procedure in this case, not unix find) first invokes the command, and if it gives no output, invokes it again with a bash login shell, which then does give output in the gmd5sum case when it should not. I think maybe that this should not happen in the bowels of the bootstrap script - either the environment is correctly initialised for the whole bootstrap script, or it is not. (this ties into the question of whether bash should be invoked with -l or not for the whole bootstrap script and other environmental configuration) -- From benc at hawaga.org.uk Mon Feb 9 08:06:55 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 9 Feb 2009 14:06:55 +0000 (GMT) Subject: [Swift-devel] double negatives Message-ID: For UI purposes, I'd like to flip the truthiness of ticker.disable, so that its ticker.enable with a default value of true. Double negatives are harder to understand. -- From benc at hawaga.org.uk Mon Feb 9 08:59:14 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 9 Feb 2009 14:59:14 +0000 (GMT) Subject: [Swift-devel] strange behavior evaluating function call as trace arg In-Reply-To: References: <31868237.1960501234057036064.JavaMail.root@zimbra> Message-ID: I committed your original patch and my fix as r2520. On Sun, 8 Feb 2009, Ben Clifford wrote: > > On Sat, 7 Feb 2009, Mihael Hategan wrote: > > > Here's a patch. It allows procedure invocations in expressions. > > > > http://www.ci.uchicago.edu/~hategan/invoke-proc.patch > > In the test in the below patch, I get a conflict with the use of $ for > more than one purpose. The use in the nested procedure call upsets > getThreadPrefix which is expecting it to contain something else (if it > exists). Renaming that variable (as the below patch does) makes the test > run ok for me. > > http://www.ci.uchicago.edu/~benc/invoke-proc-test-fix-1 > > (its not a very good test as it doesn't check the output is correct, but > it suffices for the purposes of this precise bug) > > From hategan at mcs.anl.gov Mon Feb 9 10:02:00 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 09 Feb 2009 10:02:00 -0600 Subject: [Swift-devel] r2509 breaks GRAM2 job submission by adding an unknown attribute In-Reply-To: References: Message-ID: <1234195320.23799.0.camel@localhost> On Mon, 2009-02-09 at 09:07 +0000, Ben Clifford wrote: > r2509 adds a tr attribute to all jobs, which causes gram2 job submissions > to fail like this: > > stdout.txt: > > ---- > > Caused by: > Cannot submit job > Caused by: > Parameter not supported That would be me. I have to think about it. In the mean time, please use an older version. From benc at hawaga.org.uk Mon Feb 9 10:02:55 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 9 Feb 2009 16:02:55 +0000 (GMT) Subject: [Swift-devel] r2509 breaks GRAM2 job submission by adding an unknown attribute In-Reply-To: <1234195320.23799.0.camel@localhost> References: <1234195320.23799.0.camel@localhost> Message-ID: > That would be me. I have to think about it. What do you want it for, out of interest? -- From hategan at mcs.anl.gov Mon Feb 9 10:04:25 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 09 Feb 2009 10:04:25 -0600 Subject: [Swift-devel] bash profile emitting text causes md5sum to be not located In-Reply-To: References: Message-ID: <1234195465.23799.3.camel@localhost> On Mon, 2009-02-09 at 11:12 +0000, Ben Clifford wrote: > On teraport, my .bash_profile emits some information to stdout at login. > > That causes bootstrap.sh to be unable to get confused (it takes the > bash_profile output to be the name of the md5sum executable, and then is > unable to execute). > > This change stops that happening for me: > > -MD5SUM=`find 'which gmd5sum'` > +MD5SUM=`which gmd5sum` > if [ "X$MD5SUM" == "X" ]; then > - MD5SUM=`find 'which md5sum'` > + MD5SUM=`which md5sum` > > This is because the find command (which is a shell procedure in this case, > not unix find) first invokes the command, and if it gives no output, > invokes it again with a bash login shell, which then does give output in > the gmd5sum case when it should not. > > I think maybe that this should not happen in the bowels of the bootstrap > script - either the environment is correctly initialised for the whole > bootstrap script, or it is not. Well, obviously you can't have it all. This is why I put the change in in the first place. So hang on. I'm working on it. From benc at hawaga.org.uk Mon Feb 9 10:06:01 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 9 Feb 2009 16:06:01 +0000 (GMT) Subject: [Swift-devel] which foo Message-ID: Here's a particularly annoyingly behaved 'which', on my os x 10.4 box: $ which foo 2>/dev/null ; echo $? no foo in /Users/benc/work/cog/modules/swift/dist/swift-svn/bin /Users/benc/bin /opt/local/bin /usr/X11R6/bin /Users/benc/work/globus/4.2.0-rc1/bin /Users/benc/work/globus/4.2.0-rc1/sbin /bin /sbin /usr/bin /usr/sbin 0 I came across this when trying to find out why bootstrap wasn't switching over to using curl when it can't find wget, testing coasters locally. -- From hategan at mcs.anl.gov Mon Feb 9 10:21:37 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 09 Feb 2009 10:21:37 -0600 Subject: [Swift-devel] r2509 breaks GRAM2 job submission by adding an unknown attribute In-Reply-To: References: <1234195320.23799.0.camel@localhost> Message-ID: <1234196497.24358.0.camel@localhost> On Mon, 2009-02-09 at 16:02 +0000, Ben Clifford wrote: > > > That would be me. I have to think about it. > > What do you want it for, out of interest? > Display a message if the walltime is missing. But I suppose that can be done earlier. From benc at hawaga.org.uk Mon Feb 9 10:25:07 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 9 Feb 2009 16:25:07 +0000 (GMT) Subject: [Swift-devel] Re: which foo In-Reply-To: References: Message-ID: ... though thats easy to get around with a test for executableness of the output, which I have done in my local checkout. The curl support is broken elsewhere, in its use of -O instead of -o, which I am also fixing. -- From hategan at mcs.anl.gov Mon Feb 9 10:29:17 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 09 Feb 2009 10:29:17 -0600 Subject: [Swift-devel] Re: which foo In-Reply-To: References: Message-ID: <1234196957.24501.1.camel@localhost> On Mon, 2009-02-09 at 16:25 +0000, Ben Clifford wrote: > ... though thats easy to get around with a test for executableness of the > output, which I have done in my local checkout. Right. That's what I was going to put in. I'm still appalled by $? being 0 though. > > The curl support is broken elsewhere, in its use of -O instead of -o, > which I am also fixing. > From wilde at mcs.anl.gov Mon Feb 9 10:52:03 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Mon, 09 Feb 2009 10:52:03 -0600 Subject: [Swift-devel] Re: which foo In-Reply-To: References: Message-ID: <49905F33.3090404@mcs.anl.gov> also, the case where java is not found either initially or after testing with "bash -l" yields a dirname error, I think. I didnt have time to gather the output of that, but its easy to see where it happens. that was the case on teraport to the osg vo using the service-on-worker patch. On 2/9/09 10:25 AM, Ben Clifford wrote: > ... though thats easy to get around with a test for executableness of the > output, which I have done in my local checkout. > > The curl support is broken elsewhere, in its use of -O instead of -o, > which I am also fixing. > From benc at hawaga.org.uk Mon Feb 9 11:01:29 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 9 Feb 2009 17:01:29 +0000 (GMT) Subject: [Swift-devel] JAVA_HOME misdetection Message-ID: Another thing i've seen is when java is on the path, but JAVA_HOME is not set, and java is found not-in-JAVA_HOME, like this: $ which java /usr/bin/java $ ls -l /usr/bin/java lrwxr-xr-x 1 root wheel 77 Feb 3 11:11 /usr/bin/java -> /System/Library/Frameworks/JavaVM.framework/Versions/CurrentJDK/Commands/java then setting JAVA_HOME incorrectly to /usr/ in this case breaks things, when leaving it entirely unset does not break things. This is on my os x box. -- From benc at hawaga.org.uk Mon Feb 9 11:10:32 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 9 Feb 2009 17:10:32 +0000 (GMT) Subject: [Swift-devel] runaway workers on teraport coaster test of In-Reply-To: References: <498FC0DF.8060607@mcs.anl.gov> Message-ID: On Mon, 9 Feb 2009, Ben Clifford wrote: > With more hacking round to get jobs to run in teraport's test queue rather > than in the default extended queue, I see a similar problem - I see lots > of workers being launched, getting as far as exchanging a heartbeat with > the head job, and then not being issued with jobs, with new workers being > launched every few seconds. On the swift side, no jobs ever go beyond > Submitted state. However coasters locally on my laptop (modulo various environmental fun discussed in other messages) runs ok and does not show this problem - I can run tests/sites/ run-site coasters/coaster-local.xml to completion. -- From hategan at mcs.anl.gov Mon Feb 9 12:09:22 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 09 Feb 2009 12:09:22 -0600 Subject: [Swift-devel] Re: which foo In-Reply-To: <49905F33.3090404@mcs.anl.gov> References: <49905F33.3090404@mcs.anl.gov> Message-ID: <1234202962.29439.1.camel@localhost> This is getting too annoying. Too much variation. So I'm exploring embedding the bootstrap jar into the script so that the only thing the script needs to do is find java (no md5 check and no wget). On Mon, 2009-02-09 at 10:52 -0600, Michael Wilde wrote: > also, the case where java is not found either initially or after testing > with "bash -l" yields a dirname error, I think. I didnt have time to > gather the output of that, but its easy to see where it happens. > > that was the case on teraport to the osg vo using the service-on-worker > patch. > > On 2/9/09 10:25 AM, Ben Clifford wrote: > > ... though thats easy to get around with a test for executableness of the > > output, which I have done in my local checkout. > > > > The curl support is broken elsewhere, in its use of -O instead of -o, > > which I am also fixing. > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From hategan at mcs.anl.gov Mon Feb 9 13:03:39 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 09 Feb 2009 13:03:39 -0600 Subject: [Swift-devel] Re: which foo In-Reply-To: <1234202962.29439.1.camel@localhost> References: <49905F33.3090404@mcs.anl.gov> <1234202962.29439.1.camel@localhost> Message-ID: <1234206219.29700.0.camel@localhost> Not a very good idea. Back to where I was. On Mon, 2009-02-09 at 12:09 -0600, Mihael Hategan wrote: > This is getting too annoying. Too much variation. So I'm exploring > embedding the bootstrap jar into the script so that the only thing the > script needs to do is find java (no md5 check and no wget). > > On Mon, 2009-02-09 at 10:52 -0600, Michael Wilde wrote: > > also, the case where java is not found either initially or after testing > > with "bash -l" yields a dirname error, I think. I didnt have time to > > gather the output of that, but its easy to see where it happens. > > > > that was the case on teraport to the osg vo using the service-on-worker > > patch. > > > > On 2/9/09 10:25 AM, Ben Clifford wrote: > > > ... though thats easy to get around with a test for executableness of the > > > output, which I have done in my local checkout. > > > > > > The curl support is broken elsewhere, in its use of -O instead of -o, > > > which I am also fixing. > > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From hategan at mcs.anl.gov Mon Feb 9 13:06:39 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 09 Feb 2009 13:06:39 -0600 Subject: [Swift-devel] JAVA_HOME misdetection In-Reply-To: References: Message-ID: <1234206399.30971.0.camel@localhost> On Mon, 2009-02-09 at 17:01 +0000, Ben Clifford wrote: > Another thing i've seen is when java is on the path, but JAVA_HOME is not > set, and java is found not-in-JAVA_HOME, like this: > > $ which java > /usr/bin/java > $ ls -l /usr/bin/java > lrwxr-xr-x 1 root wheel 77 Feb 3 11:11 /usr/bin/java -> > /System/Library/Frameworks/JavaVM.framework/Versions/CurrentJDK/Commands/java > > then setting JAVA_HOME incorrectly to /usr/ in this case breaks things, How does it break things? > when leaving it entirely unset does not break things. > > This is on my os x box. > From hategan at mcs.anl.gov Mon Feb 9 13:46:53 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 09 Feb 2009 13:46:53 -0600 Subject: [Swift-devel] Problems with coasters and managedfork jobmanager In-Reply-To: <498C6655.3010702@mcs.anl.gov> References: <498BC903.7010008@mcs.anl.gov> <498C550B.4040303@mcs.anl.gov> <1233936759.5005.2.camel@localhost> <1233937447.5206.0.camel@localhost> <498C6655.3010702@mcs.anl.gov> Message-ID: <1234208813.32649.0.camel@localhost> On Fri, 2009-02-06 at 10:33 -0600, Michael Wilde wrote: > I dont know, but I am testing a version where I removed the newlines > from bootstrap.pl (and adjusted a few bits manually) May we see that? > and I *think* its > moving on to the next stage and trying to start the workers. > > Ben, it seems that *some* whitespace is passed on OK, in that I can run > a job that does echo "hello world" and that blank after hello is > preserved, and the job runs. I assume the whitespace problem is more > subtle than that? > > On 2/6/09 10:24 AM, Mihael Hategan wrote: > > On Fri, 2009-02-06 at 16:17 +0000, Ben Clifford wrote: > >> On Fri, 6 Feb 2009, Mihael Hategan wrote: > >> > >>> I guess we'll have to stage in the bootstrap script using the stage-in > >>> directive if we are to support managed fork, since I don't see OSG > >>> fixing the issue. > >> They are fixing the whitespace in parameters - see the gram-user thread I > >> sent in a different message. > > > > Does this include the new lines? > > From wilde at mcs.anl.gov Mon Feb 9 14:23:03 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Mon, 09 Feb 2009 14:23:03 -0600 Subject: [Swift-devel] Problems with coasters and managedfork jobmanager In-Reply-To: <1234208813.32649.0.camel@localhost> References: <498BC903.7010008@mcs.anl.gov> <498C550B.4040303@mcs.anl.gov> <1233936759.5005.2.camel@localhost> <1233937447.5206.0.camel@localhost> <498C6655.3010702@mcs.anl.gov> <1234208813.32649.0.camel@localhost> Message-ID: <499090A7.2020100@mcs.anl.gov> its www.ci.uchicago.edu/~wilde/bootstrap.nonl.sh, plus I removed the code in ServiceManager that inserted the extra newlines when reading it into a string buffer. I checked as far as verifying in the gram log that it was seen by gram as a single line in the rsl. I never got a successful run from it, though - it ran into other problems later. On 2/9/09 1:46 PM, Mihael Hategan wrote: > On Fri, 2009-02-06 at 10:33 -0600, Michael Wilde wrote: >> I dont know, but I am testing a version where I removed the newlines >> from bootstrap.pl (and adjusted a few bits manually) > > May we see that? > >> and I *think* its >> moving on to the next stage and trying to start the workers. >> >> Ben, it seems that *some* whitespace is passed on OK, in that I can run >> a job that does echo "hello world" and that blank after hello is >> preserved, and the job runs. I assume the whitespace problem is more >> subtle than that? >> >> On 2/6/09 10:24 AM, Mihael Hategan wrote: >>> On Fri, 2009-02-06 at 16:17 +0000, Ben Clifford wrote: >>>> On Fri, 6 Feb 2009, Mihael Hategan wrote: >>>> >>>>> I guess we'll have to stage in the bootstrap script using the stage-in >>>>> directive if we are to support managed fork, since I don't see OSG >>>>> fixing the issue. >>>> They are fixing the whitespace in parameters - see the gram-user thread I >>>> sent in a different message. >>> Does this include the new lines? >>> > From hategan at mcs.anl.gov Mon Feb 9 14:34:04 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 09 Feb 2009 14:34:04 -0600 Subject: [Swift-devel] Problems with coasters and managedfork jobmanager In-Reply-To: <499090A7.2020100@mcs.anl.gov> References: <498BC903.7010008@mcs.anl.gov> <498C550B.4040303@mcs.anl.gov> <1233936759.5005.2.camel@localhost> <1233937447.5206.0.camel@localhost> <498C6655.3010702@mcs.anl.gov> <1234208813.32649.0.camel@localhost> <499090A7.2020100@mcs.anl.gov> Message-ID: <1234211644.6450.0.camel@localhost> I asked because I couldn't figure out how to get it to work. But it seems like yours has problems, too: mike at blabla tmp$ sh bootstrap.nonl.sh bootstrap.nonl.sh: line 1: syntax error near unexpected token `;' ... On Mon, 2009-02-09 at 14:23 -0600, Michael Wilde wrote: > its www.ci.uchicago.edu/~wilde/bootstrap.nonl.sh, plus I removed the > code in ServiceManager that inserted the extra newlines when reading it > into a string buffer. I checked as far as verifying in the gram log that > it was seen by gram as a single line in the rsl. I never got a > successful run from it, though - it ran into other problems later. > > On 2/9/09 1:46 PM, Mihael Hategan wrote: > > On Fri, 2009-02-06 at 10:33 -0600, Michael Wilde wrote: > >> I dont know, but I am testing a version where I removed the newlines > >> from bootstrap.pl (and adjusted a few bits manually) > > > > May we see that? > > > >> and I *think* its > >> moving on to the next stage and trying to start the workers. > >> > >> Ben, it seems that *some* whitespace is passed on OK, in that I can run > >> a job that does echo "hello world" and that blank after hello is > >> preserved, and the job runs. I assume the whitespace problem is more > >> subtle than that? > >> > >> On 2/6/09 10:24 AM, Mihael Hategan wrote: > >>> On Fri, 2009-02-06 at 16:17 +0000, Ben Clifford wrote: > >>>> On Fri, 6 Feb 2009, Mihael Hategan wrote: > >>>> > >>>>> I guess we'll have to stage in the bootstrap script using the stage-in > >>>>> directive if we are to support managed fork, since I don't see OSG > >>>>> fixing the issue. > >>>> They are fixing the whitespace in parameters - see the gram-user thread I > >>>> sent in a different message. > >>> Does this include the new lines? > >>> > > From wilde at mcs.anl.gov Mon Feb 9 16:06:19 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Mon, 09 Feb 2009 16:06:19 -0600 Subject: [Swift-devel] Problems with coasters and managedfork jobmanager In-Reply-To: <1234211644.6450.0.camel@localhost> References: <498BC903.7010008@mcs.anl.gov> <498C550B.4040303@mcs.anl.gov> <1233936759.5005.2.camel@localhost> <1233937447.5206.0.camel@localhost> <498C6655.3010702@mcs.anl.gov> <1234208813.32649.0.camel@localhost> <499090A7.2020100@mcs.anl.gov> <1234211644.6450.0.camel@localhost> Message-ID: <4990A8DB.1010606@mcs.anl.gov> sorry, i think i put the wrong version there; i dealt with that specific problem and hit a subtler one, deeper in the bootstrap process. On 2/9/09 2:34 PM, Mihael Hategan wrote: > I asked because I couldn't figure out how to get it to work. > > But it seems like yours has problems, too: > mike at blabla tmp$ sh bootstrap.nonl.sh > bootstrap.nonl.sh: line 1: syntax error near unexpected token `;' > ... > > On Mon, 2009-02-09 at 14:23 -0600, Michael Wilde wrote: >> its www.ci.uchicago.edu/~wilde/bootstrap.nonl.sh, plus I removed the >> code in ServiceManager that inserted the extra newlines when reading it >> into a string buffer. I checked as far as verifying in the gram log that >> it was seen by gram as a single line in the rsl. I never got a >> successful run from it, though - it ran into other problems later. >> >> On 2/9/09 1:46 PM, Mihael Hategan wrote: >>> On Fri, 2009-02-06 at 10:33 -0600, Michael Wilde wrote: >>>> I dont know, but I am testing a version where I removed the newlines >>>> from bootstrap.pl (and adjusted a few bits manually) >>> May we see that? >>> >>>> and I *think* its >>>> moving on to the next stage and trying to start the workers. >>>> >>>> Ben, it seems that *some* whitespace is passed on OK, in that I can run >>>> a job that does echo "hello world" and that blank after hello is >>>> preserved, and the job runs. I assume the whitespace problem is more >>>> subtle than that? >>>> >>>> On 2/6/09 10:24 AM, Mihael Hategan wrote: >>>>> On Fri, 2009-02-06 at 16:17 +0000, Ben Clifford wrote: >>>>>> On Fri, 6 Feb 2009, Mihael Hategan wrote: >>>>>> >>>>>>> I guess we'll have to stage in the bootstrap script using the stage-in >>>>>>> directive if we are to support managed fork, since I don't see OSG >>>>>>> fixing the issue. >>>>>> They are fixing the whitespace in parameters - see the gram-user thread I >>>>>> sent in a different message. >>>>> Does this include the new lines? >>>>> > From benc at hawaga.org.uk Mon Feb 9 15:33:57 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 9 Feb 2009 21:33:57 +0000 (GMT) Subject: [Swift-devel] JAVA_HOME misdetection In-Reply-To: <1234206399.30971.0.camel@localhost> References: <1234206399.30971.0.camel@localhost> Message-ID: On Mon, 9 Feb 2009, Mihael Hategan wrote: > How does it break things? screws up timezone checking and then gives an ExceptionPreparationError or something like that (an Error that I'd never heard of before). -- From hategan at mcs.anl.gov Mon Feb 9 16:23:16 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 09 Feb 2009 16:23:16 -0600 Subject: [Swift-devel] JAVA_HOME misdetection In-Reply-To: References: <1234206399.30971.0.camel@localhost> Message-ID: <1234218196.8851.0.camel@localhost> On Mon, 2009-02-09 at 21:33 +0000, Ben Clifford wrote: > On Mon, 9 Feb 2009, Mihael Hategan wrote: > > > How does it break things? > > screws up timezone checking and then gives an ExceptionPreparationError or > something like that (an Error that I'd never heard of before). > Can you dig a bit more? From wilde at mcs.anl.gov Mon Feb 9 17:05:08 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Mon, 09 Feb 2009 17:05:08 -0600 Subject: [Swift-devel] Problems with coasters and managedfork jobmanager In-Reply-To: <4990A8DB.1010606@mcs.anl.gov> References: <498BC903.7010008@mcs.anl.gov> <498C550B.4040303@mcs.anl.gov> <1233936759.5005.2.camel@localhost> <1233937447.5206.0.camel@localhost> <498C6655.3010702@mcs.anl.gov> <1234208813.32649.0.camel@localhost> <499090A7.2020100@mcs.anl.gov> <1234211644.6450.0.camel@localhost> <4990A8DB.1010606@mcs.anl.gov> Message-ID: <4990B6A4.3070806@mcs.anl.gov> No, correction, I mis-spoke. When I try this direct to the shell (as opposed to via swift and coaster bootstrap) I get the same error as you show. I cant get sh to accept function defs on 1 line. So I must have mis-interpreted my result from last Friday. - Mike On 2/9/09 4:06 PM, Michael Wilde wrote: > sorry, i think i put the wrong version there; i dealt with that specific > problem and hit a subtler one, deeper in the bootstrap process. > > On 2/9/09 2:34 PM, Mihael Hategan wrote: >> I asked because I couldn't figure out how to get it to work. >> >> But it seems like yours has problems, too: >> mike at blabla tmp$ sh bootstrap.nonl.sh >> bootstrap.nonl.sh: line 1: syntax error near unexpected token `;' >> ... >> >> On Mon, 2009-02-09 at 14:23 -0600, Michael Wilde wrote: >>> its www.ci.uchicago.edu/~wilde/bootstrap.nonl.sh, plus I removed the >>> code in ServiceManager that inserted the extra newlines when reading >>> it into a string buffer. I checked as far as verifying in the gram >>> log that it was seen by gram as a single line in the rsl. I never >>> got a successful run from it, though - it ran into other problems later. >>> >>> On 2/9/09 1:46 PM, Mihael Hategan wrote: >>>> On Fri, 2009-02-06 at 10:33 -0600, Michael Wilde wrote: >>>>> I dont know, but I am testing a version where I removed the >>>>> newlines from bootstrap.pl (and adjusted a few bits manually) >>>> May we see that? >>>> >>>>> and I *think* its moving on to the next stage and trying to start >>>>> the workers. >>>>> >>>>> Ben, it seems that *some* whitespace is passed on OK, in that I can >>>>> run a job that does echo "hello world" and that blank after hello >>>>> is preserved, and the job runs. I assume the whitespace problem is >>>>> more subtle than that? >>>>> >>>>> On 2/6/09 10:24 AM, Mihael Hategan wrote: >>>>>> On Fri, 2009-02-06 at 16:17 +0000, Ben Clifford wrote: >>>>>>> On Fri, 6 Feb 2009, Mihael Hategan wrote: >>>>>>> >>>>>>>> I guess we'll have to stage in the bootstrap script using the >>>>>>>> stage-in >>>>>>>> directive if we are to support managed fork, since I don't see OSG >>>>>>>> fixing the issue. >>>>>>> They are fixing the whitespace in parameters - see the gram-user >>>>>>> thread I sent in a different message. >>>>>> Does this include the new lines? >>>>>> >> > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From benc at hawaga.org.uk Mon Feb 9 16:32:55 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 9 Feb 2009 22:32:55 +0000 (GMT) Subject: [Swift-devel] JAVA_HOME misdetection In-Reply-To: <1234218196.8851.0.camel@localhost> References: <1234206399.30971.0.camel@localhost> <1234218196.8851.0.camel@localhost> Message-ID: On Mon, 9 Feb 2009, Mihael Hategan wrote: > > > How does it break things? > > > > screws up timezone checking and then gives an ExceptionPreparationError or > > something like that (an Error that I'd never heard of before). > > Can you dig a bit more? I can give you a whole paste of the error next time I go that way. It strikes me that the answer, however is, "don't set JAVA_HOME to some directory that isn't." -- From hategan at mcs.anl.gov Mon Feb 9 18:16:12 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 09 Feb 2009 18:16:12 -0600 Subject: [Swift-devel] Problems with coasters and managedfork jobmanager In-Reply-To: <4990B6A4.3070806@mcs.anl.gov> References: <498BC903.7010008@mcs.anl.gov> <498C550B.4040303@mcs.anl.gov> <1233936759.5005.2.camel@localhost> <1233937447.5206.0.camel@localhost> <498C6655.3010702@mcs.anl.gov> <1234208813.32649.0.camel@localhost> <499090A7.2020100@mcs.anl.gov> <1234211644.6450.0.camel@localhost> <4990A8DB.1010606@mcs.anl.gov> <4990B6A4.3070806@mcs.anl.gov> Message-ID: <1234224972.11972.2.camel@localhost> I did get a one-liner perl which contains an encoded bootstrap jar file (only printable characters). Perl cares less about whitespace, which makes me think it's a better candidate if we want to go that way. Thoughts? On Mon, 2009-02-09 at 17:05 -0600, Michael Wilde wrote: > No, correction, I mis-spoke. When I try this direct to the shell (as > opposed to via swift and coaster bootstrap) I get the same error as you > show. I cant get sh to accept function defs on 1 line. > > So I must have mis-interpreted my result from last Friday. > > - Mike > > > On 2/9/09 4:06 PM, Michael Wilde wrote: > > sorry, i think i put the wrong version there; i dealt with that specific > > problem and hit a subtler one, deeper in the bootstrap process. > > > > On 2/9/09 2:34 PM, Mihael Hategan wrote: > >> I asked because I couldn't figure out how to get it to work. > >> > >> But it seems like yours has problems, too: > >> mike at blabla tmp$ sh bootstrap.nonl.sh > >> bootstrap.nonl.sh: line 1: syntax error near unexpected token `;' > >> ... > >> > >> On Mon, 2009-02-09 at 14:23 -0600, Michael Wilde wrote: > >>> its www.ci.uchicago.edu/~wilde/bootstrap.nonl.sh, plus I removed the > >>> code in ServiceManager that inserted the extra newlines when reading > >>> it into a string buffer. I checked as far as verifying in the gram > >>> log that it was seen by gram as a single line in the rsl. I never > >>> got a successful run from it, though - it ran into other problems later. > >>> > >>> On 2/9/09 1:46 PM, Mihael Hategan wrote: > >>>> On Fri, 2009-02-06 at 10:33 -0600, Michael Wilde wrote: > >>>>> I dont know, but I am testing a version where I removed the > >>>>> newlines from bootstrap.pl (and adjusted a few bits manually) > >>>> May we see that? > >>>> > >>>>> and I *think* its moving on to the next stage and trying to start > >>>>> the workers. > >>>>> > >>>>> Ben, it seems that *some* whitespace is passed on OK, in that I can > >>>>> run a job that does echo "hello world" and that blank after hello > >>>>> is preserved, and the job runs. I assume the whitespace problem is > >>>>> more subtle than that? > >>>>> > >>>>> On 2/6/09 10:24 AM, Mihael Hategan wrote: > >>>>>> On Fri, 2009-02-06 at 16:17 +0000, Ben Clifford wrote: > >>>>>>> On Fri, 6 Feb 2009, Mihael Hategan wrote: > >>>>>>> > >>>>>>>> I guess we'll have to stage in the bootstrap script using the > >>>>>>>> stage-in > >>>>>>>> directive if we are to support managed fork, since I don't see OSG > >>>>>>>> fixing the issue. > >>>>>>> They are fixing the whitespace in parameters - see the gram-user > >>>>>>> thread I sent in a different message. > >>>>>> Does this include the new lines? > >>>>>> > >> > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From wilde at mcs.anl.gov Mon Feb 9 18:23:39 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Mon, 09 Feb 2009 18:23:39 -0600 Subject: [Swift-devel] Problems with coasters and managedfork jobmanager In-Reply-To: <1234224972.11972.2.camel@localhost> References: <498BC903.7010008@mcs.anl.gov> <498C550B.4040303@mcs.anl.gov> <1233936759.5005.2.camel@localhost> <1233937447.5206.0.camel@localhost> <498C6655.3010702@mcs.anl.gov> <1234208813.32649.0.camel@localhost> <499090A7.2020100@mcs.anl.gov> <1234211644.6450.0.camel@localhost> <4990A8DB.1010606@mcs.anl.gov> <4990B6A4.3070806@mcs.anl.gov> <1234224972.11972.2.camel@localhost> Message-ID: <4990C90B.40700@mcs.anl.gov> It sounds reasonable, but lets try it and see how well it works. I'd like to create a test that runs a trivial swift script on a set of osg and tg sites with coasters. If you create a patch or check it in I'll try it too. - Mike On 2/9/09 6:16 PM, Mihael Hategan wrote: > I did get a one-liner perl which contains an encoded bootstrap jar file > (only printable characters). Perl cares less about whitespace, which > makes me think it's a better candidate if we want to go that way. > > Thoughts? > > On Mon, 2009-02-09 at 17:05 -0600, Michael Wilde wrote: >> No, correction, I mis-spoke. When I try this direct to the shell (as >> opposed to via swift and coaster bootstrap) I get the same error as you >> show. I cant get sh to accept function defs on 1 line. >> >> So I must have mis-interpreted my result from last Friday. >> >> - Mike >> >> >> On 2/9/09 4:06 PM, Michael Wilde wrote: >>> sorry, i think i put the wrong version there; i dealt with that specific >>> problem and hit a subtler one, deeper in the bootstrap process. >>> >>> On 2/9/09 2:34 PM, Mihael Hategan wrote: >>>> I asked because I couldn't figure out how to get it to work. >>>> >>>> But it seems like yours has problems, too: >>>> mike at blabla tmp$ sh bootstrap.nonl.sh >>>> bootstrap.nonl.sh: line 1: syntax error near unexpected token `;' >>>> ... >>>> >>>> On Mon, 2009-02-09 at 14:23 -0600, Michael Wilde wrote: >>>>> its www.ci.uchicago.edu/~wilde/bootstrap.nonl.sh, plus I removed the >>>>> code in ServiceManager that inserted the extra newlines when reading >>>>> it into a string buffer. I checked as far as verifying in the gram >>>>> log that it was seen by gram as a single line in the rsl. I never >>>>> got a successful run from it, though - it ran into other problems later. >>>>> >>>>> On 2/9/09 1:46 PM, Mihael Hategan wrote: >>>>>> On Fri, 2009-02-06 at 10:33 -0600, Michael Wilde wrote: >>>>>>> I dont know, but I am testing a version where I removed the >>>>>>> newlines from bootstrap.pl (and adjusted a few bits manually) >>>>>> May we see that? >>>>>> >>>>>>> and I *think* its moving on to the next stage and trying to start >>>>>>> the workers. >>>>>>> >>>>>>> Ben, it seems that *some* whitespace is passed on OK, in that I can >>>>>>> run a job that does echo "hello world" and that blank after hello >>>>>>> is preserved, and the job runs. I assume the whitespace problem is >>>>>>> more subtle than that? >>>>>>> >>>>>>> On 2/6/09 10:24 AM, Mihael Hategan wrote: >>>>>>>> On Fri, 2009-02-06 at 16:17 +0000, Ben Clifford wrote: >>>>>>>>> On Fri, 6 Feb 2009, Mihael Hategan wrote: >>>>>>>>> >>>>>>>>>> I guess we'll have to stage in the bootstrap script using the >>>>>>>>>> stage-in >>>>>>>>>> directive if we are to support managed fork, since I don't see OSG >>>>>>>>>> fixing the issue. >>>>>>>>> They are fixing the whitespace in parameters - see the gram-user >>>>>>>>> thread I sent in a different message. >>>>>>>> Does this include the new lines? >>>>>>>> >>> _______________________________________________ >>> Swift-devel mailing list >>> Swift-devel at ci.uchicago.edu >>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > From hategan at mcs.anl.gov Mon Feb 9 18:29:05 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 09 Feb 2009 18:29:05 -0600 Subject: [Swift-devel] Problems with coasters and managedfork jobmanager In-Reply-To: <4990C90B.40700@mcs.anl.gov> References: <498BC903.7010008@mcs.anl.gov> <498C550B.4040303@mcs.anl.gov> <1233936759.5005.2.camel@localhost> <1233937447.5206.0.camel@localhost> <498C6655.3010702@mcs.anl.gov> <1234208813.32649.0.camel@localhost> <499090A7.2020100@mcs.anl.gov> <1234211644.6450.0.camel@localhost> <4990A8DB.1010606@mcs.anl.gov> <4990B6A4.3070806@mcs.anl.gov> <1234224972.11972.2.camel@localhost> <4990C90B.40700@mcs.anl.gov> Message-ID: <1234225745.12056.11.camel@localhost> On Mon, 2009-02-09 at 18:23 -0600, Michael Wilde wrote: > It sounds reasonable, but lets try it and see how well it works. Right. It might not work with condor, given that the line is 16k bytes long. > > I'd like to create a test that runs a trivial swift script on a set of > osg and tg sites with coasters. I suggest looking at the existing tests (swift/tests/(sites)?) first. From benc at hawaga.org.uk Mon Feb 9 18:29:25 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 10 Feb 2009 00:29:25 +0000 (GMT) Subject: [Swift-devel] Problems with coasters and managedfork jobmanager In-Reply-To: <4990C90B.40700@mcs.anl.gov> References: <498BC903.7010008@mcs.anl.gov> <498C550B.4040303@mcs.anl.gov> <1233936759.5005.2.camel@localhost> <1233937447.5206.0.camel@localhost> <498C6655.3010702@mcs.anl.gov> <1234208813.32649.0.camel@localhost> <499090A7.2020100@mcs.anl.gov> <1234211644.6450.0.camel@localhost> <4990A8DB.1010606@mcs.anl.gov> <4990B6A4.3070806@mcs.anl.gov> <1234224972.11972.2.camel@localhost> <4990C90B.40700@mcs.anl.gov> Message-ID: On Mon, 9 Feb 2009, Michael Wilde wrote: > I'd like to create a test that runs a trivial swift script on a set of > osg and tg sites with coasters. There's a multi-site test setup in tests/sites/ cd tests/sites/ ./run-all coaster/ will run some tests (the list is defined in tests/sites/run-site) with each site in tests/sites/coaster/ and report on success or failure To add sites, create a sites.xml file in tests/sites/coaster/ directory, and then add appropriate lines for the site name into tests/sites/tc.data (if the site isn't already in there). -- From benc at hawaga.org.uk Tue Feb 10 07:12:40 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 10 Feb 2009 13:12:40 +0000 (GMT) Subject: [Swift-devel] typecheck foo[*].bar Message-ID: I noticed today that expressions like this don't get typechecked properly, so in 0.8, you can't use [*].member expressions. Bleugh. As I want to use such expressions (or equivalent), I guess I have to fix that soonish. I think the approach I am favouring language-wise is that [*] becomes a no-op/identity operator, and . with an array of structs on the left returns an array of the appropriate member fields. Thus a[*] == a for all arrays a a[*].foo == a.foo == (in haskelly pseudocode) (map \(x->x.foo) a) I think from an implementation point of view, that can cause some trouble though. DSHandles expect to have only one parent. Writing an expression a[*].foo causes each element to then have two potential parents: i. array of structs -> contained structure -> foo ii. array of foos -> foo Although something like this must be happening at the moment anyway with the [*].foo syntax, so it might not turn out to be a big deal. -- From wilde at mcs.anl.gov Tue Feb 10 11:20:19 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 10 Feb 2009 11:20:19 -0600 Subject: [Swift-devel] Problems with coasters and managedfork jobmanager In-Reply-To: <1234224972.11972.2.camel@localhost> References: <498BC903.7010008@mcs.anl.gov> <498C550B.4040303@mcs.anl.gov> <1233936759.5005.2.camel@localhost> <1233937447.5206.0.camel@localhost> <498C6655.3010702@mcs.anl.gov> <1234208813.32649.0.camel@localhost> <499090A7.2020100@mcs.anl.gov> <1234211644.6450.0.camel@localhost> <4990A8DB.1010606@mcs.anl.gov> <4990B6A4.3070806@mcs.anl.gov> <1234224972.11972.2.camel@localhost> Message-ID: <4991B753.1000801@mcs.anl.gov> Was it not possible and/or easy to let GRAM stage in bootstrap.sh as a stdin file to /bin/bash, the equivalent of: com$ globus-job-run tp-osg.uchicago.edu -stdin -s longscript.sh /bin/sh hello world com$ - Mike On 2/9/09 6:16 PM, Mihael Hategan wrote: > I did get a one-liner perl which contains an encoded bootstrap jar file > (only printable characters). Perl cares less about whitespace, which > makes me think it's a better candidate if we want to go that way. > > Thoughts? > > On Mon, 2009-02-09 at 17:05 -0600, Michael Wilde wrote: >> No, correction, I mis-spoke. When I try this direct to the shell (as >> opposed to via swift and coaster bootstrap) I get the same error as you >> show. I cant get sh to accept function defs on 1 line. >> >> So I must have mis-interpreted my result from last Friday. >> >> - Mike >> >> >> On 2/9/09 4:06 PM, Michael Wilde wrote: >>> sorry, i think i put the wrong version there; i dealt with that specific >>> problem and hit a subtler one, deeper in the bootstrap process. >>> >>> On 2/9/09 2:34 PM, Mihael Hategan wrote: >>>> I asked because I couldn't figure out how to get it to work. >>>> >>>> But it seems like yours has problems, too: >>>> mike at blabla tmp$ sh bootstrap.nonl.sh >>>> bootstrap.nonl.sh: line 1: syntax error near unexpected token `;' >>>> ... >>>> >>>> On Mon, 2009-02-09 at 14:23 -0600, Michael Wilde wrote: >>>>> its www.ci.uchicago.edu/~wilde/bootstrap.nonl.sh, plus I removed the >>>>> code in ServiceManager that inserted the extra newlines when reading >>>>> it into a string buffer. I checked as far as verifying in the gram >>>>> log that it was seen by gram as a single line in the rsl. I never >>>>> got a successful run from it, though - it ran into other problems later. >>>>> >>>>> On 2/9/09 1:46 PM, Mihael Hategan wrote: >>>>>> On Fri, 2009-02-06 at 10:33 -0600, Michael Wilde wrote: >>>>>>> I dont know, but I am testing a version where I removed the >>>>>>> newlines from bootstrap.pl (and adjusted a few bits manually) >>>>>> May we see that? >>>>>> >>>>>>> and I *think* its moving on to the next stage and trying to start >>>>>>> the workers. >>>>>>> >>>>>>> Ben, it seems that *some* whitespace is passed on OK, in that I can >>>>>>> run a job that does echo "hello world" and that blank after hello >>>>>>> is preserved, and the job runs. I assume the whitespace problem is >>>>>>> more subtle than that? >>>>>>> >>>>>>> On 2/6/09 10:24 AM, Mihael Hategan wrote: >>>>>>>> On Fri, 2009-02-06 at 16:17 +0000, Ben Clifford wrote: >>>>>>>>> On Fri, 6 Feb 2009, Mihael Hategan wrote: >>>>>>>>> >>>>>>>>>> I guess we'll have to stage in the bootstrap script using the >>>>>>>>>> stage-in >>>>>>>>>> directive if we are to support managed fork, since I don't see OSG >>>>>>>>>> fixing the issue. >>>>>>>>> They are fixing the whitespace in parameters - see the gram-user >>>>>>>>> thread I sent in a different message. >>>>>>>> Does this include the new lines? >>>>>>>> >>> _______________________________________________ >>> Swift-devel mailing list >>> Swift-devel at ci.uchicago.edu >>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > From hategan at mcs.anl.gov Tue Feb 10 13:22:27 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 10 Feb 2009 13:22:27 -0600 (CST) Subject: [Swift-devel] Problems with coasters and managedfork jobmanager In-Reply-To: <4991B753.1000801@mcs.anl.gov> Message-ID: <23141741.2071951234293747843.JavaMail.root@zimbra> ----- Michael Wilde wrote: > Was it not possible and/or easy to let GRAM stage in bootstrap.sh as a > stdin file to /bin/bash, With GT4 that would have required users to have a gridftp server on the Swift side. Which seems to come in conflict with ease of use, so I think it's a bad idea. >From a theoretical perspective, it also annoys me that with staging in, there is no way to guarantee lack of contention in file naming without globally unique identifiers. > the equivalent of: > > com$ globus-job-run tp-osg.uchicago.edu -stdin -s longscript.sh /bin/sh > hello > world > com$ > > - Mike > > On 2/9/09 6:16 PM, Mihael Hategan wrote: > > I did get a one-liner perl which contains an encoded bootstrap jar file > > (only printable characters). Perl cares less about whitespace, which > > makes me think it's a better candidate if we want to go that way. > > > > Thoughts? > > > > On Mon, 2009-02-09 at 17:05 -0600, Michael Wilde wrote: > >> No, correction, I mis-spoke. When I try this direct to the shell (as > >> opposed to via swift and coaster bootstrap) I get the same error as you > >> show. I cant get sh to accept function defs on 1 line. > >> > >> So I must have mis-interpreted my result from last Friday. > >> > >> - Mike > >> > >> > >> On 2/9/09 4:06 PM, Michael Wilde wrote: > >>> sorry, i think i put the wrong version there; i dealt with that specific > >>> problem and hit a subtler one, deeper in the bootstrap process. > >>> > >>> On 2/9/09 2:34 PM, Mihael Hategan wrote: > >>>> I asked because I couldn't figure out how to get it to work. > >>>> > >>>> But it seems like yours has problems, too: > >>>> mike at blabla tmp$ sh bootstrap.nonl.sh > >>>> bootstrap.nonl.sh: line 1: syntax error near unexpected token `;' > >>>> ... > >>>> > >>>> On Mon, 2009-02-09 at 14:23 -0600, Michael Wilde wrote: > >>>>> its www.ci.uchicago.edu/~wilde/bootstrap.nonl.sh, plus I removed the > >>>>> code in ServiceManager that inserted the extra newlines when reading > >>>>> it into a string buffer. I checked as far as verifying in the gram > >>>>> log that it was seen by gram as a single line in the rsl. I never > >>>>> got a successful run from it, though - it ran into other problems later. > >>>>> > >>>>> On 2/9/09 1:46 PM, Mihael Hategan wrote: > >>>>>> On Fri, 2009-02-06 at 10:33 -0600, Michael Wilde wrote: > >>>>>>> I dont know, but I am testing a version where I removed the > >>>>>>> newlines from bootstrap.pl (and adjusted a few bits manually) > >>>>>> May we see that? > >>>>>> > >>>>>>> and I *think* its moving on to the next stage and trying to start > >>>>>>> the workers. > >>>>>>> > >>>>>>> Ben, it seems that *some* whitespace is passed on OK, in that I can > >>>>>>> run a job that does echo "hello world" and that blank after hello > >>>>>>> is preserved, and the job runs. I assume the whitespace problem is > >>>>>>> more subtle than that? > >>>>>>> > >>>>>>> On 2/6/09 10:24 AM, Mihael Hategan wrote: > >>>>>>>> On Fri, 2009-02-06 at 16:17 +0000, Ben Clifford wrote: > >>>>>>>>> On Fri, 6 Feb 2009, Mihael Hategan wrote: > >>>>>>>>> > >>>>>>>>>> I guess we'll have to stage in the bootstrap script using the > >>>>>>>>>> stage-in > >>>>>>>>>> directive if we are to support managed fork, since I don't see OSG > >>>>>>>>>> fixing the issue. > >>>>>>>>> They are fixing the whitespace in parameters - see the gram-user > >>>>>>>>> thread I sent in a different message. > >>>>>>>> Does this include the new lines? > >>>>>>>> > >>> _______________________________________________ > >>> Swift-devel mailing list > >>> Swift-devel at ci.uchicago.edu > >>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > From hategan at mcs.anl.gov Tue Feb 10 13:31:14 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 10 Feb 2009 13:31:14 -0600 (CST) Subject: [Swift-devel] JAVA_HOME misdetection Message-ID: <10496779.2072501234294274599.JavaMail.root@zimbra> The bootstrapping process does not require, AFAIK, JAVA_HOME. It only needs a java executable. I have a local patch for the issue, but I'm trying to explore the perl route, so I'll integrate the concept there. From zhaozhang at uchicago.edu Tue Feb 10 13:46:05 2009 From: zhaozhang at uchicago.edu (Zhao Zhang) Date: Tue, 10 Feb 2009 13:46:05 -0600 Subject: [Swift-devel] GPFS issue of SWIFT on BGP Message-ID: <4991D97D.7010508@uchicago.edu> Hi, All I am working with Allan on applying CIO to SWIFT on BGP, now we are blocked by a ssh-provider issue. Here is the description: we made ssh-provider working as the data provider, and I tested it with multiple psets, it is working fine. Login Node ----- submit host IO Node -------- remote site Compute Node -- workers Now, we start swift on Login Node, and the working directory will be created on IO Node, so that all intermediate files and final result files will be copied back to Login Node(GPFS) once they are generated. Here we got an old problem, all IO nodes are trying to write files in the same directory, which we are trying to avoid all the way. My solution would be modify the ssh-provider source code, implement an asynchronous collector logic there. Do you have any other ideas about this issue? Or other alternative design? Thanks so much. best wishes zhangzhao From benc at hawaga.org.uk Tue Feb 10 14:03:17 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 10 Feb 2009 20:03:17 +0000 (GMT) Subject: [Swift-devel] GPFS issue of SWIFT on BGP In-Reply-To: <4991D97D.7010508@uchicago.edu> References: <4991D97D.7010508@uchicago.edu> Message-ID: On Tue, 10 Feb 2009, Zhao Zhang wrote: > Now, we start swift on Login Node, and the working directory will be created > on IO Node, so that all intermediate files and final > result files will be copied back to Login Node(GPFS) once they are generated. > Here we got an old problem, all IO nodes are trying > to write files in the same directory, which we are trying to avoid all the > way. > My solution would be modify the ssh-provider source code, implement an > asynchronous collector logic there. Can you describe what is going on here more explicitly. How do multiple IO nodes end up writing to the same GPFS directory? It is unclear to me from what you write how that comes about - as I understand it: . submit side data files are posix-accessed only by the swift submit-side client . files on the I/O nodes (the remote sites) use pset-local storage . any communication between the I/O nodes and submit-side client happens over ssh. Where does an I/O node access machine-wide GPFS? -- From zhaozhang at uchicago.edu Tue Feb 10 14:13:49 2009 From: zhaozhang at uchicago.edu (Zhao Zhang) Date: Tue, 10 Feb 2009 14:13:49 -0600 Subject: [Swift-devel] GPFS issue of SWIFT on BGP In-Reply-To: References: <4991D97D.7010508@uchicago.edu> Message-ID: <4991DFFD.2070501@uchicago.edu> Hi, Ben Clifford wrote: > On Tue, 10 Feb 2009, Zhao Zhang wrote: > > >> Now, we start swift on Login Node, and the working directory will be created >> on IO Node, so that all intermediate files and final >> result files will be copied back to Login Node(GPFS) once they are generated. >> Here we got an old problem, all IO nodes are trying >> to write files in the same directory, which we are trying to avoid all the >> way. >> My solution would be modify the ssh-provider source code, implement an >> asynchronous collector logic there. >> > > Can you describe what is going on here more explicitly. > > How do multiple IO nodes end up writing to the same GPFS directory? > In previous case, we have 512 IO nodes each create 1 file in the same directory, that would take 30 minutes to finish. Besides, some time only 510 files could be created. > It is unclear to me from what you write how that comes about - as I > understand it: > > . submit side data files are posix-accessed only by the swift submit-side > client > yes > . files on the I/O nodes (the remote sites) use pset-local storage > yes > . any communication between the I/O nodes and submit-side client happens > over ssh. > yes > Where does an I/O node access machine-wide GPFS? > data transfer from I/O nodes to submit-side client is writing to GPFS through ssh-provider. zhao From benc at hawaga.org.uk Tue Feb 10 14:20:33 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 10 Feb 2009 20:20:33 +0000 (GMT) Subject: [Swift-devel] GPFS issue of SWIFT on BGP In-Reply-To: <4991DFFD.2070501@uchicago.edu> References: <4991D97D.7010508@uchicago.edu> <4991DFFD.2070501@uchicago.edu> Message-ID: On Tue, 10 Feb 2009, Zhao Zhang wrote: > > > Here we got an old problem, all IO nodes are trying > data transfer from I/O nodes to submit-side client is writing to GPFS > through ssh-provider. By 'old problem' I assumed you meant the GPFS locking problems previously experienced, where GPFS locks for particular filesystem objects would need to be expensively moved between nodes. However, that should not be a problem here - from the GPFS perspective, the submit (login) node is the only node that is interacting with GPFS, and the only node that needs a lock on that directory. If you're experiencing slowness, then I would be inclined to investigate elsewhere. It may be that the ssh provider is not fast (ssh is not renowned for being a fast protocol; Mihael might have some commentary either way based on his experiences with the cog ssh provider); it may be that something else is causing a bottleneck. Do you have any detailed timing information? (from my perspective, the wrapper logs for every job, and the submit side log, would be interesting to look at - send those to me) -- From zhaozhang at uchicago.edu Tue Feb 10 14:29:53 2009 From: zhaozhang at uchicago.edu (Zhao Zhang) Date: Tue, 10 Feb 2009 14:29:53 -0600 Subject: [Swift-devel] GPFS issue of SWIFT on BGP In-Reply-To: References: <4991D97D.7010508@uchicago.edu> <4991DFFD.2070501@uchicago.edu> Message-ID: <4991E3C1.2060503@uchicago.edu> Hi Ben Clifford wrote: > On Tue, 10 Feb 2009, Zhao Zhang wrote: > > >>>> Here we got an old problem, all IO nodes are trying >>>> > > >> data transfer from I/O nodes to submit-side client is writing to GPFS >> through ssh-provider. >> > > By 'old problem' I assumed you meant the GPFS locking problems previously > experienced, where GPFS locks for particular filesystem objects would > need to be expensively moved between nodes. > > However, that should not be a problem here - from the GPFS perspective, > the submit (login) node is the only node that is interacting with GPFS, > and the only node that needs a lock on that directory. > ok, this sounds reasonable. Thanks > If you're experiencing slowness, then I would be inclined to investigate > elsewhere. It may be that the ssh provider is not fast (ssh is not > renowned for being a fast protocol; Mihael might have some commentary > either way based on his experiences with the cog ssh provider); it may be > that something else is causing a bottleneck. > > Do you have any detailed timing information? (from my perspective, the > wrapper logs for every job, and the submit side log, would be interesting > to look at - send those to me) > Nope, I am just making a work plan, those are potential issues, I will send you the data once I got them. zhao From zhaozhang at uchicago.edu Tue Feb 10 14:33:43 2009 From: zhaozhang at uchicago.edu (Zhao Zhang) Date: Tue, 10 Feb 2009 14:33:43 -0600 Subject: [Swift-devel] GPFS issue of SWIFT on BGP In-Reply-To: References: <4991D97D.7010508@uchicago.edu> <4991DFFD.2070501@uchicago.edu> Message-ID: <4991E4A7.3040504@uchicago.edu> Hi, Ben What if there are 640 ssh-providers sending result files at the same time? Do you know any successful test case with hundreds of ssh-providers working together with one submit host? zhao Ben Clifford wrote: > On Tue, 10 Feb 2009, Zhao Zhang wrote: > > >>>> Here we got an old problem, all IO nodes are trying >>>> > > >> data transfer from I/O nodes to submit-side client is writing to GPFS >> through ssh-provider. >> > > By 'old problem' I assumed you meant the GPFS locking problems previously > experienced, where GPFS locks for particular filesystem objects would > need to be expensively moved between nodes. > > However, that should not be a problem here - from the GPFS perspective, > the submit (login) node is the only node that is interacting with GPFS, > and the only node that needs a lock on that directory. > > If you're experiencing slowness, then I would be inclined to investigate > elsewhere. It may be that the ssh provider is not fast (ssh is not > renowned for being a fast protocol; Mihael might have some commentary > either way based on his experiences with the cog ssh provider); it may be > that something else is causing a bottleneck. > > Do you have any detailed timing information? (from my perspective, the > wrapper logs for every job, and the submit side log, would be interesting > to look at - send those to me) > > From hategan at mcs.anl.gov Tue Feb 10 14:38:20 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 10 Feb 2009 14:38:20 -0600 (CST) Subject: [Swift-devel] GPFS issue of SWIFT on BGP In-Reply-To: <4991D97D.7010508@uchicago.edu> Message-ID: <9278820.2078991234298300290.JavaMail.root@zimbra> ----- Zhao Zhang wrote: > Hi, All > > I am working with Allan on applying CIO to SWIFT on BGP, now we are > blocked by a ssh-provider issue. > Here is the description: we made ssh-provider working as the data > provider, and I tested it with multiple psets, it is working fine. > Login Node ----- submit host > IO Node -------- remote site > Compute Node -- workers > > Now, we start swift on Login Node, and the working directory will be > created on IO Node, so that all intermediate files and final > result files will be copied back to Login Node(GPFS) once they are > generated. Here we got an old problem, all IO nodes are trying > to write files in the same directory, which we are trying to avoid all > the way. > > My solution would be modify the ssh-provider source code, implement an > asynchronous collector logic there. I'm not really sure what the protocol used to move data with has to do with organizing files to make things work faster. From benc at hawaga.org.uk Tue Feb 10 14:43:05 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 10 Feb 2009 20:43:05 +0000 (GMT) Subject: [Swift-devel] GPFS issue of SWIFT on BGP In-Reply-To: <4991E4A7.3040504@uchicago.edu> References: <4991D97D.7010508@uchicago.edu> <4991DFFD.2070501@uchicago.edu> <4991E4A7.3040504@uchicago.edu> Message-ID: On Tue, 10 Feb 2009, Zhao Zhang wrote: > What if there are 640 ssh-providers sending result files at the same time? Do > you know any successful test case > with hundreds of ssh-providers working together with one submit host? Control is the other way round. The Swift client will pull files down from the I/O nodes when jobs are finished. (that is done by the dostageout call in execute2 in libexec/vdl-int.k) Swift has rate limiting on the number of file transfers and file operations that can be in progress at any one time. By default, the limit is 4 (for file transfers) and 8 (for file operations). This is controlled by the throttle.transfers and throttle.file.operations settings in swift.properties. I think (but I am not sure) that this is a limit for the whole of Swift, rather than per site (but I am not sure). If jobs are finishing faster than Swift can stage out the data (which is likely to happen) then a queue of transfer requests will build up inside Swift. I think it is quite likely (though I have no numerical evidence) that you will find provider-ssh copies files too slowly for your liking; in which case you would need to come up with a faster way of moving files between the IO nodes and the submitting node. But you should see what happens with provider-ssh first. You should easily be able to compute throughput rates when you have log files for this. -- From hategan at mcs.anl.gov Tue Feb 10 18:47:50 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 10 Feb 2009 18:47:50 -0600 (CST) Subject: [Swift-devel] Problems with coasters and managedfork jobmanager Message-ID: <32608764.2094661234313270386.JavaMail.root@zimbra> On Mon, 2009-02-09 at 18:23 -0600, Michael Wilde wrote: > It sounds reasonable, but lets try it and see how well it works. http://www.ci.uchicago.edu/~hategan/coaster-bootstrap.jar.pl I suggest trying globusrun perl -e "`cat coaster-bootstrap.jar.pl`" What you should see if everything works ok is the following complaint: "Wrong number of arguments. Expected , , , and " From wilde at mcs.anl.gov Tue Feb 10 20:04:29 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 10 Feb 2009 20:04:29 -0600 Subject: [Swift-devel] Problems with coasters and managedfork jobmanager In-Reply-To: <32608764.2094661234313270386.JavaMail.root@zimbra> References: <32608764.2094661234313270386.JavaMail.root@zimbra> Message-ID: <4992322D.4050204@mcs.anl.gov> I dont know what to do, re: On 2/10/09 6:47 PM, Mihael Hategan wrote: > On Mon, 2009-02-09 at 18:23 -0600, Michael Wilde wrote: >> It sounds reasonable, but lets try it and see how well it works. > > http://www.ci.uchicago.edu/~hategan/coaster-bootstrap.jar.pl OK, fetched that. Is there a patch for this? Modifed ServiceManager to use this for bootstrapping? > I suggest trying > > globusrun perl -e "`cat coaster-bootstrap.jar.pl`" Do you mean put perl, -e, and the contents of the .pl file into an RSL string and run through globusrun? perl -e "`cat coaster-bootstrap.jar.pl`" on the command line gives what you show below. How do I integrate and test this? > > What you should see if everything works ok is the following > complaint: > > "Wrong number of arguments. Expected , > , , and " > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From hategan at mcs.anl.gov Tue Feb 10 20:22:54 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 10 Feb 2009 20:22:54 -0600 (CST) Subject: [Swift-devel] Problems with coasters and managedfork jobmanager In-Reply-To: <4992322D.4050204@mcs.anl.gov> Message-ID: <25935177.2095661234318974195.JavaMail.root@zimbra> ----- Michael Wilde wrote: > I dont know what to do, re: I mean literally do the globusrun thing. > > On 2/10/09 6:47 PM, Mihael Hategan wrote: > > On Mon, 2009-02-09 at 18:23 -0600, Michael Wilde wrote: > >> It sounds reasonable, but lets try it and see how well it works. > > > > http://www.ci.uchicago.edu/~hategan/coaster-bootstrap.jar.pl > > OK, fetched that. > > Is there a patch for this? Modifed ServiceManager to use this for > bootstrapping? There is no patch. Just that script. > > > I suggest trying > > > > globusrun perl -e "`cat coaster-bootstrap.jar.pl`" > > Do you mean put perl, -e, and the contents of the .pl file into an RSL > string and run through globusrun? That would do. But I suspect the above command will do that. > > perl -e "`cat coaster-bootstrap.jar.pl`" > on the command line gives what you show below. Yes, that I checked. The point is to run the same through a few job managers and see whether they choke on it or not. If it goes through as many sites as we can send to, then we declare this a winner. If not, we go back to the drawing board. From wilde at mcs.anl.gov Tue Feb 10 22:54:08 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 10 Feb 2009 22:54:08 -0600 Subject: [Swift-devel] Problems with coasters and managedfork jobmanager In-Reply-To: <25935177.2095661234318974195.JavaMail.root@zimbra> References: <25935177.2095661234318974195.JavaMail.root@zimbra> Message-ID: <499259F0.9060005@mcs.anl.gov> Sorry, I still dont get it. Did the globusrun command below work for you? It doesnt work for me, and doesnt match the syntax of what I understand globusrun to expect. I was not able to get coaster-bootstrap.jar.pl into an RSL string for globusrun. I don't think thats possible, because globusrun gets confused by the single and double quotes in that file. Unless Im missing something, as far as I can tell you can only test this from API level. Has it worked for you, from inside swift, say from ServiceManager? On 2/10/09 8:22 PM, Mihael Hategan wrote: > ----- Michael Wilde wrote: >> I dont know what to do, re: > > I mean literally do the globusrun thing. > >> On 2/10/09 6:47 PM, Mihael Hategan wrote: >>> On Mon, 2009-02-09 at 18:23 -0600, Michael Wilde wrote: >>>> It sounds reasonable, but lets try it and see how well it works. >>> http://www.ci.uchicago.edu/~hategan/coaster-bootstrap.jar.pl >> OK, fetched that. >> >> Is there a patch for this? Modifed ServiceManager to use this for >> bootstrapping? > > There is no patch. Just that script. > >>> I suggest trying >>> >>> globusrun perl -e "`cat coaster-bootstrap.jar.pl`" >> Do you mean put perl, -e, and the contents of the .pl file into an RSL >> string and run through globusrun? > > That would do. But I suspect the above command will do that. > >> perl -e "`cat coaster-bootstrap.jar.pl`" >> on the command line gives what you show below. > > Yes, that I checked. > > The point is to run the same through a few job managers and see > whether they choke on it or not. If it goes through as many sites > as we can send to, then we declare this a winner. > If not, we go back to the drawing board. I think thats a good idea, but I have not yet found a way to do this from a shell. I'm skeptical that there's a gram client command that will take the required perl command from the command line Can you encode the script without embedded quotes? - Mike --- The output I get is: com$ globusrun -r tp-grid1.ci.uchicago.edu perl -e "`cat coaster-bootstrap.jar.pl`" ERROR: too many request strings specified Syntax: globusrun [-help] [-f RSL file] [-s][-b][-d][...] [-r RM] [RSL] Use -help to display full usage com$ From hategan at mcs.anl.gov Tue Feb 10 23:43:25 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 10 Feb 2009 23:43:25 -0600 Subject: [Swift-devel] Problems with coasters and managedfork jobmanager In-Reply-To: <499259F0.9060005@mcs.anl.gov> References: <25935177.2095661234318974195.JavaMail.root@zimbra> <499259F0.9060005@mcs.anl.gov> Message-ID: <1234331005.2557.5.camel@localhost> On Tue, 2009-02-10 at 22:54 -0600, Michael Wilde wrote: > Sorry, I still dont get it. Did the globusrun command below work for > you? It doesnt work for me, and doesnt match the syntax of what I > understand globusrun to expect. Allow me to rephrase it. Try the globusrun command that does "perl -e ". I haven't tried it, so I don't know the exact incantation. > > I was not able to get coaster-bootstrap.jar.pl into an RSL string for > globusrun. I don't think thats possible, because globusrun gets confused > by the single and double quotes in that file. Globusrun should properly escape those as long as it is gets the correct ARGV (which '`cat file`' should do). > > Unless Im missing something, as far as I can tell you can only test this > from API level. Has it worked for you, from inside swift, say from > ServiceManager? I haven't tried it from inside swift. I'll poke around and send a command that can be used literally. From hategan at mcs.anl.gov Wed Feb 11 00:15:02 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 11 Feb 2009 00:15:02 -0600 Subject: [Swift-devel] Problems with coasters and managedfork jobmanager In-Reply-To: <1234331005.2557.5.camel@localhost> References: <25935177.2095661234318974195.JavaMail.root@zimbra> <499259F0.9060005@mcs.anl.gov> <1234331005.2557.5.camel@localhost> Message-ID: <1234332902.4348.3.camel@localhost> On Tue, 2009-02-10 at 23:43 -0600, Mihael Hategan wrote: > I haven't tried it from inside swift. I'll poke around and send a > command that can be used literally. Grr. Put the following in a file named t.k: import("sys.k") import("task.k") [bs,url,provider,jm] := each(...) h := host(url service("execution", provider=provider, url=url,jobManager=jm) ) src := strip(file:read(bs)) execute("/usr/bin/perl", args=["-e", "{src}"], host=h, provider=provider, redirect=true) Then (with your swift bin dir in your path) run it like this: cog-workflow t.k For example: cog-workflow t.k coaster-bootstrap.jar.pl localhost local none You should get: Submitting task Task(type=JOB_SUBMISSION, identity=urn:0-1234332844313) Wrong number of arguments. Expected , , , and Execution failed: Job failed with an exit code of 1 ... From benc at hawaga.org.uk Wed Feb 11 03:42:36 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 11 Feb 2009 09:42:36 +0000 (GMT) Subject: [Swift-devel] Problems with coasters and managedfork jobmanager In-Reply-To: <1234332902.4348.3.camel@localhost> References: <25935177.2095661234318974195.JavaMail.root@zimbra> <499259F0.9060005@mcs.anl.gov> <1234331005.2557.5.camel@localhost> <1234332902.4348.3.camel@localhost> Message-ID: I've tried (with my OSGEDU credential) and the following site give the expected 'Wrong number of arguments...': tp-osg.ci.uchicago.edu pbs (t.k modified to use queue="test") Trying with osgce.cs.clemson.edu condor gives: [bcliff at osgedu coaster-perl-test]$ cog-workflow t.k coaster-bootstrap.jar.pl osgce.cs.clemson.edu gt2 condor String found where operator expected at -e line 1, at end of line (Missing semicolon on previous line?) Can't find string terminator "'" anywhere before EOF at -e line 1. which doesn't particularly surprise me as the perl file has many spaces, so the same problem as with sh -c appears to arise there. -- From benc at hawaga.org.uk Wed Feb 11 05:39:40 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 11 Feb 2009 11:39:40 +0000 (GMT) Subject: [Swift-devel] scheduler stuff for Google Summer of Code 2009 Message-ID: Google Summer of Code 2009 mentor applications open in a couple of weeks. dev.globus likely will be applying again. I have a few ideas for Swift projects within dev.globus that I'd like to mentor. One idea that I'm kinda fuzzy on but there might be interesting work to do is implementing more interesting scheduler behaviour. Various people have talked in the past about these, that I think have some decent level of merit: a) changing ordering of execution of swift-level jobs based on how many other swift-level jobs depend on that first job b) reordering stageins and stageouts so to allow (in addition to the present as-they-come (I think) policy) "prefer stageins" (which would get more jobs going sooner, but incurring an expense in that stageouts would happen slower, and in our present restart model reduce the speed as which jobs complete enough for restart), and "prefer stageouts", which would get completed results out to submit side faster, at the expense of job execution speed. c) data affinity - there was some messing round with this, but it resulted in code that did not work (which is fine for that project, as it was not production code oriented, but not for committing to the codebase). So potentially this could be reimplemented or the existing code tidied up as part of this. Comments and additional ideas... -- From hategan at mcs.anl.gov Wed Feb 11 08:17:53 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 11 Feb 2009 08:17:53 -0600 Subject: [Swift-devel] Problems with coasters and managedfork jobmanager In-Reply-To: References: <25935177.2095661234318974195.JavaMail.root@zimbra> <499259F0.9060005@mcs.anl.gov> <1234331005.2557.5.camel@localhost> <1234332902.4348.3.camel@localhost> Message-ID: <1234361873.5085.0.camel@localhost> This was an attempt at dealing with the newlines. There is not much I can do about the spaces. On Wed, 2009-02-11 at 09:42 +0000, Ben Clifford wrote: > I've tried (with my OSGEDU credential) and the following site give the > expected 'Wrong number of arguments...': > > tp-osg.ci.uchicago.edu pbs (t.k modified to use queue="test") > > Trying with osgce.cs.clemson.edu condor gives: > > [bcliff at osgedu coaster-perl-test]$ cog-workflow t.k > coaster-bootstrap.jar.pl osgce.cs.clemson.edu gt2 condor > String found where operator expected at -e line 1, at end of line > (Missing semicolon on previous line?) > Can't find string terminator "'" anywhere before EOF at -e line 1. > > which doesn't particularly surprise me as the perl file has many spaces, > so the same problem as with sh -c appears to arise there. > From foster at anl.gov Wed Feb 11 08:23:48 2009 From: foster at anl.gov (foster at anl.gov) Date: Wed, 11 Feb 2009 08:23:48 -0600 (CST) Subject: [Swift-devel] Problems with coasters and managedfork jobmanager In-Reply-To: <1234361873.5085.0.camel@localhost> References: <25935177.2095661234318974195.JavaMail.root@zimbra> <499259F0.9060005@mcs.anl.gov> <1234331005.2557.5.camel@localhost> <1234332902.4348.3.camel@localhost> <1234361873.5085.0.camel@localhost> Message-ID: <1DB48832-4CF3-4EDD-A26D-6A38D42D51D3@anl.gov> Are you not using gram? Ian -- from mobile On Feb 11, 2009, at 8:19 AM, Mihael Hategan wrote: > This was an attempt at dealing with the newlines. There is not much I > can do about the spaces. > > On Wed, 2009-02-11 at 09:42 +0000, Ben Clifford wrote: >> I've tried (with my OSGEDU credential) and the following site give >> the >> expected 'Wrong number of arguments...': >> >> tp-osg.ci.uchicago.edu pbs (t.k modified to use queue="test") >> >> Trying with osgce.cs.clemson.edu condor gives: >> >> [bcliff at osgedu coaster-perl-test]$ cog-workflow t.k >> coaster-bootstrap.jar.pl osgce.cs.clemson.edu gt2 condor >> String found where operator expected at -e line 1, at end of line >> (Missing semicolon on previous line?) >> Can't find string terminator "'" anywhere before EOF at -e line 1. >> >> which doesn't particularly surprise me as the perl file has many >> spaces, >> so the same problem as with sh -c appears to arise there. >> > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From benc at hawaga.org.uk Wed Feb 11 08:40:58 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 11 Feb 2009 14:40:58 +0000 (GMT) Subject: [Swift-devel] Problems with coasters and managedfork jobmanager In-Reply-To: <1DB48832-4CF3-4EDD-A26D-6A38D42D51D3@anl.gov> References: <25935177.2095661234318974195.JavaMail.root@zimbra> <499259F0.9060005@mcs.anl.gov> <1234331005.2557.5.camel@localhost> <1234332902.4348.3.camel@localhost> <1234361873.5085.0.camel@localhost> <1DB48832-4CF3-4EDD-A26D-6A38D42D51D3@anl.gov> Message-ID: On Wed, 11 Feb 2009, foster at anl.gov wrote: > Are you not using gram? yes, which feeds jobs into condor. -- From foster at anl.gov Wed Feb 11 08:49:56 2009 From: foster at anl.gov (Ian Foster) Date: Wed, 11 Feb 2009 08:49:56 -0600 Subject: [Swift-devel] Problems with coasters and managedfork jobmanager In-Reply-To: References: <25935177.2095661234318974195.JavaMail.root@zimbra> <499259F0.9060005@mcs.anl.gov> <1234331005.2557.5.camel@localhost> <1234332902.4348.3.camel@localhost> <1234361873.5085.0.camel@localhost> <1DB48832-4CF3-4EDD-A26D-6A38D42D51D3@anl.gov> Message-ID: <86375BBE-29A8-4171-BC89-0B5495180E9A@anl.gov> Ben: The meaning of the answer "yes" to a negative question has different meanings depending on one's cultural origins :) What did you mean in this case? Ian. On Feb 11, 2009, at 8:40 AM, Ben Clifford wrote: > > On Wed, 11 Feb 2009, foster at anl.gov wrote: > >> Are you not using gram? > > yes, which feeds jobs into condor. > > -- > From benc at hawaga.org.uk Wed Feb 11 09:00:29 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 11 Feb 2009 15:00:29 +0000 (GMT) Subject: [Swift-devel] Problems with coasters and managedfork jobmanager In-Reply-To: <86375BBE-29A8-4171-BC89-0B5495180E9A@anl.gov> References: <25935177.2095661234318974195.JavaMail.root@zimbra> <499259F0.9060005@mcs.anl.gov> <1234331005.2557.5.camel@localhost> <1234332902.4348.3.camel@localhost> <1234361873.5085.0.camel@localhost> <1DB48832-4CF3-4EDD-A26D-6A38D42D51D3@anl.gov> <86375BBE-29A8-4171-BC89-0B5495180E9A@anl.gov> Message-ID: On Wed, 11 Feb 2009, Ian Foster wrote: > The meaning of the answer "yes" to a negative question has different meanings > depending on one's cultural origins :) ha. indeed. > What did you mean in this case? job flow is cog commandline -> cog gt2 provider -> gram2 -> condor -> go! This problem occurs with any use of Condor up until fairly recent versions. The Condor job submission file format (which in this case is being generated by gram2, but that is mostly irrelevant) doesn't cope with spaces in arguments: Normally: echo "hi there" has $1 equal to the entire string: hi there and no $2 If you submit the same through condor, you get $1=hi and $2=there This then causes problems using on-the-commandline command sequences in sh or perl: Something like this: sh -c 'echo foo' which passes the entire command "echo foo" as the second parameter to sh gets turned into something like: $1=-c (which is ok) $2=echo (or maybe 'echo) $3=foo (or maybe foo') so sh runs the command "echo" with no parameters (as $2 instructs it to). So its hard to run any non-trivial command through this mechanism. There's future hope, though, as a hopefully fixed GRAM update is available, working with a recently fixed Condor, and that is likely to be deployed on OSG in due course (which is where most Condor woes occur). -- From hategan at mcs.anl.gov Wed Feb 11 09:04:13 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 11 Feb 2009 09:04:13 -0600 Subject: [Swift-devel] Problems with coasters and managedfork jobmanager In-Reply-To: <1DB48832-4CF3-4EDD-A26D-6A38D42D51D3@anl.gov> References: <25935177.2095661234318974195.JavaMail.root@zimbra> <499259F0.9060005@mcs.anl.gov> <1234331005.2557.5.camel@localhost> <1234332902.4348.3.camel@localhost> <1234361873.5085.0.camel@localhost> <1DB48832-4CF3-4EDD-A26D-6A38D42D51D3@anl.gov> Message-ID: <1234364653.5907.2.camel@localhost> On Wed, 2009-02-11 at 08:23 -0600, foster at anl.gov wrote: > Are you not using gram? Hmm, you're still in "you hate gram" mode :) Yes, we are using gram and the condor job manager behaves badly. From foster at anl.gov Wed Feb 11 09:43:15 2009 From: foster at anl.gov (Ian Foster) Date: Wed, 11 Feb 2009 09:43:15 -0600 Subject: [Swift-devel] Problems with coasters and managedfork jobmanager In-Reply-To: <1234364653.5907.2.camel@localhost> References: <25935177.2095661234318974195.JavaMail.root@zimbra> <499259F0.9060005@mcs.anl.gov> <1234331005.2557.5.camel@localhost> <1234332902.4348.3.camel@localhost> <1234361873.5085.0.camel@localhost> <1DB48832-4CF3-4EDD-A26D-6A38D42D51D3@anl.gov> <1234364653.5907.2.camel@localhost> Message-ID: <35E4FAE1-4044-4A9C-8242-AF427EBACCEF@anl.gov> Ben, Mihael: My email did have that flavor, didn't it. My apologies. :) Ian. On Feb 11, 2009, at 9:04 AM, Mihael Hategan wrote: > On Wed, 2009-02-11 at 08:23 -0600, foster at anl.gov wrote: >> Are you not using gram? > > Hmm, you're still in "you hate gram" mode :) > > Yes, we are using gram and the condor job manager behaves badly. > From wilde at mcs.anl.gov Wed Feb 11 13:24:01 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Wed, 11 Feb 2009 13:24:01 -0600 Subject: [Swift-devel] scheduler stuff for Google Summer of Code 2009 In-Reply-To: References: Message-ID: <499325D1.9040601@mcs.anl.gov> These all sound good. Another scheduler-related set of projects relates to algorithms around coasters: - how to manage (grow and shrink) the size of coaster pools - how to size the time requests for the workers (perhaps dynamically) - how the current dynamic throttle works for coasters - and probably many more. Some more areas: - a detailed evaluation and tuning of the throttle heuristics for many-site workflows - automatically clustering more diverse dags, and pipelining - running Swift on clouds like E2C and Azure - scaling swift to 1M+ task workflows, efficiently (streaming the mappers) - service oriented workflows - extending CIO techniques back to grid environments - creating a lightweight "embedded swift" VM for running workflows from *within* perl, python, R, etc. These cover a wide space of useful-to-crazy, easy-to-hard, etc. Just tossing them out. - Mike On 2/11/09 5:39 AM, Ben Clifford wrote: > Google Summer of Code 2009 mentor applications open in a couple of weeks. > dev.globus likely will be applying again. I have a few ideas for Swift > projects within dev.globus that I'd like to mentor. > > One idea that I'm kinda fuzzy on but there might be interesting work to do > is implementing more interesting scheduler behaviour. > > Various people have talked in the past about these, that I think have some > decent level of merit: > > a) changing ordering of execution of swift-level jobs based on how many > other swift-level jobs depend on that first job > > b) reordering stageins and stageouts so to allow (in addition to the > present as-they-come (I think) policy) "prefer stageins" (which would > get more jobs going sooner, but incurring an expense in that > stageouts would happen slower, and in our present restart model > reduce the speed as which jobs complete enough for restart), and > "prefer stageouts", which would get completed results out to submit > side faster, at the expense of job execution speed. > > c) data affinity - there was some messing round with this, but it > resulted in code that did not work (which is fine for that project, > as it was not production code oriented, but not for committing to > the codebase). So potentially this could be reimplemented or the > existing code tidied up as part of this. > > Comments and additional ideas... > From benc at hawaga.org.uk Wed Feb 11 14:33:16 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 11 Feb 2009 20:33:16 +0000 (GMT) Subject: [Swift-devel] scheduler stuff for Google Summer of Code 2009 In-Reply-To: <499325D1.9040601@mcs.anl.gov> References: <499325D1.9040601@mcs.anl.gov> Message-ID: >From the stuff below, some are quite researchy and so probably better for an academic student project rather than a google project. The two that seem fairly clearly defined are: For this, there may be some work here doing grunt work implementation of tweakable parameters and things like that. I don't know if it would take up a whole summer, though. > Another scheduler-related set of projects relates to algorithms around > coasters: - how to manage (grow and shrink) the size of coaster pools - > how to size the time requests for the workers (perhaps dynamically) - > how the current dynamic throttle works for coasters - and probably many > more. > - running Swift on clouds like E2C and Azure Tim Freeman has done some work standing up clusters on multiple EC2 nodes, so that the multiple nodes are exposed through a single (perhaps GRAM?) interface. So putting Swift on the front of that seems straightforward to define. -- From iraicu at cs.uchicago.edu Wed Feb 11 14:41:17 2009 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Wed, 11 Feb 2009 14:41:17 -0600 Subject: [Swift-devel] scheduler stuff for Google Summer of Code 2009 In-Reply-To: References: <499325D1.9040601@mcs.anl.gov> Message-ID: <499337ED.9010009@cs.uchicago.edu> Ben Clifford wrote: > >> - running Swift on clouds like E2C and Azure >> > > Tim Freeman has done some work standing up clusters on multiple EC2 nodes, > so that the multiple nodes are exposed through a single (perhaps GRAM?) > interface. So putting Swift on the front of that seems straightforward to > define. > > Back in 2007 (yes, its been almost 2 years since we tried this), Catalin, Tim Freeman, and I got MolDyn (Nika's molecular dynamics app) running through Swift + Falkon + Workspace Service + EC2 + NFS. At the time, they were just rolling out support for PBS/GRAM, which means that for a simpler deployment scenario, you might be able to use GRAM/PBS instead of Falkon. We had a wiki setup to keep track of our progress: http://dev.globus.org/wiki/Incubator/Falkon/EC2 http://www.ci.uchicago.edu/wiki/bin/view/VDS/DslCS/Falkon_EC2 but then Catalin found a job, and moved on to other things, and I never had time to carry the work forward. It sounds like an interesting scenario to support, with minimal end-user intervention. Ioan -- =================================================== Ioan Raicu Ph.D. Candidate =================================================== Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 =================================================== Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dev.globus.org/wiki/Incubator/Falkon http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page =================================================== =================================================== -------------- next part -------------- An HTML attachment was scrubbed... URL: From benc at hawaga.org.uk Wed Feb 11 14:41:41 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 11 Feb 2009 20:41:41 +0000 (GMT) Subject: [Swift-devel] scheduler stuff for Google Summer of Code 2009 In-Reply-To: <499325D1.9040601@mcs.anl.gov> References: <499325D1.9040601@mcs.anl.gov> Message-ID: On Wed, 11 Feb 2009, Michael Wilde wrote: > - scaling swift to 1M+ task workflows, efficiently (streaming the > mappers) There's more to this than simple streaming mappers. At the moment, everything is built around having a Java object in memory for every piece of data that can be referenced, and that object tends to stick around for a long time (at least as long as that data can be referenced). For example, if you have an array which has a large number of elements, then each of those elements has at least one object in memory representing it, because as long as you have the array in scope, you can say a[1] or a[anything] and thus get to every element. The in-memory implementation of the data model and anything that touches it would need some fairly serious work to cope with having stuff kept out of core; and I think keeping stuff out of core is something that would need to happen. (that is, 'streaming mappers' as a phrase seems to deal with "not getting knowledge about data too fast" but does not deal with "forgetting knowledge about data fast enough") -- From wilde at mcs.anl.gov Wed Feb 11 14:53:01 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Wed, 11 Feb 2009 14:53:01 -0600 Subject: [Swift-devel] scheduler stuff for Google Summer of Code 2009 In-Reply-To: References: <499325D1.9040601@mcs.anl.gov> Message-ID: <49933AAD.2020303@mcs.anl.gov> So in a sense, rather than saying "streaming mappers" can you call this "streaming foreach() statements" so that as each "iteration" (or "instantiation") of the foreach completes, the objects it used are freed and removed/removable from memory? (ie, does this address the 'scope" problem?) Too big for an SOC student? Interesting enough for one? (Its a nice scalability challenge... and could be demonstrated first on localhost to make good progress w/o getting tangled in distributed computing initially) If too big, can we make it manageble? If too small, can we bundle with related tasks? On 2/11/09 2:41 PM, Ben Clifford wrote: > On Wed, 11 Feb 2009, Michael Wilde wrote: > >> - scaling swift to 1M+ task workflows, efficiently (streaming the >> mappers) > > There's more to this than simple streaming mappers. > > At the moment, everything is built around having a Java object in memory > for every piece of data that can be referenced, and that object tends to > stick around for a long time (at least as long as that data can be > referenced). For example, if you have an array which has a large number of > elements, then each of those elements has at least one object in memory > representing it, because as long as you have the array in scope, you can > say a[1] or a[anything] and thus get to every element. > > The in-memory implementation of the data model and anything that touches > it would need some fairly serious work to cope with having stuff kept out > of core; and I think keeping stuff out of core is something that would > need to happen. > > (that is, 'streaming mappers' as a phrase seems to deal with "not getting > knowledge about data too fast" but does not deal with "forgetting > knowledge about data fast enough") > From tfreeman at mcs.anl.gov Wed Feb 11 14:53:43 2009 From: tfreeman at mcs.anl.gov (Tim Freeman) Date: Wed, 11 Feb 2009 14:53:43 -0600 Subject: [Swift-devel] scheduler stuff for Google Summer of Code 2009 In-Reply-To: References: <499325D1.9040601@mcs.anl.gov> Message-ID: <20090211145343.3df6ab8e@prnb> On Wed, 11 Feb 2009 20:33:16 +0000 (GMT) Ben Clifford wrote: [...] > > > - running Swift on clouds like E2C and Azure > > Tim Freeman has done some work standing up clusters on multiple EC2 nodes, > so that the multiple nodes are exposed through a single (perhaps GRAM?) > interface. So putting Swift on the front of that seems straightforward to > define. Yep, was just starting some 100+ node GRAM/PBS clusters there last week. Tim From hategan at mcs.anl.gov Wed Feb 11 15:33:48 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 11 Feb 2009 15:33:48 -0600 (CST) Subject: [Swift-devel] scheduler stuff for Google Summer of Code 2009 In-Reply-To: Message-ID: <22220735.2155401234388028575.JavaMail.root@zimbra> ----- Ben Clifford wrote: > > On Wed, 11 Feb 2009, Michael Wilde wrote: > > > - scaling swift to 1M+ task workflows, efficiently (streaming the > > mappers) > > There's more to this than simple streaming mappers. > > At the moment, everything is built around having a Java object in memory > for every piece of data that can be referenced, and that object tends to > stick around for a long time (at least as long as that data can be > referenced). For example, if you have an array which has a large number of > elements, then each of those elements has at least one object in memory > representing it, because as long as you have the array in scope, you can > say a[1] or a[anything] and thus get to every element. I do not think that this issue is the bottleneck here. For every application invocation there is a karajan thread. The fact that one such thread eats around 10-20k seems to be the problem. By contrast, a piece of Swift data probably takes less than 1k. So I think that one order of magnitude improvement could be achieved by addressing that 10-20k problem (or by somehow having fewer karajan threads). > > The in-memory implementation of the data model and anything that touches > it would need some fairly serious work to cope with having stuff kept out > of core; and I think keeping stuff out of core is something that would > need to happen. > > (that is, 'streaming mappers' as a phrase seems to deal with "not getting > knowledge about data too fast" but does not deal with "forgetting > knowledge about data fast enough") > > -- > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From foster at anl.gov Wed Feb 11 15:34:33 2009 From: foster at anl.gov (Ian Foster) Date: Wed, 11 Feb 2009 15:34:33 -0600 Subject: [Swift-devel] scheduler stuff for Google Summer of Code 2009 In-Reply-To: <20090211145343.3df6ab8e@prnb> References: <499325D1.9040601@mcs.anl.gov> <20090211145343.3df6ab8e@prnb> Message-ID: <529D3FFB-BA26-4736-A160-3AD51C823CA4@anl.gov> UniCloud provides that functionality, too. I imagine there are others. On Feb 11, 2009, at 2:53 PM, Tim Freeman wrote: > On Wed, 11 Feb 2009 20:33:16 +0000 (GMT) > Ben Clifford wrote: > > [...] >> >>> - running Swift on clouds like E2C and Azure >> >> Tim Freeman has done some work standing up clusters on multiple EC2 >> nodes, >> so that the multiple nodes are exposed through a single (perhaps >> GRAM?) >> interface. So putting Swift on the front of that seems >> straightforward to >> define. > > Yep, was just starting some 100+ node GRAM/PBS clusters there last > week. > > Tim > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel -------------- next part -------------- An HTML attachment was scrubbed... URL: From foster at anl.gov Wed Feb 11 15:35:56 2009 From: foster at anl.gov (Ian Foster) Date: Wed, 11 Feb 2009 15:35:56 -0600 Subject: [Swift-devel] scheduler stuff for Google Summer of Code 2009 In-Reply-To: <22220735.2155401234388028575.JavaMail.root@zimbra> References: <22220735.2155401234388028575.JavaMail.root@zimbra> Message-ID: I would argue that Swift support for collective operations also helps with scaling. (We can run a computation with 1M tasks, if not all 1M tasks are active at once.) On Feb 11, 2009, at 3:33 PM, Mihael Hategan wrote: > > ----- Ben Clifford wrote: >> >> On Wed, 11 Feb 2009, Michael Wilde wrote: >> >>> - scaling swift to 1M+ task workflows, efficiently (streaming the >>> mappers) >> >> There's more to this than simple streaming mappers. >> >> At the moment, everything is built around having a Java object in >> memory >> for every piece of data that can be referenced, and that object >> tends to >> stick around for a long time (at least as long as that data can be >> referenced). For example, if you have an array which has a large >> number of >> elements, then each of those elements has at least one object in >> memory >> representing it, because as long as you have the array in scope, >> you can >> say a[1] or a[anything] and thus get to every element. > > I do not think that this issue is the bottleneck here. For every > application > invocation there is a karajan thread. The fact that one such thread > eats > around 10-20k seems to be the problem. By contrast, a piece of Swift > data > probably takes less than 1k. > > So I think that one order of magnitude improvement could be achieved > by > addressing that 10-20k problem (or by somehow having fewer karajan > threads). > >> >> The in-memory implementation of the data model and anything that >> touches >> it would need some fairly serious work to cope with having stuff >> kept out >> of core; and I think keeping stuff out of core is something that >> would >> need to happen. >> >> (that is, 'streaming mappers' as a phrase seems to deal with "not >> getting >> knowledge about data too fast" but does not deal with "forgetting >> knowledge about data fast enough") >> >> -- >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From benc at hawaga.org.uk Wed Feb 11 15:52:33 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 11 Feb 2009 21:52:33 +0000 (GMT) Subject: [Swift-devel] scheduler stuff for Google Summer of Code 2009 In-Reply-To: <22220735.2155401234388028575.JavaMail.root@zimbra> References: <22220735.2155401234388028575.JavaMail.root@zimbra> Message-ID: On Wed, 11 Feb 2009, Mihael Hategan wrote: > I do not think that this issue is the bottleneck here. For every application I wasn't attempting to provide a comprehensive summary of stuff that won't scale... mostly I wanted one example of another issue. > So I think that one order of magnitude improvement could be achieved by > addressing that 10-20k problem (or by somehow having fewer karajan threads). 20k per app invocation is pretty heavyweight when invocations are parallelised. I suspect that fewer karajan threads at any point in time can be brought about by some streaming-like approach - rather than n elements to iterate over forking n karajan threads at by saying parallelFor and having a simultaneous thread for each (of which most threads, for large enough n, will block), that construct could perhaps be made to act over time. So stream-like (lists spread over time as well as space) behaviour for foreach. -- From hategan at mcs.anl.gov Wed Feb 11 15:52:46 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 11 Feb 2009 15:52:46 -0600 (CST) Subject: [Swift-devel] scheduler stuff for Google Summer of Code 2009 In-Reply-To: Message-ID: <31731535.2162031234389166830.JavaMail.root@zimbra> ----- Ian Foster wrote: > I would argue that Swift support for collective operations also helps > with scaling. (We can run a computation with 1M tasks, if not all 1M > tasks are active at once.) I don't think we can talk about an improvement if we're moving from "no can do" to "no can do in a different way". From tfreeman at mcs.anl.gov Wed Feb 11 16:12:34 2009 From: tfreeman at mcs.anl.gov (Tim Freeman) Date: Wed, 11 Feb 2009 16:12:34 -0600 Subject: [Swift-devel] scheduler stuff for Google Summer of Code 2009 In-Reply-To: <529D3FFB-BA26-4736-A160-3AD51C823CA4@anl.gov> References: <499325D1.9040601@mcs.anl.gov> <20090211145343.3df6ab8e@prnb> <529D3FFB-BA26-4736-A160-3AD51C823CA4@anl.gov> Message-ID: <20090211161234.3de09250@prnb> On Wed, 11 Feb 2009 15:34:33 -0600 Ian Foster wrote: > UniCloud provides that functionality, too. I imagine there are others. What Nimbus actually provides is much different, a generic virtual cluster configuration engine that runs on VMs from either EC2 and Nimbus clouds or even for clusters that span more than one. There happens to be one instantiation of a cluster that has gram2 + Torque (it's been around in some form for almost two years), but you can do anything (and people have). And it doesn't cost any extra money on top of what you pay EC2 (like UniCloud would). Tim > On Feb 11, 2009, at 2:53 PM, Tim Freeman wrote: > > > On Wed, 11 Feb 2009 20:33:16 +0000 (GMT) > > Ben Clifford wrote: > > > > [...] > >> > >>> - running Swift on clouds like E2C and Azure > >> > >> Tim Freeman has done some work standing up clusters on multiple EC2 > >> nodes, > >> so that the multiple nodes are exposed through a single (perhaps > >> GRAM?) > >> interface. So putting Swift on the front of that seems > >> straightforward to > >> define. > > > > Yep, was just starting some 100+ node GRAM/PBS clusters there last > > week. > > > > Tim > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > From foster at anl.gov Wed Feb 11 16:15:45 2009 From: foster at anl.gov (Ian Foster) Date: Wed, 11 Feb 2009 16:15:45 -0600 Subject: [Swift-devel] scheduler stuff for Google Summer of Code 2009 In-Reply-To: <31731535.2162031234389166830.JavaMail.root@zimbra> References: <31731535.2162031234389166830.JavaMail.root@zimbra> Message-ID: <1002CF15-4A70-4EA9-8373-10F1E795CAEB@anl.gov> Mihael: I don't know how to parse your comment. If I write a program that performs a series of operations on many files, involving 1M tasks during its execution, but with only 10,000 active at each step, why is that not interesting? Or are you re- defining the problem to "have 1M tasks active at once"? That is a useful thing to be able to do, I am sure, but that does not mean that the former is not useful also. Ian. On Feb 11, 2009, at 3:52 PM, Mihael Hategan wrote: > > ----- Ian Foster wrote: >> I would argue that Swift support for collective operations also helps >> with scaling. (We can run a computation with 1M tasks, if not all 1M >> tasks are active at once.) > > I don't think we can talk about an improvement if we're moving from > "no can do" to "no can do in a different way". From wilde at mcs.anl.gov Wed Feb 11 16:28:39 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Wed, 11 Feb 2009 16:28:39 -0600 Subject: [Swift-devel] swift tools directory Message-ID: <49935117.3040500@mcs.anl.gov> As was briefly discussed long ago, Im going to make a tools/ directory in svn under https://svn.ci.uchicago.edu/svn/vdl2 (eg alongside things like provenancedb, www, etc) This is to hold things like the following, some of which may migrate to dist/bin as their distribution location, when they are ready and accepted: - enhanced swift run command (select multiple sites, etc) - sites command to compose sites.xml more dynamically - swift #include processor - osg/tg site listing/status/checking tools - bgp execution scripts For now, collaborators using this stuff would check it out separate from the core of swift. If anyone prefers a different name, place, or approach, let me know. I'll do this tonight, but can move/remove it as desired. From benc at hawaga.org.uk Wed Feb 11 16:30:57 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 11 Feb 2009 22:30:57 +0000 (GMT) Subject: [Swift-devel] swift tools directory In-Reply-To: <49935117.3040500@mcs.anl.gov> References: <49935117.3040500@mcs.anl.gov> Message-ID: On Wed, 11 Feb 2009, Michael Wilde wrote: > As was briefly discussed long ago, Im going to make a tools/ directory in svn > under https://svn.ci.uchicago.edu/svn/vdl2 (eg alongside things like > provenancedb, www, etc) That seems fine. -- From hategan at mcs.anl.gov Wed Feb 11 16:39:22 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 11 Feb 2009 16:39:22 -0600 (CST) Subject: [Swift-devel] scheduler stuff for Google Summer of Code 2009 In-Reply-To: Message-ID: <8223033.2165841234391962754.JavaMail.root@zimbra> ----- Ben Clifford wrote: > > On Wed, 11 Feb 2009, Mihael Hategan wrote: > > > I do not think that this issue is the bottleneck here. For every application > > I wasn't attempting to provide a comprehensive summary of stuff that won't > scale... mostly I wanted one example of another issue. > > > > > So I think that one order of magnitude improvement could be achieved by > > addressing that 10-20k problem (or by somehow having fewer karajan threads). > > 20k per app invocation is pretty heavyweight when invocations are > parallelised. > > I suspect that fewer karajan threads at any point in time can be brought > about by some streaming-like approach - rather than n elements to iterate > over forking n karajan threads at by saying parallelFor and having a > simultaneous thread for each (of which most threads, for large enough n, > will block), that construct could perhaps be made to act over time. So > stream-like (lists spread over time as well as space) behaviour for > foreach. > This may already happen in certain cases (e.g. a foreach acting on the product of another foreach). From hategan at mcs.anl.gov Wed Feb 11 16:49:28 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 11 Feb 2009 16:49:28 -0600 (CST) Subject: [Swift-devel] scheduler stuff for Google Summer of Code 2009 In-Reply-To: <1002CF15-4A70-4EA9-8373-10F1E795CAEB@anl.gov> Message-ID: <973187.2171551234392568270.JavaMail.root@zimbra> ----- Ian Foster wrote: > Mihael: > > I don't know how to parse your comment. > > If I write a program that performs a series of operations on many > files, involving 1M tasks during its execution, but with only 10,000 > active at each step, why is that not interesting? Or are you re- > defining the problem to "have 1M tasks active at once"? That is a > useful thing to be able to do, I am sure, but that does not mean that > the former is not useful also. In the context in which the engine cannot reasonably support 1M tasks, it seems fairly pointless to say that, e.g. a faster GRAM is better for running 1M tasks with Swift. It makes no difference. From aespinosa at cs.uchicago.edu Wed Feb 11 19:07:10 2009 From: aespinosa at cs.uchicago.edu (Allan Espinosa) Date: Wed, 11 Feb 2009 19:07:10 -0600 Subject: [Swift-devel] data staging process/ documents? Message-ID: <50b07b4b0902111707q7aad742fh89c0dad88f744ecf@mail.gmail.com> Hi, I am attempting to actualize how collective operations on workflows (loosely-coupled) work in general. My initial idea is that this goes in the staging of data before executing a task in a workflow. Do we have documents describing these? I have a small idea on how it works by monitoring my swift job's as a workflow executes. My initial ideas are posted in http://www.ci.uchicago.edu/wiki/bin/view/VDS/DslCS/CollectiveIO -Allan From hategan at mcs.anl.gov Wed Feb 11 20:02:34 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 11 Feb 2009 20:02:34 -0600 (CST) Subject: [Swift-devel] coaster one-liner bootstrap script Message-ID: <12554808.2179801234404154196.JavaMail.root@zimbra> cog r2297 contains a patch to transform the bootstrap script to a one-liner (thanks to Mike for the tips). I did a sanity test on localhost. From hategan at mcs.anl.gov Wed Feb 11 20:04:06 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 11 Feb 2009 20:04:06 -0600 (CST) Subject: [Swift-devel] java home misdetection Message-ID: <15461697.2179831234404246287.JavaMail.root@zimbra> Cog r2297 has a patch for the java_home issue. Since JAVA_HOME doesn't appear to be needed, the script only attempts to find a java executable. From hategan at mcs.anl.gov Wed Feb 11 20:06:19 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 11 Feb 2009 20:06:19 -0600 (CST) Subject: [Swift-devel] gt2 and runaway jobs Message-ID: <18026343.2179861234404379439.JavaMail.root@zimbra> This is probably getting annoying. swift r2525 has a fix for the issue introduced by the runaway jobs patch, namely if gt2 was used, jobs would fail complaining about a bogus "tr" attribute. Or something like that. From wilde at mcs.anl.gov Wed Feb 11 22:29:17 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Wed, 11 Feb 2009 22:29:17 -0600 Subject: [Swift-devel] Rename versions of Globus commands in swift/bin? In-Reply-To: <4993A311.3060103@mcs.anl.gov> References: <4988A6A9.6030302@renci.org> <4993561E.1060105@mcs.anl.gov> <49936388.80207@mcs.anl.gov> <49939761.9020002@renci.org> <4993A311.3060103@mcs.anl.gov> Message-ID: <4993A59D.4010900@mcs.anl.gov> Matt raises a good point below. Can we rename globus-url-copy, grid-proxy-init, and other commands that have the same name as Globus commands, to swift-url-copy, swift-proxy-init, etc? Especially for those that are not identical enough to the Globus versions (where that is tbd). Its extremely handy to have these commands there, but perhaps confusing for some users that get these in their paths ahead of the Globus versions. I realize calling them swift-* causes its own kind of confusion, but I'm excited that folks like Mats are installing Swift for users, and I'd like to remove any barriers, even the small ones. This is likely to be a bigger issue for users with the OSG client stack installed. On 2/11/09 10:18 PM, Michael Wilde wrote: > > > On 2/11/09 9:28 PM, Mats Rynge wrote: ... >> >> Regarding bin/, it is pretty evil to have a globus-url-copy under the >> swift bin/ which has a different set of command line options as the >> Globus one. I had a plan on adding swift to the default path on our >> submit node so all our users could use swift without doing anything >> special. But having a different globus-url-copy means it would break >> things for other users. > > I know what you mean. I actually got bit the other day by our copy of > grid-proxy-init - I was seeing the globus one, and one of my users was > seeing the swift one, with a slightly different output that broke a script. > > So I agree - if the swift version is not pretty near identical, its > better to give it another name. I'll past your comment to the devel list > with a suggestion that we rename it. Perhaps swift-proxy-init, > swift-url-copy? (not sure how that will go over... ;) > > - Mike Mats, I would just go ahead and remove or rename those, in the meantime. I dont think anything points to them. - Mike From foster at anl.gov Wed Feb 11 22:31:02 2009 From: foster at anl.gov (Ian Foster) Date: Wed, 11 Feb 2009 22:31:02 -0600 Subject: [Swift-devel] Rename versions of Globus commands in swift/bin? In-Reply-To: <4993A59D.4010900@mcs.anl.gov> References: <4988A6A9.6030302@renci.org> <4993561E.1060105@mcs.anl.gov> <49936388.80207@mcs.anl.gov> <49939761.9020002@renci.org> <4993A311.3060103@mcs.anl.gov> <4993A59D.4010900@mcs.anl.gov> Message-ID: <255CA353-FCE7-4A1E-B937-71EEF7E1E690@anl.gov> Mike: This begs the question for me as to why they are different. Are Swift proxies different from Globus proxies, for example? And if so, why? Ian. On Feb 11, 2009, at 10:29 PM, Michael Wilde wrote: > Matt raises a good point below. Can we rename globus-url-copy, grid- > proxy-init, and other commands that have the same name as Globus > commands, to swift-url-copy, swift-proxy-init, etc? > > Especially for those that are not identical enough to the Globus > versions (where that is tbd). > > Its extremely handy to have these commands there, but perhaps > confusing for some users that get these in their paths ahead of the > Globus versions. I realize calling them swift-* causes its own kind > of confusion, but I'm excited that folks like Mats are installing > Swift for users, and I'd like to remove any barriers, even the small > ones. > > This is likely to be a bigger issue for users with the OSG client > stack installed. > > On 2/11/09 10:18 PM, Michael Wilde wrote: >> On 2/11/09 9:28 PM, Mats Rynge wrote: > > ... >>> >>> Regarding bin/, it is pretty evil to have a globus-url-copy under >>> the >>> swift bin/ which has a different set of command line options as the >>> Globus one. I had a plan on adding swift to the default path on our >>> submit node so all our users could use swift without doing anything >>> special. But having a different globus-url-copy means it would break >>> things for other users. >> I know what you mean. I actually got bit the other day by our copy >> of grid-proxy-init - I was seeing the globus one, and one of my >> users was seeing the swift one, with a slightly different output >> that broke a script. >> So I agree - if the swift version is not pretty near identical, its >> better to give it another name. I'll past your comment to the devel >> list with a suggestion that we rename it. Perhaps swift-proxy-init, >> swift-url-copy? (not sure how that will go over... ;) >> - Mike > > Mats, I would just go ahead and remove or rename those, in the > meantime. > > I dont think anything points to them. > > - Mike > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From foster at anl.gov Wed Feb 11 22:36:45 2009 From: foster at anl.gov (Ian Foster) Date: Wed, 11 Feb 2009 22:36:45 -0600 Subject: [Swift-devel] scheduler stuff for Google Summer of Code 2009 In-Reply-To: <973187.2171551234392568270.JavaMail.root@zimbra> References: <973187.2171551234392568270.JavaMail.root@zimbra> Message-ID: <14F191F5-CB8C-403B-A032-053334D9169D@anl.gov> Mihael: I think we are exploring the limits of email as a communication vehicle :-) I wasn't talking about GRAM at all. I had understood someone to say that we can't run 1M tasks because each task needs 10KB (or similar), and 1M*10KB is a lot. I was observing that a workflow of 1M tasks can still be run if a smaller number are active at a time. That may seem like splitting hairs, but in fact that was a big reason for Swift's design. We would have applications that had multiple phases, each with thousands of tasks. As a DAG, this would not fit in memory (and was a pain to write of course). As a Swift program, we might have: foreach(i in 1:10,000) f() foreach(i in 1:10,000) g() etc. Ian. On Feb 11, 2009, at 4:49 PM, Mihael Hategan wrote: > ----- Ian Foster wrote: >> Mihael: >> >> I don't know how to parse your comment. >> >> If I write a program that performs a series of operations on many >> files, involving 1M tasks during its execution, but with only 10,000 >> active at each step, why is that not interesting? Or are you re- >> defining the problem to "have 1M tasks active at once"? That is a >> useful thing to be able to do, I am sure, but that does not mean that >> the former is not useful also. > > In the context in which the engine cannot reasonably support 1M tasks, > it seems fairly pointless to say that, e.g. a faster GRAM is better > for > running 1M tasks with Swift. It makes no difference. From wilde at mcs.anl.gov Wed Feb 11 22:37:35 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Wed, 11 Feb 2009 22:37:35 -0600 Subject: [Swift-devel] Rename versions of Globus commands in swift/bin? In-Reply-To: <255CA353-FCE7-4A1E-B937-71EEF7E1E690@anl.gov> References: <4988A6A9.6030302@renci.org> <4993561E.1060105@mcs.anl.gov> <49936388.80207@mcs.anl.gov> <49939761.9020002@renci.org> <4993A311.3060103@mcs.anl.gov> <4993A59D.4010900@mcs.anl.gov> <255CA353-FCE7-4A1E-B937-71EEF7E1E690@anl.gov> Message-ID: <4993A78F.4050604@mcs.anl.gov> These are cog-based versions of the commands. The cool thing is that users get this core Globus functionality with no compilation needed: just untar Swift, and poof, you can make proxies, run jobs, move files (eg for setting this up on remote grid sites). The issue I bumped into with proxies was cosmetic. The proxies are totally compatible as far as I know. I just happened to have a front-end script for running swift code that checked to make sure the user has a valid proxy with some time left. And the time format returned by the swift grid-proxy-info was slightly different than the Globus one. That broke my brittle little script - its not a criticism of the cog code version. But its easy enough for us to refer, in swift docs, to swift-proxy-init, swift-proxy-info with a * explaining that the Globus versions of these are fine if you happen to have them. - Mike On 2/11/09 10:31 PM, Ian Foster wrote: > Mike: > > This begs the question for me as to why they are different. Are Swift > proxies different from Globus proxies, for example? And if so, why? > > Ian. > > > On Feb 11, 2009, at 10:29 PM, Michael Wilde wrote: > >> Matt raises a good point below. Can we rename globus-url-copy, >> grid-proxy-init, and other commands that have the same name as Globus >> commands, to swift-url-copy, swift-proxy-init, etc? >> >> Especially for those that are not identical enough to the Globus >> versions (where that is tbd). >> >> Its extremely handy to have these commands there, but perhaps >> confusing for some users that get these in their paths ahead of the >> Globus versions. I realize calling them swift-* causes its own kind of >> confusion, but I'm excited that folks like Mats are installing Swift >> for users, and I'd like to remove any barriers, even the small ones. >> >> This is likely to be a bigger issue for users with the OSG client >> stack installed. >> >> On 2/11/09 10:18 PM, Michael Wilde wrote: >>> On 2/11/09 9:28 PM, Mats Rynge wrote: >> >> ... >>>> >>>> Regarding bin/, it is pretty evil to have a globus-url-copy under the >>>> swift bin/ which has a different set of command line options as the >>>> Globus one. I had a plan on adding swift to the default path on our >>>> submit node so all our users could use swift without doing anything >>>> special. But having a different globus-url-copy means it would break >>>> things for other users. >>> I know what you mean. I actually got bit the other day by our copy of >>> grid-proxy-init - I was seeing the globus one, and one of my users >>> was seeing the swift one, with a slightly different output that broke >>> a script. >>> So I agree - if the swift version is not pretty near identical, its >>> better to give it another name. I'll past your comment to the devel >>> list with a suggestion that we rename it. Perhaps swift-proxy-init, >>> swift-url-copy? (not sure how that will go over... ;) >>> - Mike >> >> Mats, I would just go ahead and remove or rename those, in the meantime. >> >> I dont think anything points to them. >> >> - Mike >> >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > From hategan at mcs.anl.gov Wed Feb 11 23:06:13 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 11 Feb 2009 23:06:13 -0600 Subject: [Swift-devel] Rename versions of Globus commands in swift/bin? In-Reply-To: <4993A59D.4010900@mcs.anl.gov> References: <4988A6A9.6030302@renci.org> <4993561E.1060105@mcs.anl.gov> <49936388.80207@mcs.anl.gov> <49939761.9020002@renci.org> <4993A311.3060103@mcs.anl.gov> <4993A59D.4010900@mcs.anl.gov> Message-ID: <1234415173.1513.1.camel@localhost> On Wed, 2009-02-11 at 22:29 -0600, Michael Wilde wrote: > Matt raises a good point below. Can we rename globus-url-copy, > grid-proxy-init, and other commands that have the same name as Globus > commands, to swift-url-copy, swift-proxy-init, etc? I think the less confusing and accurate names would be cog-*. > > Especially for those that are not identical enough to the Globus > versions (where that is tbd). > > Its extremely handy to have these commands there, but perhaps confusing > for some users that get these in their paths ahead of the Globus > versions. I realize calling them swift-* causes its own kind of > confusion, but I'm excited that folks like Mats are installing Swift for > users, and I'd like to remove any barriers, even the small ones. > > This is likely to be a bigger issue for users with the OSG client stack > installed. > > On 2/11/09 10:18 PM, Michael Wilde wrote: > > > > > > On 2/11/09 9:28 PM, Mats Rynge wrote: > > ... > >> > >> Regarding bin/, it is pretty evil to have a globus-url-copy under the > >> swift bin/ which has a different set of command line options as the > >> Globus one. I had a plan on adding swift to the default path on our > >> submit node so all our users could use swift without doing anything > >> special. But having a different globus-url-copy means it would break > >> things for other users. > > > > I know what you mean. I actually got bit the other day by our copy of > > grid-proxy-init - I was seeing the globus one, and one of my users was > > seeing the swift one, with a slightly different output that broke a script. > > > > So I agree - if the swift version is not pretty near identical, its > > better to give it another name. I'll past your comment to the devel list > > with a suggestion that we rename it. Perhaps swift-proxy-init, > > swift-url-copy? (not sure how that will go over... ;) > > > > - Mike > > Mats, I would just go ahead and remove or rename those, in the meantime. > > I dont think anything points to them. > > - Mike > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From hategan at mcs.anl.gov Wed Feb 11 23:31:14 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 11 Feb 2009 23:31:14 -0600 Subject: [Swift-devel] scheduler stuff for Google Summer of Code 2009 In-Reply-To: <14F191F5-CB8C-403B-A032-053334D9169D@anl.gov> References: <973187.2171551234392568270.JavaMail.root@zimbra> <14F191F5-CB8C-403B-A032-053334D9169D@anl.gov> Message-ID: <1234416674.1513.26.camel@localhost> On Wed, 2009-02-11 at 22:36 -0600, Ian Foster wrote: > Mihael: > > I think we are exploring the limits of email as a communication > vehicle :-) I just think some subjects require more emails than others :) > > I wasn't talking about GRAM at all. I know. I gave it as an example, because we both know clearly what it is. Let me back up a bit: >> If I write a program that performs a series of operations on many >> files, involving 1M tasks during its execution, but with only 10,000 >> active at each step, why is that not interesting? It is interesting, but that part where you can have a 1M workflow in Swift with only 10,000 karajan threads active at a time is what I'm questioning. So before we talk about better I/O or job execution providers for the 1M workflow, we need to make sure that the engine can run the 1M workflow in the first place. Otherwise the I/O can be as fast as you want and swift still won't run the workflow. From hategan at mcs.anl.gov Wed Feb 11 23:33:40 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 11 Feb 2009 23:33:40 -0600 Subject: [Swift-devel] Rename versions of Globus commands in swift/bin? In-Reply-To: <4993A78F.4050604@mcs.anl.gov> References: <4988A6A9.6030302@renci.org> <4993561E.1060105@mcs.anl.gov> <49936388.80207@mcs.anl.gov> <49939761.9020002@renci.org> <4993A311.3060103@mcs.anl.gov> <4993A59D.4010900@mcs.anl.gov> <255CA353-FCE7-4A1E-B937-71EEF7E1E690@anl.gov> <4993A78F.4050604@mcs.anl.gov> Message-ID: <1234416820.1513.30.camel@localhost> On Wed, 2009-02-11 at 22:37 -0600, Michael Wilde wrote: > These are cog-based versions of the commands. > > The cool thing is that users get this core Globus functionality with no > compilation needed: just untar Swift, and poof, you can make proxies, > run jobs, move files (eg for setting this up on remote grid sites). We rename them to cog-* then? > > The issue I bumped into with proxies was cosmetic. The proxies are > totally compatible as far as I know. Yes. > I just happened to have a > front-end script for running swift code that checked to make sure the > user has a valid proxy with some time left. And the time format returned > by the swift grid-proxy-info was slightly different than the Globus one. > > That broke my brittle little script - its not a criticism of the cog > code version. Though that may be possible to fix. url-copy not so much. > > But its easy enough for us to refer, in swift docs, to swift-proxy-init, > swift-proxy-info with a * explaining that the Globus versions of these > are fine if you happen to have them. Or the other way around. We use globus-* by default and then say that the user could use the other ones. > > - Mike > > > On 2/11/09 10:31 PM, Ian Foster wrote: > > Mike: > > > > This begs the question for me as to why they are different. Are Swift > > proxies different from Globus proxies, for example? And if so, why? > > > > Ian. > > > > > > On Feb 11, 2009, at 10:29 PM, Michael Wilde wrote: > > > >> Matt raises a good point below. Can we rename globus-url-copy, > >> grid-proxy-init, and other commands that have the same name as Globus > >> commands, to swift-url-copy, swift-proxy-init, etc? > >> > >> Especially for those that are not identical enough to the Globus > >> versions (where that is tbd). > >> > >> Its extremely handy to have these commands there, but perhaps > >> confusing for some users that get these in their paths ahead of the > >> Globus versions. I realize calling them swift-* causes its own kind of > >> confusion, but I'm excited that folks like Mats are installing Swift > >> for users, and I'd like to remove any barriers, even the small ones. > >> > >> This is likely to be a bigger issue for users with the OSG client > >> stack installed. > >> > >> On 2/11/09 10:18 PM, Michael Wilde wrote: > >>> On 2/11/09 9:28 PM, Mats Rynge wrote: > >> > >> ... > >>>> > >>>> Regarding bin/, it is pretty evil to have a globus-url-copy under the > >>>> swift bin/ which has a different set of command line options as the > >>>> Globus one. I had a plan on adding swift to the default path on our > >>>> submit node so all our users could use swift without doing anything > >>>> special. But having a different globus-url-copy means it would break > >>>> things for other users. > >>> I know what you mean. I actually got bit the other day by our copy of > >>> grid-proxy-init - I was seeing the globus one, and one of my users > >>> was seeing the swift one, with a slightly different output that broke > >>> a script. > >>> So I agree - if the swift version is not pretty near identical, its > >>> better to give it another name. I'll past your comment to the devel > >>> list with a suggestion that we rename it. Perhaps swift-proxy-init, > >>> swift-url-copy? (not sure how that will go over... ;) > >>> - Mike > >> > >> Mats, I would just go ahead and remove or rename those, in the meantime. > >> > >> I dont think anything points to them. > >> > >> - Mike > >> > >> _______________________________________________ > >> Swift-devel mailing list > >> Swift-devel at ci.uchicago.edu > >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From foster at anl.gov Wed Feb 11 23:38:30 2009 From: foster at anl.gov (Ian Foster) Date: Wed, 11 Feb 2009 23:38:30 -0600 Subject: [Swift-devel] Rename versions of Globus commands in swift/bin? In-Reply-To: <1234416820.1513.30.camel@localhost> References: <4988A6A9.6030302@renci.org> <4993561E.1060105@mcs.anl.gov> <49936388.80207@mcs.anl.gov> <49939761.9020002@renci.org> <4993A311.3060103@mcs.anl.gov> <4993A59D.4010900@mcs.anl.gov> <255CA353-FCE7-4A1E-B937-71EEF7E1E690@anl.gov> <4993A78F.4050604@mcs.anl.gov> <1234416820.1513.30.camel@localhost> Message-ID: <5CA10434-F090-41B6-A1C4-9AB67B0A88EC@anl.gov> My view is that the CoG and C versions of basic Globus commands should have the same behavior. If they do not, that is a bug. It should be reported and fixed, not worked around. I appreciate that others may not share that perspective. On Feb 11, 2009, at 11:33 PM, Mihael Hategan wrote: > On Wed, 2009-02-11 at 22:37 -0600, Michael Wilde wrote: >> These are cog-based versions of the commands. >> >> The cool thing is that users get this core Globus functionality >> with no >> compilation needed: just untar Swift, and poof, you can make proxies, >> run jobs, move files (eg for setting this up on remote grid sites). > > We rename them to cog-* then? > >> >> The issue I bumped into with proxies was cosmetic. The proxies are >> totally compatible as far as I know. > > Yes. > >> I just happened to have a >> front-end script for running swift code that checked to make sure the >> user has a valid proxy with some time left. And the time format >> returned >> by the swift grid-proxy-info was slightly different than the Globus >> one. >> >> That broke my brittle little script - its not a criticism of the cog >> code version. > > Though that may be possible to fix. url-copy not so much. > >> >> But its easy enough for us to refer, in swift docs, to swift-proxy- >> init, >> swift-proxy-info with a * explaining that the Globus versions of >> these >> are fine if you happen to have them. > > Or the other way around. We use globus-* by default and then say that > the user could use the other ones. > >> >> - Mike >> >> >> On 2/11/09 10:31 PM, Ian Foster wrote: >>> Mike: >>> >>> This begs the question for me as to why they are different. Are >>> Swift >>> proxies different from Globus proxies, for example? And if so, why? >>> >>> Ian. >>> >>> >>> On Feb 11, 2009, at 10:29 PM, Michael Wilde wrote: >>> >>>> Matt raises a good point below. Can we rename globus-url-copy, >>>> grid-proxy-init, and other commands that have the same name as >>>> Globus >>>> commands, to swift-url-copy, swift-proxy-init, etc? >>>> >>>> Especially for those that are not identical enough to the Globus >>>> versions (where that is tbd). >>>> >>>> Its extremely handy to have these commands there, but perhaps >>>> confusing for some users that get these in their paths ahead of the >>>> Globus versions. I realize calling them swift-* causes its own >>>> kind of >>>> confusion, but I'm excited that folks like Mats are installing >>>> Swift >>>> for users, and I'd like to remove any barriers, even the small >>>> ones. >>>> >>>> This is likely to be a bigger issue for users with the OSG client >>>> stack installed. >>>> >>>> On 2/11/09 10:18 PM, Michael Wilde wrote: >>>>> On 2/11/09 9:28 PM, Mats Rynge wrote: >>>> >>>> ... >>>>>> >>>>>> Regarding bin/, it is pretty evil to have a globus-url-copy >>>>>> under the >>>>>> swift bin/ which has a different set of command line options as >>>>>> the >>>>>> Globus one. I had a plan on adding swift to the default path on >>>>>> our >>>>>> submit node so all our users could use swift without doing >>>>>> anything >>>>>> special. But having a different globus-url-copy means it would >>>>>> break >>>>>> things for other users. >>>>> I know what you mean. I actually got bit the other day by our >>>>> copy of >>>>> grid-proxy-init - I was seeing the globus one, and one of my users >>>>> was seeing the swift one, with a slightly different output that >>>>> broke >>>>> a script. >>>>> So I agree - if the swift version is not pretty near identical, >>>>> its >>>>> better to give it another name. I'll past your comment to the >>>>> devel >>>>> list with a suggestion that we rename it. Perhaps swift-proxy- >>>>> init, >>>>> swift-url-copy? (not sure how that will go over... ;) >>>>> - Mike >>>> >>>> Mats, I would just go ahead and remove or rename those, in the >>>> meantime. >>>> >>>> I dont think anything points to them. >>>> >>>> - Mike >>>> >>>> _______________________________________________ >>>> Swift-devel mailing list >>>> Swift-devel at ci.uchicago.edu >>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >>> >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > From wilde at mcs.anl.gov Wed Feb 11 23:46:20 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Wed, 11 Feb 2009 23:46:20 -0600 Subject: [Swift-devel] Rename versions of Globus commands in swift/bin? In-Reply-To: <1234416820.1513.30.camel@localhost> References: <4988A6A9.6030302@renci.org> <4993561E.1060105@mcs.anl.gov> <49936388.80207@mcs.anl.gov> <49939761.9020002@renci.org> <4993A311.3060103@mcs.anl.gov> <4993A59D.4010900@mcs.anl.gov> <255CA353-FCE7-4A1E-B937-71EEF7E1E690@anl.gov> <4993A78F.4050604@mcs.anl.gov> <1234416820.1513.30.camel@localhost> Message-ID: <4993B7AC.3020408@mcs.anl.gov> On 2/11/09 11:33 PM, Mihael Hategan wrote: > On Wed, 2009-02-11 at 22:37 -0600, Michael Wilde wrote: >> These are cog-based versions of the commands. >> >> The cool thing is that users get this core Globus functionality with no >> compilation needed: just untar Swift, and poof, you can make proxies, >> run jobs, move files (eg for setting this up on remote grid sites). > > We rename them to cog-* then? That would be fine by me. Unless Ben weighs in with a different view, please do. >> The issue I bumped into with proxies was cosmetic. The proxies are >> totally compatible as far as I know. > > Yes. > >> I just happened to have a >> front-end script for running swift code that checked to make sure the >> user has a valid proxy with some time left. And the time format returned >> by the swift grid-proxy-info was slightly different than the Globus one. >> >> That broke my brittle little script - its not a criticism of the cog >> code version. > > Though that may be possible to fix. url-copy not so much. I think this is pretty low on the prio list. If its easy, yes, please do. Else file it as low prio in bugzilla, imo. >> But its easy enough for us to refer, in swift docs, to swift-proxy-init, >> swift-proxy-info with a * explaining that the Globus versions of these >> are fine if you happen to have them. > > Or the other way around. We use globus-* by default and then say that > the user could use the other ones. Yes, that would be better. OSG and TG users are likely to have the globus- versions in their paths. New users running swift on their own hosts are likely not to. At the moment, the users guide doesnt cover this at all; its a future issue for when it does. - Mike >> - Mike >> >> >> On 2/11/09 10:31 PM, Ian Foster wrote: >>> Mike: >>> >>> This begs the question for me as to why they are different. Are Swift >>> proxies different from Globus proxies, for example? And if so, why? >>> >>> Ian. >>> >>> >>> On Feb 11, 2009, at 10:29 PM, Michael Wilde wrote: >>> >>>> Matt raises a good point below. Can we rename globus-url-copy, >>>> grid-proxy-init, and other commands that have the same name as Globus >>>> commands, to swift-url-copy, swift-proxy-init, etc? >>>> >>>> Especially for those that are not identical enough to the Globus >>>> versions (where that is tbd). >>>> >>>> Its extremely handy to have these commands there, but perhaps >>>> confusing for some users that get these in their paths ahead of the >>>> Globus versions. I realize calling them swift-* causes its own kind of >>>> confusion, but I'm excited that folks like Mats are installing Swift >>>> for users, and I'd like to remove any barriers, even the small ones. >>>> >>>> This is likely to be a bigger issue for users with the OSG client >>>> stack installed. >>>> >>>> On 2/11/09 10:18 PM, Michael Wilde wrote: >>>>> On 2/11/09 9:28 PM, Mats Rynge wrote: >>>> ... >>>>>> Regarding bin/, it is pretty evil to have a globus-url-copy under the >>>>>> swift bin/ which has a different set of command line options as the >>>>>> Globus one. I had a plan on adding swift to the default path on our >>>>>> submit node so all our users could use swift without doing anything >>>>>> special. But having a different globus-url-copy means it would break >>>>>> things for other users. >>>>> I know what you mean. I actually got bit the other day by our copy of >>>>> grid-proxy-init - I was seeing the globus one, and one of my users >>>>> was seeing the swift one, with a slightly different output that broke >>>>> a script. >>>>> So I agree - if the swift version is not pretty near identical, its >>>>> better to give it another name. I'll past your comment to the devel >>>>> list with a suggestion that we rename it. Perhaps swift-proxy-init, >>>>> swift-url-copy? (not sure how that will go over... ;) >>>>> - Mike >>>> Mats, I would just go ahead and remove or rename those, in the meantime. >>>> >>>> I dont think anything points to them. >>>> >>>> - Mike >>>> >>>> _______________________________________________ >>>> Swift-devel mailing list >>>> Swift-devel at ci.uchicago.edu >>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > From hategan at mcs.anl.gov Thu Feb 12 00:01:35 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 12 Feb 2009 00:01:35 -0600 Subject: [Swift-devel] Rename versions of Globus commands in swift/bin? In-Reply-To: <5CA10434-F090-41B6-A1C4-9AB67B0A88EC@anl.gov> References: <4988A6A9.6030302@renci.org> <4993561E.1060105@mcs.anl.gov> <49936388.80207@mcs.anl.gov> <49939761.9020002@renci.org> <4993A311.3060103@mcs.anl.gov> <4993A59D.4010900@mcs.anl.gov> <255CA353-FCE7-4A1E-B937-71EEF7E1E690@anl.gov> <4993A78F.4050604@mcs.anl.gov> <1234416820.1513.30.camel@localhost> <5CA10434-F090-41B6-A1C4-9AB67B0A88EC@anl.gov> Message-ID: <1234418495.1513.35.camel@localhost> On Wed, 2009-02-11 at 23:38 -0600, Ian Foster wrote: > My view is that the CoG and C versions of basic Globus commands should > have the same behavior. If they do not, that is a bug. It should be > reported and fixed, not worked around. I appreciate that others may > not share that perspective. I very much agree. But it seems there aren't many resources to fix those bugs, so we end up working around them. From hategan at mcs.anl.gov Thu Feb 12 00:03:40 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 12 Feb 2009 00:03:40 -0600 Subject: [Swift-devel] Rename versions of Globus commands in swift/bin? In-Reply-To: <4993B7AC.3020408@mcs.anl.gov> References: <4988A6A9.6030302@renci.org> <4993561E.1060105@mcs.anl.gov> <49936388.80207@mcs.anl.gov> <49939761.9020002@renci.org> <4993A311.3060103@mcs.anl.gov> <4993A59D.4010900@mcs.anl.gov> <255CA353-FCE7-4A1E-B937-71EEF7E1E690@anl.gov> <4993A78F.4050604@mcs.anl.gov> <1234416820.1513.30.camel@localhost> <4993B7AC.3020408@mcs.anl.gov> Message-ID: <1234418620.1513.36.camel@localhost> On Wed, 2009-02-11 at 23:46 -0600, Michael Wilde wrote: > >> That broke my brittle little script - its not a criticism of the cog > >> code version. > > > > Though that may be possible to fix. url-copy not so much. > > I think this is pretty low on the prio list. If its easy, yes, please > do. Else file it as low prio in bugzilla, imo. I can ping Rachana. From wilde at mcs.anl.gov Thu Feb 12 01:05:51 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Thu, 12 Feb 2009 01:05:51 -0600 Subject: [Swift-devel] coaster one-liner bootstrap script In-Reply-To: <12554808.2179801234404154196.JavaMail.root@zimbra> References: <12554808.2179801234404154196.JavaMail.root@zimbra> Message-ID: <4993CA4F.3030609@mcs.anl.gov> I updated cog and swift to the latest, and tried on teraport. sites.xml was: fast 00:05:00 /gpfs1/osg/data/oops/swiftwork I got: coaster-bootstrap.list not found in classpath Output is below. I lost the log, but it didnt say much more than whats below. /home/wilde/swift/tools/swiftrun: Swift script oops7.swift starting at Wed Feb 11 22:50:33 CST 2009 running on sites: teraport.coaster.gt2.osg Swift svn swift-r2527 cog-r2297 RunID: 20090211-2250-gkzsaa90 Progress: Progress: Stage in:1 Initializing site shared directory:1 Progress: Stage in:1 Submitting:1 Warning: missing walltime specification for "runoops". Assuming 10 minutes. Failed to transfer wrapper log from oops7-20090211-2250-gkzsaa90/info/7 on teraport Execution failed: Failed to transfer wrapper log from oops7-20090211-2250-gkzsaa90/info/8 on teraport Exception in runoops: Arguments: [input/fasta/T1af7.fasta, input/secseq/T1af7.secseq, input/native/T1af7.pdb, output/T1af7.0.pdt, output/T1af7.0.rmsd, 0, TEMP UPDATE INTERVAL = 10, SMOOTH DEVIATION COEFFICIENT = 0.80001] Host: teraport Directory: oops7-20090211-2250-gkzsaa90/jobs/7/runoops-7aj2uh6j stderr.txt: stdout.txt: ---- Caused by: Could not submit job Caused by: Could not start coaster service Caused by: java.lang.RuntimeException: coaster-bootstrap.list not found in classpath Cleaning up... Done /home/wilde/swift/tools/swiftrun: Swift Script oops7.swift ended at Wed Feb 11 22:50:40 CST 2009 with exit code 0 com$ On 2/11/09 8:02 PM, Mihael Hategan wrote: > cog r2297 contains a patch to transform the bootstrap script > to a one-liner (thanks to Mike for the tips). > > I did a sanity test on localhost. > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From benc at hawaga.org.uk Thu Feb 12 03:23:13 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Thu, 12 Feb 2009 09:23:13 +0000 (GMT) Subject: [Swift-devel] Rename versions of Globus commands in swift/bin? In-Reply-To: <5CA10434-F090-41B6-A1C4-9AB67B0A88EC@anl.gov> References: <4988A6A9.6030302@renci.org> <4993561E.1060105@mcs.anl.gov> <49936388.80207@mcs.anl.gov> <49939761.9020002@renci.org> <4993A311.3060103@mcs.anl.gov> <4993A59D.4010900@mcs.anl.gov> <255CA353-FCE7-4A1E-B937-71EEF7E1E690@anl.gov> <4993A78F.4050604@mcs.anl.gov> <1234416820.1513.30.camel@localhost> <5CA10434-F090-41B6-A1C4-9AB67B0A88EC@anl.gov> Message-ID: On Wed, 11 Feb 2009, Ian Foster wrote: > My view is that the CoG and C versions of basic Globus commands should have > the same behavior. If they do not, that is a bug. It should be reported and > fixed, not worked around. I appreciate that others may not share that > perspective. I tend to agree (especially when they share the same filename). But, these are cog vs GT user interface bugs, not Swift bugs, and I don't think Swift developers should expend any non-trivial amount of work fixing such (although reporting them to the appropriate is always nice) -- From benc at hawaga.org.uk Thu Feb 12 03:31:09 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Thu, 12 Feb 2009 09:31:09 +0000 (GMT) Subject: [Swift-devel] Rename versions of Globus commands in swift/bin? In-Reply-To: <4993A59D.4010900@mcs.anl.gov> References: <4988A6A9.6030302@renci.org> <4993561E.1060105@mcs.anl.gov> <49936388.80207@mcs.anl.gov> <49939761.9020002@renci.org> <4993A311.3060103@mcs.anl.gov> <4993A59D.4010900@mcs.anl.gov> Message-ID: Relevant to this thread, there is a build option: ant -Dno-supporting=true (hurrah for more double negatives - should fix that too) which makes a swift build without potentially conflicting commands. I did this specifically to ease installation onto systems where a decent grid stack is already deployed. That goes a long way to solve the problem that motivated Mikes initial mail (which was someone trying to install swift onto a system where a grid stack is already deployed). At the moment, that needs a source build, in order to get the build option. Release-mechanics-wise, it would be possible to put up a version with, and a version without, the supporting material. I'm a little wary of making more release combinations (I like the single one that we have), but perhaps this is the correct thing to do. It also fits in nicely with the pacman packaging that I experimented with in an earlier release and have had no feedback for. -- From benc at hawaga.org.uk Thu Feb 12 03:58:05 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Thu, 12 Feb 2009 09:58:05 +0000 (GMT) Subject: [Swift-devel] Rename versions of Globus commands in swift/bin? In-Reply-To: References: <4988A6A9.6030302@renci.org> <4993561E.1060105@mcs.anl.gov> <49936388.80207@mcs.anl.gov> <49939761.9020002@renci.org> <4993A311.3060103@mcs.anl.gov> <4993A59D.4010900@mcs.anl.gov> Message-ID: On Thu, 12 Feb 2009, Ben Clifford wrote: > Release-mechanics-wise, it would be possible to put up a version with, and > a version without, the supporting material. I'm a little wary of making > more release combinations (I like the single one that we have), but > perhaps this is the correct thing to do. It also fits in nicely with the > pacman packaging that I experimented with in an earlier release and have > had no feedback for. I just made: http://www.ci.uchicago.edu/swift/packages/swift-0.8-stripped.tar.gz so built, and will link to it from the release page alongside the existing full 0.8 release. -- From benc at hawaga.org.uk Thu Feb 12 06:54:31 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Thu, 12 Feb 2009 12:54:31 +0000 (GMT) Subject: [Swift-devel] walltime compulsion Message-ID: Recent walltime violation commits seem to change the way in which applications with unspecified walltimes behave. They now get this treatment: > Warning: missing walltime specification for "echo". Assuming 10 minutes. I don't think it is appropriate to assume that any arbitrary program in Swift has a 10 minute maxwalltime, if none is specified. None should be assumed, and if that means some functionality based on having a walltime doesn't do anything for those tasks, then so be it. -- From wilde at mcs.anl.gov Thu Feb 12 08:25:05 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Thu, 12 Feb 2009 08:25:05 -0600 Subject: [Swift-devel] coaster one-liner bootstrap script In-Reply-To: <4993CA4F.3030609@mcs.anl.gov> References: <12554808.2179801234404154196.JavaMail.root@zimbra> <4993CA4F.3030609@mcs.anl.gov> Message-ID: <49943141.9020701@mcs.anl.gov> I forgot to add: there was no gram or coaster log on the target site, teraport, under ~osg, which is what my cert is mapped to. As far as I could tell, the job never made it to the site, or even to gram. Is this message the result of a client-side check, before the botstrap job is launched? - Mike On 2/12/09 1:05 AM, Michael Wilde wrote: > I updated cog and swift to the latest, and tried on teraport. > > sites.xml was: > > > > fast > key="coasterWorkerMaxwalltime">00:05:00 > > jobmanager="gt2:gt2:pbs" /> > /gpfs1/osg/data/oops/swiftwork > > > > > > I got: coaster-bootstrap.list not found in classpath > > Output is below. I lost the log, but it didnt say much more than whats > below. > > /home/wilde/swift/tools/swiftrun: Swift script oops7.swift starting at > Wed Feb 11 22:50:33 CST 2009 > running on sites: teraport.coaster.gt2.osg > > Swift svn swift-r2527 cog-r2297 > > RunID: 20090211-2250-gkzsaa90 > Progress: > Progress: Stage in:1 Initializing site shared directory:1 > Progress: Stage in:1 Submitting:1 > Warning: missing walltime specification for "runoops". Assuming 10 minutes. > Failed to transfer wrapper log from oops7-20090211-2250-gkzsaa90/info/7 > on teraport > Execution failed: > Failed to transfer wrapper log from > oops7-20090211-2250-gkzsaa90/info/8 on teraport > Exception in runoops: > Arguments: [input/fasta/T1af7.fasta, input/secseq/T1af7.secseq, > input/native/T1af7.pdb, output/T1af7.0.pdt, output/T1af7.0.rmsd, 0, TEMP > UPDATE INTERVAL = 10, SMOOTH DEVIATION COEFFICIENT = 0.80001] > Host: teraport > Directory: oops7-20090211-2250-gkzsaa90/jobs/7/runoops-7aj2uh6j > stderr.txt: > > stdout.txt: > > ---- > > Caused by: > Could not submit job > Caused by: > Could not start coaster service > Caused by: > java.lang.RuntimeException: coaster-bootstrap.list not found in > classpath > Cleaning up... > Done > > /home/wilde/swift/tools/swiftrun: Swift Script oops7.swift ended at Wed > Feb 11 22:50:40 CST 2009 with exit code 0 > com$ > > > On 2/11/09 8:02 PM, Mihael Hategan wrote: >> cog r2297 contains a patch to transform the bootstrap script to a >> one-liner (thanks to Mike for the tips). >> >> I did a sanity test on localhost. >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From hategan at mcs.anl.gov Thu Feb 12 09:33:26 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 12 Feb 2009 09:33:26 -0600 Subject: [Swift-devel] walltime compulsion In-Reply-To: References: Message-ID: <1234452806.3694.1.camel@localhost> I have yet to see a queuing system that works that same way (not that I've seen many). On Thu, 2009-02-12 at 12:54 +0000, Ben Clifford wrote: > Recent walltime violation commits seem to change the way in which > applications with unspecified walltimes behave. > > They now get this treatment: > > > Warning: missing walltime specification for "echo". Assuming 10 minutes. > > I don't think it is appropriate to assume that any arbitrary program in > Swift has a 10 minute maxwalltime, if none is specified. > > None should be assumed, and if that means some functionality based on > having a walltime doesn't do anything for those tasks, then so be it. > From benc at hawaga.org.uk Thu Feb 12 09:36:43 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Thu, 12 Feb 2009 15:36:43 +0000 (GMT) Subject: [Swift-devel] walltime compulsion In-Reply-To: <1234452806.3694.1.camel@localhost> References: <1234452806.3694.1.camel@localhost> Message-ID: On Thu, 12 Feb 2009, Mihael Hategan wrote: > I have yet to see a queuing system that works that same way (not that > I've seen many). Plenty of queueing systems give you a default walltime on jobs that you submit. I don't see that its Swift's business to be interfering with that default. -- From hategan at mcs.anl.gov Thu Feb 12 09:44:18 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 12 Feb 2009 09:44:18 -0600 Subject: [Swift-devel] walltime compulsion In-Reply-To: References: <1234452806.3694.1.camel@localhost> Message-ID: <1234453458.4032.0.camel@localhost> On Thu, 2009-02-12 at 15:36 +0000, Ben Clifford wrote: > On Thu, 12 Feb 2009, Mihael Hategan wrote: > > > I have yet to see a queuing system that works that same way (not that > > I've seen many). > > Plenty of queueing systems give you a default walltime on jobs that you > submit. I don't see that its Swift's business to be interfering with that > default. > I suppose there's no clear thing here. Anybody else? From rynge at renci.org Thu Feb 12 09:55:42 2009 From: rynge at renci.org (Mats Rynge) Date: Thu, 12 Feb 2009 10:55:42 -0500 Subject: [Swift-devel] walltime compulsion In-Reply-To: <1234453458.4032.0.camel@localhost> References: <1234452806.3694.1.camel@localhost> <1234453458.4032.0.camel@localhost> Message-ID: <4994467E.1020209@renci.org> Mihael Hategan wrote: > On Thu, 2009-02-12 at 15:36 +0000, Ben Clifford wrote: >> On Thu, 12 Feb 2009, Mihael Hategan wrote: >> >>> I have yet to see a queuing system that works that same way (not that >>> I've seen many). >> Plenty of queueing systems give you a default walltime on jobs that you >> submit. I don't see that its Swift's business to be interfering with that >> default. >> > > I suppose there's no clear thing here. Anybody else? Ignoring the queuing system for a moment, it is still a good idea to know what the expected runtime is. Ben and I had some of this conversation when we tried Swift on OSG, and we had a couple of instances where job and/or file transfer status changes where "lost", and Swift got stuck. I strongly believe that you need to have internal timeouts for pretty much all your states in your state machine, and that the timeouts for the job states should be based on the "walltime". We are using state timeouts for a lot of our OSG jobs based on Condor and OSGMM. This ensures that it is the workflow engine, and not the user, that picks up weird states and handles them accordingly (resubmit to another site for example). -- Mats Rynge Renaissance Computing Institute From benc at hawaga.org.uk Thu Feb 12 10:09:06 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Thu, 12 Feb 2009 16:09:06 +0000 (GMT) Subject: [Swift-devel] walltime compulsion In-Reply-To: <4994467E.1020209@renci.org> References: <1234452806.3694.1.camel@localhost> <1234453458.4032.0.camel@localhost> <4994467E.1020209@renci.org> Message-ID: On Thu, 12 Feb 2009, Mats Rynge wrote: > Ignoring the queuing system for a moment, it is still a good idea to > know what the expected runtime is. Right. And better handling basically as you describe when the expected maximum runtime (expressed through maxwalltime) is known is what was implemented. The issue I brought up is in cases where the expected maximum runtime is not specified by a user (in tc.data). Previously (for example, in Swift 0.8), tc.data entries that had no maxwalltime specification carried on having no maxwalltime all the way through the job submission process. Some functionality (in 0.8, I think only clustering) cannot be used with tc.data entries that have no maxwalltime: entries that have no maxwalltime will never be considered for being clustered. Instead, they will be submitted as if clustering was not enabled. What exists in SVN HEAD now is that if you do not specify a tc.data maxwalltime, then that tc.data entry is given a 10 minute maxwalltime entry by default (for everywhere - clustering, submit-side walltime violation, remote queue entry) This is a change in maxwalltime handling that I dislike. -- From wilde at mcs.anl.gov Thu Feb 12 10:21:26 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Thu, 12 Feb 2009 10:21:26 -0600 Subject: [Swift-devel] data staging process/ documents? In-Reply-To: <50b07b4b0902111707q7aad742fh89c0dad88f744ecf@mail.gmail.com> References: <50b07b4b0902111707q7aad742fh89c0dad88f744ecf@mail.gmail.com> Message-ID: <49944C86.9060206@mcs.anl.gov> On 2/11/09 7:07 PM, Allan Espinosa wrote: > Hi, > > I am attempting to actualize how collective operations on workflows > (loosely-coupled) work in general. My initial idea is that this goes > in the staging of data before executing a task in a workflow. > > Do we have documents describing these? I think the email below from Ben is relevant, and referes to a prios post on swift-devel. Im not sure if that text has made it to the userguide yet. - Mike I have a small idea on how it > works by monitoring my swift job's as a workflow executes. > > My initial ideas are posted in > http://www.ci.uchicago.edu/wiki/bin/view/VDS/DslCS/CollectiveIO > > -Allan > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel -------- Original Message -------- Subject: [Swift-devel] notes on how swift implements file input and output Date: Mon, 1 Dec 2008 22:00:16 +0000 (GMT) From: Ben Clifford To: swift-devel at ci.uchicago.edu References: read this in conjunction with previous note, "Subject: User perspective on how an app procedure call maps into an application executable call" This note details the implementation of Swift file input and output in application blocks; it is intended to be read in conjunction with a previous note 'How an app procedure call maps into an application call, from a Swift user perspective, attempting to avoid the mechanics inside Swift.' Swift executes application procedures on one or more //sites//. Each site consists of: * worker nodes. There is some //execution mechanism// through which the Swift client side executable can execute its //wrapper script// on those worker nodes. This is commonly GRAM or Falkon or coasters. * a site-shared file system. This site shared filesystem is accessible through some //file transfer mechanism// from the Swift client side executable. This is commonly GridFTP or coasters. This site shared filesystem is also accessible through the posix file system on all worker nodes, mounted at the same location as seen through the file transfer mechanism. Swift is configured with the location of some //site working directory// on that site-shared file system. There is no assumption that the site shared file system for one site is accessible from another site. For each workflow run, on each site that is used by that run, a //run directory// is created in the site working directory, by the Swift client side. In that run directory are placed several subdirectories: * shared/ - site shared files cache * kickstart/ - when kickstart is used, kickstart record files for each job that has generated a kickstart * info/ - wrapper script log files * status/ - job status files * jobs/ //application workspace directories// (optionally placed here - see below) Application execution looks like this: For each application procedure call: The Swift client side selects a site; copies the input files for that procedure call to the site shared file cache if they are not already in the cache, using the file transfer mechanism; and then invokes the wrapper script on that site using the execution mechanism. The wrapper script creates the application workspace directory; places the input files for that job into the application workspace directory using either cp or ln -s (depending on a configuration option); executes the application unix executable; copies output files from the application workspace directory to the site shared directory using cp; creates a status file under the status/ directory; and exits, returning control to the Swift client side. Logs created during the execution of the wrapper script are stored under the info/ directory. The Swift client side then checks for the presence of and deletes a status file indicating success; copies files from the site shared directory to the appropriate client side location. The job directory is created (in the default mode) under the jobs/ directory. However, it can be created under an arbitrary other path, which allows it to be created on a different file system (such as a worker node local file system in the case that the worker node has a local file system). -- _______________________________________________ Swift-devel mailing list Swift-devel at ci.uchicago.edu http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From rynge at renci.org Thu Feb 12 10:24:50 2009 From: rynge at renci.org (Mats Rynge) Date: Thu, 12 Feb 2009 11:24:50 -0500 Subject: [Swift-devel] walltime compulsion In-Reply-To: References: <1234452806.3694.1.camel@localhost> <1234453458.4032.0.camel@localhost> <4994467E.1020209@renci.org> Message-ID: <49944D52.6060804@renci.org> Ben Clifford wrote: > What exists in SVN HEAD now is that if you do not specify a tc.data > maxwalltime, then that tc.data entry is given a 10 minute maxwalltime > entry by default (for everywhere - clustering, submit-side walltime > violation, remote queue entry) > > This is a change in maxwalltime handling that I dislike. So, why not just make the maxwalltime in tc.data a required field? -- Mats Rynge Renaissance Computing Institute From wilde at mcs.anl.gov Thu Feb 12 10:25:15 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Thu, 12 Feb 2009 10:25:15 -0600 Subject: [Swift-devel] data staging process/ documents? In-Reply-To: <49944C86.9060206@mcs.anl.gov> References: <50b07b4b0902111707q7aad742fh89c0dad88f744ecf@mail.gmail.com> <49944C86.9060206@mcs.anl.gov> Message-ID: <49944D6B.9090200@mcs.anl.gov> On 2/12/09 10:21 AM, Michael Wilde wrote: > > > On 2/11/09 7:07 PM, Allan Espinosa wrote: >> Hi, >> >> I am attempting to actualize how collective operations on workflows >> (loosely-coupled) work in general. My initial idea is that this goes >> in the staging of data before executing a task in a workflow. >> >> Do we have documents describing these? > > I think the email below from Ben is relevant, and referes to a prios > post on swift-devel. Im not sure if that text has made it to the > userguide yet. Prior post was: http://mail.ci.uchicago.edu/pipermail/swift-devel/2008-December/004070.html I think that info has been added to the userguide. - Mike > > - Mike > > > I have a small idea on how it >> works by monitoring my swift job's as a workflow executes. >> >> My initial ideas are posted in >> http://www.ci.uchicago.edu/wiki/bin/view/VDS/DslCS/CollectiveIO >> >> -Allan >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > > -------- Original Message -------- > Subject: [Swift-devel] notes on how swift implements file input and output > Date: Mon, 1 Dec 2008 22:00:16 +0000 (GMT) > From: Ben Clifford > To: swift-devel at ci.uchicago.edu > References: > > > read this in conjunction with previous note, "Subject: User perspective on > how an app procedure call maps into an application executable call" > > > This note details the implementation of Swift file input and output in > application blocks; it is intended to be read in conjunction with a > previous note 'How an app procedure call maps into an application call, > from a Swift user perspective, attempting to avoid the mechanics inside > Swift.' > > > Swift executes application procedures on one or more //sites//. > > Each site consists of: > > * worker nodes. There is some //execution mechanism// through which the > Swift client side executable can execute its //wrapper script// on those > worker nodes. This is commonly GRAM or Falkon or coasters. > > * a site-shared file system. This site shared filesystem is accessible > through some //file transfer mechanism// from the Swift client side > executable. This is commonly GridFTP or coasters. This site shared > filesystem is also accessible through the posix file system on all worker > nodes, mounted at the same location as seen through the file transfer > mechanism. Swift is configured with the location of some //site working > directory// on that site-shared file system. > > There is no assumption that the site shared file system for one site is > accessible from another site. > > For each workflow run, on each site that is used by that run, a //run > directory// is created in the site working directory, by the Swift client > side. > > In that run directory are placed several subdirectories: > > * shared/ - site shared files cache > > * kickstart/ - when kickstart is used, kickstart record files > for each job that has generated a kickstart > > * info/ - wrapper script log files > > * status/ - job status files > > * jobs/ //application workspace directories// (optionally placed here - > see below) > > Application execution looks like this: > > For each application procedure call: > > The Swift client side selects a site; copies the input files for that > procedure call to the site shared file cache if they are not already in > the cache, using the file transfer mechanism; and then invokes the wrapper > script on that site using the execution mechanism. > > The wrapper script creates the application workspace directory; places the > input files for that job into the application workspace directory using > either cp or ln -s (depending on a configuration option); executes the > application unix executable; copies output files from the application > workspace directory to the site shared directory using cp; creates a > status file under the status/ directory; and exits, returning control to > the Swift client side. Logs created during the execution of the wrapper > script are stored under the info/ directory. > > The Swift client side then checks for the presence of and deletes a status > file indicating success; copies files from the site shared directory to > the appropriate client side location. > > The job directory is created (in the default mode) under the jobs/ > directory. However, it can be created under an arbitrary other path, which > allows it to be created on a different file system (such as a worker node > local file system in the case that the worker node has a local file > system). > From benc at hawaga.org.uk Thu Feb 12 10:28:16 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Thu, 12 Feb 2009 16:28:16 +0000 (GMT) Subject: [Swift-devel] data staging process/ documents? In-Reply-To: <49944D6B.9090200@mcs.anl.gov> References: <50b07b4b0902111707q7aad742fh89c0dad88f744ecf@mail.gmail.com> <49944C86.9060206@mcs.anl.gov> <49944D6B.9090200@mcs.anl.gov> Message-ID: On Thu, 12 Feb 2009, Michael Wilde wrote: > I think that info has been added to the userguide. It has. Those two emails ended up being this: http://www.ci.uchicago.edu/swift/guides/userguide/appmodel.php -- From hategan at mcs.anl.gov Thu Feb 12 10:31:46 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 12 Feb 2009 10:31:46 -0600 Subject: [Swift-devel] walltime compulsion In-Reply-To: <49944D52.6060804@renci.org> References: <1234452806.3694.1.camel@localhost> <1234453458.4032.0.camel@localhost> <4994467E.1020209@renci.org> <49944D52.6060804@renci.org> Message-ID: <1234456306.4850.3.camel@localhost> On Thu, 2009-02-12 at 11:24 -0500, Mats Rynge wrote: > Ben Clifford wrote: > > What exists in SVN HEAD now is that if you do not specify a tc.data > > maxwalltime, then that tc.data entry is given a 10 minute maxwalltime > > entry by default (for everywhere - clustering, submit-side walltime > > violation, remote queue entry) > > > > This is a change in maxwalltime handling that I dislike. > > So, why not just make the maxwalltime in tc.data a required field? > What I put in there is the middle ground between no walltime and mandatory walltime. For many small things 10 minutes is fine. But I also wouldn't want the user to be surprised that their 30 minute job for which they didn't put a walltime in never completes. So there's a once-per-job warning which is supposed to persuade the user to specify a walltime. From benc at hawaga.org.uk Thu Feb 12 10:34:08 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Thu, 12 Feb 2009 16:34:08 +0000 (GMT) Subject: [Swift-devel] walltime compulsion In-Reply-To: <49944D52.6060804@renci.org> References: <1234452806.3694.1.camel@localhost> <1234453458.4032.0.camel@localhost> <4994467E.1020209@renci.org> <49944D52.6060804@renci.org> Message-ID: On Thu, 12 Feb 2009, Mats Rynge wrote: > So, why not just make the maxwalltime in tc.data a required field? I'd rather have it compulsory than an arbitrary default value; but I'd rather have neither. I see no reason to compel a walltime if you don't want walltime-based handling. In the situations that you are running in, it seems fairly vital, because you want to use walltime based features. However, Swift also gets used in situations where it isn't necessary - for example, when running on single-site clusters where stuff tends to either work or not work (rather than having distributed-system style partial failures), and hung jobs don't cause a problem (which is why its taken this many years for the submit side walltime stuff to get implemented). Nothing has changed to suddenly make it necessary to compel that user community to think about walltimes, and in those cases, it adds unnecessary configuration load onto users; and that is something that I feel fairly strongly about in Swift. -- From wilde at mcs.anl.gov Thu Feb 12 10:35:57 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Thu, 12 Feb 2009 10:35:57 -0600 Subject: [Swift-devel] walltime compulsion In-Reply-To: <49944D52.6060804@renci.org> References: <1234452806.3694.1.camel@localhost> <1234453458.4032.0.camel@localhost> <4994467E.1020209@renci.org> <49944D52.6060804@renci.org> Message-ID: <49944FED.7080206@mcs.anl.gov> Would that address Ben's concern? Is the dislike the 10 minute default, or the fact that the same value is used for all 3 calculations mentioned? My slight preference is to keep required fields to a minimum, to make the time somewhat higher (to to reduce surprise job terminations at the expense of surprise at winding up in slow queue). If its easy, is a global property for the default time reasonable? So a user could tweak one value for all their wall times? Also: I didnt raise this because there was so much churn last week on the coaster code, but when testing Ben's coaster-service-on-headnode patch, I was unable to find walltime settings that would get me into the fast queue on teraport. I did not have time to track down what was happening, in terms of what I requested where vs what was sent to gram. Just a heads-up that the time calculation code could use some testing. - Mike On 2/12/09 10:24 AM, Mats Rynge wrote: > Ben Clifford wrote: >> What exists in SVN HEAD now is that if you do not specify a tc.data >> maxwalltime, then that tc.data entry is given a 10 minute maxwalltime >> entry by default (for everywhere - clustering, submit-side walltime >> violation, remote queue entry) >> >> This is a change in maxwalltime handling that I dislike. > > So, why not just make the maxwalltime in tc.data a required field? > From benc at hawaga.org.uk Thu Feb 12 10:36:50 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Thu, 12 Feb 2009 16:36:50 +0000 (GMT) Subject: [Swift-devel] walltime compulsion In-Reply-To: <1234456306.4850.3.camel@localhost> References: <1234452806.3694.1.camel@localhost> <1234453458.4032.0.camel@localhost> <4994467E.1020209@renci.org> <49944D52.6060804@renci.org> <1234456306.4850.3.camel@localhost> Message-ID: On Thu, 12 Feb 2009, Mihael Hategan wrote: > What I put in there is the middle ground between no walltime and > mandatory walltime. I think the middle ground is less desirable than either extreme. > For many small things 10 minutes is fine. But I also wouldn't want the > user to be surprised that their 30 minute job for which they didn't put > a walltime in never completes. So there's a once-per-job warning which > is supposed to persuade the user to specify a walltime. Having a warning/recommendation is fine. -- From wilde at mcs.anl.gov Thu Feb 12 10:38:50 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Thu, 12 Feb 2009 10:38:50 -0600 Subject: [Swift-devel] walltime compulsion In-Reply-To: References: <1234452806.3694.1.camel@localhost> <1234453458.4032.0.camel@localhost> <4994467E.1020209@renci.org> <49944D52.6060804@renci.org> Message-ID: <4994509A.8080401@mcs.anl.gov> This argument is more clear and makes sense to me. I agree with it, if it causes no complications. On 2/12/09 10:34 AM, Ben Clifford wrote: > On Thu, 12 Feb 2009, Mats Rynge wrote: > >> So, why not just make the maxwalltime in tc.data a required field? > > I'd rather have it compulsory than an arbitrary default value; but I'd > rather have neither. > > I see no reason to compel a walltime if you don't want walltime-based > handling. > > In the situations that you are running in, it seems fairly vital, because > you want to use walltime based features. > > However, Swift also gets used in situations where it isn't necessary - for > example, when running on single-site clusters where stuff tends to either > work or not work (rather than having distributed-system style partial > failures), and hung jobs don't cause a problem (which is why its taken > this many years for the submit side walltime stuff to get implemented). > > Nothing has changed to suddenly make it necessary to compel that user > community to think about walltimes, and in those cases, it adds > unnecessary configuration load onto users; and that is something that I > feel fairly strongly about in Swift. > From benc at hawaga.org.uk Thu Feb 12 10:41:35 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Thu, 12 Feb 2009 16:41:35 +0000 (GMT) Subject: [Swift-devel] walltime compulsion In-Reply-To: <49944FED.7080206@mcs.anl.gov> References: <1234452806.3694.1.camel@localhost> <1234453458.4032.0.camel@localhost> <4994467E.1020209@renci.org> <49944D52.6060804@renci.org> <49944FED.7080206@mcs.anl.gov> Message-ID: On Thu, 12 Feb 2009, Michael Wilde wrote: > My slight preference is to keep required fields to a minimum, to make the time > somewhat higher (to to reduce surprise job terminations at the expense of > surprise at winding up in slow queue). In (most?) PBS deployments, you don't end up in a slow queue - you get your job rejected. Queues are selected by specifying a queue name, separately from the walltime. This requires low-end users to have a better understanding of the submit stack than even someone like you who has worked on grid stuff for many years has; I think thats a nice argument for not compelling this. > If its easy, is a global property for the default time reasonable? So a > user could tweak one value for all their wall times? That could be implemented but feels a bit messy to me - if you know what the time to specify is for your apps, and can take a max() of it to set the default, then you also have enough understanding and information to specify it in tc.data. -- From hategan at mcs.anl.gov Thu Feb 12 12:48:03 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 12 Feb 2009 12:48:03 -0600 (CST) Subject: [Swift-devel] walltime compulsion Message-ID: <12353309.2198941234464483254.JavaMail.root@zimbra> r2531 fixes the issue to only use the walltime if specified. We will encourage our users to specify a walltime through other means. From zhaozhang at uchicago.edu Thu Feb 12 13:23:31 2009 From: zhaozhang at uchicago.edu (Zhao Zhang) Date: Thu, 12 Feb 2009 13:23:31 -0600 Subject: [Swift-devel] Swift data distribution strategy Message-ID: <49947733.7000300@uchicago.edu> Hi, I got a problem when I do a 512-compute node scale test. There are 15351 input files for this computation, and there are 8 sites. The question is: when I start swift, are all 15351 input files be copied to each of the 8 sites? By by test, it is yes. Does swift has an option so that only on demand input files are copied? best wishes zhangzhao From aespinosa at cs.uchicago.edu Thu Feb 12 13:29:37 2009 From: aespinosa at cs.uchicago.edu (Allan Espinosa) Date: Thu, 12 Feb 2009 13:29:37 -0600 Subject: [Swift-devel] Swift data distribution strategy In-Reply-To: <49947733.7000300@uchicago.edu> References: <49947733.7000300@uchicago.edu> Message-ID: <50b07b4b0902121129m4c25484fm5dcb88f47d1f9558@mail.gmail.com> Hi Zhao, the input files are copies on demand as jobs get dispatched. (i think) -Allan On Thu, Feb 12, 2009 at 1:23 PM, Zhao Zhang wrote: > Hi, > > I got a problem when I do a 512-compute node scale test. > There are 15351 input files for this computation, and there are 8 sites. > The question is: when I start swift, are all 15351 input files be copied to > each of the 8 sites? > > By by test, it is yes. Does swift has an option so that only on demand input > files are copied? > > best wishes > zhangzhao From hategan at mcs.anl.gov Thu Feb 12 13:43:48 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 12 Feb 2009 13:43:48 -0600 (CST) Subject: [Swift-devel] Swift data distribution strategy In-Reply-To: <49947733.7000300@uchicago.edu> Message-ID: <24164550.2203051234467828083.JavaMail.root@zimbra> ----- Zhao Zhang wrote: > Hi, > > I got a problem when I do a 512-compute node scale test. > There are 15351 input files for this computation, and there are 8 sites. > The question is: when I start swift, are all 15351 input files be copied > to each of the 8 sites? > > By by test, it is yes. It's more like "no" actually. Swift first selects a site for each job that can be run. After that, it stages each job's files to the site that was selected for that job. > Does swift has an option so that only on demand > input files are copied? > > best wishes > zhangzhao > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From zhaozhang at uchicago.edu Thu Feb 12 13:46:15 2009 From: zhaozhang at uchicago.edu (Zhao Zhang) Date: Thu, 12 Feb 2009 13:46:15 -0600 Subject: [Swift-devel] Swift data distribution strategy In-Reply-To: <24164550.2203051234467828083.JavaMail.root@zimbra> References: <24164550.2203051234467828083.JavaMail.root@zimbra> Message-ID: <49947C87.1050107@uchicago.edu> ok, got it. zhao Mihael Hategan wrote: > ----- Zhao Zhang wrote: > >> Hi, >> >> I got a problem when I do a 512-compute node scale test. >> There are 15351 input files for this computation, and there are 8 sites. >> The question is: when I start swift, are all 15351 input files be copied >> to each of the 8 sites? >> >> By by test, it is yes. >> > > It's more like "no" actually. Swift first selects a site for each job > that can be run. After that, it stages each job's files to the site > that was selected for that job. > > >> Does swift has an option so that only on demand >> input files are copied? >> >> best wishes >> zhangzhao >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >> > > > From hategan at mcs.anl.gov Thu Feb 12 14:06:32 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 12 Feb 2009 14:06:32 -0600 (CST) Subject: [Swift-devel] coaster one-liner bootstrap script Message-ID: <20791025.2204621234469192280.JavaMail.root@zimbra> On Thu, 2009-02-12 at 01:05 -0600, Michael Wilde wrote: > I got: coaster-bootstrap.list not found in classpath Should be fixed in swift r2300. From zhangzhao0718 at gmail.com Thu Feb 12 13:14:55 2009 From: zhangzhao0718 at gmail.com (Zhao Zhang) Date: Thu, 12 Feb 2009 13:14:55 -0600 Subject: [Swift-devel] Swift data distribution strategy Message-ID: <4994752F.5010705@gmail.com> Hi, I got a problem when I do a 512-compute node scale test. There are 15351 input files for this computation, and there are 8 sites. The question is: when I start swift, are all 15351 input files be copied to each of the 8 sites? By by test, it is yes. Does swift has an option so that only on demand input files are copied? best wishes zhangzhao From wilde at mcs.anl.gov Thu Feb 12 18:36:13 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Thu, 12 Feb 2009 18:36:13 -0600 Subject: [Swift-devel] coaster one-liner bootstrap script In-Reply-To: <20791025.2204621234469192280.JavaMail.root@zimbra> References: <20791025.2204621234469192280.JavaMail.root@zimbra> Message-ID: <4994C07D.8070809@mcs.anl.gov> I updated to 2300. Now I get the error below (java.lang.RuntimeException: Failed to register service) Im also a bit confused why I see "which: no gmd5sum in (/soft/java-1.5.0_06-sun-r1/bin: etc etc" on stdout - that should be going to /dev/null, but its reproducible in a normal interactive shell. Something subtle in eval? gram log is in ~osg/gram_job_mgr_17585.log swift log is in ~wilde/oops7-20090212-1510-kk6i43og.log (on ci network) - Mike On 2/12/09 2:06 PM, Mihael Hategan wrote: > On Thu, 2009-02-12 at 01:05 -0600, Michael Wilde wrote: >> I got: coaster-bootstrap.list not found in classpath > > Should be fixed in swift r2300. > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel com$ ls -l ~osg/coaster-bootstrap-167563000.log -rw-r--r-- 1 osg osgvo 2744 Feb 12 15:11 /home/osgvo/osg/coaster-bootstrap-167563000.log com$ cat ~osg/coaster-bootstrap-167563000.log BS: http://communicado.ci.uchicago.edu:50001 Expected checksum: c6dbde30e69462446c06e15a46fba6eb Computed checksum: c6dbde30e69462446c06e15a46fba6eb JAVA=/soft/java-1.5.0_06-sun-r1/bin/java /soft/java-1.5.0_06-sun-r1/bin/java -Djava=/soft/java-1.5.0_06-sun-r1/bin/java -DGLOBUS_TCP_PORT_RANGE= -DX509_USER_PROXY=/home/osgvo/osg/.globus/job/tp-grid1.ci.uchicago.edu/16700.1234473069/x509_up -DX509_CERT_DIR= -DGLOBUS_HOSTNAME=tp-grid1.ci.uchicago.edu -jar /tmp/bootstrap.N16834 http://communicado.ci.uchicago.edu:50001 b3d581fddd49e3d1166f52f6077ddcc5 https://128.135.125.17:50000 167563000 java.lang.RuntimeException: Failed to register service at org.globus.cog.abstraction.coaster.service.CoasterService.start(CoasterService.java:111) at org.globus.cog.abstraction.coaster.service.CoasterService.main(CoasterService.java:226) Caused by: org.globus.cog.karajan.workflow.service.channels.ChannelException: Failed to start channel GSSCChannel-https://b3d581fddd49e3d1166f52f6077ddcc5:1984(1) at org.globus.cog.karajan.workflow.service.channels.GSSChannel.reconnect(GSSChannel.java:104) at org.globus.cog.karajan.workflow.service.channels.GSSChannel.start(GSSChannel.java:63) at org.globus.cog.karajan.workflow.service.ChannelFactory.newChannel(ChannelFactory.java:43) at org.globus.cog.karajan.workflow.service.Client.connect(Client.java:115) at org.globus.cog.karajan.workflow.service.Client.newClient(Client.java:72) at org.globus.cog.karajan.workflow.service.channels.ChannelManager.connect(ChannelManager.java:211) at org.globus.cog.karajan.workflow.service.channels.ChannelManager.reserveChannel(ChannelManager.java:230) at org.globus.cog.karajan.workflow.service.channels.ChannelManager.reserveChannel(ChannelManager.java:186) at org.globus.cog.abstraction.coaster.service.CoasterService.start(CoasterService.java:100) ... 1 more Caused by: java.net.UnknownHostException: b3d581fddd49e3d1166f52f6077ddcc5: b3d581fddd49e3d1166f52f6077ddcc5 at java.net.InetAddress.getAllByName0(InetAddress.java:1128) at java.net.InetAddress.getAllByName0(InetAddress.java:1098) at java.net.InetAddress.getAllByName(InetAddress.java:1061) at java.net.InetAddress.getByName(InetAddress.java:958) at org.globus.net.SocketFactory.createSocket(SocketFactory.java:53) at org.globus.gsi.gssapi.net.GssSocket.(GssSocket.java:56) at org.globus.gsi.gssapi.net.impl.GSIGssSocket.(GSIGssSocket.java:29) at org.globus.gsi.gssapi.net.impl.GSIGssSocketFactory.createSocket(GSIGssSocketFactory.java:38) at org.globus.cog.karajan.workflow.service.channels.GSSChannel.reconnect(GSSChannel.java:90) ... 9 more EC: 1 BS: http://communicado.ci.uchicago.edu:50001 Failed to download bootstrap jar from http://communicado.ci.uchicago.edu:50001 com$ ---- and on stdout/stderr: com$ cat swift.out /home/wilde/swift/tools/swiftrun: Swift script oops7.swift starting at Thu Feb 12 15:10:57 CST 2009 running on sites: teraport.coaster.gt2.osg Swift svn swift-r2532 cog-r2300 RunID: 20090212-1510-kk6i43og Progress: Progress: Stage in:1 Initializing site shared directory:1 Progress: Stage in:1 Submitting:1 Progress: Submitting:1 Submitted:1 Failed to transfer wrapper log from oops7-20090212-1510-kk6i43og/info/j on teraport Execution failed: Exception in runoops: Arguments: [input/fasta/T1af7.fasta, input/secseq/T1af7.secseq, input/native/T1af7.pdb, output/T1af7.0.pdt, output/T1af7.0.rmsd, 0, TEMP UPDATE INTERVAL = 10, SMOOTH DEVIATION COEFFICIENT = 0.80001] Host: teraport Directory: oops7-20090212-1510-kk6i43og/jobs/j/runoops-j1j9zi6j stderr.txt: stdout.txt: ---- Caused by: Could not submit job Caused by: Could not start coaster service Caused by: Task ended before registration was received. STDOUT: which: no gmd5sum in (/soft/java-1.5.0_06-sun-r1/bin:/soft/java-1.5.0_06-sun-r1/jre/bin:/usr/kerberos/bin:/bin:/usr/bin:/usr/X11R6/bin:/usr/local/bin:/software/common/softenv-1.6.0-r1/bin:/home/osgvo/osg/bin/linux-rhel4-x86_64:/home/osgvo/osg/bin:/soft/xcat-1.2.0-r1/bin:/soft/xcat-1.2.0-r1/sbin:/soft/xcat-1.2.0-r1/x86_64/bin:/soft/xcat-1.2.0-r1/x86_64/sbin:/soft/xcat-1.2.0-r1/contrib/bin:/soft/xcat-1.2.0-r1/contrib/sbin:/soft/xcat-1.2.0-r1/contrib/x86_64/bin:/soft/xcat-1.2.0-r1/contrib/x86_64/sbin) STDERR: null Cleaning up... Done /home/wilde/swift/tools/swiftrun: Swift Script oops7.swift ended at Thu Feb 12 15:11:24 CST 2009 with exit code 0 com$ From hategan at mcs.anl.gov Thu Feb 12 19:28:23 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 12 Feb 2009 19:28:23 -0600 (CST) Subject: [Swift-devel] coaster one-liner bootstrap script In-Reply-To: <4994C07D.8070809@mcs.anl.gov> Message-ID: <11175850.2224431234488503374.JavaMail.root@zimbra> ----- Michael Wilde wrote: > I updated to 2300. Now I get the error below > (java.lang.RuntimeException: Failed to register service) Excellent! It works. The exception is due to me messing with the parameters, but Java starts properly. From wilde at mcs.anl.gov Fri Feb 13 08:59:55 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Fri, 13 Feb 2009 08:59:55 -0600 Subject: [Swift-devel] Status of coasters Message-ID: <49958AEB.3090002@mcs.anl.gov> Here's my understanding of status, issues and needs on coasters. Some side discussion with Mihael on various coaster issues is summarized here as well; clarifications welcome. Work in progress: - Mihael has a good handle on the bootstrap issues and is working on improvements. This is not working in trunk at the moment, will likely be fixed soon. We think this will fix known issues in: command line lenth for condor, spaces, quotes, newlines and other offending argument issues; location of Java and tools (wget/curl and mdsum). - still to do on above: sites.xml attribute to explicitly specify location of tools, or at least of Java. - Ben has a patch to integrate to run the coaster service on a worker node. Question: this is only usable when workers have sufficient IP access, correct? - The scalability problem submitting to GT2 GRAM sites still exists. Potential solutions are: -- Service submits workers via PBS (using jobmanger=gt2:pbs). Valid only on PBS sites. Not yet tested. -- Service submits workers via Condor-G (using jobmanager=gt2:condor). Mihael feels this requires a new Condor provider, the one in the current code base being insufficient and untested - really more of a prototype developed by a student). -- Service submits via WS-GRAM. This should be tested, on sites where WS-GRAM is working. This woild use jobmanager=gt2:gt4:{pbs/condor/sge}, and needs to be tested. For sites where WS-GRAM is not functional, I suggested we consider configuring our own non-root WS-GRAM, ideally using already-installed GT4 software, eg, from the OSG package on OSG and TG sites where its installed. Mihael thought this would be considerable work. I agree but it might be a stable solution with fewer unknowns and suppot from the GRAM group. We can bring in the latest GT4 as needed if that provides a better solution than some older installed GT4 which we have no control over and which wont change till upcoming releases of say OSG or TG packages. Doing the above should then enable large-scale testing of user workflows across many OSG and TG sites, without need to throttle back the *number* of jobs waiting or running. Lastly: it seems that a Condor-G provide might be a powerful capability (as one configuration option) to be able to submit all swift jobs via Condor-G (e.g, for non-coaster runs as well). Please comment on the value of such a capability. - Mike From hategan at mcs.anl.gov Fri Feb 13 09:17:39 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 13 Feb 2009 09:17:39 -0600 Subject: [Swift-devel] Status of coasters In-Reply-To: <49958AEB.3090002@mcs.anl.gov> References: <49958AEB.3090002@mcs.anl.gov> Message-ID: <1234538259.25737.1.camel@localhost> On Fri, 2009-02-13 at 08:59 -0600, Michael Wilde wrote: > -- Service submits via WS-GRAM. This should be tested, on sites where > WS-GRAM is working. > This woild use jobmanager=gt2:gt4:{pbs/condor/sge}, and needs to be tested. > For sites where WS-GRAM is not functional, I suggested we consider > configuring our own non-root WS-GRAM, ideally using already-installed > GT4 software, eg, from the OSG package on OSG and TG sites where its > installed. Mihael thought this would be considerable work. Not as much the amount of work, but: 1. getting root on sites (if installed for multiple users) OR 2. telling our users that they need to install a GT4 server in order to submit many jobs at once using swift. From foster at anl.gov Fri Feb 13 09:17:38 2009 From: foster at anl.gov (Ian Foster) Date: Fri, 13 Feb 2009 09:17:38 -0600 Subject: [Swift-devel] Status of coasters In-Reply-To: <49958AEB.3090002@mcs.anl.gov> References: <49958AEB.3090002@mcs.anl.gov> Message-ID: Mike: What is the scalability problem WRT GT2 GRAM sites? Ian. On Feb 13, 2009, at 8:59 AM, Michael Wilde wrote: > Here's my understanding of status, issues and needs on coasters. > > Some side discussion with Mihael on various coaster issues is > summarized here as well; clarifications welcome. > > Work in progress: > > - Mihael has a good handle on the bootstrap issues and is working on > improvements. This is not working in trunk at the moment, will > likely be fixed soon. We think this will fix known issues in: > command line lenth for condor, spaces, quotes, newlines and other > offending argument issues; location of Java and tools (wget/curl and > mdsum). > > - still to do on above: sites.xml attribute to explicitly specify > location of tools, or at least of Java. > > - Ben has a patch to integrate to run the coaster service on a > worker node. Question: this is only usable when workers have > sufficient IP access, correct? > > - The scalability problem submitting to GT2 GRAM sites still exists. > Potential solutions are: > > -- Service submits workers via PBS (using jobmanger=gt2:pbs). Valid > only on PBS sites. Not yet tested. > > -- Service submits workers via Condor-G (using > jobmanager=gt2:condor). Mihael feels this requires a new Condor > provider, the one in the current code base being insufficient and > untested - really more of a prototype developed by a student). > > -- Service submits via WS-GRAM. This should be tested, on sites > where WS-GRAM is working. > This woild use jobmanager=gt2:gt4:{pbs/condor/sge}, and needs to be > tested. > For sites where WS-GRAM is not functional, I suggested we consider > configuring our own non-root WS-GRAM, ideally using already- > installed GT4 software, eg, from the OSG package on OSG and TG sites > where its installed. Mihael thought this would be considerable work. > I agree but it might be a stable solution with fewer unknowns and > suppot from the GRAM group. We can bring in the latest GT4 as needed > if that provides a better solution than some older installed GT4 > which we have no control over and which wont change till upcoming > releases of say OSG or TG packages. > > Doing the above should then enable large-scale testing of user > workflows across many OSG and TG sites, without need to throttle > back the *number* of jobs waiting or running. > > Lastly: it seems that a Condor-G provide might be a powerful > capability (as one configuration option) to be able to submit all > swift jobs via Condor-G (e.g, for non-coaster runs as well). Please > comment on the value of such a capability. > > - Mike > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From wilde at mcs.anl.gov Fri Feb 13 09:20:45 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Fri, 13 Feb 2009 09:20:45 -0600 Subject: [Swift-devel] Status of coasters In-Reply-To: References: <49958AEB.3090002@mcs.anl.gov> Message-ID: <49958FCD.3060106@mcs.anl.gov> Its the problem of resource consumption by the jobmanager: the longstanding problem that the Condor-G GRID_MONITOR addresses; the problem that requires that we scale back to send fewer thn 20-40 jobs to any OSG site when we use pre-WS-GRAM. On 2/13/09 9:17 AM, Ian Foster wrote: > Mike: > > What is the scalability problem WRT GT2 GRAM sites? > > Ian. > > > On Feb 13, 2009, at 8:59 AM, Michael Wilde wrote: > >> Here's my understanding of status, issues and needs on coasters. >> >> Some side discussion with Mihael on various coaster issues is >> summarized here as well; clarifications welcome. >> >> Work in progress: >> >> - Mihael has a good handle on the bootstrap issues and is working on >> improvements. This is not working in trunk at the moment, will likely >> be fixed soon. We think this will fix known issues in: command line >> lenth for condor, spaces, quotes, newlines and other offending >> argument issues; location of Java and tools (wget/curl and mdsum). >> >> - still to do on above: sites.xml attribute to explicitly specify >> location of tools, or at least of Java. >> >> - Ben has a patch to integrate to run the coaster service on a worker >> node. Question: this is only usable when workers have sufficient IP >> access, correct? >> >> - The scalability problem submitting to GT2 GRAM sites still exists. >> Potential solutions are: >> >> -- Service submits workers via PBS (using jobmanger=gt2:pbs). Valid >> only on PBS sites. Not yet tested. >> >> -- Service submits workers via Condor-G (using jobmanager=gt2:condor). >> Mihael feels this requires a new Condor provider, the one in the >> current code base being insufficient and untested - really more of a >> prototype developed by a student). >> >> -- Service submits via WS-GRAM. This should be tested, on sites where >> WS-GRAM is working. >> This woild use jobmanager=gt2:gt4:{pbs/condor/sge}, and needs to be >> tested. >> For sites where WS-GRAM is not functional, I suggested we consider >> configuring our own non-root WS-GRAM, ideally using already-installed >> GT4 software, eg, from the OSG package on OSG and TG sites where its >> installed. Mihael thought this would be considerable work. I agree but >> it might be a stable solution with fewer unknowns and suppot from the >> GRAM group. We can bring in the latest GT4 as needed if that provides >> a better solution than some older installed GT4 which we have no >> control over and which wont change till upcoming releases of say OSG >> or TG packages. >> >> Doing the above should then enable large-scale testing of user >> workflows across many OSG and TG sites, without need to throttle back >> the *number* of jobs waiting or running. >> >> Lastly: it seems that a Condor-G provide might be a powerful >> capability (as one configuration option) to be able to submit all >> swift jobs via Condor-G (e.g, for non-coaster runs as well). Please >> comment on the value of such a capability. >> >> - Mike >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > From wilde at mcs.anl.gov Fri Feb 13 09:23:21 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Fri, 13 Feb 2009 09:23:21 -0600 Subject: [Swift-devel] Status of coasters In-Reply-To: <1234538259.25737.1.camel@localhost> References: <49958AEB.3090002@mcs.anl.gov> <1234538259.25737.1.camel@localhost> Message-ID: <49959069.7020602@mcs.anl.gov> On 2/13/09 9:17 AM, Mihael Hategan wrote: > On Fri, 2009-02-13 at 08:59 -0600, Michael Wilde wrote: >> -- Service submits via WS-GRAM. This should be tested, on sites where >> WS-GRAM is working. >> This woild use jobmanager=gt2:gt4:{pbs/condor/sge}, and needs to be tested. >> For sites where WS-GRAM is not functional, I suggested we consider >> configuring our own non-root WS-GRAM, ideally using already-installed >> GT4 software, eg, from the OSG package on OSG and TG sites where its >> installed. Mihael thought this would be considerable work. > > Not as much the amount of work, but: > 1. getting root on sites (if installed for multiple users) I was thinking/hoping we could have a single setup script, which we'd pre-install where required, that would configure and start a personal WSRF container for the user. Which would be work for us to create, install and maintain, but, if successful, would be transparent to the user. > OR > > 2. telling our users that they need to install a GT4 server in order to > submit many jobs at once using swift. No, that would not be a good route. If that was required, I'll call this alternative undesirable. From benc at hawaga.org.uk Fri Feb 13 09:27:37 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Fri, 13 Feb 2009 15:27:37 +0000 (GMT) Subject: [Swift-devel] Status of coasters In-Reply-To: <49958AEB.3090002@mcs.anl.gov> References: <49958AEB.3090002@mcs.anl.gov> Message-ID: On Fri, 13 Feb 2009, Michael Wilde wrote: > - Ben has a patch to integrate to run the coaster service on a worker node. > Question: this is only usable when workers have sufficient IP access, correct? Yes. I plan on making this presentable and then committing it. As part of that, probably I should document who connects where in coasters with a pretty diagram, to aid in understanding of what 'sufficient' is. > - The scalability problem submitting to GT2 GRAM sites still exists. Potential > solutions are: > > -- Service submits workers via PBS (using jobmanger=gt2:pbs). Valid only on > PBS sites. Not yet tested. > > -- Service submits workers via Condor-G (using jobmanager=gt2:condor). Mihael > feels this requires a new Condor provider, the one in the current code base > being insufficient and untested - really more of a prototype developed by a > student). That would be regular Condor, not Condor-G, I think. The two above could be summarised as "submit service workers through the local LRM using CoG specific providers for that LRM". The PBS provider seems to be getting a reasonable amount of use recently, and I think is also useful in the single-site case where it allows GRAM to be avoided entirely. A decent Condor provider would probably allow something similar for Condor based clusters. > -- Service submits via WS-GRAM. This should be tested, on sites where > WS-GRAM is working. This woild use jobmanager=gt2:gt4:{pbs/condor/sge}, > and needs to be tested. If gram4.0 is working on a site, is there any reason to use gt2 for the head job submission? It seems to add a dependency on one more service (depending on both gram2 and gram4.0) rather than substituting one dependency for another (gram2 for gram4.0) > For sites where WS-GRAM is not functional, I suggested we consider configuring > our own non-root WS-GRAM, ideally using already-installed GT4 software, eg, > from the OSG package on OSG and TG sites where its installed. Mihael thought > this would be considerable work. I agree but it might be a stable solution > with fewer unknowns and suppot from the GRAM group. We can bring in the latest > GT4 as needed if that provides a better solution than some older installed GT4 > which we have no control over and which wont change till upcoming releases of > say OSG or TG packages. I agree that this is considerable work. I think it is not something we should pursue. > Lastly: it seems that a Condor-G provide might be a powerful capability (as > one configuration option) to be able to submit all swift jobs via Condor-G > (e.g, for non-coaster runs as well). Please comment on the value of such a > capability. I've pondered that before. Using Condor-G appears to be the officially supported mechanism for submitting to OSG in some peoples minds; and similarly, using plain GRAM2 is Prohibited in those peoples minds. Using Condor-G would be more in line with some peoples views of how jobs should properly be submitted to OSG. Such functionality could fit in as a CoG execution provider (similar to, or part of a plain Condor execution provider), and would not peturb the architecture of Swift. Swift runs in such a situation would look a little like DAGman runs, with a management process handling some rate limiting and deciding which jobs to run and where, but then the mechanics of submission being handled by a local Condor. This approach would necessitate a local Condor installation, but only in situations where this approach was used; so this would not peturb usability too much, and many places where this would be used already have a Condor installation. So I'm cautiously supportive of this approach. Specifically given the two different uses for condor interfacing discussed above, I think that it would be useful to investigate making the Condor provider decent. -- From benc at hawaga.org.uk Fri Feb 13 09:29:53 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Fri, 13 Feb 2009 15:29:53 +0000 (GMT) Subject: [Swift-devel] Status of coasters In-Reply-To: References: <49958AEB.3090002@mcs.anl.gov> Message-ID: On Fri, 13 Feb 2009, Ian Foster wrote: > What is the scalability problem WRT GT2 GRAM sites? loadavg on the submit site = k * the number of GRAM2 jobs running on that site or queued on that site. where k is in the range 0.1 .. 1 in my informal testing. -- From hategan at mcs.anl.gov Fri Feb 13 09:32:43 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 13 Feb 2009 09:32:43 -0600 Subject: [Swift-devel] Status of coasters In-Reply-To: <49959069.7020602@mcs.anl.gov> References: <49958AEB.3090002@mcs.anl.gov> <1234538259.25737.1.camel@localhost> <49959069.7020602@mcs.anl.gov> Message-ID: <1234539163.26116.3.camel@localhost> On Fri, 2009-02-13 at 09:23 -0600, Michael Wilde wrote: > > On 2/13/09 9:17 AM, Mihael Hategan wrote: > > On Fri, 2009-02-13 at 08:59 -0600, Michael Wilde wrote: > >> -- Service submits via WS-GRAM. This should be tested, on sites where > >> WS-GRAM is working. > >> This woild use jobmanager=gt2:gt4:{pbs/condor/sge}, and needs to be tested. > >> For sites where WS-GRAM is not functional, I suggested we consider > >> configuring our own non-root WS-GRAM, ideally using already-installed > >> GT4 software, eg, from the OSG package on OSG and TG sites where its > >> installed. Mihael thought this would be considerable work. > > > > Not as much the amount of work, but: > > 1. getting root on sites (if installed for multiple users) > > I was thinking/hoping we could have a single setup script, which we'd > pre-install where required, that would configure and start a personal > WSRF container for the user. Which would be work for us to create, > install and maintain, but, if successful, would be transparent to the user. You need root to configure sudo. I also don't think you can easily automate the gt4 installation process. There's manual configuration that needs to be done. I suggest installing your own gt4 once to which I can submit jobs to get an idea of what's involved. From wilde at mcs.anl.gov Fri Feb 13 09:40:08 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Fri, 13 Feb 2009 09:40:08 -0600 Subject: [Swift-devel] Status of coasters In-Reply-To: References: <49958AEB.3090002@mcs.anl.gov> Message-ID: <49959458.3070704@mcs.anl.gov> On 2/13/09 9:27 AM, Ben Clifford wrote: > On Fri, 13 Feb 2009, Michael Wilde wrote: > >> - Ben has a patch to integrate to run the coaster service on a worker node. >> Question: this is only usable when workers have sufficient IP access, correct? > > Yes. I plan on making this presentable and then committing it. As part of > that, probably I should document who connects where in coasters with a > pretty diagram, to aid in understanding of what 'sufficient' is. Very good; I was just thinking of the same diagram, even as design documentation to help us grok the setup and communication paths for coasters. Also: coaster-server-on-workernode has the nice advantage that we dont run any swift software on infrastructure nodes like headnodes: less chance to cause damage; more power for our workflow. Gets round potential problem that managed-fork JM will kill our process for exceeding a walltime limit. Nice philosophy overall. >> - The scalability problem submitting to GT2 GRAM sites still exists. Potential >> solutions are: >> >> -- Service submits workers via PBS (using jobmanger=gt2:pbs). Valid only on >> PBS sites. Not yet tested. >> >> -- Service submits workers via Condor-G (using jobmanager=gt2:condor). Mihael >> feels this requires a new Condor provider, the one in the current code base >> being insufficient and untested - really more of a prototype developed by a >> student). > > That would be regular Condor, not Condor-G, I think. Seems could be either: - regular Condor to submit to the local condor pool - Condor-G to submit back through GT2 but with aid of its GRID_MONITOR for scalability, and would be LRM-independent. > > The two above could be summarised as "submit service workers through the > local LRM using CoG specific providers for that LRM". > > The PBS provider seems to be getting a reasonable amount of use recently, > and I think is also useful in the single-site case where it allows GRAM to > be avoided entirely. > > A decent Condor provider would probably allow something similar for Condor > based clusters. > >> -- Service submits via WS-GRAM. This should be tested, on sites where >> WS-GRAM is working. This woild use jobmanager=gt2:gt4:{pbs/condor/sge}, >> and needs to be tested. > > If gram4.0 is working on a site, is there any reason to use gt2 for the > head job submission? No, not at all: we should indeed use WSGRAM in those cases. In fact, we should use it wherever possible - i.e., wherever it provides the best available job exec service. > It seems to add a dependency on one more service > (depending on both gram2 and gram4.0) rather than substituting one > dependency for another (gram2 for gram4.0) > >> For sites where WS-GRAM is not functional, I suggested we consider configuring >> our own non-root WS-GRAM, ideally using already-installed GT4 software, eg, >> from the OSG package on OSG and TG sites where its installed. Mihael thought >> this would be considerable work. I agree but it might be a stable solution >> with fewer unknowns and suppot from the GRAM group. We can bring in the latest >> GT4 as needed if that provides a better solution than some older installed GT4 >> which we have no control over and which wont change till upcoming releases of >> say OSG or TG packages. > > I agree that this is considerable work. I think it is not something we > should pursue. > >> Lastly: it seems that a Condor-G provide might be a powerful capability (as >> one configuration option) to be able to submit all swift jobs via Condor-G >> (e.g, for non-coaster runs as well). Please comment on the value of such a >> capability. > > I've pondered that before. > > Using Condor-G appears to be the officially supported mechanism for > submitting to OSG in some peoples minds; and similarly, using plain GRAM2 > is Prohibited in those peoples minds. > > Using Condor-G would be more in line with some peoples views of how jobs > should properly be submitted to OSG. > > Such functionality could fit in as a CoG execution provider (similar to, > or part of a plain Condor execution provider), and would not peturb the > architecture of Swift. Swift runs in such a situation would look a little > like DAGman runs, with a management process handling some rate limiting > and deciding which jobs to run and where, but then the mechanics of > submission being handled by a local Condor. > > This approach would necessitate a local Condor installation, but only in > situations where this approach was used; so this would not peturb > usability too much, and many places where this would be used already have > a Condor installation. > > So I'm cautiously supportive of this approach. Excellent, and I agree with your analysis. I'll draft a priority list for such efforts and then circulate to the group. > > Specifically given the two different uses for condor interfacing discussed > above, I think that it would be useful to investigate making the Condor > provider decent. > Agreed. From tfreeman at mcs.anl.gov Fri Feb 13 09:40:32 2009 From: tfreeman at mcs.anl.gov (Tim Freeman) Date: Fri, 13 Feb 2009 09:40:32 -0600 Subject: [Swift-devel] Status of coasters In-Reply-To: <1234539163.26116.3.camel@localhost> References: <49958AEB.3090002@mcs.anl.gov> <1234538259.25737.1.camel@localhost> <49959069.7020602@mcs.anl.gov> <1234539163.26116.3.camel@localhost> Message-ID: <20090213094032.2b041afc@prnb> On Fri, 13 Feb 2009 09:32:43 -0600 Mihael Hategan wrote: > On Fri, 2009-02-13 at 09:23 -0600, Michael Wilde wrote: > > > > On 2/13/09 9:17 AM, Mihael Hategan wrote: > > > On Fri, 2009-02-13 at 08:59 -0600, Michael Wilde wrote: > > >> -- Service submits via WS-GRAM. This should be tested, on sites where > > >> WS-GRAM is working. > > >> This woild use jobmanager=gt2:gt4:{pbs/condor/sge}, and needs to be > > >> tested. For sites where WS-GRAM is not functional, I suggested we > > >> consider configuring our own non-root WS-GRAM, ideally using > > >> already-installed GT4 software, eg, from the OSG package on OSG and TG > > >> sites where its installed. Mihael thought this would be considerable > > >> work. > > > > > > Not as much the amount of work, but: > > > 1. getting root on sites (if installed for multiple users) > > > > I was thinking/hoping we could have a single setup script, which we'd > > pre-install where required, that would configure and start a personal > > WSRF container for the user. Which would be work for us to create, > > install and maintain, but, if successful, would be transparent to the user. > > You need root to configure sudo. Why would you need sudo for gram if it mapped to the same account? > > I also don't think you can easily automate the gt4 installation process. > There's manual configuration that needs to be done. I suggest installing > your own gt4 once to which I can submit jobs to get an idea of what's > involved. For non-gram, there's an auto script here: http://workspace.globus.org/vm/TP2.2/admin/quickstart.html#auto-container The questions it asks users (is this the right hostname? etc) could also be scripted. Getting from there to GRAM4 auto-configuration should not be too much of a step (the nimbus contextualization scripts have done it for gram4). Tim From benc at hawaga.org.uk Fri Feb 13 09:44:14 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Fri, 13 Feb 2009 15:44:14 +0000 (GMT) Subject: [Swift-devel] Status of coasters In-Reply-To: <49958AEB.3090002@mcs.anl.gov> References: <49958AEB.3090002@mcs.anl.gov> Message-ID: On Fri, 13 Feb 2009, Michael Wilde wrote: > -- Service submits via WS-GRAM. This should be tested, on sites where > WS-GRAM is working. This woild use jobmanager=gt2:gt4:{pbs/condor/sge}, > and needs to be tested. For sites where WS-GRAM is not functional, I > suggested we consider configuring our own non-root WS-GRAM, ideally > using already-installed GT4 software, eg, from the OSG package on OSG > and TG sites where its installed. Mihael thought this would be > considerable work. I agree but it might be a stable solution with fewer > unknowns and suppot from the GRAM group. We can bring in the latest GT4 > as needed if that provides a better solution than some older installed > GT4 which we have no control over and which wont change till upcoming > releases of say OSG or TG packages. We already deploy an execution system on the remote head node. Its called coasters. To deploy another execution service on a site through which our existing execution service on that site can execute things seems perverse. Putting aside "we must use GRAM" dogma, the key benefit of using GRAM would be (I think) to get access to the wider range of LRM adapters than is provided by CoG. If that actually is the benefit we're after by this, then we should consider other ways in which we might more profitably interface to those LRM adapters. They might be other reasons to use GRAM locally that I have not thought of, though. -- From wilde at mcs.anl.gov Fri Feb 13 09:52:51 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Fri, 13 Feb 2009 09:52:51 -0600 Subject: [Swift-devel] Status of coasters In-Reply-To: <20090213094032.2b041afc@prnb> References: <49958AEB.3090002@mcs.anl.gov> <1234538259.25737.1.camel@localhost> <49959069.7020602@mcs.anl.gov> <1234539163.26116.3.camel@localhost> <20090213094032.2b041afc@prnb> Message-ID: <49959753.9050001@mcs.anl.gov> Tim, we should take your advice under advisement. I can only leave the merits and cost/benefit assessment of this approach to you GT4 experts (Ben, Mihael, Tim, ...) Im open to the idea, and it sounds like its not totally off the table, but it has some costs and some unknowns. The diagram Ben suggests can also be expressed as a list of config alternatives, essentially an embellishment of the jobmanager=service-submitter:worker-submitter:worker-submitter-jobmanager string. The first message on this thread started to enumerate coaster config alternatives. I suggest we (or I) clarify those into a clean list. Then we can comment on the cost/benefit tradeoffs of the different alternatives, and denote which ones have been tested, which ones should be tested, and which ones need how much development work. I think there's some obvious "low hanging fruit"-ful ones that float to the top, which we can test, debug, and harden now, and some that require more development, some of which has greater additional benefits (like a good Condor provider). If a user-config for a WS-GRAM container proved easier than expected, possibly with help from you, Tim, or from other GRAM experts, then perhaps it can stay on the table. - Mike On 2/13/09 9:40 AM, Tim Freeman wrote: > On Fri, 13 Feb 2009 09:32:43 -0600 > Mihael Hategan wrote: > >> On Fri, 2009-02-13 at 09:23 -0600, Michael Wilde wrote: >>> On 2/13/09 9:17 AM, Mihael Hategan wrote: >>>> On Fri, 2009-02-13 at 08:59 -0600, Michael Wilde wrote: >>>>> -- Service submits via WS-GRAM. This should be tested, on sites where >>>>> WS-GRAM is working. >>>>> This woild use jobmanager=gt2:gt4:{pbs/condor/sge}, and needs to be >>>>> tested. For sites where WS-GRAM is not functional, I suggested we >>>>> consider configuring our own non-root WS-GRAM, ideally using >>>>> already-installed GT4 software, eg, from the OSG package on OSG and TG >>>>> sites where its installed. Mihael thought this would be considerable >>>>> work. >>>> Not as much the amount of work, but: >>>> 1. getting root on sites (if installed for multiple users) >>> I was thinking/hoping we could have a single setup script, which we'd >>> pre-install where required, that would configure and start a personal >>> WSRF container for the user. Which would be work for us to create, >>> install and maintain, but, if successful, would be transparent to the user. >> You need root to configure sudo. > > Why would you need sudo for gram if it mapped to the same account? > >> I also don't think you can easily automate the gt4 installation process. >> There's manual configuration that needs to be done. I suggest installing >> your own gt4 once to which I can submit jobs to get an idea of what's >> involved. > > For non-gram, there's an auto script here: > > http://workspace.globus.org/vm/TP2.2/admin/quickstart.html#auto-container > > The questions it asks users (is this the right hostname? etc) could also be > scripted. > > Getting from there to GRAM4 auto-configuration should not be too much of a step > (the nimbus contextualization scripts have done it for gram4). > > Tim From smartin at mcs.anl.gov Fri Feb 13 09:58:38 2009 From: smartin at mcs.anl.gov (Stuart Martin) Date: Fri, 13 Feb 2009 09:58:38 -0600 Subject: [Swift-devel] Status of coasters In-Reply-To: <20090213094032.2b041afc@prnb> References: <49958AEB.3090002@mcs.anl.gov> <1234538259.25737.1.camel@localhost> <49959069.7020602@mcs.anl.gov> <1234539163.26116.3.camel@localhost> <20090213094032.2b041afc@prnb> Message-ID: On Feb 13, 2009, at Feb 13, 9:40 AM, Tim Freeman wrote: > On Fri, 13 Feb 2009 09:32:43 -0600 > Mihael Hategan wrote: > >> On Fri, 2009-02-13 at 09:23 -0600, Michael Wilde wrote: >>> >>> On 2/13/09 9:17 AM, Mihael Hategan wrote: >>>> On Fri, 2009-02-13 at 08:59 -0600, Michael Wilde wrote: >>>>> -- Service submits via WS-GRAM. This should be tested, on sites >>>>> where >>>>> WS-GRAM is working. >>>>> This woild use jobmanager=gt2:gt4:{pbs/condor/sge}, and needs to >>>>> be >>>>> tested. For sites where WS-GRAM is not functional, I suggested we >>>>> consider configuring our own non-root WS-GRAM, ideally using >>>>> already-installed GT4 software, eg, from the OSG package on OSG >>>>> and TG >>>>> sites where its installed. Mihael thought this would be >>>>> considerable >>>>> work. >>>> >>>> Not as much the amount of work, but: >>>> 1. getting root on sites (if installed for multiple users) >>> >>> I was thinking/hoping we could have a single setup script, which >>> we'd >>> pre-install where required, that would configure and start a >>> personal >>> WSRF container for the user. Which would be work for us to create, >>> install and maintain, but, if successful, would be transparent to >>> the user. >> >> You need root to configure sudo. > > Why would you need sudo for gram if it mapped to the same account? You wouldn't. sudo is not needed if the DN of the requester is the same as the DN used to start the container. >> >> I also don't think you can easily automate the gt4 installation >> process. >> There's manual configuration that needs to be done. I suggest >> installing >> your own gt4 once to which I can submit jobs to get an idea of what's >> involved. > > For non-gram, there's an auto script here: > > http://workspace.globus.org/vm/TP2.2/admin/quickstart.html#auto-container > > The questions it asks users (is this the right hostname? etc) could > also be > scripted. > > Getting from there to GRAM4 auto-configuration should not be too > much of a step > (the nimbus contextualization scripts have done it for gram4). You could do this and if you avoid staging, then this may work fine for GRAM4. But also, we want to have GRAM2 sites to start to use the SEG to remove a significant portion of the GRAM2 scalability problem. I think that would be best and simplest solution to focus on. Maybe we can start with a site where you need more scalability and the admin would want to work with us on that? > > Tim > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From tfreeman at mcs.anl.gov Fri Feb 13 09:58:43 2009 From: tfreeman at mcs.anl.gov (Tim Freeman) Date: Fri, 13 Feb 2009 09:58:43 -0600 Subject: [Swift-devel] Status of coasters In-Reply-To: <49959753.9050001@mcs.anl.gov> References: <49958AEB.3090002@mcs.anl.gov> <1234538259.25737.1.camel@localhost> <49959069.7020602@mcs.anl.gov> <1234539163.26116.3.camel@localhost> <20090213094032.2b041afc@prnb> <49959753.9050001@mcs.anl.gov> Message-ID: <20090213095843.67e0496b@prnb> On Fri, 13 Feb 2009 09:52:51 -0600 Michael Wilde wrote: > Tim, we should take your advice under advisement. > > I can only leave the merits and cost/benefit assessment of this approach > to you GT4 experts (Ben, Mihael, Tim, ...) I'm only weighing in on the cost parts. :-) Tim From rynge at renci.org Fri Feb 13 09:56:31 2009 From: rynge at renci.org (Mats Rynge) Date: Fri, 13 Feb 2009 10:56:31 -0500 Subject: [Swift-devel] Status of coasters In-Reply-To: <49958AEB.3090002@mcs.anl.gov> References: <49958AEB.3090002@mcs.anl.gov> Message-ID: <4995982F.9010605@renci.org> Michael Wilde wrote: > For sites where WS-GRAM is not functional, I suggested we consider > configuring our own non-root WS-GRAM, ideally using already-installed > GT4 software, eg, from the OSG package on OSG and TG sites where its > installed. Mihael thought this would be considerable work. I agree but > it might be a stable solution with fewer unknowns and suppot from the > GRAM group. We can bring in the latest GT4 as needed if that provides a > better solution than some older installed GT4 which we have no control > over and which wont change till upcoming releases of say OSG or TG > packages. Please don't do this on OSG. I'm fairly sure that working around the existing interfaces to a resource would just tick off the resource owners. -- Mats Rynge Renaissance Computing Institute From wilde at mcs.anl.gov Fri Feb 13 10:05:18 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Fri, 13 Feb 2009 10:05:18 -0600 Subject: [Swift-devel] Status of coasters In-Reply-To: References: <49958AEB.3090002@mcs.anl.gov> <1234538259.25737.1.camel@localhost> <49959069.7020602@mcs.anl.gov> <1234539163.26116.3.camel@localhost> <20090213094032.2b041afc@prnb> Message-ID: <49959A3E.9010305@mcs.anl.gov> Stu, this would be a good thing to discuss through the UChicago OSG group. I can start that, and am cc'ing Rob Gardner to get it on the list of Globus-OSG things to track. - Mike > But also, we want to have GRAM2 sites to start to use the SEG to remove > a significant portion of the GRAM2 scalability problem. I think that > would be best and simplest solution to focus on. Maybe we can start > with a site where you need more scalability and the admin would want to > work with us on that? - Mike On 2/13/09 9:58 AM, Stuart Martin wrote: > On Feb 13, 2009, at Feb 13, 9:40 AM, Tim Freeman wrote: > >> On Fri, 13 Feb 2009 09:32:43 -0600 >> Mihael Hategan wrote: >> >>> On Fri, 2009-02-13 at 09:23 -0600, Michael Wilde wrote: >>>> >>>> On 2/13/09 9:17 AM, Mihael Hategan wrote: >>>>> On Fri, 2009-02-13 at 08:59 -0600, Michael Wilde wrote: >>>>>> -- Service submits via WS-GRAM. This should be tested, on sites >>>>>> whereWS-GRAM is working. >>>>>> This woild use jobmanager=gt2:gt4:{pbs/condor/sge}, and needs to be >>>>>> tested. For sites where WS-GRAM is not functional, I suggested we >>>>>> consider configuring our own non-root WS-GRAM, ideally using >>>>>> already-installed GT4 software, eg, from the OSG package on OSG >>>>>> and TG >>>>>> sites where its installed. Mihael thought this would be considerable >>>>>> work. >>>>> >>>>> Not as much the amount of work, but: >>>>> 1. getting root on sites (if installed for multiple users) >>>> >>>> I was thinking/hoping we could have a single setup script, which we'd >>>> pre-install where required, that would configure and start a personal >>>> WSRF container for the user. Which would be work for us to create, >>>> install and maintain, but, if successful, would be transparent to >>>> the user. >>> >>> You need root to configure sudo. >> >> Why would you need sudo for gram if it mapped to the same account? > > You wouldn't. sudo is not needed if the DN of the requester is the same > as the DN used to start the container. > >>> >>> I also don't think you can easily automate the gt4 installation process. >>> There's manual configuration that needs to be done. I suggest installing >>> your own gt4 once to which I can submit jobs to get an idea of what's >>> involved. >> >> For non-gram, there's an auto script here: >> >> >> http://workspace.globus.org/vm/TP2.2/admin/quickstart.html#auto-container >> >> The questions it asks users (is this the right hostname? etc) could >> also be >> scripted. >> >> Getting from there to GRAM4 auto-configuration should not be too much >> of a step >> (the nimbus contextualization scripts have done it for gram4). > > You could do this and if you avoid staging, then this may work fine for > GRAM4. > > But also, we want to have GRAM2 sites to start to use the SEG to remove > a significant portion of the GRAM2 scalability problem. I think that > would be best and simplest solution to focus on. Maybe we can start > with a site where you need more scalability and the admin would want to > work with us on that? > >> >> Tim >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From wilde at mcs.anl.gov Fri Feb 13 10:06:13 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Fri, 13 Feb 2009 10:06:13 -0600 Subject: [Swift-devel] Status of coasters In-Reply-To: <4995982F.9010605@renci.org> References: <49958AEB.3090002@mcs.anl.gov> <4995982F.9010605@renci.org> Message-ID: <49959A75.3050500@mcs.anl.gov> So instead work with OSG to get WS-GRAM working? On 2/13/09 9:56 AM, Mats Rynge wrote: > Michael Wilde wrote: >> For sites where WS-GRAM is not functional, I suggested we consider >> configuring our own non-root WS-GRAM, ideally using already-installed >> GT4 software, eg, from the OSG package on OSG and TG sites where its >> installed. Mihael thought this would be considerable work. I agree but >> it might be a stable solution with fewer unknowns and suppot from the >> GRAM group. We can bring in the latest GT4 as needed if that provides a >> better solution than some older installed GT4 which we have no control >> over and which wont change till upcoming releases of say OSG or TG >> packages. > > Please don't do this on OSG. I'm fairly sure that working around the > existing interfaces to a resource would just tick off the resource owners. > From hategan at mcs.anl.gov Fri Feb 13 10:19:49 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 13 Feb 2009 10:19:49 -0600 Subject: [Swift-devel] Status of coasters In-Reply-To: <20090213094032.2b041afc@prnb> References: <49958AEB.3090002@mcs.anl.gov> <1234538259.25737.1.camel@localhost> <49959069.7020602@mcs.anl.gov> <1234539163.26116.3.camel@localhost> <20090213094032.2b041afc@prnb> Message-ID: <1234541989.26956.1.camel@localhost> On Fri, 2009-02-13 at 09:40 -0600, Tim Freeman wrote: > > > > You need root to configure sudo. > > Why would you need sudo for gram if it mapped to the same account? I assumed that if we support multiple users, we do so properly. > > > > > I also don't think you can easily automate the gt4 installation process. > > There's manual configuration that needs to be done. I suggest installing > > your own gt4 once to which I can submit jobs to get an idea of what's > > involved. > > For non-gram, there's an auto script here: > > http://workspace.globus.org/vm/TP2.2/admin/quickstart.html#auto-container > > The questions it asks users (is this the right hostname? etc) could also be > scripted. > > Getting from there to GRAM4 auto-configuration should not be too much of a step > (the nimbus contextualization scripts have done it for gram4). Thanks. We should then try this. From rynge at renci.org Fri Feb 13 10:17:46 2009 From: rynge at renci.org (Mats Rynge) Date: Fri, 13 Feb 2009 11:17:46 -0500 Subject: [Swift-devel] Status of coasters In-Reply-To: <49959A75.3050500@mcs.anl.gov> References: <49958AEB.3090002@mcs.anl.gov> <4995982F.9010605@renci.org> <49959A75.3050500@mcs.anl.gov> Message-ID: <49959D2A.7090006@renci.org> Michael Wilde wrote: > So instead work with OSG to get WS-GRAM working? There is a slow movement to make this happen. A couple of smaller VOs (Engagement, which I'm representing, included) which are asking for WS-GRAM to become a suggested/required service, but I don't think that will happen for the next release. I have heard that the new SEG for pre-WS GRAM will be included in the next release of the OSG software stack. > On 2/13/09 9:56 AM, Mats Rynge wrote: >> Michael Wilde wrote: >>> For sites where WS-GRAM is not functional, I suggested we consider >>> configuring our own non-root WS-GRAM, ideally using already-installed >>> GT4 software, eg, from the OSG package on OSG and TG sites where its >>> installed. Mihael thought this would be considerable work. I agree but >>> it might be a stable solution with fewer unknowns and suppot from the >>> GRAM group. We can bring in the latest GT4 as needed if that provides a >>> better solution than some older installed GT4 which we have no control >>> over and which wont change till upcoming releases of say OSG or TG >>> packages. >> >> Please don't do this on OSG. I'm fairly sure that working around the >> existing interfaces to a resource would just tick off the resource >> owners. >> > -- Mats Rynge Renaissance Computing Institute From hategan at mcs.anl.gov Fri Feb 13 10:22:43 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 13 Feb 2009 10:22:43 -0600 Subject: [Swift-devel] Status of coasters In-Reply-To: References: <49958AEB.3090002@mcs.anl.gov> <1234538259.25737.1.camel@localhost> <49959069.7020602@mcs.anl.gov> <1234539163.26116.3.camel@localhost> <20090213094032.2b041afc@prnb> Message-ID: <1234542163.26956.3.camel@localhost> On Fri, 2009-02-13 at 09:58 -0600, Stuart Martin wrote: > But also, we want to have GRAM2 sites to start to use the SEG to > remove a significant portion of the GRAM2 scalability problem. I > think that would be best and simplest solution to focus on. Maybe we > can start with a site where you need more scalability and the admin > would want to work with us on that? Can this be pushed into VDT/OSG and SDgrr$% (that thing that TG uses)? From benc at hawaga.org.uk Fri Feb 13 10:25:15 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Fri, 13 Feb 2009 16:25:15 +0000 (GMT) Subject: [Swift-devel] Status of coasters In-Reply-To: <1234542163.26956.3.camel@localhost> References: <49958AEB.3090002@mcs.anl.gov> <1234538259.25737.1.camel@localhost> <49959069.7020602@mcs.anl.gov> <1234539163.26116.3.camel@localhost> <20090213094032.2b041afc@prnb> <1234542163.26956.3.camel@localhost> Message-ID: On Fri, 13 Feb 2009, Mihael Hategan wrote: > > But also, we want to have GRAM2 sites to start to use the SEG to > > remove a significant portion of the GRAM2 scalability problem. I > > think that would be best and simplest solution to focus on. Maybe we > > can start with a site where you need more scalability and the admin > > would want to work with us on that? > > Can this be pushed into VDT/OSG and SDgrr$% (that thing that TG uses)? CTSS. I think that is based on VDT now too. -- From iraicu at cs.uchicago.edu Fri Feb 13 12:13:12 2009 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Fri, 13 Feb 2009 12:13:12 -0600 Subject: [Swift-devel] Status of coasters In-Reply-To: <49959458.3070704@mcs.anl.gov> References: <49958AEB.3090002@mcs.anl.gov> <49959458.3070704@mcs.anl.gov> Message-ID: <4995B838.9060904@cs.uchicago.edu> Michael Wilde wrote: > > ...Gets round potential problem that managed-fork JM will kill our > process for exceeding a walltime limit. > > Managed-fork kills your process on the head node. The LRM (PBS, Condor, etc) kills your process on the compute node. Either way, if you exceed the walltime limit, your process gets killed. Ioan -- =================================================== Ioan Raicu Ph.D. Candidate =================================================== Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 =================================================== Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dev.globus.org/wiki/Incubator/Falkon http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page =================================================== =================================================== From iraicu at cs.uchicago.edu Fri Feb 13 12:15:38 2009 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Fri, 13 Feb 2009 12:15:38 -0600 Subject: [Swift-devel] Status of coasters In-Reply-To: <4995982F.9010605@renci.org> References: <49958AEB.3090002@mcs.anl.gov> <4995982F.9010605@renci.org> Message-ID: <4995B8CA.8030509@cs.uchicago.edu> Although, we and others have been doing this (multi-level scheduling) for a while now, using Coasters, Falkon, Condor glide-ins, etc... I don't see what would be different, to set up a separate WS-GRAM interface on a site that doesn't support it. Its just another method, to do multi-level scheduling. Ioan Mats Rynge wrote: > Michael Wilde wrote: > >> For sites where WS-GRAM is not functional, I suggested we consider >> configuring our own non-root WS-GRAM, ideally using already-installed >> GT4 software, eg, from the OSG package on OSG and TG sites where its >> installed. Mihael thought this would be considerable work. I agree but >> it might be a stable solution with fewer unknowns and suppot from the >> GRAM group. We can bring in the latest GT4 as needed if that provides a >> better solution than some older installed GT4 which we have no control >> over and which wont change till upcoming releases of say OSG or TG >> packages. >> > > Please don't do this on OSG. I'm fairly sure that working around the > existing interfaces to a resource would just tick off the resource owners. > > -- =================================================== Ioan Raicu Ph.D. Candidate =================================================== Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 =================================================== Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dev.globus.org/wiki/Incubator/Falkon http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page =================================================== =================================================== -------------- next part -------------- An HTML attachment was scrubbed... URL: From hategan at mcs.anl.gov Fri Feb 13 12:32:01 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 13 Feb 2009 12:32:01 -0600 (CST) Subject: [Swift-devel] Status of coasters In-Reply-To: <4995B8CA.8030509@cs.uchicago.edu> Message-ID: <29514779.2253011234549921308.JavaMail.root@zimbra> ----- Ioan Raicu wrote: > Although, we and others have been doing this (multi-level scheduling) > for a while now, using Coasters, Falkon, Condor glide-ins, etc... I > don't see what would be different, to set up a separate WS-GRAM > interface on a site that doesn't support it. Its just another method, > to do multi-level scheduling. I suppose too much multi unnecessarily complicates things. -------------- next part -------------- An HTML attachment was scrubbed... URL: From hategan at mcs.anl.gov Fri Feb 13 12:33:41 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 13 Feb 2009 12:33:41 -0600 (CST) Subject: [Swift-devel] Status of coasters In-Reply-To: <29514779.2253011234549921308.JavaMail.root@zimbra> Message-ID: <29887029.2253071234550021030.JavaMail.root@zimbra> ----- Mihael Hategan wrote: > ----- Ioan Raicu wrote: > Although, we and others have been doing this (multi-level scheduling) > for a while now, using Coasters, Falkon, Condor glide-ins, etc... I > don't see what would be different, to set up a separate WS-GRAM > interface on a site that doesn't support it. Its just another method, > to do multi-level scheduling. I suppose too much multi unnecessarily complicates things. Additionally if you're setting up WS-GRAM in order to be able to start coasters, you might as well start the coasters manually. -------------- next part -------------- An HTML attachment was scrubbed... URL: From rynge at renci.org Fri Feb 13 12:34:55 2009 From: rynge at renci.org (Mats Rynge) Date: Fri, 13 Feb 2009 13:34:55 -0500 Subject: [Swift-devel] Status of coasters In-Reply-To: <4995B8CA.8030509@cs.uchicago.edu> References: <49958AEB.3090002@mcs.anl.gov> <4995982F.9010605@renci.org> <4995B8CA.8030509@cs.uchicago.edu> Message-ID: <4995BD4F.8080204@renci.org> Ioan Raicu wrote: > Although, we and others have been doing this (multi-level scheduling) > for a while now, using Coasters, Falkon, Condor glide-ins, etc... I > don't see what would be different, to set up a separate WS-GRAM > interface on a site that doesn't support it. Its just another method, to > do multi-level scheduling. No. The problem here is that you want to directly interact with the lrm. This will break several things such as accounting, and lrm policies. It is similar to use jobmanger-fork to run condor_submit. Even though it is technically possibly to do so, most OSG sites would find such a behavior unacceptable and probably ban the user/VO doing it. This is different from for example glideins, where you use the existing interface and lrm to get slots allocated to you, and then your job is actually the glidein. The WS-GRAM on the side would be the same iif you submitted it to the lrm, started it on compute nodes and only used jobmanager-fork (but only one real job at once). This would obviously not be very helpful as many compute nodes are behind NAT. You really should not user jobmanager-fork for anything except simple setup/cleanup jobs. -- Mats Rynge Renaissance Computing Institute From iraicu at cs.uchicago.edu Fri Feb 13 12:38:36 2009 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Fri, 13 Feb 2009 12:38:36 -0600 Subject: [Swift-devel] Status of coasters In-Reply-To: <29514779.2253011234549921308.JavaMail.root@zimbra> References: <29514779.2253011234549921308.JavaMail.root@zimbra> Message-ID: <4995BE2C.4010200@cs.uchicago.edu> I am not arguing for adding more levels to the scheduling decisions, but was merely pointing out to Mats that we have been bypassing the main schedulers of various systems for a while now, and that the reason for not following through with installing WS-GRAM should not be because of this. In fact, I don't support the idea of installing a user level GRAM per site. Its not even clear to me, how you would make that happen in a generic way, as GRAM needs to be configured to be used in conjunction with other LRMs, SGE, PBS, Condor, etc... so you not only need to install GRAM, but also configure it against another LRM, that you might know little about where its installed, the paths to the various logs (which GRAM needs), etc. If Coaster can be made to run transparently and works well enough for production, then Coaster can run on top of GRAM2 just fine, as the load it will impose on GRAM is lower than what the same app would impose if it were to go directly to GRAM. Ioan Mihael Hategan wrote: > > ----- Ioan Raicu wrote: > > Although, we and others have been doing this (multi-level scheduling) > > for a while now, using Coasters, Falkon, Condor glide-ins, etc... I > > don't see what would be different, to set up a separate WS-GRAM > > interface on a site that doesn't support it. Its just another method, > > to do multi-level scheduling. > > I suppose too much multi unnecessarily complicates things. -- =================================================== Ioan Raicu, Ph.D. =================================================== Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 =================================================== Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dev.globus.org/wiki/Incubator/Falkon http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page =================================================== =================================================== -------------- next part -------------- An HTML attachment was scrubbed... URL: From iraicu at cs.uchicago.edu Fri Feb 13 12:51:24 2009 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Fri, 13 Feb 2009 12:51:24 -0600 Subject: [Swift-devel] Status of coasters In-Reply-To: <4995BD4F.8080204@renci.org> References: <49958AEB.3090002@mcs.anl.gov> <4995982F.9010605@renci.org> <4995B8CA.8030509@cs.uchicago.edu> <4995BD4F.8080204@renci.org> Message-ID: <4995C12C.1050107@cs.uchicago.edu> I think we are on the same page, as I was not implying to use methods to bypass accounting or lrm policies. Cheers, Ioan Mats Rynge wrote: > Ioan Raicu wrote: > >> Although, we and others have been doing this (multi-level scheduling) >> for a while now, using Coasters, Falkon, Condor glide-ins, etc... I >> don't see what would be different, to set up a separate WS-GRAM >> interface on a site that doesn't support it. Its just another method, to >> do multi-level scheduling. >> > > No. The problem here is that you want to directly interact with the lrm. > This will break several things such as accounting, and lrm policies. It > is similar to use jobmanger-fork to run condor_submit. Even though it is > technically possibly to do so, most OSG sites would find such a behavior > unacceptable and probably ban the user/VO doing it. > > This is different from for example glideins, where you use the existing > interface and lrm to get slots allocated to you, and then your job is > actually the glidein. The WS-GRAM on the side would be the same iif you > submitted it to the lrm, started it on compute nodes and only used > jobmanager-fork (but only one real job at once). This would obviously > not be very helpful as many compute nodes are behind NAT. > > You really should not user jobmanager-fork for anything except simple > setup/cleanup jobs. > > -- =================================================== Ioan Raicu, Ph.D. =================================================== Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 =================================================== Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dev.globus.org/wiki/Incubator/Falkon http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page =================================================== =================================================== -------------- next part -------------- An HTML attachment was scrubbed... URL: From hategan at mcs.anl.gov Fri Feb 13 13:19:19 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 13 Feb 2009 13:19:19 -0600 (CST) Subject: [Swift-devel] Status of coasters In-Reply-To: <4995BD4F.8080204@renci.org> Message-ID: <31956487.2256981234552759142.JavaMail.root@zimbra> ----- Mats Rynge wrote: > Ioan Raicu wrote: > > Although, we and others have been doing this (multi-level scheduling) > > for a while now, using Coasters, Falkon, Condor glide-ins, etc... I > > don't see what would be different, to set up a separate WS-GRAM > > interface on a site that doesn't support it. Its just another method, to > > do multi-level scheduling. > > No. The problem here is that you want to directly interact with the lrm. > This will break several things such as accounting, and lrm policies. It > is similar to use jobmanger-fork to run condor_submit. Even though it is > technically possibly to do so, most OSG sites would find such a behavior > unacceptable and probably ban the user/VO doing it. I assumed the LRM does the accounting. > > This is different from for example glideins, where you use the existing > interface and lrm to get slots allocated to you, and then your job is > actually the glidein. The WS-GRAM on the side would be the same iif you > submitted it to the lrm, started it on compute nodes and only used > jobmanager-fork (but only one real job at once). This would obviously > not be very helpful as many compute nodes are behind NAT. > > You really should not user jobmanager-fork for anything except simple > setup/cleanup jobs. That doesn't leave many options. If we have a more efficient way of submitting to the LRM than GRAM2, we can't use it. If we have more-user friendly way of doing glide-ins, we can't use that either. We're pretty much at the mercy of the VDT, which doesn't, after many years, properly escape whitespace. I find that to be an affront to the users and ultimately to the taxpayer. From rynge at renci.org Fri Feb 13 13:28:57 2009 From: rynge at renci.org (Mats Rynge) Date: Fri, 13 Feb 2009 14:28:57 -0500 Subject: [Swift-devel] Status of coasters In-Reply-To: <31956487.2256981234552759142.JavaMail.root@zimbra> References: <31956487.2256981234552759142.JavaMail.root@zimbra> Message-ID: <4995C9F9.8010705@renci.org> Mihael Hategan wrote: > ----- Mats Rynge wrote: >> Ioan Raicu wrote: >>> Although, we and others have been doing this (multi-level scheduling) >>> for a while now, using Coasters, Falkon, Condor glide-ins, etc... I >>> don't see what would be different, to set up a separate WS-GRAM >>> interface on a site that doesn't support it. Its just another method, to >>> do multi-level scheduling. >> No. The problem here is that you want to directly interact with the lrm. >> This will break several things such as accounting, and lrm policies. It >> is similar to use jobmanger-fork to run condor_submit. Even though it is >> technically possibly to do so, most OSG sites would find such a behavior >> unacceptable and probably ban the user/VO doing it. > > I assumed the LRM does the accounting. I don't know details on how this works, but I think there are some OSG and/or VDT patches to the jobmanager perl modules. Think about it as the LRM does the accounting, but you have to pass the LRM the correct information. If you interacted directly with the LRM, the jobs would look like local jobs, not OSG originated jobs. >> This is different from for example glideins, where you use the existing >> interface and lrm to get slots allocated to you, and then your job is >> actually the glidein. The WS-GRAM on the side would be the same iif you >> submitted it to the lrm, started it on compute nodes and only used >> jobmanager-fork (but only one real job at once). This would obviously >> not be very helpful as many compute nodes are behind NAT. >> >> You really should not user jobmanager-fork for anything except simple >> setup/cleanup jobs. > > That doesn't leave many options. If we have a more efficient way of > submitting to the LRM than GRAM2, we can't use it. If we have more-user > friendly way of doing glide-ins, we can't use that either. We're pretty > much at the mercy of the VDT, which doesn't, after many years, properly > escape whitespace. I find that to be an affront to the users and > ultimately to the taxpayer. I agree with you, but standing up your own WS-GRAM is not the solution. -- Mats Rynge Renaissance Computing Institute From hategan at mcs.anl.gov Fri Feb 13 13:39:50 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 13 Feb 2009 13:39:50 -0600 (CST) Subject: [Swift-devel] Status of coasters In-Reply-To: <4995C9F9.8010705@renci.org> Message-ID: <3035487.2258841234553990865.JavaMail.root@zimbra> > >> You really should not user jobmanager-fork for anything except simple > >> setup/cleanup jobs. > > > > That doesn't leave many options. If we have a more efficient way of > > submitting to the LRM than GRAM2, we can't use it. If we have more-user > > friendly way of doing glide-ins, we can't use that either. We're pretty > > much at the mercy of the VDT, which doesn't, after many years, properly > > escape whitespace. I find that to be an affront to the users and > > ultimately to the taxpayer. > > I agree with you, but standing up your own WS-GRAM is not the solution. I don't think standing up our own WS-GRAM is the solution either. However, I must also consider that standing up coasters (whether manually or automatically) or anything non-trivial is similar to standing up WS-GRAM in that there is a not-so-transient process that plays a part in managing jobs and other things. From hategan at mcs.anl.gov Fri Feb 13 23:48:11 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 13 Feb 2009 23:48:11 -0600 Subject: [Swift-devel] Status of coasters In-Reply-To: <49958AEB.3090002@mcs.anl.gov> References: <49958AEB.3090002@mcs.anl.gov> Message-ID: <1234590491.9413.0.camel@localhost> On Fri, 2009-02-13 at 08:59 -0600, Michael Wilde wrote: > - Mihael has a good handle on the bootstrap issues and is working on > improvements. This is not working in trunk at the moment, will likely be > fixed soon. We think this will fix known issues in: command line lenth > for condor, spaces, quotes, newlines and other offending argument > issues; location of Java and tools (wget/curl and mdsum). I think I nailed it this time. From benc at hawaga.org.uk Sat Feb 14 08:16:11 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Sat, 14 Feb 2009 14:16:11 +0000 (GMT) Subject: [Swift-devel] typecheck foo[*].bar In-Reply-To: References: Message-ID: I've implemented the below, as r2538. On Tue, 10 Feb 2009, Ben Clifford wrote: > I noticed today that expressions like this don't get typechecked properly, > so in 0.8, you can't use [*].member expressions. Bleugh. > > As I want to use such expressions (or equivalent), I guess I have to fix > that soonish. > > I think the approach I am favouring language-wise is that [*] becomes a > no-op/identity operator, and . with an array of structs on the left > returns an array of the appropriate member fields. > > Thus a[*] == a for all arrays a > > a[*].foo == a.foo == (in haskelly pseudocode) (map \(x->x.foo) a) -- From benc at hawaga.org.uk Sun Feb 15 13:29:42 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Sun, 15 Feb 2009 19:29:42 +0000 (GMT) Subject: [Swift-devel] code coverage of tests Message-ID: I just used EMMA, a code coverage tool, on the test suite. The results (in html report) are here: http://www.ci.uchicago.edu/~benc/tmp/coverage/index.html I ran this to try to see what areas of the code aren't being tested at all at the moment. You'll need to have an understanding of the Swift source code in order to understand this report. This covers only the classes in 'cog-swift-svn.jar', where the code from the Swift repository lives. It doesn't cover any of the other classes (for example the CoG providers or karajan) The report is for only the base fully automated tests, not per-site or wonky tests. For future reference, this is how I ran the tests: cd cog/modules/swift export CLASSPATH=/Users/benc/work/emma-2.0.5312/lib/emma.jar export COG_OPTS="-Demma.verbosity.level=quiet" java emma instr -cp dist/swift-svn/lib/cog-swift-svn.jar -m overwrite cd tests ./run cd .. java emma report -r html -sp src -in coverage.em -in coverage.ec open coverage/index.html -- From benc at hawaga.org.uk Sun Feb 15 16:04:45 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Sun, 15 Feb 2009 22:04:45 +0000 (GMT) Subject: [Swift-devel] 0.8 clustering broken Message-ID: doh, I broke clustering in 0.8 - turns out my test for clusters is lame and neither enables clustering nor checks that clustering was actually used. Upcoming fixes in the SVN in a few, so it should be working in 0.9 planned for mid-March. -- From wilde at mcs.anl.gov Sun Feb 15 17:50:18 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Sun, 15 Feb 2009 17:50:18 -0600 Subject: [Swift-devel] Returning GRAM errors to swift user Message-ID: <4998AA3A.4050307@mcs.anl.gov> Im assuming that Swift and the CoG provider return as much about GRAM errors back to the user as they know. But, for jobs that fail to start, e.g., due to an invalid project code, that error never makes it back to the user (but *is* present in the gram log). In this case, can the message below, from the GRAM log, "GRAM_SCRIPT_GT3_FAILURE_MESSAGE:qsub: Invalid Account MSG=invalid account\n" available in the GRAM API so it can be sent to the user? I'm assuming this particular issue is well known to users experienced with TeraGrid sites, like Sarah, but is perhaps worth pointing out in a troubleshooting section. If there's a chance that some of this GRAM error info can be returned but is not currently, I can file this in bugzilla. It seems like a few errors, such as account/project errors, or other invalid job specs (like time/queue mismatches?) are similarly not passed back. Is that the case? Relevant snips from the logs are below. Also interesting to note: On the UC teragrid site, a project specified in sites.xml via the globus profile does *not* override a default project set by the tgprojects command. Im my case, I had an invalid (old) project set via tgprojects, which too precedence over the one in my sites.xml. When I set the default project to "None" in tgprojects, then the sites.xml project was accepted and the job ran. - Mike In swift .log file: 2009-02-15 16:59:27,408-0600 DEBUG vdl:execute2 APPLICATION_EXCEPTION jobid=uname-0gpb1o6j - Application exception: The job failed when the job manager attempted to run it Caused by: org.globus.gram.GramException: The job failed when the job manager attempted to run it Messages on swift stdout/err: =============================== Swift svn swift-r2532 cog-r2300 RunID: 20090215-1733-0x4ksmd8 Progress: Progress: Stage in:1 Progress: Active:1 Failed to transfer wrapper log from un-20090215-1733-0x4ksmd8/info/k on uc32 Progress: Failed:1 Execution failed: Exception in uname: Arguments: [-a] Host: uc32 Directory: un-20090215-1733-0x4ksmd8/jobs/k/uname-kl1p2o6j stderr.txt: stdout.txt: ---- Caused by: The job failed when the job manager attempted to run it =============================== But the following useful info is in the gram log (on the server side), which did not make it to the swift logs above: Sun Feb 15 17:33:58 2009 JM_SCRIPT: submitting job -- /soft/torque/bin/qsub < /home/wilde/.globus/job/tg-grid1.uc.teragrid.org/14326.1234740838/scheduler_pbs_job_script 2>/home/wilde/.glo bus/job/tg-grid1.uc.teragrid.org/14326.1234740838/scheduler_pbs_submit_stderr Sun Feb 15 17:33:58 2009 JM_SCRIPT: qsub returned Sun Feb 15 17:33:58 2009 JM_SCRIPT: qsub stderr qsub: Invalid Account MSG=invalid account 2/15 17:33:58 JM: GT3 extended error message: GRAM_SCRIPT_GT3_FAILURE_MESSAGE:qsub: Invalid Account MSG=invalid account\n 2/15 17:33:58 JMI: while return_buf = GRAM_SCRIPT_GT3_FAILURE_MESSAGE = qsub: Invalid Account MSG=invalid account\n 2/15 17:33:58 JMI: while return_buf = GRAM_SCRIPT_ERROR = 17 2/15 17:33:58 JM: in globus_gram_job_manager_reporting_file_create() 2/15 17:33:58 JM: not reporting job information 2/15 17:33:58 JM: in globus_gram_job_manager_history_file_create() 2/15 17:33:58 JM: NOT empty client callback list. 2/15 17:33:58 JM: sending callback of status 4 (failure code 17) to https://128.135.125.17:50000/1234740837636. 2/15 17:33:58 JMI: testing job manager scripts for type pbs exist and permissions are ok. 2/15 17:33:58 JMI: completed script validation: job manager type is pbs. From benc at hawaga.org.uk Sun Feb 15 18:43:54 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 16 Feb 2009 00:43:54 +0000 (GMT) Subject: [Swift-devel] wonky-runawayjob missing its sites file Message-ID: r2513 introduces a test for runaway jobs, but doesn't include the site file. In r2550 I add a script to run all the wonky tests (I thought I had that in already, but apparently not - seeming it was uncommitted in my local working directory). If you fix the runaway test, add it to that script - tests/misc/run-wonky -- From hategan at mcs.anl.gov Sun Feb 15 23:14:57 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sun, 15 Feb 2009 23:14:57 -0600 Subject: [Swift-devel] Re: wonky-runawayjob missing its sites file In-Reply-To: References: Message-ID: <1234761297.12882.0.camel@localhost> On Mon, 2009-02-16 at 00:43 +0000, Ben Clifford wrote: > r2513 introduces a test for runaway jobs, but doesn't include the site > file. r2551 fixes that. > > In r2550 I add a script to run all the wonky tests (I thought I had that > in already, but apparently not - seeming it was uncommitted in my local > working directory). If you fix the runaway test, add it to that script - > tests/misc/run-wonky > From hategan at mcs.anl.gov Sun Feb 15 23:21:58 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sun, 15 Feb 2009 23:21:58 -0600 Subject: [Swift-devel] Returning GRAM errors to swift user In-Reply-To: <4998AA3A.4050307@mcs.anl.gov> References: <4998AA3A.4050307@mcs.anl.gov> Message-ID: <1234761718.12882.7.camel@localhost> On Sun, 2009-02-15 at 17:50 -0600, Michael Wilde wrote: > Im assuming that Swift and the CoG provider return as much about GRAM > errors back to the user as they know. But, for jobs that fail to start, > e.g., due to an invalid project code, that error never makes it back to > the user (but *is* present in the gram log). > > In this case, can the message below, from the GRAM log, > "GRAM_SCRIPT_GT3_FAILURE_MESSAGE:qsub: Invalid Account MSG=invalid > account\n" available in the GRAM API so it can be sent to the user? There are two possibilities: 1. The message does not make it to the ws-gram client. This needs to be fixed in ws-gram. 2. (1) is false, and the ws-gram cog provider does not propagate that message in the failure event. This I should fix. There's a third, but unlikely, that the karajan or swift portion is broken. > > I'm assuming this particular issue is well known to users experienced > with TeraGrid sites, like Sarah, but is perhaps worth pointing out in a > troubleshooting section. If there's a chance that some of this GRAM > error info can be returned but is not currently, I can file this in > bugzilla. > > It seems like a few errors, such as account/project errors, or other > invalid job specs (like time/queue mismatches?) are similarly not passed > back. Is that the case? In my experience, yes. > > Relevant snips from the logs are below. > > Also interesting to note: On the UC teragrid site, a project specified > in sites.xml via the globus profile does *not* override a default > project set by the tgprojects command. If this is correct (i.e. we're not talking about some obscure issue where having a bogus default project causes all your jobs to fail), I would think of it as a bug that should be submitted to teragrid. From rynge at renci.org Mon Feb 16 17:01:50 2009 From: rynge at renci.org (Mats Rynge) Date: Mon, 16 Feb 2009 18:01:50 -0500 Subject: [Swift-devel] Contribution: swift-osg-ress-site-catalog Message-ID: <20090216230150.GA9956@rynge.europa.renci.org> Attached is a contribution for Swift. It is a script which queries the OSG Resource Selection System (ReSS) for site information and builds a Swift site catalog. I think the script should be located in bin/, but feel free to include it anywhere in the distribution. I have already submitted a signed contribution license agreement, and this script is contributed under that agreement. -- Mats Rynge Renaissance Computing Institute -------------- next part -------------- #!/usr/bin/perl use strict; use Pod::Usage; use Getopt::Long; use File::Temp qw/ tempfile tempdir mktemp /; my $opt_help = 0; my $opt_vo = 'engage'; my $opt_engage_verified = 0; my $opt_gt4 = 0; my $opt_out = '&STDOUT'; Getopt::Long::Configure('bundling'); GetOptions( "help" => \$opt_help, "vo=s" => \$opt_vo, "engage-verified" => \$opt_engage_verified, "gt4" => \$opt_gt4, "out=s" => \$opt_out, ) or pod2usage(1); if ($opt_help) { pod2usage(1); } if ($opt_engage_verified && $opt_vo ne "engage") { die("You can not specify a vo when using --engage-verified\n"); } # make sure condor_status is in the path my $out = `which condor_status 2>/dev/null`; if ($out eq "") { die("This tool depends on condor_status.\n" . "Please make sure condor_status in your path.\n"); } my %ads; my %tmp; my $cmd = "condor_status -any -long -constraint" . " 'StringlistIMember(\"VO:$opt_vo\";GlueCEAccessControlBaseRule)'" . " -pool osg-ress-1.fnal.gov"; # if we want the engage verified sites, ignore opt_vo and query against # engage central collector if ($opt_engage_verified) { $cmd = "condor_status -any -long -constraint" . " 'SiteVerified==TRUE'" . " -pool engage-central.renci.org" } open(STATUS, "$cmd|"); while() { chomp; if ($_ eq "") { if ($tmp{'GlueSiteName'} ne "") { my %copy = %tmp; $ads{$tmp{'GlueSiteName'}} = \%copy; undef %tmp; } } else { my ($key, $value) = split(/ = /, $_, 2); $value =~ s/^"|"$//g; # remove quotes from Condor strings $tmp{$key} = $value; } } close(STATUS); # lowercase vo my $lc_vo = lc($opt_vo); open(FH, ">$opt_out") or die("Unable to open $opt_out"); print FH "\n"; foreach my $sitename (keys %ads) { my $contact = $ads{$sitename}->{'GlueCEInfoContactString'}; my $host = $contact; $host =~ s/:.*//; my $jm = $contact; $jm =~ s/.*jobmanager-//; if ($jm eq "pbs") { $jm = "PBS"; } elsif ($jm eq "lsf") { $jm = "LSF"; } elsif ($jm eq "sge") { $jm = "SGE"; } elsif ($jm eq "condor") { $jm = "Condor"; } my $workdir = $ads{$sitename}->{'GlueCEInfoDataDir'}; print FH "\n"; print FH " \n"; print FH " \n"; print FH " \n"; if ($opt_gt4) { print FH " \n"; } else { print FH " \n"; } print FH " $workdir/$lc_vo/tmp\n"; print FH " \n"; } print FH "\n\n"; close(FH); exit(0); __END__ =head1 NAME swift-osg-ress-site-catalog - converts ReSS data to Swift site catalog =head1 SYNOPSIS swift-osg-ress-site-catalog [options] =head1 OPTIONS =over 8 =item B<--help> Show this help message =item B<--vo=[name]> Set what VO to query ReSS for =item B<--engage-verified> Only retrieve sites verified by the Engagement VO site verification tests This can not be used together with --vo, as the query will only work for sites advertising support for the Engagement VO. This option means information will be retrieved from the Engagement collector instead of the top-level ReSS collector. =item B<--out=[filename]> Write to [filename] instead of stdout =back =head1 DESCRIPTION B converts ReSS data to Swift site catalog =cut From benc at hawaga.org.uk Tue Feb 17 02:26:49 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 17 Feb 2009 08:26:49 +0000 (GMT) Subject: [Swift-devel] Contribution: swift-osg-ress-site-catalog In-Reply-To: <20090216230150.GA9956@rynge.europa.renci.org> References: <20090216230150.GA9956@rynge.europa.renci.org> Message-ID: cool. I'll put this in for 0.9. -- From iraicu at cs.uchicago.edu Wed Feb 18 11:43:04 2009 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Wed, 18 Feb 2009 11:43:04 -0600 Subject: [Swift-devel] [Fwd: [Dadc09] Deadlines for DADC'09 extended] Message-ID: <499C48A8.5080808@cs.uchicago.edu> Hi all, The DADC workshop extended their deadline. I attended last year, and it was a great venue focusing on data resource management in distributed systems. If you have any work that is close to being ready to publish and it is relevant to the workshop, its a good venue to try! Cheers, Ioan -------- Original Message -------- Subject: [Dadc09] Deadlines for DADC'09 extended Date: Wed, 18 Feb 2009 11:34:16 -0600 From: Tevfik Kosar To: dadc09 at cct.lsu.edu The abstract and paper submission deadlines for DADC'09 have been extended. The new deadlines are: Abstract submission: February 25, 2009 (extended) Paper submission: March 1, 2009 (extended) Thank you. Tevfik ----------------------------------------------------------------------------------- *** Call for Papers *** WORKSHOP ON DATA-AWARE DISTRIBUTED COMPUTING (DADC'09) In conjunction with HPDC 2009, June 9-13, Munich, Germany http://www.cct.lsu.edu/~kosar/dadc09 ------------------------------------------------------------------------------------ The Second International Workshop on Data-Aware Distributed Computing (DADC'09) will be held in conjunction with the 18th International Symposium on High Performance Distributed Computing (HPDC-18), in Munich, Germany. The data requirements of scientific as well as commercial applications from a diverse range of fields have been increasing exponentially over the recent years. This increase in the demand for large-scale data processing has necessitated collaboration and sharing among the world's leading education, research, and industrial institutions and use of distributed resources owned by collaborating parties. In a widely distributed environment, data is no more locally accessible and has thus to be remotely retrieved and stored. While traditional distributed systems work well for computation that requires limited data handling, they fail in unexpected ways when the computation accesses, creates, and moves large amounts of data especially over wide-area networks. Scientists, researchers, and application developers are often forced to spend a great deal of time and energy on solving basic data-handling issues, such as the physical location of data, how to access it, and/or how to move it to visualization and/or compute resources for further analysis. This workshop will focus on the challenges of distributed systems imposed by the data intensive applications, and on the different state-of-the-art solutions proposed to overcome these challenges. A new paradigm called "data-aware distributed computing" and its application to different research realms such as scheduling, resource allocation, workflow management, and visualization will be discussed. With the knowledge of the correct data management techniques, the domain scientists will be able to focus on their primary goal, assured that their data management needs are handled reliably and efficiently. We believe this workshop will make a unique contribution to collaborative and distributed computing community by focusing on the planning, management, and scheduling of data handling tasks and data storage resources. Topics of interest include, but are not limited to: - Data-intensive applications and their challenges - Data-aware toolkits, middleware, storage and file systems - Data-oriented batch schedulers and workflow managers - Data staging, replication, and remote access to data - Data placement, management, and scheduling techniques - Co-scheduling of computation, data storage, and network resources - Network support for data-intensive computing - High speed wide area data transfers and bulk data movement - Remote and distributed visualization of large scale data - Data-aware workflow and data-flow management - Cross-domain metadata and ontologies - Distributed and hierarchical storage management - Storage resource managers and brokers - Data archives, digital libraries, and preservations - Service oriented architectures for data-intensive computing - Protection of sensitive data in a collaborative environment - Peer-to-peer data movement and data streaming - Future research challenges in data-intensive computing IMPORTANT DATES: Abstract submission: February 25, 2009 (extended) Paper submission: March 1, 2009 (extended) Acceptance notification: March 15, 2009 Final papers due: April 1, 2009 PAPER SUBMISSIONS: DADC'09 invites authors to submit original and unpublished technical papers of at most 10 pages. All submissions will be peer-reviewed and judged on correctness, originality, technical strength, significance, quality of presentation, and relevance to the workshop topics of interest. Submitted papers may not have appeared in or be under consideration for another workshop, conference or a journal, nor may they be under review or submitted to another forum during the DADC'09 review process. Papers should be prepared in ACM SIG Proceedings format and submitted electronically (in PDF format) via the HPDC 2009 conference web site. WORKSHOP and PROGRAM CHAIR: Tevfik Kosar, Louisiana State University PROGRAM COMMITTEE: Micah Beck, University of Tennessee John Bent, Los Alamos National Laboratory Ann Chervenak, USC Information Sciences Institute Alok Choudhary, Northwestern University Ewa Deelman, USC Information Sciences Institute Renato Figueiredo, University of Florida Geoffrey Fox, Indiana University Peter Kacsuk, Hungarian Academy of Sciences Dan Katz, Louisiana State University Peter Kunszt, Swiss National Computing Center Erwin Laure, CERN Reagan Moore, San Diego Supercomputing Center Don Petravick, Fermi National Accelarator Laboratory Ioan Raicu, University of Chicago Sanjay Ranka, University of Florida Doron Rotem, Lawrence Berkeley National Laboratory Jennifer Schopf, National Science Foundation Florian Schintke, Zuse Institute Berlin Alex Sim, Lawrence Berkeley National Laboratory Ian Taylor, Cardiff University Douglas Thain, University of Notre Dame Brian Tierney, Lawrence Berkeley National Laboratory Bernard Traversat, Sun Microsystems Sudharshan Vazhkudai, Oak Ridge National Laboratory Andrew Wendelborn, University of Adelaide Mike Wilde, Argonne National Laboratory _______________________________________________ CS mailing list CS at cct.lsu.edu https://mail.cct.lsu.edu/mailman/listinfo/cs -- =================================================== Ioan Raicu, Ph.D. =================================================== Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 =================================================== Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dev.globus.org/wiki/Incubator/Falkon http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page =================================================== =================================================== -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: Attached Message Part URL: From zhaozhang at uchicago.edu Thu Feb 19 11:51:59 2009 From: zhaozhang at uchicago.edu (Zhao Zhang) Date: Thu, 19 Feb 2009 11:51:59 -0600 Subject: [Swift-devel] new swift Message-ID: <499D9C3F.9060904@uchicago.edu> Hi, I tried to update my swift on bgp up to date, but I found that the tree structure is changed, the provider-deef is no longer in the "module" directory. Shall I copy the old provider-deef to the new directory? or any suggestions to install provider-deef? thanks. best wishes zhangzhao From benc at hawaga.org.uk Thu Feb 19 15:40:22 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Thu, 19 Feb 2009 21:40:22 +0000 (GMT) Subject: [Swift-devel] new swift In-Reply-To: <499D9C3F.9060904@uchicago.edu> References: <499D9C3F.9060904@uchicago.edu> Message-ID: On Thu, 19 Feb 2009, Zhao Zhang wrote: > I tried to update my swift on bgp up to date, but I found that the tree > structure is changed, the provider-deef is no longer in the "module" > directory. > Shall I copy the old provider-deef to the new directory? or any suggestions > to install provider-deef? thanks. What do you mean by that? What did you do to update? Go into an existing checkout or start from fresh? If you start from fresh, then you need to separately make three checkouts: cog/ cog/modules/swift cog/modules/provider-deef from their various different locations. -- From zhaozhang at uchicago.edu Fri Feb 20 14:37:32 2009 From: zhaozhang at uchicago.edu (Zhao Zhang) Date: Fri, 20 Feb 2009 14:37:32 -0600 Subject: [Swift-devel] filesystem mapper Message-ID: <499F148C.7050906@uchicago.edu> Hi, I have a problem with file system mapper in the latest version of swift. The code looks like: Mol texts[] ; It is trying to map all .mol2 files in the input directory, and it works fine with an older version of swift which is Swift svn swift-r2334 (Swift modified locally) cog-r2216 But failed with the following information zzhang at login6.surveyor:~/new_dock6> ./run_swift_ssh.sh 1010 64 test.swift waiting for at least 64 nodes to register before submitting workload... waiting to find at least 1 services in file /home/falkon/users/zzhang/1010/config/Client-service-URIs.config... all done, file has found at least 1 services found at least 64 registered, submitting workload... Swift svn swift-r2578 cog-r2305 RunID: 20090220-1428-ugfvnoya Progress: Execution failed: Getting value for array texts.$[]/1 which is not permitted. The log file is at http://www.ci.uchicago.edu/~zzhang/test-20090220-1428-ugfvnoya.log zhao From zhaozhang at uchicago.edu Fri Feb 20 16:48:49 2009 From: zhaozhang at uchicago.edu (Zhao Zhang) Date: Fri, 20 Feb 2009 16:48:49 -0600 Subject: [Swift-devel] absolute path Message-ID: <499F3351.2060000@uchicago.edu> Hi, Guys I found that in the latest swift code, the task description is using absolute path like this: /bin/bash /tmp/sleep-20090220-1646-7vdlcg0a/shared/wrapper.sh sleep-hoyn8x6j -jobdir h/o -e /bin/sleep -out stdout.txt -err stderr.txt -i -d -if -of -k -a 30 I mean the "/tmp/sleep-20090220-1646-7vdlcg0a/shared/wrapper.sh" part. Is there an option that we use relative path? thanks best wishes zhangzhao From hategan at mcs.anl.gov Fri Feb 20 16:57:12 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 20 Feb 2009 16:57:12 -0600 (CST) Subject: [Swift-devel] absolute path In-Reply-To: <499F3351.2060000@uchicago.edu> Message-ID: <9459070.396241235170632289.JavaMail.root@zimbra> ----- Zhao Zhang wrote: > Hi, Guys > > I found that in the latest swift code, the task description is using > absolute path like this: > /bin/bash /tmp/sleep-20090220-1646-7vdlcg0a/shared/wrapper.sh > sleep-hoyn8x6j -jobdir h/o -e /bin/sleep -out stdout.txt -err stderr.txt > -i -d -if -of -k -a 30 > > I mean the "/tmp/sleep-20090220-1646-7vdlcg0a/shared/wrapper.sh" part. > Is there an option that we use relative path? thanks Relative to what? You could try changing the in sites.xml. From zhaozhang at uchicago.edu Fri Feb 20 17:01:13 2009 From: zhaozhang at uchicago.edu (Zhao Zhang) Date: Fri, 20 Feb 2009 17:01:13 -0600 Subject: [Swift-devel] absolute path In-Reply-To: <9459070.396241235170632289.JavaMail.root@zimbra> References: <9459070.396241235170632289.JavaMail.root@zimbra> Message-ID: <499F3639.8010709@uchicago.edu> so, in the old version instead of "/tmp/sleep-20090220-1646-7vdlcg0a/shared/wrapper.sh" it is like this: "shared/wrapper.sh" by this I mean it is a relative path. I got another question about using signal notification instead of status files, as I remembered, there is an option in one property file for that, but I couldn't find it. zhao Mihael Hategan wrote: > ----- Zhao Zhang wrote: > >> Hi, Guys >> >> I found that in the latest swift code, the task description is using >> absolute path like this: >> /bin/bash /tmp/sleep-20090220-1646-7vdlcg0a/shared/wrapper.sh >> sleep-hoyn8x6j -jobdir h/o -e /bin/sleep -out stdout.txt -err stderr.txt >> -i -d -if -of -k -a 30 >> >> I mean the "/tmp/sleep-20090220-1646-7vdlcg0a/shared/wrapper.sh" part. >> Is there an option that we use relative path? thanks >> > > Relative to what? > > You could try changing the in sites.xml. > > From hategan at mcs.anl.gov Fri Feb 20 17:08:11 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 20 Feb 2009 17:08:11 -0600 (CST) Subject: [Swift-devel] absolute path In-Reply-To: <499F3639.8010709@uchicago.edu> Message-ID: <7469230.397401235171291666.JavaMail.root@zimbra> ----- Zhao Zhang wrote: > so, in the old version instead of > "/tmp/sleep-20090220-1646-7vdlcg0a/shared/wrapper.sh" > it is like this: > "shared/wrapper.sh" > > by this I mean it is a relative path. Yes, of course, but relative paths are in respect to something. In other words, where do you want it to end on the remote site? > > I got another question about using signal notification instead of status > files, > as I remembered, there is an option in one property file for that, Have you tried swift.properties? > but I > couldn't > find it. > From zhaozhang at uchicago.edu Fri Feb 20 17:11:50 2009 From: zhaozhang at uchicago.edu (Zhao Zhang) Date: Fri, 20 Feb 2009 17:11:50 -0600 Subject: [Swift-devel] absolute path In-Reply-To: <7469230.397401235171291666.JavaMail.root@zimbra> References: <7469230.397401235171291666.JavaMail.root@zimbra> Message-ID: <499F38B6.5070403@uchicago.edu> Mihael Hategan wrote: > ----- Zhao Zhang wrote: > >> so, in the old version instead of >> "/tmp/sleep-20090220-1646-7vdlcg0a/shared/wrapper.sh" >> it is like this: >> "shared/wrapper.sh" >> >> by this I mean it is a relative path. >> > > Yes, of course, but relative paths are in respect to something. In > other words, where do you want it to end on the remote site? > This is fine. I just made a change in the worker code, and it could dynamically work with both cases. > >> I got another question about using signal notification instead of status >> files, >> as I remembered, there is an option in one property file for that, >> > > Have you tried swift.properties? > yes I tried, but I didn't find it. Ben knows it, but probably, he is sleeping right now. zhao > >> but I >> couldn't >> find it. >> >> > > > From hategan at mcs.anl.gov Fri Feb 20 17:15:40 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 20 Feb 2009 17:15:40 -0600 (CST) Subject: [Swift-devel] absolute path In-Reply-To: <499F38B6.5070403@uchicago.edu> Message-ID: <416126.398011235171740732.JavaMail.root@zimbra> ----- Zhao Zhang wrote: > > > Mihael Hategan wrote: > > ----- Zhao Zhang wrote: > > > >> so, in the old version instead of > >> "/tmp/sleep-20090220-1646-7vdlcg0a/shared/wrapper.sh" > >> it is like this: > >> "shared/wrapper.sh" > >> > >> by this I mean it is a relative path. > >> > > > > Yes, of course, but relative paths are in respect to something. In > > other words, where do you want it to end on the remote site? > > > This is fine. I just made a change in the worker code, and it could > dynamically work with both cases. > > > >> I got another question about using signal notification instead of status > >> files, > >> as I remembered, there is an option in one property file for that, > >> > > > > Have you tried swift.properties? > > > yes I tried, but I didn't find it. You should probably do an SVN update and look at the end of that file: # Controls how Swift will communicate the result code of running user programs # from workers to the submit side. In files mode, a file # indicating success or failure will be created on the site shared filesystem. # In provider mode, the execution provider job status will # be used. Notably, GRAM2 does not return job statuses correctly, and so # provider mode will not work with GRAM2. With other # providers, it can be used to reduce the amount of filesystem access compared # to files mode. # # status.mode=files From zhaozhang at uchicago.edu Fri Feb 20 17:20:03 2009 From: zhaozhang at uchicago.edu (Zhao Zhang) Date: Fri, 20 Feb 2009 17:20:03 -0600 Subject: [Swift-devel] absolute path In-Reply-To: <416126.398011235171740732.JavaMail.root@zimbra> References: <416126.398011235171740732.JavaMail.root@zimbra> Message-ID: <499F3AA3.1050204@uchicago.edu> I tried this zzhang at login6.surveyor:/home/falkon/swift_scratch/cog/modules/swift> svn update U src/org/griphyn/vdl/mapping/RootDataNode.java Updated to revision 2579. but there is no such texts in the swift.properties. zhao Mihael Hategan wrote: > ----- Zhao Zhang wrote: > >> Mihael Hategan wrote: >> >>> ----- Zhao Zhang wrote: >>> >>> >>>> so, in the old version instead of >>>> "/tmp/sleep-20090220-1646-7vdlcg0a/shared/wrapper.sh" >>>> it is like this: >>>> "shared/wrapper.sh" >>>> >>>> by this I mean it is a relative path. >>>> >>>> >>> Yes, of course, but relative paths are in respect to something. In >>> other words, where do you want it to end on the remote site? >>> >>> >> This is fine. I just made a change in the worker code, and it could >> dynamically work with both cases. >> >>> >>> >>>> I got another question about using signal notification instead of status >>>> files, >>>> as I remembered, there is an option in one property file for that, >>>> >>>> >>> Have you tried swift.properties? >>> >>> >> yes I tried, but I didn't find it. >> > > You should probably do an SVN update and look at the end of that file: > > # Controls how Swift will communicate the result code of running user programs > # from workers to the submit side. In files mode, a file > # indicating success or failure will be created on the site shared filesystem. > # In provider mode, the execution provider job status will > # be used. Notably, GRAM2 does not return job statuses correctly, and so > # provider mode will not work with GRAM2. With other > # providers, it can be used to reduce the amount of filesystem access compared > # to files mode. > # > # status.mode=files > > From hategan at mcs.anl.gov Fri Feb 20 17:29:35 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 20 Feb 2009 17:29:35 -0600 (CST) Subject: [Swift-devel] absolute path In-Reply-To: <499F3AA3.1050204@uchicago.edu> Message-ID: <2461287.398561235172575261.JavaMail.root@zimbra> ----- Zhao Zhang wrote: > I tried this > zzhang at login6.surveyor:/home/falkon/swift_scratch/cog/modules/swift> svn > update > U src/org/griphyn/vdl/mapping/RootDataNode.java > Updated to revision 2579. > > but there is no such texts in the swift.properties. That's probably because you have a locally modified swift.properties that got messed up. Try this: swift> cd etc mv swift.properties swift.properties.mine svn up tail -n 16 swift.properties Then manually merge your customization into swift.properties and re-compile. > > zhao > > Mihael Hategan wrote: > > ----- Zhao Zhang wrote: > > > >> Mihael Hategan wrote: > >> > >>> ----- Zhao Zhang wrote: > >>> > >>> > >>>> so, in the old version instead of > >>>> "/tmp/sleep-20090220-1646-7vdlcg0a/shared/wrapper.sh" > >>>> it is like this: > >>>> "shared/wrapper.sh" > >>>> > >>>> by this I mean it is a relative path. > >>>> > >>>> > >>> Yes, of course, but relative paths are in respect to something. In > >>> other words, where do you want it to end on the remote site? > >>> > >>> > >> This is fine. I just made a change in the worker code, and it could > >> dynamically work with both cases. > >> > >>> > >>> > >>>> I got another question about using signal notification instead of status > >>>> files, > >>>> as I remembered, there is an option in one property file for that, > >>>> > >>>> > >>> Have you tried swift.properties? > >>> > >>> > >> yes I tried, but I didn't find it. > >> > > > > You should probably do an SVN update and look at the end of that file: > > > > # Controls how Swift will communicate the result code of running user programs > > # from workers to the submit side. In files mode, a file > > # indicating success or failure will be created on the site shared filesystem. > > # In provider mode, the execution provider job status will > > # be used. Notably, GRAM2 does not return job statuses correctly, and so > > # provider mode will not work with GRAM2. With other > > # providers, it can be used to reduce the amount of filesystem access compared > > # to files mode. > > # > > # status.mode=files > > > > From benc at hawaga.org.uk Fri Feb 20 17:30:25 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Fri, 20 Feb 2009 23:30:25 +0000 (GMT) Subject: [Swift-devel] absolute path In-Reply-To: <499F3351.2060000@uchicago.edu> References: <499F3351.2060000@uchicago.edu> Message-ID: On Fri, 20 Feb 2009, Zhao Zhang wrote: > I found that in the latest swift code, the task description is using absolute > path like this: > /bin/bash /tmp/sleep-20090220-1646-7vdlcg0a/shared/wrapper.sh sleep-hoyn8x6j > -jobdir h/o -e /bin/sleep -out stdout.txt -err stderr.txt -i -d -if -of -k > -a 30 > > I mean the "/tmp/sleep-20090220-1646-7vdlcg0a/shared/wrapper.sh" part. > Is there an option that we use relative path? thanks yes, that's a change made recently because I discovered that some sites do not respect the initial working directory specified in job submissions. In the rest of this thread, you don't clearly describe *why* you want a relative path - you're clearly trying to achieve some higher goal but it is not clear what. -- From zhaozhang at uchicago.edu Fri Feb 20 17:32:15 2009 From: zhaozhang at uchicago.edu (Zhao Zhang) Date: Fri, 20 Feb 2009 17:32:15 -0600 Subject: [Swift-devel] absolute path In-Reply-To: References: <499F3351.2060000@uchicago.edu> Message-ID: <499F3D7F.5010708@uchicago.edu> it doesn't matter right now, I was just trying out the latest version of swift and stuck, then I solved the problem. zhao Ben Clifford wrote: > On Fri, 20 Feb 2009, Zhao Zhang wrote: > > >> I found that in the latest swift code, the task description is using absolute >> path like this: >> /bin/bash /tmp/sleep-20090220-1646-7vdlcg0a/shared/wrapper.sh sleep-hoyn8x6j >> -jobdir h/o -e /bin/sleep -out stdout.txt -err stderr.txt -i -d -if -of -k >> -a 30 >> >> I mean the "/tmp/sleep-20090220-1646-7vdlcg0a/shared/wrapper.sh" part. >> Is there an option that we use relative path? thanks >> > > yes, that's a change made recently because I discovered that some sites do > not respect the initial working directory specified in job submissions. > > In the rest of this thread, you don't clearly describe *why* you want a > relative path - you're clearly trying to achieve some higher goal but it > is not clear what. > > From benc at hawaga.org.uk Fri Feb 20 17:32:34 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Fri, 20 Feb 2009 23:32:34 +0000 (GMT) Subject: [Swift-devel] absolute path In-Reply-To: <2461287.398561235172575261.JavaMail.root@zimbra> References: <2461287.398561235172575261.JavaMail.root@zimbra> Message-ID: On Fri, 20 Feb 2009, Mihael Hategan wrote: > Try this: > swift> cd etc > mv swift.properties swift.properties.mine > svn up > tail -n 16 swift.properties also paste the output of: svn info swift.properties -- From zhaozhang at uchicago.edu Fri Feb 20 17:35:09 2009 From: zhaozhang at uchicago.edu (Zhao Zhang) Date: Fri, 20 Feb 2009 17:35:09 -0600 Subject: [Swift-devel] absolute path In-Reply-To: References: <2461287.398561235172575261.JavaMail.root@zimbra> Message-ID: <499F3E2D.5030706@uchicago.edu> yep, I got it. will try this out soon. zzhang at login6.surveyor:/home/falkon/swift_scratch/cog/modules/swift/etc> svn info swift.properties Path: swift.properties Name: swift.properties URL: https://svn.ci.uchicago.edu/svn/vdl2/trunk/etc/swift.properties Repository Root: https://svn.ci.uchicago.edu/svn/vdl2 Repository UUID: e2bb083e-7f23-0410-b3a8-8253ac9ef6d8 Revision: 2579 Node Kind: file Schedule: normal Last Changed Author: benc Last Changed Rev: 2533 Last Changed Date: 2009-02-13 07:54:41 -0600 (Fri, 13 Feb 2009) Text Last Updated: 2009-02-20 17:32:31 -0600 (Fri, 20 Feb 2009) Properties Last Updated: 2009-02-20 11:03:38 -0600 (Fri, 20 Feb 2009) Checksum: c7124b6c27e8bc56b68f2d197d31c96d zhao Ben Clifford wrote: > On Fri, 20 Feb 2009, Mihael Hategan wrote: > > >> Try this: >> swift> cd etc >> mv swift.properties swift.properties.mine >> svn up >> tail -n 16 swift.properties >> > > also paste the output of: > > svn info swift.properties > > From benc at hawaga.org.uk Fri Feb 20 17:40:08 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Fri, 20 Feb 2009 23:40:08 +0000 (GMT) Subject: [Swift-devel] absolute path In-Reply-To: <499F3D7F.5010708@uchicago.edu> References: <499F3351.2060000@uchicago.edu> <499F3D7F.5010708@uchicago.edu> Message-ID: On Fri, 20 Feb 2009, Zhao Zhang wrote: > it doesn't matter right now, I was just trying out the latest version of swift > and stuck, then I solved the problem. It does matter, in that I made a change that was, as far as I could see, either identical or beneficial in all circumstances. But apparently this change caused trouble for you. If you write about what you were trying to do and how you solved your problem, you will help the Swift developers understand what you are doing and you are likely to encounter fewer problems in the future. If you keep secrets, then we cannot help you. -- From zhaozhang at uchicago.edu Fri Feb 20 18:01:33 2009 From: zhaozhang at uchicago.edu (Zhao Zhang) Date: Fri, 20 Feb 2009 18:01:33 -0600 Subject: [Swift-devel] absolute path In-Reply-To: References: <499F3351.2060000@uchicago.edu> <499F3D7F.5010708@uchicago.edu> Message-ID: <499F445D.7070205@uchicago.edu> sure i am not keeping my secrets, : ) The context on BGP is, the working "sleep-20090220-1646-7vdlcg0a" directory is created on IO nodes at "/tmp/sleep-20090220-1646-7vdlcg0a/", and mounted on Compute Node through fuse. which means Compute Nodes need to enter this directory through /fuse/tmp/sleep-20090220-1646-7vdlcg0a, then with the old style of relative path "shared/wrapper.sh", everything is fine. And the wrapper.sh knows where the job is started, so all output data will be copied to the job dir on IO nodes. So in the new case, we are using the absolute path for wrapper.sh, after worker enters the working directory on IO nodes, it tried to invoke "/tmp/sleep-20090220-1646-7vdlcg0a/shared/wrapper.sh", which it has no idea where it is. So what I did is that if the path of wrapper.sh starts with a absolute path, then put a "/fuse" in front of "/tmp/sleep-20090220-1646-7vdlcg0a/shared/wrapper.sh", or else just work as it was. Any point if it is not clear, let me know asap. best wishes zhangzhao Ben Clifford wrote: > On Fri, 20 Feb 2009, Zhao Zhang wrote: > > >> it doesn't matter right now, I was just trying out the latest version of swift >> and stuck, then I solved the problem. >> > > It does matter, in that I made a change that was, as far as I could see, > either identical or beneficial in all circumstances. > > But apparently this change caused trouble for you. > > If you write about what you were trying to do and how you solved your > problem, you will help the Swift developers understand what you are doing > and you are likely to encounter fewer problems in the future. > > If you keep secrets, then we cannot help you. > > From benc at hawaga.org.uk Sat Feb 21 03:10:14 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Sat, 21 Feb 2009 09:10:14 +0000 (GMT) Subject: [Swift-devel] filesystem mapper In-Reply-To: <499F148C.7050906@uchicago.edu> References: <499F148C.7050906@uchicago.edu> Message-ID: can you send me the .swift and .kml files for this? Did you recompile your kml file after upgrading? If not, you may find a kml file from an older Swift does not work with the present version. On Fri, 20 Feb 2009, Zhao Zhang wrote: > Hi, > > I have a problem with file system mapper in the latest version of swift. > The code looks like: > Mol texts[] ; > > It is trying to map all .mol2 files in the input directory, and it works fine > with an older version of swift which is > Swift svn swift-r2334 (Swift modified locally) cog-r2216 > > But failed with the following information > zzhang at login6.surveyor:~/new_dock6> ./run_swift_ssh.sh 1010 64 test.swift > waiting for at least 64 nodes to register before submitting workload... > waiting to find at least 1 services in file > /home/falkon/users/zzhang/1010/config/Client-service-URIs.config... > all done, file has found at least 1 services > found at least 64 registered, submitting workload... > Swift svn swift-r2578 cog-r2305 > > RunID: 20090220-1428-ugfvnoya > Progress: > Execution failed: > Getting value for array texts.$[]/1 which is not permitted. > > The log file is at > http://www.ci.uchicago.edu/~zzhang/test-20090220-1428-ugfvnoya.log > > zhao > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > From iraicu at cs.uchicago.edu Sat Feb 21 06:37:59 2009 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Sat, 21 Feb 2009 06:37:59 -0600 Subject: [Swift-devel] [Fwd: [Dbworld] Extended Deadline:CFP: IEEE International Workshop on Scientific Workflows (SWF 2009)] Message-ID: <499FF5A7.5060308@cs.uchicago.edu> It seems that the SWF09 deadline has been extended to March 16th. Cheers, Ioan -------- Original Message -------- Subject: [Dbworld] Extended Deadline:CFP: IEEE International Workshop on Scientific Workflows (SWF 2009) Date: Fri, 20 Feb 2009 16:35:21 -0600 From: Shiyong Lu Reply-To: dbworld_owner at yahoo.com To: dbworld at cs.wisc.edu EXTENDED DEADLINE: Due to numerous requests and a discussion with the ICWS organizing committee, the SWF submission deadline is extended to March 16, 2009, existing submissions can be updated before the deadline. Call for Papers IEEE 2009 Third International Workshop on Scientific Workflows (SWF 2009) http://www.servicescongress.org/2009/1/swf-2009.html Los Angeles, USA, July 10, 2009 In conjunction with IEEE International Conference on Web Services (ICWS 2009) Description Today, many scientific discoveries are achieved through complex and distributed scientific computations that are represented and structured as scientific workflows. User friendly scientific workflow systems are increasingly being developed to enable e-scientists to integrate, structure, and orchestrate various local or remote data and service resources to perform various in silico experiments to produce interesting scientific discovery. The critical role of scientific workflows in cyberinfrastructure has been recognized by a recent NSF workshop on the challenges of scientific workflows in May 2006, which concluded that ??workflows should become first-class entities in cyberinfrastructure architecture. For domain scientists, they are important because workflows document and manage the increasingly complex processes involved in exploration and discovery through computations. For computer scientists, workflows provide a formal and declarative representation of complex distributed computations that must be managed efficiently through their lifecycle from assembly, to execution, to sharing.?? Authors are invited to submit regular papers (8 pages), short papers (4 pages), and demo papers (2 pages) that show original unpublished research results in all areas of scientific workflows. Topics of interest are listed below; however, submissions on all aspects of scientific workflows are welcome. For demo papers, at least one author is expected to present a demo in the workshop during the demo session, special arrangement will be made to meet the need of the authors. Accepted SWF 2009 papers will be included in the proceedings of SERVICES 2009 (Part I), which will be published by IEEE Computer Society Press. Topics o Architecture, model, and language o Provenance management o Task management o Workflow scheduling o Data product management o Monitoring and failure handling o Service, Grid, and Cloud workflows o Scientific workflow composition o Scientific workflow security o Modeling, simulation, analysis o Scalability, reliability, extensibility o Scientific workflow applications o Service-oriented scientific workflows and workflow-based Web services o Security of Web services and scientific workflows o Data integration and service integration in scientific workflows o Application service management in scientific workflows o Data service management in scientific workflows o Scientific workflow architectures, models, and languages o Grid workflow management o Scientific workflow mapping, optimization, and scheduling o Scientific workflow modeling, verification, and validation o Scientific workflow provenance management o Workflow and provenance mining and analysis o Scalability, reliability, extensibility, agility, and interoperability o Scientific workflow real-life applications Important dates Paper Submission March 16, 2009 Decision Notification (Electronic) April 2, 2009 Camera-Ready Submission & Pre-registration April 17, 2009 Workshop organizers Workshop chairs: Shiyong Lu, Wayne State University, shiyong at wayne.edu; Calton Pu, Georgia Tech Publicity chairs: Yong Zhao, Microsoft Corporation; Ilkay Altintas, San Diego Supercomputer Center Publication chair: Cui Lin, Wayne State University For any questions, please send e-mails to Shiyong Lu at shiyong at wayne.edu. Previous SWF workshops http://www.cs.wayne.edu/~shiyong/swf IEEE 2009 Third International Workshop on Scientific Workflows _______________________________________________ Please do not post msgs that are not relevant to the database community at large. Go to www.cs.wisc.edu/dbworld for guidelines and posting forms. To unsubscribe, go to https://lists.cs.wisc.edu/mailman/listinfo/dbworld -- =================================================== Ioan Raicu, Ph.D. =================================================== Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 =================================================== Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dev.globus.org/wiki/Incubator/Falkon http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page =================================================== =================================================== -------------- next part -------------- An HTML attachment was scrubbed... URL: From iraicu at cs.uchicago.edu Sat Feb 21 06:39:30 2009 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Sat, 21 Feb 2009 06:39:30 -0600 Subject: [Swift-devel] [Fwd: [Dbworld] 2nd CFP: Special issue on Scientific Workflows in IJBPIM] Message-ID: <499FF602.3080102@cs.uchicago.edu> Here might be a good venue for a future paper on Swift/Falkon, with a May 1st deadline. Cheers, Ioan -------- Original Message -------- Subject: [Dbworld] 2nd CFP: Special issue on Scientific Workflows in IJBPIM Date: Fri, 20 Feb 2009 16:46:49 -0600 From: Shiyong Lu Reply-To: dbworld_owner at yahoo.com To: dbworld at cs.wisc.edu Call for Papers Special Issue on Scientific Workflows International Journal of Business Process Integration and Management (IJBPIM) http://www.cs.wayne.edu/~shiyong/swf/ijbpim09.html Description Scientific workflows have recently emerged as a new paradigm for scientists to formalize and structure complex scientific processes to enable and accelerate many significant scientific discoveries. A scientific workflow is a formal specification of a scientific process, which represents, streamlines, and automates the analytical and computational steps that a scientist needs to go through from dataset selection and integration, computation and analysis, to final data product presentation and visualization. A scientific workflow management system (SWFMS) is a system that supports the specification, modification, execution, failure recovery, and monitoring of a scientific workflow using the workflow logic to control the order of executing workflow tasks. The goal of this special issue is to present critical challenges, requirements, and issues related to scientific workflows. This collection of manuscripts will discuss key aspects in the development of a broad range of novel and innovative scientific workflow technologies. The emphasis of the special issue is on critical challenges in the development of various scientific workflows specifically as they relate to business workflow and service technologies. Particular emphasis will be placed on examples where innovative solutions to these challenges have resulted in scientific workflows which impact the scientific discovery process. Topics include but are not limited to: List of topics ?? Scientific workflow provenance management ?? Scientific workflow provenance analytics ?? Scientific workflow data, metadata, service, and task management ?? Scientific workflow architectures, models, and languages ?? Scientific workflow monitoring and failure handling ?? Streaming data processing in scientific workflows ?? Pipelined, data, workflow, and task parallelism in scientific workflows ?? Service, Grid, or Cloud-based scientific workflows ?? Data, metadata, compute, user-interaction, or visualization-intensive scientific workflows ?? Scientific workflow composition ?? Security issues in scientific workflows ?? Data integration and service integration in scientific workflows ?? Scientific workflow mapping, optimization, and scheduling ?? Scientific workflow modeling, verification, and validation ?? Scalability, reliability, extensibility, agility, and interoperability ?? Scientific workflow real-life applications Important dates ?? May 1, 2009, paper submission ?? August 1, 2009, notification ?? November 1, 2009, camera-ready version ?? Planned publication, end of 2009/early 2010 Guest editors ?? Shiyong Lu, Wayne State University, U.S.A., Email: shiyong at wayne.edu ?? Ewa Deelman, USC Information Sciences Institute, U.S.A., Email: deelman at isi.edu ?? Zhiming Zhao, University of Amsterdam, the Netherlands, Email: z.zhao at uva.nl Submission details Submitted papers should not have been previously published nor be currently under consideration for publication elsewhere. All papers are refereed through a peer review process. Please submit your paper at http://www.servicescomputing.org/ijbpim. Contact information All enquires about the special issue should be sent to Shiyong Lu at shiyong at wayne.edu. _______________________________________________ Please do not post msgs that are not relevant to the database community at large. Go to www.cs.wisc.edu/dbworld for guidelines and posting forms. To unsubscribe, go to https://lists.cs.wisc.edu/mailman/listinfo/dbworld -- =================================================== Ioan Raicu, Ph.D. =================================================== Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 =================================================== Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dev.globus.org/wiki/Incubator/Falkon http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page =================================================== =================================================== -------------- next part -------------- An HTML attachment was scrubbed... URL: From benc at hawaga.org.uk Sat Feb 21 08:11:03 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Sat, 21 Feb 2009 14:11:03 +0000 (GMT) Subject: [Swift-devel] absolute path In-Reply-To: <499F445D.7070205@uchicago.edu> References: <499F3351.2060000@uchicago.edu> <499F3D7F.5010708@uchicago.edu> <499F445D.7070205@uchicago.edu> Message-ID: On Fri, 20 Feb 2009, Zhao Zhang wrote: > The context on BGP is, the working "sleep-20090220-1646-7vdlcg0a" directory is > created on IO nodes at > "/tmp/sleep-20090220-1646-7vdlcg0a/", and mounted on Compute Node through > fuse. which means Compute Nodes > need to enter this directory through /fuse/tmp/sleep-20090220-1646-7vdlcg0a, > then with the old style of relative path > "shared/wrapper.sh", everything is fine. And the wrapper.sh knows where the > job is started, so all output data will be > copied to the job dir on IO nodes. ok, I'll make a config option to allow you to choose whether wrapper.sh is invoked with an absolute path in the command line or not. -- From bugzilla-daemon at mcs.anl.gov Sat Feb 21 09:03:59 2009 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Sat, 21 Feb 2009 09:03:59 -0600 (CST) Subject: [Swift-devel] [Bug 169] submit-side timeouts (or other fault detection) to accommodate some byzantine site failures In-Reply-To: Message-ID: <20090221150359.1E715164CE@foxtrot.mcs.anl.gov> http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=169 hategan at mcs.anl.gov changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |ASSIGNED ------- Comment #1 from hategan at mcs.anl.gov 2008-12-18 13:46 ------- A 2*walltime timeout is what we discussed before and agreed upon. This needs to be implemented. ------- Comment #2 from benc at hawaga.org.uk 2009-02-21 09:03 ------- Mihael implemented this for job submission. It isn't for file operations or transfers; and likely doesn't behave well with coasters enabled, so leaving this open as a to-do for those. -- Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You reported the bug, or are watching the reporter. You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at mcs.anl.gov Sat Feb 21 09:09:06 2009 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Sat, 21 Feb 2009 09:09:06 -0600 (CST) Subject: [Swift-devel] [Bug 176] New: config option to invoke wrapper script with relative path Message-ID: http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=176 Summary: config option to invoke wrapper script with relative path Product: Swift Version: unspecified Platform: Macintosh OS/Version: Mac OS Status: NEW Severity: normal Priority: P2 Component: General AssignedTo: benc at hawaga.org.uk ReportedBy: benc at hawaga.org.uk Versions of Swift before 0.8 invoked the wrapper script using a relative path, and relied on the job submission system to start in the directory requested in the job submission. Some sites do not do that, instead starting in a clean job-specific working directory. Swift 0.8 had different behaviour, with explicit specification of the path to wrapper.sh. However, this fails to work on sites where the site-shared filesystem is mapped into local filesystems differently on the worker nodes and through the submission system, such as present experimental use of Falkon on BG/P. In such cases, Falkon starts the job in the correct directory, which Swift then ignores. A configuration option to switch between absolute and relative behaviour should be provided. -- Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You reported the bug, or are watching the reporter. You are the assignee for the bug, or are watching the assignee. From benc at hawaga.org.uk Sat Feb 21 09:09:34 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Sat, 21 Feb 2009 15:09:34 +0000 (GMT) Subject: [Swift-devel] absolute path In-Reply-To: References: <499F3351.2060000@uchicago.edu> <499F3D7F.5010708@uchicago.edu> <499F445D.7070205@uchicago.edu> Message-ID: On Sat, 21 Feb 2009, Ben Clifford wrote: > ok, I'll make a config option to allow you to choose whether wrapper.sh is > invoked with an absolute path in the command line or not. bug 176 if you want to keep an eye on this. http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=176 -- From bugzilla-daemon at mcs.anl.gov Sat Feb 21 09:12:42 2009 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Sat, 21 Feb 2009 09:12:42 -0600 (CST) Subject: [Swift-devel] [Bug 177] New: variables declared inside an iterate body should be available to the termination condition Message-ID: http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=177 Summary: variables declared inside an iterate body should be available to the termination condition Product: Swift Version: unspecified Platform: Macintosh OS/Version: Mac OS Status: NEW Severity: enhancement Priority: P2 Component: SwiftScript language AssignedTo: benc at hawaga.org.uk ReportedBy: benc at hawaga.org.uk Variables declared inside an iterate body are not available to the termination expression. Those variables should be made available to the termination expression. As far as I can tell, the lack of this ability does not restrict what can be expressed, as a variable v used inside the loop can always be transformed into an array element v[ix] with ix the iteration index, and v declared outside of the iteration loop. However, it does force a certain coding style which can be unintuitive. -- Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You reported the bug, or are watching the reporter. You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at mcs.anl.gov Sat Feb 21 09:32:56 2009 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Sat, 21 Feb 2009 09:32:56 -0600 (CST) Subject: [Swift-devel] [Bug 169] submit-side timeouts (or other fault detection) to accommodate some byzantine site failures In-Reply-To: Message-ID: <20090221153256.1631B164B3@foxtrot.mcs.anl.gov> http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=169 ------- Comment #3 from hategan at mcs.anl.gov 2009-02-21 09:32 ------- (In reply to comment #2) > Mihael implemented this for job submission. It isn't for file operations or > transfers; and likely doesn't behave well with coasters enabled, so leaving > this open as a to-do for those. > I will likely not implement this for file ops/transfers. At least not for now. That because most of the implementations for those have their own timeouts. Why would this not behave well with coasters? -- Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You reported the bug, or are watching the reporter. You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at mcs.anl.gov Sat Feb 21 15:18:23 2009 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Sat, 21 Feb 2009 15:18:23 -0600 (CST) Subject: [Swift-devel] [Bug 169] submit-side timeouts (or other fault detection) to accommodate some byzantine site failures In-Reply-To: Message-ID: <20090221211823.6D538164B3@foxtrot.mcs.anl.gov> http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=169 ------- Comment #4 from benc at hawaga.org.uk 2009-02-21 15:18 ------- http://mail.ci.uchicago.edu/pipermail/swift-devel/2009-February/004382.html -- Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You reported the bug, or are watching the reporter. You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at mcs.anl.gov Sat Feb 21 15:18:46 2009 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Sat, 21 Feb 2009 15:18:46 -0600 (CST) Subject: [Swift-devel] [Bug 169] submit-side timeouts (or other fault detection) to accommodate some byzantine site failures In-Reply-To: Message-ID: <20090221211846.F186E164B3@foxtrot.mcs.anl.gov> http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=169 ------- Comment #5 from benc at hawaga.org.uk 2009-02-21 15:18 ------- although I did mean clusters, not coasters... -- Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You reported the bug, or are watching the reporter. You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at mcs.anl.gov Sat Feb 21 15:31:04 2009 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Sat, 21 Feb 2009 15:31:04 -0600 (CST) Subject: [Swift-devel] [Bug 169] submit-side timeouts (or other fault detection) to accommodate some byzantine site failures In-Reply-To: Message-ID: <20090221213104.5F644164B3@foxtrot.mcs.anl.gov> http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=169 ------- Comment #6 from hategan at mcs.anl.gov 2009-02-21 15:31 ------- (In reply to comment #5) > although I did mean clusters, not coasters... > That's what confused me. We're clear now. -- Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You reported the bug, or are watching the reporter. You are the assignee for the bug, or are watching the assignee. From aespinosa at cs.uchicago.edu Sun Feb 22 18:21:25 2009 From: aespinosa at cs.uchicago.edu (Allan Espinosa) Date: Sun, 22 Feb 2009 18:21:25 -0600 Subject: [Swift-devel] different host CN expectations in gram and gridftp server Message-ID: <50b07b4b0902221621s52239835xf920e665e8cfce5f@mail.gmail.com> Swift expects and different CN from the gridftp server and gram server and creates the authentication problems below. Doing a gridftp url also gives the same message but in addition displays the authorization error as a runtime exception Swift version: swift-r2580 cog-r2305 >From a ranger login host: RunID: 20090222-1815-9ly285cb Progress: Progress: Initializing:25 Selecting site:6 Progress: Selecting site:25 Stage in:4 Submitting:2 Progress: Selecting site:25 Submitting:5 Submitted:1 Failed to transfer wrapper log from test-20090222-1815-9ly285cb/info/m on RANGERFailed to transfer wrapper log from test-20090222-1815-9ly285cb/info/i on RANGERFailed to transfer wrapper log from test-20090222-1815-9ly285cb/info/l on RANGERFailed to transfer wrapper log from test-20090222-1815-9ly285cb/info/k on RANGERFailed to transfer wrapper log from test-20090222-1815-9ly285cb/info/j on RANGERFailed to transfer wrapper log from test-20090222-1815-9ly285cb/info/n on RANGER logfile: Could not start coaster service Caused by: org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: Cannot submit job Caused by: org.globus.gram.GramException: Data transfer to the server failed [Caused by: Authentication failed [Caused by: Operation unauthorized (Mechanism level: [JGLOBUS-56] Authorization failed. Expected "/CN=host/129.114.50.163" target but received "/C=US/O=UTAustin/OU=TACC/CN=login3.ranger.tacc.utexas.edu")]] 2 1 16 /work/01035/tg802895/swift-runs -Allan From aespinosa at cs.uchicago.edu Sun Feb 22 18:25:38 2009 From: aespinosa at cs.uchicago.edu (Allan Espinosa) Date: Sun, 22 Feb 2009 18:25:38 -0600 Subject: [Swift-devel] Re: different host CN expectations in gram and gridftp server In-Reply-To: <50b07b4b0902221621s52239835xf920e665e8cfce5f@mail.gmail.com> References: <50b07b4b0902221621s52239835xf920e665e8cfce5f@mail.gmail.com> Message-ID: <50b07b4b0902221625g26cdad89h188d22358e1a16d9@mail.gmail.com> I rsync'ed the logfile to ~benc/swift-logs: test-20090222-1815-9ly285cb On Sun, Feb 22, 2009 at 6:21 PM, Allan Espinosa wrote: > Swift expects and different CN from the gridftp server and gram server > and creates the authentication problems below. Doing a gridftp url > also gives the same message but in addition displays the authorization > error as a runtime exception > -- Allan M. Espinosa PhD student, Computer Science University of Chicago From hategan at mcs.anl.gov Sun Feb 22 21:55:43 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sun, 22 Feb 2009 21:55:43 -0600 Subject: [Swift-devel] different host CN expectations in gram and gridftp server In-Reply-To: <50b07b4b0902221621s52239835xf920e665e8cfce5f@mail.gmail.com> References: <50b07b4b0902221621s52239835xf920e665e8cfce5f@mail.gmail.com> Message-ID: <1235361343.1273.6.camel@localhost> Can you try a globusrun from the same host to gatekeeper.ranger? Mihael On Sun, 2009-02-22 at 18:21 -0600, Allan Espinosa wrote: > Swift expects and different CN from the gridftp server and gram server > and creates the authentication problems below. Doing a gridftp url > also gives the same message but in addition displays the authorization > error as a runtime exception > > Swift version: swift-r2580 cog-r2305 > > >From a ranger login host: > > RunID: 20090222-1815-9ly285cb > Progress: > Progress: Initializing:25 Selecting site:6 > Progress: Selecting site:25 Stage in:4 Submitting:2 > Progress: Selecting site:25 Submitting:5 Submitted:1 > Failed to transfer wrapper log from test-20090222-1815-9ly285cb/info/m > on RANGERFailed to transfer wrapper log from > test-20090222-1815-9ly285cb/info/i on RANGERFailed to transfer wrapper > log from test-20090222-1815-9ly285cb/info/l on RANGERFailed to > transfer wrapper log from test-20090222-1815-9ly285cb/info/k on > RANGERFailed to transfer wrapper log from > test-20090222-1815-9ly285cb/info/j on RANGERFailed to transfer wrapper > log from test-20090222-1815-9ly285cb/info/n on RANGER > > logfile: > Could not start coaster service > Caused by: org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: > Cannot submit job > Caused by: org.globus.gram.GramException: Data transfer to the server > failed [Caused by: Authentication failed [Caused by: Operation > unauthorized (Mechanism level: [JGLOBUS-56] Authorization failed. > Expected "/CN=host/129.114.50.163" target but received > "/C=US/O=UTAustin/OU=TACC/CN=login3.ranger.tacc.utexas.edu")]] > > > > 2 > 1 > 16 > > > url="gatekeeper.ranger.tacc.teragrid.org" jobManager="gt2:gt2:SGE"/> > /work/01035/tg802895/swift-runs > > > > > -Allan > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From bugzilla-daemon at mcs.anl.gov Mon Feb 23 09:44:05 2009 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Mon, 23 Feb 2009 09:44:05 -0600 (CST) Subject: [Swift-devel] [Bug 177] variables declared inside an iterate body should be available to the termination condition In-Reply-To: Message-ID: <20090223154405.E6F8D164B3@foxtrot.mcs.anl.gov> http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=177 benc at hawaga.org.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #1 from benc at hawaga.org.uk 2009-02-23 09:44 ------- this should be fixed in r2593 -- Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You reported the bug, or are watching the reporter. You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at mcs.anl.gov Mon Feb 23 09:50:44 2009 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Mon, 23 Feb 2009 09:50:44 -0600 (CST) Subject: [Swift-devel] [Bug 61] semantics of [*] and multi-return-values need clarifying In-Reply-To: Message-ID: <20090223155044.B2E82164B3@foxtrot.mcs.anl.gov> http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=61 benc at hawaga.org.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |ASSIGNED ------- Comment #2 from benc at hawaga.org.uk 2009-02-23 09:50 ------- This has mostly been done as of r2538. However, the parser still appears to take .* for structure access, which needs some more consideration and tidying. -- Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You reported the bug, or are watching the reporter. You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at mcs.anl.gov Mon Feb 23 09:53:33 2009 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Mon, 23 Feb 2009 09:53:33 -0600 (CST) Subject: [Swift-devel] [Bug 172] filesystem and gridftp element in the same pool. In-Reply-To: Message-ID: <20090223155333.20F53164CF@foxtrot.mcs.anl.gov> http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=172 ------- Comment #1 from benc at hawaga.org.uk 2009-02-23 09:53 ------- This should be a general test for duplicates - not only for a gridftp and a filesystem specified in the same entry, but also multiple filesystem entries, and other combinations that are illegal. -- Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From aespinosa at cs.uchicago.edu Mon Feb 23 10:48:10 2009 From: aespinosa at cs.uchicago.edu (Allan Espinosa) Date: Mon, 23 Feb 2009 10:48:10 -0600 Subject: [Swift-devel] different host CN expectations in gram and gridftp server In-Reply-To: <1235361343.1273.6.camel@localhost> References: <50b07b4b0902221621s52239835xf920e665e8cfce5f@mail.gmail.com> <1235361343.1273.6.camel@localhost> Message-ID: <50b07b4b0902230848v15e1394dh829fcb2bbf94a578@mail.gmail.com> These are all run from a Ranger login node Here's the output: login4$ globusrun -b -r gatekeeper.ranger.tacc.teragrid.org '&(executable=/bin/hostname)' globus_gram_client_callback_allow successful GRAM Job submission successful https://gatekeeper.ranger.tacc.teragrid.org:50004/24184/1235407542/ GLOBUS_GRAM_PROTOCOL_JOB_STATE_ACTIVE login4$ globus-job-run: login4$ globus-job-run gatekeeper.ranger.tacc.teragrid.org /bin/hostname login3.ranger.tacc.utexas.edu -Allan On Sun, Feb 22, 2009 at 9:55 PM, Mihael Hategan wrote: > Can you try a globusrun from the same host to gatekeeper.ranger? > > Mihael > > On Sun, 2009-02-22 at 18:21 -0600, Allan Espinosa wrote: >> Swift expects and different CN from the gridftp server and gram server >> and creates the authentication problems below. Doing a gridftp url >> also gives the same message but in addition displays the authorization >> error as a runtime exception >> >> Swift version: swift-r2580 cog-r2305 >> >> >From a ranger login host: >> >> RunID: 20090222-1815-9ly285cb >> Progress: >> Progress: Initializing:25 Selecting site:6 >> Progress: Selecting site:25 Stage in:4 Submitting:2 >> Progress: Selecting site:25 Submitting:5 Submitted:1 >> Failed to transfer wrapper log from test-20090222-1815-9ly285cb/info/m >> on RANGERFailed to transfer wrapper log from >> test-20090222-1815-9ly285cb/info/i on RANGERFailed to transfer wrapper >> log from test-20090222-1815-9ly285cb/info/l on RANGERFailed to >> transfer wrapper log from test-20090222-1815-9ly285cb/info/k on >> RANGERFailed to transfer wrapper log from >> test-20090222-1815-9ly285cb/info/j on RANGERFailed to transfer wrapper >> log from test-20090222-1815-9ly285cb/info/n on RANGER >> From hategan at mcs.anl.gov Mon Feb 23 10:54:32 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 23 Feb 2009 10:54:32 -0600 Subject: [Swift-devel] different host CN expectations in gram and gridftp server In-Reply-To: <50b07b4b0902230848v15e1394dh829fcb2bbf94a578@mail.gmail.com> References: <50b07b4b0902221621s52239835xf920e665e8cfce5f@mail.gmail.com> <1235361343.1273.6.camel@localhost> <50b07b4b0902230848v15e1394dh829fcb2bbf94a578@mail.gmail.com> Message-ID: <1235408072.10242.0.camel@localhost> Can you now run the same from login3 rather than login4? On Mon, 2009-02-23 at 10:48 -0600, Allan Espinosa wrote: > These are all run from a Ranger login node > Here's the output: > > login4$ globusrun -b -r gatekeeper.ranger.tacc.teragrid.org > '&(executable=/bin/hostname)' > globus_gram_client_callback_allow successful > GRAM Job submission successful > https://gatekeeper.ranger.tacc.teragrid.org:50004/24184/1235407542/ > GLOBUS_GRAM_PROTOCOL_JOB_STATE_ACTIVE > login4$ > > > globus-job-run: > login4$ globus-job-run gatekeeper.ranger.tacc.teragrid.org /bin/hostname > login3.ranger.tacc.utexas.edu > > > -Allan > > > On Sun, Feb 22, 2009 at 9:55 PM, Mihael Hategan wrote: > > Can you try a globusrun from the same host to gatekeeper.ranger? > > > > Mihael > > > > On Sun, 2009-02-22 at 18:21 -0600, Allan Espinosa wrote: > >> Swift expects and different CN from the gridftp server and gram server > >> and creates the authentication problems below. Doing a gridftp url > >> also gives the same message but in addition displays the authorization > >> error as a runtime exception > >> > >> Swift version: swift-r2580 cog-r2305 > >> > >> >From a ranger login host: > >> > >> RunID: 20090222-1815-9ly285cb > >> Progress: > >> Progress: Initializing:25 Selecting site:6 > >> Progress: Selecting site:25 Stage in:4 Submitting:2 > >> Progress: Selecting site:25 Submitting:5 Submitted:1 > >> Failed to transfer wrapper log from test-20090222-1815-9ly285cb/info/m > >> on RANGERFailed to transfer wrapper log from > >> test-20090222-1815-9ly285cb/info/i on RANGERFailed to transfer wrapper > >> log from test-20090222-1815-9ly285cb/info/l on RANGERFailed to > >> transfer wrapper log from test-20090222-1815-9ly285cb/info/k on > >> RANGERFailed to transfer wrapper log from > >> test-20090222-1815-9ly285cb/info/j on RANGERFailed to transfer wrapper > >> log from test-20090222-1815-9ly285cb/info/n on RANGER > >> From aespinosa at cs.uchicago.edu Mon Feb 23 10:56:32 2009 From: aespinosa at cs.uchicago.edu (Allan Espinosa) Date: Mon, 23 Feb 2009 10:56:32 -0600 Subject: [Swift-devel] different host CN expectations in gram and gridftp server In-Reply-To: <1235408072.10242.0.camel@localhost> References: <50b07b4b0902221621s52239835xf920e665e8cfce5f@mail.gmail.com> <1235361343.1273.6.camel@localhost> <50b07b4b0902230848v15e1394dh829fcb2bbf94a578@mail.gmail.com> <1235408072.10242.0.camel@localhost> Message-ID: <50b07b4b0902230856g18e11118v5f5a27d2d5eb7afc@mail.gmail.com> Below's the output. But i did the my swift submit run from login4 too. -Allan login3$ globusrun -b -r gatekeeper.ranger.tacc.teragrid.org '&(executable=/bin/hostname)' globus_gram_client_callback_allow successful GRAM Job submission successful https://gatekeeper.ranger.tacc.teragrid.org:50004/7306/1235408110/ GLOBUS_GRAM_PROTOCOL_JOB_STATE_ACTIVE login3$ globus-job-run gatekeeper.ranger.tacc.teragrid.org /bin/hostname login3.ranger.tacc.utexas.edu login3$ On Mon, Feb 23, 2009 at 10:54 AM, Mihael Hategan wrote: > Can you now run the same from login3 rather than login4? > > On Mon, 2009-02-23 at 10:48 -0600, Allan Espinosa wrote: >> These are all run from a Ranger login node >> Here's the output: >> >> login4$ globusrun -b -r gatekeeper.ranger.tacc.teragrid.org >> '&(executable=/bin/hostname)' >> globus_gram_client_callback_allow successful >> GRAM Job submission successful >> https://gatekeeper.ranger.tacc.teragrid.org:50004/24184/1235407542/ >> GLOBUS_GRAM_PROTOCOL_JOB_STATE_ACTIVE >> login4$ >> >> >> globus-job-run: >> login4$ globus-job-run gatekeeper.ranger.tacc.teragrid.org /bin/hostname >> login3.ranger.tacc.utexas.edu >> >> >> -Allan >> >> >> On Sun, Feb 22, 2009 at 9:55 PM, Mihael Hategan wrote: >> > Can you try a globusrun from the same host to gatekeeper.ranger? >> > >> > Mihael >> > >> > On Sun, 2009-02-22 at 18:21 -0600, Allan Espinosa wrote: >> >> Swift expects and different CN from the gridftp server and gram server >> >> and creates the authentication problems below. Doing a gridftp url >> >> also gives the same message but in addition displays the authorization >> >> error as a runtime exception >> >> >> >> Swift version: swift-r2580 cog-r2305 >> >> >> >> >From a ranger login host: >> >> >> >> RunID: 20090222-1815-9ly285cb >> >> Progress: >> >> Progress: Initializing:25 Selecting site:6 >> >> Progress: Selecting site:25 Stage in:4 Submitting:2 >> >> Progress: Selecting site:25 Submitting:5 Submitted:1 >> >> Failed to transfer wrapper log from test-20090222-1815-9ly285cb/info/m >> >> on RANGERFailed to transfer wrapper log from >> >> test-20090222-1815-9ly285cb/info/i on RANGERFailed to transfer wrapper >> >> log from test-20090222-1815-9ly285cb/info/l on RANGERFailed to >> >> transfer wrapper log from test-20090222-1815-9ly285cb/info/k on >> >> RANGERFailed to transfer wrapper log from >> >> test-20090222-1815-9ly285cb/info/j on RANGERFailed to transfer wrapper >> >> log from test-20090222-1815-9ly285cb/info/n on RANGER >> >> > > > -- Allan M. Espinosa PhD student, Computer Science University of Chicago From benc at hawaga.org.uk Mon Feb 23 10:58:43 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 23 Feb 2009 16:58:43 +0000 (GMT) Subject: [Swift-devel] different host CN expectations in gram and gridftp server In-Reply-To: <50b07b4b0902230856g18e11118v5f5a27d2d5eb7afc@mail.gmail.com> References: <50b07b4b0902221621s52239835xf920e665e8cfce5f@mail.gmail.com> <1235361343.1273.6.camel@localhost> <50b07b4b0902230848v15e1394dh829fcb2bbf94a578@mail.gmail.com> <1235408072.10242.0.camel@localhost> <50b07b4b0902230856g18e11118v5f5a27d2d5eb7afc@mail.gmail.com> Message-ID: If you use gram2 instead of coasters+gram2, what happens? -- From bugzilla-daemon at mcs.anl.gov Tue Feb 24 06:54:26 2009 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Tue, 24 Feb 2009 06:54:26 -0600 (CST) Subject: [Swift-devel] [Bug 176] config option to invoke wrapper script with relative path In-Reply-To: Message-ID: <20090224125426.A77E0164CE@foxtrot.mcs.anl.gov> http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=176 benc at hawaga.org.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #1 from benc at hawaga.org.uk 2009-02-24 06:54 ------- r2597 implements this, wrapper.invocation.mode, documented in the user guide. -- Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You reported the bug, or are watching the reporter. You are the assignee for the bug, or are watching the assignee. From benc at hawaga.org.uk Tue Feb 24 06:56:27 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 24 Feb 2009 12:56:27 +0000 (GMT) Subject: [Swift-devel] absolute path In-Reply-To: References: <499F3351.2060000@uchicago.edu> <499F3D7F.5010708@uchicago.edu> <499F445D.7070205@uchicago.edu> Message-ID: > > ok, I'll make a config option to allow you to choose whether wrapper.sh is > > invoked with an absolute path in the command line or not. > > bug 176 if you want to keep an eye on this. > > http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=176 r2597 provides such an option: wrapper.invocation.mode It is documented in the user guide and swift.properties Please let me know if this does what you want. -- From bugzilla-daemon at mcs.anl.gov Tue Feb 24 07:30:28 2009 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Tue, 24 Feb 2009 07:30:28 -0600 (CST) Subject: [Swift-devel] [Bug 123] Array mappers should accept programmatically-built string[]s In-Reply-To: Message-ID: <20090224133028.CA887164CF@foxtrot.mcs.anl.gov> http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=123 benc at hawaga.org.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #1 from benc at hawaga.org.uk 2009-02-24 07:30 ------- This works now - it should have been fixed somewhere around Swift 0.8. -- Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You reported the bug, or are watching the reporter. From wilde at mcs.anl.gov Tue Feb 24 15:07:24 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 24 Feb 2009 15:07:24 -0600 Subject: [Swift-devel] Problem using @extractint() on derived file Message-ID: <49A4618C.6050108@mcs.anl.gov> This script: --- type file; app (file out) echo (string s) { echo s stdout=@out; } file f<"a">; int i; f = echo("123"); i = @extractint(@f); trace("i=", i); --- produces: Swift svn swift-r2552 cog-r2303 RunID: 20090224-1455-k1vj4uy7 Progress: Execution failed: Reading integer content of file Caused by: a (No such file or directory) --- I seem to get the same behavior from readData, and the same if I explicitly specify "a" as the argument to @extractint(); Is this because @extractint() is not waiting for "f" to get produced? Ive extracted this example while debugging a script that uses an app to test the termination condition of an iterate loop. - Mike From hategan at mcs.anl.gov Tue Feb 24 15:12:32 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 24 Feb 2009 15:12:32 -0600 Subject: [Swift-devel] Problem using @extractint() on derived file In-Reply-To: <49A4618C.6050108@mcs.anl.gov> References: <49A4618C.6050108@mcs.anl.gov> Message-ID: <1235509952.7505.2.camel@localhost> Where is that file with respect to: - the script - the place you are running this from On Tue, 2009-02-24 at 15:07 -0600, Michael Wilde wrote: > This script: > > --- > > type file; > > app (file out) echo (string s) { echo s stdout=@out; } > > file f<"a">; > int i; > > f = echo("123"); > i = @extractint(@f); > trace("i=", i); > > --- > > produces: > > Swift svn swift-r2552 cog-r2303 > > > > RunID: 20090224-1455-k1vj4uy7 > > Progress: > > Execution failed: > > Reading integer content of file > > Caused by: > > a (No such file or directory) > > > > --- > > I seem to get the same behavior from readData, and the same if I > explicitly specify "a" as the argument to @extractint(); > > Is this because @extractint() is not waiting for "f" to get produced? > > Ive extracted this example while debugging a script that uses an app to > test the termination condition of an iterate loop. > > - Mike > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From aespinosa at cs.uchicago.edu Tue Feb 24 15:14:23 2009 From: aespinosa at cs.uchicago.edu (Allan Espinosa) Date: Tue, 24 Feb 2009 15:14:23 -0600 Subject: [Swift-devel] different host CN expectations in gram and gridftp server In-Reply-To: References: <50b07b4b0902221621s52239835xf920e665e8cfce5f@mail.gmail.com> <1235361343.1273.6.camel@localhost> <50b07b4b0902230848v15e1394dh829fcb2bbf94a578@mail.gmail.com> <1235408072.10242.0.camel@localhost> <50b07b4b0902230856g18e11118v5f5a27d2d5eb7afc@mail.gmail.com> Message-ID: <50b07b4b0902241314t7ea23b28g832c70e26877c5f6@mail.gmail.com> I still get the same gram authentication error message: Caused by: org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: Cannot submit job Caused by: org.globus.gram.GramException: Data transfer to the server failed [Caused by: Authentication failed [Caused by: Operation unauthorized (Mechanism level: [JGLOBUS-56] Authorization failed. Expected "/CN=host/129.114.50.163" target but received "/C=US/O=UTAustin/OU=TACC/CN=login3.ranger.tacc.utexas.edu")]] 2009-02-24 15:12:07,215-0600 DEBUG vdl:execute2 APPLICATION_EXCEPTION jobid=hostname-8tx7p37j - Application exception: Cannot submit job This is using both the fork and sge job manager via gram2-only -aallan On Mon, Feb 23, 2009 at 10:58 AM, Ben Clifford wrote: > > If you use gram2 instead of coasters+gram2, what happens? > From hategan at mcs.anl.gov Tue Feb 24 15:17:59 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 24 Feb 2009 15:17:59 -0600 Subject: [Swift-devel] different host CN expectations in gram and gridftp server In-Reply-To: <50b07b4b0902241314t7ea23b28g832c70e26877c5f6@mail.gmail.com> References: <50b07b4b0902221621s52239835xf920e665e8cfce5f@mail.gmail.com> <1235361343.1273.6.camel@localhost> <50b07b4b0902230848v15e1394dh829fcb2bbf94a578@mail.gmail.com> <1235408072.10242.0.camel@localhost> <50b07b4b0902230856g18e11118v5f5a27d2d5eb7afc@mail.gmail.com> <50b07b4b0902241314t7ea23b28g832c70e26877c5f6@mail.gmail.com> Message-ID: <1235510279.7676.0.camel@localhost> Ok. I'll look into this. On Tue, 2009-02-24 at 15:14 -0600, Allan Espinosa wrote: > I still get the same gram authentication error message: > > Caused by: org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: > Cannot submit job > Caused by: org.globus.gram.GramException: Data transfer to the server > failed [Caused by: Authentication failed [Caused by: Operation > unauthorized (Mechanism level: [JGLOBUS-56] Authorization failed. > Expected "/CN=host/129.114.50.163" target but received > "/C=US/O=UTAustin/OU=TACC/CN=login3.ranger.tacc.utexas.edu")]] > 2009-02-24 15:12:07,215-0600 DEBUG vdl:execute2 APPLICATION_EXCEPTION > jobid=hostname-8tx7p37j - Application exception: Cannot submit job > > This is using both the fork and sge job manager via gram2-only > > -aallan > > > On Mon, Feb 23, 2009 at 10:58 AM, Ben Clifford wrote: > > > > If you use gram2 instead of coasters+gram2, what happens? > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From aespinosa at cs.uchicago.edu Tue Feb 24 15:19:30 2009 From: aespinosa at cs.uchicago.edu (Allan Espinosa) Date: Tue, 24 Feb 2009 15:19:30 -0600 Subject: [Swift-devel] different host CN expectations in gram and gridftp server In-Reply-To: <1235510279.7676.0.camel@localhost> References: <50b07b4b0902221621s52239835xf920e665e8cfce5f@mail.gmail.com> <1235361343.1273.6.camel@localhost> <50b07b4b0902230848v15e1394dh829fcb2bbf94a578@mail.gmail.com> <1235408072.10242.0.camel@localhost> <50b07b4b0902230856g18e11118v5f5a27d2d5eb7afc@mail.gmail.com> <50b07b4b0902241314t7ea23b28g832c70e26877c5f6@mail.gmail.com> <1235510279.7676.0.camel@localhost> Message-ID: <50b07b4b0902241319w5d1ffeb9ua65918428fcae9f7@mail.gmail.com> Thanks Mihael! On Tue, Feb 24, 2009 at 3:17 PM, Mihael Hategan wrote: > Ok. I'll look into this. > > On Tue, 2009-02-24 at 15:14 -0600, Allan Espinosa wrote: >> I still get the same gram authentication error message: >> >> Caused by: org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: >> Cannot submit job >> Caused by: org.globus.gram.GramException: Data transfer to the server >> failed [Caused by: Authentication failed [Caused by: Operation >> unauthorized (Mechanism level: [JGLOBUS-56] Authorization failed. >> Expected "/CN=host/129.114.50.163" target but received >> "/C=US/O=UTAustin/OU=TACC/CN=login3.ranger.tacc.utexas.edu")]] >> 2009-02-24 15:12:07,215-0600 DEBUG vdl:execute2 APPLICATION_EXCEPTION >> jobid=hostname-8tx7p37j - Application exception: Cannot submit job From wilde at mcs.anl.gov Tue Feb 24 15:24:21 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 24 Feb 2009 15:24:21 -0600 Subject: [Swift-devel] Problem using @extractint() on derived file In-Reply-To: <1235509952.7505.2.camel@localhost> References: <49A4618C.6050108@mcs.anl.gov> <1235509952.7505.2.camel@localhost> Message-ID: <49A46585.6090003@mcs.anl.gov> The file doesnt exist - its created in the script, and hence I would expect it to be placed back in the dir that I ran swift from. (But Ive been testing further, in my original scripts, and am seeing confusing results, so I need to sort out and isolate. I do this "y=f(x); rc=extractint(y)" pattern in a loop, thus I needed to make the file name unique (else I get the "file already in cache" error). So I switched to an anonymous file, and then started getting the "no such file" error. Whats confusing is I may have seen a csae similar to whats below that did work, so I need to do more testing to isolate when it works and when it fails. Can you duplicate the failure with the simple script below? - Mike On 2/24/09 3:12 PM, Mihael Hategan wrote: > Where is that file with respect to: > - the script > - the place you are running this from > > On Tue, 2009-02-24 at 15:07 -0600, Michael Wilde wrote: >> This script: >> >> --- >> >> type file; >> >> app (file out) echo (string s) { echo s stdout=@out; } >> >> file f<"a">; >> int i; >> >> f = echo("123"); >> i = @extractint(@f); >> trace("i=", i); >> >> --- >> >> produces: >> >> Swift svn swift-r2552 cog-r2303 >> >> >> >> RunID: 20090224-1455-k1vj4uy7 >> >> Progress: >> >> Execution failed: >> >> Reading integer content of file >> >> Caused by: >> >> a (No such file or directory) >> >> >> >> --- >> >> I seem to get the same behavior from readData, and the same if I >> explicitly specify "a" as the argument to @extractint(); >> >> Is this because @extractint() is not waiting for "f" to get produced? >> >> Ive extracted this example while debugging a script that uses an app to >> test the termination condition of an iterate loop. >> >> - Mike >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > From hategan at mcs.anl.gov Tue Feb 24 15:33:28 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 24 Feb 2009 15:33:28 -0600 Subject: [Swift-devel] Problem using @extractint() on derived file In-Reply-To: <49A46585.6090003@mcs.anl.gov> References: <49A4618C.6050108@mcs.anl.gov> <1235509952.7505.2.camel@localhost> <49A46585.6090003@mcs.anl.gov> Message-ID: <1235511208.7899.3.camel@localhost> Sorry. Didn't look properly. Yes, this happens because @f can be resolved before f can be, so swift will happily do the extractint before echo finishes. I don't have a solution yet. On Tue, 2009-02-24 at 15:24 -0600, Michael Wilde wrote: > The file doesnt exist - its created in the script, and hence I would > expect it to be placed back in the dir that I ran swift from. > > (But Ive been testing further, in my original scripts, and am seeing > confusing results, so I need to sort out and isolate. > > I do this "y=f(x); rc=extractint(y)" pattern in a loop, thus I needed to > make the file name unique (else I get the "file already in cache" > error). So I switched to an anonymous file, and then started getting the > "no such file" error. > > Whats confusing is I may have seen a csae similar to whats below that > did work, so I need to do more testing to isolate when it works and when > it fails. > > Can you duplicate the failure with the simple script below? > > - Mike > > > On 2/24/09 3:12 PM, Mihael Hategan wrote: > > Where is that file with respect to: > > - the script > > - the place you are running this from > > > > On Tue, 2009-02-24 at 15:07 -0600, Michael Wilde wrote: > >> This script: > >> > >> --- > >> > >> type file; > >> > >> app (file out) echo (string s) { echo s stdout=@out; } > >> > >> file f<"a">; > >> int i; > >> > >> f = echo("123"); > >> i = @extractint(@f); > >> trace("i=", i); > >> > >> --- > >> > >> produces: > >> > >> Swift svn swift-r2552 cog-r2303 > >> > >> > >> > >> RunID: 20090224-1455-k1vj4uy7 > >> > >> Progress: > >> > >> Execution failed: > >> > >> Reading integer content of file > >> > >> Caused by: > >> > >> a (No such file or directory) > >> > >> > >> > >> --- > >> > >> I seem to get the same behavior from readData, and the same if I > >> explicitly specify "a" as the argument to @extractint(); > >> > >> Is this because @extractint() is not waiting for "f" to get produced? > >> > >> Ive extracted this example while debugging a script that uses an app to > >> test the termination condition of an iterate loop. > >> > >> - Mike > >> _______________________________________________ > >> Swift-devel mailing list > >> Swift-devel at ci.uchicago.edu > >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > From wilde at mcs.anl.gov Tue Feb 24 16:21:47 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 24 Feb 2009 16:21:47 -0600 Subject: [Swift-devel] Iterate example broken - semantics changed? Message-ID: <49A472FB.2040000@mcs.anl.gov> The iterate example in the swift tutorial no longer works. Its at: http://www.ci.uchicago.edu/swift/guides/tutorial.php#tutorial.iterate The problem seems to be the same as the example below: swift wont let you set the members of an array both in the declaring scope and in an inner nested scope, it seems. This example: --- int a[]; a[0] = 0; iterate v { a[v+1] = v+1; trace("v=",v); } until (); --- gives: Could not start execution. variable a has multiple writers. From wilde at mcs.anl.gov Tue Feb 24 16:29:19 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 24 Feb 2009 16:29:19 -0600 Subject: [Swift-devel] Problem using @extractint() on derived file In-Reply-To: <1235511208.7899.3.camel@localhost> References: <49A4618C.6050108@mcs.anl.gov> <1235509952.7505.2.camel@localhost> <49A46585.6090003@mcs.anl.gov> <1235511208.7899.3.camel@localhost> Message-ID: <49A474BF.8030708@mcs.anl.gov> Sorry, I think I see the problem. @extractint() wants a file-mapped object (ie marker type), not a filename string. Now (I think) it seems to work. Is readData and readData2 the same? On 2/24/09 3:33 PM, Mihael Hategan wrote: > Sorry. Didn't look properly. > > Yes, this happens because @f can be resolved before f can be, so swift > will happily do the extractint before echo finishes. > > I don't have a solution yet. > > On Tue, 2009-02-24 at 15:24 -0600, Michael Wilde wrote: >> The file doesnt exist - its created in the script, and hence I would >> expect it to be placed back in the dir that I ran swift from. >> >> (But Ive been testing further, in my original scripts, and am seeing >> confusing results, so I need to sort out and isolate. >> >> I do this "y=f(x); rc=extractint(y)" pattern in a loop, thus I needed to >> make the file name unique (else I get the "file already in cache" >> error). So I switched to an anonymous file, and then started getting the >> "no such file" error. >> >> Whats confusing is I may have seen a csae similar to whats below that >> did work, so I need to do more testing to isolate when it works and when >> it fails. >> >> Can you duplicate the failure with the simple script below? >> >> - Mike >> >> >> On 2/24/09 3:12 PM, Mihael Hategan wrote: >>> Where is that file with respect to: >>> - the script >>> - the place you are running this from >>> >>> On Tue, 2009-02-24 at 15:07 -0600, Michael Wilde wrote: >>>> This script: >>>> >>>> --- >>>> >>>> type file; >>>> >>>> app (file out) echo (string s) { echo s stdout=@out; } >>>> >>>> file f<"a">; >>>> int i; >>>> >>>> f = echo("123"); >>>> i = @extractint(@f); >>>> trace("i=", i); >>>> >>>> --- >>>> >>>> produces: >>>> >>>> Swift svn swift-r2552 cog-r2303 >>>> >>>> >>>> >>>> RunID: 20090224-1455-k1vj4uy7 >>>> >>>> Progress: >>>> >>>> Execution failed: >>>> >>>> Reading integer content of file >>>> >>>> Caused by: >>>> >>>> a (No such file or directory) >>>> >>>> >>>> >>>> --- >>>> >>>> I seem to get the same behavior from readData, and the same if I >>>> explicitly specify "a" as the argument to @extractint(); >>>> >>>> Is this because @extractint() is not waiting for "f" to get produced? >>>> >>>> Ive extracted this example while debugging a script that uses an app to >>>> test the termination condition of an iterate loop. >>>> >>>> - Mike >>>> _______________________________________________ >>>> Swift-devel mailing list >>>> Swift-devel at ci.uchicago.edu >>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > From wilde at mcs.anl.gov Tue Feb 24 18:32:59 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 24 Feb 2009 18:32:59 -0600 Subject: [Swift-devel] truncated name in typecheck error message Message-ID: <49A491BB.4060007@mcs.anl.gov> This script: int out[]; out[0][1]=123; produces: Could not start execution. Compile error in assigment at line 4: You cannot assign value of type int to a variable of type i -- The typename is truncated at the end of the message. Eg, I think "file" prints as "fi". From aespinosa at cs.uchicago.edu Tue Feb 24 18:52:29 2009 From: aespinosa at cs.uchicago.edu (Allan Espinosa) Date: Tue, 24 Feb 2009 18:52:29 -0600 Subject: [Swift-devel] throttling parameters with coasters Message-ID: <50b07b4b0902241652l2552a38eh8954155112b70d09@mail.gmail.com> In coasters, are throtte.submit, throttle.host.submit, throttle.job.factor parameters ignored ? Looking on how swift submits initial requests, it seems that throttle.job.factor affects the number of coaster nodes it will submit to the LRM. If I have 1 1 Swift spawns 4 coasters. 4*16=64 processors available to me. I observe that throughout the job this number did not increase Next, in my swift.properties, I have throttle.submit=4 throttle.host.submit=2 But in the runtime, rogress: Selecting site:2809 Submitting:17 Active:40 Stage out:31 Finished successfully:103 Progress: Selecting site:2809 Submitting:17 Active:40 Stage out:30 Finished su so the 2 parameters does not apply to coaster submissions? From hategan at mcs.anl.gov Tue Feb 24 19:15:50 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 24 Feb 2009 19:15:50 -0600 Subject: [Swift-devel] throttling parameters with coasters In-Reply-To: <50b07b4b0902241652l2552a38eh8954155112b70d09@mail.gmail.com> References: <50b07b4b0902241652l2552a38eh8954155112b70d09@mail.gmail.com> Message-ID: <1235524550.11984.4.camel@localhost> On Tue, 2009-02-24 at 18:52 -0600, Allan Espinosa wrote: > In coasters, are throtte.submit, throttle.host.submit, > throttle.job.factor parameters ignored ? > > Looking on how swift submits initial requests, it seems that > throttle.job.factor affects the number of coaster nodes it will submit > to the LRM. If I have > 1 > 1 > Swift spawns 4 coasters. 4*16=64 processors available to me. I > observe that throughout the job this number did not increase A job throttle of 1 pretty much caps the total number of concurrent jobs at 100. > > Next, in my swift.properties, I have > throttle.submit=4 > throttle.host.submit=2 > > But in the runtime, > > rogress: Selecting site:2809 Submitting:17 Active:40 Stage out:31 > Finished successfully:103 > Progress: Selecting site:2809 Submitting:17 Active:40 Stage out:30 Finished su > > so the 2 parameters does not apply to coaster submissions? The "submitting" printed by the progress ticker is not the same as the "submit" in swift.properties. >From a cog abstractions perspective, 4 concurrent submissions means that only 4 calls to TaskHandler.submit(Task) can be active at one time. >From swift's perspective it means that the job was queued to the scheduler and awaits its turn to be one of the 4 to go through TaskHandler.submit(). From benc at hawaga.org.uk Tue Feb 24 20:23:12 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 25 Feb 2009 02:23:12 +0000 (GMT) Subject: [Swift-devel] Problem using @extractint() on derived file In-Reply-To: <1235511208.7899.3.camel@localhost> References: <49A4618C.6050108@mcs.anl.gov> <1235509952.7505.2.camel@localhost> <49A46585.6090003@mcs.anl.gov> <1235511208.7899.3.camel@localhost> Message-ID: On Tue, 24 Feb 2009, Mihael Hategan wrote: > Yes, this happens because @f can be resolved before f can be, so swift > will happily do the extractint before echo finishes. > > I don't have a solution yet. extractint probably should be able to take a file parameter, rather than a string, and order evaluation properly. which I think is what readData does (though its not in the test suite, I'm told) so it may be that replacing extractint(@f) with readData(f) (note the lack of @) does what is desired in which case, extractInt can be removed from the language. -- From wilde at mcs.anl.gov Tue Feb 24 22:55:49 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 24 Feb 2009 22:55:49 -0600 Subject: [Swift-devel] Problem using @extractint() on derived file In-Reply-To: References: <49A4618C.6050108@mcs.anl.gov> <1235509952.7505.2.camel@localhost> <49A46585.6090003@mcs.anl.gov> <1235511208.7899.3.camel@localhost> Message-ID: <49A4CF55.3060105@mcs.anl.gov> On 2/24/09 8:23 PM, Ben Clifford wrote: > On Tue, 24 Feb 2009, Mihael Hategan wrote: > >> Yes, this happens because @f can be resolved before f can be, so swift >> will happily do the extractint before echo finishes. >> >> I don't have a solution yet. > > extractint probably should be able to take a file parameter, rather than a > string, and order evaluation properly. extractint(f) seems to work: --- type file; app (file out) echo (string s) { echo s stdout=@out; } file f = echo("123"); int i = @extractint(f); trace (i); --- prints 123 substituting readData for @extractint in the above works as well. I was confused about what each expected, so perhaps just clarifying the users guide is whats needed. > which I think is what readData does (though its not in the test suite, I'm > told) > > so it may be that replacing extractint(@f) with readData(f) (note the > lack of @) does what is desired > > in which case, extractInt can be removed from the language. > From benc at hawaga.org.uk Wed Feb 25 03:45:32 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 25 Feb 2009 09:45:32 +0000 (GMT) Subject: [Swift-devel] Problem using @extractint() on derived file In-Reply-To: <49A4CF55.3060105@mcs.anl.gov> References: <49A4618C.6050108@mcs.anl.gov> <1235509952.7505.2.camel@localhost> <49A46585.6090003@mcs.anl.gov> <1235511208.7899.3.camel@localhost> <49A4CF55.3060105@mcs.anl.gov> Message-ID: On Tue, 24 Feb 2009, Michael Wilde wrote: > substituting readData for @extractint in the above works as well. ok good. rambling slightly: It would be nice to get rid of @extractint and have only readData, but I think that this doesn't work in all cases: readData's behaviour is controlled by the type that it is expected to return (that is, if you assign a readData expression to an int, it tries to read an int; if you assign it to an array, it tries to read an array). In some situations, that return type isn't well defined, because it could be a context where anything is accepted - for example what is the type of readData in: trace(readData(f)) -- From benc at hawaga.org.uk Wed Feb 25 03:59:45 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 25 Feb 2009 09:59:45 +0000 (GMT) Subject: [Swift-devel] Problem using @extractint() on derived file In-Reply-To: <49A4CF55.3060105@mcs.anl.gov> References: <49A4618C.6050108@mcs.anl.gov> <1235509952.7505.2.camel@localhost> <49A46585.6090003@mcs.anl.gov> <1235511208.7899.3.camel@localhost> <49A4CF55.3060105@mcs.anl.gov> Message-ID: On Tue, 24 Feb 2009, Michael Wilde wrote: > > extractint probably should be able to take a file parameter, rather than a > > string, and order evaluation properly. > > extractint(f) seems to work: [...] > I was confused about what each expected, so perhaps just clarifying the users > guide is whats needed. It would be nice if you got some kind of warning here. The approach that I think is most in-sync with other file handling in Swift would be to say that you could not pass a filename into readData; instead you would be compelled to pass a mapped file (as you ended up doing in this case). That would increase the volume of text needed when using the present 'pass in a filename' behaviour; but in some ways, its a simplification because it strengthens the rule "if you want to deal with a file, you must do it by mapping to a variable, not by passing its filename around". I'm unsure. Right this second I think I'd prefer that change to be made. -- From wilde at mcs.anl.gov Wed Feb 25 08:07:42 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Wed, 25 Feb 2009 08:07:42 -0600 Subject: [Swift-devel] Re: [Swift-user] Questions on use of iterate statement In-Reply-To: References: <499EAB13.5050401@mcs.anl.gov> <499EC8DA.70304@mcs.anl.gov> Message-ID: <49A550AE.1020007@mcs.anl.gov> On 2/23/09 9:46 AM, Ben Clifford wrote: > As of r2593 you should be able to use the style of iteration that you > originally used. Thanks, Ben. When I tried this, it turned out my data flow required a whole array of results to be passed out of the iterate anyways. So what was initially a workaround turned out to be the way it needed to be coded anyways. But I'll try to test your change on other variants of this. From benc at hawaga.org.uk Wed Feb 25 11:31:53 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 25 Feb 2009 17:31:53 +0000 (GMT) Subject: [Swift-devel] pmd Message-ID: In my ongoing adventure with build/test/analysis tools, I ran PMD on the swift source code. I used the unused code and unused import report to remove a bunch of unused code from the source,. I just ran a test with almost all rulesets enabled, which gives 8000 comments on the swift source code. A bunch are stylistic coments that I don't particualrly agree with (such as on teh use of single-character variable names), but if anyonei s interested in having a browse, here's the report: http://www.ci.uchicago.edu/~benc/tmp/pmd.html The source code lines in this report are against r2606 From hategan at mcs.anl.gov Wed Feb 25 11:42:45 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 25 Feb 2009 11:42:45 -0600 Subject: [Swift-devel] pmd In-Reply-To: References: Message-ID: <1235583765.20020.1.camel@localhost> On Wed, 2009-02-25 at 17:31 +0000, Ben Clifford wrote: > In my ongoing adventure with build/test/analysis tools, I ran PMD on the > swift source code. I used the unused code and unused import report Eclipse has a sub-set of what PMD does, including looking for unused imports (for which there is also a shortcut that automatically re-organizes them). From bugzilla-daemon at mcs.anl.gov Wed Feb 25 18:27:57 2009 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Wed, 25 Feb 2009 18:27:57 -0600 (CST) Subject: [Swift-devel] [Bug 178] New: strange unused string replacement in CSVMapper needs investigating Message-ID: http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=178 Summary: strange unused string replacement in CSVMapper needs investigating Product: Swift Version: unspecified Platform: Macintosh OS/Version: Mac OS Status: NEW Severity: normal Priority: P2 Component: General AssignedTo: benc at hawaga.org.uk ReportedBy: benc at hawaga.org.uk CSVMapper contains (as of r2606) the following TODO to investigate: // TODO PMD reports this for the // following line: // An operation on an Immutable object ( String, BigDecimal or BigInteger) won't change the object itself // This is likely a bug column.replaceAll("\\s", "_"); That's meant to do something, presumably, but it doesn't... -- Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You reported the bug, or are watching the reporter. You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at mcs.anl.gov Wed Feb 25 18:37:25 2009 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Wed, 25 Feb 2009 18:37:25 -0600 (CST) Subject: [Swift-devel] [Bug 178] strange unused string replacement in CSVMapper needs investigating In-Reply-To: Message-ID: <20090226003725.B181E164CE@foxtrot.mcs.anl.gov> http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=178 ------- Comment #1 from hategan at mcs.anl.gov 2009-02-25 18:37 ------- I believe we should deprecate the CSV mapper entirely in favor of the ext mapper, which is both more powerful and easier to use. -- Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You reported the bug, or are watching the reporter. You are the assignee for the bug, or are watching the assignee. From aespinosa at cs.uchicago.edu Wed Feb 25 18:46:38 2009 From: aespinosa at cs.uchicago.edu (Allan Espinosa) Date: Wed, 25 Feb 2009 18:46:38 -0600 Subject: [Swift-devel] current workers < 0 ? Message-ID: <50b07b4b0902251646r2cc11935w34337187c6c26b4a@mail.gmail.com> I am trying to generate a plot of number of coaster workers vs time superimposed with number of tasks vs time plot. Upon poking through ~/.globus/coasters/coasters.log, I notice that it drops to negative values. 2009-02-25 15:30:34,447-0600 INFO WorkerManager Current workers: 81 2009-02-25 15:30:34,447-0600 INFO WorkerManager Ready: {} 2009-02-25 15:30:34,447-0600 INFO WorkerManager Busy: [Worker[608604359], Worker[671140203], Worker[-475116310], Worker[-1187087425], Worker[1021599238], Work2009-02-25 15:30:34,448-0600 INFO WorkerManager Requested: {-906148816=Worker[-906148816], 40 I think this deals with the currentWorkers++ ; line in WorkerManager.java -Allan From hategan at mcs.anl.gov Wed Feb 25 20:21:40 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 25 Feb 2009 20:21:40 -0600 (CST) Subject: [Swift-devel] current workers < 0 ? In-Reply-To: <50b07b4b0902251646r2cc11935w34337187c6c26b4a@mail.gmail.com> Message-ID: <11549721.289771235614900491.JavaMail.root@zimbra> ----- Allan Espinosa wrote: > I am trying to generate a plot of number of coaster workers vs time > superimposed with number of tasks vs time plot. Upon poking through > ~/.globus/coasters/coasters.log, I notice that it drops to negative > values. Maybe I'm missing something, but where do you see the number of workers being negative in the text below? > > 2009-02-25 15:30:34,447-0600 INFO WorkerManager Current workers: 81 > 2009-02-25 15:30:34,447-0600 INFO WorkerManager Ready: {} > 2009-02-25 15:30:34,447-0600 INFO WorkerManager Busy: > [Worker[608604359], Worker[671140203], Worker[-475116310], > Worker[-1187087425], Worker[1021599238], Work2009-02-25 > 15:30:34,448-0600 INFO WorkerManager Requested: > {-906148816=Worker[-906148816], 40 > > I think this deals with the currentWorkers++ ; line in WorkerManager.java > > -Allan > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From aespinosa at cs.uchicago.edu Wed Feb 25 20:30:16 2009 From: aespinosa at cs.uchicago.edu (Allan Espinosa) Date: Wed, 25 Feb 2009 20:30:16 -0600 Subject: [Swift-devel] current workers < 0 ? In-Reply-To: <11549721.289771235614900491.JavaMail.root@zimbra> References: <50b07b4b0902251646r2cc11935w34337187c6c26b4a@mail.gmail.com> <11549721.289771235614900491.JavaMail.root@zimbra> Message-ID: <50b07b4b0902251830k9f5f04bma9987bce21083e9c@mail.gmail.com> Ooops. I copy pasted the wrong line. It should be: 2009-02-25 15:33:14,665-0600 INFO WorkerManager Current workers: -110 2009-02-25 15:33:14,669-0600 INFO AbstractKarajanChannel SC-null REPL: Command(2009-02-25 15:33:14,670-0600 INFO AbstractKarajanChannel Unregistering Command(2009-02-25 15:33:14,673-0600 INFO WorkerManager Ready: {-1644269098/1235598824s2009-02-25 15:33:14,674-0600 INFO WorkerManager Busy: [Worker[-1955187037], Wor2009-02-25 15:33:14,674-0600 INFO WorkerManager Requested: {-1105264759=Worker[2009-02-25 15:33:14,674-0600 INFO WorkerManager Starting: [] 2009-02-25 15:33:14,676-0600 INFO WorkerManager Ids: {-1734485274=Worker[-173442009-02-25 15:33:14,676-0600 INFO WorkerManager AllocationR: [] Sorry about that. -Allan On Wed, Feb 25, 2009 at 8:21 PM, Mihael Hategan wrote: > > ----- Allan Espinosa wrote: >> I am trying to generate a plot of number of coaster workers vs time >> superimposed with number of tasks vs time plot. Upon poking through >> ~/.globus/coasters/coasters.log, I notice that it drops to negative >> values. > > Maybe I'm missing something, but where do you see the number of > workers being negative in the text below? > >> >> 2009-02-25 15:30:34,447-0600 INFO WorkerManager Current workers: 81 >> 2009-02-25 15:30:34,447-0600 INFO WorkerManager Ready: {} >> 2009-02-25 15:30:34,447-0600 INFO WorkerManager Busy: >> [Worker[608604359], Worker[671140203], Worker[-475116310], >> Worker[-1187087425], Worker[1021599238], Work2009-02-25 >> 15:30:34,448-0600 INFO WorkerManager Requested: >> {-906148816=Worker[-906148816], 40 >> From hategan at mcs.anl.gov Wed Feb 25 20:36:40 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 25 Feb 2009 20:36:40 -0600 (CST) Subject: [Swift-devel] current workers < 0 ? In-Reply-To: <50b07b4b0902251830k9f5f04bma9987bce21083e9c@mail.gmail.com> Message-ID: <8091050.289951235615800053.JavaMail.root@zimbra> ----- Allan Espinosa wrote: > Ooops. I copy pasted the wrong line. It should be: > > 2009-02-25 15:33:14,665-0600 INFO WorkerManager Current workers: -110 Heh. Yes. That increment should be synchronized. I guess I didn't bother because it was only there for informal reasons. > 2009-02-25 15:33:14,669-0600 INFO AbstractKarajanChannel SC-null > REPL: Command(2009-02-25 15:33:14,670-0600 INFO > AbstractKarajanChannel Unregistering Command(2009-02-25 > 15:33:14,673-0600 INFO WorkerManager Ready: > {-1644269098/1235598824s2009-02-25 15:33:14,674-0600 INFO > WorkerManager Busy: [Worker[-1955187037], Wor2009-02-25 > 15:33:14,674-0600 INFO WorkerManager Requested: > {-1105264759=Worker[2009-02-25 15:33:14,674-0600 INFO WorkerManager > Starting: [] > 2009-02-25 15:33:14,676-0600 INFO WorkerManager Ids: > {-1734485274=Worker[-173442009-02-25 15:33:14,676-0600 INFO > WorkerManager AllocationR: [] > > > Sorry about that. > > -Allan > > On Wed, Feb 25, 2009 at 8:21 PM, Mihael Hategan wrote: > > > > ----- Allan Espinosa wrote: > >> I am trying to generate a plot of number of coaster workers vs time > >> superimposed with number of tasks vs time plot. Upon poking through > >> ~/.globus/coasters/coasters.log, I notice that it drops to negative > >> values. > > > > Maybe I'm missing something, but where do you see the number of > > workers being negative in the text below? > > > >> > >> 2009-02-25 15:30:34,447-0600 INFO WorkerManager Current workers: 81 > >> 2009-02-25 15:30:34,447-0600 INFO WorkerManager Ready: {} > >> 2009-02-25 15:30:34,447-0600 INFO WorkerManager Busy: > >> [Worker[608604359], Worker[671140203], Worker[-475116310], > >> Worker[-1187087425], Worker[1021599238], Work2009-02-25 > >> 15:30:34,448-0600 INFO WorkerManager Requested: > >> {-906148816=Worker[-906148816], 40 > >> From aespinosa at cs.uchicago.edu Wed Feb 25 20:40:35 2009 From: aespinosa at cs.uchicago.edu (Allan Espinosa) Date: Wed, 25 Feb 2009 20:40:35 -0600 Subject: [Swift-devel] current workers < 0 ? In-Reply-To: <8091050.289951235615800053.JavaMail.root@zimbra> References: <50b07b4b0902251830k9f5f04bma9987bce21083e9c@mail.gmail.com> <8091050.289951235615800053.JavaMail.root@zimbra> Message-ID: <50b07b4b0902251840u71deaefdk7fcebd04acdc0ec3@mail.gmail.com> I see. Formally it should be currentWorkers = ready.size() + busy.size() right? On Wed, Feb 25, 2009 at 8:36 PM, Mihael Hategan wrote: > > ----- Allan Espinosa wrote: >> Ooops. I copy pasted the wrong line. It should be: >> >> 2009-02-25 15:33:14,665-0600 INFO WorkerManager Current workers: -110 > > Heh. Yes. That increment should be synchronized. I guess I didn't bother > because it was only there for informal reasons. > >> 2009-02-25 15:33:14,669-0600 INFO AbstractKarajanChannel SC-null >> REPL: Command(2009-02-25 15:33:14,670-0600 INFO >> AbstractKarajanChannel Unregistering Command(2009-02-25 >> 15:33:14,673-0600 INFO WorkerManager Ready: >> {-1644269098/1235598824s2009-02-25 15:33:14,674-0600 INFO >> WorkerManager Busy: [Worker[-1955187037], Wor2009-02-25 >> 15:33:14,674-0600 INFO WorkerManager Requested: >> {-1105264759=Worker[2009-02-25 15:33:14,674-0600 INFO WorkerManager >> Starting: [] >> 2009-02-25 15:33:14,676-0600 INFO WorkerManager Ids: >> {-1734485274=Worker[-173442009-02-25 15:33:14,676-0600 INFO >> WorkerManager AllocationR: [] >> >> >> Sorry about that. >> >> -Allan >> >> On Wed, Feb 25, 2009 at 8:21 PM, Mihael Hategan wrote: >> > >> > ----- Allan Espinosa wrote: >> >> I am trying to generate a plot of number of coaster workers vs time >> >> superimposed with number of tasks vs time plot. Upon poking through >> >> ~/.globus/coasters/coasters.log, I notice that it drops to negative >> >> values. >> > >> > Maybe I'm missing something, but where do you see the number of >> > workers being negative in the text below? From hategan at mcs.anl.gov Wed Feb 25 21:32:34 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 25 Feb 2009 21:32:34 -0600 (CST) Subject: [Swift-devel] current workers < 0 ? In-Reply-To: <50b07b4b0902251840u71deaefdk7fcebd04acdc0ec3@mail.gmail.com> Message-ID: <31786787.290651235619154333.JavaMail.root@zimbra> ----- Allan Espinosa wrote: > I see. Formally it should be > > currentWorkers = ready.size() + busy.size() Also + starting.size() (I suppose in order to avoid starting more workers than the total number allowed). > > right? > > On Wed, Feb 25, 2009 at 8:36 PM, Mihael Hategan wrote: > > > > ----- Allan Espinosa wrote: > >> Ooops. I copy pasted the wrong line. It should be: > >> > >> 2009-02-25 15:33:14,665-0600 INFO WorkerManager Current workers: -110 > > > > Heh. Yes. That increment should be synchronized. I guess I didn't bother > > because it was only there for informal reasons. I take that back. It is actually used for things. From hategan at mcs.anl.gov Wed Feb 25 21:39:55 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 25 Feb 2009 21:39:55 -0600 (CST) Subject: [Swift-devel] current workers < 0 ? In-Reply-To: <8091050.289951235615800053.JavaMail.root@zimbra> Message-ID: <17584853.290771235619595728.JavaMail.root@zimbra> ----- Mihael Hategan wrote: > > ----- Allan Espinosa wrote: > > Ooops. I copy pasted the wrong line. It should be: > > > > 2009-02-25 15:33:14,665-0600 INFO WorkerManager Current workers: -110 > > Heh. Yes. That increment should be synchronized. I guess I didn't bother > because it was only there for informal reasons. > cog r2306 should fix this. Let me know if it works or if I screwed up something else. From aespinosa at cs.uchicago.edu Wed Feb 25 22:29:14 2009 From: aespinosa at cs.uchicago.edu (Allan Espinosa) Date: Wed, 25 Feb 2009 22:29:14 -0600 Subject: [Swift-devel] current workers < 0 ? In-Reply-To: <17584853.290771235619595728.JavaMail.root@zimbra> References: <8091050.289951235615800053.JavaMail.root@zimbra> <17584853.290771235619595728.JavaMail.root@zimbra> Message-ID: <50b07b4b0902252029u2dd82147x4accab87ac85ecfd@mail.gmail.com> It still has the same issues. It subtracts too much when a task if finished. Also, observing the LRM queue, i see swift creating 18-20 "make coaster" requests (4 at start then 16-18 after 5 mins). with a 16 coastersPerNode you get a 320 processor allocation. this more than MAX_WORKERS~256 and the max score possible from my sites.xml (102 max) 1 1 2009-02-25 20:31:15,590-0600 INFO Worker Worker stderr: null 2009-02-25 20:31:15,590-0600 WARN WorkerManager Worker terminated: Worker[-1909333457] 2009-02-25 20:31:15,590-0600 WARN Worker Worker 335457820 status change: Completed 2009-02-25 20:31:15,590-0600 INFO Worker Worker stdout: Job You has completed. Writing job STDOUT and STDERR to cache files. Returning job success. 2009-02-25 20:31:15,590-0600 INFO Worker Worker stderr: null 2009-02-25 20:31:15,590-0600 WARN WorkerManager Worker terminated: Worker[335457820] ******2009-02-25 20:31:15,742-0600 INFO WorkerManager Current workers: -32**** 2009-02-25 20:31:15,745-0600 INFO WorkerManager Ready: {} 2009-02-25 20:31:15,745-0600 INFO WorkerManager Busy: [Worker[-1260987422], Worker[2142641145], Worker[2053757208 2009-02-25 20:31:15,751-0600 INFO WorkerManager Requested: {640597733=Worker[640597733], -692025578=Worker[-69202 2009-02-25 20:31:15,751-0600 INFO WorkerManager Starting: [Task(type=JOB_SUBMISSION, identity=urn:1235615211813-1 2009-02-25 20:31:15,752-0600 INFO WorkerManager Ids: {1078934147=Worker[1078934147], 264613139=Worker[264613139], 2009-02-25 20:31:15,753-0600 INFO WorkerManager AllocationR: [org.globus.cog.abstraction.coaster.service.job.mana 2009-02-25 20:31:15,873-0600 INFO AbstractKarajanChannel SC-null REQ: Handler(JOBSTATUS) On Wed, Feb 25, 2009 at 9:39 PM, Mihael Hategan wrote: > > ----- Mihael Hategan wrote: >> >> ----- Allan Espinosa wrote: >> > Ooops. I copy pasted the wrong line. ?It should be: >> > >> > 2009-02-25 15:33:14,665-0600 INFO ?WorkerManager Current workers: -110 >> >> Heh. Yes. That increment should be synchronized. I guess I didn't bother >> because it was only there for informal reasons. >> > > cog r2306 should fix this. Let me know if it works or if I screwed up > something else. > > -- Allan M. Espinosa PhD student, Computer Science University of Chicago From hategan at mcs.anl.gov Wed Feb 25 23:27:37 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 25 Feb 2009 23:27:37 -0600 Subject: [Swift-devel] current workers < 0 ? In-Reply-To: <50b07b4b0902252029u2dd82147x4accab87ac85ecfd@mail.gmail.com> References: <8091050.289951235615800053.JavaMail.root@zimbra> <17584853.290771235619595728.JavaMail.root@zimbra> <50b07b4b0902252029u2dd82147x4accab87ac85ecfd@mail.gmail.com> Message-ID: <1235626057.5218.6.camel@localhost> I suspect the issue was introduced by the addition of multiple coasters per node. The manager expects one worker, but gets 16 instead. On Wed, 2009-02-25 at 22:29 -0600, Allan Espinosa wrote: > It still has the same issues. It subtracts too much when a task if finished. > > Also, observing the LRM queue, i see swift creating 18-20 "make > coaster" requests (4 at start then 16-18 after 5 mins). with a 16 > coastersPerNode you get a 320 processor allocation. this more than > MAX_WORKERS~256 and the max score possible from my sites.xml (102 max) Regarding MAX_WORKERS, that probably suffers from the same problem, in that it may request less than 256 workers, but given that each request means 16 workers, the end result may be different than what's expected. However, MAX_WORKERS was introduced merely to limit damage in case the code is bad and it doesn't otherwise put an upper bound on the limit of worker requests (/jobs in the queue). From wilde at mcs.anl.gov Thu Feb 26 01:05:35 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Thu, 26 Feb 2009 01:05:35 -0600 Subject: [Swift-devel] Re: [Swift-user] assigning file variables In-Reply-To: <49A632FE.8070906@mcs.anl.gov> References: <49A55309.4050100@mcs.anl.gov> <1235576422.17806.4.camel@localhost> <49A632FE.8070906@mcs.anl.gov> Message-ID: <49A63F3F.10504@mcs.anl.gov> (Im moving this back from swift-user to swift-devel) -- Another aspect of where I'm stuck at the moment is this: Recall (from previous posts ;) that I have 3 nested loops: foreach protein in plist { iterate { foreach i in [1:N] randomlyFoldProtein() } } until convergence or limit reached } In testing the "simulated" version of this (my oops8e.swift example) I had to put the inner folding "round" into a function, in order to force the closing of the array of result files returned by the inner foreach. That was fine, with simple_mapper, because I had pre-mapped the entire 2D result[][] array with simple_mapper, and Swift still let me return an inner array and assign it to a member of the outer array: foreach p, pn in protein { file result[][] ; iterate i { result[i] = doRound(p,i); } until (roundDone(result[i],pn) == 1); } But, that test was over-simplified, because it didnt handle the fact that these returns are really 6-file structs, which motivated me to try ext mapper. But that decision led me back in circles, bouncing between Swift limitations: - ext-mapper cant pre-map a dynamic output structure with any dimensions whose size cant be passed to the mapper (I think?) - arrays can only be closed via return from functions - files and structs with files have limitations on assignments - I cant set a mapping any time I want on any member (field or element) of any structure. Here's a related question: Is it the case that if a function returns an array, that array *must* be declared and mapped in the calling function, *not* in the called function? Eg, I cant dynamically declare and map an array *within* a function and return that array out? (I'll try this in the morning). I think I can solve my problems by retreating from ext mapper and accepting the naming conventions of simple_mapper, but the set of restrictions was interesting. This makes me more determined to re-open the discussion on the nature of object, variables, handles, scope, and lifetime, as it seems to me that part of the problem comes from an object model thats almost, but not quite, as regular as it should be. - Mike On 2/26/09 12:13 AM, Michael Wilde wrote: > Can you clarify how the ext mapper behaves differently from say the > simple_mapper for output files, and if the following is correct? > > It seems that for the simple_mapper, the mapper parameters define a > prefix/suffix, and these strings are used wherever necessary at runtime > to form a mapping for any object composed of (possibly nested) structs > and arrays, by bracketing the dynamically-constructed object path. > > But when the ext mapper is used for output, it is expected, in a single > call, to map the entire structure (and hence can only do static mappings)? > > I thought I had my problem solved using the ext mapper, but the > combination of restrictions on assigning file variables and getting the > right info to the ext mapper seems to be forcing me back to simple_mapper. > > (I'll try to assemble examples when I have more time) > > On 2/25/09 9:50 AM, Ben Clifford wrote: >> On Wed, 25 Feb 2009, Mihael Hategan wrote: >> >>> it would be preferable to map t to what m is mapped to >>> from the start >> >> right. often (always?) the desire to do this kind of assignment comes >> from insufficient expressiveness in our mapping semantics. in the >> foreach case, I think my email suggests a reasonable alternative to >> assignments that allows mapping to be generated inside of Swift. In >> the iterate{} case, that in-swift expression is not possible at the >> moment, but could be. For example, soemthing like the ext mapper that >> only maps output files, not inputs, and calls a specified swift >> procedure to do that mapping. (thates something that has been >> discussed before, I think) >> > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-user From benc at hawaga.org.uk Thu Feb 26 05:34:55 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Thu, 26 Feb 2009 11:34:55 +0000 (GMT) Subject: [Swift-devel] Re: [Swift-user] assigning file variables In-Reply-To: <49A63F3F.10504@mcs.anl.gov> References: <49A55309.4050100@mcs.anl.gov> <1235576422.17806.4.camel@localhost> <49A632FE.8070906@mcs.anl.gov> <49A63F3F.10504@mcs.anl.gov> Message-ID: On Thu, 26 Feb 2009, Michael Wilde wrote: > foreach p, pn in protein { > file result[][] > ; > iterate i { > result[i] = doRound(p,i); > } until (roundDone(result[i],pn) == 1); > } > But, that test was over-simplified, because it didnt handle the fact > that these returns are really 6-file structs, which motivated me to try > ext mapper. Assuming the above is working, what breaks when you change file into a 6-member struct? > - ext-mapper cant pre-map a dynamic output structure with any dimensions whose > size cant be passed to the mapper (I think?) yes. > - arrays can only be closed via return from functions no. Not since r1536 | benc at CI.UCHICAGO.EDU | 2008-01-03 Since that commit, there is static analysis of source code, and when no more assignments are left to make to an array, its regarded as closed. However, in the case of multidimensional arrays, this only happens when the entire top level array has no more assignments at all, not as each subarray happens to become finished. Static analysis of arrays (and even runtime analysis to discover when no more assignments may happen to a particular piece) is extremely hard because you're allowed to construct your own indicies, and you're allowed to use them in a way that isn't single assignment; I think they're a fairly poor structure to have in SwiftScript the way its going. For example, in the code fragment: > file result[][] > ; > iterate i { > result[i] = doRound(p,i); > } until (roundDone(result[i],pn) == 1); You can look at that and reason that result[i] won't get assigned any more after the iterate statement for that i, but in general that i can be any expression. In the general case, how do you know that result[2] will never get any more assignments? There are other ways of doing things, for example Haskell's map, fold and unfold, that I think would be much easier to analyse in this case. (hey I get to mention map/reduce here!) foreach in that case could look like this map (making up ugly syntax) with syntax: output = map (range) (code) file results[] = map proteins (p -> { analyse(p); return p}) This means the same as: file results[]; foreach p,i in results { results[i] = analyse(p); } What is different is there is now only a single assignment to results. The idea of "array closing" collapses down to "has a single assignment been made?" Iterate would look more like an unfoldr: output = unfold seed step terminateCondition file results[] = unfold initalStep (\prev -> { evaluate(prev); return prev} Again, you know when results is fully assigned, because there is now only a single statement assigning to it. In addition, in both of these, you know exactly when a member of the array has been assigned - for any element of results, in both the map and unfold case, there is exactly one 'iteration' of the map or unfold which can assign to that element, and that is easily known to Swift because it knows how map/unfold work. These should be nestable, and in the case of a multidimensional array, you known when any particular sub-array has been assigned, because you know which iteration of the outer map/unfold generates that value. > - files and structs with files have limitations on assignments yes. Its easy to implement struct assignment, for structs where the members have defined assignment semantics already. for files, see other thread. > - I cant set a mapping any time I want on any member (field or element) of any > structure. Yes. > Here's a related question: Is it the case that if a function returns an array, > that array *must* be declared and mapped in the calling function, *not* in the > called function? Eg, I cant dynamically declare and map an array *within* a > function and return that array out? (I'll try this in the morning). By function, you mean procedure, I think (code referenced without a @ prefix). In that case, yes - procedure call semantics are that you pass in where the output belongs. > This makes me more determined to re-open the discussion on the nature of > object, variables, handles, scope, and lifetime, as it seems to me that > part of the problem comes from an object model thats almost, but not > quite, as regular as it should be. yes, its riddled with prototypiness from before; mostly from imperativeness conflicting with data flow dependencies. Its substantially more consistent than it was a few years ago, though. -- From wilde at mcs.anl.gov Thu Feb 26 09:31:27 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Thu, 26 Feb 2009 09:31:27 -0600 Subject: [Swift-devel] Re: [Swift-user] assigning file variables In-Reply-To: References: <49A55309.4050100@mcs.anl.gov> <1235576422.17806.4.camel@localhost> <49A632FE.8070906@mcs.anl.gov> <49A63F3F.10504@mcs.anl.gov> Message-ID: <49A6B5CF.8030505@mcs.anl.gov> On 2/26/09 5:34 AM, Ben Clifford wrote: > On Thu, 26 Feb 2009, Michael Wilde wrote: > >> foreach p, pn in protein { >> file result[][] >> ; >> iterate i { >> result[i] = doRound(p,i); >> } until (roundDone(result[i],pn) == 1); >> } > >> But, that test was over-simplified, because it didnt handle the fact >> that these returns are really 6-file structs, which motivated me to try >> ext mapper. > > Assuming the above is working, what breaks when you change file into a > 6-member struct? If I just move to the 6-file struct and leave all else the same, I think I can get that to work (I'll be trying this next). But I was trying to preserve the current output structure as well, which is not what I'll get with the code above. If you call the loops: foreach $protein iterate each $round foreach $simulation and the array indices result[$round][$simulation] I wanted: output/r$round/$protein.{pdt,energy,rmsd,...} and what the working code I think will give me is: output/$protein/$round.$simulation.{pdt,energy,rmsd,...} Thats not bad, but I didn't expect it to be so hard to get a specific output structure. Trying to do so was an interesting learning experience about the nature of the language. My conclusion is that the simplest thing that would let me do what I want is to stay with the 2-d array structure, and extend the ext mapper to be dynamically called once for each output mapping desired, passing the ext script the path of the element being mapped. Another seemingly-simple solution is a generalization of simple_mapper that allows a more powerful sprintf-like expression to form the file name. I wonder if we could actually move *all* our mappers to "ext" implementations, and implement them with shell, perl, awk, etc scripts? This would seem to make testing new ideas and enhancements pretty easy (and in fact more user extensible), and would have virtually no performance impact on most workflows. (But dont implement anything yet; I think all this needs more thought and discussion before we bounce around on solutions; I just want to gather and organize the issues, then have a language review and see whats most important based on real app needs). >> - ext-mapper cant pre-map a dynamic output structure with any dimensions whose >> size cant be passed to the mapper (I think?) > > yes. Can this be lifted, as above? >> - arrays can only be closed via return from functions > > no. Not since r1536 | benc at CI.UCHICAGO.EDU | 2008-01-03 > > Since that commit, there is static analysis of source code, and when no > more assignments are left to make to an array, its regarded as closed. > > However, in the case of multidimensional arrays, this only happens when > the entire top level array has no more assignments at all, not as each > subarray happens to become finished. OK, so in my case, effectively that restriction remains (although I appreciate the explanation below). Note that I'm not complaining about that restriction in this example. In my case, moving the inner loop into a separate procedure made the code read a bit nicer, in fact. But it led to bumping into the other restrictions mentioned. > Static analysis of arrays (and even runtime analysis to discover when no > more assignments may happen to a particular piece) is extremely hard > because you're allowed to construct your own indicies, and you're allowed > to use them in a way that isn't single assignment; I think they're a > fairly poor structure to have in SwiftScript the way its going. By "theyre a fairly poor structure" do you mean user-specified array indices? I fear that removing them will take us too deep into the imperative/functional debate, but perhaps we need to keep that discussion going. > For example, in the code fragment: > >> file result[][] >> ; >> iterate i { >> result[i] = doRound(p,i); >> } until (roundDone(result[i],pn) == 1); > > You can look at that and reason that result[i] won't get assigned any more > after the iterate statement for that i, but in general that i can be any > expression. In the general case, how do you know that result[2] will never > get any more assignments? > > There are other ways of doing things, for example Haskell's map, fold and > unfold, that I think would be much easier to analyse in this case. > > (hey I get to mention map/reduce here!) > > foreach in that case could look like this map (making up ugly syntax) > with syntax: output = map (range) (code) > > file results[] = map proteins (p -> { analyse(p); return p}) > > This means the same as: > > file results[]; > foreach p,i in results { > results[i] = analyse(p); > } > > What is different is there is now only a single assignment to results. The > idea of "array closing" collapses down to "has a single assignment been > made?" > > Iterate would look more like an unfoldr: > > output = unfold seed step terminateCondition > > file results[] = unfold initalStep (\prev -> { evaluate(prev); return prev} > > Again, you know when results is fully assigned, because there is now only > a single statement assigning to it. We could discuss if such things could be added as experiments without (yet) removing their imperative equivalents. I think that the question of the attractiveness of the functional model to distributed and parallel programming is a promising research topic. But its not at the top of my priority list for the group, which is usability/productivity, platform support, performance, and provenance. I do agree that it could lead to these, but its uncertain if we can get as many people to use it, and thats where we need to make progress right now. If you think that going in the direction above could take us to the goal quicker than improving the language in its current flavor, I'll listen to a plan. My view right now is that swift is on the right track as-is and is *very close* to becoming *very* usable/productive. If we can identify and make the fewest tweaks we need to iron out current difficulties, we'll be on the right track. And some of those tweaks might be to documentation and examples, not even code changes. I do realize that some of the *tweaks* might be hard. > In addition, in both of these, you know exactly when a member of the array > has been assigned - for any element of results, in both the map and unfold > case, there is exactly one 'iteration' of the map or unfold which can > assign to that element, and that is easily known to Swift because it knows > how map/unfold work. > > These should be nestable, and in the case of a multidimensional array, you > known when any particular sub-array has been assigned, because you know > which iteration of the outer map/unfold generates that value. > >> - files and structs with files have limitations on assignments > > yes. > > Its easy to implement struct assignment, for structs where the members > have defined assignment semantics already. > > for files, see other thread. The conclusion of that thread (in my opinion) is that case (ii), what I would call "value assignment of file handles", is what we want. (Where "file handle" is that "marker type" term that I think the debate is still open on). >> - I cant set a mapping any time I want on any member (field or element) of any >> structure. > > Yes. But thats one of the critical things here. I seem to bump into this limitation frequently. Does language consistency require these limitations on setting mappings, or is it an implementation issue that can be lifted? Is it the case that mapping does not affect data flow semantics? >> Here's a related question: Is it the case that if a function returns an array, >> that array *must* be declared and mapped in the calling function, *not* in the >> called function? Eg, I cant dynamically declare and map an array *within* a >> function and return that array out? (I'll try this in the morning). > > By function, you mean procedure, I think (code referenced without a @ > prefix). I was wondering about that difference - I thought it was inconsistent usage in various documents/tutorials. So we should clarify that terminology in the user guide. But better to erase the differnce - all callable things, I feel, should have the same name - function or procedure, and they are either built-in, or user (or eventually library) defined. Whats the semantic difference between the two today? One distinction I see is that built-in things like trace() can take varying arg types, but trace has no @ and thus looks more like a user-defined procedure syntactically. In that case, yes - procedure call semantics are that you pass in > where the output belongs. Then this dictates that the caller also do the mapping - hence the names of the members of an array can not depend on values that will only be known in the called function, which actually creates the array members. (in my case, doRound()) >> This makes me more determined to re-open the discussion on the nature of >> object, variables, handles, scope, and lifetime, as it seems to me that >> part of the problem comes from an object model thats almost, but not >> quite, as regular as it should be. > > yes, its riddled with prototypiness from before; mostly from > imperativeness conflicting with data flow dependencies. Its substantially > more consistent than it was a few years ago, though. I agree, it's greatly improved and can do some amazing things. My guts tell me that if we can address some of the issues I mentioned on the nature of vars, handles, and mappings, we're in the home stretch. I dont think that a more regular approach to object structure and lifetime would conflict with the dataflow semantics. Maybe we should start a new thread on that specific topic, or resume the old thread. For starters (and feel free to move this to a new thread), do you feel comfortable with the current model of var, dsHandle, and by-value-like assignment? I would like to see a more Java-like model with a var being a typed pointer or scalar value holder, and structs and arrays being dynamic objects, and files being special vars with mapping and state. scalar-var: value (int/string/boolean/float) state (set/unset) object-var pointer to array or struct state (set/unset) file-var mapping state (set/unset) I have to confess that the above is pretty much the way I *thought* Swift worked until we tried to write the latest paper, and had the ensuing email discussions. Then I realized that (even after the discussions) I still dont understand the model. I dont feel that we have yet adequately described the model, neither for a CS paper *nor* for the programmer. I think that a good start is to write a data model description (in the user guide, in a detailed "skip this on first reading" section, that specifies the data model in language-reference-specification fashion). From there we can discuss any proposed changes to either terminology and/or implementation. I *think* that with the model above, one should be able to more flexibly set mappings - in fact, set them from swift code, with some kind of assignment (like f=<> expression; or f). From benc at hawaga.org.uk Thu Feb 26 10:39:01 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Thu, 26 Feb 2009 16:39:01 +0000 (GMT) Subject: [Swift-devel] Re: [Swift-user] assigning file variables In-Reply-To: <49A6B5CF.8030505@mcs.anl.gov> References: <49A55309.4050100@mcs.anl.gov> <1235576422.17806.4.camel@localhost> <49A632FE.8070906@mcs.anl.gov> <49A63F3F.10504@mcs.anl.gov> <49A6B5CF.8030505@mcs.anl.gov> Message-ID: On Thu, 26 Feb 2009, Michael Wilde wrote: > Another seemingly-simple solution is a generalization of simple_mapper that > allows a more powerful sprintf-like expression to form the file name. I think the an interesting approach is to look at having a mapper call an arbitrary Swift procedure that returns a string. > I wonder if we could actually move *all* our mappers to "ext" implementations, > and implement them with shell, perl, awk, etc scripts? > This would seem to make testing new ideas and enhancements pretty easy (and in > fact more user extensible), and would have virtually no performance impact on > most workflows. I think the ext interface isn't sufficiently expressive for that at the moment. The whole mapper API feels rather messy to me at the moment, and if we're doing development there, putting more serious consideration into what it should look like seems worthwhile. > and would have virtually no performance impact on > most workflows. Do you have numbers to back that up? > > > - ext-mapper cant pre-map a dynamic output structure with any dimensions > > > whose > > > size cant be passed to the mapper (I think?) > > > > yes. > > Can this be lifted, as above? I think not easily. But see above paragraph about API design. > > However, in the case of multidimensional arrays, this only happens when the > > entire top level array has no more assignments at all, not as each subarray > > happens to become finished. > > OK, so in my case, effectively that restriction remains (although I appreciate > the explanation below). Note that I'm not complaining about that restriction > in this example. In my case, moving the inner loop into a separate procedure > made the code read a bit nicer, in fact. But it led to bumping into the other > restrictions mentioned. I think its an undesirable restriction. However... > > Static analysis of arrays (and even runtime analysis to discover when no > > more assignments may happen to a particular piece) is extremely hard because > > you're allowed to construct your own indicies, and you're allowed to use > > them in a way that isn't single assignment; I think they're a fairly poor > > structure to have in SwiftScript the way its going. > > By "theyre a fairly poor structure" do you mean user-specified array indices? > I fear that removing them will take us too deep into the > imperative/functional debate, but perhaps we need to keep that discussion > going. Yes, I mean user-specified array indices. > We could discuss if such things could be added as experiments without (yet) > removing their imperative equivalents. I think that the question of the > attractiveness of the functional model to distributed and parallel programming > is a promising research topic. But its not at the top of my priority list for > the group, which is usability/productivity, platform support, performance, I think that its important from a user-interface perspective, not particularly from a research perspective. This style of piecewise assignment to arrays plays merry hell with trying to do data-dependent ordering in a way that I think is not easily resolvable; and anyone trying to do anything at all interesting with arrays gets hit by strange things happening - "I know i've assigned everything but somehow the next stage isn't running". Syntactically this stuff doesn't have to look too different from what it looks like now, and we don't have to use particularly scary words like map or haskell (although I will point out the doublethink inherent in "we don't want functional' vs. 'google map/reduce is god') > I was wondering about that difference - I thought it was inconsistent usage in > various documents/tutorials. So we should clarify that terminology in the user > guide. But better to erase the differnce - all callable things, I feel, should > have the same name - function or procedure, and they are either built-in, or > user (or eventually library) defined. > Whats the semantic difference between the two today? One distinction I see is > that built-in things like trace() can take varying arg types, but trace has no > @ and thus looks more like a user-defined procedure syntactically. @strcat takes varargs too. Those differences are ever more insignificant and with time will disappear entirely, I think. At the moment, its the return semantics that make them different. Historically, @functions returned in-memory values, and procedures operated on files; with @functions being intended for constructing parameters in mapping parameters, and procedures being the equivalent of a VDL1 procedure invocation. That distinction has blurred greatly over time. > I dont feel that we have yet adequately described the model, neither for a CS > paper *nor* for the programmer. I think that a good start is to write a data > model description (in the user guide, in a detailed "skip this on first > reading" section, that specifies the data model in > language-reference-specification fashion). right. I'll see about writing more, as I'm in a writing mood this month ;) -- From aespinosa at cs.uchicago.edu Thu Feb 26 12:44:34 2009 From: aespinosa at cs.uchicago.edu (Allan Espinosa) Date: Thu, 26 Feb 2009 12:44:34 -0600 Subject: on coaster accounting (was Re: [Swift-devel] current workers < 0 ?) Message-ID: <50b07b4b0902261044kcc21925tbbdb0c51d21a48b4@mail.gmail.com> Here i reverted to the 1 coaster per node configuration: Here is the content of the LRM : JOBID JOBNAME USERNAME STATE CORE REMAINING STARTTIME ================================================================================561497 data tg802895 Running 16 00:21:26 Thu Feb 26 12:36:45 561498 data tg802895 Running 16 00:21:26 Thu Feb 26 12:36:45 561499 data tg802895 Running 16 00:21:26 Thu Feb 26 12:36:45 .... .... ... 561547 data tg802895 Running 16 00:23:42 Thu Feb 26 12:39:01 50 active jobs : 50 of 3896 hosts ( 1.28 %) Total jobs: 50 Active Jobs: 50 Waiting Jobs: 0 Dep/Unsched Jobs: 0 Here is the current workers: 2009-02-26 12:38:50,412-0600 INFO WorkerManager Current workers: 111 2009-02-26 12:38:50,412-0600 INFO CoasterQueueProcessor Coaster queue: [org.glo2009-02-26 12:38:50,413-0600 INFO WorkerManager Ready: 0 {} 2009-02-26 12:38:50,413-0600 INFO WorkerManager Busy: 0 [Worker[-1480006551], W2009-02-26 12:38:50,413-0600 INFO WorkerManager Requested: 61 {2109491608=Worke2009-02-26 12:38:50,414-0600 INFO WorkerManager Starting: 32 [Task(type=JOB_SUB2009-02-26 12:38:50,414-0600 INFO WorkerManager Ids: 13 {1104104218=Worker[11042009-02-26 12:38:50,414-0600 INFO WorkerManager AllocationR: [org.globus.cog.ab On Wed, Feb 25, 2009 at 11:27 PM, Mihael Hategan wrote: > I suspect the issue was introduced by the addition of multiple coasters > per node. The manager expects one worker, but gets 16 instead. > > On Wed, 2009-02-25 at 22:29 -0600, Allan Espinosa wrote: >> It still has the same issues. It subtracts too much when a task if finished. >> >> Also, observing the LRM queue, i see swift creating 18-20 "make >> coaster" requests (4 at start then 16-18 after 5 mins). with a 16 >> coastersPerNode you get a 320 processor allocation. this more than >> MAX_WORKERS~256 and the max score possible from my sites.xml (102 max) > > Regarding MAX_WORKERS, that probably suffers from the same problem, in > that it may request less than 256 workers, but given that each request > means 16 workers, the end result may be different than what's expected. > > However, MAX_WORKERS was introduced merely to limit damage in case the > code is bad and it doesn't otherwise put an upper bound on the limit of > worker requests (/jobs in the queue). From hategan at mcs.anl.gov Thu Feb 26 13:03:18 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 26 Feb 2009 13:03:18 -0600 Subject: on coaster accounting (was Re: [Swift-devel] current workers < 0 ?) In-Reply-To: <50b07b4b0902261044kcc21925tbbdb0c51d21a48b4@mail.gmail.com> References: <50b07b4b0902261044kcc21925tbbdb0c51d21a48b4@mail.gmail.com> Message-ID: <1235674998.15221.2.camel@localhost> There are 50 running workers and 61 somewhere between being submitted and contacting the service. What's the question? On Thu, 2009-02-26 at 12:44 -0600, Allan Espinosa wrote: > Here i reverted to the 1 coaster per node configuration: Here is the > content of the LRM : > > JOBID JOBNAME USERNAME STATE CORE REMAINING STARTTIME > ================================================================================561497 > data tg802895 Running 16 00:21:26 Thu Feb 26 > 12:36:45 > 561498 data tg802895 Running 16 00:21:26 Thu Feb 26 12:36:45 > 561499 data tg802895 Running 16 00:21:26 Thu Feb 26 12:36:45 > .... > .... > ... > 561547 data tg802895 Running 16 00:23:42 Thu Feb 26 12:39:01 > > 50 active jobs : 50 of 3896 hosts ( 1.28 %) > > > Total jobs: 50 Active Jobs: 50 Waiting Jobs: 0 Dep/Unsched Jobs: 0 > > Here is the current workers: > > 2009-02-26 12:38:50,412-0600 INFO WorkerManager Current workers: 111 > 2009-02-26 12:38:50,412-0600 INFO CoasterQueueProcessor Coaster > queue: [org.glo2009-02-26 12:38:50,413-0600 INFO WorkerManager Ready: > 0 {} > 2009-02-26 12:38:50,413-0600 INFO WorkerManager Busy: 0 > [Worker[-1480006551], W2009-02-26 12:38:50,413-0600 INFO > WorkerManager Requested: 61 {2109491608=Worke2009-02-26 > 12:38:50,414-0600 INFO WorkerManager Starting: 32 > [Task(type=JOB_SUB2009-02-26 12:38:50,414-0600 INFO WorkerManager > Ids: 13 {1104104218=Worker[11042009-02-26 12:38:50,414-0600 INFO > WorkerManager AllocationR: [org.globus.cog.ab > > > > On Wed, Feb 25, 2009 at 11:27 PM, Mihael Hategan wrote: > > I suspect the issue was introduced by the addition of multiple coasters > > per node. The manager expects one worker, but gets 16 instead. > > > > On Wed, 2009-02-25 at 22:29 -0600, Allan Espinosa wrote: > >> It still has the same issues. It subtracts too much when a task if finished. > >> > >> Also, observing the LRM queue, i see swift creating 18-20 "make > >> coaster" requests (4 at start then 16-18 after 5 mins). with a 16 > >> coastersPerNode you get a 320 processor allocation. this more than > >> MAX_WORKERS~256 and the max score possible from my sites.xml (102 max) > > > > Regarding MAX_WORKERS, that probably suffers from the same problem, in > > that it may request less than 256 workers, but given that each request > > means 16 workers, the end result may be different than what's expected. > > > > However, MAX_WORKERS was introduced merely to limit damage in case the > > code is bad and it doesn't otherwise put an upper bound on the limit of > > worker requests (/jobs in the queue). From aespinosa at cs.uchicago.edu Thu Feb 26 13:29:44 2009 From: aespinosa at cs.uchicago.edu (Allan Espinosa) Date: Thu, 26 Feb 2009 13:29:44 -0600 Subject: on coaster accounting (was Re: [Swift-devel] current workers < 0 ?) In-Reply-To: <1235674998.15221.2.camel@localhost> References: <50b07b4b0902261044kcc21925tbbdb0c51d21a48b4@mail.gmail.com> <1235674998.15221.2.camel@localhost> Message-ID: <50b07b4b0902261129o6a67b39fy204c8850305f5384@mail.gmail.com> So the Requested list is for the tasks being received and not the "make coaster" request to the LRM? also, currentWorkers is the "demand" for coasters and not the number of coasters that are available (busy or ready) thus the best way to graph a "number of avail processors" & "current usage" vs time is using the size of Ids and Busy right? On Thu, Feb 26, 2009 at 1:03 PM, Mihael Hategan wrote: > There are 50 running workers and 61 somewhere between being submitted > and contacting the service. What's the question? > From hategan at mcs.anl.gov Thu Feb 26 14:55:33 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 26 Feb 2009 14:55:33 -0600 (CST) Subject: on coaster accounting (was Re: [Swift-devel] current workers < 0 ?) In-Reply-To: <50b07b4b0902261129o6a67b39fy204c8850305f5384@mail.gmail.com> Message-ID: <20544682.352711235681733961.JavaMail.root@zimbra> ----- Allan Espinosa wrote: > So the Requested list is for the tasks being received and not the > "make coaster" request to the LRM? The requested list is to track all the workers that the manager plans to have started and has put a request for to the underlying provider (LRM, but see below) but haven't yet started or failed. The manager attempting to start a job is not the same as that job being in the LRM queue. Between delays, asynchronicity, and just weird job managers/LRMs, stuff happens. > > also, currentWorkers is the "demand" for coasters and not the number > of coasters that are available (busy or ready) Right. It's supposed to track the total amount of workers: busy, ready, and starting. > > thus the best way to graph a "number of avail processors" & "current > usage" vs time is using the size of Ids and Busy right? Somewhat. Busy will tell you the workers the manager thinks are running jobs. Ids is there to allow quick lookup of a worker based on its id. I'm not sure what stages of a worker's life (busy, ready, starting) it includes. > > On Thu, Feb 26, 2009 at 1:03 PM, Mihael Hategan wrote: > > There are 50 running workers and 61 somewhere between being submitted > > and contacting the service. What's the question? > > From wilde at mcs.anl.gov Thu Feb 26 18:08:40 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Thu, 26 Feb 2009 18:08:40 -0600 Subject: [Swift-devel] output format of simple_mapper Message-ID: <49A72F08.4050608@mcs.anl.gov> When I apply this mapping to a 2D array of files: file result[][] ; then I get files like: output/T1di2/0004.0001.pdt but when I apply this mapping to a 2D array of structs of files: OOPSOut result[][] ; then I get files like: output/T3cpo/0000_0000.pdt Not a problem, just curious what motivated the difference (of sub1.sub2 vs sub1_sub2)? From hategan at mcs.anl.gov Thu Feb 26 18:24:29 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 26 Feb 2009 18:24:29 -0600 (CST) Subject: [Swift-devel] output format of simple_mapper In-Reply-To: <49A72F08.4050608@mcs.anl.gov> Message-ID: <13895265.374241235694269652.JavaMail.root@zimbra> I think the logic was that if you have a type path (say a.b.c), it would be mapped to a_b.c, where the last element gives the extension. This was inspired by the analyze format, where we would usually have a struct "image {hdr, img}", so that mapper would magically end up naming files with the proper extension for that case. In your first case, given that the second index is the last element in the path, it will be separated by ".", and then you add ".pdt" to that. In the second case, I assume in OOPSOut your field is named "pdt" and that ends up being the last element in the path. If you were to try file result[][][] <...>, you would get names like: 0004_0005.0001 Mihael ----- Michael Wilde wrote: > When I apply this mapping to a 2D array of files: > file result[][] prefix=@strcat("output/",p,"/"),suffix=".pdt">; > > then I get files like: > > output/T1di2/0004.0001.pdt > > but when I apply this mapping to a 2D array of structs of files: > > OOPSOut result[][] ; > > then I get files like: > > output/T3cpo/0000_0000.pdt > > Not a problem, just curious what motivated the difference (of sub1.sub2 > vs sub1_sub2)? > > > > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From benc at hawaga.org.uk Thu Feb 26 18:42:01 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Fri, 27 Feb 2009 00:42:01 +0000 (GMT) Subject: [Swift-devel] output format of simple_mapper In-Reply-To: <13895265.374241235694269652.JavaMail.root@zimbra> References: <13895265.374241235694269652.JavaMail.root@zimbra> Message-ID: AbstractMapper has this rather case: if (level < tokenCount - 2) { logger.debug("Adding mapper-specified separator" ); sb.append(getElementMapper().getSeparator(level) ); } else { logger.debug("Adding '.' instead of mapper-specified separator"); sb.append('.'); } that implements the behaviour Mihael describes. Don't let the name fool you - simple_mapper is not the simplest thing in the world... -- From hategan at mcs.anl.gov Thu Feb 26 18:58:19 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 26 Feb 2009 18:58:19 -0600 (CST) Subject: [Swift-devel] output format of simple_mapper In-Reply-To: Message-ID: <14520899.374411235696299617.JavaMail.root@zimbra> ----- Ben Clifford wrote: > > AbstractMapper has this rather case: > Monologue smelling of dialogue: "Rather what?" "Case!" From benc at hawaga.org.uk Thu Feb 26 19:24:34 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Fri, 27 Feb 2009 01:24:34 +0000 (GMT) Subject: [Swift-devel] output format of simple_mapper In-Reply-To: <14520899.374411235696299617.JavaMail.root@zimbra> References: <14520899.374411235696299617.JavaMail.root@zimbra> Message-ID: > > > > AbstractMapper has this rather case: > > > > Monologue smelling of dialogue: > "Rather what?" > "Case!" well, there was an adjective there originally, but I removed it. -- From wilde at mcs.anl.gov Thu Feb 26 20:51:24 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Thu, 26 Feb 2009 20:51:24 -0600 Subject: [Swift-devel] output format of simple_mapper In-Reply-To: <13895265.374241235694269652.JavaMail.root@zimbra> References: <13895265.374241235694269652.JavaMail.root@zimbra> Message-ID: <49A7552C.202@mcs.anl.gov> On 2/26/09 6:24 PM, Mihael Hategan wrote: > I think the logic was that if you have a type path > (say a.b.c), it would be mapped to a_b.c, where the > last element gives the extension. This was inspired by the analyze > format, where we would usually have a struct "image {hdr, img}", so > that mapper would magically end up naming files with the proper > extension for that case. I see an argument for a sprintf mapper here. But like Ben suggested earlier, the whole mapper thing needs assessment and redesign. Trick there will be some amount of deprecatable backwards compat. > In your first case, given that the second index is the last > element in the path, it will be separated by ".", and then you > add ".pdt" to that. > > In the second case, I assume in OOPSOut your field is named "pdt" > and that ends up being the last element in the path. > > If you were to try file result[][][] <...>, you would get names > like: 0004_0005.0001 That would violate the principle of least astonishment ;) > > Mihael > > ----- Michael Wilde wrote: >> When I apply this mapping to a 2D array of files: >> file result[][] > prefix=@strcat("output/",p,"/"),suffix=".pdt">; >> >> then I get files like: >> >> output/T1di2/0004.0001.pdt >> >> but when I apply this mapping to a 2D array of structs of files: >> >> OOPSOut result[][] ; >> >> then I get files like: >> >> output/T3cpo/0000_0000.pdt >> >> Not a problem, just curious what motivated the difference (of sub1.sub2 >> vs sub1_sub2)? >> >> >> >> >> >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > From hategan at mcs.anl.gov Thu Feb 26 21:04:48 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 26 Feb 2009 21:04:48 -0600 (CST) Subject: [Swift-devel] output format of simple_mapper In-Reply-To: <49A7552C.202@mcs.anl.gov> Message-ID: <15448138.376321235703888961.JavaMail.root@zimbra> ----- Michael Wilde wrote: > > > On 2/26/09 6:24 PM, Mihael Hategan wrote: > > I think the logic was that if you have a type path > > (say a.b.c), it would be mapped to a_b.c, where the > > last element gives the extension. This was inspired by the analyze > > format, where we would usually have a struct "image {hdr, img}", so > > that mapper would magically end up naming files with the proper > > extension for that case. > > I see an argument for a sprintf mapper here. But like Ben suggested > earlier, the whole mapper thing needs assessment and redesign. > > Trick there will be some amount of deprecatable backwards compat. > > > In your first case, given that the second index is the last > > element in the path, it will be separated by ".", and then you > > add ".pdt" to that. > > > > In the second case, I assume in OOPSOut your field is named "pdt" > > and that ends up being the last element in the path. > > > > If you were to try file result[][][] <...>, you would get names > > like: 0004_0005.0001 > > That would violate the principle of least astonishment ;) Except you did find it convenient when you used the struct, and the extension just happened to be right :) > > > > > Mihael > > > > ----- Michael Wilde wrote: > >> When I apply this mapping to a 2D array of files: > >> file result[][] >> prefix=@strcat("output/",p,"/"),suffix=".pdt">; > >> > >> then I get files like: > >> > >> output/T1di2/0004.0001.pdt > >> > >> but when I apply this mapping to a 2D array of structs of files: > >> > >> OOPSOut result[][] ; > >> > >> then I get files like: > >> > >> output/T3cpo/0000_0000.pdt > >> > >> Not a problem, just curious what motivated the difference (of sub1.sub2 > >> vs sub1_sub2)? > >> > >> > >> > >> > >> > >> _______________________________________________ > >> Swift-devel mailing list > >> Swift-devel at ci.uchicago.edu > >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > From benc at hawaga.org.uk Fri Feb 27 04:02:59 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Fri, 27 Feb 2009 10:02:59 +0000 (GMT) Subject: [Swift-devel] Re: [Swift-user] assigning file variables In-Reply-To: References: <49A55309.4050100@mcs.anl.gov> <1235576422.17806.4.camel@localhost> <49A632FE.8070906@mcs.anl.gov> <49A63F3F.10504@mcs.anl.gov> <49A6B5CF.8030505@mcs.anl.gov> Message-ID: On Thu, 26 Feb 2009, Ben Clifford wrote: > This style of piecewise assignment to arrays plays merry hell with > trying to do data-dependent ordering in a way that I think is not easily > resolvable; and anyone trying to do anything at all interesting with > arrays gets hit by strange things happening - "I know i've assigned > everything but somehow the next stage isn't running". A different way of looking at this: Why is it that Swift can have the 'close array returned from a procedure call' behaviour which made you move code out of the loop body and into a procedure? Its because from the calling code, the procedure call looks like a single assignment: file a[] = foo(); or when accessing sub-arrays: file a[][]; a[7] = foo(); We know a[7], which is an entire array, has its entire value because that assignment is the only place that a[7] can have its elements assigned - there is a single statement which assigns its entire value. -- From benc at hawaga.org.uk Fri Feb 27 06:13:56 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Fri, 27 Feb 2009 12:13:56 +0000 (GMT) Subject: [Swift-devel] Re: [Swift-user] assigning file variables In-Reply-To: <49A6B5CF.8030505@mcs.anl.gov> References: <49A55309.4050100@mcs.anl.gov> <1235576422.17806.4.camel@localhost> <49A632FE.8070906@mcs.anl.gov> <49A63F3F.10504@mcs.anl.gov> <49A6B5CF.8030505@mcs.anl.gov> Message-ID: On Thu, 26 Feb 2009, Michael Wilde wrote: > But thats one of the critical things here. I seem to bump into this > limitation frequently. Does language consistency require these > limitations on setting mappings, or is it an implementation issue that > can be lifted? Is it the case that mapping does not affect data flow > semantics? >From a high level perspective, I don't think the language requires much about what is mapped where and based on what. The present implementations of the mappers and the framework surrounding those mappers compel stricter requirements. For example, at present, an entire data structure rooted at some variable declaration is regarded as either "not mapped" or "mapped" in its entirety - that is either the mapper is not yet initialized, and so any attempts to ask it about the data structure it maps must be deferred; or it is initialized and therefore can answer authoritatively about any part of that data structure. This is pretty much what is meant by "Swift does not have streaming mappers". What you propose, being able to map some subpiece of a data structure programmatically, ties in closely with the 'streaming mapper' concept, I think. The 'stream of new things' comes perhaps from some on going external process, some deliberate rate limiting or from on-going evaluation of other pieces of SwiftScript that come up with new mappings. > For starters (and feel free to move this to a new thread), do you feel > comfortable with the current model of var, dsHandle, and by-value-like > assignment? > > I would like to see a more Java-like model with a var being a typed pointer or > scalar value holder, and structs and arrays being dynamic objects, and files > being special vars with mapping and state. That's very much what SwiftScript has now. Can you describe what you percieve to be the salient differences? > I dont feel that we have yet adequately described the model, neither for a CS > paper *nor* for the programmer. I think that a good start is to write a data > model description (in the user guide, in a detailed "skip this on first > reading" section, that specifies the data model in > language-reference-specification fashion). Yes, I think getting a more rigorous description of what actually happens, warts and all, would be useful. I think targetting such a description as a CS paper is the wrong thing to aim for is the wrong way to go - we need to be adding copious warts and ugliness to the description making it mind numbingly tedious to read, not coughing politely and deleting such paragraphs to make a paper that a wider audience is interested in. -- From wilde at mcs.anl.gov Fri Feb 27 08:55:53 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Fri, 27 Feb 2009 08:55:53 -0600 Subject: [Swift-devel] Re: [Swift-user] assigning file variables In-Reply-To: References: <49A55309.4050100@mcs.anl.gov> <1235576422.17806.4.camel@localhost> <49A632FE.8070906@mcs.anl.gov> <49A63F3F.10504@mcs.anl.gov> <49A6B5CF.8030505@mcs.anl.gov> Message-ID: <49A7FEF9.9020508@mcs.anl.gov> Thats a good explanation, worth adding to the user guide text on this topic. I think on first read one (eg, me ;) misses the subtlety that the foreach is special in how the array is treated within it, and that outside foreach statements, arrays need to be closed. The challenge will be to see in general how feasible it is to code in such a way that you always to a close via a procedure return. My guess is it will be feasible, but may lead to more procedures than a user might use in an imperative style. Thats not bad, as long as coding is easy and the resulting code is clear. We're getting some good experience now as we build libraries for CNARI, OOPS, DOCK and more. related: whats a suggested debugging technique when "the next stage isnt running"? Thats exactly what happened to me, and one of the harder things in swift to debug. I noticed by chance the other day that swift seems to read debugging commands of some sort from stdin? I may have missed it, but what are these, and can users use them to find the state of a stuck script? On 2/27/09 4:02 AM, Ben Clifford wrote: > On Thu, 26 Feb 2009, Ben Clifford wrote: > >> This style of piecewise assignment to arrays plays merry hell with >> trying to do data-dependent ordering in a way that I think is not easily >> resolvable; and anyone trying to do anything at all interesting with >> arrays gets hit by strange things happening - "I know i've assigned >> everything but somehow the next stage isn't running". > > A different way of looking at this: > > Why is it that Swift can have the 'close array returned from a procedure > call' behaviour which made you move code out of the loop body and into a > procedure? > > Its because from the calling code, the procedure call looks like a single > assignment: > > file a[] = foo(); > > or when accessing sub-arrays: > > file a[][]; > a[7] = foo(); > > We know a[7], which is an entire array, has its entire value because that > assignment is the only place that a[7] can have its elements assigned - > there is a single statement which assigns its entire value. > From benc at hawaga.org.uk Fri Feb 27 08:57:22 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Fri, 27 Feb 2009 14:57:22 +0000 (GMT) Subject: [Swift-devel] log processing in main tree Message-ID: I've rearranged the log-processing/ SVN directory so that its contents are in two directories: bin/ (2 commands) libexec/log-processing/ with the notes in the previous README file moved over to the existing docs/log-processing/ module. Placing the log-processing code under a subdirectory of libexec keeps the many files there nicely separated from other libexec stuff. What I'd like to do for 0.9 is move that tree as-in into the trunk/ module so that all Swift builds have this stuff, rather than this being a separate SVN checkout. -- From benc at hawaga.org.uk Fri Feb 27 09:06:25 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Fri, 27 Feb 2009 15:06:25 +0000 (GMT) Subject: [Swift-devel] Re: [Swift-user] assigning file variables In-Reply-To: <49A7FEF9.9020508@mcs.anl.gov> References: <49A55309.4050100@mcs.anl.gov> <1235576422.17806.4.camel@localhost> <49A632FE.8070906@mcs.anl.gov> <49A63F3F.10504@mcs.anl.gov> <49A6B5CF.8030505@mcs.anl.gov> <49A7FEF9.9020508@mcs.anl.gov> Message-ID: On Fri, 27 Feb 2009, Michael Wilde wrote: > The challenge will be to see in general how feasible it is to code in such a > way that you always to a close via a procedure return. My guess is it will be > feasible, but may lead to more procedures than a user might use in an > imperative style. I'd rather not force a separate procedure style; that's part of my argument for having iteration constructs that are sympathetic to single assignment analysis rather than being almost perfectly anti-sympathetic. > related: whats a suggested debugging technique when "the next stage isnt > running"? Thats exactly what happened to me, and one of the harder things in > swift to debug. > > I noticed by chance the other day that swift seems to read debugging commands > of some sort from stdin? I may have missed it, but what are these, and can > users use them to find the state of a stuck script? There are two commands: v and t to show waiting variables and threads. I'm not sure how useful the output of that is in your case. Its not really a public interface, but you might be able to make sense of it. There is much in place for easy debugging of dataflow-based hangs like this - previously, I've put effort into tightening the analysis so hangs don't happen, rather than into detecting and reporting hangs; and I'd like to continue in that trend (although at the moment, the next useful step there is to remove [index] based assignment entirely...) -- From wilde at mcs.anl.gov Fri Feb 27 09:21:33 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Fri, 27 Feb 2009 09:21:33 -0600 Subject: [Swift-devel] Re: [Swift-user] assigning file variables In-Reply-To: References: <49A55309.4050100@mcs.anl.gov> <1235576422.17806.4.camel@localhost> <49A632FE.8070906@mcs.anl.gov> <49A63F3F.10504@mcs.anl.gov> <49A6B5CF.8030505@mcs.anl.gov> <49A7FEF9.9020508@mcs.anl.gov> Message-ID: <49A804FD.9050101@mcs.anl.gov> On 2/27/09 9:06 AM, Ben Clifford wrote: > There is much in place for easy debugging of dataflow-based hangs like > this - previously, I've put effort into tightening the analysis so hangs > don't happen, rather than into detecting and reporting hangs; and I'd like > to continue in that trend (although at the moment, the next useful step > there is to remove [index] based assignment entirely...) I assume you meant "there is not much in place", and thats fine. Your approach sounds good, lets see where it leads. The question above leads to the interesting research topic of "how to show the state of, and debug, concurrent functional programs". I suspect theres some (much?) work on that out there. From benc at hawaga.org.uk Fri Feb 27 09:22:58 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Fri, 27 Feb 2009 15:22:58 +0000 (GMT) Subject: [Swift-devel] Re: [Swift-user] assigning file variables In-Reply-To: <49A804FD.9050101@mcs.anl.gov> References: <49A55309.4050100@mcs.anl.gov> <1235576422.17806.4.camel@localhost> <49A632FE.8070906@mcs.anl.gov> <49A63F3F.10504@mcs.anl.gov> <49A6B5CF.8030505@mcs.anl.gov> <49A7FEF9.9020508@mcs.anl.gov> <49A804FD.9050101@mcs.anl.gov> Message-ID: On Fri, 27 Feb 2009, Michael Wilde wrote: > I assume you meant "there is not much in place", and thats fine. Your approach > sounds good, lets see where it leads. yes: 'not much'. -- From hategan at mcs.anl.gov Fri Feb 27 09:58:12 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 27 Feb 2009 09:58:12 -0600 Subject: [Swift-devel] Re: [Swift-user] assigning file variables In-Reply-To: <49A804FD.9050101@mcs.anl.gov> References: <49A55309.4050100@mcs.anl.gov> <1235576422.17806.4.camel@localhost> <49A632FE.8070906@mcs.anl.gov> <49A63F3F.10504@mcs.anl.gov> <49A6B5CF.8030505@mcs.anl.gov> <49A7FEF9.9020508@mcs.anl.gov> <49A804FD.9050101@mcs.anl.gov> Message-ID: <1235750292.1344.6.camel@localhost> On Fri, 2009-02-27 at 09:21 -0600, Michael Wilde wrote: > The question above leads to the interesting research topic of "how to > show the state of, and debug, concurrent functional programs". I suspect > theres some (much?) work on that out there. > Not that much. Debugging lazy languages is a known difficulty. Debugging future-based languages, like swift, I don't know. However, I think that the "who's waiting on what" information can be presented to the user in such a way as to make it more clear where problems are. From bugzilla-daemon at mcs.anl.gov Fri Feb 27 20:06:32 2009 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Fri, 27 Feb 2009 20:06:32 -0600 (CST) Subject: [Swift-devel] [Bug 179] New: coaster request throttling and (currentWorkers <0) Message-ID: http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=179 Summary: coaster request throttling and (currentWorkers <0) Product: Swift Version: unspecified Platform: PC OS/Version: Linux Status: NEW Severity: normal Priority: P2 Component: Log processing and plotting AssignedTo: hategan at mcs.anl.gov ReportedBy: aespinosa at cs.uchicago.edu The number of currentWorkers becomes < 0. this has impact on how coasters get throttled. In an example session, it can be observed in the LRM creating 18-20 "make coaster" requests (4 at start then 16-18 after 5 mins). with a 16 coastersPerNode you get a 320 processor allocation. this more than MAX_WORKERS~256 and the max score possible from my sites.xml (102 max) 1 1 2009-02-25 20:31:15,590-0600 INFO Worker Worker stderr: null 2009-02-25 20:31:15,590-0600 WARN WorkerManager Worker terminated: Worker[-1909333457] 2009-02-25 20:31:15,590-0600 WARN Worker Worker 335457820 status change: Completed 2009-02-25 20:31:15,590-0600 INFO Worker Worker stdout: Job You has completed. Writing job STDOUT and STDERR to cache files. Returning job success. 2009-02-25 20:31:15,590-0600 INFO Worker Worker stderr: null 2009-02-25 20:31:15,590-0600 WARN WorkerManager Worker terminated: Worker[335457820] ******2009-02-25 20:31:15,742-0600 INFO WorkerManager Current workers: -32**** 2009-02-25 20:31:15,745-0600 INFO WorkerManager Ready: {} 2009-02-25 20:31:15,745-0600 INFO WorkerManager Busy: [Worker[-1260987422], Worker[2142641145], Worker[2053757208 2009-02-25 20:31:15,751-0600 INFO WorkerManager Requested: {640597733=Worker[640597733], -692025578=Worker[-69202 2009-02-25 20:31:15,751-0600 INFO WorkerManager Starting: [Task(type=JOB_SUBMISSION, identity=urn:1235615211813-1 2009-02-25 20:31:15,752-0600 INFO WorkerManager Ids: {1078934147=Worker[1078934147], 264613139=Worker[264613139], 2009-02-25 20:31:15,753-0600 INFO WorkerManager AllocationR: [org.globus.cog.abstraction.coaster.service.job.mana 2009-02-25 20:31:15,873-0600 INFO AbstractKarajanChannel SC-null REQ: -- Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at mcs.anl.gov Fri Feb 27 20:09:45 2009 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Fri, 27 Feb 2009 20:09:45 -0600 (CST) Subject: [Swift-devel] [Bug 180] New: multi-node coasters? Message-ID: http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=180 Summary: multi-node coasters? Product: Swift Version: unspecified Platform: PC OS/Version: Linux Status: NEW Severity: enhancement Priority: P2 Component: Specific site issues AssignedTo: benc at hawaga.org.uk ReportedBy: aespinosa at cs.uchicago.edu In a 1 coaster per node configuration. Sometimes site policies only allow you to submit a maximum number of jobs in the queue (e.g. 50). thus even though the score can reach up to 102, the maximum number of active jobs in a site is only 50. this can be worked around by requesting 2 or more nodes in a single job submission. We can use pbsdsh or equivalent in the LRMs. Using mpirun can also be explored. -- Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee.