From foster at mcs.anl.gov Sun Jun 1 09:46:55 2008 From: foster at mcs.anl.gov (Ian Foster) Date: Sun, 1 Jun 2008 09:46:55 -0500 Subject: [Swift-devel] might be of interest to someone Message-ID: <16BA86D9-0920-4B85-8731-669286399B9C@mcs.anl.gov> http://sc08.sc-education.org/media/workshopflyers/IntroModSimIUNW.pdf IntroduCtIon to ModElIng, SIMulatIon, and CoMputatIonal MEthodS JulY 28 - JulY 30, 2008 ? IndIana unIvErSItY northWESt - garY, IndIana -------------- next part -------------- An HTML attachment was scrubbed... URL: From bugzilla-daemon at mcs.anl.gov Sun Jun 1 19:23:57 2008 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Sun, 1 Jun 2008 19:23:57 -0500 (CDT) Subject: [Swift-devel] [Bug 144] New: earlier detection of earlier duplicate output file mapping. Message-ID: http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=144 Summary: earlier detection of earlier duplicate output file mapping. Product: Swift Version: unspecified Platform: Macintosh OS/Version: Mac OS Status: NEW Severity: normal Priority: P2 Component: General AssignedTo: benc at hawaga.org.uk ReportedBy: benc at hawaga.org.uk CC: swift-devel at ci.uchicago.edu when multiple variables are mapped to the same filename, then an error will be thrown indicating this; however, the error will not be thrown until an application procedure attempts to stage out the file for the second time. In many cases, this could be detected much earlier on (often before any application execution has occurred at all). -- Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is. From hategan at mcs.anl.gov Tue Jun 3 15:30:18 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 03 Jun 2008 15:30:18 -0500 Subject: [Swift-devel] Re: [Swift-user] Failed to link input file In-Reply-To: <20080603151729.BAX04132@m4500-03.uchicago.edu> References: <20080603151729.BAX04132@m4500-03.uchicago.edu> Message-ID: <1212525018.17315.10.camel@localhost> On Tue, 2008-06-03 at 15:17 -0500, lixi at uchicago.edu wrote: > >In this case, you already did. The fact that you get any > message > >whatsoever from wrapper.sh ("cannot link input file" is one > of them) > >means that it did work at the point the job was started. > > > > I don't mean checking it during the execution of Swift > workflow. I mean that can I run some extra pieces of scripts > to make sure the availability of the shared file systems on > remote sites. It's not as much "on remote sites" as it is "on each node of the remote site". What I was, however, trying to point out was that in the likely short period of time between the wrapper being started and it trying to link the input files the node went from good to bad. So I'm not sure how useful it would be to increase that period of time even further. But no, I don't have such a script. I wouldn't even know how to do it, since I'm not aware of how I could force queuing systems to send my jobs to specific nodes. Well, besides sending a large number of jobs and hoping that eventually each node will get at least one. Which is silly. > > You know, currently, I'm running some calibration scripts > which include globus-job-run and globus-url-copy tasks to > learn the performance of remote sites. Based on such > results, I could give the initial scores and filter some > sites before running Swift workflow. So since the shared > file system also leads to the failure of the workflow, I > think that it is necessary to add the evaluation into my > current scoring methods. I have a suspicion Swift itself causes the problem. It's like looking at things with a hammer. Perhaps Ben can look at the performance data an get some insight. > > Thanks, > > Xi From benc at hawaga.org.uk Tue Jun 3 17:17:41 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 3 Jun 2008 22:17:41 +0000 (GMT) Subject: [Swift-devel] Re: [Swift-user] Failed to link input file In-Reply-To: <1212525018.17315.10.camel@localhost> References: <20080603151729.BAX04132@m4500-03.uchicago.edu> <1212525018.17315.10.camel@localhost> Message-ID: I was able to recreate this problem on UCSDT2 with a simple hello world, I think, a few days ago. I haven't got round to actually investigating what is happening on that site, though. Something wierd is happening on that site when running through condor with all swift jobs, I think. However, I haven't got round to looking at it yet, and am in emergency non-swift mode at the moment until next tuesday. -- From benc at hawaga.org.uk Tue Jun 3 17:51:33 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 3 Jun 2008 22:51:33 +0000 (GMT) Subject: [Swift-devel] Re: [Swift-user] Failed to link input file In-Reply-To: <1212525018.17315.10.camel@localhost> References: <20080603151729.BAX04132@m4500-03.uchicago.edu> <1212525018.17315.10.camel@localhost> Message-ID: > But no, I don't have such a script. I wouldn't even know how to do it, > since I'm not aware of how I could force queuing systems to send my jobs > to specific nodes. Well, besides sending a large number of jobs and > hoping that eventually each node will get at least one. Which is silly. In condor (which is what is running on that site) you can put a requirement to match hostname. If I ever get round to committing that patch, you'll be able to do that in swift too! I have this vague idea that you can do something similar with PBS with host_types but not really sure and I'm not going to look up now. -- From benc at hawaga.org.uk Tue Jun 3 17:59:23 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 3 Jun 2008 22:59:23 +0000 (GMT) Subject: [Swift-devel] Re: [Swift-user] Failed to link input file In-Reply-To: References: <20080603151729.BAX04132@m4500-03.uchicago.edu> <1212525018.17315.10.camel@localhost> Message-ID: actual, xi, if you want me to not forget this, open a bug on it and assign it to me in the swift bugzilla. From bugzilla-daemon at mcs.anl.gov Wed Jun 4 11:57:17 2008 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Wed, 4 Jun 2008 11:57:17 -0500 (CDT) Subject: [Swift-devel] [Bug 145] New: Failed to link input file Message-ID: http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=145 Summary: Failed to link input file Product: Swift Version: unspecified Platform: All OS/Version: Linux Status: NEW Severity: normal Priority: P2 Component: General AssignedTo: benc at hawaga.org.uk ReportedBy: lixi at uchicago.edu CC: lixi at uchicago.edu During the execution of Swift workflow, the following error occurs many times: ... Host: UCSDT2 Directory: workflowtest-20080603-0934-cctuq211/jobs/2/node- 2sl4okti stderr.txt: stdout.txt: ---- Caused by: UCSDT2 Failed to link input file _concurrent/intermediatefile-272352a4-9803-4509-8f19- fcddb7de230b- ... In fact, it did happen before on other sites within OSG except "UCSDT2". -- Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From hategan at mcs.anl.gov Wed Jun 4 11:59:47 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 04 Jun 2008 11:59:47 -0500 Subject: [Swift-devel] committers Message-ID: <1212598787.8492.8.camel@localhost> So we need, I think, to agree on the initial list of committers for the dev.globus side of things. If you think you should be a swift committer, please respond to this email. From hategan at mcs.anl.gov Wed Jun 4 12:02:10 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 04 Jun 2008 12:02:10 -0500 Subject: [Swift-devel] committers In-Reply-To: <1212598787.8492.8.camel@localhost> References: <1212598787.8492.8.camel@localhost> Message-ID: <1212598930.8492.10.camel@localhost> On Wed, 2008-06-04 at 11:59 -0500, Mihael Hategan wrote: > So we need, I think, to agree on the initial list of committers for the > dev.globus side of things. > > If you think you should be a swift committer, please respond to this > email. Me! me! > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From iraicu at cs.uchicago.edu Wed Jun 4 12:05:36 2008 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Wed, 04 Jun 2008 12:05:36 -0500 Subject: [Swift-devel] committers In-Reply-To: <1212598930.8492.10.camel@localhost> References: <1212598787.8492.8.camel@localhost> <1212598930.8492.10.camel@localhost> Message-ID: <4846CB60.7000509@cs.uchicago.edu> I have committed code to the Falkon provider, so feel free to add me (although not essential, if you are trying to keep the list short). Ioan Mihael Hategan wrote: > On Wed, 2008-06-04 at 11:59 -0500, Mihael Hategan wrote: > >> So we need, I think, to agree on the initial list of committers for the >> dev.globus side of things. >> >> If you think you should be a swift committer, please respond to this >> email. >> > > Me! me! > > >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >> > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > -- =================================================== Ioan Raicu Ph.D. Candidate =================================================== Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 =================================================== Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dev.globus.org/wiki/Incubator/Falkon http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page =================================================== =================================================== From hategan at mcs.anl.gov Wed Jun 4 12:16:36 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 04 Jun 2008 12:16:36 -0500 Subject: [Swift-devel] committers In-Reply-To: <4846CB60.7000509@cs.uchicago.edu> References: <1212598787.8492.8.camel@localhost> <1212598930.8492.10.camel@localhost> <4846CB60.7000509@cs.uchicago.edu> Message-ID: <1212599796.8492.22.camel@localhost> Do you want to be a Swift committer? Yes or No. On Wed, 2008-06-04 at 12:05 -0500, Ioan Raicu wrote: > I have committed code to the Falkon provider, so feel free to add me > (although not essential, if you are trying to keep the list short). > > Ioan > > Mihael Hategan wrote: > > On Wed, 2008-06-04 at 11:59 -0500, Mihael Hategan wrote: > > > >> So we need, I think, to agree on the initial list of committers for the > >> dev.globus side of things. > >> > >> If you think you should be a swift committer, please respond to this > >> email. > >> > > > > Me! me! > > > > > >> _______________________________________________ > >> Swift-devel mailing list > >> Swift-devel at ci.uchicago.edu > >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > >> > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > > From iraicu at cs.uchicago.edu Wed Jun 4 12:25:03 2008 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Wed, 04 Jun 2008 12:25:03 -0500 Subject: [Swift-devel] committers In-Reply-To: <1212599796.8492.22.camel@localhost> References: <1212598787.8492.8.camel@localhost> <1212598930.8492.10.camel@localhost> <4846CB60.7000509@cs.uchicago.edu> <1212599796.8492.22.camel@localhost> Message-ID: <4846CFEF.9000102@cs.uchicago.edu> I was trying to leave it up to you. But since you asked, yes. Mihael Hategan wrote: > Do you want to be a Swift committer? Yes or No. > > On Wed, 2008-06-04 at 12:05 -0500, Ioan Raicu wrote: > >> I have committed code to the Falkon provider, so feel free to add me >> (although not essential, if you are trying to keep the list short). >> >> Ioan >> >> Mihael Hategan wrote: >> >>> On Wed, 2008-06-04 at 11:59 -0500, Mihael Hategan wrote: >>> >>> >>>> So we need, I think, to agree on the initial list of committers for the >>>> dev.globus side of things. >>>> >>>> If you think you should be a swift committer, please respond to this >>>> email. >>>> >>>> >>> Me! me! >>> >>> >>> >>>> _______________________________________________ >>>> Swift-devel mailing list >>>> Swift-devel at ci.uchicago.edu >>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >>>> >>>> >>> _______________________________________________ >>> Swift-devel mailing list >>> Swift-devel at ci.uchicago.edu >>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >>> >>> >>> > > > -- =================================================== Ioan Raicu Ph.D. Candidate =================================================== Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 =================================================== Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dev.globus.org/wiki/Incubator/Falkon http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page =================================================== =================================================== From hategan at mcs.anl.gov Wed Jun 4 13:18:12 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 04 Jun 2008 13:18:12 -0500 Subject: [Swift-devel] committers In-Reply-To: <4846D471.6070102@mcs.anl.gov> References: <1212598787.8492.8.camel@localhost> <4846D471.6070102@mcs.anl.gov> Message-ID: <1212603492.10245.0.camel@localhost> Sending this to the list. On Wed, 2008-06-04 at 12:44 -0500, Michael Wilde wrote: > please make me a committer > > On 6/4/08 11:59 AM, Mihael Hategan wrote: > > So we need, I think, to agree on the initial list of committers for the > > dev.globus side of things. > > > > If you think you should be a swift committer, please respond to this > > email. > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From hategan at mcs.anl.gov Wed Jun 4 16:07:55 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 04 Jun 2008 16:07:55 -0500 Subject: [Swift-devel] plans in trac Message-ID: <1212613675.14199.4.camel@localhost> So I played a bit with having our plans moved from a closed wiki page to a more open thing, and I tried out trac in the process. You can see the results at: http://www.ci.uchicago.edu/trac/swift/roadmap and http://www.ci.uchicago.edu/trac/swift/query?status=new&status=assigned&status=reopened&milestone=1.0 Alternatively we can do this in bugzilla, but for some reason trac looks nicer (though not as beefy in terms of functionality, yet better than a wiki page). Let me know if you have something against this. From benc at hawaga.org.uk Wed Jun 4 16:20:55 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 4 Jun 2008 21:20:55 +0000 (GMT) Subject: [Swift-devel] committers In-Reply-To: <1212598787.8492.8.camel@localhost> References: <1212598787.8492.8.camel@localhost> Message-ID: On Wed, 4 Jun 2008, Mihael Hategan wrote: > So we need, I think, to agree on the initial list of committers for the > dev.globus side of things. The initial proposal, listed in globus bug 5300, already contains an initial list of committers: Benjamin Clifford Mihael Hategan Tiberius Stef-Praun Beth Yong Zhao Veronika Nefedova Ian Foster Michael Wilde Ioan Raicu Beth Cerny Patino Any amendments to that should probably be made through the dev.globus voting process. -- From hategan at mcs.anl.gov Wed Jun 4 16:36:35 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 04 Jun 2008 16:36:35 -0500 Subject: [Swift-devel] committers In-Reply-To: References: <1212598787.8492.8.camel@localhost> Message-ID: <1212615395.14746.0.camel@localhost> Ah, that bug I wasn't CC-ed on. Well, I know about it now. On Wed, 2008-06-04 at 21:20 +0000, Ben Clifford wrote: > > On Wed, 4 Jun 2008, Mihael Hategan wrote: > > > So we need, I think, to agree on the initial list of committers for the > > dev.globus side of things. > > The initial proposal, listed in globus bug 5300, already contains an > initial list of committers: > > Benjamin Clifford > Mihael Hategan > Tiberius Stef-Praun Beth > Yong Zhao > Veronika Nefedova > Ian Foster > Michael Wilde > Ioan Raicu > Beth Cerny Patino > > Any amendments to that should probably be made through the dev.globus > voting process. > From hategan at mcs.anl.gov Wed Jun 4 16:57:53 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 04 Jun 2008 16:57:53 -0500 Subject: [Swift-devel] committers In-Reply-To: <1212615395.14746.0.camel@localhost> References: <1212598787.8492.8.camel@localhost> <1212615395.14746.0.camel@localhost> Message-ID: <1212616673.14746.4.camel@localhost> All committers have been added to the swift-commit mailing list. If some are wondering what the list admin password is, well, you either know it already or you don't. On Wed, 2008-06-04 at 16:36 -0500, Mihael Hategan wrote: > Ah, that bug I wasn't CC-ed on. Well, I know about it now. > > On Wed, 2008-06-04 at 21:20 +0000, Ben Clifford wrote: > > > > On Wed, 4 Jun 2008, Mihael Hategan wrote: > > > > > So we need, I think, to agree on the initial list of committers for the > > > dev.globus side of things. > > > > The initial proposal, listed in globus bug 5300, already contains an > > initial list of committers: > > > > Benjamin Clifford > > Mihael Hategan > > Tiberius Stef-Praun Beth > > Yong Zhao > > Veronika Nefedova > > Ian Foster > > Michael Wilde > > Ioan Raicu > > Beth Cerny Patino > > > > Any amendments to that should probably be made through the dev.globus > > voting process. > > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From wilde at mcs.anl.gov Wed Jun 4 18:52:37 2008 From: wilde at mcs.anl.gov (Michael Wilde) Date: Wed, 04 Jun 2008 18:52:37 -0500 Subject: [Swift-devel] voms-proxy-init support Message-ID: <48472AC5.7010409@mcs.anl.gov> How much effort would it take to add voms-proxy-init support to the CoG grid-proxy-init thats packaged with Swift? From wilde at mcs.anl.gov Wed Jun 4 19:06:31 2008 From: wilde at mcs.anl.gov (Michael Wilde) Date: Wed, 04 Jun 2008 19:06:31 -0500 Subject: [Swift-devel] plans in trac In-Reply-To: <1212613675.14199.4.camel@localhost> References: <1212613675.14199.4.camel@localhost> Message-ID: <48472E07.4080908@mcs.anl.gov> I like whats at the second link (list of campaigns). I didnt see much interesting at the first link - not sure what thats supposed to be. - Mike On 6/4/08 4:07 PM, Mihael Hategan wrote: > So I played a bit with having our plans moved from a closed wiki page to > a more open thing, and I tried out trac in the process. You can see the > results at: > http://www.ci.uchicago.edu/trac/swift/roadmap > > and > > http://www.ci.uchicago.edu/trac/swift/query?status=new&status=assigned&status=reopened&milestone=1.0 > > Alternatively we can do this in bugzilla, but for some reason trac looks > nicer (though not as beefy in terms of functionality, yet better than a > wiki page). > > Let me know if you have something against this. > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From hategan at mcs.anl.gov Wed Jun 4 19:10:29 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 04 Jun 2008 19:10:29 -0500 Subject: [Swift-devel] plans in trac In-Reply-To: <48472E07.4080908@mcs.anl.gov> References: <1212613675.14199.4.camel@localhost> <48472E07.4080908@mcs.anl.gov> Message-ID: <1212624629.18197.0.camel@localhost> On Wed, 2008-06-04 at 19:06 -0500, Michael Wilde wrote: > I like whats at the second link (list of campaigns). > > I didnt see much interesting at the first link - not sure what thats > supposed to be. That's a summary of all the milestones of which there is only one. As you close tickets, there is a progress bar that reflects it. > > - Mike > > On 6/4/08 4:07 PM, Mihael Hategan wrote: > > So I played a bit with having our plans moved from a closed wiki page to > > a more open thing, and I tried out trac in the process. You can see the > > results at: > > http://www.ci.uchicago.edu/trac/swift/roadmap > > > > and > > > > http://www.ci.uchicago.edu/trac/swift/query?status=new&status=assigned&status=reopened&milestone=1.0 > > > > Alternatively we can do this in bugzilla, but for some reason trac looks > > nicer (though not as beefy in terms of functionality, yet better than a > > wiki page). > > > > Let me know if you have something against this. > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From hategan at mcs.anl.gov Wed Jun 4 19:11:28 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 04 Jun 2008 19:11:28 -0500 Subject: [Swift-devel] voms-proxy-init support In-Reply-To: <48472AC5.7010409@mcs.anl.gov> References: <48472AC5.7010409@mcs.anl.gov> Message-ID: <1212624688.18197.2.camel@localhost> On Wed, 2008-06-04 at 18:52 -0500, Michael Wilde wrote: > How much effort would it take to add voms-proxy-init support to the CoG > grid-proxy-init thats packaged with Swift? I'm a bit unsure. I know that a java client to do that already exists within egee, but I have failed to get my hands on the sources so far. I can probably spend time on investigating further. > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From wilde at mcs.anl.gov Wed Jun 4 19:53:23 2008 From: wilde at mcs.anl.gov (Michael Wilde) Date: Wed, 04 Jun 2008 19:53:23 -0500 Subject: [Swift-devel] voms-proxy-init support In-Reply-To: <1212624688.18197.2.camel@localhost> References: <48472AC5.7010409@mcs.anl.gov> <1212624688.18197.2.camel@localhost> Message-ID: <48473903.1000805@mcs.anl.gov> Its very low priority. I was just looking for an alternative to having people get multiple certs to work with multiple VOs. So lets let it rest until it looks trivial or bubbles higher in priority. I'll add to bugzilla. On 6/4/08 7:11 PM, Mihael Hategan wrote: > On Wed, 2008-06-04 at 18:52 -0500, Michael Wilde wrote: >> How much effort would it take to add voms-proxy-init support to the CoG >> grid-proxy-init thats packaged with Swift? > > I'm a bit unsure. I know that a java client to do that already exists > within egee, but I have failed to get my hands on the sources so far. > > I can probably spend time on investigating further. > >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > From bugzilla-daemon at mcs.anl.gov Wed Jun 4 19:56:07 2008 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Wed, 4 Jun 2008 19:56:07 -0500 (CDT) Subject: [Swift-devel] [Bug 146] New: Add voms-proxy-init command to Swift release Message-ID: http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=146 Summary: Add voms-proxy-init command to Swift release Product: Swift Version: unspecified Platform: All OS/Version: Mac OS Status: NEW Severity: minor Priority: P5 Component: General AssignedTo: hategan at mcs.anl.gov ReportedBy: wilde at mcs.anl.gov CC: benc at hawaga.org.uk This would be handy as an alternative to having people get multiple certs to work with multiple VOs. On 6/4/08 7:11 PM, Mihael Hategan wrote: > On Wed, 2008-06-04 at 18:52 -0500, Michael Wilde wrote: >> How much effort would it take to add voms-proxy-init support to the CoG >> grid-proxy-init thats packaged with Swift? > > I'm a bit unsure. I know that a java client to do that already exists > within egee, but I have failed to get my hands on the sources so far. > > I can probably spend time on investigating further. -- Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You reported the bug, or are watching the reporter. From bugzilla-daemon at mcs.anl.gov Wed Jun 4 20:24:07 2008 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Wed, 4 Jun 2008 20:24:07 -0500 (CDT) Subject: [Swift-devel] [Bug 146] Add voms-proxy-init command to Swift release In-Reply-To: Message-ID: <20080605012407.D961F164BB@foxtrot.mcs.anl.gov> http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=146 ------- Comment #1 from benc at hawaga.org.uk 2008-06-04 20:24 ------- See the commentary associated with bug 104: http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=104 -- Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You reported the bug, or are watching the reporter. From wilde at mcs.anl.gov Thu Jun 5 10:35:29 2008 From: wilde at mcs.anl.gov (Michael Wilde) Date: Thu, 05 Jun 2008 10:35:29 -0500 Subject: [Swift-devel] [Fwd: Re: [incubator-committers] Re: swift project in hibernation] Message-ID: <484807C1.2060902@mcs.anl.gov> If "chair" is someone who maintains the infrastructure it should be a developer. If its someone that makes management decisions and speaks for the project, it should be me. I should be the one who requests the project to be un-hibernated, but at least designated developer should be able to interact with dev.globus regarding infrastructure setup, imho. Mihael, Ben - do you have a recommendation on this? Seems like once infrastructure setup is done, I should be able to be chair. - Mike -------- Original Message -------- Subject: Re: [incubator-committers] Re: swift project in hibernation Date: Thu, 05 Jun 2008 09:14:17 -0500 From: Mihael Hategan To: Jennifer M. Schopf CC: Jennifer M. Schopf , incubator-committers at globus.org, wilde at mcs.anl.gov, Ben Clifford References: <20080604223759.9D756B00006C at sumac.eol.org> <1212621397.17056.11.camel at localhost> <20080605020236.B489EB00006C at sumac.eol.org> <1212640861.22960.25.camel at localhost> <20080605130731.515E4ABC001 at zimbra.anl.gov> Short ones inline. On Thu, 2008-06-05 at 09:05 -0400, Jennifer M. Schopf wrote: > > ... > >I will also reformulate the question: > >If I (or any other committer for that matter) ask for the project to be > >un-hibernated, while meeting all the requirements, will the project be > >un-hibernated or not? > > Mike's gotten in touch with us - so if one of you > gets a document or shows effort towards this end > that should suffice, but you've got to include > him as chair in my opinion. If he's not being the > chair that's something for you as a project to > sort out - we really only interface to the chair > of a project , that's what the chair is in our > view, the person making this project happen from > an incubator point of view. If your chair isn't > doing that then likely something need to change > inthe project. It is not uncommon for a project > to have a PI that isn't the chair, for example. > But generally, yes, we work with the chair not > other group members or else this process simply > wouldn't scale. I can't give you an authoritative > answer without a meeting of the IMP which takes > place in roughly 2 weeks, all i can give you is > my best interpretation of the process as it exists. Ok. Please let me know what the answer is when you can. ... From hategan at mcs.anl.gov Thu Jun 5 11:16:31 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 05 Jun 2008 11:16:31 -0500 Subject: [Swift-devel] [Fwd: Re: [incubator-committers] Re: swift project in hibernation] In-Reply-To: <484807C1.2060902@mcs.anl.gov> References: <484807C1.2060902@mcs.anl.gov> Message-ID: <1212682591.27047.31.camel@localhost> Actually, from a dev.globus perspective, there is no "manager" in the sense of somebody having more decision power than other committers. Such management decisions are made through voting. So you should probably think a bit about this. The control structure imposed by our job hierarchy or by the flow of money have no say in the decision making process of the project. That doesn't mean that you can't exercise authority as long as you have the means to, but globdev isn't going to give you such means. In regard to the task at hand, there are no problems really. You will probably have the honors of making the official request. That wasn't my point though. I was simply asking whether there was a rule in place, and all I got was an opinion. I didn't want an opinion. It was a simple question and I wanted a "yes" or a "no". On Thu, 2008-06-05 at 10:35 -0500, Michael Wilde wrote: > If "chair" is someone who maintains the infrastructure it should be a > developer. > > If its someone that makes management decisions and speaks for the > project, it should be me. > > I should be the one who requests the project to be un-hibernated, but at > least designated developer should be able to interact with dev.globus > regarding infrastructure setup, imho. > > Mihael, Ben - do you have a recommendation on this? Seems like once > infrastructure setup is done, I should be able to be chair. > > - Mike > > > -------- Original Message -------- > Subject: Re: [incubator-committers] Re: swift project in hibernation > Date: Thu, 05 Jun 2008 09:14:17 -0500 > From: Mihael Hategan > To: Jennifer M. Schopf > CC: Jennifer M. Schopf , incubator-committers at globus.org, > wilde at mcs.anl.gov, Ben Clifford > References: <20080604223759.9D756B00006C at sumac.eol.org> > <1212621397.17056.11.camel at localhost> > <20080605020236.B489EB00006C at sumac.eol.org> > <1212640861.22960.25.camel at localhost> > <20080605130731.515E4ABC001 at zimbra.anl.gov> > > Short ones inline. > > On Thu, 2008-06-05 at 09:05 -0400, Jennifer M. Schopf wrote: > > > > ... > > >I will also reformulate the question: > > >If I (or any other committer for that matter) ask for the project to be > > >un-hibernated, while meeting all the requirements, will the project be > > >un-hibernated or not? > > > > Mike's gotten in touch with us - so if one of you > > gets a document or shows effort towards this end > > that should suffice, but you've got to include > > him as chair in my opinion. If he's not being the > > chair that's something for you as a project to > > sort out - we really only interface to the chair > > of a project , that's what the chair is in our > > view, the person making this project happen from > > an incubator point of view. If your chair isn't > > doing that then likely something need to change > > inthe project. It is not uncommon for a project > > to have a PI that isn't the chair, for example. > > But generally, yes, we work with the chair not > > other group members or else this process simply > > wouldn't scale. I can't give you an authoritative > > answer without a meeting of the IMP which takes > > place in roughly 2 weeks, all i can give you is > > my best interpretation of the process as it exists. > > Ok. Please let me know what the answer is when you can. > > ... > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From iraicu at cs.uchicago.edu Thu Jun 5 11:21:56 2008 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Thu, 05 Jun 2008 11:21:56 -0500 Subject: [Swift-devel] [Fwd: Re: [incubator-committers] Re: swift project in hibernation] In-Reply-To: <1212682591.27047.31.camel@localhost> References: <484807C1.2060902@mcs.anl.gov> <1212682591.27047.31.camel@localhost> Message-ID: <484812A4.8010801@cs.uchicago.edu> From my experience with Falkon and ServMark as an incubator project, there was one person, picked out from the team of committers, who was responsible to communicate with dev.globus, for review purposes, infrastructure setup, etc... That doesn't mean that this single person has to do everything, but dev.globus wants to talk to a single person per project, not a random committer from the pool. So, someone should step up to be that contact person, and then individual issues can be sub-tasked to other committers. This might not be explicit in writing anywhere, but this is what I saw on two projects so far, and is what I understand from Jenny email below. Ioan Mihael Hategan wrote: > Actually, from a dev.globus perspective, there is no "manager" in the > sense of somebody having more decision power than other committers. > > Such management decisions are made through voting. > > So you should probably think a bit about this. The control structure > imposed by our job hierarchy or by the flow of money have no say in the > decision making process of the project. That doesn't mean that you can't > exercise authority as long as you have the means to, but globdev isn't > going to give you such means. > > In regard to the task at hand, there are no problems really. You will > probably have the honors of making the official request. That wasn't my > point though. I was simply asking whether there was a rule in place, and > all I got was an opinion. I didn't want an opinion. It was a simple > question and I wanted a "yes" or a "no". > > On Thu, 2008-06-05 at 10:35 -0500, Michael Wilde wrote: > >> If "chair" is someone who maintains the infrastructure it should be a >> developer. >> >> If its someone that makes management decisions and speaks for the >> project, it should be me. >> >> I should be the one who requests the project to be un-hibernated, but at >> least designated developer should be able to interact with dev.globus >> regarding infrastructure setup, imho. >> >> Mihael, Ben - do you have a recommendation on this? Seems like once >> infrastructure setup is done, I should be able to be chair. >> >> - Mike >> >> >> -------- Original Message -------- >> Subject: Re: [incubator-committers] Re: swift project in hibernation >> Date: Thu, 05 Jun 2008 09:14:17 -0500 >> From: Mihael Hategan >> To: Jennifer M. Schopf >> CC: Jennifer M. Schopf , incubator-committers at globus.org, >> wilde at mcs.anl.gov, Ben Clifford >> References: <20080604223759.9D756B00006C at sumac.eol.org> >> <1212621397.17056.11.camel at localhost> >> <20080605020236.B489EB00006C at sumac.eol.org> >> <1212640861.22960.25.camel at localhost> >> <20080605130731.515E4ABC001 at zimbra.anl.gov> >> >> Short ones inline. >> >> On Thu, 2008-06-05 at 09:05 -0400, Jennifer M. Schopf wrote: >> >> ... >> >>>> I will also reformulate the question: >>>> If I (or any other committer for that matter) ask for the project to be >>>> un-hibernated, while meeting all the requirements, will the project be >>>> un-hibernated or not? >>>> >>> Mike's gotten in touch with us - so if one of you >>> gets a document or shows effort towards this end >>> that should suffice, but you've got to include >>> him as chair in my opinion. If he's not being the >>> chair that's something for you as a project to >>> sort out - we really only interface to the chair >>> of a project , that's what the chair is in our >>> view, the person making this project happen from >>> an incubator point of view. If your chair isn't >>> doing that then likely something need to change >>> inthe project. It is not uncommon for a project >>> to have a PI that isn't the chair, for example. >>> But generally, yes, we work with the chair not >>> other group members or else this process simply >>> wouldn't scale. I can't give you an authoritative >>> answer without a meeting of the IMP which takes >>> place in roughly 2 weeks, all i can give you is >>> my best interpretation of the process as it exists. >>> >> Ok. Please let me know what the answer is when you can. >> >> ... >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >> > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > -- =================================================== Ioan Raicu Ph.D. Candidate =================================================== Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 =================================================== Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dev.globus.org/wiki/Incubator/Falkon http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page =================================================== =================================================== From wilde at mcs.anl.gov Thu Jun 5 12:03:21 2008 From: wilde at mcs.anl.gov (Michael Wilde) Date: Thu, 05 Jun 2008 12:03:21 -0500 Subject: [Swift-devel] Swift dev.globus chair In-Reply-To: <1212682591.27047.31.camel@localhost> References: <484807C1.2060902@mcs.anl.gov> <1212682591.27047.31.camel@localhost> Message-ID: <48481C59.9070002@mcs.anl.gov> was: Re: [Swift-devel] [Fwd: Re: [incubator-committers] Re: swift project in hibernation] -- The Globus process defines Chair as: "Each Globus project is required to name a project Chair via some process defined by the project?s Committers. A project Chair has no enhanced authority, but has certain responsibilities relative to the function of the Globus Alliance. Specifically: * The Chair should generate, on or before March 31st, June 30th, September 30th, and December 31st of each year, reports concerning the activities of the project during the past quarter, its current status, and future plans. * The Chair is responsible for forwarding to the Globus Infrastructure group requests to add or delete Committers for the project." (from http://dev.globus.org/wiki/Guidelines) I.e the chair needs to report on the activity status and plans quarterly. Gigi suggested to me that Ben as Project Coordinator should serve this role. That would be OK with me. This is also a good time to step back and look at our process, and see what of anything we want to change. But if all else is ready to roll on incubator environment, lets just agree on a chair while we sort it out. For now, I suggest this: Ben, if you want to be chair, I support that. If you dont, I feel the same about you, Mihael. If neither of you want to be chair, I propose to remain chair. All: After we hear from Ben (or Mihael if Ben declines) please indicate your support with a response to this thread. - Mike On 6/5/08 11:16 AM, Mihael Hategan wrote: > Actually, from a dev.globus perspective, there is no "manager" in the > sense of somebody having more decision power than other committers. > > Such management decisions are made through voting. > > So you should probably think a bit about this. The control structure > imposed by our job hierarchy or by the flow of money have no say in the > decision making process of the project. That doesn't mean that you can't > exercise authority as long as you have the means to, but globdev isn't > going to give you such means. > > In regard to the task at hand, there are no problems really. You will > probably have the honors of making the official request. That wasn't my > point though. I was simply asking whether there was a rule in place, and > all I got was an opinion. I didn't want an opinion. It was a simple > question and I wanted a "yes" or a "no". > > On Thu, 2008-06-05 at 10:35 -0500, Michael Wilde wrote: >> If "chair" is someone who maintains the infrastructure it should be a >> developer. >> >> If its someone that makes management decisions and speaks for the >> project, it should be me. >> >> I should be the one who requests the project to be un-hibernated, but at >> least designated developer should be able to interact with dev.globus >> regarding infrastructure setup, imho. >> >> Mihael, Ben - do you have a recommendation on this? Seems like once >> infrastructure setup is done, I should be able to be chair. >> >> - Mike >> >> >> -------- Original Message -------- >> Subject: Re: [incubator-committers] Re: swift project in hibernation >> Date: Thu, 05 Jun 2008 09:14:17 -0500 >> From: Mihael Hategan >> To: Jennifer M. Schopf >> CC: Jennifer M. Schopf , incubator-committers at globus.org, >> wilde at mcs.anl.gov, Ben Clifford >> References: <20080604223759.9D756B00006C at sumac.eol.org> >> <1212621397.17056.11.camel at localhost> >> <20080605020236.B489EB00006C at sumac.eol.org> >> <1212640861.22960.25.camel at localhost> >> <20080605130731.515E4ABC001 at zimbra.anl.gov> >> >> Short ones inline. >> >> On Thu, 2008-06-05 at 09:05 -0400, Jennifer M. Schopf wrote: >> ... >>>> I will also reformulate the question: >>>> If I (or any other committer for that matter) ask for the project to be >>>> un-hibernated, while meeting all the requirements, will the project be >>>> un-hibernated or not? >>> Mike's gotten in touch with us - so if one of you >>> gets a document or shows effort towards this end >>> that should suffice, but you've got to include >>> him as chair in my opinion. If he's not being the >>> chair that's something for you as a project to >>> sort out - we really only interface to the chair >>> of a project , that's what the chair is in our >>> view, the person making this project happen from >>> an incubator point of view. If your chair isn't >>> doing that then likely something need to change >>> inthe project. It is not uncommon for a project >>> to have a PI that isn't the chair, for example. >>> But generally, yes, we work with the chair not >>> other group members or else this process simply >>> wouldn't scale. I can't give you an authoritative >>> answer without a meeting of the IMP which takes >>> place in roughly 2 weeks, all i can give you is >>> my best interpretation of the process as it exists. >> Ok. Please let me know what the answer is when you can. >> >> ... >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > From hategan at mcs.anl.gov Thu Jun 5 12:18:01 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 05 Jun 2008 12:18:01 -0500 Subject: [Swift-devel] Swift dev.globus chair In-Reply-To: <48481C59.9070002@mcs.anl.gov> References: <484807C1.2060902@mcs.anl.gov> <1212682591.27047.31.camel@localhost> <48481C59.9070002@mcs.anl.gov> Message-ID: <1212686281.29017.7.camel@localhost> I'm fine with you as the chair. Again, who is the chair has nothing to do with what I was interested in. On Thu, 2008-06-05 at 12:03 -0500, Michael Wilde wrote: > was: Re: [Swift-devel] [Fwd: Re: [incubator-committers] Re: swift > project in hibernation] > > -- > > The Globus process defines Chair as: > > "Each Globus project is required to name a project Chair via some > process defined by the project?s Committers. A project Chair has no > enhanced authority, but has certain responsibilities relative to the > function of the Globus Alliance. Specifically: > > * The Chair should generate, on or before March 31st, June 30th, > September 30th, and December 31st of each year, reports concerning the > activities of the project during the past quarter, its current status, > and future plans. > * The Chair is responsible for forwarding to the Globus > Infrastructure group requests to add or delete Committers for the project." > > (from http://dev.globus.org/wiki/Guidelines) > > I.e the chair needs to report on the activity status and plans quarterly. > > Gigi suggested to me that Ben as Project Coordinator should serve this > role. That would be OK with me. > > This is also a good time to step back and look at our process, and see > what of anything we want to change. But if all else is ready to roll on > incubator environment, lets just agree on a chair while we sort it out. > > For now, I suggest this: Ben, if you want to be chair, I support that. > If you dont, I feel the same about you, Mihael. If neither of you want > to be chair, I propose to remain chair. > > All: After we hear from Ben (or Mihael if Ben declines) please indicate > your support with a response to this thread. > > - Mike > > > > On 6/5/08 11:16 AM, Mihael Hategan wrote: > > Actually, from a dev.globus perspective, there is no "manager" in the > > sense of somebody having more decision power than other committers. > > > > Such management decisions are made through voting. > > > > So you should probably think a bit about this. The control structure > > imposed by our job hierarchy or by the flow of money have no say in the > > decision making process of the project. That doesn't mean that you can't > > exercise authority as long as you have the means to, but globdev isn't > > going to give you such means. > > > > In regard to the task at hand, there are no problems really. You will > > probably have the honors of making the official request. That wasn't my > > point though. I was simply asking whether there was a rule in place, and > > all I got was an opinion. I didn't want an opinion. It was a simple > > question and I wanted a "yes" or a "no". > > > > On Thu, 2008-06-05 at 10:35 -0500, Michael Wilde wrote: > >> If "chair" is someone who maintains the infrastructure it should be a > >> developer. > >> > >> If its someone that makes management decisions and speaks for the > >> project, it should be me. > >> > >> I should be the one who requests the project to be un-hibernated, but at > >> least designated developer should be able to interact with dev.globus > >> regarding infrastructure setup, imho. > >> > >> Mihael, Ben - do you have a recommendation on this? Seems like once > >> infrastructure setup is done, I should be able to be chair. > >> > >> - Mike > >> > >> > >> -------- Original Message -------- > >> Subject: Re: [incubator-committers] Re: swift project in hibernation > >> Date: Thu, 05 Jun 2008 09:14:17 -0500 > >> From: Mihael Hategan > >> To: Jennifer M. Schopf > >> CC: Jennifer M. Schopf , incubator-committers at globus.org, > >> wilde at mcs.anl.gov, Ben Clifford > >> References: <20080604223759.9D756B00006C at sumac.eol.org> > >> <1212621397.17056.11.camel at localhost> > >> <20080605020236.B489EB00006C at sumac.eol.org> > >> <1212640861.22960.25.camel at localhost> > >> <20080605130731.515E4ABC001 at zimbra.anl.gov> > >> > >> Short ones inline. > >> > >> On Thu, 2008-06-05 at 09:05 -0400, Jennifer M. Schopf wrote: > >> ... > >>>> I will also reformulate the question: > >>>> If I (or any other committer for that matter) ask for the project to be > >>>> un-hibernated, while meeting all the requirements, will the project be > >>>> un-hibernated or not? > >>> Mike's gotten in touch with us - so if one of you > >>> gets a document or shows effort towards this end > >>> that should suffice, but you've got to include > >>> him as chair in my opinion. If he's not being the > >>> chair that's something for you as a project to > >>> sort out - we really only interface to the chair > >>> of a project , that's what the chair is in our > >>> view, the person making this project happen from > >>> an incubator point of view. If your chair isn't > >>> doing that then likely something need to change > >>> inthe project. It is not uncommon for a project > >>> to have a PI that isn't the chair, for example. > >>> But generally, yes, we work with the chair not > >>> other group members or else this process simply > >>> wouldn't scale. I can't give you an authoritative > >>> answer without a meeting of the IMP which takes > >>> place in roughly 2 weeks, all i can give you is > >>> my best interpretation of the process as it exists. > >> Ok. Please let me know what the answer is when you can. > >> > >> ... > >> _______________________________________________ > >> Swift-devel mailing list > >> Swift-devel at ci.uchicago.edu > >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From benc at hawaga.org.uk Thu Jun 5 12:18:53 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Thu, 5 Jun 2008 17:18:53 +0000 (GMT) Subject: [Swift-devel] [Fwd: Re: [incubator-committers] Re: swift project in hibernation] In-Reply-To: <484807C1.2060902@mcs.anl.gov> References: <484807C1.2060902@mcs.anl.gov> Message-ID: On Thu, 5 Jun 2008, Michael Wilde wrote: > If "chair" is someone who maintains the infrastructure it should be a > developer. > > If its someone that makes management decisions and speaks for the project, it > should be me. It seems to me to be neither of those. Specifically there are no requirements that the chair maintain infrastructure and an explicit prohibition on the chair making enhanced-authority decisions (wrt any other committer). However, Jen's correspondence seems to suggest that IMP regards the chair as having some other rights and obligations. I don't believe these are publicly documented though. --- Each Globus project is required to name a project Chair via some process defined by the project's Committers. A project Chair has no enhanced authority, but has certain responsibilities relative to the function of the Globus Alliance. Specifically: * The Chair should generate, on or before March 31st, June 30th, September 30th, and December 31st of each year, reports concerning the activities of the project during the past quarter, its current status, and future plans. *The Chair is responsible for forwarding to the Globus infrastructure group requests to add or delete Committers for the project. --- -- From hategan at mcs.anl.gov Thu Jun 5 12:22:45 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 05 Jun 2008 12:22:45 -0500 Subject: [Swift-devel] maling lists Message-ID: <1212686565.29017.12.camel@localhost> It seems like one requirement is to have all committers subscribed to the @globus.org mailing lists. While @ci will continue to be our primary mailing lists, in order to meet the requirements, I'll do the following: - move everybody from swift-commit at ci to swift-commit at globus and make swift-commit at ci forward to swift-commit at globus. This is so that we don't get double posts, but still have the infrastructure primarily based @ci. - subscribe all committers to the other @globus mailing lists. We'll encourage users to use the @ci mailing lists, and discussions initiated on @globus should be manually moved to @ci. Objections? From wilde at mcs.anl.gov Thu Jun 5 12:32:14 2008 From: wilde at mcs.anl.gov (Michael Wilde) Date: Thu, 05 Jun 2008 12:32:14 -0500 Subject: [Swift-devel] maling lists In-Reply-To: <1212686565.29017.12.camel@localhost> References: <1212686565.29017.12.camel@localhost> Message-ID: <4848231E.8090709@mcs.anl.gov> It sounds good to me. Would it be better and feasible though to maintain just one set of list memberships, and have the master lists echo to the other set for archival purposes? It seems like we have much vested in maintaining svn and bugzilla using the current infrastructure, but the email lists seem a bit easier to change. And if the CI lists can remain the master, is it OK just to forward traffic to the dev.globus lists? If the dev.globus lists need to be fully populated with members to meet dev.globus requirements, then can we transition to using those as the sole lists, with minimal impact on current list members? - Mike On 6/5/08 12:22 PM, Mihael Hategan wrote: > It seems like one requirement is to have all committers subscribed to > the @globus.org mailing lists. > > While @ci will continue to be our primary mailing lists, in order to > meet the requirements, I'll do the following: > > - move everybody from swift-commit at ci to swift-commit at globus and make > swift-commit at ci forward to swift-commit at globus. This is so that we don't > get double posts, but still have the infrastructure primarily based @ci. > > - subscribe all committers to the other @globus mailing lists. We'll > encourage users to use the @ci mailing lists, and discussions initiated > on @globus should be manually moved to @ci. > > Objections? > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From foster at mcs.anl.gov Thu Jun 5 12:37:23 2008 From: foster at mcs.anl.gov (Ian Foster) Date: Thu, 05 Jun 2008 12:37:23 -0500 Subject: [Swift-devel] maling lists In-Reply-To: <1212686565.29017.12.camel@localhost> References: <1212686565.29017.12.camel@localhost> Message-ID: <48482453.9070809@mcs.anl.gov> Any reason not to move to the globus.org lists? It has the advantage of having a "home" not affiliated with a particular institution--this may encourage further contributions. Mihael Hategan wrote: > It seems like one requirement is to have all committers subscribed to > the @globus.org mailing lists. > > While @ci will continue to be our primary mailing lists, in order to > meet the requirements, I'll do the following: > > - move everybody from swift-commit at ci to swift-commit at globus and make > swift-commit at ci forward to swift-commit at globus. This is so that we don't > get double posts, but still have the infrastructure primarily based @ci. > > - subscribe all committers to the other @globus mailing lists. We'll > encourage users to use the @ci mailing lists, and discussions initiated > on @globus should be manually moved to @ci. > > Objections? > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > From foster at mcs.anl.gov Thu Jun 5 12:43:22 2008 From: foster at mcs.anl.gov (Ian Foster) Date: Thu, 05 Jun 2008 12:43:22 -0500 Subject: [Swift-devel] [Fwd: Re: [incubator-committers] Re: swift project in hibernation] In-Reply-To: References: <484807C1.2060902@mcs.anl.gov> Message-ID: <484825BA.7020509@mcs.anl.gov> I want to point out (probably obvious, but worth mentioning) that the whole dev.globus "democratic" process is meant to enable negotiation of different perspectives and needs from different contributing groups. Clearly in the case of U.Chicago people, the ultimate decision on what is done is not determined democratically--it is determined via a consensus-based process, to a large extent, but ultimately decided by Mike based on project funding obligations. Ian. Ben Clifford wrote: > On Thu, 5 Jun 2008, Michael Wilde wrote: > > >> If "chair" is someone who maintains the infrastructure it should be a >> developer. >> >> If its someone that makes management decisions and speaks for the project, it >> should be me. >> > > It seems to me to be neither of those. Specifically there are no > requirements that the chair maintain infrastructure and an explicit > prohibition on the chair making enhanced-authority decisions (wrt any > other committer). > > However, Jen's correspondence seems to suggest that IMP regards the chair > as having some other rights and obligations. I don't believe these are > publicly documented though. > > --- > Each Globus project is required to name a project Chair via some > process defined by the project's Committers. A project Chair has no > enhanced authority, but has certain responsibilities relative to the > function of the Globus Alliance. Specifically: > * The Chair should generate, on or before March 31st, June 30th, September > 30th, and December 31st of each year, reports concerning the activities of > the project during the past quarter, its current status, and future plans. > *The Chair is responsible for forwarding to the Globus infrastructure > group requests to add or delete Committers for the project. > --- > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From benc at hawaga.org.uk Thu Jun 5 12:30:20 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Thu, 5 Jun 2008 17:30:20 +0000 (GMT) Subject: [Swift-devel] maling lists In-Reply-To: <1212686565.29017.12.camel@localhost> References: <1212686565.29017.12.camel@localhost> Message-ID: > - subscribe all committers to the other @globus mailing lists. We'll > encourage users to use the @ci mailing lists, and discussions initiated > on @globus should be manually moved to @ci. I think this is fairly ridiculous. We should not have multiple mailing lists for the same purpose. If we have globus lists, the CI lists should be shut down and the community and archives there abandoned or compelled to move. -- From benc at hawaga.org.uk Thu Jun 5 12:34:45 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Thu, 5 Jun 2008 17:34:45 +0000 (GMT) Subject: [Swift-devel] Swift dev.globus chair In-Reply-To: <48481C59.9070002@mcs.anl.gov> References: <484807C1.2060902@mcs.anl.gov> <1212682591.27047.31.camel@localhost> <48481C59.9070002@mcs.anl.gov> Message-ID: On Thu, 5 Jun 2008, Michael Wilde wrote: > Gigi suggested to me that Ben as Project Coordinator should serve this role. > That would be OK with me. I find any continued refernce to me as 'project coordinator' quite laughable given that this entire dev.globus incubation submission was done without my consent and without my knowledge by a secret cabal, a clear indication of the level of respect afforded that role. > For now, I suggest this: Ben, if you want to be chair, I support that. I don't. -- From hategan at mcs.anl.gov Thu Jun 5 13:02:32 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 05 Jun 2008 13:02:32 -0500 Subject: [Swift-devel] maling lists In-Reply-To: <4848231E.8090709@mcs.anl.gov> References: <1212686565.29017.12.camel@localhost> <4848231E.8090709@mcs.anl.gov> Message-ID: <1212688952.29852.4.camel@localhost> On Thu, 2008-06-05 at 12:32 -0500, Michael Wilde wrote: > It sounds good to me. > > Would it be better and feasible though to maintain just one set of list > memberships, and have the master lists echo to the other set for > archival purposes? If you want @ci to be primary, then no. It is required for all committers to be subscribed to the @globus mailing lists. So we either keep both in these conditions or move entirely to @globus. > > It seems like we have much vested in maintaining svn and bugzilla using > the current infrastructure, but the email lists seem a bit easier to > change. And if the CI lists can remain the master, is it OK just to > forward traffic to the dev.globus lists? > > If the dev.globus lists need to be fully populated with members to meet > dev.globus requirements, then can we transition to using those as the > sole lists, with minimal impact on current list members? > > - Mike > > > > On 6/5/08 12:22 PM, Mihael Hategan wrote: > > It seems like one requirement is to have all committers subscribed to > > the @globus.org mailing lists. > > > > While @ci will continue to be our primary mailing lists, in order to > > meet the requirements, I'll do the following: > > > > - move everybody from swift-commit at ci to swift-commit at globus and make > > swift-commit at ci forward to swift-commit at globus. This is so that we don't > > get double posts, but still have the infrastructure primarily based @ci. > > > > - subscribe all committers to the other @globus mailing lists. We'll > > encourage users to use the @ci mailing lists, and discussions initiated > > on @globus should be manually moved to @ci. > > > > Objections? > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From hategan at mcs.anl.gov Thu Jun 5 13:06:48 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 05 Jun 2008 13:06:48 -0500 Subject: [Swift-devel] maling lists In-Reply-To: <48482453.9070809@mcs.anl.gov> References: <1212686565.29017.12.camel@localhost> <48482453.9070809@mcs.anl.gov> Message-ID: <1212689208.29852.8.camel@localhost> On Thu, 2008-06-05 at 12:37 -0500, Ian Foster wrote: > Any reason not to move to the globus.org lists? It has the advantage of > having a "home" not affiliated with a particular institution And the disadvantage of having a "home" associated with a particular institution, namely dev.globus.org. On nice thing about the CI ones is mailman. But I guess we can live without. > --this may > encourage further contributions. > > Mihael Hategan wrote: > > It seems like one requirement is to have all committers subscribed to > > the @globus.org mailing lists. > > > > While @ci will continue to be our primary mailing lists, in order to > > meet the requirements, I'll do the following: > > > > - move everybody from swift-commit at ci to swift-commit at globus and make > > swift-commit at ci forward to swift-commit at globus. This is so that we don't > > get double posts, but still have the infrastructure primarily based @ci. > > > > - subscribe all committers to the other @globus mailing lists. We'll > > encourage users to use the @ci mailing lists, and discussions initiated > > on @globus should be manually moved to @ci. > > > > Objections? > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > From hategan at mcs.anl.gov Thu Jun 5 13:11:08 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 05 Jun 2008 13:11:08 -0500 Subject: [Swift-devel] [Fwd: Re: [incubator-committers] Re: swift project in hibernation] In-Reply-To: <484825BA.7020509@mcs.anl.gov> References: <484807C1.2060902@mcs.anl.gov> <484825BA.7020509@mcs.anl.gov> Message-ID: <1212689468.29852.12.camel@localhost> On Thu, 2008-06-05 at 12:43 -0500, Ian Foster wrote: > I want to point out (probably obvious, but worth mentioning) that the > whole dev.globus "democratic" process is meant to enable negotiation > of different perspectives and needs from different contributing > groups. > > Clearly in the case of U.Chicago people, the ultimate decision on what > is done is not determined democratically--it is determined via a > consensus-based process, to a large extent, but ultimately decided by > Mike based on project funding obligations. As long as that applies. In the purely hypothetical scenario that a majority of the committers stop working at UC, Mike will have no authority. Nor will you. It may also be contradictory to the process if committers had no free will, but were to be vetoable by Mike. In that case, they should not be committers in the first place. So we either want dev.globus.org or we don't. But it doesn't seem like we can do very much about wanting only parts of dev.globus.org and not others. > > Ian. > > > > Ben Clifford wrote: > > On Thu, 5 Jun 2008, Michael Wilde wrote: > > > > > > > If "chair" is someone who maintains the infrastructure it should be a > > > developer. > > > > > > If its someone that makes management decisions and speaks for the project, it > > > should be me. > > > > > > > It seems to me to be neither of those. Specifically there are no > > requirements that the chair maintain infrastructure and an explicit > > prohibition on the chair making enhanced-authority decisions (wrt any > > other committer). > > > > However, Jen's correspondence seems to suggest that IMP regards the chair > > as having some other rights and obligations. I don't believe these are > > publicly documented though. > > > > --- > > Each Globus project is required to name a project Chair via some > > process defined by the project's Committers. A project Chair has no > > enhanced authority, but has certain responsibilities relative to the > > function of the Globus Alliance. Specifically: > > * The Chair should generate, on or before March 31st, June 30th, September > > 30th, and December 31st of each year, reports concerning the activities of > > the project during the past quarter, its current status, and future plans. > > *The Chair is responsible for forwarding to the Globus infrastructure > > group requests to add or delete Committers for the project. > > --- > > > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From benc at hawaga.org.uk Thu Jun 5 13:00:46 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Thu, 5 Jun 2008 18:00:46 +0000 (GMT) Subject: [Swift-devel] maling lists In-Reply-To: <48482453.9070809@mcs.anl.gov> References: <1212686565.29017.12.camel@localhost> <48482453.9070809@mcs.anl.gov> Message-ID: On Thu, 5 Jun 2008, Ian Foster wrote: > Any reason not to move to the globus.org lists? It has the advantage of > having a "home" not affiliated with a particular institution--this may > encourage further contributions. It is a poorer (in my opinion) list management interface; plus moving fragments the archives. Neither is a total disaster, but its a reduction in quality. -- From foster at mcs.anl.gov Thu Jun 5 13:33:09 2008 From: foster at mcs.anl.gov (Ian Foster) Date: Thu, 05 Jun 2008 13:33:09 -0500 Subject: [Swift-devel] [Fwd: Re: [incubator-committers] Re: swift project in hibernation] In-Reply-To: <1212689468.29852.12.camel@localhost> References: <484807C1.2060902@mcs.anl.gov> <484825BA.7020509@mcs.anl.gov> <1212689468.29852.12.camel@localhost> Message-ID: <48483165.2020409@mcs.anl.gov> Indeed, if the majority of committers are from elsewhere (which would be a wonderful success), then the direction of the project would tend to be determined by their desires. If those desires ended up being antithetical to ours, we might end up forking the code. But that is all hypothetical (and unlikely). I don't think that I (or Mike) have been restricting free will. My only (periodic) input is that I would like to see more attention given to documenting application requirements and status. Ian. Mihael Hategan wrote: > On Thu, 2008-06-05 at 12:43 -0500, Ian Foster wrote: > >> I want to point out (probably obvious, but worth mentioning) that the >> whole dev.globus "democratic" process is meant to enable negotiation >> of different perspectives and needs from different contributing >> groups. >> >> Clearly in the case of U.Chicago people, the ultimate decision on what >> is done is not determined democratically--it is determined via a >> consensus-based process, to a large extent, but ultimately decided by >> Mike based on project funding obligations. >> > > As long as that applies. In the purely hypothetical scenario that a > majority of the committers stop working at UC, Mike will have no > authority. Nor will you. > > It may also be contradictory to the process if committers had no free > will, but were to be vetoable by Mike. In that case, they should not be > committers in the first place. > > So we either want dev.globus.org or we don't. But it doesn't seem like > we can do very much about wanting only parts of dev.globus.org and not > others. > > >> Ian. >> >> >> >> Ben Clifford wrote: >> >>> On Thu, 5 Jun 2008, Michael Wilde wrote: >>> >>> >>> >>>> If "chair" is someone who maintains the infrastructure it should be a >>>> developer. >>>> >>>> If its someone that makes management decisions and speaks for the project, it >>>> should be me. >>>> >>>> >>> It seems to me to be neither of those. Specifically there are no >>> requirements that the chair maintain infrastructure and an explicit >>> prohibition on the chair making enhanced-authority decisions (wrt any >>> other committer). >>> >>> However, Jen's correspondence seems to suggest that IMP regards the chair >>> as having some other rights and obligations. I don't believe these are >>> publicly documented though. >>> >>> --- >>> Each Globus project is required to name a project Chair via some >>> process defined by the project's Committers. A project Chair has no >>> enhanced authority, but has certain responsibilities relative to the >>> function of the Globus Alliance. Specifically: >>> * The Chair should generate, on or before March 31st, June 30th, September >>> 30th, and December 31st of each year, reports concerning the activities of >>> the project during the past quarter, its current status, and future plans. >>> *The Chair is responsible for forwarding to the Globus infrastructure >>> group requests to add or delete Committers for the project. >>> --- >>> >>> >>> >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From hategan at mcs.anl.gov Thu Jun 5 13:37:11 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 05 Jun 2008 13:37:11 -0500 Subject: [Swift-devel] maling lists In-Reply-To: References: <1212686565.29017.12.camel@localhost> Message-ID: <1212691031.30510.5.camel@localhost> On Thu, 2008-06-05 at 17:30 +0000, Ben Clifford wrote: > > - subscribe all committers to the other @globus mailing lists. We'll > > encourage users to use the @ci mailing lists, and discussions initiated > > on @globus should be manually moved to @ci. > > I think this is fairly ridiculous. That I can't deny. > > We should not have multiple mailing lists for the same purpose. If we have > globus lists, the CI lists should be shut down and the community and > archives there abandoned or compelled to move. ?We wouldn't really use the @globus.org mailing lists. We fulfill our requirement (given that we have no choice), and we also make it so that we don't really fulfill the requirement. In other words we have the lists, we do what we're asked, but we tell everybody to use the other ones. And for the few cases when people ask questions on the @globus.org lists, we move the discussion. > From foster at mcs.anl.gov Thu Jun 5 13:40:25 2008 From: foster at mcs.anl.gov (Ian Foster) Date: Thu, 05 Jun 2008 13:40:25 -0500 Subject: [Swift-devel] maling lists In-Reply-To: <1212691031.30510.5.camel@localhost> References: <1212686565.29017.12.camel@localhost> <1212691031.30510.5.camel@localhost> Message-ID: <48483319.3000100@mcs.anl.gov> Can't we just use the globus.org mail lists? The move to globus.org is a valuable outreach action. We get a global platform, increase visibility, maybe get more contributors. We make clear that we are open source, part of a global community. The downsides that are being quoted seem minor. Ian. Mihael Hategan wrote: > On Thu, 2008-06-05 at 17:30 +0000, Ben Clifford wrote: > >>> - subscribe all committers to the other @globus mailing lists. We'll >>> encourage users to use the @ci mailing lists, and discussions initiated >>> on @globus should be manually moved to @ci. >>> >> I think this is fairly ridiculous. >> > > That I can't deny. > > >> We should not have multiple mailing lists for the same purpose. If we have >> globus lists, the CI lists should be shut down and the community and >> archives there abandoned or compelled to move. >> > > ?We wouldn't really use the @globus.org mailing lists. We fulfill our > requirement (given that we have no choice), and we also make it so that > we don't really fulfill the requirement. In other words we have the > lists, we do what we're asked, but we tell everybody to use the other > ones. And for the few cases when people ask questions on the @globus.org > lists, we move the discussion. > > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > -------------- next part -------------- An HTML attachment was scrubbed... URL: From iraicu at cs.uchicago.edu Thu Jun 5 13:39:31 2008 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Thu, 05 Jun 2008 13:39:31 -0500 Subject: [Swift-devel] maling lists In-Reply-To: <1212691031.30510.5.camel@localhost> References: <1212686565.29017.12.camel@localhost> <1212691031.30510.5.camel@localhost> Message-ID: <484832E3.5000304@cs.uchicago.edu> This will be enough for being allowed back as an incubator, but it will never suffice for escalating the project to being a full fledged Globus component. They look at the mailing lists activity at review time, and will judge the community size based on the number of people that post questions and answers. So, your solution to ignore as much as possible the globus mailing list is only a short term solution. Ioan Mihael Hategan wrote: > On Thu, 2008-06-05 at 17:30 +0000, Ben Clifford wrote: > >>> - subscribe all committers to the other @globus mailing lists. We'll >>> encourage users to use the @ci mailing lists, and discussions initiated >>> on @globus should be manually moved to @ci. >>> >> I think this is fairly ridiculous. >> > > That I can't deny. > > >> We should not have multiple mailing lists for the same purpose. If we have >> globus lists, the CI lists should be shut down and the community and >> archives there abandoned or compelled to move. >> > > ?We wouldn't really use the @globus.org mailing lists. We fulfill our > requirement (given that we have no choice), and we also make it so that > we don't really fulfill the requirement. In other words we have the > lists, we do what we're asked, but we tell everybody to use the other > ones. And for the few cases when people ask questions on the @globus.org > lists, we move the discussion. > > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > -- =================================================== Ioan Raicu Ph.D. Candidate =================================================== Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 =================================================== Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dev.globus.org/wiki/Incubator/Falkon http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page =================================================== =================================================== -------------- next part -------------- An HTML attachment was scrubbed... URL: From hategan at mcs.anl.gov Thu Jun 5 13:43:04 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 05 Jun 2008 13:43:04 -0500 Subject: [Swift-devel] [Fwd: Re: [incubator-committers] Re: swift project in hibernation] In-Reply-To: <48483165.2020409@mcs.anl.gov> References: <484807C1.2060902@mcs.anl.gov> <484825BA.7020509@mcs.anl.gov> <1212689468.29852.12.camel@localhost> <48483165.2020409@mcs.anl.gov> Message-ID: <1212691384.30510.12.camel@localhost> On Thu, 2008-06-05 at 13:33 -0500, Ian Foster wrote: > Indeed, if the majority of committers are from elsewhere (which would > be a wonderful success), Not necessarily. The number of committers and the quality of the committs are two separate issues. > then the direction of the project would tend to be determined by > their desires. If those desires ended up being antithetical to ours, > we might end up forking the code. But that is all hypothetical (and > unlikely). > > I don't think that I (or Mike) have been restricting free will. My > only (periodic) input is that I would like to see more attention given > to documenting application requirements and status. I'm not saying you did. Nor that you didn't. The point was that there is a fundamental difference between not having a liberty and not exercising a liberty that you have. > > Ian. > > Mihael Hategan wrote: > > On Thu, 2008-06-05 at 12:43 -0500, Ian Foster wrote: > > > > > I want to point out (probably obvious, but worth mentioning) that the > > > whole dev.globus "democratic" process is meant to enable negotiation > > > of different perspectives and needs from different contributing > > > groups. > > > > > > Clearly in the case of U.Chicago people, the ultimate decision on what > > > is done is not determined democratically--it is determined via a > > > consensus-based process, to a large extent, but ultimately decided by > > > Mike based on project funding obligations. > > > > > > > As long as that applies. In the purely hypothetical scenario that a > > majority of the committers stop working at UC, Mike will have no > > authority. Nor will you. > > > > It may also be contradictory to the process if committers had no free > > will, but were to be vetoable by Mike. In that case, they should not be > > committers in the first place. > > > > So we either want dev.globus.org or we don't. But it doesn't seem like > > we can do very much about wanting only parts of dev.globus.org and not > > others. > > > > > > > Ian. > > > > > > > > > > > > Ben Clifford wrote: > > > > > > > On Thu, 5 Jun 2008, Michael Wilde wrote: > > > > > > > > > > > > > > > > > If "chair" is someone who maintains the infrastructure it should be a > > > > > developer. > > > > > > > > > > If its someone that makes management decisions and speaks for the project, it > > > > > should be me. > > > > > > > > > > > > > > It seems to me to be neither of those. Specifically there are no > > > > requirements that the chair maintain infrastructure and an explicit > > > > prohibition on the chair making enhanced-authority decisions (wrt any > > > > other committer). > > > > > > > > However, Jen's correspondence seems to suggest that IMP regards the chair > > > > as having some other rights and obligations. I don't believe these are > > > > publicly documented though. > > > > > > > > --- > > > > Each Globus project is required to name a project Chair via some > > > > process defined by the project's Committers. A project Chair has no > > > > enhanced authority, but has certain responsibilities relative to the > > > > function of the Globus Alliance. Specifically: > > > > * The Chair should generate, on or before March 31st, June 30th, September > > > > 30th, and December 31st of each year, reports concerning the activities of > > > > the project during the past quarter, its current status, and future plans. > > > > *The Chair is responsible for forwarding to the Globus infrastructure > > > > group requests to add or delete Committers for the project. > > > > --- > > > > > > > > > > > > > > > _______________________________________________ > > > Swift-devel mailing list > > > Swift-devel at ci.uchicago.edu > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > > > > From hategan at mcs.anl.gov Thu Jun 5 13:46:55 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 05 Jun 2008 13:46:55 -0500 Subject: [Swift-devel] maling lists In-Reply-To: <484832E3.5000304@cs.uchicago.edu> References: <1212686565.29017.12.camel@localhost> <1212691031.30510.5.camel@localhost> <484832E3.5000304@cs.uchicago.edu> Message-ID: <1212691615.30510.17.camel@localhost> On Thu, 2008-06-05 at 13:39 -0500, Ioan Raicu wrote: > This will be enough for being allowed back as an incubator, but it > will never suffice for escalating the project to being a full fledged > Globus component. They look at the mailing lists activity at review > time, and will judge the community size based on the number of people > that post questions and answers. So, your solution to ignore as much > as possible the globus mailing list is only a short term solution. It would seem silly to allow projects to have primary lists other than the ones at @globus.org, yet require that all meaningful activity happens on the ones @globus.org. "Primary" implies "where all meaningful stuff happens". > > Ioan > > Mihael Hategan wrote: > > On Thu, 2008-06-05 at 17:30 +0000, Ben Clifford wrote: > > > > > > - subscribe all committers to the other @globus mailing lists. We'll > > > > encourage users to use the @ci mailing lists, and discussions initiated > > > > on @globus should be manually moved to @ci. > > > > > > > I think this is fairly ridiculous. > > > > > > > That I can't deny. > > > > > > > We should not have multiple mailing lists for the same purpose. If we have > > > globus lists, the CI lists should be shut down and the community and > > > archives there abandoned or compelled to move. > > > > > > > ?We wouldn't really use the @globus.org mailing lists. We fulfill our > > requirement (given that we have no choice), and we also make it so that > > we don't really fulfill the requirement. In other words we have the > > lists, we do what we're asked, but we tell everybody to use the other > > ones. And for the few cases when people ask questions on the @globus.org > > lists, we move the discussion. > > > > > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > > > -- > =================================================== > Ioan Raicu > Ph.D. Candidate > =================================================== > Distributed Systems Laboratory > Computer Science Department > University of Chicago > 1100 E. 58th Street, Ryerson Hall > Chicago, IL 60637 > =================================================== > Email: iraicu at cs.uchicago.edu > Web: http://www.cs.uchicago.edu/~iraicu > http://dev.globus.org/wiki/Incubator/Falkon > http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page > =================================================== > =================================================== > From iraicu at cs.uchicago.edu Thu Jun 5 13:48:19 2008 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Thu, 05 Jun 2008 13:48:19 -0500 Subject: [Swift-devel] maling lists In-Reply-To: <1212691615.30510.17.camel@localhost> References: <1212686565.29017.12.camel@localhost> <1212691031.30510.5.camel@localhost> <484832E3.5000304@cs.uchicago.edu> <1212691615.30510.17.camel@localhost> Message-ID: <484834F3.1090500@cs.uchicago.edu> Incubator guidelines are probably easier to satisfy than once you are ready to escalate, which should be most incubator's end goal. Mihael Hategan wrote: > On Thu, 2008-06-05 at 13:39 -0500, Ioan Raicu wrote: > >> This will be enough for being allowed back as an incubator, but it >> will never suffice for escalating the project to being a full fledged >> Globus component. They look at the mailing lists activity at review >> time, and will judge the community size based on the number of people >> that post questions and answers. So, your solution to ignore as much >> as possible the globus mailing list is only a short term solution. >> > > It would seem silly to allow projects to have primary lists other than > the ones at @globus.org, yet require that all meaningful activity > happens on the ones @globus.org. "Primary" implies "where all meaningful > stuff happens". > > >> Ioan >> >> Mihael Hategan wrote: >> >>> On Thu, 2008-06-05 at 17:30 +0000, Ben Clifford wrote: >>> >>> >>>>> - subscribe all committers to the other @globus mailing lists. We'll >>>>> encourage users to use the @ci mailing lists, and discussions initiated >>>>> on @globus should be manually moved to @ci. >>>>> >>>>> >>>> I think this is fairly ridiculous. >>>> >>>> >>> That I can't deny. >>> >>> >>> >>>> We should not have multiple mailing lists for the same purpose. If we have >>>> globus lists, the CI lists should be shut down and the community and >>>> archives there abandoned or compelled to move. >>>> >>>> >>> ?We wouldn't really use the @globus.org mailing lists. We fulfill our >>> requirement (given that we have no choice), and we also make it so that >>> we don't really fulfill the requirement. In other words we have the >>> lists, we do what we're asked, but we tell everybody to use the other >>> ones. And for the few cases when people ask questions on the @globus.org >>> lists, we move the discussion. >>> >>> >>> >>> _______________________________________________ >>> Swift-devel mailing list >>> Swift-devel at ci.uchicago.edu >>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >>> >>> >>> >> -- >> =================================================== >> Ioan Raicu >> Ph.D. Candidate >> =================================================== >> Distributed Systems Laboratory >> Computer Science Department >> University of Chicago >> 1100 E. 58th Street, Ryerson Hall >> Chicago, IL 60637 >> =================================================== >> Email: iraicu at cs.uchicago.edu >> Web: http://www.cs.uchicago.edu/~iraicu >> http://dev.globus.org/wiki/Incubator/Falkon >> http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page >> =================================================== >> =================================================== >> >> > > > -- =================================================== Ioan Raicu Ph.D. Candidate =================================================== Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 =================================================== Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dev.globus.org/wiki/Incubator/Falkon http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page =================================================== =================================================== From hategan at mcs.anl.gov Thu Jun 5 13:59:34 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 05 Jun 2008 13:59:34 -0500 Subject: [Swift-devel] maling lists In-Reply-To: <484834F3.1090500@cs.uchicago.edu> References: <1212686565.29017.12.camel@localhost> <1212691031.30510.5.camel@localhost> <484832E3.5000304@cs.uchicago.edu> <1212691615.30510.17.camel@localhost> <484834F3.1090500@cs.uchicago.edu> Message-ID: <1212692374.30510.29.camel@localhost> On Thu, 2008-06-05 at 13:48 -0500, Ioan Raicu wrote: > Incubator guidelines are probably easier to satisfy than once you are > ready to escalate, which should be most incubator's end goal. Ah, you trigger an important issue. Have the full project infrastructure requirements been copied from the incubator requirements and not updated, or is it that the "existing infrastructure" leniencies are meant to only be transitional. I guess that's another mail to Jen! > > Mihael Hategan wrote: > > On Thu, 2008-06-05 at 13:39 -0500, Ioan Raicu wrote: > > > >> This will be enough for being allowed back as an incubator, but it > >> will never suffice for escalating the project to being a full fledged > >> Globus component. They look at the mailing lists activity at review > >> time, and will judge the community size based on the number of people > >> that post questions and answers. So, your solution to ignore as much > >> as possible the globus mailing list is only a short term solution. > >> > > > > It would seem silly to allow projects to have primary lists other than > > the ones at @globus.org, yet require that all meaningful activity > > happens on the ones @globus.org. "Primary" implies "where all meaningful > > stuff happens". > > > > > >> Ioan > >> > >> Mihael Hategan wrote: > >> > >>> On Thu, 2008-06-05 at 17:30 +0000, Ben Clifford wrote: > >>> > >>> > >>>>> - subscribe all committers to the other @globus mailing lists. We'll > >>>>> encourage users to use the @ci mailing lists, and discussions initiated > >>>>> on @globus should be manually moved to @ci. > >>>>> > >>>>> > >>>> I think this is fairly ridiculous. > >>>> > >>>> > >>> That I can't deny. > >>> > >>> > >>> > >>>> We should not have multiple mailing lists for the same purpose. If we have > >>>> globus lists, the CI lists should be shut down and the community and > >>>> archives there abandoned or compelled to move. > >>>> > >>>> > >>> ?We wouldn't really use the @globus.org mailing lists. We fulfill our > >>> requirement (given that we have no choice), and we also make it so that > >>> we don't really fulfill the requirement. In other words we have the > >>> lists, we do what we're asked, but we tell everybody to use the other > >>> ones. And for the few cases when people ask questions on the @globus.org > >>> lists, we move the discussion. > >>> > >>> > >>> > >>> _______________________________________________ > >>> Swift-devel mailing list > >>> Swift-devel at ci.uchicago.edu > >>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > >>> > >>> > >>> > >> -- > >> =================================================== > >> Ioan Raicu > >> Ph.D. Candidate > >> =================================================== > >> Distributed Systems Laboratory > >> Computer Science Department > >> University of Chicago > >> 1100 E. 58th Street, Ryerson Hall > >> Chicago, IL 60637 > >> =================================================== > >> Email: iraicu at cs.uchicago.edu > >> Web: http://www.cs.uchicago.edu/~iraicu > >> http://dev.globus.org/wiki/Incubator/Falkon > >> http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page > >> =================================================== > >> =================================================== > >> > >> > > > > > > > From wilde at mcs.anl.gov Thu Jun 5 14:04:23 2008 From: wilde at mcs.anl.gov (Michael Wilde) Date: Thu, 05 Jun 2008 14:04:23 -0500 Subject: [Swift-devel] maling lists In-Reply-To: <1212688952.29852.4.camel@localhost> References: <1212686565.29017.12.camel@localhost> <4848231E.8090709@mcs.anl.gov> <1212688952.29852.4.camel@localhost> Message-ID: <484838B7.6040003@mcs.anl.gov> It seems there are 2 scenarios: we get out of hibernation, join the process, and life is good. Or we decide down the road that this is not working, and we withdraw. In the first scenario, its best to convert now to dev.globus lists. In the second scenario, we'd have to convert back to some other domain later. I think the first scenario is more likely, so I propose we set a target date to convert, and for the moment, manually mirror the membership from CI to majordomo to comply with the guidelines. I propose that you send out a summary of the infrastructure impact, and we then attempt to reach a decision on dev.globus by a committers vote. I'll echo this on another thread, along with discussion on a review of the committers list. - Mike On 6/5/08 1:02 PM, Mihael Hategan wrote: > On Thu, 2008-06-05 at 12:32 -0500, Michael Wilde wrote: >> It sounds good to me. >> >> Would it be better and feasible though to maintain just one set of list >> memberships, and have the master lists echo to the other set for >> archival purposes? > > If you want @ci to be primary, then no. It is required for all > committers to be subscribed to the @globus mailing lists. > > So we either keep both in these conditions or move entirely to @globus. > >> It seems like we have much vested in maintaining svn and bugzilla using >> the current infrastructure, but the email lists seem a bit easier to >> change. And if the CI lists can remain the master, is it OK just to >> forward traffic to the dev.globus lists? >> >> If the dev.globus lists need to be fully populated with members to meet >> dev.globus requirements, then can we transition to using those as the >> sole lists, with minimal impact on current list members? >> >> - Mike >> >> >> >> On 6/5/08 12:22 PM, Mihael Hategan wrote: >>> It seems like one requirement is to have all committers subscribed to >>> the @globus.org mailing lists. >>> >>> While @ci will continue to be our primary mailing lists, in order to >>> meet the requirements, I'll do the following: >>> >>> - move everybody from swift-commit at ci to swift-commit at globus and make >>> swift-commit at ci forward to swift-commit at globus. This is so that we don't >>> get double posts, but still have the infrastructure primarily based @ci. >>> >>> - subscribe all committers to the other @globus mailing lists. We'll >>> encourage users to use the @ci mailing lists, and discussions initiated >>> on @globus should be manually moved to @ci. >>> >>> Objections? >>> >>> _______________________________________________ >>> Swift-devel mailing list >>> Swift-devel at ci.uchicago.edu >>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > From hategan at mcs.anl.gov Thu Jun 5 14:12:23 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 05 Jun 2008 14:12:23 -0500 Subject: [Swift-devel] maling lists In-Reply-To: <484838B7.6040003@mcs.anl.gov> References: <1212686565.29017.12.camel@localhost> <4848231E.8090709@mcs.anl.gov> <1212688952.29852.4.camel@localhost> <484838B7.6040003@mcs.anl.gov> Message-ID: <1212693143.30510.44.camel@localhost> On Thu, 2008-06-05 at 14:04 -0500, Michael Wilde wrote: > It seems there are 2 scenarios: we get out of hibernation, join the > process, and life is good. Or we decide down the road that this is not > working, and we withdraw. I'd say no more down the road. We decide on this round, based on hands on experience trying to get there, whether we do it or not. I'll put all I can into getting it to work under the assumption that any changes required will be minimal and accepted by everybody. If that fails, so be it. > > In the first scenario, its best to convert now to dev.globus lists. > > In the second scenario, we'd have to convert back to some other domain > later. > > I think the first scenario is more likely, so I propose we set a target > date to convert, and for the moment, manually mirror the membership from > CI to majordomo to comply with the guidelines. > > I propose that you send out a summary of the infrastructure impact, and > we then attempt to reach a decision on dev.globus by a committers vote. > > I'll echo this on another thread, along with discussion on a review of > the committers list. > > - Mike > > > On 6/5/08 1:02 PM, Mihael Hategan wrote: > > On Thu, 2008-06-05 at 12:32 -0500, Michael Wilde wrote: > >> It sounds good to me. > >> > >> Would it be better and feasible though to maintain just one set of list > >> memberships, and have the master lists echo to the other set for > >> archival purposes? > > > > If you want @ci to be primary, then no. It is required for all > > committers to be subscribed to the @globus mailing lists. > > > > So we either keep both in these conditions or move entirely to @globus. > > > >> It seems like we have much vested in maintaining svn and bugzilla using > >> the current infrastructure, but the email lists seem a bit easier to > >> change. And if the CI lists can remain the master, is it OK just to > >> forward traffic to the dev.globus lists? > >> > >> If the dev.globus lists need to be fully populated with members to meet > >> dev.globus requirements, then can we transition to using those as the > >> sole lists, with minimal impact on current list members? > >> > >> - Mike > >> > >> > >> > >> On 6/5/08 12:22 PM, Mihael Hategan wrote: > >>> It seems like one requirement is to have all committers subscribed to > >>> the @globus.org mailing lists. > >>> > >>> While @ci will continue to be our primary mailing lists, in order to > >>> meet the requirements, I'll do the following: > >>> > >>> - move everybody from swift-commit at ci to swift-commit at globus and make > >>> swift-commit at ci forward to swift-commit at globus. This is so that we don't > >>> get double posts, but still have the infrastructure primarily based @ci. > >>> > >>> - subscribe all committers to the other @globus mailing lists. We'll > >>> encourage users to use the @ci mailing lists, and discussions initiated > >>> on @globus should be manually moved to @ci. > >>> > >>> Objections? > >>> > >>> _______________________________________________ > >>> Swift-devel mailing list > >>> Swift-devel at ci.uchicago.edu > >>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > From benc at hawaga.org.uk Thu Jun 5 14:22:42 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Thu, 5 Jun 2008 19:22:42 +0000 (GMT) Subject: [Swift-devel] maling lists In-Reply-To: <48483319.3000100@mcs.anl.gov> References: <1212686565.29017.12.camel@localhost> <1212691031.30510.5.camel@localhost> <48483319.3000100@mcs.anl.gov> Message-ID: On Thu, 5 Jun 2008, Ian Foster wrote: > The downsides that are being quoted seem minor. mostly I think thats because the downsides are ones that inconvenience people like me, rather than people like you. do you even use svn or procmail regularly? -- From wilde at mcs.anl.gov Thu Jun 5 15:46:49 2008 From: wilde at mcs.anl.gov (Michael Wilde) Date: Thu, 05 Jun 2008 15:46:49 -0500 Subject: [Swift-devel] [PROPOSAL] Review and adjust Committers list In-Reply-To: <1212616673.14746.4.camel@localhost> References: <1212598787.8492.8.camel@localhost> <1212615395.14746.0.camel@localhost> <1212616673.14746.4.camel@localhost> Message-ID: <484850B9.7050605@mcs.anl.gov> Subject was: Re: [Swift-devel] committers As we move towards meeting the infrastructure and process guidelines of the dev.globus incubator, I propose that we revise the committers list. I propose, based on the people actively involved in the project, that the committers list be changed by vote of the current committers to be: Ben Clifford Ian Foster Mihael Hategan Sarah Kenny Michael Wilde Of this list, all but Sarah are current committers. Sarah has recently joined the group, and her job calls for her to become an active committer. I propose that Nika Nefedova and Beth Cerny be taken off the committers list. Neither Nika nor Beth currently work on the project. As a web content developer, Beth may likely contribute in the future, but does not I feel need to (nor I suspect want to) engage as a contributor. I propose that Ioan Raicu, Tibi Stef-Praun, and Yong Zhao become contributors rather than committers. Their past contributions have been immensely valuable, but I feel that their current role is more appropriately one of contributor. Ioan, Tibi, Yong, if you feel differently, and plan to be an active committer, I'm happy to amend this proposal to reflect your wish. But I am seeking to keep the committers list, which has voting rights in the project, down to the people that are deeply involved. This will be our first formal vote under the dev.globus guidelines: http://dev.globus.org/wiki/Guidelines#Decision_Making My understanding is that we do this: 1) readers reply to this email with comments. 2) I as proposer calls a vote with another email 3) you respond to the vote email with: The way you do this is - I think - to reply to the voting email with one of these strings and in some cases additional text as indicated: +1 The action should be performed. +0 Abstain - I support the action. -0 Abstain, I don't support the action but I can't help with an alternative -1 The action should not be performed and I am offering an explanation or alternative. Ive paraphrased these from the original to reflect what I believe is appropriate for a committers-list change rather than a technical change. My read of the guidelines is that we have 5 days to conduct vote. I'll send the [VOTE] message out in a few hours, to allow initial comments on this proposal. - Mike On 6/4/08 4:57 PM, Mihael Hategan wrote: > All committers have been added to the swift-commit mailing list. > > If some are wondering what the list admin password is, well, you either > know it already or you don't. > > On Wed, 2008-06-04 at 16:36 -0500, Mihael Hategan wrote: >> Ah, that bug I wasn't CC-ed on. Well, I know about it now. >> >> On Wed, 2008-06-04 at 21:20 +0000, Ben Clifford wrote: >>> On Wed, 4 Jun 2008, Mihael Hategan wrote: >>> >>>> So we need, I think, to agree on the initial list of committers for the >>>> dev.globus side of things. >>> The initial proposal, listed in globus bug 5300, already contains an >>> initial list of committers: >>> >>> Benjamin Clifford >>> Mihael Hategan >>> Tiberius Stef-Praun Beth >>> Yong Zhao >>> Veronika Nefedova >>> Ian Foster >>> Michael Wilde >>> Ioan Raicu >>> Beth Cerny Patino >>> >>> Any amendments to that should probably be made through the dev.globus >>> voting process. >>> >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel -- Michael Wilde Computation Institute University of Chicago and Argonne National Laboratory 5640 S. Ellis Av, Suite 405 Chicago, IL 60637 USA 708-203-9548 From hategan at mcs.anl.gov Thu Jun 5 17:30:10 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 05 Jun 2008 17:30:10 -0500 Subject: [Swift-devel] [PROPOSAL] Review and adjust Committers list In-Reply-To: <484850B9.7050605@mcs.anl.gov> References: <1212598787.8492.8.camel@localhost> <1212615395.14746.0.camel@localhost> <1212616673.14746.4.camel@localhost> <484850B9.7050605@mcs.anl.gov> Message-ID: <1212705010.2002.26.camel@localhost> Right. Committers should probably be only those who actively work on Swift. Which in a sense would probably exclude Ian and make him a contributor instead: "Contributors are the people who write code or documentation patches or contribute positively to Globus in other ways, for example by being active on the developer mailing list, participating in discussions, and/or providing suggestions or criticism." In any event, I don't think having write access to the repository necessarily means you have to be a committer. In other words Beth can keep write access to the repository, and so can Ioan, Tibi, and Victor, without having committer status. So the list looks good to me. Mihael On Thu, 2008-06-05 at 15:46 -0500, Michael Wilde wrote: > Subject was: Re: [Swift-devel] committers > > As we move towards meeting the infrastructure and process guidelines of > the dev.globus incubator, I propose that we revise the committers list. > > I propose, based on the people actively involved in the project, that > the committers list be changed by vote of the current committers to be: > > Ben Clifford > Ian Foster > Mihael Hategan > Sarah Kenny > Michael Wilde > > Of this list, all but Sarah are current committers. Sarah has recently > joined the group, and her job calls for her to become an active committer. > > I propose that Nika Nefedova and Beth Cerny be taken off the committers > list. Neither Nika nor Beth currently work on the project. As a web > content developer, Beth may likely contribute in the future, but does > not I feel need to (nor I suspect want to) engage as a contributor. > > I propose that Ioan Raicu, Tibi Stef-Praun, and Yong Zhao become > contributors rather than committers. Their past contributions have been > immensely valuable, but I feel that their current role is more > appropriately one of contributor. Ioan, Tibi, Yong, if you feel > differently, and plan to be an active committer, I'm happy to amend this > proposal to reflect your wish. > > But I am seeking to keep the committers list, which has voting rights in > the project, down to the people that are deeply involved. > > This will be our first formal vote under the dev.globus guidelines: > http://dev.globus.org/wiki/Guidelines#Decision_Making > > My understanding is that we do this: > > 1) readers reply to this email with comments. > 2) I as proposer calls a vote with another email > 3) you respond to the vote email with: > > The way you do this is - I think - to reply to the voting email with one > of these strings and in some cases additional text as indicated: > > +1 The action should be performed. > +0 Abstain - I support the action. > -0 Abstain, I don't support the action but I can't help with an alternative > -1 The action should not be performed and I am offering an explanation > or alternative. > > Ive paraphrased these from the original to reflect what I believe is > appropriate for a committers-list change rather than a technical change. > > My read of the guidelines is that we have 5 days to conduct vote. > I'll send the [VOTE] message out in a few hours, to allow initial > comments on this proposal. > > - Mike > > On 6/4/08 4:57 PM, Mihael Hategan wrote: > > All committers have been added to the swift-commit mailing list. > > > > If some are wondering what the list admin password is, well, you either > > know it already or you don't. > > > > On Wed, 2008-06-04 at 16:36 -0500, Mihael Hategan wrote: > >> Ah, that bug I wasn't CC-ed on. Well, I know about it now. > >> > >> On Wed, 2008-06-04 at 21:20 +0000, Ben Clifford wrote: > >>> On Wed, 4 Jun 2008, Mihael Hategan wrote: > >>> > >>>> So we need, I think, to agree on the initial list of committers for the > >>>> dev.globus side of things. > >>> The initial proposal, listed in globus bug 5300, already contains an > >>> initial list of committers: > >>> > >>> Benjamin Clifford > >>> Mihael Hategan > >>> Tiberius Stef-Praun Beth > >>> Yong Zhao > >>> Veronika Nefedova > >>> Ian Foster > >>> Michael Wilde > >>> Ioan Raicu > >>> Beth Cerny Patino > >>> > >>> Any amendments to that should probably be made through the dev.globus > >>> voting process. > >>> > >> _______________________________________________ > >> Swift-devel mailing list > >> Swift-devel at ci.uchicago.edu > >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > -- > Michael Wilde > Computation Institute > University of Chicago and Argonne National Laboratory > 5640 S. Ellis Av, Suite 405 > Chicago, IL 60637 USA > 708-203-9548 > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From foster at mcs.anl.gov Thu Jun 5 18:01:46 2008 From: foster at mcs.anl.gov (Ian Foster) Date: Thu, 05 Jun 2008 18:01:46 -0500 Subject: [Swift-devel] maling lists In-Reply-To: References: <1212686565.29017.12.camel@localhost> <1212691031.30510.5.camel@localhost> <48483319.3000100@mcs.anl.gov> Message-ID: <4848705A.10909@mcs.anl.gov> Ben: I don't think we are talking about changing SVN, are we? I do use procmail for other mail lists, not this one. Ian. Ben Clifford wrote: > On Thu, 5 Jun 2008, Ian Foster wrote: > > >> The downsides that are being quoted seem minor. >> > > mostly I think thats because the downsides are ones that inconvenience > people like me, rather than people like you. > > do you even use svn or procmail regularly? > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From wilde at mcs.anl.gov Thu Jun 5 18:18:11 2008 From: wilde at mcs.anl.gov (Michael Wilde) Date: Thu, 05 Jun 2008 18:18:11 -0500 Subject: [Swift-devel] [VOTE] Proposal to adjust Committers list In-Reply-To: <1212616673.14746.4.camel@localhost> References: <1212598787.8492.8.camel@localhost> <1212615395.14746.0.camel@localhost> <1212616673.14746.4.camel@localhost> Message-ID: <48487433.40602@mcs.anl.gov> Im sending this out for your vote as I indicated in the proposal. There's been one comment in favor and none against. The vote closes in 5 days per my read of the guidelines - same time, June 10. Please respond with one of: +1 The action should be performed. +0 Abstain - I support the action. -0 Abstain, I don't support the action but I can't help with an alternative -1 The action should not be performed and I am offering an explanation or alternative. Current committers, please add to your vote the word "binding". Thanks, Mike --- Proposal text below --- As we move towards meeting the infrastructure and process guidelines of the dev.globus incubator, I propose that we revise the committers list. I propose, based on the people actively involved in the project, that the committers list be changed by vote of the current committers to be: Ben Clifford Ian Foster Mihael Hategan Sarah Kenny Michael Wilde Of this list, all but Sarah are current committers. Sarah has recently joined the group, and her job calls for her to become an active committer. I propose that Nika Nefedova and Beth Cerny be taken off the committers list. Neither Nika nor Beth currently work on the project. As a web content developer, Beth may likely contribute in the future, but does not I feel need to (nor I suspect want to) engage as a contributor. I propose that Ioan Raicu, Tibi Stef-Praun, and Yong Zhao become contributors rather than committers. Their past contributions have been immensely valuable, but I feel that their current role is more appropriately one of contributor. Ioan, Tibi, Yong, if you feel differently, and plan to be an active committer, I'm happy to amend this proposal to reflect your wish. But I am seeking to keep the committers list, which has voting rights in the project, down to the people that are deeply involved. This will be our first formal vote under the dev.globus guidelines: http://dev.globus.org/wiki/Guidelines#Decision_Making My understanding is that we do this: 1) readers reply to this email with comments. 2) I as proposer calls a vote with another email 3) you respond to the vote email with: The way you do this is - I think - to reply to the voting email with one of these strings and in some cases additional text as indicated: +1 The action should be performed. +0 Abstain - I support the action. -0 Abstain, I don't support the action but I can't help with an alternative -1 The action should not be performed and I am offering an explanation or alternative. Ive paraphrased these from the original to reflect what I believe is appropriate for a committers-list change rather than a technical change. My read of the guidelines is that we have 5 days to conduct vote. I'll send the [VOTE] message out in a few hours, to allow initial comments on this proposal. - Mike On 6/4/08 4:57 PM, Mihael Hategan wrote: > All committers have been added to the swift-commit mailing list. > > If some are wondering what the list admin password is, well, you either > know it already or you don't. > > On Wed, 2008-06-04 at 16:36 -0500, Mihael Hategan wrote: >> Ah, that bug I wasn't CC-ed on. Well, I know about it now. >> >> On Wed, 2008-06-04 at 21:20 +0000, Ben Clifford wrote: >>> On Wed, 4 Jun 2008, Mihael Hategan wrote: >>> >>>> So we need, I think, to agree on the initial list of committers for the >>>> dev.globus side of things. >>> The initial proposal, listed in globus bug 5300, already contains an >>> initial list of committers: >>> >>> Benjamin Clifford >>> Mihael Hategan >>> Tiberius Stef-Praun Beth >>> Yong Zhao >>> Veronika Nefedova >>> Ian Foster >>> Michael Wilde >>> Ioan Raicu >>> Beth Cerny Patino >>> >>> Any amendments to that should probably be made through the dev.globus >>> voting process. >>> >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel -- Michael Wilde Computation Institute University of Chicago and Argonne National Laboratory 5640 S. Ellis Av, Suite 405 Chicago, IL 60637 USA 708-203-9548 From wilde at mcs.anl.gov Thu Jun 5 19:11:42 2008 From: wilde at mcs.anl.gov (Michael Wilde) Date: Thu, 05 Jun 2008 19:11:42 -0500 Subject: [Swift-devel] [VOTE] Proposal to adjust Committers list In-Reply-To: <48487433.40602@mcs.anl.gov> References: <1212598787.8492.8.camel@localhost> <1212615395.14746.0.camel@localhost> <1212616673.14746.4.camel@localhost> <48487433.40602@mcs.anl.gov> Message-ID: <484880BE.7030207@mcs.anl.gov> +1 (binding) - Mike On 6/5/08 6:18 PM, Michael Wilde wrote: > Im sending this out for your vote as I indicated in the proposal. > There's been one comment in favor and none against. > > The vote closes in 5 days per my read of the guidelines - same time, > June 10. > > Please respond with one of: > +1 The action should be performed. > +0 Abstain - I support the action. > -0 Abstain, I don't support the action but I can't help with an alternative > -1 The action should not be performed and I am offering an explanation > or alternative. > > Current committers, please add to your vote the word "binding". > > Thanks, > > Mike > > --- Proposal text below --- > > As we move towards meeting the infrastructure and process guidelines of > the dev.globus incubator, I propose that we revise the committers list. > > I propose, based on the people actively involved in the project, that > the committers list be changed by vote of the current committers to be: > > Ben Clifford > Ian Foster > Mihael Hategan > Sarah Kenny > Michael Wilde > > Of this list, all but Sarah are current committers. Sarah has recently > joined the group, and her job calls for her to become an active committer. > > I propose that Nika Nefedova and Beth Cerny be taken off the committers > list. Neither Nika nor Beth currently work on the project. As a web > content developer, Beth may likely contribute in the future, but does > not I feel need to (nor I suspect want to) engage as a contributor. > > I propose that Ioan Raicu, Tibi Stef-Praun, and Yong Zhao become > contributors rather than committers. Their past contributions have been > immensely valuable, but I feel that their current role is more > appropriately one of contributor. Ioan, Tibi, Yong, if you feel > differently, and plan to be an active committer, I'm happy to amend this > proposal to reflect your wish. > > But I am seeking to keep the committers list, which has voting rights in > the project, down to the people that are deeply involved. > > This will be our first formal vote under the dev.globus guidelines: > http://dev.globus.org/wiki/Guidelines#Decision_Making > > My understanding is that we do this: > > 1) readers reply to this email with comments. > 2) I as proposer calls a vote with another email > 3) you respond to the vote email with: > > The way you do this is - I think - to reply to the voting email with one > of these strings and in some cases additional text as indicated: > > +1 The action should be performed. > +0 Abstain - I support the action. > -0 Abstain, I don't support the action but I can't help with an alternative > -1 The action should not be performed and I am offering an explanation > or alternative. > > Ive paraphrased these from the original to reflect what I believe is > appropriate for a committers-list change rather than a technical change. > > My read of the guidelines is that we have 5 days to conduct vote. > I'll send the [VOTE] message out in a few hours, to allow initial > comments on this proposal. > > - Mike > > On 6/4/08 4:57 PM, Mihael Hategan wrote: >> All committers have been added to the swift-commit mailing list. >> >> If some are wondering what the list admin password is, well, you either >> know it already or you don't. >> >> On Wed, 2008-06-04 at 16:36 -0500, Mihael Hategan wrote: >>> Ah, that bug I wasn't CC-ed on. Well, I know about it now. >>> >>> On Wed, 2008-06-04 at 21:20 +0000, Ben Clifford wrote: >>>> On Wed, 4 Jun 2008, Mihael Hategan wrote: >>>> >>>>> So we need, I think, to agree on the initial list of committers for >>>>> the >>>>> dev.globus side of things. >>>> The initial proposal, listed in globus bug 5300, already contains an >>>> initial list of committers: >>>> >>>> Benjamin Clifford >>>> Mihael Hategan >>>> Tiberius Stef-Praun Beth >>>> Yong Zhao >>>> Veronika Nefedova >>>> Ian Foster >>>> Michael Wilde >>>> Ioan Raicu >>>> Beth Cerny Patino >>>> >>>> Any amendments to that should probably be made through the >>>> dev.globus voting process. >>>> >>> _______________________________________________ >>> Swift-devel mailing list >>> Swift-devel at ci.uchicago.edu >>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >> >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From hategan at mcs.anl.gov Thu Jun 5 19:23:40 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 05 Jun 2008 19:23:40 -0500 Subject: [Swift-devel] [VOTE] Proposal to adjust Committers list In-Reply-To: <48487433.40602@mcs.anl.gov> References: <1212598787.8492.8.camel@localhost> <1212615395.14746.0.camel@localhost> <1212616673.14746.4.camel@localhost> <48487433.40602@mcs.anl.gov> Message-ID: <1212711820.4797.5.camel@localhost> +1 On Thu, 2008-06-05 at 18:18 -0500, Michael Wilde wrote: > Im sending this out for your vote as I indicated in the proposal. > There's been one comment in favor and none against. > > The vote closes in 5 days per my read of the guidelines - same time, > June 10. > > Please respond with one of: > +1 The action should be performed. > +0 Abstain - I support the action. > -0 Abstain, I don't support the action but I can't help with an alternative > -1 The action should not be performed and I am offering an explanation > or alternative. > > Current committers, please add to your vote the word "binding". > > Thanks, > > Mike > > --- Proposal text below --- > > As we move towards meeting the infrastructure and process guidelines of > the dev.globus incubator, I propose that we revise the committers list. > > I propose, based on the people actively involved in the project, that > the committers list be changed by vote of the current committers to be: > > Ben Clifford > Ian Foster > Mihael Hategan > Sarah Kenny > Michael Wilde > > Of this list, all but Sarah are current committers. Sarah has recently > joined the group, and her job calls for her to become an active committer. > > I propose that Nika Nefedova and Beth Cerny be taken off the committers > list. Neither Nika nor Beth currently work on the project. As a web > content developer, Beth may likely contribute in the future, but does > not I feel need to (nor I suspect want to) engage as a contributor. > > I propose that Ioan Raicu, Tibi Stef-Praun, and Yong Zhao become > contributors rather than committers. Their past contributions have been > immensely valuable, but I feel that their current role is more > appropriately one of contributor. Ioan, Tibi, Yong, if you feel > differently, and plan to be an active committer, I'm happy to amend this > proposal to reflect your wish. > > But I am seeking to keep the committers list, which has voting rights in > the project, down to the people that are deeply involved. > > This will be our first formal vote under the dev.globus guidelines: > http://dev.globus.org/wiki/Guidelines#Decision_Making > > My understanding is that we do this: > > 1) readers reply to this email with comments. > 2) I as proposer calls a vote with another email > 3) you respond to the vote email with: > > The way you do this is - I think - to reply to the voting email with one > of these strings and in some cases additional text as indicated: > > +1 The action should be performed. > +0 Abstain - I support the action. > -0 Abstain, I don't support the action but I can't help with an alternative > -1 The action should not be performed and I am offering an explanation > or alternative. > > Ive paraphrased these from the original to reflect what I believe is > appropriate for a committers-list change rather than a technical change. > > My read of the guidelines is that we have 5 days to conduct vote. > I'll send the [VOTE] message out in a few hours, to allow initial > comments on this proposal. > > - Mike > > On 6/4/08 4:57 PM, Mihael Hategan wrote: > > All committers have been added to the swift-commit mailing list. > > > > If some are wondering what the list admin password is, well, you either > > know it already or you don't. > > > > On Wed, 2008-06-04 at 16:36 -0500, Mihael Hategan wrote: > >> Ah, that bug I wasn't CC-ed on. Well, I know about it now. > >> > >> On Wed, 2008-06-04 at 21:20 +0000, Ben Clifford wrote: > >>> On Wed, 4 Jun 2008, Mihael Hategan wrote: > >>> > >>>> So we need, I think, to agree on the initial list of committers for the > >>>> dev.globus side of things. > >>> The initial proposal, listed in globus bug 5300, already contains an > >>> initial list of committers: > >>> > >>> Benjamin Clifford > >>> Mihael Hategan > >>> Tiberius Stef-Praun Beth > >>> Yong Zhao > >>> Veronika Nefedova > >>> Ian Foster > >>> Michael Wilde > >>> Ioan Raicu > >>> Beth Cerny Patino > >>> > >>> Any amendments to that should probably be made through the dev.globus > >>> voting process. > >>> > >> _______________________________________________ > >> Swift-devel mailing list > >> Swift-devel at ci.uchicago.edu > >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From benc at hawaga.org.uk Fri Jun 6 11:34:03 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Fri, 6 Jun 2008 16:34:03 +0000 (GMT) Subject: [Swift-devel] maling lists In-Reply-To: <4848705A.10909@mcs.anl.gov> References: <1212686565.29017.12.camel@localhost> <1212691031.30510.5.camel@localhost> <48483319.3000100@mcs.anl.gov> <4848705A.10909@mcs.anl.gov> Message-ID: On Thu, 5 Jun 2008, Ian Foster wrote: > I don't think we are talking about changing SVN, are we? Fooling round with SVN has been discussed, so pretty much yes. -- From benc at hawaga.org.uk Fri Jun 6 11:44:58 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Fri, 6 Jun 2008 16:44:58 +0000 (GMT) Subject: [Swift-devel] maling lists In-Reply-To: <1212689208.29852.8.camel@localhost> References: <1212686565.29017.12.camel@localhost> <48482453.9070809@mcs.anl.gov> <1212689208.29852.8.camel@localhost> Message-ID: On Thu, 5 Jun 2008, Mihael Hategan wrote: > On Thu, 2008-06-05 at 12:37 -0500, Ian Foster wrote: > > Any reason not to move to the globus.org lists? It has the advantage of > > having a "home" not affiliated with a particular institution > > And the disadvantage of having a "home" associated with a particular > institution, namely dev.globus.org. a particular institution which many people have extremely low brand opinion of, too - I rarely hear people comment "I will never use anything from University of Chicago", as a contrast. -- From benc at hawaga.org.uk Fri Jun 6 11:49:49 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Fri, 6 Jun 2008 16:49:49 +0000 (GMT) Subject: [Swift-devel] boolean operators Message-ID: Milena noticed that the boolean operators for AND and OR have been broken for quite a long time. Its interesting that no-one noticed this as a commentary on how people are using the control constructs. I know lots of people use foreach; but this suggests that basically no one is doing anything fancy with booleans (so I think no really complicated ifs or iterates). -- From hategan at mcs.anl.gov Fri Jun 6 12:53:37 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 06 Jun 2008 12:53:37 -0500 Subject: [Swift-devel] maling lists In-Reply-To: <1212691031.30510.5.camel@localhost> References: <1212686565.29017.12.camel@localhost> <1212691031.30510.5.camel@localhost> Message-ID: <1212774817.17166.3.camel@localhost> > > > > We should not have multiple mailing lists for the same purpose. If we have > > globus lists, the CI lists should be shut down and the community and > > archives there abandoned or compelled to move. > > ?We wouldn't really use the @globus.org mailing lists. We fulfill our > requirement (given that we have no choice), and we also make it so that > we don't really fulfill the requirement. In other words we have the > lists, we do what we're asked, but we tell everybody to use the other > ones. And for the few cases when people ask questions on the @globus.org > lists, we move the discussion. There could also be an advantage to this: we can tell how many users dev.globus.org generated for us. As far as I can tell, the mailing lists are the only known showstopper. So Ben, 1, 0, -0, -1? > > > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From benc at hawaga.org.uk Fri Jun 6 13:00:11 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Fri, 6 Jun 2008 18:00:11 +0000 (GMT) Subject: [Swift-devel] maling lists In-Reply-To: <1212774817.17166.3.camel@localhost> References: <1212686565.29017.12.camel@localhost> <1212691031.30510.5.camel@localhost> <1212774817.17166.3.camel@localhost> Message-ID: > > > We should not have multiple mailing lists for the same purpose. If we have > > > globus lists, the CI lists should be shut down and the community and > > > archives there abandoned or compelled to move. > As far as I can tell, the mailing lists are the only known showstopper. > > So Ben, 1, 0, -0, -1? Given that I've already had one person ask me why the swift archives don't exist (based on following links of the dev.globus wiki page that used to link to the 'real' archives but now point somewhere empty) I remain convinced that we should not have multiple mailing lists. -- From hategan at mcs.anl.gov Fri Jun 6 13:19:24 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 06 Jun 2008 13:19:24 -0500 Subject: [Swift-devel] maling lists In-Reply-To: References: <1212686565.29017.12.camel@localhost> <1212691031.30510.5.camel@localhost> <1212774817.17166.3.camel@localhost> Message-ID: <1212776364.17699.0.camel@localhost> On Fri, 2008-06-06 at 18:00 +0000, Ben Clifford wrote: > > > > We should not have multiple mailing lists for the same purpose. If we have > > > > globus lists, the CI lists should be shut down and the community and > > > > archives there abandoned or compelled to move. > > > As far as I can tell, the mailing lists are the only known showstopper. > > > > So Ben, 1, 0, -0, -1? > > Given that I've already had one person ask me why the swift archives don't > exist (based on following links of the dev.globus wiki page that used to > link to the 'real' archives but now point somewhere empty) I remain > convinced that we should not have multiple mailing lists. Right. That was supposed to change. Take a look at the page now. > From foster at mcs.anl.gov Fri Jun 6 14:58:58 2008 From: foster at mcs.anl.gov (Ian Foster) Date: Fri, 6 Jun 2008 14:58:58 -0500 Subject: [Swift-devel] maling lists In-Reply-To: References: <1212686565.29017.12.camel@localhost> <48482453.9070809@mcs.anl.gov> <1212689208.29852.8.camel@localhost> Message-ID: and of which many people have a very high opinion. I am being lobbied by multiple European groups who want Globus to be part of the European Grid Initiative, "because it just works" (unlike other grid software that they use). The National Institutes of Health is rolling out several Globus based grids for the same reason. Etc. I think we should try not to let personal prejudices get in the way of sensible decision making. Ian. On Jun 6, 2008, at 11:44 AM, Ben Clifford wrote: > > On Thu, 5 Jun 2008, Mihael Hategan wrote: > >> On Thu, 2008-06-05 at 12:37 -0500, Ian Foster wrote: >>> Any reason not to move to the globus.org lists? It has the >>> advantage of >>> having a "home" not affiliated with a particular institution >> >> And the disadvantage of having a "home" associated with a particular >> institution, namely dev.globus.org. > > a particular institution which many people have extremely low brand > opinion of, too - I rarely hear people comment "I will never use > anything > from University of Chicago", as a contrast. > > -- > From hategan at mcs.anl.gov Fri Jun 6 15:25:59 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 06 Jun 2008 15:25:59 -0500 Subject: [Swift-devel] maling lists In-Reply-To: References: <1212686565.29017.12.camel@localhost> <48482453.9070809@mcs.anl.gov> <1212689208.29852.8.camel@localhost> Message-ID: <1212783959.20147.14.camel@localhost> On Fri, 2008-06-06 at 14:58 -0500, Ian Foster wrote: > and of which many people have a very high opinion. I am being lobbied > by multiple European groups who want Globus to be part of the European > Grid Initiative, "because it just works" :) I'll just say that it's a dangerous belief that does not drive one to actually make things work should the belief happen to not be quite true. But this is probably the wrong mailing list for the topic. > (unlike other grid software > that they use). The National Institutes of Health is rolling out > several Globus based grids for the same reason. Etc. > > I think we should try not to let personal prejudices get in the way of > sensible decision making. I agree. Though I do want to point out that there is a distinction between The Globus Toolkit and dev.globus.org, and that I myself got complaints about the latter which amounted to annoyances toward the fact that fairly senior project managers were being treated like children in the interest of a uniformity that proved to be largely irrelevant. But this is also a subject that probably doesn't belong on this mailing list. Mihael From benc at hawaga.org.uk Fri Jun 6 17:42:53 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Fri, 6 Jun 2008 22:42:53 +0000 (GMT) Subject: [Swift-devel] Re: how to put different wrapper behaviour into production In-Reply-To: References: Message-ID: On Wed, 9 Apr 2008, Ben Clifford wrote: > Last week or so I made some patches that change wrapper.sh to copy lots of > stuff around to a(n assumed) worker-local filesystem rather than using the > shared filesystem. I committed a new patch to do this in mainline Swift. Set an environment profile of SWIFT_JOBDIR_PATH to the worker-node local path, and input files will be copied to a job directory made under that path, rather than linked to on the GPFS. -- From benc at hawaga.org.uk Fri Jun 6 17:45:16 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Fri, 6 Jun 2008 22:45:16 +0000 (GMT) Subject: [Swift-devel] symlinks on gpfs Message-ID: Whilst playing witht he wrapper this week I remembered something I thought before and think I forgot about. It would be interesting to see how symlinks vs hardlinks behave speedwise on GPFS - symlinks will always have an extra layer of indirection to a possibly contended shared directory; hardlinks, I think, would avoid that indirection by linking directly to the file data. -- From hategan at mcs.anl.gov Fri Jun 6 19:40:00 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 06 Jun 2008 19:40:00 -0500 Subject: [Swift-devel] maling lists In-Reply-To: <1212692374.30510.29.camel@localhost> References: <1212686565.29017.12.camel@localhost> <1212691031.30510.5.camel@localhost> <484832E3.5000304@cs.uchicago.edu> <1212691615.30510.17.camel@localhost> <484834F3.1090500@cs.uchicago.edu> <1212692374.30510.29.camel@localhost> Message-ID: <1212799200.25717.8.camel@localhost> On Thu, 2008-06-05 at 13:59 -0500, Mihael Hategan wrote: > On Thu, 2008-06-05 at 13:48 -0500, Ioan Raicu wrote: > > Incubator guidelines are probably easier to satisfy than once you are > > ready to escalate, which should be most incubator's end goal. > > Ah, you trigger an important issue. Have the full project infrastructure > requirements been copied from the incubator requirements and not > updated, or is it that the "existing infrastructure" leniencies are > meant to only be transitional. > > I guess that's another mail to Jen! Well that's likely a no. The statement from Charles was "I assume if we're happy to have you field your own source repo during incubation that you would stay in that repo through escalation." Which probably applies to other things. Charles also seems to think that the U of C mailing lists, if open and archived, which they are, would satisfy the conceptual requirements, but the issue needs to be voted on their next meeting. Funny. It seems easier to meddle with incubator rules than to get a project out of hibernation. This would allow us to only set up forwarding from the @globus mailing lists, which we would not advertise, but would be there if somebody assumes there must be a swift-xyz at globus.org mailing list. So I suppose there would be no more objections to such a setting. Mihael From bugzilla-daemon at mcs.anl.gov Sat Jun 7 08:06:59 2008 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Sat, 7 Jun 2008 08:06:59 -0500 (CDT) Subject: [Swift-devel] [Bug 142] concurrent mapper does not work when used inside iterate {} block In-Reply-To: Message-ID: <20080607130659.69761164BB@foxtrot.mcs.anl.gov> http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=142 benc at hawaga.org.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #1 from benc at hawaga.org.uk 2008-06-07 08:06 ------- fixed in r2052 -- Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is. From bugzilla-daemon at mcs.anl.gov Sat Jun 7 08:09:06 2008 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Sat, 7 Jun 2008 08:09:06 -0500 (CDT) Subject: [Swift-devel] [Bug 102] workflow failes due to file cache duplicates In-Reply-To: Message-ID: <20080607130906.529E3164BB@foxtrot.mcs.anl.gov> http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=102 benc at hawaga.org.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|ASSIGNED |RESOLVED Resolution| |FIXED ------- Comment #2 from benc at hawaga.org.uk 2008-06-07 08:09 ------- Last comment says this is fixed. -- Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is. From bugzilla-daemon at mcs.anl.gov Sat Jun 7 08:10:46 2008 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Sat, 7 Jun 2008 08:10:46 -0500 (CDT) Subject: [Swift-devel] [Bug 136] CLASSPATH construction order In-Reply-To: Message-ID: <20080607131046.9F812164BB@foxtrot.mcs.anl.gov> http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=136 benc at hawaga.org.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|ASSIGNED |RESOLVED Resolution| |FIXED ------- Comment #2 from benc at hawaga.org.uk 2008-06-07 08:10 ------- This change works (in the situation where it was causing a problem, which is when Swift is installed at the same time as Pegasus). -- Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You reported the bug, or are watching the reporter. From bugzilla-daemon at mcs.anl.gov Sat Jun 7 08:12:04 2008 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Sat, 7 Jun 2008 08:12:04 -0500 (CDT) Subject: [Swift-devel] [Bug 140] assorted GRAM4 own-hostname problems In-Reply-To: Message-ID: <20080607131204.F40DF164BB@foxtrot.mcs.anl.gov> http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=140 benc at hawaga.org.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|ASSIGNED |RESOLVED Resolution| |FIXED ------- Comment #3 from benc at hawaga.org.uk 2008-06-07 08:12 ------- I haven't encountered hostname related problems such as the below recently, so I think this is fixed to my satisfaction. -- Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You reported the bug, or are watching the reporter. From bugzilla-daemon at mcs.anl.gov Sat Jun 7 08:12:39 2008 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Sat, 7 Jun 2008 08:12:39 -0500 (CDT) Subject: [Swift-devel] [Bug 144] earlier detection of duplicate output file mapping. In-Reply-To: Message-ID: <20080607131239.E5CCB164BB@foxtrot.mcs.anl.gov> http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=144 benc at hawaga.org.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Summary|earlier detection of earlier|earlier detection of |duplicate output file |duplicate output file |mapping. |mapping. -- Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is. From bugzilla-daemon at mcs.anl.gov Sat Jun 7 08:42:34 2008 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Sat, 7 Jun 2008 08:42:34 -0500 (CDT) Subject: [Swift-devel] [Bug 101] fast-failing sites will absorb large numbers of jobs causing runs to fail despite multiple attempts at retrying In-Reply-To: Message-ID: <20080607134234.0A4CE16460@foxtrot.mcs.anl.gov> http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=101 benc at hawaga.org.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Summary|failure in site |fast-failing sites will |initialisation appears to |absorb large numbers of jobs |cause job to fail rather |causing runs to fail despite |than be retried elsewhere. |multiple attempts at | |retrying ------- Comment #3 from benc at hawaga.org.uk 2008-06-07 08:42 ------- previous summary was:failure in site initialisation appears to cause job to fail rather than be retried elsewhere. -- Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is. You reported the bug, or are watching the reporter. From benc at hawaga.org.uk Sat Jun 7 09:09:33 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Sat, 7 Jun 2008 14:09:33 +0000 (GMT) Subject: [Swift-devel] Re: Few easy questions In-Reply-To: <123bf0400806020627u7fa08586wd8d4fd6c595ea02@mail.gmail.com> References: <123bf0400806020627u7fa08586wd8d4fd6c595ea02@mail.gmail.com> Message-ID: Milena asked me the below questions privately: On Mon, 2 Jun 2008, Milena Nikolic wrote: > 1. For COND_EXPR, what can be compared? This covers ==, !=, <, >, <=, >= > for numeric types I guess. Can other types be compared with == and != ? Numeric types should have the equality (== and !=) and ordering (< > <= >=) operators - this should be working now in the codebase. Strings and booleans should probably also work with the equality (== and !=) operators. > 2. In RANGE_EXPR can we have anything else but int? For instance [1.5, 2.5, > 0.1]? Any numeric type should work there. There is a bug related to this, bug 126: http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=126 That is a problem in the runtime, that can be fixed separately from typechecking. -- From hategan at mcs.anl.gov Sat Jun 7 10:56:35 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sat, 07 Jun 2008 10:56:35 -0500 Subject: [Swift-devel] Re: Few easy questions In-Reply-To: References: <123bf0400806020627u7fa08586wd8d4fd6c595ea02@mail.gmail.com> Message-ID: <1212854195.3370.0.camel@localhost> On Sat, 2008-06-07 at 14:09 +0000, Ben Clifford wrote: > Milena asked me the below questions privately: > > On Mon, 2 Jun 2008, Milena Nikolic wrote: > > > 1. For COND_EXPR, what can be compared? This covers ==, !=, <, >, <=, >= > > for numeric types I guess. Can other types be compared with == and != ? > > Numeric types should have the equality (== and !=) and ordering (< > <= > >=) operators - this should be working now in the codebase. > > Strings and booleans should probably also work with the equality (== and > !=) operators. And if we stretch it a bit, there's lexicographic order. > > > 2. In RANGE_EXPR can we have anything else but int? For instance [1.5, 2.5, > > 0.1]? > > Any numeric type should work there. > > There is a bug related to this, bug 126: > http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=126 > > That is a problem in the runtime, that can be fixed separately from > typechecking. > From benc at hawaga.org.uk Sat Jun 7 13:39:45 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Sat, 7 Jun 2008 18:39:45 +0000 (GMT) Subject: [Swift-devel] Re: Few easy questions In-Reply-To: <1212854195.3370.0.camel@localhost> References: <123bf0400806020627u7fa08586wd8d4fd6c595ea02@mail.gmail.com> <1212854195.3370.0.camel@localhost> Message-ID: On Sat, 7 Jun 2008, Mihael Hategan wrote: > And if we stretch it a bit, there's lexicographic order. in which case, we'd have orderings on strings and on array of orderables. but I suspect that isn't really necessary for typical swift applications. -- From hategan at mcs.anl.gov Sat Jun 7 14:22:24 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sat, 07 Jun 2008 14:22:24 -0500 Subject: [Swift-devel] Re: Few easy questions In-Reply-To: References: <123bf0400806020627u7fa08586wd8d4fd6c595ea02@mail.gmail.com> <1212854195.3370.0.camel@localhost> Message-ID: <1212866544.10312.1.camel@localhost> On Sat, 2008-06-07 at 18:39 +0000, Ben Clifford wrote: > On Sat, 7 Jun 2008, Mihael Hategan wrote: > > > And if we stretch it a bit, there's lexicographic order. > > in which case, we'd have orderings on strings and on array of orderables. Just strings would probably do. > > but I suspect that isn't really necessary for typical swift applications. From bugzilla-daemon at mcs.anl.gov Sat Jun 7 15:26:22 2008 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Sat, 7 Jun 2008 15:26:22 -0500 (CDT) Subject: [Swift-devel] [Bug 145] Failed to link input file In-Reply-To: Message-ID: <20080607202622.0295816460@foxtrot.mcs.anl.gov> http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=145 ------- Comment #1 from benc at hawaga.org.uk 2008-06-07 15:26 ------- > In fact, it did happen before on other sites within OSG except "UCSDT2". Do you mean that the workflow works on every site except UCSDT2? -- Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at mcs.anl.gov Sat Jun 7 16:20:31 2008 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Sat, 7 Jun 2008 16:20:31 -0500 (CDT) Subject: [Swift-devel] [Bug 145] Failed to link input file In-Reply-To: Message-ID: <20080607212031.80A95164BB@foxtrot.mcs.anl.gov> http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=145 ------- Comment #2 from benc at hawaga.org.uk 2008-06-07 16:20 ------- Previously, I've tested this site using fork and condor. Fork worked, condor showed the symptom xi reports. I just put the site definitions that I used into svn in r2055. However, trying today I find that site's ftp server unresponsive. -- Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From hategan at mcs.anl.gov Sat Jun 7 18:21:49 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sat, 07 Jun 2008 18:21:49 -0500 Subject: [Swift-devel] karajan compiler Message-ID: <1212880909.13612.14.camel@localhost> In the past week I've been playing a little with trying to see how viable such an idea would be. It seems like it is. The difficulty was in keeping lightweight threading in. This seems to be doable pretty efficiently with a fair amount of switch statements. Though trying it on real things would probably reveal the need for some adjustments. I also wanted to have proper lexical scoping, in the style of ML. That also seems to be working fairly well. The third part was some type system, and it turns out type inference was also easy to put in. So the end result is something that is statically scoped, but type annotations are optional. In terms of Swift this would translate into an easier implementation of the Swift type system, which is currently quite contrived. Speeds seem to be in the range of about 1.5 orders of magnitude faster (50x) (mostly due to the fact that context switches happen in a more regular fashion). I'll be committing the code in a separate module in cog at some point in the future, so that others can play with it (though I suspect others means Ben), even if it may never see more development. It's, of course, missing a lot of things. Mihael From bugzilla-daemon at mcs.anl.gov Sun Jun 8 09:39:58 2008 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Sun, 8 Jun 2008 09:39:58 -0500 (CDT) Subject: [Swift-devel] [Bug 145] Failed to link input file In-Reply-To: Message-ID: <20080608143958.70744164BB@foxtrot.mcs.anl.gov> http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=145 benc at hawaga.org.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |ASSIGNED Component|General |Specific site issues -- Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From benc at hawaga.org.uk Sun Jun 8 10:52:04 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Sun, 8 Jun 2008 15:52:04 +0000 (GMT) Subject: [Swift-devel] karajan compiler In-Reply-To: <1212880909.13612.14.camel@localhost> References: <1212880909.13612.14.camel@localhost> Message-ID: how much is the source file format and java-library linking changed? (as in how likely is it than an existing .kml file, such as one generated by the Swift compiler, and calling out to Java libraries such as the Swift runtime libraries, will work?) -- From bugzilla-daemon at mcs.anl.gov Sun Jun 8 11:37:12 2008 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Sun, 8 Jun 2008 11:37:12 -0500 (CDT) Subject: [Swift-devel] [Bug 145] Failed to link input file In-Reply-To: Message-ID: <20080608163712.C4312164BB@foxtrot.mcs.anl.gov> http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=145 ------- Comment #3 from lixi at uchicago.edu 2008-06-08 11:37 ------- (In reply to comment #1) > > In fact, it did happen before on other sites within OSG except "UCSDT2". > Do you mean that the workflow works on every site except UCSDT2? I mean that the same error (failed to link input file) happened on other sites except UCSDT2 before. -- Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From hategan at mcs.anl.gov Sun Jun 8 11:46:14 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sun, 08 Jun 2008 11:46:14 -0500 Subject: [Swift-devel] karajan compiler In-Reply-To: References: <1212880909.13612.14.camel@localhost> Message-ID: <1212943574.20518.0.camel@localhost> On Sun, 2008-06-08 at 15:52 +0000, Ben Clifford wrote: > how much is the source file format and java-library linking changed? (as > in how likely is it than an existing .kml file, such as one generated by > the Swift compiler, and calling out to Java libraries such as the Swift > runtime libraries, will work?) > Likelihoods: 0 and 0. From benc at hawaga.org.uk Sun Jun 8 11:52:56 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Sun, 8 Jun 2008 16:52:56 +0000 (GMT) Subject: [Swift-devel] [VOTE] 0.6 release plan Message-ID: I will be the release manager for Swift 0.6. I'm going to make a release candidate for 0.6 early next week (sometime within the next three days), and hope to release that as 0.6 proper next weekend (maybe 6 days from now). I'm hoping to have a single release candidate, with minor bugs being noted and fixed in 0.7 rather than causing a new release candidate. I'm planning on announcing coasters and replication as experimental features which we encourage interested parties to experiment with and report their experiences. I'm also planning on making the release under the dev.globus public release guidelines. There will be no repository freeze for this release. This release plan is subject to 'Lazy Majority' approval, which means that this plan is automatically approved until/unless someone votes -1. So pretty much you do not need to vote +1 until/unless someone votes -1. If you wish to vote -1, it appears that you should vote -1 specifically for the issues that you disagree with rather than the plan as a whole. -- From foster at mcs.anl.gov Sun Jun 8 19:10:17 2008 From: foster at mcs.anl.gov (Ian Foster) Date: Sun, 8 Jun 2008 19:10:17 -0500 Subject: [Swift-devel] karajan compiler In-Reply-To: <1212880909.13612.14.camel@localhost> References: <1212880909.13612.14.camel@localhost> Message-ID: Mihael: Cool. Have you also looked at whether you could use some existing functional language compiler? We have real experts right here on campus--could be well worth talking to them. Ian. On Jun 7, 2008, at 6:21 PM, Mihael Hategan wrote: > In the past week I've been playing a little with trying to see how > viable such an idea would be. It seems like it is. > > The difficulty was in keeping lightweight threading in. This seems > to be > doable pretty efficiently with a fair amount of switch statements. > Though trying it on real things would probably reveal the need for > some > adjustments. > > I also wanted to have proper lexical scoping, in the style of ML. That > also seems to be working fairly well. > > The third part was some type system, and it turns out type inference > was > also easy to put in. So the end result is something that is statically > scoped, but type annotations are optional. In terms of Swift this > would > translate into an easier implementation of the Swift type system, > which > is currently quite contrived. > > Speeds seem to be in the range of about 1.5 orders of magnitude faster > (50x) (mostly due to the fact that context switches happen in a more > regular fashion). > > I'll be committing the code in a separate module in cog at some > point in > the future, so that others can play with it (though I suspect others > means Ben), even if it may never see more development. It's, of > course, > missing a lot of things. > > Mihael > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From benc at hawaga.org.uk Sun Jun 8 19:15:38 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 9 Jun 2008 00:15:38 +0000 (GMT) Subject: [Swift-devel] karajan compiler In-Reply-To: References: <1212880909.13612.14.camel@localhost> Message-ID: On Sun, 8 Jun 2008, Ian Foster wrote: > Cool. Have you also looked at whether you could use some existing > functional language compiler? We have real experts right here on > campus--could be well worth talking to them. yes, I think it might be interesting to involve language people more in some of this. its an extremely different space to distributed computing (though I think it is obvious that both myself and Mihael have interests both in the distribtued computing and language side of things). -- From hategan at mcs.anl.gov Sun Jun 8 19:18:06 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sun, 08 Jun 2008 19:18:06 -0500 Subject: [Swift-devel] karajan compiler In-Reply-To: References: <1212880909.13612.14.camel@localhost> Message-ID: <1212970686.29677.1.camel@localhost> On Sun, 2008-06-08 at 19:10 -0500, Ian Foster wrote: > Mihael: > > Cool. Have you also looked at whether you could use some existing > functional language compiler? We have real experts right here on > campus--could be well worth talking to them. Yes, I did. Unfortunately very few can use Java libraries and even fewer have lightweight threading. And those seem to be the things that cannot be easily added as an afterthought. > > Ian. > > > On Jun 7, 2008, at 6:21 PM, Mihael Hategan wrote: > > > In the past week I've been playing a little with trying to see how > > viable such an idea would be. It seems like it is. > > > > The difficulty was in keeping lightweight threading in. This seems > > to be > > doable pretty efficiently with a fair amount of switch statements. > > Though trying it on real things would probably reveal the need for > > some > > adjustments. > > > > I also wanted to have proper lexical scoping, in the style of ML. That > > also seems to be working fairly well. > > > > The third part was some type system, and it turns out type inference > > was > > also easy to put in. So the end result is something that is statically > > scoped, but type annotations are optional. In terms of Swift this > > would > > translate into an easier implementation of the Swift type system, > > which > > is currently quite contrived. > > > > Speeds seem to be in the range of about 1.5 orders of magnitude faster > > (50x) (mostly due to the fact that context switches happen in a more > > regular fashion). > > > > I'll be committing the code in a separate module in cog at some > > point in > > the future, so that others can play with it (though I suspect others > > means Ben), even if it may never see more development. It's, of > > course, > > missing a lot of things. > > > > Mihael > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > From nikolicmilena at gmail.com Mon Jun 9 04:35:54 2008 From: nikolicmilena at gmail.com (Milena Nikolic) Date: Mon, 9 Jun 2008 11:35:54 +0200 Subject: [Swift-devel] Re: Few easy questions In-Reply-To: <1212866544.10312.1.camel@localhost> References: <123bf0400806020627u7fa08586wd8d4fd6c595ea02@mail.gmail.com> <1212854195.3370.0.camel@localhost> <1212866544.10312.1.camel@localhost> Message-ID: <123bf0400806090235u7c3bd96x36e82bc706d26b0b@mail.gmail.com> Ok, so I've done: - equality (== and !=) type check for two operands of *numeric*, *string *and *boolean *types - ordering (<, >, <=, >=) type check for two operands of *numeric *and *string *types I am not sure what is ordering on array of orderables, if you give me some example I might type check that too. Thanks, Milena On Sat, Jun 7, 2008 at 9:22 PM, Mihael Hategan wrote: > On Sat, 2008-06-07 at 18:39 +0000, Ben Clifford wrote: > > On Sat, 7 Jun 2008, Mihael Hategan wrote: > > > > > And if we stretch it a bit, there's lexicographic order. > > > > in which case, we'd have orderings on strings and on array of orderables. > > Just strings would probably do. > > > > > but I suspect that isn't really necessary for typical swift applications. > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From hategan at mcs.anl.gov Mon Jun 9 09:04:35 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 09 Jun 2008 09:04:35 -0500 Subject: [Swift-devel] Re: Few easy questions In-Reply-To: <123bf0400806090235u7c3bd96x36e82bc706d26b0b@mail.gmail.com> References: <123bf0400806020627u7fa08586wd8d4fd6c595ea02@mail.gmail.com> <1212854195.3370.0.camel@localhost> <1212866544.10312.1.camel@localhost> <123bf0400806090235u7c3bd96x36e82bc706d26b0b@mail.gmail.com> Message-ID: <1213020275.3604.4.camel@localhost> On Mon, 2008-06-09 at 11:35 +0200, Milena Nikolic wrote: > Ok, so I've done: > - equality (== and !=) type check for two operands of numeric, string > and boolean types > - ordering (<, >, <=, >=) type check for two operands of numeric and > string types > > I am not sure what is ordering on array of orderables, if you give me > some example I might type check that too. Presumably a[] < b[] if there is a k for which a[k] < b[k] and a[i] == b[i] for all 0 <= i < k. But I think it's exaggerated to have it. Mihael From bugzilla-daemon at mcs.anl.gov Mon Jun 9 09:17:00 2008 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Mon, 9 Jun 2008 09:17:00 -0500 (CDT) Subject: [Swift-devel] [Bug 145] Failed to link input file In-Reply-To: Message-ID: <20080609141700.82A0B16460@foxtrot.mcs.anl.gov> http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=145 lixi at uchicago.edu changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |hategan at mcs.anl.gov ------- Comment #4 from lixi at uchicago.edu 2008-06-09 09:17 ------- (In reply to comment #1) > > In fact, it did happen before on other sites within OSG except "UCSDT2". > Do you mean that the workflow works on every site except UCSDT2? The same thing happened to UCSDT2-B yesterday. -- Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From hategan at mcs.anl.gov Mon Jun 9 10:41:10 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 09 Jun 2008 10:41:10 -0500 Subject: [Swift-devel] Re: remote file system In-Reply-To: <20080609103401.BBC55501@m4500-03.uchicago.edu> References: <20080609103401.BBC55501@m4500-03.uchicago.edu> Message-ID: <1213026070.5388.3.camel@localhost> I'm routing this through the mailing list. On Mon, 2008-06-09 at 10:34 -0500, lixi at uchicago.edu wrote: > Hi Mihael, > > I found that sometimes the failure of Swift workflow is > caused by the remote file system error. In OSG, except NFS, > GPFS and pvfs, what are other file systems adopted? Those are probably the most widespread ones. But I have no idea what individual sites have out there. You could probably run a sweep of cat /etc/fstab and see what comes up. > How can > I check if they are right for Swift? I think it's somewhat the other way. We need to make Swift work on them. > Now I just run df -h -T > to get the file system status. > > Thanks, > > Xi From benc at hawaga.org.uk Mon Jun 9 10:46:18 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 9 Jun 2008 15:46:18 +0000 (GMT) Subject: [Swift-devel] Re: remote file system In-Reply-To: <1213026070.5388.3.camel@localhost> References: <20080609103401.BBC55501@m4500-03.uchicago.edu> <1213026070.5388.3.camel@localhost> Message-ID: > > I found that sometimes the failure of Swift workflow is > > caused by the remote file system error. In OSG, except NFS, > > GPFS and pvfs, what are other file systems adopted? > > Those are probably the most widespread ones. But I have no idea what > individual sites have out there. You could probably run a sweep of > cat /etc/fstab and see what comes up. I think I've never seen anything other than those three in live deployment on a grid machine. AFS is used in some places but there are many challenges to using that in a grid environment and I've not seen it used for a cluster-shared fs. I've seen it used more to provide an enterprise/global homedirectory service. -- From benc at hawaga.org.uk Mon Jun 9 11:04:19 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 9 Jun 2008 16:04:19 +0000 (GMT) Subject: [Swift-devel] Re: remote file system In-Reply-To: <1213026070.5388.3.camel@localhost> References: <20080609103401.BBC55501@m4500-03.uchicago.edu> <1213026070.5388.3.camel@localhost> Message-ID: On Mon, 9 Jun 2008, Mihael Hategan wrote: > > How can > > I check if they are right for Swift? > > I think it's somewhat the other way. We need to make Swift work on them. One way to gather information there is to clearly identify each site problem. These two UCSD sites that Xi mentions in bug 146 seem to have similar problems (which, if they use a common shared file system would not surprise me). But I have not seen reports for other sites having consistent problems. Those could be reported on this list or in the bugzilla in the specific site problems component. -- From bugzilla-daemon at mcs.anl.gov Mon Jun 9 11:08:16 2008 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Mon, 9 Jun 2008 11:08:16 -0500 (CDT) Subject: [Swift-devel] [Bug 145] Failed to link input file In-Reply-To: Message-ID: <20080609160816.3B79016460@foxtrot.mcs.anl.gov> http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=145 ------- Comment #5 from benc at hawaga.org.uk 2008-06-09 11:08 ------- I can recreate this with UCSDT2-B today, using condor. fork works ok. Site definitions added to tests/sites/broken/osg-ucsdt2-b2-* -- Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From benc at hawaga.org.uk Mon Jun 9 11:45:56 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 9 Jun 2008 16:45:56 +0000 (GMT) Subject: [Swift-devel] Re: remote file system In-Reply-To: <1213026070.5388.3.camel@localhost> References: <20080609103401.BBC55501@m4500-03.uchicago.edu> <1213026070.5388.3.camel@localhost> Message-ID: actually specifically the sites that Xi has been using look fairly flakey - they often won't accept gridftp connections (as in, the site that just worked with fork for me now doesn't work at all). >From the perspective of Xi's work, I think that means that this site should be regarded as "broken" even when it does happen to accept gridftp transfers and job submissions. Perhaps that unable to link input file error, when used on a known-to-be-good workflow, should always be treated for now (in Xi's work) as an indication that the site is not broken. Separately, from an OSG operations perspective, someone (I guess me) could contact the site admin and see what is going on. -- From lixi at uchicago.edu Mon Jun 9 11:50:25 2008 From: lixi at uchicago.edu (lixi at uchicago.edu) Date: Mon, 9 Jun 2008 11:50:25 -0500 (CDT) Subject: [Swift-devel] Re: remote file system Message-ID: <20080609115025.BBC64803@m4500-03.uchicago.edu> Thanks for your information. It seems that I should filter out these two weird sites (UCSDT2 and UCSDT2-B) for experiments. Xi ---- Original message ---- >Date: Mon, 9 Jun 2008 16:45:56 +0000 (GMT) >From: Ben Clifford >Subject: Re: [Swift-devel] Re: remote file system >To: Mihael Hategan >Cc: lixi at uchicago.edu, "swift-devel at ci.uchicago.edu" > >actually specifically the sites that Xi has been using look fairly flakey >- they often won't accept gridftp connections (as in, the site that just >worked with fork for me now doesn't work at all). > >From the perspective of Xi's work, I think that means that this site >should be regarded as "broken" even when it does happen to accept gridftp >transfers and job submissions. > >Perhaps that unable to link input file error, when used on a >known-to-be-good workflow, should always be treated for now (in Xi's work) >as an indication that the site is not broken. > >Separately, from an OSG operations perspective, someone (I guess me) could >contact the site admin and see what is going on. > >-- From hategan at mcs.anl.gov Mon Jun 9 11:56:18 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 09 Jun 2008 11:56:18 -0500 Subject: [Swift-devel] Re: remote file system In-Reply-To: References: <20080609103401.BBC55501@m4500-03.uchicago.edu> <1213026070.5388.3.camel@localhost> Message-ID: <1213030578.6622.6.camel@localhost> On Mon, 2008-06-09 at 16:45 +0000, Ben Clifford wrote: > actually specifically the sites that Xi has been using look fairly flakey > - they often won't accept gridftp connections (as in, the site that just > worked with fork for me now doesn't work at all). > > From the perspective of Xi's work, I think that means that this site > should be regarded as "broken" even when it does happen to accept gridftp > transfers and job submissions. > > Perhaps that unable to link input file error, when used on a > known-to-be-good workflow, should always be treated for now (in Xi's work) > as an indication that the site is not broken. I'd say this is exactly why the scheduler keeps tabs on sites. I also think, given that I've seen no indication in the past 5 years that the minimum reliability of sites has increased significantly, that we should make sure Swift works despite the occasional broken site. > > Separately, from an OSG operations perspective, someone (I guess me) could > contact the site admin and see what is going on. > From hategan at mcs.anl.gov Mon Jun 9 11:58:13 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 09 Jun 2008 11:58:13 -0500 Subject: [Swift-devel] Re: remote file system In-Reply-To: <20080609115025.BBC64803@m4500-03.uchicago.edu> References: <20080609115025.BBC64803@m4500-03.uchicago.edu> Message-ID: <1213030693.6622.9.camel@localhost> On Mon, 2008-06-09 at 11:50 -0500, lixi at uchicago.edu wrote: > Thanks for your information. > > It seems that I should filter out these two weird sites > (UCSDT2 and UCSDT2-B) for experiments. Depends on what you want to achieve. I'd say swift should work in any at-least-one-good-site scenarios. > > Xi > > ---- Original message ---- > >Date: Mon, 9 Jun 2008 16:45:56 +0000 (GMT) > >From: Ben Clifford > >Subject: Re: [Swift-devel] Re: remote file system > >To: Mihael Hategan > >Cc: lixi at uchicago.edu, "swift-devel at ci.uchicago.edu" devel at ci.uchicago.edu> > > > >actually specifically the sites that Xi has been using look > fairly flakey > >- they often won't accept gridftp connections (as in, the > site that just > >worked with fork for me now doesn't work at all). > > > >From the perspective of Xi's work, I think that means that > this site > >should be regarded as "broken" even when it does happen to > accept gridftp > >transfers and job submissions. > > > >Perhaps that unable to link input file error, when used on > a > >known-to-be-good workflow, should always be treated for now > (in Xi's work) > >as an indication that the site is not broken. > > > >Separately, from an OSG operations perspective, someone (I > guess me) could > >contact the site admin and see what is going on. > > > >-- From benc at hawaga.org.uk Mon Jun 9 12:44:18 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 9 Jun 2008 17:44:18 +0000 (GMT) Subject: [Swift-devel] Re: remote file system In-Reply-To: <1213030693.6622.9.camel@localhost> References: <20080609115025.BBC64803@m4500-03.uchicago.edu> <1213030693.6622.9.camel@localhost> Message-ID: On Mon, 9 Jun 2008, Mihael Hategan wrote: > Depends on what you want to achieve. Xi's work at the moment consists of giving extremely strong hints to Swift about how sites are likely to perform. So in the context of her work, any site that exhibits this behaviour is probably best treated as entirely broken. > I'd say swift should work in any at-least-one-good-site scenarios. >From the swift development side of things (distinct from Xi's work), yes I agree with this. Hopefully the replication functionality added recently will massively improve that, though there are still other problems (for example, fast fail sites bug 101) which I think are not addressed by that. -- From benc at hawaga.org.uk Mon Jun 9 12:48:45 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 9 Jun 2008 17:48:45 +0000 (GMT) Subject: [Swift-devel] Re: remote file system In-Reply-To: References: <20080609103401.BBC55501@m4500-03.uchicago.edu> <1213026070.5388.3.camel@localhost> Message-ID: > Perhaps that unable to link input file error, when used on a > known-to-be-good workflow, should always be treated for now (in Xi's work) > as an indication that the site is not broken. Sorry, I meant, should be treated that the site *IS* broken, not *IS NOT* broken. -- From hategan at mcs.anl.gov Mon Jun 9 13:29:38 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 09 Jun 2008 13:29:38 -0500 Subject: [Swift-devel] [VOTE] 0.6 release plan In-Reply-To: References: Message-ID: <1213036178.8456.1.camel@localhost> I guess the process is simpler when there are votes. +1 On Sun, 2008-06-08 at 16:52 +0000, Ben Clifford wrote: > I will be the release manager for Swift 0.6. > > I'm going to make a release candidate for 0.6 early next week (sometime > within the next three days), and hope to release that as 0.6 proper next > weekend (maybe 6 days from now). > > I'm hoping to have a single release candidate, with minor bugs being noted > and fixed in 0.7 rather than causing a new release candidate. > > I'm planning on announcing coasters and replication as experimental > features which we encourage interested parties to experiment with and > report their experiences. > > I'm also planning on making the release under the dev.globus public > release guidelines. > > There will be no repository freeze for this release. > > > This release plan is subject to 'Lazy Majority' approval, which means that > this plan is automatically approved until/unless someone votes -1. So > pretty much you do not need to vote +1 until/unless someone votes -1. If > you wish to vote -1, it appears that you should vote -1 specifically for > the issues that you disagree with rather than the plan as a whole. > From benc at hawaga.org.uk Tue Jun 10 04:57:58 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 10 Jun 2008 09:57:58 +0000 (GMT) Subject: [Swift-devel] Please test swift 0.6-rc1 Message-ID: swift 0.6 rc1 is at: http://www.ci.uchicago.edu/~benc/vdsk-0.6-rc1.tar.gz $ md5sum vdsk-0.6-rc1.tar.gz 757cebfbbc959a9a07ac3ceb2b904707 vdsk-0.6-rc1.tar.gz Please test and vote in the separate release vote thread as appropriate. -- From benc at hawaga.org.uk Tue Jun 10 05:07:40 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 10 Jun 2008 10:07:40 +0000 (GMT) Subject: [Swift-devel] [VOTE] Release Swift 0.6 rc1 as Swift 0.6 Message-ID: This thread is for voting whether Swift 0.6 rc1 should be released as the Swift 0.6 release. The guidelines state that this vote should be held "once the release manager determines that testing is complete". Given the relatively informal nature of our previous release testing, I'm calling this vote immediately after putting the release candidate up. This vote is a 'majority approval' vote. Three +1 votes, and more +1 votes than -1 votes are required for this release to happen. Because this is a public release, the vote is not lazy - actual +1 votes are required, rather than silence. By voting +1 on a release vote, you agree to provide ongoing support for this release while it is current (which I guess means until 0.7 comes out). By voting +1 on a vote regarding product code, you assert that you have tested the action on your own equipment. The results of this vote will appear in ~120 hours from now. -- From benc at hawaga.org.uk Tue Jun 10 06:52:12 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 10 Jun 2008 11:52:12 +0000 (GMT) Subject: [Swift-devel] Re: [VOTE] Release Swift 0.6 rc1 as Swift 0.6 In-Reply-To: References: Message-ID: -1 - i forgot to build with coasters. rc2 soon. -- From hategan at mcs.anl.gov Tue Jun 10 09:52:51 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 10 Jun 2008 09:52:51 -0500 Subject: [Swift-devel] Release Swift 0.6 rc1 as Swift 0.6 In-Reply-To: References: Message-ID: <1213109571.30333.1.camel@localhost> On Tue, 2008-06-10 at 10:07 +0000, Ben Clifford wrote: > Three +1 votes, and more +1 votes than -1 votes are required for this > release to happen. > Hmm. So projects like "cog-workflow" who have only 2 committers can never make a release. From benc at hawaga.org.uk Tue Jun 10 09:54:53 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 10 Jun 2008 14:54:53 +0000 (GMT) Subject: [Swift-devel] [VOTE] Release Swift 0.6 rc3 as Swift 0.6 Message-ID: This thread is for voting whether Swift 0.6 rc3 should be released as the Swift 0.6 release. The guidelines state that this vote should be held "once the release manager determines that testing is complete". Given the relatively informal nature of our previous release testing, I'm calling this vote immediately after putting the release candidate up. This vote is a 'majority approval' vote. Three +1 votes, and more +1 votes than -1 votes are required for this release to happen. Because this is a public release, the vote is not lazy - actual +1 votes are required, rather than silence. By voting +1 on a release vote, you agree to provide ongoing support for this release while it is current (which I guess means until 0.7 comes out). By voting +1 on a vote regarding product code, you assert that you have tested the action on your own equipment. The results of this vote will appear in ~120 hours from now. -- From hategan at mcs.anl.gov Tue Jun 10 10:05:38 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 10 Jun 2008 10:05:38 -0500 Subject: [Swift-devel] Release Swift 0.6 rc3 as Swift 0.6 In-Reply-To: References: Message-ID: <1213110338.30467.9.camel@localhost> What happened to rc2? On Tue, 2008-06-10 at 14:54 +0000, Ben Clifford wrote: > This thread is for voting whether Swift 0.6 rc3 should be released as the > Swift 0.6 release. > > The guidelines state that this vote should be held "once the release > manager determines that testing is complete". Given the relatively > informal nature of our previous release testing, I'm calling this vote > immediately after putting the release candidate up. > > This vote is a 'majority approval' vote. > > Three +1 votes, and more +1 votes than -1 votes are required for this > release to happen. > > Because this is a public release, the vote is not lazy - actual +1 votes > are required, rather than silence. > > By voting +1 on a release vote, you agree to provide ongoing support for > this release while it is current (which I guess means until 0.7 comes > out). > > By voting +1 on a vote regarding product code, you assert that you have > tested the action on your own equipment. > > The results of this vote will appear in ~120 hours from now. > From benc at hawaga.org.uk Tue Jun 10 09:54:18 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 10 Jun 2008 14:54:18 +0000 (GMT) Subject: [Swift-devel] Please test swift 0.6-rc3 Message-ID: swift 0.6 rc3 is at: http://www.ci.uchicago.edu/~benc/vdsk-0.6-rc3.tar.gz $ md5sum vdsk-0.6-rc3.tar.gz 3ff8e5c5221f135dbfcd04689d489439 vdsk-0.6-rc3.tar.gz Please test and vote in the separate release vote thread as appropriate. Differences from previous RCs: rc2: which never left my possession: build with coasters rc3: project profile specification had regressed in coasters - fixed. -- From lixi at uchicago.edu Tue Jun 10 11:20:39 2008 From: lixi at uchicago.edu (lixi at uchicago.edu) Date: Tue, 10 Jun 2008 11:20:39 -0500 (CDT) Subject: [Swift-devel] execution.retries Message-ID: <20080610112039.BBD75729@m4500-03.uchicago.edu> Hi, Just now I ran a workflow, site "OSG_LIGO_MIT" had a GridFTP error for a little while exactly after it was selected for Swift job. So the message "Could not initialize shared directory on OSG_LIGO_MIT" was issued and the whole workflow exited. The log file is on CI: /home/lixi/newswift/latest/score/3500/workflowtest- 20080610-1045-58kc7p6f.log After investigating the log file, I found that this failed job produced a execute event with id of 0-1-307. When it was staging files, a temp GridFTP error on OSG_LIGO_MIT just happened, so 0-1-307 didn't result in any execute2 event. Finally, the whole workflow failed. My understanding is that in Swift, the execution.retries just mean the retrying times for execute2 events, is that right? Then currently how to avoid or handle this kind of error? Is there is a way to do with it in Swift now? Thanks, Xi From mikekubal at yahoo.com Tue Jun 10 11:35:54 2008 From: mikekubal at yahoo.com (Mike Kubal) Date: Tue, 10 Jun 2008 09:35:54 -0700 (PDT) Subject: [Swift-devel] can't find local file? Message-ID: <855505.59558.qm@web52306.mail.re2.yahoo.com> I get the following error message: Caused by: The following output files were not created by the application: status_of_sorting_dock_results.txt Though the the file is created successfully and located within the current working directory, though swift seems not able to find it. Here's the swift code that maps the file name and calls the application that runs on the local host: file dock_sorting_status ; (dock_sorting_status) = sort_dock_results(dir,keep); Any thoughts? Thanks, Mike From benc at hawaga.org.uk Tue Jun 10 11:38:14 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 10 Jun 2008 16:38:14 +0000 (GMT) Subject: [Swift-devel] execution.retries In-Reply-To: <20080610112039.BBD75729@m4500-03.uchicago.edu> References: <20080610112039.BBD75729@m4500-03.uchicago.edu> Message-ID: In that log file, it looks like there are a lot of attempts to initialise the shared directory (each of which fails, by the looks of it). Look how many lines there are in the log like this: 2008-06-10 10:48:03,137-0500 INFO vdl:initshareddir START host=OSG_LIGO_MIT followed closely by: 2008-06-10 10:48:03,196-0500 DEBUG TaskImpl Task(type=FILE_OPERATION, identity=u rn:0-1-701-1-1213112750531) setting status to Failed org.globus.cog.abstraction. impl.file.IrrecoverableResourceException: Error communicating with the GridFTP server It looks like you are suffering from the "fast fail" problem here though - see bug 101. The site fails rapidly. The scheduler will never go below 2 jobs per site at once, no matter how much it fails. So, Swift will submit to that site many many times, all of which will faill; and so that site will absorb all the retries for a site. Pretty much to stop this, the scheduler needs to be able to go below 2-jobs-per-site. The lower limit could go to 0; however, once a site has gone to 0, its possible that no more jobs will be run and the score will never go up; thus a transient failure might cause a site to be ignored forever. And if every site does this, then eventually you might end up with no sites at all to run. If you would like to experiment, I'll show you where in the source code to change that lower limit. There are a few other ideas being tossed around - I have some, and I think Mihael has some too - to deal with this. -- From hategan at mcs.anl.gov Tue Jun 10 11:39:22 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 10 Jun 2008 11:39:22 -0500 Subject: [Swift-devel] can't find local file? In-Reply-To: <855505.59558.qm@web52306.mail.re2.yahoo.com> References: <855505.59558.qm@web52306.mail.re2.yahoo.com> Message-ID: <1213115962.2099.0.camel@localhost> On Tue, 2008-06-10 at 09:35 -0700, Mike Kubal wrote: > I get the following error message: > > Caused by: > The following output files were not created by the application: status_of_sorting_dock_results.txt > > Though the the file is created successfully and located within the current working directory, though swift seems not able to find it. > > Here's the swift code that maps the file name and calls the application that runs on the local host: > > file dock_sorting_status "status_of_sorting_dock_results.txt">; > > (dock_sorting_status) = sort_dock_results(dir,keep); What does "dir" there do? > > Any thoughts? > > Thanks, > > Mike > > > > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From benc at hawaga.org.uk Tue Jun 10 11:40:06 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 10 Jun 2008 16:40:06 +0000 (GMT) Subject: [Swift-devel] can't find local file? In-Reply-To: <855505.59558.qm@web52306.mail.re2.yahoo.com> References: <855505.59558.qm@web52306.mail.re2.yahoo.com> Message-ID: can you give a location for a log file of a run exhibiting this and also the complete workflow code? In this: > Though the the file is created successfully and located within the > current working directory, though swift seems not able to find it. what do you mean by 'current working directory' ? the cwd that you invoked swift in, or the cwd that swift ends up running your app in (which is dynamically created elsewhere). -- From lixi at uchicago.edu Tue Jun 10 11:58:52 2008 From: lixi at uchicago.edu (lixi at uchicago.edu) Date: Tue, 10 Jun 2008 11:58:52 -0500 (CDT) Subject: [Swift-devel] execution.retries Message-ID: <20080610115852.BBD80834@m4500-03.uchicago.edu> >Look how many lines there are in the log like this: > >2008-06-10 10:48:03,137-0500 INFO vdl:initshareddir START >host=OSG_LIGO_MIT > >followed closely by: > >2008-06-10 10:48:03,196-0500 DEBUG TaskImpl Task (type=FILE_OPERATION, >identity=u >rn:0-1-701-1-1213112750531) setting status to Failed >org.globus.cog.abstraction. >impl.file.IrrecoverableResourceException: Error communicating with the >GridFTP server Yes, I've seen that. My question is: do these lines mean different execution retries? If yes, can we add another factor here other than the final failureFactor to change the site's score. Then this might prevent more retries submitting to this site again for this single job. >The site fails rapidly. The scheduler will never go below 2 jobs per site >at once, no matter how much it fails. >So, Swift will submit to that site many many times, all of which will >faill; and so that site will absorb all the retries for a site. However during the execution, the site's score could be decreased into negative one which would erase at lease 2 jobs limit? Then to some extent, that site would have no chance of absorbing more retries. >Pretty much to stop this, the scheduler needs to be able to go below >2-jobs-per-site. The lower limit could go to 0; however, once a site has >gone to 0, its possible that no more jobs will be run and the score will >never go up; thus a transient failure might cause a site to be ignored >forever. And if every site does this, then eventually you might end up >with no sites at all to run. I agree that the lower limit should not be 0. >If you would like to experiment, I'll show you where in the source code to >change that lower limit. In fact, I know the place to modify the limit. What I don't know is how to use the whole framework to complie and generate new version. Xi From hategan at mcs.anl.gov Tue Jun 10 12:05:00 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 10 Jun 2008 12:05:00 -0500 Subject: [Swift-devel] execution.retries In-Reply-To: <20080610115852.BBD80834@m4500-03.uchicago.edu> References: <20080610115852.BBD80834@m4500-03.uchicago.edu> Message-ID: <1213117500.3370.3.camel@localhost> On Tue, 2008-06-10 at 11:58 -0500, lixi at uchicago.edu wrote: > Yes, I've seen that. My question is: do these lines mean > different execution retries? Yes > If yes, can we add another > factor here other than the final failureFactor to change the > site's score. Then this might prevent more retries > submitting to this site again for this single job. It wouldn't. No matter how small the score, there is a minimum of 2. Which seems wrong. There should be a delay. > > >The site fails rapidly. The scheduler will never go below 2 > jobs per site > >at once, no matter how much it fails. > > >So, Swift will submit to that site many many times, all of > which will > >faill; and so that site will absorb all the retries for a > site. > > However during the execution, the site's score could be > decreased into negative one which would erase at lease 2 > jobs limit? That would be the basic idea. If a score is very low, the site should only be considered after a certain delay. > > Then to some extent, that site would have no chance of > absorbing more retries. > From lixi at uchicago.edu Tue Jun 10 12:09:41 2008 From: lixi at uchicago.edu (lixi at uchicago.edu) Date: Tue, 10 Jun 2008 12:09:41 -0500 (CDT) Subject: [Swift-devel] execution.retries Message-ID: <20080610120941.BBD82602@m4500-03.uchicago.edu> >> Yes, I've seen that. My question is: do these lines mean >> different execution retries? > >Yes > Then why were these different retries submitted the same site? Coincidence or certainty? From hategan at mcs.anl.gov Tue Jun 10 12:16:18 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 10 Jun 2008 12:16:18 -0500 Subject: [Swift-devel] execution.retries In-Reply-To: <20080610120941.BBD82602@m4500-03.uchicago.edu> References: <20080610120941.BBD82602@m4500-03.uchicago.edu> Message-ID: <1213118178.3941.0.camel@localhost> On Tue, 2008-06-10 at 12:09 -0500, lixi at uchicago.edu wrote: > >> Yes, I've seen that. My question is: do these lines mean > >> different execution retries? > > > >Yes > > > > Then why were these different retries submitted the same > site? Coincidence or certainty? Ben just explained that. It's the only "free" site. From benc at hawaga.org.uk Tue Jun 10 12:20:08 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 10 Jun 2008 17:20:08 +0000 (GMT) Subject: [Swift-devel] execution.retries In-Reply-To: <20080610120941.BBD82602@m4500-03.uchicago.edu> References: <20080610120941.BBD82602@m4500-03.uchicago.edu> Message-ID: On Tue, 10 Jun 2008, lixi at uchicago.edu wrote: > >> Yes, I've seen that. My question is: do these lines mean > >> different execution retries? > > > >Yes > > Then why were these different retries submitted the same > site? Coincidence or certainty? Say you have two sites. Site A always fails fast. Site B accepts jobs normally. You have three jobs to submit, job J, K, L, which take a long time to run. at t=0 We submit jobs randomly to available sites: Job J is submitted to site A. Job K is submitted to site B. Job L is submitted to site B. t=1 Site B is busy executing job K, and L Job J fails on site A. We look for somewhere to retry it. Site B has 0 slots free. Site A has 2 slots free. We send the job to site A. t=2 same happens. t=3 same happens. Now we have retried job J three times, and so the workflow ultimately fails. t=1000 job K and job L complete successfully. -- From benc at hawaga.org.uk Tue Jun 10 12:09:54 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 10 Jun 2008 17:09:54 +0000 (GMT) Subject: [Swift-devel] execution.retries In-Reply-To: <20080610115852.BBD80834@m4500-03.uchicago.edu> References: <20080610115852.BBD80834@m4500-03.uchicago.edu> Message-ID: On Tue, 10 Jun 2008, lixi at uchicago.edu wrote: > However during the execution, the site's score could be > decreased into negative one which would erase at lease 2 > jobs limit? There are two ways of expressing the site score. One can range from -infinity to +infinity (approximately). Call this 'score'. This is then scaled using a complex formula to a value ranging between 2 and the maximum allowed onto that site - this number is then used to determine how many jobs can run at once on a site. As the first score goes to -infinity, the second score goes down to 2, but no lower. Look in cog/modules/karajan//src/org/globus/cog/karajan/scheduler/WeightedHost.java let T = 100 let B = 2.0 * log(T) / pi let C = 0.2 let tscore = e^(B * atan(C * score)) let number-of-jobs = 2 + (jobThrottle * tscore) I think that by editing the definition of isOverloaded() in that file, you can vary the behaviour (Mihael might comment if that is actually the method used to determine whether we can submit to a site or not) -- From lixi at uchicago.edu Tue Jun 10 12:39:00 2008 From: lixi at uchicago.edu (lixi at uchicago.edu) Date: Tue, 10 Jun 2008 12:39:00 -0500 (CDT) Subject: [Swift-devel] execution.retries Message-ID: <20080610123900.BBD86348@m4500-03.uchicago.edu> >Say you have two sites. > >Site A always fails fast. >Site B accepts jobs normally. > >You have three jobs to submit, job J, K, L, which take a long time to >run. > >at t=0 >We submit jobs randomly to available sites: >Job J is submitted to site A. >Job K is submitted to site B. >Job L is submitted to site B. > >t=1 >Site B is busy executing job K, and L >Job J fails on site A. We look for somewhere to retry it. Site B has 0 >slots free. Site A has 2 slots free. We send the job to site A. > >t=2 >same happens. > >t=3 >same happens. > >Now we have retried job J three times, and so the workflow ultimately >fails. > >t=1000 >job K and job L complete successfully. Thanks for clear explaination. :) Xi From hategan at mcs.anl.gov Tue Jun 10 12:39:56 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 10 Jun 2008 12:39:56 -0500 Subject: [Swift-devel] execution.retries In-Reply-To: References: <20080610115852.BBD80834@m4500-03.uchicago.edu> Message-ID: <1213119596.4476.11.camel@localhost> On Tue, 2008-06-10 at 17:09 +0000, Ben Clifford wrote: > On Tue, 10 Jun 2008, lixi at uchicago.edu wrote: > > > However during the execution, the site's score could be > > decreased into negative one which would erase at lease 2 > > jobs limit? > > There are two ways of expressing the site score. One can range from > -infinity to +infinity (approximately). Call this 'score'. > > This is then scaled using a complex formula to a value ranging between 2 > and the maximum allowed onto that site - this number is then used to > determine how many jobs can run at once on a site. > > As the first score goes to -infinity, the second score goes down to 2, but > no lower. > > Look in > cog/modules/karajan//src/org/globus/cog/karajan/scheduler/WeightedHost.java > > let T = 100 > let B = 2.0 * log(T) / pi > let C = 0.2 > let tscore = e^(B * atan(C * score)) > let number-of-jobs = 2 + (jobThrottle * tscore) > > I think that by editing the definition of isOverloaded() in that file, you > can vary the behaviour (Mihael might comment if that is actually the > method used to determine whether we can submit to a site or not) That's the one. However, I think that tscores <1 should be translated into timed rate limitations. So if tscore = 10 means I can submit at most jobThrottle*10 jobs, tscore = 0.1 should mean that I can submit jobs no faster than some_number/tscore seconds. Like an exponential back-off. We'd probably figure some_number by looking at how the score would evolve should all attempts fail, and set a minimum waiting time that we want in the worst case scenario (which will probably be the -10 score limit anyway). > From benc at hawaga.org.uk Tue Jun 10 12:42:42 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 10 Jun 2008 17:42:42 +0000 (GMT) Subject: [Swift-devel] execution.retries In-Reply-To: <1213119596.4476.11.camel@localhost> References: <20080610115852.BBD80834@m4500-03.uchicago.edu> <1213119596.4476.11.camel@localhost> Message-ID: On Tue, 10 Jun 2008, Mihael Hategan wrote: > That's the one. However, I think that tscores <1 should be translated > into timed rate limitations. So if tscore = 10 means I can submit at > most jobThrottle*10 jobs, tscore = 0.1 should mean that I can submit > jobs no faster than some_number/tscore seconds. Like an exponential > back-off. heh, I was just writing almost exactly the same email to you. tscore=1 should 1 job slot available tscore < 1 should mean one job slot available some of the time. I'm not sure what the formula for calculating the <1 availability should be, though. It needs to cope with rapidly slowing down in the presence of fast fail, slowing down to the scale of other running jobs (so eg on the scale of minute to hours) without overly slowing down. Some experimentation there will probably help. Related to this, I've been playing with provider-wonky a bit to make it able to exhibit other failure modes such as this fast fail behaviour; but nothing to commit yet. -- From hategan at mcs.anl.gov Tue Jun 10 12:50:49 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 10 Jun 2008 12:50:49 -0500 Subject: [Swift-devel] execution.retries In-Reply-To: References: <20080610115852.BBD80834@m4500-03.uchicago.edu> <1213119596.4476.11.camel@localhost> Message-ID: <1213120249.5211.2.camel@localhost> On Tue, 2008-06-10 at 17:42 +0000, Ben Clifford wrote: > On Tue, 10 Jun 2008, Mihael Hategan wrote: > > > That's the one. However, I think that tscores <1 should be translated > > into timed rate limitations. So if tscore = 10 means I can submit at > > most jobThrottle*10 jobs, tscore = 0.1 should mean that I can submit > > jobs no faster than some_number/tscore seconds. Like an exponential > > back-off. > > heh, I was just writing almost exactly the same email to you. > > tscore=1 should 1 job slot available > > tscore < 1 should mean one job slot available some of the time. > > I'm not sure what the formula for calculating the <1 availability should > be, though. It needs to cope with rapidly slowing down in the presence of > fast fail, slowing down to the scale of other running jobs (so eg on the > scale of minute to hours) without overly slowing down. It's taking a guess at when the site will recover and when other sites will be available. I don't think there's a way to know. Which is why it probably should be as exponential as possible. > > Some experimentation there will probably help. > > Related to this, I've been playing with provider-wonky a bit to make it > able to exhibit other failure modes such as this fast fail behaviour; but > nothing to commit yet. > From benc at hawaga.org.uk Tue Jun 10 13:37:55 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 10 Jun 2008 18:37:55 +0000 (GMT) Subject: [Swift-devel] Release Swift 0.6 rc3 as Swift 0.6 In-Reply-To: <1213110338.30467.9.camel@localhost> References: <1213110338.30467.9.camel@localhost> Message-ID: On Tue, 10 Jun 2008, Mihael Hategan wrote: > What happened to rc2? it never left my custody - i labelled it, performed some tests before general release, saw they failed, and so generated a new one. -- From lixi at uchicago.edu Tue Jun 10 14:51:33 2008 From: lixi at uchicago.edu (lixi at uchicago.edu) Date: Tue, 10 Jun 2008 14:51:33 -0500 (CDT) Subject: [Swift-devel] execution.retries Message-ID: <20080610145133.BBE01683@m4500-03.uchicago.edu> >It's taking a guess at when the site will recover and when other sites >will be available. I don't think there's a way to know. Which is why it >probably should be as exponential as possible. > How about restricting the same job to be resubmitted to the same site after an exception. Because I suspect the similar thing would happen after replication is enabled (of course, just my assumption). Let's imagine, if all other sites are overloaded. Would be a copy job submitted to the same unresponsive site? This idea just came into my mind suddenly, if it doesn't make sense, forget it. Xi From hategan at mcs.anl.gov Tue Jun 10 16:58:22 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 10 Jun 2008 16:58:22 -0500 Subject: [Swift-devel] Please test swift 0.6-rc3 In-Reply-To: References: Message-ID: <1213135102.22978.1.camel@localhost> Something seems odd in the coaster bootstrap list, but it doesn't seem to be because of a build problem, but rather it having been that way for a while. So I'm curious, did anybody ever successfully run a coaster gt4:gt4 job? On Tue, 2008-06-10 at 14:54 +0000, Ben Clifford wrote: > swift 0.6 rc3 is at: > > http://www.ci.uchicago.edu/~benc/vdsk-0.6-rc3.tar.gz > > $ md5sum vdsk-0.6-rc3.tar.gz > 3ff8e5c5221f135dbfcd04689d489439 vdsk-0.6-rc3.tar.gz > > Please test and vote in the separate release vote thread as appropriate. > > Differences from previous RCs: > > rc2: which never left my possession: build with coasters > rc3: project profile specification had regressed in coasters - fixed. > From benc at hawaga.org.uk Tue Jun 10 17:01:11 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 10 Jun 2008 22:01:11 +0000 (GMT) Subject: [Swift-devel] Please test swift 0.6-rc3 In-Reply-To: <1213135102.22978.1.camel@localhost> References: <1213135102.22978.1.camel@localhost> Message-ID: > So I'm curious, did anybody ever successfully run a coaster gt4:gt4 job? I have problems with the coaster + gram4 site definition that is in tests/sites/coaster/ though the gt2 definitions work there. I thought I'd run it successfully before, but not sure. I was going to note that in my test results later on. It is perhaps not a showstopper, though, given that this is an experimental feature - I'd rather get a release out than fiddle too much. -- From benc at hawaga.org.uk Tue Jun 10 18:29:26 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 10 Jun 2008 23:29:26 +0000 (GMT) Subject: [Swift-devel] [VOTE] Proposal to adjust Committers list In-Reply-To: <48487433.40602@mcs.anl.gov> References: <1212598787.8492.8.camel@localhost> <1212615395.14746.0.camel@localhost> <1212616673.14746.4.camel@localhost> <48487433.40602@mcs.anl.gov> Message-ID: +1 On a procedural note, I don't believe the following: > The vote closes in 5 days per my read of the guidelines - same time, > June 10. to be mandated by the guidelines, although I suspect it may have been the original intention of the authors; you appear perfectly within your liberty to specify that time limit or any other on the vote (as both you did and I did on my votes). The 5 day (120 hour) requirement is that a vote result message be posted within that time if any -1 vote has been made. There appears to me to be no other treatement within the guidelines of a vote closing. On Thu, 5 Jun 2008, Michael Wilde wrote: > Im sending this out for your vote as I indicated in the proposal. > There's been one comment in favor and none against. > > The vote closes in 5 days per my read of the guidelines - same time, June 10. > > Please respond with one of: > +1 The action should be performed. > +0 Abstain - I support the action. > -0 Abstain, I don't support the action but I can't help with an alternative > -1 The action should not be performed and I am offering an explanation > or alternative. > > Current committers, please add to your vote the word "binding". > > Thanks, > > Mike > > --- Proposal text below --- > > As we move towards meeting the infrastructure and process guidelines of > the dev.globus incubator, I propose that we revise the committers list. > > I propose, based on the people actively involved in the project, that > the committers list be changed by vote of the current committers to be: > > Ben Clifford > Ian Foster > Mihael Hategan > Sarah Kenny > Michael Wilde > > Of this list, all but Sarah are current committers. Sarah has recently > joined the group, and her job calls for her to become an active committer. > > I propose that Nika Nefedova and Beth Cerny be taken off the committers > list. Neither Nika nor Beth currently work on the project. As a web > content developer, Beth may likely contribute in the future, but does > not I feel need to (nor I suspect want to) engage as a contributor. > > I propose that Ioan Raicu, Tibi Stef-Praun, and Yong Zhao become > contributors rather than committers. Their past contributions have been > immensely valuable, but I feel that their current role is more > appropriately one of contributor. Ioan, Tibi, Yong, if you feel > differently, and plan to be an active committer, I'm happy to amend this > proposal to reflect your wish. > > But I am seeking to keep the committers list, which has voting rights in > the project, down to the people that are deeply involved. > > This will be our first formal vote under the dev.globus guidelines: > http://dev.globus.org/wiki/Guidelines#Decision_Making > > My understanding is that we do this: > > 1) readers reply to this email with comments. > 2) I as proposer calls a vote with another email > 3) you respond to the vote email with: > > The way you do this is - I think - to reply to the voting email with one > of these strings and in some cases additional text as indicated: > > +1 The action should be performed. > +0 Abstain - I support the action. > -0 Abstain, I don't support the action but I can't help with an alternative > -1 The action should not be performed and I am offering an explanation > or alternative. > > Ive paraphrased these from the original to reflect what I believe is > appropriate for a committers-list change rather than a technical change. > > My read of the guidelines is that we have 5 days to conduct vote. > I'll send the [VOTE] message out in a few hours, to allow initial > comments on this proposal. > > - Mike > > On 6/4/08 4:57 PM, Mihael Hategan wrote: > > All committers have been added to the swift-commit mailing list. > > > > If some are wondering what the list admin password is, well, you either > > know it already or you don't. > > > > On Wed, 2008-06-04 at 16:36 -0500, Mihael Hategan wrote: > > > Ah, that bug I wasn't CC-ed on. Well, I know about it now. > > > > > > On Wed, 2008-06-04 at 21:20 +0000, Ben Clifford wrote: > > > > On Wed, 4 Jun 2008, Mihael Hategan wrote: > > > > > > > > > So we need, I think, to agree on the initial list of committers for > > > > > the > > > > > dev.globus side of things. > > > > The initial proposal, listed in globus bug 5300, already contains an > > > > initial list of committers: > > > > > > > > Benjamin Clifford > > > > Mihael Hategan > > > > Tiberius Stef-Praun Beth > > > > Yong Zhao > > > > Veronika Nefedova > > > > Ian Foster > > > > Michael Wilde > > > > Ioan Raicu > > > > Beth Cerny Patino > > > > > > > > Any amendments to that should probably be made through the dev.globus > > > > voting process. > > > > > > > _______________________________________________ > > > Swift-devel mailing list > > > Swift-devel at ci.uchicago.edu > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > From wilde at mcs.anl.gov Tue Jun 10 19:07:58 2008 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 10 Jun 2008 19:07:58 -0500 Subject: [Swift-devel] [VOTE] Proposal to adjust Committers list In-Reply-To: <48487433.40602@mcs.anl.gov> References: <1212598787.8492.8.camel@localhost> <1212615395.14746.0.camel@localhost> <1212616673.14746.4.camel@localhost> <48487433.40602@mcs.anl.gov> Message-ID: <484F175E.3000208@mcs.anl.gov> This vote is closed. The tally is +3, from a total of 3 votes cast: +1 Ben +1 Mihael +1 Mike Given that there were no comments against, the next step seems to be to adjust the committers list in accordance with this proposal. If anyone is aware of any other formal step we need to take under dev.globus rules, please advise. - Mike On 6/5/08 6:18 PM, Michael Wilde wrote: > Im sending this out for your vote as I indicated in the proposal. > There's been one comment in favor and none against. > > The vote closes in 5 days per my read of the guidelines - same time, > June 10. > > Please respond with one of: > +1 The action should be performed. > +0 Abstain - I support the action. > -0 Abstain, I don't support the action but I can't help with an alternative > -1 The action should not be performed and I am offering an explanation > or alternative. > > Current committers, please add to your vote the word "binding". > > Thanks, > > Mike > > --- Proposal text below --- > > As we move towards meeting the infrastructure and process guidelines of > the dev.globus incubator, I propose that we revise the committers list. > > I propose, based on the people actively involved in the project, that > the committers list be changed by vote of the current committers to be: > > Ben Clifford > Ian Foster > Mihael Hategan > Sarah Kenny > Michael Wilde > > Of this list, all but Sarah are current committers. Sarah has recently > joined the group, and her job calls for her to become an active committer. > > I propose that Nika Nefedova and Beth Cerny be taken off the committers > list. Neither Nika nor Beth currently work on the project. As a web > content developer, Beth may likely contribute in the future, but does > not I feel need to (nor I suspect want to) engage as a contributor. > > I propose that Ioan Raicu, Tibi Stef-Praun, and Yong Zhao become > contributors rather than committers. Their past contributions have been > immensely valuable, but I feel that their current role is more > appropriately one of contributor. Ioan, Tibi, Yong, if you feel > differently, and plan to be an active committer, I'm happy to amend this > proposal to reflect your wish. > > But I am seeking to keep the committers list, which has voting rights in > the project, down to the people that are deeply involved. > > This will be our first formal vote under the dev.globus guidelines: > http://dev.globus.org/wiki/Guidelines#Decision_Making > > My understanding is that we do this: > > 1) readers reply to this email with comments. > 2) I as proposer calls a vote with another email > 3) you respond to the vote email with: > > The way you do this is - I think - to reply to the voting email with one > of these strings and in some cases additional text as indicated: > > +1 The action should be performed. > +0 Abstain - I support the action. > -0 Abstain, I don't support the action but I can't help with an alternative > -1 The action should not be performed and I am offering an explanation > or alternative. > > Ive paraphrased these from the original to reflect what I believe is > appropriate for a committers-list change rather than a technical change. > > My read of the guidelines is that we have 5 days to conduct vote. > I'll send the [VOTE] message out in a few hours, to allow initial > comments on this proposal. > > - Mike > > On 6/4/08 4:57 PM, Mihael Hategan wrote: >> All committers have been added to the swift-commit mailing list. >> >> If some are wondering what the list admin password is, well, you either >> know it already or you don't. >> >> On Wed, 2008-06-04 at 16:36 -0500, Mihael Hategan wrote: >>> Ah, that bug I wasn't CC-ed on. Well, I know about it now. >>> >>> On Wed, 2008-06-04 at 21:20 +0000, Ben Clifford wrote: >>>> On Wed, 4 Jun 2008, Mihael Hategan wrote: >>>> >>>>> So we need, I think, to agree on the initial list of committers for >>>>> the >>>>> dev.globus side of things. >>>> The initial proposal, listed in globus bug 5300, already contains an >>>> initial list of committers: >>>> >>>> Benjamin Clifford >>>> Mihael Hategan >>>> Tiberius Stef-Praun Beth >>>> Yong Zhao >>>> Veronika Nefedova >>>> Ian Foster >>>> Michael Wilde >>>> Ioan Raicu >>>> Beth Cerny Patino >>>> >>>> Any amendments to that should probably be made through the >>>> dev.globus voting process. >>>> >>> _______________________________________________ >>> Swift-devel mailing list >>> Swift-devel at ci.uchicago.edu >>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >> >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From benc at hawaga.org.uk Tue Jun 10 19:09:51 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 11 Jun 2008 00:09:51 +0000 (GMT) Subject: [Swift-devel] [VOTE] Proposal to adjust Committers list In-Reply-To: <484F175E.3000208@mcs.anl.gov> References: <1212598787.8492.8.camel@localhost> <1212615395.14746.0.camel@localhost> <1212616673.14746.4.camel@localhost> <48487433.40602@mcs.anl.gov> <484F175E.3000208@mcs.anl.gov> Message-ID: On Tue, 10 Jun 2008, Michael Wilde wrote: > Given that there were no comments against, the next step seems to be to > adjust the committers list in accordance with this proposal. I think that is all that needs to be done. -- From benc at hawaga.org.uk Tue Jun 10 19:26:25 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 11 Jun 2008 00:26:25 +0000 (GMT) Subject: [Swift-devel] Re: [VOTE] Release Swift 0.6 rc3 as Swift 0.6 In-Reply-To: References: Message-ID: +1 I've done the following testing: on my laptop: . the tests launched by tests/run, which all pass . the tests launched by tests/sites/run-all for which all the failures were accounted for either by known site problems or by ongoing condor+some obscure quote/whitespace problems. . tests/sites/runl-all coaster/ for which all passed except for the gram4 one, which is what mihael commented on earlier on tg-uc: . first.swift using the PBS local provider -- From hategan at mcs.anl.gov Wed Jun 11 09:30:06 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 11 Jun 2008 09:30:06 -0500 Subject: [Swift-devel] execution.retries In-Reply-To: <20080610145133.BBE01683@m4500-03.uchicago.edu> References: <20080610145133.BBE01683@m4500-03.uchicago.edu> Message-ID: <1213194606.1668.0.camel@localhost> On Tue, 2008-06-10 at 14:51 -0500, lixi at uchicago.edu wrote: > >It's taking a guess at when the site will recover and when > other sites > >will be available. I don't think there's a way to know. > Which is why it > >probably should be as exponential as possible. > > > > How about restricting the same job to be resubmitted to the > same site after an exception. Because I suspect the similar > thing would happen after replication is enabled (of course, > just my assumption). Let's imagine, if all other sites are > overloaded. Would be a copy job submitted to the same > unresponsive site? Replication won't happen if a job fails. > > This idea just came into my mind suddenly, if it doesn't > make sense, forget it. > > Xi From hategan at mcs.anl.gov Wed Jun 11 09:31:01 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 11 Jun 2008 09:31:01 -0500 Subject: [Swift-devel] Please test swift 0.6-rc3 In-Reply-To: References: <1213135102.22978.1.camel@localhost> Message-ID: <1213194661.1668.2.camel@localhost> On Tue, 2008-06-10 at 22:01 +0000, Ben Clifford wrote: > > So I'm curious, did anybody ever successfully run a coaster gt4:gt4 job? > > I have problems with the coaster + gram4 site definition that is in > tests/sites/coaster/ though the gt2 definitions work there. > > I thought I'd run it successfully before, but not sure. > > I was going to note that in my test results later on. It is perhaps not a > showstopper, though, given that this is an experimental feature - I'd > rather get a release out than fiddle too much. I agree there. I was trying to identify whether there is a problem. > From lixi at uchicago.edu Wed Jun 11 10:10:50 2008 From: lixi at uchicago.edu (lixi at uchicago.edu) Date: Wed, 11 Jun 2008 10:10:50 -0500 (CDT) Subject: [Swift-devel] execution.retries Message-ID: <20080611101050.BBE75359@m4500-03.uchicago.edu> >> How about restricting the same job to be resubmitted to the >> same site after an exception. Because I suspect the similar >> thing would happen after replication is enabled (of course, >> just my assumption). Let's imagine, if all other sites are >> overloaded. Would be a copy job submitted to the same >> unresponsive site? > >Replication won't happen if a job fails. I'm not saying that a job fails. I think that replicas of job would be submitted after a time threshold because the original job is still not finished. But this might not means that job necessarily fails. In this case, the replica would be possibly submitted to the same site as the original job. From hategan at mcs.anl.gov Wed Jun 11 09:33:53 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 11 Jun 2008 09:33:53 -0500 Subject: [Swift-devel] Re: [VOTE] Release Swift 0.6 rc3 as Swift 0.6 In-Reply-To: References: Message-ID: <1213194833.1668.5.camel@localhost> +1 I tested language-behavior only. But I don't think there is much into me replicating what both the automated tests and you are doing. Value for testing here would be for other people to try their own scripts. On Wed, 2008-06-11 at 00:26 +0000, Ben Clifford wrote: > +1 > > I've done the following testing: > > on my laptop: > . the tests launched by tests/run, which all pass > . the tests launched by tests/sites/run-all > for which all the failures were accounted for either by known site > problems or by ongoing condor+some obscure quote/whitespace problems. > . tests/sites/runl-all coaster/ > for which all passed except for the gram4 one, which is what mihael > commented on earlier > > on tg-uc: > . first.swift using the PBS local provider > From hategan at mcs.anl.gov Wed Jun 11 10:59:11 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 11 Jun 2008 10:59:11 -0500 Subject: [Swift-devel] execution.retries In-Reply-To: <20080611101050.BBE75359@m4500-03.uchicago.edu> References: <20080611101050.BBE75359@m4500-03.uchicago.edu> Message-ID: <1213199951.2196.0.camel@localhost> On Wed, 2008-06-11 at 10:10 -0500, lixi at uchicago.edu wrote: > >> How about restricting the same job to be resubmitted to > the > >> same site after an exception. Because I suspect the > similar > >> thing would happen after replication is enabled (of > course, > >> just my assumption). Let's imagine, if all other sites > are > >> overloaded. Would be a copy job submitted to the same > >> unresponsive site? > > > >Replication won't happen if a job fails. > > I'm not saying that a job fails. I think that replicas of > job would be submitted after a time threshold because the > original job is still not finished. But this might not means > that job necessarily fails. In this case, the replica would > be possibly submitted to the same site as the original job. Ah, yes. I see. You are right. This may be the case. From lixi at uchicago.edu Wed Jun 11 14:57:42 2008 From: lixi at uchicago.edu (lixi at uchicago.edu) Date: Wed, 11 Jun 2008 14:57:42 -0500 (CDT) Subject: [Swift-devel] execution.retries Message-ID: <20080611145742.BBF13537@m4500-03.uchicago.edu> >Look how many lines there are in the log like this: > >2008-06-10 10:48:03,137-0500 INFO vdl:initshareddir START >host=OSG_LIGO_MIT > >followed closely by: > >2008-06-10 10:48:03,196-0500 DEBUG TaskImpl Task (type=FILE_OPERATION, >identity=u >rn:0-1-701-1-1213112750531) setting status to Failed >org.globus.cog.abstraction. >impl.file.IrrecoverableResourceException: Error communicating with the >GridFTP server I'm sorry to turn to the old question and ask again. But I'm still confused about it. Both of you said that these lines mean different retries. However among these lines in the log file, there is no site selection action which is represented by "WeightedHostScoreScheduler Sorted". Then I wonder what would be included in one try. One try just means trying to do the same operation to the same site or selecting next site (may be another one or the same one) to do the same operation. In log file, what kind of expression implies the beginning or end of one try? Thanks, Xi From hategan at mcs.anl.gov Wed Jun 11 15:43:29 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 11 Jun 2008 15:43:29 -0500 Subject: [Swift-devel] execution.retries In-Reply-To: <20080611145742.BBF13537@m4500-03.uchicago.edu> References: <20080611145742.BBF13537@m4500-03.uchicago.edu> Message-ID: <1213217009.8232.11.camel@localhost> On Wed, 2008-06-11 at 14:57 -0500, lixi at uchicago.edu wrote: > >Look how many lines there are in the log like this: > > > >2008-06-10 10:48:03,137-0500 INFO vdl:initshareddir START > >host=OSG_LIGO_MIT > > > >followed closely by: > > > >2008-06-10 10:48:03,196-0500 DEBUG TaskImpl Task > (type=FILE_OPERATION, > >identity=u > >rn:0-1-701-1-1213112750531) setting status to Failed > >org.globus.cog.abstraction. > >impl.file.IrrecoverableResourceException: Error > communicating with the > >GridFTP server > > I'm sorry to turn to the old question and ask again. But I'm > still confused about it. Both of you said that these lines > mean different retries. However among these lines in the log > file, there is no site selection action which is represented > by "WeightedHostScoreScheduler Sorted". Hmm? 2008-06-10 10:47:27,429-0500 DEBUG WeightedHostScoreScheduler Releasing contact 7 2008-06-10 10:47:27,430-0500 INFO WeightedHostScoreScheduler Sorted: [OSG_LIGO_MIT:21.822(51.667):2/4] 2008-06-10 10:47:27,430-0500 DEBUG WeightedHostScoreScheduler Rand: 15.78147400479908, sum: 100.07797485652034 2008-06-10 10:47:27,431-0500 DEBUG WeightedHostScoreScheduler Next contact: OSG_LIGO_MIT:21.822(51.667):2/4 That seems to be your only contact. Running cat /home/lixi/newswift/latest/score/3500/workflowtest-20080610-1045-58kc7p6f.log|grep "Next contact: OSG_LIGO_MIT"|wc produces: 4376 30632 469760 So there's 4376 site selections there. If you remove the |wc you can see the evolution of the score. > Then I wonder what > would be included in one try. One try just means trying to > do the same operation to the same site or selecting next site > (may be another one or the same one) to do the same > operation. In log file, what kind of expression implies the > beginning or end of one try? You only seem to have one site there. Re-trying means full re-scheduling (so maybe another site if there is one). There isn't much marking the start of a try besides the scheduler allocating a site. The successful end of a try is represented by "JOB_END". Failed -> "APPLICATION_EXCEPTION". > > Thanks, > > Xi From lixi at uchicago.edu Wed Jun 11 16:02:04 2008 From: lixi at uchicago.edu (lixi at uchicago.edu) Date: Wed, 11 Jun 2008 16:02:04 -0500 (CDT) Subject: [Swift-devel] execution.retries Message-ID: <20080611160204.BBF21732@m4500-03.uchicago.edu> >Hmm? >2008-06-10 10:47:27,429-0500 DEBUG WeightedHostScoreScheduler Releasing >contact 7 >2008-06-10 10:47:27,430-0500 INFO WeightedHostScoreScheduler Sorted: >[OSG_LIGO_MIT:21.822(51.667):2/4] >2008-06-10 10:47:27,430-0500 DEBUG WeightedHostScoreScheduler Rand: >15.78147400479908, sum: 100.07797485652034 >2008-06-10 10:47:27,431-0500 DEBUG WeightedHostScoreScheduler Next >contact: OSG_LIGO_MIT:21.822(51.667):2/4 > > >That seems to be your only contact. Running >cat /home/lixi/newswift/latest/score/3500/workflowtest- 20080610-1045-58kc7p6f.log|grep "Next contact: OSG_LIGO_MIT"|wc > >produces: 4376 30632 469760 > >So there's 4376 site selections there. > >If you remove the |wc you can see the evolution of the score. > >You only seem to have one site there. Re-trying means full re-scheduling >(so maybe another site if there is one). > >There isn't much marking the start of a try besides the scheduler >allocating a site. The successful end of a try is represented by >"JOB_END". Failed -> "APPLICATION_EXCEPTION". > Thanks, I see. :) However, there is another question. Please check my another log file on CI /home/lixi/newswift/test1/workflowtest- 20080611-0956-z09bzjs5.log. In that file: 2008-06-11 09:56:33,899-0500 DEBUG TaskImpl Task (type=FILE_OPERATION, identity=urn:0-0-1-1213196193768) setting status to Submitting 2008-06-11 09:56:35,823-0500 DEBUG TaskImpl Task (type=FILE_OPERATION, identity=urn:0-0-1-1213196193768) setting status to Submitted 2008-06-11 09:56:35,823-0500 DEBUG TaskImpl Task (type=FILE_OPERATION, identity=urn:0-0-1-1213196193768) setting status to Active 2008-06-11 09:56:35,877-0500 DEBUG TaskImpl Task (type=FILE_OPERATION, identity=urn:0-0-1-1213196193768) setting status to Completed 2008-06-11 09:56:35,877-0500 DEBUG WeightedHostScoreScheduler multiplyScore(GLOW:-0.010 (0.994):1/2, 0.01) 2008-06-11 09:56:35,878-0500 DEBUG WeightedHostScoreScheduler Old score: -0.010, new score: 0.000 2008-06-11 09:56:35,878-0500 INFO LateBindingScheduler Task (type=FILE_OPERATION, identity=urn:0-0-1-1213196193768) Completed. Waiting: 0, Running: 0. Heap size: 12M, Heap free: 1M, Max heap: 63M 2008-06-11 09:56:35,885-0500 DEBUG WeightedHostScoreScheduler multiplyScore(GLOW:0.000 (1.000):1/2, -0.2) 2008-06-11 09:56:35,885-0500 DEBUG WeightedHostScoreScheduler Old score: 0.000, new score: - 0.200 2008-06-11 09:56:35,896-0500 DEBUG TaskImpl Task (type=FILE_TRANSFER, identity=urn:0-0-1-1213196193771) setting status to Submitting 2008-06-11 09:56:35,896-0500 DEBUG TaskImpl Task (type=FILE_TRANSFER, identity=urn:0-0-1-1213196193771) setting status to Submitted 2008-06-11 09:56:35,897-0500 DEBUG TaskImpl Task (type=FILE_TRANSFER, identity=urn:0-0-1-1213196193771) setting status to Active 2008-06-11 09:56:36,129-0500 DEBUG TaskImpl Task (type=FILE_TRANSFER, identity=urn:0-0-1-1213196193771) setting status to Completed 2008-06-11 09:56:36,129-0500 DEBUG WeightedHostScoreScheduler multiplyScore(GLOW:-0.200 (0.889):1/2, 0.2) 2008-06-11 09:56:36,129-0500 DEBUG WeightedHostScoreScheduler Old score: -0.200, new score: 0.000 2008-06-11 09:56:36,129-0500 INFO LateBindingScheduler Task (type=FILE_TRANSFER, identity=urn:0-0-1-1213196193771) Completed. Waiting: 0, Running: 0. Heap size: 12M, Heap free: 2M, Max heap: 63M 2008-06-11 09:56:36,130-0500 DEBUG WeightedHostScoreScheduler multiplyScore(GLOW:0.000 (1.000):1/2, -0.2) 2008-06-11 09:56:36,130-0500 DEBUG WeightedHostScoreScheduler Old score: 0.000, new score: - 0.200 2008-06-11 09:56:36,131-0500 DEBUG TaskImpl Task (type=FILE_TRANSFER, identity=urn:0-0-1-1213196193773) setting status to Submitting 2008-06-11 09:56:36,131-0500 DEBUG TaskImpl Task (type=FILE_TRANSFER, identity=urn:0-0-1-1213196193773) setting status to Submitted 2008-06-11 09:56:36,131-0500 DEBUG TaskImpl Task (type=FILE_TRANSFER, identity=urn:0-0-1-1213196193773) setting status to Active 2008-06-11 09:56:36,345-0500 DEBUG TaskImpl Task (type=FILE_TRANSFER, identity=urn:0-0-1-1213196193773) setting status to Completed 2008-06-11 09:56:36,345-0500 DEBUG WeightedHostScoreScheduler multiplyScore(GLOW:-0.200 (0.889):1/2, 0.2) 2008-06-11 09:56:36,345-0500 DEBUG WeightedHostScoreScheduler Old score: -0.200, new score: 0.000 2008-06-11 09:56:36,346-0500 INFO LateBindingScheduler Task (type=FILE_TRANSFER, identity=urn:0-0-1-1213196193773) Completed. Waiting: 0, Running: 0. Heap size: 12M, Heap free: 1M, Max heap: 63M 2008-06-11 09:56:36,347-0500 DEBUG WeightedHostScoreScheduler multiplyScore(GLOW:0.000 (1.000):1/2, -0.01) 2008-06-11 09:56:36,347-0500 DEBUG WeightedHostScoreScheduler Old score: 0.000, new score: - 0.010 2008-06-11 09:56:36,348-0500 DEBUG TaskImpl Task (type=FILE_OPERATION, identity=urn:0-0-1-1213196193775) setting status to Submitting 2008-06-11 09:56:36,348-0500 DEBUG TaskImpl Task (type=FILE_OPERATION, identity=urn:0-0-1-1213196193775) setting status to Submitted 2008-06-11 09:56:36,348-0500 DEBUG TaskImpl Task (type=FILE_OPERATION, identity=urn:0-0-1-1213196193775) setting status to Active 2008-06-11 09:56:36,378-0500 DEBUG TaskImpl Task (type=FILE_OPERATION, identity=urn:0-0-1-1213196193775) setting status to Completed 2008-06-11 09:56:36,378-0500 DEBUG WeightedHostScoreScheduler multiplyScore(GLOW:-0.010 (0.994):1/2, 0.01) 2008-06-11 09:56:36,378-0500 DEBUG WeightedHostScoreScheduler Old score: -0.010, new score: 0.000 2008-06-11 09:56:36,378-0500 INFO LateBindingScheduler Task (type=FILE_OPERATION, identity=urn:0-0-1-1213196193775) Completed. Waiting: 0, Running: 0. Heap size: 12M, Heap free: 1M, Max heap: 63M 2008-06-11 09:56:36,379-0500 DEBUG WeightedHostScoreScheduler multiplyScore(GLOW:0.000 (1.000):1/2, -0.01) 2008-06-11 09:56:36,380-0500 DEBUG WeightedHostScoreScheduler Old score: 0.000, new score: - 0.010 2008-06-11 09:56:36,380-0500 DEBUG TaskImpl Task (type=FILE_OPERATION, identity=urn:0-0-1-1213196193777) setting status to Submitting 2008-06-11 09:56:36,380-0500 DEBUG TaskImpl Task (type=FILE_OPERATION, identity=urn:0-0-1-1213196193777) setting status to Submitted 2008-06-11 09:56:36,380-0500 DEBUG TaskImpl Task (type=FILE_OPERATION, identity=urn:0-0-1-1213196193777) setting status to Active 2008-06-11 09:56:36,408-0500 DEBUG TaskImpl Task (type=FILE_OPERATION, identity=urn:0-0-1-1213196193777) setting status to Completed 2008-06-11 09:56:36,408-0500 DEBUG WeightedHostScoreScheduler multiplyScore(GLOW:-0.010 (0.994):1/2, 0.01) 2008-06-11 09:56:36,408-0500 DEBUG WeightedHostScoreScheduler Old score: -0.010, new score: 0.000 2008-06-11 09:56:36,409-0500 INFO LateBindingScheduler Task (type=FILE_OPERATION, identity=urn:0-0-1-1213196193777) Completed. Waiting: 0, Running: 0. Heap size: 12M, Heap free: 1M, Max heap: 63M 2008-06-11 09:56:36,410-0500 DEBUG WeightedHostScoreScheduler multiplyScore(GLOW:0.000 (1.000):1/2, -0.01) 2008-06-11 09:56:36,410-0500 DEBUG WeightedHostScoreScheduler Old score: 0.000, new score: - 0.010 2008-06-11 09:56:36,410-0500 DEBUG TaskImpl Task (type=FILE_OPERATION, identity=urn:0-0-1-1213196193779) setting status to Submitting 2008-06-11 09:56:36,410-0500 DEBUG TaskImpl Task (type=FILE_OPERATION, identity=urn:0-0-1-1213196193779) setting status to Submitted 2008-06-11 09:56:36,410-0500 DEBUG TaskImpl Task (type=FILE_OPERATION, identity=urn:0-0-1-1213196193779) setting status to Active 2008-06-11 09:56:36,437-0500 DEBUG TaskImpl Task (type=FILE_OPERATION, identity=urn:0-0-1-1213196193779) setting status to Completed 2008-06-11 09:56:36,437-0500 DEBUG WeightedHostScoreScheduler multiplyScore(GLOW:-0.010 (0.994):1/2, 0.01) 2008-06-11 09:56:36,437-0500 DEBUG WeightedHostScoreScheduler Old score: -0.010, new score: 0.000 2008-06-11 09:56:36,437-0500 INFO LateBindingScheduler Task (type=FILE_OPERATION, identity=urn:0-0-1-1213196193779) Completed. Waiting: 0, Running: 0. Heap size: 12M, Heap free: 1M, Max heap: 63M 2008-06-11 09:56:36,441-0500 INFO vdl:initshareddir END host=GLOW - Done initializing shared directory It seems that there are multiple FILE_ OPERATION and FILE_TRANSFER for the same job when initializing shared directory, what does this mean? From hategan at mcs.anl.gov Wed Jun 11 16:08:07 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 11 Jun 2008 16:08:07 -0500 Subject: [Swift-devel] execution.retries In-Reply-To: <20080611160204.BBF21732@m4500-03.uchicago.edu> References: <20080611160204.BBF21732@m4500-03.uchicago.edu> Message-ID: <1213218487.9847.3.camel@localhost> On Wed, 2008-06-11 at 16:02 -0500, lixi at uchicago.edu wrote: > It seems that there are multiple FILE_ OPERATION and > FILE_TRANSFER for the same job when initializing shared > directory, what does this mean? Look in libexec/vdl-int.k: element(initSharedDir, [rhost] once(list(rhost, "shared") ... dir:make(sharedDir, host=rhost) transfer(srcdir="{vds.home}/libexec/", srcfile="wrapper.sh", destdir=sharedDir, desthost=rhost) transfer(srcdir="{vds.home}/libexec/", srcfile="seq.sh", destdir=sharedDir, desthost=rhost) dir:make(dircat(wfdir, "kickstart"), host=rhost) dir:make(dircat(wfdir, "status"), host=rhost) dir:make(dircat(wfdir, "info"), host=rhost) ... ) ) So it makes the main directory, transfers wrapper.sh and seq.sh, and creates the kickstart, status, and info sub-directories. From benc at hawaga.org.uk Sun Jun 15 07:21:32 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Sun, 15 Jun 2008 12:21:32 +0000 (GMT) Subject: [Swift-devel] slightly easier log file report html generator command Message-ID: I just (r2061) put a command in the log-processing code to hopefully make it easier to generate report webpages for Swift log files. To use: svn co https://svn.ci.uchicago.edu/svn/vdl2/log-processing cd log-processing/bin export PATH=`pwd`:$PATH Now you will have a command swift-plot-log that you can run on a log file like this: $ swift-plot-log workflowtest-20080529-1145-n44o2cj1.log You'll get lots of output, most likely with a bunch of errors because that code is quite lame. You will also get a directory called report-/ with an index.html in it. Open that with your favourite web browser. For example, on os x: $ open report-workflowtest-20080529-1145-n44o2cj1/index.html Eventually I'd like to move all of this into the main release rather than it being a separate checkout. But not yet. -- From benc at hawaga.org.uk Sun Jun 15 10:08:56 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Sun, 15 Jun 2008 15:08:56 +0000 (GMT) Subject: [Swift-devel] [VOTE] Release Swift 0.6 rc3 as Swift 0.6 In-Reply-To: References: Message-ID: On Tue, 10 Jun 2008, Ben Clifford wrote: > The results of this vote will appear in ~120 hours from now. That is now. This action item did not receive majority approval. -- From benc at hawaga.org.uk Sun Jun 15 14:35:26 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Sun, 15 Jun 2008 19:35:26 +0000 (GMT) Subject: [Swift-devel] coaster hello world x 1000 and related. Message-ID: These are some notes on some playing round with coasters and related stuff today - there's not any particular strong point made, but here are details for anyone interested. I tried a 1000 x hello-world run. All of the below were submitted from my laptop on residential DSL in the UK. Most of the below were only measured once. Running through GRAM4 to TG NCSA Mercury, a run takes about 8000 seconds with a jobThrottle of 4. Running with coasters+GRAM2 with a job throttle of 0, to TG NCSA Mercury, this takes about 1500 seconds. Running coasters with direct PBS submission of workers rather than using gram2 to submit the workers leads to various breakages - perhaps a problem with the pbs provider. I'll have to investigate that some more. Then I noticed I was using the gt2 gridftp server for that site (which is GridFTP Server 2.5 (gcc64, 1182369948-63); I switched to the main one which is GridFTP Server 2.5 (gcc64, 1182369948-63) to see how that would change things. It doesn't seem to change much in this case. I also increased the throttle up to 0.2, which is pretty much the maximum that GRAM2 is OK with - the runtime is down to less than 600s - something like 1/13th of the time taken to run in untuned GRAM4 mode. (Note that I expect fiddling with various throttles with GRAM4 would produce some decent speedup) A lot of that 600s is taken up ramping up to threshold. I set the initialScore to 100 so that submission would start at a higher rate rather than ramping up. That didn't have much effect. So at this point, most of the time seems to be taken up with file operations rather than job submission operations. Next I doubled the file transfer and file operation throttles (to 8 and 16 respectively). This lowers the run time after execution a bit, to around 300s; however the realtime of the run went up to 600s, apparently because someone else woke up and started putting stuff in the queue on Mercury. -- From benc at hawaga.org.uk Sun Jun 15 18:48:56 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Sun, 15 Jun 2008 23:48:56 +0000 (GMT) Subject: [Swift-devel] coaster + fletch-condor command line truncation Message-ID: If I submit to fletch using coasters and gt2:fork, my tests run OK. If I tell coasters to use gt2:condor to submit workers, then I get problems with commandline truncation - 130-fmri has a job in the middle with a bunch of input files. r2073 adds the two sites files that demonstrate this: cd tests/sites/ ./run-site coaster/fletch-coaster-gram2-gram2-fork.xml works for me and ./run-site coaster/fletch-coaster-gram2-gram2-condor.xml does not work for me. The jobs that fail get run on gwynn.bsd.uchicago.edu most of the time. So I tried running coasters locally on gwynn itself (tests/sites/coaster-local.xml) and that works ok. I have not dug into this deeply; however at first glance it seems somewhat puzzling. Changing the mechanism by which coaster workers are submitted should not change the command-line handling of jobs submitted through coasters; at least not from my general understanding of coaster architecture. Maybe there is something funny happening with different shell or perl versions or something like that. -- From hategan at mcs.anl.gov Sun Jun 15 18:56:02 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sun, 15 Jun 2008 18:56:02 -0500 Subject: [Swift-devel] coaster + fletch-condor command line truncation In-Reply-To: References: Message-ID: <1213574162.18631.1.camel@localhost> On Sun, 2008-06-15 at 23:48 +0000, Ben Clifford wrote: > If I submit to fletch using coasters and gt2:fork, my tests run OK. > > If I tell coasters to use gt2:condor to submit workers, then I get > problems with commandline truncation - 130-fmri has a job in the middle > with a bunch of input files. > > r2073 adds the two sites files that demonstrate this: > > cd tests/sites/ > > ./run-site coaster/fletch-coaster-gram2-gram2-fork.xml > > works for me and > > ./run-site coaster/fletch-coaster-gram2-gram2-condor.xml > > does not work for me. > > The jobs that fail get run on gwynn.bsd.uchicago.edu most of the time. So > I tried running coasters locally on gwynn itself > (tests/sites/coaster-local.xml) and that works ok. > > I have not dug into this deeply; however at first glance it seems somewhat > puzzling. > > Changing the mechanism by which coaster workers are submitted should not > change the command-line handling of jobs submitted through coasters; at > least not from my general understanding of coaster architecture. > > Maybe there is something funny happening with different shell or perl > versions or something like that. May it be that the condor/gram version installed there is different and fixes some of the quoting problems resulting in twice quoted things? > From benc at hawaga.org.uk Sun Jun 15 19:05:15 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 16 Jun 2008 00:05:15 +0000 (GMT) Subject: [Swift-devel] coaster + fletch-condor command line truncation In-Reply-To: <1213574162.18631.1.camel@localhost> References: <1213574162.18631.1.camel@localhost> Message-ID: On Sun, 15 Jun 2008, Mihael Hategan wrote: > May it be that the condor/gram version installed there is different and > fixes some of the quoting problems resulting in twice quoted things? The job command lines shouldn't be going anywhere near condor in this setup though, I think. I added a commandline log to the wrapper. It looks truncated rather than funnily quoted. _____________________________________________________________________________ command line _____________________________________________________________________________ touch-mg5926ui -jobdir m -e /bin/touch -out stdout.txt -err stderr.txt -i -d _co ncurrent/aligned-028fa5e1-234d-45bf-bc3f-d4c3296c1802--array//elt-3.-field|_conc urrent/aligned-028fa5e1-234d-45bf-bc3f-d4c3296c1802--array//elt-2.-field|_concur rent/aligned-028fa5e1-234d-45bf-bc3f-d4c3296c1802--array//elt-4.-field|_concurre nt/aligned-028fa5e1-234d-45bf-bc3f-d4c3296c1802--array//elt-1.-field|_concurrent /brainatlas-c5b15ce2-fe89-49a7-864c-43ada7185721--field -if _concurrent/aligned- 028fa5e1-234d-45bf-bc3f-d4c3296c1802--array//elt-3.-field/h|_concurrent/aligned- 028fa5e1-234d-45bf-bc3f-d4c3296c1802--array//elt-3.-field/v|_concurrent/aligned- 028fa5e1-234d-45bf-bc3f-d4c3296c1802--array//elt-2.-field/h|_concurrent/aligned- 028fa5e1-234d-45bf-bc3f-d4c3296c1802--array//elt-2.-field/v|_concurrent/aligned- 028fa5e1-234d-45bf-bc3f-d4c3296c1802--array//elt-4.-field/h|_concurrent/aligned- 028fa5e1-234d-45bf-bc3f-d4c3296c1802--array//elt-4.-field/v|_concurrent/aligned- 028fa5e1-234d-45bf-bc3f-d4c3296c1802--array//elt-1.-field/h|_concurrent/aligned- 028fa5e1-234d-45bf-bc3f-d4c3296c1802--array//elt-1.-field/v -of _concurrent/brai natlas-c5b15ce2-fe89-49a7-864c-43ada7185721--field/h|_concurrent/brainatlas-c5b1 5ce2-fe89-49a7-864c-43ada7185721--field/v -k -- From foster at mcs.anl.gov Sun Jun 15 19:14:19 2008 From: foster at mcs.anl.gov (Ian Foster) Date: Sun, 15 Jun 2008 19:14:19 -0500 Subject: [Swift-devel] coaster + fletch-condor command line truncation In-Reply-To: References: Message-ID: These sites don't support GRAM4? On Jun 15, 2008, at 6:48 PM, Ben Clifford wrote: > > If I submit to fletch using coasters and gt2:fork, my tests run OK. > > If I tell coasters to use gt2:condor to submit workers, then I get > problems with commandline truncation - 130-fmri has a job in the > middle > with a bunch of input files. > > r2073 adds the two sites files that demonstrate this: > > cd tests/sites/ > > ./run-site coaster/fletch-coaster-gram2-gram2-fork.xml > > works for me and > > ./run-site coaster/fletch-coaster-gram2-gram2-condor.xml > > does not work for me. > > The jobs that fail get run on gwynn.bsd.uchicago.edu most of the > time. So > I tried running coasters locally on gwynn itself > (tests/sites/coaster-local.xml) and that works ok. > > I have not dug into this deeply; however at first glance it seems > somewhat > puzzling. > > Changing the mechanism by which coaster workers are submitted should > not > change the command-line handling of jobs submitted through coasters; > at > least not from my general understanding of coaster architecture. > > Maybe there is something funny happening with different shell or perl > versions or something like that. > > -- > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From benc at hawaga.org.uk Sun Jun 15 19:16:37 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 16 Jun 2008 00:16:37 +0000 (GMT) Subject: [Swift-devel] coaster + fletch-condor command line truncation In-Reply-To: References: Message-ID: On Sun, 15 Jun 2008, Ian Foster wrote: > These sites don't support GRAM4? no. -- From benc at hawaga.org.uk Sun Jun 15 20:05:59 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 16 Jun 2008 01:05:59 +0000 (GMT) Subject: [Swift-devel] coaster + fletch-condor command line truncation In-Reply-To: <1213574162.18631.1.camel@localhost> References: <1213574162.18631.1.camel@localhost> Message-ID: I just ran 130-fmri on the TG Purdue condor pool and I don't see this problem; so I'm more inclined to think there is something environmental about the different ways in which code is being launched on fletch/gwynn. -- From foster at mcs.anl.gov Sun Jun 15 21:25:25 2008 From: foster at mcs.anl.gov (Ian Foster) Date: Sun, 15 Jun 2008 21:25:25 -0500 Subject: [Swift-devel] coaster + fletch-condor command line truncation In-Reply-To: References: <1213574162.18631.1.camel@localhost> Message-ID: <165814F6-07BC-40D5-8BCF-6BA0A7245313@mcs.anl.gov> where are the machines fletch and gwynn? It would be good to get GRAM4 installed on them. On Jun 15, 2008, at 8:05 PM, Ben Clifford wrote: > I just ran 130-fmri on the TG Purdue condor pool and I don't see this > problem; so I'm more inclined to think there is something > environmental > about the different ways in which code is being launched on fletch/ > gwynn. > -- > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From benc at hawaga.org.uk Mon Jun 16 06:02:15 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 16 Jun 2008 11:02:15 +0000 (GMT) Subject: [Swift-devel] coaster + fletch-condor command line truncation In-Reply-To: <165814F6-07BC-40D5-8BCF-6BA0A7245313@mcs.anl.gov> References: <1213574162.18631.1.camel@localhost> <165814F6-07BC-40D5-8BCF-6BA0A7245313@mcs.anl.gov> Message-ID: On Sun, 15 Jun 2008, Ian Foster wrote: > where are the machines fletch and gwynn? bsd.uchicago.edu though CI support appears to administer them some. > It would be good to get GRAM4 installed on them. If I was to choose where to spend CI support time, I'd rather spend it on teraport than on a small cluster like the bsd one. -- From foster at mcs.anl.gov Mon Jun 16 07:11:28 2008 From: foster at mcs.anl.gov (Ian Foster) Date: Mon, 16 Jun 2008 07:11:28 -0500 Subject: [Swift-devel] coaster + fletch-condor command line truncation In-Reply-To: References: <1213574162.18631.1.camel@localhost> <165814F6-07BC-40D5-8BCF-6BA0A7245313@mcs.anl.gov> Message-ID: Do you mean that TeraPort does not run GRAM4? On Jun 16, 2008, at 6:02 AM, Ben Clifford wrote: > > On Sun, 15 Jun 2008, Ian Foster wrote: > >> where are the machines fletch and gwynn? > > bsd.uchicago.edu though CI support appears to administer them some. > >> It would be good to get GRAM4 installed on them. > > If I was to choose where to spend CI support time, I'd rather spend > it on > teraport than on a small cluster like the bsd one. > > -- From benc at hawaga.org.uk Mon Jun 16 07:13:55 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 16 Jun 2008 12:13:55 +0000 (GMT) Subject: [Swift-devel] coaster + fletch-condor command line truncation In-Reply-To: References: <1213574162.18631.1.camel@localhost> <165814F6-07BC-40D5-8BCF-6BA0A7245313@mcs.anl.gov> Message-ID: On Mon, 16 Jun 2008, Ian Foster wrote: > Do you mean that TeraPort does not run GRAM4? I can't get it to successfully run jobs through the queueing system. Which means I won't recommend it to other users to attempt to use. Whether you want to interpret that as a yes or a no, I don't know. -- From bugzilla-daemon at mcs.anl.gov Mon Jun 16 09:52:47 2008 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Mon, 16 Jun 2008 09:52:47 -0500 (CDT) Subject: [Swift-devel] [Bug 101] fast-failing sites will absorb large numbers of jobs causing runs to fail despite multiple attempts at retrying In-Reply-To: Message-ID: <20080616145247.B1333164B9@foxtrot.mcs.anl.gov> http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=101 ------- Comment #4 from benc at hawaga.org.uk 2008-06-16 09:52 ------- Using provider-wonky it is possible to recreate this problem in a local environment. The following site definitions will give one local executing site that runs with a 5s delay with no failure and another site that will fast-fail all jobs. Try it against eg tests/language-behaviour/130-fmri /var/tmp 0 /var/tmp 0 -- Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is. You reported the bug, or are watching the reporter. From wilde at mcs.anl.gov Wed Jun 18 10:31:17 2008 From: wilde at mcs.anl.gov (Michael Wilde) Date: Wed, 18 Jun 2008 10:31:17 -0500 Subject: [Swift-devel] PROPOSAL: Mike to be dev.globus project chair Message-ID: <48592A45.80403@mcs.anl.gov> dev.globus is still waiting to hear from us who will be chair. I propose that I remain project chair and that we signify agreement (or not) by a vote, which will follow this email. We did not vote on chair (or anything else) when we initially joined dev.globus. I propose we do so now. dev.globus defines the chair role as: "Each Globus project is required to name a project Chair via some process defined by the project's Committers. A project Chair has no enhanced authority, but has certain responsibilities relative to the function of the Globus Alliance. Specifically: * The Chair should generate, on or before March 31st, June 30th, September 30th, and December 31st of each year, reports concerning the activities of the project during the past quarter, its current status, and future plans. *The Chair is responsible for forwarding to the Globus infrastructure group requests to add or delete Committers for the project." - Mike -------- Original Message -------- Subject: Re: [Swift-devel] [Fwd: Re: [incubator-committers] Re: swift project in hibernation] Date: Thu, 5 Jun 2008 17:18:53 +0000 (GMT) From: Ben Clifford To: Michael Wilde CC: swift-devel References: <484807C1.2060902 at mcs.anl.gov> On Thu, 5 Jun 2008, Michael Wilde wrote: > If "chair" is someone who maintains the infrastructure it should be a > developer. > > If its someone that makes management decisions and speaks for the project, it > should be me. It seems to me to be neither of those. Specifically there are no requirements that the chair maintain infrastructure and an explicit prohibition on the chair making enhanced-authority decisions (wrt any other committer). However, Jen's correspondence seems to suggest that IMP regards the chair as having some other rights and obligations. I don't believe these are publicly documented though. --- Each Globus project is required to name a project Chair via some process defined by the project's Committers. A project Chair has no enhanced authority, but has certain responsibilities relative to the function of the Globus Alliance. Specifically: * The Chair should generate, on or before March 31st, June 30th, September 30th, and December 31st of each year, reports concerning the activities of the project during the past quarter, its current status, and future plans. *The Chair is responsible for forwarding to the Globus infrastructure group requests to add or delete Committers for the project. --- -- From wilde at mcs.anl.gov Wed Jun 18 10:35:49 2008 From: wilde at mcs.anl.gov (Michael Wilde) Date: Wed, 18 Jun 2008 10:35:49 -0500 Subject: [Swift-devel] [VOTE] Mike to be dev.globus project chair In-Reply-To: <1212616673.14746.4.camel@localhost> References: <1212598787.8492.8.camel@localhost> <1212615395.14746.0.camel@localhost> <1212616673.14746.4.camel@localhost> Message-ID: <48592B55.5060503@mcs.anl.gov> Im sending this out for committers to vote as I indicated in the proposal. If the proposal needs discussion please reply to the proposal thread. The vote closes at 5:00 PM CDT today, June 18. Please respond with one of: +1 The action should be performed. +0 Abstain - I support the action. -0 Abstain, I don't support the action but I can't help with an alternative -1 The action should not be performed and I am offering an explanation or alternative. Thanks, Mike --- Proposal text below --- dev.globus is still waiting to hear from us who will be chair. I propose that I remain project chair and that we signify agreement (or not) by a vote, which will follow this email. We did not vote on chair (or anything else) when we initially joined dev.globus. I propose we do so now. dev.globus defines the chair role as: "Each Globus project is required to name a project Chair via some process defined by the project's Committers. A project Chair has no enhanced authority, but has certain responsibilities relative to the function of the Globus Alliance. Specifically: * The Chair should generate, on or before March 31st, June 30th, September 30th, and December 31st of each year, reports concerning the activities of the project during the past quarter, its current status, and future plans. *The Chair is responsible for forwarding to the Globus infrastructure group requests to add or delete Committers for the project." From foster at mcs.anl.gov Wed Jun 18 10:36:31 2008 From: foster at mcs.anl.gov (Ian Foster) Date: Wed, 18 Jun 2008 10:36:31 -0500 Subject: [Swift-devel] [VOTE] Mike to be dev.globus project chair In-Reply-To: <48592B55.5060503@mcs.anl.gov> References: <1212598787.8492.8.camel@localhost> <1212615395.14746.0.camel@localhost> <1212616673.14746.4.camel@localhost> <48592B55.5060503@mcs.anl.gov> Message-ID: +1 On Jun 18, 2008, at 10:35 AM, Michael Wilde wrote: > Im sending this out for committers to vote as I indicated in the > proposal. > > If the proposal needs discussion please reply to the proposal thread. > > The vote closes at 5:00 PM CDT today, June 18. > > Please respond with one of: > +1 The action should be performed. > +0 Abstain - I support the action. > -0 Abstain, I don't support the action but I can't help with an > alternative > -1 The action should not be performed and I am offering an explanation > or alternative. > > Thanks, > > Mike > > --- Proposal text below --- > > dev.globus is still waiting to hear from us who will be chair. > > I propose that I remain project chair and that we signify agreement > (or not) by a vote, which will follow this email. > > We did not vote on chair (or anything else) when we initially joined > dev.globus. I propose we do so now. > > dev.globus defines the chair role as: > > "Each Globus project is required to name a project Chair via some > process defined by the project's Committers. A project Chair has no > enhanced authority, but has certain responsibilities relative to the > function of the Globus Alliance. Specifically: > * The Chair should generate, on or before March 31st, June 30th, > September 30th, and December 31st of each year, reports concerning > the activities of the project during the past quarter, its current > status, and future plans. > *The Chair is responsible for forwarding to the Globus infrastructure > group requests to add or delete Committers for the project." > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From wilde at mcs.anl.gov Wed Jun 18 10:39:24 2008 From: wilde at mcs.anl.gov (Michael Wilde) Date: Wed, 18 Jun 2008 10:39:24 -0500 Subject: [Swift-devel] [VOTE] Mike to be dev.globus project chair In-Reply-To: <48592B55.5060503@mcs.anl.gov> References: <1212598787.8492.8.camel@localhost> <1212615395.14746.0.camel@localhost> <1212616673.14746.4.camel@localhost> <48592B55.5060503@mcs.anl.gov> Message-ID: <48592C2C.5010906@mcs.anl.gov> +1 On 6/18/08 10:35 AM, Michael Wilde wrote: > Im sending this out for committers to vote as I indicated in the proposal. > > If the proposal needs discussion please reply to the proposal thread. > > The vote closes at 5:00 PM CDT today, June 18. > > Please respond with one of: > +1 The action should be performed. > +0 Abstain - I support the action. > -0 Abstain, I don't support the action but I can't help with an alternative > -1 The action should not be performed and I am offering an explanation > or alternative. > > Thanks, > > Mike > > --- Proposal text below --- > > dev.globus is still waiting to hear from us who will be chair. > > I propose that I remain project chair and that we signify agreement (or > not) by a vote, which will follow this email. > > We did not vote on chair (or anything else) when we initially joined > dev.globus. I propose we do so now. > > dev.globus defines the chair role as: > > "Each Globus project is required to name a project Chair via some > process defined by the project's Committers. A project Chair has no > enhanced authority, but has certain responsibilities relative to the > function of the Globus Alliance. Specifically: > * The Chair should generate, on or before March 31st, June 30th, > September 30th, and December 31st of each year, reports concerning the > activities of the project during the past quarter, its current status, > and future plans. > *The Chair is responsible for forwarding to the Globus infrastructure > group requests to add or delete Committers for the project." > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From tiberius at ci.uchicago.edu Wed Jun 18 10:54:35 2008 From: tiberius at ci.uchicago.edu (Tiberiu Stef-Praun) Date: Wed, 18 Jun 2008 10:54:35 -0500 Subject: [Swift-devel] [VOTE] Mike to be dev.globus project chair In-Reply-To: <48592B55.5060503@mcs.anl.gov> References: <1212598787.8492.8.camel@localhost> <1212615395.14746.0.camel@localhost> <1212616673.14746.4.camel@localhost> <48592B55.5060503@mcs.anl.gov> Message-ID: +1 On Wed, Jun 18, 2008 at 10:35 AM, Michael Wilde wrote: > Im sending this out for committers to vote as I indicated in the proposal. > > If the proposal needs discussion please reply to the proposal thread. > > The vote closes at 5:00 PM CDT today, June 18. > > Please respond with one of: > +1 The action should be performed. > +0 Abstain - I support the action. > -0 Abstain, I don't support the action but I can't help with an alternative > -1 The action should not be performed and I am offering an explanation > or alternative. > > Thanks, > > Mike > > --- Proposal text below --- > > dev.globus is still waiting to hear from us who will be chair. > > I propose that I remain project chair and that we signify agreement (or not) > by a vote, which will follow this email. > > We did not vote on chair (or anything else) when we initially joined > dev.globus. I propose we do so now. > > dev.globus defines the chair role as: > > "Each Globus project is required to name a project Chair via some > process defined by the project's Committers. A project Chair has no > enhanced authority, but has certain responsibilities relative to the > function of the Globus Alliance. Specifically: > * The Chair should generate, on or before March 31st, June 30th, September > 30th, and December 31st of each year, reports concerning the activities of > the project during the past quarter, its current status, and future plans. > *The Chair is responsible for forwarding to the Globus infrastructure > group requests to add or delete Committers for the project." > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > -- Tiberiu (Tibi) Stef-Praun, PhD Research Staff, Computation Institute 5640 S. Ellis Ave, #405 University of Chicago http://www-unix.mcs.anl.gov/~tiberius/ From hategan at mcs.anl.gov Wed Jun 18 11:03:58 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 18 Jun 2008 11:03:58 -0500 Subject: [Swift-devel] [VOTE] Mike to be dev.globus project chair In-Reply-To: References: <1212598787.8492.8.camel@localhost> <1212615395.14746.0.camel@localhost> <1212616673.14746.4.camel@localhost> <48592B55.5060503@mcs.anl.gov> Message-ID: <1213805038.20858.0.camel@localhost> On Wed, 2008-06-18 at 10:54 -0500, Tiberiu Stef-Praun wrote: > +1 Cute, but I think we voted a slightly different committer list: Ben Clifford Ian Foster Mihael Hategan Sarah Kenny Michael Wilde > > > On Wed, Jun 18, 2008 at 10:35 AM, Michael Wilde wrote: > > Im sending this out for committers to vote as I indicated in the proposal. > > > > If the proposal needs discussion please reply to the proposal thread. > > > > The vote closes at 5:00 PM CDT today, June 18. > > > > Please respond with one of: > > +1 The action should be performed. > > +0 Abstain - I support the action. > > -0 Abstain, I don't support the action but I can't help with an alternative > > -1 The action should not be performed and I am offering an explanation > > or alternative. > > > > Thanks, > > > > Mike > > > > --- Proposal text below --- > > > > dev.globus is still waiting to hear from us who will be chair. > > > > I propose that I remain project chair and that we signify agreement (or not) > > by a vote, which will follow this email. > > > > We did not vote on chair (or anything else) when we initially joined > > dev.globus. I propose we do so now. > > > > dev.globus defines the chair role as: > > > > "Each Globus project is required to name a project Chair via some > > process defined by the project's Committers. A project Chair has no > > enhanced authority, but has certain responsibilities relative to the > > function of the Globus Alliance. Specifically: > > * The Chair should generate, on or before March 31st, June 30th, September > > 30th, and December 31st of each year, reports concerning the activities of > > the project during the past quarter, its current status, and future plans. > > *The Chair is responsible for forwarding to the Globus infrastructure > > group requests to add or delete Committers for the project." > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > > From hategan at mcs.anl.gov Wed Jun 18 12:07:26 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 18 Jun 2008 12:07:26 -0500 Subject: [Swift-devel] [VOTE] Mike to be dev.globus project chair In-Reply-To: <48592B55.5060503@mcs.anl.gov> References: <1212598787.8492.8.camel@localhost> <1212615395.14746.0.camel@localhost> <1212616673.14746.4.camel@localhost> <48592B55.5060503@mcs.anl.gov> Message-ID: <1213808846.24701.0.camel@localhost> +1 On Wed, 2008-06-18 at 10:35 -0500, Michael Wilde wrote: > Im sending this out for committers to vote as I indicated in the proposal. > > If the proposal needs discussion please reply to the proposal thread. > > The vote closes at 5:00 PM CDT today, June 18. > > Please respond with one of: > +1 The action should be performed. > +0 Abstain - I support the action. > -0 Abstain, I don't support the action but I can't help with an alternative > -1 The action should not be performed and I am offering an explanation > or alternative. > > Thanks, > > Mike > > --- Proposal text below --- > > dev.globus is still waiting to hear from us who will be chair. > > I propose that I remain project chair and that we signify agreement (or > not) by a vote, which will follow this email. > > We did not vote on chair (or anything else) when we initially joined > dev.globus. I propose we do so now. > > dev.globus defines the chair role as: > > "Each Globus project is required to name a project Chair via some > process defined by the project's Committers. A project Chair has no > enhanced authority, but has certain responsibilities relative to the > function of the Globus Alliance. Specifically: > * The Chair should generate, on or before March 31st, June 30th, > September 30th, and December 31st of each year, reports concerning the > activities of the project during the past quarter, its current status, > and future plans. > *The Chair is responsible for forwarding to the Globus infrastructure > group requests to add or delete Committers for the project." > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From wilde at mcs.anl.gov Wed Jun 18 12:07:21 2008 From: wilde at mcs.anl.gov (Michael Wilde) Date: Wed, 18 Jun 2008 12:07:21 -0500 Subject: [Swift-devel] [VOTE] Mike to be dev.globus project chair In-Reply-To: <1213805038.20858.0.camel@localhost> References: <1212598787.8492.8.camel@localhost> <1212615395.14746.0.camel@localhost> <1212616673.14746.4.camel@localhost> <48592B55.5060503@mcs.anl.gov> <1213805038.20858.0.camel@localhost> Message-ID: <485940C9.6080207@mcs.anl.gov> By dev.globus rules as far as I can tell, everyone on the developer's list can vote but only the committers votes are binding. The guidelines ask that committers indicate this with the string "(binding)" after their vote, as in "+1 (binding)". No one did that on the last vote, so I didnt bother mentioning it this time. - Mike On 6/18/08 11:03 AM, Mihael Hategan wrote: > On Wed, 2008-06-18 at 10:54 -0500, Tiberiu Stef-Praun wrote: >> +1 > > Cute, but I think we voted a slightly different committer list: > Ben Clifford > Ian Foster > Mihael Hategan > Sarah Kenny > Michael Wilde > >> >> On Wed, Jun 18, 2008 at 10:35 AM, Michael Wilde wrote: >>> Im sending this out for committers to vote as I indicated in the proposal. >>> >>> If the proposal needs discussion please reply to the proposal thread. >>> >>> The vote closes at 5:00 PM CDT today, June 18. >>> >>> Please respond with one of: >>> +1 The action should be performed. >>> +0 Abstain - I support the action. >>> -0 Abstain, I don't support the action but I can't help with an alternative >>> -1 The action should not be performed and I am offering an explanation >>> or alternative. >>> >>> Thanks, >>> >>> Mike >>> >>> --- Proposal text below --- >>> >>> dev.globus is still waiting to hear from us who will be chair. >>> >>> I propose that I remain project chair and that we signify agreement (or not) >>> by a vote, which will follow this email. >>> >>> We did not vote on chair (or anything else) when we initially joined >>> dev.globus. I propose we do so now. >>> >>> dev.globus defines the chair role as: >>> >>> "Each Globus project is required to name a project Chair via some >>> process defined by the project's Committers. A project Chair has no >>> enhanced authority, but has certain responsibilities relative to the >>> function of the Globus Alliance. Specifically: >>> * The Chair should generate, on or before March 31st, June 30th, September >>> 30th, and December 31st of each year, reports concerning the activities of >>> the project during the past quarter, its current status, and future plans. >>> *The Chair is responsible for forwarding to the Globus infrastructure >>> group requests to add or delete Committers for the project." >>> >>> _______________________________________________ >>> Swift-devel mailing list >>> Swift-devel at ci.uchicago.edu >>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >>> >> >> > From hategan at mcs.anl.gov Wed Jun 18 12:16:42 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 18 Jun 2008 12:16:42 -0500 Subject: [Swift-devel] [VOTE] Mike to be dev.globus project chair In-Reply-To: <485940C9.6080207@mcs.anl.gov> References: <1212598787.8492.8.camel@localhost> <1212615395.14746.0.camel@localhost> <1212616673.14746.4.camel@localhost> <48592B55.5060503@mcs.anl.gov> <1213805038.20858.0.camel@localhost> <485940C9.6080207@mcs.anl.gov> Message-ID: <1213809402.24972.1.camel@localhost> A good point. So what does a non-binding vote achieve in this case? On Wed, 2008-06-18 at 12:07 -0500, Michael Wilde wrote: > By dev.globus rules as far as I can tell, everyone on the developer's > list can vote but only the committers votes are binding. > > The guidelines ask that committers indicate this with the string > "(binding)" after their vote, as in "+1 (binding)". > > No one did that on the last vote, so I didnt bother mentioning it this time. > > - Mike > > > > On 6/18/08 11:03 AM, Mihael Hategan wrote: > > On Wed, 2008-06-18 at 10:54 -0500, Tiberiu Stef-Praun wrote: > >> +1 > > > > Cute, but I think we voted a slightly different committer list: > > Ben Clifford > > Ian Foster > > Mihael Hategan > > Sarah Kenny > > Michael Wilde > > > >> > >> On Wed, Jun 18, 2008 at 10:35 AM, Michael Wilde wrote: > >>> Im sending this out for committers to vote as I indicated in the proposal. > >>> > >>> If the proposal needs discussion please reply to the proposal thread. > >>> > >>> The vote closes at 5:00 PM CDT today, June 18. > >>> > >>> Please respond with one of: > >>> +1 The action should be performed. > >>> +0 Abstain - I support the action. > >>> -0 Abstain, I don't support the action but I can't help with an alternative > >>> -1 The action should not be performed and I am offering an explanation > >>> or alternative. > >>> > >>> Thanks, > >>> > >>> Mike > >>> > >>> --- Proposal text below --- > >>> > >>> dev.globus is still waiting to hear from us who will be chair. > >>> > >>> I propose that I remain project chair and that we signify agreement (or not) > >>> by a vote, which will follow this email. > >>> > >>> We did not vote on chair (or anything else) when we initially joined > >>> dev.globus. I propose we do so now. > >>> > >>> dev.globus defines the chair role as: > >>> > >>> "Each Globus project is required to name a project Chair via some > >>> process defined by the project's Committers. A project Chair has no > >>> enhanced authority, but has certain responsibilities relative to the > >>> function of the Globus Alliance. Specifically: > >>> * The Chair should generate, on or before March 31st, June 30th, September > >>> 30th, and December 31st of each year, reports concerning the activities of > >>> the project during the past quarter, its current status, and future plans. > >>> *The Chair is responsible for forwarding to the Globus infrastructure > >>> group requests to add or delete Committers for the project." > >>> > >>> _______________________________________________ > >>> Swift-devel mailing list > >>> Swift-devel at ci.uchicago.edu > >>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > >>> > >> > >> > > From hategan at mcs.anl.gov Wed Jun 18 12:18:15 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 18 Jun 2008 12:18:15 -0500 Subject: [Swift-devel] [VOTE] Mike to be dev.globus project chair In-Reply-To: <1213809402.24972.1.camel@localhost> References: <1212598787.8492.8.camel@localhost> <1212615395.14746.0.camel@localhost> <1212616673.14746.4.camel@localhost> <48592B55.5060503@mcs.anl.gov> <1213805038.20858.0.camel@localhost> <485940C9.6080207@mcs.anl.gov> <1213809402.24972.1.camel@localhost> Message-ID: <1213809495.25175.0.camel@localhost> On Wed, 2008-06-18 at 12:16 -0500, Mihael Hategan wrote: > A good point. So what does a non-binding vote achieve in this case? Allow me to rephrase, since non-binding vetoes have a role: ?A good point. So what does a non-binding +1 vote achieve in this case? > > On Wed, 2008-06-18 at 12:07 -0500, Michael Wilde wrote: > > By dev.globus rules as far as I can tell, everyone on the developer's > > list can vote but only the committers votes are binding. > > > > The guidelines ask that committers indicate this with the string > > "(binding)" after their vote, as in "+1 (binding)". > > > > No one did that on the last vote, so I didnt bother mentioning it this time. > > > > - Mike > > > > > > > > On 6/18/08 11:03 AM, Mihael Hategan wrote: > > > On Wed, 2008-06-18 at 10:54 -0500, Tiberiu Stef-Praun wrote: > > >> +1 > > > > > > Cute, but I think we voted a slightly different committer list: > > > Ben Clifford > > > Ian Foster > > > Mihael Hategan > > > Sarah Kenny > > > Michael Wilde > > > > > >> > > >> On Wed, Jun 18, 2008 at 10:35 AM, Michael Wilde wrote: > > >>> Im sending this out for committers to vote as I indicated in the proposal. > > >>> > > >>> If the proposal needs discussion please reply to the proposal thread. > > >>> > > >>> The vote closes at 5:00 PM CDT today, June 18. > > >>> > > >>> Please respond with one of: > > >>> +1 The action should be performed. > > >>> +0 Abstain - I support the action. > > >>> -0 Abstain, I don't support the action but I can't help with an alternative > > >>> -1 The action should not be performed and I am offering an explanation > > >>> or alternative. > > >>> > > >>> Thanks, > > >>> > > >>> Mike > > >>> > > >>> --- Proposal text below --- > > >>> > > >>> dev.globus is still waiting to hear from us who will be chair. > > >>> > > >>> I propose that I remain project chair and that we signify agreement (or not) > > >>> by a vote, which will follow this email. > > >>> > > >>> We did not vote on chair (or anything else) when we initially joined > > >>> dev.globus. I propose we do so now. > > >>> > > >>> dev.globus defines the chair role as: > > >>> > > >>> "Each Globus project is required to name a project Chair via some > > >>> process defined by the project's Committers. A project Chair has no > > >>> enhanced authority, but has certain responsibilities relative to the > > >>> function of the Globus Alliance. Specifically: > > >>> * The Chair should generate, on or before March 31st, June 30th, September > > >>> 30th, and December 31st of each year, reports concerning the activities of > > >>> the project during the past quarter, its current status, and future plans. > > >>> *The Chair is responsible for forwarding to the Globus infrastructure > > >>> group requests to add or delete Committers for the project." > > >>> > > >>> _______________________________________________ > > >>> Swift-devel mailing list > > >>> Swift-devel at ci.uchicago.edu > > >>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > >>> > > >> > > >> > > > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From wilde at mcs.anl.gov Wed Jun 18 12:19:28 2008 From: wilde at mcs.anl.gov (Michael Wilde) Date: Wed, 18 Jun 2008 12:19:28 -0500 Subject: [Swift-devel] [VOTE] Mike to be dev.globus project chair In-Reply-To: <1213809495.25175.0.camel@localhost> References: <1212598787.8492.8.camel@localhost> <1212615395.14746.0.camel@localhost> <1212616673.14746.4.camel@localhost> <48592B55.5060503@mcs.anl.gov> <1213805038.20858.0.camel@localhost> <485940C9.6080207@mcs.anl.gov> <1213809402.24972.1.camel@localhost> <1213809495.25175.0.camel@localhost> Message-ID: <485943A0.5080501@mcs.anl.gov> A chance to voice an opinion that gets recorded with the vote, as far as I can tell. - Mike On 6/18/08 12:18 PM, Mihael Hategan wrote: > On Wed, 2008-06-18 at 12:16 -0500, Mihael Hategan wrote: >> A good point. So what does a non-binding vote achieve in this case? > > Allow me to rephrase, since non-binding vetoes have a role: > ?A good point. So what does a non-binding +1 vote achieve in this case? > >> On Wed, 2008-06-18 at 12:07 -0500, Michael Wilde wrote: >>> By dev.globus rules as far as I can tell, everyone on the developer's >>> list can vote but only the committers votes are binding. >>> >>> The guidelines ask that committers indicate this with the string >>> "(binding)" after their vote, as in "+1 (binding)". >>> >>> No one did that on the last vote, so I didnt bother mentioning it this time. >>> >>> - Mike >>> >>> >>> >>> On 6/18/08 11:03 AM, Mihael Hategan wrote: >>>> On Wed, 2008-06-18 at 10:54 -0500, Tiberiu Stef-Praun wrote: >>>>> +1 >>>> Cute, but I think we voted a slightly different committer list: >>>> Ben Clifford >>>> Ian Foster >>>> Mihael Hategan >>>> Sarah Kenny >>>> Michael Wilde >>>> >>>>> On Wed, Jun 18, 2008 at 10:35 AM, Michael Wilde wrote: >>>>>> Im sending this out for committers to vote as I indicated in the proposal. >>>>>> >>>>>> If the proposal needs discussion please reply to the proposal thread. >>>>>> >>>>>> The vote closes at 5:00 PM CDT today, June 18. >>>>>> >>>>>> Please respond with one of: >>>>>> +1 The action should be performed. >>>>>> +0 Abstain - I support the action. >>>>>> -0 Abstain, I don't support the action but I can't help with an alternative >>>>>> -1 The action should not be performed and I am offering an explanation >>>>>> or alternative. >>>>>> >>>>>> Thanks, >>>>>> >>>>>> Mike >>>>>> >>>>>> --- Proposal text below --- >>>>>> >>>>>> dev.globus is still waiting to hear from us who will be chair. >>>>>> >>>>>> I propose that I remain project chair and that we signify agreement (or not) >>>>>> by a vote, which will follow this email. >>>>>> >>>>>> We did not vote on chair (or anything else) when we initially joined >>>>>> dev.globus. I propose we do so now. >>>>>> >>>>>> dev.globus defines the chair role as: >>>>>> >>>>>> "Each Globus project is required to name a project Chair via some >>>>>> process defined by the project's Committers. A project Chair has no >>>>>> enhanced authority, but has certain responsibilities relative to the >>>>>> function of the Globus Alliance. Specifically: >>>>>> * The Chair should generate, on or before March 31st, June 30th, September >>>>>> 30th, and December 31st of each year, reports concerning the activities of >>>>>> the project during the past quarter, its current status, and future plans. >>>>>> *The Chair is responsible for forwarding to the Globus infrastructure >>>>>> group requests to add or delete Committers for the project." >>>>>> >>>>>> _______________________________________________ >>>>>> Swift-devel mailing list >>>>>> Swift-devel at ci.uchicago.edu >>>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >>>>>> >>>>> >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > From tiberius at ci.uchicago.edu Wed Jun 18 15:48:49 2008 From: tiberius at ci.uchicago.edu (Tiberiu Stef-Praun) Date: Wed, 18 Jun 2008 15:48:49 -0500 Subject: [Swift-devel] need @extractfloat Message-ID: Hi As in title, I need to use such a function. I did not know where to send the request. Thanks Tibi -- Tiberiu (Tibi) Stef-Praun, PhD Research Staff, Computation Institute 5640 S. Ellis Ave, #405 University of Chicago http://www-unix.mcs.anl.gov/~tiberius/ From hategan at mcs.anl.gov Wed Jun 18 15:53:17 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 18 Jun 2008 15:53:17 -0500 Subject: [Swift-devel] need @extractfloat In-Reply-To: References: Message-ID: <1213822397.32583.0.camel@localhost> Have you looked at readData? http://www.ci.uchicago.edu/swift/guides/userguide.php#procedure.readdata On Wed, 2008-06-18 at 15:48 -0500, Tiberiu Stef-Praun wrote: > Hi > > As in title, I need to use such a function. > I did not know where to send the request. > > Thanks > Tibi > From benc at hawaga.org.uk Wed Jun 18 19:30:06 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Thu, 19 Jun 2008 00:30:06 +0000 (GMT) Subject: [Swift-devel] need @extractfloat In-Reply-To: References: Message-ID: On Wed, 18 Jun 2008, Tiberiu Stef-Praun wrote: > As in title, I need to use such a function. The code for extractint looks like it might actually handle floats already. Try it and see what happens. -- From benc at hawaga.org.uk Wed Jun 18 19:25:44 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Thu, 19 Jun 2008 00:25:44 +0000 (GMT) Subject: [Swift-devel] [VOTE] Mike to be dev.globus project chair In-Reply-To: <485940C9.6080207@mcs.anl.gov> References: <1212598787.8492.8.camel@localhost> <1212615395.14746.0.camel@localhost> <1212616673.14746.4.camel@localhost> <48592B55.5060503@mcs.anl.gov> <1213805038.20858.0.camel@localhost> <485940C9.6080207@mcs.anl.gov> Message-ID: > By dev.globus rules as far as I can tell, everyone on the developer's > list can vote but only the committers votes are binding. The guidelines define no specific process for the election of a chair other than: > Each Globus project is required to name a project Chair via some process > defined by the project's committers. Specifically they do not mandate that either of the electoral mechanisms, 'Majority approval' or 'Consensus approval' are used. Tibi, as a non-committer, does not get to define the process by which a project chair is named. However, he might participate in that process, depending on how it is defined. Should that process be defined (by the committers) to be 'majority approval' then he would be able to participate in the majority approval vote as a 'contributor' and hence a 'member'; in that respect his vote would count in the second clause of the majority approval requirement that there be more +1 votes than -1 votes; it would not count towards the first clause requirement that there be at least three binding +1 votes. > The guidelines ask that committers indicate this with the string > "(binding)" after their vote, as in "+1 (binding)". They make no such request. They assert that a committer *may* add that indication, in order to simplify a tally. My interpretation of the guidelines is that placing such a mark or not placing such a mark does not affect the nature of the particular committer's vote in any way. On a non-guideline related note: > Date: Wed, 18 Jun 2008 10:35:49 -0500 > The vote closes at 5:00 PM CDT today, June 18. That is an extremely short voting window not even covering one 24h period; given the global nature of Swift development, this is a bad precedent to set for vote duration. -- From benc at hawaga.org.uk Thu Jun 19 06:21:25 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Thu, 19 Jun 2008 11:21:25 +0000 (GMT) Subject: [Swift-devel] need @extractfloat In-Reply-To: References: Message-ID: Here's how to do it both with the apparently buggy-enough to work for you @extractint or using readData. Using readData is probably the better way. $ cat double.txt 3.1415 $ cat double.swift double d = @extractint("double.txt"); trace(d); $ swift double.swift Swift svn swift-r2070 (Swift modified locally) cog-r2038 RunID: 20080619-1213-rx5fbjlg Progress: SwiftScript trace: 3.1415 Final status: $ cat double2.swift float d = readData("double.txt"); trace(d); $ swift double2.swift Swift svn swift-r2070 (Swift modified locally) cog-r2038 RunID: 20080619-1220-nbmvzfi0 Progress: SwiftScript trace: 3.1415 Final status: -- From wilde at mcs.anl.gov Thu Jun 19 14:47:06 2008 From: wilde at mcs.anl.gov (Michael Wilde) Date: Thu, 19 Jun 2008 14:47:06 -0500 Subject: [Swift-devel] Try coaster on BG/P ? Message-ID: <485AB7BA.9020504@mcs.anl.gov> Zhao, Ioan and I have been working with several scientists on running 5 applications on the BG/P over the past month. We've been using the approach of getting them running under a simple shell script, then running at large scale under Falkon, and hoping to move them all to Swift and then to this small group of nearby friendly users as the final stage. It would be good to see if Swift Coaster can run on the BG/P now, so we can try these apps under it. A change in the past month is that we've exploited the technique of running the Falkon (java) server on the BG/P IO nodes. The BG/P UNIX system (ZeptoOS) developers created a generalized startup script under which a user can launch any needed ION services before the user's jobs are started on the compute nodes. This has given very good scaling up to 64K CPU cores. This also may get around the NAT problems that were discussed on this list, and may enable Coaster to work on the BGPs with less change. I *think* that both the login nodes and the compute nodes can reach the ION (and vice versa) on direct networks without NAT, hopefully allowing Coaster UDP communication to work. Coaster is not a high priority for this, but in the next few weeks I hope to start running the apps above under Swift. If Coaster is ready to try on the BGP at that point, I'm eager to test it there. Before that I hope to test it first in an environment where it already works. Ben or Mihael, can you point me to (or write) a README on how to run Coaster? From hategan at mcs.anl.gov Thu Jun 19 14:52:46 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 19 Jun 2008 14:52:46 -0500 Subject: [Swift-devel] Try coaster on BG/P ? In-Reply-To: <485AB7BA.9020504@mcs.anl.gov> References: <485AB7BA.9020504@mcs.anl.gov> Message-ID: <1213905166.19600.1.camel@localhost> On Thu, 2008-06-19 at 14:47 -0500, Michael Wilde wrote: > I *think* that both the login nodes and the compute nodes can reach the > ION (and vice versa) on direct networks without NAT, hopefully allowing > Coaster UDP communication to work. I thought we agreed that UDP is not worth the trouble. > > Coaster is not a high priority for this, but in the next few weeks I > hope to start running the apps above under Swift. If Coaster is ready > to try on the BGP at that point, I'm eager to test it there. Before that > I hope to test it first in an environment where it already works. > > Ben or Mihael, can you point me to (or write) a README on how to run > Coaster? Do you want to run a standalone service? > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From wilde at mcs.anl.gov Thu Jun 19 15:07:08 2008 From: wilde at mcs.anl.gov (Michael Wilde) Date: Thu, 19 Jun 2008 15:07:08 -0500 Subject: [Swift-devel] Try coaster on BG/P ? In-Reply-To: <1213905166.19600.1.camel@localhost> References: <485AB7BA.9020504@mcs.anl.gov> <1213905166.19600.1.camel@localhost> Message-ID: <485ABC6C.4070503@mcs.anl.gov> On 6/19/08 2:52 PM, Mihael Hategan wrote: > On Thu, 2008-06-19 at 14:47 -0500, Michael Wilde wrote: >> I *think* that both the login nodes and the compute nodes can reach the >> ION (and vice versa) on direct networks without NAT, hopefully allowing >> Coaster UDP communication to work. > > I thought we agreed that UDP is not worth the trouble. If so, I either forgot or missed that in the email thread. But thats OK - I have no preference until the choice of protocol proves to be an issue for scalability or reliability. Have you switched the implementation to TCP? > >> Coaster is not a high priority for this, but in the next few weeks I >> hope to start running the apps above under Swift. If Coaster is ready >> to try on the BGP at that point, I'm eager to test it there. Before that >> I hope to test it first in an environment where it already works. >> >> Ben or Mihael, can you point me to (or write) a README on how to run >> Coaster? > > Do you want to run a standalone service? Do you mean standalone as in "not started automatically by Swift via GRAM"? If so, I think yes: on the BGP, one mode would be to run one or more services from some Swift or related Coaster startup script, that would launch the service(s) - either one, on the login host on which you're running Swift, or on IO nodes, one per processor-set. > >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > From hategan at mcs.anl.gov Thu Jun 19 15:20:55 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 19 Jun 2008 15:20:55 -0500 Subject: [Swift-devel] Try coaster on BG/P ? In-Reply-To: <485ABC6C.4070503@mcs.anl.gov> References: <485AB7BA.9020504@mcs.anl.gov> <1213905166.19600.1.camel@localhost> <485ABC6C.4070503@mcs.anl.gov> Message-ID: <1213906855.20895.3.camel@localhost> On Thu, 2008-06-19 at 15:07 -0500, Michael Wilde wrote: > > On 6/19/08 2:52 PM, Mihael Hategan wrote: > > On Thu, 2008-06-19 at 14:47 -0500, Michael Wilde wrote: > >> I *think* that both the login nodes and the compute nodes can reach the > >> ION (and vice versa) on direct networks without NAT, hopefully allowing > >> Coaster UDP communication to work. > > > > I thought we agreed that UDP is not worth the trouble. > > If so, I either forgot or missed that in the email thread. But thats OK > - I have no preference until the choice of protocol proves to be an > issue for scalability or reliability. Have you switched the > implementation to TCP? Yes. A long time ago. > > > > >> Coaster is not a high priority for this, but in the next few weeks I > >> hope to start running the apps above under Swift. If Coaster is ready > >> to try on the BGP at that point, I'm eager to test it there. Before that > >> I hope to test it first in an environment where it already works. > >> > >> Ben or Mihael, can you point me to (or write) a README on how to run > >> Coaster? > > > > Do you want to run a standalone service? > > Do you mean standalone as in "not started automatically by Swift via > GRAM"? Yes. > If so, I think yes: on the BGP, one mode would be to run one or > more services from some Swift or related Coaster startup script, that > would launch the service(s) - either one, on the login host on which > you're running Swift, or on IO nodes, one per processor-set. I don't think that's documented yet. I'll let you know when I have something there. > > > > >> _______________________________________________ > >> Swift-devel mailing list > >> Swift-devel at ci.uchicago.edu > >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > From benc at hawaga.org.uk Thu Jun 19 18:18:26 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Thu, 19 Jun 2008 23:18:26 +0000 (GMT) Subject: [Swift-devel] Try coaster on BG/P ? In-Reply-To: <485AB7BA.9020504@mcs.anl.gov> References: <485AB7BA.9020504@mcs.anl.gov> Message-ID: On Thu, 19 Jun 2008, Michael Wilde wrote: > Ben or Mihael, can you point me to (or write) a README on how to run Coaster? For some 'normal' sites: Build swift with: ant -Dwith-provider-coaster=true redist Then look in trunk/tests/sites/coaster/tgncsa-hg-coaster-pbs-gram2-gram2.xml for a sites.xml file that should work if you change the workdirectory to your own space. Some of the other site definitions in there will work. Some won't (the gram4 ones and fletch using condor) -- From benc at hawaga.org.uk Thu Jun 19 18:19:17 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Thu, 19 Jun 2008 23:19:17 +0000 (GMT) Subject: [Swift-devel] Try coaster on BG/P ? In-Reply-To: <485ABC6C.4070503@mcs.anl.gov> References: <485AB7BA.9020504@mcs.anl.gov> <1213905166.19600.1.camel@localhost> <485ABC6C.4070503@mcs.anl.gov> Message-ID: Probably would want to launch from somewhere else the coaster head node job and the coaster workers all in one go, by the sound of how things work on BG/P. The only mode that I've used coasters in so far doesn't do that - the coaster head node job launches workers as it percieves they are needed. Where you're being allocated a big chunk of the machine, it probably makes sense to run coasters on all of that chunk at once. -- From wilde at mcs.anl.gov Thu Jun 19 18:23:54 2008 From: wilde at mcs.anl.gov (Michael Wilde) Date: Thu, 19 Jun 2008 18:23:54 -0500 Subject: [Swift-devel] [VOTE] Mike to be dev.globus project chair In-Reply-To: References: <1212598787.8492.8.camel@localhost> <1212615395.14746.0.camel@localhost> <1212616673.14746.4.camel@localhost> <48592B55.5060503@mcs.anl.gov> <1213805038.20858.0.camel@localhost> <485940C9.6080207@mcs.anl.gov> Message-ID: <485AEA8A.7060708@mcs.anl.gov> On 6/18/08 7:25 PM, Ben Clifford wrote: >> By dev.globus rules as far as I can tell, everyone on the developer's >> list can vote but only the committers votes are binding. > > The guidelines define no specific process for the election of a chair > other than: > >> Each Globus project is required to name a project Chair via some process >> defined by the project's committers. > > Specifically they do not mandate that either of the electoral mechanisms, > 'Majority approval' or 'Consensus approval' are used. Sounds reasonable. I proposed a vote for that process, and the vote was conducted. If I said that the vote was proposed because dev.globus guidelines said we need to vote on this issue, I stand corrected. I dont recall what I wrote - and dont think it was critical. As far as I can see, your statement above is unrelated to the issue of votes being binding. > Tibi, as a non-committer, does not get to define the process by which a > project chair is named. However, he might participate in that process, > depending on how it is defined. > > Should that process be defined (by the committers) to be 'majority > approval' then he would be able to participate in the majority approval > vote as a 'contributor' and hence a 'member'; in that respect his vote > would count in the second clause of the majority approval requirement that > there be more +1 votes than -1 votes; it would not count towards the first > clause requirement that there be at least three binding +1 votes. OK, sounds pretty complex, but I *think* it makes sense. I ask that we all be patient with each other as we figure this out. Clearly we can get into all sorts of parliamentary issues, but I hope that common sense prevails. >> The guidelines ask that committers indicate this with the string >> "(binding)" after their vote, as in "+1 (binding)". > > They make no such request. They assert that a committer *may* add that > indication, in order to simplify a tally. My interpretation of the > guidelines is that placing such a mark or not placing such a mark does not > affect the nature of the particular committer's vote in any way. Yes, thats what I was assuming it meant, too. > On a non-guideline related note: > >> Date: Wed, 18 Jun 2008 10:35:49 -0500 >> The vote closes at 5:00 PM CDT today, June 18. > > That is an extremely short voting window not even covering one 24h period; > given the global nature of Swift development, this is a bad precedent to > set for vote duration. Agreed - I apologize - I was trying to rush this through because I thought it was a trivial issue, I didnt want to "declare" my self chair without a vote, and thought we could get it done quickly. I agree with you though, we should not rush a vote. Again - please let common sense prevail. The dev.globus rules were intended to help large projects make decisions. In the same way that dev.globus should not interfere with pre-existing infrastructure that works well, we should not let it turn a decision making process that was already working well into something that is suddenly burdensome. If thats happening, its not dev.globus's fault - we're doing it to ourselves. No one in dev.globus management is going to question any decisions our group makes on Swift development, as long as no contributors object. If they do, then thats being counter-productive and we should not accept it. I feel that if we make a good faith and common sense attempt to run an open process, then all should work well and all should be happy. To go out of our way to dicker about rules and procedures, just because dev.globus had the burden of defining such things, is unproductive. - Mike From wilde at mcs.anl.gov Thu Jun 19 18:28:14 2008 From: wilde at mcs.anl.gov (Michael Wilde) Date: Thu, 19 Jun 2008 18:28:14 -0500 Subject: [Swift-devel] Vote results, dev.globus notification, and issues Message-ID: <485AEB8E.6050305@mcs.anl.gov> Since we're still feeling our way through this process, I will wait another 24 hours for comments or objections before notifying dev.globus about 2 decisions: - I will serve as Swift dev.globus chair - we have a revised list of committers (Ben, Ian, Mihael, Sarah, and me) I understand that we are waiting on dev.globus for answers on infrastructure before Mihael can finalize that. From wilde at mcs.anl.gov Thu Jun 19 18:37:43 2008 From: wilde at mcs.anl.gov (Michael Wilde) Date: Thu, 19 Jun 2008 18:37:43 -0500 Subject: [Swift-devel] [VOTE] Release Swift 0.6 rc3 as Swift 0.6 In-Reply-To: References: Message-ID: <485AEDC7.6070603@mcs.anl.gov> On 6/15/08 10:08 AM, Ben Clifford wrote: > On Tue, 10 Jun 2008, Ben Clifford wrote: > >> The results of this vote will appear in ~120 hours from now. > > That is now. > > This action item did not receive majority approval. That was a shame. I think the problem was that somehow we (or I, at least) got the impression that I *MUST* test before I can vote. I think thats where the problem was. In the past, you cut releases when you felt they were ready, allowing people ample time to test and comment. I typically did not have time to test and comment, and was always happy when you cut a release. This process can work in one of two ways: - we cut the committers list down further to just those that are likely to test and vote. That would be Ben, Mihael and Sarah. - we allow/encourage votes from me and/or Ian to say "sure, go ahead, release - I trust you". Which in virtually all cases would be the case, unless in fact we *did* have knowledge or concern over some code issue, in which case we'd vote. I'm in favor of the second approach - which basically provides a minimal gate - a positive ack - from me saying OK to release, which would give you the needed majority. So if you want to call this for a vote again, I will vote +1. From benc at hawaga.org.uk Thu Jun 19 18:45:22 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Thu, 19 Jun 2008 23:45:22 +0000 (GMT) Subject: [Swift-devel] Try coaster on BG/P ? In-Reply-To: References: <485AB7BA.9020504@mcs.anl.gov> Message-ID: more notes on running this on a normal site: Set the jobThrottle parameter for a site based on the mechanism used to submit coasters - that is, for a site that is submitting through GRAM2, set the throttle to 0.2, which will limit you to 20 jobs at once, and likely cause just under 20 coaster workers to run. When GRAM4 works, should be able to use a jobThrottle of 4 and get 400 workers to run (at least as far as the coaster-submission side of things). I have done any tests (though Mihael might have) about how many coaster workers can run sensibly at once - most of my testing has been poking round at the low end of things. When you start running jobs, the timings will look a bit different - you'll see a much longer delay than usual when the first job goes into execute state, as this is when the coaster master starts up on the remote site and submits a worker into the queue But once workers start actually executing, subsequent job executions will be faster (in as much as there should be no GRAM latency and no LRM latency). The basic pattern is as long as there are more jobs ready to be run than there are workers, then more workers will be started (but will be subject to LRM delays in starting up). -- From wilde at mcs.anl.gov Thu Jun 19 18:50:21 2008 From: wilde at mcs.anl.gov (Michael Wilde) Date: Thu, 19 Jun 2008 18:50:21 -0500 Subject: [Swift-devel] Try coaster on BG/P ? In-Reply-To: References: <485AB7BA.9020504@mcs.anl.gov> <1213905166.19600.1.camel@localhost> <485ABC6C.4070503@mcs.anl.gov> Message-ID: <485AF0BD.8020607@mcs.anl.gov> On 6/19/08 6:19 PM, Ben Clifford wrote: > Probably would want to launch from somewhere else the coaster head node > job and the coaster workers all in one go, by the sound of how things work > on BG/P. The only mode that I've used coasters in so far doesn't do that - > the coaster head node job launches workers as it percieves they are > needed. Where you're being allocated a big chunk of the machine, it > probably makes sense to run coasters on all of that chunk at once. Yes, makes sense. The BGP mechanism works as you say - the head-node job starts first, and when it exits from its main process the BGP startup mechanism launches the worker-node jobs. The server continues on the head node (the ION in this case) asynchronously. Whats the form of the coaster head-node job? Is that Java, or some form of script? (I recall Java, but keep forgetting). On the BGP, if the worker jobs connect back to the headnode process, then it may be as easy as just starting them, with the right contact info to reach the headnode. Do you think Swift/Coaster will scale well if it had up to 640 servers, one for each p-set of 256 compute cores? Thats one "natural" mode of deployment. Another mode is N servers total, where N can be any small number >=1, but these N servers would run on a login node. The issues seem to be connectivity, reachability (having all the components find out how to reach each other), and then performance (how will the algorithms behave in the different topologies). From benc at hawaga.org.uk Thu Jun 19 18:51:57 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Thu, 19 Jun 2008 23:51:57 +0000 (GMT) Subject: [Swift-devel] [VOTE] Release Swift 0.6 rc3 as Swift 0.6 In-Reply-To: <485AEDC7.6070603@mcs.anl.gov> References: <485AEDC7.6070603@mcs.anl.gov> Message-ID: On Thu, 19 Jun 2008, Michael Wilde wrote: > That was a shame. I think the problem was that somehow we (or I, at > least) got the impression that I *MUST* test before I can vote. Yes, that is my impression of the rules too. > In the past, you cut releases when you felt they were ready, allowing > people ample time to test and comment. Right. Also, I tended to make releases without much feedback (as in, I think this rc is suitable for release, and i've not had any negative feedback from anyone even if I also haven't had positive feedback). That's lazy majority approval. For the 0.4 release, that worked quite badly - the test suite, which encapsulates pretty much all of the testing I do, missed out on some pretty obvious bugs. dev.globus makes a specific point that a release cannot be lazy (its possibly the only vote that it explicitly prohibits from being lazy) - that would help to avoid the minor fiasco that was Swift 0.4. I'm fairly happy to make releases the old way, though I would appreciate the additional testing that is compelled by the dev.globus release process. -- From hategan at mcs.anl.gov Thu Jun 19 18:55:30 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 19 Jun 2008 18:55:30 -0500 Subject: [Swift-devel] [VOTE] Mike to be dev.globus project chair In-Reply-To: <485AEA8A.7060708@mcs.anl.gov> References: <1212598787.8492.8.camel@localhost> <1212615395.14746.0.camel@localhost> <1212616673.14746.4.camel@localhost> <48592B55.5060503@mcs.anl.gov> <1213805038.20858.0.camel@localhost> <485940C9.6080207@mcs.anl.gov> <485AEA8A.7060708@mcs.anl.gov> Message-ID: <1213919730.29014.3.camel@localhost> > OK, sounds pretty complex, but I *think* it makes sense. I ask that we > all be patient with each other as we figure this out. Clearly we can get > into all sorts of parliamentary issues, but I hope that common sense > prevails. The only problem with common sense is that it doesn't exist when people disagree. > > Again - please let common sense prevail. The dev.globus rules were > intended to help large projects make decisions. In the same way that > dev.globus should not interfere with pre-existing infrastructure that > works well, we should not let it turn a decision making process that was > already working well into something that is suddenly burdensome. > The vote was unnecessary. I'm not aware of anybody (besides perhaps Jen commenting about when and whether project chairs should be changed) had any problem with you being the chair. I find it really odd that there is insistence on the matter. From benc at hawaga.org.uk Thu Jun 19 18:56:13 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Thu, 19 Jun 2008 23:56:13 +0000 (GMT) Subject: [Swift-devel] Try coaster on BG/P ? In-Reply-To: <485AF0BD.8020607@mcs.anl.gov> References: <485AB7BA.9020504@mcs.anl.gov> <1213905166.19600.1.camel@localhost> <485ABC6C.4070503@mcs.anl.gov> <485AF0BD.8020607@mcs.anl.gov> Message-ID: On Thu, 19 Jun 2008, Michael Wilde wrote: > Whats the form of the coaster head-node job? Is that Java, or some form of > script? (I recall Java, but keep forgetting). java I think, with perl on the workers. > Do you think Swift/Coaster will scale well if it had up to 640 servers, > one for each p-set of 256 compute cores? Thats one "natural" mode of > deployment. At the moment, that would look like 640 sites with 256 compute nodes each. I've never seen Swift run with anything like that number of sites - the biggest is probably Xi's work with tens of OSG sites; I've also not tried coaster out with hundreds of workers (mostly because I don't have an easy way at the moment to get that many workers running on a site). I'm fairly certain some interesting scalability problems would appear running at that scale, both in Swift and in the coaster code. -- From wilde at mcs.anl.gov Thu Jun 19 18:56:50 2008 From: wilde at mcs.anl.gov (Michael Wilde) Date: Thu, 19 Jun 2008 18:56:50 -0500 Subject: [Swift-devel] Try coaster on BG/P ? In-Reply-To: References: <485AB7BA.9020504@mcs.anl.gov> Message-ID: <485AF242.4050005@mcs.anl.gov> Cool - very helpful info. It will still be 1-2 weeks before I can try this. Perhaps Zhao can try it sooner. Re: > The basic pattern is as long as there are more jobs ready to be run than > there are workers, then more workers will be started This is where, as you suggested earlier, for BGP-like-systems would be best if this was static and disabled. What Ioan did in Falkon when he went to the multiple-server architecture is relevant here: the client load-shares among all the servers, round-robin, only sending a job to a server when it knows that the server has a free cpu slot. In this way, no queues build up on the servers, and it avoids having a job wait in any server's queue when a free cpu might be available on some other server. - Mike On 6/19/08 6:45 PM, Ben Clifford wrote: > more notes on running this on a normal site: > > Set the jobThrottle parameter for a site based on the mechanism used to > submit coasters - that is, for a site that is submitting through GRAM2, > set the throttle to 0.2, which will limit you to 20 jobs at once, and > likely cause just under 20 coaster workers to run. When GRAM4 works, > should be able to use a jobThrottle of 4 and get 400 workers to run (at > least as far as the coaster-submission side of things). > > I have done any tests (though Mihael might have) about how many coaster > workers can run sensibly at once - most of my testing has been poking > round at the low end of things. > > When you start running jobs, the timings will look a bit different - > you'll see a much longer delay than usual when the first job goes into > execute state, as this is when the coaster master starts up on the remote > site and submits a worker into the queue But once workers start actually > executing, subsequent job executions will be faster (in as much as there > should be no GRAM latency and no LRM latency). > > The basic pattern is as long as there are more jobs ready to be run than > there are workers, then more workers will be started (but will be subject > to LRM delays in starting up). > From hategan at mcs.anl.gov Thu Jun 19 19:00:43 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 19 Jun 2008 19:00:43 -0500 Subject: [Swift-devel] Try coaster on BG/P ? In-Reply-To: <485AF0BD.8020607@mcs.anl.gov> References: <485AB7BA.9020504@mcs.anl.gov> <1213905166.19600.1.camel@localhost> <485ABC6C.4070503@mcs.anl.gov> <485AF0BD.8020607@mcs.anl.gov> Message-ID: <1213920043.29014.8.camel@localhost> On Thu, 2008-06-19 at 18:50 -0500, Michael Wilde wrote: > On 6/19/08 6:19 PM, Ben Clifford wrote: > > Probably would want to launch from somewhere else the coaster head node > > job and the coaster workers all in one go, by the sound of how things work > > on BG/P. The only mode that I've used coasters in so far doesn't do that - > > the coaster head node job launches workers as it percieves they are > > needed. Where you're being allocated a big chunk of the machine, it > > probably makes sense to run coasters on all of that chunk at once. > > Yes, makes sense. > > The BGP mechanism works as you say - the head-node job starts first, and > when it exits from its main process the BGP startup mechanism launches > the worker-node jobs. The server continues on the head node (the ION in > this case) asynchronously. > > Whats the form of the coaster head-node job? Is that Java, or some form > of script? (I recall Java, but keep forgetting). Bash forks Java. > > On the BGP, if the worker jobs connect back to the headnode process, > then it may be as easy as just starting them, with the right contact > info to reach the headnode. That's how it works. > > Do you think Swift/Coaster will scale well if it had up to 640 servers, > one for each p-set of 256 compute cores? Thats one "natural" mode of > deployment. Experience will tell. > > Another mode is N servers total, where N can be any small number >=1, > but these N servers would run on a login node. Is there a purpose to that? 'Cause to me it sounds more like trouble than goodness. From wilde at mcs.anl.gov Thu Jun 19 19:06:05 2008 From: wilde at mcs.anl.gov (Michael Wilde) Date: Thu, 19 Jun 2008 19:06:05 -0500 Subject: [Swift-devel] [VOTE] Release Swift 0.6 rc3 as Swift 0.6 In-Reply-To: References: <485AEDC7.6070603@mcs.anl.gov> Message-ID: <485AF46D.3090600@mcs.anl.gov> > I'm fairly happy to make releases the old way, though I would appreciate > the additional testing that is compelled by the dev.globus release > process. I agree completely on the testing need. Lets do that, then. Call for the vote as you did, continue to urge test-before-vote, but accept vote-without-test in the interest of progress. Progress-through-breakage is not ideal, but given Swift is quite young and the user community is small, its better to do that, to push forward with improvements, even if bugs occasionally slip thorough. You and Mihael always fix things fast, and that has always worked well to date. - Mike On 6/19/08 6:51 PM, Ben Clifford wrote: > On Thu, 19 Jun 2008, Michael Wilde wrote: > >> That was a shame. I think the problem was that somehow we (or I, at >> least) got the impression that I *MUST* test before I can vote. > > Yes, that is my impression of the rules too. > >> In the past, you cut releases when you felt they were ready, allowing >> people ample time to test and comment. > > Right. Also, I tended to make releases without much feedback (as in, I > think this rc is suitable for release, and i've not had any negative > feedback from anyone even if I also haven't had positive feedback). That's > lazy majority approval. For the 0.4 release, that worked quite badly - the > test suite, which encapsulates pretty much all of the testing I do, missed > out on some pretty obvious bugs. > > dev.globus makes a specific point that a release cannot be lazy (its > possibly the only vote that it explicitly prohibits from being lazy) - > that would help to avoid the minor fiasco that was Swift 0.4. > > I'm fairly happy to make releases the old way, though I would appreciate > the additional testing that is compelled by the dev.globus release > process. > From hategan at mcs.anl.gov Thu Jun 19 19:10:14 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 19 Jun 2008 19:10:14 -0500 Subject: [Swift-devel] Try coaster on BG/P ? In-Reply-To: <485AF242.4050005@mcs.anl.gov> References: <485AB7BA.9020504@mcs.anl.gov> <485AF242.4050005@mcs.anl.gov> Message-ID: <1213920614.29014.12.camel@localhost> On Thu, 2008-06-19 at 18:56 -0500, Michael Wilde wrote: > What Ioan did in Falkon when he went to the multiple-server architecture > is relevant here: the client load-shares among all the servers, > round-robin, only sending a job to a server when it knows that the > server has a free cpu slot. In this way, no queues build up on the > servers, and it avoids having a job wait in any server's queue when a > free cpu might be available on some other server. > If you have O(1) scheduling, this shouldn't be necessary. It's like i2u2: Don't build a cluster to reduce the odds of triggering a problem. Fix the problem instead. From iraicu at cs.uchicago.edu Thu Jun 19 19:58:43 2008 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Thu, 19 Jun 2008 19:58:43 -0500 Subject: [Swift-devel] Try coaster on BG/P ? In-Reply-To: <1213920614.29014.12.camel@localhost> References: <485AB7BA.9020504@mcs.anl.gov> <485AF242.4050005@mcs.anl.gov> <1213920614.29014.12.camel@localhost> Message-ID: <485B00C3.4080600@cs.uchicago.edu> I am not sure of what problem you are referring to fix? The issue with Falkon, is that there are queues at the service. If a client submits all its jobs to a single service (that only manages 256 CPUs), there could be 639 other services with 160K - 256 CPUs that are left idle (worst case, which wouldn't happen very often, but could still happen towards the ends of runs when there isn't enough work to keep everyone busy). There are only 2 solutions. 1) never queue anything up at the services, only send tasks from the client to a service when we know there is an available CPU to run that task; this is the approach we took 2) allow tasks to timeout after some time, and trigger a resubmit of the same task to another service, and keep doing this until a reply to that task comes back; this seems that it would introduce unnecessarily long delays, and cause load imbalances towards the end of runs when there isn't enough work to keep all busy In essence, there is no problem to solve here, its just what solution you take, in such a distributed tree like environment, where you have 1 client, N services, and M workers. N is a value between 1 and 640, and M could be as high as 160K, with a ratio of 1:256 between N:M. Ioan Mihael Hategan wrote: > On Thu, 2008-06-19 at 18:56 -0500, Michael Wilde wrote: > > >> What Ioan did in Falkon when he went to the multiple-server architecture >> is relevant here: the client load-shares among all the servers, >> round-robin, only sending a job to a server when it knows that the >> server has a free cpu slot. In this way, no queues build up on the >> servers, and it avoids having a job wait in any server's queue when a >> free cpu might be available on some other server. >> >> > > If you have O(1) scheduling, this shouldn't be necessary. It's like > i2u2: Don't build a cluster to reduce the odds of triggering a problem. > Fix the problem instead. > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From hategan at mcs.anl.gov Thu Jun 19 20:30:57 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 19 Jun 2008 20:30:57 -0500 Subject: [Swift-devel] Try coaster on BG/P ? In-Reply-To: <485B00C3.4080600@cs.uchicago.edu> References: <485AB7BA.9020504@mcs.anl.gov> <485AF242.4050005@mcs.anl.gov> <1213920614.29014.12.camel@localhost> <485B00C3.4080600@cs.uchicago.edu> Message-ID: <1213925457.31071.7.camel@localhost> There's probably a misunderstanding. Mike seemed to suggest that, when using BG/P, there should be multiple services in order to distribute load. That I think is a problem. But I also now think he was referring to the case in which multiple clusters are used, in which case what you say applies. We've pretty much discussed this, and (1) is what we would eventually want to achieve with Swift + Coasters. On Thu, 2008-06-19 at 19:58 -0500, Ioan Raicu wrote: > I am not sure of what problem you are referring to fix? > > The issue with Falkon, is that there are queues at the service. If a > client submits all its jobs to a single service (that only manages 256 > CPUs), there could be 639 other services with 160K - 256 CPUs that are > left idle (worst case, which wouldn't happen very often, but could > still happen towards the ends of runs when there isn't enough work to > keep everyone busy). There are only 2 solutions. > > 1) never queue anything up at the services, only send tasks from the > client to a service when we know there is an available CPU to run that > task; this is the approach we took > 2) allow tasks to timeout after some time, and trigger a resubmit of > the same task to another service, and keep doing this until a reply to > that task comes back; this seems that it would introduce unnecessarily > long delays, and cause load imbalances towards the end of runs when > there isn't enough work to keep all busy > > In essence, there is no problem to solve here, its just what solution > you take, in such a distributed tree like environment, where you have > 1 client, N services, and M workers. N is a value between 1 and 640, > and M could be as high as 160K, with a ratio of 1:256 between N:M. > > Ioan > > Mihael Hategan wrote: > > On Thu, 2008-06-19 at 18:56 -0500, Michael Wilde wrote: > > > > > > > What Ioan did in Falkon when he went to the multiple-server architecture > > > is relevant here: the client load-shares among all the servers, > > > round-robin, only sending a job to a server when it knows that the > > > server has a free cpu slot. In this way, no queues build up on the > > > servers, and it avoids having a job wait in any server's queue when a > > > free cpu might be available on some other server. > > > > > > > > > > If you have O(1) scheduling, this shouldn't be necessary. It's like > > i2u2: Don't build a cluster to reduce the odds of triggering a problem. > > Fix the problem instead. > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > From iraicu at cs.uchicago.edu Thu Jun 19 22:24:10 2008 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Thu, 19 Jun 2008 22:24:10 -0500 Subject: [Swift-devel] Try coaster on BG/P ? In-Reply-To: <1213925457.31071.7.camel@localhost> References: <485AB7BA.9020504@mcs.anl.gov> <485AF242.4050005@mcs.anl.gov> <1213920614.29014.12.camel@localhost> <485B00C3.4080600@cs.uchicago.edu> <1213925457.31071.7.camel@localhost> Message-ID: <485B22DA.9000509@cs.uchicago.edu> Mihael Hategan wrote: > There's probably a misunderstanding. Mike seemed to suggest that, when > using BG/P, there should be multiple services in order to distribute > load. Yes, he was correct. > That I think is a problem. I don't follow. If your goal is to just show that it works at small scales (100s, maybe 1000s of CPUs), you don't need this, but if you want to have any chance of scaling to 160K CPUs, I don't think you'll have many options :( > But I also now think he was referring > to the case in which multiple clusters are used, in which case what you > say applies. You can call them whatever you want. It is 1 machine, which is composed of 640 P-SETs, each P-SET having an I/O node (i.e. think of a login node from a cluster), and 64 compute nodes (4 CPU cores each node) on a private network behind each I/O node. So, if you want to think of the 160K CPU BG/P as a collection of 640 clusters, you can, but its really 1 big machine. The trouble with not using the 640 I/O nodes is that 1 (or a few, up to 10) login nodes has to manage 160K CPUs. If you use the I/O nodes as well, then 1 (or up to 10) login nodes can manage 640 I/O nodes, which in turn will each manage 256 CPUs, a break down that is certainly more manageable. We have had trouble running Falkon reliably at more than 10K~20K CPUs using a single Falkon service (i.e. running on 1 login node), but when we turned to the hierarchical solution, we have gotten up to 64K CPUs and it worked without any sign of stress or problems. BTW, the trouble we had when managing all CPUs from a single service was probably due to the fact that we were using persistent sockets, and using select to manage 10K+ active sockets. We have an option to run without persistent sockets which will scale better, but we haven't tested this on the BG/P yet as it involves Java on the compute nodes (which we heard works, but we haven't tried it yet). > We've pretty much discussed this, and (1) is what we would > eventually want to achieve with Swift + Coasters. > Right, I agree that option #1 is the desired goal. Ioan > On Thu, 2008-06-19 at 19:58 -0500, Ioan Raicu wrote: > >> I am not sure of what problem you are referring to fix? >> >> The issue with Falkon, is that there are queues at the service. If a >> client submits all its jobs to a single service (that only manages 256 >> CPUs), there could be 639 other services with 160K - 256 CPUs that are >> left idle (worst case, which wouldn't happen very often, but could >> still happen towards the ends of runs when there isn't enough work to >> keep everyone busy). There are only 2 solutions. >> >> 1) never queue anything up at the services, only send tasks from the >> client to a service when we know there is an available CPU to run that >> task; this is the approach we took >> 2) allow tasks to timeout after some time, and trigger a resubmit of >> the same task to another service, and keep doing this until a reply to >> that task comes back; this seems that it would introduce unnecessarily >> long delays, and cause load imbalances towards the end of runs when >> there isn't enough work to keep all busy >> >> In essence, there is no problem to solve here, its just what solution >> you take, in such a distributed tree like environment, where you have >> 1 client, N services, and M workers. N is a value between 1 and 640, >> and M could be as high as 160K, with a ratio of 1:256 between N:M. >> >> Ioan >> >> Mihael Hategan wrote: >> >>> On Thu, 2008-06-19 at 18:56 -0500, Michael Wilde wrote: >>> >>> >>> >>>> What Ioan did in Falkon when he went to the multiple-server architecture >>>> is relevant here: the client load-shares among all the servers, >>>> round-robin, only sending a job to a server when it knows that the >>>> server has a free cpu slot. In this way, no queues build up on the >>>> servers, and it avoids having a job wait in any server's queue when a >>>> free cpu might be available on some other server. >>>> >>>> >>>> >>> If you have O(1) scheduling, this shouldn't be necessary. It's like >>> i2u2: Don't build a cluster to reduce the odds of triggering a problem. >>> Fix the problem instead. >>> >>> _______________________________________________ >>> Swift-devel mailing list >>> Swift-devel at ci.uchicago.edu >>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >>> >>> >>> > > > -- =================================================== Ioan Raicu Ph.D. Candidate =================================================== Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 =================================================== Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dev.globus.org/wiki/Incubator/Falkon http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page =================================================== =================================================== -------------- next part -------------- An HTML attachment was scrubbed... URL: From hategan at mcs.anl.gov Thu Jun 19 23:01:41 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 19 Jun 2008 23:01:41 -0500 Subject: [Swift-devel] Try coaster on BG/P ? In-Reply-To: <485B22DA.9000509@cs.uchicago.edu> References: <485AB7BA.9020504@mcs.anl.gov> <485AF242.4050005@mcs.anl.gov> <1213920614.29014.12.camel@localhost> <485B00C3.4080600@cs.uchicago.edu> <1213925457.31071.7.camel@localhost> <485B22DA.9000509@cs.uchicago.edu> Message-ID: <1213934501.1194.18.camel@localhost> On Thu, 2008-06-19 at 22:24 -0500, Ioan Raicu wrote: > > > Mihael Hategan wrote: > > There's probably a misunderstanding. Mike seemed to suggest that, when > > using BG/P, there should be multiple services in order to distribute > > load. > Yes, he was correct. > > That I think is a problem. > I don't follow. If your goal is to just show that it works at small > scales (100s, maybe 1000s of CPUs), you don't need this, but if you > want to have any chance of scaling to 160K CPUs, I don't think you'll > have many options :( If your service scales linearly, then splitting it into multiple processes does not help. But now you have more services to maintain. That's because k*n = c*k*(n/c), where k would be your linearity factor. If you have worse, say k*n^2, then dividing makes sense because c*k*((n/c)^2) = k*n/c, which is better than k*(n^2). The point is that I'd rather spend my time making the algorithm linear than dealing with multiple services. Now, of course, as you mention, it may not be possible to do so because the problem is at the networking layer. So we should probably stop talking until we know what the actual bottleneck is. And I mean *know*. Do we? From benc at hawaga.org.uk Fri Jun 20 05:41:32 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Fri, 20 Jun 2008 10:41:32 +0000 (GMT) Subject: [Swift-devel] numeric types still Message-ID: Numeric types bother me still. Substantially moreso now that my gsoc student Milena has been working on typechecking and has reached the point where the somewhat vague definitions of numerical types and implicit casting are causing trouble. Option A: I'm very much in favour of a single numeric type, represented internally by a java.lang.Double. This is very much in line with what the implementation does at the moment (for example the thread Tibi started the other day, where @extractint actually uses java.lang.Double and can be used in either an int or a float context in SwiftScript). This would change (though not necessarily make worse) the behaviour of integers when trying to go out of the range of what can be represented to the accuracy of 1. With integer representation, max_int+1 would roll round to min_int, those being (respectively) +/- 2^31. With double representation, the largest accurate integer is around 2^52 (which is larger than integer can store, though not as long as if we used a long); after that I think that the problem will be that 2^52 + 1 = 2^52 (so incrementing will get stuck rather than rolling round). (give or take a few orders of magnitude here, and assuming IEEE754 representation) These numbers are sufficiently large to probably not cause problems any time soon in real usage of SwiftScript; should accuracy prove a problem, there is a clear path to move from double to some other more accuracte general purpose numerical representation that does not massively change the language (for example, switching to java.lang.BigDecimal). On the downside, it changes type names - float and int both become 'number' or some such. However, float and int could be deprecated and internally aliases to 'number'. I think that the actual use of numbers in SwiftScript, as counters and things like that, that a rich numerical type system is not needed; and that this requires straightforward code change to implement. Option B: We could instead define when it is appropriate to implicitly cast between different types of numbers and in which direction. I think this is more difficult to implement - it needs someone who is interested in thinking about and defining the casting rules, which I think no one is. I think this also would increase, rather than decrease, the complexity of the code base. As you can probably tell, I am not in favour of option B. I'm going offline this weekend; but Milena is lurking on this list and might even come out of hiding to discuss further. -- From wilde at mcs.anl.gov Fri Jun 20 09:07:55 2008 From: wilde at mcs.anl.gov (Michael Wilde) Date: Fri, 20 Jun 2008 09:07:55 -0500 Subject: [Swift-devel] numeric types still In-Reply-To: References: Message-ID: <485BB9BB.1070401@mcs.anl.gov> I lean to option A as well, as long as intergery kinds of things work as expected. I also wonder if the solitary base scalar types can be String and File, and all arithmetic operations essentially implemented as if they are external functions - much like /bin/sh users would use expr for all calculations. The type system needs to ensure that all operations on datasets are type correct, and this would make arithmetic follow the same conventions. So just like a user can define "type Foo File"; then could also say "type Int String;", "type Float String;", "type Number String;" etc. And we could use interpreter like dc, expr, awk, perl, or bash to do all arithmetic ops. Some built-in interpreter would be chosen as the base implementation. So the language could appear to have a flexible type system, but in fact it would be very simple and regular, and make no distinction between any kinds of numeric types as being special in any way. No implicit casting, etc. - that will remove much potential complexity. This concept may or may not be feasible or desirable - I or others would need to explore it on paper a bit more. But I think we all share the goal of keeping the Swift language as minimalist as possible, retaining its nature as a language with which to invoke chains of external programs. - Mike On 6/20/08 5:41 AM, Ben Clifford wrote: > Numeric types bother me still. Substantially moreso now that my gsoc > student Milena has been working on typechecking and has reached the point > where the somewhat vague definitions of numerical types and implicit > casting are causing trouble. > > Option A: > > I'm very much in favour of a single numeric type, represented internally > by a java.lang.Double. > > This is very much in line with what the implementation does at the moment > (for example the thread Tibi started the other day, where @extractint > actually uses java.lang.Double and can be used in either an int or a float > context in SwiftScript). > > This would change (though not necessarily make worse) the behaviour of > integers when trying to go out of the range of what can be represented to > the accuracy of 1. > > With integer representation, max_int+1 would roll round to min_int, those > being (respectively) +/- 2^31. > > With double representation, the largest accurate integer is around 2^52 > (which is larger than integer can store, though not as long as if we used > a long); after that I think that the problem will be that 2^52 + 1 = 2^52 > (so incrementing will get stuck rather than rolling round). (give or take > a few orders of magnitude here, and assuming IEEE754 representation) > > These numbers are sufficiently large to probably not cause problems any > time soon in real usage of SwiftScript; should accuracy prove a problem, > there is a clear path to move from double to some other more accuracte > general purpose numerical representation that does not massively change > the language (for example, switching to java.lang.BigDecimal). > > On the downside, it changes type names - float and int both become > 'number' or some such. However, float and int could be deprecated and > internally aliases to 'number'. > > I think that the actual use of numbers in SwiftScript, as counters and > things like that, that a rich numerical type system is not needed; and > that this requires straightforward code change to implement. > > Option B: > > We could instead define when it is appropriate to implicitly cast between > different types of numbers and in which direction. > > I think this is more difficult to implement - it needs someone who is > interested in thinking about and defining the casting rules, which I think > no one is. > > I think this also would increase, rather than decrease, the complexity of > the code base. > > As you can probably tell, I am not in favour of option B. > > I'm going offline this weekend; but Milena is lurking on this list and > might even come out of hiding to discuss further. > From hategan at mcs.anl.gov Fri Jun 20 09:52:05 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 20 Jun 2008 09:52:05 -0500 Subject: [Swift-devel] numeric types still In-Reply-To: References: Message-ID: <1213973525.3203.4.camel@localhost> I believe ML uses B and requires explicit conversion. For example, it only defines +:int*int->int and +:real*real->real. So 1 + 0.5 causes an error. You'd have to either do real(1) + 0.5 (or presumably 1.0 + 0.5) or 1 + int(0.5). Which doesn't look bad to me. On Fri, 2008-06-20 at 10:41 +0000, Ben Clifford wrote: > Numeric types bother me still. Substantially moreso now that my gsoc > student Milena has been working on typechecking and has reached the point > where the somewhat vague definitions of numerical types and implicit > casting are causing trouble. > > Option A: > > I'm very much in favour of a single numeric type, represented internally > by a java.lang.Double. > > This is very much in line with what the implementation does at the moment > (for example the thread Tibi started the other day, where @extractint > actually uses java.lang.Double and can be used in either an int or a float > context in SwiftScript). > > This would change (though not necessarily make worse) the behaviour of > integers when trying to go out of the range of what can be represented to > the accuracy of 1. > > With integer representation, max_int+1 would roll round to min_int, those > being (respectively) +/- 2^31. > > With double representation, the largest accurate integer is around 2^52 > (which is larger than integer can store, though not as long as if we used > a long); after that I think that the problem will be that 2^52 + 1 = 2^52 > (so incrementing will get stuck rather than rolling round). (give or take > a few orders of magnitude here, and assuming IEEE754 representation) > > These numbers are sufficiently large to probably not cause problems any > time soon in real usage of SwiftScript; should accuracy prove a problem, > there is a clear path to move from double to some other more accuracte > general purpose numerical representation that does not massively change > the language (for example, switching to java.lang.BigDecimal). > > On the downside, it changes type names - float and int both become > 'number' or some such. However, float and int could be deprecated and > internally aliases to 'number'. > > I think that the actual use of numbers in SwiftScript, as counters and > things like that, that a rich numerical type system is not needed; and > that this requires straightforward code change to implement. > > Option B: > > We could instead define when it is appropriate to implicitly cast between > different types of numbers and in which direction. > > I think this is more difficult to implement - it needs someone who is > interested in thinking about and defining the casting rules, which I think > no one is. > > I think this also would increase, rather than decrease, the complexity of > the code base. > > As you can probably tell, I am not in favour of option B. > > I'm going offline this weekend; but Milena is lurking on this list and > might even come out of hiding to discuss further. > From iraicu at cs.uchicago.edu Fri Jun 20 12:44:35 2008 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Fri, 20 Jun 2008 12:44:35 -0500 Subject: [Swift-devel] Try coaster on BG/P ? In-Reply-To: <1213934501.1194.18.camel@localhost> References: <485AB7BA.9020504@mcs.anl.gov> <485AF242.4050005@mcs.anl.gov> <1213920614.29014.12.camel@localhost> <485B00C3.4080600@cs.uchicago.edu> <1213925457.31071.7.camel@localhost> <485B22DA.9000509@cs.uchicago.edu> <1213934501.1194.18.camel@localhost> Message-ID: <485BEC83.1000002@cs.uchicago.edu> Mihael Hategan wrote: > On Thu, 2008-06-19 at 22:24 -0500, Ioan Raicu wrote: > >> Mihael Hategan wrote: >> >>> There's probably a misunderstanding. Mike seemed to suggest that, when >>> using BG/P, there should be multiple services in order to distribute >>> load. >>> >> Yes, he was correct. >> >>> That I think is a problem. >>> >> I don't follow. If your goal is to just show that it works at small >> scales (100s, maybe 1000s of CPUs), you don't need this, but if you >> want to have any chance of scaling to 160K CPUs, I don't think you'll >> have many options :( >> > > If your service scales linearly, then splitting it into multiple > processes does not help. But now you have more services to maintain. > That's because k*n = c*k*(n/c), where k would be your linearity factor. > If you have worse, say k*n^2, then dividing makes sense because > c*k*((n/c)^2) = k*n/c, which is better than k*(n^2). > > The point is that I'd rather spend my time making the algorithm linear > than dealing with multiple services. > > Now, of course, as you mention, it may not be possible to do so because > the problem is at the networking layer. So we should probably stop > talking until we know what the actual bottleneck is. And I mean *know*. > Do we? > For Falkon, it was a networking issue (couple with the amount of CPU/RAM the node had where the service was running), that was causing one Falkon service to not scale beyond 10K+ CPUs reliably, when using persistent sockets. Note that when not using persistent sockets, as is the case with GT4.0.x WS, we were able to scale to 50K CPUs just fine, but in this case, there were never more than a few 100 TCP connections that the service had to maintain at the same time, which is why it scaled so well. Now, that is not to say that your implementation of Coaster won't scale to 160K CPUs all from 1 service, but from my experience, a server (implemented in Java anyways) using select with 2~4GB of memory and 4 CPU cores will not be able to handle 100K+ concurrent TCP connections that are all active at the same time. Anyways, I never did a thorough study of this to see what part of the networking stack or OS level calls was the problem... I'd be curious to see how far Coaster will scale with a single service using TCP, so it might be worth running 1 Coaster service on a login node, and trying to see how many CPUs it can manage before running into trouble. Ioan > > -- =================================================== Ioan Raicu Ph.D. Candidate =================================================== Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 =================================================== Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dev.globus.org/wiki/Incubator/Falkon http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page =================================================== =================================================== -------------- next part -------------- An HTML attachment was scrubbed... URL: From hategan at mcs.anl.gov Fri Jun 20 20:21:39 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 20 Jun 2008 20:21:39 -0500 Subject: [Swift-devel] Re: Issues Re: hibernation status review for swift In-Reply-To: <485C36EB.6000002@mcs.anl.gov> References: <20080620214339.9EA85B000062@sumac.eol.org> <485C2AC5.5060309@mcs.anl.gov> <1214001378.18296.9.camel@localhost> <485C36EB.6000002@mcs.anl.gov> Message-ID: <1214011299.18296.24.camel@localhost> On Fri, 2008-06-20 at 18:02 -0500, Michael Wilde wrote: > >> > >> 2) We have adjusted our list of committers (by vote) and will get those > >> people on the @globus list. > > > > I'd be careful about making such a statement, because I don't think it > > was the consensus. > > I meant that we took a vote on committers and reduced the list of people > designated as committers. > > What you say below pertains to the email routing, correct? The vote was > solely about who is a committer. I assume you are not disagreeing with > that, and that we should update the wiki page to reflect this: > http://dev.globus.org/wiki/Incubator/Swift#Committers Well, there were two statements separated by "and". The second was inaccurate to my understanding. I replied to your email because it misrepresented what I understood was the desired (from our group) way to proceed. > > > A better statement would probably be: > > - Any message sent to the user at globus and devel at globus will be received > > by all committers. > > - All commits to the SVN will be received by anybody on commit at globus. > > Both of these sound reasonable to me. Do they comply with dev.globus rules? That was one of the questions. Jen's email didn't seem to clarify despite the fact that I asked it in what I think was a very clear way. > > Are these lists now in place and active? The lists are in place, but I did not set up the forwarding. I was waiting for word from the incubator management. However, the way I will proceed is to set up forwarding and let them sort out whether that's a problem or not. > > Do we have open issues regarding email management? Some discussion on > swift-devel suggested (implied?) we should consider making the > @globus.dev lists our official (or sole) lists. Should we discuss doing > that? No infrastructure changes we agreed? > > As it stands now, is the plan (if we dont make further changes) to leave > the primary lists be swift-{user,devel}@ci.uchicago.edu, with some > amount of mirroring of content and membership to the dev.globus lists? The intent, as I understand it, is for the @globus mailing lists to be monitored. Forwarding achieves that, and at least Charles seems to agree with that view. Jen doesn't. My interest was in clarifying which one is the official incubator management position. Bummer! > > What other infrastructure issues need to be resolved and completed? We'll need to have our roadmap open to public comments. The CI Trac doesn't quite fulfill that, so we'll need to move it to bugzilla. > > What other answers are we waiting on dev.globus for? Whether I, as a non-chair committer, can at all interact with the incubator management. The answer is, again, fuzzy, but it seems to suggest that I cannot. So good luck! Mihael From benc at hawaga.org.uk Wed Jun 25 08:46:10 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 25 Jun 2008 13:46:10 +0000 (GMT) Subject: [Swift-devel] scheduler changes to deal with fast-failing sites Message-ID: I played around a bit with how the scheduler in Swift (actually in Karajan) deals with low scoring sites. Previously a site would always take at least 2 jobs at once, even in the case a very poorly scoring site. This causes bug 101 where a site that rapidly fails jobs can eat up all the retries for all the jobs in a SwiftScript program, and cause the SwiftScript program as a whole to fail. The attached patch changes that behaviour. The scoring of well-performing sites is basically the same. Instead of a base of 2 jobs, with more being added according to tscore * jobThrottle, instead a base of 1 job is used. This should not cause much change in behaviour for well-performing sites. However, the score can now go below 1 for poorly performing sites. In that case, a delay is enforced between submissions to a particular site. The length of that delay increases exponentially as the site score decreases. A few quick tests with provider-wonky suggest that this does a fairly good job rapidly eliminating poorly performing sites running locally on my laptop. I'd be interested if anyone (especially Xi) tries this in a real life multi-site situation. In combination with replication to deal with slow-fail sites, I hope that this makes multi-site usage of Swift work much better. The patch, which applies against cog r2056 is at http://www.ci.uchicago.edu/~benc/backoff-less-than-zero-1.patch -- From lixi at uchicago.edu Wed Jun 25 09:42:50 2008 From: lixi at uchicago.edu (lixi at uchicago.edu) Date: Wed, 25 Jun 2008 09:42:50 -0500 (CDT) Subject: [Swift-devel] Re: scheduler changes to deal with fast-failing sites Message-ID: <20080625094250.BBQ11800@m4500-03.uchicago.edu> >I'd be interested if anyone (especially Xi) tries this in a real life >multi-site situation. I just ran a workflow with 51 jobs (50 of them can be run in parallel). It finished successfully and quickly. The log file is on CI: /home/lixi/newswift/test/newversion/workflowtest- 20080625-0921-oi2tnbrd.log Now, a workflow with 501 jobs (500 of them can be run in parallel) is running now. The sites.file includes 12 sites, including two or three poor performance sites. From foster at mcs.anl.gov Wed Jun 25 10:08:32 2008 From: foster at mcs.anl.gov (Ian Foster) Date: Wed, 25 Jun 2008 10:08:32 -0500 Subject: [Swift-devel] Re: scheduler changes to deal with fast-failing sites In-Reply-To: <20080625094250.BBQ11800@m4500-03.uchicago.edu> References: <20080625094250.BBQ11800@m4500-03.uchicago.edu> Message-ID: <5ABD7463-CA04-4F35-AEF3-A82FDC328E54@mcs.anl.gov> Nice! On Jun 25, 2008, at 9:42 AM, wrote: >> I'd be interested if anyone (especially Xi) tries this in a > real life >> multi-site situation. > > I just ran a workflow with 51 jobs (50 of them can be run in > parallel). It finished successfully and quickly. The log > file is on > CI: /home/lixi/newswift/test/newversion/workflowtest- > 20080625-0921-oi2tnbrd.log > > Now, a workflow with 501 jobs (500 of them can be run in > parallel) is running now. The sites.file includes 12 sites, > including two or three poor performance sites. > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From hategan at mcs.anl.gov Wed Jun 25 10:16:06 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 25 Jun 2008 10:16:06 -0500 Subject: [Swift-devel] scheduler changes to deal with fast-failing sites In-Reply-To: References: Message-ID: <1214406966.23647.1.camel@localhost> On Wed, 2008-06-25 at 13:46 +0000, Ben Clifford wrote: > I played around a bit with how the scheduler in Swift (actually in > Karajan) deals with low scoring sites. > ... > The patch, which applies against cog r2056 is at > http://www.ci.uchicago.edu/~benc/backoff-less-than-zero-1.patch > I don't think there is any reason not to commit this. From lixi at uchicago.edu Wed Jun 25 10:44:10 2008 From: lixi at uchicago.edu (lixi at uchicago.edu) Date: Wed, 25 Jun 2008 10:44:10 -0500 (CDT) Subject: [Swift-devel] Re: scheduler changes to deal with fast-failing sites Message-ID: <20080625104410.BBQ18567@m4500-03.uchicago.edu> >I'd be interested if anyone (especially Xi) tries this in a real life >multi-site situation. Unfortunately, just now the workflow with 501 jobs failed due to: "No status file was found. Check the shared filesystem on CIT_CMS_T2" In fact, this is the most frequent error I encountered so far. I am thinking how to avoid this kind of error for a long time. I tried to check the remote directory using df command and make directory, transfer files, etc. These operations outside of Swift could be done successfully. So I still wonder how to avoid it, or could we think of adapting Swift to such sites as CIT_CMS_T2, MIT_CMS, and so on? >The scoring of well-performing sites is basically the same. Instead of a >base of 2 jobs, with more being added according to tscore * jobThrottle, >instead a base of 1 job is used. This should not cause much change in >behaviour for well-performing sites. > >However, the score can now go below 1 for poorly performing sites. In that >case, a delay is enforced between submissions to a particular site. The >length of that delay increases exponentially as the site score decreases. In addition such improvements, as well as filtering out some sites and giving initial scores which I've done, I am thinking of other methods these days. Now in Swift, we only reply on "scores" to determine the performance of sites which are in turn the only metrics for site selection. Can we set the different states for sites? For example, candidate, frozen, etc. "Candidate" just means that we could select site from them based on their scores/Tscores. If the site fails, we could designate it as "frozen", at least for the current job, avoiding more retries would be eaten up. A frozen site could be unfrozen for satisfying different conditions, such as an amount of time later, for other new jobs. Of course, this is some simple ideas which I'm thinking now. I am going to give more detailed and feasible process. Any suggestions are warmly welcome. Thanks, Xi From hategan at mcs.anl.gov Wed Jun 25 11:05:14 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 25 Jun 2008 11:05:14 -0500 Subject: [Swift-devel] Re: scheduler changes to deal with fast-failing sites In-Reply-To: <20080625104410.BBQ18567@m4500-03.uchicago.edu> References: <20080625104410.BBQ18567@m4500-03.uchicago.edu> Message-ID: <1214409914.24613.1.camel@localhost> > In addition such improvements, as well as filtering out some > sites and giving initial scores which I've done, I am > thinking of other methods these days. Now in Swift, we only > reply on "scores" to determine the performance of sites > which are in turn the only metrics for site selection. Can > we set the different states for sites? For example, > candidate, frozen, etc. "Candidate" just means that we could > select site from them based on their scores/Tscores. If the > site fails, we could designate it as "frozen", at least for > the current job, avoiding more retries would be eaten up. A > frozen site could be unfrozen for satisfying different > conditions, such as an amount of time later, for other new > jobs. Of course, this is some simple ideas which I'm > thinking now. I am going to give more detailed and feasible > process. Any suggestions are warmly welcome. The current system pretty much does that, though in a slower way, which is desired because occasional errors don't necessarily mean the site is bad. > > Thanks, > > Xi > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From hategan at mcs.anl.gov Wed Jun 25 15:34:20 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 25 Jun 2008 15:34:20 -0500 Subject: [Swift-devel] [swift-dev] Testing In-Reply-To: <1214424748.24702.11.camel@localhost> References: <1214424748.24702.11.camel@localhost> Message-ID: <1214426060.24702.19.camel@localhost> Please ignore. From hategan at mcs.anl.gov Wed Jun 25 15:50:50 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 25 Jun 2008 15:50:50 -0500 Subject: [Swift-devel] [swift-dev] Testing II Message-ID: <1214427050.24702.31.camel@localhost> Continue ignoring these... From hategan at mcs.anl.gov Wed Jun 25 15:56:03 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 25 Jun 2008 15:56:03 -0500 Subject: [Swift-devel] [swift-dev] End of testing Message-ID: <1214427363.24702.37.camel@localhost> Sorry about that. I was testing forwarding from swift-dev at globus.org to swift-devel at ci.uchicago.edu. You do not need to do anything about this (nor were any of you individually subscribed to swift-dev at globus.org). However, you will get all messages posted to that mailing list, and they will look pretty much like this one. There may be some more 'reply-to' configuration that could be done, but given that I don't see a particular benefit to one choice or another, I'll leave it at this for now. Mihael From benc at hawaga.org.uk Wed Jun 25 16:15:07 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 25 Jun 2008 21:15:07 +0000 (GMT) Subject: [Swift-devel] scheduler changes to deal with fast-failing sites In-Reply-To: <1214406966.23647.1.camel@localhost> References: <1214406966.23647.1.camel@localhost> Message-ID: On Wed, 25 Jun 2008, Mihael Hategan wrote: > I don't think there is any reason not to commit this. was hoping for some sanity checking before committing... -- From hategan at mcs.anl.gov Wed Jun 25 16:21:16 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 25 Jun 2008 16:21:16 -0500 Subject: [Swift-devel] scheduler changes to deal with fast-failing sites In-Reply-To: References: <1214406966.23647.1.camel@localhost> Message-ID: <1214428876.26911.0.camel@localhost> On Wed, 2008-06-25 at 21:15 +0000, Ben Clifford wrote: > On Wed, 25 Jun 2008, Mihael Hategan wrote: > > > I don't think there is any reason not to commit this. > > was hoping for some sanity checking before committing... I might clean up some things a bit after, but it looks sane to me. From benc at hawaga.org.uk Wed Jun 25 17:15:14 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 25 Jun 2008 22:15:14 +0000 (GMT) Subject: [Swift-devel] scheduler changes to deal with fast-failing sites In-Reply-To: <1214428876.26911.0.camel@localhost> References: <1214406966.23647.1.camel@localhost> <1214428876.26911.0.camel@localhost> Message-ID: On Wed, 25 Jun 2008, Mihael Hategan wrote: > On Wed, 2008-06-25 at 21:15 +0000, Ben Clifford wrote: > > On Wed, 25 Jun 2008, Mihael Hategan wrote: > > > > > I don't think there is any reason not to commit this. > > > > was hoping for some sanity checking before committing... > > I might clean up some things a bit after, but it looks sane to me. Sane as in someone having actually tried using it in a real situation (which is pretty much Xi's domain) rather than sane-by-inspection. I'll commit it, though. -- From benc at hawaga.org.uk Wed Jun 25 17:16:52 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 25 Jun 2008 22:16:52 +0000 (GMT) Subject: [Swift-devel] Re: scheduler changes to deal with fast-failing sites In-Reply-To: <20080625104410.BBQ18567@m4500-03.uchicago.edu> References: <20080625104410.BBQ18567@m4500-03.uchicago.edu> Message-ID: > select site from them based on their scores/Tscores. If the > site fails, we could designate it as "frozen", at least for > the current job, avoiding more retries would be eaten up. A > frozen site could be unfrozen for satisfying different > conditions, such as an amount of time later, That is basically what the patch that I sent does. -- From benc at hawaga.org.uk Wed Jun 25 17:48:03 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 25 Jun 2008 22:48:03 +0000 (GMT) Subject: [Swift-devel] Re: scheduler changes to deal with fast-failing sites In-Reply-To: References: <20080625104410.BBQ18567@m4500-03.uchicago.edu> Message-ID: On Wed, 25 Jun 2008, Ben Clifford wrote: > > select site from them based on their scores/Tscores. If the > > site fails, we could designate it as "frozen", at least for > > the current job, avoiding more retries would be eaten up. A > > frozen site could be unfrozen for satisfying different > > conditions, such as an amount of time later, > > That is basically what the patch that I sent does. which is now in CoG as r2058. svn update in cog to get it. -- From benc at hawaga.org.uk Thu Jun 26 08:38:48 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Thu, 26 Jun 2008 13:38:48 +0000 (GMT) Subject: [Swift-devel] Re: scheduler changes to deal with fast-failing sites In-Reply-To: References: Message-ID: r2058 results in scheduling hangs, due to the way in which the overloaded host count is cached and not recomputed at the appropriate time. The attached hacky patch demonstrates a different (more correct though less efficient) was of counting the overloaded host count. I can reliably get a hang after about 10 minutes running: cd tests/language-behaviour/ while true; do swift 0755-ext-mapper.swift ; done which goes away with this patch. Patch is at http://www.ci.uchicago.edu/~benc/overload-chk -- From hategan at mcs.anl.gov Thu Jun 26 09:40:54 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 26 Jun 2008 09:40:54 -0500 Subject: [Swift-devel] Re: scheduler changes to deal with fast-failing sites In-Reply-To: References: Message-ID: <1214491254.4576.1.camel@localhost> That particular bit needs to be fast. I'll take a look at it. On Thu, 2008-06-26 at 13:38 +0000, Ben Clifford wrote: > r2058 results in scheduling hangs, due to the way in which the overloaded > host count is cached and not recomputed at the appropriate time. > > The attached hacky patch demonstrates a different (more correct though > less efficient) was of counting the overloaded host count. > > I can reliably get a hang after about 10 minutes running: > > cd tests/language-behaviour/ > while true; do swift 0755-ext-mapper.swift ; done > > which goes away with this patch. > > Patch is at http://www.ci.uchicago.edu/~benc/overload-chk > From benc at hawaga.org.uk Thu Jun 26 10:28:22 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Thu, 26 Jun 2008 15:28:22 +0000 (GMT) Subject: [Swift-devel] Re: scheduler changes to deal with fast-failing sites In-Reply-To: <20080626093141.BBR03600@m4500-03.uchicago.edu> References: <20080626093141.BBR03600@m4500-03.uchicago.edu> Message-ID: On Thu, 26 Jun 2008, lixi at uchicago.edu wrote: > When you are not very busy, could you please explain the > details of the new improvement of scheduler for me? Previous, a job would be submitted to a site if its load was less than its maximum load; maximum load was calculated as tscore * jobThrottle + 2. tscore ranges between 0 and 100 as score ranges between -inf and +inf. The maximum load will never go below 2, no matter how bad a site is. In the new code, when score > 0 (the site is 'good') then the maximum load is calculated in a very similar way, as tscore * jobthrottle + 1. When score < 0 (the site is 'bad') then a different method is used. There will be delays between job submission. Two jobs will not be submitted to the same site within t ms of each other, with t = e^(-score) * 100. So what should happen as a site is bad is that it will still sometimes be used, but the longer it remains bad, the longer the delay between attempted submissions will be. -- From hategan at mcs.anl.gov Thu Jun 26 15:54:32 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 26 Jun 2008 15:54:32 -0500 Subject: [Swift-devel] Re: scheduler changes to deal with fast-failing sites In-Reply-To: References: Message-ID: <1214513672.12138.0.camel@localhost> On Thu, 2008-06-26 at 13:38 +0000, Ben Clifford wrote: > r2058 results in scheduling hangs, due to the way in which the overloaded > host count is cached and not recomputed at the appropriate time. > > The attached hacky patch demonstrates a different (more correct though > less efficient) was of counting the overloaded host count. > > I can reliably get a hang after about 10 minutes running: > > cd tests/language-behaviour/ > while true; do swift 0755-ext-mapper.swift ; done > > which goes away with this patch. > > Patch is at http://www.ci.uchicago.edu/~benc/overload-chk > How about ?http://www.mcs.anl.gov/~hategan/overload-chk2 ? From benc at hawaga.org.uk Fri Jun 27 06:30:33 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Fri, 27 Jun 2008 11:30:33 +0000 (GMT) Subject: [Swift-devel] Re: scheduler changes to deal with fast-failing sites In-Reply-To: <1214513672.12138.0.camel@localhost> References: <1214513672.12138.0.camel@localhost> Message-ID: On Thu, 26 Jun 2008, Mihael Hategan wrote: > How about ?http://www.mcs.anl.gov/~hategan/overload-chk2 ? That seems to work. -- From lixi at uchicago.edu Fri Jun 27 10:15:46 2008 From: lixi at uchicago.edu (lixi at uchicago.edu) Date: Fri, 27 Jun 2008 10:15:46 -0500 (CDT) Subject: [Swift-devel] Re: scheduler changes to deal with fast-failing sites Message-ID: <20080627101546.BBS16630@m4500-03.uchicago.edu> >A. Site selection - that has had a number of changes recently that are >designed to help the multi-site case (replication which appeared a month >or so ago; and some scoring behaviour changes which went in a day or so >ago). Prior to these changes, running on a large number of sites was >almost guaranteed to fail. The new behaviour looks like it should be much >more successful, though I have yet to hear of anyone (eg Xi) trying it on >OSG yet. As far as multi-site is concerned, I've already tried new changes. There are several aspects changes I've tested so far: 1. With my own calibration results, the sites file generated filters some "bad" sites in terms of GRAM and GridFTP. In my experiments, this could evidently increased the success rate of whole workflow. However, it could not guarantee completely successful run of every workflow, because some sites produce shared file system error as follows: Application exception: No status file was found. Check the shared filesystem on hostname This is the error which I don't know how to check in advance. 2. With replication option enabled, I often encountered "Multiple mappings pointing to the same file" error which leaded to the failure of the whole workflow. I think that I've already reported that error. Since then, I didn't receive the message notifying the resolution of this problem. So I disabled "replication" option in recent experiments. 3. For the latest changes of scoring behaviour, I continue testing it. Xi From benc at hawaga.org.uk Fri Jun 27 10:24:57 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Fri, 27 Jun 2008 15:24:57 +0000 (GMT) Subject: [Swift-devel] Re: OSG Interest Group meeting (fwd) Message-ID: below discussion happened in an off-list thread but it may be of interest for archival and intra-swift communication purposes. ---------- Forwarded message ---------- Date: Fri, 27 Jun 2008 15:23:55 +0000 (GMT) From: Ben Clifford To: lixi at uchicago.edu Cc: Michael Wilde , Zhengxiong Hou , Jing Tie , Alina Bejan , Zhao Zhang , Ian Foster , Charles Bacon Subject: Re: OSG Interest Group meeting On Fri, 27 Jun 2008, lixi at uchicago.edu wrote: > of whole workflow. However, it could not guarantee > completely successful run of every workflow, because some > sites produce shared file system error as follows: > Application exception: No status file was found. Check the > shared filesystem on hostname > This is the error which I don't know how to check in advance. The recent scoring changes should reduce the damaging effect of such sites - Swift will make some attempts to use but should rapidly slow down submissions. > 2. With replication option enabled, I often > encountered "Multiple mappings pointing to the same file" > error which leaded to the failure of the whole workflow. I > think that I've already reported that error. Since then, I > didn't receive the message notifying the resolution of this > problem. So I disabled "replication" option in recent > experiments. You did, though Mihael made a fix which cured my attempt at reproducing this. (see this: http://mail.ci.uchicago.edu/pipermail/swift-devel/2008-May/003140.html) Please try using replication again. Also, if you have open issues that aren't actively being addressed, please put them in the bugzilla. -- From lixi at uchicago.edu Fri Jun 27 12:23:02 2008 From: lixi at uchicago.edu (lixi at uchicago.edu) Date: Fri, 27 Jun 2008 12:23:02 -0500 (CDT) Subject: [Swift-devel] workflow run with the newest Swift Message-ID: <20080627122302.BBS29893@m4500-03.uchicago.edu> Hi, I'm launching a Swift run at 10:49. The first job of this workflow was submitted to a site. However, it is still waiting for running on the remote site until now according to log file. I enabled the replication option, so I think that it should produce copy job and submit it to other sites. The log file is on CI: /home/lixi/newswift/test/newversion/workflowtest-20080627- 1049-4p7s1tne.log I don't terminate this run yet. Please help to check it, thanks. Xi From benc at hawaga.org.uk Fri Jun 27 12:35:14 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Fri, 27 Jun 2008 17:35:14 +0000 (GMT) Subject: [Swift-devel] workflow run with the newest Swift In-Reply-To: <20080627122302.BBS29893@m4500-03.uchicago.edu> References: <20080627122302.BBS29893@m4500-03.uchicago.edu> Message-ID: On Fri, 27 Jun 2008, lixi at uchicago.edu wrote: > I'm launching a Swift run at 10:49. The first job of this > workflow was submitted to a site. However, it is still > waiting for running on the remote site until now according > to log file. I enabled the replication option, so I think > that it should produce copy job and submit it to other sites. > > The log file is on CI: > /home/lixi/newswift/test/newversion/workflowtest-20080627- > 1049-4p7s1tne.log > > I don't terminate this run yet. > > Please help to check it, thanks. In the present implementation, replication will never happen until one job has completed. If your first job hits a slow site, then you will be in trouble. It might be useful to have other controls on replication - for example, an initial time to use when a mean time is not known, that might be set to half an hour or something like that. I think you should probably restart that run and see how it works. -- From lixi at uchicago.edu Fri Jun 27 12:53:48 2008 From: lixi at uchicago.edu (lixi at uchicago.edu) Date: Fri, 27 Jun 2008 12:53:48 -0500 (CDT) Subject: [Swift-devel] workflow run with the newest Swift Message-ID: <20080627125348.BBS33039@m4500-03.uchicago.edu> >In the present implementation, replication will never happen until one job >has completed. If your first job hits a slow site, then you will be in >trouble. >It might be useful to have other controls on replication - for example, an >initial time to use when a mean time is not known, that might be set to >half an hour or something like that. I notice that a submission time calculation has been added. How about utilizing similar way to start replication? For example, waiting time could be also calculated every seconds after submission. When meeting the time throttle of waitting, replication should be initiated. From hategan at mcs.anl.gov Fri Jun 27 13:44:49 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 27 Jun 2008 13:44:49 -0500 Subject: [Swift-devel] workflow run with the newest Swift In-Reply-To: References: <20080627122302.BBS29893@m4500-03.uchicago.edu> Message-ID: <1214592289.5768.0.camel@localhost> On Fri, 2008-06-27 at 17:35 +0000, Ben Clifford wrote: > On Fri, 27 Jun 2008, lixi at uchicago.edu wrote: > > > I'm launching a Swift run at 10:49. The first job of this > > workflow was submitted to a site. However, it is still > > waiting for running on the remote site until now according > > to log file. I enabled the replication option, so I think > > that it should produce copy job and submit it to other sites. > > > > The log file is on CI: > > /home/lixi/newswift/test/newversion/workflowtest-20080627- > > 1049-4p7s1tne.log > > > > I don't terminate this run yet. > > > > Please help to check it, thanks. > > In the present implementation, replication will never happen until one job > has completed. If your first job hits a slow site, then you will be in > trouble. > > It might be useful to have other controls on replication - for example, an > initial time to use when a mean time is not known, that might be set to > half an hour or something like that. Right. I agree. > > I think you should probably restart that run and see how it works. > From hategan at mcs.anl.gov Fri Jun 27 13:46:17 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 27 Jun 2008 13:46:17 -0500 Subject: [Swift-devel] workflow run with the newest Swift In-Reply-To: <20080627125348.BBS33039@m4500-03.uchicago.edu> References: <20080627125348.BBS33039@m4500-03.uchicago.edu> Message-ID: <1214592377.5768.3.camel@localhost> On Fri, 2008-06-27 at 12:53 -0500, lixi at uchicago.edu wrote: > >In the present implementation, replication will never > happen until one job > >has completed. If your first job hits a slow site, then you > will be in > >trouble. > > >It might be useful to have other controls on replication - > for example, an > >initial time to use when a mean time is not known, that > might be set to > >half an hour or something like that. > > I notice that a submission time calculation has been added. > How about utilizing similar way to start replication? For > example, waiting time could be also calculated every seconds > after submission. When meeting the time throttle of > waitting, replication should be initiated. It is. It's simply that it doesn't make any assumptions about what the minimum queue time for replication should be, and it uses the average of the previous jobs, an average which obviously doesn't exist for the first job. But there should be a reasonable number there. > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From lixi at uchicago.edu Fri Jun 27 14:46:30 2008 From: lixi at uchicago.edu (lixi at uchicago.edu) Date: Fri, 27 Jun 2008 14:46:30 -0500 (CDT) Subject: [Swift-devel] workflow run with the newest Swift Message-ID: <20080627144630.BBS42462@m4500-03.uchicago.edu> >I think you should probably restart that run and see how it works. Yes, the new run is finished successfully. The execution speed for this run is good. However, a lot of "cached overload count = 13 just generated count = 13 (EEP!)" like information made the execution display on screen is not very friendly. At the beginning, I had thought that jobs weren't executed. After checking the output directory and log file, we could only find many jobs done already. From hategan at mcs.anl.gov Fri Jun 27 15:02:45 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 27 Jun 2008 15:02:45 -0500 Subject: [Swift-devel] workflow run with the newest Swift In-Reply-To: <20080627144630.BBS42462@m4500-03.uchicago.edu> References: <20080627144630.BBS42462@m4500-03.uchicago.edu> Message-ID: <1214596965.9379.11.camel@localhost> On Fri, 2008-06-27 at 14:46 -0500, lixi at uchicago.edu wrote: > >I think you should probably restart that run and see how it > works. > > Yes, the new run is finished successfully. The execution > speed for this run is good. However, a lot of "cached > overload count = 13 just generated count = 13 (EEP!)" like > information made the execution display on screen is not very > friendly. At the beginning, I had thought that jobs weren't > executed. After checking the output directory and log file, > we could only find many jobs done already. Ehm, could you try the patch I sent? > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From benc at hawaga.org.uk Fri Jun 27 16:50:34 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Fri, 27 Jun 2008 21:50:34 +0000 (GMT) Subject: [Swift-devel] workflow run with the newest Swift In-Reply-To: <20080627144630.BBS42462@m4500-03.uchicago.edu> References: <20080627144630.BBS42462@m4500-03.uchicago.edu> Message-ID: On Fri, 27 Jun 2008, lixi at uchicago.edu wrote: > >I think you should probably restart that run and see how it > works. > > Yes, the new run is finished successfully. The execution > speed for this run is good. However, a lot of "cached > overload count = 13 just generated count = 13 (EEP!)" like > information made the execution display on screen is not very > friendly. yes, that patch was intended more for discussion than real use. Mihael's one, overload-chk2, is better. > At the beginning, I had thought that jobs weren't > executed. After checking the output directory and log file, > we could only find many jobs done already. Do you have the log file for this? I would like to see how it worked. -- From lixi at uchicago.edu Fri Jun 27 17:37:47 2008 From: lixi at uchicago.edu (lixi at uchicago.edu) Date: Fri, 27 Jun 2008 17:37:47 -0500 (CDT) Subject: [Swift-devel] workflow run with the newest Swift Message-ID: <20080627173747.BBS54423@m4500-03.uchicago.edu> >Do you have the log file for this? I would like to see how it worked. It is on CI: /home/lixi/newswift/test/newversion/workflowtest-20080627- 1246-aba0l8d9.log Meanwhile, there is another workflow with 4001 jobs ongoing and its log file is under the same directory:workflowtest- 20080627-1440-qf6kbah7.log. From benc at hawaga.org.uk Sun Jun 29 07:20:24 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Sun, 29 Jun 2008 12:20:24 +0000 (GMT) Subject: [Swift-devel] workflow run with the newest Swift In-Reply-To: <20080627173747.BBS54423@m4500-03.uchicago.edu> References: <20080627173747.BBS54423@m4500-03.uchicago.edu> Message-ID: On Fri, 27 Jun 2008, lixi at uchicago.edu wrote: > >Do you have the log file for this? I would like to see how > it worked. > > It is on CI: > /home/lixi/newswift/test/newversion/workflowtest-20080627- > 1246-aba0l8d9.log Some stats for that run: Almost exactly 1h runtime. Peak of 45 jobs actually running on workers at any one time; for most of the hour between 30 and 45 jobs were running on workers. At peak, around 70 jobs were either queued or running. The bulk of jobs took bang on 60s to execute on workers, with a relatively small number taking much less or more (up to 75s for some) - that is plotted in the 'info duration histograph' graph in the below URL. 13 different sites were tried. There is a breakdown of how much each was used in 'sites/success table' in the below URL. The decent sites got 200 or so jobs each. The log analyser doesn't deal with replication properly yet, but it looks like around 10% of jobs (about 200) got replicated. Looking at per-site stats, it looks like most sites did not get more than 6 jobs queued/running at once, rather than increasing. One site got 12 though. I'm interested to know what caused that. I think it is your choice of jobThrottle parameter in your sites file generation. My rule of thumb has been 20 jobs at once for GRAM2 and a few hundred at once for GRAM4; are you specifying a lower job throttle in your submissions? All the plots and stats are here: http://www.ci.uchicago.edu/~benc/report-workflowtest-20080627-1246-aba0l8d9/ -- From lixi at uchicago.edu Sun Jun 29 07:36:30 2008 From: lixi at uchicago.edu (lixi at uchicago.edu) Date: Sun, 29 Jun 2008 07:36:30 -0500 (CDT) Subject: [Swift-devel] workflow run with the newest Swift Message-ID: <20080629073630.BBT18474@m4500-03.uchicago.edu> >The bulk of jobs took bang on 60s to execute on workers, with a relatively >small number taking much less or more (up to 75s for some) - that is >plotted in the 'info duration histograph' graph in the below URL. Yes, every successful job should be executed at least for 60s, because there is a command -- "sleep 60" included in the job. >The log analyser doesn't deal with replication properly yet, but it looks >like around 10% of jobs (about 200) got replicated. > >Looking at per-site stats, it looks like most sites did not get more than >6 jobs queued/running at once, rather than increasing. One site got 12 >though. I'm interested to know what caused that. I think it is your choice >of jobThrottle parameter in your sites file generation. My rule of thumb >has been 20 jobs at once for GRAM2 and a few hundred at once for GRAM4; >are you specifying a lower job throttle in your submissions? In this run, I specified 0.05 of job throttle for all the sites. I think that the result of this run is better than past situation with the same workflow size. From benc at hawaga.org.uk Sun Jun 29 07:40:06 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Sun, 29 Jun 2008 12:40:06 +0000 (GMT) Subject: [Swift-devel] workflow run with the newest Swift In-Reply-To: <20080629073630.BBT18474@m4500-03.uchicago.edu> References: <20080629073630.BBT18474@m4500-03.uchicago.edu> Message-ID: On Sun, 29 Jun 2008, lixi at uchicago.edu wrote: > In this run, I specified 0.05 of job throttle for all the > sites. OK. For GRAM2 sites, you should be able to specify a throttle of 0.2. -- From wilde at mcs.anl.gov Sun Jun 29 12:26:20 2008 From: wilde at mcs.anl.gov (Michael Wilde) Date: Sun, 29 Jun 2008 12:26:20 -0500 Subject: [Swift-devel] Falkon and Coaster support for MPI Message-ID: <4867C5BC.2080702@mcs.anl.gov> There's been some initial discussion off-list on whether it would be possible to run MPI jobs under Falkon on the BG/P, under its ZeptoOS (Linux) environment which is what is currently supporting Falkon execution. Kamil and Kaz are the main ZeptoOS architects and developers at Argonne, and Kaz is just readying an initial development of MPI support under ZeptoOS. Initial discussion is suggesting that MPI support under Falkon under ZeptpoOS is non-trivial. So it seems reasonable to ask what it would take to provide MPI support under Coaster as a future capability. Everyone involved, on ZeptoOS, Falkon and Swift is exceptionally busy. I don't want to create distractions, but do want to bring this up for longer-term discussion. - Mike From benc at hawaga.org.uk Sun Jun 29 12:58:48 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Sun, 29 Jun 2008 17:58:48 +0000 (GMT) Subject: [Swift-devel] Falkon and Coaster support for MPI In-Reply-To: <4867C5BC.2080702@mcs.anl.gov> References: <4867C5BC.2080702@mcs.anl.gov> Message-ID: Do you really mean falkon and/or coaster or do you mean MPI jobs launched from Swift onto BG/P? The implementation of the latter might be completely distinct from Coaster and or Falkon. It might be desired to run a specific MPI application on all cores in a particular processor set (or whatever they are called). In such a case, the per-node individual job management that falkon and coaster provide would be almost/entirely irrelevant. I presume there is some existing mechanism for launching an MPI job on every core in a processor set already. It might be that it would be more appropriate for Swift to cause that mechanism to be used, making 'one node' = 'one pset' rather than 'one node' = 'one cpu' (where node is the basic unit that can execute a job). There is a (substantially?) more complicated case of causing one pset to run multiple different MPI jobs simultaneously, with some cores going to one job, and some to another. The above two are (from my perspective) very different use cases; any future discussion should clarify which one is being discussed, rather than being based the always-vague "I want to use MPI". -- From wilde at mcs.anl.gov Sun Jun 29 13:24:45 2008 From: wilde at mcs.anl.gov (Michael Wilde) Date: Sun, 29 Jun 2008 13:24:45 -0500 Subject: [Swift-devel] Falkon and Coaster support for MPI In-Reply-To: References: <4867C5BC.2080702@mcs.anl.gov> Message-ID: <4867D36D.9090801@mcs.anl.gov> We mean MPI jobs launched from Swift onto BG/P, or directly to Falkon or Coaster to BGP. But in general the desire is to run all work, even workloads with no workflow dependencies, under Swift, for uniformity, site independence, and provenance. The initial discussion here was based on the assumption that a Falkon-like mechanism was required in order to run workloads of many small jobs on the BGP - whether that be through Swift, or directly. (Small meaning 1 to 64 CPUs each and order of a few minutes of runtime each). I think that assumption is true on the Argonne BGP for two reasons: 1) scheduling policy doesnt allow or favor any user from running > 2 BGP jobs at once, and 2) on the production BGP the production partitions favor large jobs, of 512 to 2048 compute nodes. Running many smaller (eg 16, 32 CPU) Swift jobs doesnt seem like its going to be an accepted model. That drives these kinds of app needs towards a Falkon/Coaster approach. Recently, IBM circulated info on their "HTC" mode support for the BG/P, which may change the nature of the assumptions above. - Mike On 6/29/08 12:58 PM, Ben Clifford wrote: > Do you really mean falkon and/or coaster or do you mean MPI jobs launched > from Swift onto BG/P? > > The implementation of the latter might be completely distinct from Coaster > and or Falkon. It might be desired to run a specific MPI application on > all cores in a particular processor set (or whatever they are called). In > such a case, the per-node individual job management that falkon and > coaster provide would be almost/entirely irrelevant. > > I presume there is some existing mechanism for launching an MPI job on > every core in a processor set already. > > It might be that it would be more appropriate for Swift to cause that > mechanism to be used, making 'one node' = 'one pset' rather than 'one > node' = 'one cpu' (where node is the basic unit that can execute a job). > > There is a (substantially?) more complicated case of causing one pset to > run multiple different MPI jobs simultaneously, with some cores going to > one job, and some to another. > > The above two are (from my perspective) very different use cases; any > future discussion should clarify which one is being discussed, rather than > being based the always-vague "I want to use MPI". > From benc at hawaga.org.uk Sun Jun 29 13:28:49 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Sun, 29 Jun 2008 18:28:49 +0000 (GMT) Subject: [Swift-devel] Falkon and Coaster support for MPI In-Reply-To: <4867D36D.9090801@mcs.anl.gov> References: <4867C5BC.2080702@mcs.anl.gov> <4867D36D.9090801@mcs.anl.gov> Message-ID: On Sun, 29 Jun 2008, Michael Wilde wrote: > The initial discussion here was based on the assumption that a Falkon-like > mechanism was required in order to run workloads of many small jobs on the BGP > - whether that be through Swift, or directly. (Small meaning 1 to 64 CPUs each > and order of a few minutes of runtime each). Is this the sort of workload that the applications that are targetted for MPI use on this machine create? (what are those applications, btw?) > Recently, IBM circulated info on their "HTC" mode support for the BG/P, which > may change the nature of the assumptions above. That would be useful to see. Its hasn't circulated to me, though. -- From hategan at mcs.anl.gov Sun Jun 29 13:34:15 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sun, 29 Jun 2008 13:34:15 -0500 Subject: [Swift-devel] Falkon and Coaster support for MPI In-Reply-To: <4867D36D.9090801@mcs.anl.gov> References: <4867C5BC.2080702@mcs.anl.gov> <4867D36D.9090801@mcs.anl.gov> Message-ID: <1214764455.11092.4.camel@localhost> On Sun, 2008-06-29 at 13:24 -0500, Michael Wilde wrote: > We mean MPI jobs launched from Swift onto BG/P, or directly to Falkon or > Coaster to BGP. But in general the desire is to run all work, even > workloads with no workflow dependencies, under Swift, for uniformity, > site independence, and provenance. > > The initial discussion here was based on the assumption that a > Falkon-like mechanism was required in order to run workloads of many > small jobs on the BGP - whether that be through Swift, or directly. > (Small meaning 1 to 64 CPUs each and order of a few minutes of runtime > each). We should probably make a separation here between the ability of the system to run jobs (of whatever kind the system supports) and the suitability of a certain type of jobs for this system. In principle, there is a local Cobalt provider there, which should allow running jobs on BG/P without the need for an indirect mechanism. Though that may not be suitable for our typical problems. > > I think that assumption is true on the Argonne BGP for two reasons: 1) > scheduling policy doesnt allow or favor any user from running > 2 BGP > jobs at once, There is a big difference between "doesn't allow" and "doesn't favor". From wilde at mcs.anl.gov Sun Jun 29 13:40:07 2008 From: wilde at mcs.anl.gov (Michael Wilde) Date: Sun, 29 Jun 2008 13:40:07 -0500 Subject: [Swift-devel] Falkon and Coaster support for MPI In-Reply-To: References: <4867C5BC.2080702@mcs.anl.gov> <4867D36D.9090801@mcs.anl.gov> Message-ID: <4867D707.4020403@mcs.anl.gov> On 6/29/08 1:28 PM, Ben Clifford wrote: > On Sun, 29 Jun 2008, Michael Wilde wrote: > >> The initial discussion here was based on the assumption that a Falkon-like >> mechanism was required in order to run workloads of many small jobs on the BGP >> - whether that be through Swift, or directly. (Small meaning 1 to 64 CPUs each >> and order of a few minutes of runtime each). > > Is this the sort of workload that the applications that are targetted for > MPI use on this machine create? (what are those applications, btw?) The app that accelerated this discussion is CHARMM for molecular dynamics. CHARMM on BGP with the parameters needed in the use case in question has a long runtime on 1 CPU (24-36 hours), and seems to peak in performance using MPI at 32 CPUS. Runs >6 hours are also not runnable today on the Argonne BGP without a reservation. So the most effective way to use the BGP for some specific CHARMM runs needed by Benoit Roux's group is to run large numbers of multi-hour 32-rank MPI jobs. (In the meantime we're looking to run 1-CPU jobs broken into multiple separate time steps and then merged. Extra work, some questions on accuracy and equivalence, but seemingly doable). > > >> Recently, IBM circulated info on their "HTC" mode support for the BG/P, which >> may change the nature of the assumptions above. > > That would be useful to see. Its hasn't circulated to me, though. http://www.bgconsortium.org/documents/HTC%20WhitePaper%20V2%20050508.pdf > From benc at hawaga.org.uk Sun Jun 29 13:52:38 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Sun, 29 Jun 2008 18:52:38 +0000 (GMT) Subject: [Swift-devel] Falkon and Coaster support for MPI In-Reply-To: <1214764455.11092.4.camel@localhost> References: <4867C5BC.2080702@mcs.anl.gov> <4867D36D.9090801@mcs.anl.gov> <1214764455.11092.4.camel@localhost> Message-ID: I had a brief look at how MPI runs inside PBS. It looks something like: 1. user requests n nodes to run on PBS 2. PBS allocates those n nodes, and writes a description of them all into a per-job (not per-node) $PBSNODEFILE 3. PBS runs the same mpi command-line on all of the nodes, specifying $PBSNODEFILE somewhere in that command line. 4. The mpi commandline uses the $PBSNODEFILE to know how to get to all the nodes. It might not be too much change to the coaster code to make it so you can get this behaviour. You'd need to be able to specify a node count (rather than the implicit count of 1 at the moment) and have the coaster manager handle starting everything up at the same time and providing something like $PBSNODEFILE. -- From benc at hawaga.org.uk Sun Jun 29 14:19:00 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Sun, 29 Jun 2008 19:19:00 +0000 (GMT) Subject: [Swift-devel] Falkon and Coaster support for MPI In-Reply-To: <4867D707.4020403@mcs.anl.gov> References: <4867C5BC.2080702@mcs.anl.gov> <4867D36D.9090801@mcs.anl.gov> <4867D707.4020403@mcs.anl.gov> Message-ID: On Sun, 29 Jun 2008, Michael Wilde wrote: > The app that accelerated this discussion is CHARMM for molecular > dynamics. CHARMM on BGP with the parameters needed in the use case in > question has a long runtime on 1 CPU (24-36 hours), and seems to peak in > performance using MPI at 32 CPUS. Runs >6 hours are also not runnable > today on the Argonne BGP without a reservation. So the most effective > way to use the BGP for some specific CHARMM runs needed by Benoit Roux's > group is to run large numbers of multi-hour 32-rank MPI jobs. OK. This seems like an interesting application. I can spend some time working on this in the next few weeks, as long as you provide me with the following: i) someone who knows how the charmm applications work (eg from Benoit's group) ii) someone who knows how MPI works on the BG/P (maybe Kamil?) -- From iraicu at cs.uchicago.edu Sun Jun 29 14:25:38 2008 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Sun, 29 Jun 2008 14:25:38 -0500 Subject: [Swift-devel] Falkon and Coaster support for MPI In-Reply-To: References: <4867C5BC.2080702@mcs.anl.gov> <4867D36D.9090801@mcs.anl.gov> <1214764455.11092.4.camel@localhost> Message-ID: <4867E1B2.6080604@cs.uchicago.edu> How do you guarantee that N nodes will all start at the same time? At least, in Falkon, this is the largest problem that I am not sure how to address... the Falkon scheduler doesn't know how to handle a task of N processors, it only knows tasks of 1 processor... Lets take an example. Assume you have 256 CPUs free. Lets say you get a MPI job for 32 CPUs, the new improved scheduler could simply replicate that single CPU task 32 times, and start the task on 32 CPUs. Now, while the 32 CPU MPI job is running, a 256 CPU MPI job comes in, and needs to be scheduled. The naive replication of 1 ==> 256 tasks doesn't work anymore, as there are no 256 CPUs free, only 224 are free. So, the scheduler needs to be smart enough to wait for this 32 CPU MPI job to finish, before it can launch the 256 CPU job, and make sure no other job will go through before that might use up any of the free CPUs. All this is certainly do-able, as PBS, SGE, Condor, etc... most LRMs can deal with MPI jobs just fine, but its certainly not a trivial addition to Falkon, and its not clear to me how easy or hard it would be in Coaster either, depending on how much effort is placed in the scheduler that feeds jobs to Coaster. Ioan Ben Clifford wrote: > I had a brief look at how MPI runs inside PBS. > > It looks something like: > > 1. user requests n nodes to run on PBS > 2. PBS allocates those n nodes, and writes a description of them all > into a per-job (not per-node) $PBSNODEFILE > 3. PBS runs the same mpi command-line on all of the nodes, specifying > $PBSNODEFILE somewhere in that command line. > 4. The mpi commandline uses the $PBSNODEFILE to know how to get to all > the nodes. > > It might not be too much change to the coaster code to make it so you can > get this behaviour. You'd need to be able to specify a node count (rather > than the implicit count of 1 at the moment) and have the coaster manager > handle starting everything up at the same time and providing something > like $PBSNODEFILE. > > -- =================================================== Ioan Raicu Ph.D. Candidate =================================================== Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 =================================================== Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dev.globus.org/wiki/Incubator/Falkon http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page =================================================== =================================================== From benc at hawaga.org.uk Sun Jun 29 14:27:29 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Sun, 29 Jun 2008 19:27:29 +0000 (GMT) Subject: [Swift-devel] Falkon and Coaster support for MPI In-Reply-To: <4867E1B2.6080604@cs.uchicago.edu> References: <4867C5BC.2080702@mcs.anl.gov> <4867D36D.9090801@mcs.anl.gov> <1214764455.11092.4.camel@localhost> <4867E1B2.6080604@cs.uchicago.edu> Message-ID: On Sun, 29 Jun 2008, Ioan Raicu wrote: > How do you guarantee that N nodes will all start at the same time? By writing code to make it so! -- From benc at hawaga.org.uk Sun Jun 29 15:30:55 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Sun, 29 Jun 2008 20:30:55 +0000 (GMT) Subject: [Swift-devel] mpi on swift Message-ID: There is periodic talk of running MPI programs through swift, including some projects expressing desire to do that and some people playing round trying to. I propose that I spend about a week gathering together experiences from anyone who has actually tried to do it (I suspect there will not be much there); figure out myself how to get some basic MPI stuff running in simple situations through swift; write some documentation about such situations; and figure out what cannot be done at the moment that would be useful to address. -- From wilde at mcs.anl.gov Sun Jun 29 15:51:05 2008 From: wilde at mcs.anl.gov (Michael Wilde) Date: Sun, 29 Jun 2008 15:51:05 -0500 Subject: [Swift-devel] mpi on swift In-Reply-To: References: Message-ID: <4867F5B9.4020000@mcs.anl.gov> Previous work: Nika ran MPI climate models, an app called FOAM, under VDS. I dont know if the CHARMM runs that Nika did under Swift were MPI or not. I think not, but I dont recall. Its worth looking at those workflows. On 6/29/08 3:30 PM, Ben Clifford wrote: > There is periodic talk of running MPI programs through swift, including > some projects expressing desire to do that and some people playing round > trying to. > > I propose that I spend about a week gathering together experiences from > anyone who has actually tried to do it (I suspect there will not be much > there); figure out myself how to get some basic MPI stuff running in > simple situations through swift; write some documentation about such > situations; and figure out what cannot be done at the moment that would be > useful to address. From wilde at mcs.anl.gov Sun Jun 29 16:07:03 2008 From: wilde at mcs.anl.gov (Michael Wilde) Date: Sun, 29 Jun 2008 16:07:03 -0500 Subject: [Swift-devel] Falkon and Coaster support for MPI In-Reply-To: References: <4867C5BC.2080702@mcs.anl.gov> <4867D36D.9090801@mcs.anl.gov> <4867D707.4020403@mcs.anl.gov> Message-ID: <4867F977.1060500@mcs.anl.gov> Ben, On 6/29/08 2:19 PM, Ben Clifford wrote: > On Sun, 29 Jun 2008, Michael Wilde wrote: > >> The app that accelerated this discussion is CHARMM for molecular >> dynamics. CHARMM on BGP with the parameters needed in the use case in >> question has a long runtime on 1 CPU (24-36 hours), and seems to peak in >> performance using MPI at 32 CPUS. Runs >6 hours are also not runnable >> today on the Argonne BGP without a reservation. So the most effective >> way to use the BGP for some specific CHARMM runs needed by Benoit Roux's >> group is to run large numbers of multi-hour 32-rank MPI jobs. > > OK. > > This seems like an interesting application. > > I can spend some time working on this in the next few weeks, as long as > you provide me with the following: > > i) someone who knows how the charmm applications work (eg from Benoit's > group) Wei Jiang from Benoit's group spent several hours with me Friday explaining how they run CHARMM. The approach uses scripts that Yuqing Deng, who recently left the group, developed. These are on the Argonne "KBT" cluster. You should get on the Argonne accounts web and request access to that cluster, kbt.mcs.anl.gov. One there, you can run perl scripts that generate either shell scripts or pbs jobs to do a CHARMM run. It was fairly complex, and I need to transcribe some notes on this. Benoit was going to try to locate slides and/or docs from a talk that Yuqing gave before he left, on how to run his tools. Ray Loy of the ALCF team is working on compiling a non-MPI CHARMM runnable under ZeptoOS on the BGP, and he also wanted to try HTC mode. Nika's latest workflows running CHARMM are also a good reference point. Wei can provide info on how to do a sample run from Yuquing's scripts. > ii) someone who knows how MPI works on the BG/P (maybe Kamil?) MPI on BGP under ZeptoOS is still several weeks away from friendly-user testing capability. Kamil and Kaz (and Pete Beckman) are the Argonne ZeptoOS team. Kaz is working on ZeptoOS MPI and just reached hello-world about a week ago. - Mike From roux at uchicago.edu Sun Jun 29 16:17:13 2008 From: roux at uchicago.edu (Benoit Roux) Date: Sun, 29 Jun 2008 16:17:13 -0500 Subject: [Swift-devel] Falkon and Coaster support for MPI In-Reply-To: <4867F977.1060500@mcs.anl.gov> References: <4867C5BC.2080702@mcs.anl.gov> <4867D36D.9090801@mcs.anl.gov> <4867D707.4020403@mcs.anl.gov> <4867F977.1060500@mcs.anl.gov> Message-ID: <3A76DA84-B161-4304-97E0-28ED6683A111@uchicago.edu> On Jun 29, 2008, at 4:07 PM, Michael Wilde wrote: > Ben, > > On 6/29/08 2:19 PM, Ben Clifford wrote: >> On Sun, 29 Jun 2008, Michael Wilde wrote: >>> The app that accelerated this discussion is CHARMM for molecular >>> dynamics. CHARMM on BGP with the parameters needed in the use >>> case in question has a long runtime on 1 CPU (24-36 hours), and >>> seems to peak in performance using MPI at 32 CPUS. Runs >6 hours >>> are also not runnable today on the Argonne BGP without a >>> reservation. So the most effective way to use the BGP for some >>> specific CHARMM runs needed by Benoit Roux's group is to run >>> large numbers of multi-hour 32-rank MPI jobs. >> OK. >> This seems like an interesting application. >> I can spend some time working on this in the next few weeks, as >> long as you provide me with the following: >> i) someone who knows how the charmm applications work (eg from >> Benoit's group) > > Wei Jiang from Benoit's group spent several hours with me Friday > explaining how they run CHARMM. > > The approach uses scripts that Yuqing Deng, who recently left the > group, developed. These are on the Argonne "KBT" cluster. You > should get on the Argonne accounts web and request access to that > cluster, kbt.mcs.anl.gov. > > One there, you can run perl scripts that generate either shell > scripts or pbs jobs to do a CHARMM run. It was fairly complex, and > I need to transcribe some notes on this. > > Benoit was going to try to locate slides and/or docs from a talk > that Yuqing gave before he left, on how to run his tools. > > Ray Loy of the ALCF team is working on compiling a non-MPI CHARMM > runnable under ZeptoOS on the BGP, and he also wanted to try HTC mode. Note that a non-MPI CHARMM job on BG would be not super fast, and MPI capability would be desirable. > > Nika's latest workflows running CHARMM are also a good reference > point. > > Wei can provide info on how to do a sample run from Yuquing's scripts. > >> ii) someone who knows how MPI works on the BG/P (maybe Kamil?) > > MPI on BGP under ZeptoOS is still several weeks away from friendly- > user testing capability. Kamil and Kaz (and Pete Beckman) are the > Argonne ZeptoOS team. Kaz is working on ZeptoOS MPI and just > reached hello-world about a week ago. > > - Mike -------------- next part -------------- An HTML attachment was scrubbed... URL: From wilde at mcs.anl.gov Sun Jun 29 16:30:58 2008 From: wilde at mcs.anl.gov (Michael Wilde) Date: Sun, 29 Jun 2008 16:30:58 -0500 Subject: [Swift-devel] Falkon and Coaster support for MPI In-Reply-To: <3A76DA84-B161-4304-97E0-28ED6683A111@uchicago.edu> References: <4867C5BC.2080702@mcs.anl.gov> <4867D36D.9090801@mcs.anl.gov> <4867D707.4020403@mcs.anl.gov> <4867F977.1060500@mcs.anl.gov> <3A76DA84-B161-4304-97E0-28ED6683A111@uchicago.edu> Message-ID: <4867FF12.6070609@mcs.anl.gov> > *Note that a non-MPI CHARMM job on BG would be not super fast, and MPI > capability would be desirable.* Right, Benoit. This thread is about initiating some of the work from the Swift/Falkon side needed for the MPI version of CHARMM. It needs to join together with the Zeptos/MPI effort by Kaz and Kamil, but will also be of use in running CHARMM and other apps under Swift in MPI mode on other clusters. - Mike On 6/29/08 4:17 PM, Benoit Roux wrote: > > On Jun 29, 2008, at 4:07 PM, Michael Wilde wrote: > >> Ben, >> >> On 6/29/08 2:19 PM, Ben Clifford wrote: >>> On Sun, 29 Jun 2008, Michael Wilde wrote: >>>> The app that accelerated this discussion is CHARMM for molecular >>>> dynamics. CHARMM on BGP with the parameters needed in the use case >>>> in question has a long runtime on 1 CPU (24-36 hours), and seems to >>>> peak in performance using MPI at 32 CPUS. Runs >6 hours are also not >>>> runnable today on the Argonne BGP without a reservation. So the most >>>> effective way to use the BGP for some specific CHARMM runs needed by >>>> Benoit Roux's group is to run large numbers of multi-hour 32-rank >>>> MPI jobs. >>> OK. >>> This seems like an interesting application. >>> I can spend some time working on this in the next few weeks, as long >>> as you provide me with the following: >>> i) someone who knows how the charmm applications work (eg from >>> Benoit's group) >> >> Wei Jiang from Benoit's group spent several hours with me Friday >> explaining how they run CHARMM. >> >> The approach uses scripts that Yuqing Deng, who recently left the >> group, developed. These are on the Argonne "KBT" cluster. You should >> get on the Argonne accounts web and request access to that cluster, >> kbt.mcs.anl.gov. >> >> One there, you can run perl scripts that generate either shell scripts >> or pbs jobs to do a CHARMM run. It was fairly complex, and I need to >> transcribe some notes on this. >> >> Benoit was going to try to locate slides and/or docs from a talk that >> Yuqing gave before he left, on how to run his tools. >> >> *Ray Loy of the ALCF team is working on compiling a non-MPI CHARMM >> runnable under ZeptoOS on the BGP, and he also wanted to try HTC mode.* > > *Note that a non-MPI CHARMM job on BG would be not super fast, and MPI > capability would be desirable.* > >> >> Nika's latest workflows running CHARMM are also a good reference point. >> >> Wei can provide info on how to do a sample run from Yuquing's scripts. >> >>> ii) someone who knows how MPI works on the BG/P (maybe Kamil?) >> >> MPI on BGP under ZeptoOS is still several weeks away from >> friendly-user testing capability. Kamil and Kaz (and Pete Beckman) are >> the Argonne ZeptoOS team. Kaz is working on ZeptoOS MPI and just >> reached hello-world about a week ago. >> >> - Mike > From wjiang at mcs.anl.gov Sun Jun 29 18:07:10 2008 From: wjiang at mcs.anl.gov (Wei Jiang) Date: Sun, 29 Jun 2008 18:07:10 -0500 (CDT) Subject: [Swift-devel] Falkon and Coaster support for MPI In-Reply-To: <10593829.813071214780646828.JavaMail.root@zimbra> Message-ID: <28500640.813151214780830669.JavaMail.root@zimbra> Here I am ready to provide any information you want for the sample CHARMM run on BGP or KBT, and the perl script fe.pl, which generates all the PBS scripts/jobs. If necessary, we can meet in ANL or campus some day this week. Wei ----- Original Message ----- From: "Michael Wilde" To: "Benoit Roux" Cc: "Ben Clifford" , "swift-devel" , "Kamil Iskra" , "Kaz Yoshii" , "Wei Jiang" , "Ray Loy" Sent: Sunday, June 29, 2008 4:30:58 PM GMT -06:00 US/Canada Central Subject: Re: [Swift-devel] Falkon and Coaster support for MPI > *Note that a non-MPI CHARMM job on BG would be not super fast, and MPI > capability would be desirable.* Right, Benoit. This thread is about initiating some of the work from the Swift/Falkon side needed for the MPI version of CHARMM. It needs to join together with the Zeptos/MPI effort by Kaz and Kamil, but will also be of use in running CHARMM and other apps under Swift in MPI mode on other clusters. - Mike On 6/29/08 4:17 PM, Benoit Roux wrote: > > On Jun 29, 2008, at 4:07 PM, Michael Wilde wrote: > >> Ben, >> >> On 6/29/08 2:19 PM, Ben Clifford wrote: >>> On Sun, 29 Jun 2008, Michael Wilde wrote: >>>> The app that accelerated this discussion is CHARMM for molecular >>>> dynamics. CHARMM on BGP with the parameters needed in the use case >>>> in question has a long runtime on 1 CPU (24-36 hours), and seems to >>>> peak in performance using MPI at 32 CPUS. Runs >6 hours are also not >>>> runnable today on the Argonne BGP without a reservation. So the most >>>> effective way to use the BGP for some specific CHARMM runs needed by >>>> Benoit Roux's group is to run large numbers of multi-hour 32-rank >>>> MPI jobs. >>> OK. >>> This seems like an interesting application. >>> I can spend some time working on this in the next few weeks, as long >>> as you provide me with the following: >>> i) someone who knows how the charmm applications work (eg from >>> Benoit's group) >> >> Wei Jiang from Benoit's group spent several hours with me Friday >> explaining how they run CHARMM. >> >> The approach uses scripts that Yuqing Deng, who recently left the >> group, developed. These are on the Argonne "KBT" cluster. You should >> get on the Argonne accounts web and request access to that cluster, >> kbt.mcs.anl.gov. >> >> One there, you can run perl scripts that generate either shell scripts >> or pbs jobs to do a CHARMM run. It was fairly complex, and I need to >> transcribe some notes on this. >> >> Benoit was going to try to locate slides and/or docs from a talk that >> Yuqing gave before he left, on how to run his tools. >> >> *Ray Loy of the ALCF team is working on compiling a non-MPI CHARMM >> runnable under ZeptoOS on the BGP, and he also wanted to try HTC mode.* > > *Note that a non-MPI CHARMM job on BG would be not super fast, and MPI > capability would be desirable.* > >> >> Nika's latest workflows running CHARMM are also a good reference point. >> >> Wei can provide info on how to do a sample run from Yuquing's scripts. >> >>> ii) someone who knows how MPI works on the BG/P (maybe Kamil?) >> >> MPI on BGP under ZeptoOS is still several weeks away from >> friendly-user testing capability. Kamil and Kaz (and Pete Beckman) are >> the Argonne ZeptoOS team. Kaz is working on ZeptoOS MPI and just >> reached hello-world about a week ago. >> >> - Mike > From foster at mcs.anl.gov Mon Jun 30 03:43:45 2008 From: foster at mcs.anl.gov (Ian Foster) Date: Mon, 30 Jun 2008 03:43:45 -0500 Subject: [Swift-devel] Falkon and Coaster support for MPI In-Reply-To: <4867E1B2.6080604@cs.uchicago.edu> References: <4867C5BC.2080702@mcs.anl.gov> <4867D36D.9090801@mcs.anl.gov> <1214764455.11092.4.camel@localhost> <4867E1B2.6080604@cs.uchicago.edu> Message-ID: <037DBF80-8537-42B7-9A93-C46075548B1E@mcs.anl.gov> A few thoughts: 1) It must be straightforward to submit MPI programs from Swift, via the GRAM provider--the only issue is passing the appropriate parameters to the GRAM submission. (I realize this is not completely trivial, but as Ben says, we have done it before.) 2) The challenge is doing this in conjunction with multi-level scheduling (aka Falkon/Coaster/Glideins), which we may require for reasons of feasibility (on BG/P, where I don't think you can request less than a rack?) and/or performance. 3) So my view is that we want to set up Swift to support both modes (1) and (2). 4) Ioan points out that a fully general multi-level scheduling solution with support for multi-CPU jobs may introduce the need for a smarter scheduler than our current FIFO approach. E.g., if we have 256 nodes and a queue with jobs of size {32,256,32,32,32,32,32,32,32,32}, a FIFO strategy would run them in that order, and waste much CPU time. On the other hand, a simple "first-fit" strategy might starve large jobs. I think we should be nervous about getting into the business of implementing scheduler functionality like this. I'd like to advocate that in the short term, we try to make this problem go away by requiring that if an application includes MPI tasks, they all be of the same size. Of course the problem remains that we will probably still have a mix of uniprocessor (P=1) and multiprocessor (P=N) tasks. Again, we could make this problem go away by reserving some nodes for P=1 and some for P=N, at the cost of some inefficiency. 5) The question has been raised of how to implement (2). One proposal is to adapt coaster to support MPI jobs. I'm a bit concerned that this could be expensive: we already have Falkon running well on BG/P, and given our other commitments to support NSF user communities, putting scarce resources into replicating that work may not be optimal. Regards -- Ian. On Jun 29, 2008, at 2:25 PM, Ioan Raicu wrote: > How do you guarantee that N nodes will all start at the same time? > At least, in Falkon, this is the largest problem that I am not sure > how to address... the Falkon scheduler doesn't know how to handle a > task of N processors, it only knows tasks of 1 processor... > > Lets take an example. Assume you have 256 CPUs free. Lets say you > get a MPI job for 32 CPUs, the new improved scheduler could simply > replicate that single CPU task 32 times, and start the task on 32 > CPUs. Now, while the 32 CPU MPI job is running, a 256 CPU MPI job > comes in, and needs to be scheduled. The naive replication of 1 ==> > 256 tasks doesn't work anymore, as there are no 256 CPUs free, only > 224 are free. So, the scheduler needs to be smart enough to wait > for this 32 CPU MPI job to finish, before it can launch the 256 CPU > job, and make sure no other job will go through before that might > use up any of the free CPUs. All this is certainly do-able, as PBS, > SGE, Condor, etc... most LRMs can deal with MPI jobs just fine, but > its certainly not a trivial addition to Falkon, and its not clear to > me how easy or hard it would be in Coaster either, depending on how > much effort is placed in the scheduler that feeds jobs to Coaster. > > Ioan > > Ben Clifford wrote: >> I had a brief look at how MPI runs inside PBS. >> >> It looks something like: >> >> 1. user requests n nodes to run on PBS >> 2. PBS allocates those n nodes, and writes a description of them >> all into a per-job (not per-node) $PBSNODEFILE >> 3. PBS runs the same mpi command-line on all of the nodes, >> specifying $PBSNODEFILE somewhere in that command line. >> 4. The mpi commandline uses the $PBSNODEFILE to know how to get >> to all the nodes. >> >> It might not be too much change to the coaster code to make it so >> you can get this behaviour. You'd need to be able to specify a node >> count (rather than the implicit count of 1 at the moment) and have >> the coaster manager handle starting everything up at the same time >> and providing something like $PBSNODEFILE. >> >> > > -- > =================================================== > Ioan Raicu > Ph.D. Candidate > =================================================== > Distributed Systems Laboratory > Computer Science Department > University of Chicago > 1100 E. 58th Street, Ryerson Hall > Chicago, IL 60637 > =================================================== > Email: iraicu at cs.uchicago.edu > Web: http://www.cs.uchicago.edu/~iraicu > http://dev.globus.org/wiki/Incubator/Falkon > http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page > =================================================== > =================================================== > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From benc at hawaga.org.uk Mon Jun 30 03:54:52 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 30 Jun 2008 08:54:52 +0000 (GMT) Subject: [Swift-devel] Falkon and Coaster support for MPI In-Reply-To: <037DBF80-8537-42B7-9A93-C46075548B1E@mcs.anl.gov> References: <4867C5BC.2080702@mcs.anl.gov> <4867D36D.9090801@mcs.anl.gov> <1214764455.11092.4.camel@localhost> <4867E1B2.6080604@cs.uchicago.edu> <037DBF80-8537-42B7-9A93-C46075548B1E@mcs.anl.gov> Message-ID: On Mon, 30 Jun 2008, Ian Foster wrote: > 1) It must be straightforward to submit MPI programs from Swift, via the GRAM > provider--the only issue is passing the appropriate parameters to the GRAM [...] > 2) The challenge is doing this in conjunction with multi-level scheduling (aka > Falkon/Coaster/Glideins), which we may require for reasons of feasibility (on [...] > 3) So my view is that we want to set up Swift to support both modes (1) and > (2). yes. > > 4) Ioan points out that a fully general multi-level scheduling solution with > support for multi-CPU jobs may introduce the need for a smarter scheduler than > our current FIFO approach. E.g., if we have 256 nodes and a queue with jobs of > size {32,256,32,32,32,32,32,32,32,32}, a FIFO strategy would run them in that > order, and waste much CPU time. On the other hand, a simple "first-fit" > strategy might starve large jobs. > > I think we should be nervous about getting into the business of implementing > scheduler functionality like this. yes. certainly the coaster code should not at the moment be getting into doing 'fancy' stuff. > I'd like to advocate that in the short term, we try to make this problem go > away by requiring that if an application includes MPI tasks, they all be of > the same size. right. I pretty much agree with that constraint, or something fairly similar, eg by saying that we will only do first-come-first-serve queueing that probably ties uniform job size strongly to worker efficiency. > 5) The question has been raised of how to implement (2). One proposal is > to adapt coaster to support MPI jobs. I'm a bit concerned that this > could be expensive: we already have Falkon running well on BG/P, and > given our other commitments to support NSF user communities, putting > scarce resources into replicating that work may not be optimal. This is not the thread to debate falkon vs coaster. The development model for that has been extensively debated in the past. -- From foster at mcs.anl.gov Mon Jun 30 04:00:50 2008 From: foster at mcs.anl.gov (Ian Foster) Date: Mon, 30 Jun 2008 04:00:50 -0500 Subject: [Swift-devel] Falkon and Coaster support for MPI In-Reply-To: References: <4867C5BC.2080702@mcs.anl.gov> <4867D36D.9090801@mcs.anl.gov> <1214764455.11092.4.camel@localhost> <4867E1B2.6080604@cs.uchicago.edu> <037DBF80-8537-42B7-9A93-C46075548B1E@mcs.anl.gov> Message-ID: Ben: I wasn't debating Falkon vs. Coaster, but raising this issue: as the primary motivation for supporting MPI jobs via multi-level scheduling sees to be the BG/P, then I think we need to think hard whether we want to spend scarce Swift/Coaster development resources on making Coaster support MPI on BG/P (a fairly specialized requirement). Ian. On Jun 30, 2008, at 3:54 AM, Ben Clifford wrote: >> 5) The question has been raised of how to implement (2). One >> proposal is >> to adapt coaster to support MPI jobs. I'm a bit concerned that this >> could be expensive: we already have Falkon running well on BG/P, and >> given our other commitments to support NSF user communities, putting >> scarce resources into replicating that work may not be optimal. > > This is not the thread to debate falkon vs coaster. The development > model > for that has been extensively debated in the past. From lixi at uchicago.edu Mon Jun 30 09:03:09 2008 From: lixi at uchicago.edu (lixi at uchicago.edu) Date: Mon, 30 Jun 2008 09:03:09 -0500 (CDT) Subject: [Swift-devel] No response of Swift run Message-ID: <20080630090309.BBT73100@m4500-03.uchicago.edu> Hi, I launched a Swift workflow (including 2001 jobs) at 16:16 yesterday. At 17:20, it returned the results of 2000 jobs, then there is no reponse any more. I wonder why? I enabled the replication option. The log file is very large (more 1G) and is on CI: /home/lixi/newswift/test/newversion/workflowtest-20080629- 1616-c4h22j03.log Please check it, thanks Xi From hategan at mcs.anl.gov Mon Jun 30 09:23:43 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 30 Jun 2008 09:23:43 -0500 Subject: [Swift-devel] Falkon and Coaster support for MPI In-Reply-To: <037DBF80-8537-42B7-9A93-C46075548B1E@mcs.anl.gov> References: <4867C5BC.2080702@mcs.anl.gov> <4867D36D.9090801@mcs.anl.gov> <1214764455.11092.4.camel@localhost> <4867E1B2.6080604@cs.uchicago.edu> <037DBF80-8537-42B7-9A93-C46075548B1E@mcs.anl.gov> Message-ID: <1214835823.22531.7.camel@localhost> > > I think we should be nervous about getting into the business of > implementing scheduler functionality like this. That's funny. These are the problems we need to solve. We (Swift) are pretty much suffering from nobody else in the grid world having done so. I guess too many people were nervous. From hategan at mcs.anl.gov Mon Jun 30 10:19:50 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 30 Jun 2008 10:19:50 -0500 Subject: [Swift-devel] Falkon and Coaster support for MPI In-Reply-To: <037DBF80-8537-42B7-9A93-C46075548B1E@mcs.anl.gov> References: <4867C5BC.2080702@mcs.anl.gov> <4867D36D.9090801@mcs.anl.gov> <1214764455.11092.4.camel@localhost> <4867E1B2.6080604@cs.uchicago.edu> <037DBF80-8537-42B7-9A93-C46075548B1E@mcs.anl.gov> Message-ID: <1214839190.23321.16.camel@localhost> On Mon, 2008-06-30 at 03:43 -0500, Ian Foster wrote: > 5) The question has been raised of how to implement (2). One proposal > is to adapt coaster to support MPI jobs. I'm a bit concerned that this > could be expensive: we already have Falkon running well on BG/P, and > given our other commitments to support NSF user communities, putting > scarce resources into replicating that work may not be optimal. > As far as I understand from what Ioan says, Falkon doesn't support MPI jobs. I think the requirement came from Benoit's group realization that running applications on BG/P without MPI is rather slow. Not that I haven't said that. In any event, it seems like such support is necessary for achieving reasonable performance on BG/P. So it pretty much boils down to whether we want to reasonably support BG/P or not. In terms of the scheduling, what we must keep in mind is that coasters/glideins/falkons can be implemented in such a way as to ensure performance is never worse than without them. So if a 256 node job is submitted, requesting 256 nodes from the queuing system and making sure that no other job will get them before the 256 node job, while also submitting the job earlier if a sufficient number of nodes becomes available is never going to be worse than not having the mechanism. This also ensures that starvation does not happen unless the underlying system would have starved the job anyway. From iraicu at cs.uchicago.edu Mon Jun 30 12:18:10 2008 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Mon, 30 Jun 2008 12:18:10 -0500 Subject: [Swift-devel] Falkon and Coaster support for MPI In-Reply-To: <037DBF80-8537-42B7-9A93-C46075548B1E@mcs.anl.gov> References: <4867C5BC.2080702@mcs.anl.gov> <4867D36D.9090801@mcs.anl.gov> <1214764455.11092.4.camel@localhost> <4867E1B2.6080604@cs.uchicago.edu> <037DBF80-8537-42B7-9A93-C46075548B1E@mcs.anl.gov> Message-ID: <48691552.4010002@cs.uchicago.edu> I am just now catching up with the dozens of emails... Ian Foster wrote: > 4) Ioan points out that a fully general multi-level scheduling > solution with support for multi-CPU jobs may introduce the need for a > smarter scheduler than our current FIFO approach. E.g., if we have 256 > nodes and a queue with jobs of size {32,256,32,32,32,32,32,32,32,32}, > a FIFO strategy would run them in that order, and waste much CPU time. > On the other hand, a simple "first-fit" strategy might starve large jobs. This is all true... in the case of Falkon, there are further limitations, such as: 32 CPU MPI job starts and runs for 10 min 256 CPU MPI job is ready to run, but not enough CPUs are available; what is easy in Falkon to do is to place the 256 CPU job back in the queue, and process the next one, which is 32 CPUs... and keep doing this until it finds all 256 CPUs free to schedule the 256 CPU MPI job. This means that the order will be {32, ...., 32, 256}... and this is assuming that at some point, the smaller MPI jobs will stop coming, and let the 256 CPU MPI job start, or else the 256 CPU MPI job will run the risk of being starved. The thing that is a bit harder to achieve (in Falkon) is to actually pause all scheduling decisions when it comes to a MPI job that needs more CPUs than are free, to allow enough CPUs to drain and free up to let the larger MPI job go through as fast as possible. Come to think of it, maybe this is not that hard to implement, as we could simply do a blocking wait until enough CPUs are freed up, and run the scheduler in a single threaded mode to ensure that no other threads can schedule anything else. So, in a way, I guess its possible to do both of these, probably not at the same time, but configurable at startup time, wether you want to maintain order requirements (and potentially get poor utilization), or wether you can re-order jobs and do a smallest job first ordering that will maximize the utilization (but potentially starve large jobs). > > I think we should be nervous about getting into the business of > implementing scheduler functionality like this. That was my impression as well, at least to add this kind of logic to Falkon. In the end, its probably not as hard as I thought it would be, but with any new code/functionaly, there is always the bag of new bugs, so the time investment is certainly not trivial. > > I'd like to advocate that in the short term, we try to make this > problem go away by requiring that if an application includes MPI > tasks, they all be of the same size. Yes, that would certainly make it easier on the implementation side. Ioan > > -- =================================================== Ioan Raicu Ph.D. Candidate =================================================== Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 =================================================== Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dev.globus.org/wiki/Incubator/Falkon http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page =================================================== =================================================== From hategan at mcs.anl.gov Mon Jun 30 12:20:30 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 30 Jun 2008 12:20:30 -0500 Subject: [Swift-devel] Falkon and Coaster support for MPI In-Reply-To: <1214839190.23321.16.camel@localhost> References: <4867C5BC.2080702@mcs.anl.gov> <4867D36D.9090801@mcs.anl.gov> <1214764455.11092.4.camel@localhost> <4867E1B2.6080604@cs.uchicago.edu> <037DBF80-8537-42B7-9A93-C46075548B1E@mcs.anl.gov> <1214839190.23321.16.camel@localhost> Message-ID: <1214846430.32672.8.camel@localhost> On Mon, 2008-06-30 at 10:19 -0500, Mihael Hategan wrote: > In terms of the scheduling, what we must keep in mind is that > coasters/glideins/falkons can be implemented in such a way as to ensure > performance is never worse than without them. In fact, one trivial implementation of this would be to forward all MPI/multi-node jobs directly to the queuing system, under the assumption that they are sufficiently large jobs not to benefit much from queue bypasses. > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From benc at hawaga.org.uk Mon Jun 30 12:27:00 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 30 Jun 2008 17:27:00 +0000 (GMT) Subject: [Swift-devel] Falkon and Coaster support for MPI In-Reply-To: <1214846430.32672.8.camel@localhost> References: <4867C5BC.2080702@mcs.anl.gov> <4867D36D.9090801@mcs.anl.gov> <1214764455.11092.4.camel@localhost> <4867E1B2.6080604@cs.uchicago.edu> <037DBF80-8537-42B7-9A93-C46075548B1E@mcs.anl.gov> <1214839190.23321.16.camel@localhost> <1214846430.32672.8.camel@localhost> Message-ID: On Mon, 30 Jun 2008, Mihael Hategan wrote: > In fact, one trivial implementation of this would be to forward all > MPI/multi-node jobs directly to the queuing system, under the assumption > that they are sufficiently large jobs not to benefit much from queue > bypasses. That's something I probed about at the start of this thread (the message beginning "do you want MPI support for coasters or for swift?") but based on the replies to that it sounds like there is a medium-size use case (wrt the size of a BGP processor set) that would benefit from something that can allocate smaller-than-a-pset blocks of nodes for MPI. -- From iraicu at cs.uchicago.edu Mon Jun 30 12:31:53 2008 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Mon, 30 Jun 2008 12:31:53 -0500 Subject: [Swift-devel] Falkon and Coaster support for MPI In-Reply-To: <1214835823.22531.7.camel@localhost> References: <4867C5BC.2080702@mcs.anl.gov> <4867D36D.9090801@mcs.anl.gov> <1214764455.11092.4.camel@localhost> <4867E1B2.6080604@cs.uchicago.edu> <037DBF80-8537-42B7-9A93-C46075548B1E@mcs.anl.gov> <1214835823.22531.7.camel@localhost> Message-ID: <48691889.6060505@cs.uchicago.edu> Comon Mihael, quit being so quick to argue :) Adding MPI support might be easy, or hard, but it will certainly add complexity to the scheduler that needs to support MPI. If the user community asks for this feature, we (Swift, Falkon, etc) would eventually support it, but its not clear what priority this should take, which partly depends on how hard it is to achieve the desired goal (given all our limited resources and time). Ioan Mihael Hategan wrote: >> I think we should be nervous about getting into the business of >> implementing scheduler functionality like this. >> > > That's funny. These are the problems we need to solve. We (Swift) are > pretty much suffering from nobody else in the grid world having done so. > I guess too many people were nervous. > > > > -- =================================================== Ioan Raicu Ph.D. Candidate =================================================== Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 =================================================== Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dev.globus.org/wiki/Incubator/Falkon http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page =================================================== =================================================== From hategan at mcs.anl.gov Mon Jun 30 12:38:08 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 30 Jun 2008 12:38:08 -0500 Subject: [Swift-devel] Falkon and Coaster support for MPI In-Reply-To: <48691552.4010002@cs.uchicago.edu> References: <4867C5BC.2080702@mcs.anl.gov> <4867D36D.9090801@mcs.anl.gov> <1214764455.11092.4.camel@localhost> <4867E1B2.6080604@cs.uchicago.edu> <037DBF80-8537-42B7-9A93-C46075548B1E@mcs.anl.gov> <48691552.4010002@cs.uchicago.edu> Message-ID: <1214847488.541.8.camel@localhost> On Mon, 2008-06-30 at 12:18 -0500, Ioan Raicu wrote: > I am just now catching up with the dozens of emails... > > Ian Foster wrote: > > 4) Ioan points out that a fully general multi-level scheduling > > solution with support for multi-CPU jobs may introduce the need for a > > smarter scheduler than our current FIFO approach. E.g., if we have 256 > > nodes and a queue with jobs of size {32,256,32,32,32,32,32,32,32,32}, > > a FIFO strategy would run them in that order, and waste much CPU time. > > On the other hand, a simple "first-fit" strategy might starve large jobs. > This is all true... in the case of Falkon, there are further > limitations, such as: > 32 CPU MPI job starts and runs for 10 min > 256 CPU MPI job is ready to run, but not enough CPUs are available; what > is easy in Falkon to do is to place the 256 CPU job back in the queue, > and process the next one, which is 32 CPUs... and keep doing this until > it finds all 256 CPUs free to schedule the 256 CPU MPI job. This means > that the order will be {32, ...., 32, 256}... and this is assuming that > at some point, the smaller MPI jobs will stop coming, and let the 256 > CPU MPI job start, or else the 256 CPU MPI job will run the risk of > being starved. This seems like coming up with particularly bad (though simple) scheduling algorithms and then pointing out that they are... bad. I'm not sure what this is supposed to achieve, but I'd rather start by reading some papers on the topic. From iraicu at cs.uchicago.edu Mon Jun 30 12:40:51 2008 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Mon, 30 Jun 2008 12:40:51 -0500 Subject: [Swift-devel] Falkon and Coaster support for MPI In-Reply-To: <1214839190.23321.16.camel@localhost> References: <4867C5BC.2080702@mcs.anl.gov> <4867D36D.9090801@mcs.anl.gov> <1214764455.11092.4.camel@localhost> <4867E1B2.6080604@cs.uchicago.edu> <037DBF80-8537-42B7-9A93-C46075548B1E@mcs.anl.gov> <1214839190.23321.16.camel@localhost> Message-ID: <48691AA3.5090203@cs.uchicago.edu> Right, but in implementing the glide-in scheduler, we will face many of the same challenges more mature LRMs have faced when implementing MPI support. After thinking about supporting MPI in Falkon further, it might not be as hard as I thought it would be, but it will certainly take at least a day of work (initial estimate, could be longer) to get the first prototype ready, and then lots of testing to make sure it works well. Perhaps we can wait until Ben does his 1 week evaluation of MPI support in Coaster, and make a decision once we have a better understanding how much effort it would take in either Coaster or Falkon. Ioan Mihael Hategan wrote: > On Mon, 2008-06-30 at 03:43 -0500, Ian Foster wrote: > > >> 5) The question has been raised of how to implement (2). One proposal >> is to adapt coaster to support MPI jobs. I'm a bit concerned that this >> could be expensive: we already have Falkon running well on BG/P, and >> given our other commitments to support NSF user communities, putting >> scarce resources into replicating that work may not be optimal. >> >> > > As far as I understand from what Ioan says, Falkon doesn't support MPI > jobs. I think the requirement came from Benoit's group realization that > running applications on BG/P without MPI is rather slow. Not that I > haven't said that. > > In any event, it seems like such support is necessary for achieving > reasonable performance on BG/P. So it pretty much boils down to whether > we want to reasonably support BG/P or not. > > In terms of the scheduling, what we must keep in mind is that > coasters/glideins/falkons can be implemented in such a way as to ensure > performance is never worse than without them. So if a 256 node job is > submitted, requesting 256 nodes from the queuing system and making sure > that no other job will get them before the 256 node job, while also > submitting the job earlier if a sufficient number of nodes becomes > available is never going to be worse than not having the mechanism. This > also ensures that starvation does not happen unless the underlying > system would have starved the job anyway. > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > -- =================================================== Ioan Raicu Ph.D. Candidate =================================================== Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 =================================================== Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dev.globus.org/wiki/Incubator/Falkon http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page =================================================== =================================================== From iraicu at cs.uchicago.edu Mon Jun 30 12:44:35 2008 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Mon, 30 Jun 2008 12:44:35 -0500 Subject: [Swift-devel] Falkon and Coaster support for MPI In-Reply-To: <1214846430.32672.8.camel@localhost> References: <4867C5BC.2080702@mcs.anl.gov> <4867D36D.9090801@mcs.anl.gov> <1214764455.11092.4.camel@localhost> <4867E1B2.6080604@cs.uchicago.edu> <037DBF80-8537-42B7-9A93-C46075548B1E@mcs.anl.gov> <1214839190.23321.16.camel@localhost> <1214846430.32672.8.camel@localhost> Message-ID: <48691B83.1040600@cs.uchicago.edu> Except that there are limits on how many jobs you can have in the Cobalt queue, mostly for policy reasons due to Cobalt not supporting fair-share scheduling. The limits have been 2 jobs per user for a long time, and they have recently been raised to 6 jobs per user. This obviously won't work with the policies in place today. Ioan Mihael Hategan wrote: > On Mon, 2008-06-30 at 10:19 -0500, Mihael Hategan wrote: > > >> In terms of the scheduling, what we must keep in mind is that >> coasters/glideins/falkons can be implemented in such a way as to ensure >> performance is never worse than without them. >> > > In fact, one trivial implementation of this would be to forward all > MPI/multi-node jobs directly to the queuing system, under the assumption > that they are sufficiently large jobs not to benefit much from queue > bypasses. > > >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >> > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > -- =================================================== Ioan Raicu Ph.D. Candidate =================================================== Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 =================================================== Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dev.globus.org/wiki/Incubator/Falkon http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page =================================================== =================================================== From hategan at mcs.anl.gov Mon Jun 30 12:46:23 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 30 Jun 2008 12:46:23 -0500 Subject: [Swift-devel] Falkon and Coaster support for MPI In-Reply-To: <48691889.6060505@cs.uchicago.edu> References: <4867C5BC.2080702@mcs.anl.gov> <4867D36D.9090801@mcs.anl.gov> <1214764455.11092.4.camel@localhost> <4867E1B2.6080604@cs.uchicago.edu> <037DBF80-8537-42B7-9A93-C46075548B1E@mcs.anl.gov> <1214835823.22531.7.camel@localhost> <48691889.6060505@cs.uchicago.edu> Message-ID: <1214847983.541.16.camel@localhost> On Mon, 2008-06-30 at 12:31 -0500, Ioan Raicu wrote: > Comon Mihael, quit being so quick to argue :)\ I'm not sure what you mean. Arguments are not a bad thing. They are the mechanism to clarify issues and solve problems. Arguing by exclusively using rhetoric or not following logic much, on the other hand, is probably not that useful. I was trying to point out that there are some recurring patterns when moving from the single site batch job pattern to a multi site grid-like pattern, and that there's little infrastructure to support that move. But the problems themselves are there to stay, and won't go away if ignored. > > Adding MPI support might be easy, or hard, but it will certainly add > complexity to the scheduler that needs to support MPI. If the user > community asks for this feature, we (Swift, Falkon, etc) would > eventually support it, but its not clear what priority this should take, > which partly depends on how hard it is to achieve the desired goal > (given all our limited resources and time). > > Ioan > > Mihael Hategan wrote: > >> I think we should be nervous about getting into the business of > >> implementing scheduler functionality like this. > >> > > > > That's funny. These are the problems we need to solve. We (Swift) are > > pretty much suffering from nobody else in the grid world having done so. > > I guess too many people were nervous. > > > > > > > > > From iraicu at cs.uchicago.edu Mon Jun 30 12:46:09 2008 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Mon, 30 Jun 2008 12:46:09 -0500 Subject: [Swift-devel] Falkon and Coaster support for MPI In-Reply-To: <1214846430.32672.8.camel@localhost> References: <4867C5BC.2080702@mcs.anl.gov> <4867D36D.9090801@mcs.anl.gov> <1214764455.11092.4.camel@localhost> <4867E1B2.6080604@cs.uchicago.edu> <037DBF80-8537-42B7-9A93-C46075548B1E@mcs.anl.gov> <1214839190.23321.16.camel@localhost> <1214846430.32672.8.camel@localhost> Message-ID: <48691BE1.9060404@cs.uchicago.edu> Plus, we were talking about 32-CPU MPI jobs (for performance reasons), but there are 256 CPUs in the smallest allocation Cobalt understands, so if we want to run 8 different jobs of 32 CPU MPI jobs in each P-SET of 256 CPUs, then we can't use Cobalt again, and have to support MPI internally in Falkon/Coaster. Ioan Mihael Hategan wrote: > On Mon, 2008-06-30 at 10:19 -0500, Mihael Hategan wrote: > > >> In terms of the scheduling, what we must keep in mind is that >> coasters/glideins/falkons can be implemented in such a way as to ensure >> performance is never worse than without them. >> > > In fact, one trivial implementation of this would be to forward all > MPI/multi-node jobs directly to the queuing system, under the assumption > that they are sufficiently large jobs not to benefit much from queue > bypasses. > > >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >> > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > -- =================================================== Ioan Raicu Ph.D. Candidate =================================================== Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 =================================================== Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dev.globus.org/wiki/Incubator/Falkon http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page =================================================== =================================================== From iraicu at cs.uchicago.edu Mon Jun 30 12:57:41 2008 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Mon, 30 Jun 2008 12:57:41 -0500 Subject: [Swift-devel] Falkon and Coaster support for MPI In-Reply-To: <1214847983.541.16.camel@localhost> References: <4867C5BC.2080702@mcs.anl.gov> <4867D36D.9090801@mcs.anl.gov> <1214764455.11092.4.camel@localhost> <4867E1B2.6080604@cs.uchicago.edu> <037DBF80-8537-42B7-9A93-C46075548B1E@mcs.anl.gov> <1214835823.22531.7.camel@localhost> <48691889.6060505@cs.uchicago.edu> <1214847983.541.16.camel@localhost> Message-ID: <48691E95.2020609@cs.uchicago.edu> In the interest of productive arguing :) ... But we have to prioritize our work! I bet we could put a full time staff/student on surveying the state of the art scheduling strategies and to implement that in Falkon/Coaster, and it would likely take months (maybe more, if we find hurdles that our environment is posing to us that others didn't face due to a more general Linux environment) to do this task well. I am willing to spend days on finding a solution for this, but anything more, and I would have to say that its better to add MPI support to Coaster (regardless of what I really believe). If the goal is some fancy scheduling algorithm (beyond the simple one I outlined), well I can save myself a few days of work, and give you/Ben the token to start working on MPI support in Coaster! Ioan Mihael Hategan wrote: > On Mon, 2008-06-30 at 12:31 -0500, Ioan Raicu wrote: > >> Comon Mihael, quit being so quick to argue :)\ >> > > I'm not sure what you mean. Arguments are not a bad thing. They are the > mechanism to clarify issues and solve problems. Arguing by exclusively > using rhetoric or not following logic much, on the other hand, is > probably not that useful. > > I was trying to point out that there are some recurring patterns when > moving from the single site batch job pattern to a multi site grid-like > pattern, and that there's little infrastructure to support that move. > But the problems themselves are there to stay, and won't go away if > ignored. > > >> Adding MPI support might be easy, or hard, but it will certainly add >> complexity to the scheduler that needs to support MPI. If the user >> community asks for this feature, we (Swift, Falkon, etc) would >> eventually support it, but its not clear what priority this should take, >> which partly depends on how hard it is to achieve the desired goal >> (given all our limited resources and time). >> >> Ioan >> >> Mihael Hategan wrote: >> >>>> I think we should be nervous about getting into the business of >>>> implementing scheduler functionality like this. >>>> >>>> >>> That's funny. These are the problems we need to solve. We (Swift) are >>> pretty much suffering from nobody else in the grid world having done so. >>> I guess too many people were nervous. >>> >>> >>> >>> >>> > > > -- =================================================== Ioan Raicu Ph.D. Candidate =================================================== Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 =================================================== Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dev.globus.org/wiki/Incubator/Falkon http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page =================================================== =================================================== From hategan at mcs.anl.gov Mon Jun 30 13:01:12 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 30 Jun 2008 13:01:12 -0500 Subject: [Swift-devel] Falkon and Coaster support for MPI In-Reply-To: <48691B83.1040600@cs.uchicago.edu> References: <4867C5BC.2080702@mcs.anl.gov> <4867D36D.9090801@mcs.anl.gov> <1214764455.11092.4.camel@localhost> <4867E1B2.6080604@cs.uchicago.edu> <037DBF80-8537-42B7-9A93-C46075548B1E@mcs.anl.gov> <1214839190.23321.16.camel@localhost> <1214846430.32672.8.camel@localhost> <48691B83.1040600@cs.uchicago.edu> Message-ID: <1214848872.541.33.camel@localhost> Yep, that's crappy. On Mon, 2008-06-30 at 12:44 -0500, Ioan Raicu wrote: > Except that there are limits on how many jobs you can have in the Cobalt > queue, mostly for policy reasons due to Cobalt not supporting fair-share > scheduling. The limits have been 2 jobs per user for a long time, and > they have recently been raised to 6 jobs per user. This obviously won't > work with the policies in place today. > > Ioan > > Mihael Hategan wrote: > > On Mon, 2008-06-30 at 10:19 -0500, Mihael Hategan wrote: > > > > > >> In terms of the scheduling, what we must keep in mind is that > >> coasters/glideins/falkons can be implemented in such a way as to ensure > >> performance is never worse than without them. > >> > > > > In fact, one trivial implementation of this would be to forward all > > MPI/multi-node jobs directly to the queuing system, under the assumption > > that they are sufficiently large jobs not to benefit much from queue > > bypasses. > > > > > >> _______________________________________________ > >> Swift-devel mailing list > >> Swift-devel at ci.uchicago.edu > >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > >> > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > > From hategan at mcs.anl.gov Mon Jun 30 13:04:18 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 30 Jun 2008 13:04:18 -0500 Subject: [Swift-devel] Falkon and Coaster support for MPI In-Reply-To: <48691E95.2020609@cs.uchicago.edu> References: <4867C5BC.2080702@mcs.anl.gov> <4867D36D.9090801@mcs.anl.gov> <1214764455.11092.4.camel@localhost> <4867E1B2.6080604@cs.uchicago.edu> <037DBF80-8537-42B7-9A93-C46075548B1E@mcs.anl.gov> <1214835823.22531.7.camel@localhost> <48691889.6060505@cs.uchicago.edu> <1214847983.541.16.camel@localhost> <48691E95.2020609@cs.uchicago.edu> Message-ID: <1214849058.541.37.camel@localhost> On Mon, 2008-06-30 at 12:57 -0500, Ioan Raicu wrote: > In the interest of productive arguing :) > ... > > But we have to prioritize our work! > > I bet we could put a full time staff/student on surveying the state of > the art scheduling strategies and to implement that in Falkon/Coaster, > and it would likely take months (maybe more, if we find hurdles that our > environment is posing to us that others didn't face due to a more > general Linux environment) to do this task well. That has no basis in anything, so not really "productive arguing". > > I am willing to spend days on finding a solution for this, but anything > more, and I would have to say that its better to add MPI support to > Coaster (regardless of what I really believe). If the goal is some > fancy scheduling algorithm (beyond the simple one I outlined), well I > can save myself a few days of work, and give you/Ben the token to start > working on MPI support in Coaster! Thank you for the token. I didn't realize I needed one though. > > Ioan > > Mihael Hategan wrote: > > On Mon, 2008-06-30 at 12:31 -0500, Ioan Raicu wrote: > > > >> Comon Mihael, quit being so quick to argue :)\ > >> > > > > I'm not sure what you mean. Arguments are not a bad thing. They are the > > mechanism to clarify issues and solve problems. Arguing by exclusively > > using rhetoric or not following logic much, on the other hand, is > > probably not that useful. > > > > I was trying to point out that there are some recurring patterns when > > moving from the single site batch job pattern to a multi site grid-like > > pattern, and that there's little infrastructure to support that move. > > But the problems themselves are there to stay, and won't go away if > > ignored. > > > > > >> Adding MPI support might be easy, or hard, but it will certainly add > >> complexity to the scheduler that needs to support MPI. If the user > >> community asks for this feature, we (Swift, Falkon, etc) would > >> eventually support it, but its not clear what priority this should take, > >> which partly depends on how hard it is to achieve the desired goal > >> (given all our limited resources and time). > >> > >> Ioan > >> > >> Mihael Hategan wrote: > >> > >>>> I think we should be nervous about getting into the business of > >>>> implementing scheduler functionality like this. > >>>> > >>>> > >>> That's funny. These are the problems we need to solve. We (Swift) are > >>> pretty much suffering from nobody else in the grid world having done so. > >>> I guess too many people were nervous. > >>> > >>> > >>> > >>> > >>> > > > > > > > From bugzilla-daemon at mcs.anl.gov Mon Jun 30 13:15:00 2008 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Mon, 30 Jun 2008 13:15:00 -0500 (CDT) Subject: [Swift-devel] [Bug 147] New: swift hangs at faulty mapping Message-ID: http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=147 Summary: swift hangs at faulty mapping Product: Swift Version: unspecified Platform: All OS/Version: Linux Status: NEW Severity: normal Priority: P2 Component: SwiftScript language AssignedTo: benc at hawaga.org.uk ReportedBy: skenny at uchicago.edu i'm trying to use an external mapper. when the mapping does not work swift doesn't return an error but hangs like this: RunID: 20080630-1203-lfa9psed Progress: Progress: Progress: Progress: Progress: Progress: Progress: Progress: Progress: Progress: Progress: Progress: Progress: secondly, the external mapper is being handed extra arguments (specifically, "-waitfor 88000) which are internal swift args not meant to be passed to the mapper. -- Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From iraicu at cs.uchicago.edu Mon Jun 30 13:16:26 2008 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Mon, 30 Jun 2008 13:16:26 -0500 Subject: [Swift-devel] Falkon and Coaster support for MPI In-Reply-To: <1214849058.541.37.camel@localhost> References: <4867C5BC.2080702@mcs.anl.gov> <4867D36D.9090801@mcs.anl.gov> <1214764455.11092.4.camel@localhost> <4867E1B2.6080604@cs.uchicago.edu> <037DBF80-8537-42B7-9A93-C46075548B1E@mcs.anl.gov> <1214835823.22531.7.camel@localhost> <48691889.6060505@cs.uchicago.edu> <1214847983.541.16.camel@localhost> <48691E95.2020609@cs.uchicago.edu> <1214849058.541.37.camel@localhost> Message-ID: <486922FA.6080205@cs.uchicago.edu> Mihael Hategan wrote: > On Mon, 2008-06-30 at 12:57 -0500, Ioan Raicu wrote: > > ... > > Thank you for the token. I didn't realize I needed one though. > I thought this entire thread was about identifying where (Coaster/Falkon) it makes the most sense to add MPI support, given people are really busy, and we don't want to duplicate the effort. > >> Ioan >> >> Mihael Hategan wrote: >> >>> On Mon, 2008-06-30 at 12:31 -0500, Ioan Raicu wrote: >>> >>> >>>> Comon Mihael, quit being so quick to argue :)\ >>>> >>>> >>> I'm not sure what you mean. Arguments are not a bad thing. They are the >>> mechanism to clarify issues and solve problems. Arguing by exclusively >>> using rhetoric or not following logic much, on the other hand, is >>> probably not that useful. >>> >>> I was trying to point out that there are some recurring patterns when >>> moving from the single site batch job pattern to a multi site grid-like >>> pattern, and that there's little infrastructure to support that move. >>> But the problems themselves are there to stay, and won't go away if >>> ignored. >>> >>> >>> >>>> Adding MPI support might be easy, or hard, but it will certainly add >>>> complexity to the scheduler that needs to support MPI. If the user >>>> community asks for this feature, we (Swift, Falkon, etc) would >>>> eventually support it, but its not clear what priority this should take, >>>> which partly depends on how hard it is to achieve the desired goal >>>> (given all our limited resources and time). >>>> >>>> Ioan >>>> >>>> Mihael Hategan wrote: >>>> >>>> >>>>>> I think we should be nervous about getting into the business of >>>>>> implementing scheduler functionality like this. >>>>>> >>>>>> >>>>>> >>>>> That's funny. These are the problems we need to solve. We (Swift) are >>>>> pretty much suffering from nobody else in the grid world having done so. >>>>> I guess too many people were nervous. >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>> >>> > > > -- =================================================== Ioan Raicu Ph.D. Candidate =================================================== Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 =================================================== Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dev.globus.org/wiki/Incubator/Falkon http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page =================================================== =================================================== From hategan at mcs.anl.gov Mon Jun 30 14:27:38 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 30 Jun 2008 14:27:38 -0500 Subject: [Swift-devel] Falkon and Coaster support for MPI In-Reply-To: <486922FA.6080205@cs.uchicago.edu> References: <4867C5BC.2080702@mcs.anl.gov> <4867D36D.9090801@mcs.anl.gov> <1214764455.11092.4.camel@localhost> <4867E1B2.6080604@cs.uchicago.edu> <037DBF80-8537-42B7-9A93-C46075548B1E@mcs.anl.gov> <1214835823.22531.7.camel@localhost> <48691889.6060505@cs.uchicago.edu> <1214847983.541.16.camel@localhost> <48691E95.2020609@cs.uchicago.edu> <1214849058.541.37.camel@localhost> <486922FA.6080205@cs.uchicago.edu> Message-ID: <1214854058.2108.40.camel@localhost> On Mon, 2008-06-30 at 13:16 -0500, Ioan Raicu wrote: > > Mihael Hategan wrote: > > On Mon, 2008-06-30 at 12:57 -0500, Ioan Raicu wrote: > > > > ... > > > > Thank you for the token. I didn't realize I needed one though. > > > I thought this entire thread was about identifying where > (Coaster/Falkon) it makes the most sense to add MPI support, given > people are really busy, and we don't want to duplicate the effort. s/where/whether/. Then you said it's hard (using the "bad is bad" argument), and you wouldn't do it, and I said I didn't think it was that hard, and perhaps it could be reasonably done. After which you mentioned that it may not be as hard as you thought (at least 1 day of coding on your part), then later suggested that it still shouldn't be done because a random student/staff would likely not be able to do it in a few months. Further down the line, you mentioned you'll give it a try, and then if you fail after a week, I or Ben could take a stab at it. I believe we agreed that BG is a nasty platform. The issues: 1. You really don't have to do anything as far as I'm concerned, especially if you think it's too much for the time you have. 2. You don't get to decide (by giving tokens or otherwise) by yourself when/whether I or Ben should do things. 3. There is no effort duplication. Falkon is your PhD research project. Coasters are what I/we officially develop as production software. I would like to suggest that in the future, we keep falkon and coaster discussions separate. The term "falkon/coasters" should probably only be used to denote the glide-in concept, not as a reference to the implementations. Mihael From iraicu at cs.uchicago.edu Mon Jun 30 14:30:02 2008 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Mon, 30 Jun 2008 14:30:02 -0500 Subject: [Swift-devel] Falkon and Coaster support for MPI In-Reply-To: <1214854058.2108.40.camel@localhost> References: <4867C5BC.2080702@mcs.anl.gov> <4867D36D.9090801@mcs.anl.gov> <1214764455.11092.4.camel@localhost> <4867E1B2.6080604@cs.uchicago.edu> <037DBF80-8537-42B7-9A93-C46075548B1E@mcs.anl.gov> <1214835823.22531.7.camel@localhost> <48691889.6060505@cs.uchicago.edu> <1214847983.541.16.camel@localhost> <48691E95.2020609@cs.uchicago.edu> <1214849058.541.37.camel@localhost> <486922FA.6080205@cs.uchicago.edu> <1214854058.2108.40.camel@localhost> Message-ID: <4869343A.7030303@cs.uchicago.edu> This email thread is over (from my point of view). We can continue it in person, if you'd like. Ioan Mihael Hategan wrote: > On Mon, 2008-06-30 at 13:16 -0500, Ioan Raicu wrote: > >> Mihael Hategan wrote: >> >>> On Mon, 2008-06-30 at 12:57 -0500, Ioan Raicu wrote: >>> >>> ... >>> >>> Thank you for the token. I didn't realize I needed one though. >>> >>> >> I thought this entire thread was about identifying where >> (Coaster/Falkon) it makes the most sense to add MPI support, given >> people are really busy, and we don't want to duplicate the effort. >> > > s/where/whether/. > > Then you said it's hard (using the "bad is bad" argument), and you > wouldn't do it, and I said I didn't think it was that hard, and perhaps > it could be reasonably done. > > After which you mentioned that it may not be as hard as you thought (at > least 1 day of coding on your part), then later suggested that it still > shouldn't be done because a random student/staff would likely not be > able to do it in a few months. > > Further down the line, you mentioned you'll give it a try, and then if > you fail after a week, I or Ben could take a stab at it. > > I believe we agreed that BG is a nasty platform. > > The issues: > 1. You really don't have to do anything as far as I'm concerned, > especially if you think it's too much for the time you have. > 2. You don't get to decide (by giving tokens or otherwise) by yourself > when/whether I or Ben should do things. > 3. There is no effort duplication. Falkon is your PhD research project. > Coasters are what I/we officially develop as production software. > > I would like to suggest that in the future, we keep falkon and coaster > discussions separate. The term "falkon/coasters" should probably only be > used to denote the glide-in concept, not as a reference to the > implementations. > > Mihael > > > > -- =================================================== Ioan Raicu Ph.D. Candidate =================================================== Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 =================================================== Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dev.globus.org/wiki/Incubator/Falkon http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page =================================================== =================================================== From benc at hawaga.org.uk Mon Jun 30 15:56:01 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 30 Jun 2008 20:56:01 +0000 (GMT) Subject: [Swift-devel] Falkon and Coaster support for MPI In-Reply-To: <1214849058.541.37.camel@localhost> References: <4867C5BC.2080702@mcs.anl.gov> <4867D36D.9090801@mcs.anl.gov> <1214764455.11092.4.camel@localhost> <4867E1B2.6080604@cs.uchicago.edu> <037DBF80-8537-42B7-9A93-C46075548B1E@mcs.anl.gov> <1214835823.22531.7.camel@localhost> <48691889.6060505@cs.uchicago.edu> <1214847983.541.16.camel@localhost> <48691E95.2020609@cs.uchicago.edu> <1214849058.541.37.camel@localhost> Message-ID: On Mon, 30 Jun 2008, Mihael Hategan wrote: > Thank you for the token. I didn't realize I needed one though. Its how you simulate an entirely serial system in a largely distributed environment! -- From bugzilla-daemon at mcs.anl.gov Mon Jun 30 20:38:54 2008 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Mon, 30 Jun 2008 20:38:54 -0500 (CDT) Subject: [Swift-devel] [Bug 148] New: regexp mapper substitution doesn't work properly Message-ID: http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=148 Summary: regexp mapper substitution doesn't work properly Product: Swift Version: unspecified Platform: PC OS/Version: Linux Status: NEW Severity: normal Priority: P2 Component: General AssignedTo: hategan at mcs.anl.gov ReportedBy: hategan at mcs.anl.gov The following produces an error: ... countfile c; ... Namely: Could not compile SwiftScript source: line x:72: unexpected char: '1' This is probably introduced by the "\" escaping scheme added at some point. Using countfile ... c; ... does not produce an error, but the transformed name has a backslash prefixed. -- Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You reported the bug, or are watching the reporter. You are the assignee for the bug, or are watching the assignee.