From smartin at mcs.anl.gov Fri Sep 18 08:01:10 2009 From: smartin at mcs.anl.gov (Stuart Martin) Date: Fri, 18 Sep 2009 08:01:10 -0500 Subject: [Swift-devel] Fwd: [gateways] Fwd: GRAM V5 Alpha 3 now available References: <4AB30F06.3060107@cct.lsu.edu> Message-ID: Mike, Tibi, Do you have more job runs that you can send to this gram5 alpha3 deployment on queen bee? http://dev.globus.org/wiki/GRAM/GRAM5#Deployments_2 ? LSU Queen Bee cluster ? https://queenbee.loni-lsu.teragrid.org:2120/jobmanager-fork ? https://queenbee.loni-lsu.teragrid.org:2120/jobmanager-pbs Tibi, I think the problem in the past was that the head node was rebooted and for a while the SEG was not running there. That causes job state changes to not get propagated, so job hang in the pending state. -Stu Begin forwarded message: > From: Lukasz Lacinski > Date: September 17, 2009 11:39:34 PM CDT > To: Stuart Martin > Cc: gateways at teragrid.org > Subject: Re: [gateways] Fwd: GRAM V5 Alpha 3 now available > > GRAM5 Alpha3 has been installed on Queen Bee, passed successfully > first tests and is ready for further tests. GRAM5 Alpha3 listens on > the same port as GRAM5 Alpha2 did before (queenbee.loni- > lsu.teragrid.org:2120/jobmanager-pbs). > If you need help in solving any issues, please do not hesitate to > contact me. > > Regards, > Lukasz > > Stuart Martin wrote: >> FYI >> >> Begin forwarded message: >> >>> From: Stuart Martin >>> Date: September 17, 2009 3:48:59 PM CDT >>> To: GRAM developer , GRAM users >> >, gt-user at lists.globus.org >>> Cc: Stuart Martin >>> Subject: GRAM V5 Alpha 3 now available >>> >>> Hi All, >>> >>> We are happy to make available a new GRAM V5 alpha 3 version for >>> testing. >>> http://dev.globus.org/wiki/GRAM/GRAM5#Alpha_3 >>> >>> New features in this release: >>> - Support for clients to get the remote application's exit code >>> - Support for clients to get the version of the job manager >>> >>> Alpha 2 was deployed at 2 TeraGrid sites: LSU's Queen Bee and >>> NCSA's Abe. Thanks to Lukasz Lacinski and Doru Marcusiu. Also, >>> Alpha 2 got some excellent testing and feedback from Jaime Frey >>> and Igor Sfiligoi using condor-g. Thanks Jaime and Igor. Igor >>> ran a number of tests submitting 5000 jobs at a time. Here is a >>> comment from Igor on the performance comparison to GRAM2: >>> >>> "GRAM5 can easily keep a job turnaround of around 1Hz (50 jobs/ >>> min). Compare this to the 7.5 jobs/min of GT2. >>> Plus, it can start jobs at a 7Hz rate (450 jobs/min); GT2 best >>> case scenario was 0.5Hz (33 jobs/min). >>> (note: the comparison is not really apple to apple due to missing >>> file transfer)" >>> >>> We would encourage GRAM users to install and test this new Alpha 3 >>> version. Please send your feedback to gram-dev at globus.org or to >>> me directly. >>> >>> Thanks for your support! >>> >>> - GRAM development team >> >> _______________________________________________ >> Gateways mailing list >> Gateways at teragrid.org >> http://teragrid.org/mailman/listinfo/gateways > From hategan at mcs.anl.gov Fri Sep 18 08:26:32 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 18 Sep 2009 08:26:32 -0500 Subject: [Swift-devel] Fwd: [gateways] Fwd: GRAM V5 Alpha 3 now available In-Reply-To: References: <4AB30F06.3060107@cct.lsu.edu> Message-ID: <1253280392.21622.1.camel@localhost> On Fri, 2009-09-18 at 08:01 -0500, Stuart Martin wrote: > Mike, Tibi, > > Do you have more job runs that you can send to this gram5 alpha3 > deployment on queen bee? > > http://dev.globus.org/wiki/GRAM/GRAM5#Deployments_2 > > ? LSU Queen Bee cluster > ? https://queenbee.loni-lsu.teragrid.org:2120/jobmanager-fork > ? https://queenbee.loni-lsu.teragrid.org:2120/jobmanager-pbs > > Tibi, I think the problem in the past was that the head node was > rebooted and for a while the SEG was not running there. That was one of the big problems of WS-GRAM I think. Is there any way to ensure that the history won't repeat itself with GRAM5? From smartin at mcs.anl.gov Fri Sep 18 09:32:14 2009 From: smartin at mcs.anl.gov (Stuart Martin) Date: Fri, 18 Sep 2009 09:32:14 -0500 Subject: [Swift-devel] Fwd: [gateways] Fwd: GRAM V5 Alpha 3 now available In-Reply-To: <1253280392.21622.1.camel@localhost> References: <4AB30F06.3060107@cct.lsu.edu> <1253280392.21622.1.camel@localhost> Message-ID: <2DD5B820-CC1D-4B34-9628-8EA88C82EBE0@mcs.anl.gov> True. This will be one of the things we'll need to emphasize with gram5 admin/maintenance. The problem with GRAM4 was not the SEG, but the entire container going down and no one noticing. The SEG is just another process that needs to be running on a host. The SEG will need to be included in the boot/startup script. I think there are various programs/tools that people use to do this. We will need to make this loud and clear and also provide an example or two for how to do this. On Sep 18, 2009, at Sep 18, 8:26 AM, Mihael Hategan wrote: > On Fri, 2009-09-18 at 08:01 -0500, Stuart Martin wrote: >> Mike, Tibi, >> >> Do you have more job runs that you can send to this gram5 alpha3 >> deployment on queen bee? >> >> http://dev.globus.org/wiki/GRAM/GRAM5#Deployments_2 >> >> ? LSU Queen Bee cluster >> ? https://queenbee.loni-lsu.teragrid.org:2120/jobmanager-fork >> ? https://queenbee.loni-lsu.teragrid.org:2120/jobmanager-pbs >> >> Tibi, I think the problem in the past was that the head node was >> rebooted and for a while the SEG was not running there. > > That was one of the big problems of WS-GRAM I think. Is there any > way to > ensure that the history won't repeat itself with GRAM5? > > From andric at uchicago.edu Mon Sep 21 16:28:28 2009 From: andric at uchicago.edu (Michael Andric) Date: Mon, 21 Sep 2009 16:28:28 -0500 Subject: [Swift-devel] trouble resuming Message-ID: I'm having trouble resuming swift-jobs. When resuming, it goes through 'Initializing' every single job in the workflow and just finishes without actually picking up where it left off. Below is the swift script. Thanks Michael ## type declarations: type file{} type Rscript; ## Mediator app declaration: app (external turn) run_query (string med_args, file config, Rscript code, file Annot){ Mediator med_args @filename(code) @filename(Annot); } ## this process sets parameters and calls Mediator: loop_query(int vert, string user, string db, string host, string query_outline, Rscript code, file config, string subject, string h, int beginTS, int endTS, file Annot){ string outPrefix = @strcat("gest_vs_nogest_vert",vert,h); string med_args = @strcat("--user ","andric"," --conf ", @filename(config)," --db ", db," --host ", host, " --vox ", vert," --subject ", subject," --subquery tsTSVAR"," --begin_ts ",beginTS," --end_ts ",endTS, " --query ", query_outline," --r_swift_args ",outPrefix," ",vert," ",h," ",subject, " --outprefix ", "FAH_Q", " --r_script ", at filename(code)); external turnpt = run_query(med_args, config, code, Annot); } ## needed parameters to use Mediator: string user = @arg("user"); string db = "HEL"; string host = "tp-neurodb.ci.uchicago.edu"; file config; ## mapping the R code: Rscript code; file Annot; ## variables to move across in the foreach loops: string declarelist[] = ["ss2"]; string hemilist[] = ["rh"]; int vertices[] = [1:155991:1]; #int vertices[] = [0:1:1]; foreach subject in declarelist{ foreach h in hemilist{ int beginTS = 0; int endTS = 1254; string query_outline = @strcat("SELECT SUBQUERY FROM ",subject,"TS_data",h," WHERE subject = '",subject,"' AND vertex=VOX"); foreach vert in vertices{ loop_query(vert, user, db, host, query_outline, code, config, subject, h, beginTS, endTS, Annot); } } } -------------- next part -------------- An HTML attachment was scrubbed... URL: From skenny at uchicago.edu Wed Sep 23 02:55:55 2009 From: skenny at uchicago.edu (skenny at uchicago.edu) Date: Wed, 23 Sep 2009 02:55:55 -0500 (CDT) Subject: [Swift-devel] trouble resuming Message-ID: <20090923025555.CCT43329@m4500-02.uchicago.edu> i think the main issue is that the rlog only contains thread id's/mappings for files and not externals (even if that's all you return). e.g. the rlog will contain something like: null.!unmapped null.!unmapped null.!unmapped null.!unmapped null.!unmapped ... if externals could be logged, i think the code below would still need to have loop_query return its external in order for that to work properly...regardless though, i don't *think* jobs relying entirely on externals can be resumed in swift, but maybe mihael will tell me i'm wrong and that there's a magical solution ;) ~sk ---- Original message ---- >Date: Mon, 21 Sep 2009 16:28:28 -0500 >From: Michael Andric >Subject: [Swift-devel] trouble resuming >To: swift-user at ci.uchicago.edu, swift-devel at ci.uchicago.edu > > I'm having trouble resuming swift-jobs. ?When > resuming, it goes through 'Initializing' every > single job in the workflow and just finishes without > actually picking up where it left off. ?Below is > the swift script.? > Thanks > Michael? > ## type declarations: > type file{} > type Rscript; > ## Mediator app declaration: > app (external turn) run_query (string med_args, file > config, Rscript code, file Annot){ > ?? ?Mediator med_args @filename(code) > @filename(Annot); > } > ## this process sets parameters and calls Mediator: > loop_query(int vert, string user, string db, string > host, string query_outline, Rscript code, file > config, string subject, string h, int beginTS, int > endTS, file Annot){ > ?? ?string outPrefix = > @strcat("gest_vs_nogest_vert",vert,h); > ?? ?string med_args = @strcat("--user > ","andric"," --conf ", @filename(config)," --db ", > db," --host ", host, > ?? ? ? ?" --vox ", vert," --subject ", > subject," --subquery tsTSVAR"," --begin_ts > ",beginTS," --end_ts ",endTS, > ?? ? ? ?" --query ", query_outline," > --r_swift_args ",outPrefix," ",vert," ",h," > ",subject, " --outprefix ", "FAH_Q", " --r_script > ", at filename(code)); > ?? ?external turnpt = run_query(med_args, config, > code, Annot); > } > ## needed parameters to use Mediator: > string user = @arg("user"); > string db = "HEL"; > string host = "tp-neurodb.ci.uchicago.edu"; > file config; > ## mapping the R code: > Rscript code file="Rturning/turnchi_ss2.R">; > file Annot file="Rturning/resampled_coding_CarStory.txt">; > ## variables to move across in the foreach loops: > string declarelist[] = ["ss2"]; > string hemilist[] = ["rh"]; > int vertices[] = [1:155991:1]; > #int vertices[] = [0:1:1]; > foreach subject in declarelist{ > ?? ?foreach h in hemilist{ > ?? ? ? ?int beginTS = 0; > ?? ? ? ?int endTS = 1254; > ?? ? ? ?string query_outline = @strcat("SELECT > SUBQUERY FROM ",subject,"TS_data",h," WHERE subject > = '",subject,"' AND vertex=VOX"); > ?? ? ? ?foreach vert in vertices{ > ?? ? ? ? ? ?loop_query(vert, user, db, host, > query_outline, code, config, subject, h, beginTS, > endTS, Annot); > ?? ? ? ?} > ?? ?} > } >________________ >_______________________________________________ >Swift-devel mailing list >Swift-devel at ci.uchicago.edu >http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From hategan at mcs.anl.gov Wed Sep 23 14:48:01 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 23 Sep 2009 14:48:01 -0500 Subject: [Swift-devel] trouble resuming In-Reply-To: <20090923025555.CCT43329@m4500-02.uchicago.edu> References: <20090923025555.CCT43329@m4500-02.uchicago.edu> Message-ID: <1253735281.381.3.camel@localhost> On Wed, 2009-09-23 at 02:55 -0500, skenny at uchicago.edu wrote: > i think the main issue is that the rlog only contains > thread id's/mappings for files and not externals (even if > that's all you return). > > e.g. the rlog will contain something like: > > null.!unmapped > null.!unmapped > null.!unmapped > null.!unmapped > null.!unmapped > > ... > > if externals could be logged, i think the code below would > still need to have loop_query return its external in order for > that to work properly...regardless though, i don't *think* > jobs relying entirely on externals can be resumed in swift, > but maybe mihael will tell me i'm wrong and that there's a > magical solution ;) > I can't so far see anything major that would prevent externals from keeping consistency on a run. Externals are a way to tell swift that the data management for certain data shouldn't be done by swift. Assuming that said data management is done "properly", it is equivalent to swift doing it. So yeah, I think you might be wrong there :) Now, the implementation, that's another story. I'll have to look into that. From skenny at uchicago.edu Fri Sep 25 13:30:14 2009 From: skenny at uchicago.edu (skenny at uchicago.edu) Date: Fri, 25 Sep 2009 13:30:14 -0500 (CDT) Subject: [Swift-devel] trouble resuming Message-ID: <20090925133014.CCW92280@m4500-02.uchicago.edu> ok, i see what you're saying...it's 'theoretically' possible, but how to actually tell swift to do it is the tricky bit ;) don't know if this is helpful for figuring out a way to do so, but i tried the following: type file; type Rscript; type mxModel; app (external min) mxModelProcessor(file covMatrix, Rscript mxModProc, int modnum, float weight, string cond, int net) { RInvoke @filename(mxModProc) @filename(covMatrix) modnum weight cond net; } file covMatrix; Rscript mxScript; external dbdone[]; int totalperms[] = [1:200]; float initweight = .5; int net = 1; foreach perm in totalperms{ dbdone[perm] = mxModelProcessor(covMatrix, mxScript, perm, initweight, "speech", net); trace(@dbdone[perm]); } in order to test restart, i made the workflow die by deleting the remote db table it's trying to access while the worflow was still running. in this case, it looks like nothing is written to the rlog (w/the exception of its timestamp). the trace spits out something like this: SwiftScript trace: _concurrent/dbdone-d664f24e-673d-47e2-bd83-69027de4928a--array//elt-4 SwiftScript trace: _concurrent/dbdone-d664f24e-673d-47e2-bd83-69027de4928a--array/h24//elt-124 SwiftScript trace: _concurrent/dbdone-d664f24e-673d-47e2-bd83-69027de4928a--array/h9//elt-84 SwiftScript trace: _concurrent/dbdone-d664f24e-673d-47e2-bd83-69027de4928a--array//elt-12 ... swift does print a successful 'stage out' for the jobs that successfully completed. again, i'm not sure if this is helpful, but thought it was worth sharing...log attached. ~sk ---- Original message ---- >Date: Wed, 23 Sep 2009 14:48:01 -0500 >From: Mihael Hategan >Subject: Re: [Swift-devel] trouble resuming >To: skenny at uchicago.edu >Cc: Michael Andric , swift-user at ci.uchicago.edu, swift-devel at ci.uchicago.edu > >On Wed, 2009-09-23 at 02:55 -0500, skenny at uchicago.edu wrote: >> i think the main issue is that the rlog only contains >> thread id's/mappings for files and not externals (even if >> that's all you return). >> >> e.g. the rlog will contain something like: >> >> null.!unmapped >> null.!unmapped >> null.!unmapped >> null.!unmapped >> null.!unmapped >> >> ... >> >> if externals could be logged, i think the code below would >> still need to have loop_query return its external in order for >> that to work properly...regardless though, i don't *think* >> jobs relying entirely on externals can be resumed in swift, >> but maybe mihael will tell me i'm wrong and that there's a >> magical solution ;) >> > >I can't so far see anything major that would prevent externals from >keeping consistency on a run. Externals are a way to tell swift that the >data management for certain data shouldn't be done by swift. Assuming >that said data management is done "properly", it is equivalent to swift >doing it. > >So yeah, I think you might be wrong there :) > >Now, the implementation, that's another story. I'll have to look into >that. > -------------- next part -------------- A non-text attachment was scrubbed... Name: semtest-20090925-1316-o4co0x47.log Type: application/octet-stream Size: 2475528 bytes Desc: not available URL: From hategan at mcs.anl.gov Fri Sep 25 13:57:43 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 25 Sep 2009 13:57:43 -0500 Subject: [Swift-devel] trouble resuming In-Reply-To: <20090925133014.CCW92280@m4500-02.uchicago.edu> References: <20090925133014.CCW92280@m4500-02.uchicago.edu> Message-ID: <1253905063.1765.2.camel@localhost> On Fri, 2009-09-25 at 13:30 -0500, skenny at uchicago.edu wrote: > ok, i see what you're saying...it's 'theoretically' possible, > but how to actually tell swift to do it is the tricky bit ;) Right. I suspect the problem is that external variables don't have mappers that implement things properly. I'd file a bug report. > > in order to test restart, i made the workflow die by deleting > the remote db table it's trying to access while the worflow > was still running. If you mess with intermediate data, whether external or not, even if swift resumes, things are going to be in an inconsistent state. > in this case, it looks like nothing is > written to the rlog (w/the exception of its timestamp). Right. Things are only rlogged when the application is successful. From bugzilla-daemon at mcs.anl.gov Fri Sep 25 14:01:43 2009 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Fri, 25 Sep 2009 14:01:43 -0500 (CDT) Subject: [Swift-devel] [Bug 219] New: variables of type external are not mapped/written to rlog Message-ID: https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=219 Summary: variables of type external are not mapped/written to rlog Product: Swift Version: unspecified Platform: PC OS/Version: Windows Status: NEW Severity: normal Priority: P2 Component: SwiftScript language AssignedTo: benc at hawaga.org.uk ReportedBy: skenny at uchicago.edu workflows depending on variables of type external cannot be resumed bcs these variables are not logged or mapped properly -- Configure bugmail: https://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are watching the assignee of the bug. From wilde at mcs.anl.gov Sat Sep 26 21:57:23 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Sat, 26 Sep 2009 21:57:23 -0500 Subject: [Swift-devel] Re: ranger block scheduling In-Reply-To: References: Message-ID: <4ABED493.2000908@mcs.anl.gov> Hi Glen, The coaster block allocation params are listed in the users guide section on coasters: http://www.ci.uchicago.edu/swift/guides/userguide.php#coasters which in turn refers you here for the details of the coaster params: http://www.ci.uchicago.edu/swift/guides/userguide.php#profile.globus Note that the maxWallTime setting of the job (eg from tc.data or sites.xml) affects how jobs get placed into blocks. You should use the latest svn rev - to get Mihael's latest fixes. I think we need to write more explanation and provide examples for how to set all the parameters. I think the defaults use only one worker node per block, from a quick read. Mihael, maybe you can provide a few examples of settings that work well together, or show some common usages, and explain which params you typically need to set eg on Ranger or other TG sites. (workersPerNode=16, 8 etc. of course). - Mike On 9/26/09 7:35 PM, Glen Hocky wrote: > hi mike, > i updated the install script so now new_oops seems to install correctly > on bgp and ranger > > when you get the chance, can you re-forward me a thread about block > scheduling so i can see if i can get it running under swift > > Glen > From hategan at mcs.anl.gov Sun Sep 27 11:27:21 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sun, 27 Sep 2009 11:27:21 -0500 Subject: [Swift-devel] Re: ranger block scheduling In-Reply-To: <4ABED493.2000908@mcs.anl.gov> References: <4ABED493.2000908@mcs.anl.gov> Message-ID: <1254068841.23563.9.camel@localhost> On Sat, 2009-09-26 at 21:57 -0500, Michael Wilde wrote: > Hi Glen, > > The coaster block allocation params are listed in the users guide > section on coasters: > > http://www.ci.uchicago.edu/swift/guides/userguide.php#coasters > > which in turn refers you here for the details of the coaster params: > > http://www.ci.uchicago.edu/swift/guides/userguide.php#profile.globus > > Note that the maxWallTime setting of the job (eg from tc.data or > sites.xml) affects how jobs get placed into blocks. > > You should use the latest svn rev - to get Mihael's latest fixes. > > I think we need to write more explanation and provide examples for how > to set all the parameters. I think the defaults use only one worker node > per block, from a quick read. Right. > > Mihael, maybe you can provide a few examples of settings that work well > together, or show some common usages, and explain which params you > typically need to set eg on Ranger or other TG sites. > (workersPerNode=16, 8 etc. of course). Right. On Ranger it's workersPerNode=16. Depending on the queue you may want to set slots (3 on "development"), maxNodes (16 on "development") and maxtime (7200 on "development"). Those are the system/queue dependent settings that are needed in order to prevent jobs that are outside of the queue spec, which would cause things to fail with the insightful "The job manager detected an invalid script response" error message.