From benc at hawaga.org.uk Tue Jul 1 03:12:22 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 1 Jul 2008 08:12:22 +0000 (GMT) Subject: [Swift-devel] appalling mail disaster while I was sleeping Message-ID: If you sent mail in the 12 hours preceeding this message, please resend it to me. -- From benc at hawaga.org.uk Tue Jul 1 16:11:27 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 1 Jul 2008 21:11:27 +0000 (GMT) Subject: [Swift-devel] multiple worker.sh in the same job Message-ID: Here is one problem with Swift + MPI, with workaround, that Andriy Fedorov and I have uncovered. I'm interested in anyone's commentary. If you use GRAM with jobtype=mpi, then your job is run through mpirun, and thus executes on each node in the job rather than once. In the case of Swift submitting this way, 'your job' actually means the Swift server side code, wrapper.sh, not 'your (the user's) job'. This means there are multiple wrapper.sh jobs running, all trying to use the same working directory, input files and output files. Andriy tried making only one of the nodes create output files (eg the rank 0 node), and that appears to work in his case, though I think the following is happening: * each worker will link the same input files into the same working directory. if this was a copy, this would be a potentially damaging race condition. as its a link, I think there is a still a race condition there that would cause some of the workers to fail (so perhaps in the presence of any input files at all this won't work - I think Andriy's test case does not have any input files). * I think that all except the rank-0 wrapper script indicates failure (because of missing output files); and the rank-0 wrapper script indicates success. Swift submit-side checks for success flag before failure flag, so regards the job as successful. I think this only works if at least one job succeeds, which pretty much means one job must generate all the output files, rather than different jobs generating different output files. I haven't really tested the above out in great depth, but I think that is what is happening >From a technical perspective, I think the way to address this is to swap the mpirun and wrapper.sh, so that one wrapper.sh runs, and inside that it runs mpirun which then spawns only the application executables. There you lose the abstraction from GRAM of being able to specify jobtype=mpi; instead you have to know how to do this yourself, and run the job as a normal, not mpi, job from GRAM's perspective. However, in the case of non-GRAM execution mechanisms, then that abstraction is not in place anyway. -- From hategan at mcs.anl.gov Tue Jul 1 16:27:58 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 01 Jul 2008 16:27:58 -0500 Subject: [Swift-devel] multiple worker.sh in the same job In-Reply-To: References: Message-ID: <1214947678.23653.9.camel@localhost> This whole thing, I think, applies not only to MPI jobs, but also to any job requesting more than one node. So I think the solution is not to swap mpirun and wrapper.sh, but, along the lines of what Andriy did, perform all the relevant wrapper functions in only one instance and have a barrier right before running the executable as well as right after. How exactly this would be done is a little hazy in my head, but I guess that's what makes it interesting. On Tue, 2008-07-01 at 21:11 +0000, Ben Clifford wrote: > Here is one problem with Swift + MPI, with workaround, that Andriy Fedorov > and I have uncovered. I'm interested in anyone's > commentary. > > If you use GRAM with jobtype=mpi, then your job is run through mpirun, and > thus executes on each node in the job rather than once. > > In the case of Swift submitting this way, 'your job' actually means the > Swift server side code, wrapper.sh, not 'your (the user's) job'. > > This means there are multiple wrapper.sh jobs running, all trying to use > the same working directory, input files and output files. > > Andriy tried making only one of the nodes create output files (eg the rank > 0 node), and that appears to work in his case, though I think the > following is happening: > > * each worker will link the same input files into the same working > directory. if this was a copy, this would be a potentially damaging > race condition. as its a link, I think there is a still a race > condition there that would cause some of the workers to fail (so > perhaps in the presence of any input files at all this won't work - I > think Andriy's test case does not have any input files). > > * I think that all except the rank-0 wrapper script indicates failure > (because of missing output files); and the rank-0 wrapper script > indicates success. Swift submit-side checks for success flag before > failure flag, so regards the job as successful. I think this only > works if at least one job succeeds, which pretty much means one job > must generate all the output files, rather than different jobs > generating different output files. > > I haven't really tested the above out in great depth, but I think that is > what is happening > > >From a technical perspective, I think the way to address this is to swap > the mpirun and wrapper.sh, so that one wrapper.sh runs, and inside that it > runs mpirun which then spawns only the application executables. > > There you lose the abstraction from GRAM of being able to specify > jobtype=mpi; instead you have to know how to do this yourself, and run the > job as a normal, not mpi, job from GRAM's perspective. > > However, in the case of non-GRAM execution mechanisms, then that > abstraction is not in place anyway. > From benc at hawaga.org.uk Tue Jul 1 16:41:42 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 1 Jul 2008 21:41:42 +0000 (GMT) Subject: [Swift-devel] multiple worker.sh in the same job In-Reply-To: <1214947678.23653.9.camel@localhost> References: <1214947678.23653.9.camel@localhost> Message-ID: On Tue, 1 Jul 2008, Mihael Hategan wrote: > This whole thing, I think, applies not only to MPI jobs, but also to any > job requesting more than one node. So I think the solution is not to > swap mpirun and wrapper.sh, but, along the lines of what Andriy did, > perform all the relevant wrapper functions in only one instance and have > a barrier right before running the executable as well as right after. > How exactly this would be done is a little hazy in my head, but I guess > that's what makes it interesting. Pretty much that's what putting mpirun as the job run by the wrapper script does, at least as far as PBS on TG-UC seems to behave: mpirun exposes a single unix executable that before it runs, no app code has run, and when it finishes all app code has finished everywhere. The start and end of that executable are the barriers above. PBS as far as I can tell, though it allocates a bunch of nodes, only spawns your job on one and informs you of the list of nodes that it has allocated to you through the $PBS_NODEFILE environment variable (that mpirun then uses, for example). -- From benc at hawaga.org.uk Tue Jul 1 17:25:14 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 1 Jul 2008 22:25:14 +0000 (GMT) Subject: [Swift-devel] Re: multiple worker.sh in the same job In-Reply-To: References: Message-ID: Here's how I just ran a simple mpi-hello-world with one wrapper.sh that launches MPI inside the wrapper. I would be interested if Andriy could try his app in the style shown below. I think the behaviour is now correct. From a user configuration perspective it is somewhat unpleasant, though. On TG-UC: /home/benc/mpi/a.out is my mpi hello world program /home/benc/mpi/mpi.sh contains: #!/bin/bash echo running launcher on $(hostname) mpirun -np 3 -machinefile $PBS_NODEFILE /home/benc/mpi/a.out On swift submit side (my laptop): tc.data maps mpi to /home/benc/mpi/mpi.sh sites.xml defines: TG-CCR080002N ia64-compute 4 single /home/benc/mpi Note specifically, jobtype=single (which is what causes only a single wrapper.sh to be run, even though 4 nodes will be allocated). mpi.swift contains: $ cat mpi.swift type file; (file o, file e) p() { app { mpi stdout=@filename(o) stderr=@filename(e); } } file mpiout <"mpi.out">; file mpierr <"mpi.err">; (mpiout, mpierr) = p(); so now run the above, and the output of the hello world MPI app (different pieces output by all workers) appears mpi.out, correctly staged back through mpirun and wrapper.sh. -- From benc at hawaga.org.uk Wed Jul 2 04:16:23 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 2 Jul 2008 09:16:23 +0000 (GMT) Subject: [Swift-devel] Re: multiple worker.sh in the same job In-Reply-To: References: Message-ID: On Tue, 1 Jul 2008, Ben Clifford wrote: > Here's how I just ran a simple mpi-hello-world with one wrapper.sh that > launches MPI inside the wrapper. I would be interested if Andriy could try And here is how I did it with swift+gram4 (rather than swift+gram2 which my original message was about). Use the same setup, but with this sites.xml: TG-CCR080002N 3:ia64-compute single /home/benc/mpi -- From lixi at uchicago.edu Wed Jul 2 08:07:49 2008 From: lixi at uchicago.edu (lixi at uchicago.edu) Date: Wed, 2 Jul 2008 08:07:49 -0500 (CDT) Subject: [Swift-devel] Re: No response of Swift run Message-ID: <20080702080749.BBV69776@m4500-03.uchicago.edu> >Hi, > >I launched a Swift workflow (including 2001 jobs) at 16:16 >yesterday. At 17:20, it returned the results of 2000 jobs, >then there is no reponse any more. I wonder why? I enabled >the replication option. > >The log file is very large (more 1G) and is on CI: >/home/lixi/newswift/test/newversion/workflowtest-20080629- >1616-c4h22j03.log > >Please check it, thanks > The similar execution result occurred again. The log file is on CI: /home/lixi/newswift/test/newversion/0701/workflowtest- 20080701-1206-sjuu3cnc.log Thanks, Xi From benc at hawaga.org.uk Wed Jul 2 08:14:17 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 2 Jul 2008 13:14:17 +0000 (GMT) Subject: [Swift-devel] Re: No response of Swift run In-Reply-To: <20080702080749.BBV69776@m4500-03.uchicago.edu> References: <20080702080749.BBV69776@m4500-03.uchicago.edu> Message-ID: cog r2064 and r2065 introduce some changes in the scheduling code which will reduce the size of log files substantially and fix a hanging problem that was introduced with my r2058 scheduler changes. This might or might not fix your problem. I think probably not, but it is worth a try. -- From lixi at uchicago.edu Wed Jul 2 08:34:04 2008 From: lixi at uchicago.edu (lixi at uchicago.edu) Date: Wed, 2 Jul 2008 08:34:04 -0500 (CDT) Subject: [Swift-devel] Re: No response of Swift run Message-ID: <20080702083404.BBV71639@m4500-03.uchicago.edu> >cog r2064 and r2065 introduce some changes in the scheduling code which >will reduce the size of log files substantially and fix a hanging problem >that was introduced with my r2058 scheduler changes. > >This might or might not fix your problem. I think probably not, but it is >worth a try. > Thanks, I'll try. In fact, this is the result of Swift svn swift-r2079 cog- r2063. Xi From benc at hawaga.org.uk Wed Jul 2 08:43:00 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 2 Jul 2008 13:43:00 +0000 (GMT) Subject: [Swift-devel] Re: No response of Swift run In-Reply-To: <20080702083404.BBV71639@m4500-03.uchicago.edu> References: <20080702083404.BBV71639@m4500-03.uchicago.edu> Message-ID: On Wed, 2 Jul 2008, lixi at uchicago.edu wrote: > In fact, this is the result of Swift svn swift-r2079 cog- > r2063. Yes, I can see that from the log file. Actually it is r2063 with some changes that you have applied, according to the log file (presumably one of the patches that mihael and I sent earlier that you will not need to use after r2065) In your log workflowtest-20080701-1206-sjuu3cnc, a single task appears to still be in 'Active' state, which is possibly why the run does not end. The task ID for that is 0-1-1550-2-1214932015745. It is a file transfer of some kind. I think to site AGLT2 though the log information is a little vague - probably we should give more information there. -- From benc at hawaga.org.uk Wed Jul 2 09:10:03 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 2 Jul 2008 14:10:03 +0000 (GMT) Subject: [Swift-devel] Re: No response of Swift run In-Reply-To: <20080702083404.BBV71639@m4500-03.uchicago.edu> References: <20080702083404.BBV71639@m4500-03.uchicago.edu> Message-ID: unrelated to your problem: here are log plots: http://www.ci.uchicago.edu/~benc/tmp/report-workflowtest-20080701-1206-sjuu3cnc/ the table: 'sites/success table' gives some quantification of what replication is doing. the columns in that table mean, basically: JOB_SUCCESS - a job ran all the way through APPLICATION_EXCEPTION - a job was attempted but failed JOB_CANCELLED - a job was submitted to the queue, but a replica ran first so this was cancelled. On the big (high success rate) sites, it looks like around a third of submissions end up getting cancelled due to replication. -- From hategan at mcs.anl.gov Wed Jul 2 10:12:44 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 02 Jul 2008 10:12:44 -0500 Subject: [Swift-devel] Re: [Swift-user] Re: No response of Swift run In-Reply-To: <20080702080749.BBV69776@m4500-03.uchicago.edu> References: <20080702080749.BBV69776@m4500-03.uchicago.edu> Message-ID: <1215011564.469.4.camel@localhost> Could you do the following for me: 1. edit dist/vdsk-xyz/bin/swift 2. replace the 'OPTIONS=' line with 'OPTIONS="-Xdebug -Xrunjdwp:transport=dt_socket,address=8888,server=y,suspend=n"' (a single line) (you may need to do this every time you compile swift) 3. then run it again and let me know when it hangs. Don't kill the hanging workflow. Let it hang instead. 4. Also let me know what machine you run this on. On Wed, 2008-07-02 at 08:07 -0500, lixi at uchicago.edu wrote: > >Hi, > > > >I launched a Swift workflow (including 2001 jobs) at 16:16 > >yesterday. At 17:20, it returned the results of 2000 jobs, > >then there is no reponse any more. I wonder why? I enabled > >the replication option. > > > >The log file is very large (more 1G) and is on CI: > >/home/lixi/newswift/test/newversion/workflowtest-20080629- > >1616-c4h22j03.log > > > >Please check it, thanks > > > The similar execution result occurred again. The log file is > on CI: > /home/lixi/newswift/test/newversion/0701/workflowtest- > 20080701-1206-sjuu3cnc.log > > Thanks, > > Xi > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-user From lixi at uchicago.edu Wed Jul 2 12:22:09 2008 From: lixi at uchicago.edu (lixi at uchicago.edu) Date: Wed, 2 Jul 2008 12:22:09 -0500 (CDT) Subject: [Swift-devel] Re: [Swift-user] Re: No response of Swift run Message-ID: <20080702122209.BBV97884@m4500-03.uchicago.edu> >Could you do the following for me: >1. edit dist/vdsk-xyz/bin/swift >2. replace the 'OPTIONS=' line with 'OPTIONS="-Xdebug >- Xrunjdwp:transport=dt_socket,address=8888,server=y,suspend=n" ' (a >single line) (you may need to do this every time you compile swift) >3. then run it again and let me know when it hangs. Don't kill the >hanging workflow. Let it hang instead. >4. Also let me know what machine you run this on. Now I'm running this workflow again on login.ci.uchicago.edu. Meanwhile, I launched another swift run to test a single site, but I got such error: [lixi at login GLOW]$ swift -sites.file GLOW.sites.xml -tc.file tc.data workflowtest.swift ERROR: transport error 202: bind failed: Address already in use ["transport.c",L41] ERROR: JDWP Transport dt_socket failed to initialize, TRANSPORT_INIT(510) ["debugInit.c",L500] JDWP exit error JVMTI_ERROR_INTERNAL(113): No transports initializedFATAL ERROR in native method: JDWP No transports initialized, jvmtiError=JVMTI_ERROR_INTERNAL(113) Is there something to do with this option? Thanks, Xi From hategan at mcs.anl.gov Wed Jul 2 12:30:52 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 02 Jul 2008 12:30:52 -0500 Subject: [Swift-devel] Re: [Swift-user] Re: No response of Swift run In-Reply-To: <20080702122209.BBV97884@m4500-03.uchicago.edu> References: <20080702122209.BBV97884@m4500-03.uchicago.edu> Message-ID: <1215019852.3631.4.camel@localhost> On Wed, 2008-07-02 at 12:22 -0500, lixi at uchicago.edu wrote: > >Could you do the following for me: > >1. edit dist/vdsk-xyz/bin/swift > >2. replace the 'OPTIONS=' line with 'OPTIONS="-Xdebug > >- > Xrunjdwp:transport=dt_socket,address=8888,server=y,suspend=n" > ' (a > >single line) (you may need to do this every time you > compile swift) > >3. then run it again and let me know when it hangs. Don't > kill the > >hanging workflow. Let it hang instead. > >4. Also let me know what machine you run this on. > > Now I'm running this workflow again on login.ci.uchicago.edu. > > Meanwhile, I launched another swift run to test a single > site, but I got such error: > [lixi at login GLOW]$ swift -sites.file GLOW.sites.xml -tc.file > tc.data workflowtest.swift > ERROR: transport error 202: bind failed: Address already in > use ["transport.c",L41] > ERROR: JDWP Transport dt_socket failed to initialize, > TRANSPORT_INIT(510) ["debugInit.c",L500] > JDWP exit error JVMTI_ERROR_INTERNAL(113): No transports > initializedFATAL ERROR in native method: JDWP No transports > initialized, jvmtiError=JVMTI_ERROR_INTERNAL(113) > > Is there something to do with this option? It has everything to do with that option :) As far as I remember, things should continue to run ok (except for the debugger not being started), so you should ignore the error message. If swift doesn't run, then you could make two copies of the swift startup script (say swift-debugger with the option and swift without the option). Then if you want the debugger on, use swift-debugger, and for normal runs, use swift. > > Thanks, > > Xi From lixi at uchicago.edu Wed Jul 2 12:38:29 2008 From: lixi at uchicago.edu (lixi at uchicago.edu) Date: Wed, 2 Jul 2008 12:38:29 -0500 (CDT) Subject: [Swift-devel] Re: [Swift-user] Re: No response of Swift run Message-ID: <20080702123829.BBV99464@m4500-03.uchicago.edu> >> Now I'm running this workflow again on login.ci.uchicago.edu. This workflow with 2001 jobs finished successfully and quickly without hanging up. Then I continue to launch a workflow with 3001 jobs and see the result. >As far as I remember, things should continue to run ok (except for the >debugger not being started), so you should ignore the error message. If >swift doesn't run, then you could make two copies of the swift startup >script (say swift-debugger with the option and swift without the >option). Then if you want the debugger on, use swift- debugger, and for >normal runs, use swift. Do you mean that I could copy swift into swift-debugger (specifying that option). I could choose one of these ways to run swift, e.g: swift first.swift swift-debugger first.swift Then it will invoke the corresponding script. >> >> Thanks, >> >> Xi > From hategan at mcs.anl.gov Wed Jul 2 14:26:49 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 02 Jul 2008 14:26:49 -0500 Subject: [Swift-devel] Re: [Swift-user] Re: No response of Swift run In-Reply-To: <20080702123829.BBV99464@m4500-03.uchicago.edu> References: <20080702123829.BBV99464@m4500-03.uchicago.edu> Message-ID: <1215026809.5659.0.camel@localhost> On Wed, 2008-07-02 at 12:38 -0500, lixi at uchicago.edu wrote: > >> Now I'm running this workflow again on > login.ci.uchicago.edu. > > This workflow with 2001 jobs finished successfully and > quickly without hanging up. Then I continue to launch a > workflow with 3001 jobs and see the result. > > >As far as I remember, things should continue to run ok > (except for the > >debugger not being started), so you should ignore the error > message. If > >swift doesn't run, then you could make two copies of the > swift startup > >script (say swift-debugger with the option and swift > without the > >option). Then if you want the debugger on, use swift- > debugger, and for > >normal runs, use swift. > > Do you mean that I could copy swift into swift-debugger > (specifying that option). I could choose one of these ways > to run swift, e.g: > swift first.swift > swift-debugger first.swift > > Then it will invoke the corresponding script. Yes. > > >> > >> Thanks, > >> > >> Xi > > From lixi at uchicago.edu Thu Jul 3 10:45:44 2008 From: lixi at uchicago.edu (lixi at uchicago.edu) Date: Thu, 3 Jul 2008 10:45:44 -0500 (CDT) Subject: [Swift-devel] The effectiveness of new changes in scheduler Message-ID: <20080703104544.BBW80845@m4500-03.uchicago.edu> Hi, I launched a workflow (including 4001 jobs, the sequential execution time should be more than 4001*60s) yesterday. It finished successfully, although there is a long waiting time (no job submission). The log plot is: http://www.ci.uchicago.edu/~lixi/Log/report-workflowtest- 20080702-1751-nt031l0f/ I think that, this result could prove the new changes in scheduler could prevent certain bad site absorbing all retries and improve the success rate of whole workflow execution, but potential might increase the workflow execution time. Some parameters might be tuned to get the tradeoff between success rate and execution time. I hope that this feedback could help. Xi From hategan at mcs.anl.gov Thu Jul 3 10:58:49 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 03 Jul 2008 10:58:49 -0500 Subject: [Swift-devel] The effectiveness of new changes in scheduler In-Reply-To: <20080703104544.BBW80845@m4500-03.uchicago.edu> References: <20080703104544.BBW80845@m4500-03.uchicago.edu> Message-ID: <1215100729.19239.1.camel@localhost> That doesn't look right. I see no reason for that long pause to happen. On Thu, 2008-07-03 at 10:45 -0500, lixi at uchicago.edu wrote: > Hi, > > I launched a workflow (including 4001 jobs, the sequential > execution time should be more than 4001*60s) yesterday. It > finished successfully, although there is a long waiting time > (no job submission). The log plot is: > http://www.ci.uchicago.edu/~lixi/Log/report-workflowtest- > 20080702-1751-nt031l0f/ > > I think that, this result could prove the new changes in > scheduler could prevent certain bad site absorbing all > retries and improve the success rate of whole workflow > execution, but potential might increase the workflow > execution time. Some parameters might be tuned to get the > tradeoff between success rate and execution time. I hope > that this feedback could help. > > Xi > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From fedorov at cs.wm.edu Thu Jul 3 09:55:52 2008 From: fedorov at cs.wm.edu (Andriy Fedorov) Date: Thu, 3 Jul 2008 10:55:52 -0400 Subject: [Swift-devel] Re: multiple worker.sh in the same job In-Reply-To: References: Message-ID: <82f536810807030755i592d2372nf7b00ac3a9ba60f4@mail.gmail.com> Ben Clifford wrote: > * each worker will link the same input files into the same working > directory. if this was a copy, this would be a potentially damaging > race condition. as its a link, I think there is a still a race > condition there that would cause some of the workers to fail (so > perhaps in the presence of any input files at all this won't work - I > think Andriy's test case does not have any input files). This is correct, I haven't tried that. So the first thing I tried was to confirm your conjecture. I updated the MPI example to take input file: hello_mpi3.c ==> #include #include int main(int argc, char **argv){ int myrank, size; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &myrank); MPI_Comm_size(MPI_COMM_WORLD, &size); fprintf(stderr, "Hello, world from cpu %i (total %i)\n", myrank, size); if(myrank==atoi(argv[1])){ FILE *fIn, *fOut; fIn = fopen(argv[2], "r"); fOut = fopen(argv[3], "w"); char inStr[255]; fscanf(fIn, "%s", &inStr[0]); fprintf(fOut, "File IO: Hello, world from cpu %i (total %i). The message is:\"%s\"\n", myrank, size, inStr); fclose(fOut); } MPI_Finalize(); return 0; } hello_mpi3.c <== Here's the Swift script: hello_mpi_swift3.swift ==> type messagefile {} (messagefile fOut) greeting(messagefile fIn) { app { hello_mpi "0" @fIn @fOut; } } messagefile outfile <"hello_mpi3.txt">; messagefile infile <"test_input.txt">; outfile = greeting(infile); hello_mpi_swift3.swift <== And didn't change anything in the tc.data (kept jobType=mpi). My tc.data is: UC-GT4 hello_mpi /home/fedorov/local/bin/hello_mpi3_v INSTALLED INTEL32::LINUX GLOBUS::hostCount="4",jobType=mpi,maxWallTime="10",count="4" What I see happening is that the PBS reports job starts, but never finishes (Status is "R"). I don't know what is going on there. I guess it confirms what Ben suggested. I am not the one to explain what exactly is going on though. > Here's how I just ran a simple mpi-hello-world with one wrapper.sh that > launches MPI inside the wrapper. I would be interested if Andriy could try > his app in the style shown below. > I tried this with the test when I have file input, file output at 0 rank, and stderr output at all ranks. Everything works great!!!! Of course, the wrapper has to be updated each time to handle the command line, not perfectly convenient, but the main thing is that it's working. Thanks, Ben!!! -- Andrey > I think the behaviour is now correct. From a user configuration > perspective it is somewhat unpleasant, though. > > On TG-UC: > > /home/benc/mpi/a.out is my mpi hello world program > /home/benc/mpi/mpi.sh contains: > > #!/bin/bash > > echo running launcher on $(hostname) > mpirun -np 3 -machinefile $PBS_NODEFILE /home/benc/mpi/a.out > > > On swift submit side (my laptop): > > tc.data maps mpi to /home/benc/mpi/mpi.sh > > sites.xml defines: > > > > url="tg-grid.uc.teragrid.org/jobmanager-pbs" > major="2" /> > TG-CCR080002N > ia64-compute > 4 > single > /home/benc/mpi > > > Note specifically, jobtype=single (which is what causes only a single > wrapper.sh to be run, even though 4 nodes will be allocated). > > mpi.swift contains: > > $ cat mpi.swift > > type file; > > (file o, file e) p() { > app { > mpi stdout=@filename(o) stderr=@filename(e); > } > } > > file mpiout <"mpi.out">; > file mpierr <"mpi.err">; > > (mpiout, mpierr) = p(); > > > > so now run the above, and the output of the hello world MPI app (different > pieces output by all workers) appears mpi.out, correctly staged back > through mpirun and wrapper.sh. > > -- > From benc at hawaga.org.uk Fri Jul 4 02:27:23 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Fri, 4 Jul 2008 07:27:23 +0000 (GMT) Subject: [Swift-devel] The effectiveness of new changes in scheduler In-Reply-To: <1215100729.19239.1.camel@localhost> References: <20080703104544.BBW80845@m4500-03.uchicago.edu> <1215100729.19239.1.camel@localhost> Message-ID: On Thu, 3 Jul 2008, Mihael Hategan wrote: > That doesn't look right. I see no reason for that long pause to happen. Times I've seen that happen in the past are if you suspend the run with ctrl-z or if there is a network failure. 11 hours of such pause looks very wrong. -- From hategan at mcs.anl.gov Fri Jul 4 03:13:06 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 04 Jul 2008 03:13:06 -0500 Subject: [Swift-devel] The effectiveness of new changes in scheduler In-Reply-To: References: <20080703104544.BBW80845@m4500-03.uchicago.edu> <1215100729.19239.1.camel@localhost> Message-ID: <1215159186.9471.0.camel@localhost> On Fri, 2008-07-04 at 07:27 +0000, Ben Clifford wrote: > > On Thu, 3 Jul 2008, Mihael Hategan wrote: > > > That doesn't look right. I see no reason for that long pause to happen. > > Times I've seen that happen in the past are if you suspend the run with > ctrl-z or if there is a network failure. 11 hours of such pause looks very > wrong. It's precisely what my suspicion was, but that's for Xi to clarify. > From benc at hawaga.org.uk Sun Jul 6 11:41:52 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Sun, 6 Jul 2008 16:41:52 +0000 (GMT) Subject: [Swift-devel] stageout ordering vs restarts Message-ID: At present, stageouts for jobs tend to execute quite late in a run, in as much as when there are other jobs to run, the stageins for those jobs will usually use available file transfer rate-limit load before stageouts happen. I've noticed this before as a user interface quirk - users see GRAM jobs complete on remote sites, but do not see output files appear on the submit side until much much later and sometimes misinterpret that as a failure. However, I think there is an issue here with how restarts work too. Jobs are not recorded as done for the purposes of restart (i.e. will not be re-executed) until stageout has finished. When stageout is happening late, that means in late-stageout situations, lots of work will be done but to the extent that it can be ignored on restarts. So that makes early-stageout behaviour more appealing in some situations - situations in which it is expected that a restart will be necessary, or where it is preferable to have slower job execution in exchange for more stuff marked as done in the restart logs. That is perhaps worth thinking about as part of the project that Ragib is working on. -- From lixi at uchicago.edu Sun Jul 6 12:29:01 2008 From: lixi at uchicago.edu (lixi at uchicago.edu) Date: Sun, 6 Jul 2008 12:29:01 -0500 (CDT) Subject: [Swift-devel] Re: [Swift-user] Re: No response of Swift run Message-ID: <20080706122901.BBY13774@m4500-03.uchicago.edu> >Could you do the following for me: >1. edit dist/vdsk-xyz/bin/swift >2. replace the 'OPTIONS=' line with 'OPTIONS="-Xdebug >- Xrunjdwp:transport=dt_socket,address=8888,server=y,suspend=n" ' (a >single line) (you may need to do this every time you compile swift) >3. then run it again and let me know when it hangs. Don't kill the >hanging workflow. Let it hang instead. >4. Also let me know what machine you run this on. > Today, I ran a workflow with 5001 jobs using swift-debugger, but it finished with error message: ERROR: transport error 202: handshake failed - received >GET http://www< - excepted >JDWP-Handshake< ["transport.c",L41] This is the first time for me to encounter this error. The log file is on CI: /home/lixi/newswift/test/newversion/0706/workflowtest- 20080706-1134-o8s4a3ig.log Thanks, xi From foster at mcs.anl.gov Sun Jul 6 13:32:09 2008 From: foster at mcs.anl.gov (Ian Foster) Date: Sun, 6 Jul 2008 13:32:09 -0500 Subject: [Swift-devel] stageout ordering vs restarts In-Reply-To: References: Message-ID: Ben: That's an interesting observation. Can we control the relative priorities of stage outs and stage ins? Ian. On Jul 6, 2008, at 11:41 AM, Ben Clifford wrote: > > At present, stageouts for jobs tend to execute quite late in a run, > in as > much as when there are other jobs to run, the stageins for those > jobs will > usually use available file transfer rate-limit load before stageouts > happen. > > I've noticed this before as a user interface quirk - users see GRAM > jobs > complete on remote sites, but do not see output files appear on the > submit > side until much much later and sometimes misinterpret that as a > failure. > > However, I think there is an issue here with how restarts work too. > Jobs > are not recorded as done for the purposes of restart (i.e. will not be > re-executed) until stageout has finished. > > When stageout is happening late, that means in late-stageout > situations, > lots of work will be done but to the extent that it can be ignored on > restarts. > > So that makes early-stageout behaviour more appealing in some > situations - > situations in which it is expected that a restart will be necessary, > or > where it is preferable to have slower job execution in exchange for > more > stuff marked as done in the restart logs. > > That is perhaps worth thinking about as part of the project that > Ragib is > working on. > > -- > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From benc at hawaga.org.uk Sun Jul 6 16:04:45 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Sun, 6 Jul 2008 21:04:45 +0000 (GMT) Subject: [Swift-devel] stageout ordering vs restarts In-Reply-To: References: Message-ID: On Sun, 6 Jul 2008, Ian Foster wrote: > > When stageout is happening late, that means in late-stageout situations, > > lots of work will be done but to the extent that it can be ignored on > > restarts. That is possibly the worstly phrased paragraph I have ever written in my life. -- From benc at hawaga.org.uk Sun Jul 6 16:47:34 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Sun, 6 Jul 2008 21:47:34 +0000 (GMT) Subject: [Swift-devel] stageout ordering vs restarts In-Reply-To: References: Message-ID: On Sun, 6 Jul 2008, Ian Foster wrote: > Can we control the relative priorities of stage outs and stage ins? Ragib might be looking at some stuff wrt job prioritisation in general in his summer project. As to how the priority is specified in this case, I don't know if some real-number based prioritisation is useful or if (I think more likely) a choice of more discrete prioritisations such as "always stageout first", "always stageout last", "make no priority distinction between stageouts and other jobs" (or more generally, once you have started the first job for a Swift-level procedure call, do you want to get the other jobs done as fast as possible at the expense of overall workflow throughput?) -- From hategan at mcs.anl.gov Sun Jul 6 22:04:36 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sun, 06 Jul 2008 22:04:36 -0500 Subject: [Swift-devel] Re: [Swift-user] Re: No response of Swift run In-Reply-To: <20080706122901.BBY13774@m4500-03.uchicago.edu> References: <20080706122901.BBY13774@m4500-03.uchicago.edu> Message-ID: <1215399876.29501.2.camel@localhost> On Sun, 2008-07-06 at 12:29 -0500, lixi at uchicago.edu wrote: > >Could you do the following for me: > >1. edit dist/vdsk-xyz/bin/swift > >2. replace the 'OPTIONS=' line with 'OPTIONS="-Xdebug > >- > Xrunjdwp:transport=dt_socket,address=8888,server=y,suspend=n" > ' (a > >single line) (you may need to do this every time you > compile swift) > >3. then run it again and let me know when it hangs. Don't > kill the > >hanging workflow. Let it hang instead. > >4. Also let me know what machine you run this on. > > > > Today, I ran a workflow with 5001 jobs using swift-debugger, > but it finished with error message: > ERROR: transport error 202: handshake failed - received >GET > http://www< - excepted >JDWP-Handshake< ["transport.c",L41] > > This is the first time for me to encounter this error. The > log file is on > CI: /home/lixi/newswift/test/newversion/0706/workflowtest- > 20080706-1134-o8s4a3ig.log Well, probably somebody was nice enough to portscan that machine while the workflow was running. I guess there isn't any easy solution to this. Maybe somebody else has a better idea. > > Thanks, > > xi From hategan at mcs.anl.gov Sun Jul 6 22:05:54 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sun, 06 Jul 2008 22:05:54 -0500 Subject: [Swift-devel] stageout ordering vs restarts In-Reply-To: References: Message-ID: <1215399954.29501.4.camel@localhost> On Sun, 2008-07-06 at 16:41 +0000, Ben Clifford wrote: > At present, stageouts for jobs tend to execute quite late in a run, in as > much as when there are other jobs to run, the stageins for those jobs will > usually use available file transfer rate-limit load before stageouts > happen. I think it's overall a better strategy. Because the stageouts can than run in parallel with the other jobs as opposed to the other jobs blocking for the previous stageouts to complete. From benc at hawaga.org.uk Mon Jul 7 01:28:10 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 7 Jul 2008 06:28:10 +0000 (GMT) Subject: [Swift-devel] stageout ordering vs restarts In-Reply-To: <1215399954.29501.4.camel@localhost> References: <1215399954.29501.4.camel@localhost> Message-ID: On Sun, 6 Jul 2008, Mihael Hategan wrote: > I think it's overall a better strategy. If you prefer throughput to restartability. > Because the stageouts can than > run in parallel with the other jobs as opposed to the other jobs > blocking for the previous stageouts to complete. -- From benc at hawaga.org.uk Mon Jul 7 04:39:07 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 7 Jul 2008 09:39:07 +0000 (GMT) Subject: [Swift-devel] [VOTE] 0.6 release plan In-Reply-To: References: Message-ID: This is a new attempt to make a 0.6 release. The release plan below is roughly the same as my last abortive attempt at making a release, except that I no longer propose using dev.globus release procedure. I will be the release manager for Swift 0.6. I'm going to make a release candidate for 0.6 sometime within the next three days, and hope to release that as 0.6 proper next weekend (maybe 6 days from now). I'm hoping to have a single release candidate, with minor bugs being noted and fixed in 0.7 rather than causing a new release candidate. I'm planning on announcing coasters and replication as experimental features which we encourage interested parties to experiment with and report their experiences. There will be no repository freeze for this release. This release will use the traditional Swift release process, not the dev.globus release process. A release will be made when the release manager (me) sees fit, rather than based on voting. This release plan is subject to 'Lazy Majority' approval, which means that this plan is automatically approved until/unless someone votes -1. So pretty much you do not need to vote +1 until/unless someone votes -1. If you wish to vote -1, it appears that you should vote -1 specifically for the issues that you disagree with rather than the plan as a whole. -- From hategan at mcs.anl.gov Mon Jul 7 09:34:39 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 07 Jul 2008 09:34:39 -0500 Subject: [Swift-devel] stageout ordering vs restarts In-Reply-To: References: <1215399954.29501.4.camel@localhost> Message-ID: <1215441279.686.2.camel@localhost> On Mon, 2008-07-07 at 06:28 +0000, Ben Clifford wrote: > On Sun, 6 Jul 2008, Mihael Hategan wrote: > > > I think it's overall a better strategy. > > If you prefer throughput to restartability. That's as much an issue with staging order as it is with the way restarts work. Jobs could be marked as done with the files on the remote site. > > > Because the stageouts can than > > run in parallel with the other jobs as opposed to the other jobs > > blocking for the previous stageouts to complete. > From skenny at uchicago.edu Mon Jul 7 10:37:57 2008 From: skenny at uchicago.edu (skenny at uchicago.edu) Date: Mon, 7 Jul 2008 10:37:57 -0500 (CDT) Subject: [Swift-devel] trouble submitting to ranger Message-ID: <20080707103757.BHN88503@m4500-02.uchicago.edu> hi all, i'm trying to submit a simple job to the tacc ranger site. i can submit fine with gt2 and gt4, cog and globus-url-copy also works: [skenny at fletch ~]$ globusrun-ws -submit -s -F tg-login.ranger.tacc.teragrid.org -c /bin/hostname Delegating user credentials...Done. Submitting job...Done. Job ID: uuid:db847f4c-4c35-11dd-a1d8-0019d1912829 Termination time: 07/08/2008 15:03 GMT Current job state: Active Current job state: CleanUp-Hold login3.ranger.tacc.utexas.edu Current job state: CleanUp Current job state: Done Destroying job...Done. Cleaning up any delegated credentials...Done. [skenny at fletch ~]$ globus-job-run tg-login.ranger.tacc.teragrid.org /bin/hostname login3.ranger.tacc.utexas.edu [skenny at fletch ~]$ cog-job-submit -p gt4 -e /bin/hostname -s tg-login.ranger.tacc.teragrid.org Unable to find required classes (javax.activation.DataHandler and javax.mail.internet.MimeMultipart). Attachment support is disabled. Submitting task: Task(type=JOB_SUBMISSION, identity=urn:cog-1215443209762) 1215443211137 1215443212355 12154432145271215443216670 Job completed but when i try to run a swift job (a simple script calling 'env') it hangs indefinitely. with this in the log: java.net.ConnectException: Connection refused 2008-07-07 10:21:59,968-0500 DEBUG vdl:execute2 APPLICATION_EXCEPTION jobid=env-kzpkn6vi - Applicatio\ n exception: Cannot submit job: ; nested exception is: java.net.ConnectException: Connection refused here's the sites entry: 0 4 /share/home/00043/tg457040/sidgrid_out/{username} everything from the run is here including the log: /home/skenny/swift/check_env. finally, in case this email wasn't long enough :) Swift svn swift-r2084 cog-r2065 ideas? sarah From benc at hawaga.org.uk Mon Jul 7 10:40:57 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 7 Jul 2008 15:40:57 +0000 (GMT) Subject: [Swift-devel] trouble submitting to ranger In-Reply-To: <20080707103757.BHN88503@m4500-02.uchicago.edu> References: <20080707103757.BHN88503@m4500-02.uchicago.edu> Message-ID: ranger has multiple IP addresses. One of them is refusing connections. Try running the cog and swift examples you gave three or four times and see if the behaviour is consistent. -- From wilde at mcs.anl.gov Mon Jul 7 11:02:36 2008 From: wilde at mcs.anl.gov (Michael Wilde) Date: Mon, 07 Jul 2008 11:02:36 -0500 Subject: [Swift-devel] [VOTE] 0.6 release plan In-Reply-To: References: Message-ID: <48723E1C.3060207@mcs.anl.gov> +1 On 7/7/08 4:39 AM, Ben Clifford wrote: > This is a new attempt to make a 0.6 release. The release plan below is > roughly the same as my last abortive attempt at making a release, except > that I no longer propose using dev.globus release procedure. > > I will be the release manager for Swift 0.6. > > I'm going to make a release candidate for 0.6 sometime > within the next three days, and hope to release that as 0.6 proper next > weekend (maybe 6 days from now). > > I'm hoping to have a single release candidate, with minor bugs being noted > and fixed in 0.7 rather than causing a new release candidate. > > I'm planning on announcing coasters and replication as experimental > features which we encourage interested parties to experiment with and > report their experiences. > > There will be no repository freeze for this release. > > This release will use the traditional Swift release process, not the > dev.globus release process. A release will be made when the release > manager (me) sees fit, rather than based on voting. > > This release plan is subject to 'Lazy Majority' approval, which means that > this plan is automatically approved until/unless someone votes -1. So > pretty much you do not need to vote +1 until/unless someone votes -1. If > you wish to vote -1, it appears that you should vote -1 specifically for > the issues that you disagree with rather than the plan as a whole. > From skenny at uchicago.edu Mon Jul 7 11:46:34 2008 From: skenny at uchicago.edu (skenny at uchicago.edu) Date: Mon, 7 Jul 2008 11:46:34 -0500 (CDT) Subject: [Swift-devel] trouble submitting to ranger Message-ID: <20080707114634.BHN96213@m4500-02.uchicago.edu> yep, one completed, one refused: [skenny at gwynn check_env]$ cog-job-submit -p gt4 -e /bin/hostname -s tg-login.ranger.tacc.teragrid.org Unable to find required classes (javax.activation.DataHandler and javax.mail.internet.MimeMultipart). Attachment support is disabled. Submitting task: Task(type=JOB_SUBMISSION, identity=urn:cog-1215449079513) 1215449080280 1215449081168 12154490829651215449093478 Job completed [skenny at gwynn check_env]$ cog-job-submit -p gt4 -e /bin/hostname -s tg-login.ranger.tacc.teragrid.org Unable to find required classes (javax.activation.DataHandler and javax.mail.internet.MimeMultipart). Attachment support is disabled. Submitting task: Task(type=JOB_SUBMISSION, identity=urn:cog-1215449095891) 1215449096655 1215449097286 Submission Exception: Cannot submit job: ; nested exception is: java.net.ConnectException: Connection refused [skenny at gwynn check_env]$ ---- Original message ---- >Date: Mon, 7 Jul 2008 15:40:57 +0000 (GMT) >From: Ben Clifford >Subject: Re: [Swift-devel] trouble submitting to ranger >To: skenny at uchicago.edu >Cc: swift-devel at ci.uchicago.edu > >ranger has multiple IP addresses. One of them is refusing connections. > >Try running the cog and swift examples you gave three or four times and >see if the behaviour is consistent. >-- > From benc at hawaga.org.uk Mon Jul 7 11:49:26 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 7 Jul 2008 16:49:26 +0000 (GMT) Subject: [Swift-devel] trouble submitting to ranger In-Reply-To: <20080707114634.BHN96213@m4500-02.uchicago.edu> References: <20080707114634.BHN96213@m4500-02.uchicago.edu> Message-ID: On Mon, 7 Jul 2008, skenny at uchicago.edu wrote: > yep, one completed, one refused: broken site then. -- From skenny at uchicago.edu Mon Jul 7 12:11:06 2008 From: skenny at uchicago.edu (skenny at uchicago.edu) Date: Mon, 7 Jul 2008 12:11:06 -0500 (CDT) Subject: [Swift-devel] trouble submitting to ranger Message-ID: <20080707121106.BHN99491@m4500-02.uchicago.edu> it is weird though that all the times i tried with swift it has not gone thru once...does that mean swift is *somehow* always using the bad ip address? ---- Original message ---- >Date: Mon, 7 Jul 2008 16:49:26 +0000 (GMT) >From: Ben Clifford >Subject: Re: [Swift-devel] trouble submitting to ranger >To: skenny at uchicago.edu >Cc: swift-devel at ci.uchicago.edu > > > >On Mon, 7 Jul 2008, skenny at uchicago.edu wrote: > >> yep, one completed, one refused: > >broken site then. > >-- From hategan at mcs.anl.gov Mon Jul 7 12:50:59 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 07 Jul 2008 12:50:59 -0500 Subject: [Swift-devel] trouble submitting to ranger In-Reply-To: <20080707121106.BHN99491@m4500-02.uchicago.edu> References: <20080707121106.BHN99491@m4500-02.uchicago.edu> Message-ID: <1215453059.3751.2.camel@localhost> On Mon, 2008-07-07 at 12:11 -0500, skenny at uchicago.edu wrote: > it is weird though that all the times How many is "all"? > i tried with swift it > has not gone thru once...does that mean swift is *somehow* > always using the bad ip address? > > ---- Original message ---- > >Date: Mon, 7 Jul 2008 16:49:26 +0000 (GMT) > >From: Ben Clifford > >Subject: Re: [Swift-devel] trouble submitting to ranger > >To: skenny at uchicago.edu > >Cc: swift-devel at ci.uchicago.edu > > > > > > > >On Mon, 7 Jul 2008, skenny at uchicago.edu wrote: > > > >> yep, one completed, one refused: > > > >broken site then. > > > >-- > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From hategan at uchicago.edu Mon Jul 7 22:26:46 2008 From: hategan at uchicago.edu (Mihael Hategan) Date: Mon, 07 Jul 2008 22:26:46 -0500 Subject: [Swift-devel] my MCS account Message-ID: <1215487606.10764.2.camel@localhost> Apparently, and despite the fact that it was "extended" for one week (aka this week), my ANL account expired. So I'm very much likely not going to be able to read any of the email sent to my @mcs account until the issue is solved. Gotta love bureaucracy. Mihael From iraicu at cs.uchicago.edu Thu Jul 10 22:39:02 2008 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Thu, 10 Jul 2008 22:39:02 -0500 Subject: [Swift-devel] [Fwd: IEEE eScience 2008 NEW PAPER DEADLINE: AUGUST 10] Message-ID: <4876D5D6.2040508@cs.uchicago.edu> Here is a good conference... and they just extended the deadline too. Cheers, Ioan -------- Original Message -------- Subject: IEEE eScience 2008 NEW PAPER DEADLINE: AUGUST 10 Date: Thu, 10 Jul 2008 17:46:18 -0400 From: escience at indiana.edu Reply-To: escience at indiana.edu To: IRAICU at CS.UCHICAGO.EDU Please note, the deadline for paper submission has been extended to August 10. Organizing committees of the 4th International IEEE Computer Society Technical Committee on Scalable Computing eScience 2008 Conference are now accepting papers and proposals for tutorials; posters, exhibits and demos. The conference is being held in partnership with the Microsoft Research eScience Workshop and is hosted by Indiana University. Conference Dates: December 7-12, 2008 Conference Location: University Place Conference Center, Indiana University/Purdue University (IUPUI) Campus, Indianapolis, IN Submission Deadlines: Papers: August 10, 2008 (EXTENDED) Tutorials: July 20, 2008 Posters, Exhibits and Demos: September 14, 2008 For submission guidelines and more information visit the conference Web site: http://escience2008.iu.edu. Topics of interest cover applications and technologies related to eScience, grid and cloud computing. They include, but are not limited to, the following: * Application development environments * Autonomic, real-time, and self-organizing grids * Cloud computing and storage * Collaborative science models and techniques * Enabling technologies: Internet and Web services * e-Science for applications including physics, biology, astronomy, chemistry, finance, engineering, and the humanities * Grid economy and business models * Problem-solving environments * Programming paradigms and models * Resource management and scheduling * Security challenges for grids and e-Science * Sensor networks and environmental observatories * Service-oriented grid architectures * Virtual instruments and data access management * Virtualization for technical computing * Web 2.0 technology and services for e-Science Sponsors Include: IEEE Computer Society Committee on Scalable Computing Microsoft Research Pervasive Technology Labs at Indiana University Indiana University School of Informatics Louisiana State University Center for Computation and Technology Conference Leadership: General Chairs Geoffrey Fox, Indiana University, United States Dennis Gannon, Indiana University, United States Program Chair Anne Trefethen, University of Oxford, United Kingdom Program Vice-Chair David Wallom, University of Oxford, United Kingdom Workshops Chair Ken Chiu, State University of New York, United States Tutorials Chair Krishna Madhavan, Clemson University, United States Exhibits, Demos, and Posters Chair Daniel S. Katz, Louisiana State University, United States Exhibits, Demos, and Posters Vice-Chair Shantenu Jha, Louisiana State University, United States Education, Diversity, and Broadening Participation Chair Alex Ramirez, Hispanic Association of Colleges and Universities Communication and Outreach Chair Daphne Siefert-Herron, Indiana University, United States Microsoft e-Science Conference Chair Kristin Tolle, Microsoft, United States Conference Manager Therese Miller, Indiana University, United States -- =================================================== Ioan Raicu Ph.D. Candidate =================================================== Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 =================================================== Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dev.globus.org/wiki/Incubator/Falkon http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page =================================================== =================================================== From benc at hawaga.org.uk Fri Jul 11 12:39:08 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Fri, 11 Jul 2008 17:39:08 +0000 (GMT) Subject: [Swift-devel] swift 0.6-rc4 Message-ID: Swift 0.6rc4 is available from http://www.ci.uchicago.edu/~benc/vdsk-0.6rc4.tar.gz Please test and report. My present target for release is Tuesday. -- From benc at hawaga.org.uk Mon Jul 14 11:37:37 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 14 Jul 2008 16:37:37 +0000 (GMT) Subject: [Swift-devel] too much slow down. Message-ID: With the recent changes made to the scheduler to deal with bad sites in a multisite run, the behaviour in the presence of a single bad site and no good sites seems to be that a run will sit for a very long time rather than the previous behaviour of failing pretty fast. This is perhaps unpleasant, perhaps not; but its a significant change to behaviour. -- From hategan at mcs.anl.gov Mon Jul 14 11:52:22 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 14 Jul 2008 11:52:22 -0500 Subject: [Swift-devel] too much slow down. In-Reply-To: References: Message-ID: <1216054342.9377.4.camel@localhost> On Mon, 2008-07-14 at 16:37 +0000, Ben Clifford wrote: > With the recent changes made to the scheduler to deal with bad sites in a > multisite run, the behaviour in the presence of a single bad site and no > good sites seems to be that a run will sit for a very long time rather > than the previous behaviour of failing pretty fast. > > This is perhaps unpleasant, perhaps not; but its a significant change to > behaviour. Isn't this what we wanted? > From benc at hawaga.org.uk Mon Jul 14 12:00:06 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 14 Jul 2008 17:00:06 +0000 (GMT) Subject: [Swift-devel] too much slow down. In-Reply-To: <1216054342.9377.4.camel@localhost> References: <1216054342.9377.4.camel@localhost> Message-ID: On Mon, 14 Jul 2008, Mihael Hategan wrote: > > This is perhaps unpleasant, perhaps not; but its a significant change to > > behaviour. > > Isn't this what we wanted? In the multisite case, yes. In a single case site, its fairly close to needless hanging around. -- From hategan at mcs.anl.gov Mon Jul 14 12:16:03 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 14 Jul 2008 12:16:03 -0500 Subject: [Swift-devel] too much slow down. In-Reply-To: References: <1216054342.9377.4.camel@localhost> Message-ID: <1216055763.10436.3.camel@localhost> On Mon, 2008-07-14 at 17:00 +0000, Ben Clifford wrote: > On Mon, 14 Jul 2008, Mihael Hategan wrote: > > > > This is perhaps unpleasant, perhaps not; but its a significant change to > > > behaviour. > > > > Isn't this what we wanted? > > In the multisite case, yes. In a single case site, its fairly close to > needless hanging around. Maybe it's a temporary problem? In the single site case all you have is one site. So the question becomes, how does one make the best of that one single site and, before that, what that best even is. From foster at mcs.anl.gov Mon Jul 14 12:51:11 2008 From: foster at mcs.anl.gov (Ian Foster) Date: Mon, 14 Jul 2008 12:51:11 -0500 Subject: [Swift-devel] Fwd: Release MRNET 2.0 References: <200807141635.m6EGZoFB003710@asiago.cs.wisc.edu> Message-ID: could potentially be relevant to BG/P work ... Begin forwarded message: > From: Barton Miller > Date: July 14, 2008 11:35:50 AM CDT > To: foster at mcs.anl.gov > Subject: Release MRNET 2.0 > > ********************************************************************* > * Release 2.0 of the MRNet Multicast/Reduction Network * > ********************************************************************* > > The new 2.0 release of MRNet is now available. This release of MRNet > includes a source distribution for Unix/Linux and Windows platforms, > plus pre-built binaries for Windows. The release also includes > associated manuals. > > Improvements made to MRNet since version 1.2 have been incorporated > into > version 2.0. > > MRNet is a customizable, high-performance software infrastructure for > building scalable tools and applications. It supports efficient > multicast and data aggregation functionality using a tree of processes > between the tool's front-end and back-ends. MRNet-based tools may use > these internal processes to distribute many important tool activities, > for example to reduce data analysis time and keep tool front-end loads > manageable. > > More information about MRNet, including downloads for binary and > source, can be found at: > http://www.paradyn.org/mrnet > > Technical papers about MRNet can be found in the 'Middleware' > section at: > http://www.paradyn.org/html/publications-by-category.html > > The new features and fixes include: > > * Fault tolerance and recovery for internal MRNet node failures > * Improved API for examining MRNet topology > * New filter capabilities such as dynamic configuration > * Improved memory management > * Improved support for multi-threaded front-ends and back-ends > * Updated examples, including a sample Makefile > * Numerous bug fixes and enhancements > > MRNet is designed to be a highly-portable system. The source code has > been compiled using both GNU GCC and native platform compilers. We > have successfully tested MRNet components on the following platforms: > - Linux: x86, x86_64, ia64, powerpc64 > - Solaris: sparc32, sparc64 > - AIX 5.2: rs6000 > - Windows: x86 > > Please send questions or comments to paradyn at cs.wisc.edu -------------- next part -------------- An HTML attachment was scrubbed... URL: From iraicu at cs.uchicago.edu Mon Jul 14 13:06:12 2008 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Mon, 14 Jul 2008 13:06:12 -0500 Subject: [Swift-devel] Fwd: Release MRNET 2.0 In-Reply-To: References: <200807141635.m6EGZoFB003710@asiago.cs.wisc.edu> Message-ID: <487B9594.5030800@cs.uchicago.edu> Here are two of their recent papers that are relevant: ftp://ftp.cs.wisc.edu/paradyn/papers/Lee08ScalingSTAT.pdf ftp://ftp.cs.wisc.edu/paradyn/papers/Brim08GroupFile.pdf Still reading them, but good finds! Ioan Ian Foster wrote: > could potentially be relevant to BG/P work ... > > Begin forwarded message: > >> *From: *Barton Miller > >> *Date: *July 14, 2008 11:35:50 AM CDT >> *To: *foster at mcs.anl.gov >> *Subject: **Release MRNET 2.0* >> >> ********************************************************************* >> * Release 2.0 of the MRNet Multicast/Reduction Network * >> ********************************************************************* >> >> The new 2.0 release of MRNet is now available. This release of MRNet >> includes a source distribution for Unix/Linux and Windows platforms, >> plus pre-built binaries for Windows. The release also includes >> associated manuals. >> >> Improvements made to MRNet since version 1.2 have been incorporated into >> version 2.0. >> >> MRNet is a customizable, high-performance software infrastructure for >> building scalable tools and applications. It supports efficient >> multicast and data aggregation functionality using a tree of processes >> between the tool's front-end and back-ends. MRNet-based tools may use >> these internal processes to distribute many important tool activities, >> for example to reduce data analysis time and keep tool front-end loads >> manageable. >> >> More information about MRNet, including downloads for binary and >> source, can be found at: >> http://www.paradyn.org/mrnet >> >> Technical papers about MRNet can be found in the 'Middleware' section at: >> http://www.paradyn.org/html/publications-by-category.html >> >> The new features and fixes include: >> >> * Fault tolerance and recovery for internal MRNet node failures >> * Improved API for examining MRNet topology >> * New filter capabilities such as dynamic configuration >> * Improved memory management >> * Improved support for multi-threaded front-ends and back-ends >> * Updated examples, including a sample Makefile >> * Numerous bug fixes and enhancements >> >> MRNet is designed to be a highly-portable system. The source code has >> been compiled using both GNU GCC and native platform compilers. We >> have successfully tested MRNet components on the following platforms: >> - Linux: x86, x86_64, ia64, powerpc64 >> - Solaris: sparc32, sparc64 >> - AIX 5.2: rs6000 >> - Windows: x86 >> >> Please send questions or comments to paradyn at cs.wisc.edu >> > > ------------------------------------------------------------------------ > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > -- =================================================== Ioan Raicu Ph.D. Candidate =================================================== Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 =================================================== Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dev.globus.org/wiki/Incubator/Falkon http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page =================================================== =================================================== -------------- next part -------------- An HTML attachment was scrubbed... URL: From bugzilla-daemon at mcs.anl.gov Mon Jul 14 15:23:38 2008 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Mon, 14 Jul 2008 15:23:38 -0500 (CDT) Subject: [Swift-devel] [Bug 149] New: Improve readdata() error message Message-ID: http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=149 Summary: Improve readdata() error message Product: Swift Version: unspecified Platform: All OS/Version: All Status: NEW Severity: minor Priority: P3 Component: SwiftScript language AssignedTo: hategan at mcs.anl.gov ReportedBy: wilde at mcs.anl.gov CC: hategan at mcs.anl.gov The error message is a bit confusing if you didnt realize that readdata expects whitespace delimited items in the header (rather than comma separated). I suggest a slight change from: Execution failed: File header does not match type. Expected the following header items (in no particular order): [ligandsfile, targetlist]. Instead, the header was (again, in no particular order): [ligandsfile,targetlist] To: Execution failed: File header does not match type. Expected the following header of 2 items, whitespace separated (in any order): ligandsfile targetlist Instead, the header contained 1 item: ligandsfile,targetlist -- Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You reported the bug, or are watching the reporter. From bugzilla-daemon at mcs.anl.gov Mon Jul 14 17:20:27 2008 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Mon, 14 Jul 2008 17:20:27 -0500 (CDT) Subject: [Swift-devel] [Bug 149] Improve readdata() error message In-Reply-To: Message-ID: <20080714222027.3D38416469@foxtrot.mcs.anl.gov> http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=149 hategan at mcs.anl.gov changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |ASSIGNED ------- Comment #1 from hategan at mcs.anl.gov 2008-07-14 17:20 ------- I added an additional check for the sizes and explicitly mentioned the whitespace separator issue. r2115. -- Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You reported the bug, or are watching the reporter. From bugzilla-daemon at mcs.anl.gov Tue Jul 15 02:49:50 2008 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Tue, 15 Jul 2008 02:49:50 -0500 (CDT) Subject: [Swift-devel] [Bug 148] regexp mapper substitution doesn't work properly In-Reply-To: Message-ID: <20080715074950.8F6021646B@foxtrot.mcs.anl.gov> http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=148 benc at hawaga.org.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #1 from benc at hawaga.org.uk 2008-07-15 02:49 ------- r2116 and r2117 change the behaviour of \ a bit. Saying \\1 now works. -- Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You reported the bug, or are watching the reporter. You are the assignee for the bug, or are watching the assignee. From skenny at uchicago.edu Tue Jul 15 10:35:28 2008 From: skenny at uchicago.edu (skenny at uchicago.edu) Date: Tue, 15 Jul 2008 10:35:28 -0500 (CDT) Subject: [Swift-devel] too much slow down. Message-ID: <20080715103528.BHV48553@m4500-02.uchicago.edu> so andric and i have been doing lots of runs the past week with the latest swift. we've definitely noticed a lack of errors from swift. that is, when it can't get a job thru it hangs...often for hours 'til we kill it. yesterday my job hung for about 20min so i killed it and tried running it with the previous version of swift. right away i got an error saying that the job was having trouble creating a directory on the remote site (which was in fact a correct error, there was a problem with the permissions). my personal vote would be for faster failures. i guess it's also worth mentioning that we rarely (read: never) run multi-site...mostly bcs we need to separate debugging our workflows from debugging our sites :) ---- Original message ---- >Date: Mon, 14 Jul 2008 11:52:22 -0500 >From: Mihael Hategan >Subject: Re: [Swift-devel] too much slow down. >To: Ben Clifford >Cc: swift-devel at ci.uchicago.edu > >On Mon, 2008-07-14 at 16:37 +0000, Ben Clifford wrote: >> With the recent changes made to the scheduler to deal with bad sites in a >> multisite run, the behaviour in the presence of a single bad site and no >> good sites seems to be that a run will sit for a very long time rather >> than the previous behaviour of failing pretty fast. >> >> This is perhaps unpleasant, perhaps not; but its a significant change to >> behaviour. > >Isn't this what we wanted? > >> > >_______________________________________________ >Swift-devel mailing list >Swift-devel at ci.uchicago.edu >http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From benc at hawaga.org.uk Tue Jul 15 10:45:29 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 15 Jul 2008 15:45:29 +0000 (GMT) Subject: [Swift-devel] too much slow down. In-Reply-To: References: <20080715103528.BHV48553@m4500-02.uchicago.edu> Message-ID: ok. I'll put in a user-adjustable parameter to adjust this which you will be able to set to get pretty much the previous behaviour back. On Tue, 15 Jul 2008, Michael Andric wrote: > ...often for hours 'til we kill it. > > and by that, i mean eternity. i've let things go overnight to see what > would happen and it's still just hanging when i check it in the morning. > > On Tue, Jul 15, 2008 at 10:35 AM, wrote: > > > so andric and i have been doing lots of runs the past week > > with the latest swift. we've definitely noticed a lack of > > errors from swift. that is, when it can't get a job thru it > > hangs...often for hours 'til we kill it. > > > > yesterday my job hung for about 20min so i > > killed it and tried running it with the previous version of > > swift. right away i got an error saying that the job was > > having trouble creating a directory on the remote site (which > > was in fact a correct error, there was a problem with the > > permissions). > > > > my personal vote would be for faster failures. i guess it's > > also worth mentioning that we rarely (read: never) run > > multi-site...mostly bcs we need to separate debugging our > > workflows from debugging our sites :) > > > > > > ---- Original message ---- > > >Date: Mon, 14 Jul 2008 11:52:22 -0500 > > >From: Mihael Hategan > > >Subject: Re: [Swift-devel] too much slow down. > > >To: Ben Clifford > > >Cc: swift-devel at ci.uchicago.edu > > > > > >On Mon, 2008-07-14 at 16:37 +0000, Ben Clifford wrote: > > >> With the recent changes made to the scheduler to deal with > > bad sites in a > > >> multisite run, the behaviour in the presence of a single > > bad site and no > > >> good sites seems to be that a run will sit for a very long > > time rather > > >> than the previous behaviour of failing pretty fast. > > >> > > >> This is perhaps unpleasant, perhaps not; but its a > > significant change to > > >> behaviour. > > > > > >Isn't this what we wanted? > > > > > >> > > > > > >_______________________________________________ > > >Swift-devel mailing list > > >Swift-devel at ci.uchicago.edu > > >http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > From hategan at mcs.anl.gov Tue Jul 15 10:58:33 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 15 Jul 2008 10:58:33 -0500 Subject: [Swift-devel] too much slow down. In-Reply-To: References: <20080715103528.BHV48553@m4500-02.uchicago.edu> Message-ID: <1216137513.28701.0.camel@localhost> We need to classify errors into transients and non-transients. On Tue, 2008-07-15 at 15:45 +0000, Ben Clifford wrote: > ok. I'll put in a user-adjustable parameter to adjust this which you will > be able to set to get pretty much the previous behaviour back. > > On Tue, 15 Jul 2008, Michael Andric wrote: > > > ...often for hours 'til we kill it. > > > > and by that, i mean eternity. i've let things go overnight to see what > > would happen and it's still just hanging when i check it in the morning. > > > > On Tue, Jul 15, 2008 at 10:35 AM, wrote: > > > > > so andric and i have been doing lots of runs the past week > > > with the latest swift. we've definitely noticed a lack of > > > errors from swift. that is, when it can't get a job thru it > > > hangs...often for hours 'til we kill it. > > > > > > yesterday my job hung for about 20min so i > > > killed it and tried running it with the previous version of > > > swift. right away i got an error saying that the job was > > > having trouble creating a directory on the remote site (which > > > was in fact a correct error, there was a problem with the > > > permissions). > > > > > > my personal vote would be for faster failures. i guess it's > > > also worth mentioning that we rarely (read: never) run > > > multi-site...mostly bcs we need to separate debugging our > > > workflows from debugging our sites :) > > > > > > > > > ---- Original message ---- > > > >Date: Mon, 14 Jul 2008 11:52:22 -0500 > > > >From: Mihael Hategan > > > >Subject: Re: [Swift-devel] too much slow down. > > > >To: Ben Clifford > > > >Cc: swift-devel at ci.uchicago.edu > > > > > > > >On Mon, 2008-07-14 at 16:37 +0000, Ben Clifford wrote: > > > >> With the recent changes made to the scheduler to deal with > > > bad sites in a > > > >> multisite run, the behaviour in the presence of a single > > > bad site and no > > > >> good sites seems to be that a run will sit for a very long > > > time rather > > > >> than the previous behaviour of failing pretty fast. > > > >> > > > >> This is perhaps unpleasant, perhaps not; but its a > > > significant change to > > > >> behaviour. > > > > > > > >Isn't this what we wanted? > > > > > > > >> > > > > > > > >_______________________________________________ > > > >Swift-devel mailing list > > > >Swift-devel at ci.uchicago.edu > > > >http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > > From benc at hawaga.org.uk Tue Jul 15 11:02:24 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 15 Jul 2008 16:02:24 +0000 (GMT) Subject: [Swift-devel] too much slow down. In-Reply-To: <1216137513.28701.0.camel@localhost> References: <20080715103528.BHV48553@m4500-02.uchicago.edu> <1216137513.28701.0.camel@localhost> Message-ID: On Tue, 15 Jul 2008, Mihael Hategan wrote: > We need to classify errors into transients and non-transients. ... Any error is transient if you wait long enough... pretty much that's what this delay stuff is doing anyway - trying to avoid transients. But what a transient is is subjective. Quite legitimately a 1h site outage could be a transient or could not be, depending on what you're trying to do. The choice of constants (of which there are two) used in delay calculation is poor at the moment in that it appears to be giving ridiculously long delays. I think what those values should be will be decided by user taste. Some would rather have workflows fail instantly (for instantly = <5min) whilst others might be prepared to wait a week... at the moment, it appears biased towards the latter. -- From andric at uchicago.edu Tue Jul 15 10:43:48 2008 From: andric at uchicago.edu (Michael Andric) Date: Tue, 15 Jul 2008 10:43:48 -0500 Subject: [Swift-devel] too much slow down. In-Reply-To: <20080715103528.BHV48553@m4500-02.uchicago.edu> References: <20080715103528.BHV48553@m4500-02.uchicago.edu> Message-ID: ...often for hours 'til we kill it. and by that, i mean eternity. i've let things go overnight to see what would happen and it's still just hanging when i check it in the morning. On Tue, Jul 15, 2008 at 10:35 AM, wrote: > so andric and i have been doing lots of runs the past week > with the latest swift. we've definitely noticed a lack of > errors from swift. that is, when it can't get a job thru it > hangs...often for hours 'til we kill it. > > yesterday my job hung for about 20min so i > killed it and tried running it with the previous version of > swift. right away i got an error saying that the job was > having trouble creating a directory on the remote site (which > was in fact a correct error, there was a problem with the > permissions). > > my personal vote would be for faster failures. i guess it's > also worth mentioning that we rarely (read: never) run > multi-site...mostly bcs we need to separate debugging our > workflows from debugging our sites :) > > > ---- Original message ---- > >Date: Mon, 14 Jul 2008 11:52:22 -0500 > >From: Mihael Hategan > >Subject: Re: [Swift-devel] too much slow down. > >To: Ben Clifford > >Cc: swift-devel at ci.uchicago.edu > > > >On Mon, 2008-07-14 at 16:37 +0000, Ben Clifford wrote: > >> With the recent changes made to the scheduler to deal with > bad sites in a > >> multisite run, the behaviour in the presence of a single > bad site and no > >> good sites seems to be that a run will sit for a very long > time rather > >> than the previous behaviour of failing pretty fast. > >> > >> This is perhaps unpleasant, perhaps not; but its a > significant change to > >> behaviour. > > > >Isn't this what we wanted? > > > >> > > > >_______________________________________________ > >Swift-devel mailing list > >Swift-devel at ci.uchicago.edu > >http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > -------------- next part -------------- An HTML attachment was scrubbed... URL: From skenny at uchicago.edu Tue Jul 15 11:14:33 2008 From: skenny at uchicago.edu (skenny at uchicago.edu) Date: Tue, 15 Jul 2008 11:14:33 -0500 (CDT) Subject: [Swift-devel] too much slow down. Message-ID: <20080715111433.BHV54097@m4500-02.uchicago.edu> so, i get similar behavior when i try to run w/o a valid proxy...swift hangs. this is an error that, it seems to me, should return immediately regardless of whether you're running multi or single site (or whether you're willing to wait a week or not)... i'm *guessing* this is what mihael means by "transient" vs "non-transient" errors (?) ---- Original message ---- >Date: Tue, 15 Jul 2008 16:02:24 +0000 (GMT) >From: Ben Clifford >Subject: Re: [Swift-devel] too much slow down. >To: Mihael Hategan >Cc: Michael Andric , skenny at uchicago.edu, swift-devel at ci.uchicago.edu > > >On Tue, 15 Jul 2008, Mihael Hategan wrote: > >> We need to classify errors into transients and non-transients. > >... > >Any error is transient if you wait long enough... pretty much that's what >this delay stuff is doing anyway - trying to avoid transients. But what a >transient is is subjective. Quite legitimately a 1h site outage could be a >transient or could not be, depending on what you're trying to do. > >The choice of constants (of which there are two) used in delay calculation >is poor at the moment in that it appears to be giving ridiculously long >delays. > >I think what those values should be will be decided by user taste. Some >would rather have workflows fail instantly (for instantly = <5min) whilst >others might be prepared to wait a week... at the moment, it appears >biased towards the latter. > >-- From benc at hawaga.org.uk Tue Jul 15 11:16:21 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 15 Jul 2008 16:16:21 +0000 (GMT) Subject: [Swift-devel] too much slow down. In-Reply-To: <20080715111433.BHV54097@m4500-02.uchicago.edu> References: <20080715111433.BHV54097@m4500-02.uchicago.edu> Message-ID: On Tue, 15 Jul 2008, skenny at uchicago.edu wrote: > so, i get similar behavior when i try to run w/o a valid > proxy...swift hangs. this is an error that, it seems to me, > should return immediately regardless of whether you're running > multi or single site (or whether you're willing to wait a week > or not)... not if you change the remote system clock so that your proxy is now valid... -- From hategan at mcs.anl.gov Tue Jul 15 11:43:55 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 15 Jul 2008 11:43:55 -0500 Subject: [Swift-devel] too much slow down. In-Reply-To: References: <20080715103528.BHV48553@m4500-02.uchicago.edu> <1216137513.28701.0.camel@localhost> Message-ID: <1216140235.29049.10.camel@localhost> On Tue, 2008-07-15 at 16:02 +0000, Ben Clifford wrote: > On Tue, 15 Jul 2008, Mihael Hategan wrote: > > > We need to classify errors into transients and non-transients. > > ... > > Any error is transient if you wait long enough... Unless it's a bug in the specification (e.g. incorrect home dir, missing application, incorrect output files, etc.), in which case it doesn't matter how much you wait. > pretty much that's what > this delay stuff is doing anyway - trying to avoid transients. But what a > transient is is subjective. Quite legitimately a 1h site outage could be a > transient or could not be, depending on what you're trying to do. > > The choice of constants (of which there are two) used in delay calculation > is poor at the moment in that it appears to be giving ridiculously long > delays. > > I think what those values should be will be decided by user taste. Some > would rather have workflows fail instantly (for instantly = <5min) whilst > others might be prepared to wait a week... at the moment, it appears > biased towards the latter. > From hategan at mcs.anl.gov Tue Jul 15 11:44:47 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 15 Jul 2008 11:44:47 -0500 Subject: [Swift-devel] too much slow down. In-Reply-To: <20080715111433.BHV54097@m4500-02.uchicago.edu> References: <20080715111433.BHV54097@m4500-02.uchicago.edu> Message-ID: <1216140287.29049.11.camel@localhost> On Tue, 2008-07-15 at 11:14 -0500, skenny at uchicago.edu wrote: > so, i get similar behavior when i try to run w/o a valid > proxy...swift hangs. this is an error that, it seems to me, > should return immediately regardless of whether you're running > multi or single site (or whether you're willing to wait a week > or not)... > > i'm *guessing* this is what mihael means by "transient" vs > "non-transient" errors (?) You are guessing correctly. > > ---- Original message ---- > >Date: Tue, 15 Jul 2008 16:02:24 +0000 (GMT) > >From: Ben Clifford > >Subject: Re: [Swift-devel] too much slow down. > >To: Mihael Hategan > >Cc: Michael Andric , > skenny at uchicago.edu, swift-devel at ci.uchicago.edu > > > > > >On Tue, 15 Jul 2008, Mihael Hategan wrote: > > > >> We need to classify errors into transients and non-transients. > > > >... > > > >Any error is transient if you wait long enough... pretty much > that's what > >this delay stuff is doing anyway - trying to avoid > transients. But what a > >transient is is subjective. Quite legitimately a 1h site > outage could be a > >transient or could not be, depending on what you're trying to do. > > > >The choice of constants (of which there are two) used in > delay calculation > >is poor at the moment in that it appears to be giving > ridiculously long > >delays. > > > >I think what those values should be will be decided by user > taste. Some > >would rather have workflows fail instantly (for instantly = > <5min) whilst > >others might be prepared to wait a week... at the moment, it > appears > >biased towards the latter. > > > >-- From ragib.morshed at gmail.com Tue Jul 15 11:53:22 2008 From: ragib.morshed at gmail.com (Ragib Morshed) Date: Tue, 15 Jul 2008 11:53:22 -0500 Subject: [Swift-devel] Fwd: build problem with swift In-Reply-To: <94e4b8380807150940t21dba026y417a4d5a3dd94eb6@mail.gmail.com> References: <94e4b8380807150925s3825f7d2n22a3cc8a6867c9fb@mail.gmail.com> <94e4b8380807150940t21dba026y417a4d5a3dd94eb6@mail.gmail.com> Message-ID: <94e4b8380807150953g7f3f9da1v7837169eed3b6bc8@mail.gmail.com> Ben suggested forwarding this problem of mine to swift-devel. ---------- Forwarded message ---------- From: Ragib Morshed Date: Tue, Jul 15, 2008 at 11:40 AM Subject: Fwd: build problem with swift To: benc at ci.uchicago.edu Hi Ben, I sent this email to Mihael, but he is off somewhere at a conference I think, and don't know if he will get to it. I put some code in for the site-affinity thing and swift compiles and runs the code fine using eclipse. Compilation/building using ant says: *package org.griphyn.vdl.karajan.lib.cache does not exist. *But it is there and compiles fine with eclipse. Do you have any ideas on the top of your head where the problem might be? Thanks. -Ragib ---------- Forwarded message ---------- From: Ragib Morshed Date: Tue, Jul 15, 2008 at 11:25 AM Subject: build problem with swift To: Mihael Hategan Hi, I have been trying to build swift with the new changes using 'ant dist', but it gives compilation error. It compiles and runs fine on eclipse, but here it says it can't find the package org.gridphyn.vdl.karajan.lib.cache so it cannot recognize the VDLFileCache type. Any ideas why? Here's the output from building it: [javac] Compiling 540 source files to /autonfs/home/rmorshed/Desktop/swiftNew/cog/modules/karajan/build [javac] /autonfs/home/rmorshed/Desktop/swiftNew/cog/modules/karajan/src/org/globus/cog/karajan/scheduler/WeightedHostScoreScheduler.java:34: *package org.griphyn.vdl.karajan.lib.cache does not exist* [javac] import org.griphyn.vdl.karajan.lib.cache.VDLFileCache; [javac] ^ [javac] /autonfs/home/rmorshed/Desktop/swiftNew/cog/modules/karajan/src/org/globus/cog/karajan/scheduler/WeightedHostScoreScheduler.java:230: cannot resolve symbol [javac] symbol : class VDLFileCache [javac] location: class org.globus.cog.karajan.scheduler.WeightedHostScoreScheduler [javac] VDLFileCache fileCache = (VDLFileCache) t.getConstraint("filecache"); [javac] ^ [javac] /autonfs/home/rmorshed/Desktop/swiftNew/cog/modules/karajan/src/org/globus/cog/karajan/scheduler/WeightedHostScoreScheduler.java:230: cannot resolve symbol [javac] symbol : class VDLFileCache [javac] location: class org.globus.cog.karajan.scheduler.WeightedHostScoreScheduler [javac] VDLFileCache fileCache = (VDLFileCache) t.getConstraint("filecache"); [javac] ^ [javac] Note: Some input files use or override a deprecated API. [javac] Note: Recompile with -deprecation for details. [javac] 3 errors BUILD FAILED /autonfs/home/rmorshed/Desktop/swiftNew/cog/modules/vdsk/build.xml:73: The following error occurred while executing this line: /autonfs/home/rmorshed/Desktop/swiftNew/cog/mbuild.xml:442: The following error occurred while executing this line: /autonfs/home/rmorshed/Desktop/swiftNew/cog/mbuild.xml:78: The following error occurred while executing this line: /autonfs/home/rmorshed/Desktop/swiftNew/cog/mbuild.xml:51: The following error occurred while executing this line: /autonfs/home/rmorshed/Desktop/swiftNew/cog/modules/vdsk/dependencies.xml:4: The following error occurred while executing this line: /autonfs/home/rmorshed/Desktop/swiftNew/cog/mbuild.xml:162: The following error occurred while executing this line: /autonfs/home/rmorshed/Desktop/swiftNew/cog/mbuild.xml:167: The following error occurred while executing this line: /autonfs/home/rmorshed/Desktop/swiftNew/cog/modules/karajan/build.xml:59: The following error occurred while executing this line: /autonfs/home/rmorshed/Desktop/swiftNew/cog/mbuild.xml:463: The following error occurred while executing this line: /autonfs/home/rmorshed/Desktop/swiftNew/cog/mbuild.xml:227: Compile failed; see the compiler error output for details. Total time: 30 seconds -ragib -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: WeightedHostScoreScheduler.java Type: text/x-java Size: 16318 bytes Desc: not available URL: From benc at hawaga.org.uk Tue Jul 15 12:27:54 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 15 Jul 2008 17:27:54 +0000 (GMT) Subject: [Swift-devel] Fwd: build problem with swift In-Reply-To: <94e4b8380807150953g7f3f9da1v7837169eed3b6bc8@mail.gmail.com> References: <94e4b8380807150925s3825f7d2n22a3cc8a6867c9fb@mail.gmail.com> <94e4b8380807150940t21dba026y417a4d5a3dd94eb6@mail.gmail.com> <94e4b8380807150953g7f3f9da1v7837169eed3b6bc8@mail.gmail.com> Message-ID: VDLCache is in the vdsk module. The WeightedHost stuff is in karajan, which is a dependency. You can't use code from module A in module B if module A is dependent on module B (equivalently if if module B is a prerequisite of module A) Here A = vdsk, B= karajan There's a subclass of the weighted host scheduler that lives in the vdsk module and extends the functionality in swift-specific ways: src//org/griphyn/vdl/karajan/VDSAdaptiveScheduler.java You might be able to add your new functionality there. If its building in eclipse, that's because it enforces cog module dependencies differently (or not at all). On Tue, 15 Jul 2008, Ragib Morshed wrote: > Ben suggested forwarding this problem of mine to swift-devel. > > ---------- Forwarded message ---------- > From: Ragib Morshed > Date: Tue, Jul 15, 2008 at 11:40 AM > Subject: Fwd: build problem with swift > To: benc at ci.uchicago.edu > > > Hi Ben, > > I sent this email to Mihael, but he is off somewhere at a conference I > think, and don't know if he will get to it. > > I put some code in for the site-affinity thing and swift compiles and runs > the code fine using eclipse. Compilation/building using ant says: *package > org.griphyn.vdl.karajan.lib.cache does not exist. *But it is there and > compiles fine with eclipse. > > Do you have any ideas on the top of your head where the problem might be? > > Thanks. > -Ragib > > > ---------- Forwarded message ---------- > From: Ragib Morshed > Date: Tue, Jul 15, 2008 at 11:25 AM > Subject: build problem with swift > To: Mihael Hategan > > > Hi, > > I have been trying to build swift with the new changes using 'ant dist', but > it gives compilation error. It compiles and runs fine on eclipse, but here > it says it can't find the package org.gridphyn.vdl.karajan.lib.cache so it > cannot recognize the VDLFileCache type. Any ideas why? > > Here's the output from building it: > > [javac] Compiling 540 source files to > /autonfs/home/rmorshed/Desktop/swiftNew/cog/modules/karajan/build > [javac] > /autonfs/home/rmorshed/Desktop/swiftNew/cog/modules/karajan/src/org/globus/cog/karajan/scheduler/WeightedHostScoreScheduler.java:34: > *package org.griphyn.vdl.karajan.lib.cache does not exist* > [javac] import org.griphyn.vdl.karajan.lib.cache.VDLFileCache; > [javac] ^ > [javac] > /autonfs/home/rmorshed/Desktop/swiftNew/cog/modules/karajan/src/org/globus/cog/karajan/scheduler/WeightedHostScoreScheduler.java:230: > cannot resolve symbol > [javac] symbol : class VDLFileCache > [javac] location: class > org.globus.cog.karajan.scheduler.WeightedHostScoreScheduler > [javac] VDLFileCache fileCache = (VDLFileCache) > t.getConstraint("filecache"); > [javac] ^ > [javac] > /autonfs/home/rmorshed/Desktop/swiftNew/cog/modules/karajan/src/org/globus/cog/karajan/scheduler/WeightedHostScoreScheduler.java:230: > cannot resolve symbol > [javac] symbol : class VDLFileCache > [javac] location: class > org.globus.cog.karajan.scheduler.WeightedHostScoreScheduler > [javac] VDLFileCache fileCache = (VDLFileCache) > t.getConstraint("filecache"); > [javac] ^ > [javac] Note: Some input files use or override a deprecated API. > [javac] Note: Recompile with -deprecation for details. > [javac] 3 errors > > BUILD FAILED > /autonfs/home/rmorshed/Desktop/swiftNew/cog/modules/vdsk/build.xml:73: The > following error occurred while executing this line: > /autonfs/home/rmorshed/Desktop/swiftNew/cog/mbuild.xml:442: The following > error occurred while executing this line: > /autonfs/home/rmorshed/Desktop/swiftNew/cog/mbuild.xml:78: The following > error occurred while executing this line: > /autonfs/home/rmorshed/Desktop/swiftNew/cog/mbuild.xml:51: The following > error occurred while executing this line: > /autonfs/home/rmorshed/Desktop/swiftNew/cog/modules/vdsk/dependencies.xml:4: > The following error occurred while executing this line: > /autonfs/home/rmorshed/Desktop/swiftNew/cog/mbuild.xml:162: The following > error occurred while executing this line: > /autonfs/home/rmorshed/Desktop/swiftNew/cog/mbuild.xml:167: The following > error occurred while executing this line: > /autonfs/home/rmorshed/Desktop/swiftNew/cog/modules/karajan/build.xml:59: > The following error occurred while executing this line: > /autonfs/home/rmorshed/Desktop/swiftNew/cog/mbuild.xml:463: The following > error occurred while executing this line: > /autonfs/home/rmorshed/Desktop/swiftNew/cog/mbuild.xml:227: Compile failed; > see the compiler error output for details. > > Total time: 30 seconds > > -ragib > From ragib.morshed at gmail.com Tue Jul 15 16:18:20 2008 From: ragib.morshed at gmail.com (Ragib Morshed) Date: Tue, 15 Jul 2008 16:18:20 -0500 Subject: [Swift-devel] Fwd: build problem with swift In-Reply-To: References: <94e4b8380807150925s3825f7d2n22a3cc8a6867c9fb@mail.gmail.com> <94e4b8380807150940t21dba026y417a4d5a3dd94eb6@mail.gmail.com> <94e4b8380807150953g7f3f9da1v7837169eed3b6bc8@mail.gmail.com> Message-ID: <94e4b8380807151418i2491ea69if4c4d213fcdc27e7@mail.gmail.com> Can we change the dependencies such that it builds up the software like in eclipse? -ragib On Tue, Jul 15, 2008 at 12:27 PM, Ben Clifford wrote: > > > VDLCache is in the vdsk module. > > The WeightedHost stuff is in karajan, which is a dependency. > > You can't use code from module A in module B if module A is dependent on > module B (equivalently if if module B is a prerequisite of module A) > > Here A = vdsk, B= karajan > > There's a subclass of the weighted host scheduler that lives in the vdsk > module and extends the functionality in swift-specific ways: > src//org/griphyn/vdl/karajan/VDSAdaptiveScheduler.java > > You might be able to add your new functionality there. > > If its building in eclipse, that's because it enforces cog module > dependencies differently (or not at all). > > On Tue, 15 Jul 2008, Ragib Morshed wrote: > > > Ben suggested forwarding this problem of mine to swift-devel. > > > > ---------- Forwarded message ---------- > > From: Ragib Morshed > > Date: Tue, Jul 15, 2008 at 11:40 AM > > Subject: Fwd: build problem with swift > > To: benc at ci.uchicago.edu > > > > > > Hi Ben, > > > > I sent this email to Mihael, but he is off somewhere at a conference I > > think, and don't know if he will get to it. > > > > I put some code in for the site-affinity thing and swift compiles and > runs > > the code fine using eclipse. Compilation/building using ant says: > *package > > org.griphyn.vdl.karajan.lib.cache does not exist. *But it is there and > > compiles fine with eclipse. > > > > Do you have any ideas on the top of your head where the problem might be? > > > > Thanks. > > -Ragib > > > > > > ---------- Forwarded message ---------- > > From: Ragib Morshed > > Date: Tue, Jul 15, 2008 at 11:25 AM > > Subject: build problem with swift > > To: Mihael Hategan > > > > > > Hi, > > > > I have been trying to build swift with the new changes using 'ant dist', > but > > it gives compilation error. It compiles and runs fine on eclipse, but > here > > it says it can't find the package org.gridphyn.vdl.karajan.lib.cache so > it > > cannot recognize the VDLFileCache type. Any ideas why? > > > > Here's the output from building it: > > > > [javac] Compiling 540 source files to > > /autonfs/home/rmorshed/Desktop/swiftNew/cog/modules/karajan/build > > [javac] > > > /autonfs/home/rmorshed/Desktop/swiftNew/cog/modules/karajan/src/org/globus/cog/karajan/scheduler/WeightedHostScoreScheduler.java:34: > > *package org.griphyn.vdl.karajan.lib.cache does not exist* > > [javac] import org.griphyn.vdl.karajan.lib.cache.VDLFileCache; > > [javac] ^ > > [javac] > > > /autonfs/home/rmorshed/Desktop/swiftNew/cog/modules/karajan/src/org/globus/cog/karajan/scheduler/WeightedHostScoreScheduler.java:230: > > cannot resolve symbol > > [javac] symbol : class VDLFileCache > > [javac] location: class > > org.globus.cog.karajan.scheduler.WeightedHostScoreScheduler > > [javac] VDLFileCache fileCache = (VDLFileCache) > > t.getConstraint("filecache"); > > [javac] ^ > > [javac] > > > /autonfs/home/rmorshed/Desktop/swiftNew/cog/modules/karajan/src/org/globus/cog/karajan/scheduler/WeightedHostScoreScheduler.java:230: > > cannot resolve symbol > > [javac] symbol : class VDLFileCache > > [javac] location: class > > org.globus.cog.karajan.scheduler.WeightedHostScoreScheduler > > [javac] VDLFileCache fileCache = (VDLFileCache) > > t.getConstraint("filecache"); > > [javac] ^ > > [javac] Note: Some input files use or override a deprecated API. > > [javac] Note: Recompile with -deprecation for details. > > [javac] 3 errors > > > > BUILD FAILED > > /autonfs/home/rmorshed/Desktop/swiftNew/cog/modules/vdsk/build.xml:73: > The > > following error occurred while executing this line: > > /autonfs/home/rmorshed/Desktop/swiftNew/cog/mbuild.xml:442: The following > > error occurred while executing this line: > > /autonfs/home/rmorshed/Desktop/swiftNew/cog/mbuild.xml:78: The following > > error occurred while executing this line: > > /autonfs/home/rmorshed/Desktop/swiftNew/cog/mbuild.xml:51: The following > > error occurred while executing this line: > > > /autonfs/home/rmorshed/Desktop/swiftNew/cog/modules/vdsk/dependencies.xml:4: > > The following error occurred while executing this line: > > /autonfs/home/rmorshed/Desktop/swiftNew/cog/mbuild.xml:162: The following > > error occurred while executing this line: > > /autonfs/home/rmorshed/Desktop/swiftNew/cog/mbuild.xml:167: The following > > error occurred while executing this line: > > /autonfs/home/rmorshed/Desktop/swiftNew/cog/modules/karajan/build.xml:59: > > The following error occurred while executing this line: > > /autonfs/home/rmorshed/Desktop/swiftNew/cog/mbuild.xml:463: The following > > error occurred while executing this line: > > /autonfs/home/rmorshed/Desktop/swiftNew/cog/mbuild.xml:227: Compile > failed; > > see the compiler error output for details. > > > > Total time: 30 seconds > > > > -ragib > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From benc at hawaga.org.uk Tue Jul 15 16:24:21 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 15 Jul 2008 21:24:21 +0000 (GMT) Subject: [Swift-devel] Fwd: build problem with swift In-Reply-To: <94e4b8380807151418i2491ea69if4c4d213fcdc27e7@mail.gmail.com> References: <94e4b8380807150925s3825f7d2n22a3cc8a6867c9fb@mail.gmail.com> <94e4b8380807150940t21dba026y417a4d5a3dd94eb6@mail.gmail.com> <94e4b8380807150953g7f3f9da1v7837169eed3b6bc8@mail.gmail.com> <94e4b8380807151418i2491ea69if4c4d213fcdc27e7@mail.gmail.com> Message-ID: On Tue, 15 Jul 2008, Ragib Morshed wrote: > Can we change the dependencies such that it builds up the software like in > eclipse? 'we'? basically, though, no. karajan is a separate piece of code than Swift. you can hack round in your build tree as much as you want, and if you're not ever intending for anyone to reuse your code it should be possible somehow, in some probably unpleasant fashion. -- From hategan at mcs.anl.gov Tue Jul 15 16:37:22 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 15 Jul 2008 16:37:22 -0500 Subject: [Swift-devel] Fwd: build problem with swift In-Reply-To: <94e4b8380807151418i2491ea69if4c4d213fcdc27e7@mail.gmail.com> References: <94e4b8380807150925s3825f7d2n22a3cc8a6867c9fb@mail.gmail.com> <94e4b8380807150940t21dba026y417a4d5a3dd94eb6@mail.gmail.com> <94e4b8380807150953g7f3f9da1v7837169eed3b6bc8@mail.gmail.com> <94e4b8380807151418i2491ea69if4c4d213fcdc27e7@mail.gmail.com> Message-ID: <1216157842.4550.4.camel@localhost> On Tue, 2008-07-15 at 16:18 -0500, Ragib Morshed wrote: > Can we change the dependencies such that it builds up the software > like in eclipse? No. So the solution would be to pass a data structure that is available in both. Let me think a bit. > > -ragib > > On Tue, Jul 15, 2008 at 12:27 PM, Ben Clifford > wrote: > > > VDLCache is in the vdsk module. > > The WeightedHost stuff is in karajan, which is a dependency. > > You can't use code from module A in module B if module A is > dependent on > module B (equivalently if if module B is a prerequisite of > module A) > > Here A = vdsk, B= karajan > > There's a subclass of the weighted host scheduler that lives > in the vdsk > module and extends the functionality in swift-specific ways: > src//org/griphyn/vdl/karajan/VDSAdaptiveScheduler.java > > You might be able to add your new functionality there. > > If its building in eclipse, that's because it enforces cog > module > dependencies differently (or not at all). > > On Tue, 15 Jul 2008, Ragib Morshed wrote: > > > > > Ben suggested forwarding this problem of mine to > swift-devel. > > > > ---------- Forwarded message ---------- > > From: Ragib Morshed > > Date: Tue, Jul 15, 2008 at 11:40 AM > > Subject: Fwd: build problem with swift > > To: benc at ci.uchicago.edu > > > > > > Hi Ben, > > > > I sent this email to Mihael, but he is off somewhere at a > conference I > > think, and don't know if he will get to it. > > > > I put some code in for the site-affinity thing and swift > compiles and runs > > the code fine using eclipse. Compilation/building using ant > says: *package > > org.griphyn.vdl.karajan.lib.cache does not exist. *But it is > there and > > compiles fine with eclipse. > > > > Do you have any ideas on the top of your head where the > problem might be? > > > > Thanks. > > -Ragib > > > > > > ---------- Forwarded message ---------- > > From: Ragib Morshed > > Date: Tue, Jul 15, 2008 at 11:25 AM > > Subject: build problem with swift > > To: Mihael Hategan > > > > > > Hi, > > > > I have been trying to build swift with the new changes using > 'ant dist', but > > it gives compilation error. It compiles and runs fine on > eclipse, but here > > it says it can't find the package > org.gridphyn.vdl.karajan.lib.cache so it > > cannot recognize the VDLFileCache type. Any ideas why? > > > > Here's the output from building it: > > > > [javac] Compiling 540 source files to > > /autonfs/home/rmorshed/Desktop/swiftNew/cog/modules/karajan/build > > [javac] > > /autonfs/home/rmorshed/Desktop/swiftNew/cog/modules/karajan/src/org/globus/cog/karajan/scheduler/WeightedHostScoreScheduler.java:34: > > *package org.griphyn.vdl.karajan.lib.cache does not exist* > > [javac] import > org.griphyn.vdl.karajan.lib.cache.VDLFileCache; > > [javac] ^ > > [javac] > > /autonfs/home/rmorshed/Desktop/swiftNew/cog/modules/karajan/src/org/globus/cog/karajan/scheduler/WeightedHostScoreScheduler.java:230: > > cannot resolve symbol > > [javac] symbol : class VDLFileCache > > [javac] location: class > > org.globus.cog.karajan.scheduler.WeightedHostScoreScheduler > > [javac] VDLFileCache fileCache = > (VDLFileCache) > > t.getConstraint("filecache"); > > [javac] ^ > > [javac] > > /autonfs/home/rmorshed/Desktop/swiftNew/cog/modules/karajan/src/org/globus/cog/karajan/scheduler/WeightedHostScoreScheduler.java:230: > > cannot resolve symbol > > [javac] symbol : class VDLFileCache > > [javac] location: class > > org.globus.cog.karajan.scheduler.WeightedHostScoreScheduler > > [javac] VDLFileCache fileCache = > (VDLFileCache) > > t.getConstraint("filecache"); > > [javac] ^ > > [javac] Note: Some input files use or override a > deprecated API. > > [javac] Note: Recompile with -deprecation for details. > > [javac] 3 errors > > > > BUILD FAILED > > /autonfs/home/rmorshed/Desktop/swiftNew/cog/modules/vdsk/build.xml:73: The > > following error occurred while executing this line: > > /autonfs/home/rmorshed/Desktop/swiftNew/cog/mbuild.xml:442: > The following > > error occurred while executing this line: > > /autonfs/home/rmorshed/Desktop/swiftNew/cog/mbuild.xml:78: > The following > > error occurred while executing this line: > > /autonfs/home/rmorshed/Desktop/swiftNew/cog/mbuild.xml:51: > The following > > error occurred while executing this line: > > /autonfs/home/rmorshed/Desktop/swiftNew/cog/modules/vdsk/dependencies.xml:4: > > The following error occurred while executing this line: > > /autonfs/home/rmorshed/Desktop/swiftNew/cog/mbuild.xml:162: > The following > > error occurred while executing this line: > > /autonfs/home/rmorshed/Desktop/swiftNew/cog/mbuild.xml:167: > The following > > error occurred while executing this line: > > /autonfs/home/rmorshed/Desktop/swiftNew/cog/modules/karajan/build.xml:59: > > The following error occurred while executing this line: > > /autonfs/home/rmorshed/Desktop/swiftNew/cog/mbuild.xml:463: > The following > > error occurred while executing this line: > > /autonfs/home/rmorshed/Desktop/swiftNew/cog/mbuild.xml:227: > Compile failed; > > see the compiler error output for details. > > > > Total time: 30 seconds > > > > -ragib > > > > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From hategan at mcs.anl.gov Tue Jul 15 16:48:17 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 15 Jul 2008 16:48:17 -0500 Subject: [Swift-devel] Fwd: build problem with swift In-Reply-To: <1216157842.4550.4.camel@localhost> References: <94e4b8380807150925s3825f7d2n22a3cc8a6867c9fb@mail.gmail.com> <94e4b8380807150940t21dba026y417a4d5a3dd94eb6@mail.gmail.com> <94e4b8380807150953g7f3f9da1v7837169eed3b6bc8@mail.gmail.com> <94e4b8380807151418i2491ea69if4c4d213fcdc27e7@mail.gmail.com> <1216157842.4550.4.camel@localhost> Message-ID: <1216158497.4879.4.camel@localhost> On Tue, 2008-07-15 at 16:37 -0500, Mihael Hategan wrote: > On Tue, 2008-07-15 at 16:18 -0500, Ragib Morshed wrote: > > Can we change the dependencies such that it builds up the software > > like in eclipse? > > No. > > So the solution would be to pass a data structure that is available in > both. Let me think a bit. A Map adapter of course. I'll commit one shortly. > > > > > -ragib > > > > On Tue, Jul 15, 2008 at 12:27 PM, Ben Clifford > > wrote: > > > > > > VDLCache is in the vdsk module. > > > > The WeightedHost stuff is in karajan, which is a dependency. > > > > You can't use code from module A in module B if module A is > > dependent on > > module B (equivalently if if module B is a prerequisite of > > module A) > > > > Here A = vdsk, B= karajan > > > > There's a subclass of the weighted host scheduler that lives > > in the vdsk > > module and extends the functionality in swift-specific ways: > > src//org/griphyn/vdl/karajan/VDSAdaptiveScheduler.java > > > > You might be able to add your new functionality there. > > > > If its building in eclipse, that's because it enforces cog > > module > > dependencies differently (or not at all). > > > > On Tue, 15 Jul 2008, Ragib Morshed wrote: > > > > > > > > > Ben suggested forwarding this problem of mine to > > swift-devel. > > > > > > ---------- Forwarded message ---------- > > > From: Ragib Morshed > > > Date: Tue, Jul 15, 2008 at 11:40 AM > > > Subject: Fwd: build problem with swift > > > To: benc at ci.uchicago.edu > > > > > > > > > Hi Ben, > > > > > > I sent this email to Mihael, but he is off somewhere at a > > conference I > > > think, and don't know if he will get to it. > > > > > > I put some code in for the site-affinity thing and swift > > compiles and runs > > > the code fine using eclipse. Compilation/building using ant > > says: *package > > > org.griphyn.vdl.karajan.lib.cache does not exist. *But it is > > there and > > > compiles fine with eclipse. > > > > > > Do you have any ideas on the top of your head where the > > problem might be? > > > > > > Thanks. > > > -Ragib > > > > > > > > > ---------- Forwarded message ---------- > > > From: Ragib Morshed > > > Date: Tue, Jul 15, 2008 at 11:25 AM > > > Subject: build problem with swift > > > To: Mihael Hategan > > > > > > > > > Hi, > > > > > > I have been trying to build swift with the new changes using > > 'ant dist', but > > > it gives compilation error. It compiles and runs fine on > > eclipse, but here > > > it says it can't find the package > > org.gridphyn.vdl.karajan.lib.cache so it > > > cannot recognize the VDLFileCache type. Any ideas why? > > > > > > Here's the output from building it: > > > > > > [javac] Compiling 540 source files to > > > /autonfs/home/rmorshed/Desktop/swiftNew/cog/modules/karajan/build > > > [javac] > > > /autonfs/home/rmorshed/Desktop/swiftNew/cog/modules/karajan/src/org/globus/cog/karajan/scheduler/WeightedHostScoreScheduler.java:34: > > > *package org.griphyn.vdl.karajan.lib.cache does not exist* > > > [javac] import > > org.griphyn.vdl.karajan.lib.cache.VDLFileCache; > > > [javac] ^ > > > [javac] > > > /autonfs/home/rmorshed/Desktop/swiftNew/cog/modules/karajan/src/org/globus/cog/karajan/scheduler/WeightedHostScoreScheduler.java:230: > > > cannot resolve symbol > > > [javac] symbol : class VDLFileCache > > > [javac] location: class > > > org.globus.cog.karajan.scheduler.WeightedHostScoreScheduler > > > [javac] VDLFileCache fileCache = > > (VDLFileCache) > > > t.getConstraint("filecache"); > > > [javac] ^ > > > [javac] > > > /autonfs/home/rmorshed/Desktop/swiftNew/cog/modules/karajan/src/org/globus/cog/karajan/scheduler/WeightedHostScoreScheduler.java:230: > > > cannot resolve symbol > > > [javac] symbol : class VDLFileCache > > > [javac] location: class > > > org.globus.cog.karajan.scheduler.WeightedHostScoreScheduler > > > [javac] VDLFileCache fileCache = > > (VDLFileCache) > > > t.getConstraint("filecache"); > > > [javac] ^ > > > [javac] Note: Some input files use or override a > > deprecated API. > > > [javac] Note: Recompile with -deprecation for details. > > > [javac] 3 errors > > > > > > BUILD FAILED > > > /autonfs/home/rmorshed/Desktop/swiftNew/cog/modules/vdsk/build.xml:73: The > > > following error occurred while executing this line: > > > /autonfs/home/rmorshed/Desktop/swiftNew/cog/mbuild.xml:442: > > The following > > > error occurred while executing this line: > > > /autonfs/home/rmorshed/Desktop/swiftNew/cog/mbuild.xml:78: > > The following > > > error occurred while executing this line: > > > /autonfs/home/rmorshed/Desktop/swiftNew/cog/mbuild.xml:51: > > The following > > > error occurred while executing this line: > > > /autonfs/home/rmorshed/Desktop/swiftNew/cog/modules/vdsk/dependencies.xml:4: > > > The following error occurred while executing this line: > > > /autonfs/home/rmorshed/Desktop/swiftNew/cog/mbuild.xml:162: > > The following > > > error occurred while executing this line: > > > /autonfs/home/rmorshed/Desktop/swiftNew/cog/mbuild.xml:167: > > The following > > > error occurred while executing this line: > > > /autonfs/home/rmorshed/Desktop/swiftNew/cog/modules/karajan/build.xml:59: > > > The following error occurred while executing this line: > > > /autonfs/home/rmorshed/Desktop/swiftNew/cog/mbuild.xml:463: > > The following > > > error occurred while executing this line: > > > /autonfs/home/rmorshed/Desktop/swiftNew/cog/mbuild.xml:227: > > Compile failed; > > > see the compiler error output for details. > > > > > > Total time: 30 seconds > > > > > > -ragib > > > > > > > > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From hategan at mcs.anl.gov Tue Jul 15 16:58:25 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 15 Jul 2008 16:58:25 -0500 Subject: [Swift-devel] Fwd: build problem with swift In-Reply-To: <1216158497.4879.4.camel@localhost> References: <94e4b8380807150925s3825f7d2n22a3cc8a6867c9fb@mail.gmail.com> <94e4b8380807150940t21dba026y417a4d5a3dd94eb6@mail.gmail.com> <94e4b8380807150953g7f3f9da1v7837169eed3b6bc8@mail.gmail.com> <94e4b8380807151418i2491ea69if4c4d213fcdc27e7@mail.gmail.com> <1216157842.4550.4.camel@localhost> <1216158497.4879.4.camel@localhost> Message-ID: <1216159105.4879.14.camel@localhost> On Tue, 2008-07-15 at 16:48 -0500, Mihael Hategan wrote: > On Tue, 2008-07-15 at 16:37 -0500, Mihael Hategan wrote: > > On Tue, 2008-07-15 at 16:18 -0500, Ragib Morshed wrote: > > > Can we change the dependencies such that it builds up the software > > > like in eclipse? > > > > No. > > > > So the solution would be to pass a data structure that is available in > > both. Let me think a bit. > > A Map adapter of course. > > I'll commit one shortly. Ok, so if you do an svn update in vdsk, there will be a CacheMapAdapter which implements java.util.Map. You'll use it like this: In JobConstrains, instead of tc.addConstraint("filecache", CacheFunction.getCache(stack)), you'd write: tc.addConstraint("filecache", new CacheMapAdapter(CacheFunction.getCache(stack))); The get() method of the adapter is mapped to cache.getPaths(). So in the scheduler code: Map cache = (Map) t.getConstraint("filecache"); ... Collection paths = (Collection) cache.get(wh.getHost()); Though this is a bit shady anyway, given that knowledge of all that file information is pretty Swift specific. But for the purpose of experimenting with this it will have to do. > > > > > > > > > > -ragib > > > > > > On Tue, Jul 15, 2008 at 12:27 PM, Ben Clifford > > > wrote: > > > > > > > > > VDLCache is in the vdsk module. > > > > > > The WeightedHost stuff is in karajan, which is a dependency. > > > > > > You can't use code from module A in module B if module A is > > > dependent on > > > module B (equivalently if if module B is a prerequisite of > > > module A) > > > > > > Here A = vdsk, B= karajan > > > > > > There's a subclass of the weighted host scheduler that lives > > > in the vdsk > > > module and extends the functionality in swift-specific ways: > > > src//org/griphyn/vdl/karajan/VDSAdaptiveScheduler.java > > > > > > You might be able to add your new functionality there. > > > > > > If its building in eclipse, that's because it enforces cog > > > module > > > dependencies differently (or not at all). > > > > > > On Tue, 15 Jul 2008, Ragib Morshed wrote: > > > > > > > > > > > > > Ben suggested forwarding this problem of mine to > > > swift-devel. > > > > > > > > ---------- Forwarded message ---------- > > > > From: Ragib Morshed > > > > Date: Tue, Jul 15, 2008 at 11:40 AM > > > > Subject: Fwd: build problem with swift > > > > To: benc at ci.uchicago.edu > > > > > > > > > > > > Hi Ben, > > > > > > > > I sent this email to Mihael, but he is off somewhere at a > > > conference I > > > > think, and don't know if he will get to it. > > > > > > > > I put some code in for the site-affinity thing and swift > > > compiles and runs > > > > the code fine using eclipse. Compilation/building using ant > > > says: *package > > > > org.griphyn.vdl.karajan.lib.cache does not exist. *But it is > > > there and > > > > compiles fine with eclipse. > > > > > > > > Do you have any ideas on the top of your head where the > > > problem might be? > > > > > > > > Thanks. > > > > -Ragib > > > > > > > > > > > > ---------- Forwarded message ---------- > > > > From: Ragib Morshed > > > > Date: Tue, Jul 15, 2008 at 11:25 AM > > > > Subject: build problem with swift > > > > To: Mihael Hategan > > > > > > > > > > > > Hi, > > > > > > > > I have been trying to build swift with the new changes using > > > 'ant dist', but > > > > it gives compilation error. It compiles and runs fine on > > > eclipse, but here > > > > it says it can't find the package > > > org.gridphyn.vdl.karajan.lib.cache so it > > > > cannot recognize the VDLFileCache type. Any ideas why? > > > > > > > > Here's the output from building it: > > > > > > > > [javac] Compiling 540 source files to > > > > /autonfs/home/rmorshed/Desktop/swiftNew/cog/modules/karajan/build > > > > [javac] > > > > /autonfs/home/rmorshed/Desktop/swiftNew/cog/modules/karajan/src/org/globus/cog/karajan/scheduler/WeightedHostScoreScheduler.java:34: > > > > *package org.griphyn.vdl.karajan.lib.cache does not exist* > > > > [javac] import > > > org.griphyn.vdl.karajan.lib.cache.VDLFileCache; > > > > [javac] ^ > > > > [javac] > > > > /autonfs/home/rmorshed/Desktop/swiftNew/cog/modules/karajan/src/org/globus/cog/karajan/scheduler/WeightedHostScoreScheduler.java:230: > > > > cannot resolve symbol > > > > [javac] symbol : class VDLFileCache > > > > [javac] location: class > > > > org.globus.cog.karajan.scheduler.WeightedHostScoreScheduler > > > > [javac] VDLFileCache fileCache = > > > (VDLFileCache) > > > > t.getConstraint("filecache"); > > > > [javac] ^ > > > > [javac] > > > > /autonfs/home/rmorshed/Desktop/swiftNew/cog/modules/karajan/src/org/globus/cog/karajan/scheduler/WeightedHostScoreScheduler.java:230: > > > > cannot resolve symbol > > > > [javac] symbol : class VDLFileCache > > > > [javac] location: class > > > > org.globus.cog.karajan.scheduler.WeightedHostScoreScheduler > > > > [javac] VDLFileCache fileCache = > > > (VDLFileCache) > > > > t.getConstraint("filecache"); > > > > [javac] ^ > > > > [javac] Note: Some input files use or override a > > > deprecated API. > > > > [javac] Note: Recompile with -deprecation for details. > > > > [javac] 3 errors > > > > > > > > BUILD FAILED > > > > /autonfs/home/rmorshed/Desktop/swiftNew/cog/modules/vdsk/build.xml:73: The > > > > following error occurred while executing this line: > > > > /autonfs/home/rmorshed/Desktop/swiftNew/cog/mbuild.xml:442: > > > The following > > > > error occurred while executing this line: > > > > /autonfs/home/rmorshed/Desktop/swiftNew/cog/mbuild.xml:78: > > > The following > > > > error occurred while executing this line: > > > > /autonfs/home/rmorshed/Desktop/swiftNew/cog/mbuild.xml:51: > > > The following > > > > error occurred while executing this line: > > > > /autonfs/home/rmorshed/Desktop/swiftNew/cog/modules/vdsk/dependencies.xml:4: > > > > The following error occurred while executing this line: > > > > /autonfs/home/rmorshed/Desktop/swiftNew/cog/mbuild.xml:162: > > > The following > > > > error occurred while executing this line: > > > > /autonfs/home/rmorshed/Desktop/swiftNew/cog/mbuild.xml:167: > > > The following > > > > error occurred while executing this line: > > > > /autonfs/home/rmorshed/Desktop/swiftNew/cog/modules/karajan/build.xml:59: > > > > The following error occurred while executing this line: > > > > /autonfs/home/rmorshed/Desktop/swiftNew/cog/mbuild.xml:463: > > > The following > > > > error occurred while executing this line: > > > > /autonfs/home/rmorshed/Desktop/swiftNew/cog/mbuild.xml:227: > > > Compile failed; > > > > see the compiler error output for details. > > > > > > > > Total time: 30 seconds > > > > > > > > -ragib > > > > > > > > > > > > > > > > _______________________________________________ > > > Swift-devel mailing list > > > Swift-devel at ci.uchicago.edu > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From ragib.morshed at gmail.com Tue Jul 15 21:31:36 2008 From: ragib.morshed at gmail.com (Ragib Morshed) Date: Tue, 15 Jul 2008 21:31:36 -0500 Subject: [Swift-devel] Fwd: build problem with swift In-Reply-To: References: <94e4b8380807150925s3825f7d2n22a3cc8a6867c9fb@mail.gmail.com> <94e4b8380807150940t21dba026y417a4d5a3dd94eb6@mail.gmail.com> <94e4b8380807150953g7f3f9da1v7837169eed3b6bc8@mail.gmail.com> <94e4b8380807151418i2491ea69if4c4d213fcdc27e7@mail.gmail.com> Message-ID: <94e4b8380807151931t19fd3c0aw5ba51e620afe3572@mail.gmail.com> > > > > 'we'? > > > basically, though, no. karajan is a separate piece of code than Swift. > > you can hack round in your build tree as much as you want, and if you're > not ever intending for anyone to reuse your code it should be possible > somehow, in some probably unpleasant fashion. > > -- > -------------- next part -------------- An HTML attachment was scrubbed... URL: From ragib.morshed at gmail.com Tue Jul 15 21:34:10 2008 From: ragib.morshed at gmail.com (Ragib Morshed) Date: Tue, 15 Jul 2008 21:34:10 -0500 Subject: [Swift-devel] Fwd: build problem with swift In-Reply-To: <94e4b8380807151931t19fd3c0aw5ba51e620afe3572@mail.gmail.com> References: <94e4b8380807150925s3825f7d2n22a3cc8a6867c9fb@mail.gmail.com> <94e4b8380807150940t21dba026y417a4d5a3dd94eb6@mail.gmail.com> <94e4b8380807150953g7f3f9da1v7837169eed3b6bc8@mail.gmail.com> <94e4b8380807151418i2491ea69if4c4d213fcdc27e7@mail.gmail.com> <94e4b8380807151931t19fd3c0aw5ba51e620afe3572@mail.gmail.com> Message-ID: <94e4b8380807151934l6ce83dd2s606603c9c981062f@mail.gmail.com> sorry for the last empty one, wrongly clicked send. > > > 'we'? > > I meant 'I' > >> > >> >> >basically, though, no. karajan is a separate piece of code than Swift. >> > yeah, i don't think so either. thanks. -ragib -------------- next part -------------- An HTML attachment was scrubbed... URL: From ragib.morshed at gmail.com Tue Jul 15 23:46:00 2008 From: ragib.morshed at gmail.com (Ragib Morshed) Date: Tue, 15 Jul 2008 23:46:00 -0500 Subject: [Swift-devel] Fwd: build problem with swift In-Reply-To: <1216159105.4879.14.camel@localhost> References: <94e4b8380807150925s3825f7d2n22a3cc8a6867c9fb@mail.gmail.com> <94e4b8380807150940t21dba026y417a4d5a3dd94eb6@mail.gmail.com> <94e4b8380807150953g7f3f9da1v7837169eed3b6bc8@mail.gmail.com> <94e4b8380807151418i2491ea69if4c4d213fcdc27e7@mail.gmail.com> <1216157842.4550.4.camel@localhost> <1216158497.4879.4.camel@localhost> <1216159105.4879.14.camel@localhost> Message-ID: <94e4b8380807152146g2879ffcfp9d7ec3a58108daa7@mail.gmail.com> It builds successfully and works on helloworld! -ragib On Tue, Jul 15, 2008 at 4:58 PM, Mihael Hategan wrote: > On Tue, 2008-07-15 at 16:48 -0500, Mihael Hategan wrote: > > On Tue, 2008-07-15 at 16:37 -0500, Mihael Hategan wrote: > > > On Tue, 2008-07-15 at 16:18 -0500, Ragib Morshed wrote: > > > > Can we change the dependencies such that it builds up the software > > > > like in eclipse? > > > > > > No. > > > > > > So the solution would be to pass a data structure that is available in > > > both. Let me think a bit. > > > > A Map adapter of course. > > > > I'll commit one shortly. > > Ok, so if you do an svn update in vdsk, there will be a CacheMapAdapter > which implements java.util.Map. > > You'll use it like this: > In JobConstrains, instead of tc.addConstraint("filecache", > CacheFunction.getCache(stack)), you'd write: > tc.addConstraint("filecache", new > CacheMapAdapter(CacheFunction.getCache(stack))); > > The get() method of the adapter is mapped to cache.getPaths(). So in the > scheduler code: > Map cache = (Map) t.getConstraint("filecache"); > ... > Collection paths = (Collection) cache.get(wh.getHost()); > > Though this is a bit shady anyway, given that knowledge of all that file > information is pretty Swift specific. But for the purpose of > experimenting with this it will have to do. > > > > > > > > > > > > > > > > -ragib > > > > > > > > On Tue, Jul 15, 2008 at 12:27 PM, Ben Clifford > > > > wrote: > > > > > > > > > > > > VDLCache is in the vdsk module. > > > > > > > > The WeightedHost stuff is in karajan, which is a dependency. > > > > > > > > You can't use code from module A in module B if module A is > > > > dependent on > > > > module B (equivalently if if module B is a prerequisite of > > > > module A) > > > > > > > > Here A = vdsk, B= karajan > > > > > > > > There's a subclass of the weighted host scheduler that lives > > > > in the vdsk > > > > module and extends the functionality in swift-specific ways: > > > > src//org/griphyn/vdl/karajan/VDSAdaptiveScheduler.java > > > > > > > > You might be able to add your new functionality there. > > > > > > > > If its building in eclipse, that's because it enforces cog > > > > module > > > > dependencies differently (or not at all). > > > > > > > > On Tue, 15 Jul 2008, Ragib Morshed wrote: > > > > > > > > > > > > > > > > > Ben suggested forwarding this problem of mine to > > > > swift-devel. > > > > > > > > > > ---------- Forwarded message ---------- > > > > > From: Ragib Morshed > > > > > Date: Tue, Jul 15, 2008 at 11:40 AM > > > > > Subject: Fwd: build problem with swift > > > > > To: benc at ci.uchicago.edu > > > > > > > > > > > > > > > Hi Ben, > > > > > > > > > > I sent this email to Mihael, but he is off somewhere at a > > > > conference I > > > > > think, and don't know if he will get to it. > > > > > > > > > > I put some code in for the site-affinity thing and swift > > > > compiles and runs > > > > > the code fine using eclipse. Compilation/building using ant > > > > says: *package > > > > > org.griphyn.vdl.karajan.lib.cache does not exist. *But it > is > > > > there and > > > > > compiles fine with eclipse. > > > > > > > > > > Do you have any ideas on the top of your head where the > > > > problem might be? > > > > > > > > > > Thanks. > > > > > -Ragib > > > > > > > > > > > > > > > ---------- Forwarded message ---------- > > > > > From: Ragib Morshed > > > > > Date: Tue, Jul 15, 2008 at 11:25 AM > > > > > Subject: build problem with swift > > > > > To: Mihael Hategan > > > > > > > > > > > > > > > Hi, > > > > > > > > > > I have been trying to build swift with the new changes > using > > > > 'ant dist', but > > > > > it gives compilation error. It compiles and runs fine on > > > > eclipse, but here > > > > > it says it can't find the package > > > > org.gridphyn.vdl.karajan.lib.cache so it > > > > > cannot recognize the VDLFileCache type. Any ideas why? > > > > > > > > > > Here's the output from building it: > > > > > > > > > > [javac] Compiling 540 source files to > > > > > > /autonfs/home/rmorshed/Desktop/swiftNew/cog/modules/karajan/build > > > > > [javac] > > > > > > /autonfs/home/rmorshed/Desktop/swiftNew/cog/modules/karajan/src/org/globus/cog/karajan/scheduler/WeightedHostScoreScheduler.java:34: > > > > > *package org.griphyn.vdl.karajan.lib.cache does not exist* > > > > > [javac] import > > > > org.griphyn.vdl.karajan.lib.cache.VDLFileCache; > > > > > [javac] ^ > > > > > [javac] > > > > > > /autonfs/home/rmorshed/Desktop/swiftNew/cog/modules/karajan/src/org/globus/cog/karajan/scheduler/WeightedHostScoreScheduler.java:230: > > > > > cannot resolve symbol > > > > > [javac] symbol : class VDLFileCache > > > > > [javac] location: class > > > > > org.globus.cog.karajan.scheduler.WeightedHostScoreScheduler > > > > > [javac] VDLFileCache fileCache = > > > > (VDLFileCache) > > > > > t.getConstraint("filecache"); > > > > > [javac] ^ > > > > > [javac] > > > > > > /autonfs/home/rmorshed/Desktop/swiftNew/cog/modules/karajan/src/org/globus/cog/karajan/scheduler/WeightedHostScoreScheduler.java:230: > > > > > cannot resolve symbol > > > > > [javac] symbol : class VDLFileCache > > > > > [javac] location: class > > > > > org.globus.cog.karajan.scheduler.WeightedHostScoreScheduler > > > > > [javac] VDLFileCache fileCache = > > > > (VDLFileCache) > > > > > t.getConstraint("filecache"); > > > > > [javac] ^ > > > > > [javac] Note: Some input files use or override a > > > > deprecated API. > > > > > [javac] Note: Recompile with -deprecation for details. > > > > > [javac] 3 errors > > > > > > > > > > BUILD FAILED > > > > > > /autonfs/home/rmorshed/Desktop/swiftNew/cog/modules/vdsk/build.xml:73: The > > > > > following error occurred while executing this line: > > > > > /autonfs/home/rmorshed/Desktop/swiftNew/cog/mbuild.xml:442: > > > > The following > > > > > error occurred while executing this line: > > > > > /autonfs/home/rmorshed/Desktop/swiftNew/cog/mbuild.xml:78: > > > > The following > > > > > error occurred while executing this line: > > > > > /autonfs/home/rmorshed/Desktop/swiftNew/cog/mbuild.xml:51: > > > > The following > > > > > error occurred while executing this line: > > > > > > /autonfs/home/rmorshed/Desktop/swiftNew/cog/modules/vdsk/dependencies.xml:4: > > > > > The following error occurred while executing this line: > > > > > /autonfs/home/rmorshed/Desktop/swiftNew/cog/mbuild.xml:162: > > > > The following > > > > > error occurred while executing this line: > > > > > /autonfs/home/rmorshed/Desktop/swiftNew/cog/mbuild.xml:167: > > > > The following > > > > > error occurred while executing this line: > > > > > > /autonfs/home/rmorshed/Desktop/swiftNew/cog/modules/karajan/build.xml:59: > > > > > The following error occurred while executing this line: > > > > > /autonfs/home/rmorshed/Desktop/swiftNew/cog/mbuild.xml:463: > > > > The following > > > > > error occurred while executing this line: > > > > > /autonfs/home/rmorshed/Desktop/swiftNew/cog/mbuild.xml:227: > > > > Compile failed; > > > > > see the compiler error output for details. > > > > > > > > > > Total time: 30 seconds > > > > > > > > > > -ragib > > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > > Swift-devel mailing list > > > > Swift-devel at ci.uchicago.edu > > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From benc at hawaga.org.uk Wed Jul 16 02:30:29 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 16 Jul 2008 07:30:29 +0000 (GMT) Subject: [Swift-devel] swift 0.6-rc4 In-Reply-To: References: Message-ID: On Fri, 11 Jul 2008, Ben Clifford wrote: > Swift 0.6rc4 is available from There is enough trouble in the scheduler in rc4 that I will make an rc5 when its fixed. -- From benc at hawaga.org.uk Wed Jul 16 05:34:52 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 16 Jul 2008 10:34:52 +0000 (GMT) Subject: [Swift-devel] scheduler foo Message-ID: Some subset of the below patch changes the behaviour to more what I would expect - more functional but less optimised. It suggests something is awry with comparing an explicit from-fresh overload count with the ongoing cached value. I hope you like output on stderr ;) I'll poke with this a bit more later but I have to work on other stuff right now - the patch looks horrible but you might be interested in seeing where I go tot. http://www.ci.uchicago.edu/~benc/tmp/debug-backofffail-1 -- From benc at hawaga.org.uk Wed Jul 16 12:20:11 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 16 Jul 2008 17:20:11 +0000 (GMT) Subject: [Swift-devel] Re: scheduler foo In-Reply-To: References: Message-ID: On Wed, 16 Jul 2008, Ben Clifford wrote: > Some subset of the below patch changes the behaviour to more what I > would expect - more functional but less optimised. http://www.ci.uchicago.edu/~benc/tmp/debug-backofffail-2 is a slightly tidier version. Its still a nasty hack. -- From skenny at uchicago.edu Thu Jul 17 12:09:35 2008 From: skenny at uchicago.edu (skenny at uchicago.edu) Date: Thu, 17 Jul 2008 12:09:35 -0500 (CDT) Subject: [Swift-devel] sge + cog-job-submit == headache Message-ID: <20080717120935.BHY01569@m4500-02.uchicago.edu> hi all, i'm trying to get some swift jobs thru to tacc's ranger but it seems not to play nicely with the sge scheduler. i'm able to get a job thru via globusrun: globusrun -o -r gatekeeper.ranger.tacc.teragrid.org/jobmanager-sge -f test.rsl however, if i submit via cog-job-submit and/or swift the job makes it to the scheduler, the work directories are created but then the job moves, in the scheduler to a state of 'unscheduled' and remains there indefinitely. nothing jumps out at me when i look at the gram logs on the remote end...is there anyone else attempting to or having any success submitting to sge with swift or cog? i've attached the swift log. any ideas greatly appreciated :) thanks sarah -------------- next part -------------- A non-text attachment was scrubbed... Name: env-20080717-1128-fpnetfca.log Type: application/octet-stream Size: 13363 bytes Desc: not available URL: From mikekubal at yahoo.com Sun Jul 20 13:02:40 2008 From: mikekubal at yahoo.com (Mike Kubal) Date: Sun, 20 Jul 2008 11:02:40 -0700 (PDT) Subject: [Swift-devel] gridftp issue with communicado Message-ID: <587778.92361.qm@web52310.mail.re2.yahoo.com> Using the following url: for the gridftp server for NCSA's abe, host communicado fails to return Swift job results files. Files are transferred to Abe successfully and the jobs run to completion. This problem does not occur when running from host terminable. MikeK From hategan at mcs.anl.gov Sun Jul 20 13:15:14 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sun, 20 Jul 2008 13:15:14 -0500 Subject: [Swift-devel] gridftp issue with communicado In-Reply-To: <587778.92361.qm@web52310.mail.re2.yahoo.com> References: <587778.92361.qm@web52310.mail.re2.yahoo.com> Message-ID: <1216577714.27406.1.camel@localhost> On Sun, 2008-07-20 at 11:02 -0700, Mike Kubal wrote: > Using the following url: > > > > for the gridftp server for NCSA's abe, host communicado fails to return Swift job results files. Files are transferred to Abe successfully and the jobs run to completion. This problem does not occur when running from host terminable. I'm not following. Are you saying files can be staged in but not staged out? What is the error message? Do you have a log file? Mihael From benc at hawaga.org.uk Mon Jul 21 02:50:18 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 21 Jul 2008 07:50:18 +0000 (GMT) Subject: [Swift-devel] gridftp issue with communicado In-Reply-To: <587778.92361.qm@web52310.mail.re2.yahoo.com> References: <587778.92361.qm@web52310.mail.re2.yahoo.com> Message-ID: There is a fairly common "my run is not returning its output files" report from people that comes from the fact that Swift does not stage output files back straight away after a job finishes if there are other more important file transfers to do (where more important at the moment means stage-ins for other jobs) In the absence of other information, this would be the first thing I would check. -- From benc at hawaga.org.uk Mon Jul 21 08:14:02 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 21 Jul 2008 13:14:02 +0000 (GMT) Subject: [Swift-devel] Re: scheduler foo In-Reply-To: References: Message-ID: poking more at this. the approach used in the code at the moment seems to have a few problems: The frist, in situations like the below code fragment the first overloadedCount adjustment takes away the overloadedness of the host in question, then adds the new overloadedness back. However, it relies on the first checkOverloaded returning the (negative of the) same as what has been added previously - either by a previous checkOverloaded call or by a timer thread. This seems to not always happen. overloadedCount += checkOverloaded(wh, -1); wh.setScore(newScore); weightedHosts.put(wh.getHost(), wh); scores.add(wh); sum += wh.getTScore(); overloadedCount += checkOverloaded(wh, 1); Secondly timer tasks are not killed, so a large number of timer tasks get started up for overloaded sites. This appears to cause some weird behaviour. I've put commits in that tidy up a few apparent bugs a bit but the abovementioned behaviour remains. It seems difficult to deal with the above in an o(1) fashion - if moving away from that, an explicit check of each site's overload status (as implemented in the previous patches on this thread) appears to work. -- From hategan at mcs.anl.gov Mon Jul 21 09:30:59 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 21 Jul 2008 09:30:59 -0500 Subject: [Swift-devel] Re: scheduler foo In-Reply-To: References: Message-ID: <1216650659.4064.1.camel@localhost> On Mon, 2008-07-21 at 13:14 +0000, Ben Clifford wrote: > poking more at this. the approach used in the code at the moment seems to > have a few problems: > > The frist, in situations like the below code fragment the first > overloadedCount adjustment > takes away the overloadedness of the host in question, then adds the new > overloadedness back. However, it relies on the first checkOverloaded > returning the (negative of the) same as what has been added previously - > either by a previous checkOverloaded call or by a timer thread. > > This seems to not always happen. > > > > overloadedCount += checkOverloaded(wh, -1); > wh.setScore(newScore); > weightedHosts.put(wh.getHost(), wh); > scores.add(wh); > sum += wh.getTScore(); > overloadedCount += checkOverloaded(wh, 1); > > Secondly timer tasks are not killed, so a large number of timer tasks get > started up for overloaded sites. Oh, I see. They are started multiple times for the same delay. > This appears to cause some weird > behaviour. > > I've put commits in that tidy up a few apparent bugs a bit but the > abovementioned behaviour remains. It seems difficult to deal with the > above in an o(1) fashion - if moving away from that, an explicit check of > each site's overload status (as implemented in the previous patches on > this thread) appears to work. > From benc at hawaga.org.uk Mon Jul 21 10:27:32 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 21 Jul 2008 15:27:32 +0000 (GMT) Subject: [Swift-devel] Re: scheduler foo In-Reply-To: <1216650659.4064.1.camel@localhost> References: <1216650659.4064.1.camel@localhost> Message-ID: On Mon, 21 Jul 2008, Mihael Hategan wrote: > > Secondly timer tasks are not killed, so a large number of timer tasks get > > started up for overloaded sites. > > Oh, I see. They are started multiple times for the same delay. yes. I fiddled a little with cancelling timers before launching the next to see if it made the code better behaved. It does, though I'm still not convinced that its right. Also to do that in the present layout needs some kind of lookup table to figure out which timer to cancel which is increased complexity. -- From hategan at mcs.anl.gov Mon Jul 21 11:58:37 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 21 Jul 2008 11:58:37 -0500 Subject: [Swift-devel] Re: scheduler foo In-Reply-To: References: <1216650659.4064.1.camel@localhost> Message-ID: <1216659517.6481.1.camel@localhost> On Mon, 2008-07-21 at 15:27 +0000, Ben Clifford wrote: > > On Mon, 21 Jul 2008, Mihael Hategan wrote: > > > > Secondly timer tasks are not killed, so a large number of timer tasks get > > > started up for overloaded sites. > > > > Oh, I see. They are started multiple times for the same delay. > > yes. > > I fiddled a little with cancelling timers before launching the next to see > if it made the code better behaved. It does, though I'm still not > convinced that its right. Also to do that in the present layout needs some > kind of lookup table to figure out which timer to cancel which is > increased complexity. Right. The better strategy would be to keep track of waiting hosts using a O(1) op. Let me poke around a bit today and tomorrow. > From zhaozhang at uchicago.edu Mon Jul 21 17:26:24 2008 From: zhaozhang at uchicago.edu (Zhao Zhang) Date: Mon, 21 Jul 2008 17:26:24 -0500 Subject: [Swift-devel] A naive run of Falkon+Swift on BGP login node. Message-ID: <48850D10.7050103@uchicago.edu> Hi, I started a test on BGP login nodes, running falkon service and swift on Login6, and a worker on Login2. Good news is I got the output file. Swift return successful. Bad news is there are some problems I don't understand. The swift stdout: /Line 1: zzhang at login6.surveyor:~/swift/etc> swift -sites.file ./sites.xml -tc.file ./tc.data -ip.address 172.17.3.16 first.swift Line 2: Unable to find required classes (javax.activation.DataHandler and javax.mail.internet.MimeMultipart). Attachment support is disabled. Line 3: Swift svn swift-r2140 cog-r2070 Line 4: RunID: 20080721-1713-zkz78kcf Line 5: Progress: Line 6: echo started Line 7: error: Notification(int timeout): socket = new ServerSocket(recvPort); Address already in use Line 8: Waiting for notification for 0 ms Line 9: Received notification with 1 messages Line 10: echo completed Line 11: Final status: Finished successfully:1/ 1. What is the exception in Line 2? is this ignorable or not? 2. What is the error in Line 7? Is it printed by swift or the deef-provider? Is this ignorable or not? The following exception from Falkon only occurs when I specify the ip.address property in swift The falkon stdout: /2008-07-21 17:00:46,325 ERROR handler.AddressingHandler [ServiceThread-6,invoke:120] Exception in AddressingHandler AxisFault faultCode: {http://schemas.xmlsoap.org/soap/envelope/}Server.userException faultSubcode: faultString: java.io.IOException: '' For input string: "" faultActor: faultNode: faultDetail: {http://xml.apache.org/axis/}stackTrace:java.io.IOException: '' For input string: "" at org.apache.axis.transport.http.ChunkedInputStream.getChunked(ChunkedInputStream.java:161) at org.apache.axis.transport.http.ChunkedInputStream.read(ChunkedInputStream.java:53) at org.apache.xerces.impl.XMLEntityManager$RewindableInputStream.read(Unknown Source) at org.apache.xerces.impl.XMLEntityManager.setupCurrentEntity(Unknown Source) at org.apache.xerces.impl.XMLVersionDetector.determineDocVersion(Unknown Source) at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source) at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source) at org.apache.xerces.parsers.XMLParser.parse(Unknown Source) at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source) at org.apache.xerces.jaxp.SAXParserImpl.parse(Unknown Source) at org.apache.axis.encoding.DeserializationContext.parse(DeserializationContext.java:227) at org.apache.axis.SOAPPart.getAsSOAPEnvelope(SOAPPart.java:645) at org.apache.axis.Message.getSOAPEnvelope(Message.java:424) at org.apache.axis.message.addressing.handler.AddressingHandler.processServerRequest(AddressingHandler.java:328) at org.globus.wsrf.handlers.AddressingHandler.processServerRequest(AddressingHandler.java:77) at org.apache.axis.message.addressing.handler.AddressingHandler.invoke(AddressingHandler.java:114) at org.apache.axis.strategies.InvocationStrategy.visit(InvocationStrategy.java:32) at org.apache.axis.SimpleChain.doVisiting(SimpleChain.java:118) at org.apache.axis.SimpleChain.invoke(SimpleChain.java:83) at org.apache.axis.server.AxisServer.invoke(AxisServer.java:248) at org.globus.wsrf.container.ServiceThread.doPost(ServiceThread.java:664) at org.globus.wsrf.container.ServiceThread.process(ServiceThread.java:382) at org.globus.wsrf.container.ServiceThread.run(ServiceThread.java:291) {http://xml.apache.org/axis/}hostname:login6 / Ioan, any idea about this? I am also attaching the swift log, could anyone check this to tell if there is a problem there, and most important thing is that if swift is using the IP address I specified in the --ip.address parameter? Thanks so much for the help best wishes zhangzhao -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: first-20080721-1713-zkz78kcf.log URL: From hategan at mcs.anl.gov Mon Jul 21 17:35:06 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 21 Jul 2008 17:35:06 -0500 Subject: [Swift-devel] Re: scheduler foo In-Reply-To: <1216659517.6481.1.camel@localhost> References: <1216650659.4064.1.camel@localhost> <1216659517.6481.1.camel@localhost> Message-ID: <1216679706.18694.6.camel@localhost> On Mon, 2008-07-21 at 11:58 -0500, Mihael Hategan wrote: > On Mon, 2008-07-21 at 15:27 +0000, Ben Clifford wrote: > > > > On Mon, 21 Jul 2008, Mihael Hategan wrote: > > > > > > Secondly timer tasks are not killed, so a large number of timer tasks get > > > > started up for overloaded sites. > > > > > > Oh, I see. They are started multiple times for the same delay. > > > > yes. > > > > I fiddled a little with cancelling timers before launching the next to see > > if it made the code better behaved. It does, though I'm still not > > convinced that its right. Also to do that in the present layout needs some > > kind of lookup table to figure out which timer to cancel which is > > increased complexity. > > Right. The better strategy would be to keep track of waiting hosts using > a O(1) op. > > Let me poke around a bit today and tomorrow. So one alternative is to have a separate thread that polls the sites for changes in overloadedness. This is a certain type of constant time and seems to work. I'll test this some more with various setups and see how it pans out. But I noticed that a number of times the site score is affected by failure of tasks that we reasonably expect to fail. Such as the absence of one of the stdout/err files. So I think it's an oversimplification to change the score based on the type of a task. It should probably be coupled with other things. Which I don't like because it breaks abstraction. Which means it's a relatively broken abstraction. > > > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From iraicu at cs.uchicago.edu Mon Jul 21 17:36:03 2008 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Mon, 21 Jul 2008 17:36:03 -0500 Subject: [Swift-devel] Re: A naive run of Falkon+Swift on BGP login node. In-Reply-To: <48850D10.7050103@uchicago.edu> References: <48850D10.7050103@uchicago.edu> Message-ID: <48850F53.3010300@cs.uchicago.edu> Zhao Zhang wrote: > Hi, > > I started a test on BGP login nodes, running falkon service and swift > on Login6, and a worker on Login2. > Good news is I got the output file. Swift return successful. Bad news > is there are some problems I don't > understand. > > The swift stdout: > /Line 1: zzhang at login6.surveyor:~/swift/etc> swift -sites.file > ./sites.xml -tc.file ./tc.data -ip.address 172.17.3.16 first.swift > Line 2: Unable to find required classes > (javax.activation.DataHandler and javax.mail.internet.MimeMultipart). > Attachment support is disabled. > Line 3: Swift svn swift-r2140 cog-r2070 > > Line 4: RunID: 20080721-1713-zkz78kcf > Line 5: Progress: > Line 6: echo started > Line 7: error: Notification(int timeout): socket = new > ServerSocket(recvPort); Address already in use > Line 8: Waiting for notification for 0 ms > Line 9: Received notification with 1 messages > Line 10: echo completed > Line 11: Final status: Finished successfully:1/ > > 1. What is the exception in Line 2? is this ignorable or not? This is not a Falkon provider exception, so I don't know. > 2. What is the error in Line 7? Is it printed by swift or the > deef-provider? Is this ignorable or not? > You can ignore this, it should really be just a warning. > > > The following exception from Falkon only occurs when I specify the > ip.address property in swift > The falkon stdout: > > /2008-07-21 17:00:46,325 ERROR handler.AddressingHandler > [ServiceThread-6,invoke:120] Exception in AddressingHandler > AxisFault > faultCode: > {http://schemas.xmlsoap.org/soap/envelope/}Server.userException > faultSubcode: > faultString: java.io.IOException: '' For input string: "" > faultActor: > faultNode: > faultDetail: > {http://xml.apache.org/axis/}stackTrace:java.io.IOException: '' > For input string: "" > at > org.apache.axis.transport.http.ChunkedInputStream.getChunked(ChunkedInputStream.java:161) > > at > org.apache.axis.transport.http.ChunkedInputStream.read(ChunkedInputStream.java:53) > > at > org.apache.xerces.impl.XMLEntityManager$RewindableInputStream.read(Unknown > Source) > at > org.apache.xerces.impl.XMLEntityManager.setupCurrentEntity(Unknown > Source) > at > org.apache.xerces.impl.XMLVersionDetector.determineDocVersion(Unknown > Source) > at org.apache.xerces.parsers.XML11Configuration.parse(Unknown > Source) > at org.apache.xerces.parsers.XML11Configuration.parse(Unknown > Source) > at org.apache.xerces.parsers.XMLParser.parse(Unknown Source) > at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown > Source) > at org.apache.xerces.jaxp.SAXParserImpl.parse(Unknown Source) > at > org.apache.axis.encoding.DeserializationContext.parse(DeserializationContext.java:227) > > at org.apache.axis.SOAPPart.getAsSOAPEnvelope(SOAPPart.java:645) > at org.apache.axis.Message.getSOAPEnvelope(Message.java:424) > at > org.apache.axis.message.addressing.handler.AddressingHandler.processServerRequest(AddressingHandler.java:328) > > at > org.globus.wsrf.handlers.AddressingHandler.processServerRequest(AddressingHandler.java:77) > > at > org.apache.axis.message.addressing.handler.AddressingHandler.invoke(AddressingHandler.java:114) > > at > org.apache.axis.strategies.InvocationStrategy.visit(InvocationStrategy.java:32) > > at org.apache.axis.SimpleChain.doVisiting(SimpleChain.java:118) > at org.apache.axis.SimpleChain.invoke(SimpleChain.java:83) > at org.apache.axis.server.AxisServer.invoke(AxisServer.java:248) > at > org.globus.wsrf.container.ServiceThread.doPost(ServiceThread.java:664) > at > org.globus.wsrf.container.ServiceThread.process(ServiceThread.java:382) > at > org.globus.wsrf.container.ServiceThread.run(ServiceThread.java:291) > > {http://xml.apache.org/axis/}hostname:login6 > / > Ioan, any idea about this? Not really sure what is wrong. Try to fix the exception from line 2 first. Also, Falkon is using GT4.0.x, is Swift still on GT4.0.x libs? Ioan > > I am also attaching the swift log, could anyone check this to tell if > there is a problem there, and most important thing > is that if swift is using the IP address I specified in the > --ip.address parameter? > > Thanks so much for the help > > best wishes > zhangzhao -- =================================================== Ioan Raicu Ph.D. Candidate =================================================== Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 =================================================== Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dev.globus.org/wiki/Incubator/Falkon http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page =================================================== =================================================== From hategan at mcs.anl.gov Mon Jul 21 17:39:09 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 21 Jul 2008 17:39:09 -0500 Subject: [Swift-devel] A naive run of Falkon+Swift on BGP login node. In-Reply-To: <48850D10.7050103@uchicago.edu> References: <48850D10.7050103@uchicago.edu> Message-ID: <1216679949.18694.10.camel@localhost> On Mon, 2008-07-21 at 17:26 -0500, Zhao Zhang wrote: > Hi, > > I started a test on BGP login nodes, running falkon service and swift on > Login6, and a worker on Login2. > Good news is I got the output file. Swift return successful. Bad news is > there are some problems I don't > understand. > > The swift stdout: > /Line 1: zzhang at login6.surveyor:~/swift/etc> swift -sites.file > ./sites.xml -tc.file ./tc.data -ip.address 172.17.3.16 first.swift > Line 2: Unable to find required classes (javax.activation.DataHandler > and javax.mail.internet.MimeMultipart). Attachment support is disabled. > Line 3: Swift svn swift-r2140 cog-r2070 > > Line 4: RunID: 20080721-1713-zkz78kcf > Line 5: Progress: > Line 6: echo started > Line 7: error: Notification(int timeout): socket = new > ServerSocket(recvPort); Address already in use > Line 8: Waiting for notification for 0 ms > Line 9: Received notification with 1 messages > Line 10: echo completed > Line 11: Final status: Finished successfully:1/ > > 1. What is the exception in Line 2? is this ignorable or not? Yes. It's axis complaining about some missing stuff that is never used in this case. > 2. What is the error in Line 7? Is it printed by swift or the > deef-provider? provider-deef. Do you have another swift instance running by any chance? > Is this ignorable or not? It isn't. It probably means that the falkon notifications won't get to you. > > > > The following exception from Falkon only occurs when I specify the > ip.address property in swift What exactly did you set it to? Mihael From hategan at mcs.anl.gov Mon Jul 21 17:41:36 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 21 Jul 2008 17:41:36 -0500 Subject: [Swift-devel] Re: A naive run of Falkon+Swift on BGP login node. In-Reply-To: <48850F53.3010300@cs.uchicago.edu> References: <48850D10.7050103@uchicago.edu> <48850F53.3010300@cs.uchicago.edu> Message-ID: <1216680096.18694.14.camel@localhost> > > Line 7: error: Notification(int timeout): socket = new > > ServerSocket(recvPort); Address already in use > > Line 8: Waiting for notification for 0 ms > > Line 9: Received notification with 1 messages > > Line 10: echo completed > > Line 11: Final status: Finished successfully:1/ > > > > 1. What is the exception in Line 2? is this ignorable or not? > This is not a Falkon provider exception, so I don't know. > > 2. What is the error in Line 7? Is it printed by swift or the > > deef-provider? Is this ignorable or not? > > > You can ignore this, it should really be just a warning. Oops. Sorry. Nevermind what I said. From iraicu at cs.uchicago.edu Mon Jul 21 17:38:35 2008 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Mon, 21 Jul 2008 17:38:35 -0500 Subject: [Swift-devel] A naive run of Falkon+Swift on BGP login node. In-Reply-To: <1216679949.18694.10.camel@localhost> References: <48850D10.7050103@uchicago.edu> <1216679949.18694.10.camel@localhost> Message-ID: <48850FEB.2020108@cs.uchicago.edu> Mihael Hategan wrote: > On Mon, 2008-07-21 at 17:26 -0500, Zhao Zhang wrote: > >> Hi, >> >> I started a test on BGP login nodes, running falkon service and swift on >> Login6, and a worker on Login2. >> Good news is I got the output file. Swift return successful. Bad news is >> there are some problems I don't >> understand. >> >> The swift stdout: >> /Line 1: zzhang at login6.surveyor:~/swift/etc> swift -sites.file >> ./sites.xml -tc.file ./tc.data -ip.address 172.17.3.16 first.swift >> Line 2: Unable to find required classes (javax.activation.DataHandler >> and javax.mail.internet.MimeMultipart). Attachment support is disabled. >> Line 3: Swift svn swift-r2140 cog-r2070 >> >> Line 4: RunID: 20080721-1713-zkz78kcf >> Line 5: Progress: >> Line 6: echo started >> Line 7: error: Notification(int timeout): socket = new >> ServerSocket(recvPort); Address already in use >> Line 8: Waiting for notification for 0 ms >> Line 9: Received notification with 1 messages >> Line 10: echo completed >> Line 11: Final status: Finished successfully:1/ >> >> 1. What is the exception in Line 2? is this ignorable or not? >> > > Yes. It's axis complaining about some missing stuff that is never used > in this case. > > >> 2. What is the error in Line 7? Is it printed by swift or the >> deef-provider? >> > > provider-deef. Do you have another swift instance running by any chance? > > >> Is this ignorable or not? >> > > It isn't. It probably means that the falkon notifications won't get to > you. > This error should just be a warning... as it tries a different port until it finds a good one. It should only print an error when it gives up. So, that is not your problem Zhao, especially as you seem to have run OK, right? Line 11: Final status: Finished successfully:1/ Ioan > >> >> The following exception from Falkon only occurs when I specify the >> ip.address property in swift >> > > What exactly did you set it to? > > Mihael > > > -- =================================================== Ioan Raicu Ph.D. Candidate =================================================== Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 =================================================== Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dev.globus.org/wiki/Incubator/Falkon http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page =================================================== =================================================== -------------- next part -------------- An HTML attachment was scrubbed... URL: From hategan at mcs.anl.gov Mon Jul 21 17:43:30 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 21 Jul 2008 17:43:30 -0500 Subject: [Swift-devel] Re: A naive run of Falkon+Swift on BGP login node. In-Reply-To: <48850F53.3010300@cs.uchicago.edu> References: <48850D10.7050103@uchicago.edu> <48850F53.3010300@cs.uchicago.edu> Message-ID: <1216680210.18694.17.camel@localhost> On Mon, 2008-07-21 at 17:36 -0500, Ioan Raicu wrote: > > Ioan, any idea about this? > Not really sure what is wrong. Try to fix the exception from line 2 > first. Not the problem. Normally in the wsrf log4j.properties this is masked out. It's the log4j.properties in swift that doesn't. We should change that. > Also, Falkon is using GT4.0.x, is Swift still on GT4.0.x libs? Yes. It's still on gt4.0 From hategan at mcs.anl.gov Mon Jul 21 17:44:50 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 21 Jul 2008 17:44:50 -0500 Subject: [Swift-devel] A naive run of Falkon+Swift on BGP login node. In-Reply-To: <48850FEB.2020108@cs.uchicago.edu> References: <48850D10.7050103@uchicago.edu> <1216679949.18694.10.camel@localhost> <48850FEB.2020108@cs.uchicago.edu> Message-ID: <1216680290.20073.0.camel@localhost> On Mon, 2008-07-21 at 17:38 -0500, Ioan Raicu wrote: > > > This error should just be a warning... as it tries a different port > until it finds a good one. It should only print an error when it > gives up. So, that is not your problem Zhao, especially as you seem > to have run OK, right? > > Line 11: Final status: Finished successfully:1/ Yep. Sorry. Spoke without knowing. From iraicu at cs.uchicago.edu Mon Jul 21 17:46:32 2008 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Mon, 21 Jul 2008 17:46:32 -0500 Subject: [Swift-devel] A naive run of Falkon+Swift on BGP login node. In-Reply-To: <1216680290.20073.0.camel@localhost> References: <48850D10.7050103@uchicago.edu> <1216679949.18694.10.camel@localhost> <48850FEB.2020108@cs.uchicago.edu> <1216680290.20073.0.camel@localhost> Message-ID: <488511C8.2000707@cs.uchicago.edu> So Zhao, did it actually work, but you got those two errors and wanted to know what the errors were? If things worked as expected, then you should be fine, you can ignore both of those errors (I think). If things didn't work as expected, then we need to dig deeper to find out why. Ioan Mihael Hategan wrote: > On Mon, 2008-07-21 at 17:38 -0500, Ioan Raicu wrote: > >>> >>> >> This error should just be a warning... as it tries a different port >> until it finds a good one. It should only print an error when it >> gives up. So, that is not your problem Zhao, especially as you seem >> to have run OK, right? >> >> Line 11: Final status: Finished successfully:1/ >> > > Yep. Sorry. Spoke without knowing. > > > > -- =================================================== Ioan Raicu Ph.D. Candidate =================================================== Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 =================================================== Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dev.globus.org/wiki/Incubator/Falkon http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page =================================================== =================================================== -------------- next part -------------- An HTML attachment was scrubbed... URL: From zhaozhang at uchicago.edu Mon Jul 21 18:04:41 2008 From: zhaozhang at uchicago.edu (Zhao Zhang) Date: Mon, 21 Jul 2008 18:04:41 -0500 Subject: [Swift-devel] A naive run of Falkon+Swift on BGP login node. In-Reply-To: <488511C8.2000707@cs.uchicago.edu> References: <48850D10.7050103@uchicago.edu> <1216679949.18694.10.camel@localhost> <48850FEB.2020108@cs.uchicago.edu> <1216680290.20073.0.camel@localhost> <488511C8.2000707@cs.uchicago.edu> Message-ID: <48851609.8050909@uchicago.edu> In this test case, it actually worked. I talked with Mike, and we don't quite understand these 2 things. So I sent them out. After that I started another test. Running, swift on Login Node, falkon service on IO node, and BGexec on CN. At the very end of the service log, I got his: 847.985 2 2 25 256 256 0 0 0 0 0 0 0 0.0 2 0 0 0 0 0 0 0 0 0.0 0.0 0 0 100 512 288 512 848.985 2 2 25 256 256 0 0 0 0 0 0 0 0.0 2 0 0 0 0 0 0 0 0 0.0 0.0 0 0 100 512 288 512 849.985 2 2 25 256 256 0 0 0 0 0 0 0 0.0 2 0 0 0 0 0 0 0 0 0.0 0.0 0 0 100 512 287 512 850.985 2 2 25 256 256 0 0 0 0 0 0 0 0.0 2 0 0 0 0 0 0 0 0 0.0 0.0 0 0 100 512 287 512 851.985 2 2 25 256 256 0 0 0 0 0 0 0 0.0 2 0 0 0 0 0 0 0 0 0.0 0.0 0 0 100 512 287 512 This means that we are still suffering the endpoint problem, right? And from swift stdout, zzhang at login6.surveyor:~/swift/etc> swift -sites.file ./sites.xml -tc.file ./tc.data -ip.address 172.17.3.16 first.swift Unable to find required classes (javax.activation.DataHandler and javax.mail.internet.MimeMultipart). Attachment support is disabled. Swift svn swift-r2140 cog-r2070 RunID: 20080721-1748-m9d39dg9 Progress: echo started Progress: Executing:1 Progress: Executing:1 Progress: Executing:1 Progress: Executing:1 Progress: Executing:1 Progress: Executing:1 Progress: Executing:1 Progress: Executing:1 Progress: Executing:1 Progress: Executing:1 Progress: Executing:1 Progress: Executing:1 Swift kept waiting, which mean the -ip.address doesn't work as we expexted. zhao Ioan Raicu wrote: > So Zhao, did it actually work, but you got those two errors and wanted > to know what the errors were? If things worked as expected, then you > should be fine, you can ignore both of those errors (I think). If > things didn't work as expected, then we need to dig deeper to find out > why. > > Ioan > > Mihael Hategan wrote: >> On Mon, 2008-07-21 at 17:38 -0500, Ioan Raicu wrote: >> >>>> >>>> >>> This error should just be a warning... as it tries a different port >>> until it finds a good one. It should only print an error when it >>> gives up. So, that is not your problem Zhao, especially as you seem >>> to have run OK, right? >>> >>> Line 11: Final status: Finished successfully:1/ >>> >> >> Yep. Sorry. Spoke without knowing. >> >> >> >> > > -- > =================================================== > Ioan Raicu > Ph.D. Candidate > =================================================== > Distributed Systems Laboratory > Computer Science Department > University of Chicago > 1100 E. 58th Street, Ryerson Hall > Chicago, IL 60637 > =================================================== > Email: iraicu at cs.uchicago.edu > Web: http://www.cs.uchicago.edu/~iraicu > http://dev.globus.org/wiki/Incubator/Falkon > http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page > =================================================== > =================================================== > > From iraicu at cs.uchicago.edu Mon Jul 21 18:08:26 2008 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Mon, 21 Jul 2008 18:08:26 -0500 Subject: [Swift-devel] A naive run of Falkon+Swift on BGP login node. In-Reply-To: <48851609.8050909@uchicago.edu> References: <48850D10.7050103@uchicago.edu> <1216679949.18694.10.camel@localhost> <48850FEB.2020108@cs.uchicago.edu> <1216680290.20073.0.camel@localhost> <488511C8.2000707@cs.uchicago.edu> <48851609.8050909@uchicago.edu> Message-ID: <488516EA.1080703@cs.uchicago.edu> Zhao Zhang wrote: > In this test case, it actually worked. I talked with Mike, and we > don't quite understand these 2 things. So I sent them out. > > After that I started another test. Running, swift on Login Node, > falkon service on IO node, and BGexec on CN. > At the very end of the service log, I got his: > 847.985 2 2 25 256 256 0 0 0 0 0 0 0 0.0 2 0 0 0 0 0 0 0 0 0.0 0.0 0 0 > 100 512 288 512 > 848.985 2 2 25 256 256 0 0 0 0 0 0 0 0.0 2 0 0 0 0 0 0 0 0 0.0 0.0 0 0 > 100 512 288 512 > 849.985 2 2 25 256 256 0 0 0 0 0 0 0 0.0 2 0 0 0 0 0 0 0 0 0.0 0.0 0 0 > 100 512 287 512 > 850.985 2 2 25 256 256 0 0 0 0 0 0 0 0.0 2 0 0 0 0 0 0 0 0 0.0 0.0 0 0 > 100 512 287 512 > 851.985 2 2 25 256 256 0 0 0 0 0 0 0 0.0 2 0 0 0 0 0 0 0 0 0.0 0.0 0 0 > 100 512 287 512 \ Right, it can't deliver the 2 tasks, as there would have been a 2 before the 0.0 in the middle. > > This means that we are still suffering the endpoint problem, right? Right! You might want to put some debug statements in the Falkon provider to print the end point IP address, to make sure it is the one you are expecting. Ioan > > And from swift stdout, > zzhang at login6.surveyor:~/swift/etc> swift -sites.file ./sites.xml > -tc.file ./tc.data -ip.address 172.17.3.16 first.swift > Unable to find required classes (javax.activation.DataHandler and > javax.mail.internet.MimeMultipart). Attachment support is disabled. > Swift svn swift-r2140 cog-r2070 > > RunID: 20080721-1748-m9d39dg9 > Progress: > echo started > Progress: Executing:1 > Progress: Executing:1 > Progress: Executing:1 > Progress: Executing:1 > Progress: Executing:1 > Progress: Executing:1 > Progress: Executing:1 > Progress: Executing:1 > Progress: Executing:1 > Progress: Executing:1 > Progress: Executing:1 > Progress: Executing:1 > > Swift kept waiting, which mean the -ip.address doesn't work as we > expexted. > > zhao > > Ioan Raicu wrote: >> So Zhao, did it actually work, but you got those two errors and >> wanted to know what the errors were? If things worked as expected, >> then you should be fine, you can ignore both of those errors (I >> think). If things didn't work as expected, then we need to dig >> deeper to find out why. >> >> Ioan >> >> Mihael Hategan wrote: >>> On Mon, 2008-07-21 at 17:38 -0500, Ioan Raicu wrote: >>> >>>>> >>>> This error should just be a warning... as it tries a different port >>>> until it finds a good one. It should only print an error when it >>>> gives up. So, that is not your problem Zhao, especially as you seem >>>> to have run OK, right? >>>> Line 11: Final status: Finished successfully:1/ >>>> >>> >>> Yep. Sorry. Spoke without knowing. >>> >>> >>> >>> >> >> -- >> =================================================== >> Ioan Raicu >> Ph.D. Candidate >> =================================================== >> Distributed Systems Laboratory >> Computer Science Department >> University of Chicago >> 1100 E. 58th Street, Ryerson Hall >> Chicago, IL 60637 >> =================================================== >> Email: iraicu at cs.uchicago.edu >> Web: http://www.cs.uchicago.edu/~iraicu >> http://dev.globus.org/wiki/Incubator/Falkon >> http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page >> =================================================== >> =================================================== >> >> > -- =================================================== Ioan Raicu Ph.D. Candidate =================================================== Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 =================================================== Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dev.globus.org/wiki/Incubator/Falkon http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page =================================================== =================================================== From wilde at mcs.anl.gov Mon Jul 21 18:18:16 2008 From: wilde at mcs.anl.gov (Michael Wilde) Date: Mon, 21 Jul 2008 18:18:16 -0500 Subject: [Swift-devel] A naive run of Falkon+Swift on BGP login node. In-Reply-To: <488516EA.1080703@cs.uchicago.edu> References: <48850D10.7050103@uchicago.edu> <1216679949.18694.10.camel@localhost> <48850FEB.2020108@cs.uchicago.edu> <1216680290.20073.0.camel@localhost> <488511C8.2000707@cs.uchicago.edu> <48851609.8050909@uchicago.edu> <488516EA.1080703@cs.uchicago.edu> Message-ID: <48851938.4070507@mcs.anl.gov> On 7/21/08 6:08 PM, Ioan Raicu wrote: > > > Zhao Zhang wrote: >> In this test case, it actually worked. I talked with Mike, and we >> don't quite understand these 2 things. So I sent them out. >> >> After that I started another test. Running, swift on Login Node, >> falkon service on IO node, and BGexec on CN. >> At the very end of the service log, I got his: >> 847.985 2 2 25 256 256 0 0 0 0 0 0 0 0.0 2 0 0 0 0 0 0 0 0 0.0 0.0 0 0 >> 100 512 288 512 >> 848.985 2 2 25 256 256 0 0 0 0 0 0 0 0.0 2 0 0 0 0 0 0 0 0 0.0 0.0 0 0 >> 100 512 288 512 >> 849.985 2 2 25 256 256 0 0 0 0 0 0 0 0.0 2 0 0 0 0 0 0 0 0 0.0 0.0 0 0 >> 100 512 287 512 >> 850.985 2 2 25 256 256 0 0 0 0 0 0 0 0.0 2 0 0 0 0 0 0 0 0 0.0 0.0 0 0 >> 100 512 287 512 >> 851.985 2 2 25 256 256 0 0 0 0 0 0 0 0.0 2 0 0 0 0 0 0 0 0 0.0 0.0 0 0 >> 100 512 287 512 \ > Right, it can't deliver the 2 tasks, as there would have been a 2 before > the 0.0 in the middle. >> >> This means that we are still suffering the endpoint problem, right? > Right! > > You might want to put some debug statements in the Falkon provider to > print the end point IP address, to make sure it is the one you are > expecting. that debug logging is there, but not sure if or where its getting logged: In src/cog/modules/provider-deef/src/org/globus/cog/abstraction/impl/execution/deef/ResourcePool.java the changed code tries to log as follows: public static String getMachNamePort(Notification userNot){ //String machIP = VDL2Config.getIP(); String machIP = CoGProperties.getDefault().getIPAddress(); String machNamePort = new String (machIP + ":" + userNot.recvPort); logger.debug("WORKER: Machine ID = " + machNamePort); return machNamePort; } Zhao, did you see "WORKER: Machine ID = " in your swift log? - Mike > Ioan >> >> And from swift stdout, >> zzhang at login6.surveyor:~/swift/etc> swift -sites.file ./sites.xml >> -tc.file ./tc.data -ip.address 172.17.3.16 first.swift >> Unable to find required classes (javax.activation.DataHandler and >> javax.mail.internet.MimeMultipart). Attachment support is disabled. >> Swift svn swift-r2140 cog-r2070 >> >> RunID: 20080721-1748-m9d39dg9 >> Progress: >> echo started >> Progress: Executing:1 >> Progress: Executing:1 >> Progress: Executing:1 >> Progress: Executing:1 >> Progress: Executing:1 >> Progress: Executing:1 >> Progress: Executing:1 >> Progress: Executing:1 >> Progress: Executing:1 >> Progress: Executing:1 >> Progress: Executing:1 >> Progress: Executing:1 >> >> Swift kept waiting, which mean the -ip.address doesn't work as we >> expexted. >> >> zhao >> >> Ioan Raicu wrote: >>> So Zhao, did it actually work, but you got those two errors and >>> wanted to know what the errors were? If things worked as expected, >>> then you should be fine, you can ignore both of those errors (I >>> think). If things didn't work as expected, then we need to dig >>> deeper to find out why. >>> >>> Ioan >>> >>> Mihael Hategan wrote: >>>> On Mon, 2008-07-21 at 17:38 -0500, Ioan Raicu wrote: >>>> >>>>>> >>>>> This error should just be a warning... as it tries a different port >>>>> until it finds a good one. It should only print an error when it >>>>> gives up. So, that is not your problem Zhao, especially as you seem >>>>> to have run OK, right? Line 11: Final status: Finished >>>>> successfully:1/ >>>>> >>>> >>>> Yep. Sorry. Spoke without knowing. >>>> >>>> >>>> >>>> >>> >>> -- >>> =================================================== >>> Ioan Raicu >>> Ph.D. Candidate >>> =================================================== >>> Distributed Systems Laboratory >>> Computer Science Department >>> University of Chicago >>> 1100 E. 58th Street, Ryerson Hall >>> Chicago, IL 60637 >>> =================================================== >>> Email: iraicu at cs.uchicago.edu >>> Web: http://www.cs.uchicago.edu/~iraicu >>> http://dev.globus.org/wiki/Incubator/Falkon >>> http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page >>> =================================================== >>> =================================================== >>> >>> >> > From hategan at mcs.anl.gov Mon Jul 21 20:19:00 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 21 Jul 2008 20:19:00 -0500 Subject: [Swift-devel] Re: scheduler foo In-Reply-To: <1216679706.18694.6.camel@localhost> References: <1216650659.4064.1.camel@localhost> <1216659517.6481.1.camel@localhost> <1216679706.18694.6.camel@localhost> Message-ID: <1216689540.23025.0.camel@localhost> On Mon, 2008-07-21 at 17:35 -0500, Mihael Hategan wrote: > > So one alternative is to have a separate thread that polls the sites for > changes in overloadedness. This is a certain type of constant time and > seems to work. I'll test this some more with various setups and see how > it pans out. I committed what I've been working on. cog r2072. From wilde at mcs.anl.gov Tue Jul 22 00:03:42 2008 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 22 Jul 2008 00:03:42 -0500 Subject: [Swift-devel] swift -typecheck gives null pointer execption Message-ID: <48856A2E.30106@mcs.anl.gov> For the attached file ab1.swift, I get the error below. I was trying to track down a different Swift problem for Alina when this occurred. (hence the commented out code) This was in ~wilde/testBLAST where you can find the mappers, which returned: communicado$ ./inmapper [0] /home/wilde/testBLAST/data/one.faa [1] /home/wilde/testBLAST/data/three.faa [2] /home/wilde/testBLAST/data/two.faa communicado$ ./medmapper [0].left /home/wilde/testBLAST/data/one.faa.left [0].right /home/wilde/testBLAST/data/one.faa.right [1].left /home/wilde/testBLAST/data/three.faa.left [1].right /home/wilde/testBLAST/data/three.faa.right [2].left /home/wilde/testBLAST/data/two.faa.left [2].right /home/wilde/testBLAST/data/two.faa.right communicado$ I realize medmapper is wrong and doesnt match the declared fields, but Swift shouldnt give a NPE. - Mike communicado$ swift -typecheck ab1.swift Swift svn swift-r2144 cog-r2072 RunID: 20080721-2351-dmm4uzjb Progress: Execution failed: java.lang.NullPointerException at org.griphyn.vdl.karajan.lib.GetFieldSubscript.function(GetFieldSubscript.java:39) at org.griphyn.vdl.karajan.lib.VDLFunction.post(VDLFunction.java:65) at org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.childCompleted(AbstractSequentialWithArguments.java:192) at org.globus.cog.karajan.workflow.nodes.Sequential.notificationEvent(Sequential.java:33) at org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:335) at org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125) at org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99) at org.globus.cog.karajan.workflow.nodes.FlowNode.fireNotificationEvent(FlowNode.java:173) at org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:299) at org.globus.cog.karajan.workflow.nodes.FlowContainer.post(FlowContainer.java:58) at org.globus.cog.karajan.workflow.nodes.functions.Argument.post(Argument.java:45) at org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.childCompleted(AbstractSequentialWithArguments.java:192) at org.globus.cog.karajan.workflow.nodes.Sequential.notificationEvent(Sequential.java:33) at org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:335) at org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125) at org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99) at org.globus.cog.karajan.workflow.nodes.FlowNode.fireNotificationEvent(FlowNode.java:173) at org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:299) at org.globus.cog.karajan.workflow.nodes.FlowContainer.post(FlowContainer.java:58) at org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:46) at org.globus.cog.karajan.workflow.nodes.Sequential.startNext(Sequential.java:51) at org.globus.cog.karajan.workflow.nodes.Sequential.executeChildren(Sequential.java:27) at org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.executeChildren(AbstractFunction.java:40) at org.globus.cog.karajan.workflow.nodes.FlowContainer.execute(FlowContainer.java:63) at org.globus.cog.karajan.workflow.nodes.FlowNode.restart(FlowNode.java:240) at org.globus.cog.karajan.workflow.nodes.FlowNode.start(FlowNode.java:281) at org.globus.cog.karajan.workflow.nodes.FlowNode.controlEvent(FlowNode.java:393) at org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:332) at org.globus.cog.karajan.workflow.FlowElementWrapper.event(FlowElementWrapper.java:227) at org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125) at org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99) at org.globus.cog.karajan.workflow.events.EventWorker.run(EventWorker.java:69) communicado$ -------------- next part -------------- A non-text attachment was scrubbed... Name: ab1.tar.gz Type: application/x-gzip Size: 8230 bytes Desc: not available URL: From bugzilla-daemon at mcs.anl.gov Tue Jul 22 02:19:32 2008 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Tue, 22 Jul 2008 02:19:32 -0500 (CDT) Subject: [Swift-devel] [Bug 150] New: multiple workers on one compute node Message-ID: http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=150 Summary: multiple workers on one compute node Product: Swift Version: unspecified Platform: Macintosh OS/Version: Mac OS Status: NEW Severity: normal Priority: P2 Component: General AssignedTo: benc at hawaga.org.uk ReportedBy: benc at hawaga.org.uk CC: swift-devel at ci.uchicago.edu Investigate running multiple workers on one compute node (probably via coasters) to make use of multiple cores with single-core applications. -- Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is. From benc at hawaga.org.uk Tue Jul 22 02:23:04 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 22 Jul 2008 07:23:04 +0000 (GMT) Subject: [Swift-devel] swift -typecheck gives null pointer execption In-Reply-To: <48856A2E.30106@mcs.anl.gov> References: <48856A2E.30106@mcs.anl.gov> Message-ID: ok, stick that in bugzilla. In the general area of typechecking, Milena's compile-time typechecking code should be hitting trunk Real Soon Now so the -typecheck option should become less necessary. On Tue, 22 Jul 2008, Michael Wilde wrote: > For the attached file ab1.swift, I get the error below. > I was trying to track down a different Swift problem for Alina when this > occurred. (hence the commented out code) > > This was in ~wilde/testBLAST where you can find the mappers, which returned: > > communicado$ ./inmapper > [0] /home/wilde/testBLAST/data/one.faa > [1] /home/wilde/testBLAST/data/three.faa > [2] /home/wilde/testBLAST/data/two.faa > communicado$ ./medmapper > [0].left /home/wilde/testBLAST/data/one.faa.left > [0].right /home/wilde/testBLAST/data/one.faa.right > [1].left /home/wilde/testBLAST/data/three.faa.left > [1].right /home/wilde/testBLAST/data/three.faa.right > [2].left /home/wilde/testBLAST/data/two.faa.left > [2].right /home/wilde/testBLAST/data/two.faa.right > communicado$ > > I realize medmapper is wrong and doesnt match the declared fields, but Swift > shouldnt give a NPE. > > - Mike > > > communicado$ swift -typecheck ab1.swift > Swift svn swift-r2144 cog-r2072 > > RunID: 20080721-2351-dmm4uzjb > Progress: > Execution failed: > java.lang.NullPointerException > at > org.griphyn.vdl.karajan.lib.GetFieldSubscript.function(GetFieldSubscript.java:39) > at org.griphyn.vdl.karajan.lib.VDLFunction.post(VDLFunction.java:65) > at > org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.childCompleted(AbstractSequentialWithArguments.java:192) > at > org.globus.cog.karajan.workflow.nodes.Sequential.notificationEvent(Sequential.java:33) > at > org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:335) > at > org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125) > at > org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99) > at > org.globus.cog.karajan.workflow.nodes.FlowNode.fireNotificationEvent(FlowNode.java:173) > at > org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:299) > at > org.globus.cog.karajan.workflow.nodes.FlowContainer.post(FlowContainer.java:58) > at > org.globus.cog.karajan.workflow.nodes.functions.Argument.post(Argument.java:45) > at > org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.childCompleted(AbstractSequentialWithArguments.java:192) > at > org.globus.cog.karajan.workflow.nodes.Sequential.notificationEvent(Sequential.java:33) > at > org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:335) > at > org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125) > at > org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99) > at > org.globus.cog.karajan.workflow.nodes.FlowNode.fireNotificationEvent(FlowNode.java:173) > at > org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:299) > at > org.globus.cog.karajan.workflow.nodes.FlowContainer.post(FlowContainer.java:58) > at > org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:46) > at > org.globus.cog.karajan.workflow.nodes.Sequential.startNext(Sequential.java:51) > at > org.globus.cog.karajan.workflow.nodes.Sequential.executeChildren(Sequential.java:27) > at > org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.executeChildren(AbstractFunction.java:40) > at > org.globus.cog.karajan.workflow.nodes.FlowContainer.execute(FlowContainer.java:63) > at > org.globus.cog.karajan.workflow.nodes.FlowNode.restart(FlowNode.java:240) > at > org.globus.cog.karajan.workflow.nodes.FlowNode.start(FlowNode.java:281) > at > org.globus.cog.karajan.workflow.nodes.FlowNode.controlEvent(FlowNode.java:393) > at > org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:332) > at > org.globus.cog.karajan.workflow.FlowElementWrapper.event(FlowElementWrapper.java:227) > at > org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125) > at > org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99) > at > org.globus.cog.karajan.workflow.events.EventWorker.run(EventWorker.java:69) > > communicado$ > From bugzilla-daemon at mcs.anl.gov Tue Jul 22 02:23:36 2008 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Tue, 22 Jul 2008 02:23:36 -0500 (CDT) Subject: [Swift-devel] [Bug 150] multiple workers on one compute node In-Reply-To: Message-ID: <20080722072336.619801646B@foxtrot.mcs.anl.gov> http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=150 benc at hawaga.org.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Severity|normal |enhancement -- Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is. From benc at hawaga.org.uk Tue Jul 22 02:56:31 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 22 Jul 2008 07:56:31 +0000 (GMT) Subject: [Swift-devel] Re: scheduler foo In-Reply-To: <1216689540.23025.0.camel@localhost> References: <1216650659.4064.1.camel@localhost> <1216659517.6481.1.camel@localhost> <1216679706.18694.6.camel@localhost> <1216689540.23025.0.camel@localhost> Message-ID: On Mon, 21 Jul 2008, Mihael Hategan wrote: > I committed what I've been working on. cog r2072. On my laptop, test 085-iterate frequently (5 runs out of 5) gets stuck as in it runs some of the jobs (different each time) and then does not seelct a site for the next job for at least a few minutes. I didn't wait to see if it selects a (the only) site after more than a couple of minutes. -- From bugzilla-daemon at mcs.anl.gov Tue Jul 22 09:28:51 2008 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Tue, 22 Jul 2008 09:28:51 -0500 (CDT) Subject: [Swift-devel] [Bug 151] New: Swift gives null pointer exception Message-ID: http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=151 Summary: Swift gives null pointer exception Product: Swift Version: unspecified Platform: Other URL: http://www.ci.uchicago.edu/~wilde/logs/npebug.2008.0722. tar.gz OS/Version: Linux Status: NEW Severity: normal Priority: P2 Component: SwiftScript language AssignedTo: benc at hawaga.org.uk ReportedBy: wilde at mcs.anl.gov Running: swift ab1.swift gives an NPE. This program certainly still has errors; the NPE occurred while debugging those. The program was run in: communicado:/home/wilde/testBLAST/for_swift/npebug. All logs and output including tests of the ext mappers used by this script are in the file listed in the URL field. -- Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You reported the bug, or are watching the reporter. From bugzilla-daemon at mcs.anl.gov Tue Jul 22 09:31:06 2008 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Tue, 22 Jul 2008 09:31:06 -0500 (CDT) Subject: [Swift-devel] [Bug 151] Swift gives null pointer exception In-Reply-To: Message-ID: <20080722143106.778FC16469@foxtrot.mcs.anl.gov> http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=151 wilde at mcs.anl.gov changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |wilde at mcs.anl.gov ------- Comment #1 from wilde at mcs.anl.gov 2008-07-22 09:31 ------- Output was: communicado$ cat swift.out Swift svn swift-r2144 cog-r2072 RunID: 20080722-0916-yryjih55 Progress: Execution failed: java.lang.NullPointerException at org.griphyn.vdl.karajan.lib.GetFieldSubscript.function(GetFieldSubscript.java:39) at org.griphyn.vdl.karajan.lib.VDLFunction.post(VDLFunction.java:65) at org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.childCompleted(AbstractSequentialWithArguments.java:192) at org.globus.cog.karajan.workflow.nodes.Sequential.notificationEvent(Sequential.java:33) at org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:335) at org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125) at org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99) at org.globus.cog.karajan.workflow.nodes.FlowNode.fireNotificationEvent(FlowNode.java:173) at org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:299) at org.globus.cog.karajan.workflow.nodes.FlowContainer.post(FlowContainer.java:58) at org.globus.cog.karajan.workflow.nodes.functions.Argument.post(Argument.java:45) at org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.childCompleted(AbstractSequentialWithArguments.java:192) at org.globus.cog.karajan.workflow.nodes.Sequential.notificationEvent(Sequential.java:33) at org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:335) at org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125) at org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99) at org.globus.cog.karajan.workflow.nodes.FlowNode.fireNotificationEvent(FlowNode.java:173) at org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:299) at org.globus.cog.karajan.workflow.nodes.FlowContainer.post(FlowContainer.java:58) at org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:46) at org.globus.cog.karajan.workflow.nodes.Sequential.startNext(Sequential.java:51) at org.globus.cog.karajan.workflow.nodes.Sequential.executeChildren(Sequential.java:27) at org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.executeChildren(AbstractFunction.java:40) at org.globus.cog.karajan.workflow.nodes.FlowContainer.execute(FlowContainer.java:63) at org.globus.cog.karajan.workflow.nodes.FlowNode.restart(FlowNode.java:240) at org.globus.cog.karajan.workflow.nodes.FlowNode.start(FlowNode.java:281) at org.globus.cog.karajan.workflow.nodes.FlowNode.controlEvent(FlowNode.java:393) at org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:332) at org.globus.cog.karajan.workflow.FlowElementWrapper.event(FlowElementWrapper.java:227) at org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125) at org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99) at org.globus.cog.karajan.workflow.events.EventWorker.run(EventWorker.java:69) communicado$ -- Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is. You reported the bug, or are watching the reporter. From wilde at mcs.anl.gov Tue Jul 22 09:33:12 2008 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 22 Jul 2008 09:33:12 -0500 Subject: [Swift-devel] swift -typecheck gives null pointer execption In-Reply-To: References: <48856A2E.30106@mcs.anl.gov> Message-ID: <4885EFA8.7070303@mcs.anl.gov> On 7/22/08 2:23 AM, Ben Clifford wrote: > ok, stick that in bugzilla. Done. Its bug 151. > In the general area of typechecking, Milena's compile-time typechecking > code should be hitting trunk Real Soon Now so the -typecheck option should > become less necessary. It happens without requesting -typecheck, so could be anywhere. The logs referenced in bug 151 were run without -typecheck. - Mike > > On Tue, 22 Jul 2008, Michael Wilde wrote: > >> For the attached file ab1.swift, I get the error below. >> I was trying to track down a different Swift problem for Alina when this >> occurred. (hence the commented out code) >> >> This was in ~wilde/testBLAST where you can find the mappers, which returned: >> >> communicado$ ./inmapper >> [0] /home/wilde/testBLAST/data/one.faa >> [1] /home/wilde/testBLAST/data/three.faa >> [2] /home/wilde/testBLAST/data/two.faa >> communicado$ ./medmapper >> [0].left /home/wilde/testBLAST/data/one.faa.left >> [0].right /home/wilde/testBLAST/data/one.faa.right >> [1].left /home/wilde/testBLAST/data/three.faa.left >> [1].right /home/wilde/testBLAST/data/three.faa.right >> [2].left /home/wilde/testBLAST/data/two.faa.left >> [2].right /home/wilde/testBLAST/data/two.faa.right >> communicado$ >> >> I realize medmapper is wrong and doesnt match the declared fields, but Swift >> shouldnt give a NPE. >> >> - Mike >> >> >> communicado$ swift -typecheck ab1.swift >> Swift svn swift-r2144 cog-r2072 >> >> RunID: 20080721-2351-dmm4uzjb >> Progress: >> Execution failed: >> java.lang.NullPointerException >> at >> org.griphyn.vdl.karajan.lib.GetFieldSubscript.function(GetFieldSubscript.java:39) >> at org.griphyn.vdl.karajan.lib.VDLFunction.post(VDLFunction.java:65) >> at >> org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.childCompleted(AbstractSequentialWithArguments.java:192) >> at >> org.globus.cog.karajan.workflow.nodes.Sequential.notificationEvent(Sequential.java:33) >> at >> org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:335) >> at >> org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125) >> at >> org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99) >> at >> org.globus.cog.karajan.workflow.nodes.FlowNode.fireNotificationEvent(FlowNode.java:173) >> at >> org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:299) >> at >> org.globus.cog.karajan.workflow.nodes.FlowContainer.post(FlowContainer.java:58) >> at >> org.globus.cog.karajan.workflow.nodes.functions.Argument.post(Argument.java:45) >> at >> org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.childCompleted(AbstractSequentialWithArguments.java:192) >> at >> org.globus.cog.karajan.workflow.nodes.Sequential.notificationEvent(Sequential.java:33) >> at >> org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:335) >> at >> org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125) >> at >> org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99) >> at >> org.globus.cog.karajan.workflow.nodes.FlowNode.fireNotificationEvent(FlowNode.java:173) >> at >> org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:299) >> at >> org.globus.cog.karajan.workflow.nodes.FlowContainer.post(FlowContainer.java:58) >> at >> org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:46) >> at >> org.globus.cog.karajan.workflow.nodes.Sequential.startNext(Sequential.java:51) >> at >> org.globus.cog.karajan.workflow.nodes.Sequential.executeChildren(Sequential.java:27) >> at >> org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.executeChildren(AbstractFunction.java:40) >> at >> org.globus.cog.karajan.workflow.nodes.FlowContainer.execute(FlowContainer.java:63) >> at >> org.globus.cog.karajan.workflow.nodes.FlowNode.restart(FlowNode.java:240) >> at >> org.globus.cog.karajan.workflow.nodes.FlowNode.start(FlowNode.java:281) >> at >> org.globus.cog.karajan.workflow.nodes.FlowNode.controlEvent(FlowNode.java:393) >> at >> org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:332) >> at >> org.globus.cog.karajan.workflow.FlowElementWrapper.event(FlowElementWrapper.java:227) >> at >> org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125) >> at >> org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99) >> at >> org.globus.cog.karajan.workflow.events.EventWorker.run(EventWorker.java:69) >> >> communicado$ >> From hategan at mcs.anl.gov Tue Jul 22 10:57:44 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 22 Jul 2008 10:57:44 -0500 Subject: [Swift-devel] Re: scheduler foo In-Reply-To: References: <1216650659.4064.1.camel@localhost> <1216659517.6481.1.camel@localhost> <1216679706.18694.6.camel@localhost> <1216689540.23025.0.camel@localhost> Message-ID: <1216742264.28239.0.camel@localhost> On Tue, 2008-07-22 at 07:56 +0000, Ben Clifford wrote: > On Mon, 21 Jul 2008, Mihael Hategan wrote: > > > I committed what I've been working on. cog r2072. > > On my laptop, test 085-iterate frequently (5 runs out of 5) gets stuck as > in it runs some of the jobs (different each time) and then does not seelct > a site for the next job for at least a few minutes. > > I didn't wait to see if it selects a (the only) site after more than a > couple of minutes. Could be the backoff. Anyway let me know exactly what configuration we're talking about so that I can reproduce it. > From benc at hawaga.org.uk Tue Jul 22 10:56:13 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 22 Jul 2008 15:56:13 +0000 (GMT) Subject: [Swift-devel] Re: scheduler foo In-Reply-To: <1216742264.28239.0.camel@localhost> References: <1216650659.4064.1.camel@localhost> <1216659517.6481.1.camel@localhost> <1216679706.18694.6.camel@localhost> <1216689540.23025.0.camel@localhost> <1216742264.28239.0.camel@localhost> Message-ID: On Tue, 22 Jul 2008, Mihael Hategan wrote: > Could be the backoff. Anyway let me know exactly what configuration > we're talking about so that I can reproduce it. build swift put dist/vdsk-*/bin on path cd tests/language-behaviour ./run 085-iterate so using plain local site with no fancy config. -- From zhaozhang at uchicago.edu Tue Jul 22 11:33:00 2008 From: zhaozhang at uchicago.edu (Zhao Zhang) Date: Tue, 22 Jul 2008 11:33:00 -0500 Subject: [Swift-devel] Question regarding ip.address Message-ID: <48860BBC.4040101@uchicago.edu> Hi, I have a question about the following parameter of swift. 1. Is this feature enabled in current release of swift? 2. If I specify it as "swift -ip.address 172.17.3.16 first.swift", is that correct? Thanks zhao ip.address Valid values: /||/ Default value: N/A The Globus GRAM service uses a callback mechanism to send notifications about the status of submitted jobs. The callback mechanism requires that the Swift client be reachable from the hosts the GRAM services are running on. Normally, Swift can detect the correct IP address of the client machine. However, in certain cases (such as the client machine having more than one network interface) the automatic detection mechanism is not reliable. In such cases, the IP address of the Swift client machine can be specified using this property. The value of this property must be a numeric address without quotes. From bugzilla-daemon at mcs.anl.gov Tue Jul 22 12:12:23 2008 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Tue, 22 Jul 2008 12:12:23 -0500 (CDT) Subject: [Swift-devel] [Bug 151] Swift gives null pointer exception In-Reply-To: Message-ID: <20080722171223.1110F1646B@foxtrot.mcs.anl.gov> http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=151 ------- Comment #2 from wilde at mcs.anl.gov 2008-07-22 12:12 ------- It seems that if I remove the .xml and .kml files from previous compiles, this problem goes away, and I get instead the message that I was trying to debug originally: Compile error in foreach statement at line 18: Compile error in procedure invocation at line 19: variable a is not writeable in this scope (the line numbers here dont match the source code in the URL filed here because Ive changed things since then in the course of debugging). -- Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is. You reported the bug, or are watching the reporter. From benc at hawaga.org.uk Tue Jul 22 11:41:20 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 22 Jul 2008 16:41:20 +0000 (GMT) Subject: [Swift-devel] Question regarding ip.address In-Reply-To: <48860BBC.4040101@uchicago.edu> References: <48860BBC.4040101@uchicago.edu> Message-ID: On Tue, 22 Jul 2008, Zhao Zhang wrote: > I have a question about the following parameter of swift. > 1. Is this feature enabled in current release of swift? > 2. If I specify it as "swift -ip.address 172.17.3.16 first.swift", is that > correct? I usually set GLOBUS_HOSTNAME in the environment rather than specifying a commandline parameter; that works for me regularly. eg: $ export GLOBUS_HOSTNAME=182.17.3.16 If you're adding in non-Swift stuff like provider-deef, I haven't tested how that responds to the above configuration. -- From iraicu at cs.uchicago.edu Tue Jul 22 12:17:43 2008 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Tue, 22 Jul 2008 12:17:43 -0500 Subject: [Swift-devel] Question regarding ip.address In-Reply-To: References: <48860BBC.4040101@uchicago.edu> Message-ID: <48861637.4080408@cs.uchicago.edu> Zhao, You just need a way to pass an IP address to the deef provider... so might as well do: export GLOBUS_HOSTNAME=182.17.3.16 and then read the environment variable "GLOBUS_HOSTNAME" in the deef provider and get the IP address that way. This is assuming that the -ip.address option in Swift doesn't work as you expected. Ioan Ben Clifford wrote: > On Tue, 22 Jul 2008, Zhao Zhang wrote: > > >> I have a question about the following parameter of swift. >> 1. Is this feature enabled in current release of swift? >> 2. If I specify it as "swift -ip.address 172.17.3.16 first.swift", is that >> correct? >> > > I usually set GLOBUS_HOSTNAME in the environment rather than specifying a > commandline parameter; that works for me regularly. > > eg: > > $ export GLOBUS_HOSTNAME=182.17.3.16 > > If you're adding in non-Swift stuff like provider-deef, I haven't tested > how that responds to the above configuration. > > -- =================================================== Ioan Raicu Ph.D. Candidate =================================================== Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 =================================================== Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dev.globus.org/wiki/Incubator/Falkon http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page =================================================== =================================================== -------------- next part -------------- An HTML attachment was scrubbed... URL: From hategan at mcs.anl.gov Tue Jul 22 12:24:59 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 22 Jul 2008 12:24:59 -0500 Subject: [Swift-devel] Question regarding ip.address In-Reply-To: <48861637.4080408@cs.uchicago.edu> References: <48860BBC.4040101@uchicago.edu> <48861637.4080408@cs.uchicago.edu> Message-ID: <1216747499.29974.1.camel@localhost> On Tue, 2008-07-22 at 12:17 -0500, Ioan Raicu wrote: > Zhao, > You just need a way to pass an IP address to the deef provider... so > might as well do: > export GLOBUS_HOSTNAME=182.17.3.16 > and then read the environment variable "GLOBUS_HOSTNAME" in the deef > provider and get the IP address that way. This is assuming that the > -ip.address option in Swift doesn't work as you expected. If you're not using the jglobus libraries to start the notification service, nor are you explicitly querying CoGProperties, then chances are that Swift option won't work for you. From zhaozhang at uchicago.edu Tue Jul 22 12:25:34 2008 From: zhaozhang at uchicago.edu (Zhao Zhang) Date: Tue, 22 Jul 2008 12:25:34 -0500 Subject: [Swift-devel] Question regarding ip.address In-Reply-To: <48861637.4080408@cs.uchicago.edu> References: <48860BBC.4040101@uchicago.edu> <48861637.4080408@cs.uchicago.edu> Message-ID: <4886180E.4030908@uchicago.edu> Thanks, All Simply set GLOBUS_HOSTNAME=172.17.3.16 makes everything working. Next, I am trying to run Swift with multiple falkon services. zhao Ioan Raicu wrote: > Zhao, > You just need a way to pass an IP address to the deef provider... so > might as well do: > export GLOBUS_HOSTNAME=182.17.3.16 > and then read the environment variable "GLOBUS_HOSTNAME" in the deef > provider and get the IP address that way. This is assuming that the > -ip.address option in Swift doesn't work as you expected. > > Ioan > > Ben Clifford wrote: >> On Tue, 22 Jul 2008, Zhao Zhang wrote: >> >> >>> I have a question about the following parameter of swift. >>> 1. Is this feature enabled in current release of swift? >>> 2. If I specify it as "swift -ip.address 172.17.3.16 first.swift", is that >>> correct? >>> >> >> I usually set GLOBUS_HOSTNAME in the environment rather than specifying a >> commandline parameter; that works for me regularly. >> >> eg: >> >> $ export GLOBUS_HOSTNAME=182.17.3.16 >> >> If you're adding in non-Swift stuff like provider-deef, I haven't tested >> how that responds to the above configuration. >> >> > > -- > =================================================== > Ioan Raicu > Ph.D. Candidate > =================================================== > Distributed Systems Laboratory > Computer Science Department > University of Chicago > 1100 E. 58th Street, Ryerson Hall > Chicago, IL 60637 > =================================================== > Email: iraicu at cs.uchicago.edu > Web: http://www.cs.uchicago.edu/~iraicu > http://dev.globus.org/wiki/Incubator/Falkon > http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page > =================================================== > =================================================== > > From bugzilla-daemon at mcs.anl.gov Tue Jul 22 13:04:41 2008 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Tue, 22 Jul 2008 13:04:41 -0500 (CDT) Subject: [Swift-devel] [Bug 150] multiple workers on one compute node In-Reply-To: Message-ID: <20080722180441.BFA1816469@foxtrot.mcs.anl.gov> http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=150 ------- Comment #1 from wilde at mcs.anl.gov 2008-07-22 13:04 ------- The following note from TeraGrid Support suggests that we might be able to set form GRAM, at least from WS-GRAM: -------- Original Message -------- Subject: Re: Using > 1 CPU per compute node on Abe under GRAM] Date: Tue, 22 Jul 2008 12:40:16 -0500 From: consult at ncsa.uiuc.edu To: Michael Wilde FROM: Estabrook, John (Concerning ticket No. 159049) There is no policy nor setting to prevent this; the quoted info at http://www.globus.org/toolkit/docs/4.0/execution/wsgram/WS_GRAM_Job_Desc_Extensions.html#r-wsgram-extensions-constructs-nodes shows how to make the request, noting that it requires a current version of GRAM. ------------------------ John Estabrook NCSA Consulting Services ------------------------ Michael Wilde writes: >Hi Help Team, > >Can you forward this to the Abe folks at NCSA? > >We have a user thats trying to get his single-node GRAM jobs to run up >to 8-per-compute-node (ie one-jobs-per-core) on Abe. > >Is this possible, or prevented by some combination of Abe PBS and GRAM >jobmanager setting? > >If its possible, please let us know what setting to use for WS-GRAM and >pre-WS-GRAM, if you could. > >If its not, not to worry - we have other strategies to turn these into >multi-node jobs. > >There's more info in the thread below. > >Thanks, > >Mike -- Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is. From zhaozhang at uchicago.edu Tue Jul 22 15:05:25 2008 From: zhaozhang at uchicago.edu (Zhao Zhang) Date: Tue, 22 Jul 2008 15:05:25 -0500 Subject: [Swift-devel] Question for running swift with 2 psets on BGP Message-ID: <48863D85.7030609@uchicago.edu> Hi, Mihael I run a sleep_10x512 workload with swift + falkon on BGP. Swift send all tasks to the first pset, but never to the second. The plotted log are at http://www.ci.uchicago.edu/~zzhang/report-sleep-20080722-1456-3fnc42b1/ The sites.xml is at http://www.ci.uchicago.edu/~zzhang/sites.xml And the tc.data is at http://www.ci.uchicago.edu/~zzhang/tc.data Thanks for help zhao From hategan at mcs.anl.gov Tue Jul 22 15:17:34 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 22 Jul 2008 15:17:34 -0500 Subject: [Swift-devel] Question for running swift with 2 psets on BGP In-Reply-To: <48863D85.7030609@uchicago.edu> References: <48863D85.7030609@uchicago.edu> Message-ID: <1216757854.18169.3.camel@localhost> On Tue, 2008-07-22 at 15:05 -0500, Zhao Zhang wrote: > Hi, Mihael > > I run a sleep_10x512 workload with swift + falkon on BGP. Swift send all > tasks to the first pset, but never to the second. > The plotted log are at > http://www.ci.uchicago.edu/~zzhang/report-sleep-20080722-1456-3fnc42b1/ It looks to me that jobs are sent to both sites. I'm looking at the scheduler scores graph and the "execute2 tasks coloured by site" graph. The sites/success table seems broken a bit. It lists more jobs ending than the total number of jobs. I suspect, given that in this case the names of the sites overlap, that the count is done using a substring search which, again, in this case, will always match for the first site (i.e. "bgps" < "bgps1"). > The sites.xml is at http://www.ci.uchicago.edu/~zzhang/sites.xml > And the tc.data is at http://www.ci.uchicago.edu/~zzhang/tc.data > > Thanks for help > > zhao > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From hategan at mcs.anl.gov Tue Jul 22 15:18:58 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 22 Jul 2008 15:18:58 -0500 Subject: [Swift-devel] Re: scheduler foo In-Reply-To: References: <1216650659.4064.1.camel@localhost> <1216659517.6481.1.camel@localhost> <1216679706.18694.6.camel@localhost> <1216689540.23025.0.camel@localhost> <1216742264.28239.0.camel@localhost> Message-ID: <1216757938.18169.6.camel@localhost> On Tue, 2008-07-22 at 15:56 +0000, Ben Clifford wrote: > On Tue, 22 Jul 2008, Mihael Hategan wrote: > > > Could be the backoff. Anyway let me know exactly what configuration > > we're talking about so that I can reproduce it. > > build swift > put dist/vdsk-*/bin on path > cd tests/language-behaviour > ./run 085-iterate > > so using plain local site with no fancy config. A bug in the monitor which prevented the overloaded count from being properly updated. I fixed that and ran all the tests locally, which seems to work. > From iraicu at cs.uchicago.edu Tue Jul 22 15:13:11 2008 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Tue, 22 Jul 2008 15:13:11 -0500 Subject: [Swift-devel] Re: Question for running swift with 2 psets on BGP In-Reply-To: <48863D85.7030609@uchicago.edu> References: <48863D85.7030609@uchicago.edu> Message-ID: <48863F57.6070605@cs.uchicago.edu> Zhao, There should be a way to set a maximum number of tasks to have outstanding at any point in time. You should set this to 256 per site, as we have 256 CPUs on each P-SET. This is a setting from Swift that needs to be set. Perhaps, if this number was set higher (say infinity), Swift might have submitted all 512 tasks to 1 P-SET before the Swift scheduler was able to submit any to the second one. What happens if you make the task lengths 60 seconds, and you send 5120 tasks? Do they all queue up on 1 service? Or do they eventually start load balancing across the two services? Ideally, on the BG/P, with multiple sites (services), you don't want to queue anything up, you just want to send enough tasks to keep all CPUs busy, but no tasks in the queues. Mihael, Ben, where would Zhao set this parameter that will allow us to limit the number of outstanding tasks per site? Ioan Zhao Zhang wrote: > Hi, Mihael > > I run a sleep_10x512 workload with swift + falkon on BGP. Swift send > all tasks to the first pset, but never to the second. > The plotted log are at > http://www.ci.uchicago.edu/~zzhang/report-sleep-20080722-1456-3fnc42b1/ > The sites.xml is at http://www.ci.uchicago.edu/~zzhang/sites.xml > And the tc.data is at http://www.ci.uchicago.edu/~zzhang/tc.data > > Thanks for help > > zhao > -- =================================================== Ioan Raicu Ph.D. Candidate =================================================== Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 =================================================== Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dev.globus.org/wiki/Incubator/Falkon http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page =================================================== =================================================== From iraicu at cs.uchicago.edu Tue Jul 22 15:20:26 2008 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Tue, 22 Jul 2008 15:20:26 -0500 Subject: [Swift-devel] Question for running swift with 2 psets on BGP In-Reply-To: <1216757854.18169.3.camel@localhost> References: <48863D85.7030609@uchicago.edu> <1216757854.18169.3.camel@localhost> Message-ID: <4886410A.8010406@cs.uchicago.edu> So, are you saying that it did work as expected, and it load balanced across the two sites? Who assigns the names "bgps" and "bgps1"? If its the user (i.e. Zhao :), then should we name these strings better? Zhao, what did Falkon report, did both services see tasks? Ioan Mihael Hategan wrote: > On Tue, 2008-07-22 at 15:05 -0500, Zhao Zhang wrote: > >> Hi, Mihael >> >> I run a sleep_10x512 workload with swift + falkon on BGP. Swift send all >> tasks to the first pset, but never to the second. >> The plotted log are at >> http://www.ci.uchicago.edu/~zzhang/report-sleep-20080722-1456-3fnc42b1/ >> > > It looks to me that jobs are sent to both sites. I'm looking at the > scheduler scores graph and the "execute2 tasks coloured by site" graph. > > The sites/success table seems broken a bit. It lists more jobs ending > than the total number of jobs. I suspect, given that in this case the > names of the sites overlap, that the count is done using a substring > search which, again, in this case, will always match for the first site > (i.e. "bgps" < "bgps1"). > > >> The sites.xml is at http://www.ci.uchicago.edu/~zzhang/sites.xml >> And the tc.data is at http://www.ci.uchicago.edu/~zzhang/tc.data >> >> Thanks for help >> >> zhao >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >> > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > -- =================================================== Ioan Raicu Ph.D. Candidate =================================================== Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 =================================================== Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dev.globus.org/wiki/Incubator/Falkon http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page =================================================== =================================================== -------------- next part -------------- An HTML attachment was scrubbed... URL: From zhaozhang at uchicago.edu Tue Jul 22 15:24:30 2008 From: zhaozhang at uchicago.edu (Zhao Zhang) Date: Tue, 22 Jul 2008 15:24:30 -0500 Subject: [Swift-devel] Question for running swift with 2 psets on BGP In-Reply-To: <4886410A.8010406@cs.uchicago.edu> References: <48863D85.7030609@uchicago.edu> <1216757854.18169.3.camel@localhost> <4886410A.8010406@cs.uchicago.edu> Message-ID: <488641FE.3050808@uchicago.edu> The services are started at 2 IONs, ion-3 and ion-4 on surveyor. I sent a sleep_10x512 workload. Ideally, each service should receive 256 jobs. But in fact, ion-4 receive 512 jobs, while ion-3 didn't receive anything. Yes, I gave the name "bgps" and "bgps1". I am doing another test to avoid this. zhao Ioan Raicu wrote: > So, are you saying that it did work as expected, and it load balanced > across the two sites? Who assigns the names "bgps" and "bgps1"? If > its the user (i.e. Zhao :), then should we name these strings > better? Zhao, what did Falkon report, did both services see tasks? > > Ioan > > Mihael Hategan wrote: >> On Tue, 2008-07-22 at 15:05 -0500, Zhao Zhang wrote: >> >>> Hi, Mihael >>> >>> I run a sleep_10x512 workload with swift + falkon on BGP. Swift send all >>> tasks to the first pset, but never to the second. >>> The plotted log are at >>> http://www.ci.uchicago.edu/~zzhang/report-sleep-20080722-1456-3fnc42b1/ >>> >> >> It looks to me that jobs are sent to both sites. I'm looking at the >> scheduler scores graph and the "execute2 tasks coloured by site" graph. >> >> The sites/success table seems broken a bit. It lists more jobs ending >> than the total number of jobs. I suspect, given that in this case the >> names of the sites overlap, that the count is done using a substring >> search which, again, in this case, will always match for the first site >> (i.e. "bgps" < "bgps1"). >> >> >>> The sites.xml is at http://www.ci.uchicago.edu/~zzhang/sites.xml >>> And the tc.data is at http://www.ci.uchicago.edu/~zzhang/tc.data >>> >>> Thanks for help >>> >>> zhao >>> _______________________________________________ >>> Swift-devel mailing list >>> Swift-devel at ci.uchicago.edu >>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >>> >> >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >> >> > > -- > =================================================== > Ioan Raicu > Ph.D. Candidate > =================================================== > Distributed Systems Laboratory > Computer Science Department > University of Chicago > 1100 E. 58th Street, Ryerson Hall > Chicago, IL 60637 > =================================================== > Email: iraicu at cs.uchicago.edu > Web: http://www.cs.uchicago.edu/~iraicu > http://dev.globus.org/wiki/Incubator/Falkon > http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page > =================================================== > =================================================== > > From hategan at mcs.anl.gov Tue Jul 22 15:28:15 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 22 Jul 2008 15:28:15 -0500 Subject: [Swift-devel] Question for running swift with 2 psets on BGP In-Reply-To: <4886410A.8010406@cs.uchicago.edu> References: <48863D85.7030609@uchicago.edu> <1216757854.18169.3.camel@localhost> <4886410A.8010406@cs.uchicago.edu> Message-ID: <1216758495.25379.1.camel@localhost> On Tue, 2008-07-22 at 15:20 -0500, Ioan Raicu wrote: > So, are you saying that it did work as expected, and it load balanced > across the two sites? Pretty much, yes. > Who assigns the names "bgps" and "bgps1"? In this case it would be Zhao. > If its the user (i.e. Zhao :), then should we name these strings > better? I suppose using bgps1 and bgps2 would prevent the triggering of that bug in the stats scripts. From hategan at mcs.anl.gov Tue Jul 22 15:33:22 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 22 Jul 2008 15:33:22 -0500 Subject: [Swift-devel] Question for running swift with 2 psets on BGP In-Reply-To: <488641FE.3050808@uchicago.edu> References: <48863D85.7030609@uchicago.edu> <1216757854.18169.3.camel@localhost> <4886410A.8010406@cs.uchicago.edu> <488641FE.3050808@uchicago.edu> Message-ID: <1216758802.25379.5.camel@localhost> On Tue, 2008-07-22 at 15:24 -0500, Zhao Zhang wrote: > The services are started at 2 IONs, ion-3 and ion-4 on surveyor. I sent > a sleep_10x512 workload. Ideally, each service should receive 256 jobs. > But in fact, ion-4 receive 512 jobs, while ion-3 didn't receive anything. That's not what swift thinks. May it be that the deef provider doesn't work very well with multiple sites? Perhaps because it caches stuff in a way that makes the single-site assumption? From wilde at mcs.anl.gov Tue Jul 22 15:32:12 2008 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 22 Jul 2008 15:32:12 -0500 Subject: [Swift-devel] Question for running swift with 2 psets on BGP In-Reply-To: <1216758495.25379.1.camel@localhost> References: <48863D85.7030609@uchicago.edu> <1216757854.18169.3.camel@localhost> <4886410A.8010406@cs.uchicago.edu> <1216758495.25379.1.camel@localhost> Message-ID: <488643CC.7060901@mcs.anl.gov> Zhao's renaming the sites to avoid any substring problem. The current data shows a contradiction, though: the stats plot shows nice balancing between sites, but the falkon logs think all traffic went to one site. If the next run shows same, I think we need to a) look in the Swift log and site selection info and b) double check that the two falkon servers are not logging to (or listening to) the same place. - Mike On 7/22/08 3:28 PM, Mihael Hategan wrote: > On Tue, 2008-07-22 at 15:20 -0500, Ioan Raicu wrote: >> So, are you saying that it did work as expected, and it load balanced >> across the two sites? > > Pretty much, yes. > >> Who assigns the names "bgps" and "bgps1"? > > In this case it would be Zhao. > >> If its the user (i.e. Zhao :), then should we name these strings >> better? > > I suppose using bgps1 and bgps2 would prevent the triggering of that bug > in the stats scripts. > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From hategan at mcs.anl.gov Tue Jul 22 15:37:09 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 22 Jul 2008 15:37:09 -0500 Subject: [Swift-devel] Question for running swift with 2 psets on BGP In-Reply-To: <488643CC.7060901@mcs.anl.gov> References: <48863D85.7030609@uchicago.edu> <1216757854.18169.3.camel@localhost> <4886410A.8010406@cs.uchicago.edu> <1216758495.25379.1.camel@localhost> <488643CC.7060901@mcs.anl.gov> Message-ID: <1216759029.1607.0.camel@localhost> I'd look here: public static synchronized ResourcePool instance(String server, int num) throws InvalidServiceContactException { if (rp == null) { rp = new ResourcePool(); ... On Tue, 2008-07-22 at 15:32 -0500, Michael Wilde wrote: > Zhao's renaming the sites to avoid any substring problem. > > The current data shows a contradiction, though: the stats plot shows > nice balancing between sites, but the falkon logs think all traffic went > to one site. > > If the next run shows same, I think we need to a) look in the Swift log > and site selection info and b) double check that the two falkon servers > are not logging to (or listening to) the same place. > > - Mike > > > On 7/22/08 3:28 PM, Mihael Hategan wrote: > > On Tue, 2008-07-22 at 15:20 -0500, Ioan Raicu wrote: > >> So, are you saying that it did work as expected, and it load balanced > >> across the two sites? > > > > Pretty much, yes. > > > >> Who assigns the names "bgps" and "bgps1"? > > > > In this case it would be Zhao. > > > >> If its the user (i.e. Zhao :), then should we name these strings > >> better? > > > > I suppose using bgps1 and bgps2 would prevent the triggering of that bug > > in the stats scripts. > > > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From wilde at mcs.anl.gov Tue Jul 22 15:34:15 2008 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 22 Jul 2008 15:34:15 -0500 Subject: [Swift-devel] Question for running swift with 2 psets on BGP In-Reply-To: <1216758802.25379.5.camel@localhost> References: <48863D85.7030609@uchicago.edu> <1216757854.18169.3.camel@localhost> <4886410A.8010406@cs.uchicago.edu> <488641FE.3050808@uchicago.edu> <1216758802.25379.5.camel@localhost> Message-ID: <48864447.6040401@mcs.anl.gov> On 7/22/08 3:33 PM, Mihael Hategan wrote: > On Tue, 2008-07-22 at 15:24 -0500, Zhao Zhang wrote: >> The services are started at 2 IONs, ion-3 and ion-4 on surveyor. I sent >> a sleep_10x512 workload. Ideally, each service should receive 256 jobs. >> But in fact, ion-4 receive 512 jobs, while ion-3 didn't receive anything. > > That's not what swift thinks. > > May it be that the deef provider doesn't work very well with multiple > sites? Perhaps because it caches stuff in a way that makes the > single-site assumption? Thats a very likely cause - we will check. Its likely it keeps its config in a global variable, and both instances are sending to the same server. > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From iraicu at cs.uchicago.edu Tue Jul 22 15:34:48 2008 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Tue, 22 Jul 2008 15:34:48 -0500 Subject: [Swift-devel] Question for running swift with 2 psets on BGP In-Reply-To: <1216758802.25379.5.camel@localhost> References: <48863D85.7030609@uchicago.edu> <1216757854.18169.3.camel@localhost> <4886410A.8010406@cs.uchicago.edu> <488641FE.3050808@uchicago.edu> <1216758802.25379.5.camel@localhost> Message-ID: <48864468.4090909@cs.uchicago.edu> Hmmm... the deef provider needs to create a resource (EPR) on a remote service, and then uses this resource to submit tasks. If this resource is static, then I could see how this would be the problem. Ioan Mihael Hategan wrote: > On Tue, 2008-07-22 at 15:24 -0500, Zhao Zhang wrote: > >> The services are started at 2 IONs, ion-3 and ion-4 on surveyor. I sent >> a sleep_10x512 workload. Ideally, each service should receive 256 jobs. >> But in fact, ion-4 receive 512 jobs, while ion-3 didn't receive anything. >> > > That's not what swift thinks. > > May it be that the deef provider doesn't work very well with multiple > sites? Perhaps because it caches stuff in a way that makes the > single-site assumption? > > > > -- =================================================== Ioan Raicu Ph.D. Candidate =================================================== Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 =================================================== Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dev.globus.org/wiki/Incubator/Falkon http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page =================================================== =================================================== -------------- next part -------------- An HTML attachment was scrubbed... URL: From iraicu at cs.uchicago.edu Tue Jul 22 15:37:03 2008 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Tue, 22 Jul 2008 15:37:03 -0500 Subject: [Swift-devel] Question for running swift with 2 psets on BGP In-Reply-To: <1216759029.1607.0.camel@localhost> References: <48863D85.7030609@uchicago.edu> <1216757854.18169.3.camel@localhost> <4886410A.8010406@cs.uchicago.edu> <1216758495.25379.1.camel@localhost> <488643CC.7060901@mcs.anl.gov> <1216759029.1607.0.camel@localhost> Message-ID: <488644EF.4070806@cs.uchicago.edu> Right, that the "static" declaration would certainly cause trouble :( I don't recall why we made it static, perhaps its simply old code that was never updated, Zhao, can you simply remove the static declaration, and see if it still compiles, and if yes, try it out? Ioan Mihael Hategan wrote: > I'd look here: > public static synchronized ResourcePool instance(String server, int num) > throws InvalidServiceContactException { > if (rp == null) { > rp = new ResourcePool(); > ... > > On Tue, 2008-07-22 at 15:32 -0500, Michael Wilde wrote: > >> Zhao's renaming the sites to avoid any substring problem. >> >> The current data shows a contradiction, though: the stats plot shows >> nice balancing between sites, but the falkon logs think all traffic went >> to one site. >> >> If the next run shows same, I think we need to a) look in the Swift log >> and site selection info and b) double check that the two falkon servers >> are not logging to (or listening to) the same place. >> >> - Mike >> >> >> On 7/22/08 3:28 PM, Mihael Hategan wrote: >> >>> On Tue, 2008-07-22 at 15:20 -0500, Ioan Raicu wrote: >>> >>>> So, are you saying that it did work as expected, and it load balanced >>>> across the two sites? >>>> >>> Pretty much, yes. >>> >>> >>>> Who assigns the names "bgps" and "bgps1"? >>>> >>> In this case it would be Zhao. >>> >>> >>>> If its the user (i.e. Zhao :), then should we name these strings >>>> better? >>>> >>> I suppose using bgps1 and bgps2 would prevent the triggering of that bug >>> in the stats scripts. >>> >>> >>> _______________________________________________ >>> Swift-devel mailing list >>> Swift-devel at ci.uchicago.edu >>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >>> > > > -- =================================================== Ioan Raicu Ph.D. Candidate =================================================== Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 =================================================== Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dev.globus.org/wiki/Incubator/Falkon http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page =================================================== =================================================== -------------- next part -------------- An HTML attachment was scrubbed... URL: From zhaozhang at uchicago.edu Tue Jul 22 16:00:46 2008 From: zhaozhang at uchicago.edu (Zhao Zhang) Date: Tue, 22 Jul 2008 16:00:46 -0500 Subject: [Swift-devel] Question for running swift with 2 psets on BGP In-Reply-To: <488644EF.4070806@cs.uchicago.edu> References: <48863D85.7030609@uchicago.edu> <1216757854.18169.3.camel@localhost> <4886410A.8010406@cs.uchicago.edu> <1216758495.25379.1.camel@localhost> <488643CC.7060901@mcs.anl.gov> <1216759029.1607.0.camel@localhost> <488644EF.4070806@cs.uchicago.edu> Message-ID: <48864A7E.70003@uchicago.edu> ok, I removed the "static" in the line Mihael pointed out. And recompile it, got the following. zhao compile: [echo] [provider-deef]: COMPILE [mkdir] Created dir: /gpfs/home/falkon/cog/modules/provider-deef/build [javac] Compiling 8 source files to /gpfs/home/falkon/cog/modules/provider-deef/build [javac] /gpfs/home/falkon/cog/modules/provider-deef/src/org/globus/cog/abstraction/impl/execution/deef/JobSubmissionTaskHandler.java:157: non-static method instance(java.lang.String,int) cannot be referenced from a static context [javac] resourcePool = ResourcePool.instance(server, 1); [javac] ^ [javac] Note: /gpfs/home/falkon/cog/modules/provider-deef/src/org/globus/cog/abstraction/impl/execution/deef/JobSubmissionTaskHandler.java uses or overrides a deprecated API. [javac] Note: Recompile with -deprecation for details. [javac] 1 error Ioan Raicu wrote: > Right, that the "static" declaration would certainly cause trouble :( > I don't recall why we made it static, perhaps its simply old code that > was never updated, Zhao, can you simply remove the static declaration, > and see if it still compiles, and if yes, try it out? > > Ioan > > Mihael Hategan wrote: >> I'd look here: >> public static synchronized ResourcePool instance(String server, int num) >> throws InvalidServiceContactException { >> if (rp == null) { >> rp = new ResourcePool(); >> ... >> >> On Tue, 2008-07-22 at 15:32 -0500, Michael Wilde wrote: >> >>> Zhao's renaming the sites to avoid any substring problem. >>> >>> The current data shows a contradiction, though: the stats plot shows >>> nice balancing between sites, but the falkon logs think all traffic went >>> to one site. >>> >>> If the next run shows same, I think we need to a) look in the Swift log >>> and site selection info and b) double check that the two falkon servers >>> are not logging to (or listening to) the same place. >>> >>> - Mike >>> >>> >>> On 7/22/08 3:28 PM, Mihael Hategan wrote: >>> >>>> On Tue, 2008-07-22 at 15:20 -0500, Ioan Raicu wrote: >>>> >>>>> So, are you saying that it did work as expected, and it load balanced >>>>> across the two sites? >>>>> >>>> Pretty much, yes. >>>> >>>> >>>>> Who assigns the names "bgps" and "bgps1"? >>>>> >>>> In this case it would be Zhao. >>>> >>>> >>>>> If its the user (i.e. Zhao :), then should we name these strings >>>>> better? >>>>> >>>> I suppose using bgps1 and bgps2 would prevent the triggering of that bug >>>> in the stats scripts. >>>> >>>> >>>> _______________________________________________ >>>> Swift-devel mailing list >>>> Swift-devel at ci.uchicago.edu >>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >>>> >> >> >> > > -- > =================================================== > Ioan Raicu > Ph.D. Candidate > =================================================== > Distributed Systems Laboratory > Computer Science Department > University of Chicago > 1100 E. 58th Street, Ryerson Hall > Chicago, IL 60637 > =================================================== > Email: iraicu at cs.uchicago.edu > Web: http://www.cs.uchicago.edu/~iraicu > http://dev.globus.org/wiki/Incubator/Falkon > http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page > =================================================== > =================================================== > > From zhaozhang at uchicago.edu Tue Jul 22 21:04:29 2008 From: zhaozhang at uchicago.edu (Zhao Zhang) Date: Tue, 22 Jul 2008 21:04:29 -0500 Subject: [Swift-devel] Is there a way to make swift send 256 tasks at a time Message-ID: <488691AD.3030309@uchicago.edu> Hi, For now, I understand how swift decide how many tasks send to one site at a time: let T = 100 let B = 2.0 * log(T) / pi let C = 0.2 let tscore = e^(B * atan(C * score)) let number-of-jobs = 1 + (jobThrottle * tscore) T is initial score. I am wondering that , is there a way for swift to set a constant number of jobs, say 256 at a time to 1 site? The reason I am asking this is that we could avoid the slow start period, thus improve the efficiency. Thanks best wishes zhangzhao From wilde at mcs.anl.gov Tue Jul 22 22:40:35 2008 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 22 Jul 2008 22:40:35 -0500 Subject: [Swift-devel] Re: Is there a way to make swift send 256 tasks at a time In-Reply-To: <488691AD.3030309@uchicago.edu> References: <488691AD.3030309@uchicago.edu> Message-ID: <4886A833.7080302@mcs.anl.gov> Such a mechanism or setting would be nice. Im not sure about the fancy formula below with logs, atans and pi's, though. Ben or Mihael will need to comment. What I see in the Users Guide is simpler: http://www.ci.uchicago.edu/swift/guides/userguide.php#property.throttle.score.job.factor which says: " The Swift scheduler has the ability to limit the number of concurrent jobs allowed on a site based on the performance history of that site. Each site is assigned a score (initially 1), which can increase or decrease based on whether the site yields successful or faulty job runs. The score for a site can take values in the (0.1, 100) interval. The number of allowed jobs is calculated using the following formula: 2 + score*throttle.score.job.factor This means a site will always be allowed at least two concurrent jobs and at most 2 + 100*throttle.score.job.factor. With a default of 4 this means at least 2 jobs and at most 402. This parameter can also be set per site using the jobThrottle profile key in a site catalog entry." Also sec 13, Profiles, says: "jobThrottle - allows the job throttle factor (see Swift property throttle.score.job.factor) to be set per site. initialScore - allows the initial score for rate limiting and site selection to be set to a value other than 0." So while its not as precise a setting as is desired for this case, you can, I think, set initialScore to 100, thereby eliminating the ramp-up period. But, I suspect you want *exactly* 256 to keep all the cores in each pset busy without over-committing jobs. So what I think the best you can do is start with the throttle at 3, and the score at 85 which will give you 3 * 85 + 2 = 257 jobs. If the throttle creeps up for good behavior, then you'll wind up with a little overcomiting, which isnt bad. It wont go over 302 with a throttle of 3. I suspect we could try a patch that prevents the score from changing. Can you tell from Falkon logs as we ramp up testing of this, whether overcomiting is a problem? I suspect at worst it will exacerbate tail-effects as the workflow is winding down and some resources sit idle while others are overcomited. Lastly, note that the fact that the score can drop might eventually prove useful in handling psets that go bad in the middle of long runs. We dont have enough experience yet to see how frequent that will be at 640 psets, or how such errors wil be manifested. - Mike On 7/22/08 9:04 PM, Zhao Zhang wrote: > Hi, > > For now, I understand how swift decide how many tasks send to one site > at a time: > let T = 100 > let B = 2.0 * log(T) / pi > let C = 0.2 > let tscore = e^(B * atan(C * score)) > let number-of-jobs = 1 + (jobThrottle * tscore) > T is initial score. > > I am wondering that , is there a way for swift to set a constant number > of jobs, say 256 at a time to 1 site? > The reason I am asking this is that we could avoid the slow start > period, thus improve the efficiency. Thanks > > best wishes > zhangzhao > From benc at hawaga.org.uk Wed Jul 23 04:42:56 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 23 Jul 2008 09:42:56 +0000 (GMT) Subject: [Swift-devel] Re: Is there a way to make swift send 256 tasks at a time In-Reply-To: <4886A833.7080302@mcs.anl.gov> References: <488691AD.3030309@uchicago.edu> <4886A833.7080302@mcs.anl.gov> Message-ID: On Tue, 22 Jul 2008, Michael Wilde wrote: > I suspect we could try a patch that prevents the score from changing. You could, though settings as you described should keep the system saturated just about right anyway; if something breaks in one pset, then allowing swift to stop using that pset will be a feature not a bug in terms of proper run completion... as you note yourself... > Lastly, note that the fact that the score can drop might eventually prove > useful in handling psets that go bad in the middle of long runs. We dont have > enough experience yet to see how frequent that will be at 640 psets, or how > such errors wil be manifested. So I'd be inclined to not interfere with the code; you can use jobThrottle and initialScore to say you want about 256 jobs and to start at full steam, and I think that should suffice. -- From iraicu at cs.uchicago.edu Wed Jul 23 09:12:56 2008 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Wed, 23 Jul 2008 09:12:56 -0500 Subject: [Swift-devel] Re: Is there a way to make swift send 256 tasks at a time In-Reply-To: <4886A833.7080302@mcs.anl.gov> References: <488691AD.3030309@uchicago.edu> <4886A833.7080302@mcs.anl.gov> Message-ID: <48873C68.4030309@cs.uchicago.edu> Michael Wilde wrote: > ... > > But, I suspect you want *exactly* 256 to keep all the cores in each > pset busy without over-committing jobs. So what I think the best you > can do is start with the throttle at 3, and the score at 85 which will > give you 3 * 85 + 2 = 257 jobs. If the throttle creeps up for good > behavior, then you'll wind up with a little overcomiting, which isnt > bad. It wont go over 302 with a throttle of 3. Can the throttle.score.job.factor be set to a float? Then setting it to 2.54 will give you exactly 256, assuming all tasks come back successful. > > I suspect we could try a patch that prevents the score from changing. > > Can you tell from Falkon logs as we ramp up testing of this, whether > overcomiting is a problem? I suspect at worst it will exacerbate > tail-effects as the workflow is winding down and some resources sit > idle while others are overcomited. Yes, it is the tail end of the runs that will be most affected. > > Lastly, note that the fact that the score can drop might eventually > prove useful in handling psets that go bad in the middle of long runs. Right! Ioan -- =================================================== Ioan Raicu Ph.D. Candidate =================================================== Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 =================================================== Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dev.globus.org/wiki/Incubator/Falkon http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page =================================================== =================================================== From wilde at mcs.anl.gov Wed Jul 23 09:36:12 2008 From: wilde at mcs.anl.gov (Michael Wilde) Date: Wed, 23 Jul 2008 09:36:12 -0500 Subject: [Swift-devel] Re: Is there a way to make swift send 256 tasks at a time In-Reply-To: <48873C68.4030309@cs.uchicago.edu> References: <488691AD.3030309@uchicago.edu> <4886A833.7080302@mcs.anl.gov> <48873C68.4030309@cs.uchicago.edu> Message-ID: <488741DC.50106@mcs.anl.gov> Lets experiment and measure with the current settable properties before we consider any changes. Note that we already have a tail effect on most workloads; perhaps the Swift dynamics will have little effect on that. They might even have a positive effect by biasing work towards psets that are returning work faster (due to the fact that they wound up with shorter jobs). Im not sure if that happens at the moment. In real usage, users may request longer provisions and take advantage of free cycles in the "tails" for real work. And/or they may set provisioned resources to free up for other users as their own usage tails off. - Mike On 7/23/08 9:12 AM, Ioan Raicu wrote: > > > Michael Wilde wrote: >> ... >> >> But, I suspect you want *exactly* 256 to keep all the cores in each >> pset busy without over-committing jobs. So what I think the best you >> can do is start with the throttle at 3, and the score at 85 which will >> give you 3 * 85 + 2 = 257 jobs. If the throttle creeps up for good >> behavior, then you'll wind up with a little overcomiting, which isnt >> bad. It wont go over 302 with a throttle of 3. > Can the throttle.score.job.factor be set to a float? Then setting it to > 2.54 will give you exactly 256, assuming all tasks come back successful. >> >> I suspect we could try a patch that prevents the score from changing. >> >> Can you tell from Falkon logs as we ramp up testing of this, whether >> overcomiting is a problem? I suspect at worst it will exacerbate >> tail-effects as the workflow is winding down and some resources sit >> idle while others are overcomited. > Yes, it is the tail end of the runs that will be most affected. >> >> Lastly, note that the fact that the score can drop might eventually >> prove useful in handling psets that go bad in the middle of long runs. > Right! > Ioan > From zhaozhang at uchicago.edu Thu Jul 24 01:19:14 2008 From: zhaozhang at uchicago.edu (Zhao Zhang) Date: Thu, 24 Jul 2008 01:19:14 -0500 Subject: [Swift-devel] 2 rack run with swift Message-ID: <48881EE2.2060206@uchicago.edu> Hi, All I just made a swift run of 16384 sleep_30 tasks on 2 racks, which are 8192 cores. The log is at http://www.ci.uchicago.edu/~zzhang/report-sleep-20080724-0030-3zbv20j6/ Tomorrow, I will try to make a mars run with swift. zhao From wilde at mcs.anl.gov Thu Jul 24 07:57:17 2008 From: wilde at mcs.anl.gov (Michael Wilde) Date: Thu, 24 Jul 2008 07:57:17 -0500 Subject: [Swift-devel] Re: 2 rack run with swift In-Reply-To: <48881EE2.2060206@uchicago.edu> References: <48881EE2.2060206@uchicago.edu> Message-ID: <48887C2D.40503@mcs.anl.gov> Thanks, Zhao, This is a great initial snapshot of performance on the new BG/P Falkon server mechanism (1 server per pset). Its also the largest Swift run to date I know of in terms of "sites" (32) and processors used (8192). From a quick scan of the plots, it seems like we have some tuning to do: The ideal time for this run would be 120 seconds. It took 600 seconds. Thats in fact "not bad at all" for a first attempt at this scale, and very reasonable if the job length were longer. 16K jobs in 10 minutes is pretty good. The nearest real-world Falkon-only run I can compare to is the 15Kx9 DOCK run, which did 138K jobs in 40 minutes. This run performed at somewhat under half that rate. I suspect that the main bottleneck this is hitting is creation of job directories on the BGP. As we learned in the past few months of Falkon-only runs, creation of filesystem objects on GPFS is very expensive, and creation of two objects within the same parent directory by > 1 host is extremely expensive in locking contention. I *think* the plots bear this out, but need more assessment. I'd like to start by writing down a detailed description of the runtime file environment and management logic (i.e. job setup by swift and file management by wrapper.sh. Then look to see which of the options Ben provided when we last did this, in March, were properly enabled. (Some may still be un-applied test patches). Then turn on some of the timing metrics in wrapper.sh to see where time is spent. I also see that job distribution among servers is pretty good - ranging from 490 to 600 jobs, but for the most part staying within 10 jobs of the ideal, 512. I can't work on this today till our Swift report is done, but can then turn to it. Ben, once you're done with the SA Grid School, we could use your help on this. Mihael, as well, if you're interested and able to help. For now, I think we know a few steps we can take to measure and improve things. - Mike On 7/24/08 1:19 AM, Zhao Zhang wrote: > Hi, All > > I just made a swift run of 16384 sleep_30 tasks on 2 racks, which are > 8192 cores. The log is at > http://www.ci.uchicago.edu/~zzhang/report-sleep-20080724-0030-3zbv20j6/ > > Tomorrow, I will try to make a mars run with swift. > > zhao From iraicu at cs.uchicago.edu Thu Jul 24 09:24:32 2008 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Thu, 24 Jul 2008 09:24:32 -0500 Subject: [Swift-devel] Re: 2 rack run with swift In-Reply-To: <48887C2D.40503@mcs.anl.gov> References: <48881EE2.2060206@uchicago.edu> <48887C2D.40503@mcs.anl.gov> Message-ID: <488890A0.7010904@cs.uchicago.edu> Hi, I did a similar run through Falkon only, and got: Number of Tasks: 16384 Task Duration: 30 sec Average Task Execution Time (from Client point of view): 31.851 sec Number of CPUs: 8192 Startup: 5.185 sec Execute: 80.656 sec Ideal time: 60 sec Swift took some 600 seconds, and had an average per task run time of 240.97 sec. Zhao, was Swift patched up, with Ben's 3 patches from April/May? I am curious what would happen if we throw 256 second tasks through Swift, at the same 2 rack scale? Ioan Michael Wilde wrote: > Thanks, Zhao, > > This is a great initial snapshot of performance on the new BG/P Falkon > server mechanism (1 server per pset). > > Its also the largest Swift run to date I know of in terms of "sites" > (32) and processors used (8192). > > From a quick scan of the plots, it seems like we have some tuning to do: > > The ideal time for this run would be 120 seconds. It took 600 seconds. > Thats in fact "not bad at all" for a first attempt at this scale, and > very reasonable if the job length were longer. 16K jobs in 10 minutes > is pretty good. The nearest real-world Falkon-only run I can compare > to is the 15Kx9 DOCK run, which did 138K jobs in 40 minutes. This run > performed at somewhat under half that rate. > > I suspect that the main bottleneck this is hitting is creation of job > directories on the BGP. As we learned in the past few months of > Falkon-only runs, creation of filesystem objects on GPFS is very > expensive, and creation of two objects within the same parent > directory by > 1 host is extremely expensive in locking contention. > > I *think* the plots bear this out, but need more assessment. > > I'd like to start by writing down a detailed description of the > runtime file environment and management logic (i.e. job setup by swift > and file management by wrapper.sh. Then look to see which of the > options Ben provided when we last did this, in March, were properly > enabled. (Some may still be un-applied test patches). Then turn on > some of the timing metrics in wrapper.sh to see where time is spent. > > I also see that job distribution among servers is pretty good - > ranging from 490 to 600 jobs, but for the most part staying within 10 > jobs of the ideal, 512. > > I can't work on this today till our Swift report is done, but can then > turn to it. Ben, once you're done with the SA Grid School, we could > use your help on this. Mihael, as well, if you're interested and able > to help. > > For now, I think we know a few steps we can take to measure and > improve things. > > - Mike > > > On 7/24/08 1:19 AM, Zhao Zhang wrote: >> Hi, All >> >> I just made a swift run of 16384 sleep_30 tasks on 2 racks, which are >> 8192 cores. The log is at >> http://www.ci.uchicago.edu/~zzhang/report-sleep-20080724-0030-3zbv20j6/ >> >> Tomorrow, I will try to make a mars run with swift. >> >> zhao > -- =================================================== Ioan Raicu Ph.D. Candidate =================================================== Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 =================================================== Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dev.globus.org/wiki/Incubator/Falkon http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page =================================================== =================================================== From wilde at mcs.anl.gov Thu Jul 24 09:50:30 2008 From: wilde at mcs.anl.gov (Michael Wilde) Date: Thu, 24 Jul 2008 09:50:30 -0500 Subject: [Swift-devel] Re: 2 rack run with swift In-Reply-To: <488890A0.7010904@cs.uchicago.edu> References: <48881EE2.2060206@uchicago.edu> <48887C2D.40503@mcs.anl.gov> <488890A0.7010904@cs.uchicago.edu> Message-ID: <488896B6.7020206@mcs.anl.gov> On 7/24/08 9:24 AM, Ioan Raicu wrote: > Hi, > I did a similar run through Falkon only, and got: > Number of Tasks: 16384 > Task Duration: 30 sec > Average Task Execution Time (from Client point of view): 31.851 sec > Number of CPUs: 8192 > Startup: 5.185 sec > Execute: 80.656 sec > Ideal time: 60 sec > > Swift took some 600 seconds, and had an average per task run time of > 240.97 sec. Zhao, was Swift patched up, with Ben's 3 patches from > April/May? No. But one or more of those patches may have been integrated into the source. Still needs to be enabled. We'll look into this more in the next few days. I dont want to spend much time discussing this, though, until we have a chance to sort through all the issues we already know about: scheduler parameters, data management, wrapper script settings and patches, GPFS issues. Tests at longer job durations are worth doing. - Mike I am curious what would happen if we throw 256 second tasks > through Swift, at the same 2 rack scale? > Ioan > > Michael Wilde wrote: >> Thanks, Zhao, >> >> This is a great initial snapshot of performance on the new BG/P Falkon >> server mechanism (1 server per pset). >> >> Its also the largest Swift run to date I know of in terms of "sites" >> (32) and processors used (8192). >> >> From a quick scan of the plots, it seems like we have some tuning to do: >> >> The ideal time for this run would be 120 seconds. It took 600 seconds. >> Thats in fact "not bad at all" for a first attempt at this scale, and >> very reasonable if the job length were longer. 16K jobs in 10 minutes >> is pretty good. The nearest real-world Falkon-only run I can compare >> to is the 15Kx9 DOCK run, which did 138K jobs in 40 minutes. This run >> performed at somewhat under half that rate. >> >> I suspect that the main bottleneck this is hitting is creation of job >> directories on the BGP. As we learned in the past few months of >> Falkon-only runs, creation of filesystem objects on GPFS is very >> expensive, and creation of two objects within the same parent >> directory by > 1 host is extremely expensive in locking contention. >> >> I *think* the plots bear this out, but need more assessment. >> >> I'd like to start by writing down a detailed description of the >> runtime file environment and management logic (i.e. job setup by swift >> and file management by wrapper.sh. Then look to see which of the >> options Ben provided when we last did this, in March, were properly >> enabled. (Some may still be un-applied test patches). Then turn on >> some of the timing metrics in wrapper.sh to see where time is spent. >> >> I also see that job distribution among servers is pretty good - >> ranging from 490 to 600 jobs, but for the most part staying within 10 >> jobs of the ideal, 512. >> >> I can't work on this today till our Swift report is done, but can then >> turn to it. Ben, once you're done with the SA Grid School, we could >> use your help on this. Mihael, as well, if you're interested and able >> to help. >> >> For now, I think we know a few steps we can take to measure and >> improve things. >> >> - Mike >> >> >> On 7/24/08 1:19 AM, Zhao Zhang wrote: >>> Hi, All >>> >>> I just made a swift run of 16384 sleep_30 tasks on 2 racks, which are >>> 8192 cores. The log is at >>> http://www.ci.uchicago.edu/~zzhang/report-sleep-20080724-0030-3zbv20j6/ >>> >>> Tomorrow, I will try to make a mars run with swift. >>> >>> zhao >> > From benc at hawaga.org.uk Thu Jul 24 11:31:16 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Thu, 24 Jul 2008 16:31:16 +0000 (GMT) Subject: [Swift-devel] Re: 2 rack run with swift In-Reply-To: <48887C2D.40503@mcs.anl.gov> References: <48881EE2.2060206@uchicago.edu> <48887C2D.40503@mcs.anl.gov> Message-ID: On Thu, 24 Jul 2008, Michael Wilde wrote: > I *think* the plots bear this out, but need more assessment. The raw info that is useful for debugging that is the -info wrapper log files that you'll need to turn on explicitly, which is these: > Then turn on some of the timing metrics in wrapper.sh to see > where time is spent. Some of the patches I sent ended up in very different form in the main codebase - for example, the worker-node local job directories. But for any of those runs it would be useful to collect all the wrapper (-info) logs to look at the timings. -- From skenny at uchicago.edu Thu Jul 24 16:50:06 2008 From: skenny at uchicago.edu (skenny at uchicago.edu) Date: Thu, 24 Jul 2008 16:50:06 -0500 (CDT) Subject: [Swift-devel] mystery runs on ucanl & ncsa--warning very long email, sorry! Message-ID: <20080724165006.BIE62980@m4500-02.uchicago.edu> so we've had some odd behavior on a big run recently and having some trouble figuring out exactly what's going on here. it's also worth mentioning that we've had other successful runs with these settings on these same sites. first, tried running on ncsa: 1 2 /usr/projects/tg-community/SIDGrid/sidgrid_out/{username} and then after failing/killing the run was resumed on ucanl64: 1 2 ia64-compute /scratch/gpfs/local/sidgrid_out/{username} the workflow appears ok at first. however we would then get some failures; the retries of the failed jobs that swift submits appeared to work but the failures were keeping the run from ramping up. and eventually andric killed the run bcs there were so many errors and so few jobs running at once (though no clear indication of why). also, on ucanl, even when we kill the workflow the jobs not only remain in the queue but i can't kill them at all even when i own them (ti's looking into this i believe). the log file is pretty long so rather than attach i've put everything from the run here on the ci network: /home/skenny/andric/permFriedman_run2 the individual jobs are given a 300min wallclock limit and generally take about an hour. finally, when jobs fail and/or exceed wallclock on ucanl i get an email from the pbs scheduler. in this case i get the following: PBS Job Id: 1759715.tg-master.uc.teragrid.org Job Name: STDIN Exec host: tg-c054/0 Aborted by PBS Server Job cannot be executed See Administrator for help finally, our big ugly tc.data file can be seen here if that's of use: https://svn.ci.uchicago.edu/svn/vdl2/SwiftApps/SIDGrid/config/tc.data sorry this email is so lengthy! just wanted to give you guys a full picture of what we're seeing. i'm open to any ideas, no matter how outlandish or hacky :) to try and get these running properly. thanks!! sarah From hategan at mcs.anl.gov Thu Jul 24 17:17:14 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 24 Jul 2008 17:17:14 -0500 Subject: [Swift-devel] mystery runs on ucanl & ncsa--warning very long email, sorry! In-Reply-To: <20080724165006.BIE62980@m4500-02.uchicago.edu> References: <20080724165006.BIE62980@m4500-02.uchicago.edu> Message-ID: <1216937834.24122.5.camel@localhost> Strange. It looks like the wrapper script never gets to execute on UCANL. Do you have the logs from the first run? Is On Thu, 2008-07-24 at 16:50 -0500, skenny at uchicago.edu wrote: > so we've had some odd behavior on a big run recently and > having some trouble figuring out exactly what's going on here. > it's also worth mentioning that we've had other successful > runs with these settings on these same sites. > > first, tried running on ncsa: > > > > 1 > 2 > > jobManager="PBS"/> > > /usr/projects/tg-community/SIDGrid/sidgrid_out/{username} > > > and then after failing/killing the run was resumed on ucanl64: > > > > 1 > 2 > key="host_types">ia64-compute > storage="/home/skenny/data" major="2" minor="4" patch="3"/> > url="tg-grid.uc.teragrid.org" /> > > /scratch/gpfs/local/sidgrid_out/{username} > > > the workflow appears ok at first. however we would then get > some failures; the retries of the failed jobs that swift > submits appeared to work but the failures were keeping the run > from ramping up. and eventually andric killed the run bcs > there were so many errors and so few jobs running at once > (though no clear indication of why). > > also, on ucanl, even when we kill the workflow the jobs not > only remain in the queue but i can't kill them at all even > when i own them (ti's looking into this i believe). > > the log file is pretty long so rather than attach i've put > everything from the run here on the ci network: > /home/skenny/andric/permFriedman_run2 > > the individual jobs are given a 300min wallclock limit and > generally take about an hour. finally, when jobs fail and/or > exceed wallclock on ucanl i get an email from the pbs > scheduler. in this case i get the following: > > PBS Job Id: 1759715.tg-master.uc.teragrid.org > Job Name: STDIN > Exec host: tg-c054/0 > Aborted by PBS Server > Job cannot be executed > See Administrator for help > > finally, our big ugly tc.data file can be seen here if that's > of use: > > https://svn.ci.uchicago.edu/svn/vdl2/SwiftApps/SIDGrid/config/tc.data > > sorry this email is so lengthy! just wanted to give you guys a > full picture of what we're seeing. i'm open to any ideas, no > matter how outlandish or hacky :) to try and get these running > properly. > > thanks!! > sarah > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From skenny at uchicago.edu Thu Jul 24 17:21:47 2008 From: skenny at uchicago.edu (skenny at uchicago.edu) Date: Thu, 24 Jul 2008 17:21:47 -0500 (CDT) Subject: [Swift-devel] mystery runs on ucanl & ncsa--warning very long email, sorry! Message-ID: <20080724172147.BIE67389@m4500-02.uchicago.edu> hmm, i think that's the only log we have from this most recent run. however, we saw the same behavior on another run to ncsa the week before. the log is here: /home/skenny/andric/permFriedman_logs/permFriedman1001-20080702-2300-3r702ylc.log ---- Original message ---- >Date: Thu, 24 Jul 2008 17:17:14 -0500 >From: Mihael Hategan >Subject: Re: [Swift-devel] mystery runs on ucanl & ncsa--warning very long email, sorry! >To: skenny at uchicago.edu >Cc: swift-devel at ci.uchicago.edu, andric > >Strange. It looks like the wrapper script never gets to execute on >UCANL. > >Do you have the logs from the first run? > >Is > >On Thu, 2008-07-24 at 16:50 -0500, skenny at uchicago.edu wrote: >> so we've had some odd behavior on a big run recently and >> having some trouble figuring out exactly what's going on here. >> it's also worth mentioning that we've had other successful >> runs with these settings on these same sites. >> >> first, tried running on ncsa: >> >> >> >> 1 >> 2 >> >> > jobManager="PBS"/> >> >> /usr/projects/tg-community/SIDGrid/sidgrid_out/{username} >> >> >> and then after failing/killing the run was resumed on ucanl64: >> >> >> >> 1 >> 2 >> > key="host_types">ia64-compute >> > storage="/home/skenny/data" major="2" minor="4" patch="3"/> >> > url="tg-grid.uc.teragrid.org" /> >> >> /scratch/gpfs/local/sidgrid_out/{username} >> >> >> the workflow appears ok at first. however we would then get >> some failures; the retries of the failed jobs that swift >> submits appeared to work but the failures were keeping the run >> from ramping up. and eventually andric killed the run bcs >> there were so many errors and so few jobs running at once >> (though no clear indication of why). >> >> also, on ucanl, even when we kill the workflow the jobs not >> only remain in the queue but i can't kill them at all even >> when i own them (ti's looking into this i believe). >> >> the log file is pretty long so rather than attach i've put >> everything from the run here on the ci network: >> /home/skenny/andric/permFriedman_run2 >> >> the individual jobs are given a 300min wallclock limit and >> generally take about an hour. finally, when jobs fail and/or >> exceed wallclock on ucanl i get an email from the pbs >> scheduler. in this case i get the following: >> >> PBS Job Id: 1759715.tg-master.uc.teragrid.org >> Job Name: STDIN >> Exec host: tg-c054/0 >> Aborted by PBS Server >> Job cannot be executed >> See Administrator for help >> >> finally, our big ugly tc.data file can be seen here if that's >> of use: >> >> https://svn.ci.uchicago.edu/svn/vdl2/SwiftApps/SIDGrid/config/tc.data >> >> sorry this email is so lengthy! just wanted to give you guys a >> full picture of what we're seeing. i'm open to any ideas, no >> matter how outlandish or hacky :) to try and get these running >> properly. >> >> thanks!! >> sarah >> >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > From hategan at mcs.anl.gov Thu Jul 24 17:30:12 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 24 Jul 2008 17:30:12 -0500 Subject: [Swift-devel] mystery runs on ucanl & ncsa--warning very long email, sorry! In-Reply-To: <20080724172147.BIE67389@m4500-02.uchicago.edu> References: <20080724172147.BIE67389@m4500-02.uchicago.edu> Message-ID: <1216938612.24999.1.camel@localhost> On Thu, 2008-07-24 at 17:21 -0500, skenny at uchicago.edu wrote: > hmm, i think that's the only log we have from this most recent > run. however, we saw the same behavior on another run to ncsa > the week before. the log is here: > > /home/skenny/andric/permFriedman_logs/permFriedman1001-20080702-2300-3r702ylc.log Not the same. In this case it seems like WS-GRAM requests are getting a connection reset. I'm not really sure what could cause that, but it's somewhere at the TCP level. Anyway, can you run manual jobs on UCANL? > > ---- Original message ---- > >Date: Thu, 24 Jul 2008 17:17:14 -0500 > >From: Mihael Hategan > >Subject: Re: [Swift-devel] mystery runs on ucanl & > ncsa--warning very long email, sorry! > >To: skenny at uchicago.edu > >Cc: swift-devel at ci.uchicago.edu, andric > > > >Strange. It looks like the wrapper script never gets to > execute on > >UCANL. > > > >Do you have the logs from the first run? > > > >Is > > > >On Thu, 2008-07-24 at 16:50 -0500, skenny at uchicago.edu wrote: > >> so we've had some odd behavior on a big run recently and > >> having some trouble figuring out exactly what's going on here. > >> it's also worth mentioning that we've had other successful > >> runs with these settings on these same sites. > >> > >> first, tried running on ncsa: > >> > >> > >> > >> 1 > >> 2 > >> > >> >> jobManager="PBS"/> > >> > >> > /usr/projects/tg-community/SIDGrid/sidgrid_out/{username} > >> > >> > >> and then after failing/killing the run was resumed on ucanl64: > >> > >> > >> > >> 1 > >> 2 > >> >> key="host_types">ia64-compute > >> >> storage="/home/skenny/data" major="2" minor="4" patch="3"/> > >> >> url="tg-grid.uc.teragrid.org" /> > >> > >> > /scratch/gpfs/local/sidgrid_out/{username} > >> > >> > >> the workflow appears ok at first. however we would then get > >> some failures; the retries of the failed jobs that swift > >> submits appeared to work but the failures were keeping the run > >> from ramping up. and eventually andric killed the run bcs > >> there were so many errors and so few jobs running at once > >> (though no clear indication of why). > >> > >> also, on ucanl, even when we kill the workflow the jobs not > >> only remain in the queue but i can't kill them at all even > >> when i own them (ti's looking into this i believe). > >> > >> the log file is pretty long so rather than attach i've put > >> everything from the run here on the ci network: > >> /home/skenny/andric/permFriedman_run2 > >> > >> the individual jobs are given a 300min wallclock limit and > >> generally take about an hour. finally, when jobs fail and/or > >> exceed wallclock on ucanl i get an email from the pbs > >> scheduler. in this case i get the following: > >> > >> PBS Job Id: 1759715.tg-master.uc.teragrid.org > >> Job Name: STDIN > >> Exec host: tg-c054/0 > >> Aborted by PBS Server > >> Job cannot be executed > >> See Administrator for help > >> > >> finally, our big ugly tc.data file can be seen here if that's > >> of use: > >> > >> > https://svn.ci.uchicago.edu/svn/vdl2/SwiftApps/SIDGrid/config/tc.data > >> > >> sorry this email is so lengthy! just wanted to give you guys a > >> full picture of what we're seeing. i'm open to any ideas, no > >> matter how outlandish or hacky :) to try and get these running > >> properly. > >> > >> thanks!! > >> sarah > >> > >> _______________________________________________ > >> Swift-devel mailing list > >> Swift-devel at ci.uchicago.edu > >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > From skenny at uchicago.edu Thu Jul 24 17:32:55 2008 From: skenny at uchicago.edu (skenny at uchicago.edu) Date: Thu, 24 Jul 2008 17:32:55 -0500 (CDT) Subject: [Swift-devel] mystery runs on ucanl & ncsa--warning very long email, sorry! Message-ID: <20080724173255.BIE68384@m4500-02.uchicago.edu> yes (see below) and SOME of the jobs in the workflow do complete when we submit the whole workflow to ucanl. unfortunately i can't test anything on ncsa right now 'cause it's down. [skenny at gwynn mediator]$ globusrun-ws -submit -s -F tg-grid1.uc.teragrid.org -Ft PBS -job-command /bin/hostname Delegating user credentials...Done. Submitting job...Done. Job ID: uuid:3fecbd58-59d0-11dd-8cd1-0019d1912789 Termination time: 07/25/2008 22:31 GMT Current job state: Pending Current job state: Active ---------------------------------------- Begin PBS Prologue Thu Jul 24 17:31:26 CDT 2008 Job ID: 1759742.tg-master.uc.teragrid.org Username: sidgrid Group: allocate Nodes: tg-v086 End PBS Prologue Thu Jul 24 17:31:26 CDT 2008 ---------------------------------------- tg-v086.uc.teragrid.org ---------------------------------------- Begin PBS Epilogue Thu Jul 24 17:31:29 CDT 2008 Job ID: 1759742.tg-master.uc.teragrid.org Username: sidgrid Group: allocate Job Name: STDIN Session: 12326 Limits: nodes=1,walltime=00:15:00 Resources: cput=00:00:00,mem=0kb,vmem=0kb,walltime=00:00:01 Nodes: tg-v086 End PBS Epilogue Thu Jul 24 17:31:29 CDT 2008 ---------------------------------------- Current job state: CleanUp-Hold Current job state: CleanUp Current job state: Done Destroying job...Done. Cleaning up any delegated credentials...Done. [skenny at gwynn mediator]$ ---- Original message ---- >Date: Thu, 24 Jul 2008 17:30:12 -0500 >From: Mihael Hategan >Subject: Re: [Swift-devel] mystery runs on ucanl & ncsa--warning very long email, sorry! >To: skenny at uchicago.edu >Cc: swift-devel at ci.uchicago.edu, andric > >On Thu, 2008-07-24 at 17:21 -0500, skenny at uchicago.edu wrote: >> hmm, i think that's the only log we have from this most recent >> run. however, we saw the same behavior on another run to ncsa >> the week before. the log is here: >> >> /home/skenny/andric/permFriedman_logs/permFriedman1001-20080702-2300-3r702ylc.log > >Not the same. In this case it seems like WS-GRAM requests are getting a >connection reset. I'm not really sure what could cause that, but it's >somewhere at the TCP level. > >Anyway, can you run manual jobs on UCANL? > >> >> ---- Original message ---- >> >Date: Thu, 24 Jul 2008 17:17:14 -0500 >> >From: Mihael Hategan >> >Subject: Re: [Swift-devel] mystery runs on ucanl & >> ncsa--warning very long email, sorry! >> >To: skenny at uchicago.edu >> >Cc: swift-devel at ci.uchicago.edu, andric >> > >> >Strange. It looks like the wrapper script never gets to >> execute on >> >UCANL. >> > >> >Do you have the logs from the first run? >> > >> >Is >> > >> >On Thu, 2008-07-24 at 16:50 -0500, skenny at uchicago.edu wrote: >> >> so we've had some odd behavior on a big run recently and >> >> having some trouble figuring out exactly what's going on here. >> >> it's also worth mentioning that we've had other successful >> >> runs with these settings on these same sites. >> >> >> >> first, tried running on ncsa: >> >> >> >> >> >> >> >> 1 >> >> 2 >> >> >> >> > >> jobManager="PBS"/> >> >> >> >> >> /usr/projects/tg-community/SIDGrid/sidgrid_out/{username} >> >> >> >> >> >> and then after failing/killing the run was resumed on ucanl64: >> >> >> >> >> >> >> >> 1 >> >> 2 >> >> > >> key="host_types">ia64-compute >> >> > >> storage="/home/skenny/data" major="2" minor="4" patch="3"/> >> >> > >> url="tg-grid.uc.teragrid.org" /> >> >> >> >> >> /scratch/gpfs/local/sidgrid_out/{username} >> >> >> >> >> >> the workflow appears ok at first. however we would then get >> >> some failures; the retries of the failed jobs that swift >> >> submits appeared to work but the failures were keeping the run >> >> from ramping up. and eventually andric killed the run bcs >> >> there were so many errors and so few jobs running at once >> >> (though no clear indication of why). >> >> >> >> also, on ucanl, even when we kill the workflow the jobs not >> >> only remain in the queue but i can't kill them at all even >> >> when i own them (ti's looking into this i believe). >> >> >> >> the log file is pretty long so rather than attach i've put >> >> everything from the run here on the ci network: >> >> /home/skenny/andric/permFriedman_run2 >> >> >> >> the individual jobs are given a 300min wallclock limit and >> >> generally take about an hour. finally, when jobs fail and/or >> >> exceed wallclock on ucanl i get an email from the pbs >> >> scheduler. in this case i get the following: >> >> >> >> PBS Job Id: 1759715.tg-master.uc.teragrid.org >> >> Job Name: STDIN >> >> Exec host: tg-c054/0 >> >> Aborted by PBS Server >> >> Job cannot be executed >> >> See Administrator for help >> >> >> >> finally, our big ugly tc.data file can be seen here if that's >> >> of use: >> >> >> >> >> https://svn.ci.uchicago.edu/svn/vdl2/SwiftApps/SIDGrid/config/tc.data >> >> >> >> sorry this email is so lengthy! just wanted to give you guys a >> >> full picture of what we're seeing. i'm open to any ideas, no >> >> matter how outlandish or hacky :) to try and get these running >> >> properly. >> >> >> >> thanks!! >> >> sarah >> >> >> >> _______________________________________________ >> >> Swift-devel mailing list >> >> Swift-devel at ci.uchicago.edu >> >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >> > > From hategan at mcs.anl.gov Thu Jul 24 17:42:37 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 24 Jul 2008 17:42:37 -0500 Subject: [Swift-devel] mystery runs on ucanl & ncsa--warning very long email, sorry! In-Reply-To: <20080724173255.BIE68384@m4500-02.uchicago.edu> References: <20080724173255.BIE68384@m4500-02.uchicago.edu> Message-ID: <1216939357.25721.4.camel@localhost> On Thu, 2008-07-24 at 17:32 -0500, skenny at uchicago.edu wrote: > yes (see below) and SOME of the jobs in the workflow do > complete when we submit the whole workflow to ucanl. Indeed. It seems like roughly half of them work and the other half break. Could this be an ia32/ia64 issue? Like python being compiled for the wrong platform or something? > unfortunately i can't test anything on ncsa right now 'cause > it's down. It being down would generally prevent swift from being able to run jobs there. Which is probably what happened the week before. From skenny at uchicago.edu Thu Jul 24 17:49:40 2008 From: skenny at uchicago.edu (skenny at uchicago.edu) Date: Thu, 24 Jul 2008 17:49:40 -0500 (CDT) Subject: [Swift-devel] mystery runs on ucanl & ncsa--warning very long email, sorry! Message-ID: <20080724174940.BIE69697@m4500-02.uchicago.edu> >On Thu, 2008-07-24 at 17:32 -0500, skenny at uchicago.edu wrote: >> yes (see below) and SOME of the jobs in the workflow do >> complete when we submit the whole workflow to ucanl. > >Indeed. It seems like roughly half of them work and the other half >break. Could this be an ia32/ia64 issue? Like python being compiled for >the wrong platform or something? hmm, not quite sure i follow, since we're only sending to ia64 on this run...how can i test? >> unfortunately i can't test anything on ncsa right now 'cause >> it's down. > >It being down would generally prevent swift from being able to run jobs >there. Which is probably what happened the week before. ha ha, what swift can't run jobs on a site that's down? lame! heh, actually we've had a couple of runs now where we see the behavior i described on ncsa--e.g. a few jobs completing but some failing and an eventual decline. though, it's true the site's been up and down quite a bit over the past few weeks so could be indicative of something else wrong entirely. incidentally, i told them a couple weeks ago i was having trouble submitting to gram4 so we switched back to gram2 and it *seemed* to be working...for a while. well, we're trying on yet another site now so if we see more of the same we'll know we need to do *something* on our end. thanks sarah > > From hategan at mcs.anl.gov Thu Jul 24 18:02:01 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 24 Jul 2008 18:02:01 -0500 Subject: [Swift-devel] mystery runs on ucanl & ncsa--warning very long email, sorry! In-Reply-To: <20080724174940.BIE69697@m4500-02.uchicago.edu> References: <20080724174940.BIE69697@m4500-02.uchicago.edu> Message-ID: <1216940521.26673.6.camel@localhost> On Thu, 2008-07-24 at 17:49 -0500, skenny at uchicago.edu wrote: > >On Thu, 2008-07-24 at 17:32 -0500, skenny at uchicago.edu wrote: > >> yes (see below) and SOME of the jobs in the workflow do > >> complete when we submit the whole workflow to ucanl. > > > >Indeed. It seems like roughly half of them work and the other > half > >break. Could this be an ia32/ia64 issue? Like python being > compiled for > >the wrong platform or something? > > hmm, not quite sure i follow, since we're only sending to ia64 > on this run...how can i test? Although it would be bash failing, since we don't get to the wrapper script. I'm thinking instead of /bin/hostname you could try /bin/bash -c /bin/hostname. Repeatedly. With globusrun-ws. > > >> unfortunately i can't test anything on ncsa right now 'cause > >> it's down. > > > >It being down would generally prevent swift from being able > to run jobs > >there. Which is probably what happened the week before. > > ha ha, what swift can't run jobs on a site that's down? As strange as it may sound, it can't. > lame! heh, actually we've had a couple of runs now where we > see the behavior i described on ncsa--e.g. a few jobs > completing but some failing and an eventual decline. though, > it's true the site's been up and down quite a bit over the > past few weeks so could be indicative of something else wrong > entirely. incidentally, i told them a couple weeks > ago i was having trouble submitting to gram4 so we switched > back to gram2 and it *seemed* to be working...for a while. > > well, we're trying on yet another site now so if we see more > of the same we'll know we need to do *something* on our end. May I (again) suggest not storing all the eggs in one basket if eggs are the only food you can have for lunch? From hategan at mcs.anl.gov Thu Jul 24 18:07:35 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 24 Jul 2008 18:07:35 -0500 Subject: [Swift-devel] mystery runs on ucanl & ncsa--warning very long email, sorry! In-Reply-To: References: <20080724174940.BIE69697@m4500-02.uchicago.edu> Message-ID: <1216940855.26861.2.camel@localhost> On Thu, 2008-07-24 at 17:57 -0500, Michael Andric wrote: > it's ucanl (not ncsa) that has been completing a few and declining, Yes. I got that part. > e.g. > > Progress: Initializing:73 Selecting site:6922 Executing:5 [...] > Progress: Initializing:73 Selecting site:6916 Executing:5 Finished > successfully:5 Failed but can retry:1 [...] Seems time dependent rather than node dependent. Maybe something happened to it. > > on ncsa, it seems recently to either all-out work or not work. > yesterday i got 73 jobs 'Finished successfully' on there and then it > just hung, so i killed it (after letting it hang for a few hours). > today, i couldn't get it to even start executing (re: the site is > down). > > and this 'new site', it's been sitting at: > > Progress: Selecting site:6994 Executing:6 > Progress: Selecting site:6994 Executing:6 > Progress: Selecting site:6994 Executing:6 > > since 2pm this afternoon, still with nothing finished, no errors, no > indication of what's going on... > woo grid computing! Can you give me more details about "new site"? From skenny at uchicago.edu Thu Jul 24 18:20:37 2008 From: skenny at uchicago.edu (skenny at uchicago.edu) Date: Thu, 24 Jul 2008 18:20:37 -0500 (CDT) Subject: [Swift-devel] mystery runs on ucanl & ncsa--warning very long email, sorry! Message-ID: <20080724182037.BIE72202@m4500-02.uchicago.edu> > >Can you give me more details about "new site"? no, it's a secret ;) heh, it's bigred at IU, his jobs are sitting in the q. i haven't had a chance to peek and see if it's legitimately a long q or if something else is up. it IS very full, but...my job this morning went thru in <1hr... > > From hategan at mcs.anl.gov Thu Jul 24 18:30:12 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 24 Jul 2008 18:30:12 -0500 Subject: [Swift-devel] mystery runs on ucanl & ncsa--warning very long email, sorry! In-Reply-To: <20080724182037.BIE72202@m4500-02.uchicago.edu> References: <20080724182037.BIE72202@m4500-02.uchicago.edu> Message-ID: <1216942212.27366.2.camel@localhost> On Thu, 2008-07-24 at 18:20 -0500, skenny at uchicago.edu wrote: > > > >Can you give me more details about "new site"? > > no, it's a secret ;) > > heh, it's bigred at IU, his jobs are sitting in the q. i > haven't had a chance to peek and see if it's legitimately a > long q or if something else is up. it IS very full, Lack of free nodes will probably also prevent Swift from running jobs there. From skenny at uchicago.edu Thu Jul 24 18:30:35 2008 From: skenny at uchicago.edu (skenny at uchicago.edu) Date: Thu, 24 Jul 2008 18:30:35 -0500 (CDT) Subject: [Swift-devel] mystery runs on ucanl & ncsa--warning very long email, sorry! Message-ID: <20080724183035.BIE72862@m4500-02.uchicago.edu> ---- Original message ---- >Date: Thu, 24 Jul 2008 18:30:12 -0500 >From: Mihael Hategan >Subject: Re: [Swift-devel] mystery runs on ucanl & ncsa--warning very long email, sorry! >To: skenny at uchicago.edu >Cc: Michael Andric , swift-devel at ci.uchicago.edu > >On Thu, 2008-07-24 at 18:20 -0500, skenny at uchicago.edu wrote: >> > >> >Can you give me more details about "new site"? >> >> no, it's a secret ;) >> >> heh, it's bigred at IU, his jobs are sitting in the q. i >> haven't had a chance to peek and see if it's legitimately a >> long q or if something else is up. it IS very full, > >Lack of free nodes will probably also prevent Swift from running jobs >there. oh, i thought swift could generate nodes magically ;) what i meant was, we had an issue on ranger where things sat in the q indefinitely and it turned out to be the redirection of stdout (which you helped me solve) that was holding it up and not the fact that the q was very full (though it was). > From hategan at mcs.anl.gov Thu Jul 24 18:39:01 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 24 Jul 2008 18:39:01 -0500 Subject: [Swift-devel] mystery runs on ucanl & ncsa--warning very long email, sorry! In-Reply-To: <20080724183035.BIE72862@m4500-02.uchicago.edu> References: <20080724183035.BIE72862@m4500-02.uchicago.edu> Message-ID: <1216942741.27692.2.camel@localhost> On Thu, 2008-07-24 at 18:30 -0500, skenny at uchicago.edu wrote: > oh, i thought swift could generate nodes magically ;) what i > meant was, we had an issue on ranger where things sat in the q > indefinitely and it turned out to be the redirection of stdout > (which you helped me solve) that was holding it up and not the > fact that the q was very full (though it was). I know. Sorry. I liked the symmetry :) Though I suspect, given that bigred does not use SGE, that this isn't the case here. From zhaozhang at uchicago.edu Fri Jul 25 01:01:02 2008 From: zhaozhang at uchicago.edu (Zhao Zhang) Date: Fri, 25 Jul 2008 01:01:02 -0500 Subject: [Swift-devel] Progress Report of Swift on BGP Message-ID: <48896C1E.50604@uchicago.edu> Hi, All I made 2 small changes to wrapper.sh. For now, instead of creating everything on GPFS, wrapper.sh only create that id.success file on GPFS, leaving others to CN's ramdisk. For improvements you could compare http://www.ci.uchicago.edu/~zzhang/report-sleep-20080724-0030-3zbv20j6/ and http://www.ci.uchicago.edu/~zzhang/report-sleep-20080724-1816-x10hf3s1/ Those two logs are both sleep_30x16384 on 8192 cores I took stats from 1 pset as example: OLD wrapper.sh === execute2 duration statistics for bgp0 === Total number of events: 501 Shortest event (s): 114.388999938965 Longest event (s): 356.953999996185 Total duration of all events (s): 118968.562002659 Mean event duration (s): 237.462199606105 Standard deviation of event duration (s): 53.4422013666663 Maximum number of events at one time: 311 NEW wrapper.sh === execute2 duration statistics for bgp000 === Total number of events: 519 Shortest event (s): 52.1200001239777 Longest event (s): 251.282999992371 Total duration of all events (s): 68348.8350019455 Mean event duration (s): 131.693323703171 Standard deviation of event duration (s): 39.0439215239727 Maximum number of events at one time: 321 Comparing the Mean event duration, we could see a 2X improvement. Plus, I did a sleep_600 x 16384 test on 8192 cores with this new wrapper.sh. Things are perfect. The log is at http://www.ci.uchicago.edu/~zzhang/report-sleep-20080724-2357-4z97cn3f/ I also paste stats of 1 pset here === execute2 duration statistics for bgp000 === Total number of events: 518 Shortest event (s): 601.161000013351 Longest event (s): 1103.05599999428 Total duration of all events (s): 351741.216003895 Mean event duration (s): 679.037096532615 Standard deviation of event duration (s): 114.593631279296 Maximum number of events at one time: 293 The average event duration is 679 seconds, thus we have a 88% efficiency. This is all my work about SWIFT on BGP today. best wishes zhangzhao From mjandric at gmail.com Thu Jul 24 17:57:42 2008 From: mjandric at gmail.com (Michael Andric) Date: Thu, 24 Jul 2008 17:57:42 -0500 Subject: [Swift-devel] mystery runs on ucanl & ncsa--warning very long email, sorry! In-Reply-To: <20080724174940.BIE69697@m4500-02.uchicago.edu> References: <20080724174940.BIE69697@m4500-02.uchicago.edu> Message-ID: it's ucanl (not ncsa) that has been completing a few and declining, e.g. Progress: Initializing:73 Selecting site:6922 Executing:5 Mediator completed Progress: Initializing:73 Selecting site:6922 Stage out:4 Finished successfully:1 Mediator completed Mediator completed Mediator completed Mediator completed Progress: Initializing:73 Selecting site:6916 Executing:5 Finished successfully:5 Failed but can retry:1 Failed to transfer wrapper log from PermFriedman-20080724-1033- 7eg450y8/info/z/ANLUCTERAGRID64 Failed to transfer wrapper log from PermFriedman-20080724-1033-7eg450y8/info/1/ANLUCTERAGRID64 Failed to transfer wrapper log from PermFriedman-20080724-1033-7eg450y8/info/3/ANLUCTERAGRID64 Progress: Initializing:73 Selecting site:6918 Executing:2 Finished successfully:5 Failed but can retry:2 Failed to transfer wrapper log from PermFriedman-20080724-1033-7eg450y8/info/2/ANLUCTERAGRID64 Progress: Initializing:73 Selecting site:6919 Executing:2 Finished successfully:5 Failed but can retry:1 Failed to transfer wrapper log from PermFriedman-20080724-1033-7eg450y8/info/9/ANLUCTERAGRID64 Progress: Initializing:73 Selecting site:6919 Executing:2 Finished successfully:5 Failed but can retry:1 Failed to transfer wrapper log from PermFriedman-20080724-1033-7eg450y8/info/b/ANLUCTERAGRID64 Progress: Initializing:73 Selecting site:6919 Executing:3 Finished successfully:5 Failed to transfer wrapper log from PermFriedman-20080724-1033-7eg450y8/info/d/ANLUCTERAGRID64 Progress: Initializing:73 Selecting site:6919 Executing:2 Finished successfully:5 Failed but can retry:1 Failed to transfer wrapper log from PermFriedman-20080724-1033-7eg450y8/info/f/ANLUCTERAGRID64 Progress: Initializing:73 Selecting site:6919 Executing:2 Finished successfully:5 Failed but can retry:1 Failed to transfer wrapper log from PermFriedman-20080724-1033-7eg450y8/info/h/ANLUCTERAGRID64 Progress: Initializing:73 Selecting site:6919 Executing:2 Finished successfully:5 Failed but can retry:1 Failed to transfer wrapper log from PermFriedman-20080724-1033-7eg450y8/info/j/ANLUCTERAGRID64 Progress: Initializing:73 Selecting site:6919 Executing:3 Finished successfully:5 Progress: Initializing:73 Selecting site:6919 Executing:3 Finished successfully:5 on ncsa, it seems recently to either all-out work or not work. yesterday i got 73 jobs 'Finished successfully' on there and then it just hung, so i killed it (after letting it hang for a few hours). today, i couldn't get it to even start executing (re: the site is down). and this 'new site', it's been sitting at: Progress: Selecting site:6994 Executing:6 Progress: Selecting site:6994 Executing:6 Progress: Selecting site:6994 Executing:6 since 2pm this afternoon, still with nothing finished, no errors, no indication of what's going on... woo grid computing! On Thu, Jul 24, 2008 at 5:49 PM, wrote: > >On Thu, 2008-07-24 at 17:32 -0500, skenny at uchicago.edu wrote: > >> yes (see below) and SOME of the jobs in the workflow do > >> complete when we submit the whole workflow to ucanl. > > > >Indeed. It seems like roughly half of them work and the other > half > >break. Could this be an ia32/ia64 issue? Like python being > compiled for > >the wrong platform or something? > > hmm, not quite sure i follow, since we're only sending to ia64 > on this run...how can i test? > > >> unfortunately i can't test anything on ncsa right now 'cause > >> it's down. > > > >It being down would generally prevent swift from being able > to run jobs > >there. Which is probably what happened the week before. > > ha ha, what swift can't run jobs on a site that's down? > lame! heh, actually we've had a couple of runs now where we > see the behavior i described on ncsa--e.g. a few jobs > completing but some failing and an eventual decline. though, > it's true the site's been up and down quite a bit over the > past few weeks so could be indicative of something else wrong > entirely. incidentally, i told them a couple weeks > ago i was having trouble submitting to gram4 so we switched > back to gram2 and it *seemed* to be working...for a while. > > well, we're trying on yet another site now so if we see more > of the same we'll know we need to do *something* on our end. > > thanks > sarah > > > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From benc at hawaga.org.uk Fri Jul 25 05:01:30 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Fri, 25 Jul 2008 10:01:30 +0000 (GMT) Subject: [Swift-devel] Progress Report of Swift on BGP In-Reply-To: <48896C1E.50604@uchicago.edu> References: <48896C1E.50604@uchicago.edu> Message-ID: On Fri, 25 Jul 2008, Zhao Zhang wrote: > I made 2 small changes to wrapper.sh. For now, instead of creating everything > on GPFS, wrapper.sh only create that id.success file on GPFS, leaving others > to CN's ramdisk. Can you indicate what you mean by 'everything'? For real swift runs you will also need to put the output files on GPFS, for example. -- From wilde at mcs.anl.gov Fri Jul 25 07:36:21 2008 From: wilde at mcs.anl.gov (Michael Wilde) Date: Fri, 25 Jul 2008 07:36:21 -0500 Subject: [Swift-devel] Re: 2 rack run with swift In-Reply-To: References: <48881EE2.2060206@uchicago.edu> <48887C2D.40503@mcs.anl.gov> Message-ID: <4889C8C5.4000704@mcs.anl.gov> Zhao, turning on this log is a good experiment to do next. On 7/24/08 11:31 AM, Ben Clifford wrote: > On Thu, 24 Jul 2008, Michael Wilde wrote: > >> I *think* the plots bear this out, but need more assessment. > > The raw info that is useful for debugging that is the -info wrapper log > files that you'll need to turn on explicitly, which is these: > >> Then turn on some of the timing metrics in wrapper.sh to see >> where time is spent. > > > Some of the patches I sent ended up in very different form in the main > codebase - for example, the worker-node local job directories. > > But for any of those runs it would be useful to collect all the wrapper > (-info) logs to look at the timings. > From wilde at mcs.anl.gov Fri Jul 25 07:41:25 2008 From: wilde at mcs.anl.gov (Michael Wilde) Date: Fri, 25 Jul 2008 07:41:25 -0500 Subject: [Swift-devel] Drop kickstart, adopt wrapper logs? In-Reply-To: References: <334502.67176.qm@web52310.mail.re2.yahoo.com> <487D12A7.1090901@mcs.anl.gov> Message-ID: <4889C9F5.10903@mcs.anl.gov> was: Re: Can we copy Swift execution logs to CI network? Should we make this transition, dropping kickstart, and making the wrapper log a useful format that becomes part of the Swift runtime specification? On 7/15/08 4:18 PM, Ben Clifford wrote: > ...there's substantial lack of information > for many runs as we have been tweaking the logs over time (especially the > worker node logs which take a similar place to kickstart records now - > giving the actual on-worker cpu usage but are relatively very expensive to > collect and move around). From foster at mcs.anl.gov Fri Jul 25 08:01:11 2008 From: foster at mcs.anl.gov (Ian Foster) Date: Fri, 25 Jul 2008 08:01:11 -0500 Subject: [Swift-devel] Drop kickstart, adopt wrapper logs? In-Reply-To: <4889C9F5.10903@mcs.anl.gov> References: <334502.67176.qm@web52310.mail.re2.yahoo.com> <487D12A7.1090901@mcs.anl.gov> <4889C9F5.10903@mcs.anl.gov> Message-ID: <65B51B17-1DFC-4098-A601-A7E98ABF828C@mcs.anl.gov> Mike: What is the status of the integration into kickstart into the Globus release? Ian. On Jul 25, 2008, at 7:41 AM, Michael Wilde wrote: > was: Re: Can we copy Swift execution logs to CI network? > > Should we make this transition, dropping kickstart, and making the > wrapper log a useful format that becomes part of the Swift runtime > specification? > > On 7/15/08 4:18 PM, Ben Clifford wrote: > >> ...there's substantial lack of information >> for many runs as we have been tweaking the logs over time >> (especially the worker node logs which take a similar place to >> kickstart records now - giving the actual on-worker cpu usage but >> are relatively very expensive to collect and move around). > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From wilde at mcs.anl.gov Fri Jul 25 08:17:39 2008 From: wilde at mcs.anl.gov (Michael Wilde) Date: Fri, 25 Jul 2008 08:17:39 -0500 Subject: [Swift-devel] Drop kickstart, adopt wrapper logs? In-Reply-To: <65B51B17-1DFC-4098-A601-A7E98ABF828C@mcs.anl.gov> References: <334502.67176.qm@web52310.mail.re2.yahoo.com> <487D12A7.1090901@mcs.anl.gov> <4889C9F5.10903@mcs.anl.gov> <65B51B17-1DFC-4098-A601-A7E98ABF828C@mcs.anl.gov> Message-ID: <4889D273.5000405@mcs.anl.gov> Im not aware of any work done in that direction. I would have to hunt back through emails for the last thing we said we would do. Its been an often discussed topic, but one thats hard to move forward on. The tradeoff is doing something that works well for swift, vs having a significantly larger and longer discussion on what would work for GRAM as well (and potentially Condor and ...) One possibility is that we stay with a kickstart like architecture where kickstart is the last thing called to launch the app, and that becomes a separable component that can be used elsewhere. We do it in a way that its useful with GRAM and other systems, and which users can just use. Then the discussion becomes mostly one of data format. There the two likely candidates are XML or name/value pairs, possibly in some other "standard" format, eg classads. Another possibility is to emit an xml doc in a rigid format that doesnt need an xml parser to process it - sort of n/v pairs in xml. This gets complicated with multiline and escape issues of course. Its worth a chat with the GRAM team. It will be a slower harder route though. Thats a tradeoff. - Mike On 7/25/08 8:01 AM, Ian Foster wrote: > Mike: > > What is the status of the integration into kickstart into the Globus > release? > > Ian. > > > On Jul 25, 2008, at 7:41 AM, Michael Wilde wrote: > >> was: Re: Can we copy Swift execution logs to CI network? >> >> Should we make this transition, dropping kickstart, and making the >> wrapper log a useful format that becomes part of the Swift runtime >> specification? >> >> On 7/15/08 4:18 PM, Ben Clifford wrote: >> >>> ...there's substantial lack of information >>> for many runs as we have been tweaking the logs over time (especially >>> the worker node logs which take a similar place to kickstart records >>> now - giving the actual on-worker cpu usage but are relatively very >>> expensive to collect and move around). >> >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > From foster at mcs.anl.gov Fri Jul 25 08:20:43 2008 From: foster at mcs.anl.gov (Ian Foster) Date: Fri, 25 Jul 2008 08:20:43 -0500 Subject: [Swift-devel] Drop kickstart, adopt wrapper logs? In-Reply-To: <4889D273.5000405@mcs.anl.gov> References: <334502.67176.qm@web52310.mail.re2.yahoo.com> <487D12A7.1090901@mcs.anl.gov> <4889C9F5.10903@mcs.anl.gov> <65B51B17-1DFC-4098-A601-A7E98ABF828C@mcs.anl.gov> <4889D273.5000405@mcs.anl.gov> Message-ID: Mike: There was an agreement a year or more ago that the VDS group would work with the GRAM team on integration. I suspect it got dropped, but I wanted to check. Ian. On Jul 25, 2008, at 8:17 AM, Michael Wilde wrote: > Im not aware of any work done in that direction. > I would have to hunt back through emails for the last thing we said > we would do. > > Its been an often discussed topic, but one thats hard to move > forward on. > > The tradeoff is doing something that works well for swift, vs having > a significantly larger and longer discussion on what would work for > GRAM as well (and potentially Condor and ...) > > One possibility is that we stay with a kickstart like architecture > where kickstart is the last thing called to launch the app, and that > becomes a separable component that can be used elsewhere. We do it > in a way that its useful with GRAM and other systems, and which > users can just use. > > Then the discussion becomes mostly one of data format. > There the two likely candidates are XML or name/value pairs, > possibly in some other "standard" format, eg classads. > > Another possibility is to emit an xml doc in a rigid format that > doesnt need an xml parser to process it - sort of n/v pairs in xml. > This gets complicated with multiline and escape issues of course. > > Its worth a chat with the GRAM team. It will be a slower harder > route though. Thats a tradeoff. > > - Mike > > > On 7/25/08 8:01 AM, Ian Foster wrote: >> Mike: >> What is the status of the integration into kickstart into the >> Globus release? >> Ian. >> On Jul 25, 2008, at 7:41 AM, Michael Wilde wrote: >>> was: Re: Can we copy Swift execution logs to CI network? >>> >>> Should we make this transition, dropping kickstart, and making the >>> wrapper log a useful format that becomes part of the Swift runtime >>> specification? >>> >>> On 7/15/08 4:18 PM, Ben Clifford wrote: >>> >>>> ...there's substantial lack of information >>>> for many runs as we have been tweaking the logs over time >>>> (especially the worker node logs which take a similar place to >>>> kickstart records now - giving the actual on-worker cpu usage but >>>> are relatively very expensive to collect and move around). >>> >>> _______________________________________________ >>> Swift-devel mailing list >>> Swift-devel at ci.uchicago.edu >>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From wilde at mcs.anl.gov Fri Jul 25 09:47:23 2008 From: wilde at mcs.anl.gov (Michael Wilde) Date: Fri, 25 Jul 2008 09:47:23 -0500 Subject: [Swift-devel] Progress Report of Swift on BGP In-Reply-To: References: <48896C1E.50604@uchicago.edu> Message-ID: <4889E77B.9020503@mcs.anl.gov> Nonetheless this sounds very promising, Zhao. Today lets start documenting the data management logic and data, and shows what optimizations can be done. - Mike On 7/25/08 5:01 AM, Ben Clifford wrote: > On Fri, 25 Jul 2008, Zhao Zhang wrote: > >> I made 2 small changes to wrapper.sh. For now, instead of creating everything >> on GPFS, wrapper.sh only create that id.success file on GPFS, leaving others >> to CN's ramdisk. > > Can you indicate what you mean by 'everything'? For real swift runs you > will also need to put the output files on GPFS, for example. > From zhaozhang at uchicago.edu Fri Jul 25 09:57:27 2008 From: zhaozhang at uchicago.edu (Zhao Zhang) Date: Fri, 25 Jul 2008 09:57:27 -0500 Subject: [Swift-devel] Progress Report of Swift on BGP In-Reply-To: References: <48896C1E.50604@uchicago.edu> Message-ID: <4889E9D7.3040108@uchicago.edu> Hi,Ben By everything, I mean in that run every task is successful, the wrapper.sh works well. You are right, I haven't done much about data, it is on my plan next. best wishes zhangzhao Ben Clifford wrote: > On Fri, 25 Jul 2008, Zhao Zhang wrote: > > >> I made 2 small changes to wrapper.sh. For now, instead of creating everything >> on GPFS, wrapper.sh only create that id.success file on GPFS, leaving others >> to CN's ramdisk. >> > > Can you indicate what you mean by 'everything'? For real swift runs you > will also need to put the output files on GPFS, for example. > > From hategan at mcs.anl.gov Fri Jul 25 10:02:41 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 25 Jul 2008 10:02:41 -0500 Subject: [Swift-devel] Progress Report of Swift on BGP In-Reply-To: <4889E77B.9020503@mcs.anl.gov> References: <48896C1E.50604@uchicago.edu> <4889E77B.9020503@mcs.anl.gov> Message-ID: <1216998161.1315.4.camel@localhost> On Fri, 2008-07-25 at 09:47 -0500, Michael Wilde wrote: > Nonetheless this sounds very promising, Zhao. > > Today lets start documenting the data management logic and data, and > shows what optimizations can be done. Though we should be careful not to confuse "optimization" with "removing functionality". > > - Mike > > On 7/25/08 5:01 AM, Ben Clifford wrote: > > On Fri, 25 Jul 2008, Zhao Zhang wrote: > > > >> I made 2 small changes to wrapper.sh. For now, instead of creating everything > >> on GPFS, wrapper.sh only create that id.success file on GPFS, leaving others > >> to CN's ramdisk. > > > > Can you indicate what you mean by 'everything'? For real swift runs you > > will also need to put the output files on GPFS, for example. > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From benc at hawaga.org.uk Fri Jul 25 10:15:19 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Fri, 25 Jul 2008 15:15:19 +0000 (GMT) Subject: [Swift-devel] Re: Drop kickstart, adopt wrapper logs? In-Reply-To: <4889C9F5.10903@mcs.anl.gov> References: <334502.67176.qm@web52310.mail.re2.yahoo.com> <487D12A7.1090901@mcs.anl.gov> <4889C9F5.10903@mcs.anl.gov> Message-ID: pretty much no one uses kickstart with swift at the moment. The wrapper log *is* in a useful format, and is probably as specified as any other part of the runtime environment. > was: Re: Can we copy Swift execution logs to CI network? > > Should we make this transition, dropping kickstart, and making the wrapper log > a useful format that becomes part of the Swift runtime specification? > > On 7/15/08 4:18 PM, Ben Clifford wrote: > > > ...there's substantial lack of information > > for many runs as we have been tweaking the logs over time (especially the > > worker node logs which take a similar place to kickstart records now - > > giving the actual on-worker cpu usage but are relatively very expensive to > > collect and move around). > > From benc at hawaga.org.uk Fri Jul 25 10:20:30 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Fri, 25 Jul 2008 15:20:30 +0000 (GMT) Subject: [Swift-devel] Drop kickstart, adopt wrapper logs? In-Reply-To: References: <334502.67176.qm@web52310.mail.re2.yahoo.com> <487D12A7.1090901@mcs.anl.gov> <4889C9F5.10903@mcs.anl.gov> <65B51B17-1DFC-4098-A601-A7E98ABF828C@mcs.anl.gov> <4889D273.5000405@mcs.anl.gov> Message-ID: Its on my todo list. Its below the event horizon though (that being the line at which more prioritised things are arriving at a rate such that this never rises and is instead sucked into the black hole of infinity) so it won't get done. On Fri, 25 Jul 2008, Ian Foster wrote: > Mike: > > There was an agreement a year or more ago that the VDS group would work with > the GRAM team on integration. I suspect it got dropped, but I wanted to check. > > Ian. > > On Jul 25, 2008, at 8:17 AM, Michael Wilde wrote: > > > Im not aware of any work done in that direction. > > I would have to hunt back through emails for the last thing we said we would > > do. > > > > Its been an often discussed topic, but one thats hard to move forward on. > > > > The tradeoff is doing something that works well for swift, vs having a > > significantly larger and longer discussion on what would work for GRAM as > > well (and potentially Condor and ...) > > > > One possibility is that we stay with a kickstart like architecture where > > kickstart is the last thing called to launch the app, and that becomes a > > separable component that can be used elsewhere. We do it in a way that its > > useful with GRAM and other systems, and which users can just use. > > > > Then the discussion becomes mostly one of data format. > > There the two likely candidates are XML or name/value pairs, possibly in > > some other "standard" format, eg classads. > > > > Another possibility is to emit an xml doc in a rigid format that doesnt need > > an xml parser to process it - sort of n/v pairs in xml. This gets > > complicated with multiline and escape issues of course. > > > > Its worth a chat with the GRAM team. It will be a slower harder route > > though. Thats a tradeoff. > > > > - Mike > > > > > > On 7/25/08 8:01 AM, Ian Foster wrote: > > > Mike: > > > What is the status of the integration into kickstart into the Globus > > > release? > > > Ian. > > > On Jul 25, 2008, at 7:41 AM, Michael Wilde wrote: > > > > was: Re: Can we copy Swift execution logs to CI network? > > > > > > > > Should we make this transition, dropping kickstart, and making the > > > > wrapper log a useful format that becomes part of the Swift runtime > > > > specification? > > > > > > > > On 7/15/08 4:18 PM, Ben Clifford wrote: > > > > > > > > > ...there's substantial lack of information > > > > > for many runs as we have been tweaking the logs over time (especially > > > > > the worker node logs which take a similar place to kickstart records > > > > > now - giving the actual on-worker cpu usage but are relatively very > > > > > expensive to collect and move around). > > > > > > > > _______________________________________________ > > > > Swift-devel mailing list > > > > Swift-devel at ci.uchicago.edu > > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > From hategan at mcs.anl.gov Fri Jul 25 10:28:45 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 25 Jul 2008 10:28:45 -0500 Subject: [Swift-devel] Re: Drop kickstart, adopt wrapper logs? In-Reply-To: References: <334502.67176.qm@web52310.mail.re2.yahoo.com> <487D12A7.1090901@mcs.anl.gov> <4889C9F5.10903@mcs.anl.gov> Message-ID: <1216999725.1948.3.camel@localhost> In the really long run if coasters turn out to be a workable piece of software, they could support remote logging, which would mean streaming logging data instead of using the filesystem. But for the short term, I agree. We haven't exercised kickstart as much, and the wrapper logs seem to be reasonably useful as a replacement. On Fri, 2008-07-25 at 15:15 +0000, Ben Clifford wrote: > pretty much no one uses kickstart with swift at the moment. The wrapper > log *is* in a useful format, and is probably as specified as any other > part of the runtime environment. > > > > was: Re: Can we copy Swift execution logs to CI network? > > > > Should we make this transition, dropping kickstart, and making the wrapper log > > a useful format that becomes part of the Swift runtime specification? > > > > On 7/15/08 4:18 PM, Ben Clifford wrote: > > > > > ...there's substantial lack of information > > > for many runs as we have been tweaking the logs over time (especially the > > > worker node logs which take a similar place to kickstart records now - > > > giving the actual on-worker cpu usage but are relatively very expensive to > > > collect and move around). > > > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From foster at mcs.anl.gov Fri Jul 25 10:25:47 2008 From: foster at mcs.anl.gov (Ian Foster) Date: Fri, 25 Jul 2008 10:25:47 -0500 Subject: [Swift-devel] Drop kickstart, adopt wrapper logs? In-Reply-To: References: <334502.67176.qm@web52310.mail.re2.yahoo.com> <487D12A7.1090901@mcs.anl.gov> <4889C9F5.10903@mcs.anl.gov> <65B51B17-1DFC-4098-A601-A7E98ABF828C@mcs.anl.gov> <4889D273.5000405@mcs.anl.gov> Message-ID: <05FF2E5E-0F37-4324-92DB-20F1005862C5@mcs.anl.gov> nicely put :) On Jul 25, 2008, at 10:20 AM, Ben Clifford wrote: > > Its on my todo list. Its below the event horizon though (that being > the > line at which more prioritised things are arriving at a rate such that > this never rises and is instead sucked into the black hole of > infinity) so > it won't get done. > > On Fri, 25 Jul 2008, Ian Foster wrote: > >> Mike: >> >> There was an agreement a year or more ago that the VDS group would >> work with >> the GRAM team on integration. I suspect it got dropped, but I >> wanted to check. >> >> Ian. >> >> On Jul 25, 2008, at 8:17 AM, Michael Wilde wrote: >> >>> Im not aware of any work done in that direction. >>> I would have to hunt back through emails for the last thing we >>> said we would >>> do. >>> >>> Its been an often discussed topic, but one thats hard to move >>> forward on. >>> >>> The tradeoff is doing something that works well for swift, vs >>> having a >>> significantly larger and longer discussion on what would work for >>> GRAM as >>> well (and potentially Condor and ...) >>> >>> One possibility is that we stay with a kickstart like architecture >>> where >>> kickstart is the last thing called to launch the app, and that >>> becomes a >>> separable component that can be used elsewhere. We do it in a way >>> that its >>> useful with GRAM and other systems, and which users can just use. >>> >>> Then the discussion becomes mostly one of data format. >>> There the two likely candidates are XML or name/value pairs, >>> possibly in >>> some other "standard" format, eg classads. >>> >>> Another possibility is to emit an xml doc in a rigid format that >>> doesnt need >>> an xml parser to process it - sort of n/v pairs in xml. This gets >>> complicated with multiline and escape issues of course. >>> >>> Its worth a chat with the GRAM team. It will be a slower harder >>> route >>> though. Thats a tradeoff. >>> >>> - Mike >>> >>> >>> On 7/25/08 8:01 AM, Ian Foster wrote: >>>> Mike: >>>> What is the status of the integration into kickstart into the >>>> Globus >>>> release? >>>> Ian. >>>> On Jul 25, 2008, at 7:41 AM, Michael Wilde wrote: >>>>> was: Re: Can we copy Swift execution logs to CI network? >>>>> >>>>> Should we make this transition, dropping kickstart, and making the >>>>> wrapper log a useful format that becomes part of the Swift runtime >>>>> specification? >>>>> >>>>> On 7/15/08 4:18 PM, Ben Clifford wrote: >>>>> >>>>>> ...there's substantial lack of information >>>>>> for many runs as we have been tweaking the logs over time >>>>>> (especially >>>>>> the worker node logs which take a similar place to kickstart >>>>>> records >>>>>> now - giving the actual on-worker cpu usage but are relatively >>>>>> very >>>>>> expensive to collect and move around). >>>>> >>>>> _______________________________________________ >>>>> Swift-devel mailing list >>>>> Swift-devel at ci.uchicago.edu >>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >> From hategan at mcs.anl.gov Fri Jul 25 10:34:09 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 25 Jul 2008 10:34:09 -0500 Subject: [Swift-devel] Drop kickstart, adopt wrapper logs? In-Reply-To: References: <334502.67176.qm@web52310.mail.re2.yahoo.com> <487D12A7.1090901@mcs.anl.gov> <4889C9F5.10903@mcs.anl.gov> <65B51B17-1DFC-4098-A601-A7E98ABF828C@mcs.anl.gov> <4889D273.5000405@mcs.anl.gov> Message-ID: <1217000049.2116.1.camel@localhost> Could be a decent weekend hack for me though. On Fri, 2008-07-25 at 15:20 +0000, Ben Clifford wrote: > Its on my todo list. Its below the event horizon though (that being the > line at which more prioritised things are arriving at a rate such that > this never rises and is instead sucked into the black hole of infinity) so > it won't get done. > > On Fri, 25 Jul 2008, Ian Foster wrote: > > > Mike: > > > > There was an agreement a year or more ago that the VDS group would work with > > the GRAM team on integration. I suspect it got dropped, but I wanted to check. > > > > Ian. > > > > On Jul 25, 2008, at 8:17 AM, Michael Wilde wrote: > > > > > Im not aware of any work done in that direction. > > > I would have to hunt back through emails for the last thing we said we would > > > do. > > > > > > Its been an often discussed topic, but one thats hard to move forward on. > > > > > > The tradeoff is doing something that works well for swift, vs having a > > > significantly larger and longer discussion on what would work for GRAM as > > > well (and potentially Condor and ...) > > > > > > One possibility is that we stay with a kickstart like architecture where > > > kickstart is the last thing called to launch the app, and that becomes a > > > separable component that can be used elsewhere. We do it in a way that its > > > useful with GRAM and other systems, and which users can just use. > > > > > > Then the discussion becomes mostly one of data format. > > > There the two likely candidates are XML or name/value pairs, possibly in > > > some other "standard" format, eg classads. > > > > > > Another possibility is to emit an xml doc in a rigid format that doesnt need > > > an xml parser to process it - sort of n/v pairs in xml. This gets > > > complicated with multiline and escape issues of course. > > > > > > Its worth a chat with the GRAM team. It will be a slower harder route > > > though. Thats a tradeoff. > > > > > > - Mike > > > > > > > > > On 7/25/08 8:01 AM, Ian Foster wrote: > > > > Mike: > > > > What is the status of the integration into kickstart into the Globus > > > > release? > > > > Ian. > > > > On Jul 25, 2008, at 7:41 AM, Michael Wilde wrote: > > > > > was: Re: Can we copy Swift execution logs to CI network? > > > > > > > > > > Should we make this transition, dropping kickstart, and making the > > > > > wrapper log a useful format that becomes part of the Swift runtime > > > > > specification? > > > > > > > > > > On 7/15/08 4:18 PM, Ben Clifford wrote: > > > > > > > > > > > ...there's substantial lack of information > > > > > > for many runs as we have been tweaking the logs over time (especially > > > > > > the worker node logs which take a similar place to kickstart records > > > > > > now - giving the actual on-worker cpu usage but are relatively very > > > > > > expensive to collect and move around). > > > > > > > > > > _______________________________________________ > > > > > Swift-devel mailing list > > > > > Swift-devel at ci.uchicago.edu > > > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From wilde at mcs.anl.gov Fri Jul 25 10:48:36 2008 From: wilde at mcs.anl.gov (Michael Wilde) Date: Fri, 25 Jul 2008 10:48:36 -0500 Subject: [Swift-devel] Re: Drop kickstart, adopt wrapper logs? In-Reply-To: <1216999725.1948.3.camel@localhost> References: <334502.67176.qm@web52310.mail.re2.yahoo.com> <487D12A7.1090901@mcs.anl.gov> <4889C9F5.10903@mcs.anl.gov> <1216999725.1948.3.camel@localhost> Message-ID: <4889F5D4.4040204@mcs.anl.gov> Im happy with wrapper for now. Regarding use: there is a real need. *I* used kickstart quite a bit when working on apps. It doesnt get used because its not easy and not efficient. A real user need exists right now: the OOPS folding app has memory and cpu usage issues that impact its feasibility. The user (Glen Hocky) needs to compare its run time on different platforms with different compilers, and also needs to tune how many copies run on multicore hosts (bgp, sicortex, Abe/QueenBee on TG, and Ranger - 4, 6, 8, and 16 respectively). So Glen is essentially doing his own logging in the app wrapper, using /usr/bin/time. But this is something we can and should do for the user, simply, naturally, and by default. And make it easy enough that the basic app performance characteristics are reported on every run. Very similar needs exist with DOCK: we have some mysterious app failures that we'd like to correlate with mem consumption. And BLAST: Ive told Alina to to ad-hoc time/mem recording there as well. In the BGP we are starting to discuss collective I/O, and have been running apps in a way that returns all their little output files - including ad-hoc time reports, in a single tarball per job. So I think this is worth discussing and improving. I'll take a look at what wrapper is doing, and we can discuss a few options for moving the logs. One of which is that the user bundles it explicitly. Another is that the files are streamed and/or batched. A third is that the info is stream as Mihael suggests. - Mike On 7/25/08 10:28 AM, Mihael Hategan wrote: > In the really long run if coasters turn out to be a workable piece of > software, they could support remote logging, which would mean streaming > logging data instead of using the filesystem. > > But for the short term, I agree. We haven't exercised kickstart as much, > and the wrapper logs seem to be reasonably useful as a replacement. > > On Fri, 2008-07-25 at 15:15 +0000, Ben Clifford wrote: >> pretty much no one uses kickstart with swift at the moment. The wrapper >> log *is* in a useful format, and is probably as specified as any other >> part of the runtime environment. >> >> >>> was: Re: Can we copy Swift execution logs to CI network? >>> >>> Should we make this transition, dropping kickstart, and making the wrapper log >>> a useful format that becomes part of the Swift runtime specification? >>> >>> On 7/15/08 4:18 PM, Ben Clifford wrote: >>> >>>> ...there's substantial lack of information >>>> for many runs as we have been tweaking the logs over time (especially the >>>> worker node logs which take a similar place to kickstart records now - >>>> giving the actual on-worker cpu usage but are relatively very expensive to >>>> collect and move around). >>> >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > From hategan at mcs.anl.gov Fri Jul 25 10:57:50 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 25 Jul 2008 10:57:50 -0500 Subject: [Swift-devel] Re: Drop kickstart, adopt wrapper logs? In-Reply-To: <4889F5D4.4040204@mcs.anl.gov> References: <334502.67176.qm@web52310.mail.re2.yahoo.com> <487D12A7.1090901@mcs.anl.gov> <4889C9F5.10903@mcs.anl.gov> <1216999725.1948.3.camel@localhost> <4889F5D4.4040204@mcs.anl.gov> Message-ID: <1217001470.3627.2.camel@localhost> On Fri, 2008-07-25 at 10:48 -0500, Michael Wilde wrote: > Im happy with wrapper for now. > > Regarding use: there is a real need. *I* used kickstart quite a bit when > working on apps. It doesnt get used because its not easy and not efficient. > > A real user need exists right now: the OOPS folding app has memory and > cpu usage issues that impact its feasibility. The user (Glen Hocky) > needs to compare its run time on different platforms with different > compilers, and also needs to tune how many copies run on multicore hosts > (bgp, sicortex, Abe/QueenBee on TG, and Ranger - 4, 6, 8, and 16 > respectively). > > So Glen is essentially doing his own logging in the app wrapper, using > /usr/bin/time. But this is something we can and should do for the user, > simply, naturally, and by default. And make it easy enough that the > basic app performance characteristics are reported on every run. Presumably we could do similar things in the wrapper by looking at /proc and such. Though Jens would shout "that's not portable!". From wilde at mcs.anl.gov Fri Jul 25 11:06:30 2008 From: wilde at mcs.anl.gov (Michael Wilde) Date: Fri, 25 Jul 2008 11:06:30 -0500 Subject: [Swift-devel] Drop kickstart, adopt wrapper logs? In-Reply-To: <1217000049.2116.1.camel@localhost> References: <334502.67176.qm@web52310.mail.re2.yahoo.com> <487D12A7.1090901@mcs.anl.gov> <4889C9F5.10903@mcs.anl.gov> <65B51B17-1DFC-4098-A601-A7E98ABF828C@mcs.anl.gov> <4889D273.5000405@mcs.anl.gov> <1217000049.2116.1.camel@localhost> Message-ID: <4889FA06.8060100@mcs.anl.gov> If you do take this off for weekend work (which sounds like a good idea, modulo the weekend part ;) lets first set a design decision making process on the issues that have been raised in discussions to date: - arg passing - what it captures - time, mem - env (in the broad sense) - file state - output format(s) - snapshotting for monitoring an app run - controls for start, stop, pause - how it integrates in CoG, GRAM, Condor, other LRMs, Coaster, Falkon, and Swift - how it relates to "accounting" projects in OSG and TG - ideas on streaming output - idead on batching output - interaction with a collective I/O toolkit/model - other stuff??? Its worth looking at what Jens did in "kickstart v2", what he called "k2", in the VDS CVS. It had conventions to deal with passing args in a file stream rather than command line to tunnel through what at the time were cli length limits and arg escape problems in Condor I feel this merits a design effort, and propose that we try this as an experiment in doing design in a way that can get group input do decision making in an effective and pleasing way. Im thinking email discussion but based on a persistent changing document (bugzilla note, wiki page, or even text in email that we treat as a doc). Something that when we are done discussing and say "go", we have a single place that documents the complete consensus to that point. - Mike On 7/25/08 10:34 AM, Mihael Hategan wrote: > Could be a decent weekend hack for me though. > > On Fri, 2008-07-25 at 15:20 +0000, Ben Clifford wrote: >> Its on my todo list. Its below the event horizon though (that being the >> line at which more prioritised things are arriving at a rate such that >> this never rises and is instead sucked into the black hole of infinity) so >> it won't get done. >> >> On Fri, 25 Jul 2008, Ian Foster wrote: >> >>> Mike: >>> >>> There was an agreement a year or more ago that the VDS group would work with >>> the GRAM team on integration. I suspect it got dropped, but I wanted to check. >>> >>> Ian. >>> >>> On Jul 25, 2008, at 8:17 AM, Michael Wilde wrote: >>> >>>> Im not aware of any work done in that direction. >>>> I would have to hunt back through emails for the last thing we said we would >>>> do. >>>> >>>> Its been an often discussed topic, but one thats hard to move forward on. >>>> >>>> The tradeoff is doing something that works well for swift, vs having a >>>> significantly larger and longer discussion on what would work for GRAM as >>>> well (and potentially Condor and ...) >>>> >>>> One possibility is that we stay with a kickstart like architecture where >>>> kickstart is the last thing called to launch the app, and that becomes a >>>> separable component that can be used elsewhere. We do it in a way that its >>>> useful with GRAM and other systems, and which users can just use. >>>> >>>> Then the discussion becomes mostly one of data format. >>>> There the two likely candidates are XML or name/value pairs, possibly in >>>> some other "standard" format, eg classads. >>>> >>>> Another possibility is to emit an xml doc in a rigid format that doesnt need >>>> an xml parser to process it - sort of n/v pairs in xml. This gets >>>> complicated with multiline and escape issues of course. >>>> >>>> Its worth a chat with the GRAM team. It will be a slower harder route >>>> though. Thats a tradeoff. >>>> >>>> - Mike >>>> >>>> >>>> On 7/25/08 8:01 AM, Ian Foster wrote: >>>>> Mike: >>>>> What is the status of the integration into kickstart into the Globus >>>>> release? >>>>> Ian. >>>>> On Jul 25, 2008, at 7:41 AM, Michael Wilde wrote: >>>>>> was: Re: Can we copy Swift execution logs to CI network? >>>>>> >>>>>> Should we make this transition, dropping kickstart, and making the >>>>>> wrapper log a useful format that becomes part of the Swift runtime >>>>>> specification? >>>>>> >>>>>> On 7/15/08 4:18 PM, Ben Clifford wrote: >>>>>> >>>>>>> ...there's substantial lack of information >>>>>>> for many runs as we have been tweaking the logs over time (especially >>>>>>> the worker node logs which take a similar place to kickstart records >>>>>>> now - giving the actual on-worker cpu usage but are relatively very >>>>>>> expensive to collect and move around). >>>>>> _______________________________________________ >>>>>> Swift-devel mailing list >>>>>> Swift-devel at ci.uchicago.edu >>>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From wilde at mcs.anl.gov Fri Jul 25 11:10:44 2008 From: wilde at mcs.anl.gov (Michael Wilde) Date: Fri, 25 Jul 2008 11:10:44 -0500 Subject: [Swift-devel] Re: Drop kickstart, adopt wrapper logs? In-Reply-To: <1217001470.3627.2.camel@localhost> References: <334502.67176.qm@web52310.mail.re2.yahoo.com> <487D12A7.1090901@mcs.anl.gov> <4889C9F5.10903@mcs.anl.gov> <1216999725.1948.3.camel@localhost> <4889F5D4.4040204@mcs.anl.gov> <1217001470.3627.2.camel@localhost> Message-ID: <4889FB04.8090906@mcs.anl.gov> On 7/25/08 10:57 AM, Mihael Hategan wrote: > On Fri, 2008-07-25 at 10:48 -0500, Michael Wilde wrote: >> Im happy with wrapper for now. >> >> Regarding use: there is a real need. *I* used kickstart quite a bit when >> working on apps. It doesnt get used because its not easy and not efficient. >> >> A real user need exists right now: the OOPS folding app has memory and >> cpu usage issues that impact its feasibility. The user (Glen Hocky) >> needs to compare its run time on different platforms with different >> compilers, and also needs to tune how many copies run on multicore hosts >> (bgp, sicortex, Abe/QueenBee on TG, and Ranger - 4, 6, 8, and 16 >> respectively). >> >> So Glen is essentially doing his own logging in the app wrapper, using >> /usr/bin/time. But this is something we can and should do for the user, >> simply, naturally, and by default. And make it easy enough that the >> basic app performance characteristics are reported on every run. > > Presumably we could do similar things in the wrapper by looking at /proc > and such. Though Jens would shout "that's not portable!". Or even put the time command in the wrapper. I think the time man page makes some murmurs about posix compliance. Nested invocations of wrapper-like entities on the worker node are all pretty affordable, it seems. The costly part seems to be moving/accessing/recording the output. btw - a side note here - last I looked at wrapper.sh circa March I found it to be very beautifully written. Im all for building on it. From hategan at mcs.anl.gov Fri Jul 25 11:25:16 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 25 Jul 2008 11:25:16 -0500 Subject: [Swift-devel] Drop kickstart, adopt wrapper logs? In-Reply-To: <4889FA06.8060100@mcs.anl.gov> References: <334502.67176.qm@web52310.mail.re2.yahoo.com> <487D12A7.1090901@mcs.anl.gov> <4889C9F5.10903@mcs.anl.gov> <65B51B17-1DFC-4098-A601-A7E98ABF828C@mcs.anl.gov> <4889D273.5000405@mcs.anl.gov> <1217000049.2116.1.camel@localhost> <4889FA06.8060100@mcs.anl.gov> Message-ID: <1217003116.4952.8.camel@localhost> On Fri, 2008-07-25 at 11:06 -0500, Michael Wilde wrote: > If you do take this off for weekend work (which sounds like a good idea, > modulo the weekend part ;) lets first set a design decision making > process on the issues that have been raised in discussions to date: Well, it was about removing kickstart which is somewhat different from improving the wrapper to be like kickstart. But we should discuss this. > > - arg passing It's there. > - what it captures > - time, mem I suppose regular snapshots of those. > - env (in the broad sense) I think it's there. > - file state ? > - output format(s) ? > - snapshotting for monitoring an app run ? > - controls for start, stop, pause ? > - how it integrates in CoG, GRAM, Condor, other LRMs, Coaster, Falkon, > and Swift It's the wrapper. > - how it relates to "accounting" projects in OSG and TG ? > - ideas on streaming output As in stdout? > - idead on batching output > - interaction with a collective I/O toolkit/model > - other stuff??? > > Its worth looking at what Jens did in "kickstart v2", what he called > "k2", in the VDS CVS. It had conventions to deal with passing args in a > file stream rather than command line to tunnel through what at the time > were cli length limits and arg escape problems in Condor We should try not to make the wrapper a bad substitute for a bad job manager. In other words I don't think it should fix too many problems. > > I feel this merits a design effort, and propose that we try this as an > experiment in doing design in a way that can get group input do decision > making in an effective and pleasing way. So we should first decide what we're talking about: 1. Logging of application stuff (i.e. providing information similar to kickstart) 2. Fixin' problems in Condor and GRAM I think it should only be a focused thing around (1). From benc at hawaga.org.uk Fri Jul 25 11:56:01 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Fri, 25 Jul 2008 16:56:01 +0000 (GMT) Subject: [Swift-devel] Re: Drop kickstart, adopt wrapper logs? In-Reply-To: <4889FB04.8090906@mcs.anl.gov> References: <334502.67176.qm@web52310.mail.re2.yahoo.com> <487D12A7.1090901@mcs.anl.gov> <4889C9F5.10903@mcs.anl.gov> <1216999725.1948.3.camel@localhost> <4889F5D4.4040204@mcs.anl.gov> <1217001470.3627.2.camel@localhost> <4889FB04.8090906@mcs.anl.gov> Message-ID: On Fri, 25 Jul 2008, Michael Wilde wrote: > Or even put the time command in the wrapper. I think the time man page makes > some murmurs about posix compliance. The info log files already log job executable start and end times. -- From hategan at mcs.anl.gov Fri Jul 25 12:16:30 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 25 Jul 2008 12:16:30 -0500 Subject: [Swift-devel] Re: Drop kickstart, adopt wrapper logs? In-Reply-To: References: <334502.67176.qm@web52310.mail.re2.yahoo.com> <487D12A7.1090901@mcs.anl.gov> <4889C9F5.10903@mcs.anl.gov> <1216999725.1948.3.camel@localhost> <4889F5D4.4040204@mcs.anl.gov> <1217001470.3627.2.camel@localhost> <4889FB04.8090906@mcs.anl.gov> Message-ID: <1217006190.5958.0.camel@localhost> On Fri, 2008-07-25 at 16:56 +0000, Ben Clifford wrote: > On Fri, 25 Jul 2008, Michael Wilde wrote: > > > Or even put the time command in the wrapper. I think the time man page makes > > some murmurs about posix compliance. > > The info log files already log job executable start and end times. Though 'time' has this nice distinction between real/user/system time. > From benc at hawaga.org.uk Sun Jul 27 05:18:56 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Sun, 27 Jul 2008 10:18:56 +0000 (GMT) Subject: [Swift-devel] Progress Report of Swift on BGP In-Reply-To: <1216998161.1315.4.camel@localhost> References: <48896C1E.50604@uchicago.edu> <4889E77B.9020503@mcs.anl.gov> <1216998161.1315.4.camel@localhost> Message-ID: On Fri, 25 Jul 2008, Mihael Hategan wrote: > Though we should be careful not to confuse "optimization" with "removing > functionality". Yes. I think its very important to be careful about what is changed whilst still claiming something is "Swift". As a first technical approximation, it would be intereseting to check that the tests itests/language-behaviour all pass after making an "optimisation". If one of those doesn't pass then the change is not an optimisation / improvement of Swift. That is not to say that there is anything wrong per-se with benchmarking how stuff behaves with such changes, but when discussing results its important to indicate that this is not "Real Swift(tm)" As a particularly extreme example, here is a misoptimised version of Swift that can run more than Avogadros Number of jobs per second: === #!/bin/bash true === -- From benc at hawaga.org.uk Sun Jul 27 06:14:26 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Sun, 27 Jul 2008 11:14:26 +0000 (GMT) Subject: [Swift-devel] Drop kickstart, adopt wrapper logs? In-Reply-To: <1217003116.4952.8.camel@localhost> References: <334502.67176.qm@web52310.mail.re2.yahoo.com> <487D12A7.1090901@mcs.anl.gov> <4889C9F5.10903@mcs.anl.gov> <65B51B17-1DFC-4098-A601-A7E98ABF828C@mcs.anl.gov> <4889D273.5000405@mcs.anl.gov> <1217000049.2116.1.camel@localhost> <4889FA06.8060100@mcs.anl.gov> <1217003116.4952.8.camel@localhost> Message-ID: On Fri, 25 Jul 2008, Mihael Hategan wrote: > Well, it was about removing kickstart which is somewhat different from > improving the wrapper to be like kickstart. That's not what i was referring to (nor Ian, I think) - the work item wa to package kickstart to eventually go in the GT release and be more closely integrated with GRAM. -- From hategan at mcs.anl.gov Sun Jul 27 10:52:58 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sun, 27 Jul 2008 10:52:58 -0500 Subject: [Swift-devel] Drop kickstart, adopt wrapper logs? In-Reply-To: References: <334502.67176.qm@web52310.mail.re2.yahoo.com> <487D12A7.1090901@mcs.anl.gov> <4889C9F5.10903@mcs.anl.gov> <65B51B17-1DFC-4098-A601-A7E98ABF828C@mcs.anl.gov> <4889D273.5000405@mcs.anl.gov> <1217000049.2116.1.camel@localhost> <4889FA06.8060100@mcs.anl.gov> <1217003116.4952.8.camel@localhost> Message-ID: <1217173978.20432.0.camel@localhost> On Sun, 2008-07-27 at 11:14 +0000, Ben Clifford wrote: > On Fri, 25 Jul 2008, Mihael Hategan wrote: > > > Well, it was about removing kickstart which is somewhat different from > > improving the wrapper to be like kickstart. > > That's not what i was referring to (nor Ian, I think) - the work item wa > to package kickstart to eventually go in the GT release and be more > closely integrated with GRAM. I was replying to: ?> Should we make this transition, dropping kickstart, and making the wrapper log > a useful format that becomes part of the Swift runtime specification? And also to the subject line. From bugzilla-daemon at mcs.anl.gov Sun Jul 27 11:27:51 2008 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Sun, 27 Jul 2008 11:27:51 -0500 (CDT) Subject: [Swift-devel] [Bug 151] Swift gives null pointer exception In-Reply-To: Message-ID: <20080727162751.2788016469@foxtrot.mcs.anl.gov> http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=151 benc at hawaga.org.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |WORKSFORME ------- Comment #3 from benc at hawaga.org.uk 2008-07-27 11:27 ------- The attached code compiles and runs without error for me. Note that there appears to be no variable 'a' referred to in that source file at all. If you continue to have problems, please attach code and error messages that actually belong together. -- Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is. You reported the bug, or are watching the reporter. From wilde at mcs.anl.gov Sun Jul 27 11:52:34 2008 From: wilde at mcs.anl.gov (Michael Wilde) Date: Sun, 27 Jul 2008 11:52:34 -0500 Subject: [Swift-devel] [Bug 151] Swift gives null pointer exception In-Reply-To: <20080727162751.2788016469@foxtrot.mcs.anl.gov> References: <20080727162751.2788016469@foxtrot.mcs.anl.gov> Message-ID: <488CA7D2.50203@mcs.anl.gov> On 7/27/08 11:27 AM, bugzilla-daemon at mcs.anl.gov wrote: > http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=151 > > > benc at hawaga.org.uk changed: > > What |Removed |Added > ---------------------------------------------------------------------------- > Status|NEW |RESOLVED > Resolution| |WORKSFORME > > > > > ------- Comment #3 from benc at hawaga.org.uk 2008-07-27 11:27 ------- > The attached code compiles and runs without error for me. > > Note that there appears to be no variable 'a' referred to in that source file > at all. If you continue to have problems, please attach code and error messages > that actually belong together. I obviously thought I did. Perhaps Comment #2 explains what happened? If not I will check. Whats the Swift logic regarding recompilation if it finds an existing .kml file? Does that logic get fooled in some cases? - Mike ------- Comment #2 From Michael Wilde 2008-07-22 12:12 [reply] ------- It seems that if I remove the .xml and .kml files from previous compiles, this problem goes away, and I get instead the message that I was trying to debug originally: Compile error in foreach statement at line 18: Compile error in procedure invocation at line 19: variable a is not writeable in this scope (the line numbers here dont match the source code in the URL filed here because Ive changed things since then in the course of debugging). ------- Comment #3 From Ben Clifford 2008-07-27 11:27 [reply] ------- The attached code compiles and runs without error for me. Note that there appears to be no variable 'a' referred to in that source file at all. If you continue to have problems, please attach code and error messages that actually belong together. From benc at hawaga.org.uk Sun Jul 27 11:59:35 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Sun, 27 Jul 2008 16:59:35 +0000 (GMT) Subject: [Swift-devel] [Bug 151] Swift gives null pointer exception In-Reply-To: <488CA7D2.50203@mcs.anl.gov> References: <20080727162751.2788016469@foxtrot.mcs.anl.gov> <488CA7D2.50203@mcs.anl.gov> Message-ID: On Sun, 27 Jul 2008, Michael Wilde wrote: > Whats the Swift logic regarding recompilation if it finds an existing .kml > file? Does that logic get fooled in some cases? Its based on the modification dates of the .kml and .swift files. The code is in the compile method of src//org/griphyn/vdl/karajan/Loader.java The two ways in which this commonly breaks are if the Swift version changes (because it doesn't force a recompile when versions change) or if you touch the kml file (eg by loading it into a text editor and causing a save to happen) -- From wilde at mcs.anl.gov Sun Jul 27 14:10:20 2008 From: wilde at mcs.anl.gov (Michael Wilde) Date: Sun, 27 Jul 2008 14:10:20 -0500 Subject: [Swift-devel] Problems running coaster Message-ID: <488CC81C.7030205@mcs.anl.gov> I got errors trying coaster both on the abe site on teragrid and locally. Im using swift rev 2148 For both, I see in the log a message like: DEBUG vdl:execute2 APPLICATION_EXCEPTION jobid=echo-aa3vo3xi - Application exception: Could not submit job Do you see whats wrong here? I will continue to debug in the meantime. Info below. Thanks, - Mike The abe log is *l4.log (letter L), the local one is *ha.log. The local one got a bit further, in that I see in the log the GETs of the jar files in the bootstrap process. The Swift script is: type file; (file t) echo (string s) { app { echo "the string is" s stdout=@filename(t); } } file outfile <"echo_000.txt">; string words[] = ["s000","s001","s002"]; outfile = echo(words[0]); (testing one echo call before I try a loop) The local sites entry is: /home/wilde/swiftwork The abe sites entry is: 4 /u/ac/wilde/swiftwork TG-CCR080002N tc.data has: localhost echo /bin/echo INSTALLED INTEL32::LINUX null ... abe echo /bin/echo INSTALLED INTEL32::LINUX null All the files and logs are attached. -------------- next part -------------- A non-text attachment was scrubbed... Name: cprob1.tar.gz Type: application/x-gzip Size: 10080 bytes Desc: not available URL: From wilde at mcs.anl.gov Sun Jul 27 14:13:36 2008 From: wilde at mcs.anl.gov (Michael Wilde) Date: Sun, 27 Jul 2008 14:13:36 -0500 Subject: [Swift-devel] Problems running coaster In-Reply-To: <488CC81C.7030205@mcs.anl.gov> References: <488CC81C.7030205@mcs.anl.gov> Message-ID: <488CC8E0.2050305@mcs.anl.gov> forgot to mention: this was run from communicado:/home/wilde/ctest - Mike On 7/27/08 2:10 PM, Michael Wilde wrote: > I got errors trying coaster both on the abe site on teragrid and locally. > > Im using swift rev 2148 > > For both, I see in the log a message like: > > DEBUG vdl:execute2 APPLICATION_EXCEPTION jobid=echo-aa3vo3xi - > Application exception: Could not submit job > > Do you see whats wrong here? I will continue to debug in the meantime. > > Info below. > > Thanks, > > - Mike > > > > The abe log is *l4.log (letter L), the local one is *ha.log. > > The local one got a bit further, in that I see in the log the GETs of > the jar files in the bootstrap process. > > The Swift script is: > > type file; > > (file t) echo (string s) { > app { > echo "the string is" s stdout=@filename(t); > } > } > file outfile <"echo_000.txt">; > string words[] = ["s000","s001","s002"]; > outfile = echo(words[0]); > > (testing one echo call before I try a loop) > > The local sites entry is: > > > > url="localhost" /> > /home/wilde/swiftwork > > > The abe sites entry is: > > > jobManager="gt2:pbs" /> > 4 > > /u/ac/wilde/swiftwork > TG-CCR080002N > > > > > > > > > > tc.data has: > > localhost echo /bin/echo INSTALLED > INTEL32::LINUX null > ... > abe echo /bin/echo INSTALLED INTEL32::LINUX > null > > All the files and logs are attached. > > > > > > ------------------------------------------------------------------------ > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From benc at hawaga.org.uk Sun Jul 27 14:44:39 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Sun, 27 Jul 2008 19:44:39 +0000 (GMT) Subject: [Swift-devel] Problems running coaster In-Reply-To: <488CC8E0.2050305@mcs.anl.gov> References: <488CC81C.7030205@mcs.anl.gov> <488CC8E0.2050305@mcs.anl.gov> Message-ID: I don't see the original message for this so I can't see the logs. The list software used to filter messages with large attachments (and hopefully still does). What cog version do you have? Something like r2066 fixes a bug with walltimes that was breaking coasters. -- From benc at hawaga.org.uk Sun Jul 27 14:51:32 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Sun, 27 Jul 2008 19:51:32 +0000 (GMT) Subject: [Swift-devel] Problems running coaster In-Reply-To: <488CC81C.7030205@mcs.anl.gov> References: <488CC81C.7030205@mcs.anl.gov> Message-ID: I got the logs eventually. On the Abe log, I see this error: Caused by: org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: Task ended before registration was received. STDOUT: This node is in dedicated user mode. The string 'This node is in dedicated user mode.' is coming from something outside of Swift, perhaps the local scheduler getting upset by coasters. Could you submit successfully without coasters at the same time (within a few minutes) of not being able to submit with coasters? For the localhost run, have a look in your home directory root for coaster and coaster worker log files that were generated at the same time as you did that run and send those / look in them. -- From hategan at mcs.anl.gov Sun Jul 27 14:57:07 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sun, 27 Jul 2008 14:57:07 -0500 Subject: [Swift-devel] Problems running coaster In-Reply-To: References: <488CC81C.7030205@mcs.anl.gov> <488CC8E0.2050305@mcs.anl.gov> Message-ID: <1217188627.30700.4.camel@localhost> On Sun, 2008-07-27 at 19:44 +0000, Ben Clifford wrote: > I don't see the original message for this so I can't see the logs. The > list software used to filter messages with large attachments (and > hopefully still does). > > What cog version do you have? Something like r2066 fixes a bug with > walltimes that was breaking coasters. Caused by: org...task.TaskSubmissionException: Could not submit job Caused by: org...task.TaskSubmissionException: Could not start coaster service Caused by: org...task.TaskSubmissionException: Task ended before registration was received. STDOUT: STDERR: Caused by: org...execution.JobException: Job failed with an exit code of 1 Looks like the workers don't quite start properly. There should be some worker logs in ~/workerxyz.log on the remote machine that may provide some details. From benc at hawaga.org.uk Sun Jul 27 14:58:36 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Sun, 27 Jul 2008 19:58:36 +0000 (GMT) Subject: [Swift-devel] multiple coaster workers per node Message-ID: cog svn r2094 introduces a profile property coastersPerNode which allows you to spawn multiple coaster workers on a node. this should allow you to take advantage of sites which have multicore CPUs but allocate the whole node, rather than an individual core, when a job is submitted. When using coasters, add this to the site definition: 5 to get eg 5 workers on each node. -- From tiberius at ci.uchicago.edu Sun Jul 27 15:40:30 2008 From: tiberius at ci.uchicago.edu (Tiberiu Stef-Praun) Date: Sun, 27 Jul 2008 15:40:30 -0500 Subject: [Swift-devel] Some observations Message-ID: I was trying to read into swift the contents of a file which contained a float (e.g. 0.415599405693). It has been suggested that I use readData. If did not work (some error about unable to cast to java.lang.Integer) whatever output type I was using: float x=readData(file); int x=readData(file); string x=readData(file); However, completely unexpectedly, it worked with @extractint(file), and it even returned the correct float value. This is abit confusing, but at least I got my problem solved. Tibi PS: it would be really-really good to have swift work with cygwin. -- Tiberiu (Tibi) Stef-Praun, PhD Computational Sciences Researcher Computation Institute 5640 S. Ellis Ave, #405 University of Chicago http://www-unix.mcs.anl.gov/~tiberius/ From wilde at mcs.anl.gov Sun Jul 27 18:20:59 2008 From: wilde at mcs.anl.gov (Michael Wilde) Date: Sun, 27 Jul 2008 18:20:59 -0500 Subject: [Swift-devel] Problems running coaster In-Reply-To: References: <488CC81C.7030205@mcs.anl.gov> <488CC8E0.2050305@mcs.anl.gov> Message-ID: <488D02DB.7090204@mcs.anl.gov> On 7/27/08 2:44 PM, Ben Clifford wrote: > I don't see the original message for this so I can't see the logs. It seemed to go through - I got the message from the list. But if it didnt get to you, the files are on the CI net at ~wilde/coast/crob1, and the text is below. > The > list software used to filter messages with large attachments (and > hopefully still does). the file was 10K bytes > What cog version do you have? 2093 Something like r2066 fixes a bug with > walltimes that was breaking coasters. > -------- Original Message -------- Subject: [Swift-devel] Problems running coaster Date: Sun, 27 Jul 2008 14:10:20 -0500 From: Michael Wilde To: swift-devel I got errors trying coaster both on the abe site on teragrid and locally. Im using swift rev 2148 For both, I see in the log a message like: DEBUG vdl:execute2 APPLICATION_EXCEPTION jobid=echo-aa3vo3xi - Application exception: Could not submit job Do you see whats wrong here? I will continue to debug in the meantime. Info below. Thanks, - Mike The abe log is *l4.log (letter L), the local one is *ha.log. The local one got a bit further, in that I see in the log the GETs of the jar files in the bootstrap process. The Swift script is: type file; (file t) echo (string s) { app { echo "the string is" s stdout=@filename(t); } } file outfile <"echo_000.txt">; string words[] = ["s000","s001","s002"]; outfile = echo(words[0]); (testing one echo call before I try a loop) The local sites entry is: /home/wilde/swiftwork The abe sites entry is: 4 /u/ac/wilde/swiftwork TG-CCR080002N tc.data has: localhost echo /bin/echo INSTALLED INTEL32::LINUX null ... abe echo /bin/echo INSTALLED INTEL32::LINUX null All the files and logs are attached. From wilde at mcs.anl.gov Sun Jul 27 22:50:47 2008 From: wilde at mcs.anl.gov (Michael Wilde) Date: Sun, 27 Jul 2008 22:50:47 -0500 Subject: [Swift-devel] Problems running coaster In-Reply-To: References: <488CC81C.7030205@mcs.anl.gov> Message-ID: <488D4217.2010006@mcs.anl.gov> On 7/27/08 2:51 PM, Ben Clifford wrote: > I got the logs eventually. > > On the Abe log, I see this error: > > Caused by: > org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: > Task ended before registration was received. > STDOUT: This node is in dedicated user mode. > > > > The string 'This node is in dedicated user mode.' is coming from > something outside of Swift, perhaps the local scheduler getting upset by > coasters. Could you submit successfully without coasters at the same time > (within a few minutes) of not being able to submit with coasters? Yes. Just *before* I tried abe with coasters, I did a simple globus-job-run to its pbs jobmanager. That worked fine. Ive sent ticket to TG Help asking if they recognize the "dedicated" message. Is the coaster server started with any special GRAM attributes, that I could provide to globus-job-run or globusrun to try to re-create the problem? > For the localhost run, have a look in your home directory root for coaster > and coaster worker log files that were generated at the same time as you > did that run and send those / look in them. I found the localhost problem in these logs - I didnt realize I needed a grid proxy for localhost coaster runs. I made one, and that works now. From hategan at mcs.anl.gov Sun Jul 27 23:07:51 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sun, 27 Jul 2008 23:07:51 -0500 Subject: [Swift-devel] Problems running coaster In-Reply-To: <488D4217.2010006@mcs.anl.gov> References: <488CC81C.7030205@mcs.anl.gov> <488D4217.2010006@mcs.anl.gov> Message-ID: <1217218071.19068.2.camel@localhost> There's something I'm missing here. What sites.xml file are you using? On Sun, 2008-07-27 at 22:50 -0500, Michael Wilde wrote: > On 7/27/08 2:51 PM, Ben Clifford wrote: > > I got the logs eventually. > > > > On the Abe log, I see this error: > > > > Caused by: > > org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: > > Task ended before registration was received. > > STDOUT: This node is in dedicated user mode. > > > > > > > > The string 'This node is in dedicated user mode.' is coming from > > something outside of Swift, perhaps the local scheduler getting upset by > > coasters. Could you submit successfully without coasters at the same time > > (within a few minutes) of not being able to submit with coasters? > > Yes. Just *before* I tried abe with coasters, I did a simple > globus-job-run to its pbs jobmanager. That worked fine. > > Ive sent ticket to TG Help asking if they recognize the "dedicated" message. > > Is the coaster server started with any special GRAM attributes, that I > could provide to globus-job-run or globusrun to try to re-create the > problem? > > > For the localhost run, have a look in your home directory root for coaster > > and coaster worker log files that were generated at the same time as you > > did that run and send those / look in them. > > I found the localhost problem in these logs - I didnt realize I needed a > grid proxy for localhost coaster runs. I made one, and that works now. > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From benc at hawaga.org.uk Sun Jul 27 23:14:20 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 28 Jul 2008 04:14:20 +0000 (GMT) Subject: [Swift-devel] Problems running coaster In-Reply-To: <488D4217.2010006@mcs.anl.gov> References: <488CC81C.7030205@mcs.anl.gov> <488D4217.2010006@mcs.anl.gov> Message-ID: On Sun, 27 Jul 2008, Michael Wilde wrote: > I found the localhost problem in these logs - I didnt realize I needed a grid > proxy for localhost coaster runs. I made one, and that works now. Yes, that is poorly reported at the moment. -- From wilde at mcs.anl.gov Sun Jul 27 23:16:22 2008 From: wilde at mcs.anl.gov (Michael Wilde) Date: Sun, 27 Jul 2008 23:16:22 -0500 Subject: [Swift-devel] Problems running coaster In-Reply-To: <1217218071.19068.2.camel@localhost> References: <488CC81C.7030205@mcs.anl.gov> <488D4217.2010006@mcs.anl.gov> <1217218071.19068.2.camel@localhost> Message-ID: <488D4816.2010900@mcs.anl.gov> On 7/27/08 11:07 PM, Mihael Hategan wrote: > There's something I'm missing here. What sites.xml file are you using? 4 /u/ac/wilde/swiftwork TG-MCA01S018 When I run to localhost I get a coaster-boot logfile in my home dir on the submit host (swift host). When I run to abe I dont get such a log. Is there something I can turn on in the coaster bootstrap phase to get more logging? Is there anything special done in the gram request to start the coaster service that is unusual and may not work on abe? - Mike > > On Sun, 2008-07-27 at 22:50 -0500, Michael Wilde wrote: >> On 7/27/08 2:51 PM, Ben Clifford wrote: >>> I got the logs eventually. >>> >>> On the Abe log, I see this error: >>> >>> Caused by: >>> org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: >>> Task ended before registration was received. >>> STDOUT: This node is in dedicated user mode. >>> >>> >>> >>> The string 'This node is in dedicated user mode.' is coming from >>> something outside of Swift, perhaps the local scheduler getting upset by >>> coasters. Could you submit successfully without coasters at the same time >>> (within a few minutes) of not being able to submit with coasters? >> Yes. Just *before* I tried abe with coasters, I did a simple >> globus-job-run to its pbs jobmanager. That worked fine. >> >> Ive sent ticket to TG Help asking if they recognize the "dedicated" message. >> >> Is the coaster server started with any special GRAM attributes, that I >> could provide to globus-job-run or globusrun to try to re-create the >> problem? >> >>> For the localhost run, have a look in your home directory root for coaster >>> and coaster worker log files that were generated at the same time as you >>> did that run and send those / look in them. >> I found the localhost problem in these logs - I didnt realize I needed a >> grid proxy for localhost coaster runs. I made one, and that works now. >> >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > From benc at hawaga.org.uk Sun Jul 27 23:17:04 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 28 Jul 2008 04:17:04 +0000 (GMT) Subject: [Swift-devel] Some observations In-Reply-To: References: Message-ID: can't be *completely* unexpectedly, as I think I told you @extractint would work for this some time ago. our present handling of numerical types is lame. I'll look at tidyups. In the meantime, filing this message as a bug would be a nice thing to do. On Sun, 27 Jul 2008, Tiberiu Stef-Praun wrote: > I was trying to read into swift the contents of a file which contained > a float (e.g. 0.415599405693). > It has been suggested that I use readData. > If did not work (some error about unable to cast to java.lang.Integer) > whatever output type I was using: > float x=readData(file); > int x=readData(file); > string x=readData(file); > > However, completely unexpectedly, it worked with @extractint(file), > and it even returned the correct float value. > > This is abit confusing, but at least I got my problem solved. > > Tibi > > PS: it would be really-really good to have swift work with cygwin. > > From benc at hawaga.org.uk Sun Jul 27 23:19:31 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 28 Jul 2008 04:19:31 +0000 (GMT) Subject: [Swift-devel] Problems running coaster In-Reply-To: <488D4816.2010900@mcs.anl.gov> References: <488CC81C.7030205@mcs.anl.gov> <488D4217.2010006@mcs.anl.gov> <1217218071.19068.2.camel@localhost> <488D4816.2010900@mcs.anl.gov> Message-ID: On Sun, 27 Jul 2008, Michael Wilde wrote: > jobManager="gt2:pbs" /> Try gt2:gt2:pbs -- From benc at hawaga.org.uk Sun Jul 27 23:27:40 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 28 Jul 2008 04:27:40 +0000 (GMT) Subject: [Swift-devel] Problems running coaster In-Reply-To: References: <488CC81C.7030205@mcs.anl.gov> <488D4217.2010006@mcs.anl.gov> <1217218071.19068.2.camel@localhost> <488D4816.2010900@mcs.anl.gov> Message-ID: On Mon, 28 Jul 2008, Ben Clifford wrote: > > jobManager="gt2:pbs" /> > > Try gt2:gt2:pbs In more detail: this field, in the case of condor, encodes a lot of information in a not-so-obvious way: a:b[:c] a = cog provider to use to submit the remote headnode job b = cog provider that the remote headnode job will use to submit workers c = jobmanager to be used by cog provider b There happens to be a cog provider called pbs that doesn't work so well. That is what you were specifying. gt2:gt2:pbs specifies gt2 for both situations, using the pbs jobmanger for the worker node submissions in gram2. -- From wilde at mcs.anl.gov Mon Jul 28 08:27:51 2008 From: wilde at mcs.anl.gov (Michael Wilde) Date: Mon, 28 Jul 2008 08:27:51 -0500 Subject: [Swift-devel] Problems running coaster In-Reply-To: References: <488CC81C.7030205@mcs.anl.gov> <488D4217.2010006@mcs.anl.gov> <1217218071.19068.2.camel@localhost> <488D4816.2010900@mcs.anl.gov> Message-ID: <488DC957.3070802@mcs.anl.gov> I tried jobManager="gt2:gt2:pbs" but still get the same error. Note that each time this fails I also see this in the log: -- 2008-07-28 08:22:58,853-0500 INFO ServiceManager Service task Task(type=JOB_SUBMISSION, identity=urn:cog-1217251364678) terminated. Removing service. 2008-07-28 08:22:58,853-0500 INFO ServiceManager Service does not appear to be registered with this manager -- Does that indicate a problem? Other notes: There was no coaster log in my home dir on the submit host (communicado). Should there be, for remote execution? Or will that log show up on the remote site, where the coaster service is run? There was no gram log on abe to indicate that a job was started there. It seems like the initial job that should run on abe to start the coaster service is failing. What piece of code creates that job? Full logs are on CI net at ~wilde/coast/run5 sites.xml was: 4 /u/ac/wilde/swiftwork TG-MCA01S018 error was same: 2008-07-28 08:22:46,936-0500 INFO vdl:dostagein START jobid=echo-28cx26xi - Staging in files 2008-07-28 08:22:46,936-0500 INFO vdl:dostagein END jobid=echo-28cx26xi - Staging in finished 2008-07-28 08:22:46,937-0500 DEBUG vdl:execute2 JOB_START jobid=echo-28cx26xi tr=echo arguments=[the string is, s000] tmpdir=ctest-20080728-0822-23q64s0d/jobs/2/echo-28cx26xi host=abe 2008-07-28 08:22:46,952-0500 DEBUG WeightedHostScoreScheduler multiplyScore(abe:0.000(1.000):1/5 overload: 0, -0.2) 2008-07-28 08:22:46,952-0500 DEBUG WeightedHostScoreScheduler Old score: 0.000, new score: -0.200 2008-07-28 08:22:46,957-0500 DEBUG TaskImpl Task(type=JOB_SUBMISSION, identity=urn:0-1-1-1217251364677) setting status to Submitting 2008-07-28 08:22:46,971-0500 INFO LocalService Started local service: 128.135.125.17:50000 2008-07-28 08:22:46,979-0500 INFO BootstrapService Socket bound. URL is http://128.135.125.17:50001 2008-07-28 08:22:47,008-0500 DEBUG TaskImpl Task(type=JOB_SUBMISSION, identity=urn:cog-1217251364678) setting status to Submitting 2008-07-28 08:22:48,032-0500 DEBUG TaskImpl Task(type=JOB_SUBMISSION, identity=urn:cog-1217251364678) setting status to Submitted 2008-07-28 08:22:48,389-0500 DEBUG TaskImpl Task(type=JOB_SUBMISSION, identity=urn:cog-1217251364678) setting status to Active 2008-07-28 08:22:58,853-0500 DEBUG TaskImpl Task(type=JOB_SUBMISSION, identity=urn:cog-1217251364678) setting status to Completed 2008-07-28 08:22:58,853-0500 INFO ServiceManager Service task Task(type=JOB_SUBMISSION, identity=urn:cog-1217251364678) terminated. Removing service. 2008-07-28 08:22:58,853-0500 INFO ServiceManager Service does not appear to be registered with this manager 2008-07-28 08:22:59,055-0500 DEBUG TaskImpl Task(type=JOB_SUBMISSION, identity=urn:0-1-1-1217251364677) setting status to Submitted 2008-07-28 08:22:59,056-0500 DEBUG WeightedHostScoreScheduler Submission time for Task(type=JOB_SUBMISSION, identity=urn:0-1-1-1217251364677): 12098ms. Score delta: -0.05947692307692308 2008-07-28 08:22:59,056-0500 DEBUG WeightedHostScoreScheduler multiplyScore(abe:-0.200(0.889):1/4 overload: 0, -0.05947692307692308) 2008-07-28 08:22:59,056-0500 DEBUG WeightedHostScoreScheduler Old score: -0.200, new score: -0.259 2008-07-28 08:22:59,056-0500 DEBUG TaskImpl Task(type=JOB_SUBMISSION, identity=urn:0-1-1-1217251364677) setting status to Active 2008-07-28 08:22:59,057-0500 DEBUG TaskImpl Task(type=JOB_SUBMISSION, identity=urn:0-1-1-1217251364677) setting status to Failed Could not submit job 2008-07-28 08:22:59,061-0500 DEBUG vdl:execute2 APPLICATION_EXCEPTION jobid=echo-28cx26xi - Application exception: Could not submit job vdl:execute @ vdl-int.k, line: 395 sys:sequential @ vdl-int.k, line: 387 ... rlog:restartlog @ ctest.kml, line: 66 kernel:project @ ctest.kml, line: 2 ctest-20080728-0822-23q64s0d Caused by: org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: Could not submit job Caused by: org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: Could not start coaster service Caused by: org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: Task ended before registration was received. STDOUT: This node is in dedicated user mode. STDERR: null On 7/27/08 11:27 PM, Ben Clifford wrote: > On Mon, 28 Jul 2008, Ben Clifford wrote: > >>> jobManager="gt2:pbs" /> >> Try gt2:gt2:pbs > > In more detail: this field, in the case of condor, encodes a lot of > information in a not-so-obvious way: > > a:b[:c] > > a = cog provider to use to submit the remote headnode job > b = cog provider that the remote headnode job will use to submit workers > c = jobmanager to be used by cog provider b > > There happens to be a cog provider called pbs that doesn't work so well. > That is what you were specifying. > > gt2:gt2:pbs specifies gt2 for both situations, using the pbs jobmanger for > the worker node submissions in gram2. > From hategan at mcs.anl.gov Mon Jul 28 09:17:06 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 28 Jul 2008 09:17:06 -0500 Subject: [Swift-devel] Problems running coaster In-Reply-To: <488DC957.3070802@mcs.anl.gov> References: <488CC81C.7030205@mcs.anl.gov> <488D4217.2010006@mcs.anl.gov> <1217218071.19068.2.camel@localhost> <488D4816.2010900@mcs.anl.gov> <488DC957.3070802@mcs.anl.gov> Message-ID: <1217254626.19954.0.camel@localhost> On Mon, 2008-07-28 at 08:27 -0500, Michael Wilde wrote: > I tried jobManager="gt2:gt2:pbs" but still get the same error. The error seems to be related to the fork job that tries to start the service. Do they disallow fork jobs? > > Note that each time this fails I also see this in the log: > -- > 2008-07-28 08:22:58,853-0500 INFO ServiceManager Service task > Task(type=JOB_SUBMISSION, identity=urn:cog-1217251364678) terminated. > Removing service. > 2008-07-28 08:22:58,853-0500 INFO ServiceManager Service does not > appear to be registered with this manager > -- > > Does that indicate a problem? > > Other notes: > > There was no coaster log in my home dir on the submit host > (communicado). Should there be, for remote execution? Or will that log > show up on the remote site, where the coaster service is run? > > There was no gram log on abe to indicate that a job was started there. > It seems like the initial job that should run on abe to start the > coaster service is failing. What piece of code creates that job? > > Full logs are on CI net at ~wilde/coast/run5 > > sites.xml was: > > > > jobManager="gt2:gt2:pbs" /> > 4 > > /u/ac/wilde/swiftwork > TG-MCA01S018 > > > > > > > > > error was same: > > 2008-07-28 08:22:46,936-0500 INFO vdl:dostagein START > jobid=echo-28cx26xi - Staging in files > 2008-07-28 08:22:46,936-0500 INFO vdl:dostagein END jobid=echo-28cx26xi > - Staging in finished > 2008-07-28 08:22:46,937-0500 DEBUG vdl:execute2 JOB_START > jobid=echo-28cx26xi tr=echo arguments=[the string is, s000] > tmpdir=ctest-20080728-0822-23q64s0d/jobs/2/echo-28cx26xi host=abe > 2008-07-28 08:22:46,952-0500 DEBUG WeightedHostScoreScheduler > multiplyScore(abe:0.000(1.000):1/5 overload: 0, -0.2) > 2008-07-28 08:22:46,952-0500 DEBUG WeightedHostScoreScheduler Old score: > 0.000, new score: -0.200 > 2008-07-28 08:22:46,957-0500 DEBUG TaskImpl Task(type=JOB_SUBMISSION, > identity=urn:0-1-1-1217251364677) setting status to Submitting > 2008-07-28 08:22:46,971-0500 INFO LocalService Started local service: > 128.135.125.17:50000 > 2008-07-28 08:22:46,979-0500 INFO BootstrapService Socket bound. URL is > http://128.135.125.17:50001 > 2008-07-28 08:22:47,008-0500 DEBUG TaskImpl Task(type=JOB_SUBMISSION, > identity=urn:cog-1217251364678) setting status to Submitting > 2008-07-28 08:22:48,032-0500 DEBUG TaskImpl Task(type=JOB_SUBMISSION, > identity=urn:cog-1217251364678) setting status to Submitted > 2008-07-28 08:22:48,389-0500 DEBUG TaskImpl Task(type=JOB_SUBMISSION, > identity=urn:cog-1217251364678) setting status to Active > 2008-07-28 08:22:58,853-0500 DEBUG TaskImpl Task(type=JOB_SUBMISSION, > identity=urn:cog-1217251364678) setting status to Completed > 2008-07-28 08:22:58,853-0500 INFO ServiceManager Service task > Task(type=JOB_SUBMISSION, identity=urn:cog-1217251364678) terminated. > Removing service. > 2008-07-28 08:22:58,853-0500 INFO ServiceManager Service does not > appear to be registered with this manager > 2008-07-28 08:22:59,055-0500 DEBUG TaskImpl Task(type=JOB_SUBMISSION, > identity=urn:0-1-1-1217251364677) setting status to Submitted > 2008-07-28 08:22:59,056-0500 DEBUG WeightedHostScoreScheduler Submission > time for Task(type=JOB_SUBMISSION, identity=urn:0-1-1-1217251364677): > 12098ms. Score delta: -0.05947692307692308 > 2008-07-28 08:22:59,056-0500 DEBUG WeightedHostScoreScheduler > multiplyScore(abe:-0.200(0.889):1/4 overload: 0, -0.05947692307692308) > 2008-07-28 08:22:59,056-0500 DEBUG WeightedHostScoreScheduler Old score: > -0.200, new score: -0.259 > 2008-07-28 08:22:59,056-0500 DEBUG TaskImpl Task(type=JOB_SUBMISSION, > identity=urn:0-1-1-1217251364677) setting status to Active > 2008-07-28 08:22:59,057-0500 DEBUG TaskImpl Task(type=JOB_SUBMISSION, > identity=urn:0-1-1-1217251364677) setting status to Failed Could not > submit job > 2008-07-28 08:22:59,061-0500 DEBUG vdl:execute2 APPLICATION_EXCEPTION > jobid=echo-28cx26xi - Application exception: Could not submit job > vdl:execute @ vdl-int.k, line: 395 > sys:sequential @ vdl-int.k, line: 387 > ... > rlog:restartlog @ ctest.kml, line: 66 > kernel:project @ ctest.kml, line: 2 > ctest-20080728-0822-23q64s0d > Caused by: > org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: > Could not submit job > Caused by: > org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: > Could not start coaster service > Caused by: > org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: > Task ended before registration was received. > STDOUT: This node is in dedicated user mode. > > STDERR: null > > > On 7/27/08 11:27 PM, Ben Clifford wrote: > > On Mon, 28 Jul 2008, Ben Clifford wrote: > > > >>> jobManager="gt2:pbs" /> > >> Try gt2:gt2:pbs > > > > In more detail: this field, in the case of condor, encodes a lot of > > information in a not-so-obvious way: > > > > a:b[:c] > > > > a = cog provider to use to submit the remote headnode job > > b = cog provider that the remote headnode job will use to submit workers > > c = jobmanager to be used by cog provider b > > > > There happens to be a cog provider called pbs that doesn't work so well. > > That is what you were specifying. > > > > gt2:gt2:pbs specifies gt2 for both situations, using the pbs jobmanger for > > the worker node submissions in gram2. > > From benc at hawaga.org.uk Mon Jul 28 09:31:06 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 28 Jul 2008 14:31:06 +0000 (GMT) Subject: [Swift-devel] Problems running coaster In-Reply-To: <488D4217.2010006@mcs.anl.gov> References: <488CC81C.7030205@mcs.anl.gov> <488D4217.2010006@mcs.anl.gov> Message-ID: On Sun, 27 Jul 2008, Michael Wilde wrote: > Yes. Just *before* I tried abe with coasters, I did a simple globus-job-run to > its pbs jobmanager. That worked fine. try submitting with the following changes to sites.xml: 1. not using coasters, instead using the gt2 provider 2. not specifying tg allocation in sites.xml -- From benc at hawaga.org.uk Mon Jul 28 09:40:20 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 28 Jul 2008 14:40:20 +0000 (GMT) Subject: [Swift-devel] Problems running coaster In-Reply-To: <488D4816.2010900@mcs.anl.gov> References: <488CC81C.7030205@mcs.anl.gov> <488D4217.2010006@mcs.anl.gov> <1217218071.19068.2.camel@localhost> <488D4816.2010900@mcs.anl.gov> Message-ID: On Sun, 27 Jul 2008, Michael Wilde wrote: > TG-MCA01S018 Can you confirm the existence of this project? On TGUC the closest I see is: UC-MCA01S018: Computational Studies of Complex Processes in Biological Macromolecular Systems which is a different ID and possibly UC-specific. -- From benc at hawaga.org.uk Mon Jul 28 09:43:53 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 28 Jul 2008 14:43:53 +0000 (GMT) Subject: [Swift-devel] Problems running coaster In-Reply-To: References: <488CC81C.7030205@mcs.anl.gov> <488D4217.2010006@mcs.anl.gov> <1217218071.19068.2.camel@localhost> <488D4816.2010900@mcs.anl.gov> Message-ID: also, don't: > 4 say that if you are using gram2 (even coasters in gram2) -- From wilde at mcs.anl.gov Mon Jul 28 10:06:28 2008 From: wilde at mcs.anl.gov (Michael Wilde) Date: Mon, 28 Jul 2008 10:06:28 -0500 Subject: [Swift-devel] Problems running coaster In-Reply-To: References: <488CC81C.7030205@mcs.anl.gov> <488D4217.2010006@mcs.anl.gov> <1217218071.19068.2.camel@localhost> <488D4816.2010900@mcs.anl.gov> Message-ID: <488DE074.2000700@mcs.anl.gov> I see the account below on the TG portal when logged in, but I will now test this and your other suggestions. - Mike Simulations of Complex Processes in Biological Macromolecular Systems Project PI: Charge No.: Roux, Benoit TG-MCA01S018 On 7/28/08 9:40 AM, Ben Clifford wrote: > On Sun, 27 Jul 2008, Michael Wilde wrote: > >> TG-MCA01S018 > > Can you confirm the existence of this project? > > On TGUC the closest I see is: > > UC-MCA01S018: Computational Studies of Complex Processes in Biological > Macromolecular Systems > > which is a different ID and possibly UC-specific. > From wilde at mcs.anl.gov Mon Jul 28 12:44:02 2008 From: wilde at mcs.anl.gov (Michael Wilde) Date: Mon, 28 Jul 2008 12:44:02 -0500 Subject: [Swift-devel] NCSA-hg servers Message-ID: <488E0562.4070702@mcs.anl.gov> I see that on mercury you specified gt2-gridftp-hg.ncsa.teragrid.org (in the sample test sites file) but did not specify alternate port 2812 as the TG siteinfo for that system suggests: GridFTP Host Name (Globus 4.0) gridftp-hg.ncsa.teragrid.org GridFTP Port (Globus 4.0) 2811 GridFTP Host Name (Globus 2.4.3) gt2-gridftp-hg.ncsa.teragrid.org GridFTP Port (Globus 2.4.3) 2812 Yet gridftp seems to work as you have it: gt2-gridftp-hg.ncsa.teragrid.org When I tried gridftp-hg.ncsa.teragrid.org, it worked the first time, although with an unexpected lengthy delay (seemed about 15-30 seconds) but when I retried the same command I got the cert error below. Did you follow up with TG on this to determine the right settings, or whether they have a server problem, or just left set at whatever worked first? If the latter, I will follow up with TG help to see what the state of that system's grid servers is supposed to be. Its also possible that the recurring cert problem on communicado has recurred again. - Mike communicado$ globus-url-copy gsiftp://gridftp-hg.ncsa.teragrid.org/etc/group file:///tmp/mwt1 communicado$ globus-url-copy gsiftp://gridftp-hg.ncsa.teragrid.org/etc/group file:///tmp/mwt1 error: globus_ftp_control: gss_init_sec_context failed globus_gsi_gssapi: Error with GSI credential globus_sysconfig: Could not find a valid trusted CA certificates directory globus_sysconfig: Error getting password entry for current user: Error occured for uid: 1031 communicado$ globus-url-copy gsiftp://gridftp-hg.ncsa.teragrid.org/etc/group file:///tmp/mwt1 communicado$ head /tmp/mwt1 communicado$ globus-url-copy gsiftp://gt2-gridftp-hg.ncsa.teragrid.org/etc/group file:///tmp/mwt1 communicado$ head /tmp/mwt1 root:x:0: bin:x:1:daemon daemon:x:2: sys:x:3: tty:x:5: disk:x:6: lp:x:7: www:x:8: kmem:x:9: wheel:x:10: communicado$ globus-url-copy gsiftp://gt2-gridftp-hg.ncsa.teragrid.org/etc/group file:///tmp/mwt1 From wilde at mcs.anl.gov Mon Jul 28 11:52:39 2008 From: wilde at mcs.anl.gov (Michael Wilde) Date: Mon, 28 Jul 2008 11:52:39 -0500 Subject: [Swift-devel] Problems running coaster In-Reply-To: References: <488CC81C.7030205@mcs.anl.gov> <488D4217.2010006@mcs.anl.gov> Message-ID: <488DF957.9000803@mcs.anl.gov> On 7/28/08 9:31 AM, Ben Clifford wrote: > On Sun, 27 Jul 2008, Michael Wilde wrote: > >> Yes. Just *before* I tried abe with coasters, I did a simple globus-job-run to >> its pbs jobmanager. That worked fine. > > try submitting with the following changes to sites.xml: > > 1. not using coasters, instead using the gt2 provider > 2. not specifying tg allocation in sites.xml > Using globus-job-run I see that only 1 account is valid. globus-job-run works to jobmanager-pbs, both with and without a -p option specifying the valid account. Swift using the gt2 provider also works with and without the valid account specified in sites.xml as a globus profile. Swift and the coaster provider with gt2:gt2:pbs fails both without the valid account (it gets the same "This node is in dedicated user mode" message) In terms of sites.xml: This fails (using coaster provider): /u/ac/wilde/swiftwork This works (using gt2 provider): /u/ac/wilde/swiftwork I will try the same to a different TG site (starting with UC). - Mike From hategan at mcs.anl.gov Mon Jul 28 13:08:19 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 28 Jul 2008 13:08:19 -0500 Subject: [Swift-devel] Problems running coaster In-Reply-To: <488DF957.9000803@mcs.anl.gov> References: <488CC81C.7030205@mcs.anl.gov> <488D4217.2010006@mcs.anl.gov> <488DF957.9000803@mcs.anl.gov> Message-ID: <1217268499.2147.1.camel@localhost> On Mon, 2008-07-28 at 11:52 -0500, Michael Wilde wrote: > > On 7/28/08 9:31 AM, Ben Clifford wrote: > > On Sun, 27 Jul 2008, Michael Wilde wrote: > > > >> Yes. Just *before* I tried abe with coasters, I did a simple globus-job-run to > >> its pbs jobmanager. That worked fine. > > > > try submitting with the following changes to sites.xml: > > > > 1. not using coasters, instead using the gt2 provider > > 2. not specifying tg allocation in sites.xml > > > > Using globus-job-run I see that only 1 account is valid. > > globus-job-run works to jobmanager-pbs, both with and without a -p > option specifying the valid account. Can you try a more complex fork job? Like, say, java -help? From wilde at mcs.anl.gov Mon Jul 28 13:13:56 2008 From: wilde at mcs.anl.gov (Michael Wilde) Date: Mon, 28 Jul 2008 13:13:56 -0500 Subject: [Swift-devel] Problems running coaster In-Reply-To: <488DF957.9000803@mcs.anl.gov> References: <488CC81C.7030205@mcs.anl.gov> <488D4217.2010006@mcs.anl.gov> <488DF957.9000803@mcs.anl.gov> Message-ID: <488E0C64.1020106@mcs.anl.gov> I tried NCSA-HG instead (to avoid the issues of the UC two-architecture site). The coaster provider works for me there, using gt2:gt2:pbs, as you reported previously. Same Swift script that fails on abe. So it looks like something in the job specs that is launching coaster for gt2:pbs is not being accepted by abe. I also see that these log messages which I mentioned earlier do not occur on the successful mercury coaster run: -- 2008-07-28 08:22:58,853-0500 INFO ServiceManager Service task Task(type=JOB_SUBMISSION, identity=urn:cog-1217251364678) terminated. Removing service. 2008-07-28 08:22:58,853-0500 INFO ServiceManager Service does not appear to be registered with this manager -- If someone can point me closer to where the swift job that launches the bootstrap script is run from, and how, I can try to reproduce the problem with globus-job-run or globus-run. In the meantime I will hunt for that. - Mike On 7/28/08 11:52 AM, Michael Wilde wrote: > > > On 7/28/08 9:31 AM, Ben Clifford wrote: >> On Sun, 27 Jul 2008, Michael Wilde wrote: >> >>> Yes. Just *before* I tried abe with coasters, I did a simple >>> globus-job-run to >>> its pbs jobmanager. That worked fine. >> >> try submitting with the following changes to sites.xml: >> >> 1. not using coasters, instead using the gt2 provider >> 2. not specifying tg allocation in sites.xml >> > > Using globus-job-run I see that only 1 account is valid. > > globus-job-run works to jobmanager-pbs, both with and without a -p > option specifying the valid account. > > Swift using the gt2 provider also works with and without the valid > account specified in sites.xml as a globus profile. > > Swift and the coaster provider with gt2:gt2:pbs fails both without the > valid account (it gets the same "This node is in dedicated user mode" > message) > > In terms of sites.xml: > > This fails (using coaster provider): > > > > jobManager="gt2:gt2:pbs" /> > > /u/ac/wilde/swiftwork > > > > This works (using gt2 provider): > > > > url="grid-abe.ncsa.teragrid.org/jobmanager-pbs" major="2"/> > > /u/ac/wilde/swiftwork > > > > I will try the same to a different TG site (starting with UC). > > - Mike > > From hategan at mcs.anl.gov Mon Jul 28 13:22:18 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 28 Jul 2008 13:22:18 -0500 Subject: [Swift-devel] Problems running coaster In-Reply-To: <488E0C64.1020106@mcs.anl.gov> References: <488CC81C.7030205@mcs.anl.gov> <488D4217.2010006@mcs.anl.gov> <488DF957.9000803@mcs.anl.gov> <488E0C64.1020106@mcs.anl.gov> Message-ID: <1217269338.2401.2.camel@localhost> On Mon, 2008-07-28 at 13:13 -0500, Michael Wilde wrote: > I tried NCSA-HG instead (to avoid the issues of the UC two-architecture > site). > > The coaster provider works for me there, using gt2:gt2:pbs, as you > reported previously. Same Swift script that fails on abe. > > So it looks like something in the job specs that is launching coaster > for gt2:pbs is not being accepted by abe. > > I also see that these log messages which I mentioned earlier do not > occur on the successful mercury coaster run: > > -- > 2008-07-28 08:22:58,853-0500 INFO ServiceManager Service task > Task(type=JOB_SUBMISSION, identity=urn:cog-1217251364678) terminated. > Removing service. > 2008-07-28 08:22:58,853-0500 INFO ServiceManager Service does not > appear to be registered with this manager That's ok. That's produced when there's a status change in the task before it gets to register with the service manager. From hategan at mcs.anl.gov Mon Jul 28 13:24:08 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 28 Jul 2008 13:24:08 -0500 Subject: [Swift-devel] Problems running coaster In-Reply-To: <488E0C64.1020106@mcs.anl.gov> References: <488CC81C.7030205@mcs.anl.gov> <488D4217.2010006@mcs.anl.gov> <488DF957.9000803@mcs.anl.gov> <488E0C64.1020106@mcs.anl.gov> Message-ID: <1217269448.2401.4.camel@localhost> On Mon, 2008-07-28 at 13:13 -0500, Michael Wilde wrote: > I tried NCSA-HG instead (to avoid the issues of the UC two-architecture > site). > > The coaster provider works for me there, using gt2:gt2:pbs, as you > reported previously. Same Swift script that fails on abe. > > So it looks like something in the job specs that is launching coaster > for gt2:pbs is not being accepted by abe. That's wrong. As I mentioned before, it's the gt2:fork part that isn't working. You don't have that specified, but fork is always used to start the coaster service on the head node. > > I also see that these log messages which I mentioned earlier do not > occur on the successful mercury coaster run: > > -- > 2008-07-28 08:22:58,853-0500 INFO ServiceManager Service task > Task(type=JOB_SUBMISSION, identity=urn:cog-1217251364678) terminated. > Removing service. > 2008-07-28 08:22:58,853-0500 INFO ServiceManager Service does not > appear to be registered with this manager From wilde at mcs.anl.gov Mon Jul 28 13:37:38 2008 From: wilde at mcs.anl.gov (Michael Wilde) Date: Mon, 28 Jul 2008 13:37:38 -0500 Subject: [Swift-devel] Problems running coaster In-Reply-To: <1217269448.2401.4.camel@localhost> References: <488CC81C.7030205@mcs.anl.gov> <488D4217.2010006@mcs.anl.gov> <488DF957.9000803@mcs.anl.gov> <488E0C64.1020106@mcs.anl.gov> <1217269448.2401.4.camel@localhost> Message-ID: <488E11F2.7020702@mcs.anl.gov> On 7/28/08 1:24 PM, Mihael Hategan wrote: > On Mon, 2008-07-28 at 13:13 -0500, Michael Wilde wrote: >> I tried NCSA-HG instead (to avoid the issues of the UC two-architecture >> site). >> >> The coaster provider works for me there, using gt2:gt2:pbs, as you >> reported previously. Same Swift script that fails on abe. >> >> So it looks like something in the job specs that is launching coaster >> for gt2:pbs is not being accepted by abe. > > That's wrong. As I mentioned before, it's the gt2:fork part that isn't > working. You don't have that specified, but fork is always used to start > the coaster service on the head node. Right - I realize the error is likely coming from the fork-jobmanager path. I meant that whatever is launching coaster for launching coaster for gt2:pbs is failing. Whats the closest globus-job-run or globusrun command to what the coaster code is doing to launch the server bootstrap? > >> I also see that these log messages which I mentioned earlier do not >> occur on the successful mercury coaster run: >> >> -- >> 2008-07-28 08:22:58,853-0500 INFO ServiceManager Service task >> Task(type=JOB_SUBMISSION, identity=urn:cog-1217251364678) terminated. >> Removing service. >> 2008-07-28 08:22:58,853-0500 INFO ServiceManager Service does not >> appear to be registered with this manager > From hategan at mcs.anl.gov Mon Jul 28 13:46:26 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 28 Jul 2008 13:46:26 -0500 Subject: [Swift-devel] Problems running coaster In-Reply-To: <488E11F2.7020702@mcs.anl.gov> References: <488CC81C.7030205@mcs.anl.gov> <488D4217.2010006@mcs.anl.gov> <488DF957.9000803@mcs.anl.gov> <488E0C64.1020106@mcs.anl.gov> <1217269448.2401.4.camel@localhost> <488E11F2.7020702@mcs.anl.gov> Message-ID: <1217270786.3034.3.camel@localhost> On Mon, 2008-07-28 at 13:37 -0500, Michael Wilde wrote: > > That's wrong. As I mentioned before, it's the gt2:fork part that isn't > > working. You don't have that specified, but fork is always used to start > > the coaster service on the head node. > > Right - I realize the error is likely coming from the fork-jobmanager > path. I meant that whatever is launching coaster for launching coaster > for gt2:pbs is failing. Sorry for the assertive answer. I was in the middle of something and wanted to keep things short. > > Whats the closest globus-job-run or globusrun command to what the > coaster code is doing to launch the server bootstrap? It launches a bash job which then tries to launch java. So, ah! Well, bash -l -c 'the script'. I'm guessing at this point that '-l' might be the problem. From wilde at mcs.anl.gov Mon Jul 28 13:50:22 2008 From: wilde at mcs.anl.gov (Michael Wilde) Date: Mon, 28 Jul 2008 13:50:22 -0500 Subject: [Swift-devel] Problems running coaster In-Reply-To: <1217270786.3034.3.camel@localhost> References: <488CC81C.7030205@mcs.anl.gov> <488D4217.2010006@mcs.anl.gov> <488DF957.9000803@mcs.anl.gov> <488E0C64.1020106@mcs.anl.gov> <1217269448.2401.4.camel@localhost> <488E11F2.7020702@mcs.anl.gov> <1217270786.3034.3.camel@localhost> Message-ID: <488E14EE.2070503@mcs.anl.gov> cool. I can duplicate the problem like this: communicado$ globus-job-run grid-abe.ncsa.teragrid.org/jobmanager-fork /bin/bash -l -c /bin/hostname This node is in dedicated user mode. communicado$ On 7/28/08 1:46 PM, Mihael Hategan wrote: > On Mon, 2008-07-28 at 13:37 -0500, Michael Wilde wrote: > >>> That's wrong. As I mentioned before, it's the gt2:fork part that isn't >>> working. You don't have that specified, but fork is always used to start >>> the coaster service on the head node. >> Right - I realize the error is likely coming from the fork-jobmanager >> path. I meant that whatever is launching coaster for launching coaster >> for gt2:pbs is failing. > > Sorry for the assertive answer. I was in the middle of something and > wanted to keep things short. > >> Whats the closest globus-job-run or globusrun command to what the >> coaster code is doing to launch the server bootstrap? > > It launches a bash job which then tries to launch java. So, ah! Well, > bash -l -c 'the script'. I'm guessing at this point that '-l' might be > the problem. > From hategan at mcs.anl.gov Mon Jul 28 14:14:28 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 28 Jul 2008 14:14:28 -0500 Subject: [Swift-devel] I'm off for the rest of the day Message-ID: <1217272468.3792.3.camel@localhost> 773 807 2892 if there are any urgent issues. From benc at hawaga.org.uk Mon Jul 28 14:32:05 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 28 Jul 2008 19:32:05 +0000 (GMT) Subject: [Swift-devel] Problems running coaster In-Reply-To: <488E0C64.1020106@mcs.anl.gov> References: <488CC81C.7030205@mcs.anl.gov> <488D4217.2010006@mcs.anl.gov> <488DF957.9000803@mcs.anl.gov> <488E0C64.1020106@mcs.anl.gov> Message-ID: On Mon, 28 Jul 2008, Michael Wilde wrote: > So it looks like something in the job specs that is launching coaster for > gt2:pbs is not being accepted by abe. ok. TeraGrid's unified account system is insufficiently unified for me to be able to access abe, but they are aware of that; if and when I am reunified, I'll try this out myself. -- From wilde at mcs.anl.gov Mon Jul 28 19:27:42 2008 From: wilde at mcs.anl.gov (Michael Wilde) Date: Mon, 28 Jul 2008 19:27:42 -0500 Subject: [Swift-devel] including non-standard providers in Swift builds Message-ID: <488E63FE.3000500@mcs.anl.gov> I'm confused about the correct way to include both deef and coaster providers in a build. Specifically: I want to say something like: ant -Dwith-provider-coaster -Dwith-provider-deef redist 1) Is saying -Dwith-provider-coaster the same as saying -Dwith-provider-coaster=true ? 2) Is the right way to add both providers to put on two separate -D options as above? 3) Does redist clean out the whole previous dist dir contents? 4) If repeatedly building the dist after making source changing, do I need to do any "clean" operations, or just "ant redist" with the right -D's? I ask this because the line I show above does not seem to consistently build both providers. Yesterday I was getting deef but no coaster; today I'm getting coaster but no deef. I havent tried enough experiments (or reading) to figure this out yet. From hategan at mcs.anl.gov Mon Jul 28 22:50:12 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 28 Jul 2008 22:50:12 -0500 Subject: [Swift-devel] Problems running coaster In-Reply-To: References: <488CC81C.7030205@mcs.anl.gov> <488D4217.2010006@mcs.anl.gov> <488DF957.9000803@mcs.anl.gov> <488E0C64.1020106@mcs.anl.gov> Message-ID: <1217303412.4347.0.camel@localhost> On Mon, 2008-07-28 at 19:32 +0000, Ben Clifford wrote: > On Mon, 28 Jul 2008, Michael Wilde wrote: > > > So it looks like something in the job specs that is launching coaster for > > gt2:pbs is not being accepted by abe. > > ok. TeraGrid's unified account system is insufficiently unified for me to > be able to access abe, but they are aware of that; if and when I am > reunified, I'll try this out myself. Not to be cynical or anything, but that unified thing: never worked. From hategan at mcs.anl.gov Mon Jul 28 22:54:35 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 28 Jul 2008 22:54:35 -0500 Subject: [Swift-devel] including non-standard providers in Swift builds In-Reply-To: <488E63FE.3000500@mcs.anl.gov> References: <488E63FE.3000500@mcs.anl.gov> Message-ID: <1217303675.4347.6.camel@localhost> On Mon, 2008-07-28 at 19:27 -0500, Michael Wilde wrote: > I'm confused about the correct way to include both deef and coaster > providers in a build. Specifically: > > I want to say something like: > > ant -Dwith-provider-coaster -Dwith-provider-deef redist That's what you're supposed to say. > > 1) Is saying -Dwith-provider-coaster the same as saying > -Dwith-provider-coaster=true ? According to the apache ANT semantics, and in what I would consider a convenient yet weird way, yes. > > 2) Is the right way to add both providers to put on two separate -D > options as above? Yes. > > 3) Does redist clean out the whole previous dist dir contents? If it calls distclean (which it seems to do) then it cleans the dist dir contents, the build dir contents, and all the build dirs contents of all dependents. > > 4) If repeatedly building the dist after making source changing, do I > need to do any "clean" operations, or just "ant redist" with the right -D's? redist implies clean and more. So no. > > I ask this because the line I show above does not seem to consistently > build both providers. Yesterday I was getting deef but no coaster; today > I'm getting coaster but no deef. I havent tried enough experiments (or > reading) to figure this out yet. Try running ant with -q. It gives you only a very short summary of what happens. Then paste that and we'll see (though probably you'll spot the problem before that). > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From wilde at mcs.anl.gov Mon Jul 28 23:45:45 2008 From: wilde at mcs.anl.gov (Michael Wilde) Date: Mon, 28 Jul 2008 23:45:45 -0500 Subject: [Swift-devel] Problems running coaster In-Reply-To: <1217303412.4347.0.camel@localhost> References: <488CC81C.7030205@mcs.anl.gov> <488D4217.2010006@mcs.anl.gov> <488DF957.9000803@mcs.anl.gov> <488E0C64.1020106@mcs.anl.gov> <1217303412.4347.0.camel@localhost> Message-ID: <488EA079.4000404@mcs.anl.gov> Ive moved on, and put a temp hack in to not use -l and instead run "~/.myetcprofile" if it exists and /etc/profile if it doesnt. .myetcprofile on abe is /etc/profile with the problematic code removed. Now abe gets past the problem and runs bootstrap.sh ok. The sequence runs OK up to the point where the service on abe's headnode receives a message to start a job. AT this point, the service on abe seems to hang. Comparing to the message sequence on mercury, which works, I see this: *** mercury: [RequestHandler] DEBUG Channel multiplexer - GSSC-null: HND< 2 SUBMITJOB(identity=1217268111318 executable=/bin/bash directory=/home/ncsa/wilde/swiftwork/ctest-20080728-1301-7c4ok42h arg=shared/wrapper.sh arg=echo-myx2e6xi arg=-jobdir arg=m arg=-e arg=/bin/echo arg=-out arg=echo_s000.txt arg=-err arg=stderr.txt arg=-i arg=-d ar) [ChannelManager] DEBUG Channel multiplexer - Looking up -134779b6:11b6ad597e2:-7fff:3598cb3d:11b6ad597b5:-7fffS [ChannelManager] DEBUG Channel multiplexer - Found -134779b6:11b6ad597e2:-7fff:3598cb3d:11b6ad597b5:-7fffS [RequestHandler] DEBUG Channel multiplexer - GSSC-null: HND> 2 SUBMITJOB(urn:1217268111318-1217268128309-1217268128310) [Replier] DEBUG Worker 1 - Replier(GSSC-null)REPL>: tag = 2, fin = true, datalen = 45, data = urn:1217268111318-1217268128309-1217268128310 [WorkerManager] INFO Coaster Queue Processor - No suitable worker found. Attempting to start a new one. [WorkerManager] INFO Worker Manager - Got allocation request: org.globus.cog.abstraction.coaster.service.job.manager.WorkerManager$AllocationRequest at 151ca803 [WorkerManager] INFO Worker Manager - Starting worker with id=-615912369 and maxwalltime=6060s Worker start provider: gt2 Worker start JM: pbs *** abe: [RequestHandler] DEBUG Channel multiplexer - GSSC-null: HND< 2 SUBMITJOB(identity=1217291444315 executable=/bin/bash directory=/u/ac/wilde/swiftwork/ctest-20080728-1930-m5a70lvc arg=shared/wrapper.sh arg=echo-zc5mt6xi arg=-jobdir arg=z arg=-e arg=/bin/echo arg=-out arg=echo_s000.txt arg=-err arg=stderr.txt arg=-i arg=-d arg= ar) [ChannelManager] DEBUG Channel multiplexer - Looking up 17badc64:11b6c39944a:-7fff:f7c31d:11b6c399416:-7fffS [ChannelManager] DEBUG Channel multiplexer - Found 17badc64:11b6c39944a:-7fff:f7c31d:11b6c399416:-7fffS [RequestHandler] DEBUG Channel multiplexer - GSSC-null: HND> 2 SUBMITJOB(urn:1217291444315-1217291458042-1217291458043) [Replier] DEBUG Worker 1 - Replier(GSSC-null)REPL>: tag = 2, fin = true, datalen = 45, data = urn:1217291444315-1217291458042-1217291458043 [WorkerManager] INFO Coaster Queue Processor - No suitable worker found. Attempting to start a new one. [WorkerManager] INFO Worker Manager - Got allocation request: org.globus.cog.abstraction.coaster.service.job.manager.WorkerManager$AllocationRequest at 95cfbe [AbstractKarajanChannel] DEBUG Channel multiplexer - GSSC-null REQ<: tag = 3, fin = true, err = false, datalen = 15, data = SHUTDOWNSERVICE *** I *think* the SHUTDOWNSERVICE message on abe is coming much later, after abe's service hangs, but Im not sure. What it looks like to me is that what should should happen on abe is this: [WorkerManager] INFO Worker Manager - Got allocation request: org.globus.cog.abstraction.coaster.service.job.manager.WorkerManager$AllocationRequest at 151ca803 [WorkerManager] INFO Worker Manager - Starting worker with id=-615912369 and maxwalltime=6060s but on abe the "Worker Manager - Starting worker" is never seen. Looking at WorkerManager.run() its hard to see how the "Starting worker" message could *not* show up right after "Got allocation request", but there must be some sequence of events that causes this. Abe is an 8-core system. Is there perhaps more opportunity for a multi-thread race or deadlock that could cause this? I will insert some more debug logging and try a few more times to see if thing shang in this manner every time or not. - Mike ps client Logs with abe server side boot logs are on CI net in ~wilde/coast/run11 On 7/28/08 10:50 PM, Mihael Hategan wrote: > On Mon, 2008-07-28 at 19:32 +0000, Ben Clifford wrote: >> On Mon, 28 Jul 2008, Michael Wilde wrote: >> >>> So it looks like something in the job specs that is launching coaster for >>> gt2:pbs is not being accepted by abe. >> ok. TeraGrid's unified account system is insufficiently unified for me to >> be able to access abe, but they are aware of that; if and when I am >> reunified, I'll try this out myself. > > Not to be cynical or anything, but that unified thing: never worked. > From wilde at mcs.anl.gov Tue Jul 29 00:06:42 2008 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 29 Jul 2008 00:06:42 -0500 Subject: [Swift-devel] Problems running coaster In-Reply-To: <488EA079.4000404@mcs.anl.gov> References: <488CC81C.7030205@mcs.anl.gov> <488D4217.2010006@mcs.anl.gov> <488DF957.9000803@mcs.anl.gov> <488E0C64.1020106@mcs.anl.gov> <1217303412.4347.0.camel@localhost> <488EA079.4000404@mcs.anl.gov> Message-ID: <488EA562.9020704@mcs.anl.gov> hmmm. my debug statement didnt print. but this time the job on abe ran ok. Tomorrow I'll run more tests and see how stable it is there, and why my logging calls never showed up. - Mike On 7/28/08 11:45 PM, Michael Wilde wrote: > Ive moved on, and put a temp hack in to not use -l and instead run > "~/.myetcprofile" if it exists and /etc/profile if it doesnt. > > .myetcprofile on abe is /etc/profile with the problematic code removed. > > Now abe gets past the problem and runs bootstrap.sh ok. > > The sequence runs OK up to the point where the service on abe's headnode > receives a message to start a job. > > AT this point, the service on abe seems to hang. > > Comparing to the message sequence on mercury, which works, I see this: > > *** mercury: > > [RequestHandler] DEBUG Channel multiplexer - GSSC-null: HND< 2 > SUBMITJOB(identity=1217268111318 > executable=/bin/bash > directory=/home/ncsa/wilde/swiftwork/ctest-20080728-1301-7c4ok42h > arg=shared/wrapper.sh > arg=echo-myx2e6xi > arg=-jobdir > arg=m > arg=-e > arg=/bin/echo > arg=-out > arg=echo_s000.txt > arg=-err > arg=stderr.txt > arg=-i > arg=-d > ar) > [ChannelManager] DEBUG Channel multiplexer - > Looking up -134779b6:11b6ad597e2:-7fff:3598cb3d:11b6ad597b5:-7fffS > [ChannelManager] DEBUG Channel multiplexer - Found > -134779b6:11b6ad597e2:-7fff:3598cb3d:11b6ad597b5:-7fffS > [RequestHandler] DEBUG Channel multiplexer - GSSC-null: HND> 2 > SUBMITJOB(urn:1217268111318-1217268128309-1217268128310) > [Replier] DEBUG Worker 1 - Replier(GSSC-null)REPL>: tag = 2, fin = > true, datalen = 45, data = urn:1217268111318-1217268128309-1217268128310 > [WorkerManager] INFO Coaster Queue Processor - No suitable worker > found. Attempting to start a new one. > [WorkerManager] INFO Worker Manager - Got allocation request: > org.globus.cog.abstraction.coaster.service.job.manager.WorkerManager$AllocationRequest at 151ca803 > > [WorkerManager] INFO Worker Manager - Starting worker with > id=-615912369 and maxwalltime=6060s > Worker start provider: gt2 > Worker start JM: pbs > > *** abe: > > [RequestHandler] DEBUG Channel multiplexer - GSSC-null: HND< 2 > SUBMITJOB(identity=1217291444315 > executable=/bin/bash > directory=/u/ac/wilde/swiftwork/ctest-20080728-1930-m5a70lvc > arg=shared/wrapper.sh > arg=echo-zc5mt6xi > arg=-jobdir > arg=z > arg=-e > arg=/bin/echo > arg=-out > arg=echo_s000.txt > arg=-err > arg=stderr.txt > arg=-i > arg=-d > arg= > ar) > [ChannelManager] DEBUG Channel multiplexer - > Looking up 17badc64:11b6c39944a:-7fff:f7c31d:11b6c399416:-7fffS > [ChannelManager] DEBUG Channel multiplexer - Found > 17badc64:11b6c39944a:-7fff:f7c31d:11b6c399416:-7fffS > [RequestHandler] DEBUG Channel multiplexer - GSSC-null: HND> 2 > SUBMITJOB(urn:1217291444315-1217291458042-1217291458043) > [Replier] DEBUG Worker 1 - Replier(GSSC-null)REPL>: tag = 2, fin = > true, datalen = 45, data = urn:1217291444315-1217291458042-1217291458043 > [WorkerManager] INFO Coaster Queue Processor - No suitable worker > found. Attempting to start a new one. > [WorkerManager] INFO Worker Manager - Got allocation request: > org.globus.cog.abstraction.coaster.service.job.manager.WorkerManager$AllocationRequest at 95cfbe > > [AbstractKarajanChannel] DEBUG Channel multiplexer - GSSC-null REQ<: > tag = 3, fin = true, err = false, datalen = 15, data = SHUTDOWNSERVICE > > *** > > I *think* the SHUTDOWNSERVICE message on abe is coming much later, after > abe's service hangs, but Im not sure. > > What it looks like to me is that what should should happen on abe is this: > > [WorkerManager] INFO Worker Manager - Got allocation request: > org.globus.cog.abstraction.coaster.service.job.manager.WorkerManager$AllocationRequest at 151ca803 > > [WorkerManager] INFO Worker Manager - Starting worker with > id=-615912369 and maxwalltime=6060s > > but on abe the "Worker Manager - Starting worker" is never seen. > > Looking at WorkerManager.run() its hard to see how the "Starting worker" > message could *not* show up right after "Got allocation request", but > there must be some sequence of events that causes this. > > Abe is an 8-core system. Is there perhaps more opportunity for a > multi-thread race or deadlock that could cause this? > > I will insert some more debug logging and try a few more times to see if > thing shang in this manner every time or not. > > - Mike > > ps client Logs with abe server side boot logs are on CI net in > ~wilde/coast/run11 > > > > On 7/28/08 10:50 PM, Mihael Hategan wrote: >> On Mon, 2008-07-28 at 19:32 +0000, Ben Clifford wrote: >>> On Mon, 28 Jul 2008, Michael Wilde wrote: >>> >>>> So it looks like something in the job specs that is launching >>>> coaster for >>>> gt2:pbs is not being accepted by abe. >>> ok. TeraGrid's unified account system is insufficiently unified for >>> me to be able to access abe, but they are aware of that; if and when >>> I am reunified, I'll try this out myself. >> >> Not to be cynical or anything, but that unified thing: never worked. >> > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From benc at hawaga.org.uk Tue Jul 29 01:45:23 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 29 Jul 2008 06:45:23 +0000 (GMT) Subject: [Swift-devel] including non-standard providers in Swift builds In-Reply-To: <488E63FE.3000500@mcs.anl.gov> References: <488E63FE.3000500@mcs.anl.gov> Message-ID: On Mon, 28 Jul 2008, Michael Wilde wrote: > I ask this because the line I show above does not seem to consistently build > both providers. Yesterday I was getting deef but no coaster; today I'm getting > coaster but no deef. I havent tried enough experiments (or reading) to figure > this out yet. How did you check whether it was building the providers or not? Watching the build messages? Looking at the lib/ directory in the distribution for the appropriate jars? Trying to run wift using those providers? -- From benc at hawaga.org.uk Tue Jul 29 04:12:02 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 29 Jul 2008 09:12:02 +0000 (GMT) Subject: [Swift-devel] more compile time type checking Message-ID: I just committed Milena's work on compile-time type checking. Based on what happened last time I made changes to the compile-time anity checking, there will be some things you do or thought you could do in your programs that will now not work. When you discover such, file a bug or post to this list. -- From benc at hawaga.org.uk Tue Jul 29 05:54:45 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 29 Jul 2008 10:54:45 +0000 (GMT) Subject: [Swift-devel] Re: scheduler foo In-Reply-To: <1216757938.18169.6.camel@localhost> References: <1216650659.4064.1.camel@localhost> <1216659517.6481.1.camel@localhost> <1216679706.18694.6.camel@localhost> <1216689540.23025.0.camel@localhost> <1216742264.28239.0.camel@localhost> <1216757938.18169.6.camel@localhost> Message-ID: On Tue, 22 Jul 2008, Mihael Hategan wrote: > A bug in the monitor which prevented the overloaded count from being > properly updated. I fixed that and ran all the tests locally, which > seems to work. This seems better, though in process of testing with provider-wonky I've seem some behaviour that looks a little suspicious - I'm not sure if that was caused by me poking round with debugging or not, though. More later. I put two tests in misc/ that you need to have compiled -Dwith-provider-wonky to run, called wonky.sh and wonky80.sh These run single-site local tests with 90% and 80% success rate of jobs. The 90% test usually finishes fine, which is good. The 80% one gets long delays; sometimes I think because of legitimate long delays caused by the scheduler slowing down, but I think sometimes caused by some race condition making the single site be ignored forever. -- From benc at hawaga.org.uk Tue Jul 29 06:11:49 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 29 Jul 2008 11:11:49 +0000 (GMT) Subject: [Swift-devel] Re: scheduler foo In-Reply-To: References: <1216650659.4064.1.camel@localhost> <1216659517.6481.1.camel@localhost> <1216679706.18694.6.camel@localhost> <1216689540.23025.0.camel@localhost> <1216742264.28239.0.camel@localhost> <1216757938.18169.6.camel@localhost> Message-ID: so with the patch in www.ci.uchicago.edu:~benc/tmp/dbg2 which hould ideally only be adding logging and not peturbing behaviour (though Im sure it does a little bit), I can repetedbly get wonky80.sh to hang after some 10s of jobs have been run with this being reported: Progress: Selecting site:984 Finished successfully:13 Failed but can retry:3 Progress: Selecting site:984 Finished successfully:13 Failed but can retry:3 with the log file loking like this: 2008-07-29 13:05:51,648+0200 INFO vdl:execute END_SUCCESS thread=0-363 tr=touch 2008-07-29 13:05:51,680+0200 DEBUG OverloadedHostMonitor Polling 1 hosts 2008-07-29 13:05:51,681+0200 DEBUG WeightedHost In delay mode. score = -0.759071 7948717954 tscore = 0.6429374962443759, maxload=1.0 delay since last used=231ms permitted delay=213ms overloaded=false delay-permitted delay=18 2008-07-29 13:05:52,681+0200 DEBUG OverloadedHostMonitor Polling 0 hosts 2008-07-29 13:05:53,681+0200 DEBUG OverloadedHostMonitor Polling 0 hosts 2008-07-29 13:05:54,681+0200 DEBUG OverloadedHostMonitor Polling 0 hosts etc This looks wrong to me - with 0 hosts being used for execution and 0 hosts in delay state... -- From benc at hawaga.org.uk Tue Jul 29 06:26:30 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 29 Jul 2008 11:26:30 +0000 (GMT) Subject: [Swift-devel] Re: scheduler foo In-Reply-To: References: <1216650659.4064.1.camel@localhost> <1216659517.6481.1.camel@localhost> <1216679706.18694.6.camel@localhost> <1216689540.23025.0.camel@localhost> <1216742264.28239.0.camel@localhost> <1216757938.18169.6.camel@localhost> Message-ID: with dbg3 from the same location applied on top of dbg2, also giving more logging, I see some wierdness with one of the loops in LateBindingScheduler. After the last task completes (or rather the last task that gets launched) I see very very tight iterations of the scheduler loop: 2008-07-29 13:17:11,661+0200 DEBUG LateBindingScheduler starting bottom loop 2008-07-29 13:17:11,661+0200 DEBUG LateBindingScheduler out of bottom loop 2008-07-29 13:17:11,661+0200 DEBUG LateBindingScheduler starting bottom loop 2008-07-29 13:17:11,661+0200 DEBUG LateBindingScheduler out of bottom loop 2008-07-29 13:17:11,661+0200 DEBUG LateBindingScheduler starting bottom loop 2008-07-29 13:17:11,661+0200 DEBUG LateBindingScheduler out of bottom loop 2008-07-29 13:17:11,661+0200 DEBUG LateBindingScheduler starting bottom loop 2008-07-29 13:17:11,661+0200 DEBUG LateBindingScheduler out of bottom loop 2008-07-29 13:17:11,661+0200 DEBUG LateBindingScheduler starting bottom loop 2008-07-29 13:17:11,661+0200 DEBUG LateBindingScheduler out of bottom loop 2008-07-29 13:17:11,661+0200 DEBUG LateBindingScheduler starting bottom loop 2008-07-29 13:17:11,661+0200 DEBUG LateBindingScheduler out of bottom loop for about 2 seconds, and then: 2008-07-29 13:17:13,763+0200 DEBUG LateBindingScheduler starting bottom loop 2008-07-29 13:17:13,763+0200 DEBUG LateBindingScheduler In bottom loop 2008-07-29 13:17:14,290+0200 DEBUG OverloadedHostMonitor Polling 0 hosts 2008-07-29 13:17:15,290+0200 DEBUG OverloadedHostMonitor Polling 0 hosts 2008-07-29 13:17:15,763+0200 DEBUG LateBindingScheduler In bottom loop 2008-07-29 13:17:16,291+0200 DEBUG OverloadedHostMonitor Polling 0 hosts so the bottom loop and the host monitor polling once every second or so as expected, but not managing to detect that there are hosts available. -- From wilde at mcs.anl.gov Tue Jul 29 07:04:04 2008 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 29 Jul 2008 07:04:04 -0500 Subject: [Swift-devel] including non-standard providers in Swift builds In-Reply-To: References: <488E63FE.3000500@mcs.anl.gov> Message-ID: <488F0734.7060007@mcs.anl.gov> On 7/29/08 1:45 AM, Ben Clifford wrote: > On Mon, 28 Jul 2008, Michael Wilde wrote: > >> I ask this because the line I show above does not seem to consistently build >> both providers. Yesterday I was getting deef but no coaster; today I'm getting >> coaster but no deef. I havent tried enough experiments (or reading) to figure >> this out yet. > > How did you check whether it was building the providers or not? Watching > the build messages? Looking at the lib/ directory in the distribution for > the appropriate jars? Trying to run wift using those providers? The latter. essentially find | egrep 'def|coaster' from the root of dist would show now jars of one or the other, and it would give a missing provider message at runtime. I'll try the -v and look at the messages and report back. From benc at hawaga.org.uk Tue Jul 29 07:30:09 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 29 Jul 2008 12:30:09 +0000 (GMT) Subject: [Swift-devel] including non-standard providers in Swift builds In-Reply-To: <488F0734.7060007@mcs.anl.gov> References: <488E63FE.3000500@mcs.anl.gov> <488F0734.7060007@mcs.anl.gov> Message-ID: On Tue, 29 Jul 2008, Michael Wilde wrote: > The latter. essentially find | egrep 'def|coaster' from the root of dist would > show now jars of one or the other, and it would give a missing provider > message at runtime. > > I'll try the -v and look at the messages and report back. ok. I'll a look at that and see what happens for me. Note that if you misspell or otherwise misspecify the option, its unlikely that you get an error message or other indication. -- From wilde at mcs.anl.gov Tue Jul 29 09:29:25 2008 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 29 Jul 2008 09:29:25 -0500 Subject: [Swift-devel] Problems running coaster In-Reply-To: <488EA562.9020704@mcs.anl.gov> References: <488CC81C.7030205@mcs.anl.gov> <488D4217.2010006@mcs.anl.gov> <488DF957.9000803@mcs.anl.gov> <488E0C64.1020106@mcs.anl.gov> <1217303412.4347.0.camel@localhost> <488EA079.4000404@mcs.anl.gov> <488EA562.9020704@mcs.anl.gov> Message-ID: <488F2945.4090600@mcs.anl.gov> I was looking into why my logger.debug statements did not print. I am not sure, but suspect, that the updated jar, loaded into ~/.globus/coasters/cache, was either not placed in the classpath at runtime was was placed after the older copy in the same directory. I have not yet found the logic by which newer classes get loaded to the server, but suspect there may be an issue here. (Or, as usual, pilot error on my part). The class with the updated logging was WorkerManager: [wilde at honest3 cache]$ jar tvf cog-provider-coaster-0.1-a82e2ac11a74fedfadb9a8168a08b6d5.jar | grep WorkerManager 869 Mon Jul 28 19:10:34 CDT 2008 org/globus/cog/abstraction/coaster/service/job/manager/WorkerManager$AllocationRequest.class 15556 Mon Jul 28 19:10:34 CDT 2008 org/globus/cog/abstraction/coaster/service/job/manager/WorkerManager.class [wilde at honest3 cache]$ jar tvf cog-provider-coaster-0.1-d903eecc754a2c97fb5ceaebdce6ccad.jar | grep WorkerManager 869 Mon Jul 28 23:54:24 CDT 2008 org/globus/cog/abstraction/coaster/service/job/manager/WorkerManager$AllocationRequest.class 15963 Mon Jul 28 23:54:24 CDT 2008 org/globus/cog/abstraction/coaster/service/job/manager/WorkerManager.class [wilde at honest3 cache]$ The *ad.jar file has the correct updated class; the *d5.jar file has the original unmodified class. -- If my suspicion about the classpath order is correct, then there is greater possibility that there may be a race in the job launching code of WorkerManager, as this means that the same code hung once and worked once (I'll test more on abe to investigate). - Mike On 7/29/08 12:06 AM, Michael Wilde wrote: > hmmm. my debug statement didnt print. but this time the job on abe ran ok. > > Tomorrow I'll run more tests and see how stable it is there, and why my > logging calls never showed up. > > - Mike > > > On 7/28/08 11:45 PM, Michael Wilde wrote: >> Ive moved on, and put a temp hack in to not use -l and instead run >> "~/.myetcprofile" if it exists and /etc/profile if it doesnt. >> >> .myetcprofile on abe is /etc/profile with the problematic code removed. >> >> Now abe gets past the problem and runs bootstrap.sh ok. >> >> The sequence runs OK up to the point where the service on abe's >> headnode receives a message to start a job. >> >> AT this point, the service on abe seems to hang. >> >> Comparing to the message sequence on mercury, which works, I see this: >> >> *** mercury: >> >> [RequestHandler] DEBUG Channel multiplexer - GSSC-null: HND< 2 >> SUBMITJOB(identity=1217268111318 >> executable=/bin/bash >> directory=/home/ncsa/wilde/swiftwork/ctest-20080728-1301-7c4ok42h >> arg=shared/wrapper.sh >> arg=echo-myx2e6xi >> arg=-jobdir >> arg=m >> arg=-e >> arg=/bin/echo >> arg=-out >> arg=echo_s000.txt >> arg=-err >> arg=stderr.txt >> arg=-i >> arg=-d >> ar) >> [ChannelManager] DEBUG Channel multiplexer - >> Looking up -134779b6:11b6ad597e2:-7fff:3598cb3d:11b6ad597b5:-7fffS >> [ChannelManager] DEBUG Channel multiplexer - Found >> -134779b6:11b6ad597e2:-7fff:3598cb3d:11b6ad597b5:-7fffS >> [RequestHandler] DEBUG Channel multiplexer - GSSC-null: HND> 2 >> SUBMITJOB(urn:1217268111318-1217268128309-1217268128310) >> [Replier] DEBUG Worker 1 - Replier(GSSC-null)REPL>: tag = 2, fin = >> true, datalen = 45, data = urn:1217268111318-1217268128309-1217268128310 >> [WorkerManager] INFO Coaster Queue Processor - No suitable worker >> found. Attempting to start a new one. >> [WorkerManager] INFO Worker Manager - Got allocation request: >> org.globus.cog.abstraction.coaster.service.job.manager.WorkerManager$AllocationRequest at 151ca803 >> >> [WorkerManager] INFO Worker Manager - Starting worker with >> id=-615912369 and maxwalltime=6060s >> Worker start provider: gt2 >> Worker start JM: pbs >> >> *** abe: >> >> [RequestHandler] DEBUG Channel multiplexer - GSSC-null: HND< 2 >> SUBMITJOB(identity=1217291444315 >> executable=/bin/bash >> directory=/u/ac/wilde/swiftwork/ctest-20080728-1930-m5a70lvc >> arg=shared/wrapper.sh >> arg=echo-zc5mt6xi >> arg=-jobdir >> arg=z >> arg=-e >> arg=/bin/echo >> arg=-out >> arg=echo_s000.txt >> arg=-err >> arg=stderr.txt >> arg=-i >> arg=-d >> arg= >> ar) >> [ChannelManager] DEBUG Channel multiplexer - >> Looking up 17badc64:11b6c39944a:-7fff:f7c31d:11b6c399416:-7fffS >> [ChannelManager] DEBUG Channel multiplexer - Found >> 17badc64:11b6c39944a:-7fff:f7c31d:11b6c399416:-7fffS >> [RequestHandler] DEBUG Channel multiplexer - GSSC-null: HND> 2 >> SUBMITJOB(urn:1217291444315-1217291458042-1217291458043) >> [Replier] DEBUG Worker 1 - Replier(GSSC-null)REPL>: tag = 2, fin = >> true, datalen = 45, data = urn:1217291444315-1217291458042-1217291458043 >> [WorkerManager] INFO Coaster Queue Processor - No suitable worker >> found. Attempting to start a new one. >> [WorkerManager] INFO Worker Manager - Got allocation request: >> org.globus.cog.abstraction.coaster.service.job.manager.WorkerManager$AllocationRequest at 95cfbe >> >> [AbstractKarajanChannel] DEBUG Channel multiplexer - GSSC-null REQ<: >> tag = 3, fin = true, err = false, datalen = 15, data = SHUTDOWNSERVICE >> >> *** >> >> I *think* the SHUTDOWNSERVICE message on abe is coming much later, >> after abe's service hangs, but Im not sure. >> >> What it looks like to me is that what should should happen on abe is >> this: >> >> [WorkerManager] INFO Worker Manager - Got allocation request: >> org.globus.cog.abstraction.coaster.service.job.manager.WorkerManager$AllocationRequest at 151ca803 >> >> [WorkerManager] INFO Worker Manager - Starting worker with >> id=-615912369 and maxwalltime=6060s >> >> but on abe the "Worker Manager - Starting worker" is never seen. >> >> Looking at WorkerManager.run() its hard to see how the "Starting >> worker" message could *not* show up right after "Got allocation >> request", but there must be some sequence of events that causes this. >> >> Abe is an 8-core system. Is there perhaps more opportunity for a >> multi-thread race or deadlock that could cause this? >> >> I will insert some more debug logging and try a few more times to see >> if thing shang in this manner every time or not. >> >> - Mike >> >> ps client Logs with abe server side boot logs are on CI net in >> ~wilde/coast/run11 >> >> >> >> On 7/28/08 10:50 PM, Mihael Hategan wrote: >>> On Mon, 2008-07-28 at 19:32 +0000, Ben Clifford wrote: >>>> On Mon, 28 Jul 2008, Michael Wilde wrote: >>>> >>>>> So it looks like something in the job specs that is launching >>>>> coaster for >>>>> gt2:pbs is not being accepted by abe. >>>> ok. TeraGrid's unified account system is insufficiently unified for >>>> me to be able to access abe, but they are aware of that; if and when >>>> I am reunified, I'll try this out myself. >>> >>> Not to be cynical or anything, but that unified thing: never worked. >>> >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From hategan at mcs.anl.gov Tue Jul 29 09:38:20 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 29 Jul 2008 09:38:20 -0500 Subject: [Swift-devel] Problems running coaster In-Reply-To: <488F2945.4090600@mcs.anl.gov> References: <488CC81C.7030205@mcs.anl.gov> <488D4217.2010006@mcs.anl.gov> <488DF957.9000803@mcs.anl.gov> <488E0C64.1020106@mcs.anl.gov> <1217303412.4347.0.camel@localhost> <488EA079.4000404@mcs.anl.gov> <488EA562.9020704@mcs.anl.gov> <488F2945.4090600@mcs.anl.gov> Message-ID: <1217342300.6215.3.camel@localhost> There is no order issue. When the service is started the exact list of jars to be used is supplied rather than "all jars in this directory". On Tue, 2008-07-29 at 09:29 -0500, Michael Wilde wrote: > I was looking into why my logger.debug statements did not print. > I am not sure, but suspect, that the updated jar, loaded into > ~/.globus/coasters/cache, was either not placed in the classpath at > runtime was was placed after the older copy in the same directory. > > I have not yet found the logic by which newer classes get loaded to the > server, but suspect there may be an issue here. (Or, as usual, pilot > error on my part). > > The class with the updated logging was WorkerManager: > > [wilde at honest3 cache]$ jar tvf > cog-provider-coaster-0.1-a82e2ac11a74fedfadb9a8168a08b6d5.jar | grep > WorkerManager > 869 Mon Jul 28 19:10:34 CDT 2008 > org/globus/cog/abstraction/coaster/service/job/manager/WorkerManager$AllocationRequest.class > 15556 Mon Jul 28 19:10:34 CDT 2008 > org/globus/cog/abstraction/coaster/service/job/manager/WorkerManager.class > [wilde at honest3 cache]$ jar tvf > cog-provider-coaster-0.1-d903eecc754a2c97fb5ceaebdce6ccad.jar | grep > WorkerManager > 869 Mon Jul 28 23:54:24 CDT 2008 > org/globus/cog/abstraction/coaster/service/job/manager/WorkerManager$AllocationRequest.class > 15963 Mon Jul 28 23:54:24 CDT 2008 > org/globus/cog/abstraction/coaster/service/job/manager/WorkerManager.class > [wilde at honest3 cache]$ > > The *ad.jar file has the correct updated class; the *d5.jar file has the > original unmodified class. > > -- > > If my suspicion about the classpath order is correct, then there is > greater possibility that there may be a race in the job launching code > of WorkerManager, as this means that the same code hung once and worked > once (I'll test more on abe to investigate). > > - Mike > > > > On 7/29/08 12:06 AM, Michael Wilde wrote: > > hmmm. my debug statement didnt print. but this time the job on abe ran ok. > > > > Tomorrow I'll run more tests and see how stable it is there, and why my > > logging calls never showed up. > > > > - Mike > > > > > > On 7/28/08 11:45 PM, Michael Wilde wrote: > >> Ive moved on, and put a temp hack in to not use -l and instead run > >> "~/.myetcprofile" if it exists and /etc/profile if it doesnt. > >> > >> .myetcprofile on abe is /etc/profile with the problematic code removed. > >> > >> Now abe gets past the problem and runs bootstrap.sh ok. > >> > >> The sequence runs OK up to the point where the service on abe's > >> headnode receives a message to start a job. > >> > >> AT this point, the service on abe seems to hang. > >> > >> Comparing to the message sequence on mercury, which works, I see this: > >> > >> *** mercury: > >> > >> [RequestHandler] DEBUG Channel multiplexer - GSSC-null: HND< 2 > >> SUBMITJOB(identity=1217268111318 > >> executable=/bin/bash > >> directory=/home/ncsa/wilde/swiftwork/ctest-20080728-1301-7c4ok42h > >> arg=shared/wrapper.sh > >> arg=echo-myx2e6xi > >> arg=-jobdir > >> arg=m > >> arg=-e > >> arg=/bin/echo > >> arg=-out > >> arg=echo_s000.txt > >> arg=-err > >> arg=stderr.txt > >> arg=-i > >> arg=-d > >> ar) > >> [ChannelManager] DEBUG Channel multiplexer - > >> Looking up -134779b6:11b6ad597e2:-7fff:3598cb3d:11b6ad597b5:-7fffS > >> [ChannelManager] DEBUG Channel multiplexer - Found > >> -134779b6:11b6ad597e2:-7fff:3598cb3d:11b6ad597b5:-7fffS > >> [RequestHandler] DEBUG Channel multiplexer - GSSC-null: HND> 2 > >> SUBMITJOB(urn:1217268111318-1217268128309-1217268128310) > >> [Replier] DEBUG Worker 1 - Replier(GSSC-null)REPL>: tag = 2, fin = > >> true, datalen = 45, data = urn:1217268111318-1217268128309-1217268128310 > >> [WorkerManager] INFO Coaster Queue Processor - No suitable worker > >> found. Attempting to start a new one. > >> [WorkerManager] INFO Worker Manager - Got allocation request: > >> org.globus.cog.abstraction.coaster.service.job.manager.WorkerManager$AllocationRequest at 151ca803 > >> > >> [WorkerManager] INFO Worker Manager - Starting worker with > >> id=-615912369 and maxwalltime=6060s > >> Worker start provider: gt2 > >> Worker start JM: pbs > >> > >> *** abe: > >> > >> [RequestHandler] DEBUG Channel multiplexer - GSSC-null: HND< 2 > >> SUBMITJOB(identity=1217291444315 > >> executable=/bin/bash > >> directory=/u/ac/wilde/swiftwork/ctest-20080728-1930-m5a70lvc > >> arg=shared/wrapper.sh > >> arg=echo-zc5mt6xi > >> arg=-jobdir > >> arg=z > >> arg=-e > >> arg=/bin/echo > >> arg=-out > >> arg=echo_s000.txt > >> arg=-err > >> arg=stderr.txt > >> arg=-i > >> arg=-d > >> arg= > >> ar) > >> [ChannelManager] DEBUG Channel multiplexer - > >> Looking up 17badc64:11b6c39944a:-7fff:f7c31d:11b6c399416:-7fffS > >> [ChannelManager] DEBUG Channel multiplexer - Found > >> 17badc64:11b6c39944a:-7fff:f7c31d:11b6c399416:-7fffS > >> [RequestHandler] DEBUG Channel multiplexer - GSSC-null: HND> 2 > >> SUBMITJOB(urn:1217291444315-1217291458042-1217291458043) > >> [Replier] DEBUG Worker 1 - Replier(GSSC-null)REPL>: tag = 2, fin = > >> true, datalen = 45, data = urn:1217291444315-1217291458042-1217291458043 > >> [WorkerManager] INFO Coaster Queue Processor - No suitable worker > >> found. Attempting to start a new one. > >> [WorkerManager] INFO Worker Manager - Got allocation request: > >> org.globus.cog.abstraction.coaster.service.job.manager.WorkerManager$AllocationRequest at 95cfbe > >> > >> [AbstractKarajanChannel] DEBUG Channel multiplexer - GSSC-null REQ<: > >> tag = 3, fin = true, err = false, datalen = 15, data = SHUTDOWNSERVICE > >> > >> *** > >> > >> I *think* the SHUTDOWNSERVICE message on abe is coming much later, > >> after abe's service hangs, but Im not sure. > >> > >> What it looks like to me is that what should should happen on abe is > >> this: > >> > >> [WorkerManager] INFO Worker Manager - Got allocation request: > >> org.globus.cog.abstraction.coaster.service.job.manager.WorkerManager$AllocationRequest at 151ca803 > >> > >> [WorkerManager] INFO Worker Manager - Starting worker with > >> id=-615912369 and maxwalltime=6060s > >> > >> but on abe the "Worker Manager - Starting worker" is never seen. > >> > >> Looking at WorkerManager.run() its hard to see how the "Starting > >> worker" message could *not* show up right after "Got allocation > >> request", but there must be some sequence of events that causes this. > >> > >> Abe is an 8-core system. Is there perhaps more opportunity for a > >> multi-thread race or deadlock that could cause this? > >> > >> I will insert some more debug logging and try a few more times to see > >> if thing shang in this manner every time or not. > >> > >> - Mike > >> > >> ps client Logs with abe server side boot logs are on CI net in > >> ~wilde/coast/run11 > >> > >> > >> > >> On 7/28/08 10:50 PM, Mihael Hategan wrote: > >>> On Mon, 2008-07-28 at 19:32 +0000, Ben Clifford wrote: > >>>> On Mon, 28 Jul 2008, Michael Wilde wrote: > >>>> > >>>>> So it looks like something in the job specs that is launching > >>>>> coaster for > >>>>> gt2:pbs is not being accepted by abe. > >>>> ok. TeraGrid's unified account system is insufficiently unified for > >>>> me to be able to access abe, but they are aware of that; if and when > >>>> I am reunified, I'll try this out myself. > >>> > >>> Not to be cynical or anything, but that unified thing: never worked. > >>> > >> _______________________________________________ > >> Swift-devel mailing list > >> Swift-devel at ci.uchicago.edu > >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From wilde at mcs.anl.gov Tue Jul 29 09:38:58 2008 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 29 Jul 2008 09:38:58 -0500 Subject: [Swift-devel] Problems running coaster In-Reply-To: <1217342300.6215.3.camel@localhost> References: <488CC81C.7030205@mcs.anl.gov> <488D4217.2010006@mcs.anl.gov> <488DF957.9000803@mcs.anl.gov> <488E0C64.1020106@mcs.anl.gov> <1217303412.4347.0.camel@localhost> <488EA079.4000404@mcs.anl.gov> <488EA562.9020704@mcs.anl.gov> <488F2945.4090600@mcs.anl.gov> <1217342300.6215.3.camel@localhost> Message-ID: <488F2B82.5020107@mcs.anl.gov> What are some other possibilities of why the logging code didnt work? I see the logger.debug calls in the .class file. The logger calls were mostly unconditional. Possibly a different code path, but less likely. I will try clearing the cache and re-running. - Mike On 7/29/08 9:38 AM, Mihael Hategan wrote: > There is no order issue. When the service is started the exact list of > jars to be used is supplied rather than "all jars in this directory". > > On Tue, 2008-07-29 at 09:29 -0500, Michael Wilde wrote: >> I was looking into why my logger.debug statements did not print. >> I am not sure, but suspect, that the updated jar, loaded into >> ~/.globus/coasters/cache, was either not placed in the classpath at >> runtime was was placed after the older copy in the same directory. >> >> I have not yet found the logic by which newer classes get loaded to the >> server, but suspect there may be an issue here. (Or, as usual, pilot >> error on my part). >> >> The class with the updated logging was WorkerManager: >> >> [wilde at honest3 cache]$ jar tvf >> cog-provider-coaster-0.1-a82e2ac11a74fedfadb9a8168a08b6d5.jar | grep >> WorkerManager >> 869 Mon Jul 28 19:10:34 CDT 2008 >> org/globus/cog/abstraction/coaster/service/job/manager/WorkerManager$AllocationRequest.class >> 15556 Mon Jul 28 19:10:34 CDT 2008 >> org/globus/cog/abstraction/coaster/service/job/manager/WorkerManager.class >> [wilde at honest3 cache]$ jar tvf >> cog-provider-coaster-0.1-d903eecc754a2c97fb5ceaebdce6ccad.jar | grep >> WorkerManager >> 869 Mon Jul 28 23:54:24 CDT 2008 >> org/globus/cog/abstraction/coaster/service/job/manager/WorkerManager$AllocationRequest.class >> 15963 Mon Jul 28 23:54:24 CDT 2008 >> org/globus/cog/abstraction/coaster/service/job/manager/WorkerManager.class >> [wilde at honest3 cache]$ >> >> The *ad.jar file has the correct updated class; the *d5.jar file has the >> original unmodified class. >> >> -- >> >> If my suspicion about the classpath order is correct, then there is >> greater possibility that there may be a race in the job launching code >> of WorkerManager, as this means that the same code hung once and worked >> once (I'll test more on abe to investigate). >> >> - Mike >> >> >> >> On 7/29/08 12:06 AM, Michael Wilde wrote: >>> hmmm. my debug statement didnt print. but this time the job on abe ran ok. >>> >>> Tomorrow I'll run more tests and see how stable it is there, and why my >>> logging calls never showed up. >>> >>> - Mike >>> >>> >>> On 7/28/08 11:45 PM, Michael Wilde wrote: >>>> Ive moved on, and put a temp hack in to not use -l and instead run >>>> "~/.myetcprofile" if it exists and /etc/profile if it doesnt. >>>> >>>> .myetcprofile on abe is /etc/profile with the problematic code removed. >>>> >>>> Now abe gets past the problem and runs bootstrap.sh ok. >>>> >>>> The sequence runs OK up to the point where the service on abe's >>>> headnode receives a message to start a job. >>>> >>>> AT this point, the service on abe seems to hang. >>>> >>>> Comparing to the message sequence on mercury, which works, I see this: >>>> >>>> *** mercury: >>>> >>>> [RequestHandler] DEBUG Channel multiplexer - GSSC-null: HND< 2 >>>> SUBMITJOB(identity=1217268111318 >>>> executable=/bin/bash >>>> directory=/home/ncsa/wilde/swiftwork/ctest-20080728-1301-7c4ok42h >>>> arg=shared/wrapper.sh >>>> arg=echo-myx2e6xi >>>> arg=-jobdir >>>> arg=m >>>> arg=-e >>>> arg=/bin/echo >>>> arg=-out >>>> arg=echo_s000.txt >>>> arg=-err >>>> arg=stderr.txt >>>> arg=-i >>>> arg=-d >>>> ar) >>>> [ChannelManager] DEBUG Channel multiplexer - >>>> Looking up -134779b6:11b6ad597e2:-7fff:3598cb3d:11b6ad597b5:-7fffS >>>> [ChannelManager] DEBUG Channel multiplexer - Found >>>> -134779b6:11b6ad597e2:-7fff:3598cb3d:11b6ad597b5:-7fffS >>>> [RequestHandler] DEBUG Channel multiplexer - GSSC-null: HND> 2 >>>> SUBMITJOB(urn:1217268111318-1217268128309-1217268128310) >>>> [Replier] DEBUG Worker 1 - Replier(GSSC-null)REPL>: tag = 2, fin = >>>> true, datalen = 45, data = urn:1217268111318-1217268128309-1217268128310 >>>> [WorkerManager] INFO Coaster Queue Processor - No suitable worker >>>> found. Attempting to start a new one. >>>> [WorkerManager] INFO Worker Manager - Got allocation request: >>>> org.globus.cog.abstraction.coaster.service.job.manager.WorkerManager$AllocationRequest at 151ca803 >>>> >>>> [WorkerManager] INFO Worker Manager - Starting worker with >>>> id=-615912369 and maxwalltime=6060s >>>> Worker start provider: gt2 >>>> Worker start JM: pbs >>>> >>>> *** abe: >>>> >>>> [RequestHandler] DEBUG Channel multiplexer - GSSC-null: HND< 2 >>>> SUBMITJOB(identity=1217291444315 >>>> executable=/bin/bash >>>> directory=/u/ac/wilde/swiftwork/ctest-20080728-1930-m5a70lvc >>>> arg=shared/wrapper.sh >>>> arg=echo-zc5mt6xi >>>> arg=-jobdir >>>> arg=z >>>> arg=-e >>>> arg=/bin/echo >>>> arg=-out >>>> arg=echo_s000.txt >>>> arg=-err >>>> arg=stderr.txt >>>> arg=-i >>>> arg=-d >>>> arg= >>>> ar) >>>> [ChannelManager] DEBUG Channel multiplexer - >>>> Looking up 17badc64:11b6c39944a:-7fff:f7c31d:11b6c399416:-7fffS >>>> [ChannelManager] DEBUG Channel multiplexer - Found >>>> 17badc64:11b6c39944a:-7fff:f7c31d:11b6c399416:-7fffS >>>> [RequestHandler] DEBUG Channel multiplexer - GSSC-null: HND> 2 >>>> SUBMITJOB(urn:1217291444315-1217291458042-1217291458043) >>>> [Replier] DEBUG Worker 1 - Replier(GSSC-null)REPL>: tag = 2, fin = >>>> true, datalen = 45, data = urn:1217291444315-1217291458042-1217291458043 >>>> [WorkerManager] INFO Coaster Queue Processor - No suitable worker >>>> found. Attempting to start a new one. >>>> [WorkerManager] INFO Worker Manager - Got allocation request: >>>> org.globus.cog.abstraction.coaster.service.job.manager.WorkerManager$AllocationRequest at 95cfbe >>>> >>>> [AbstractKarajanChannel] DEBUG Channel multiplexer - GSSC-null REQ<: >>>> tag = 3, fin = true, err = false, datalen = 15, data = SHUTDOWNSERVICE >>>> >>>> *** >>>> >>>> I *think* the SHUTDOWNSERVICE message on abe is coming much later, >>>> after abe's service hangs, but Im not sure. >>>> >>>> What it looks like to me is that what should should happen on abe is >>>> this: >>>> >>>> [WorkerManager] INFO Worker Manager - Got allocation request: >>>> org.globus.cog.abstraction.coaster.service.job.manager.WorkerManager$AllocationRequest at 151ca803 >>>> >>>> [WorkerManager] INFO Worker Manager - Starting worker with >>>> id=-615912369 and maxwalltime=6060s >>>> >>>> but on abe the "Worker Manager - Starting worker" is never seen. >>>> >>>> Looking at WorkerManager.run() its hard to see how the "Starting >>>> worker" message could *not* show up right after "Got allocation >>>> request", but there must be some sequence of events that causes this. >>>> >>>> Abe is an 8-core system. Is there perhaps more opportunity for a >>>> multi-thread race or deadlock that could cause this? >>>> >>>> I will insert some more debug logging and try a few more times to see >>>> if thing shang in this manner every time or not. >>>> >>>> - Mike >>>> >>>> ps client Logs with abe server side boot logs are on CI net in >>>> ~wilde/coast/run11 >>>> >>>> >>>> >>>> On 7/28/08 10:50 PM, Mihael Hategan wrote: >>>>> On Mon, 2008-07-28 at 19:32 +0000, Ben Clifford wrote: >>>>>> On Mon, 28 Jul 2008, Michael Wilde wrote: >>>>>> >>>>>>> So it looks like something in the job specs that is launching >>>>>>> coaster for >>>>>>> gt2:pbs is not being accepted by abe. >>>>>> ok. TeraGrid's unified account system is insufficiently unified for >>>>>> me to be able to access abe, but they are aware of that; if and when >>>>>> I am reunified, I'll try this out myself. >>>>> Not to be cynical or anything, but that unified thing: never worked. >>>>> >>>> _______________________________________________ >>>> Swift-devel mailing list >>>> Swift-devel at ci.uchicago.edu >>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >>> _______________________________________________ >>> Swift-devel mailing list >>> Swift-devel at ci.uchicago.edu >>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > From hategan at mcs.anl.gov Tue Jul 29 09:47:17 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 29 Jul 2008 09:47:17 -0500 Subject: [Swift-devel] Problems running coaster In-Reply-To: <488F2B82.5020107@mcs.anl.gov> References: <488CC81C.7030205@mcs.anl.gov> <488D4217.2010006@mcs.anl.gov> <488DF957.9000803@mcs.anl.gov> <488E0C64.1020106@mcs.anl.gov> <1217303412.4347.0.camel@localhost> <488EA079.4000404@mcs.anl.gov> <488EA562.9020704@mcs.anl.gov> <488F2945.4090600@mcs.anl.gov> <1217342300.6215.3.camel@localhost> <488F2B82.5020107@mcs.anl.gov> Message-ID: <1217342837.6596.2.camel@localhost> On Tue, 2008-07-29 at 09:38 -0500, Michael Wilde wrote: > What are some other possibilities of why the logging code didnt work? > > I see the logger.debug calls in the .class file. The logger calls were > mostly unconditional. Possibly a different code path, but less likely. I think the issue here is that the remote log4j doesn't exist or is different. It's something I've been meaning to deal with. > > I will try clearing the cache and re-running. I don't think that will help much. The odds of you having found a collision in MD5 is fairly low. > > - Mike > > > On 7/29/08 9:38 AM, Mihael Hategan wrote: > > There is no order issue. When the service is started the exact list of > > jars to be used is supplied rather than "all jars in this directory". > > > > On Tue, 2008-07-29 at 09:29 -0500, Michael Wilde wrote: > >> I was looking into why my logger.debug statements did not print. > >> I am not sure, but suspect, that the updated jar, loaded into > >> ~/.globus/coasters/cache, was either not placed in the classpath at > >> runtime was was placed after the older copy in the same directory. > >> > >> I have not yet found the logic by which newer classes get loaded to the > >> server, but suspect there may be an issue here. (Or, as usual, pilot > >> error on my part). > >> > >> The class with the updated logging was WorkerManager: > >> > >> [wilde at honest3 cache]$ jar tvf > >> cog-provider-coaster-0.1-a82e2ac11a74fedfadb9a8168a08b6d5.jar | grep > >> WorkerManager > >> 869 Mon Jul 28 19:10:34 CDT 2008 > >> org/globus/cog/abstraction/coaster/service/job/manager/WorkerManager$AllocationRequest.class > >> 15556 Mon Jul 28 19:10:34 CDT 2008 > >> org/globus/cog/abstraction/coaster/service/job/manager/WorkerManager.class > >> [wilde at honest3 cache]$ jar tvf > >> cog-provider-coaster-0.1-d903eecc754a2c97fb5ceaebdce6ccad.jar | grep > >> WorkerManager > >> 869 Mon Jul 28 23:54:24 CDT 2008 > >> org/globus/cog/abstraction/coaster/service/job/manager/WorkerManager$AllocationRequest.class > >> 15963 Mon Jul 28 23:54:24 CDT 2008 > >> org/globus/cog/abstraction/coaster/service/job/manager/WorkerManager.class > >> [wilde at honest3 cache]$ > >> > >> The *ad.jar file has the correct updated class; the *d5.jar file has the > >> original unmodified class. > >> > >> -- > >> > >> If my suspicion about the classpath order is correct, then there is > >> greater possibility that there may be a race in the job launching code > >> of WorkerManager, as this means that the same code hung once and worked > >> once (I'll test more on abe to investigate). > >> > >> - Mike > >> > >> > >> > >> On 7/29/08 12:06 AM, Michael Wilde wrote: > >>> hmmm. my debug statement didnt print. but this time the job on abe ran ok. > >>> > >>> Tomorrow I'll run more tests and see how stable it is there, and why my > >>> logging calls never showed up. > >>> > >>> - Mike > >>> > >>> > >>> On 7/28/08 11:45 PM, Michael Wilde wrote: > >>>> Ive moved on, and put a temp hack in to not use -l and instead run > >>>> "~/.myetcprofile" if it exists and /etc/profile if it doesnt. > >>>> > >>>> .myetcprofile on abe is /etc/profile with the problematic code removed. > >>>> > >>>> Now abe gets past the problem and runs bootstrap.sh ok. > >>>> > >>>> The sequence runs OK up to the point where the service on abe's > >>>> headnode receives a message to start a job. > >>>> > >>>> AT this point, the service on abe seems to hang. > >>>> > >>>> Comparing to the message sequence on mercury, which works, I see this: > >>>> > >>>> *** mercury: > >>>> > >>>> [RequestHandler] DEBUG Channel multiplexer - GSSC-null: HND< 2 > >>>> SUBMITJOB(identity=1217268111318 > >>>> executable=/bin/bash > >>>> directory=/home/ncsa/wilde/swiftwork/ctest-20080728-1301-7c4ok42h > >>>> arg=shared/wrapper.sh > >>>> arg=echo-myx2e6xi > >>>> arg=-jobdir > >>>> arg=m > >>>> arg=-e > >>>> arg=/bin/echo > >>>> arg=-out > >>>> arg=echo_s000.txt > >>>> arg=-err > >>>> arg=stderr.txt > >>>> arg=-i > >>>> arg=-d > >>>> ar) > >>>> [ChannelManager] DEBUG Channel multiplexer - > >>>> Looking up -134779b6:11b6ad597e2:-7fff:3598cb3d:11b6ad597b5:-7fffS > >>>> [ChannelManager] DEBUG Channel multiplexer - Found > >>>> -134779b6:11b6ad597e2:-7fff:3598cb3d:11b6ad597b5:-7fffS > >>>> [RequestHandler] DEBUG Channel multiplexer - GSSC-null: HND> 2 > >>>> SUBMITJOB(urn:1217268111318-1217268128309-1217268128310) > >>>> [Replier] DEBUG Worker 1 - Replier(GSSC-null)REPL>: tag = 2, fin = > >>>> true, datalen = 45, data = urn:1217268111318-1217268128309-1217268128310 > >>>> [WorkerManager] INFO Coaster Queue Processor - No suitable worker > >>>> found. Attempting to start a new one. > >>>> [WorkerManager] INFO Worker Manager - Got allocation request: > >>>> org.globus.cog.abstraction.coaster.service.job.manager.WorkerManager$AllocationRequest at 151ca803 > >>>> > >>>> [WorkerManager] INFO Worker Manager - Starting worker with > >>>> id=-615912369 and maxwalltime=6060s > >>>> Worker start provider: gt2 > >>>> Worker start JM: pbs > >>>> > >>>> *** abe: > >>>> > >>>> [RequestHandler] DEBUG Channel multiplexer - GSSC-null: HND< 2 > >>>> SUBMITJOB(identity=1217291444315 > >>>> executable=/bin/bash > >>>> directory=/u/ac/wilde/swiftwork/ctest-20080728-1930-m5a70lvc > >>>> arg=shared/wrapper.sh > >>>> arg=echo-zc5mt6xi > >>>> arg=-jobdir > >>>> arg=z > >>>> arg=-e > >>>> arg=/bin/echo > >>>> arg=-out > >>>> arg=echo_s000.txt > >>>> arg=-err > >>>> arg=stderr.txt > >>>> arg=-i > >>>> arg=-d > >>>> arg= > >>>> ar) > >>>> [ChannelManager] DEBUG Channel multiplexer - > >>>> Looking up 17badc64:11b6c39944a:-7fff:f7c31d:11b6c399416:-7fffS > >>>> [ChannelManager] DEBUG Channel multiplexer - Found > >>>> 17badc64:11b6c39944a:-7fff:f7c31d:11b6c399416:-7fffS > >>>> [RequestHandler] DEBUG Channel multiplexer - GSSC-null: HND> 2 > >>>> SUBMITJOB(urn:1217291444315-1217291458042-1217291458043) > >>>> [Replier] DEBUG Worker 1 - Replier(GSSC-null)REPL>: tag = 2, fin = > >>>> true, datalen = 45, data = urn:1217291444315-1217291458042-1217291458043 > >>>> [WorkerManager] INFO Coaster Queue Processor - No suitable worker > >>>> found. Attempting to start a new one. > >>>> [WorkerManager] INFO Worker Manager - Got allocation request: > >>>> org.globus.cog.abstraction.coaster.service.job.manager.WorkerManager$AllocationRequest at 95cfbe > >>>> > >>>> [AbstractKarajanChannel] DEBUG Channel multiplexer - GSSC-null REQ<: > >>>> tag = 3, fin = true, err = false, datalen = 15, data = SHUTDOWNSERVICE > >>>> > >>>> *** > >>>> > >>>> I *think* the SHUTDOWNSERVICE message on abe is coming much later, > >>>> after abe's service hangs, but Im not sure. > >>>> > >>>> What it looks like to me is that what should should happen on abe is > >>>> this: > >>>> > >>>> [WorkerManager] INFO Worker Manager - Got allocation request: > >>>> org.globus.cog.abstraction.coaster.service.job.manager.WorkerManager$AllocationRequest at 151ca803 > >>>> > >>>> [WorkerManager] INFO Worker Manager - Starting worker with > >>>> id=-615912369 and maxwalltime=6060s > >>>> > >>>> but on abe the "Worker Manager - Starting worker" is never seen. > >>>> > >>>> Looking at WorkerManager.run() its hard to see how the "Starting > >>>> worker" message could *not* show up right after "Got allocation > >>>> request", but there must be some sequence of events that causes this. > >>>> > >>>> Abe is an 8-core system. Is there perhaps more opportunity for a > >>>> multi-thread race or deadlock that could cause this? > >>>> > >>>> I will insert some more debug logging and try a few more times to see > >>>> if thing shang in this manner every time or not. > >>>> > >>>> - Mike > >>>> > >>>> ps client Logs with abe server side boot logs are on CI net in > >>>> ~wilde/coast/run11 > >>>> > >>>> > >>>> > >>>> On 7/28/08 10:50 PM, Mihael Hategan wrote: > >>>>> On Mon, 2008-07-28 at 19:32 +0000, Ben Clifford wrote: > >>>>>> On Mon, 28 Jul 2008, Michael Wilde wrote: > >>>>>> > >>>>>>> So it looks like something in the job specs that is launching > >>>>>>> coaster for > >>>>>>> gt2:pbs is not being accepted by abe. > >>>>>> ok. TeraGrid's unified account system is insufficiently unified for > >>>>>> me to be able to access abe, but they are aware of that; if and when > >>>>>> I am reunified, I'll try this out myself. > >>>>> Not to be cynical or anything, but that unified thing: never worked. > >>>>> > >>>> _______________________________________________ > >>>> Swift-devel mailing list > >>>> Swift-devel at ci.uchicago.edu > >>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > >>> _______________________________________________ > >>> Swift-devel mailing list > >>> Swift-devel at ci.uchicago.edu > >>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > From benc at hawaga.org.uk Tue Jul 29 11:08:15 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 29 Jul 2008 16:08:15 +0000 (GMT) Subject: [Swift-devel] Re: scheduler foo In-Reply-To: References: <1216650659.4064.1.camel@localhost> <1216659517.6481.1.camel@localhost> <1216679706.18694.6.camel@localhost> <1216689540.23025.0.camel@localhost> <1216742264.28239.0.camel@localhost> <1216757938.18169.6.camel@localhost> Message-ID: hmm. I added more logging and there is still something awry with overload counting: 2008-07-29 17:57:29,555+0200 DEBUG WeightedHostSet Adjusted overload count: 0 + -1 = -1 Its reducing the load on a set that thinks already has no overload... hmm. -- From zhaozhang at uchicago.edu Tue Jul 29 12:02:26 2008 From: zhaozhang at uchicago.edu (Zhao Zhang) Date: Tue, 29 Jul 2008 12:02:26 -0500 Subject: [Swift-devel] Analysis of wrapper.sh Message-ID: <488F4D22.4060409@uchicago.edu> Hi, All I made this analysis of wrapper.sh. Correct me, if there is anything wrong. Thanks. zhao SWIFT phase: When swift is started, it creates a directory with the workload name and a random string, something like /home/zzhang/swift/sleep-20080724-1527-qakbkkcc | |____info | |____kickstart | |____shared | | | |____wrapper.sh | | | |____seq.sh | |____status WRAPPER.SH phase In my test case, the BGexec received a task in such a format "shared/wrapper.sh sleep-l5clzzvi -jobdir l -e /bin/sleep -out stdout.txt -err stderr.txt -i -d -if -of -k -a 600" with the working dir "/home/zzhang/swift/sleep-20080724-1527-qakbkkcc" WORKING_DIRECTORY OPERATION /home/zzhang/swift/sleep-20080724-1527-qakbkkcc OPEN wrapper.log /home/zzhang/swift/sleep-20080724-1527-qakbkkcc CHECK if jobid ($1, sleep-l5clzzvi ) is empty ----> empty exit with 254 | | V /home/zzhang/swift/sleep-20080724-1527-qakbkkcc Get -jobdir as $JOBDIR /home/zzhang/swift/sleep-20080724-1527-qakbkkcc CHECK if -jobdir ( l )is empty ----> empty exit with 254 | | V /home/zzhang/swift/sleep-20080724-1527-qakbkkcc mkdir -p $WFDIR/info/$JOBDIR (mkdir /home/zzhang/swift/sleep-20080724-1527-qakbkkcc/info/l ) /home/zzhang/swift/sleep-20080724-1527-qakbkkcc rm -f "$WFDIR/info/$JOBDIR/${ID}-info" ( make a clean $ID-info file ID=sleep-l5clzzvi ) /home/zzhang/swift/sleep-20080724-1527-qakbkkcc openinfo "$WFDIR/info/$JOBDIR/${ID}-info" ( openinfo /home/zzhang/swift/sleep-20080724-1527-qakbkkcc/info/l/sleep-l5clzzvi-info) creating log file /home/zzhang/swift/sleep-20080724-1527-qakbkkcc PUT "LOG_START" into log /home/zzhang/swift/sleep-20080724-1527-qakbkkcc PUT "Wrapper" into log /home/zzhang/swift/sleep-20080724-1527-qakbkkcc mkdir -p $WFDIR/status/$JOBDIR (create status parent dir for the job "mkdir -p /home/zzhang/swift/sleep-20080724-1527-qakbkkcc/status/l ") /home/zzhang/swift/sleep-20080724-1527-qakbkkcc PARSE the arguments ( EXEC=/bin/sleep, STDOUT=stdout.txt, STDERR=stderr.txt, STDIN=NULL, DIRS=null, INF=NULL, OUTF=NULL, KICKSTART=NULL) /home/zzhang/swift/sleep-20080724-1527-qakbkkcc Check if there are arguments after -a ----> empty exit with 254 | | V /home/zzhang/swift/sleep-20080724-1527-qakbkkcc change $@ from "-a 600" to "600" /home/zzhang/swift/sleep-20080724-1527-qakbkkcc Check if "$SIWFT_JOBDIR_PATH" is NULL ----> NO, local copy | | V /home/zzhang/swift/sleep-20080724-1527-qakbkkcc YES, shared file system. DIR=jobs/$JOBDIR/$ID ( DIR=jobs/l/sleep-l5clzzvi ) /home/zzhang/swift/sleep-20080724-1527-qakbkkcc set PATH /home/zzhang/swift/sleep-20080724-1527-qakbkkcc PUT all arguments into log /home/zzhang/swift/sleep-20080724-1527-qakbkkcc PUT "CREATE_JOBDIR" into log /home/zzhang/swift/sleep-20080724-1527-qakbkkcc mkdir -p $DIR (In the working dir, "mkdir -p jobs/l/sleep-l5clzzvi ") /home/zzhang/swift/sleep-20080724-1527-qakbkkcc CHECK if "mkdir" is successful ----> NO, exit with 254 | | V YES, put "Created job directory : $DIR" into log /home/zzhang/swift/sleep-20080724-1527-qakbkkcc PUT "CREATE_INPUTDIR" into log /home/zzhang/swift/sleep-20080724-1527-qakbkkcc Created all subdirs in $DIR as in $DIRS ( create the same tree in jobs/l/sleep-l5clzzvi as in input file dir) /home/zzhang/swift/sleep-20080724-1527-qakbkkcc PUT "LINK_INPUTS" in to log /home/zzhang/swift/sleep-20080724-1527-qakbkkcc CHECK file system type ----> local disck, cp all files in $PWD/shared to $DIR | | V /home/zzhang/swift/sleep-20080724-1527-qakbkkcc shared file system, create links in $DIR for all files in $PWD/shared /home/zzhang/swift/sleep-20080724-1527-qakbkkcc PUT "EXECUTE" into log /home/zzhang/swift/sleep-20080724-1527-qakbkkcc CHECK if "kickstart is enabled" ----> yes, use kickstart to run the job | | V NO, run the job with wrapper.sh /home/zzhang/swift/sleep-20080724-1527-qakbkkcc PUT "EXECUTE_DONE" into log /home/zzhang/swift/sleep-20080724-1527-qakbkkcc PUT "Job ran successfully" into log /home/zzhang/swift/sleep-20080724-1527-qakbkkcc CHECK if the out put dir tree is the same as the one in $OUTF ----> NO, exit with 254 | | V YES, COPY all output files in $DIR back to $PWD/shared /home/zzhang/swift/sleep-20080724-1527-qakbkkcc PUT "RM_JOBDIR" into log /home/zzhang/swift/sleep-20080724-1527-qakbkkcc rm -rf $DIR ("rm -rf jobs/l/sleep-l5clzzvi ") /home/zzhang/swift/sleep-20080724-1527-qakbkkcc PUT "TOUCH_SUCCESS" into log /home/zzhang/swift/sleep-20080724-1527-qakbkkcc touch status/${JOBDIR}/${ID}-success ( touch status/l/)sleep-l5clzzvi-success ) /home/zzhang/swift/sleep-20080724-1527-qakbkkcc PUT "END" into log /home/zzhang/swift/sleep-20080724-1527-qakbkkcc closeinfo From wilde at mcs.anl.gov Tue Jul 29 12:27:08 2008 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 29 Jul 2008 12:27:08 -0500 Subject: [Swift-devel] Re: Analysis of wrapper.sh In-Reply-To: <488F4D22.4060409@uchicago.edu> References: <488F4D22.4060409@uchicago.edu> Message-ID: <488F52EC.8030200@mcs.anl.gov> Thanks, Zhao. Thats a good start. Where I want you to take this (with help form me and others on the team) is to create a detailed description of how data flows in Swift, for use by both end users and developers. What you show here so far is mainly the wrapper code itself. I'm looking for a diagram that shows the three main data locations, and explains the important stages in data management during a workflow, and what they mean, why they are done. The three areas are: the data file's original location when the mapper sees them; the shared dir on each site; the work dir on each compute node. Examples of questions I'd like this to cover are: why do we have a shared dir? (Answer: to re-use transfered or generated files within a workflow without re-transfering). whats the lifetime of this directory? what in it is persistent vs removed after jobs and/or scripts complete? when does output come back? Where to? how are relative vs absolute pathnames handled? how are URL-prefixed pathnames handled? (gsiftp://, http:// etc?) which Swift properties affect data management? Same for options in profiles? how should wrappers be written that reference files installed as part of the application? what are various ways in which wrappers and apps can utilize worker node disk, today? what patches that Ben implemented for testing in March-April on the BGP and Sicortex are integrated and which remain patches to be considered for testing and integration? Some of these questions are more useful and make more sense than others, but this is the general thing I want to get documented. - Mike On 7/29/08 12:02 PM, Zhao Zhang wrote: > Hi, All > > I made this analysis of wrapper.sh. Correct me, if there is anything > wrong. Thanks. > > zhao > > > SWIFT phase: When swift is started, it creates a directory with the > workload name and a random string, something like > > /home/zzhang/swift/sleep-20080724-1527-qakbkkcc > | > |____info > | > |____kickstart > | > |____shared > | | > | > |____wrapper.sh > | | > | > |____seq.sh > | > |____status > > > WRAPPER.SH phase > > In my test case, the BGexec received a task in such a format > "shared/wrapper.sh sleep-l5clzzvi -jobdir l -e /bin/sleep -out > stdout.txt -err stderr.txt -i -d -if -of -k -a 600" > with the working dir "/home/zzhang/swift/sleep-20080724-1527-qakbkkcc" > > WORKING_DIRECTORY OPERATION > /home/zzhang/swift/sleep-20080724-1527-qakbkkcc OPEN wrapper.log > /home/zzhang/swift/sleep-20080724-1527-qakbkkcc CHECK if > jobid ($1, sleep-l5clzzvi ) is empty ----> empty exit with 254 > > | > > | > > V > /home/zzhang/swift/sleep-20080724-1527-qakbkkcc Get -jobdir > as $JOBDIR > /home/zzhang/swift/sleep-20080724-1527-qakbkkcc CHECK if > -jobdir ( l )is empty ----> empty exit with 254 > > | > > | > > V > /home/zzhang/swift/sleep-20080724-1527-qakbkkcc mkdir -p > $WFDIR/info/$JOBDIR (mkdir > /home/zzhang/swift/sleep-20080724-1527-qakbkkcc/info/l ) > /home/zzhang/swift/sleep-20080724-1527-qakbkkcc rm -f > "$WFDIR/info/$JOBDIR/${ID}-info" ( make a clean $ID-info file > ID=sleep-l5clzzvi ) > /home/zzhang/swift/sleep-20080724-1527-qakbkkcc openinfo > "$WFDIR/info/$JOBDIR/${ID}-info" > > ( openinfo > /home/zzhang/swift/sleep-20080724-1527-qakbkkcc/info/l/sleep-l5clzzvi-info) > > creating log file > > /home/zzhang/swift/sleep-20080724-1527-qakbkkcc PUT > "LOG_START" into log > /home/zzhang/swift/sleep-20080724-1527-qakbkkcc PUT "Wrapper" > into log > /home/zzhang/swift/sleep-20080724-1527-qakbkkcc mkdir -p > $WFDIR/status/$JOBDIR (create status parent dir for the job "mkdir -p > /home/zzhang/swift/sleep-20080724-1527-qakbkkcc/status/l ") > /home/zzhang/swift/sleep-20080724-1527-qakbkkcc PARSE the > arguments > > ( EXEC=/bin/sleep, STDOUT=stdout.txt, > STDERR=stderr.txt, STDIN=NULL, > > DIRS=null, INF=NULL, OUTF=NULL, KICKSTART=NULL) > /home/zzhang/swift/sleep-20080724-1527-qakbkkcc Check if > there are arguments after -a ----> empty exit with 254 > > | > > | > > V > /home/zzhang/swift/sleep-20080724-1527-qakbkkcc change $@ > from "-a 600" to "600" > /home/zzhang/swift/sleep-20080724-1527-qakbkkcc Check if > "$SIWFT_JOBDIR_PATH" is NULL ----> NO, local copy > > | > > | > > V > /home/zzhang/swift/sleep-20080724-1527-qakbkkcc YES, shared > file system. DIR=jobs/$JOBDIR/$ID ( DIR=jobs/l/sleep-l5clzzvi ) > /home/zzhang/swift/sleep-20080724-1527-qakbkkcc set PATH > /home/zzhang/swift/sleep-20080724-1527-qakbkkcc PUT all > arguments into log > /home/zzhang/swift/sleep-20080724-1527-qakbkkcc PUT > "CREATE_JOBDIR" into log > /home/zzhang/swift/sleep-20080724-1527-qakbkkcc mkdir -p $DIR > (In the working dir, "mkdir -p jobs/l/sleep-l5clzzvi ") > /home/zzhang/swift/sleep-20080724-1527-qakbkkcc CHECK if > "mkdir" is successful ----> NO, exit with 254 > > | > > | > > V > > YES, put "Created job directory : $DIR" into log > /home/zzhang/swift/sleep-20080724-1527-qakbkkcc PUT > "CREATE_INPUTDIR" into log > /home/zzhang/swift/sleep-20080724-1527-qakbkkcc Created all > subdirs in $DIR as in $DIRS ( create the same tree in > jobs/l/sleep-l5clzzvi as in input file dir) > /home/zzhang/swift/sleep-20080724-1527-qakbkkcc PUT > "LINK_INPUTS" in to log > /home/zzhang/swift/sleep-20080724-1527-qakbkkcc CHECK file > system type ----> local disck, cp all files in $PWD/shared to $DIR > > | > > | > > V > /home/zzhang/swift/sleep-20080724-1527-qakbkkcc shared file > system, create links in $DIR for all files in $PWD/shared > /home/zzhang/swift/sleep-20080724-1527-qakbkkcc PUT "EXECUTE" > into log > /home/zzhang/swift/sleep-20080724-1527-qakbkkcc CHECK if > "kickstart is enabled" ----> yes, use kickstart to run the job > > | > > | > > V > > NO, run the job with wrapper.sh > /home/zzhang/swift/sleep-20080724-1527-qakbkkcc PUT > "EXECUTE_DONE" into log > /home/zzhang/swift/sleep-20080724-1527-qakbkkcc PUT "Job ran > successfully" into log > /home/zzhang/swift/sleep-20080724-1527-qakbkkcc CHECK if the > out put dir tree is the same as the one in $OUTF ----> NO, exit with 254 > > | > > | > > V > > YES, COPY all output files in $DIR back to $PWD/shared > /home/zzhang/swift/sleep-20080724-1527-qakbkkcc PUT > "RM_JOBDIR" into log > /home/zzhang/swift/sleep-20080724-1527-qakbkkcc rm -rf $DIR > ("rm -rf jobs/l/sleep-l5clzzvi ") > /home/zzhang/swift/sleep-20080724-1527-qakbkkcc PUT > "TOUCH_SUCCESS" into log > /home/zzhang/swift/sleep-20080724-1527-qakbkkcc touch > status/${JOBDIR}/${ID}-success ( touch status/l/)sleep-l5clzzvi-success ) > /home/zzhang/swift/sleep-20080724-1527-qakbkkcc PUT "END" > into log > /home/zzhang/swift/sleep-20080724-1527-qakbkkcc closeinfo From foster at mcs.anl.gov Tue Jul 29 12:29:01 2008 From: foster at mcs.anl.gov (Ian Foster) Date: Tue, 29 Jul 2008 12:29:01 -0500 Subject: [Swift-devel] Re: Analysis of wrapper.sh In-Reply-To: <488F52EC.8030200@mcs.anl.gov> References: <488F4D22.4060409@uchicago.edu> <488F52EC.8030200@mcs.anl.gov> Message-ID: Mike, Zhao: This sounds like a great initiative. I wonder whether we can use UML to describe some of these things. Carl is pushing on this in a different context, and seems to be finding it useful. Ian. On Jul 29, 2008, at 12:27 PM, Michael Wilde wrote: > Thanks, Zhao. > > Thats a good start. Where I want you to take this (with help form me > and others on the team) is to create a detailed description of how > data flows in Swift, for use by both end users and developers. > > What you show here so far is mainly the wrapper code itself. > > I'm looking for a diagram that shows the three main data locations, > and explains the important stages in data management during a > workflow, and what they mean, why they are done. > > The three areas are: the data file's original location when the > mapper sees them; the shared dir on each site; the work dir on each > compute node. > > Examples of questions I'd like this to cover are: > > why do we have a shared dir? (Answer: to re-use transfered or > generated files within a workflow without re-transfering). > > whats the lifetime of this directory? what in it is persistent vs > removed after jobs and/or scripts complete? > > when does output come back? Where to? > > how are relative vs absolute pathnames handled? > > how are URL-prefixed pathnames handled? (gsiftp://, http:// etc?) > > which Swift properties affect data management? > Same for options in profiles? > > how should wrappers be written that reference files installed as > part of the application? > > what are various ways in which wrappers and apps can utilize worker > node disk, today? > > what patches that Ben implemented for testing in March-April on the > BGP and Sicortex are integrated and which remain patches to be > considered for testing and integration? > > Some of these questions are more useful and make more sense than > others, but this is the general thing I want to get documented. > > - Mike > > > > On 7/29/08 12:02 PM, Zhao Zhang wrote: >> Hi, All >> I made this analysis of wrapper.sh. Correct me, if there is >> anything wrong. Thanks. >> zhao >> SWIFT phase: When swift is started, it creates a directory with the >> workload name and a random string, something like >> /home/zzhang/swift/sleep-20080724-1527-qakbkkcc >> | >> |____info >> | >> |____kickstart >> | >> |____shared >> | | >> | | >> ____wrapper.sh >> | | >> | | >> ____seq.sh >> | >> |____status >> WRAPPER.SH phase >> In my test case, the BGexec received a task in such a format >> "shared/wrapper.sh sleep-l5clzzvi -jobdir l -e /bin/sleep -out >> stdout.txt -err stderr.txt -i -d -if -of -k -a 600" >> with the working dir "/home/zzhang/swift/sleep-20080724-1527- >> qakbkkcc" >> WORKING_DIRECTORY >> OPERATION >> /home/zzhang/swift/sleep-20080724-1527-qakbkkcc OPEN >> wrapper.log >> /home/zzhang/swift/sleep-20080724-1527-qakbkkcc CHECK if >> jobid ($1, sleep-l5clzzvi ) is empty ----> empty exit with 254 >> | >> | >> V >> /home/zzhang/swift/sleep-20080724-1527-qakbkkcc Get - >> jobdir as $JOBDIR >> /home/zzhang/swift/sleep-20080724-1527-qakbkkcc CHECK if >> -jobdir ( l )is empty ----> empty exit with 254 >> | >> | >> V >> /home/zzhang/swift/sleep-20080724-1527-qakbkkcc mkdir -p >> $WFDIR/info/$JOBDIR (mkdir /home/zzhang/swift/sleep-20080724-1527- >> qakbkkcc/info/l ) >> /home/zzhang/swift/sleep-20080724-1527-qakbkkcc rm -f >> "$WFDIR/info/$JOBDIR/${ID}-info" ( make a clean $ID-info file >> ID=sleep-l5clzzvi ) >> /home/zzhang/swift/sleep-20080724-1527-qakbkkcc openinfo >> "$WFDIR/info/$JOBDIR/${ID}-info" >> ( openinfo >> /home/zzhang/swift/sleep-20080724-1527-qakbkkcc/info/l/sleep- >> l5clzzvi-info) >> creating >> log file >> /home/zzhang/swift/sleep-20080724-1527-qakbkkcc PUT >> "LOG_START" into log >> /home/zzhang/swift/sleep-20080724-1527-qakbkkcc PUT >> "Wrapper" into log >> /home/zzhang/swift/sleep-20080724-1527-qakbkkcc mkdir >> -p $WFDIR/status/$JOBDIR (create status parent dir for the job >> "mkdir -p /home/zzhang/swift/sleep-20080724-1527-qakbkkcc/status/l ") >> /home/zzhang/swift/sleep-20080724-1527-qakbkkcc PARSE >> the arguments >> ( EXEC >> =/bin/sleep, STDOUT=stdout.txt, STDERR=stderr.txt, STDIN=NULL, >> DIRS >> =null, INF=NULL, OUTF=NULL, KICKSTART=NULL) >> /home/zzhang/swift/sleep-20080724-1527-qakbkkcc Check if >> there are arguments after -a ----> empty exit with 254 >> | >> | >> V >> /home/zzhang/swift/sleep-20080724-1527-qakbkkcc change >> $@ from "-a 600" to "600" >> /home/zzhang/swift/sleep-20080724-1527-qakbkkcc Check if >> "$SIWFT_JOBDIR_PATH" is NULL ----> NO, local copy >> | >> | >> V >> /home/zzhang/swift/sleep-20080724-1527-qakbkkcc YES, >> shared file system. DIR=jobs/$JOBDIR/$ID ( DIR=jobs/l/sleep- >> l5clzzvi ) >> /home/zzhang/swift/sleep-20080724-1527-qakbkkcc set PATH >> /home/zzhang/swift/sleep-20080724-1527-qakbkkcc PUT all >> arguments into log >> /home/zzhang/swift/sleep-20080724-1527-qakbkkcc PUT >> "CREATE_JOBDIR" into log >> /home/zzhang/swift/sleep-20080724-1527-qakbkkcc mkdir -p >> $DIR (In the working dir, "mkdir -p jobs/l/sleep-l5clzzvi ") >> /home/zzhang/swift/sleep-20080724-1527-qakbkkcc CHECK if >> "mkdir" is successful ----> NO, exit with 254 >> | >> | >> V >> YES >> , put "Created job directory : $DIR" into log >> /home/zzhang/swift/sleep-20080724-1527-qakbkkcc PUT >> "CREATE_INPUTDIR" into log >> /home/zzhang/swift/sleep-20080724-1527-qakbkkcc Created >> all subdirs in $DIR as in $DIRS ( create the same tree in jobs/l/ >> sleep-l5clzzvi as in input file dir) >> /home/zzhang/swift/sleep-20080724-1527-qakbkkcc PUT >> "LINK_INPUTS" in to log >> /home/zzhang/swift/sleep-20080724-1527-qakbkkcc CHECK >> file system type ----> local disck, cp all files in $PWD/shared to >> $DIR >> | >> | >> V >> /home/zzhang/swift/sleep-20080724-1527-qakbkkcc shared >> file system, create links in $DIR for all files in $PWD/shared >> /home/zzhang/swift/sleep-20080724-1527-qakbkkcc PUT >> "EXECUTE" into log >> /home/zzhang/swift/sleep-20080724-1527-qakbkkcc CHECK if >> "kickstart is enabled" ----> yes, use kickstart to run the job >> | >> | >> V >> NO >> , run the job with wrapper.sh >> /home/zzhang/swift/sleep-20080724-1527-qakbkkcc PUT >> "EXECUTE_DONE" into log >> /home/zzhang/swift/sleep-20080724-1527-qakbkkcc PUT "Job >> ran successfully" into log >> /home/zzhang/swift/sleep-20080724-1527-qakbkkcc CHECK if >> the out put dir tree is the same as the one in $OUTF ----> NO, exit >> with 254 >> | >> | >> V >> YES >> , COPY all output files in $DIR back to $PWD/shared >> /home/zzhang/swift/sleep-20080724-1527-qakbkkcc PUT >> "RM_JOBDIR" into log >> /home/zzhang/swift/sleep-20080724-1527-qakbkkcc rm -rf >> $DIR ("rm -rf jobs/l/sleep-l5clzzvi ") >> /home/zzhang/swift/sleep-20080724-1527-qakbkkcc PUT >> "TOUCH_SUCCESS" into log >> /home/zzhang/swift/sleep-20080724-1527-qakbkkcc touch >> status/${JOBDIR}/${ID}-success ( touch status/l/)sleep-l5clzzvi- >> success ) >> /home/zzhang/swift/sleep-20080724-1527-qakbkkcc PUT >> "END" into log >> /home/zzhang/swift/sleep-20080724-1527-qakbkkcc closeinfo > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From hategan at mcs.anl.gov Tue Jul 29 12:32:02 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 29 Jul 2008 12:32:02 -0500 Subject: [Swift-devel] maybe unrelated Message-ID: <1217352722.9187.28.camel@localhost> but it seems like there is a class of functions that can be automatically transformed to be tail recursive. These are functions whose otherwise not-tail-call is a fold-able operation. For example, base: () -> B someFunction: (A, B) -> B someOtherFunction: A -> A f(params:A):B { if (terminationCondition(params)) { base(params) } else { someFunction(params, f(someOtherFunction(params))) } } can be transformed to f(params:A):B { list = new list of A f'(params, list) foldl(someFunction, base(params), list) } f'(params:A, list of A) { if (!terminationCondition(params)) { append(list, params) f'(someOtherFunction(params), list) } } In particular the mighty factorial: (A = int, B = int) someOtherFunction = - someFunction = * fact(n:int) { list = new list of int f'(n, list) foldl(*, 1, list) } fact'(n, list) { if (!(n==0)) { append(list, n) fact'(n - 1, list) } } At least it passes the type checking step ;) Even if there are side-effects in one or both of the some* functions, this still seems to preserve the right semantics. From wilde at mcs.anl.gov Tue Jul 29 12:37:16 2008 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 29 Jul 2008 12:37:16 -0500 Subject: [Swift-devel] Re: Analysis of wrapper.sh In-Reply-To: References: <488F4D22.4060409@uchicago.edu> <488F52EC.8030200@mcs.anl.gov> Message-ID: <488F554C.2030308@mcs.anl.gov> Out of all the UML diagrams, I find sequence diagrams the most useful. These are the ones where vertical bars represent parties in an interaction, and horizontal lines indicate messages or interactions between these parties. Often used to describe message protocols. I would find (simplified) UML sequence diagrams to be useful in, eg, documenting the logic of how the various parties of Coaster interact, and I guess the same may apply to core Swift data management logic (in the sense that there are multiple parties: Swift, the mappers, the wrapper, and the app). I once advocated doing this for documenting Falkon's logic. So, yes, we can try this on both Coaster and data management documentation and see how it works. - Mike On 7/29/08 12:29 PM, Ian Foster wrote: > Mike, Zhao: > > This sounds like a great initiative. > > I wonder whether we can use UML to describe some of these things. Carl > is pushing on this in a different context, and seems to be finding it > useful. > > Ian. > > > On Jul 29, 2008, at 12:27 PM, Michael Wilde wrote: > >> Thanks, Zhao. >> >> Thats a good start. Where I want you to take this (with help form me >> and others on the team) is to create a detailed description of how >> data flows in Swift, for use by both end users and developers. >> >> What you show here so far is mainly the wrapper code itself. >> >> I'm looking for a diagram that shows the three main data locations, >> and explains the important stages in data management during a >> workflow, and what they mean, why they are done. >> >> The three areas are: the data file's original location when the mapper >> sees them; the shared dir on each site; the work dir on each compute >> node. >> >> Examples of questions I'd like this to cover are: >> >> why do we have a shared dir? (Answer: to re-use transfered or >> generated files within a workflow without re-transfering). >> >> whats the lifetime of this directory? what in it is persistent vs >> removed after jobs and/or scripts complete? >> >> when does output come back? Where to? >> >> how are relative vs absolute pathnames handled? >> >> how are URL-prefixed pathnames handled? (gsiftp://, http:// etc?) >> >> which Swift properties affect data management? >> Same for options in profiles? >> >> how should wrappers be written that reference files installed as part >> of the application? >> >> what are various ways in which wrappers and apps can utilize worker >> node disk, today? >> >> what patches that Ben implemented for testing in March-April on the >> BGP and Sicortex are integrated and which remain patches to be >> considered for testing and integration? >> >> Some of these questions are more useful and make more sense than >> others, but this is the general thing I want to get documented. >> >> - Mike >> >> >> >> On 7/29/08 12:02 PM, Zhao Zhang wrote: >>> Hi, All >>> I made this analysis of wrapper.sh. Correct me, if there is anything >>> wrong. Thanks. >>> zhao >>> SWIFT phase: When swift is started, it creates a directory with the >>> workload name and a random string, something like >>> /home/zzhang/swift/sleep-20080724-1527-qakbkkcc >>> | >>> |____info >>> | >>> |____kickstart >>> | >>> |____shared >>> | | >>> | >>> |____wrapper.sh >>> | | >>> | >>> |____seq.sh >>> | >>> |____status >>> WRAPPER.SH phase >>> In my test case, the BGexec received a task in such a format >>> "shared/wrapper.sh sleep-l5clzzvi -jobdir l -e /bin/sleep -out >>> stdout.txt -err stderr.txt -i -d -if -of -k -a 600" >>> with the working dir "/home/zzhang/swift/sleep-20080724-1527-qakbkkcc" >>> WORKING_DIRECTORY >>> OPERATION >>> /home/zzhang/swift/sleep-20080724-1527-qakbkkcc OPEN >>> wrapper.log >>> /home/zzhang/swift/sleep-20080724-1527-qakbkkcc CHECK if >>> jobid ($1, sleep-l5clzzvi ) is empty ----> empty exit with 254 >>> >>> | >>> >>> | >>> >>> V >>> /home/zzhang/swift/sleep-20080724-1527-qakbkkcc Get >>> -jobdir as $JOBDIR >>> /home/zzhang/swift/sleep-20080724-1527-qakbkkcc CHECK if >>> -jobdir ( l )is empty ----> empty exit with 254 >>> >>> | >>> >>> | >>> >>> V >>> /home/zzhang/swift/sleep-20080724-1527-qakbkkcc mkdir -p >>> $WFDIR/info/$JOBDIR (mkdir >>> /home/zzhang/swift/sleep-20080724-1527-qakbkkcc/info/l ) >>> /home/zzhang/swift/sleep-20080724-1527-qakbkkcc rm -f >>> "$WFDIR/info/$JOBDIR/${ID}-info" ( make a clean $ID-info file >>> ID=sleep-l5clzzvi ) >>> /home/zzhang/swift/sleep-20080724-1527-qakbkkcc openinfo >>> "$WFDIR/info/$JOBDIR/${ID}-info" >>> >>> ( >>> openinfo /home/zzhang/swift/sleep-20080724-1527-qakbkkcc/info/l/sleep-l5clzzvi-info) >>> >>> >>> creating log file >>> /home/zzhang/swift/sleep-20080724-1527-qakbkkcc PUT >>> "LOG_START" into log >>> /home/zzhang/swift/sleep-20080724-1527-qakbkkcc PUT >>> "Wrapper" into log >>> /home/zzhang/swift/sleep-20080724-1527-qakbkkcc mkdir >>> -p $WFDIR/status/$JOBDIR (create status parent dir for the job "mkdir >>> -p /home/zzhang/swift/sleep-20080724-1527-qakbkkcc/status/l ") >>> /home/zzhang/swift/sleep-20080724-1527-qakbkkcc PARSE the >>> arguments >>> >>> ( EXEC=/bin/sleep, STDOUT=stdout.txt, STDERR=stderr.txt, STDIN=NULL, >>> >>> DIRS=null, INF=NULL, OUTF=NULL, KICKSTART=NULL) >>> /home/zzhang/swift/sleep-20080724-1527-qakbkkcc Check if >>> there are arguments after -a ----> empty exit with 254 >>> >>> | >>> >>> | >>> >>> V >>> /home/zzhang/swift/sleep-20080724-1527-qakbkkcc change $@ >>> from "-a 600" to "600" >>> /home/zzhang/swift/sleep-20080724-1527-qakbkkcc Check if >>> "$SIWFT_JOBDIR_PATH" is NULL ----> NO, local copy >>> >>> | >>> >>> | >>> >>> V >>> /home/zzhang/swift/sleep-20080724-1527-qakbkkcc YES, >>> shared file system. DIR=jobs/$JOBDIR/$ID ( DIR=jobs/l/sleep-l5clzzvi ) >>> /home/zzhang/swift/sleep-20080724-1527-qakbkkcc set PATH >>> /home/zzhang/swift/sleep-20080724-1527-qakbkkcc PUT all >>> arguments into log >>> /home/zzhang/swift/sleep-20080724-1527-qakbkkcc PUT >>> "CREATE_JOBDIR" into log >>> /home/zzhang/swift/sleep-20080724-1527-qakbkkcc mkdir -p >>> $DIR (In the working dir, "mkdir -p jobs/l/sleep-l5clzzvi ") >>> /home/zzhang/swift/sleep-20080724-1527-qakbkkcc CHECK if >>> "mkdir" is successful ----> NO, exit with 254 >>> >>> | >>> >>> | >>> >>> V >>> >>> YES, put "Created job directory : $DIR" into log >>> /home/zzhang/swift/sleep-20080724-1527-qakbkkcc PUT >>> "CREATE_INPUTDIR" into log >>> /home/zzhang/swift/sleep-20080724-1527-qakbkkcc Created >>> all subdirs in $DIR as in $DIRS ( create the same tree in >>> jobs/l/sleep-l5clzzvi as in input file dir) >>> /home/zzhang/swift/sleep-20080724-1527-qakbkkcc PUT >>> "LINK_INPUTS" in to log >>> /home/zzhang/swift/sleep-20080724-1527-qakbkkcc CHECK file >>> system type ----> local disck, cp all files in $PWD/shared to $DIR >>> >>> | >>> >>> | >>> >>> V >>> /home/zzhang/swift/sleep-20080724-1527-qakbkkcc shared >>> file system, create links in $DIR for all files in $PWD/shared >>> /home/zzhang/swift/sleep-20080724-1527-qakbkkcc PUT >>> "EXECUTE" into log >>> /home/zzhang/swift/sleep-20080724-1527-qakbkkcc CHECK if >>> "kickstart is enabled" ----> yes, use kickstart to run the job >>> >>> | >>> >>> | >>> >>> V >>> >>> NO, run the job with wrapper.sh >>> /home/zzhang/swift/sleep-20080724-1527-qakbkkcc PUT >>> "EXECUTE_DONE" into log >>> /home/zzhang/swift/sleep-20080724-1527-qakbkkcc PUT "Job >>> ran successfully" into log >>> /home/zzhang/swift/sleep-20080724-1527-qakbkkcc CHECK if >>> the out put dir tree is the same as the one in $OUTF ----> NO, exit >>> with 254 >>> >>> | >>> >>> | >>> >>> V >>> >>> YES, COPY all output files in $DIR back to $PWD/shared >>> /home/zzhang/swift/sleep-20080724-1527-qakbkkcc PUT >>> "RM_JOBDIR" into log >>> /home/zzhang/swift/sleep-20080724-1527-qakbkkcc rm -rf >>> $DIR ("rm -rf jobs/l/sleep-l5clzzvi ") >>> /home/zzhang/swift/sleep-20080724-1527-qakbkkcc PUT >>> "TOUCH_SUCCESS" into log >>> /home/zzhang/swift/sleep-20080724-1527-qakbkkcc touch >>> status/${JOBDIR}/${ID}-success ( touch >>> status/l/)sleep-l5clzzvi-success ) >>> /home/zzhang/swift/sleep-20080724-1527-qakbkkcc PUT "END" >>> into log >>> /home/zzhang/swift/sleep-20080724-1527-qakbkkcc closeinfo >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > From benc at hawaga.org.uk Tue Jul 29 12:37:23 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 29 Jul 2008 17:37:23 +0000 (GMT) Subject: [Swift-devel] Re: Analysis of wrapper.sh In-Reply-To: <488F52EC.8030200@mcs.anl.gov> References: <488F4D22.4060409@uchicago.edu> <488F52EC.8030200@mcs.anl.gov> Message-ID: in a similar context, plenty of people involved in swift development have had thoughts about how data management might change in the future. it would be useful to define what the application 'contract' is, so as to separate out the accidental features of the wrapper as distinct from that contract. So (in my mind, though others perhaps see it differently): an applciation can expect to be started up in a working directory that is not shared with any other applicaiton start up; and that mapped input files will be available for read only access within that directory (at top level or in subdirs, depending on mapping); and mapped output files should be left in that directory (at top level or in subdirs, depending on mapping) Applications should not make assumptions about the nature of the file system (which is how the wrapper can have an option to switch between working dirs on the worker node or on shared fs). Nor should they necessarily assume that the wrapper.sh script is the way in which things get there, or that there is a shared directory at all; for example, if Falkon's real or future data management features were wired in, Falkon might handle the movement of files from some submit-side location to individual application working directories... Separate from the above is our implementation of that interface, which is both the wrapper.sh on the worker side and behaviour in the submit-side Swift code to manage the site-side shared directories. -- From iraicu at cs.uchicago.edu Tue Jul 29 12:40:44 2008 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Tue, 29 Jul 2008 12:40:44 -0500 Subject: [Swift-devel] Re: Analysis of wrapper.sh In-Reply-To: <488F554C.2030308@mcs.anl.gov> References: <488F4D22.4060409@uchicago.edu> <488F52EC.8030200@mcs.anl.gov> <488F554C.2030308@mcs.anl.gov> Message-ID: <488F561C.3010404@cs.uchicago.edu> Michael Wilde wrote: > I once advocated doing this for documenting Falkon's logic. > I thought we had a diagram that showed the message exchanges, is this what you are referring to? Note that each of those numbers correspond to a specific message that is being exchanged. Ioan > ... > > - Mike > > -- =================================================== Ioan Raicu Ph.D. Candidate =================================================== Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 =================================================== Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dev.globus.org/wiki/Incubator/Falkon http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page =================================================== =================================================== -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: moz-screenshot-71.jpg Type: image/jpeg Size: 19304 bytes Desc: not available URL: From iraicu at cs.uchicago.edu Tue Jul 29 12:43:10 2008 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Tue, 29 Jul 2008 12:43:10 -0500 Subject: [Swift-devel] Re: Analysis of wrapper.sh In-Reply-To: References: <488F4D22.4060409@uchicago.edu> <488F52EC.8030200@mcs.anl.gov> Message-ID: <488F56AE.5050002@cs.uchicago.edu> Ben Clifford wrote: > ...for example, if > Falkon's real or future data management features were wired in, Falkon > might handle the movement of files from some submit-side location to > individual application working directories... > Right, one of these days I'll update the Falkon provider in Swift, and give the new data management code from Falkon a try! Ioan -- =================================================== Ioan Raicu Ph.D. Candidate =================================================== Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 =================================================== Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dev.globus.org/wiki/Incubator/Falkon http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page =================================================== =================================================== From wilde at mcs.anl.gov Tue Jul 29 13:23:01 2008 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 29 Jul 2008 13:23:01 -0500 Subject: [Swift-devel] Problems running coaster In-Reply-To: <488EA562.9020704@mcs.anl.gov> References: <488CC81C.7030205@mcs.anl.gov> <488D4217.2010006@mcs.anl.gov> <488DF957.9000803@mcs.anl.gov> <488E0C64.1020106@mcs.anl.gov> <1217303412.4347.0.camel@localhost> <488EA079.4000404@mcs.anl.gov> <488EA562.9020704@mcs.anl.gov> Message-ID: <488F6005.3060400@mcs.anl.gov> Some more details on this issue, which is different than I previously thought. Short summary: do we have a /dev/random entropy problem here? details: After running many more 1-job tests, I see that they are all working on Abe. What caused the behavior that I reported below, where the coaster service *seems* to hang, is in fact a long time delay. It seems like about 5 minutes between the message sequence below: -- [WorkerManager] INFO Coaster Queue Processor - No suitable worker found. Attempting to start a new one. [WorkerManager] INFO Worker Manager - Got allocation request: org.globus.cog.abstraction.coaster.service.job.manager.WorkerManager$AllocationRequest at 95cfbe [WorkerManager] INFO Worker Manager - Starting worker with id=1391818162 and maxwalltime=6060s Worker start provider: gt2 Worker start JM: pbs -- (Timestamps in the coaster-boot*.log files would be useful). It seems like about 5 minutes, as I get between 5-6 of these message on stdout/err on the client side: Progress: Executing:1 In those 5 minutes, it doesnt *seem* that the job to start a worker has been sent to the server, as seen by qstat. Thats what led me last night to think that the sever was hung here. One possibility is that that impression above is falsely created by buffering and other time lags. I am looking at the log via tail -f, so if the message "[WorkerManager] INFO Worker Manager - Starting worker" is buffered, that would give a misleading impression that there was a delay. This could be coupled by a lag in qstat reporting job existence, which Ive never seen on other PBS systems, but I have seen curious delays in abe qstat reporting job state changes. Another possibility is the /dev/random delay in generating an id due ot lack of server entropy. Now *that* would explain things, as its right where the delay is occurring: private void startWorker(int maxWallTime, Task prototype) throws InvalidServiceContactException { int id = sr.nextInt(); // <<<<<<<<<<<<<<<<<<<<<< if (logger.isInfoEnabled()) { logger.info("Starting worker with id=" + id + " and } which uses SecureRandom.getInstance("SHA1PRNG") This just occurred to me and is perhaps a more likely explanation. Is this the same scenario that was causing the Swift client to encounter long delays as it started trivial workflows? How was that eventually fixed? I can stub this out with a simple number generator and test. And/or time SecureRandom in a standalone program. - Mike On 7/29/08 12:06 AM, Michael Wilde wrote: > hmmm. my debug statement didnt print. but this time the job on abe ran ok. > > Tomorrow I'll run more tests and see how stable it is there, and why my > logging calls never showed up. > > - Mike > > > On 7/28/08 11:45 PM, Michael Wilde wrote: >> Ive moved on, and put a temp hack in to not use -l and instead run >> "~/.myetcprofile" if it exists and /etc/profile if it doesnt. >> >> .myetcprofile on abe is /etc/profile with the problematic code removed. >> >> Now abe gets past the problem and runs bootstrap.sh ok. >> >> The sequence runs OK up to the point where the service on abe's >> headnode receives a message to start a job. >> >> AT this point, the service on abe seems to hang. >> >> Comparing to the message sequence on mercury, which works, I see this: >> >> *** mercury: >> >> [RequestHandler] DEBUG Channel multiplexer - GSSC-null: HND< 2 >> SUBMITJOB(identity=1217268111318 >> executable=/bin/bash >> directory=/home/ncsa/wilde/swiftwork/ctest-20080728-1301-7c4ok42h >> arg=shared/wrapper.sh >> arg=echo-myx2e6xi >> arg=-jobdir >> arg=m >> arg=-e >> arg=/bin/echo >> arg=-out >> arg=echo_s000.txt >> arg=-err >> arg=stderr.txt >> arg=-i >> arg=-d >> ar) >> [ChannelManager] DEBUG Channel multiplexer - >> Looking up -134779b6:11b6ad597e2:-7fff:3598cb3d:11b6ad597b5:-7fffS >> [ChannelManager] DEBUG Channel multiplexer - Found >> -134779b6:11b6ad597e2:-7fff:3598cb3d:11b6ad597b5:-7fffS >> [RequestHandler] DEBUG Channel multiplexer - GSSC-null: HND> 2 >> SUBMITJOB(urn:1217268111318-1217268128309-1217268128310) >> [Replier] DEBUG Worker 1 - Replier(GSSC-null)REPL>: tag = 2, fin = >> true, datalen = 45, data = urn:1217268111318-1217268128309-1217268128310 >> [WorkerManager] INFO Coaster Queue Processor - No suitable worker >> found. Attempting to start a new one. >> [WorkerManager] INFO Worker Manager - Got allocation request: >> org.globus.cog.abstraction.coaster.service.job.manager.WorkerManager$AllocationRequest at 151ca803 >> >> [WorkerManager] INFO Worker Manager - Starting worker with >> id=-615912369 and maxwalltime=6060s >> Worker start provider: gt2 >> Worker start JM: pbs >> >> *** abe: >> >> [RequestHandler] DEBUG Channel multiplexer - GSSC-null: HND< 2 >> SUBMITJOB(identity=1217291444315 >> executable=/bin/bash >> directory=/u/ac/wilde/swiftwork/ctest-20080728-1930-m5a70lvc >> arg=shared/wrapper.sh >> arg=echo-zc5mt6xi >> arg=-jobdir >> arg=z >> arg=-e >> arg=/bin/echo >> arg=-out >> arg=echo_s000.txt >> arg=-err >> arg=stderr.txt >> arg=-i >> arg=-d >> arg= >> ar) >> [ChannelManager] DEBUG Channel multiplexer - >> Looking up 17badc64:11b6c39944a:-7fff:f7c31d:11b6c399416:-7fffS >> [ChannelManager] DEBUG Channel multiplexer - Found >> 17badc64:11b6c39944a:-7fff:f7c31d:11b6c399416:-7fffS >> [RequestHandler] DEBUG Channel multiplexer - GSSC-null: HND> 2 >> SUBMITJOB(urn:1217291444315-1217291458042-1217291458043) >> [Replier] DEBUG Worker 1 - Replier(GSSC-null)REPL>: tag = 2, fin = >> true, datalen = 45, data = urn:1217291444315-1217291458042-1217291458043 >> [WorkerManager] INFO Coaster Queue Processor - No suitable worker >> found. Attempting to start a new one. >> [WorkerManager] INFO Worker Manager - Got allocation request: >> org.globus.cog.abstraction.coaster.service.job.manager.WorkerManager$AllocationRequest at 95cfbe >> >> [AbstractKarajanChannel] DEBUG Channel multiplexer - GSSC-null REQ<: >> tag = 3, fin = true, err = false, datalen = 15, data = SHUTDOWNSERVICE >> >> *** >> >> I *think* the SHUTDOWNSERVICE message on abe is coming much later, >> after abe's service hangs, but Im not sure. >> >> What it looks like to me is that what should should happen on abe is >> this: >> >> [WorkerManager] INFO Worker Manager - Got allocation request: >> org.globus.cog.abstraction.coaster.service.job.manager.WorkerManager$AllocationRequest at 151ca803 >> >> [WorkerManager] INFO Worker Manager - Starting worker with >> id=-615912369 and maxwalltime=6060s >> >> but on abe the "Worker Manager - Starting worker" is never seen. >> >> Looking at WorkerManager.run() its hard to see how the "Starting >> worker" message could *not* show up right after "Got allocation >> request", but there must be some sequence of events that causes this. >> >> Abe is an 8-core system. Is there perhaps more opportunity for a >> multi-thread race or deadlock that could cause this? >> >> I will insert some more debug logging and try a few more times to see >> if thing shang in this manner every time or not. >> >> - Mike >> >> ps client Logs with abe server side boot logs are on CI net in >> ~wilde/coast/run11 >> >> >> >> On 7/28/08 10:50 PM, Mihael Hategan wrote: >>> On Mon, 2008-07-28 at 19:32 +0000, Ben Clifford wrote: >>>> On Mon, 28 Jul 2008, Michael Wilde wrote: >>>> >>>>> So it looks like something in the job specs that is launching >>>>> coaster for >>>>> gt2:pbs is not being accepted by abe. >>>> ok. TeraGrid's unified account system is insufficiently unified for >>>> me to be able to access abe, but they are aware of that; if and when >>>> I am reunified, I'll try this out myself. >>> >>> Not to be cynical or anything, but that unified thing: never worked. >>> >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From wilde at mcs.anl.gov Tue Jul 29 13:48:41 2008 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 29 Jul 2008 13:48:41 -0500 Subject: [Swift-devel] Problems running coaster In-Reply-To: <488F6005.3060400@mcs.anl.gov> References: <488CC81C.7030205@mcs.anl.gov> <488D4217.2010006@mcs.anl.gov> <488DF957.9000803@mcs.anl.gov> <488E0C64.1020106@mcs.anl.gov> <1217303412.4347.0.camel@localhost> <488EA079.4000404@mcs.anl.gov> <488EA562.9020704@mcs.anl.gov> <488F6005.3060400@mcs.anl.gov> Message-ID: <488F6609.5020005@mcs.anl.gov> Ive confirmed the "long pause" problem with the class IDGenerator running on the abe login host. I can gen a few thousand random #s instantly, but once it runs out of entropy, the same trivial program just hangs: //Added to a test copy of IDGenerator.java: public static void main(String args[]) { System.out.println("Hello World!"); IDGenerator gen = new IDGenerator(); for (int i = 0; i<100; i++ ) { System.out.println("id: " + gen.nextInt()); } } at first: [wilde at honest3 ~/rantest]$ java IDGenerator Hello World! id: -1080189798 id: -263746139 id: 947709574 ...etc Then I start another one, and it "just hangs" for several minutes. [wilde at honest3 ~/rantest]$ time java IDGenerator | wc -l ^^^ this has been hanging for a few minutes. I also see that genId() can generate negative numbers. This gives some log files with -- in their names where others have just -, which is confusing and can mess up automation scripts. Should I bugzilla either or both of these? - Mike On 7/29/08 1:23 PM, Michael Wilde wrote: > Some more details on this issue, which is different than I previously > thought. > > Short summary: do we have a /dev/random entropy problem here? > > details: > > After running many more 1-job tests, I see that they are all working on > Abe. What caused the behavior that I reported below, where the coaster > service *seems* to hang, is in fact a long time delay. It seems like > about 5 minutes between the message sequence below: > -- > [WorkerManager] INFO Coaster Queue Processor - No suitable worker > found. Attempting to start a new one. > [WorkerManager] INFO Worker Manager - Got allocation request: > org.globus.cog.abstraction.coaster.service.job.manager.WorkerManager$AllocationRequest at 95cfbe > > > > > [WorkerManager] INFO Worker Manager - Starting worker with > id=1391818162 and maxwalltime=6060s > Worker start provider: gt2 > Worker start JM: pbs > -- > (Timestamps in the coaster-boot*.log files would be useful). > > It seems like about 5 minutes, as I get between 5-6 of these message on > stdout/err on the client side: > Progress: Executing:1 > > In those 5 minutes, it doesnt *seem* that the job to start a worker has > been sent to the server, as seen by qstat. > > Thats what led me last night to think that the sever was hung here. > > One possibility is that that impression above is falsely created by > buffering and other time lags. I am looking at the log via tail -f, so > if the message "[WorkerManager] INFO Worker Manager - Starting worker" > is buffered, that would give a misleading impression that there was a > delay. This could be coupled by a lag in qstat reporting job existence, > which Ive never seen on other PBS systems, but I have seen curious > delays in abe qstat reporting job state changes. > > Another possibility is the /dev/random delay in generating an id due ot > lack of server entropy. Now *that* would explain things, as its right > where the delay is occurring: > > private void startWorker(int maxWallTime, Task prototype) > throws InvalidServiceContactException { > int id = sr.nextInt(); // <<<<<<<<<<<<<<<<<<<<<< > if (logger.isInfoEnabled()) { > logger.info("Starting worker with id=" + id + " and > } > which uses SecureRandom.getInstance("SHA1PRNG") > > This just occurred to me and is perhaps a more likely explanation. Is > this the same scenario that was causing the Swift client to encounter > long delays as it started trivial workflows? How was that eventually > fixed? > > I can stub this out with a simple number generator and test. And/or time > SecureRandom in a standalone program. > > - Mike > > > > > > On 7/29/08 12:06 AM, Michael Wilde wrote: >> hmmm. my debug statement didnt print. but this time the job on abe ran >> ok. >> >> Tomorrow I'll run more tests and see how stable it is there, and why >> my logging calls never showed up. >> >> - Mike >> >> >> On 7/28/08 11:45 PM, Michael Wilde wrote: >>> Ive moved on, and put a temp hack in to not use -l and instead run >>> "~/.myetcprofile" if it exists and /etc/profile if it doesnt. >>> >>> .myetcprofile on abe is /etc/profile with the problematic code removed. >>> >>> Now abe gets past the problem and runs bootstrap.sh ok. >>> >>> The sequence runs OK up to the point where the service on abe's >>> headnode receives a message to start a job. >>> >>> AT this point, the service on abe seems to hang. >>> >>> Comparing to the message sequence on mercury, which works, I see this: >>> >>> *** mercury: >>> >>> [RequestHandler] DEBUG Channel multiplexer - GSSC-null: HND< 2 >>> SUBMITJOB(identity=1217268111318 >>> executable=/bin/bash >>> directory=/home/ncsa/wilde/swiftwork/ctest-20080728-1301-7c4ok42h >>> arg=shared/wrapper.sh >>> arg=echo-myx2e6xi >>> arg=-jobdir >>> arg=m >>> arg=-e >>> arg=/bin/echo >>> arg=-out >>> arg=echo_s000.txt >>> arg=-err >>> arg=stderr.txt >>> arg=-i >>> arg=-d >>> ar) >>> [ChannelManager] DEBUG Channel multiplexer - >>> Looking up -134779b6:11b6ad597e2:-7fff:3598cb3d:11b6ad597b5:-7fffS >>> [ChannelManager] DEBUG Channel multiplexer - Found >>> -134779b6:11b6ad597e2:-7fff:3598cb3d:11b6ad597b5:-7fffS >>> [RequestHandler] DEBUG Channel multiplexer - GSSC-null: HND> 2 >>> SUBMITJOB(urn:1217268111318-1217268128309-1217268128310) >>> [Replier] DEBUG Worker 1 - Replier(GSSC-null)REPL>: tag = 2, fin = >>> true, datalen = 45, data = urn:1217268111318-1217268128309-1217268128310 >>> [WorkerManager] INFO Coaster Queue Processor - No suitable worker >>> found. Attempting to start a new one. >>> [WorkerManager] INFO Worker Manager - Got allocation request: >>> org.globus.cog.abstraction.coaster.service.job.manager.WorkerManager$AllocationRequest at 151ca803 >>> >>> [WorkerManager] INFO Worker Manager - Starting worker with >>> id=-615912369 and maxwalltime=6060s >>> Worker start provider: gt2 >>> Worker start JM: pbs >>> >>> *** abe: >>> >>> [RequestHandler] DEBUG Channel multiplexer - GSSC-null: HND< 2 >>> SUBMITJOB(identity=1217291444315 >>> executable=/bin/bash >>> directory=/u/ac/wilde/swiftwork/ctest-20080728-1930-m5a70lvc >>> arg=shared/wrapper.sh >>> arg=echo-zc5mt6xi >>> arg=-jobdir >>> arg=z >>> arg=-e >>> arg=/bin/echo >>> arg=-out >>> arg=echo_s000.txt >>> arg=-err >>> arg=stderr.txt >>> arg=-i >>> arg=-d >>> arg= >>> ar) >>> [ChannelManager] DEBUG Channel multiplexer - >>> Looking up 17badc64:11b6c39944a:-7fff:f7c31d:11b6c399416:-7fffS >>> [ChannelManager] DEBUG Channel multiplexer - Found >>> 17badc64:11b6c39944a:-7fff:f7c31d:11b6c399416:-7fffS >>> [RequestHandler] DEBUG Channel multiplexer - GSSC-null: HND> 2 >>> SUBMITJOB(urn:1217291444315-1217291458042-1217291458043) >>> [Replier] DEBUG Worker 1 - Replier(GSSC-null)REPL>: tag = 2, fin = >>> true, datalen = 45, data = urn:1217291444315-1217291458042-1217291458043 >>> [WorkerManager] INFO Coaster Queue Processor - No suitable worker >>> found. Attempting to start a new one. >>> [WorkerManager] INFO Worker Manager - Got allocation request: >>> org.globus.cog.abstraction.coaster.service.job.manager.WorkerManager$AllocationRequest at 95cfbe >>> >>> [AbstractKarajanChannel] DEBUG Channel multiplexer - GSSC-null REQ<: >>> tag = 3, fin = true, err = false, datalen = 15, data = SHUTDOWNSERVICE >>> >>> *** >>> >>> I *think* the SHUTDOWNSERVICE message on abe is coming much later, >>> after abe's service hangs, but Im not sure. >>> >>> What it looks like to me is that what should should happen on abe is >>> this: >>> >>> [WorkerManager] INFO Worker Manager - Got allocation request: >>> org.globus.cog.abstraction.coaster.service.job.manager.WorkerManager$AllocationRequest at 151ca803 >>> >>> [WorkerManager] INFO Worker Manager - Starting worker with >>> id=-615912369 and maxwalltime=6060s >>> >>> but on abe the "Worker Manager - Starting worker" is never seen. >>> >>> Looking at WorkerManager.run() its hard to see how the "Starting >>> worker" message could *not* show up right after "Got allocation >>> request", but there must be some sequence of events that causes this. >>> >>> Abe is an 8-core system. Is there perhaps more opportunity for a >>> multi-thread race or deadlock that could cause this? >>> >>> I will insert some more debug logging and try a few more times to see >>> if thing shang in this manner every time or not. >>> >>> - Mike >>> >>> ps client Logs with abe server side boot logs are on CI net in >>> ~wilde/coast/run11 >>> >>> >>> >>> On 7/28/08 10:50 PM, Mihael Hategan wrote: >>>> On Mon, 2008-07-28 at 19:32 +0000, Ben Clifford wrote: >>>>> On Mon, 28 Jul 2008, Michael Wilde wrote: >>>>> >>>>>> So it looks like something in the job specs that is launching >>>>>> coaster for >>>>>> gt2:pbs is not being accepted by abe. >>>>> ok. TeraGrid's unified account system is insufficiently unified for >>>>> me to be able to access abe, but they are aware of that; if and >>>>> when I am reunified, I'll try this out myself. >>>> >>>> Not to be cynical or anything, but that unified thing: never worked. >>>> >>> _______________________________________________ >>> Swift-devel mailing list >>> Swift-devel at ci.uchicago.edu >>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > From hategan at mcs.anl.gov Tue Jul 29 13:57:12 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 29 Jul 2008 13:57:12 -0500 Subject: [Swift-devel] Problems running coaster In-Reply-To: <488F6005.3060400@mcs.anl.gov> References: <488CC81C.7030205@mcs.anl.gov> <488D4217.2010006@mcs.anl.gov> <488DF957.9000803@mcs.anl.gov> <488E0C64.1020106@mcs.anl.gov> <1217303412.4347.0.camel@localhost> <488EA079.4000404@mcs.anl.gov> <488EA562.9020704@mcs.anl.gov> <488F6005.3060400@mcs.anl.gov> Message-ID: <1217357832.10507.1.camel@localhost> On Tue, 2008-07-29 at 13:23 -0500, Michael Wilde wrote: > Another possibility is the /dev/random delay in generating an id due ot > lack of server entropy. Now *that* would explain things, as its right > where the delay is occurring: > > private void startWorker(int maxWallTime, Task prototype) > throws InvalidServiceContactException { > int id = sr.nextInt(); // <<<<<<<<<<<<<<<<<<<<<< > if (logger.isInfoEnabled()) { > logger.info("Starting worker with id=" + id + " and > } > which uses SecureRandom.getInstance("SHA1PRNG") > > This just occurred to me and is perhaps a more likely explanation. Is > this the same scenario that was causing the Swift client to encounter > long delays as it started trivial workflows? How was that eventually fixed? Hmm. Yes. I'll change the bootstrap class to start the service with /dev/urandom instead (if available). > > I can stub this out with a simple number generator and test. And/or time > SecureRandom in a standalone program. > > - Mike > > > > > > On 7/29/08 12:06 AM, Michael Wilde wrote: > > hmmm. my debug statement didnt print. but this time the job on abe ran ok. > > > > Tomorrow I'll run more tests and see how stable it is there, and why my > > logging calls never showed up. > > > > - Mike > > > > > > On 7/28/08 11:45 PM, Michael Wilde wrote: > >> Ive moved on, and put a temp hack in to not use -l and instead run > >> "~/.myetcprofile" if it exists and /etc/profile if it doesnt. > >> > >> .myetcprofile on abe is /etc/profile with the problematic code removed. > >> > >> Now abe gets past the problem and runs bootstrap.sh ok. > >> > >> The sequence runs OK up to the point where the service on abe's > >> headnode receives a message to start a job. > >> > >> AT this point, the service on abe seems to hang. > >> > >> Comparing to the message sequence on mercury, which works, I see this: > >> > >> *** mercury: > >> > >> [RequestHandler] DEBUG Channel multiplexer - GSSC-null: HND< 2 > >> SUBMITJOB(identity=1217268111318 > >> executable=/bin/bash > >> directory=/home/ncsa/wilde/swiftwork/ctest-20080728-1301-7c4ok42h > >> arg=shared/wrapper.sh > >> arg=echo-myx2e6xi > >> arg=-jobdir > >> arg=m > >> arg=-e > >> arg=/bin/echo > >> arg=-out > >> arg=echo_s000.txt > >> arg=-err > >> arg=stderr.txt > >> arg=-i > >> arg=-d > >> ar) > >> [ChannelManager] DEBUG Channel multiplexer - > >> Looking up -134779b6:11b6ad597e2:-7fff:3598cb3d:11b6ad597b5:-7fffS > >> [ChannelManager] DEBUG Channel multiplexer - Found > >> -134779b6:11b6ad597e2:-7fff:3598cb3d:11b6ad597b5:-7fffS > >> [RequestHandler] DEBUG Channel multiplexer - GSSC-null: HND> 2 > >> SUBMITJOB(urn:1217268111318-1217268128309-1217268128310) > >> [Replier] DEBUG Worker 1 - Replier(GSSC-null)REPL>: tag = 2, fin = > >> true, datalen = 45, data = urn:1217268111318-1217268128309-1217268128310 > >> [WorkerManager] INFO Coaster Queue Processor - No suitable worker > >> found. Attempting to start a new one. > >> [WorkerManager] INFO Worker Manager - Got allocation request: > >> org.globus.cog.abstraction.coaster.service.job.manager.WorkerManager$AllocationRequest at 151ca803 > >> > >> [WorkerManager] INFO Worker Manager - Starting worker with > >> id=-615912369 and maxwalltime=6060s > >> Worker start provider: gt2 > >> Worker start JM: pbs > >> > >> *** abe: > >> > >> [RequestHandler] DEBUG Channel multiplexer - GSSC-null: HND< 2 > >> SUBMITJOB(identity=1217291444315 > >> executable=/bin/bash > >> directory=/u/ac/wilde/swiftwork/ctest-20080728-1930-m5a70lvc > >> arg=shared/wrapper.sh > >> arg=echo-zc5mt6xi > >> arg=-jobdir > >> arg=z > >> arg=-e > >> arg=/bin/echo > >> arg=-out > >> arg=echo_s000.txt > >> arg=-err > >> arg=stderr.txt > >> arg=-i > >> arg=-d > >> arg= > >> ar) > >> [ChannelManager] DEBUG Channel multiplexer - > >> Looking up 17badc64:11b6c39944a:-7fff:f7c31d:11b6c399416:-7fffS > >> [ChannelManager] DEBUG Channel multiplexer - Found > >> 17badc64:11b6c39944a:-7fff:f7c31d:11b6c399416:-7fffS > >> [RequestHandler] DEBUG Channel multiplexer - GSSC-null: HND> 2 > >> SUBMITJOB(urn:1217291444315-1217291458042-1217291458043) > >> [Replier] DEBUG Worker 1 - Replier(GSSC-null)REPL>: tag = 2, fin = > >> true, datalen = 45, data = urn:1217291444315-1217291458042-1217291458043 > >> [WorkerManager] INFO Coaster Queue Processor - No suitable worker > >> found. Attempting to start a new one. > >> [WorkerManager] INFO Worker Manager - Got allocation request: > >> org.globus.cog.abstraction.coaster.service.job.manager.WorkerManager$AllocationRequest at 95cfbe > >> > >> [AbstractKarajanChannel] DEBUG Channel multiplexer - GSSC-null REQ<: > >> tag = 3, fin = true, err = false, datalen = 15, data = SHUTDOWNSERVICE > >> > >> *** > >> > >> I *think* the SHUTDOWNSERVICE message on abe is coming much later, > >> after abe's service hangs, but Im not sure. > >> > >> What it looks like to me is that what should should happen on abe is > >> this: > >> > >> [WorkerManager] INFO Worker Manager - Got allocation request: > >> org.globus.cog.abstraction.coaster.service.job.manager.WorkerManager$AllocationRequest at 151ca803 > >> > >> [WorkerManager] INFO Worker Manager - Starting worker with > >> id=-615912369 and maxwalltime=6060s > >> > >> but on abe the "Worker Manager - Starting worker" is never seen. > >> > >> Looking at WorkerManager.run() its hard to see how the "Starting > >> worker" message could *not* show up right after "Got allocation > >> request", but there must be some sequence of events that causes this. > >> > >> Abe is an 8-core system. Is there perhaps more opportunity for a > >> multi-thread race or deadlock that could cause this? > >> > >> I will insert some more debug logging and try a few more times to see > >> if thing shang in this manner every time or not. > >> > >> - Mike > >> > >> ps client Logs with abe server side boot logs are on CI net in > >> ~wilde/coast/run11 > >> > >> > >> > >> On 7/28/08 10:50 PM, Mihael Hategan wrote: > >>> On Mon, 2008-07-28 at 19:32 +0000, Ben Clifford wrote: > >>>> On Mon, 28 Jul 2008, Michael Wilde wrote: > >>>> > >>>>> So it looks like something in the job specs that is launching > >>>>> coaster for > >>>>> gt2:pbs is not being accepted by abe. > >>>> ok. TeraGrid's unified account system is insufficiently unified for > >>>> me to be able to access abe, but they are aware of that; if and when > >>>> I am reunified, I'll try this out myself. > >>> > >>> Not to be cynical or anything, but that unified thing: never worked. > >>> > >> _______________________________________________ > >> Swift-devel mailing list > >> Swift-devel at ci.uchicago.edu > >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From hategan at mcs.anl.gov Tue Jul 29 14:00:17 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 29 Jul 2008 14:00:17 -0500 Subject: [Swift-devel] Re: scheduler foo In-Reply-To: References: <1216650659.4064.1.camel@localhost> <1216659517.6481.1.camel@localhost> <1216679706.18694.6.camel@localhost> <1216689540.23025.0.camel@localhost> <1216742264.28239.0.camel@localhost> <1216757938.18169.6.camel@localhost> Message-ID: <1217358017.12370.0.camel@localhost> On Tue, 2008-07-29 at 10:54 +0000, Ben Clifford wrote: > > I put two tests in misc/ that you need to have compiled > -Dwith-provider-wonky to run, called wonky.sh and wonky80.sh > > These run single-site local tests with 90% and 80% success rate of jobs. > The 90% test usually finishes fine, which is good. > > The 80% one gets long delays; sometimes I think because of legitimate long > delays caused by the scheduler slowing down, but I think sometimes caused > by some race condition making the single site be ignored forever. I can see the hanging. I'll poke at it. > > > From skenny at uchicago.edu Tue Jul 29 14:34:28 2008 From: skenny at uchicago.edu (skenny at uchicago.edu) Date: Tue, 29 Jul 2008 14:34:28 -0500 (CDT) Subject: [Swift-devel] mystery runs on ucanl Message-ID: <20080729143428.BII73728@m4500-02.uchicago.edu> >> >> yes (see below) and SOME of the jobs in the workflow do >> >> complete when we submit the whole workflow to ucanl. >> > >> >Indeed. It seems like roughly half of them work and the other >> half >> >break. Could this be an ia32/ia64 issue? Like python being >> compiled for >> >the wrong platform or something? well, i thought that sounded pretty likely (apparently some jobs were going to 32-bit machines even though 64 was specified in the sites file). however, i've just sent a batch to the site and am getting failures on 64-bit nodes as well (and on varying nodes, so not just 1 or 2 bum nodes)...because there is still this odd behavior of jobs remaining in the queue even after they've been killed, i'm tempted to blame pbs (gotta blame someone ;) also, i'm getting emails from pbs like this: PBS Job Id: 1759910.tg-master.uc.teragrid.org Job Name: STDIN Exec host: tg-c054/0 Aborted by PBS Server Job cannot be executed See Administrator for help and the swift log simply gives "Failed Error code: 271, ProcessDied" hence, i'm copying help at teragrid on this...if there are any other tests i can run to try and narrow down the bug let me know. i've tried submitting several globusrun-ws jobs but haven't gotten an error that way as of yet. From wilde at mcs.anl.gov Tue Jul 29 14:43:09 2008 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 29 Jul 2008 14:43:09 -0500 Subject: [Swift-devel] mystery runs on ucanl In-Reply-To: <20080729143428.BII73728@m4500-02.uchicago.edu> References: <20080729143428.BII73728@m4500-02.uchicago.edu> Message-ID: <488F72CD.4080305@mcs.anl.gov> On 7/29/08 2:34 PM, skenny at uchicago.edu wrote: >>>>> yes (see below) and SOME of the jobs in the workflow do >>>>> complete when we submit the whole workflow to ucanl. >>>> Indeed. It seems like roughly half of them work and the other >>> half >>>> break. Could this be an ia32/ia64 issue? Like python being >>> compiled for >>>> the wrong platform or something? > > well, i thought that sounded pretty likely (apparently some > jobs were going to 32-bit machines even though 64 was > specified in the sites file). Is it possible that the property was mis-spelled? I recall some issues with this profile attribute in the past, when you first started running Swift last Oct-Nov. > however, i've just sent a batch > to the site and am getting failures on 64-bit nodes as > well (and on varying nodes, so not just 1 or 2 bum > nodes)...because there is still this odd behavior of jobs > remaining in the queue even after they've been killed, i'm > tempted to blame pbs (gotta blame someone ;) also, i'm getting > emails from pbs like this: > > PBS Job Id: 1759910.tg-master.uc.teragrid.org > Job Name: STDIN > Exec host: tg-c054/0 > Aborted by PBS Server > Job cannot be executed > See Administrator for help > > and the swift log simply gives "Failed Error code: 271, > ProcessDied" I also recall some similar issues on UC Teragrid last Nov (2007) as we were preparing Angle runs for SC07. Ti was involved in that debugging and had given us PBS diagnostic commands to capture log data on the problem at the time. Ti, can you recall the details? - Mike > > hence, i'm copying help at teragrid on this...if there are any > other tests i can run to try and narrow down the bug let me > know. i've tried submitting several globusrun-ws jobs but > haven't gotten an error that way as of yet. > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From hategan at mcs.anl.gov Tue Jul 29 14:55:27 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 29 Jul 2008 14:55:27 -0500 Subject: [Swift-devel] mystery runs on ucanl In-Reply-To: <20080729143428.BII73728@m4500-02.uchicago.edu> References: <20080729143428.BII73728@m4500-02.uchicago.edu> Message-ID: <1217361327.13407.1.camel@localhost> On Tue, 2008-07-29 at 14:34 -0500, skenny at uchicago.edu wrote: > >> >> yes (see below) and SOME of the jobs in the workflow do > >> >> complete when we submit the whole workflow to ucanl. > >> > > >> >Indeed. It seems like roughly half of them work and the other > >> half > >> >break. Could this be an ia32/ia64 issue? Like python being > >> compiled for > >> >the wrong platform or something? > > well, i thought that sounded pretty likely (apparently some > jobs were going to 32-bit machines even though 64 was > specified in the sites file). however, i've just sent a batch > to the site and am getting failures on 64-bit nodes as > well (and on varying nodes, so not just 1 or 2 bum > nodes)... The same kinds of failures? > because there is still this odd behavior of jobs > remaining in the queue even after they've been killed, i'm > tempted to blame pbs (gotta blame someone ;) also, i'm getting > emails from pbs like this: > > PBS Job Id: 1759910.tg-master.uc.teragrid.org > Job Name: STDIN > Exec host: tg-c054/0 > Aborted by PBS Server > Job cannot be executed > See Administrator for help > > and the swift log simply gives "Failed Error code: 271, > ProcessDied" Not the same kind of failures. So we may be dealing with multiple issues here. > > hence, i'm copying help at teragrid on this...if there are any > other tests i can run to try and narrow down the bug let me > know. i've tried submitting several globusrun-ws jobs but > haven't gotten an error that way as of yet. From benc at hawaga.org.uk Tue Jul 29 15:14:45 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 29 Jul 2008 20:14:45 +0000 (GMT) Subject: [Swift-devel] Re: scheduler foo In-Reply-To: <1217358017.12370.0.camel@localhost> References: <1216650659.4064.1.camel@localhost> <1216659517.6481.1.camel@localhost> <1216679706.18694.6.camel@localhost> <1216689540.23025.0.camel@localhost> <1216742264.28239.0.camel@localhost> <1216757938.18169.6.camel@localhost> <1217358017.12370.0.camel@localhost> Message-ID: I see this: 2008-07-29 18:50:51,941+0200 DEBUG OverloadedHostMonitor Polling 1 hosts 2008-07-29 18:50:51,941+0200 DEBUG WeightedHost In delay mode. score = -1.784646153846153 tscore = 0.3660071750300 991, maxload=1.0 delay since last used=1074ms permitted delay=595ms overloaded=false delay-permitted delay=479 2008-07-29 18:50:51,941+0200 DEBUG OverloadedHostMonitor Adjusting overloaded by 1 2008-07-29 18:50:51,941+0200 DEBUG WeightedHostSet Adjusted overload count for 8076068: 1 + 1 = 2 I think that an adustment done by overloadedhostmonitor should only ever be reducing the load because the only time dependent load adjustment is wen a host goes from overloaded to not-overloaded after its permitted delay time. -- From hategan at mcs.anl.gov Tue Jul 29 15:18:37 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 29 Jul 2008 15:18:37 -0500 Subject: [Swift-devel] Re: scheduler foo In-Reply-To: References: <1216650659.4064.1.camel@localhost> <1216659517.6481.1.camel@localhost> <1216679706.18694.6.camel@localhost> <1216689540.23025.0.camel@localhost> <1216742264.28239.0.camel@localhost> <1216757938.18169.6.camel@localhost> <1217358017.12370.0.camel@localhost> Message-ID: <1217362717.16732.0.camel@localhost> On Tue, 2008-07-29 at 20:14 +0000, Ben Clifford wrote: > I see this: > > 2008-07-29 18:50:51,941+0200 DEBUG OverloadedHostMonitor Polling 1 hosts > 2008-07-29 18:50:51,941+0200 DEBUG WeightedHost In delay mode. score = > -1.784646153846153 tscore = 0.3660071750300 > 991, maxload=1.0 delay since last used=1074ms permitted delay=595ms > overloaded=false delay-permitted delay=479 > 2008-07-29 18:50:51,941+0200 DEBUG OverloadedHostMonitor Adjusting > overloaded by 1 > 2008-07-29 18:50:51,941+0200 DEBUG WeightedHostSet Adjusted overload count > for 8076068: 1 + 1 = 2 > > I think that an adustment done by overloadedhostmonitor should only ever > be reducing the load because the only time dependent load adjustment is > wen a host goes from overloaded to not-overloaded after its permitted > delay time. Right. From skenny at uchicago.edu Tue Jul 29 15:19:50 2008 From: skenny at uchicago.edu (skenny at uchicago.edu) Date: Tue, 29 Jul 2008 15:19:50 -0500 (CDT) Subject: [Swift-devel] mystery runs on ucanl Message-ID: <20080729151950.BII79030@m4500-02.uchicago.edu> >> because there is still this odd behavior of jobs >> remaining in the queue even after they've been killed, i'm >> tempted to blame pbs (gotta blame someone ;) also, i'm getting >> emails from pbs like this: >> >> PBS Job Id: 1759910.tg-master.uc.teragrid.org >> Job Name: STDIN >> Exec host: tg-c054/0 >> Aborted by PBS Server >> Job cannot be executed >> See Administrator for help >> >> and the swift log simply gives "Failed Error code: 271, >> ProcessDied" > >Not the same kind of failures. So we may be dealing with multiple issues >here. so, in looking back at the pbs notices from a run on 7/23-24 i actually see about 25 failures indicating tg-c054 as the node, so i may have jumped the gun on there not being a bum node involved...i'm also seeing that the batch i submitted today the failures were either going to 32-bit nodes (which i expect to fail) or to tg-c054...sooo, that 054 is looking like a culprit for at least some of what we're seeing. From benc at hawaga.org.uk Tue Jul 29 15:35:05 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 29 Jul 2008 20:35:05 +0000 (GMT) Subject: [Swift-devel] Re: scheduler foo In-Reply-To: <1217362717.16732.0.camel@localhost> References: <1216650659.4064.1.camel@localhost> <1216659517.6481.1.camel@localhost> <1216679706.18694.6.camel@localhost> <1216689540.23025.0.camel@localhost> <1216742264.28239.0.camel@localhost> <1216757938.18169.6.camel@localhost> <1217358017.12370.0.camel@localhost> <1217362717.16732.0.camel@localhost> Message-ID: which i suppose happens sometimes when its time to adjust the load or score of a site that is already in delay mode... From hategan at mcs.anl.gov Tue Jul 29 15:38:36 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 29 Jul 2008 15:38:36 -0500 Subject: [Swift-devel] Re: scheduler foo In-Reply-To: References: <1216650659.4064.1.camel@localhost> <1216659517.6481.1.camel@localhost> <1216679706.18694.6.camel@localhost> <1216689540.23025.0.camel@localhost> <1216742264.28239.0.camel@localhost> <1216757938.18169.6.camel@localhost> <1217358017.12370.0.camel@localhost> <1217362717.16732.0.camel@localhost> Message-ID: <1217363916.25884.0.camel@localhost> On Tue, 2008-07-29 at 20:35 +0000, Ben Clifford wrote: > which i suppose happens sometimes when its time to adjust the load or > score of a site that is already in delay mode... That's why I kept it there. From leggett at ci.uchicago.edu Tue Jul 29 15:50:38 2008 From: leggett at ci.uchicago.edu (Ti Leggett) Date: Tue, 29 Jul 2008 15:50:38 -0500 Subject: [Swift-devel] mystery runs on ucanl In-Reply-To: <488F72CD.4080305@mcs.anl.gov> References: <20080729143428.BII73728@m4500-02.uchicago.edu> <488F72CD.4080305@mcs.anl.gov> Message-ID: <6E499FD2-0303-48F3-B5D2-043E22A72B18@ci.uchicago.edu> This looks like you're trying to run ia64 code on an ia32 machine. Double verify that you are in fact requesting the right type of node (ia64-compute for ia64 and ia32-compute for ia32). If you don't, you will arbitrarily be placed on an available node, which could be either architecture. On Jul 29, 2008, at 2:43 PM, Michael Wilde wrote: > > On 7/29/08 2:34 PM, skenny at uchicago.edu wrote: >>>>>> yes (see below) and SOME of the jobs in the workflow do >>>>>> complete when we submit the whole workflow to ucanl. >>>>> Indeed. It seems like roughly half of them work and the other >>>> half >>>>> break. Could this be an ia32/ia64 issue? Like python being >>>> compiled for >>>>> the wrong platform or something? >> well, i thought that sounded pretty likely (apparently some >> jobs were going to 32-bit machines even though 64 was >> specified in the sites file). > > Is it possible that the property was mis-spelled? I recall some > issues with this profile attribute in the past, when you first > started running Swift last Oct-Nov. > >> however, i've just sent a batch >> to the site and am getting failures on 64-bit nodes as >> well (and on varying nodes, so not just 1 or 2 bum >> nodes)...because there is still this odd behavior of jobs >> remaining in the queue even after they've been killed, i'm >> tempted to blame pbs (gotta blame someone ;) also, i'm getting >> emails from pbs like this: >> PBS Job Id: 1759910.tg-master.uc.teragrid.org >> Job Name: STDIN >> Exec host: tg-c054/0 >> Aborted by PBS Server Job cannot be executed >> See Administrator for help >> and the swift log simply gives "Failed Error code: 271, >> ProcessDied" > > I also recall some similar issues on UC Teragrid last Nov (2007) as > we were preparing Angle runs for SC07. Ti was involved in that > debugging and had given us PBS diagnostic commands to capture log > data on the problem at the time. Ti, can you recall the details? > > - Mike > >> hence, i'm copying help at teragrid on this...if there are any >> other tests i can run to try and narrow down the bug let me >> know. i've tried submitting several globusrun-ws jobs but >> haven't gotten an error that way as of yet. >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From benc at hawaga.org.uk Tue Jul 29 16:01:31 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 29 Jul 2008 21:01:31 +0000 (GMT) Subject: [Swift-devel] Re: scheduler foo In-Reply-To: <1217363916.25884.0.camel@localhost> References: <1216650659.4064.1.camel@localhost> <1216659517.6481.1.camel@localhost> <1216679706.18694.6.camel@localhost> <1216689540.23025.0.camel@localhost> <1216742264.28239.0.camel@localhost> <1216757938.18169.6.camel@localhost> <1217358017.12370.0.camel@localhost> <1217362717.16732.0.camel@localhost> <1217363916.25884.0.camel@localhost> Message-ID: I just started building a test for multiple wonly sites, one good and one bad; With the scores like this: 2008-07-29 22:55:18,172+0200 INFO WeightedHostScoreScheduler Sorted: [wonkyB:-7.292(0.058):0/1 overload: 0, wonky A:183.795(92.336):1/1 overload: 0] I see a similar looking problem eventually - even though wonkyA is very well scored. Both of these sites have jobThrottle set to 0 so they'll never go more than a load of 1 before being overloaded. -- From skenny at uchicago.edu Tue Jul 29 16:45:59 2008 From: skenny at uchicago.edu (skenny at uchicago.edu) Date: Tue, 29 Jul 2008 16:45:59 -0500 (CDT) Subject: [Swift-devel] mystery runs on ucanl Message-ID: <20080729164559.BII90730@m4500-02.uchicago.edu> >This looks like you're trying to run ia64 code on an ia32 machine. >Double verify that you are in fact requesting the right type of node >(ia64-compute for ia64 and ia32-compute for ia32). If you don't, you >will arbitrarily be placed on an available node, which could be either >architecture. hi ti, i understand that; however the issue seems to be that even when the job makes it to an ia64 node it dies...in particular it seems tg-c054 might be problematic. is there a way i can submit directly to this node to test it? From hategan at mcs.anl.gov Tue Jul 29 18:15:52 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 29 Jul 2008 18:15:52 -0500 Subject: [Swift-devel] Re: scheduler foo In-Reply-To: <1217363916.25884.0.camel@localhost> References: <1216650659.4064.1.camel@localhost> <1216659517.6481.1.camel@localhost> <1216679706.18694.6.camel@localhost> <1216689540.23025.0.camel@localhost> <1216742264.28239.0.camel@localhost> <1216757938.18169.6.camel@localhost> <1217358017.12370.0.camel@localhost> <1217362717.16732.0.camel@localhost> <1217363916.25884.0.camel@localhost> Message-ID: <1217373352.28446.0.camel@localhost> We're not thinking this clearly. There are two possibilities: 1. site was overloaded and it's not any more -> overloadedCount-- 2. site wasn't overloaded and it is now -> overloadedCount++ I'll code this. On Tue, 2008-07-29 at 15:38 -0500, Mihael Hategan wrote: > On Tue, 2008-07-29 at 20:35 +0000, Ben Clifford wrote: > > which i suppose happens sometimes when its time to adjust the load or > > score of a site that is already in delay mode... > > That's why I kept it there. > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From hategan at mcs.anl.gov Tue Jul 29 19:35:10 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 29 Jul 2008 19:35:10 -0500 Subject: [Swift-devel] Re: scheduler foo In-Reply-To: <1217373352.28446.0.camel@localhost> References: <1216650659.4064.1.camel@localhost> <1216659517.6481.1.camel@localhost> <1216679706.18694.6.camel@localhost> <1216689540.23025.0.camel@localhost> <1216742264.28239.0.camel@localhost> <1216757938.18169.6.camel@localhost> <1217358017.12370.0.camel@localhost> <1217362717.16732.0.camel@localhost> <1217363916.25884.0.camel@localhost> <1217373352.28446.0.camel@localhost> Message-ID: <1217378110.28446.6.camel@localhost> I put that in. This should be easier to debug and understand. The problem, however, seems to be that wonky causes, after a while, the score to go to the minimum which in turn causes delays in the order of thousands of seconds. I need to figure out why this is happening, because I don't think it should. On Tue, 2008-07-29 at 18:15 -0500, Mihael Hategan wrote: > We're not thinking this clearly. > > There are two possibilities: > 1. site was overloaded and it's not any more -> overloadedCount-- > 2. site wasn't overloaded and it is now -> overloadedCount++ > > I'll code this. > > On Tue, 2008-07-29 at 15:38 -0500, Mihael Hategan wrote: > > On Tue, 2008-07-29 at 20:35 +0000, Ben Clifford wrote: > > > which i suppose happens sometimes when its time to adjust the load or > > > score of a site that is already in delay mode... > > > > That's why I kept it there. > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From hategan at mcs.anl.gov Tue Jul 29 19:42:40 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 29 Jul 2008 19:42:40 -0500 Subject: [Swift-devel] Re: scheduler foo In-Reply-To: <1217378110.28446.6.camel@localhost> References: <1216650659.4064.1.camel@localhost> <1216659517.6481.1.camel@localhost> <1216679706.18694.6.camel@localhost> <1216689540.23025.0.camel@localhost> <1216742264.28239.0.camel@localhost> <1216757938.18169.6.camel@localhost> <1217358017.12370.0.camel@localhost> <1217362717.16732.0.camel@localhost> <1217363916.25884.0.camel@localhost> <1217373352.28446.0.camel@localhost> <1217378110.28446.6.camel@localhost> Message-ID: <1217378561.28446.9.camel@localhost> On Tue, 2008-07-29 at 19:35 -0500, Mihael Hategan wrote: > I put that in. This should be easier to debug and understand. > > The problem, however, seems to be that wonky causes, after a while, the > score to go to the minimum which in turn causes delays in the order of > thousands of seconds. I need to figure out why this is happening, > because I don't think it should. Hmm. So if I change max load to be a constant 32, things seem to go fine. From hategan at mcs.anl.gov Tue Jul 29 20:14:37 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 29 Jul 2008 20:14:37 -0500 Subject: [Swift-devel] Re: scheduler foo In-Reply-To: <1217378561.28446.9.camel@localhost> References: <1216650659.4064.1.camel@localhost> <1216659517.6481.1.camel@localhost> <1216679706.18694.6.camel@localhost> <1216689540.23025.0.camel@localhost> <1216742264.28239.0.camel@localhost> <1216757938.18169.6.camel@localhost> <1217358017.12370.0.camel@localhost> <1217362717.16732.0.camel@localhost> <1217363916.25884.0.camel@localhost> <1217373352.28446.0.camel@localhost> <1217378110.28446.6.camel@localhost> <1217378561.28446.9.camel@localhost> Message-ID: <1217380477.7960.0.camel@localhost> On Tue, 2008-07-29 at 19:42 -0500, Mihael Hategan wrote: > On Tue, 2008-07-29 at 19:35 -0500, Mihael Hategan wrote: > > I put that in. This should be easier to debug and understand. > > > > The problem, however, seems to be that wonky causes, after a while, the > > score to go to the minimum which in turn causes delays in the order of > > thousands of seconds. I need to figure out why this is happening, > > because I don't think it should. > > Hmm. So if I change max load to be a constant 32, things seem to go > fine. Nevermind that. > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From hategan at mcs.anl.gov Tue Jul 29 20:32:52 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 29 Jul 2008 20:32:52 -0500 Subject: [Swift-devel] Re: scheduler foo In-Reply-To: <1217380477.7960.0.camel@localhost> References: <1216650659.4064.1.camel@localhost> <1216659517.6481.1.camel@localhost> <1216679706.18694.6.camel@localhost> <1216689540.23025.0.camel@localhost> <1216742264.28239.0.camel@localhost> <1216757938.18169.6.camel@localhost> <1217358017.12370.0.camel@localhost> <1217362717.16732.0.camel@localhost> <1217363916.25884.0.camel@localhost> <1217373352.28446.0.camel@localhost> <1217378110.28446.6.camel@localhost> <1217378561.28446.9.camel@localhost> <1217380477.7960.0.camel@localhost> Message-ID: <1217381572.14271.1.camel@localhost> On Tue, 2008-07-29 at 20:14 -0500, Mihael Hategan wrote: > On Tue, 2008-07-29 at 19:42 -0500, Mihael Hategan wrote: > > On Tue, 2008-07-29 at 19:35 -0500, Mihael Hategan wrote: > > > I put that in. This should be easier to debug and understand. > > > > > > The problem, however, seems to be that wonky causes, after a while, the > > > score to go to the minimum which in turn causes delays in the order of > > > thousands of seconds. I need to figure out why this is happening, > > > because I don't think it should. > > > > Hmm. So if I change max load to be a constant 32, things seem to go > > fine. > > Nevermind that. Ok, so one problem was the delayed score factoring. It could cause, in many cases, a negative score to be repeatedly applied even for successful jobs, causing the score to go very low very quickly. I removed that for now. The behavior seems smoother. > > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From wilde at mcs.anl.gov Tue Jul 29 23:52:54 2008 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 29 Jul 2008 23:52:54 -0500 Subject: [Swift-devel] Re: Analysis of wrapper.sh In-Reply-To: References: <488F4D22.4060409@uchicago.edu> <488F52EC.8030200@mcs.anl.gov> Message-ID: <488FF3A6.50702@mcs.anl.gov> This is a god basis for a model, Ben. We should capture and refine this in the same doc as the one mentioned above. On 7/29/08 12:37 PM, Ben Clifford wrote: > in a similar context, plenty of people involved in swift development have > had thoughts about how data management might change in the future. > > it would be useful to define what the application 'contract' is, so as to > separate out the accidental features of the wrapper as distinct from that > contract. yes > So (in my mind, though others perhaps see it differently): an applciation > can expect to be started up in a working directory that is not shared with > any other applicaiton start up; and that mapped input files will be > available for read only access within that directory (at top level or in > subdirs, depending on mapping); and mapped output files should be left in > that directory (at top level or in subdirs, depending on mapping) yes. may want to define the requirements that make hard and soft links possible in the work dir. > Applications should not make assumptions about the nature of the file > system (which is how the wrapper can have an option to switch between > working dirs on the worker node or on shared fs). Nor should they > necessarily assume that the wrapper.sh script is the way in which things > get there, or that there is a shared directory at all; for example, if > Falkon's real or future data management features were wired in, Falkon > might handle the movement of files from some submit-side location to > individual application working directories... all sound good at the moment, certainly on the right track. > Separate from the above is our implementation of that interface, which is > both the wrapper.sh on the worker side and behaviour in the submit-side > Swift code to manage the site-side shared directories. Ive felt that some similar contract is needed between the swift interpreter and the mappers it calls. There should be an abstract data model defined by the swift language - that of scalars, files, structs and arrays; and separately, the mapping of that to files/data-objects on various storage systems (including perhaps the "shared dirs" above). Im in favor of nailing down contracts for the app exec side and the swift side, and then having families of mappers that implement different data management strategies. I suspect we may need to generalize the interaction between the swift "vm" that interprets primitives and the storage providers that the mappers provides references to. Not sure how all these fit together but I think we should address this as we get closer to implementing VDS-style data caching with an RLS-like catalog and replica model. - Mike From wilde at mcs.anl.gov Wed Jul 30 06:42:33 2008 From: wilde at mcs.anl.gov (Michael Wilde) Date: Wed, 30 Jul 2008 06:42:33 -0500 Subject: [Swift-devel] coastersPerNode not recognized by GT2 GRAM Message-ID: <489053A9.4080906@mcs.anl.gov> When I use coastersPerNode on Abe I get an error: 2008-07-30 01:06:43,498-0500 DEBUG vdl:execute2 APPLICATION_EXCEPTION jobid=echo-9g3mu8xi - Application exception: Cannot submit job org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: Cannot submit job at org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.submitSingleJob(JobSubmissionTaskHandler.java:162) ... Caused by: org.globus.gram.GramException: Parameter not supported Do the globus namespace options get put into RSL in GRAM2, and do these need to be valid Globus options? I assume this option is in the globus namespace because that was a convenient way of adding per-site options? Or was this mean to be in the karajan namespace? I'll dig deeper, but wonder if this has been tested on TeraGrid sites. - Mike On 7/27/08 2:58 PM, Ben Clifford wrote: > cog svn r2094 introduces a profile property coastersPerNode which allows > you to spawn multiple coaster workers on a node. this should allow you to > take advantage of sites which have multicore CPUs but allocate the whole > node, rather than an individual core, when a job is submitted. > > When using coasters, add this to the site definition: > > 5 > > > to get eg 5 workers on each node. > From wilde at mcs.anl.gov Wed Jul 30 06:44:37 2008 From: wilde at mcs.anl.gov (Michael Wilde) Date: Wed, 30 Jul 2008 06:44:37 -0500 Subject: [Swift-devel] Clarification of Coaster parameter jobThrottle In-Reply-To: References: <488CC81C.7030205@mcs.anl.gov> <488D4217.2010006@mcs.anl.gov> <1217218071.19068.2.camel@localhost> <488D4816.2010900@mcs.anl.gov> Message-ID: <48905425.40304@mcs.anl.gov> was: Re: [Swift-devel] Problems running coaster On 7/28/08 9:43 AM, Ben Clifford wrote: > also, don't: > >> 4 > > say that if you are using gram2 (even coasters in gram2) What is the reason for this recommendation? Are you saying (a) dont overload gram2 or (b) something breaks if you use this parameter at all, or (c) you were just eliminating possibilities debugging this error (which turned out to be login exclusion on the headnode in /etc/profile on abe) Eventually for coasters we *want* to open up the throttles, but perhaps to throttle as the Falkon load-balancer does: by keeping all allocated workers on each site fully busy, but not pre-committing jobs to them. (That would change with data pre-staging but in the absence of that the strategy of not pre-committing seems to work well). From benc at hawaga.org.uk Wed Jul 30 08:10:38 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 30 Jul 2008 13:10:38 +0000 (GMT) Subject: [Swift-devel] Re: Clarification of Coaster parameter jobThrottle In-Reply-To: <48905425.40304@mcs.anl.gov> References: <488CC81C.7030205@mcs.anl.gov> <488D4217.2010006@mcs.anl.gov> <1217218071.19068.2.camel@localhost> <488D4816.2010900@mcs.anl.gov> <48905425.40304@mcs.anl.gov> Message-ID: On Wed, 30 Jul 2008, Michael Wilde wrote: > Are you saying (a) dont overload gram2 or (b) something breaks if you use this > parameter at all, or (c) you were just eliminating possibilities debugging > this error (which turned out to be login exclusion on the headnode in > /etc/profile on abe) (a). The number of simultaneous jobs sent to coasters by Swift will end up fairly similar to the number of gram jobs submitted to make coaster workers run (I think now divided by the number of coastersPerNode set in that other property I implemented the other day). -- From benc at hawaga.org.uk Wed Jul 30 08:29:11 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 30 Jul 2008 13:29:11 +0000 (GMT) Subject: [Swift-devel] Re: Analysis of wrapper.sh In-Reply-To: <488FF3A6.50702@mcs.anl.gov> References: <488F4D22.4060409@uchicago.edu> <488F52EC.8030200@mcs.anl.gov> <488FF3A6.50702@mcs.anl.gov> Message-ID: on a practical note, the existing swift documentation is in docbook and if any of this work is intended to actually form part of the swift documentation, it should likely also be in that format. -- From wilde at mcs.anl.gov Wed Jul 30 08:36:37 2008 From: wilde at mcs.anl.gov (Michael Wilde) Date: Wed, 30 Jul 2008 08:36:37 -0500 Subject: [Swift-devel] Re: Analysis of wrapper.sh In-Reply-To: References: <488F4D22.4060409@uchicago.edu> <488F52EC.8030200@mcs.anl.gov> <488FF3A6.50702@mcs.anl.gov> Message-ID: <48906E65.5020405@mcs.anl.gov> I agree - that sounds good. I think you may have already done this, but can you (re)send or post instructions for what you need to install to run dockbook, and the necessary recipes. How to proof, how to gen pdf vs html, where to keep, how to add docs to the online pages and nighly builds, etc. Last time I did this I spent a long time hunting for the right tools and update methods. Finally got it working but cant recall how, and not at all sure I took the best route. On 7/30/08 8:29 AM, Ben Clifford wrote: > on a practical note, the existing swift documentation is in docbook and if > any of this work is intended to actually form part of the swift > documentation, it should likely also be in that format. From wilde at mcs.anl.gov Wed Jul 30 08:38:32 2008 From: wilde at mcs.anl.gov (Michael Wilde) Date: Wed, 30 Jul 2008 08:38:32 -0500 Subject: [Swift-devel] Re: Analysis of wrapper.sh In-Reply-To: <48906E65.5020405@mcs.anl.gov> References: <488F4D22.4060409@uchicago.edu> <488F52EC.8030200@mcs.anl.gov> <488FF3A6.50702@mcs.anl.gov> <48906E65.5020405@mcs.anl.gov> Message-ID: <48906ED8.1000907@mcs.anl.gov> also re docs: we need to start adding some diagrams to the docs, especially design docs, but user docs as well. what do you suggest for drawing tool(s)? would be good to use one simple tool that all contributors can readily run. On 7/30/08 8:36 AM, Michael Wilde wrote: > I agree - that sounds good. > > I think you may have already done this, but can you (re)send or post > instructions for what you need to install to run dockbook, and the > necessary recipes. How to proof, how to gen pdf vs html, where to keep, > how to add docs to the online pages and nighly builds, etc. > > Last time I did this I spent a long time hunting for the right tools and > update methods. Finally got it working but cant recall how, and not at > all sure I took the best route. > > On 7/30/08 8:29 AM, Ben Clifford wrote: >> on a practical note, the existing swift documentation is in docbook >> and if any of this work is intended to actually form part of the swift >> documentation, it should likely also be in that format. > From benc at hawaga.org.uk Wed Jul 30 08:46:55 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 30 Jul 2008 13:46:55 +0000 (GMT) Subject: [Swift-devel] Re: Analysis of wrapper.sh In-Reply-To: <48906E65.5020405@mcs.anl.gov> References: <488F4D22.4060409@uchicago.edu> <488F52EC.8030200@mcs.anl.gov> <488FF3A6.50702@mcs.anl.gov> <48906E65.5020405@mcs.anl.gov> Message-ID: On Wed, 30 Jul 2008, Michael Wilde wrote: > Last time I did this I spent a long time hunting for the right tools and > update methods. Finally got it working but cant recall how, and not at all > sure I took the best route. There is a readme in docs/README. r2160 adds a note about setting it up on a CI login host. -- From benc at hawaga.org.uk Wed Jul 30 09:09:12 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 30 Jul 2008 14:09:12 +0000 (GMT) Subject: [Swift-devel] Re: scheduler foo In-Reply-To: <1217381572.14271.1.camel@localhost> References: <1216650659.4064.1.camel@localhost> <1216659517.6481.1.camel@localhost> <1217373352.28446.0.camel@localhost> <1217378110.28446.6.camel@localhost> <1217378561.28446.9.camel@localhost> <1217380477.7960.0.camel@localhost> <1217381572.14271.1.camel@localhost> Message-ID: Seems to work better now on my laptop. With 90% success rate for jobs, it can sometimes manage to complete the 1000 job test, sometimes not (failing with too many retries, which I'd expect given that number of jobs and that failure rate). With 80% success rate even the 15 job 130-fmri test takes a long time to complete. I'll do more testing but this seems better. I think I still want to put in configuration options to allow the delay when scores are negative to be configured (or disabled entirely, which I think is sometimes desirable for single site runs) -- From wilde at mcs.anl.gov Wed Jul 30 09:42:57 2008 From: wilde at mcs.anl.gov (Michael Wilde) Date: Wed, 30 Jul 2008 09:42:57 -0500 Subject: [Swift-devel] coastersPerNode not recognized by GT2 GRAM In-Reply-To: <489053A9.4080906@mcs.anl.gov> References: <489053A9.4080906@mcs.anl.gov> Message-ID: <48907DF1.8020009@mcs.anl.gov> I need to set this aside for now, but would appreciate any help in debugging it. My sites.xml file is: /u/ac/wilde/swiftwork 8 The logs are on CI net at /home/wilde/coast/run17. I dont see how this works. Is the code picking up the parameter from the globus RSL and then passing it to bootstrap.sh to in turn pass it to the coaster server? It needs to be stripped off the globus profile before the GT2 job that launches bootstrap.sh is run, right? Else Globus will complain that its not valid RSL? I see the one test case for this in tests/sites/coaster is for localhost. Was it tested on gt2:gt2:pbs? - Mike On 7/30/08 6:42 AM, Michael Wilde wrote: > When I use coastersPerNode on Abe I get an error: > > 2008-07-30 01:06:43,498-0500 DEBUG vdl:execute2 APPLICATION_EXCEPTION > jobid=echo-9g3mu8xi - Application exception: Cannot submit job > org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: > Cannot submit job > at > org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.submitSingleJob(JobSubmissionTaskHandler.java:162) > > ... > Caused by: org.globus.gram.GramException: Parameter not supported > > Do the globus namespace options get put into RSL in GRAM2, and do these > need to be valid Globus options? > > I assume this option is in the globus namespace because that was a > convenient way of adding per-site options? Or was this mean to be in > the karajan namespace? > > I'll dig deeper, but wonder if this has been tested on TeraGrid sites. > > - Mike > > > On 7/27/08 2:58 PM, Ben Clifford wrote: > > cog svn r2094 introduces a profile property coastersPerNode which allows > > you to spawn multiple coaster workers on a node. this should allow > you to > > take advantage of sites which have multicore CPUs but allocate the whole > > node, rather than an individual core, when a job is submitted. > > > > When using coasters, add this to the site definition: > > > > 5 > > > > > > to get eg 5 workers on each node. > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From wilde at mcs.anl.gov Wed Jul 30 11:05:28 2008 From: wilde at mcs.anl.gov (Michael Wilde) Date: Wed, 30 Jul 2008 11:05:28 -0500 Subject: [Swift-devel] coastersPerNode not recognized by GT2 GRAM In-Reply-To: <48907DF1.8020009@mcs.anl.gov> References: <489053A9.4080906@mcs.anl.gov> <48907DF1.8020009@mcs.anl.gov> Message-ID: <48909148.1010504@mcs.anl.gov> Im able to work around this on abe for now by hardcoding coasterPerNode to 8 and removing the tag from sites.xml. That works. - Mike On 7/30/08 9:42 AM, Michael Wilde wrote: > I need to set this aside for now, but would appreciate any help in > debugging it. > > My sites.xml file is: > > > > jobManager="gt2:gt2:pbs" /> > > /u/ac/wilde/swiftwork > 8 > > > > The logs are on CI net at /home/wilde/coast/run17. > > I dont see how this works. Is the code picking up the parameter from the > globus RSL and then passing it to bootstrap.sh to in turn pass it to > the coaster server? It needs to be stripped off the globus profile > before the GT2 job that launches bootstrap.sh is run, right? Else > Globus will complain that its not valid RSL? > > I see the one test case for this in tests/sites/coaster is for > localhost. Was it tested on gt2:gt2:pbs? > > - Mike > > > > On 7/30/08 6:42 AM, Michael Wilde wrote: >> When I use coastersPerNode on Abe I get an error: >> >> 2008-07-30 01:06:43,498-0500 DEBUG vdl:execute2 APPLICATION_EXCEPTION >> jobid=echo-9g3mu8xi - Application exception: Cannot submit job >> org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: >> Cannot submit job >> at >> org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.submitSingleJob(JobSubmissionTaskHandler.java:162) >> >> ... >> Caused by: org.globus.gram.GramException: Parameter not supported >> >> Do the globus namespace options get put into RSL in GRAM2, and do >> these need to be valid Globus options? >> >> I assume this option is in the globus namespace because that was a >> convenient way of adding per-site options? Or was this mean to be in >> the karajan namespace? >> >> I'll dig deeper, but wonder if this has been tested on TeraGrid sites. >> >> - Mike >> >> >> On 7/27/08 2:58 PM, Ben Clifford wrote: >> > cog svn r2094 introduces a profile property coastersPerNode which >> allows >> > you to spawn multiple coaster workers on a node. this should allow >> you to >> > take advantage of sites which have multicore CPUs but allocate the >> whole >> > node, rather than an individual core, when a job is submitted. >> > >> > When using coasters, add this to the site definition: >> > >> > 5 >> > >> > >> > to get eg 5 workers on each node. >> > >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From benc at hawaga.org.uk Wed Jul 30 11:54:47 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 30 Jul 2008 16:54:47 +0000 (GMT) Subject: [Swift-devel] coastersPerNode not recognized by GT2 GRAM In-Reply-To: <48909148.1010504@mcs.anl.gov> References: <489053A9.4080906@mcs.anl.gov> <48907DF1.8020009@mcs.anl.gov> <48909148.1010504@mcs.anl.gov> Message-ID: try cog r2123. i just tested that against ncsa teragrid. it now filters out that attribute before sending on to gram2. -- From hategan at mcs.anl.gov Wed Jul 30 18:30:56 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 30 Jul 2008 18:30:56 -0500 Subject: [Swift-devel] Re: scheduler foo In-Reply-To: References: <1216650659.4064.1.camel@localhost> <1216659517.6481.1.camel@localhost> <1217373352.28446.0.camel@localhost> <1217378110.28446.6.camel@localhost> <1217378561.28446.9.camel@localhost> <1217380477.7960.0.camel@localhost> <1217381572.14271.1.camel@localhost> Message-ID: <1217460656.19065.3.camel@localhost> On Wed, 2008-07-30 at 14:09 +0000, Ben Clifford wrote: > Seems to work better now on my laptop. > > With 90% success rate for jobs, it can sometimes manage to complete the > 1000 job test, sometimes not (failing with too many retries, which I'd > expect given that number of jobs and that failure rate). > > With 80% success rate even the 15 job 130-fmri test takes a long time to > complete. Yes. The scaling back is based on the assertion that it helps, which it clearly doesn't in this case. However, in a multi-site case it would make sense. > > I'll do more testing but this seems better. > > I think I still want to put in configuration options to allow the delay > when scores are negative to be configured (or disabled entirely, which I > think is sometimes desirable for single site runs) Right. > From hategan at mcs.anl.gov Wed Jul 30 18:32:52 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 30 Jul 2008 18:32:52 -0500 Subject: [Swift-devel] coastersPerNode not recognized by GT2 GRAM In-Reply-To: <48907DF1.8020009@mcs.anl.gov> References: <489053A9.4080906@mcs.anl.gov> <48907DF1.8020009@mcs.anl.gov> Message-ID: <1217460772.19065.6.camel@localhost> On Wed, 2008-07-30 at 09:42 -0500, Michael Wilde wrote: > I need to set this aside for now, but would appreciate any help in > debugging it. > > My sites.xml file is: > > > > jobManager="gt2:gt2:pbs" /> > > /u/ac/wilde/swiftwork > 8 > > > > The logs are on CI net at /home/wilde/coast/run17. > > I dont see how this works. Is the code picking up the parameter from the > globus RSL and then passing it to bootstrap.sh to in turn pass it to > the coaster server? It needs to be stripped off the globus profile > before the GT2 job that launches bootstrap.sh is run, right? Else > Globus will complain that its not valid RSL? Right. Actually the coaster code should explicitly avoid passing that attribute to gt2. So I consider this a bug. > > I see the one test case for this in tests/sites/coaster is for > localhost. Was it tested on gt2:gt2:pbs? > > - Mike > > > > On 7/30/08 6:42 AM, Michael Wilde wrote: > > When I use coastersPerNode on Abe I get an error: > > > > 2008-07-30 01:06:43,498-0500 DEBUG vdl:execute2 APPLICATION_EXCEPTION > > jobid=echo-9g3mu8xi - Application exception: Cannot submit job > > org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: > > Cannot submit job > > at > > org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.submitSingleJob(JobSubmissionTaskHandler.java:162) > > > > ... > > Caused by: org.globus.gram.GramException: Parameter not supported > > > > Do the globus namespace options get put into RSL in GRAM2, and do these > > need to be valid Globus options? > > > > I assume this option is in the globus namespace because that was a > > convenient way of adding per-site options? Or was this mean to be in > > the karajan namespace? > > > > I'll dig deeper, but wonder if this has been tested on TeraGrid sites. > > > > - Mike > > > > > > On 7/27/08 2:58 PM, Ben Clifford wrote: > > > cog svn r2094 introduces a profile property coastersPerNode which allows > > > you to spawn multiple coaster workers on a node. this should allow > > you to > > > take advantage of sites which have multicore CPUs but allocate the whole > > > node, rather than an individual core, when a job is submitted. > > > > > > When using coasters, add this to the site definition: > > > > > > 5 > > > > > > > > > to get eg 5 workers on each node. > > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From wilde at mcs.anl.gov Wed Jul 30 18:59:42 2008 From: wilde at mcs.anl.gov (Michael Wilde) Date: Wed, 30 Jul 2008 18:59:42 -0500 Subject: [Swift-devel] coastersPerNode not recognized by GT2 GRAM In-Reply-To: <1217460772.19065.6.camel@localhost> References: <489053A9.4080906@mcs.anl.gov> <48907DF1.8020009@mcs.anl.gov> <1217460772.19065.6.camel@localhost> Message-ID: <4891006E.2010507@mcs.anl.gov> Ben applied a fix. I will test: -------- Original Message -------- Subject: Re: [Swift-devel] coastersPerNode not recognized by GT2 GRAM Date: Wed, 30 Jul 2008 16:54:47 +0000 (GMT) From: Ben Clifford To: Michael Wilde CC: swift-devel References: <489053A9.4080906 at mcs.anl.gov> <48907DF1.8020009 at mcs.anl.gov> <48909148.1010504 at mcs.anl.gov> try cog r2123. i just tested that against ncsa teragrid. it now filters out that attribute before sending on to gram2. -- On 7/30/08 6:32 PM, Mihael Hategan wrote: > On Wed, 2008-07-30 at 09:42 -0500, Michael Wilde wrote: >> I need to set this aside for now, but would appreciate any help in >> debugging it. >> >> My sites.xml file is: >> >> >> >> > jobManager="gt2:gt2:pbs" /> >> >> /u/ac/wilde/swiftwork >> 8 >> >> >> >> The logs are on CI net at /home/wilde/coast/run17. >> >> I dont see how this works. Is the code picking up the parameter from the >> globus RSL and then passing it to bootstrap.sh to in turn pass it to >> the coaster server? It needs to be stripped off the globus profile >> before the GT2 job that launches bootstrap.sh is run, right? Else >> Globus will complain that its not valid RSL? > > Right. Actually the coaster code should explicitly avoid passing that > attribute to gt2. So I consider this a bug. > >> I see the one test case for this in tests/sites/coaster is for >> localhost. Was it tested on gt2:gt2:pbs? >> >> - Mike >> >> >> >> On 7/30/08 6:42 AM, Michael Wilde wrote: >>> When I use coastersPerNode on Abe I get an error: >>> >>> 2008-07-30 01:06:43,498-0500 DEBUG vdl:execute2 APPLICATION_EXCEPTION >>> jobid=echo-9g3mu8xi - Application exception: Cannot submit job >>> org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: >>> Cannot submit job >>> at >>> org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.submitSingleJob(JobSubmissionTaskHandler.java:162) >>> >>> ... >>> Caused by: org.globus.gram.GramException: Parameter not supported >>> >>> Do the globus namespace options get put into RSL in GRAM2, and do these >>> need to be valid Globus options? >>> >>> I assume this option is in the globus namespace because that was a >>> convenient way of adding per-site options? Or was this mean to be in >>> the karajan namespace? >>> >>> I'll dig deeper, but wonder if this has been tested on TeraGrid sites. >>> >>> - Mike >>> >>> >>> On 7/27/08 2:58 PM, Ben Clifford wrote: >>> > cog svn r2094 introduces a profile property coastersPerNode which allows >>> > you to spawn multiple coaster workers on a node. this should allow >>> you to >>> > take advantage of sites which have multicore CPUs but allocate the whole >>> > node, rather than an individual core, when a job is submitted. >>> > >>> > When using coasters, add this to the site definition: >>> > >>> > 5 >>> > >>> > >>> > to get eg 5 workers on each node. >>> > >>> _______________________________________________ >>> Swift-devel mailing list >>> Swift-devel at ci.uchicago.edu >>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > From hategan at mcs.anl.gov Wed Jul 30 19:05:00 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 30 Jul 2008 19:05:00 -0500 Subject: [Swift-devel] coastersPerNode not recognized by GT2 GRAM In-Reply-To: <4891006E.2010507@mcs.anl.gov> References: <489053A9.4080906@mcs.anl.gov> <48907DF1.8020009@mcs.anl.gov> <1217460772.19065.6.camel@localhost> <4891006E.2010507@mcs.anl.gov> Message-ID: <1217462700.25689.0.camel@localhost> On Wed, 2008-07-30 at 18:59 -0500, Michael Wilde wrote: > Ben applied a fix. I will test: Yeah. Saw that... after sending the email. > > -------- Original Message -------- > Subject: Re: [Swift-devel] coastersPerNode not recognized by GT2 GRAM > Date: Wed, 30 Jul 2008 16:54:47 +0000 (GMT) > From: Ben Clifford > To: Michael Wilde > CC: swift-devel > References: <489053A9.4080906 at mcs.anl.gov> > <48907DF1.8020009 at mcs.anl.gov> <48909148.1010504 at mcs.anl.gov> > > try cog r2123. i just tested that against ncsa teragrid. it now filters > out that attribute before sending on to gram2. From benc at hawaga.org.uk Thu Jul 31 08:51:49 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Thu, 31 Jul 2008 13:51:49 +0000 (GMT) Subject: [Swift-devel] Re: Karajan.java not commited with r2161, and should be In-Reply-To: <123bf0400807310545t6bdd3077oc9919c8f21eff13c@mail.gmail.com> References: <123bf0400807310545t6bdd3077oc9919c8f21eff13c@mail.gmail.com> Message-ID: fixed in r2165 On Thu, 31 Jul 2008, Milena Nikolic wrote: > I just noticed that Karajan.java is not r2161 (yesterday commit), but r2158 > (the one few days ago). I guess you forgot to commit it. You should do it, > because some of the tests (which are committed yesterday) won't work > without it. > > Cheers, > Milena > From benc at hawaga.org.uk Thu Jul 31 09:01:42 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Thu, 31 Jul 2008 14:01:42 +0000 (GMT) Subject: [Swift-devel] Re: scheduler foo In-Reply-To: <1217460656.19065.3.camel@localhost> References: <1216650659.4064.1.camel@localhost> <1216659517.6481.1.camel@localhost> <1217373352.28446.0.camel@localhost> <1217378110.28446.6.camel@localhost> <1217378561.28446.9.camel@localhost> <1217380477.7960.0.camel@localhost> <1217381572.14271.1.camel@localhost> <1217460656.19065.3.camel@localhost> Message-ID: On Wed, 30 Jul 2008, Mihael Hategan wrote: > > I think I still want to put in configuration options to allow the delay > > when scores are negative to be configured (or disabled entirely, which I > > think is sometimes desirable for single site runs) > > Right. Might also be useful to change the default based on whether there are one or more than one sites. Though there's a vague principle that adding more sites should not make a run behave more poorly that I think is broken by that. -- From benc at hawaga.org.uk Thu Jul 31 10:38:59 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Thu, 31 Jul 2008 15:38:59 +0000 (GMT) Subject: [Swift-devel] Re: scheduler foo In-Reply-To: <1217460656.19065.3.camel@localhost> References: <1216650659.4064.1.camel@localhost> <1216659517.6481.1.camel@localhost> <1217373352.28446.0.camel@localhost> <1217378110.28446.6.camel@localhost> <1217378561.28446.9.camel@localhost> <1217380477.7960.0.camel@localhost> <1217381572.14271.1.camel@localhost> <1217460656.19065.3.camel@localhost> Message-ID: On Wed, 30 Jul 2008, Mihael Hategan wrote: > > I think I still want to put in configuration options to allow the delay > > when scores are negative to be configured (or disabled entirely, which I > > think is sometimes desirable for single site runs) > > Right. That exists now - delayBase in the globus namespace, which can be set per site. I should probably make it globally configurable too,so that it is like jobThrottle. -- From benc at hawaga.org.uk Thu Jul 31 10:39:46 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Thu, 31 Jul 2008 15:39:46 +0000 (GMT) Subject: [Swift-devel] Re: scheduler foo In-Reply-To: References: <1216650659.4064.1.camel@localhost> <1216659517.6481.1.camel@localhost> <1217373352.28446.0.camel@localhost> <1217378110.28446.6.camel@localhost> <1217378561.28446.9.camel@localhost> <1217380477.7960.0.camel@localhost> <1217381572.14271.1.camel@localhost> <1217460656.19065.3.camel@localhost> Message-ID: On Thu, 31 Jul 2008, Ben Clifford wrote: > That exists now - delayBase in the globus namespace, which can be set per karajan namespace even... --