From benc at hawaga.org.uk Sun Mar 2 09:25:30 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Sun, 2 Mar 2008 15:25:30 +0000 (GMT) Subject: [Swift-devel] swift 0.4-rc1 Message-ID: I've just made release candidate 1 for swift 0.4. It passes my superficial testing. Please download and test it, ideally with your own big applications. If there are no major problems, this will be released as swift 0.4 sometime Tuesday. http://www.ci.uchicago.edu/~benc/vdsk-0.4-rc1.tar.gz $ md5sum /home/benc/public_html/vdsk-0.4-rc1.tar.gz 90dfd5a91f27a0aea2c0cd56642e7721 /home/benc/public_html/vdsk-0.4-rc1.tar.gz It is Swift SVN r1696 and CoG SVN r1933 From benc at hawaga.org.uk Sun Mar 2 10:56:25 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Sun, 2 Mar 2008 16:56:25 +0000 (GMT) Subject: [Swift-devel] stageout: Expected multiline reply Message-ID: Running from terminable to tg-uc, I get an error at stageout, starting with the below: Caused by: org.globus.ftp.exception.ServerException: Custom message: Could not create MlsxEntry [Nested exception message: Custom message: Expected multiline reply] [Nested exception is org.globus.ftp.exception.FTPException: Custom messa ge: Expected multiline reply] at org.globus.ftp.FTPClient.mlst(FTPClient.java:642) at org.globus.cog.abstraction.impl.file.gridftp.FileResourceImpl.isDirec tory(FileResourceImpl.java:159) The complete log file is http://www.ci.uchicago.edu/~benc/tmp/061-cattwo-20080302-1031-vdualzq0.log This is with 0.4rc1. This is using gram2 to submit to PBS. Other runs from my testing appeared to have worked ok submitting to fork. I'll dig a bit deeper, but I'm not sure what the likely meaning of the above error is. -- From hategan at mcs.anl.gov Sun Mar 2 14:06:24 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sun, 02 Mar 2008 14:06:24 -0600 Subject: [Swift-devel] stageout: Expected multiline reply In-Reply-To: References: Message-ID: <1204488384.7286.0.camel@blabla.mcs.anl.gov> Repeatable or not? Consistent or not? On Sun, 2008-03-02 at 16:56 +0000, Ben Clifford wrote: > Running from terminable to tg-uc, I get an error at stageout, starting > with the below: > > Caused by: org.globus.ftp.exception.ServerException: Custom message: > Could not > create MlsxEntry [Nested exception message: Custom message: Expected > multiline > reply] [Nested exception is org.globus.ftp.exception.FTPException: Custom > messa > ge: Expected multiline reply] > at org.globus.ftp.FTPClient.mlst(FTPClient.java:642) > at > org.globus.cog.abstraction.impl.file.gridftp.FileResourceImpl.isDirec > tory(FileResourceImpl.java:159) > > > The complete log file is > http://www.ci.uchicago.edu/~benc/tmp/061-cattwo-20080302-1031-vdualzq0.log > > This is with 0.4rc1. This is using gram2 to submit to PBS. Other runs from > my testing appeared to have worked ok submitting to fork. I'll dig a bit > deeper, but I'm not sure what the likely meaning of the above error is. > From benc at hawaga.org.uk Sun Mar 2 14:08:41 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Sun, 2 Mar 2008 20:08:41 +0000 (GMT) Subject: [Swift-devel] stageout: Expected multiline reply In-Reply-To: <1204488384.7286.0.camel@blabla.mcs.anl.gov> References: <1204488384.7286.0.camel@blabla.mcs.anl.gov> Message-ID: at least today its repeatable and consistent against TGUC using pbs+gram2 but doesn't happen any time I use fork+gram2. some other sites that I've run it against haven't had this problem. I rarely run stuff from terminable so I have no idea if this has been the case for some time or is for one day only. On Sun, 2 Mar 2008, Mihael Hategan wrote: > Repeatable or not? Consistent or not? > > On Sun, 2008-03-02 at 16:56 +0000, Ben Clifford wrote: > > Running from terminable to tg-uc, I get an error at stageout, starting > > with the below: > > > > Caused by: org.globus.ftp.exception.ServerException: Custom message: > > Could not > > create MlsxEntry [Nested exception message: Custom message: Expected > > multiline > > reply] [Nested exception is org.globus.ftp.exception.FTPException: Custom > > messa > > ge: Expected multiline reply] > > at org.globus.ftp.FTPClient.mlst(FTPClient.java:642) > > at > > org.globus.cog.abstraction.impl.file.gridftp.FileResourceImpl.isDirec > > tory(FileResourceImpl.java:159) > > > > > > The complete log file is > > http://www.ci.uchicago.edu/~benc/tmp/061-cattwo-20080302-1031-vdualzq0.log > > > > This is with 0.4rc1. This is using gram2 to submit to PBS. Other runs from > > my testing appeared to have worked ok submitting to fork. I'll dig a bit > > deeper, but I'm not sure what the likely meaning of the above error is. > > > > From hategan at mcs.anl.gov Sun Mar 2 14:13:10 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sun, 02 Mar 2008 14:13:10 -0600 Subject: [Swift-devel] stageout: Expected multiline reply In-Reply-To: References: <1204488384.7286.0.camel@blabla.mcs.anl.gov> Message-ID: <1204488790.7286.3.camel@blabla.mcs.anl.gov> So we may either be dealing with a gridftp incompatibility or some strange cog thing. I'll have to debug to be able to tell. On Sun, 2008-03-02 at 20:08 +0000, Ben Clifford wrote: > at least today its repeatable and consistent against TGUC using pbs+gram2 > but doesn't happen any time I use fork+gram2. > > some other sites that I've run it against haven't had this problem. > > I rarely run stuff from terminable so I have no idea if this has been the > case for some time or is for one day only. > > On Sun, 2 Mar 2008, Mihael Hategan wrote: > > > Repeatable or not? Consistent or not? > > > > On Sun, 2008-03-02 at 16:56 +0000, Ben Clifford wrote: > > > Running from terminable to tg-uc, I get an error at stageout, starting > > > with the below: > > > > > > Caused by: org.globus.ftp.exception.ServerException: Custom message: > > > Could not > > > create MlsxEntry [Nested exception message: Custom message: Expected > > > multiline > > > reply] [Nested exception is org.globus.ftp.exception.FTPException: Custom > > > messa > > > ge: Expected multiline reply] > > > at org.globus.ftp.FTPClient.mlst(FTPClient.java:642) > > > at > > > org.globus.cog.abstraction.impl.file.gridftp.FileResourceImpl.isDirec > > > tory(FileResourceImpl.java:159) > > > > > > > > > The complete log file is > > > http://www.ci.uchicago.edu/~benc/tmp/061-cattwo-20080302-1031-vdualzq0.log > > > > > > This is with 0.4rc1. This is using gram2 to submit to PBS. Other runs from > > > my testing appeared to have worked ok submitting to fork. I'll dig a bit > > > deeper, but I'm not sure what the likely meaning of the above error is. > > > > > > > > From benc at hawaga.org.uk Sun Mar 2 22:33:51 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 3 Mar 2008 04:33:51 +0000 (GMT) Subject: [Swift-devel] Re: stageout: Expected multiline reply In-Reply-To: References: Message-ID: this doesn't happen now i try it again a few hours later. grr. -- From hategan at mcs.anl.gov Mon Mar 3 05:01:31 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 03 Mar 2008 05:01:31 -0600 Subject: [Swift-devel] Re: stageout: Expected multiline reply In-Reply-To: References: Message-ID: <1204542091.9797.1.camel@blabla.mcs.anl.gov> Maybe it was a glitch in the gridftp server. The error message sounds like it was (i.e. a protocol problem). On Mon, 2008-03-03 at 04:33 +0000, Ben Clifford wrote: > this doesn't happen now i try it again a few hours later. grr. From benc at hawaga.org.uk Tue Mar 4 17:33:18 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 4 Mar 2008 23:33:18 +0000 (GMT) Subject: [Swift-devel] Swift log processing code Message-ID: Over the past few months, I've been developing log processing and analysis code that takes in various log files from Swift runs and uses them to make various plots. I have written a small note on how to download and use these tools, so that others can experiment with them: http://www.ci.uchicago.edu/swift/guides/log-processing.php Much of the output is rather rough and poorly documented, however I'm quite happy to explain stuff on these lists if/when people have questions. -- From benc at hawaga.org.uk Wed Mar 5 08:01:49 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 5 Mar 2008 14:01:49 +0000 (GMT) Subject: [Swift-devel] Re: stageout: Expected multiline reply In-Reply-To: References: Message-ID: On Mon, 3 Mar 2008, Ben Clifford wrote: > this doesn't happen now i try it again a few hours later. grr. actually it does still happen most of the time. from terminable to tg-uc using gram2. all the other sites in my testing appear to work ok (those being the sites that I just committed to tests/sites/ in the SVN) - including tguc + gram4 + pbs and teraport + gram2 + pbs. This happens both with 0.4rc1 and yesterday's HEADs -- From hategan at mcs.anl.gov Wed Mar 5 08:13:49 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 05 Mar 2008 08:13:49 -0600 Subject: [Swift-devel] Re: stageout: Expected multiline reply In-Reply-To: References: Message-ID: <1204726429.4180.46.camel@blabla.mcs.anl.gov> Ok, can you enable debug on org.globus.ftp? On Wed, 2008-03-05 at 14:01 +0000, Ben Clifford wrote: > On Mon, 3 Mar 2008, Ben Clifford wrote: > > > this doesn't happen now i try it again a few hours later. grr. > > actually it does still happen most of the time. from terminable to tg-uc > using gram2. > all the other sites in my testing appear to work ok (those being the sites > that I just committed to tests/sites/ in the SVN) - including tguc + gram4 > + pbs and teraport + gram2 + pbs. > > This happens both with 0.4rc1 and yesterday's HEADs > > > From benc at hawaga.org.uk Wed Mar 5 08:42:00 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 5 Mar 2008 14:42:00 +0000 (GMT) Subject: [Swift-devel] Re: stageout: Expected multiline reply In-Reply-To: <1204726429.4180.46.camel@blabla.mcs.anl.gov> References: <1204726429.4180.46.camel@blabla.mcs.anl.gov> Message-ID: On Wed, 5 Mar 2008, Mihael Hategan wrote: > Ok, can you enable debug on org.globus.ftp? http://www.ci.uchicago.edu/~benc/tmp/061-cattwo-20080305-0819-28le50uc.log This is running 061-cattwo in tests/language-behaviour/ -- From hategan at mcs.anl.gov Thu Mar 6 05:42:44 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 06 Mar 2008 05:42:44 -0600 Subject: [Swift-devel] Re: stageout: Expected multiline reply In-Reply-To: References: <1204726429.4180.46.camel@blabla.mcs.anl.gov> Message-ID: <1204803764.5265.2.camel@blabla.mcs.anl.gov> There's quite a bit of weirdness there. I'll have to try to reproduce it and dig deeper. On Wed, 2008-03-05 at 14:42 +0000, Ben Clifford wrote: > On Wed, 5 Mar 2008, Mihael Hategan wrote: > > > Ok, can you enable debug on org.globus.ftp? > > http://www.ci.uchicago.edu/~benc/tmp/061-cattwo-20080305-0819-28le50uc.log > > This is running 061-cattwo in tests/language-behaviour/ > From benc at hawaga.org.uk Thu Mar 6 11:26:02 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Thu, 6 Mar 2008 17:26:02 +0000 (GMT) Subject: [Swift-devel] Failed null Message-ID: I'm seeing plenty of errors during stagein that look like this: 2008-03-06 10:42:02,911-0600 DEBUG TaskImpl Task(type=FILE_TRANSFER, identity=urn:0-624-4-1204821602563) setting status to Submitting 2008-03-06 10:42:02,911-0600 DEBUG TaskImpl Task(type=FILE_TRANSFER, identity=urn:0-624-4-1204821602563) setting status to Submitted 2008-03-06 10:42:02,912-0600 DEBUG TaskImpl Task(type=FILE_TRANSFER, identity=urn:0-624-4-1204821602563) setting status to Active 2008-03-06 10:42:02,937-0600 DEBUG TaskImpl Task(type=FILE_TRANSFER, identity=urn:0-624-4-1204821602563) setting status to Failed null 'null' is not so helpful - also there's indicating which file was attempting to be transferred here... -- From benc at hawaga.org.uk Thu Mar 6 12:50:50 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Thu, 6 Mar 2008 18:50:50 +0000 (GMT) Subject: [Swift-devel] Re: Failed null In-Reply-To: References: Message-ID: I got mike kubal to rerun with org.globus.ftp debugging on and I see this: 2008-03-06 12:20:15,173-0600 DEBUG FTPControlChannel Control channel sending: PA SV 2008-03-06 12:20:15,174-0600 DEBUG Reply read 1st line 2008-03-06 12:20:15,180-0600 DEBUG Reply 1st line: 227 Entering Passive Mode (19 2,5,198,208,195,86) 2008-03-06 12:20:15,180-0600 DEBUG FTPControlChannel Control channel received: 2 27 Entering Passive Mode (192,5,198,208,195,86) 2008-03-06 12:20:15,180-0600 DEBUG GridFTPServerFacade hostport: 192.5.198.208 5 0006 2008-03-06 12:20:15,180-0600 DEBUG TransferThreadManager adding new empty socket Box to the socket pool 2008-03-06 12:20:15,180-0600 DEBUG SocketPool adding a free socket 2008-03-06 12:20:15,180-0600 DEBUG TransferThreadManager connecting active socke t 0; total cached sockets = 1 2008-03-06 12:20:15,185-0600 DEBUG TaskThread executing task: org.globus.ftp.dc. GridFTPActiveConnectTask at 851105 2008-03-06 12:20:15,186-0600 DEBUG GridFTPActiveConnectTask connecting new socke t to: 192.5.198.208 50006 2008-03-06 12:20:15,188-0600 DEBUG TaskImpl Task(type=FILE_TRANSFER, identity=ur n:0-114-5-1204827503989) setting status to Failed null not sure what's happening after 'connecting new socket' that makes the provider decide its failed... -- From benc at hawaga.org.uk Thu Mar 6 13:05:13 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Thu, 6 Mar 2008 19:05:13 +0000 (GMT) Subject: [Swift-devel] hanging without allocating a host Message-ID: In the last couple of runs that mike kubal has made, with a large number of file transfer failures, the workflow eventually hangs with no karajan tasks in progress, and (according to a debug line I just added) apparently waiting for a host to be allocated - this will, I guess, never happen as nothing is happening to change the host scores. bleugh. -- From benc at hawaga.org.uk Thu Mar 6 13:07:03 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Thu, 6 Mar 2008 19:07:03 +0000 (GMT) Subject: [Swift-devel] Re: hanging without allocating a host In-Reply-To: References: Message-ID: the most recent log for this is wiggum:/nfs/dsl-homes03/mkubal/IMPDH/Swift_MD_Runs/Fixed_Ligands/ligands/run_MD_pipeline_loop_for_impdh-20080306-1218-g6aci607.lo it has ftp logging turned on and shows the FTP problem, and also you can see it stop at 13:01 after a successful job completion (one of about 5 for the whole run) and hang for a long time. -- From benc at hawaga.org.uk Thu Mar 6 18:10:17 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Fri, 7 Mar 2008 00:10:17 +0000 (GMT) Subject: [Swift-devel] Re: swift 0.4-rc1 In-Reply-To: References: Message-ID: On Sun, 2 Mar 2008, Ben Clifford wrote: > If there are no major problems, this will be released as swift 0.4 > sometime Tuesday. The gridftp problems of the past few days are giving me bad vibes so I'm going to wait a while. -- From iraicu at cs.uchicago.edu Fri Mar 7 00:03:04 2008 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Fri, 07 Mar 2008 00:03:04 -0600 Subject: [Swift-devel] Re: Question of wrapper.sh In-Reply-To: References: <47BEC111.8030503@mcs.anl.gov> <47BF3593.7030600@mcs.anl.gov> <47BF36EE.90504@cs.uchicago.edu> <47BF377A.6020707@uchicago.edu> <47BF38C8.10507@cs.uchicago.edu> <47BF4D3C.7030100@uchicago.edu> <47BF4FCD.9070904@cs.uchicago.edu> <47C13DEC.4090008@uchicago.edu> <47C1C5CD.5080906@cs.uchicago.edu> <47C1C8C6.1020009@cs.uchicago.edu> <47C2E11F.2070208@mcs.anl.gov> <47C2E483.70005@cs.uchicago.edu> <47C2FB7B.3000901@uchicago.edu> <47C30053.8010608@mcs.anl.gov> <47C3E648.8040101@uchicago.edu> <47C421FC.10405@mcs.anl.gov> <47C429D0.9010109@cs.uchicago.edu> <47C43C40.50702@mcs.anl.gov> <47C43D15.8080608@cs.uchicago.edu> <47C440F9.6090701@mcs.anl.gov> <47C44238.9060007@cs.uchicago.edu> <47C44EC2.4070901@uchicago.edu> <47C479EE.90901@mcs.anl.gov> <47C5039F.8020001@cs.uchicago.edu> <47D07BDC.60406@uchicago.edu> Message-ID: <47D0DA98.6010308@cs.uchicago.edu> Ben Clifford wrote: > you should send questions like this to swift-devel or swift-user list > rather than attempting to compose your own list of likely candidates and > witholding the information from the public archives. > Made the reply to the Swift devel mailing lists... > >> I am trying to dig into the wrapper.sh, disable the log to enhance the >> performance. >> > > Do you have numbers that suggest logging is causing a performance > degradation? By default, Swift is able to do about 5 jobs/sec running over Falkon on 256 CPUs on the BG/P, where each job is a sleep 0. The Falkon command line client can do about 1700 jobs/sec on the same hardware. 9 months ago, I saw Swift go from a few jobs/sec to about 50 jobs/sec by stripping out all logging (i.e. echo "..." >> LOG) from the wrapper script, and by removing the mkdir and symbolic linking. Since the mkdir is much improved now, I assume that is not the bottleneck, but doing 10~20 echo to a log file on the shared file system from many nodes at the same time is expensive, which I think is the main bottleneck in the current wrapper script. Once Zhao is done disabling all logging, except for necessary ones, we'll have a better idea of how fast we can go, and if it is necessary to eliminate the mkdir step as well. I think getting about 50 jobs/sec is within reach by streamlining the wrapper.sh script, but I think we'll have to think of ways to push those numbers even higher! > I notice you're using quite an old version of swift iraicu at login1.surveyor:/home/zzhang/cog/modules/vdsk> svn info Path: . URL: https://svn.ci.uchicago.edu/svn/vdl2/trunk Repository Root: https://svn.ci.uchicago.edu/svn/vdl2 Repository UUID: e2bb083e-7f23-0410-b3a8-8253ac9ef6d8 Revision: 1673 Node Kind: directory Schedule: normal Last Changed Author: benc at CI.UCHICAGO.EDU Last Changed Rev: 1670 Last Changed Date: 2008-02-09 12:42:56 -0600 (Sat, 09 Feb 2008) It doesn't seem that old, but we'll update to the latest one before we do more experiments. > (the last > release) - we made substantial log speed improvements subsequent to that. > If you're hitting log file problems here, there is a fair chance that > you'll encounter other scalability problems on the site filesystem that we > also fixed in SVN some months ago. > Right, I know, and I thought we were using a late enough version that had those fixes. Just to be sure, we'll upgrade! > >> One thing I notice is that for each job, correct me if I am wrong, SWIFT >> will make a unique directory with the date and a random string, then >> copy wrapper.sh and other necessary files to that directory. >> > > It should do that one per workflow per site, not per job. > Every job still has a scratch space sandbox, which results in a mkdir, symbolic linking, and finally a cleanup remove dir. I think this is the dir he is referring to. BTW, if there would be an easy way to eliminate this entire mkdir part of the wrapper script without breaking anything in Swift, it would be nice. The apps we are dealing with don't need the sandboxing, as we know all input files, and all output files, and we'll never have input as *.fits that might be ambiguous if we don't sandbox. Ioan > >> echo -abc "Hello, world!" stdout=@filename(t); >> > > Put the -abc in quotes: > > echo "-abc" "hello" > > to solve the immediate problem. > > However, note that the command: > echo -abc hello > executes successfully on my linux and os x boxes. > > If you want a job that will fail, try the 'false' command. > > >> RunID: 20080306-1647-4nd1cymf >> Execution failed: >> Variable not found: abc >> > > This is because you did not quote "-abc", so swift is trying to give you > the unary negative value of -abc (just like if you said -abc in Java or > C). > > >> But I still can not find the default working directory of this task. >> Also, I know there is a log file for this wrapper, so it is in the >> working directory, right? >> > > Swift will never have attempted to run the above, because of the above > error. > > >> Another question is, could you give me a simple task description of >> wrapper.sh? So I could invoke wrapper.sh directly without falkon. I got >> a task description before, >> >> 140.221.82.10 : urn:0-195-1203621652641 : EXECUTABLE /bin/bash ARGUEMENTS >> shared/wrapper.sh sleep-1j38kqoi -jobdir 1 -e /bin/sleep -out stdout.txt -err >> stderr.txt -i -d -if -of -k -a 0 >> >> but it is within the working directory, and I don't understand what >> "sleep-1j38kqoi" means. >> > > sleep-1j38kqoi is a job identifier (in Swift internal language, an > execute2 identifier, perhaps) which identifies one attempt to run an > application. This is used to label log files and working directories for > this. > > -- =================================================== Ioan Raicu Ph.D. Candidate =================================================== Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 =================================================== Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dev.globus.org/wiki/Incubator/Falkon http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page =================================================== =================================================== -------------- next part -------------- An HTML attachment was scrubbed... URL: From benc at hawaga.org.uk Fri Mar 7 02:36:40 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Fri, 7 Mar 2008 08:36:40 +0000 (GMT) Subject: [Swift-devel] Re: Question of wrapper.sh In-Reply-To: <47D0DA98.6010308@cs.uchicago.edu> References: <47BEC111.8030503@mcs.anl.gov> <47BF38C8.10507@cs.uchicago.edu> <47BF4D3C.7030100@uchicago.edu> <47BF4FCD.9070904@cs.uchicago.edu> <47C13DEC.4090008@uchicago.edu> <47C1C5CD.5080906@cs.uchicago.edu> <47C1C8C6.1020009@cs.uchicago.edu> <47C2E11F.2070208@mcs.anl.gov> <47C2E483.70005@cs.uchicago.edu> <47C2FB7B.3000901@uchicago.edu> <47C30053.8010608@mcs.anl.gov> <47C3E648.8040101@uchicago.edu> <47C421FC.10405@mcs.anl.gov> <47C429D0.9010109@cs.uchicago.edu> <47C43C40.50702@mcs.anl.gov> <47C43D15.8080608@cs.uchicago.edu> <47C440F9.6090701@mcs.anl.gov> <47C44238.9060007@cs.uchicago.edu> <47C44EC2.4070901@uchicago.edu> <47C479EE.90901@mcs.anl.gov> <47C5039F.8020001@cs.uchicago.edu> <47D07BDC.60406@uchicago.edu> <47D0DA98.6010308@cs.uchicago.edu> Message-ID: On Fri, 7 Mar 2008, Ioan Raicu wrote: > > I notice you're using quite an old version of swift > iraicu at login1.surveyor:/home/zzhang/cog/modules/vdsk> svn info > Revision: 1673 Zhao's log reported this: zhaozhang at viper:~/vdsk-0.3/examples/vdsk> swift first.swift Swift v0.3 r1319 (modified locally) which is nowhere near the version that you report - something in the 1600s is a reasonable number, but that's not what the log output was. So please clarify which version you actually have poor behaviour for. If it is in the r1600 range, then I'm interested to fix up - but in the r1600s there should be absolutely no cross-node shared log files at all. > Every job still has a scratch space sandbox, which results in a mkdir, > symbolic linking, and finally a cleanup remove dir. I think this is the > dir he is referring to. BTW, if there would be an easy way to eliminate > this entire mkdir part of the wrapper script without breaking anything > in Swift, it would be nice. The apps we are dealing with don't need the > sandboxing, as we know all input files, and all output files, and we'll > never have input as *.fits that might be ambiguous if we don't sandbox. If the jobs touch only those files, I think you can probably eliminate the mkdir, the ln to copy files in and the cp to copy files back out, and run the job in the shared directory directly. However, as each node will then be trying to do stuff to that shared directly, my initial thoughts would be that it wouldn't really change much (perhaps better, perhaps worse). -- From benc at hawaga.org.uk Fri Mar 7 02:43:48 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Fri, 7 Mar 2008 08:43:48 +0000 (GMT) Subject: [Swift-devel] Re: Question of wrapper.sh In-Reply-To: <47D0DA98.6010308@cs.uchicago.edu> References: <47BEC111.8030503@mcs.anl.gov> <47BF38C8.10507@cs.uchicago.edu> <47BF4D3C.7030100@uchicago.edu> <47BF4FCD.9070904@cs.uchicago.edu> <47C13DEC.4090008@uchicago.edu> <47C1C5CD.5080906@cs.uchicago.edu> <47C1C8C6.1020009@cs.uchicago.edu> <47C2E11F.2070208@mcs.anl.gov> <47C2E483.70005@cs.uchicago.edu> <47C2FB7B.3000901@uchicago.edu> <47C30053.8010608@mcs.anl.gov> <47C3E648.8040101@uchicago.edu> <47C421FC.10405@mcs.anl.gov> <47C429D0.9010109@cs.uchicago.edu> <47C43C40.50702@mcs.anl.gov> <47C43D15.8080608@cs.uchicago.edu> <47C440F9.6090701@mcs.anl.gov> <47C44238.9060007@cs.uchicago.edu> <47C44EC2.4070901@uchicago.edu> <47C479EE.90901@mcs.anl.gov> <47C5039F.8020001@cs.uchicago.edu> <47D07BDC.60406@uchicago.edu> <47D0DA98.6010308@cs.uchicago.edu> Message-ID: On Fri, 7 Mar 2008, Ioan Raicu wrote: > symbolic linking. Since the mkdir is much improved now, I assume that is not > the bottleneck, but doing 10~20 echo to a log file on the shared file system > from many nodes at the same time is expensive, which I think is the main > bottleneck in the current wrapper script. Once Zhao is done disabling all > logging, except for necessary ones, we'll have a better idea of how fast we > can go, and if it is necessary to eliminate the mkdir step as well. When I was playing with this around the time of SC, I put in a bunch of progress logging inside the wrapper script. This adds to the amount of logging that the wrapper does, but gives a many stage breakdown of where the wrapper script is spending its time. Run a bunch of jobs, eg a few thousand, with latest SVN and wrapperlog.always.transfer=true set in swift.properties. You'll get a .d directory, with a bunch of .info files. From there I (or you) can graph how each wrapper script spent its time. Ideally there should be a bunch of steps taking almost no time, then the executable, then another bunch of steps taking almost no time; but doing this should reveal wrong behaviour there. Poke me when you have that dump directory and I can have a look. -- From iraicu at cs.uchicago.edu Fri Mar 7 07:40:08 2008 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Fri, 07 Mar 2008 07:40:08 -0600 Subject: [Swift-devel] Re: Question of wrapper.sh In-Reply-To: References: <47BEC111.8030503@mcs.anl.gov> <47BF38C8.10507@cs.uchicago.edu> <47BF4D3C.7030100@uchicago.edu> <47BF4FCD.9070904@cs.uchicago.edu> <47C13DEC.4090008@uchicago.edu> <47C1C5CD.5080906@cs.uchicago.edu> <47C1C8C6.1020009@cs.uchicago.edu> <47C2E11F.2070208@mcs.anl.gov> <47C2E483.70005@cs.uchicago.edu> <47C2FB7B.3000901@uchicago.edu> <47C30053.8010608@mcs.anl.gov> <47C3E648.8040101@uchicago.edu> <47C421FC.10405@mcs.anl.gov> <47C429D0.9010109@cs.uchicago.edu> <47C43C40.50702@mcs.anl.gov> <47C43D15.8080608@cs.uchicago.edu> <47C440F9.6090701@mcs.anl.gov> <47C44238.9060007@cs.uchicago.edu> <47C44EC2.4070901@uchicago.edu> <47C479EE.90901@mcs.anl.gov> <47C5039F.8020001@cs.uchicago.edu> <47D07BDC.60406@uchicago.edu> <47D0DA98.6010308@cs.uchicago.edu> Message-ID: <47D145B8.2030804@cs.uchicago.edu> Ben Clifford wrote: > On Fri, 7 Mar 2008, Ioan Raicu wrote: > > > >>> I notice you're using quite an old version of swift >>> >> iraicu at login1.surveyor:/home/zzhang/cog/modules/vdsk> svn info >> > > >> Revision: 1673 >> > > Zhao's log reported this: > > zhaozhang at viper:~/vdsk-0.3/examples/vdsk> swift first.swift > Swift v0.3 r1319 (modified locally) > Notice that this Swift is from viper, a machine in RI. He might be playing with this to get used to learning about Swift, but that is not being used on the BG/P in any way, as everything we run on the BG/P needs to run from the login nodes. Zhao, can you confirm that the runs you are making with Swift on the BG/P are indeed using a recent version? Can you also do an update of Swift on the BG/P to make sure we have the latest? > which is nowhere near the version that you report - something in the 1600s > is a reasonable number, but that's not what the log output was. So please > clarify which version you actually have poor behaviour for. If it is in > the r1600 range, then I'm interested to fix up - but in the r1600s there > should be absolutely no cross-node shared log files at all. > > >> Every job still has a scratch space sandbox, which results in a mkdir, >> symbolic linking, and finally a cleanup remove dir. I think this is the >> dir he is referring to. BTW, if there would be an easy way to eliminate >> this entire mkdir part of the wrapper script without breaking anything >> in Swift, it would be nice. The apps we are dealing with don't need the >> sandboxing, as we know all input files, and all output files, and we'll >> never have input as *.fits that might be ambiguous if we don't sandbox. >> > > If the jobs touch only those files, I think you can probably eliminate the > mkdir, the ln to copy files in and the cp to copy files back out, and run > the job in the shared directory directly. However, as each node will then > be trying to do stuff to that shared directly, my initial thoughts would > be that it wouldn't really change much (perhaps better, perhaps worse). > That is what I thought. We'll try that and see what we get! We'll keep you posted. Ioan -- =================================================== Ioan Raicu Ph.D. Candidate =================================================== Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 =================================================== Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dev.globus.org/wiki/Incubator/Falkon http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page =================================================== =================================================== -------------- next part -------------- An HTML attachment was scrubbed... URL: From iraicu at cs.uchicago.edu Fri Mar 7 07:42:40 2008 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Fri, 07 Mar 2008 07:42:40 -0600 Subject: [Swift-devel] Re: Question of wrapper.sh In-Reply-To: References: <47BEC111.8030503@mcs.anl.gov> <47BF38C8.10507@cs.uchicago.edu> <47BF4D3C.7030100@uchicago.edu> <47BF4FCD.9070904@cs.uchicago.edu> <47C13DEC.4090008@uchicago.edu> <47C1C5CD.5080906@cs.uchicago.edu> <47C1C8C6.1020009@cs.uchicago.edu> <47C2E11F.2070208@mcs.anl.gov> <47C2E483.70005@cs.uchicago.edu> <47C2FB7B.3000901@uchicago.edu> <47C30053.8010608@mcs.anl.gov> <47C3E648.8040101@uchicago.edu> <47C421FC.10405@mcs.anl.gov> <47C429D0.9010109@cs.uchicago.edu> <47C43C40.50702@mcs.anl.gov> <47C43D15.8080608@cs.uchicago.edu> <47C440F9.6090701@mcs.anl.gov> <47C44238.9060007@cs.uchicago.edu> <47C44EC2.4070901@uchicago.edu> <47C479EE.90901@mcs.anl.gov> <47C5039F.8020001@cs.uchicago.edu> <47D07BDC.60406@uchicago.edu> <47D0DA98.6010308@cs.uchicago.edu> Message-ID: <47D14650.2000206@cs.uchicago.edu> Ben Clifford wrote: > On Fri, 7 Mar 2008, Ioan Raicu wrote: > > >> symbolic linking. Since the mkdir is much improved now, I assume that is not >> the bottleneck, but doing 10~20 echo to a log file on the shared file system >> from many nodes at the same time is expensive, which I think is the main >> bottleneck in the current wrapper script. Once Zhao is done disabling all >> logging, except for necessary ones, we'll have a better idea of how fast we >> can go, and if it is necessary to eliminate the mkdir step as well. >> > > When I was playing with this around the time of SC, I put in a bunch of > progress logging inside the wrapper script. This adds to the amount of > logging that the wrapper does, but gives a many stage breakdown of where > the wrapper script is spending its time. > > Run a bunch of jobs, eg a few thousand, with latest SVN and > wrapperlog.always.transfer=true set in swift.properties. > > You'll get a .d directory, with a bunch of .info files. From there > I (or you) can graph how each wrapper script spent its time. > > Ideally there should be a bunch of steps taking almost no time, then the > executable, then another bunch of steps taking almost no time; but doing > this should reveal wrong behaviour there. > > Poke me when you have that dump directory and I can have a look. > Ideally, we'd want any extra logging outside of the bare minimum to be optional, something that could be turned on or off depending on output level. Maybe you or Mihael could work in such an option in the future, so we could easily disable all logging in the wrapper script if we need to. In the meantime, we'll hack away to it ourselves :) We'll try to do some back to back comparison runs, and save the logs, and let you know where they are for later debugging. Thanks, Ioan -- =================================================== Ioan Raicu Ph.D. Candidate =================================================== Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 =================================================== Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dev.globus.org/wiki/Incubator/Falkon http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page =================================================== =================================================== -------------- next part -------------- An HTML attachment was scrubbed... URL: From benc at hawaga.org.uk Fri Mar 7 09:51:55 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Fri, 7 Mar 2008 15:51:55 +0000 (GMT) Subject: [Swift-devel] Re: Question of wrapper.sh In-Reply-To: <47D14650.2000206@cs.uchicago.edu> References: <47BEC111.8030503@mcs.anl.gov> <47C13DEC.4090008@uchicago.edu> <47C1C5CD.5080906@cs.uchicago.edu> <47C1C8C6.1020009@cs.uchicago.edu> <47C2E11F.2070208@mcs.anl.gov> <47C2E483.70005@cs.uchicago.edu> <47C2FB7B.3000901@uchicago.edu> <47C30053.8010608@mcs.anl.gov> <47C3E648.8040101@uchicago.edu> <47C421FC.10405@mcs.anl.gov> <47C429D0.9010109@cs.uchicago.edu> <47C43C40.50702@mcs.anl.gov> <47C43D15.8080608@cs.uchicago.edu> <47C440F9.6090701@mcs.anl.gov> <47C44238.9060007@cs.uchicago.edu> <47C44EC2.4070901@uchicago.edu> <47C479EE.90901@mcs.anl.gov> <47C5039F.8020001@cs.uchicago.edu> <47D07BDC.60406@uchicago.edu> <47D0DA98.6010308@cs.uchicago.edu> <47D14650.2000206@cs.uchicago.edu> Message-ID: On Fri, 7 Mar 2008, Ioan Raicu wrote: > Ideally, we'd want any extra logging outside of the bare minimum to be > optional, something that could be turned on or off depending on output level. > Maybe you or Mihael could work in such an option in the future, so we could > easily disable all logging in the wrapper script if we need to. In the You can perhaps convince me easily (or unconvince me) if you provide the -info files for a large test run. I implemented measurements there in november to provide quantified results for almost exactly this situation. You will considerably enhance this discourse by providing that information now rather than later. -- From benc at hawaga.org.uk Fri Mar 7 10:41:07 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Fri, 7 Mar 2008 16:41:07 +0000 (GMT) Subject: [Swift-devel] Re: Question of wrapper.sh In-Reply-To: <47D16F6D.9070903@uchicago.edu> References: <47BEC111.8030503@mcs.anl.gov> <47C1C8C6.1020009@cs.uchicago.edu> <47C2E11F.2070208@mcs.anl.gov> <47C2E483.70005@cs.uchicago.edu> <47C2FB7B.3000901@uchicago.edu> <47C30053.8010608@mcs.anl.gov> <47C3E648.8040101@uchicago.edu> <47C421FC.10405@mcs.anl.gov> <47C429D0.9010109@cs.uchicago.edu> <47C43C40.50702@mcs.anl.gov> <47C43D15.8080608@cs.uchicago.edu> <47C440F9.6090701@mcs.anl.gov> <47C44238.9060007@cs.uchicago.edu> <47C44EC2.4070901@uchicago.edu> <47C479EE.90901@mcs.anl.gov> <47C5039F.8020001@cs.uchicago.edu> <47D07BDC.60406@uchicago.edu> <47D0DA98.6010308@cs.uchicago.edu> <47D14650.2000206@cs.uchicago.edu> <47D16F6D.9070903@uchicago.edu> Message-ID: On Fri, 7 Mar 2008, Zhao Zhang wrote: > Where do those -info files live, if I run SWIFT locally on a linux box? You will need to use a recent (as in r1715 swift, r1934 cog) build from SVN. Edit your swift.properties file so that that this setting: wrapperlog.always.transfer=false is changed to this: wrapperlog.always.transfer=false Then when you run, you will get a log file called: whatever-20080101-9999-abcdef.log and a corresponding directory, whatever-20080101-9999-abcdef.d/ In that directory, you should see one *-info file for each job that is run (with the * being the jobid, the same as passed on the wrapper.sh command line that we talked about 12h ago) -- From zhaozhang at uchicago.edu Fri Mar 7 10:38:05 2008 From: zhaozhang at uchicago.edu (Zhao Zhang) Date: Fri, 07 Mar 2008 10:38:05 -0600 Subject: [Swift-devel] Re: Question of wrapper.sh In-Reply-To: References: <47BEC111.8030503@mcs.anl.gov> <47C13DEC.4090008@uchicago.edu> <47C1C5CD.5080906@cs.uchicago.edu> <47C1C8C6.1020009@cs.uchicago.edu> <47C2E11F.2070208@mcs.anl.gov> <47C2E483.70005@cs.uchicago.edu> <47C2FB7B.3000901@uchicago.edu> <47C30053.8010608@mcs.anl.gov> <47C3E648.8040101@uchicago.edu> <47C421FC.10405@mcs.anl.gov> <47C429D0.9010109@cs.uchicago.edu> <47C43C40.50702@mcs.anl.gov> <47C43D15.8080608@cs.uchicago.edu> <47C440F9.6090701@mcs.anl.gov> <47C44238.9060007@cs.uchicago.edu> <47C44EC2.4070901@uchicago.edu> <47C479EE.90901@mcs.anl.gov> <47C5039F.8020001@cs.uchicago.edu> <47D07BDC.60406@uchicago.edu> <47D0DA98.6010308@cs.uchicago.edu> <47D14650.2000206@cs.uchicago.edu> Message-ID: <47D16F6D.9070903@uchicago.edu> Hi, Ben Where do those -info files live, if I run SWIFT locally on a linux box? zhao Ben Clifford wrote: > On Fri, 7 Mar 2008, Ioan Raicu wrote: > > >> Ideally, we'd want any extra logging outside of the bare minimum to be >> optional, something that could be turned on or off depending on output level. >> Maybe you or Mihael could work in such an option in the future, so we could >> easily disable all logging in the wrapper script if we need to. In the >> > > You can perhaps convince me easily (or unconvince me) if you provide the > -info files for a large test run. I implemented measurements there in > november to provide quantified results for almost exactly this situation. > You will considerably enhance this discourse by providing that information > now rather than later. > > From iraicu at cs.uchicago.edu Fri Mar 7 11:16:29 2008 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Fri, 07 Mar 2008 11:16:29 -0600 Subject: [Swift-devel] Re: Question of wrapper.sh In-Reply-To: References: <47BEC111.8030503@mcs.anl.gov> <47C13DEC.4090008@uchicago.edu> <47C1C5CD.5080906@cs.uchicago.edu> <47C1C8C6.1020009@cs.uchicago.edu> <47C2E11F.2070208@mcs.anl.gov> <47C2E483.70005@cs.uchicago.edu> <47C2FB7B.3000901@uchicago.edu> <47C30053.8010608@mcs.anl.gov> <47C3E648.8040101@uchicago.edu> <47C421FC.10405@mcs.anl.gov> <47C429D0.9010109@cs.uchicago.edu> <47C43C40.50702@mcs.anl.gov> <47C43D15.8080608@cs.uchicago.edu> <47C440F9.6090701@mcs.anl.gov> <47C44238.9060007@cs.uchicago.edu> <47C44EC2.4070901@uchicago.edu> <47C479EE.90901@mcs.anl.gov> <47C5039F.8020001@cs.uchicago.edu> <47D07BDC.60406@uchicago.edu> <47D0DA98.6010308@cs.uchicago.edu> <47D14650.2000206@cs.uchicago.edu> Message-ID: <47D1786D.4010609@cs.uchicago.edu> OK, Zhao is working on it, and should get you those logs later today. Ioan Ben Clifford wrote: > On Fri, 7 Mar 2008, Ioan Raicu wrote: > > >> Ideally, we'd want any extra logging outside of the bare minimum to be >> optional, something that could be turned on or off depending on output level. >> Maybe you or Mihael could work in such an option in the future, so we could >> easily disable all logging in the wrapper script if we need to. In the >> > > You can perhaps convince me easily (or unconvince me) if you provide the > -info files for a large test run. I implemented measurements there in > november to provide quantified results for almost exactly this situation. > You will considerably enhance this discourse by providing that information > now rather than later. > > -- =================================================== Ioan Raicu Ph.D. Candidate =================================================== Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 =================================================== Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dev.globus.org/wiki/Incubator/Falkon http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page =================================================== =================================================== From zhaozhang at uchicago.edu Fri Mar 7 18:18:16 2008 From: zhaozhang at uchicago.edu (Zhao Zhang) Date: Fri, 07 Mar 2008 18:18:16 -0600 Subject: [Swift-devel] Re: Question of wrapper.sh In-Reply-To: References: <47BEC111.8030503@mcs.anl.gov> <47C2E11F.2070208@mcs.anl.gov> <47C2E483.70005@cs.uchicago.edu> <47C2FB7B.3000901@uchicago.edu> <47C30053.8010608@mcs.anl.gov> <47C3E648.8040101@uchicago.edu> <47C421FC.10405@mcs.anl.gov> <47C429D0.9010109@cs.uchicago.edu> <47C43C40.50702@mcs.anl.gov> <47C43D15.8080608@cs.uchicago.edu> <47C440F9.6090701@mcs.anl.gov> <47C44238.9060007@cs.uchicago.edu> <47C44EC2.4070901@uchicago.edu> <47C479EE.90901@mcs.anl.gov> <47C5039F.8020001@cs.uchicago.edu> <47D07BDC.60406@uchicago.edu> <47D0DA98.6010308@cs.uchicago.edu> <47D14650.2000206@cs.uchicago.edu> <47D16F6D.9070903@uchicago.edu> Message-ID: <47D1DB48.4080709@uchicago.edu> Hi, I got this problem when I tried to connect Swift and Falkon. I could run this before, using the r1673 version swift. But not go through with the 1716 version. zzhang at login1.surveyor:~/cog/modules/vdsk/dist/vdsk-0.3-dev/examples/vdsk> swift -sites.file ../../etc/sites-BG.xml -tc.file ../../etc/tc-BG.data first.swift Swift v0.3-dev swift-r1716 cog-r1934 RunID: 20080307-1813-i00h1fia Progress: echo started error: Notification(int timeout): socket = new ServerSocket(recvPort); Address already in use 2008-03-07 18:13:48,122 WARN submitQueue.NonBlockingSubmit [pool-1-thread-1,notifyPreviousQueue:71] Warning: Task handler throws exception and also sets status the sites-BG.xml is like below /home/zzhang along with the tc-BG.data file # sitename transformation path INSTALLED platform profiles bgp echo /bin/echo INSTALLED INTEL32::LINUX null localhost cat /bin/cat INSTALLED INTEL32::LINUX \ null localhost ls /bin/ls INSTALLED INTEL32::LINUX \ null localhost grep /bin/grep INSTALLED INTEL32::LINUX \ null localhost sort /bin/sort INSTALLED INTEL32::LINUX \ null localhost paste /bin/paste INSTALLED INTEL32::LINUX \ null bgp sleep /bin/sleep INSTALLED INTEL32::LINUX null zhao Ben Clifford wrote: > On Fri, 7 Mar 2008, Zhao Zhang wrote: > > >> Where do those -info files live, if I run SWIFT locally on a linux box? >> > > You will need to use a recent (as in r1715 swift, r1934 cog) build from > SVN. Edit your swift.properties file so that that this setting: > > wrapperlog.always.transfer=false > > is changed to this: > > wrapperlog.always.transfer=false > > Then when you run, you will get a log file called: > > whatever-20080101-9999-abcdef.log > > and a corresponding directory, > > whatever-20080101-9999-abcdef.d/ > > In that directory, you should see one *-info file for each job that is run > (with the * being the jobid, the same as passed on the wrapper.sh command > line that we talked about 12h ago) > > From benc at hawaga.org.uk Fri Mar 7 18:34:46 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Sat, 8 Mar 2008 00:34:46 +0000 (GMT) Subject: [Swift-devel] Re: Question of wrapper.sh In-Reply-To: <47D1DB48.4080709@uchicago.edu> References: <47BEC111.8030503@mcs.anl.gov> <47C2FB7B.3000901@uchicago.edu> <47C30053.8010608@mcs.anl.gov> <47C3E648.8040101@uchicago.edu> <47C421FC.10405@mcs.anl.gov> <47C429D0.9010109@cs.uchicago.edu> <47C43C40.50702@mcs.anl.gov> <47C43D15.8080608@cs.uchicago.edu> <47C440F9.6090701@mcs.anl.gov> <47C44238.9060007@cs.uchicago.edu> <47C44EC2.4070901@uchicago.edu> <47C479EE.90901@mcs.anl.gov> <47C5039F.8020001@cs.uchicago.edu> <47D07BDC.60406@uchicago.edu> <47D0DA98.6010308@cs.uchicago.edu> <47D14650.2000206@cs.uchicago.edu> <47D16F6D.9070903@uchicago.edu> <47D1DB48.4080709@uchicago.edu> Message-ID: On Fri, 7 Mar 2008, Zhao Zhang wrote: > error: Notification(int timeout): socket = new ServerSocket(recvPort); Address > already in use That's an error I'm not familiar with. At a guess, I'd say something like provider-deef is trying to open a server socket on a manually specified port (recvPort) that you already have something listening on. Ioan, does that seem likely? -- From benc at hawaga.org.uk Fri Mar 7 18:40:54 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Sat, 8 Mar 2008 00:40:54 +0000 (GMT) Subject: [Swift-devel] Re: Question of wrapper.sh In-Reply-To: <47D1DB48.4080709@uchicago.edu> References: <47BEC111.8030503@mcs.anl.gov> <47C2FB7B.3000901@uchicago.edu> <47C30053.8010608@mcs.anl.gov> <47C3E648.8040101@uchicago.edu> <47C421FC.10405@mcs.anl.gov> <47C429D0.9010109@cs.uchicago.edu> <47C43C40.50702@mcs.anl.gov> <47C43D15.8080608@cs.uchicago.edu> <47C440F9.6090701@mcs.anl.gov> <47C44238.9060007@cs.uchicago.edu> <47C44EC2.4070901@uchicago.edu> <47C479EE.90901@mcs.anl.gov> <47C5039F.8020001@cs.uchicago.edu> <47D07BDC.60406@uchicago.edu> <47D0DA98.6010308@cs.uchicago.edu> <47D14650.2000206@cs.uchicago.edu> <47D16F6D.9070903@uchicago.edu> <47D1DB48.4080709@uchicago.edu> Message-ID: On Fri, 7 Mar 2008, Zhao Zhang wrote: > error: Notification(int timeout): socket = new ServerSocket(recvPort); Address > already in use > 2008-03-07 18:13:48,122 WARN submitQueue.NonBlockingSubmit > [pool-1-thread-1,notifyPreviousQueue:71] Warning: Task handler throws > exception and also sets status At that place in the log file, do you also get a stack trace? Put the whole log file somewhere i can see it. -- From zhaozhang at uchicago.edu Fri Mar 7 19:06:40 2008 From: zhaozhang at uchicago.edu (Zhao Zhang) Date: Fri, 07 Mar 2008 19:06:40 -0600 Subject: [Swift-devel] Re: Question of wrapper.sh In-Reply-To: References: <47BEC111.8030503@mcs.anl.gov> <47C30053.8010608@mcs.anl.gov> <47C3E648.8040101@uchicago.edu> <47C421FC.10405@mcs.anl.gov> <47C429D0.9010109@cs.uchicago.edu> <47C43C40.50702@mcs.anl.gov> <47C43D15.8080608@cs.uchicago.edu> <47C440F9.6090701@mcs.anl.gov> <47C44238.9060007@cs.uchicago.edu> <47C44EC2.4070901@uchicago.edu> <47C479EE.90901@mcs.anl.gov> <47C5039F.8020001@cs.uchicago.edu> <47D07BDC.60406@uchicago.edu> <47D0DA98.6010308@cs.uchicago.edu> <47D14650.2000206@cs.uchicago.edu> <47D16F6D.9070903@uchicago.edu> <47D1DB48.4080709@uchicago.edu> Message-ID: <47D1E6A0.3080805@uchicago.edu> Hi, Ben There is nothing in the sleep-20080307-1808-ik0izkmg.d directory, and also nothing in rlog file. When I could run it through falkon before, it still has that socket error, it is ok, swift will use another port. zhao Ben Clifford wrote: > On Fri, 7 Mar 2008, Zhao Zhang wrote: > > >> error: Notification(int timeout): socket = new ServerSocket(recvPort); Address >> already in use >> 2008-03-07 18:13:48,122 WARN submitQueue.NonBlockingSubmit >> [pool-1-thread-1,notifyPreviousQueue:71] Warning: Task handler throws >> exception and also sets status >> > > At that place in the log file, do you also get a stack trace? Put the > whole log file somewhere i can see it. > > From iraicu at cs.uchicago.edu Fri Mar 7 19:49:09 2008 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Fri, 07 Mar 2008 19:49:09 -0600 Subject: [Swift-devel] Re: Question of wrapper.sh In-Reply-To: References: <47BEC111.8030503@mcs.anl.gov> <47C30053.8010608@mcs.anl.gov> <47C3E648.8040101@uchicago.edu> <47C421FC.10405@mcs.anl.gov> <47C429D0.9010109@cs.uchicago.edu> <47C43C40.50702@mcs.anl.gov> <47C43D15.8080608@cs.uchicago.edu> <47C440F9.6090701@mcs.anl.gov> <47C44238.9060007@cs.uchicago.edu> <47C44EC2.4070901@uchicago.edu> <47C479EE.90901@mcs.anl.gov> <47C5039F.8020001@cs.uchicago.edu> <47D07BDC.60406@uchicago.edu> <47D0DA98.6010308@cs.uchicago.edu> <47D14650.2000206@cs.uchicago.edu> <47D16F6D.9070903@uchicago.edu> <47D1DB48.4080709@uchicago.edu> Message-ID: <47D1F095.6070207@cs.uchicago.edu> That should be labeled as a warning... if that port is in use, it will try another one, so that is not the problem. Ioan Ben Clifford wrote: > On Fri, 7 Mar 2008, Zhao Zhang wrote: > > >> error: Notification(int timeout): socket = new ServerSocket(recvPort); Address >> already in use >> > > That's an error I'm not familiar with. At a guess, I'd say something like > provider-deef is trying to open a server socket on a manually specified > port (recvPort) that you already have something listening on. > > Ioan, does that seem likely? > > -- =================================================== Ioan Raicu Ph.D. Candidate =================================================== Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 =================================================== Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dev.globus.org/wiki/Incubator/Falkon http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page =================================================== =================================================== -------------- next part -------------- An HTML attachment was scrubbed... URL: From benc at hawaga.org.uk Sat Mar 8 00:49:40 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Sat, 8 Mar 2008 06:49:40 +0000 (GMT) Subject: [Swift-devel] Re: Question of wrapper.sh In-Reply-To: <47D1E6A0.3080805@uchicago.edu> References: <47BEC111.8030503@mcs.anl.gov> <47C421FC.10405@mcs.anl.gov> <47C429D0.9010109@cs.uchicago.edu> <47C43C40.50702@mcs.anl.gov> <47C43D15.8080608@cs.uchicago.edu> <47C440F9.6090701@mcs.anl.gov> <47C44238.9060007@cs.uchicago.edu> <47C44EC2.4070901@uchicago.edu> <47C479EE.90901@mcs.anl.gov> <47C5039F.8020001@cs.uchicago.edu> <47D07BDC.60406@uchicago.edu> <47D0DA98.6010308@cs.uchicago.edu> <47D14650.2000206@cs.uchicago.edu> <47D16F6D.9070903@uchicago.edu> <47D1DB48.4080709@uchicago.edu> <47D1E6A0.3080805@uchicago.edu> Message-ID: On Fri, 7 Mar 2008, Zhao Zhang wrote: > Hi, Ben > > There is nothing in the sleep-20080307-1808-ik0izkmg.d directory, and also > nothing in rlog file. When I could run it through falkon before, it still has > that socket error, it is ok, swift will use another port. You should get something in the .log file, though. Please send that. -- From zhaozhang at uchicago.edu Sat Mar 8 17:15:10 2008 From: zhaozhang at uchicago.edu (Zhao Zhang) Date: Sat, 08 Mar 2008 17:15:10 -0600 Subject: [Swift-devel] Re: Question of wrapper.sh In-Reply-To: References: <47BEC111.8030503@mcs.anl.gov> <47C421FC.10405@mcs.anl.gov> <47C429D0.9010109@cs.uchicago.edu> <47C43C40.50702@mcs.anl.gov> <47C43D15.8080608@cs.uchicago.edu> <47C440F9.6090701@mcs.anl.gov> <47C44238.9060007@cs.uchicago.edu> <47C44EC2.4070901@uchicago.edu> <47C479EE.90901@mcs.anl.gov> <47C5039F.8020001@cs.uchicago.edu> <47D07BDC.60406@uchicago.edu> <47D0DA98.6010308@cs.uchicago.edu> <47D14650.2000206@cs.uchicago.edu> <47D16F6D.9070903@uchicago.edu> <47D1DB48.4080709@uchicago.edu> <47D1E6A0.3080805@uchicago.edu> Message-ID: <47D31DFE.5060808@uchicago.edu> Hi, Ben I didn't find any .log files in the directory where I run the swift command and the working directory where the task should be executed. zhao Ben Clifford wrote: > On Fri, 7 Mar 2008, Zhao Zhang wrote: > > >> Hi, Ben >> >> There is nothing in the sleep-20080307-1808-ik0izkmg.d directory, and also >> nothing in rlog file. When I could run it through falkon before, it still has >> that socket error, it is ok, swift will use another port. >> > > You should get something in the .log file, though. Please send that. > > From zhaozhang at uchicago.edu Sat Mar 8 17:50:19 2008 From: zhaozhang at uchicago.edu (Zhao Zhang) Date: Sat, 08 Mar 2008 17:50:19 -0600 Subject: [Swift-devel] Re: Question of wrapper.sh In-Reply-To: References: <47BEC111.8030503@mcs.anl.gov> <47C421FC.10405@mcs.anl.gov> <47C429D0.9010109@cs.uchicago.edu> <47C43C40.50702@mcs.anl.gov> <47C43D15.8080608@cs.uchicago.edu> <47C440F9.6090701@mcs.anl.gov> <47C44238.9060007@cs.uchicago.edu> <47C44EC2.4070901@uchicago.edu> <47C479EE.90901@mcs.anl.gov> <47C5039F.8020001@cs.uchicago.edu> <47D07BDC.60406@uchicago.edu> <47D0DA98.6010308@cs.uchicago.edu> <47D14650.2000206@cs.uchicago.edu> <47D16F6D.9070903@uchicago.edu> <47D1DB48.4080709@uchicago.edu> <47D1E6A0.3080805@uchicago.edu> Message-ID: <47D3263B.4080705@uchicago.edu> Hi, All It is ok now, I solve that problem. It is a simple typo error in the GPfactoryservice path. Now swift could send task directly to falkon. Then I submit 100 sleep 0 tasks from swift to falkon. It tool 50 seconds from the falkon's point of view to complete these tasks. The info.tar file is the info files for these 100 sleep jobs.and the stdout.txt is that I get from the standard output from swift. Thanks again for your help. zhao Ben Clifford wrote: > On Fri, 7 Mar 2008, Zhao Zhang wrote: > > >> Hi, Ben >> >> There is nothing in the sleep-20080307-1808-ik0izkmg.d directory, and also >> nothing in rlog file. When I could run it through falkon before, it still has >> that socket error, it is ok, swift will use another port. >> > > You should get something in the .log file, though. Please send that. > > -------------- next part -------------- A non-text attachment was scrubbed... Name: info.tar Type: application/octet-stream Size: 163840 bytes Desc: not available URL: -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: stdout.txt URL: From zhaozhang at uchicago.edu Sun Mar 9 00:28:27 2008 From: zhaozhang at uchicago.edu (Zhao Zhang) Date: Sun, 09 Mar 2008 00:28:27 -0600 Subject: [Swift-devel] Re: Question of wrapper.sh In-Reply-To: References: <47BEC111.8030503@mcs.anl.gov> <47C30053.8010608@mcs.anl.gov> <47C3E648.8040101@uchicago.edu> <47C421FC.10405@mcs.anl.gov> <47C429D0.9010109@cs.uchicago.edu> <47C43C40.50702@mcs.anl.gov> <47C43D15.8080608@cs.uchicago.edu> <47C440F9.6090701@mcs.anl.gov> <47C44238.9060007@cs.uchicago.edu> <47C44EC2.4070901@uchicago.edu> <47C479EE.90901@mcs.anl.gov> <47C5039F.8020001@cs.uchicago.edu> <47D07BDC.60406@uchicago.edu> <47D0DA98.6010308@cs.uchicago.edu> <47D14650.2000206@cs.uchicago.edu> <47D16F6D.9070903@uchicago.edu> <47D1DB48.4080709@uchicago.edu> Message-ID: <47D3838B.3040506@uchicago.edu> ok, here is the info files for the test of 500 sleep 0 on 1 P-SET. zhao Ben Clifford wrote: > On Fri, 7 Mar 2008, Zhao Zhang wrote: > > >> error: Notification(int timeout): socket = new ServerSocket(recvPort); Address >> already in use >> 2008-03-07 18:13:48,122 WARN submitQueue.NonBlockingSubmit >> [pool-1-thread-1,notifyPreviousQueue:71] Warning: Task handler throws >> exception and also sets status >> > > At that place in the log file, do you also get a stack trace? Put the > whole log file somewhere i can see it. > > -------------- next part -------------- A non-text attachment was scrubbed... Name: info.tar Type: application/octet-stream Size: 778240 bytes Desc: not available URL: From wilde at mcs.anl.gov Sun Mar 9 16:09:59 2008 From: wilde at mcs.anl.gov (Michael Wilde) Date: Sun, 09 Mar 2008 16:09:59 -0500 Subject: [Swift-devel] Re: [Swft] Re: Question of wrapper.sh In-Reply-To: <47D3838B.3040506@uchicago.edu> References: <47BEC111.8030503@mcs.anl.gov> <47C3E648.8040101@uchicago.edu> <47C421FC.10405@mcs.anl.gov> <47C429D0.9010109@cs.uchicago.edu> <47C43C40.50702@mcs.anl.gov> <47C43D15.8080608@cs.uchicago.edu> <47C440F9.6090701@mcs.anl.gov> <47C44238.9060007@cs.uchicago.edu> <47C44EC2.4070901@uchicago.edu> <47C479EE.90901@mcs.anl.gov> <47C5039F.8020001@cs.uchicago.edu> <47D07BDC.60406@uchicago.edu> <47D0DA98.6010308@cs.uchicago.edu> <47D14650.2000206@cs.uchicago.edu> <47D16F6D.9070903@uchicago.edu> <47D1DB48.4080709@uchicago.edu> <47D3838B.3040506@uchicago.edu> Message-ID: <47D45227.8020201@mcs.anl.gov> (taking swft off cc list) Zhao, Ive lost track of what these logs are for. This run is using all 64 compute nodes on 1 pset, right? I compute these stats for these logs: nfiles=500 Avg=1.27 secs, Min=0 secs, Max=4 secs, Run Duration=138 secs So not too much variation in run time, but pretty slow. - mike On 3/9/08 12:28 AM, Zhao Zhang wrote: > ok, here is the info files for the test of 500 sleep 0 on 1 P-SET. > > zhao > > Ben Clifford wrote: >> On Fri, 7 Mar 2008, Zhao Zhang wrote: >> >> >>> error: Notification(int timeout): socket = new >>> ServerSocket(recvPort); Address >>> already in use >>> 2008-03-07 18:13:48,122 WARN submitQueue.NonBlockingSubmit >>> [pool-1-thread-1,notifyPreviousQueue:71] Warning: Task handler throws >>> exception and also sets status >>> >> >> At that place in the log file, do you also get a stack trace? Put the >> whole log file somewhere i can see it. >> >> From wilde at mcs.anl.gov Sun Mar 9 16:32:16 2008 From: wilde at mcs.anl.gov (Michael Wilde) Date: Sun, 09 Mar 2008 16:32:16 -0500 Subject: [Swift-devel] Can scp data provider be used with Swift? Message-ID: <47D45760.4060509@mcs.anl.gov> We'd like to do a Swift run on the SiCortex machine using Falkon as the execution provider. At the moment there's no Java on the SiCortex and no ready access to its filesystem from a Linux host with Java. Is it feasible to run Swift on a Linux host with Falkon for the job provider and scp for the data provider? If so, would this be specified? From benc at hawaga.org.uk Sun Mar 9 19:29:05 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 10 Mar 2008 00:29:05 +0000 (GMT) Subject: [Swift-devel] Re: Question of wrapper.sh In-Reply-To: <47D3838B.3040506@uchicago.edu> References: <47BEC111.8030503@mcs.anl.gov> <47C421FC.10405@mcs.anl.gov> <47C429D0.9010109@cs.uchicago.edu> <47C43C40.50702@mcs.anl.gov> <47C43D15.8080608@cs.uchicago.edu> <47C440F9.6090701@mcs.anl.gov> <47C44238.9060007@cs.uchicago.edu> <47C44EC2.4070901@uchicago.edu> <47C479EE.90901@mcs.anl.gov> <47C5039F.8020001@cs.uchicago.edu> <47D07BDC.60406@uchicago.edu> <47D0DA98.6010308@cs.uchicago.edu> <47D14650.2000206@cs.uchicago.edu> <47D16F6D.9070903@uchicago.edu> <47D1DB48.4080709@uchicago.edu> <47D3838B.3040506@uchicago.edu> Message-ID: ok. I will look that those. At the same time that you send -info logs, please can you send the main log file (that is named something like foo-20080101-000-abcdef.log) as that has interesting information too. On Sun, 9 Mar 2008, Zhao Zhang wrote: > ok, here is the info files for the test of 500 sleep 0 on 1 P-SET. > > zhao > > Ben Clifford wrote: > > On Fri, 7 Mar 2008, Zhao Zhang wrote: > > > > > > > error: Notification(int timeout): socket = new ServerSocket(recvPort); > > > Address > > > already in use > > > 2008-03-07 18:13:48,122 WARN submitQueue.NonBlockingSubmit > > > [pool-1-thread-1,notifyPreviousQueue:71] Warning: Task handler throws > > > exception and also sets status > > > > > > > At that place in the log file, do you also get a stack trace? Put the whole > > log file somewhere i can see it. > > > > From benc at hawaga.org.uk Sun Mar 9 19:47:09 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 10 Mar 2008 00:47:09 +0000 (GMT) Subject: [Swift-devel] Re: Question of wrapper.sh In-Reply-To: References: <47BEC111.8030503@mcs.anl.gov> <47C429D0.9010109@cs.uchicago.edu> <47C43C40.50702@mcs.anl.gov> <47C43D15.8080608@cs.uchicago.edu> <47C440F9.6090701@mcs.anl.gov> <47C44238.9060007@cs.uchicago.edu> <47C44EC2.4070901@uchicago.edu> <47C479EE.90901@mcs.anl.gov> <47C5039F.8020001@cs.uchicago.edu> <47D07BDC.60406@uchicago.edu> <47D0DA98.6010308@cs.uchicago.edu> <47D14650.2000206@cs.uchicago.edu> <47D16F6D.9070903@uchicago.edu> <47D1DB48.4080709@uchicago.edu> <47D3838B.3040506@uchicago.edu> Message-ID: Whichever version of date that you used, it doesn't support more than 1s accuracy in its output. That's irksome because its that subsecond accuracy that I wanted from these log files. date on os x doesn't have that precision, fairly recent (in the past couple of years at least) GNU coreutils date does. It would be nice if you could find if that is installed and use that instead... -- From iraicu at cs.uchicago.edu Sun Mar 9 19:51:20 2008 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Sun, 09 Mar 2008 19:51:20 -0500 Subject: [Swift-devel] Re: Question of wrapper.sh In-Reply-To: References: <47BEC111.8030503@mcs.anl.gov> <47C43C40.50702@mcs.anl.gov> <47C43D15.8080608@cs.uchicago.edu> <47C440F9.6090701@mcs.anl.gov> <47C44238.9060007@cs.uchicago.edu> <47C44EC2.4070901@uchicago.edu> <47C479EE.90901@mcs.anl.gov> <47C5039F.8020001@cs.uchicago.edu> <47D07BDC.60406@uchicago.edu> <47D0DA98.6010308@cs.uchicago.edu> <47D14650.2000206@cs.uchicago.edu> <47D16F6D.9070903@uchicago.edu> <47D1DB48.4080709@uchicago.edu> <47D3838B.3040506@uchicago.edu> Message-ID: <47D48608.2060204@cs.uchicago.edu> Ben, Those logs are only for 1 CPU, so most things will take less than 1 sec. In the case where we use 100s of CPUs (a basic scenario for the BG/P), things will take 10s of seconds, so 1 sec resolution should be OK. Zhao, did you re-run that test with 256 CPUs? Ben should be looking at those logs, not the 1 CPU case. Ioan Ben Clifford wrote: > Whichever version of date that you used, it doesn't support more than 1s > accuracy in its output. That's irksome because its that subsecond accuracy > that I wanted from these log files. > > date on os x doesn't have that precision, fairly recent (in the past > couple of years at least) GNU coreutils date does. It would be nice if you > could find if that is installed and use that instead... > > -- =================================================== Ioan Raicu Ph.D. Candidate =================================================== Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 =================================================== Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dev.globus.org/wiki/Incubator/Falkon http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page =================================================== =================================================== From benc at hawaga.org.uk Sun Mar 9 19:54:05 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 10 Mar 2008 00:54:05 +0000 (GMT) Subject: [Swift-devel] Re: Question of wrapper.sh In-Reply-To: <47D48608.2060204@cs.uchicago.edu> References: <47BEC111.8030503@mcs.anl.gov> <47C440F9.6090701@mcs.anl.gov> <47C44238.9060007@cs.uchicago.edu> <47C44EC2.4070901@uchicago.edu> <47C479EE.90901@mcs.anl.gov> <47C5039F.8020001@cs.uchicago.edu> <47D07BDC.60406@uchicago.edu> <47D0DA98.6010308@cs.uchicago.edu> <47D14650.2000206@cs.uchicago.edu> <47D16F6D.9070903@uchicago.edu> <47D1DB48.4080709@uchicago.edu> <47D3838B.3040506@uchicago.edu> <47D48608.2060204@cs.uchicago.edu> Message-ID: On Sun, 9 Mar 2008, Ioan Raicu wrote: > Those logs are only for 1 CPU, so most things will take less than 1 sec. In > the case where we use 100s of CPUs (a basic scenario for the BG/P), things > will take 10s of seconds, so 1 sec resolution should be OK. Zhao, did you > re-run that test with 256 CPUs? Ben should be looking at those logs, not the > 1 CPU case. OK. That will be more interesting then. Zhao, please send the regular .log file at the same time too. -- From zhaozhang at uchicago.edu Sun Mar 9 20:32:01 2008 From: zhaozhang at uchicago.edu (Zhao Zhang) Date: Sun, 09 Mar 2008 20:32:01 -0500 Subject: [Swift-devel] Re: Question of wrapper.sh In-Reply-To: <47D48608.2060204@cs.uchicago.edu> References: <47BEC111.8030503@mcs.anl.gov> <47C43D15.8080608@cs.uchicago.edu> <47C440F9.6090701@mcs.anl.gov> <47C44238.9060007@cs.uchicago.edu> <47C44EC2.4070901@uchicago.edu> <47C479EE.90901@mcs.anl.gov> <47C5039F.8020001@cs.uchicago.edu> <47D07BDC.60406@uchicago.edu> <47D0DA98.6010308@cs.uchicago.edu> <47D14650.2000206@cs.uchicago.edu> <47D16F6D.9070903@uchicago.edu> <47D1DB48.4080709@uchicago.edu> <47D3838B.3040506@uchicago.edu> <47D48608.2060204@cs.uchicago! .edu> Message-ID: <47D48F91.6040802@uchicago.edu> well, the tar ball I sent in the last email is from 256 cores, are they from the same cpu? By the way, I tried to find the .log files, but there isn't any in the folder where I started the swift script. zhao Ioan Raicu wrote: > Ben, > Those logs are only for 1 CPU, so most things will take less than 1 > sec. In the case where we use 100s of CPUs (a basic scenario for the > BG/P), things will take 10s of seconds, so 1 sec resolution should be > OK. Zhao, did you re-run that test with 256 CPUs? Ben should be > looking at those logs, not the 1 CPU case. > > Ioan > > Ben Clifford wrote: >> Whichever version of date that you used, it doesn't support more than >> 1s accuracy in its output. That's irksome because its that subsecond >> accuracy that I wanted from these log files. >> >> date on os x doesn't have that precision, fairly recent (in the past >> couple of years at least) GNU coreutils date does. It would be nice >> if you could find if that is installed and use that instead... >> > From benc at hawaga.org.uk Sun Mar 9 20:51:07 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 10 Mar 2008 01:51:07 +0000 (GMT) Subject: [Swift-devel] Re: Question of wrapper.sh In-Reply-To: <47D48F91.6040802@uchicago.edu> References: <47BEC111.8030503@mcs.anl.gov> <47C44238.9060007@cs.uchicago.edu> <47C44EC2.4070901@uchicago.edu> <47C479EE.90901@mcs.anl.gov> <47C5039F.8020001@cs.uchicago.edu> <47D07BDC.60406@uchicago.edu> <47D0DA98.6010308@cs.uchicago.edu> <47D14650.2000206@cs.uchicago.edu> <47D16F6D.9070903@uchicago.edu> <47D1DB48.4080709@uchicago.edu> <47D3838B.3040506@uchicago.edu> <47D48608.2060204@cs.uchicago! .edu> <47D48F91.6040802@uchicago.edu> Message-ID: On Sun, 9 Mar 2008, Zhao Zhang wrote: > well, the tar ball I sent in the last email is from 256 cores, are they from > the same cpu? By the way, I tried to find the .log files, but there isn't any > in the folder where I started the swift script. How are you building provider-deef? The old way messes up logging. Since r1525 in December, a way to build that doesn't do this is to build swift and provider-deef at the same time, by using this command in the vdsk directory: ant -Dwith-provider-deef redist You'll get a warning like this: [input] Warning! The specified target directory (/Users/benc/work/cog/modules/swift/../..//modules/swift/dist/swift-0.3-dev) does not seem to contain a Swift build. which is not a problem - press return a few times to get the build to continue. -- From zhaozhang at uchicago.edu Sun Mar 9 20:55:53 2008 From: zhaozhang at uchicago.edu (Zhao Zhang) Date: Sun, 09 Mar 2008 20:55:53 -0500 Subject: [Swift-devel] Re: Question of wrapper.sh In-Reply-To: References: <47BEC111.8030503@mcs.anl.gov> <47C44EC2.4070901@uchicago.edu> <47C479EE.90901@mcs.anl.gov> <47C5039F.8020001@cs.uchicago.edu> <47D07BDC.60406@uchicago.edu> <47D0DA98.6010308@cs.uchicago.edu> <47D14650.2000206@cs.uchicago.edu> <47D16F6D.9070903@uchicago.edu> <47D1DB48.4080709@uchicago.edu> <47D3838B.3040506@uchicago.edu> <47D48608.2060204@cs.uchicago! .edu> <47D48F91.6040802@uchicago.edu> Message-ID: <47D49529.1020809@uchicago.edu> Hi, Ben Here is the script that we are using to build the provider #/bin/sh if [ -z "${FALKON_ROOT}" ]; then echo "ERROR: environment variable FALKON_ROOT not defined" 1>&2 return 1 fi if [ ! -d "${FALKON_ROOT}" ]; then echo "ERROR: invalid FALKON_ROOT set: $FALKON_ROOT" 1>&2 return 1 fi cd ${FALKON_ROOT}/cog/modules/provider-deef ant distclean ant -Ddist.dir=../vdsk/dist/vdsk-0.3-dev/ dist So I need to delete the last line and add ant -Dwith-provider-deef -Ddist.dir=../vdsk/dist/vdsk-0.3-dev/ redist right? zhao Ben Clifford wrote: > On Sun, 9 Mar 2008, Zhao Zhang wrote: > > >> well, the tar ball I sent in the last email is from 256 cores, are they from >> the same cpu? By the way, I tried to find the .log files, but there isn't any >> in the folder where I started the swift script. >> > > How are you building provider-deef? The old way messes up logging. Since > r1525 in December, a way to build that doesn't do this is to build swift > and provider-deef at the same time, by using this command in the vdsk > directory: > > ant -Dwith-provider-deef redist > > You'll get a warning like this: > [input] Warning! The specified target directory > (/Users/benc/work/cog/modules/swift/../..//modules/swift/dist/swift-0.3-dev) > does not seem to contain a Swift build. > > which is not a problem - press return a few times to get the build to > continue. > > From benc at hawaga.org.uk Sun Mar 9 20:58:51 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 10 Mar 2008 01:58:51 +0000 (GMT) Subject: [Swift-devel] Re: Question of wrapper.sh In-Reply-To: <47D49529.1020809@uchicago.edu> References: <47BEC111.8030503@mcs.anl.gov> <47C5039F.8020001@cs.uchicago.edu> <47D07BDC.60406@uchicago.edu> <47D0DA98.6010308@cs.uchicago.edu> <47D14650.2000206@cs.uchicago.edu> <47D16F6D.9070903@uchicago.edu> <47D1DB48.4080709@uchicago.edu> <47D3838B.3040506@uchicago.edu> <47D48608.2060204@cs.uchicago! .edu> <47D48F91.6040802@uchicago.edu> <47D49529.1020809@uchicago.edu> Message-ID: On Sun, 9 Mar 2008, Zhao Zhang wrote: > cd ${FALKON_ROOT}/cog/modules/provider-deef > ant distclean > ant -Ddist.dir=../vdsk/dist/vdsk-0.3-dev/ dist > So I need to delete the last line and add > > ant -Dwith-provider-deef -Ddist.dir=../vdsk/dist/vdsk-0.3-dev/ redist don't run any build command in the provider-deef directory. replace the above three lines with: cd ${FALKON_ROOT}/cog/modules/vdsk/ ant -Dwith-provider-deef redist -- From iraicu at cs.uchicago.edu Mon Mar 10 00:55:03 2008 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Mon, 10 Mar 2008 00:55:03 -0500 Subject: [Swift-devel] Re: Question of wrapper.sh In-Reply-To: <47D48F91.6040802@uchicago.edu> References: <47BEC111.8030503@mcs.anl.gov> <47C440F9.6090701@mcs.anl.gov> <47C44238.9060007@cs.uchicago.edu> <47C44EC2.4070901@uchicago.edu> <47C479EE.90901@mcs.anl.gov> <47C5039F.8020001@cs.uchicago.edu> <47D07BDC.60406@uchicago.edu> <47D0DA98.6010308@cs.uchicago.edu> <47D14650.2000206@cs.uchicago.edu> <47D16F6D.9070903@uchicago.edu> <47D1DB48.4080709@uchicago.edu> <47D3838B.3040506@uchicago.edu> <47D48608.2060204@cs.uchicago! .edu> <47D48F91.6040802@uchicago.e du> Message-ID: <47D4CD37.7020006@cs.uchicago.edu> I am a little bit behind here on the emails... based on the Falkon logs, it seems that the low throughput we are getting in the latest Swift runs is due to throttling. Where are all the various throttling parameters that we should change, to ensure that Swift submits to Falkon as fast as possible with all available jobs? I assume there is a jobs/sec throttle, a maximum number of outstanding jobs (i.e. falkon queued jobs + running jobs), and maybe others. Thanks, Ioan Zhao Zhang wrote: > well, the tar ball I sent in the last email is from 256 cores, are > they from the same cpu? By the way, I tried to find the .log files, > but there isn't any in the folder where I started the swift script. > > zhao > > Ioan Raicu wrote: >> Ben, >> Those logs are only for 1 CPU, so most things will take less than 1 >> sec. In the case where we use 100s of CPUs (a basic scenario for the >> BG/P), things will take 10s of seconds, so 1 sec resolution should be >> OK. Zhao, did you re-run that test with 256 CPUs? Ben should be >> looking at those logs, not the 1 CPU case. >> >> Ioan >> >> Ben Clifford wrote: >>> Whichever version of date that you used, it doesn't support more >>> than 1s accuracy in its output. That's irksome because its that >>> subsecond accuracy that I wanted from these log files. >>> >>> date on os x doesn't have that precision, fairly recent (in the past >>> couple of years at least) GNU coreutils date does. It would be nice >>> if you could find if that is installed and use that instead... >>> >> > -- =================================================== Ioan Raicu Ph.D. Candidate =================================================== Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 =================================================== Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dev.globus.org/wiki/Incubator/Falkon http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page =================================================== =================================================== From benc at hawaga.org.uk Mon Mar 10 01:23:47 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 10 Mar 2008 06:23:47 +0000 (GMT) Subject: [Swift-devel] Re: Question of wrapper.sh In-Reply-To: <47D4CD37.7020006@cs.uchicago.edu> References: <47BEC111.8030503@mcs.anl.gov> <47C44EC2.4070901@uchicago.edu> <47C479EE.90901@mcs.anl.gov> <47C5039F.8020001@cs.uchicago.edu> <47D07BDC.60406@uchicago.edu> <47D0DA98.6010308@cs.uchicago.edu> <47D14650.2000206@cs.uchicago.edu> <47D16F6D.9070903@uchicago.edu> <47D1DB48.4080709@uchicago.edu> <47D3838B.3040506@uchicago.edu> <47D48608.2060204@cs.uchicago! .edu> <47D48F91.6040802@uchicago.e du> <47D4CD37.7020006@cs.uchicago.edu> Message-ID: The user guide has a section on properties that you can configure - in here, http://www.ci.uchicago.edu/swift/guides/userguide.php#engineconfiguration pretty much anything with the word 'throttle' in it. If you give me the .log files for runs, I can look at what the rate control stuff is doing. In the past day or so, I've reduced throttle.score.job.factor substantially, to be more appropriate for GRAM2 submission - I suspect for something like Falkon you should make it much higher. It used to be 4 (which means 402 jobs executing at once maximum), but is now 0.2 (20 jobs at once maximum). For Falkon on a large number of CPUs, you probably want to make that higher (maybe number of CPUs divided by about 30) On Mon, 10 Mar 2008, Ioan Raicu wrote: > I am a little bit behind here on the emails... based on the Falkon logs, it > seems that the low throughput we are getting in the latest Swift runs is due > to throttling. Where are all the various throttling parameters that we should > change, to ensure that Swift submits to Falkon as fast as possible with all > available jobs? I assume there is a jobs/sec throttle, a maximum number of > outstanding jobs (i.e. falkon queued jobs + running jobs), and maybe others. > Thanks, > Ioan > > Zhao Zhang wrote: > > well, the tar ball I sent in the last email is from 256 cores, are they from > > the same cpu? By the way, I tried to find the .log files, but there isn't > > any in the folder where I started the swift script. > > > > zhao > > > > Ioan Raicu wrote: > > > Ben, > > > Those logs are only for 1 CPU, so most things will take less than 1 sec. > > > In the case where we use 100s of CPUs (a basic scenario for the BG/P), > > > things will take 10s of seconds, so 1 sec resolution should be OK. Zhao, > > > did you re-run that test with 256 CPUs? Ben should be looking at those > > > logs, not the 1 CPU case. > > > > > > Ioan > > > > > > Ben Clifford wrote: > > > > Whichever version of date that you used, it doesn't support more than 1s > > > > accuracy in its output. That's irksome because its that subsecond > > > > accuracy that I wanted from these log files. > > > > > > > > date on os x doesn't have that precision, fairly recent (in the past > > > > couple of years at least) GNU coreutils date does. It would be nice if > > > > you could find if that is installed and use that instead... > > > > > > > > > > > From hategan at mcs.anl.gov Mon Mar 10 07:16:21 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 10 Mar 2008 07:16:21 -0500 Subject: [Swift-devel] Failed null In-Reply-To: References: Message-ID: <1205151381.11504.8.camel@blabla.mcs.anl.gov> Smells of NPE. On Thu, 2008-03-06 at 17:26 +0000, Ben Clifford wrote: > I'm seeing plenty of errors during stagein that look like this: > > 2008-03-06 10:42:02,911-0600 DEBUG TaskImpl Task(type=FILE_TRANSFER, > identity=urn:0-624-4-1204821602563) setting status to Submitting > 2008-03-06 10:42:02,911-0600 DEBUG TaskImpl Task(type=FILE_TRANSFER, > identity=urn:0-624-4-1204821602563) setting status to Submitted > 2008-03-06 10:42:02,912-0600 DEBUG TaskImpl Task(type=FILE_TRANSFER, > identity=urn:0-624-4-1204821602563) setting status to Active > 2008-03-06 10:42:02,937-0600 DEBUG TaskImpl Task(type=FILE_TRANSFER, > identity=urn:0-624-4-1204821602563) setting status to Failed null > > 'null' is not so helpful - also there's indicating which file was > attempting to be transferred here... From hategan at mcs.anl.gov Mon Mar 10 07:17:43 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 10 Mar 2008 07:17:43 -0500 Subject: [Swift-devel] Can scp data provider be used with Swift? In-Reply-To: <47D45760.4060509@mcs.anl.gov> References: <47D45760.4060509@mcs.anl.gov> Message-ID: <1205151463.11504.10.camel@blabla.mcs.anl.gov> On Sun, 2008-03-09 at 16:32 -0500, Michael Wilde wrote: > We'd like to do a Swift run on the SiCortex machine using Falkon as the > execution provider. > > At the moment there's no Java on the SiCortex and no ready access to its > filesystem from a Linux host with Java. > > Is it feasible to run Swift on a Linux host with Falkon for the job > provider and scp for the data provider? It's possible. I2U2 does it. > If so, would this be specified? I'm fuzzy about it at the moment and battery is low... > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > From benc at hawaga.org.uk Tue Mar 11 17:26:20 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 11 Mar 2008 22:26:20 +0000 (GMT) Subject: [Swift-devel] Re: Question of wrapper.sh In-Reply-To: References: <47BEC111.8030503@mcs.anl.gov> <47C5039F.8020001@cs.uchicago.edu> <47D07BDC.60406@uchicago.edu> <47D0DA98.6010308@cs.uchicago.edu> <47D14650.2000206@cs.uchicago.edu> <47D16F6D.9070903@uchicago.edu> <47D1DB48.4080709@uchicago.edu> <47D3838B.3040506@uchicago.edu> <47D48608.2060204@cs.uchicago! .edu> <47D48F91.6040802@uchicago.e du> <47D4CD37.7020006@cs.uchicago.edu> Message-ID: anything happening with this now? -- From benc at hawaga.org.uk Wed Mar 12 22:32:02 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Thu, 13 Mar 2008 03:32:02 +0000 (GMT) Subject: [Swift-devel] swift 0.4-rc2 Message-ID: I have put a second release candidate online: http://www.ci.uchicago.edu/~benc/vdsk-0.4rc2.tar.gz This is from newer versions of the SVNs: swift r1718 and cog r1934. What's changed from rc1: wrapper log stageout; more conservative job throttles; high resolution timestamping in wrapper logs where available; host type selection for GRAM4. As before, if no major bugs appear, I'll release it in a couple of days. Please test. -- From benc at hawaga.org.uk Wed Mar 12 23:40:52 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Thu, 13 Mar 2008 04:40:52 +0000 (GMT) Subject: [Swift-devel] more ftp errors running terminable->tg uc Message-ID: In addition to the mlst error reporting in the thread 'stageout: Expected multiline reply', today I see a similar-but-different error with other swift configurations running on terminable going to TG UC. The below occurs when I try PBS+GRAM4, fork+gram4, fork+gram2 (with the mlst error occuring with pbs+gram2 as before). I do not get this behaviour submitting from terminable to TeraPort. I do get it submitting to the OSG site UCLA_Saxon_Tier3. This is all running the test 130-fmri, which has in the past managed to trigger a bunch of race conditions that the other tests haven't. In general the other test I've been running, 061-cattwo, has been working mostly ok. The specific configurations that I use are in the svn in vdsks/tests/sites/ Server refused performing the request. Custom message: (error code 1) [Nested exception message: Custom message: Unexpected reply: 451 ocurred during retrieve() org.globus.ftp.exception.DataChannelException: setPassive() must match store() and setActive() - retrieve() (error code 2) org.globus.ftp.exception.DataChannelException: setPassive() must match store() and setActive() - retrieve() (error code 2) at org.globus.ftp.extended.GridFTPServerFacade.retrieve(GridFTPServerFacade.java:469) at org.globus.ftp.FTPClient.put(FTPClient.java:1289) at org.globus.cog.abstraction.impl.file.gridftp.old.FileResourceImpl.putFile(FileResourceImpl.java:399) at org.globus.cog.abstraction.impl.fileTransfer.DelegatedFileTransferHandler.doDestination(DelegatedFileTransferHandler.java:356) at org.globus.cog.abstraction.impl.fileTransfer.CachingDelegatedFileTransferHandler.doDestination(CachingDelegatedFileTransferHandler.java:47) at org.globus.cog.abstraction.impl.fileTransfer.DelegatedFileTransferHandler.run(DelegatedFileTransferHandler.java:493) at java.lang.Thread.run(Thread.java:595) ] -- From benc at hawaga.org.uk Thu Mar 13 13:16:08 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Thu, 13 Mar 2008 18:16:08 +0000 (GMT) Subject: [Swift-devel] Re: more ftp errors running terminable->tg uc In-Reply-To: References: Message-ID: I also get the previously reported errors when running on tg-login1.uc.teragrid.org - the same sites that don't work from terminable don't work here, and teraport, which does work from terminable does work here. -- From skenny at uchicago.edu Fri Mar 14 16:42:30 2008 From: skenny at uchicago.edu (skenny at uchicago.edu) Date: Fri, 14 Mar 2008 16:42:30 -0500 (CDT) Subject: [Swift-devel] Re: misc swift errors Message-ID: <20080314164230.BBU09580@m4500-02.uchicago.edu> so, ben is suggesting that the use of relative paths within the swift script may be the problem here. can you rerun giving the mapper the full path? string inputName=@strcat("/disks/gpfs/fmri/cnari/swift/lhROI7_4p2filter_input/sub.",subject,".block",file,".txt"); ---- Original message ---- >Date: Fri, 14 Mar 2008 14:53:10 -0500 >From: "Uri Hasson" >Subject: Re: misc swift errors >To: skenny at uchicago.edu > >Hi Sarah, > >I'm using the following: >A swift properties file: >/disks/gpfs/fmri/cnari/swift/projects/uhasson/peakfit_project/all_runs/run_0314_lhREG7_f4p2/swift.properties > >and config files in >/disks/gpfs/fmri/cnari/swift/projects/uhasson/peakfit_project/swift.conf > > >On Fri, Mar 14, 2008 at 2:51 PM, wrote: >> which config files (tc.data and sites.xml) are you using for >> this run? >> >> >> >> ---- Original message ---- >> >Date: Fri, 14 Mar 2008 14:30:35 -0500 >> >From: "Uri Hasson" >> >Subject: misc swift errors >> >To: "Sarah Kenny" , "Mihael Hategan" >> >> > >> >Hey SWIFT gurus.. I'm running swift heavy duty and >> encountering some >> >errors I can't track. >> > >> >1) In a log file of a run that's still ongoing there are >> errors on >> >"status files not" found: >> >/disks/gpfs/fmri/cnari/swift/projects/uhasson/peakfit_project/all_runs/run_0314_lhREG8_f4p2/ccf-perm-wf-20080314-1308-g61h5kic.log >> >But the job seems to be continuing... >> > >> >2) another run simply crashed with errors. Log at: >> >/disks/gpfs/fmri/cnari/swift/projects/uhasson/peakfit_project/all_runs/run_0314_lhREG7_f4p2/ccf-perm-wf-20080314-1006-ix8mnzfb.log >> >It sais it can't link to a file that exists.. >> > >> >Any ideas -- much appreciated. >> > >> >Uri >> From wilde at mcs.anl.gov Fri Mar 14 17:47:04 2008 From: wilde at mcs.anl.gov (Michael Wilde) Date: Fri, 14 Mar 2008 17:47:04 -0500 Subject: [Swift-devel] Re: more ftp errors running terminable->tg uc In-Reply-To: References: Message-ID: <47DB0068.6020305@mcs.anl.gov> Do you have (or need to get) the gridftp group involved in this? Or is this a cog-level error that only Mihael supports at the moment? Is the problem reproducible with globus-url-copy? On 3/13/08 1:16 PM, Ben Clifford wrote: > I also get the previously reported errors when running on > tg-login1.uc.teragrid.org - the same sites that don't work from terminable > don't work here, and teraport, which does work from terminable does work > here. From benc at hawaga.org.uk Fri Mar 14 17:51:50 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Fri, 14 Mar 2008 22:51:50 +0000 (GMT) Subject: [Swift-devel] Re: more ftp errors running terminable->tg uc In-Reply-To: <47DB0068.6020305@mcs.anl.gov> References: <47DB0068.6020305@mcs.anl.gov> Message-ID: On Fri, 14 Mar 2008, Michael Wilde wrote: > Do you have (or need to get) the gridftp group involved in this? > > Or is this a cog-level error that only Mihael supports at the moment? > > Is the problem reproducible with globus-url-copy? More report coming soon. Please wait. -- From benc at hawaga.org.uk Fri Mar 14 18:14:23 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Fri, 14 Mar 2008 23:14:23 +0000 (GMT) Subject: [Swift-devel] Re: more ftp errors running terminable->tg uc In-Reply-To: References: Message-ID: I have dug some more into this. The cog gridftp provider enables data channel reuse when talking to gridftp servers that report exactly version 2.3. Some of the sites that I am testing against report that version. Some report version 2.5. The sites which are version 2.3 fail to run test workflow '130-fmri' in the tests/language-behaviour directory. The sites which are not 2.3 do not exhibit this problem. This happens submitting from both tg-login1.uc.teragrid.org and from terminable.ci.uchicago.edu On terminable: If I change the cog gridftp provider to enable gridftp data channel reuse for version 2.5 too, then the 2.5 sites also break. If I disable data channel reuse entirely (which appears to need a source code change) then all site tests work ok. There are two separate issues here: This needs fixing in general, presumably in cog. At the moment, I'm not particularly inclined to spend large amounts of time learning how the cog ftp provider works when potentially mihael could look at it. However, its unclear how much time mihael has to work on this, given his other projects and I have no particular belief that it will be fixed any time soon. In a Swift-specific context, I'm happy for data-channel reuse to be turned off for now (eg until someone figures out what is up at the cog level) - its already not used for any recent gridftp server (i.e. v2.5) such as tg-gridftp.uc.teragrid.org. No one has reported this as a problem in the wild (yet). I suspect test 130-fmri is especially good at exhibiting this problem. I think therefore that this should not be a release-stopped for 0.4; but that should anyone actually come across it in the wild we should rapidly put out a 0.4.1 or a 0.5 with data channel caching disabled. I would appreciate commentary on: i) the above release proposal ii) the likelihood that Mihael will have time to look at this and when that would happen (which is essentially the question - do I have to go learn the guts of the gt2 cog provider?) -- From wilde at mcs.anl.gov Fri Mar 14 18:23:06 2008 From: wilde at mcs.anl.gov (Michael Wilde) Date: Fri, 14 Mar 2008 18:23:06 -0500 Subject: [Swift-devel] Re: more ftp errors running terminable->tg uc In-Reply-To: References: Message-ID: <47DB08DA.7040703@mcs.anl.gov> i) proposal sounds good to me ii) Mihael is on pseudo-vacation (supposed to be real vacation at the moment, but he is being a great guy to help launch an i2u2 release that slipped). So lets wait for Mihael to weigh in. Only thing I can offer is once i2u2 release is live and stable, fix gridftp next, modulo vacation preferences). - Mike On 3/14/08 6:14 PM, Ben Clifford wrote: > I have dug some more into this. > > The cog gridftp provider enables data channel reuse when talking to > gridftp servers that report exactly version 2.3. > > Some of the sites that I am testing against report that version. Some > report version 2.5. > > The sites which are version 2.3 fail to run test workflow '130-fmri' in > the tests/language-behaviour directory. The sites which are not 2.3 do not > exhibit this problem. > > This happens submitting from both tg-login1.uc.teragrid.org and from > terminable.ci.uchicago.edu > > On terminable: > > If I change the cog gridftp provider to enable gridftp data channel reuse > for version 2.5 too, then the 2.5 sites also break. > > If I disable data channel reuse entirely (which appears to need a source > code change) then all site tests work ok. > > There are two separate issues here: > > This needs fixing in general, presumably in cog. At the moment, I'm not > particularly inclined to spend large amounts of time learning how the cog > ftp provider works when potentially mihael could look at it. However, its > unclear how much time mihael has to work on this, given his other projects > and I have no particular belief that it will be fixed any time soon. > > In a Swift-specific context, I'm happy for data-channel reuse to be turned > off for now (eg until someone figures out what is up at the cog level) - > its already not used for any recent gridftp server (i.e. v2.5) such as > tg-gridftp.uc.teragrid.org. > > No one has reported this as a problem in the wild (yet). I suspect test > 130-fmri is especially good at exhibiting this problem. > > I think therefore that this should not be a release-stopped for 0.4; but > that should anyone actually come across it in the wild we should rapidly > put out a 0.4.1 or a 0.5 with data channel caching disabled. > > I would appreciate commentary on: > > i) the above release proposal > > ii) the likelihood that Mihael will have time to look at this and when > that would happen (which is essentially the question - do I have to > go learn the guts of the gt2 cog provider?) > From benc at hawaga.org.uk Fri Mar 14 19:42:43 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Sat, 15 Mar 2008 00:42:43 +0000 (GMT) Subject: [Swift-devel] Re: swift 0.4-rc2 In-Reply-To: References: Message-ID: On Thu, 13 Mar 2008, Ben Clifford wrote: > http://www.ci.uchicago.edu/~benc/vdsk-0.4rc2.tar.gz > Please test. please? -- From hategan at mcs.anl.gov Sat Mar 15 04:36:02 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sat, 15 Mar 2008 04:36:02 -0500 Subject: [Swift-devel] Re: more ftp errors running terminable->tg uc In-Reply-To: References: Message-ID: <1205573762.24604.2.camel@blabla.mcs.anl.gov> > I would appreciate commentary on: > > i) the above release proposal > > ii) the likelihood that Mihael will have time to look at this and when > that would happen (which is essentially the question - do I have to > go learn the guts of the gt2 cog provider?) Looking at the plan we have, I've only spent 1 week real time on making transfers faster. Which leaves another week of real time to fix the problems. > From wilde at mcs.anl.gov Sat Mar 15 07:34:40 2008 From: wilde at mcs.anl.gov (Michael Wilde) Date: Sat, 15 Mar 2008 07:34:40 -0500 Subject: [Swift-devel] Re: swift 0.4-rc2 In-Reply-To: References: Message-ID: <47DBC260.1010104@mcs.anl.gov> Im testing this weekend, but at the moment using 1723. I can switch to rc2 when I get a chance. On 3/14/08 7:42 PM, Ben Clifford wrote: > On Thu, 13 Mar 2008, Ben Clifford wrote: > >> http://www.ci.uchicago.edu/~benc/vdsk-0.4rc2.tar.gz >> Please test. > > please? > From wilde at mcs.anl.gov Sun Mar 16 19:48:23 2008 From: wilde at mcs.anl.gov (Michael Wilde) Date: Sun, 16 Mar 2008 19:48:23 -0500 Subject: [Swift-devel] swift-falkon problem Message-ID: <47DDBFD7.2050700@mcs.anl.gov> Ioan, Im stuck at: RunID: 20080316-1643-g4n8t252 Progress: runam3 started error: Notification(int timeout): socket = new ServerSocket(recvPort); Address already in use error: Notification(int timeout): socket = new ServerSocket(recvPort); Address already in use Waiting for notification for 0 ms Received notification with 1 messages Failed to transfer wrapper log from amps1-20080316-1643-g4n8t252/info/0/sico runam3 failed Execution failed: Exception in runam3: Arguments: [0000, 0.1899, 0.1858] Host: sico Does this look familiar? -- What Im confused about is: - the deef-provider code that I get with a swift checkout seems to have out of date falkon stubs (I get a runtime error on a missing xml element) - if I grab a FalkonStubs jar from Zhao's bgp swift tree and use it in a newly compiled swift tree, should that work? It seems to get further. It *seems* like swift is reaching Falkon - I can see something in a falkon logfile that looks like swift-generated job ids) but then I'm getting the errors above. The log file doesnt contain any details, just whats below. I'll double-check all my steps and package up the full log file, but wanted to get this out to you before I spend too much more time debugging, hoping someone recognizes the problem. I note that I havent yet found the strings above, like "Waiting for notification" in the swift source tree. Thanks, Mike 2008-03-16 16:43:42,807-0600 INFO vdl:createdirset END jobid=runam3-0cu5avpi - Done initializing directory structure 2008-03-16 16:43:42,809-0600 INFO vdl:dostagein START jobid=runam3-0cu5avpi - Staging in files 2008-03-16 16:43:42,810-0600 INFO vdl:dostagein END jobid=runam3-0cu5avpi - Staging in finished 2008-03-16 16:43:42,812-0600 DEBUG vdl:execute2 JOB_START jobid=runam3-0cu5avpi tr=runam3 arguments=[0000, 0.1899, 0.1858] tmpdir=amps1-20080316-1643-g4n8t252/jobs/0/runam3-0cu5avpi host=sico 2008-03-16 16:43:42,829-0600 DEBUG WeightedHostScoreScheduler multiplyScore(sico:0.000(1.000):1/1000002, -0.2) 2008-03-16 16:43:42,829-0600 DEBUG WeightedHostScoreScheduler Old score: 0.000, new score: -0.200 2008-03-16 16:43:43,693-0600 DEBUG TaskImpl Task(type=JOB_SUBMISSION, identity=urn:0-1-1-1205707420808) setting status to Submitting 2008-03-16 16:43:43,693-0600 DEBUG TaskImpl Task(type=JOB_SUBMISSION, identity=urn:0-1-1-1205707420808) setting status to Submitted 2008-03-16 16:43:43,693-0600 DEBUG WeightedHostScoreScheduler Submission time for Task(type=JOB_SUBMISSION, identity=urn:0-1-1-1205707420808): 0ms. Score delta: 0.002564102564102564 2008-03-16 16:43:43,694-0600 DEBUG WeightedHostScoreScheduler multiplyScore(sico:-0.200(0.889):1/889402, 0.002564102564102564) 2008-03-16 16:43:43,694-0600 DEBUG WeightedHostScoreScheduler Old score: -0.200, new score: -0.197 2008-03-16 16:43:43,694-0600 INFO JobSubmissionTaskHandler Job submitted 2008-03-16 16:43:44,213-0600 DEBUG TaskImpl Task(type=JOB_SUBMISSION, identity=urn:0-1-1-1205707420808) setting status to Active 2008-03-16 16:43:44,213-0600 DEBUG TaskImpl Task(type=JOB_SUBMISSION, identity=urn:0-1-1-1205707420808) setting status to Failed 2008-03-16 16:43:44,218-0600 DEBUG vdl:execute2 APPLICATION_EXCEPTION jobid=runam3-0cu5avpi - Application exception: Task failed task:execute @ vdl-int.k, line: 386 sys:sequential @ vdl-int.k, line: 378 sys:try @ vdl-int.k, line: 377 task:allocatehost @ vdl-int.k, line: 356 vdl:execute2 @ execute-default.k, line: 23 sys:restartonerror @ execute-default.k, line: 21 sys:sequential @ execute-default.k, line: 19 sys:try @ execute-default.k, line: 18 sys:if @ execute-default.k, line: 17 sys:then @ execute-default.k, line: 16 sys:if @ execute-default.k, line: 15 vdl:execute @ amps1.kml, line: 52 runam3 @ amps1.kml, line: 92 sys:sequential @ amps1.kml, line: 91 sys:parallelfor @ amps1.kml, line: 73 sys:sequential @ amps1.kml, line: 72 doall @ amps1.kml, line: 142 sys:sequential @ amps1.kml, line: 141 sys:parallel @ amps1.kml, line: 131 vdl:mainp @ amps1.kml, line: 130 mainp @ vdl.k, line: 150 vdl:mains @ amps1.kml, line: 128 vdl:mains @ amps1.kml, line: 128 rlog:restartlog @ amps1.kml, line: 126 kernel:project @ amps1.kml, line: 2 amps1-20080316-1643-g4n8t252 From benc at hawaga.org.uk Mon Mar 17 06:05:15 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 17 Mar 2008 11:05:15 +0000 (GMT) Subject: [Swift-devel] Re: swift-falkon problem In-Reply-To: <47DDBFD7.2050700@mcs.anl.gov> References: <47DDBFD7.2050700@mcs.anl.gov> Message-ID: So from the Swift log you paste, this line: 2008-03-16 16:43:44,213-0600 DEBUG TaskImpl Task(type=JOB_SUBMISSION, identity=urn:0-1-1-1205707420808) setting status to Active 2008-03-16 16:43:44,213-0600 DEBUG TaskImpl Task(type=JOB_SUBMISSION, identity=urn:0-1-1-1205707420808) setting status to Failed suggests that provider-deef is reporting a failure up to the Swift runtime. There may be some more logs that you can turn on at the provider-deef or falkon layer, but I don't know what the likely ones would be. >From a process point-of-view, I'm concerned about this: > - the deef-provider code that I get with a swift checkout seems to have > out > of date falkon stubs (I get a runtime error on a missing xml element) > - if I grab a FalkonStubs jar from Zhao's bgp swift tree and use it in a > newly compiled swift tree, should that work? It seems to get further. If provider-deef needs updating, those updates should be made in SVN; if it doesn't need updating, then you have some other problem. Ioan and Zhao, if you had to update something to make it work, please commit that change. -- From iraicu at cs.uchicago.edu Mon Mar 17 07:43:22 2008 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Mon, 17 Mar 2008 07:43:22 -0500 Subject: [Swift-devel] Re: swift-falkon problem In-Reply-To: References: <47DDBFD7.2050700@mcs.anl.gov> Message-ID: <47DE676A.8090604@cs.uchicago.edu> Ben Clifford wrote: > So from the Swift log you paste, this line: > > 2008-03-16 16:43:44,213-0600 DEBUG TaskImpl Task(type=JOB_SUBMISSION, > identity=urn:0-1-1-1205707420808) setting status to Active > 2008-03-16 16:43:44,213-0600 DEBUG TaskImpl Task(type=JOB_SUBMISSION, > identity=urn:0-1-1-1205707420808) setting status to Failed > > suggests that provider-deef is reporting a failure up to the Swift > runtime. There may be some more logs that you can turn on at the > provider-deef or falkon layer, but I don't know what the likely ones would > be. > iraicu at viper:~/java/svn/cog/modules/provider-deef/etc> cat log4j.properties.module log4j.logger.org.apache.axis.utils.JavaUtils=ERROR log4j.logger.org.globus.cog.abstraction.impl.execution.deef=DEBUG I have this log4j property, do you have this enabled? This should enable more debug output from the Falkon provider. > From a process point-of-view, I'm concerned about this: > > >> - the deef-provider code that I get with a swift checkout seems to have >> out >> of date falkon stubs (I get a runtime error on a missing xml element) >> - if I grab a FalkonStubs jar from Zhao's bgp swift tree and use it in a >> newly compiled swift tree, should that work? It seems to get further. >> > > If provider-deef needs updating, those updates should be made in SVN; if > it doesn't need updating, then you have some other problem. Ioan and Zhao, > if you had to update something to make it work, please commit that change. > I know we should update SVN, we just haven't gotten around to it. I just updated the stubs in the Swift SVN (R1727). Mike, give it a try again from SVN. Ioan -- =================================================== Ioan Raicu Ph.D. Candidate =================================================== Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 =================================================== Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dev.globus.org/wiki/Incubator/Falkon http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page =================================================== =================================================== -------------- next part -------------- An HTML attachment was scrubbed... URL: From iraicu at cs.uchicago.edu Mon Mar 17 08:03:52 2008 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Mon, 17 Mar 2008 08:03:52 -0500 Subject: [Swift-devel] Re: swift-falkon problem In-Reply-To: <47DDBFD7.2050700@mcs.anl.gov> References: <47DDBFD7.2050700@mcs.anl.gov> Message-ID: <47DE6C38.40802@cs.uchicago.edu> Michael Wilde wrote: > Ioan, > > Im stuck at: > > RunID: 20080316-1643-g4n8t252 > Progress: > runam3 started > error: Notification(int timeout): socket = new ServerSocket(recvPort); > Address already in use > error: Notification(int timeout): socket = new ServerSocket(recvPort); > Address already in use this is just a warning, its not causing any trouble. > Waiting for notification for 0 ms > Received notification with 1 messages this means that the Falkon service sent back a notification, which means that all went well, it had received a task, attempted to execute it, and returned back a result... but apparently a failed result. > Failed to transfer wrapper log from > amps1-20080316-1643-g4n8t252/info/0/sico I don't understant this error, how is this error text being generated? Falkon only returns back a numeric exit code. Could this error be a post processing error when Swift couldn't manipulate the local file system, or it couldn't find some expecting files? What exit code does Falkon return for this task, 0, or something else? > runam3 failed > Execution failed: > Exception in runam3: > Arguments: [0000, 0.1899, 0.1858] > Host: sico > > Does this look familiar? > > -- > > What Im confused about is: > > - the deef-provider code that I get with a swift checkout seems to > have out of date falkon stubs (I get a runtime error on a missing xml > element) I just updated them in SVN. > > - if I grab a FalkonStubs jar from Zhao's bgp swift tree and use it in > a newly compiled swift tree, should that work? It seems to get further. > > It *seems* like swift is reaching Falkon - I can see something in a > falkon logfile that looks like swift-generated job ids) but then I'm > getting the errors above. We need to figure out if the failure is in executing the tasks in Falkon, or if that is OK, and the error is in Swift not finding some files afterwards. > > The log file doesnt contain any details, just whats below. > > I'll double-check all my steps and package up the full log file, but > wanted to get this out to you before I spend too much more time > debugging, hoping someone recognizes the problem. > > I note that I havent yet found the strings above, like "Waiting for > notification" in the swift source tree. That is from the FalkonStubs.jar (falkon/service/org/globus/GenericPortal/common/Notification.java), so you won't find that. I should probably disable all the logging from FalkonStubs.jar code by default. Once you enable the Falkon provider debug logging, there are more per task logs that get printed... for example, file cog/modules/provider-deef/src/org/globus/cog/abstraction/impl/execution/deef/NotificationThread.java would print "Falkon: waiting for notifications...", and then print the contents of the notification when it received them... Ioan > > Thanks, > > Mike > > > > > 2008-03-16 16:43:42,807-0600 INFO vdl:createdirset END > jobid=runam3-0cu5avpi - Done initializing directory structure > 2008-03-16 16:43:42,809-0600 INFO vdl:dostagein START > jobid=runam3-0cu5avpi - Staging in files > 2008-03-16 16:43:42,810-0600 INFO vdl:dostagein END > jobid=runam3-0cu5avpi - Staging in finished > 2008-03-16 16:43:42,812-0600 DEBUG vdl:execute2 JOB_START > jobid=runam3-0cu5avpi tr=runam3 arguments=[0000, 0.1899, 0.1858] > tmpdir=amps1-20080316-1643-g4n8t252/jobs/0/runam3-0cu5avpi host=sico > 2008-03-16 16:43:42,829-0600 DEBUG WeightedHostScoreScheduler > multiplyScore(sico:0.000(1.000):1/1000002, -0.2) > 2008-03-16 16:43:42,829-0600 DEBUG WeightedHostScoreScheduler Old > score: 0.000, new score: -0.200 > 2008-03-16 16:43:43,693-0600 DEBUG TaskImpl Task(type=JOB_SUBMISSION, > identity=urn:0-1-1-1205707420808) setting status to Submitting > 2008-03-16 16:43:43,693-0600 DEBUG TaskImpl Task(type=JOB_SUBMISSION, > identity=urn:0-1-1-1205707420808) setting status to Submitted > 2008-03-16 16:43:43,693-0600 DEBUG WeightedHostScoreScheduler > Submission time for Task(type=JOB_SUBMISSION, > identity=urn:0-1-1-1205707420808): 0ms. Score delta: 0.002564102564102564 > 2008-03-16 16:43:43,694-0600 DEBUG WeightedHostScoreScheduler > multiplyScore(sico:-0.200(0.889):1/889402, 0.002564102564102564) > 2008-03-16 16:43:43,694-0600 DEBUG WeightedHostScoreScheduler Old > score: -0.200, new score: -0.197 > 2008-03-16 16:43:43,694-0600 INFO JobSubmissionTaskHandler Job submitted > 2008-03-16 16:43:44,213-0600 DEBUG TaskImpl Task(type=JOB_SUBMISSION, > identity=urn:0-1-1-1205707420808) setting status to Active > 2008-03-16 16:43:44,213-0600 DEBUG TaskImpl Task(type=JOB_SUBMISSION, > identity=urn:0-1-1-1205707420808) setting status to Failed > 2008-03-16 16:43:44,218-0600 DEBUG vdl:execute2 APPLICATION_EXCEPTION > jobid=runam3-0cu5avpi - Application exception: Task failed > task:execute @ vdl-int.k, line: 386 > sys:sequential @ vdl-int.k, line: 378 > sys:try @ vdl-int.k, line: 377 > task:allocatehost @ vdl-int.k, line: 356 > vdl:execute2 @ execute-default.k, line: 23 > sys:restartonerror @ execute-default.k, line: 21 > sys:sequential @ execute-default.k, line: 19 > sys:try @ execute-default.k, line: 18 > sys:if @ execute-default.k, line: 17 > sys:then @ execute-default.k, line: 16 > sys:if @ execute-default.k, line: 15 > vdl:execute @ amps1.kml, line: 52 > runam3 @ amps1.kml, line: 92 > sys:sequential @ amps1.kml, line: 91 > sys:parallelfor @ amps1.kml, line: 73 > sys:sequential @ amps1.kml, line: 72 > doall @ amps1.kml, line: 142 > sys:sequential @ amps1.kml, line: 141 > sys:parallel @ amps1.kml, line: 131 > vdl:mainp @ amps1.kml, line: 130 > mainp @ vdl.k, line: 150 > vdl:mains @ amps1.kml, line: 128 > vdl:mains @ amps1.kml, line: 128 > rlog:restartlog @ amps1.kml, line: 126 > kernel:project @ amps1.kml, line: 2 > amps1-20080316-1643-g4n8t252 > > -- =================================================== Ioan Raicu Ph.D. Candidate =================================================== Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 =================================================== Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dev.globus.org/wiki/Incubator/Falkon http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page =================================================== =================================================== From wilde at mcs.anl.gov Mon Mar 17 08:34:28 2008 From: wilde at mcs.anl.gov (Michael Wilde) Date: Mon, 17 Mar 2008 08:34:28 -0500 Subject: [Swift-devel] Re: swift-falkon problem In-Reply-To: <47DE676A.8090604@cs.uchicago.edu> References: <47DDBFD7.2050700@mcs.anl.gov> <47DE676A.8090604@cs.uchicago.edu> Message-ID: <47DE7364.20700@mcs.anl.gov> I did a clean checkout to get the latest rev directly on the bblogin machine (previously I copied the code). Strange: this time provider-deef didnt show up in the modules directory. I *thought* last time it did, unless I'm imagining things. Did something just change w.r.t provider-deef, or is my memory faulty? - Mike On 3/17/08 7:43 AM, Ioan Raicu wrote: > > > Ben Clifford wrote: >> So from the Swift log you paste, this line: >> >> 2008-03-16 16:43:44,213-0600 DEBUG TaskImpl Task(type=JOB_SUBMISSION, >> identity=urn:0-1-1-1205707420808) setting status to Active >> 2008-03-16 16:43:44,213-0600 DEBUG TaskImpl Task(type=JOB_SUBMISSION, >> identity=urn:0-1-1-1205707420808) setting status to Failed >> >> suggests that provider-deef is reporting a failure up to the Swift >> runtime. There may be some more logs that you can turn on at the >> provider-deef or falkon layer, but I don't know what the likely ones would >> be. >> > iraicu at viper:~/java/svn/cog/modules/provider-deef/etc> cat > log4j.properties.module > log4j.logger.org.apache.axis.utils.JavaUtils=ERROR > log4j.logger.org.globus.cog.abstraction.impl.execution.deef=DEBUG > > I have this log4j property, do you have this enabled? This should > enable more debug output from the Falkon provider. >> >From a process point-of-view, I'm concerned about this: >> >> >>> - the deef-provider code that I get with a swift checkout seems to have >>> out >>> of date falkon stubs (I get a runtime error on a missing xml element) >>> - if I grab a FalkonStubs jar from Zhao's bgp swift tree and use it in a >>> newly compiled swift tree, should that work? It seems to get further. >>> >> >> If provider-deef needs updating, those updates should be made in SVN; if >> it doesn't need updating, then you have some other problem. Ioan and Zhao, >> if you had to update something to make it work, please commit that change. >> > I know we should update SVN, we just haven't gotten around to it. I > just updated the stubs in the Swift SVN (R1727). Mike, give it a try > again from SVN. > > Ioan > > -- > =================================================== > Ioan Raicu > Ph.D. Candidate > =================================================== > Distributed Systems Laboratory > Computer Science Department > University of Chicago > 1100 E. 58th Street, Ryerson Hall > Chicago, IL 60637 > =================================================== > Email: iraicu at cs.uchicago.edu > Web: http://www.cs.uchicago.edu/~iraicu > http://dev.globus.org/wiki/Incubator/Falkon > http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page > =================================================== > =================================================== > > From wilde at mcs.anl.gov Mon Mar 17 10:14:37 2008 From: wilde at mcs.anl.gov (Michael Wilde) Date: Mon, 17 Mar 2008 10:14:37 -0500 Subject: [Swift-devel] Re: swift-falkon problem In-Reply-To: <47DE7364.20700@mcs.anl.gov> References: <47DDBFD7.2050700@mcs.anl.gov> <47DE676A.8090604@cs.uchicago.edu> <47DE7364.20700@mcs.anl.gov> Message-ID: <47DE8ADD.3020603@mcs.anl.gov> as i backtrack looking at my svn trees, i see that indeed my memory was wrong and i do need to checkout deef explicitly. - mike On 3/17/08 8:34 AM, Michael Wilde wrote: > I did a clean checkout to get the latest rev directly on the bblogin > machine (previously I copied the code). > > Strange: this time provider-deef didnt show up in the modules directory. > I *thought* last time it did, unless I'm imagining things. > > Did something just change w.r.t provider-deef, or is my memory faulty? > > - Mike > > > > On 3/17/08 7:43 AM, Ioan Raicu wrote: >> >> >> Ben Clifford wrote: >>> So from the Swift log you paste, this line: >>> >>> 2008-03-16 16:43:44,213-0600 DEBUG TaskImpl >>> Task(type=JOB_SUBMISSION, identity=urn:0-1-1-1205707420808) >>> setting status to Active 2008-03-16 >>> 16:43:44,213-0600 DEBUG TaskImpl Task(type=JOB_SUBMISSION, >>> identity=urn:0-1-1-1205707420808) setting status to Failed >>> suggests that provider-deef is reporting a failure up to the Swift >>> runtime. There may be some more logs that you can turn on at the >>> provider-deef or falkon layer, but I don't know what the likely ones >>> would be. >>> >> iraicu at viper:~/java/svn/cog/modules/provider-deef/etc> cat >> log4j.properties.module >> log4j.logger.org.apache.axis.utils.JavaUtils=ERROR >> log4j.logger.org.globus.cog.abstraction.impl.execution.deef=DEBUG >> >> I have this log4j property, do you have this enabled? This should >> enable more debug output from the Falkon provider. >>> >From a process point-of-view, I'm concerned about this: >>> >>> >>>> - the deef-provider code that I get with a swift checkout seems to >>>> have out of date falkon stubs (I get a runtime error on a >>>> missing xml element) - if I grab a FalkonStubs jar from >>>> Zhao's bgp swift tree and use it in a newly compiled swift >>>> tree, should that work? It seems to get further. >>> >>> If provider-deef needs updating, those updates should be made in SVN; >>> if it doesn't need updating, then you have some other problem. Ioan >>> and Zhao, if you had to update something to make it work, please >>> commit that change. >>> >> I know we should update SVN, we just haven't gotten around to it. I >> just updated the stubs in the Swift SVN (R1727). Mike, give it a try >> again from SVN. >> >> Ioan >> >> -- >> =================================================== >> Ioan Raicu >> Ph.D. Candidate >> =================================================== >> Distributed Systems Laboratory >> Computer Science Department >> University of Chicago >> 1100 E. 58th Street, Ryerson Hall >> Chicago, IL 60637 >> =================================================== >> Email: iraicu at cs.uchicago.edu >> Web: http://www.cs.uchicago.edu/~iraicu >> http://dev.globus.org/wiki/Incubator/Falkon >> http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page >> =================================================== >> =================================================== >> >> > From wilde at mcs.anl.gov Mon Mar 17 10:23:37 2008 From: wilde at mcs.anl.gov (Michael Wilde) Date: Mon, 17 Mar 2008 10:23:37 -0500 Subject: [Swift-devel] Re: swift-falkon problem In-Reply-To: <47DE8ADD.3020603@mcs.anl.gov> References: <47DDBFD7.2050700@mcs.anl.gov> <47DE676A.8090604@cs.uchicago.edu> <47DE7364.20700@mcs.anl.gov> <47DE8ADD.3020603@mcs.anl.gov> Message-ID: <47DE8CF9.8060907@mcs.anl.gov> With a clean checkout of 1727 and explicit checkout of provider-deef, falkon works on the sicortex. - mike On 3/17/08 10:14 AM, Michael Wilde wrote: > as i backtrack looking at my svn trees, i see that indeed my memory was > wrong and i do need to checkout deef explicitly. > > - mike > > > On 3/17/08 8:34 AM, Michael Wilde wrote: >> I did a clean checkout to get the latest rev directly on the bblogin >> machine (previously I copied the code). >> >> Strange: this time provider-deef didnt show up in the modules >> directory. I *thought* last time it did, unless I'm imagining things. >> >> Did something just change w.r.t provider-deef, or is my memory faulty? >> >> - Mike >> >> >> >> On 3/17/08 7:43 AM, Ioan Raicu wrote: >>> >>> >>> Ben Clifford wrote: >>>> So from the Swift log you paste, this line: >>>> >>>> 2008-03-16 16:43:44,213-0600 DEBUG TaskImpl >>>> Task(type=JOB_SUBMISSION, >>>> identity=urn:0-1-1-1205707420808) setting status to >>>> Active 2008-03-16 16:43:44,213-0600 DEBUG >>>> TaskImpl Task(type=JOB_SUBMISSION, >>>> identity=urn:0-1-1-1205707420808) setting status to Failed suggests >>>> that provider-deef is reporting a failure up to the Swift runtime. >>>> There may be some more logs that you can turn on at the >>>> provider-deef or falkon layer, but I don't know what the likely ones >>>> would be. >>>> >>> iraicu at viper:~/java/svn/cog/modules/provider-deef/etc> cat >>> log4j.properties.module >>> log4j.logger.org.apache.axis.utils.JavaUtils=ERROR >>> log4j.logger.org.globus.cog.abstraction.impl.execution.deef=DEBUG >>> >>> I have this log4j property, do you have this enabled? This should >>> enable more debug output from the Falkon provider. >>>> >From a process point-of-view, I'm concerned about this: >>>> >>>> >>>>> - the deef-provider code that I get with a swift checkout seems to >>>>> have out of date falkon stubs (I get a runtime error on a >>>>> missing xml element) - if I grab a FalkonStubs jar from >>>>> Zhao's bgp swift tree and use it in a newly compiled swift >>>>> tree, should that work? It seems to get further. >>>> >>>> If provider-deef needs updating, those updates should be made in >>>> SVN; if it doesn't need updating, then you have some other problem. >>>> Ioan and Zhao, if you had to update something to make it work, >>>> please commit that change. >>>> >>> I know we should update SVN, we just haven't gotten around to it. I >>> just updated the stubs in the Swift SVN (R1727). Mike, give it a try >>> again from SVN. >>> >>> Ioan >>> >>> -- >>> =================================================== >>> Ioan Raicu >>> Ph.D. Candidate >>> =================================================== >>> Distributed Systems Laboratory >>> Computer Science Department >>> University of Chicago >>> 1100 E. 58th Street, Ryerson Hall >>> Chicago, IL 60637 >>> =================================================== >>> Email: iraicu at cs.uchicago.edu >>> Web: http://www.cs.uchicago.edu/~iraicu >>> http://dev.globus.org/wiki/Incubator/Falkon >>> http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page >>> =================================================== >>> =================================================== >>> >>> >> > From benc at hawaga.org.uk Mon Mar 17 14:16:34 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 17 Mar 2008 19:16:34 +0000 (GMT) Subject: [Swift-devel] Re: swift-falkon problem In-Reply-To: <47DE6C38.40802@cs.uchicago.edu> References: <47DDBFD7.2050700@mcs.anl.gov> <47DE6C38.40802@cs.uchicago.edu> Message-ID: On Mon, 17 Mar 2008, Ioan Raicu wrote: > > Failed to transfer wrapper log from amps1-20080316-1643-g4n8t252/info/0/sico > I don't understant this error, how is this error text being generated? Falkon > only returns back a numeric exit code. Could this error be a post processing > error when Swift couldn't manipulate the local file system, or it couldn't > find some expecting files? What exit code does Falkon return for this task, > 0, or something else? That bit is a follow-on error, so pretty much ignore it - I haven't figured out what the right thing to do is for presenting it to the user. Basically: 1. swift tries to run an execuable (using provider-deef, in this case) 2. run of executable fails (i.e. provider-deef is passing back an erro) 3. swift tries to stage back the wrapper log to help diagnosis 4. the wrapper log doesn't exist (presumably the wrapper never executed that far in the failed executable step 1) 5. swift reports the above error/warning that step 3 failed. So that line is an error that is a follow on from step 1 failing. > We need to figure out if the failure is in executing the tasks in Falkon, or > if that is OK, and the error is in Swift not finding some files afterwards. Provider-deef is reporting that execution failed. So its not the second one of those. But there isn't enough log information in mike's report to indicate where below the swift/provider-deef interface the error is occurring. The present swift+falkon joint deployment code still seems screwy enough to not merge in the provider-deef log4j command, which is annying. I'll have a look at that. -- From benc at hawaga.org.uk Mon Mar 17 14:19:39 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 17 Mar 2008 19:19:39 +0000 (GMT) Subject: [Swift-devel] Re: swift-falkon problem In-Reply-To: <47DE676A.8090604@cs.uchicago.edu> References: <47DDBFD7.2050700@mcs.anl.gov> <47DE676A.8090604@cs.uchicago.edu> Message-ID: On Mon, 17 Mar 2008, Ioan Raicu wrote: > iraicu at viper:~/java/svn/cog/modules/provider-deef/etc> cat > log4j.properties.module > log4j.logger.org.apache.axis.utils.JavaUtils=ERROR > log4j.logger.org.globus.cog.abstraction.impl.execution.deef=DEBUG > > I have this log4j property, do you have this enabled? This should > enable more debug output from the Falkon provider. If deploying swift with this command: ant -Dwith-provider-deef redist then it looks like those lines don't get merged in. Mike, add those lines yourself to your dist/swift-0.3-dev/etc/log4j.properties file. -- From wilde at mcs.anl.gov Mon Mar 17 14:53:38 2008 From: wilde at mcs.anl.gov (Michael Wilde) Date: Mon, 17 Mar 2008 14:53:38 -0500 Subject: [Swift-devel] Re: swift-falkon problem In-Reply-To: References: <47DDBFD7.2050700@mcs.anl.gov> <47DE676A.8090604@cs.uchicago.edu> Message-ID: <47DECC42.4020500@mcs.anl.gov> OK, thanks, will do. My earlier message that "it worked" was premature - my sites file was doing local execution. I pushed forward past a few other errors and am now stuck as follows. As far as I can tell Im now getting the "NFS not syncing" problem. Swift creates the working dir for the workflow as a local file reference to an NFS-mounted directory. When swift tells falkon to run shared/wrapper.sh, its not there yet. When I look after the workflow has failed, it is indeed there. What I'd rather do here is tell swift to use scp rather than direct-file-access as the data provider. Do you know how to do that? Or are there any other data transports to consider? (One alternative is to get gridftp running on sico). On 3/17/08 2:19 PM, Ben Clifford wrote: > On Mon, 17 Mar 2008, Ioan Raicu wrote: > >> iraicu at viper:~/java/svn/cog/modules/provider-deef/etc> cat >> log4j.properties.module >> log4j.logger.org.apache.axis.utils.JavaUtils=ERROR >> log4j.logger.org.globus.cog.abstraction.impl.execution.deef=DEBUG >> >> I have this log4j property, do you have this enabled? This should >> enable more debug output from the Falkon provider. > > If deploying swift with this command: ant -Dwith-provider-deef redist > then it looks like those lines don't get merged in. Mike, add those lines > yourself to your dist/swift-0.3-dev/etc/log4j.properties file. > From benc at hawaga.org.uk Mon Mar 17 14:57:07 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 17 Mar 2008 19:57:07 +0000 (GMT) Subject: [Swift-devel] Re: swift-falkon problem In-Reply-To: <47DECC42.4020500@mcs.anl.gov> References: <47DDBFD7.2050700@mcs.anl.gov> <47DE676A.8090604@cs.uchicago.edu> <47DECC42.4020500@mcs.anl.gov> Message-ID: what does your filesystem layout look like? Where are you running swift? And where are you putting your scicortex site directory? On an NFS that is also accessible from your submit machine? If so, what path? -- From wilde at mcs.anl.gov Mon Mar 17 15:19:23 2008 From: wilde at mcs.anl.gov (Michael Wilde) Date: Mon, 17 Mar 2008 15:19:23 -0500 Subject: [Swift-devel] Re: swift-falkon problem In-Reply-To: References: <47DDBFD7.2050700@mcs.anl.gov> <47DE676A.8090604@cs.uchicago.edu> <47DECC42.4020500@mcs.anl.gov> Message-ID: <47DED24B.8070900@mcs.anl.gov> Sorry - another mis-diagnosis and incorrect conclusion on my part. Zhao just told me that we have out of date falkon worker code on the sicortex that is not chdir'ing to the cwd arg of the falkon request. That explains what Im seeing. Its being fixed now and checked in. -- To answer your questions though: Im running swift on a linux box bblogin.mcs.anl.gov It mounts the sicortex under /sicortex-homes I run swift from /sicortex-homes/wilde/amiga/run My sites file says: /home/wilde/swiftwork and /home/wilde/swiftwork on bblogin is a symlink to /sicortex-homes/wilde/swiftwork so that when swift writes files to the sicortex dir (eg when it creates shared/*) its using the same pathname that the worker-side will use when the job runs. Ie, even though the mount-points differ between the swift host and the worker host, symlinks make the workdir appear under same name on both sides. If NFS adheres to its close-to-open-coherence semantics, this then should I think work. My scp-provider question is probably still worth answering and trying if this doesnt work. - Mike On 3/17/08 2:57 PM, Ben Clifford wrote: > what does your filesystem layout look like? > > Where are you running swift? And where are you putting your > scicortex site directory? On an NFS that is also accessible from your > submit machine? If so, what path? > From zhaozhang at uchicago.edu Mon Mar 17 15:24:54 2008 From: zhaozhang at uchicago.edu (Zhao Zhang) Date: Mon, 17 Mar 2008 15:24:54 -0500 Subject: [Swift-devel] Re: swift-falkon problem In-Reply-To: <47DED24B.8070900@mcs.anl.gov> References: <47DDBFD7.2050700@mcs.anl.gov> <47DE676A.8090604@cs.uchicago.edu> <47DECC42.4020500@mcs.anl.gov> <47DED24B.8070900@mcs.anl.gov> Message-ID: <47DED396.4060304@uchicago.edu> Hi, The attachment is the BGexec source code to use with both non-swift and swift. To use it, first compile it like "gcc -o BGexec BGexec.c", run invoke it, run "./BGexec 127.0.0.1 55000 55001 -debug swift" the last option is to indicate that we are running BGexec with swift, if not simply change it to no. zhao Michael Wilde wrote: > Sorry - another mis-diagnosis and incorrect conclusion on my part. > > Zhao just told me that we have out of date falkon worker code on the > sicortex that is not chdir'ing to the cwd arg of the falkon request. > > That explains what Im seeing. Its being fixed now and checked in. > > -- > > To answer your questions though: > > Im running swift on a linux box bblogin.mcs.anl.gov > > It mounts the sicortex under /sicortex-homes > > I run swift from /sicortex-homes/wilde/amiga/run > > My sites file says: > > > > > url="http://140.221.37.30:50001/wsrf/services/GenericPortal/core/WS/GPFactoryService"/> > > /home/wilde/swiftwork > > > and /home/wilde/swiftwork on bblogin is a symlink to > /sicortex-homes/wilde/swiftwork > > so that when swift writes files to the sicortex dir (eg when it > creates shared/*) its using the same pathname that the worker-side > will use when the job runs. Ie, even though the mount-points differ > between the swift host and the worker host, symlinks make the workdir > appear under same name on both sides. > > If NFS adheres to its close-to-open-coherence semantics, this then > should I think work. > > My scp-provider question is probably still worth answering and trying > if this doesnt work. > > - Mike > > > > > On 3/17/08 2:57 PM, Ben Clifford wrote: >> what does your filesystem layout look like? >> >> Where are you running swift? And where are you putting your scicortex >> site directory? On an NFS that is also accessible from your submit >> machine? If so, what path? >> > -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: BGexec.c URL: From iraicu at cs.uchicago.edu Mon Mar 17 15:26:39 2008 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Mon, 17 Mar 2008 15:26:39 -0500 Subject: [Swift-devel] Re: swift-falkon problem In-Reply-To: <47DED396.4060304@uchicago.edu> References: <47DDBFD7.2050700@mcs.anl.gov> <47DE676A.8090604@cs.uchicago.edu> <47DECC42.4020500@mcs.anl.gov> <47DED24B.8070900@mcs.anl.gov> <47DED396.4060304@uchicago.edu> Message-ID: <47DED3FF.8040105@cs.uchicago.edu> What does the last option "swift" really do? Is that the chdir? If yes, then it should be a default behavior (rather than an option) as long as the directory field is specified. Do you have SVN permissions to commit changes? If yes, you should commit this. If not, I'll commit it, and you need to get the permissions to commit code to the Falkon SVN (I'll work on this). Ioan Zhao Zhang wrote: > Hi, > > The attachment is the BGexec source code to use with both non-swift > and swift. To use it, first compile it like "gcc -o BGexec BGexec.c", > run invoke it, run "./BGexec 127.0.0.1 55000 55001 -debug swift" the > last option is to indicate that we are running BGexec with swift, if > not simply change it to no. > > zhao > > Michael Wilde wrote: >> Sorry - another mis-diagnosis and incorrect conclusion on my part. >> >> Zhao just told me that we have out of date falkon worker code on the >> sicortex that is not chdir'ing to the cwd arg of the falkon request. >> >> That explains what Im seeing. Its being fixed now and checked in. >> >> -- >> >> To answer your questions though: >> >> Im running swift on a linux box bblogin.mcs.anl.gov >> >> It mounts the sicortex under /sicortex-homes >> >> I run swift from /sicortex-homes/wilde/amiga/run >> >> My sites file says: >> >> >> >> > >> url="http://140.221.37.30:50001/wsrf/services/GenericPortal/core/WS/GPFactoryService"/> >> >> /home/wilde/swiftwork >> >> >> and /home/wilde/swiftwork on bblogin is a symlink to >> /sicortex-homes/wilde/swiftwork >> >> so that when swift writes files to the sicortex dir (eg when it >> creates shared/*) its using the same pathname that the worker-side >> will use when the job runs. Ie, even though the mount-points differ >> between the swift host and the worker host, symlinks make the workdir >> appear under same name on both sides. >> >> If NFS adheres to its close-to-open-coherence semantics, this then >> should I think work. >> >> My scp-provider question is probably still worth answering and trying >> if this doesnt work. >> >> - Mike >> >> >> >> >> On 3/17/08 2:57 PM, Ben Clifford wrote: >>> what does your filesystem layout look like? >>> >>> Where are you running swift? And where are you putting your >>> scicortex site directory? On an NFS that is also accessible from >>> your submit machine? If so, what path? >>> >> -- =================================================== Ioan Raicu Ph.D. Candidate =================================================== Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 =================================================== Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dev.globus.org/wiki/Incubator/Falkon http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page =================================================== =================================================== From zhaozhang at uchicago.edu Mon Mar 17 15:34:06 2008 From: zhaozhang at uchicago.edu (Zhao Zhang) Date: Mon, 17 Mar 2008 15:34:06 -0500 Subject: [Swift-devel] Re: swift-falkon problem In-Reply-To: <47DED3FF.8040105@cs.uchicago.edu> References: <47DDBFD7.2050700@mcs.anl.gov> <47DE676A.8090604@cs.uchicago.edu> <47DECC42.4020500@mcs.anl.gov> <47DED24B.8070900@mcs.anl.gov> <47DED396.4060304@uchicago.edu> <47DED3FF.8040105@cs.uchicago.edu> Message-ID: <47DED5BE.50204@uchicago.edu> yep, the swift option only cares about "chdir". I am not sure that I have the SVN permissions to commit. That will be great, if you commit this to the SVN. :-) zhao Ioan Raicu wrote: > What does the last option "swift" really do? Is that the chdir? If > yes, then it should be a default behavior (rather than an option) as > long as the directory field is specified. Do you have SVN permissions > to commit changes? If yes, you should commit this. If not, I'll > commit it, and you need to get the permissions to commit code to the > Falkon SVN (I'll work on this). > > Ioan > > Zhao Zhang wrote: >> Hi, >> >> The attachment is the BGexec source code to use with both non-swift >> and swift. To use it, first compile it like "gcc -o BGexec BGexec.c", >> run invoke it, run "./BGexec 127.0.0.1 55000 55001 -debug swift" the >> last option is to indicate that we are running BGexec with swift, if >> not simply change it to no. >> >> zhao >> >> Michael Wilde wrote: >>> Sorry - another mis-diagnosis and incorrect conclusion on my part. >>> >>> Zhao just told me that we have out of date falkon worker code on the >>> sicortex that is not chdir'ing to the cwd arg of the falkon request. >>> >>> That explains what Im seeing. Its being fixed now and checked in. >>> >>> -- >>> >>> To answer your questions though: >>> >>> Im running swift on a linux box bblogin.mcs.anl.gov >>> >>> It mounts the sicortex under /sicortex-homes >>> >>> I run swift from /sicortex-homes/wilde/amiga/run >>> >>> My sites file says: >>> >>> >>> >>> >> >>> url="http://140.221.37.30:50001/wsrf/services/GenericPortal/core/WS/GPFactoryService"/> >>> >>> /home/wilde/swiftwork >>> >>> >>> and /home/wilde/swiftwork on bblogin is a symlink to >>> /sicortex-homes/wilde/swiftwork >>> >>> so that when swift writes files to the sicortex dir (eg when it >>> creates shared/*) its using the same pathname that the worker-side >>> will use when the job runs. Ie, even though the mount-points differ >>> between the swift host and the worker host, symlinks make the >>> workdir appear under same name on both sides. >>> >>> If NFS adheres to its close-to-open-coherence semantics, this then >>> should I think work. >>> >>> My scp-provider question is probably still worth answering and >>> trying if this doesnt work. >>> >>> - Mike >>> >>> >>> >>> >>> On 3/17/08 2:57 PM, Ben Clifford wrote: >>>> what does your filesystem layout look like? >>>> >>>> Where are you running swift? And where are you putting your >>>> scicortex site directory? On an NFS that is also accessible from >>>> your submit machine? If so, what path? >>>> >>> > From benc at hawaga.org.uk Mon Mar 17 16:13:29 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 17 Mar 2008 21:13:29 +0000 (GMT) Subject: [Swift-devel] Swift v0.4 released Message-ID: Swift 0.4 is released. You can download it from http://www.ci.uchicago.edu/swift/downloads/ In addition, there are a few pages of release notes detailing the substantial changes since v0.3 here: http://www.ci.uchicago.edu/swift/packages/release-notes-0.4.txt -- From benc at hawaga.org.uk Mon Mar 17 17:01:53 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 17 Mar 2008 22:01:53 +0000 (GMT) Subject: [Swift-devel] google summer of code Message-ID: The Globus Alliance was accepted as a Google summer of code mentor organization. Under that umbrella, interested students can work on Swift related projects. See http://dev.globus.org/wiki/Google_Summer_of_Code_2008_Ideas for more information - there are a few Swift-related projects listed there, but Google encourage students to also come up with their own. -- From hategan at mcs.anl.gov Mon Mar 17 17:27:52 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 17 Mar 2008 17:27:52 -0500 Subject: [Swift-devel] Re: swift-falkon problem In-Reply-To: References: <47DDBFD7.2050700@mcs.anl.gov> <47DE6C38.40802@cs.uchicago.edu> Message-ID: <1205792872.16095.3.camel@blabla.mcs.anl.gov> On Mon, 2008-03-17 at 19:16 +0000, Ben Clifford wrote: > On Mon, 17 Mar 2008, Ioan Raicu wrote: > > > > Failed to transfer wrapper log from amps1-20080316-1643-g4n8t252/info/0/sico > > > I don't understant this error, how is this error text being generated? Falkon > > only returns back a numeric exit code. Could this error be a post processing > > error when Swift couldn't manipulate the local file system, or it couldn't > > find some expecting files? What exit code does Falkon return for this task, > > 0, or something else? > > > That bit is a follow-on error, so pretty much ignore it - I haven't > figured out what the right thing to do is for presenting it to the user. > Basically: > > 1. swift tries to run an execuable (using provider-deef, in this case) > 2. run of executable fails (i.e. provider-deef is passing back an erro) > 3. swift tries to stage back the wrapper log to help diagnosis Should it perhaps be maybe(transfer(wrapper_log)) instead of transfer(wrapper_log)? > 4. the wrapper log doesn't exist (presumably the wrapper never executed > that far in the failed executable step 1) > 5. swift reports the above error/warning that step 3 failed. > > So that line is an error that is a follow on from step 1 failing. > > > We need to figure out if the failure is in executing the tasks in Falkon, or > > if that is OK, and the error is in Swift not finding some files afterwards. > > Provider-deef is reporting that execution failed. So its not the second > one of those. But there isn't enough log information in mike's report to > indicate where below the swift/provider-deef interface the error is > occurring. > > The present swift+falkon joint deployment code still seems screwy enough > to not merge in the provider-deef log4j command, which is annying. I'll > have a look at that. > From hategan at mcs.anl.gov Mon Mar 17 17:32:00 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 17 Mar 2008 17:32:00 -0500 Subject: [Swift-devel] Re: swift-falkon problem In-Reply-To: <47DECC42.4020500@mcs.anl.gov> References: <47DDBFD7.2050700@mcs.anl.gov> <47DE676A.8090604@cs.uchicago.edu> <47DECC42.4020500@mcs.anl.gov> Message-ID: <1205793120.16095.7.camel@blabla.mcs.anl.gov> On Mon, 2008-03-17 at 14:53 -0500, Michael Wilde wrote: > OK, thanks, will do. > > My earlier message that "it worked" was premature - my sites file was > doing local execution. > > I pushed forward past a few other errors and am now stuck as follows. > > As far as I can tell Im now getting the "NFS not syncing" problem. > > Swift creates the working dir for the workflow as a local file reference > to an NFS-mounted directory. When swift tells falkon to run > shared/wrapper.sh, its not there yet. When I look after the workflow > has failed, it is indeed there. > > What I'd rather do here is tell swift to use scp rather than > direct-file-access as the data provider. Do you know how to do that? >From one of the working i2u2 sites.xml files: /sandbox/quarkcat/tmp You'll probably need to configure ~/.ssh/auth.defaults: www11.i2u2.org.type=key www11.i2u2.org.username=hategan www11.i2u2.org.key=/home/mike/.ssh/i2u2portal www11.i2u2.org.passphrase=... > Or > are there any other data transports to consider? > > (One alternative is to get gridftp running on sico). > > > > On 3/17/08 2:19 PM, Ben Clifford wrote: > > On Mon, 17 Mar 2008, Ioan Raicu wrote: > > > >> iraicu at viper:~/java/svn/cog/modules/provider-deef/etc> cat > >> log4j.properties.module > >> log4j.logger.org.apache.axis.utils.JavaUtils=ERROR > >> log4j.logger.org.globus.cog.abstraction.impl.execution.deef=DEBUG > >> > >> I have this log4j property, do you have this enabled? This should > >> enable more debug output from the Falkon provider. > > > > If deploying swift with this command: ant -Dwith-provider-deef redist > > then it looks like those lines don't get merged in. Mike, add those lines > > yourself to your dist/swift-0.3-dev/etc/log4j.properties file. > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > From benc at hawaga.org.uk Mon Mar 17 20:48:19 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 18 Mar 2008 01:48:19 +0000 (GMT) Subject: [Swift-devel] Re: swift-falkon problem In-Reply-To: <1205793120.16095.7.camel@blabla.mcs.anl.gov> References: <47DDBFD7.2050700@mcs.anl.gov> <47DE676A.8090604@cs.uchicago.edu> <47DECC42.4020500@mcs.anl.gov> <1205793120.16095.7.camel@blabla.mcs.anl.gov> Message-ID: On Mon, 17 Mar 2008, Mihael Hategan wrote: > You'll probably need to configure ~/.ssh/auth.defaults: > www11.i2u2.org.passphrase=... ick passwords in config files. -- From benc at hawaga.org.uk Mon Mar 17 20:57:50 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 18 Mar 2008 01:57:50 +0000 (GMT) Subject: [Swift-devel] Re: swift-falkon problem In-Reply-To: <1205792872.16095.3.camel@blabla.mcs.anl.gov> References: <47DDBFD7.2050700@mcs.anl.gov> <47DE6C38.40802@cs.uchicago.edu> <1205792872.16095.3.camel@blabla.mcs.anl.gov> Message-ID: On Mon, 17 Mar 2008, Mihael Hategan wrote: > Should it perhaps be maybe(transfer(wrapper_log)) instead of > transfer(wrapper_log)? In other circumstances, though, people get upset that their job status files didn't get transferred back but that there was no visible error. This same UI conflict occurs with kickstart if that is turned on, I think - failed jobs cause visible kickstart transfer errors. Its made more viisble here because wrapper logs are always generated. -- From benc at hawaga.org.uk Tue Mar 18 01:27:46 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 18 Mar 2008 06:27:46 +0000 (GMT) Subject: [Swift-devel] install directory change. read this - it will break your build. Message-ID: If you are an SVN user (rather than using mainline Swift releases downloaded in .tar.gz form), read the following and obey the one instruction it contains. The one instruction is: delete your cog/modules/vdsk/dist/ directory Further information about this one instruction which you must obey follows. You do not have to read this: I just committed a change to make the SVN version of swift be 'svn' rather than 0.3-dev. That means that swift will now build to: dist/vdsk-svn instead of dist/vdsk-0.3-dev When you do an svn update and an ant redist, subsequent builds will go into the above directory. However, if you already have a dist/vdsk-0.3-dev directory in place, it will be left there, with the previous version of swift there. This is almost definitely undesirable for you and you should delete that directory. My simplest advice is to remove the entire dist/ directory before making a rebuild. If you do not delete this directory, you will almost definitely accidentally leave paths pointing at the old build directory, and you will therefore almost definitely experience confusion later on when new functionality and bugs do not appear, and old bugs do not disappear. All of the above references to 'almost definitely' come from experience the last times we've bumped the version number; they will almost definitely cause you trouble if you do not read and act on this mail. I've changed the in-SVN version to 'svn' on the basis that there is enough other information about SVN version numbers to render the '0.n-dev' string pretty useless and no longer worth the above mentioned trouble every time a release is made. -- From benc at hawaga.org.uk Tue Mar 18 02:18:26 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 18 Mar 2008 07:18:26 +0000 (GMT) Subject: ssh provider doc (was Re: [Swift-devel] Re: swift-falkon problem) In-Reply-To: <1205793120.16095.7.camel@blabla.mcs.anl.gov> References: <47DDBFD7.2050700@mcs.anl.gov> <47DE676A.8090604@cs.uchicago.edu> <47DECC42.4020500@mcs.anl.gov> <1205793120.16095.7.camel@blabla.mcs.anl.gov> Message-ID: I've put basically the below into the userguide in the sites.xml configuartion section, alongside notes about using the other providers. On Mon, 17 Mar 2008, Mihael Hategan wrote: > >From one of the working i2u2 sites.xml files: > > > > /sandbox/quarkcat/tmp > > > You'll probably need to configure ~/.ssh/auth.defaults: > www11.i2u2.org.type=key > www11.i2u2.org.username=hategan > www11.i2u2.org.key=/home/mike/.ssh/i2u2portal > www11.i2u2.org.passphrase=... -- From wilde at mcs.anl.gov Tue Mar 18 09:05:39 2008 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 18 Mar 2008 09:05:39 -0500 Subject: [Swift-devel] Re: swift-falkon problem In-Reply-To: <47DED24B.8070900@mcs.anl.gov> References: <47DDBFD7.2050700@mcs.anl.gov> <47DE676A.8090604@cs.uchicago.edu> <47DECC42.4020500@mcs.anl.gov> <47DED24B.8070900@mcs.anl.gov> Message-ID: <47DFCC33.30803@mcs.anl.gov> Moving forward on this: Zhao's update to the falkon worker agent "bgexec" fixed the problem of not finding wrapper.sh on the worker node. With the new bgexec in place, the workflow ran successfully for runs of 1 job and 25 jobs. In a run of 100 jobs I start to see problems: - 89 of 100 jobs produced output data files on shared/ - 89 info files, 60 success files - 29 output files made it back to the swift run directory (amdi.*) All the logs and the server-side runtime directory are on the CI FS at ~benc/swift-logs/wilde/run313 I am debugging this, but if you could take a look Ben that would be great. I will test the jobs locally to ensure that all 100 parameters yield successful output. But the app - a shell around a C program - should yield a zero-length file when the job fails and a single decimal number when it succeeds. This is still running with locally mounted NFS for data access. I will try the ssh approach after I rule out problems in my app. After mis-judging the previous problem as an NFS coherence issue, I dont want to be hasty in prejudging this one. - Mike On 3/17/08 3:19 PM, Michael Wilde wrote: > Sorry - another mis-diagnosis and incorrect conclusion on my part. > > Zhao just told me that we have out of date falkon worker code on the > sicortex that is not chdir'ing to the cwd arg of the falkon request. > > That explains what Im seeing. Its being fixed now and checked in. > > -- > > To answer your questions though: > > Im running swift on a linux box bblogin.mcs.anl.gov > > It mounts the sicortex under /sicortex-homes > > I run swift from /sicortex-homes/wilde/amiga/run > > My sites file says: > > > > > url="http://140.221.37.30:50001/wsrf/services/GenericPortal/core/WS/GPFactoryService"/> > > /home/wilde/swiftwork > > > and /home/wilde/swiftwork on bblogin is a symlink to > /sicortex-homes/wilde/swiftwork > > so that when swift writes files to the sicortex dir (eg when it creates > shared/*) its using the same pathname that the worker-side will use when > the job runs. Ie, even though the mount-points differ between the swift > host and the worker host, symlinks make the workdir appear under same > name on both sides. > > If NFS adheres to its close-to-open-coherence semantics, this then > should I think work. > > My scp-provider question is probably still worth answering and trying if > this doesnt work. > > - Mike > > > > > On 3/17/08 2:57 PM, Ben Clifford wrote: >> what does your filesystem layout look like? >> >> Where are you running swift? And where are you putting your scicortex >> site directory? On an NFS that is also accessible from your submit >> machine? If so, what path? >> > From zhaozhang at uchicago.edu Tue Mar 18 12:00:26 2008 From: zhaozhang at uchicago.edu (Zhao Zhang) Date: Tue, 18 Mar 2008 12:00:26 -0500 Subject: [Swift-devel] Re: swift-falkon problem In-Reply-To: <47DFCC33.30803@mcs.anl.gov> References: <47DDBFD7.2050700@mcs.anl.gov> <47DE676A.8090604@cs.uchicago.edu> <47DECC42.4020500@mcs.anl.gov> <47DED24B.8070900@mcs.anl.gov> <47DFCC33.30803@mcs.anl.gov> Message-ID: <47DFF52A.4050103@uchicago.edu> Hi, Mike Are you running runam4? I think there is a range of the variable we chose, so in my test, I made sure, each input has an output. Not all input data will have results. zhao Michael Wilde wrote: > Moving forward on this: > > Zhao's update to the falkon worker agent "bgexec" fixed the problem of > not finding wrapper.sh on the worker node. > > With the new bgexec in place, the workflow ran successfully for runs > of 1 job and 25 jobs. > > In a run of 100 jobs I start to see problems: > > - 89 of 100 jobs produced output data files on shared/ > - 89 info files, 60 success files > - 29 output files made it back to the swift run directory > (amdi.*) > > All the logs and the server-side runtime directory are on the CI FS at > ~benc/swift-logs/wilde/run313 > > I am debugging this, but if you could take a look Ben that would be > great. > > I will test the jobs locally to ensure that all 100 parameters yield > successful output. But the app - a shell around a C program - should > yield a zero-length file when the job fails and a single decimal > number when it succeeds. > > This is still running with locally mounted NFS for data access. > I will try the ssh approach after I rule out problems in my app. > > After mis-judging the previous problem as an NFS coherence issue, I > dont want to be hasty in prejudging this one. > > - Mike > > > > On 3/17/08 3:19 PM, Michael Wilde wrote: >> Sorry - another mis-diagnosis and incorrect conclusion on my part. >> >> Zhao just told me that we have out of date falkon worker code on the >> sicortex that is not chdir'ing to the cwd arg of the falkon request. >> >> That explains what Im seeing. Its being fixed now and checked in. >> >> -- >> >> To answer your questions though: >> >> Im running swift on a linux box bblogin.mcs.anl.gov >> >> It mounts the sicortex under /sicortex-homes >> >> I run swift from /sicortex-homes/wilde/amiga/run >> >> My sites file says: >> >> >> >> > >> url="http://140.221.37.30:50001/wsrf/services/GenericPortal/core/WS/GPFactoryService"/> >> >> /home/wilde/swiftwork >> >> >> and /home/wilde/swiftwork on bblogin is a symlink to >> /sicortex-homes/wilde/swiftwork >> >> so that when swift writes files to the sicortex dir (eg when it >> creates shared/*) its using the same pathname that the worker-side >> will use when the job runs. Ie, even though the mount-points differ >> between the swift host and the worker host, symlinks make the workdir >> appear under same name on both sides. >> >> If NFS adheres to its close-to-open-coherence semantics, this then >> should I think work. >> >> My scp-provider question is probably still worth answering and trying >> if this doesnt work. >> >> - Mike >> >> >> >> >> On 3/17/08 2:57 PM, Ben Clifford wrote: >>> what does your filesystem layout look like? >>> >>> Where are you running swift? And where are you putting your >>> scicortex site directory? On an NFS that is also accessible from >>> your submit machine? If so, what path? >>> >> > From benc at hawaga.org.uk Tue Mar 18 13:52:04 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 18 Mar 2008 18:52:04 +0000 (GMT) Subject: [Swift-devel] Re: swift-falkon problem In-Reply-To: <47DFCC33.30803@mcs.anl.gov> References: <47DDBFD7.2050700@mcs.anl.gov> <47DE676A.8090604@cs.uchicago.edu> <47DECC42.4020500@mcs.anl.gov> <47DED24B.8070900@mcs.anl.gov> <47DFCC33.30803@mcs.anl.gov> Message-ID: On Tue, 18 Mar 2008, Michael Wilde wrote: > I will test the jobs locally to ensure that all 100 parameters yield > successful output. But the app - a shell around a C program - should yield a > zero-length file when the job fails and a single decimal number when it > succeeds. yes, please run exactly the same 100 parameter SwiftScript with the local provider. ideally twice or three times. -- From benc at hawaga.org.uk Tue Mar 18 13:55:37 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 18 Mar 2008 18:55:37 +0000 (GMT) Subject: [Swift-devel] Re: swift-falkon problem In-Reply-To: References: <47DDBFD7.2050700@mcs.anl.gov> <47DE676A.8090604@cs.uchicago.edu> <47DECC42.4020500@mcs.anl.gov> <47DED24B.8070900@mcs.anl.gov> <47DFCC33.30803@mcs.anl.gov> Message-ID: if you want a more tested simple-but-large-numbers of jobs test app, get SwiftApps/badmonkey/ from the SVN. that's what I use. -- From benc at hawaga.org.uk Tue Mar 18 15:57:37 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 18 Mar 2008 20:57:37 +0000 (GMT) Subject: [Swift-devel] Re: swift-falkon problem In-Reply-To: References: <47DDBFD7.2050700@mcs.anl.gov> <47DE676A.8090604@cs.uchicago.edu> <47DECC42.4020500@mcs.anl.gov> <47DED24B.8070900@mcs.anl.gov> <47DFCC33.30803@mcs.anl.gov> Message-ID: I picked the first failed job in the log oyu sent. Job id 2qbcdypi. I assume that your submit host and the various machines involved have properly synchronised clocks, but I have not checked this beyond seeing that the machine I am logged into has the same time as my laptop. I have labelled the times taken from different system clocks with lettered clock domains just in case they are different. For this job, its running in thread 0-1-88. The karajan level job submission goes through these states (in clock domain A) 23:14:08,196-0600 Submitting 23:14:08,204-0600 Submitted 23:14:14,121-0600 Active 23:14:14,121-0600 Completed Note that the last two - Active and Completed - are the same (within a millisecond) At 23:14:14,189-0600 Swift checks the job status and finds the success file is not found. (This timestamp is in clock domain A) So now I look at for the status file myself on the fd filesystem: $ ls --full-time /home/wilde/swiftwork/amps1-20080317-2314-nvn8x1p2/status/2/runam3-2pbcdypi-success -rw-r--r-- 1 wilde mcsz 0 2008-03-18 00:14:19.202382966 -0500 /home/wilde/swiftwork/amps1-20080317-2314-nvn8x1p2/status/2/runam3-2pbcdypi-success (this is in clock domain B) And see that the file does exist but is a full 5 seconds after the job was reported as successful by provider-deef. So now we can look in the info/ directory (next to the status directory) and get run time stamps or the jobs. According to the info log, the job begins running at: (in clock domain B again) at: 00:14:14.065373000-0500 which corresponds within about 60ms of the time that provider-deef reported the job as active. However, the execution according to the wrapper log shows that the job did not finish executing until 00:14:19.233438000-0500 (which is when the status file is approximately timestamped). My off-the-cuff hypothesis is, based on the above, that soemwhere in provider-deef or below, the execution system is reporting a job as completed as soon as it starts executing, rather than when it actually finishes executing; and that successes with small numbers of jobs have been a race condition that would disappear if those small jobs took a substantially longer time to execute (eg if they had a sleep 30s in them). -- From iraicu at cs.uchicago.edu Tue Mar 18 16:36:21 2008 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Tue, 18 Mar 2008 16:36:21 -0500 Subject: [Swift-devel] Re: swift-falkon problem In-Reply-To: References: <47DDBFD7.2050700@mcs.anl.gov> <47DE676A.8090604@cs.uchicago.edu> <47DECC42.4020500@mcs.anl.gov> <47DED24B.8070900@mcs.anl.gov> <47DFCC33.30803@mcs.anl.gov> Message-ID: <47E035D5.6000804@cs.uchicago.edu> I would say that Falkon to send a successful exit code at the start of the execution is impossible (unless its a bug that I have never seen before)... it could certainly send a failed exit code before the task even starts under certain conditions, but if an exit code of 0 is received at Swift, I would say that the task executed on the remote resource, and an exit code 0 was propagated back to Swift. Could a latency of NFS in which one node creates a file/dir and another node requires xxx time (in this case, 5 sec) before it actually sees the file, explain what Mike is seeing? If this is a likely explanation, then the race condition is that the exit code goes from worker to Falkon service to Swift faster than NFS can update its file/dir list, and when Swift checks for the file or dir (probably within 10s of milliseconds) of the job completion, it can't find the file/dir. Are there any counterarguments that would make this hypothesis not possible? Just another hypothesis which might be worth investigating. Ioan Ben Clifford wrote: > I picked the first failed job in the log oyu sent. Job id 2qbcdypi. > > I assume that your submit host and the various machines involved have > properly synchronised clocks, but I have not checked this beyond seeing > that the machine I am logged into has the same time as my laptop. I have > labelled the times taken from different system clocks with lettered clock > domains just in case they are different. > > For this job, its running in thread 0-1-88. > The karajan level job submission goes through these states (in clock > domain A) > 23:14:08,196-0600 Submitting > 23:14:08,204-0600 Submitted > 23:14:14,121-0600 Active > 23:14:14,121-0600 Completed > > Note that the last two - Active and Completed - are the same (within a > millisecond) > > At 23:14:14,189-0600 Swift checks the job status and finds the success > file is not found. (This timestamp is in clock domain A) > > So now I look at for the status file myself on the fd filesystem: > > $ ls --full-time > /home/wilde/swiftwork/amps1-20080317-2314-nvn8x1p2/status/2/runam3-2pbcdypi-success > > -rw-r--r-- 1 wilde mcsz 0 2008-03-18 00:14:19.202382966 -0500 > /home/wilde/swiftwork/amps1-20080317-2314-nvn8x1p2/status/2/runam3-2pbcdypi-success > > (this is in clock domain B) > > And see that the file does exist but is a full 5 seconds after the job was > reported as successful by provider-deef. > > So now we can look in the info/ directory (next to the status directory) > and get run time stamps or the jobs. > > According to the info log, the job begins running at: (in clock domain B > again) at: > > 00:14:14.065373000-0500 > > which corresponds within about 60ms of the time that provider-deef > reported the job as active. > However, the execution according to the wrapper log shows that the job did > not finish executing until > > 00:14:19.233438000-0500 > > (which is when the status file is approximately timestamped). > > My off-the-cuff hypothesis is, based on the above, that soemwhere in > provider-deef or below, the execution system is reporting a job as > completed as soon as it starts executing, rather than when it actually > finishes executing; and that successes with small numbers of jobs have > been a race condition that would disappear if those small jobs took a > substantially longer time to execute (eg if they had a sleep 30s in them). > > -- =================================================== Ioan Raicu Ph.D. Candidate =================================================== Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 =================================================== Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dev.globus.org/wiki/Incubator/Falkon http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page =================================================== =================================================== From benc at hawaga.org.uk Tue Mar 18 16:45:56 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 18 Mar 2008 21:45:56 +0000 (GMT) Subject: [Swift-devel] Re: swift-falkon problem In-Reply-To: <47E035D5.6000804@cs.uchicago.edu> References: <47DDBFD7.2050700@mcs.anl.gov> <47DE676A.8090604@cs.uchicago.edu> <47DECC42.4020500@mcs.anl.gov> <47DED24B.8070900@mcs.anl.gov> <47DFCC33.30803@mcs.anl.gov> <47E035D5.6000804@cs.uchicago.edu> Message-ID: On Tue, 18 Mar 2008, Ioan Raicu wrote: > Could a latency of NFS in which one node creates a > file/dir and another node requires xxx time (in this case, 5 sec) before it > actually sees the file, explain what Mike is seeing? If this is a likely > explanation, then the race condition is that the exit code goes from worker to > Falkon service to Swift faster than NFS can update its file/dir list, and when > Swift checks for the file or dir (probably within 10s of milliseconds) of the > job completion, it can't find the file/dir. Are there any counterarguments > that would make this hypothesis not possible? Just another hypothesis which > might be worth investigating. > According to the timing in the log file, Swift is getting a notification from provider-deef that the job completed before the actual job has even been run to completion on the worker, well before the wrapper even attempts to write out a status file. I'm not accusing this of being a problem inside Falkon - I'm saying I think its happening somewhere below the Swift layer, so it could well be provider-deef, which is probably the most neglected part of this whole stack. Mike, are you running with those extra debug lines in the log4j configuration? If not, please run again with them turned on. Also Ioan can probably recommend which Falkon logs to keep so we can see what's happening for a job there and approach the problem from the other end of the stack too. -- From wilde at mcs.anl.gov Tue Mar 18 17:12:12 2008 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 18 Mar 2008 17:12:12 -0500 Subject: [Swift-devel] Re: swift-falkon problem In-Reply-To: References: <47DDBFD7.2050700@mcs.anl.gov> <47DE676A.8090604@cs.uchicago.edu> <47DECC42.4020500@mcs.anl.gov> <47DED24B.8070900@mcs.anl.gov> <47DFCC33.30803@mcs.anl.gov> <47E035D5.6000804@cs.uchicago.edu> Message-ID: <47E03E3C.6060905@mcs.anl.gov> I will rerun with log4j settings. Will also try adding the sleep suggested earlier - to see if all jobs then fail. I did re-run the workflow 3X on local, and each time all 100 jobs finished successfully. Also for this dataset, all jobs return data. - Mike On 3/18/08 4:45 PM, Ben Clifford wrote: > On Tue, 18 Mar 2008, Ioan Raicu wrote: > >> Could a latency of NFS in which one node creates a >> file/dir and another node requires xxx time (in this case, 5 sec) before it >> actually sees the file, explain what Mike is seeing? If this is a likely >> explanation, then the race condition is that the exit code goes from worker to >> Falkon service to Swift faster than NFS can update its file/dir list, and when >> Swift checks for the file or dir (probably within 10s of milliseconds) of the >> job completion, it can't find the file/dir. Are there any counterarguments >> that would make this hypothesis not possible? Just another hypothesis which >> might be worth investigating. >> > > According to the timing in the log file, Swift is getting a notification > from provider-deef that the job completed before the actual job has even > been run to completion on the worker, well before the wrapper even > attempts to write out a status file. > > I'm not accusing this of being a problem inside Falkon - I'm saying I > think its happening somewhere below the Swift layer, so it could well be > provider-deef, which is probably the most neglected part of this whole > stack. > > Mike, are you running with those extra debug lines in the log4j > configuration? If not, please run again with them turned on. Also Ioan can > probably recommend which Falkon logs to keep so we can see what's > happening for a job there and approach the problem from the other end of > the stack too. > > From iraicu at cs.uchicago.edu Tue Mar 18 17:20:26 2008 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Tue, 18 Mar 2008 17:20:26 -0500 Subject: [Swift-devel] Re: swift-falkon problem In-Reply-To: References: <47DDBFD7.2050700@mcs.anl.gov> <47DE676A.8090604@cs.uchicago.edu> <47DECC42.4020500@mcs.anl.gov> <47DED24B.8070900@mcs.anl.gov> <47DFCC33.30803@mcs.anl.gov> <47E035D5.6000804@cs.uchicago.edu> Message-ID: <47E0402A.8010303@cs.uchicago.edu> The clocks on the two machines that Mike was running on seems to be in sync (less than 1 sec off). iraicu at bblogin:~/java/svn/falkon$ date Tue Mar 18 17:10:15 CDT 2008 iraicu at scx-m23n6 ~/java/svn/falkon/worker/temp $ date Tue Mar 18 17:10:15 CDT 2008 Mike, here are the logs you need to make sure you capture when running in debug mode: iraicu at viper:~/java/svn/falkon/config> cat Falkon-TCPCore.config GenericPortalWS=falkon_task_submission_history.txt GenericPortalWS_perf_per_sec=falkon_summary.txt GenericPortalWS_taskPerf=falkon_task_perf.txt GenericPortalWS_task=falkon_task_status.txt When running in normal mode (when we know things work fine), we just need iraicu at viper:~/java/svn/falkon/config> cat Falkon-TCPCore.config GenericPortalWS_perf_per_sec=falkon_summary.txt GenericPortalWS_taskPerf=falkon_task_perf.txt In the event that we can't figure out things from the Swift and Falkon service logs, we might have to enable worker side logs as well, which you do from the run.worker-c.sh (or run.worker-c-ram.sh) script(s). Its also possible that the Falkon provider code is doing something funny, but I'd want to see the Falkon logs before we focus on the provider. Ioan Ben Clifford wrote: > On Tue, 18 Mar 2008, Ioan Raicu wrote: > > >> Could a latency of NFS in which one node creates a >> file/dir and another node requires xxx time (in this case, 5 sec) before it >> actually sees the file, explain what Mike is seeing? If this is a likely >> explanation, then the race condition is that the exit code goes from worker to >> Falkon service to Swift faster than NFS can update its file/dir list, and when >> Swift checks for the file or dir (probably within 10s of milliseconds) of the >> job completion, it can't find the file/dir. Are there any counterarguments >> that would make this hypothesis not possible? Just another hypothesis which >> might be worth investigating. >> >> > > According to the timing in the log file, Swift is getting a notification > from provider-deef that the job completed before the actual job has even > been run to completion on the worker, well before the wrapper even > attempts to write out a status file. > > I'm not accusing this of being a problem inside Falkon - I'm saying I > think its happening somewhere below the Swift layer, so it could well be > provider-deef, which is probably the most neglected part of this whole > stack. > > Mike, are you running with those extra debug lines in the log4j > configuration? If not, please run again with them turned on. Also Ioan can > probably recommend which Falkon logs to keep so we can see what's > happening for a job there and approach the problem from the other end of > the stack too. > > > -- =================================================== Ioan Raicu Ph.D. Candidate =================================================== Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 =================================================== Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dev.globus.org/wiki/Incubator/Falkon http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page =================================================== =================================================== From wilde at mcs.anl.gov Tue Mar 18 17:29:01 2008 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 18 Mar 2008 17:29:01 -0500 Subject: [Swift-devel] Why build prompts in redist? Message-ID: <47E0422D.9000707@mcs.anl.gov> When doing an ant redist, I get: dist.dir.warning: ====================================================================================== [input] Warning! The specified target directory (/radix-homes01/wilde/swift/src/cog/modules/vdsk/../..//modules/vdsk/dist/vdsk-svn) does not seem to contain a Swift build. [input] Press Return to continue with the build or CTRL+C to abort... [input] ====================================================================================== Is this check really useful? Its inconvenient when you start a build and then walk away from it. You come back and its waiting on this prompt. From benc at hawaga.org.uk Tue Mar 18 17:37:01 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 18 Mar 2008 22:37:01 +0000 (GMT) Subject: [Swift-devel] Why build prompts in redist? In-Reply-To: <47E0422D.9000707@mcs.anl.gov> References: <47E0422D.9000707@mcs.anl.gov> Message-ID: > Is this check really useful? Its inconvenient when you start a build and then > walk away from it. You come back and its waiting on this prompt. It used to be very useful. If you are building with the instructions in provider-deef/README (note that these were updated in the last week or so), then no, it isn't useful. -- From benc at hawaga.org.uk Tue Mar 18 20:40:12 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 19 Mar 2008 01:40:12 +0000 (GMT) Subject: [Swift-devel] Why build prompts in redist? In-Reply-To: <47E0422D.9000707@mcs.anl.gov> References: <47E0422D.9000707@mcs.anl.gov> Message-ID: > [input] Warning! The specified target directory > (/radix-homes01/wilde/swift/src/cog/modules/vdsk/../..//modules/vdsk/dist/vdsk-svn) > does not seem to contain a Swift build. > [input] Press Return to continue with the build or CTRL+C to abort... As of r1738 (to provider-deef) this does not happen any more. -- From hategan at mcs.anl.gov Wed Mar 19 03:25:16 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 19 Mar 2008 03:25:16 -0500 Subject: [Swift-devel] Re: swift-falkon problem In-Reply-To: References: <47DDBFD7.2050700@mcs.anl.gov> <47DE676A.8090604@cs.uchicago.edu> <47DECC42.4020500@mcs.anl.gov> <47DED24B.8070900@mcs.anl.gov> <47DFCC33.30803@mcs.anl.gov> Message-ID: <1205915116.21170.1.camel@blabla.mcs.anl.gov> On Tue, 2008-03-18 at 20:57 +0000, Ben Clifford wrote: > I picked the first failed job in the log oyu sent. Job id 2qbcdypi. > > I assume that your submit host and the various machines involved have > properly synchronised clocks, but I have not checked this beyond seeing > that the machine I am logged into has the same time as my laptop. I have > labelled the times taken from different system clocks with lettered clock > domains just in case they are different. > > For this job, its running in thread 0-1-88. > The karajan level job submission goes through these states (in clock > domain A) > 23:14:08,196-0600 Submitting > 23:14:08,204-0600 Submitted > 23:14:14,121-0600 Active > 23:14:14,121-0600 Completed > > Note that the last two - Active and Completed - are the same (within a > millisecond) That probably means the provider doesn't really set the active state, and it gets filled in when "completed" arrives. From hategan at mcs.anl.gov Wed Mar 19 03:25:50 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 19 Mar 2008 03:25:50 -0500 Subject: [Swift-devel] Re: swift-falkon problem In-Reply-To: <47E035D5.6000804@cs.uchicago.edu> References: <47DDBFD7.2050700@mcs.anl.gov> <47DE676A.8090604@cs.uchicago.edu> <47DECC42.4020500@mcs.anl.gov> <47DED24B.8070900@mcs.anl.gov> <47DFCC33.30803@mcs.anl.gov> <47E035D5.6000804@cs.uchicago.edu> Message-ID: <1205915150.21170.3.camel@blabla.mcs.anl.gov> On Tue, 2008-03-18 at 16:36 -0500, Ioan Raicu wrote: > I would say that Falkon to send a successful exit code at the start of > the execution is impossible (unless its a bug that I have never seen > before)... :) Like any new bug? > Ioan > > > Ben Clifford wrote: > > I picked the first failed job in the log oyu sent. Job id 2qbcdypi. > > > > I assume that your submit host and the various machines involved have > > properly synchronised clocks, but I have not checked this beyond seeing > > that the machine I am logged into has the same time as my laptop. I have > > labelled the times taken from different system clocks with lettered clock > > domains just in case they are different. > > > > For this job, its running in thread 0-1-88. > > The karajan level job submission goes through these states (in clock > > domain A) > > 23:14:08,196-0600 Submitting > > 23:14:08,204-0600 Submitted > > 23:14:14,121-0600 Active > > 23:14:14,121-0600 Completed > > > > Note that the last two - Active and Completed - are the same (within a > > millisecond) > > > > At 23:14:14,189-0600 Swift checks the job status and finds the success > > file is not found. (This timestamp is in clock domain A) > > > > So now I look at for the status file myself on the fd filesystem: > > > > $ ls --full-time > > /home/wilde/swiftwork/amps1-20080317-2314-nvn8x1p2/status/2/runam3-2pbcdypi-success > > > > -rw-r--r-- 1 wilde mcsz 0 2008-03-18 00:14:19.202382966 -0500 > > /home/wilde/swiftwork/amps1-20080317-2314-nvn8x1p2/status/2/runam3-2pbcdypi-success > > > > (this is in clock domain B) > > > > And see that the file does exist but is a full 5 seconds after the job was > > reported as successful by provider-deef. > > > > So now we can look in the info/ directory (next to the status directory) > > and get run time stamps or the jobs. > > > > According to the info log, the job begins running at: (in clock domain B > > again) at: > > > > 00:14:14.065373000-0500 > > > > which corresponds within about 60ms of the time that provider-deef > > reported the job as active. > > However, the execution according to the wrapper log shows that the job did > > not finish executing until > > > > 00:14:19.233438000-0500 > > > > (which is when the status file is approximately timestamped). > > > > My off-the-cuff hypothesis is, based on the above, that soemwhere in > > provider-deef or below, the execution system is reporting a job as > > completed as soon as it starts executing, rather than when it actually > > finishes executing; and that successes with small numbers of jobs have > > been a race condition that would disappear if those small jobs took a > > substantially longer time to execute (eg if they had a sleep 30s in them). > > > > > > -- > =================================================== > Ioan Raicu > Ph.D. Candidate > =================================================== > Distributed Systems Laboratory > Computer Science Department > University of Chicago > 1100 E. 58th Street, Ryerson Hall > Chicago, IL 60637 > =================================================== > Email: iraicu at cs.uchicago.edu > Web: http://www.cs.uchicago.edu/~iraicu > http://dev.globus.org/wiki/Incubator/Falkon > http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page > =================================================== > =================================================== > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > From benc at hawaga.org.uk Wed Mar 19 03:33:00 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 19 Mar 2008 08:33:00 +0000 (GMT) Subject: [Swift-devel] Re: swift-falkon problem In-Reply-To: <1205915116.21170.1.camel@blabla.mcs.anl.gov> References: <47DDBFD7.2050700@mcs.anl.gov> <47DE676A.8090604@cs.uchicago.edu> <47DECC42.4020500@mcs.anl.gov> <47DED24B.8070900@mcs.anl.gov> <47DFCC33.30803@mcs.anl.gov> <1205915116.21170.1.camel@blabla.mcs.anl.gov> Message-ID: On Wed, 19 Mar 2008, Mihael Hategan wrote: > > Note that the last two - Active and Completed - are the same (within a > > millisecond) > > That probably means the provider doesn't really set the active state, > and it gets filled in when "completed" arrives. Indeed the provider doesn't set Active anywhere. But the time of the above events is still many seconds too early. -- From iraicu at cs.uchicago.edu Wed Mar 19 06:15:27 2008 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Wed, 19 Mar 2008 06:15:27 -0500 Subject: [Swift-devel] Re: swift-falkon problem In-Reply-To: <1205915116.21170.1.camel@blabla.mcs.anl.gov> References: <47DDBFD7.2050700@mcs.anl.gov> <47DE676A.8090604@cs.uchicago.edu> <47DECC42.4020500@mcs.anl.gov> <47DED24B.8070900@mcs.anl.gov> <47DFCC33.30803@mcs.anl.gov> <1205915116.21170.1.camel@blabla.mcs.anl.gov> Message-ID: <47E0F5CF.1090000@cs.uchicago.edu> Right, from what I remember, it never sets the active state. The jobs in question probably took less than 1 sec to execute, so seeing 8 seconds between submitted and completed looks fine to me. The fact that the timestampts on the file/dir is later than the time Falkon says the job completed is an indication that either the clocks are not in sync (bbogin and fd-login.mcs are in sync, but what about bblogin and SiCortex compute nodes?), or NFS did not process the write operation immediately, and under the heavy load of 60 workers ll writing at the same time, it took 5 seconds to complete the write operation. Mike, where are the Falkon logs, to see what happened from Falkon's point of view. Ioan Mihael Hategan wrote: > On Tue, 2008-03-18 at 20:57 +0000, Ben Clifford wrote: > >> I picked the first failed job in the log oyu sent. Job id 2qbcdypi. >> >> I assume that your submit host and the various machines involved have >> properly synchronised clocks, but I have not checked this beyond seeing >> that the machine I am logged into has the same time as my laptop. I have >> labelled the times taken from different system clocks with lettered clock >> domains just in case they are different. >> >> For this job, its running in thread 0-1-88. >> The karajan level job submission goes through these states (in clock >> domain A) >> 23:14:08,196-0600 Submitting >> 23:14:08,204-0600 Submitted >> 23:14:14,121-0600 Active >> 23:14:14,121-0600 Completed >> >> Note that the last two - Active and Completed - are the same (within a >> millisecond) >> > > That probably means the provider doesn't really set the active state, > and it gets filled in when "completed" arrives. > > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > -- =================================================== Ioan Raicu Ph.D. Candidate =================================================== Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 =================================================== Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dev.globus.org/wiki/Incubator/Falkon http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page =================================================== =================================================== -------------- next part -------------- An HTML attachment was scrubbed... URL: From wilde at mcs.anl.gov Wed Mar 19 10:45:50 2008 From: wilde at mcs.anl.gov (Michael Wilde) Date: Wed, 19 Mar 2008 10:45:50 -0500 Subject: [Swift-devel] Re: swift-falkon problem In-Reply-To: References: <47DDBFD7.2050700@mcs.anl.gov> <47DE676A.8090604@cs.uchicago.edu> <47DECC42.4020500@mcs.anl.gov> <47DED24B.8070900@mcs.anl.gov> <47DFCC33.30803@mcs.anl.gov> Message-ID: <47E1352E.3070808@mcs.anl.gov> Following up on Ben's request from the msg below: > My off-the-cuff hypothesis is, based on the above, that soemwhere in > provider-deef or below, the execution system is reporting a job as > completed as soon as it starts executing, rather than when it actually > finishes executing; and that successes with small numbers of jobs have > been a race condition that would disappear if those small jobs took a > substantially longer time to execute (eg if they had a sleep 30s in them). > I tested the following: run314: 100 jobs, 10 workers: all finished OK run315: 100 jobs, 110 workers: ~80% failed run316: 100 jobs, 110 workers, sleep 30 in the app: all finished OK These are in ~benc/swift-logs/wilde. The workdirs are preserved on bblogin/sico - I did not copy them because you need access to the msec timestamps anyways. I can run these several times each to get more data before we assess the hypothesis, but didnt have time yet. Let me know if thats needed. I'm cautiously leaning a bit more to the NFS-race theory. I would like to test with scp data transfer. Am also trying to get gridftp compiled there with help from Raj. Build is failing with gpt problems, I think I need Ben or Charles on this. - Mike On 3/18/08 3:57 PM, Ben Clifford wrote: > I picked the first failed job in the log oyu sent. Job id 2qbcdypi. > > I assume that your submit host and the various machines involved have > properly synchronised clocks, but I have not checked this beyond seeing > that the machine I am logged into has the same time as my laptop. I have > labelled the times taken from different system clocks with lettered clock > domains just in case they are different. > > For this job, its running in thread 0-1-88. > The karajan level job submission goes through these states (in clock > domain A) > 23:14:08,196-0600 Submitting > 23:14:08,204-0600 Submitted > 23:14:14,121-0600 Active > 23:14:14,121-0600 Completed > > Note that the last two - Active and Completed - are the same (within a > millisecond) > > At 23:14:14,189-0600 Swift checks the job status and finds the success > file is not found. (This timestamp is in clock domain A) > > So now I look at for the status file myself on the fd filesystem: > > $ ls --full-time > /home/wilde/swiftwork/amps1-20080317-2314-nvn8x1p2/status/2/runam3-2pbcdypi-success > > -rw-r--r-- 1 wilde mcsz 0 2008-03-18 00:14:19.202382966 -0500 > /home/wilde/swiftwork/amps1-20080317-2314-nvn8x1p2/status/2/runam3-2pbcdypi-success > > (this is in clock domain B) > > And see that the file does exist but is a full 5 seconds after the job was > reported as successful by provider-deef. > > So now we can look in the info/ directory (next to the status directory) > and get run time stamps or the jobs. > > According to the info log, the job begins running at: (in clock domain B > again) at: > > 00:14:14.065373000-0500 > > which corresponds within about 60ms of the time that provider-deef > reported the job as active. > However, the execution according to the wrapper log shows that the job did > not finish executing until > > 00:14:19.233438000-0500 > > (which is when the status file is approximately timestamped). > > My off-the-cuff hypothesis is, based on the above, that soemwhere in > provider-deef or below, the execution system is reporting a job as > completed as soon as it starts executing, rather than when it actually > finishes executing; and that successes with small numbers of jobs have > been a race condition that would disappear if those small jobs took a > substantially longer time to execute (eg if they had a sleep 30s in them). > From benc at hawaga.org.uk Wed Mar 19 11:31:18 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 19 Mar 2008 16:31:18 +0000 (GMT) Subject: [Swift-devel] Re: swift-falkon problem In-Reply-To: <47E1352E.3070808@mcs.anl.gov> References: <47DDBFD7.2050700@mcs.anl.gov> <47DE676A.8090604@cs.uchicago.edu> <47DECC42.4020500@mcs.anl.gov> <47DED24B.8070900@mcs.anl.gov> <47DFCC33.30803@mcs.anl.gov> <47E1352E.3070808@mcs.anl.gov> Message-ID: On Wed, 19 Mar 2008, Michael Wilde wrote: > run315: 100 jobs, 110 workers: ~80% failed Do you have the falkon logs for this run? -- From benc at hawaga.org.uk Wed Mar 19 11:53:39 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 19 Mar 2008 16:53:39 +0000 (GMT) Subject: [Swift-devel] Re: swift-falkon problem In-Reply-To: <47E1352E.3070808@mcs.anl.gov> References: <47DDBFD7.2050700@mcs.anl.gov> <47DE676A.8090604@cs.uchicago.edu> <47DECC42.4020500@mcs.anl.gov> <47DED24B.8070900@mcs.anl.gov> <47DFCC33.30803@mcs.anl.gov> <47E1352E.3070808@mcs.anl.gov> Message-ID: a brief look at run315 shows much closer/overlapping times that are more indicative of something funny at the filesystem level than yesterdays logs. in run316, where did you put the sleep? in the application code or in the wrapper script? -- From iraicu at cs.uchicago.edu Wed Mar 19 12:09:25 2008 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Wed, 19 Mar 2008 12:09:25 -0500 Subject: [Swift-devel] Re: swift-falkon problem In-Reply-To: <47E1352E.3070808@mcs.anl.gov> References: <47DDBFD7.2050700@mcs.anl.gov> <47DE676A.8090604@cs.uchicago.edu> <47DECC42.4020500@mcs.anl.gov> <47DED24B.8070900@mcs.anl.gov> <47DFCC33.30803@mcs.anl.gov> <47E1352E.3070808@mcs.anl.gov> Message-ID: <47E148C5.3090600@cs.uchicago.edu> Mike, The errors that occur are after the jobs execute, and Swift looks for the successful empty file for each job (in a unique dir per job), right? If Swift were to rely solely on the exit code that Falkon returned for each job, would that not solve your immediate problem? That is not to say that a race condition might not happen elsewhere, but at least it would not happen in such a simple scenario, where you have no data dependencies. For example, if job A (running on compute node X) outputs data 1, and job B (running on compute node Y) reads data 1, and Swift submits job B within milliseconds of job A's completion, its likely that job B might not find data 1 to read. So, relying solely on Falkon exit codes could allow you to run your 100 jobs that have no data dependencies among each other just fine, and will push the race condition to workflows that do have some data dependencies between jobs. Ben, Mihael, is this feasible, to use the Falkon exit code solely to determine the success or failure of a job? Ioan Michael Wilde wrote: > Following up on Ben's request from the msg below: > > > My off-the-cuff hypothesis is, based on the above, that soemwhere in > > provider-deef or below, the execution system is reporting a job as > > completed as soon as it starts executing, rather than when it actually > > finishes executing; and that successes with small numbers of jobs have > > been a race condition that would disappear if those small jobs took a > > substantially longer time to execute (eg if they had a sleep 30s in > them). > > > > I tested the following: > > run314: 100 jobs, 10 workers: all finished OK > > run315: 100 jobs, 110 workers: ~80% failed > > run316: 100 jobs, 110 workers, sleep 30 in the app: all finished OK > > These are in ~benc/swift-logs/wilde. The workdirs are preserved on > bblogin/sico - I did not copy them because you need access to the msec > timestamps anyways. > > I can run these several times each to get more data before we assess > the hypothesis, but didnt have time yet. Let me know if thats needed. > > I'm cautiously leaning a bit more to the NFS-race theory. I would like > to test with scp data transfer. Am also trying to get gridftp > compiled there with help from Raj. Build is failing with gpt > problems, I think I need Ben or Charles on this. > > - Mike > > > On 3/18/08 3:57 PM, Ben Clifford wrote: >> I picked the first failed job in the log oyu sent. Job id 2qbcdypi. >> >> I assume that your submit host and the various machines involved have >> properly synchronised clocks, but I have not checked this beyond >> seeing that the machine I am logged into has the same time as my >> laptop. I have labelled the times taken from different system clocks >> with lettered clock domains just in case they are different. >> >> For this job, its running in thread 0-1-88. >> The karajan level job submission goes through these states (in clock >> domain A) >> 23:14:08,196-0600 Submitting >> 23:14:08,204-0600 Submitted >> 23:14:14,121-0600 Active >> 23:14:14,121-0600 Completed >> >> Note that the last two - Active and Completed - are the same (within >> a millisecond) >> >> At 23:14:14,189-0600 Swift checks the job status and finds the >> success file is not found. (This timestamp is in clock domain A) >> >> So now I look at for the status file myself on the fd filesystem: >> >> $ ls --full-time >> /home/wilde/swiftwork/amps1-20080317-2314-nvn8x1p2/status/2/runam3-2pbcdypi-success >> >> -rw-r--r-- 1 wilde mcsz 0 2008-03-18 00:14:19.202382966 -0500 >> /home/wilde/swiftwork/amps1-20080317-2314-nvn8x1p2/status/2/runam3-2pbcdypi-success >> >> >> (this is in clock domain B) >> >> And see that the file does exist but is a full 5 seconds after the >> job was reported as successful by provider-deef. >> >> So now we can look in the info/ directory (next to the status >> directory) and get run time stamps or the jobs. >> >> According to the info log, the job begins running at: (in clock >> domain B again) at: >> >> 00:14:14.065373000-0500 >> >> which corresponds within about 60ms of the time that provider-deef >> reported the job as active. >> However, the execution according to the wrapper log shows that the >> job did not finish executing until >> >> 00:14:19.233438000-0500 >> >> (which is when the status file is approximately timestamped). >> >> My off-the-cuff hypothesis is, based on the above, that soemwhere in >> provider-deef or below, the execution system is reporting a job as >> completed as soon as it starts executing, rather than when it >> actually finishes executing; and that successes with small numbers of >> jobs have been a race condition that would disappear if those small >> jobs took a substantially longer time to execute (eg if they had a >> sleep 30s in them). >> > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > -- =================================================== Ioan Raicu Ph.D. Candidate =================================================== Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 =================================================== Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dev.globus.org/wiki/Incubator/Falkon http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page =================================================== =================================================== From benc at hawaga.org.uk Wed Mar 19 12:20:39 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 19 Mar 2008 17:20:39 +0000 (GMT) Subject: [Swift-devel] Re: swift-falkon problem In-Reply-To: <47E148C5.3090600@cs.uchicago.edu> References: <47DDBFD7.2050700@mcs.anl.gov> <47DE676A.8090604@cs.uchicago.edu> <47DECC42.4020500@mcs.anl.gov> <47DED24B.8070900@mcs.anl.gov> <47DFCC33.30803@mcs.anl.gov> <47E1352E.3070808@mcs.anl.gov> <47E148C5.3090600@cs.uchicago.edu> Message-ID: On Wed, 19 Mar 2008, Ioan Raicu wrote: > Ben, Mihael, is this feasible, to use the Falkon exit code solely to determine > the success or failure of a job? It would be overspecialising for this particular case, I think; and doesn't solve whatever the fundamental problem is (which I think I probably agree with you now having seen todays results is a filesystem race / bad filesystem semantics). From benc at hawaga.org.uk Wed Mar 19 12:49:59 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 19 Mar 2008 17:49:59 +0000 (GMT) Subject: [Swift-devel] Re: swift-falkon problem In-Reply-To: <47E1352E.3070808@mcs.anl.gov> References: <47DDBFD7.2050700@mcs.anl.gov> <47DE676A.8090604@cs.uchicago.edu> <47DECC42.4020500@mcs.anl.gov> <47DED24B.8070900@mcs.anl.gov> <47DFCC33.30803@mcs.anl.gov> <47E1352E.3070808@mcs.anl.gov> Message-ID: On Wed, 19 Mar 2008, Michael Wilde wrote: > I'm cautiously leaning a bit more to the NFS-race theory. I would like to test > with scp data transfer. Am also trying to get gridftp compiled there with > help from Raj. Build is failing with gpt problems, I think I need Ben or > Charles on this. If an underlying NFS race is the problem, using scp or gridftp won't cure that - it may, by virtue of adding latency, make the problem disppear most/all of the time, but that would be by virtue of slowing down access, not any actual fixing of the problem. If you're deliberately introducing artificial delays eg by doing the above, there are probably simpler ways (such as hacking a delay into the wrapper script after doing the touch but before exiting) -- From wilde at mcs.anl.gov Wed Mar 19 13:26:47 2008 From: wilde at mcs.anl.gov (Michael Wilde) Date: Wed, 19 Mar 2008 13:26:47 -0500 Subject: [Swift-devel] Re: swift-falkon problem In-Reply-To: References: <47DDBFD7.2050700@mcs.anl.gov> <47DE676A.8090604@cs.uchicago.edu> <47DECC42.4020500@mcs.anl.gov> <47DED24B.8070900@mcs.anl.gov> <47DFCC33.30803@mcs.anl.gov> <47E1352E.3070808@mcs.anl.gov> Message-ID: <47E15AE7.2060609@mcs.anl.gov> On 3/19/08 11:53 AM, Ben Clifford wrote: > a brief look at run315 shows much closer/overlapping times that are more > indicative of something funny at the filesystem level than yesterdays > logs. > > in run316, where did you put the sleep? in the application code or in the > wrapper script? > the sleep was the very last statement in the runam3-sleep30 wrapper script. its the executable listed in tc.data From wilde at mcs.anl.gov Wed Mar 19 13:46:02 2008 From: wilde at mcs.anl.gov (Michael Wilde) Date: Wed, 19 Mar 2008 13:46:02 -0500 Subject: [Swift-devel] Re: swift-falkon problem In-Reply-To: References: <47DDBFD7.2050700@mcs.anl.gov> <47DE676A.8090604@cs.uchicago.edu> <47DECC42.4020500@mcs.anl.gov> <47DED24B.8070900@mcs.anl.gov> <47DFCC33.30803@mcs.anl.gov> <47E1352E.3070808@mcs.anl.gov> Message-ID: <47E15F6A.5080109@mcs.anl.gov> I was not considering scp and gridftp to introduce artificial delays. The purpose was two-fold: 1) eliminate need to run swift on a host that mounts the sicortex filesystem, as there is no good host that does on which we can run long-term. (we are temporary guests on bblogin). This was the initial reason, before we knew of any problems. 2) for dealing with this race, I thought we could avoid any possible NFS race conditions by writing directly to the filesystem. But I now realize that this wont necessarily help: the scp and gridftp *servers* would not be running on a host that locally mounts the filesystem, and the sicortex worker nodes do NFS mounts themselves. My (likely outdated) understanding of NFS protocol was that its supposed to guarantee close-to-open coherence. Meaning that if two clients want to access a file sequentially, and the writing client closes the file before the reading client opens the file, then NFS was supposed to ensure that the reader correctly saw the existence and content of the file. If others agree that this should still be the case, then its worth looking at our code to make sure that this is the case. If it wasnt, you'd think that more things would break, but perhaps Falkon exacerbates any problems in that area due to its low latency. The race as far as I know is between the worker writing and moving result, info, and success status files, and the swift host seeing these, correct? - Mike On 3/19/08 12:49 PM, Ben Clifford wrote: > On Wed, 19 Mar 2008, Michael Wilde wrote: > >> I'm cautiously leaning a bit more to the NFS-race theory. I would like to test >> with scp data transfer. Am also trying to get gridftp compiled there with >> help from Raj. Build is failing with gpt problems, I think I need Ben or >> Charles on this. > > If an underlying NFS race is the problem, using scp or gridftp won't cure > that - it may, by virtue of adding latency, make the problem disppear > most/all of the time, but that would be by virtue of slowing down access, > not any actual fixing of the problem. > > If you're deliberately introducing artificial delays eg by doing the > above, there are probably simpler ways (such as hacking a delay into the > wrapper script after doing the touch but before exiting) > From hategan at mcs.anl.gov Wed Mar 19 15:48:57 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 19 Mar 2008 15:48:57 -0500 Subject: [Swift-devel] Re: swift-falkon problem In-Reply-To: <47E148C5.3090600@cs.uchicago.edu> References: <47DDBFD7.2050700@mcs.anl.gov> <47DE676A.8090604@cs.uchicago.edu> <47DECC42.4020500@mcs.anl.gov> <47DED24B.8070900@mcs.anl.gov> <47DFCC33.30803@mcs.anl.gov> <47E1352E.3070808@mcs.anl.gov> <47E148C5.3090600@cs.uchicago.edu> Message-ID: <1205959738.3410.6.camel@blabla.mcs.anl.gov> If there is a race condition, we need to find it and address it, because it applies to more than just the status file. Mihael On Wed, 2008-03-19 at 12:09 -0500, Ioan Raicu wrote: > Mike, > The errors that occur are after the jobs execute, and Swift looks for > the successful empty file for each job (in a unique dir per job), > right? If Swift were to rely solely on the exit code that Falkon > returned for each job, would that not solve your immediate problem? > That is not to say that a race condition might not happen elsewhere, but > at least it would not happen in such a simple scenario, where you have > no data dependencies. For example, if job A (running on compute node X) > outputs data 1, and job B (running on compute node Y) reads data 1, and > Swift submits job B within milliseconds of job A's completion, its > likely that job B might not find data 1 to read. So, relying solely on > Falkon exit codes could allow you to run your 100 jobs that have no data > dependencies among each other just fine, and will push the race > condition to workflows that do have some data dependencies between jobs. > > Ben, Mihael, is this feasible, to use the Falkon exit code solely to > determine the success or failure of a job? > > Ioan > > Michael Wilde wrote: > > Following up on Ben's request from the msg below: > > > > > My off-the-cuff hypothesis is, based on the above, that soemwhere in > > > provider-deef or below, the execution system is reporting a job as > > > completed as soon as it starts executing, rather than when it actually > > > finishes executing; and that successes with small numbers of jobs have > > > been a race condition that would disappear if those small jobs took a > > > substantially longer time to execute (eg if they had a sleep 30s in > > them). > > > > > > > I tested the following: > > > > run314: 100 jobs, 10 workers: all finished OK > > > > run315: 100 jobs, 110 workers: ~80% failed > > > > run316: 100 jobs, 110 workers, sleep 30 in the app: all finished OK > > > > These are in ~benc/swift-logs/wilde. The workdirs are preserved on > > bblogin/sico - I did not copy them because you need access to the msec > > timestamps anyways. > > > > I can run these several times each to get more data before we assess > > the hypothesis, but didnt have time yet. Let me know if thats needed. > > > > I'm cautiously leaning a bit more to the NFS-race theory. I would like > > to test with scp data transfer. Am also trying to get gridftp > > compiled there with help from Raj. Build is failing with gpt > > problems, I think I need Ben or Charles on this. > > > > - Mike > > > > > > On 3/18/08 3:57 PM, Ben Clifford wrote: > >> I picked the first failed job in the log oyu sent. Job id 2qbcdypi. > >> > >> I assume that your submit host and the various machines involved have > >> properly synchronised clocks, but I have not checked this beyond > >> seeing that the machine I am logged into has the same time as my > >> laptop. I have labelled the times taken from different system clocks > >> with lettered clock domains just in case they are different. > >> > >> For this job, its running in thread 0-1-88. > >> The karajan level job submission goes through these states (in clock > >> domain A) > >> 23:14:08,196-0600 Submitting > >> 23:14:08,204-0600 Submitted > >> 23:14:14,121-0600 Active > >> 23:14:14,121-0600 Completed > >> > >> Note that the last two - Active and Completed - are the same (within > >> a millisecond) > >> > >> At 23:14:14,189-0600 Swift checks the job status and finds the > >> success file is not found. (This timestamp is in clock domain A) > >> > >> So now I look at for the status file myself on the fd filesystem: > >> > >> $ ls --full-time > >> /home/wilde/swiftwork/amps1-20080317-2314-nvn8x1p2/status/2/runam3-2pbcdypi-success > >> > >> -rw-r--r-- 1 wilde mcsz 0 2008-03-18 00:14:19.202382966 -0500 > >> /home/wilde/swiftwork/amps1-20080317-2314-nvn8x1p2/status/2/runam3-2pbcdypi-success > >> > >> > >> (this is in clock domain B) > >> > >> And see that the file does exist but is a full 5 seconds after the > >> job was reported as successful by provider-deef. > >> > >> So now we can look in the info/ directory (next to the status > >> directory) and get run time stamps or the jobs. > >> > >> According to the info log, the job begins running at: (in clock > >> domain B again) at: > >> > >> 00:14:14.065373000-0500 > >> > >> which corresponds within about 60ms of the time that provider-deef > >> reported the job as active. > >> However, the execution according to the wrapper log shows that the > >> job did not finish executing until > >> > >> 00:14:19.233438000-0500 > >> > >> (which is when the status file is approximately timestamped). > >> > >> My off-the-cuff hypothesis is, based on the above, that soemwhere in > >> provider-deef or below, the execution system is reporting a job as > >> completed as soon as it starts executing, rather than when it > >> actually finishes executing; and that successes with small numbers of > >> jobs have been a race condition that would disappear if those small > >> jobs took a substantially longer time to execute (eg if they had a > >> sleep 30s in them). > >> > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > From benc at hawaga.org.uk Wed Mar 19 16:22:46 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 19 Mar 2008 21:22:46 +0000 (GMT) Subject: [Swift-devel] Re: swift-falkon problem In-Reply-To: <47E15F6A.5080109@mcs.anl.gov> References: <47DDBFD7.2050700@mcs.anl.gov> <47DE676A.8090604@cs.uchicago.edu> <47DECC42.4020500@mcs.anl.gov> <47DED24B.8070900@mcs.anl.gov> <47DFCC33.30803@mcs.anl.gov> <47E1352E.3070808@mcs.anl.gov> <47E15F6A.5080109@mcs.anl.gov> Message-ID: On Wed, 19 Mar 2008, Michael Wilde wrote: > My (likely outdated) understanding of NFS protocol was that its supposed to > guarantee close-to-open coherence. Meaning that if two clients want to access > a file sequentially, and the writing client closes the file before the reading > client opens the file, then NFS was supposed to ensure that the reader > correctly saw the existence and content of the file. Right. Linux NFS (but this is going back half a decade) had some problem there (I think that caused problems for GRAM2 somewhere, for example) though I do not remember the details; and it was also half a decade ago so has a good chance of being different now. A quick google did not find anything that immediately applied. I've also still not entirely ruled out a race somewhere in the falkon->provider-deef->swift stack reporting this. > If others agree that this should still be the case, then its worth > looking at our code to make sure that this is the case. If it wasnt, > you'd think that more things would break, but perhaps Falkon exacerbates > any problems in that area due to its low latency. Indeed, the combination of falkon and local filesystem access is probably getting the time between touching the status file on one node and reading it on another down pretty low compared to other submission and file access protocols. > The race as far as I know is between the worker writing and moving result, > info, and success status files, and the swift host seeing these, correct? That's what your logs look like today. But yesterday had different timings that suggested a different problem. More runs of the kind that failed would be useful, along with the corresponding falkon logs that Ioan listed in a mail in this thread. -- From benc at hawaga.org.uk Wed Mar 19 22:42:46 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Thu, 20 Mar 2008 03:42:46 +0000 (GMT) Subject: [Swift-devel] plan for 0.5 release Message-ID: There was a long long pause between swift 0.3 and swift 0.4; and consequently a bunch of bugs have been discovered. so I'd like to put out a 0.5 sometime in the next couple weeks to release those bugfixes. after that, hoepfully I will manage to not wait so many months before releasing 0.6 -- From hategan at mcs.anl.gov Thu Mar 20 17:07:23 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 20 Mar 2008 17:07:23 -0500 Subject: [Swift-devel] Re: swift-falkon problem In-Reply-To: References: <47DDBFD7.2050700@mcs.anl.gov> <47DE676A.8090604@cs.uchicago.edu> <47DECC42.4020500@mcs.anl.gov> <47DED24B.8070900@mcs.anl.gov> <47DFCC33.30803@mcs.anl.gov> <47E1352E.3070808@mcs.anl.gov> <47E15F6A.5080109@mcs.anl.gov> Message-ID: <1206050843.4091.9.camel@blabla.mcs.anl.gov> On Wed, 2008-03-19 at 21:22 +0000, Ben Clifford wrote: > On Wed, 19 Mar 2008, Michael Wilde wrote: > > > My (likely outdated) understanding of NFS protocol was that its supposed to > > guarantee close-to-open coherence. Meaning that if two clients want to access > > a file sequentially, and the writing client closes the file before the reading > > client opens the file, then NFS was supposed to ensure that the reader > > correctly saw the existence and content of the file. > > Right. > > Linux NFS (but this is going back half a decade) had some problem there (I > think that caused problems for GRAM2 somewhere, for example) though I do > not remember the details; and it was also half a decade ago so has a good > chance of being different now. I seem to remember what looked like an oddity at the time, that the GRAM PBS script was writing a file on the worker node and insisted that the script (and the job) be "done" only when the file was visible on the head node. > > A quick google did not find anything that immediately applied. > > I've also still not entirely ruled out a race somewhere in the > falkon->provider-deef->swift stack reporting this. > > > If others agree that this should still be the case, then its worth > > looking at our code to make sure that this is the case. If it wasnt, > > you'd think that more things would break, but perhaps Falkon exacerbates > > any problems in that area due to its low latency. > > Indeed, the combination of falkon and local filesystem access is probably > getting the time between touching the status file on one node and reading > it on another down pretty low compared to other submission and file access > protocols. > > > The race as far as I know is between the worker writing and moving result, > > info, and success status files, and the swift host seeing these, correct? > > That's what your logs look like today. But yesterday had different timings > that suggested a different problem. > > More runs of the kind that failed would be useful, along with the > corresponding falkon logs that Ioan listed in a mail in this thread. > From iraicu at cs.uchicago.edu Thu Mar 20 17:44:18 2008 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Thu, 20 Mar 2008 17:44:18 -0500 Subject: [Swift-devel] Re: swift-falkon problem In-Reply-To: <1206050843.4091.9.camel@blabla.mcs.anl.gov> References: <47DDBFD7.2050700@mcs.anl.gov> <47DE676A.8090604@cs.uchicago.edu> <47DECC42.4020500@mcs.anl.gov> <47DED24B.8070900@mcs.anl.gov> <47DFCC33.30803@mcs.anl.gov> <47E1352E.3070808@mcs.anl.gov> <47E15F6A.5080109@mcs.anl.gov> <1206050843.4091.9.camel@blabla.mcs.anl.gov> Message-ID: <47E2E8C2.2050700@cs.uchicago.edu> If GRAM handles the stagin in and out of data, then its true. Falkon in the way that Swift is using it now does not do any data staging, so I don't see how Falkon can do any further checking on the existence of files, on behalf of jobs. What file would it check for? This would surely involve modifying the API in the falkon provider code, for Swift to tell Falkon what file it needs to verify. If Falkon were to handle the data management, then you are right, Falkon would do all this checking, but currently it just treats Swift jobs as black boxes, and knows nothing about files or directories that need to exist. Furthermore, the Falkon service could run anywhere (given that firewalls and NATs permit), which further complicates any kind of checking for files on some remote file system. Why could Swift not have a retry mechanism, given that it received a successful exit code, be more persistent in looking for the success or failure file, and if it doesn't exist, to try it again after some small amount of sleep... this would certainly hide (and potentially solve) the race condition, with a persisitent enough retry mechanism, wouldn't it? Ioan Mihael Hategan wrote: > On Wed, 2008-03-19 at 21:22 +0000, Ben Clifford wrote: > >> On Wed, 19 Mar 2008, Michael Wilde wrote: >> >> >>> My (likely outdated) understanding of NFS protocol was that its supposed to >>> guarantee close-to-open coherence. Meaning that if two clients want to access >>> a file sequentially, and the writing client closes the file before the reading >>> client opens the file, then NFS was supposed to ensure that the reader >>> correctly saw the existence and content of the file. >>> >> Right. >> >> Linux NFS (but this is going back half a decade) had some problem there (I >> think that caused problems for GRAM2 somewhere, for example) though I do >> not remember the details; and it was also half a decade ago so has a good >> chance of being different now. >> > > I seem to remember what looked like an oddity at the time, that the GRAM > PBS script was writing a file on the worker node and insisted that the > script (and the job) be "done" only when the file was visible on the > head node. > > >> A quick google did not find anything that immediately applied. >> >> I've also still not entirely ruled out a race somewhere in the >> falkon->provider-deef->swift stack reporting this. >> >> >>> If others agree that this should still be the case, then its worth >>> looking at our code to make sure that this is the case. If it wasnt, >>> you'd think that more things would break, but perhaps Falkon exacerbates >>> any problems in that area due to its low latency. >>> >> Indeed, the combination of falkon and local filesystem access is probably >> getting the time between touching the status file on one node and reading >> it on another down pretty low compared to other submission and file access >> protocols. >> >> >>> The race as far as I know is between the worker writing and moving result, >>> info, and success status files, and the swift host seeing these, correct? >>> >> That's what your logs look like today. But yesterday had different timings >> that suggested a different problem. >> >> More runs of the kind that failed would be useful, along with the >> corresponding falkon logs that Ioan listed in a mail in this thread. >> >> > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > -- =================================================== Ioan Raicu Ph.D. Candidate =================================================== Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 =================================================== Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dev.globus.org/wiki/Incubator/Falkon http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page =================================================== =================================================== -------------- next part -------------- An HTML attachment was scrubbed... URL: From benc at hawaga.org.uk Thu Mar 20 18:17:47 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Thu, 20 Mar 2008 23:17:47 +0000 (GMT) Subject: [Swift-devel] Re: swift-falkon problem In-Reply-To: <47E2E8C2.2050700@cs.uchicago.edu> References: <47DDBFD7.2050700@mcs.anl.gov> <47DE676A.8090604@cs.uchicago.edu> <47DECC42.4020500@mcs.anl.gov> <47DED24B.8070900@mcs.anl.gov> <47DFCC33.30803@mcs.anl.gov> <47E1352E.3070808@mcs.anl.gov> <47E15F6A.5080109@mcs.anl.gov> <1206050843.4091.9.camel@blabla.mcs.anl.gov> <47E2E8C2.2050700@cs.uchicago.edu> Message-ID: On Thu, 20 Mar 2008, Ioan Raicu wrote: > Why could Swift not have a retry mechanism, given that it received a > successful exit code, be more persistent in looking for the success or failure > file, and if it doesn't exist, to try it again after some small amount of > sleep... this would certainly hide (and potentially solve) the race > condition, with a persisitent enough retry mechanism, wouldn't it? The goal is not just to find a status file; there is other stuff beign written to the shared filesystem and its not clear that the status files appearing would guarantee that the other files had appeared too. -- From benc at hawaga.org.uk Thu Mar 20 18:23:06 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Thu, 20 Mar 2008 23:23:06 +0000 (GMT) Subject: [Swift-devel] Re: swift-falkon problem In-Reply-To: References: <47DDBFD7.2050700@mcs.anl.gov> <47DE676A.8090604@cs.uchicago.edu> <47DECC42.4020500@mcs.anl.gov> <47DED24B.8070900@mcs.anl.gov> <47DFCC33.30803@mcs.anl.gov> <47E1352E.3070808@mcs.anl.gov> <47E15F6A.5080109@mcs.anl.gov> Message-ID: There is flag for NFS mounts, 'noac', which disables attribute caching on clients, which I think may make the fielsystem behave in the desired fashion; however it sounds like it also massively reduces filesystem performance and fileserver load. Mike, you might be able to persuade MCS systems to make such a filesystem available. I suspect some multi-second delay after touching the status file and before exiting in the wrapper script is probably the best workaround for now, though. -- From iraicu at cs.uchicago.edu Thu Mar 20 18:26:39 2008 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Thu, 20 Mar 2008 18:26:39 -0500 Subject: [Swift-devel] Re: swift-falkon problem In-Reply-To: References: <47DDBFD7.2050700@mcs.anl.gov> <47DE676A.8090604@cs.uchicago.edu> <47DECC42.4020500@mcs.anl.gov> <47DED24B.8070900@mcs.anl.gov> <47DFCC33.30803@mcs.anl.gov> <47E1352E.3070808@mcs.anl.gov> <47E15F6A.5080109@mcs.anl.gov> <1206050843.4091.9.camel@blabla.mcs.anl.gov> <47E2E8C2.2050700@cs.uchicago.edu> Message-ID: <47E2F2AF.9090601@cs.uchicago.edu> But the status file is written last, all from the same node, so in theory (would have to be tested, or at least verified by someone who knows NFS better than I do), if the status file appears, then the other files would also be there. A year ago, there was no status file... this was added later. What was the main motivator for adding the status file? Was is that you couldn't rely on the provider's exit codes? Or something else? Ioan Ben Clifford wrote: > On Thu, 20 Mar 2008, Ioan Raicu wrote: > > >> Why could Swift not have a retry mechanism, given that it received a >> successful exit code, be more persistent in looking for the success or failure >> file, and if it doesn't exist, to try it again after some small amount of >> sleep... this would certainly hide (and potentially solve) the race >> condition, with a persisitent enough retry mechanism, wouldn't it? >> > > The goal is not just to find a status file; there is other stuff beign > written to the shared filesystem and its not clear that the status files > appearing would guarantee that the other files had appeared too. > > -- =================================================== Ioan Raicu Ph.D. Candidate =================================================== Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 =================================================== Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dev.globus.org/wiki/Incubator/Falkon http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page =================================================== =================================================== -------------- next part -------------- An HTML attachment was scrubbed... URL: From benc at hawaga.org.uk Thu Mar 20 18:29:13 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Thu, 20 Mar 2008 23:29:13 +0000 (GMT) Subject: [Swift-devel] Re: swift-falkon problem In-Reply-To: <47E2F2AF.9090601@cs.uchicago.edu> References: <47DDBFD7.2050700@mcs.anl.gov> <47DE676A.8090604@cs.uchicago.edu> <47DECC42.4020500@mcs.anl.gov> <47DED24B.8070900@mcs.anl.gov> <47DFCC33.30803@mcs.anl.gov> <47E1352E.3070808@mcs.anl.gov> <47E15F6A.5080109@mcs.anl.gov> <1206050843.4091.9.camel@blabla.mcs.anl.gov> <47E2E8C2.2050700@cs.uchicago.edu> <47E2F2AF.9090601@cs.uchicago.edu> Message-ID: If there is no status file and we rely on falkon reporting success; then we go to retrieve the last data file that was written out by the job, and 'oh! filesystem race condition, it isn't there...' for the same reasons that the status file isn't there now. On Thu, 20 Mar 2008, Ioan Raicu wrote: > But the status file is written last, all from the same node, so in theory > (would have to be tested, or at least verified by someone who knows NFS better > than I do), if the status file appears, then the other files would also be > there. A year ago, there was no status file... this was added later. What > was the main motivator for adding the status file? Was is that you couldn't > rely on the provider's exit codes? Or something else? > > Ioan > > Ben Clifford wrote: > > On Thu, 20 Mar 2008, Ioan Raicu wrote: > > > > > > > Why could Swift not have a retry mechanism, given that it received a > > > successful exit code, be more persistent in looking for the success or > > > failure > > > file, and if it doesn't exist, to try it again after some small amount of > > > sleep... this would certainly hide (and potentially solve) the race > > > condition, with a persisitent enough retry mechanism, wouldn't it? > > > > > > > The goal is not just to find a status file; there is other stuff beign > > written to the shared filesystem and its not clear that the status files > > appearing would guarantee that the other files had appeared too. > > > > > > From iraicu at cs.uchicago.edu Thu Mar 20 18:44:25 2008 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Thu, 20 Mar 2008 18:44:25 -0500 Subject: [Swift-devel] Re: swift-falkon problem In-Reply-To: References: <47DDBFD7.2050700@mcs.anl.gov> <47DE676A.8090604@cs.uchicago.edu> <47DECC42.4020500@mcs.anl.gov> <47DED24B.8070900@mcs.anl.gov> <47DFCC33.30803@mcs.anl.gov> <47E1352E.3070808@mcs.anl.gov> <47E15F6A.5080109@mcs.anl.gov> Message-ID: <47E2F6D9.2030007@cs.uchicago.edu> I added a configurable delay in delivering notifications to Swift in the provider code, which Mike still has to test. The deef provider already had a queue for the incoming notifications, so it was not hard to delay these notifications from this queue to Swift. Another approach, which I discussed with Mike, was to do a sync at the end of the wrapper script. From my simple test on a linux box, it seems that sync is a blocking call, which is exactly what we want! iraicu at gto:~> time sync real 0m0.711s user 0m0.000s sys 0m0.004s iraicu at gto:~> time sync real 0m0.035s user 0m0.000s sys 0m0.000s Mike, could you try adding a sync at the end of the wrapper.sh (and make sure to not have any additional sleeps anywhere else), and see if that helps? Ioan Ben Clifford wrote: > There is flag for NFS mounts, 'noac', which disables attribute caching on > clients, which I think may make the fielsystem behave in the desired > fashion; however it sounds like it also massively reduces filesystem > performance and fileserver load. > > Mike, you might be able to persuade MCS systems to make such a filesystem > available. > > I suspect some multi-second delay after touching the status file and > before exiting in the wrapper script is probably the best workaround for > now, though. > > -- =================================================== Ioan Raicu Ph.D. Candidate =================================================== Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 =================================================== Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dev.globus.org/wiki/Incubator/Falkon http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page =================================================== =================================================== From iraicu at cs.uchicago.edu Thu Mar 20 19:03:24 2008 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Thu, 20 Mar 2008 19:03:24 -0500 Subject: [Swift-devel] Re: swift-falkon problem In-Reply-To: <47E2F6D9.2030007@cs.uchicago.edu> References: <47DDBFD7.2050700@mcs.anl.gov> <47DE676A.8090604@cs.uchicago.edu> <47DECC42.4020500@mcs.anl.gov> <47DED24B.8070900@mcs.anl.gov> <47DFCC33.30803@mcs.anl.gov> <47E1352E.3070808@mcs.anl.gov> <47E15F6A.5080109@mcs.anl.gov> <47E2F6D9.2030007@cs.uchicago.edu> Message-ID: <47E2FB4C.50406@cs.uchicago.edu> Here is more on sync: > According to the standard specification (e.g., POSIX.1-2001), sync() > schedules the writes, but may return before the > actual writing is done. However, since version 1.3.20 Linux > does actually wait. (This still does not guarantee data > integrity: modern disks have large caches.) So, it looks like it might be blocking, but it might depend on the Linux kernel. Anyways, I think its worth a try, and it seems like a better solution than sleeps. Ioan Ioan Raicu wrote: > I added a configurable delay in delivering notifications to Swift in > the provider code, which Mike still has to test. The deef provider > already had a queue for the incoming notifications, so it was not hard > to delay these notifications from this queue to Swift. > > Another approach, which I discussed with Mike, was to do a sync at the > end of the wrapper script. From my simple test on a linux box, it > seems that sync is a blocking call, which is exactly what we want! > iraicu at gto:~> time sync > real 0m0.711s > user 0m0.000s > sys 0m0.004s > iraicu at gto:~> time sync > real 0m0.035s > user 0m0.000s > sys 0m0.000s > > Mike, could you try adding a sync at the end of the wrapper.sh (and > make sure to not have any additional sleeps anywhere else), and see if > that helps? > > Ioan > > > Ben Clifford wrote: >> There is flag for NFS mounts, 'noac', which disables attribute >> caching on clients, which I think may make the fielsystem behave in >> the desired fashion; however it sounds like it also massively reduces >> filesystem performance and fileserver load. >> >> Mike, you might be able to persuade MCS systems to make such a >> filesystem available. >> >> I suspect some multi-second delay after touching the status file and >> before exiting in the wrapper script is probably the best workaround >> for now, though. >> >> > -- =================================================== Ioan Raicu Ph.D. Candidate =================================================== Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 =================================================== Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dev.globus.org/wiki/Incubator/Falkon http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page =================================================== =================================================== From benc at hawaga.org.uk Thu Mar 20 19:14:42 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Fri, 21 Mar 2008 00:14:42 +0000 (GMT) Subject: [Swift-devel] Re: swift-falkon problem In-Reply-To: <47E2F6D9.2030007@cs.uchicago.edu> References: <47DDBFD7.2050700@mcs.anl.gov> <47DE676A.8090604@cs.uchicago.edu> <47DECC42.4020500@mcs.anl.gov> <47DED24B.8070900@mcs.anl.gov> <47DFCC33.30803@mcs.anl.gov> <47E1352E.3070808@mcs.anl.gov> <47E15F6A.5080109@mcs.anl.gov> <47E2F6D9.2030007@cs.uchicago.edu> Message-ID: On Thu, 20 Mar 2008, Ioan Raicu wrote: > Mike, could you try adding a sync at the end of the wrapper.sh (and make sure > to not have any additional sleeps anywhere else), and see if that helps? yeah, that's a good thing to try. i don't really know how sync works wrt NFS, but its worth trying. -- From hategan at mcs.anl.gov Fri Mar 21 03:18:16 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 21 Mar 2008 03:18:16 -0500 Subject: [Swift-devel] Re: swift-falkon problem In-Reply-To: <47E2E8C2.2050700@cs.uchicago.edu> References: <47DDBFD7.2050700@mcs.anl.gov> <47DE676A.8090604@cs.uchicago.edu> <47DECC42.4020500@mcs.anl.gov> <47DED24B.8070900@mcs.anl.gov> <47DFCC33.30803@mcs.anl.gov> <47E1352E.3070808@mcs.anl.gov> <47E15F6A.5080109@mcs.anl.gov> <1206050843.4091.9.camel@blabla.mcs.anl.gov> <47E2E8C2.2050700@cs.uchicago.edu> Message-ID: <1206087497.4572.3.camel@blabla.mcs.anl.gov> On Thu, 2008-03-20 at 17:44 -0500, Ioan Raicu wrote: > If GRAM handles the stagin in and out of data, then its true. No, it's true because that's what GRAM scripts do. > Falkon in the way that Swift is using it now does not do any data > staging, so I don't see how Falkon can do any further checking on the > existence of files, on behalf of jobs. What file would it check for? Pretty much the file that GRAM checks for: one that it creates after the executable completes. If the filesystem preserves temporal ordering on file availability, then this will guarantee that any files created by the job will be visible. > This would surely involve modifying the API in the falkon provider > code, for Swift to tell Falkon what file it needs to verify. > > If Falkon were to handle the data management, then you are right, > Falkon would do all this checking, but currently it just treats Swift > jobs as black boxes, and knows nothing about files or directories that > need to exist. Furthermore, the Falkon service could run anywhere > (given that firewalls and NATs permit), which further complicates any > kind of checking for files on some remote file system. > > Why could Swift not have a retry mechanism, given that it received a > successful exit code, be more persistent in looking for the success or > failure file, and if it doesn't exist, to try it again after some > small amount of sleep... this would certainly hide (and potentially > solve) the race condition, with a persisitent enough retry mechanism, > wouldn't it? > > Ioan > > Mihael Hategan wrote: > > On Wed, 2008-03-19 at 21:22 +0000, Ben Clifford wrote: > > > > > On Wed, 19 Mar 2008, Michael Wilde wrote: > > > > > > > > > > My (likely outdated) understanding of NFS protocol was that its supposed to > > > > guarantee close-to-open coherence. Meaning that if two clients want to access > > > > a file sequentially, and the writing client closes the file before the reading > > > > client opens the file, then NFS was supposed to ensure that the reader > > > > correctly saw the existence and content of the file. > > > > > > > Right. > > > > > > Linux NFS (but this is going back half a decade) had some problem there (I > > > think that caused problems for GRAM2 somewhere, for example) though I do > > > not remember the details; and it was also half a decade ago so has a good > > > chance of being different now. > > > > > > > I seem to remember what looked like an oddity at the time, that the GRAM > > PBS script was writing a file on the worker node and insisted that the > > script (and the job) be "done" only when the file was visible on the > > head node. > > > > > > > A quick google did not find anything that immediately applied. > > > > > > I've also still not entirely ruled out a race somewhere in the > > > falkon->provider-deef->swift stack reporting this. > > > > > > > > > > If others agree that this should still be the case, then its worth > > > > looking at our code to make sure that this is the case. If it wasnt, > > > > you'd think that more things would break, but perhaps Falkon exacerbates > > > > any problems in that area due to its low latency. > > > > > > > Indeed, the combination of falkon and local filesystem access is probably > > > getting the time between touching the status file on one node and reading > > > it on another down pretty low compared to other submission and file access > > > protocols. > > > > > > > > > > The race as far as I know is between the worker writing and moving result, > > > > info, and success status files, and the swift host seeing these, correct? > > > > > > > That's what your logs look like today. But yesterday had different timings > > > that suggested a different problem. > > > > > > More runs of the kind that failed would be useful, along with the > > > corresponding falkon logs that Ioan listed in a mail in this thread. > > > > > > > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > > > -- > =================================================== > Ioan Raicu > Ph.D. Candidate > =================================================== > Distributed Systems Laboratory > Computer Science Department > University of Chicago > 1100 E. 58th Street, Ryerson Hall > Chicago, IL 60637 > =================================================== > Email: iraicu at cs.uchicago.edu > Web: http://www.cs.uchicago.edu/~iraicu > http://dev.globus.org/wiki/Incubator/Falkon > http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page > =================================================== > =================================================== > From hategan at mcs.anl.gov Fri Mar 21 03:21:25 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 21 Mar 2008 03:21:25 -0500 Subject: [Swift-devel] Re: swift-falkon problem In-Reply-To: <47E2F2AF.9090601@cs.uchicago.edu> References: <47DDBFD7.2050700@mcs.anl.gov> <47DE676A.8090604@cs.uchicago.edu> <47DECC42.4020500@mcs.anl.gov> <47DED24B.8070900@mcs.anl.gov> <47DFCC33.30803@mcs.anl.gov> <47E1352E.3070808@mcs.anl.gov> <47E15F6A.5080109@mcs.anl.gov> <1206050843.4091.9.camel@blabla.mcs.anl.gov> <47E2E8C2.2050700@cs.uchicago.edu> <47E2F2AF.9090601@cs.uchicago.edu> Message-ID: <1206087685.4572.7.camel@blabla.mcs.anl.gov> On Thu, 2008-03-20 at 18:26 -0500, Ioan Raicu wrote: > But the status file is written last, all from the same node, so in > theory (would have to be tested, or at least verified by someone who > knows NFS better than I do), if the status file appears, then the > other files would also be there. A year ago, there was no status > file... this was added later. Your assumption is incorrect. There was an exit code file written when the application failed, but nothing written when the application succeeded, causing ambiguity when the filesystem settings were wrong. > What was the main motivator for adding the status file? Was is that > you couldn't rely on the provider's exit codes? Or something else? > > Ioan > > Ben Clifford wrote: > > On Thu, 20 Mar 2008, Ioan Raicu wrote: > > > > > > > Why could Swift not have a retry mechanism, given that it received a > > > successful exit code, be more persistent in looking for the success or failure > > > file, and if it doesn't exist, to try it again after some small amount of > > > sleep... this would certainly hide (and potentially solve) the race > > > condition, with a persisitent enough retry mechanism, wouldn't it? > > > > > > > The goal is not just to find a status file; there is other stuff beign > > written to the shared filesystem and its not clear that the status files > > appearing would guarantee that the other files had appeared too. > > > > > > -- > =================================================== > Ioan Raicu > Ph.D. Candidate > =================================================== > Distributed Systems Laboratory > Computer Science Department > University of Chicago > 1100 E. 58th Street, Ryerson Hall > Chicago, IL 60637 > =================================================== > Email: iraicu at cs.uchicago.edu > Web: http://www.cs.uchicago.edu/~iraicu > http://dev.globus.org/wiki/Incubator/Falkon > http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page > =================================================== > =================================================== > From wilde at mcs.anl.gov Fri Mar 21 07:12:03 2008 From: wilde at mcs.anl.gov (Michael Wilde) Date: Fri, 21 Mar 2008 07:12:03 -0500 Subject: [Swift-devel] Re: swift-falkon problem In-Reply-To: References: <47DDBFD7.2050700@mcs.anl.gov> <47DE676A.8090604@cs.uchicago.edu> <47DECC42.4020500@mcs.anl.gov> <47DED24B.8070900@mcs.anl.gov> <47DFCC33.30803@mcs.anl.gov> <47E1352E.3070808@mcs.anl.gov> <47E15F6A.5080109@mcs.anl.gov> Message-ID: <47E3A613.2040701@mcs.anl.gov> My latest test on runs of 25, 100, and 1000 jobs seem to indicate that with a sync command at the end of the application script, all job status and data is returned ok every time. (This is somewhat curious, as the info and success files fur the current would not yet be complete at the time, but the sync command effects all other activity on the host, and ensures that at least the currently existing dirs, files and data are synced, or that their sync has started). Without the sync, at the moment, virtually all jobs fail, and almost *no* data is being returned. Out of 3 runs of 1000 jobs, one run returned 2 data files, the other two returned no data files. One 100-job run without sync returned 11 of 100 files. It seems like the most fruitful testing to see if this sync is totally fixing the problem is to do lots more runs. I noted that the bblog host (from which I run Swift) has no special NFS mount flags, just rw. (I was wondering if they had something on that would affect coherence; seems not). I did not have a chance to capture the falkon logs in these tests; I will look for the ones Ioan mentioned, and try some runs with those logs captured. The swift logs I did capture are in the CI log dir, wilde/run{317-328} run317/comment:amps1 100 sico with sync - ran ok run318/comment:amps1 100 with no sync - died on first error run319/comment:amps1 without sync - 11 of 100 returned OK run320/comment:amps1 100 without sync - no data returned ok run321/comment:amps1 100 without sync - no data returned ok run322/comment:amps1 100 with sync - all data returned ok run323/comment:amps1 100 with sync - all data returned ok run324/comment:amps1 1000 with sync - all data returned ok run325/comment:amps1 1000 without sync - no data returned ok run326/comment:amps1 25 without sync - no data returned ok run327/comment:amps1 100 without sync - 2 data files returned ok run328/comment:amps1 1000 with sync - all data returned ok - Mike On 3/20/08 6:23 PM, Ben Clifford wrote: > There is flag for NFS mounts, 'noac', which disables attribute caching on > clients, which I think may make the fielsystem behave in the desired > fashion; however it sounds like it also massively reduces filesystem > performance and fileserver load. > > Mike, you might be able to persuade MCS systems to make such a filesystem > available. > > I suspect some multi-second delay after touching the status file and > before exiting in the wrapper script is probably the best workaround for > now, though. > From wilde at mcs.anl.gov Fri Mar 21 08:34:43 2008 From: wilde at mcs.anl.gov (Michael Wilde) Date: Fri, 21 Mar 2008 08:34:43 -0500 Subject: [Swift-devel] Re: swift-falkon problem In-Reply-To: <47E3A613.2040701@mcs.anl.gov> References: <47DDBFD7.2050700@mcs.anl.gov> <47DE676A.8090604@cs.uchicago.edu> <47DECC42.4020500@mcs.anl.gov> <47DED24B.8070900@mcs.anl.gov> <47DFCC33.30803@mcs.anl.gov> <47E1352E.3070808@mcs.anl.gov> <47E15F6A.5080109@mcs.anl.gov> <47E3A613.2040701@mcs.anl.gov> Message-ID: <47E3B973.6040503@mcs.anl.gov> Runs 329 and 330 (both in the CI log dir) were run with (I hope) the requested Falkon logs turned on. Note that I turned the deef provider logs on too, but did not yet verify that it was correctly logging. run329 was 9 jobs, no sync. All 9 succeed. run330 was 25 jobs, no sync. 19 of 25 succeeded, the rest failed. This is starting to confirm a curious pattern: without the sync, workflows with more jobs achieve *less* total sucessful jobs. Here's what I recall from the last few days of testing: 1 job wf: all succeeds 9 job wf: al succeed 25 job wf: 15-20 succeed 100 job wf: 1-2 succeed 1000 job wf: 0 succeed I dont have enough data to confirm this, but the pattern seems to be present. I am going to set the problem aside for now, until, Ben and Ioan, you have a chance to look at the logs from this morning's test. I'll assume for the moment that the sync "fixes" the problem, and go on to the application tests I need to run, keeping an eye out for anomalies. My goal is to to large-scale tests of AMIGA and DOCK under Swift, reducing wrapper.sh and throttling delays, and doing as much work on local RAM filesystems as possible. Mike On 3/21/08 7:12 AM, Michael Wilde wrote: > My latest test on runs of 25, 100, and 1000 jobs seem to indicate that > with a sync command at the end of the application script, all job status > and data is returned ok every time. > > (This is somewhat curious, as the info and success files fur the current > would not yet be complete at the time, but the sync command effects all > other activity on the host, and ensures that at least the currently > existing dirs, files and data are synced, or that their sync has started). > > Without the sync, at the moment, virtually all jobs fail, and almost > *no* data is being returned. Out of 3 runs of 1000 jobs, one run > returned 2 data files, the other two returned no data files. One 100-job > run without sync returned 11 of 100 files. > > It seems like the most fruitful testing to see if this sync is totally > fixing the problem is to do lots more runs. > > I noted that the bblog host (from which I run Swift) has no special NFS > mount flags, just rw. (I was wondering if they had something on that > would affect coherence; seems not). > > I did not have a chance to capture the falkon logs in these tests; I > will look for the ones Ioan mentioned, and try some runs with those logs > captured. > > The swift logs I did capture are in the CI log dir, wilde/run{317-328} > > run317/comment:amps1 100 sico with sync - ran ok > run318/comment:amps1 100 with no sync - died on first error > run319/comment:amps1 without sync - 11 of 100 returned OK > run320/comment:amps1 100 without sync - no data returned ok > run321/comment:amps1 100 without sync - no data returned ok > run322/comment:amps1 100 with sync - all data returned ok > run323/comment:amps1 100 with sync - all data returned ok > run324/comment:amps1 1000 with sync - all data returned ok > run325/comment:amps1 1000 without sync - no data returned ok > run326/comment:amps1 25 without sync - no data returned ok > run327/comment:amps1 100 without sync - 2 data files returned ok > run328/comment:amps1 1000 with sync - all data returned ok > > - Mike > > On 3/20/08 6:23 PM, Ben Clifford wrote: >> There is flag for NFS mounts, 'noac', which disables attribute caching >> on clients, which I think may make the fielsystem behave in the >> desired fashion; however it sounds like it also massively reduces >> filesystem performance and fileserver load. >> >> Mike, you might be able to persuade MCS systems to make such a >> filesystem available. >> >> I suspect some multi-second delay after touching the status file and >> before exiting in the wrapper script is probably the best workaround >> for now, though. >> > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > From hategan at mcs.anl.gov Fri Mar 21 09:02:30 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 21 Mar 2008 09:02:30 -0500 Subject: [Swift-devel] Re: swift-falkon problem In-Reply-To: <47E3A613.2040701@mcs.anl.gov> References: <47DDBFD7.2050700@mcs.anl.gov> <47DE676A.8090604@cs.uchicago.edu> <47DECC42.4020500@mcs.anl.gov> <47DED24B.8070900@mcs.anl.gov> <47DFCC33.30803@mcs.anl.gov> <47E1352E.3070808@mcs.anl.gov> <47E15F6A.5080109@mcs.anl.gov> <47E3A613.2040701@mcs.anl.gov> Message-ID: <1206108150.5100.0.camel@blabla.mcs.anl.gov> On Fri, 2008-03-21 at 07:12 -0500, Michael Wilde wrote: > My latest test on runs of 25, 100, and 1000 jobs seem to indicate that > with a sync command at the end of the application script, all job status > and data is returned ok every time. Why not put it in the wrapper script at the end? > > (This is somewhat curious, as the info and success files fur the current > would not yet be complete at the time, but the sync command effects all > other activity on the host, and ensures that at least the currently > existing dirs, files and data are synced, or that their sync has started). > > Without the sync, at the moment, virtually all jobs fail, and almost > *no* data is being returned. Out of 3 runs of 1000 jobs, one run > returned 2 data files, the other two returned no data files. One 100-job > run without sync returned 11 of 100 files. > > It seems like the most fruitful testing to see if this sync is totally > fixing the problem is to do lots more runs. > > I noted that the bblog host (from which I run Swift) has no special NFS > mount flags, just rw. (I was wondering if they had something on that > would affect coherence; seems not). > > I did not have a chance to capture the falkon logs in these tests; I > will look for the ones Ioan mentioned, and try some runs with those logs > captured. > > The swift logs I did capture are in the CI log dir, wilde/run{317-328} > > run317/comment:amps1 100 sico with sync - ran ok > run318/comment:amps1 100 with no sync - died on first error > run319/comment:amps1 without sync - 11 of 100 returned OK > run320/comment:amps1 100 without sync - no data returned ok > run321/comment:amps1 100 without sync - no data returned ok > run322/comment:amps1 100 with sync - all data returned ok > run323/comment:amps1 100 with sync - all data returned ok > run324/comment:amps1 1000 with sync - all data returned ok > run325/comment:amps1 1000 without sync - no data returned ok > run326/comment:amps1 25 without sync - no data returned ok > run327/comment:amps1 100 without sync - 2 data files returned ok > run328/comment:amps1 1000 with sync - all data returned ok > > - Mike > > On 3/20/08 6:23 PM, Ben Clifford wrote: > > There is flag for NFS mounts, 'noac', which disables attribute caching on > > clients, which I think may make the fielsystem behave in the desired > > fashion; however it sounds like it also massively reduces filesystem > > performance and fileserver load. > > > > Mike, you might be able to persuade MCS systems to make such a filesystem > > available. > > > > I suspect some multi-second delay after touching the status file and > > before exiting in the wrapper script is probably the best workaround for > > now, though. > > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > From iraicu at cs.uchicago.edu Fri Mar 21 09:38:15 2008 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Fri, 21 Mar 2008 09:38:15 -0500 Subject: [Swift-devel] Re: swift-falkon problem In-Reply-To: <1206108150.5100.0.camel@blabla.mcs.anl.gov> References: <47DDBFD7.2050700@mcs.anl.gov> <47DE676A.8090604@cs.uchicago.edu> <47DECC42.4020500@mcs.anl.gov> <47DED24B.8070900@mcs.anl.gov> <47DFCC33.30803@mcs.anl.gov> <47E1352E.3070808@mcs.anl.gov> <47E15F6A.5080109@mcs.anl.gov> <47E3A613.2040701@mcs.anl.gov> <1206108150.5100.0.camel@blabla.mcs.anl.gov> Message-ID: <47E3C857.604@cs.uchicago.edu> I would also think that having the sync at the end of the wrapper.sh after it is done modifying any other files would be the best thing. Ioan Mihael Hategan wrote: > On Fri, 2008-03-21 at 07:12 -0500, Michael Wilde wrote: > >> My latest test on runs of 25, 100, and 1000 jobs seem to indicate that >> with a sync command at the end of the application script, all job status >> and data is returned ok every time. >> > > Why not put it in the wrapper script at the end? > > > -- =================================================== Ioan Raicu Ph.D. Candidate =================================================== Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 =================================================== Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dev.globus.org/wiki/Incubator/Falkon http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page =================================================== =================================================== -------------- next part -------------- An HTML attachment was scrubbed... URL: From benc at hawaga.org.uk Fri Mar 21 09:50:22 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Fri, 21 Mar 2008 14:50:22 +0000 (GMT) Subject: [Swift-devel] Re: swift-falkon problem In-Reply-To: <47E3B973.6040503@mcs.anl.gov> References: <47DDBFD7.2050700@mcs.anl.gov> <47DE676A.8090604@cs.uchicago.edu> <47DECC42.4020500@mcs.anl.gov> <47DED24B.8070900@mcs.anl.gov> <47DFCC33.30803@mcs.anl.gov> <47E1352E.3070808@mcs.anl.gov> <47E15F6A.5080109@mcs.anl.gov> <47E3A613.2040701@mcs.anl.gov> <47E3B973.6040503@mcs.anl.gov> Message-ID: > This is starting to confirm a curious pattern: without the sync, workflows > with more jobs achieve *less* total sucessful jobs. I can imagine that happening - with more stuff going on, flush to NFS server happens less/slower, because other stuff is happening instead. -- From benc at hawaga.org.uk Sun Mar 23 19:12:05 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 24 Mar 2008 00:12:05 +0000 (GMT) Subject: [Swift-devel] Re: swift-falkon problem In-Reply-To: <1206108150.5100.0.camel@blabla.mcs.anl.gov> References: <47DDBFD7.2050700@mcs.anl.gov> <47DE676A.8090604@cs.uchicago.edu> <47DECC42.4020500@mcs.anl.gov> <47DED24B.8070900@mcs.anl.gov> <47DFCC33.30803@mcs.anl.gov> <47E1352E.3070808@mcs.anl.gov> <47E15F6A.5080109@mcs.anl.gov> <47E3A613.2040701@mcs.anl.gov> <1206108150.5100.0.camel@blabla.mcs.anl.gov> Message-ID: On Fri, 21 Mar 2008, Mihael Hategan wrote: > On Fri, 2008-03-21 at 07:12 -0500, Michael Wilde wrote: > > My latest test on runs of 25, 100, and 1000 jobs seem to indicate that > > with a sync command at the end of the application script, all job status > > and data is returned ok every time. > > Why not put it in the wrapper script at the end? Mike, the attached patch will do that, and will also add logging information so that we can see how long syncs are taking compared to other stages in worker node execution. cd cog/modules/vdsk patch -p1 < sync-in-wrapper -- -------------- next part -------------- Index: swift/libexec/wrapper.sh =================================================================== --- swift.orig/libexec/wrapper.sh 2008-03-24 08:49:43.000000000 +0900 +++ swift/libexec/wrapper.sh 2008-03-24 08:49:45.000000000 +0900 @@ -240,5 +240,9 @@ logstate "TOUCH_SUCCESS" touch status/${JOBDIR}/${ID}-success + +logstate SYNC +sync + logstate "END" From wilde at mcs.anl.gov Sun Mar 23 21:59:04 2008 From: wilde at mcs.anl.gov (Michael Wilde) Date: Sun, 23 Mar 2008 21:59:04 -0500 Subject: [Swift-devel] Re: swift-falkon problem In-Reply-To: References: <47DDBFD7.2050700@mcs.anl.gov> <47DE676A.8090604@cs.uchicago.edu> <47DECC42.4020500@mcs.anl.gov> <47DED24B.8070900@mcs.anl.gov> <47DFCC33.30803@mcs.anl.gov> <47E1352E.3070808@mcs.anl.gov> <47E15F6A.5080109@mcs.anl.gov> <47E3A613.2040701@mcs.anl.gov> <1206108150.5100.0.camel@blabla.mcs.anl.gov> Message-ID: <47E718F8.6010800@mcs.anl.gov> Ben, thanks. Ive been debugging on this since Friday. I had already moved the sync into wrapper.sh when Mihael first mentioned it. Friday afternoon I moved from a falkon binary drop that Ioan had built for me, to a build that I got from SVN and built myself. When I did that the nature of the problem changed: - first run after a falkon restart, with the sync in wrapper.sh, worked fine, at various workflow sizes. - second run would consistently fail with most jobs missing output, status and info files. Turns out the data was going mostly into the previous workflow's workdir. After much debugging, the problem was found to be bad message formatting in the falkon service, causing the chdir to the workdir to fail. It failed very seldom on the initial workflow, and heavily on subsequent ones. This problem, too, initially looked like NFS incoherence. Since that was fixed, Ive been experimenting with workflows of various sizes, and have run several 10, 25, 100, 500, and 1000 job workflows, all without any sync, and without apparent problems. Some mysteries remain, as its not clear that this message/chdir fix explains the earlier problem. But several Falkon fixes went in as well, so there's too many variables to know with confidence whether the original problem remains. Ioan: I do see that we're loosing some workers, so some investigation is needed on the Falkon side. Ben: the swift provenance log records seem excessive: I'll start a thread on that. I'm now going to start performance measurement and tuning on this now that things seem stable enough to do repeatable runs. - Mike On 3/23/08 7:12 PM, Ben Clifford wrote: > On Fri, 21 Mar 2008, Mihael Hategan wrote: > >> On Fri, 2008-03-21 at 07:12 -0500, Michael Wilde wrote: >>> My latest test on runs of 25, 100, and 1000 jobs seem to indicate that >>> with a sync command at the end of the application script, all job status >>> and data is returned ok every time. >> Why not put it in the wrapper script at the end? > > Mike, the attached patch will do that, and will also add logging > information so that we can see how long syncs are taking compared to other > stages in worker node execution. > > cd cog/modules/vdsk > patch -p1 < sync-in-wrapper > > From benc at hawaga.org.uk Sun Mar 23 22:03:04 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 24 Mar 2008 03:03:04 +0000 (GMT) Subject: [Swift-devel] Re: swift-falkon problem In-Reply-To: <47E718F8.6010800@mcs.anl.gov> References: <47DDBFD7.2050700@mcs.anl.gov> <47DE676A.8090604@cs.uchicago.edu> <47DECC42.4020500@mcs.anl.gov> <47DED24B.8070900@mcs.anl.gov> <47DFCC33.30803@mcs.anl.gov> <47E1352E.3070808@mcs.anl.gov> <47E15F6A.5080109@mcs.anl.gov> <47E3A613.2040701@mcs.anl.gov> <1206108150.5100.0.camel@blabla.mcs.anl.gov> <47E718F8.6010800@mcs.anl.gov> Message-ID: On Sun, 23 Mar 2008, Michael Wilde wrote: > Ben: the swift provenance log records seem excessive which ones? -- From wilde at mcs.anl.gov Sun Mar 23 22:15:10 2008 From: wilde at mcs.anl.gov (Michael Wilde) Date: Sun, 23 Mar 2008 22:15:10 -0500 Subject: [Swift-devel] Excessive object-closing messages in log Message-ID: <47E71CBE.20000@mcs.anl.gov> Ben, for a small swift script that iterates over a parameter array, I seem to be getting about (N^2)/2 log records regarding object closing. The messages are of the form: 2008-03-23 18:30:36,877-0600 INFO CloseDataset org.griphyn.vdl.mapping.RootDataNode identifier tag:benc at ci.uchicago.edu,2008:swift:dataset:20080323-1828-x14dlldb:720000003246 with no value at dataset=ofile (closed) For 500 entries, I had about 130K object closing log records; for 1000 entries, over 500K: bb$ grep -i -c closedataset run{341,342}/amps*.log run341/amps1-20080323-1828-25xcue89.log:130259 run342/amps1-20080323-1935-su38n0k5.log:510509 341 is 500 jobs, 342 is 1000 jobs. Is this log mechanism supposed to do that? If so, is that practical, as we want to test this script in the range of 1M jobs. run342 is in swift-logs/wilde The script is below. My properties are: sitedir.keep=true lazy.errors=true execution.retries=0 #kickstart.always.transfer=true throttle.submit=off throttle.host.submit=off throttle.transfers=20 throttle.file.operations=20 throttle.score.job.factor=1000000 sitedir.keep=true - Mike type amout; (amout ofile ) runam3 (string id , string dieselLowSLightLLProd, string dieselMedSLightLLProd) { app { runam3 id dieselLowSLightLLProd dieselMedSLightLLProd ; } } type params { string id; string dieselLowSLightLLProd; string dieselMedSLightLLProd; }; doall(params p[]) { foreach pset in p { amout ofile ; ofile = runam3(pset.id, pset.dieselLowSLightLLProd, pset.dieselMedSLightLLProd); } } // Main params p[]; p = readdata("paramlist"); doall(p); From benc at hawaga.org.uk Mon Mar 24 03:26:40 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 24 Mar 2008 08:26:40 +0000 (GMT) Subject: [Swift-devel] Re: Excessive object-closing messages in log In-Reply-To: <47E71CBE.20000@mcs.anl.gov> References: <47E71CBE.20000@mcs.anl.gov> Message-ID: On Sun, 23 Mar 2008, Michael Wilde wrote: > Ben, for a small swift script that iterates over a parameter array, I seem to > be getting about (N^2)/2 log records regarding object closing. For the script you gave, should be more like O(N). I'll have a poke around and see what's going on. Also, in your script is there a reason you have a separate doall function rather than putting everything in the top level? This used to be a solution to a closing problem but I think that should have been fixed now. Also, > throttle.score.job.factor=1000000 That is documented as taking 'off' which is probably the effect you are trying to achieve with that value. Does that cause a problem for you? -- From benc at hawaga.org.uk Mon Mar 24 04:02:51 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 24 Mar 2008 09:02:51 +0000 (GMT) Subject: [Swift-devel] Re: Excessive object-closing messages in log In-Reply-To: <47E71CBE.20000@mcs.anl.gov> References: <47E71CBE.20000@mcs.anl.gov> Message-ID: On Sun, 23 Mar 2008, Michael Wilde wrote: > Ben, for a small swift script that iterates over a parameter array, I seem to > be getting about (N^2)/2 log records regarding object closing. There was a debugging log loop that dumped the entire contents of the dataset closing-tracking cache every iteration of the loop (with that cache growing by one each iteration). This dump is gone as of r1759. -- From benc at hawaga.org.uk Mon Mar 24 04:28:00 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 24 Mar 2008 09:28:00 +0000 (GMT) Subject: [Swift-devel] Re: swift-falkon problem In-Reply-To: <47E718F8.6010800@mcs.anl.gov> References: <47DDBFD7.2050700@mcs.anl.gov> <47DE676A.8090604@cs.uchicago.edu> <47DECC42.4020500@mcs.anl.gov> <47DED24B.8070900@mcs.anl.gov> <47DFCC33.30803@mcs.anl.gov> <47E1352E.3070808@mcs.anl.gov> <47E15F6A.5080109@mcs.anl.gov> <47E3A613.2040701@mcs.anl.gov> <1206108150.5100.0.camel@blabla.mcs.anl.gov> <47E718F8.6010800@mcs.anl.gov> Message-ID: On Sun, 23 Mar 2008, Michael Wilde wrote: > I'm now going to start performance measurement and tuning on this now that > things seem stable enough to do repeatable runs. For worker side performance measurement, set wrapperlog.always.transfer=true and copy the resulting *.d directory (which should have the same basename as the logfile) into the log repo. That will give a breakdown of what the worker is doing during the time that falkon says it is executing (which ideally will be almost entirely executing the actual application). -- From benc at hawaga.org.uk Mon Mar 24 04:46:20 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 24 Mar 2008 09:46:20 +0000 (GMT) Subject: [Swift-devel] Re: swift-falkon problem In-Reply-To: References: <47DDBFD7.2050700@mcs.anl.gov> <47DE676A.8090604@cs.uchicago.edu> <47DECC42.4020500@mcs.anl.gov> <47DED24B.8070900@mcs.anl.gov> <47DFCC33.30803@mcs.anl.gov> <47E1352E.3070808@mcs.anl.gov> <47E15F6A.5080109@mcs.anl.gov> <47E3A613.2040701@mcs.anl.gov> <1206108150.5100.0.camel@blabla.mcs.anl.gov> <47E718F8.6010800@mcs.anl.gov> Message-ID: you can get plots for your 1000 job run here: http://www.ci.uchicago.edu/~benc/report-amps1-20080323-1935-su38n0k5/ you're hitting the file transfer and file operation limits (that are 20 in your config) once jobs start staging out. There's a wierd looking plateu in graph 'number of execute2 tasks at once:' around 170s .. 200s where no jobs complete for some time. Getting the falkon logs and/or the wrapper (.d) logs would be interesting there. these were generated on my laptop with: make \ LOG=/Users/benc/work/everylog/amps1-20080323-1935-su38n0k5.log clean \ webpage.weights webpage.kara webpage using the SVN log-procesisng code. -- From iraicu at cs.uchicago.edu Mon Mar 24 08:46:21 2008 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Mon, 24 Mar 2008 08:46:21 -0500 Subject: [Swift-devel] Re: swift-falkon problem In-Reply-To: <47E718F8.6010800@mcs.anl.gov> References: <47DDBFD7.2050700@mcs.anl.gov> <47DE676A.8090604@cs.uchicago.edu> <47DECC42.4020500@mcs.anl.gov> <47DED24B.8070900@mcs.anl.gov> <47DFCC33.30803@mcs.anl.gov> <47E1352E.3070808@mcs.anl.gov> <47E15F6A.5080109@mcs.anl.gov> <47E3A613.2040701@mcs.anl.gov> <1206108150.5100.0.camel@blabla.mcs.anl.gov> <47E718F8.6010800@mcs.anl.gov> Message-ID: <47E7B0AD.6000609@cs.uchicago.edu> Michael Wilde wrote: > > Ioan: I do see that we're loosing some workers, so some investigation > is needed on the Falkon side. > I see that some workers remain in a pending state for about a minute, after which the same amount of workers that are in pending state register as new workers, and the number of available workers gets back up to what we had to begin with. I think I removed a few months back the mechanism that went through periodically and cleaned up the pending workers... which is leaving them stuck in a pending state in the logs. I'll try to add that back in. Now, for the real problem why these workers are ending up in a pending state, and never going to a running state, we need to do more debugging. Ioan > > - Mike > > > On 3/23/08 7:12 PM, Ben Clifford wrote: >> On Fri, 21 Mar 2008, Mihael Hategan wrote: >> >>> On Fri, 2008-03-21 at 07:12 -0500, Michael Wilde wrote: >>>> My latest test on runs of 25, 100, and 1000 jobs seem to indicate that >>>> with a sync command at the end of the application script, all job >>>> status >>>> and data is returned ok every time. >>> Why not put it in the wrapper script at the end? >> >> Mike, the attached patch will do that, and will also add logging >> information so that we can see how long syncs are taking compared to >> other stages in worker node execution. >> >> cd cog/modules/vdsk >> patch -p1 < sync-in-wrapper >> >> > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > -- =================================================== Ioan Raicu Ph.D. Candidate =================================================== Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 =================================================== Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dev.globus.org/wiki/Incubator/Falkon http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page =================================================== =================================================== From iraicu at cs.uchicago.edu Mon Mar 24 08:54:05 2008 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Mon, 24 Mar 2008 08:54:05 -0500 Subject: [Swift-devel] Re: swift-falkon problem In-Reply-To: References: <47DDBFD7.2050700@mcs.anl.gov> <47DE676A.8090604@cs.uchicago.edu> <47DECC42.4020500@mcs.anl.gov> <47DED24B.8070900@mcs.anl.gov> <47DFCC33.30803@mcs.anl.gov> <47E1352E.3070808@mcs.anl.gov> <47E15F6A.5080109@mcs.anl.gov> <47E3A613.2040701@mcs.anl.gov> <1206108150.5100.0.camel@blabla.mcs.anl.gov> <47E718F8.6010800@mcs.anl.gov> Message-ID: <47E7B27D.6050606@cs.uchicago.edu> I see the plateau, but there are other graphs which seem to go crazy during those periods, such as http://www.ci.uchicago.edu/~benc/report-amps1-20080323-1935-su38n0k5/karatasks.FILE_TRANSFER-total.png http://www.ci.uchicago.edu/~benc/report-amps1-20080323-1935-su38n0k5/karatasks.FILE_OPERATION-total.png Looking at the Falkon logs might reveal more about if the plateau was due to Falkon or not. Where would I find the Falkon logs that correlate to these graphs? Ioan Ben Clifford wrote: > you can get plots for your 1000 job run here: > > http://www.ci.uchicago.edu/~benc/report-amps1-20080323-1935-su38n0k5/ > > you're hitting the file transfer and file operation limits (that are 20 in > your config) once jobs start staging out. > > There's a wierd looking plateu in graph 'number of execute2 tasks at > once:' around 170s .. 200s where no jobs complete for some time. > > Getting the falkon logs and/or the wrapper (.d) logs would be interesting > there. > > these were generated on my laptop with: > > make \ > LOG=/Users/benc/work/everylog/amps1-20080323-1935-su38n0k5.log clean \ > webpage.weights webpage.kara webpage > > using the SVN log-procesisng code. > -- =================================================== Ioan Raicu Ph.D. Candidate =================================================== Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 =================================================== Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dev.globus.org/wiki/Incubator/Falkon http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page =================================================== =================================================== From wilde at mcs.anl.gov Mon Mar 24 08:59:22 2008 From: wilde at mcs.anl.gov (Michael Wilde) Date: Mon, 24 Mar 2008 08:59:22 -0500 Subject: [Swift-devel] Re: swift-falkon problem In-Reply-To: <47E7B27D.6050606@cs.uchicago.edu> References: <47DDBFD7.2050700@mcs.anl.gov> <47DECC42.4020500@mcs.anl.gov> <47DED24B.8070900@mcs.anl.gov> <47DFCC33.30803@mcs.anl.gov> <47E1352E.3070808@mcs.anl.gov> <47E15F6A.5080109@mcs.anl.gov> <47E3A613.2040701@mcs.anl.gov> <1206108150.5100.0.camel@blabla.mcs.anl.gov> <47E718F8.6010800@mcs.anl.gov> <47E7B27D.6050606@cs.uchicago.edu> Message-ID: <47E7B3BA.9040502@mcs.anl.gov> On 3/24/08 8:54 AM, Ioan Raicu wrote: > I see the plateau, but there are other graphs which seem to go crazy > during those periods, such as > http://www.ci.uchicago.edu/~benc/report-amps1-20080323-1935-su38n0k5/karatasks.FILE_TRANSFER-total.png > > http://www.ci.uchicago.edu/~benc/report-amps1-20080323-1935-su38n0k5/karatasks.FILE_OPERATION-total.png > > > Looking at the Falkon logs might reveal more about if the plateau was > due to Falkon or not. Where would I find the Falkon logs that correlate > to these graphs? \ bblogin.mcs.anl.gov:/home/wilde/falkon/logs > > Ioan > > Ben Clifford wrote: >> you can get plots for your 1000 job run here: >> >> http://www.ci.uchicago.edu/~benc/report-amps1-20080323-1935-su38n0k5/ >> >> you're hitting the file transfer and file operation limits (that are >> 20 in your config) once jobs start staging out. >> >> There's a wierd looking plateu in graph 'number of execute2 tasks at >> once:' around 170s .. 200s where no jobs complete for some time. >> >> Getting the falkon logs and/or the wrapper (.d) logs would be >> interesting there. >> >> these were generated on my laptop with: >> >> make \ >> LOG=/Users/benc/work/everylog/amps1-20080323-1935-su38n0k5.log clean \ >> webpage.weights webpage.kara webpage >> >> using the SVN log-procesisng code. >> > From iraicu at cs.uchicago.edu Mon Mar 24 09:47:41 2008 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Mon, 24 Mar 2008 09:47:41 -0500 Subject: [Swift-devel] Re: swift-falkon problem In-Reply-To: <47E7B3BA.9040502@mcs.anl.gov> References: <47DDBFD7.2050700@mcs.anl.gov> <47DECC42.4020500@mcs.anl.gov> <47DED24B.8070900@mcs.anl.gov> <47DFCC33.30803@mcs.anl.gov> <47E1352E.3070808@mcs.anl.gov> <47E15F6A.5080109@mcs.anl.gov> <47E3A613.2040701@mcs.anl.gov> <1206108150.5100.0.camel@blabla.mcs.anl.gov> <47E718F8.6010800@mcs.anl.gov> <47E7B27D.6050606@cs.uchicago.edu> <47E7B3BA.9040502@mcs.anl.gov> Message-ID: <47E7BF0D.1010104@cs.uchicago.edu> Michael Wilde wrote: > > > On 3/24/08 8:54 AM, Ioan Raicu wrote: >> I see the plateau, but there are other graphs which seem to go crazy >> during those periods, such as >> http://www.ci.uchicago.edu/~benc/report-amps1-20080323-1935-su38n0k5/karatasks.FILE_TRANSFER-total.png >> >> http://www.ci.uchicago.edu/~benc/report-amps1-20080323-1935-su38n0k5/karatasks.FILE_OPERATION-total.png >> >> >> Looking at the Falkon logs might reveal more about if the plateau was >> due to Falkon or not. Where would I find the Falkon logs that >> correlate to these graphs? > \ > > bblogin.mcs.anl.gov:/home/wilde/falkon/logs > But which ones, there are about 8 dirs (or something similar), there, and each dir contains multiple runs... how can I tell which log, and which parts of the logs are the run that is graphed by Ben? >> From wilde at mcs.anl.gov Mon Mar 24 10:05:16 2008 From: wilde at mcs.anl.gov (Michael Wilde) Date: Mon, 24 Mar 2008 10:05:16 -0500 Subject: [Swift-devel] Re: swift-falkon problem In-Reply-To: <47E7BF0D.1010104@cs.uchicago.edu> References: <47DDBFD7.2050700@mcs.anl.gov> <47DECC42.4020500@mcs.anl.gov> <47DED24B.8070900@mcs.anl.gov> <47DFCC33.30803@mcs.anl.gov> <47E1352E.3070808@mcs.anl.gov> <47E15F6A.5080109@mcs.anl.gov> <47E3A613.2040701@mcs.anl.gov> <1206108150.5100.0.camel@blabla.mcs.anl.gov> <47E718F8.6010800@mcs.anl.gov> <47E7B27D.6050606@cs.uchicago.edu> <47E7B3BA.9040502@mcs.anl.gov> <47E7BF0D.1010104@cs.uchicago.edu> Message-ID: <47E7C32C.8040002@mcs.anl.gov> On 3/24/08 9:47 AM, Ioan Raicu wrote: >>> >>> Looking at the Falkon logs might reveal more about if the plateau was >>> due to Falkon or not. Where would I find the Falkon logs that >>> correlate to these graphs? >> \ >> >> bblogin.mcs.anl.gov:/home/wilde/falkon/logs >> > But which ones, there are about 8 dirs (or something similar), there, > and each dir contains multiple runs... how can I tell which log, and > which parts of the logs are the run that is graphed by Ben? >>> > The pointer to that was in this email: On 3/24/08 7:55 AM, Michael Wilde wrote: > > The summary logs are on bblogin in ~wilde/falkon. > > I keep swift run logs in ~wilde/swift/logs/runNNN and copy some of them > to the CI NFS at ~benc/swift-logs/wilde/runNNN. > > I'll start copying the falkon logs to the same place, but for the > previous tests you'll need to locate them separately. > > In the swift log dirs, amps*.log shows the run time (look at first and > last line). Ben pointed out that I'm doing data transfer throttling > which is slowing things down. I did that intentionally at this stage to > avoid hurting sico NFS. I'll start opening that throttle up. From iraicu at cs.uchicago.edu Mon Mar 24 11:48:16 2008 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Mon, 24 Mar 2008 11:48:16 -0500 Subject: [Swift-devel] Re: swift-falkon problem... plots to explain plateaus... In-Reply-To: <47E7B27D.6050606@cs.uchicago.edu> References: <47DDBFD7.2050700@mcs.anl.gov> <47DECC42.4020500@mcs.anl.gov> <47DED24B.8070900@mcs.anl.gov> <47DFCC33.30803@mcs.anl.gov> <47E1352E.3070808@mcs.anl.gov> <47E15F6A.5080109@mcs.anl.gov> <47E3A613.2040701@mcs.anl.gov> <1206108150.5100.0.camel@blabla.mcs.anl.gov> <47E718F8.6010800@mcs.anl.gov> <47E7B27D.6050606@cs.uchicago.edu> Message-ID: <47E7DB50.2020407@cs.uchicago.edu> .OK, here is my analysis of the plateaus, from Falkon's point of view. Notice the per task execution (green) is about 100 seconds per job, where the job is some invocation of the wrapper.sh that Swift sent to Falkon. Things look normal so far. See the 2nd graph for more... This shows that there are 600 workers (600 CPUs), which all get their work within 10 seconds... then they all churn away until about 100 sec when jobs start completing, and new ones get dispatched. At around 132 seconds, the wait queue is empty, and some workers start becoming idle (the red area)... by time 155, the initial 600 jobs that started between time 0 and 10, have completed, and from 155 to 211, the remaining 400 jobs all run to completion; they really only start completing around 190 sec, and all finish by 211. So, the plateau, that is evident here as well, is really when 400 workers are executing 400 jobs in parallel, and since the jobs are taking around 100 sec each to complete, the plateau of 50 seconds is completely normal. See more after the graph... Now the real question is, what is the breakdown of the 100 sec invocation (108.645 sec on average to be exact), how much is due to wrapper.sh, and how much is due to the application itself? Mike, can you comment on this? I assume you are running amiga which should have 0.5 sec jobs, right? Ioan Ioan Raicu wrote: > I see the plateau, but there are other graphs which seem to go crazy > during those periods, such as > http://www.ci.uchicago.edu/~benc/report-amps1-20080323-1935-su38n0k5/karatasks.FILE_TRANSFER-total.png > > http://www.ci.uchicago.edu/~benc/report-amps1-20080323-1935-su38n0k5/karatasks.FILE_OPERATION-total.png > > > Looking at the Falkon logs might reveal more about if the plateau was > due to Falkon or not. Where would I find the Falkon logs that > correlate to these graphs? > > Ioan > > Ben Clifford wrote: >> you can get plots for your 1000 job run here: >> >> http://www.ci.uchicago.edu/~benc/report-amps1-20080323-1935-su38n0k5/ >> >> you're hitting the file transfer and file operation limits (that are >> 20 in your config) once jobs start staging out. >> >> There's a wierd looking plateu in graph 'number of execute2 tasks at >> once:' around 170s .. 200s where no jobs complete for some time. >> >> Getting the falkon logs and/or the wrapper (.d) logs would be >> interesting there. >> >> these were generated on my laptop with: >> >> make \ >> LOG=/Users/benc/work/everylog/amps1-20080323-1935-su38n0k5.log clean \ >> webpage.weights webpage.kara webpage >> >> using the SVN log-procesisng code. >> > -- =================================================== Ioan Raicu Ph.D. Candidate =================================================== Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 =================================================== Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dev.globus.org/wiki/Incubator/Falkon http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page =================================================== =================================================== -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/jpeg Size: 38387 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/jpeg Size: 47424 bytes Desc: not available URL: From wilde at mcs.anl.gov Mon Mar 24 12:21:14 2008 From: wilde at mcs.anl.gov (Michael Wilde) Date: Mon, 24 Mar 2008 12:21:14 -0500 Subject: [Fwd: Re: [Swift-devel] Re: swift-falkon problem... plots to explain plateaus...] In-Reply-To: <47E7DC3D.6040704@cs.uchicago.edu> References: <47E7DC3D.6040704@cs.uchicago.edu> Message-ID: <47E7E30A.2020208@mcs.anl.gov> > Now the real question is, what is the breakdown of the 100 sec > invocation (108.645 sec on average to be exact), how much is due to > wrapper.sh, and how much is due to the application itself? Mike, can > you comment on this? I assume you are running amiga which should have > 0.5 sec jobs, right? Amiga is about .5 secs and teh script that runs (runam3) I think adds another .5 secs (from a quick scan of falkon logs on the actual task run time - but please verify, I think you have all the data from the task log). I suspect, as you and I both agree, that hundreds of short jobs starting in some small interval causes heavy NFS activity. The next round of testing we'll do should start to pick this apart, determine causes and prototype improvements. - Mike On 3/24/08 11:52 AM, Ioan Raicu wrote: > Not sure if this email made it to the mailing list, due to the larger > size (128KB)... > > Ioan > > ------------------------------------------------------------------------ > > Subject: > Re: [Swift-devel] Re: swift-falkon problem... plots to explain plateaus... > From: > Ioan Raicu > Date: > Mon, 24 Mar 2008 11:48:16 -0500 > To: > Ben Clifford > > To: > Ben Clifford > CC: > swift-devel > > > .OK, here is my analysis of the plateaus, from Falkon's point of view. > > Notice the per task execution (green) is about 100 seconds per job, > where the job is some invocation of the wrapper.sh that Swift sent to > Falkon. Things look normal so far. See the 2nd graph for more... > > > This shows that there are 600 workers (600 CPUs), which all get their > work within 10 seconds... then they all churn away until about 100 sec > when jobs start completing, and new ones get dispatched. At around 132 > seconds, the wait queue is empty, and some workers start becoming idle > (the red area)... by time 155, the initial 600 jobs that started between > time 0 and 10, have completed, and from 155 to 211, the remaining 400 > jobs all run to completion; they really only start completing around 190 > sec, and all finish by 211. So, the plateau, that is evident here as > well, is really when 400 workers are executing 400 jobs in parallel, and > since the jobs are taking around 100 sec each to complete, the plateau > of 50 seconds is completely normal. See more after the graph... > > > Now the real question is, what is the breakdown of the 100 sec > invocation (108.645 sec on average to be exact), how much is due to > wrapper.sh, and how much is due to the application itself? Mike, can > you comment on this? I assume you are running amiga which should have > 0.5 sec jobs, right? > > Ioan > > Ioan Raicu wrote: >> I see the plateau, but there are other graphs which seem to go crazy >> during those periods, such as >> http://www.ci.uchicago.edu/~benc/report-amps1-20080323-1935-su38n0k5/karatasks.FILE_TRANSFER-total.png >> >> http://www.ci.uchicago.edu/~benc/report-amps1-20080323-1935-su38n0k5/karatasks.FILE_OPERATION-total.png >> >> >> Looking at the Falkon logs might reveal more about if the plateau was >> due to Falkon or not. Where would I find the Falkon logs that >> correlate to these graphs? >> >> Ioan >> >> Ben Clifford wrote: >>> you can get plots for your 1000 job run here: >>> >>> http://www.ci.uchicago.edu/~benc/report-amps1-20080323-1935-su38n0k5/ >>> >>> you're hitting the file transfer and file operation limits (that are >>> 20 in your config) once jobs start staging out. >>> >>> There's a wierd looking plateu in graph 'number of execute2 tasks at >>> once:' around 170s .. 200s where no jobs complete for some time. >>> >>> Getting the falkon logs and/or the wrapper (.d) logs would be >>> interesting there. >>> >>> these were generated on my laptop with: >>> >>> make \ >>> LOG=/Users/benc/work/everylog/amps1-20080323-1935-su38n0k5.log clean \ >>> webpage.weights webpage.kara webpage >>> >>> using the SVN log-procesisng code. >>> >> > > -- > =================================================== > Ioan Raicu > Ph.D. Candidate > =================================================== > Distributed Systems Laboratory > Computer Science Department > University of Chicago > 1100 E. 58th Street, Ryerson Hall > Chicago, IL 60637 > =================================================== > Email: iraicu at cs.uchicago.edu > Web: http://www.cs.uchicago.edu/~iraicu > http://dev.globus.org/wiki/Incubator/Falkon > http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page > =================================================== > =================================================== > > > > ------------------------------------------------------------------------ > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From iraicu at cs.uchicago.edu Mon Mar 24 12:36:04 2008 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Mon, 24 Mar 2008 12:36:04 -0500 Subject: [Fwd: Re: [Swift-devel] Re: swift-falkon problem... plots to explain plateaus...] In-Reply-To: <47E7E30A.2020208@mcs.anl.gov> References: <47E7DC3D.6040704@cs.uchicago.edu> <47E7E30A.2020208@mcs.anl.gov> Message-ID: <47E7E684.5070303@cs.uchicago.edu> Michael Wilde wrote: > > Now the real question is, what is the breakdown of the 100 sec > > invocation (108.645 sec on average to be exact), how much is due to > > wrapper.sh, and how much is due to the application itself? Mike, can > > you comment on this? I assume you are running amiga which should have > > 0.5 sec jobs, right? > > Amiga is about .5 secs and teh script that runs (runam3) I think adds > another .5 secs (from a quick scan of falkon logs on the actual task > run time - but please verify, I think you have all the data from the > task log). The log with 1000 tasks, the shortest job was 72 secs, average 108, and max 170 sec. Is amiga working from RAM, or is it from NFS? If its from NFS, how big is the input data and script? I thought it was about 10KB? The overall throughput was 6.6 jobs/sec, so that is only 66KB/s, which seems quite small, assuming that each read is done in large chunks, and not a few bytes at a time. > > I suspect, as you and I both agree, that hundreds of short jobs > starting in some small interval causes heavy NFS activity. Yes, but is the NFS activity due to the app, or due to wrapper.sh? I would replace the amiga app with a sleep 0.5, or sleep 1, just to see if the graph looks much different or not. That will surely isolate the overhead from your app or wrapper.sh. Ioan > The next round of testing we'll do should start to pick this apart, > determine causes and prototype improvements. > > - Mike > > > On 3/24/08 11:52 AM, Ioan Raicu wrote: >> Not sure if this email made it to the mailing list, due to the larger >> size (128KB)... >> >> Ioan >> >> ------------------------------------------------------------------------ >> >> Subject: >> Re: [Swift-devel] Re: swift-falkon problem... plots to explain >> plateaus... >> From: >> Ioan Raicu >> Date: >> Mon, 24 Mar 2008 11:48:16 -0500 >> To: >> Ben Clifford >> >> To: >> Ben Clifford >> CC: >> swift-devel >> >> >> .OK, here is my analysis of the plateaus, from Falkon's point of view. >> >> Notice the per task execution (green) is about 100 seconds per job, >> where the job is some invocation of the wrapper.sh that Swift sent to >> Falkon. Things look normal so far. See the 2nd graph for more... >> >> >> This shows that there are 600 workers (600 CPUs), which all get their >> work within 10 seconds... then they all churn away until about 100 >> sec when jobs start completing, and new ones get dispatched. At >> around 132 seconds, the wait queue is empty, and some workers start >> becoming idle (the red area)... by time 155, the initial 600 jobs >> that started between time 0 and 10, have completed, and from 155 to >> 211, the remaining 400 jobs all run to completion; they really only >> start completing around 190 sec, and all finish by 211. So, the >> plateau, that is evident here as well, is really when 400 workers are >> executing 400 jobs in parallel, and since the jobs are taking around >> 100 sec each to complete, the plateau of 50 seconds is completely >> normal. See more after the graph... >> >> >> Now the real question is, what is the breakdown of the 100 sec >> invocation (108.645 sec on average to be exact), how much is due to >> wrapper.sh, and how much is due to the application itself? Mike, can >> you comment on this? I assume you are running amiga which should >> have 0.5 sec jobs, right? >> >> Ioan >> >> Ioan Raicu wrote: >>> I see the plateau, but there are other graphs which seem to go crazy >>> during those periods, such as >>> http://www.ci.uchicago.edu/~benc/report-amps1-20080323-1935-su38n0k5/karatasks.FILE_TRANSFER-total.png >>> >>> http://www.ci.uchicago.edu/~benc/report-amps1-20080323-1935-su38n0k5/karatasks.FILE_OPERATION-total.png >>> >>> >>> Looking at the Falkon logs might reveal more about if the plateau >>> was due to Falkon or not. Where would I find the Falkon logs that >>> correlate to these graphs? >>> >>> Ioan >>> >>> Ben Clifford wrote: >>>> you can get plots for your 1000 job run here: >>>> >>>> http://www.ci.uchicago.edu/~benc/report-amps1-20080323-1935-su38n0k5/ >>>> >>>> you're hitting the file transfer and file operation limits (that >>>> are 20 in your config) once jobs start staging out. >>>> >>>> There's a wierd looking plateu in graph 'number of execute2 tasks >>>> at once:' around 170s .. 200s where no jobs complete for some time. >>>> >>>> Getting the falkon logs and/or the wrapper (.d) logs would be >>>> interesting there. >>>> >>>> these were generated on my laptop with: >>>> >>>> make \ >>>> LOG=/Users/benc/work/everylog/amps1-20080323-1935-su38n0k5.log >>>> clean \ >>>> webpage.weights webpage.kara webpage >>>> >>>> using the SVN log-procesisng code. >>>> >>> >> >> -- >> =================================================== >> Ioan Raicu >> Ph.D. Candidate >> =================================================== >> Distributed Systems Laboratory >> Computer Science Department >> University of Chicago >> 1100 E. 58th Street, Ryerson Hall >> Chicago, IL 60637 >> =================================================== >> Email: iraicu at cs.uchicago.edu >> Web: http://www.cs.uchicago.edu/~iraicu >> http://dev.globus.org/wiki/Incubator/Falkon >> http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page >> =================================================== >> =================================================== >> >> >> >> ------------------------------------------------------------------------ >> >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > -- =================================================== Ioan Raicu Ph.D. Candidate =================================================== Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 =================================================== Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dev.globus.org/wiki/Incubator/Falkon http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page =================================================== =================================================== From wilde at mcs.anl.gov Mon Mar 24 12:59:31 2008 From: wilde at mcs.anl.gov (Michael Wilde) Date: Mon, 24 Mar 2008 12:59:31 -0500 Subject: [Fwd: Re: [Swift-devel] Re: swift-falkon problem... plots to explain plateaus...] In-Reply-To: <47E7E684.5070303@cs.uchicago.edu> References: <47E7DC3D.6040704@cs.uchicago.edu> <47E7E30A.2020208@mcs.anl.gov> <47E7E684.5070303@cs.uchicago.edu> Message-ID: <47E7EC03.1090609@mcs.anl.gov> On 3/24/08 12:36 PM, Ioan Raicu wrote: > > > Michael Wilde wrote: >> > Now the real question is, what is the breakdown of the 100 sec >> > invocation (108.645 sec on average to be exact), how much is due to >> > wrapper.sh, and how much is due to the application itself? Mike, can >> > you comment on this? I assume you are running amiga which should have >> > 0.5 sec jobs, right? >> >> Amiga is about .5 secs and teh script that runs (runam3) I think adds >> another .5 secs (from a quick scan of falkon logs on the actual task >> run time - but please verify, I think you have all the data from the >> task log). > The log with 1000 tasks, the shortest job was 72 secs, average 108, and > max 170 sec. Is amiga working from RAM, or is it from NFS? If its from > NFS, how big is the input data and script? I thought it was about > 10KB? The overall throughput was 6.6 jobs/sec, so that is only 66KB/s, > which seems quite small, assuming that each read is done in large > chunks, and not a few bytes at a time. >> >> I suspect, as you and I both agree, that hundreds of short jobs >> starting in some small interval causes heavy NFS activity. > Yes, but is the NFS activity due to the app, or due to wrapper.sh? Its due to both. wrapper.sh fetches the app script from nfs which fetches the app from nfs. then wrapper.sh does its setup, which causes more (synchronous) nfs activity, then the app output is copied, then fetched back to the run directory. All this is dominated I suspect by nfs request overhead, most of which is not data transfer. There's really nothing to discuss regarding this until I get some data from tests. - Mike > > I would replace the amiga app with a sleep 0.5, or sleep 1, just to see > if the graph looks much different or not. That will surely isolate the > overhead from your app or wrapper.sh. > > Ioan >> The next round of testing we'll do should start to pick this apart, >> determine causes and prototype improvements. >> >> - Mike >> >> >> On 3/24/08 11:52 AM, Ioan Raicu wrote: >>> Not sure if this email made it to the mailing list, due to the larger >>> size (128KB)... >>> >>> Ioan >>> >>> ------------------------------------------------------------------------ >>> >>> Subject: >>> Re: [Swift-devel] Re: swift-falkon problem... plots to explain >>> plateaus... >>> From: >>> Ioan Raicu >>> Date: >>> Mon, 24 Mar 2008 11:48:16 -0500 >>> To: >>> Ben Clifford >>> >>> To: >>> Ben Clifford >>> CC: >>> swift-devel >>> >>> >>> .OK, here is my analysis of the plateaus, from Falkon's point of view. >>> >>> Notice the per task execution (green) is about 100 seconds per job, >>> where the job is some invocation of the wrapper.sh that Swift sent to >>> Falkon. Things look normal so far. See the 2nd graph for more... >>> >>> >>> This shows that there are 600 workers (600 CPUs), which all get their >>> work within 10 seconds... then they all churn away until about 100 >>> sec when jobs start completing, and new ones get dispatched. At >>> around 132 seconds, the wait queue is empty, and some workers start >>> becoming idle (the red area)... by time 155, the initial 600 jobs >>> that started between time 0 and 10, have completed, and from 155 to >>> 211, the remaining 400 jobs all run to completion; they really only >>> start completing around 190 sec, and all finish by 211. So, the >>> plateau, that is evident here as well, is really when 400 workers are >>> executing 400 jobs in parallel, and since the jobs are taking around >>> 100 sec each to complete, the plateau of 50 seconds is completely >>> normal. See more after the graph... >>> >>> >>> Now the real question is, what is the breakdown of the 100 sec >>> invocation (108.645 sec on average to be exact), how much is due to >>> wrapper.sh, and how much is due to the application itself? Mike, can >>> you comment on this? I assume you are running amiga which should >>> have 0.5 sec jobs, right? >>> >>> Ioan >>> >>> Ioan Raicu wrote: >>>> I see the plateau, but there are other graphs which seem to go crazy >>>> during those periods, such as >>>> http://www.ci.uchicago.edu/~benc/report-amps1-20080323-1935-su38n0k5/karatasks.FILE_TRANSFER-total.png >>>> >>>> http://www.ci.uchicago.edu/~benc/report-amps1-20080323-1935-su38n0k5/karatasks.FILE_OPERATION-total.png >>>> >>>> >>>> Looking at the Falkon logs might reveal more about if the plateau >>>> was due to Falkon or not. Where would I find the Falkon logs that >>>> correlate to these graphs? >>>> >>>> Ioan >>>> >>>> Ben Clifford wrote: >>>>> you can get plots for your 1000 job run here: >>>>> >>>>> http://www.ci.uchicago.edu/~benc/report-amps1-20080323-1935-su38n0k5/ >>>>> >>>>> you're hitting the file transfer and file operation limits (that >>>>> are 20 in your config) once jobs start staging out. >>>>> >>>>> There's a wierd looking plateu in graph 'number of execute2 tasks >>>>> at once:' around 170s .. 200s where no jobs complete for some time. >>>>> >>>>> Getting the falkon logs and/or the wrapper (.d) logs would be >>>>> interesting there. >>>>> >>>>> these were generated on my laptop with: >>>>> >>>>> make \ >>>>> LOG=/Users/benc/work/everylog/amps1-20080323-1935-su38n0k5.log >>>>> clean \ >>>>> webpage.weights webpage.kara webpage >>>>> >>>>> using the SVN log-procesisng code. >>>>> >>>> >>> >>> -- >>> =================================================== >>> Ioan Raicu >>> Ph.D. Candidate >>> =================================================== >>> Distributed Systems Laboratory >>> Computer Science Department >>> University of Chicago >>> 1100 E. 58th Street, Ryerson Hall >>> Chicago, IL 60637 >>> =================================================== >>> Email: iraicu at cs.uchicago.edu >>> Web: http://www.cs.uchicago.edu/~iraicu >>> http://dev.globus.org/wiki/Incubator/Falkon >>> http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page >>> =================================================== >>> =================================================== >>> >>> >>> >>> ------------------------------------------------------------------------ >>> >>> _______________________________________________ >>> Swift-devel mailing list >>> Swift-devel at ci.uchicago.edu >>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >> > From hategan at mcs.anl.gov Mon Mar 24 13:03:12 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 24 Mar 2008 13:03:12 -0500 Subject: [Fwd: Re: [Swift-devel] Re: swift-falkon problem... plots to explain plateaus...] In-Reply-To: <47E7EC03.1090609@mcs.anl.gov> References: <47E7DC3D.6040704@cs.uchicago.edu> <47E7E30A.2020208@mcs.anl.gov> <47E7E684.5070303@cs.uchicago.edu> <47E7EC03.1090609@mcs.anl.gov> Message-ID: <1206381792.11561.1.camel@blabla.mcs.anl.gov> > Its due to both. wrapper.sh fetches the app script from nfs which > fetches the app from nfs. then wrapper.sh does its setup, which causes > more (synchronous) nfs activity, then the app output is copied, then > fetched back to the run directory. > > All this is dominated I suspect by nfs request overhead, most of which > is not data transfer. > > There's really nothing to discuss regarding this until I get some data > from tests. As far as I can remember, Ben added fairly comprehensive logging to the wrapper. That may shed some light on the issue. Mihael From benc at hawaga.org.uk Mon Mar 24 15:16:13 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 24 Mar 2008 20:16:13 +0000 (GMT) Subject: [Swift-devel] Re: swift-falkon problem In-Reply-To: <47E7B27D.6050606@cs.uchicago.edu> References: <47DDBFD7.2050700@mcs.anl.gov> <47DECC42.4020500@mcs.anl.gov> <47DED24B.8070900@mcs.anl.gov> <47DFCC33.30803@mcs.anl.gov> <47E1352E.3070808@mcs.anl.gov> <47E15F6A.5080109@mcs.anl.gov> <47E3A613.2040701@mcs.anl.gov> <1206108150.5100.0.camel@blabla.mcs.anl.gov> <47E718F8.6010800@mcs.anl.gov> <47E7B27D.6050606@cs.uchicago.edu> Message-ID: On Mon, 24 Mar 2008, Ioan Raicu wrote: > I see the plateau, but there are other graphs which seem to go crazy during > those periods, such as > http://www.ci.uchicago.edu/~benc/report-amps1-20080323-1935-su38n0k5/karatasks.FILE_TRANSFER-total.png > http://www.ci.uchicago.edu/~benc/report-amps1-20080323-1935-su38n0k5/karatasks.FILE_OPERATION-total.png > > Looking at the Falkon logs might reveal more about if the plateau was due to > Falkon or not. Where would I find the Falkon logs that correlate to these > graphs? I haven't looked at the falkon logs because I dont' have them, but that load corresponds roughly with apparently a bunch of jobs finishing and their results being staged out. As far as I can tell, the workflow looks like minimal stage in (so not much FILE_TRANSFER load), then a bunch of jobs that take around 120s; then those all start finishing around t=120 so now swift is doing lots of file transfer. -- From benc at hawaga.org.uk Mon Mar 24 15:24:13 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 24 Mar 2008 20:24:13 +0000 (GMT) Subject: [Swift-devel] Re: swift-falkon problem... plots to explain plateaus... In-Reply-To: <47E7DB50.2020407@cs.uchicago.edu> References: <47DDBFD7.2050700@mcs.anl.gov> <47DECC42.4020500@mcs.anl.gov> <47DED24B.8070900@mcs.anl.gov> <47DFCC33.30803@mcs.anl.gov> <47E1352E.3070808@mcs.anl.gov> <47E15F6A.5080109@mcs.anl.gov> <47E3A613.2040701@mcs.anl.gov> <1206108150.5100.0.camel@blabla.mcs.anl.gov> <47E718F8.6010800@mcs.anl.gov> <47E7B27D.6050606@cs.uchicago.edu> <47E7DB50.2020407@cs.uchicago.edu> Message-ID: On Mon, 24 Mar 2008, Ioan Raicu wrote: > start completing, and new ones get dispatched. At around 132 seconds, the > wait queue is empty, and some workers start becoming idle (the red area)... by Ideally swift would be keeping the queue full until there is nothing left to send. There shouldn't be two distinct 600 and 400 job bursts. But I guess that may be because the job throttling isn't set to a large enough infinity. > Now the real question is, what is the breakdown of the 100 sec invocation > (108.645 sec on average to be exact), how much is due to wrapper.sh, and how > much is due to the application itself? Mike, can you comment on this? I > assume you are running amiga which should have 0.5 sec jobs, right? running with wrapperlog.always.transfer=true will grab the raw data for this. -- From wilde at mcs.anl.gov Mon Mar 24 17:13:01 2008 From: wilde at mcs.anl.gov (Michael Wilde) Date: Mon, 24 Mar 2008 17:13:01 -0500 Subject: [Swift-devel] Why build prompts in redist? In-Reply-To: References: <47E0422D.9000707@mcs.anl.gov> Message-ID: <47E8276D.1020809@mcs.anl.gov> I just updated to 1756 and I still get same prompt. This is not a big deal, just thought you'd want to know. My command was: ant -Dwith-provider-deef redist in cog/modules/vdsk - Mike dist.dir.warning: [input] [input] ====================================================================================== [input] Warning! The specified target directory (/radix-homes01/wilde/swift/src/cog/modules/vdsk/../..//modules/vdsk/dist/vdsk-svn) does not seem to contain a Swift build. [input] Press Return to continue with the build or CTRL+C to abort... [input] ====================================================================================== [input] On 3/18/08 8:40 PM, Ben Clifford wrote: >> [input] Warning! The specified target directory >> (/radix-homes01/wilde/swift/src/cog/modules/vdsk/../..//modules/vdsk/dist/vdsk-svn) >> does not seem to contain a Swift build. >> [input] Press Return to continue with the build or CTRL+C to abort... > > > As of r1738 (to provider-deef) this does not happen any more. > From wilde at mcs.anl.gov Mon Mar 24 17:14:10 2008 From: wilde at mcs.anl.gov (Michael Wilde) Date: Mon, 24 Mar 2008 17:14:10 -0500 Subject: [Swift-devel] Why build prompts in redist? In-Reply-To: <47E8276D.1020809@mcs.anl.gov> References: <47E0422D.9000707@mcs.anl.gov> <47E8276D.1020809@mcs.anl.gov> Message-ID: <47E827B2.1090802@mcs.anl.gov> correction: 1759 On 3/24/08 5:13 PM, Michael Wilde wrote: > I just updated to 1756 and I still get same prompt. This is not a big > deal, just thought you'd want to know. > > My command was: > > ant -Dwith-provider-deef redist > > in cog/modules/vdsk > > - Mike > > > dist.dir.warning: > [input] > [input] > ====================================================================================== > > [input] Warning! The specified target directory > (/radix-homes01/wilde/swift/src/cog/modules/vdsk/../..//modules/vdsk/dist/vdsk-svn) > does not seem to contain a Swift build. > [input] Press Return to continue with the build or CTRL+C to abort... > [input] > ====================================================================================== > > [input] > > > On 3/18/08 8:40 PM, Ben Clifford wrote: >>> [input] Warning! The specified target directory >>> (/radix-homes01/wilde/swift/src/cog/modules/vdsk/../..//modules/vdsk/dist/vdsk-svn) >>> >>> does not seem to contain a Swift build. >>> [input] Press Return to continue with the build or CTRL+C to >>> abort... >> >> >> As of r1738 (to provider-deef) this does not happen any more. >> > From hategan at mcs.anl.gov Mon Mar 24 17:20:14 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 24 Mar 2008 17:20:14 -0500 Subject: [Swift-devel] Why build prompts in redist? In-Reply-To: <47E8276D.1020809@mcs.anl.gov> References: <47E0422D.9000707@mcs.anl.gov> <47E8276D.1020809@mcs.anl.gov> Message-ID: <1206397214.25801.1.camel@blabla.mcs.anl.gov> Seems quite obvious, given that the warning was meant to deal with a situation where one would build provider-deef after building swift, whereas the -Dwith-provider-swift trick does precisely the opposite. On Mon, 2008-03-24 at 17:13 -0500, Michael Wilde wrote: > I just updated to 1756 and I still get same prompt. This is not a big > deal, just thought you'd want to know. > > My command was: > > ant -Dwith-provider-deef redist > > in cog/modules/vdsk > > - Mike > > > dist.dir.warning: > [input] > [input] > ====================================================================================== > [input] Warning! The specified target directory > (/radix-homes01/wilde/swift/src/cog/modules/vdsk/../..//modules/vdsk/dist/vdsk-svn) > does not seem to contain a Swift build. > [input] Press Return to continue with the build or CTRL+C to abort... > [input] > ====================================================================================== > [input] > > > On 3/18/08 8:40 PM, Ben Clifford wrote: > >> [input] Warning! The specified target directory > >> (/radix-homes01/wilde/swift/src/cog/modules/vdsk/../..//modules/vdsk/dist/vdsk-svn) > >> does not seem to contain a Swift build. > >> [input] Press Return to continue with the build or CTRL+C to abort... > > > > > > As of r1738 (to provider-deef) this does not happen any more. > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > From benc at hawaga.org.uk Mon Mar 24 17:22:30 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 24 Mar 2008 22:22:30 +0000 (GMT) Subject: [Swift-devel] Why build prompts in redist? In-Reply-To: <47E8276D.1020809@mcs.anl.gov> References: <47E0422D.9000707@mcs.anl.gov> <47E8276D.1020809@mcs.anl.gov> Message-ID: can you type svn info in cog/modules/provider-deef ? On Mon, 24 Mar 2008, Michael Wilde wrote: > I just updated to 1756 and I still get same prompt. This is not a big deal, > just thought you'd want to know. > > My command was: > > ant -Dwith-provider-deef redist > > in cog/modules/vdsk > > - Mike > > > dist.dir.warning: > [input] > [input] > ====================================================================================== > [input] Warning! The specified target directory > (/radix-homes01/wilde/swift/src/cog/modules/vdsk/../..//modules/vdsk/dist/vdsk-svn) > does not seem to contain a Swift build. > [input] Press Return to continue with the build or CTRL+C to abort... > [input] > ====================================================================================== > [input] > > > On 3/18/08 8:40 PM, Ben Clifford wrote: > > > [input] Warning! The specified target directory > > > (/radix-homes01/wilde/swift/src/cog/modules/vdsk/../..//modules/vdsk/dist/vdsk-svn) > > > does not seem to contain a Swift build. > > > [input] Press Return to continue with the build or CTRL+C to abort... > > > > > > As of r1738 (to provider-deef) this does not happen any more. > > > > From hategan at mcs.anl.gov Mon Mar 24 17:22:59 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 24 Mar 2008 17:22:59 -0500 Subject: [Swift-devel] Why build prompts in redist? In-Reply-To: <1206397214.25801.1.camel@blabla.mcs.anl.gov> References: <47E0422D.9000707@mcs.anl.gov> <47E8276D.1020809@mcs.anl.gov> <1206397214.25801.1.camel@blabla.mcs.anl.gov> Message-ID: <1206397379.26235.0.camel@blabla.mcs.anl.gov> However, given that Ben removed that warning entirely, causes me to believe that your provider-deef isn't up to date. On Mon, 2008-03-24 at 17:20 -0500, Mihael Hategan wrote: > Seems quite obvious, given that the warning was meant to deal with a > situation where one would build provider-deef after building swift, > whereas the -Dwith-provider-swift trick does precisely the opposite. > > On Mon, 2008-03-24 at 17:13 -0500, Michael Wilde wrote: > > I just updated to 1756 and I still get same prompt. This is not a big > > deal, just thought you'd want to know. > > > > My command was: > > > > ant -Dwith-provider-deef redist > > > > in cog/modules/vdsk > > > > - Mike > > > > > > dist.dir.warning: > > [input] > > [input] > > ====================================================================================== > > [input] Warning! The specified target directory > > (/radix-homes01/wilde/swift/src/cog/modules/vdsk/../..//modules/vdsk/dist/vdsk-svn) > > does not seem to contain a Swift build. > > [input] Press Return to continue with the build or CTRL+C to abort... > > [input] > > ====================================================================================== > > [input] > > > > > > On 3/18/08 8:40 PM, Ben Clifford wrote: > > >> [input] Warning! The specified target directory > > >> (/radix-homes01/wilde/swift/src/cog/modules/vdsk/../..//modules/vdsk/dist/vdsk-svn) > > >> does not seem to contain a Swift build. > > >> [input] Press Return to continue with the build or CTRL+C to abort... > > > > > > > > > As of r1738 (to provider-deef) this does not happen any more. > > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > From benc at hawaga.org.uk Mon Mar 24 18:36:45 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 24 Mar 2008 23:36:45 +0000 (GMT) Subject: [Fwd: Re: [Swift-devel] Re: swift-falkon problem... plots to explain plateaus...] In-Reply-To: <1206381792.11561.1.camel@blabla.mcs.anl.gov> References: <47E7DC3D.6040704@cs.uchicago.edu> <47E7E30A.2020208@mcs.anl.gov> <47E7E684.5070303@cs.uchicago.edu> <47E7EC03.1090609@mcs.anl.gov> <1206381792.11561.1.camel@blabla.mcs.anl.gov> Message-ID: On Mon, 24 Mar 2008, Mihael Hategan wrote: > As far as I can remember, Ben added fairly comprehensive logging to the > wrapper. That may shed some light on the issue. Indeed I did; and that logging information can be sent back to the submit host by enabling wrapperlog.always.transfer=true -- From wilde at mcs.anl.gov Mon Mar 24 19:30:14 2008 From: wilde at mcs.anl.gov (Michael Wilde) Date: Mon, 24 Mar 2008 19:30:14 -0500 Subject: [Swift-devel] Why build prompts in redist? In-Reply-To: <1206397379.26235.0.camel@blabla.mcs.anl.gov> References: <47E0422D.9000707@mcs.anl.gov> <47E8276D.1020809@mcs.anl.gov> <1206397214.25801.1.camel@blabla.mcs.anl.gov> <1206397379.26235.0.camel@blabla.mcs.anl.gov> Message-ID: <47E84796.30307@mcs.anl.gov> It was - thanks. I missed Ben's initial note "As of r1738 (to provider-deef)". That fixed it. - Mike On 3/24/08 5:22 PM, Mihael Hategan wrote: > However, given that Ben removed that warning entirely, causes me to > believe that your provider-deef isn't up to date. > > On Mon, 2008-03-24 at 17:20 -0500, Mihael Hategan wrote: >> Seems quite obvious, given that the warning was meant to deal with a >> situation where one would build provider-deef after building swift, >> whereas the -Dwith-provider-swift trick does precisely the opposite. >> >> On Mon, 2008-03-24 at 17:13 -0500, Michael Wilde wrote: >>> I just updated to 1756 and I still get same prompt. This is not a big >>> deal, just thought you'd want to know. >>> >>> My command was: >>> >>> ant -Dwith-provider-deef redist >>> >>> in cog/modules/vdsk >>> >>> - Mike >>> >>> >>> dist.dir.warning: >>> [input] >>> [input] >>> ====================================================================================== >>> [input] Warning! The specified target directory >>> (/radix-homes01/wilde/swift/src/cog/modules/vdsk/../..//modules/vdsk/dist/vdsk-svn) >>> does not seem to contain a Swift build. >>> [input] Press Return to continue with the build or CTRL+C to abort... >>> [input] >>> ====================================================================================== >>> [input] >>> >>> >>> On 3/18/08 8:40 PM, Ben Clifford wrote: >>>>> [input] Warning! The specified target directory >>>>> (/radix-homes01/wilde/swift/src/cog/modules/vdsk/../..//modules/vdsk/dist/vdsk-svn) >>>>> does not seem to contain a Swift build. >>>>> [input] Press Return to continue with the build or CTRL+C to abort... >>>> >>>> As of r1738 (to provider-deef) this does not happen any more. >>>> >>> _______________________________________________ >>> Swift-devel mailing list >>> Swift-devel at ci.uchicago.edu >>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >>> >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >> > > From wilde at mcs.anl.gov Mon Mar 24 22:15:38 2008 From: wilde at mcs.anl.gov (Michael Wilde) Date: Mon, 24 Mar 2008 22:15:38 -0500 Subject: [Fwd: Re: [Swift-devel] Re: swift-falkon problem... plots to explain plateaus...] In-Reply-To: References: <47E7DC3D.6040704@cs.uchicago.edu> <47E7E30A.2020208@mcs.anl.gov> <47E7E684.5070303@cs.uchicago.edu> <47E7EC03.1090609@mcs.anl.gov> <1206381792.11561.1.camel@blabla.mcs.anl.gov> Message-ID: <47E86E5A.6080303@mcs.anl.gov> Ben, do you have a script to sum the time spent per step of wrapper.sh, over a set in -info files? On 3/24/08 6:36 PM, Ben Clifford wrote: > On Mon, 24 Mar 2008, Mihael Hategan wrote: > >> As far as I can remember, Ben added fairly comprehensive logging to the >> wrapper. That may shed some light on the issue. > > Indeed I did; and that logging information can be sent back to the submit > host by enabling wrapperlog.always.transfer=true > From iraicu at cs.uchicago.edu Mon Mar 24 22:21:39 2008 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Mon, 24 Mar 2008 22:21:39 -0500 Subject: [Swift-devel] Re: swift-falkon problem... plots to explain plateaus... In-Reply-To: References: <47DDBFD7.2050700@mcs.anl.gov> <47DED24B.8070900@mcs.anl.gov> <47DFCC33.30803@mcs.anl.gov> <47E1352E.3070808@mcs.anl.gov> <47E15F6A.5080109@mcs.anl.gov> <47E3A613.2040701@mcs.anl.gov> <1206108150.5100.0.camel@blabla.mcs.anl.gov> <47E718F8.6010800@mcs.anl.gov> <47E7B27D.6050606@cs.uchicago.edu> <47E7DB50.2020407@cs.uchicago.edu> Message-ID: <47E86FC3.80904@cs.uchicago.edu> Ben Clifford wrote: > On Mon, 24 Mar 2008, Ioan Raicu wrote: > > >> start completing, and new ones get dispatched. At around 132 seconds, the >> wait queue is empty, and some workers start becoming idle (the red area)... by >> > > Ideally swift would be keeping the queue full until there is nothing left > to send. There shouldn't be two distinct 600 and 400 job bursts. But I > guess that may be because the job throttling isn't set to a large enough > infinity. > Its not the throttling... Swift sent all 1000 tasks at once (all within the first 10 seconds). There were 600 workers running on 600 CPUs, so 600 (of the 1000) tasks went from the wait queue to the running state, and there were 400 tasks left in the wait queue. After some time, the first round of tasks (the first 600) completed and the second round of tasks (400) went from the wait queue to the running state. So, the two distinct rounds, 600 then 400 is because of the 600 CPUs and 1000 total tasks... Ioan -- =================================================== Ioan Raicu Ph.D. Candidate =================================================== Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 =================================================== Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dev.globus.org/wiki/Incubator/Falkon http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page =================================================== =================================================== -------------- next part -------------- An HTML attachment was scrubbed... URL: From wilde at mcs.anl.gov Tue Mar 25 00:28:44 2008 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 25 Mar 2008 00:28:44 -0500 Subject: [Fwd: Re: [Swift-devel] Re: swift-falkon problem... plots to explain plateaus...] In-Reply-To: <47E86E5A.6080303@mcs.anl.gov> References: <47E7DC3D.6040704@cs.uchicago.edu> <47E7E30A.2020208@mcs.anl.gov> <47E7E684.5070303@cs.uchicago.edu> <47E7EC03.1090609@mcs.anl.gov> <1206381792.11561.1.camel@blabla.mcs.anl.gov> <47E86E5A.6080303@mcs.anl.gov> Message-ID: <47E88D8C.4090207@mcs.anl.gov> I eyeballed the wrapperlogs to get a rough idea of what was happening. I ran with wrapperlog saving and no other changes for wf's of 10, 100 and 500 jobs, to see how the exec time grew. At 500 jobs it grew to about 30+ seconds for a core app exec time of about 1 sec. (Im just recollecting the times as at this point I didnt write much down). First results showed more time spent in the app wrapper than in wrapper.sh. I remedied this by using /tmp as the app-wrapper's working dir, and caching the app binary on /tmp. This brought a 20+ sec app exec time down to about 3 seconds. With this fixed, the total time in wrapper.sh including the app is now about 15 seconds, with 3 being in the app-wrapper itself. The time seems about evenly spread over the several wrapper.sh operations, which is not surprising when 500 wrappers hit NFS all at once. I then tried 3 more tests: - a run to see if the app-executable caching on /tmp had an effect (it didnt) - a run to see if turning of wrapperlog retrieval had an effect - a run with data operation throttles (both) set to 100 from 10 None of these last three things had a significant effect. Tomorrow I will try some mods to the wrapper script. Turning off wrapper logging in a previous trial yesterday *seemed* to shave 20-30% off the run time. I need to verify this. I'm also going to try to use /tmp for the jobdir and reduce wrapper.sh overhead; also will leave the (tiny) job output on /tmp for later aggregation (will have some swift questions on that). Ben, if you want to look at any of these logs, the runs are in swift-logs/wilde in the order described above (w/comment files): 346: 10 job workflow 347: 100 job wf 348: 500 job wf 349: 500 jobs w/ improved app-wrapper 350: 500 jobs w/ improved app-wrapper & executable on /tmp 351: 500 jobs, wrapperlog saving off 352: 500 jobs, wrapperlog saving off, data throttles at 100 (from 20) All but the first of these should have falkon logs saved as well. I have several ideas on how to proceed, but welcome advice and any discoveries from log analysis. Thanks, Mike On 3/24/08 10:15 PM, Michael Wilde wrote: > Ben, do you have a script to sum the time spent per step of wrapper.sh, > over a set in -info files? > > On 3/24/08 6:36 PM, Ben Clifford wrote: >> On Mon, 24 Mar 2008, Mihael Hategan wrote: >> >>> As far as I can remember, Ben added fairly comprehensive logging to the >>> wrapper. That may shed some light on the issue. >> >> Indeed I did; and that logging information can be sent back to the >> submit host by enabling wrapperlog.always.transfer=true >> > From hategan at mcs.anl.gov Tue Mar 25 03:31:19 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 25 Mar 2008 03:31:19 -0500 Subject: [Fwd: Re: [Swift-devel] Re: swift-falkon problem... plots to explain plateaus...] In-Reply-To: <47E88D8C.4090207@mcs.anl.gov> References: <47E7DC3D.6040704@cs.uchicago.edu> <47E7E30A.2020208@mcs.anl.gov> <47E7E684.5070303@cs.uchicago.edu> <47E7EC03.1090609@mcs.anl.gov> <1206381792.11561.1.camel@blabla.mcs.anl.gov> <47E86E5A.6080303@mcs.anl.gov> <47E88D8C.4090207@mcs.anl.gov> Message-ID: <1206433879.26701.0.camel@blabla.mcs.anl.gov> On Tue, 2008-03-25 at 00:28 -0500, Michael Wilde wrote: > I eyeballed the wrapperlogs to get a rough idea of what was happening. > > I ran with wrapperlog saving and no other changes for wf's of 10, 100 > and 500 jobs, to see how the exec time grew. At 500 jobs it grew to > about 30+ seconds for a core app exec time of about 1 sec. (Im just > recollecting the times as at this point I didnt write much down). > I would personally like to see those logs. From wilde at mcs.anl.gov Tue Mar 25 08:16:07 2008 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 25 Mar 2008 08:16:07 -0500 Subject: [Fwd: Re: [Swift-devel] Re: swift-falkon problem... plots to explain plateaus...] In-Reply-To: <1206433879.26701.0.camel@blabla.mcs.anl.gov> References: <47E7DC3D.6040704@cs.uchicago.edu> <47E7E30A.2020208@mcs.anl.gov> <47E7E684.5070303@cs.uchicago.edu> <47E7EC03.1090609@mcs.anl.gov> <1206381792.11561.1.camel@blabla.mcs.anl.gov> <47E86E5A.6080303@mcs.anl.gov> <47E88D8C.4090207@mcs.anl.gov> <1206433879.26701.0.camel@blabla.mcs.anl.gov> Message-ID: <47E8FB17.1080501@mcs.anl.gov> On 3/25/08 3:31 AM, Mihael Hategan wrote: > On Tue, 2008-03-25 at 00:28 -0500, Michael Wilde wrote: >> I eyeballed the wrapperlogs to get a rough idea of what was happening. >> >> I ran with wrapperlog saving and no other changes for wf's of 10, 100 >> and 500 jobs, to see how the exec time grew. At 500 jobs it grew to >> about 30+ seconds for a core app exec time of about 1 sec. (Im just >> recollecting the times as at this point I didnt write much down). >> > > I would personally like to see those logs. I listed all the runs in the previous mail (below), Mihael. They are on CI NFS at ~benc/swift-logs/wilde/run{345-350}. Let us know what you find. Thanks, - Mike On 3/25/08 12:28 AM, Michael Wilde wrote: > I eyeballed the wrapperlogs to get a rough idea of what was happening. > > I ran with wrapperlog saving and no other changes for wf's of 10, 100 > and 500 jobs, to see how the exec time grew. At 500 jobs it grew to > about 30+ seconds for a core app exec time of about 1 sec. (Im just > recollecting the times as at this point I didnt write much down). > > First results showed more time spent in the app wrapper than in > wrapper.sh. I remedied this by using /tmp as the app-wrapper's working > dir, and caching the app binary on /tmp. This brought a 20+ sec app > exec time down to about 3 seconds. > > With this fixed, the total time in wrapper.sh including the app is now > about 15 seconds, with 3 being in the app-wrapper itself. The time seems > about evenly spread over the several wrapper.sh operations, which is not > surprising when 500 wrappers hit NFS all at once. > > I then tried 3 more tests: > - a run to see if the app-executable caching on /tmp had an effect > (it didnt) > - a run to see if turning of wrapperlog retrieval had an effect > - a run with data operation throttles (both) set to 100 from 10 > > None of these last three things had a significant effect. > > Tomorrow I will try some mods to the wrapper script. Turning off wrapper > logging in a previous trial yesterday *seemed* to shave 20-30% off the > run time. I need to verify this. > > I'm also going to try to use /tmp for the jobdir and reduce wrapper.sh > overhead; also will leave the (tiny) job output on /tmp for later > aggregation (will have some swift questions on that). > > Ben, if you want to look at any of these logs, the runs are in > swift-logs/wilde in the order described above (w/comment files): > > 346: 10 job workflow > 347: 100 job wf > 348: 500 job wf > 349: 500 jobs w/ improved app-wrapper > 350: 500 jobs w/ improved app-wrapper & executable on /tmp > 351: 500 jobs, wrapperlog saving off > 352: 500 jobs, wrapperlog saving off, data throttles at 100 (from 20) > > All but the first of these should have falkon logs saved as well. > > I have several ideas on how to proceed, but welcome advice and any > discoveries from log analysis. > > Thanks, > > Mike > > > On 3/24/08 10:15 PM, Michael Wilde wrote: >> Ben, do you have a script to sum the time spent per step of >> wrapper.sh, over a set in -info files? >> >> On 3/24/08 6:36 PM, Ben Clifford wrote: >>> On Mon, 24 Mar 2008, Mihael Hategan wrote: >>> >>>> As far as I can remember, Ben added fairly comprehensive logging to the >>>> wrapper. That may shed some light on the issue. >>> >>> Indeed I did; and that logging information can be sent back to the >>> submit host by enabling wrapperlog.always.transfer=true >>> >> > From benc at hawaga.org.uk Tue Mar 25 08:22:27 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 25 Mar 2008 13:22:27 +0000 (GMT) Subject: [Fwd: Re: [Swift-devel] Re: swift-falkon problem... plots to explain plateaus...] In-Reply-To: <47E86E5A.6080303@mcs.anl.gov> References: <47E7DC3D.6040704@cs.uchicago.edu> <47E7E30A.2020208@mcs.anl.gov> <47E7E684.5070303@cs.uchicago.edu> <47E7EC03.1090609@mcs.anl.gov> <1206381792.11561.1.camel@blabla.mcs.anl.gov> <47E86E5A.6080303@mcs.anl.gov> Message-ID: On Mon, 24 Mar 2008, Michael Wilde wrote: > Ben, do you have a script to sum the time spent per step of wrapper.sh, over a > set in -info files? No. But possibly a similar summary can be given visually by the -info graphs that are in the present log procesing code. Make your wrapper logs online and we can look at them. -- From hategan at mcs.anl.gov Tue Mar 25 08:34:43 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 25 Mar 2008 08:34:43 -0500 Subject: [Fwd: Re: [Swift-devel] Re: swift-falkon problem... plots to explain plateaus...] In-Reply-To: <47E8FB17.1080501@mcs.anl.gov> References: <47E7DC3D.6040704@cs.uchicago.edu> <47E7E30A.2020208@mcs.anl.gov> <47E7E684.5070303@cs.uchicago.edu> <47E7EC03.1090609@mcs.anl.gov> <1206381792.11561.1.camel@blabla.mcs.anl.gov> <47E86E5A.6080303@mcs.anl.gov> <47E88D8C.4090207@mcs.anl.gov> <1206433879.26701.0.camel@blabla.mcs.anl.gov> <47E8FB17.1080501@mcs.anl.gov> Message-ID: <1206452083.31476.12.camel@blabla.mcs.anl.gov> On Tue, 2008-03-25 at 08:16 -0500, Michael Wilde wrote: > On 3/25/08 3:31 AM, Mihael Hategan wrote: > > On Tue, 2008-03-25 at 00:28 -0500, Michael Wilde wrote: > >> I eyeballed the wrapperlogs to get a rough idea of what was happening. > >> > >> I ran with wrapperlog saving and no other changes for wf's of 10, 100 > >> and 500 jobs, to see how the exec time grew. At 500 jobs it grew to > >> about 30+ seconds for a core app exec time of about 1 sec. (Im just > >> recollecting the times as at this point I didnt write much down). > >> > > > > I would personally like to see those logs. > > I listed all the runs in the previous mail (below), Mihael. They are on > CI NFS at ~benc/swift-logs/wilde/run{345-350}. Sorry about that. > Let us know what you find. > It looks like this: - 5 seconds between LOG_START and CREATE_JOBDIR. Likely hogs: mkdir -p $WFDIR/info/$JOBDIR mkdir -p $WFDIR/status/$JOBDIR and the creation of the info file. - 2.5 seconds between CREATE_JOBDIR and CREATE_INPUTDIR. Likely problem: mkdir -p $DIR (on a very fuzzy note, if one mkdir takes 2.5 seconds, two will take 5, which seems to roughly fit the observed numbers). - 3.5 seconds for COPYING_OUTPUTS - 2.5 seconds for RM_JOBDIR I'd be curious to know how much of the time is actually spent writing to the logs. That's because I see one second between EXECUTE_DONE and COPYING_OUTPUTS, a place where the only meaningful things that are done are two log messages. Perhaps it may be useful to run the whole thing through strace -T. Mihael From wilde at mcs.anl.gov Tue Mar 25 08:44:40 2008 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 25 Mar 2008 08:44:40 -0500 Subject: [Fwd: Re: [Swift-devel] Re: swift-falkon problem... plots to explain plateaus...] In-Reply-To: <1206452083.31476.12.camel@blabla.mcs.anl.gov> References: <47E7DC3D.6040704@cs.uchicago.edu> <47E7E30A.2020208@mcs.anl.gov> <47E7E684.5070303@cs.uchicago.edu> <47E7EC03.1090609@mcs.anl.gov> <1206381792.11561.1.camel@blabla.mcs.anl.gov> <47E86E5A.6080303@mcs.anl.gov> <47E88D8C.4090207@mcs.anl.gov> <1206433879.26701.0.camel@blabla.mcs.anl.gov> <47E8FB17.1080501@mcs.anl.gov> <1206452083.31476.12.camel@blabla.mcs.anl.gov> Message-ID: <47E901C8.1060106@mcs.anl.gov> I did runs the day before with a modified wrapper that bypassed the INFO logging. It saved a good amount - I recall about 30% but need to re-check the numbers. Yes, I came to the same conclusion on the mkdirs. Im looking at reducing these, likely moving the jobdir to /tmp. I think I can do that within the current structure. wrapper.sh is ver clear and nicely written. (Ben: yes, eyeballing the log #s was easy and no problem). First thing I want to do, though, is run some large scale tests on our two science workflows, increasing the petro-modelling one (the sub-second application) to a larger runtime through app-level batching. Zhao's latest test indicate that if we do batches of 40, bringing the jobs from .5 sec to 20 sec, we can saturate the BGP's 4K cores and keep it running efficiently. Given the extra wrapper.sh overhead, I might need to increase that another 10X, but once the app is wrapped in a loop, it makes little difference to the user how big we make that. The other app is a molecule-docking app, that can be batched similarly. Once we get those running nicely at a larger, less brutal job time, I'll come back to wrapper.sh tuning. If you or Ben want to do this in the meantime, though, that would be great. We have the use-local-disk scenario on our development stack anyways - this would be a good time to do it. If I do it, it will be only a prototype for measurement purposes. Mike On 3/25/08 8:34 AM, Mihael Hategan wrote: > On Tue, 2008-03-25 at 08:16 -0500, Michael Wilde wrote: >> On 3/25/08 3:31 AM, Mihael Hategan wrote: >>> On Tue, 2008-03-25 at 00:28 -0500, Michael Wilde wrote: >>>> I eyeballed the wrapperlogs to get a rough idea of what was happening. >>>> >>>> I ran with wrapperlog saving and no other changes for wf's of 10, 100 >>>> and 500 jobs, to see how the exec time grew. At 500 jobs it grew to >>>> about 30+ seconds for a core app exec time of about 1 sec. (Im just >>>> recollecting the times as at this point I didnt write much down). >>>> >>> I would personally like to see those logs. >> I listed all the runs in the previous mail (below), Mihael. They are on >> CI NFS at ~benc/swift-logs/wilde/run{345-350}. > > Sorry about that. > >> Let us know what you find. >> > > It looks like this: > - 5 seconds between LOG_START and CREATE_JOBDIR. Likely hogs: > mkdir -p $WFDIR/info/$JOBDIR > mkdir -p $WFDIR/status/$JOBDIR > and the creation of the info file. > - 2.5 seconds between CREATE_JOBDIR and CREATE_INPUTDIR. Likely problem: > mkdir -p $DIR > (on a very fuzzy note, if one mkdir takes 2.5 seconds, two will take 5, > which seems to roughly fit the observed numbers). > - 3.5 seconds for COPYING_OUTPUTS > - 2.5 seconds for RM_JOBDIR > > I'd be curious to know how much of the time is actually spent writing to > the logs. That's because I see one second between EXECUTE_DONE and > COPYING_OUTPUTS, a place where the only meaningful things that are done > are two log messages. > > Perhaps it may be useful to run the whole thing through strace -T. > > Mihael > > From hategan at mcs.anl.gov Tue Mar 25 09:32:54 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 25 Mar 2008 09:32:54 -0500 Subject: [Fwd: Re: [Swift-devel] Re: swift-falkon problem... plots to explain plateaus...] In-Reply-To: <47E901C8.1060106@mcs.anl.gov> References: <47E7DC3D.6040704@cs.uchicago.edu> <47E7E30A.2020208@mcs.anl.gov> <47E7E684.5070303@cs.uchicago.edu> <47E7EC03.1090609@mcs.anl.gov> <1206381792.11561.1.camel@blabla.mcs.anl.gov> <47E86E5A.6080303@mcs.anl.gov> <47E88D8C.4090207@mcs.anl.gov> <1206433879.26701.0.camel@blabla.mcs.anl.gov> <47E8FB17.1080501@mcs.anl.gov> <1206452083.31476.12.camel@blabla.mcs.anl.gov> <47E901C8.1060106@mcs.anl.gov> Message-ID: <1206455574.31476.15.camel@blabla.mcs.anl.gov> Problem may be that, as a quick test shows, bash opens and closes the info file every time a redirect is done. On Tue, 2008-03-25 at 08:44 -0500, Michael Wilde wrote: > I did runs the day before with a modified wrapper that bypassed the INFO > logging. It saved a good amount - I recall about 30% but need to > re-check the numbers. > > Yes, I came to the same conclusion on the mkdirs. Im looking at > reducing these, likely moving the jobdir to /tmp. I think I can do that > within the current structure. wrapper.sh is ver clear and nicely > written. (Ben: yes, eyeballing the log #s was easy and no problem). > > First thing I want to do, though, is run some large scale tests on our > two science workflows, increasing the petro-modelling one (the > sub-second application) to a larger runtime through app-level batching. > > Zhao's latest test indicate that if we do batches of 40, bringing the > jobs from .5 sec to 20 sec, we can saturate the BGP's 4K cores and keep > it running efficiently. Given the extra wrapper.sh overhead, I might > need to increase that another 10X, but once the app is wrapped in a > loop, it makes little difference to the user how big we make that. > > The other app is a molecule-docking app, that can be batched similarly. > > Once we get those running nicely at a larger, less brutal job time, I'll > come back to wrapper.sh tuning. If you or Ben want to do this in the > meantime, though, that would be great. We have the use-local-disk > scenario on our development stack anyways - this would be a good time to > do it. If I do it, it will be only a prototype for measurement purposes. > > Mike > > > > > On 3/25/08 8:34 AM, Mihael Hategan wrote: > > On Tue, 2008-03-25 at 08:16 -0500, Michael Wilde wrote: > >> On 3/25/08 3:31 AM, Mihael Hategan wrote: > >>> On Tue, 2008-03-25 at 00:28 -0500, Michael Wilde wrote: > >>>> I eyeballed the wrapperlogs to get a rough idea of what was happening. > >>>> > >>>> I ran with wrapperlog saving and no other changes for wf's of 10, 100 > >>>> and 500 jobs, to see how the exec time grew. At 500 jobs it grew to > >>>> about 30+ seconds for a core app exec time of about 1 sec. (Im just > >>>> recollecting the times as at this point I didnt write much down). > >>>> > >>> I would personally like to see those logs. > >> I listed all the runs in the previous mail (below), Mihael. They are on > >> CI NFS at ~benc/swift-logs/wilde/run{345-350}. > > > > Sorry about that. > > > >> Let us know what you find. > >> > > > > It looks like this: > > - 5 seconds between LOG_START and CREATE_JOBDIR. Likely hogs: > > mkdir -p $WFDIR/info/$JOBDIR > > mkdir -p $WFDIR/status/$JOBDIR > > and the creation of the info file. > > - 2.5 seconds between CREATE_JOBDIR and CREATE_INPUTDIR. Likely problem: > > mkdir -p $DIR > > (on a very fuzzy note, if one mkdir takes 2.5 seconds, two will take 5, > > which seems to roughly fit the observed numbers). > > - 3.5 seconds for COPYING_OUTPUTS > > - 2.5 seconds for RM_JOBDIR > > > > I'd be curious to know how much of the time is actually spent writing to > > the logs. That's because I see one second between EXECUTE_DONE and > > COPYING_OUTPUTS, a place where the only meaningful things that are done > > are two log messages. > > > > Perhaps it may be useful to run the whole thing through strace -T. > > > > Mihael > > > > > From wilde at mcs.anl.gov Tue Mar 25 10:06:46 2008 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 25 Mar 2008 10:06:46 -0500 Subject: [Fwd: Re: [Swift-devel] Re: swift-falkon problem... plots to explain plateaus...] In-Reply-To: <1206455574.31476.15.camel@blabla.mcs.anl.gov> References: <47E7DC3D.6040704@cs.uchicago.edu> <47E7E30A.2020208@mcs.anl.gov> <47E7E684.5070303@cs.uchicago.edu> <47E7EC03.1090609@mcs.anl.gov> <1206381792.11561.1.camel@blabla.mcs.anl.gov> <47E86E5A.6080303@mcs.anl.gov> <47E88D8C.4090207@mcs.anl.gov> <1206433879.26701.0.camel@blabla.mcs.anl.gov> <47E8FB17.1080501@mcs.anl.gov> <1206452083.31476.12.camel@blabla.mcs.anl.gov> <47E901C8.1060106@mcs.anl.gov> <1206455574.31476.15.camel@blabla.mcs.anl.gov> Message-ID: <47E91506.2080100@mcs.anl.gov> One thing I'll test is generating the info file on /tmp, and moving it when done to the final job dir. I can see adjusting wrapper.sh to go from very light to very logged with a few increments in the middle that would be most useful. The main option I think we want to leave for users to toggle in common usage, is whether to run the app with its jobdir on local disk, typically below /tmp, or on shared disk. The user would decide based on the job's I/O profile and on local disk space availability. Also, I recall some discussion on the success file. Thats acceptable overhead for all but the tiniest of jobs, but when a BGP is eventually running 100K+ short jobs at once, the rate of success file creation could become a bottleneck. Seems like we could have an option that avoids creating and expecting the success file if that proved useful - need to measure. - Mike On 3/25/08 9:32 AM, Mihael Hategan wrote: > Problem may be that, as a quick test shows, bash opens and closes the > info file every time a redirect is done. > > On Tue, 2008-03-25 at 08:44 -0500, Michael Wilde wrote: >> I did runs the day before with a modified wrapper that bypassed the INFO >> logging. It saved a good amount - I recall about 30% but need to >> re-check the numbers. >> >> Yes, I came to the same conclusion on the mkdirs. Im looking at >> reducing these, likely moving the jobdir to /tmp. I think I can do that >> within the current structure. wrapper.sh is ver clear and nicely >> written. (Ben: yes, eyeballing the log #s was easy and no problem). >> >> First thing I want to do, though, is run some large scale tests on our >> two science workflows, increasing the petro-modelling one (the >> sub-second application) to a larger runtime through app-level batching. >> >> Zhao's latest test indicate that if we do batches of 40, bringing the >> jobs from .5 sec to 20 sec, we can saturate the BGP's 4K cores and keep >> it running efficiently. Given the extra wrapper.sh overhead, I might >> need to increase that another 10X, but once the app is wrapped in a >> loop, it makes little difference to the user how big we make that. >> >> The other app is a molecule-docking app, that can be batched similarly. >> >> Once we get those running nicely at a larger, less brutal job time, I'll >> come back to wrapper.sh tuning. If you or Ben want to do this in the >> meantime, though, that would be great. We have the use-local-disk >> scenario on our development stack anyways - this would be a good time to >> do it. If I do it, it will be only a prototype for measurement purposes. >> >> Mike >> >> >> >> >> On 3/25/08 8:34 AM, Mihael Hategan wrote: >>> On Tue, 2008-03-25 at 08:16 -0500, Michael Wilde wrote: >>>> On 3/25/08 3:31 AM, Mihael Hategan wrote: >>>>> On Tue, 2008-03-25 at 00:28 -0500, Michael Wilde wrote: >>>>>> I eyeballed the wrapperlogs to get a rough idea of what was happening. >>>>>> >>>>>> I ran with wrapperlog saving and no other changes for wf's of 10, 100 >>>>>> and 500 jobs, to see how the exec time grew. At 500 jobs it grew to >>>>>> about 30+ seconds for a core app exec time of about 1 sec. (Im just >>>>>> recollecting the times as at this point I didnt write much down). >>>>>> >>>>> I would personally like to see those logs. >>>> I listed all the runs in the previous mail (below), Mihael. They are on >>>> CI NFS at ~benc/swift-logs/wilde/run{345-350}. >>> Sorry about that. >>> >>>> Let us know what you find. >>>> >>> It looks like this: >>> - 5 seconds between LOG_START and CREATE_JOBDIR. Likely hogs: >>> mkdir -p $WFDIR/info/$JOBDIR >>> mkdir -p $WFDIR/status/$JOBDIR >>> and the creation of the info file. >>> - 2.5 seconds between CREATE_JOBDIR and CREATE_INPUTDIR. Likely problem: >>> mkdir -p $DIR >>> (on a very fuzzy note, if one mkdir takes 2.5 seconds, two will take 5, >>> which seems to roughly fit the observed numbers). >>> - 3.5 seconds for COPYING_OUTPUTS >>> - 2.5 seconds for RM_JOBDIR >>> >>> I'd be curious to know how much of the time is actually spent writing to >>> the logs. That's because I see one second between EXECUTE_DONE and >>> COPYING_OUTPUTS, a place where the only meaningful things that are done >>> are two log messages. >>> >>> Perhaps it may be useful to run the whole thing through strace -T. >>> >>> Mihael >>> >>> > > From hategan at mcs.anl.gov Tue Mar 25 10:09:06 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 25 Mar 2008 10:09:06 -0500 Subject: [Fwd: Re: [Swift-devel] Re: swift-falkon problem... plots to explain plateaus...] In-Reply-To: <47E91506.2080100@mcs.anl.gov> References: <47E7DC3D.6040704@cs.uchicago.edu> <47E7E30A.2020208@mcs.anl.gov> <47E7E684.5070303@cs.uchicago.edu> <47E7EC03.1090609@mcs.anl.gov> <1206381792.11561.1.camel@blabla.mcs.anl.gov> <47E86E5A.6080303@mcs.anl.gov> <47E88D8C.4090207@mcs.anl.gov> <1206433879.26701.0.camel@blabla.mcs.anl.gov> <47E8FB17.1080501@mcs.anl.gov> <1206452083.31476.12.camel@blabla.mcs.anl.gov> <47E901C8.1060106@mcs.anl.gov> <1206455574.31476.15.camel@blabla.mcs.anl.gov> <47E91506.2080100@mcs.anl.gov> Message-ID: <1206457746.20249.1.camel@blabla.mcs.anl.gov> I just wrote a version of the wrapper that opens the log in a descriptor (so opening happens once). I need to test it first, but I'll commit shortly. On Tue, 2008-03-25 at 10:06 -0500, Michael Wilde wrote: > One thing I'll test is generating the info file on /tmp, and moving it > when done to the final job dir. > > I can see adjusting wrapper.sh to go from very light to very logged with > a few increments in the middle that would be most useful. > > The main option I think we want to leave for users to toggle in common > usage, is whether to run the app with its jobdir on local disk, > typically below /tmp, or on shared disk. The user would decide based on > the job's I/O profile and on local disk space availability. > > Also, I recall some discussion on the success file. Thats acceptable > overhead for all but the tiniest of jobs, but when a BGP is eventually > running 100K+ short jobs at once, the rate of success file creation > could become a bottleneck. Seems like we could have an option that > avoids creating and expecting the success file if that proved useful - > need to measure. > > - Mike > > > On 3/25/08 9:32 AM, Mihael Hategan wrote: > > Problem may be that, as a quick test shows, bash opens and closes the > > info file every time a redirect is done. > > > > On Tue, 2008-03-25 at 08:44 -0500, Michael Wilde wrote: > >> I did runs the day before with a modified wrapper that bypassed the INFO > >> logging. It saved a good amount - I recall about 30% but need to > >> re-check the numbers. > >> > >> Yes, I came to the same conclusion on the mkdirs. Im looking at > >> reducing these, likely moving the jobdir to /tmp. I think I can do that > >> within the current structure. wrapper.sh is ver clear and nicely > >> written. (Ben: yes, eyeballing the log #s was easy and no problem). > >> > >> First thing I want to do, though, is run some large scale tests on our > >> two science workflows, increasing the petro-modelling one (the > >> sub-second application) to a larger runtime through app-level batching. > >> > >> Zhao's latest test indicate that if we do batches of 40, bringing the > >> jobs from .5 sec to 20 sec, we can saturate the BGP's 4K cores and keep > >> it running efficiently. Given the extra wrapper.sh overhead, I might > >> need to increase that another 10X, but once the app is wrapped in a > >> loop, it makes little difference to the user how big we make that. > >> > >> The other app is a molecule-docking app, that can be batched similarly. > >> > >> Once we get those running nicely at a larger, less brutal job time, I'll > >> come back to wrapper.sh tuning. If you or Ben want to do this in the > >> meantime, though, that would be great. We have the use-local-disk > >> scenario on our development stack anyways - this would be a good time to > >> do it. If I do it, it will be only a prototype for measurement purposes. > >> > >> Mike > >> > >> > >> > >> > >> On 3/25/08 8:34 AM, Mihael Hategan wrote: > >>> On Tue, 2008-03-25 at 08:16 -0500, Michael Wilde wrote: > >>>> On 3/25/08 3:31 AM, Mihael Hategan wrote: > >>>>> On Tue, 2008-03-25 at 00:28 -0500, Michael Wilde wrote: > >>>>>> I eyeballed the wrapperlogs to get a rough idea of what was happening. > >>>>>> > >>>>>> I ran with wrapperlog saving and no other changes for wf's of 10, 100 > >>>>>> and 500 jobs, to see how the exec time grew. At 500 jobs it grew to > >>>>>> about 30+ seconds for a core app exec time of about 1 sec. (Im just > >>>>>> recollecting the times as at this point I didnt write much down). > >>>>>> > >>>>> I would personally like to see those logs. > >>>> I listed all the runs in the previous mail (below), Mihael. They are on > >>>> CI NFS at ~benc/swift-logs/wilde/run{345-350}. > >>> Sorry about that. > >>> > >>>> Let us know what you find. > >>>> > >>> It looks like this: > >>> - 5 seconds between LOG_START and CREATE_JOBDIR. Likely hogs: > >>> mkdir -p $WFDIR/info/$JOBDIR > >>> mkdir -p $WFDIR/status/$JOBDIR > >>> and the creation of the info file. > >>> - 2.5 seconds between CREATE_JOBDIR and CREATE_INPUTDIR. Likely problem: > >>> mkdir -p $DIR > >>> (on a very fuzzy note, if one mkdir takes 2.5 seconds, two will take 5, > >>> which seems to roughly fit the observed numbers). > >>> - 3.5 seconds for COPYING_OUTPUTS > >>> - 2.5 seconds for RM_JOBDIR > >>> > >>> I'd be curious to know how much of the time is actually spent writing to > >>> the logs. That's because I see one second between EXECUTE_DONE and > >>> COPYING_OUTPUTS, a place where the only meaningful things that are done > >>> are two log messages. > >>> > >>> Perhaps it may be useful to run the whole thing through strace -T. > >>> > >>> Mihael > >>> > >>> > > > > > From wilde at mcs.anl.gov Tue Mar 25 10:16:21 2008 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 25 Mar 2008 10:16:21 -0500 Subject: [Fwd: Re: [Swift-devel] Re: swift-falkon problem... plots to explain plateaus...] In-Reply-To: <1206457746.20249.1.camel@blabla.mcs.anl.gov> References: <47E7DC3D.6040704@cs.uchicago.edu> <47E7E30A.2020208@mcs.anl.gov> <47E7E684.5070303@cs.uchicago.edu> <47E7EC03.1090609@mcs.anl.gov> <1206381792.11561.1.camel@blabla.mcs.anl.gov> <47E86E5A.6080303@mcs.anl.gov> <47E88D8C.4090207@mcs.anl.gov> <1206433879.26701.0.camel@blabla.mcs.anl.gov> <47E8FB17.1080501@mcs.anl.gov> <1206452083.31476.12.camel@blabla.mcs.anl.gov> <47E901C8.1060106@mcs.anl.gov> <1206455574.31476.15.camel@blabla.mcs.anl.gov> <47E91506.2080100@mcs.anl.gov> <1206457746.20249.1.camel@blabla.mcs.anl.gov> Message-ID: <47E91745.7090909@mcs.anl.gov> Great, thanks Mihael. Thats a useful step. I'll test. - Mike On 3/25/08 10:09 AM, Mihael Hategan wrote: > I just wrote a version of the wrapper that opens the log in a descriptor > (so opening happens once). I need to test it first, but I'll commit > shortly. > > On Tue, 2008-03-25 at 10:06 -0500, Michael Wilde wrote: >> One thing I'll test is generating the info file on /tmp, and moving it >> when done to the final job dir. >> >> I can see adjusting wrapper.sh to go from very light to very logged with >> a few increments in the middle that would be most useful. >> >> The main option I think we want to leave for users to toggle in common >> usage, is whether to run the app with its jobdir on local disk, >> typically below /tmp, or on shared disk. The user would decide based on >> the job's I/O profile and on local disk space availability. >> >> Also, I recall some discussion on the success file. Thats acceptable >> overhead for all but the tiniest of jobs, but when a BGP is eventually >> running 100K+ short jobs at once, the rate of success file creation >> could become a bottleneck. Seems like we could have an option that >> avoids creating and expecting the success file if that proved useful - >> need to measure. >> >> - Mike >> >> >> On 3/25/08 9:32 AM, Mihael Hategan wrote: >>> Problem may be that, as a quick test shows, bash opens and closes the >>> info file every time a redirect is done. >>> >>> On Tue, 2008-03-25 at 08:44 -0500, Michael Wilde wrote: >>>> I did runs the day before with a modified wrapper that bypassed the INFO >>>> logging. It saved a good amount - I recall about 30% but need to >>>> re-check the numbers. >>>> >>>> Yes, I came to the same conclusion on the mkdirs. Im looking at >>>> reducing these, likely moving the jobdir to /tmp. I think I can do that >>>> within the current structure. wrapper.sh is ver clear and nicely >>>> written. (Ben: yes, eyeballing the log #s was easy and no problem). >>>> >>>> First thing I want to do, though, is run some large scale tests on our >>>> two science workflows, increasing the petro-modelling one (the >>>> sub-second application) to a larger runtime through app-level batching. >>>> >>>> Zhao's latest test indicate that if we do batches of 40, bringing the >>>> jobs from .5 sec to 20 sec, we can saturate the BGP's 4K cores and keep >>>> it running efficiently. Given the extra wrapper.sh overhead, I might >>>> need to increase that another 10X, but once the app is wrapped in a >>>> loop, it makes little difference to the user how big we make that. >>>> >>>> The other app is a molecule-docking app, that can be batched similarly. >>>> >>>> Once we get those running nicely at a larger, less brutal job time, I'll >>>> come back to wrapper.sh tuning. If you or Ben want to do this in the >>>> meantime, though, that would be great. We have the use-local-disk >>>> scenario on our development stack anyways - this would be a good time to >>>> do it. If I do it, it will be only a prototype for measurement purposes. >>>> >>>> Mike >>>> >>>> >>>> >>>> >>>> On 3/25/08 8:34 AM, Mihael Hategan wrote: >>>>> On Tue, 2008-03-25 at 08:16 -0500, Michael Wilde wrote: >>>>>> On 3/25/08 3:31 AM, Mihael Hategan wrote: >>>>>>> On Tue, 2008-03-25 at 00:28 -0500, Michael Wilde wrote: >>>>>>>> I eyeballed the wrapperlogs to get a rough idea of what was happening. >>>>>>>> >>>>>>>> I ran with wrapperlog saving and no other changes for wf's of 10, 100 >>>>>>>> and 500 jobs, to see how the exec time grew. At 500 jobs it grew to >>>>>>>> about 30+ seconds for a core app exec time of about 1 sec. (Im just >>>>>>>> recollecting the times as at this point I didnt write much down). >>>>>>>> >>>>>>> I would personally like to see those logs. >>>>>> I listed all the runs in the previous mail (below), Mihael. They are on >>>>>> CI NFS at ~benc/swift-logs/wilde/run{345-350}. >>>>> Sorry about that. >>>>> >>>>>> Let us know what you find. >>>>>> >>>>> It looks like this: >>>>> - 5 seconds between LOG_START and CREATE_JOBDIR. Likely hogs: >>>>> mkdir -p $WFDIR/info/$JOBDIR >>>>> mkdir -p $WFDIR/status/$JOBDIR >>>>> and the creation of the info file. >>>>> - 2.5 seconds between CREATE_JOBDIR and CREATE_INPUTDIR. Likely problem: >>>>> mkdir -p $DIR >>>>> (on a very fuzzy note, if one mkdir takes 2.5 seconds, two will take 5, >>>>> which seems to roughly fit the observed numbers). >>>>> - 3.5 seconds for COPYING_OUTPUTS >>>>> - 2.5 seconds for RM_JOBDIR >>>>> >>>>> I'd be curious to know how much of the time is actually spent writing to >>>>> the logs. That's because I see one second between EXECUTE_DONE and >>>>> COPYING_OUTPUTS, a place where the only meaningful things that are done >>>>> are two log messages. >>>>> >>>>> Perhaps it may be useful to run the whole thing through strace -T. >>>>> >>>>> Mihael >>>>> >>>>> >>> > > From hategan at mcs.anl.gov Tue Mar 25 10:52:27 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 25 Mar 2008 10:52:27 -0500 Subject: [Fwd: Re: [Swift-devel] Re: swift-falkon problem... plots to explain plateaus...] In-Reply-To: <47E91745.7090909@mcs.anl.gov> References: <47E7DC3D.6040704@cs.uchicago.edu> <47E7E30A.2020208@mcs.anl.gov> <47E7E684.5070303@cs.uchicago.edu> <47E7EC03.1090609@mcs.anl.gov> <1206381792.11561.1.camel@blabla.mcs.anl.gov> <47E86E5A.6080303@mcs.anl.gov> <47E88D8C.4090207@mcs.anl.gov> <1206433879.26701.0.camel@blabla.mcs.anl.gov> <47E8FB17.1080501@mcs.anl.gov> <1206452083.31476.12.camel@blabla.mcs.anl.gov> <47E901C8.1060106@mcs.anl.gov> <1206455574.31476.15.camel@blabla.mcs.anl.gov> <47E91506.2080100@mcs.anl.gov> <1206457746.20249.1.camel@blabla.mcs.anl.gov> <47E91745.7090909@mcs.anl.gov> Message-ID: <1206460347.20974.8.camel@blabla.mcs.anl.gov> Done. On Tue, 2008-03-25 at 10:16 -0500, Michael Wilde wrote: > Great, thanks Mihael. Thats a useful step. I'll test. > > - Mike > > On 3/25/08 10:09 AM, Mihael Hategan wrote: > > I just wrote a version of the wrapper that opens the log in a descriptor > > (so opening happens once). I need to test it first, but I'll commit > > shortly. > > > > On Tue, 2008-03-25 at 10:06 -0500, Michael Wilde wrote: > >> One thing I'll test is generating the info file on /tmp, and moving it > >> when done to the final job dir. > >> > >> I can see adjusting wrapper.sh to go from very light to very logged with > >> a few increments in the middle that would be most useful. > >> > >> The main option I think we want to leave for users to toggle in common > >> usage, is whether to run the app with its jobdir on local disk, > >> typically below /tmp, or on shared disk. The user would decide based on > >> the job's I/O profile and on local disk space availability. > >> > >> Also, I recall some discussion on the success file. Thats acceptable > >> overhead for all but the tiniest of jobs, but when a BGP is eventually > >> running 100K+ short jobs at once, the rate of success file creation > >> could become a bottleneck. Seems like we could have an option that > >> avoids creating and expecting the success file if that proved useful - > >> need to measure. > >> > >> - Mike > >> > >> > >> On 3/25/08 9:32 AM, Mihael Hategan wrote: > >>> Problem may be that, as a quick test shows, bash opens and closes the > >>> info file every time a redirect is done. > >>> > >>> On Tue, 2008-03-25 at 08:44 -0500, Michael Wilde wrote: > >>>> I did runs the day before with a modified wrapper that bypassed the INFO > >>>> logging. It saved a good amount - I recall about 30% but need to > >>>> re-check the numbers. > >>>> > >>>> Yes, I came to the same conclusion on the mkdirs. Im looking at > >>>> reducing these, likely moving the jobdir to /tmp. I think I can do that > >>>> within the current structure. wrapper.sh is ver clear and nicely > >>>> written. (Ben: yes, eyeballing the log #s was easy and no problem). > >>>> > >>>> First thing I want to do, though, is run some large scale tests on our > >>>> two science workflows, increasing the petro-modelling one (the > >>>> sub-second application) to a larger runtime through app-level batching. > >>>> > >>>> Zhao's latest test indicate that if we do batches of 40, bringing the > >>>> jobs from .5 sec to 20 sec, we can saturate the BGP's 4K cores and keep > >>>> it running efficiently. Given the extra wrapper.sh overhead, I might > >>>> need to increase that another 10X, but once the app is wrapped in a > >>>> loop, it makes little difference to the user how big we make that. > >>>> > >>>> The other app is a molecule-docking app, that can be batched similarly. > >>>> > >>>> Once we get those running nicely at a larger, less brutal job time, I'll > >>>> come back to wrapper.sh tuning. If you or Ben want to do this in the > >>>> meantime, though, that would be great. We have the use-local-disk > >>>> scenario on our development stack anyways - this would be a good time to > >>>> do it. If I do it, it will be only a prototype for measurement purposes. > >>>> > >>>> Mike > >>>> > >>>> > >>>> > >>>> > >>>> On 3/25/08 8:34 AM, Mihael Hategan wrote: > >>>>> On Tue, 2008-03-25 at 08:16 -0500, Michael Wilde wrote: > >>>>>> On 3/25/08 3:31 AM, Mihael Hategan wrote: > >>>>>>> On Tue, 2008-03-25 at 00:28 -0500, Michael Wilde wrote: > >>>>>>>> I eyeballed the wrapperlogs to get a rough idea of what was happening. > >>>>>>>> > >>>>>>>> I ran with wrapperlog saving and no other changes for wf's of 10, 100 > >>>>>>>> and 500 jobs, to see how the exec time grew. At 500 jobs it grew to > >>>>>>>> about 30+ seconds for a core app exec time of about 1 sec. (Im just > >>>>>>>> recollecting the times as at this point I didnt write much down). > >>>>>>>> > >>>>>>> I would personally like to see those logs. > >>>>>> I listed all the runs in the previous mail (below), Mihael. They are on > >>>>>> CI NFS at ~benc/swift-logs/wilde/run{345-350}. > >>>>> Sorry about that. > >>>>> > >>>>>> Let us know what you find. > >>>>>> > >>>>> It looks like this: > >>>>> - 5 seconds between LOG_START and CREATE_JOBDIR. Likely hogs: > >>>>> mkdir -p $WFDIR/info/$JOBDIR > >>>>> mkdir -p $WFDIR/status/$JOBDIR > >>>>> and the creation of the info file. > >>>>> - 2.5 seconds between CREATE_JOBDIR and CREATE_INPUTDIR. Likely problem: > >>>>> mkdir -p $DIR > >>>>> (on a very fuzzy note, if one mkdir takes 2.5 seconds, two will take 5, > >>>>> which seems to roughly fit the observed numbers). > >>>>> - 3.5 seconds for COPYING_OUTPUTS > >>>>> - 2.5 seconds for RM_JOBDIR > >>>>> > >>>>> I'd be curious to know how much of the time is actually spent writing to > >>>>> the logs. That's because I see one second between EXECUTE_DONE and > >>>>> COPYING_OUTPUTS, a place where the only meaningful things that are done > >>>>> are two log messages. > >>>>> > >>>>> Perhaps it may be useful to run the whole thing through strace -T. > >>>>> > >>>>> Mihael > >>>>> > >>>>> > >>> > > > > > From wilde at mcs.anl.gov Tue Mar 25 11:36:22 2008 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 25 Mar 2008 11:36:22 -0500 Subject: [Swift-devel] Re: How to wait on functions that return no data? In-Reply-To: <1206460871.20974.18.camel@blabla.mcs.anl.gov> References: <47E91036.9070806@mcs.anl.gov> <1206457254.19756.5.camel@blabla.mcs.anl.gov> <47E916E0.2020103@mcs.anl.gov> <1206458624.20974.7.camel@blabla.mcs.anl.gov> <47E91E33.3020700@mcs.anl.gov> <1206460871.20974.18.camel@blabla.mcs.anl.gov> Message-ID: <47E92A06.4090705@mcs.anl.gov> Related to this virtual idea, is it possible to add language semantics where a function defined as returning an object can decide to return "null", in which case its deemed to be complete but decided no to generate a result? So a foreach that calls 1000 functions could complete when 10 return files and 990 return null? I'm moving this discussion to swift-devel by the way, as its now talking about future possibilities. From a pure language point of view, we should permit the return of data that can be grouped (batched) into files files in arbitrary chunks, determined and optimized by the implementation. Map-reduce tuples seem to work well for this model, and it seems that Swift could encompass it with minimal semantic change to the current language. This petro model app seems to be a good illustration of the use case. The function the way Im calling it is basically z = f (x,y) where x,y,z are floats. To treat it as tuples, the return would be (x,y,z) = f(x,y) - ie the return is a triple, so that the reduce step simply merges all the output tuples and plots them. (example plot below) - Mike This sweep varied the Low-S Light LL and Med-S Light LL production yields for Diesel fuel and plotted the effect on the Discount Investment: It shows a sweep on $2 and $3 in this line of adj_crude.txt. > -3 Prod_Yields > ... > 3 Diesel $2 $3 $4 $5 $6 $7 The production yield is plotted in: http://www.ci.uchicago.edu/~wilde/psweep1.png On 3/25/08 11:01 AM, Mihael Hategan wrote: > I think there is some confusion here between language and > implementation. > > The language can express the problem just fine. That's why I'm saying > you should change doall() to return an array with all the outputs. > > It's the implementation that behaves in a very poor way if the > applications are very fine grained. You seem to be trying to solve the > problem by: > 1. Doing some magic with the way files are moved around > 2. Convincing Swift that it should work without knowing about data > dependencies, despite the fact that it only works properly if it knows > about all data dependencies. By definition. > > There is some middle ground here. It may be possible to let Swift know > what the data dependencies are, but also prevent it from dealing with > certain files, by marking them as "virtual" (or whatever the term). > > Mihael > > On Tue, 2008-03-25 at 10:45 -0500, Michael Wilde wrote: >> Your view has merits in terms of language purity, but I disagree with it. >> >> This was posed as an academic question, and I think its interesting to >> discuss. >> >> The point here is that there's an application that could best be done by >> batching up its output, and in fact perhaps by using the map-reduce >> representation of tuples for that output. >> >> Its still driven by dataflow and data dependencies, just not the >> simplistic lock-step dependencies that swift implements today. >> >> For example, one way to address the problem is to say that batching of >> function calls, the way swift does today, is helpful but ignores the >> problem that small tasks often have small data inputs and outputs, and >> that these should be batched along with the job execution. >> >> That would leave swift language semantics unchanged, but the >> implementation would get more efficient and could handle finer-grained >> tasks. >> >> An even more efficient and interesting approach, fully in keeping with >> the language as it stands today, would be to allow tuples to be >> expressed as inputs and outputs, and to have swift efficiently and >> automatically route (and batch) tuples in and out of jobs. >> >> So I view what I was asking for here as a prototype or exploration of >> that direction. It would be good to test the performance of an >> implementation that streamed output tuples into a subsequent ("reduce") >> stage of processing, before we even consider what the language and/or >> implementation would need to do for such a case. >> >> >> On 3/25/08 10:23 AM, Mihael Hategan wrote: >> ... >> > Don't use Swift then. Seriously. If you don't want to express things in >> > a dataflow oriented way, and are not satisfied with its performance for >> > the given problem, don't use it. >> >> I want to express things as dataflow, with high performance, in Swift. >> >> Mike >> >> >> On 3/25/08 10:23 AM, Mihael Hategan wrote: >>> On Tue, 2008-03-25 at 10:14 -0500, Michael Wilde wrote: >>>>>> In the example below, I want collectResults() to get invoked after all >>>> >> the runam() calls complete in doall(). >>>> > >>>> > results = doall(); >>>> > collectResults(results); >>>> > >>>> > Mihael >>>> >>>> But thats the problem: doall() does not in this example return results. >>> Then it should be fixed. >>> >>>> If it would return an artificial result, how would we get such a return >>>> to wait until all the runam() calls made within the freach() have completed? >>>> >>>> Each of the runam() call runs a small model, and in this proposed >>>> scenario would leave those results on a local disk for later collection, >>>> either in a single shared file that many invocations would append to, or >>>> in a set of files. >>> I don't think the solution to performance problems in Swift is to hack >>> stuff like that. >>> >>>> Then collectresults() would run a job that collects all the data when done. >>>> >>>> One approach can be to have collectresults() just run iteratively until >>>> it has collected a sufficient number of results. I.e., to have it not >>>> depend on swift to find out when all the runam() calls have completed. >>>> That might work. >>> Don't use Swift then. Seriously. If you don't want to express things in >>> a dataflow oriented way, and are not satisfied with its performance for >>> the given problem, don't use it. >>> >>> Mihael >>> >>>> - Mike >>>> >>>> >>>> On 3/25/08 10:00 AM, Mihael Hategan wrote: >>>>> On Tue, 2008-03-25 at 09:46 -0500, Michael Wilde wrote: >>>>>> For the petro-model app Im working on, it would be interesting to run >>>>>> the parameter sweep in "map reduce" manner, in which each invocation >>>>>> bites off a portion of the parameter space and processes it, resulting >>>>>> in a set of result tuples. Each run of the model will produce a set of >>>>>> tuples, and when all are done, we want to aggregate and plot the tuples. >>>>>> >>>>>> While with batching this is not strictly needed, it would be interesting >>>>>> to let the model results accumulate on the local filesystem (as in this >>>>>> case they are small) and collect them either at the end of the run, or >>>>>> periodically and perhaps asynchronously during the run. >>>>>> >>>>>> To do this, we'd want to write the model invocation as a swift function >>>>>> with only scalar numeric parameters, and no output. >>>>> That assertion I'm not sure about. >>>>> >>>>>> The question is how to call a zero-returns function in a swift foreach() >>>>>> loop, and embed that foreach() in a function that doesnt return until >>>>>> all members of the foreach() have been processed. >>>>> The very notion of "return" as it would appear in a strict language >>>>> doesn't make much sense in Swift, so I'm not quite sure. >>>>> >>>>>> I havent tried to code this yet, because I cant think of a way to >>>>>> express it in swift, due to the data-dependency semantics. >>>>>> >>>>>> In the example below, I want collectResults() to get invoked after all >>>>>> the runam() calls complete in doall(). >>>>> results = doall(); >>>>> collectResults(results); >>>>> >>>>> Mihael >>>>> >>>>>> Anyone have any ideas? >>>>>> >>>>>> This is a low-priority question, just food for thought, as the batched >>>>>> way of running this parameter sweep should be straightforward and efficient. >>>>>> >>>>>> Mike >>>>>> >>>>>> >>>>>> >>>>>> // Amiga-Mars Parameter Sweep >>>>>> >>>>>> type amout; >>>>>> >>>>>> runam (string id , string p1, string p2) // no ret val >>>>>> { >>>>>> app { runam3 id p1 p2 ; } >>>>>> } >>>>>> >>>>>> type params { >>>>>> string id; >>>>>> string p1; >>>>>> string p2; >>>>>> }; >>>>>> >>>>>> doall(params p[]) >>>>>> { >>>>>> foreach pset in p { >>>>>> runam(pset.id, pset.p1, pset.p2); >>>>>> } >>>>>> // waitTillAllDone(); >>>>>> // want to block here till all above finish, >>>>>> // but no data to wait on. any way to >>>>>> // achieve this??? >>>>>> } >>>>>> >>>>>> // Main >>>>>> >>>>>> params p[]; >>>>>> p = readdata("paramlist"); >>>>>> doall(p); >>>>>> amout amdata ; >>>>>> amdata = collectResults(); >>>>>> >>>>>> // ^^^ Want collectresults to run AFTER all runam() calls finish >>>>>> // in the doall() function. >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> Swift-user mailing list >>>>>> Swift-user at ci.uchicago.edu >>>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user >>>>>> >>> > > From hategan at mcs.anl.gov Tue Mar 25 12:41:54 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 25 Mar 2008 12:41:54 -0500 Subject: [Swift-devel] Re: How to wait on functions that return no data? In-Reply-To: <47E92A06.4090705@mcs.anl.gov> References: <47E91036.9070806@mcs.anl.gov> <1206457254.19756.5.camel@blabla.mcs.anl.gov> <47E916E0.2020103@mcs.anl.gov> <1206458624.20974.7.camel@blabla.mcs.anl.gov> <47E91E33.3020700@mcs.anl.gov> <1206460871.20974.18.camel@blabla.mcs.anl.gov> <47E92A06.4090705@mcs.anl.gov> Message-ID: <1206466914.24237.11.camel@blabla.mcs.anl.gov> On Tue, 2008-03-25 at 11:36 -0500, Michael Wilde wrote: > Related to this virtual idea, is it possible to add language semantics > where a function defined as returning an object can decide to return > "null", in which case its deemed to be complete but decided no to > generate a result? Yes. What I mentioned would be similar. > > So a foreach that calls 1000 functions could complete when 10 return > files and 990 return null? That we don't have. > > I'm moving this discussion to swift-devel by the way, as its now talking > about future possibilities. > > From a pure language point of view, we should permit the return of data > that can be grouped (batched) into files files in arbitrary chunks, > determined and optimized by the implementation. Map-reduce tuples seem > to work well for this model, and it seems that Swift could encompass it > with minimal semantic change to the current language. > > This petro model app seems to be a good illustration of the use case. Before the petro model, we had the Aphasia model which required pretty much the same thing. I.e. for some inputs there was no output. > The function the way Im calling it is basically z = f (x,y) where x,y,z > are floats. > > To treat it as tuples, the return would be (x,y,z) = f(x,y) - ie the > return is a triple, so that the reduce step simply merges all the output > tuples and plots them. (example plot below) That's a 2d array. t[x][y] = f(x, y); Or even a simple list of arrays (which we would simulate with an array). Overall, this does not deal with "missing" elements. That's where user exceptions, which we spoke of before, would come in: try { t[x][y] = f(x, y); } catch (MissingValue) { //discard } Mihael > > - Mike > > > This sweep varied the Low-S Light LL and Med-S Light LL production > yields for Diesel fuel and plotted the effect on the Discount Investment: > > It shows a sweep on $2 and $3 in this line of adj_crude.txt. > > > -3 Prod_Yields > > ... > > 3 Diesel $2 $3 $4 $5 $6 $7 > > The production yield is plotted in: > > http://www.ci.uchicago.edu/~wilde/psweep1.png > > > > On 3/25/08 11:01 AM, Mihael Hategan wrote: > > I think there is some confusion here between language and > > implementation. > > > > The language can express the problem just fine. That's why I'm saying > > you should change doall() to return an array with all the outputs. > > > > It's the implementation that behaves in a very poor way if the > > applications are very fine grained. You seem to be trying to solve the > > problem by: > > 1. Doing some magic with the way files are moved around > > 2. Convincing Swift that it should work without knowing about data > > dependencies, despite the fact that it only works properly if it knows > > about all data dependencies. By definition. > > > > There is some middle ground here. It may be possible to let Swift know > > what the data dependencies are, but also prevent it from dealing with > > certain files, by marking them as "virtual" (or whatever the term). > > > > Mihael > > > > On Tue, 2008-03-25 at 10:45 -0500, Michael Wilde wrote: > >> Your view has merits in terms of language purity, but I disagree with it. > >> > >> This was posed as an academic question, and I think its interesting to > >> discuss. > >> > >> The point here is that there's an application that could best be done by > >> batching up its output, and in fact perhaps by using the map-reduce > >> representation of tuples for that output. > >> > >> Its still driven by dataflow and data dependencies, just not the > >> simplistic lock-step dependencies that swift implements today. > >> > >> For example, one way to address the problem is to say that batching of > >> function calls, the way swift does today, is helpful but ignores the > >> problem that small tasks often have small data inputs and outputs, and > >> that these should be batched along with the job execution. > >> > >> That would leave swift language semantics unchanged, but the > >> implementation would get more efficient and could handle finer-grained > >> tasks. > >> > >> An even more efficient and interesting approach, fully in keeping with > >> the language as it stands today, would be to allow tuples to be > >> expressed as inputs and outputs, and to have swift efficiently and > >> automatically route (and batch) tuples in and out of jobs. > >> > >> So I view what I was asking for here as a prototype or exploration of > >> that direction. It would be good to test the performance of an > >> implementation that streamed output tuples into a subsequent ("reduce") > >> stage of processing, before we even consider what the language and/or > >> implementation would need to do for such a case. > >> > >> > >> On 3/25/08 10:23 AM, Mihael Hategan wrote: > >> ... > >> > Don't use Swift then. Seriously. If you don't want to express things in > >> > a dataflow oriented way, and are not satisfied with its performance for > >> > the given problem, don't use it. > >> > >> I want to express things as dataflow, with high performance, in Swift. > >> > >> Mike > >> > >> > >> On 3/25/08 10:23 AM, Mihael Hategan wrote: > >>> On Tue, 2008-03-25 at 10:14 -0500, Michael Wilde wrote: > >>>>>> In the example below, I want collectResults() to get invoked after all > >>>> >> the runam() calls complete in doall(). > >>>> > > >>>> > results = doall(); > >>>> > collectResults(results); > >>>> > > >>>> > Mihael > >>>> > >>>> But thats the problem: doall() does not in this example return results. > >>> Then it should be fixed. > >>> > >>>> If it would return an artificial result, how would we get such a return > >>>> to wait until all the runam() calls made within the freach() have completed? > >>>> > >>>> Each of the runam() call runs a small model, and in this proposed > >>>> scenario would leave those results on a local disk for later collection, > >>>> either in a single shared file that many invocations would append to, or > >>>> in a set of files. > >>> I don't think the solution to performance problems in Swift is to hack > >>> stuff like that. > >>> > >>>> Then collectresults() would run a job that collects all the data when done. > >>>> > >>>> One approach can be to have collectresults() just run iteratively until > >>>> it has collected a sufficient number of results. I.e., to have it not > >>>> depend on swift to find out when all the runam() calls have completed. > >>>> That might work. > >>> Don't use Swift then. Seriously. If you don't want to express things in > >>> a dataflow oriented way, and are not satisfied with its performance for > >>> the given problem, don't use it. > >>> > >>> Mihael > >>> > >>>> - Mike > >>>> > >>>> > >>>> On 3/25/08 10:00 AM, Mihael Hategan wrote: > >>>>> On Tue, 2008-03-25 at 09:46 -0500, Michael Wilde wrote: > >>>>>> For the petro-model app Im working on, it would be interesting to run > >>>>>> the parameter sweep in "map reduce" manner, in which each invocation > >>>>>> bites off a portion of the parameter space and processes it, resulting > >>>>>> in a set of result tuples. Each run of the model will produce a set of > >>>>>> tuples, and when all are done, we want to aggregate and plot the tuples. > >>>>>> > >>>>>> While with batching this is not strictly needed, it would be interesting > >>>>>> to let the model results accumulate on the local filesystem (as in this > >>>>>> case they are small) and collect them either at the end of the run, or > >>>>>> periodically and perhaps asynchronously during the run. > >>>>>> > >>>>>> To do this, we'd want to write the model invocation as a swift function > >>>>>> with only scalar numeric parameters, and no output. > >>>>> That assertion I'm not sure about. > >>>>> > >>>>>> The question is how to call a zero-returns function in a swift foreach() > >>>>>> loop, and embed that foreach() in a function that doesnt return until > >>>>>> all members of the foreach() have been processed. > >>>>> The very notion of "return" as it would appear in a strict language > >>>>> doesn't make much sense in Swift, so I'm not quite sure. > >>>>> > >>>>>> I havent tried to code this yet, because I cant think of a way to > >>>>>> express it in swift, due to the data-dependency semantics. > >>>>>> > >>>>>> In the example below, I want collectResults() to get invoked after all > >>>>>> the runam() calls complete in doall(). > >>>>> results = doall(); > >>>>> collectResults(results); > >>>>> > >>>>> Mihael > >>>>> > >>>>>> Anyone have any ideas? > >>>>>> > >>>>>> This is a low-priority question, just food for thought, as the batched > >>>>>> way of running this parameter sweep should be straightforward and efficient. > >>>>>> > >>>>>> Mike > >>>>>> > >>>>>> > >>>>>> > >>>>>> // Amiga-Mars Parameter Sweep > >>>>>> > >>>>>> type amout; > >>>>>> > >>>>>> runam (string id , string p1, string p2) // no ret val > >>>>>> { > >>>>>> app { runam3 id p1 p2 ; } > >>>>>> } > >>>>>> > >>>>>> type params { > >>>>>> string id; > >>>>>> string p1; > >>>>>> string p2; > >>>>>> }; > >>>>>> > >>>>>> doall(params p[]) > >>>>>> { > >>>>>> foreach pset in p { > >>>>>> runam(pset.id, pset.p1, pset.p2); > >>>>>> } > >>>>>> // waitTillAllDone(); > >>>>>> // want to block here till all above finish, > >>>>>> // but no data to wait on. any way to > >>>>>> // achieve this??? > >>>>>> } > >>>>>> > >>>>>> // Main > >>>>>> > >>>>>> params p[]; > >>>>>> p = readdata("paramlist"); > >>>>>> doall(p); > >>>>>> amout amdata ; > >>>>>> amdata = collectResults(); > >>>>>> > >>>>>> // ^^^ Want collectresults to run AFTER all runam() calls finish > >>>>>> // in the doall() function. > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> _______________________________________________ > >>>>>> Swift-user mailing list > >>>>>> Swift-user at ci.uchicago.edu > >>>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user > >>>>>> > >>> > > > > > From iraicu at cs.uchicago.edu Tue Mar 25 15:31:58 2008 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Tue, 25 Mar 2008 15:31:58 -0500 Subject: [Swift-devel] new bugzilla usage for Falkon Message-ID: <47E9613E.40906@cs.uchicago.edu> Hi all, I just started using Bugzilla to keep track of bugs, problems, and new features for Falkon. You can use the following two links for creating new bugs and displaying open bugs. * Create new bug: http://bugzilla.globus.org/globus/enter_bug.cgi?product=Falkon * Display open bugs: http://bugzilla.globus.org/globus/buglist.cgi?query_format=specific&order=relevance+desc&bug_status=__open__&product=Falkon&content= As you encounter problems with Falkon, or want new features added, please feel free to use the add new bug form to help keep everything organized. Cheers, Ioan -- =================================================== Ioan Raicu Ph.D. Candidate =================================================== Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 =================================================== Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dev.globus.org/wiki/Incubator/Falkon http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page =================================================== =================================================== -------------- next part -------------- An HTML attachment was scrubbed... URL: From benc at hawaga.org.uk Tue Mar 25 17:49:07 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 25 Mar 2008 22:49:07 +0000 (GMT) Subject: [Swift-devel] Re: How to wait on functions that return no data? In-Reply-To: <47E92A06.4090705@mcs.anl.gov> References: <47E91036.9070806@mcs.anl.gov> <1206457254.19756.5.camel@blabla.mcs.anl.gov> <47E916E0.2020103@mcs.anl.gov> <1206458624.20974.7.camel@blabla.mcs.anl.gov> <47E91E33.3020700@mcs.anl.gov> <1206460871.20974.18.camel@blabla.mcs.anl.gov> <47E92A06.4090705@mcs.anl.gov> Message-ID: On Tue, 25 Mar 2008, Michael Wilde wrote: > Related to this virtual idea, is it possible to add language semantics where a > function defined as returning an object can decide to return "null", in which > case its deemed to be complete but decided no to generate a result? Going to haskell way, introducing a Maybe type would be that - its a dataflow rather than control flow form of exception handling. You declare a type as 'maybe resultfile' and values of that type can be either 'Nothing' or a result file. You could have an array of: (Maybe resultfile)[] where each element is of type 'maybe resultfile' and so can (independent of the other elements) be a file or null. -- From benc at hawaga.org.uk Tue Mar 25 18:04:48 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 25 Mar 2008 23:04:48 +0000 (GMT) Subject: [Swift-devel] Re: How to wait on functions that return no data? In-Reply-To: <47E92A06.4090705@mcs.anl.gov> References: <47E91036.9070806@mcs.anl.gov> <1206457254.19756.5.camel@blabla.mcs.anl.gov> <47E916E0.2020103@mcs.anl.gov> <1206458624.20974.7.camel@blabla.mcs.anl.gov> <47E91E33.3020700@mcs.anl.gov> <1206460871.20974.18.camel@blabla.mcs.anl.gov> <47E92A06.4090705@mcs.anl.gov> Message-ID: On Tue, 25 Mar 2008, Michael Wilde wrote: > From a pure language point of view, we should permit the return of data that > can be grouped (batched) into files files in arbitrary chunks, determined and > optimized by the implementation. Map-reduce tuples seem to work well for this > model, and it seems that Swift could encompass it with minimal semantic change > to the current language. For your example, what way do you want to store the data on the remote side - I'm assuming not individual files. The present dataset model should fairly easily accomodate the description of places to store data that aren't files - there's an abstraction in the implementation to help with that at the moment (DSHandle, which is what deals with the difference between in-memory values and on-disk files; and could fairly straightforwardly deal with other storage forms). One of the project ideas I put in for the google summer of code was to play around with this, in fact. -- From hategan at mcs.anl.gov Tue Mar 25 18:09:44 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 25 Mar 2008 18:09:44 -0500 Subject: [Swift-devel] Re: How to wait on functions that return no data? In-Reply-To: References: <47E91036.9070806@mcs.anl.gov> <1206457254.19756.5.camel@blabla.mcs.anl.gov> <47E916E0.2020103@mcs.anl.gov> <1206458624.20974.7.camel@blabla.mcs.anl.gov> <47E91E33.3020700@mcs.anl.gov> <1206460871.20974.18.camel@blabla.mcs.anl.gov> <47E92A06.4090705@mcs.anl.gov> Message-ID: <1206486584.11261.0.camel@blabla.mcs.anl.gov> On Tue, 2008-03-25 at 22:49 +0000, Ben Clifford wrote: > On Tue, 25 Mar 2008, Michael Wilde wrote: > > > Related to this virtual idea, is it possible to add language semantics where a > > function defined as returning an object can decide to return "null", in which > > case its deemed to be complete but decided no to generate a result? > > Going to haskell way, introducing a Maybe type would be that - its a > dataflow rather than control flow form of exception handling. You declare > a type as 'maybe resultfile' and values of that type can be either > 'Nothing' or a result file. > > You could have an array of: > > (Maybe resultfile)[] > > where each element is of type 'maybe resultfile' and so can (independent > of the other elements) be a file or null. pretty much like try {a[x] = f(x)} catch {} or maybe(a[x] = f(x)). > From benc at hawaga.org.uk Tue Mar 25 18:18:06 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 25 Mar 2008 23:18:06 +0000 (GMT) Subject: [Swift-devel] Re: How to wait on functions that return no data? In-Reply-To: <1206486584.11261.0.camel@blabla.mcs.anl.gov> References: <47E91036.9070806@mcs.anl.gov> <1206457254.19756.5.camel@blabla.mcs.anl.gov> <47E916E0.2020103@mcs.anl.gov> <1206458624.20974.7.camel@blabla.mcs.anl.gov> <47E91E33.3020700@mcs.anl.gov> <1206460871.20974.18.camel@blabla.mcs.anl.gov> <47E92A06.4090705@mcs.anl.gov> <1206486584.11261.0.camel@blabla.mcs.anl.gov> Message-ID: On Tue, 25 Mar 2008, Mihael Hategan wrote: > > where each element is of type 'maybe resultfile' and so can (independent > > of the other elements) be a file or null. > > pretty much like try {a[x] = f(x)} catch {} or maybe(a[x] = f(x)). in the array case, sort of, yes. Doesn't compare so well when passing round a single non-array value though - a = Nothing is different from a not being assigned (yet, or never), which is what syntax this like: try { a =f(x) } catch {} alludes to. The try/catch syntax also alludes to a null response being somehow exceptional, rather than a legitimate return value. -- From benc at hawaga.org.uk Tue Mar 25 22:54:44 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 26 Mar 2008 03:54:44 +0000 (GMT) Subject: [Fwd: Re: [Swift-devel] Re: swift-falkon problem... plots to explain plateaus...] In-Reply-To: <47E88D8C.4090207@mcs.anl.gov> References: <47E7DC3D.6040704@cs.uchicago.edu> <47E7E30A.2020208@mcs.anl.gov> <47E7E684.5070303@cs.uchicago.edu> <47E7EC03.1090609@mcs.anl.gov> <1206381792.11561.1.camel@blabla.mcs.anl.gov> <47E86E5A.6080303@mcs.anl.gov> <47E88D8C.4090207@mcs.anl.gov> Message-ID: On Tue, 25 Mar 2008, Michael Wilde wrote: > also will leave the (tiny) job output on /tmp for later aggregation > (will have some swift questions on that). Note that swift doesn't have any concept of non-shared filesystem management at the moment - if you want to keep files on a worker-local file system that is not accessible to the entire site, Swift doesn't have any way of getting a job to run somewhere where it can access that same filesystem. We've talked about worker-local storage management before and agreed that it was both and long and hard thing to make happen. So its interesting to see your experiences in this field, but its probably very out of scope for any imminent develoment work. -- From benc at hawaga.org.uk Tue Mar 25 23:00:42 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 26 Mar 2008 04:00:42 +0000 (GMT) Subject: [Fwd: Re: [Swift-devel] Re: swift-falkon problem... plots to explain plateaus...] In-Reply-To: <47E91506.2080100@mcs.anl.gov> References: <47E7DC3D.6040704@cs.uchicago.edu> <47E7E30A.2020208@mcs.anl.gov> <47E7E684.5070303@cs.uchicago.edu> <47E7EC03.1090609@mcs.anl.gov> <1206381792.11561.1.camel@blabla.mcs.anl.gov> <47E86E5A.6080303@mcs.anl.gov> <47E88D8C.4090207@mcs.anl.gov> <1206433879.26701.0.camel@blabla.mcs.anl.gov> <47E8FB17.1080501@mcs.anl.gov> <1206452083.31476.12.camel@blabla.mcs.anl.gov> <47E901C8.1060106@mcs.anl.gov> <1206455574.31476.15.camel@blabla.mcs.anl.gov> <47E91506.2080100@mcs.anl.gov> Message-ID: On Tue, 25 Mar 2008, Michael Wilde wrote: > > Also, I recall some discussion on the success file. Thats acceptable overhead > for all but the tiniest of jobs, but when a BGP is eventually running 100K+ > short jobs at once, the rate of success file creation could become a I think there's an important Swift scoping issue here. Running 100k x 1s jobs is outside the scope of what I expect Swift to be used for any time soon; so I'm leery of spending time optimising out-of-scope applications at the expense of other work. -- From hategan at mcs.anl.gov Wed Mar 26 04:50:50 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 26 Mar 2008 04:50:50 -0500 Subject: [Swift-devel] Re: How to wait on functions that return no data? In-Reply-To: References: <47E91036.9070806@mcs.anl.gov> <1206457254.19756.5.camel@blabla.mcs.anl.gov> <47E916E0.2020103@mcs.anl.gov> <1206458624.20974.7.camel@blabla.mcs.anl.gov> <47E91E33.3020700@mcs.anl.gov> <1206460871.20974.18.camel@blabla.mcs.anl.gov> <47E92A06.4090705@mcs.anl.gov> <1206486584.11261.0.camel@blabla.mcs.anl.gov> Message-ID: <1206525050.11566.3.camel@blabla.mcs.anl.gov> On Tue, 2008-03-25 at 23:18 +0000, Ben Clifford wrote: > On Tue, 25 Mar 2008, Mihael Hategan wrote: > > > > where each element is of type 'maybe resultfile' and so can (independent > > > of the other elements) be a file or null. > > > > pretty much like try {a[x] = f(x)} catch {} or maybe(a[x] = f(x)). > > in the array case, sort of, yes. > > Doesn't compare so well when passing round a single non-array value though > - a = Nothing is different from a not being assigned (yet, or never), > which is what syntax this like: try { a =f(x) } catch {} alludes to. Right. Though one could say catch() {a = Nothing}. > > The try/catch syntax also alludes to a null response being somehow > exceptional, rather than a legitimate return value. > Not the null response, but actually f throwing an exception. From wilde at mcs.anl.gov Wed Mar 26 10:28:20 2008 From: wilde at mcs.anl.gov (Michael Wilde) Date: Wed, 26 Mar 2008 10:28:20 -0500 Subject: [Swift-devel] Re: How to wait on functions that return no data? In-Reply-To: References: <47E91036.9070806@mcs.anl.gov> <1206457254.19756.5.camel@blabla.mcs.anl.gov> <47E916E0.2020103@mcs.anl.gov> <1206458624.20974.7.camel@blabla.mcs.anl.gov> <47E91E33.3020700@mcs.anl.gov> <1206460871.20974.18.camel@blabla.mcs.anl.gov> <47E92A06.4090705@mcs.anl.gov> Message-ID: <47EA6B94.9090609@mcs.anl.gov> Sorry - a long response follows to your simple question: > For your example, what way do you want to store the data on the remote > side - I'm assuming not individual files. In this example, a C program takes in 5 data files describing parameters of the petroleum refining process, and models various economic, emission and production yields. We do parameter sweeps by varying a few of these input vars and plotting their effect on an output var. The 5 files are text files, bundled into the application wrapper as shell "here documents" using cat <datafileN. The parameters are inserted into these data files using simple shell var substitution. In the simple tests I'm running now, I vary 2 input vars, and plot one output var. Each run of the model, which takes about 1 sec, takes 3 parameters (id, x, y) from a readdata() file, and puts out a similar line with a 4th column, the z value (id, x, y, z). Id is an int, x, y, z are floats. In the simplest runs, I just run one model per swift job. So id, x and y are provided on the command line, and a single file is produced with the tuple (id, x, y, z). I am now testing a batched version, where the app-wrapper script takes a range of x and y values with increments, and iterates over that range at the specified increments. Each batch results in a single file with all the output tuples for that batch. For this case, this is fine, and is the end of the problem. But I asked about the null values to explore a different approach: where most batches run and just leave their outputs on a local fiesystem, concatenated into one file. The nice thing about having output in tuples is that you can batch them in any arbitrary way, and the reduce step can sort and select as needed. I suspect you're not going to like this idea on first consideration. But its related to ideas on how to leverage map-reduce, as I mentioned earlier, and Ian's suggestion to explore collective operations. Mihael thought my take on this was inelegant and inconsistent with data flow. I think it can be massaged to fit nicely in the model and provide useful capabilities. Here's one way I thought it could work with the addition of null/Nothing to Swift. The idea was that most or all invocations of the model jobs would return Nothing, and the actual results would be collected later in large, efficient batches. If an invocation of a wrapper batch returns null, then a later job can go and interrogate the workers to collect the data. One possibility in the falkon case was that one job would be broadcast to all workers, and collect all files of a desired type. Another approach is that each job ensures that there's a background task running on the worker, which waits for either some accumulation of data or elapsed time, and then transfers what was produced, as a single file. These files would be returned either as results of arbitrary actual model runs, or by a collector job that runs after all the models are complete. But, separate from this data collection operation, a Nothing return has a more direct use. Its handy in cases when you have a large set of short jobs, exploring some parameter space in which results are very sparse. In these cases, it would be nice to have a way to say that a job succeeded but return null/Nothing. That reduces the need to pass back a large number of files that signify "Nothing" in some inefficient manner. Its also handy for executing jobs that have side effects, and still waiting for them to complete. This gets us to a related issue: If a swift job could efficiently return a set of swift objects without using a file (specifically without placing files back in the shared directory) then many of these apps could work beautifully, by returning strings or numeric objects, possibly as structs and/r arrays, that travel back through the job submission interface rather than getting fetched via the data provider. If a cluster of jobs could return data efficiently in a single "package" from the cluster, then we could pretty readily do map-reduce in swift, efficiently, in perfect concordance with the current dataflow model. Perhaps this later approach is the best to consider: I suspect it could be readily implemented, could use a simple file to contain an arbitrary set of swift object return values, possibly in a format similar to that of readdata(). - Mike On 3/25/08 6:04 PM, Ben Clifford wrote: > On Tue, 25 Mar 2008, Michael Wilde wrote: > >> From a pure language point of view, we should permit the return of data that >> can be grouped (batched) into files files in arbitrary chunks, determined and >> optimized by the implementation. Map-reduce tuples seem to work well for this >> model, and it seems that Swift could encompass it with minimal semantic change >> to the current language. > > For your example, what way do you want to store the data on the remote > side - I'm assuming not individual files. > > The present dataset model should fairly easily accomodate the description > of places to store data that aren't files - there's an abstraction in the > implementation to help with that at the moment (DSHandle, which is what > deals with the difference between in-memory values and on-disk files; and > could fairly straightforwardly deal with other storage forms). > > One of the project ideas I put in for the google summer of code was to > play around with this, in fact. > From hategan at mcs.anl.gov Wed Mar 26 10:51:33 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 26 Mar 2008 10:51:33 -0500 Subject: [Swift-devel] Re: How to wait on functions that return no data? In-Reply-To: <47EA6B94.9090609@mcs.anl.gov> References: <47E91036.9070806@mcs.anl.gov> <1206457254.19756.5.camel@blabla.mcs.anl.gov> <47E916E0.2020103@mcs.anl.gov> <1206458624.20974.7.camel@blabla.mcs.anl.gov> <47E91E33.3020700@mcs.anl.gov> <1206460871.20974.18.camel@blabla.mcs.anl.gov> <47E92A06.4090705@mcs.anl.gov> <47EA6B94.9090609@mcs.anl.gov> Message-ID: <1206546694.1119.16.camel@blabla.mcs.anl.gov> > I suspect you're not going to like this idea on first consideration. But > its related to ideas on how to leverage map-reduce, as I mentioned > earlier, and Ian's suggestion to explore collective operations. Mihael > thought my take on this was inelegant and inconsistent with data flow. Somewhat. What I thought you suggested was pretty much "I don't want to write my program as dataflow but I want to implement it in a dataflow language". "And if it doesn't work, then the language should be changed so that I can". [...] > > If a swift job could efficiently return a set of swift objects without > using a file In the context of Globus, it seems a bit difficult. > (specifically without placing files back in the shared > directory) then many of these apps could work beautifully, by returning > strings or numeric objects, possibly as structs and/r arrays, that > travel back through the job submission interface rather than getting > fetched via the data provider. If a cluster of jobs could return data > efficiently in a single "package" from the cluster, then we could pretty > readily do map-reduce in swift, efficiently, in perfect concordance with > the current dataflow model. One more time: we CAN do map-reduce in Swift. Stop saying we can't. Please. It's getting silly. The efficiency issue comes from the fact that the overhead for distributing very very very small tasks across a wide area network is very high compared to the task run time. And in the current Swift implementation it is higher than in the implementation you seem to think of. > > Perhaps this later approach is the best to consider: I suspect it could > be readily implemented, could use a simple file to contain an arbitrary > set of swift object return values, possibly in a format similar to that > of readdata(). How is this different from the current scheme (besides the data files being in a different format)? > > - Mike > > > > > > > > On 3/25/08 6:04 PM, Ben Clifford wrote: > > On Tue, 25 Mar 2008, Michael Wilde wrote: > > > >> From a pure language point of view, we should permit the return of data that > >> can be grouped (batched) into files files in arbitrary chunks, determined and > >> optimized by the implementation. Map-reduce tuples seem to work well for this > >> model, and it seems that Swift could encompass it with minimal semantic change > >> to the current language. > > > > For your example, what way do you want to store the data on the remote > > side - I'm assuming not individual files. > > > > The present dataset model should fairly easily accomodate the description > > of places to store data that aren't files - there's an abstraction in the > > implementation to help with that at the moment (DSHandle, which is what > > deals with the difference between in-memory values and on-disk files; and > > could fairly straightforwardly deal with other storage forms). > > > > One of the project ideas I put in for the google summer of code was to > > play around with this, in fact. > > > From liming at mcs.anl.gov Wed Mar 26 15:15:53 2008 From: liming at mcs.anl.gov (Lee Liming) Date: Wed, 26 Mar 2008 14:15:53 -0600 Subject: [Swift-devel] Swift at GlobusWorld Message-ID: <82AE556D-B408-46FB-83BF-152B5D0CE8D0@mcs.anl.gov> Hello, members of the Swift incubator project! I am writing to encourage you to propose a presentation on your work for the GlobusWorld track of the Open Source Grid and Cluster Conference in May. (See www.globus.org for details.) The official deadline for presentation proposals is already passed, but members of dev.globus projects and incubators are still encouraged to submit proposals as described on the conference website. This would be a good opportunity to let people know what you are doing in your incubator and how they could make use of it. Hope to see you at the conference, -- Lee From foster at mcs.anl.gov Wed Mar 26 15:25:08 2008 From: foster at mcs.anl.gov (Ian Foster) Date: Wed, 26 Mar 2008 15:25:08 -0500 Subject: [Swift-devel] An interesting article on reproducibility Message-ID: <47EAB124.2040708@mcs.anl.gov> http://www.bepress.com/cgi/viewcontent.cgi?article=1002&context=bioconductor From foster at mcs.anl.gov Wed Mar 26 16:00:39 2008 From: foster at mcs.anl.gov (Ian Foster) Date: Wed, 26 Mar 2008 16:00:39 -0500 Subject: [Swift-devel] Article on Hadoop, etc. Message-ID: <47EAB977.2060808@mcs.anl.gov> http://www.theregister.co.uk/2008/03/26/yahoo_hadoop_summit/ and Dryad: http://research.microsoft.com/research/sv/dryad/ From benc at hawaga.org.uk Thu Mar 27 23:15:04 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Fri, 28 Mar 2008 04:15:04 +0000 (GMT) Subject: [Swift-devel] proxy expiration whilst jobs are running through GRAM4 Message-ID: If the user proxy expires whilst a job is running from Swift through GRAM4, that job will hang in the swift runtime. This is reproducable by running a 5 minutes sleep job with a 2 minute proxy. I think (though I haven't looked at gram server side logs to check) what is happening here is that status notifications cannot be delivered because of the expired credential; and the job then sits forever waiting for the notification that will never come. If so, then probably it would be better to refresh the credential if possible; and fail the job if we know that we cannot get notifications because the local proxy has expired. -- From callmeno2 at gmail.com Sun Mar 30 10:32:45 2008 From: callmeno2 at gmail.com (karthik parvathaneni) Date: Sun, 30 Mar 2008 21:02:45 +0530 Subject: [Swift-devel] joining mailing list !! Message-ID: <347f9e8c0803300832r6da1c887lefe20e4f0e2d58fb@mail.gmail.com> HI .. I AM A STUDENT INTERESTING IN FOLLOWING UP WITH THE WORK THATS IN PROGRESS ! .. AND THE PROJECT ITSELF IS CHALLLENGING ... I WOULD LOVE TAKE IT UP ... regards, karthik -------------- next part -------------- An HTML attachment was scrubbed... URL: From bugzilla-daemon at mcs.anl.gov Mon Mar 31 05:22:26 2008 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Mon, 31 Mar 2008 05:22:26 -0500 (CDT) Subject: [Swift-devel] [Bug 127] New: proxy expiration whilst jobs are running through GRAM4 Message-ID: http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=127 Summary: proxy expiration whilst jobs are running through GRAM4 Product: Swift Version: unspecified Platform: All OS/Version: All Status: NEW Severity: normal Priority: P2 Component: General AssignedTo: nobody at mcs.anl.gov ReportedBy: benc at hawaga.org.uk CC: benc at hawaga.org.uk, swift-devel at ci.uchicago.edu If the user proxy expires whilst a job is running from Swift through GRAM4, that job will hang in the swift runtime. This is reproducable by running a 5 minutes sleep job with a 2 minute proxy. I think (though I haven't looked at gram server side logs to check) what is happening here is that status notifications cannot be delivered because of the expired credential; and the job then sits forever waiting for the notification that will never come. If so, then probably it would be better to refresh the credential if possible; and fail the job if we know that we cannot get notifications because the local proxy has expired. -- Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is. From benc at hawaga.org.uk Mon Mar 31 05:23:40 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 31 Mar 2008 10:23:40 +0000 (GMT) Subject: [Swift-devel] Re: proxy expiration whilst jobs are running through GRAM4 In-Reply-To: References: Message-ID: for tracking, I put this in bugzilla as bug 127. -- From foster at mcs.anl.gov Mon Mar 31 08:11:53 2008 From: foster at mcs.anl.gov (Ian Foster) Date: Mon, 31 Mar 2008 08:11:53 -0500 Subject: [Swift-devel] interesting program at NeSC Message-ID: <47F0E319.2020005@mcs.anl.gov> http://wiki.esi.ac.uk/Principles_of_Provenance From iraicu at cs.uchicago.edu Mon Mar 31 10:46:17 2008 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Mon, 31 Mar 2008 10:46:17 -0500 Subject: [Swift-devel] [Fwd: [Dbworld] Final Call for Papers for IEEE T-ASE Special Issue on Scientific Workflow Management and Application] Message-ID: <47F10749.4040405@cs.uchicago.edu> Here is a good journal which has a CFP specific to scientific workflow systems! The paper submission deadline is April 30th. Ioan -------- Original Message -------- Subject: [Dbworld] Final Call for Papers for IEEE T-ASE Special Issue on Scientific Workflow Management and Application Date: Mon, 31 Mar 2008 06:23:38 -0500 From: Jinjun Chen Reply-To: dbworld_owner at yahoo.com To: undisclosed-recipients: ; Final Call for Papers - IEEE T-ASE Special Issue on Scientific Workflow Management and Applications http://www.swinflow.org/si/t-ase.htm. Deadline for submission has been extended to April 30 2008 due to many requests. Details can be referred to http://www.swinflow.org/si/t-ase.htm. Best Wishes, Jinjun _______________________________________________ Please do not post msgs that are not relevant to the database community at large. Go to www.cs.wisc.edu/dbworld for guidelines and posting forms. To unsubscribe, go to https://lists.cs.wisc.edu/mailman/listinfo/dbworld -- =================================================== Ioan Raicu Ph.D. Candidate =================================================== Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 =================================================== Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dev.globus.org/wiki/Incubator/Falkon http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page =================================================== =================================================== -------------- next part -------------- An HTML attachment was scrubbed... URL: