From kslowikowski at gmail.com Sun May 4 16:45:47 2014 From: kslowikowski at gmail.com (Kamil Slowikowski) Date: Sun, 4 May 2014 17:45:47 -0400 Subject: [Swift-user] structured_regexp_mapper does not work Message-ID: I cannot reproduce the example in the user guide: http://swift-lang.org/guides/release-0.94/userguide/userguide.html#_structured_regular_expression_mapper Please see the script and console output below. 1. Is Swift currently being developed? 2. Are there plans to improve the documentation soon? 3. Is it possible to move the documentation to a platform where your users can contribute? 4. Is the project well-funded and staffed? Thanks, Kamil Here is the script: (copied from the guide verbatim) - - - $ cat test.swift type file; string s[] = ["picture.gif", "hello.gif", "world.gif"]; file f[] ; trace(f); - - - Here is the output: - - - $ swift test.swift Swift 0.94.1 swift-r7114 cog-r3803 RunID: 20140504-1718-q5qk8b9a Progress: time: Sun, 04 May 2014 17:18:24 -0400 Execution failed: f, line 3 had mapping errors Caused by: java.lang.NullPointerException at org.griphyn.vdl.mapping.file.StructuredRegularExpressionMapper.map(StructuredRegularExpressionMapper.java:106) at org.griphyn.vdl.mapping.RootDataNode.addExisting(RootDataNode.java:173) at org.griphyn.vdl.mapping.RootDataNode.checkInputs(RootDataNode.java:128) at org.griphyn.vdl.mapping.RootArrayDataNode.checkInputs(RootArrayDataNode.java:104) at org.griphyn.vdl.mapping.RootArrayDataNode.innerInit(RootArrayDataNode.java:92) at org.griphyn.vdl.mapping.RootArrayDataNode.futureModified(RootArrayDataNode.java:113) at org.griphyn.vdl.karajan.ArrayIndexFutureList$1.run(ArrayIndexFutureList.java:156) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) - - - -------------- next part -------------- An HTML attachment was scrubbed... URL: From wilde at anl.gov Sun May 4 18:35:57 2014 From: wilde at anl.gov (Michael Wilde) Date: Sun, 4 May 2014 18:35:57 -0500 Subject: [Swift-user] structured_regexp_mapper does not work In-Reply-To: References: Message-ID: <5366CEDD.3000206@anl.gov> On 5/4/14, 4:45 PM, Kamil Slowikowski wrote: > I cannot reproduce the example in the user guide: > > http://swift-lang.org/guides/release-0.94/userguide/userguide.html#_structured_regular_expression_mapper > > Please see the script and console output below. Kamil, thanks for reporting this. Im not yet certain, but I think the example should be recoded to make the source (in addition to the dest) array be an array of type file rather than of type string, and also to trace @filenames( ) of the dest file array rather than the whole array: $ cat test2.swift type file; string s[] = ["picture.gif", "hello.gif", "world.gif"]; file g[]; file f[] ; trace(@filenames(f)); $ swift test2.swift Swift 0.94.1 swift-r7114 cog-r3803 RunID: 20140504-2309-rfdo9614 Progress: time: Sun, 04 May 2014 23:09:21 +0000 SwiftScript trace: {0 = picture.jpg, 1 = hello.jpg, 2 = world.jpg} Final status: Sun, 04 May 2014 23:09:21 +0000 I filed a bug ticket (#1271) to fix the exception, determine the (currently) desired behavior, and update the User Guide: https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=1271 > > 1. Is Swift currently being developed? Yes, actively. > 2. Are there plans to improve the documentation soon? Yes - this need is well-recognized. The User Guide improves but rather slowly. I can't provide a date when this will start accelerating. > 3. Is it possible to move the documentation to a platform where your users > can contribute? Yes, that's a good idea. We are planning a move to GitHub. We were a bit more focused on enabling users to contribute to code, but enabling user documentation contributions more readily is a great idea. Our docs are currently written in AsciiDoc; we'll look for ways to do this, but for now, just mail improvement text to swift-user. (Ideally, please join the swift-user list so that your posts are not blocked for moderation). > 4. Is the project well-funded and staffed? We're very grateful for the funding we have, but its barely enough to support our user base. About 7 people work on Swift, under relatively modest funding. Less than 1.5 FTE is available to support of the Swift 0.9 -> 1.0 product trajectory; the rest of the team is funded to do CS research or support specific science application user groups. We're actively pursuing expanded funding and other means of sustaining Swift, including working more aggressively to grow a sustaining open source community. - Mike > > Thanks, > Kamil > > > Here is the script: (copied from the guide verbatim) > > - - - > $ cat test.swift > type file; > string s[] = ["picture.gif", "hello.gif", "world.gif"]; > file f[] source=s, > match="(.*)gif", > transform="\\1jpg">; > trace(f); > - - - > > > Here is the output: > > - - - > $ swift test.swift > Swift 0.94.1 swift-r7114 cog-r3803 > > RunID: 20140504-1718-q5qk8b9a > Progress: time: Sun, 04 May 2014 17:18:24 -0400 > Execution failed: > f, line 3 had mapping errors > Caused by: > java.lang.NullPointerException > at > org.griphyn.vdl.mapping.file.StructuredRegularExpressionMapper.map(StructuredRegularExpressionMapper.java:106) > at > org.griphyn.vdl.mapping.RootDataNode.addExisting(RootDataNode.java:173) > at > org.griphyn.vdl.mapping.RootDataNode.checkInputs(RootDataNode.java:128) > at > org.griphyn.vdl.mapping.RootArrayDataNode.checkInputs(RootArrayDataNode.java:104) > at > org.griphyn.vdl.mapping.RootArrayDataNode.innerInit(RootArrayDataNode.java:92) > at > org.griphyn.vdl.mapping.RootArrayDataNode.futureModified(RootArrayDataNode.java:113) > at > org.griphyn.vdl.karajan.ArrayIndexFutureList$1.run(ArrayIndexFutureList.java:156) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441) > at > java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) > at java.util.concurrent.FutureTask.run(FutureTask.java:138) > at > java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) > at java.lang.Thread.run(Thread.java:662) > - - - > > > > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user -- Michael Wilde Mathematics and Computer Science Computation Institute Argonne National Laboratory The University of Chicago -------------- next part -------------- An HTML attachment was scrubbed... URL: From wilde at anl.gov Tue May 6 18:05:04 2014 From: wilde at anl.gov (Michael Wilde) Date: Tue, 6 May 2014 18:05:04 -0500 Subject: [Swift-user] Swift crash In-Reply-To: References: Message-ID: <53696AA0.2050608@anl.gov> Bill, since you are running on Open Science Grid with no specific Condor requirements to select specific sites or node types, it looks like some sites that your app tasks are running at do not have a Python environment that works with your wrapper script. This might have something to do with that way you have set up or are using virtualenv. You might want to start by adding requirements tags to specify that your tasks should only run on the UC3 cycle-seeder nodes, make sure everything works correctly there, and then add tags for one additional pool of OSG Connect resources at a time. You can find info on this in the OSG Connect Book. You can also log the site name in your returned .err or .out file, via your wrapper, which will help a lot in debugging. Then you can force tasks to the *bad* sites to debug and fix your python environment for those sites. I can help more on this after Friday. Perhaps others on the Swift team or this list can also provide assistance. - Mike On 5/6/14, 5:53 PM, William Catino wrote: > I have a script that crashed 3 times, then succeeded, with no change. > When it crashed, the .err file in the data directory contained a message > about not finding a file whose name included > > /usr/lib64/libpython2.6.so.1.0 > > The attached log file that corresponds to a crash is: > df-20140506-1724-d6au5ct7.log > The attached log file that corresponds to a subsequent success is: > df-20140506-1727-39v1is9b.log > > I might have specific files confused, but these files (and others from > prior crashes) are located in my home directory on OSG, at /home/wcatino/df. > This configuration is identical to that used when working with Mike today, > except that the data directory was reduced to 5 files. > > I also noticed that my PATH is pointing swift to version 0.94 rather than > 0.95. > Perhaps I should change my PATH to fix this. > The instructions I followed had me append the 0.95 path to PATH, which > contains a 0.94 path earlier in the PATH string. > > Thanks. -- Michael Wilde Mathematics and Computer Science Computation Institute Argonne National Laboratory The University of Chicago From wcatino at gmail.com Mon May 5 15:57:10 2014 From: wcatino at gmail.com (William Catino) Date: Mon, 5 May 2014 15:57:10 -0500 Subject: [Swift-user] Swift crash Message-ID: I submitted a job that contained 1.9 million documents, and got a crash. I tried this 2 different ways: First, I submitted files with 10000 filenames in each file - so there were 197 files. This generated a GC error. Then I tried 1976 files, each containing 1000 filenames. This caused another crash - part of the error message contained the phrase "Heap table out of memory." I tried 100 files contianing a total of 100,000 filenames. It is running now (for over 15 minutes). Is there any documentation about the various limitations of Swift, especially on OSG: -max number of files to process -max total number of bytes contained in the files processed -max length of command line to app -max number of nodes -max number of slots The swift script looks like this: [wcatino at login01 df]$ cat df.swift type file; type script; // Note that to use `bash` here, it has to be in SWIFT's 'apps' file: app (file df, file err, file out) wrapper (script wrap, script df_script, file all_files[]) { // this uses bash to call the wrapper script with two CLAs: directory (of input files) and target output file df // It also sets stderr and stdout to two passed files, err and out. bash @wrap @df @all_files stderr=@err stdout=@out; } string dir = (@arg("data")); script calc_df <"df.py">; script wrap <"wrapper.sh">; // This grabs all *.txt files in dir: file[] all_docs ; foreach list_of_files, index in all_docs { // Top file is for the results -- if this is not created by our python, swift will fail. file df ; file err ; file out ; // Read the list of files as an array of string which is then passed // to the array mapper as the array of filenames for mapping. string names[] = readData(list_of_files); file all_files[] ; // Do we need to ? // Note we have to pass both dir and doc along with the wrapper script and the actual python script: (df,err,out) = wrapper (wrap, calc_df, all_files); Any help would be appreciated. Sincerely, William Catino, Ph.D. Principal Software Engineer Knowledge Lab | Computation Institute | University of Chicago *wcatino at uchicago.edu * -------------- next part -------------- An HTML attachment was scrubbed... URL: From wcatino at gmail.com Tue May 6 17:53:52 2014 From: wcatino at gmail.com (William Catino) Date: Tue, 6 May 2014 17:53:52 -0500 Subject: [Swift-user] Swift crash Message-ID: I have a script that crashed 3 times, then succeeded, with no change. When it crashed, the .err file in the data directory contained a message about not finding a file whose name included /usr/lib64/libpython2.6.so.1.0 The attached log file that corresponds to a crash is: df-20140506-1724-d6au5ct7.log The attached log file that corresponds to a subsequent success is: df-20140506-1727-39v1is9b.log I might have specific files confused, but these files (and others from prior crashes) are located in my home directory on OSG, at /home/wcatino/df. This configuration is identical to that used when working with Mike today, except that the data directory was reduced to 5 files. I also noticed that my PATH is pointing swift to version 0.94 rather than 0.95. Perhaps I should change my PATH to fix this. The instructions I followed had me append the 0.95 path to PATH, which contains a 0.94 path earlier in the PATH string. Thanks. -- William Catino, Ph.D. Principal Software Engineer Knowledge Lab | Computation Institute | University of Chicago wcatino at gmail.com 773.603.9481 -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: df-20140506-1724-d6au5ct7.log Type: application/octet-stream Size: 78136 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: df-20140506-1727-39v1is9b.log Type: application/octet-stream Size: 112626 bytes Desc: not available URL: From iraicu at cs.iit.edu Thu May 8 21:39:20 2014 From: iraicu at cs.iit.edu (Ioan Raicu) Date: Thu, 08 May 2014 21:39:20 -0500 Subject: [Swift-user] Call for Posters: ACM HPDC 2014 -- due May 16th Message-ID: <536C3FD8.7090205@cs.iit.edu> Call for Posters HPDC'14 will feature a poster session that will provide the right environment for lively and informal discussions on various high performance parallel and distributed computing topics. The poster session will be held on Wednesday, June 25, in the late afternoon. Participating posters will be selected based on the following criteria: - Submissions must describe new, interesting ideas on any HPDC topics of interest - Submissions can present work in progress, and we strongly encourage the authors to include preliminary experimental results, if available - Student submissions meeting the above criteria will be given preference We invite all potential authors to submit their contribution to this poster session in the form of a two-page PDF extended abstract (we recommend using the ACM Proceedings style, and fonts not smaller than 10 point). Please provide the following information in your PDF file: - Poster title - Author names, affiliations, and email addresses - Note which authors, if any, are students Abstracts must be submitted through email with the Subject "HPDC 2014 Poster Submission" to chandra AT cs DOT umn DOT edu before May 16 2014, 5:00pm EDT. Authors will be notified of acceptance or rejection via e-mail by May 23, 2014. No reviews will be provided. Accepted posters will be published online on the conference website. Details about the poster presentation (e.g., poster size) will be available closer to the conference. For any questions about the submission, selection, and presentation of the accepted posters, please contact the Posters Chair, Abhishek Chandra, University of Minnesota (email: chandra AT cs DOT umn DOT edu), or see http://www.hpdc.org/2014/posters/call-for-posters/ for more information. -- ================================================================= Ioan Raicu, Ph.D. Assistant Professor, Illinois Institute of Technology (IIT) Guest Research Faculty, Argonne National Laboratory (ANL) ================================================================= Data-Intensive Distributed Systems Laboratory, CS/IIT Distributed Systems Laboratory, MCS/ANL ================================================================= Editor: IEEE TCC, Springer Cluster, Springer JoCCASA Chair: IEEE/ACM MTAGS, ACM ScienceCloud ================================================================= Cel: 1-847-722-0876 Office: 1-312-567-5704 Email: iraicu at cs.iit.edu Web: http://www.cs.iit.edu/~iraicu/ Web: http://datasys.cs.iit.edu/ LinkedIn: http://www.linkedin.com/in/ioanraicu Google: http://scholar.google.com/citations?user=jE73HYAAAAAJ ================================================================= ================================================================= -------------- next part -------------- An HTML attachment was scrubbed... URL: From wcatino at gmail.com Tue May 6 18:20:31 2014 From: wcatino at gmail.com (William Catino) Date: Tue, 6 May 2014 18:20:31 -0500 Subject: [Swift-user] Swift crash In-Reply-To: <53696AA0.2050608@anl.gov> References: <53696AA0.2050608@anl.gov> Message-ID: Thanks, Mike. On May 6, 2014 6:05 PM, "Michael Wilde" wrote: > Bill, since you are running on Open Science Grid with no specific Condor > requirements to select specific sites or node types, it looks like some > sites that your app tasks are running at do not have a Python environment > that works with your wrapper script. > > This might have something to do with that way you have set up or are using > virtualenv. > > You might want to start by adding requirements tags to specify that your > tasks should only run on the UC3 cycle-seeder nodes, make sure everything > works correctly there, and then add tags for one additional pool of OSG > Connect resources at a time. You can find info on this in the OSG Connect > Book. > > You can also log the site name in your returned .err or .out file, via > your wrapper, which will help a lot in debugging. > > Then you can force tasks to the *bad* sites to debug and fix your python > environment for those sites. > > I can help more on this after Friday. Perhaps others on the Swift team or > this list can also provide assistance. > > - Mike > > > On 5/6/14, 5:53 PM, William Catino wrote: > >> I have a script that crashed 3 times, then succeeded, with no change. >> When it crashed, the .err file in the data directory contained a message >> about not finding a file whose name included >> >> /usr/lib64/libpython2.6.so.1.0 >> >> The attached log file that corresponds to a crash is: >> df-20140506-1724-d6au5ct7.log >> The attached log file that corresponds to a subsequent success is: >> df-20140506-1727-39v1is9b.log >> >> I might have specific files confused, but these files (and others from >> prior crashes) are located in my home directory on OSG, at >> /home/wcatino/df. >> This configuration is identical to that used when working with Mike today, >> except that the data directory was reduced to 5 files. >> >> I also noticed that my PATH is pointing swift to version 0.94 rather than >> 0.95. >> Perhaps I should change my PATH to fix this. >> The instructions I followed had me append the 0.95 path to PATH, which >> contains a 0.94 path earlier in the PATH string. >> >> Thanks. >> > > -- > Michael Wilde > Mathematics and Computer Science Computation Institute > Argonne National Laboratory The University of Chicago > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mattshax at gmail.com Sat May 10 11:12:07 2014 From: mattshax at gmail.com (Matthew Shaxted) Date: Sat, 10 May 2014 11:12:07 -0500 Subject: [Swift-user] Trunk/0.95 Build Message-ID: Hi, I'm trying to compile from trunk and/or 0.95 release to learn the new swift.properties method (which seems excellent btw) and having some issues. Trunch dist examples give me strange error below and 0.95 won't seem to build with the 4.1.10 version of cog. Is there a precompiled version of 0.95 available? This link appears to be inactive: http://swiftlang.org/packages/swift-0.95.tar.gz Swift trunk swift-r7850 cog-r3905 RunID: run004 Could not start execution: Unknown function: swift:field, swift:field @ simple.kml, line: 12: Unknown function: swift:field Thanks Matthew -------------- next part -------------- An HTML attachment was scrubbed... URL: From wilde at anl.gov Sat May 10 11:16:37 2014 From: wilde at anl.gov (Michael Wilde) Date: Sat, 10 May 2014 11:16:37 -0500 Subject: [Swift-user] Trunk/0.95 Build In-Reply-To: References: Message-ID: <536E50E5.9040508@anl.gov> Matthew, All, Regarding: On 5/10/14, 11:12 AM, Matthew Shaxted wrote: > ... > > Is there a precompiled version of 0.95 available? This link appears to be > inactive: http://swiftlang.org/packages/swift-0.95.tar.gz Sorry, this file has been replaced by 0.95 release-candidate 5, so use: http://swiftlang.org/packages/swift-0.95-RC5.tar.gz We'll update the download info to reflect this. - Mike From wilde at anl.gov Sat May 10 11:29:42 2014 From: wilde at anl.gov (Michael Wilde) Date: Sat, 10 May 2014 11:29:42 -0500 Subject: [Swift-user] Trunk/0.95 Build In-Reply-To: References: Message-ID: <536E53F6.1010905@anl.gov> On 5/10/14, 11:12 AM, Matthew Shaxted wrote: ... > Trunch dist examples give me strange error below and 0.95 won't seem to > build with the 4.1.10 version of cog. > > Is there a precompiled version of 0.95 available? This link appears to be > inactive:http://swiftlang.org/packages/swift-0.95.tar.gz > > Swift trunk swift-r7850 cog-r3905 > RunID: run004 > Could not start execution: > Unknown function: swift:field, swift:field @ simple.kml, line: 12: > Unknown function: swift:field I am able to reproduce this on my own trunk build: swift$ swift test3.swift Swift trunk swift-r7850 cog-r3905 RunID: run079 Warning: The @ syntax for function invocation is deprecated Could not start execution: Unknown function: swift:field, swift:field @ test3.kml, line: 12: Unknown function: swift:field swift$ We'll fix and report back. Thanks for the report! - Mike From hategan at mcs.anl.gov Sat May 10 13:08:21 2014 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sat, 10 May 2014 11:08:21 -0700 Subject: [Swift-user] Trunk/0.95 Build In-Reply-To: References: Message-ID: <1399745301.31413.2.camel@echo> On Sat, 2014-05-10 at 11:12 -0500, Matthew Shaxted wrote: [...] > Could not start execution: > Unknown function: swift:field, swift:field @ simple.kml, line: 12: > Unknown function: swift:field Hi, Thanks for reporting this and sorry about that. I was committing quite a big chunk of code last night and a piece got forgotten. I fixed that. Though trunk might have some other rough corners at the moment. Mihael From hategan at mcs.anl.gov Sat May 10 16:26:48 2014 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sat, 10 May 2014 14:26:48 -0700 Subject: [Swift-user] Swift crash In-Reply-To: References: Message-ID: <1399757208.1975.30.camel@echo> Hi, On Mon, 2014-05-05 at 15:57 -0500, William Catino wrote: > I tried 100 files contianing a total of 100,000 filenames. It is running > now (for over 15 minutes). Just wanted to mention that even for relatively short path names you would need on the order of 500MB for the above just to store those file names. So I think most of the problems that you are seeing are due to this being a large problem, rather than some limitation with file staging performance. Your best bet would be to bias the splitting towards a larger number of application invocations with smaller arrays for each. Then move as much as possible of the app instance information inside a foreach loop and use foreach.max.threads to limit the number of concurrent iterations in that loop to something that is reasonable for the amount of heap given to the swift process. An alternative strategy is to set foreach.max.threads on the order of the amount of actual jobs you expect to be able to run at a time. If those are not viable options, I suggest looking into Swift/T, which has a distributed implementation of the runtime engine and can scale better for large problems. > > > Is there any documentation about the various limitations of Swift, > especially on OSG: > -max number of files to process Unfortunately these numbers depend on a lot of factors and it is hard to come up with something that is universally true. A file in /tmp will result in different memory use than a file in /gpfs/projects/john-d-jones/2014/projectx/version6/swift/input/mixed/, but then it really depends whether this uses a mapper that uses full file names explicitly or a mapper that builds file names procedurally. The complexity of the swift script is also a factor. > -max total number of bytes contained in the files processed This should not be a limitation of swift. Larger files will take more time to transfer, but should not cause any errors. > -max length of command line to app This unfortunately does not depend on swift. There are no limits (beside RAM) to how large an app command line can be. However, the operating system or job submission mechanism can impose their own limits. I believe that Condor has a limit of something like 4096 bytes for the command line. In any event, if this is a problem, and you are submitting jobs directly to condor (instead of coasters), you can use "wrapper.parameter.mode=files" in swift.properties. If you are using coasters, the OS will be the limit you are likely to hit first. > -max number of nodes This isn't limited by Swift. Sites may limit this number and it is also possible that for short apps you will not get much speedup after a certain number of nodes. > -max number of slots Again, not limited by Swift, but certain sites/queues may only allow a limited number of jobs in the queue. Mihael From ketan at mcs.anl.gov Sun May 11 19:41:12 2014 From: ketan at mcs.anl.gov (Ketan Maheshwari) Date: Sun, 11 May 2014 19:41:12 -0500 Subject: [Swift-user] Could not convert value to number: 0 Message-ID: Hi, Trying to get a Swift script running from Galaxy but getting this error: Could not convert value to number: 0 I can reproduce this from outside of Galaxy but could not locate the cause. Full error message is: Swift 0.94.1 swift-r7114 cog-r3803 RunID: 20140511-1931-xqay4e2d Progress: time: Sun, 11 May 2014 19:31:38 -0500 Caused by: Could not convert value to number: 0 Caused by: For input string: "0 " Caused by: Could not convert value to number: 0 Caused by: For input string: "0 " Caused by: Could not convert value to number: 0 Caused by: For input string: "0 " Final status: Sun, 11 May 2014 19:31:38 -0500 Failed:3 The following errors have occurred: 1. Could not convert value to number: 0 Caused by: For input string: "0 " (3 times) Attached is the tarball with script and configs. Thanks for any help. Best, Ketan -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: swift-gal.ZQib.tgz Type: application/x-gzip Size: 5482 bytes Desc: not available URL: From tim.g.armstrong at gmail.com Sun May 11 20:45:55 2014 From: tim.g.armstrong at gmail.com (Tim Armstrong) Date: Sun, 11 May 2014 20:45:55 -0500 Subject: [Swift-user] Could not convert value to number: 0 In-Reply-To: References: Message-ID: The problem might be the extra space after execution.retries=0 in the cf file. If that's the case, this is clearly poor error reporting so probably makes sense to file a ticket for this. - Tim On Sun, May 11, 2014 at 7:41 PM, Ketan Maheshwari wrote: > Hi, > > Trying to get a Swift script running from Galaxy but getting this error: > > Could not convert value to number: 0 > > > I can reproduce this from outside of Galaxy but could not locate the cause. > > Full error message is: > > Swift 0.94.1 swift-r7114 cog-r3803 > > RunID: 20140511-1931-xqay4e2d > Progress: time: Sun, 11 May 2014 19:31:38 -0500 > Caused by: Could not convert value to number: 0 > Caused by: For input string: "0 " > > Caused by: Could not convert value to number: 0 > Caused by: For input string: "0 " > > Caused by: Could not convert value to number: 0 > Caused by: For input string: "0 " > > Final status: Sun, 11 May 2014 19:31:38 -0500 Failed:3 > The following errors have occurred: > 1. Could not convert value to number: 0 > Caused by: > For input string: "0 " (3 times) > > > Attached is the tarball with script and configs. > > Thanks for any help. > > Best, > Ketan > > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > -------------- next part -------------- An HTML attachment was scrubbed... URL: From ketan at mcs.anl.gov Sun May 11 21:01:32 2014 From: ketan at mcs.anl.gov (Ketan Maheshwari) Date: Sun, 11 May 2014 21:01:32 -0500 Subject: [Swift-user] Could not convert value to number: 0 In-Reply-To: References: Message-ID: Thanks Tim. That was it. Will file a ticket. On Sun, May 11, 2014 at 8:45 PM, Tim Armstrong wrote: > The problem might be the extra space after execution.retries=0 in the cf > file. If that's the case, this is clearly poor error reporting so probably > makes sense to file a ticket for this. > > - Tim > > > On Sun, May 11, 2014 at 7:41 PM, Ketan Maheshwari wrote: > >> Hi, >> >> Trying to get a Swift script running from Galaxy but getting this error: >> >> Could not convert value to number: 0 >> >> >> I can reproduce this from outside of Galaxy but could not locate the >> cause. >> >> Full error message is: >> >> Swift 0.94.1 swift-r7114 cog-r3803 >> >> RunID: 20140511-1931-xqay4e2d >> Progress: time: Sun, 11 May 2014 19:31:38 -0500 >> Caused by: Could not convert value to number: 0 >> Caused by: For input string: "0 " >> >> Caused by: Could not convert value to number: 0 >> Caused by: For input string: "0 " >> >> Caused by: Could not convert value to number: 0 >> Caused by: For input string: "0 " >> >> Final status: Sun, 11 May 2014 19:31:38 -0500 Failed:3 >> The following errors have occurred: >> 1. Could not convert value to number: 0 >> Caused by: >> For input string: "0 " (3 times) >> >> >> Attached is the tarball with script and configs. >> >> Thanks for any help. >> >> Best, >> Ketan >> >> _______________________________________________ >> Swift-user mailing list >> Swift-user at ci.uchicago.edu >> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user >> > > > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > -------------- next part -------------- An HTML attachment was scrubbed... URL: From bronevetsky1 at llnl.gov Fri May 16 12:35:54 2014 From: bronevetsky1 at llnl.gov (Bronevetsky, Greg) Date: Fri, 16 May 2014 17:35:54 +0000 Subject: [Swift-user] Resuming jobs when script has changed Message-ID: <8635C0D1735D2C4BA6E571FD97486FF172EC8E71@PRDEXMBX-05.the-lab.llnl.gov> I have Swift scripts that scan some portion of a design space and after I have scanned a sub-space of all the possibilities I modify the script to target a different, overlapping portion of the design space. However, it seems that when I do this Swift ignores the fact that I've already computed many of the tasks in the new run when I performed the prior run, and re-executes them redundantly. Is there a way for me to avoid such redundant executions? Can the -resume flag be used in this case? Greg Bronevetsky Lawrence Livermore National Lab (925) 424-5756 bronevetsky at llnl.gov http://greg.bronevetsky.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From wilde at anl.gov Fri May 16 12:52:50 2014 From: wilde at anl.gov (Michael Wilde) Date: Fri, 16 May 2014 12:52:50 -0500 Subject: [Swift-user] Resuming jobs when script has changed In-Reply-To: <8635C0D1735D2C4BA6E571FD97486FF172EC8E71@PRDEXMBX-05.the-lab.llnl.gov> References: <8635C0D1735D2C4BA6E571FD97486FF172EC8E71@PRDEXMBX-05.the-lab.llnl.gov> Message-ID: <53765072.9070906@anl.gov> Hi Greg, The Swift resume mechanism can't do this, as its driven by internal variables name within a Swift run, not by external file existence. The best way to do what you describe below is to write an external input file mapper (e.g. a simple shell or py script) that returns only the files that still need to be produced or processed. (This is the "ext" mapper: http://swift-lang.org/guides/release-0.94/userguide/userguide.html#_external_mapper ) You can also call a local app to determine what files need to be produced, then use readData() to read those file names into an array, and array_mapper to map an array of remaining work to do. Would one of these approaches meet your needs? - Mike On 5/16/14, 12:35 PM, Bronevetsky, Greg wrote: > I have Swift scripts that scan some portion of a design space and after I have scanned a sub-space of all the possibilities I modify the script to target a different, overlapping portion of the design space. However, it seems that when I do this Swift ignores the fact that I've already computed many of the tasks in the new run when I performed the prior run, and re-executes them redundantly. Is there a way for me to avoid such redundant executions? Can the -resume flag be used in this case? > > Greg Bronevetsky > Lawrence Livermore National Lab > (925) 424-5756 > bronevetsky at llnl.gov > http://greg.bronevetsky.com > > > > > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user -- Michael Wilde Mathematics and Computer Science Computation Institute Argonne National Laboratory The University of Chicago -------------- next part -------------- An HTML attachment was scrubbed... URL: From bronevetsky1 at llnl.gov Fri May 16 13:18:35 2014 From: bronevetsky1 at llnl.gov (Bronevetsky, Greg) Date: Fri, 16 May 2014 18:18:35 +0000 Subject: [Swift-user] Resuming jobs when script has changed In-Reply-To: <53765072.9070906@anl.gov> References: <8635C0D1735D2C4BA6E571FD97486FF172EC8E71@PRDEXMBX-05.the-lab.llnl.gov> <53765072.9070906@anl.gov> Message-ID: <8635C0D1735D2C4BA6E571FD97486FF172EC8EBC@PRDEXMBX-05.the-lab.llnl.gov> The external mapper mechanism looks to be appropriate. To make sure I understand you, is the idea that I'd replace my current mappers which are ; with an external mapper < ext; exec="script.sh", arg=@strcat("a","b","c")>; Then script.sh would return the string @strcat("a","b","c") if the file with this path does not exist and an empty file it does. Greg Bronevetsky Lawrence Livermore National Lab (925) 424-5756 bronevetsky at llnl.gov http://greg.bronevetsky.com From: swift-user-bounces at ci.uchicago.edu [mailto:swift-user-bounces at ci.uchicago.edu] On Behalf Of Michael Wilde Sent: Friday, May 16, 2014 10:53 AM To: swift-user at ci.uchicago.edu Subject: Re: [Swift-user] Resuming jobs when script has changed Hi Greg, The Swift resume mechanism can't do this, as its driven by internal variables name within a Swift run, not by external file existence. The best way to do what you describe below is to write an external input file mapper (e.g. a simple shell or py script) that returns only the files that still need to be produced or processed. (This is the "ext" mapper: http://swift-lang.org/guides/release-0.94/userguide/userguide.html#_external_mapper ) You can also call a local app to determine what files need to be produced, then use readData() to read those file names into an array, and array_mapper to map an array of remaining work to do. Would one of these approaches meet your needs? - Mike On 5/16/14, 12:35 PM, Bronevetsky, Greg wrote: I have Swift scripts that scan some portion of a design space and after I have scanned a sub-space of all the possibilities I modify the script to target a different, overlapping portion of the design space. However, it seems that when I do this Swift ignores the fact that I've already computed many of the tasks in the new run when I performed the prior run, and re-executes them redundantly. Is there a way for me to avoid such redundant executions? Can the -resume flag be used in this case? Greg Bronevetsky Lawrence Livermore National Lab (925) 424-5756 bronevetsky at llnl.gov http://greg.bronevetsky.com _______________________________________________ Swift-user mailing list Swift-user at ci.uchicago.edu https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user -- Michael Wilde Mathematics and Computer Science Computation Institute Argonne National Laboratory The University of Chicago -------------- next part -------------- An HTML attachment was scrubbed... URL: From bronevetsky1 at llnl.gov Sun May 18 17:16:33 2014 From: bronevetsky1 at llnl.gov (Bronevetsky, Greg) Date: Sun, 18 May 2014 22:16:33 +0000 Subject: [Swift-user] Reduction trees Message-ID: <8635C0D1735D2C4BA6E571FD97486FF179921DBD@PRDEXMBX-05.the-lab.llnl.gov> I need to implement a reduction three in Swift to aggregate the results of many individual runs. I've written the simple algorithm below, which works when all my runs correspond to numbers in a fixed range. However, in reality I have a regular or associative array of files produced by individual runs. How do I adapt the code below to this scenario? I don't see any way to iterate over subsets of array indexes/keys. Thanks! Greg Bronevetsky Lawrence Livermore National Lab (925) 424-5756 bronevetsky at llnl.gov http://greg.bronevetsky.com type file; app (file outFile) gen(string arg) { echo arg stdout=@filename(outFile); } app (file summaryFile) catBase(file inFiles[]) { cat @filenames(inFiles) stdout=@filename(summaryFile); } // Concatenate the text of numbers in range [minR - maxR) (includes minR, not maxR) (file summaryFile) catTree(int minR, int maxR, int radix, int level) { file subFiles[]; if((maxR-minR) <= radix) { //tracef("catTree: leaf: minR=%i, minR=%i\n", minR, maxR); foreach b in [0: (maxR-minR)-1] { //tracef("catTree: leaf: b=%i\n", b); file curLeafFile ; (curLeafFile) = gen(@toString(minR+b)); subFiles[b] = curLeafFile; } } else { int size=maxR-minR; //tracef("catTree: node: minR=%i, minR=%i\n", minR, maxR); foreach b in [0: radix-1] { file curNodeFile ; int start = minR + (size*b)%/radix; int end = minR + (size*(b+1))%/radix; //tracef("catTree: node: b=%i [%i - %i]\n", b, start, end); (curNodeFile) = catTree(start, end, radix, level+1); subFiles[b] = curNodeFile; } } //tracef("catBase: inFiles=%q\n", @filenames(subFiles)); (summaryFile) = catBase(subFiles); } file allFile ; (allFile) = catTree(0, 31, 2, 0); -------------- next part -------------- An HTML attachment was scrubbed... URL: From hategan at mcs.anl.gov Sun May 18 19:04:58 2014 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sun, 18 May 2014 17:04:58 -0700 Subject: [Swift-user] Reduction trees In-Reply-To: <8635C0D1735D2C4BA6E571FD97486FF179921DBD@PRDEXMBX-05.the-lab.llnl.gov> References: <8635C0D1735D2C4BA6E571FD97486FF179921DBD@PRDEXMBX-05.the-lab.llnl.gov> Message-ID: <1400457898.10734.1.camel@echo> Hi, Can you be more specific about what you mean by "subsets of keys" below? Specifically, how are these sub sets defined? Mihael On Sun, 2014-05-18 at 22:16 +0000, Bronevetsky, Greg wrote: > I need to implement a reduction three in Swift to aggregate the > results of many individual runs. I've written the simple algorithm > below, which works when all my runs correspond to numbers in a fixed > range. However, in reality I have a regular or associative array of > files produced by individual runs. How do I adapt the code below to > this scenario? I don't see any way to iterate over subsets of array > indexes/keys. Thanks! > > Greg Bronevetsky > Lawrence Livermore National Lab > (925) 424-5756 > bronevetsky at llnl.gov > http://greg.bronevetsky.com > > > type file; > > app (file outFile) gen(string arg) > { > echo arg stdout=@filename(outFile); > } > > app (file summaryFile) catBase(file inFiles[]) { > cat @filenames(inFiles) stdout=@filename(summaryFile); > } > > // Concatenate the text of numbers in range [minR - maxR) (includes minR, not maxR) > (file summaryFile) catTree(int minR, int maxR, int radix, int level) { > file subFiles[]; > if((maxR-minR) <= radix) { > //tracef("catTree: leaf: minR=%i, minR=%i\n", minR, maxR); > foreach b in [0: (maxR-minR)-1] { > //tracef("catTree: leaf: b=%i\n", b); > file curLeafFile ; > (curLeafFile) = gen(@toString(minR+b)); > subFiles[b] = curLeafFile; > } > } else { > int size=maxR-minR; > //tracef("catTree: node: minR=%i, minR=%i\n", minR, maxR); > foreach b in [0: radix-1] { > file curNodeFile ; > int start = minR + (size*b)%/radix; > int end = minR + (size*(b+1))%/radix; > //tracef("catTree: node: b=%i [%i - %i]\n", b, start, end); > (curNodeFile) = catTree(start, end, radix, level+1); > subFiles[b] = curNodeFile; > } > } > //tracef("catBase: inFiles=%q\n", @filenames(subFiles)); > (summaryFile) = catBase(subFiles); > } > > > file allFile ; > (allFile) = catTree(0, 31, 2, 0); > > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user From bronevetsky1 at llnl.gov Mon May 19 09:09:43 2014 From: bronevetsky1 at llnl.gov (Bronevetsky, Greg) Date: Mon, 19 May 2014 14:09:43 +0000 Subject: [Swift-user] Reduction trees In-Reply-To: <1400457898.10734.1.camel@echo> References: <8635C0D1735D2C4BA6E571FD97486FF179921DBD@PRDEXMBX-05.the-lab.llnl.gov> <1400457898.10734.1.camel@echo> Message-ID: <8635C0D1735D2C4BA6E571FD97486FF179922FA6@PRDEXMBX-05.the-lab.llnl.gov> What I mean is that I have an array of files computed by my individual tasks. To do a reduction tree on them I need to create a sub-array of the first 100, the second 100, etc. so that each sub-array can be merged independently. In my code example below I skip this step and simply generate a region of numbers (minR, maxR) for each node in my reduction tree and then do stuff with the numbers. Now I need for these numbers to correspond to indexes in an array of files. Greg Bronevetsky Lawrence Livermore National Lab (925) 424-5756 bronevetsky at llnl.gov http://greg.bronevetsky.com -----Original Message----- From: Mihael Hategan [mailto:hategan at mcs.anl.gov] Sent: Sunday, May 18, 2014 5:05 PM To: Bronevetsky, Greg Cc: swift-user at ci.uchicago.edu Subject: Re: [Swift-user] Reduction trees Hi, Can you be more specific about what you mean by "subsets of keys" below? Specifically, how are these sub sets defined? Mihael On Sun, 2014-05-18 at 22:16 +0000, Bronevetsky, Greg wrote: > I need to implement a reduction three in Swift to aggregate the > results of many individual runs. I've written the simple algorithm > below, which works when all my runs correspond to numbers in a fixed > range. However, in reality I have a regular or associative array of > files produced by individual runs. How do I adapt the code below to > this scenario? I don't see any way to iterate over subsets of array > indexes/keys. Thanks! > > Greg Bronevetsky > Lawrence Livermore National Lab > (925) 424-5756 > bronevetsky at llnl.gov > http://greg.bronevetsky.com > > > type file; > > app (file outFile) gen(string arg) > { > echo arg stdout=@filename(outFile); > } > > app (file summaryFile) catBase(file inFiles[]) { > cat @filenames(inFiles) stdout=@filename(summaryFile); } > > // Concatenate the text of numbers in range [minR - maxR) (includes > minR, not maxR) (file summaryFile) catTree(int minR, int maxR, int radix, int level) { > file subFiles[]; > if((maxR-minR) <= radix) { > //tracef("catTree: leaf: minR=%i, minR=%i\n", minR, maxR); > foreach b in [0: (maxR-minR)-1] { > //tracef("catTree: leaf: b=%i\n", b); > file curLeafFile ; > (curLeafFile) = gen(@toString(minR+b)); > subFiles[b] = curLeafFile; > } > } else { > int size=maxR-minR; > //tracef("catTree: node: minR=%i, minR=%i\n", minR, maxR); > foreach b in [0: radix-1] { > file curNodeFile ; > int start = minR + (size*b)%/radix; > int end = minR + (size*(b+1))%/radix; > //tracef("catTree: node: b=%i [%i - %i]\n", b, start, end); > (curNodeFile) = catTree(start, end, radix, level+1); > subFiles[b] = curNodeFile; > } > } > //tracef("catBase: inFiles=%q\n", @filenames(subFiles)); > (summaryFile) = catBase(subFiles); > } > > > file allFile ; > (allFile) = catTree(0, 31, 2, 0); > > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user From bronevetsky1 at llnl.gov Mon May 19 12:39:18 2014 From: bronevetsky1 at llnl.gov (Bronevetsky, Greg) Date: Mon, 19 May 2014 17:39:18 +0000 Subject: [Swift-user] Resuming jobs when script has changed In-Reply-To: <53765072.9070906@anl.gov> References: <8635C0D1735D2C4BA6E571FD97486FF172EC8E71@PRDEXMBX-05.the-lab.llnl.gov> <53765072.9070906@anl.gov> Message-ID: <8635C0D1735D2C4BA6E571FD97486FF17992320C@PRDEXMBX-05.the-lab.llnl.gov> Based on your suggestion, I came up with the following code. However, I'm having an issue with it. If I use an external mapper I can choose whether to return a file or not based on whether it already exists. However, once I made a decision not to return a file from the external mapper because it has already been generated, I don't see how to enable subsequent workflow steps to take it as input. In the code below, if dataFile exists but copyFile does not, I get the following error: Execution failed: org.griphyn.vdl.mapping.InvalidPathException: Array index '0' not found for f1 of size 0 copyFile, testResume.swift, line 20 copyFile, testResume.swift, line 20 Greg Bronevetsky Lawrence Livermore National Lab (925) 424-5756 bronevetsky at llnl.gov http://greg.bronevetsky.com Swift: type file; string ROOT_PATH="/g/g15/bronevet/apps/swift-0.94.1/examples/test"; app (file outF) writeFile(string message) { echo message stdout=@filename(outF); } app (file outF) copyFile(file inF) { cp @filename(inF) @filename(outF); } file f1[] ; if(@length(f1) == 1) { f1[0] = writeFile("hello"); } file f2[] ; if(@length(f2) == 1) { f2[0] = copyFile(f1[0]); } External mapper in Python: #!/usr/apps/python2.7.3/bin/python import argparse import sys import os def main(argv): parser = argparse.ArgumentParser(description='Merge stats and distances files from experiments.') parser.add_argument('-fName', dest='fName', action='store', nargs="+", help='List of files to check for existence') args = parser.parse_args() for i in range(0, len(args.fName)): if(not (os.path.exists(args.fName[i]))): print "["+str(i)+"] "+args.fName[i] if __name__ == "__main__": main(sys.argv[1:]) From: swift-user-bounces at ci.uchicago.edu [mailto:swift-user-bounces at ci.uchicago.edu] On Behalf Of Michael Wilde Sent: Friday, May 16, 2014 10:53 AM To: swift-user at ci.uchicago.edu Subject: Re: [Swift-user] Resuming jobs when script has changed Hi Greg, The Swift resume mechanism can't do this, as its driven by internal variables name within a Swift run, not by external file existence. The best way to do what you describe below is to write an external input file mapper (e.g. a simple shell or py script) that returns only the files that still need to be produced or processed. (This is the "ext" mapper: http://swift-lang.org/guides/release-0.94/userguide/userguide.html#_external_mapper ) You can also call a local app to determine what files need to be produced, then use readData() to read those file names into an array, and array_mapper to map an array of remaining work to do. Would one of these approaches meet your needs? - Mike On 5/16/14, 12:35 PM, Bronevetsky, Greg wrote: I have Swift scripts that scan some portion of a design space and after I have scanned a sub-space of all the possibilities I modify the script to target a different, overlapping portion of the design space. However, it seems that when I do this Swift ignores the fact that I've already computed many of the tasks in the new run when I performed the prior run, and re-executes them redundantly. Is there a way for me to avoid such redundant executions? Can the -resume flag be used in this case? Greg Bronevetsky Lawrence Livermore National Lab (925) 424-5756 bronevetsky at llnl.gov http://greg.bronevetsky.com _______________________________________________ Swift-user mailing list Swift-user at ci.uchicago.edu https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user -- Michael Wilde Mathematics and Computer Science Computation Institute Argonne National Laboratory The University of Chicago -------------- next part -------------- An HTML attachment was scrubbed... URL: From bronevetsky1 at llnl.gov Mon May 19 14:06:24 2014 From: bronevetsky1 at llnl.gov (Bronevetsky, Greg) Date: Mon, 19 May 2014 19:06:24 +0000 Subject: [Swift-user] Resuming jobs when script has changed In-Reply-To: <8635C0D1735D2C4BA6E571FD97486FF17992320C@PRDEXMBX-05.the-lab.llnl.gov> References: <8635C0D1735D2C4BA6E571FD97486FF172EC8E71@PRDEXMBX-05.the-lab.llnl.gov> <53765072.9070906@anl.gov> <8635C0D1735D2C4BA6E571FD97486FF17992320C@PRDEXMBX-05.the-lab.llnl.gov> Message-ID: <8635C0D1735D2C4BA6E571FD97486FF179923310@PRDEXMBX-05.the-lab.llnl.gov> I've played around with it some more and put together the following solution, which appears to work. The main issues with this variant are: - It turns each 1-line procedure call into 4 lines that test if the output already exists. - I have to explicitly keep track of my working directory to make it possible to test for the existence of files there. - I need to explicitly copy the files that already exist from my working directory to temporary directory used by Swift to enable Swift to recognize these files' existence. Then Swift copies them back to their original locations, which is wasteful but at least correct. If anybody can suggest alternatives that overcome the above issues, I'd be grateful. I expect that mine is a common use-case because without this every small change to a script causes all of its intermediate results to be recomputed, even if the script is large and takes days to compute. Greg Bronevetsky Lawrence Livermore National Lab (925) 424-5756 bronevetsky at llnl.gov http://greg.bronevetsky.com type file; string WORK_PATH="/g/g15/bronevet/apps/swift-0.94.1/examples/test/work"; app (file out) fileExistsApp(string d, string f) { fileExists "-fName" @strcat(d, "/", f) stdout=@filename(out); } (boolean exists) fileExists(string d, string f) { file tmp ; tracef("tmp=%M, f=%s\n", tmp, @strcat(d, "/", f)); (tmp) = fileExistsApp(d, f); (exists) = readData(tmp); tracef("exists=%s\n", @toString(exists)); } app (file outF) noop(string d, string f) { cp @strcat(d, "/", f) @filename(outF); } app (file outF) writeFile(string message) { echo message stdout=@filename(outF); } app (file outF) copyFile(file inF) { cp @filename(inF) @filename(outF); } file data ; if(!fileExists(WORK_PATH, "dataF")) { (data) = writeData("hello"); } else { (data) = noop(WORK_PATH, "dataF"); } file copy ; if(!fileExists(WORK_PATH, "copyF")) { (copy) = copyFile(data); } else { (copy) = noop(WORK_PATH, "copyF"); } From: swift-user-bounces at ci.uchicago.edu [mailto:swift-user-bounces at ci.uchicago.edu] On Behalf Of Bronevetsky, Greg Sent: Monday, May 19, 2014 10:39 AM To: Michael Wilde; swift-user at ci.uchicago.edu Subject: Re: [Swift-user] Resuming jobs when script has changed Based on your suggestion, I came up with the following code. However, I'm having an issue with it. If I use an external mapper I can choose whether to return a file or not based on whether it already exists. However, once I made a decision not to return a file from the external mapper because it has already been generated, I don't see how to enable subsequent workflow steps to take it as input. In the code below, if dataFile exists but copyFile does not, I get the following error: Execution failed: org.griphyn.vdl.mapping.InvalidPathException: Array index '0' not found for f1 of size 0 copyFile, testResume.swift, line 20 copyFile, testResume.swift, line 20 Greg Bronevetsky Lawrence Livermore National Lab (925) 424-5756 bronevetsky at llnl.gov http://greg.bronevetsky.com Swift: type file; string ROOT_PATH="/g/g15/bronevet/apps/swift-0.94.1/examples/test"; app (file outF) writeFile(string message) { echo message stdout=@filename(outF); } app (file outF) copyFile(file inF) { cp @filename(inF) @filename(outF); } file f1[] ; if(@length(f1) == 1) { f1[0] = writeFile("hello"); } file f2[] ; if(@length(f2) == 1) { f2[0] = copyFile(f1[0]); } External mapper in Python: #!/usr/apps/python2.7.3/bin/python import argparse import sys import os def main(argv): parser = argparse.ArgumentParser(description='Merge stats and distances files from experiments.') parser.add_argument('-fName', dest='fName', action='store', nargs="+", help='List of files to check for existence') args = parser.parse_args() for i in range(0, len(args.fName)): if(not (os.path.exists(args.fName[i]))): print "["+str(i)+"] "+args.fName[i] if __name__ == "__main__": main(sys.argv[1:]) From: swift-user-bounces at ci.uchicago.edu [mailto:swift-user-bounces at ci.uchicago.edu] On Behalf Of Michael Wilde Sent: Friday, May 16, 2014 10:53 AM To: swift-user at ci.uchicago.edu Subject: Re: [Swift-user] Resuming jobs when script has changed Hi Greg, The Swift resume mechanism can't do this, as its driven by internal variables name within a Swift run, not by external file existence. The best way to do what you describe below is to write an external input file mapper (e.g. a simple shell or py script) that returns only the files that still need to be produced or processed. (This is the "ext" mapper: http://swift-lang.org/guides/release-0.94/userguide/userguide.html#_external_mapper ) You can also call a local app to determine what files need to be produced, then use readData() to read those file names into an array, and array_mapper to map an array of remaining work to do. Would one of these approaches meet your needs? - Mike On 5/16/14, 12:35 PM, Bronevetsky, Greg wrote: I have Swift scripts that scan some portion of a design space and after I have scanned a sub-space of all the possibilities I modify the script to target a different, overlapping portion of the design space. However, it seems that when I do this Swift ignores the fact that I've already computed many of the tasks in the new run when I performed the prior run, and re-executes them redundantly. Is there a way for me to avoid such redundant executions? Can the -resume flag be used in this case? Greg Bronevetsky Lawrence Livermore National Lab (925) 424-5756 bronevetsky at llnl.gov http://greg.bronevetsky.com _______________________________________________ Swift-user mailing list Swift-user at ci.uchicago.edu https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user -- Michael Wilde Mathematics and Computer Science Computation Institute Argonne National Laboratory The University of Chicago -------------- next part -------------- An HTML attachment was scrubbed... URL: From hategan at mcs.anl.gov Mon May 19 16:31:10 2014 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 19 May 2014 14:31:10 -0700 Subject: [Swift-user] Reduction trees In-Reply-To: <8635C0D1735D2C4BA6E571FD97486FF179922FA6@PRDEXMBX-05.the-lab.llnl.gov> References: <8635C0D1735D2C4BA6E571FD97486FF179921DBD@PRDEXMBX-05.the-lab.llnl.gov> <1400457898.10734.1.camel@echo> <8635C0D1735D2C4BA6E571FD97486FF179922FA6@PRDEXMBX-05.the-lab.llnl.gov> Message-ID: <1400535070.19707.7.camel@echo> You are correct. There does not seem to be an easy way to slice a sparse array (or an array with non-int keys). I've filed an enhancement report to add the relevant feature (https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=1275) In the mean time, I'm trying to see if there might be a hack that can allow you to do what you need. Can you post the code that generates the sparse arrays you mention? I'm trying to picture a solution, but it's hard without a concrete example. Mihael On Mon, 2014-05-19 at 14:09 +0000, Bronevetsky, Greg wrote: > What I mean is that I have an array of files computed by my individual > tasks. To do a reduction tree on them I need to create a sub-array of > the first 100, the second 100, etc. so that each sub-array can be > merged independently. In my code example below I skip this step and > simply generate a region of numbers (minR, maxR) for each node in my > reduction tree and then do stuff with the numbers. Now I need for > these numbers to correspond to indexes in an array of files. > > Greg Bronevetsky > Lawrence Livermore National Lab > (925) 424-5756 > bronevetsky at llnl.gov > http://greg.bronevetsky.com > > -----Original Message----- > From: Mihael Hategan [mailto:hategan at mcs.anl.gov] > Sent: Sunday, May 18, 2014 5:05 PM > To: Bronevetsky, Greg > Cc: swift-user at ci.uchicago.edu > Subject: Re: [Swift-user] Reduction trees > > Hi, > > Can you be more specific about what you mean by "subsets of keys" below? > Specifically, how are these sub sets defined? > > Mihael > > On Sun, 2014-05-18 at 22:16 +0000, Bronevetsky, Greg wrote: > > I need to implement a reduction three in Swift to aggregate the > > results of many individual runs. I've written the simple algorithm > > below, which works when all my runs correspond to numbers in a fixed > > range. However, in reality I have a regular or associative array of > > files produced by individual runs. How do I adapt the code below to > > this scenario? I don't see any way to iterate over subsets of array > > indexes/keys. Thanks! > > > > Greg Bronevetsky > > Lawrence Livermore National Lab > > (925) 424-5756 > > bronevetsky at llnl.gov > > http://greg.bronevetsky.com > > > > > > type file; > > > > app (file outFile) gen(string arg) > > { > > echo arg stdout=@filename(outFile); > > } > > > > app (file summaryFile) catBase(file inFiles[]) { > > cat @filenames(inFiles) stdout=@filename(summaryFile); } > > > > // Concatenate the text of numbers in range [minR - maxR) (includes > > minR, not maxR) (file summaryFile) catTree(int minR, int maxR, int radix, int level) { > > file subFiles[]; > > if((maxR-minR) <= radix) { > > //tracef("catTree: leaf: minR=%i, minR=%i\n", minR, maxR); > > foreach b in [0: (maxR-minR)-1] { > > //tracef("catTree: leaf: b=%i\n", b); > > file curLeafFile ; > > (curLeafFile) = gen(@toString(minR+b)); > > subFiles[b] = curLeafFile; > > } > > } else { > > int size=maxR-minR; > > //tracef("catTree: node: minR=%i, minR=%i\n", minR, maxR); > > foreach b in [0: radix-1] { > > file curNodeFile ; > > int start = minR + (size*b)%/radix; > > int end = minR + (size*(b+1))%/radix; > > //tracef("catTree: node: b=%i [%i - %i]\n", b, start, end); > > (curNodeFile) = catTree(start, end, radix, level+1); > > subFiles[b] = curNodeFile; > > } > > } > > //tracef("catBase: inFiles=%q\n", @filenames(subFiles)); > > (summaryFile) = catBase(subFiles); > > } > > > > > > file allFile ; > > (allFile) = catTree(0, 31, 2, 0); > > > > _______________________________________________ > > Swift-user mailing list > > Swift-user at ci.uchicago.edu > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > > From hategan at mcs.anl.gov Mon May 19 16:40:11 2014 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 19 May 2014 14:40:11 -0700 Subject: [Swift-user] Resuming jobs when script has changed In-Reply-To: <8635C0D1735D2C4BA6E571FD97486FF17992320C@PRDEXMBX-05.the-lab.llnl.gov> References: <8635C0D1735D2C4BA6E571FD97486FF172EC8E71@PRDEXMBX-05.the-lab.llnl.gov> <53765072.9070906@anl.gov> <8635C0D1735D2C4BA6E571FD97486FF17992320C@PRDEXMBX-05.the-lab.llnl.gov> Message-ID: <1400535611.19707.13.camel@echo> I think the problem is that you are assuming (when you say f1[0]) that the mapper returns at least one element for the array. I also believe that Mike's suggestion went along the following lines: file[] list ; etc. Also, recent versions of Swift (after 0.94) have an exists() function. E.g. if (!exists(filename(outputFile)) { outputFile = doStuff(...); } I have used this latter bit to do something similar to what you are trying to do. Mihael On Mon, 2014-05-19 at 17:39 +0000, Bronevetsky, Greg wrote: > Based on your suggestion, I came up with the following code. However, I'm having an issue with it. If I use an external mapper I can choose whether to return a file or not based on whether it already exists. However, once I made a decision not to return a file from the external mapper because it has already been generated, I don't see how to enable subsequent workflow steps to take it as input. In the code below, if dataFile exists but copyFile does not, I get the following error: > Execution failed: > org.griphyn.vdl.mapping.InvalidPathException: Array index '0' not found for f1 of size 0 > copyFile, testResume.swift, line 20 > copyFile, testResume.swift, line 20 > > Greg Bronevetsky > Lawrence Livermore National Lab > (925) 424-5756 > bronevetsky at llnl.gov > http://greg.bronevetsky.com > > Swift: > type file; > > string ROOT_PATH="/g/g15/bronevet/apps/swift-0.94.1/examples/test"; > > app (file outF) writeFile(string message) { > echo message stdout=@filename(outF); > } > > app (file outF) copyFile(file inF) { > cp @filename(inF) @filename(outF); > } > > file f1[] ; > if(@length(f1) == 1) { > f1[0] = writeFile("hello"); > } > > file f2[] ; > if(@length(f2) == 1) { > f2[0] = copyFile(f1[0]); > } > External mapper in Python: > #!/usr/apps/python2.7.3/bin/python > > import argparse > import sys > import os > > def main(argv): > parser = argparse.ArgumentParser(description='Merge stats and distances files from experiments.') > parser.add_argument('-fName', dest='fName', action='store', nargs="+", help='List of files to check for existence') > args = parser.parse_args() > > for i in range(0, len(args.fName)): > if(not (os.path.exists(args.fName[i]))): > print "["+str(i)+"] "+args.fName[i] > > if __name__ == "__main__": > main(sys.argv[1:]) > > From: swift-user-bounces at ci.uchicago.edu [mailto:swift-user-bounces at ci.uchicago.edu] On Behalf Of Michael Wilde > Sent: Friday, May 16, 2014 10:53 AM > To: swift-user at ci.uchicago.edu > Subject: Re: [Swift-user] Resuming jobs when script has changed > > Hi Greg, > > The Swift resume mechanism can't do this, as its driven by internal variables name within a Swift run, not by external file existence. > > The best way to do what you describe below is to write an external input file mapper (e.g. a simple shell or py script) that returns only the files that still need to be produced or processed. (This is the "ext" mapper: http://swift-lang.org/guides/release-0.94/userguide/userguide.html#_external_mapper ) > > You can also call a local app to determine what files need to be produced, then use readData() to read those file names into an array, and array_mapper to map an array of remaining work to do. > > Would one of these approaches meet your needs? > > - Mike > > On 5/16/14, 12:35 PM, Bronevetsky, Greg wrote: > > I have Swift scripts that scan some portion of a design space and after I have scanned a sub-space of all the possibilities I modify the script to target a different, overlapping portion of the design space. However, it seems that when I do this Swift ignores the fact that I've already computed many of the tasks in the new run when I performed the prior run, and re-executes them redundantly. Is there a way for me to avoid such redundant executions? Can the -resume flag be used in this case? > > > > Greg Bronevetsky > > Lawrence Livermore National Lab > > (925) 424-5756 > > bronevetsky at llnl.gov > > http://greg.bronevetsky.com > > > > > > > > > _______________________________________________ > > Swift-user mailing list > > Swift-user at ci.uchicago.edu > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > > > > -- > > Michael Wilde > > Mathematics and Computer Science Computation Institute > > Argonne National Laboratory The University of Chicago > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user From bronevetsky1 at llnl.gov Mon May 19 16:40:42 2014 From: bronevetsky1 at llnl.gov (Bronevetsky, Greg) Date: Mon, 19 May 2014 21:40:42 +0000 Subject: [Swift-user] Reduction trees In-Reply-To: <1400535070.19707.7.camel@echo> References: <8635C0D1735D2C4BA6E571FD97486FF179921DBD@PRDEXMBX-05.the-lab.llnl.gov> <1400457898.10734.1.camel@echo> <8635C0D1735D2C4BA6E571FD97486FF179922FA6@PRDEXMBX-05.the-lab.llnl.gov> <1400535070.19707.7.camel@echo> Message-ID: <8635C0D1735D2C4BA6E571FD97486FF179923482@PRDEXMBX-05.the-lab.llnl.gov> I'm actually not sure how to write such code in Swift so let me describe the task I am interested in. I have an array of files that contain the results of individual computation tasks. I wish to aggregate these files into one summary file. Swift makes it easy to pass the entire array to a single task to aggregate. However, with many files I reach the limit of the command line length constraints and further, the aggregator task takes too long. As such, I need to create a reduction tree, where the first n files are aggregated by one task, the next n by another and so on. Then the results of these aggregation tasks are themselves aggregated and so on until I have a single file that contains the aggregation of all the tasks. The code example below does this when the input data to my tasks is just a number range and the code recursively chops the range into sub-segments, performing an aggregation on each one. However, if the input data is more complex, I don't see a way in Swift to do the same thing. I can't use regular looping constructs to create an array of computation task inputs or array of computation task output files since I don't know how to operate on sub-arrays. I might be able to pull some trick where I take a complex loop nest that generates the computation tasks and push it deep inside the recursion below, forcing the loop nest within each leaf of the recursion to perform just the portion of the iteration space that corresponds to the range bounds of that leaf. However, even if this were expressible in Swift, it would be hard to read. That's all I can think of right now but any other suggestions would be welcome. Greg Bronevetsky Lawrence Livermore National Lab (925) 424-5756 bronevetsky at llnl.gov http://greg.bronevetsky.com -----Original Message----- From: Mihael Hategan [mailto:hategan at mcs.anl.gov] Sent: Monday, May 19, 2014 2:31 PM To: Bronevetsky, Greg Cc: swift-user at ci.uchicago.edu Subject: Re: [Swift-user] Reduction trees You are correct. There does not seem to be an easy way to slice a sparse array (or an array with non-int keys). I've filed an enhancement report to add the relevant feature (https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=1275) In the mean time, I'm trying to see if there might be a hack that can allow you to do what you need. Can you post the code that generates the sparse arrays you mention? I'm trying to picture a solution, but it's hard without a concrete example. Mihael On Mon, 2014-05-19 at 14:09 +0000, Bronevetsky, Greg wrote: > What I mean is that I have an array of files computed by my individual > tasks. To do a reduction tree on them I need to create a sub-array of > the first 100, the second 100, etc. so that each sub-array can be > merged independently. In my code example below I skip this step and > simply generate a region of numbers (minR, maxR) for each node in my > reduction tree and then do stuff with the numbers. Now I need for > these numbers to correspond to indexes in an array of files. > > Greg Bronevetsky > Lawrence Livermore National Lab > (925) 424-5756 > bronevetsky at llnl.gov > http://greg.bronevetsky.com > > -----Original Message----- > From: Mihael Hategan [mailto:hategan at mcs.anl.gov] > Sent: Sunday, May 18, 2014 5:05 PM > To: Bronevetsky, Greg > Cc: swift-user at ci.uchicago.edu > Subject: Re: [Swift-user] Reduction trees > > Hi, > > Can you be more specific about what you mean by "subsets of keys" below? > Specifically, how are these sub sets defined? > > Mihael > > On Sun, 2014-05-18 at 22:16 +0000, Bronevetsky, Greg wrote: > > I need to implement a reduction three in Swift to aggregate the > > results of many individual runs. I've written the simple algorithm > > below, which works when all my runs correspond to numbers in a fixed > > range. However, in reality I have a regular or associative array of > > files produced by individual runs. How do I adapt the code below to > > this scenario? I don't see any way to iterate over subsets of array > > indexes/keys. Thanks! > > > > Greg Bronevetsky > > Lawrence Livermore National Lab > > (925) 424-5756 > > bronevetsky at llnl.gov > > http://greg.bronevetsky.com > > > > > > type file; > > > > app (file outFile) gen(string arg) > > { > > echo arg stdout=@filename(outFile); } > > > > app (file summaryFile) catBase(file inFiles[]) { > > cat @filenames(inFiles) stdout=@filename(summaryFile); } > > > > // Concatenate the text of numbers in range [minR - maxR) (includes > > minR, not maxR) (file summaryFile) catTree(int minR, int maxR, int radix, int level) { > > file subFiles[]; > > if((maxR-minR) <= radix) { > > //tracef("catTree: leaf: minR=%i, minR=%i\n", minR, maxR); > > foreach b in [0: (maxR-minR)-1] { > > //tracef("catTree: leaf: b=%i\n", b); > > file curLeafFile ; > > (curLeafFile) = gen(@toString(minR+b)); > > subFiles[b] = curLeafFile; > > } > > } else { > > int size=maxR-minR; > > //tracef("catTree: node: minR=%i, minR=%i\n", minR, maxR); > > foreach b in [0: radix-1] { > > file curNodeFile ; > > int start = minR + (size*b)%/radix; > > int end = minR + (size*(b+1))%/radix; > > //tracef("catTree: node: b=%i [%i - %i]\n", b, start, end); > > (curNodeFile) = catTree(start, end, radix, level+1); > > subFiles[b] = curNodeFile; > > } > > } > > //tracef("catBase: inFiles=%q\n", @filenames(subFiles)); > > (summaryFile) = catBase(subFiles); } > > > > > > file allFile ; > > (allFile) = catTree(0, 31, 2, 0); > > > > _______________________________________________ > > Swift-user mailing list > > Swift-user at ci.uchicago.edu > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > > From hategan at mcs.anl.gov Mon May 19 16:43:39 2014 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 19 May 2014 14:43:39 -0700 Subject: [Swift-user] Resuming jobs when script has changed In-Reply-To: <8635C0D1735D2C4BA6E571FD97486FF179923310@PRDEXMBX-05.the-lab.llnl.gov> References: <8635C0D1735D2C4BA6E571FD97486FF172EC8E71@PRDEXMBX-05.the-lab.llnl.gov> <53765072.9070906@anl.gov> <8635C0D1735D2C4BA6E571FD97486FF17992320C@PRDEXMBX-05.the-lab.llnl.gov> <8635C0D1735D2C4BA6E571FD97486FF179923310@PRDEXMBX-05.the-lab.llnl.gov> Message-ID: <1400535819.19707.16.camel@echo> On Mon, 2014-05-19 at 19:06 +0000, Bronevetsky, Greg wrote: > - I need to explicitly copy the files that already exist from my working directory to temporary directory used by Swift to enable Swift to recognize these files' existence. Then Swift copies them back to their original locations, which is wasteful but at least correct. > If anybody can suggest alternatives that overcome the above issues, I'd be grateful. I expect that mine is a common use-case because without this every small change to a script causes all of its intermediate results to be recomputed, even if the script is large and takes days to compute. If you are using swift > 0.94, you could use the built-in exists() function I mentioned in my previous email. If not, you can make sure that the fileExists app only runs locally (by only having it for localhost in the sites file). Then, you can pass the working directory as a string to it and modify it to use that directory and the file whose existence you are trying to determine to build an absolute path. That would not require any copying. Mihael From bronevetsky1 at llnl.gov Mon May 19 16:47:29 2014 From: bronevetsky1 at llnl.gov (Bronevetsky, Greg) Date: Mon, 19 May 2014 21:47:29 +0000 Subject: [Swift-user] Resuming jobs when script has changed In-Reply-To: <1400535819.19707.16.camel@echo> References: <8635C0D1735D2C4BA6E571FD97486FF172EC8E71@PRDEXMBX-05.the-lab.llnl.gov> <53765072.9070906@anl.gov> <8635C0D1735D2C4BA6E571FD97486FF17992320C@PRDEXMBX-05.the-lab.llnl.gov> <8635C0D1735D2C4BA6E571FD97486FF179923310@PRDEXMBX-05.the-lab.llnl.gov> <1400535819.19707.16.camel@echo> Message-ID: <8635C0D1735D2C4BA6E571FD97486FF1799234A8@PRDEXMBX-05.the-lab.llnl.gov> I'm using Swift 0.94 so I'll be happy to switch to the built-in exists function. However, I don't think resolves the extra copying issue. The reason why I had to go the way I did is that I had to make sure that regardless of whether a file exists or not, the Swift variable that represents it needs to be the output of some task, otherwise Swift can't track dependencies on it. The only way I could think of to achieve this while, making sure that the file associated with the Swift-visible variable had the right contents, was to copy the data from the original to the Swift-visible file. However Swift then copies the file back to the original since I gave them the same names. Greg Bronevetsky Lawrence Livermore National Lab (925) 424-5756 bronevetsky at llnl.gov http://greg.bronevetsky.com -----Original Message----- From: Mihael Hategan [mailto:hategan at mcs.anl.gov] Sent: Monday, May 19, 2014 2:44 PM To: Bronevetsky, Greg Cc: swift-user at ci.uchicago.edu Subject: Re: [Swift-user] Resuming jobs when script has changed On Mon, 2014-05-19 at 19:06 +0000, Bronevetsky, Greg wrote: > - I need to explicitly copy the files that already exist from my working directory to temporary directory used by Swift to enable Swift to recognize these files' existence. Then Swift copies them back to their original locations, which is wasteful but at least correct. > If anybody can suggest alternatives that overcome the above issues, I'd be grateful. I expect that mine is a common use-case because without this every small change to a script causes all of its intermediate results to be recomputed, even if the script is large and takes days to compute. If you are using swift > 0.94, you could use the built-in exists() function I mentioned in my previous email. If not, you can make sure that the fileExists app only runs locally (by only having it for localhost in the sites file). Then, you can pass the working directory as a string to it and modify it to use that directory and the file whose existence you are trying to determine to build an absolute path. That would not require any copying. Mihael From hategan at mcs.anl.gov Mon May 19 16:54:27 2014 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 19 May 2014 14:54:27 -0700 Subject: [Swift-user] Reduction trees In-Reply-To: <8635C0D1735D2C4BA6E571FD97486FF179923482@PRDEXMBX-05.the-lab.llnl.gov> References: <8635C0D1735D2C4BA6E571FD97486FF179921DBD@PRDEXMBX-05.the-lab.llnl.gov> <1400457898.10734.1.camel@echo> <8635C0D1735D2C4BA6E571FD97486FF179922FA6@PRDEXMBX-05.the-lab.llnl.gov> <1400535070.19707.7.camel@echo> <8635C0D1735D2C4BA6E571FD97486FF179923482@PRDEXMBX-05.the-lab.llnl.gov> Message-ID: <1400536467.19707.24.camel@echo> On Mon, 2014-05-19 at 21:40 +0000, Bronevetsky, Greg wrote: > I'm actually not sure how to write such code in Swift so let me > describe the task I am interested in. I have an array of files that > contain the results of individual computation tasks. I wish to > aggregate these files into one summary file. Swift makes it easy to > pass the entire array to a single task to aggregate. However, with > many files I reach the limit of the command line length constraints > and further, the aggregator task takes too long. As such, I need to > create a reduction tree, where the first n files are aggregated by one > task, the next n by another and so on. Then the results of these > aggregation tasks are themselves aggregated and so on until I have a > single file that contains the aggregation of all the tasks. We have been talking about this issue since the early days of swift. One idea suggested was to be able to declare certain apps as associative reduction steps and do this splitting automatically, but it was speculative and didn't quite materialize. The command line arguments being too long, that can be addressed by creating a list of files that you can pass to the reduction app instead of doing it on the command line. I.e.: string[] fnames; foreach v, k in files { fnames[k] = filename(v); } file fnamesFile = writeData(fnames); app (file result) reduce(file[] files, file fnamesFile) { reduce filename(fnamesFile) ...; } > > The code example below does this when the input data to my tasks is > just a number range and the code recursively chops the range into > sub-segments, performing an aggregation on each one. However, if the > input data is more complex, I don't see a way in Swift to do the same > thing. I can't use regular looping constructs to create an array of > computation task inputs or array of computation task output files > since I don't know how to operate on sub-arrays. I might be able to > pull some trick where I take a complex loop nest that generates the > computation tasks and push it deep inside the recursion below, forcing > the loop nest within each leaf of the recursion to perform just the > portion of the iteration space that corresponds to the range bounds of > that leaf. However, even if this were expressible in Swift, it would > be hard to read. That's all I can think of right now but any other > suggestions would be welcome. Well, ... I can probably write a hackish split function that you can use with 0.94. Give me a few hours. Mihael From bronevetsky1 at llnl.gov Mon May 19 16:57:02 2014 From: bronevetsky1 at llnl.gov (Bronevetsky, Greg) Date: Mon, 19 May 2014 21:57:02 +0000 Subject: [Swift-user] Reduction trees In-Reply-To: <1400536467.19707.24.camel@echo> References: <8635C0D1735D2C4BA6E571FD97486FF179921DBD@PRDEXMBX-05.the-lab.llnl.gov> <1400457898.10734.1.camel@echo> <8635C0D1735D2C4BA6E571FD97486FF179922FA6@PRDEXMBX-05.the-lab.llnl.gov> <1400535070.19707.7.camel@echo> <8635C0D1735D2C4BA6E571FD97486FF179923482@PRDEXMBX-05.the-lab.llnl.gov> <1400536467.19707.24.camel@echo> Message-ID: <8635C0D1735D2C4BA6E571FD97486FF1799234CD@PRDEXMBX-05.the-lab.llnl.gov> Thanks! It will be great to see how this can be accomplished in Swift. Greg Bronevetsky Lawrence Livermore National Lab (925) 424-5756 bronevetsky at llnl.gov http://greg.bronevetsky.com -----Original Message----- From: Mihael Hategan [mailto:hategan at mcs.anl.gov] Sent: Monday, May 19, 2014 2:54 PM To: Bronevetsky, Greg Cc: swift-user at ci.uchicago.edu Subject: Re: [Swift-user] Reduction trees On Mon, 2014-05-19 at 21:40 +0000, Bronevetsky, Greg wrote: > I'm actually not sure how to write such code in Swift so let me > describe the task I am interested in. I have an array of files that > contain the results of individual computation tasks. I wish to > aggregate these files into one summary file. Swift makes it easy to > pass the entire array to a single task to aggregate. However, with > many files I reach the limit of the command line length constraints > and further, the aggregator task takes too long. As such, I need to > create a reduction tree, where the first n files are aggregated by one > task, the next n by another and so on. Then the results of these > aggregation tasks are themselves aggregated and so on until I have a > single file that contains the aggregation of all the tasks. We have been talking about this issue since the early days of swift. One idea suggested was to be able to declare certain apps as associative reduction steps and do this splitting automatically, but it was speculative and didn't quite materialize. The command line arguments being too long, that can be addressed by creating a list of files that you can pass to the reduction app instead of doing it on the command line. I.e.: string[] fnames; foreach v, k in files { fnames[k] = filename(v); } file fnamesFile = writeData(fnames); app (file result) reduce(file[] files, file fnamesFile) { reduce filename(fnamesFile) ...; } > > The code example below does this when the input data to my tasks is > just a number range and the code recursively chops the range into > sub-segments, performing an aggregation on each one. However, if the > input data is more complex, I don't see a way in Swift to do the same > thing. I can't use regular looping constructs to create an array of > computation task inputs or array of computation task output files > since I don't know how to operate on sub-arrays. I might be able to > pull some trick where I take a complex loop nest that generates the > computation tasks and push it deep inside the recursion below, forcing > the loop nest within each leaf of the recursion to perform just the > portion of the iteration space that corresponds to the range bounds of > that leaf. However, even if this were expressible in Swift, it would > be hard to read. That's all I can think of right now but any other > suggestions would be welcome. Well, ... I can probably write a hackish split function that you can use with 0.94. Give me a few hours. Mihael From hategan at mcs.anl.gov Mon May 19 17:05:44 2014 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 19 May 2014 15:05:44 -0700 Subject: [Swift-user] Resuming jobs when script has changed In-Reply-To: <8635C0D1735D2C4BA6E571FD97486FF1799234A8@PRDEXMBX-05.the-lab.llnl.gov> References: <8635C0D1735D2C4BA6E571FD97486FF172EC8E71@PRDEXMBX-05.the-lab.llnl.gov> <53765072.9070906@anl.gov> <8635C0D1735D2C4BA6E571FD97486FF17992320C@PRDEXMBX-05.the-lab.llnl.gov> <8635C0D1735D2C4BA6E571FD97486FF179923310@PRDEXMBX-05.the-lab.llnl.gov> <1400535819.19707.16.camel@echo> <8635C0D1735D2C4BA6E571FD97486FF1799234A8@PRDEXMBX-05.the-lab.llnl.gov> Message-ID: <1400537144.19707.33.camel@echo> On Mon, 2014-05-19 at 21:47 +0000, Bronevetsky, Greg wrote: > I'm using Swift 0.94 so I'll be happy to switch to the built-in exists > function. However, I don't think resolves the extra copying issue. The > reason why I had to go the way I did is that I had to make sure that > regardless of whether a file exists or not, the Swift variable that > represents it needs to be the output of some task, otherwise Swift > can't track dependencies on it. The only way I could think of to > achieve this while, making sure that the file associated with the > Swift-visible variable had the right contents, was to copy the data > from the original to the Swift-visible file. However Swift then copies > the file back to the original since I gave them the same names. Ok, so you still need the ability to derive things from the existing files that might be different from what was derived before. I suggest the following: file newOrOld; // without a mapper file old <"/old/work/dir/xyz">; if (exists(old) { newOrOld = old; // will not do a copy, but alias newOrOld to old } else { file new <"work/dir/xyz">; new = makeNew(...); newOrOld = new; // will also not do a copy, but the result will be in work/dir/xzy } // now newOrOld is either /old/work/dir/xyz or work/dir/xyz Mihael From bronevetsky1 at llnl.gov Mon May 19 17:21:22 2014 From: bronevetsky1 at llnl.gov (Bronevetsky, Greg) Date: Mon, 19 May 2014 22:21:22 +0000 Subject: [Swift-user] Resuming jobs when script has changed In-Reply-To: <1400537144.19707.33.camel@echo> References: <8635C0D1735D2C4BA6E571FD97486FF172EC8E71@PRDEXMBX-05.the-lab.llnl.gov> <53765072.9070906@anl.gov> <8635C0D1735D2C4BA6E571FD97486FF17992320C@PRDEXMBX-05.the-lab.llnl.gov> <8635C0D1735D2C4BA6E571FD97486FF179923310@PRDEXMBX-05.the-lab.llnl.gov> <1400535819.19707.16.camel@echo> <8635C0D1735D2C4BA6E571FD97486FF1799234A8@PRDEXMBX-05.the-lab.llnl.gov> <1400537144.19707.33.camel@echo> Message-ID: <8635C0D1735D2C4BA6E571FD97486FF17992352E@PRDEXMBX-05.the-lab.llnl.gov> Thanks, Mihael! I've modified my original code based on your suggestions (new code below) but I'm getting the following error: Could not start execution Procedure exists is not defined. I'm using Swift 0.94.1, which I believe should have the exists function. Greg Bronevetsky Lawrence Livermore National Lab (925) 424-5756 bronevetsky at llnl.gov http://greg.bronevetsky.com type file; app (file outF) writeFile(string message) { echo message stdout=@filename(outF); } app (file outF) copyFile(file inF) { cp @filename(inF) @filename(outF); } file data; file dataOld ; if(!exists(dataOld)) { (data) = writeData("hello"); } else { data = dataOld; } file copy; file copyOld ; if(!exists(copyOld)) { (copy) = copyFile(data); } else { copy = copyOld; } -----Original Message----- From: Mihael Hategan [mailto:hategan at mcs.anl.gov] Sent: Monday, May 19, 2014 3:06 PM To: Bronevetsky, Greg Cc: swift-user at ci.uchicago.edu Subject: Re: [Swift-user] Resuming jobs when script has changed On Mon, 2014-05-19 at 21:47 +0000, Bronevetsky, Greg wrote: > I'm using Swift 0.94 so I'll be happy to switch to the built-in exists > function. However, I don't think resolves the extra copying issue. The > reason why I had to go the way I did is that I had to make sure that > regardless of whether a file exists or not, the Swift variable that > represents it needs to be the output of some task, otherwise Swift > can't track dependencies on it. The only way I could think of to > achieve this while, making sure that the file associated with the > Swift-visible variable had the right contents, was to copy the data > from the original to the Swift-visible file. However Swift then copies > the file back to the original since I gave them the same names. Ok, so you still need the ability to derive things from the existing files that might be different from what was derived before. I suggest the following: file newOrOld; // without a mapper file old <"/old/work/dir/xyz">; if (exists(old) { newOrOld = old; // will not do a copy, but alias newOrOld to old } else { file new <"work/dir/xyz">; new = makeNew(...); newOrOld = new; // will also not do a copy, but the result will be in work/dir/xzy } // now newOrOld is either /old/work/dir/xyz or work/dir/xyz Mihael From hategan at mcs.anl.gov Mon May 19 17:28:24 2014 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 19 May 2014 15:28:24 -0700 Subject: [Swift-user] Resuming jobs when script has changed In-Reply-To: <8635C0D1735D2C4BA6E571FD97486FF17992352E@PRDEXMBX-05.the-lab.llnl.gov> References: <8635C0D1735D2C4BA6E571FD97486FF172EC8E71@PRDEXMBX-05.the-lab.llnl.gov> <53765072.9070906@anl.gov> <8635C0D1735D2C4BA6E571FD97486FF17992320C@PRDEXMBX-05.the-lab.llnl.gov> <8635C0D1735D2C4BA6E571FD97486FF179923310@PRDEXMBX-05.the-lab.llnl.gov> <1400535819.19707.16.camel@echo> <8635C0D1735D2C4BA6E571FD97486FF1799234A8@PRDEXMBX-05.the-lab.llnl.gov> <1400537144.19707.33.camel@echo> <8635C0D1735D2C4BA6E571FD97486FF17992352E@PRDEXMBX-05.the-lab.llnl.gov> Message-ID: <1400538504.19707.38.camel@echo> On Mon, 2014-05-19 at 22:21 +0000, Bronevetsky, Greg wrote: > Thanks, Mihael! I've modified my original code based on your suggestions (new code below) but I'm getting the following error: > Could not start execution > Procedure exists is not defined. Sorry. That's where your implementation of exists for absolute path names goes. I'll send you the relevant code shortly. I'm working on the array splitting issue now. Mihael From hategan at mcs.anl.gov Mon May 19 18:12:23 2014 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 19 May 2014 16:12:23 -0700 Subject: [Swift-user] Reduction trees In-Reply-To: <1400536467.19707.24.camel@echo> References: <8635C0D1735D2C4BA6E571FD97486FF179921DBD@PRDEXMBX-05.the-lab.llnl.gov> <1400457898.10734.1.camel@echo> <8635C0D1735D2C4BA6E571FD97486FF179922FA6@PRDEXMBX-05.the-lab.llnl.gov> <1400535070.19707.7.camel@echo> <8635C0D1735D2C4BA6E571FD97486FF179923482@PRDEXMBX-05.the-lab.llnl.gov> <1400536467.19707.24.camel@echo> Message-ID: <1400541143.19707.43.camel@echo> On Mon, 2014-05-19 at 14:54 -0700, Mihael Hategan wrote: > Well, ... I can probably write a hackish split function that you can use > with 0.94. Give me a few hours. Ok, here's the basic idea: (file[][] result) split(file[] array, int n) { // need to explicitly keep track of keys since // writeData does not write the array keys int[auto] keys; foreach v, k in array { keys << k; } file kf = writeData(keys); file kfs = splitArrayApp(kf, n); // make pointers to original keys int[][] skeys = readStructured(kfs); foreach slice, sliceIndex in skeys { foreach index in slice { result[sliceIndex][index] = array[index]; } } } Attached are full versions of scripts. You will need to add splitArrayApp to tc.data. Please try to run it and let me know if it works. My development 0.94 might be slightly different from what you have. Mihael -------------- next part -------------- type file; // ----- Start of split related stuff app (file outf) splitArrayApp(file inf, int n) { splitArrayApp @filename(inf) n stdout=@filename(outf); } (file[][] result) split(file[] array, int n) { // need to explicitly keep track of keys since // writeData does not write the array keys int[auto] keys; foreach v, k in array { keys << k; } file kf = writeData(keys); file kfs = splitArrayApp(kf, n); // make pointers to original keys int[][] skeys = readStructured(kfs); foreach slice, sliceIndex in skeys { foreach index in slice { result[sliceIndex][index] = array[index]; } } } // ----- End of split related stuff file[] array; app (file outf) gen(int i) { echo i stdout=@filename(outf); } foreach i in [1:13] { file f ; f = gen(i * 3); array[i * 2] = f; } file[][] split = split(array, 5); foreach v1, k1 in split { foreach v2, k2 in v1 { tracef("array[%i][%i] -> %s\n", k1, k2, @filename(v2)); } } -------------- next part -------------- A non-text attachment was scrubbed... Name: splitArray Type: text/x-python Size: 297 bytes Desc: not available URL: From bronevetsky1 at llnl.gov Mon May 19 18:22:53 2014 From: bronevetsky1 at llnl.gov (Bronevetsky, Greg) Date: Mon, 19 May 2014 23:22:53 +0000 Subject: [Swift-user] Treatment of booleans in Swift Message-ID: <8635C0D1735D2C4BA6E571FD97486FF1799235D7@PRDEXMBX-05.the-lab.llnl.gov> I've been working with Booleans recently and I've run into a couple of issues with them that might be a problem for others: - tracef doesn't have a format option for Booleans, forcing me to convert them to strings. Running @toInt on a Boolean produces: "java.lang.NumberFormatException: For input string: "false"". - It is not clear how to format files to be read by readData() if the data in question is a Boolean. Based on experimentation I discovered that the correct strings are "True"/"False" rather than "true"/"false" or "0"/"1". It would be useful to document this in the description of readData(). Greg Bronevetsky Lawrence Livermore National Lab (925) 424-5756 bronevetsky at llnl.gov http://greg.bronevetsky.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From hategan at mcs.anl.gov Mon May 19 18:24:47 2014 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 19 May 2014 16:24:47 -0700 Subject: [Swift-user] Resuming jobs when script has changed In-Reply-To: <1400538504.19707.38.camel@echo> References: <8635C0D1735D2C4BA6E571FD97486FF172EC8E71@PRDEXMBX-05.the-lab.llnl.gov> <53765072.9070906@anl.gov> <8635C0D1735D2C4BA6E571FD97486FF17992320C@PRDEXMBX-05.the-lab.llnl.gov> <8635C0D1735D2C4BA6E571FD97486FF179923310@PRDEXMBX-05.the-lab.llnl.gov> <1400535819.19707.16.camel@echo> <8635C0D1735D2C4BA6E571FD97486FF1799234A8@PRDEXMBX-05.the-lab.llnl.gov> <1400537144.19707.33.camel@echo> <8635C0D1735D2C4BA6E571FD97486FF17992352E@PRDEXMBX-05.the-lab.llnl.gov> <1400538504.19707.38.camel@echo> Message-ID: <1400541887.19707.45.camel@echo> Here's the swift part. It will only work with absolute paths. It can probably be modified to only work with relative paths. type file; app (file outf) existsApp(string path) { existsApp path stdout=@filename(outf); } (boolean result) exists(string path) { result = readData(existsApp(path)); } file there <"/bin/bash">; file notthere <"/bin/notthere">; tracef("Y %b\n", exists(@filename(there))); tracef("N %b\n", exists(@filename(notthere))); And existsApp: #!/bin/bash # add leading / since @filename removes it # this will not work with swift > 0.94 if [ -f "/$1" ]; then echo "true" else echo "false" fi On Mon, 2014-05-19 at 15:28 -0700, Mihael Hategan wrote: > On Mon, 2014-05-19 at 22:21 +0000, Bronevetsky, Greg wrote: > > Thanks, Mihael! I've modified my original code based on your suggestions (new code below) but I'm getting the following error: > > Could not start execution > > Procedure exists is not defined. > > Sorry. That's where your implementation of exists for absolute path > names goes. I'll send you the relevant code shortly. I'm working on the > array splitting issue now. > > Mihael > > > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user From hategan at mcs.anl.gov Mon May 19 18:29:17 2014 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 19 May 2014 16:29:17 -0700 Subject: [Swift-user] Treatment of booleans in Swift In-Reply-To: <8635C0D1735D2C4BA6E571FD97486FF1799235D7@PRDEXMBX-05.the-lab.llnl.gov> References: <8635C0D1735D2C4BA6E571FD97486FF1799235D7@PRDEXMBX-05.the-lab.llnl.gov> Message-ID: <1400542157.19707.49.camel@echo> On Mon, 2014-05-19 at 23:22 +0000, Bronevetsky, Greg wrote: > I've been working with Booleans recently and I've run into a couple of issues with them that might be a problem for others: > > - tracef doesn't have a format option for Booleans, forcing > me to convert them to strings. Running @toInt on a Boolean produces: > "java.lang.NumberFormatException: For input string: "false"". I think the documentation is lying (by omission) about that. Try "%b". > > - It is not clear how to format files to be read by > readData() if the data in question is a Boolean. Based on > experimentation I discovered that the correct strings are > "True"/"False" rather than "true"/"false" or "0"/"1". It would be > useful to document this in the description of readData(). All of "True" and "true" and "TRUE" and "tRuE" (and all other case combinations) should work. The implementation uses Java's Boolean.valueOf(String), which calls this: private static boolean toBoolean(String name) { return ((name != null) && name.equalsIgnoreCase("true")); } Mihael From bronevetsky1 at llnl.gov Mon May 19 18:31:11 2014 From: bronevetsky1 at llnl.gov (Bronevetsky, Greg) Date: Mon, 19 May 2014 23:31:11 +0000 Subject: [Swift-user] Treatment of booleans in Swift In-Reply-To: <1400542157.19707.49.camel@echo> References: <8635C0D1735D2C4BA6E571FD97486FF1799235D7@PRDEXMBX-05.the-lab.llnl.gov> <1400542157.19707.49.camel@echo> Message-ID: <8635C0D1735D2C4BA6E571FD97486FF179923604@PRDEXMBX-05.the-lab.llnl.gov> > - tracef doesn't have a format option for Booleans, forcing > me to convert them to strings. Running @toInt on a Boolean produces: > "java.lang.NumberFormatException: For input string: "false"". I think the documentation is lying (by omission) about that. Try "%b". Thanks! All of "True" and "true" and "TRUE" and "tRuE" (and all other case combinations) should work. The implementation uses Java's Boolean.valueOf(String), which calls this: private static boolean toBoolean(String name) { return ((name != null) && name.equalsIgnoreCase("true")); } Great! I must have had another bug that I thought was caused by mis-parsing of file contents. Greg Bronevetsky Lawrence Livermore National Lab (925) 424-5756 bronevetsky at llnl.gov http://greg.bronevetsky.com -----Original Message----- From: Mihael Hategan [mailto:hategan at mcs.anl.gov] Sent: Monday, May 19, 2014 4:29 PM To: Bronevetsky, Greg Cc: swift-user at ci.uchicago.edu Subject: Re: [Swift-user] Treatment of booleans in Swift On Mon, 2014-05-19 at 23:22 +0000, Bronevetsky, Greg wrote: > I've been working with Booleans recently and I've run into a couple of issues with them that might be a problem for others: > > - tracef doesn't have a format option for Booleans, forcing > me to convert them to strings. Running @toInt on a Boolean produces: > "java.lang.NumberFormatException: For input string: "false"". I think the documentation is lying (by omission) about that. Try "%b". > > - It is not clear how to format files to be read by > readData() if the data in question is a Boolean. Based on > experimentation I discovered that the correct strings are > "True"/"False" rather than "true"/"false" or "0"/"1". It would be > useful to document this in the description of readData(). All of "True" and "true" and "TRUE" and "tRuE" (and all other case combinations) should work. The implementation uses Java's Boolean.valueOf(String), which calls this: private static boolean toBoolean(String name) { return ((name != null) && name.equalsIgnoreCase("true")); } Mihael -------------- next part -------------- An HTML attachment was scrubbed... URL: From bronevetsky1 at llnl.gov Tue May 20 00:30:28 2014 From: bronevetsky1 at llnl.gov (Bronevetsky, Greg) Date: Tue, 20 May 2014 05:30:28 +0000 Subject: [Swift-user] Resuming jobs when script has changed In-Reply-To: <1400541887.19707.45.camel@echo> References: <8635C0D1735D2C4BA6E571FD97486FF172EC8E71@PRDEXMBX-05.the-lab.llnl.gov> <53765072.9070906@anl.gov> <8635C0D1735D2C4BA6E571FD97486FF17992320C@PRDEXMBX-05.the-lab.llnl.gov> <8635C0D1735D2C4BA6E571FD97486FF179923310@PRDEXMBX-05.the-lab.llnl.gov> <1400535819.19707.16.camel@echo> <8635C0D1735D2C4BA6E571FD97486FF1799234A8@PRDEXMBX-05.the-lab.llnl.gov> <1400537144.19707.33.camel@echo> <8635C0D1735D2C4BA6E571FD97486FF17992352E@PRDEXMBX-05.the-lab.llnl.gov> <1400538504.19707.38.camel@echo> <1400541887.19707.45.camel@echo> Message-ID: <8635C0D1735D2C4BA6E571FD97486FF1799236C7@PRDEXMBX-05.the-lab.llnl.gov> Thanks Mihael. I just tried using your operations in my test code (below). It seems to detect whether the files do/don't exist and seems to try to generate the files but they never appear in my directory. During a fresh run in an empty directory I get the following output: Swift 0.94.1 swift-r7114 cog-r3803 RunID: 20140519-2230-4lyyh8me Progress: time: Mon, 19 May 2014 22:30:03 -0700 Progress: time: Mon, 19 May 2014 22:30:04 -0700 Stage in:1 Submitting:1 Progress: time: Mon, 19 May 2014 22:30:09 -0700 Active:1 Checking status:1 path=g/g15/bronevet/apps/swift-0.94.1/examples/test/work/dataF, exists=false Generating file://localhost/_concurrent/data-5fbfabc7-8f5d-4240-82b1-4c7ecfdc33aa-path=g/g15/bronevet/apps/swift-0.94.1/examples/test/work/copyF, exists=false Generating file://localhost/_concurrent/copy-44d394f6-53f3-4f5b-9ec9-fa2c19f5fcdf- from file://localhost/_concurrent/data-5fbfabc7-8f5d-4240-82b1-4c7ecfdc33aa-Final status: Mon, 19 May 2014 22:30:09 -0700 Finished successfully:3 Greg Bronevetsky Lawrence Livermore National Lab (925) 424-5756 bronevetsky at llnl.gov http://greg.bronevetsky.com import "common"; type file; string WORK_PATH="/g/g15/bronevet/apps/swift-0.94.1/examples/test/work"; app (file outF) writeFile(string message) { echo message stdout=@filename(outF); } app (file outF) copyFile(file inF) { cp @filename(inF) @filename(outF); } file data; file dataOld ; if(!exists(@strcat(WORK_PATH, "/dataF"))) { tracef("Generating %M\n", data); (data) = writeData("hello"); } else { data = dataOld; } file copy; file copyOld ; if(!exists(@strcat(WORK_PATH, "/copyF"))) { tracef("Generating %M from %M\n", copy, data); (copy) = copyFile(data); } else { copy = copyOld; } -----Original Message----- From: Mihael Hategan [mailto:hategan at mcs.anl.gov] Sent: Monday, May 19, 2014 4:25 PM To: Bronevetsky, Greg Cc: swift-user at ci.uchicago.edu Subject: Re: [Swift-user] Resuming jobs when script has changed Here's the swift part. It will only work with absolute paths. It can probably be modified to only work with relative paths. type file; app (file outf) existsApp(string path) { existsApp path stdout=@filename(outf); } (boolean result) exists(string path) { result = readData(existsApp(path)); } file there <"/bin/bash">; file notthere <"/bin/notthere">; tracef("Y %b\n", exists(@filename(there))); tracef("N %b\n", exists(@filename(notthere))); And existsApp: #!/bin/bash # add leading / since @filename removes it # this will not work with swift > 0.94 if [ -f "/$1" ]; then echo "true" else echo "false" fi On Mon, 2014-05-19 at 15:28 -0700, Mihael Hategan wrote: > On Mon, 2014-05-19 at 22:21 +0000, Bronevetsky, Greg wrote: > > Thanks, Mihael! I've modified my original code based on your suggestions (new code below) but I'm getting the following error: > > Could not start execution > > Procedure exists is not defined. > > Sorry. That's where your implementation of exists for absolute path > names goes. I'll send you the relevant code shortly. I'm working on > the array splitting issue now. > > Mihael > > > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user From hategan at mcs.anl.gov Tue May 20 01:02:51 2014 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 19 May 2014 23:02:51 -0700 Subject: [Swift-user] Resuming jobs when script has changed In-Reply-To: <8635C0D1735D2C4BA6E571FD97486FF1799236C7@PRDEXMBX-05.the-lab.llnl.gov> References: <8635C0D1735D2C4BA6E571FD97486FF172EC8E71@PRDEXMBX-05.the-lab.llnl.gov> <53765072.9070906@anl.gov> <8635C0D1735D2C4BA6E571FD97486FF17992320C@PRDEXMBX-05.the-lab.llnl.gov> <8635C0D1735D2C4BA6E571FD97486FF179923310@PRDEXMBX-05.the-lab.llnl.gov> <1400535819.19707.16.camel@echo> <8635C0D1735D2C4BA6E571FD97486FF1799234A8@PRDEXMBX-05.the-lab.llnl.gov> <1400537144.19707.33.camel@echo> <8635C0D1735D2C4BA6E571FD97486FF17992352E@PRDEXMBX-05.the-lab.llnl.gov> <1400538504.19707.38.camel@echo> <1400541887.19707.45.camel@echo> <8635C0D1735D2C4BA6E571FD97486FF1799236C7@PRDEXMBX-05.the-lab.llnl.gov> Message-ID: <1400565771.821.4.camel@echo> Sorry. I thought you didn't want the overhead of a copy. If you do want them in your directory, then map "copy" explicitly to the relevant name. In other words: file a <"mydir/a">; file olda <"olddir/a">; a = olda; will copy olddir/a to mydir/a but file a; file olda <"olddir/a">; a = olda; will not. Mihael On Tue, 2014-05-20 at 05:30 +0000, Bronevetsky, Greg wrote: > Thanks Mihael. I just tried using your operations in my test code (below). It seems to detect whether the files do/don't exist and seems to try to generate the files but they never appear in my directory. During a fresh run in an empty directory I get the following output: > > Swift 0.94.1 swift-r7114 cog-r3803 > > RunID: 20140519-2230-4lyyh8me > Progress: time: Mon, 19 May 2014 22:30:03 -0700 > Progress: time: Mon, 19 May 2014 22:30:04 -0700 Stage in:1 Submitting:1 > Progress: time: Mon, 19 May 2014 22:30:09 -0700 Active:1 Checking status:1 > path=g/g15/bronevet/apps/swift-0.94.1/examples/test/work/dataF, exists=false > Generating file://localhost/_concurrent/data-5fbfabc7-8f5d-4240-82b1-4c7ecfdc33aa-path=g/g15/bronevet/apps/swift-0.94.1/examples/test/work/copyF, exists=false > Generating file://localhost/_concurrent/copy-44d394f6-53f3-4f5b-9ec9-fa2c19f5fcdf- from file://localhost/_concurrent/data-5fbfabc7-8f5d-4240-82b1-4c7ecfdc33aa-Final status: Mon, 19 May 2014 22:30:09 -0700 Finished successfully:3 > > Greg Bronevetsky > Lawrence Livermore National Lab > (925) 424-5756 > bronevetsky at llnl.gov > http://greg.bronevetsky.com > > import "common"; > type file; > > string WORK_PATH="/g/g15/bronevet/apps/swift-0.94.1/examples/test/work"; > > app (file outF) writeFile(string message) { > echo message stdout=@filename(outF); > } > > app (file outF) copyFile(file inF) { > cp @filename(inF) @filename(outF); > } > > file data; > file dataOld ; > if(!exists(@strcat(WORK_PATH, "/dataF"))) { > tracef("Generating %M\n", data); > (data) = writeData("hello"); > } else > { data = dataOld; } > > file copy; > file copyOld ; > if(!exists(@strcat(WORK_PATH, "/copyF"))) { > tracef("Generating %M from %M\n", copy, data); > (copy) = copyFile(data); > } else > { copy = copyOld; } > > > -----Original Message----- > From: Mihael Hategan [mailto:hategan at mcs.anl.gov] > Sent: Monday, May 19, 2014 4:25 PM > To: Bronevetsky, Greg > Cc: swift-user at ci.uchicago.edu > Subject: Re: [Swift-user] Resuming jobs when script has changed > > Here's the swift part. It will only work with absolute paths. It can probably be modified to only work with relative paths. > > type file; > > app (file outf) existsApp(string path) { > existsApp path stdout=@filename(outf); } > > (boolean result) exists(string path) { > result = readData(existsApp(path)); > } > > file there <"/bin/bash">; > file notthere <"/bin/notthere">; > > tracef("Y %b\n", exists(@filename(there))); tracef("N %b\n", exists(@filename(notthere))); > > And existsApp: > #!/bin/bash > > # add leading / since @filename removes it # this will not work with swift > 0.94 if [ -f "/$1" ]; then > echo "true" > else > echo "false" > fi > > > > On Mon, 2014-05-19 at 15:28 -0700, Mihael Hategan wrote: > > On Mon, 2014-05-19 at 22:21 +0000, Bronevetsky, Greg wrote: > > > Thanks, Mihael! I've modified my original code based on your suggestions (new code below) but I'm getting the following error: > > > Could not start execution > > > Procedure exists is not defined. > > > > Sorry. That's where your implementation of exists for absolute path > > names goes. I'll send you the relevant code shortly. I'm working on > > the array splitting issue now. > > > > Mihael > > > > > > _______________________________________________ > > Swift-user mailing list > > Swift-user at ci.uchicago.edu > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > > From bronevetsky1 at llnl.gov Tue May 20 01:18:39 2014 From: bronevetsky1 at llnl.gov (Bronevetsky, Greg) Date: Tue, 20 May 2014 06:18:39 +0000 Subject: [Swift-user] Resuming jobs when script has changed In-Reply-To: <1400565771.821.4.camel@echo> References: <8635C0D1735D2C4BA6E571FD97486FF172EC8E71@PRDEXMBX-05.the-lab.llnl.gov> <53765072.9070906@anl.gov> <8635C0D1735D2C4BA6E571FD97486FF17992320C@PRDEXMBX-05.the-lab.llnl.gov> <8635C0D1735D2C4BA6E571FD97486FF179923310@PRDEXMBX-05.the-lab.llnl.gov> <1400535819.19707.16.camel@echo> <8635C0D1735D2C4BA6E571FD97486FF1799234A8@PRDEXMBX-05.the-lab.llnl.gov> <1400537144.19707.33.camel@echo> <8635C0D1735D2C4BA6E571FD97486FF17992352E@PRDEXMBX-05.the-lab.llnl.gov> <1400538504.19707.38.camel@echo> <1400541887.19707.45.camel@echo> <8635C0D1735D2C4BA6E571FD97486FF1799236C7@PRDEXMBX-05.the-lab.llnl.gov> <1400565771.821.4.camel@echo> Message-ID: <8635C0D1735D2C4BA6E571FD97486FF1799236F0@PRDEXMBX-05.the-lab.llnl.gov> I'm not sure I'm following. In my case mydir/ and olddir/ are the same. I'm trying to run the script in its original directory within my home directory, while making sure not to re-create files that already exist. The issue that I had is that in my original solution I copied files from my home directory to Swift's intermediate directory and then back, overwriting the original file in my home directory with an identical copy. The first copy (home -> Swift intermediate dir) seems unavoidable but the second seems unnecessary. Greg Bronevetsky Lawrence Livermore National Lab (925) 424-5756 bronevetsky at llnl.gov http://greg.bronevetsky.com -----Original Message----- From: Mihael Hategan [mailto:hategan at mcs.anl.gov] Sent: Monday, May 19, 2014 11:03 PM To: Bronevetsky, Greg Cc: swift-user at ci.uchicago.edu Subject: Re: [Swift-user] Resuming jobs when script has changed Sorry. I thought you didn't want the overhead of a copy. If you do want them in your directory, then map "copy" explicitly to the relevant name. In other words: file a <"mydir/a">; file olda <"olddir/a">; a = olda; will copy olddir/a to mydir/a but file a; file olda <"olddir/a">; a = olda; will not. Mihael On Tue, 2014-05-20 at 05:30 +0000, Bronevetsky, Greg wrote: > Thanks Mihael. I just tried using your operations in my test code (below). It seems to detect whether the files do/don't exist and seems to try to generate the files but they never appear in my directory. During a fresh run in an empty directory I get the following output: > > Swift 0.94.1 swift-r7114 cog-r3803 > > RunID: 20140519-2230-4lyyh8me > Progress: time: Mon, 19 May 2014 22:30:03 -0700 > Progress: time: Mon, 19 May 2014 22:30:04 -0700 Stage in:1 > Submitting:1 > Progress: time: Mon, 19 May 2014 22:30:09 -0700 Active:1 Checking > status:1 > path=g/g15/bronevet/apps/swift-0.94.1/examples/test/work/dataF, > exists=false Generating > file://localhost/_concurrent/data-5fbfabc7-8f5d-4240-82b1-4c7ecfdc33aa > -path=g/g15/bronevet/apps/swift-0.94.1/examples/test/work/copyF, > exists=false Generating > file://localhost/_concurrent/copy-44d394f6-53f3-4f5b-9ec9-fa2c19f5fcdf > - from > file://localhost/_concurrent/data-5fbfabc7-8f5d-4240-82b1-4c7ecfdc33aa > -Final status: Mon, 19 May 2014 22:30:09 -0700 Finished > successfully:3 > > Greg Bronevetsky > Lawrence Livermore National Lab > (925) 424-5756 > bronevetsky at llnl.gov > http://greg.bronevetsky.com > > import "common"; > type file; > > string > WORK_PATH="/g/g15/bronevet/apps/swift-0.94.1/examples/test/work"; > > app (file outF) writeFile(string message) { > echo message stdout=@filename(outF); } > > app (file outF) copyFile(file inF) { > cp @filename(inF) @filename(outF); > } > > file data; > file dataOld ; > if(!exists(@strcat(WORK_PATH, "/dataF"))) { > tracef("Generating %M\n", data); > (data) = writeData("hello"); > } else > { data = dataOld; } > > file copy; > file copyOld ; > if(!exists(@strcat(WORK_PATH, "/copyF"))) { > tracef("Generating %M from %M\n", copy, data); > (copy) = copyFile(data); > } else > { copy = copyOld; } > > > -----Original Message----- > From: Mihael Hategan [mailto:hategan at mcs.anl.gov] > Sent: Monday, May 19, 2014 4:25 PM > To: Bronevetsky, Greg > Cc: swift-user at ci.uchicago.edu > Subject: Re: [Swift-user] Resuming jobs when script has changed > > Here's the swift part. It will only work with absolute paths. It can probably be modified to only work with relative paths. > > type file; > > app (file outf) existsApp(string path) { > existsApp path stdout=@filename(outf); } > > (boolean result) exists(string path) { > result = readData(existsApp(path)); > } > > file there <"/bin/bash">; > file notthere <"/bin/notthere">; > > tracef("Y %b\n", exists(@filename(there))); tracef("N %b\n", > exists(@filename(notthere))); > > And existsApp: > #!/bin/bash > > # add leading / since @filename removes it # this will not work with swift > 0.94 if [ -f "/$1" ]; then > echo "true" > else > echo "false" > fi > > > > On Mon, 2014-05-19 at 15:28 -0700, Mihael Hategan wrote: > > On Mon, 2014-05-19 at 22:21 +0000, Bronevetsky, Greg wrote: > > > Thanks, Mihael! I've modified my original code based on your suggestions (new code below) but I'm getting the following error: > > > Could not start execution > > > Procedure exists is not defined. > > > > Sorry. That's where your implementation of exists for absolute path > > names goes. I'll send you the relevant code shortly. I'm working on > > the array splitting issue now. > > > > Mihael > > > > > > _______________________________________________ > > Swift-user mailing list > > Swift-user at ci.uchicago.edu > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > > From hategan at mcs.anl.gov Tue May 20 01:41:03 2014 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 19 May 2014 23:41:03 -0700 Subject: [Swift-user] Resuming jobs when script has changed In-Reply-To: <8635C0D1735D2C4BA6E571FD97486FF1799236F0@PRDEXMBX-05.the-lab.llnl.gov> References: <8635C0D1735D2C4BA6E571FD97486FF172EC8E71@PRDEXMBX-05.the-lab.llnl.gov> <53765072.9070906@anl.gov> <8635C0D1735D2C4BA6E571FD97486FF17992320C@PRDEXMBX-05.the-lab.llnl.gov> <8635C0D1735D2C4BA6E571FD97486FF179923310@PRDEXMBX-05.the-lab.llnl.gov> <1400535819.19707.16.camel@echo> <8635C0D1735D2C4BA6E571FD97486FF1799234A8@PRDEXMBX-05.the-lab.llnl.gov> <1400537144.19707.33.camel@echo> <8635C0D1735D2C4BA6E571FD97486FF17992352E@PRDEXMBX-05.the-lab.llnl.gov> <1400538504.19707.38.camel@echo> <1400541887.19707.45.camel@echo> <8635C0D1735D2C4BA6E571FD97486FF1799236C7@PRDEXMBX-05.the-lab.llnl.gov> <1400565771.821.4.camel@echo> <8635C0D1735D2C4BA6E571FD97486FF1799236F0@PRDEXMBX-05.the-lab.llnl.gov> Message-ID: <1400568063.1116.11.camel@echo> Hi Greg, So I think I might be misunderstanding the issue. And I am not talking about what you have right now, but what you want to get to. So let me make some toy example and see if it's anything like what you need. There is some directory. Let's call it "d". In "d", you run swift. It produces, among other things, some files, let's call them "b??". The basic mechanism for producing the "b??" files is something of this sort (in swift): file[] b ; foreach i in INDEX_LIST { b[i] = generate(i); } Let's say that when you first run swift, you get b00, b05, and b08. These files are subsequently used to do more complex stuff, say "c = f(b[i])". Now, in a subsequent run, INDEX_LIST is {02, 05, 07}. You would like b05 to not be generated again, but the old one to be used. Is this correct? I'm getting a bit confused by your "swift intermediate directory" and how it relates to this picture. Mihael On Tue, 2014-05-20 at 06:18 +0000, Bronevetsky, Greg wrote: > I'm not sure I'm following. In my case mydir/ and olddir/ are the > same. I'm trying to run the script in its original directory within my > home directory, while making sure not to re-create files that already > exist. The issue that I had is that in my original solution I copied > files from my home directory to Swift's intermediate directory and > then back, overwriting the original file in my home directory with an > identical copy. The first copy (home -> Swift intermediate dir) seems > unavoidable but the second seems unnecessary. > > Greg Bronevetsky > Lawrence Livermore National Lab > (925) 424-5756 > bronevetsky at llnl.gov > http://greg.bronevetsky.com > > > -----Original Message----- > From: Mihael Hategan [mailto:hategan at mcs.anl.gov] > Sent: Monday, May 19, 2014 11:03 PM > To: Bronevetsky, Greg > Cc: swift-user at ci.uchicago.edu > Subject: Re: [Swift-user] Resuming jobs when script has changed > > Sorry. I thought you didn't want the overhead of a copy. > > If you do want them in your directory, then map "copy" explicitly to the relevant name. > > In other words: > > file a <"mydir/a">; > file olda <"olddir/a">; > a = olda; > > will copy olddir/a to mydir/a > > but > > file a; > file olda <"olddir/a">; > a = olda; > > will not. > > Mihael > > > On Tue, 2014-05-20 at 05:30 +0000, Bronevetsky, Greg wrote: > > Thanks Mihael. I just tried using your operations in my test code (below). It seems to detect whether the files do/don't exist and seems to try to generate the files but they never appear in my directory. During a fresh run in an empty directory I get the following output: > > > > Swift 0.94.1 swift-r7114 cog-r3803 > > > > RunID: 20140519-2230-4lyyh8me > > Progress: time: Mon, 19 May 2014 22:30:03 -0700 > > Progress: time: Mon, 19 May 2014 22:30:04 -0700 Stage in:1 > > Submitting:1 > > Progress: time: Mon, 19 May 2014 22:30:09 -0700 Active:1 Checking > > status:1 > > path=g/g15/bronevet/apps/swift-0.94.1/examples/test/work/dataF, > > exists=false Generating > > file://localhost/_concurrent/data-5fbfabc7-8f5d-4240-82b1-4c7ecfdc33aa > > -path=g/g15/bronevet/apps/swift-0.94.1/examples/test/work/copyF, > > exists=false Generating > > file://localhost/_concurrent/copy-44d394f6-53f3-4f5b-9ec9-fa2c19f5fcdf > > - from > > file://localhost/_concurrent/data-5fbfabc7-8f5d-4240-82b1-4c7ecfdc33aa > > -Final status: Mon, 19 May 2014 22:30:09 -0700 Finished > > successfully:3 > > > > Greg Bronevetsky > > Lawrence Livermore National Lab > > (925) 424-5756 > > bronevetsky at llnl.gov > > http://greg.bronevetsky.com > > > > import "common"; > > type file; > > > > string > > WORK_PATH="/g/g15/bronevet/apps/swift-0.94.1/examples/test/work"; > > > > app (file outF) writeFile(string message) { > > echo message stdout=@filename(outF); } > > > > app (file outF) copyFile(file inF) { > > cp @filename(inF) @filename(outF); > > } > > > > file data; > > file dataOld ; > > if(!exists(@strcat(WORK_PATH, "/dataF"))) { > > tracef("Generating %M\n", data); > > (data) = writeData("hello"); > > } else > > { data = dataOld; } > > > > file copy; > > file copyOld ; > > if(!exists(@strcat(WORK_PATH, "/copyF"))) { > > tracef("Generating %M from %M\n", copy, data); > > (copy) = copyFile(data); > > } else > > { copy = copyOld; } > > > > > > -----Original Message----- > > From: Mihael Hategan [mailto:hategan at mcs.anl.gov] > > Sent: Monday, May 19, 2014 4:25 PM > > To: Bronevetsky, Greg > > Cc: swift-user at ci.uchicago.edu > > Subject: Re: [Swift-user] Resuming jobs when script has changed > > > > Here's the swift part. It will only work with absolute paths. It can probably be modified to only work with relative paths. > > > > type file; > > > > app (file outf) existsApp(string path) { > > existsApp path stdout=@filename(outf); } > > > > (boolean result) exists(string path) { > > result = readData(existsApp(path)); > > } > > > > file there <"/bin/bash">; > > file notthere <"/bin/notthere">; > > > > tracef("Y %b\n", exists(@filename(there))); tracef("N %b\n", > > exists(@filename(notthere))); > > > > And existsApp: > > #!/bin/bash > > > > # add leading / since @filename removes it # this will not work with swift > 0.94 if [ -f "/$1" ]; then > > echo "true" > > else > > echo "false" > > fi > > > > > > > > On Mon, 2014-05-19 at 15:28 -0700, Mihael Hategan wrote: > > > On Mon, 2014-05-19 at 22:21 +0000, Bronevetsky, Greg wrote: > > > > Thanks, Mihael! I've modified my original code based on your suggestions (new code below) but I'm getting the following error: > > > > Could not start execution > > > > Procedure exists is not defined. > > > > > > Sorry. That's where your implementation of exists for absolute path > > > names goes. I'll send you the relevant code shortly. I'm working on > > > the array splitting issue now. > > > > > > Mihael > > > > > > > > > _______________________________________________ > > > Swift-user mailing list > > > Swift-user at ci.uchicago.edu > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > > > > > > From bronevetsky1 at llnl.gov Tue May 20 16:10:52 2014 From: bronevetsky1 at llnl.gov (Bronevetsky, Greg) Date: Tue, 20 May 2014 21:10:52 +0000 Subject: [Swift-user] Data transfer error Message-ID: <8635C0D1735D2C4BA6E571FD97486FF179925AFC@PRDEXMBX-05.the-lab.llnl.gov> I sometimes get the following error in my Swift runs: Caused by: Failed to move output file solver_bicg.precond_diag.mtx_nasa1824/mt_fmodel/fm_n/lm_local/allStats to shared directory What causes it and how can I avoid it? Greg Bronevetsky Lawrence Livermore National Lab (925) 424-5756 bronevetsky at llnl.gov http://greg.bronevetsky.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From jozik at uchicago.edu Tue May 20 22:48:54 2014 From: jozik at uchicago.edu (Jonathan Ozik) Date: Tue, 20 May 2014 22:48:54 -0500 Subject: [Swift-user] Fixed array vs array mapper Message-ID: I'm having trouble using array mappers for output files. The fixed array mapper seems to work on 0.94.1 but not with trunk. I've put together a minimal case that will hopefully help shed light on the issue. There's a swift script and a simple bash script that creates files. I've tried this both with 0.94.1 and with the trunk as of May 16. Trying the array mapper: With 0.94.1, the execution just hangs at: *** Ready to call sum_obj With the trunk version I get the error: *** Ready to call sum_obj Execution failed: Exception in sh: Arguments: [cs.sh] Host: localhost Directory: minimal_case-run036/jobs/k/sh-kl8esxql exception @ swift-int-staging.k, line: 182 Caused by: exception @ swift-int-staging.k, line: 178 Caused by: null Caused by: org.globus.cog.abstraction.impl.common.task.InvalidProviderException: No 'proxy' provider or alias found. Available providers: [cobalt, ssh-cl, gsiftp, coaster, webdav, slurm, dcache, ssh, gt2, condor, http, coaster-persistent, pbs, ftp, lsf, gsiftp-old, local, sge]. Aliases: condor <-> condorlocal; cobalt <-> cobaltlocal; gsiftp <-> gridftp; pbs <-> pbslocal; gsiftp-old <-> gridftp-old; local <-> file; k:assign @ swift.k, line: 194 Caused by: Exception in sh: Arguments: [cs.sh] Host: localhost Directory: minimal_case-run036/jobs/k/sh-kl8esxql exception @ swift-int-staging.k, line: 182 Caused by: exception @ swift-int-staging.k, line: 178 Caused by: null Caused by: org.globus.cog.abstraction.impl.common.task.InvalidProviderException: No 'proxy' provider or alias found. Available providers: [cobalt, ssh-cl, gsiftp, coaster, webdav, slurm, dcache, ssh, gt2, condor, http, coaster-persistent, pbs, ftp, lsf, gsiftp-old, local, sge]. Aliases: condor <-> condorlocal; cobalt <-> cobaltlocal; gsiftp <-> gridftp; pbs <-> pbslocal; gsiftp-old <-> gridftp-old; local <-> file; If I use the fixed array mapper, the 0.94.1 works but the trunk version gives: *** Ready to call sum_obj WARNING: duplicate mapping found: fs (line 18) and fs (line 18) are both used to write to file://localhost/ Execution failed: Exception in sh: Arguments: [cs.sh] Host: localhost Directory: minimal_case-run037/jobs/y/sh-y7bosxql exception @ swift-int-staging.k, line: 182 Caused by: exception @ swift-int-staging.k, line: 178 Caused by: null Caused by: org.globus.cog.abstraction.impl.common.task.InvalidProviderException: No 'proxy' provider or alias found. Available providers: [cobalt, ssh-cl, gsiftp, coaster, webdav, slurm, dcache, ssh, gt2, condor, http, coaster-persistent, pbs, ftp, lsf, gsiftp-old, local, sge]. Aliases: condor <-> condorlocal; cobalt <-> cobaltlocal; gsiftp <-> gridftp; pbs <-> pbslocal; gsiftp-old <-> gridftp-old; local <-> file; k:assign @ swift.k, line: 194 Caused by: Exception in sh: Arguments: [cs.sh] Host: localhost Directory: minimal_case-run037/jobs/y/sh-y7bosxql exception @ swift-int-staging.k, line: 182 Caused by: exception @ swift-int-staging.k, line: 178 Caused by: null Caused by: org.globus.cog.abstraction.impl.common.task.InvalidProviderException: No 'proxy' provider or alias found. Available providers: [cobalt, ssh-cl, gsiftp, coaster, webdav, slurm, dcache, ssh, gt2, condor, http, coaster-persistent, pbs, ftp, lsf, gsiftp-old, local, sge]. Aliases: condor <-> condorlocal; cobalt <-> cobaltlocal; gsiftp <-> gridftp; pbs <-> pbslocal; gsiftp-old <-> gridftp-old; local <-> file; Any thoughts on what might be going wrong would be greatly appreciated! Jonathan -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: minimal_case.swift Type: application/octet-stream Size: 582 bytes Desc: not available URL: -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: cs.sh Type: application/octet-stream Size: 91 bytes Desc: not available URL: -------------- next part -------------- An HTML attachment was scrubbed... URL: From bronevetsky1 at llnl.gov Wed May 21 15:50:19 2014 From: bronevetsky1 at llnl.gov (Bronevetsky, Greg) Date: Wed, 21 May 2014 20:50:19 +0000 Subject: [Swift-user] Data transfer error Message-ID: <8635C0D1735D2C4BA6E571FD97486FF179926169@PRDEXMBX-05.the-lab.llnl.gov> Related question: what causes the following error? Caused by: Failed to move output file solver_bicg.precond_diag.mtx_nasa1824/FI/blocks/block_10 to shared directory I see the file in the swift work directory and the path solver_bicg.precond_diag.mtx_nasa1824/FI/blocks/ exists in the directory where I run the script. Greg Bronevetsky Lawrence Livermore National Lab (925) 424-5756 bronevetsky at llnl.gov http://greg.bronevetsky.com From: Bronevetsky, Greg Sent: Tuesday, May 20, 2014 2:11 PM To: swift-user at ci.uchicago.edu Subject: Data transfer error I sometimes get the following error in my Swift runs: Caused by: Failed to move output file solver_bicg.precond_diag.mtx_nasa1824/mt_fmodel/fm_n/lm_local/allStats to shared directory What causes it and how can I avoid it? Greg Bronevetsky Lawrence Livermore National Lab (925) 424-5756 bronevetsky at llnl.gov http://greg.bronevetsky.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From hategan at mcs.anl.gov Wed May 21 16:09:34 2014 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 21 May 2014 14:09:34 -0700 Subject: [Swift-user] Data transfer error In-Reply-To: <8635C0D1735D2C4BA6E571FD97486FF179926169@PRDEXMBX-05.the-lab.llnl.gov> References: <8635C0D1735D2C4BA6E571FD97486FF179926169@PRDEXMBX-05.the-lab.llnl.gov> Message-ID: <1400706574.25214.3.camel@echo> Hi, Sorry for the late reply (to your previous mail mentioning this). I don't know what the answer to your question is. It shouldn't be happening. However, a directory called --.d should be created by swift. That directory should contain one or more *.info file which may contain a few more details. Mihael On Wed, 2014-05-21 at 20:50 +0000, Bronevetsky, Greg wrote: > Related question: what causes the following error? > Caused by: Failed to move output file solver_bicg.precond_diag.mtx_nasa1824/FI/blocks/block_10 to shared directory > I see the file in the swift work directory and the path solver_bicg.precond_diag.mtx_nasa1824/FI/blocks/ exists in the directory where I run the script. > > Greg Bronevetsky > Lawrence Livermore National Lab > (925) 424-5756 > bronevetsky at llnl.gov > http://greg.bronevetsky.com > > From: Bronevetsky, Greg > Sent: Tuesday, May 20, 2014 2:11 PM > To: swift-user at ci.uchicago.edu > Subject: Data transfer error > > I sometimes get the following error in my Swift runs: > Caused by: Failed to move output file solver_bicg.precond_diag.mtx_nasa1824/mt_fmodel/fm_n/lm_local/allStats to shared directory > What causes it and how can I avoid it? > > Greg Bronevetsky > Lawrence Livermore National Lab > (925) 424-5756 > bronevetsky at llnl.gov > http://greg.bronevetsky.com > > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user From bronevetsky1 at llnl.gov Wed May 21 18:59:11 2014 From: bronevetsky1 at llnl.gov (Bronevetsky, Greg) Date: Wed, 21 May 2014 23:59:11 +0000 Subject: [Swift-user] Data transfer error In-Reply-To: <1400706574.25214.3.camel@echo> References: <8635C0D1735D2C4BA6E571FD97486FF179926169@PRDEXMBX-05.the-lab.llnl.gov> <1400706574.25214.3.camel@echo> Message-ID: <8635C0D1735D2C4BA6E571FD97486FF17992DC59@PRDEXMBX-05.the-lab.llnl.gov> Where should I look to debug the following error? Caused by: Block task failed: 0521-5404270-000009 Block task ended prematurely Greg Bronevetsky Lawrence Livermore National Lab (925) 424-5756 bronevetsky at llnl.gov http://greg.bronevetsky.com -----Original Message----- From: Mihael Hategan [mailto:hategan at mcs.anl.gov] Sent: Wednesday, May 21, 2014 2:10 PM To: Bronevetsky, Greg Cc: swift-user at ci.uchicago.edu Subject: Re: [Swift-user] Data transfer error Hi, Sorry for the late reply (to your previous mail mentioning this). I don't know what the answer to your question is. It shouldn't be happening. However, a directory called --.d should be created by swift. That directory should contain one or more *.info file which may contain a few more details. Mihael On Wed, 2014-05-21 at 20:50 +0000, Bronevetsky, Greg wrote: > Related question: what causes the following error? > Caused by: Failed to move output file > solver_bicg.precond_diag.mtx_nasa1824/FI/blocks/block_10 to shared directory I see the file in the swift work directory and the path solver_bicg.precond_diag.mtx_nasa1824/FI/blocks/ exists in the directory where I run the script. > > Greg Bronevetsky > Lawrence Livermore National Lab > (925) 424-5756 > bronevetsky at llnl.gov > http://greg.bronevetsky.com > > From: Bronevetsky, Greg > Sent: Tuesday, May 20, 2014 2:11 PM > To: swift-user at ci.uchicago.edu > Subject: Data transfer error > > I sometimes get the following error in my Swift runs: > Caused by: Failed to move output file > solver_bicg.precond_diag.mtx_nasa1824/mt_fmodel/fm_n/lm_local/allStats to shared directory What causes it and how can I avoid it? > > Greg Bronevetsky > Lawrence Livermore National Lab > (925) 424-5756 > bronevetsky at llnl.gov > http://greg.bronevetsky.com > > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user From hategan at mcs.anl.gov Thu May 22 02:22:36 2014 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 22 May 2014 00:22:36 -0700 Subject: [Swift-user] Data transfer error In-Reply-To: <8635C0D1735D2C4BA6E571FD97486FF17992DC59@PRDEXMBX-05.the-lab.llnl.gov> References: <8635C0D1735D2C4BA6E571FD97486FF179926169@PRDEXMBX-05.the-lab.llnl.gov> <1400706574.25214.3.camel@echo> <8635C0D1735D2C4BA6E571FD97486FF17992DC59@PRDEXMBX-05.the-lab.llnl.gov> Message-ID: <1400743356.29694.6.camel@echo> There are a number of places. I think this got improved a bit post 0.94, but that's another story. Anyway, first place is the swift log (-.log in the directory where you ran swift). The second place, if the previous one fails, is ~/.globus/coasters/*.log. There is yet another place that isn't enabled by default. That's the coaster worker logs. It can be enabled by saying DEBUG in sites.xml. It will produce some additional logs in ~/.globus/coasters/. Please feel free to send any/all these our way. We might be able to quickly spot some obvious problems. Mihael On Wed, 2014-05-21 at 23:59 +0000, Bronevetsky, Greg wrote: > Where should I look to debug the following error? > Caused by: Block task failed: 0521-5404270-000009 Block task ended prematurely > > Greg Bronevetsky > Lawrence Livermore National Lab > (925) 424-5756 > bronevetsky at llnl.gov > http://greg.bronevetsky.com > > > -----Original Message----- > From: Mihael Hategan [mailto:hategan at mcs.anl.gov] > Sent: Wednesday, May 21, 2014 2:10 PM > To: Bronevetsky, Greg > Cc: swift-user at ci.uchicago.edu > Subject: Re: [Swift-user] Data transfer error > > Hi, > > Sorry for the late reply (to your previous mail mentioning this). > > I don't know what the answer to your question is. It shouldn't be happening. > > However, a directory called --.d should be created by swift. That directory should contain one or more *.info file which may contain a few more details. > > Mihael > > On Wed, 2014-05-21 at 20:50 +0000, Bronevetsky, Greg wrote: > > Related question: what causes the following error? > > Caused by: Failed to move output file > > solver_bicg.precond_diag.mtx_nasa1824/FI/blocks/block_10 to shared directory I see the file in the swift work directory and the path solver_bicg.precond_diag.mtx_nasa1824/FI/blocks/ exists in the directory where I run the script. > > > > Greg Bronevetsky > > Lawrence Livermore National Lab > > (925) 424-5756 > > bronevetsky at llnl.gov > > http://greg.bronevetsky.com > > > > From: Bronevetsky, Greg > > Sent: Tuesday, May 20, 2014 2:11 PM > > To: swift-user at ci.uchicago.edu > > Subject: Data transfer error > > > > I sometimes get the following error in my Swift runs: > > Caused by: Failed to move output file > > solver_bicg.precond_diag.mtx_nasa1824/mt_fmodel/fm_n/lm_local/allStats to shared directory What causes it and how can I avoid it? > > > > Greg Bronevetsky > > Lawrence Livermore National Lab > > (925) 424-5756 > > bronevetsky at llnl.gov > > http://greg.bronevetsky.com > > > > _______________________________________________ > > Swift-user mailing list > > Swift-user at ci.uchicago.edu > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > > From bronevetsky1 at llnl.gov Thu May 22 11:11:24 2014 From: bronevetsky1 at llnl.gov (Bronevetsky, Greg) Date: Thu, 22 May 2014 16:11:24 +0000 Subject: [Swift-user] Data transfer error In-Reply-To: <1400743356.29694.6.camel@echo> References: <8635C0D1735D2C4BA6E571FD97486FF179926169@PRDEXMBX-05.the-lab.llnl.gov> <1400706574.25214.3.camel@echo> <8635C0D1735D2C4BA6E571FD97486FF17992DC59@PRDEXMBX-05.the-lab.llnl.gov> <1400743356.29694.6.camel@echo> Message-ID: <8635C0D1735D2C4BA6E571FD97486FF179933F77@PRDEXMBX-05.the-lab.llnl.gov> There are a number of places. I think this got improved a bit post 0.94, but that's another story. Anyway, first place is the swift log (-.log in the directory where you ran swift). I?m not seeing much here. There are the periodic warnings like: 2014-05-22 08:53:50,494-0700 INFO RuntimeStats$ProgressTicker Selecting site:39311 Stage in:1 Submitting:1 Stage out:284 Finished successfully:3118 Failed but can retry:646 And occasional errors like: Block Block task status changed: Failed Exitcode file (/g/g15/bronevet/.globus/scripts/PBS7360428055973706159.submit.exitcode) not found 5 queue polls after the job was reported done However, I can?t see the file in question at the reported path. Also, when the run finally fails (I ran with -lazy.errors true to keep it going as far as possible) I get the following: Exception in runModel: Arguments: [--solver, bicg, --precond, diag, --matrix, nasa1824, --num_runs, 100, --modelType, contModel, --faultModel, n, --locModel, local, --ap, 1e-2, --am, 1e-4, --sp, 1e-2, --sm, 1e-4, --dp, 1e-2, --dm, 1e-4, --mp, 1e-2, --mm, 1e-4, --psp, 1e-2, --psm, 1e-2, --ptsp, 1e-2, --ptsm, 1e-2, --cprob, 1e-5, --exec_time, 5.591000e-03, --stats, modelBlocks/stats.solver_bicg.precond_diag.mtx_nasa1824.mt_contModel.fm_n.lm_local.ap_1e-2.am_1e-4.sp_1e-2.sm_1e-4.dp_1e-2.dm_1e-4.mp_1e-2.mm_1e-4.psp_1e-2.psm_1e-2.ptsp_1e-2.ptsm_1e-2.cprob_1e-5.block_26] Host: pbatch Directory: experiments.new-20140522-0841-bhu0vze3/jobs/v/runModel-vvuy90rl Caused by: Block task failed: 0522-4108580-000002 Block task ended prematurely I?ve attached the log. The second place, if the previous one fails, is ~/.globus/coasters/*.log. ~/.globus/coasters contains the following files. No logs in my install. cscript1601720472000314596.pl cscript7039282452425599503.pl cscript8162919165195912014.pl cscript3466700121560325070.pl cscript747960757439884021.pl cscript8876053012113700888.pl cscript6877638344534390867.pl cscript7841839259853776419.pl cscript95537409038396166.pl There is yet another place that isn't enabled by default. That's the coaster worker logs. It can be enabled by saying DEBUG in sites.xml. It will produce some additional logs in ~/.globus/coasters/. Done and attached. Please let me know if you see anything. Greg Bronevetsky Lawrence Livermore National Lab (925) 424-5756 bronevetsky at llnl.gov http://greg.bronevetsky.com -----Original Message----- From: Mihael Hategan [mailto:hategan at mcs.anl.gov] Sent: Thursday, May 22, 2014 12:23 AM To: Bronevetsky, Greg Cc: swift-user at ci.uchicago.edu Subject: Re: [Swift-user] Data transfer error There are a number of places. I think this got improved a bit post 0.94, but that's another story. Anyway, first place is the swift log (-.log in the directory where you ran swift). The second place, if the previous one fails, is ~/.globus/coasters/*.log. There is yet another place that isn't enabled by default. That's the coaster worker logs. It can be enabled by saying DEBUG in sites.xml. It will produce some additional logs in ~/.globus/coasters/. Please feel free to send any/all these our way. We might be able to quickly spot some obvious problems. Mihael On Wed, 2014-05-21 at 23:59 +0000, Bronevetsky, Greg wrote: > Where should I look to debug the following error? > Caused by: Block task failed: 0521-5404270-000009 Block task ended > prematurely > > Greg Bronevetsky > Lawrence Livermore National Lab > (925) 424-5756 > bronevetsky at llnl.gov > http://greg.bronevetsky.com > > > -----Original Message----- > From: Mihael Hategan [mailto:hategan at mcs.anl.gov] > Sent: Wednesday, May 21, 2014 2:10 PM > To: Bronevetsky, Greg > Cc: swift-user at ci.uchicago.edu > Subject: Re: [Swift-user] Data transfer error > > Hi, > > Sorry for the late reply (to your previous mail mentioning this). > > I don't know what the answer to your question is. It shouldn't be happening. > > However, a directory called --.d should be created by swift. That directory should contain one or more *.info file which may contain a few more details. > > Mihael > > On Wed, 2014-05-21 at 20:50 +0000, Bronevetsky, Greg wrote: > > Related question: what causes the following error? > > Caused by: Failed to move output file > > solver_bicg.precond_diag.mtx_nasa1824/FI/blocks/block_10 to shared directory I see the file in the swift work directory and the path solver_bicg.precond_diag.mtx_nasa1824/FI/blocks/ exists in the directory where I run the script. > > > > Greg Bronevetsky > > Lawrence Livermore National Lab > > (925) 424-5756 > > bronevetsky at llnl.gov> > > http://greg.bronevetsky.com > > > > From: Bronevetsky, Greg > > Sent: Tuesday, May 20, 2014 2:11 PM > > To: swift-user at ci.uchicago.edu > > Subject: Data transfer error > > > > I sometimes get the following error in my Swift runs: > > Caused by: Failed to move output file > > solver_bicg.precond_diag.mtx_nasa1824/mt_fmodel/fm_n/lm_local/allStats to shared directory What causes it and how can I avoid it? > > > > Greg Bronevetsky > > Lawrence Livermore National Lab > > (925) 424-5756 > > bronevetsky at llnl.gov> > > http://greg.bronevetsky.com > > > > _______________________________________________ > > Swift-user mailing list > > Swift-user at ci.uchicago.edu > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: worker-0522-0109210-000000.log.bz2 Type: application/octet-stream Size: 546622 bytes Desc: worker-0522-0109210-000000.log.bz2 URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: experiments.new-20140522-0901-mke7zei7.log.bz2 Type: application/octet-stream Size: 638485 bytes Desc: experiments.new-20140522-0901-mke7zei7.log.bz2 URL: From hategan at mcs.anl.gov Fri May 23 14:17:16 2014 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 23 May 2014 12:17:16 -0700 Subject: [Swift-user] Data transfer error In-Reply-To: <8635C0D1735D2C4BA6E571FD97486FF179933F77@PRDEXMBX-05.the-lab.llnl.gov> References: <8635C0D1735D2C4BA6E571FD97486FF179926169@PRDEXMBX-05.the-lab.llnl.gov> <1400706574.25214.3.camel@echo> <8635C0D1735D2C4BA6E571FD97486FF17992DC59@PRDEXMBX-05.the-lab.llnl.gov> <1400743356.29694.6.camel@echo> <8635C0D1735D2C4BA6E571FD97486FF179933F77@PRDEXMBX-05.the-lab.llnl.gov> Message-ID: <1400872636.18856.5.camel@echo> >From this run, do you happen to have a log called "worker-0522-0109210-000001.log"? The missing exit code file errors combined with a missing log from a worker could indicate that the node on which this is running may not have the home directory mounted properly. Said block seems to fail pretty quickly without running any jobs, so I suspect something in its environment isn't quite right. Though bad nodes may be a little hard to track. It may also be helpful to disable lazy errors until you get things to run reliably. Mihael On Thu, 2014-05-22 at 16:11 +0000, Bronevetsky, Greg wrote: > There are a number of places. I think this got improved a bit post 0.94, but that's another story. > > > > Anyway, first place is the swift log (-.log in the directory where you ran swift). > > I?m not seeing much here. There are the periodic warnings like: > > 2014-05-22 08:53:50,494-0700 INFO RuntimeStats$ProgressTicker Selecting site:39311 Stage in:1 Submitting:1 Stage out:284 Finished successfully:3118 Failed but can retry:646 > > And occasional errors like: > > Block Block task status changed: Failed Exitcode file (/g/g15/bronevet/.globus/scripts/PBS7360428055973706159.submit.exitcode) not found 5 queue polls after the job was reported done > > However, I can?t see the file in question at the reported path. > > > > Also, when the run finally fails (I ran with -lazy.errors true to keep it going as far as possible) I get the following: > > Exception in runModel: > > Arguments: [--solver, bicg, --precond, diag, --matrix, nasa1824, --num_runs, 100, --modelType, contModel, --faultModel, n, --locModel, local, --ap, 1e-2, --am, 1e-4, --sp, 1e-2, --sm, 1e-4, --dp, 1e-2, --dm, 1e-4, --mp, 1e-2, --mm, 1e-4, --psp, 1e-2, --psm, 1e-2, --ptsp, 1e-2, --ptsm, 1e-2, --cprob, 1e-5, --exec_time, 5.591000e-03, --stats, modelBlocks/stats.solver_bicg.precond_diag.mtx_nasa1824.mt_contModel.fm_n.lm_local.ap_1e-2.am_1e-4.sp_1e-2.sm_1e-4.dp_1e-2.dm_1e-4.mp_1e-2.mm_1e-4.psp_1e-2.psm_1e-2.ptsp_1e-2.ptsm_1e-2.cprob_1e-5.block_26] > > Host: pbatch > > Directory: experiments.new-20140522-0841-bhu0vze3/jobs/v/runModel-vvuy90rl > > Caused by: Block task failed: 0522-4108580-000002 Block task ended prematurely > > > > I?ve attached the log. > > > > The second place, if the previous one fails, is ~/.globus/coasters/*.log. > > ~/.globus/coasters contains the following files. No logs in my install. > > cscript1601720472000314596.pl cscript7039282452425599503.pl cscript8162919165195912014.pl > > cscript3466700121560325070.pl cscript747960757439884021.pl cscript8876053012113700888.pl > > cscript6877638344534390867.pl cscript7841839259853776419.pl cscript95537409038396166.pl > > > > There is yet another place that isn't enabled by default. That's the coaster worker logs. It can be enabled by saying DEBUG in sites.xml. It will produce some additional logs in ~/.globus/coasters/. > > Done and attached. Please let me know if you see anything. > > > > Greg Bronevetsky > > Lawrence Livermore National Lab > > (925) 424-5756 > > bronevetsky at llnl.gov > > http://greg.bronevetsky.com > > > > > > -----Original Message----- > From: Mihael Hategan [mailto:hategan at mcs.anl.gov] > Sent: Thursday, May 22, 2014 12:23 AM > To: Bronevetsky, Greg > Cc: swift-user at ci.uchicago.edu > Subject: Re: [Swift-user] Data transfer error > > > > There are a number of places. I think this got improved a bit post 0.94, but that's another story. > > > > Anyway, first place is the swift log (-.log in the directory where you ran swift). > > > > The second place, if the previous one fails, is ~/.globus/coasters/*.log. > > > > There is yet another place that isn't enabled by default. That's the coaster worker logs. It can be enabled by saying DEBUG in sites.xml. It will produce some additional logs in ~/.globus/coasters/. > > > > Please feel free to send any/all these our way. We might be able to quickly spot some obvious problems. > > > > Mihael > > > > On Wed, 2014-05-21 at 23:59 +0000, Bronevetsky, Greg wrote: > > > Where should I look to debug the following error? > > > Caused by: Block task failed: 0521-5404270-000009 Block task ended > > > prematurely > > > > > > Greg Bronevetsky > > > Lawrence Livermore National Lab > > > (925) 424-5756 > > > bronevetsky at llnl.gov > > > http://greg.bronevetsky.com > > > > > > > > > -----Original Message----- > > > From: Mihael Hategan [mailto:hategan at mcs.anl.gov] > > > Sent: Wednesday, May 21, 2014 2:10 PM > > > To: Bronevetsky, Greg > > > Cc: swift-user at ci.uchicago.edu > > > Subject: Re: [Swift-user] Data transfer error > > > > > > Hi, > > > > > > Sorry for the late reply (to your previous mail mentioning this). > > > > > > I don't know what the answer to your question is. It shouldn't be happening. > > > > > > However, a directory called --.d should be created by swift. That directory should contain one or more *.info file which may contain a few more details. > > > > > > Mihael > > > > > > On Wed, 2014-05-21 at 20:50 +0000, Bronevetsky, Greg wrote: > > > > Related question: what causes the following error? > > > > Caused by: Failed to move output file > > > > solver_bicg.precond_diag.mtx_nasa1824/FI/blocks/block_10 to shared directory I see the file in the swift work directory and the path solver_bicg.precond_diag.mtx_nasa1824/FI/blocks/ exists in the directory where I run the script. > > > > > > > > Greg Bronevetsky > > > > Lawrence Livermore National Lab > > > > (925) 424-5756 > > > > bronevetsky at llnl.gov> > > > > http://greg.bronevetsky.com > > > > > > > > From: Bronevetsky, Greg > > > > Sent: Tuesday, May 20, 2014 2:11 PM > > > > To: swift-user at ci.uchicago.edu > > > > Subject: Data transfer error > > > > > > > > I sometimes get the following error in my Swift runs: > > > > Caused by: Failed to move output file > > > > solver_bicg.precond_diag.mtx_nasa1824/mt_fmodel/fm_n/lm_local/allStats to shared directory What causes it and how can I avoid it? > > > > > > > > Greg Bronevetsky > > > > Lawrence Livermore National Lab > > > > (925) 424-5756 > > > > bronevetsky at llnl.gov> > > > > http://greg.bronevetsky.com > > > > > > > > _______________________________________________ > > > > Swift-user mailing list > > > > Swift-user at ci.uchicago.edu > > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > > > > > > > > > > From hategan at mcs.anl.gov Fri May 23 14:21:29 2014 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 23 May 2014 12:21:29 -0700 Subject: [Swift-user] Fixed array vs array mapper In-Reply-To: References: Message-ID: <1400872889.18856.8.camel@echo> Hi Jonathan, The "No 'proxy' provider or alias found" comes up when you are using provider staging and you are missing the staging method setting in sites.xml (e.g. proxy). Can you try to add that to the sites file and see if you still get the failure (or confirm that the setting is already there)? Mihael On Tue, 2014-05-20 at 22:48 -0500, Jonathan Ozik wrote: > I'm having trouble using array mappers for output files. The fixed > array mapper seems to work on 0.94.1 but not with trunk. I've put > together a minimal case that will hopefully help shed light on the > issue. There's a swift script and a simple bash script that creates > files. > I've tried this both with 0.94.1 and with the trunk as of May 16. > > > Trying the array mapper: > With 0.94.1, the execution just hangs at: > *** Ready to call sum_obj > > > With the trunk version I get the error: > *** Ready to call sum_obj > > > > > Execution failed: > Exception in sh: > Arguments: [cs.sh] > Host: localhost > Directory: minimal_case-run036/jobs/k/sh-kl8esxql > exception @ swift-int-staging.k, line: 182 > Caused by: > exception @ swift-int-staging.k, line: 178 > Caused by: null > Caused by: > org.globus.cog.abstraction.impl.common.task.InvalidProviderException: > No 'proxy' provider or alias found. Available providers: [cobalt, > ssh-cl, gsiftp, coaster, webdav, slurm, dcache, ssh, gt2, condor, > http, coaster-persistent, pbs, ftp, lsf, gsiftp-old, local, sge]. > Aliases: condor <-> condorlocal; cobalt <-> cobaltlocal; gsiftp <-> > gridftp; pbs <-> pbslocal; gsiftp-old <-> gridftp-old; local <-> > file; > k:assign @ swift.k, line: 194 > Caused by: Exception in sh: > Arguments: [cs.sh] > Host: localhost > Directory: minimal_case-run036/jobs/k/sh-kl8esxql > exception @ swift-int-staging.k, line: 182 > Caused by: > exception @ swift-int-staging.k, line: 178 > Caused by: null > Caused by: > org.globus.cog.abstraction.impl.common.task.InvalidProviderException: > No 'proxy' provider or alias found. Available providers: [cobalt, > ssh-cl, gsiftp, coaster, webdav, slurm, dcache, ssh, gt2, condor, > http, coaster-persistent, pbs, ftp, lsf, gsiftp-old, local, sge]. > Aliases: condor <-> condorlocal; cobalt <-> cobaltlocal; gsiftp <-> > gridftp; pbs <-> pbslocal; gsiftp-old <-> gridftp-old; local <-> > file; > > > > > If I use the fixed array mapper, the 0.94.1 works but the trunk > version gives: > *** Ready to call sum_obj > > > WARNING: duplicate mapping found: > fs (line 18) and fs (line 18) are both used to write to > file://localhost/ > > > Execution failed: > Exception in sh: > Arguments: [cs.sh] > Host: localhost > Directory: minimal_case-run037/jobs/y/sh-y7bosxql > exception @ swift-int-staging.k, line: 182 > Caused by: > exception @ swift-int-staging.k, line: 178 > Caused by: null > Caused by: > org.globus.cog.abstraction.impl.common.task.InvalidProviderException: > No 'proxy' provider or alias found. Available providers: [cobalt, > ssh-cl, gsiftp, coaster, webdav, slurm, dcache, ssh, gt2, condor, > http, coaster-persistent, pbs, ftp, lsf, gsiftp-old, local, sge]. > Aliases: condor <-> condorlocal; cobalt <-> cobaltlocal; gsiftp <-> > gridftp; pbs <-> pbslocal; gsiftp-old <-> gridftp-old; local <-> > file; > k:assign @ swift.k, line: 194 > Caused by: Exception in sh: > Arguments: [cs.sh] > Host: localhost > Directory: minimal_case-run037/jobs/y/sh-y7bosxql > exception @ swift-int-staging.k, line: 182 > Caused by: > exception @ swift-int-staging.k, line: 178 > Caused by: null > Caused by: > org.globus.cog.abstraction.impl.common.task.InvalidProviderException: > No 'proxy' provider or alias found. Available providers: [cobalt, > ssh-cl, gsiftp, coaster, webdav, slurm, dcache, ssh, gt2, condor, > http, coaster-persistent, pbs, ftp, lsf, gsiftp-old, local, sge]. > Aliases: condor <-> condorlocal; cobalt <-> cobaltlocal; gsiftp <-> > gridftp; pbs <-> pbslocal; gsiftp-old <-> gridftp-old; local <-> > file; > > > > > Any thoughts on what might be going wrong would be greatly > appreciated! > > > Jonathan > > > > > > > From bronevetsky1 at llnl.gov Fri May 23 14:32:41 2014 From: bronevetsky1 at llnl.gov (Bronevetsky, Greg) Date: Fri, 23 May 2014 19:32:41 +0000 Subject: [Swift-user] Data transfer error In-Reply-To: <1400872636.18856.5.camel@echo> References: <8635C0D1735D2C4BA6E571FD97486FF179926169@PRDEXMBX-05.the-lab.llnl.gov> <1400706574.25214.3.camel@echo> <8635C0D1735D2C4BA6E571FD97486FF17992DC59@PRDEXMBX-05.the-lab.llnl.gov> <1400743356.29694.6.camel@echo> <8635C0D1735D2C4BA6E571FD97486FF179933F77@PRDEXMBX-05.the-lab.llnl.gov> <1400872636.18856.5.camel@echo> Message-ID: <8635C0D1735D2C4BA6E571FD97486FF1799377AB@PRDEXMBX-05.the-lab.llnl.gov> I've now had a little more experience with this and have gotten a partial workaround. Whatever the underlying cause, it seems to happen a lot less when I disable my mechanisms to avoid re-executing tasks that I've already completed. Right now my guess for the root cause is that I'm hitting the Lustre meta-data servers too hard and they're throwing back occasional errors. Specifically, I just got yelled at by our admins about performing thousands of file openings per second. I just did a small run and got some failures. e.g.: Progress: time: Fri, 23 May 2014 12:25:54 -0700 Selecting site:2723 Submitted:216 Active:119 Stage out:16 Finished successfully:58 Failed but can retry:144 However, when I looked at the log files generated when I set workerLoggingLevel to DEBUG as well as the stdout and stderr of the SLURM scripts I didn't find any failures or errors. What should I be looking for? Greg Bronevetsky Lawrence Livermore National Lab (925) 424-5756 bronevetsky at llnl.gov http://greg.bronevetsky.com -----Original Message----- From: Mihael Hategan [mailto:hategan at mcs.anl.gov] Sent: Friday, May 23, 2014 12:17 PM To: Bronevetsky, Greg Cc: swift-user at ci.uchicago.edu Subject: Re: [Swift-user] Data transfer error From this run, do you happen to have a log called "worker-0522-0109210-000001.log"? The missing exit code file errors combined with a missing log from a worker could indicate that the node on which this is running may not have the home directory mounted properly. Said block seems to fail pretty quickly without running any jobs, so I suspect something in its environment isn't quite right. Though bad nodes may be a little hard to track. It may also be helpful to disable lazy errors until you get things to run reliably. Mihael On Thu, 2014-05-22 at 16:11 +0000, Bronevetsky, Greg wrote: > There are a number of places. I think this got improved a bit post 0.94, but that's another story. > > > > Anyway, first place is the swift log (-.log in the directory where you ran swift). > > I?m not seeing much here. There are the periodic warnings like: > > 2014-05-22 08:53:50,494-0700 INFO RuntimeStats$ProgressTicker Selecting site:39311 Stage in:1 Submitting:1 Stage out:284 Finished successfully:3118 Failed but can retry:646 > > And occasional errors like: > > Block Block task status changed: Failed Exitcode file > (/g/g15/bronevet/.globus/scripts/PBS7360428055973706159.submit.exitcod > e) not found 5 queue polls after the job was reported done > > However, I can?t see the file in question at the reported path. > > > > Also, when the run finally fails (I ran with -lazy.errors true to keep it going as far as possible) I get the following: > > Exception in runModel: > > Arguments: [--solver, bicg, --precond, diag, --matrix, nasa1824, > --num_runs, 100, --modelType, contModel, --faultModel, n, --locModel, > local, --ap, 1e-2, --am, 1e-4, --sp, 1e-2, --sm, 1e-4, --dp, 1e-2, > --dm, 1e-4, --mp, 1e-2, --mm, 1e-4, --psp, 1e-2, --psm, 1e-2, --ptsp, > 1e-2, --ptsm, 1e-2, --cprob, 1e-5, --exec_time, 5.591000e-03, --stats, > modelBlocks/stats.solver_bicg.precond_diag.mtx_nasa1824.mt_contModel.f > m_n.lm_local.ap_1e-2.am_1e-4.sp_1e-2.sm_1e-4.dp_1e-2.dm_1e-4.mp_1e-2.m > m_1e-4.psp_1e-2.psm_1e-2.ptsp_1e-2.ptsm_1e-2.cprob_1e-5.block_26] > > Host: pbatch > > Directory: > experiments.new-20140522-0841-bhu0vze3/jobs/v/runModel-vvuy90rl > > Caused by: Block task failed: 0522-4108580-000002 Block task ended > prematurely > > > > I?ve attached the log. > > > > The second place, if the previous one fails, is ~/.globus/coasters/*.log. > > ~/.globus/coasters contains the following files. No logs in my install. > > cscript1601720472000314596.pl cscript7039282452425599503.pl > cscript8162919165195912014.pl > > cscript3466700121560325070.pl cscript747960757439884021.pl cscript8876053012113700888.pl > > cscript6877638344534390867.pl cscript7841839259853776419.pl > cscript95537409038396166.pl > > > > There is yet another place that isn't enabled by default. That's the coaster worker logs. It can be enabled by saying DEBUG in sites.xml. It will produce some additional logs in ~/.globus/coasters/. > > Done and attached. Please let me know if you see anything. > > > > Greg Bronevetsky > > Lawrence Livermore National Lab > > (925) 424-5756 > > bronevetsky at llnl.gov > > http://greg.bronevetsky.com > > > > > > -----Original Message----- > From: Mihael Hategan [mailto:hategan at mcs.anl.gov] > Sent: Thursday, May 22, 2014 12:23 AM > To: Bronevetsky, Greg > Cc: swift-user at ci.uchicago.edu > Subject: Re: [Swift-user] Data transfer error > > > > There are a number of places. I think this got improved a bit post 0.94, but that's another story. > > > > Anyway, first place is the swift log (-.log in the directory where you ran swift). > > > > The second place, if the previous one fails, is ~/.globus/coasters/*.log. > > > > There is yet another place that isn't enabled by default. That's the coaster worker logs. It can be enabled by saying DEBUG in sites.xml. It will produce some additional logs in ~/.globus/coasters/. > > > > Please feel free to send any/all these our way. We might be able to quickly spot some obvious problems. > > > > Mihael > > > > On Wed, 2014-05-21 at 23:59 +0000, Bronevetsky, Greg wrote: > > > Where should I look to debug the following error? > > > Caused by: Block task failed: 0521-5404270-000009 Block > > task ended > > > prematurely > > > > > > Greg Bronevetsky > > > Lawrence Livermore National Lab > > > (925) 424-5756 > > > bronevetsky at llnl.gov > > > http://greg.bronevetsky.com > > > > > > > > > -----Original Message----- > > > From: Mihael Hategan [mailto:hategan at mcs.anl.gov] > > > Sent: Wednesday, May 21, 2014 2:10 PM > > > To: Bronevetsky, Greg > > > Cc: swift-user at ci.uchicago.edu > > > Subject: Re: [Swift-user] Data transfer error > > > > > > Hi, > > > > > > Sorry for the late reply (to your previous mail mentioning this). > > > > > > I don't know what the answer to your question is. It shouldn't be happening. > > > > > > However, a directory called --.d should be created by swift. That directory should contain one or more *.info file which may contain a few more details. > > > > > > Mihael > > > > > > On Wed, 2014-05-21 at 20:50 +0000, Bronevetsky, Greg wrote: > > > > Related question: what causes the following error? > > > > Caused by: Failed to move output file > > > > solver_bicg.precond_diag.mtx_nasa1824/FI/blocks/block_10 to shared directory I see the file in the swift work directory and the path solver_bicg.precond_diag.mtx_nasa1824/FI/blocks/ exists in the directory where I run the script. > > > > > > > > Greg Bronevetsky > > > > Lawrence Livermore National Lab > > > > (925) 424-5756 > > > > bronevetsky at llnl.gov > > y at llnl.gov%3cmailto:bronevetsky at llnl.gov>> > > > > http://greg.bronevetsky.com > > > > > > > > From: Bronevetsky, Greg > > > > Sent: Tuesday, May 20, 2014 2:11 PM > > > > To: swift-user at ci.uchicago.edu > > > > Subject: Data transfer error > > > > > > > > I sometimes get the following error in my Swift runs: > > > > Caused by: Failed to move output file > > > > solver_bicg.precond_diag.mtx_nasa1824/mt_fmodel/fm_n/lm_local/allStats to shared directory What causes it and how can I avoid it? > > > > > > > > Greg Bronevetsky > > > > Lawrence Livermore National Lab > > > > (925) 424-5756 > > > > bronevetsky at llnl.gov > > y at llnl.gov%3cmailto:bronevetsky at llnl.gov>> > > > > http://greg.bronevetsky.com > > > > > > > > _______________________________________________ > > > > Swift-user mailing list > > > > Swift-user at ci.uchicago.edu > > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > > > > > > > > > > From bronevetsky1 at llnl.gov Fri May 23 14:52:20 2014 From: bronevetsky1 at llnl.gov (Bronevetsky, Greg) Date: Fri, 23 May 2014 19:52:20 +0000 Subject: [Swift-user] Data transfer error In-Reply-To: <8635C0D1735D2C4BA6E571FD97486FF1799377AB@PRDEXMBX-05.the-lab.llnl.gov> References: <8635C0D1735D2C4BA6E571FD97486FF179926169@PRDEXMBX-05.the-lab.llnl.gov> <1400706574.25214.3.camel@echo> <8635C0D1735D2C4BA6E571FD97486FF17992DC59@PRDEXMBX-05.the-lab.llnl.gov> <1400743356.29694.6.camel@echo> <8635C0D1735D2C4BA6E571FD97486FF179933F77@PRDEXMBX-05.the-lab.llnl.gov> <1400872636.18856.5.camel@echo> <8635C0D1735D2C4BA6E571FD97486FF1799377AB@PRDEXMBX-05.the-lab.llnl.gov> Message-ID: <8635C0D1735D2C4BA6E571FD97486FF1799377D2@PRDEXMBX-05.the-lab.llnl.gov> Looking deeper through the logs I'm noticing the following messages in my -info files: "Job directory mode is: link on shared filesystem" Googling around I noticed that another mode is "local copy". Would running in this mode alleviate pressure on the global scratch file system? If so, how do I use it? Thanks! Greg Bronevetsky Lawrence Livermore National Lab (925) 424-5756 bronevetsky at llnl.gov http://greg.bronevetsky.com -----Original Message----- From: swift-user-bounces at ci.uchicago.edu [mailto:swift-user-bounces at ci.uchicago.edu] On Behalf Of Bronevetsky, Greg Sent: Friday, May 23, 2014 12:33 PM To: Mihael Hategan Cc: swift-user at ci.uchicago.edu Subject: Re: [Swift-user] Data transfer error I've now had a little more experience with this and have gotten a partial workaround. Whatever the underlying cause, it seems to happen a lot less when I disable my mechanisms to avoid re-executing tasks that I've already completed. Right now my guess for the root cause is that I'm hitting the Lustre meta-data servers too hard and they're throwing back occasional errors. Specifically, I just got yelled at by our admins about performing thousands of file openings per second. I just did a small run and got some failures. e.g.: Progress: time: Fri, 23 May 2014 12:25:54 -0700 Selecting site:2723 Submitted:216 Active:119 Stage out:16 Finished successfully:58 Failed but can retry:144 However, when I looked at the log files generated when I set workerLoggingLevel to DEBUG as well as the stdout and stderr of the SLURM scripts I didn't find any failures or errors. What should I be looking for? Greg Bronevetsky Lawrence Livermore National Lab (925) 424-5756 bronevetsky at llnl.gov http://greg.bronevetsky.com -----Original Message----- From: Mihael Hategan [mailto:hategan at mcs.anl.gov] Sent: Friday, May 23, 2014 12:17 PM To: Bronevetsky, Greg Cc: swift-user at ci.uchicago.edu Subject: Re: [Swift-user] Data transfer error From this run, do you happen to have a log called "worker-0522-0109210-000001.log"? The missing exit code file errors combined with a missing log from a worker could indicate that the node on which this is running may not have the home directory mounted properly. Said block seems to fail pretty quickly without running any jobs, so I suspect something in its environment isn't quite right. Though bad nodes may be a little hard to track. It may also be helpful to disable lazy errors until you get things to run reliably. Mihael On Thu, 2014-05-22 at 16:11 +0000, Bronevetsky, Greg wrote: > There are a number of places. I think this got improved a bit post 0.94, but that's another story. > > > > Anyway, first place is the swift log (-.log in the directory where you ran swift). > > I?m not seeing much here. There are the periodic warnings like: > > 2014-05-22 08:53:50,494-0700 INFO RuntimeStats$ProgressTicker Selecting site:39311 Stage in:1 Submitting:1 Stage out:284 Finished successfully:3118 Failed but can retry:646 > > And occasional errors like: > > Block Block task status changed: Failed Exitcode file > (/g/g15/bronevet/.globus/scripts/PBS7360428055973706159.submit.exitcod > e) not found 5 queue polls after the job was reported done > > However, I can?t see the file in question at the reported path. > > > > Also, when the run finally fails (I ran with -lazy.errors true to keep it going as far as possible) I get the following: > > Exception in runModel: > > Arguments: [--solver, bicg, --precond, diag, --matrix, nasa1824, > --num_runs, 100, --modelType, contModel, --faultModel, n, --locModel, > local, --ap, 1e-2, --am, 1e-4, --sp, 1e-2, --sm, 1e-4, --dp, 1e-2, > --dm, 1e-4, --mp, 1e-2, --mm, 1e-4, --psp, 1e-2, --psm, 1e-2, --ptsp, > 1e-2, --ptsm, 1e-2, --cprob, 1e-5, --exec_time, 5.591000e-03, --stats, > modelBlocks/stats.solver_bicg.precond_diag.mtx_nasa1824.mt_contModel.f > m_n.lm_local.ap_1e-2.am_1e-4.sp_1e-2.sm_1e-4.dp_1e-2.dm_1e-4.mp_1e-2.m > m_1e-4.psp_1e-2.psm_1e-2.ptsp_1e-2.ptsm_1e-2.cprob_1e-5.block_26] > > Host: pbatch > > Directory: > experiments.new-20140522-0841-bhu0vze3/jobs/v/runModel-vvuy90rl > > Caused by: Block task failed: 0522-4108580-000002 Block task ended > prematurely > > > > I?ve attached the log. > > > > The second place, if the previous one fails, is ~/.globus/coasters/*.log. > > ~/.globus/coasters contains the following files. No logs in my install. > > cscript1601720472000314596.pl cscript7039282452425599503.pl > cscript8162919165195912014.pl > > cscript3466700121560325070.pl cscript747960757439884021.pl cscript8876053012113700888.pl > > cscript6877638344534390867.pl cscript7841839259853776419.pl > cscript95537409038396166.pl > > > > There is yet another place that isn't enabled by default. That's the coaster worker logs. It can be enabled by saying DEBUG in sites.xml. It will produce some additional logs in ~/.globus/coasters/. > > Done and attached. Please let me know if you see anything. > > > > Greg Bronevetsky > > Lawrence Livermore National Lab > > (925) 424-5756 > > bronevetsky at llnl.gov > > http://greg.bronevetsky.com > > > > > > -----Original Message----- > From: Mihael Hategan [mailto:hategan at mcs.anl.gov] > Sent: Thursday, May 22, 2014 12:23 AM > To: Bronevetsky, Greg > Cc: swift-user at ci.uchicago.edu > Subject: Re: [Swift-user] Data transfer error > > > > There are a number of places. I think this got improved a bit post 0.94, but that's another story. > > > > Anyway, first place is the swift log (-.log in the directory where you ran swift). > > > > The second place, if the previous one fails, is ~/.globus/coasters/*.log. > > > > There is yet another place that isn't enabled by default. That's the coaster worker logs. It can be enabled by saying DEBUG in sites.xml. It will produce some additional logs in ~/.globus/coasters/. > > > > Please feel free to send any/all these our way. We might be able to quickly spot some obvious problems. > > > > Mihael > > > > On Wed, 2014-05-21 at 23:59 +0000, Bronevetsky, Greg wrote: > > > Where should I look to debug the following error? > > > Caused by: Block task failed: 0521-5404270-000009 Block > > task ended > > > prematurely > > > > > > Greg Bronevetsky > > > Lawrence Livermore National Lab > > > (925) 424-5756 > > > bronevetsky at llnl.gov > > > http://greg.bronevetsky.com > > > > > > > > > -----Original Message----- > > > From: Mihael Hategan [mailto:hategan at mcs.anl.gov] > > > Sent: Wednesday, May 21, 2014 2:10 PM > > > To: Bronevetsky, Greg > > > Cc: swift-user at ci.uchicago.edu > > > Subject: Re: [Swift-user] Data transfer error > > > > > > Hi, > > > > > > Sorry for the late reply (to your previous mail mentioning this). > > > > > > I don't know what the answer to your question is. It shouldn't be happening. > > > > > > However, a directory called --.d should be created by swift. That directory should contain one or more *.info file which may contain a few more details. > > > > > > Mihael > > > > > > On Wed, 2014-05-21 at 20:50 +0000, Bronevetsky, Greg wrote: > > > > Related question: what causes the following error? > > > > Caused by: Failed to move output file > > > > solver_bicg.precond_diag.mtx_nasa1824/FI/blocks/block_10 to shared directory I see the file in the swift work directory and the path solver_bicg.precond_diag.mtx_nasa1824/FI/blocks/ exists in the directory where I run the script. > > > > > > > > Greg Bronevetsky > > > > Lawrence Livermore National Lab > > > > (925) 424-5756 > > > > bronevetsky at llnl.gov > > y at llnl.gov%3cmailto:bronevetsky at llnl.gov>> > > > > http://greg.bronevetsky.com > > > > > > > > From: Bronevetsky, Greg > > > > Sent: Tuesday, May 20, 2014 2:11 PM > > > > To: swift-user at ci.uchicago.edu > > > > Subject: Data transfer error > > > > > > > > I sometimes get the following error in my Swift runs: > > > > Caused by: Failed to move output file > > > > solver_bicg.precond_diag.mtx_nasa1824/mt_fmodel/fm_n/lm_local/allStats to shared directory What causes it and how can I avoid it? > > > > > > > > Greg Bronevetsky > > > > Lawrence Livermore National Lab > > > > (925) 424-5756 > > > > bronevetsky at llnl.gov > > y at llnl.gov%3cmailto:bronevetsky at llnl.gov>> > > > > http://greg.bronevetsky.com > > > > > > > > _______________________________________________ > > > > Swift-user mailing list > > > > Swift-user at ci.uchicago.edu > > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > > > > > > > > > > _______________________________________________ Swift-user mailing list Swift-user at ci.uchicago.edu https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user From hategan at mcs.anl.gov Fri May 23 15:23:13 2014 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 23 May 2014 13:23:13 -0700 Subject: [Swift-user] Data transfer error In-Reply-To: <8635C0D1735D2C4BA6E571FD97486FF1799377AB@PRDEXMBX-05.the-lab.llnl.gov> References: <8635C0D1735D2C4BA6E571FD97486FF179926169@PRDEXMBX-05.the-lab.llnl.gov> <1400706574.25214.3.camel@echo> <8635C0D1735D2C4BA6E571FD97486FF17992DC59@PRDEXMBX-05.the-lab.llnl.gov> <1400743356.29694.6.camel@echo> <8635C0D1735D2C4BA6E571FD97486FF179933F77@PRDEXMBX-05.the-lab.llnl.gov> <1400872636.18856.5.camel@echo> <8635C0D1735D2C4BA6E571FD97486FF1799377AB@PRDEXMBX-05.the-lab.llnl.gov> Message-ID: <1400876593.19522.2.camel@echo> On Fri, 2014-05-23 at 19:32 +0000, Bronevetsky, Greg wrote: > I've now had a little more experience with this and have gotten a > partial workaround. Whatever the underlying cause, it seems to happen > a lot less when I disable my mechanisms to avoid re-executing tasks > that I've already completed. Right now my guess for the root cause is > that I'm hitting the Lustre meta-data servers too hard and they're > throwing back occasional errors. That sounds plausible. > Specifically, I just got yelled at by our admins about performing > thousands of file openings per second. :) > > I just did a small run and got some failures. e.g.: > Progress: time: Fri, 23 May 2014 12:25:54 -0700 Selecting site:2723 Submitted:216 Active:119 Stage out:16 Finished successfully:58 Failed but can retry:144 > > However, when I looked at the log files generated when I set > workerLoggingLevel to DEBUG as well as the stdout and stderr of the > SLURM scripts I didn't find any failures or errors. What should I be > looking for? Those are probably swift-level errors, and the details would be in the swift log (or on stdout once the run finished). Mihael From hategan at mcs.anl.gov Fri May 23 15:26:24 2014 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 23 May 2014 13:26:24 -0700 Subject: [Swift-user] Data transfer error In-Reply-To: <8635C0D1735D2C4BA6E571FD97486FF1799377D2@PRDEXMBX-05.the-lab.llnl.gov> References: <8635C0D1735D2C4BA6E571FD97486FF179926169@PRDEXMBX-05.the-lab.llnl.gov> <1400706574.25214.3.camel@echo> <8635C0D1735D2C4BA6E571FD97486FF17992DC59@PRDEXMBX-05.the-lab.llnl.gov> <1400743356.29694.6.camel@echo> <8635C0D1735D2C4BA6E571FD97486FF179933F77@PRDEXMBX-05.the-lab.llnl.gov> <1400872636.18856.5.camel@echo> <8635C0D1735D2C4BA6E571FD97486FF1799377AB@PRDEXMBX-05.the-lab.llnl.gov> <8635C0D1735D2C4BA6E571FD97486FF1799377D2@PRDEXMBX-05.the-lab.llnl.gov> Message-ID: <1400876784.19522.6.camel@echo> On Fri, 2014-05-23 at 19:52 +0000, Bronevetsky, Greg wrote: > Looking deeper through the logs I'm noticing the following messages in my -info files: > "Job directory mode is: link on shared filesystem" > Googling around I noticed that another mode is "local copy". Would running in this mode alleviate pressure on the global scratch file system? If so, how do I use it? Thanks! Before I answer that question... Do the compute nodes have local storage? Mihael From bronevetsky1 at llnl.gov Fri May 23 15:27:37 2014 From: bronevetsky1 at llnl.gov (Bronevetsky, Greg) Date: Fri, 23 May 2014 20:27:37 +0000 Subject: [Swift-user] Data transfer error In-Reply-To: <1400876784.19522.6.camel@echo> References: <8635C0D1735D2C4BA6E571FD97486FF179926169@PRDEXMBX-05.the-lab.llnl.gov> <1400706574.25214.3.camel@echo> <8635C0D1735D2C4BA6E571FD97486FF17992DC59@PRDEXMBX-05.the-lab.llnl.gov> <1400743356.29694.6.camel@echo> <8635C0D1735D2C4BA6E571FD97486FF179933F77@PRDEXMBX-05.the-lab.llnl.gov> <1400872636.18856.5.camel@echo> <8635C0D1735D2C4BA6E571FD97486FF1799377AB@PRDEXMBX-05.the-lab.llnl.gov> <8635C0D1735D2C4BA6E571FD97486FF1799377D2@PRDEXMBX-05.the-lab.llnl.gov> <1400876784.19522.6.camel@echo> Message-ID: <8635C0D1735D2C4BA6E571FD97486FF17993788A@PRDEXMBX-05.the-lab.llnl.gov> Ramdisk mounted at /tmp. Greg Bronevetsky Lawrence Livermore National Lab (925) 424-5756 bronevetsky at llnl.gov http://greg.bronevetsky.com -----Original Message----- From: Mihael Hategan [mailto:hategan at mcs.anl.gov] Sent: Friday, May 23, 2014 1:26 PM To: Bronevetsky, Greg Cc: swift-user at ci.uchicago.edu Subject: Re: [Swift-user] Data transfer error On Fri, 2014-05-23 at 19:52 +0000, Bronevetsky, Greg wrote: > Looking deeper through the logs I'm noticing the following messages in my -info files: > "Job directory mode is: link on shared filesystem" > Googling around I noticed that another mode is "local copy". Would running in this mode alleviate pressure on the global scratch file system? If so, how do I use it? Thanks! Before I answer that question... Do the compute nodes have local storage? Mihael From hategan at mcs.anl.gov Fri May 23 15:32:45 2014 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 23 May 2014 13:32:45 -0700 Subject: [Swift-user] Data transfer error In-Reply-To: <8635C0D1735D2C4BA6E571FD97486FF17993788A@PRDEXMBX-05.the-lab.llnl.gov> References: <8635C0D1735D2C4BA6E571FD97486FF179926169@PRDEXMBX-05.the-lab.llnl.gov> <1400706574.25214.3.camel@echo> <8635C0D1735D2C4BA6E571FD97486FF17992DC59@PRDEXMBX-05.the-lab.llnl.gov> <1400743356.29694.6.camel@echo> <8635C0D1735D2C4BA6E571FD97486FF179933F77@PRDEXMBX-05.the-lab.llnl.gov> <1400872636.18856.5.camel@echo> <8635C0D1735D2C4BA6E571FD97486FF1799377AB@PRDEXMBX-05.the-lab.llnl.gov> <8635C0D1735D2C4BA6E571FD97486FF1799377D2@PRDEXMBX-05.the-lab.llnl.gov> <1400876784.19522.6.camel@echo> <8635C0D1735D2C4BA6E571FD97486FF17993788A@PRDEXMBX-05.the-lab.llnl.gov> Message-ID: <1400877165.19522.8.camel@echo> Would it have enough scratch space to run your apps? Mihael On Fri, 2014-05-23 at 20:27 +0000, Bronevetsky, Greg wrote: > Ramdisk mounted at /tmp. > > Greg Bronevetsky > Lawrence Livermore National Lab > (925) 424-5756 > bronevetsky at llnl.gov > http://greg.bronevetsky.com > > > -----Original Message----- > From: Mihael Hategan [mailto:hategan at mcs.anl.gov] > Sent: Friday, May 23, 2014 1:26 PM > To: Bronevetsky, Greg > Cc: swift-user at ci.uchicago.edu > Subject: Re: [Swift-user] Data transfer error > > On Fri, 2014-05-23 at 19:52 +0000, Bronevetsky, Greg wrote: > > Looking deeper through the logs I'm noticing the following messages in my -info files: > > "Job directory mode is: link on shared filesystem" > > Googling around I noticed that another mode is "local copy". Would running in this mode alleviate pressure on the global scratch file system? If so, how do I use it? Thanks! > > Before I answer that question... > > Do the compute nodes have local storage? > > Mihael > From bronevetsky1 at llnl.gov Fri May 23 15:33:33 2014 From: bronevetsky1 at llnl.gov (Bronevetsky, Greg) Date: Fri, 23 May 2014 20:33:33 +0000 Subject: [Swift-user] Data transfer error In-Reply-To: <1400877165.19522.8.camel@echo> References: <8635C0D1735D2C4BA6E571FD97486FF179926169@PRDEXMBX-05.the-lab.llnl.gov> <1400706574.25214.3.camel@echo> <8635C0D1735D2C4BA6E571FD97486FF17992DC59@PRDEXMBX-05.the-lab.llnl.gov> <1400743356.29694.6.camel@echo> <8635C0D1735D2C4BA6E571FD97486FF179933F77@PRDEXMBX-05.the-lab.llnl.gov> <1400872636.18856.5.camel@echo> <8635C0D1735D2C4BA6E571FD97486FF1799377AB@PRDEXMBX-05.the-lab.llnl.gov> <8635C0D1735D2C4BA6E571FD97486FF1799377D2@PRDEXMBX-05.the-lab.llnl.gov> <1400876784.19522.6.camel@echo> <8635C0D1735D2C4BA6E571FD97486FF17993788A@PRDEXMBX-05.the-lab.llnl.gov> <1400877165.19522.8.camel@echo> Message-ID: <8635C0D1735D2C4BA6E571FD97486FF1799378EC@PRDEXMBX-05.the-lab.llnl.gov> Yes, at least as long as it is cleaned out at the end of each job so that I don't get memory leaks across jobs. Greg Bronevetsky Lawrence Livermore National Lab (925) 424-5756 bronevetsky at llnl.gov http://greg.bronevetsky.com -----Original Message----- From: Mihael Hategan [mailto:hategan at mcs.anl.gov] Sent: Friday, May 23, 2014 1:33 PM To: Bronevetsky, Greg Cc: swift-user at ci.uchicago.edu Subject: Re: [Swift-user] Data transfer error Would it have enough scratch space to run your apps? Mihael On Fri, 2014-05-23 at 20:27 +0000, Bronevetsky, Greg wrote: > Ramdisk mounted at /tmp. > > Greg Bronevetsky > Lawrence Livermore National Lab > (925) 424-5756 > bronevetsky at llnl.gov > http://greg.bronevetsky.com > > > -----Original Message----- > From: Mihael Hategan [mailto:hategan at mcs.anl.gov] > Sent: Friday, May 23, 2014 1:26 PM > To: Bronevetsky, Greg > Cc: swift-user at ci.uchicago.edu > Subject: Re: [Swift-user] Data transfer error > > On Fri, 2014-05-23 at 19:52 +0000, Bronevetsky, Greg wrote: > > Looking deeper through the logs I'm noticing the following messages in my -info files: > > "Job directory mode is: link on shared filesystem" > > Googling around I noticed that another mode is "local copy". Would running in this mode alleviate pressure on the global scratch file system? If so, how do I use it? Thanks! > > Before I answer that question... > > Do the compute nodes have local storage? > > Mihael > From hategan at mcs.anl.gov Fri May 23 15:47:41 2014 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 23 May 2014 13:47:41 -0700 Subject: [Swift-user] Data transfer error In-Reply-To: <8635C0D1735D2C4BA6E571FD97486FF1799378EC@PRDEXMBX-05.the-lab.llnl.gov> References: <8635C0D1735D2C4BA6E571FD97486FF179926169@PRDEXMBX-05.the-lab.llnl.gov> <1400706574.25214.3.camel@echo> <8635C0D1735D2C4BA6E571FD97486FF17992DC59@PRDEXMBX-05.the-lab.llnl.gov> <1400743356.29694.6.camel@echo> <8635C0D1735D2C4BA6E571FD97486FF179933F77@PRDEXMBX-05.the-lab.llnl.gov> <1400872636.18856.5.camel@echo> <8635C0D1735D2C4BA6E571FD97486FF1799377AB@PRDEXMBX-05.the-lab.llnl.gov> <8635C0D1735D2C4BA6E571FD97486FF1799377D2@PRDEXMBX-05.the-lab.llnl.gov> <1400876784.19522.6.camel@echo> <8635C0D1735D2C4BA6E571FD97486FF17993788A@PRDEXMBX-05.the-lab.llnl.gov> <1400877165.19522.8.camel@echo> <8635C0D1735D2C4BA6E571FD97486FF1799378EC@PRDEXMBX-05.the-lab.llnl.gov> Message-ID: <1400878061.19890.8.camel@echo> On Fri, 2014-05-23 at 20:33 +0000, Bronevetsky, Greg wrote: > Yes, at least as long as it is cleaned out at the end of each job so that I don't get memory leaks across jobs. First the copy not link part: * leave as it is in sites.xml, but add something like: /tmp/swift. This will cause swift to use /tmp/swift to run jobs, and it will copy fies to it and back as necessary. Alternatively, you should be able to avoid hitting the shared FS almost entirely (almost because the PBS jobs might still use it for scripts and log files).: * set "use.provider.staging=true" in swift.properties * /tmp/swift in sites.xml * add file to sites.xml * compile (or unpack) swift on a local disk on the head node * also have the directories where you run swift and data files on a local disk on the head node Mihael From jozik at uchicago.edu Fri May 23 16:46:02 2014 From: jozik at uchicago.edu (Jonathan Ozik) Date: Fri, 23 May 2014 16:46:02 -0500 Subject: [Swift-user] Fixed array vs array mapper In-Reply-To: <1400872889.18856.8.camel@echo> References: <1400872889.18856.8.camel@echo> Message-ID: Thanks Mihael, It turns out that using the trunk swift.properties file fixed this issue. Jonathan On May 23, 2014, at 2:21 PM, Mihael Hategan wrote: > Hi Jonathan, > > The "No 'proxy' provider or alias found" comes up when you are using > provider staging and you are missing the staging method setting in > sites.xml (e.g. key="stagingMethod">proxy). > > Can you try to add that to the sites file and see if you still get the > failure (or confirm that the setting is already there)? > > Mihael > > On Tue, 2014-05-20 at 22:48 -0500, Jonathan Ozik wrote: >> I'm having trouble using array mappers for output files. The fixed >> array mapper seems to work on 0.94.1 but not with trunk. I've put >> together a minimal case that will hopefully help shed light on the >> issue. There's a swift script and a simple bash script that creates >> files. >> I've tried this both with 0.94.1 and with the trunk as of May 16. >> >> >> Trying the array mapper: >> With 0.94.1, the execution just hangs at: >> *** Ready to call sum_obj >> >> >> With the trunk version I get the error: >> *** Ready to call sum_obj >> >> >> >> >> Execution failed: >> Exception in sh: >> Arguments: [cs.sh] >> Host: localhost >> Directory: minimal_case-run036/jobs/k/sh-kl8esxql >> exception @ swift-int-staging.k, line: 182 >> Caused by: >> exception @ swift-int-staging.k, line: 178 >> Caused by: null >> Caused by: >> org.globus.cog.abstraction.impl.common.task.InvalidProviderException: >> No 'proxy' provider or alias found. Available providers: [cobalt, >> ssh-cl, gsiftp, coaster, webdav, slurm, dcache, ssh, gt2, condor, >> http, coaster-persistent, pbs, ftp, lsf, gsiftp-old, local, sge]. >> Aliases: condor <-> condorlocal; cobalt <-> cobaltlocal; gsiftp <-> >> gridftp; pbs <-> pbslocal; gsiftp-old <-> gridftp-old; local <-> >> file; >> k:assign @ swift.k, line: 194 >> Caused by: Exception in sh: >> Arguments: [cs.sh] >> Host: localhost >> Directory: minimal_case-run036/jobs/k/sh-kl8esxql >> exception @ swift-int-staging.k, line: 182 >> Caused by: >> exception @ swift-int-staging.k, line: 178 >> Caused by: null >> Caused by: >> org.globus.cog.abstraction.impl.common.task.InvalidProviderException: >> No 'proxy' provider or alias found. Available providers: [cobalt, >> ssh-cl, gsiftp, coaster, webdav, slurm, dcache, ssh, gt2, condor, >> http, coaster-persistent, pbs, ftp, lsf, gsiftp-old, local, sge]. >> Aliases: condor <-> condorlocal; cobalt <-> cobaltlocal; gsiftp <-> >> gridftp; pbs <-> pbslocal; gsiftp-old <-> gridftp-old; local <-> >> file; >> >> >> >> >> If I use the fixed array mapper, the 0.94.1 works but the trunk >> version gives: >> *** Ready to call sum_obj >> >> >> WARNING: duplicate mapping found: >> fs (line 18) and fs (line 18) are both used to write to >> file://localhost/ >> >> >> Execution failed: >> Exception in sh: >> Arguments: [cs.sh] >> Host: localhost >> Directory: minimal_case-run037/jobs/y/sh-y7bosxql >> exception @ swift-int-staging.k, line: 182 >> Caused by: >> exception @ swift-int-staging.k, line: 178 >> Caused by: null >> Caused by: >> org.globus.cog.abstraction.impl.common.task.InvalidProviderException: >> No 'proxy' provider or alias found. Available providers: [cobalt, >> ssh-cl, gsiftp, coaster, webdav, slurm, dcache, ssh, gt2, condor, >> http, coaster-persistent, pbs, ftp, lsf, gsiftp-old, local, sge]. >> Aliases: condor <-> condorlocal; cobalt <-> cobaltlocal; gsiftp <-> >> gridftp; pbs <-> pbslocal; gsiftp-old <-> gridftp-old; local <-> >> file; >> k:assign @ swift.k, line: 194 >> Caused by: Exception in sh: >> Arguments: [cs.sh] >> Host: localhost >> Directory: minimal_case-run037/jobs/y/sh-y7bosxql >> exception @ swift-int-staging.k, line: 182 >> Caused by: >> exception @ swift-int-staging.k, line: 178 >> Caused by: null >> Caused by: >> org.globus.cog.abstraction.impl.common.task.InvalidProviderException: >> No 'proxy' provider or alias found. Available providers: [cobalt, >> ssh-cl, gsiftp, coaster, webdav, slurm, dcache, ssh, gt2, condor, >> http, coaster-persistent, pbs, ftp, lsf, gsiftp-old, local, sge]. >> Aliases: condor <-> condorlocal; cobalt <-> cobaltlocal; gsiftp <-> >> gridftp; pbs <-> pbslocal; gsiftp-old <-> gridftp-old; local <-> >> file; >> >> >> >> >> Any thoughts on what might be going wrong would be greatly >> appreciated! >> >> >> Jonathan >> >> >> >> >> >> >> > > From bronevetsky1 at llnl.gov Tue May 27 17:56:09 2014 From: bronevetsky1 at llnl.gov (Bronevetsky, Greg) Date: Tue, 27 May 2014 22:56:09 +0000 Subject: [Swift-user] Data transfer error In-Reply-To: <1400876593.19522.2.camel@echo> References: <8635C0D1735D2C4BA6E571FD97486FF179926169@PRDEXMBX-05.the-lab.llnl.gov> <1400706574.25214.3.camel@echo> <8635C0D1735D2C4BA6E571FD97486FF17992DC59@PRDEXMBX-05.the-lab.llnl.gov> <1400743356.29694.6.camel@echo> <8635C0D1735D2C4BA6E571FD97486FF179933F77@PRDEXMBX-05.the-lab.llnl.gov> <1400872636.18856.5.camel@echo> <8635C0D1735D2C4BA6E571FD97486FF1799377AB@PRDEXMBX-05.the-lab.llnl.gov> <1400876593.19522.2.camel@echo> Message-ID: <8635C0D1735D2C4BA6E571FD97486FF179938553@PRDEXMBX-05.the-lab.llnl.gov> Mihael, I've been struggling with the runs for the past few days. I've managed to push some of them through but the majority gets so many errors that they appear to stall out. Below is an example of the stdout output from Swift: RunID: 20140527-1533-6dt18a4b Progress: time: Tue, 27 May 2014 15:33:52 -0700 Progress: time: Tue, 27 May 2014 15:33:54 -0700 Stage in:1 Submitted:2 Progress: time: Tue, 27 May 2014 15:33:55 -0700 Active:2 Finished successfully:1 Progress: time: Tue, 27 May 2014 15:33:56 -0700 Initializing:40 Active:2 Finished successfully:1 Progress: time: Tue, 27 May 2014 15:33:57 -0700 Initializing:352 Selecting site:241 Active:2 Finished successfully:1 Progress: time: Tue, 27 May 2014 15:33:58 -0700 Selecting site:1674 Submitting:326 Active:2 Finished successfully:1 Progress: time: Tue, 27 May 2014 15:34:00 -0700 Selecting site:1601 Stage in:2 Submitted:397 Active:2 Finished successfully:1 ... Progress: time: Tue, 27 May 2014 15:39:42 -0700 Selecting site:1268 Stage in:30 Submitted:328 Active:31 Finished successfully:3 Failed but can retry:344 Digging through the logs, I've found the following mention of an error in one of my worker logs (attached): 2014/05/27 15:34:56.648 INFO 000000 1401230034461 Staging out /p/lscratche/bronevet/swift/work/psuadeExperiments-20140527-1533-6dt18a4b/jobs/1/runModel-1r0az8rl/wrapper.error (mode = 2). ? 2014/05/27 15:34:56.659 INFO 000000 1401230034458 Job dir total 17 drwx------ 2 bronevet bronevet 7168 May 27 15:34 . drwx------ 5 bronevet bronevet 7168 May 27 15:34 .. -rw------- 1 bronevet bronevet 6078 May 27 15:34 _swiftwrap.staging -rw------- 1 bronevet bronevet 96 May 27 15:34 out.expID_0 -rw------- 1 bronevet bronevet 1199 May 27 15:34 out.solver_bicg.precond_diag.mtx_nasa1824.mt_0.fm_0.lm_0.ap_-1.5515515515515510e-01.am_-2.1171171171171173e+00.psp_-3.0330330330330328e+00.psm_-2.2472472472472473e+00.cprob_1e-10.block_0 -rw------- 1 bronevet bronevet 0 May 27 15:34 stderr.txt -rw------- 1 bronevet bronevet 103 May 27 15:34 wrapper.error -rw------- 1 bronevet bronevet 32501 May 27 15:34 wrapper.log Also, I saw some SLURM stdout files the said that I?m out of space. cat /g/g15/bronevet/.globus/scripts/Slurm1575966868019932992.submit.stdout env: write error: No space left on device df: write error: No space left on device cat: write error: No space left on device cat: write error: No space left on device _swiftwrap.staging: line 45: echo: write error: No space left on device ? env: write error: No space left on device df: write error: No space left on device cat: write error: No space left on device cat: write error: No space left on device _swiftwrap.staging: line 45: echo: write error: No space left on device However, I can?t see how this could be since each node has 16GB of RAM (available as RAM or ramdisk). Is there a way to look into this further? Greg Bronevetsky Lawrence Livermore National Lab (925) 424-5756 bronevetsky at llnl.gov http://greg.bronevetsky.com -----Original Message----- From: Mihael Hategan [mailto:hategan at mcs.anl.gov] Sent: Friday, May 23, 2014 1:23 PM To: Bronevetsky, Greg Cc: swift-user at ci.uchicago.edu Subject: Re: [Swift-user] Data transfer error On Fri, 2014-05-23 at 19:32 +0000, Bronevetsky, Greg wrote: > I've now had a little more experience with this and have gotten a > partial workaround. Whatever the underlying cause, it seems to happen > a lot less when I disable my mechanisms to avoid re-executing tasks > that I've already completed. Right now my guess for the root cause is > that I'm hitting the Lustre meta-data servers too hard and they're > throwing back occasional errors. That sounds plausible. > Specifically, I just got yelled at by our admins about performing > thousands of file openings per second. :) > > I just did a small run and got some failures. e.g.: > Progress: time: Fri, 23 May 2014 12:25:54 -0700 Selecting site:2723 > Submitted:216 Active:119 Stage out:16 Finished successfully:58 > Failed but can retry:144 > > However, when I looked at the log files generated when I set > workerLoggingLevel to DEBUG as well as the stdout and stderr of the > SLURM scripts I didn't find any failures or errors. What should I be > looking for? Those are probably swift-level errors, and the details would be in the swift log (or on stdout once the run finished). Mihael -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: worker-0527-3303530-000000.log Type: application/octet-stream Size: 3326458 bytes Desc: worker-0527-3303530-000000.log URL: From hategan at mcs.anl.gov Tue May 27 18:46:04 2014 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 27 May 2014 16:46:04 -0700 Subject: [Swift-user] Data transfer error In-Reply-To: <8635C0D1735D2C4BA6E571FD97486FF179938553@PRDEXMBX-05.the-lab.llnl.gov> References: <8635C0D1735D2C4BA6E571FD97486FF179926169@PRDEXMBX-05.the-lab.llnl.gov> <1400706574.25214.3.camel@echo> <8635C0D1735D2C4BA6E571FD97486FF17992DC59@PRDEXMBX-05.the-lab.llnl.gov> <1400743356.29694.6.camel@echo> <8635C0D1735D2C4BA6E571FD97486FF179933F77@PRDEXMBX-05.the-lab.llnl.gov> <1400872636.18856.5.camel@echo> <8635C0D1735D2C4BA6E571FD97486FF1799377AB@PRDEXMBX-05.the-lab.llnl.gov> <1400876593.19522.2.camel@echo> <8635C0D1735D2C4BA6E571FD97486FF179938553@PRDEXMBX-05.the-lab.llnl.gov> Message-ID: <1401234364.13640.7.camel@echo> On Tue, 2014-05-27 at 22:56 +0000, Bronevetsky, Greg wrote: [...] > Progress: time: Tue, 27 May 2014 15:39:42 -0700 Selecting site:1268 Stage in:30 Submitted:328 Active:31 Finished successfully:3 Failed but can retry:344 I would really suggest disabling lazy errors and execution retries until you get things to run. [...] > 2014/05/27 15:34:56.648 INFO 000000 1401230034461 Staging out /p/lscratche/bronevet/swift/work/psuadeExperiments-20140527-1533-6dt18a4b/jobs/1/runModel-1r0az8rl/wrapper.error (mode = 2). Right. It means something went wrong running the app on the compute node. That's a file that is used to send back the exact error. [...] > _swiftwrap.staging: line 45: echo: write error: No space left on device > > However, I can?t see how this could be since each node has 16GB of RAM (available as RAM or ramdisk). Is there a way to look into this further? Swift doesn't have much in that direction. The wrapper logs should contain some diagnostic information for failing jobs, but if they fail due to lack of disk space, I can't see how the wrapper log can be written to. What I would suggest is wrapping your app in a script that looks into disk issues (df, ls), and running multiple apps on a single node and hopefully catching a glimpse of what the problem is before all scratch space is exhausted. I think it would be a nice idea to add some node status (mem/disk/cpu) monitors to the swift monitoring interfaces. Mihael From jozik at uchicago.edu Wed May 28 17:01:47 2014 From: jozik at uchicago.edu (Jonathan Ozik) Date: Wed, 28 May 2014 17:01:47 -0500 Subject: [Swift-user] Known issue with stdout and stderr file delays? Message-ID: <3A043140-2D21-4E23-891F-A4103B1712F9@uchicago.edu> Hi all, Has anyone run into issues when trying to direct stdout and stderr to output files and getting the error: Caused by: The following output files were not created by the application: sim.out, err.out It looks like the files are being created too late and not caught when the output is being gathered up? This is using trunk, by the way. Also, in a related question, is there an easy way to redirect stdout and stderr from an app invocation to the regular swift.out without creating specific files? Jonathan From bronevetsky1 at llnl.gov Wed May 28 17:54:16 2014 From: bronevetsky1 at llnl.gov (Bronevetsky, Greg) Date: Wed, 28 May 2014 22:54:16 +0000 Subject: [Swift-user] Data transfer error In-Reply-To: <1401234364.13640.7.camel@echo> References: <8635C0D1735D2C4BA6E571FD97486FF179926169@PRDEXMBX-05.the-lab.llnl.gov> <1400706574.25214.3.camel@echo> <8635C0D1735D2C4BA6E571FD97486FF17992DC59@PRDEXMBX-05.the-lab.llnl.gov> <1400743356.29694.6.camel@echo> <8635C0D1735D2C4BA6E571FD97486FF179933F77@PRDEXMBX-05.the-lab.llnl.gov> <1400872636.18856.5.camel@echo> <8635C0D1735D2C4BA6E571FD97486FF1799377AB@PRDEXMBX-05.the-lab.llnl.gov> <1400876593.19522.2.camel@echo> <8635C0D1735D2C4BA6E571FD97486FF179938553@PRDEXMBX-05.the-lab.llnl.gov> <1401234364.13640.7.camel@echo> Message-ID: <8635C0D1735D2C4BA6E571FD97486FF179938CB9@PRDEXMBX-05.the-lab.llnl.gov> Mihael, I ran a few more experiments where I ran a workflow on a single cluster node while monitoring its memory use but I didn't see any issues with it running out of memory since at all times /proc/meminfo reported 22GB out of 24GB free. I've now begun a more focused analysis where I have a simple script that captures the high-level structure of my real script. It first generates a bunch of files, producing additional temporary files and the directories along with the main output file. These files are then reduced using a reduction tree based on the example you sent me. I have not yet gotten the simple script to fail in the same way as the main script but I've noticed a few oddities. First, although my sites file has file and my cf file has use.provider.staging=true, I see that all the intermediate files produced by my tasks are written to the global file system specified in the sites file as /p/lscratche/bronevet/swift_work. How do I force Swift to use node-local storage for this data? Second, when I run as many processes on the one node as there are cores, the script runs but it keeps stalling. As you can see below, it processes tasks in batches of 12. However, after a few batches the job is aborted (~6 mins into a 30 min allocation) even though the node appears healthy and does not run out of memory and Swift submits a new job into the batch queue. Why does this happen? Swift 0.94.1 swift-r7114 cog-r3803 RunID: 20140528-1504-1ndui8r0 Progress: time: Wed, 28 May 2014 15:04:36 -0700 Progress: time: Wed, 28 May 2014 15:04:37 -0700 Initializing:258 Progress: time: Wed, 28 May 2014 15:04:38 -0700 Initializing:698 Selecting site:589 Progress: time: Wed, 28 May 2014 15:04:45 -0700 Selecting site:4408 Submitting:3 Progress: time: Wed, 28 May 2014 15:05:06 -0700 Selecting site:4010 Submitted:401 ... Progress: time: Wed, 28 May 2014 15:15:06 -0700 Selecting site:4010 Submitted:401 Progress: time: Wed, 28 May 2014 15:15:13 -0700 Selecting site:4010 Stage in:1 Submitted:400 Progress: time: Wed, 28 May 2014 15:15:21 -0700 Selecting site:4010 Stage in:9 Submitted:392 Progress: time: Wed, 28 May 2014 15:15:22 -0700 Selecting site:4010 Stage in:5 Submitted:389 Active:7 Progress: time: Wed, 28 May 2014 15:15:36 -0700 Selecting site:4010 Submitted:389 Active:12 Progress: time: Wed, 28 May 2014 15:16:06 -0700 Selecting site:4010 Submitted:389 Active:12 Progress: time: Wed, 28 May 2014 15:16:36 -0700 Selecting site:4010 Submitted:389 Active:12 Progress: time: Wed, 28 May 2014 15:17:06 -0700 Selecting site:4010 Submitted:389 Active:12 Progress: time: Wed, 28 May 2014 15:17:36 -0700 Selecting site:4010 Submitted:389 Active:12 Progress: time: Wed, 28 May 2014 15:18:06 -0700 Selecting site:4010 Submitted:389 Active:12 Progress: time: Wed, 28 May 2014 15:18:12 -0700 Selecting site:4010 Submitted:389 Active:11 Stage out:1 Progress: time: Wed, 28 May 2014 15:18:15 -0700 Selecting site:4010 Submitted:389 Active:2 Stage out:10 Progress: time: Wed, 28 May 2014 15:18:23 -0700 Selecting site:4010 Submitted:389 Active:1 Stage out:11 Progress: time: Wed, 28 May 2014 15:18:25 -0700 Selecting site:3998 Stage in:1 Submitted:400 Finished successfully:12 Progress: time: Wed, 28 May 2014 15:18:26 -0700 Selecting site:3998 Stage in:9 Submitted:392 Finished successfully:12 Progress: time: Wed, 28 May 2014 15:18:36 -0700 Selecting site:3998 Submitted:389 Active:12 Finished successfully:12 Progress: time: Wed, 28 May 2014 15:19:06 -0700 Selecting site:3998 Submitted:389 Active:12 Finished successfully:12 Progress: time: Wed, 28 May 2014 15:19:36 -0700 Selecting site:3998 Submitted:389 Active:12 Finished successfully:12 Progress: time: Wed, 28 May 2014 15:20:06 -0700 Selecting site:3998 Submitted:389 Active:12 Finished successfully:12 Progress: time: Wed, 28 May 2014 15:20:18 -0700 Selecting site:3998 Submitted:389 Active:11 Stage out:1 Finished successfully:12 Progress: time: Wed, 28 May 2014 15:20:19 -0700 Selecting site:3998 Submitted:389 Active:7 Stage out:5 Finished successfully:12 Progress: time: Wed, 28 May 2014 15:20:21 -0700 Selecting site:3998 Submitted:389 Stage out:11 Finished successfully:13 Progress: time: Wed, 28 May 2014 15:20:36 -0700 Selecting site:3986 Submitted:401 Finished successfully:24 Progress: time: Wed, 28 May 2014 15:21:06 -0700 Selecting site:3986 Submitted:401 Finished successfully:24 Progress: time: Wed, 28 May 2014 15:21:36 -0700 Selecting site:3986 Submitted:401 Finished successfully:24 ? Batch allocation released, new request submitted ? Progress: time: Wed, 28 May 2014 15:22:06 -0700 Selecting site:3986 Submitted:401 Finished successfully:24 Progress: time: Wed, 28 May 2014 15:22:36 -0700 Selecting site:3986 Submitted:401 Finished successfully:24 Progress: time: Wed, 28 May 2014 15:23:06 -0700 Selecting site:3986 Submitted:401 Finished successfully:24 Progress: time: Wed, 28 May 2014 15:23:18 -0700 Selecting site:3986 Stage in:1 Submitted:400 Finished successfully:24 Progress: time: Wed, 28 May 2014 15:23:20 -0700 Selecting site:3986 Stage in:4 Submitted:397 Finished successfully:24 Progress: time: Wed, 28 May 2014 15:23:24 -0700 Selecting site:3986 Stage in:5 Submitted:396 Finished successfully:24 Progress: time: Wed, 28 May 2014 15:23:36 -0700 Selecting site:3986 Submitted:389 Active:12 Finished successfully:24 Progress: time: Wed, 28 May 2014 15:24:06 -0700 Selecting site:3986 Submitted:389 Active:12 Finished successfully:24 Progress: time: Wed, 28 May 2014 15:24:36 -0700 Selecting site:3986 Submitted:389 Active:12 Finished successfully:24 Progress: time: Wed, 28 May 2014 15:25:06 -0700 Selecting site:3986 Submitted:389 Active:12 Finished successfully:24 Progress: time: Wed, 28 May 2014 15:25:16 -0700 Selecting site:3986 Submitted:389 Active:11 Stage out:1 Finished successfully:24 Progress: time: Wed, 28 May 2014 15:25:26 -0700 Selecting site:3986 Submitted:389 Active:7 Stage out:5 Finished successfully:24 Progress: time: Wed, 28 May 2014 15:25:27 -0700 Selecting site:3986 Submitted:389 Stage out:12 Finished successfully:24 Progress: time: Wed, 28 May 2014 15:25:29 -0700 Selecting site:3975 Stage in:2 Submitted:398 Stage out:1 Finished successfully:35 Progress: time: Wed, 28 May 2014 15:25:32 -0700 Selecting site:3975 Stage in:3 Submitted:397 Stage out:1 Finished successfully:35 Progress: time: Wed, 28 May 2014 15:25:34 -0700 Selecting site:3975 Stage in:4 Submitted:396 Stage out:1 Finished successfully:35 Progress: time: Wed, 28 May 2014 15:25:35 -0700 Selecting site:3974 Stage in:1 Submitting:1 Submitted:389 Active:10 Finished successfully:36 Progress: time: Wed, 28 May 2014 15:25:36 -0700 Selecting site:3974 Submitted:389 Active:12 Finished successfully:36 Progress: time: Wed, 28 May 2014 15:26:06 -0700 Selecting site:3974 Submitted:389 Active:12 Finished successfully:36 Progress: time: Wed, 28 May 2014 15:26:36 -0700 Selecting site:3974 Submitted:389 Active:12 Finished successfully:36 Progress: time: Wed, 28 May 2014 15:27:01 -0700 Selecting site:3974 Submitted:389 Active:11 Stage out:1 Finished successfully:36 Progress: time: Wed, 28 May 2014 15:27:03 -0700 Selecting site:3974 Submitted:389 Active:2 Stage out:10 Finished successfully:36 Progress: time: Wed, 28 May 2014 15:27:04 -0700 Selecting site:3974 Submitted:389 Active:1 Stage out:11 Finished successfully:36 Progress: time: Wed, 28 May 2014 15:27:06 -0700 Selecting site:3974 Submitted:389 Stage out:12 Finished successfully:36 Progress: time: Wed, 28 May 2014 15:27:08 -0700 Selecting site:3974 Submitted:389 Stage out:11 Finished successfully:37 Progress: time: Wed, 28 May 2014 15:27:36 -0700 Selecting site:3962 Submitted:389 Active:12 Finished successfully:48 Progress: time: Wed, 28 May 2014 15:28:06 -0700 Selecting site:3962 Submitted:389 Active:12 Finished successfully:48 Progress: time: Wed, 28 May 2014 15:28:36 -0700 Selecting site:3962 Submitted:389 Active:12 Finished successfully:48 Progress: time: Wed, 28 May 2014 15:28:47 -0700 Selecting site:3962 Submitted:389 Active:11 Stage out:1 Finished successfully:48 Progress: time: Wed, 28 May 2014 15:29:01 -0700 Selecting site:3962 Submitted:389 Active:8 Stage out:4 Finished successfully:48 Progress: time: Wed, 28 May 2014 15:29:02 -0700 Selecting site:3955 Submitting:1 Submitted:395 Finished successfully:60 Progress: time: Wed, 28 May 2014 15:29:06 -0700 Selecting site:3950 Submitted:401 Finished successfully:60 Progress: time: Wed, 28 May 2014 15:29:36 -0700 Selecting site:3950 Submitted:401 Finished successfully:60 Progress: time: Wed, 28 May 2014 15:30:06 -0700 Selecting site:3950 Submitted:401 Finished successfully:60 Progress: time: Wed, 28 May 2014 15:30:36 -0700 Selecting site:3950 Submitted:401 Finished successfully:60 ? Batch allocation released, new request submitted ? Progress: time: Wed, 28 May 2014 15:31:06 -0700 Selecting site:3950 Submitted:401 Finished successfully:60 Progress: time: Wed, 28 May 2014 15:31:36 -0700 Selecting site:3950 Submitted:401 Finished successfully:60 Progress: time: Wed, 28 May 2014 15:31:56 -0700 Selecting site:3950 Stage in:1 Submitted:400 Finished successfully:60 Progress: time: Wed, 28 May 2014 15:32:06 -0700 Selecting site:3950 Submitted:389 Active:12 Finished successfully:60 Progress: time: Wed, 28 May 2014 15:32:36 -0700 Selecting site:3950 Submitted:389 Active:12 Finished successfully:60 Progress: time: Wed, 28 May 2014 15:33:06 -0700 Selecting site:3950 Submitted:389 Active:12 Finished successfully:60 Progress: time: Wed, 28 May 2014 15:33:27 -0700 Selecting site:3950 Submitted:389 Active:11 Stage out:1 Finished successfully:60 Progress: time: Wed, 28 May 2014 15:33:28 -0700 Selecting site:3950 Submitted:389 Active:3 Stage out:9 Finished successfully:60 Progress: time: Wed, 28 May 2014 15:33:31 -0700 Selecting site:3938 Stage in:1 Submitted:400 Finished successfully:72 Progress: time: Wed, 28 May 2014 15:33:36 -0700 Selecting site:3938 Stage in:2 Submitted:399 Finished successfully:72 Progress: time: Wed, 28 May 2014 15:34:06 -0700 Selecting site:3938 Submitted:389 Active:12 Finished successfully:72 Progress: time: Wed, 28 May 2014 15:34:36 -0700 Selecting site:3938 Submitted:389 Active:12 Finished successfully:72 Progress: time: Wed, 28 May 2014 15:35:06 -0700 Selecting site:3938 Submitted:389 Active:12 Finished successfully:72 ? Batch allocation released, new request submitted ? Greg Bronevetsky Lawrence Livermore National Lab (925) 424-5756 bronevetsky at llnl.gov http://greg.bronevetsky.com swift -sites.file /g/g15/bronevet/apps/swift-0.94.1/etc/sites.pbatch.sierra.tmp.xml -tc.file ~/code/tmp/sight/apps/linsolve/iml/tc.data -config /g/g15/bronevet/apps/swift-0.94.1/etc/cf.tmp -lazy.errors false ~/code/tmp/sight/apps/linsolve/iml/psuadeExperiments.swift -precond=diag -matrix=nasa1824 -modelType=singleModel -resume psuadeExperiments-20140528-0007-gx1v1o8g.0.rlog Swift 0.94.1 swift-r7114 cog-r3803 RunID: 20140528-1001-vnifrjtf Progress: time: Wed, 28 May 2014 10:01:24 -0700 Tree combinedProgress: time: Wed, 28 May 2014 10:01:54 -0700 Finished in previous run:12 Progress: time: Wed, 28 May 2014 10:02:15 -0700 Initializing:2 Finished in previous run:12 Progress: time: Wed, 28 May 2014 10:02:16 -0700 Initializing:1542 Finished in previous run:39 Progress: time: Wed, 28 May 2014 10:02:17 -0700 Initializing:2627 Selecting site:1532 Finished in previous run:93 Progress: time: Wed, 28 May 2014 10:02:18 -0700 Initializing:10156 Selecting site:3667 Finished in previous run:269 Progress: time: Wed, 28 May 2014 10:02:19 -0700 Initializing:3421 Selecting site:12328 Finished in previous run:269 Progress: time: Wed, 28 May 2014 10:02:21 -0700 Selecting site:15746 Submitting:3 Finished in previous run:269 Progress: time: Wed, 28 May 2014 10:02:23 -0700 Selecting site:15348 Submitting:384 Submitted:17 Finished in previous run:269 Progress: time: Wed, 28 May 2014 10:02:24 -0700 Selecting site:15348 Submitted:401 Finished in previous run:269 Progress: time: Wed, 28 May 2014 10:02:45 -0700 Selecting site:15348 Stage in:1 Submitted:400 Finished in previous run:269 Progress: time: Wed, 28 May 2014 10:02:46 -0700 Selecting site:15348 Stage in:2 Submitted:399 Finished in previous run:269 Progress: time: Wed, 28 May 2014 10:02:52 -0700 Selecting site:15348 Stage in:6 Submitted:395 Finished in previous run:269 Progress: time: Wed, 28 May 2014 10:02:53 -0700 Selecting site:15348 Stage in:10 Submitted:391 Finished in previous run:269 Progress: time: Wed, 28 May 2014 10:03:24 -0700 Selecting site:15348 Submitted:389 Active:12 Finished in previous run:269 Progress: time: Wed, 28 May 2014 10:03:54 -0700 Selecting site:15348 Submitted:389 Active:12 Finished in previous run:269 Progress: time: Wed, 28 May 2014 10:04:24 -0700 Selecting site:15348 Submitted:389 Active:12 Finished in previous run:269 Progress: time: Wed, 28 May 2014 10:04:54 -0700 Selecting site:15348 Submitted:389 Active:12 Finished in previous run:269 Progress: time: Wed, 28 May 2014 10:05:24 -0700 Selecting site:15348 Submitted:389 Active:12 Finished in previous run:269 Progress: time: Wed, 28 May 2014 10:05:54 -0700 Selecting site:15348 Submitted:389 Active:12 Finished in previous run:269 Progress: time: Wed, 28 May 2014 10:06:24 -0700 Selecting site:15348 Submitted:389 Active:12 Finished in previous run:269 Progress: time: Wed, 28 May 2014 10:06:54 -0700 Selecting site:15348 Submitted:389 Active:12 Finished in previous run:269 Progress: time: Wed, 28 May 2014 10:07:24 -0700 Selecting site:15348 Submitted:389 Active:12 Finished in previous run:269 Progress: time: Wed, 28 May 2014 10:07:54 -0700 Selecting site:15348 Submitted:389 Active:12 Finished in previous run:269 Progress: time: Wed, 28 May 2014 10:08:24 -0700 Selecting site:15348 Submitted:389 Active:12 Finished in previous run:269 Progress: time: Wed, 28 May 2014 10:08:54 -0700 Selecting site:15348 Submitted:389 Active:12 Finished in previous run:269 Progress: time: Wed, 28 May 2014 10:09:24 -0700 Selecting site:15348 Submitted:389 Active:12 Finished in previous run:269 Progress: time: Wed, 28 May 2014 10:09:54 -0700 Selecting site:15348 Submitted:389 Active:12 Finished in previous run:269 Progress: time: Wed, 28 May 2014 10:10:24 -0700 Selecting site:15348 Submitted:389 Active:12 Finished in previous run:269 Progress: time: Wed, 28 May 2014 10:10:54 -0700 Selecting site:15348 Submitted:389 Active:12 Finished in previous run:269 Progress: time: Wed, 28 May 2014 10:11:24 -0700 Selecting site:15348 Submitted:389 Active:12 Finished in previous run:269 Progress: time: Wed, 28 May 2014 10:11:27 -0700 Selecting site:15348 Submitted:388 Active:13 Finished in previous run:269 Progress: time: Wed, 28 May 2014 10:11:28 -0700 Selecting site:15324 Submitting:23 Submitted:366 Active:12 Finished in previous run:269 Failed but can retry:24 Progress: time: Wed, 28 May 2014 10:11:54 -0700 Selecting site:15324 Submitted:389 Active:12 Finished in previous run:269 Failed but can retry:24 Progress: time: Wed, 28 May 2014 10:12:12 -0700 Selecting site:15324 Submitted:388 Active:13 Finished in previous run:269 Failed but can retry:24 Progress: time: Wed, 28 May 2014 10:12:13 -0700 Selecting site:15300 Submitting:23 Submitted:366 Active:12 Finished in previous run:269 Failed but can retry:48 Progress: time: Wed, 28 May 2014 10:12:24 -0700 Selecting site:15300 Submitted:389 Active:12 Finished in previous run:269 Failed but can retry:48 Progress: time: Wed, 28 May 2014 10:12:54 -0700 Selecting site:15300 Submitted:389 Active:12 Finished in previous run:269 Failed but can retry:48 Progress: time: Wed, 28 May 2014 10:12:58 -0700 Selecting site:15300 Submitted:388 Active:13 Finished in previous run:269 Failed but can retry:48 Progress: time: Wed, 28 May 2014 10:12:59 -0700 Selecting site:15276 Submitting:23 Submitted:366 Active:12 Finished in previous run:269 Failed but can retry:72 Progress: time: Wed, 28 May 2014 10:13:24 -0700 Selecting site:15276 Submitted:389 Active:12 Finished in previous run:269 Failed but can retry:72 Progress: time: Wed, 28 May 2014 10:13:43 -0700 Selecting site:15276 Submitted:388 Active:13 Finished in previous run:269 Failed but can retry:72 Progress: time: Wed, 28 May 2014 10:13:44 -0700 Selecting site:15252 Submitting:23 Submitted:366 Active:12 Finished in previous run:269 Failed but can retry:96 Progress: time: Wed, 28 May 2014 10:13:54 -0700 Selecting site:15252 Submitted:389 Active:12 Finished in previous run:269 Failed but can retry:96 Progress: time: Wed, 28 May 2014 10:14:24 -0700 Selecting site:15252 Submitted:389 Active:12 Finished in previous run:269 Failed but can retry:96 Progress: time: Wed, 28 May 2014 10:14:28 -0700 Selecting site:15252 Submitted:388 Active:13 Finished in previous run:269 Failed but can retry:96 Progress: time: Wed, 28 May 2014 10:14:29 -0700 Selecting site:15228 Submitting:23 Submitted:366 Active:12 Finished in previous run:269 Failed but can retry:120 Progress: time: Wed, 28 May 2014 10:14:54 -0700 Selecting site:15228 Submitted:389 Active:12 Finished in previous run:269 Failed but can retry:120 Progress: time: Wed, 28 May 2014 10:15:14 -0700 Selecting site:15228 Submitted:388 Active:13 Finished in previous run:269 Failed but can retry:120 Progress: time: Wed, 28 May 2014 10:15:15 -0700 Selecting site:15204 Submitting:23 Submitted:366 Active:12 Finished in previous run:269 Failed but can retry:144 Progress: time: Wed, 28 May 2014 10:15:24 -0700 Selecting site:15204 Submitted:389 Active:12 Finished in previous run:269 Failed but can retry:144 -----Original Message----- From: Mihael Hategan [mailto:hategan at mcs.anl.gov] Sent: Tuesday, May 27, 2014 4:46 PM To: Bronevetsky, Greg Cc: swift-user at ci.uchicago.edu Subject: Re: [Swift-user] Data transfer error On Tue, 2014-05-27 at 22:56 +0000, Bronevetsky, Greg wrote: [...] > Progress: time: Tue, 27 May 2014 15:39:42 -0700 Selecting site:1268 > Stage in:30 Submitted:328 Active:31 Finished successfully:3 Failed > but can retry:344 I would really suggest disabling lazy errors and execution retries until you get things to run. [...] > 2014/05/27 15:34:56.648 INFO 000000 1401230034461 Staging out /p/lscratche/bronevet/swift/work/psuadeExperiments-20140527-1533-6dt18a4b/jobs/1/runModel-1r0az8rl/wrapper.error (mode = 2). Right. It means something went wrong running the app on the compute node. That's a file that is used to send back the exact error. [...] > _swiftwrap.staging: line 45: echo: write error: No space left on > device > > However, I can?t see how this could be since each node has 16GB of RAM (available as RAM or ramdisk). Is there a way to look into this further? Swift doesn't have much in that direction. The wrapper logs should contain some diagnostic information for failing jobs, but if they fail due to lack of disk space, I can't see how the wrapper log can be written to. What I would suggest is wrapping your app in a script that looks into disk issues (df, ls), and running multiple apps on a single node and hopefully catching a glimpse of what the problem is before all scratch space is exhausted. I think it would be a nice idea to add some node status (mem/disk/cpu) monitors to the swift monitoring interfaces. Mihael -------------- next part -------------- An HTML attachment was scrubbed... URL: From wilde at anl.gov Wed May 28 18:22:06 2014 From: wilde at anl.gov (Michael Wilde) Date: Wed, 28 May 2014 18:22:06 -0500 Subject: [Swift-user] Known issue with stdout and stderr file delays? In-Reply-To: <3A043140-2D21-4E23-891F-A4103B1712F9@uchicago.edu> References: <3A043140-2D21-4E23-891F-A4103B1712F9@uchicago.edu> Message-ID: <53866F9E.8020206@anl.gov> On 5/28/14, 5:01 PM, Jonathan Ozik wrote: > Hi all, > > Has anyone run into issues when trying to direct stdout and stderr to output files and getting the error: > Caused by: The following output files were not created by the application: sim.out, err.out This seems to be caused by an error in Swift (based on investigating with Jonathan off-list). The bug is that if an app returns an array of files, then any non-array returns from the app are not getting handled correctly. It seems to be a problem in the array-collecting code in the Swift app-launching wrapper, which may be loosing its directory context after returning from the collection function. > > It looks like the files are being created too late and not caught when the output is being gathered up? This is using trunk, by the way. > > Also, in a related question, is there an easy way to redirect stdout and stderr from an app invocation to the regular swift.out without creating specific files? No, not at the moment. Some variations of this capability have been discussed, though. It seems to me to be a useful addition to implement. - Mike > > Jonathan > > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user -- Michael Wilde Mathematics and Computer Science Computation Institute Argonne National Laboratory The University of Chicago From hategan at mcs.anl.gov Wed May 28 18:45:38 2014 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 28 May 2014 16:45:38 -0700 Subject: [Swift-user] Data transfer error In-Reply-To: <8635C0D1735D2C4BA6E571FD97486FF179938CB9@PRDEXMBX-05.the-lab.llnl.gov> References: <8635C0D1735D2C4BA6E571FD97486FF179926169@PRDEXMBX-05.the-lab.llnl.gov> <1400706574.25214.3.camel@echo> <8635C0D1735D2C4BA6E571FD97486FF17992DC59@PRDEXMBX-05.the-lab.llnl.gov> <1400743356.29694.6.camel@echo> <8635C0D1735D2C4BA6E571FD97486FF179933F77@PRDEXMBX-05.the-lab.llnl.gov> <1400872636.18856.5.camel@echo> <8635C0D1735D2C4BA6E571FD97486FF1799377AB@PRDEXMBX-05.the-lab.llnl.gov> <1400876593.19522.2.camel@echo> <8635C0D1735D2C4BA6E571FD97486FF179938553@PRDEXMBX-05.the-lab.llnl.gov> <1401234364.13640.7.camel@echo> <8635C0D1735D2C4BA6E571FD97486FF179938CB9@PRDEXMBX-05.the-lab.llnl.gov> Message-ID: <1401320738.12733.3.camel@echo> On Wed, 2014-05-28 at 22:54 +0000, Bronevetsky, Greg wrote: > Mihael, I ran a few more experiments where I ran a workflow on a > single cluster node while monitoring its memory use but I didn't see > any issues with it running out of memory since at all > times /proc/meminfo reported 22GB out of 24GB free. The error you were getting previously seemed to indicate that you were running out of *disk* space somewhere, probably on the ramdisk. So maybe the output of 'df' would be better than /proc/meminfo > I've now begun a more focused analysis where I have a simple script > that captures the high-level structure of my real script. It first > generates a bunch of files, producing additional temporary files and > the directories along with the main output file. These files are then > reduced using a reduction tree based on the example you sent me. I > have not yet gotten the simple script to fail in the same way as the > main script but I've noticed a few oddities. > > First, although my sites file has key="stagingMethod">file and my cf file has > use.provider.staging=true, I see that all the intermediate files > produced by my tasks are written to the global file system specified > in the sites file as > /p/lscratche/bronevet/swift_work. How > do I force Swift to use node-local storage for this data? You would have to change to a node-local location. > > Second, when I run as many processes on the one node as there are > cores, the script runs but it keeps stalling. As you can see below, it > processes tasks in batches of 12. However, after a few batches the job > is aborted (~6 mins into a 30 min allocation) even though the node > appears healthy and does not run out of memory and Swift submits a new > job into the batch queue. Why does this happen? Are you specifying a max walltime for the apps? If not, swift assumes 10 minutes. If the first few batches take 21 minutes, and the worker has 30 minutes allocated, it won't be able to fit any other jobs after that. Mihael From bronevetsky1 at llnl.gov Wed May 28 18:48:12 2014 From: bronevetsky1 at llnl.gov (Bronevetsky, Greg) Date: Wed, 28 May 2014 23:48:12 +0000 Subject: [Swift-user] Data transfer error In-Reply-To: <1401320738.12733.3.camel@echo> References: <8635C0D1735D2C4BA6E571FD97486FF179926169@PRDEXMBX-05.the-lab.llnl.gov> <1400706574.25214.3.camel@echo> <8635C0D1735D2C4BA6E571FD97486FF17992DC59@PRDEXMBX-05.the-lab.llnl.gov> <1400743356.29694.6.camel@echo> <8635C0D1735D2C4BA6E571FD97486FF179933F77@PRDEXMBX-05.the-lab.llnl.gov> <1400872636.18856.5.camel@echo> <8635C0D1735D2C4BA6E571FD97486FF1799377AB@PRDEXMBX-05.the-lab.llnl.gov> <1400876593.19522.2.camel@echo> <8635C0D1735D2C4BA6E571FD97486FF179938553@PRDEXMBX-05.the-lab.llnl.gov> <1401234364.13640.7.camel@echo> <8635C0D1735D2C4BA6E571FD97486FF179938CB9@PRDEXMBX-05.the-lab.llnl.gov> <1401320738.12733.3.camel@echo> Message-ID: <8635C0D1735D2C4BA6E571FD97486FF179939D6E@PRDEXMBX-05.the-lab.llnl.gov> Are you specifying a max walltime for the apps? If not, swift assumes 10 minutes. If the first few batches take 21 minutes, and the worker has 30 minutes allocated, it won't be able to fit any other jobs after that. My sites file has the following bounds: 1800 00:24:00 The job typically ended 5-10 min after it started. Greg Bronevetsky Lawrence Livermore National Lab (925) 424-5756 bronevetsky at llnl.gov http://greg.bronevetsky.com -----Original Message----- From: Mihael Hategan [mailto:hategan at mcs.anl.gov] Sent: Wednesday, May 28, 2014 4:46 PM To: Bronevetsky, Greg Cc: swift-user at ci.uchicago.edu Subject: Re: [Swift-user] Data transfer error On Wed, 2014-05-28 at 22:54 +0000, Bronevetsky, Greg wrote: > Mihael, I ran a few more experiments where I ran a workflow on a > single cluster node while monitoring its memory use but I didn't see > any issues with it running out of memory since at all times > /proc/meminfo reported 22GB out of 24GB free. The error you were getting previously seemed to indicate that you were running out of *disk* space somewhere, probably on the ramdisk. So maybe the output of 'df' would be better than /proc/meminfo > I've now begun a more focused analysis where I have a simple script > that captures the high-level structure of my real script. It first > generates a bunch of files, producing additional temporary files and > the directories along with the main output file. These files are then > reduced using a reduction tree based on the example you sent me. I > have not yet gotten the simple script to fail in the same way as the > main script but I've noticed a few oddities. > > First, although my sites file has key="stagingMethod">file and my cf file has > use.provider.staging=true, I see that all the intermediate files > produced by my tasks are written to the global file system specified > in the sites file as > /p/lscratche/bronevet/swift_work. How > do I force Swift to use node-local storage for this data? You would have to change to a node-local location. > > Second, when I run as many processes on the one node as there are > cores, the script runs but it keeps stalling. As you can see below, it > processes tasks in batches of 12. However, after a few batches the job > is aborted (~6 mins into a 30 min allocation) even though the node > appears healthy and does not run out of memory and Swift submits a new > job into the batch queue. Why does this happen? Are you specifying a max walltime for the apps? If not, swift assumes 10 minutes. If the first few batches take 21 minutes, and the worker has 30 minutes allocated, it won't be able to fit any other jobs after that. Mihael From bronevetsky1 at llnl.gov Wed May 28 18:58:40 2014 From: bronevetsky1 at llnl.gov (Bronevetsky, Greg) Date: Wed, 28 May 2014 23:58:40 +0000 Subject: [Swift-user] Data transfer error In-Reply-To: <1401320738.12733.3.camel@echo> References: <8635C0D1735D2C4BA6E571FD97486FF179926169@PRDEXMBX-05.the-lab.llnl.gov> <1400706574.25214.3.camel@echo> <8635C0D1735D2C4BA6E571FD97486FF17992DC59@PRDEXMBX-05.the-lab.llnl.gov> <1400743356.29694.6.camel@echo> <8635C0D1735D2C4BA6E571FD97486FF179933F77@PRDEXMBX-05.the-lab.llnl.gov> <1400872636.18856.5.camel@echo> <8635C0D1735D2C4BA6E571FD97486FF1799377AB@PRDEXMBX-05.the-lab.llnl.gov> <1400876593.19522.2.camel@echo> <8635C0D1735D2C4BA6E571FD97486FF179938553@PRDEXMBX-05.the-lab.llnl.gov> <1401234364.13640.7.camel@echo> <8635C0D1735D2C4BA6E571FD97486FF179938CB9@PRDEXMBX-05.the-lab.llnl.gov> <1401320738.12733.3.camel@echo> Message-ID: <8635C0D1735D2C4BA6E571FD97486FF179939DD0@PRDEXMBX-05.the-lab.llnl.gov> > /p/lscratche/bronevet/swift_work. How > do I force Swift to use node-local storage for this data? You would have to change to a node-local location. The node-local location is /tmp/${USERNAME}. However, there is no guarantee that this directory will exist when the job starts. How can I get Swift to create it automatically? Greg Bronevetsky Lawrence Livermore National Lab (925) 424-5756 bronevetsky at llnl.gov http://greg.bronevetsky.com -----Original Message----- From: Mihael Hategan [mailto:hategan at mcs.anl.gov] Sent: Wednesday, May 28, 2014 4:46 PM To: Bronevetsky, Greg Cc: swift-user at ci.uchicago.edu Subject: Re: [Swift-user] Data transfer error On Wed, 2014-05-28 at 22:54 +0000, Bronevetsky, Greg wrote: > Mihael, I ran a few more experiments where I ran a workflow on a > single cluster node while monitoring its memory use but I didn't see > any issues with it running out of memory since at all times > /proc/meminfo reported 22GB out of 24GB free. The error you were getting previously seemed to indicate that you were running out of *disk* space somewhere, probably on the ramdisk. So maybe the output of 'df' would be better than /proc/meminfo > I've now begun a more focused analysis where I have a simple script > that captures the high-level structure of my real script. It first > generates a bunch of files, producing additional temporary files and > the directories along with the main output file. These files are then > reduced using a reduction tree based on the example you sent me. I > have not yet gotten the simple script to fail in the same way as the > main script but I've noticed a few oddities. > > First, although my sites file has key="stagingMethod">file and my cf file has > use.provider.staging=true, I see that all the intermediate files > produced by my tasks are written to the global file system specified > in the sites file as > /p/lscratche/bronevet/swift_work. How > do I force Swift to use node-local storage for this data? You would have to change to a node-local location. > > Second, when I run as many processes on the one node as there are > cores, the script runs but it keeps stalling. As you can see below, it > processes tasks in batches of 12. However, after a few batches the job > is aborted (~6 mins into a 30 min allocation) even though the node > appears healthy and does not run out of memory and Swift submits a new > job into the batch queue. Why does this happen? Are you specifying a max walltime for the apps? If not, swift assumes 10 minutes. If the first few batches take 21 minutes, and the worker has 30 minutes allocated, it won't be able to fit any other jobs after that. Mihael From hategan at mcs.anl.gov Wed May 28 19:11:21 2014 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 28 May 2014 17:11:21 -0700 Subject: [Swift-user] Data transfer error In-Reply-To: <8635C0D1735D2C4BA6E571FD97486FF179939D6E@PRDEXMBX-05.the-lab.llnl.gov> References: <8635C0D1735D2C4BA6E571FD97486FF179926169@PRDEXMBX-05.the-lab.llnl.gov> <1400706574.25214.3.camel@echo> <8635C0D1735D2C4BA6E571FD97486FF17992DC59@PRDEXMBX-05.the-lab.llnl.gov> <1400743356.29694.6.camel@echo> <8635C0D1735D2C4BA6E571FD97486FF179933F77@PRDEXMBX-05.the-lab.llnl.gov> <1400872636.18856.5.camel@echo> <8635C0D1735D2C4BA6E571FD97486FF1799377AB@PRDEXMBX-05.the-lab.llnl.gov> <1400876593.19522.2.camel@echo> <8635C0D1735D2C4BA6E571FD97486FF179938553@PRDEXMBX-05.the-lab.llnl.gov> <1401234364.13640.7.camel@echo> <8635C0D1735D2C4BA6E571FD97486FF179938CB9@PRDEXMBX-05.the-lab.llnl.gov> <1401320738.12733.3.camel@echo> <8635C0D1735D2C4BA6E571FD97486FF179939D6E@PRDEXMBX-05.the-lab.llnl.gov> Message-ID: <1401322281.12733.7.camel@echo> On Wed, 2014-05-28 at 23:48 +0000, Bronevetsky, Greg wrote: > Are you specifying a max walltime for the apps? > > If not, swift assumes 10 minutes. If the first few batches take 21 minutes, and the worker has 30 minutes allocated, it won't be able to fit any other jobs after that. > My sites file has the following bounds: > 1800 We need to fix that one. So that's 30 minutes. > 00:24:00 And that's 24 minutes. So if the worker is left with less than 24 minutes (i.e. after the first 6 minutes), none of the jobs will fit. You might want to lower the app maxwalltime to 10 minutes if that is the actual maximum time the app will take. If 30 minutes is not a hard limit for the queue, increasing that should help. Mihael From hategan at mcs.anl.gov Wed May 28 19:13:24 2014 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 28 May 2014 17:13:24 -0700 Subject: [Swift-user] Data transfer error In-Reply-To: <8635C0D1735D2C4BA6E571FD97486FF179939DD0@PRDEXMBX-05.the-lab.llnl.gov> References: <8635C0D1735D2C4BA6E571FD97486FF179926169@PRDEXMBX-05.the-lab.llnl.gov> <1400706574.25214.3.camel@echo> <8635C0D1735D2C4BA6E571FD97486FF17992DC59@PRDEXMBX-05.the-lab.llnl.gov> <1400743356.29694.6.camel@echo> <8635C0D1735D2C4BA6E571FD97486FF179933F77@PRDEXMBX-05.the-lab.llnl.gov> <1400872636.18856.5.camel@echo> <8635C0D1735D2C4BA6E571FD97486FF1799377AB@PRDEXMBX-05.the-lab.llnl.gov> <1400876593.19522.2.camel@echo> <8635C0D1735D2C4BA6E571FD97486FF179938553@PRDEXMBX-05.the-lab.llnl.gov> <1401234364.13640.7.camel@echo> <8635C0D1735D2C4BA6E571FD97486FF179938CB9@PRDEXMBX-05.the-lab.llnl.gov> <1401320738.12733.3.camel@echo> <8635C0D1735D2C4BA6E571FD97486FF179939DD0@PRDEXMBX-05.the-lab.llnl.gov> Message-ID: <1401322404.13993.0.camel@echo> On Wed, 2014-05-28 at 23:58 +0000, Bronevetsky, Greg wrote: > > /p/lscratche/bronevet/swift_work. How > > do I force Swift to use node-local storage for this data? > You would have to change to a node-local location. > The node-local location is /tmp/${USERNAME}. However, there is no guarantee that this directory will exist when the job starts. How can I get Swift to create it automatically? I believe so, yes. Mihael From bronevetsky1 at llnl.gov Wed May 28 19:15:08 2014 From: bronevetsky1 at llnl.gov (Bronevetsky, Greg) Date: Thu, 29 May 2014 00:15:08 +0000 Subject: [Swift-user] Data transfer error In-Reply-To: <1401322281.12733.7.camel@echo> References: <8635C0D1735D2C4BA6E571FD97486FF179926169@PRDEXMBX-05.the-lab.llnl.gov> <1400706574.25214.3.camel@echo> <8635C0D1735D2C4BA6E571FD97486FF17992DC59@PRDEXMBX-05.the-lab.llnl.gov> <1400743356.29694.6.camel@echo> <8635C0D1735D2C4BA6E571FD97486FF179933F77@PRDEXMBX-05.the-lab.llnl.gov> <1400872636.18856.5.camel@echo> <8635C0D1735D2C4BA6E571FD97486FF1799377AB@PRDEXMBX-05.the-lab.llnl.gov> <1400876593.19522.2.camel@echo> <8635C0D1735D2C4BA6E571FD97486FF179938553@PRDEXMBX-05.the-lab.llnl.gov> <1401234364.13640.7.camel@echo> <8635C0D1735D2C4BA6E571FD97486FF179938CB9@PRDEXMBX-05.the-lab.llnl.gov> <1401320738.12733.3.camel@echo> <8635C0D1735D2C4BA6E571FD97486FF179939D6E@PRDEXMBX-05.the-lab.llnl.gov> <1401322281.12733.7.camel@echo> Message-ID: <8635C0D1735D2C4BA6E571FD97486FF179939E1B@PRDEXMBX-05.the-lab.llnl.gov> Ah, that makes a lot of sense! It might be useful to add an explicit check for this and a warning since my configuration ends up using just 20% of my available job allocations but tracking it down it non-trivial since poor efficiency may have many causes. Greg Bronevetsky Lawrence Livermore National Lab (925) 424-5756 bronevetsky at llnl.gov http://greg.bronevetsky.com -----Original Message----- From: Mihael Hategan [mailto:hategan at mcs.anl.gov] Sent: Wednesday, May 28, 2014 5:11 PM To: Bronevetsky, Greg Cc: swift-user at ci.uchicago.edu Subject: Re: [Swift-user] Data transfer error On Wed, 2014-05-28 at 23:48 +0000, Bronevetsky, Greg wrote: > Are you specifying a max walltime for the apps? > > If not, swift assumes 10 minutes. If the first few batches take 21 minutes, and the worker has 30 minutes allocated, it won't be able to fit any other jobs after that. > My sites file has the following bounds: > 1800 We need to fix that one. So that's 30 minutes. > 00:24:00 And that's 24 minutes. So if the worker is left with less than 24 minutes (i.e. after the first 6 minutes), none of the jobs will fit. You might want to lower the app maxwalltime to 10 minutes if that is the actual maximum time the app will take. If 30 minutes is not a hard limit for the queue, increasing that should help. Mihael From hategan at mcs.anl.gov Wed May 28 19:18:48 2014 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 28 May 2014 17:18:48 -0700 Subject: [Swift-user] Data transfer error In-Reply-To: <8635C0D1735D2C4BA6E571FD97486FF179939E1B@PRDEXMBX-05.the-lab.llnl.gov> References: <8635C0D1735D2C4BA6E571FD97486FF179926169@PRDEXMBX-05.the-lab.llnl.gov> <1400706574.25214.3.camel@echo> <8635C0D1735D2C4BA6E571FD97486FF17992DC59@PRDEXMBX-05.the-lab.llnl.gov> <1400743356.29694.6.camel@echo> <8635C0D1735D2C4BA6E571FD97486FF179933F77@PRDEXMBX-05.the-lab.llnl.gov> <1400872636.18856.5.camel@echo> <8635C0D1735D2C4BA6E571FD97486FF1799377AB@PRDEXMBX-05.the-lab.llnl.gov> <1400876593.19522.2.camel@echo> <8635C0D1735D2C4BA6E571FD97486FF179938553@PRDEXMBX-05.the-lab.llnl.gov> <1401234364.13640.7.camel@echo> <8635C0D1735D2C4BA6E571FD97486FF179938CB9@PRDEXMBX-05.the-lab.llnl.gov> <1401320738.12733.3.camel@echo> <8635C0D1735D2C4BA6E571FD97486FF179939D6E@PRDEXMBX-05.the-lab.llnl.gov> <1401322281.12733.7.camel@echo> <8635C0D1735D2C4BA6E571FD97486FF179939E1B@PRDEXMBX-05.the-lab.llnl.gov> Message-ID: <1401322728.14073.2.camel@echo> On Thu, 2014-05-29 at 00:15 +0000, Bronevetsky, Greg wrote: > Ah, that makes a lot of sense! It might be useful to add an explicit > check for this and a warning since my configuration ends up using just > 20% of my available job allocations but tracking it down it > non-trivial since poor efficiency may have many causes. The workers shut down if they don't have any work, so they won't consume the full 30 minutes. However, suggestion noted. Mihael From hategan at mcs.anl.gov Wed May 28 23:09:16 2014 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 28 May 2014 21:09:16 -0700 Subject: [Swift-user] Known issue with stdout and stderr file delays? In-Reply-To: <53866F9E.8020206@anl.gov> References: <3A043140-2D21-4E23-891F-A4103B1712F9@uchicago.edu> <53866F9E.8020206@anl.gov> Message-ID: <1401336556.17821.2.camel@echo> Hi Mike and Jonathan, I'm afraid I need more context. I am unable to reproduce the issue, although I did discover another bug in the way FixedArrayMapper parses strings. Can you please tell me what array mapper we are talking about, what the app signature is, what provider you are using, and whether provider staging is enabled or not. Better yet, please send me the swift log if you can. Mihael On Wed, 2014-05-28 at 18:22 -0500, Michael Wilde wrote: > On 5/28/14, 5:01 PM, Jonathan Ozik wrote: > > Hi all, > > > > Has anyone run into issues when trying to direct stdout and stderr to output files and getting the error: > > Caused by: The following output files were not created by the application: sim.out, err.out > This seems to be caused by an error in Swift (based on investigating > with Jonathan off-list). > > The bug is that if an app returns an array of files, then any non-array > returns from the app are not getting handled correctly. > It seems to be a problem in the array-collecting code in the Swift > app-launching wrapper, which may be loosing its directory context after > returning from the collection function. > > > > It looks like the files are being created too late and not caught when the output is being gathered up? This is using trunk, by the way. > > > > Also, in a related question, is there an easy way to redirect stdout and stderr from an app invocation to the regular swift.out without creating specific files? > No, not at the moment. Some variations of this capability have been > discussed, though. It seems to me to be a useful addition to implement. > > - Mike > > > > Jonathan > > > > _______________________________________________ > > Swift-user mailing list > > Swift-user at ci.uchicago.edu > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > From wilde at anl.gov Wed May 28 23:50:36 2014 From: wilde at anl.gov (Michael Wilde) Date: Wed, 28 May 2014 23:50:36 -0500 Subject: [Swift-user] Known issue with stdout and stderr file delays? In-Reply-To: <1401336556.17821.2.camel@echo> References: <3A043140-2D21-4E23-891F-A4103B1712F9@uchicago.edu> <53866F9E.8020206@anl.gov> <1401336556.17821.2.camel@echo> Message-ID: <5386BC9C.8030308@anl.gov> Here's an example of the problem (what looks to me like a Swift bug in the new array return feature). amap01.swift tries to return a file array and a single file (stdout). Swift is unable to find the mapped stdout file, even though it exists. amap02.swift returns just the stdout file (to show that this part of the test script works). amap03.swift returns just the array. It, too, works. But returning both together fails, suggesting that the swiftwrap code that collects the array prevents any other non-array files from being returned. Tests of Jonathan'sscript indicated that it didnt matter of the scalar file being returned was stdout/err, or just a plain file. If an array is returned, nothing else seems to be getting returned correctly. Here's the tests: swift$ cat amap01.swift type file; app (file a[], file o) retarr() { sh "-c" "mkdir -p out; echo a >out/a0001.txt; echo a >out/a0002.txt; echo I am stdout" stdout=filename(o); } file a[]; file o<"out.txt">; (a,o) = retarr(); swift$ rm -rf out swift$ swift amap01.swift Swift trunk swift-r7880 cog-r3907 RunID: run010 Progress: Thu, 29 May 2014 04:44:25+0000 Execution failed: Exception in sh: Arguments: [-c, mkdir -p out; echo a >out/a0001.txt; echo a >out/a0002.txt; echo I am stdout] Host: localhost Directory: amap01-run010/jobs/4/sh-4ipn1brl stderr.txt: stdout.txt: I am stdout exception @ swift-int.k, line: 511 Caused by: The following output files were not created by the application: out.txt throw @ swift-int.k, line: 76 k:assign @ swift.k, line: 194 Caused by: Exception in sh: Arguments: [-c, mkdir -p out; echo a >out/a0001.txt; echo a >out/a0002.txt; echo I am stdout] Host: localhost Directory: amap01-run010/jobs/4/sh-4ipn1brl stderr.txt: stdout.txt: I am stdout exception @ swift-int.k, line: 511 Caused by: The following output files were not created by the application: out.txt throw @ swift-int.k, line: 76 swift$ cat amap02.swift type file; app (file a, file o) retarr() { sh "-c" "mkdir -p out; echo a >out/a0001.txt; echo a >out/a0002.txt; echo I am stdout" stdout=filename(o); } file a<"out/a0001.txt">; file o<"out.txt">; (a,o) = retarr(); swift$ rm -rf out swift$ swift amap02.swift Swift trunk swift-r7880 cog-r3907 RunID: run011 Progress: Thu, 29 May 2014 04:44:51+0000 Final status:Thu, 29 May 2014 04:44:51+0000 Finished successfully:1 swift$ ls out a0001.txt swift$ cat out.txt I am stdout swift$ cat amap03.swift type file; app (file a[]) retarr() { sh "-c" "mkdir -p out; echo a >out/a0001.txt; echo a >out/a0002.txt; echo I am stdout" ; } file a[]; a = retarr(); swift$ rm -rf out swift$ swift amap03.swift Swift trunk swift-r7880 cog-r3907 RunID: run012 Progress: Thu, 29 May 2014 04:45:34+0000 Final status:Thu, 29 May 2014 04:45:34+0000 Finished successfully:1 swift$ ls out a0001.txt a0002.txt swift$ On 5/28/14, 11:09 PM, Mihael Hategan wrote: > Hi Mike and Jonathan, > > I'm afraid I need more context. I am unable to reproduce the issue, > although I did discover another bug in the way FixedArrayMapper parses > strings. > > Can you please tell me what array mapper we are talking about, what the > app signature is, what provider you are using, and whether provider > staging is enabled or not. > > Better yet, please send me the swift log if you can. > > Mihael > > On Wed, 2014-05-28 at 18:22 -0500, Michael Wilde wrote: >> On 5/28/14, 5:01 PM, Jonathan Ozik wrote: >>> Hi all, >>> >>> Has anyone run into issues when trying to direct stdout and stderr to output files and getting the error: >>> Caused by: The following output files were not created by the application: sim.out, err.out >> This seems to be caused by an error in Swift (based on investigating >> with Jonathan off-list). >> >> The bug is that if an app returns an array of files, then any non-array >> returns from the app are not getting handled correctly. >> It seems to be a problem in the array-collecting code in the Swift >> app-launching wrapper, which may be loosing its directory context after >> returning from the collection function. >>> It looks like the files are being created too late and not caught when the output is being gathered up? This is using trunk, by the way. >>> >>> Also, in a related question, is there an easy way to redirect stdout and stderr from an app invocation to the regular swift.out without creating specific files? >> No, not at the moment. Some variations of this capability have been >> discussed, though. It seems to me to be a useful addition to implement. >> >> - Mike >>> Jonathan >>> >>> _______________________________________________ >>> Swift-user mailing list >>> Swift-user at ci.uchicago.edu >>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > -- Michael Wilde Mathematics and Computer Science Computation Institute Argonne National Laboratory The University of Chicago From hategan at mcs.anl.gov Thu May 29 04:26:25 2014 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 29 May 2014 02:26:25 -0700 Subject: [Swift-user] Known issue with stdout and stderr file delays? In-Reply-To: <5386BC9C.8030308@anl.gov> References: <3A043140-2D21-4E23-891F-A4103B1712F9@uchicago.edu> <53866F9E.8020206@anl.gov> <1401336556.17821.2.camel@echo> <5386BC9C.8030308@anl.gov> Message-ID: <1401355585.18136.4.camel@echo> Thanks. This was only an issue without provider staging. Anyway, the problem was that the dynamic data collect code was overwriting a variable used to point to the directory where the app output files were. Silly shell scoping rules (or the lack thereof), and silly me for not seeing it. Fix committed to trunk. Mihael On Wed, 2014-05-28 at 23:50 -0500, Michael Wilde wrote: > Here's an example of the problem (what looks to me like a Swift bug in > the new array return feature). > > amap01.swift tries to return a file array and a single file (stdout). > Swift is unable to find the mapped stdout file, even though it exists. > > amap02.swift returns just the stdout file (to show that this part of the > test script works). > > amap03.swift returns just the array. It, too, works. > > But returning both together fails, suggesting that the swiftwrap code > that collects the array prevents any other non-array files from being > returned. > > Tests of Jonathan'sscript indicated that it didnt matter of the scalar > file being returned was stdout/err, or just a plain file. If an array is > returned, nothing else seems to be getting returned correctly. > > Here's the tests: > > swift$ cat amap01.swift > type file; > > app (file a[], file o) retarr() > { > sh "-c" "mkdir -p out; echo a >out/a0001.txt; echo a >out/a0002.txt; > echo I am stdout" stdout=filename(o); > } > > file a[]; > file o<"out.txt">; > > (a,o) = retarr(); > swift$ rm -rf out > swift$ swift amap01.swift > Swift trunk swift-r7880 cog-r3907 > RunID: run010 > Progress: Thu, 29 May 2014 04:44:25+0000 > > Execution failed: > Exception in sh: > Arguments: [-c, mkdir -p out; echo a >out/a0001.txt; echo a > >out/a0002.txt; echo I am stdout] > Host: localhost > Directory: amap01-run010/jobs/4/sh-4ipn1brl > stderr.txt: > stdout.txt: I am stdout > exception @ swift-int.k, line: 511 > Caused by: The following output files were not created by the > application: out.txt > throw @ swift-int.k, line: 76 > > k:assign @ swift.k, line: 194 > Caused by: Exception in sh: > Arguments: [-c, mkdir -p out; echo a >out/a0001.txt; echo a > >out/a0002.txt; echo I am stdout] > Host: localhost > Directory: amap01-run010/jobs/4/sh-4ipn1brl > stderr.txt: > stdout.txt: I am stdout > exception @ swift-int.k, line: 511 > Caused by: The following output files were not created by the > application: out.txt > throw @ swift-int.k, line: 76 > > swift$ cat amap02.swift > type file; > > app (file a, file o) retarr() > { > sh "-c" "mkdir -p out; echo a >out/a0001.txt; echo a >out/a0002.txt; > echo I am stdout" stdout=filename(o); > } > > file a<"out/a0001.txt">; > file o<"out.txt">; > > (a,o) = retarr(); > swift$ rm -rf out > swift$ swift amap02.swift > Swift trunk swift-r7880 cog-r3907 > RunID: run011 > Progress: Thu, 29 May 2014 04:44:51+0000 > Final status:Thu, 29 May 2014 04:44:51+0000 Finished successfully:1 > swift$ ls out > a0001.txt > swift$ cat out.txt > I am stdout > swift$ cat amap03.swift > type file; > > app (file a[]) retarr() > { > sh "-c" "mkdir -p out; echo a >out/a0001.txt; echo a >out/a0002.txt; > echo I am stdout" ; > } > > file a[]; > > a = retarr(); > swift$ rm -rf out > swift$ swift amap03.swift > Swift trunk swift-r7880 cog-r3907 > RunID: run012 > Progress: Thu, 29 May 2014 04:45:34+0000 > Final status:Thu, 29 May 2014 04:45:34+0000 Finished successfully:1 > swift$ ls out > a0001.txt a0002.txt > swift$ > > > > On 5/28/14, 11:09 PM, Mihael Hategan wrote: > > Hi Mike and Jonathan, > > > > I'm afraid I need more context. I am unable to reproduce the issue, > > although I did discover another bug in the way FixedArrayMapper parses > > strings. > > > > Can you please tell me what array mapper we are talking about, what the > > app signature is, what provider you are using, and whether provider > > staging is enabled or not. > > > > Better yet, please send me the swift log if you can. > > > > Mihael > > > > On Wed, 2014-05-28 at 18:22 -0500, Michael Wilde wrote: > >> On 5/28/14, 5:01 PM, Jonathan Ozik wrote: > >>> Hi all, > >>> > >>> Has anyone run into issues when trying to direct stdout and stderr to output files and getting the error: > >>> Caused by: The following output files were not created by the application: sim.out, err.out > >> This seems to be caused by an error in Swift (based on investigating > >> with Jonathan off-list). > >> > >> The bug is that if an app returns an array of files, then any non-array > >> returns from the app are not getting handled correctly. > >> It seems to be a problem in the array-collecting code in the Swift > >> app-launching wrapper, which may be loosing its directory context after > >> returning from the collection function. > >>> It looks like the files are being created too late and not caught when the output is being gathered up? This is using trunk, by the way. > >>> > >>> Also, in a related question, is there an easy way to redirect stdout and stderr from an app invocation to the regular swift.out without creating specific files? > >> No, not at the moment. Some variations of this capability have been > >> discussed, though. It seems to me to be a useful addition to implement. > >> > >> - Mike > >>> Jonathan > >>> > >>> _______________________________________________ > >>> Swift-user mailing list > >>> Swift-user at ci.uchicago.edu > >>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > > > From hategan at mcs.anl.gov Thu May 29 04:27:56 2014 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 29 May 2014 02:27:56 -0700 Subject: [Swift-user] Transfer directory structure In-Reply-To: <1396482755.13706.5.camel@echo> References: <1F6C5B44-8853-4D3B-B152-140C7FF31F92@uchicago.edu> <1396403416.19960.2.camel@echo> <94AB5E8C-6EB0-48C9-B645-181D2D89CAA1@uchicago.edu> <1396482755.13706.5.camel@echo> Message-ID: <1401355676.18136.5.camel@echo> This issue should also now be fixed in trunk. Mihael On Wed, 2014-04-02 at 16:52 -0700, Mihael Hategan wrote: > Ooops. Well, it's meant to work. > > Mihael > > On Wed, 2014-04-02 at 11:05 -0500, David Kelly wrote: > > I just did a test with swift/0.95-RC5 and filesys mapper on a directory of > > data I'd like to map, but I'm running into some errors with that. I have > > the following files I'm trying to bring in: > > > > $ find data > > data > > data/foo_a > > data/foo_a/foo_a.txt > > data/foo_a/foo_b.txt > > data/foo_b > > data/foo_b/bar_1 > > data/foo_b/bar_2 > > data/data.txt > > > > My Swift script: > > ----------- > > type file; > > > > app check_files (file inputs[]) > > { > > ls "data/foo_a/foo_a.txt" "data/foo_a/foo_b.txt" "data/foo_b/bar_1" > > "data/foo_b/bar_2" "data/data.txt"; > > } > > > > file inputs[] ; > > foreach i in inputs { > > tracef("%s\n", filename(i)); > > } > > > > check_files(inputs); > > ---------- > > > > Tracef says the filenames are: > > > > data/foo_a.txt > > data/data.txt > > data/foo_b.txt > > > > Which seem to omit the directory structure. Then the app fails with: > > > > Execution failed: > > Exception in ls: > > Arguments: [data/foo_a/foo_a.txt, data/foo_a/foo_b.txt, > > data/foo_b/bar_1, data/foo_b/bar_2, data/data.txt] > > Host: westmere > > Directory: filesys_extglob-run001/jobs/h/ls-he7o1nol > > exception @ swift-int.k, line: 530 > > Caused by: null > > Caused by: org.globus.cog.abstraction.impl.file.FileNotFoundException: File > > not found: /home/davidkelly999/tests/filesys_extglob/./data/foo_a.txt > > parallelFor @ swift-int.k, line: 240 > > Caused by: null > > Caused by: org.globus.cog.abstraction.impl.file.FileNotFoundException: File > > not found: /home/davidkelly999/tests/filesys_extglob/./data/foo_a.txt > > > > In the swift work directory, data/data.txt is the only file I see staged in. > > > > > > On Wed, Apr 2, 2014 at 10:40 AM, Jonathan Ozik wrote: > > > > > Thanks David (especially for pointing me to the newer user guide). > > > Mihael, would the "swift/0.95-RC5" version contain the FilesysMapper that > > > accepts the extended glob patterns? > > > > > > Jonathan > > > > > > On Apr 2, 2014, at 10:17 AM, David Kelly wrote: > > > > > > Hi Jonathan, > > > > > > A fairly recent version of 0.95 is available on Midway in the module > > > "swift/0.95-RC5". It should backwards compatible with 0.94. You may see > > > some warnings about deprecated use of @ in front of functions, and you will > > > see logs going into a run directory named run001, run002, etc instead of > > > going to your current working directory. There's more information about > > > this and the other new (and optional) configuration changes at > > > http://swiftlang.org/guides/trunk/userguide/userguide.html#_configuration. > > > Please let me know if you have any issues. > > > > > > Thanks, > > > David > > > > > > > > > On Wed, Apr 2, 2014 at 9:51 AM, Jonathan Ozik wrote: > > > > > >> Yadu, David, Mihael, > > >> > > >> Thanks for your responses. > > >> I'm thinking of using the ext mapper for now. Would the trunk/0.95 be > > >> available on Midway? > > >> > > >> Jonathan > > >> > > >> On Apr 1, 2014, at 8:50 PM, Mihael Hategan wrote: > > >> > > >> > At least in trunk/0.95, FilesysMapper accepts extended glob patterns, so > > >> > you should be able to say: > > >> > > > >> > file[] f ; > > >> > > > >> > Mihael > > >> > > > >> > On Tue, 2014-04-01 at 19:49 -0500, David Kelly wrote: > > >> >> I created a ticket about this on Monday because I was running into > > >> similar > > >> >> issues in my scripts. An ext mapper worked for me, but I think this is > > >> a > > >> >> common pattern we can make easier in future releases. > > >> >> https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=1229. > > >> >> > > >> >> > > >> >> On Tue, Apr 1, 2014 at 7:09 PM, Yadu Nand > > >> wrote: > > >> >> > > >> >>> Hi Jonathan, > > >> >>> > > >> >>> What I'd do in this case is map every file under the directory you > > >> want to > > >> >>> send to the compute nodes, into an array and pass > > >> >>> that along to your apps. You can do this mapping using either ext > > >> mapper > > >> >>> or array mappers. > > >> >>> > > >> >>> I have the following dir structure in my folders: > > >> >>> > > >> >>> ./dirs/foo_a/foo_a.txt > > >> >>> ./dirs/foo_a/foo_b.txt > > >> >>> ./dirs/foo_b/bar_1 > > >> >>> ./dirs/foo_b/bar_2 > > >> >>> > > >> >>> Here's my ext mapper (mapper.sh) : > > >> >>> #!/bin/bash > > >> >>> find ./dirs -type f | awk '{printf("[%d] %s\n", NR, $0)}' > > >> >>> > > >> >>> If you are using ext mappers, you would need a script which generates > > >> >>> output in the form [] > > >> >>> Here I use find to just output files and awk to get the right format. > > >> >>> > > >> >>> The swift mapping would be like this: > > >> >>> file array[] ; > > >> >>> > > >> >>> You could also use array mappers to read all files you need from a > > >> file > > >> >>> containing the filenames. I filled filenames.txt with > > >> >>> the names of all files in the folders. > > >> >>> > > >> >>> string[] names = readData("filenames.txt"); > > >> >>> file dirmap[] ; > > >> >>> > > >> >>> I've got both cases as examples tarballed here if you'd like to take a > > >> >>> look : http://swift.rcc.uchicago.edu:8042/directory_mapping.tar > > >> >>> > > >> >>> Thanks, > > >> >>> Yadu > > >> >>> > > >> >>> On Tue, Apr 1, 2014 at 8:56 AM, Jonathan Ozik > > >> wrote: > > >> >>> > > >> >>>> Hello all, > > >> >>>> > > >> >>>> Is there a simple way to specify "all files including files in > > >> subfolders > > >> >>>> within a folder" as a file mapper? It looks like the filesys_mapper > > >> does > > >> >>>> get everything within a folder, but has problems if there's a folder > > >> in > > >> >>>> there. > > >> >>>> > > >> >>>> Jonathan > > >> >>>> > > >> >>>> _______________________________________________ > > >> >>>> Swift-user mailing list > > >> >>>> Swift-user at ci.uchicago.edu > > >> >>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > > >> >>>> > > >> >>> > > >> >>> > > >> >>> > > >> >>> -- > > >> >>> Yadu Nand B > > >> >>> > > >> >>> > > >> >>> _______________________________________________ > > >> >>> Swift-user mailing list > > >> >>> Swift-user at ci.uchicago.edu > > >> >>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > > >> >>> > > >> >> _______________________________________________ > > >> >> Swift-user mailing list > > >> >> Swift-user at ci.uchicago.edu > > >> >> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > > >> > > > >> > > > >> > > >> > > > > > > > > > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user From wilde at anl.gov Thu May 29 09:46:30 2014 From: wilde at anl.gov (Michael Wilde) Date: Thu, 29 May 2014 09:46:30 -0500 Subject: [Swift-user] Known issue with stdout and stderr file delays? In-Reply-To: <1401355585.18136.4.camel@echo> References: <3A043140-2D21-4E23-891F-A4103B1712F9@uchicago.edu> <53866F9E.8020206@anl.gov> <1401336556.17821.2.camel@echo> <5386BC9C.8030308@anl.gov> <1401355585.18136.4.camel@echo> Message-ID: <53874846.4000802@anl.gov> Thanks, Mihael. My tests for the two bugs related to app array return mapping now work, on trunk. 1) returning scalar files along with array returns from the same app 2) returning output arrays mapped with simple_mapper without location= argument Yadu, these tests are on midway in ~wilde/swift/lab/arrmapbug/*swift - Mike On 5/29/14, 4:26 AM, Mihael Hategan wrote: > Thanks. This was only an issue without provider staging. > > Anyway, the problem was that the dynamic data collect code was > overwriting a variable used to point to the directory where the app > output files were. Silly shell scoping rules (or the lack thereof), and > silly me for not seeing it. > > Fix committed to trunk. > > Mihael > > On Wed, 2014-05-28 at 23:50 -0500, Michael Wilde wrote: >> Here's an example of the problem (what looks to me like a Swift bug in >> the new array return feature). >> >> amap01.swift tries to return a file array and a single file (stdout). >> Swift is unable to find the mapped stdout file, even though it exists. >> >> amap02.swift returns just the stdout file (to show that this part of the >> test script works). >> >> amap03.swift returns just the array. It, too, works. >> >> But returning both together fails, suggesting that the swiftwrap code >> that collects the array prevents any other non-array files from being >> returned. >> >> Tests of Jonathan'sscript indicated that it didnt matter of the scalar >> file being returned was stdout/err, or just a plain file. If an array is >> returned, nothing else seems to be getting returned correctly. >> >> Here's the tests: >> >> swift$ cat amap01.swift >> type file; >> >> app (file a[], file o) retarr() >> { >> sh "-c" "mkdir -p out; echo a >out/a0001.txt; echo a >out/a0002.txt; >> echo I am stdout" stdout=filename(o); >> } >> >> file a[]; >> file o<"out.txt">; >> >> (a,o) = retarr(); >> swift$ rm -rf out >> swift$ swift amap01.swift >> Swift trunk swift-r7880 cog-r3907 >> RunID: run010 >> Progress: Thu, 29 May 2014 04:44:25+0000 >> >> Execution failed: >> Exception in sh: >> Arguments: [-c, mkdir -p out; echo a >out/a0001.txt; echo a >> >out/a0002.txt; echo I am stdout] >> Host: localhost >> Directory: amap01-run010/jobs/4/sh-4ipn1brl >> stderr.txt: >> stdout.txt: I am stdout >> exception @ swift-int.k, line: 511 >> Caused by: The following output files were not created by the >> application: out.txt >> throw @ swift-int.k, line: 76 >> >> k:assign @ swift.k, line: 194 >> Caused by: Exception in sh: >> Arguments: [-c, mkdir -p out; echo a >out/a0001.txt; echo a >> >out/a0002.txt; echo I am stdout] >> Host: localhost >> Directory: amap01-run010/jobs/4/sh-4ipn1brl >> stderr.txt: >> stdout.txt: I am stdout >> exception @ swift-int.k, line: 511 >> Caused by: The following output files were not created by the >> application: out.txt >> throw @ swift-int.k, line: 76 >> >> swift$ cat amap02.swift >> type file; >> >> app (file a, file o) retarr() >> { >> sh "-c" "mkdir -p out; echo a >out/a0001.txt; echo a >out/a0002.txt; >> echo I am stdout" stdout=filename(o); >> } >> >> file a<"out/a0001.txt">; >> file o<"out.txt">; >> >> (a,o) = retarr(); >> swift$ rm -rf out >> swift$ swift amap02.swift >> Swift trunk swift-r7880 cog-r3907 >> RunID: run011 >> Progress: Thu, 29 May 2014 04:44:51+0000 >> Final status:Thu, 29 May 2014 04:44:51+0000 Finished successfully:1 >> swift$ ls out >> a0001.txt >> swift$ cat out.txt >> I am stdout >> swift$ cat amap03.swift >> type file; >> >> app (file a[]) retarr() >> { >> sh "-c" "mkdir -p out; echo a >out/a0001.txt; echo a >out/a0002.txt; >> echo I am stdout" ; >> } >> >> file a[]; >> >> a = retarr(); >> swift$ rm -rf out >> swift$ swift amap03.swift >> Swift trunk swift-r7880 cog-r3907 >> RunID: run012 >> Progress: Thu, 29 May 2014 04:45:34+0000 >> Final status:Thu, 29 May 2014 04:45:34+0000 Finished successfully:1 >> swift$ ls out >> a0001.txt a0002.txt >> swift$ >> >> >> >> On 5/28/14, 11:09 PM, Mihael Hategan wrote: >>> Hi Mike and Jonathan, >>> >>> I'm afraid I need more context. I am unable to reproduce the issue, >>> although I did discover another bug in the way FixedArrayMapper parses >>> strings. >>> >>> Can you please tell me what array mapper we are talking about, what the >>> app signature is, what provider you are using, and whether provider >>> staging is enabled or not. >>> >>> Better yet, please send me the swift log if you can. >>> >>> Mihael >>> >>> On Wed, 2014-05-28 at 18:22 -0500, Michael Wilde wrote: >>>> On 5/28/14, 5:01 PM, Jonathan Ozik wrote: >>>>> Hi all, >>>>> >>>>> Has anyone run into issues when trying to direct stdout and stderr to output files and getting the error: >>>>> Caused by: The following output files were not created by the application: sim.out, err.out >>>> This seems to be caused by an error in Swift (based on investigating >>>> with Jonathan off-list). >>>> >>>> The bug is that if an app returns an array of files, then any non-array >>>> returns from the app are not getting handled correctly. >>>> It seems to be a problem in the array-collecting code in the Swift >>>> app-launching wrapper, which may be loosing its directory context after >>>> returning from the collection function. >>>>> It looks like the files are being created too late and not caught when the output is being gathered up? This is using trunk, by the way. >>>>> >>>>> Also, in a related question, is there an easy way to redirect stdout and stderr from an app invocation to the regular swift.out without creating specific files? >>>> No, not at the moment. Some variations of this capability have been >>>> discussed, though. It seems to me to be a useful addition to implement. >>>> >>>> - Mike >>>>> Jonathan >>>>> >>>>> _______________________________________________ >>>>> Swift-user mailing list >>>>> Swift-user at ci.uchicago.edu >>>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > -- Michael Wilde Mathematics and Computer Science Computation Institute Argonne National Laboratory The University of Chicago From jozik at uchicago.edu Thu May 29 09:51:46 2014 From: jozik at uchicago.edu (Jonathan Ozik) Date: Thu, 29 May 2014 09:51:46 -0500 Subject: [Swift-user] Known issue with stdout and stderr file delays? In-Reply-To: <53874846.4000802@anl.gov> References: <3A043140-2D21-4E23-891F-A4103B1712F9@uchicago.edu> <53866F9E.8020206@anl.gov> <1401336556.17821.2.camel@echo> <5386BC9C.8030308@anl.gov> <1401355585.18136.4.camel@echo> <53874846.4000802@anl.gov> Message-ID: Thank you Mihael. These fixes will be very helpful. Jonathan On May 29, 2014, at 9:46 AM, Michael Wilde wrote: > Thanks, Mihael. My tests for the two bugs related to app array return mapping now work, on trunk. > > 1) returning scalar files along with array returns from the same app > 2) returning output arrays mapped with simple_mapper without location= argument > > Yadu, these tests are on midway in ~wilde/swift/lab/arrmapbug/*swift > > - Mike > > > On 5/29/14, 4:26 AM, Mihael Hategan wrote: >> Thanks. This was only an issue without provider staging. >> >> Anyway, the problem was that the dynamic data collect code was >> overwriting a variable used to point to the directory where the app >> output files were. Silly shell scoping rules (or the lack thereof), and >> silly me for not seeing it. >> >> Fix committed to trunk. >> >> Mihael >> >> On Wed, 2014-05-28 at 23:50 -0500, Michael Wilde wrote: >>> Here's an example of the problem (what looks to me like a Swift bug in >>> the new array return feature). >>> >>> amap01.swift tries to return a file array and a single file (stdout). >>> Swift is unable to find the mapped stdout file, even though it exists. >>> >>> amap02.swift returns just the stdout file (to show that this part of the >>> test script works). >>> >>> amap03.swift returns just the array. It, too, works. >>> >>> But returning both together fails, suggesting that the swiftwrap code >>> that collects the array prevents any other non-array files from being >>> returned. >>> >>> Tests of Jonathan'sscript indicated that it didnt matter of the scalar >>> file being returned was stdout/err, or just a plain file. If an array is >>> returned, nothing else seems to be getting returned correctly. >>> >>> Here's the tests: >>> >>> swift$ cat amap01.swift >>> type file; >>> >>> app (file a[], file o) retarr() >>> { >>> sh "-c" "mkdir -p out; echo a >out/a0001.txt; echo a >out/a0002.txt; >>> echo I am stdout" stdout=filename(o); >>> } >>> >>> file a[]; >>> file o<"out.txt">; >>> >>> (a,o) = retarr(); >>> swift$ rm -rf out >>> swift$ swift amap01.swift >>> Swift trunk swift-r7880 cog-r3907 >>> RunID: run010 >>> Progress: Thu, 29 May 2014 04:44:25+0000 >>> >>> Execution failed: >>> Exception in sh: >>> Arguments: [-c, mkdir -p out; echo a >out/a0001.txt; echo a >>> >out/a0002.txt; echo I am stdout] >>> Host: localhost >>> Directory: amap01-run010/jobs/4/sh-4ipn1brl >>> stderr.txt: >>> stdout.txt: I am stdout >>> exception @ swift-int.k, line: 511 >>> Caused by: The following output files were not created by the >>> application: out.txt >>> throw @ swift-int.k, line: 76 >>> >>> k:assign @ swift.k, line: 194 >>> Caused by: Exception in sh: >>> Arguments: [-c, mkdir -p out; echo a >out/a0001.txt; echo a >>> >out/a0002.txt; echo I am stdout] >>> Host: localhost >>> Directory: amap01-run010/jobs/4/sh-4ipn1brl >>> stderr.txt: >>> stdout.txt: I am stdout >>> exception @ swift-int.k, line: 511 >>> Caused by: The following output files were not created by the >>> application: out.txt >>> throw @ swift-int.k, line: 76 >>> >>> swift$ cat amap02.swift >>> type file; >>> >>> app (file a, file o) retarr() >>> { >>> sh "-c" "mkdir -p out; echo a >out/a0001.txt; echo a >out/a0002.txt; >>> echo I am stdout" stdout=filename(o); >>> } >>> >>> file a<"out/a0001.txt">; >>> file o<"out.txt">; >>> >>> (a,o) = retarr(); >>> swift$ rm -rf out >>> swift$ swift amap02.swift >>> Swift trunk swift-r7880 cog-r3907 >>> RunID: run011 >>> Progress: Thu, 29 May 2014 04:44:51+0000 >>> Final status:Thu, 29 May 2014 04:44:51+0000 Finished successfully:1 >>> swift$ ls out >>> a0001.txt >>> swift$ cat out.txt >>> I am stdout >>> swift$ cat amap03.swift >>> type file; >>> >>> app (file a[]) retarr() >>> { >>> sh "-c" "mkdir -p out; echo a >out/a0001.txt; echo a >out/a0002.txt; >>> echo I am stdout" ; >>> } >>> >>> file a[]; >>> >>> a = retarr(); >>> swift$ rm -rf out >>> swift$ swift amap03.swift >>> Swift trunk swift-r7880 cog-r3907 >>> RunID: run012 >>> Progress: Thu, 29 May 2014 04:45:34+0000 >>> Final status:Thu, 29 May 2014 04:45:34+0000 Finished successfully:1 >>> swift$ ls out >>> a0001.txt a0002.txt >>> swift$ >>> >>> >>> >>> On 5/28/14, 11:09 PM, Mihael Hategan wrote: >>>> Hi Mike and Jonathan, >>>> >>>> I'm afraid I need more context. I am unable to reproduce the issue, >>>> although I did discover another bug in the way FixedArrayMapper parses >>>> strings. >>>> >>>> Can you please tell me what array mapper we are talking about, what the >>>> app signature is, what provider you are using, and whether provider >>>> staging is enabled or not. >>>> >>>> Better yet, please send me the swift log if you can. >>>> >>>> Mihael >>>> >>>> On Wed, 2014-05-28 at 18:22 -0500, Michael Wilde wrote: >>>>> On 5/28/14, 5:01 PM, Jonathan Ozik wrote: >>>>>> Hi all, >>>>>> >>>>>> Has anyone run into issues when trying to direct stdout and stderr to output files and getting the error: >>>>>> Caused by: The following output files were not created by the application: sim.out, err.out >>>>> This seems to be caused by an error in Swift (based on investigating >>>>> with Jonathan off-list). >>>>> >>>>> The bug is that if an app returns an array of files, then any non-array >>>>> returns from the app are not getting handled correctly. >>>>> It seems to be a problem in the array-collecting code in the Swift >>>>> app-launching wrapper, which may be loosing its directory context after >>>>> returning from the collection function. >>>>>> It looks like the files are being created too late and not caught when the output is being gathered up? This is using trunk, by the way. >>>>>> >>>>>> Also, in a related question, is there an easy way to redirect stdout and stderr from an app invocation to the regular swift.out without creating specific files? >>>>> No, not at the moment. Some variations of this capability have been >>>>> discussed, though. It seems to me to be a useful addition to implement. >>>>> >>>>> - Mike >>>>>> Jonathan >>>>>> >>>>>> _______________________________________________ >>>>>> Swift-user mailing list >>>>>> Swift-user at ci.uchicago.edu >>>>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user >> > > -- > Michael Wilde > Mathematics and Computer Science Computation Institute > Argonne National Laboratory The University of Chicago > From jozik at uchicago.edu Thu May 29 10:24:03 2014 From: jozik at uchicago.edu (Jonathan Ozik) Date: Thu, 29 May 2014 10:24:03 -0500 Subject: [Swift-user] Concurrent mappers Message-ID: I'm trying to use concurrent mappers to let swift automatically avoid file name conflicts. After the run is complete, it looks like those files are not retained, or at least do not appear in the location that I've specified. I see a _concurrent directory but it is empty. If I specify a directory as part of the prefix (e.g., ;) I see an "out" directory inside the _concurrent directory but that directory is empty. Are these files meant to only be temporary? Jonathan From wilde at anl.gov Thu May 29 10:42:28 2014 From: wilde at anl.gov (Michael Wilde) Date: Thu, 29 May 2014 10:42:28 -0500 Subject: [Swift-user] Concurrent mappers In-Reply-To: References: Message-ID: <53875564.3080004@anl.gov> Jonathan, the files created by the concurrent mapper default to being temporary. To make them persist after the swift command terminates, add this to swift.properties: file.gc.enabled=false (This needs to be added to the User Guide. Its desrcibed at the moment only in etc/swift.properties): # Files mapped by the concurrent mapper (i.e. when you don't # explicitly specify a mapper) are deleted when they are not # in use any more. This property can be used to prevent # files mapped by the concurrent mapper from being deleted. # # Format: # file.gc.enabled=(true|false) # # Default: true # - Mike On 5/29/14, 10:24 AM, Jonathan Ozik wrote: > I'm trying to use concurrent mappers to let swift automatically avoid file name conflicts. After the run is complete, it looks like those files are not retained, or at least do not appear in the location that I've specified. I see a _concurrent directory but it is empty. If I specify a directory as part of the prefix (e.g., ;) I see an "out" directory inside the _concurrent directory but that directory is empty. > Are these files meant to only be temporary? > > Jonathan > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user -- Michael Wilde Mathematics and Computer Science Computation Institute Argonne National Laboratory The University of Chicago From jozik at uchicago.edu Thu May 29 11:05:16 2014 From: jozik at uchicago.edu (Jonathan Ozik) Date: Thu, 29 May 2014 11:05:16 -0500 Subject: [Swift-user] Concurrent mappers In-Reply-To: <53875564.3080004@anl.gov> References: <53875564.3080004@anl.gov> Message-ID: <4E0DCF8E-81A7-4A54-93A7-4F56CDC6A8AE@uchicago.edu> Mike, I'm getting: Error: Unknown property file.gc.enabled when I add that to my swift.properties file. This is using trunk. Jonathan On May 29, 2014, at 10:42 AM, Michael Wilde wrote: > Jonathan, the files created by the concurrent mapper default to being > temporary. > > To make them persist after the swift command terminates, add this to > swift.properties: > > file.gc.enabled=false > > (This needs to be added to the User Guide. Its desrcibed at the moment > only in etc/swift.properties): > > # Files mapped by the concurrent mapper (i.e. when you don't > # explicitly specify a mapper) are deleted when they are not > # in use any more. This property can be used to prevent > # files mapped by the concurrent mapper from being deleted. > # > # Format: > # file.gc.enabled=(true|false) > # > # Default: true > # > > - Mike > > > On 5/29/14, 10:24 AM, Jonathan Ozik wrote: >> I'm trying to use concurrent mappers to let swift automatically avoid file name conflicts. After the run is complete, it looks like those files are not retained, or at least do not appear in the location that I've specified. I see a _concurrent directory but it is empty. If I specify a directory as part of the prefix (e.g., ;) I see an "out" directory inside the _concurrent directory but that directory is empty. >> Are these files meant to only be temporary? >> >> Jonathan >> _______________________________________________ >> Swift-user mailing list >> Swift-user at ci.uchicago.edu >> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > > -- > Michael Wilde > Mathematics and Computer Science Computation Institute > Argonne National Laboratory The University of Chicago > > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user From davidkelly at uchicago.edu Thu May 29 11:48:30 2014 From: davidkelly at uchicago.edu (David Kelly) Date: Thu, 29 May 2014 11:48:30 -0500 Subject: [Swift-user] Concurrent mappers In-Reply-To: <4E0DCF8E-81A7-4A54-93A7-4F56CDC6A8AE@uchicago.edu> References: <53875564.3080004@anl.gov> <4E0DCF8E-81A7-4A54-93A7-4F56CDC6A8AE@uchicago.edu> Message-ID: Hi Jonathan, I've added file.gc.enabled as an allowable property in the new config mechanism in trunk, and I've added a description to the trunk userguide (should be on the website within an hour or so). Thanks, David On Thu, May 29, 2014 at 11:05 AM, Jonathan Ozik wrote: > Mike, > > I'm getting: > Error: Unknown property file.gc.enabled > > when I add that to my swift.properties file. > > This is using trunk. > > Jonathan > > On May 29, 2014, at 10:42 AM, Michael Wilde wrote: > > > Jonathan, the files created by the concurrent mapper default to being > > temporary. > > > > To make them persist after the swift command terminates, add this to > > swift.properties: > > > > file.gc.enabled=false > > > > (This needs to be added to the User Guide. Its desrcibed at the moment > > only in etc/swift.properties): > > > > # Files mapped by the concurrent mapper (i.e. when you don't > > # explicitly specify a mapper) are deleted when they are not > > # in use any more. This property can be used to prevent > > # files mapped by the concurrent mapper from being deleted. > > # > > # Format: > > # file.gc.enabled=(true|false) > > # > > # Default: true > > # > > > > - Mike > > > > > > On 5/29/14, 10:24 AM, Jonathan Ozik wrote: > >> I'm trying to use concurrent mappers to let swift automatically avoid > file name conflicts. After the run is complete, it looks like those files > are not retained, or at least do not appear in the location that I've > specified. I see a _concurrent directory but it is empty. If I specify a > directory as part of the prefix (e.g., > ;) I see an "out" > directory inside the _concurrent directory but that directory is empty. > >> Are these files meant to only be temporary? > >> > >> Jonathan > >> _______________________________________________ > >> Swift-user mailing list > >> Swift-user at ci.uchicago.edu > >> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > > > > -- > > Michael Wilde > > Mathematics and Computer Science Computation Institute > > Argonne National Laboratory The University of Chicago > > > > _______________________________________________ > > Swift-user mailing list > > Swift-user at ci.uchicago.edu > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > -------------- next part -------------- An HTML attachment was scrubbed... URL: From bronevetsky1 at llnl.gov Thu May 29 17:48:31 2014 From: bronevetsky1 at llnl.gov (Bronevetsky, Greg) Date: Thu, 29 May 2014 22:48:31 +0000 Subject: [Swift-user] Data transfer error In-Reply-To: <1401320738.12733.3.camel@echo> References: <8635C0D1735D2C4BA6E571FD97486FF179926169@PRDEXMBX-05.the-lab.llnl.gov> <1400706574.25214.3.camel@echo> <8635C0D1735D2C4BA6E571FD97486FF17992DC59@PRDEXMBX-05.the-lab.llnl.gov> <1400743356.29694.6.camel@echo> <8635C0D1735D2C4BA6E571FD97486FF179933F77@PRDEXMBX-05.the-lab.llnl.gov> <1400872636.18856.5.camel@echo> <8635C0D1735D2C4BA6E571FD97486FF1799377AB@PRDEXMBX-05.the-lab.llnl.gov> <1400876593.19522.2.camel@echo> <8635C0D1735D2C4BA6E571FD97486FF179938553@PRDEXMBX-05.the-lab.llnl.gov> <1401234364.13640.7.camel@echo> <8635C0D1735D2C4BA6E571FD97486FF179938CB9@PRDEXMBX-05.the-lab.llnl.gov> <1401320738.12733.3.camel@echo> Message-ID: <8635C0D1735D2C4BA6E571FD97486FF17993A5E3@PRDEXMBX-05.the-lab.llnl.gov> I've finally managed to create a reproducer for my problem. I've attached the problematic Swift script and the app script that it calls. The Swift script is a 2-level reduction tree with radix 40. It iterates 40 times and for iteration performs 40 inner iterations, in which it calls the app to generate an output file and merges these files. The files produced by all the inner iterations are subsequently merged to produce a single file. The app creates 10 temporary directories and temporary 10 files in each directory but the only output it emits to Swift is what it writes on stdout. Below is the screen output emitted by Swift when the workflow is executed on a single 12-core node, executed from a directory on our Lustre scratch file system and also using this file system as work storage ( in sites file). As you can see, after it finishes processing each task Swift tries to stage them out and on some of them it encounters an error, which causes these tasks to go into the "Failed but can retry" status. If I reduce the workload below 40x40 tasks and 10x10 files this does not happen so the issue looks so be related to the amount of stress I put on the file system. I can cause the same behavior to occur on our nfs file system if I increase the size of the workload. I've attached the logs that the run produced in my .globus/coasters and .globus/scripts directories. The stderr.txt files in my jobs/*/* directories were empty and the wrapper.log files contained pretty similar text such as: checking for paramfile no paramfile: using command line arguments Progress 2014-05-29 15:29:52.801545566-0700 LOG_START _____________________________________________________________________________ Wrapper (_swiftwrap.staging) _____________________________________________________________________________ /g/g15/bronevet/apps/swift-0.94.1/examples/test/fileGenTest.py -out file.1.1 -err stderr.txt -i -d -if -of file.1.1 -k -cdmfile -status provider -a 1 PWD=/p/lscratche/bronevet/swift_work/testSwiftErrors-20140529-1528-cv85utlb/jobs/o/fileGenTest-o7b89crl EXEC=/g/g15/bronevet/apps/swift-0.94.1/examples/test/fileGenTest.py STDIN= STDOUT=file.1.1 STDERR=stderr.txt DIRS= INF= OUTF=file.1.1 KICKSTART= ARGS=1 ARGC=1 Progress 2014-05-29 15:29:52.806613446-0700 CREATE_INPUTDIR Progress 2014-05-29 15:29:52.809312133-0700 EXECUTE Please let me know if you need any additional info. Greg Bronevetsky Lawrence Livermore National Lab (925) 424-5756 bronevetsky at llnl.gov http://greg.bronevetsky.com Swift 0.94.1 swift-r7114 cog-r3803 RunID: 20140529-1528-cv85utlb Progress: time: Thu, 29 May 2014 15:28:51 -0700 Progress: time: Thu, 29 May 2014 15:28:53 -0700 Selecting site:50 Submitting:400 Submitted:1 Progress: time: Thu, 29 May 2014 15:29:21 -0700 Selecting site:50 Submitted:401 Progress: time: Thu, 29 May 2014 15:29:51 -0700 Selecting site:50 Stage in:1 Submitted:400 Progress: time: Thu, 29 May 2014 15:29:52 -0700 Selecting site:50 Stage in:8 Submitted:389 Active:4 Progress: time: Thu, 29 May 2014 15:30:21 -0700 Selecting site:50 Submitted:389 Active:12 Progress: time: Thu, 29 May 2014 15:30:51 -0700 Selecting site:50 Submitted:389 Active:12 Progress: time: Thu, 29 May 2014 15:31:21 -0700 Selecting site:50 Submitted:389 Active:12 Progress: time: Thu, 29 May 2014 15:31:51 -0700 Selecting site:50 Submitted:389 Active:12 Progress: time: Thu, 29 May 2014 15:32:21 -0700 Selecting site:50 Submitted:389 Active:12 Progress: time: Thu, 29 May 2014 15:32:51 -0700 Selecting site:50 Submitted:389 Active:12 Progress: time: Thu, 29 May 2014 15:33:21 -0700 Selecting site:50 Submitted:389 Active:12 Progress: time: Thu, 29 May 2014 15:33:51 -0700 Selecting site:50 Submitted:389 Active:12 Progress: time: Thu, 29 May 2014 15:34:21 -0700 Selecting site:50 Submitted:389 Active:12 Progress: time: Thu, 29 May 2014 15:34:28 -0700 Selecting site:50 Submitted:389 Active:11 Stage out:1 Progress: time: Thu, 29 May 2014 15:34:51 -0700 Selecting site:39 Submitted:389 Active:12 Failed but can retry:11 Progress: time: Thu, 29 May 2014 15:35:21 -0700 Selecting site:39 Submitted:389 Active:12 Failed but can retry:11 Progress: time: Thu, 29 May 2014 15:35:51 -0700 Selecting site:39 Submitted:389 Active:12 Failed but can retry:11 Progress: time: Thu, 29 May 2014 15:36:21 -0700 Selecting site:39 Submitted:389 Active:12 Failed but can retry:11 Progress: time: Thu, 29 May 2014 15:36:51 -0700 Selecting site:39 Submitted:389 Active:12 Failed but can retry:11 Progress: time: Thu, 29 May 2014 15:37:21 -0700 Selecting site:39 Submitted:389 Active:12 Failed but can retry:11 Progress: time: Thu, 29 May 2014 15:37:51 -0700 Selecting site:39 Submitted:389 Active:12 Failed but can retry:11 Progress: time: Thu, 29 May 2014 15:38:21 -0700 Selecting site:39 Submitted:389 Active:12 Failed but can retry:11 Progress: time: Thu, 29 May 2014 15:38:51 -0700 Selecting site:39 Submitted:389 Active:12 Failed but can retry:11 Progress: time: Thu, 29 May 2014 15:39:21 -0700 Selecting site:39 Submitted:389 Active:12 Failed but can retry:11 Progress: time: Thu, 29 May 2014 15:39:25 -0700 Selecting site:39 Submitted:389 Active:11 Stage out:1 Failed but can retry:11 Progress: time: Thu, 29 May 2014 15:39:27 -0700 Selecting site:39 Submitted:389 Active:9 Stage out:3 Failed but can retry:11 Progress: time: Thu, 29 May 2014 15:39:51 -0700 Selecting site:39 Submitted:389 Active:8 Stage out:4 Failed but can retry:11 Progress: time: Thu, 29 May 2014 15:39:55 -0700 Selecting site:39 Submitted:389 Active:7 Stage out:5 Failed but can retry:11 Progress: time: Thu, 29 May 2014 15:39:57 -0700 Selecting site:39 Submitted:389 Active:4 Stage out:8 Failed but can retry:11 Progress: time: Thu, 29 May 2014 15:39:58 -0700 Selecting site:29 Stage in:3 Submitted:398 Failed but can retry:21 Progress: time: Thu, 29 May 2014 15:40:04 -0700 Selecting site:29 Stage in:5 Submitted:396 Failed but can retry:21 Progress: time: Thu, 29 May 2014 15:40:08 -0700 Selecting site:29 Stage in:8 Submitted:393 Failed but can retry:21 Progress: time: Thu, 29 May 2014 15:40:21 -0700 Selecting site:29 Submitted:389 Active:12 Failed but can retry:21 Progress: time: Thu, 29 May 2014 15:40:51 -0700 Selecting site:29 Submitted:389 Active:12 Failed but can retry:21 Progress: time: Thu, 29 May 2014 15:41:21 -0700 Selecting site:29 Submitted:389 Active:12 Failed but can retry:21 Progress: time: Thu, 29 May 2014 15:41:51 -0700 Selecting site:29 Submitted:389 Active:12 Failed but can retry:21 Progress: time: Thu, 29 May 2014 15:42:21 -0700 Selecting site:29 Submitted:389 Active:12 Failed but can retry:21 Progress: time: Thu, 29 May 2014 15:42:51 -0700 Selecting site:29 Submitted:389 Active:12 Failed but can retry:21 Progress: time: Thu, 29 May 2014 15:43:21 -0700 Selecting site:29 Submitted:389 Active:12 Failed but can retry:21 Progress: time: Thu, 29 May 2014 15:43:51 -0700 Selecting site:29 Submitted:389 Active:12 Failed but can retry:21 -----Original Message----- From: Mihael Hategan [mailto:hategan at mcs.anl.gov] Sent: Wednesday, May 28, 2014 4:46 PM To: Bronevetsky, Greg Cc: swift-user at ci.uchicago.edu Subject: Re: [Swift-user] Data transfer error On Wed, 2014-05-28 at 22:54 +0000, Bronevetsky, Greg wrote: > Mihael, I ran a few more experiments where I ran a workflow on a > single cluster node while monitoring its memory use but I didn't see > any issues with it running out of memory since at all times > /proc/meminfo reported 22GB out of 24GB free. The error you were getting previously seemed to indicate that you were running out of *disk* space somewhere, probably on the ramdisk. So maybe the output of 'df' would be better than /proc/meminfo > I've now begun a more focused analysis where I have a simple script > that captures the high-level structure of my real script. It first > generates a bunch of files, producing additional temporary files and > the directories along with the main output file. These files are then > reduced using a reduction tree based on the example you sent me. I > have not yet gotten the simple script to fail in the same way as the > main script but I've noticed a few oddities. > > First, although my sites file has key="stagingMethod">file and my cf file has > use.provider.staging=true, I see that all the intermediate files > produced by my tasks are written to the global file system specified > in the sites file as > /p/lscratche/bronevet/swift_work. How > do I force Swift to use node-local storage for this data? You would have to change to a node-local location. > > Second, when I run as many processes on the one node as there are > cores, the script runs but it keeps stalling. As you can see below, it > processes tasks in batches of 12. However, after a few batches the job > is aborted (~6 mins into a 30 min allocation) even though the node > appears healthy and does not run out of memory and Swift submits a new > job into the batch queue. Why does this happen? Are you specifying a max walltime for the apps? If not, swift assumes 10 minutes. If the first few batches take 21 minutes, and the worker has 30 minutes allocated, it won't be able to fit any other jobs after that. Mihael -------------- next part -------------- A non-text attachment was scrubbed... Name: fileGenTest.py Type: application/octet-stream Size: 285 bytes Desc: fileGenTest.py URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: testSwiftErrors.swift Type: application/octet-stream Size: 658 bytes Desc: testSwiftErrors.swift URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: worker-0529-2803530-000000.log.bz2 Type: application/octet-stream Size: 313974 bytes Desc: worker-0529-2803530-000000.log.bz2 URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: PBS1386040007442248971.submit.exitcode Type: application/octet-stream Size: 4 bytes Desc: PBS1386040007442248971.submit.exitcode URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: PBS1386040007442248971.submit.stderr Type: application/octet-stream Size: 115 bytes Desc: PBS1386040007442248971.submit.stderr URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: PBS1386040007442248971.submit.stdout Type: application/octet-stream Size: 69 bytes Desc: PBS1386040007442248971.submit.stdout URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: PBS1386040007442248971.submit Type: application/octet-stream Size: 797 bytes Desc: PBS1386040007442248971.submit URL: From hategan at mcs.anl.gov Thu May 29 19:55:37 2014 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 29 May 2014 17:55:37 -0700 Subject: [Swift-user] Data transfer error In-Reply-To: <8635C0D1735D2C4BA6E571FD97486FF17993A5E3@PRDEXMBX-05.the-lab.llnl.gov> References: <8635C0D1735D2C4BA6E571FD97486FF179926169@PRDEXMBX-05.the-lab.llnl.gov> <1400706574.25214.3.camel@echo> <8635C0D1735D2C4BA6E571FD97486FF17992DC59@PRDEXMBX-05.the-lab.llnl.gov> <1400743356.29694.6.camel@echo> <8635C0D1735D2C4BA6E571FD97486FF179933F77@PRDEXMBX-05.the-lab.llnl.gov> <1400872636.18856.5.camel@echo> <8635C0D1735D2C4BA6E571FD97486FF1799377AB@PRDEXMBX-05.the-lab.llnl.gov> <1400876593.19522.2.camel@echo> <8635C0D1735D2C4BA6E571FD97486FF179938553@PRDEXMBX-05.the-lab.llnl.gov> <1401234364.13640.7.camel@echo> <8635C0D1735D2C4BA6E571FD97486FF179938CB9@PRDEXMBX-05.the-lab.llnl.gov> <1401320738.12733.3.camel@echo> <8635C0D1735D2C4BA6E571FD97486FF17993A5E3@PRDEXMBX-05.the-lab.llnl.gov> Message-ID: <1401411337.31590.8.camel@echo> Hi Greg, I think the swift log (that thing in the directory where you invoke swift called *.log) would contain all the relevant information here. In any event, it is also there in the worker log: 2014/05/29 15:44:08.741 DEBUG 000000 Checking jobs status (12 active) 2014/05/29 15:44:08.741 DEBUG 000000 1401402532522 Checking pid 24049 2014/05/29 15:44:08.741 DEBUG 000000 1401402532522 walltime exceeded (start: 1401403208.70666, now: 1401403448.74189, maxwalltime: 240); killing ... I believe that what is happening is that as you increase the load, I/O operations on shared disks become slower and slower to the extent that the app walltimes become greater than what you get in small runs and what you have as maxwalltime in sites.xml. The fact that is happens on both lustre and NFS seems to support this theory. This can be checked by slowly increasing the maxwalltime in sites.xml. If it is not associated by a corresponding increase in the scale at which you can run without failure, then we should probably look somewhere else. Mihael On Thu, 2014-05-29 at 22:48 +0000, Bronevetsky, Greg wrote: > I've finally managed to create a reproducer for my problem. [...] From gmgall at lncc.br Fri May 30 09:17:59 2014 From: gmgall at lncc.br (Guilherme Gall) Date: Fri, 30 May 2014 11:17:59 -0300 Subject: [Swift-user] Java version to build and run Swift Message-ID: <53889317.7070708@lncc.br> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Hi all, What version of Java should I use to build and run Swift trunk? I'm getting a NullPointerException trying to run even the simplest script in the examples directory (hello.swift). Thanks in advance, - -- Guilherme Gall CSR/LNCC GPG Public Key ID: A65ED0D5 -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.14 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iQEcBAEBAgAGBQJTiJMXAAoJEG9WBlOmXtDVYVEH/0KJVXA+u3RZ+j3ATbJlC24e IXeKgOg6apxmPec/w41YvkuUSbWZRaSdpCw+GDMhk/NvgZxLHhbySIRlpsd/94OP zt8vsGWiCMAZNU3Jnol5QEWupWwq3cBEV7Z9wMjeumCm77owdowNSdLnTZYjdtKa tlx3DvnDci+lfJIHXqxNFV+nOcHeiSX2KL2igM6DMFs531D7fgps9tZHZJis87jF cRlPwbFaLjeaxi17GKBGoEduonFlKfT3qkZbQB8yCQPDaA/FufjpvjRHydSmW9kg Oz9+QxFr03hSSbxEQ1+dSRDq6oRyAlKxXK33ZUyGDputMcKOwn928737WbCYhz8= =3ory -----END PGP SIGNATURE----- From jozik at uchicago.edu Fri May 30 09:43:35 2014 From: jozik at uchicago.edu (Jonathan Ozik) Date: Fri, 30 May 2014 09:43:35 -0500 Subject: [Swift-user] Concurrent mappers In-Reply-To: References: <53875564.3080004@anl.gov> <4E0DCF8E-81A7-4A54-93A7-4F56CDC6A8AE@uchicago.edu> Message-ID: <75F004C3-8002-4647-8C71-96ABC43C4061@uchicago.edu> David, Thank you! It's been a great help so far. Jonathan On May 29, 2014, at 11:48 AM, David Kelly wrote: > Hi Jonathan, > > I've added file.gc.enabled as an allowable property in the new config mechanism in trunk, and I've added a description to the trunk userguide (should be on the website within an hour or so). > > Thanks, > David > > > On Thu, May 29, 2014 at 11:05 AM, Jonathan Ozik wrote: > Mike, > > I'm getting: > Error: Unknown property file.gc.enabled > > when I add that to my swift.properties file. > > This is using trunk. > > Jonathan > > On May 29, 2014, at 10:42 AM, Michael Wilde wrote: > > > Jonathan, the files created by the concurrent mapper default to being > > temporary. > > > > To make them persist after the swift command terminates, add this to > > swift.properties: > > > > file.gc.enabled=false > > > > (This needs to be added to the User Guide. Its desrcibed at the moment > > only in etc/swift.properties): > > > > # Files mapped by the concurrent mapper (i.e. when you don't > > # explicitly specify a mapper) are deleted when they are not > > # in use any more. This property can be used to prevent > > # files mapped by the concurrent mapper from being deleted. > > # > > # Format: > > # file.gc.enabled=(true|false) > > # > > # Default: true > > # > > > > - Mike > > > > > > On 5/29/14, 10:24 AM, Jonathan Ozik wrote: > >> I'm trying to use concurrent mappers to let swift automatically avoid file name conflicts. After the run is complete, it looks like those files are not retained, or at least do not appear in the location that I've specified. I see a _concurrent directory but it is empty. If I specify a directory as part of the prefix (e.g., ;) I see an "out" directory inside the _concurrent directory but that directory is empty. > >> Are these files meant to only be temporary? > >> > >> Jonathan > >> _______________________________________________ > >> Swift-user mailing list > >> Swift-user at ci.uchicago.edu > >> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > > > > -- > > Michael Wilde > > Mathematics and Computer Science Computation Institute > > Argonne National Laboratory The University of Chicago > > > > _______________________________________________ > > Swift-user mailing list > > Swift-user at ci.uchicago.edu > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > -------------- next part -------------- An HTML attachment was scrubbed... URL: From bronevetsky1 at llnl.gov Fri May 30 11:24:20 2014 From: bronevetsky1 at llnl.gov (Bronevetsky, Greg) Date: Fri, 30 May 2014 16:24:20 +0000 Subject: [Swift-user] Data transfer error In-Reply-To: <1401411337.31590.8.camel@echo> References: <8635C0D1735D2C4BA6E571FD97486FF179926169@PRDEXMBX-05.the-lab.llnl.gov> <1400706574.25214.3.camel@echo> <8635C0D1735D2C4BA6E571FD97486FF17992DC59@PRDEXMBX-05.the-lab.llnl.gov> <1400743356.29694.6.camel@echo> <8635C0D1735D2C4BA6E571FD97486FF179933F77@PRDEXMBX-05.the-lab.llnl.gov> <1400872636.18856.5.camel@echo> <8635C0D1735D2C4BA6E571FD97486FF1799377AB@PRDEXMBX-05.the-lab.llnl.gov> <1400876593.19522.2.camel@echo> <8635C0D1735D2C4BA6E571FD97486FF179938553@PRDEXMBX-05.the-lab.llnl.gov> <1401234364.13640.7.camel@echo> <8635C0D1735D2C4BA6E571FD97486FF179938CB9@PRDEXMBX-05.the-lab.llnl.gov> <1401320738.12733.3.camel@echo> <8635C0D1735D2C4BA6E571FD97486FF17993A5E3@PRDEXMBX-05.the-lab.llnl.gov> <1401411337.31590.8.camel@echo> Message-ID: <8635C0D1735D2C4BA6E571FD97486FF17993A947@PRDEXMBX-05.the-lab.llnl.gov> I just ran a test where I varied between 1 and 10 minutes. At 1 it gave me errors and for larger values it did not. So, assuming that this is the true root cause, how can I resolve it? I can use node-local storage as my . However, when I run my real workload, I'm still getting errors even if I use node-local storage. I'm still following up with our file systems folks but the key issue appears to be the large number of meta-data operations that are sent at the shared file system (Lustre or NFS here). Is there a way to reduce that or at least measure it so that I can tell our admins exactly the throughput I need? Greg Bronevetsky Lawrence Livermore National Lab (925) 424-5756 bronevetsky at llnl.gov http://greg.bronevetsky.com -----Original Message----- From: Mihael Hategan [mailto:hategan at mcs.anl.gov] Sent: Thursday, May 29, 2014 5:56 PM To: Bronevetsky, Greg Cc: swift-user at ci.uchicago.edu Subject: Re: [Swift-user] Data transfer error Hi Greg, I think the swift log (that thing in the directory where you invoke swift called *.log) would contain all the relevant information here. In any event, it is also there in the worker log: 2014/05/29 15:44:08.741 DEBUG 000000 Checking jobs status (12 active) 2014/05/29 15:44:08.741 DEBUG 000000 1401402532522 Checking pid 24049 2014/05/29 15:44:08.741 DEBUG 000000 1401402532522 walltime exceeded (start: 1401403208.70666, now: 1401403448.74189, maxwalltime: 240); killing ... I believe that what is happening is that as you increase the load, I/O operations on shared disks become slower and slower to the extent that the app walltimes become greater than what you get in small runs and what you have as maxwalltime in sites.xml. The fact that is happens on both lustre and NFS seems to support this theory. This can be checked by slowly increasing the maxwalltime in sites.xml. If it is not associated by a corresponding increase in the scale at which you can run without failure, then we should probably look somewhere else. Mihael On Thu, 2014-05-29 at 22:48 +0000, Bronevetsky, Greg wrote: > I've finally managed to create a reproducer for my problem. [...] From ketan at mcs.anl.gov Fri May 30 11:47:33 2014 From: ketan at mcs.anl.gov (Ketan Maheshwari) Date: Fri, 30 May 2014 11:47:33 -0500 Subject: [Swift-user] Java version to build and run Swift In-Reply-To: <53889317.7070708@lncc.br> References: <53889317.7070708@lncc.br> Message-ID: Hi Guilherme, Swift should work with Sun/Oracle java 1.6+. Which version of java are you using? Can you send the command line for swift invocation that you are using. Thanks, Ketan On Fri, May 30, 2014 at 9:17 AM, Guilherme Gall wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > Hi all, > > What version of Java should I use to build and run Swift trunk? > > I'm getting a NullPointerException trying to run even the simplest > script in the examples directory (hello.swift). > > Thanks in advance, > - -- > Guilherme Gall > CSR/LNCC > GPG Public Key ID: A65ED0D5 > -----BEGIN PGP SIGNATURE----- > Version: GnuPG v1.4.14 (GNU/Linux) > Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ > > iQEcBAEBAgAGBQJTiJMXAAoJEG9WBlOmXtDVYVEH/0KJVXA+u3RZ+j3ATbJlC24e > IXeKgOg6apxmPec/w41YvkuUSbWZRaSdpCw+GDMhk/NvgZxLHhbySIRlpsd/94OP > zt8vsGWiCMAZNU3Jnol5QEWupWwq3cBEV7Z9wMjeumCm77owdowNSdLnTZYjdtKa > tlx3DvnDci+lfJIHXqxNFV+nOcHeiSX2KL2igM6DMFs531D7fgps9tZHZJis87jF > cRlPwbFaLjeaxi17GKBGoEduonFlKfT3qkZbQB8yCQPDaA/FufjpvjRHydSmW9kg > Oz9+QxFr03hSSbxEQ1+dSRDq6oRyAlKxXK33ZUyGDputMcKOwn928737WbCYhz8= > =3ory > -----END PGP SIGNATURE----- > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > -------------- next part -------------- An HTML attachment was scrubbed... URL: From ketan at mcs.anl.gov Fri May 30 11:54:12 2014 From: ketan at mcs.anl.gov (Ketan Maheshwari) Date: Fri, 30 May 2014 11:54:12 -0500 Subject: [Swift-user] Java version to build and run Swift In-Reply-To: References: <53889317.7070708@lncc.br> Message-ID: PS: Just tested with Oracle java 1.7.07 on My Mac and a Ubuntu machine. Worked for me: $ ~/swift-devel/cog/modules/swift/dist/swift-svn/bin/swift ~/swift-devel/cog/modules/swift/examples/misc/hello.swift Swift trunk swift-r7889 cog-r3907 RunID: run001 Progress: Fri, 30 May 2014 11:52:48-0500 Final status:Fri, 30 May 2014 11:52:48-0500 Finished successfully:1 On Fri, May 30, 2014 at 11:47 AM, Ketan Maheshwari wrote: > Hi Guilherme, > > Swift should work with Sun/Oracle java 1.6+. Which version of java are you > using? Can you send the command line for swift invocation that you are > using. > > Thanks, > Ketan > > > On Fri, May 30, 2014 at 9:17 AM, Guilherme Gall wrote: > >> -----BEGIN PGP SIGNED MESSAGE----- >> Hash: SHA1 >> >> Hi all, >> >> What version of Java should I use to build and run Swift trunk? >> >> I'm getting a NullPointerException trying to run even the simplest >> script in the examples directory (hello.swift). >> >> Thanks in advance, >> - -- >> Guilherme Gall >> CSR/LNCC >> GPG Public Key ID: A65ED0D5 >> -----BEGIN PGP SIGNATURE----- >> Version: GnuPG v1.4.14 (GNU/Linux) >> Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ >> >> iQEcBAEBAgAGBQJTiJMXAAoJEG9WBlOmXtDVYVEH/0KJVXA+u3RZ+j3ATbJlC24e >> IXeKgOg6apxmPec/w41YvkuUSbWZRaSdpCw+GDMhk/NvgZxLHhbySIRlpsd/94OP >> zt8vsGWiCMAZNU3Jnol5QEWupWwq3cBEV7Z9wMjeumCm77owdowNSdLnTZYjdtKa >> tlx3DvnDci+lfJIHXqxNFV+nOcHeiSX2KL2igM6DMFs531D7fgps9tZHZJis87jF >> cRlPwbFaLjeaxi17GKBGoEduonFlKfT3qkZbQB8yCQPDaA/FufjpvjRHydSmW9kg >> Oz9+QxFr03hSSbxEQ1+dSRDq6oRyAlKxXK33ZUyGDputMcKOwn928737WbCYhz8= >> =3ory >> -----END PGP SIGNATURE----- >> _______________________________________________ >> Swift-user mailing list >> Swift-user at ci.uchicago.edu >> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From hategan at mcs.anl.gov Fri May 30 13:19:43 2014 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 30 May 2014 11:19:43 -0700 Subject: [Swift-user] Java version to build and run Swift In-Reply-To: <53889317.7070708@lncc.br> References: <53889317.7070708@lncc.br> Message-ID: <1401473983.16971.7.camel@echo> On Fri, 2014-05-30 at 11:17 -0300, Guilherme Gall wrote: > Hi all, > > What version of Java should I use to build and run Swift trunk? Oracle is the gold standard. The IBM JDK should work. GCJ never did. The version should be >= 1.6. > > I'm getting a NullPointerException trying to run even the simplest > script in the examples directory (hello.swift). Trunk may have its moments of instability. However, I cannot reproduce your problem with trunk, so it might be helpful if you posted some more details, such as the log or a stack trace if that is not possible. Mihael From hategan at mcs.anl.gov Fri May 30 14:06:17 2014 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 30 May 2014 12:06:17 -0700 Subject: [Swift-user] Data transfer error In-Reply-To: <8635C0D1735D2C4BA6E571FD97486FF17993A947@PRDEXMBX-05.the-lab.llnl.gov> References: <8635C0D1735D2C4BA6E571FD97486FF179926169@PRDEXMBX-05.the-lab.llnl.gov> <1400706574.25214.3.camel@echo> <8635C0D1735D2C4BA6E571FD97486FF17992DC59@PRDEXMBX-05.the-lab.llnl.gov> <1400743356.29694.6.camel@echo> <8635C0D1735D2C4BA6E571FD97486FF179933F77@PRDEXMBX-05.the-lab.llnl.gov> <1400872636.18856.5.camel@echo> <8635C0D1735D2C4BA6E571FD97486FF1799377AB@PRDEXMBX-05.the-lab.llnl.gov> <1400876593.19522.2.camel@echo> <8635C0D1735D2C4BA6E571FD97486FF179938553@PRDEXMBX-05.the-lab.llnl.gov> <1401234364.13640.7.camel@echo> <8635C0D1735D2C4BA6E571FD97486FF179938CB9@PRDEXMBX-05.the-lab.llnl.gov> <1401320738.12733.3.camel@echo> <8635C0D1735D2C4BA6E571FD97486FF17993A5E3@PRDEXMBX-05.the-lab.llnl.gov> <1401411337.31590.8.camel@echo> <8635C0D1735D2C4BA6E571FD97486FF17993A947@PRDEXMBX-05.the-lab.llnl.gov> Message-ID: <1401476777.16971.20.camel@echo> On Fri, 2014-05-30 at 16:24 +0000, Bronevetsky, Greg wrote: > I just ran a test where I varied key="maxwalltime"> between 1 and 10 minutes. At 1 it gave me errors > and for larger values it did not. So, assuming that this is the true > root cause, how can I resolve it? I can use node-local storage as my > . However, when I run my real workload, I'm still > getting errors even if I use node-local storage. Assuming you used /local/disk, there is still some load on the shared filesystem since swift still needs to copy data from it to the scratch directory and back. The only true way of avoiding the shared FS is with provider staging enabled, and having both the swift run directory and the workdirectory on local disk. > I'm still following up with our file systems folks but the key issue > appears to be the large number of meta-data operations that are sent > at the shared file system (Lustre or NFS here). Is there a way to > reduce that or at least measure it so that I can tell our admins > exactly the throughput I need? This is hard to quantify. It is possible to measure the rate of I/O requests using strace, and recent versions of swift have some flags that allow you to strace the worker and its sub-processes. The actual bandwidth, I don't know. Perhaps iotop or something like it, but I have never personally used it to measure disk bandwidth with swift apps on shared FSs. Mihael From hategan at mcs.anl.gov Fri May 30 14:14:15 2014 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 30 May 2014 12:14:15 -0700 Subject: [Swift-user] Data transfer error In-Reply-To: <1401476777.16971.20.camel@echo> References: <8635C0D1735D2C4BA6E571FD97486FF179926169@PRDEXMBX-05.the-lab.llnl.gov> <1400706574.25214.3.camel@echo> <8635C0D1735D2C4BA6E571FD97486FF17992DC59@PRDEXMBX-05.the-lab.llnl.gov> <1400743356.29694.6.camel@echo> <8635C0D1735D2C4BA6E571FD97486FF179933F77@PRDEXMBX-05.the-lab.llnl.gov> <1400872636.18856.5.camel@echo> <8635C0D1735D2C4BA6E571FD97486FF1799377AB@PRDEXMBX-05.the-lab.llnl.gov> <1400876593.19522.2.camel@echo> <8635C0D1735D2C4BA6E571FD97486FF179938553@PRDEXMBX-05.the-lab.llnl.gov> <1401234364.13640.7.camel@echo> <8635C0D1735D2C4BA6E571FD97486FF179938CB9@PRDEXMBX-05.the-lab.llnl.gov> <1401320738.12733.3.camel@echo> <8635C0D1735D2C4BA6E571FD97486FF17993A5E3@PRDEXMBX-05.the-lab.llnl.gov> <1401411337.31590.8.camel@echo> <8635C0D1735D2C4BA6E571FD97486FF17993A947@PRDEXMBX-05.the-lab.llnl.gov> <1401476777.16971.20.camel@echo> Message-ID: <1401477255.16971.24.camel@echo> On Fri, 2014-05-30 at 12:06 -0700, Mihael Hategan wrote: > On Fri, 2014-05-30 at 16:24 +0000, Bronevetsky, Greg wrote: > > I just ran a test where I varied > key="maxwalltime"> between 1 and 10 minutes. At 1 it gave me errors > > and for larger values it did not. So, assuming that this is the true > > root cause, how can I resolve it? Alternatively, increase the maxwalltime. Mike Wilde just figured that I'm about one hour away of driving from LLNL, so talking in person might be an option. Mihael From bronevetsky1 at llnl.gov Fri May 30 15:14:31 2014 From: bronevetsky1 at llnl.gov (Bronevetsky, Greg) Date: Fri, 30 May 2014 20:14:31 +0000 Subject: [Swift-user] Data transfer error In-Reply-To: <1401476777.16971.20.camel@echo> References: <8635C0D1735D2C4BA6E571FD97486FF179926169@PRDEXMBX-05.the-lab.llnl.gov> <1400706574.25214.3.camel@echo> <8635C0D1735D2C4BA6E571FD97486FF17992DC59@PRDEXMBX-05.the-lab.llnl.gov> <1400743356.29694.6.camel@echo> <8635C0D1735D2C4BA6E571FD97486FF179933F77@PRDEXMBX-05.the-lab.llnl.gov> <1400872636.18856.5.camel@echo> <8635C0D1735D2C4BA6E571FD97486FF1799377AB@PRDEXMBX-05.the-lab.llnl.gov> <1400876593.19522.2.camel@echo> <8635C0D1735D2C4BA6E571FD97486FF179938553@PRDEXMBX-05.the-lab.llnl.gov> <1401234364.13640.7.camel@echo> <8635C0D1735D2C4BA6E571FD97486FF179938CB9@PRDEXMBX-05.the-lab.llnl.gov> <1401320738.12733.3.camel@echo> <8635C0D1735D2C4BA6E571FD97486FF17993A5E3@PRDEXMBX-05.the-lab.llnl.gov> <1401411337.31590.8.camel@echo> <8635C0D1735D2C4BA6E571FD97486FF17993A947@PRDEXMBX-05.the-lab.llnl.gov> <1401476777.16971.20.camel@echo> Message-ID: <8635C0D1735D2C4BA6E571FD97486FF17993AC92@PRDEXMBX-05.the-lab.llnl.gov> The issues I'm running into seem more related to metadata operations since in Lustre the metadata server is not distributed. When I used 10 or 20 nodes I was generating thousands of file opens per second, which Lustre cannot deal with. Even when I use node-local storage as scratch I still get timeouts. Is there a way to just track metadata operations? The only true way of avoiding the shared FS is with provider staging enabled, and having both the swift run directory and the workdirectory on local disk. Does this mean that I'd only be able to do single-node runs or is there a way to shuttle data between the node-local storage of different nodes? Greg Bronevetsky Lawrence Livermore National Lab (925) 424-5756 bronevetsky at llnl.gov http://greg.bronevetsky.com -----Original Message----- From: Mihael Hategan [mailto:hategan at mcs.anl.gov] Sent: Friday, May 30, 2014 12:06 PM To: Bronevetsky, Greg Cc: swift-user at ci.uchicago.edu Subject: Re: [Swift-user] Data transfer error On Fri, 2014-05-30 at 16:24 +0000, Bronevetsky, Greg wrote: > I just ran a test where I varied key="maxwalltime"> between 1 and 10 minutes. At 1 it gave me errors > and for larger values it did not. So, assuming that this is the true > root cause, how can I resolve it? I can use node-local storage as my > . However, when I run my real workload, I'm still > getting errors even if I use node-local storage. Assuming you used /local/disk, there is still some load on the shared filesystem since swift still needs to copy data from it to the scratch directory and back. The only true way of avoiding the shared FS is with provider staging enabled, and having both the swift run directory and the workdirectory on local disk. > I'm still following up with our file systems folks but the key issue > appears to be the large number of meta-data operations that are sent > at the shared file system (Lustre or NFS here). Is there a way to > reduce that or at least measure it so that I can tell our admins > exactly the throughput I need? This is hard to quantify. It is possible to measure the rate of I/O requests using strace, and recent versions of swift have some flags that allow you to strace the worker and its sub-processes. The actual bandwidth, I don't know. Perhaps iotop or something like it, but I have never personally used it to measure disk bandwidth with swift apps on shared FSs. Mihael From hategan at mcs.anl.gov Fri May 30 17:22:00 2014 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 30 May 2014 15:22:00 -0700 Subject: [Swift-user] Data transfer error In-Reply-To: <8635C0D1735D2C4BA6E571FD97486FF17993AC92@PRDEXMBX-05.the-lab.llnl.gov> References: <8635C0D1735D2C4BA6E571FD97486FF179926169@PRDEXMBX-05.the-lab.llnl.gov> <1400706574.25214.3.camel@echo> <8635C0D1735D2C4BA6E571FD97486FF17992DC59@PRDEXMBX-05.the-lab.llnl.gov> <1400743356.29694.6.camel@echo> <8635C0D1735D2C4BA6E571FD97486FF179933F77@PRDEXMBX-05.the-lab.llnl.gov> <1400872636.18856.5.camel@echo> <8635C0D1735D2C4BA6E571FD97486FF1799377AB@PRDEXMBX-05.the-lab.llnl.gov> <1400876593.19522.2.camel@echo> <8635C0D1735D2C4BA6E571FD97486FF179938553@PRDEXMBX-05.the-lab.llnl.gov> <1401234364.13640.7.camel@echo> <8635C0D1735D2C4BA6E571FD97486FF179938CB9@PRDEXMBX-05.the-lab.llnl.gov> <1401320738.12733.3.camel@echo> <8635C0D1735D2C4BA6E571FD97486FF17993A5E3@PRDEXMBX-05.the-lab.llnl.gov> <1401411337.31590.8.camel@echo> <8635C0D1735D2C4BA6E571FD97486FF17993A947@PRDEXMBX-05.the-lab.llnl.gov> <1401476777.16971.20.camel@echo> <8635C0D1735D2C4BA6E571FD97486FF17993AC92@PRDEXMBX-05.the-lab.llnl.gov> Message-ID: <1401488520.18750.10.camel@echo> On Fri, 2014-05-30 at 20:14 +0000, Bronevetsky, Greg wrote: > The issues I'm running into seem more related to metadata operations > since in Lustre the metadata server is not distributed. When I used 10 > or 20 nodes I was generating thousands of file opens per second, which > Lustre cannot deal with. Even when I use node-local storage as scratch > I still get timeouts. Is there a way to just track metadata > operations? strace with the appropriate syscalls selected. You'd probably want to do this for a single app, since the strace logs can be large. > > > The only true way of avoiding the shared FS is with provider staging enabled, and having both the swift run directory and the workdirectory on local disk. > > Does this mean that I'd only be able to do single-node runs or is there a way to shuttle data between the node-local storage of different nodes? Swift does that automatically, although it stages files back to and from the swift client node*. It is no different in bandwidth consumption than "staging" files to the shared FS storage nodes, but since posix semantics don't need to be enforced, metadata operations are significantly faster. If local (ram) disk space on the computed nodes isn't an issue, this scheme significantly improves performance, especially at higher scales. Mihael (*) provider.staging.pin.files=true provides additional caching on the compute nodes which can further improve things if multiple apps need the same input file(s).