From lpesce at uchicago.edu Mon Apr 2 14:51:02 2012 From: lpesce at uchicago.edu (Lorenzo Pesce) Date: Mon, 2 Apr 2012 14:51:02 -0500 Subject: [Swift-user] Java out of memory error Message-ID: <89CEC81B-5720-4A43-A4EF-81C4B112812B@uchicago.edu> Any idea? Swift version 0.93 loaded Swift 0.93 swift-r5483 cog-r3339 RunID: 20120402-1945-mix1mk13 Progress: time: Mon, 02 Apr 2012 19:45:30 +0000 Progress: time: Mon, 02 Apr 2012 19:45:54 +0000 Initializing:1 Progress: time: Mon, 02 Apr 2012 19:45:57 +0000 Initializing:17215 Selecting site:65 Progress: time: Mon, 02 Apr 2012 19:46:01 +0000 Selecting site:17280 Progress: time: Mon, 02 Apr 2012 19:46:31 +0000 Selecting site:17280 Progress: time: Mon, 02 Apr 2012 19:47:04 +0000 Selecting site:17280 Exception in thread "Progress ticker" java.lang.OutOfMemoryError: GC overhead limit exceeded at org.griphyn.vdl.karajan.lib.RuntimeStats$ProgressTicker.getSummary(RuntimeStats.java:170) at org.griphyn.vdl.karajan.lib.RuntimeStats$ProgressTicker.printStates(RuntimeStats.java:194) at org.griphyn.vdl.karajan.lib.RuntimeStats$ProgressTicker.dumpState(RuntimeStats.java:159)^C Exception in thread "Timer-0" java.lang.OutOfMemoryError: GC overhead limit exceeded ^CJava HotSpot(TM) 64-Bit Server VM warning: Exception java.lang.OutOfMemoryError occurred dispatching signal SIGINT to handler- the VM may need to be forcibly terminated From davidk at ci.uchicago.edu Mon Apr 2 14:58:52 2012 From: davidk at ci.uchicago.edu (David Kelly) Date: Mon, 2 Apr 2012 14:58:52 -0500 (CDT) Subject: [Swift-user] Java out of memory error In-Reply-To: <89CEC81B-5720-4A43-A4EF-81C4B112812B@uchicago.edu> Message-ID: <340676230.93085.1333396732080.JavaMail.root@zimbra-mb2.anl.gov> Lorenzo, You can edit the 'swift' shell script and modify this line: HEAPMAX=256M The default of 256M is probably too low.. I set it to a default of 1024M in trunk. David ----- Original Message ----- > From: "Lorenzo Pesce" > To: swift-user at ci.uchicago.edu > Sent: Monday, April 2, 2012 2:51:02 PM > Subject: [Swift-user] Java out of memory error > Any idea? > > Swift version 0.93 loaded > Swift 0.93 swift-r5483 cog-r3339 > > RunID: 20120402-1945-mix1mk13 > Progress: time: Mon, 02 Apr 2012 19:45:30 +0000 > Progress: time: Mon, 02 Apr 2012 19:45:54 +0000 Initializing:1 > Progress: time: Mon, 02 Apr 2012 19:45:57 +0000 Initializing:17215 > Selecting site:65 > Progress: time: Mon, 02 Apr 2012 19:46:01 +0000 Selecting site:17280 > Progress: time: Mon, 02 Apr 2012 19:46:31 +0000 Selecting site:17280 > Progress: time: Mon, 02 Apr 2012 19:47:04 +0000 Selecting site:17280 > Exception in thread "Progress ticker" java.lang.OutOfMemoryError: GC > overhead limit exceeded > at > org.griphyn.vdl.karajan.lib.RuntimeStats$ProgressTicker.getSummary(RuntimeStats.java:170) > at > org.griphyn.vdl.karajan.lib.RuntimeStats$ProgressTicker.printStates(RuntimeStats.java:194) > at > org.griphyn.vdl.karajan.lib.RuntimeStats$ProgressTicker.dumpState(RuntimeStats.java:159)^C > > Exception in thread "Timer-0" java.lang.OutOfMemoryError: GC overhead > limit exceeded > ^CJava HotSpot(TM) 64-Bit Server VM warning: Exception > java.lang.OutOfMemoryError occurred dispatching signal SIGINT to > handler- the VM may need to be forcibly terminated > > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user From lpesce at uchicago.edu Mon Apr 2 15:02:47 2012 From: lpesce at uchicago.edu (Lorenzo Pesce) Date: Mon, 2 Apr 2012 15:02:47 -0500 Subject: [Swift-user] Java out of memory error In-Reply-To: <340676230.93085.1333396732080.JavaMail.root@zimbra-mb2.anl.gov> References: <340676230.93085.1333396732080.JavaMail.root@zimbra-mb2.anl.gov> Message-ID: Thanks a lot for your reply. > You can edit the 'swift' shell script and modify this line: Forgive my newbie question, but what shell are you talking about here? I am running a script from Beagle and all I have is my swift script (demo_real.swift) , tc, cf and sites files. Can I pass HEAPMAX as an argument? How? > > HEAPMAX=256M > > The default of 256M is probably too low.. I set it to a default of 1024M in trunk. > > David > > ----- Original Message ----- >> From: "Lorenzo Pesce" >> To: swift-user at ci.uchicago.edu >> Sent: Monday, April 2, 2012 2:51:02 PM >> Subject: [Swift-user] Java out of memory error >> Any idea? >> >> Swift version 0.93 loaded >> Swift 0.93 swift-r5483 cog-r3339 >> >> RunID: 20120402-1945-mix1mk13 >> Progress: time: Mon, 02 Apr 2012 19:45:30 +0000 >> Progress: time: Mon, 02 Apr 2012 19:45:54 +0000 Initializing:1 >> Progress: time: Mon, 02 Apr 2012 19:45:57 +0000 Initializing:17215 >> Selecting site:65 >> Progress: time: Mon, 02 Apr 2012 19:46:01 +0000 Selecting site:17280 >> Progress: time: Mon, 02 Apr 2012 19:46:31 +0000 Selecting site:17280 >> Progress: time: Mon, 02 Apr 2012 19:47:04 +0000 Selecting site:17280 >> Exception in thread "Progress ticker" java.lang.OutOfMemoryError: GC >> overhead limit exceeded >> at >> org.griphyn.vdl.karajan.lib.RuntimeStats$ProgressTicker.getSummary(RuntimeStats.java:170) >> at >> org.griphyn.vdl.karajan.lib.RuntimeStats$ProgressTicker.printStates(RuntimeStats.java:194) >> at >> org.griphyn.vdl.karajan.lib.RuntimeStats$ProgressTicker.dumpState(RuntimeStats.java:159)^C >> >> Exception in thread "Timer-0" java.lang.OutOfMemoryError: GC overhead >> limit exceeded >> ^CJava HotSpot(TM) 64-Bit Server VM warning: Exception >> java.lang.OutOfMemoryError occurred dispatching signal SIGINT to >> handler- the VM may need to be forcibly terminated >> >> _______________________________________________ >> Swift-user mailing list >> Swift-user at ci.uchicago.edu >> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user From wilde at mcs.anl.gov Mon Apr 2 15:04:57 2012 From: wilde at mcs.anl.gov (Michael Wilde) Date: Mon, 2 Apr 2012 15:04:57 -0500 (CDT) Subject: [Swift-user] Java out of memory error In-Reply-To: <340676230.93085.1333396732080.JavaMail.root@zimbra-mb2.anl.gov> Message-ID: <334474773.122887.1333397097096.JavaMail.root@zimbra.anl.gov> I think Lorenzo may be using "module load swift". Can this also be set with an env var: SWIFT_HEAP_MAX=1024M swift ...rest of swift args here - Mike ----- Original Message ----- > From: "David Kelly" > To: "Lorenzo Pesce" > Cc: swift-user at ci.uchicago.edu > Sent: Monday, April 2, 2012 2:58:52 PM > Subject: Re: [Swift-user] Java out of memory error > Lorenzo, > > You can edit the 'swift' shell script and modify this line: > > HEAPMAX=256M > > The default of 256M is probably too low.. I set it to a default of > 1024M in trunk. > > David > > ----- Original Message ----- > > From: "Lorenzo Pesce" > > To: swift-user at ci.uchicago.edu > > Sent: Monday, April 2, 2012 2:51:02 PM > > Subject: [Swift-user] Java out of memory error > > Any idea? > > > > Swift version 0.93 loaded > > Swift 0.93 swift-r5483 cog-r3339 > > > > RunID: 20120402-1945-mix1mk13 > > Progress: time: Mon, 02 Apr 2012 19:45:30 +0000 > > Progress: time: Mon, 02 Apr 2012 19:45:54 +0000 Initializing:1 > > Progress: time: Mon, 02 Apr 2012 19:45:57 +0000 Initializing:17215 > > Selecting site:65 > > Progress: time: Mon, 02 Apr 2012 19:46:01 +0000 Selecting site:17280 > > Progress: time: Mon, 02 Apr 2012 19:46:31 +0000 Selecting site:17280 > > Progress: time: Mon, 02 Apr 2012 19:47:04 +0000 Selecting site:17280 > > Exception in thread "Progress ticker" java.lang.OutOfMemoryError: GC > > overhead limit exceeded > > at > > org.griphyn.vdl.karajan.lib.RuntimeStats$ProgressTicker.getSummary(RuntimeStats.java:170) > > at > > org.griphyn.vdl.karajan.lib.RuntimeStats$ProgressTicker.printStates(RuntimeStats.java:194) > > at > > org.griphyn.vdl.karajan.lib.RuntimeStats$ProgressTicker.dumpState(RuntimeStats.java:159)^C > > > > Exception in thread "Timer-0" java.lang.OutOfMemoryError: GC > > overhead > > limit exceeded > > ^CJava HotSpot(TM) 64-Bit Server VM warning: Exception > > java.lang.OutOfMemoryError occurred dispatching signal SIGINT to > > handler- the VM may need to be forcibly terminated > > > > _______________________________________________ > > Swift-user mailing list > > Swift-user at ci.uchicago.edu > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From jonmon at mcs.anl.gov Mon Apr 2 15:05:35 2012 From: jonmon at mcs.anl.gov (Jonathan Monette) Date: Mon, 2 Apr 2012 15:05:35 -0500 Subject: [Swift-user] Java out of memory error In-Reply-To: <340676230.93085.1333396732080.JavaMail.root@zimbra-mb2.anl.gov> References: <340676230.93085.1333396732080.JavaMail.root@zimbra-mb2.anl.gov> Message-ID: You could also use the SWIFT_HEAP_MAX environment variable. Doing "export SWIFT_HEAP_MAX=1024M" will have the same effect as what David suggested but you do not have to modify the Swift shell script On Apr 2, 2012, at 14:58, David Kelly wrote: > Lorenzo, > > You can edit the 'swift' shell script and modify this line: > > HEAPMAX=256M > > The default of 256M is probably too low.. I set it to a default of 1024M in trunk. > > David > > ----- Original Message ----- >> From: "Lorenzo Pesce" >> To: swift-user at ci.uchicago.edu >> Sent: Monday, April 2, 2012 2:51:02 PM >> Subject: [Swift-user] Java out of memory error >> Any idea? >> >> Swift version 0.93 loaded >> Swift 0.93 swift-r5483 cog-r3339 >> >> RunID: 20120402-1945-mix1mk13 >> Progress: time: Mon, 02 Apr 2012 19:45:30 +0000 >> Progress: time: Mon, 02 Apr 2012 19:45:54 +0000 Initializing:1 >> Progress: time: Mon, 02 Apr 2012 19:45:57 +0000 Initializing:17215 >> Selecting site:65 >> Progress: time: Mon, 02 Apr 2012 19:46:01 +0000 Selecting site:17280 >> Progress: time: Mon, 02 Apr 2012 19:46:31 +0000 Selecting site:17280 >> Progress: time: Mon, 02 Apr 2012 19:47:04 +0000 Selecting site:17280 >> Exception in thread "Progress ticker" java.lang.OutOfMemoryError: GC >> overhead limit exceeded >> at >> org.griphyn.vdl.karajan.lib.RuntimeStats$ProgressTicker.getSummary(RuntimeStats.java:170) >> at >> org.griphyn.vdl.karajan.lib.RuntimeStats$ProgressTicker.printStates(RuntimeStats.java:194) >> at >> org.griphyn.vdl.karajan.lib.RuntimeStats$ProgressTicker.dumpState(RuntimeStats.java:159)^C >> >> Exception in thread "Timer-0" java.lang.OutOfMemoryError: GC overhead >> limit exceeded >> ^CJava HotSpot(TM) 64-Bit Server VM warning: Exception >> java.lang.OutOfMemoryError occurred dispatching signal SIGINT to >> handler- the VM may need to be forcibly terminated >> >> _______________________________________________ >> Swift-user mailing list >> Swift-user at ci.uchicago.edu >> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user From lpesce at uchicago.edu Mon Apr 2 15:05:38 2012 From: lpesce at uchicago.edu (Lorenzo Pesce) Date: Mon, 2 Apr 2012 15:05:38 -0500 Subject: [Swift-user] Java out of memory error In-Reply-To: References: <340676230.93085.1333396732080.JavaMail.root@zimbra-mb2.anl.gov> Message-ID: <6B3DE6BC-4A44-4E2B-B502-92F0AA65339B@uchicago.edu> I mean, I can change the swift script for every user on Beagle, but I don't think that it would be polite ;-) On Apr 2, 2012, at 3:02 PM, Lorenzo Pesce wrote: > Thanks a lot for your reply. > >> You can edit the 'swift' shell script and modify this line: > > Forgive my newbie question, but what shell are you talking about here? > > I am running a script from Beagle and all I have is my swift script (demo_real.swift) , tc, cf and sites files. > Can I pass HEAPMAX as an argument? How? > > > > >> >> HEAPMAX=256M >> >> The default of 256M is probably too low.. I set it to a default of 1024M in trunk. >> >> David >> >> ----- Original Message ----- >>> From: "Lorenzo Pesce" >>> To: swift-user at ci.uchicago.edu >>> Sent: Monday, April 2, 2012 2:51:02 PM >>> Subject: [Swift-user] Java out of memory error >>> Any idea? >>> >>> Swift version 0.93 loaded >>> Swift 0.93 swift-r5483 cog-r3339 >>> >>> RunID: 20120402-1945-mix1mk13 >>> Progress: time: Mon, 02 Apr 2012 19:45:30 +0000 >>> Progress: time: Mon, 02 Apr 2012 19:45:54 +0000 Initializing:1 >>> Progress: time: Mon, 02 Apr 2012 19:45:57 +0000 Initializing:17215 >>> Selecting site:65 >>> Progress: time: Mon, 02 Apr 2012 19:46:01 +0000 Selecting site:17280 >>> Progress: time: Mon, 02 Apr 2012 19:46:31 +0000 Selecting site:17280 >>> Progress: time: Mon, 02 Apr 2012 19:47:04 +0000 Selecting site:17280 >>> Exception in thread "Progress ticker" java.lang.OutOfMemoryError: GC >>> overhead limit exceeded >>> at >>> org.griphyn.vdl.karajan.lib.RuntimeStats$ProgressTicker.getSummary(RuntimeStats.java:170) >>> at >>> org.griphyn.vdl.karajan.lib.RuntimeStats$ProgressTicker.printStates(RuntimeStats.java:194) >>> at >>> org.griphyn.vdl.karajan.lib.RuntimeStats$ProgressTicker.dumpState(RuntimeStats.java:159)^C >>> >>> Exception in thread "Timer-0" java.lang.OutOfMemoryError: GC overhead >>> limit exceeded >>> ^CJava HotSpot(TM) 64-Bit Server VM warning: Exception >>> java.lang.OutOfMemoryError occurred dispatching signal SIGINT to >>> handler- the VM may need to be forcibly terminated >>> >>> _______________________________________________ >>> Swift-user mailing list >>> Swift-user at ci.uchicago.edu >>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user From lpesce at uchicago.edu Mon Apr 2 15:06:00 2012 From: lpesce at uchicago.edu (Lorenzo Pesce) Date: Mon, 2 Apr 2012 15:06:00 -0500 Subject: [Swift-user] Java out of memory error In-Reply-To: <334474773.122887.1333397097096.JavaMail.root@zimbra.anl.gov> References: <334474773.122887.1333397097096.JavaMail.root@zimbra.anl.gov> Message-ID: <90E20CD2-40B5-4AAD-8E63-22615F99FEDC@uchicago.edu> Perfect. Let me try this. On Apr 2, 2012, at 3:04 PM, Michael Wilde wrote: > I think Lorenzo may be using "module load swift". > > Can this also be set with an env var: > > SWIFT_HEAP_MAX=1024M swift ...rest of swift args here > > - Mike > > ----- Original Message ----- >> From: "David Kelly" >> To: "Lorenzo Pesce" >> Cc: swift-user at ci.uchicago.edu >> Sent: Monday, April 2, 2012 2:58:52 PM >> Subject: Re: [Swift-user] Java out of memory error >> Lorenzo, >> >> You can edit the 'swift' shell script and modify this line: >> >> HEAPMAX=256M >> >> The default of 256M is probably too low.. I set it to a default of >> 1024M in trunk. >> >> David >> >> ----- Original Message ----- >>> From: "Lorenzo Pesce" >>> To: swift-user at ci.uchicago.edu >>> Sent: Monday, April 2, 2012 2:51:02 PM >>> Subject: [Swift-user] Java out of memory error >>> Any idea? >>> >>> Swift version 0.93 loaded >>> Swift 0.93 swift-r5483 cog-r3339 >>> >>> RunID: 20120402-1945-mix1mk13 >>> Progress: time: Mon, 02 Apr 2012 19:45:30 +0000 >>> Progress: time: Mon, 02 Apr 2012 19:45:54 +0000 Initializing:1 >>> Progress: time: Mon, 02 Apr 2012 19:45:57 +0000 Initializing:17215 >>> Selecting site:65 >>> Progress: time: Mon, 02 Apr 2012 19:46:01 +0000 Selecting site:17280 >>> Progress: time: Mon, 02 Apr 2012 19:46:31 +0000 Selecting site:17280 >>> Progress: time: Mon, 02 Apr 2012 19:47:04 +0000 Selecting site:17280 >>> Exception in thread "Progress ticker" java.lang.OutOfMemoryError: GC >>> overhead limit exceeded >>> at >>> org.griphyn.vdl.karajan.lib.RuntimeStats$ProgressTicker.getSummary(RuntimeStats.java:170) >>> at >>> org.griphyn.vdl.karajan.lib.RuntimeStats$ProgressTicker.printStates(RuntimeStats.java:194) >>> at >>> org.griphyn.vdl.karajan.lib.RuntimeStats$ProgressTicker.dumpState(RuntimeStats.java:159)^C >>> >>> Exception in thread "Timer-0" java.lang.OutOfMemoryError: GC >>> overhead >>> limit exceeded >>> ^CJava HotSpot(TM) 64-Bit Server VM warning: Exception >>> java.lang.OutOfMemoryError occurred dispatching signal SIGINT to >>> handler- the VM may need to be forcibly terminated >>> >>> _______________________________________________ >>> Swift-user mailing list >>> Swift-user at ci.uchicago.edu >>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user >> _______________________________________________ >> Swift-user mailing list >> Swift-user at ci.uchicago.edu >> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > > -- > Michael Wilde > Computation Institute, University of Chicago > Mathematics and Computer Science Division > Argonne National Laboratory > From lpesce at uchicago.edu Mon Apr 2 15:09:54 2012 From: lpesce at uchicago.edu (Lorenzo Pesce) Date: Mon, 2 Apr 2012 15:09:54 -0500 Subject: [Swift-user] Java out of memory error In-Reply-To: References: <340676230.93085.1333396732080.JavaMail.root@zimbra-mb2.anl.gov> Message-ID: Perfect. Thanks a million. 20,000 more jobs on Beagle :-) On Apr 2, 2012, at 3:05 PM, Jonathan Monette wrote: > You could also use the SWIFT_HEAP_MAX environment variable. Doing "export SWIFT_HEAP_MAX=1024M" will have the same effect as what David suggested but you do not have to modify the Swift shell script > > On Apr 2, 2012, at 14:58, David Kelly wrote: > >> Lorenzo, >> >> You can edit the 'swift' shell script and modify this line: >> >> HEAPMAX=256M >> >> The default of 256M is probably too low.. I set it to a default of 1024M in trunk. >> >> David >> >> ----- Original Message ----- >>> From: "Lorenzo Pesce" >>> To: swift-user at ci.uchicago.edu >>> Sent: Monday, April 2, 2012 2:51:02 PM >>> Subject: [Swift-user] Java out of memory error >>> Any idea? >>> >>> Swift version 0.93 loaded >>> Swift 0.93 swift-r5483 cog-r3339 >>> >>> RunID: 20120402-1945-mix1mk13 >>> Progress: time: Mon, 02 Apr 2012 19:45:30 +0000 >>> Progress: time: Mon, 02 Apr 2012 19:45:54 +0000 Initializing:1 >>> Progress: time: Mon, 02 Apr 2012 19:45:57 +0000 Initializing:17215 >>> Selecting site:65 >>> Progress: time: Mon, 02 Apr 2012 19:46:01 +0000 Selecting site:17280 >>> Progress: time: Mon, 02 Apr 2012 19:46:31 +0000 Selecting site:17280 >>> Progress: time: Mon, 02 Apr 2012 19:47:04 +0000 Selecting site:17280 >>> Exception in thread "Progress ticker" java.lang.OutOfMemoryError: GC >>> overhead limit exceeded >>> at >>> org.griphyn.vdl.karajan.lib.RuntimeStats$ProgressTicker.getSummary(RuntimeStats.java:170) >>> at >>> org.griphyn.vdl.karajan.lib.RuntimeStats$ProgressTicker.printStates(RuntimeStats.java:194) >>> at >>> org.griphyn.vdl.karajan.lib.RuntimeStats$ProgressTicker.dumpState(RuntimeStats.java:159)^C >>> >>> Exception in thread "Timer-0" java.lang.OutOfMemoryError: GC overhead >>> limit exceeded >>> ^CJava HotSpot(TM) 64-Bit Server VM warning: Exception >>> java.lang.OutOfMemoryError occurred dispatching signal SIGINT to >>> handler- the VM may need to be forcibly terminated >>> >>> _______________________________________________ >>> Swift-user mailing list >>> Swift-user at ci.uchicago.edu >>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user >> _______________________________________________ >> Swift-user mailing list >> Swift-user at ci.uchicago.edu >> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user From lpesce at uchicago.edu Mon Apr 2 15:48:42 2012 From: lpesce at uchicago.edu (Lorenzo Pesce) Date: Mon, 2 Apr 2012 15:48:42 -0500 Subject: [Swift-user] Question about nr of nodes Message-ID: <62F6E6C9-6575-4B3D-AE29-B7D72D14FC0D@uchicago.edu> Nth question from me. Do you know of any implicit of explicit limitation in the number of nodes when one submits a job requesting less than 1 hr on Beagle? I asked for 50 nodes (I think) and I got only two. This is my sites file CI-IBN000103 pbs.aprun;pbs.mpp;depth=24 24 3600 1 1 50 12 10000 /lustre/beagle/GCNet/swift.workdir From jonmon at mcs.anl.gov Mon Apr 2 15:54:10 2012 From: jonmon at mcs.anl.gov (Jonathan Monette) Date: Mon, 2 Apr 2012 15:54:10 -0500 Subject: [Swift-user] Question about nr of nodes In-Reply-To: <62F6E6C9-6575-4B3D-AE29-B7D72D14FC0D@uchicago.edu> References: <62F6E6C9-6575-4B3D-AE29-B7D72D14FC0D@uchicago.edu> Message-ID: So you can very nodeGranulairty. Right now in your sites file you have it set to 1. That means that coasters will try jobs anywhere in the range 1-50 nodes. If you set nodeGranularity to 50, then coasters will submit a job with 50 nodes. On Apr 2, 2012, at 3:48 PM, Lorenzo Pesce wrote: > Nth question from me. > > Do you know of any implicit of explicit limitation in the number of nodes when one submits a job requesting less than 1 hr on Beagle? > > I asked for 50 nodes (I think) and I got only two. > > This is my sites file > > > > > > CI-IBN000103 > > pbs.aprun;pbs.mpp;depth=24 > > 24 > 3600 > > 1 > 1 > 50 > > 12 > 10000 > > > > /lustre/beagle/GCNet/swift.workdir > > > > > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user From glen842 at uchicago.edu Mon Apr 2 15:57:12 2012 From: glen842 at uchicago.edu (Glen Hocky) Date: Mon, 2 Apr 2012 16:57:12 -0400 Subject: [Swift-user] Question about nr of nodes In-Reply-To: References: <62F6E6C9-6575-4B3D-AE29-B7D72D14FC0D@uchicago.edu> Message-ID: What I used to use was (shell variables should be self explanatory). This should limit your number of nodes to $nodes while putting 1 coaster on each one to run your jobs... Things may have changed in the last few months and this might not work correctly any more... 24 pbs.aprun;pbs.mpp;depth=24 $PPN $TIME $MAXTIME $nodes 1 1 100 100 200.00 10000 On Mon, Apr 2, 2012 at 4:54 PM, Jonathan Monette wrote: > So you can very nodeGranulairty. Right now in your sites file you have it > set to 1. That means that coasters will try jobs anywhere in the range > 1-50 nodes. If you set nodeGranularity to 50, then coasters will submit a > job with 50 nodes. > > On Apr 2, 2012, at 3:48 PM, Lorenzo Pesce wrote: > > > Nth question from me. > > > > Do you know of any implicit of explicit limitation in the number of > nodes when one submits a job requesting less than 1 hr on Beagle? > > > > I asked for 50 nodes (I think) and I got only two. > > > > This is my sites file > > > > > > > > > > > > CI-IBN000103 > > > > key="providerAttributes">pbs.aprun;pbs.mpp;depth=24 > > > > 24 > > 3600 > > > > 1 > > 1 > > 50 > > > > 12 > > 10000 > > > > > > > > /lustre/beagle/GCNet/swift.workdir > > > > > > > > > > _______________________________________________ > > Swift-user mailing list > > Swift-user at ci.uchicago.edu > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jonmon at mcs.anl.gov Mon Apr 2 17:01:08 2012 From: jonmon at mcs.anl.gov (Jonathan Monette) Date: Mon, 2 Apr 2012 17:01:08 -0500 Subject: [Swift-user] Question about nr of nodes In-Reply-To: References: <62F6E6C9-6575-4B3D-AE29-B7D72D14FC0D@uchicago.edu> Message-ID: <05B840C4-ED15-478B-8F12-E09C3DD612A6@mcs.anl.gov> It all depends on ho you want to shape your jobs. slots: The maximum number of coaster blocks to submit(pbs jobs). maxNodes: The maximum number of nodes in a coaster block nodeGranularity: How much to increment node count by(nodeGranularity <= maxNodes) For instance, the following settings: 1 10 1 will submit a single coaster block at a time to the system with a node count of anywhere between 1-10. By changing nodeGranularity to 10 we will get a single coaster block with 10 nodes. That is the only size block that will be submitted while in the previous example we could get a varying size of coaster blocks. Adjusting slots will provide more active coaster blocks. In Glen's example he will requests "$nodes" number of single node jobs. He could have said the same thing by setting: 1 $nodes $nodes In his example, several single node coaster blocks would be submitted for execution. With the above settings, a single multi-node coaster block would be submitted. If the machine is overloaded and there is slow response time, then Glen's approach would probably be better as the scheduler may bias some single node jobs to run over multi-node jobs to keep the entire machine busy. This way progress will be made(even if it is slow progress). Another setting that should be set is: 100 100 This will force the coaster blocks to be exactly the maxTime you asked for. If those are not set coasters dynamically chooses a wall time which is often lower than the time you specified. I hope all this makes sense. We are in the process of improving documentation for Swift in general and this is one of the areas we need to better explain. Please let me know if this all makes sense or anything else I could help with. On Apr 2, 2012, at 3:57 PM, Glen Hocky wrote: > What I used to use was (shell variables should be self explanatory). This should limit your number of nodes to $nodes while putting 1 coaster on each one to run your jobs... > Things may have changed in the last few months and this might not work correctly any more... > > 24 > pbs.aprun;pbs.mpp;depth=24 > > $PPN > $TIME > $MAXTIME > $nodes > 1 > 1 > 100 > 100 > 200.00 > 10000 > > > > On Mon, Apr 2, 2012 at 4:54 PM, Jonathan Monette wrote: > So you can very nodeGranulairty. Right now in your sites file you have it set to 1. That means that coasters will try jobs anywhere in the range 1-50 nodes. If you set nodeGranularity to 50, then coasters will submit a job with 50 nodes. > > On Apr 2, 2012, at 3:48 PM, Lorenzo Pesce wrote: > > > Nth question from me. > > > > Do you know of any implicit of explicit limitation in the number of nodes when one submits a job requesting less than 1 hr on Beagle? > > > > I asked for 50 nodes (I think) and I got only two. > > > > This is my sites file > > > > > > > > > > > > CI-IBN000103 > > > > pbs.aprun;pbs.mpp;depth=24 > > > > 24 > > 3600 > > > > 1 > > 1 > > 50 > > > > 12 > > 10000 > > > > > > > > /lustre/beagle/GCNet/swift.workdir > > > > > > > > > > _______________________________________________ > > Swift-user mailing list > > Swift-user at ci.uchicago.edu > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > -------------- next part -------------- An HTML attachment was scrubbed... URL: From lpesce at uchicago.edu Tue Apr 3 07:47:50 2012 From: lpesce at uchicago.edu (Lorenzo Pesce) Date: Tue, 3 Apr 2012 07:47:50 -0500 Subject: [Swift-user] Question about nr of nodes In-Reply-To: <05B840C4-ED15-478B-8F12-E09C3DD612A6@mcs.anl.gov> References: <62F6E6C9-6575-4B3D-AE29-B7D72D14FC0D@uchicago.edu> <05B840C4-ED15-478B-8F12-E09C3DD612A6@mcs.anl.gov> Message-ID: Thanks to all of you for you explanations. Now it makes a lot more sense to me. When Beagle is back up again, I will try to put all of this to good use and then finish the matlab/swift wiki (more like start it...) On Apr 2, 2012, at 5:01 PM, Jonathan Monette wrote: > It all depends on ho you want to shape your jobs. > slots: The maximum number of coaster blocks to submit(pbs jobs). > maxNodes: The maximum number of nodes in a coaster block > nodeGranularity: How much to increment node count by(nodeGranularity <= maxNodes) > > For instance, the following settings: > 1 > 10 > 1 > > will submit a single coaster block at a time to the system with a node count of anywhere between 1-10. > > By changing nodeGranularity to 10 we will get a single coaster block with 10 nodes. That is the only size block that will be submitted while in the previous example we could get a varying size of coaster blocks. > > Adjusting slots will provide more active coaster blocks. > > In Glen's example he will requests "$nodes" number of single node jobs. He could have said the same thing by setting: > 1 > $nodes > $nodes > > In his example, several single node coaster blocks would be submitted for execution. With the above settings, a single multi-node coaster block would be submitted. If the machine is overloaded and there is slow response time, then Glen's approach would probably be better as the scheduler may bias some single node jobs to run over multi-node jobs to keep the entire machine busy. This way progress will be made(even if it is slow progress). > > Another setting that should be set is: > 100 > 100 > > This will force the coaster blocks to be exactly the maxTime you asked for. If those are not set coasters dynamically chooses a wall time which is often lower than the time you specified. > > I hope all this makes sense. We are in the process of improving documentation for Swift in general and this is one of the areas we need to better explain. Please let me know if this all makes sense or anything else I could help with. > > > On Apr 2, 2012, at 3:57 PM, Glen Hocky wrote: > >> What I used to use was (shell variables should be self explanatory). This should limit your number of nodes to $nodes while putting 1 coaster on each one to run your jobs... >> Things may have changed in the last few months and this might not work correctly any more... >> >> 24 >> pbs.aprun;pbs.mpp;depth=24 >> >> $PPN >> $TIME >> $MAXTIME >> $nodes >> 1 >> 1 >> 100 >> 100 >> 200.00 >> 10000 >> >> >> >> On Mon, Apr 2, 2012 at 4:54 PM, Jonathan Monette wrote: >> So you can very nodeGranulairty. Right now in your sites file you have it set to 1. That means that coasters will try jobs anywhere in the range 1-50 nodes. If you set nodeGranularity to 50, then coasters will submit a job with 50 nodes. >> >> On Apr 2, 2012, at 3:48 PM, Lorenzo Pesce wrote: >> >> > Nth question from me. >> > >> > Do you know of any implicit of explicit limitation in the number of nodes when one submits a job requesting less than 1 hr on Beagle? >> > >> > I asked for 50 nodes (I think) and I got only two. >> > >> > This is my sites file >> > >> > >> > >> > >> > >> > CI-IBN000103 >> > >> > pbs.aprun;pbs.mpp;depth=24 >> > >> > 24 >> > 3600 >> > >> > 1 >> > 1 >> > 50 >> > >> > 12 >> > 10000 >> > >> > >> > >> > /lustre/beagle/GCNet/swift.workdir >> > >> > >> > >> > >> > _______________________________________________ >> > Swift-user mailing list >> > Swift-user at ci.uchicago.edu >> > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user >> >> _______________________________________________ >> Swift-user mailing list >> Swift-user at ci.uchicago.edu >> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mickelso at mcs.anl.gov Thu Apr 5 13:15:28 2012 From: mickelso at mcs.anl.gov (Sheri Mickelson) Date: Thu, 5 Apr 2012 13:15:28 -0500 Subject: [Swift-user] sites.xml option Message-ID: <1265073F-0309-4D75-B159-766651B9D61F@mcs.anl.gov> Hi, I'm trying to run a Swift script that uses coasters and pbs. When the script runs, one of the tasks runs out of memory. I can get this large memory task to run if I add something like this to a submission script: #PBS -l pvmem=20GB Is there a way I can pass this option to Swift so it's able to run with more memory? Thanks, Sheri From wozniak at mcs.anl.gov Thu Apr 5 13:25:34 2012 From: wozniak at mcs.anl.gov (Justin M Wozniak) Date: Thu, 5 Apr 2012 13:25:34 -0500 (CDT) Subject: [Swift-user] sites.xml option In-Reply-To: <1265073F-0309-4D75-B159-766651B9D61F@mcs.anl.gov> References: <1265073F-0309-4D75-B159-766651B9D61F@mcs.anl.gov> Message-ID: Yes, please try: pbs.resource_list=pvmem=20GB On Thu, 5 Apr 2012, Sheri Mickelson wrote: > Hi, > > I'm trying to run a Swift script that uses coasters and pbs. When the > script runs, one of the tasks runs out of memory. I can get this > large memory task to run if I add something like this to a submission > script: > > #PBS -l pvmem=20GB > > Is there a way I can pass this option to Swift so it's able to run > with more memory? > > Thanks, Sheri > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > -- Justin M Wozniak From mickelso at mcs.anl.gov Thu Apr 5 13:30:53 2012 From: mickelso at mcs.anl.gov (Sheri Mickelson) Date: Thu, 5 Apr 2012 13:30:53 -0500 Subject: [Swift-user] sites.xml option In-Reply-To: References: <1265073F-0309-4D75-B159-766651B9D61F@mcs.anl.gov> Message-ID: Thanks, that worked. -Sheri On Apr 5, 2012, at 1:25 PM, Justin M Wozniak wrote: > > Yes, please try: > > > pbs.resource_list=pvmem=20GB > > > On Thu, 5 Apr 2012, Sheri Mickelson wrote: > >> Hi, >> >> I'm trying to run a Swift script that uses coasters and pbs. When >> the >> script runs, one of the tasks runs out of memory. I can get this >> large memory task to run if I add something like this to a submission >> script: >> >> #PBS -l pvmem=20GB >> >> Is there a way I can pass this option to Swift so it's able to run >> with more memory? >> >> Thanks, Sheri >> _______________________________________________ >> Swift-user mailing list >> Swift-user at ci.uchicago.edu >> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user >> > > -- > Justin M Wozniak From lpesce at uchicago.edu Fri Apr 13 17:18:36 2012 From: lpesce at uchicago.edu (Lorenzo Pesce) Date: Fri, 13 Apr 2012 17:18:36 -0500 Subject: [Swift-user] Error message on Cray XE6 Message-ID: <06C0E895-24C2-4CFF-A239-D647B9F1EA27@uchicago.edu> Hi -- I haven't seen this one before: Can't open perl script "/home/lpesce/.globus/coasters/cscript7176272791806289394.pl": No such file or directory The config of the cray has changed, might this have anything to do with it? I have no idea what perl script is it talking about and why it is looking to home. Thanks a lot, Lorenzo From heather.stoller at gmail.com Thu Apr 5 09:37:33 2012 From: heather.stoller at gmail.com (Heather Stoller) Date: Thu, 5 Apr 2012 07:37:33 -0700 Subject: [Swift-user] MODIS demo Message-ID: Hello, I'm a UC student working with Mike Wilde doing some Swift stuff - at present, I'm trying to run the demo to see what can be seen. I get: *^Cheather at ubuntu:~/modis$ ./demo.local urban 10 runid=modis-2012.0405.0704-urban-10-10 Swift 0.93 swift-r5483 cog-r3339 RunID: 20120405-0704-drh1g0ob (input): found 0 files Progress: time: Thu, 05 Apr 2012 07:04:48 -0700 Progress: time: Thu, 05 Apr 2012 07:04:49 -0700 Stage in:19 Submitting:1 Progress: time: Thu, 05 Apr 2012 07:04:50 -0700 Stage in:13 Submitting:1 Active:6 Progress: time: Thu, 05 Apr 2012 07:04:53 -0700 Stage in:10 Submitting:2 Submitted:2 Active:6 Progress: time: Thu, 05 Apr 2012 07:04:54 -0700 Stage in:6 Submitting:1 Submitted:2 Active:9 Checking status:2 Progress: time: Thu, 05 Apr 2012 07:04:55 -0700 Stage in:2 Submitting:2 Submitted:1 Active:9 Checking status:3 Stage out:3 Progress: time: Thu, 05 Apr 2012 07:04:57 -0700 Submitting:1 Submitted:2 Active:9 Checking status:1 Stage out:7 Progress: time: Thu, 05 Apr 2012 07:04:58 -0700 Active:3 Checking status:2 Stage out:7 Finished successfully:7 Failed but can retry:1 Progress: time: Thu, 05 Apr 2012 07:04:59 -0700 Stage in:2 Submitting:2 Submitted:1 Active:3 Stage out:1 Finished successfully:10 Failed but can retry:2 Progress: time: Thu, 05 Apr 2012 07:05:00 -0700 Active:10 Stage out:1 Finished successfully:10 Progress: time: Thu, 05 Apr 2012 07:05:01 -0700 Active:9 Checking status:2 Finished successfully:11 Progress: time: Thu, 05 Apr 2012 07:05:02 -0700 Submitting:1 Active:3 Checking status:3 Stage out:2 Finished successfully:11 Failed but can retry:2 Progress: time: Thu, 05 Apr 2012 07:05:03 -0700 Stage in:2 Active:6 Stage out:1 Finished successfully:11 Failed but can retry:2 Execution failed: Progress: time: Thu, 05 Apr 2012 07:05:04 -0700 Active:7 Checking status:2 Stage out:1 Failed:1 Finished successfully:11 Progress: time: Thu, 05 Apr 2012 07:05:05 -0700 Active:3 Checking status:1 Stage out:3 Failed:4 Finished successfully:11 Progress: time: Thu, 05 Apr 2012 07:05:06 -0700 Failed:11 Finished successfully:11 * lots of "failed but can retry". Does this look right? -- Heather Stoller Computer Science MS Student University of Chicago Cell: 843-290-6711 -------------- next part -------------- An HTML attachment was scrubbed... URL: From davidk at ci.uchicago.edu Fri Apr 13 23:17:01 2012 From: davidk at ci.uchicago.edu (David Kelly) Date: Fri, 13 Apr 2012 23:17:01 -0500 (CDT) Subject: [Swift-user] MODIS demo In-Reply-To: Message-ID: <1884709006.141180.1334377021910.JavaMail.root@zimbra-mb2.anl.gov> Heather, You might want to check the path names in the tc.local file. Since about half the tasks fail, I'm guessing either colormodis or getlanduse is pointing to the wrong place. You can verify this by looking at the directory called modis-2012.d. In there is a list of files that end with -info. Look at the "Wrapper" section of these files and you should find more information about what is causing the failures. David ----- Original Message ----- > From: "Heather Stoller" > To: swift-user at ci.uchicago.edu > Sent: Thursday, April 5, 2012 9:37:33 AM > Subject: [Swift-user] MODIS demo > Hello, > > I'm a UC student working with Mike Wilde doing some Swift stuff - at > present, I'm trying to run the demo to see what can be seen. I get: > > ^Cheather at ubuntu:~/modis$ ./demo.local urban 10 > runid=modis-2012.0405.0704-urban-10-10 > Swift 0.93 swift-r5483 cog-r3339 > > RunID: 20120405-0704-drh1g0ob > (input): found 0 files > Progress: time: Thu, 05 Apr 2012 07:04:48 -0700 > Progress: time: Thu, 05 Apr 2012 07:04:49 -0700 Stage in:19 > Submitting:1 > Progress: time: Thu, 05 Apr 2012 07:04:50 -0700 Stage in:13 > Submitting:1 Active:6 > Progress: time: Thu, 05 Apr 2012 07:04:53 -0700 Stage in:10 > Submitting:2 Submitted:2 Active:6 > Progress: time: Thu, 05 Apr 2012 07:04:54 -0700 Stage in:6 > Submitting:1 Submitted:2 Active:9 Checking status:2 > Progress: time: Thu, 05 Apr 2012 07:04:55 -0700 Stage in:2 > Submitting:2 Submitted:1 Active:9 Checking status:3 Stage out:3 > Progress: time: Thu, 05 Apr 2012 07:04:57 -0700 Submitting:1 > Submitted:2 Active:9 Checking status:1 Stage out:7 > Progress: time: Thu, 05 Apr 2012 07:04:58 -0700 Active:3 Checking > status:2 Stage out:7 Finished successfully:7 Failed but can retry:1 > Progress: time: Thu, 05 Apr 2012 07:04:59 -0700 Stage in:2 > Submitting:2 Submitted:1 Active:3 Stage out:1 Finished successfully:10 > Failed but can retry:2 > Progress: time: Thu, 05 Apr 2012 07:05:00 -0700 Active:10 Stage out:1 > Finished successfully:10 > Progress: time: Thu, 05 Apr 2012 07:05:01 -0700 Active:9 Checking > status:2 Finished successfully:11 > Progress: time: Thu, 05 Apr 2012 07:05:02 -0700 Submitting:1 Active:3 > Checking status:3 Stage out:2 Finished successfully:11 Failed but can > retry:2 > Progress: time: Thu, 05 Apr 2012 07:05:03 -0700 Stage in:2 Active:6 > Stage out:1 Finished successfully:11 Failed but can retry:2 > Execution failed: > Progress: time: Thu, 05 Apr 2012 07:05:04 -0700 Active:7 Checking > status:2 Stage out:1 Failed:1 Finished successfully:11 > Progress: time: Thu, 05 Apr 2012 07:05:05 -0700 Active:3 Checking > status:1 Stage out:3 Failed:4 Finished successfully:11 > Progress: time: Thu, 05 Apr 2012 07:05:06 -0700 Failed:11 Finished > successfully:11 > > lots of "failed but can retry". Does this look right? > > -- > Heather Stoller > Computer Science MS Student > University of Chicago > Cell: 843-290-6711 > > > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user From jonmon at mcs.anl.gov Sat Apr 14 00:10:43 2012 From: jonmon at mcs.anl.gov (Jonathan Monette) Date: Sat, 14 Apr 2012 00:10:43 -0500 Subject: [Swift-user] Error message on Cray XE6 In-Reply-To: <06C0E895-24C2-4CFF-A239-D647B9F1EA27@uchicago.edu> References: <06C0E895-24C2-4CFF-A239-D647B9F1EA27@uchicago.edu> Message-ID: <27BED777-D32A-43D9-9697-1094B033C02F@mcs.anl.gov> The perl script is the worker script that is submitted with PBS. I have not tried to run on Beagle since the maintenance period has ended so I am not exactly sure why the error popped up. One reason could be that the home file system is no longer mounted on the compute nodes. I know they spoke about that being a possibility but not sure they implemented that during the maintenance period. Do you know if the home file system is still mounted on the compute nodes? On Apr 13, 2012, at 17:18, Lorenzo Pesce wrote: > Hi -- > I haven't seen this one before: > > Can't open perl script "/home/lpesce/.globus/coasters/cscript7176272791806289394.pl": No such file or directory > > The config of the cray has changed, might this have anything to do with it? > I have no idea what perl script is it talking about and why it is looking to home. > > Thanks a lot, > > Lorenzo > > > > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user From jonmon at mcs.anl.gov Sat Apr 14 00:17:05 2012 From: jonmon at mcs.anl.gov (Jonathan Monette) Date: Sat, 14 Apr 2012 00:17:05 -0500 Subject: [Swift-user] MODIS demo In-Reply-To: <1884709006.141180.1334377021910.JavaMail.root@zimbra-mb2.anl.gov> References: <1884709006.141180.1334377021910.JavaMail.root@zimbra-mb2.anl.gov> Message-ID: <4D712B0A-8F4C-4358-9237-01BFD7A46A26@mcs.anl.gov> I echo David's suggestion but would like to add another. It looks like soft error handling is being used which may not be the best approach when exploring Swift. In the config file you should set the retry count to 0 and set lazy.errors=false. This will cause Swift to fail as soon as the first error is encountered and will provide an error message. This is useful for when you are exploring Swift behavior. On Apr 13, 2012, at 23:17, David Kelly wrote: > Heather, > > You might want to check the path names in the tc.local file. Since about half the tasks fail, I'm guessing either colormodis or getlanduse is pointing to the wrong place. You can verify this by looking at the directory called modis-2012.d. In there is a list of files that end with -info. Look at the "Wrapper" section of these files and you should find more information about what is causing the failures. > > David > > ----- Original Message ----- >> From: "Heather Stoller" >> To: swift-user at ci.uchicago.edu >> Sent: Thursday, April 5, 2012 9:37:33 AM >> Subject: [Swift-user] MODIS demo >> Hello, >> >> I'm a UC student working with Mike Wilde doing some Swift stuff - at >> present, I'm trying to run the demo to see what can be seen. I get: >> >> ^Cheather at ubuntu:~/modis$ ./demo.local urban 10 >> runid=modis-2012.0405.0704-urban-10-10 >> Swift 0.93 swift-r5483 cog-r3339 >> >> RunID: 20120405-0704-drh1g0ob >> (input): found 0 files >> Progress: time: Thu, 05 Apr 2012 07:04:48 -0700 >> Progress: time: Thu, 05 Apr 2012 07:04:49 -0700 Stage in:19 >> Submitting:1 >> Progress: time: Thu, 05 Apr 2012 07:04:50 -0700 Stage in:13 >> Submitting:1 Active:6 >> Progress: time: Thu, 05 Apr 2012 07:04:53 -0700 Stage in:10 >> Submitting:2 Submitted:2 Active:6 >> Progress: time: Thu, 05 Apr 2012 07:04:54 -0700 Stage in:6 >> Submitting:1 Submitted:2 Active:9 Checking status:2 >> Progress: time: Thu, 05 Apr 2012 07:04:55 -0700 Stage in:2 >> Submitting:2 Submitted:1 Active:9 Checking status:3 Stage out:3 >> Progress: time: Thu, 05 Apr 2012 07:04:57 -0700 Submitting:1 >> Submitted:2 Active:9 Checking status:1 Stage out:7 >> Progress: time: Thu, 05 Apr 2012 07:04:58 -0700 Active:3 Checking >> status:2 Stage out:7 Finished successfully:7 Failed but can retry:1 >> Progress: time: Thu, 05 Apr 2012 07:04:59 -0700 Stage in:2 >> Submitting:2 Submitted:1 Active:3 Stage out:1 Finished successfully:10 >> Failed but can retry:2 >> Progress: time: Thu, 05 Apr 2012 07:05:00 -0700 Active:10 Stage out:1 >> Finished successfully:10 >> Progress: time: Thu, 05 Apr 2012 07:05:01 -0700 Active:9 Checking >> status:2 Finished successfully:11 >> Progress: time: Thu, 05 Apr 2012 07:05:02 -0700 Submitting:1 Active:3 >> Checking status:3 Stage out:2 Finished successfully:11 Failed but can >> retry:2 >> Progress: time: Thu, 05 Apr 2012 07:05:03 -0700 Stage in:2 Active:6 >> Stage out:1 Finished successfully:11 Failed but can retry:2 >> Execution failed: >> Progress: time: Thu, 05 Apr 2012 07:05:04 -0700 Active:7 Checking >> status:2 Stage out:1 Failed:1 Finished successfully:11 >> Progress: time: Thu, 05 Apr 2012 07:05:05 -0700 Active:3 Checking >> status:1 Stage out:3 Failed:4 Finished successfully:11 >> Progress: time: Thu, 05 Apr 2012 07:05:06 -0700 Failed:11 Finished >> successfully:11 >> >> lots of "failed but can retry". Does this look right? >> >> -- >> Heather Stoller >> Computer Science MS Student >> University of Chicago >> Cell: 843-290-6711 >> >> >> _______________________________________________ >> Swift-user mailing list >> Swift-user at ci.uchicago.edu >> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user From iraicu at cs.iit.edu Sat Apr 14 07:12:25 2012 From: iraicu at cs.iit.edu (Ioan Raicu) Date: Sat, 14 Apr 2012 07:12:25 -0500 Subject: [Swift-user] Call for Participation: ACM HPDC 2012 Message-ID: <4F8969A9.5080106@cs.iit.edu> (Please accept our apologies if you receive this message multiple times) **** CALL FOR PARTICIPATION **** *************************************************************** *** ** EARLY REGISTRATION DEADLINE: May 25, 2012 (CET) ** *** *************************************************************** The 21st International ACM Symposium on High-Performance Parallel and Distributed Computing (HPDC'12) Delft University of Technology, Delft, the Netherlands June 18-22, 2012 http://www.hpdc.org/2012 The ACM International Symposium on High-Performance Parallel and Distributed Computing (HPDC) is the premier annual conference on the design, the implementation, the evaluation, and the use of parallel and distributed systems for high-end computing. HPDC'12 will take place in Delft, the Netherlands, a historical, picturesque city that is less than one hour away from Amsterdam-Schiphol airport. The conference will be held on June 20-22 (Wednesday to Friday 1 PM), with affiliated workshops taking place on June 18-19 (Monday and Tuesday). **** MAIN CONFERENCE FEATURES **** - High-quality single-track paper sessions - Two keynote presentations - Achievement Award talk (new in HPDC!) - Poster session plus conference reception - Seven workshops - Visit to Museum de Prinsenhof in Delft - Conference dinner in the historical place De Prinsenkelder in Delft **** CONFERENCE PROGRAM **** The program of the conference will be posted by mid-april on the conference website. **** KEYNOTE SPEAKERS (titles and abstracts will be posted online) **** Ricardo Bianchini, Rutgers University, USA Mihai Budiu, Microsoft Research, USA **** ACHIEVEMENT AWARD TALK (title and abstract will be posted online) **** Ian Foster, University of Chicago and Argonne National Laboratory, USA **** CALL FOR POSTERS **** HPDC'12 offers conference attendees the opportunity to participate in the poster session on Wednesday afternoon. For details on how to submit a poster, please consult the conference website. **** HPDC 2012 GENERAL CHAIR **** Dick Epema, Delft University of Technology, Delft, the Netherlands **** HPDC 2012 PROGRAM CO-CHAIRS **** Thilo Kielmann, Vrije Universiteit, Amsterdam, the Netherlands Matei Ripeanu, The University of British Columbia, Vancouver, Canada **** HPDC 2012 WORKSHOPS CHAIR **** Alexandru Iosup, Delft University of Technology, Delft, the Netherlands **** HPDC 2012 POSTERS CHAIR **** Ana Varbanescu, Delft University of Technology, Delft, the Netherlands **** EARLY REGISTRATION DEADLINE **** May 25, 2012 (CET) **** VENUE **** The HPDC'12 conference will be held on the campus of Delft University of Technology, which was founded in 1842 by King William II and which is the oldest and largest technical university in the Netherlands. It is well established as one of the leading technical universities in the world. Delft is a small, historical town dating back to the 13th century. Delft has many old buildings and small canals, and it has a lively atmosphere. The city offers a large variety of hotels and restaurants. Many other places of interest (e.g., Amsterdam and The Hague) are within one hour distance of traveling. Traveling to Delft is easy. Delft is close to Amsterdam Schiphol Airport (60 km, 45 min by train), which has direct connections from all major airports in the world. Delft also has excellent train connections to the rest of Europe. -- ================================================================= Ioan Raicu, Ph.D. Assistant Professor, Illinois Institute of Technology (IIT) Guest Research Faculty, Argonne National Laboratory (ANL) ================================================================= Data-Intensive Distributed Systems Laboratory, CS/IIT Distributed Systems Laboratory, MCS/ANL ================================================================= Cel: 1-847-722-0876 Office: 1-312-567-5704 Email: iraicu at cs.iit.edu Web: http://www.cs.iit.edu/~iraicu/ Web: http://datasys.cs.iit.edu/ ================================================================= ================================================================= From iraicu at cs.iit.edu Sat Apr 14 07:12:50 2012 From: iraicu at cs.iit.edu (Ioan Raicu) Date: Sat, 14 Apr 2012 07:12:50 -0500 Subject: [Swift-user] Call for Posters: The 21st Int. ACM Symp. on High-Performance Parallel and Distributed Computing (HPDC'12) Message-ID: <4F8969C2.4090009@cs.iit.edu> **** CALL FOR POSTERS **** The 21st International ACM Symposium on High-Performance Parallel and Distributed Computing (HPDC'12) Delft University of Technology, Delft, the Netherlands June 18-22, 2012 http://www.hpdc.org/2012 The ACM International Symposium on High-Performance Parallel and Distributed Computing (HPDC) is the premier annual conference on the design, the implementation, the evaluation, and the use of parallel and distributed systems for high-end computing. HPDC'12 will take place in Delft, the Netherlands, a historical, picturesque city that is less than one hour away from Amsterdam-Schiphol airport. The conference will be held on June 20-22 (Wednesday to Friday), with affiliated workshops taking place on June 18-19 (Monday and Tuesday). HPDC'12 will feature a poster session that will provide the right environment for lively and informal discussions on various high performance parallel and distributed computing topics. We invite all potential authors to submit their contribution for this poster session in the form of a two-page PDF abstract (we recommend using the ACM Proceedings style, and fonts not smaller than 10 point). Posters may be accompanied by practical demonstrations. Abstracts must be submitted by sending an email to: hpdc-2012-posters at gmail.com before May 15th 2012, 23:59 CET. Participating posters will be selected based on the following criteria: * Submissions must describe new, interesting ideas on any HPDC topics of interest. * Submissions can present work in progress, but we strongly encourage the authors to include preliminary experimental results, if available. * Student submissions meeting the above criteria will be given preference. Please provide the following information in your PDF file: * Poster title. * Author names, affiliations, and email addresses. * Note which authors, if any, are students. * Indicate if you plan to set up a demo with your poster (the authors and organizers need to agree that the requirements for the demo to function can be met at the site of the poster exhibition). Authors will be notified of acceptance or rejection via e-mail by May 20th, 2012. No reviews will be provided. Posters will be published online on the conference Web site. Each poster will also have an A0 panel in a poster exhibition area, which will also include posters of the HPDC accepted papers. The poster session will be held on Wednesday, June 20, in the late afternoon, and it will start with a poster advertising session, during which the author(s) of each poster will give a very short presentation (2 slides, 1-2 minutes) of their poster. Following these presentations, the poster exhibition will be opened for visiting and, we hope, for fruitful discussions. Therefore, we kindly request at least one author of each poster to be present throughout the entire session. For any questions about the submission, selection, and presentation of the accepted posters, please contact the Poster Session Chair - Ana Lucia Varbanescu, Delft University of Technology, The Netherlands. -- ================================================================= Ioan Raicu, Ph.D. Assistant Professor, Illinois Institute of Technology (IIT) Guest Research Faculty, Argonne National Laboratory (ANL) ================================================================= Data-Intensive Distributed Systems Laboratory, CS/IIT Distributed Systems Laboratory, MCS/ANL ================================================================= Cel: 1-847-722-0876 Office: 1-312-567-5704 Email: iraicu at cs.iit.edu Web: http://www.cs.iit.edu/~iraicu/ Web: http://datasys.cs.iit.edu/ ================================================================= ================================================================= From iraicu at cs.iit.edu Sat Apr 14 07:42:32 2012 From: iraicu at cs.iit.edu (Ioan Raicu) Date: Sat, 14 Apr 2012 07:42:32 -0500 Subject: [Swift-user] Call for Participation: Cloud Futures 2012, Berkeley, CA (May 7-8) In-Reply-To: References: Message-ID: <4F8970B8.3030306@cs.iit.edu> *//* *//* */Cloud Futures: Hot Topics in Research and Education/* Berkeley, CA | May 7-8, 2012 http://research.microsoft.com/cloudfutures2012/ The Cloud Futures Workshop series brings together thought leaders from academia, industry, and government to discuss the role of cloud computing across a variety of research and educational areas---including computer science, engineering, Earth sciences, healthcare, humanities, interactive games, life sciences, and social sciences.Presentations, posters and discussions will highlight how new techniques, software platforms, and methods of research and teaching in the cloud may solve distinct challenges arising in those diverse areas. This year's Workshop is being hosted in conjunction with UC Berkeley. Conference Co-Chairs: Michael Franklin, UC Berkeley and Tony Hey, Microsoft Research *Register today! * Program: Day 1 Monday 05/07/2012 09:00 - 10:00 Keynote. Science In the Cloud, Joseph Hellerstein, Manager Big Science, Google_ 10:00 - 10:30 Break 10:30 - 12:00 plenary session ?10:30 -- 11:00 Advancing Declarative Query for Data-Intensive Science in the Cloud, Bill Howe, University of Washington ?11:00 -- 11:30 Programming Paradigms for Technical Computing on Clouds and Supercomputers, Geoffrey Fox, Indiana University, Dennis Gannon, Microsoft ?11:30 -- 12:00 Cloud Computing for Fundamental Spatial Operations on Polygonal GIS Data, Sushil Prasad, Dinesh Agarwal, Satish Puri, Xi He, Georgia State University 12:00 - 02:00 Lunch Posters 02:00 - 3:30 Session 1a Education ?02:00 -- 2:30 InstantLab 2.0 - A Platform for Operating System Experiments on Public Cloud Infrastructure, Andreas Polze, Christian Neuhaus, Rehab Alnemr, Lysann Kessler and Frank Schlegel, University of Potsdam ?02:30 -- 3:00 Case Study on Cloud Computing Infusion at a Leading Tertiary Institution in Singapore, Choong Wu Gary Lim, Nanyang Polytechnic ?03:00 -- 3:30 Teaching Web-scale Data Management using Microsoft Azure: POSTECH Experiences, Seung-won Hwang, POSTECH 02:00 -- 3:30 Session 1b Life Sciences ?02:00 -- 2:30 A-Brain: Using the Cloud to Understand the Impact of Genetic Variability on the Brain, Alexandru Costan, Radu Tudoran, Benoit Da Mota, Gabriel Antoniu and Bertrand Thirion, INRIA Rennes and Saclay ?02:30 -- 3:00 Very Large Scale Operon Predictions via Comparative Genomics, Ehsan Tabari, ZhengChang Su, UNC Charlotte ?03:00 -- 3:30 Fast Exploration of the QSAR Model Space with e-Science Central and Windows Azure, Jacek Cala, Hugo Hiden, Simon Woodman, Paul Watson, Newcastle University 03:30 - 04:00 Break 04:00 - 05:30 Session 1c Interactive Services ?04:00 -- 4:30 3D Remote Collaboration Framework for Virtual Cultural Heritage, Yasuhide Okamoto, Gregorij Kurillo, Ruzena Bajcsy University of California, Berkeley, Takeshi Oishi, Katsushi Ikeuchi , University of Tokyo ?04:30 -- 5:00 Interactive 3D Services over Windows Azure, Lukas Kencl, Jiri Danihelka , Czech Technical University , Prague ?05:00 -- 5:30 Microsoft Azure and the Kinect Join the World of Telemedicine to Save Lives, Janet Bailey, Aaron Rothberg, University of Arkansas, Bradley Jensen, Microsoft 04:00 -- 05:30 Session 1d Environmental Applications ?04:00 -- 4:30 Cloud-based Exploration of Complex Ecosystems for Science, Education and Entertainment, Ilmi Yoon, Sangyuk Yoon, Gary Ng, Hunvil Rodrigues, Sonal Mahajan, San Francisco State University, Neo Martinez, Pacific Ecoinformatics Lab ?04:30 -- 5:00 Cloud Computing as a Cyber-Infrastructure for Mass Customization and Collaboration, Kwa-Sur Tam, Virginia Tech ?05:00 -- 5:30 Green Prefab: Civil Engineering Hub In Ms Windows Azure, Furio Barzon, Collaboratorio, Italy 06:00-09:00 Dinner Day 2 Tuesday 05/08/2012 09:00 - 10:00 Keynote, Yousef Khalidi, Distinguished Engineer, Microsoft Corporation, Large Scale Cloud Computing: Opportunities and Challenges 10:00 - 10:30 Break 10:30 - 12:00 Plenary Session ?10:30 -- 11:00 Vision Paper: Towards an Understanding of the Limits of Map-Reduce Computation, Anish Das Sarma, Google Research, Semih Salihogluz, Jeffrey D. Ullman, Stanford University, Foto Afrati, National Technical University Athens ?11:00 -- 11:30 Twister4Azure: Parallel Data Analytics on Azure, Judy Qiu, Thilina Gunarathne, Indiana University ?11:30 -- 12:00 CumuloNimbo: Parallel-Distributed Transactional Processing, Ricardo Jimenez-Peris, Marta Pati?o-Mart?nez, Iv?n Brondino, Universidad Politecnica de Madrid, Jos? Pereira, Rui Oliveira, Ricardo Vila?a, University Minho, Bettina Kemme, Yousuf Ahmad, McGill Univ. 12:00 - 02:00 Lunch Posters 02:00 - 3:30 Session 2a Social and Mobile Services ?02:00 -- 2:30 An Efficient Meet-Up Mechanism by Mashing-up Social and Mobile Clouds, Li-Chun Wang, Chia-Yu Lin, Yu-Jia Chen, Yu-Chee Tseng, National Chiao Tung University ?02:30 -- 3:00 Scalable, Secure Analysis of Social Sciences Data on the Azure Platform, Yogesh Simmhan, Litao Den, Alok Kumbhare, Mark Redekopp and Viktor Prasanna, University of Southern California ?03:00 -- 3:30 Remote Software Service for Mobile Clients leveraging Cloud Computing, Chunming Hu, Beihang University 02:00 -- 3:30 Session 2b Computational Models and applications ?02:00 -- 2:30 Enabling cloud interoperability with COMPSs, Daniele Lezzi, Fabrizio Marozzo, Francesc Lordan, Roger Rafanell, Rosa Badia, Barcelona Supercomputer Center, Domenico Talia, University of Calabria ?02:30 -- 3:00 McCloud: Monte Carlo Service in Windows Azure, Rafael Nasser, Karin Breitman, Rubens Sampaio, Americo Cunha, Helio Vieira, PUC-Rio ?03:00 -- 3:30 Experiences using Windows Azure to Calibrate Watershed Models, Marty Humphrey, Norm Beekwilder, University of Virginia, Jon Goodall, Mehmet Ercan, University of South Carolina 03:30 - 04:00 Break 04:00 - 05:30 Panel TBD 05:30 - Close -- ================================================================= Ioan Raicu, Ph.D. Assistant Professor, Illinois Institute of Technology (IIT) Guest Research Faculty, Argonne National Laboratory (ANL) ================================================================= Data-Intensive Distributed Systems Laboratory, CS/IIT Distributed Systems Laboratory, MCS/ANL ================================================================= Cel: 1-847-722-0876 Office: 1-312-567-5704 Email:iraicu at cs.iit.edu Web:http://www.cs.iit.edu/~iraicu/ Web:http://datasys.cs.iit.edu/ ================================================================= ================================================================= -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/jpeg Size: 4462 bytes Desc: not available URL: From iraicu at cs.iit.edu Sat Apr 14 07:55:42 2012 From: iraicu at cs.iit.edu (Ioan Raicu) Date: Sat, 14 Apr 2012 07:55:42 -0500 Subject: [Swift-user] CFP: 13th IEEE/ACM Int. Conf. on Grid Computing (GRID) 2012 Message-ID: <4F8973CE.9020905@cs.iit.edu> Call for papers *Grid 2012: 13th IEEE/ACM International Conference on Grid Computing* Beijing, China September 20-23, 2012 http://grid2012.meepo.org Co-located with ChinaGrid'12 Grid computing enables the sharing of distributed computing and data resources such as processing, network bandwidth and storage capacity to create a cohesive resource environment for executing distributed applications. The Grid conference series is an annual international meeting that brings together a community of researchers, developers, practitioners, and users involved with Grid technology. The objective of the meeting is to serve as both the premier venue for presenting foremost research results in the area and as a forum for introducing and exploring new concepts. In 2012, the Grid conference will come to China for the first time and will be held in Beijing, co-located with ChinaGrid'12. Grid 2012 will have a focus on important and immediate issues that are significantly influencing grid computing. Scope Grid 2012 topics of interest include, but are not limited to: * Architecture * Middleware and toolkits * Resource management, scheduling, and runtime environments * Performance modeling and evaluation * Programming models, tools and environments * Metadata, ontologies, and provenance * Cloud computing * Virtualization and grid computing * Scientific workflow * Storage systems and data management * Data-intensive computing and processing * QoS and SLA Negotiation * Applications and experiences in science, engineering, business and society Paper Submission Authors are invited to submit original papers (not published or currently under review for any other conference or journal). Submitted manuscripts should not exceed 8 letter size (8.5 x 11) pages including figures, tables and references using the IEEE format for conference proceedings. Authors should submit the manuscript in PDF format via https://www.easychair.org/conferences/?conf=grid12 All submitted papers will be reviewed by program committee members and selected based on their originality, correctness, technical strength, significance, quality of presentation, and interest and relevance to the conference attendees. Accepted papers will be published in the IEEE categorized conference proceedings and will be made available online through the IEEE Xplore and the CS Digital Library. Go to paper submission page... Important Dates Papers Submission Due: 15 March 2012 Extended to 15 April 2012. Notification of Acceptance: 15 May 2012 Camera Ready Papers Due: 15 June 2012 Committees Organising Committee * *General Co-Chairs:* o Dieter Kranzlmueller, Ludwig-Maximilians-Universit?t, Germany o Weimin Zheng, Tsinghua University, China * *Programme Co-Chairs:* o Rajkumar Buyya, University of Melbourne, Australia o Hai Jin, Huazhong University of Science and Technology, China * *Local Organization Chair:* Yongwei Wu, Tsinghua University, China * *Finance Chair:* Kang Chen, Tsinghua University, China Programme Committee (To be comfirmed) * *Programme Co-Chairs:* o Rajkumar Buyya, University of Melbourne, Australia o Hai Jin, Huazhong University of Science and Technology, China * *Workshop & Poster Chair:* Jinlei Jiang, Tsinghua University, China * *Vice Chair -- Clouds and Virtualisation: * Roger Barga, Microsoft Research * *Vice Chair -- Distributed Production Cyberinfrastructure and Middleware:* Andrew Grimshaw, Univ. of Virginia, US * *Vice Chair -- e-Research and Applications:* Daniel S. Katz, Univ. of Chicago & Argonne National Laboratory, US * *Vice Chair -- Tools & Services, Resource Management & Runtime Environments:* Ramin Yahyapour, Dortmund * *Vice Chair -- Distributed Data-Intensive Science and Systems:* Erwin Laure, KTH, Sweeden * *Publishing Chair: *Ran Zheng, Huazhong University of Science and Technology, China * *Publicity Chairs:* o Gilles Fedak, INRIA/LIP, France o Ioan Raicu, Illinois Institute of Technology and Argonne National Laboratory, USA o Xuanhua Shi, Huazhong University of Science and Technology, China * *Program Committe:* o David Abramson, Monash University, Australia o Gabrielle Allen, Louisiana State University, USA o Andreas Aschenbrenner, Austrian Academy of Sciences o David Bader, Georgia Institute of Technology, USA o Rosa Badia, UPC, Spain o Henri Bal, Vrije Universiteit, Netherlands o Chaitanya Baru, San Diego Supercomputer Center, US o Eloisa Bentivegna, Max Planck Institute for Gravitational Physics, Germany o Ignacio Blanquer, Universidad Polit?cnica de Valencia, Spain o Jinlei Jiang, Tsinghua University, China o Neil Chue Hong, EPCC, UK o Marco Danelutto, Universit? di Pisa, Italy o Eva Deelman, ISI, USC , US o Frederic Desprez, INRIA-LIP, France o Jim Dowling, SICS, Sweden o Jaliya Ekanayake, Microsoft Research, US o Erik Elmroth, Ume? University, Sweden o Vangelis Floros, GRNET, Greece o Ian Foster, Univ. of Chicago, US o Patrick Fuhrmann, DESY, DE o Kang Chen, Tsinghua University, China o Rob Gillen, Oak Ridge National Laboratory , US o Marty Humphrey, University of Virginia, US o Jens Jensen, STFC, UK o Kate Keahey, Argonne National Laboratory, US o Thilo Kielmann, Vrije Universiteit, The Netherlands o Bastian Koller, HLRS, Germany o Tevfik Kosar, University at Buffalo, US o Nicolas Kourtellis, University of South Florida, USA o Patricia Kovatch, University of Tennessee, USA o Dieter Kranzlm?ller, Ludwig-Maximilians-Universit?t M?nchen, Germany o Peter Kunszt, SystemsX, Switzerland o Miron Livny, Univ. of Wisconsin, US o Hideo Matsuda, University of Osaka, Japan o Satoshi Matsuoka, Tokyo Institute of Technology, Japan o Jim Myers, Rensselaer Polytechnic Institute, US o Steven Newhouse, EGI, NL o Manish Parashar, Rutgers, USA o Judy Qiu, Indiana University, US o Ioan Raicu, Illinois Institute of Technology and Argonne National Laboratory, USA o Alistair Rendell, Australian National University, Australia o Karolina Sarnowska-Upton, Univ. of Virginia, US o Heiko Schuldt, Basel University, Switzerland o Richard Sinnott, University of Melbourne, Australia o Alan Sill, Texas-Tech, US o Alex Sim, LBL, US o Mark Stillwell, INRIA-Universit? de Lyon-LIP, France o Alan Sussman, University of Maryland, USA o Osamu Tatebe, Tsukuba University, Japan o Domenico Talia, Universit? della Calabria, Italiy o Douglas Thain, University of Notre Dame, US o David Wallom, Oxford University, UK -- ================================================================= Ioan Raicu, Ph.D. Assistant Professor, Illinois Institute of Technology (IIT) Guest Research Faculty, Argonne National Laboratory (ANL) ================================================================= Data-Intensive Distributed Systems Laboratory, CS/IIT Distributed Systems Laboratory, MCS/ANL ================================================================= Cel: 1-847-722-0876 Office: 1-312-567-5704 Email: iraicu at cs.iit.edu Web: http://www.cs.iit.edu/~iraicu/ Web: http://datasys.cs.iit.edu/ ================================================================= ================================================================= -------------- next part -------------- An HTML attachment was scrubbed... URL: From lpesce at uchicago.edu Sat Apr 14 08:15:39 2012 From: lpesce at uchicago.edu (Lorenzo Pesce) Date: Sat, 14 Apr 2012 08:15:39 -0500 Subject: [Swift-user] Error message on Cray XE6 In-Reply-To: <27BED777-D32A-43D9-9697-1094B033C02F@mcs.anl.gov> References: <06C0E895-24C2-4CFF-A239-D647B9F1EA27@uchicago.edu> <27BED777-D32A-43D9-9697-1094B033C02F@mcs.anl.gov> Message-ID: <6517FCFD-2A6A-4EDC-84AC-C6968B47E845@uchicago.edu> In principle the access to the /home filesystem should still be there. The only thing I did was to chance the cf file to remove some errors I had into it, so that might also be the source of the problem. This is what it looks like now: (BTW, the comments are not mine, I run swift only from lustre) # Whether to transfer the wrappers from the compute nodes # I like to launch from my home dir, but keep everything on # lustre wrapperlog.always.transfer=false #Indicates whether the working directory on the remote site # should be left intact even when a run completes successfully sitedir.keep=true #try only once execution.retries=1 # Attempt to run as much as possible, i.g., ignore non-fatal errors lazy.errors=true # to reduce filesystem access status.mode=provider use.provider.staging=false provider.staging.pin.swiftfiles=false foreach.max.threads=100 provenance.log=false On Apr 14, 2012, at 12:10 AM, Jonathan Monette wrote: > The perl script is the worker script that is submitted with PBS. I have not tried to run on Beagle since the maintenance period has ended so I am not exactly sure why the error popped up. One reason could be that the home file system is no longer mounted on the compute nodes. I know they spoke about that being a possibility but not sure they implemented that during the maintenance period. Do you know if the home file system is still mounted on the compute nodes? > > On Apr 13, 2012, at 17:18, Lorenzo Pesce wrote: > >> Hi -- >> I haven't seen this one before: >> >> Can't open perl script "/home/lpesce/.globus/coasters/cscript7176272791806289394.pl": No such file or directory >> >> The config of the cray has changed, might this have anything to do with it? >> I have no idea what perl script is it talking about and why it is looking to home. >> >> Thanks a lot, >> >> Lorenzo >> >> >> >> _______________________________________________ >> Swift-user mailing list >> Swift-user at ci.uchicago.edu >> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user From wilde at mcs.anl.gov Sat Apr 14 09:58:22 2012 From: wilde at mcs.anl.gov (Michael Wilde) Date: Sat, 14 Apr 2012 09:58:22 -0500 (CDT) Subject: [Swift-user] Error message on Cray XE6 In-Reply-To: <6517FCFD-2A6A-4EDC-84AC-C6968B47E845@uchicago.edu> Message-ID: <1515044644.141298.1334415502139.JavaMail.root@zimbra.anl.gov> /home is no longer mounted by the compute nodes, per the post-maitenance summary: "External filesystem dependencies minimized: Compute nodes and the scheduler should now continue to process and complete jobs without the threat of interference of external filesystem outages. /gpfs/pads is only available on login1 through login5; /home is on login and mom nodes only." So we need to (finally) remove Swift's dependence on $HOME/.globus and $HOME/.globus/scripts in particular. I suggest - since the swift command already needs to write to "." - that we create a scripts/ directory in "." instead of $HOME/.globus. And this should be used by any provider that would have previously created files below .globus. I'll echo this to swift-devel and start a thread there to discuss. Its possible there's already a property to cause scripts/ to be created elsewhere. If not, I think we should make one. I think grouping the scripts created by a run into the current dir, along with the swift log, _concurrent, and (in the conventions I use in my run scripts) swiftwork/. Lorenzo, hopefully we can at least get you a workaround for this soon. You *might* be able to trick swift into doing this by setting HOME=/lustre/beagle/$USER. I already tried a symlink under .globus and that didnt work, as /home is not even readable by the compute nodes, which in this case need to run the coaster worker (.pl) script. - Mike ----- Original Message ----- > From: "Lorenzo Pesce" > To: "Jonathan Monette" > Cc: swift-user at ci.uchicago.edu > Sent: Saturday, April 14, 2012 8:15:39 AM > Subject: Re: [Swift-user] Error message on Cray XE6 > In principle the access to the /home filesystem should still be there. > > The only thing I did was to chance the cf file to remove some errors I > had into it, so that might also be the source of the problem. This is > what it looks like now: > (BTW, the comments are not mine, I run swift only from lustre) > > > # Whether to transfer the wrappers from the compute nodes > # I like to launch from my home dir, but keep everything on > # lustre > wrapperlog.always.transfer=false > > #Indicates whether the working directory on the remote site > # should be left intact even when a run completes successfully > sitedir.keep=true > > #try only once > execution.retries=1 > > # Attempt to run as much as possible, i.g., ignore non-fatal errors > lazy.errors=true > > # to reduce filesystem access > status.mode=provider > > use.provider.staging=false > > provider.staging.pin.swiftfiles=false > > foreach.max.threads=100 > > provenance.log=false > > > > > On Apr 14, 2012, at 12:10 AM, Jonathan Monette wrote: > > > The perl script is the worker script that is submitted with PBS. I > > have not tried to run on Beagle since the maintenance period has > > ended so I am not exactly sure why the error popped up. One reason > > could be that the home file system is no longer mounted on the > > compute nodes. I know they spoke about that being a possibility but > > not sure they implemented that during the maintenance period. Do you > > know if the home file system is still mounted on the compute nodes? > > > > On Apr 13, 2012, at 17:18, Lorenzo Pesce > > wrote: > > > >> Hi -- > >> I haven't seen this one before: > >> > >> Can't open perl script > >> "/home/lpesce/.globus/coasters/cscript7176272791806289394.pl": No > >> such file or directory > >> > >> The config of the cray has changed, might this have anything to do > >> with it? > >> I have no idea what perl script is it talking about and why it is > >> looking to home. > >> > >> Thanks a lot, > >> > >> Lorenzo > >> > >> > >> > >> _______________________________________________ > >> Swift-user mailing list > >> Swift-user at ci.uchicago.edu > >> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From jonmon at mcs.anl.gov Sat Apr 14 10:02:14 2012 From: jonmon at mcs.anl.gov (Jonathan Monette) Date: Sat, 14 Apr 2012 10:02:14 -0500 Subject: [Swift-user] Error message on Cray XE6 In-Reply-To: <1515044644.141298.1334415502139.JavaMail.root@zimbra.anl.gov> References: <1515044644.141298.1334415502139.JavaMail.root@zimbra.anl.gov> Message-ID: <1193EE13-FCE1-40C7-B2AC-6E6499BB0EA1@mcs.anl.gov> That is an easy fix I believe. I know where the code is so I will change and test. In the mean time could you try something? Try setting user.home= in your config file and try again. On Apr 14, 2012, at 9:58, Michael Wilde wrote: > /home is no longer mounted by the compute nodes, per the post-maitenance summary: > > "External filesystem dependencies minimized: Compute nodes and the scheduler should now continue to process and complete jobs without the threat of interference of external filesystem outages. /gpfs/pads is only available on login1 through login5; /home is on login and mom nodes only." > > So we need to (finally) remove Swift's dependence on $HOME/.globus and $HOME/.globus/scripts in particular. > > I suggest - since the swift command already needs to write to "." - that we create a scripts/ directory in "." instead of $HOME/.globus. And this should be used by any provider that would have previously created files below .globus. > > I'll echo this to swift-devel and start a thread there to discuss. Its possible there's already a property to cause scripts/ to be created elsewhere. If not, I think we should make one. I think grouping the scripts created by a run into the current dir, along with the swift log, _concurrent, and (in the conventions I use in my run scripts) swiftwork/. > > Lorenzo, hopefully we can at least get you a workaround for this soon. > > You *might* be able to trick swift into doing this by setting HOME=/lustre/beagle/$USER. I already tried a symlink under .globus and that didnt work, as /home is not even readable by the compute nodes, which in this case need to run the coaster worker (.pl) script. > > - Mike > > > ----- Original Message ----- >> From: "Lorenzo Pesce" >> To: "Jonathan Monette" >> Cc: swift-user at ci.uchicago.edu >> Sent: Saturday, April 14, 2012 8:15:39 AM >> Subject: Re: [Swift-user] Error message on Cray XE6 >> In principle the access to the /home filesystem should still be there. >> >> The only thing I did was to chance the cf file to remove some errors I >> had into it, so that might also be the source of the problem. This is >> what it looks like now: >> (BTW, the comments are not mine, I run swift only from lustre) >> >> >> # Whether to transfer the wrappers from the compute nodes >> # I like to launch from my home dir, but keep everything on >> # lustre >> wrapperlog.always.transfer=false >> >> #Indicates whether the working directory on the remote site >> # should be left intact even when a run completes successfully >> sitedir.keep=true >> >> #try only once >> execution.retries=1 >> >> # Attempt to run as much as possible, i.g., ignore non-fatal errors >> lazy.errors=true >> >> # to reduce filesystem access >> status.mode=provider >> >> use.provider.staging=false >> >> provider.staging.pin.swiftfiles=false >> >> foreach.max.threads=100 >> >> provenance.log=false >> >> >> >> >> On Apr 14, 2012, at 12:10 AM, Jonathan Monette wrote: >> >>> The perl script is the worker script that is submitted with PBS. I >>> have not tried to run on Beagle since the maintenance period has >>> ended so I am not exactly sure why the error popped up. One reason >>> could be that the home file system is no longer mounted on the >>> compute nodes. I know they spoke about that being a possibility but >>> not sure they implemented that during the maintenance period. Do you >>> know if the home file system is still mounted on the compute nodes? >>> >>> On Apr 13, 2012, at 17:18, Lorenzo Pesce >>> wrote: >>> >>>> Hi -- >>>> I haven't seen this one before: >>>> >>>> Can't open perl script >>>> "/home/lpesce/.globus/coasters/cscript7176272791806289394.pl": No >>>> such file or directory >>>> >>>> The config of the cray has changed, might this have anything to do >>>> with it? >>>> I have no idea what perl script is it talking about and why it is >>>> looking to home. >>>> >>>> Thanks a lot, >>>> >>>> Lorenzo >>>> >>>> >>>> >>>> _______________________________________________ >>>> Swift-user mailing list >>>> Swift-user at ci.uchicago.edu >>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user >> >> _______________________________________________ >> Swift-user mailing list >> Swift-user at ci.uchicago.edu >> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > > -- > Michael Wilde > Computation Institute, University of Chicago > Mathematics and Computer Science Division > Argonne National Laboratory > From wilde at mcs.anl.gov Sat Apr 14 10:10:00 2012 From: wilde at mcs.anl.gov (Michael Wilde) Date: Sat, 14 Apr 2012 10:10:00 -0500 (CDT) Subject: [Swift-user] Error message on Cray XE6 In-Reply-To: <1193EE13-FCE1-40C7-B2AC-6E6499BB0EA1@mcs.anl.gov> Message-ID: <749553073.141331.1334416200913.JavaMail.root@zimbra.anl.gov> I just tried both setting HOME=/lustre/beagle/wilde and setting user.home to the same thing. Neither works. I think user.home is coming from the Java property, and that doesnt seem to be influenced by the HOME env var. I was about to look if Java can be asked to change home. Maybe by setting a command line arg to Java. - Mike ----- Original Message ----- > From: "Jonathan Monette" > To: "Michael Wilde" > Cc: "Lorenzo Pesce" , swift-user at ci.uchicago.edu > Sent: Saturday, April 14, 2012 10:02:14 AM > Subject: Re: [Swift-user] Error message on Cray XE6 > That is an easy fix I believe. I know where the code is so I will > change and test. > > In the mean time could you try something? Try setting > user.home= > in your config file and try again. > > On Apr 14, 2012, at 9:58, Michael Wilde wrote: > > > /home is no longer mounted by the compute nodes, per the > > post-maitenance summary: > > > > "External filesystem dependencies minimized: Compute nodes and the > > scheduler should now continue to process and complete jobs without > > the threat of interference of external filesystem outages. > > /gpfs/pads is only available on login1 through login5; /home is on > > login and mom nodes only." > > > > So we need to (finally) remove Swift's dependence on $HOME/.globus > > and $HOME/.globus/scripts in particular. > > > > I suggest - since the swift command already needs to write to "." - > > that we create a scripts/ directory in "." instead of $HOME/.globus. > > And this should be used by any provider that would have previously > > created files below .globus. > > > > I'll echo this to swift-devel and start a thread there to discuss. > > Its possible there's already a property to cause scripts/ to be > > created elsewhere. If not, I think we should make one. I think > > grouping the scripts created by a run into the current dir, along > > with the swift log, _concurrent, and (in the conventions I use in my > > run scripts) swiftwork/. > > > > Lorenzo, hopefully we can at least get you a workaround for this > > soon. > > > > You *might* be able to trick swift into doing this by setting > > HOME=/lustre/beagle/$USER. I already tried a symlink under .globus > > and that didnt work, as /home is not even readable by the compute > > nodes, which in this case need to run the coaster worker (.pl) > > script. > > > > - Mike > > > > > > ----- Original Message ----- > >> From: "Lorenzo Pesce" > >> To: "Jonathan Monette" > >> Cc: swift-user at ci.uchicago.edu > >> Sent: Saturday, April 14, 2012 8:15:39 AM > >> Subject: Re: [Swift-user] Error message on Cray XE6 > >> In principle the access to the /home filesystem should still be > >> there. > >> > >> The only thing I did was to chance the cf file to remove some > >> errors I > >> had into it, so that might also be the source of the problem. This > >> is > >> what it looks like now: > >> (BTW, the comments are not mine, I run swift only from lustre) > >> > >> > >> # Whether to transfer the wrappers from the compute nodes > >> # I like to launch from my home dir, but keep everything on > >> # lustre > >> wrapperlog.always.transfer=false > >> > >> #Indicates whether the working directory on the remote site > >> # should be left intact even when a run completes successfully > >> sitedir.keep=true > >> > >> #try only once > >> execution.retries=1 > >> > >> # Attempt to run as much as possible, i.g., ignore non-fatal errors > >> lazy.errors=true > >> > >> # to reduce filesystem access > >> status.mode=provider > >> > >> use.provider.staging=false > >> > >> provider.staging.pin.swiftfiles=false > >> > >> foreach.max.threads=100 > >> > >> provenance.log=false > >> > >> > >> > >> > >> On Apr 14, 2012, at 12:10 AM, Jonathan Monette wrote: > >> > >>> The perl script is the worker script that is submitted with PBS. I > >>> have not tried to run on Beagle since the maintenance period has > >>> ended so I am not exactly sure why the error popped up. One reason > >>> could be that the home file system is no longer mounted on the > >>> compute nodes. I know they spoke about that being a possibility > >>> but > >>> not sure they implemented that during the maintenance period. Do > >>> you > >>> know if the home file system is still mounted on the compute > >>> nodes? > >>> > >>> On Apr 13, 2012, at 17:18, Lorenzo Pesce > >>> wrote: > >>> > >>>> Hi -- > >>>> I haven't seen this one before: > >>>> > >>>> Can't open perl script > >>>> "/home/lpesce/.globus/coasters/cscript7176272791806289394.pl": No > >>>> such file or directory > >>>> > >>>> The config of the cray has changed, might this have anything to > >>>> do > >>>> with it? > >>>> I have no idea what perl script is it talking about and why it is > >>>> looking to home. > >>>> > >>>> Thanks a lot, > >>>> > >>>> Lorenzo > >>>> > >>>> > >>>> > >>>> _______________________________________________ > >>>> Swift-user mailing list > >>>> Swift-user at ci.uchicago.edu > >>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > >> > >> _______________________________________________ > >> Swift-user mailing list > >> Swift-user at ci.uchicago.edu > >> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > > > > -- > > Michael Wilde > > Computation Institute, University of Chicago > > Mathematics and Computer Science Division > > Argonne National Laboratory > > -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From wilde at mcs.anl.gov Sat Apr 14 10:13:40 2012 From: wilde at mcs.anl.gov (Michael Wilde) Date: Sat, 14 Apr 2012 10:13:40 -0500 (CDT) Subject: [Swift-user] Error message on Cray XE6 In-Reply-To: <749553073.141331.1334416200913.JavaMail.root@zimbra.anl.gov> Message-ID: <1531575228.141345.1334416420380.JavaMail.root@zimbra.anl.gov> stackoverflow says this should work: java -Duser.home= Need to get that in via the swift command. - Mike ----- Original Message ----- > From: "Michael Wilde" > To: "Jonathan Monette" > Cc: "Lorenzo Pesce" , swift-user at ci.uchicago.edu > Sent: Saturday, April 14, 2012 10:10:00 AM > Subject: Re: [Swift-user] Error message on Cray XE6 > I just tried both setting HOME=/lustre/beagle/wilde and setting > user.home to the same thing. Neither works. I think user.home is > coming from the Java property, and that doesnt seem to be influenced > by the HOME env var. I was about to look if Java can be asked to > change home. Maybe by setting a command line arg to Java. > > - Mike > > ----- Original Message ----- > > From: "Jonathan Monette" > > To: "Michael Wilde" > > Cc: "Lorenzo Pesce" , > > swift-user at ci.uchicago.edu > > Sent: Saturday, April 14, 2012 10:02:14 AM > > Subject: Re: [Swift-user] Error message on Cray XE6 > > That is an easy fix I believe. I know where the code is so I will > > change and test. > > > > In the mean time could you try something? Try setting > > user.home= > > in your config file and try again. > > > > On Apr 14, 2012, at 9:58, Michael Wilde wrote: > > > > > /home is no longer mounted by the compute nodes, per the > > > post-maitenance summary: > > > > > > "External filesystem dependencies minimized: Compute nodes and the > > > scheduler should now continue to process and complete jobs without > > > the threat of interference of external filesystem outages. > > > /gpfs/pads is only available on login1 through login5; /home is on > > > login and mom nodes only." > > > > > > So we need to (finally) remove Swift's dependence on $HOME/.globus > > > and $HOME/.globus/scripts in particular. > > > > > > I suggest - since the swift command already needs to write to "." > > > - > > > that we create a scripts/ directory in "." instead of > > > $HOME/.globus. > > > And this should be used by any provider that would have previously > > > created files below .globus. > > > > > > I'll echo this to swift-devel and start a thread there to discuss. > > > Its possible there's already a property to cause scripts/ to be > > > created elsewhere. If not, I think we should make one. I think > > > grouping the scripts created by a run into the current dir, along > > > with the swift log, _concurrent, and (in the conventions I use in > > > my > > > run scripts) swiftwork/. > > > > > > Lorenzo, hopefully we can at least get you a workaround for this > > > soon. > > > > > > You *might* be able to trick swift into doing this by setting > > > HOME=/lustre/beagle/$USER. I already tried a symlink under .globus > > > and that didnt work, as /home is not even readable by the compute > > > nodes, which in this case need to run the coaster worker (.pl) > > > script. > > > > > > - Mike > > > > > > > > > ----- Original Message ----- > > >> From: "Lorenzo Pesce" > > >> To: "Jonathan Monette" > > >> Cc: swift-user at ci.uchicago.edu > > >> Sent: Saturday, April 14, 2012 8:15:39 AM > > >> Subject: Re: [Swift-user] Error message on Cray XE6 > > >> In principle the access to the /home filesystem should still be > > >> there. > > >> > > >> The only thing I did was to chance the cf file to remove some > > >> errors I > > >> had into it, so that might also be the source of the problem. > > >> This > > >> is > > >> what it looks like now: > > >> (BTW, the comments are not mine, I run swift only from lustre) > > >> > > >> > > >> # Whether to transfer the wrappers from the compute nodes > > >> # I like to launch from my home dir, but keep everything on > > >> # lustre > > >> wrapperlog.always.transfer=false > > >> > > >> #Indicates whether the working directory on the remote site > > >> # should be left intact even when a run completes successfully > > >> sitedir.keep=true > > >> > > >> #try only once > > >> execution.retries=1 > > >> > > >> # Attempt to run as much as possible, i.g., ignore non-fatal > > >> errors > > >> lazy.errors=true > > >> > > >> # to reduce filesystem access > > >> status.mode=provider > > >> > > >> use.provider.staging=false > > >> > > >> provider.staging.pin.swiftfiles=false > > >> > > >> foreach.max.threads=100 > > >> > > >> provenance.log=false > > >> > > >> > > >> > > >> > > >> On Apr 14, 2012, at 12:10 AM, Jonathan Monette wrote: > > >> > > >>> The perl script is the worker script that is submitted with PBS. > > >>> I > > >>> have not tried to run on Beagle since the maintenance period has > > >>> ended so I am not exactly sure why the error popped up. One > > >>> reason > > >>> could be that the home file system is no longer mounted on the > > >>> compute nodes. I know they spoke about that being a possibility > > >>> but > > >>> not sure they implemented that during the maintenance period. Do > > >>> you > > >>> know if the home file system is still mounted on the compute > > >>> nodes? > > >>> > > >>> On Apr 13, 2012, at 17:18, Lorenzo Pesce > > >>> wrote: > > >>> > > >>>> Hi -- > > >>>> I haven't seen this one before: > > >>>> > > >>>> Can't open perl script > > >>>> "/home/lpesce/.globus/coasters/cscript7176272791806289394.pl": > > >>>> No > > >>>> such file or directory > > >>>> > > >>>> The config of the cray has changed, might this have anything to > > >>>> do > > >>>> with it? > > >>>> I have no idea what perl script is it talking about and why it > > >>>> is > > >>>> looking to home. > > >>>> > > >>>> Thanks a lot, > > >>>> > > >>>> Lorenzo > > >>>> > > >>>> > > >>>> > > >>>> _______________________________________________ > > >>>> Swift-user mailing list > > >>>> Swift-user at ci.uchicago.edu > > >>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > > >> > > >> _______________________________________________ > > >> Swift-user mailing list > > >> Swift-user at ci.uchicago.edu > > >> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > > > > > > -- > > > Michael Wilde > > > Computation Institute, University of Chicago > > > Mathematics and Computer Science Division > > > Argonne National Laboratory > > > > > -- > Michael Wilde > Computation Institute, University of Chicago > Mathematics and Computer Science Division > Argonne National Laboratory -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From wilde at mcs.anl.gov Sat Apr 14 10:51:07 2012 From: wilde at mcs.anl.gov (Michael Wilde) Date: Sat, 14 Apr 2012 10:51:07 -0500 (CDT) Subject: [Swift-user] Error message on Cray XE6 In-Reply-To: <1531575228.141345.1334416420380.JavaMail.root@zimbra.anl.gov> Message-ID: <1873029745.141382.1334418667851.JavaMail.root@zimbra.anl.gov> OK, here's a workaround for this problem: You need to add this line to the swift command bin/swift in your Swift release. After: updateOptions "$SWIFT_HOME" "swift.home" Add: updateOptions "$USER_HOME" "user.home" This is near line 92 in the version I tested, Swift trunk swift-r5739 cog-r3368. Then you can do: USER_HOME=/lustre/beagle/wilde swift -config cf -tc.file tc -sites.file pbs.xml catsn.swift -n=1 Lorenzo, if you are using "module load swift" we'll need to update that, or you can copy the swift release directory structure that module load points you to, then modify the swift command there, and put that modified release first in your PATH. We'll work out a way to get something like this into the production module and trunk. I dont know of other systems that are currently affected by this, but Im sure they will come up. - Mike ----- Original Message ----- > From: "Michael Wilde" > To: "Jonathan Monette" > Cc: swift-user at ci.uchicago.edu > Sent: Saturday, April 14, 2012 10:13:40 AM > Subject: Re: [Swift-user] Error message on Cray XE6 > stackoverflow says this should work: > > java -Duser.home= > > Need to get that in via the swift command. > > - Mike > > > ----- Original Message ----- > > From: "Michael Wilde" > > To: "Jonathan Monette" > > Cc: "Lorenzo Pesce" , > > swift-user at ci.uchicago.edu > > Sent: Saturday, April 14, 2012 10:10:00 AM > > Subject: Re: [Swift-user] Error message on Cray XE6 > > I just tried both setting HOME=/lustre/beagle/wilde and setting > > user.home to the same thing. Neither works. I think user.home is > > coming from the Java property, and that doesnt seem to be influenced > > by the HOME env var. I was about to look if Java can be asked to > > change home. Maybe by setting a command line arg to Java. > > > > - Mike > > > > ----- Original Message ----- > > > From: "Jonathan Monette" > > > To: "Michael Wilde" > > > Cc: "Lorenzo Pesce" , > > > swift-user at ci.uchicago.edu > > > Sent: Saturday, April 14, 2012 10:02:14 AM > > > Subject: Re: [Swift-user] Error message on Cray XE6 > > > That is an easy fix I believe. I know where the code is so I will > > > change and test. > > > > > > In the mean time could you try something? Try setting > > > user.home= > > > in your config file and try again. > > > > > > On Apr 14, 2012, at 9:58, Michael Wilde wrote: > > > > > > > /home is no longer mounted by the compute nodes, per the > > > > post-maitenance summary: > > > > > > > > "External filesystem dependencies minimized: Compute nodes and > > > > the > > > > scheduler should now continue to process and complete jobs > > > > without > > > > the threat of interference of external filesystem outages. > > > > /gpfs/pads is only available on login1 through login5; /home is > > > > on > > > > login and mom nodes only." > > > > > > > > So we need to (finally) remove Swift's dependence on > > > > $HOME/.globus > > > > and $HOME/.globus/scripts in particular. > > > > > > > > I suggest - since the swift command already needs to write to > > > > "." > > > > - > > > > that we create a scripts/ directory in "." instead of > > > > $HOME/.globus. > > > > And this should be used by any provider that would have > > > > previously > > > > created files below .globus. > > > > > > > > I'll echo this to swift-devel and start a thread there to > > > > discuss. > > > > Its possible there's already a property to cause scripts/ to be > > > > created elsewhere. If not, I think we should make one. I think > > > > grouping the scripts created by a run into the current dir, > > > > along > > > > with the swift log, _concurrent, and (in the conventions I use > > > > in > > > > my > > > > run scripts) swiftwork/. > > > > > > > > Lorenzo, hopefully we can at least get you a workaround for this > > > > soon. > > > > > > > > You *might* be able to trick swift into doing this by setting > > > > HOME=/lustre/beagle/$USER. I already tried a symlink under > > > > .globus > > > > and that didnt work, as /home is not even readable by the > > > > compute > > > > nodes, which in this case need to run the coaster worker (.pl) > > > > script. > > > > > > > > - Mike > > > > > > > > > > > > ----- Original Message ----- > > > >> From: "Lorenzo Pesce" > > > >> To: "Jonathan Monette" > > > >> Cc: swift-user at ci.uchicago.edu > > > >> Sent: Saturday, April 14, 2012 8:15:39 AM > > > >> Subject: Re: [Swift-user] Error message on Cray XE6 > > > >> In principle the access to the /home filesystem should still be > > > >> there. > > > >> > > > >> The only thing I did was to chance the cf file to remove some > > > >> errors I > > > >> had into it, so that might also be the source of the problem. > > > >> This > > > >> is > > > >> what it looks like now: > > > >> (BTW, the comments are not mine, I run swift only from lustre) > > > >> > > > >> > > > >> # Whether to transfer the wrappers from the compute nodes > > > >> # I like to launch from my home dir, but keep everything on > > > >> # lustre > > > >> wrapperlog.always.transfer=false > > > >> > > > >> #Indicates whether the working directory on the remote site > > > >> # should be left intact even when a run completes successfully > > > >> sitedir.keep=true > > > >> > > > >> #try only once > > > >> execution.retries=1 > > > >> > > > >> # Attempt to run as much as possible, i.g., ignore non-fatal > > > >> errors > > > >> lazy.errors=true > > > >> > > > >> # to reduce filesystem access > > > >> status.mode=provider > > > >> > > > >> use.provider.staging=false > > > >> > > > >> provider.staging.pin.swiftfiles=false > > > >> > > > >> foreach.max.threads=100 > > > >> > > > >> provenance.log=false > > > >> > > > >> > > > >> > > > >> > > > >> On Apr 14, 2012, at 12:10 AM, Jonathan Monette wrote: > > > >> > > > >>> The perl script is the worker script that is submitted with > > > >>> PBS. > > > >>> I > > > >>> have not tried to run on Beagle since the maintenance period > > > >>> has > > > >>> ended so I am not exactly sure why the error popped up. One > > > >>> reason > > > >>> could be that the home file system is no longer mounted on the > > > >>> compute nodes. I know they spoke about that being a > > > >>> possibility > > > >>> but > > > >>> not sure they implemented that during the maintenance period. > > > >>> Do > > > >>> you > > > >>> know if the home file system is still mounted on the compute > > > >>> nodes? > > > >>> > > > >>> On Apr 13, 2012, at 17:18, Lorenzo Pesce > > > >>> wrote: > > > >>> > > > >>>> Hi -- > > > >>>> I haven't seen this one before: > > > >>>> > > > >>>> Can't open perl script > > > >>>> "/home/lpesce/.globus/coasters/cscript7176272791806289394.pl": > > > >>>> No > > > >>>> such file or directory > > > >>>> > > > >>>> The config of the cray has changed, might this have anything > > > >>>> to > > > >>>> do > > > >>>> with it? > > > >>>> I have no idea what perl script is it talking about and why > > > >>>> it > > > >>>> is > > > >>>> looking to home. > > > >>>> > > > >>>> Thanks a lot, > > > >>>> > > > >>>> Lorenzo > > > >>>> > > > >>>> > > > >>>> > > > >>>> _______________________________________________ > > > >>>> Swift-user mailing list > > > >>>> Swift-user at ci.uchicago.edu > > > >>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > > > >> > > > >> _______________________________________________ > > > >> Swift-user mailing list > > > >> Swift-user at ci.uchicago.edu > > > >> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > > > > > > > > -- > > > > Michael Wilde > > > > Computation Institute, University of Chicago > > > > Mathematics and Computer Science Division > > > > Argonne National Laboratory > > > > > > > > -- > > Michael Wilde > > Computation Institute, University of Chicago > > Mathematics and Computer Science Division > > Argonne National Laboratory > > -- > Michael Wilde > Computation Institute, University of Chicago > Mathematics and Computer Science Division > Argonne National Laboratory > > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From larsson at uchicago.edu Sat Apr 14 19:02:38 2012 From: larsson at uchicago.edu (Gustav Larsson) Date: Sat, 14 Apr 2012 19:02:38 -0500 Subject: [Swift-user] OS X dependencies Message-ID: Hi, I am trying to run Swift from a source build on OS X (10.8, dev preview 2 / Java 1.6.0_31). The ANT build was successful and I added the swift/bin to my PATH, so that I can run it. My problems however are runtime errors: 1. Exception in thread "main" java.lang.NoClassDefFoundError: org/griphyn/vdl/karajan/Loader So, I went ahead and added karajan/build to my CLASSPATH. Fixed. 2. Two more errors like this, resulting in adding swift/build and util/build as well to my CLASSPATH. Fixed. 3. Exception in thread "main" java.lang.NoClassDefFoundError: com/thoughtworks/xstream/converters/Converter Now, I could probably download xstream and get this to work, but talking to Mike Wilde, he seemed puzzled that I should have to go through this process, so is there something I'm missing? Thanks! Regards, Gustav From jonmon at mcs.anl.gov Sat Apr 14 19:20:31 2012 From: jonmon at mcs.anl.gov (Jonathan Monette) Date: Sat, 14 Apr 2012 19:20:31 -0500 Subject: [Swift-user] OS X dependencies In-Reply-To: References: Message-ID: <7B98B317-7D65-44F2-B319-688BEAF29C29@mcs.anl.gov> What source are you building? Trunk? 0.93? Also, are you sure you added the right bin directory to your path? After a build you need to add dist/swift-svn/bin On Apr 14, 2012, at 19:02, Gustav Larsson wrote: > Hi, > > I am trying to run Swift from a source build on OS X (10.8, dev > preview 2 / Java 1.6.0_31). The ANT build was successful and I added > the swift/bin to my PATH, so that I can run it. My problems however > are runtime errors: > > 1. > Exception in thread "main" java.lang.NoClassDefFoundError: > org/griphyn/vdl/karajan/Loader > > So, I went ahead and added karajan/build to my CLASSPATH. Fixed. > > 2. > Two more errors like this, resulting in adding swift/build and > util/build as well to my CLASSPATH. Fixed. > > 3. > Exception in thread "main" java.lang.NoClassDefFoundError: > com/thoughtworks/xstream/converters/Converter > > Now, I could probably download xstream and get this to work, but > talking to Mike Wilde, he seemed puzzled that I should have to go > through this process, so is there something I'm missing? > > Thanks! > > Regards, > Gustav > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user From larsson at uchicago.edu Sat Apr 14 19:37:34 2012 From: larsson at uchicago.edu (Gustav Larsson) Date: Sat, 14 Apr 2012 19:37:34 -0500 Subject: [Swift-user] OS X dependencies In-Reply-To: <7B98B317-7D65-44F2-B319-688BEAF29C29@mcs.anl.gov> References: <7B98B317-7D65-44F2-B319-688BEAF29C29@mcs.anl.gov> Message-ID: Hi Jonathan, Thanks for your quick reply! I tried this both with the trunk and 0.93. Just to clarify, it was /cog/modules/swift/dist/swift-svn/bin that I added to my PATH. Gustav On 2012/04/14, at 19:20, Jonathan Monette wrote: > What source are you building? Trunk? 0.93? > > Also, are you sure you added the right bin directory to your path? After a build you need to add dist/swift-svn/bin > > On Apr 14, 2012, at 19:02, Gustav Larsson wrote: > >> Hi, >> >> I am trying to run Swift from a source build on OS X (10.8, dev >> preview 2 / Java 1.6.0_31). The ANT build was successful and I added >> the swift/bin to my PATH, so that I can run it. My problems however >> are runtime errors: >> >> 1. >> Exception in thread "main" java.lang.NoClassDefFoundError: >> org/griphyn/vdl/karajan/Loader >> >> So, I went ahead and added karajan/build to my CLASSPATH. Fixed. >> >> 2. >> Two more errors like this, resulting in adding swift/build and >> util/build as well to my CLASSPATH. Fixed. >> >> 3. >> Exception in thread "main" java.lang.NoClassDefFoundError: >> com/thoughtworks/xstream/converters/Converter >> >> Now, I could probably download xstream and get this to work, but >> talking to Mike Wilde, he seemed puzzled that I should have to go >> through this process, so is there something I'm missing? >> >> Thanks! >> >> Regards, >> Gustav >> _______________________________________________ >> Swift-user mailing list >> Swift-user at ci.uchicago.edu >> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user From lpesce at uchicago.edu Sat Apr 14 20:14:43 2012 From: lpesce at uchicago.edu (Lorenzo Pesce) Date: Sat, 14 Apr 2012 20:14:43 -0500 (CDT) Subject: [Swift-user] Error message on Cray XE6 In-Reply-To: <1531575228.141345.1334416420380.JavaMail.root@zimbra.anl.gov> References: <749553073.141331.1334416200913.JavaMail.root@zimbra.anl.gov> <1531575228.141345.1334416420380.JavaMail.root@zimbra.anl.gov> Message-ID: <20120414201443.BII29763@mstore01.uchicago.edu> First thanks a million for all your help. Does it mean that I should hack the swift script (presumably make my own one), or there is a simpler path that I am missing (Java isn't really my thing). Thanks again. ---- Original message ---- >Date: Sat, 14 Apr 2012 10:13:40 -0500 (CDT) >From: Michael Wilde >Subject: Re: [Swift-user] Error message on Cray XE6 >To: Jonathan Monette >Cc: Lorenzo Pesce , swift-user at ci.uchicago.edu > >stackoverflow says this should work: > >java -Duser.home= > >Need to get that in via the swift command. > >- Mike > > >----- Original Message ----- >> From: "Michael Wilde" >> To: "Jonathan Monette" >> Cc: "Lorenzo Pesce" , swift-user at ci.uchicago.edu >> Sent: Saturday, April 14, 2012 10:10:00 AM >> Subject: Re: [Swift-user] Error message on Cray XE6 >> I just tried both setting HOME=/lustre/beagle/wilde and setting >> user.home to the same thing. Neither works. I think user.home is >> coming from the Java property, and that doesnt seem to be influenced >> by the HOME env var. I was about to look if Java can be asked to >> change home. Maybe by setting a command line arg to Java. >> >> - Mike >> >> ----- Original Message ----- >> > From: "Jonathan Monette" >> > To: "Michael Wilde" >> > Cc: "Lorenzo Pesce" , >> > swift-user at ci.uchicago.edu >> > Sent: Saturday, April 14, 2012 10:02:14 AM >> > Subject: Re: [Swift-user] Error message on Cray XE6 >> > That is an easy fix I believe. I know where the code is so I will >> > change and test. >> > >> > In the mean time could you try something? Try setting >> > user.home= >> > in your config file and try again. >> > >> > On Apr 14, 2012, at 9:58, Michael Wilde wrote: >> > >> > > /home is no longer mounted by the compute nodes, per the >> > > post-maitenance summary: >> > > >> > > "External filesystem dependencies minimized: Compute nodes and the >> > > scheduler should now continue to process and complete jobs without >> > > the threat of interference of external filesystem outages. >> > > /gpfs/pads is only available on login1 through login5; /home is on >> > > login and mom nodes only." >> > > >> > > So we need to (finally) remove Swift's dependence on $HOME/.globus >> > > and $HOME/.globus/scripts in particular. >> > > >> > > I suggest - since the swift command already needs to write to "." >> > > - >> > > that we create a scripts/ directory in "." instead of >> > > $HOME/.globus. >> > > And this should be used by any provider that would have previously >> > > created files below .globus. >> > > >> > > I'll echo this to swift-devel and start a thread there to discuss. >> > > Its possible there's already a property to cause scripts/ to be >> > > created elsewhere. If not, I think we should make one. I think >> > > grouping the scripts created by a run into the current dir, along >> > > with the swift log, _concurrent, and (in the conventions I use in >> > > my >> > > run scripts) swiftwork/. >> > > >> > > Lorenzo, hopefully we can at least get you a workaround for this >> > > soon. >> > > >> > > You *might* be able to trick swift into doing this by setting >> > > HOME=/lustre/beagle/$USER. I already tried a symlink under .globus >> > > and that didnt work, as /home is not even readable by the compute >> > > nodes, which in this case need to run the coaster worker (.pl) >> > > script. >> > > >> > > - Mike >> > > >> > > >> > > ----- Original Message ----- >> > >> From: "Lorenzo Pesce" >> > >> To: "Jonathan Monette" >> > >> Cc: swift-user at ci.uchicago.edu >> > >> Sent: Saturday, April 14, 2012 8:15:39 AM >> > >> Subject: Re: [Swift-user] Error message on Cray XE6 >> > >> In principle the access to the /home filesystem should still be >> > >> there. >> > >> >> > >> The only thing I did was to chance the cf file to remove some >> > >> errors I >> > >> had into it, so that might also be the source of the problem. >> > >> This >> > >> is >> > >> what it looks like now: >> > >> (BTW, the comments are not mine, I run swift only from lustre) >> > >> >> > >> >> > >> # Whether to transfer the wrappers from the compute nodes >> > >> # I like to launch from my home dir, but keep everything on >> > >> # lustre >> > >> wrapperlog.always.transfer=false >> > >> >> > >> #Indicates whether the working directory on the remote site >> > >> # should be left intact even when a run completes successfully >> > >> sitedir.keep=true >> > >> >> > >> #try only once >> > >> execution.retries=1 >> > >> >> > >> # Attempt to run as much as possible, i.g., ignore non-fatal >> > >> errors >> > >> lazy.errors=true >> > >> >> > >> # to reduce filesystem access >> > >> status.mode=provider >> > >> >> > >> use.provider.staging=false >> > >> >> > >> provider.staging.pin.swiftfiles=false >> > >> >> > >> foreach.max.threads=100 >> > >> >> > >> provenance.log=false >> > >> >> > >> >> > >> >> > >> >> > >> On Apr 14, 2012, at 12:10 AM, Jonathan Monette wrote: >> > >> >> > >>> The perl script is the worker script that is submitted with PBS. >> > >>> I >> > >>> have not tried to run on Beagle since the maintenance period has >> > >>> ended so I am not exactly sure why the error popped up. One >> > >>> reason >> > >>> could be that the home file system is no longer mounted on the >> > >>> compute nodes. I know they spoke about that being a possibility >> > >>> but >> > >>> not sure they implemented that during the maintenance period. Do >> > >>> you >> > >>> know if the home file system is still mounted on the compute >> > >>> nodes? >> > >>> >> > >>> On Apr 13, 2012, at 17:18, Lorenzo Pesce >> > >>> wrote: >> > >>> >> > >>>> Hi -- >> > >>>> I haven't seen this one before: >> > >>>> >> > >>>> Can't open perl script >> > >>>> "/home/lpesce/.globus/coasters/cscript7176272791806289394.pl": >> > >>>> No >> > >>>> such file or directory >> > >>>> >> > >>>> The config of the cray has changed, might this have anything to >> > >>>> do >> > >>>> with it? >> > >>>> I have no idea what perl script is it talking about and why it >> > >>>> is >> > >>>> looking to home. >> > >>>> >> > >>>> Thanks a lot, >> > >>>> >> > >>>> Lorenzo >> > >>>> >> > >>>> >> > >>>> >> > >>>> _______________________________________________ >> > >>>> Swift-user mailing list >> > >>>> Swift-user at ci.uchicago.edu >> > >>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user >> > >> >> > >> _______________________________________________ >> > >> Swift-user mailing list >> > >> Swift-user at ci.uchicago.edu >> > >> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user >> > > >> > > -- >> > > Michael Wilde >> > > Computation Institute, University of Chicago >> > > Mathematics and Computer Science Division >> > > Argonne National Laboratory >> > > >> >> -- >> Michael Wilde >> Computation Institute, University of Chicago >> Mathematics and Computer Science Division >> Argonne National Laboratory > >-- >Michael Wilde >Computation Institute, University of Chicago >Mathematics and Computer Science Division >Argonne National Laboratory > From lpesce at uchicago.edu Sat Apr 14 20:16:43 2012 From: lpesce at uchicago.edu (Lorenzo Pesce) Date: Sat, 14 Apr 2012 20:16:43 -0500 (CDT) Subject: [Swift-user] Error message on Cray XE6 In-Reply-To: <1873029745.141382.1334418667851.JavaMail.root@zimbra.anl.gov> References: <1531575228.141345.1334416420380.JavaMail.root@zimbra.anl.gov> <1873029745.141382.1334418667851.JavaMail.root@zimbra.anl.gov> Message-ID: <20120414201643.BII29796@mstore01.uchicago.edu> Sorry for missing this reply before. I will try and follow your instructions (not right now, I have spent the day digging and wielding a pickaxe and I am not really feeling like it :-)). ---- Original message ---- >Date: Sat, 14 Apr 2012 10:51:07 -0500 (CDT) >From: Michael Wilde >Subject: Re: [Swift-user] Error message on Cray XE6 >To: Jonathan Monette , Lorenzo Pesce >Cc: swift-user at ci.uchicago.edu > >OK, here's a workaround for this problem: > >You need to add this line to the swift command bin/swift in your Swift release. > >After: > >updateOptions "$SWIFT_HOME" "swift.home" > >Add: > >updateOptions "$USER_HOME" "user.home" > >This is near line 92 in the version I tested, Swift trunk swift-r5739 cog-r3368. > >Then you can do: > >USER_HOME=/lustre/beagle/wilde swift -config cf -tc.file tc -sites.file pbs.xml catsn.swift -n=1 > >Lorenzo, if you are using "module load swift" we'll need to update that, or you can copy the swift release directory structure that module load points you to, then modify the swift command there, and put that modified release first in your PATH. > >We'll work out a way to get something like this into the production module and trunk. I dont know of other systems that are currently affected by this, but Im sure they will come up. > >- Mike > > >----- Original Message ----- >> From: "Michael Wilde" >> To: "Jonathan Monette" >> Cc: swift-user at ci.uchicago.edu >> Sent: Saturday, April 14, 2012 10:13:40 AM >> Subject: Re: [Swift-user] Error message on Cray XE6 >> stackoverflow says this should work: >> >> java -Duser.home= >> >> Need to get that in via the swift command. >> >> - Mike >> >> >> ----- Original Message ----- >> > From: "Michael Wilde" >> > To: "Jonathan Monette" >> > Cc: "Lorenzo Pesce" , >> > swift-user at ci.uchicago.edu >> > Sent: Saturday, April 14, 2012 10:10:00 AM >> > Subject: Re: [Swift-user] Error message on Cray XE6 >> > I just tried both setting HOME=/lustre/beagle/wilde and setting >> > user.home to the same thing. Neither works. I think user.home is >> > coming from the Java property, and that doesnt seem to be influenced >> > by the HOME env var. I was about to look if Java can be asked to >> > change home. Maybe by setting a command line arg to Java. >> > >> > - Mike >> > >> > ----- Original Message ----- >> > > From: "Jonathan Monette" >> > > To: "Michael Wilde" >> > > Cc: "Lorenzo Pesce" , >> > > swift-user at ci.uchicago.edu >> > > Sent: Saturday, April 14, 2012 10:02:14 AM >> > > Subject: Re: [Swift-user] Error message on Cray XE6 >> > > That is an easy fix I believe. I know where the code is so I will >> > > change and test. >> > > >> > > In the mean time could you try something? Try setting >> > > user.home= >> > > in your config file and try again. >> > > >> > > On Apr 14, 2012, at 9:58, Michael Wilde wrote: >> > > >> > > > /home is no longer mounted by the compute nodes, per the >> > > > post-maitenance summary: >> > > > >> > > > "External filesystem dependencies minimized: Compute nodes and >> > > > the >> > > > scheduler should now continue to process and complete jobs >> > > > without >> > > > the threat of interference of external filesystem outages. >> > > > /gpfs/pads is only available on login1 through login5; /home is >> > > > on >> > > > login and mom nodes only." >> > > > >> > > > So we need to (finally) remove Swift's dependence on >> > > > $HOME/.globus >> > > > and $HOME/.globus/scripts in particular. >> > > > >> > > > I suggest - since the swift command already needs to write to >> > > > "." >> > > > - >> > > > that we create a scripts/ directory in "." instead of >> > > > $HOME/.globus. >> > > > And this should be used by any provider that would have >> > > > previously >> > > > created files below .globus. >> > > > >> > > > I'll echo this to swift-devel and start a thread there to >> > > > discuss. >> > > > Its possible there's already a property to cause scripts/ to be >> > > > created elsewhere. If not, I think we should make one. I think >> > > > grouping the scripts created by a run into the current dir, >> > > > along >> > > > with the swift log, _concurrent, and (in the conventions I use >> > > > in >> > > > my >> > > > run scripts) swiftwork/. >> > > > >> > > > Lorenzo, hopefully we can at least get you a workaround for this >> > > > soon. >> > > > >> > > > You *might* be able to trick swift into doing this by setting >> > > > HOME=/lustre/beagle/$USER. I already tried a symlink under >> > > > .globus >> > > > and that didnt work, as /home is not even readable by the >> > > > compute >> > > > nodes, which in this case need to run the coaster worker (.pl) >> > > > script. >> > > > >> > > > - Mike >> > > > >> > > > >> > > > ----- Original Message ----- >> > > >> From: "Lorenzo Pesce" >> > > >> To: "Jonathan Monette" >> > > >> Cc: swift-user at ci.uchicago.edu >> > > >> Sent: Saturday, April 14, 2012 8:15:39 AM >> > > >> Subject: Re: [Swift-user] Error message on Cray XE6 >> > > >> In principle the access to the /home filesystem should still be >> > > >> there. >> > > >> >> > > >> The only thing I did was to chance the cf file to remove some >> > > >> errors I >> > > >> had into it, so that might also be the source of the problem. >> > > >> This >> > > >> is >> > > >> what it looks like now: >> > > >> (BTW, the comments are not mine, I run swift only from lustre) >> > > >> >> > > >> >> > > >> # Whether to transfer the wrappers from the compute nodes >> > > >> # I like to launch from my home dir, but keep everything on >> > > >> # lustre >> > > >> wrapperlog.always.transfer=false >> > > >> >> > > >> #Indicates whether the working directory on the remote site >> > > >> # should be left intact even when a run completes successfully >> > > >> sitedir.keep=true >> > > >> >> > > >> #try only once >> > > >> execution.retries=1 >> > > >> >> > > >> # Attempt to run as much as possible, i.g., ignore non-fatal >> > > >> errors >> > > >> lazy.errors=true >> > > >> >> > > >> # to reduce filesystem access >> > > >> status.mode=provider >> > > >> >> > > >> use.provider.staging=false >> > > >> >> > > >> provider.staging.pin.swiftfiles=false >> > > >> >> > > >> foreach.max.threads=100 >> > > >> >> > > >> provenance.log=false >> > > >> >> > > >> >> > > >> >> > > >> >> > > >> On Apr 14, 2012, at 12:10 AM, Jonathan Monette wrote: >> > > >> >> > > >>> The perl script is the worker script that is submitted with >> > > >>> PBS. >> > > >>> I >> > > >>> have not tried to run on Beagle since the maintenance period >> > > >>> has >> > > >>> ended so I am not exactly sure why the error popped up. One >> > > >>> reason >> > > >>> could be that the home file system is no longer mounted on the >> > > >>> compute nodes. I know they spoke about that being a >> > > >>> possibility >> > > >>> but >> > > >>> not sure they implemented that during the maintenance period. >> > > >>> Do >> > > >>> you >> > > >>> know if the home file system is still mounted on the compute >> > > >>> nodes? >> > > >>> >> > > >>> On Apr 13, 2012, at 17:18, Lorenzo Pesce >> > > >>> wrote: >> > > >>> >> > > >>>> Hi -- >> > > >>>> I haven't seen this one before: >> > > >>>> >> > > >>>> Can't open perl script >> > > >>>> "/home/lpesce/.globus/coasters/cscript7176272791806289394.pl": >> > > >>>> No >> > > >>>> such file or directory >> > > >>>> >> > > >>>> The config of the cray has changed, might this have anything >> > > >>>> to >> > > >>>> do >> > > >>>> with it? >> > > >>>> I have no idea what perl script is it talking about and why >> > > >>>> it >> > > >>>> is >> > > >>>> looking to home. >> > > >>>> >> > > >>>> Thanks a lot, >> > > >>>> >> > > >>>> Lorenzo >> > > >>>> >> > > >>>> >> > > >>>> >> > > >>>> _______________________________________________ >> > > >>>> Swift-user mailing list >> > > >>>> Swift-user at ci.uchicago.edu >> > > >>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user >> > > >> >> > > >> _______________________________________________ >> > > >> Swift-user mailing list >> > > >> Swift-user at ci.uchicago.edu >> > > >> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user >> > > > >> > > > -- >> > > > Michael Wilde >> > > > Computation Institute, University of Chicago >> > > > Mathematics and Computer Science Division >> > > > Argonne National Laboratory >> > > > >> > >> > -- >> > Michael Wilde >> > Computation Institute, University of Chicago >> > Mathematics and Computer Science Division >> > Argonne National Laboratory >> >> -- >> Michael Wilde >> Computation Institute, University of Chicago >> Mathematics and Computer Science Division >> Argonne National Laboratory >> >> _______________________________________________ >> Swift-user mailing list >> Swift-user at ci.uchicago.edu >> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > >-- >Michael Wilde >Computation Institute, University of Chicago >Mathematics and Computer Science Division >Argonne National Laboratory > From jonmon at mcs.anl.gov Sat Apr 14 20:17:42 2012 From: jonmon at mcs.anl.gov (Jonathan Monette) Date: Sat, 14 Apr 2012 20:17:42 -0500 Subject: [Swift-user] OS X dependencies In-Reply-To: References: <7B98B317-7D65-44F2-B319-688BEAF29C29@mcs.anl.gov> Message-ID: <9823D658-76F3-4CB7-924E-51DA9DFA4E15@mcs.anl.gov> So I can successfully build and execute swift(both 0.93 and trunk) on my personal Mac which has OS X Lion(10.7 I believe). I am not sure why it works with 10.7 and not 10.8. It could be because 10.8 is still under development so it may be missing classes Java needs that comes standard in 10.7? This is just a shot in the dark since I do not have access to 10.8 dev preview. I can debug this a little but not sure how much help I can provide. This sounds like the version of your OS specific issues of issue. Do you have access to a 10.7 version of OS X to try? On Apr 14, 2012, at 19:37, Gustav Larsson wrote: > Hi Jonathan, > > Thanks for your quick reply! > > I tried this both with the trunk and 0.93. > > Just to clarify, it was /cog/modules/swift/dist/swift-svn/bin that I added to my PATH. > > Gustav > > On 2012/04/14, at 19:20, Jonathan Monette wrote: > >> What source are you building? Trunk? 0.93? >> >> Also, are you sure you added the right bin directory to your path? After a build you need to add dist/swift-svn/bin >> >> On Apr 14, 2012, at 19:02, Gustav Larsson wrote: >> >>> Hi, >>> >>> I am trying to run Swift from a source build on OS X (10.8, dev >>> preview 2 / Java 1.6.0_31). The ANT build was successful and I added >>> the swift/bin to my PATH, so that I can run it. My problems however >>> are runtime errors: >>> >>> 1. >>> Exception in thread "main" java.lang.NoClassDefFoundError: >>> org/griphyn/vdl/karajan/Loader >>> >>> So, I went ahead and added karajan/build to my CLASSPATH. Fixed. >>> >>> 2. >>> Two more errors like this, resulting in adding swift/build and >>> util/build as well to my CLASSPATH. Fixed. >>> >>> 3. >>> Exception in thread "main" java.lang.NoClassDefFoundError: >>> com/thoughtworks/xstream/converters/Converter >>> >>> Now, I could probably download xstream and get this to work, but >>> talking to Mike Wilde, he seemed puzzled that I should have to go >>> through this process, so is there something I'm missing? >>> >>> Thanks! >>> >>> Regards, >>> Gustav >>> _______________________________________________ >>> Swift-user mailing list >>> Swift-user at ci.uchicago.edu >>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > From larsson at uchicago.edu Sun Apr 15 13:13:44 2012 From: larsson at uchicago.edu (Gustav Larsson) Date: Sun, 15 Apr 2012 13:13:44 -0500 Subject: [Swift-user] OS X dependencies In-Reply-To: <9823D658-76F3-4CB7-924E-51DA9DFA4E15@mcs.anl.gov> References: <7B98B317-7D65-44F2-B319-688BEAF29C29@mcs.anl.gov> <9823D658-76F3-4CB7-924E-51DA9DFA4E15@mcs.anl.gov> Message-ID: Thank you for your help Jonathan. After some attempts of upgrading Java and installing xstream manually (every version I test gets a different runtime error), I think I will have to abandon this problem for now and accept that I can't use this computer for now. I guess we will have to wait and see if this becomes a problem for everyone in the final release of 10.8, or if it's just a beta quirk. I will let to list know if I am able to get it running. Gustav On Sat, Apr 14, 2012 at 8:17 PM, Jonathan Monette wrote: > So I can successfully build and execute swift(both 0.93 and trunk) on my personal Mac which has OS X Lion(10.7 I believe). > > I am not sure why it works with 10.7 and not 10.8. It could be because 10.8 is still under development so it may be missing classes Java needs that comes standard in 10.7? This is just a shot in the dark since I do not have access to 10.8 dev preview. > > I can debug this a little but not sure how much help I can provide. This sounds like the version of your OS specific issues of issue. Do you have access to a 10.7 version of OS X to try? > > On Apr 14, 2012, at 19:37, Gustav Larsson wrote: > >> Hi Jonathan, >> >> Thanks for your quick reply! >> >> I tried this both with the trunk and 0.93. >> >> Just to clarify, it was /cog/modules/swift/dist/swift-svn/bin that I added to my PATH. >> >> Gustav >> >> On 2012/04/14, at 19:20, Jonathan Monette wrote: >> >>> What source are you building? Trunk? 0.93? >>> >>> Also, are you sure you added the right bin directory to your path? After a build you need to add dist/swift-svn/bin >>> >>> On Apr 14, 2012, at 19:02, Gustav Larsson wrote: >>> >>>> Hi, >>>> >>>> I am trying to run Swift from a source build on OS X (10.8, dev >>>> preview 2 / Java 1.6.0_31). The ANT build was successful and I added >>>> the swift/bin to my PATH, so that I can run it. My problems however >>>> are runtime errors: >>>> >>>> 1. >>>> Exception in thread "main" java.lang.NoClassDefFoundError: >>>> org/griphyn/vdl/karajan/Loader >>>> >>>> So, I went ahead and added karajan/build to my CLASSPATH. Fixed. >>>> >>>> 2. >>>> Two more errors like this, resulting in adding swift/build and >>>> util/build as well to my CLASSPATH. Fixed. >>>> >>>> 3. >>>> Exception in thread "main" java.lang.NoClassDefFoundError: >>>> com/thoughtworks/xstream/converters/Converter >>>> >>>> Now, I could probably download xstream and get this to work, but >>>> talking to Mike Wilde, he seemed puzzled that I should have to go >>>> through this process, so is there something I'm missing? >>>> >>>> Thanks! >>>> >>>> Regards, >>>> Gustav >>>> _______________________________________________ >>>> Swift-user mailing list >>>> Swift-user at ci.uchicago.edu >>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user >> From hategan at mcs.anl.gov Sun Apr 15 16:05:56 2012 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sun, 15 Apr 2012 14:05:56 -0700 Subject: [Swift-user] OS X dependencies In-Reply-To: References: Message-ID: <1334523956.28828.6.camel@blabla> On Sat, 2012-04-14 at 19:02 -0500, Gustav Larsson wrote: > Hi, > > I am trying to run Swift from a source build on OS X (10.8, dev > preview 2 / Java 1.6.0_31). The ANT build was successful and I added > the swift/bin to my PATH, so that I can run it. My problems however > are runtime errors: Add swift/dist//bin to your path instead. That's where the actual build is. Mihael > > 1. > Exception in thread "main" java.lang.NoClassDefFoundError: > org/griphyn/vdl/karajan/Loader > > So, I went ahead and added karajan/build to my CLASSPATH. Fixed. > > 2. > Two more errors like this, resulting in adding swift/build and > util/build as well to my CLASSPATH. Fixed. > > 3. > Exception in thread "main" java.lang.NoClassDefFoundError: > com/thoughtworks/xstream/converters/Converter > > Now, I could probably download xstream and get this to work, but > talking to Mike Wilde, he seemed puzzled that I should have to go > through this process, so is there something I'm missing? > > Thanks! > > Regards, > Gustav > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user From hategan at mcs.anl.gov Sun Apr 15 16:12:49 2012 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sun, 15 Apr 2012 14:12:49 -0700 Subject: [Swift-user] OS X dependencies In-Reply-To: References: <7B98B317-7D65-44F2-B319-688BEAF29C29@mcs.anl.gov> <9823D658-76F3-4CB7-924E-51DA9DFA4E15@mcs.anl.gov> Message-ID: <1334524369.28828.8.camel@blabla> On Sun, 2012-04-15 at 13:13 -0500, Gustav Larsson wrote: > Thank you for your help Jonathan. After some attempts of upgrading > Java and installing xstream manually (every version I test gets a > different runtime error), I think I will have to abandon this problem > for now and accept that I can't use this computer for now. The xstream jar is already in the swift/cog sources. You should not need to download is separately. Can you run the following: bash -x swift inputFile 2>&1 >swiftout.txt and post swiftout.txt? Mihael From larsson at uchicago.edu Sun Apr 15 18:55:10 2012 From: larsson at uchicago.edu (Gustav Larsson) Date: Sun, 15 Apr 2012 18:55:10 -0500 Subject: [Swift-user] OS X dependencies [Resolved] Message-ID: Thank you Mihael! When arranging the bash -x output file, I happened to spot the error and fixed it. So in the output, I noticed SWIFT_HOME, which I had completely forgotten that I had ever set. I tried removing it from my .bashrc and swift ran without a hitch. On closer inspection, I had set my SWIFT_HOME to cog/modules/swift instead of cog/modules/swift/dist/swift-svn. The bin/swift bash file checks for the required jar files when it tries to automatically set SWIFT_HOME. It might be worth doing this when the user has specified the SWIFT_HOME manually as well. This way this problem can be avoided. I would be happy to implement this change by the way. I'm sorry for all the confusion and thank you for your help! Gustav On Sun, Apr 15, 2012 at 4:12 PM, Mihael Hategan wrote: > On Sun, 2012-04-15 at 13:13 -0500, Gustav Larsson wrote: >> Thank you for your help Jonathan. After some attempts of upgrading >> Java and installing xstream manually (every version I test gets a >> different runtime error), I think I will have to abandon this problem >> for now and accept that I can't use this computer for now. > > The xstream jar is already in the swift/cog sources. You should not need > to download is separately. > > Can you run the following: > > bash -x swift inputFile 2>&1 >swiftout.txt > > and post swiftout.txt? > > Mihael > > From hategan at mcs.anl.gov Sun Apr 15 18:59:06 2012 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sun, 15 Apr 2012 16:59:06 -0700 Subject: [Swift-user] OS X dependencies [Resolved] In-Reply-To: References: Message-ID: <1334534346.31053.1.camel@blabla> On Sun, 2012-04-15 at 18:55 -0500, Gustav Larsson wrote: > Thank you Mihael! When arranging the bash -x output file, I happened > to spot the error and fixed it. > > So in the output, I noticed SWIFT_HOME, which I had completely > forgotten that I had ever set. I tried removing it from my .bashrc and > swift ran without a hitch. On closer inspection, I had set my > SWIFT_HOME to cog/modules/swift instead of > cog/modules/swift/dist/swift-svn. > > The bin/swift bash file checks for the required jar files when it > tries to automatically set SWIFT_HOME. It might be worth doing this > when the user has specified the SWIFT_HOME manually as well. This way > this problem can be avoided. I would be happy to implement this change > by the way. SWIFT_HOME was added in order to allow a mechanism to override the automatic detection. Perhaps the solution would be to print a warning when SWIFT_HOME is already set. From lpesce at uchicago.edu Tue Apr 17 10:31:06 2012 From: lpesce at uchicago.edu (Lorenzo Pesce) Date: Tue, 17 Apr 2012 10:31:06 -0500 Subject: [Swift-user] Error message on Cray XE6 In-Reply-To: <1873029745.141382.1334418667851.JavaMail.root@zimbra.anl.gov> References: <1873029745.141382.1334418667851.JavaMail.root@zimbra.anl.gov> Message-ID: This is what I did and it seems to work (well... it doesn't explode right away, but the rest are hopefully my own bugs ;-)). all worked fine, but I needed to specify SWIFT_HOME too. ## Swift hack: # Load the working java, plus the path to the swift directory module load java export SWIFT_HEAP_MAX=1024M export PATH=/soft/swift/0.93/bin:$PATH #Fix swift script cp /soft/swift/0.93/bin/swift . after: updateOptions "$SWIFT_HOME" "swift.home" Add: updateOptions "$USER_HOME" "user.home" # Important to put this here, so that the swift hack will preempt the # script in soft export PATH=`pwd`:$PATH Run as: USER_HOME=/lustre/beagle/`whoami` SWIFT_HOME=/soft/swift/0.93/ swift -config cf -tc.file tc -sites.file pbs.xml catsn.swift -n=1 On Apr 14, 2012, at 10:51 AM, Michael Wilde wrote: > OK, here's a workaround for this problem: > > You need to add this line to the swift command bin/swift in your Swift release. > > After: > > updateOptions "$SWIFT_HOME" "swift.home" > > Add: > > updateOptions "$USER_HOME" "user.home" > > This is near line 92 in the version I tested, Swift trunk swift-r5739 cog-r3368. > > Then you can do: > > USER_HOME=/lustre/beagle/wilde swift -config cf -tc.file tc -sites.file pbs.xml catsn.swift -n=1 > > Lorenzo, if you are using "module load swift" we'll need to update that, or you can copy the swift release directory structure that module load points you to, then modify the swift command there, and put that modified release first in your PATH. > > We'll work out a way to get something like this into the production module and trunk. I dont know of other systems that are currently affected by this, but Im sure they will come up. > > - Mike > > > ----- Original Message ----- >> From: "Michael Wilde" >> To: "Jonathan Monette" >> Cc: swift-user at ci.uchicago.edu >> Sent: Saturday, April 14, 2012 10:13:40 AM >> Subject: Re: [Swift-user] Error message on Cray XE6 >> stackoverflow says this should work: >> >> java -Duser.home= >> >> Need to get that in via the swift command. >> >> - Mike >> >> >> ----- Original Message ----- >>> From: "Michael Wilde" >>> To: "Jonathan Monette" >>> Cc: "Lorenzo Pesce" , >>> swift-user at ci.uchicago.edu >>> Sent: Saturday, April 14, 2012 10:10:00 AM >>> Subject: Re: [Swift-user] Error message on Cray XE6 >>> I just tried both setting HOME=/lustre/beagle/wilde and setting >>> user.home to the same thing. Neither works. I think user.home is >>> coming from the Java property, and that doesnt seem to be influenced >>> by the HOME env var. I was about to look if Java can be asked to >>> change home. Maybe by setting a command line arg to Java. >>> >>> - Mike >>> >>> ----- Original Message ----- >>>> From: "Jonathan Monette" >>>> To: "Michael Wilde" >>>> Cc: "Lorenzo Pesce" , >>>> swift-user at ci.uchicago.edu >>>> Sent: Saturday, April 14, 2012 10:02:14 AM >>>> Subject: Re: [Swift-user] Error message on Cray XE6 >>>> That is an easy fix I believe. I know where the code is so I will >>>> change and test. >>>> >>>> In the mean time could you try something? Try setting >>>> user.home= >>>> in your config file and try again. >>>> >>>> On Apr 14, 2012, at 9:58, Michael Wilde wrote: >>>> >>>>> /home is no longer mounted by the compute nodes, per the >>>>> post-maitenance summary: >>>>> >>>>> "External filesystem dependencies minimized: Compute nodes and >>>>> the >>>>> scheduler should now continue to process and complete jobs >>>>> without >>>>> the threat of interference of external filesystem outages. >>>>> /gpfs/pads is only available on login1 through login5; /home is >>>>> on >>>>> login and mom nodes only." >>>>> >>>>> So we need to (finally) remove Swift's dependence on >>>>> $HOME/.globus >>>>> and $HOME/.globus/scripts in particular. >>>>> >>>>> I suggest - since the swift command already needs to write to >>>>> "." >>>>> - >>>>> that we create a scripts/ directory in "." instead of >>>>> $HOME/.globus. >>>>> And this should be used by any provider that would have >>>>> previously >>>>> created files below .globus. >>>>> >>>>> I'll echo this to swift-devel and start a thread there to >>>>> discuss. >>>>> Its possible there's already a property to cause scripts/ to be >>>>> created elsewhere. If not, I think we should make one. I think >>>>> grouping the scripts created by a run into the current dir, >>>>> along >>>>> with the swift log, _concurrent, and (in the conventions I use >>>>> in >>>>> my >>>>> run scripts) swiftwork/. >>>>> >>>>> Lorenzo, hopefully we can at least get you a workaround for this >>>>> soon. >>>>> >>>>> You *might* be able to trick swift into doing this by setting >>>>> HOME=/lustre/beagle/$USER. I already tried a symlink under >>>>> .globus >>>>> and that didnt work, as /home is not even readable by the >>>>> compute >>>>> nodes, which in this case need to run the coaster worker (.pl) >>>>> script. >>>>> >>>>> - Mike >>>>> >>>>> >>>>> ----- Original Message ----- >>>>>> From: "Lorenzo Pesce" >>>>>> To: "Jonathan Monette" >>>>>> Cc: swift-user at ci.uchicago.edu >>>>>> Sent: Saturday, April 14, 2012 8:15:39 AM >>>>>> Subject: Re: [Swift-user] Error message on Cray XE6 >>>>>> In principle the access to the /home filesystem should still be >>>>>> there. >>>>>> >>>>>> The only thing I did was to chance the cf file to remove some >>>>>> errors I >>>>>> had into it, so that might also be the source of the problem. >>>>>> This >>>>>> is >>>>>> what it looks like now: >>>>>> (BTW, the comments are not mine, I run swift only from lustre) >>>>>> >>>>>> >>>>>> # Whether to transfer the wrappers from the compute nodes >>>>>> # I like to launch from my home dir, but keep everything on >>>>>> # lustre >>>>>> wrapperlog.always.transfer=false >>>>>> >>>>>> #Indicates whether the working directory on the remote site >>>>>> # should be left intact even when a run completes successfully >>>>>> sitedir.keep=true >>>>>> >>>>>> #try only once >>>>>> execution.retries=1 >>>>>> >>>>>> # Attempt to run as much as possible, i.g., ignore non-fatal >>>>>> errors >>>>>> lazy.errors=true >>>>>> >>>>>> # to reduce filesystem access >>>>>> status.mode=provider >>>>>> >>>>>> use.provider.staging=false >>>>>> >>>>>> provider.staging.pin.swiftfiles=false >>>>>> >>>>>> foreach.max.threads=100 >>>>>> >>>>>> provenance.log=false >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On Apr 14, 2012, at 12:10 AM, Jonathan Monette wrote: >>>>>> >>>>>>> The perl script is the worker script that is submitted with >>>>>>> PBS. >>>>>>> I >>>>>>> have not tried to run on Beagle since the maintenance period >>>>>>> has >>>>>>> ended so I am not exactly sure why the error popped up. One >>>>>>> reason >>>>>>> could be that the home file system is no longer mounted on the >>>>>>> compute nodes. I know they spoke about that being a >>>>>>> possibility >>>>>>> but >>>>>>> not sure they implemented that during the maintenance period. >>>>>>> Do >>>>>>> you >>>>>>> know if the home file system is still mounted on the compute >>>>>>> nodes? >>>>>>> >>>>>>> On Apr 13, 2012, at 17:18, Lorenzo Pesce >>>>>>> wrote: >>>>>>> >>>>>>>> Hi -- >>>>>>>> I haven't seen this one before: >>>>>>>> >>>>>>>> Can't open perl script >>>>>>>> "/home/lpesce/.globus/coasters/cscript7176272791806289394.pl": >>>>>>>> No >>>>>>>> such file or directory >>>>>>>> >>>>>>>> The config of the cray has changed, might this have anything >>>>>>>> to >>>>>>>> do >>>>>>>> with it? >>>>>>>> I have no idea what perl script is it talking about and why >>>>>>>> it >>>>>>>> is >>>>>>>> looking to home. >>>>>>>> >>>>>>>> Thanks a lot, >>>>>>>> >>>>>>>> Lorenzo >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> Swift-user mailing list >>>>>>>> Swift-user at ci.uchicago.edu >>>>>>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user >>>>>> >>>>>> _______________________________________________ >>>>>> Swift-user mailing list >>>>>> Swift-user at ci.uchicago.edu >>>>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user >>>>> >>>>> -- >>>>> Michael Wilde >>>>> Computation Institute, University of Chicago >>>>> Mathematics and Computer Science Division >>>>> Argonne National Laboratory >>>>> >>> >>> -- >>> Michael Wilde >>> Computation Institute, University of Chicago >>> Mathematics and Computer Science Division >>> Argonne National Laboratory >> >> -- >> Michael Wilde >> Computation Institute, University of Chicago >> Mathematics and Computer Science Division >> Argonne National Laboratory >> >> _______________________________________________ >> Swift-user mailing list >> Swift-user at ci.uchicago.edu >> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > > -- > Michael Wilde > Computation Institute, University of Chicago > Mathematics and Computer Science Division > Argonne National Laboratory > From lpesce at uchicago.edu Tue Apr 17 11:58:52 2012 From: lpesce at uchicago.edu (Lorenzo Pesce) Date: Tue, 17 Apr 2012 11:58:52 -0500 Subject: [Swift-user] Error message on Cray XE6 In-Reply-To: <1873029745.141382.1334418667851.JavaMail.root@zimbra.anl.gov> References: <1873029745.141382.1334418667851.JavaMail.root@zimbra.anl.gov> Message-ID: <30C617E5-E7EA-456F-8F53-F5F32A50FBEC@uchicago.edu> Works great! Is there a way I can ask swift to put me in a specific queue, such as scalability of some reservation? On Apr 14, 2012, at 10:51 AM, Michael Wilde wrote: > OK, here's a workaround for this problem: > > You need to add this line to the swift command bin/swift in your Swift release. > > After: > > updateOptions "$SWIFT_HOME" "swift.home" > > Add: > > updateOptions "$USER_HOME" "user.home" > > This is near line 92 in the version I tested, Swift trunk swift-r5739 cog-r3368. > > Then you can do: > > USER_HOME=/lustre/beagle/wilde swift -config cf -tc.file tc -sites.file pbs.xml catsn.swift -n=1 > > Lorenzo, if you are using "module load swift" we'll need to update that, or you can copy the swift release directory structure that module load points you to, then modify the swift command there, and put that modified release first in your PATH. > > We'll work out a way to get something like this into the production module and trunk. I dont know of other systems that are currently affected by this, but Im sure they will come up. > > - Mike > > > ----- Original Message ----- >> From: "Michael Wilde" >> To: "Jonathan Monette" >> Cc: swift-user at ci.uchicago.edu >> Sent: Saturday, April 14, 2012 10:13:40 AM >> Subject: Re: [Swift-user] Error message on Cray XE6 >> stackoverflow says this should work: >> >> java -Duser.home= >> >> Need to get that in via the swift command. >> >> - Mike >> >> >> ----- Original Message ----- >>> From: "Michael Wilde" >>> To: "Jonathan Monette" >>> Cc: "Lorenzo Pesce" , >>> swift-user at ci.uchicago.edu >>> Sent: Saturday, April 14, 2012 10:10:00 AM >>> Subject: Re: [Swift-user] Error message on Cray XE6 >>> I just tried both setting HOME=/lustre/beagle/wilde and setting >>> user.home to the same thing. Neither works. I think user.home is >>> coming from the Java property, and that doesnt seem to be influenced >>> by the HOME env var. I was about to look if Java can be asked to >>> change home. Maybe by setting a command line arg to Java. >>> >>> - Mike >>> >>> ----- Original Message ----- >>>> From: "Jonathan Monette" >>>> To: "Michael Wilde" >>>> Cc: "Lorenzo Pesce" , >>>> swift-user at ci.uchicago.edu >>>> Sent: Saturday, April 14, 2012 10:02:14 AM >>>> Subject: Re: [Swift-user] Error message on Cray XE6 >>>> That is an easy fix I believe. I know where the code is so I will >>>> change and test. >>>> >>>> In the mean time could you try something? Try setting >>>> user.home= >>>> in your config file and try again. >>>> >>>> On Apr 14, 2012, at 9:58, Michael Wilde wrote: >>>> >>>>> /home is no longer mounted by the compute nodes, per the >>>>> post-maitenance summary: >>>>> >>>>> "External filesystem dependencies minimized: Compute nodes and >>>>> the >>>>> scheduler should now continue to process and complete jobs >>>>> without >>>>> the threat of interference of external filesystem outages. >>>>> /gpfs/pads is only available on login1 through login5; /home is >>>>> on >>>>> login and mom nodes only." >>>>> >>>>> So we need to (finally) remove Swift's dependence on >>>>> $HOME/.globus >>>>> and $HOME/.globus/scripts in particular. >>>>> >>>>> I suggest - since the swift command already needs to write to >>>>> "." >>>>> - >>>>> that we create a scripts/ directory in "." instead of >>>>> $HOME/.globus. >>>>> And this should be used by any provider that would have >>>>> previously >>>>> created files below .globus. >>>>> >>>>> I'll echo this to swift-devel and start a thread there to >>>>> discuss. >>>>> Its possible there's already a property to cause scripts/ to be >>>>> created elsewhere. If not, I think we should make one. I think >>>>> grouping the scripts created by a run into the current dir, >>>>> along >>>>> with the swift log, _concurrent, and (in the conventions I use >>>>> in >>>>> my >>>>> run scripts) swiftwork/. >>>>> >>>>> Lorenzo, hopefully we can at least get you a workaround for this >>>>> soon. >>>>> >>>>> You *might* be able to trick swift into doing this by setting >>>>> HOME=/lustre/beagle/$USER. I already tried a symlink under >>>>> .globus >>>>> and that didnt work, as /home is not even readable by the >>>>> compute >>>>> nodes, which in this case need to run the coaster worker (.pl) >>>>> script. >>>>> >>>>> - Mike >>>>> >>>>> >>>>> ----- Original Message ----- >>>>>> From: "Lorenzo Pesce" >>>>>> To: "Jonathan Monette" >>>>>> Cc: swift-user at ci.uchicago.edu >>>>>> Sent: Saturday, April 14, 2012 8:15:39 AM >>>>>> Subject: Re: [Swift-user] Error message on Cray XE6 >>>>>> In principle the access to the /home filesystem should still be >>>>>> there. >>>>>> >>>>>> The only thing I did was to chance the cf file to remove some >>>>>> errors I >>>>>> had into it, so that might also be the source of the problem. >>>>>> This >>>>>> is >>>>>> what it looks like now: >>>>>> (BTW, the comments are not mine, I run swift only from lustre) >>>>>> >>>>>> >>>>>> # Whether to transfer the wrappers from the compute nodes >>>>>> # I like to launch from my home dir, but keep everything on >>>>>> # lustre >>>>>> wrapperlog.always.transfer=false >>>>>> >>>>>> #Indicates whether the working directory on the remote site >>>>>> # should be left intact even when a run completes successfully >>>>>> sitedir.keep=true >>>>>> >>>>>> #try only once >>>>>> execution.retries=1 >>>>>> >>>>>> # Attempt to run as much as possible, i.g., ignore non-fatal >>>>>> errors >>>>>> lazy.errors=true >>>>>> >>>>>> # to reduce filesystem access >>>>>> status.mode=provider >>>>>> >>>>>> use.provider.staging=false >>>>>> >>>>>> provider.staging.pin.swiftfiles=false >>>>>> >>>>>> foreach.max.threads=100 >>>>>> >>>>>> provenance.log=false >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On Apr 14, 2012, at 12:10 AM, Jonathan Monette wrote: >>>>>> >>>>>>> The perl script is the worker script that is submitted with >>>>>>> PBS. >>>>>>> I >>>>>>> have not tried to run on Beagle since the maintenance period >>>>>>> has >>>>>>> ended so I am not exactly sure why the error popped up. One >>>>>>> reason >>>>>>> could be that the home file system is no longer mounted on the >>>>>>> compute nodes. I know they spoke about that being a >>>>>>> possibility >>>>>>> but >>>>>>> not sure they implemented that during the maintenance period. >>>>>>> Do >>>>>>> you >>>>>>> know if the home file system is still mounted on the compute >>>>>>> nodes? >>>>>>> >>>>>>> On Apr 13, 2012, at 17:18, Lorenzo Pesce >>>>>>> wrote: >>>>>>> >>>>>>>> Hi -- >>>>>>>> I haven't seen this one before: >>>>>>>> >>>>>>>> Can't open perl script >>>>>>>> "/home/lpesce/.globus/coasters/cscript7176272791806289394.pl": >>>>>>>> No >>>>>>>> such file or directory >>>>>>>> >>>>>>>> The config of the cray has changed, might this have anything >>>>>>>> to >>>>>>>> do >>>>>>>> with it? >>>>>>>> I have no idea what perl script is it talking about and why >>>>>>>> it >>>>>>>> is >>>>>>>> looking to home. >>>>>>>> >>>>>>>> Thanks a lot, >>>>>>>> >>>>>>>> Lorenzo >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> Swift-user mailing list >>>>>>>> Swift-user at ci.uchicago.edu >>>>>>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user >>>>>> >>>>>> _______________________________________________ >>>>>> Swift-user mailing list >>>>>> Swift-user at ci.uchicago.edu >>>>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user >>>>> >>>>> -- >>>>> Michael Wilde >>>>> Computation Institute, University of Chicago >>>>> Mathematics and Computer Science Division >>>>> Argonne National Laboratory >>>>> >>> >>> -- >>> Michael Wilde >>> Computation Institute, University of Chicago >>> Mathematics and Computer Science Division >>> Argonne National Laboratory >> >> -- >> Michael Wilde >> Computation Institute, University of Chicago >> Mathematics and Computer Science Division >> Argonne National Laboratory >> >> _______________________________________________ >> Swift-user mailing list >> Swift-user at ci.uchicago.edu >> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > > -- > Michael Wilde > Computation Institute, University of Chicago > Mathematics and Computer Science Division > Argonne National Laboratory > From jonmon at mcs.anl.gov Tue Apr 17 12:08:49 2012 From: jonmon at mcs.anl.gov (Jonathan Monette) Date: Tue, 17 Apr 2012 12:08:49 -0500 Subject: [Swift-user] Error message on Cray XE6 In-Reply-To: <30C617E5-E7EA-456F-8F53-F5F32A50FBEC@uchicago.edu> References: <1873029745.141382.1334418667851.JavaMail.root@zimbra.anl.gov> <30C617E5-E7EA-456F-8F53-F5F32A50FBEC@uchicago.edu> Message-ID: <66F7EB49-829F-498B-B896-966DB44478FC@mcs.anl.gov> There is a site file entry for that. scalability You must make certain that the shape of your job fits in the queue you requested. If it does not fit, there is a silent failure and Swift hangs. On Apr 17, 2012, at 11:58, Lorenzo Pesce wrote: > Works great! > > Is there a way I can ask swift to put me in a specific queue, such as scalability of some reservation? > > > > On Apr 14, 2012, at 10:51 AM, Michael Wilde wrote: > >> OK, here's a workaround for this problem: >> >> You need to add this line to the swift command bin/swift in your Swift release. >> >> After: >> >> updateOptions "$SWIFT_HOME" "swift.home" >> >> Add: >> >> updateOptions "$USER_HOME" "user.home" >> >> This is near line 92 in the version I tested, Swift trunk swift-r5739 cog-r3368. >> >> Then you can do: >> >> USER_HOME=/lustre/beagle/wilde swift -config cf -tc.file tc -sites.file pbs.xml catsn.swift -n=1 >> >> Lorenzo, if you are using "module load swift" we'll need to update that, or you can copy the swift release directory structure that module load points you to, then modify the swift command there, and put that modified release first in your PATH. >> >> We'll work out a way to get something like this into the production module and trunk. I dont know of other systems that are currently affected by this, but Im sure they will come up. >> >> - Mike >> >> >> ----- Original Message ----- >>> From: "Michael Wilde" >>> To: "Jonathan Monette" >>> Cc: swift-user at ci.uchicago.edu >>> Sent: Saturday, April 14, 2012 10:13:40 AM >>> Subject: Re: [Swift-user] Error message on Cray XE6 >>> stackoverflow says this should work: >>> >>> java -Duser.home= >>> >>> Need to get that in via the swift command. >>> >>> - Mike >>> >>> >>> ----- Original Message ----- >>>> From: "Michael Wilde" >>>> To: "Jonathan Monette" >>>> Cc: "Lorenzo Pesce" , >>>> swift-user at ci.uchicago.edu >>>> Sent: Saturday, April 14, 2012 10:10:00 AM >>>> Subject: Re: [Swift-user] Error message on Cray XE6 >>>> I just tried both setting HOME=/lustre/beagle/wilde and setting >>>> user.home to the same thing. Neither works. I think user.home is >>>> coming from the Java property, and that doesnt seem to be influenced >>>> by the HOME env var. I was about to look if Java can be asked to >>>> change home. Maybe by setting a command line arg to Java. >>>> >>>> - Mike >>>> >>>> ----- Original Message ----- >>>>> From: "Jonathan Monette" >>>>> To: "Michael Wilde" >>>>> Cc: "Lorenzo Pesce" , >>>>> swift-user at ci.uchicago.edu >>>>> Sent: Saturday, April 14, 2012 10:02:14 AM >>>>> Subject: Re: [Swift-user] Error message on Cray XE6 >>>>> That is an easy fix I believe. I know where the code is so I will >>>>> change and test. >>>>> >>>>> In the mean time could you try something? Try setting >>>>> user.home= >>>>> in your config file and try again. >>>>> >>>>> On Apr 14, 2012, at 9:58, Michael Wilde wrote: >>>>> >>>>>> /home is no longer mounted by the compute nodes, per the >>>>>> post-maitenance summary: >>>>>> >>>>>> "External filesystem dependencies minimized: Compute nodes and >>>>>> the >>>>>> scheduler should now continue to process and complete jobs >>>>>> without >>>>>> the threat of interference of external filesystem outages. >>>>>> /gpfs/pads is only available on login1 through login5; /home is >>>>>> on >>>>>> login and mom nodes only." >>>>>> >>>>>> So we need to (finally) remove Swift's dependence on >>>>>> $HOME/.globus >>>>>> and $HOME/.globus/scripts in particular. >>>>>> >>>>>> I suggest - since the swift command already needs to write to >>>>>> "." >>>>>> - >>>>>> that we create a scripts/ directory in "." instead of >>>>>> $HOME/.globus. >>>>>> And this should be used by any provider that would have >>>>>> previously >>>>>> created files below .globus. >>>>>> >>>>>> I'll echo this to swift-devel and start a thread there to >>>>>> discuss. >>>>>> Its possible there's already a property to cause scripts/ to be >>>>>> created elsewhere. If not, I think we should make one. I think >>>>>> grouping the scripts created by a run into the current dir, >>>>>> along >>>>>> with the swift log, _concurrent, and (in the conventions I use >>>>>> in >>>>>> my >>>>>> run scripts) swiftwork/. >>>>>> >>>>>> Lorenzo, hopefully we can at least get you a workaround for this >>>>>> soon. >>>>>> >>>>>> You *might* be able to trick swift into doing this by setting >>>>>> HOME=/lustre/beagle/$USER. I already tried a symlink under >>>>>> .globus >>>>>> and that didnt work, as /home is not even readable by the >>>>>> compute >>>>>> nodes, which in this case need to run the coaster worker (.pl) >>>>>> script. >>>>>> >>>>>> - Mike >>>>>> >>>>>> >>>>>> ----- Original Message ----- >>>>>>> From: "Lorenzo Pesce" >>>>>>> To: "Jonathan Monette" >>>>>>> Cc: swift-user at ci.uchicago.edu >>>>>>> Sent: Saturday, April 14, 2012 8:15:39 AM >>>>>>> Subject: Re: [Swift-user] Error message on Cray XE6 >>>>>>> In principle the access to the /home filesystem should still be >>>>>>> there. >>>>>>> >>>>>>> The only thing I did was to chance the cf file to remove some >>>>>>> errors I >>>>>>> had into it, so that might also be the source of the problem. >>>>>>> This >>>>>>> is >>>>>>> what it looks like now: >>>>>>> (BTW, the comments are not mine, I run swift only from lustre) >>>>>>> >>>>>>> >>>>>>> # Whether to transfer the wrappers from the compute nodes >>>>>>> # I like to launch from my home dir, but keep everything on >>>>>>> # lustre >>>>>>> wrapperlog.always.transfer=false >>>>>>> >>>>>>> #Indicates whether the working directory on the remote site >>>>>>> # should be left intact even when a run completes successfully >>>>>>> sitedir.keep=true >>>>>>> >>>>>>> #try only once >>>>>>> execution.retries=1 >>>>>>> >>>>>>> # Attempt to run as much as possible, i.g., ignore non-fatal >>>>>>> errors >>>>>>> lazy.errors=true >>>>>>> >>>>>>> # to reduce filesystem access >>>>>>> status.mode=provider >>>>>>> >>>>>>> use.provider.staging=false >>>>>>> >>>>>>> provider.staging.pin.swiftfiles=false >>>>>>> >>>>>>> foreach.max.threads=100 >>>>>>> >>>>>>> provenance.log=false >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Apr 14, 2012, at 12:10 AM, Jonathan Monette wrote: >>>>>>> >>>>>>>> The perl script is the worker script that is submitted with >>>>>>>> PBS. >>>>>>>> I >>>>>>>> have not tried to run on Beagle since the maintenance period >>>>>>>> has >>>>>>>> ended so I am not exactly sure why the error popped up. One >>>>>>>> reason >>>>>>>> could be that the home file system is no longer mounted on the >>>>>>>> compute nodes. I know they spoke about that being a >>>>>>>> possibility >>>>>>>> but >>>>>>>> not sure they implemented that during the maintenance period. >>>>>>>> Do >>>>>>>> you >>>>>>>> know if the home file system is still mounted on the compute >>>>>>>> nodes? >>>>>>>> >>>>>>>> On Apr 13, 2012, at 17:18, Lorenzo Pesce >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Hi -- >>>>>>>>> I haven't seen this one before: >>>>>>>>> >>>>>>>>> Can't open perl script >>>>>>>>> "/home/lpesce/.globus/coasters/cscript7176272791806289394.pl": >>>>>>>>> No >>>>>>>>> such file or directory >>>>>>>>> >>>>>>>>> The config of the cray has changed, might this have anything >>>>>>>>> to >>>>>>>>> do >>>>>>>>> with it? >>>>>>>>> I have no idea what perl script is it talking about and why >>>>>>>>> it >>>>>>>>> is >>>>>>>>> looking to home. >>>>>>>>> >>>>>>>>> Thanks a lot, >>>>>>>>> >>>>>>>>> Lorenzo >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> Swift-user mailing list >>>>>>>>> Swift-user at ci.uchicago.edu >>>>>>>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user >>>>>>> >>>>>>> _______________________________________________ >>>>>>> Swift-user mailing list >>>>>>> Swift-user at ci.uchicago.edu >>>>>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user >>>>>> >>>>>> -- >>>>>> Michael Wilde >>>>>> Computation Institute, University of Chicago >>>>>> Mathematics and Computer Science Division >>>>>> Argonne National Laboratory >>>>>> >>>> >>>> -- >>>> Michael Wilde >>>> Computation Institute, University of Chicago >>>> Mathematics and Computer Science Division >>>> Argonne National Laboratory >>> >>> -- >>> Michael Wilde >>> Computation Institute, University of Chicago >>> Mathematics and Computer Science Division >>> Argonne National Laboratory >>> >>> _______________________________________________ >>> Swift-user mailing list >>> Swift-user at ci.uchicago.edu >>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user >> >> -- >> Michael Wilde >> Computation Institute, University of Chicago >> Mathematics and Computer Science Division >> Argonne National Laboratory >> > From wozniak at mcs.anl.gov Tue Apr 17 12:14:55 2012 From: wozniak at mcs.anl.gov (Justin M Wozniak) Date: Tue, 17 Apr 2012 12:14:55 -0500 Subject: [Swift-user] Error message on Cray XE6 In-Reply-To: <66F7EB49-829F-498B-B896-966DB44478FC@mcs.anl.gov> References: <1873029745.141382.1334418667851.JavaMail.root@zimbra.anl.gov> <30C617E5-E7EA-456F-8F53-F5F32A50FBEC@uchicago.edu> <66F7EB49-829F-498B-B896-966DB44478FC@mcs.anl.gov> Message-ID: <4F8DA50F.2000700@mcs.anl.gov> On 04/17/2012 12:08 PM, Jonathan Monette wrote: For reservations on Beagle, I think you can set profile globus:pbs.resource_list to advres=reservation_name.number The point is to set the pbs.resource_list to whatever needs to go after #PBS -l ... > There is a site file entry for that. > > scalability > > You must make certain that the shape of your job fits in the queue you requested. If it does not fit, there is a silent failure and Swift hangs. > > On Apr 17, 2012, at 11:58, Lorenzo Pesce wrote: > >> Works great! >> >> Is there a way I can ask swift to put me in a specific queue, such as scalability of some reservation? -- Justin M Wozniak From jonmon at mcs.anl.gov Tue Apr 17 12:28:01 2012 From: jonmon at mcs.anl.gov (Jonathan Monette) Date: Tue, 17 Apr 2012 12:28:01 -0500 Subject: [Swift-user] Error message on Cray XE6 In-Reply-To: <4F8DA50F.2000700@mcs.anl.gov> References: <1873029745.141382.1334418667851.JavaMail.root@zimbra.anl.gov> <30C617E5-E7EA-456F-8F53-F5F32A50FBEC@uchicago.edu> <66F7EB49-829F-498B-B896-966DB44478FC@mcs.anl.gov> <4F8DA50F.2000700@mcs.anl.gov> Message-ID: <84EF9C60-0BBD-4B96-9904-F7F21BED9B37@mcs.anl.gov> To use a reservation on Beagle you would need to use the following line: pbs.aprun;pbs.mpp;depth=24;pbs.resource_list=advres=res.id You have to use the pbs.resource_list in the providerAttributes key. I did not using pbs.resource_list as the key though. On Apr 17, 2012, at 12:14 PM, Justin M Wozniak wrote: > On 04/17/2012 12:08 PM, Jonathan Monette wrote: > > For reservations on Beagle, I think you can set profile > globus:pbs.resource_list to advres=reservation_name.number > > The point is to set the pbs.resource_list to whatever needs to go after > #PBS -l ... > >> There is a site file entry for that. >> >> scalability >> >> You must make certain that the shape of your job fits in the queue you requested. If it does not fit, there is a silent failure and Swift hangs. >> >> On Apr 17, 2012, at 11:58, Lorenzo Pesce wrote: >> >>> Works great! >>> >>> Is there a way I can ask swift to put me in a specific queue, such as scalability of some reservation? > > -- > Justin M Wozniak > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user -------------- next part -------------- An HTML attachment was scrubbed... URL: From wilde at mcs.anl.gov Tue Apr 17 12:38:22 2012 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 17 Apr 2012 12:38:22 -0500 (CDT) Subject: [Swift-user] Error message on Cray XE6 In-Reply-To: <66F7EB49-829F-498B-B896-966DB44478FC@mcs.anl.gov> Message-ID: <80870977.145190.1334684302429.JavaMail.root@zimbra.anl.gov> I think Swift hangs in such cases because PBS silently leaves the job queued even though no current queue can take the job. That seems to be a "feature" of the RM: being able to queue jobs that *might* be runnable in the future iff new queue settings are made. I dont know a good wat for Swift to detect this, but thats something to discuss (and create a ticket for). - Mike ----- Original Message ----- > From: "Jonathan Monette" > To: "Lorenzo Pesce" > Cc: "Michael Wilde" , swift-user at ci.uchicago.edu > Sent: Tuesday, April 17, 2012 12:08:49 PM > Subject: Re: [Swift-user] Error message on Cray XE6 > There is a site file entry for that. > > scalability > > You must make certain that the shape of your job fits in the queue you > requested. If it does not fit, there is a silent failure and Swift > hangs. > > On Apr 17, 2012, at 11:58, Lorenzo Pesce wrote: > > > Works great! > > > > Is there a way I can ask swift to put me in a specific queue, such > > as scalability of some reservation? > > > > > > > > On Apr 14, 2012, at 10:51 AM, Michael Wilde wrote: > > > >> OK, here's a workaround for this problem: > >> > >> You need to add this line to the swift command bin/swift in your > >> Swift release. > >> > >> After: > >> > >> updateOptions "$SWIFT_HOME" "swift.home" > >> > >> Add: > >> > >> updateOptions "$USER_HOME" "user.home" > >> > >> This is near line 92 in the version I tested, Swift trunk > >> swift-r5739 cog-r3368. > >> > >> Then you can do: > >> > >> USER_HOME=/lustre/beagle/wilde swift -config cf -tc.file tc > >> -sites.file pbs.xml catsn.swift -n=1 > >> > >> Lorenzo, if you are using "module load swift" we'll need to update > >> that, or you can copy the swift release directory structure that > >> module load points you to, then modify the swift command there, and > >> put that modified release first in your PATH. > >> > >> We'll work out a way to get something like this into the production > >> module and trunk. I dont know of other systems that are currently > >> affected by this, but Im sure they will come up. > >> > >> - Mike > >> > >> > >> ----- Original Message ----- > >>> From: "Michael Wilde" > >>> To: "Jonathan Monette" > >>> Cc: swift-user at ci.uchicago.edu > >>> Sent: Saturday, April 14, 2012 10:13:40 AM > >>> Subject: Re: [Swift-user] Error message on Cray XE6 > >>> stackoverflow says this should work: > >>> > >>> java -Duser.home= > >>> > >>> Need to get that in via the swift command. > >>> > >>> - Mike > >>> > >>> > >>> ----- Original Message ----- > >>>> From: "Michael Wilde" > >>>> To: "Jonathan Monette" > >>>> Cc: "Lorenzo Pesce" , > >>>> swift-user at ci.uchicago.edu > >>>> Sent: Saturday, April 14, 2012 10:10:00 AM > >>>> Subject: Re: [Swift-user] Error message on Cray XE6 > >>>> I just tried both setting HOME=/lustre/beagle/wilde and setting > >>>> user.home to the same thing. Neither works. I think user.home is > >>>> coming from the Java property, and that doesnt seem to be > >>>> influenced > >>>> by the HOME env var. I was about to look if Java can be asked to > >>>> change home. Maybe by setting a command line arg to Java. > >>>> > >>>> - Mike > >>>> > >>>> ----- Original Message ----- > >>>>> From: "Jonathan Monette" > >>>>> To: "Michael Wilde" > >>>>> Cc: "Lorenzo Pesce" , > >>>>> swift-user at ci.uchicago.edu > >>>>> Sent: Saturday, April 14, 2012 10:02:14 AM > >>>>> Subject: Re: [Swift-user] Error message on Cray XE6 > >>>>> That is an easy fix I believe. I know where the code is so I > >>>>> will > >>>>> change and test. > >>>>> > >>>>> In the mean time could you try something? Try setting > >>>>> user.home= > >>>>> in your config file and try again. > >>>>> > >>>>> On Apr 14, 2012, at 9:58, Michael Wilde > >>>>> wrote: > >>>>> > >>>>>> /home is no longer mounted by the compute nodes, per the > >>>>>> post-maitenance summary: > >>>>>> > >>>>>> "External filesystem dependencies minimized: Compute nodes and > >>>>>> the > >>>>>> scheduler should now continue to process and complete jobs > >>>>>> without > >>>>>> the threat of interference of external filesystem outages. > >>>>>> /gpfs/pads is only available on login1 through login5; /home is > >>>>>> on > >>>>>> login and mom nodes only." > >>>>>> > >>>>>> So we need to (finally) remove Swift's dependence on > >>>>>> $HOME/.globus > >>>>>> and $HOME/.globus/scripts in particular. > >>>>>> > >>>>>> I suggest - since the swift command already needs to write to > >>>>>> "." > >>>>>> - > >>>>>> that we create a scripts/ directory in "." instead of > >>>>>> $HOME/.globus. > >>>>>> And this should be used by any provider that would have > >>>>>> previously > >>>>>> created files below .globus. > >>>>>> > >>>>>> I'll echo this to swift-devel and start a thread there to > >>>>>> discuss. > >>>>>> Its possible there's already a property to cause scripts/ to be > >>>>>> created elsewhere. If not, I think we should make one. I think > >>>>>> grouping the scripts created by a run into the current dir, > >>>>>> along > >>>>>> with the swift log, _concurrent, and (in the conventions I use > >>>>>> in > >>>>>> my > >>>>>> run scripts) swiftwork/. > >>>>>> > >>>>>> Lorenzo, hopefully we can at least get you a workaround for > >>>>>> this > >>>>>> soon. > >>>>>> > >>>>>> You *might* be able to trick swift into doing this by setting > >>>>>> HOME=/lustre/beagle/$USER. I already tried a symlink under > >>>>>> .globus > >>>>>> and that didnt work, as /home is not even readable by the > >>>>>> compute > >>>>>> nodes, which in this case need to run the coaster worker (.pl) > >>>>>> script. > >>>>>> > >>>>>> - Mike > >>>>>> > >>>>>> > >>>>>> ----- Original Message ----- > >>>>>>> From: "Lorenzo Pesce" > >>>>>>> To: "Jonathan Monette" > >>>>>>> Cc: swift-user at ci.uchicago.edu > >>>>>>> Sent: Saturday, April 14, 2012 8:15:39 AM > >>>>>>> Subject: Re: [Swift-user] Error message on Cray XE6 > >>>>>>> In principle the access to the /home filesystem should still > >>>>>>> be > >>>>>>> there. > >>>>>>> > >>>>>>> The only thing I did was to chance the cf file to remove some > >>>>>>> errors I > >>>>>>> had into it, so that might also be the source of the problem. > >>>>>>> This > >>>>>>> is > >>>>>>> what it looks like now: > >>>>>>> (BTW, the comments are not mine, I run swift only from lustre) > >>>>>>> > >>>>>>> > >>>>>>> # Whether to transfer the wrappers from the compute nodes > >>>>>>> # I like to launch from my home dir, but keep everything on > >>>>>>> # lustre > >>>>>>> wrapperlog.always.transfer=false > >>>>>>> > >>>>>>> #Indicates whether the working directory on the remote site > >>>>>>> # should be left intact even when a run completes successfully > >>>>>>> sitedir.keep=true > >>>>>>> > >>>>>>> #try only once > >>>>>>> execution.retries=1 > >>>>>>> > >>>>>>> # Attempt to run as much as possible, i.g., ignore non-fatal > >>>>>>> errors > >>>>>>> lazy.errors=true > >>>>>>> > >>>>>>> # to reduce filesystem access > >>>>>>> status.mode=provider > >>>>>>> > >>>>>>> use.provider.staging=false > >>>>>>> > >>>>>>> provider.staging.pin.swiftfiles=false > >>>>>>> > >>>>>>> foreach.max.threads=100 > >>>>>>> > >>>>>>> provenance.log=false > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> On Apr 14, 2012, at 12:10 AM, Jonathan Monette wrote: > >>>>>>> > >>>>>>>> The perl script is the worker script that is submitted with > >>>>>>>> PBS. > >>>>>>>> I > >>>>>>>> have not tried to run on Beagle since the maintenance period > >>>>>>>> has > >>>>>>>> ended so I am not exactly sure why the error popped up. One > >>>>>>>> reason > >>>>>>>> could be that the home file system is no longer mounted on > >>>>>>>> the > >>>>>>>> compute nodes. I know they spoke about that being a > >>>>>>>> possibility > >>>>>>>> but > >>>>>>>> not sure they implemented that during the maintenance period. > >>>>>>>> Do > >>>>>>>> you > >>>>>>>> know if the home file system is still mounted on the compute > >>>>>>>> nodes? > >>>>>>>> > >>>>>>>> On Apr 13, 2012, at 17:18, Lorenzo Pesce > >>>>>>>> > >>>>>>>> wrote: > >>>>>>>> > >>>>>>>>> Hi -- > >>>>>>>>> I haven't seen this one before: > >>>>>>>>> > >>>>>>>>> Can't open perl script > >>>>>>>>> "/home/lpesce/.globus/coasters/cscript7176272791806289394.pl": > >>>>>>>>> No > >>>>>>>>> such file or directory > >>>>>>>>> > >>>>>>>>> The config of the cray has changed, might this have anything > >>>>>>>>> to > >>>>>>>>> do > >>>>>>>>> with it? > >>>>>>>>> I have no idea what perl script is it talking about and why > >>>>>>>>> it > >>>>>>>>> is > >>>>>>>>> looking to home. > >>>>>>>>> > >>>>>>>>> Thanks a lot, > >>>>>>>>> > >>>>>>>>> Lorenzo > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> _______________________________________________ > >>>>>>>>> Swift-user mailing list > >>>>>>>>> Swift-user at ci.uchicago.edu > >>>>>>>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > >>>>>>> > >>>>>>> _______________________________________________ > >>>>>>> Swift-user mailing list > >>>>>>> Swift-user at ci.uchicago.edu > >>>>>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > >>>>>> > >>>>>> -- > >>>>>> Michael Wilde > >>>>>> Computation Institute, University of Chicago > >>>>>> Mathematics and Computer Science Division > >>>>>> Argonne National Laboratory > >>>>>> > >>>> > >>>> -- > >>>> Michael Wilde > >>>> Computation Institute, University of Chicago > >>>> Mathematics and Computer Science Division > >>>> Argonne National Laboratory > >>> > >>> -- > >>> Michael Wilde > >>> Computation Institute, University of Chicago > >>> Mathematics and Computer Science Division > >>> Argonne National Laboratory > >>> > >>> _______________________________________________ > >>> Swift-user mailing list > >>> Swift-user at ci.uchicago.edu > >>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > >> > >> -- > >> Michael Wilde > >> Computation Institute, University of Chicago > >> Mathematics and Computer Science Division > >> Argonne National Laboratory > >> > > -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From jonmon at mcs.anl.gov Tue Apr 17 12:49:12 2012 From: jonmon at mcs.anl.gov (Jonathan Monette) Date: Tue, 17 Apr 2012 12:49:12 -0500 Subject: [Swift-user] Error message on Cray XE6 In-Reply-To: <80870977.145190.1334684302429.JavaMail.root@zimbra.anl.gov> References: <80870977.145190.1334684302429.JavaMail.root@zimbra.anl.gov> Message-ID: <4D3C82A5-5FDC-4964-A124-F710D7F04644@mcs.anl.gov> I do not think that is the case where PBS leaves the job queued, maybe on some machines but no on Beagle. When I had a job that did not fit in the scalability queue Swift hung but when checking the log I found a message from qsub saying the job was rejected. There is a bug ticket open for this issue. I will find the log that has the message(or just recreate it) and post the message to the ticket. Swift also hangs(with a qsub message in the log) if you try to submit a PBS job to machine where you no longer have an allocation. I received this message when trying to use Fusion after a long time. On Apr 17, 2012, at 12:38 PM, Michael Wilde wrote: > I think Swift hangs in such cases because PBS silently leaves the job queued even though no current queue can take the job. That seems to be a "feature" of the RM: being able to queue jobs that *might* be runnable in the future iff new queue settings are made. I dont know a good wat for Swift to detect this, but thats something to discuss (and create a ticket for). > > - Mike > > ----- Original Message ----- >> From: "Jonathan Monette" >> To: "Lorenzo Pesce" >> Cc: "Michael Wilde" , swift-user at ci.uchicago.edu >> Sent: Tuesday, April 17, 2012 12:08:49 PM >> Subject: Re: [Swift-user] Error message on Cray XE6 >> There is a site file entry for that. >> >> scalability >> >> You must make certain that the shape of your job fits in the queue you >> requested. If it does not fit, there is a silent failure and Swift >> hangs. >> >> On Apr 17, 2012, at 11:58, Lorenzo Pesce wrote: >> >>> Works great! >>> >>> Is there a way I can ask swift to put me in a specific queue, such >>> as scalability of some reservation? >>> >>> >>> >>> On Apr 14, 2012, at 10:51 AM, Michael Wilde wrote: >>> >>>> OK, here's a workaround for this problem: >>>> >>>> You need to add this line to the swift command bin/swift in your >>>> Swift release. >>>> >>>> After: >>>> >>>> updateOptions "$SWIFT_HOME" "swift.home" >>>> >>>> Add: >>>> >>>> updateOptions "$USER_HOME" "user.home" >>>> >>>> This is near line 92 in the version I tested, Swift trunk >>>> swift-r5739 cog-r3368. >>>> >>>> Then you can do: >>>> >>>> USER_HOME=/lustre/beagle/wilde swift -config cf -tc.file tc >>>> -sites.file pbs.xml catsn.swift -n=1 >>>> >>>> Lorenzo, if you are using "module load swift" we'll need to update >>>> that, or you can copy the swift release directory structure that >>>> module load points you to, then modify the swift command there, and >>>> put that modified release first in your PATH. >>>> >>>> We'll work out a way to get something like this into the production >>>> module and trunk. I dont know of other systems that are currently >>>> affected by this, but Im sure they will come up. >>>> >>>> - Mike >>>> >>>> >>>> ----- Original Message ----- >>>>> From: "Michael Wilde" >>>>> To: "Jonathan Monette" >>>>> Cc: swift-user at ci.uchicago.edu >>>>> Sent: Saturday, April 14, 2012 10:13:40 AM >>>>> Subject: Re: [Swift-user] Error message on Cray XE6 >>>>> stackoverflow says this should work: >>>>> >>>>> java -Duser.home= >>>>> >>>>> Need to get that in via the swift command. >>>>> >>>>> - Mike >>>>> >>>>> >>>>> ----- Original Message ----- >>>>>> From: "Michael Wilde" >>>>>> To: "Jonathan Monette" >>>>>> Cc: "Lorenzo Pesce" , >>>>>> swift-user at ci.uchicago.edu >>>>>> Sent: Saturday, April 14, 2012 10:10:00 AM >>>>>> Subject: Re: [Swift-user] Error message on Cray XE6 >>>>>> I just tried both setting HOME=/lustre/beagle/wilde and setting >>>>>> user.home to the same thing. Neither works. I think user.home is >>>>>> coming from the Java property, and that doesnt seem to be >>>>>> influenced >>>>>> by the HOME env var. I was about to look if Java can be asked to >>>>>> change home. Maybe by setting a command line arg to Java. >>>>>> >>>>>> - Mike >>>>>> >>>>>> ----- Original Message ----- >>>>>>> From: "Jonathan Monette" >>>>>>> To: "Michael Wilde" >>>>>>> Cc: "Lorenzo Pesce" , >>>>>>> swift-user at ci.uchicago.edu >>>>>>> Sent: Saturday, April 14, 2012 10:02:14 AM >>>>>>> Subject: Re: [Swift-user] Error message on Cray XE6 >>>>>>> That is an easy fix I believe. I know where the code is so I >>>>>>> will >>>>>>> change and test. >>>>>>> >>>>>>> In the mean time could you try something? Try setting >>>>>>> user.home= >>>>>>> in your config file and try again. >>>>>>> >>>>>>> On Apr 14, 2012, at 9:58, Michael Wilde >>>>>>> wrote: >>>>>>> >>>>>>>> /home is no longer mounted by the compute nodes, per the >>>>>>>> post-maitenance summary: >>>>>>>> >>>>>>>> "External filesystem dependencies minimized: Compute nodes and >>>>>>>> the >>>>>>>> scheduler should now continue to process and complete jobs >>>>>>>> without >>>>>>>> the threat of interference of external filesystem outages. >>>>>>>> /gpfs/pads is only available on login1 through login5; /home is >>>>>>>> on >>>>>>>> login and mom nodes only." >>>>>>>> >>>>>>>> So we need to (finally) remove Swift's dependence on >>>>>>>> $HOME/.globus >>>>>>>> and $HOME/.globus/scripts in particular. >>>>>>>> >>>>>>>> I suggest - since the swift command already needs to write to >>>>>>>> "." >>>>>>>> - >>>>>>>> that we create a scripts/ directory in "." instead of >>>>>>>> $HOME/.globus. >>>>>>>> And this should be used by any provider that would have >>>>>>>> previously >>>>>>>> created files below .globus. >>>>>>>> >>>>>>>> I'll echo this to swift-devel and start a thread there to >>>>>>>> discuss. >>>>>>>> Its possible there's already a property to cause scripts/ to be >>>>>>>> created elsewhere. If not, I think we should make one. I think >>>>>>>> grouping the scripts created by a run into the current dir, >>>>>>>> along >>>>>>>> with the swift log, _concurrent, and (in the conventions I use >>>>>>>> in >>>>>>>> my >>>>>>>> run scripts) swiftwork/. >>>>>>>> >>>>>>>> Lorenzo, hopefully we can at least get you a workaround for >>>>>>>> this >>>>>>>> soon. >>>>>>>> >>>>>>>> You *might* be able to trick swift into doing this by setting >>>>>>>> HOME=/lustre/beagle/$USER. I already tried a symlink under >>>>>>>> .globus >>>>>>>> and that didnt work, as /home is not even readable by the >>>>>>>> compute >>>>>>>> nodes, which in this case need to run the coaster worker (.pl) >>>>>>>> script. >>>>>>>> >>>>>>>> - Mike >>>>>>>> >>>>>>>> >>>>>>>> ----- Original Message ----- >>>>>>>>> From: "Lorenzo Pesce" >>>>>>>>> To: "Jonathan Monette" >>>>>>>>> Cc: swift-user at ci.uchicago.edu >>>>>>>>> Sent: Saturday, April 14, 2012 8:15:39 AM >>>>>>>>> Subject: Re: [Swift-user] Error message on Cray XE6 >>>>>>>>> In principle the access to the /home filesystem should still >>>>>>>>> be >>>>>>>>> there. >>>>>>>>> >>>>>>>>> The only thing I did was to chance the cf file to remove some >>>>>>>>> errors I >>>>>>>>> had into it, so that might also be the source of the problem. >>>>>>>>> This >>>>>>>>> is >>>>>>>>> what it looks like now: >>>>>>>>> (BTW, the comments are not mine, I run swift only from lustre) >>>>>>>>> >>>>>>>>> >>>>>>>>> # Whether to transfer the wrappers from the compute nodes >>>>>>>>> # I like to launch from my home dir, but keep everything on >>>>>>>>> # lustre >>>>>>>>> wrapperlog.always.transfer=false >>>>>>>>> >>>>>>>>> #Indicates whether the working directory on the remote site >>>>>>>>> # should be left intact even when a run completes successfully >>>>>>>>> sitedir.keep=true >>>>>>>>> >>>>>>>>> #try only once >>>>>>>>> execution.retries=1 >>>>>>>>> >>>>>>>>> # Attempt to run as much as possible, i.g., ignore non-fatal >>>>>>>>> errors >>>>>>>>> lazy.errors=true >>>>>>>>> >>>>>>>>> # to reduce filesystem access >>>>>>>>> status.mode=provider >>>>>>>>> >>>>>>>>> use.provider.staging=false >>>>>>>>> >>>>>>>>> provider.staging.pin.swiftfiles=false >>>>>>>>> >>>>>>>>> foreach.max.threads=100 >>>>>>>>> >>>>>>>>> provenance.log=false >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On Apr 14, 2012, at 12:10 AM, Jonathan Monette wrote: >>>>>>>>> >>>>>>>>>> The perl script is the worker script that is submitted with >>>>>>>>>> PBS. >>>>>>>>>> I >>>>>>>>>> have not tried to run on Beagle since the maintenance period >>>>>>>>>> has >>>>>>>>>> ended so I am not exactly sure why the error popped up. One >>>>>>>>>> reason >>>>>>>>>> could be that the home file system is no longer mounted on >>>>>>>>>> the >>>>>>>>>> compute nodes. I know they spoke about that being a >>>>>>>>>> possibility >>>>>>>>>> but >>>>>>>>>> not sure they implemented that during the maintenance period. >>>>>>>>>> Do >>>>>>>>>> you >>>>>>>>>> know if the home file system is still mounted on the compute >>>>>>>>>> nodes? >>>>>>>>>> >>>>>>>>>> On Apr 13, 2012, at 17:18, Lorenzo Pesce >>>>>>>>>> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> Hi -- >>>>>>>>>>> I haven't seen this one before: >>>>>>>>>>> >>>>>>>>>>> Can't open perl script >>>>>>>>>>> "/home/lpesce/.globus/coasters/cscript7176272791806289394.pl": >>>>>>>>>>> No >>>>>>>>>>> such file or directory >>>>>>>>>>> >>>>>>>>>>> The config of the cray has changed, might this have anything >>>>>>>>>>> to >>>>>>>>>>> do >>>>>>>>>>> with it? >>>>>>>>>>> I have no idea what perl script is it talking about and why >>>>>>>>>>> it >>>>>>>>>>> is >>>>>>>>>>> looking to home. >>>>>>>>>>> >>>>>>>>>>> Thanks a lot, >>>>>>>>>>> >>>>>>>>>>> Lorenzo >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> _______________________________________________ >>>>>>>>>>> Swift-user mailing list >>>>>>>>>>> Swift-user at ci.uchicago.edu >>>>>>>>>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user >>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> Swift-user mailing list >>>>>>>>> Swift-user at ci.uchicago.edu >>>>>>>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user >>>>>>>> >>>>>>>> -- >>>>>>>> Michael Wilde >>>>>>>> Computation Institute, University of Chicago >>>>>>>> Mathematics and Computer Science Division >>>>>>>> Argonne National Laboratory >>>>>>>> >>>>>> >>>>>> -- >>>>>> Michael Wilde >>>>>> Computation Institute, University of Chicago >>>>>> Mathematics and Computer Science Division >>>>>> Argonne National Laboratory >>>>> >>>>> -- >>>>> Michael Wilde >>>>> Computation Institute, University of Chicago >>>>> Mathematics and Computer Science Division >>>>> Argonne National Laboratory >>>>> >>>>> _______________________________________________ >>>>> Swift-user mailing list >>>>> Swift-user at ci.uchicago.edu >>>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user >>>> >>>> -- >>>> Michael Wilde >>>> Computation Institute, University of Chicago >>>> Mathematics and Computer Science Division >>>> Argonne National Laboratory >>>> >>> > > -- > Michael Wilde > Computation Institute, University of Chicago > Mathematics and Computer Science Division > Argonne National Laboratory > From lpesce at uchicago.edu Tue Apr 17 14:28:19 2012 From: lpesce at uchicago.edu (Lorenzo Pesce) Date: Tue, 17 Apr 2012 14:28:19 -0500 Subject: [Swift-user] Error message on Cray XE6 In-Reply-To: <84EF9C60-0BBD-4B96-9904-F7F21BED9B37@mcs.anl.gov> References: <1873029745.141382.1334418667851.JavaMail.root@zimbra.anl.gov> <30C617E5-E7EA-456F-8F53-F5F32A50FBEC@uchicago.edu> <66F7EB49-829F-498B-B896-966DB44478FC@mcs.anl.gov> <4F8DA50F.2000700@mcs.anl.gov> <84EF9C60-0BBD-4B96-9904-F7F21BED9B37@mcs.anl.gov> Message-ID: <7938B237-9BEB-461E-9E86-028A62CA06E9@uchicago.edu> Thanks a lot. I will add the advres to the script, test it and add it to the swift on Beagle lecture. I have tested the scalability and it seems to work. I set checks for job size and length and set automatic resize to fit into the requested queue (with message to the user about it). On Apr 17, 2012, at 12:28 PM, Jonathan Monette wrote: > To use a reservation on Beagle you would need to use the following line: > > pbs.aprun;pbs.mpp;depth=24;pbs.resource_list=advres=res.id > > You have to use the pbs.resource_list in the providerAttributes key. I did not using pbs.resource_list as the key though. > > On Apr 17, 2012, at 12:14 PM, Justin M Wozniak wrote: > >> On 04/17/2012 12:08 PM, Jonathan Monette wrote: >> >> For reservations on Beagle, I think you can set profile >> globus:pbs.resource_list to advres=reservation_name.number >> >> The point is to set the pbs.resource_list to whatever needs to go after >> #PBS -l ... >> >>> There is a site file entry for that. >>> >>> scalability >>> >>> You must make certain that the shape of your job fits in the queue you requested. If it does not fit, there is a silent failure and Swift hangs. >>> >>> On Apr 17, 2012, at 11:58, Lorenzo Pesce wrote: >>> >>>> Works great! >>>> >>>> Is there a way I can ask swift to put me in a specific queue, such as scalability of some reservation? >> >> -- >> Justin M Wozniak >> _______________________________________________ >> Swift-user mailing list >> Swift-user at ci.uchicago.edu >> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user -------------- next part -------------- An HTML attachment was scrubbed... URL: From lpesce at uchicago.edu Tue Apr 17 15:27:12 2012 From: lpesce at uchicago.edu (Lorenzo Pesce) Date: Tue, 17 Apr 2012 15:27:12 -0500 Subject: [Swift-user] Question about job sizes Message-ID: <94ADC29B-6F18-41F5-A53D-68F601217014@uchicago.edu> Me again ;-) If the jobs occupy memory and CPU in a predictable way, say they have a parameter i =1:20 where 20 takes 20 time units and 1 takes on. Memory also grows, but not proportionally. Is there a way to tell coasters how to pack jobs in order to not bust the memory? Lorenzo From jonmon at mcs.anl.gov Tue Apr 17 17:59:14 2012 From: jonmon at mcs.anl.gov (Jonathan Monette) Date: Tue, 17 Apr 2012 17:59:14 -0500 Subject: [Swift-user] Question about job sizes In-Reply-To: <94ADC29B-6F18-41F5-A53D-68F601217014@uchicago.edu> References: <94ADC29B-6F18-41F5-A53D-68F601217014@uchicago.edu> Message-ID: I am not sure I understand the question. Are you saying that how to have coasters use less memory for longer running jobs? Or is the question that when there are more concurrently running apps memory is increasing in a non-proportional way? I may be wrong with both interpretations. Could you elaborate a bit more? On Apr 17, 2012, at 15:27, Lorenzo Pesce wrote: > Me again ;-) > > If the jobs occupy memory and CPU in a predictable way, say they have a parameter i =1:20 where 20 takes 20 time units and 1 takes on. Memory also grows, but not proportionally. > Is there a way to tell coasters how to pack jobs in order to not bust the memory? > > Lorenzo > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user From lpesce at uchicago.edu Tue Apr 17 19:11:24 2012 From: lpesce at uchicago.edu (Lorenzo Pesce) Date: Tue, 17 Apr 2012 19:11:24 -0500 Subject: [Swift-user] Question about job sizes In-Reply-To: References: <94ADC29B-6F18-41F5-A53D-68F601217014@uchicago.edu> Message-ID: Sorry for being unclear. > I am not sure I understand the question. Are you saying that how to have coasters use less memory for longer running jobs? No. Some of the jobs will take more memory than other and it is reasonably predictable. I was wondering if I can pass that information to swift to prevent it from sending too many "large memory" jobs to the same node and blow it up. Those jobs tend to take longer as well, but that is less important to a degree (would be good to send them in first without overdoing it). I guess that perhaps I could try to force the big ones to go in one per NUMA node right away and let the other ones more or less move in as the slots open up. > Or is the question that when there are more concurrently running apps memory is increasing in a non-proportional way? I may be wrong with both interpretations. Could you elaborate a bit more? > > On Apr 17, 2012, at 15:27, Lorenzo Pesce wrote: > >> Me again ;-) >> >> If the jobs occupy memory and CPU in a predictable way, say they have a parameter i =1:20 where 20 takes 20 time units and 1 takes on. Memory also grows, but not proportionally. >> Is there a way to tell coasters how to pack jobs in order to not bust the memory? >> >> Lorenzo >> _______________________________________________ >> Swift-user mailing list >> Swift-user at ci.uchicago.edu >> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user From hategan at mcs.anl.gov Tue Apr 17 19:25:28 2012 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 17 Apr 2012 17:25:28 -0700 Subject: [Swift-user] Error message on Cray XE6 In-Reply-To: <4D3C82A5-5FDC-4964-A124-F710D7F04644@mcs.anl.gov> References: <80870977.145190.1334684302429.JavaMail.root@zimbra.anl.gov> <4D3C82A5-5FDC-4964-A124-F710D7F04644@mcs.anl.gov> Message-ID: <1334708728.19777.2.camel@blabla> Hmm, so if a block job fails, coasters will fail at least one swift job. If this happens enough times, the failure should propagate through the retries and to the user. It might take some time though. So maybe there's a distinction between "hangs" and "takes a lot of time but eventually fails". Mihael On Tue, 2012-04-17 at 12:49 -0500, Jonathan Monette wrote: > I do not think that is the case where PBS leaves the job queued, maybe > on some machines but no on Beagle. When I had a job that did not fit > in the scalability queue Swift hung but when checking the log I found > a message from qsub saying the job was rejected. There is a bug > ticket open for this issue. I will find the log that has the > message(or just recreate it) and post the message to the ticket. > Swift also hangs(with a qsub message in the log) if you try to submit > a PBS job to machine where you no longer have an allocation. I > received this message when trying to use Fusion after a long time. > > On Apr 17, 2012, at 12:38 PM, Michael Wilde wrote: > > > I think Swift hangs in such cases because PBS silently leaves the job queued even though no current queue can take the job. That seems to be a "feature" of the RM: being able to queue jobs that *might* be runnable in the future iff new queue settings are made. I dont know a good wat for Swift to detect this, but thats something to discuss (and create a ticket for). > > > > - Mike > > > > ----- Original Message ----- > >> From: "Jonathan Monette" > >> To: "Lorenzo Pesce" > >> Cc: "Michael Wilde" , swift-user at ci.uchicago.edu > >> Sent: Tuesday, April 17, 2012 12:08:49 PM > >> Subject: Re: [Swift-user] Error message on Cray XE6 > >> There is a site file entry for that. > >> > >> scalability > >> > >> You must make certain that the shape of your job fits in the queue you > >> requested. If it does not fit, there is a silent failure and Swift > >> hangs. > >> > >> On Apr 17, 2012, at 11:58, Lorenzo Pesce wrote: > >> > >>> Works great! > >>> > >>> Is there a way I can ask swift to put me in a specific queue, such > >>> as scalability of some reservation? > >>> > >>> > >>> > >>> On Apr 14, 2012, at 10:51 AM, Michael Wilde wrote: > >>> > >>>> OK, here's a workaround for this problem: > >>>> > >>>> You need to add this line to the swift command bin/swift in your > >>>> Swift release. > >>>> > >>>> After: > >>>> > >>>> updateOptions "$SWIFT_HOME" "swift.home" > >>>> > >>>> Add: > >>>> > >>>> updateOptions "$USER_HOME" "user.home" > >>>> > >>>> This is near line 92 in the version I tested, Swift trunk > >>>> swift-r5739 cog-r3368. > >>>> > >>>> Then you can do: > >>>> > >>>> USER_HOME=/lustre/beagle/wilde swift -config cf -tc.file tc > >>>> -sites.file pbs.xml catsn.swift -n=1 > >>>> > >>>> Lorenzo, if you are using "module load swift" we'll need to update > >>>> that, or you can copy the swift release directory structure that > >>>> module load points you to, then modify the swift command there, and > >>>> put that modified release first in your PATH. > >>>> > >>>> We'll work out a way to get something like this into the production > >>>> module and trunk. I dont know of other systems that are currently > >>>> affected by this, but Im sure they will come up. > >>>> > >>>> - Mike > >>>> > >>>> > >>>> ----- Original Message ----- > >>>>> From: "Michael Wilde" > >>>>> To: "Jonathan Monette" > >>>>> Cc: swift-user at ci.uchicago.edu > >>>>> Sent: Saturday, April 14, 2012 10:13:40 AM > >>>>> Subject: Re: [Swift-user] Error message on Cray XE6 > >>>>> stackoverflow says this should work: > >>>>> > >>>>> java -Duser.home= > >>>>> > >>>>> Need to get that in via the swift command. > >>>>> > >>>>> - Mike > >>>>> > >>>>> > >>>>> ----- Original Message ----- > >>>>>> From: "Michael Wilde" > >>>>>> To: "Jonathan Monette" > >>>>>> Cc: "Lorenzo Pesce" , > >>>>>> swift-user at ci.uchicago.edu > >>>>>> Sent: Saturday, April 14, 2012 10:10:00 AM > >>>>>> Subject: Re: [Swift-user] Error message on Cray XE6 > >>>>>> I just tried both setting HOME=/lustre/beagle/wilde and setting > >>>>>> user.home to the same thing. Neither works. I think user.home is > >>>>>> coming from the Java property, and that doesnt seem to be > >>>>>> influenced > >>>>>> by the HOME env var. I was about to look if Java can be asked to > >>>>>> change home. Maybe by setting a command line arg to Java. > >>>>>> > >>>>>> - Mike > >>>>>> > >>>>>> ----- Original Message ----- > >>>>>>> From: "Jonathan Monette" > >>>>>>> To: "Michael Wilde" > >>>>>>> Cc: "Lorenzo Pesce" , > >>>>>>> swift-user at ci.uchicago.edu > >>>>>>> Sent: Saturday, April 14, 2012 10:02:14 AM > >>>>>>> Subject: Re: [Swift-user] Error message on Cray XE6 > >>>>>>> That is an easy fix I believe. I know where the code is so I > >>>>>>> will > >>>>>>> change and test. > >>>>>>> > >>>>>>> In the mean time could you try something? Try setting > >>>>>>> user.home= > >>>>>>> in your config file and try again. > >>>>>>> > >>>>>>> On Apr 14, 2012, at 9:58, Michael Wilde > >>>>>>> wrote: > >>>>>>> > >>>>>>>> /home is no longer mounted by the compute nodes, per the > >>>>>>>> post-maitenance summary: > >>>>>>>> > >>>>>>>> "External filesystem dependencies minimized: Compute nodes and > >>>>>>>> the > >>>>>>>> scheduler should now continue to process and complete jobs > >>>>>>>> without > >>>>>>>> the threat of interference of external filesystem outages. > >>>>>>>> /gpfs/pads is only available on login1 through login5; /home is > >>>>>>>> on > >>>>>>>> login and mom nodes only." > >>>>>>>> > >>>>>>>> So we need to (finally) remove Swift's dependence on > >>>>>>>> $HOME/.globus > >>>>>>>> and $HOME/.globus/scripts in particular. > >>>>>>>> > >>>>>>>> I suggest - since the swift command already needs to write to > >>>>>>>> "." > >>>>>>>> - > >>>>>>>> that we create a scripts/ directory in "." instead of > >>>>>>>> $HOME/.globus. > >>>>>>>> And this should be used by any provider that would have > >>>>>>>> previously > >>>>>>>> created files below .globus. > >>>>>>>> > >>>>>>>> I'll echo this to swift-devel and start a thread there to > >>>>>>>> discuss. > >>>>>>>> Its possible there's already a property to cause scripts/ to be > >>>>>>>> created elsewhere. If not, I think we should make one. I think > >>>>>>>> grouping the scripts created by a run into the current dir, > >>>>>>>> along > >>>>>>>> with the swift log, _concurrent, and (in the conventions I use > >>>>>>>> in > >>>>>>>> my > >>>>>>>> run scripts) swiftwork/. > >>>>>>>> > >>>>>>>> Lorenzo, hopefully we can at least get you a workaround for > >>>>>>>> this > >>>>>>>> soon. > >>>>>>>> > >>>>>>>> You *might* be able to trick swift into doing this by setting > >>>>>>>> HOME=/lustre/beagle/$USER. I already tried a symlink under > >>>>>>>> .globus > >>>>>>>> and that didnt work, as /home is not even readable by the > >>>>>>>> compute > >>>>>>>> nodes, which in this case need to run the coaster worker (.pl) > >>>>>>>> script. > >>>>>>>> > >>>>>>>> - Mike > >>>>>>>> > >>>>>>>> > >>>>>>>> ----- Original Message ----- > >>>>>>>>> From: "Lorenzo Pesce" > >>>>>>>>> To: "Jonathan Monette" > >>>>>>>>> Cc: swift-user at ci.uchicago.edu > >>>>>>>>> Sent: Saturday, April 14, 2012 8:15:39 AM > >>>>>>>>> Subject: Re: [Swift-user] Error message on Cray XE6 > >>>>>>>>> In principle the access to the /home filesystem should still > >>>>>>>>> be > >>>>>>>>> there. > >>>>>>>>> > >>>>>>>>> The only thing I did was to chance the cf file to remove some > >>>>>>>>> errors I > >>>>>>>>> had into it, so that might also be the source of the problem. > >>>>>>>>> This > >>>>>>>>> is > >>>>>>>>> what it looks like now: > >>>>>>>>> (BTW, the comments are not mine, I run swift only from lustre) > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> # Whether to transfer the wrappers from the compute nodes > >>>>>>>>> # I like to launch from my home dir, but keep everything on > >>>>>>>>> # lustre > >>>>>>>>> wrapperlog.always.transfer=false > >>>>>>>>> > >>>>>>>>> #Indicates whether the working directory on the remote site > >>>>>>>>> # should be left intact even when a run completes successfully > >>>>>>>>> sitedir.keep=true > >>>>>>>>> > >>>>>>>>> #try only once > >>>>>>>>> execution.retries=1 > >>>>>>>>> > >>>>>>>>> # Attempt to run as much as possible, i.g., ignore non-fatal > >>>>>>>>> errors > >>>>>>>>> lazy.errors=true > >>>>>>>>> > >>>>>>>>> # to reduce filesystem access > >>>>>>>>> status.mode=provider > >>>>>>>>> > >>>>>>>>> use.provider.staging=false > >>>>>>>>> > >>>>>>>>> provider.staging.pin.swiftfiles=false > >>>>>>>>> > >>>>>>>>> foreach.max.threads=100 > >>>>>>>>> > >>>>>>>>> provenance.log=false > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> On Apr 14, 2012, at 12:10 AM, Jonathan Monette wrote: > >>>>>>>>> > >>>>>>>>>> The perl script is the worker script that is submitted with > >>>>>>>>>> PBS. > >>>>>>>>>> I > >>>>>>>>>> have not tried to run on Beagle since the maintenance period > >>>>>>>>>> has > >>>>>>>>>> ended so I am not exactly sure why the error popped up. One > >>>>>>>>>> reason > >>>>>>>>>> could be that the home file system is no longer mounted on > >>>>>>>>>> the > >>>>>>>>>> compute nodes. I know they spoke about that being a > >>>>>>>>>> possibility > >>>>>>>>>> but > >>>>>>>>>> not sure they implemented that during the maintenance period. > >>>>>>>>>> Do > >>>>>>>>>> you > >>>>>>>>>> know if the home file system is still mounted on the compute > >>>>>>>>>> nodes? > >>>>>>>>>> > >>>>>>>>>> On Apr 13, 2012, at 17:18, Lorenzo Pesce > >>>>>>>>>> > >>>>>>>>>> wrote: > >>>>>>>>>> > >>>>>>>>>>> Hi -- > >>>>>>>>>>> I haven't seen this one before: > >>>>>>>>>>> > >>>>>>>>>>> Can't open perl script > >>>>>>>>>>> "/home/lpesce/.globus/coasters/cscript7176272791806289394.pl": > >>>>>>>>>>> No > >>>>>>>>>>> such file or directory > >>>>>>>>>>> > >>>>>>>>>>> The config of the cray has changed, might this have anything > >>>>>>>>>>> to > >>>>>>>>>>> do > >>>>>>>>>>> with it? > >>>>>>>>>>> I have no idea what perl script is it talking about and why > >>>>>>>>>>> it > >>>>>>>>>>> is > >>>>>>>>>>> looking to home. > >>>>>>>>>>> > >>>>>>>>>>> Thanks a lot, > >>>>>>>>>>> > >>>>>>>>>>> Lorenzo > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> _______________________________________________ > >>>>>>>>>>> Swift-user mailing list > >>>>>>>>>>> Swift-user at ci.uchicago.edu > >>>>>>>>>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > >>>>>>>>> > >>>>>>>>> _______________________________________________ > >>>>>>>>> Swift-user mailing list > >>>>>>>>> Swift-user at ci.uchicago.edu > >>>>>>>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > >>>>>>>> > >>>>>>>> -- > >>>>>>>> Michael Wilde > >>>>>>>> Computation Institute, University of Chicago > >>>>>>>> Mathematics and Computer Science Division > >>>>>>>> Argonne National Laboratory > >>>>>>>> > >>>>>> > >>>>>> -- > >>>>>> Michael Wilde > >>>>>> Computation Institute, University of Chicago > >>>>>> Mathematics and Computer Science Division > >>>>>> Argonne National Laboratory > >>>>> > >>>>> -- > >>>>> Michael Wilde > >>>>> Computation Institute, University of Chicago > >>>>> Mathematics and Computer Science Division > >>>>> Argonne National Laboratory > >>>>> > >>>>> _______________________________________________ > >>>>> Swift-user mailing list > >>>>> Swift-user at ci.uchicago.edu > >>>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > >>>> > >>>> -- > >>>> Michael Wilde > >>>> Computation Institute, University of Chicago > >>>> Mathematics and Computer Science Division > >>>> Argonne National Laboratory > >>>> > >>> > > > > -- > > Michael Wilde > > Computation Institute, University of Chicago > > Mathematics and Computer Science Division > > Argonne National Laboratory > > > > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user From hategan at mcs.anl.gov Tue Apr 17 19:28:18 2012 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 17 Apr 2012 17:28:18 -0700 Subject: [Swift-user] Question about job sizes In-Reply-To: <94ADC29B-6F18-41F5-A53D-68F601217014@uchicago.edu> References: <94ADC29B-6F18-41F5-A53D-68F601217014@uchicago.edu> Message-ID: <1334708898.19777.5.camel@blabla> On Tue, 2012-04-17 at 15:27 -0500, Lorenzo Pesce wrote: > Me again ;-) > > If the jobs occupy memory and CPU in a predictable way, say they have > a parameter i =1:20 where 20 takes 20 time units and 1 takes on. > Memory also grows, but not proportionally. > Is there a way to tell coasters how to pack jobs in order to not bust the memory? Partially. Maxwalltime should be the thing you use to specify the maximum cpu time. There is no equivalent for memory, though it seems like a useful feature to have. Mihael From jonmon at mcs.anl.gov Tue Apr 17 19:33:53 2012 From: jonmon at mcs.anl.gov (Jonathan Monette) Date: Tue, 17 Apr 2012 19:33:53 -0500 Subject: [Swift-user] Error message on Cray XE6 In-Reply-To: <1334708728.19777.2.camel@blabla> References: <80870977.145190.1334684302429.JavaMail.root@zimbra.anl.gov> <4D3C82A5-5FDC-4964-A124-F710D7F04644@mcs.anl.gov> <1334708728.19777.2.camel@blabla> Message-ID: <74E26274-9D98-4047-8664-D5B92846AD18@mcs.anl.gov> So that is not what I was witnessing. It seems the scheduler rejected the job(the PBS scheduler) because no jobs showed up under qstat but Swift still showed that jobs were submitted with no failures. If I checked the log I found a message from qsub saying could not submit job. I will reproduce the issue and post what I see. Perhaps this is happening though because the scheduler rejects the job but does not return an error code? On Apr 17, 2012, at 19:25, Mihael Hategan wrote: > Hmm, so if a block job fails, coasters will fail at least one swift job. > If this happens enough times, the failure should propagate through the > retries and to the user. It might take some time though. > > So maybe there's a distinction between "hangs" and "takes a lot of time > but eventually fails". > > Mihael > > On Tue, 2012-04-17 at 12:49 -0500, Jonathan Monette wrote: >> I do not think that is the case where PBS leaves the job queued, maybe >> on some machines but no on Beagle. When I had a job that did not fit >> in the scalability queue Swift hung but when checking the log I found >> a message from qsub saying the job was rejected. There is a bug >> ticket open for this issue. I will find the log that has the >> message(or just recreate it) and post the message to the ticket. >> Swift also hangs(with a qsub message in the log) if you try to submit >> a PBS job to machine where you no longer have an allocation. I >> received this message when trying to use Fusion after a long time. >> >> On Apr 17, 2012, at 12:38 PM, Michael Wilde wrote: >> >>> I think Swift hangs in such cases because PBS silently leaves the job queued even though no current queue can take the job. That seems to be a "feature" of the RM: being able to queue jobs that *might* be runnable in the future iff new queue settings are made. I dont know a good wat for Swift to detect this, but thats something to discuss (and create a ticket for). >>> >>> - Mike >>> >>> ----- Original Message ----- >>>> From: "Jonathan Monette" >>>> To: "Lorenzo Pesce" >>>> Cc: "Michael Wilde" , swift-user at ci.uchicago.edu >>>> Sent: Tuesday, April 17, 2012 12:08:49 PM >>>> Subject: Re: [Swift-user] Error message on Cray XE6 >>>> There is a site file entry for that. >>>> >>>> scalability >>>> >>>> You must make certain that the shape of your job fits in the queue you >>>> requested. If it does not fit, there is a silent failure and Swift >>>> hangs. >>>> >>>> On Apr 17, 2012, at 11:58, Lorenzo Pesce wrote: >>>> >>>>> Works great! >>>>> >>>>> Is there a way I can ask swift to put me in a specific queue, such >>>>> as scalability of some reservation? >>>>> >>>>> >>>>> >>>>> On Apr 14, 2012, at 10:51 AM, Michael Wilde wrote: >>>>> >>>>>> OK, here's a workaround for this problem: >>>>>> >>>>>> You need to add this line to the swift command bin/swift in your >>>>>> Swift release. >>>>>> >>>>>> After: >>>>>> >>>>>> updateOptions "$SWIFT_HOME" "swift.home" >>>>>> >>>>>> Add: >>>>>> >>>>>> updateOptions "$USER_HOME" "user.home" >>>>>> >>>>>> This is near line 92 in the version I tested, Swift trunk >>>>>> swift-r5739 cog-r3368. >>>>>> >>>>>> Then you can do: >>>>>> >>>>>> USER_HOME=/lustre/beagle/wilde swift -config cf -tc.file tc >>>>>> -sites.file pbs.xml catsn.swift -n=1 >>>>>> >>>>>> Lorenzo, if you are using "module load swift" we'll need to update >>>>>> that, or you can copy the swift release directory structure that >>>>>> module load points you to, then modify the swift command there, and >>>>>> put that modified release first in your PATH. >>>>>> >>>>>> We'll work out a way to get something like this into the production >>>>>> module and trunk. I dont know of other systems that are currently >>>>>> affected by this, but Im sure they will come up. >>>>>> >>>>>> - Mike >>>>>> >>>>>> >>>>>> ----- Original Message ----- >>>>>>> From: "Michael Wilde" >>>>>>> To: "Jonathan Monette" >>>>>>> Cc: swift-user at ci.uchicago.edu >>>>>>> Sent: Saturday, April 14, 2012 10:13:40 AM >>>>>>> Subject: Re: [Swift-user] Error message on Cray XE6 >>>>>>> stackoverflow says this should work: >>>>>>> >>>>>>> java -Duser.home= >>>>>>> >>>>>>> Need to get that in via the swift command. >>>>>>> >>>>>>> - Mike >>>>>>> >>>>>>> >>>>>>> ----- Original Message ----- >>>>>>>> From: "Michael Wilde" >>>>>>>> To: "Jonathan Monette" >>>>>>>> Cc: "Lorenzo Pesce" , >>>>>>>> swift-user at ci.uchicago.edu >>>>>>>> Sent: Saturday, April 14, 2012 10:10:00 AM >>>>>>>> Subject: Re: [Swift-user] Error message on Cray XE6 >>>>>>>> I just tried both setting HOME=/lustre/beagle/wilde and setting >>>>>>>> user.home to the same thing. Neither works. I think user.home is >>>>>>>> coming from the Java property, and that doesnt seem to be >>>>>>>> influenced >>>>>>>> by the HOME env var. I was about to look if Java can be asked to >>>>>>>> change home. Maybe by setting a command line arg to Java. >>>>>>>> >>>>>>>> - Mike >>>>>>>> >>>>>>>> ----- Original Message ----- >>>>>>>>> From: "Jonathan Monette" >>>>>>>>> To: "Michael Wilde" >>>>>>>>> Cc: "Lorenzo Pesce" , >>>>>>>>> swift-user at ci.uchicago.edu >>>>>>>>> Sent: Saturday, April 14, 2012 10:02:14 AM >>>>>>>>> Subject: Re: [Swift-user] Error message on Cray XE6 >>>>>>>>> That is an easy fix I believe. I know where the code is so I >>>>>>>>> will >>>>>>>>> change and test. >>>>>>>>> >>>>>>>>> In the mean time could you try something? Try setting >>>>>>>>> user.home= >>>>>>>>> in your config file and try again. >>>>>>>>> >>>>>>>>> On Apr 14, 2012, at 9:58, Michael Wilde >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> /home is no longer mounted by the compute nodes, per the >>>>>>>>>> post-maitenance summary: >>>>>>>>>> >>>>>>>>>> "External filesystem dependencies minimized: Compute nodes and >>>>>>>>>> the >>>>>>>>>> scheduler should now continue to process and complete jobs >>>>>>>>>> without >>>>>>>>>> the threat of interference of external filesystem outages. >>>>>>>>>> /gpfs/pads is only available on login1 through login5; /home is >>>>>>>>>> on >>>>>>>>>> login and mom nodes only." >>>>>>>>>> >>>>>>>>>> So we need to (finally) remove Swift's dependence on >>>>>>>>>> $HOME/.globus >>>>>>>>>> and $HOME/.globus/scripts in particular. >>>>>>>>>> >>>>>>>>>> I suggest - since the swift command already needs to write to >>>>>>>>>> "." >>>>>>>>>> - >>>>>>>>>> that we create a scripts/ directory in "." instead of >>>>>>>>>> $HOME/.globus. >>>>>>>>>> And this should be used by any provider that would have >>>>>>>>>> previously >>>>>>>>>> created files below .globus. >>>>>>>>>> >>>>>>>>>> I'll echo this to swift-devel and start a thread there to >>>>>>>>>> discuss. >>>>>>>>>> Its possible there's already a property to cause scripts/ to be >>>>>>>>>> created elsewhere. If not, I think we should make one. I think >>>>>>>>>> grouping the scripts created by a run into the current dir, >>>>>>>>>> along >>>>>>>>>> with the swift log, _concurrent, and (in the conventions I use >>>>>>>>>> in >>>>>>>>>> my >>>>>>>>>> run scripts) swiftwork/. >>>>>>>>>> >>>>>>>>>> Lorenzo, hopefully we can at least get you a workaround for >>>>>>>>>> this >>>>>>>>>> soon. >>>>>>>>>> >>>>>>>>>> You *might* be able to trick swift into doing this by setting >>>>>>>>>> HOME=/lustre/beagle/$USER. I already tried a symlink under >>>>>>>>>> .globus >>>>>>>>>> and that didnt work, as /home is not even readable by the >>>>>>>>>> compute >>>>>>>>>> nodes, which in this case need to run the coaster worker (.pl) >>>>>>>>>> script. >>>>>>>>>> >>>>>>>>>> - Mike >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> ----- Original Message ----- >>>>>>>>>>> From: "Lorenzo Pesce" >>>>>>>>>>> To: "Jonathan Monette" >>>>>>>>>>> Cc: swift-user at ci.uchicago.edu >>>>>>>>>>> Sent: Saturday, April 14, 2012 8:15:39 AM >>>>>>>>>>> Subject: Re: [Swift-user] Error message on Cray XE6 >>>>>>>>>>> In principle the access to the /home filesystem should still >>>>>>>>>>> be >>>>>>>>>>> there. >>>>>>>>>>> >>>>>>>>>>> The only thing I did was to chance the cf file to remove some >>>>>>>>>>> errors I >>>>>>>>>>> had into it, so that might also be the source of the problem. >>>>>>>>>>> This >>>>>>>>>>> is >>>>>>>>>>> what it looks like now: >>>>>>>>>>> (BTW, the comments are not mine, I run swift only from lustre) >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> # Whether to transfer the wrappers from the compute nodes >>>>>>>>>>> # I like to launch from my home dir, but keep everything on >>>>>>>>>>> # lustre >>>>>>>>>>> wrapperlog.always.transfer=false >>>>>>>>>>> >>>>>>>>>>> #Indicates whether the working directory on the remote site >>>>>>>>>>> # should be left intact even when a run completes successfully >>>>>>>>>>> sitedir.keep=true >>>>>>>>>>> >>>>>>>>>>> #try only once >>>>>>>>>>> execution.retries=1 >>>>>>>>>>> >>>>>>>>>>> # Attempt to run as much as possible, i.g., ignore non-fatal >>>>>>>>>>> errors >>>>>>>>>>> lazy.errors=true >>>>>>>>>>> >>>>>>>>>>> # to reduce filesystem access >>>>>>>>>>> status.mode=provider >>>>>>>>>>> >>>>>>>>>>> use.provider.staging=false >>>>>>>>>>> >>>>>>>>>>> provider.staging.pin.swiftfiles=false >>>>>>>>>>> >>>>>>>>>>> foreach.max.threads=100 >>>>>>>>>>> >>>>>>>>>>> provenance.log=false >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Apr 14, 2012, at 12:10 AM, Jonathan Monette wrote: >>>>>>>>>>> >>>>>>>>>>>> The perl script is the worker script that is submitted with >>>>>>>>>>>> PBS. >>>>>>>>>>>> I >>>>>>>>>>>> have not tried to run on Beagle since the maintenance period >>>>>>>>>>>> has >>>>>>>>>>>> ended so I am not exactly sure why the error popped up. One >>>>>>>>>>>> reason >>>>>>>>>>>> could be that the home file system is no longer mounted on >>>>>>>>>>>> the >>>>>>>>>>>> compute nodes. I know they spoke about that being a >>>>>>>>>>>> possibility >>>>>>>>>>>> but >>>>>>>>>>>> not sure they implemented that during the maintenance period. >>>>>>>>>>>> Do >>>>>>>>>>>> you >>>>>>>>>>>> know if the home file system is still mounted on the compute >>>>>>>>>>>> nodes? >>>>>>>>>>>> >>>>>>>>>>>> On Apr 13, 2012, at 17:18, Lorenzo Pesce >>>>>>>>>>>> >>>>>>>>>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Hi -- >>>>>>>>>>>>> I haven't seen this one before: >>>>>>>>>>>>> >>>>>>>>>>>>> Can't open perl script >>>>>>>>>>>>> "/home/lpesce/.globus/coasters/cscript7176272791806289394.pl": >>>>>>>>>>>>> No >>>>>>>>>>>>> such file or directory >>>>>>>>>>>>> >>>>>>>>>>>>> The config of the cray has changed, might this have anything >>>>>>>>>>>>> to >>>>>>>>>>>>> do >>>>>>>>>>>>> with it? >>>>>>>>>>>>> I have no idea what perl script is it talking about and why >>>>>>>>>>>>> it >>>>>>>>>>>>> is >>>>>>>>>>>>> looking to home. >>>>>>>>>>>>> >>>>>>>>>>>>> Thanks a lot, >>>>>>>>>>>>> >>>>>>>>>>>>> Lorenzo >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>> Swift-user mailing list >>>>>>>>>>>>> Swift-user at ci.uchicago.edu >>>>>>>>>>>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user >>>>>>>>>>> >>>>>>>>>>> _______________________________________________ >>>>>>>>>>> Swift-user mailing list >>>>>>>>>>> Swift-user at ci.uchicago.edu >>>>>>>>>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> Michael Wilde >>>>>>>>>> Computation Institute, University of Chicago >>>>>>>>>> Mathematics and Computer Science Division >>>>>>>>>> Argonne National Laboratory >>>>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> Michael Wilde >>>>>>>> Computation Institute, University of Chicago >>>>>>>> Mathematics and Computer Science Division >>>>>>>> Argonne National Laboratory >>>>>>> >>>>>>> -- >>>>>>> Michael Wilde >>>>>>> Computation Institute, University of Chicago >>>>>>> Mathematics and Computer Science Division >>>>>>> Argonne National Laboratory >>>>>>> >>>>>>> _______________________________________________ >>>>>>> Swift-user mailing list >>>>>>> Swift-user at ci.uchicago.edu >>>>>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user >>>>>> >>>>>> -- >>>>>> Michael Wilde >>>>>> Computation Institute, University of Chicago >>>>>> Mathematics and Computer Science Division >>>>>> Argonne National Laboratory >>>>>> >>>>> >>> >>> -- >>> Michael Wilde >>> Computation Institute, University of Chicago >>> Mathematics and Computer Science Division >>> Argonne National Laboratory >>> >> >> _______________________________________________ >> Swift-user mailing list >> Swift-user at ci.uchicago.edu >> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > > From jonmon at mcs.anl.gov Tue Apr 17 20:20:03 2012 From: jonmon at mcs.anl.gov (Jonathan Monette) Date: Tue, 17 Apr 2012 20:20:03 -0500 Subject: [Swift-user] Error message on Cray XE6 In-Reply-To: <74E26274-9D98-4047-8664-D5B92846AD18@mcs.anl.gov> References: <80870977.145190.1334684302429.JavaMail.root@zimbra.anl.gov> <4D3C82A5-5FDC-4964-A124-F710D7F04644@mcs.anl.gov> <1334708728.19777.2.camel@blabla> <74E26274-9D98-4047-8664-D5B92846AD18@mcs.anl.gov> Message-ID: So here is the case where there is no allocation available for a user. I am running on Fusion where my allocation has expired. Here is what Swift is showing: Swift 0.93 swift-r5483 cog-r3339 RunID: 20120417-2011-kky5yb46 Progress: time: Tue, 17 Apr 2012 20:11:36 -0500 Progress: time: Tue, 17 Apr 2012 20:12:06 -0500 Submitted:2 Progress: time: Tue, 17 Apr 2012 20:12:36 -0500 Submitted:2 Progress: time: Tue, 17 Apr 2012 20:13:06 -0500 Submitted:2 Progress: time: Tue, 17 Apr 2012 20:13:36 -0500 Submitted:2 Progress: time: Tue, 17 Apr 2012 20:14:06 -0500 Submitted:2 Progress: time: Tue, 17 Apr 2012 20:14:36 -0500 Submitted:2 Progress: time: Tue, 17 Apr 2012 20:15:06 -0500 Submitted:2 Progress: time: Tue, 17 Apr 2012 20:15:36 -0500 Submitted:2 Progress: time: Tue, 17 Apr 2012 20:16:06 -0500 Submitted:2 Progress: time: Tue, 17 Apr 2012 20:16:36 -0500 Submitted:2 Progress: time: Tue, 17 Apr 2012 20:17:06 -0500 Submitted:2 Progress: time: Tue, 17 Apr 2012 20:17:36 -0500 Submitted:2 And here is the message in the log: 012-04-17 20:11:37,306-0500 DEBUG AbstractExecutor Wrote PBS script to /homes/jonmon/.globus/scripts/PBS5153949301806424155.submit 2012-04-17 20:11:37,306-0500 DEBUG AbstractExecutor Command line: qsub /homes/jonmon/.globus/scripts/PBS5153949301806424155.submit 2012-04-17 20:11:37,433-0500 DEBUG AbstractExecutor Waiting for output from qsub 2012-04-17 20:11:37,434-0500 DEBUG AbstractExecutor Output from qsub is: "" 2012-04-17 20:11:37,434-0500 DEBUG AbstractExecutor Waiting for output from qsub 2012-04-17 20:11:37,434-0500 DEBUG AbstractExecutor Output from qsub is: "ERROR: Project "startup-jonmon" has no allocation; can't run job." 2012-04-17 20:11:37,434-0500 INFO BlockTaskSubmitter Error submitting block task: Cannot submit job: Could not submit job (qsub reported an exit code of 1). ERROR: Project "startup-jonmon" has no allocation; can't run job. So it shows that qsub failed(with error code 1) but Swift keeps going showing a submitted count of 2 when there is no jobs under qstat -u jonmon I will try and get the case for when no job fits in the the specified queue. I do not think this is high priority but this is definitely something that users should be aware of. On Apr 17, 2012, at 7:33 PM, Jonathan Monette wrote: > So that is not what I was witnessing. It seems the scheduler rejected the job(the PBS scheduler) because no jobs showed up under qstat but Swift still showed that jobs were submitted with no failures. If I checked the log I found a message from qsub saying could not submit job. I will reproduce the issue and post what I see. Perhaps this is happening though because the scheduler rejects the job but does not return an error code? > > On Apr 17, 2012, at 19:25, Mihael Hategan wrote: > >> Hmm, so if a block job fails, coasters will fail at least one swift job. >> If this happens enough times, the failure should propagate through the >> retries and to the user. It might take some time though. >> >> So maybe there's a distinction between "hangs" and "takes a lot of time >> but eventually fails". >> >> Mihael >> >> On Tue, 2012-04-17 at 12:49 -0500, Jonathan Monette wrote: >>> I do not think that is the case where PBS leaves the job queued, maybe >>> on some machines but no on Beagle. When I had a job that did not fit >>> in the scalability queue Swift hung but when checking the log I found >>> a message from qsub saying the job was rejected. There is a bug >>> ticket open for this issue. I will find the log that has the >>> message(or just recreate it) and post the message to the ticket. >>> Swift also hangs(with a qsub message in the log) if you try to submit >>> a PBS job to machine where you no longer have an allocation. I >>> received this message when trying to use Fusion after a long time. >>> >>> On Apr 17, 2012, at 12:38 PM, Michael Wilde wrote: >>> >>>> I think Swift hangs in such cases because PBS silently leaves the job queued even though no current queue can take the job. That seems to be a "feature" of the RM: being able to queue jobs that *might* be runnable in the future iff new queue settings are made. I dont know a good wat for Swift to detect this, but thats something to discuss (and create a ticket for). >>>> >>>> - Mike >>>> >>>> ----- Original Message ----- >>>>> From: "Jonathan Monette" >>>>> To: "Lorenzo Pesce" >>>>> Cc: "Michael Wilde" , swift-user at ci.uchicago.edu >>>>> Sent: Tuesday, April 17, 2012 12:08:49 PM >>>>> Subject: Re: [Swift-user] Error message on Cray XE6 >>>>> There is a site file entry for that. >>>>> >>>>> scalability >>>>> >>>>> You must make certain that the shape of your job fits in the queue you >>>>> requested. If it does not fit, there is a silent failure and Swift >>>>> hangs. >>>>> >>>>> On Apr 17, 2012, at 11:58, Lorenzo Pesce wrote: >>>>> >>>>>> Works great! >>>>>> >>>>>> Is there a way I can ask swift to put me in a specific queue, such >>>>>> as scalability of some reservation? >>>>>> >>>>>> >>>>>> >>>>>> On Apr 14, 2012, at 10:51 AM, Michael Wilde wrote: >>>>>> >>>>>>> OK, here's a workaround for this problem: >>>>>>> >>>>>>> You need to add this line to the swift command bin/swift in your >>>>>>> Swift release. >>>>>>> >>>>>>> After: >>>>>>> >>>>>>> updateOptions "$SWIFT_HOME" "swift.home" >>>>>>> >>>>>>> Add: >>>>>>> >>>>>>> updateOptions "$USER_HOME" "user.home" >>>>>>> >>>>>>> This is near line 92 in the version I tested, Swift trunk >>>>>>> swift-r5739 cog-r3368. >>>>>>> >>>>>>> Then you can do: >>>>>>> >>>>>>> USER_HOME=/lustre/beagle/wilde swift -config cf -tc.file tc >>>>>>> -sites.file pbs.xml catsn.swift -n=1 >>>>>>> >>>>>>> Lorenzo, if you are using "module load swift" we'll need to update >>>>>>> that, or you can copy the swift release directory structure that >>>>>>> module load points you to, then modify the swift command there, and >>>>>>> put that modified release first in your PATH. >>>>>>> >>>>>>> We'll work out a way to get something like this into the production >>>>>>> module and trunk. I dont know of other systems that are currently >>>>>>> affected by this, but Im sure they will come up. >>>>>>> >>>>>>> - Mike >>>>>>> >>>>>>> >>>>>>> ----- Original Message ----- >>>>>>>> From: "Michael Wilde" >>>>>>>> To: "Jonathan Monette" >>>>>>>> Cc: swift-user at ci.uchicago.edu >>>>>>>> Sent: Saturday, April 14, 2012 10:13:40 AM >>>>>>>> Subject: Re: [Swift-user] Error message on Cray XE6 >>>>>>>> stackoverflow says this should work: >>>>>>>> >>>>>>>> java -Duser.home= >>>>>>>> >>>>>>>> Need to get that in via the swift command. >>>>>>>> >>>>>>>> - Mike >>>>>>>> >>>>>>>> >>>>>>>> ----- Original Message ----- >>>>>>>>> From: "Michael Wilde" >>>>>>>>> To: "Jonathan Monette" >>>>>>>>> Cc: "Lorenzo Pesce" , >>>>>>>>> swift-user at ci.uchicago.edu >>>>>>>>> Sent: Saturday, April 14, 2012 10:10:00 AM >>>>>>>>> Subject: Re: [Swift-user] Error message on Cray XE6 >>>>>>>>> I just tried both setting HOME=/lustre/beagle/wilde and setting >>>>>>>>> user.home to the same thing. Neither works. I think user.home is >>>>>>>>> coming from the Java property, and that doesnt seem to be >>>>>>>>> influenced >>>>>>>>> by the HOME env var. I was about to look if Java can be asked to >>>>>>>>> change home. Maybe by setting a command line arg to Java. >>>>>>>>> >>>>>>>>> - Mike >>>>>>>>> >>>>>>>>> ----- Original Message ----- >>>>>>>>>> From: "Jonathan Monette" >>>>>>>>>> To: "Michael Wilde" >>>>>>>>>> Cc: "Lorenzo Pesce" , >>>>>>>>>> swift-user at ci.uchicago.edu >>>>>>>>>> Sent: Saturday, April 14, 2012 10:02:14 AM >>>>>>>>>> Subject: Re: [Swift-user] Error message on Cray XE6 >>>>>>>>>> That is an easy fix I believe. I know where the code is so I >>>>>>>>>> will >>>>>>>>>> change and test. >>>>>>>>>> >>>>>>>>>> In the mean time could you try something? Try setting >>>>>>>>>> user.home= >>>>>>>>>> in your config file and try again. >>>>>>>>>> >>>>>>>>>> On Apr 14, 2012, at 9:58, Michael Wilde >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> /home is no longer mounted by the compute nodes, per the >>>>>>>>>>> post-maitenance summary: >>>>>>>>>>> >>>>>>>>>>> "External filesystem dependencies minimized: Compute nodes and >>>>>>>>>>> the >>>>>>>>>>> scheduler should now continue to process and complete jobs >>>>>>>>>>> without >>>>>>>>>>> the threat of interference of external filesystem outages. >>>>>>>>>>> /gpfs/pads is only available on login1 through login5; /home is >>>>>>>>>>> on >>>>>>>>>>> login and mom nodes only." >>>>>>>>>>> >>>>>>>>>>> So we need to (finally) remove Swift's dependence on >>>>>>>>>>> $HOME/.globus >>>>>>>>>>> and $HOME/.globus/scripts in particular. >>>>>>>>>>> >>>>>>>>>>> I suggest - since the swift command already needs to write to >>>>>>>>>>> "." >>>>>>>>>>> - >>>>>>>>>>> that we create a scripts/ directory in "." instead of >>>>>>>>>>> $HOME/.globus. >>>>>>>>>>> And this should be used by any provider that would have >>>>>>>>>>> previously >>>>>>>>>>> created files below .globus. >>>>>>>>>>> >>>>>>>>>>> I'll echo this to swift-devel and start a thread there to >>>>>>>>>>> discuss. >>>>>>>>>>> Its possible there's already a property to cause scripts/ to be >>>>>>>>>>> created elsewhere. If not, I think we should make one. I think >>>>>>>>>>> grouping the scripts created by a run into the current dir, >>>>>>>>>>> along >>>>>>>>>>> with the swift log, _concurrent, and (in the conventions I use >>>>>>>>>>> in >>>>>>>>>>> my >>>>>>>>>>> run scripts) swiftwork/. >>>>>>>>>>> >>>>>>>>>>> Lorenzo, hopefully we can at least get you a workaround for >>>>>>>>>>> this >>>>>>>>>>> soon. >>>>>>>>>>> >>>>>>>>>>> You *might* be able to trick swift into doing this by setting >>>>>>>>>>> HOME=/lustre/beagle/$USER. I already tried a symlink under >>>>>>>>>>> .globus >>>>>>>>>>> and that didnt work, as /home is not even readable by the >>>>>>>>>>> compute >>>>>>>>>>> nodes, which in this case need to run the coaster worker (.pl) >>>>>>>>>>> script. >>>>>>>>>>> >>>>>>>>>>> - Mike >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> ----- Original Message ----- >>>>>>>>>>>> From: "Lorenzo Pesce" >>>>>>>>>>>> To: "Jonathan Monette" >>>>>>>>>>>> Cc: swift-user at ci.uchicago.edu >>>>>>>>>>>> Sent: Saturday, April 14, 2012 8:15:39 AM >>>>>>>>>>>> Subject: Re: [Swift-user] Error message on Cray XE6 >>>>>>>>>>>> In principle the access to the /home filesystem should still >>>>>>>>>>>> be >>>>>>>>>>>> there. >>>>>>>>>>>> >>>>>>>>>>>> The only thing I did was to chance the cf file to remove some >>>>>>>>>>>> errors I >>>>>>>>>>>> had into it, so that might also be the source of the problem. >>>>>>>>>>>> This >>>>>>>>>>>> is >>>>>>>>>>>> what it looks like now: >>>>>>>>>>>> (BTW, the comments are not mine, I run swift only from lustre) >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> # Whether to transfer the wrappers from the compute nodes >>>>>>>>>>>> # I like to launch from my home dir, but keep everything on >>>>>>>>>>>> # lustre >>>>>>>>>>>> wrapperlog.always.transfer=false >>>>>>>>>>>> >>>>>>>>>>>> #Indicates whether the working directory on the remote site >>>>>>>>>>>> # should be left intact even when a run completes successfully >>>>>>>>>>>> sitedir.keep=true >>>>>>>>>>>> >>>>>>>>>>>> #try only once >>>>>>>>>>>> execution.retries=1 >>>>>>>>>>>> >>>>>>>>>>>> # Attempt to run as much as possible, i.g., ignore non-fatal >>>>>>>>>>>> errors >>>>>>>>>>>> lazy.errors=true >>>>>>>>>>>> >>>>>>>>>>>> # to reduce filesystem access >>>>>>>>>>>> status.mode=provider >>>>>>>>>>>> >>>>>>>>>>>> use.provider.staging=false >>>>>>>>>>>> >>>>>>>>>>>> provider.staging.pin.swiftfiles=false >>>>>>>>>>>> >>>>>>>>>>>> foreach.max.threads=100 >>>>>>>>>>>> >>>>>>>>>>>> provenance.log=false >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Apr 14, 2012, at 12:10 AM, Jonathan Monette wrote: >>>>>>>>>>>> >>>>>>>>>>>>> The perl script is the worker script that is submitted with >>>>>>>>>>>>> PBS. >>>>>>>>>>>>> I >>>>>>>>>>>>> have not tried to run on Beagle since the maintenance period >>>>>>>>>>>>> has >>>>>>>>>>>>> ended so I am not exactly sure why the error popped up. One >>>>>>>>>>>>> reason >>>>>>>>>>>>> could be that the home file system is no longer mounted on >>>>>>>>>>>>> the >>>>>>>>>>>>> compute nodes. I know they spoke about that being a >>>>>>>>>>>>> possibility >>>>>>>>>>>>> but >>>>>>>>>>>>> not sure they implemented that during the maintenance period. >>>>>>>>>>>>> Do >>>>>>>>>>>>> you >>>>>>>>>>>>> know if the home file system is still mounted on the compute >>>>>>>>>>>>> nodes? >>>>>>>>>>>>> >>>>>>>>>>>>> On Apr 13, 2012, at 17:18, Lorenzo Pesce >>>>>>>>>>>>> >>>>>>>>>>>>> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> Hi -- >>>>>>>>>>>>>> I haven't seen this one before: >>>>>>>>>>>>>> >>>>>>>>>>>>>> Can't open perl script >>>>>>>>>>>>>> "/home/lpesce/.globus/coasters/cscript7176272791806289394.pl": >>>>>>>>>>>>>> No >>>>>>>>>>>>>> such file or directory >>>>>>>>>>>>>> >>>>>>>>>>>>>> The config of the cray has changed, might this have anything >>>>>>>>>>>>>> to >>>>>>>>>>>>>> do >>>>>>>>>>>>>> with it? >>>>>>>>>>>>>> I have no idea what perl script is it talking about and why >>>>>>>>>>>>>> it >>>>>>>>>>>>>> is >>>>>>>>>>>>>> looking to home. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Thanks a lot, >>>>>>>>>>>>>> >>>>>>>>>>>>>> Lorenzo >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>> Swift-user mailing list >>>>>>>>>>>>>> Swift-user at ci.uchicago.edu >>>>>>>>>>>>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user >>>>>>>>>>>> >>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>> Swift-user mailing list >>>>>>>>>>>> Swift-user at ci.uchicago.edu >>>>>>>>>>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user >>>>>>>>>>> >>>>>>>>>>> -- >>>>>>>>>>> Michael Wilde >>>>>>>>>>> Computation Institute, University of Chicago >>>>>>>>>>> Mathematics and Computer Science Division >>>>>>>>>>> Argonne National Laboratory >>>>>>>>>>> >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Michael Wilde >>>>>>>>> Computation Institute, University of Chicago >>>>>>>>> Mathematics and Computer Science Division >>>>>>>>> Argonne National Laboratory >>>>>>>> >>>>>>>> -- >>>>>>>> Michael Wilde >>>>>>>> Computation Institute, University of Chicago >>>>>>>> Mathematics and Computer Science Division >>>>>>>> Argonne National Laboratory >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> Swift-user mailing list >>>>>>>> Swift-user at ci.uchicago.edu >>>>>>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user >>>>>>> >>>>>>> -- >>>>>>> Michael Wilde >>>>>>> Computation Institute, University of Chicago >>>>>>> Mathematics and Computer Science Division >>>>>>> Argonne National Laboratory >>>>>>> >>>>>> >>>> >>>> -- >>>> Michael Wilde >>>> Computation Institute, University of Chicago >>>> Mathematics and Computer Science Division >>>> Argonne National Laboratory >>>> >>> >>> _______________________________________________ >>> Swift-user mailing list >>> Swift-user at ci.uchicago.edu >>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user >> >> > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user From lpesce at uchicago.edu Wed Apr 18 20:14:01 2012 From: lpesce at uchicago.edu (Lorenzo Pesce) Date: Wed, 18 Apr 2012 20:14:01 -0500 Subject: [Swift-user] Question about nr of nodes In-Reply-To: <05B840C4-ED15-478B-8F12-E09C3DD612A6@mcs.anl.gov> References: <62F6E6C9-6575-4B3D-AE29-B7D72D14FC0D@uchicago.edu> <05B840C4-ED15-478B-8F12-E09C3DD612A6@mcs.anl.gov> Message-ID: <8B8909DD-6E04-4F94-AD9D-0EBF982F48C1@uchicago.edu> Going over the emails produced during the famous Beagle crash =) (While watching "Word World" with my son, so don't worry if it doesn't make too much sense. By now we got to "Puss in Boots".) > It all depends on ho you want to shape your jobs. > slots: The maximum number of coaster blocks to submit(pbs jobs). Would there be a reason to keep the number of coasters to 1? I would be tempted to put it equal to maxnodes. The issue would then be how to keep the overall number of nodes used to, say, numnodes? Of course I could set a combination of the two, like $nnodes 1 1 Like Glen did. > nodeGranularity: How much to increment node count by(nodeGranularity <= maxNodes) Within a coaster? or either by adding a coaster block or by adding a node to a coaster block? > Adjusting slots will provide more active coaster blocks. > > In Glen's example he will requests "$nodes" number of single node jobs. He could have said the same thing by setting: > 1 > $nodes > $nodes > > In his example, several single node coaster blocks would be submitted for execution. With the above settings, a single multi-node coaster block would be submitted. If the machine is overloaded and there is slow response time, then Glen's approach would probably be better as the scheduler may bias some single node jobs to run over multi-node jobs to keep the entire machine busy. This way progress will be made(even if it is slow progress). HIs would be a good approach to make good use of the backfill, wouldn't it? the jobs can use any time and any number of coasters, so they can fit in the holes. Or am I wrong here? > Another setting that should be set is: > 100 > 100 > > This will force the coaster blocks to be exactly the maxTime you asked for. If those are not set coasters dynamically chooses a wall time which is often lower than the time you specified. But would this limit its ability to fit into the backfill? For this job, I can afford having many relatively short coasters (say longer than 20 minutes), but I also want to make sure that I am getting all the "total time I am asking for", but this I mean: totnrnodes*walltime. The idea being that to run all my jobs I need about 2400 core hours, which requires 100 node hours. I don't care too much how it is sliced in this problem. Does it make sense? the more I can fit into the backfill holes, the better. >> $PPN Added to the presentation. >> $TIME >> $MAXTIME I am lost here. Difference? Maxwalltime is what I was talking about like 100 node hours? What I would like to be able to set is a minimum time/size (say, no coaster block shorter than 1/2 hour and smaller than 3 nodes), a increment (I understand that to be granularity, so if blocks need to be at least of 3 nodes for an MPI job, this is what I would set it to, presumably the minimum time would propagate here) and a grand total of CPU time to run all of them >> 200.00 >> 10000 I need to read more about these two. I will look for the documentation and send you feedback about it. I need also to read a lot more about mappers and all the rest... From jonmon at mcs.anl.gov Thu Apr 19 11:04:20 2012 From: jonmon at mcs.anl.gov (Jonathan Monette) Date: Thu, 19 Apr 2012 11:04:20 -0500 Subject: [Swift-user] Question about nr of nodes In-Reply-To: <8B8909DD-6E04-4F94-AD9D-0EBF982F48C1@uchicago.edu> References: <62F6E6C9-6575-4B3D-AE29-B7D72D14FC0D@uchicago.edu> <05B840C4-ED15-478B-8F12-E09C3DD612A6@mcs.anl.gov> <8B8909DD-6E04-4F94-AD9D-0EBF982F48C1@uchicago.edu> Message-ID: On Apr 18, 2012, at 8:14 PM, Lorenzo Pesce wrote: > Going over the emails produced during the famous Beagle crash =) > > (While watching "Word World" with my son, so don't worry if it doesn't make too much sense. By now we got to "Puss in Boots".) > >> It all depends on ho you want to shape your jobs. >> slots: The maximum number of coaster blocks to submit(pbs jobs). > > Would there be a reason to keep the number of coasters to 1? I would be tempted to put it equal to maxnodes. > The issue would then be how to keep the overall number of nodes used to, say, numnodes? > Of course I could set a combination of the two, like > > $nnodes > 1 > 1 > > Like Glen did. > So the above approach will submit $nnodes workers all with 1 node to work with. You could say could say the same thing with: 1 $nnodes $nnodes Swift allows several ways to say the same thing when shaping jobs. However, let's say you need $nnodes for 20 hours. The way Glen did there would be $nnodes jobs submitted all waiting to run for 20 hours. The scheduler may not give you all $nnodes nodes at the same time for 20 hours, you are more than likely to get maybe 1-2 nodes at a time and the other wait for a long time. While the other way to keep slots at 1 and adjust the other parameters will have 1 $nnodes job submitted. When that job is chosen to run you will have the full $nnodes at your disposal. How to shape your jobs for Swift takes a bit of tuning and depends on the application and the system itself. > >> nodeGranularity: How much to increment node count by(nodeGranularity <= maxNodes) > > Within a coaster? or either by adding a coaster block or by adding a node to a coaster block? So if maxNodes is 6 and nodeGranularity is 2: one job could have 2 nodes wide, the next job could have 4 nodes wide, the next job could be 6 nodes wide, the next one could be 2 nodes wide,.... Coasters will submit jobs with node counts in multiples of nodeGranularity all the way up to maxNodes. > >> Adjusting slots will provide more active coaster blocks. >> >> In Glen's example he will requests "$nodes" number of single node jobs. He could have said the same thing by setting: >> 1 >> $nodes >> $nodes >> >> In his example, several single node coaster blocks would be submitted for execution. With the above settings, a single multi-node coaster block would be submitted. If the machine is overloaded and there is slow response time, then Glen's approach would probably be better as the scheduler may bias some single node jobs to run over multi-node jobs to keep the entire machine busy. This way progress will be made(even if it is slow progress). > > HIs would be a good approach to make good use of the backfill, wouldn't it? the jobs can use any time and any number of coasters, so they can fit in the holes. Or am I wrong here? Yes, you would want to keep jobs small when using the backfill so they can run quickly. So several small jobs should work nicely using backfill. > > >> Another setting that should be set is: >> 100 >> 100 >> >> This will force the coaster blocks to be exactly the maxTime you asked for. If those are not set coasters dynamically chooses a wall time which is often lower than the time you specified. > > But would this limit its ability to fit into the backfill? For this job, I can afford having many relatively short coasters (say longer than 20 minutes), but I also want to make sure that I am getting all the "total time I am asking for", but this I mean: > totnrnodes*walltime. The idea being that to run all my jobs I need about 2400 core hours, which requires 100 node hours. I don't care too much how it is sliced in this problem. Does it make sense? the more I can fit into the backfill holes, the better. This should not limit backfill as long as the job times are short. Let's say you request maxTime to be 1 minute. Without the overallocation settings coaster will often submit jobs that are maybe 30 seconds long. With the overallocation settings coasters will always submit jobs that are 1 minute long. It is just a way to force coasters to use the time you provided and not take that time as a suggestion. > > >>> $PPN > > Added to the presentation. > >>> $TIME >>> $MAXTIME > > I am lost here. Difference? Maxwalltime is what I was talking about like 100 node hours? > What I would like to be able to set is a minimum time/size (say, no coaster block shorter than 1/2 hour and smaller than 3 nodes), a increment (I understand that to be granularity, so if blocks need to be at least of 3 nodes for an MPI job, this is what I would set it to, presumably the minimum time would propagate here) and a grand total of CPU time to run all of them maxTime is how long the coaster job(the wall time specified to PBS) lasts. maxwalltime is used by coasters to determine if an app can fit into a coaster block. If you requested that a coaster block has a maxTime of 5 minutes and the maxwalltimes for the apps are 1.5 minutes, the coasters can only fit 3 apps calls in 1 coaster block. After 3 app calls the coaster block would be shutdown and another job will be submitted to PBS. I do not think there is a feature to set a "minimum" in coasters. > >>> 200.00 >>> 10000 > > I need to read more about these two. I will look for the documentation and send you feedback about it. > I need also to read a lot more about mappers and all the rest... > > > From lpesce at uchicago.edu Fri Apr 20 11:43:37 2012 From: lpesce at uchicago.edu (Lorenzo Pesce) Date: Fri, 20 Apr 2012 11:43:37 -0500 Subject: [Swift-user] Help with resuming swift job References: Message-ID: <616843DA-762B-4F0C-8241-CDC4F9B396CB@uchicago.edu> It is me again ;-) I seem to have let too many beagle users run Swift.... ;-) Here we are having a few difference people running from the same filesystem (the project file system, in this case the person who sent the first batch is on a plane right now) because of some quirks in Beagle's group permissions (namely they don't work right as far as I can tell), sometimes we need to change file permissions: I changed the file permission, Swift 0.93 swift-r5483 cog-r3339 and sent swift out with -resume RunID: 20120420-1622-6siopoz8 Execution failed: Could not aquire exclusive lock on log file: /lustre/beagle/GCNet/grasping/44neurons_SR0_200ms_5Windows_v1/causal_test-20120420-1433-k1eh3b2a.0.rlog Is there a lock file that needs to be changed? The motivation for this is that some of the files failed for two predictable reasons: they took too long and/or they blew the memory. There are also other issues, but they are not relevant at this point because they aren't solved (need of a new optimization, which we did not have time to implement). Usually we send the first batch, with short times and fewer nodes that does 98% of the work, and resume the remaining hacking the sites.xml file (we welcome better strategies which most of you have hinted to in the past). This was an attempt to rerun after a crash. BTW, my post mortem investigation seems to suggest that one of the users actually killed the swift script by mistake, wrongly changed a privilege in flight or something like that as opposed to the script failing or running out of time. Short of torture it does not seem he will confess more than this. Thanks a million as usual. Lorenzo From iraicu at cs.iit.edu Fri Apr 20 13:33:47 2012 From: iraicu at cs.iit.edu (Ioan Raicu) Date: Fri, 20 Apr 2012 13:33:47 -0500 Subject: [Swift-user] CFP: IEEE/ACM GRID 2012 -- Deadline extension to 04-25-12 Message-ID: <4F91AC0B.9090806@cs.iit.edu> Call for papers *Grid 2012: 13th IEEE/ACM International Conference on Grid Computing* Beijing, China September 20-23, 2012 http://grid2012.meepo.org Co-located with ChinaGrid'12 Grid computing enables the sharing of distributed computing and data resources such as processing, network bandwidth and storage capacity to create a cohesive resource environment for executing distributed applications. The Grid conference series is an annual international meeting that brings together a community of researchers, developers, practitioners, and users involved with Grid technology. The objective of the meeting is to serve as both the premier venue for presenting foremost research results in the area and as a forum for introducing and exploring new concepts. In 2012, the Grid conference will come to China for the first time and will be held in Beijing, co-located with ChinaGrid'12. Grid 2012 will have a focus on important and immediate issues that are significantly influencing grid computing. Scope Grid 2012 topics of interest include, but are not limited to: * Architecture * Middleware and toolkits * Resource management, scheduling, and runtime environments * Performance modeling and evaluation * Programming models, tools and environments * Metadata, ontologies, and provenance * Cloud computing * Virtualization and grid computing * Scientific workflow * Storage systems and data management * Data-intensive computing and processing * QoS and SLA Negotiation * Applications and experiences in science, engineering, business and society Paper Submission Authors are invited to submit original papers (not published or currently under review for any other conference or journal). Submitted manuscripts should not exceed 8 letter size (8.5 x 11) pages including figures, tables and references using the IEEE format for conference proceedings. Authors should submit the manuscript in PDF format via https://www.easychair.org/conferences/?conf=grid12 All submitted papers will be reviewed by program committee members and selected based on their originality, correctness, technical strength, significance, quality of presentation, and interest and relevance to the conference attendees. Accepted papers will be published in the IEEE categorized conference proceedings and will be made available online through the IEEE Xplore and the CS Digital Library. Go to paper submission page... Important Dates Papers Submission Due: 15 April 2012 Extended to 25 April 2012. Notification of Acceptance: 15 May 2012 Camera Ready Papers Due: 15 June 2012 -- ================================================================= Ioan Raicu, Ph.D. Assistant Professor, Illinois Institute of Technology (IIT) Guest Research Faculty, Argonne National Laboratory (ANL) ================================================================= Data-Intensive Distributed Systems Laboratory, CS/IIT Distributed Systems Laboratory, MCS/ANL ================================================================= Cel: 1-847-722-0876 Office: 1-312-567-5704 Email: iraicu at cs.iit.edu Web: http://www.cs.iit.edu/~iraicu/ Web: http://datasys.cs.iit.edu/ ================================================================= ================================================================= -------------- next part -------------- An HTML attachment was scrubbed... URL: From wilde at mcs.anl.gov Sat Apr 21 17:32:37 2012 From: wilde at mcs.anl.gov (Michael Wilde) Date: Sat, 21 Apr 2012 17:32:37 -0500 (CDT) Subject: [Swift-user] Help with resuming swift job In-Reply-To: <616843DA-762B-4F0C-8241-CDC4F9B396CB@uchicago.edu> Message-ID: <1225948167.151213.1335047557000.JavaMail.root@zimbra.anl.gov> Hi Lorenzo, I did a quick check into this problem. As far as I can tell, the .rlog restart file needed to resume this run does not exist. Did it get removed manually, or do you think it got removed by Swift? (Which may remove it when the run completes successfully, but I need to check on that). - Mike ----- Original Message ----- > From: "Lorenzo Pesce" > To: swift-user at ci.uchicago.edu > Sent: Friday, April 20, 2012 11:43:37 AM > Subject: [Swift-user] Help with resuming swift job > It is me again ;-) > > I seem to have let too many beagle users run Swift.... ;-) > > Here we are having a few difference people running from the same > filesystem (the project file system, in this case the person who sent > the first batch is on a plane right now) because of some quirks in > Beagle's group permissions (namely they don't work right as far as I > can tell), sometimes we need to change file permissions: > > I changed the file permission, > Swift 0.93 swift-r5483 cog-r3339 > > and sent swift out with -resume > > RunID: 20120420-1622-6siopoz8 > Execution failed: > Could not aquire exclusive lock on log file: > /lustre/beagle/GCNet/grasping/44neurons_SR0_200ms_5Windows_v1/causal_test-20120420-1433-k1eh3b2a.0.rlog > > Is there a lock file that needs to be changed? > > > The motivation for this is that some of the files failed for two > predictable reasons: they took too long and/or they blew the memory. > There are also other issues, but they are not relevant at this point > because they aren't solved (need of a new optimization, which we did > not have time to implement). Usually we send the first batch, with > short times and fewer nodes that does 98% of the work, and resume the > remaining hacking the sites.xml file (we welcome better strategies > which most of you have hinted to in the past). This was an attempt to > rerun after a crash. > > BTW, my post mortem investigation seems to suggest that one of the > users actually killed the swift script by mistake, wrongly changed a > privilege in flight or something like that as opposed to the script > failing or running out of time. Short of torture it does not seem he > will confess more than this. > > Thanks a million as usual. > > Lorenzo > > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From lpesce at uchicago.edu Sat Apr 21 18:04:54 2012 From: lpesce at uchicago.edu (Lorenzo Pesce) Date: Sat, 21 Apr 2012 18:04:54 -0500 Subject: [Swift-user] Help with resuming swift job In-Reply-To: <1225948167.151213.1335047557000.JavaMail.root@zimbra.anl.gov> References: <1225948167.151213.1335047557000.JavaMail.root@zimbra.anl.gov> Message-ID: <25536BFD-7A12-4332-BD5E-8E50B5B0969D@uchicago.edu> I plan to make a better investigation of what happened exactly because there were too many people changing things. I will try to reproduce the error. It might be difficult or impossible at this point because I modified the script in order to avoid conflicts. In this dir sequential swift runs were made with identically names tc, cf and sites files, which would have been no problem if people did not run them at the same time or in a chaotic way. Since it seems to be a possibility we modified it and changed the names of the files. I will let you know ASAP if this happens again. On Apr 21, 2012, at 5:32 PM, Michael Wilde wrote: > Hi Lorenzo, > > I did a quick check into this problem. As far as I can tell, the .rlog restart file needed to resume this run does not exist. Did it get removed manually, or do you think it got removed by Swift? (Which may remove it when the run completes successfully, but I need to check on that). > > - Mike > > ----- Original Message ----- >> From: "Lorenzo Pesce" >> To: swift-user at ci.uchicago.edu >> Sent: Friday, April 20, 2012 11:43:37 AM >> Subject: [Swift-user] Help with resuming swift job >> It is me again ;-) >> >> I seem to have let too many beagle users run Swift.... ;-) >> >> Here we are having a few difference people running from the same >> filesystem (the project file system, in this case the person who sent >> the first batch is on a plane right now) because of some quirks in >> Beagle's group permissions (namely they don't work right as far as I >> can tell), sometimes we need to change file permissions: >> >> I changed the file permission, >> Swift 0.93 swift-r5483 cog-r3339 >> >> and sent swift out with -resume >> >> RunID: 20120420-1622-6siopoz8 >> Execution failed: >> Could not aquire exclusive lock on log file: >> /lustre/beagle/GCNet/grasping/44neurons_SR0_200ms_5Windows_v1/causal_test-20120420-1433-k1eh3b2a.0.rlog >> >> Is there a lock file that needs to be changed? >> >> >> The motivation for this is that some of the files failed for two >> predictable reasons: they took too long and/or they blew the memory. >> There are also other issues, but they are not relevant at this point >> because they aren't solved (need of a new optimization, which we did >> not have time to implement). Usually we send the first batch, with >> short times and fewer nodes that does 98% of the work, and resume the >> remaining hacking the sites.xml file (we welcome better strategies >> which most of you have hinted to in the past). This was an attempt to >> rerun after a crash. >> >> BTW, my post mortem investigation seems to suggest that one of the >> users actually killed the swift script by mistake, wrongly changed a >> privilege in flight or something like that as opposed to the script >> failing or running out of time. Short of torture it does not seem he >> will confess more than this. >> >> Thanks a million as usual. >> >> Lorenzo >> >> _______________________________________________ >> Swift-user mailing list >> Swift-user at ci.uchicago.edu >> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > > -- > Michael Wilde > Computation Institute, University of Chicago > Mathematics and Computer Science Division > Argonne National Laboratory > From lpesce at uchicago.edu Mon Apr 23 15:24:16 2012 From: lpesce at uchicago.edu (Lorenzo Pesce) Date: Mon, 23 Apr 2012 15:24:16 -0500 Subject: [Swift-user] Help with resuming swift job In-Reply-To: <1225948167.151213.1335047557000.JavaMail.root@zimbra.anl.gov> References: <1225948167.151213.1335047557000.JavaMail.root@zimbra.anl.gov> Message-ID: So far the problem could not be reproduced. I suspect that there was a cross fire in the folder. My apologies for raising a false alarm. On Apr 21, 2012, at 5:32 PM, Michael Wilde wrote: > Hi Lorenzo, > > I did a quick check into this problem. As far as I can tell, the .rlog restart file needed to resume this run does not exist. Did it get removed manually, or do you think it got removed by Swift? (Which may remove it when the run completes successfully, but I need to check on that). > > - Mike > > ----- Original Message ----- >> From: "Lorenzo Pesce" >> To: swift-user at ci.uchicago.edu >> Sent: Friday, April 20, 2012 11:43:37 AM >> Subject: [Swift-user] Help with resuming swift job >> It is me again ;-) >> >> I seem to have let too many beagle users run Swift.... ;-) >> >> Here we are having a few difference people running from the same >> filesystem (the project file system, in this case the person who sent >> the first batch is on a plane right now) because of some quirks in >> Beagle's group permissions (namely they don't work right as far as I >> can tell), sometimes we need to change file permissions: >> >> I changed the file permission, >> Swift 0.93 swift-r5483 cog-r3339 >> >> and sent swift out with -resume >> >> RunID: 20120420-1622-6siopoz8 >> Execution failed: >> Could not aquire exclusive lock on log file: >> /lustre/beagle/GCNet/grasping/44neurons_SR0_200ms_5Windows_v1/causal_test-20120420-1433-k1eh3b2a.0.rlog >> >> Is there a lock file that needs to be changed? >> >> >> The motivation for this is that some of the files failed for two >> predictable reasons: they took too long and/or they blew the memory. >> There are also other issues, but they are not relevant at this point >> because they aren't solved (need of a new optimization, which we did >> not have time to implement). Usually we send the first batch, with >> short times and fewer nodes that does 98% of the work, and resume the >> remaining hacking the sites.xml file (we welcome better strategies >> which most of you have hinted to in the past). This was an attempt to >> rerun after a crash. >> >> BTW, my post mortem investigation seems to suggest that one of the >> users actually killed the swift script by mistake, wrongly changed a >> privilege in flight or something like that as opposed to the script >> failing or running out of time. Short of torture it does not seem he >> will confess more than this. >> >> Thanks a million as usual. >> >> Lorenzo >> >> _______________________________________________ >> Swift-user mailing list >> Swift-user at ci.uchicago.edu >> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > > -- > Michael Wilde > Computation Institute, University of Chicago > Mathematics and Computer Science Division > Argonne National Laboratory > From lpesce at uchicago.edu Mon Apr 23 19:54:13 2012 From: lpesce at uchicago.edu (Lorenzo Pesce) Date: Mon, 23 Apr 2012 19:54:13 -0500 Subject: [Swift-user] https://bitbucket.org/repo/create :: there is no swift as language option! Message-ID: <0DA3D540-C049-40AB-8882-7C155B47CFA2@uchicago.edu> From heather.stoller at gmail.com Tue Apr 24 15:57:40 2012 From: heather.stoller at gmail.com (Heather Stoller) Date: Tue, 24 Apr 2012 15:57:40 -0500 Subject: [Swift-user] Question - csv_mapper Message-ID: Hi Swift Users, I'm a student at UChicago working with Mike Wilde and Kyle Chard at the Computation Institute to try to create a distributed file management system. At present, I have a working script that can take two strings (physical file locations), scp the strings to get their file contents, and merge their contents together. Next, I'd like my script to take two strings (logical file names), transform them to physical file locations by looking them up in a map somewhere, and then scp the transformed strings to get their file contents etc. So I only need to do ONE step at present: look up strings in a map. It's been suggested that I use a csv_mapper for this purpose, so I tried. Below is my swiftscript and attached is my logfile. Could you suggest what might be the matter? There are liable to be very elementary mistakes as I am quite a novice swift user. Thank you kindly, and please do let me know if anything I've mentioned has been unclear. Best regards, Heather type file; type translator { string logical; string physical; } app (file f) echoer(translator t) { echo "The name is" t.logical "and the location is" t.physical stdout=@f; } file outfile <"translation.txt">; translator trans[] ; trace(trans[0].logical); trace(trans[0].physical); trace(trans[1].logical); trace(trans[1].physical); ##################################### # translate.csv # ##################################### logical, physical file1, hstoller at kedzie.cs.uchicago.edu:~/tmp/m file2, hstoller at kedzie.cs.uchicago.edu~/tmp/n -- Heather Stoller -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: csv_mapper_test-20120424-1545-a02ayqtg.log Type: application/octet-stream Size: 4103 bytes Desc: not available URL: From wilde at mcs.anl.gov Wed Apr 25 07:46:41 2012 From: wilde at mcs.anl.gov (Michael Wilde) Date: Wed, 25 Apr 2012 07:46:41 -0500 (CDT) Subject: [Swift-user] Question - csv_mapper In-Reply-To: Message-ID: <477303461.155668.1335358001694.JavaMail.root@zimbra.anl.gov> Hi Heather, You didnt give any indication here of what output the script produced. I ran it and got this: login1$ swift map.swift no sites file specified, setting to default: /home/wilde/swift/rev/trunk/etc/sites.xml Swift trunk swift-r5739 cog-r3368 (cog modified locally) RunID: 20120425-0637-liq6u4s4 Progress: time: Wed, 25 Apr 2012 06:38:02 -0600 SwiftScript trace: trans[0].physical:string - Closed SwiftScript trace: trans[1].physical:string - Closed SwiftScript trace: trans[1].logical:string - Closed SwiftScript trace: trans[0].logical:string - Closed Final status: Wed, 25 Apr 2012 06:38:02 -0600 login1$ What's happening here is not what you expect. First, "mapping" in Swift is only applicable to file-valued types. You cant apply a mapper to variables or fields of type "string", as you tried to do in this script. Second, you tried to trace() the fields logical and physical, but your script never assigned a value to these fields. Even if your mapping had succeeded, it would not have set the *value* of any fields, just the *mapping* of those fields. Values are only set by assignment (ie, by "="). Its only a quirk of the current Swift version you ran that the mapping in this case did indeed mark the state of the fields "closed" - as if they *were* assigned a value. I consider this a bug in the 0.93 version; I think the "right" thing for Swift to do here would be simply to hang, waiting for a value that will never by set, which will cause Swift to time out and report the hung state. Lets take offline the question of how to proceed (as its a class project issue, not a Swift usage issue). I should note that what I think you may want here is either to read in a table of values that constitute your logical/physical mapping, using one of the readData() functions, or to use the new Swift associate arrays (which can be indexed by string values, but is still undocumented so far as I know), or most likely, to create an app() interface to your external mapping database. - Mike ----- Original Message ----- > From: "Heather Stoller" > To: swift-user at ci.uchicago.edu > Cc: "Michael Wilde" , "Kyle Chard" > Sent: Tuesday, April 24, 2012 3:57:40 PM > Subject: Question - csv_mapper > Hi Swift Users, > > I'm a student at UChicago working with Mike Wilde and Kyle Chard at > the Computation Institute to try to create a distributed file > management system. At present, I have a working script that can take > two strings (physical file locations), scp the strings to get their > file contents, and merge their contents together. Next, I'd like my > script to take two strings (logical file names), transform them to > physical file locations by looking them up in a map somewhere, and > then scp the transformed strings to get their file contents etc. So I > only need to do ONE step at present: look up strings in a map. It's > been suggested that I use a csv_mapper for this purpose, so I tried. > Below is my swiftscript and attached is my logfile. Could you suggest > what might be the matter? There are liable to be very elementary > mistakes as I am quite a novice swift user. Thank you kindly, and > please do let me know if anything I've mentioned has been unclear. > Best regards, > Heather > > type file; > > type translator { > string logical; > string physical; > } > > app (file f) echoer(translator t) { > echo "The name is" t.logical "and the location is" t.physical > stdout=@f; > } > > file outfile <"translation.txt">; > > translator trans[] ; > > trace(trans[0].logical); > trace(trans[0].physical); > trace(trans[1].logical); > trace(trans[1].physical); > > ##################################### > > # translate.csv # > > ##################################### > > logical, physical > file1, hstoller at kedzie.cs.uchicago.edu:~/tmp/m > file2, hstoller at kedzie.cs.uchicago.edu ~/tmp/n > > > -- > Heather Stoller -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From lpesce at uchicago.edu Thu Apr 26 08:58:55 2012 From: lpesce at uchicago.edu (Lorenzo Pesce) Date: Thu, 26 Apr 2012 08:58:55 -0500 Subject: [Swift-user] Problem with app using maxTime References: Message-ID: From sites.xml: 1800 600 100 100 ----- The goal was to ask for 1/2 hour to PBS and tell swift that the jobs are expected to run in 10 minutes. The error refers to two quantities (walltime and reserve) that I am not sure about. The jobs were run in a somewhat chaotic way, so I am not sure whether I can reconstruct what happened. > Failed to transfer wrapper log for job DemoRealWrapper-1u8ltcqk > Failed to transfer wrapper log for job DemoRealWrapper-7o8ltcqk > Failed to transfer wrapper log for job DemoRealWrapper-zo8ltcqk > Exception in DemoRealWrapper: > Arguments: [/soft/matlab/7.13, 57.0, 18.0, Win2/X.mat, Win2/] > Host: pbs > Directory: demo_real-20120424-0818-cecgng88/jobs/1/DemoRealWrapper-1u8ltcqk > stderr.txt: > > stdout.txt: > > ---- > > Caused by: Job walltime > maxTime - reserve (30:00:00 > 00:29:00) > > Exception in DemoRealWrapper: > Arguments: [/soft/matlab/7.13, 58.0, 18.0, Win3/X.mat, Win3/] > Host: pbs > Directory: demo_real-20120424-0818-cecgng88/jobs/7/DemoRealWrapper-7o8ltcqk > stderr.txt: > > stdout.txt: > From wilde at mcs.anl.gov Thu Apr 26 09:04:34 2012 From: wilde at mcs.anl.gov (Michael Wilde) Date: Thu, 26 Apr 2012 09:04:34 -0500 (CDT) Subject: [Swift-user] Problem with app using maxTime In-Reply-To: Message-ID: <310446459.1841.1335449074190.JavaMail.root@zimbra.anl.gov> Lorenzo, for historical reasons which should be fixed (or will be hidden with the gensites utility), maxtime is specified in integer seconds, while maxwalltime needs to be specified in hh:mm:ss ala say PBS style. My first guess here is that Swift interpreted maxwalltime to be hours instead of seconds. so 600 secs should be specified as 00:10:00. Can you try that before we dig deeper into this one? Thanks, - Mike ----- Original Message ----- > From: "Lorenzo Pesce" > To: swift-user at ci.uchicago.edu > Sent: Thursday, April 26, 2012 8:58:55 AM > Subject: [Swift-user] Problem with app using maxTime > From sites.xml: > 1800 > 600 > 100 > 100 > ----- > > The goal was to ask for 1/2 hour to PBS and tell swift that the jobs > are expected to run in 10 minutes. > > The error refers to two quantities (walltime and reserve) that I am > not sure about. > > The jobs were run in a somewhat chaotic way, so I am not sure whether > I can reconstruct what happened. > > > > > > > > Failed to transfer wrapper log for job DemoRealWrapper-1u8ltcqk > > Failed to transfer wrapper log for job DemoRealWrapper-7o8ltcqk > > Failed to transfer wrapper log for job DemoRealWrapper-zo8ltcqk > > Exception in DemoRealWrapper: > > Arguments: [/soft/matlab/7.13, 57.0, 18.0, Win2/X.mat, Win2/] > > Host: pbs > > Directory: > > demo_real-20120424-0818-cecgng88/jobs/1/DemoRealWrapper-1u8ltcqk > > stderr.txt: > > > > stdout.txt: > > > > ---- > > > > Caused by: Job walltime > maxTime - reserve (30:00:00 > 00:29:00) > > > > Exception in DemoRealWrapper: > > Arguments: [/soft/matlab/7.13, 58.0, 18.0, Win3/X.mat, Win3/] > > Host: pbs > > Directory: > > demo_real-20120424-0818-cecgng88/jobs/7/DemoRealWrapper-7o8ltcqk > > stderr.txt: > > > > stdout.txt: > > > > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From lpesce at uchicago.edu Fri Apr 27 11:47:29 2012 From: lpesce at uchicago.edu (Lorenzo Pesce) Date: Fri, 27 Apr 2012 11:47:29 -0500 Subject: [Swift-user] Problem with app using maxTime In-Reply-To: <310446459.1841.1335449074190.JavaMail.root@zimbra.anl.gov> References: <310446459.1841.1335449074190.JavaMail.root@zimbra.anl.gov> Message-ID: <155C44A2-4D71-4948-AD96-3FCF090E605A@uchicago.edu> Diagnosis was perfect. Jobs are running fine now. I am going to write a helper script for -resume to that the settings will be "automatically" preserved. Thanks. On Apr 26, 2012, at 9:04 AM, Michael Wilde wrote: > Lorenzo, for historical reasons which should be fixed (or will be hidden with the gensites utility), maxtime is specified in integer seconds, while maxwalltime needs to be specified in hh:mm:ss ala say PBS style. > > My first guess here is that Swift interpreted maxwalltime to be hours instead of seconds. so 600 secs should be specified as 00:10:00. Can you try that before we dig deeper into this one? > > Thanks, > > - Mike > > ----- Original Message ----- >> From: "Lorenzo Pesce" >> To: swift-user at ci.uchicago.edu >> Sent: Thursday, April 26, 2012 8:58:55 AM >> Subject: [Swift-user] Problem with app using maxTime >> From sites.xml: >> 1800 >> 600 >> 100 >> 100 >> ----- >> >> The goal was to ask for 1/2 hour to PBS and tell swift that the jobs >> are expected to run in 10 minutes. >> >> The error refers to two quantities (walltime and reserve) that I am >> not sure about. >> >> The jobs were run in a somewhat chaotic way, so I am not sure whether >> I can reconstruct what happened. >> >> >> >> >> >> >>> Failed to transfer wrapper log for job DemoRealWrapper-1u8ltcqk >>> Failed to transfer wrapper log for job DemoRealWrapper-7o8ltcqk >>> Failed to transfer wrapper log for job DemoRealWrapper-zo8ltcqk >>> Exception in DemoRealWrapper: >>> Arguments: [/soft/matlab/7.13, 57.0, 18.0, Win2/X.mat, Win2/] >>> Host: pbs >>> Directory: >>> demo_real-20120424-0818-cecgng88/jobs/1/DemoRealWrapper-1u8ltcqk >>> stderr.txt: >>> >>> stdout.txt: >>> >>> ---- >>> >>> Caused by: Job walltime > maxTime - reserve (30:00:00 > 00:29:00) >>> >>> Exception in DemoRealWrapper: >>> Arguments: [/soft/matlab/7.13, 58.0, 18.0, Win3/X.mat, Win3/] >>> Host: pbs >>> Directory: >>> demo_real-20120424-0818-cecgng88/jobs/7/DemoRealWrapper-7o8ltcqk >>> stderr.txt: >>> >>> stdout.txt: >>> >> >> _______________________________________________ >> Swift-user mailing list >> Swift-user at ci.uchicago.edu >> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > > -- > Michael Wilde > Computation Institute, University of Chicago > Mathematics and Computer Science Division > Argonne National Laboratory >