From ketan at mcs.anl.gov Fri Jan 2 17:36:01 2015 From: ketan at mcs.anl.gov (Ketan Maheshwari) Date: Fri, 2 Jan 2015 17:36:01 -0600 Subject: [Swift-user] minimize logging and progress messages Message-ID: Hi, As I am dealing with a multi-day run, is there a way to minimize the progress messages (say once every 10 minutes) and reduce logs--possibly disable logging altogether? Thanks, Ketan -------------- next part -------------- An HTML attachment was scrubbed... URL: From ketan at mcs.anl.gov Sat Jan 3 14:37:30 2015 From: ketan at mcs.anl.gov (Ketan Maheshwari) Date: Sat, 3 Jan 2015 14:37:30 -0600 Subject: [Swift-user] trunk error when using two pools Message-ID: Hi, I am trying to run a Swift script with two apps each with a different configuration in terms of job size. For this, I am using two different pools in the conf. At runtime, however, I get the following error: Execution failed: Exception in wrf: Arguments: [] Host: edison2 Directory: wf.edison-run002/jobs/w/wrf-jjtyam2m exception @ swift-int-staging.k, line: 165 Caused by: exception @ swift-int-staging.k, line: 160 Caused by: null Caused by: org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: Could not submit job Caused by: org.globus.cog.coaster.ProtocolException: java.lang.IllegalStateException: A channel already exists for this key: null @id://2 Caused by: org.globus.cog.coaster.RemoteException: java.lang.IllegalStateException: A channel already exists for this key: null @id://2 Caused by: java.lang.IllegalStateException: A channel already exists for this key: null at id://2 at org.globus.cog.coaster.channels.ChannelManager.registerChannel(ChannelManager.java:128) at org.globus.cog.coaster.channels.ChannelManager.registerChannel(ChannelManager.java:140) at org.globus.cog.coaster.channels.ChannelManager.registerChannel(ChannelManager.java:136) at org.globus.cog.abstraction.coaster.service.ServiceConfigurationHandler.requestComplete(ServiceConfigurationHandler.java:50) at org.globus.cog.coaster.handlers.RequestHandler.receiveCompleted(RequestHandler.java:112) at org.globus.cog.coaster.channels.AbstractCoasterChannel.handleRequest(AbstractCoasterChannel.java:590) at org.globus.cog.coaster.channels.AbstractPipedChannel.actualSend(AbstractPipedChannel.java:101) at org.globus.cog.coaster.channels.AbstractPipedChannel$Sender.run(AbstractPipedChannel.java:130) The rundir for this run, which is on a cray (nersc edison) is attached. Thanks for any suggestions, Ketan -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: run002.tgz Type: application/x-gzip Size: 19851 bytes Desc: not available URL: From hategan at mcs.anl.gov Sat Jan 3 14:39:23 2015 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sat, 3 Jan 2015 12:39:23 -0800 Subject: [Swift-user] Edison: pbs job starts but Swift unresponsive In-Reply-To: References: <1419374265.9912.6.camel@echo> Message-ID: <1420317563.5777.0.camel@echo> On Tue, 2014-12-30 at 12:49 -0600, Ketan Maheshwari wrote: > Hi Mihael, > > It takes about 8-9 minutes after the worker starting (ie. queue showing > running status) that the Swift progress text shows active status. In the > active status, one wave of tasks finishes and the status goes back to > submit state but now no job shows up in the queue. I see the problem, but I'm not sure what causes it. Can you enable worker logging and send a worker log? Mihael From hategan at mcs.anl.gov Sat Jan 3 14:59:23 2015 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sat, 3 Jan 2015 12:59:23 -0800 Subject: [Swift-user] trunk error when using two pools In-Reply-To: References: Message-ID: <1420318763.5777.5.camel@echo> Hi Ketan, Say URL: "s1" for the first site, and URL: "s2" for the second (or any other two distinct strings). There is some inconsistent logic in deciding when to start a service vs. when to configure a service and the trick above reconciles that. I'll see if that can be nicely fixed. Mihael On Sat, 2015-01-03 at 14:37 -0600, Ketan Maheshwari wrote: > Hi, > > I am trying to run a Swift script with two apps each with a different > configuration in terms of job size. For this, I am using two different > pools in the conf. At runtime, however, I get the following error: > > Execution failed: > Exception in wrf: > Arguments: [] > Host: edison2 > Directory: wf.edison-run002/jobs/w/wrf-jjtyam2m > exception @ swift-int-staging.k, line: 165 > Caused by: > exception @ swift-int-staging.k, line: 160 > Caused by: null > Caused by: > org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: Could > not submit job > Caused by: org.globus.cog.coaster.ProtocolException: > java.lang.IllegalStateException: A channel already exists for this key: null > @id://2 > Caused by: org.globus.cog.coaster.RemoteException: > java.lang.IllegalStateException: A channel already exists for this key: null > @id://2 > Caused by: java.lang.IllegalStateException: A channel already exists for > this key: null at id://2 > at > org.globus.cog.coaster.channels.ChannelManager.registerChannel(ChannelManager.java:128) > at > org.globus.cog.coaster.channels.ChannelManager.registerChannel(ChannelManager.java:140) > at > org.globus.cog.coaster.channels.ChannelManager.registerChannel(ChannelManager.java:136) > at > org.globus.cog.abstraction.coaster.service.ServiceConfigurationHandler.requestComplete(ServiceConfigurationHandler.java:50) > at > org.globus.cog.coaster.handlers.RequestHandler.receiveCompleted(RequestHandler.java:112) > at > org.globus.cog.coaster.channels.AbstractCoasterChannel.handleRequest(AbstractCoasterChannel.java:590) > at > org.globus.cog.coaster.channels.AbstractPipedChannel.actualSend(AbstractPipedChannel.java:101) > at > org.globus.cog.coaster.channels.AbstractPipedChannel$Sender.run(AbstractPipedChannel.java:130) > > The rundir for this run, which is on a cray (nersc edison) is attached. > > Thanks for any suggestions, > Ketan > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user From hategan at mcs.anl.gov Sat Jan 3 15:16:10 2015 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sat, 3 Jan 2015 13:16:10 -0800 Subject: [Swift-user] trunk error when using two pools In-Reply-To: <1420318763.5777.5.camel@echo> References: <1420318763.5777.5.camel@echo> Message-ID: <1420319770.5777.6.camel@echo> Hi, There's also now a fix in git that would allow your initial configuration to work properly. Mihael On Sat, 2015-01-03 at 12:59 -0800, Mihael Hategan wrote: > Hi Ketan, > > Say URL: "s1" for the first site, and URL: "s2" for the second (or any > other two distinct strings). > > There is some inconsistent logic in deciding when to start a service vs. > when to configure a service and the trick above reconciles that. I'll > see if that can be nicely fixed. > > Mihael > > On Sat, 2015-01-03 at 14:37 -0600, Ketan Maheshwari wrote: > > Hi, > > > > I am trying to run a Swift script with two apps each with a different > > configuration in terms of job size. For this, I am using two different > > pools in the conf. At runtime, however, I get the following error: > > > > Execution failed: > > Exception in wrf: > > Arguments: [] > > Host: edison2 > > Directory: wf.edison-run002/jobs/w/wrf-jjtyam2m > > exception @ swift-int-staging.k, line: 165 > > Caused by: > > exception @ swift-int-staging.k, line: 160 > > Caused by: null > > Caused by: > > org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: Could > > not submit job > > Caused by: org.globus.cog.coaster.ProtocolException: > > java.lang.IllegalStateException: A channel already exists for this key: null > > @id://2 > > Caused by: org.globus.cog.coaster.RemoteException: > > java.lang.IllegalStateException: A channel already exists for this key: null > > @id://2 > > Caused by: java.lang.IllegalStateException: A channel already exists for > > this key: null at id://2 > > at > > org.globus.cog.coaster.channels.ChannelManager.registerChannel(ChannelManager.java:128) > > at > > org.globus.cog.coaster.channels.ChannelManager.registerChannel(ChannelManager.java:140) > > at > > org.globus.cog.coaster.channels.ChannelManager.registerChannel(ChannelManager.java:136) > > at > > org.globus.cog.abstraction.coaster.service.ServiceConfigurationHandler.requestComplete(ServiceConfigurationHandler.java:50) > > at > > org.globus.cog.coaster.handlers.RequestHandler.receiveCompleted(RequestHandler.java:112) > > at > > org.globus.cog.coaster.channels.AbstractCoasterChannel.handleRequest(AbstractCoasterChannel.java:590) > > at > > org.globus.cog.coaster.channels.AbstractPipedChannel.actualSend(AbstractPipedChannel.java:101) > > at > > org.globus.cog.coaster.channels.AbstractPipedChannel$Sender.run(AbstractPipedChannel.java:130) > > > > The rundir for this run, which is on a cray (nersc edison) is attached. > > > > Thanks for any suggestions, > > Ketan > > _______________________________________________ > > Swift-user mailing list > > Swift-user at ci.uchicago.edu > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > > > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user From ketan at mcs.anl.gov Sat Jan 3 15:22:43 2015 From: ketan at mcs.anl.gov (Ketan Maheshwari) Date: Sat, 3 Jan 2015 15:22:43 -0600 Subject: [Swift-user] Edison: pbs job starts but Swift unresponsive In-Reply-To: <1420317563.5777.0.camel@echo> References: <1419374265.9912.6.camel@echo> <1420317563.5777.0.camel@echo> Message-ID: Please find the workerlog attached. Thanks, Ketan On Sat, Jan 3, 2015 at 2:39 PM, Mihael Hategan wrote: > On Tue, 2014-12-30 at 12:49 -0600, Ketan Maheshwari wrote: > > Hi Mihael, > > > > It takes about 8-9 minutes after the worker starting (ie. queue showing > > running status) that the Swift progress text shows active status. In the > > active status, one wave of tasks finishes and the status goes back to > > submit state but now no job shows up in the queue. > > I see the problem, but I'm not sure what causes it. Can you enable > worker logging and send a worker log? > > Mihael > > > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: worker-0103-5312160-000000.log Type: application/octet-stream Size: 1265452 bytes Desc: not available URL: From hategan at mcs.anl.gov Sat Jan 3 15:53:20 2015 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sat, 3 Jan 2015 13:53:20 -0800 Subject: [Swift-user] Edison: pbs job starts but Swift unresponsive In-Reply-To: References: <1419374265.9912.6.camel@echo> <1420317563.5777.0.camel@echo> Message-ID: <1420322000.11251.1.camel@echo> Is this from the same run? I don't see delays between the jobs completing and the worker being shut down. Can you also post the swift log that corresponds to this run and confirm that you see the problem in this run? Mihael On Sat, 2015-01-03 at 15:22 -0600, Ketan Maheshwari wrote: > Please find the workerlog attached. > > Thanks, > Ketan > > On Sat, Jan 3, 2015 at 2:39 PM, Mihael Hategan wrote: > > > On Tue, 2014-12-30 at 12:49 -0600, Ketan Maheshwari wrote: > > > Hi Mihael, > > > > > > It takes about 8-9 minutes after the worker starting (ie. queue showing > > > running status) that the Swift progress text shows active status. In the > > > active status, one wave of tasks finishes and the status goes back to > > > submit state but now no job shows up in the queue. > > > > I see the problem, but I'm not sure what causes it. Can you enable > > worker logging and send a worker log? > > > > Mihael > > > > > > _______________________________________________ > > Swift-user mailing list > > Swift-user at ci.uchicago.edu > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > > From ketan at mcs.anl.gov Sat Jan 3 16:29:00 2015 From: ketan at mcs.anl.gov (Ketan Maheshwari) Date: Sat, 3 Jan 2015 16:29:00 -0600 Subject: [Swift-user] Edison: pbs job starts but Swift unresponsive In-Reply-To: <1420322000.11251.1.camel@echo> References: <1419374265.9912.6.camel@echo> <1420317563.5777.0.camel@echo> <1420322000.11251.1.camel@echo> Message-ID: Yes, this was a different run. Here is the run directory and worker log for a fresh run where I see job in running stated for ~9 minutes before Swift status shows task active. Thanks, Ketan On Sat, Jan 3, 2015 at 3:53 PM, Mihael Hategan wrote: > Is this from the same run? I don't see delays between the jobs > completing and the worker being shut down. Can you also post the swift > log that corresponds to this run and confirm that you see the problem in > this run? > > Mihael > > On Sat, 2015-01-03 at 15:22 -0600, Ketan Maheshwari wrote: > > Please find the workerlog attached. > > > > Thanks, > > Ketan > > > > On Sat, Jan 3, 2015 at 2:39 PM, Mihael Hategan > wrote: > > > > > On Tue, 2014-12-30 at 12:49 -0600, Ketan Maheshwari wrote: > > > > Hi Mihael, > > > > > > > > It takes about 8-9 minutes after the worker starting (ie. queue > showing > > > > running status) that the Swift progress text shows active status. In > the > > > > active status, one wave of tasks finishes and the status goes back to > > > > submit state but now no job shows up in the queue. > > > > > > I see the problem, but I'm not sure what causes it. Can you enable > > > worker logging and send a worker log? > > > > > > Mihael > > > > > > > > > _______________________________________________ > > > Swift-user mailing list > > > Swift-user at ci.uchicago.edu > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > > > > > > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: worker-0103-1302110-000000.log Type: application/octet-stream Size: 123605 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: run001.tgz Type: application/x-gzip Size: 20735 bytes Desc: not available URL: From hategan at mcs.anl.gov Sat Jan 3 18:09:49 2015 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sat, 3 Jan 2015 16:09:49 -0800 Subject: [Swift-user] Edison: pbs job starts but Swift unresponsive In-Reply-To: References: <1419374265.9912.6.camel@echo> <1420317563.5777.0.camel@echo> <1420322000.11251.1.camel@echo> Message-ID: <1420330189.12041.5.camel@echo> Ok, so I was looking for problems after the first batch of jobs, but aren't any here. The 9 minute delay is because workers try all IP addresses that the head node has, and it may take a long time to time-out through all of them until a good one is found. You could force a specific IP address (in your case it's probably 128.55.34.2) using: 128.55.34.2 Mihael On Sat, 2015-01-03 at 16:29 -0600, Ketan Maheshwari wrote: > Yes, this was a different run. > > Here is the run directory and worker log for a fresh run where I see job in > running stated for ~9 minutes before Swift status shows task active. > > Thanks, > Ketan > > On Sat, Jan 3, 2015 at 3:53 PM, Mihael Hategan wrote: > > > Is this from the same run? I don't see delays between the jobs > > completing and the worker being shut down. Can you also post the swift > > log that corresponds to this run and confirm that you see the problem in > > this run? > > > > Mihael > > > > On Sat, 2015-01-03 at 15:22 -0600, Ketan Maheshwari wrote: > > > Please find the workerlog attached. > > > > > > Thanks, > > > Ketan > > > > > > On Sat, Jan 3, 2015 at 2:39 PM, Mihael Hategan > > wrote: > > > > > > > On Tue, 2014-12-30 at 12:49 -0600, Ketan Maheshwari wrote: > > > > > Hi Mihael, > > > > > > > > > > It takes about 8-9 minutes after the worker starting (ie. queue > > showing > > > > > running status) that the Swift progress text shows active status. In > > the > > > > > active status, one wave of tasks finishes and the status goes back to > > > > > submit state but now no job shows up in the queue. > > > > > > > > I see the problem, but I'm not sure what causes it. Can you enable > > > > worker logging and send a worker log? > > > > > > > > Mihael > > > > > > > > > > > > _______________________________________________ > > > > Swift-user mailing list > > > > Swift-user at ci.uchicago.edu > > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > > > > > > > > > > _______________________________________________ > > Swift-user mailing list > > Swift-user at ci.uchicago.edu > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > > From ketan at mcs.anl.gov Mon Jan 5 17:13:34 2015 From: ketan at mcs.anl.gov (Ketan Maheshwari) Date: Mon, 5 Jan 2015 17:13:34 -0600 Subject: [Swift-user] Trunk: staging in takes long time Message-ID: Hi, Trying to use Swift trunk on Edison with "local" staging. It looks like the staging takes a long time. For the same workflow running with 0.95 staging finishes instantly. Attached is the rundir. Any ideas why staging is taking so long? Thanks, --Ketan -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: run001.tgz Type: application/x-gzip Size: 1935820 bytes Desc: not available URL: From hategan at mcs.anl.gov Mon Jan 5 23:13:20 2015 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 5 Jan 2015 21:13:20 -0800 Subject: [Swift-user] Trunk: staging in takes long time In-Reply-To: References: Message-ID: <1420521200.18796.1.camel@echo> On Mon, 2015-01-05 at 17:13 -0600, Ketan Maheshwari wrote: > Hi, > > Trying to use Swift trunk on Edison with "local" staging. It looks like the > staging takes a long time. Can you be more specific? How did you conclude that staging takes a long time? Mihael > For the same workflow running with 0.95 staging > finishes instantly. > > Attached is the rundir. Any ideas why staging is taking so long? > > Thanks, > --Ketan > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user From ketan at mcs.anl.gov Tue Jan 6 07:50:48 2015 From: ketan at mcs.anl.gov (Ketan Maheshwari) Date: Tue, 6 Jan 2015 07:50:48 -0600 Subject: [Swift-user] Trunk: staging in takes long time In-Reply-To: <1420521200.18796.1.camel@echo> References: <1420521200.18796.1.camel@echo> Message-ID: Swift progress message shows "staging in" for more than 10 minutes with no status change until the job walltime expires. On Mon, Jan 5, 2015 at 11:13 PM, Mihael Hategan wrote: > On Mon, 2015-01-05 at 17:13 -0600, Ketan Maheshwari wrote: > > Hi, > > > > Trying to use Swift trunk on Edison with "local" staging. It looks like > the > > staging takes a long time. > > Can you be more specific? How did you conclude that staging takes a long > time? > > Mihael > > > For the same workflow running with 0.95 staging > > finishes instantly. > > > > Attached is the rundir. Any ideas why staging is taking so long? > > > > Thanks, > > --Ketan > > _______________________________________________ > > Swift-user mailing list > > Swift-user at ci.uchicago.edu > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > > > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > -------------- next part -------------- An HTML attachment was scrubbed... URL: From ketan at mcs.anl.gov Tue Jan 6 09:47:51 2015 From: ketan at mcs.anl.gov (Ketan Maheshwari) Date: Tue, 6 Jan 2015 09:47:51 -0600 Subject: [Swift-user] Trunk: submits one pbs job despite larger maxJobs value Message-ID: Hi, I am trying to get 20 pbs jobs submitted with Swift trunk. I set maxJobs to 20 for this purpose. However, at runtime, I see that only one job gets submitted. Attached is the run directory. Thanks for any suggestions, Ketan -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: run009.tgz Type: application/x-gzip Size: 14903 bytes Desc: not available URL: From ketan at mcs.anl.gov Tue Jan 6 11:23:40 2015 From: ketan at mcs.anl.gov (Ketan Maheshwari) Date: Tue, 6 Jan 2015 11:23:40 -0600 Subject: [Swift-user] output file array In-Reply-To: <1419290056.31348.0.camel@echo> References: <1419290056.31348.0.camel@echo> Message-ID: Trying trunk for this pattern. A toy application invoked over a foreach loop that creates an output file, an stdout and an stderr files. The files are mapped into an output directory named with the loop index as suffix so that the files do not get overwritten: foreach i in [0:9]{ file out; file err; file appout; (out, err, appout) = touch_app("Hello"); } The stdout and stderr files correctly ends up in their respective directories but the app generated file does not. I see following error message: Execution failed: Exception in t: Arguments: [Hello] Host: edison1 Directory: touchafile-run001/jobs/t/t-ffv33r2m exception @ swift-int-staging.k, line: 165 Caused by: The following output files were not created by the application: outdir4/afile.txt Any suggestions for fixing this? Attached is the test directory with sources and executable with rundir. Thanks, Ketan On Mon, Dec 22, 2014 at 5:14 PM, Mihael Hategan wrote: > Hi, > > I don't think 0.95 supports dynamic arrays output from apps. You will > need trunk/0.96 for that. > > Mihael > > On Mon, 2014-12-22 at 15:37 -0600, Ketan Maheshwari wrote: > > Hi Mihael, > > > > This is with Swift 0.95. > > > > Thanks, > > Ketan > > > > On Sun, Dec 21, 2014 at 2:40 PM, Hategan-Marandiuc, Philip M. < > > hategan at mcs.anl.gov> wrote: > > > > > Hi Ketan, > > > > > > Sorry for the delay. Is this trunk or 0.95? > > > > > > Mihael > > > > > > On Wed, 2014-12-17 at 14:26 -0600, Ketan Maheshwari wrote: > > > > Hi, > > > > > > > > I am dealing with a workflow pattern where an app expects multiple > output > > > > files with a pattern. > > > > > > > > The app signature is: > > > > > > > > app (file[] _wrfout, file _out, file _err) wrf_app (file _wrf_in, > file[] > > > > _tbl, file[] _ozone, ...) > > > > { > > > > wrf stdout=@_out stderr=@_err; > > > > } > > > > > > > > The _wrfout files are the app result files which follows a pattern: > > > wrfout_* > > > > > > > > So, I am invoking the application in a foreach loop as: > > > > > > > > foreach i in [0:2]{ > > > > file[] wrfout > > > pattern="wrfout_*">; > > > > file wrfstdout > > "/std.out")>; > > > > file wrfstderr > > "/std.err")>; > > > > > > > > (wrfout, wrfstdout, wrfstderr) = wrf_app (wrfin, tbl, ozone, tr, > data, > > > > gribmap, namelist, co2_trans, input_sounding); > > > > } > > > > > > > > The script hangs at runtime with the following messages: > > > > > > > > No events in 1s. > > > > Finding dependency loops... > > > > > > > > Waiting threads: > > > > Thread: R-6-0-4, waiting on wrfout (declared on line 50) > > > > swift:stageOut, wf.edison, line 134 > > > > swift:execute, wf.edison, line 123 > > > > wrf_app, wf.edison, line 242 > > > > > > > > Thread: R-6-2-4, waiting on wrfout (declared on line 50) > > > > swift:stageOut, wf.edison, line 134 > > > > swift:execute, wf.edison, line 123 > > > > wrf_app, wf.edison, line 242 > > > > > > > > Thread: R-6-1-4, waiting on wrfout (declared on line 50) > > > > swift:stageOut, wf.edison, line 134 > > > > swift:execute, wf.edison, line 123 > > > > wrf_app, wf.edison, line 242 > > > > > > > > Any suggestions? > > > > > > > > Thanks, > > > > Ketan > > > > _______________________________________________ > > > > Swift-user mailing list > > > > Swift-user at ci.uchicago.edu > > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > > > > > > > > > > > > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: touchafile.tgz Type: application/x-gzip Size: 112033 bytes Desc: not available URL: From yadudoc1729 at gmail.com Tue Jan 6 11:27:18 2015 From: yadudoc1729 at gmail.com (Yadu Nand) Date: Tue, 6 Jan 2015 11:27:18 -0600 Subject: [Swift-user] Trunk: submits one pbs job despite larger maxJobs value In-Reply-To: References: Message-ID: Hi Ketan, The config has set initialParallelTasks and maxParallelTasks set to 10, with 24 tasksPerNode. This basically means that your tasks are throttled at 10 and a single scheduler-job can do all of the 10 tasks. Since you have maxJobs set to 20, you probably should set initialParallelTasks and maxParallelTasks to 20 x 24, for 24 tasksPerNode to get 20 jobs/nodes started. Here's the config with my modifications https://gist.github.com/yadudoc/8694361f512678357178 Yadu On Tue, Jan 6, 2015 at 9:47 AM, Ketan Maheshwari wrote: > Hi, > > I am trying to get 20 pbs jobs submitted with Swift trunk. I set maxJobs > to 20 for this purpose. > > However, at runtime, I see that only one job gets submitted. > > Attached is the run directory. > > Thanks for any suggestions, > Ketan > > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > -- Yadu Nand B -------------- next part -------------- An HTML attachment was scrubbed... URL: From ketan at mcs.anl.gov Tue Jan 6 12:21:26 2015 From: ketan at mcs.anl.gov (Ketan Maheshwari) Date: Tue, 6 Jan 2015 12:21:26 -0600 Subject: [Swift-user] Edison: pbs job starts but Swift unresponsive In-Reply-To: <1420330189.12041.5.camel@echo> References: <1419374265.9912.6.camel@echo> <1420317563.5777.0.camel@echo> <1420322000.11251.1.camel@echo> <1420330189.12041.5.camel@echo> Message-ID: So, I tried with this line in sites file but the run crashes with following error messages: Execution failed: Exception in wrf: Arguments: [] Host: edison2 Directory: wf.edison-run006/jobs/i/wrf-iuas5r2m exception @ swift-int.k, line: 530 Caused by: Block task failed: 0106-1110110-000000 Block task ended prematurely Application 9450632 exit codes: 101, 111 Application 9450632 resources: utime ~25s, stime ~30s, Rss ~8260, inblocks ~425450, outblocks ~28500 + -------------------------------------------------------------------------- + Job name: B0106-1110110-0 + Job Id: 2247186.edique02 + System: edison + Queued Time: Tue Jan 6 10:11:12 2015 + Start Time: Tue Jan 6 10:12:20 2015 + Completion Time: Tue Jan 6 10:12:32 2015 + User: ketan + MOM Host: nid02819 + Queue: debug + Req. Resources: mppnodect=25,mppnppn=24,mppwidth=600,walltime=00:29:00 + Used Resources: cput=00:00:00,mem=0kb,vmem=0kb,walltime=00:00:12 + Acct String: m1540 + PBS_O_WORKDIR: /global/u2/k/ketan/wrf + Submit Args: /global/u2/k/ketan/wrf/run006/scripts/PBS4838165627827831510.submit + -------------------------------------------------------------------------- Failed to connect: Network is unreachable at /global/homes/k/ketan/.globus/coasters/cscript3816651147061795773.pl line 1101. Failed to connect: Network is unreachable at /global/homes/k/ketan/.globus/coasters/cscript3816651147061795773.pl line 1101. Failed to connect: Network is unreachable at /global/homes/k/ketan/.globus/coasters/cscript3816651147061795773.pl line 1101. .... .... <> This is a different ip that I found from the previous run's logs in a line like this (due to a different login host) : 2015-01-06 10:01:27,985-0800 INFO MetaChannel MetaChannel [context: worker-6, boundTo: null] binding to TCPChannel [type: server, contact: 128.55.34.27:52189] The rundir is attached. --Ketan On Sat, Jan 3, 2015 at 6:09 PM, Mihael Hategan wrote: > Ok, so I was looking for problems after the first batch of jobs, but > aren't any here. > > The 9 minute delay is because workers try all IP addresses that the head > node has, and it may take a long time to time-out through all of them > until a good one is found. > > You could force a specific IP address (in your case it's probably > 128.55.34.2) using: > > 128.55.34.2 > > Mihael > > On Sat, 2015-01-03 at 16:29 -0600, Ketan Maheshwari wrote: > > Yes, this was a different run. > > > > Here is the run directory and worker log for a fresh run where I see job > in > > running stated for ~9 minutes before Swift status shows task active. > > > > Thanks, > > Ketan > > > > On Sat, Jan 3, 2015 at 3:53 PM, Mihael Hategan > wrote: > > > > > Is this from the same run? I don't see delays between the jobs > > > completing and the worker being shut down. Can you also post the swift > > > log that corresponds to this run and confirm that you see the problem > in > > > this run? > > > > > > Mihael > > > > > > On Sat, 2015-01-03 at 15:22 -0600, Ketan Maheshwari wrote: > > > > Please find the workerlog attached. > > > > > > > > Thanks, > > > > Ketan > > > > > > > > On Sat, Jan 3, 2015 at 2:39 PM, Mihael Hategan > > > wrote: > > > > > > > > > On Tue, 2014-12-30 at 12:49 -0600, Ketan Maheshwari wrote: > > > > > > Hi Mihael, > > > > > > > > > > > > It takes about 8-9 minutes after the worker starting (ie. queue > > > showing > > > > > > running status) that the Swift progress text shows active > status. In > > > the > > > > > > active status, one wave of tasks finishes and the status goes > back to > > > > > > submit state but now no job shows up in the queue. > > > > > > > > > > I see the problem, but I'm not sure what causes it. Can you enable > > > > > worker logging and send a worker log? > > > > > > > > > > Mihael > > > > > > > > > > > > > > > _______________________________________________ > > > > > Swift-user mailing list > > > > > Swift-user at ci.uchicago.edu > > > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > > > > > > > > > > > > > > _______________________________________________ > > > Swift-user mailing list > > > Swift-user at ci.uchicago.edu > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > > > > > > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: run006.tgz Type: application/x-gzip Size: 34828 bytes Desc: not available URL: From hategan at mcs.anl.gov Tue Jan 6 12:36:05 2015 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 6 Jan 2015 10:36:05 -0800 Subject: [Swift-user] Edison: pbs job starts but Swift unresponsive In-Reply-To: References: <1419374265.9912.6.camel@echo> <1420317563.5777.0.camel@echo> <1420322000.11251.1.camel@echo> <1420330189.12041.5.camel@echo> Message-ID: <1420569365.25608.4.camel@echo> Well, clearly 128.55.34.27 is not working. You need to find the correct one. I would suggest a combination of ifconfig and looking at worker logs without an internalHostname set to see what the last IP tried is. Mihael On Tue, 2015-01-06 at 12:21 -0600, Ketan Maheshwari wrote: > So, I tried with this line in sites file but the run crashes with following > error messages: > > Execution failed: > Exception in wrf: > Arguments: [] > Host: edison2 > Directory: wf.edison-run006/jobs/i/wrf-iuas5r2m > exception @ swift-int.k, line: 530 > Caused by: Block task failed: 0106-1110110-000000 Block task ended > prematurely > Application 9450632 exit codes: 101, 111 > Application 9450632 resources: utime ~25s, stime ~30s, Rss ~8260, inblocks > ~425450, outblocks ~28500 > > + > -------------------------------------------------------------------------- > + Job name: B0106-1110110-0 > + Job Id: 2247186.edique02 > + System: edison > + Queued Time: Tue Jan 6 10:11:12 2015 > + Start Time: Tue Jan 6 10:12:20 2015 > + Completion Time: Tue Jan 6 10:12:32 2015 > + User: ketan > + MOM Host: nid02819 > + Queue: debug > + Req. Resources: mppnodect=25,mppnppn=24,mppwidth=600,walltime=00:29:00 > + Used Resources: cput=00:00:00,mem=0kb,vmem=0kb,walltime=00:00:12 > + Acct String: m1540 > + PBS_O_WORKDIR: /global/u2/k/ketan/wrf > + Submit Args: > /global/u2/k/ketan/wrf/run006/scripts/PBS4838165627827831510.submit > + > -------------------------------------------------------------------------- > > > Failed to connect: Network is unreachable at > /global/homes/k/ketan/.globus/coasters/cscript3816651147061795773.pl line > 1101. > Failed to connect: Network is unreachable at > /global/homes/k/ketan/.globus/coasters/cscript3816651147061795773.pl line > 1101. > Failed to connect: Network is unreachable at > /global/homes/k/ketan/.globus/coasters/cscript3816651147061795773.pl line > 1101. > .... > .... > <> > > This is a different ip that I found from the previous run's logs in a line > like this (due to a different login host) : > > 2015-01-06 10:01:27,985-0800 INFO MetaChannel MetaChannel [context: > worker-6, boundTo: null] binding to TCPChannel [type: server, contact: > 128.55.34.27:52189] > > The rundir is attached. > > --Ketan > > On Sat, Jan 3, 2015 at 6:09 PM, Mihael Hategan wrote: > > > Ok, so I was looking for problems after the first batch of jobs, but > > aren't any here. > > > > The 9 minute delay is because workers try all IP addresses that the head > > node has, and it may take a long time to time-out through all of them > > until a good one is found. > > > > You could force a specific IP address (in your case it's probably > > 128.55.34.2) using: > > > > 128.55.34.2 > > > > Mihael > > > > On Sat, 2015-01-03 at 16:29 -0600, Ketan Maheshwari wrote: > > > Yes, this was a different run. > > > > > > Here is the run directory and worker log for a fresh run where I see job > > in > > > running stated for ~9 minutes before Swift status shows task active. > > > > > > Thanks, > > > Ketan > > > > > > On Sat, Jan 3, 2015 at 3:53 PM, Mihael Hategan > > wrote: > > > > > > > Is this from the same run? I don't see delays between the jobs > > > > completing and the worker being shut down. Can you also post the swift > > > > log that corresponds to this run and confirm that you see the problem > > in > > > > this run? > > > > > > > > Mihael > > > > > > > > On Sat, 2015-01-03 at 15:22 -0600, Ketan Maheshwari wrote: > > > > > Please find the workerlog attached. > > > > > > > > > > Thanks, > > > > > Ketan > > > > > > > > > > On Sat, Jan 3, 2015 at 2:39 PM, Mihael Hategan > > > > wrote: > > > > > > > > > > > On Tue, 2014-12-30 at 12:49 -0600, Ketan Maheshwari wrote: > > > > > > > Hi Mihael, > > > > > > > > > > > > > > It takes about 8-9 minutes after the worker starting (ie. queue > > > > showing > > > > > > > running status) that the Swift progress text shows active > > status. In > > > > the > > > > > > > active status, one wave of tasks finishes and the status goes > > back to > > > > > > > submit state but now no job shows up in the queue. > > > > > > > > > > > > I see the problem, but I'm not sure what causes it. Can you enable > > > > > > worker logging and send a worker log? > > > > > > > > > > > > Mihael > > > > > > > > > > > > > > > > > > _______________________________________________ > > > > > > Swift-user mailing list > > > > > > Swift-user at ci.uchicago.edu > > > > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > > > > > > > > > > > > > > > > > > _______________________________________________ > > > > Swift-user mailing list > > > > Swift-user at ci.uchicago.edu > > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > > > > > > > > > > _______________________________________________ > > Swift-user mailing list > > Swift-user at ci.uchicago.edu > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > > From wilde at anl.gov Tue Jan 6 12:53:40 2015 From: wilde at anl.gov (Michael Wilde) Date: Tue, 6 Jan 2015 12:53:40 -0600 Subject: [Swift-user] output file array In-Reply-To: References: <1419290056.31348.0.camel@echo> Message-ID: <54AC2F34.1050208@anl.gov> Is your app touch_app( ) correctly creating output files of the form outdirN/afile.txt? From the error message, I suspect that it is not. Your app is declared as: app (file _stdout, file _stderr, file _appout) touch_app(string _instr){ t _instr stdout=@_stdout stderr=@_stderr; } You need to pass filename(_appout) to the app, via its command line, so that it knows the correct output filename to create. Then you need to ensure that the app does indeed create that file. - Mike On 1/6/15 11:23 AM, Ketan Maheshwari wrote: > Trying trunk for this pattern. > > A toy application invoked over a foreach loop that creates an output > file, an stdout and an stderr files. > > The files are mapped into an output directory named with the loop > index as suffix so that the files do not get overwritten: > > foreach i in [0:9]{ > > file out; > file err; > file appout; > > (out, err, appout) = touch_app("Hello"); > } > > > The stdout and stderr files correctly ends up in their respective > directories but the app generated file does not. > > I see following error message: > > Execution failed: > Exception in t: > Arguments: [Hello] > Host: edison1 > Directory: touchafile-run001/jobs/t/t-ffv33r2m > exception @ swift-int-staging.k, line: 165 > Caused by: The following output files were not created by the > application: outdir4/afile.txt > > Any suggestions for fixing this? > > Attached is the test directory with sources and executable with rundir. > > Thanks, > Ketan > > > On Mon, Dec 22, 2014 at 5:14 PM, Mihael Hategan > wrote: > > Hi, > > I don't think 0.95 supports dynamic arrays output from apps. You will > need trunk/0.96 for that. > > Mihael > > On Mon, 2014-12-22 at 15:37 -0600, Ketan Maheshwari wrote: > > Hi Mihael, > > > > This is with Swift 0.95. > > > > Thanks, > > Ketan > > > > On Sun, Dec 21, 2014 at 2:40 PM, Hategan-Marandiuc, Philip M. < > > hategan at mcs.anl.gov > wrote: > > > > > Hi Ketan, > > > > > > Sorry for the delay. Is this trunk or 0.95? > > > > > > Mihael > > > > > > On Wed, 2014-12-17 at 14:26 -0600, Ketan Maheshwari wrote: > > > > Hi, > > > > > > > > I am dealing with a workflow pattern where an app expects > multiple output > > > > files with a pattern. > > > > > > > > The app signature is: > > > > > > > > app (file[] _wrfout, file _out, file _err) wrf_app (file > _wrf_in, file[] > > > > _tbl, file[] _ozone, ...) > > > > { > > > > wrf stdout=@_out stderr=@_err; > > > > } > > > > > > > > The _wrfout files are the app result files which follows a > pattern: > > > wrfout_* > > > > > > > > So, I am invoking the application in a foreach loop as: > > > > > > > > foreach i in [0:2]{ > > > > file[] wrfout > > > pattern="wrfout_*">; > > > > file wrfstdout > > "/std.out")>; > > > > file wrfstderr > > "/std.err")>; > > > > > > > > (wrfout, wrfstdout, wrfstderr) = wrf_app (wrfin, tbl, > ozone, tr, data, > > > > gribmap, namelist, co2_trans, input_sounding); > > > > } > > > > > > > > The script hangs at runtime with the following messages: > > > > > > > > No events in 1s. > > > > Finding dependency loops... > > > > > > > > Waiting threads: > > > > Thread: R-6-0-4, waiting on wrfout (declared on line 50) > > > > swift:stageOut, wf.edison, line 134 > > > > swift:execute, wf.edison, line 123 > > > > wrf_app, wf.edison, line 242 > > > > > > > > Thread: R-6-2-4, waiting on wrfout (declared on line 50) > > > > swift:stageOut, wf.edison, line 134 > > > > swift:execute, wf.edison, line 123 > > > > wrf_app, wf.edison, line 242 > > > > > > > > Thread: R-6-1-4, waiting on wrfout (declared on line 50) > > > > swift:stageOut, wf.edison, line 134 > > > > swift:execute, wf.edison, line 123 > > > > wrf_app, wf.edison, line 242 > > > > > > > > Any suggestions? > > > > > > > > Thanks, > > > > Ketan > > > > _______________________________________________ > > > > Swift-user mailing list > > > > Swift-user at ci.uchicago.edu > > > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > > > > > > > > > > > > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > > > > > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user -- Michael Wilde Mathematics and Computer Science Computation Institute Argonne National Laboratory The University of Chicago -------------- next part -------------- An HTML attachment was scrubbed... URL: From hategan at mcs.anl.gov Tue Jan 6 12:56:01 2015 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 6 Jan 2015 10:56:01 -0800 Subject: [Swift-user] Edison: pbs job starts but Swift unresponsive In-Reply-To: <1420569365.25608.4.camel@echo> References: <1419374265.9912.6.camel@echo> <1420317563.5777.0.camel@echo> <1420322000.11251.1.camel@echo> <1420330189.12041.5.camel@echo> <1420569365.25608.4.camel@echo> Message-ID: <1420570561.26595.0.camel@echo> This could be a more general solution: https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=1393 But it probably won't make it into 0.95. Mihael On Tue, 2015-01-06 at 10:36 -0800, Mihael Hategan wrote: > Well, clearly 128.55.34.27 is not working. You need to find the correct > one. I would suggest a combination of ifconfig and looking at worker > logs without an internalHostname set to see what the last IP tried is. > > Mihael > > On Tue, 2015-01-06 at 12:21 -0600, Ketan Maheshwari wrote: > > So, I tried with this line in sites file but the run crashes with following > > error messages: > > > > Execution failed: > > Exception in wrf: > > Arguments: [] > > Host: edison2 > > Directory: wf.edison-run006/jobs/i/wrf-iuas5r2m > > exception @ swift-int.k, line: 530 > > Caused by: Block task failed: 0106-1110110-000000 Block task ended > > prematurely > > Application 9450632 exit codes: 101, 111 > > Application 9450632 resources: utime ~25s, stime ~30s, Rss ~8260, inblocks > > ~425450, outblocks ~28500 > > > > + > > -------------------------------------------------------------------------- > > + Job name: B0106-1110110-0 > > + Job Id: 2247186.edique02 > > + System: edison > > + Queued Time: Tue Jan 6 10:11:12 2015 > > + Start Time: Tue Jan 6 10:12:20 2015 > > + Completion Time: Tue Jan 6 10:12:32 2015 > > + User: ketan > > + MOM Host: nid02819 > > + Queue: debug > > + Req. Resources: mppnodect=25,mppnppn=24,mppwidth=600,walltime=00:29:00 > > + Used Resources: cput=00:00:00,mem=0kb,vmem=0kb,walltime=00:00:12 > > + Acct String: m1540 > > + PBS_O_WORKDIR: /global/u2/k/ketan/wrf > > + Submit Args: > > /global/u2/k/ketan/wrf/run006/scripts/PBS4838165627827831510.submit > > + > > -------------------------------------------------------------------------- > > > > > > Failed to connect: Network is unreachable at > > /global/homes/k/ketan/.globus/coasters/cscript3816651147061795773.pl line > > 1101. > > Failed to connect: Network is unreachable at > > /global/homes/k/ketan/.globus/coasters/cscript3816651147061795773.pl line > > 1101. > > Failed to connect: Network is unreachable at > > /global/homes/k/ketan/.globus/coasters/cscript3816651147061795773.pl line > > 1101. > > .... > > .... > > <> > > > > This is a different ip that I found from the previous run's logs in a line > > like this (due to a different login host) : > > > > 2015-01-06 10:01:27,985-0800 INFO MetaChannel MetaChannel [context: > > worker-6, boundTo: null] binding to TCPChannel [type: server, contact: > > 128.55.34.27:52189] > > > > The rundir is attached. > > > > --Ketan > > > > On Sat, Jan 3, 2015 at 6:09 PM, Mihael Hategan wrote: > > > > > Ok, so I was looking for problems after the first batch of jobs, but > > > aren't any here. > > > > > > The 9 minute delay is because workers try all IP addresses that the head > > > node has, and it may take a long time to time-out through all of them > > > until a good one is found. > > > > > > You could force a specific IP address (in your case it's probably > > > 128.55.34.2) using: > > > > > > 128.55.34.2 > > > > > > Mihael > > > > > > On Sat, 2015-01-03 at 16:29 -0600, Ketan Maheshwari wrote: > > > > Yes, this was a different run. > > > > > > > > Here is the run directory and worker log for a fresh run where I see job > > > in > > > > running stated for ~9 minutes before Swift status shows task active. > > > > > > > > Thanks, > > > > Ketan > > > > > > > > On Sat, Jan 3, 2015 at 3:53 PM, Mihael Hategan > > > wrote: > > > > > > > > > Is this from the same run? I don't see delays between the jobs > > > > > completing and the worker being shut down. Can you also post the swift > > > > > log that corresponds to this run and confirm that you see the problem > > > in > > > > > this run? > > > > > > > > > > Mihael > > > > > > > > > > On Sat, 2015-01-03 at 15:22 -0600, Ketan Maheshwari wrote: > > > > > > Please find the workerlog attached. > > > > > > > > > > > > Thanks, > > > > > > Ketan > > > > > > > > > > > > On Sat, Jan 3, 2015 at 2:39 PM, Mihael Hategan > > > > > wrote: > > > > > > > > > > > > > On Tue, 2014-12-30 at 12:49 -0600, Ketan Maheshwari wrote: > > > > > > > > Hi Mihael, > > > > > > > > > > > > > > > > It takes about 8-9 minutes after the worker starting (ie. queue > > > > > showing > > > > > > > > running status) that the Swift progress text shows active > > > status. In > > > > > the > > > > > > > > active status, one wave of tasks finishes and the status goes > > > back to > > > > > > > > submit state but now no job shows up in the queue. > > > > > > > > > > > > > > I see the problem, but I'm not sure what causes it. Can you enable > > > > > > > worker logging and send a worker log? > > > > > > > > > > > > > > Mihael > > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > > > > > Swift-user mailing list > > > > > > > Swift-user at ci.uchicago.edu > > > > > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > > > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > > > Swift-user mailing list > > > > > Swift-user at ci.uchicago.edu > > > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > > > > > > > > > > > > > > _______________________________________________ > > > Swift-user mailing list > > > Swift-user at ci.uchicago.edu > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > > > > > > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user From ketan at mcs.anl.gov Tue Jan 6 13:04:09 2015 From: ketan at mcs.anl.gov (Ketan Maheshwari) Date: Tue, 6 Jan 2015 13:04:09 -0600 Subject: [Swift-user] output file array In-Reply-To: <54AC2F34.1050208@anl.gov> References: <1419290056.31348.0.camel@echo> <54AC2F34.1050208@anl.gov> Message-ID: Hi Mike, The app is creating the file but the way the app is invoked, the filename does not appear in the command-line. It creates the file at the top level, so just afile.txt. However, since there are many calls to the app, to avoid this file being overwritten, I need to put this file into a separate directory which is why I am using directory outdirN. I was thinking if it is possible for Swift runtime to find from the app definition that the file _appout is expected output similar to stdout and move the file to the outdirN (again similar to stdout). Thanks, Ketan On Tue, Jan 6, 2015 at 12:53 PM, Michael Wilde wrote: > Is your app touch_app( ) correctly creating output files of the form > outdirN/afile.txt? > > From the error message, I suspect that it is not. > > Your app is declared as: > > app (file _stdout, file _stderr, file _appout) touch_app(string _instr){ > t _instr stdout=@_stdout stderr=@_stderr; > } > > > You need to pass filename(_appout) to the app, via its command line, so > that it knows the correct output filename to create. Then you need to > ensure that the app does indeed create that file. > > - Mike > > > On 1/6/15 11:23 AM, Ketan Maheshwari wrote: > > Trying trunk for this pattern. > > A toy application invoked over a foreach loop that creates an output > file, an stdout and an stderr files. > > The files are mapped into an output directory named with the loop index > as suffix so that the files do not get overwritten: > > foreach i in [0:9]{ > > file out; > file err; > file appout; > > (out, err, appout) = touch_app("Hello"); > } > > > The stdout and stderr files correctly ends up in their respective > directories but the app generated file does not. > > I see following error message: > > Execution failed: > Exception in t: > Arguments: [Hello] > Host: edison1 > Directory: touchafile-run001/jobs/t/t-ffv33r2m > exception @ swift-int-staging.k, line: 165 > Caused by: The following output files were not created by the application: > outdir4/afile.txt > > Any suggestions for fixing this? > > Attached is the test directory with sources and executable with rundir. > > Thanks, > Ketan > > > On Mon, Dec 22, 2014 at 5:14 PM, Mihael Hategan > wrote: > >> Hi, >> >> I don't think 0.95 supports dynamic arrays output from apps. You will >> need trunk/0.96 for that. >> >> Mihael >> >> On Mon, 2014-12-22 at 15:37 -0600, Ketan Maheshwari wrote: >> > Hi Mihael, >> > >> > This is with Swift 0.95. >> > >> > Thanks, >> > Ketan >> > >> > On Sun, Dec 21, 2014 at 2:40 PM, Hategan-Marandiuc, Philip M. < >> > hategan at mcs.anl.gov> wrote: >> > >> > > Hi Ketan, >> > > >> > > Sorry for the delay. Is this trunk or 0.95? >> > > >> > > Mihael >> > > >> > > On Wed, 2014-12-17 at 14:26 -0600, Ketan Maheshwari wrote: >> > > > Hi, >> > > > >> > > > I am dealing with a workflow pattern where an app expects multiple >> output >> > > > files with a pattern. >> > > > >> > > > The app signature is: >> > > > >> > > > app (file[] _wrfout, file _out, file _err) wrf_app (file _wrf_in, >> file[] >> > > > _tbl, file[] _ozone, ...) >> > > > { >> > > > wrf stdout=@_out stderr=@_err; >> > > > } >> > > > >> > > > The _wrfout files are the app result files which follows a pattern: >> > > wrfout_* >> > > > >> > > > So, I am invoking the application in a foreach loop as: >> > > > >> > > > foreach i in [0:2]{ >> > > > file[] wrfout> > > > pattern="wrfout_*">; >> > > > file wrfstdout> > > "/std.out")>; >> > > > file wrfstderr> > > "/std.err")>; >> > > > >> > > > (wrfout, wrfstdout, wrfstderr) = wrf_app (wrfin, tbl, ozone, tr, >> data, >> > > > gribmap, namelist, co2_trans, input_sounding); >> > > > } >> > > > >> > > > The script hangs at runtime with the following messages: >> > > > >> > > > No events in 1s. >> > > > Finding dependency loops... >> > > > >> > > > Waiting threads: >> > > > Thread: R-6-0-4, waiting on wrfout (declared on line 50) >> > > > swift:stageOut, wf.edison, line 134 >> > > > swift:execute, wf.edison, line 123 >> > > > wrf_app, wf.edison, line 242 >> > > > >> > > > Thread: R-6-2-4, waiting on wrfout (declared on line 50) >> > > > swift:stageOut, wf.edison, line 134 >> > > > swift:execute, wf.edison, line 123 >> > > > wrf_app, wf.edison, line 242 >> > > > >> > > > Thread: R-6-1-4, waiting on wrfout (declared on line 50) >> > > > swift:stageOut, wf.edison, line 134 >> > > > swift:execute, wf.edison, line 123 >> > > > wrf_app, wf.edison, line 242 >> > > > >> > > > Any suggestions? >> > > > >> > > > Thanks, >> > > > Ketan >> > > > _______________________________________________ >> > > > Swift-user mailing list >> > > > Swift-user at ci.uchicago.edu >> > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user >> > > >> > > >> > > >> >> >> _______________________________________________ >> Swift-user mailing list >> Swift-user at ci.uchicago.edu >> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user >> > > > > _______________________________________________ > Swift-user mailing listSwift-user at ci.uchicago.eduhttps://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > > > -- > Michael Wilde > Mathematics and Computer Science Computation Institute > Argonne National Laboratory The University of Chicago > > > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > -------------- next part -------------- An HTML attachment was scrubbed... URL: From wilde at anl.gov Tue Jan 6 13:13:37 2015 From: wilde at anl.gov (Michael Wilde) Date: Tue, 6 Jan 2015 13:13:37 -0600 Subject: [Swift-user] output file array In-Reply-To: References: <1419290056.31348.0.camel@echo> <54AC2F34.1050208@anl.gov> Message-ID: <54AC33E1.2000701@anl.gov> On 1/6/15 1:04 PM, Ketan Maheshwari wrote: > Hi Mike, > > The app is creating the file but the way the app is invoked, the > filename does not appear in the command-line. It creates the file at > the top level, so just afile.txt. > > However, since there are many calls to the app, to avoid this file > being overwritten, I need to put this file into a separate directory > which is why I am using directory outdirN. > > I was thinking if it is possible for Swift runtime to find from the > app definition that the file _appout is expected output similar to > stdout and move the file to the outdirN (again similar to stdout). I think what you're asking for is the ability to declare for each app that its temporary "sandbox" working dir gets saved below the current working dir in which you're running the swift command. And, further, to be able to name that directory from the source script. That *might* be a reasonable feature, but will need more discussion. I suggest a bugzilla ticket to capture this as a proposed enhancement. It seems however that the current way of doing this, explicitly, is simple and sufficient for now. - Mike > > Thanks, > Ketan > > On Tue, Jan 6, 2015 at 12:53 PM, Michael Wilde > wrote: > > Is your app touch_app( ) correctly creating output files of the > form outdirN/afile.txt? > > From the error message, I suspect that it is not. > > Your app is declared as: > > app (file _stdout, file _stderr, file _appout) touch_app(string > _instr){ > t _instr stdout=@_stdout stderr=@_stderr; > } > > > You need to pass filename(_appout) to the app, via its command > line, so that it knows the correct output filename to create. > Then you need to ensure that the app does indeed create that file. > > - Mike > > > On 1/6/15 11:23 AM, Ketan Maheshwari wrote: >> Trying trunk for this pattern. >> >> A toy application invoked over a foreach loop that creates an >> output file, an stdout and an stderr files. >> >> The files are mapped into an output directory named with the loop >> index as suffix so that the files do not get overwritten: >> >> foreach i in [0:9]{ >> >> file out; >> file err; >> file appout> "/afile.txt")>; >> >> (out, err, appout) = touch_app("Hello"); >> } >> >> >> The stdout and stderr files correctly ends up in their respective >> directories but the app generated file does not. >> >> I see following error message: >> >> Execution failed: >> Exception in t: >> Arguments: [Hello] >> Host: edison1 >> Directory: touchafile-run001/jobs/t/t-ffv33r2m >> exception @ swift-int-staging.k, line: 165 >> Caused by: The following output files were not created by the >> application: outdir4/afile.txt >> >> Any suggestions for fixing this? >> >> Attached is the test directory with sources and executable with >> rundir. >> >> Thanks, >> Ketan >> >> >> On Mon, Dec 22, 2014 at 5:14 PM, Mihael Hategan >> > wrote: >> >> Hi, >> >> I don't think 0.95 supports dynamic arrays output from apps. >> You will >> need trunk/0.96 for that. >> >> Mihael >> >> On Mon, 2014-12-22 at 15:37 -0600, Ketan Maheshwari wrote: >> > Hi Mihael, >> > >> > This is with Swift 0.95. >> > >> > Thanks, >> > Ketan >> > >> > On Sun, Dec 21, 2014 at 2:40 PM, Hategan-Marandiuc, Philip M. < >> > hategan at mcs.anl.gov > wrote: >> > >> > > Hi Ketan, >> > > >> > > Sorry for the delay. Is this trunk or 0.95? >> > > >> > > Mihael >> > > >> > > On Wed, 2014-12-17 at 14:26 -0600, Ketan Maheshwari wrote: >> > > > Hi, >> > > > >> > > > I am dealing with a workflow pattern where an app >> expects multiple output >> > > > files with a pattern. >> > > > >> > > > The app signature is: >> > > > >> > > > app (file[] _wrfout, file _out, file _err) wrf_app >> (file _wrf_in, file[] >> > > > _tbl, file[] _ozone, ...) >> > > > { >> > > > wrf stdout=@_out stderr=@_err; >> > > > } >> > > > >> > > > The _wrfout files are the app result files which >> follows a pattern: >> > > wrfout_* >> > > > >> > > > So, I am invoking the application in a foreach loop as: >> > > > >> > > > foreach i in [0:2]{ >> > > > file[] wrfout> > > > pattern="wrfout_*">; >> > > > file wrfstdout> file=strcat("outdir", i, >> > > "/std.out")>; >> > > > file wrfstderr> file=strcat("outdir", i, >> > > "/std.err")>; >> > > > >> > > > (wrfout, wrfstdout, wrfstderr) = wrf_app (wrfin, >> tbl, ozone, tr, data, >> > > > gribmap, namelist, co2_trans, input_sounding); >> > > > } >> > > > >> > > > The script hangs at runtime with the following messages: >> > > > >> > > > No events in 1s. >> > > > Finding dependency loops... >> > > > >> > > > Waiting threads: >> > > > Thread: R-6-0-4, waiting on wrfout (declared on line 50) >> > > > swift:stageOut, wf.edison, line 134 >> > > > swift:execute, wf.edison, line 123 >> > > > wrf_app, wf.edison, line 242 >> > > > >> > > > Thread: R-6-2-4, waiting on wrfout (declared on line 50) >> > > > swift:stageOut, wf.edison, line 134 >> > > > swift:execute, wf.edison, line 123 >> > > > wrf_app, wf.edison, line 242 >> > > > >> > > > Thread: R-6-1-4, waiting on wrfout (declared on line 50) >> > > > swift:stageOut, wf.edison, line 134 >> > > > swift:execute, wf.edison, line 123 >> > > > wrf_app, wf.edison, line 242 >> > > > >> > > > Any suggestions? >> > > > >> > > > Thanks, >> > > > Ketan >> > > > _______________________________________________ >> > > > Swift-user mailing list >> > > > Swift-user at ci.uchicago.edu >> >> > > > >> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user >> > > >> > > >> > > >> >> >> _______________________________________________ >> Swift-user mailing list >> Swift-user at ci.uchicago.edu >> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user >> >> >> >> >> _______________________________________________ >> Swift-user mailing list >> Swift-user at ci.uchicago.edu >> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > > -- > Michael Wilde > Mathematics and Computer Science Computation Institute > Argonne National Laboratory The University of Chicago > > > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > > -- Michael Wilde Mathematics and Computer Science Computation Institute Argonne National Laboratory The University of Chicago -------------- next part -------------- An HTML attachment was scrubbed... URL: From ketan at mcs.anl.gov Tue Jan 6 13:30:00 2015 From: ketan at mcs.anl.gov (Ketan Maheshwari) Date: Tue, 6 Jan 2015 13:30:00 -0600 Subject: [Swift-user] Edison: pbs job starts but Swift unresponsive In-Reply-To: References: <1419374265.9912.6.camel@echo> <1420317563.5777.0.camel@echo> <1420322000.11251.1.camel@echo> <1420330189.12041.5.camel@echo> Message-ID: Thanks, indeed I picked the address from swift log thinking it was the one worker was able to connect but may be it was not the case. This time I picked the one which seemed to have connected from the worker log which worked. I found from the worker log that the worker tries too many times to connect, eg: $ grep 'Trying other addresses' /scratch2/scratchdirs/ketan/workerlogs/worker-0106-5710080-000000.log | wc -l 6334 Wondering if it is possible that worker logic can be amended to detect the responsive address sooner. Thanks, Ketan On Tue, Jan 6, 2015 at 12:36 PM, Hategan-Marandiuc, Philip M. < hategan at mcs.anl.gov> wrote: > Well, clearly 128.55.34.27 is not working. You need to find the correct > one. I would suggest a combination of ifconfig and looking at worker > logs without an internalHostname set to see what the last IP tried is. > > Mihael > > On Tue, 2015-01-06 at 12:21 -0600, Ketan Maheshwari wrote: > > So, I tried with this line in sites file but the run crashes with > following > > error messages: > > > > Execution failed: > > Exception in wrf: > > Arguments: [] > > Host: edison2 > > Directory: wf.edison-run006/jobs/i/wrf-iuas5r2m > > exception @ swift-int.k, line: 530 > > Caused by: Block task failed: 0106-1110110-000000 Block task ended > > prematurely > > Application 9450632 exit codes: 101, 111 > > Application 9450632 resources: utime ~25s, stime ~30s, Rss ~8260, > inblocks > > ~425450, outblocks ~28500 > > > > + > > > -------------------------------------------------------------------------- > > + Job name: B0106-1110110-0 > > + Job Id: 2247186.edique02 > > + System: edison > > + Queued Time: Tue Jan 6 10:11:12 2015 > > + Start Time: Tue Jan 6 10:12:20 2015 > > + Completion Time: Tue Jan 6 10:12:32 2015 > > + User: ketan > > + MOM Host: nid02819 > > + Queue: debug > > + Req. Resources: > mppnodect=25,mppnppn=24,mppwidth=600,walltime=00:29:00 > > + Used Resources: cput=00:00:00,mem=0kb,vmem=0kb,walltime=00:00:12 > > + Acct String: m1540 > > + PBS_O_WORKDIR: /global/u2/k/ketan/wrf > > + Submit Args: > > /global/u2/k/ketan/wrf/run006/scripts/PBS4838165627827831510.submit > > + > > > -------------------------------------------------------------------------- > > > > > > Failed to connect: Network is unreachable at > > /global/homes/k/ketan/.globus/coasters/cscript3816651147061795773.pl > line > > 1101. > > Failed to connect: Network is unreachable at > > /global/homes/k/ketan/.globus/coasters/cscript3816651147061795773.pl > line > > 1101. > > Failed to connect: Network is unreachable at > > /global/homes/k/ketan/.globus/coasters/cscript3816651147061795773.pl > line > > 1101. > > .... > > .... > > <> > > > > This is a different ip that I found from the previous run's logs in a > line > > like this (due to a different login host) : > > > > 2015-01-06 10:01:27,985-0800 INFO MetaChannel MetaChannel [context: > > worker-6, boundTo: null] binding to TCPChannel [type: server, contact: > > 128.55.34.27:52189] > > > > The rundir is attached. > > > > --Ketan > > > > On Sat, Jan 3, 2015 at 6:09 PM, Mihael Hategan > wrote: > > > > > Ok, so I was looking for problems after the first batch of jobs, but > > > aren't any here. > > > > > > The 9 minute delay is because workers try all IP addresses that the > head > > > node has, and it may take a long time to time-out through all of them > > > until a good one is found. > > > > > > You could force a specific IP address (in your case it's probably > > > 128.55.34.2) using: > > > > > > key="internalHostname">128.55.34.2 > > > > > > Mihael > > > > > > On Sat, 2015-01-03 at 16:29 -0600, Ketan Maheshwari wrote: > > > > Yes, this was a different run. > > > > > > > > Here is the run directory and worker log for a fresh run where I see > job > > > in > > > > running stated for ~9 minutes before Swift status shows task active. > > > > > > > > Thanks, > > > > Ketan > > > > > > > > On Sat, Jan 3, 2015 at 3:53 PM, Mihael Hategan > > > wrote: > > > > > > > > > Is this from the same run? I don't see delays between the jobs > > > > > completing and the worker being shut down. Can you also post the > swift > > > > > log that corresponds to this run and confirm that you see the > problem > > > in > > > > > this run? > > > > > > > > > > Mihael > > > > > > > > > > On Sat, 2015-01-03 at 15:22 -0600, Ketan Maheshwari wrote: > > > > > > Please find the workerlog attached. > > > > > > > > > > > > Thanks, > > > > > > Ketan > > > > > > > > > > > > On Sat, Jan 3, 2015 at 2:39 PM, Mihael Hategan < > hategan at mcs.anl.gov> > > > > > wrote: > > > > > > > > > > > > > On Tue, 2014-12-30 at 12:49 -0600, Ketan Maheshwari wrote: > > > > > > > > Hi Mihael, > > > > > > > > > > > > > > > > It takes about 8-9 minutes after the worker starting (ie. > queue > > > > > showing > > > > > > > > running status) that the Swift progress text shows active > > > status. In > > > > > the > > > > > > > > active status, one wave of tasks finishes and the status goes > > > back to > > > > > > > > submit state but now no job shows up in the queue. > > > > > > > > > > > > > > I see the problem, but I'm not sure what causes it. Can you > enable > > > > > > > worker logging and send a worker log? > > > > > > > > > > > > > > Mihael > > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > > > > > Swift-user mailing list > > > > > > > Swift-user at ci.uchicago.edu > > > > > > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > > > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > > > Swift-user mailing list > > > > > Swift-user at ci.uchicago.edu > > > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > > > > > > > > > > > > > > _______________________________________________ > > > Swift-user mailing list > > > Swift-user at ci.uchicago.edu > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From hategan at mcs.anl.gov Tue Jan 6 13:34:33 2015 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 6 Jan 2015 11:34:33 -0800 Subject: [Swift-user] Edison: pbs job starts but Swift unresponsive In-Reply-To: References: <1419374265.9912.6.camel@echo> <1420317563.5777.0.camel@echo> <1420322000.11251.1.camel@echo> <1420330189.12041.5.camel@echo> Message-ID: <1420572873.27285.2.camel@echo> On Tue, 2015-01-06 at 13:30 -0600, Ketan Maheshwari wrote: > Wondering if it is possible that worker logic can be amended to detect the > responsive address sooner. Aside from reducing the socket timeouts, the service could remember what the right address is from what the first wave of workers manage to connect to. But the first wave of workers would still see the problem. Other than that, I'm don't know. Maybe. Mihael From karthikeyanb at uchicago.edu Tue Jan 6 14:07:36 2015 From: karthikeyanb at uchicago.edu (Karthikeyan Balasubramanian) Date: Tue, 6 Jan 2015 20:07:36 +0000 Subject: [Swift-user] Migrating issues: Swift 0.94 and 0.95 Message-ID: <8CEB97C36B499F4CB2FA1E00DD06E343449CB232@xm-mbx-07-prod.ad.uchicago.edu> Hi, When I ran my swift application in 0.95, the following issues showed up, and the output file ended up empty. However, the same codes ran successfully in 0.94.1, and generated the expected output. Issue-1: [Error] midway_rev.xml, line 1, col 9: cvc-elt.1: Cannot find the declaration of element 'config' I added the namespace as xmlns="http://www.ci.uchicago.edu/swift/SwiftSites". I am not sure this is the right solution, but the error disappeared afterwards. Issue-2: Warning: The @ syntax for function invocation is deprecated This is persistent and the output results are empty. Essentially, the output is generated by Matlab, which is then redirected by the bash script. Is there an updated syntax for this? Thanks. B.K. -------------- next part -------------- An HTML attachment was scrubbed... URL: From hategan at mcs.anl.gov Tue Jan 6 14:13:31 2015 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 6 Jan 2015 12:13:31 -0800 Subject: [Swift-user] Migrating issues: Swift 0.94 and 0.95 In-Reply-To: <8CEB97C36B499F4CB2FA1E00DD06E343449CB232@xm-mbx-07-prod.ad.uchicago.edu> References: <8CEB97C36B499F4CB2FA1E00DD06E343449CB232@xm-mbx-07-prod.ad.uchicago.edu> Message-ID: <1420575211.27803.3.camel@echo> Hi, None of the errors listed below should affect the functioning of your program. The first comes from a change in the way swift parses sites.xml and your solution is right. The second is a polite note that you don't need to put an at sign before a function name. You can simply remove the "@". However, I cannot tell you why the output is empty. It might help if you posted logs from both the 0.94 and 0.95 runs. Mihael On Tue, 2015-01-06 at 20:07 +0000, Karthikeyan Balasubramanian wrote: > Hi, > > When I ran my swift application in 0.95, the following issues showed up, and the output file ended up empty. However, the same codes ran successfully in 0.94.1, and generated the expected output. > > Issue-1: [Error] midway_rev.xml, line 1, col 9: cvc-elt.1: Cannot find the declaration of element 'config' > > I added the namespace as xmlns="http://www.ci.uchicago.edu/swift/SwiftSites". I am not sure this is the right solution, but the error disappeared afterwards. > > Issue-2: Warning: The @ syntax for function invocation is deprecated > > This is persistent and the output results are empty. Essentially, the output is generated by Matlab, which is then redirected by the bash script. Is there an updated syntax for this? > > Thanks. > B.K. > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user From ketan at mcs.anl.gov Thu Jan 8 10:53:37 2015 From: ketan at mcs.anl.gov (Ketan Maheshwari) Date: Thu, 8 Jan 2015 10:53:37 -0600 Subject: [Swift-user] output file array In-Reply-To: <54AC33E1.2000701@anl.gov> References: <1419290056.31348.0.camel@echo> <54AC2F34.1050208@anl.gov> <54AC33E1.2000701@anl.gov> Message-ID: On Tue, Jan 6, 2015 at 1:13 PM, Michael Wilde wrote: > > On 1/6/15 1:04 PM, Ketan Maheshwari wrote: > > Hi Mike, > > The app is creating the file but the way the app is invoked, the > filename does not appear in the command-line. It creates the file at the > top level, so just afile.txt. > > However, since there are many calls to the app, to avoid this file being > overwritten, I need to put this file into a separate directory which is why > I am using directory outdirN. > > I was thinking if it is possible for Swift runtime to find from the app > definition that the file _appout is expected output similar to stdout and > move the file to the outdirN (again similar to stdout). > > I think what you're asking for is the ability to declare for each app that > its temporary "sandbox" working dir gets saved below the current working > dir in which you're running the swift command. And, further, to be able to > name that directory from the source script. > > That *might* be a reasonable feature, but will need more discussion. I > suggest a bugzilla ticket to capture this as a proposed enhancement. It > seems however that the current way of doing this, explicitly, is simple and > sufficient for now. > I think it is simpler. Consider the following app: (file _appout) app someapp (string _appin){ appcmd _appin; } file appout <"appout.txt">; (appout) = someapp ("hello"); Swift will bring the "appout.txt" from workdir to the current dir if it is produced by app (even without being specified in commandline). So, I am facing a special case of this pattern where the appout file is mapped into a directory instead of at the toplevel: file appout ; Currently, this pattern gets honored for stdout/stderr files, ie. Swift creates the outdirN and puts the stdout/stderr files into it. > - Mike > > > Thanks, > Ketan > > On Tue, Jan 6, 2015 at 12:53 PM, Michael Wilde wrote: > >> Is your app touch_app( ) correctly creating output files of the form >> outdirN/afile.txt? >> >> From the error message, I suspect that it is not. >> >> Your app is declared as: >> >> app (file _stdout, file _stderr, file _appout) touch_app(string _instr){ >> t _instr stdout=@_stdout stderr=@_stderr; >> } >> >> >> You need to pass filename(_appout) to the app, via its command line, so >> that it knows the correct output filename to create. Then you need to >> ensure that the app does indeed create that file. >> >> - Mike >> >> >> On 1/6/15 11:23 AM, Ketan Maheshwari wrote: >> >> Trying trunk for this pattern. >> >> A toy application invoked over a foreach loop that creates an output >> file, an stdout and an stderr files. >> >> The files are mapped into an output directory named with the loop index >> as suffix so that the files do not get overwritten: >> >> foreach i in [0:9]{ >> >> file out; >> file err; >> file appout; >> >> (out, err, appout) = touch_app("Hello"); >> } >> >> >> The stdout and stderr files correctly ends up in their respective >> directories but the app generated file does not. >> >> I see following error message: >> >> Execution failed: >> Exception in t: >> Arguments: [Hello] >> Host: edison1 >> Directory: touchafile-run001/jobs/t/t-ffv33r2m >> exception @ swift-int-staging.k, line: 165 >> Caused by: The following output files were not created by the >> application: outdir4/afile.txt >> >> Any suggestions for fixing this? >> >> Attached is the test directory with sources and executable with rundir. >> >> Thanks, >> Ketan >> >> >> On Mon, Dec 22, 2014 at 5:14 PM, Mihael Hategan >> wrote: >> >>> Hi, >>> >>> I don't think 0.95 supports dynamic arrays output from apps. You will >>> need trunk/0.96 for that. >>> >>> Mihael >>> >>> On Mon, 2014-12-22 at 15:37 -0600, Ketan Maheshwari wrote: >>> > Hi Mihael, >>> > >>> > This is with Swift 0.95. >>> > >>> > Thanks, >>> > Ketan >>> > >>> > On Sun, Dec 21, 2014 at 2:40 PM, Hategan-Marandiuc, Philip M. < >>> > hategan at mcs.anl.gov> wrote: >>> > >>> > > Hi Ketan, >>> > > >>> > > Sorry for the delay. Is this trunk or 0.95? >>> > > >>> > > Mihael >>> > > >>> > > On Wed, 2014-12-17 at 14:26 -0600, Ketan Maheshwari wrote: >>> > > > Hi, >>> > > > >>> > > > I am dealing with a workflow pattern where an app expects multiple >>> output >>> > > > files with a pattern. >>> > > > >>> > > > The app signature is: >>> > > > >>> > > > app (file[] _wrfout, file _out, file _err) wrf_app (file _wrf_in, >>> file[] >>> > > > _tbl, file[] _ozone, ...) >>> > > > { >>> > > > wrf stdout=@_out stderr=@_err; >>> > > > } >>> > > > >>> > > > The _wrfout files are the app result files which follows a pattern: >>> > > wrfout_* >>> > > > >>> > > > So, I am invoking the application in a foreach loop as: >>> > > > >>> > > > foreach i in [0:2]{ >>> > > > file[] wrfout>> > > > pattern="wrfout_*">; >>> > > > file wrfstdout>> > > "/std.out")>; >>> > > > file wrfstderr>> > > "/std.err")>; >>> > > > >>> > > > (wrfout, wrfstdout, wrfstderr) = wrf_app (wrfin, tbl, ozone, >>> tr, data, >>> > > > gribmap, namelist, co2_trans, input_sounding); >>> > > > } >>> > > > >>> > > > The script hangs at runtime with the following messages: >>> > > > >>> > > > No events in 1s. >>> > > > Finding dependency loops... >>> > > > >>> > > > Waiting threads: >>> > > > Thread: R-6-0-4, waiting on wrfout (declared on line 50) >>> > > > swift:stageOut, wf.edison, line 134 >>> > > > swift:execute, wf.edison, line 123 >>> > > > wrf_app, wf.edison, line 242 >>> > > > >>> > > > Thread: R-6-2-4, waiting on wrfout (declared on line 50) >>> > > > swift:stageOut, wf.edison, line 134 >>> > > > swift:execute, wf.edison, line 123 >>> > > > wrf_app, wf.edison, line 242 >>> > > > >>> > > > Thread: R-6-1-4, waiting on wrfout (declared on line 50) >>> > > > swift:stageOut, wf.edison, line 134 >>> > > > swift:execute, wf.edison, line 123 >>> > > > wrf_app, wf.edison, line 242 >>> > > > >>> > > > Any suggestions? >>> > > > >>> > > > Thanks, >>> > > > Ketan >>> > > > _______________________________________________ >>> > > > Swift-user mailing list >>> > > > Swift-user at ci.uchicago.edu >>> > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user >>> > > >>> > > >>> > > >>> >>> >>> _______________________________________________ >>> Swift-user mailing list >>> Swift-user at ci.uchicago.edu >>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user >>> >> >> >> >> _______________________________________________ >> Swift-user mailing listSwift-user at ci.uchicago.eduhttps://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user >> >> >> -- >> Michael Wilde >> Mathematics and Computer Science Computation Institute >> Argonne National Laboratory The University of Chicago >> >> >> _______________________________________________ >> Swift-user mailing list >> Swift-user at ci.uchicago.edu >> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user >> > > > -- > Michael Wilde > Mathematics and Computer Science Computation Institute > Argonne National Laboratory The University of Chicago > > > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > -------------- next part -------------- An HTML attachment was scrubbed... URL: From hategan at mcs.anl.gov Thu Jan 8 12:00:12 2015 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 8 Jan 2015 10:00:12 -0800 Subject: [Swift-user] output file array In-Reply-To: References: <1419290056.31348.0.camel@echo> <54AC2F34.1050208@anl.gov> <54AC33E1.2000701@anl.gov> Message-ID: <1420740012.14377.9.camel@echo> On Thu, 2015-01-08 at 10:53 -0600, Ketan Maheshwari wrote: > file appout ; > > Currently, this pattern gets honored for stdout/stderr files, ie. Swift > creates the outdirN and puts the stdout/stderr files into it. Not automagically! You pass the stdout/stderr file names to the app command line using stdout= and stderr=. So in all cases swift honors what you tell it to honor. The question is what is it that you really want here: - is it to automatically create multiple versions of files when an app specifies the same output file and there are multiple invocations? - or is it for swift to ignore the directory in the remote app output when staging out files (but not locally)? Mihael From kshldey at gmail.com Fri Jan 9 10:19:07 2015 From: kshldey at gmail.com (kushal kumar dey) Date: Fri, 9 Jan 2015 10:19:07 -0600 Subject: [Swift-user] Query regarding optimal configuration in Swift for running code in Midway. Message-ID: Hi, I am a graduate student in the University of Chicago and I am currently trying to run a parallel script using Swift-lang on the Midway cluster. I need your help regarding optimally choosing the configuration to run my program. I am running 100 parallel scripts at a time in the "bigmem" machine, and each task takes around *5 hours *to complete. I have been asked in the Swift Configuration file (the program I am using to run these parallel scripts) to set the following in the configuration file max Nodes per Job : maxJobs: tasks per Node: max Job time: (This I am setting as 36 hours currently) max Wall time : (this i am setting as 34 hours right now) max Parallel Task Initial Parallel Task Can you please let me know what numbers I should be putting in these options so that I can optimally run my 100 jobs in parallel and given that each of these tasks takes around 5 hours, whether it would be possible to complete all the 100 tasks in 36 hours? If you could give me the optimal configuration or at least give some intuition about how to fix it, it would be really helpful..This is the first time I am running jobs in parallel, so I have very little idea about it... Thanks a lot Kushal -------------- next part -------------- An HTML attachment was scrubbed... URL: From ketan at mcs.anl.gov Fri Jan 9 11:17:48 2015 From: ketan at mcs.anl.gov (Ketan Maheshwari) Date: Fri, 9 Jan 2015 11:17:48 -0600 Subject: [Swift-user] output file array In-Reply-To: <5d783ff86d6743609722464d64d9c9fd@LUCKMAN.anl.gov> References: <1419290056.31348.0.camel@echo> <54AC2F34.1050208@anl.gov> <54AC33E1.2000701@anl.gov> <5d783ff86d6743609722464d64d9c9fd@LUCKMAN.anl.gov> Message-ID: On Thu, Jan 8, 2015 at 12:00 PM, Hategan-Marandiuc, Philip M. < hategan at mcs.anl.gov> wrote: > On Thu, 2015-01-08 at 10:53 -0600, Ketan Maheshwari wrote: > > > file appout "/appout.txt")>; > > > > Currently, this pattern gets honored for stdout/stderr files, ie. Swift > > creates the outdirN and puts the stdout/stderr files into it. > > Not automagically! You pass the stdout/stderr file names to the app > command line using stdout= and stderr=. > True. However, the scheme is not quite consistent due to the following behavior: -- When specifying stdout and stderr in command line, I do not specify them in directories. Yet, if I map them in non-existent directories, Swift will create those directories and put the stdout/stderr files in them. This does not happen with other files even if they are specified in command line. -- If a file is specified as output file in app definition but does not appear in the command line, Swift will still bring it in as an output. Again, this will happen only if the file is mapped at top level and not in any directory. > > So in all cases swift honors what you tell it to honor. The question is > what is it that you really want here: > - is it to automatically create multiple versions of files when an app > specifies the same output file and there are multiple invocations? > Yes. The app specifies it alright but the executable does not specify it on command line. > - or is it for swift to ignore the directory in the remote app output > when staging out files (but not locally)? > > Mihael > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From hategan at mcs.anl.gov Fri Jan 9 12:37:23 2015 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 9 Jan 2015 10:37:23 -0800 Subject: [Swift-user] output file array In-Reply-To: References: <1419290056.31348.0.camel@echo> <54AC2F34.1050208@anl.gov> <54AC33E1.2000701@anl.gov> <5d783ff86d6743609722464d64d9c9fd@LUCKMAN.anl.gov> Message-ID: <1420828643.3215.12.camel@echo> Hi Ketan, On Fri, 2015-01-09 at 11:17 -0600, Ketan Maheshwari wrote: [...] > > > > Not automagically! You pass the stdout/stderr file names to the app > > command line using stdout= and stderr=. > > > > True. However, the scheme is not quite consistent due to the following > behavior: > > -- When specifying stdout and stderr in command line, I do not specify them > in directories. Yet, if I map them in non-existent directories, Swift will > create those directories and put the stdout/stderr files in them. This does > not happen with other files even if they are specified in command line. I do not think that is true. Swift creates directory structures for all input and output files, whether they are specified on the command line or not. If this isn't working, then we are talking about a bug, but the code was designed to do this from the start. I did a quick check and it seems to work fine for both 0.95 and trunk. > > -- If a file is specified as output file in app definition but does not > appear in the command line, Swift will still bring it in as an output. > Again, this will happen only if the file is mapped at top level and not in > any directory. I do not believe that to be true either. Swift stages in all files that are parameters to and app and stages out all files that are returns from an app. There is no code to treat files differently based on whether they are in sub-directories or not. Again, if this isn't working on a specific version of the code, we may be looking at a bug. I checked this, too, and it works on 0.95 and trunk with the simple configurations I tried. > > > > > So in all cases swift honors what you tell it to honor. The question is > > what is it that you really want here: > > - is it to automatically create multiple versions of files when an app > > specifies the same output file and there are multiple invocations? > > > > Yes. The app specifies it alright but the executable does not specify it on > command line. I am not sure how that answers the question above. What I meant was this: foreach i in [0:10] { file outf <"out.txt">; outf = echo(i); } where echo(i) is an app that returns a file. The question was whether you are suggesting that the code above should be valid code. It is not currently because it violates the local consistency rule. In other words, writing the equivalent of this in a shell script will result in out.txt being overwritten, and that results in nondeterminism based on exactly how the echo() invocations are ordered. > > - or is it for swift to ignore the directory in the remote app output > > when staging out files (but not locally)? Mihae From ketan at mcs.anl.gov Fri Jan 9 13:17:27 2015 From: ketan at mcs.anl.gov (Ketan Maheshwari) Date: Fri, 9 Jan 2015 13:17:27 -0600 Subject: [Swift-user] output file array In-Reply-To: <1420828643.3215.12.camel@echo> References: <1419290056.31348.0.camel@echo> <54AC2F34.1050208@anl.gov> <54AC33E1.2000701@anl.gov> <5d783ff86d6743609722464d64d9c9fd@LUCKMAN.anl.gov> <1420828643.3215.12.camel@echo> Message-ID: On Fri, Jan 9, 2015 at 12:37 PM, Mihael Hategan wrote: > Hi Ketan, > > On Fri, 2015-01-09 at 11:17 -0600, Ketan Maheshwari wrote: > [...] > > > > > > Not automagically! You pass the stdout/stderr file names to the app > > > command line using stdout= and stderr=. > > > > > > > True. However, the scheme is not quite consistent due to the following > > behavior: > > > > -- When specifying stdout and stderr in command line, I do not specify > them > > in directories. Yet, if I map them in non-existent directories, Swift > will > > create those directories and put the stdout/stderr files in them. This > does > > not happen with other files even if they are specified in command line. > > I do not think that is true. Swift creates directory structures for all > input and output files, whether they are specified on the command line > or not. If this isn't working, then we are talking about a bug, but the > code was designed to do this from the start. I did a quick check and it > seems to work fine for both 0.95 and trunk. > May be it is a bug. See the attached tarball which have example cases of things that work and does not. touchafile1.swift illustrates this above point. > > > > > -- If a file is specified as output file in app definition but does not > > appear in the command line, Swift will still bring it in as an output. > > Again, this will happen only if the file is mapped at top level and not > in > > any directory. > > I do not believe that to be true either. Swift stages in all files that > are parameters to and app and stages out all files that are returns from > an app. There is no code to treat files differently based on whether > they are in sub-directories or not. Again, if this isn't working on a > specific version of the code, we may be looking at a bug. I checked > this, too, and it works on 0.95 and trunk with the simple configurations > I tried. > touchafile2.swift illustrates this point. > > > > > > > > > So in all cases swift honors what you tell it to honor. The question is > > > what is it that you really want here: > > > - is it to automatically create multiple versions of files when an app > > > specifies the same output file and there are multiple invocations? > > > > > > > Yes. The app specifies it alright but the executable does not specify it > on > > command line. > > I am not sure how that answers the question above. What I meant was > this: > > foreach i in [0:10] { > file outf <"out.txt">; > outf = echo(i); > } > > where echo(i) is an app that returns a file. > > The question was whether you are suggesting that the code above should > be valid code. It is not currently because it violates the local > consistency rule. In other words, writing the equivalent of this in a > shell script will result in out.txt being overwritten, and that results > in nondeterminism based on exactly how the echo() invocations are > ordered. > I understand this is not a valid code because successive echo invocations will overwrite the out.txt. This is why I want it to be like this: foreach i in [0:9]{ file outf ; outf = echo(i); } > > > > - or is it for swift to ignore the directory in the remote app output > > > when staging out files (but not locally)? > > Mihae > > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: touchafile.tgz Type: application/x-gzip Size: 1066 bytes Desc: not available URL: From hategan at mcs.anl.gov Fri Jan 9 14:20:43 2015 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 9 Jan 2015 12:20:43 -0800 Subject: [Swift-user] output file array In-Reply-To: References: <1419290056.31348.0.camel@echo> <54AC2F34.1050208@anl.gov> <54AC33E1.2000701@anl.gov> <5d783ff86d6743609722464d64d9c9fd@LUCKMAN.anl.gov> <1420828643.3215.12.camel@echo> Message-ID: <1420834843.5057.15.camel@echo> On Fri, 2015-01-09 at 13:17 -0600, Ketan Maheshwari wrote: > > > > I do not think that is true. Swift creates directory structures for all > > input and output files, whether they are specified on the command line > > or not. If this isn't working, then we are talking about a bug, but the > > code was designed to do this from the start. I did a quick check and it > > seems to work fine for both 0.95 and trunk. > > > > May be it is a bug. See the attached tarball which have example cases of > things that work and does not. touchafile1.swift illustrates this above > point. > I think it works as expected. You have an app that creates a file and you are specifying that swift should stage out another file that doesn't exist. I'll simplify this a bit, but this is essentially what your first example does: - the executable creates a file, "a.txt" - you have the following swift code: app (file outf) runapp() { myapp; } - the following swift script works: file outf <"a.txt">; outf = runapp(); - the following swift script does not: file outf <"dir/a.txt">; outf = runapp(); The second version is not supposed to work. If you abstract away the notion of directories and give no meaning to "/" as a path separator, your second example is equivalent to saying: file outf <"dir_a.txt">, whereas your application produces "a.txt". In other words your application produces a file, "a.txt" and you are asking swift to stage out a different file, and whether that is "dir_a.txt" or "dir/a.txt" makes no difference. So I think that we are setting up an artificially difficult situation here and then ending up with long debates that get lost in the details. So I suggest that you either change your app to accept file names for outputs or wrap it into a shell script that does that. That wrapper can be tested outside of swift and does not require exact matching like above. I won't go into the second example. It is a variation on the same theme. Mihael From yadudoc1729 at gmail.com Tue Jan 13 13:03:19 2015 From: yadudoc1729 at gmail.com (Yadu Nand) Date: Tue, 13 Jan 2015 13:03:19 -0600 Subject: [Swift-user] Query regarding optimal configuration in Swift for running code in Midway. In-Reply-To: References: Message-ID: Hi Kushal, In the discussion we had off-list, I'd recommended the following changes: * Set maxWallTime to ~9hours and maxJobTime to ~36hours. * Set initialParallelTasks and maxParallelTasks to the number of app invocation necessary. * Since the application does not use multiple-cores, set tasksPerNode to the number of cores on the target machines * Add pools for other queues (westmere, sandyb, and bigmem) * maxNodesPerJob=1 Were you able to complete the runs with these changes ? Thanks, Yadu On Fri, Jan 9, 2015 at 10:19 AM, kushal kumar dey wrote: > Hi, > > I am a graduate student in the University of Chicago and I am currently > trying to run a parallel script using Swift-lang on the Midway cluster. I > need your help regarding optimally choosing the configuration to run my > program. I am running 100 parallel scripts at a time in the "bigmem" > machine, and each task takes around *5 hours *to complete. I have been > asked in the Swift Configuration file (the program I am using to run these > parallel scripts) to set the following in the configuration file > > max Nodes per Job : > maxJobs: > tasks per Node: > max Job time: (This I am setting as 36 hours currently) > max Wall time : (this i am setting as 34 hours right now) > max Parallel Task > Initial Parallel Task > > Can you please let me know what numbers I should be putting in these > options so that I can optimally run my 100 jobs in parallel and given that > each of these tasks takes around 5 hours, whether it would be possible to > complete all the 100 tasks in 36 hours? > > If you could give me the optimal configuration or at least give some > intuition about how to fix it, it would be really helpful..This is the > first time I am running jobs in parallel, so I have very little idea about > it... > > > Thanks a lot > > Kushal > > > > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > -- Yadu Nand B -------------- next part -------------- An HTML attachment was scrubbed... URL: From karthikeyanb at uchicago.edu Fri Jan 23 19:47:43 2015 From: karthikeyanb at uchicago.edu (Karthikeyan Balasubramanian) Date: Sat, 24 Jan 2015 01:47:43 +0000 Subject: [Swift-user] Swift-Java error: Block task failed: Connection to worker lost Message-ID: <8CEB97C36B499F4CB2FA1E00DD06E343449D0950@xm-mbx-07-prod.ad.uchicago.edu> Hi, I am encountering the following error and subsequently Swift fails. The code runs and generates results for a while and then throws up this error. Browsing the swift-user forum, it appears that the issue is not completely new. But, I am unable to identify a suitable solution. Caused by: Block task failed: Connection to worker lost java.io.IOException: Broken pipe at sun.nio.ch.FileDispatcherImpl.write0(Native Method) at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47) at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93) at sun.nio.ch.IOUtil.write(IOUtil.java:65) at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:487) at org.globus.cog.karajan.workflow.service.channels.NIOSender.write(NIOSender.java:168) at org.globus.cog.karajan.workflow.service.channels.NIOSender.run(NIOSender.java:133) Thanks, B.K. -------------- next part -------------- An HTML attachment was scrubbed... URL: From yadudoc1729 at gmail.com Thu Jan 29 15:35:38 2015 From: yadudoc1729 at gmail.com (Yadu Nand) Date: Thu, 29 Jan 2015 15:35:38 -0600 Subject: [Swift-user] Swift-Java error: Block task failed: Connection to worker lost In-Reply-To: <8CEB97C36B499F4CB2FA1E00DD06E343449D0950@xm-mbx-07-prod.ad.uchicago.edu> References: <8CEB97C36B499F4CB2FA1E00DD06E343449D0950@xm-mbx-07-prod.ad.uchicago.edu> Message-ID: Hi Karthikeyan, Could you send us the runNNN folder from the execution. If you are running swift-0.94, sending a tarball of all the logs from the run would help. Thanks. Yadu On Fri, Jan 23, 2015 at 7:47 PM, Karthikeyan Balasubramanian < karthikeyanb at uchicago.edu> wrote: > Hi, > > I am encountering the following error and subsequently Swift fails. The > code runs and generates results for a while and then throws up this error. > Browsing the swift-user forum, it appears that the issue is not completely > new. But, I am unable to identify a suitable solution. > > Caused by: > Block task failed: Connection to worker lost > java.io.IOException: Broken pipe > at sun.nio.ch.FileDispatcherImpl.write0(Native Method) > at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47) > at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93) > at sun.nio.ch.IOUtil.write(IOUtil.java:65) > at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:487) > at > org.globus.cog.karajan.workflow.service.channels.NIOSender.write(NIOSender.java:168) > at > org.globus.cog.karajan.workflow.service.channels.NIOSender.run(NIOSender.java:133) > > Thanks, > B.K. > > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > -- Yadu Nand B -------------- next part -------------- An HTML attachment was scrubbed... URL: From ketan at mcs.anl.gov Fri Jan 30 10:52:30 2015 From: ketan at mcs.anl.gov (Ketan Maheshwari) Date: Fri, 30 Jan 2015 10:52:30 -0600 Subject: [Swift-user] how to keep work dir Message-ID: Hi, In an application run, I need to preserve the workdir after a run is completed successfully. I tried the following options but none worked: In config: sitedir.keep=true In sites.xml: true Any suggestions? Thanks, Ketan -------------- next part -------------- An HTML attachment was scrubbed... URL: From ketan at mcs.anl.gov Sat Jan 31 11:44:14 2015 From: ketan at mcs.anl.gov (Ketan Maheshwari) Date: Sat, 31 Jan 2015 11:44:14 -0600 Subject: [Swift-user] walltime in sites does not match the walltime of submitted job Message-ID: Hi, On Blues, I submit a script with the following sites spec: 03:40:00 1 1 2 single 16 16 2.20 10000 /home/ketan/swift.workdir The maxWalltime being 3 hours 40 minutes. However, the job that gets submitted is with 11 hours of walltime: $ qstat bmgt1.lcrc.anl.gov: Req'd Req'd Elap Job ID Username Queue Jobname SessID NDS TSKS Memory Time S Time -------------- -------- -------- ---------- ------ --- ---- ------ ----- - ----- 851234.bmgt1.l ketan batch B0131-3911 -- 16 32 -- 11:00 Q -- Any clues? Thanks, Ketan -------------- next part -------------- An HTML attachment was scrubbed... URL: From wilde at anl.gov Sat Jan 31 11:50:19 2015 From: wilde at anl.gov (Michael Wilde) Date: Sat, 31 Jan 2015 11:50:19 -0600 Subject: [Swift-user] walltime in sites does not match the walltime of submitted job In-Reply-To: References: Message-ID: <54CD15DB.7060401@anl.gov> MaxWallTime is the estimate/ceiling you provide Swift for your app tasks for this site. I think that if you don't specify a MaxTime (which puts an upper bound on the wall time of the scheduler jobs which Swift/coasters will submit) then Swift uses what's in its ready-to-run task queue to determine the scheduler job walltime. If you want shorter jobs (e.g. to not wait so long in the queue) then you should un-comment your maxtime tag in the sites entry and set it to what you want to see. Be aware that you need to consider what your likely app task time will really be as you try to balance these two time limits. - Mike On 1/31/15 11:44 AM, Ketan Maheshwari wrote: > Hi, > > On Blues, I submit a script with the following sites spec: > > > > > > > > 03:40:00 > 1 > 1 > 2 > single > 16 > 16 > 2.20 > 10000 > > /home/ketan/swift.workdir > > > The maxWalltime being 3 hours 40 minutes. However, the job that gets > submitted is with 11 hours of walltime: > > $ qstat > > bmgt1.lcrc.anl.gov : > Req'd Req'd Elap > Job ID Username Queue Jobname SessID NDS TSKS Memory > Time S Time > -------------- -------- -------- ---------- ------ --- ---- ------ > ----- - ----- > 851234.bmgt1.l ketan batch B0131-3911 -- 16 32 -- 11:00 > Q -- > > Any clues? > > Thanks, > Ketan > > > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user -- Michael Wilde Mathematics and Computer Science Computation Institute Argonne National Laboratory The University of Chicago -------------- next part -------------- An HTML attachment was scrubbed... URL: From hategan at mcs.anl.gov Sat Jan 31 15:02:04 2015 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sat, 31 Jan 2015 13:02:04 -0800 Subject: [Swift-user] walltime in sites does not match the walltime of submitted job In-Reply-To: <54CD15DB.7060401@anl.gov> References: <54CD15DB.7060401@anl.gov> Message-ID: <1422738124.8326.2.camel@echo> ... although I do not see how 11 hours came into the picture. For a job that size and with the default settings, the coaster job should have at most twice the walltime of the actual job. Ketan, can you send the swift log from this run? Mihael On Sat, 2015-01-31 at 11:50 -0600, Michael Wilde wrote: > MaxWallTime is the estimate/ceiling you provide Swift for your app tasks > for this site. > > I think that if you don't specify a MaxTime (which puts an upper bound > on the wall time of the scheduler jobs which Swift/coasters will submit) > then Swift uses what's in its ready-to-run task queue to determine the > scheduler job walltime. > > If you want shorter jobs (e.g. to not wait so long in the queue) then > you should un-comment your maxtime tag in the sites entry and set it to > what you want to see. Be aware that you need to consider what your > likely app task time will really be as you try to balance these two time > limits. > > - Mike > > On 1/31/15 11:44 AM, Ketan Maheshwari wrote: > > Hi, > > > > On Blues, I submit a script with the following sites spec: > > > > > > > > > > > > > > > > 03:40:00 > > 1 > > 1 > > 2 > > single > > 16 > > 16 > > 2.20 > > 10000 > > > > /home/ketan/swift.workdir > > > > > > The maxWalltime being 3 hours 40 minutes. However, the job that gets > > submitted is with 11 hours of walltime: > > > > $ qstat > > > > bmgt1.lcrc.anl.gov : > > Req'd Req'd Elap > > Job ID Username Queue Jobname SessID NDS TSKS Memory > > Time S Time > > -------------- -------- -------- ---------- ------ --- ---- ------ > > ----- - ----- > > 851234.bmgt1.l ketan batch B0131-3911 -- 16 32 -- 11:00 > > Q -- > > > > Any clues? > > > > Thanks, > > Ketan > > > > > > _______________________________________________ > > Swift-user mailing list > > Swift-user at ci.uchicago.edu > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user