From hategan at mcs.anl.gov Sat Oct 1 19:19:58 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sat, 01 Oct 2011 17:19:58 -0700 Subject: [Swift-devel] persistent coasters and data staging In-Reply-To: References: <1315853779.19354.5.camel@blabla> <1315873153.2945.0.camel@blabla> Message-ID: <1317514798.577.1.camel@blabla> This should be fixed now in cog r3293. There were two deadlocks. One that hung stage-ins and one that applied to stageouts. These were only apparent when all the I/O buffers got used, so only with relatively large staging activity. Please test. Mihael On Wed, 2011-09-21 at 16:24 -0500, Ketan Maheshwari wrote: > Hi Mihael, > > > I tested this fix. It seems that the timeout issue for large-ish data > and throttle > ~30 persists. I am not sure if this is data staging > timeout though. > > > The setup that fails is as follows: > > > persistent coasters, resource= workers running on OSG > data size=8MB, 100 data items. > foreach throttle=40=jobthrottle. > > > The standard output seems intermittently showing some activity and > then getting back to no activity without any progress on tasks. > > > Please find the log and stdouterr > here: http://www.ci.uchicago.edu/~ketan/coaster-lab/std.out.err, > http://www.ci.uchicago.edu/~ketan/coaster-lab/catsn-20110921-1535-v0t3gcg5.log > > > When I tested with small data, 1MB, 2MB, 4MB, it did work. 4MB > displayed a fat tail behavior though, ~94 tasks completing steadily > and quickly while the last 5-6 tasks taking disproportionate times. > The throttle in these cases was <= 30. > > > > > Regards, > Ketan > > On Mon, Sep 12, 2011 at 7:19 PM, Mihael Hategan > wrote: > Try now please (cog r3262). > > On Mon, 2011-09-12 at 15:56 -0500, Ketan Maheshwari wrote: > > > > Mihael, > > > > > > I tried with the new worker.pl, running a 100 task 10MB per > task run > > with throttle set at 100. > > > > > > However, it seems to have failed with the same symptoms of > timeout > > error 521: > > > > > > Caused by: null > > Caused by: > > > org.globus.cog.abstraction.impl.common.execution.JobException: > Job > > failed with an exit code of 521 > > Progress: time: Mon, 12 Sep 2011 15:45:31 -0500 > Submitted:53 > > Active:1 Failed:46 > > Progress: time: Mon, 12 Sep 2011 15:45:34 -0500 > Submitted:53 > > Active:1 Failed:46 > > Exception in cat: > > Arguments: [gpfs/pads/swift/ketan/indir10/data0002.txt] > > Host: grid > > Directory: catsn-20110912-1521-8jh2gar4/jobs/u/cat-u18visfk > > - - - > > > > > > Caused by: null > > Caused by: > > > org.globus.cog.abstraction.impl.common.execution.JobException: > Job > > failed with an exit code of 521 > > Progress: time: Mon, 12 Sep 2011 15:45:45 -0500 > Submitted:52 > > Active:1 Failed:47 > > Exception in cat: > > Arguments: [gpfs/pads/swift/ketan/indir10/data0014.txt] > > Host: grid > > Directory: catsn-20110912-1521-8jh2gar4/jobs/x/cat-x18visfk > > > > > > I had about 107 workers running at the time of these > failures. > > > > > > I started seeing the failure messages after about 20 minutes > into this > > run. > > > > > > The logs are in http://www.ci.uchicago.edu/~ketan/pack.tgz > > > > > > Regards, > > Ketan > > > > > > > > On Mon, Sep 12, 2011 at 1:56 PM, Mihael Hategan > > > wrote: > > On Mon, 2011-09-12 at 11:58 -0500, Ketan Maheshwari > wrote: > > > > > After some discussion with Mike, Our conclusion > from these > > runs was > > > that the parallel data transfers are causing > timeouts from > > the > > > worker.pl, further, we were undecided if somehow > the timeout > > threshold > > > is set too agressive plus how are they determined > and > > whether a change > > > in that value could resolve the issue. > > > > > > Something like that. Worker.pl would use the time > when a file > > transfer > > started to determine timeouts. This is undesirable. > The > > purpose of > > timeouts is to determine whether the other side has > stopped > > from > > properly following the flow of things. It follows > that any > > kind of > > activity should reset the timeout... timer. > > > > I updated the worker code to deal with the issue in > a proper > > way. But > > now I need your help. This is perl code, and it > needs testing. > > > > So can you re-run, first with some simple test that > uses > > coaster staging > > (just to make sure I didn't mess something up), and > then the > > version of > > your tests that was most likely to fail? > > > > > > > > > > > > -- > > Ketan > > > > > > > > > > > > > > -- > Ketan > > > From hategan at mcs.anl.gov Sun Oct 2 04:38:30 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sun, 02 Oct 2011 02:38:30 -0700 Subject: [Swift-devel] persistent coasters and data staging In-Reply-To: <1317514798.577.1.camel@blabla> References: <1315853779.19354.5.camel@blabla> <1315873153.2945.0.camel@blabla> <1317514798.577.1.camel@blabla> Message-ID: <1317548310.5297.2.camel@blabla> I might have spoken a bit too soon there. There's still a timeout, but it occurs at higher loads during stageout. That's with proxy mode, so local (file) mode (i.e. what you should be using on OSG with the service running on the client node) may not necessarily show the same problem. On Sat, 2011-10-01 at 17:19 -0700, Mihael Hategan wrote: > This should be fixed now in cog r3293. > > There were two deadlocks. One that hung stage-ins and one that applied > to stageouts. These were only apparent when all the I/O buffers got > used, so only with relatively large staging activity. > > Please test. > > Mihael > > On Wed, 2011-09-21 at 16:24 -0500, Ketan Maheshwari wrote: > > Hi Mihael, > > > > > > I tested this fix. It seems that the timeout issue for large-ish data > > and throttle > ~30 persists. I am not sure if this is data staging > > timeout though. > > > > > > The setup that fails is as follows: > > > > > > persistent coasters, resource= workers running on OSG > > data size=8MB, 100 data items. > > foreach throttle=40=jobthrottle. > > > > > > The standard output seems intermittently showing some activity and > > then getting back to no activity without any progress on tasks. > > > > > > Please find the log and stdouterr > > here: http://www.ci.uchicago.edu/~ketan/coaster-lab/std.out.err, > > http://www.ci.uchicago.edu/~ketan/coaster-lab/catsn-20110921-1535-v0t3gcg5.log > > > > > > When I tested with small data, 1MB, 2MB, 4MB, it did work. 4MB > > displayed a fat tail behavior though, ~94 tasks completing steadily > > and quickly while the last 5-6 tasks taking disproportionate times. > > The throttle in these cases was <= 30. > > > > > > > > > > Regards, > > Ketan > > > > On Mon, Sep 12, 2011 at 7:19 PM, Mihael Hategan > > wrote: > > Try now please (cog r3262). > > > > On Mon, 2011-09-12 at 15:56 -0500, Ketan Maheshwari wrote: > > > > > > > Mihael, > > > > > > > > > I tried with the new worker.pl, running a 100 task 10MB per > > task run > > > with throttle set at 100. > > > > > > > > > However, it seems to have failed with the same symptoms of > > timeout > > > error 521: > > > > > > > > > Caused by: null > > > Caused by: > > > > > org.globus.cog.abstraction.impl.common.execution.JobException: > > Job > > > failed with an exit code of 521 > > > Progress: time: Mon, 12 Sep 2011 15:45:31 -0500 > > Submitted:53 > > > Active:1 Failed:46 > > > Progress: time: Mon, 12 Sep 2011 15:45:34 -0500 > > Submitted:53 > > > Active:1 Failed:46 > > > Exception in cat: > > > Arguments: [gpfs/pads/swift/ketan/indir10/data0002.txt] > > > Host: grid > > > Directory: catsn-20110912-1521-8jh2gar4/jobs/u/cat-u18visfk > > > - - - > > > > > > > > > Caused by: null > > > Caused by: > > > > > org.globus.cog.abstraction.impl.common.execution.JobException: > > Job > > > failed with an exit code of 521 > > > Progress: time: Mon, 12 Sep 2011 15:45:45 -0500 > > Submitted:52 > > > Active:1 Failed:47 > > > Exception in cat: > > > Arguments: [gpfs/pads/swift/ketan/indir10/data0014.txt] > > > Host: grid > > > Directory: catsn-20110912-1521-8jh2gar4/jobs/x/cat-x18visfk > > > > > > > > > I had about 107 workers running at the time of these > > failures. > > > > > > > > > I started seeing the failure messages after about 20 minutes > > into this > > > run. > > > > > > > > > The logs are in http://www.ci.uchicago.edu/~ketan/pack.tgz > > > > > > > > > Regards, > > > Ketan > > > > > > > > > > > > On Mon, Sep 12, 2011 at 1:56 PM, Mihael Hategan > > > > > wrote: > > > On Mon, 2011-09-12 at 11:58 -0500, Ketan Maheshwari > > wrote: > > > > > > > After some discussion with Mike, Our conclusion > > from these > > > runs was > > > > that the parallel data transfers are causing > > timeouts from > > > the > > > > worker.pl, further, we were undecided if somehow > > the timeout > > > threshold > > > > is set too agressive plus how are they determined > > and > > > whether a change > > > > in that value could resolve the issue. > > > > > > > > > Something like that. Worker.pl would use the time > > when a file > > > transfer > > > started to determine timeouts. This is undesirable. > > The > > > purpose of > > > timeouts is to determine whether the other side has > > stopped > > > from > > > properly following the flow of things. It follows > > that any > > > kind of > > > activity should reset the timeout... timer. > > > > > > I updated the worker code to deal with the issue in > > a proper > > > way. But > > > now I need your help. This is perl code, and it > > needs testing. > > > > > > So can you re-run, first with some simple test that > > uses > > > coaster staging > > > (just to make sure I didn't mess something up), and > > then the > > > version of > > > your tests that was most likely to fail? > > > > > > > > > > > > > > > > > > -- > > > Ketan > > > > > > > > > > > > > > > > > > > > > > > > > -- > > Ketan > > > > > > > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel From ketancmaheshwari at gmail.com Sun Oct 2 21:27:20 2011 From: ketancmaheshwari at gmail.com (Ketan Maheshwari) Date: Sun, 2 Oct 2011 21:27:20 -0500 Subject: [Swift-devel] persistent coasters and data staging In-Reply-To: <1317548310.5297.2.camel@blabla> References: <1315853779.19354.5.camel@blabla> <1315873153.2945.0.camel@blabla> <1317514798.577.1.camel@blabla> <1317548310.5297.2.camel@blabla> Message-ID: Mihael, So far, I've been using the proxy mode: proxy I just tried using the non-proxy (file/local) mode: The run doesn't progress. I get the following timeout messages interspersed with stdout status message: Command(2, HEARTBEAT): handling reply timeout; sendReqTime=111002-211133.264, sendTime=111002-211255.655, now=111002-212055.740 Command(2, HEARTBEAT)fault was: Reply timeout org.globus.cog.karajan.workflow.service.TimeoutException at org.globus.cog.karajan.workflow.service.commands.Command.handleTimeout(Command.java:253) at org.globus.cog.karajan.workflow.service.channels.AbstractKarajanChannel.checkTimeouts(AbstractKarajanChannel.java:131) at org.globus.cog.karajan.workflow.service.channels.AbstractKarajanChannel.checkTimeouts(AbstractKarajanChannel.java:122) at org.globus.cog.karajan.workflow.service.channels.AbstractKarajanChannel$1.run(AbstractKarajanChannel.java:116) at java.util.TimerThread.mainLoop(Timer.java:512) at java.util.TimerThread.run(Timer.java:462) Progress: time: Sun, 02 Oct 2011 21:21:25 -0500 Submitting:100 On the other hand, while trying the proxy mode, I did not get any timeouts however, 7 out of 100 jobs failed with the following errors: The following errors have occurred: 1. Task failed: null org.globus.cog.karajan.workflow.service.channels.ChannelException: Channel died and no contact available at org.globus.cog.karajan.workflow.service.channels.ChannelManager.connect(ChannelManager.java:235) at org.globus.cog.karajan.workflow.service.channels.ChannelManager.reserveChannel(ChannelManager.java:257) at org.globus.cog.karajan.workflow.service.channels.ChannelManager.reserveChannel(ChannelManager.java:227) at org.globus.cog.abstraction.coaster.service.job.manager.Node.getChannel(Node.java:125) at org.globus.cog.abstraction.coaster.service.job.manager.Cpu.submit(Cpu.java:245) at org.globus.cog.abstraction.coaster.service.job.manager.Cpu.launchSequential(Cpu.java:203) at org.globus.cog.abstraction.coaster.service.job.manager.Cpu.launch(Cpu.java:189) at org.globus.cog.abstraction.coaster.service.job.manager.Cpu.pull(Cpu.java:159) at org.globus.cog.abstraction.coaster.service.job.manager.PullThread.run(PullThread.java:98) (5 times) 2. Task failed: Connection to worker lost java.net.SocketException: Broken pipe at java.net.SocketOutputStream.socketWrite0(Native Method) at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:92) at java.net.SocketOutputStream.write(SocketOutputStream.java:124) at org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel$Sender.send(AbstractStreamKarajanChannel.java:305) at org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel$Sender.run(AbstractStreamKarajanChannel.java:251) 3. Task failed: Connection to worker lost java.net.SocketException: Connection reset at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:96) at java.net.SocketOutputStream.write(SocketOutputStream.java:124) at org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel$Sender.send(AbstractStreamKarajanChannel.java:305) at org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel$Sender.run(AbstractStreamKarajanChannel.java:251) Regards, Ketan On Sun, Oct 2, 2011 at 4:38 AM, Mihael Hategan wrote: > I might have spoken a bit too soon there. There's still a timeout, but > it occurs at higher loads during stageout. That's with proxy mode, so > local (file) mode (i.e. what you should be using on OSG with the service > running on the client node) may not necessarily show the same problem. > > On Sat, 2011-10-01 at 17:19 -0700, Mihael Hategan wrote: > > This should be fixed now in cog r3293. > > > > There were two deadlocks. One that hung stage-ins and one that applied > > to stageouts. These were only apparent when all the I/O buffers got > > used, so only with relatively large staging activity. > > > > Please test. > > > > Mihael > > > > On Wed, 2011-09-21 at 16:24 -0500, Ketan Maheshwari wrote: > > > Hi Mihael, > > > > > > > > > I tested this fix. It seems that the timeout issue for large-ish data > > > and throttle > ~30 persists. I am not sure if this is data staging > > > timeout though. > > > > > > > > > The setup that fails is as follows: > > > > > > > > > persistent coasters, resource= workers running on OSG > > > data size=8MB, 100 data items. > > > foreach throttle=40=jobthrottle. > > > > > > > > > The standard output seems intermittently showing some activity and > > > then getting back to no activity without any progress on tasks. > > > > > > > > > Please find the log and stdouterr > > > here: http://www.ci.uchicago.edu/~ketan/coaster-lab/std.out.err, > > > > http://www.ci.uchicago.edu/~ketan/coaster-lab/catsn-20110921-1535-v0t3gcg5.log > > > > > > > > > When I tested with small data, 1MB, 2MB, 4MB, it did work. 4MB > > > displayed a fat tail behavior though, ~94 tasks completing steadily > > > and quickly while the last 5-6 tasks taking disproportionate times. > > > The throttle in these cases was <= 30. > > > > > > > > > > > > > > > Regards, > > > Ketan > > > > > > On Mon, Sep 12, 2011 at 7:19 PM, Mihael Hategan > > > wrote: > > > Try now please (cog r3262). > > > > > > On Mon, 2011-09-12 at 15:56 -0500, Ketan Maheshwari wrote: > > > > > > > > > > Mihael, > > > > > > > > > > > > I tried with the new worker.pl, running a 100 task 10MB per > > > task run > > > > with throttle set at 100. > > > > > > > > > > > > However, it seems to have failed with the same symptoms of > > > timeout > > > > error 521: > > > > > > > > > > > > Caused by: null > > > > Caused by: > > > > > > > org.globus.cog.abstraction.impl.common.execution.JobException: > > > Job > > > > failed with an exit code of 521 > > > > Progress: time: Mon, 12 Sep 2011 15:45:31 -0500 > > > Submitted:53 > > > > Active:1 Failed:46 > > > > Progress: time: Mon, 12 Sep 2011 15:45:34 -0500 > > > Submitted:53 > > > > Active:1 Failed:46 > > > > Exception in cat: > > > > Arguments: [gpfs/pads/swift/ketan/indir10/data0002.txt] > > > > Host: grid > > > > Directory: catsn-20110912-1521-8jh2gar4/jobs/u/cat-u18visfk > > > > - - - > > > > > > > > > > > > Caused by: null > > > > Caused by: > > > > > > > org.globus.cog.abstraction.impl.common.execution.JobException: > > > Job > > > > failed with an exit code of 521 > > > > Progress: time: Mon, 12 Sep 2011 15:45:45 -0500 > > > Submitted:52 > > > > Active:1 Failed:47 > > > > Exception in cat: > > > > Arguments: [gpfs/pads/swift/ketan/indir10/data0014.txt] > > > > Host: grid > > > > Directory: catsn-20110912-1521-8jh2gar4/jobs/x/cat-x18visfk > > > > > > > > > > > > I had about 107 workers running at the time of these > > > failures. > > > > > > > > > > > > I started seeing the failure messages after about 20 minutes > > > into this > > > > run. > > > > > > > > > > > > The logs are in http://www.ci.uchicago.edu/~ketan/pack.tgz > > > > > > > > > > > > Regards, > > > > Ketan > > > > > > > > > > > > > > > > On Mon, Sep 12, 2011 at 1:56 PM, Mihael Hategan > > > > > > > wrote: > > > > On Mon, 2011-09-12 at 11:58 -0500, Ketan Maheshwari > > > wrote: > > > > > > > > > After some discussion with Mike, Our conclusion > > > from these > > > > runs was > > > > > that the parallel data transfers are causing > > > timeouts from > > > > the > > > > > worker.pl, further, we were undecided if somehow > > > the timeout > > > > threshold > > > > > is set too agressive plus how are they determined > > > and > > > > whether a change > > > > > in that value could resolve the issue. > > > > > > > > > > > > Something like that. Worker.pl would use the time > > > when a file > > > > transfer > > > > started to determine timeouts. This is undesirable. > > > The > > > > purpose of > > > > timeouts is to determine whether the other side has > > > stopped > > > > from > > > > properly following the flow of things. It follows > > > that any > > > > kind of > > > > activity should reset the timeout... timer. > > > > > > > > I updated the worker code to deal with the issue in > > > a proper > > > > way. But > > > > now I need your help. This is perl code, and it > > > needs testing. > > > > > > > > So can you re-run, first with some simple test that > > > uses > > > > coaster staging > > > > (just to make sure I didn't mess something up), and > > > then the > > > > version of > > > > your tests that was most likely to fail? > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > Ketan > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > Ketan > > > > > > > > > > > > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > -- Ketan -------------- next part -------------- An HTML attachment was scrubbed... URL: From hategan at mcs.anl.gov Sun Oct 2 21:40:44 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sun, 02 Oct 2011 19:40:44 -0700 Subject: [Swift-devel] persistent coasters and data staging In-Reply-To: References: <1315853779.19354.5.camel@blabla> <1315873153.2945.0.camel@blabla> <1317514798.577.1.camel@blabla> <1317548310.5297.2.camel@blabla> Message-ID: <1317609644.8192.4.camel@blabla> On Sun, 2011-10-02 at 21:27 -0500, Ketan Maheshwari wrote: > Mihael, > > > So far, I've been using the proxy mode: > > > proxy > > > I just tried using the non-proxy (file/local) mode: > > > file And that is not related to the heartbeat error, which I'm not sure why you're getting. As for the errors you get in proxy mode, are you sure your workers are fine? From ketancmaheshwari at gmail.com Mon Oct 3 09:25:22 2011 From: ketancmaheshwari at gmail.com (Ketan Maheshwari) Date: Mon, 3 Oct 2011 09:25:22 -0500 Subject: [Swift-devel] persistent coasters and data staging In-Reply-To: <1317609644.8192.4.camel@blabla> References: <1315853779.19354.5.camel@blabla> <1315873153.2945.0.camel@blabla> <1317514798.577.1.camel@blabla> <1317548310.5297.2.camel@blabla> <1317609644.8192.4.camel@blabla> Message-ID: Mihael, On Sun, Oct 2, 2011 at 9:40 PM, Mihael Hategan wrote: > On Sun, 2011-10-02 at 21:27 -0500, Ketan Maheshwari wrote: > > Mihael, > > > > > > So far, I've been using the proxy mode: > > > > > > proxy > > > > > > I just tried using the non-proxy (file/local) mode: > > > > > > > > file > Thanks, however, on using the above file mode, Swift do not seem to be progressing. On stdout, I see intermittent "Active: 1" lines but they dissappear and get back to submitted status: This happens for about 20 minutes after which the run starts but with high number of failures, with following message: Caused by: Task failed: null org.globus.cog.karajan.workflow.service.channels.ChannelException: Channel died and no contact available at org.globus.cog.karajan.workflow.service.channels.ChannelManager.connect(ChannelManager.java:235) at org.globus.cog.karajan.workflow.service.channels.ChannelManager.reserveChannel(ChannelManager.java:257) at org.globus.cog.karajan.workflow.service.channels.ChannelManager.reserveChannel(ChannelManager.java:227) at org.globus.cog.abstraction.coaster.service.job.manager.Node.getChannel(Node.java:125) at org.globus.cog.abstraction.coaster.service.job.manager.Cpu.submit(Cpu.java:245) at org.globus.cog.abstraction.coaster.service.job.manager.Cpu.launchSequential(Cpu.java:203) at org.globus.cog.abstraction.coaster.service.job.manager.Cpu.launch(Cpu.java:189) at org.globus.cog.abstraction.coaster.service.job.manager.Cpu.pull(Cpu.java:159) at org.globus.cog.abstraction.coaster.service.job.manager.PullThread.run(PullThread.java:98) On the workers stdout, I see 59 workers are running: "*** demandThread: swiftDemand=20 paddedDemand=24 totalRunning=59" In the worker logs, I do not see any errors except for one worker which says: "Failed to register (timeout)" The log for this run is: http://www.ci.uchicago.edu/~ketan/catsn-20111003-0901-nd7ta1bb.log The data size for this run is 10MB per task. Regards, Ketan > And that is not related to the heartbeat error, which I'm not sure why > you're getting. > > As for the errors you get in proxy mode, are you sure your workers are > fine? > > > -- Ketan -------------- next part -------------- An HTML attachment was scrubbed... URL: From dsk at ci.uchicago.edu Mon Oct 3 11:12:30 2011 From: dsk at ci.uchicago.edu (Daniel S. Katz) Date: Mon, 3 Oct 2011 11:12:30 -0500 Subject: [Swift-devel] Fwd: Pilot jobs in wikipedia References: <26ECE39E-E575-45CA-8573-F1DF15201E44@ci.uchicago.edu> Message-ID: <7A813DF6-5A67-4900-8F41-7122E17EB99D@ci.uchicago.edu> FYI... Begin forwarded message: > From: "Daniel S. Katz" > Date: October 3, 2011 11:11:49 AM CDT > To: Michael Wilde , Miron Livny , Shantenu Jha , Andre Merzky > Subject: Pilot jobs in wikipedia > > Hi, > > I created a pilot job page in wikipedia - I'm not sure if it will stick around, or if the editors will dislike it, but please feel free to edit, and ask your teams to help make it better. > > http://en.wikipedia.org/wiki/Pilot_job > > Dan > > -- > Daniel S. Katz > University of Chicago > (773) 834-7186 (voice) > (773) 834-6818 (fax) > d.katz at ieee.org or dsk at ci.uchicago.edu > http://www.ci.uchicago.edu/~dsk/ > > > > -- Daniel S. Katz University of Chicago (773) 834-7186 (voice) (773) 834-6818 (fax) d.katz at ieee.org or dsk at ci.uchicago.edu http://www.ci.uchicago.edu/~dsk/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From hategan at mcs.anl.gov Mon Oct 3 14:23:17 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 03 Oct 2011 12:23:17 -0700 Subject: [Swift-devel] persistent coasters and data staging In-Reply-To: References: <1315853779.19354.5.camel@blabla> <1315873153.2945.0.camel@blabla> <1317514798.577.1.camel@blabla> <1317548310.5297.2.camel@blabla> <1317609644.8192.4.camel@blabla> Message-ID: <1317669797.11525.0.camel@blabla> Are you running with a standalone coaster service? If yes, can you also post the service log? On Mon, 2011-10-03 at 09:25 -0500, Ketan Maheshwari wrote: > Mihael, > > > On Sun, Oct 2, 2011 at 9:40 PM, Mihael Hategan > wrote: > On Sun, 2011-10-02 at 21:27 -0500, Ketan Maheshwari wrote: > > Mihael, > > > > > > > So far, I've been using the proxy mode: > > > > > > key="stagingMethod">proxy > > > > > > I just tried using the non-proxy (file/local) mode: > > > > > > > > > file > > > Thanks, however, on using the above file mode, Swift do not seem to be > progressing. On stdout, I see intermittent "Active: 1" lines but they > dissappear and get back to submitted status: > > > This happens for about 20 minutes after which the run starts but with > high number of failures, with following message: > > > Caused by: Task failed: null > org.globus.cog.karajan.workflow.service.channels.ChannelException: > Channel died and no contact available > at > org.globus.cog.karajan.workflow.service.channels.ChannelManager.connect(ChannelManager.java:235) > at > org.globus.cog.karajan.workflow.service.channels.ChannelManager.reserveChannel(ChannelManager.java:257) > at > org.globus.cog.karajan.workflow.service.channels.ChannelManager.reserveChannel(ChannelManager.java:227) > at > org.globus.cog.abstraction.coaster.service.job.manager.Node.getChannel(Node.java:125) > at > org.globus.cog.abstraction.coaster.service.job.manager.Cpu.submit(Cpu.java:245) > at > org.globus.cog.abstraction.coaster.service.job.manager.Cpu.launchSequential(Cpu.java:203) > at > org.globus.cog.abstraction.coaster.service.job.manager.Cpu.launch(Cpu.java:189) > at > org.globus.cog.abstraction.coaster.service.job.manager.Cpu.pull(Cpu.java:159) > at > org.globus.cog.abstraction.coaster.service.job.manager.PullThread.run(PullThread.java:98) > > > On the workers stdout, I see 59 workers are running: > "*** demandThread: swiftDemand=20 paddedDemand=24 totalRunning=59" > > > In the worker logs, I do not see any errors except for one worker > which says: > > > "Failed to register (timeout)" > > > The log for this run is: > http://www.ci.uchicago.edu/~ketan/catsn-20111003-0901-nd7ta1bb.log > > > The data size for this run is 10MB per task. > > > Regards, > Ketan > > > > > > And that is not related to the heartbeat error, which I'm not > sure why > you're getting. > > As for the errors you get in proxy mode, are you sure your > workers are > fine? > > > > > > > -- > Ketan > > > From wilde at mcs.anl.gov Tue Oct 4 12:23:35 2011 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 4 Oct 2011 12:23:35 -0500 (CDT) Subject: [Swift-devel] persistent coasters and data staging In-Reply-To: <1317669797.11525.0.camel@blabla> Message-ID: <621997449.64706.1317749015873.JavaMail.root@zimbra.anl.gov> Mihael, Ketan, David, Ketan and I reviewed progress yesterday on ExTENCI applications, and decided that for the moment Ketan will focus on the coaster-server-per-site+GridFTP configuration. David, I'd like you to take over the testing and troubleshooting of the configuration related to this email thread: single coaster server for all OSG sites, using provider staging. It seems like the next action was for Ketan to send Mihael the requested service log. Im not sure if that was done, or if so what it revealed. Also, in reviewing this email thread, it wasnt clear to me: Mihael, are you applying the fixes for this problem in trunk or 0.93 branch? I believe that Ketan has been testing with the 0.93 branch. The other thing that was not clear to me, Mihael, was whether you have been able to replicate the problems that Ketan is experiencing in talking to OSG sites, in your own test setups, or if we're in a mode of sending you symptoms that you cant replicate and validate the fixes for. In order to get sufficient test coverage into the stress-test branch of the test suite, for the symptoms we've been seeing here, could you provide details on what you have been able to re-create, and how? David, can you pick up this problem and work to replicate the problems in a reproducible test suite cases, and then test the fixes, and then test on OSG? We can discuss in more detail what that would entail. I was hopeful that we could recreate the OSG symptoms in a more controlled environment between a CI lab machine and the MCS compute servers. Thanks, - Mike ----- Original Message ----- > From: "Mihael Hategan" > To: "Ketan Maheshwari" > Cc: "Swift Devel" > Sent: Monday, October 3, 2011 2:23:17 PM > Subject: Re: [Swift-devel] persistent coasters and data staging > Are you running with a standalone coaster service? If yes, can you > also > post the service log? > > On Mon, 2011-10-03 at 09:25 -0500, Ketan Maheshwari wrote: > > Mihael, > > > > > > On Sun, Oct 2, 2011 at 9:40 PM, Mihael Hategan > > wrote: > > On Sun, 2011-10-02 at 21:27 -0500, Ketan Maheshwari wrote: > > > Mihael, > > > > > > > > > > > So far, I've been using the proxy mode: > > > > > > > > > > key="stagingMethod">proxy > > > > > > > > > I just tried using the non-proxy (file/local) mode: > > > > > > > > > > > > > > > > key="stagingMethod">file > > > > > > Thanks, however, on using the above file mode, Swift do not seem to > > be > > progressing. On stdout, I see intermittent "Active: 1" lines but > > they > > dissappear and get back to submitted status: > > > > > > This happens for about 20 minutes after which the run starts but > > with > > high number of failures, with following message: > > > > > > Caused by: Task failed: null > > org.globus.cog.karajan.workflow.service.channels.ChannelException: > > Channel died and no contact available > > at > > org.globus.cog.karajan.workflow.service.channels.ChannelManager.connect(ChannelManager.java:235) > > at > > org.globus.cog.karajan.workflow.service.channels.ChannelManager.reserveChannel(ChannelManager.java:257) > > at > > org.globus.cog.karajan.workflow.service.channels.ChannelManager.reserveChannel(ChannelManager.java:227) > > at > > org.globus.cog.abstraction.coaster.service.job.manager.Node.getChannel(Node.java:125) > > at > > org.globus.cog.abstraction.coaster.service.job.manager.Cpu.submit(Cpu.java:245) > > at > > org.globus.cog.abstraction.coaster.service.job.manager.Cpu.launchSequential(Cpu.java:203) > > at > > org.globus.cog.abstraction.coaster.service.job.manager.Cpu.launch(Cpu.java:189) > > at > > org.globus.cog.abstraction.coaster.service.job.manager.Cpu.pull(Cpu.java:159) > > at > > org.globus.cog.abstraction.coaster.service.job.manager.PullThread.run(PullThread.java:98) > > > > > > On the workers stdout, I see 59 workers are running: > > "*** demandThread: swiftDemand=20 paddedDemand=24 totalRunning=59" > > > > > > In the worker logs, I do not see any errors except for one worker > > which says: > > > > > > "Failed to register (timeout)" > > > > > > The log for this run is: > > http://www.ci.uchicago.edu/~ketan/catsn-20111003-0901-nd7ta1bb.log > > > > > > The data size for this run is 10MB per task. > > > > > > Regards, > > Ketan > > > > > > > > > > > > And that is not related to the heartbeat error, which I'm > > not > > sure why > > you're getting. > > > > As for the errors you get in proxy mode, are you sure your > > workers are > > fine? > > > > > > > > > > > > > > -- > > Ketan > > > > > > > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From hategan at mcs.anl.gov Tue Oct 4 13:35:04 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 04 Oct 2011 11:35:04 -0700 Subject: [Swift-devel] persistent coasters and data staging In-Reply-To: <621997449.64706.1317749015873.JavaMail.root@zimbra.anl.gov> References: <621997449.64706.1317749015873.JavaMail.root@zimbra.anl.gov> Message-ID: <1317753304.18084.2.camel@blabla> On Tue, 2011-10-04 at 12:23 -0500, Michael Wilde wrote: > Mihael, Ketan, David, > > Ketan and I reviewed progress yesterday on ExTENCI applications, and > decided that for the moment Ketan will focus on the > coaster-server-per-site+GridFTP configuration. > > David, I'd like you to take over the testing and troubleshooting of > the configuration related to this email thread: single coaster server > for all OSG sites, using provider staging. > > It seems like the next action was for Ketan to send Mihael the > requested service log. Im not sure if that was done, or if so what it > revealed. > > Also, in reviewing this email thread, it wasnt clear to me: Mihael, > are you applying the fixes for this problem in trunk or 0.93 branch? I > believe that Ketan has been testing with the 0.93 branch. I was dealing with the 0.93 branch. > > The other thing that was not clear to me, Mihael, was whether you have > been able to replicate the problems that Ketan is experiencing in > talking to OSG sites, in your own test setups, or if we're in a mode > of sending you symptoms that you cant replicate and validate the fixes > for. I was able to replicate the original problem and fix a large chunk of it. Ketan seems to be running into a different problem, but I suspect it's a configuration issue of some sort. [...] Mihael From skenny at uchicago.edu Thu Oct 6 17:07:53 2011 From: skenny at uchicago.edu (Sarah Kenny) Date: Thu, 6 Oct 2011 15:07:53 -0700 Subject: [Swift-devel] gram on ranger Message-ID: hey all, i'm trying to submit to gram on ranger using the latest swift (built from trunk). it failes like so: Cannot submit job Caused by: org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: Cannot submit job Caused by: org.globus.gram.GramException: Parameter not supported Cannot submit job the gram log was saying first that 'jobsPerNode' is not supported so i changed it to workersPerNode and then it was saying 'maxnodes' is not supported. here's my sites file: 10000 1 00:15:00 86400 1 256 16way 1 64 normal TG-DBS080004N /work/00043/tg457040 thoughts? ideas? -- Sarah Kenny Programmer ~ Brain Circuits Laboratory ~ Rm 2224 Bio Sci III University of California Irvine, Dept. of Neurology ~ 773-818-8300 -------------- next part -------------- An HTML attachment was scrubbed... URL: From wozniak at mcs.anl.gov Fri Oct 7 10:16:10 2011 From: wozniak at mcs.anl.gov (Justin M Wozniak) Date: Fri, 7 Oct 2011 10:16:10 -0500 (Central Daylight Time) Subject: [Swift-devel] [Swift-user] gram on ranger In-Reply-To: References: Message-ID: Can I take a look at the log? On Thu, 6 Oct 2011, Sarah Kenny wrote: > hey all, i'm trying to submit to gram on ranger using the latest swift > (built from trunk). it failes like so: > > Cannot submit job > Caused by: > org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: Cannot > submit job > Caused by: org.globus.gram.GramException: Parameter not supported > Cannot submit job > > the gram log was saying first that 'jobsPerNode' is not supported so i > changed it to workersPerNode and then it was saying 'maxnodes' is not > supported. here's my sites file: > > > > 10000 > 1 > 00:15:00 > 86400 > 1 > 256 > 16way > 1 > 64 > normal > TG-DBS080004N > > > > /work/00043/tg457040 > > > > thoughts? ideas? -- Justin M Wozniak From ketancmaheshwari at gmail.com Fri Oct 7 10:19:00 2011 From: ketancmaheshwari at gmail.com (Ketan Maheshwari) Date: Fri, 7 Oct 2011 10:19:00 -0500 Subject: [Swift-devel] gram on ranger In-Reply-To: References: Message-ID: Also, could you post the generated submit script. I tested this and seems the following line is not honored: 16way My script is showing "1way" irrespective of what pe I put. Regards, Ketan On Thu, Oct 6, 2011 at 5:07 PM, Sarah Kenny wrote: > hey all, i'm trying to submit to gram on ranger using the latest swift > (built from trunk). it failes like so: > > Cannot submit job > Caused by: > org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: Cannot > submit job > Caused by: org.globus.gram.GramException: Parameter not supported > Cannot submit job > > the gram log was saying first that 'jobsPerNode' is not supported so i > changed it to workersPerNode and then it was saying 'maxnodes' is not > supported. here's my sites file: > > > > 10000 > 1 > 00:15:00 > 86400 > 1 > 256 > 16way > 1 > 64 > normal > TG-DBS080004N > > > > /work/00043/tg457040 > > > > thoughts? ideas? > > -- > Sarah Kenny > Programmer ~ Brain Circuits Laboratory ~ Rm 2224 Bio Sci III > University of California Irvine, Dept. of Neurology ~ 773-818-8300 > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > -- Ketan -------------- next part -------------- An HTML attachment was scrubbed... URL: From davidk at ci.uchicago.edu Fri Oct 7 10:25:37 2011 From: davidk at ci.uchicago.edu (David Kelly) Date: Fri, 7 Oct 2011 10:25:37 -0500 (CDT) Subject: [Swift-devel] gram on ranger In-Reply-To: Message-ID: <446676502.136713.1318001137284.JavaMail.root@zimbra-mb2.anl.gov> I ran into the same issue with the 'pe' value not being passed correctly when I was doing the provider testing. I created this bug for it: https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=549. I started looking at the code trying to understand why this happens.. I'll try to write a fix for this soon. David ----- Original Message ----- > From: "Ketan Maheshwari" > To: "Sarah Kenny" > Cc: "Swift Devel" , "Swift User" > Sent: Friday, October 7, 2011 10:19:00 AM > Subject: Re: [Swift-devel] gram on ranger > Also, could you post the generated submit script. I tested this and > seems the following line is not honored: > > > 16way > > > My script is showing "1way" irrespective of what pe I put. > > Regards, > Ketan > > > On Thu, Oct 6, 2011 at 5:07 PM, Sarah Kenny < skenny at uchicago.edu > > wrote: > > > hey all, i'm trying to submit to gram on ranger using the latest swift > (built from trunk). it failes like so: > > Cannot submit job > Caused by: > org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: > Cannot submit job > Caused by: org.globus.gram.GramException: Parameter not supported > Cannot submit job > > the gram log was saying first that 'jobsPerNode' is not supported so i > changed it to workersPerNode and then it was saying 'maxnodes' is not > supported. here's my sites file: > > > > 10000 > 1 > 00:15:00 > 86400 > 1 > 256 > 16way > 1 > 64 > normal > TG-DBS080004N > > > > /work/00043/tg457040 > > > > thoughts? ideas? > > -- > Sarah Kenny > Programmer ~ Brain Circuits Laboratory ~ Rm 2224 Bio Sci III > University of California Irvine, Dept. of Neurology ~ 773-818-8300 > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > > > > -- > Ketan > > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel From skenny at uchicago.edu Fri Oct 7 15:13:57 2011 From: skenny at uchicago.edu (Sarah Kenny) Date: Fri, 7 Oct 2011 13:13:57 -0700 Subject: [Swift-devel] [Swift-user] gram on ranger In-Reply-To: References: Message-ID: /home/skenny/swift_logs/dummy-20111005-0126-6575n7x5.log on ci On Fri, Oct 7, 2011 at 8:16 AM, Justin M Wozniak wrote: > > Can I take a look at the log? > > > On Thu, 6 Oct 2011, Sarah Kenny wrote: > > hey all, i'm trying to submit to gram on ranger using the latest swift >> (built from trunk). it failes like so: >> >> Cannot submit job >> Caused by: >> org.globus.cog.abstraction.**impl.common.task.**TaskSubmissionException: >> Cannot >> submit job >> Caused by: org.globus.gram.GramException: Parameter not supported >> Cannot submit job >> >> the gram log was saying first that 'jobsPerNode' is not supported so i >> changed it to workersPerNode and then it was saying 'maxnodes' is not >> supported. here's my sites file: >> >> >> >> 10000 >> 1 >> 00:15:00 >> 86400 >> 1 >> 256 >> 16way >> 1 >> 64 >> normal >> TG-DBS080004N >> >> >> >> /work/00043/**tg457040 >> >> >> >> thoughts? ideas? >> > > -- > Justin M Wozniak > -- Sarah Kenny Programmer ~ Brain Circuits Laboratory ~ Rm 2224 Bio Sci III University of California Irvine, Dept. of Neurology ~ 773-818-8300 -------------- next part -------------- An HTML attachment was scrubbed... URL: From davidk at ci.uchicago.edu Fri Oct 7 15:51:00 2011 From: davidk at ci.uchicago.edu (David Kelly) Date: Fri, 7 Oct 2011 15:51:00 -0500 (CDT) Subject: [Swift-devel] persistent coasters and data staging In-Reply-To: <1317753304.18084.2.camel@blabla> Message-ID: <700915792.137425.1318020660764.JavaMail.root@zimbra-mb2.anl.gov> I wrote a test to try to replicate this issue. I am running the coaster service on bridled. My workers are running on the MCS servers. I am using a modified catsn script with 100 files, each exactly 10 megabytes. After about 10 minutes of this, I get a failure in the logs: ----- Original Message ----- > From: "Mihael Hategan" > To: "Michael Wilde" > Cc: "Swift Devel" > Sent: Tuesday, October 4, 2011 1:35:04 PM > Subject: Re: [Swift-devel] persistent coasters and data staging > On Tue, 2011-10-04 at 12:23 -0500, Michael Wilde wrote: > > Mihael, Ketan, David, > > > > Ketan and I reviewed progress yesterday on ExTENCI applications, and > > decided that for the moment Ketan will focus on the > > coaster-server-per-site+GridFTP configuration. > > > > David, I'd like you to take over the testing and troubleshooting of > > the configuration related to this email thread: single coaster > > server > > for all OSG sites, using provider staging. > > > > It seems like the next action was for Ketan to send Mihael the > > requested service log. Im not sure if that was done, or if so what > > it > > revealed. > > > > Also, in reviewing this email thread, it wasnt clear to me: Mihael, > > are you applying the fixes for this problem in trunk or 0.93 branch? > > I > > believe that Ketan has been testing with the 0.93 branch. > > I was dealing with the 0.93 branch. > > > > > The other thing that was not clear to me, Mihael, was whether you > > have > > been able to replicate the problems that Ketan is experiencing in > > talking to OSG sites, in your own test setups, or if we're in a mode > > of sending you symptoms that you cant replicate and validate the > > fixes > > for. > > I was able to replicate the original problem and fix a large chunk of > it. > > Ketan seems to be running into a different problem, but I suspect it's > a > configuration issue of some sort. > > [...] > > Mihael > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel From davidk at ci.uchicago.edu Fri Oct 7 15:52:46 2011 From: davidk at ci.uchicago.edu (David Kelly) Date: Fri, 7 Oct 2011 15:52:46 -0500 (CDT) Subject: [Swift-devel] persistent coasters and data staging In-Reply-To: <700915792.137425.1318020660764.JavaMail.root@zimbra-mb2.anl.gov> Message-ID: <1370903272.137429.1318020766420.JavaMail.root@zimbra-mb2.anl.gov> The exception I'm seeing is: Caused by: org.globus.cog.abstraction.impl.common.execution.JobException: Job failed with an exit code of 521 at org.globus.cog.karajan.workflow.nodes.functions.KException.function(KException.java:29) at org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:27) at org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.completed(AbstractSequentialWithArguments.java:194) at org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:214) at org.globus.cog.karajan.workflow.nodes.FlowContainer.post(FlowContainer.java:58) at org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:28) at org.globus.cog.karajan.workflow.nodes.Sequential.startNext(Sequential.java:29) at org.globus.cog.karajan.workflow.nodes.Sequential.executeChildren(Sequential.java:20) at org.globus.cog.karajan.workflow.nodes.FlowContainer.execute(FlowContainer.java:63) at org.globus.cog.karajan.workflow.nodes.FlowNode.restart(FlowNode.java:139) at org.globus.cog.karajan.workflow.nodes.FlowNode.start(FlowNode.java:197) at org.globus.cog.karajan.workflow.FlowElementWrapper.start(FlowElementWrapper.java:227) at org.globus.cog.karajan.workflow.events.EventBus.start(EventBus.java:104) at org.globus.cog.karajan.workflow.events.EventTargetPair.run(EventTargetPair.java:40) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:619) It stops after about 10 minutes or so. The scripts and logs are on /autonfs/gpfs-pads/projects/CI-CCR000013/davidk/coaster-stress-tests ----- Original Message ----- > From: "David Kelly" > To: "Mihael Hategan" > Cc: "Swift Devel" , "Michael Wilde" > Sent: Friday, October 7, 2011 3:51:00 PM > Subject: Re: [Swift-devel] persistent coasters and data staging > I wrote a test to try to replicate this issue. I am running the > coaster service on bridled. My workers are running on the MCS servers. > I am using a modified catsn script with 100 files, each exactly 10 > megabytes. After about 10 minutes of this, I get a failure in the > logs: > > > ----- Original Message ----- > > From: "Mihael Hategan" > > To: "Michael Wilde" > > Cc: "Swift Devel" > > Sent: Tuesday, October 4, 2011 1:35:04 PM > > Subject: Re: [Swift-devel] persistent coasters and data staging > > On Tue, 2011-10-04 at 12:23 -0500, Michael Wilde wrote: > > > Mihael, Ketan, David, > > > > > > Ketan and I reviewed progress yesterday on ExTENCI applications, > > > and > > > decided that for the moment Ketan will focus on the > > > coaster-server-per-site+GridFTP configuration. > > > > > > David, I'd like you to take over the testing and troubleshooting > > > of > > > the configuration related to this email thread: single coaster > > > server > > > for all OSG sites, using provider staging. > > > > > > It seems like the next action was for Ketan to send Mihael the > > > requested service log. Im not sure if that was done, or if so what > > > it > > > revealed. > > > > > > Also, in reviewing this email thread, it wasnt clear to me: > > > Mihael, > > > are you applying the fixes for this problem in trunk or 0.93 > > > branch? > > > I > > > believe that Ketan has been testing with the 0.93 branch. > > > > I was dealing with the 0.93 branch. > > > > > > > > The other thing that was not clear to me, Mihael, was whether you > > > have > > > been able to replicate the problems that Ketan is experiencing in > > > talking to OSG sites, in your own test setups, or if we're in a > > > mode > > > of sending you symptoms that you cant replicate and validate the > > > fixes > > > for. > > > > I was able to replicate the original problem and fix a large chunk > > of > > it. > > > > Ketan seems to be running into a different problem, but I suspect > > it's > > a > > configuration issue of some sort. > > > > [...] > > > > Mihael > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel From hategan at mcs.anl.gov Sat Oct 8 14:53:46 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sat, 08 Oct 2011 12:53:46 -0700 Subject: [Swift-devel] persistent coasters and data staging In-Reply-To: <1370903272.137429.1318020766420.JavaMail.root@zimbra-mb2.anl.gov> References: <1370903272.137429.1318020766420.JavaMail.root@zimbra-mb2.anl.gov> Message-ID: <1318103626.15941.0.camel@blabla> On Fri, 2011-10-07 at 15:52 -0500, David Kelly wrote: > The exception I'm seeing is: > > Caused by: org.globus.cog.abstraction.impl.common.execution.JobException: Job failed with an exit code of 521 [...] > It stops after about 10 minutes or so. The scripts and logs are on /autonfs/gpfs-pads/projects/CI-CCR000013/davidk/coaster-stress-tests I don't see logs there. From davidk at ci.uchicago.edu Sat Oct 8 19:19:40 2011 From: davidk at ci.uchicago.edu (David Kelly) Date: Sat, 8 Oct 2011 19:19:40 -0500 (CDT) Subject: [Swift-devel] persistent coasters and data staging In-Reply-To: <1318103626.15941.0.camel@blabla> Message-ID: <1319688034.138222.1318119580638.JavaMail.root@zimbra-mb2.anl.gov> Sorry about that, not sure what happened there. Check /autonfs/gpfs-pads/projects/CI-CCR000013/davidk/test2. I modified this test so that the parallelism is better: 8 jobs per node, 10 machines, 80 active tasks at a time. This run used 500 files X 10 megs but seems to have similarly failed. The coaster logs are in logs/. David ----- Original Message ----- > From: "Mihael Hategan" > To: "David Kelly" > Cc: "Swift Devel" , "Michael Wilde" > Sent: Saturday, October 8, 2011 2:53:46 PM > Subject: Re: [Swift-devel] persistent coasters and data staging > On Fri, 2011-10-07 at 15:52 -0500, David Kelly wrote: > > The exception I'm seeing is: > > > > Caused by: > > org.globus.cog.abstraction.impl.common.execution.JobException: Job > > failed with an exit code of 521 > [...] > > It stops after about 10 minutes or so. The scripts and logs are on > > /autonfs/gpfs-pads/projects/CI-CCR000013/davidk/coaster-stress-tests > > I don't see logs there. From davidk at ci.uchicago.edu Mon Oct 10 14:28:55 2011 From: davidk at ci.uchicago.edu (David Kelly) Date: Mon, 10 Oct 2011 14:28:55 -0500 (CDT) Subject: [Swift-devel] [Swift-user] gram on ranger In-Reply-To: Message-ID: <1762711564.139799.1318274935596.JavaMail.root@zimbra-mb2.anl.gov> Sarah, Can you give this another try with the latest 0.93? I made some changes to the coaster and sge providers and was able to get it working with a simple catns script. Here is the configuration file I was using: 3600 00:00:03 1 16 16 development 0.9 TG-DBS080004N 16way /share/home/01503/davidkel/swiftwork Thanks, David ----- Original Message ----- > From: "Sarah Kenny" > To: "Justin M Wozniak" > Cc: "Swift Devel" , "Swift User" > Sent: Friday, October 7, 2011 3:13:57 PM > Subject: Re: [Swift-user] gram on ranger > /home/skenny/swift_logs/dummy-20111005-0126-6575n7x5.log > > on ci > > > On Fri, Oct 7, 2011 at 8:16 AM, Justin M Wozniak < wozniak at mcs.anl.gov > > wrote: > > > > Can I take a look at the log? > > > > > On Thu, 6 Oct 2011, Sarah Kenny wrote: > > > > hey all, i'm trying to submit to gram on ranger using the latest swift > (built from trunk). it failes like so: > > Cannot submit job > Caused by: > org.globus.cog.abstraction. impl.common.task. TaskSubmissionException: > Cannot > submit job > Caused by: org.globus.gram.GramException: Parameter not supported > Cannot submit job > > the gram log was saying first that 'jobsPerNode' is not supported so i > changed it to workersPerNode and then it was saying 'maxnodes' is not > supported. here's my sites file: > > > > 10000 > 1 > 00:15:00 > 86400 > 1 > 256 > 16way > 1 > 64 > normal > TG-DBS080004N > > > > /work/00043/ tg457040 > > > > thoughts? ideas? > > -- > Justin M Wozniak > > > > -- > Sarah Kenny > Programmer ~ Brain Circuits Laboratory ~ Rm 2224 Bio Sci III > University of California Irvine, Dept. of Neurology ~ 773-818-8300 > > > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user From skenny at uchicago.edu Mon Oct 10 15:59:34 2011 From: skenny at uchicago.edu (Sarah Kenny) Date: Mon, 10 Oct 2011 13:59:34 -0700 Subject: [Swift-devel] [Swift-user] gram on ranger In-Reply-To: <1762711564.139799.1318274935596.JavaMail.root@zimbra-mb2.anl.gov> References: <1762711564.139799.1318274935596.JavaMail.root@zimbra-mb2.anl.gov> Message-ID: where's the latest .93, can you send me the checkout link so i can be sure we're on the same page? On Mon, Oct 10, 2011 at 12:28 PM, David Kelly wrote: > Sarah, > > Can you give this another try with the latest 0.93? I made some changes to > the coaster and sge providers and was able to get it working with a simple > catns script. Here is the configuration file I was using: > > > > > > 3600 > 00:00:03 > 1 > 16 > 16 > development > 0.9 > TG-DBS080004N > 16way > /share/home/01503/davidkel/swiftwork > > > > Thanks, > David > > ----- Original Message ----- > > From: "Sarah Kenny" > > To: "Justin M Wozniak" > > Cc: "Swift Devel" , "Swift User" < > swift-user at ci.uchicago.edu> > > Sent: Friday, October 7, 2011 3:13:57 PM > > Subject: Re: [Swift-user] gram on ranger > > /home/skenny/swift_logs/dummy-20111005-0126-6575n7x5.log > > > > on ci > > > > > > On Fri, Oct 7, 2011 at 8:16 AM, Justin M Wozniak < wozniak at mcs.anl.gov > > > wrote: > > > > > > > > Can I take a look at the log? > > > > > > > > > > On Thu, 6 Oct 2011, Sarah Kenny wrote: > > > > > > > > hey all, i'm trying to submit to gram on ranger using the latest swift > > (built from trunk). it failes like so: > > > > Cannot submit job > > Caused by: > > org.globus.cog.abstraction. impl.common.task. TaskSubmissionException: > > Cannot > > submit job > > Caused by: org.globus.gram.GramException: Parameter not supported > > Cannot submit job > > > > the gram log was saying first that 'jobsPerNode' is not supported so i > > changed it to workersPerNode and then it was saying 'maxnodes' is not > > supported. here's my sites file: > > > > > > > > 10000 > > 1 > > 00:15:00 > > 86400 > > 1 > > 256 > > 16way > > 1 > > 64 > > normal > > TG-DBS080004N > > > > > > > > /work/00043/ tg457040 > > > > > > > > thoughts? ideas? > > > > -- > > Justin M Wozniak > > > > > > > > -- > > Sarah Kenny > > Programmer ~ Brain Circuits Laboratory ~ Rm 2224 Bio Sci III > > University of California Irvine, Dept. of Neurology ~ 773-818-8300 > > > > > > _______________________________________________ > > Swift-user mailing list > > Swift-user at ci.uchicago.edu > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > -- Sarah Kenny Programmer ~ Brain Circuits Laboratory ~ Rm 2224 Bio Sci III University of California Irvine, Dept. of Neurology ~ 773-818-8300 -------------- next part -------------- An HTML attachment was scrubbed... URL: From davidk at ci.uchicago.edu Mon Oct 10 16:05:31 2011 From: davidk at ci.uchicago.edu (David Kelly) Date: Mon, 10 Oct 2011 16:05:31 -0500 (CDT) Subject: [Swift-devel] [Swift-user] gram on ranger In-Reply-To: Message-ID: <101294858.140119.1318280731646.JavaMail.root@zimbra-mb2.anl.gov> Sure, it's at cog: https://cogkit.svn.sourceforge.net/svnroot/cogkit/branches/4.1.9/src/cog swift: https://svn.ci.uchicago.edu/svn/vdl2/branches/release-0.93 David ----- Original Message ----- > From: "Sarah Kenny" > To: "David Kelly" > Cc: "Swift Devel" , "Swift User" , "Justin M Wozniak" > > Sent: Monday, October 10, 2011 3:59:34 PM > Subject: Re: [Swift-user] gram on ranger > where's the latest .93, can you send me the checkout link so i can be > sure we're on the same page? > > > > On Mon, Oct 10, 2011 at 12:28 PM, David Kelly < davidk at ci.uchicago.edu > > wrote: > > > Sarah, > > Can you give this another try with the latest 0.93? I made some > changes to the coaster and sge providers and was able to get it > working with a simple catns script. Here is the configuration file I > was using: > > > > > > > 3600 > 00:00:03 > 1 > 16 > 16 > development > 0.9 > > TG-DBS080004N > > 16way > /share/home/01503/davidkel/swiftwork > > > > Thanks, > > David > > ----- Original Message ----- > > > From: "Sarah Kenny" < skenny at uchicago.edu > > > To: "Justin M Wozniak" < wozniak at mcs.anl.gov > > > Cc: "Swift Devel" < swift-devel at ci.uchicago.edu >, "Swift User" < > > swift-user at ci.uchicago.edu > > > > > > Sent: Friday, October 7, 2011 3:13:57 PM > > Subject: Re: [Swift-user] gram on ranger > > /home/skenny/swift_logs/dummy-20111005-0126-6575n7x5.log > > > > on ci > > > > > > On Fri, Oct 7, 2011 at 8:16 AM, Justin M Wozniak < > > wozniak at mcs.anl.gov > > > wrote: > > > > > > > > Can I take a look at the log? > > > > > > > > > > On Thu, 6 Oct 2011, Sarah Kenny wrote: > > > > > > > > hey all, i'm trying to submit to gram on ranger using the latest > > swift > > (built from trunk). it failes like so: > > > > Cannot submit job > > Caused by: > > org.globus.cog.abstraction. impl.common.task. > > TaskSubmissionException: > > Cannot > > submit job > > Caused by: org.globus.gram.GramException: Parameter not supported > > Cannot submit job > > > > the gram log was saying first that 'jobsPerNode' is not supported so > > i > > changed it to workersPerNode and then it was saying 'maxnodes' is > > not > > supported. here's my sites file: > > > > > > > > 10000 > > 1 > > 00:15:00 > > 86400 > > 1 > > 256 > > 16way > > 1 > > 64 > > normal > > TG-DBS080004N > > > > > > > > > > /work/00043/ tg457040 > > > > > > > > > thoughts? ideas? > > > > -- > > Justin M Wozniak > > > > > > > > -- > > Sarah Kenny > > Programmer ~ Brain Circuits Laboratory ~ Rm 2224 Bio Sci III > > University of California Irvine, Dept. of Neurology ~ 773-818-8300 > > > > > > _______________________________________________ > > Swift-user mailing list > > Swift-user at ci.uchicago.edu > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > > > > -- > Sarah Kenny > Programmer ~ Brain Circuits Laboratory ~ Rm 2224 Bio Sci III > University of California Irvine, Dept. of Neurology ~ 773-818-8300 From skenny at uchicago.edu Mon Oct 10 17:43:04 2011 From: skenny at uchicago.edu (Sarah Kenny) Date: Mon, 10 Oct 2011 15:43:04 -0700 Subject: [Swift-devel] [Swift-user] gram on ranger In-Reply-To: <1762711564.139799.1318274935596.JavaMail.root@zimbra-mb2.anl.gov> References: <1762711564.139799.1318274935596.JavaMail.root@zimbra-mb2.anl.gov> Message-ID: ok, thanks, got in the queue now...also, realized my last run may have been using the old swift. apparently i had SWIFT_HOME set in my env and that overrides the newer swift i had set in my PATH. ~sk On Mon, Oct 10, 2011 at 12:28 PM, David Kelly wrote: > Sarah, > > Can you give this another try with the latest 0.93? I made some changes to > the coaster and sge providers and was able to get it working with a simple > catns script. Here is the configuration file I was using: > > > > > > 3600 > 00:00:03 > 1 > 16 > 16 > development > 0.9 > TG-DBS080004N > 16way > /share/home/01503/davidkel/swiftwork > > > > Thanks, > David > > ----- Original Message ----- > > From: "Sarah Kenny" > > To: "Justin M Wozniak" > > Cc: "Swift Devel" , "Swift User" < > swift-user at ci.uchicago.edu> > > Sent: Friday, October 7, 2011 3:13:57 PM > > Subject: Re: [Swift-user] gram on ranger > > /home/skenny/swift_logs/dummy-20111005-0126-6575n7x5.log > > > > on ci > > > > > > On Fri, Oct 7, 2011 at 8:16 AM, Justin M Wozniak < wozniak at mcs.anl.gov > > > wrote: > > > > > > > > Can I take a look at the log? > > > > > > > > > > On Thu, 6 Oct 2011, Sarah Kenny wrote: > > > > > > > > hey all, i'm trying to submit to gram on ranger using the latest swift > > (built from trunk). it failes like so: > > > > Cannot submit job > > Caused by: > > org.globus.cog.abstraction. impl.common.task. TaskSubmissionException: > > Cannot > > submit job > > Caused by: org.globus.gram.GramException: Parameter not supported > > Cannot submit job > > > > the gram log was saying first that 'jobsPerNode' is not supported so i > > changed it to workersPerNode and then it was saying 'maxnodes' is not > > supported. here's my sites file: > > > > > > > > 10000 > > 1 > > 00:15:00 > > 86400 > > 1 > > 256 > > 16way > > 1 > > 64 > > normal > > TG-DBS080004N > > > > > > > > /work/00043/ tg457040 > > > > > > > > thoughts? ideas? > > > > -- > > Justin M Wozniak > > > > > > > > -- > > Sarah Kenny > > Programmer ~ Brain Circuits Laboratory ~ Rm 2224 Bio Sci III > > University of California Irvine, Dept. of Neurology ~ 773-818-8300 > > > > > > _______________________________________________ > > Swift-user mailing list > > Swift-user at ci.uchicago.edu > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > -- Sarah Kenny Programmer ~ Brain Circuits Laboratory ~ Rm 2224 Bio Sci III University of California Irvine, Dept. of Neurology ~ 773-818-8300 -------------- next part -------------- An HTML attachment was scrubbed... URL: From jonmon at mcs.anl.gov Mon Oct 10 18:08:46 2011 From: jonmon at mcs.anl.gov (Jonathan Monette) Date: Mon, 10 Oct 2011 18:08:46 -0500 Subject: [Swift-devel] 0.93 application not found error Message-ID: Hello, I am receiving an application not found error in Swift. I am using 0.93. Swift svn swift-r5216 (swift modified locally) cog-r3296 RunID: 20111010-1758-0gkrrg31 (input): found 10 files Progress: time: Mon, 10 Oct 2011 17:59:01 -0500 Progress: time: Mon, 10 Oct 2011 17:59:03 -0500 Selecting site:8 Active:1 Failed but can retry:1 Execution failed: Application /home/jonmon/Library/Swift/apps/SwiftMontage/scripts/mProjectPP_wrap.py failed with an exit code of 127 Here is the tc.data: localhost mImgtbl /home/jonmon/Library/Montage/bin/mImgtbl INSTALLED INTEL32::LINUX null localhost mAdd /home/jonmon/Library/Montage/bin/mAdd INSTALLED INTEL32::LINUX null localhost mJPEG /home/jonmon/Library/Montage/bin/mJPEG INSTALLED INTEL32::LINUX null localhost mOverlaps /home/jonmon/Library/Montage/bin/mOverlaps INSTALLED INTEL32::LINUX null localhost mConcatFit /home/jonmon/Library/Montage/bin/mConcatFit INSTALLED INTEL32::LINUX null localhost mBgModel /home/jonmon/Library/Montage/bin/mBgModel INSTALLED INTEL32::LINUX null localhost mMakeHdr /home/jonmon/Library/Montage/bin/mMakeHdr INSTALLED INTEL32::LINUX null localhost Background_list /home/jonmon/Library/Swift/apps/SwiftMontage/scripts/Background_list.py INSTALLED INTEL32::LINUX null localhost create_status_table /home/jonmon/Library/Swift/apps/SwiftMontage/scripts/create_status_table.py INSTALLED INTEL32::LINUX null localhost mFitplane_wrap /home/jonmon/Library/Swift/apps/SwiftMontage/scripts/mFitplane_wrap.py INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:00:20" localhost mDiffFit_wrap /home/jonmon/Library/Swift/apps/SwiftMontage/scripts/mDiffFit_wrap.py INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:00:20" localhost mDiff_wrap /home/jonmon/Library/Swift/apps/SwiftMontage/scripts/mDiff_wrap.py INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:00:20" localhost mProjectPP_wrap /home/jonmon/Library/Swift/apps/SwiftMontage/scripts/mProjectPP_wrap.py INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:00:20" localhost mProject_wrap /home/jonmon/Library/Swift/apps/SwiftMontage/scripts/mProject_wrap.py INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:00:20" localhost mBackground_wrap /home/jonmon/Library/Swift/apps/SwiftMontage/scripts/mBackground_wrap.py INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:00:20" Here is the site file: /gpfs/pads/swift/jonmon/Swift/work/localhost .05 KEEP and here is the config file: execution.retries=0 sitedir.keep=true status.mode=provider wrapper.log.always.transfer=true foreach.maxthreads=1024 wrapper.parameter.mode=files use.provider.staging=false provider.staging.pin.swiftfiles=false The log reports the std error from python as "stderr.txt: /bin/sh: mProjectPP: command not found". I have mProjectPP in my path. [jonmon at communicado: ~/PADS/Swift/SwiftMontage/m101_tutorial/run.0030]$ which mProjectPP mProjectPP is /home/jonmon/Library/Montage/bin/mProjectPP Does anyone know of a reason why this would happen? From jonmon at mcs.anl.gov Mon Oct 10 18:10:00 2011 From: jonmon at mcs.anl.gov (Jonathan Monette) Date: Mon, 10 Oct 2011 18:10:00 -0500 Subject: [Swift-devel] 0.93 application not found error In-Reply-To: References: Message-ID: <3CC2A4F2-F5A5-4693-B07E-4A54D491A9E4@mcs.anl.gov> Oh?the log for this run is located at http://www.ci.uchicago.edu/~jonmon/logs/application_not_found.log On Oct 10, 2011, at 6:08 PM, Jonathan Monette wrote: > Hello, > I am receiving an application not found error in Swift. I am using 0.93. > > Swift svn swift-r5216 (swift modified locally) cog-r3296 > > RunID: 20111010-1758-0gkrrg31 > (input): found 10 files > Progress: time: Mon, 10 Oct 2011 17:59:01 -0500 > Progress: time: Mon, 10 Oct 2011 17:59:03 -0500 Selecting site:8 Active:1 Failed but can retry:1 > Execution failed: > Application /home/jonmon/Library/Swift/apps/SwiftMontage/scripts/mProjectPP_wrap.py failed with an exit code of 127 > > Here is the tc.data: > localhost mImgtbl /home/jonmon/Library/Montage/bin/mImgtbl INSTALLED INTEL32::LINUX null > localhost mAdd /home/jonmon/Library/Montage/bin/mAdd INSTALLED INTEL32::LINUX null > localhost mJPEG /home/jonmon/Library/Montage/bin/mJPEG INSTALLED INTEL32::LINUX null > localhost mOverlaps /home/jonmon/Library/Montage/bin/mOverlaps INSTALLED INTEL32::LINUX null > localhost mConcatFit /home/jonmon/Library/Montage/bin/mConcatFit INSTALLED INTEL32::LINUX null > localhost mBgModel /home/jonmon/Library/Montage/bin/mBgModel INSTALLED INTEL32::LINUX null > localhost mMakeHdr /home/jonmon/Library/Montage/bin/mMakeHdr INSTALLED INTEL32::LINUX null > localhost Background_list /home/jonmon/Library/Swift/apps/SwiftMontage/scripts/Background_list.py INSTALLED INTEL32::LINUX null > localhost create_status_table /home/jonmon/Library/Swift/apps/SwiftMontage/scripts/create_status_table.py INSTALLED INTEL32::LINUX null > localhost mFitplane_wrap /home/jonmon/Library/Swift/apps/SwiftMontage/scripts/mFitplane_wrap.py INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:00:20" > localhost mDiffFit_wrap /home/jonmon/Library/Swift/apps/SwiftMontage/scripts/mDiffFit_wrap.py INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:00:20" > localhost mDiff_wrap /home/jonmon/Library/Swift/apps/SwiftMontage/scripts/mDiff_wrap.py INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:00:20" > localhost mProjectPP_wrap /home/jonmon/Library/Swift/apps/SwiftMontage/scripts/mProjectPP_wrap.py INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:00:20" > localhost mProject_wrap /home/jonmon/Library/Swift/apps/SwiftMontage/scripts/mProject_wrap.py INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:00:20" > localhost mBackground_wrap /home/jonmon/Library/Swift/apps/SwiftMontage/scripts/mBackground_wrap.py INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:00:20" > > Here is the site file: > > > > > /gpfs/pads/swift/jonmon/Swift/work/localhost > > .05 > > KEEP > > > > and here is the config file: > execution.retries=0 > sitedir.keep=true > status.mode=provider > wrapper.log.always.transfer=true > foreach.maxthreads=1024 > wrapper.parameter.mode=files > use.provider.staging=false > provider.staging.pin.swiftfiles=false > > The log reports the std error from python as "stderr.txt: /bin/sh: mProjectPP: command not found". I have mProjectPP in my path. > [jonmon at communicado: ~/PADS/Swift/SwiftMontage/m101_tutorial/run.0030]$ which mProjectPP > mProjectPP is /home/jonmon/Library/Montage/bin/mProjectPP > > Does anyone know of a reason why this would happen? > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel -------------- next part -------------- An HTML attachment was scrubbed... URL: From iraicu at cs.iit.edu Mon Oct 10 20:59:22 2011 From: iraicu at cs.iit.edu (Ioan Raicu) Date: Mon, 10 Oct 2011 20:59:22 -0500 Subject: [Swift-devel] CFP: First Int. Workshop on Workflow Models, Systems, Services and Applications in the Cloud (CloudFlow) 2012 Message-ID: <4E93A2FA.4040900@cs.iit.edu> First International Workshop on Workflow Models, Systems, Services and Applications in the Cloud (CloudFlow) 2012 To be held in conjunction with the 26th IEEE International Parallel & Distributed Processing Symposium (IPDPS) 2012 Shanghai, China May 21-25, 2012. http://www.cloud-uestc.cn/cloudflow/home.html Overview Cloud computing is gaining tremendous momentum in both academia and industry, more and more people are migrating their data and applications into the Cloud. We have observed wide adoption of the MapReduce computing model and the open source Hadoop system for large scale distributed data processing, and a variety of ad hoc mashup techniques that weave together Web applications. However, these are just first steps towards managing complex task and data dependencies in the Cloud, as there are more challenging issues such as large parameter space exploration, data partitioning and distribution, scheduling and optimization, smart reruns, and provenance tracking associated with workflow execution. Cloud needs structured and mature workflow technologies to handle such issues, and vice versa, as Cloud offers unprecedented scalability to workflow systems, and could potentially change the way we perceive and conduct research and experiments. The scale and complexity of the science and data analytics problems that can be handled can be greatly increased on the Cloud, and the on-demand nature of resource allocation on the Cloud will also help improve resource utilization and user experience. As Cloud computing provides a paradigm-shifting utility-oriented computing model in terms of the unprecedented size of datacenter-level resource pool and the on-demand resource provisioning mechanism, there are lots of challenges in bringing Cloud and workflows together. We need high level languages and computing models for large scale workflow specification; we need to adapt existing workflow architectures into the Cloud, and integrate workflow systems with Cloud infrastructure and resources; we also need to leverage Cloud data storage technologies to efficiently distribute data over a large number of nodes and explore data locality during computation etc. We organize the CloudFlow workshop as a venue for the workflow and Cloud communities to define models and paradigms, present their state-of-the-art work, share their thoughts and experiences, and explore new directions in realizing workflows in the Cloud. Topics: We welcome the submission of original work related to the topics listed below, which include (in the context of Cloud): - Models and Languages for Large Scale Workflow Specification - Workflow Architecture and Framework - Large Scale Workflow Systems - Service Workflow - Workflow Composition and Orchestration - Workflow Migration into the Cloud - Workflow Scheduling and Optimization - Cloud Middleware in Support of Workflow - Virtualized Environment - Workflow Applications and Case Studies - Performance and Scalability Analysis - Peta-Scale Data Processing - Event Processing and Messaging - Real-Time Analytics - Provenance Paper Submission Authors are invited to submit papers with unpublished, original work. The papers should not exceed 10 single-spaced double-column pages using 10-point size font on 8.5x11 inch pages (IEEE conference style), including figures, tables, and references. Paper submission should be done via the online CMT system, Microsoft??s Academic Conference Management Service (https://cmt.research.microsoft.com/CF2012) by midnight January 9th, 2012 Pacific Time. The final format should be in PDF. Proceedings of the workshop will be published by the IEEE Digital Library and distributed at the conference. Selected excellent work may be eligible for additional post-conference publication as journal articles or book chapters. Submission implies the willingness of at least one of the authors to register and present the paper. Important Dates Paper submission: January 9th, 2012 Acceptance notification: February 8th, 2012 Final paper due: Feb 21st, 2012 Organization Workshop Chairs: Dr. Yong Zhao University of Electronic Science and Technology of China, China yongzh04 at gmail.com Dr. Cui Lin California State University, Fresno, USA clin at csufresno.edu Dr. Shiyong Lu Wayne State University, USA shiyong at wayne.edu Program Chairs: Dr. Wenhong Tian University of Electronic Science and Technology of China, China Dr. Ruini Xue Tsinghua University, China Steering Committee - Dan Kartz, University of Chicago, U.S.A. - Mike Wilde, University of Chicago, U.S.A. - Ewa Deelman, University of South California, U.S.A. - Tevfik Kosar, University at Buffalo, U.S.A. - Ilkay Altintas, San Diego Supercomputer Center, U.S.A. - Ioan Raicu, Illinois Institute of Technology, U.S.A. - Yogesh Simmhan, University of Southern California, U.S.A. - Ian Taylor, Cardiff University, U.K. - Weimin Zheng, Tsinghua University, China - Hai Jin, Huazhong University of Science and Engineering, China - Wanchun Dou, Nanjing University, China Program Committee - Shawn Bowers, Gonzaga University, U.S.A. - Douglas Thain, University of Notre Dame, U.S.A. - Ian Gorton, Pacific Northwest National Laboratory, U.S.A. - Artem Chebotko, University of Texas at Pan American, U.S.A. - Paolo Missier, Newcastle University, U.K. - Paul Groth, University of Amsterdam, the Netherlands - Zhiming Zhao, University of Amsterdam, the Netherlands - Marta Mattoso, Federal University of Rio de Janeiro, Brazil - Wei Tan, IBM T. J. Watson Research Center, U.S.A. - Jianwu Wang, San Diego Super Computer Center, U.S.A. - Ping Yang, Binghamton University, U.S.A. - Jian Guo, Harvard University, U.S.A. - Liqiang Wang, University of Wyoming, U.S.A. - Wenhong Tian, University of Electronic Science and Technology of China, China - Ruini Xue, Tsinghua University, China - Jian Cao, Shanghai Jiaotong University, China - Weisong Shi, Tongji University, China - Jianxun Liu, Hunan University of Science and Technology, China - Song Zhang, Chinese Academy of Sciences, China -- ================================================================= Ioan Raicu, Ph.D. Assistant Professor, Illinois Institute of Technology (IIT) Guest Research Faculty, Argonne National Laboratory (ANL) ================================================================= Data-Intensive Distributed Systems Laboratory, CS/IIT Distributed Systems Laboratory, MCS/ANL ================================================================= Cel: 1-847-722-0876 Office: 1-312-567-5704 Email: iraicu at cs.iit.edu Web: http://www.cs.iit.edu/~iraicu/ Web: http://datasys.cs.iit.edu/ ================================================================= ================================================================= From iraicu at cs.iit.edu Mon Oct 10 21:11:06 2011 From: iraicu at cs.iit.edu (Ioan Raicu) Date: Mon, 10 Oct 2011 21:11:06 -0500 Subject: [Swift-devel] CFP: 12th IEEE/ACM International Symposium on Cluster, Grid and Cloud Computing (CCGrid 2012) Message-ID: <4E93A5BA.3060901@cs.iit.edu> 12th IEEE/ACM International Symposium on Cluster, Grid and Cloud Computing (CCGrid 2012) Ottawa, Canada May 13-16, 2012 http://www.cloudbus.org/ccgrid2012 CALL FOR PAPERS Rapid advances in processing, communication and systems/middleware technologies are leading to new paradigms and platforms for computing, ranging from computing Clusters to widely distributed Grid and emerging Clouds. CCGrid is a series of very successful conferences, sponsored by the IEEE Computer Society Technical Committee on Scalable Computing (TCSC) and ACM, with the overarching goal of bringing together international researchers, developers, and users and to provide an international forum to present leading research activities and results on a broad range of topics related to these platforms and paradigms and their applications. The conference features keynotes, technical presentations, posters and research demos, workshops, tutorials, as well as the SCALE challenges featuring live demonstrations. In 2012, CCGrid will come to Canada for the first time and will be held in Ottawa, the capital city. CCGrid 2012 will have a focus on important and immediate issues that are significantly influencing all aspects of cluster, cloud and grid computing. Topics of interest include, but are not limited to: * Applications and Experiences: Applications to real and complex problems in science, engineering, business and society; User studies; Experiences with large-scale deployments systems or applications. * Architecture: System architectures, Design and deployment. * Autonomic Computing and Cyberinfrastructure: Self managed behavior, models and technologies; Autonomic paradigms and approaches (control-based, bio-inspired, emergent, etc.); Bio-inspired approaches to management; SLA definition and enforcement. * Performance Modeling and Evaluation: Performance models; Monitoring and evaluation tools, Analysis of system/application performance; Benchmarks and testbeds. * Programming Models, Systems, and Fault-Tolerant Computing: Programming models for cluster, clouds and grid computing; fault tolerant infrastructure and algorithms; systems software to enable efficient computing. * Multicore and Accelerator-based Computing: Software and application techniques to utilize multicore architectures and accelerators/heterogeneous computing systems. * Scheduling and Resource Management: Techniques to schedule jobs and resources on clusters, clouds and grid computing platforms. * Cloud Computing: Cloud architectures; Software tools and techniques for clouds. PAPER SUBMISSION Authors are invited to submit papers electronically. Submitted manuscripts should be structured as technical papers and may not exceed 8 letter size (8.5 x 11) pages including figures, tables and references using the IEEE format for conference proceedings (print area of 6-1/2 inches (16.51 cm) wide by 8-7/8 inches (22.51 cm) high, two-column format with columns 3-1/16 inches (7.85 cm) wide with a 3/8 inch (0.81 cm) space between them, single-spaced 10-point Times fully justified text). Submissions not conforming to these guidelines may be returned without review. Authors should submit the manuscript in PDF format and make sure that the file will print on a printer that uses letter size (8.5 x 11) paper. The official language of the meeting is English. All manuscripts will be reviewed and will be judged on correctness, originality, technical strength, significance, quality of presentation, and interest and relevance to the conference attendees. Submitted papers must represent original unpublished research that is not currently under review for any other conference or journal. Papers not following these guidelines will be rejected without review and further action may be taken, including (but not limited to) notifications sent to the heads of the institutions of the authors and sponsors of the conference. Submissions received after the due date, exceeding the page limit, or not appropriately structured may not be considered. Authors may contact the conference chairs for more information. The proceedings will be published through the IEEE Computer Society Press, USA and will be made available online through the IEEE Digital Library. Submission Link: https://www.easychair.org/account/signin.cgi?conf=ccgrid2012 JOURNAL SPECIAL ISSUE Highly rated Top 6 papers from the CCGrid 2012 conference will be invited to extend for publication in a special issue of the "Future Generation Computer Systems (FGCS)" Journal published by Elsevier Press. CHAIRS General Chair * Shikharesh Majumdar, Carleton University, Canada Honorary Chair * Geoffrey Fox, Indiana University, USA Program Committee Co-Chairs * Rajkumar Buyya, University of Melbourne, Australia * Pavan Balaji, Argonne National Laboratory, USA Program Committee Vice-chairs * Daniel S. Katz (Applications and Experiences) * Dhabaleswar K. Panda (Architecture) * Manish Parashar (Middleware, Autonomic Computing, and Cyberinfrastructure) * Ahmad Afsahi (Performance Modeling and Analysis) * Xian-He Sun (Performance Measurement and Evaluation) * William Gropp (Programming Models, Systems, and Fault-Tolerant computing) * David Bader (Multicore and Accelerator-based Computing) * Thomas Fahringer (Scheduling and Resource Management) * Ignacio Martin Llorente and Madhusudhan Govindaraju (Cloud Computing) Cyber Co-Chairs * Anton Beloglazov, The University of Melbourne, Australia * Suraj Pandey, CSIRO, Australia * Trevor Gelowsky, Carleton University, Canada Workshops Co-Chairs * Marin Litiou, York University, Canada * Mukaddim Pathan, Telstra Corporation Limited, Australia Publicity Chairs * Helen Karatza, Aristotle University of Thessaloniki, Greece * Ioan Raicu, Illinois Institute of Technology& Argonne National Labs, USA * Bruno Schulze, National Laboratory for Scientific Computing, Brazil * G Subrahmanya VRK Rao: Cognizant technology Solutions, India Tutorials Co-Chairs * Sushil K. Prasad, Georgia State University, USA * Rob Simmonds, Westgrid, Canada Doctoral Symposium Co-Chairs * Carlos Varela, Rensselaer Polytechnic Institute, USA * Yogesh Simmhan, University of Southern California Poster and Research Demo Co-Chairs * Suraj Pandey, CSIRO, Australia SCALE Challenge Coordinator * Shantenu Jha, Rutgers and Loisiana State University Steering Committee * Henri Bal, Vrije University, The Netherlands * Pavan Balaji, Argonne National Laboratory, USA * Rajkumar Buyya, University of Melbourne, Australia (Chair) * Franck Capello, University of Paris-Sud, France * Jack Dongarra, University of Tennessee& ORNL, USA * Dick Epema, Technical University of Delft, The Netherlands * Thomas Fahringer, University of Innsbruck, Austria * Ian Foster, University of Chicago, USA * Wolfgang Gentzsch, DEISA, Germany * Hai Jin, Huazhong University of Science& Technology, China * Craig Lee, The Aerospace Corporation, USA (Co-Chair) * Laurent Lefevre, INRIA, France * Geng Lin, Dell Inc., USA * Manish Parashar, Rutgers: The State University of New Jersey, USA * Shikharesh Majumdar, Carleton University, Canada * Satoshi Matsuoaka, Tokyo Institute of Technology, Japan * Omer Rana, Cardiff University, UK * Paul Roe, Queensland University of Technology, Australia * Bruno Schulze, LNCC, Brazil * Nalini Venkatasubramanian, University of California, USA * Carlos Varela, Rensselaer Polytechnic Institute, USA IMPORTANT DATES Papers Due: 25 November 2011 Notification of Acceptance: 30 January 2012 Camera Ready Papers Due: 27 February 2012 Sponsors: IEEE Computer Society (TCSE)& ACM SIGARCH (approval pending) -- ================================================================= Ioan Raicu, Ph.D. Assistant Professor, Illinois Institute of Technology (IIT) Guest Research Faculty, Argonne National Laboratory (ANL) ================================================================= Data-Intensive Distributed Systems Laboratory, CS/IIT Distributed Systems Laboratory, MCS/ANL ================================================================= Cel: 1-847-722-0876 Office: 1-312-567-5704 Email: iraicu at cs.iit.edu Web: http://www.cs.iit.edu/~iraicu/ Web: http://datasys.cs.iit.edu/ ================================================================= ================================================================= From skenny at uchicago.edu Tue Oct 11 13:32:37 2011 From: skenny at uchicago.edu (Sarah Kenny) Date: Tue, 11 Oct 2011 11:32:37 -0700 Subject: [Swift-devel] [Swift-user] gram on ranger In-Reply-To: References: <1762711564.139799.1318274935596.JavaMail.root@zimbra-mb2.anl.gov> Message-ID: so, this workflow completes all the jobs but then just hangs indefinitely at the end...maybe a stray cleanup job? log is here: /home/skenny/swift_logs/corr-20111010-2104-fl5yngd9.log just tweaked the sites file a bit from what david sent me: 28800 00:15:00 1 64 256 normal 1 TG-DBS080004N 16way 10000 /work/00043/tg457040/sidgrid_out/skenny On Mon, Oct 10, 2011 at 3:43 PM, Sarah Kenny wrote: > ok, thanks, got in the queue now...also, realized my last run may have been > using the old swift. apparently i had SWIFT_HOME set in my env and that > overrides the newer swift i had set in my PATH. > > ~sk > > On Mon, Oct 10, 2011 at 12:28 PM, David Kelly wrote: > >> Sarah, >> >> Can you give this another try with the latest 0.93? I made some changes to >> the coaster and sge providers and was able to get it working with a simple >> catns script. Here is the configuration file I was using: >> >> >> >> >> >> 3600 >> 00:00:03 >> 1 >> 16 >> 16 >> development >> 0.9 >> TG-DBS080004N >> 16way >> /share/home/01503/davidkel/swiftwork >> >> >> >> Thanks, >> David >> >> ----- Original Message ----- >> > From: "Sarah Kenny" >> > To: "Justin M Wozniak" >> > Cc: "Swift Devel" , "Swift User" < >> swift-user at ci.uchicago.edu> >> > Sent: Friday, October 7, 2011 3:13:57 PM >> > Subject: Re: [Swift-user] gram on ranger >> > /home/skenny/swift_logs/dummy-20111005-0126-6575n7x5.log >> > >> > on ci >> > >> > >> > On Fri, Oct 7, 2011 at 8:16 AM, Justin M Wozniak < wozniak at mcs.anl.gov >> > > wrote: >> > >> > >> > >> > Can I take a look at the log? >> > >> > >> > >> > >> > On Thu, 6 Oct 2011, Sarah Kenny wrote: >> > >> > >> > >> > hey all, i'm trying to submit to gram on ranger using the latest swift >> > (built from trunk). it failes like so: >> > >> > Cannot submit job >> > Caused by: >> > org.globus.cog.abstraction. impl.common.task. TaskSubmissionException: >> > Cannot >> > submit job >> > Caused by: org.globus.gram.GramException: Parameter not supported >> > Cannot submit job >> > >> > the gram log was saying first that 'jobsPerNode' is not supported so i >> > changed it to workersPerNode and then it was saying 'maxnodes' is not >> > supported. here's my sites file: >> > >> > >> > >> > 10000 >> > 1 >> > 00:15:00 >> > 86400 >> > 1 >> > 256 >> > 16way >> > 1 >> > 64 >> > normal >> > TG-DBS080004N >> > >> > >> > >> > /work/00043/ tg457040 >> > >> > >> > >> > thoughts? ideas? >> > >> > -- >> > Justin M Wozniak >> > >> > >> > >> > -- >> > Sarah Kenny >> > Programmer ~ Brain Circuits Laboratory ~ Rm 2224 Bio Sci III >> > University of California Irvine, Dept. of Neurology ~ 773-818-8300 >> > >> > >> > _______________________________________________ >> > Swift-user mailing list >> > Swift-user at ci.uchicago.edu >> > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user >> > > > > -- > Sarah Kenny > Programmer ~ Brain Circuits Laboratory ~ Rm 2224 Bio Sci III > University of California Irvine, Dept. of Neurology ~ 773-818-8300 > > -- Sarah Kenny Programmer ~ Brain Circuits Laboratory ~ Rm 2224 Bio Sci III University of California Irvine, Dept. of Neurology ~ 773-818-8300 -------------- next part -------------- An HTML attachment was scrubbed... URL: From davidk at ci.uchicago.edu Tue Oct 11 13:49:02 2011 From: davidk at ci.uchicago.edu (David Kelly) Date: Tue, 11 Oct 2011 13:49:02 -0500 (CDT) Subject: [Swift-devel] [Swift-user] gram on ranger In-Reply-To: Message-ID: <1908023276.141555.1318358942817.JavaMail.root@zimbra-mb2.anl.gov> That could be it.. maybe a cleanup script is not getting the right parameters and failing. Do you happen to have a copy of the coaster log? Maybe there will be some clues in there. ----- Original Message ----- > From: "Sarah Kenny" > To: "David Kelly" > Cc: "Swift Devel" , "Swift User" , "Justin M Wozniak" > > Sent: Tuesday, October 11, 2011 1:32:37 PM > Subject: Re: [Swift-user] gram on ranger > so, this workflow completes all the jobs but then just hangs > indefinitely at the end...maybe a stray cleanup job? > > log is here: > > /home/skenny/swift_logs/corr-20111010-2104-fl5yngd9.log > > just tweaked the sites file a bit from what david sent me: > > > > > > 28800 > 00:15:00 > 1 > 64 > 256 > normal > 1 > TG-DBS080004N > 16way > 10000 > /work/00043/tg457040/sidgrid_out/skenny > > > > > > On Mon, Oct 10, 2011 at 3:43 PM, Sarah Kenny < skenny at uchicago.edu > > wrote: > > > ok, thanks, got in the queue now...also, realized my last run may have > been using the old swift. apparently i had SWIFT_HOME set in my env > and that overrides the newer swift i had set in my PATH. > > ~sk > > > > On Mon, Oct 10, 2011 at 12:28 PM, David Kelly < davidk at ci.uchicago.edu > > wrote: > > > > > > Sarah, > > Can you give this another try with the latest 0.93? I made some > changes to the coaster and sge providers and was able to get it > working with a simple catns script. Here is the configuration file I > was using: > > > > > > > 3600 > 00:00:03 > 1 > 16 > 16 > development > 0.9 > > TG-DBS080004N > > 16way > /share/home/01503/davidkel/swiftwork > > > > Thanks, > > David > > ----- Original Message ----- > > > From: "Sarah Kenny" < skenny at uchicago.edu > > > To: "Justin M Wozniak" < wozniak at mcs.anl.gov > > > Cc: "Swift Devel" < swift-devel at ci.uchicago.edu >, "Swift User" < > > swift-user at ci.uchicago.edu > > > > > > Sent: Friday, October 7, 2011 3:13:57 PM > > Subject: Re: [Swift-user] gram on ranger > > /home/skenny/swift_logs/dummy-20111005-0126-6575n7x5.log > > > > on ci > > > > > > On Fri, Oct 7, 2011 at 8:16 AM, Justin M Wozniak < > > wozniak at mcs.anl.gov > > > wrote: > > > > > > > > Can I take a look at the log? > > > > > > > > > > On Thu, 6 Oct 2011, Sarah Kenny wrote: > > > > > > > > hey all, i'm trying to submit to gram on ranger using the latest > > swift > > (built from trunk). it failes like so: > > > > Cannot submit job > > Caused by: > > org.globus.cog.abstraction. impl.common.task. > > TaskSubmissionException: > > Cannot > > submit job > > Caused by: org.globus.gram.GramException: Parameter not supported > > Cannot submit job > > > > the gram log was saying first that 'jobsPerNode' is not supported so > > i > > changed it to workersPerNode and then it was saying 'maxnodes' is > > not > > supported. here's my sites file: > > > > > > > > 10000 > > 1 > > 00:15:00 > > 86400 > > 1 > > 256 > > 16way > > 1 > > 64 > > normal > > TG-DBS080004N > > > > > > > > > > /work/00043/ tg457040 > > > > > > > > > thoughts? ideas? > > > > -- > > Justin M Wozniak > > > > > > > > -- > > Sarah Kenny > > Programmer ~ Brain Circuits Laboratory ~ Rm 2224 Bio Sci III > > University of California Irvine, Dept. of Neurology ~ 773-818-8300 > > > > > > _______________________________________________ > > Swift-user mailing list > > Swift-user at ci.uchicago.edu > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > > > > > > > -- > Sarah Kenny > Programmer ~ Brain Circuits Laboratory ~ Rm 2224 Bio Sci III > University of California Irvine, Dept. of Neurology ~ 773-818-8300 > > > > > -- > Sarah Kenny > Programmer ~ Brain Circuits Laboratory ~ Rm 2224 Bio Sci III > University of California Irvine, Dept. of Neurology ~ 773-818-8300 From skenny at uchicago.edu Tue Oct 11 14:05:34 2011 From: skenny at uchicago.edu (Sarah Kenny) Date: Tue, 11 Oct 2011 12:05:34 -0700 Subject: [Swift-devel] [Swift-user] gram on ranger In-Reply-To: <1908023276.141555.1318358942817.JavaMail.root@zimbra-mb2.anl.gov> References: <1908023276.141555.1318358942817.JavaMail.root@zimbra-mb2.anl.gov> Message-ID: On Tue, Oct 11, 2011 at 11:49 AM, David Kelly wrote: > > That could be it.. maybe a cleanup script is not getting the right > parameters and failing. Do you happen to have a copy of the coaster log? just put it in /home/skenny/swift_logs > Maybe there will be some clues in there. > > ----- Original Message ----- > > From: "Sarah Kenny" > > To: "David Kelly" > > Cc: "Swift Devel" , "Swift User" < > swift-user at ci.uchicago.edu>, "Justin M Wozniak" > > > > Sent: Tuesday, October 11, 2011 1:32:37 PM > > Subject: Re: [Swift-user] gram on ranger > > so, this workflow completes all the jobs but then just hangs > > indefinitely at the end...maybe a stray cleanup job? > > > > log is here: > > > > /home/skenny/swift_logs/corr-20111010-2104-fl5yngd9.log > > > > just tweaked the sites file a bit from what david sent me: > > > > > > > > > > > > 28800 > > 00:15:00 > > 1 > > 64 > > 256 > > normal > > 1 > > TG-DBS080004N > > 16way > > 10000 > > /work/00043/tg457040/sidgrid_out/skenny > > > > > > > > > > > > On Mon, Oct 10, 2011 at 3:43 PM, Sarah Kenny < skenny at uchicago.edu > > > wrote: > > > > > > ok, thanks, got in the queue now...also, realized my last run may have > > been using the old swift. apparently i had SWIFT_HOME set in my env > > and that overrides the newer swift i had set in my PATH. > > > > ~sk > > > > > > > > On Mon, Oct 10, 2011 at 12:28 PM, David Kelly < davidk at ci.uchicago.edu > > > wrote: > > > > > > > > > > > > Sarah, > > > > Can you give this another try with the latest 0.93? I made some > > changes to the coaster and sge providers and was able to get it > > working with a simple catns script. Here is the configuration file I > > was using: > > > > > > > > > > > > > > 3600 > > 00:00:03 > > 1 > > 16 > > 16 > > development > > 0.9 > > > > TG-DBS080004N > > > > 16way > > /share/home/01503/davidkel/swiftwork > > > > > > > > Thanks, > > > > David > > > > ----- Original Message ----- > > > > > From: "Sarah Kenny" < skenny at uchicago.edu > > > > To: "Justin M Wozniak" < wozniak at mcs.anl.gov > > > > Cc: "Swift Devel" < swift-devel at ci.uchicago.edu >, "Swift User" < > > > swift-user at ci.uchicago.edu > > > > > > > > > > Sent: Friday, October 7, 2011 3:13:57 PM > > > Subject: Re: [Swift-user] gram on ranger > > > /home/skenny/swift_logs/dummy-20111005-0126-6575n7x5.log > > > > > > on ci > > > > > > > > > On Fri, Oct 7, 2011 at 8:16 AM, Justin M Wozniak < > > > wozniak at mcs.anl.gov > > > > wrote: > > > > > > > > > > > > Can I take a look at the log? > > > > > > > > > > > > > > > On Thu, 6 Oct 2011, Sarah Kenny wrote: > > > > > > > > > > > > hey all, i'm trying to submit to gram on ranger using the latest > > > swift > > > (built from trunk). it failes like so: > > > > > > Cannot submit job > > > Caused by: > > > org.globus.cog.abstraction. impl.common.task. > > > TaskSubmissionException: > > > Cannot > > > submit job > > > Caused by: org.globus.gram.GramException: Parameter not supported > > > Cannot submit job > > > > > > the gram log was saying first that 'jobsPerNode' is not supported so > > > i > > > changed it to workersPerNode and then it was saying 'maxnodes' is > > > not > > > supported. here's my sites file: > > > > > > > > > > > > 10000 > > > 1 > > > 00:15:00 > > > 86400 > > > 1 > > > 256 > > > 16way > > > 1 > > > 64 > > > normal > > > TG-DBS080004N > > > > > > > > > > > > > > > > /work/00043/ tg457040 > > > > > > > > > > > > > > thoughts? ideas? > > > > > > -- > > > Justin M Wozniak > > > > > > > > > > > > -- > > > Sarah Kenny > > > Programmer ~ Brain Circuits Laboratory ~ Rm 2224 Bio Sci III > > > University of California Irvine, Dept. of Neurology ~ 773-818-8300 > > > > > > > > > _______________________________________________ > > > Swift-user mailing list > > > Swift-user at ci.uchicago.edu > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > > > > > > > > > > > > > > -- > > Sarah Kenny > > Programmer ~ Brain Circuits Laboratory ~ Rm 2224 Bio Sci III > > University of California Irvine, Dept. of Neurology ~ 773-818-8300 > > > > > > > > > > -- > > Sarah Kenny > > Programmer ~ Brain Circuits Laboratory ~ Rm 2224 Bio Sci III > > University of California Irvine, Dept. of Neurology ~ 773-818-8300 > -- Sarah Kenny Programmer ~ Brain Circuits Laboratory ~ Rm 2224 Bio Sci III University of California Irvine, Dept. of Neurology ~ 773-818-8300 -------------- next part -------------- An HTML attachment was scrubbed... URL: From wilde at mcs.anl.gov Tue Oct 11 15:28:48 2011 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 11 Oct 2011 15:28:48 -0500 Subject: [Swift-devel] [Swift-user] gram on ranger In-Reply-To: References: <1908023276.141555.1318358942817.JavaMail.root@zimbra-mb2.anl.gov> Message-ID: We have another example of swift hanging at the end of a ParVis script. I think I reported that on the list. Mihael needs a jstack dump of this along with the swift log. On 10/11/11, Sarah Kenny wrote: > On Tue, Oct 11, 2011 at 11:49 AM, David Kelly wrote: > >> >> That could be it.. maybe a cleanup script is not getting the right >> parameters and failing. Do you happen to have a copy of the coaster log? > > > just put it in /home/skenny/swift_logs > > > >> Maybe there will be some clues in there. >> >> ----- Original Message ----- >> > From: "Sarah Kenny" >> > To: "David Kelly" >> > Cc: "Swift Devel" , "Swift User" < >> swift-user at ci.uchicago.edu>, "Justin M Wozniak" >> > >> > Sent: Tuesday, October 11, 2011 1:32:37 PM >> > Subject: Re: [Swift-user] gram on ranger >> > so, this workflow completes all the jobs but then just hangs >> > indefinitely at the end...maybe a stray cleanup job? >> > >> > log is here: >> > >> > /home/skenny/swift_logs/corr-20111010-2104-fl5yngd9.log >> > >> > just tweaked the sites file a bit from what david sent me: >> > >> > >> > >> > >> > >> > 28800 >> > 00:15:00 >> > 1 >> > 64 >> > 256 >> > normal >> > 1 >> > TG-DBS080004N >> > 16way >> > 10000 >> > /work/00043/tg457040/sidgrid_out/skenny >> > >> > >> > >> > >> > >> > On Mon, Oct 10, 2011 at 3:43 PM, Sarah Kenny < skenny at uchicago.edu > >> > wrote: >> > >> > >> > ok, thanks, got in the queue now...also, realized my last run may have >> > been using the old swift. apparently i had SWIFT_HOME set in my env >> > and that overrides the newer swift i had set in my PATH. >> > >> > ~sk >> > >> > >> > >> > On Mon, Oct 10, 2011 at 12:28 PM, David Kelly < davidk at ci.uchicago.edu >> > > wrote: >> > >> > >> > >> > >> > >> > Sarah, >> > >> > Can you give this another try with the latest 0.93? I made some >> > changes to the coaster and sge providers and was able to get it >> > working with a simple catns script. Here is the configuration file I >> > was using: >> > >> > >> > >> > >> > >> > >> > 3600 >> > 00:00:03 >> > 1 >> > 16 >> > 16 >> > development >> > 0.9 >> > >> > TG-DBS080004N >> > >> > 16way >> > /share/home/01503/davidkel/swiftwork >> > >> > >> > >> > Thanks, >> > >> > David >> > >> > ----- Original Message ----- >> > >> > > From: "Sarah Kenny" < skenny at uchicago.edu > >> > > To: "Justin M Wozniak" < wozniak at mcs.anl.gov > >> > > Cc: "Swift Devel" < swift-devel at ci.uchicago.edu >, "Swift User" < >> > > swift-user at ci.uchicago.edu > >> > >> > >> > >> > > Sent: Friday, October 7, 2011 3:13:57 PM >> > > Subject: Re: [Swift-user] gram on ranger >> > > /home/skenny/swift_logs/dummy-20111005-0126-6575n7x5.log >> > > >> > > on ci >> > > >> > > >> > > On Fri, Oct 7, 2011 at 8:16 AM, Justin M Wozniak < >> > > wozniak at mcs.anl.gov >> > > > wrote: >> > > >> > > >> > > >> > > Can I take a look at the log? >> > > >> > > >> > > >> > > >> > > On Thu, 6 Oct 2011, Sarah Kenny wrote: >> > > >> > > >> > > >> > > hey all, i'm trying to submit to gram on ranger using the latest >> > > swift >> > > (built from trunk). it failes like so: >> > > >> > > Cannot submit job >> > > Caused by: >> > > org.globus.cog.abstraction. impl.common.task. >> > > TaskSubmissionException: >> > > Cannot >> > > submit job >> > > Caused by: org.globus.gram.GramException: Parameter not supported >> > > Cannot submit job >> > > >> > > the gram log was saying first that 'jobsPerNode' is not supported so >> > > i >> > > changed it to workersPerNode and then it was saying 'maxnodes' is >> > > not >> > > supported. here's my sites file: >> > > >> > > >> > > >> > > 10000 >> > > 1 >> > > 00:15:00 >> > > 86400 >> > > 1 >> > > 256 >> > > 16way >> > > 1 >> > > 64 >> > > normal >> > > TG-DBS080004N >> > > >> > >> > > >> > >> > > >> > > /work/00043/ tg457040 >> > >> > > >> > > >> > > >> > > thoughts? ideas? >> > > >> > > -- >> > > Justin M Wozniak >> > > >> > > >> > > >> > > -- >> > > Sarah Kenny >> > > Programmer ~ Brain Circuits Laboratory ~ Rm 2224 Bio Sci III >> > > University of California Irvine, Dept. of Neurology ~ 773-818-8300 >> > > >> > > >> > > _______________________________________________ >> > > Swift-user mailing list >> > > Swift-user at ci.uchicago.edu >> > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user >> > >> > >> > >> > >> > >> > >> > -- >> > Sarah Kenny >> > Programmer ~ Brain Circuits Laboratory ~ Rm 2224 Bio Sci III >> > University of California Irvine, Dept. of Neurology ~ 773-818-8300 >> > >> > >> > >> > >> > -- >> > Sarah Kenny >> > Programmer ~ Brain Circuits Laboratory ~ Rm 2224 Bio Sci III >> > University of California Irvine, Dept. of Neurology ~ 773-818-8300 >> > > > > -- > Sarah Kenny > Programmer ~ Brain Circuits Laboratory ~ Rm 2224 Bio Sci III > University of California Irvine, Dept. of Neurology ~ 773-818-8300 > -- Sent from my mobile device From jonmon at mcs.anl.gov Tue Oct 11 17:23:31 2011 From: jonmon at mcs.anl.gov (Jonathan Monette) Date: Tue, 11 Oct 2011 17:23:31 -0500 Subject: [Swift-devel] Swift command not recognizing options Message-ID: <79E43E77-AD9D-4572-93FD-0B97350C9EA2@mcs.anl.gov> Hello, I do not believe Swift is honoring any options passed to it in trunk. Here is my command line: swift -sites.file sites.xml -tc.file tc.data -config swift.properties montage.swift But the execution goes like this: [jonmon at communicado: ~/PADS/Swift/SwiftMontage/m101_tutorial/run.0030]$ swift -sites.file sites.xml -tc.file tc.data -config swift.properties montage.swift Swift version: 0.93 /home/jonmon/Library/Swift/bin/../etc/sites.xml Swift svn swift-r5219 cog-r3296 RunID: 20111011-1719-std6z8pa (input): found 10 files Progress: time: Tue, 11 Oct 2011 17:19:25 -0500 Progress: time: Tue, 11 Oct 2011 17:19:28 -0500 Selecting site:8 Stage in:1 Submitting:1 Progress: time: Tue, 11 Oct 2011 17:19:31 -0500 Selecting site:8 Active:1 Checking status:1 EXCEPTION Exception in mProjectPP_wrap: Arguments: [-X, raw_dir/raw_image_3.fits, proj_dir/proj_raw_image_3.fits, header.hdr] Host: localhost Directory: montage-20111011-1719-std6z8pa/jobs/d/mProjectPP_wrap-dc0bd6hk stderr.txt: stdout.txt: ---- Caused by: Swift config property "wrapperlog.always.transfer" not found in Swift configuration [swift.properties] Execution failed: EXCEPTION Exception in mProjectPP_wrap: Arguments: [-X, raw_dir/raw_image_7.fits, proj_dir/proj_raw_image_7.fits, header.hdr] Host: localhost Directory: montage-20111011-1719-std6z8pa/jobs/c/mProjectPP_wrap-cc0bd6hk stderr.txt: stdout.txt: ---- Caused by: Swift config property "wrapperlog.always.transfer" not found in Swift configuration [swift.properties] Swift config property "wrapperlog.always.transfer" not found in Swift configuration [swift.properties] I do not have wrapperlog.always.transfer in my properties file. I have wrapper.log.always.transfer: Swift config property "wrapperlog.always.transfer" not found in Swift configuration [swift.properties] [jonmon at communicado: ~/PADS/Swift/SwiftMontage/m101_tutorial/run.0030]$ cat swift.properties execution.retries=0 sitedir.keep=true status.mode=provider wrapper.log.always.transfer=true foreach.maxthreads=1024 wrapper.parameter.mode=files use.provider.staging=false provider.staging.pin.swiftfiles=false Did something sneak into the trunk version that doesn't honor the command line options? From jonmon at mcs.anl.gov Tue Oct 11 17:47:35 2011 From: jonmon at mcs.anl.gov (Jonathan Monette) Date: Tue, 11 Oct 2011 17:47:35 -0500 Subject: [Swift-devel] Swift command not recognizing options In-Reply-To: <79E43E77-AD9D-4572-93FD-0B97350C9EA2@mcs.anl.gov> References: <79E43E77-AD9D-4572-93FD-0B97350C9EA2@mcs.anl.gov> Message-ID: <865D535C-147F-4F81-8652-EEEAB324AA3D@mcs.anl.gov> Ok. It looks like the log and swift is recognizing the site correctly. Not sure why it is printing out the "/home/jonmon/Library/Swift/bin/../etc/sites.xml" business or why the Swift version reports that this is 0.93 when this is trunk. I changed all the swift.properties that I know of that Swift could default too so that it is now "wrapper.log.always.transfer" and not "wrapperlog.always.transfer". And I am still getting this error saying that I have it set. Is anyone else experiencing issues in trunk similar to this problem? On Oct 11, 2011, at 5:23 PM, Jonathan Monette wrote: > Hello, > I do not believe Swift is honoring any options passed to it in trunk. Here is my command line: > swift -sites.file sites.xml -tc.file tc.data -config swift.properties montage.swift > > But the execution goes like this: > [jonmon at communicado: ~/PADS/Swift/SwiftMontage/m101_tutorial/run.0030]$ swift -sites.file sites.xml -tc.file tc.data -config swift.properties montage.swift > Swift version: 0.93 > /home/jonmon/Library/Swift/bin/../etc/sites.xml > Swift svn swift-r5219 cog-r3296 > > RunID: 20111011-1719-std6z8pa > (input): found 10 files > Progress: time: Tue, 11 Oct 2011 17:19:25 -0500 > Progress: time: Tue, 11 Oct 2011 17:19:28 -0500 Selecting site:8 Stage in:1 Submitting:1 > Progress: time: Tue, 11 Oct 2011 17:19:31 -0500 Selecting site:8 Active:1 Checking status:1 > EXCEPTION Exception in mProjectPP_wrap: > Arguments: [-X, raw_dir/raw_image_3.fits, proj_dir/proj_raw_image_3.fits, header.hdr] > Host: localhost > Directory: montage-20111011-1719-std6z8pa/jobs/d/mProjectPP_wrap-dc0bd6hk > stderr.txt: > > stdout.txt: > > ---- > > Caused by: Swift config property "wrapperlog.always.transfer" not found in Swift configuration [swift.properties] > > Execution failed: > EXCEPTION Exception in mProjectPP_wrap: > Arguments: [-X, raw_dir/raw_image_7.fits, proj_dir/proj_raw_image_7.fits, header.hdr] > Host: localhost > Directory: montage-20111011-1719-std6z8pa/jobs/c/mProjectPP_wrap-cc0bd6hk > stderr.txt: > > stdout.txt: > > ---- > > Caused by: Swift config property "wrapperlog.always.transfer" not found in Swift configuration [swift.properties] > > Swift config property "wrapperlog.always.transfer" not found in Swift configuration [swift.properties] > > I do not have wrapperlog.always.transfer in my properties file. I have wrapper.log.always.transfer: > Swift config property "wrapperlog.always.transfer" not found in Swift configuration [swift.properties] > [jonmon at communicado: ~/PADS/Swift/SwiftMontage/m101_tutorial/run.0030]$ cat swift.properties > execution.retries=0 > sitedir.keep=true > status.mode=provider > wrapper.log.always.transfer=true > foreach.maxthreads=1024 > wrapper.parameter.mode=files > use.provider.staging=false > provider.staging.pin.swiftfiles=false > > Did something sneak into the trunk version that doesn't honor the command line options? From hategan at mcs.anl.gov Tue Oct 11 18:17:03 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 11 Oct 2011 16:17:03 -0700 Subject: [Swift-devel] Swift command not recognizing options In-Reply-To: <79E43E77-AD9D-4572-93FD-0B97350C9EA2@mcs.anl.gov> References: <79E43E77-AD9D-4572-93FD-0B97350C9EA2@mcs.anl.gov> Message-ID: <1318375023.2511.0.camel@blabla> Log please. On Tue, 2011-10-11 at 17:23 -0500, Jonathan Monette wrote: > Hello, > I do not believe Swift is honoring any options passed to it in trunk. Here is my command line: > swift -sites.file sites.xml -tc.file tc.data -config swift.properties montage.swift > > But the execution goes like this: > [jonmon at communicado: ~/PADS/Swift/SwiftMontage/m101_tutorial/run.0030]$ swift -sites.file sites.xml -tc.file tc.data -config swift.properties montage.swift > Swift version: 0.93 > /home/jonmon/Library/Swift/bin/../etc/sites.xml > Swift svn swift-r5219 cog-r3296 > > RunID: 20111011-1719-std6z8pa > (input): found 10 files > Progress: time: Tue, 11 Oct 2011 17:19:25 -0500 > Progress: time: Tue, 11 Oct 2011 17:19:28 -0500 Selecting site:8 Stage in:1 Submitting:1 > Progress: time: Tue, 11 Oct 2011 17:19:31 -0500 Selecting site:8 Active:1 Checking status:1 > EXCEPTION Exception in mProjectPP_wrap: > Arguments: [-X, raw_dir/raw_image_3.fits, proj_dir/proj_raw_image_3.fits, header.hdr] > Host: localhost > Directory: montage-20111011-1719-std6z8pa/jobs/d/mProjectPP_wrap-dc0bd6hk > stderr.txt: > > stdout.txt: > > ---- > > Caused by: Swift config property "wrapperlog.always.transfer" not found in Swift configuration [swift.properties] > > Execution failed: > EXCEPTION Exception in mProjectPP_wrap: > Arguments: [-X, raw_dir/raw_image_7.fits, proj_dir/proj_raw_image_7.fits, header.hdr] > Host: localhost > Directory: montage-20111011-1719-std6z8pa/jobs/c/mProjectPP_wrap-cc0bd6hk > stderr.txt: > > stdout.txt: > > ---- > > Caused by: Swift config property "wrapperlog.always.transfer" not found in Swift configuration [swift.properties] > > Swift config property "wrapperlog.always.transfer" not found in Swift configuration [swift.properties] > > I do not have wrapperlog.always.transfer in my properties file. I have wrapper.log.always.transfer: > Swift config property "wrapperlog.always.transfer" not found in Swift configuration [swift.properties] > [jonmon at communicado: ~/PADS/Swift/SwiftMontage/m101_tutorial/run.0030]$ cat swift.properties > execution.retries=0 > sitedir.keep=true > status.mode=provider > wrapper.log.always.transfer=true > foreach.maxthreads=1024 > wrapper.parameter.mode=files > use.provider.staging=false > provider.staging.pin.swiftfiles=false > > Did something sneak into the trunk version that doesn't honor the command line options? > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel From jonmon at mcs.anl.gov Tue Oct 11 18:20:17 2011 From: jonmon at mcs.anl.gov (Jonathan Monette) Date: Tue, 11 Oct 2011 18:20:17 -0500 Subject: [Swift-devel] Swift command not recognizing options In-Reply-To: <1318375023.2511.0.camel@blabla> References: <79E43E77-AD9D-4572-93FD-0B97350C9EA2@mcs.anl.gov> <1318375023.2511.0.camel@blabla> Message-ID: <3E366FB5-0AF0-4DE0-A600-F12124087E32@mcs.anl.gov> Sorry. Thought I sent that info. All the files are located at /home/jonmon/PADS/Swift/SwiftMontage/m101_tutorial/run.0030 on the CI machines. On Oct 11, 2011, at 6:17 PM, Mihael Hategan wrote: > Log please. > > On Tue, 2011-10-11 at 17:23 -0500, Jonathan Monette wrote: >> Hello, >> I do not believe Swift is honoring any options passed to it in trunk. Here is my command line: >> swift -sites.file sites.xml -tc.file tc.data -config swift.properties montage.swift >> >> But the execution goes like this: >> [jonmon at communicado: ~/PADS/Swift/SwiftMontage/m101_tutorial/run.0030]$ swift -sites.file sites.xml -tc.file tc.data -config swift.properties montage.swift >> Swift version: 0.93 >> /home/jonmon/Library/Swift/bin/../etc/sites.xml >> Swift svn swift-r5219 cog-r3296 >> >> RunID: 20111011-1719-std6z8pa >> (input): found 10 files >> Progress: time: Tue, 11 Oct 2011 17:19:25 -0500 >> Progress: time: Tue, 11 Oct 2011 17:19:28 -0500 Selecting site:8 Stage in:1 Submitting:1 >> Progress: time: Tue, 11 Oct 2011 17:19:31 -0500 Selecting site:8 Active:1 Checking status:1 >> EXCEPTION Exception in mProjectPP_wrap: >> Arguments: [-X, raw_dir/raw_image_3.fits, proj_dir/proj_raw_image_3.fits, header.hdr] >> Host: localhost >> Directory: montage-20111011-1719-std6z8pa/jobs/d/mProjectPP_wrap-dc0bd6hk >> stderr.txt: >> >> stdout.txt: >> >> ---- >> >> Caused by: Swift config property "wrapperlog.always.transfer" not found in Swift configuration [swift.properties] >> >> Execution failed: >> EXCEPTION Exception in mProjectPP_wrap: >> Arguments: [-X, raw_dir/raw_image_7.fits, proj_dir/proj_raw_image_7.fits, header.hdr] >> Host: localhost >> Directory: montage-20111011-1719-std6z8pa/jobs/c/mProjectPP_wrap-cc0bd6hk >> stderr.txt: >> >> stdout.txt: >> >> ---- >> >> Caused by: Swift config property "wrapperlog.always.transfer" not found in Swift configuration [swift.properties] >> >> Swift config property "wrapperlog.always.transfer" not found in Swift configuration [swift.properties] >> >> I do not have wrapperlog.always.transfer in my properties file. I have wrapper.log.always.transfer: >> Swift config property "wrapperlog.always.transfer" not found in Swift configuration [swift.properties] >> [jonmon at communicado: ~/PADS/Swift/SwiftMontage/m101_tutorial/run.0030]$ cat swift.properties >> execution.retries=0 >> sitedir.keep=true >> status.mode=provider >> wrapper.log.always.transfer=true >> foreach.maxthreads=1024 >> wrapper.parameter.mode=files >> use.provider.staging=false >> provider.staging.pin.swiftfiles=false >> >> Did something sneak into the trunk version that doesn't honor the command line options? >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > From hategan at mcs.anl.gov Tue Oct 11 18:23:26 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 11 Oct 2011 16:23:26 -0700 Subject: [Swift-devel] [Swift-user] gram on ranger In-Reply-To: References: <1908023276.141555.1318358942817.JavaMail.root@zimbra-mb2.anl.gov> Message-ID: <1318375406.2770.0.camel@blabla> Is this with a persistent coaster service? On Tue, 2011-10-11 at 12:05 -0700, Sarah Kenny wrote: > > > On Tue, Oct 11, 2011 at 11:49 AM, David Kelly > wrote: > > That could be it.. maybe a cleanup script is not getting the > right parameters and failing. Do you happen to have a copy of > the coaster log? > > just put it in /home/skenny/swift_logs > > > Maybe there will be some clues in there. > > ----- Original Message ----- > > From: "Sarah Kenny" > > > To: "David Kelly" > > Cc: "Swift Devel" , "Swift > User" , "Justin M Wozniak" > > > > > Sent: Tuesday, October 11, 2011 1:32:37 PM > > Subject: Re: [Swift-user] gram on ranger > > > so, this workflow completes all the jobs but then just hangs > > indefinitely at the end...maybe a stray cleanup job? > > > > log is here: > > > > /home/skenny/swift_logs/corr-20111010-2104-fl5yngd9.log > > > > just tweaked the sites file a bit from what david sent me: > > > > > > > > > > > > > 28800 > > key="maxWallTime">00:15:00 > > 1 > > key="nodeGranularity">64 > > 256 > > normal > > 1 > > key="project">TG-DBS080004N > > 16way > > key="initialScore">10000 > > > /work/00043/tg457040/sidgrid_out/skenny > > > > > > > > > > > > On Mon, Oct 10, 2011 at 3:43 PM, Sarah Kenny < > skenny at uchicago.edu > > > wrote: > > > > > > ok, thanks, got in the queue now...also, realized my last > run may have > > been using the old swift. apparently i had SWIFT_HOME set in > my env > > and that overrides the newer swift i had set in my PATH. > > > > ~sk > > > > > > > > On Mon, Oct 10, 2011 at 12:28 PM, David Kelly < > davidk at ci.uchicago.edu > > > wrote: > > > > > > > > > > > > Sarah, > > > > Can you give this another try with the latest 0.93? I made > some > > changes to the coaster and sge providers and was able to get > it > > working with a simple catns script. Here is the > configuration file I > > was using: > > > > > > > > > > > > > > > 3600 > > key="maxWallTime">00:00:03 > > 1 > > key="nodeGranularity">16 > > 16 > > key="queue">development > > 0.9 > > > > key="project">TG-DBS080004N > > > > 16way > > > /share/home/01503/davidkel/swiftwork > > > > > > > > Thanks, > > > > David > > > > ----- Original Message ----- > > > > > From: "Sarah Kenny" < skenny at uchicago.edu > > > > To: "Justin M Wozniak" < wozniak at mcs.anl.gov > > > > Cc: "Swift Devel" < swift-devel at ci.uchicago.edu >, "Swift > User" < > > > swift-user at ci.uchicago.edu > > > > > > > > > > Sent: Friday, October 7, 2011 3:13:57 PM > > > Subject: Re: [Swift-user] gram on ranger > > > /home/skenny/swift_logs/dummy-20111005-0126-6575n7x5.log > > > > > > on ci > > > > > > > > > On Fri, Oct 7, 2011 at 8:16 AM, Justin M Wozniak < > > > wozniak at mcs.anl.gov > > > > wrote: > > > > > > > > > > > > Can I take a look at the log? > > > > > > > > > > > > > > > On Thu, 6 Oct 2011, Sarah Kenny wrote: > > > > > > > > > > > > hey all, i'm trying to submit to gram on ranger using the > latest > > > swift > > > (built from trunk). it failes like so: > > > > > > Cannot submit job > > > Caused by: > > > org.globus.cog.abstraction. impl.common.task. > > > TaskSubmissionException: > > > Cannot > > > submit job > > > Caused by: org.globus.gram.GramException: Parameter not > supported > > > Cannot submit job > > > > > > the gram log was saying first that 'jobsPerNode' is not > supported so > > > i > > > changed it to workersPerNode and then it was saying > 'maxnodes' is > > > not > > > supported. here's my sites file: > > > > > > > > > > > > 10000 profile> > > > 1 > > > 00:15:00 profile> > > > 86400 > > > 1 > > > 256 > > > 16way > > > 1 profile> > > > 64 profile> > > > normal > > > TG-DBS080004N profile> > > > > > > > > url=" > > > gatekeeper.ranger.tacc. teragrid.org "/> > > > > > > > > /work/00043/ tg457040 > > > > > > > > > > > > > > thoughts? ideas? > > > > > > -- > > > Justin M Wozniak > > > > > > > > > > > > -- > > > Sarah Kenny > > > Programmer ~ Brain Circuits Laboratory ~ Rm 2224 Bio Sci > III > > > University of California Irvine, Dept. of Neurology ~ > 773-818-8300 > > > > > > > > > _______________________________________________ > > > Swift-user mailing list > > > Swift-user at ci.uchicago.edu > > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > > > > > > > > > > > > > > -- > > Sarah Kenny > > Programmer ~ Brain Circuits Laboratory ~ Rm 2224 Bio Sci III > > University of California Irvine, Dept. of Neurology ~ > 773-818-8300 > > > > > > > > > > -- > > Sarah Kenny > > Programmer ~ Brain Circuits Laboratory ~ Rm 2224 Bio Sci III > > University of California Irvine, Dept. of Neurology ~ > 773-818-8300 > > > > > -- > Sarah Kenny > Programmer ~ Brain Circuits Laboratory ~ Rm 2224 Bio Sci III > University of California Irvine, Dept. of Neurology ~ 773-818-8300 > > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user From skenny at uchicago.edu Tue Oct 11 19:13:21 2011 From: skenny at uchicago.edu (Sarah Kenny) Date: Tue, 11 Oct 2011 17:13:21 -0700 Subject: [Swift-devel] [Swift-user] gram on ranger In-Reply-To: <1318375406.2770.0.camel@blabla> References: <1908023276.141555.1318358942817.JavaMail.root@zimbra-mb2.anl.gov> <1318375406.2770.0.camel@blabla> Message-ID: On Tue, Oct 11, 2011 at 4:23 PM, Mihael Hategan wrote: > Is this with a persistent coaster service? > admittedly i have not used persistent coaster service...should i? i feel like it's documented *somewhere* (?) for now i've tried setting 'sitedir.keep=true' in the config so maybe it won't try to run the cleanup job...we'll see (waiting in q) > > On Tue, 2011-10-11 at 12:05 -0700, Sarah Kenny wrote: > > > > > > On Tue, Oct 11, 2011 at 11:49 AM, David Kelly > > wrote: > > > > That could be it.. maybe a cleanup script is not getting the > > right parameters and failing. Do you happen to have a copy of > > the coaster log? > > > > just put it in /home/skenny/swift_logs > > > > > > Maybe there will be some clues in there. > > > > ----- Original Message ----- > > > From: "Sarah Kenny" > > > > > To: "David Kelly" > > > Cc: "Swift Devel" , "Swift > > User" , "Justin M Wozniak" > > > > > > > > Sent: Tuesday, October 11, 2011 1:32:37 PM > > > Subject: Re: [Swift-user] gram on ranger > > > > > so, this workflow completes all the jobs but then just hangs > > > indefinitely at the end...maybe a stray cleanup job? > > > > > > log is here: > > > > > > /home/skenny/swift_logs/corr-20111010-2104-fl5yngd9.log > > > > > > just tweaked the sites file a bit from what david sent me: > > > > > > > > > > > > > > > > > > > > 28800 > > > > key="maxWallTime">00:15:00 > > > 1 > > > > key="nodeGranularity">64 > > > 256 > > > normal > > > 1 > > > > key="project">TG-DBS080004N > > > 16way > > > > key="initialScore">10000 > > > > > > /work/00043/tg457040/sidgrid_out/skenny > > > > > > > > > > > > > > > > > > On Mon, Oct 10, 2011 at 3:43 PM, Sarah Kenny < > > skenny at uchicago.edu > > > > wrote: > > > > > > > > > ok, thanks, got in the queue now...also, realized my last > > run may have > > > been using the old swift. apparently i had SWIFT_HOME set in > > my env > > > and that overrides the newer swift i had set in my PATH. > > > > > > ~sk > > > > > > > > > > > > On Mon, Oct 10, 2011 at 12:28 PM, David Kelly < > > davidk at ci.uchicago.edu > > > > wrote: > > > > > > > > > > > > > > > > > > Sarah, > > > > > > Can you give this another try with the latest 0.93? I made > > some > > > changes to the coaster and sge providers and was able to get > > it > > > working with a simple catns script. Here is the > > configuration file I > > > was using: > > > > > > > > > > > > > > > > > > > > > > > 3600 > > > > key="maxWallTime">00:00:03 > > > 1 > > > > key="nodeGranularity">16 > > > 16 > > > > key="queue">development > > > 0.9 > > > > > > > key="project">TG-DBS080004N > > > > > > 16way > > > > > > /share/home/01503/davidkel/swiftwork > > > > > > > > > > > > Thanks, > > > > > > David > > > > > > ----- Original Message ----- > > > > > > > From: "Sarah Kenny" < skenny at uchicago.edu > > > > > To: "Justin M Wozniak" < wozniak at mcs.anl.gov > > > > > Cc: "Swift Devel" < swift-devel at ci.uchicago.edu >, "Swift > > User" < > > > > swift-user at ci.uchicago.edu > > > > > > > > > > > > > > Sent: Friday, October 7, 2011 3:13:57 PM > > > > Subject: Re: [Swift-user] gram on ranger > > > > /home/skenny/swift_logs/dummy-20111005-0126-6575n7x5.log > > > > > > > > on ci > > > > > > > > > > > > On Fri, Oct 7, 2011 at 8:16 AM, Justin M Wozniak < > > > > wozniak at mcs.anl.gov > > > > > wrote: > > > > > > > > > > > > > > > > Can I take a look at the log? > > > > > > > > > > > > > > > > > > > > On Thu, 6 Oct 2011, Sarah Kenny wrote: > > > > > > > > > > > > > > > > hey all, i'm trying to submit to gram on ranger using the > > latest > > > > swift > > > > (built from trunk). it failes like so: > > > > > > > > Cannot submit job > > > > Caused by: > > > > org.globus.cog.abstraction. impl.common.task. > > > > TaskSubmissionException: > > > > Cannot > > > > submit job > > > > Caused by: org.globus.gram.GramException: Parameter not > > supported > > > > Cannot submit job > > > > > > > > the gram log was saying first that 'jobsPerNode' is not > > supported so > > > > i > > > > changed it to workersPerNode and then it was saying > > 'maxnodes' is > > > > not > > > > supported. here's my sites file: > > > > > > > > > > > > > > > > 10000 > profile> > > > > 1 > > > > 00:15:00 > profile> > > > > 86400 > > > > 1 > > > > 256 > > > > 16way > > > > 1 > profile> > > > > 64 > profile> > > > > normal > > > > TG-DBS080004N > profile> > > > > > > > > > > > > url=" > > > > gatekeeper.ranger.tacc. teragrid.org "/> > > > > > > > > > > > /work/00043/ tg457040 > > > > > > > > > > > > > > > > > > > thoughts? ideas? > > > > > > > > -- > > > > Justin M Wozniak > > > > > > > > > > > > > > > > -- > > > > Sarah Kenny > > > > Programmer ~ Brain Circuits Laboratory ~ Rm 2224 Bio Sci > > III > > > > University of California Irvine, Dept. of Neurology ~ > > 773-818-8300 > > > > > > > > > > > > _______________________________________________ > > > > Swift-user mailing list > > > > Swift-user at ci.uchicago.edu > > > > > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > > > > > > > > > > > > > > > > > > > > > -- > > > Sarah Kenny > > > Programmer ~ Brain Circuits Laboratory ~ Rm 2224 Bio Sci III > > > University of California Irvine, Dept. of Neurology ~ > > 773-818-8300 > > > > > > > > > > > > > > > -- > > > Sarah Kenny > > > Programmer ~ Brain Circuits Laboratory ~ Rm 2224 Bio Sci III > > > University of California Irvine, Dept. of Neurology ~ > > 773-818-8300 > > > > > > > > > > -- > > Sarah Kenny > > Programmer ~ Brain Circuits Laboratory ~ Rm 2224 Bio Sci III > > University of California Irvine, Dept. of Neurology ~ 773-818-8300 > > > > _______________________________________________ > > Swift-user mailing list > > Swift-user at ci.uchicago.edu > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > > > -- Sarah Kenny Programmer ~ Brain Circuits Laboratory ~ Rm 2224 Bio Sci III University of California Irvine, Dept. of Neurology ~ 773-818-8300 -------------- next part -------------- An HTML attachment was scrubbed... URL: From wozniak at mcs.anl.gov Wed Oct 12 10:24:31 2011 From: wozniak at mcs.anl.gov (Justin M Wozniak) Date: Wed, 12 Oct 2011 10:24:31 -0500 (Central Daylight Time) Subject: [Swift-devel] Swift command not recognizing options In-Reply-To: <865D535C-147F-4F81-8652-EEEAB324AA3D@mcs.anl.gov> References: <79E43E77-AD9D-4572-93FD-0B97350C9EA2@mcs.anl.gov> <865D535C-147F-4F81-8652-EEEAB324AA3D@mcs.anl.gov> Message-ID: On Tue, 11 Oct 2011, Jonathan Monette wrote: > Not sure why it is printing out the > "/home/jonmon/Library/Swift/bin/../etc/sites.xml" business My mistake. > or why the Swift version reports that this is 0.93 when this is trunk. Again, me- this is a new output line based on our discussion that we should include the branch/version number in the output in addition to the svn revision number. I now see that this should be "trunk" if trunk and not just the last release version. Justin > On Oct 11, 2011, at 5:23 PM, Jonathan Monette wrote: > >> Hello, >> I do not believe Swift is honoring any options passed to it in trunk. Here is my command line: >> swift -sites.file sites.xml -tc.file tc.data -config swift.properties montage.swift >> >> But the execution goes like this: >> [jonmon at communicado: ~/PADS/Swift/SwiftMontage/m101_tutorial/run.0030]$ swift -sites.file sites.xml -tc.file tc.data -config swift.properties montage.swift >> Swift version: 0.93 >> /home/jonmon/Library/Swift/bin/../etc/sites.xml >> Swift svn swift-r5219 cog-r3296 >> >> RunID: 20111011-1719-std6z8pa >> (input): found 10 files >> Progress: time: Tue, 11 Oct 2011 17:19:25 -0500 >> Progress: time: Tue, 11 Oct 2011 17:19:28 -0500 Selecting site:8 Stage in:1 Submitting:1 >> Progress: time: Tue, 11 Oct 2011 17:19:31 -0500 Selecting site:8 Active:1 Checking status:1 >> EXCEPTION Exception in mProjectPP_wrap: >> Arguments: [-X, raw_dir/raw_image_3.fits, proj_dir/proj_raw_image_3.fits, header.hdr] >> Host: localhost >> Directory: montage-20111011-1719-std6z8pa/jobs/d/mProjectPP_wrap-dc0bd6hk >> stderr.txt: >> >> stdout.txt: >> >> ---- >> >> Caused by: Swift config property "wrapperlog.always.transfer" not found in Swift configuration [swift.properties] >> >> Execution failed: >> EXCEPTION Exception in mProjectPP_wrap: >> Arguments: [-X, raw_dir/raw_image_7.fits, proj_dir/proj_raw_image_7.fits, header.hdr] >> Host: localhost >> Directory: montage-20111011-1719-std6z8pa/jobs/c/mProjectPP_wrap-cc0bd6hk >> stderr.txt: >> >> stdout.txt: >> >> ---- >> >> Caused by: Swift config property "wrapperlog.always.transfer" not found in Swift configuration [swift.properties] >> >> Swift config property "wrapperlog.always.transfer" not found in Swift configuration [swift.properties] >> >> I do not have wrapperlog.always.transfer in my properties file. I have wrapper.log.always.transfer: >> Swift config property "wrapperlog.always.transfer" not found in Swift configuration [swift.properties] >> [jonmon at communicado: ~/PADS/Swift/SwiftMontage/m101_tutorial/run.0030]$ cat swift.properties >> execution.retries=0 >> sitedir.keep=true >> status.mode=provider >> wrapper.log.always.transfer=true >> foreach.maxthreads=1024 >> wrapper.parameter.mode=files >> use.provider.staging=false >> provider.staging.pin.swiftfiles=false >> >> Did something sneak into the trunk version that doesn't honor the command line options? > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > -- Justin M Wozniak From jonmon at mcs.anl.gov Wed Oct 12 10:38:05 2011 From: jonmon at mcs.anl.gov (=?utf-8?B?Sm9uYXRoYW4gTW9uZXR0ZQ==?=) Date: Wed, 12 Oct 2011 10:38:05 -0500 Subject: [Swift-devel] =?utf-8?q?Swift_command_not_recognizing_options?= Message-ID: <20111012153748.AB0E31275E@zimbra.anl.gov> Ok. I found those lines in the loader and changed them in my code but wasn't sure why they were there. ----- Reply message ----- From: "Justin M Wozniak" Date: Wed, Oct 12, 2011 10:24 am Subject: [Swift-devel] Swift command not recognizing options To: "Jonathan Monette" Cc: "swift-devel Devel" On Tue, 11 Oct 2011, Jonathan Monette wrote: > Not sure why it is printing out the > "/home/jonmon/Library/Swift/bin/../etc/sites.xml" business My mistake. > or why the Swift version reports that this is 0.93 when this is trunk. Again, me- this is a new output line based on our discussion that we should include the branch/version number in the output in addition to the svn revision number. I now see that this should be "trunk" if trunk and not just the last release version. Justin > On Oct 11, 2011, at 5:23 PM, Jonathan Monette wrote: > >> Hello, >> I do not believe Swift is honoring any options passed to it in trunk. Here is my command line: >> swift -sites.file sites.xml -tc.file tc.data -config swift.properties montage.swift >> >> But the execution goes like this: >> [jonmon at communicado: ~/PADS/Swift/SwiftMontage/m101_tutorial/run.0030]$ swift -sites.file sites.xml -tc.file tc.data -config swift.properties montage.swift >> Swift version: 0.93 >> /home/jonmon/Library/Swift/bin/../etc/sites.xml >> Swift svn swift-r5219 cog-r3296 >> >> RunID: 20111011-1719-std6z8pa >> (input): found 10 files >> Progress: time: Tue, 11 Oct 2011 17:19:25 -0500 >> Progress: time: Tue, 11 Oct 2011 17:19:28 -0500 Selecting site:8 Stage in:1 Submitting:1 >> Progress: time: Tue, 11 Oct 2011 17:19:31 -0500 Selecting site:8 Active:1 Checking status:1 >> EXCEPTION Exception in mProjectPP_wrap: >> Arguments: [-X, raw_dir/raw_image_3.fits, proj_dir/proj_raw_image_3.fits, header.hdr] >> Host: localhost >> Directory: montage-20111011-1719-std6z8pa/jobs/d/mProjectPP_wrap-dc0bd6hk >> stderr.txt: >> >> stdout.txt: >> >> ---- >> >> Caused by: Swift config property "wrapperlog.always.transfer" not found in Swift configuration [swift.properties] >> >> Execution failed: >> EXCEPTION Exception in mProjectPP_wrap: >> Arguments: [-X, raw_dir/raw_image_7.fits, proj_dir/proj_raw_image_7.fits, header.hdr] >> Host: localhost >> Directory: montage-20111011-1719-std6z8pa/jobs/c/mProjectPP_wrap-cc0bd6hk >> stderr.txt: >> >> stdout.txt: >> >> ---- >> >> Caused by: Swift config property "wrapperlog.always.transfer" not found in Swift configuration [swift.properties] >> >> Swift config property "wrapperlog.always.transfer" not found in Swift configuration [swift.properties] >> >> I do not have wrapperlog.always.transfer in my properties file. I have wrapper.log.always.transfer: >> Swift config property "wrapperlog.always.transfer" not found in Swift configuration [swift.properties] >> [jonmon at communicado: ~/PADS/Swift/SwiftMontage/m101_tutorial/run.0030]$ cat swift.properties >> execution.retries=0 >> sitedir.keep=true >> status.mode=provider >> wrapper.log.always.transfer=true >> foreach.maxthreads=1024 >> wrapper.parameter.mode=files >> use.provider.staging=false >> provider.staging.pin.swiftfiles=false >> >> Did something sneak into the trunk version that doesn't honor the command line options? > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > -- Justin M Wozniak -------------- next part -------------- An HTML attachment was scrubbed... URL: From hategan at mcs.anl.gov Wed Oct 12 14:13:44 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 12 Oct 2011 12:13:44 -0700 Subject: [Swift-devel] [Swift-user] gram on ranger In-Reply-To: References: <1908023276.141555.1318358942817.JavaMail.root@zimbra-mb2.anl.gov> <1318375406.2770.0.camel@blabla> Message-ID: <1318446824.18036.0.camel@blabla> On Tue, 2011-10-11 at 17:13 -0700, Sarah Kenny wrote: > > > On Tue, Oct 11, 2011 at 4:23 PM, Mihael Hategan > wrote: > Is this with a persistent coaster service? > > admittedly i have not used persistent coaster service...should i? No. I was just trying to figure out whether it might be something related to the persistent version. > i feel like it's documented *somewhere* (?) > > for now i've tried setting 'sitedir.keep=true' in the config so maybe > it won't try to run the cleanup job...we'll see (waiting in q) > > > > On Tue, 2011-10-11 at 12:05 -0700, Sarah Kenny wrote: > > > > > > On Tue, Oct 11, 2011 at 11:49 AM, David Kelly > > > wrote: > > > > That could be it.. maybe a cleanup script is not > getting the > > right parameters and failing. Do you happen to have > a copy of > > the coaster log? > > > > just put it in /home/skenny/swift_logs > > > > > > Maybe there will be some clues in there. > > > > ----- Original Message ----- > > > From: "Sarah Kenny" > > > > > To: "David Kelly" > > > Cc: "Swift Devel" , > "Swift > > User" , "Justin M > Wozniak" > > > > > > > > Sent: Tuesday, October 11, 2011 1:32:37 PM > > > Subject: Re: [Swift-user] gram on ranger > > > > > so, this workflow completes all the jobs but then > just hangs > > > indefinitely at the end...maybe a stray cleanup > job? > > > > > > log is here: > > > > > > > /home/skenny/swift_logs/corr-20111010-2104-fl5yngd9.log > > > > > > just tweaked the sites file a bit from what david > sent me: > > > > > > > > > > > > url=" > > > gatekeeper.ranger.tacc.teragrid.org "/> > > > > > > > > key="maxtime">28800 > > > > key="maxWallTime">00:15:00 > > > key="jobsPerNode">1 > > > > key="nodeGranularity">64 > > > key="maxNodes">256 > > > key="queue">normal > > > key="jobThrottle">1 > > > > key="project">TG-DBS080004N > > > key="pe">16way > > > > key="initialScore">10000 > > > > > > /work/00043/tg457040/sidgrid_out/skenny > > > > > > > > > > > > > > > > > > On Mon, Oct 10, 2011 at 3:43 PM, Sarah Kenny < > > skenny at uchicago.edu > > > > wrote: > > > > > > > > > ok, thanks, got in the queue now...also, realized > my last > > run may have > > > been using the old swift. apparently i had > SWIFT_HOME set in > > my env > > > and that overrides the newer swift i had set in my > PATH. > > > > > > ~sk > > > > > > > > > > > > On Mon, Oct 10, 2011 at 12:28 PM, David Kelly < > > davidk at ci.uchicago.edu > > > > wrote: > > > > > > > > > > > > > > > > > > Sarah, > > > > > > Can you give this another try with the latest > 0.93? I made > > some > > > changes to the coaster and sge providers and was > able to get > > it > > > working with a simple catns script. Here is the > > configuration file I > > > was using: > > > > > > > > > > > > url=" > > > gatekeeper.ranger.tacc.teragrid.org "/> > > > > > > > > > > > key="maxtime">3600 > > > > key="maxWallTime">00:00:03 > > > key="jobsPerNode">1 > > > > key="nodeGranularity">16 > > > key="maxNodes">16 > > > > key="queue">development > > > key="jobThrottle">0.9 > > > > > > > key="project">TG-DBS080004N > > > > > > key="pe">16way > > > > > > /share/home/01503/davidkel/swiftwork > > > > > > > > > > > > Thanks, > > > > > > David > > > > > > ----- Original Message ----- > > > > > > > From: "Sarah Kenny" < skenny at uchicago.edu > > > > > To: "Justin M Wozniak" < wozniak at mcs.anl.gov > > > > > Cc: "Swift Devel" < swift-devel at ci.uchicago.edu > >, "Swift > > User" < > > > > swift-user at ci.uchicago.edu > > > > > > > > > > > > > > Sent: Friday, October 7, 2011 3:13:57 PM > > > > Subject: Re: [Swift-user] gram on ranger > > > > > /home/skenny/swift_logs/dummy-20111005-0126-6575n7x5.log > > > > > > > > on ci > > > > > > > > > > > > On Fri, Oct 7, 2011 at 8:16 AM, Justin M Wozniak > < > > > > wozniak at mcs.anl.gov > > > > > wrote: > > > > > > > > > > > > > > > > Can I take a look at the log? > > > > > > > > > > > > > > > > > > > > On Thu, 6 Oct 2011, Sarah Kenny wrote: > > > > > > > > > > > > > > > > hey all, i'm trying to submit to gram on ranger > using the > > latest > > > > swift > > > > (built from trunk). it failes like so: > > > > > > > > Cannot submit job > > > > Caused by: > > > > org.globus.cog.abstraction. impl.common.task. > > > > TaskSubmissionException: > > > > Cannot > > > > submit job > > > > Caused by: org.globus.gram.GramException: > Parameter not > > supported > > > > Cannot submit job > > > > > > > > the gram log was saying first that 'jobsPerNode' > is not > > supported so > > > > i > > > > changed it to workersPerNode and then it was > saying > > 'maxnodes' is > > > > not > > > > supported. here's my sites file: > > > > > > > > > > > > > > > > key="initialScore">10000 > profile> > > > > key="jobThrottle">1 > > > > key="maxWallTime">00:15:00 > profile> > > > > key="maxTime">86400 > > > > key="slots">1 > > > > key="maxNodes">256 > > > > key="pe">16way > > > > key="workersPerNode">1 > profile> > > > > key="nodeGranularity">64 > profile> > > > > key="queue">normal > > > > key="project">TG-DBS080004N > profile> > > > > > > > > > > > jobManager="gt2:gt2:SGE" > > url=" > > > > gatekeeper.ranger.tacc. teragrid.org "/> > > > > > > > > > > > /work/00043/ > tg457040 > > > > > > > > > > > > > > > > > > > thoughts? ideas? > > > > > > > > -- > > > > Justin M Wozniak > > > > > > > > > > > > > > > > -- > > > > Sarah Kenny > > > > Programmer ~ Brain Circuits Laboratory ~ Rm 2224 > Bio Sci > > III > > > > University of California Irvine, Dept. of > Neurology ~ > > 773-818-8300 > > > > > > > > > > > > _______________________________________________ > > > > Swift-user mailing list > > > > Swift-user at ci.uchicago.edu > > > > > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > > > > > > > > > > > > > > > > > > > > > -- > > > Sarah Kenny > > > Programmer ~ Brain Circuits Laboratory ~ Rm 2224 > Bio Sci III > > > University of California Irvine, Dept. of > Neurology ~ > > 773-818-8300 > > > > > > > > > > > > > > > -- > > > Sarah Kenny > > > Programmer ~ Brain Circuits Laboratory ~ Rm 2224 > Bio Sci III > > > University of California Irvine, Dept. of > Neurology ~ > > 773-818-8300 > > > > > > > > > > -- > > Sarah Kenny > > Programmer ~ Brain Circuits Laboratory ~ Rm 2224 Bio Sci III > > University of California Irvine, Dept. of Neurology ~ > 773-818-8300 > > > > _______________________________________________ > > Swift-user mailing list > > Swift-user at ci.uchicago.edu > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > > > > > > > -- > Sarah Kenny > Programmer ~ Brain Circuits Laboratory ~ Rm 2224 Bio Sci III > University of California Irvine, Dept. of Neurology ~ 773-818-8300 > From ketancmaheshwari at gmail.com Sun Oct 16 09:54:40 2011 From: ketancmaheshwari at gmail.com (Ketan Maheshwari) Date: Sun, 16 Oct 2011 09:54:40 -0500 Subject: [Swift-devel] Swift gsiftp staging issues on OSG Message-ID: Hello, While running an Extenci workflow on OSG with persistent coasters (multiple coasters services, 1 per OSG site) and gsiftp staging, I am facing some gridftp related issues. Following are some details of the run: A set of 15 OSG sites were selected after testing them for being responsive ('greensites'). I performed a separate guc test on these sites which seemed to have succeeded for each site (200MB roundtrip transfer in 7 mins for all sites). However, while running my workflow from Swift, many of these transfers fail showing a variety of errors, most pertaining to the data transfers. I noticed, that these transfers fail irrespective of data sizes (250K - 150M) and also seems to fail intermittently for different sites. The log for this run is here: http://www.mcs.anl.gov/~ketan/postproc-gridftp-20111013-2324-5qzebq16.log I am providing a 7G of Heap space at Swift commandline and the host has 50G of total memory. Any ideas? Regards, -- Ketan -------------- next part -------------- An HTML attachment was scrubbed... URL: From ketancmaheshwari at gmail.com Sun Oct 16 10:16:47 2011 From: ketancmaheshwari at gmail.com (Ketan Maheshwari) Date: Sun, 16 Oct 2011 10:16:47 -0500 Subject: [Swift-devel] Swift gsiftp staging issues on OSG In-Reply-To: References: Message-ID: Filed as bug: 589, https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=589 On Sun, Oct 16, 2011 at 9:54 AM, Ketan Maheshwari < ketancmaheshwari at gmail.com> wrote: > Hello, > > While running an Extenci workflow on OSG with persistent coasters (multiple > coasters services, 1 per OSG site) and gsiftp staging, I am facing some > gridftp related issues. Following are some details of the run: > > A set of 15 OSG sites were selected after testing them for being responsive > ('greensites'). I performed a separate guc test on these sites which seemed > to have succeeded for each site (200MB roundtrip transfer in 7 mins for all > sites). > > However, while running my workflow from Swift, many of these transfers fail > showing a variety of errors, most pertaining to the data transfers. > > I noticed, that these transfers fail irrespective of data sizes (250K - > 150M) and also seems to fail intermittently for different sites. > > The log for this run is here: > http://www.mcs.anl.gov/~ketan/postproc-gridftp-20111013-2324-5qzebq16.log > > I am providing a 7G of Heap space at Swift commandline and the host has 50G > of total memory. > > Any ideas? > > > Regards, > -- > Ketan > > > -- Ketan -------------- next part -------------- An HTML attachment was scrubbed... URL: From wilde at mcs.anl.gov Sun Oct 16 11:15:35 2011 From: wilde at mcs.anl.gov (Michael Wilde) Date: Sun, 16 Oct 2011 11:15:35 -0500 (CDT) Subject: [Swift-devel] Where is latest doc on running Swift on Beagle? Covers OpenMP apps? In-Reply-To: <20110919031410.5BFAF9CCFC@svn.ci.uchicago.edu> Message-ID: <878140598.100000.1318781735942.JavaMail.root@zimbra.anl.gov> David, Ketan, I need to run some things on Beagle, asap. Ketan, where is the latest and best documentation for this? I see your edits below to the 0.93 Site Guide. But I dont see that online where I would expect it: http://www.ci.uchicago.edu/swift/wwwdev/guides/release-0.93/siteguide/siteguide.html#_beagle David, is it just that this document is not being correctly pushed to the wwwdev site on a nightly basis? Ketan, is the latest info on running Swift on Beagle now all in the siteguide? Is the info you were putting in the cookbook (I see many commits there) now all consolidated into the Site Guide? And is there a difference in sites.xml settings between 0.93 and trunk? Lastly, which release works best? Second question: I need to run a script that executes many 24-core OpenMP apps. Is the necessary support for this in 0.93? What if any declarations do I need other than to say jobsPerNode=1? Glen, are you running OpenMP on Beagle and if so what release and sites file are you using? Im assuming Justin's latest changes to sites.xml are in trunk but not 0.93? If that is correct, is there a corresponding site site for Beagle for trunk? Thanks, - Mike ----- Forwarded Message ----- From: ketan at ci.uchicago.edu To: swift-commit at ci.uchicago.edu Sent: Sunday, September 18, 2011 10:14:10 PM Subject: [Swift-commit] r5126 - branches/release-0.93/docs/siteguide Author: ketan Date: 2011-09-18 22:14:10 -0500 (Sun, 18 Sep 2011) New Revision: 5126 Modified: branches/release-0.93/docs/siteguide/beagle Log: added content to beagle siteguide Modified: branches/release-0.93/docs/siteguide/beagle =================================================================== --- branches/release-0.93/docs/siteguide/beagle 2011-09-19 02:41:02 UTC (rev 5125) +++ branches/release-0.93/docs/siteguide/beagle 2011-09-19 03:14:10 UTC (rev 5126) @@ -52,9 +52,38 @@ A key factor in scaling up Swift runs on Beagle is to setup the sites.xml parameters. The following sites.xml parameters must be set to scale that is intended for a large run: - * walltime: The expected walltime for completion of your run. This parameter is accepted in seconds. - * slots: Number of qsub jobs needs to be submitted by swift. This number will determine how many qsubs swift will submit for your run. Typical values range between 40 and 80 for large runs. - * nodegranularity: Determines the number of nodes per job. Total nodes will thus be slots times nodegranularity. This may vary for advanced configurations though. - * maxnodes: Determines the maximum number of nodes a job must pack into its qsub. This parameter determines the largest single job that your run will submit. + * *maxTime* : The expected walltime for completion of your run. This parameter is accepted in seconds. + * *slots* : Number of qsub jobs needs to be submitted by swift. This number will determine how many qsubs swift will submit for your run. Typical values range between 40 and 80 for large runs. + * *nodeGranularity* : Determines the number of nodes per job. Total nodes will thus be slots times nodegranularity. This may vary for advanced configurations though. + * *maxNodes* : Determines the maximum number of nodes a job must pack into its qsub. This parameter determines the largest single job that your run will submit. + * *jobThrottle* : A factor that determines the number of tasks dispatched simultaneously. The intended number of simultaneous tasks must match the number of cores targeted. The number of tasks is calculated from the jobThrottle factor is as follows: +---- +Number of Tasks = (JobThrottle x 100) + 1 +---- +Following is an example sites.xml for a 50 slots run with each slot occupying 4 nodes (thus, a 200 node run): + +----- + + + + CI-CCR000013 + + 24:cray:pack + + 24 + 50000 + 50 + 4 + 4 + + 48.00 + 10000 + + + /lustre/beagle/ketan/swift.workdir + + +----- + _______________________________________________ Swift-commit mailing list Swift-commit at ci.uchicago.edu https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-commit -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From wilde at mcs.anl.gov Sun Oct 16 11:27:30 2011 From: wilde at mcs.anl.gov (Michael Wilde) Date: Sun, 16 Oct 2011 11:27:30 -0500 (CDT) Subject: [Swift-devel] Where is latest doc on running Swift on Beagle? Covers OpenMP apps? In-Reply-To: Message-ID: <900237985.100011.1318782450325.JavaMail.root@zimbra.anl.gov> Thanks, Glen! Justin, can you check the sites file below? I dont understand the interaction between the parameters OMP_NUM_THREDS, jobsPerNode, and depth. WHere is the best documentation on that? Thanks, - Mike ----- Original Message ----- > From: "Glen Hocky" > To: "Michael Wilde" > Cc: "David Kelly" , "ketan" > Sent: Sunday, October 16, 2011 11:18:33 AM > Subject: Re: [Swift-devel] Where is latest doc on running Swift on Beagle? Covers OpenMP apps? > Yes, I'm running and yes I did test openmp a while back. Sites file > follows. I'm using trunk from a few months ago > > "Swift svn swift-r4813 (swift modified locally) cog-r3175" > > > > > > > key="providerAttributes">pbs.aprun;pbs.mpp;depth=24 > 24 > > > $PPN > $TIME > $MAXTIME > $nodes > 1 > 1 > 100 > 100 > 200.00 > 10000 > > > $swiftrundir/swiftwork > > > > > On Sun, Oct 16, 2011 at 12:15 PM, Michael Wilde < wilde at mcs.anl.gov > > wrote: > > > David, Ketan, > > I need to run some things on Beagle, asap. > > Ketan, where is the latest and best documentation for this? I see your > edits below to the 0.93 Site Guide. But I dont see that online where I > would expect it: > > http://www.ci.uchicago.edu/swift/wwwdev/guides/release-0.93/siteguide/siteguide.html#_beagle > > David, is it just that this document is not being correctly pushed to > the wwwdev site on a nightly basis? > > Ketan, is the latest info on running Swift on Beagle now all in the > siteguide? Is the info you were putting in the cookbook (I see many > commits there) now all consolidated into the Site Guide? And is there > a difference in sites.xml settings between 0.93 and trunk? Lastly, > which release works best? > > Second question: I need to run a script that executes many 24-core > OpenMP apps. Is the necessary support for this in 0.93? What if any > declarations do I need other than to say jobsPerNode=1? Glen, are you > running OpenMP on Beagle and if so what release and sites file are you > using? > > Im assuming Justin's latest changes to sites.xml are in trunk but not > 0.93? If that is correct, is there a corresponding site site for > Beagle for trunk? > > Thanks, > > - Mike > > > ----- Forwarded Message ----- > From: ketan at ci.uchicago.edu > To: swift-commit at ci.uchicago.edu > Sent: Sunday, September 18, 2011 10:14:10 PM > Subject: [Swift-commit] r5126 - branches/release-0.93/docs/siteguide > > Author: ketan > Date: 2011-09-18 22:14:10 -0500 (Sun, 18 Sep 2011) > New Revision: 5126 > > Modified: > branches/release-0.93/docs/siteguide/beagle > Log: > added content to beagle siteguide > > Modified: branches/release-0.93/docs/siteguide/beagle > =================================================================== > --- branches/release-0.93/docs/siteguide/beagle 2011-09-19 02:41:02 > UTC (rev 5125) > +++ branches/release-0.93/docs/siteguide/beagle 2011-09-19 03:14:10 > UTC (rev 5126) > @@ -52,9 +52,38 @@ > A key factor in scaling up Swift runs on Beagle is to setup the > sites.xml parameters. > The following sites.xml parameters must be set to scale that is > intended for a large run: > > - * walltime: The expected walltime for completion of your run. This > parameter is accepted in seconds. > - * slots: Number of qsub jobs needs to be submitted by swift. This > number will determine how many qsubs swift will submit for your run. > Typical values range between 40 and 80 for large runs. > - * nodegranularity: Determines the number of nodes per job. Total > nodes will thus be slots times nodegranularity. This may vary for > advanced configurations though. > - * maxnodes: Determines the maximum number of nodes a job must pack > into its qsub. This parameter determines the largest single job that > your run will submit. > + * *maxTime* : The expected walltime for completion of your run. This > parameter is accepted in seconds. > + * *slots* : Number of qsub jobs needs to be submitted by swift. This > number will determine how many qsubs swift will submit for your run. > Typical values range between 40 and 80 for large runs. > + * *nodeGranularity* : Determines the number of nodes per job. Total > nodes will thus be slots times nodegranularity. This may vary for > advanced configurations though. > + * *maxNodes* : Determines the maximum number of nodes a job must > pack into its qsub. This parameter determines the largest single job > that your run will submit. > + * *jobThrottle* : A factor that determines the number of tasks > dispatched simultaneously. The intended number of simultaneous tasks > must match the number of cores targeted. The number of tasks is > calculated from the jobThrottle factor is as follows: > > +---- > +Number of Tasks = (JobThrottle x 100) + 1 > +---- > > +Following is an example sites.xml for a 50 slots run with each slot > occupying 4 nodes (thus, a 200 node run): > + > +----- > + > + > + > + CI-CCR000013 > + > + 24:cray:pack > + > + 24 > + 50000 > + 50 > + 4 > + 4 > + > + 48.00 > + 10000 > + > + > + /lustre/beagle/ketan/swift.workdir > + > + > +----- > + > > _______________________________________________ > Swift-commit mailing list > Swift-commit at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-commit > > -- > Michael Wilde > Computation Institute, University of Chicago > Mathematics and Computer Science Division > Argonne National Laboratory > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From hockyg at uchicago.edu Sun Oct 16 11:30:55 2011 From: hockyg at uchicago.edu (Glen Hocky) Date: Sun, 16 Oct 2011 12:30:55 -0400 Subject: [Swift-devel] Where is latest doc on running Swift on Beagle? Covers OpenMP apps? In-Reply-To: <900237985.100011.1318782450325.JavaMail.root@zimbra.anl.gov> References: <900237985.100011.1318782450325.JavaMail.root@zimbra.anl.gov> Message-ID: It's in my run script that creates the actual sites file that I run with. I'm not sure what you would do if you wanted more than 24 cores, so depth stays fixed at 24 (that's an aprun parameters). Then WORKERSPERNODE=$((24/$PPN)) Where PPN is how many cores you want per OPENMP app and then workers per node says how many OPENMP apps you want to run. So obvious example would be you want 3 8 core OPENMP jobs, PPN = 8, WORKERSPERNODE=3 On Sun, Oct 16, 2011 at 12:27 PM, Michael Wilde wrote: > Thanks, Glen! > > Justin, can you check the sites file below? I dont understand the > interaction between the parameters OMP_NUM_THREDS, jobsPerNode, and depth. > WHere is the best documentation on that? > > Thanks, > > - Mike > > > ----- Original Message ----- > > From: "Glen Hocky" > > To: "Michael Wilde" > > Cc: "David Kelly" , "ketan" < > ketancmaheshwari at gmail.com> > > Sent: Sunday, October 16, 2011 11:18:33 AM > > Subject: Re: [Swift-devel] Where is latest doc on running Swift on > Beagle? Covers OpenMP apps? > > Yes, I'm running and yes I did test openmp a while back. Sites file > > follows. I'm using trunk from a few months ago > > > > "Swift svn swift-r4813 (swift modified locally) cog-r3175" > > > > > > > > > > > > > > > key="providerAttributes">pbs.aprun;pbs.mpp;depth=24 > > 24 > > > > > > $PPN > > $TIME > > $MAXTIME > > $nodes > > 1 > > 1 > > 100 > > 100 > > 200.00 > > 10000 > > > > > > $swiftrundir/swiftwork > > > > > > > > > > On Sun, Oct 16, 2011 at 12:15 PM, Michael Wilde < wilde at mcs.anl.gov > > > wrote: > > > > > > David, Ketan, > > > > I need to run some things on Beagle, asap. > > > > Ketan, where is the latest and best documentation for this? I see your > > edits below to the 0.93 Site Guide. But I dont see that online where I > > would expect it: > > > > > http://www.ci.uchicago.edu/swift/wwwdev/guides/release-0.93/siteguide/siteguide.html#_beagle > > > > David, is it just that this document is not being correctly pushed to > > the wwwdev site on a nightly basis? > > > > Ketan, is the latest info on running Swift on Beagle now all in the > > siteguide? Is the info you were putting in the cookbook (I see many > > commits there) now all consolidated into the Site Guide? And is there > > a difference in sites.xml settings between 0.93 and trunk? Lastly, > > which release works best? > > > > Second question: I need to run a script that executes many 24-core > > OpenMP apps. Is the necessary support for this in 0.93? What if any > > declarations do I need other than to say jobsPerNode=1? Glen, are you > > running OpenMP on Beagle and if so what release and sites file are you > > using? > > > > Im assuming Justin's latest changes to sites.xml are in trunk but not > > 0.93? If that is correct, is there a corresponding site site for > > Beagle for trunk? > > > > Thanks, > > > > - Mike > > > > > > ----- Forwarded Message ----- > > From: ketan at ci.uchicago.edu > > To: swift-commit at ci.uchicago.edu > > Sent: Sunday, September 18, 2011 10:14:10 PM > > Subject: [Swift-commit] r5126 - branches/release-0.93/docs/siteguide > > > > Author: ketan > > Date: 2011-09-18 22:14:10 -0500 (Sun, 18 Sep 2011) > > New Revision: 5126 > > > > Modified: > > branches/release-0.93/docs/siteguide/beagle > > Log: > > added content to beagle siteguide > > > > Modified: branches/release-0.93/docs/siteguide/beagle > > =================================================================== > > --- branches/release-0.93/docs/siteguide/beagle 2011-09-19 02:41:02 > > UTC (rev 5125) > > +++ branches/release-0.93/docs/siteguide/beagle 2011-09-19 03:14:10 > > UTC (rev 5126) > > @@ -52,9 +52,38 @@ > > A key factor in scaling up Swift runs on Beagle is to setup the > > sites.xml parameters. > > The following sites.xml parameters must be set to scale that is > > intended for a large run: > > > > - * walltime: The expected walltime for completion of your run. This > > parameter is accepted in seconds. > > - * slots: Number of qsub jobs needs to be submitted by swift. This > > number will determine how many qsubs swift will submit for your run. > > Typical values range between 40 and 80 for large runs. > > - * nodegranularity: Determines the number of nodes per job. Total > > nodes will thus be slots times nodegranularity. This may vary for > > advanced configurations though. > > - * maxnodes: Determines the maximum number of nodes a job must pack > > into its qsub. This parameter determines the largest single job that > > your run will submit. > > + * *maxTime* : The expected walltime for completion of your run. This > > parameter is accepted in seconds. > > + * *slots* : Number of qsub jobs needs to be submitted by swift. This > > number will determine how many qsubs swift will submit for your run. > > Typical values range between 40 and 80 for large runs. > > + * *nodeGranularity* : Determines the number of nodes per job. Total > > nodes will thus be slots times nodegranularity. This may vary for > > advanced configurations though. > > + * *maxNodes* : Determines the maximum number of nodes a job must > > pack into its qsub. This parameter determines the largest single job > > that your run will submit. > > + * *jobThrottle* : A factor that determines the number of tasks > > dispatched simultaneously. The intended number of simultaneous tasks > > must match the number of cores targeted. The number of tasks is > > calculated from the jobThrottle factor is as follows: > > > > +---- > > +Number of Tasks = (JobThrottle x 100) + 1 > > +---- > > > > +Following is an example sites.xml for a 50 slots run with each slot > > occupying 4 nodes (thus, a 200 node run): > > + > > +----- > > + > > + > > + > > + CI-CCR000013 > > + > > + 24:cray:pack > > + > > + 24 > > + 50000 > > + 50 > > + 4 > > + 4 > > + > > + 48.00 > > + 10000 > > + > > + > > + /lustre/beagle/ketan/swift.workdir > > + > > + > > +----- > > + > > > > _______________________________________________ > > Swift-commit mailing list > > Swift-commit at ci.uchicago.edu > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-commit > > > > -- > > Michael Wilde > > Computation Institute, University of Chicago > > Mathematics and Computer Science Division > > Argonne National Laboratory > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > -- > Michael Wilde > Computation Institute, University of Chicago > Mathematics and Computer Science Division > Argonne National Laboratory > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From ketancmaheshwari at gmail.com Sun Oct 16 12:42:23 2011 From: ketancmaheshwari at gmail.com (Ketan Maheshwari) Date: Sun, 16 Oct 2011 12:42:23 -0500 Subject: [Swift-devel] Where is latest doc on running Swift on Beagle? Covers OpenMP apps? In-Reply-To: <878140598.100000.1318781735942.JavaMail.root@zimbra.anl.gov> References: <20110919031410.5BFAF9CCFC@svn.ci.uchicago.edu> <878140598.100000.1318781735942.JavaMail.root@zimbra.anl.gov> Message-ID: On Sun, Oct 16, 2011 at 11:15 AM, Michael Wilde wrote: > David, Ketan, > > I need to run some things on Beagle, asap. > > Ketan, where is the latest and best documentation for this? I see your > edits below to the 0.93 Site Guide. But I dont see that online where I > would expect it: > > > http://www.ci.uchicago.edu/swift/wwwdev/guides/release-0.93/siteguide/siteguide.html#_beagle > > David, is it just that this document is not being correctly pushed to the > wwwdev site on a nightly basis? > That seems to be the case. I have committed a little change just now, may be that will trigger a doc build. The link you mentioned is not the latest for Swift on Beagle. See this one which has documentation for scaling up runs on Beagle: http://www.ci.uchicago.edu/~ketan/swift-docs/release-0.93/siteguide/siteguide.html#_beagle > > Ketan, is the latest info on running Swift on Beagle now all in the > siteguide? Is the info you were putting in the cookbook (I see many commits > there) now all consolidated into the Site Guide? And is there a difference > in sites.xml settings between 0.93 and trunk? Lastly, which release works > best? > Yes, the sitesguide for release-0.93 is the latest on Swift Beagle documentation. My cookbook info is all consolidated on sitesguide. There is no difference between sites file for 0.93 and trunk. Regards, Ketan > Second question: I need to run a script that executes many 24-core OpenMP > apps. Is the necessary support for this in 0.93? What if any declarations > do I need other than to say jobsPerNode=1? Glen, are you running OpenMP on > Beagle and if so what release and sites file are you using? > > Im assuming Justin's latest changes to sites.xml are in trunk but not 0.93? > If that is correct, is there a corresponding site site for Beagle for > trunk? > > Thanks, > > - Mike > > > ----- Forwarded Message ----- > From: ketan at ci.uchicago.edu > To: swift-commit at ci.uchicago.edu > Sent: Sunday, September 18, 2011 10:14:10 PM > Subject: [Swift-commit] r5126 - branches/release-0.93/docs/siteguide > > Author: ketan > Date: 2011-09-18 22:14:10 -0500 (Sun, 18 Sep 2011) > New Revision: 5126 > > Modified: > branches/release-0.93/docs/siteguide/beagle > Log: > added content to beagle siteguide > > Modified: branches/release-0.93/docs/siteguide/beagle > =================================================================== > --- branches/release-0.93/docs/siteguide/beagle 2011-09-19 02:41:02 UTC > (rev 5125) > +++ branches/release-0.93/docs/siteguide/beagle 2011-09-19 03:14:10 UTC > (rev 5126) > @@ -52,9 +52,38 @@ > A key factor in scaling up Swift runs on Beagle is to setup the sites.xml > parameters. > The following sites.xml parameters must be set to scale that is intended > for a large run: > > - * walltime: The expected walltime for completion of your run. This > parameter is accepted in seconds. > - * slots: Number of qsub jobs needs to be submitted by swift. This number > will determine how many qsubs swift will submit for your run. Typical values > range between 40 and 80 for large runs. > - * nodegranularity: Determines the number of nodes per job. Total nodes > will thus be slots times nodegranularity. This may vary for advanced > configurations though. > - * maxnodes: Determines the maximum number of nodes a job must pack into > its qsub. This parameter determines the largest single job that your run > will submit. > + * *maxTime* : The expected walltime for completion of your run. This > parameter is accepted in seconds. > + * *slots* : Number of qsub jobs needs to be submitted by swift. This > number will determine how many qsubs swift will submit for your run. Typical > values range between 40 and 80 for large runs. > + * *nodeGranularity* : Determines the number of nodes per job. Total nodes > will thus be slots times nodegranularity. This may vary for advanced > configurations though. > + * *maxNodes* : Determines the maximum number of nodes a job must pack > into its qsub. This parameter determines the largest single job that your > run will submit. > + * *jobThrottle* : A factor that determines the number of tasks dispatched > simultaneously. The intended number of simultaneous tasks must match the > number of cores targeted. The number of tasks is calculated from the > jobThrottle factor is as follows: > > +---- > +Number of Tasks = (JobThrottle x 100) + 1 > +---- > > +Following is an example sites.xml for a 50 slots run with each slot > occupying 4 nodes (thus, a 200 node run): > + > +----- > + > + > + > + CI-CCR000013 > + > + 24:cray:pack > + > + 24 > + 50000 > + 50 > + 4 > + 4 > + > + 48.00 > + 10000 > + > + > + /lustre/beagle/ketan/swift.workdir > + > + > +----- > + > > _______________________________________________ > Swift-commit mailing list > Swift-commit at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-commit > > -- > Michael Wilde > Computation Institute, University of Chicago > Mathematics and Computer Science Division > Argonne National Laboratory > > -- Ketan -------------- next part -------------- An HTML attachment was scrubbed... URL: From davidk at ci.uchicago.edu Sun Oct 16 13:46:08 2011 From: davidk at ci.uchicago.edu (David Kelly) Date: Sun, 16 Oct 2011 13:46:08 -0500 (CDT) Subject: [Swift-devel] Where is latest doc on running Swift on Beagle? Covers OpenMP apps? In-Reply-To: Message-ID: <648032314.149160.1318790768294.JavaMail.root@zimbra-mb2.anl.gov> Yep - I was in the process of migrating the automated SVN jobs to the swift user, but it looks like it wasn't running correctly due to filesystem permissions. I am manually running the update now. It should be updated within 15 minutes or so. David ----- Original Message ----- > From: "Ketan Maheshwari" > To: "Michael Wilde" > Cc: "David Kelly" , "Swift Devel" > Sent: Sunday, October 16, 2011 12:42:23 PM > Subject: Re: Where is latest doc on running Swift on Beagle? Covers OpenMP apps? > On Sun, Oct 16, 2011 at 11:15 AM, Michael Wilde < wilde at mcs.anl.gov > > wrote: > > > David, Ketan, > > I need to run some things on Beagle, asap. > > Ketan, where is the latest and best documentation for this? I see your > edits below to the 0.93 Site Guide. But I dont see that online where I > would expect it: > > http://www.ci.uchicago.edu/swift/wwwdev/guides/release-0.93/siteguide/siteguide.html#_beagle > > David, is it just that this document is not being correctly pushed to > the wwwdev site on a nightly basis? > > > > That seems to be the case. I have committed a little change just now, > may be that will trigger a doc build. The link you mentioned is not > the latest for Swift on Beagle. See this one which has documentation > for scaling up runs on Beagle: > > > http://www.ci.uchicago.edu/~ketan/swift-docs/release-0.93/siteguide/siteguide.html#_beagle > > > > Ketan, is the latest info on running Swift on Beagle now all in the > siteguide? Is the info you were putting in the cookbook (I see many > commits there) now all consolidated into the Site Guide? And is there > a difference in sites.xml settings between 0.93 and trunk? Lastly, > which release works best? > > > > Yes, the sitesguide for release-0.93 is the latest on Swift Beagle > documentation. My cookbook info is all consolidated on sitesguide. > There is no difference between sites file for 0.93 and trunk. > > > Regards, > Ketan > > > > > Second question: I need to run a script that executes many 24-core > OpenMP apps. Is the necessary support for this in 0.93? What if any > declarations do I need other than to say jobsPerNode=1? Glen, are you > running OpenMP on Beagle and if so what release and sites file are you > using? > > Im assuming Justin's latest changes to sites.xml are in trunk but not > 0.93? If that is correct, is there a corresponding site site for > Beagle for trunk? > > Thanks, > > - Mike > > > ----- Forwarded Message ----- > From: ketan at ci.uchicago.edu > To: swift-commit at ci.uchicago.edu > Sent: Sunday, September 18, 2011 10:14:10 PM > Subject: [Swift-commit] r5126 - branches/release-0.93/docs/siteguide > > Author: ketan > Date: 2011-09-18 22:14:10 -0500 (Sun, 18 Sep 2011) > New Revision: 5126 > > Modified: > branches/release-0.93/docs/siteguide/beagle > Log: > added content to beagle siteguide > > Modified: branches/release-0.93/docs/siteguide/beagle > =================================================================== > --- branches/release-0.93/docs/siteguide/beagle 2011-09-19 02:41:02 > UTC (rev 5125) > +++ branches/release-0.93/docs/siteguide/beagle 2011-09-19 03:14:10 > UTC (rev 5126) > @@ -52,9 +52,38 @@ > A key factor in scaling up Swift runs on Beagle is to setup the > sites.xml parameters. > The following sites.xml parameters must be set to scale that is > intended for a large run: > > - * walltime: The expected walltime for completion of your run. This > parameter is accepted in seconds. > - * slots: Number of qsub jobs needs to be submitted by swift. This > number will determine how many qsubs swift will submit for your run. > Typical values range between 40 and 80 for large runs. > - * nodegranularity: Determines the number of nodes per job. Total > nodes will thus be slots times nodegranularity. This may vary for > advanced configurations though. > - * maxnodes: Determines the maximum number of nodes a job must pack > into its qsub. This parameter determines the largest single job that > your run will submit. > + * *maxTime* : The expected walltime for completion of your run. This > parameter is accepted in seconds. > + * *slots* : Number of qsub jobs needs to be submitted by swift. This > number will determine how many qsubs swift will submit for your run. > Typical values range between 40 and 80 for large runs. > + * *nodeGranularity* : Determines the number of nodes per job. Total > nodes will thus be slots times nodegranularity. This may vary for > advanced configurations though. > + * *maxNodes* : Determines the maximum number of nodes a job must > pack into its qsub. This parameter determines the largest single job > that your run will submit. > + * *jobThrottle* : A factor that determines the number of tasks > dispatched simultaneously. The intended number of simultaneous tasks > must match the number of cores targeted. The number of tasks is > calculated from the jobThrottle factor is as follows: > > +---- > +Number of Tasks = (JobThrottle x 100) + 1 > +---- > > +Following is an example sites.xml for a 50 slots run with each slot > occupying 4 nodes (thus, a 200 node run): > + > +----- > + > + > + > + CI-CCR000013 > + > + 24:cray:pack > + > + 24 > + 50000 > + 50 > + 4 > + 4 > + > + 48.00 > + 10000 > + > + > + /lustre/beagle/ketan/swift.workdir > + > + > +----- > + > > _______________________________________________ > Swift-commit mailing list > Swift-commit at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-commit > > -- > Michael Wilde > Computation Institute, University of Chicago > Mathematics and Computer Science Division > Argonne National Laboratory > > > > > > -- > Ketan From hategan at mcs.anl.gov Sun Oct 16 15:06:39 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sun, 16 Oct 2011 13:06:39 -0700 Subject: [Swift-devel] Swift gsiftp staging issues on OSG In-Reply-To: References: Message-ID: <1318795599.24888.20.camel@blabla> There are craploads of errors in there of all kinds and sorts, but very few of them are actual transfer problems. It looks more like gridftp/filesystem configuration issues. I attached a sorted list of exception. However, this is irrelevant. It looks like so far we keep running this test that clearly doesn't work and hope that it will work. That's silly. We need to figure out each problem individually and fix things one by one. So here's my proposal. We list all the problems that can be seen in that log and try to fix them in order. And we do not re-run the whole thing unless we actually solved at least one problem. Also, we sync periodically on what was done (i.e. we keep a list that we update immediately after something was done about an item). Also, before doing an integration test after a problem is fixed, we do a test for that specific problem/on a specific site only. There is way too much noise in these big runs and that makes it very hard to see what is happening. So here's a first list: http://www.ci.uchicago.edu/wiki/bin/view/SWFT/OSGTesting On Sun, 2011-10-16 at 09:54 -0500, Ketan Maheshwari wrote: > Hello, > > > While running an Extenci workflow on OSG with persistent coasters > (multiple coasters services, 1 per OSG site) and gsiftp staging, I am > facing some gridftp related issues. Following are some details of the > run: > > > A set of 15 OSG sites were selected after testing them for being > responsive ('greensites'). I performed a separate guc test on these > sites which seemed to have succeeded for each site (200MB roundtrip > transfer in 7 mins for all sites). > > > However, while running my workflow from Swift, many of these transfers > fail showing a variety of errors, most pertaining to the data > transfers. > > > I noticed, that these transfers fail irrespective of data sizes (250K > - 150M) and also seems to fail intermittently for different sites. > > > The log for this run is > here: http://www.mcs.anl.gov/~ketan/postproc-gridftp-20111013-2324-5qzebq16.log > > > I am providing a 7G of Heap space at Swift commandline and the host > has 50G of total memory. > > > Any ideas? > > > > > Regards, > -- > Ketan > > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel -------------- next part -------------- A non-text attachment was scrubbed... Name: err2.txt.gz Type: application/x-gzip Size: 10001 bytes Desc: not available URL: From ketancmaheshwari at gmail.com Mon Oct 17 08:46:00 2011 From: ketancmaheshwari at gmail.com (Ketan Maheshwari) Date: Mon, 17 Oct 2011 08:46:00 -0500 Subject: [Swift-devel] Swift gsiftp staging issues on OSG In-Reply-To: <1318795599.24888.20.camel@blabla> References: <1318795599.24888.20.camel@blabla> Message-ID: Mihael, I've been updating the wiki page with the test results you listed. So far I tested for 7 OSG sites out of which 2 failed and 5 worked. I've uploaded logs for each test that you can check from the link alongside each test. I'll carry on with further tests. Regards, Ketan On Sun, Oct 16, 2011 at 3:06 PM, Mihael Hategan wrote: > There are craploads of errors in there of all kinds and sorts, but very > few of them are actual transfer problems. It looks more like > gridftp/filesystem configuration issues. > > I attached a sorted list of exception. > > However, this is irrelevant. It looks like so far we keep running this > test that clearly doesn't work and hope that it will work. That's silly. > We need to figure out each problem individually and fix things one by > one. > > So here's my proposal. We list all the problems that can be seen in that > log and try to fix them in order. And we do not re-run the whole thing > unless we actually solved at least one problem. Also, we sync > periodically on what was done (i.e. we keep a list that we update > immediately after something was done about an item). Also, before doing > an integration test after a problem is fixed, we do a test for that > specific problem/on a specific site only. There is way too much noise in > these big runs and that makes it very hard to see what is happening. > > So here's a first list: > http://www.ci.uchicago.edu/wiki/bin/view/SWFT/OSGTesting > > > > On Sun, 2011-10-16 at 09:54 -0500, Ketan Maheshwari wrote: > > Hello, > > > > > > While running an Extenci workflow on OSG with persistent coasters > > (multiple coasters services, 1 per OSG site) and gsiftp staging, I am > > facing some gridftp related issues. Following are some details of the > > run: > > > > > > A set of 15 OSG sites were selected after testing them for being > > responsive ('greensites'). I performed a separate guc test on these > > sites which seemed to have succeeded for each site (200MB roundtrip > > transfer in 7 mins for all sites). > > > > > > However, while running my workflow from Swift, many of these transfers > > fail showing a variety of errors, most pertaining to the data > > transfers. > > > > > > I noticed, that these transfers fail irrespective of data sizes (250K > > - 150M) and also seems to fail intermittently for different sites. > > > > > > The log for this run is > > here: > http://www.mcs.anl.gov/~ketan/postproc-gridftp-20111013-2324-5qzebq16.log > > > > > > I am providing a 7G of Heap space at Swift commandline and the host > > has 50G of total memory. > > > > > > Any ideas? > > > > > > > > > > Regards, > > -- > > Ketan > > > > > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > -- Ketan -------------- next part -------------- An HTML attachment was scrubbed... URL: From wozniak at mcs.anl.gov Mon Oct 17 10:02:15 2011 From: wozniak at mcs.anl.gov (Justin M Wozniak) Date: Mon, 17 Oct 2011 10:02:15 -0500 (Central Daylight Time) Subject: [Swift-devel] Where is latest doc on running Swift on Beagle? Covers OpenMP apps? In-Reply-To: <648032314.149160.1318790768294.JavaMail.root@zimbra-mb2.anl.gov> References: <648032314.149160.1318790768294.JavaMail.root@zimbra-mb2.anl.gov> Message-ID: My notes about Beagle are at: https://sites.google.com/site/swiftdevel/sites/pbs/cray and the Beagle sub-page. Let me know if you get stuck on anything. Justin On Sun, 16 Oct 2011, David Kelly wrote: > > Yep - I was in the process of migrating the automated SVN jobs to the swift user, but it looks like it wasn't running correctly due to filesystem permissions. I am manually running the update now. It should be updated within 15 minutes or so. > > David > > ----- Original Message ----- >> From: "Ketan Maheshwari" >> To: "Michael Wilde" >> Cc: "David Kelly" , "Swift Devel" >> Sent: Sunday, October 16, 2011 12:42:23 PM >> Subject: Re: Where is latest doc on running Swift on Beagle? Covers OpenMP apps? >> On Sun, Oct 16, 2011 at 11:15 AM, Michael Wilde < wilde at mcs.anl.gov > >> wrote: >> >> >> David, Ketan, >> >> I need to run some things on Beagle, asap. >> >> Ketan, where is the latest and best documentation for this? I see your >> edits below to the 0.93 Site Guide. But I dont see that online where I >> would expect it: >> >> http://www.ci.uchicago.edu/swift/wwwdev/guides/release-0.93/siteguide/siteguide.html#_beagle >> >> David, is it just that this document is not being correctly pushed to >> the wwwdev site on a nightly basis? >> >> >> >> That seems to be the case. I have committed a little change just now, >> may be that will trigger a doc build. The link you mentioned is not >> the latest for Swift on Beagle. See this one which has documentation >> for scaling up runs on Beagle: >> >> >> http://www.ci.uchicago.edu/~ketan/swift-docs/release-0.93/siteguide/siteguide.html#_beagle >> >> >> >> Ketan, is the latest info on running Swift on Beagle now all in the >> siteguide? Is the info you were putting in the cookbook (I see many >> commits there) now all consolidated into the Site Guide? And is there >> a difference in sites.xml settings between 0.93 and trunk? Lastly, >> which release works best? >> >> >> >> Yes, the sitesguide for release-0.93 is the latest on Swift Beagle >> documentation. My cookbook info is all consolidated on sitesguide. >> There is no difference between sites file for 0.93 and trunk. >> >> >> Regards, >> Ketan >> >> >> >> >> Second question: I need to run a script that executes many 24-core >> OpenMP apps. Is the necessary support for this in 0.93? What if any >> declarations do I need other than to say jobsPerNode=1? Glen, are you >> running OpenMP on Beagle and if so what release and sites file are you >> using? >> >> Im assuming Justin's latest changes to sites.xml are in trunk but not >> 0.93? If that is correct, is there a corresponding site site for >> Beagle for trunk? >> >> Thanks, >> >> - Mike >> >> >> ----- Forwarded Message ----- >> From: ketan at ci.uchicago.edu >> To: swift-commit at ci.uchicago.edu >> Sent: Sunday, September 18, 2011 10:14:10 PM >> Subject: [Swift-commit] r5126 - branches/release-0.93/docs/siteguide >> >> Author: ketan >> Date: 2011-09-18 22:14:10 -0500 (Sun, 18 Sep 2011) >> New Revision: 5126 >> >> Modified: >> branches/release-0.93/docs/siteguide/beagle >> Log: >> added content to beagle siteguide >> >> Modified: branches/release-0.93/docs/siteguide/beagle >> =================================================================== >> --- branches/release-0.93/docs/siteguide/beagle 2011-09-19 02:41:02 >> UTC (rev 5125) >> +++ branches/release-0.93/docs/siteguide/beagle 2011-09-19 03:14:10 >> UTC (rev 5126) >> @@ -52,9 +52,38 @@ >> A key factor in scaling up Swift runs on Beagle is to setup the >> sites.xml parameters. >> The following sites.xml parameters must be set to scale that is >> intended for a large run: >> >> - * walltime: The expected walltime for completion of your run. This >> parameter is accepted in seconds. >> - * slots: Number of qsub jobs needs to be submitted by swift. This >> number will determine how many qsubs swift will submit for your run. >> Typical values range between 40 and 80 for large runs. >> - * nodegranularity: Determines the number of nodes per job. Total >> nodes will thus be slots times nodegranularity. This may vary for >> advanced configurations though. >> - * maxnodes: Determines the maximum number of nodes a job must pack >> into its qsub. This parameter determines the largest single job that >> your run will submit. >> + * *maxTime* : The expected walltime for completion of your run. This >> parameter is accepted in seconds. >> + * *slots* : Number of qsub jobs needs to be submitted by swift. This >> number will determine how many qsubs swift will submit for your run. >> Typical values range between 40 and 80 for large runs. >> + * *nodeGranularity* : Determines the number of nodes per job. Total >> nodes will thus be slots times nodegranularity. This may vary for >> advanced configurations though. >> + * *maxNodes* : Determines the maximum number of nodes a job must >> pack into its qsub. This parameter determines the largest single job >> that your run will submit. >> + * *jobThrottle* : A factor that determines the number of tasks >> dispatched simultaneously. The intended number of simultaneous tasks >> must match the number of cores targeted. The number of tasks is >> calculated from the jobThrottle factor is as follows: >> >> +---- >> +Number of Tasks = (JobThrottle x 100) + 1 >> +---- >> >> +Following is an example sites.xml for a 50 slots run with each slot >> occupying 4 nodes (thus, a 200 node run): >> + >> +----- >> + >> + >> + >> + CI-CCR000013 >> + >> + 24:cray:pack >> + >> + 24 >> + 50000 >> + 50 >> + 4 >> + 4 >> + >> + 48.00 >> + 10000 >> + >> + >> + /lustre/beagle/ketan/swift.workdir >> + >> + >> +----- >> + >> >> _______________________________________________ >> Swift-commit mailing list >> Swift-commit at ci.uchicago.edu >> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-commit >> >> -- >> Michael Wilde >> Computation Institute, University of Chicago >> Mathematics and Computer Science Division >> Argonne National Laboratory >> >> >> >> >> >> -- >> Ketan > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > -- Justin M Wozniak From wozniak at mcs.anl.gov Mon Oct 17 11:21:13 2011 From: wozniak at mcs.anl.gov (Justin M Wozniak) Date: Mon, 17 Oct 2011 11:21:13 -0500 (Central Daylight Time) Subject: [Swift-devel] Where is latest doc on running Swift on Beagle? Covers OpenMP apps? In-Reply-To: <900237985.100011.1318782450325.JavaMail.root@zimbra.anl.gov> References: <900237985.100011.1318782450325.JavaMail.root@zimbra.anl.gov> Message-ID: I have not tried an OMP job on Beagle. The settings below look good to me. On Sun, 16 Oct 2011, Michael Wilde wrote: > Thanks, Glen! > > Justin, can you check the sites file below? I dont understand the > interaction between the parameters OMP_NUM_THREDS, jobsPerNode, and > depth. WHere is the best documentation on that? > > Thanks, > > - Mike > > > ----- Original Message ----- >> From: "Glen Hocky" >> To: "Michael Wilde" >> Cc: "David Kelly" , "ketan" >> Sent: Sunday, October 16, 2011 11:18:33 AM >> Subject: Re: [Swift-devel] Where is latest doc on running Swift on Beagle? Covers OpenMP apps? >> Yes, I'm running and yes I did test openmp a while back. Sites file >> follows. I'm using trunk from a few months ago >> >> "Swift svn swift-r4813 (swift modified locally) cog-r3175" >> >> >> >> >> >> >> > key="providerAttributes">pbs.aprun;pbs.mpp;depth=24 >> 24 >> >> >> $PPN >> $TIME >> $MAXTIME >> $nodes >> 1 >> 1 >> 100 >> 100 >> 200.00 >> 10000 >> >> >> $swiftrundir/swiftwork >> >> >> >> >> On Sun, Oct 16, 2011 at 12:15 PM, Michael Wilde < wilde at mcs.anl.gov > >> wrote: >> >> >> David, Ketan, >> >> I need to run some things on Beagle, asap. >> >> Ketan, where is the latest and best documentation for this? I see your >> edits below to the 0.93 Site Guide. But I dont see that online where I >> would expect it: >> >> http://www.ci.uchicago.edu/swift/wwwdev/guides/release-0.93/siteguide/siteguide.html#_beagle >> >> David, is it just that this document is not being correctly pushed to >> the wwwdev site on a nightly basis? >> >> Ketan, is the latest info on running Swift on Beagle now all in the >> siteguide? Is the info you were putting in the cookbook (I see many >> commits there) now all consolidated into the Site Guide? And is there >> a difference in sites.xml settings between 0.93 and trunk? Lastly, >> which release works best? >> >> Second question: I need to run a script that executes many 24-core >> OpenMP apps. Is the necessary support for this in 0.93? What if any >> declarations do I need other than to say jobsPerNode=1? Glen, are you >> running OpenMP on Beagle and if so what release and sites file are you >> using? >> >> Im assuming Justin's latest changes to sites.xml are in trunk but not >> 0.93? If that is correct, is there a corresponding site site for >> Beagle for trunk? >> >> Thanks, >> >> - Mike >> >> >> ----- Forwarded Message ----- >> From: ketan at ci.uchicago.edu >> To: swift-commit at ci.uchicago.edu >> Sent: Sunday, September 18, 2011 10:14:10 PM >> Subject: [Swift-commit] r5126 - branches/release-0.93/docs/siteguide >> >> Author: ketan >> Date: 2011-09-18 22:14:10 -0500 (Sun, 18 Sep 2011) >> New Revision: 5126 >> >> Modified: >> branches/release-0.93/docs/siteguide/beagle >> Log: >> added content to beagle siteguide >> >> Modified: branches/release-0.93/docs/siteguide/beagle >> =================================================================== >> --- branches/release-0.93/docs/siteguide/beagle 2011-09-19 02:41:02 >> UTC (rev 5125) >> +++ branches/release-0.93/docs/siteguide/beagle 2011-09-19 03:14:10 >> UTC (rev 5126) >> @@ -52,9 +52,38 @@ >> A key factor in scaling up Swift runs on Beagle is to setup the >> sites.xml parameters. >> The following sites.xml parameters must be set to scale that is >> intended for a large run: >> >> - * walltime: The expected walltime for completion of your run. This >> parameter is accepted in seconds. >> - * slots: Number of qsub jobs needs to be submitted by swift. This >> number will determine how many qsubs swift will submit for your run. >> Typical values range between 40 and 80 for large runs. >> - * nodegranularity: Determines the number of nodes per job. Total >> nodes will thus be slots times nodegranularity. This may vary for >> advanced configurations though. >> - * maxnodes: Determines the maximum number of nodes a job must pack >> into its qsub. This parameter determines the largest single job that >> your run will submit. >> + * *maxTime* : The expected walltime for completion of your run. This >> parameter is accepted in seconds. >> + * *slots* : Number of qsub jobs needs to be submitted by swift. This >> number will determine how many qsubs swift will submit for your run. >> Typical values range between 40 and 80 for large runs. >> + * *nodeGranularity* : Determines the number of nodes per job. Total >> nodes will thus be slots times nodegranularity. This may vary for >> advanced configurations though. >> + * *maxNodes* : Determines the maximum number of nodes a job must >> pack into its qsub. This parameter determines the largest single job >> that your run will submit. >> + * *jobThrottle* : A factor that determines the number of tasks >> dispatched simultaneously. The intended number of simultaneous tasks >> must match the number of cores targeted. The number of tasks is >> calculated from the jobThrottle factor is as follows: >> >> +---- >> +Number of Tasks = (JobThrottle x 100) + 1 >> +---- >> >> +Following is an example sites.xml for a 50 slots run with each slot >> occupying 4 nodes (thus, a 200 node run): >> + >> +----- >> + >> + >> + >> + CI-CCR000013 >> + >> + 24:cray:pack >> + >> + 24 >> + 50000 >> + 50 >> + 4 >> + 4 >> + >> + 48.00 >> + 10000 >> + >> + >> + /lustre/beagle/ketan/swift.workdir >> + >> + >> +----- >> + >> >> _______________________________________________ >> Swift-commit mailing list >> Swift-commit at ci.uchicago.edu >> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-commit >> >> -- >> Michael Wilde >> Computation Institute, University of Chicago >> Mathematics and Computer Science Division >> Argonne National Laboratory >> >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > -- Justin M Wozniak From wozniak at mcs.anl.gov Mon Oct 17 11:23:02 2011 From: wozniak at mcs.anl.gov (Justin M Wozniak) Date: Mon, 17 Oct 2011 11:23:02 -0500 (Central Daylight Time) Subject: [Swift-devel] Where is latest doc on running Swift on Beagle? Covers OpenMP apps? In-Reply-To: References: <900237985.100011.1318782450325.JavaMail.root@zimbra.anl.gov> Message-ID: Glen, do you have an extremely simple but relevant OpenMP program that we could stick in the test suite? On Sun, 16 Oct 2011, Glen Hocky wrote: > It's in my run script that creates the actual sites file that I run with. > I'm not sure what you would do if you wanted more than 24 cores, so depth > stays fixed at 24 (that's an aprun parameters). Then > > WORKERSPERNODE=$((24/$PPN)) > > Where PPN is how many cores you want per OPENMP app and then workers per > node says how many OPENMP apps you want to run. So obvious example would be > you want 3 8 core OPENMP jobs, PPN = 8, WORKERSPERNODE=3 > > > On Sun, Oct 16, 2011 at 12:27 PM, Michael Wilde wrote: > >> Thanks, Glen! >> >> Justin, can you check the sites file below? I dont understand the >> interaction between the parameters OMP_NUM_THREDS, jobsPerNode, and depth. >> WHere is the best documentation on that? >> >> Thanks, >> >> - Mike >> >> >> ----- Original Message ----- >>> From: "Glen Hocky" >>> To: "Michael Wilde" >>> Cc: "David Kelly" , "ketan" < >> ketancmaheshwari at gmail.com> >>> Sent: Sunday, October 16, 2011 11:18:33 AM >>> Subject: Re: [Swift-devel] Where is latest doc on running Swift on >> Beagle? Covers OpenMP apps? >>> Yes, I'm running and yes I did test openmp a while back. Sites file >>> follows. I'm using trunk from a few months ago >>> >>> "Swift svn swift-r4813 (swift modified locally) cog-r3175" >>> >>> >>> >>> >>> >>> >>> >> key="providerAttributes">pbs.aprun;pbs.mpp;depth=24 >>> 24 >>> >>> >>> $PPN >>> $TIME >>> $MAXTIME >>> $nodes >>> 1 >>> 1 >>> 100 >>> 100 >>> 200.00 >>> 10000 >>> >>> >>> $swiftrundir/swiftwork >>> >>> >>> >>> >>> On Sun, Oct 16, 2011 at 12:15 PM, Michael Wilde < wilde at mcs.anl.gov > >>> wrote: >>> >>> >>> David, Ketan, >>> >>> I need to run some things on Beagle, asap. >>> >>> Ketan, where is the latest and best documentation for this? I see your >>> edits below to the 0.93 Site Guide. But I dont see that online where I >>> would expect it: >>> >>> >> http://www.ci.uchicago.edu/swift/wwwdev/guides/release-0.93/siteguide/siteguide.html#_beagle >>> >>> David, is it just that this document is not being correctly pushed to >>> the wwwdev site on a nightly basis? >>> >>> Ketan, is the latest info on running Swift on Beagle now all in the >>> siteguide? Is the info you were putting in the cookbook (I see many >>> commits there) now all consolidated into the Site Guide? And is there >>> a difference in sites.xml settings between 0.93 and trunk? Lastly, >>> which release works best? >>> >>> Second question: I need to run a script that executes many 24-core >>> OpenMP apps. Is the necessary support for this in 0.93? What if any >>> declarations do I need other than to say jobsPerNode=1? Glen, are you >>> running OpenMP on Beagle and if so what release and sites file are you >>> using? >>> >>> Im assuming Justin's latest changes to sites.xml are in trunk but not >>> 0.93? If that is correct, is there a corresponding site site for >>> Beagle for trunk? >>> >>> Thanks, >>> >>> - Mike >>> >>> >>> ----- Forwarded Message ----- >>> From: ketan at ci.uchicago.edu >>> To: swift-commit at ci.uchicago.edu >>> Sent: Sunday, September 18, 2011 10:14:10 PM >>> Subject: [Swift-commit] r5126 - branches/release-0.93/docs/siteguide >>> >>> Author: ketan >>> Date: 2011-09-18 22:14:10 -0500 (Sun, 18 Sep 2011) >>> New Revision: 5126 >>> >>> Modified: >>> branches/release-0.93/docs/siteguide/beagle >>> Log: >>> added content to beagle siteguide >>> >>> Modified: branches/release-0.93/docs/siteguide/beagle >>> =================================================================== >>> --- branches/release-0.93/docs/siteguide/beagle 2011-09-19 02:41:02 >>> UTC (rev 5125) >>> +++ branches/release-0.93/docs/siteguide/beagle 2011-09-19 03:14:10 >>> UTC (rev 5126) >>> @@ -52,9 +52,38 @@ >>> A key factor in scaling up Swift runs on Beagle is to setup the >>> sites.xml parameters. >>> The following sites.xml parameters must be set to scale that is >>> intended for a large run: >>> >>> - * walltime: The expected walltime for completion of your run. This >>> parameter is accepted in seconds. >>> - * slots: Number of qsub jobs needs to be submitted by swift. This >>> number will determine how many qsubs swift will submit for your run. >>> Typical values range between 40 and 80 for large runs. >>> - * nodegranularity: Determines the number of nodes per job. Total >>> nodes will thus be slots times nodegranularity. This may vary for >>> advanced configurations though. >>> - * maxnodes: Determines the maximum number of nodes a job must pack >>> into its qsub. This parameter determines the largest single job that >>> your run will submit. >>> + * *maxTime* : The expected walltime for completion of your run. This >>> parameter is accepted in seconds. >>> + * *slots* : Number of qsub jobs needs to be submitted by swift. This >>> number will determine how many qsubs swift will submit for your run. >>> Typical values range between 40 and 80 for large runs. >>> + * *nodeGranularity* : Determines the number of nodes per job. Total >>> nodes will thus be slots times nodegranularity. This may vary for >>> advanced configurations though. >>> + * *maxNodes* : Determines the maximum number of nodes a job must >>> pack into its qsub. This parameter determines the largest single job >>> that your run will submit. >>> + * *jobThrottle* : A factor that determines the number of tasks >>> dispatched simultaneously. The intended number of simultaneous tasks >>> must match the number of cores targeted. The number of tasks is >>> calculated from the jobThrottle factor is as follows: >>> >>> +---- >>> +Number of Tasks = (JobThrottle x 100) + 1 >>> +---- >>> >>> +Following is an example sites.xml for a 50 slots run with each slot >>> occupying 4 nodes (thus, a 200 node run): >>> + >>> +----- >>> + >>> + >>> + >>> + CI-CCR000013 >>> + >>> + 24:cray:pack >>> + >>> + 24 >>> + 50000 >>> + 50 >>> + 4 >>> + 4 >>> + >>> + 48.00 >>> + 10000 >>> + >>> + >>> + /lustre/beagle/ketan/swift.workdir >>> + >>> + >>> +----- >>> + >>> >>> _______________________________________________ >>> Swift-commit mailing list >>> Swift-commit at ci.uchicago.edu >>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-commit >>> >>> -- >>> Michael Wilde >>> Computation Institute, University of Chicago >>> Mathematics and Computer Science Division >>> Argonne National Laboratory >>> >>> _______________________________________________ >>> Swift-devel mailing list >>> Swift-devel at ci.uchicago.edu >>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel >> >> -- >> Michael Wilde >> Computation Institute, University of Chicago >> Mathematics and Computer Science Division >> Argonne National Laboratory >> >> > -- Justin M Wozniak From hockyg at uchicago.edu Mon Oct 17 11:30:17 2011 From: hockyg at uchicago.edu (Glen Hocky) Date: Mon, 17 Oct 2011 12:30:17 -0400 Subject: [Swift-devel] Where is latest doc on running Swift on Beagle? Covers OpenMP apps? In-Reply-To: References: <900237985.100011.1318782450325.JavaMail.root@zimbra.anl.gov> Message-ID: Justin, I'm not sure my program counts as sufficiently simple for this purpose. I'd be happy to let you include it and get an example set up though if you want to use it anyway. The open mp part, which I haven't been using recently, may need a bit of debugging as well Glen On Mon, Oct 17, 2011 at 12:23 PM, Justin M Wozniak wrote: > > Glen, do you have an extremely simple but relevant OpenMP program that we > could stick in the test suite? > > > On Sun, 16 Oct 2011, Glen Hocky wrote: > > It's in my run script that creates the actual sites file that I run with. >> I'm not sure what you would do if you wanted more than 24 cores, so depth >> stays fixed at 24 (that's an aprun parameters). Then >> >> WORKERSPERNODE=$((24/$PPN)) >> >> Where PPN is how many cores you want per OPENMP app and then workers per >> node says how many OPENMP apps you want to run. So obvious example would >> be >> you want 3 8 core OPENMP jobs, PPN = 8, WORKERSPERNODE=3 >> >> >> On Sun, Oct 16, 2011 at 12:27 PM, Michael Wilde >> wrote: >> >> Thanks, Glen! >>> >>> Justin, can you check the sites file below? I dont understand the >>> interaction between the parameters OMP_NUM_THREDS, jobsPerNode, and >>> depth. >>> WHere is the best documentation on that? >>> >>> Thanks, >>> >>> - Mike >>> >>> >>> ----- Original Message ----- >>> >>>> From: "Glen Hocky" >>>> To: "Michael Wilde" >>>> Cc: "David Kelly" , "ketan" < >>>> >>> ketancmaheshwari at gmail.com> >>> >>>> Sent: Sunday, October 16, 2011 11:18:33 AM >>>> Subject: Re: [Swift-devel] Where is latest doc on running Swift on >>>> >>> Beagle? Covers OpenMP apps? >>> >>>> Yes, I'm running and yes I did test openmp a while back. Sites file >>>> follows. I'm using trunk from a few months ago >>>> >>>> "Swift svn swift-r4813 (swift modified locally) cog-r3175" >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>> key="providerAttributes">pbs.**aprun;pbs.mpp;depth=24 >>>> 24 >>>> >>>> >>>> $PPN >>>> $TIME >>>> $MAXTIME >>>> $nodes >>>> 1 >>>> 1 >>>> 100 >>>> 100 >>>> 200.00 >>>> 10000 >>>> >>>> >>>> $swiftrundir/swiftwork >>>> >>>> >>>> >>>> >>>> On Sun, Oct 16, 2011 at 12:15 PM, Michael Wilde < wilde at mcs.anl.gov > >>>> wrote: >>>> >>>> >>>> David, Ketan, >>>> >>>> I need to run some things on Beagle, asap. >>>> >>>> Ketan, where is the latest and best documentation for this? I see your >>>> edits below to the 0.93 Site Guide. But I dont see that online where I >>>> would expect it: >>>> >>>> >>>> http://www.ci.uchicago.edu/**swift/wwwdev/guides/release-0.** >>> 93/siteguide/siteguide.html#_**beagle >>> >>>> >>>> David, is it just that this document is not being correctly pushed to >>>> the wwwdev site on a nightly basis? >>>> >>>> Ketan, is the latest info on running Swift on Beagle now all in the >>>> siteguide? Is the info you were putting in the cookbook (I see many >>>> commits there) now all consolidated into the Site Guide? And is there >>>> a difference in sites.xml settings between 0.93 and trunk? Lastly, >>>> which release works best? >>>> >>>> Second question: I need to run a script that executes many 24-core >>>> OpenMP apps. Is the necessary support for this in 0.93? What if any >>>> declarations do I need other than to say jobsPerNode=1? Glen, are you >>>> running OpenMP on Beagle and if so what release and sites file are you >>>> using? >>>> >>>> Im assuming Justin's latest changes to sites.xml are in trunk but not >>>> 0.93? If that is correct, is there a corresponding site site for >>>> Beagle for trunk? >>>> >>>> Thanks, >>>> >>>> - Mike >>>> >>>> >>>> ----- Forwarded Message ----- >>>> From: ketan at ci.uchicago.edu >>>> To: swift-commit at ci.uchicago.edu >>>> Sent: Sunday, September 18, 2011 10:14:10 PM >>>> Subject: [Swift-commit] r5126 - branches/release-0.93/docs/**siteguide >>>> >>>> Author: ketan >>>> Date: 2011-09-18 22:14:10 -0500 (Sun, 18 Sep 2011) >>>> New Revision: 5126 >>>> >>>> Modified: >>>> branches/release-0.93/docs/**siteguide/beagle >>>> Log: >>>> added content to beagle siteguide >>>> >>>> Modified: branches/release-0.93/docs/**siteguide/beagle >>>> ==============================**==============================**======= >>>> --- branches/release-0.93/docs/**siteguide/beagle 2011-09-19 02:41:02 >>>> UTC (rev 5125) >>>> +++ branches/release-0.93/docs/**siteguide/beagle 2011-09-19 03:14:10 >>>> UTC (rev 5126) >>>> @@ -52,9 +52,38 @@ >>>> A key factor in scaling up Swift runs on Beagle is to setup the >>>> sites.xml parameters. >>>> The following sites.xml parameters must be set to scale that is >>>> intended for a large run: >>>> >>>> - * walltime: The expected walltime for completion of your run. This >>>> parameter is accepted in seconds. >>>> - * slots: Number of qsub jobs needs to be submitted by swift. This >>>> number will determine how many qsubs swift will submit for your run. >>>> Typical values range between 40 and 80 for large runs. >>>> - * nodegranularity: Determines the number of nodes per job. Total >>>> nodes will thus be slots times nodegranularity. This may vary for >>>> advanced configurations though. >>>> - * maxnodes: Determines the maximum number of nodes a job must pack >>>> into its qsub. This parameter determines the largest single job that >>>> your run will submit. >>>> + * *maxTime* : The expected walltime for completion of your run. This >>>> parameter is accepted in seconds. >>>> + * *slots* : Number of qsub jobs needs to be submitted by swift. This >>>> number will determine how many qsubs swift will submit for your run. >>>> Typical values range between 40 and 80 for large runs. >>>> + * *nodeGranularity* : Determines the number of nodes per job. Total >>>> nodes will thus be slots times nodegranularity. This may vary for >>>> advanced configurations though. >>>> + * *maxNodes* : Determines the maximum number of nodes a job must >>>> pack into its qsub. This parameter determines the largest single job >>>> that your run will submit. >>>> + * *jobThrottle* : A factor that determines the number of tasks >>>> dispatched simultaneously. The intended number of simultaneous tasks >>>> must match the number of cores targeted. The number of tasks is >>>> calculated from the jobThrottle factor is as follows: >>>> >>>> +---- >>>> +Number of Tasks = (JobThrottle x 100) + 1 >>>> +---- >>>> >>>> +Following is an example sites.xml for a 50 slots run with each slot >>>> occupying 4 nodes (thus, a 200 node run): >>>> + >>>> +----- >>>> + >>>> + >>>> + >>>> + CI-CCR000013 >>>> + >>>> + 24:cray:pack >>>> + >>>> + 24 >>>> + 50000 >>>> + 50 >>>> + 4 >>>> + 4 >>>> + >>>> + 48.00 >>>> + 10000 >>>> + >>>> + >>>> + /lustre/beagle/ketan/swift.**workdir >>>> + >>>> + >>>> +----- >>>> + >>>> >>>> ______________________________**_________________ >>>> Swift-commit mailing list >>>> Swift-commit at ci.uchicago.edu >>>> https://lists.ci.uchicago.edu/**cgi-bin/mailman/listinfo/**swift-commit >>>> >>>> -- >>>> Michael Wilde >>>> Computation Institute, University of Chicago >>>> Mathematics and Computer Science Division >>>> Argonne National Laboratory >>>> >>>> ______________________________**_________________ >>>> Swift-devel mailing list >>>> Swift-devel at ci.uchicago.edu >>>> https://lists.ci.uchicago.edu/**cgi-bin/mailman/listinfo/**swift-devel >>>> >>> >>> -- >>> Michael Wilde >>> Computation Institute, University of Chicago >>> Mathematics and Computer Science Division >>> Argonne National Laboratory >>> >>> >>> >> > -- > Justin M Wozniak > -------------- next part -------------- An HTML attachment was scrubbed... URL: From wilde at mcs.anl.gov Mon Oct 17 11:52:54 2011 From: wilde at mcs.anl.gov (Michael Wilde) Date: Mon, 17 Oct 2011 11:52:54 -0500 (CDT) Subject: [Swift-devel] Where is latest doc on running Swift on Beagle? Covers OpenMP apps? In-Reply-To: Message-ID: <1155572818.102314.1318870374382.JavaMail.root@zimbra.anl.gov> I can help write a test case. Its just a for() loop with a #pragma in front - very simple. If each parallel loop iteration could do system("sleep N") we could readily observe that the test is working and spawning OMP_NUM_THREADS threads and procs. - Mike ----- Original Message ----- > From: "Glen Hocky" > To: "Justin M Wozniak" > Cc: "Michael Wilde" , "David Kelly" , "ketan" , > "Swift Devel" > Sent: Monday, October 17, 2011 11:30:17 AM > Subject: Re: [Swift-devel] Where is latest doc on running Swift on Beagle? Covers OpenMP apps? > Justin, I'm not sure my program counts as sufficiently simple for this > purpose. I'd be happy to let you include it and get an example set up > though if you want to use it anyway. The open mp part, which I haven't > been using recently, may need a bit of debugging as well > > > Glen > > > > On Mon, Oct 17, 2011 at 12:23 PM, Justin M Wozniak < > wozniak at mcs.anl.gov > wrote: > > > > Glen, do you have an extremely simple but relevant OpenMP program that > we could stick in the test suite? > > > > > On Sun, 16 Oct 2011, Glen Hocky wrote: > > > > It's in my run script that creates the actual sites file that I run > with. > I'm not sure what you would do if you wanted more than 24 cores, so > depth > stays fixed at 24 (that's an aprun parameters). Then > > WORKERSPERNODE=$((24/$PPN)) > > Where PPN is how many cores you want per OPENMP app and then workers > per > node says how many OPENMP apps you want to run. So obvious example > would be > you want 3 8 core OPENMP jobs, PPN = 8, WORKERSPERNODE=3 > > > On Sun, Oct 16, 2011 at 12:27 PM, Michael Wilde < wilde at mcs.anl.gov > > wrote: > > > > Thanks, Glen! > > Justin, can you check the sites file below? I dont understand the > interaction between the parameters OMP_NUM_THREDS, jobsPerNode, and > depth. > WHere is the best documentation on that? > > Thanks, > > - Mike > > > ----- Original Message ----- > > > From: "Glen Hocky" < hockyg at uchicago.edu > > To: "Michael Wilde" < wilde at mcs.anl.gov > > Cc: "David Kelly" < davidk at ci.uchicago.edu >, "ketan" < > ketancmaheshwari at gmail.com > > > > Sent: Sunday, October 16, 2011 11:18:33 AM > Subject: Re: [Swift-devel] Where is latest doc on running Swift on > Beagle? Covers OpenMP apps? > > > Yes, I'm running and yes I did test openmp a while back. Sites file > follows. I'm using trunk from a few months ago > > "Swift svn swift-r4813 (swift modified locally) cog-r3175" > > > > > > > key="providerAttributes">pbs. aprun;pbs.mpp;depth=24 > 24 > > > $PPN > $TIME > $MAXTIME > $nodes > 1 > 1 > 100 > 100 > 200.00 > 10000 > > > $swiftrundir/swiftwork > > > > > On Sun, Oct 16, 2011 at 12:15 PM, Michael Wilde < wilde at mcs.anl.gov > > wrote: > > > David, Ketan, > > I need to run some things on Beagle, asap. > > Ketan, where is the latest and best documentation for this? I see your > edits below to the 0.93 Site Guide. But I dont see that online where I > would expect it: > > > http://www.ci.uchicago.edu/ swift/wwwdev/guides/release-0. > 93/siteguide/siteguide.html#_ beagle > > > > David, is it just that this document is not being correctly pushed to > the wwwdev site on a nightly basis? > > Ketan, is the latest info on running Swift on Beagle now all in the > siteguide? Is the info you were putting in the cookbook (I see many > commits there) now all consolidated into the Site Guide? And is there > a difference in sites.xml settings between 0.93 and trunk? Lastly, > which release works best? > > Second question: I need to run a script that executes many 24-core > OpenMP apps. Is the necessary support for this in 0.93? What if any > declarations do I need other than to say jobsPerNode=1? Glen, are you > running OpenMP on Beagle and if so what release and sites file are you > using? > > Im assuming Justin's latest changes to sites.xml are in trunk but not > 0.93? If that is correct, is there a corresponding site site for > Beagle for trunk? > > Thanks, > > - Mike > > > ----- Forwarded Message ----- > From: ketan at ci.uchicago.edu > To: swift-commit at ci.uchicago.edu > Sent: Sunday, September 18, 2011 10:14:10 PM > Subject: [Swift-commit] r5126 - branches/release-0.93/docs/ siteguide > > Author: ketan > Date: 2011-09-18 22:14:10 -0500 (Sun, 18 Sep 2011) > New Revision: 5126 > > Modified: > branches/release-0.93/docs/ siteguide/beagle > Log: > added content to beagle siteguide > > Modified: branches/release-0.93/docs/ siteguide/beagle > ============================== ============================== ======= > --- branches/release-0.93/docs/ siteguide/beagle 2011-09-19 02:41:02 > UTC (rev 5125) > +++ branches/release-0.93/docs/ siteguide/beagle 2011-09-19 03:14:10 > UTC (rev 5126) > @@ -52,9 +52,38 @@ > A key factor in scaling up Swift runs on Beagle is to setup the > sites.xml parameters. > The following sites.xml parameters must be set to scale that is > intended for a large run: > > - * walltime: The expected walltime for completion of your run. This > parameter is accepted in seconds. > - * slots: Number of qsub jobs needs to be submitted by swift. This > number will determine how many qsubs swift will submit for your run. > Typical values range between 40 and 80 for large runs. > - * nodegranularity: Determines the number of nodes per job. Total > nodes will thus be slots times nodegranularity. This may vary for > advanced configurations though. > - * maxnodes: Determines the maximum number of nodes a job must pack > into its qsub. This parameter determines the largest single job that > your run will submit. > + * *maxTime* : The expected walltime for completion of your run. This > parameter is accepted in seconds. > + * *slots* : Number of qsub jobs needs to be submitted by swift. This > number will determine how many qsubs swift will submit for your run. > Typical values range between 40 and 80 for large runs. > + * *nodeGranularity* : Determines the number of nodes per job. Total > nodes will thus be slots times nodegranularity. This may vary for > advanced configurations though. > + * *maxNodes* : Determines the maximum number of nodes a job must > pack into its qsub. This parameter determines the largest single job > that your run will submit. > + * *jobThrottle* : A factor that determines the number of tasks > dispatched simultaneously. The intended number of simultaneous tasks > must match the number of cores targeted. The number of tasks is > calculated from the jobThrottle factor is as follows: > > +---- > +Number of Tasks = (JobThrottle x 100) + 1 > +---- > > +Following is an example sites.xml for a 50 slots run with each slot > occupying 4 nodes (thus, a 200 node run): > + > +----- > + > + > + > + CI-CCR000013 > + > + 24:cray:pack > + > + 24 > + 50000 > + 50 > + 4 > + 4 > + > + 48.00 > + 10000 > + > + > + /lustre/beagle/ketan/swift. workdir > + > + > +----- > + > > ______________________________ _________________ > Swift-commit mailing list > Swift-commit at ci.uchicago.edu > https://lists.ci.uchicago.edu/ cgi-bin/mailman/listinfo/ swift-commit > > -- > Michael Wilde > Computation Institute, University of Chicago > Mathematics and Computer Science Division > Argonne National Laboratory > > ______________________________ _________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/ cgi-bin/mailman/listinfo/ swift-devel > > -- > Michael Wilde > Computation Institute, University of Chicago > Mathematics and Computer Science Division > Argonne National Laboratory > > > > > -- > Justin M Wozniak -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From wilde at mcs.anl.gov Mon Oct 17 11:59:00 2011 From: wilde at mcs.anl.gov (Michael Wilde) Date: Mon, 17 Oct 2011 11:59:00 -0500 (CDT) Subject: [Swift-devel] Where is latest doc on running Swift on Beagle? Covers OpenMP apps? In-Reply-To: <1155572818.102314.1318870374382.JavaMail.root@zimbra.anl.gov> Message-ID: <395020739.102357.1318870740355.JavaMail.root@zimbra.anl.gov> By the way, reading through all the examples from Glen, Justin, Ketan and David, I am still curious how to get coasters to run exactly one OpenMP app per node, while ensuring that the app has access to all the node's cores. Also, on Beagle, Glen, do you know if you can compile and execute with the native gcc module, with all PrgEnv's unloaded? This *seems* to work on the sandbox node; I am about to try on a compute node. Lastly: the descriptions of Coaster parameters in the Beagle doc need to be corrected, they are not quite correct. They should be compared to the descriptions in the User Guide, which while not very easy to understand, are to my knowledge correct. It would be good to revise the User Guide to be more clear (which I think basically involves more explanation and examples) and then have the Beagle text reference and/or replicate the User Guide info as appropriate. - Mike ----- Original Message ----- > From: "Michael Wilde" > To: "Glen Hocky" > Cc: "Swift Devel" > Sent: Monday, October 17, 2011 11:52:54 AM > Subject: Re: [Swift-devel] Where is latest doc on running Swift on Beagle? Covers OpenMP apps? > I can help write a test case. Its just a for() loop with a #pragma in > front - very simple. If each parallel loop iteration could do > system("sleep N") we could readily observe that the test is working > and spawning OMP_NUM_THREADS threads and procs. > > - Mike > > ----- Original Message ----- > > From: "Glen Hocky" > > To: "Justin M Wozniak" > > Cc: "Michael Wilde" , "David Kelly" > > , "ketan" , > > "Swift Devel" > > Sent: Monday, October 17, 2011 11:30:17 AM > > Subject: Re: [Swift-devel] Where is latest doc on running Swift on > > Beagle? Covers OpenMP apps? > > Justin, I'm not sure my program counts as sufficiently simple for > > this > > purpose. I'd be happy to let you include it and get an example set > > up > > though if you want to use it anyway. The open mp part, which I > > haven't > > been using recently, may need a bit of debugging as well > > > > > > Glen > > > > > > > > On Mon, Oct 17, 2011 at 12:23 PM, Justin M Wozniak < > > wozniak at mcs.anl.gov > wrote: > > > > > > > > Glen, do you have an extremely simple but relevant OpenMP program > > that > > we could stick in the test suite? > > > > > > > > > > On Sun, 16 Oct 2011, Glen Hocky wrote: > > > > > > > > It's in my run script that creates the actual sites file that I run > > with. > > I'm not sure what you would do if you wanted more than 24 cores, so > > depth > > stays fixed at 24 (that's an aprun parameters). Then > > > > WORKERSPERNODE=$((24/$PPN)) > > > > Where PPN is how many cores you want per OPENMP app and then workers > > per > > node says how many OPENMP apps you want to run. So obvious example > > would be > > you want 3 8 core OPENMP jobs, PPN = 8, WORKERSPERNODE=3 > > > > > > On Sun, Oct 16, 2011 at 12:27 PM, Michael Wilde < wilde at mcs.anl.gov > > > > > wrote: > > > > > > > > Thanks, Glen! > > > > Justin, can you check the sites file below? I dont understand the > > interaction between the parameters OMP_NUM_THREDS, jobsPerNode, and > > depth. > > WHere is the best documentation on that? > > > > Thanks, > > > > - Mike > > > > > > ----- Original Message ----- > > > > > > From: "Glen Hocky" < hockyg at uchicago.edu > > > To: "Michael Wilde" < wilde at mcs.anl.gov > > > Cc: "David Kelly" < davidk at ci.uchicago.edu >, "ketan" < > > ketancmaheshwari at gmail.com > > > > > > > Sent: Sunday, October 16, 2011 11:18:33 AM > > Subject: Re: [Swift-devel] Where is latest doc on running Swift on > > Beagle? Covers OpenMP apps? > > > > > > Yes, I'm running and yes I did test openmp a while back. Sites file > > follows. I'm using trunk from a few months ago > > > > "Swift svn swift-r4813 (swift modified locally) cog-r3175" > > > > > > > > > > > > > > > key="providerAttributes">pbs. aprun;pbs.mpp;depth=24 > > 24 > > > > > > $PPN > > $TIME > > $MAXTIME > > $nodes > > 1 > > 1 > > 100 > > 100 > > 200.00 > > 10000 > > > > > > $swiftrundir/swiftwork > > > > > > > > > > On Sun, Oct 16, 2011 at 12:15 PM, Michael Wilde < wilde at mcs.anl.gov > > > > > wrote: > > > > > > David, Ketan, > > > > I need to run some things on Beagle, asap. > > > > Ketan, where is the latest and best documentation for this? I see > > your > > edits below to the 0.93 Site Guide. But I dont see that online where > > I > > would expect it: > > > > > > http://www.ci.uchicago.edu/ swift/wwwdev/guides/release-0. > > 93/siteguide/siteguide.html#_ beagle > > > > > > > > David, is it just that this document is not being correctly pushed > > to > > the wwwdev site on a nightly basis? > > > > Ketan, is the latest info on running Swift on Beagle now all in the > > siteguide? Is the info you were putting in the cookbook (I see many > > commits there) now all consolidated into the Site Guide? And is > > there > > a difference in sites.xml settings between 0.93 and trunk? Lastly, > > which release works best? > > > > Second question: I need to run a script that executes many 24-core > > OpenMP apps. Is the necessary support for this in 0.93? What if any > > declarations do I need other than to say jobsPerNode=1? Glen, are > > you > > running OpenMP on Beagle and if so what release and sites file are > > you > > using? > > > > Im assuming Justin's latest changes to sites.xml are in trunk but > > not > > 0.93? If that is correct, is there a corresponding site site for > > Beagle for trunk? > > > > Thanks, > > > > - Mike > > > > > > ----- Forwarded Message ----- > > From: ketan at ci.uchicago.edu > > To: swift-commit at ci.uchicago.edu > > Sent: Sunday, September 18, 2011 10:14:10 PM > > Subject: [Swift-commit] r5126 - branches/release-0.93/docs/ > > siteguide > > > > Author: ketan > > Date: 2011-09-18 22:14:10 -0500 (Sun, 18 Sep 2011) > > New Revision: 5126 > > > > Modified: > > branches/release-0.93/docs/ siteguide/beagle > > Log: > > added content to beagle siteguide > > > > Modified: branches/release-0.93/docs/ siteguide/beagle > > ============================== ============================== > > ======= > > --- branches/release-0.93/docs/ siteguide/beagle 2011-09-19 02:41:02 > > UTC (rev 5125) > > +++ branches/release-0.93/docs/ siteguide/beagle 2011-09-19 03:14:10 > > UTC (rev 5126) > > @@ -52,9 +52,38 @@ > > A key factor in scaling up Swift runs on Beagle is to setup the > > sites.xml parameters. > > The following sites.xml parameters must be set to scale that is > > intended for a large run: > > > > - * walltime: The expected walltime for completion of your run. This > > parameter is accepted in seconds. > > - * slots: Number of qsub jobs needs to be submitted by swift. This > > number will determine how many qsubs swift will submit for your run. > > Typical values range between 40 and 80 for large runs. > > - * nodegranularity: Determines the number of nodes per job. Total > > nodes will thus be slots times nodegranularity. This may vary for > > advanced configurations though. > > - * maxnodes: Determines the maximum number of nodes a job must pack > > into its qsub. This parameter determines the largest single job that > > your run will submit. > > + * *maxTime* : The expected walltime for completion of your run. > > This > > parameter is accepted in seconds. > > + * *slots* : Number of qsub jobs needs to be submitted by swift. > > This > > number will determine how many qsubs swift will submit for your run. > > Typical values range between 40 and 80 for large runs. > > + * *nodeGranularity* : Determines the number of nodes per job. > > Total > > nodes will thus be slots times nodegranularity. This may vary for > > advanced configurations though. > > + * *maxNodes* : Determines the maximum number of nodes a job must > > pack into its qsub. This parameter determines the largest single job > > that your run will submit. > > + * *jobThrottle* : A factor that determines the number of tasks > > dispatched simultaneously. The intended number of simultaneous tasks > > must match the number of cores targeted. The number of tasks is > > calculated from the jobThrottle factor is as follows: > > > > +---- > > +Number of Tasks = (JobThrottle x 100) + 1 > > +---- > > > > +Following is an example sites.xml for a 50 slots run with each slot > > occupying 4 nodes (thus, a 200 node run): > > + > > +----- > > + > > + > > + > > + CI-CCR000013 > > + > > + 24:cray:pack > > + > > + 24 > > + 50000 > > + 50 > > + 4 > > + 4 > > + > > + 48.00 > > + 10000 > > + > > + > > + /lustre/beagle/ketan/swift. > > workdir > > + > > + > > +----- > > + > > > > ______________________________ _________________ > > Swift-commit mailing list > > Swift-commit at ci.uchicago.edu > > https://lists.ci.uchicago.edu/ cgi-bin/mailman/listinfo/ > > swift-commit > > > > -- > > Michael Wilde > > Computation Institute, University of Chicago > > Mathematics and Computer Science Division > > Argonne National Laboratory > > > > ______________________________ _________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > https://lists.ci.uchicago.edu/ cgi-bin/mailman/listinfo/ swift-devel > > > > -- > > Michael Wilde > > Computation Institute, University of Chicago > > Mathematics and Computer Science Division > > Argonne National Laboratory > > > > > > > > > > -- > > Justin M Wozniak > > -- > Michael Wilde > Computation Institute, University of Chicago > Mathematics and Computer Science Division > Argonne National Laboratory > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From hockyg at uchicago.edu Mon Oct 17 13:07:17 2011 From: hockyg at uchicago.edu (Glen Hocky) Date: Mon, 17 Oct 2011 14:07:17 -0400 Subject: [Swift-devel] Where is latest doc on running Swift on Beagle? Covers OpenMP apps? In-Reply-To: <395020739.102357.1318870740355.JavaMail.root@zimbra.anl.gov> References: <1155572818.102314.1318870374382.JavaMail.root@zimbra.anl.gov> <395020739.102357.1318870740355.JavaMail.root@zimbra.anl.gov> Message-ID: Sorry, I didn't realize what I said before probably didn't make sense, because I had something hardcoded I belive jobspernode = 1 will have just one openmp job. then OMP_NUM_THREADS controls the /max/ number of threads per openmp job, but you can specify a lower number in the code if you want I think I'm using native gcc so I believe the answer to your second question is yes On Mon, Oct 17, 2011 at 12:59 PM, Michael Wilde wrote: > By the way, reading through all the examples from Glen, Justin, Ketan and > David, I am still curious how to get coasters to run exactly one OpenMP app > per node, while ensuring that the app has access to all the node's cores. > > Also, on Beagle, Glen, do you know if you can compile and execute with the > native gcc module, with all PrgEnv's unloaded? This *seems* to work on the > sandbox node; I am about to try on a compute node. > > Lastly: the descriptions of Coaster parameters in the Beagle doc need to be > corrected, they are not quite correct. They should be compared to the > descriptions in the User Guide, which while not very easy to understand, are > to my knowledge correct. It would be good to revise the User Guide to be > more clear (which I think basically involves more explanation and examples) > and then have the Beagle text reference and/or replicate the User Guide info > as appropriate. > > - Mike > > > > ----- Original Message ----- > > From: "Michael Wilde" > > To: "Glen Hocky" > > Cc: "Swift Devel" > > Sent: Monday, October 17, 2011 11:52:54 AM > > Subject: Re: [Swift-devel] Where is latest doc on running Swift on > Beagle? Covers OpenMP apps? > > I can help write a test case. Its just a for() loop with a #pragma in > > front - very simple. If each parallel loop iteration could do > > system("sleep N") we could readily observe that the test is working > > and spawning OMP_NUM_THREADS threads and procs. > > > > - Mike > > > > ----- Original Message ----- > > > From: "Glen Hocky" > > > To: "Justin M Wozniak" > > > Cc: "Michael Wilde" , "David Kelly" > > > , "ketan" , > > > "Swift Devel" > > > Sent: Monday, October 17, 2011 11:30:17 AM > > > Subject: Re: [Swift-devel] Where is latest doc on running Swift on > > > Beagle? Covers OpenMP apps? > > > Justin, I'm not sure my program counts as sufficiently simple for > > > this > > > purpose. I'd be happy to let you include it and get an example set > > > up > > > though if you want to use it anyway. The open mp part, which I > > > haven't > > > been using recently, may need a bit of debugging as well > > > > > > > > > Glen > > > > > > > > > > > > On Mon, Oct 17, 2011 at 12:23 PM, Justin M Wozniak < > > > wozniak at mcs.anl.gov > wrote: > > > > > > > > > > > > Glen, do you have an extremely simple but relevant OpenMP program > > > that > > > we could stick in the test suite? > > > > > > > > > > > > > > > On Sun, 16 Oct 2011, Glen Hocky wrote: > > > > > > > > > > > > It's in my run script that creates the actual sites file that I run > > > with. > > > I'm not sure what you would do if you wanted more than 24 cores, so > > > depth > > > stays fixed at 24 (that's an aprun parameters). Then > > > > > > WORKERSPERNODE=$((24/$PPN)) > > > > > > Where PPN is how many cores you want per OPENMP app and then workers > > > per > > > node says how many OPENMP apps you want to run. So obvious example > > > would be > > > you want 3 8 core OPENMP jobs, PPN = 8, WORKERSPERNODE=3 > > > > > > > > > On Sun, Oct 16, 2011 at 12:27 PM, Michael Wilde < wilde at mcs.anl.gov > > > > > > > wrote: > > > > > > > > > > > > Thanks, Glen! > > > > > > Justin, can you check the sites file below? I dont understand the > > > interaction between the parameters OMP_NUM_THREDS, jobsPerNode, and > > > depth. > > > WHere is the best documentation on that? > > > > > > Thanks, > > > > > > - Mike > > > > > > > > > ----- Original Message ----- > > > > > > > > > From: "Glen Hocky" < hockyg at uchicago.edu > > > > To: "Michael Wilde" < wilde at mcs.anl.gov > > > > Cc: "David Kelly" < davidk at ci.uchicago.edu >, "ketan" < > > > ketancmaheshwari at gmail.com > > > > > > > > > > Sent: Sunday, October 16, 2011 11:18:33 AM > > > Subject: Re: [Swift-devel] Where is latest doc on running Swift on > > > Beagle? Covers OpenMP apps? > > > > > > > > > Yes, I'm running and yes I did test openmp a while back. Sites file > > > follows. I'm using trunk from a few months ago > > > > > > "Swift svn swift-r4813 (swift modified locally) cog-r3175" > > > > > > > > > > > > > > > > > > > > > > > key="providerAttributes">pbs. aprun;pbs.mpp;depth=24 > > > 24 > > > > > > > > > $PPN > > > $TIME > > > $MAXTIME > > > $nodes > > > 1 > > > 1 > > > 100 > > > 100 > > > 200.00 > > > 10000 > > > > > > > > > $swiftrundir/swiftwork > > > > > > > > > > > > > > > On Sun, Oct 16, 2011 at 12:15 PM, Michael Wilde < wilde at mcs.anl.gov > > > > > > > wrote: > > > > > > > > > David, Ketan, > > > > > > I need to run some things on Beagle, asap. > > > > > > Ketan, where is the latest and best documentation for this? I see > > > your > > > edits below to the 0.93 Site Guide. But I dont see that online where > > > I > > > would expect it: > > > > > > > > > http://www.ci.uchicago.edu/ swift/wwwdev/guides/release-0. > > > 93/siteguide/siteguide.html#_ beagle > > > > > > > > > > > > David, is it just that this document is not being correctly pushed > > > to > > > the wwwdev site on a nightly basis? > > > > > > Ketan, is the latest info on running Swift on Beagle now all in the > > > siteguide? Is the info you were putting in the cookbook (I see many > > > commits there) now all consolidated into the Site Guide? And is > > > there > > > a difference in sites.xml settings between 0.93 and trunk? Lastly, > > > which release works best? > > > > > > Second question: I need to run a script that executes many 24-core > > > OpenMP apps. Is the necessary support for this in 0.93? What if any > > > declarations do I need other than to say jobsPerNode=1? Glen, are > > > you > > > running OpenMP on Beagle and if so what release and sites file are > > > you > > > using? > > > > > > Im assuming Justin's latest changes to sites.xml are in trunk but > > > not > > > 0.93? If that is correct, is there a corresponding site site for > > > Beagle for trunk? > > > > > > Thanks, > > > > > > - Mike > > > > > > > > > ----- Forwarded Message ----- > > > From: ketan at ci.uchicago.edu > > > To: swift-commit at ci.uchicago.edu > > > Sent: Sunday, September 18, 2011 10:14:10 PM > > > Subject: [Swift-commit] r5126 - branches/release-0.93/docs/ > > > siteguide > > > > > > Author: ketan > > > Date: 2011-09-18 22:14:10 -0500 (Sun, 18 Sep 2011) > > > New Revision: 5126 > > > > > > Modified: > > > branches/release-0.93/docs/ siteguide/beagle > > > Log: > > > added content to beagle siteguide > > > > > > Modified: branches/release-0.93/docs/ siteguide/beagle > > > ============================== ============================== > > > ======= > > > --- branches/release-0.93/docs/ siteguide/beagle 2011-09-19 02:41:02 > > > UTC (rev 5125) > > > +++ branches/release-0.93/docs/ siteguide/beagle 2011-09-19 03:14:10 > > > UTC (rev 5126) > > > @@ -52,9 +52,38 @@ > > > A key factor in scaling up Swift runs on Beagle is to setup the > > > sites.xml parameters. > > > The following sites.xml parameters must be set to scale that is > > > intended for a large run: > > > > > > - * walltime: The expected walltime for completion of your run. This > > > parameter is accepted in seconds. > > > - * slots: Number of qsub jobs needs to be submitted by swift. This > > > number will determine how many qsubs swift will submit for your run. > > > Typical values range between 40 and 80 for large runs. > > > - * nodegranularity: Determines the number of nodes per job. Total > > > nodes will thus be slots times nodegranularity. This may vary for > > > advanced configurations though. > > > - * maxnodes: Determines the maximum number of nodes a job must pack > > > into its qsub. This parameter determines the largest single job that > > > your run will submit. > > > + * *maxTime* : The expected walltime for completion of your run. > > > This > > > parameter is accepted in seconds. > > > + * *slots* : Number of qsub jobs needs to be submitted by swift. > > > This > > > number will determine how many qsubs swift will submit for your run. > > > Typical values range between 40 and 80 for large runs. > > > + * *nodeGranularity* : Determines the number of nodes per job. > > > Total > > > nodes will thus be slots times nodegranularity. This may vary for > > > advanced configurations though. > > > + * *maxNodes* : Determines the maximum number of nodes a job must > > > pack into its qsub. This parameter determines the largest single job > > > that your run will submit. > > > + * *jobThrottle* : A factor that determines the number of tasks > > > dispatched simultaneously. The intended number of simultaneous tasks > > > must match the number of cores targeted. The number of tasks is > > > calculated from the jobThrottle factor is as follows: > > > > > > +---- > > > +Number of Tasks = (JobThrottle x 100) + 1 > > > +---- > > > > > > +Following is an example sites.xml for a 50 slots run with each slot > > > occupying 4 nodes (thus, a 200 node run): > > > + > > > +----- > > > + > > > + > > > + > > > + CI-CCR000013 > > > + > > > + 24:cray:pack > > > + > > > + 24 > > > + 50000 > > > + 50 > > > + 4 > > > + 4 > > > + > > > + 48.00 > > > + 10000 > > > + > > > + > > > + /lustre/beagle/ketan/swift. > > > workdir > > > + > > > + > > > +----- > > > + > > > > > > ______________________________ _________________ > > > Swift-commit mailing list > > > Swift-commit at ci.uchicago.edu > > > https://lists.ci.uchicago.edu/ cgi-bin/mailman/listinfo/ > > > swift-commit > > > > > > -- > > > Michael Wilde > > > Computation Institute, University of Chicago > > > Mathematics and Computer Science Division > > > Argonne National Laboratory > > > > > > ______________________________ _________________ > > > Swift-devel mailing list > > > Swift-devel at ci.uchicago.edu > > > https://lists.ci.uchicago.edu/ cgi-bin/mailman/listinfo/ swift-devel > > > > > > -- > > > Michael Wilde > > > Computation Institute, University of Chicago > > > Mathematics and Computer Science Division > > > Argonne National Laboratory > > > > > > > > > > > > > > > -- > > > Justin M Wozniak > > > > -- > > Michael Wilde > > Computation Institute, University of Chicago > > Mathematics and Computer Science Division > > Argonne National Laboratory > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > -- > Michael Wilde > Computation Institute, University of Chicago > Mathematics and Computer Science Division > Argonne National Laboratory > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From wilde at mcs.anl.gov Mon Oct 17 16:55:31 2011 From: wilde at mcs.anl.gov (Michael Wilde) Date: Mon, 17 Oct 2011 16:55:31 -0500 (CDT) Subject: [Swift-devel] swift module update for Beagle Message-ID: <1537202546.103932.1318888531728.JavaMail.root@zimbra.anl.gov> Ketan, do you maintain the swift module on Beagle? I see that module load swift gives me what it calls "0.92" but what I think/hope is 0.92.1 Can you or David add modules swift/0.93RC2 and swift/trunk for testing? (We're adding two new user groups on Beagle this week and I'd like to have them use Swift via modules from the start rather than private builds). Thanks, - Mike -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From ketan at mcs.anl.gov Mon Oct 17 16:59:12 2011 From: ketan at mcs.anl.gov (Ketan Maheshwari) Date: Mon, 17 Oct 2011 16:59:12 -0500 (CDT) Subject: [Swift-devel] swift module update for Beagle In-Reply-To: <1537202546.103932.1318888531728.JavaMail.root@zimbra.anl.gov> Message-ID: <500432590.103942.1318888752966.JavaMail.root@zimbra.anl.gov> Mike, The module 0.92 of Beagle is actually 0.92.1; this is the case because, after the version change, one needs to contact the admin (Ti in this case) in order to update a module file. Since the change was minor, he suggested, I overwrite 0.92.1 on 0.92. I have 0.93 in place on Beagle. We just need to contact him if this would be the RC for Beagle. Regards, Ketan ----- Original Message ----- From: "Michael Wilde" To: "Ketan Maheshwari" , "David Kelly" Cc: "Swift Devel" Sent: Monday, October 17, 2011 4:55:31 PM Subject: swift module update for Beagle Ketan, do you maintain the swift module on Beagle? I see that module load swift gives me what it calls "0.92" but what I think/hope is 0.92.1 Can you or David add modules swift/0.93RC2 and swift/trunk for testing? (We're adding two new user groups on Beagle this week and I'd like to have them use Swift via modules from the start rather than private builds). Thanks, - Mike -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From wilde at mcs.anl.gov Mon Oct 17 17:13:35 2011 From: wilde at mcs.anl.gov (Michael Wilde) Date: Mon, 17 Oct 2011 17:13:35 -0500 (CDT) Subject: [Swift-devel] swift module update for Beagle In-Reply-To: <500432590.103942.1318888752966.JavaMail.root@zimbra.anl.gov> Message-ID: <2113493162.103984.1318889615741.JavaMail.root@zimbra.anl.gov> > The module 0.92 of Beagle is actually 0.92.1; this is the case > because, after the version change, one needs to contact the admin (Ti > in this case) in order to update a module file. Since the change was > minor, he suggested, I overwrite 0.92.1 on 0.92. I think moving forward it would be good to create specific module versions so there is no confusion as to whats being executed. So we should have in retrospect created an 0.92.1 module. > I have 0.93 in place on Beagle. We just need to contact him if this > would be the RC for Beagle. Lets get module versions of 0.93rc2, 0.93, and "trunk" (if we can) in place. (If there are no objections or counter-proposals) Can you initiate that, Ketan, and also document on swiftdevel how to manage Beagle modules so that David and others can do this as part of release management? Thanks, - Mike > Regards, > Ketan > > ----- Original Message ----- > From: "Michael Wilde" > To: "Ketan Maheshwari" , "David Kelly" > > Cc: "Swift Devel" > Sent: Monday, October 17, 2011 4:55:31 PM > Subject: swift module update for Beagle > > Ketan, do you maintain the swift module on Beagle? > > I see that module load swift gives me what it calls "0.92" but what I > think/hope is 0.92.1 > > Can you or David add modules swift/0.93RC2 and swift/trunk for > testing? (We're adding two new user groups on Beagle this week and I'd > like to have them use Swift via modules from the start rather than > private builds). > > Thanks, > > - Mike > > -- > Michael Wilde > Computation Institute, University of Chicago > Mathematics and Computer Science Division > Argonne National Laboratory -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From ketancmaheshwari at gmail.com Mon Oct 17 17:16:02 2011 From: ketancmaheshwari at gmail.com (Ketan Maheshwari) Date: Mon, 17 Oct 2011 17:16:02 -0500 Subject: [Swift-devel] swift module update for Beagle In-Reply-To: <2113493162.103984.1318889615741.JavaMail.root@zimbra.anl.gov> References: <500432590.103942.1318888752966.JavaMail.root@zimbra.anl.gov> <2113493162.103984.1318889615741.JavaMail.root@zimbra.anl.gov> Message-ID: On Mon, Oct 17, 2011 at 5:13 PM, Michael Wilde wrote: > > The module 0.92 of Beagle is actually 0.92.1; this is the case > > because, after the version change, one needs to contact the admin (Ti > > in this case) in order to update a module file. Since the change was > > minor, he suggested, I overwrite 0.92.1 on 0.92. > > I think moving forward it would be good to create specific module versions > so there is no confusion as to whats being executed. > > So we should have in retrospect created an 0.92.1 module. > > > I have 0.93 in place on Beagle. We just need to contact him if this > > would be the RC for Beagle. > > Lets get module versions of 0.93rc2, 0.93, and "trunk" (if we can) in > place. > > (If there are no objections or counter-proposals) > > Can you initiate that, Ketan, and also document on swiftdevel how to manage > Beagle modules so that David and others can do this as part of release > management? > Sure, I can do that. Where is 0.93rc2 located btw? Regards, Ketan > > Thanks, > > - Mike > > > Regards, > > Ketan > > > > ----- Original Message ----- > > From: "Michael Wilde" > > To: "Ketan Maheshwari" , "David Kelly" > > > > Cc: "Swift Devel" > > Sent: Monday, October 17, 2011 4:55:31 PM > > Subject: swift module update for Beagle > > > > Ketan, do you maintain the swift module on Beagle? > > > > I see that module load swift gives me what it calls "0.92" but what I > > think/hope is 0.92.1 > > > > Can you or David add modules swift/0.93RC2 and swift/trunk for > > testing? (We're adding two new user groups on Beagle this week and I'd > > like to have them use Swift via modules from the start rather than > > private builds). > > > > Thanks, > > > > - Mike > > > > -- > > Michael Wilde > > Computation Institute, University of Chicago > > Mathematics and Computer Science Division > > Argonne National Laboratory > > -- > Michael Wilde > Computation Institute, University of Chicago > Mathematics and Computer Science Division > Argonne National Laboratory > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > -- Ketan -------------- next part -------------- An HTML attachment was scrubbed... URL: From wilde at mcs.anl.gov Mon Oct 17 17:25:39 2011 From: wilde at mcs.anl.gov (Michael Wilde) Date: Mon, 17 Oct 2011 17:25:39 -0500 (CDT) Subject: [Swift-devel] swift module update for Beagle In-Reply-To: Message-ID: <1115124507.104024.1318890339025.JavaMail.root@zimbra.anl.gov> ----- Original Message ----- > From: "Ketan Maheshwari" ... > Can you initiate that, Ketan, and also document on swiftdevel how to > manage Beagle modules so that David and others can do this as part of > release management? > > Sure, I can do that. Where is 0.93rc2 located btw? http://www.ci.uchicago.edu/swift/wwwdev/downloads/index.php David, this page should clearly state *which* RC the download links and buttons refer to. Thanks, - Mike From wilde at mcs.anl.gov Mon Oct 17 18:00:54 2011 From: wilde at mcs.anl.gov (Michael Wilde) Date: Mon, 17 Oct 2011 18:00:54 -0500 (CDT) Subject: [Swift-devel] OpenMP example for Swift testing In-Reply-To: <1155572818.102314.1318870374382.JavaMail.root@zimbra.anl.gov> Message-ID: <1363479628.104133.1318892454991.JavaMail.root@zimbra.anl.gov> Im a total newbie to OpenMP (so this may be a flawed example), but this seems to work: #include void main(int argc, char **argv) { int i; #pragma omp parallel for private (i) for(i=0; i<10; i++){ printf("i=%d sleeps\n",i); system("sleep 2"); printf("i=%d wakes\n", i); } } make with: openmpapp: openmpapp.c gcc -fopenmp -o openmpapp openmpapp.c And try the same code with the #pragma commented out. Under OpenMP you can see how the threads behave: sandbox$ OMP_NUM_THREADS=2 ./openmpapp i=0 sleeps i=5 sleeps i=5 wakes i=6 sleeps i=0 wakes i=1 sleeps i=6 wakes i=7 sleeps i=1 wakes i=2 sleeps i=7 wakes i=8 sleeps i=2 wakes i=3 sleeps i=8 wakes i=9 sleeps i=3 wakes i=4 sleeps i=9 wakes i=4 wakes sandbox$ OMP_NUM_THREADS=10 ./openmpapp i=2 sleeps i=9 sleeps i=0 sleeps i=7 sleeps i=8 sleeps i=5 sleeps i=6 sleeps i=4 sleeps i=3 sleeps i=1 sleeps i=9 wakes i=5 wakes i=8 wakes i=4 wakes i=6 wakes i=3 wakes i=1 wakes i=0 wakes i=2 wakes i=7 wakes sandbox$ One thing that has me stumped, though, is that asking for 9 threads shows a behavior as if it had 5 threads: sandbox$ OMP_NUM_THREADS=9 ./openmpapp i=4 sleeps i=0 sleeps i=8 sleeps i=6 sleeps i=2 sleeps i=8 wakes i=9 sleeps i=4 wakes i=5 sleeps i=0 wakes i=1 sleeps i=6 wakes i=7 sleeps i=2 wakes i=3 sleeps i=1 wakes i=7 wakes i=9 wakes i=3 wakes i=5 wakes sandbox$ I also would not assume that printf() and system() are thread-safe, but at least this is a simple example to start out with for testing if we're getting the right number of cores and threads active under Swift, Coasters, and Cray ALPS/aprun. - Mike ----- Original Message ----- > From: "Michael Wilde" > To: "Glen Hocky" > Cc: "David Kelly" , "ketan" , "Swift Devel" > , "Justin M Wozniak" > Sent: Monday, October 17, 2011 11:52:54 AM > Subject: Re: [Swift-devel] Where is latest doc on running Swift on Beagle? Covers OpenMP apps? > I can help write a test case. Its just a for() loop with a #pragma in > front - very simple. If each parallel loop iteration could do > system("sleep N") we could readily observe that the test is working > and spawning OMP_NUM_THREADS threads and procs. > > - Mike > > ----- Original Message ----- > > From: "Glen Hocky" > > To: "Justin M Wozniak" > > Cc: "Michael Wilde" , "David Kelly" > > , "ketan" , > > "Swift Devel" > > Sent: Monday, October 17, 2011 11:30:17 AM > > Subject: Re: [Swift-devel] Where is latest doc on running Swift on > > Beagle? Covers OpenMP apps? > > Justin, I'm not sure my program counts as sufficiently simple for > > this > > purpose. I'd be happy to let you include it and get an example set > > up > > though if you want to use it anyway. The open mp part, which I > > haven't > > been using recently, may need a bit of debugging as well > > > > > > Glen > > > > > > > > On Mon, Oct 17, 2011 at 12:23 PM, Justin M Wozniak < > > wozniak at mcs.anl.gov > wrote: > > > > > > > > Glen, do you have an extremely simple but relevant OpenMP program > > that > > we could stick in the test suite? > > > > > > > > > > On Sun, 16 Oct 2011, Glen Hocky wrote: > > > > > > > > It's in my run script that creates the actual sites file that I run > > with. > > I'm not sure what you would do if you wanted more than 24 cores, so > > depth > > stays fixed at 24 (that's an aprun parameters). Then > > > > WORKERSPERNODE=$((24/$PPN)) > > > > Where PPN is how many cores you want per OPENMP app and then workers > > per > > node says how many OPENMP apps you want to run. So obvious example > > would be > > you want 3 8 core OPENMP jobs, PPN = 8, WORKERSPERNODE=3 > > > > > > On Sun, Oct 16, 2011 at 12:27 PM, Michael Wilde < wilde at mcs.anl.gov > > > > > wrote: > > > > > > > > Thanks, Glen! > > > > Justin, can you check the sites file below? I dont understand the > > interaction between the parameters OMP_NUM_THREDS, jobsPerNode, and > > depth. > > WHere is the best documentation on that? > > > > Thanks, > > > > - Mike > > > > > > ----- Original Message ----- > > > > > > From: "Glen Hocky" < hockyg at uchicago.edu > > > To: "Michael Wilde" < wilde at mcs.anl.gov > > > Cc: "David Kelly" < davidk at ci.uchicago.edu >, "ketan" < > > ketancmaheshwari at gmail.com > > > > > > > Sent: Sunday, October 16, 2011 11:18:33 AM > > Subject: Re: [Swift-devel] Where is latest doc on running Swift on > > Beagle? Covers OpenMP apps? > > > > > > Yes, I'm running and yes I did test openmp a while back. Sites file > > follows. I'm using trunk from a few months ago > > > > "Swift svn swift-r4813 (swift modified locally) cog-r3175" > > > > > > > > > > > > > > > key="providerAttributes">pbs. aprun;pbs.mpp;depth=24 > > 24 > > > > > > $PPN > > $TIME > > $MAXTIME > > $nodes > > 1 > > 1 > > 100 > > 100 > > 200.00 > > 10000 > > > > > > $swiftrundir/swiftwork > > > > > > > > > > On Sun, Oct 16, 2011 at 12:15 PM, Michael Wilde < wilde at mcs.anl.gov > > > > > wrote: > > > > > > David, Ketan, > > > > I need to run some things on Beagle, asap. > > > > Ketan, where is the latest and best documentation for this? I see > > your > > edits below to the 0.93 Site Guide. But I dont see that online where > > I > > would expect it: > > > > > > http://www.ci.uchicago.edu/ swift/wwwdev/guides/release-0. > > 93/siteguide/siteguide.html#_ beagle > > > > > > > > David, is it just that this document is not being correctly pushed > > to > > the wwwdev site on a nightly basis? > > > > Ketan, is the latest info on running Swift on Beagle now all in the > > siteguide? Is the info you were putting in the cookbook (I see many > > commits there) now all consolidated into the Site Guide? And is > > there > > a difference in sites.xml settings between 0.93 and trunk? Lastly, > > which release works best? > > > > Second question: I need to run a script that executes many 24-core > > OpenMP apps. Is the necessary support for this in 0.93? What if any > > declarations do I need other than to say jobsPerNode=1? Glen, are > > you > > running OpenMP on Beagle and if so what release and sites file are > > you > > using? > > > > Im assuming Justin's latest changes to sites.xml are in trunk but > > not > > 0.93? If that is correct, is there a corresponding site site for > > Beagle for trunk? > > > > Thanks, > > > > - Mike > > > > > > ----- Forwarded Message ----- > > From: ketan at ci.uchicago.edu > > To: swift-commit at ci.uchicago.edu > > Sent: Sunday, September 18, 2011 10:14:10 PM > > Subject: [Swift-commit] r5126 - branches/release-0.93/docs/ > > siteguide > > > > Author: ketan > > Date: 2011-09-18 22:14:10 -0500 (Sun, 18 Sep 2011) > > New Revision: 5126 > > > > Modified: > > branches/release-0.93/docs/ siteguide/beagle > > Log: > > added content to beagle siteguide > > > > Modified: branches/release-0.93/docs/ siteguide/beagle > > ============================== ============================== > > ======= > > --- branches/release-0.93/docs/ siteguide/beagle 2011-09-19 02:41:02 > > UTC (rev 5125) > > +++ branches/release-0.93/docs/ siteguide/beagle 2011-09-19 03:14:10 > > UTC (rev 5126) > > @@ -52,9 +52,38 @@ > > A key factor in scaling up Swift runs on Beagle is to setup the > > sites.xml parameters. > > The following sites.xml parameters must be set to scale that is > > intended for a large run: > > > > - * walltime: The expected walltime for completion of your run. This > > parameter is accepted in seconds. > > - * slots: Number of qsub jobs needs to be submitted by swift. This > > number will determine how many qsubs swift will submit for your run. > > Typical values range between 40 and 80 for large runs. > > - * nodegranularity: Determines the number of nodes per job. Total > > nodes will thus be slots times nodegranularity. This may vary for > > advanced configurations though. > > - * maxnodes: Determines the maximum number of nodes a job must pack > > into its qsub. This parameter determines the largest single job that > > your run will submit. > > + * *maxTime* : The expected walltime for completion of your run. > > This > > parameter is accepted in seconds. > > + * *slots* : Number of qsub jobs needs to be submitted by swift. > > This > > number will determine how many qsubs swift will submit for your run. > > Typical values range between 40 and 80 for large runs. > > + * *nodeGranularity* : Determines the number of nodes per job. > > Total > > nodes will thus be slots times nodegranularity. This may vary for > > advanced configurations though. > > + * *maxNodes* : Determines the maximum number of nodes a job must > > pack into its qsub. This parameter determines the largest single job > > that your run will submit. > > + * *jobThrottle* : A factor that determines the number of tasks > > dispatched simultaneously. The intended number of simultaneous tasks > > must match the number of cores targeted. The number of tasks is > > calculated from the jobThrottle factor is as follows: > > > > +---- > > +Number of Tasks = (JobThrottle x 100) + 1 > > +---- > > > > +Following is an example sites.xml for a 50 slots run with each slot > > occupying 4 nodes (thus, a 200 node run): > > + > > +----- > > + > > + > > + > > + CI-CCR000013 > > + > > + 24:cray:pack > > + > > + 24 > > + 50000 > > + 50 > > + 4 > > + 4 > > + > > + 48.00 > > + 10000 > > + > > + > > + /lustre/beagle/ketan/swift. > > workdir > > + > > + > > +----- > > + > > > > ______________________________ _________________ > > Swift-commit mailing list > > Swift-commit at ci.uchicago.edu > > https://lists.ci.uchicago.edu/ cgi-bin/mailman/listinfo/ > > swift-commit > > > > -- > > Michael Wilde > > Computation Institute, University of Chicago > > Mathematics and Computer Science Division > > Argonne National Laboratory > > > > ______________________________ _________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > https://lists.ci.uchicago.edu/ cgi-bin/mailman/listinfo/ swift-devel > > > > -- > > Michael Wilde > > Computation Institute, University of Chicago > > Mathematics and Computer Science Division > > Argonne National Laboratory > > > > > > > > > > -- > > Justin M Wozniak > > -- > Michael Wilde > Computation Institute, University of Chicago > Mathematics and Computer Science Division > Argonne National Laboratory -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From hockyg at uchicago.edu Mon Oct 17 18:03:48 2011 From: hockyg at uchicago.edu (Glen Hocky) Date: Mon, 17 Oct 2011 19:03:48 -0400 Subject: [Swift-devel] OpenMP example for Swift testing In-Reply-To: <1363479628.104133.1318892454991.JavaMail.root@zimbra.anl.gov> References: <1155572818.102314.1318870374382.JavaMail.root@zimbra.anl.gov> <1363479628.104133.1318892454991.JavaMail.root@zimbra.anl.gov> Message-ID: I think the reason N=9 doesn't work is that there is more than one default upper bound for openmp, one is the env variable OMP_NUM_THREADS the other is the number of cpus. so i'm guessing you're testing on an 8 cpu machine? Also, a better example will end up being one that prints the thread number as well as the number, otherwise you may not be able to tell if you're getting the right number of threads On Mon, Oct 17, 2011 at 7:00 PM, Michael Wilde wrote: > Im a total newbie to OpenMP (so this may be a flawed example), but this > seems to work: > > #include > > void main(int argc, char **argv) > { > int i; > > #pragma omp parallel for private (i) > for(i=0; i<10; i++){ > printf("i=%d sleeps\n",i); > system("sleep 2"); > printf("i=%d wakes\n", i); > } > } > > make with: > > openmpapp: openmpapp.c > gcc -fopenmp -o openmpapp openmpapp.c > > And try the same code with the #pragma commented out. > > Under OpenMP you can see how the threads behave: > > sandbox$ OMP_NUM_THREADS=2 ./openmpapp > i=0 sleeps > i=5 sleeps > i=5 wakes > i=6 sleeps > i=0 wakes > i=1 sleeps > i=6 wakes > i=7 sleeps > i=1 wakes > i=2 sleeps > i=7 wakes > i=8 sleeps > i=2 wakes > i=3 sleeps > i=8 wakes > i=9 sleeps > i=3 wakes > i=4 sleeps > i=9 wakes > i=4 wakes > sandbox$ OMP_NUM_THREADS=10 ./openmpapp > i=2 sleeps > i=9 sleeps > i=0 sleeps > i=7 sleeps > i=8 sleeps > i=5 sleeps > i=6 sleeps > i=4 sleeps > i=3 sleeps > i=1 sleeps > i=9 wakes > i=5 wakes > i=8 wakes > i=4 wakes > i=6 wakes > i=3 wakes > i=1 wakes > i=0 wakes > i=2 wakes > i=7 wakes > sandbox$ > > One thing that has me stumped, though, is that asking for 9 threads shows a > behavior as if it had 5 threads: > > sandbox$ OMP_NUM_THREADS=9 ./openmpapp > i=4 sleeps > i=0 sleeps > i=8 sleeps > i=6 sleeps > i=2 sleeps > i=8 wakes > i=9 sleeps > i=4 wakes > i=5 sleeps > i=0 wakes > i=1 sleeps > i=6 wakes > i=7 sleeps > i=2 wakes > i=3 sleeps > i=1 wakes > i=7 wakes > i=9 wakes > i=3 wakes > i=5 wakes > sandbox$ > > I also would not assume that printf() and system() are thread-safe, but at > least this is a simple example to start out with for testing if we're > getting the right number of cores and threads active under Swift, Coasters, > and Cray ALPS/aprun. > > - Mike > > > ----- Original Message ----- > > From: "Michael Wilde" > > To: "Glen Hocky" > > Cc: "David Kelly" , "ketan" < > ketancmaheshwari at gmail.com>, "Swift Devel" > > , "Justin M Wozniak" > > Sent: Monday, October 17, 2011 11:52:54 AM > > Subject: Re: [Swift-devel] Where is latest doc on running Swift on > Beagle? Covers OpenMP apps? > > I can help write a test case. Its just a for() loop with a #pragma in > > front - very simple. If each parallel loop iteration could do > > system("sleep N") we could readily observe that the test is working > > and spawning OMP_NUM_THREADS threads and procs. > > > > - Mike > > > > ----- Original Message ----- > > > From: "Glen Hocky" > > > To: "Justin M Wozniak" > > > Cc: "Michael Wilde" , "David Kelly" > > > , "ketan" , > > > "Swift Devel" > > > Sent: Monday, October 17, 2011 11:30:17 AM > > > Subject: Re: [Swift-devel] Where is latest doc on running Swift on > > > Beagle? Covers OpenMP apps? > > > Justin, I'm not sure my program counts as sufficiently simple for > > > this > > > purpose. I'd be happy to let you include it and get an example set > > > up > > > though if you want to use it anyway. The open mp part, which I > > > haven't > > > been using recently, may need a bit of debugging as well > > > > > > > > > Glen > > > > > > > > > > > > On Mon, Oct 17, 2011 at 12:23 PM, Justin M Wozniak < > > > wozniak at mcs.anl.gov > wrote: > > > > > > > > > > > > Glen, do you have an extremely simple but relevant OpenMP program > > > that > > > we could stick in the test suite? > > > > > > > > > > > > > > > On Sun, 16 Oct 2011, Glen Hocky wrote: > > > > > > > > > > > > It's in my run script that creates the actual sites file that I run > > > with. > > > I'm not sure what you would do if you wanted more than 24 cores, so > > > depth > > > stays fixed at 24 (that's an aprun parameters). Then > > > > > > WORKERSPERNODE=$((24/$PPN)) > > > > > > Where PPN is how many cores you want per OPENMP app and then workers > > > per > > > node says how many OPENMP apps you want to run. So obvious example > > > would be > > > you want 3 8 core OPENMP jobs, PPN = 8, WORKERSPERNODE=3 > > > > > > > > > On Sun, Oct 16, 2011 at 12:27 PM, Michael Wilde < wilde at mcs.anl.gov > > > > > > > wrote: > > > > > > > > > > > > Thanks, Glen! > > > > > > Justin, can you check the sites file below? I dont understand the > > > interaction between the parameters OMP_NUM_THREDS, jobsPerNode, and > > > depth. > > > WHere is the best documentation on that? > > > > > > Thanks, > > > > > > - Mike > > > > > > > > > ----- Original Message ----- > > > > > > > > > From: "Glen Hocky" < hockyg at uchicago.edu > > > > To: "Michael Wilde" < wilde at mcs.anl.gov > > > > Cc: "David Kelly" < davidk at ci.uchicago.edu >, "ketan" < > > > ketancmaheshwari at gmail.com > > > > > > > > > > Sent: Sunday, October 16, 2011 11:18:33 AM > > > Subject: Re: [Swift-devel] Where is latest doc on running Swift on > > > Beagle? Covers OpenMP apps? > > > > > > > > > Yes, I'm running and yes I did test openmp a while back. Sites file > > > follows. I'm using trunk from a few months ago > > > > > > "Swift svn swift-r4813 (swift modified locally) cog-r3175" > > > > > > > > > > > > > > > > > > > > > > > key="providerAttributes">pbs. aprun;pbs.mpp;depth=24 > > > 24 > > > > > > > > > $PPN > > > $TIME > > > $MAXTIME > > > $nodes > > > 1 > > > 1 > > > 100 > > > 100 > > > 200.00 > > > 10000 > > > > > > > > > $swiftrundir/swiftwork > > > > > > > > > > > > > > > On Sun, Oct 16, 2011 at 12:15 PM, Michael Wilde < wilde at mcs.anl.gov > > > > > > > wrote: > > > > > > > > > David, Ketan, > > > > > > I need to run some things on Beagle, asap. > > > > > > Ketan, where is the latest and best documentation for this? I see > > > your > > > edits below to the 0.93 Site Guide. But I dont see that online where > > > I > > > would expect it: > > > > > > > > > http://www.ci.uchicago.edu/ swift/wwwdev/guides/release-0. > > > 93/siteguide/siteguide.html#_ beagle > > > > > > > > > > > > David, is it just that this document is not being correctly pushed > > > to > > > the wwwdev site on a nightly basis? > > > > > > Ketan, is the latest info on running Swift on Beagle now all in the > > > siteguide? Is the info you were putting in the cookbook (I see many > > > commits there) now all consolidated into the Site Guide? And is > > > there > > > a difference in sites.xml settings between 0.93 and trunk? Lastly, > > > which release works best? > > > > > > Second question: I need to run a script that executes many 24-core > > > OpenMP apps. Is the necessary support for this in 0.93? What if any > > > declarations do I need other than to say jobsPerNode=1? Glen, are > > > you > > > running OpenMP on Beagle and if so what release and sites file are > > > you > > > using? > > > > > > Im assuming Justin's latest changes to sites.xml are in trunk but > > > not > > > 0.93? If that is correct, is there a corresponding site site for > > > Beagle for trunk? > > > > > > Thanks, > > > > > > - Mike > > > > > > > > > ----- Forwarded Message ----- > > > From: ketan at ci.uchicago.edu > > > To: swift-commit at ci.uchicago.edu > > > Sent: Sunday, September 18, 2011 10:14:10 PM > > > Subject: [Swift-commit] r5126 - branches/release-0.93/docs/ > > > siteguide > > > > > > Author: ketan > > > Date: 2011-09-18 22:14:10 -0500 (Sun, 18 Sep 2011) > > > New Revision: 5126 > > > > > > Modified: > > > branches/release-0.93/docs/ siteguide/beagle > > > Log: > > > added content to beagle siteguide > > > > > > Modified: branches/release-0.93/docs/ siteguide/beagle > > > ============================== ============================== > > > ======= > > > --- branches/release-0.93/docs/ siteguide/beagle 2011-09-19 02:41:02 > > > UTC (rev 5125) > > > +++ branches/release-0.93/docs/ siteguide/beagle 2011-09-19 03:14:10 > > > UTC (rev 5126) > > > @@ -52,9 +52,38 @@ > > > A key factor in scaling up Swift runs on Beagle is to setup the > > > sites.xml parameters. > > > The following sites.xml parameters must be set to scale that is > > > intended for a large run: > > > > > > - * walltime: The expected walltime for completion of your run. This > > > parameter is accepted in seconds. > > > - * slots: Number of qsub jobs needs to be submitted by swift. This > > > number will determine how many qsubs swift will submit for your run. > > > Typical values range between 40 and 80 for large runs. > > > - * nodegranularity: Determines the number of nodes per job. Total > > > nodes will thus be slots times nodegranularity. This may vary for > > > advanced configurations though. > > > - * maxnodes: Determines the maximum number of nodes a job must pack > > > into its qsub. This parameter determines the largest single job that > > > your run will submit. > > > + * *maxTime* : The expected walltime for completion of your run. > > > This > > > parameter is accepted in seconds. > > > + * *slots* : Number of qsub jobs needs to be submitted by swift. > > > This > > > number will determine how many qsubs swift will submit for your run. > > > Typical values range between 40 and 80 for large runs. > > > + * *nodeGranularity* : Determines the number of nodes per job. > > > Total > > > nodes will thus be slots times nodegranularity. This may vary for > > > advanced configurations though. > > > + * *maxNodes* : Determines the maximum number of nodes a job must > > > pack into its qsub. This parameter determines the largest single job > > > that your run will submit. > > > + * *jobThrottle* : A factor that determines the number of tasks > > > dispatched simultaneously. The intended number of simultaneous tasks > > > must match the number of cores targeted. The number of tasks is > > > calculated from the jobThrottle factor is as follows: > > > > > > +---- > > > +Number of Tasks = (JobThrottle x 100) + 1 > > > +---- > > > > > > +Following is an example sites.xml for a 50 slots run with each slot > > > occupying 4 nodes (thus, a 200 node run): > > > + > > > +----- > > > + > > > + > > > + > > > + CI-CCR000013 > > > + > > > + 24:cray:pack > > > + > > > + 24 > > > + 50000 > > > + 50 > > > + 4 > > > + 4 > > > + > > > + 48.00 > > > + 10000 > > > + > > > + > > > + /lustre/beagle/ketan/swift. > > > workdir > > > + > > > + > > > +----- > > > + > > > > > > ______________________________ _________________ > > > Swift-commit mailing list > > > Swift-commit at ci.uchicago.edu > > > https://lists.ci.uchicago.edu/ cgi-bin/mailman/listinfo/ > > > swift-commit > > > > > > -- > > > Michael Wilde > > > Computation Institute, University of Chicago > > > Mathematics and Computer Science Division > > > Argonne National Laboratory > > > > > > ______________________________ _________________ > > > Swift-devel mailing list > > > Swift-devel at ci.uchicago.edu > > > https://lists.ci.uchicago.edu/ cgi-bin/mailman/listinfo/ swift-devel > > > > > > -- > > > Michael Wilde > > > Computation Institute, University of Chicago > > > Mathematics and Computer Science Division > > > Argonne National Laboratory > > > > > > > > > > > > > > > -- > > > Justin M Wozniak > > > > -- > > Michael Wilde > > Computation Institute, University of Chicago > > Mathematics and Computer Science Division > > Argonne National Laboratory > > -- > Michael Wilde > Computation Institute, University of Chicago > Mathematics and Computer Science Division > Argonne National Laboratory > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From davidk at ci.uchicago.edu Mon Oct 17 18:04:42 2011 From: davidk at ci.uchicago.edu (David Kelly) Date: Mon, 17 Oct 2011 18:04:42 -0500 (CDT) Subject: [Swift-devel] swift module update for Beagle In-Reply-To: <1115124507.104024.1318890339025.JavaMail.root@zimbra.anl.gov> Message-ID: <825700294.151090.1318892682551.JavaMail.root@zimbra-mb2.anl.gov> I've updated the wwwdev page with the latest revision, 0.93RC3. David ----- Original Message ----- > From: "Michael Wilde" > To: "Ketan Maheshwari" , "David Kelly" > Cc: "Swift Devel" > Sent: Monday, October 17, 2011 5:25:39 PM > Subject: Re: [Swift-devel] swift module update for Beagle > ----- Original Message ----- > > From: "Ketan Maheshwari" > ... > > Can you initiate that, Ketan, and also document on swiftdevel how to > > manage Beagle modules so that David and others can do this as part > > of > > release management? > > > > Sure, I can do that. Where is 0.93rc2 located btw? > > http://www.ci.uchicago.edu/swift/wwwdev/downloads/index.php > > David, this page should clearly state *which* RC the download links > and buttons refer to. > > Thanks, > > - Mike From hockyg at uchicago.edu Mon Oct 17 18:07:16 2011 From: hockyg at uchicago.edu (Glen Hocky) Date: Mon, 17 Oct 2011 19:07:16 -0400 Subject: [Swift-devel] OpenMP example for Swift testing In-Reply-To: <1363479628.104133.1318892454991.JavaMail.root@zimbra.anl.gov> References: <1155572818.102314.1318870374382.JavaMail.root@zimbra.anl.gov> <1363479628.104133.1318892454991.JavaMail.root@zimbra.anl.gov> Message-ID: Sorry, actually I don't understand the N=9 example on rereading. But here is an example that you might want to combine with https://computing.llnl.gov/tutorials/openMP/samples/C/omp_hello.c With output something like Hello World from thread = 0 Number of threads = 4 Hello World from thread = 3 Hello World from thread = 1 Hello World from thread = 2 On Mon, Oct 17, 2011 at 7:00 PM, Michael Wilde wrote: > Im a total newbie to OpenMP (so this may be a flawed example), but this > seems to work: > > #include > > void main(int argc, char **argv) > { > int i; > > #pragma omp parallel for private (i) > for(i=0; i<10; i++){ > printf("i=%d sleeps\n",i); > system("sleep 2"); > printf("i=%d wakes\n", i); > } > } > > make with: > > openmpapp: openmpapp.c > gcc -fopenmp -o openmpapp openmpapp.c > > And try the same code with the #pragma commented out. > > Under OpenMP you can see how the threads behave: > > sandbox$ OMP_NUM_THREADS=2 ./openmpapp > i=0 sleeps > i=5 sleeps > i=5 wakes > i=6 sleeps > i=0 wakes > i=1 sleeps > i=6 wakes > i=7 sleeps > i=1 wakes > i=2 sleeps > i=7 wakes > i=8 sleeps > i=2 wakes > i=3 sleeps > i=8 wakes > i=9 sleeps > i=3 wakes > i=4 sleeps > i=9 wakes > i=4 wakes > sandbox$ OMP_NUM_THREADS=10 ./openmpapp > i=2 sleeps > i=9 sleeps > i=0 sleeps > i=7 sleeps > i=8 sleeps > i=5 sleeps > i=6 sleeps > i=4 sleeps > i=3 sleeps > i=1 sleeps > i=9 wakes > i=5 wakes > i=8 wakes > i=4 wakes > i=6 wakes > i=3 wakes > i=1 wakes > i=0 wakes > i=2 wakes > i=7 wakes > sandbox$ > > One thing that has me stumped, though, is that asking for 9 threads shows a > behavior as if it had 5 threads: > > sandbox$ OMP_NUM_THREADS=9 ./openmpapp > i=4 sleeps > i=0 sleeps > i=8 sleeps > i=6 sleeps > i=2 sleeps > i=8 wakes > i=9 sleeps > i=4 wakes > i=5 sleeps > i=0 wakes > i=1 sleeps > i=6 wakes > i=7 sleeps > i=2 wakes > i=3 sleeps > i=1 wakes > i=7 wakes > i=9 wakes > i=3 wakes > i=5 wakes > sandbox$ > > I also would not assume that printf() and system() are thread-safe, but at > least this is a simple example to start out with for testing if we're > getting the right number of cores and threads active under Swift, Coasters, > and Cray ALPS/aprun. > > - Mike > > > ----- Original Message ----- > > From: "Michael Wilde" > > To: "Glen Hocky" > > Cc: "David Kelly" , "ketan" < > ketancmaheshwari at gmail.com>, "Swift Devel" > > , "Justin M Wozniak" > > Sent: Monday, October 17, 2011 11:52:54 AM > > Subject: Re: [Swift-devel] Where is latest doc on running Swift on > Beagle? Covers OpenMP apps? > > I can help write a test case. Its just a for() loop with a #pragma in > > front - very simple. If each parallel loop iteration could do > > system("sleep N") we could readily observe that the test is working > > and spawning OMP_NUM_THREADS threads and procs. > > > > - Mike > > > > ----- Original Message ----- > > > From: "Glen Hocky" > > > To: "Justin M Wozniak" > > > Cc: "Michael Wilde" , "David Kelly" > > > , "ketan" , > > > "Swift Devel" > > > Sent: Monday, October 17, 2011 11:30:17 AM > > > Subject: Re: [Swift-devel] Where is latest doc on running Swift on > > > Beagle? Covers OpenMP apps? > > > Justin, I'm not sure my program counts as sufficiently simple for > > > this > > > purpose. I'd be happy to let you include it and get an example set > > > up > > > though if you want to use it anyway. The open mp part, which I > > > haven't > > > been using recently, may need a bit of debugging as well > > > > > > > > > Glen > > > > > > > > > > > > On Mon, Oct 17, 2011 at 12:23 PM, Justin M Wozniak < > > > wozniak at mcs.anl.gov > wrote: > > > > > > > > > > > > Glen, do you have an extremely simple but relevant OpenMP program > > > that > > > we could stick in the test suite? > > > > > > > > > > > > > > > On Sun, 16 Oct 2011, Glen Hocky wrote: > > > > > > > > > > > > It's in my run script that creates the actual sites file that I run > > > with. > > > I'm not sure what you would do if you wanted more than 24 cores, so > > > depth > > > stays fixed at 24 (that's an aprun parameters). Then > > > > > > WORKERSPERNODE=$((24/$PPN)) > > > > > > Where PPN is how many cores you want per OPENMP app and then workers > > > per > > > node says how many OPENMP apps you want to run. So obvious example > > > would be > > > you want 3 8 core OPENMP jobs, PPN = 8, WORKERSPERNODE=3 > > > > > > > > > On Sun, Oct 16, 2011 at 12:27 PM, Michael Wilde < wilde at mcs.anl.gov > > > > > > > wrote: > > > > > > > > > > > > Thanks, Glen! > > > > > > Justin, can you check the sites file below? I dont understand the > > > interaction between the parameters OMP_NUM_THREDS, jobsPerNode, and > > > depth. > > > WHere is the best documentation on that? > > > > > > Thanks, > > > > > > - Mike > > > > > > > > > ----- Original Message ----- > > > > > > > > > From: "Glen Hocky" < hockyg at uchicago.edu > > > > To: "Michael Wilde" < wilde at mcs.anl.gov > > > > Cc: "David Kelly" < davidk at ci.uchicago.edu >, "ketan" < > > > ketancmaheshwari at gmail.com > > > > > > > > > > Sent: Sunday, October 16, 2011 11:18:33 AM > > > Subject: Re: [Swift-devel] Where is latest doc on running Swift on > > > Beagle? Covers OpenMP apps? > > > > > > > > > Yes, I'm running and yes I did test openmp a while back. Sites file > > > follows. I'm using trunk from a few months ago > > > > > > "Swift svn swift-r4813 (swift modified locally) cog-r3175" > > > > > > > > > > > > > > > > > > > > > > > key="providerAttributes">pbs. aprun;pbs.mpp;depth=24 > > > 24 > > > > > > > > > $PPN > > > $TIME > > > $MAXTIME > > > $nodes > > > 1 > > > 1 > > > 100 > > > 100 > > > 200.00 > > > 10000 > > > > > > > > > $swiftrundir/swiftwork > > > > > > > > > > > > > > > On Sun, Oct 16, 2011 at 12:15 PM, Michael Wilde < wilde at mcs.anl.gov > > > > > > > wrote: > > > > > > > > > David, Ketan, > > > > > > I need to run some things on Beagle, asap. > > > > > > Ketan, where is the latest and best documentation for this? I see > > > your > > > edits below to the 0.93 Site Guide. But I dont see that online where > > > I > > > would expect it: > > > > > > > > > http://www.ci.uchicago.edu/ swift/wwwdev/guides/release-0. > > > 93/siteguide/siteguide.html#_ beagle > > > > > > > > > > > > David, is it just that this document is not being correctly pushed > > > to > > > the wwwdev site on a nightly basis? > > > > > > Ketan, is the latest info on running Swift on Beagle now all in the > > > siteguide? Is the info you were putting in the cookbook (I see many > > > commits there) now all consolidated into the Site Guide? And is > > > there > > > a difference in sites.xml settings between 0.93 and trunk? Lastly, > > > which release works best? > > > > > > Second question: I need to run a script that executes many 24-core > > > OpenMP apps. Is the necessary support for this in 0.93? What if any > > > declarations do I need other than to say jobsPerNode=1? Glen, are > > > you > > > running OpenMP on Beagle and if so what release and sites file are > > > you > > > using? > > > > > > Im assuming Justin's latest changes to sites.xml are in trunk but > > > not > > > 0.93? If that is correct, is there a corresponding site site for > > > Beagle for trunk? > > > > > > Thanks, > > > > > > - Mike > > > > > > > > > ----- Forwarded Message ----- > > > From: ketan at ci.uchicago.edu > > > To: swift-commit at ci.uchicago.edu > > > Sent: Sunday, September 18, 2011 10:14:10 PM > > > Subject: [Swift-commit] r5126 - branches/release-0.93/docs/ > > > siteguide > > > > > > Author: ketan > > > Date: 2011-09-18 22:14:10 -0500 (Sun, 18 Sep 2011) > > > New Revision: 5126 > > > > > > Modified: > > > branches/release-0.93/docs/ siteguide/beagle > > > Log: > > > added content to beagle siteguide > > > > > > Modified: branches/release-0.93/docs/ siteguide/beagle > > > ============================== ============================== > > > ======= > > > --- branches/release-0.93/docs/ siteguide/beagle 2011-09-19 02:41:02 > > > UTC (rev 5125) > > > +++ branches/release-0.93/docs/ siteguide/beagle 2011-09-19 03:14:10 > > > UTC (rev 5126) > > > @@ -52,9 +52,38 @@ > > > A key factor in scaling up Swift runs on Beagle is to setup the > > > sites.xml parameters. > > > The following sites.xml parameters must be set to scale that is > > > intended for a large run: > > > > > > - * walltime: The expected walltime for completion of your run. This > > > parameter is accepted in seconds. > > > - * slots: Number of qsub jobs needs to be submitted by swift. This > > > number will determine how many qsubs swift will submit for your run. > > > Typical values range between 40 and 80 for large runs. > > > - * nodegranularity: Determines the number of nodes per job. Total > > > nodes will thus be slots times nodegranularity. This may vary for > > > advanced configurations though. > > > - * maxnodes: Determines the maximum number of nodes a job must pack > > > into its qsub. This parameter determines the largest single job that > > > your run will submit. > > > + * *maxTime* : The expected walltime for completion of your run. > > > This > > > parameter is accepted in seconds. > > > + * *slots* : Number of qsub jobs needs to be submitted by swift. > > > This > > > number will determine how many qsubs swift will submit for your run. > > > Typical values range between 40 and 80 for large runs. > > > + * *nodeGranularity* : Determines the number of nodes per job. > > > Total > > > nodes will thus be slots times nodegranularity. This may vary for > > > advanced configurations though. > > > + * *maxNodes* : Determines the maximum number of nodes a job must > > > pack into its qsub. This parameter determines the largest single job > > > that your run will submit. > > > + * *jobThrottle* : A factor that determines the number of tasks > > > dispatched simultaneously. The intended number of simultaneous tasks > > > must match the number of cores targeted. The number of tasks is > > > calculated from the jobThrottle factor is as follows: > > > > > > +---- > > > +Number of Tasks = (JobThrottle x 100) + 1 > > > +---- > > > > > > +Following is an example sites.xml for a 50 slots run with each slot > > > occupying 4 nodes (thus, a 200 node run): > > > + > > > +----- > > > + > > > + > > > + > > > + CI-CCR000013 > > > + > > > + 24:cray:pack > > > + > > > + 24 > > > + 50000 > > > + 50 > > > + 4 > > > + 4 > > > + > > > + 48.00 > > > + 10000 > > > + > > > + > > > + /lustre/beagle/ketan/swift. > > > workdir > > > + > > > + > > > +----- > > > + > > > > > > ______________________________ _________________ > > > Swift-commit mailing list > > > Swift-commit at ci.uchicago.edu > > > https://lists.ci.uchicago.edu/ cgi-bin/mailman/listinfo/ > > > swift-commit > > > > > > -- > > > Michael Wilde > > > Computation Institute, University of Chicago > > > Mathematics and Computer Science Division > > > Argonne National Laboratory > > > > > > ______________________________ _________________ > > > Swift-devel mailing list > > > Swift-devel at ci.uchicago.edu > > > https://lists.ci.uchicago.edu/ cgi-bin/mailman/listinfo/ swift-devel > > > > > > -- > > > Michael Wilde > > > Computation Institute, University of Chicago > > > Mathematics and Computer Science Division > > > Argonne National Laboratory > > > > > > > > > > > > > > > -- > > > Justin M Wozniak > > > > -- > > Michael Wilde > > Computation Institute, University of Chicago > > Mathematics and Computer Science Division > > Argonne National Laboratory > > -- > Michael Wilde > Computation Institute, University of Chicago > Mathematics and Computer Science Division > Argonne National Laboratory > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From wilde at mcs.anl.gov Mon Oct 17 23:40:57 2011 From: wilde at mcs.anl.gov (Michael Wilde) Date: Mon, 17 Oct 2011 23:40:57 -0500 (CDT) Subject: [Swift-devel] Where is latest doc on running Swift on Beagle? Covers OpenMP apps? In-Reply-To: Message-ID: <602206793.104597.1318912857613.JavaMail.root@zimbra.anl.gov> Justin, I have a few questions about this page: "If using Coasters, the count attribute is the number of running worker.pl scripts (1); this is the number of nodes to use." What is the "count" attribute? "The ppn attribute is available but should be left to default to 1." This is because we are running one program per node, worker.pl, and giving it (typically) the full "depth" of the node (eg 24 cores) to manage? pbs.aprun;pbs.mpp;depth=24 I find this attribute confusing. I assume this means: pbs.aprun: generate an aprun command for Crays in the pbs submit file. pbs.mpp: generate Cray-style mppwidth and mppnppn attributes? Of what value? Can a value other than 1 be specified? depth= : why does depth not have a pbs.depth prefix? ppn: how would that be specified? Leave off ppn only for Cray? What other PBS attributes can be specified, and how are they processed? On the PBS page above the Cray page you say: ... pbs.properties: Adds the value to the end of the "#PBS -l " line. See Fusion for an example use case pbs.mpp: If value is set, use mppwidth/mppnppn instead of nodes/ppn in PBS submit file pbs.aprun: If value is set, use aprun-based command line pbs.resources: Adds the value to a new "#PBS -l" line. Is the key always providerAttributes? pbs.mpp by "if value is set" you mean "if this string is present in the tags value, separated from other strings by semicolons"? pbs.properties and pbs.resources - are the same except for where they are inserted? Are these followed by a pbs -l attributes such as "nodes=10:ppn=4"? I think several examples are needed to be able to understand how to use (and document) these. Could you elaborate a bit in the swiftdevel page? Or are these documented already elsewhere? Thanks, - Mike ----- Original Message ----- > From: "Justin M Wozniak" > To: "David Kelly" > Cc: "Swift Devel" > Sent: Monday, October 17, 2011 10:02:15 AM > Subject: Re: [Swift-devel] Where is latest doc on running Swift on Beagle? Covers OpenMP apps? > My notes about Beagle are at: > > https://sites.google.com/site/swiftdevel/sites/pbs/cray > > and the Beagle sub-page. > > Let me know if you get stuck on anything. > > Justin > > On Sun, 16 Oct 2011, David Kelly wrote: > > > > > Yep - I was in the process of migrating the automated SVN jobs to > > the swift user, but it looks like it wasn't running correctly due to > > filesystem permissions. I am manually running the update now. It > > should be updated within 15 minutes or so. > > > > David > > > > ----- Original Message ----- > >> From: "Ketan Maheshwari" > >> To: "Michael Wilde" > >> Cc: "David Kelly" , "Swift Devel" > >> > >> Sent: Sunday, October 16, 2011 12:42:23 PM > >> Subject: Re: Where is latest doc on running Swift on Beagle? Covers > >> OpenMP apps? > >> On Sun, Oct 16, 2011 at 11:15 AM, Michael Wilde < wilde at mcs.anl.gov > >> > > >> wrote: > >> > >> > >> David, Ketan, > >> > >> I need to run some things on Beagle, asap. > >> > >> Ketan, where is the latest and best documentation for this? I see > >> your > >> edits below to the 0.93 Site Guide. But I dont see that online > >> where I > >> would expect it: > >> > >> http://www.ci.uchicago.edu/swift/wwwdev/guides/release-0.93/siteguide/siteguide.html#_beagle > >> > >> David, is it just that this document is not being correctly pushed > >> to > >> the wwwdev site on a nightly basis? > >> > >> > >> > >> That seems to be the case. I have committed a little change just > >> now, > >> may be that will trigger a doc build. The link you mentioned is not > >> the latest for Swift on Beagle. See this one which has > >> documentation > >> for scaling up runs on Beagle: > >> > >> > >> http://www.ci.uchicago.edu/~ketan/swift-docs/release-0.93/siteguide/siteguide.html#_beagle > >> > >> > >> > >> Ketan, is the latest info on running Swift on Beagle now all in the > >> siteguide? Is the info you were putting in the cookbook (I see many > >> commits there) now all consolidated into the Site Guide? And is > >> there > >> a difference in sites.xml settings between 0.93 and trunk? Lastly, > >> which release works best? > >> > >> > >> > >> Yes, the sitesguide for release-0.93 is the latest on Swift Beagle > >> documentation. My cookbook info is all consolidated on sitesguide. > >> There is no difference between sites file for 0.93 and trunk. > >> > >> > >> Regards, > >> Ketan > >> > >> > >> > >> > >> Second question: I need to run a script that executes many 24-core > >> OpenMP apps. Is the necessary support for this in 0.93? What if any > >> declarations do I need other than to say jobsPerNode=1? Glen, are > >> you > >> running OpenMP on Beagle and if so what release and sites file are > >> you > >> using? > >> > >> Im assuming Justin's latest changes to sites.xml are in trunk but > >> not > >> 0.93? If that is correct, is there a corresponding site site for > >> Beagle for trunk? > >> > >> Thanks, > >> > >> - Mike > >> > >> > >> ----- Forwarded Message ----- > >> From: ketan at ci.uchicago.edu > >> To: swift-commit at ci.uchicago.edu > >> Sent: Sunday, September 18, 2011 10:14:10 PM > >> Subject: [Swift-commit] r5126 - > >> branches/release-0.93/docs/siteguide > >> > >> Author: ketan > >> Date: 2011-09-18 22:14:10 -0500 (Sun, 18 Sep 2011) > >> New Revision: 5126 > >> > >> Modified: > >> branches/release-0.93/docs/siteguide/beagle > >> Log: > >> added content to beagle siteguide > >> > >> Modified: branches/release-0.93/docs/siteguide/beagle > >> =================================================================== > >> --- branches/release-0.93/docs/siteguide/beagle 2011-09-19 02:41:02 > >> UTC (rev 5125) > >> +++ branches/release-0.93/docs/siteguide/beagle 2011-09-19 03:14:10 > >> UTC (rev 5126) > >> @@ -52,9 +52,38 @@ > >> A key factor in scaling up Swift runs on Beagle is to setup the > >> sites.xml parameters. > >> The following sites.xml parameters must be set to scale that is > >> intended for a large run: > >> > >> - * walltime: The expected walltime for completion of your run. > >> This > >> parameter is accepted in seconds. > >> - * slots: Number of qsub jobs needs to be submitted by swift. This > >> number will determine how many qsubs swift will submit for your > >> run. > >> Typical values range between 40 and 80 for large runs. > >> - * nodegranularity: Determines the number of nodes per job. Total > >> nodes will thus be slots times nodegranularity. This may vary for > >> advanced configurations though. > >> - * maxnodes: Determines the maximum number of nodes a job must > >> pack > >> into its qsub. This parameter determines the largest single job > >> that > >> your run will submit. > >> + * *maxTime* : The expected walltime for completion of your run. > >> This > >> parameter is accepted in seconds. > >> + * *slots* : Number of qsub jobs needs to be submitted by swift. > >> This > >> number will determine how many qsubs swift will submit for your > >> run. > >> Typical values range between 40 and 80 for large runs. > >> + * *nodeGranularity* : Determines the number of nodes per job. > >> Total > >> nodes will thus be slots times nodegranularity. This may vary for > >> advanced configurations though. > >> + * *maxNodes* : Determines the maximum number of nodes a job must > >> pack into its qsub. This parameter determines the largest single > >> job > >> that your run will submit. > >> + * *jobThrottle* : A factor that determines the number of tasks > >> dispatched simultaneously. The intended number of simultaneous > >> tasks > >> must match the number of cores targeted. The number of tasks is > >> calculated from the jobThrottle factor is as follows: > >> > >> +---- > >> +Number of Tasks = (JobThrottle x 100) + 1 > >> +---- > >> > >> +Following is an example sites.xml for a 50 slots run with each > >> slot > >> occupying 4 nodes (thus, a 200 node run): > >> + > >> +----- > >> + > >> + > >> + > >> + CI-CCR000013 > >> + > >> + 24:cray:pack > >> + > >> + 24 > >> + 50000 > >> + 50 > >> + 4 > >> + 4 > >> + > >> + 48.00 > >> + 10000 > >> + > >> + > >> + >> >/lustre/beagle/ketan/swift.workdir > >> + > >> + > >> +----- > >> + > >> > >> _______________________________________________ > >> Swift-commit mailing list > >> Swift-commit at ci.uchicago.edu > >> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-commit > >> > >> -- > >> Michael Wilde > >> Computation Institute, University of Chicago > >> Mathematics and Computer Science Division > >> Argonne National Laboratory > >> > >> > >> > >> > >> > >> -- > >> Ketan > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > > -- > Justin M Wozniak > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From wozniak at mcs.anl.gov Tue Oct 18 08:51:15 2011 From: wozniak at mcs.anl.gov (Justin M Wozniak) Date: Tue, 18 Oct 2011 08:51:15 -0500 (Central Daylight Time) Subject: [Swift-devel] Where is latest doc on running Swift on Beagle? Covers OpenMP apps? In-Reply-To: <602206793.104597.1318912857613.JavaMail.root@zimbra.anl.gov> References: <602206793.104597.1318912857613.JavaMail.root@zimbra.anl.gov> Message-ID: On Mon, 17 Oct 2011, Michael Wilde wrote: > Justin, I have a few questions about this page: > > "If using Coasters, the count attribute is the number of running > worker.pl scripts (1); this is the number of nodes to use." > > What is the "count" attribute? That's an internally-used attribute- I will clarify this. > "The ppn attribute is available but should be left to default to 1." > > This is because we are running one program per node, worker.pl, and > giving it (typically) the full "depth" of the node (eg 24 cores) to > manage? Yes. > > pbs.aprun;pbs.mpp;depth=24 > > > I find this attribute confusing. I assume this means: > > pbs.aprun: generate an aprun command for Crays in the pbs submit file. Yes. The provider attributes concept is described here: https://sites.google.com/site/swiftdevel/internals/providers/coasters-provider > pbs.mpp: generate Cray-style mppwidth and mppnppn attributes? > Of what value? Can a value other than 1 be specified? This takes the normal width/ppn attributes and uses them in mppwidth/mppnppn format. > depth= : why does > depth not have a pbs.depth prefix? I thought we would leave the depth unprefixed because other schedulers could conceivably have depth. > ppn: how would that be specified? Use attribute "ppn". I can add a note about this. > Leave off ppn only for Cray? What other PBS attributes can be specified, > and how are they processed? Leave it to default to 1. If you need additional settings, let me know. > On the PBS page above the Cray page you say: > > ... > pbs.properties: Adds the value to the end of the "#PBS -l " line. See Fusion for an example use case > pbs.mpp: If value is set, use mppwidth/mppnppn instead of nodes/ppn in PBS submit file > pbs.aprun: If value is set, use aprun-based command line > pbs.resources: Adds the value to a new "#PBS -l" line. > > Is the key always providerAttributes? See the provider attributes page. > pbs.mpp by "if value is set" you mean "if this string is present in the > tags value, separated from other strings by semicolons"? Idem. > pbs.properties and pbs.resources - are the same except for where they > are inserted? Are these followed by a pbs -l attributes such as > "nodes=10:ppn=4"? Yes. > I think several examples are needed to be able to understand how to use > (and document) these. > > Could you elaborate a bit in the swiftdevel page? Or are these > documented already elsewhere? Sure. > > ----- Original Message ----- >> From: "Justin M Wozniak" >> To: "David Kelly" >> Cc: "Swift Devel" >> Sent: Monday, October 17, 2011 10:02:15 AM >> Subject: Re: [Swift-devel] Where is latest doc on running Swift on Beagle? Covers OpenMP apps? >> My notes about Beagle are at: >> >> https://sites.google.com/site/swiftdevel/sites/pbs/cray >> >> and the Beagle sub-page. >> >> Let me know if you get stuck on anything. >> >> Justin >> >> On Sun, 16 Oct 2011, David Kelly wrote: >> >>> >>> Yep - I was in the process of migrating the automated SVN jobs to >>> the swift user, but it looks like it wasn't running correctly due to >>> filesystem permissions. I am manually running the update now. It >>> should be updated within 15 minutes or so. >>> >>> David >>> >>> ----- Original Message ----- >>>> From: "Ketan Maheshwari" >>>> To: "Michael Wilde" >>>> Cc: "David Kelly" , "Swift Devel" >>>> >>>> Sent: Sunday, October 16, 2011 12:42:23 PM >>>> Subject: Re: Where is latest doc on running Swift on Beagle? Covers >>>> OpenMP apps? >>>> On Sun, Oct 16, 2011 at 11:15 AM, Michael Wilde < wilde at mcs.anl.gov >>>>> >>>> wrote: >>>> >>>> >>>> David, Ketan, >>>> >>>> I need to run some things on Beagle, asap. >>>> >>>> Ketan, where is the latest and best documentation for this? I see >>>> your >>>> edits below to the 0.93 Site Guide. But I dont see that online >>>> where I >>>> would expect it: >>>> >>>> http://www.ci.uchicago.edu/swift/wwwdev/guides/release-0.93/siteguide/siteguide.html#_beagle >>>> >>>> David, is it just that this document is not being correctly pushed >>>> to >>>> the wwwdev site on a nightly basis? >>>> >>>> >>>> >>>> That seems to be the case. I have committed a little change just >>>> now, >>>> may be that will trigger a doc build. The link you mentioned is not >>>> the latest for Swift on Beagle. See this one which has >>>> documentation >>>> for scaling up runs on Beagle: >>>> >>>> >>>> http://www.ci.uchicago.edu/~ketan/swift-docs/release-0.93/siteguide/siteguide.html#_beagle >>>> >>>> >>>> >>>> Ketan, is the latest info on running Swift on Beagle now all in the >>>> siteguide? Is the info you were putting in the cookbook (I see many >>>> commits there) now all consolidated into the Site Guide? And is >>>> there >>>> a difference in sites.xml settings between 0.93 and trunk? Lastly, >>>> which release works best? >>>> >>>> >>>> >>>> Yes, the sitesguide for release-0.93 is the latest on Swift Beagle >>>> documentation. My cookbook info is all consolidated on sitesguide. >>>> There is no difference between sites file for 0.93 and trunk. >>>> >>>> >>>> Regards, >>>> Ketan >>>> >>>> >>>> >>>> >>>> Second question: I need to run a script that executes many 24-core >>>> OpenMP apps. Is the necessary support for this in 0.93? What if any >>>> declarations do I need other than to say jobsPerNode=1? Glen, are >>>> you >>>> running OpenMP on Beagle and if so what release and sites file are >>>> you >>>> using? >>>> >>>> Im assuming Justin's latest changes to sites.xml are in trunk but >>>> not >>>> 0.93? If that is correct, is there a corresponding site site for >>>> Beagle for trunk? >>>> >>>> Thanks, >>>> >>>> - Mike >>>> >>>> >>>> ----- Forwarded Message ----- >>>> From: ketan at ci.uchicago.edu >>>> To: swift-commit at ci.uchicago.edu >>>> Sent: Sunday, September 18, 2011 10:14:10 PM >>>> Subject: [Swift-commit] r5126 - >>>> branches/release-0.93/docs/siteguide >>>> >>>> Author: ketan >>>> Date: 2011-09-18 22:14:10 -0500 (Sun, 18 Sep 2011) >>>> New Revision: 5126 >>>> >>>> Modified: >>>> branches/release-0.93/docs/siteguide/beagle >>>> Log: >>>> added content to beagle siteguide >>>> >>>> Modified: branches/release-0.93/docs/siteguide/beagle >>>> =================================================================== >>>> --- branches/release-0.93/docs/siteguide/beagle 2011-09-19 02:41:02 >>>> UTC (rev 5125) >>>> +++ branches/release-0.93/docs/siteguide/beagle 2011-09-19 03:14:10 >>>> UTC (rev 5126) >>>> @@ -52,9 +52,38 @@ >>>> A key factor in scaling up Swift runs on Beagle is to setup the >>>> sites.xml parameters. >>>> The following sites.xml parameters must be set to scale that is >>>> intended for a large run: >>>> >>>> - * walltime: The expected walltime for completion of your run. >>>> This >>>> parameter is accepted in seconds. >>>> - * slots: Number of qsub jobs needs to be submitted by swift. This >>>> number will determine how many qsubs swift will submit for your >>>> run. >>>> Typical values range between 40 and 80 for large runs. >>>> - * nodegranularity: Determines the number of nodes per job. Total >>>> nodes will thus be slots times nodegranularity. This may vary for >>>> advanced configurations though. >>>> - * maxnodes: Determines the maximum number of nodes a job must >>>> pack >>>> into its qsub. This parameter determines the largest single job >>>> that >>>> your run will submit. >>>> + * *maxTime* : The expected walltime for completion of your run. >>>> This >>>> parameter is accepted in seconds. >>>> + * *slots* : Number of qsub jobs needs to be submitted by swift. >>>> This >>>> number will determine how many qsubs swift will submit for your >>>> run. >>>> Typical values range between 40 and 80 for large runs. >>>> + * *nodeGranularity* : Determines the number of nodes per job. >>>> Total >>>> nodes will thus be slots times nodegranularity. This may vary for >>>> advanced configurations though. >>>> + * *maxNodes* : Determines the maximum number of nodes a job must >>>> pack into its qsub. This parameter determines the largest single >>>> job >>>> that your run will submit. >>>> + * *jobThrottle* : A factor that determines the number of tasks >>>> dispatched simultaneously. The intended number of simultaneous >>>> tasks >>>> must match the number of cores targeted. The number of tasks is >>>> calculated from the jobThrottle factor is as follows: >>>> >>>> +---- >>>> +Number of Tasks = (JobThrottle x 100) + 1 >>>> +---- >>>> >>>> +Following is an example sites.xml for a 50 slots run with each >>>> slot >>>> occupying 4 nodes (thus, a 200 node run): >>>> + >>>> +----- >>>> + >>>> + >>>> + >>>> + CI-CCR000013 >>>> + >>>> + 24:cray:pack >>>> + >>>> + 24 >>>> + 50000 >>>> + 50 >>>> + 4 >>>> + 4 >>>> + >>>> + 48.00 >>>> + 10000 >>>> + >>>> + >>>> + >>>> /lustre/beagle/ketan/swift.workdir >>>> + >>>> + >>>> +----- >>>> + >>>> >>>> _______________________________________________ >>>> Swift-commit mailing list >>>> Swift-commit at ci.uchicago.edu >>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-commit >>>> >>>> -- >>>> Michael Wilde >>>> Computation Institute, University of Chicago >>>> Mathematics and Computer Science Division >>>> Argonne National Laboratory >>>> >>>> >>>> >>>> >>>> >>>> -- >>>> Ketan >>> _______________________________________________ >>> Swift-devel mailing list >>> Swift-devel at ci.uchicago.edu >>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel >>> >> >> -- >> Justin M Wozniak >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > -- Justin M Wozniak From ketancmaheshwari at gmail.com Tue Oct 18 11:22:11 2011 From: ketancmaheshwari at gmail.com (Ketan Maheshwari) Date: Tue, 18 Oct 2011 11:22:11 -0500 Subject: [Swift-devel] swift module update for Beagle In-Reply-To: <2113493162.103984.1318889615741.JavaMail.root@zimbra.anl.gov> References: <500432590.103942.1318888752966.JavaMail.root@zimbra.anl.gov> <2113493162.103984.1318889615741.JavaMail.root@zimbra.anl.gov> Message-ID: On Mon, Oct 17, 2011 at 5:13 PM, Michael Wilde wrote: > > The module 0.92 of Beagle is actually 0.92.1; this is the case > > because, after the version change, one needs to contact the admin (Ti > > in this case) in order to update a module file. Since the change was > > minor, he suggested, I overwrite 0.92.1 on 0.92. > > I think moving forward it would be good to create specific module versions > so there is no confusion as to whats being executed. > > So we should have in retrospect created an 0.92.1 module. > > > I have 0.93 in place on Beagle. We just need to contact him if this > > would be the RC for Beagle. > > Lets get module versions of 0.93rc2, 0.93, and "trunk" (if we can) in > place. > > (If there are no objections or counter-proposals) > > Can you initiate that, Ketan, and also document on swiftdevel how to manage > Beagle modules so that David and others can do this as part of release > management? > This is done here: https://sites.google.com/site/swiftdevel/release-plans/managing-modules-on-beagle David, feel free to get back to me if anything is not clear in the points mentioned. > > Thanks, > > - Mike > > > Regards, > > Ketan > > > > ----- Original Message ----- > > From: "Michael Wilde" > > To: "Ketan Maheshwari" , "David Kelly" > > > > Cc: "Swift Devel" > > Sent: Monday, October 17, 2011 4:55:31 PM > > Subject: swift module update for Beagle > > > > Ketan, do you maintain the swift module on Beagle? > > > > I see that module load swift gives me what it calls "0.92" but what I > > think/hope is 0.92.1 > > > > Can you or David add modules swift/0.93RC2 and swift/trunk for > > testing? (We're adding two new user groups on Beagle this week and I'd > > like to have them use Swift via modules from the start rather than > > private builds). > > > > Thanks, > > > > - Mike > > > > -- > > Michael Wilde > > Computation Institute, University of Chicago > > Mathematics and Computer Science Division > > Argonne National Laboratory > > -- > Michael Wilde > Computation Institute, University of Chicago > Mathematics and Computer Science Division > Argonne National Laboratory > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > -- Ketan -------------- next part -------------- An HTML attachment was scrubbed... URL: From skenny at uchicago.edu Thu Oct 20 06:07:09 2011 From: skenny at uchicago.edu (Sarah Kenny) Date: Thu, 20 Oct 2011 04:07:09 -0700 Subject: [Swift-devel] [Swift-user] gram on ranger In-Reply-To: <1318446824.18036.0.camel@blabla> References: <1908023276.141555.1318358942817.JavaMail.root@zimbra-mb2.anl.gov> <1318375406.2770.0.camel@blabla> <1318446824.18036.0.camel@blabla> Message-ID: hi all, one of our users, anjali (cc'd here) is trying to submit this ~400k job workflow to ranger...thought i'd see if you felt like having a look :) log is here: /home/skenny/swift_logs/corr_multisubj-20111018-1321-ihf8hz5g.log sites file: 7200 00:20:00 1 64 256 development 1.28 TG-DBS080004N 16way 10000 /work/00926/tg459516/swiftwork On Wed, Oct 12, 2011 at 12:13 PM, Mihael Hategan wrote: > On Tue, 2011-10-11 at 17:13 -0700, Sarah Kenny wrote: > > > > > > On Tue, Oct 11, 2011 at 4:23 PM, Mihael Hategan > > wrote: > > Is this with a persistent coaster service? > > > > admittedly i have not used persistent coaster service...should i? > > No. I was just trying to figure out whether it might be something > related to the persistent version. > > > i feel like it's documented *somewhere* (?) > > > > for now i've tried setting 'sitedir.keep=true' in the config so maybe > > it won't try to run the cleanup job...we'll see (waiting in q) > > > > > > > > On Tue, 2011-10-11 at 12:05 -0700, Sarah Kenny wrote: > > > > > > > > > On Tue, Oct 11, 2011 at 11:49 AM, David Kelly > > > > > wrote: > > > > > > That could be it.. maybe a cleanup script is not > > getting the > > > right parameters and failing. Do you happen to have > > a copy of > > > the coaster log? > > > > > > just put it in /home/skenny/swift_logs > > > > > > > > > Maybe there will be some clues in there. > > > > > > ----- Original Message ----- > > > > From: "Sarah Kenny" > > > > > > > To: "David Kelly" > > > > Cc: "Swift Devel" , > > "Swift > > > User" , "Justin M > > Wozniak" > > > > > > > > > > > Sent: Tuesday, October 11, 2011 1:32:37 PM > > > > Subject: Re: [Swift-user] gram on ranger > > > > > > > so, this workflow completes all the jobs but then > > just hangs > > > > indefinitely at the end...maybe a stray cleanup > > job? > > > > > > > > log is here: > > > > > > > > > > /home/skenny/swift_logs/corr-20111010-2104-fl5yngd9.log > > > > > > > > just tweaked the sites file a bit from what david > > sent me: > > > > > > > > > > > > > > > > > url=" > > > > gatekeeper.ranger.tacc.teragrid.org "/> > > > > > > > > > > > > key="maxtime">28800 > > > > > > key="maxWallTime">00:15:00 > > > > > key="jobsPerNode">1 > > > > > > key="nodeGranularity">64 > > > > > key="maxNodes">256 > > > > > key="queue">normal > > > > > key="jobThrottle">1 > > > > > > key="project">TG-DBS080004N > > > > > key="pe">16way > > > > > > key="initialScore">10000 > > > > > > > > > > /work/00043/tg457040/sidgrid_out/skenny > > > > > > > > > > > > > > > > > > > > > > > > On Mon, Oct 10, 2011 at 3:43 PM, Sarah Kenny < > > > skenny at uchicago.edu > > > > > wrote: > > > > > > > > > > > > ok, thanks, got in the queue now...also, realized > > my last > > > run may have > > > > been using the old swift. apparently i had > > SWIFT_HOME set in > > > my env > > > > and that overrides the newer swift i had set in my > > PATH. > > > > > > > > ~sk > > > > > > > > > > > > > > > > On Mon, Oct 10, 2011 at 12:28 PM, David Kelly < > > > davidk at ci.uchicago.edu > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > Sarah, > > > > > > > > Can you give this another try with the latest > > 0.93? I made > > > some > > > > changes to the coaster and sge providers and was > > able to get > > > it > > > > working with a simple catns script. Here is the > > > configuration file I > > > > was using: > > > > > > > > > > > > > > > > > url=" > > > > gatekeeper.ranger.tacc.teragrid.org "/> > > > > > > > > > > > > > > > > key="maxtime">3600 > > > > > > key="maxWallTime">00:00:03 > > > > > key="jobsPerNode">1 > > > > > > key="nodeGranularity">16 > > > > > key="maxNodes">16 > > > > > > key="queue">development > > > > > key="jobThrottle">0.9 > > > > > > > > > > key="project">TG-DBS080004N > > > > > > > > > key="pe">16way > > > > > > > > > > /share/home/01503/davidkel/swiftwork > > > > > > > > > > > > > > > > Thanks, > > > > > > > > David > > > > > > > > ----- Original Message ----- > > > > > > > > > From: "Sarah Kenny" < skenny at uchicago.edu > > > > > > To: "Justin M Wozniak" < wozniak at mcs.anl.gov > > > > > > Cc: "Swift Devel" < swift-devel at ci.uchicago.edu > > >, "Swift > > > User" < > > > > > swift-user at ci.uchicago.edu > > > > > > > > > > > > > > > > > > Sent: Friday, October 7, 2011 3:13:57 PM > > > > > Subject: Re: [Swift-user] gram on ranger > > > > > > > /home/skenny/swift_logs/dummy-20111005-0126-6575n7x5.log > > > > > > > > > > on ci > > > > > > > > > > > > > > > On Fri, Oct 7, 2011 at 8:16 AM, Justin M Wozniak > > < > > > > > wozniak at mcs.anl.gov > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > Can I take a look at the log? > > > > > > > > > > > > > > > > > > > > > > > > > On Thu, 6 Oct 2011, Sarah Kenny wrote: > > > > > > > > > > > > > > > > > > > > hey all, i'm trying to submit to gram on ranger > > using the > > > latest > > > > > swift > > > > > (built from trunk). it failes like so: > > > > > > > > > > Cannot submit job > > > > > Caused by: > > > > > org.globus.cog.abstraction. impl.common.task. > > > > > TaskSubmissionException: > > > > > Cannot > > > > > submit job > > > > > Caused by: org.globus.gram.GramException: > > Parameter not > > > supported > > > > > Cannot submit job > > > > > > > > > > the gram log was saying first that 'jobsPerNode' > > is not > > > supported so > > > > > i > > > > > changed it to workersPerNode and then it was > > saying > > > 'maxnodes' is > > > > > not > > > > > supported. here's my sites file: > > > > > > > > > > > > > > > > > > > > > key="initialScore">10000 > > profile> > > > > > > key="jobThrottle">1 > > > > > > key="maxWallTime">00:15:00 > > profile> > > > > > > key="maxTime">86400 > > > > > > key="slots">1 > > > > > > key="maxNodes">256 > > > > > > key="pe">16way > > > > > > key="workersPerNode">1 > > profile> > > > > > > key="nodeGranularity">64 > > profile> > > > > > > key="queue">normal > > > > > > key="project">TG-DBS080004N > > profile> > > > > > > > > > > > > > > > jobManager="gt2:gt2:SGE" > > > url=" > > > > > gatekeeper.ranger.tacc. teragrid.org "/> > > > > > > > > > > > > > > /work/00043/ > > tg457040 > > > > > > > > > > > > > > > > > > > > > > > > thoughts? ideas? > > > > > > > > > > -- > > > > > Justin M Wozniak > > > > > > > > > > > > > > > > > > > > -- > > > > > Sarah Kenny > > > > > Programmer ~ Brain Circuits Laboratory ~ Rm 2224 > > Bio Sci > > > III > > > > > University of California Irvine, Dept. of > > Neurology ~ > > > 773-818-8300 > > > > > > > > > > > > > > > _______________________________________________ > > > > > Swift-user mailing list > > > > > Swift-user at ci.uchicago.edu > > > > > > > > > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > Sarah Kenny > > > > Programmer ~ Brain Circuits Laboratory ~ Rm 2224 > > Bio Sci III > > > > University of California Irvine, Dept. of > > Neurology ~ > > > 773-818-8300 > > > > > > > > > > > > > > > > > > > > -- > > > > Sarah Kenny > > > > Programmer ~ Brain Circuits Laboratory ~ Rm 2224 > > Bio Sci III > > > > University of California Irvine, Dept. of > > Neurology ~ > > > 773-818-8300 > > > > > > > > > > > > > > > -- > > > Sarah Kenny > > > Programmer ~ Brain Circuits Laboratory ~ Rm 2224 Bio Sci III > > > University of California Irvine, Dept. of Neurology ~ > > 773-818-8300 > > > > > > _______________________________________________ > > > Swift-user mailing list > > > Swift-user at ci.uchicago.edu > > > > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > > > > > > > > > > > > > > -- > > Sarah Kenny > > Programmer ~ Brain Circuits Laboratory ~ Rm 2224 Bio Sci III > > University of California Irvine, Dept. of Neurology ~ 773-818-8300 > > > > > -- Sarah Kenny Programmer ~ Brain Circuits Laboratory ~ Rm 2224 Bio Sci III University of California Irvine, Dept. of Neurology ~ 773-818-8300 -------------- next part -------------- An HTML attachment was scrubbed... URL: From wilde at mcs.anl.gov Thu Oct 20 07:50:39 2011 From: wilde at mcs.anl.gov (Michael Wilde) Date: Thu, 20 Oct 2011 07:50:39 -0500 (CDT) Subject: [Swift-devel] [Swift-user] gram on ranger In-Reply-To: Message-ID: <1630476647.112879.1319115039205.JavaMail.root@zimbra.anl.gov> Hi Sarah, Anjali, My initial theory on whats failing in this job is that the Ranger development queue is limited to jobs of 16 nodes or less. (The Ranger User Guide says maxprocs 256 for that queue, and qconf -sq development says slots 16, which agrees). So you need to either change to one of the production queues (normal, long etc) or reduce the values of maxnode and nodegranularity. I would also suggest (unless you have already done this) that you test first on a very small run (like a single RInvoke app call) and then scale up to just a few voxels per dataset before trying such a large run. Have you already tested that? Lastly, when reporting problems like this, the swift standard output/err is also very helpful to get a higher-level view of what went wrong. Swift needs to clearly return errors from the local resource provider, which it doesnt seem to be doing here. Ive filed this as bug 593 and assigned to David. Please let us know if changing the queue and/or slots resolves the problem. As mentioned in the bug report I think you can set debug=true (or yes?) in the provider-sge.properties file and get swift to preserve the output from SGE in ~/.globus/scripts. (In fact that may already be preserved, I am not sure). Please check there to see if the SGE error is there. Thanks, - Mike ----- Original Message ----- > From: "Sarah Kenny" > To: "Mihael Hategan" > Cc: "Anjali Raja" , "Swift Devel" , "Swift User" > > Sent: Thursday, October 20, 2011 6:07:09 AM > Subject: Re: [Swift-devel] [Swift-user] gram on ranger > hi all, one of our users, anjali (cc'd here) is trying to submit this > ~400k job workflow to ranger...thought i'd see if you felt like having > a look :) > > log is here: > /home/skenny/swift_logs/corr_multisubj-20111018-1321-ihf8hz5g.log > > sites file: > > > > > > 7200 > 00:20:00 > 1 > 64 > 256 > development > 1.28 > TG-DBS080004N > 16way > 10000 > /work/00926/tg459516/swiftwork > > > > > On Wed, Oct 12, 2011 at 12:13 PM, Mihael Hategan < hategan at mcs.anl.gov > > wrote: > > > > On Tue, 2011-10-11 at 17:13 -0700, Sarah Kenny wrote: > > > > > > On Tue, Oct 11, 2011 at 4:23 PM, Mihael Hategan < > > hategan at mcs.anl.gov > > > wrote: > > Is this with a persistent coaster service? > > > > admittedly i have not used persistent coaster service...should i? > > No. I was just trying to figure out whether it might be something > related to the persistent version. > > > > > > i feel like it's documented *somewhere* (?) > > > > for now i've tried setting 'sitedir.keep=true' in the config so > > maybe > > it won't try to run the cleanup job...we'll see (waiting in q) > > > > > > > > On Tue, 2011-10-11 at 12:05 -0700, Sarah Kenny wrote: > > > > > > > > > On Tue, Oct 11, 2011 at 11:49 AM, David Kelly > > < davidk at ci.uchicago.edu > > > > wrote: > > > > > > That could be it.. maybe a cleanup script is not > > getting the > > > right parameters and failing. Do you happen to have > > a copy of > > > the coaster log? > > > > > > just put it in /home/skenny/swift_logs > > > > > > > > > Maybe there will be some clues in there. > > > > > > ----- Original Message ----- > > > > From: "Sarah Kenny" < skenny at uchicago.edu > > > > > > > > To: "David Kelly" < davidk at ci.uchicago.edu > > > > > Cc: "Swift Devel" < swift-devel at ci.uchicago.edu >, > > "Swift > > > User" < swift-user at ci.uchicago.edu >, "Justin M > > Wozniak" > > > > < wozniak at mcs.anl.gov > > > > > > > > Sent: Tuesday, October 11, 2011 1:32:37 PM > > > > Subject: Re: [Swift-user] gram on ranger > > > > > > > so, this workflow completes all the jobs but then > > just hangs > > > > indefinitely at the end...maybe a stray cleanup > > job? > > > > > > > > log is here: > > > > > > > > > > /home/skenny/swift_logs/corr-20111010-2104-fl5yngd9.log > > > > > > > > just tweaked the sites file a bit from what david > > sent me: > > > > > > > > > > > > > > > > > url=" > > > > gatekeeper.ranger.tacc.teragrid.org "/> > > > > > > > > > > > > key="maxtime">28800 > > > > > > key="maxWallTime">00:15:00 > > > > > key="jobsPerNode">1 > > > > > > key="nodeGranularity">64 > > > > > key="maxNodes">256 > > > > > key="queue">normal > > > > > key="jobThrottle">1 > > > > > > key="project">TG-DBS080004N > > > > > key="pe">16way > > > > > > key="initialScore">10000 > > > > > > > > > /work/00043/tg457040/sidgrid_out/skenny > > > > > > > > > > > > > > > > > > > > > > > > On Mon, Oct 10, 2011 at 3:43 PM, Sarah Kenny < > > > skenny at uchicago.edu > > > > > wrote: > > > > > > > > > > > > ok, thanks, got in the queue now...also, realized > > my last > > > run may have > > > > been using the old swift. apparently i had > > SWIFT_HOME set in > > > my env > > > > and that overrides the newer swift i had set in my > > PATH. > > > > > > > > ~sk > > > > > > > > > > > > > > > > On Mon, Oct 10, 2011 at 12:28 PM, David Kelly < > > > davidk at ci.uchicago.edu > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > Sarah, > > > > > > > > Can you give this another try with the latest > > 0.93? I made > > > some > > > > changes to the coaster and sge providers and was > > able to get > > > it > > > > working with a simple catns script. Here is the > > > configuration file I > > > > was using: > > > > > > > > > > > > > > > > > url=" > > > > gatekeeper.ranger.tacc.teragrid.org "/> > > > > > > > > > > > > > > > > key="maxtime">3600 > > > > > > key="maxWallTime">00:00:03 > > > > > key="jobsPerNode">1 > > > > > > key="nodeGranularity">16 > > > > > key="maxNodes">16 > > > > > > key="queue">development > > > > > key="jobThrottle">0.9 > > > > > > > > > > key="project">TG-DBS080004N > > > > > > > > > key="pe">16way > > > > > > > > > /share/home/01503/davidkel/swiftwork > > > > > > > > > > > > > > > > Thanks, > > > > > > > > David > > > > > > > > ----- Original Message ----- > > > > > > > > > From: "Sarah Kenny" < skenny at uchicago.edu > > > > > > To: "Justin M Wozniak" < wozniak at mcs.anl.gov > > > > > > Cc: "Swift Devel" < swift-devel at ci.uchicago.edu > > >, "Swift > > > User" < > > > > > swift-user at ci.uchicago.edu > > > > > > > > > > > > > > > > > > Sent: Friday, October 7, 2011 3:13:57 PM > > > > > Subject: Re: [Swift-user] gram on ranger > > > > > > > /home/skenny/swift_logs/dummy-20111005-0126-6575n7x5.log > > > > > > > > > > on ci > > > > > > > > > > > > > > > On Fri, Oct 7, 2011 at 8:16 AM, Justin M Wozniak > > < > > > > > wozniak at mcs.anl.gov > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > Can I take a look at the log? > > > > > > > > > > > > > > > > > > > > > > > > > On Thu, 6 Oct 2011, Sarah Kenny wrote: > > > > > > > > > > > > > > > > > > > > hey all, i'm trying to submit to gram on ranger > > using the > > > latest > > > > > swift > > > > > (built from trunk). it failes like so: > > > > > > > > > > Cannot submit job > > > > > Caused by: > > > > > org.globus.cog.abstraction. impl.common.task. > > > > > TaskSubmissionException: > > > > > Cannot > > > > > submit job > > > > > Caused by: org.globus.gram.GramException: > > Parameter not > > > supported > > > > > Cannot submit job > > > > > > > > > > the gram log was saying first that 'jobsPerNode' > > is not > > > supported so > > > > > i > > > > > changed it to workersPerNode and then it was > > saying > > > 'maxnodes' is > > > > > not > > > > > supported. here's my sites file: > > > > > > > > > > > > > > > > > > > > > key="initialScore">10000 > > profile> > > > > > > key="jobThrottle">1 > > > > > > key="maxWallTime">00:15:00 > > profile> > > > > > > key="maxTime">86400 > > > > > > key="slots">1 > > > > > > key="maxNodes">256 > > > > > > key="pe">16way > > > > > > key="workersPerNode">1 > > profile> > > > > > > key="nodeGranularity">64 > > profile> > > > > > > key="queue">normal > > > > > > key="project">TG-DBS080004N > > profile> > > > > > > > > > > > > > > > jobManager="gt2:gt2:SGE" > > > url=" > > > > > gatekeeper.ranger.tacc. teragrid.org "/> > > > > > > > > > > > > > > /work/00043/ > > tg457040 > > > > > > > > > > > > > > > > > > > > > > > > thoughts? ideas? > > > > > > > > > > -- > > > > > Justin M Wozniak > > > > > > > > > > > > > > > > > > > > -- > > > > > Sarah Kenny > > > > > Programmer ~ Brain Circuits Laboratory ~ Rm 2224 > > Bio Sci > > > III > > > > > University of California Irvine, Dept. of > > Neurology ~ > > > 773-818-8300 > > > > > > > > > > > > > > > _______________________________________________ > > > > > Swift-user mailing list > > > > > Swift-user at ci.uchicago.edu > > > > > > > > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > Sarah Kenny > > > > Programmer ~ Brain Circuits Laboratory ~ Rm 2224 > > Bio Sci III > > > > University of California Irvine, Dept. of > > Neurology ~ > > > 773-818-8300 > > > > > > > > > > > > > > > > > > > > -- > > > > Sarah Kenny > > > > Programmer ~ Brain Circuits Laboratory ~ Rm 2224 > > Bio Sci III > > > > University of California Irvine, Dept. of > > Neurology ~ > > > 773-818-8300 > > > > > > > > > > > > > > > -- > > > Sarah Kenny > > > Programmer ~ Brain Circuits Laboratory ~ Rm 2224 Bio Sci III > > > University of California Irvine, Dept. of Neurology ~ > > 773-818-8300 > > > > > > _______________________________________________ > > > Swift-user mailing list > > > Swift-user at ci.uchicago.edu > > > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > > > > > > > > > > > > > > -- > > Sarah Kenny > > Programmer ~ Brain Circuits Laboratory ~ Rm 2224 Bio Sci III > > University of California Irvine, Dept. of Neurology ~ 773-818-8300 > > > > > > > > -- > Sarah Kenny > Programmer ~ Brain Circuits Laboratory ~ Rm 2224 Bio Sci III > University of California Irvine, Dept. of Neurology ~ 773-818-8300 > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From ketancmaheshwari at gmail.com Thu Oct 20 09:54:33 2011 From: ketancmaheshwari at gmail.com (Ketan Maheshwari) Date: Thu, 20 Oct 2011 09:54:33 -0500 Subject: [Swift-devel] [Swift-user] gram on ranger In-Reply-To: <1630476647.112879.1319115039205.JavaMail.root@zimbra.anl.gov> References: <1630476647.112879.1319115039205.JavaMail.root@zimbra.anl.gov> Message-ID: On Thu, Oct 20, 2011 at 7:50 AM, Michael Wilde wrote: > Hi Sarah, Anjali, > > My initial theory on whats failing in this job is that the Ranger > development queue is limited to jobs of 16 nodes or less. (The Ranger User > Guide says maxprocs 256 for that queue, and qconf -sq development says slots > 16, which agrees). So you need to either change to one of the production > queues (normal, long etc) or reduce the values of maxnode and > nodegranularity. > I have a little confusion here: the desired line in the final pbs script should be : #$ -pe way 256; in order to have 256 procs, however, putting maxnodes=16 on sites.xml results in the following line on pbs: #$ -pe way 16; I understand this number 16/256 is for procs since, when putting 256 with development queue, ranger indeed allows the job to run in development queue. > > I would also suggest (unless you have already done this) that you test > first on a very small run (like a single RInvoke app call) and then scale up > to just a few voxels per dataset before trying such a large run. Have you > already tested that? > > Lastly, when reporting problems like this, the swift standard output/err is > also very helpful to get a higher-level view of what went wrong. > > Swift needs to clearly return errors from the local resource provider, > which it doesnt seem to be doing here. Ive filed this as bug 593 and > assigned to David. > > Please let us know if changing the queue and/or slots resolves the problem. > As mentioned in the bug report I think you can set debug=true (or yes?) in > the provider-sge.properties file and get swift to preserve the output from > SGE in ~/.globus/scripts. (In fact that may already be preserved, I am not > sure). Please check there to see if the SGE error is there. > > Thanks, > > - Mike > > > ----- Original Message ----- > > From: "Sarah Kenny" > > To: "Mihael Hategan" > > Cc: "Anjali Raja" , "Swift Devel" < > swift-devel at ci.uchicago.edu>, "Swift User" > > > > Sent: Thursday, October 20, 2011 6:07:09 AM > > Subject: Re: [Swift-devel] [Swift-user] gram on ranger > > hi all, one of our users, anjali (cc'd here) is trying to submit this > > ~400k job workflow to ranger...thought i'd see if you felt like having > > a look :) > > > > log is here: > > /home/skenny/swift_logs/corr_multisubj-20111018-1321-ihf8hz5g.log > > > > sites file: > > > > > > > > > > > > 7200 > > 00:20:00 > > 1 > > 64 > > 256 > > development > > 1.28 > > TG-DBS080004N > > 16way > > 10000 > > /work/00926/tg459516/swiftwork > > > > > > > > > > On Wed, Oct 12, 2011 at 12:13 PM, Mihael Hategan < hategan at mcs.anl.gov > > > wrote: > > > > > > > > On Tue, 2011-10-11 at 17:13 -0700, Sarah Kenny wrote: > > > > > > > > > On Tue, Oct 11, 2011 at 4:23 PM, Mihael Hategan < > > > hategan at mcs.anl.gov > > > > wrote: > > > Is this with a persistent coaster service? > > > > > > admittedly i have not used persistent coaster service...should i? > > > > No. I was just trying to figure out whether it might be something > > related to the persistent version. > > > > > > > > > > > i feel like it's documented *somewhere* (?) > > > > > > for now i've tried setting 'sitedir.keep=true' in the config so > > > maybe > > > it won't try to run the cleanup job...we'll see (waiting in q) > > > > > > > > > > > > On Tue, 2011-10-11 at 12:05 -0700, Sarah Kenny wrote: > > > > > > > > > > > > On Tue, Oct 11, 2011 at 11:49 AM, David Kelly > > > < davidk at ci.uchicago.edu > > > > > wrote: > > > > > > > > That could be it.. maybe a cleanup script is not > > > getting the > > > > right parameters and failing. Do you happen to have > > > a copy of > > > > the coaster log? > > > > > > > > just put it in /home/skenny/swift_logs > > > > > > > > > > > > Maybe there will be some clues in there. > > > > > > > > ----- Original Message ----- > > > > > From: "Sarah Kenny" < skenny at uchicago.edu > > > > > > > > > > To: "David Kelly" < davidk at ci.uchicago.edu > > > > > > Cc: "Swift Devel" < swift-devel at ci.uchicago.edu >, > > > "Swift > > > > User" < swift-user at ci.uchicago.edu >, "Justin M > > > Wozniak" > > > > > < wozniak at mcs.anl.gov > > > > > > > > > > Sent: Tuesday, October 11, 2011 1:32:37 PM > > > > > Subject: Re: [Swift-user] gram on ranger > > > > > > > > > so, this workflow completes all the jobs but then > > > just hangs > > > > > indefinitely at the end...maybe a stray cleanup > > > job? > > > > > > > > > > log is here: > > > > > > > > > > > > > /home/skenny/swift_logs/corr-20111010-2104-fl5yngd9.log > > > > > > > > > > just tweaked the sites file a bit from what david > > > sent me: > > > > > > > > > > > > > > > > > > > > > > url=" > > > > > gatekeeper.ranger.tacc.teragrid.org "/> > > > > > > > > > > > > > > > > key="maxtime">28800 > > > > > > > > key="maxWallTime">00:15:00 > > > > > > > key="jobsPerNode">1 > > > > > > > > key="nodeGranularity">64 > > > > > > > key="maxNodes">256 > > > > > > > key="queue">normal > > > > > > > key="jobThrottle">1 > > > > > > > > key="project">TG-DBS080004N > > > > > > > key="pe">16way > > > > > > > > key="initialScore">10000 > > > > > > > > > > > > /work/00043/tg457040/sidgrid_out/skenny > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Mon, Oct 10, 2011 at 3:43 PM, Sarah Kenny < > > > > skenny at uchicago.edu > > > > > > wrote: > > > > > > > > > > > > > > > ok, thanks, got in the queue now...also, realized > > > my last > > > > run may have > > > > > been using the old swift. apparently i had > > > SWIFT_HOME set in > > > > my env > > > > > and that overrides the newer swift i had set in my > > > PATH. > > > > > > > > > > ~sk > > > > > > > > > > > > > > > > > > > > On Mon, Oct 10, 2011 at 12:28 PM, David Kelly < > > > > davidk at ci.uchicago.edu > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Sarah, > > > > > > > > > > Can you give this another try with the latest > > > 0.93? I made > > > > some > > > > > changes to the coaster and sge providers and was > > > able to get > > > > it > > > > > working with a simple catns script. Here is the > > > > configuration file I > > > > > was using: > > > > > > > > > > > > > > > > > > > > > > url=" > > > > > gatekeeper.ranger.tacc.teragrid.org "/> > > > > > > > > > > > > > > > > > > > > > key="maxtime">3600 > > > > > > > > key="maxWallTime">00:00:03 > > > > > > > key="jobsPerNode">1 > > > > > > > > key="nodeGranularity">16 > > > > > > > key="maxNodes">16 > > > > > > > > key="queue">development > > > > > > > key="jobThrottle">0.9 > > > > > > > > > > > > > key="project">TG-DBS080004N > > > > > > > > > > > > key="pe">16way > > > > > > > > > > > > /share/home/01503/davidkel/swiftwork > > > > > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > > David > > > > > > > > > > ----- Original Message ----- > > > > > > > > > > > From: "Sarah Kenny" < skenny at uchicago.edu > > > > > > > To: "Justin M Wozniak" < wozniak at mcs.anl.gov > > > > > > > Cc: "Swift Devel" < swift-devel at ci.uchicago.edu > > > >, "Swift > > > > User" < > > > > > > swift-user at ci.uchicago.edu > > > > > > > > > > > > > > > > > > > > > > Sent: Friday, October 7, 2011 3:13:57 PM > > > > > > Subject: Re: [Swift-user] gram on ranger > > > > > > > > > /home/skenny/swift_logs/dummy-20111005-0126-6575n7x5.log > > > > > > > > > > > > on ci > > > > > > > > > > > > > > > > > > On Fri, Oct 7, 2011 at 8:16 AM, Justin M Wozniak > > > < > > > > > > wozniak at mcs.anl.gov > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > Can I take a look at the log? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Thu, 6 Oct 2011, Sarah Kenny wrote: > > > > > > > > > > > > > > > > > > > > > > > > hey all, i'm trying to submit to gram on ranger > > > using the > > > > latest > > > > > > swift > > > > > > (built from trunk). it failes like so: > > > > > > > > > > > > Cannot submit job > > > > > > Caused by: > > > > > > org.globus.cog.abstraction. impl.common.task. > > > > > > TaskSubmissionException: > > > > > > Cannot > > > > > > submit job > > > > > > Caused by: org.globus.gram.GramException: > > > Parameter not > > > > supported > > > > > > Cannot submit job > > > > > > > > > > > > the gram log was saying first that 'jobsPerNode' > > > is not > > > > supported so > > > > > > i > > > > > > changed it to workersPerNode and then it was > > > saying > > > > 'maxnodes' is > > > > > > not > > > > > > supported. here's my sites file: > > > > > > > > > > > > > > > > > > > > > > > > > > key="initialScore">10000 > > > profile> > > > > > > > > key="jobThrottle">1 > > > > > > > > key="maxWallTime">00:15:00 > > > profile> > > > > > > > > key="maxTime">86400 > > > > > > > > key="slots">1 > > > > > > > > key="maxNodes">256 > > > > > > > > key="pe">16way > > > > > > > > key="workersPerNode">1 > > > profile> > > > > > > > > key="nodeGranularity">64 > > > profile> > > > > > > > > key="queue">normal > > > > > > > > key="project">TG-DBS080004N > > > profile> > > > > > > > > > > > > > > > > > > > jobManager="gt2:gt2:SGE" > > > > url=" > > > > > > gatekeeper.ranger.tacc. teragrid.org "/> > > > > > > > > > > > > > > > > > /work/00043/ > > > tg457040 > > > > > > > > > > > > > > > > > > > > > > > > > > > > > thoughts? ideas? > > > > > > > > > > > > -- > > > > > > Justin M Wozniak > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > Sarah Kenny > > > > > > Programmer ~ Brain Circuits Laboratory ~ Rm 2224 > > > Bio Sci > > > > III > > > > > > University of California Irvine, Dept. of > > > Neurology ~ > > > > 773-818-8300 > > > > > > > > > > > > > > > > > > _______________________________________________ > > > > > > Swift-user mailing list > > > > > > Swift-user at ci.uchicago.edu > > > > > > > > > > > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > Sarah Kenny > > > > > Programmer ~ Brain Circuits Laboratory ~ Rm 2224 > > > Bio Sci III > > > > > University of California Irvine, Dept. of > > > Neurology ~ > > > > 773-818-8300 > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > Sarah Kenny > > > > > Programmer ~ Brain Circuits Laboratory ~ Rm 2224 > > > Bio Sci III > > > > > University of California Irvine, Dept. of > > > Neurology ~ > > > > 773-818-8300 > > > > > > > > > > > > > > > > > > > > -- > > > > Sarah Kenny > > > > Programmer ~ Brain Circuits Laboratory ~ Rm 2224 Bio Sci III > > > > University of California Irvine, Dept. of Neurology ~ > > > 773-818-8300 > > > > > > > > _______________________________________________ > > > > Swift-user mailing list > > > > Swift-user at ci.uchicago.edu > > > > > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > > > > > > > > > > > > > > > > > > > > > -- > > > Sarah Kenny > > > Programmer ~ Brain Circuits Laboratory ~ Rm 2224 Bio Sci III > > > University of California Irvine, Dept. of Neurology ~ 773-818-8300 > > > > > > > > > > > > > > > -- > > Sarah Kenny > > Programmer ~ Brain Circuits Laboratory ~ Rm 2224 Bio Sci III > > University of California Irvine, Dept. of Neurology ~ 773-818-8300 > > > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > -- > Michael Wilde > Computation Institute, University of Chicago > Mathematics and Computer Science Division > Argonne National Laboratory > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > -- Ketan -------------- next part -------------- An HTML attachment was scrubbed... URL: From wilde at mcs.anl.gov Thu Oct 20 10:21:43 2011 From: wilde at mcs.anl.gov (Michael Wilde) Date: Thu, 20 Oct 2011 10:21:43 -0500 (CDT) Subject: [Swift-devel] [Swift-user] gram on ranger In-Reply-To: Message-ID: <988426934.113721.1319124103942.JavaMail.root@zimbra.anl.gov> Thanks, Ketan. If I understand you correctly, then I would consider this a Swift bug, in that maxnodes should always mean *nodes*, for every type of resource provider including SGE. Based on what you say, the SGE provider is in this case treating the requested maxnode count as cores (Assuming Anjali was running the same Swift revision as you were testing on here). But then that might not explain the error in the log that Sarah posted. It seems the next step is to try the run on a smaller job (we can test this ourselves), and see if we can replicate and diagnose the error, with SGE subit files and output/error logs. David, can you do this, since you were working on SGE testing last week? You and Ketan should share what you know about the situation, via swift-devel, as Ketan is also running on Ranger with persistent coasters I think. Thanks, Mike ----- Original Message ----- > From: "Ketan Maheshwari" > To: "Michael Wilde" > Cc: "Sarah Kenny" , "Anjali Raja" , "Swift Devel" > , "Swift User" > Sent: Thursday, October 20, 2011 9:54:33 AM > Subject: Re: [Swift-devel] [Swift-user] gram on ranger > On Thu, Oct 20, 2011 at 7:50 AM, Michael Wilde < wilde at mcs.anl.gov > > wrote: > > > Hi Sarah, Anjali, > > My initial theory on whats failing in this job is that the Ranger > development queue is limited to jobs of 16 nodes or less. (The Ranger > User Guide says maxprocs 256 for that queue, and qconf -sq development > says slots 16, which agrees). So you need to either change to one of > the production queues (normal, long etc) or reduce the values of > maxnode and nodegranularity. > > > > I have a little confusion here: the desired line in the final pbs > script should be : #$ -pe way 256; in order to have 256 procs, > however, putting maxnodes=16 on sites.xml results in the following > line on pbs: > #$ -pe way 16; > I understand this number 16/256 is for procs since, when putting 256 > with development queue, ranger indeed allows the job to run in > development queue. > > > > I would also suggest (unless you have already done this) that you test > first on a very small run (like a single RInvoke app call) and then > scale up to just a few voxels per dataset before trying such a large > run. Have you already tested that? > > Lastly, when reporting problems like this, the swift standard > output/err is also very helpful to get a higher-level view of what > went wrong. > > Swift needs to clearly return errors from the local resource provider, > which it doesnt seem to be doing here. Ive filed this as bug 593 and > assigned to David. > > Please let us know if changing the queue and/or slots resolves the > problem. As mentioned in the bug report I think you can set debug=true > (or yes?) in the provider-sge.properties file and get swift to > preserve the output from SGE in ~/.globus/scripts. (In fact that may > already be preserved, I am not sure). Please check there to see if the > SGE error is there. > > Thanks, > > - Mike > > > > ----- Original Message ----- > > From: "Sarah Kenny" < skenny at uchicago.edu > > > > To: "Mihael Hategan" < hategan at mcs.anl.gov > > > Cc: "Anjali Raja" < anjraja at gmail.com >, "Swift Devel" < > > swift-devel at ci.uchicago.edu >, "Swift User" > > < swift-user at ci.uchicago.edu > > > Sent: Thursday, October 20, 2011 6:07:09 AM > > Subject: Re: [Swift-devel] [Swift-user] gram on ranger > > > hi all, one of our users, anjali (cc'd here) is trying to submit > > this > > ~400k job workflow to ranger...thought i'd see if you felt like > > having > > a look :) > > > > log is here: > > /home/skenny/swift_logs/corr_multisubj-20111018-1321-ihf8hz5g.log > > > > sites file: > > > > > > > > > > > > > > > 7200 > > 00:20:00 > > 1 > > 64 > > 256 > > development > > 1.28 > > TG-DBS080004N > > 16way > > 10000 > > /work/00926/tg459516/swiftwork > > > > > > > > > > On Wed, Oct 12, 2011 at 12:13 PM, Mihael Hategan < > > hategan at mcs.anl.gov > > > wrote: > > > > > > > > On Tue, 2011-10-11 at 17:13 -0700, Sarah Kenny wrote: > > > > > > > > > On Tue, Oct 11, 2011 at 4:23 PM, Mihael Hategan < > > > hategan at mcs.anl.gov > > > > wrote: > > > Is this with a persistent coaster service? > > > > > > admittedly i have not used persistent coaster service...should i? > > > > No. I was just trying to figure out whether it might be something > > related to the persistent version. > > > > > > > > > > > i feel like it's documented *somewhere* (?) > > > > > > for now i've tried setting 'sitedir.keep=true' in the config so > > > maybe > > > it won't try to run the cleanup job...we'll see (waiting in q) > > > > > > > > > > > > On Tue, 2011-10-11 at 12:05 -0700, Sarah Kenny wrote: > > > > > > > > > > > > On Tue, Oct 11, 2011 at 11:49 AM, David Kelly > > > < davidk at ci.uchicago.edu > > > > > wrote: > > > > > > > > That could be it.. maybe a cleanup script is not > > > getting the > > > > right parameters and failing. Do you happen to have > > > a copy of > > > > the coaster log? > > > > > > > > just put it in /home/skenny/swift_logs > > > > > > > > > > > > Maybe there will be some clues in there. > > > > > > > > ----- Original Message ----- > > > > > From: "Sarah Kenny" < skenny at uchicago.edu > > > > > > > > > > To: "David Kelly" < davidk at ci.uchicago.edu > > > > > > Cc: "Swift Devel" < swift-devel at ci.uchicago.edu >, > > > "Swift > > > > User" < swift-user at ci.uchicago.edu >, "Justin M > > > Wozniak" > > > > > < wozniak at mcs.anl.gov > > > > > > > > > > Sent: Tuesday, October 11, 2011 1:32:37 PM > > > > > Subject: Re: [Swift-user] gram on ranger > > > > > > > > > so, this workflow completes all the jobs but then > > > just hangs > > > > > indefinitely at the end...maybe a stray cleanup > > > job? > > > > > > > > > > log is here: > > > > > > > > > > > > > /home/skenny/swift_logs/corr-20111010-2104-fl5yngd9.log > > > > > > > > > > just tweaked the sites file a bit from what david > > > sent me: > > > > > > > > > > > > > > > > > > > > > > url=" > > > > > gatekeeper.ranger.tacc.teragrid.org "/> > > > > > > > > > > > > > > > > key="maxtime">28800 > > > > > > > > key="maxWallTime">00:15:00 > > > > > > > key="jobsPerNode">1 > > > > > > > > key="nodeGranularity">64 > > > > > > > key="maxNodes">256 > > > > > > > key="queue">normal > > > > > > > key="jobThrottle">1 > > > > > > > > key="project">TG-DBS080004N > > > > > > > key="pe">16way > > > > > > > > key="initialScore">10000 > > > > > > > > > > > > /work/00043/tg457040/sidgrid_out/skenny > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Mon, Oct 10, 2011 at 3:43 PM, Sarah Kenny < > > > > skenny at uchicago.edu > > > > > > wrote: > > > > > > > > > > > > > > > ok, thanks, got in the queue now...also, realized > > > my last > > > > run may have > > > > > been using the old swift. apparently i had > > > SWIFT_HOME set in > > > > my env > > > > > and that overrides the newer swift i had set in my > > > PATH. > > > > > > > > > > ~sk > > > > > > > > > > > > > > > > > > > > On Mon, Oct 10, 2011 at 12:28 PM, David Kelly < > > > > davidk at ci.uchicago.edu > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Sarah, > > > > > > > > > > Can you give this another try with the latest > > > 0.93? I made > > > > some > > > > > changes to the coaster and sge providers and was > > > able to get > > > > it > > > > > working with a simple catns script. Here is the > > > > configuration file I > > > > > was using: > > > > > > > > > > > > > > > > > > > > > > url=" > > > > > gatekeeper.ranger.tacc.teragrid.org "/> > > > > > > > > > > > > > > > > > > > > > key="maxtime">3600 > > > > > > > > key="maxWallTime">00:00:03 > > > > > > > key="jobsPerNode">1 > > > > > > > > key="nodeGranularity">16 > > > > > > > key="maxNodes">16 > > > > > > > > key="queue">development > > > > > > > key="jobThrottle">0.9 > > > > > > > > > > > > > key="project">TG-DBS080004N > > > > > > > > > > > > key="pe">16way > > > > > > > > > > > > /share/home/01503/davidkel/swiftwork > > > > > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > > David > > > > > > > > > > ----- Original Message ----- > > > > > > > > > > > From: "Sarah Kenny" < skenny at uchicago.edu > > > > > > > To: "Justin M Wozniak" < wozniak at mcs.anl.gov > > > > > > > Cc: "Swift Devel" < swift-devel at ci.uchicago.edu > > > >, "Swift > > > > User" < > > > > > > swift-user at ci.uchicago.edu > > > > > > > > > > > > > > > > > > > > > > Sent: Friday, October 7, 2011 3:13:57 PM > > > > > > Subject: Re: [Swift-user] gram on ranger > > > > > > > > > /home/skenny/swift_logs/dummy-20111005-0126-6575n7x5.log > > > > > > > > > > > > on ci > > > > > > > > > > > > > > > > > > On Fri, Oct 7, 2011 at 8:16 AM, Justin M Wozniak > > > < > > > > > > wozniak at mcs.anl.gov > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > Can I take a look at the log? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Thu, 6 Oct 2011, Sarah Kenny wrote: > > > > > > > > > > > > > > > > > > > > > > > > hey all, i'm trying to submit to gram on ranger > > > using the > > > > latest > > > > > > swift > > > > > > (built from trunk). it failes like so: > > > > > > > > > > > > Cannot submit job > > > > > > Caused by: > > > > > > org.globus.cog.abstraction. impl.common.task. > > > > > > TaskSubmissionException: > > > > > > Cannot > > > > > > submit job > > > > > > Caused by: org.globus.gram.GramException: > > > Parameter not > > > > supported > > > > > > Cannot submit job > > > > > > > > > > > > the gram log was saying first that 'jobsPerNode' > > > is not > > > > supported so > > > > > > i > > > > > > changed it to workersPerNode and then it was > > > saying > > > > 'maxnodes' is > > > > > > not > > > > > > supported. here's my sites file: > > > > > > > > > > > > > > > > > > > > > > > > > > key="initialScore">10000 > > > profile> > > > > > > > > key="jobThrottle">1 > > > > > > > > key="maxWallTime">00:15:00 > > > profile> > > > > > > > > key="maxTime">86400 > > > > > > > > key="slots">1 > > > > > > > > key="maxNodes">256 > > > > > > > > key="pe">16way > > > > > > > > key="workersPerNode">1 > > > profile> > > > > > > > > key="nodeGranularity">64 > > > profile> > > > > > > > > key="queue">normal > > > > > > > > key="project">TG-DBS080004N > > > profile> > > > > > > > > > > > > > > > > > > > jobManager="gt2:gt2:SGE" > > > > url=" > > > > > > gatekeeper.ranger.tacc. teragrid.org "/> > > > > > > > > > > > > > > > > > /work/00043/ > > > tg457040 > > > > > > > > > > > > > > > > > > > > > > > > > > > > > thoughts? ideas? > > > > > > > > > > > > -- > > > > > > Justin M Wozniak > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > Sarah Kenny > > > > > > Programmer ~ Brain Circuits Laboratory ~ Rm 2224 > > > Bio Sci > > > > III > > > > > > University of California Irvine, Dept. of > > > Neurology ~ > > > > 773-818-8300 > > > > > > > > > > > > > > > > > > _______________________________________________ > > > > > > Swift-user mailing list > > > > > > Swift-user at ci.uchicago.edu > > > > > > > > > > > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > Sarah Kenny > > > > > Programmer ~ Brain Circuits Laboratory ~ Rm 2224 > > > Bio Sci III > > > > > University of California Irvine, Dept. of > > > Neurology ~ > > > > 773-818-8300 > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > Sarah Kenny > > > > > Programmer ~ Brain Circuits Laboratory ~ Rm 2224 > > > Bio Sci III > > > > > University of California Irvine, Dept. of > > > Neurology ~ > > > > 773-818-8300 > > > > > > > > > > > > > > > > > > > > -- > > > > Sarah Kenny > > > > Programmer ~ Brain Circuits Laboratory ~ Rm 2224 Bio Sci III > > > > University of California Irvine, Dept. of Neurology ~ > > > 773-818-8300 > > > > > > > > _______________________________________________ > > > > Swift-user mailing list > > > > Swift-user at ci.uchicago.edu > > > > > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > > > > > > > > > > > > > > > > > > > > > -- > > > Sarah Kenny > > > Programmer ~ Brain Circuits Laboratory ~ Rm 2224 Bio Sci III > > > University of California Irvine, Dept. of Neurology ~ 773-818-8300 > > > > > > > > > > > > > > > -- > > Sarah Kenny > > Programmer ~ Brain Circuits Laboratory ~ Rm 2224 Bio Sci III > > University of California Irvine, Dept. of Neurology ~ 773-818-8300 > > > > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > -- > Michael Wilde > Computation Institute, University of Chicago > Mathematics and Computer Science Division > Argonne National Laboratory > > > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > > > -- > Ketan -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From davidk at ci.uchicago.edu Thu Oct 20 10:37:59 2011 From: davidk at ci.uchicago.edu (David Kelly) Date: Thu, 20 Oct 2011 10:37:59 -0500 (CDT) Subject: [Swift-devel] [Swift-user] gram on ranger In-Reply-To: Message-ID: <459404377.154944.1319125079761.JavaMail.root@zimbra-mb2.anl.gov> Ketan, I think you're right - I believe that line should be: #$ -pe I'll add this and do a little more testing to see if I can reproduce Sarah's problem. David > I have a little confusion here: the desired line in the final pbs > script should be : #$ -pe way 256; in order to have 256 procs, > however, putting maxnodes=16 on sites.xml results in the following > line on pbs: > #$ -pe way 16; > I understand this number 16/256 is for procs since, when putting 256 > with development queue, ranger indeed allows the job to run in > development queue. > > > > I would also suggest (unless you have already done this) that you test > first on a very small run (like a single RInvoke app call) and then > scale up to just a few voxels per dataset before trying such a large > run. Have you already tested that? > > Lastly, when reporting problems like this, the swift standard > output/err is also very helpful to get a higher-level view of what > went wrong. > > Swift needs to clearly return errors from the local resource provider, > which it doesnt seem to be doing here. Ive filed this as bug 593 and > assigned to David. > > Please let us know if changing the queue and/or slots resolves the > problem. As mentioned in the bug report I think you can set debug=true > (or yes?) in the provider-sge.properties file and get swift to > preserve the output from SGE in ~/.globus/scripts. (In fact that may > already be preserved, I am not sure). Please check there to see if the > SGE error is there. > > Thanks, > > - Mike > > > > ----- Original Message ----- > > From: "Sarah Kenny" < skenny at uchicago.edu > > > > To: "Mihael Hategan" < hategan at mcs.anl.gov > > > Cc: "Anjali Raja" < anjraja at gmail.com >, "Swift Devel" < > > swift-devel at ci.uchicago.edu >, "Swift User" > > < swift-user at ci.uchicago.edu > > > Sent: Thursday, October 20, 2011 6:07:09 AM > > Subject: Re: [Swift-devel] [Swift-user] gram on ranger > > > hi all, one of our users, anjali (cc'd here) is trying to submit > > this > > ~400k job workflow to ranger...thought i'd see if you felt like > > having > > a look :) > > > > log is here: > > /home/skenny/swift_logs/corr_multisubj-20111018-1321-ihf8hz5g.log > > > > sites file: > > > > > > > > > > > > > > > 7200 > > 00:20:00 > > 1 > > 64 > > 256 > > development > > 1.28 > > TG-DBS080004N > > 16way > > 10000 > > /work/00926/tg459516/swiftwork > > > > > > > > > > On Wed, Oct 12, 2011 at 12:13 PM, Mihael Hategan < > > hategan at mcs.anl.gov > > > wrote: > > > > > > > > On Tue, 2011-10-11 at 17:13 -0700, Sarah Kenny wrote: > > > > > > > > > On Tue, Oct 11, 2011 at 4:23 PM, Mihael Hategan < > > > hategan at mcs.anl.gov > > > > wrote: > > > Is this with a persistent coaster service? > > > > > > admittedly i have not used persistent coaster service...should i? > > > > No. I was just trying to figure out whether it might be something > > related to the persistent version. > > > > > > > > > > > i feel like it's documented *somewhere* (?) > > > > > > for now i've tried setting 'sitedir.keep=true' in the config so > > > maybe > > > it won't try to run the cleanup job...we'll see (waiting in q) > > > > > > > > > > > > On Tue, 2011-10-11 at 12:05 -0700, Sarah Kenny wrote: > > > > > > > > > > > > On Tue, Oct 11, 2011 at 11:49 AM, David Kelly > > > < davidk at ci.uchicago.edu > > > > > wrote: > > > > > > > > That could be it.. maybe a cleanup script is not > > > getting the > > > > right parameters and failing. Do you happen to have > > > a copy of > > > > the coaster log? > > > > > > > > just put it in /home/skenny/swift_logs > > > > > > > > > > > > Maybe there will be some clues in there. > > > > > > > > ----- Original Message ----- > > > > > From: "Sarah Kenny" < skenny at uchicago.edu > > > > > > > > > > To: "David Kelly" < davidk at ci.uchicago.edu > > > > > > Cc: "Swift Devel" < swift-devel at ci.uchicago.edu >, > > > "Swift > > > > User" < swift-user at ci.uchicago.edu >, "Justin M > > > Wozniak" > > > > > < wozniak at mcs.anl.gov > > > > > > > > > > Sent: Tuesday, October 11, 2011 1:32:37 PM > > > > > Subject: Re: [Swift-user] gram on ranger > > > > > > > > > so, this workflow completes all the jobs but then > > > just hangs > > > > > indefinitely at the end...maybe a stray cleanup > > > job? > > > > > > > > > > log is here: > > > > > > > > > > > > > /home/skenny/swift_logs/corr-20111010-2104-fl5yngd9.log > > > > > > > > > > just tweaked the sites file a bit from what david > > > sent me: > > > > > > > > > > > > > > > > > > > > > > url=" > > > > > gatekeeper.ranger.tacc.teragrid.org "/> > > > > > > > > > > > > > > > > key="maxtime">28800 > > > > > > > > key="maxWallTime">00:15:00 > > > > > > > key="jobsPerNode">1 > > > > > > > > key="nodeGranularity">64 > > > > > > > key="maxNodes">256 > > > > > > > key="queue">normal > > > > > > > key="jobThrottle">1 > > > > > > > > key="project">TG-DBS080004N > > > > > > > key="pe">16way > > > > > > > > key="initialScore">10000 > > > > > > > > > > > > /work/00043/tg457040/sidgrid_out/skenny > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Mon, Oct 10, 2011 at 3:43 PM, Sarah Kenny < > > > > skenny at uchicago.edu > > > > > > wrote: > > > > > > > > > > > > > > > ok, thanks, got in the queue now...also, realized > > > my last > > > > run may have > > > > > been using the old swift. apparently i had > > > SWIFT_HOME set in > > > > my env > > > > > and that overrides the newer swift i had set in my > > > PATH. > > > > > > > > > > ~sk > > > > > > > > > > > > > > > > > > > > On Mon, Oct 10, 2011 at 12:28 PM, David Kelly < > > > > davidk at ci.uchicago.edu > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Sarah, > > > > > > > > > > Can you give this another try with the latest > > > 0.93? I made > > > > some > > > > > changes to the coaster and sge providers and was > > > able to get > > > > it > > > > > working with a simple catns script. Here is the > > > > configuration file I > > > > > was using: > > > > > > > > > > > > > > > > > > > > > > url=" > > > > > gatekeeper.ranger.tacc.teragrid.org "/> > > > > > > > > > > > > > > > > > > > > > key="maxtime">3600 > > > > > > > > key="maxWallTime">00:00:03 > > > > > > > key="jobsPerNode">1 > > > > > > > > key="nodeGranularity">16 > > > > > > > key="maxNodes">16 > > > > > > > > key="queue">development > > > > > > > key="jobThrottle">0.9 > > > > > > > > > > > > > key="project">TG-DBS080004N > > > > > > > > > > > > key="pe">16way > > > > > > > > > > > > /share/home/01503/davidkel/swiftwork > > > > > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > > David > > > > > > > > > > ----- Original Message ----- > > > > > > > > > > > From: "Sarah Kenny" < skenny at uchicago.edu > > > > > > > To: "Justin M Wozniak" < wozniak at mcs.anl.gov > > > > > > > Cc: "Swift Devel" < swift-devel at ci.uchicago.edu > > > >, "Swift > > > > User" < > > > > > > swift-user at ci.uchicago.edu > > > > > > > > > > > > > > > > > > > > > > Sent: Friday, October 7, 2011 3:13:57 PM > > > > > > Subject: Re: [Swift-user] gram on ranger > > > > > > > > > /home/skenny/swift_logs/dummy-20111005-0126-6575n7x5.log > > > > > > > > > > > > on ci > > > > > > > > > > > > > > > > > > On Fri, Oct 7, 2011 at 8:16 AM, Justin M Wozniak > > > < > > > > > > wozniak at mcs.anl.gov > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > Can I take a look at the log? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Thu, 6 Oct 2011, Sarah Kenny wrote: > > > > > > > > > > > > > > > > > > > > > > > > hey all, i'm trying to submit to gram on ranger > > > using the > > > > latest > > > > > > swift > > > > > > (built from trunk). it failes like so: > > > > > > > > > > > > Cannot submit job > > > > > > Caused by: > > > > > > org.globus.cog.abstraction. impl.common.task. > > > > > > TaskSubmissionException: > > > > > > Cannot > > > > > > submit job > > > > > > Caused by: org.globus.gram.GramException: > > > Parameter not > > > > supported > > > > > > Cannot submit job > > > > > > > > > > > > the gram log was saying first that 'jobsPerNode' > > > is not > > > > supported so > > > > > > i > > > > > > changed it to workersPerNode and then it was > > > saying > > > > 'maxnodes' is > > > > > > not > > > > > > supported. here's my sites file: > > > > > > > > > > > > > > > > > > > > > > > > > > key="initialScore">10000 > > > profile> > > > > > > > > key="jobThrottle">1 > > > > > > > > key="maxWallTime">00:15:00 > > > profile> > > > > > > > > key="maxTime">86400 > > > > > > > > key="slots">1 > > > > > > > > key="maxNodes">256 > > > > > > > > key="pe">16way > > > > > > > > key="workersPerNode">1 > > > profile> > > > > > > > > key="nodeGranularity">64 > > > profile> > > > > > > > > key="queue">normal > > > > > > > > key="project">TG-DBS080004N > > > profile> > > > > > > > > > > > > > > > > > > > jobManager="gt2:gt2:SGE" > > > > url=" > > > > > > gatekeeper.ranger.tacc. teragrid.org "/> > > > > > > > > > > > > > > > > > /work/00043/ > > > tg457040 > > > > > > > > > > > > > > > > > > > > > > > > > > > > > thoughts? ideas? > > > > > > > > > > > > -- > > > > > > Justin M Wozniak > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > Sarah Kenny > > > > > > Programmer ~ Brain Circuits Laboratory ~ Rm 2224 > > > Bio Sci > > > > III > > > > > > University of California Irvine, Dept. of > > > Neurology ~ > > > > 773-818-8300 > > > > > > > > > > > > > > > > > > _______________________________________________ > > > > > > Swift-user mailing list > > > > > > Swift-user at ci.uchicago.edu > > > > > > > > > > > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > Sarah Kenny > > > > > Programmer ~ Brain Circuits Laboratory ~ Rm 2224 > > > Bio Sci III > > > > > University of California Irvine, Dept. of > > > Neurology ~ > > > > 773-818-8300 > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > Sarah Kenny > > > > > Programmer ~ Brain Circuits Laboratory ~ Rm 2224 > > > Bio Sci III > > > > > University of California Irvine, Dept. of > > > Neurology ~ > > > > 773-818-8300 > > > > > > > > > > > > > > > > > > > > -- > > > > Sarah Kenny > > > > Programmer ~ Brain Circuits Laboratory ~ Rm 2224 Bio Sci III > > > > University of California Irvine, Dept. of Neurology ~ > > > 773-818-8300 > > > > > > > > _______________________________________________ > > > > Swift-user mailing list > > > > Swift-user at ci.uchicago.edu > > > > > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > > > > > > > > > > > > > > > > > > > > > -- > > > Sarah Kenny > > > Programmer ~ Brain Circuits Laboratory ~ Rm 2224 Bio Sci III > > > University of California Irvine, Dept. of Neurology ~ 773-818-8300 > > > > > > > > > > > > > > > -- > > Sarah Kenny > > Programmer ~ Brain Circuits Laboratory ~ Rm 2224 Bio Sci III > > University of California Irvine, Dept. of Neurology ~ 773-818-8300 > > > > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > -- > Michael Wilde > Computation Institute, University of Chicago > Mathematics and Computer Science Division > Argonne National Laboratory > > > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > > > -- > Ketan > > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel From ketancmaheshwari at gmail.com Thu Oct 20 10:44:11 2011 From: ketancmaheshwari at gmail.com (Ketan Maheshwari) Date: Thu, 20 Oct 2011 10:44:11 -0500 Subject: [Swift-devel] [Swift-user] gram on ranger In-Reply-To: <459404377.154944.1319125079761.JavaMail.root@zimbra-mb2.anl.gov> References: <459404377.154944.1319125079761.JavaMail.root@zimbra-mb2.anl.gov> Message-ID: David, On Thu, Oct 20, 2011 at 10:37 AM, David Kelly wrote: > Ketan, > > I think you're right - I believe that line should be: > > #$ -pe > I am not sure, it should be maxnodes*nodegranularity, since nodegranularity means the nodes-to-be-packed per job. I think, this should be, maxnodes*corespernode, which could be a static constant (value 16) for ranger. I could be wrong here, not sure. Furthermore, it seems the parallel_environment tag of sites.xml is still not being honored. It always defaults to '1way', irrespective of the provided value. I am using 0.93RC3. May be we can debug this together. > > I'll add this and do a little more testing to see if I can reproduce > Sarah's problem. > > David > > > I have a little confusion here: the desired line in the final pbs > > script should be : #$ -pe way 256; in order to have 256 procs, > > however, putting maxnodes=16 on sites.xml results in the following > > line on pbs: > > #$ -pe way 16; > > I understand this number 16/256 is for procs since, when putting 256 > > with development queue, ranger indeed allows the job to run in > > development queue. > > > > > > > > I would also suggest (unless you have already done this) that you test > > first on a very small run (like a single RInvoke app call) and then > > scale up to just a few voxels per dataset before trying such a large > > run. Have you already tested that? > > > > Lastly, when reporting problems like this, the swift standard > > output/err is also very helpful to get a higher-level view of what > > went wrong. > > > > Swift needs to clearly return errors from the local resource provider, > > which it doesnt seem to be doing here. Ive filed this as bug 593 and > > assigned to David. > > > > Please let us know if changing the queue and/or slots resolves the > > problem. As mentioned in the bug report I think you can set debug=true > > (or yes?) in the provider-sge.properties file and get swift to > > preserve the output from SGE in ~/.globus/scripts. (In fact that may > > already be preserved, I am not sure). Please check there to see if the > > SGE error is there. > > > > Thanks, > > > > - Mike > > > > > > > > ----- Original Message ----- > > > From: "Sarah Kenny" < skenny at uchicago.edu > > > > > > To: "Mihael Hategan" < hategan at mcs.anl.gov > > > > Cc: "Anjali Raja" < anjraja at gmail.com >, "Swift Devel" < > > > swift-devel at ci.uchicago.edu >, "Swift User" > > > < swift-user at ci.uchicago.edu > > > > Sent: Thursday, October 20, 2011 6:07:09 AM > > > Subject: Re: [Swift-devel] [Swift-user] gram on ranger > > > > > hi all, one of our users, anjali (cc'd here) is trying to submit > > > this > > > ~400k job workflow to ranger...thought i'd see if you felt like > > > having > > > a look :) > > > > > > log is here: > > > /home/skenny/swift_logs/corr_multisubj-20111018-1321-ihf8hz5g.log > > > > > > sites file: > > > > > > > > > > > > > > > > > > > > > > > > 7200 > > > 00:20:00 > > > 1 > > > 64 > > > 256 > > > development > > > 1.28 > > > TG-DBS080004N > > > 16way > > > 10000 > > > /work/00926/tg459516/swiftwork > > > > > > > > > > > > > > > On Wed, Oct 12, 2011 at 12:13 PM, Mihael Hategan < > > > hategan at mcs.anl.gov > > > > wrote: > > > > > > > > > > > > On Tue, 2011-10-11 at 17:13 -0700, Sarah Kenny wrote: > > > > > > > > > > > > On Tue, Oct 11, 2011 at 4:23 PM, Mihael Hategan < > > > > hategan at mcs.anl.gov > > > > > wrote: > > > > Is this with a persistent coaster service? > > > > > > > > admittedly i have not used persistent coaster service...should i? > > > > > > No. I was just trying to figure out whether it might be something > > > related to the persistent version. > > > > > > > > > > > > > > > > i feel like it's documented *somewhere* (?) > > > > > > > > for now i've tried setting 'sitedir.keep=true' in the config so > > > > maybe > > > > it won't try to run the cleanup job...we'll see (waiting in q) > > > > > > > > > > > > > > > > On Tue, 2011-10-11 at 12:05 -0700, Sarah Kenny wrote: > > > > > > > > > > > > > > > On Tue, Oct 11, 2011 at 11:49 AM, David Kelly > > > > < davidk at ci.uchicago.edu > > > > > > wrote: > > > > > > > > > > That could be it.. maybe a cleanup script is not > > > > getting the > > > > > right parameters and failing. Do you happen to have > > > > a copy of > > > > > the coaster log? > > > > > > > > > > just put it in /home/skenny/swift_logs > > > > > > > > > > > > > > > Maybe there will be some clues in there. > > > > > > > > > > ----- Original Message ----- > > > > > > From: "Sarah Kenny" < skenny at uchicago.edu > > > > > > > > > > > > To: "David Kelly" < davidk at ci.uchicago.edu > > > > > > > Cc: "Swift Devel" < swift-devel at ci.uchicago.edu >, > > > > "Swift > > > > > User" < swift-user at ci.uchicago.edu >, "Justin M > > > > Wozniak" > > > > > > < wozniak at mcs.anl.gov > > > > > > > > > > > > Sent: Tuesday, October 11, 2011 1:32:37 PM > > > > > > Subject: Re: [Swift-user] gram on ranger > > > > > > > > > > > so, this workflow completes all the jobs but then > > > > just hangs > > > > > > indefinitely at the end...maybe a stray cleanup > > > > job? > > > > > > > > > > > > log is here: > > > > > > > > > > > > > > > > /home/skenny/swift_logs/corr-20111010-2104-fl5yngd9.log > > > > > > > > > > > > just tweaked the sites file a bit from what david > > > > sent me: > > > > > > > > > > > > > > > > > > > > > > > > > > > url=" > > > > > > gatekeeper.ranger.tacc.teragrid.org "/> > > > > > > > > > > > > > > > > > > > > key="maxtime">28800 > > > > > > > > > > key="maxWallTime">00:15:00 > > > > > > > > > key="jobsPerNode">1 > > > > > > > > > > key="nodeGranularity">64 > > > > > > > > > key="maxNodes">256 > > > > > > > > > key="queue">normal > > > > > > > > > key="jobThrottle">1 > > > > > > > > > > key="project">TG-DBS080004N > > > > > > > > > key="pe">16way > > > > > > > > > > key="initialScore">10000 > > > > > > > > > > > > > > > > /work/00043/tg457040/sidgrid_out/skenny > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Mon, Oct 10, 2011 at 3:43 PM, Sarah Kenny < > > > > > skenny at uchicago.edu > > > > > > > wrote: > > > > > > > > > > > > > > > > > > ok, thanks, got in the queue now...also, realized > > > > my last > > > > > run may have > > > > > > been using the old swift. apparently i had > > > > SWIFT_HOME set in > > > > > my env > > > > > > and that overrides the newer swift i had set in my > > > > PATH. > > > > > > > > > > > > ~sk > > > > > > > > > > > > > > > > > > > > > > > > On Mon, Oct 10, 2011 at 12:28 PM, David Kelly < > > > > > davidk at ci.uchicago.edu > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Sarah, > > > > > > > > > > > > Can you give this another try with the latest > > > > 0.93? I made > > > > > some > > > > > > changes to the coaster and sge providers and was > > > > able to get > > > > > it > > > > > > working with a simple catns script. Here is the > > > > > configuration file I > > > > > > was using: > > > > > > > > > > > > > > > > > > > > > > > > > > > url=" > > > > > > gatekeeper.ranger.tacc.teragrid.org "/> > > > > > > > > > > > > > > > > > > > > > > > > > > key="maxtime">3600 > > > > > > > > > > key="maxWallTime">00:00:03 > > > > > > > > > key="jobsPerNode">1 > > > > > > > > > > key="nodeGranularity">16 > > > > > > > > > key="maxNodes">16 > > > > > > > > > > key="queue">development > > > > > > > > > key="jobThrottle">0.9 > > > > > > > > > > > > > > > > key="project">TG-DBS080004N > > > > > > > > > > > > > > > key="pe">16way > > > > > > > > > > > > > > > /share/home/01503/davidkel/swiftwork > > > > > > > > > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > > > > David > > > > > > > > > > > > ----- Original Message ----- > > > > > > > > > > > > > From: "Sarah Kenny" < skenny at uchicago.edu > > > > > > > > To: "Justin M Wozniak" < wozniak at mcs.anl.gov > > > > > > > > Cc: "Swift Devel" < swift-devel at ci.uchicago.edu > > > > >, "Swift > > > > > User" < > > > > > > > swift-user at ci.uchicago.edu > > > > > > > > > > > > > > > > > > > > > > > > > > Sent: Friday, October 7, 2011 3:13:57 PM > > > > > > > Subject: Re: [Swift-user] gram on ranger > > > > > > > > > > > /home/skenny/swift_logs/dummy-20111005-0126-6575n7x5.log > > > > > > > > > > > > > > on ci > > > > > > > > > > > > > > > > > > > > > On Fri, Oct 7, 2011 at 8:16 AM, Justin M Wozniak > > > > < > > > > > > > wozniak at mcs.anl.gov > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > Can I take a look at the log? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Thu, 6 Oct 2011, Sarah Kenny wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > hey all, i'm trying to submit to gram on ranger > > > > using the > > > > > latest > > > > > > > swift > > > > > > > (built from trunk). it failes like so: > > > > > > > > > > > > > > Cannot submit job > > > > > > > Caused by: > > > > > > > org.globus.cog.abstraction. impl.common.task. > > > > > > > TaskSubmissionException: > > > > > > > Cannot > > > > > > > submit job > > > > > > > Caused by: org.globus.gram.GramException: > > > > Parameter not > > > > > supported > > > > > > > Cannot submit job > > > > > > > > > > > > > > the gram log was saying first that 'jobsPerNode' > > > > is not > > > > > supported so > > > > > > > i > > > > > > > changed it to workersPerNode and then it was > > > > saying > > > > > 'maxnodes' is > > > > > > > not > > > > > > > supported. here's my sites file: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > key="initialScore">10000 > > > > profile> > > > > > > > > > > key="jobThrottle">1 > > > > > > > > > > key="maxWallTime">00:15:00 > > > > profile> > > > > > > > > > > key="maxTime">86400 > > > > > > > > > > key="slots">1 > > > > > > > > > > key="maxNodes">256 > > > > > > > > > > key="pe">16way > > > > > > > > > > key="workersPerNode">1 > > > > profile> > > > > > > > > > > key="nodeGranularity">64 > > > > profile> > > > > > > > > > > key="queue">normal > > > > > > > > > > key="project">TG-DBS080004N > > > > profile> > > > > > > > > > > > > > > > > > > > > > > > jobManager="gt2:gt2:SGE" > > > > > url=" > > > > > > > gatekeeper.ranger.tacc. teragrid.org "/> > > > > > > > > > > > > > > > > > > > > /work/00043/ > > > > tg457040 > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > thoughts? ideas? > > > > > > > > > > > > > > -- > > > > > > > Justin M Wozniak > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > Sarah Kenny > > > > > > > Programmer ~ Brain Circuits Laboratory ~ Rm 2224 > > > > Bio Sci > > > > > III > > > > > > > University of California Irvine, Dept. of > > > > Neurology ~ > > > > > 773-818-8300 > > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > > > > > Swift-user mailing list > > > > > > > Swift-user at ci.uchicago.edu > > > > > > > > > > > > > > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > Sarah Kenny > > > > > > Programmer ~ Brain Circuits Laboratory ~ Rm 2224 > > > > Bio Sci III > > > > > > University of California Irvine, Dept. of > > > > Neurology ~ > > > > > 773-818-8300 > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > Sarah Kenny > > > > > > Programmer ~ Brain Circuits Laboratory ~ Rm 2224 > > > > Bio Sci III > > > > > > University of California Irvine, Dept. of > > > > Neurology ~ > > > > > 773-818-8300 > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > Sarah Kenny > > > > > Programmer ~ Brain Circuits Laboratory ~ Rm 2224 Bio Sci III > > > > > University of California Irvine, Dept. of Neurology ~ > > > > 773-818-8300 > > > > > > > > > > _______________________________________________ > > > > > Swift-user mailing list > > > > > Swift-user at ci.uchicago.edu > > > > > > > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > Sarah Kenny > > > > Programmer ~ Brain Circuits Laboratory ~ Rm 2224 Bio Sci III > > > > University of California Irvine, Dept. of Neurology ~ 773-818-8300 > > > > > > > > > > > > > > > > > > > > > > -- > > > Sarah Kenny > > > Programmer ~ Brain Circuits Laboratory ~ Rm 2224 Bio Sci III > > > University of California Irvine, Dept. of Neurology ~ 773-818-8300 > > > > > > > > > > > _______________________________________________ > > > Swift-devel mailing list > > > Swift-devel at ci.uchicago.edu > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > > -- > > Michael Wilde > > Computation Institute, University of Chicago > > Mathematics and Computer Science Division > > Argonne National Laboratory > > > > > > > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > > > > > > > > -- > > Ketan > > > > > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > -- Ketan -------------- next part -------------- An HTML attachment was scrubbed... URL: From wilde at mcs.anl.gov Thu Oct 20 10:55:52 2011 From: wilde at mcs.anl.gov (Michael Wilde) Date: Thu, 20 Oct 2011 10:55:52 -0500 (CDT) Subject: [Swift-devel] [Swift-user] gram on ranger In-Reply-To: Message-ID: <525723894.113958.1319126152521.JavaMail.root@zimbra.anl.gov> That second number on the #$ -pe directive should be the total # of cores that the provider wants in the job, and on Ranger with the 16way pe must be a multiple of 16. Coasters will request up to "maxNodes" nodes in a given SGE jobs, and the number requested will always be a multiple of "nodeGranularity". But the number requested can be < maxNodes, depending on how many app() invocations the coaster provider/scheduler has decided to put into a coaster Block. Similarly the time requested for the block can be < maxTime (but the "overallocation" parameters can influence that and force all blocks to be maxTime seconds "wide". - Mike ----- Original Message ----- > From: "Ketan Maheshwari" > To: "David Kelly" > Cc: "Anjali Raja" , "Swift Devel" , "Swift User" > , "Michael Wilde" > Sent: Thursday, October 20, 2011 10:44:11 AM > Subject: Re: [Swift-devel] [Swift-user] gram on ranger > David, > > > > On Thu, Oct 20, 2011 at 10:37 AM, David Kelly < davidk at ci.uchicago.edu > > wrote: > > > Ketan, > > I think you're right - I believe that line should be: > > #$ -pe > > > > I am not sure, it should be maxnodes*nodegranularity, since > nodegranularity means the nodes-to-be-packed per job. I think, this > should be, > maxnodes*corespernode, which could be a static constant (value 16) for > ranger. I could be wrong here, not sure. > > > Furthermore, it seems the parallel_environment tag of sites.xml is > still not being honored. It always defaults to '1way', irrespective of > the provided value. I am using 0.93RC3. > > > May be we can debug this together. > > > > I'll add this and do a little more testing to see if I can reproduce > Sarah's problem. > > David > > > > > > I have a little confusion here: the desired line in the final pbs > > script should be : #$ -pe way 256; in order to have 256 procs, > > however, putting maxnodes=16 on sites.xml results in the following > > line on pbs: > > #$ -pe way 16; > > I understand this number 16/256 is for procs since, when putting 256 > > with development queue, ranger indeed allows the job to run in > > development queue. > > > > > > > > I would also suggest (unless you have already done this) that you > > test > > first on a very small run (like a single RInvoke app call) and then > > scale up to just a few voxels per dataset before trying such a large > > run. Have you already tested that? > > > > Lastly, when reporting problems like this, the swift standard > > output/err is also very helpful to get a higher-level view of what > > went wrong. > > > > Swift needs to clearly return errors from the local resource > > provider, > > which it doesnt seem to be doing here. Ive filed this as bug 593 and > > assigned to David. > > > > Please let us know if changing the queue and/or slots resolves the > > problem. As mentioned in the bug report I think you can set > > debug=true > > (or yes?) in the provider-sge.properties file and get swift to > > preserve the output from SGE in ~/.globus/scripts. (In fact that may > > already be preserved, I am not sure). Please check there to see if > > the > > SGE error is there. > > > > Thanks, > > > > - Mike > > > > > > > > ----- Original Message ----- > > > From: "Sarah Kenny" < skenny at uchicago.edu > > > > > > To: "Mihael Hategan" < hategan at mcs.anl.gov > > > > Cc: "Anjali Raja" < anjraja at gmail.com >, "Swift Devel" < > > > swift-devel at ci.uchicago.edu >, "Swift User" > > > < swift-user at ci.uchicago.edu > > > > Sent: Thursday, October 20, 2011 6:07:09 AM > > > Subject: Re: [Swift-devel] [Swift-user] gram on ranger > > > > > hi all, one of our users, anjali (cc'd here) is trying to submit > > > this > > > ~400k job workflow to ranger...thought i'd see if you felt like > > > having > > > a look :) > > > > > > log is here: > > > /home/skenny/swift_logs/corr_multisubj-20111018-1321-ihf8hz5g.log > > > > > > sites file: > > > > > > > > > > > > > > > > > > > > > > > > 7200 > > > 00:20:00 > > > 1 > > > 64 > > > 256 > > > development > > > 1.28 > > > TG-DBS080004N > > > 16way > > > 10000 > > > /work/00926/tg459516/swiftwork > > > > > > > > > > > > > > > On Wed, Oct 12, 2011 at 12:13 PM, Mihael Hategan < > > > hategan at mcs.anl.gov > > > > wrote: > > > > > > > > > > > > On Tue, 2011-10-11 at 17:13 -0700, Sarah Kenny wrote: > > > > > > > > > > > > On Tue, Oct 11, 2011 at 4:23 PM, Mihael Hategan < > > > > hategan at mcs.anl.gov > > > > > wrote: > > > > Is this with a persistent coaster service? > > > > > > > > admittedly i have not used persistent coaster service...should > > > > i? > > > > > > No. I was just trying to figure out whether it might be something > > > related to the persistent version. > > > > > > > > > > > > > > > > i feel like it's documented *somewhere* (?) > > > > > > > > for now i've tried setting 'sitedir.keep=true' in the config so > > > > maybe > > > > it won't try to run the cleanup job...we'll see (waiting in q) > > > > > > > > > > > > > > > > On Tue, 2011-10-11 at 12:05 -0700, Sarah Kenny wrote: > > > > > > > > > > > > > > > On Tue, Oct 11, 2011 at 11:49 AM, David Kelly > > > > < davidk at ci.uchicago.edu > > > > > > wrote: > > > > > > > > > > That could be it.. maybe a cleanup script is not > > > > getting the > > > > > right parameters and failing. Do you happen to have > > > > a copy of > > > > > the coaster log? > > > > > > > > > > just put it in /home/skenny/swift_logs > > > > > > > > > > > > > > > Maybe there will be some clues in there. > > > > > > > > > > ----- Original Message ----- > > > > > > From: "Sarah Kenny" < skenny at uchicago.edu > > > > > > > > > > > > To: "David Kelly" < davidk at ci.uchicago.edu > > > > > > > Cc: "Swift Devel" < swift-devel at ci.uchicago.edu >, > > > > "Swift > > > > > User" < swift-user at ci.uchicago.edu >, "Justin M > > > > Wozniak" > > > > > > < wozniak at mcs.anl.gov > > > > > > > > > > > > Sent: Tuesday, October 11, 2011 1:32:37 PM > > > > > > Subject: Re: [Swift-user] gram on ranger > > > > > > > > > > > so, this workflow completes all the jobs but then > > > > just hangs > > > > > > indefinitely at the end...maybe a stray cleanup > > > > job? > > > > > > > > > > > > log is here: > > > > > > > > > > > > > > > > /home/skenny/swift_logs/corr-20111010-2104-fl5yngd9.log > > > > > > > > > > > > just tweaked the sites file a bit from what david > > > > sent me: > > > > > > > > > > > > > > > > > > > > > > > > > > > url=" > > > > > > gatekeeper.ranger.tacc.teragrid.org "/> > > > > > > > > > > > > > > > > > > > > key="maxtime">28800 > > > > > > > > > > key="maxWallTime">00:15:00 > > > > > > > > > key="jobsPerNode">1 > > > > > > > > > > key="nodeGranularity">64 > > > > > > > > > key="maxNodes">256 > > > > > > > > > key="queue">normal > > > > > > > > > key="jobThrottle">1 > > > > > > > > > > key="project">TG-DBS080004N > > > > > > > > > key="pe">16way > > > > > > > > > > key="initialScore">10000 > > > > > > > > > > > > > > > /work/00043/tg457040/sidgrid_out/skenny > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Mon, Oct 10, 2011 at 3:43 PM, Sarah Kenny < > > > > > skenny at uchicago.edu > > > > > > > wrote: > > > > > > > > > > > > > > > > > > ok, thanks, got in the queue now...also, realized > > > > my last > > > > > run may have > > > > > > been using the old swift. apparently i had > > > > SWIFT_HOME set in > > > > > my env > > > > > > and that overrides the newer swift i had set in my > > > > PATH. > > > > > > > > > > > > ~sk > > > > > > > > > > > > > > > > > > > > > > > > On Mon, Oct 10, 2011 at 12:28 PM, David Kelly < > > > > > davidk at ci.uchicago.edu > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Sarah, > > > > > > > > > > > > Can you give this another try with the latest > > > > 0.93? I made > > > > > some > > > > > > changes to the coaster and sge providers and was > > > > able to get > > > > > it > > > > > > working with a simple catns script. Here is the > > > > > configuration file I > > > > > > was using: > > > > > > > > > > > > > > > > > > > > > > > > > > > url=" > > > > > > gatekeeper.ranger.tacc.teragrid.org "/> > > > > > > > > > > > > > > > > > > > > > > > > > > key="maxtime">3600 > > > > > > > > > > key="maxWallTime">00:00:03 > > > > > > > > > key="jobsPerNode">1 > > > > > > > > > > key="nodeGranularity">16 > > > > > > > > > key="maxNodes">16 > > > > > > > > > > key="queue">development > > > > > > > > > key="jobThrottle">0.9 > > > > > > > > > > > > > > > > key="project">TG-DBS080004N > > > > > > > > > > > > > > > key="pe">16way > > > > > > > > > > > > > > > /share/home/01503/davidkel/swiftwork > > > > > > > > > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > > > > David > > > > > > > > > > > > ----- Original Message ----- > > > > > > > > > > > > > From: "Sarah Kenny" < skenny at uchicago.edu > > > > > > > > To: "Justin M Wozniak" < wozniak at mcs.anl.gov > > > > > > > > Cc: "Swift Devel" < swift-devel at ci.uchicago.edu > > > > >, "Swift > > > > > User" < > > > > > > > swift-user at ci.uchicago.edu > > > > > > > > > > > > > > > > > > > > > > > > > > Sent: Friday, October 7, 2011 3:13:57 PM > > > > > > > Subject: Re: [Swift-user] gram on ranger > > > > > > > > > > > /home/skenny/swift_logs/dummy-20111005-0126-6575n7x5.log > > > > > > > > > > > > > > on ci > > > > > > > > > > > > > > > > > > > > > On Fri, Oct 7, 2011 at 8:16 AM, Justin M Wozniak > > > > < > > > > > > > wozniak at mcs.anl.gov > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > Can I take a look at the log? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Thu, 6 Oct 2011, Sarah Kenny wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > hey all, i'm trying to submit to gram on ranger > > > > using the > > > > > latest > > > > > > > swift > > > > > > > (built from trunk). it failes like so: > > > > > > > > > > > > > > Cannot submit job > > > > > > > Caused by: > > > > > > > org.globus.cog.abstraction. impl.common.task. > > > > > > > TaskSubmissionException: > > > > > > > Cannot > > > > > > > submit job > > > > > > > Caused by: org.globus.gram.GramException: > > > > Parameter not > > > > > supported > > > > > > > Cannot submit job > > > > > > > > > > > > > > the gram log was saying first that 'jobsPerNode' > > > > is not > > > > > supported so > > > > > > > i > > > > > > > changed it to workersPerNode and then it was > > > > saying > > > > > 'maxnodes' is > > > > > > > not > > > > > > > supported. here's my sites file: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > key="initialScore">10000 > > > > profile> > > > > > > > > > > key="jobThrottle">1 > > > > > > > > > > key="maxWallTime">00:15:00 > > > > profile> > > > > > > > > > > key="maxTime">86400 > > > > > > > > > > key="slots">1 > > > > > > > > > > key="maxNodes">256 > > > > > > > > > > key="pe">16way > > > > > > > > > > key="workersPerNode">1 > > > > profile> > > > > > > > > > > key="nodeGranularity">64 > > > > profile> > > > > > > > > > > key="queue">normal > > > > > > > > > > key="project">TG-DBS080004N > > > > profile> > > > > > > > > > > > > > > > > > > > > > > > jobManager="gt2:gt2:SGE" > > > > > url=" > > > > > > > gatekeeper.ranger.tacc. teragrid.org "/> > > > > > > > > > > > > > > > > > > > > /work/00043/ > > > > tg457040 > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > thoughts? ideas? > > > > > > > > > > > > > > -- > > > > > > > Justin M Wozniak > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > Sarah Kenny > > > > > > > Programmer ~ Brain Circuits Laboratory ~ Rm 2224 > > > > Bio Sci > > > > > III > > > > > > > University of California Irvine, Dept. of > > > > Neurology ~ > > > > > 773-818-8300 > > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > > > > > Swift-user mailing list > > > > > > > Swift-user at ci.uchicago.edu > > > > > > > > > > > > > > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > Sarah Kenny > > > > > > Programmer ~ Brain Circuits Laboratory ~ Rm 2224 > > > > Bio Sci III > > > > > > University of California Irvine, Dept. of > > > > Neurology ~ > > > > > 773-818-8300 > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > Sarah Kenny > > > > > > Programmer ~ Brain Circuits Laboratory ~ Rm 2224 > > > > Bio Sci III > > > > > > University of California Irvine, Dept. of > > > > Neurology ~ > > > > > 773-818-8300 > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > Sarah Kenny > > > > > Programmer ~ Brain Circuits Laboratory ~ Rm 2224 Bio Sci III > > > > > University of California Irvine, Dept. of Neurology ~ > > > > 773-818-8300 > > > > > > > > > > _______________________________________________ > > > > > Swift-user mailing list > > > > > Swift-user at ci.uchicago.edu > > > > > > > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > Sarah Kenny > > > > Programmer ~ Brain Circuits Laboratory ~ Rm 2224 Bio Sci III > > > > University of California Irvine, Dept. of Neurology ~ > > > > 773-818-8300 > > > > > > > > > > > > > > > > > > > > > > -- > > > Sarah Kenny > > > Programmer ~ Brain Circuits Laboratory ~ Rm 2224 Bio Sci III > > > University of California Irvine, Dept. of Neurology ~ 773-818-8300 > > > > > > > > > > > _______________________________________________ > > > Swift-devel mailing list > > > Swift-devel at ci.uchicago.edu > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > > -- > > Michael Wilde > > Computation Institute, University of Chicago > > Mathematics and Computer Science Division > > Argonne National Laboratory > > > > > > > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > > > > > > > > -- > > Ketan > > > > > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > > > -- > Ketan -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From hategan at mcs.anl.gov Thu Oct 20 12:35:34 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 20 Oct 2011 10:35:34 -0700 Subject: [Swift-devel] [Swift-user] gram on ranger In-Reply-To: <525723894.113958.1319126152521.JavaMail.root@zimbra.anl.gov> References: <525723894.113958.1319126152521.JavaMail.root@zimbra.anl.gov> Message-ID: <1319132134.16281.0.camel@blabla> On Thu, 2011-10-20 at 10:55 -0500, Michael Wilde wrote: > That second number on the #$ -pe directive should be the total # of > cores that the provider wants in the job, and on Ranger with the 16way > pe must be a multiple of 16. I think it always has to be a multiple of 16 on Ranger, regardless of pe. From hategan at mcs.anl.gov Thu Oct 20 12:42:07 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 20 Oct 2011 10:42:07 -0700 Subject: [Swift-devel] [Swift-user] gram on ranger In-Reply-To: References: <459404377.154944.1319125079761.JavaMail.root@zimbra-mb2.anl.gov> Message-ID: <1319132527.16281.4.camel@blabla> On Thu, 2011-10-20 at 10:44 -0500, Ketan Maheshwari wrote: > David, > > > On Thu, Oct 20, 2011 at 10:37 AM, David Kelly > wrote: > Ketan, > > I think you're right - I believe that line should be: > > #$ -pe > > > I am not sure, it should be maxnodes*nodegranularity, since > nodegranularity means the nodes-to-be-packed per job. I think, this > should be, > maxnodes*corespernode, which could be a static constant (value 16) for > ranger. I could be wrong here, not sure. I agree that maxnodes * nodegranularity does not make much sense since that could cause a job to request more than maxnodes nodes. I think this problem goes back to the idea that providers should interpret "count" as "the number of instances of the executable that I want started" and other parameters should dictate how exactly that count is spread over nodes and cores. From jonmon at mcs.anl.gov Thu Oct 20 15:04:38 2011 From: jonmon at mcs.anl.gov (Jonathan Monette) Date: Thu, 20 Oct 2011 15:04:38 -0500 Subject: [Swift-devel] coasters won't start Message-ID: Here is a log saying that the coaster service isn't starting, at least that is what the log is saying. This is with on PADS with automatic coasters using 0.93RC3. http://www.ci.uchicago.edu/~jonmon/logs/coasters_wont_start.log And here is the coaster log in zipped form http://www.ci.uchicago.edu/~jonmon/logs/coasters.tar.gz All the files used for this run are located in ~jonmon/PADS/Swift/SwiftMontage/m101_tutorial/run.0039 on the ci network. -------------- next part -------------- An HTML attachment was scrubbed... URL: From davidk at ci.uchicago.edu Thu Oct 20 16:06:15 2011 From: davidk at ci.uchicago.edu (David Kelly) Date: Thu, 20 Oct 2011 16:06:15 -0500 (CDT) Subject: [Swift-devel] [Swift-user] gram on ranger In-Reply-To: <1319132527.16281.4.camel@blabla> Message-ID: <188408570.155505.1319144775904.JavaMail.root@zimbra-mb2.anl.gov> > I think this problem goes back to the idea that providers should > interpret "count" as "the number of instances of the executable that I > want started" and other parameters should dictate how exactly that > count is spread over nodes and cores. Mihael, In this setup I am using nodeGranularity=16, jobsPerNode=16, and maxNodes=16. The SGE submit file could request anywhere between 16 and 256 cores in multiples of 16. When I run catsn with -n=2, count is 16. When I run catsn with -n=20, two SGE submit scripts get created, each with count=16. Should count=32 in the second case? Am I misunderstanding what 'count' is? Is there any way to get the exact number of applications? Thanks, David From tim.g.armstrong at gmail.com Thu Oct 20 20:33:55 2011 From: tim.g.armstrong at gmail.com (Tim Armstrong) Date: Thu, 20 Oct 2011 20:33:55 -0500 Subject: [Swift-devel] Strange behaviour of fprintf in Swift 0.93 release candidate Message-ID: I've been seeing various strange behaviour from fprintf in Swift 0.93 which didn't occur in Swift 0.92 In one case, the wrong filename is used. With the following code, instead of printing to a file testfprintf1.out, fprintf instead prints to the file "filename:string = testfprintf1.out - Closed". I've added a bug report for this case: Bug report: https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=595 string filename; filename = "testfprintf1.out"; fprintf(filename, "done\n"); I've also been seeing Swift hanging when it is trying to write to a fifo. I.e.with something like the below code, fprintf never fires. I've attached a log if anyone has any ideas. I've been attempting to replicate the problem with a simpler script before posting a bug report. external e; e = somefunction(); fprintf(fifoname, "%kdone\n", e); - Tim -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: rserver-20111020-2017-n8z7bvuc.log Type: text/x-log Size: 34861 bytes Desc: not available URL: From hategan at mcs.anl.gov Thu Oct 20 20:49:46 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 20 Oct 2011 18:49:46 -0700 Subject: [Swift-devel] [Swift-user] gram on ranger In-Reply-To: <188408570.155505.1319144775904.JavaMail.root@zimbra-mb2.anl.gov> References: <188408570.155505.1319144775904.JavaMail.root@zimbra-mb2.anl.gov> Message-ID: <1319161786.17554.0.camel@blabla> On Thu, 2011-10-20 at 16:06 -0500, David Kelly wrote: > > > I think this problem goes back to the idea that providers should > > interpret "count" as "the number of instances of the executable that I > > want started" and other parameters should dictate how exactly that > > count is spread over nodes and cores. > > Mihael, > > In this setup I am using nodeGranularity=16, jobsPerNode=16, and > maxNodes=16. The SGE submit file could request anywhere between 16 and > 256 cores in multiples of 16. > > When I run catsn with -n=2, count is 16. > When I run catsn with -n=20, two SGE submit scripts get created, each with count=16. > > Should count=32 in the second case? Am I misunderstanding what 'count' is? Is there any way to get the exact number of applications? Coasters? From davidk at ci.uchicago.edu Thu Oct 20 21:03:46 2011 From: davidk at ci.uchicago.edu (David Kelly) Date: Thu, 20 Oct 2011 21:03:46 -0500 (CDT) Subject: [Swift-devel] [Swift-user] gram on ranger In-Reply-To: <1319161786.17554.0.camel@blabla> Message-ID: <1584784930.155805.1319162626982.JavaMail.root@zimbra-mb2.anl.gov> Yep, this is using coasters ----- Original Message ----- > From: "Mihael Hategan" > To: "David Kelly" > Cc: "Anjali Raja" , "Swift Devel" , "Swift User" > , "Ketan Maheshwari" > Sent: Thursday, October 20, 2011 8:49:46 PM > Subject: Re: [Swift-devel] [Swift-user] gram on ranger > On Thu, 2011-10-20 at 16:06 -0500, David Kelly wrote: > > > > > I think this problem goes back to the idea that providers should > > > interpret "count" as "the number of instances of the executable > > > that I > > > want started" and other parameters should dictate how exactly that > > > count is spread over nodes and cores. > > > > Mihael, > > > > In this setup I am using nodeGranularity=16, jobsPerNode=16, and > > maxNodes=16. The SGE submit file could request anywhere between 16 > > and > > 256 cores in multiples of 16. > > > > When I run catsn with -n=2, count is 16. > > When I run catsn with -n=20, two SGE submit scripts get created, > > each with count=16. > > > > Should count=32 in the second case? Am I misunderstanding what > > 'count' is? Is there any way to get the exact number of > > applications? > > Coasters? From hategan at mcs.anl.gov Thu Oct 20 21:08:46 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 20 Oct 2011 19:08:46 -0700 Subject: [Swift-devel] [Swift-user] gram on ranger In-Reply-To: <1584784930.155805.1319162626982.JavaMail.root@zimbra-mb2.anl.gov> References: <1584784930.155805.1319162626982.JavaMail.root@zimbra-mb2.anl.gov> Message-ID: <1319162926.21652.2.camel@blabla> On Thu, 2011-10-20 at 21:03 -0500, David Kelly wrote: > Yep, this is using coasters > Then no. Count is whatever the block allocation algorithm decides it should be. > > > > > > Should count=32 in the second case? Am I misunderstanding what > > > 'count' is? Is there any way to get the exact number of > > > applications? > > > > Coasters? From wilde at mcs.anl.gov Fri Oct 21 08:00:44 2011 From: wilde at mcs.anl.gov (Michael Wilde) Date: Fri, 21 Oct 2011 08:00:44 -0500 (CDT) Subject: [Swift-devel] [Swift-user] gram on ranger In-Reply-To: <1319162926.21652.2.camel@blabla> Message-ID: <1676227149.118514.1319202044330.JavaMail.root@zimbra.anl.gov> David, "whatever the block allocation algorithm decides it should be" is the box-packing algorithm thats mentioned in the UCC 2012 paper and which we discussed late yesterday afternoon. We should document it in the User Guide in an elaborated Coaster section. I suggested to David that he add (at least temporarily pending discussion) the coresPerNode site attribute so that - at least for SGE - the provider can set the correct value in the core-count field of the SGE submit file "pe" attribute. This seems to me to be necessary: we dont want to set the pe core count using jobsPerNode, as users on occasion need to set jobsPerNode to be higher or lower than the number of cores on the node. - Mike ----- Original Message ----- > From: "Mihael Hategan" > To: "David Kelly" > Cc: "Anjali Raja" , "Swift Devel" , "Swift User" > > Sent: Thursday, October 20, 2011 9:08:46 PM > Subject: Re: [Swift-devel] [Swift-user] gram on ranger > On Thu, 2011-10-20 at 21:03 -0500, David Kelly wrote: > > Yep, this is using coasters > > > > Then no. Count is whatever the block allocation algorithm decides it > should be. > > > > > > > > > Should count=32 in the second case? Am I misunderstanding what > > > > 'count' is? Is there any way to get the exact number of > > > > applications? > > > > > > Coasters? > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From jonmon at mcs.anl.gov Fri Oct 21 13:50:37 2011 From: jonmon at mcs.anl.gov (Jonathan Monette) Date: Fri, 21 Oct 2011 13:50:37 -0500 Subject: [Swift-devel] coasters won't start In-Reply-To: References: Message-ID: Anyone have a thought on this? Not sure what is wrong. I can't seem to get coasters registered from PADS or Beagle. The log also specifies a FileNotFoundException when trying to transfer back the wrapper log. Does this have something to do with the problem? I have been assuming that this error was being thrown due to the coaster service not connecting. On Oct 20, 2011, at 3:04 PM, Jonathan Monette wrote: > Here is a log saying that the coaster service isn't starting, at least that is what the log is saying. This is with on PADS with automatic coasters using 0.93RC3. > http://www.ci.uchicago.edu/~jonmon/logs/coasters_wont_start.log > > And here is the coaster log in zipped form > http://www.ci.uchicago.edu/~jonmon/logs/coasters.tar.gz > > All the files used for this run are located in ~jonmon/PADS/Swift/SwiftMontage/m101_tutorial/run.0039 on the ci network. > > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel -------------- next part -------------- An HTML attachment was scrubbed... URL: From ketancmaheshwari at gmail.com Fri Oct 21 13:58:57 2011 From: ketancmaheshwari at gmail.com (Ketan Maheshwari) Date: Fri, 21 Oct 2011 13:58:57 -0500 Subject: [Swift-devel] coasters won't start In-Reply-To: References: Message-ID: Jon, There were some changes in the firewalls rules in terms of allowed open ports on various ci machines. A long shot, but may be you want to check on that. Can you paste your sites.xml and I can take a look if I find something. Ketan On Fri, Oct 21, 2011 at 1:50 PM, Jonathan Monette wrote: > Anyone have a thought on this? Not sure what is wrong. I can't seem to > get coasters registered from PADS or Beagle. The log also specifies a > FileNotFoundException when trying to transfer back the wrapper log. Does > this have something to do with the problem? I have been assuming that this > error was being thrown due to the coaster service not connecting. > > On Oct 20, 2011, at 3:04 PM, Jonathan Monette wrote: > > Here is a log saying that the coaster service isn't starting, at least that > is what the log is saying. This is with on PADS with automatic coasters > using 0.93RC3. > http://www.ci.uchicago.edu/~jonmon/logs/coasters_wont_start.log > > And here is the coaster log in zipped form > http://www.ci.uchicago.edu/~jonmon/logs/coasters.tar.gz > > All the files used for this run are located in > ~jonmon/PADS/Swift/SwiftMontage/m101_tutorial/run.0039 on the ci network. > > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > -- Ketan -------------- next part -------------- An HTML attachment was scrubbed... URL: From jonmon at mcs.anl.gov Fri Oct 21 14:02:31 2011 From: jonmon at mcs.anl.gov (Jonathan Monette) Date: Fri, 21 Oct 2011 14:02:31 -0500 Subject: [Swift-devel] coasters won't start In-Reply-To: References: Message-ID: <46B70DCF-CA6E-4939-ACA1-4B3F3B43DB80@mcs.anl.gov> Thanks. That was the next thing on my check list to check but wasn't sure how to check this. I wasn't sure how to specify a port range for coasters to use. Does coasters use the GLOBUS_TCP_PORT_RANGE and the GLOBUS_SOURCE_PORT_RANGE environment variables for this? /gpfs/pads/swift/jonmon/Swift/work/localhost .05 KEEP /gpfs/pads/swift/jonmon/Swift/work/pads CI-CCR000013 3600 1 192 1 1 fast 5 10000 KEEP CI-CCR000013 /gpfs/pads/swift/jonmon/Swift/work/beagle 24 pbs.aprun;pbs.mpp;depth=24 24 1000 1 1 1 .63 10000 KEEP On Oct 21, 2011, at 1:58 PM, Ketan Maheshwari wrote: > Jon, > > There were some changes in the firewalls rules in terms of allowed open ports on various ci machines. A long shot, but may be you want to check on that. > > Can you paste your sites.xml and I can take a look if I find something. > > Ketan > > > On Fri, Oct 21, 2011 at 1:50 PM, Jonathan Monette wrote: > Anyone have a thought on this? Not sure what is wrong. I can't seem to get coasters registered from PADS or Beagle. The log also specifies a FileNotFoundException when trying to transfer back the wrapper log. Does this have something to do with the problem? I have been assuming that this error was being thrown due to the coaster service not connecting. > > On Oct 20, 2011, at 3:04 PM, Jonathan Monette wrote: > >> Here is a log saying that the coaster service isn't starting, at least that is what the log is saying. This is with on PADS with automatic coasters using 0.93RC3. >> http://www.ci.uchicago.edu/~jonmon/logs/coasters_wont_start.log >> >> And here is the coaster log in zipped form >> http://www.ci.uchicago.edu/~jonmon/logs/coasters.tar.gz >> >> All the files used for this run are located in ~jonmon/PADS/Swift/SwiftMontage/m101_tutorial/run.0039 on the ci network. >> >> >> >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > > > -- > Ketan > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From ketancmaheshwari at gmail.com Fri Oct 21 14:07:13 2011 From: ketancmaheshwari at gmail.com (Ketan Maheshwari) Date: Fri, 21 Oct 2011 14:07:13 -0500 Subject: [Swift-devel] coasters won't start In-Reply-To: <46B70DCF-CA6E-4939-ACA1-4B3F3B43DB80@mcs.anl.gov> References: <46B70DCF-CA6E-4939-ACA1-4B3F3B43DB80@mcs.anl.gov> Message-ID: Jon, If you are going from a remote host to pads via ssh, I am more suspicious, it is a firewall issue. I am no expert, but check the port status of the ports coaster service is using running on the remote host and the workers are using to connect back to the service. Since when are you seeing this issue? Ketan On Fri, Oct 21, 2011 at 2:02 PM, Jonathan Monette wrote: > Thanks. That was the next thing on my check list to check but wasn't sure > how to check this. I wasn't sure how to specify a port range for coasters > to use. Does coasters use the GLOBUS_TCP_PORT_RANGE and the > GLOBUS_SOURCE_PORT_RANGE environment variables for this? > > > > > > > /gpfs/pads/swift/jonmon/Swift/work/localhost > > .05 > > KEEP > > > > > /gpfs/pads/swift/jonmon/Swift/work/pads > > CI-CCR000013 > 3600 > 1 > 192 > 1 > 1 > fast > > 5 > 10000 > > KEEP > > > > CI-CCR000013 > > > /gpfs/pads/swift/jonmon/Swift/work/beagle > > > 24 > key="providerAttributes">pbs.aprun;pbs.mpp;depth=24 > 24 > 1000 > 1 > 1 > 1 > > .63 > 10000 > > KEEP > > > > On Oct 21, 2011, at 1:58 PM, Ketan Maheshwari wrote: > > Jon, > > There were some changes in the firewalls rules in terms of allowed open > ports on various ci machines. A long shot, but may be you want to check on > that. > > Can you paste your sites.xml and I can take a look if I find something. > > Ketan > > > On Fri, Oct 21, 2011 at 1:50 PM, Jonathan Monette wrote: > >> Anyone have a thought on this? Not sure what is wrong. I can't seem to >> get coasters registered from PADS or Beagle. The log also specifies a >> FileNotFoundException when trying to transfer back the wrapper log. Does >> this have something to do with the problem? I have been assuming that this >> error was being thrown due to the coaster service not connecting. >> >> On Oct 20, 2011, at 3:04 PM, Jonathan Monette wrote: >> >> Here is a log saying that the coaster service isn't starting, at least >> that is what the log is saying. This is with on PADS with automatic >> coasters using 0.93RC3. >> http://www.ci.uchicago.edu/~jonmon/logs/coasters_wont_start.log >> >> And here is the coaster log in zipped form >> http://www.ci.uchicago.edu/~jonmon/logs/coasters.tar.gz >> >> All the files used for this run are located in >> ~jonmon/PADS/Swift/SwiftMontage/m101_tutorial/run.0039 on the ci network. >> >> >> >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel >> >> >> >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel >> >> > > > -- > Ketan > > > > -- Ketan -------------- next part -------------- An HTML attachment was scrubbed... URL: From jonmon at mcs.anl.gov Fri Oct 21 14:10:17 2011 From: jonmon at mcs.anl.gov (Jonathan Monette) Date: Fri, 21 Oct 2011 14:10:17 -0500 Subject: [Swift-devel] coasters won't start In-Reply-To: References: Message-ID: <335DA57D-7BF5-480D-B5B6-1CFFE17DBF43@mcs.anl.gov> Recently. This is with the compiled binaries of the 0.93RC3 release. I jut downloaded the tar ball from the wwwdev site. And I am not sure how to even probe the ports coasters is trying to use. When running my scripts the service tries to start and fails, at least that is what the log looks like. On Oct 21, 2011, at 1:58 PM, Ketan Maheshwari wrote: > Jon, > > There were some changes in the firewalls rules in terms of allowed open ports on various ci machines. A long shot, but may be you want to check on that. > > Can you paste your sites.xml and I can take a look if I find something. > > Ketan > > > On Fri, Oct 21, 2011 at 1:50 PM, Jonathan Monette wrote: > Anyone have a thought on this? Not sure what is wrong. I can't seem to get coasters registered from PADS or Beagle. The log also specifies a FileNotFoundException when trying to transfer back the wrapper log. Does this have something to do with the problem? I have been assuming that this error was being thrown due to the coaster service not connecting. > > On Oct 20, 2011, at 3:04 PM, Jonathan Monette wrote: > >> Here is a log saying that the coaster service isn't starting, at least that is what the log is saying. This is with on PADS with automatic coasters using 0.93RC3. >> http://www.ci.uchicago.edu/~jonmon/logs/coasters_wont_start.log >> >> And here is the coaster log in zipped form >> http://www.ci.uchicago.edu/~jonmon/logs/coasters.tar.gz >> >> All the files used for this run are located in ~jonmon/PADS/Swift/SwiftMontage/m101_tutorial/run.0039 on the ci network. >> >> >> >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > > > -- > Ketan > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From davidk at ci.uchicago.edu Fri Oct 21 14:29:35 2011 From: davidk at ci.uchicago.edu (David Kelly) Date: Fri, 21 Oct 2011 14:29:35 -0500 (CDT) Subject: [Swift-devel] coasters won't start In-Reply-To: <335DA57D-7BF5-480D-B5B6-1CFFE17DBF43@mcs.anl.gov> Message-ID: <959494739.156835.1319225375586.JavaMail.root@zimbra-mb2.anl.gov> You could also try with a known good version (0.92, or a previous version of 0.93 you have had success with). That should narrow down the problem a bit. David ----- Original Message ----- > From: "Jonathan Monette" > To: "Ketan Maheshwari" > Cc: "swift-devel Devel" > Sent: Friday, October 21, 2011 2:10:17 PM > Subject: Re: [Swift-devel] coasters won't start > Recently. This is with the compiled binaries of the 0.93RC3 release. I > jut downloaded the tar ball from the wwwdev site. And I am not sure > how to even probe the ports coasters is trying to use. When running my > scripts the service tries to start and fails, at least that is what > the log looks like. > > > > On Oct 21, 2011, at 1:58 PM, Ketan Maheshwari wrote: > > > Jon, > > > There were some changes in the firewalls rules in terms of allowed > open ports on various ci machines. A long shot, but may be you want to > check on that. > > Can you paste your sites.xml and I can take a look if I find > something. > > > Ketan > > > > On Fri, Oct 21, 2011 at 1:50 PM, Jonathan Monette < jonmon at mcs.anl.gov > > wrote: > > > > Anyone have a thought on this? Not sure what is wrong. I can't seem to > get coasters registered from PADS or Beagle. The log also specifies a > FileNotFoundException when trying to transfer back the wrapper log. > Does this have something to do with the problem? I have been assuming > that this error was being thrown due to the coaster service not > connecting. > > > > > > > On Oct 20, 2011, at 3:04 PM, Jonathan Monette wrote: > > > > > > > Here is a log saying that the coaster service isn't starting, at least > that is what the log is saying. This is with on PADS with automatic > coasters using 0.93RC3. > http://www.ci.uchicago.edu/~jonmon/logs/coasters_wont_start.log > > > And here is the coaster log in zipped form > http://www.ci.uchicago.edu/~jonmon/logs/coasters.tar.gz > > > All the files used for this run are located in > ~jonmon/PADS/Swift/SwiftMontage/m101_tutorial/run.0039 on the ci > network. > > > > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > > > > -- > Ketan > > > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel From jonmon at mcs.anl.gov Fri Oct 21 14:36:20 2011 From: jonmon at mcs.anl.gov (Jonathan Monette) Date: Fri, 21 Oct 2011 14:36:20 -0500 Subject: [Swift-devel] coasters won't start In-Reply-To: <959494739.156835.1319225375586.JavaMail.root@zimbra-mb2.anl.gov> References: <959494739.156835.1319225375586.JavaMail.root@zimbra-mb2.anl.gov> Message-ID: The same version is appearing in 0.92. I know my scripts were working with 0.92 so maybe this is a firewall issue. Does any one know if coasters uses the GLOBUS_*_PORT_RANGE for the open ports to try? If they do I can try setting those appropriate values and try. On Oct 21, 2011, at 2:29 PM, David Kelly wrote: > You could also try with a known good version (0.92, or a previous version of 0.93 you have had success with). That should narrow down the problem a bit. > > David > > ----- Original Message ----- >> From: "Jonathan Monette" >> To: "Ketan Maheshwari" >> Cc: "swift-devel Devel" >> Sent: Friday, October 21, 2011 2:10:17 PM >> Subject: Re: [Swift-devel] coasters won't start >> Recently. This is with the compiled binaries of the 0.93RC3 release. I >> jut downloaded the tar ball from the wwwdev site. And I am not sure >> how to even probe the ports coasters is trying to use. When running my >> scripts the service tries to start and fails, at least that is what >> the log looks like. >> >> >> >> On Oct 21, 2011, at 1:58 PM, Ketan Maheshwari wrote: >> >> >> Jon, >> >> >> There were some changes in the firewalls rules in terms of allowed >> open ports on various ci machines. A long shot, but may be you want to >> check on that. >> >> Can you paste your sites.xml and I can take a look if I find >> something. >> >> >> Ketan >> >> >> >> On Fri, Oct 21, 2011 at 1:50 PM, Jonathan Monette < jonmon at mcs.anl.gov >>> wrote: >> >> >> >> Anyone have a thought on this? Not sure what is wrong. I can't seem to >> get coasters registered from PADS or Beagle. The log also specifies a >> FileNotFoundException when trying to transfer back the wrapper log. >> Does this have something to do with the problem? I have been assuming >> that this error was being thrown due to the coaster service not >> connecting. >> >> >> >> >> >> >> On Oct 20, 2011, at 3:04 PM, Jonathan Monette wrote: >> >> >> >> >> >> >> Here is a log saying that the coaster service isn't starting, at least >> that is what the log is saying. This is with on PADS with automatic >> coasters using 0.93RC3. >> http://www.ci.uchicago.edu/~jonmon/logs/coasters_wont_start.log >> >> >> And here is the coaster log in zipped form >> http://www.ci.uchicago.edu/~jonmon/logs/coasters.tar.gz >> >> >> All the files used for this run are located in >> ~jonmon/PADS/Swift/SwiftMontage/m101_tutorial/run.0039 on the ci >> network. >> >> >> >> >> >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel >> >> >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel >> >> >> >> >> >> -- >> Ketan >> >> >> >> >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel From ketancmaheshwari at gmail.com Fri Oct 21 15:18:33 2011 From: ketancmaheshwari at gmail.com (Ketan Maheshwari) Date: Fri, 21 Oct 2011 15:18:33 -0500 Subject: [Swift-devel] Swift screencasts Message-ID: Hi, This was brewing in my mind for a while but wasn't sure how well it would be received. The idea is to prepare 'screencasts' of using various features of swift which could be uploaded to youtube or hosted on swift site as a complementary thing to standard documentation. I tried to create one for running Swift on Beagle and uploaded it on youtube: http://www.youtube.com/watch?v=MwgZaJ5bqG4. A bit longish as Beagle queue did not make way for me today. Feedback? Regards, -- Ketan -------------- next part -------------- An HTML attachment was scrubbed... URL: From skenny at uchicago.edu Sat Oct 22 05:57:45 2011 From: skenny at uchicago.edu (Sarah Kenny) Date: Sat, 22 Oct 2011 03:57:45 -0700 Subject: [Swift-devel] [Swift-user] gram on ranger In-Reply-To: <988426934.113721.1319124103942.JavaMail.root@zimbra.anl.gov> References: <988426934.113721.1319124103942.JavaMail.root@zimbra.anl.gov> Message-ID: fyi, this works on a smaller workflow, we've run it several times on a 50k version. On Thu, Oct 20, 2011 at 8:21 AM, Michael Wilde wrote: > Thanks, Ketan. If I understand you correctly, then I would consider this a > Swift bug, in that maxnodes should always mean *nodes*, for every type of > resource provider including SGE. Based on what you say, the SGE provider is > in this case treating the requested maxnode count as cores (Assuming Anjali > was running the same Swift revision as you were testing on here). > > But then that might not explain the error in the log that Sarah posted. > > It seems the next step is to try the run on a smaller job (we can test this > ourselves), and see if we can replicate and diagnose the error, with SGE > subit files and output/error logs. > > David, can you do this, since you were working on SGE testing last week? > You and Ketan should share what you know about the situation, via > swift-devel, as Ketan is also running on Ranger with persistent coasters I > think. > > Thanks, > > Mike > > > ----- Original Message ----- > > From: "Ketan Maheshwari" > > To: "Michael Wilde" > > Cc: "Sarah Kenny" , "Anjali Raja" < > anjraja at gmail.com>, "Swift Devel" > > , "Swift User" > > Sent: Thursday, October 20, 2011 9:54:33 AM > > Subject: Re: [Swift-devel] [Swift-user] gram on ranger > > On Thu, Oct 20, 2011 at 7:50 AM, Michael Wilde < wilde at mcs.anl.gov > > > wrote: > > > > > > Hi Sarah, Anjali, > > > > My initial theory on whats failing in this job is that the Ranger > > development queue is limited to jobs of 16 nodes or less. (The Ranger > > User Guide says maxprocs 256 for that queue, and qconf -sq development > > says slots 16, which agrees). So you need to either change to one of > > the production queues (normal, long etc) or reduce the values of > > maxnode and nodegranularity. > > > > > > > > I have a little confusion here: the desired line in the final pbs > > script should be : #$ -pe way 256; in order to have 256 procs, > > however, putting maxnodes=16 on sites.xml results in the following > > line on pbs: > > #$ -pe way 16; > > I understand this number 16/256 is for procs since, when putting 256 > > with development queue, ranger indeed allows the job to run in > > development queue. > > > > > > > > I would also suggest (unless you have already done this) that you test > > first on a very small run (like a single RInvoke app call) and then > > scale up to just a few voxels per dataset before trying such a large > > run. Have you already tested that? > > > > Lastly, when reporting problems like this, the swift standard > > output/err is also very helpful to get a higher-level view of what > > went wrong. > > > > Swift needs to clearly return errors from the local resource provider, > > which it doesnt seem to be doing here. Ive filed this as bug 593 and > > assigned to David. > > > > Please let us know if changing the queue and/or slots resolves the > > problem. As mentioned in the bug report I think you can set debug=true > > (or yes?) in the provider-sge.properties file and get swift to > > preserve the output from SGE in ~/.globus/scripts. (In fact that may > > already be preserved, I am not sure). Please check there to see if the > > SGE error is there. > > > > Thanks, > > > > - Mike > > > > > > > > ----- Original Message ----- > > > From: "Sarah Kenny" < skenny at uchicago.edu > > > > > > To: "Mihael Hategan" < hategan at mcs.anl.gov > > > > Cc: "Anjali Raja" < anjraja at gmail.com >, "Swift Devel" < > > > swift-devel at ci.uchicago.edu >, "Swift User" > > > < swift-user at ci.uchicago.edu > > > > Sent: Thursday, October 20, 2011 6:07:09 AM > > > Subject: Re: [Swift-devel] [Swift-user] gram on ranger > > > > > hi all, one of our users, anjali (cc'd here) is trying to submit > > > this > > > ~400k job workflow to ranger...thought i'd see if you felt like > > > having > > > a look :) > > > > > > log is here: > > > /home/skenny/swift_logs/corr_multisubj-20111018-1321-ihf8hz5g.log > > > > > > sites file: > > > > > > > > > > > > > > > > > > > > > > > > 7200 > > > 00:20:00 > > > 1 > > > 64 > > > 256 > > > development > > > 1.28 > > > TG-DBS080004N > > > 16way > > > 10000 > > > /work/00926/tg459516/swiftwork > > > > > > > > > > > > > > > On Wed, Oct 12, 2011 at 12:13 PM, Mihael Hategan < > > > hategan at mcs.anl.gov > > > > wrote: > > > > > > > > > > > > On Tue, 2011-10-11 at 17:13 -0700, Sarah Kenny wrote: > > > > > > > > > > > > On Tue, Oct 11, 2011 at 4:23 PM, Mihael Hategan < > > > > hategan at mcs.anl.gov > > > > > wrote: > > > > Is this with a persistent coaster service? > > > > > > > > admittedly i have not used persistent coaster service...should i? > > > > > > No. I was just trying to figure out whether it might be something > > > related to the persistent version. > > > > > > > > > > > > > > > > i feel like it's documented *somewhere* (?) > > > > > > > > for now i've tried setting 'sitedir.keep=true' in the config so > > > > maybe > > > > it won't try to run the cleanup job...we'll see (waiting in q) > > > > > > > > > > > > > > > > On Tue, 2011-10-11 at 12:05 -0700, Sarah Kenny wrote: > > > > > > > > > > > > > > > On Tue, Oct 11, 2011 at 11:49 AM, David Kelly > > > > < davidk at ci.uchicago.edu > > > > > > wrote: > > > > > > > > > > That could be it.. maybe a cleanup script is not > > > > getting the > > > > > right parameters and failing. Do you happen to have > > > > a copy of > > > > > the coaster log? > > > > > > > > > > just put it in /home/skenny/swift_logs > > > > > > > > > > > > > > > Maybe there will be some clues in there. > > > > > > > > > > ----- Original Message ----- > > > > > > From: "Sarah Kenny" < skenny at uchicago.edu > > > > > > > > > > > > To: "David Kelly" < davidk at ci.uchicago.edu > > > > > > > Cc: "Swift Devel" < swift-devel at ci.uchicago.edu >, > > > > "Swift > > > > > User" < swift-user at ci.uchicago.edu >, "Justin M > > > > Wozniak" > > > > > > < wozniak at mcs.anl.gov > > > > > > > > > > > > Sent: Tuesday, October 11, 2011 1:32:37 PM > > > > > > Subject: Re: [Swift-user] gram on ranger > > > > > > > > > > > so, this workflow completes all the jobs but then > > > > just hangs > > > > > > indefinitely at the end...maybe a stray cleanup > > > > job? > > > > > > > > > > > > log is here: > > > > > > > > > > > > > > > > /home/skenny/swift_logs/corr-20111010-2104-fl5yngd9.log > > > > > > > > > > > > just tweaked the sites file a bit from what david > > > > sent me: > > > > > > > > > > > > > > > > > > > > > > > > > > > url=" > > > > > > gatekeeper.ranger.tacc.teragrid.org "/> > > > > > > > > > > > > > > > > > > > > key="maxtime">28800 > > > > > > > > > > key="maxWallTime">00:15:00 > > > > > > > > > key="jobsPerNode">1 > > > > > > > > > > key="nodeGranularity">64 > > > > > > > > > key="maxNodes">256 > > > > > > > > > key="queue">normal > > > > > > > > > key="jobThrottle">1 > > > > > > > > > > key="project">TG-DBS080004N > > > > > > > > > key="pe">16way > > > > > > > > > > key="initialScore">10000 > > > > > > > > > > > > > > > > /work/00043/tg457040/sidgrid_out/skenny > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Mon, Oct 10, 2011 at 3:43 PM, Sarah Kenny < > > > > > skenny at uchicago.edu > > > > > > > wrote: > > > > > > > > > > > > > > > > > > ok, thanks, got in the queue now...also, realized > > > > my last > > > > > run may have > > > > > > been using the old swift. apparently i had > > > > SWIFT_HOME set in > > > > > my env > > > > > > and that overrides the newer swift i had set in my > > > > PATH. > > > > > > > > > > > > ~sk > > > > > > > > > > > > > > > > > > > > > > > > On Mon, Oct 10, 2011 at 12:28 PM, David Kelly < > > > > > davidk at ci.uchicago.edu > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Sarah, > > > > > > > > > > > > Can you give this another try with the latest > > > > 0.93? I made > > > > > some > > > > > > changes to the coaster and sge providers and was > > > > able to get > > > > > it > > > > > > working with a simple catns script. Here is the > > > > > configuration file I > > > > > > was using: > > > > > > > > > > > > > > > > > > > > > > > > > > > url=" > > > > > > gatekeeper.ranger.tacc.teragrid.org "/> > > > > > > > > > > > > > > > > > > > > > > > > > > key="maxtime">3600 > > > > > > > > > > key="maxWallTime">00:00:03 > > > > > > > > > key="jobsPerNode">1 > > > > > > > > > > key="nodeGranularity">16 > > > > > > > > > key="maxNodes">16 > > > > > > > > > > key="queue">development > > > > > > > > > key="jobThrottle">0.9 > > > > > > > > > > > > > > > > key="project">TG-DBS080004N > > > > > > > > > > > > > > > key="pe">16way > > > > > > > > > > > > > > > /share/home/01503/davidkel/swiftwork > > > > > > > > > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > > > > David > > > > > > > > > > > > ----- Original Message ----- > > > > > > > > > > > > > From: "Sarah Kenny" < skenny at uchicago.edu > > > > > > > > To: "Justin M Wozniak" < wozniak at mcs.anl.gov > > > > > > > > Cc: "Swift Devel" < swift-devel at ci.uchicago.edu > > > > >, "Swift > > > > > User" < > > > > > > > swift-user at ci.uchicago.edu > > > > > > > > > > > > > > > > > > > > > > > > > > Sent: Friday, October 7, 2011 3:13:57 PM > > > > > > > Subject: Re: [Swift-user] gram on ranger > > > > > > > > > > > /home/skenny/swift_logs/dummy-20111005-0126-6575n7x5.log > > > > > > > > > > > > > > on ci > > > > > > > > > > > > > > > > > > > > > On Fri, Oct 7, 2011 at 8:16 AM, Justin M Wozniak > > > > < > > > > > > > wozniak at mcs.anl.gov > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > Can I take a look at the log? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Thu, 6 Oct 2011, Sarah Kenny wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > hey all, i'm trying to submit to gram on ranger > > > > using the > > > > > latest > > > > > > > swift > > > > > > > (built from trunk). it failes like so: > > > > > > > > > > > > > > Cannot submit job > > > > > > > Caused by: > > > > > > > org.globus.cog.abstraction. impl.common.task. > > > > > > > TaskSubmissionException: > > > > > > > Cannot > > > > > > > submit job > > > > > > > Caused by: org.globus.gram.GramException: > > > > Parameter not > > > > > supported > > > > > > > Cannot submit job > > > > > > > > > > > > > > the gram log was saying first that 'jobsPerNode' > > > > is not > > > > > supported so > > > > > > > i > > > > > > > changed it to workersPerNode and then it was > > > > saying > > > > > 'maxnodes' is > > > > > > > not > > > > > > > supported. here's my sites file: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > key="initialScore">10000 > > > > profile> > > > > > > > > > > key="jobThrottle">1 > > > > > > > > > > key="maxWallTime">00:15:00 > > > > profile> > > > > > > > > > > key="maxTime">86400 > > > > > > > > > > key="slots">1 > > > > > > > > > > key="maxNodes">256 > > > > > > > > > > key="pe">16way > > > > > > > > > > key="workersPerNode">1 > > > > profile> > > > > > > > > > > key="nodeGranularity">64 > > > > profile> > > > > > > > > > > key="queue">normal > > > > > > > > > > key="project">TG-DBS080004N > > > > profile> > > > > > > > > > > > > > > > > > > > > > > > jobManager="gt2:gt2:SGE" > > > > > url=" > > > > > > > gatekeeper.ranger.tacc. teragrid.org "/> > > > > > > > > > > > > > > > > > > > > /work/00043/ > > > > tg457040 > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > thoughts? ideas? > > > > > > > > > > > > > > -- > > > > > > > Justin M Wozniak > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > Sarah Kenny > > > > > > > Programmer ~ Brain Circuits Laboratory ~ Rm 2224 > > > > Bio Sci > > > > > III > > > > > > > University of California Irvine, Dept. of > > > > Neurology ~ > > > > > 773-818-8300 > > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > > > > > Swift-user mailing list > > > > > > > Swift-user at ci.uchicago.edu > > > > > > > > > > > > > > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > Sarah Kenny > > > > > > Programmer ~ Brain Circuits Laboratory ~ Rm 2224 > > > > Bio Sci III > > > > > > University of California Irvine, Dept. of > > > > Neurology ~ > > > > > 773-818-8300 > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > Sarah Kenny > > > > > > Programmer ~ Brain Circuits Laboratory ~ Rm 2224 > > > > Bio Sci III > > > > > > University of California Irvine, Dept. of > > > > Neurology ~ > > > > > 773-818-8300 > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > Sarah Kenny > > > > > Programmer ~ Brain Circuits Laboratory ~ Rm 2224 Bio Sci III > > > > > University of California Irvine, Dept. of Neurology ~ > > > > 773-818-8300 > > > > > > > > > > _______________________________________________ > > > > > Swift-user mailing list > > > > > Swift-user at ci.uchicago.edu > > > > > > > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > Sarah Kenny > > > > Programmer ~ Brain Circuits Laboratory ~ Rm 2224 Bio Sci III > > > > University of California Irvine, Dept. of Neurology ~ 773-818-8300 > > > > > > > > > > > > > > > > > > > > > > -- > > > Sarah Kenny > > > Programmer ~ Brain Circuits Laboratory ~ Rm 2224 Bio Sci III > > > University of California Irvine, Dept. of Neurology ~ 773-818-8300 > > > > > > > > > > > _______________________________________________ > > > Swift-devel mailing list > > > Swift-devel at ci.uchicago.edu > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > > -- > > Michael Wilde > > Computation Institute, University of Chicago > > Mathematics and Computer Science Division > > Argonne National Laboratory > > > > > > > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > > > > > > > > -- > > Ketan > > -- > Michael Wilde > Computation Institute, University of Chicago > Mathematics and Computer Science Division > Argonne National Laboratory > > -- Sarah Kenny Programmer ~ Brain Circuits Laboratory ~ Rm 2224 Bio Sci III University of California Irvine, Dept. of Neurology ~ 773-818-8300 -------------- next part -------------- An HTML attachment was scrubbed... URL: From wilde at mcs.anl.gov Sat Oct 22 09:41:17 2011 From: wilde at mcs.anl.gov (Michael Wilde) Date: Sat, 22 Oct 2011 09:41:17 -0500 (CDT) Subject: [Swift-devel] [Swift-user] gram on ranger In-Reply-To: Message-ID: <1923360968.122163.1319294477023.JavaMail.root@zimbra.anl.gov> Sarah, was this 50K version run with the same sites file and Swift version? At any rate, David is correcting some known problems in the SGE provider and increasing the test coverage for it. Once thats done we can try again. In the meantime, if you want to push this forward in parallel, can you try to run again and capture the SGE submit and stdout/err files? Im not 100% sure the following is correct, but I think you can set the SGE provider into debug mode by doing one or both of the following: etc/provider-sge.properties: add line: debug=true (I think this works for the PBS provider and assume it does for SGE; we need to verify) Also the sites/pbs page on the swiftdevel site has this, which *might* also give more debug info for SGE (again, needs to be checked): # Special functionality: suppresses auto-deletion of PBS submit file log4j.logger.org.globus.cog.abstraction.impl.scheduler.common.AbstractExecutor=DEBUG log4j.logger.org.globus.cog.abstraction.impl.scheduler.pbs.PBSExecutor=DEBUG - Mike ----- Original Message ----- > From: "Sarah Kenny" > To: "Michael Wilde" > Cc: "Ketan Maheshwari" , "David Kelly" , "Anjali Raja" > , "Swift Devel" , "Swift User" > Sent: Saturday, October 22, 2011 5:57:45 AM > Subject: Re: [Swift-devel] [Swift-user] gram on ranger > fyi, this works on a smaller workflow, we've run it several times on a > 50k version. > > > On Thu, Oct 20, 2011 at 8:21 AM, Michael Wilde < wilde at mcs.anl.gov > > wrote: > > > Thanks, Ketan. If I understand you correctly, then I would consider > this a Swift bug, in that maxnodes should always mean *nodes*, for > every type of resource provider including SGE. Based on what you say, > the SGE provider is in this case treating the requested maxnode count > as cores (Assuming Anjali was running the same Swift revision as you > were testing on here). > > But then that might not explain the error in the log that Sarah > posted. > > It seems the next step is to try the run on a smaller job (we can test > this ourselves), and see if we can replicate and diagnose the error, > with SGE subit files and output/error logs. > > David, can you do this, since you were working on SGE testing last > week? > You and Ketan should share what you know about the situation, via > swift-devel, as Ketan is also running on Ranger with persistent > coasters I think. > > Thanks, > > > Mike > > > ----- Original Message ----- > > > From: "Ketan Maheshwari" < ketancmaheshwari at gmail.com > > > To: "Michael Wilde" < wilde at mcs.anl.gov > > > Cc: "Sarah Kenny" < skenny at uchicago.edu >, "Anjali Raja" < > > anjraja at gmail.com >, "Swift Devel" > > < swift-devel at ci.uchicago.edu >, "Swift User" < > > swift-user at ci.uchicago.edu > > > Sent: Thursday, October 20, 2011 9:54:33 AM > > > > > Subject: Re: [Swift-devel] [Swift-user] gram on ranger > > On Thu, Oct 20, 2011 at 7:50 AM, Michael Wilde < wilde at mcs.anl.gov > > > wrote: > > > > > > Hi Sarah, Anjali, > > > > My initial theory on whats failing in this job is that the Ranger > > development queue is limited to jobs of 16 nodes or less. (The > > Ranger > > User Guide says maxprocs 256 for that queue, and qconf -sq > > development > > says slots 16, which agrees). So you need to either change to one of > > the production queues (normal, long etc) or reduce the values of > > maxnode and nodegranularity. > > > > > > > > I have a little confusion here: the desired line in the final pbs > > script should be : #$ -pe way 256; in order to have 256 procs, > > however, putting maxnodes=16 on sites.xml results in the following > > line on pbs: > > #$ -pe way 16; > > I understand this number 16/256 is for procs since, when putting 256 > > with development queue, ranger indeed allows the job to run in > > development queue. > > > > > > > > I would also suggest (unless you have already done this) that you > > test > > first on a very small run (like a single RInvoke app call) and then > > scale up to just a few voxels per dataset before trying such a large > > run. Have you already tested that? > > > > Lastly, when reporting problems like this, the swift standard > > output/err is also very helpful to get a higher-level view of what > > went wrong. > > > > Swift needs to clearly return errors from the local resource > > provider, > > which it doesnt seem to be doing here. Ive filed this as bug 593 and > > assigned to David. > > > > Please let us know if changing the queue and/or slots resolves the > > problem. As mentioned in the bug report I think you can set > > debug=true > > (or yes?) in the provider-sge.properties file and get swift to > > preserve the output from SGE in ~/.globus/scripts. (In fact that may > > already be preserved, I am not sure). Please check there to see if > > the > > SGE error is there. > > > > Thanks, > > > > - Mike > > > > > > > > ----- Original Message ----- > > > From: "Sarah Kenny" < skenny at uchicago.edu > > > > > > To: "Mihael Hategan" < hategan at mcs.anl.gov > > > > Cc: "Anjali Raja" < anjraja at gmail.com >, "Swift Devel" < > > > swift-devel at ci.uchicago.edu >, "Swift User" > > > < swift-user at ci.uchicago.edu > > > > Sent: Thursday, October 20, 2011 6:07:09 AM > > > Subject: Re: [Swift-devel] [Swift-user] gram on ranger > > > > > hi all, one of our users, anjali (cc'd here) is trying to submit > > > this > > > ~400k job workflow to ranger...thought i'd see if you felt like > > > having > > > a look :) > > > > > > log is here: > > > /home/skenny/swift_logs/corr_multisubj-20111018-1321-ihf8hz5g.log > > > > > > sites file: > > > > > > > > > > > > > > > > > > > > > > > > 7200 > > > 00:20:00 > > > 1 > > > 64 > > > 256 > > > development > > > 1.28 > > > TG-DBS080004N > > > 16way > > > 10000 > > > /work/00926/tg459516/swiftwork > > > > > > > > > > > > > > > On Wed, Oct 12, 2011 at 12:13 PM, Mihael Hategan < > > > hategan at mcs.anl.gov > > > > wrote: > > > > > > > > > > > > On Tue, 2011-10-11 at 17:13 -0700, Sarah Kenny wrote: > > > > > > > > > > > > On Tue, Oct 11, 2011 at 4:23 PM, Mihael Hategan < > > > > hategan at mcs.anl.gov > > > > > wrote: > > > > Is this with a persistent coaster service? > > > > > > > > admittedly i have not used persistent coaster service...should > > > > i? > > > > > > No. I was just trying to figure out whether it might be something > > > related to the persistent version. > > > > > > > > > > > > > > > > i feel like it's documented *somewhere* (?) > > > > > > > > for now i've tried setting 'sitedir.keep=true' in the config so > > > > maybe > > > > it won't try to run the cleanup job...we'll see (waiting in q) > > > > > > > > > > > > > > > > On Tue, 2011-10-11 at 12:05 -0700, Sarah Kenny wrote: > > > > > > > > > > > > > > > On Tue, Oct 11, 2011 at 11:49 AM, David Kelly > > > > < davidk at ci.uchicago.edu > > > > > > wrote: > > > > > > > > > > That could be it.. maybe a cleanup script is not > > > > getting the > > > > > right parameters and failing. Do you happen to have > > > > a copy of > > > > > the coaster log? > > > > > > > > > > just put it in /home/skenny/swift_logs > > > > > > > > > > > > > > > Maybe there will be some clues in there. > > > > > > > > > > ----- Original Message ----- > > > > > > From: "Sarah Kenny" < skenny at uchicago.edu > > > > > > > > > > > > To: "David Kelly" < davidk at ci.uchicago.edu > > > > > > > Cc: "Swift Devel" < swift-devel at ci.uchicago.edu >, > > > > "Swift > > > > > User" < swift-user at ci.uchicago.edu >, "Justin M > > > > Wozniak" > > > > > > < wozniak at mcs.anl.gov > > > > > > > > > > > > Sent: Tuesday, October 11, 2011 1:32:37 PM > > > > > > Subject: Re: [Swift-user] gram on ranger > > > > > > > > > > > so, this workflow completes all the jobs but then > > > > just hangs > > > > > > indefinitely at the end...maybe a stray cleanup > > > > job? > > > > > > > > > > > > log is here: > > > > > > > > > > > > > > > > /home/skenny/swift_logs/corr-20111010-2104-fl5yngd9.log > > > > > > > > > > > > just tweaked the sites file a bit from what david > > > > sent me: > > > > > > > > > > > > > > > > > > > > > > > > > > > url=" > > > > > > gatekeeper.ranger.tacc.teragrid.org "/> > > > > > > > > > > > > > > > > > > > > key="maxtime">28800 > > > > > > > > > > key="maxWallTime">00:15:00 > > > > > > > > > key="jobsPerNode">1 > > > > > > > > > > key="nodeGranularity">64 > > > > > > > > > key="maxNodes">256 > > > > > > > > > key="queue">normal > > > > > > > > > key="jobThrottle">1 > > > > > > > > > > key="project">TG-DBS080004N > > > > > > > > > key="pe">16way > > > > > > > > > > key="initialScore">10000 > > > > > > > > > > > > > > > /work/00043/tg457040/sidgrid_out/skenny > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Mon, Oct 10, 2011 at 3:43 PM, Sarah Kenny < > > > > > skenny at uchicago.edu > > > > > > > wrote: > > > > > > > > > > > > > > > > > > ok, thanks, got in the queue now...also, realized > > > > my last > > > > > run may have > > > > > > been using the old swift. apparently i had > > > > SWIFT_HOME set in > > > > > my env > > > > > > and that overrides the newer swift i had set in my > > > > PATH. > > > > > > > > > > > > ~sk > > > > > > > > > > > > > > > > > > > > > > > > On Mon, Oct 10, 2011 at 12:28 PM, David Kelly < > > > > > davidk at ci.uchicago.edu > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Sarah, > > > > > > > > > > > > Can you give this another try with the latest > > > > 0.93? I made > > > > > some > > > > > > changes to the coaster and sge providers and was > > > > able to get > > > > > it > > > > > > working with a simple catns script. Here is the > > > > > configuration file I > > > > > > was using: > > > > > > > > > > > > > > > > > > > > > > > > > > > url=" > > > > > > gatekeeper.ranger.tacc.teragrid.org "/> > > > > > > > > > > > > > > > > > > > > > > > > > > key="maxtime">3600 > > > > > > > > > > key="maxWallTime">00:00:03 > > > > > > > > > key="jobsPerNode">1 > > > > > > > > > > key="nodeGranularity">16 > > > > > > > > > key="maxNodes">16 > > > > > > > > > > key="queue">development > > > > > > > > > key="jobThrottle">0.9 > > > > > > > > > > > > > > > > key="project">TG-DBS080004N > > > > > > > > > > > > > > > key="pe">16way > > > > > > > > > > > > > > > /share/home/01503/davidkel/swiftwork > > > > > > > > > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > > > > David > > > > > > > > > > > > ----- Original Message ----- > > > > > > > > > > > > > From: "Sarah Kenny" < skenny at uchicago.edu > > > > > > > > To: "Justin M Wozniak" < wozniak at mcs.anl.gov > > > > > > > > Cc: "Swift Devel" < swift-devel at ci.uchicago.edu > > > > >, "Swift > > > > > User" < > > > > > > > swift-user at ci.uchicago.edu > > > > > > > > > > > > > > > > > > > > > > > > > > Sent: Friday, October 7, 2011 3:13:57 PM > > > > > > > Subject: Re: [Swift-user] gram on ranger > > > > > > > > > > > /home/skenny/swift_logs/dummy-20111005-0126-6575n7x5.log > > > > > > > > > > > > > > on ci > > > > > > > > > > > > > > > > > > > > > On Fri, Oct 7, 2011 at 8:16 AM, Justin M Wozniak > > > > < > > > > > > > wozniak at mcs.anl.gov > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > Can I take a look at the log? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Thu, 6 Oct 2011, Sarah Kenny wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > hey all, i'm trying to submit to gram on ranger > > > > using the > > > > > latest > > > > > > > swift > > > > > > > (built from trunk). it failes like so: > > > > > > > > > > > > > > Cannot submit job > > > > > > > Caused by: > > > > > > > org.globus.cog.abstraction. impl.common.task. > > > > > > > TaskSubmissionException: > > > > > > > Cannot > > > > > > > submit job > > > > > > > Caused by: org.globus.gram.GramException: > > > > Parameter not > > > > > supported > > > > > > > Cannot submit job > > > > > > > > > > > > > > the gram log was saying first that 'jobsPerNode' > > > > is not > > > > > supported so > > > > > > > i > > > > > > > changed it to workersPerNode and then it was > > > > saying > > > > > 'maxnodes' is > > > > > > > not > > > > > > > supported. here's my sites file: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > key="initialScore">10000 > > > > profile> > > > > > > > > > > key="jobThrottle">1 > > > > > > > > > > key="maxWallTime">00:15:00 > > > > profile> > > > > > > > > > > key="maxTime">86400 > > > > > > > > > > key="slots">1 > > > > > > > > > > key="maxNodes">256 > > > > > > > > > > key="pe">16way > > > > > > > > > > key="workersPerNode">1 > > > > profile> > > > > > > > > > > key="nodeGranularity">64 > > > > profile> > > > > > > > > > > key="queue">normal > > > > > > > > > > key="project">TG-DBS080004N > > > > profile> > > > > > > > > > > > > > > > > > > > > > > > jobManager="gt2:gt2:SGE" > > > > > url=" > > > > > > > gatekeeper.ranger.tacc. teragrid.org "/> > > > > > > > > > > > > > > > > > > > > /work/00043/ > > > > tg457040 > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > thoughts? ideas? > > > > > > > > > > > > > > -- > > > > > > > Justin M Wozniak > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > Sarah Kenny > > > > > > > Programmer ~ Brain Circuits Laboratory ~ Rm 2224 > > > > Bio Sci > > > > > III > > > > > > > University of California Irvine, Dept. of > > > > Neurology ~ > > > > > 773-818-8300 > > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > > > > > Swift-user mailing list > > > > > > > Swift-user at ci.uchicago.edu > > > > > > > > > > > > > > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > Sarah Kenny > > > > > > Programmer ~ Brain Circuits Laboratory ~ Rm 2224 > > > > Bio Sci III > > > > > > University of California Irvine, Dept. of > > > > Neurology ~ > > > > > 773-818-8300 > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > Sarah Kenny > > > > > > Programmer ~ Brain Circuits Laboratory ~ Rm 2224 > > > > Bio Sci III > > > > > > University of California Irvine, Dept. of > > > > Neurology ~ > > > > > 773-818-8300 > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > Sarah Kenny > > > > > Programmer ~ Brain Circuits Laboratory ~ Rm 2224 Bio Sci III > > > > > University of California Irvine, Dept. of Neurology ~ > > > > 773-818-8300 > > > > > > > > > > _______________________________________________ > > > > > Swift-user mailing list > > > > > Swift-user at ci.uchicago.edu > > > > > > > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > Sarah Kenny > > > > Programmer ~ Brain Circuits Laboratory ~ Rm 2224 Bio Sci III > > > > University of California Irvine, Dept. of Neurology ~ > > > > 773-818-8300 > > > > > > > > > > > > > > > > > > > > > > -- > > > Sarah Kenny > > > Programmer ~ Brain Circuits Laboratory ~ Rm 2224 Bio Sci III > > > University of California Irvine, Dept. of Neurology ~ 773-818-8300 > > > > > > > > > > > _______________________________________________ > > > Swift-devel mailing list > > > Swift-devel at ci.uchicago.edu > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > > -- > > Michael Wilde > > Computation Institute, University of Chicago > > Mathematics and Computer Science Division > > Argonne National Laboratory > > > > > > > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > > > > > > > > -- > > Ketan > > -- > Michael Wilde > Computation Institute, University of Chicago > Mathematics and Computer Science Division > Argonne National Laboratory > > > > > -- > Sarah Kenny > Programmer ~ Brain Circuits Laboratory ~ Rm 2224 Bio Sci III > University of California Irvine, Dept. of Neurology ~ 773-818-8300 -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From jonmon at mcs.anl.gov Sat Oct 22 21:19:51 2011 From: jonmon at mcs.anl.gov (Jonathan Monette) Date: Sat, 22 Oct 2011 21:19:51 -0500 Subject: [Swift-devel] CI network Message-ID: <7DC4C1E2-78C2-48EC-8C01-2E5B83AEA849@mcs.anl.gov> Hey, Is anyone able to submit jobs remotely to beagle or pads? Remotely as is from communicado, bridled, MCS machines? I seem to be getting ssh problems which I think may have something to do with the recent security tightening and port restrictions. From tim.g.armstrong at gmail.com Mon Oct 24 09:23:03 2011 From: tim.g.armstrong at gmail.com (Tim Armstrong) Date: Mon, 24 Oct 2011 09:23:03 -0500 Subject: [Swift-devel] CI network In-Reply-To: <7DC4C1E2-78C2-48EC-8C01-2E5B83AEA849@mcs.anl.gov> References: <7DC4C1E2-78C2-48EC-8C01-2E5B83AEA849@mcs.anl.gov> Message-ID: Hi Jon, Not sure if you're still having problems, but I can login to PADS and beagle ok On Sat, Oct 22, 2011 at 9:19 PM, Jonathan Monette wrote: > Hey, > Is anyone able to submit jobs remotely to beagle or pads? Remotely as is > from communicado, bridled, MCS machines? I seem to be getting ssh problems > which I think may have something to do with the recent security tightening > and port restrictions. > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jonmon at mcs.anl.gov Mon Oct 24 10:03:56 2011 From: jonmon at mcs.anl.gov (=?utf-8?B?Sm9uYXRoYW4gTW9uZXR0ZQ==?=) Date: Mon, 24 Oct 2011 10:03:56 -0500 Subject: [Swift-devel] =?utf-8?q?CI_network?= Message-ID: <20111024150343.25957123A6@zimbra.anl.gov> Logging in wasn't the problem. When I use swift to submit jobs remotely from a different machine is when I have connection issues. I believe the problem is that the ci closed some ports that coasters uses for connection back to the swift service. ----- Reply message ----- From: "Tim Armstrong" Date: Mon, Oct 24, 2011 9:23 am Subject: [Swift-devel] CI network To: "Jonathan Monette" Cc: "swift-devel Devel" -------------- next part -------------- An HTML attachment was scrubbed... URL: From wilde at mcs.anl.gov Tue Oct 25 14:10:04 2011 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 25 Oct 2011 14:10:04 -0500 (CDT) Subject: [Swift-devel] Swift 0.93 RC3 hangs after all jobs seem to be complete In-Reply-To: Message-ID: <847467507.129918.1319569804728.JavaMail.root@zimbra.anl.gov> Mihael, David, Can you both report on what you believe the status of this bug is? I think the subject line here is a bot misleading, in that it seems that a similar thing - ie the workflow deadlocks - was happening both at the start and at the end of various scripts, and possibly at intermediate points. I *think* that Sheri was seeing hangs at the start and in the middle; David was seeing hangs at the end. Talking to David just now he reported diagnosing his hang case down to a situation where the coaster scheduler emits a "null" (ill-formed) job to PBS at the tail end of a workflow. He inserted a workaround to ignore (not submit) such "null" jobs. Im not sure of that was committed, or just tested. David, can you post the details? Mihael, did you look at the jstack that Sheri attached to the posting below? Do you have any theories or fixes for this issue or issues? Unless we believe its resolved, David, please file in bugzilla and attach relevant postings from SHeri, David, and others on this bug. Thanks, - Mike ----- Original Message ----- > From: "Sheri Mickelson" > To: "Mihael Hategan" > Cc: "Michael Wilde" , "David Kelly" > Sent: Wednesday, October 12, 2011 10:34:43 AM > Subject: Re: Swift 0.93 RC3 hangs after all jobs seem to be complete > I just tried running again on fusion with 0.93RC3 and it hung right > away. > It started with "No events in 10s." and then it looks like it hung. > This was ran using coasters and I manually killed it after about 5 > minutes. > I attached both the log file and the jstack info. > > Thanks, Sheri > > > > > > On Oct 7, 2011, at 2:47 PM, Mihael Hategan wrote: > > > Yeah, so the hang checker doesn't show anything. Which means it's > > not a > > swift flow issue. > > > > I would do what Mike says with jstack as soon after the hang checker > > kicks in as possible. > > > > Mihael > > > > On Fri, 2011-10-07 at 12:12 -0500, Michael Wilde wrote: > >> Was: Re: Swift 0.93RC2 is bad - Re: Help on fusion > >> Changed subject so you can see what this is regarding, Mihael. > >> > >> --- > >> > >> Sheri, could you run this again? (Or have you already, and if so, > >> did it run to completion?) > >> > >> What I saw in the log yesterday was that all jobs that were > >> submitted to coasters ran successfully, including all of their data > >> transfers. > >> > >> But I also see that the Swift "hang checker" went off, which > >> indicates that some Java activity was indeed hung. > >> > >> When this happens again, can you run the command "jstack -l PID" > >> where PID is the process of the Swift Java command (which you can > >> best locate by using "ps -u $USER -H" and locate the java process > >> below the swift command). Then send us the jstack output in > >> addition to the associated Swift log. > >> > >> Mihael, in the meantime, can you take a look at the log to see if > >> you can spot any incomplete Swift activities that may be hanging > >> the run? > >> > >> Thanks, > >> > >> - Mike > >> > >> > >> ----- Original Message ----- > >>> From: "Sheri Mickelson" > >>> To: "David Kelly" > >>> Cc: "Michael Wilde" > >>> Sent: Thursday, October 6, 2011 3:23:57 PM > >>> Subject: Re: Swift 0.93RC2 is bad - Re: Help on fusion > >>> Here's the log file. > >>> > >>> > >>> > >>> On Oct 6, 2011, at 3:19 PM, David Kelly wrote: > >>> > >>>> Hi Sheri, > >>>> > >>>> Could you please send the log file so we can take a closer look > >>>> and > >>>> see what's going on there? > >>>> > >>>> Thanks, > >>>> David > >>>> > >>>> ----- Original Message ----- > >>>>> From: "Sheri Mickelson" > >>>>> To: "David Kelly" > >>>>> Cc: "Michael Wilde" > >>>>> Sent: Thursday, October 6, 2011 3:07:44 PM > >>>>> Subject: Re: Swift 0.93RC2 is bad - Re: Help on fusion > >>>>> I just tried this version and had a little bit more luck. It > >>>>> looked > >>>>> like everything was running fine, but now it looks like it's > >>>>> hung > >>>>> near > >>>>> the end. I keep getting the message "Finished successfully:66". > >>>>> The > >>>>> message before that was "Checking status:1 Finished > >>>>> successfully:65". > >>>>> > >>>>> Thanks, Sheri > >>>>> > >>>>> On Oct 6, 2011, at 2:14 PM, David Kelly wrote: > >>>>> > >>>>>> > >>>>>> It's been a while since RC2 was created. There have been quite > >>>>>> a > >>>>>> lot > >>>>>> of fixes since then, so I just created a new 0.93 RC3. The > >>>>>> direct > >>>>>> download can be found at: > >>>>>> > >>>>>> http://www.ci.uchicago.edu/swift/packages/swift-0.93RC3.tar.gz > >>>>>> > >>>>>> Hope this helps. > >>>>>> > >>>>>> Thanks, > >>>>>> David > >>>>>> > >>>>>> ----- Original Message ----- > >>>>>>> From: "Michael Wilde" > >>>>>>> To: "Sheri Mickelson" > >>>>>>> Cc: "David Kelly" > >>>>>>> Sent: Thursday, October 6, 2011 12:17:56 PM > >>>>>>> Subject: Swift 0.93RC2 is bad - Re: Help on fusion > >>>>>>> Sheri, > >>>>>>> > >>>>>>> Your AMWG script is failing because the swift-0.93RC2 release > >>>>>>> is > >>>>>>> bad. > >>>>>>> > >>>>>>> The error its showing in the log is this: "2011-10-06 > >>>>>>> 11:46:24,635-0500 DEBUG vdl:execute2 APPLICATION_EXCEPTION > >>>>>>> jobid=ncatted-se54rxgk - Application exception: null > >>>>>>> Caused by: > >>>>>>> org > >>>>>>> .globus > >>>>>>> .cog.abstraction.impl.common.task.TaskSubmissionException: > >>>>>>> lowOverallocation must be < 1.0 (currently 100.0)" > >>>>>>> > >>>>>>> ...which was fixed in SVN for 0.93. > >>>>>>> > >>>>>>> Did you load this from a tarball or from SVN? > >>>>>>> > >>>>>>> David, do we have a more recent 0.93 release candidate? > >>>>>>> > >>>>>>> If not, then can you build an 0.93 from SVN? If not, we can do > >>>>>>> that > >>>>>>> for you. I'll start a build in the meantime just in case. > >>>>>>> > >>>>>>> Sorry about this error, Sheri. > >>>>>>> > >>>>>>> - Mike > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> ----- Original Message ----- > >>>>>>>> From: "Sheri Mickelson" > >>>>>>>> To: "Michael Wilde" > >>>>>>>> Sent: Thursday, October 6, 2011 11:52:58 AM > >>>>>>>> Subject: Re: Help on fusion > >>>>>>>> I have everything in > >>>>>>>> /fusion/gpfs/home/mickelso/amwg-swift/svnRepo/swift > >>>>>>>> > >>>>>>>> I believe the pathnames are correct. > >>>>>>>> > >>>>>>>> I have not tried running on localhost. > >>>>>>>> > >>>>>>>> I'm using swift version swift-0.93RC2. > >>>>>>>> > >>>>>>>> I'm not at Argonne today, but will be in tomorrow. > >>>>>>>> > >>>>>>>> -Sheri > >>>>>>>> > >>>>>>>> On Oct 6, 2011, at 11:39 AM, Michael Wilde wrote: > >>>>>>>> > >>>>>>>>> Hi Sheri, > >>>>>>>>> > >>>>>>>>> can you point me to the log, run directory, and work dir of > >>>>>>>>> this > >>>>>>>>> run? > >>>>>>>>> > >>>>>>>>> I trhink we'll need to look into to the log, and the .d > >>>>>>>>> directories, > >>>>>>>>> and possibly the work dir to locate the stdout of the > >>>>>>>>> failing > >>>>>>>>> apps. > >>>>>>>>> > >>>>>>>>> - are the pathnames correct? > >>>>>>>>> > >>>>>>>>> - does the run work on localhost? (ie, are the PBS jobs > >>>>>>>>> running > >>>>>>>>> or > >>>>>>>>> failing)? > >>>>>>>>> > >>>>>>>>> - which Swift rev are you using? > >>>>>>>>> > >>>>>>>>> Are you at Argonne? I can stop by and we can debug. > >>>>>>>>> > >>>>>>>>> - Mike > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> ----- Original Message ----- > >>>>>>>>>> From: "Sheri Mickelson" > >>>>>>>>>> To: "Michael Wilde" > >>>>>>>>>> Sent: Thursday, October 6, 2011 10:32:38 AM > >>>>>>>>>> Subject: Help on fusion > >>>>>>>>>> Hi Mike, > >>>>>>>>>> > >>>>>>>>>> The AMWG people at NCAR want to incorporate the swift > >>>>>>>>>> version > >>>>>>>>>> to > >>>>>>>>>> their > >>>>>>>>>> main branch. Rob's at NCAR right now and wants to have this > >>>>>>>>>> done > >>>>>>>>>> as > >>>>>>>>>> soon as possible. I've been working on incorporating the > >>>>>>>>>> changes > >>>>>>>>>> that > >>>>>>>>>> were made in the last release and believe that it's in > >>>>>>>>>> descent > >>>>>>>>>> shape. > >>>>>>>>>> I want to test it on fusion, though, just to make sure I'm > >>>>>>>>>> handling > >>>>>>>>>> the env variables correctly. I'm running into an error when > >>>>>>>>>> I > >>>>>>>>>> run. > >>>>>>>>>> I'm getting "Failed to transfer wrapper log for job > >>>>>>>>>> for > >>>>>>>>>> all > >>>>>>>>>> of > >>>>>>>>>> the app calls. What usually causes this? I'm stuck on where > >>>>>>>>>> to > >>>>>>>>>> look. > >>>>>>>>>> > >>>>>>>>>> Thanks, Sheri > >>>>>>>>> > >>>>>>>>> -- > >>>>>>>>> Michael Wilde > >>>>>>>>> Computation Institute, University of Chicago > >>>>>>>>> Mathematics and Computer Science Division > >>>>>>>>> Argonne National Laboratory > >>>>>>>>> > >>>>>>> > >>>>>>> -- > >>>>>>> Michael Wilde > >>>>>>> Computation Institute, University of Chicago > >>>>>>> Mathematics and Computer Science Division > >>>>>>> Argonne National Laboratory > >> > > > > -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory -------------- next part -------------- A non-text attachment was scrubbed... Name: amwg_stats-20111012-1025-qaxyxad6.log Type: application/octet-stream Size: 228154 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: jstack.out Type: application/octet-stream Size: 65175 bytes Desc: not available URL: From davidk at ci.uchicago.edu Wed Oct 26 11:14:26 2011 From: davidk at ci.uchicago.edu (David Kelly) Date: Wed, 26 Oct 2011 11:14:26 -0500 (CDT) Subject: [Swift-devel] Swift 0.93 RC3 hangs after all jobs seem to be complete In-Reply-To: <847467507.129918.1319569804728.JavaMail.root@zimbra.anl.gov> Message-ID: <1456767810.162441.1319645666935.JavaMail.root@zimbra-mb2.anl.gov> I think I've found a way to reproduce this. From the test suite, if you run language-behaviour/mappers/075-array-mapper.swift a few times, you'll run into a deadlock which looks very similar to the one Sheri is seeing. Here is the jstack: http://www.ci.uchicago.edu/~davidk/logs/jstack20111025110620.log David ----- Original Message ----- > From: "Michael Wilde" > To: "Mihael Hategan" , "David Kelly" > Cc: "Swift Devel" , "Sheri Mickelson" > Sent: Tuesday, October 25, 2011 2:10:04 PM > Subject: Re: Swift 0.93 RC3 hangs after all jobs seem to be complete > Mihael, David, > > Can you both report on what you believe the status of this bug is? > > I think the subject line here is a bot misleading, in that it seems > that a similar thing - ie the workflow deadlocks - was happening both > at the start and at the end of various scripts, and possibly at > intermediate points. > > I *think* that Sheri was seeing hangs at the start and in the middle; > David was seeing hangs at the end. > > Talking to David just now he reported diagnosing his hang case down to > a situation where the coaster scheduler emits a "null" (ill-formed) > job to PBS at the tail end of a workflow. He inserted a workaround to > ignore (not submit) such "null" jobs. Im not sure of that was > committed, or just tested. David, can you post the details? > > Mihael, did you look at the jstack that Sheri attached to the posting > below? > > Do you have any theories or fixes for this issue or issues? Unless we > believe its resolved, David, please file in bugzilla and attach > relevant postings from SHeri, David, and others on this bug. > > Thanks, > > - Mike > > > ----- Original Message ----- > > From: "Sheri Mickelson" > > To: "Mihael Hategan" > > Cc: "Michael Wilde" , "David Kelly" > > > > Sent: Wednesday, October 12, 2011 10:34:43 AM > > Subject: Re: Swift 0.93 RC3 hangs after all jobs seem to be complete > > I just tried running again on fusion with 0.93RC3 and it hung right > > away. > > It started with "No events in 10s." and then it looks like it hung. > > This was ran using coasters and I manually killed it after about 5 > > minutes. > > I attached both the log file and the jstack info. > > > > Thanks, Sheri > > > > > > > > > > > > On Oct 7, 2011, at 2:47 PM, Mihael Hategan wrote: > > > > > Yeah, so the hang checker doesn't show anything. Which means it's > > > not a > > > swift flow issue. > > > > > > I would do what Mike says with jstack as soon after the hang > > > checker > > > kicks in as possible. > > > > > > Mihael > > > > > > On Fri, 2011-10-07 at 12:12 -0500, Michael Wilde wrote: > > >> Was: Re: Swift 0.93RC2 is bad - Re: Help on fusion > > >> Changed subject so you can see what this is regarding, Mihael. > > >> > > >> --- > > >> > > >> Sheri, could you run this again? (Or have you already, and if so, > > >> did it run to completion?) > > >> > > >> What I saw in the log yesterday was that all jobs that were > > >> submitted to coasters ran successfully, including all of their > > >> data > > >> transfers. > > >> > > >> But I also see that the Swift "hang checker" went off, which > > >> indicates that some Java activity was indeed hung. > > >> > > >> When this happens again, can you run the command "jstack -l PID" > > >> where PID is the process of the Swift Java command (which you can > > >> best locate by using "ps -u $USER -H" and locate the java process > > >> below the swift command). Then send us the jstack output in > > >> addition to the associated Swift log. > > >> > > >> Mihael, in the meantime, can you take a look at the log to see if > > >> you can spot any incomplete Swift activities that may be hanging > > >> the run? > > >> > > >> Thanks, > > >> > > >> - Mike > > >> > > >> > > >> ----- Original Message ----- > > >>> From: "Sheri Mickelson" > > >>> To: "David Kelly" > > >>> Cc: "Michael Wilde" > > >>> Sent: Thursday, October 6, 2011 3:23:57 PM > > >>> Subject: Re: Swift 0.93RC2 is bad - Re: Help on fusion > > >>> Here's the log file. > > >>> > > >>> > > >>> > > >>> On Oct 6, 2011, at 3:19 PM, David Kelly wrote: > > >>> > > >>>> Hi Sheri, > > >>>> > > >>>> Could you please send the log file so we can take a closer look > > >>>> and > > >>>> see what's going on there? > > >>>> > > >>>> Thanks, > > >>>> David > > >>>> > > >>>> ----- Original Message ----- > > >>>>> From: "Sheri Mickelson" > > >>>>> To: "David Kelly" > > >>>>> Cc: "Michael Wilde" > > >>>>> Sent: Thursday, October 6, 2011 3:07:44 PM > > >>>>> Subject: Re: Swift 0.93RC2 is bad - Re: Help on fusion > > >>>>> I just tried this version and had a little bit more luck. It > > >>>>> looked > > >>>>> like everything was running fine, but now it looks like it's > > >>>>> hung > > >>>>> near > > >>>>> the end. I keep getting the message "Finished > > >>>>> successfully:66". > > >>>>> The > > >>>>> message before that was "Checking status:1 Finished > > >>>>> successfully:65". > > >>>>> > > >>>>> Thanks, Sheri > > >>>>> > > >>>>> On Oct 6, 2011, at 2:14 PM, David Kelly wrote: > > >>>>> > > >>>>>> > > >>>>>> It's been a while since RC2 was created. There have been > > >>>>>> quite > > >>>>>> a > > >>>>>> lot > > >>>>>> of fixes since then, so I just created a new 0.93 RC3. The > > >>>>>> direct > > >>>>>> download can be found at: > > >>>>>> > > >>>>>> http://www.ci.uchicago.edu/swift/packages/swift-0.93RC3.tar.gz > > >>>>>> > > >>>>>> Hope this helps. > > >>>>>> > > >>>>>> Thanks, > > >>>>>> David > > >>>>>> > > >>>>>> ----- Original Message ----- > > >>>>>>> From: "Michael Wilde" > > >>>>>>> To: "Sheri Mickelson" > > >>>>>>> Cc: "David Kelly" > > >>>>>>> Sent: Thursday, October 6, 2011 12:17:56 PM > > >>>>>>> Subject: Swift 0.93RC2 is bad - Re: Help on fusion > > >>>>>>> Sheri, > > >>>>>>> > > >>>>>>> Your AMWG script is failing because the swift-0.93RC2 > > >>>>>>> release > > >>>>>>> is > > >>>>>>> bad. > > >>>>>>> > > >>>>>>> The error its showing in the log is this: "2011-10-06 > > >>>>>>> 11:46:24,635-0500 DEBUG vdl:execute2 APPLICATION_EXCEPTION > > >>>>>>> jobid=ncatted-se54rxgk - Application exception: null > > >>>>>>> Caused by: > > >>>>>>> org > > >>>>>>> .globus > > >>>>>>> .cog.abstraction.impl.common.task.TaskSubmissionException: > > >>>>>>> lowOverallocation must be < 1.0 (currently 100.0)" > > >>>>>>> > > >>>>>>> ...which was fixed in SVN for 0.93. > > >>>>>>> > > >>>>>>> Did you load this from a tarball or from SVN? > > >>>>>>> > > >>>>>>> David, do we have a more recent 0.93 release candidate? > > >>>>>>> > > >>>>>>> If not, then can you build an 0.93 from SVN? If not, we can > > >>>>>>> do > > >>>>>>> that > > >>>>>>> for you. I'll start a build in the meantime just in case. > > >>>>>>> > > >>>>>>> Sorry about this error, Sheri. > > >>>>>>> > > >>>>>>> - Mike > > >>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>>> ----- Original Message ----- > > >>>>>>>> From: "Sheri Mickelson" > > >>>>>>>> To: "Michael Wilde" > > >>>>>>>> Sent: Thursday, October 6, 2011 11:52:58 AM > > >>>>>>>> Subject: Re: Help on fusion > > >>>>>>>> I have everything in > > >>>>>>>> /fusion/gpfs/home/mickelso/amwg-swift/svnRepo/swift > > >>>>>>>> > > >>>>>>>> I believe the pathnames are correct. > > >>>>>>>> > > >>>>>>>> I have not tried running on localhost. > > >>>>>>>> > > >>>>>>>> I'm using swift version swift-0.93RC2. > > >>>>>>>> > > >>>>>>>> I'm not at Argonne today, but will be in tomorrow. > > >>>>>>>> > > >>>>>>>> -Sheri > > >>>>>>>> > > >>>>>>>> On Oct 6, 2011, at 11:39 AM, Michael Wilde wrote: > > >>>>>>>> > > >>>>>>>>> Hi Sheri, > > >>>>>>>>> > > >>>>>>>>> can you point me to the log, run directory, and work dir > > >>>>>>>>> of > > >>>>>>>>> this > > >>>>>>>>> run? > > >>>>>>>>> > > >>>>>>>>> I trhink we'll need to look into to the log, and the .d > > >>>>>>>>> directories, > > >>>>>>>>> and possibly the work dir to locate the stdout of the > > >>>>>>>>> failing > > >>>>>>>>> apps. > > >>>>>>>>> > > >>>>>>>>> - are the pathnames correct? > > >>>>>>>>> > > >>>>>>>>> - does the run work on localhost? (ie, are the PBS jobs > > >>>>>>>>> running > > >>>>>>>>> or > > >>>>>>>>> failing)? > > >>>>>>>>> > > >>>>>>>>> - which Swift rev are you using? > > >>>>>>>>> > > >>>>>>>>> Are you at Argonne? I can stop by and we can debug. > > >>>>>>>>> > > >>>>>>>>> - Mike > > >>>>>>>>> > > >>>>>>>>> > > >>>>>>>>> ----- Original Message ----- > > >>>>>>>>>> From: "Sheri Mickelson" > > >>>>>>>>>> To: "Michael Wilde" > > >>>>>>>>>> Sent: Thursday, October 6, 2011 10:32:38 AM > > >>>>>>>>>> Subject: Help on fusion > > >>>>>>>>>> Hi Mike, > > >>>>>>>>>> > > >>>>>>>>>> The AMWG people at NCAR want to incorporate the swift > > >>>>>>>>>> version > > >>>>>>>>>> to > > >>>>>>>>>> their > > >>>>>>>>>> main branch. Rob's at NCAR right now and wants to have > > >>>>>>>>>> this > > >>>>>>>>>> done > > >>>>>>>>>> as > > >>>>>>>>>> soon as possible. I've been working on incorporating the > > >>>>>>>>>> changes > > >>>>>>>>>> that > > >>>>>>>>>> were made in the last release and believe that it's in > > >>>>>>>>>> descent > > >>>>>>>>>> shape. > > >>>>>>>>>> I want to test it on fusion, though, just to make sure > > >>>>>>>>>> I'm > > >>>>>>>>>> handling > > >>>>>>>>>> the env variables correctly. I'm running into an error > > >>>>>>>>>> when > > >>>>>>>>>> I > > >>>>>>>>>> run. > > >>>>>>>>>> I'm getting "Failed to transfer wrapper log for job > > >>>>>>>>>> for > > >>>>>>>>>> all > > >>>>>>>>>> of > > >>>>>>>>>> the app calls. What usually causes this? I'm stuck on > > >>>>>>>>>> where > > >>>>>>>>>> to > > >>>>>>>>>> look. > > >>>>>>>>>> > > >>>>>>>>>> Thanks, Sheri > > >>>>>>>>> > > >>>>>>>>> -- > > >>>>>>>>> Michael Wilde > > >>>>>>>>> Computation Institute, University of Chicago > > >>>>>>>>> Mathematics and Computer Science Division > > >>>>>>>>> Argonne National Laboratory > > >>>>>>>>> > > >>>>>>> > > >>>>>>> -- > > >>>>>>> Michael Wilde > > >>>>>>> Computation Institute, University of Chicago > > >>>>>>> Mathematics and Computer Science Division > > >>>>>>> Argonne National Laboratory > > >> > > > > > > > > -- > Michael Wilde > Computation Institute, University of Chicago > Mathematics and Computer Science Division > Argonne National Laboratory From wilde at mcs.anl.gov Fri Oct 28 09:14:35 2011 From: wilde at mcs.anl.gov (Michael Wilde) Date: Fri, 28 Oct 2011 09:14:35 -0500 (CDT) Subject: [Swift-devel] @java() seems to be broken Message-ID: <1481610736.143105.1319811275173.JavaMail.root@zimbra.anl.gov> I mentioned this to Jon last night, Justin, and he was going to look at it, but maybe you can as well: @java() seems to be broken. I'd like to use it for a user's app. I get the following from 0.93: sandbox$ cat sin.swift (float result) sin(float x) { result = @java("java.lang.Math", "sin", x); } float x = 0.5; float y = sin(x); tracef("sin(%p): %p", x, y); sandbox$ which swift /home/wilde/swift/src/0.93/cog/modules/swift/dist/swift-svn/bin/swift sandbox$ swift sin.swift no sites file specified, setting to default: /home/wilde/swift/src/0.93/cog/modules/swift/dist/swift-svn/etc/sites.xml Execution failed: Karajan exception: kernel:variable @ java.xml, line: 4: Unsupported argument: type. Valid arguments are: [name] sandbox$ Is a quick fix available? Thanks, - Mike -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From wilde at mcs.anl.gov Fri Oct 28 09:16:48 2011 From: wilde at mcs.anl.gov (Michael Wilde) Date: Fri, 28 Oct 2011 09:16:48 -0500 (CDT) Subject: [Swift-devel] @java() seems to be broken In-Reply-To: <1481610736.143105.1319811275173.JavaMail.root@zimbra.anl.gov> Message-ID: <988696085.143122.1319811408071.JavaMail.root@zimbra.anl.gov> Arg, sorry - I just saw your email Jon (missed it on first scan) - will test your fix now. Thanks!!! - Mike ----- Original Message ----- > From: "Michael Wilde" > To: "Swift Devel" > Sent: Friday, October 28, 2011 9:14:35 AM > Subject: [Swift-devel] @java() seems to be broken > I mentioned this to Jon last night, Justin, and he was going to look > at it, but maybe you can as well: @java() seems to be broken. I'd like > to use it for a user's app. > > I get the following from 0.93: > > sandbox$ cat sin.swift > (float result) sin(float x) { > result = @java("java.lang.Math", "sin", x); > } > > float x = 0.5; > float y = sin(x); > > tracef("sin(%p): %p", x, y); > sandbox$ which swift > /home/wilde/swift/src/0.93/cog/modules/swift/dist/swift-svn/bin/swift > sandbox$ swift sin.swift > no sites file specified, setting to default: > /home/wilde/swift/src/0.93/cog/modules/swift/dist/swift-svn/etc/sites.xml > Execution failed: > Karajan exception: kernel:variable @ java.xml, line: 4: Unsupported > argument: type. Valid arguments are: [name] > sandbox$ > > Is a quick fix available? > > Thanks, > > - Mike > > > -- > Michael Wilde > Computation Institute, University of Chicago > Mathematics and Computer Science Division > Argonne National Laboratory > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From wozniak at mcs.anl.gov Fri Oct 28 11:29:16 2011 From: wozniak at mcs.anl.gov (Justin M Wozniak) Date: Fri, 28 Oct 2011 11:29:16 -0500 (Central Daylight Time) Subject: [Swift-devel] @java() seems to be broken In-Reply-To: <988696085.143122.1319811408071.JavaMail.root@zimbra.anl.gov> References: <988696085.143122.1319811408071.JavaMail.root@zimbra.anl.gov> Message-ID: Jon's fix looks good. On Fri, 28 Oct 2011, Michael Wilde wrote: > Arg, sorry - I just saw your email Jon (missed it on first scan) - will > test your fix now. Thanks!!! > > - Mike > > ----- Original Message ----- >> From: "Michael Wilde" >> To: "Swift Devel" >> Sent: Friday, October 28, 2011 9:14:35 AM >> Subject: [Swift-devel] @java() seems to be broken >> I mentioned this to Jon last night, Justin, and he was going to look >> at it, but maybe you can as well: @java() seems to be broken. I'd like >> to use it for a user's app. >> >> I get the following from 0.93: >> >> sandbox$ cat sin.swift >> (float result) sin(float x) { >> result = @java("java.lang.Math", "sin", x); >> } >> >> float x = 0.5; >> float y = sin(x); >> >> tracef("sin(%p): %p", x, y); >> sandbox$ which swift >> /home/wilde/swift/src/0.93/cog/modules/swift/dist/swift-svn/bin/swift >> sandbox$ swift sin.swift >> no sites file specified, setting to default: >> /home/wilde/swift/src/0.93/cog/modules/swift/dist/swift-svn/etc/sites.xml >> Execution failed: >> Karajan exception: kernel:variable @ java.xml, line: 4: Unsupported >> argument: type. Valid arguments are: [name] >> sandbox$ >> >> Is a quick fix available? >> >> Thanks, >> >> - Mike >> >> >> -- >> Michael Wilde >> Computation Institute, University of Chicago >> Mathematics and Computer Science Division >> Argonne National Laboratory >> >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > -- Justin M Wozniak From davidk at ci.uchicago.edu Fri Oct 28 12:41:02 2011 From: davidk at ci.uchicago.edu (David Kelly) Date: Fri, 28 Oct 2011 12:41:02 -0500 (CDT) Subject: [Swift-devel] [Swift-user] gram on ranger In-Reply-To: <1319162926.21652.2.camel@blabla> Message-ID: <501704482.165901.1319823662516.JavaMail.root@zimbra-mb2.anl.gov> Just to clarify - when coasters is being used, count represents the number of coaster blocks? Then to get the number of cores to request, I should use count*workersPerNode? What about in the case where coasters is not used? ----- Original Message ----- > From: "Mihael Hategan" > To: "David Kelly" > Cc: "Anjali Raja" , "Swift Devel" , "Swift User" > , "Ketan Maheshwari" > Sent: Thursday, October 20, 2011 9:08:46 PM > Subject: Re: [Swift-devel] [Swift-user] gram on ranger > On Thu, 2011-10-20 at 21:03 -0500, David Kelly wrote: > > Yep, this is using coasters > > > > Then no. Count is whatever the block allocation algorithm decides it > should be. > > > > > > > > > Should count=32 in the second case? Am I misunderstanding what > > > > 'count' is? Is there any way to get the exact number of > > > > applications? > > > > > > Coasters? From wozniak at mcs.anl.gov Fri Oct 28 13:02:35 2011 From: wozniak at mcs.anl.gov (Justin M Wozniak) Date: Fri, 28 Oct 2011 13:02:35 -0500 (Central Daylight Time) Subject: [Swift-devel] [Swift-user] gram on ranger In-Reply-To: <501704482.165901.1319823662516.JavaMail.root@zimbra-mb2.anl.gov> References: <501704482.165901.1319823662516.JavaMail.root@zimbra-mb2.anl.gov> Message-ID: I think count is the number of processes. PBSExecutor uses it, that may be a good place to look. In the Coasters context, I think it is the number of invocations of worker.pl . On Fri, 28 Oct 2011, David Kelly wrote: > Just to clarify - when coasters is being used, count represents the > number of coaster blocks? Then to get the number of cores to request, I > should use count*workersPerNode? > > What about in the case where coasters is not used? > > ----- Original Message ----- >> From: "Mihael Hategan" >> To: "David Kelly" >> Cc: "Anjali Raja" , "Swift Devel" , "Swift User" >> , "Ketan Maheshwari" >> Sent: Thursday, October 20, 2011 9:08:46 PM >> Subject: Re: [Swift-devel] [Swift-user] gram on ranger >> On Thu, 2011-10-20 at 21:03 -0500, David Kelly wrote: >>> Yep, this is using coasters >>> >> >> Then no. Count is whatever the block allocation algorithm decides it >> should be. >> >>>>> >>>>> Should count=32 in the second case? Am I misunderstanding what >>>>> 'count' is? Is there any way to get the exact number of >>>>> applications? >>>> >>>> Coasters? > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > -- Justin M Wozniak From hategan at mcs.anl.gov Sat Oct 29 19:58:30 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sat, 29 Oct 2011 17:58:30 -0700 Subject: [Swift-devel] Swift 0.93 RC3 hangs after all jobs seem to be complete In-Reply-To: <1456767810.162441.1319645666935.JavaMail.root@zimbra-mb2.anl.gov> References: <1456767810.162441.1319645666935.JavaMail.root@zimbra-mb2.anl.gov> Message-ID: <1319936310.2688.0.camel@blabla> This deadlock is now fixed (swift r5262). On Wed, 2011-10-26 at 11:14 -0500, David Kelly wrote: > I think I've found a way to reproduce this. From the test suite, if you run language-behaviour/mappers/075-array-mapper.swift a few times, you'll run into a deadlock which looks very similar to the one Sheri is seeing. Here is the jstack: > > http://www.ci.uchicago.edu/~davidk/logs/jstack20111025110620.log > > David > > ----- Original Message ----- > > From: "Michael Wilde" > > To: "Mihael Hategan" , "David Kelly" > > Cc: "Swift Devel" , "Sheri Mickelson" > > Sent: Tuesday, October 25, 2011 2:10:04 PM > > Subject: Re: Swift 0.93 RC3 hangs after all jobs seem to be complete > > Mihael, David, > > > > Can you both report on what you believe the status of this bug is? > > > > I think the subject line here is a bot misleading, in that it seems > > that a similar thing - ie the workflow deadlocks - was happening both > > at the start and at the end of various scripts, and possibly at > > intermediate points. > > > > I *think* that Sheri was seeing hangs at the start and in the middle; > > David was seeing hangs at the end. > > > > Talking to David just now he reported diagnosing his hang case down to > > a situation where the coaster scheduler emits a "null" (ill-formed) > > job to PBS at the tail end of a workflow. He inserted a workaround to > > ignore (not submit) such "null" jobs. Im not sure of that was > > committed, or just tested. David, can you post the details? > > > > Mihael, did you look at the jstack that Sheri attached to the posting > > below? > > > > Do you have any theories or fixes for this issue or issues? Unless we > > believe its resolved, David, please file in bugzilla and attach > > relevant postings from SHeri, David, and others on this bug. > > > > Thanks, > > > > - Mike > > > > > > ----- Original Message ----- > > > From: "Sheri Mickelson" > > > To: "Mihael Hategan" > > > Cc: "Michael Wilde" , "David Kelly" > > > > > > Sent: Wednesday, October 12, 2011 10:34:43 AM > > > Subject: Re: Swift 0.93 RC3 hangs after all jobs seem to be complete > > > I just tried running again on fusion with 0.93RC3 and it hung right > > > away. > > > It started with "No events in 10s." and then it looks like it hung. > > > This was ran using coasters and I manually killed it after about 5 > > > minutes. > > > I attached both the log file and the jstack info. > > > > > > Thanks, Sheri > > > > > > > > > > > > > > > > > > On Oct 7, 2011, at 2:47 PM, Mihael Hategan wrote: > > > > > > > Yeah, so the hang checker doesn't show anything. Which means it's > > > > not a > > > > swift flow issue. > > > > > > > > I would do what Mike says with jstack as soon after the hang > > > > checker > > > > kicks in as possible. > > > > > > > > Mihael > > > > > > > > On Fri, 2011-10-07 at 12:12 -0500, Michael Wilde wrote: > > > >> Was: Re: Swift 0.93RC2 is bad - Re: Help on fusion > > > >> Changed subject so you can see what this is regarding, Mihael. > > > >> > > > >> --- > > > >> > > > >> Sheri, could you run this again? (Or have you already, and if so, > > > >> did it run to completion?) > > > >> > > > >> What I saw in the log yesterday was that all jobs that were > > > >> submitted to coasters ran successfully, including all of their > > > >> data > > > >> transfers. > > > >> > > > >> But I also see that the Swift "hang checker" went off, which > > > >> indicates that some Java activity was indeed hung. > > > >> > > > >> When this happens again, can you run the command "jstack -l PID" > > > >> where PID is the process of the Swift Java command (which you can > > > >> best locate by using "ps -u $USER -H" and locate the java process > > > >> below the swift command). Then send us the jstack output in > > > >> addition to the associated Swift log. > > > >> > > > >> Mihael, in the meantime, can you take a look at the log to see if > > > >> you can spot any incomplete Swift activities that may be hanging > > > >> the run? > > > >> > > > >> Thanks, > > > >> > > > >> - Mike > > > >> > > > >> > > > >> ----- Original Message ----- > > > >>> From: "Sheri Mickelson" > > > >>> To: "David Kelly" > > > >>> Cc: "Michael Wilde" > > > >>> Sent: Thursday, October 6, 2011 3:23:57 PM > > > >>> Subject: Re: Swift 0.93RC2 is bad - Re: Help on fusion > > > >>> Here's the log file. > > > >>> > > > >>> > > > >>> > > > >>> On Oct 6, 2011, at 3:19 PM, David Kelly wrote: > > > >>> > > > >>>> Hi Sheri, > > > >>>> > > > >>>> Could you please send the log file so we can take a closer look > > > >>>> and > > > >>>> see what's going on there? > > > >>>> > > > >>>> Thanks, > > > >>>> David > > > >>>> > > > >>>> ----- Original Message ----- > > > >>>>> From: "Sheri Mickelson" > > > >>>>> To: "David Kelly" > > > >>>>> Cc: "Michael Wilde" > > > >>>>> Sent: Thursday, October 6, 2011 3:07:44 PM > > > >>>>> Subject: Re: Swift 0.93RC2 is bad - Re: Help on fusion > > > >>>>> I just tried this version and had a little bit more luck. It > > > >>>>> looked > > > >>>>> like everything was running fine, but now it looks like it's > > > >>>>> hung > > > >>>>> near > > > >>>>> the end. I keep getting the message "Finished > > > >>>>> successfully:66". > > > >>>>> The > > > >>>>> message before that was "Checking status:1 Finished > > > >>>>> successfully:65". > > > >>>>> > > > >>>>> Thanks, Sheri > > > >>>>> > > > >>>>> On Oct 6, 2011, at 2:14 PM, David Kelly wrote: > > > >>>>> > > > >>>>>> > > > >>>>>> It's been a while since RC2 was created. There have been > > > >>>>>> quite > > > >>>>>> a > > > >>>>>> lot > > > >>>>>> of fixes since then, so I just created a new 0.93 RC3. The > > > >>>>>> direct > > > >>>>>> download can be found at: > > > >>>>>> > > > >>>>>> http://www.ci.uchicago.edu/swift/packages/swift-0.93RC3.tar.gz > > > >>>>>> > > > >>>>>> Hope this helps. > > > >>>>>> > > > >>>>>> Thanks, > > > >>>>>> David > > > >>>>>> > > > >>>>>> ----- Original Message ----- > > > >>>>>>> From: "Michael Wilde" > > > >>>>>>> To: "Sheri Mickelson" > > > >>>>>>> Cc: "David Kelly" > > > >>>>>>> Sent: Thursday, October 6, 2011 12:17:56 PM > > > >>>>>>> Subject: Swift 0.93RC2 is bad - Re: Help on fusion > > > >>>>>>> Sheri, > > > >>>>>>> > > > >>>>>>> Your AMWG script is failing because the swift-0.93RC2 > > > >>>>>>> release > > > >>>>>>> is > > > >>>>>>> bad. > > > >>>>>>> > > > >>>>>>> The error its showing in the log is this: "2011-10-06 > > > >>>>>>> 11:46:24,635-0500 DEBUG vdl:execute2 APPLICATION_EXCEPTION > > > >>>>>>> jobid=ncatted-se54rxgk - Application exception: null > > > >>>>>>> Caused by: > > > >>>>>>> org > > > >>>>>>> .globus > > > >>>>>>> .cog.abstraction.impl.common.task.TaskSubmissionException: > > > >>>>>>> lowOverallocation must be < 1.0 (currently 100.0)" > > > >>>>>>> > > > >>>>>>> ...which was fixed in SVN for 0.93. > > > >>>>>>> > > > >>>>>>> Did you load this from a tarball or from SVN? > > > >>>>>>> > > > >>>>>>> David, do we have a more recent 0.93 release candidate? > > > >>>>>>> > > > >>>>>>> If not, then can you build an 0.93 from SVN? If not, we can > > > >>>>>>> do > > > >>>>>>> that > > > >>>>>>> for you. I'll start a build in the meantime just in case. > > > >>>>>>> > > > >>>>>>> Sorry about this error, Sheri. > > > >>>>>>> > > > >>>>>>> - Mike > > > >>>>>>> > > > >>>>>>> > > > >>>>>>> > > > >>>>>>> ----- Original Message ----- > > > >>>>>>>> From: "Sheri Mickelson" > > > >>>>>>>> To: "Michael Wilde" > > > >>>>>>>> Sent: Thursday, October 6, 2011 11:52:58 AM > > > >>>>>>>> Subject: Re: Help on fusion > > > >>>>>>>> I have everything in > > > >>>>>>>> /fusion/gpfs/home/mickelso/amwg-swift/svnRepo/swift > > > >>>>>>>> > > > >>>>>>>> I believe the pathnames are correct. > > > >>>>>>>> > > > >>>>>>>> I have not tried running on localhost. > > > >>>>>>>> > > > >>>>>>>> I'm using swift version swift-0.93RC2. > > > >>>>>>>> > > > >>>>>>>> I'm not at Argonne today, but will be in tomorrow. > > > >>>>>>>> > > > >>>>>>>> -Sheri > > > >>>>>>>> > > > >>>>>>>> On Oct 6, 2011, at 11:39 AM, Michael Wilde wrote: > > > >>>>>>>> > > > >>>>>>>>> Hi Sheri, > > > >>>>>>>>> > > > >>>>>>>>> can you point me to the log, run directory, and work dir > > > >>>>>>>>> of > > > >>>>>>>>> this > > > >>>>>>>>> run? > > > >>>>>>>>> > > > >>>>>>>>> I trhink we'll need to look into to the log, and the .d > > > >>>>>>>>> directories, > > > >>>>>>>>> and possibly the work dir to locate the stdout of the > > > >>>>>>>>> failing > > > >>>>>>>>> apps. > > > >>>>>>>>> > > > >>>>>>>>> - are the pathnames correct? > > > >>>>>>>>> > > > >>>>>>>>> - does the run work on localhost? (ie, are the PBS jobs > > > >>>>>>>>> running > > > >>>>>>>>> or > > > >>>>>>>>> failing)? > > > >>>>>>>>> > > > >>>>>>>>> - which Swift rev are you using? > > > >>>>>>>>> > > > >>>>>>>>> Are you at Argonne? I can stop by and we can debug. > > > >>>>>>>>> > > > >>>>>>>>> - Mike > > > >>>>>>>>> > > > >>>>>>>>> > > > >>>>>>>>> ----- Original Message ----- > > > >>>>>>>>>> From: "Sheri Mickelson" > > > >>>>>>>>>> To: "Michael Wilde" > > > >>>>>>>>>> Sent: Thursday, October 6, 2011 10:32:38 AM > > > >>>>>>>>>> Subject: Help on fusion > > > >>>>>>>>>> Hi Mike, > > > >>>>>>>>>> > > > >>>>>>>>>> The AMWG people at NCAR want to incorporate the swift > > > >>>>>>>>>> version > > > >>>>>>>>>> to > > > >>>>>>>>>> their > > > >>>>>>>>>> main branch. Rob's at NCAR right now and wants to have > > > >>>>>>>>>> this > > > >>>>>>>>>> done > > > >>>>>>>>>> as > > > >>>>>>>>>> soon as possible. I've been working on incorporating the > > > >>>>>>>>>> changes > > > >>>>>>>>>> that > > > >>>>>>>>>> were made in the last release and believe that it's in > > > >>>>>>>>>> descent > > > >>>>>>>>>> shape. > > > >>>>>>>>>> I want to test it on fusion, though, just to make sure > > > >>>>>>>>>> I'm > > > >>>>>>>>>> handling > > > >>>>>>>>>> the env variables correctly. I'm running into an error > > > >>>>>>>>>> when > > > >>>>>>>>>> I > > > >>>>>>>>>> run. > > > >>>>>>>>>> I'm getting "Failed to transfer wrapper log for job > > > >>>>>>>>>> for > > > >>>>>>>>>> all > > > >>>>>>>>>> of > > > >>>>>>>>>> the app calls. What usually causes this? I'm stuck on > > > >>>>>>>>>> where > > > >>>>>>>>>> to > > > >>>>>>>>>> look. > > > >>>>>>>>>> > > > >>>>>>>>>> Thanks, Sheri > > > >>>>>>>>> > > > >>>>>>>>> -- > > > >>>>>>>>> Michael Wilde > > > >>>>>>>>> Computation Institute, University of Chicago > > > >>>>>>>>> Mathematics and Computer Science Division > > > >>>>>>>>> Argonne National Laboratory > > > >>>>>>>>> > > > >>>>>>> > > > >>>>>>> -- > > > >>>>>>> Michael Wilde > > > >>>>>>> Computation Institute, University of Chicago > > > >>>>>>> Mathematics and Computer Science Division > > > >>>>>>> Argonne National Laboratory > > > >> > > > > > > > > > > > > -- > > Michael Wilde > > Computation Institute, University of Chicago > > Mathematics and Computer Science Division > > Argonne National Laboratory