From wilde at mcs.anl.gov Thu Jul 1 08:23:37 2010 From: wilde at mcs.anl.gov (wilde at mcs.anl.gov) Date: Thu, 1 Jul 2010 08:23:37 -0500 (CDT) Subject: [Swift-devel] CASP jobs hang - seems to be in coaster scheduling In-Reply-To: <26625693.1289731277990343214.JavaMail.root@zimbra> Message-ID: <5935652.1289891277990617500.JavaMail.root@zimbra> [Mihael: help urgently needed on this if possible] Aashish, I see the runs you submitted around 3-4AM this morning in /home/aashish/CASP/{T0608,T0610,T0611} Each of them show a similar problem to what we saw earlier last night with T0608: the script submits 300 jobs to the pads coaster pool, and none of them run. In some of these scripts, the first round of 300 (boostThreader) work fine, but the later round of 300 loops jobs get "stuck". Mihael, can you set aside some time as soon as possible this morning to look at these? These need to be submitted to CASP by 2PM CDT today, so attention to the problem is rather urgent. The scripts are all coming from /home/aashish/RapLoops The swift release is from /home/wilde/swift/src/stable/... In the above directories, you will find all source for scripts, mappers, tc, and sites, as well as all logs. In some of the Tnnnn directories (each one is a protein target for the CASP competition) you will see multiple runs, each with an outN file log of stdout/err and then a run directory for that run with all relevant files. This *looks* like the familiar problem of trying to run an app whose maxwalltime wont fit into any available coaster slot, but the times in tc and sites.xml dont seem to explain that behavior. This script has been running well since May; "slight" changes were made to work around the unavailability of GPFS on PADS this week, but we still cant figure out why these scripts are hanging in this manner. - Mike From wilde at mcs.anl.gov Thu Jul 1 10:23:40 2010 From: wilde at mcs.anl.gov (Michael Wilde) Date: Thu, 1 Jul 2010 10:23:40 -0500 (CDT) Subject: [Swift-devel] CASP jobs hang - seems to be in coaster scheduling In-Reply-To: <5935652.1289891277990617500.JavaMail.root@zimbra> Message-ID: <10232493.1304301277997820587.JavaMail.root@zimbra> Sorry, false alarm - please ignore the request below. The problem was indeed simply requesting a larger maxwalltime than any available coaster maxtime slot. Can this be detected and a clear error message issued, as well as ending the run? - Mike ----- wilde at mcs.anl.gov wrote: > [Mihael: help urgently needed on this if possible] > > Aashish, I see the runs you submitted around 3-4AM this morning in > /home/aashish/CASP/{T0608,T0610,T0611} > > Each of them show a similar problem to what we saw earlier last night > with T0608: the script submits 300 jobs to the pads coaster pool, and > none of them run. > > In some of these scripts, the first round of 300 (boostThreader) work > fine, but the later round of 300 loops jobs get "stuck". > > Mihael, can you set aside some time as soon as possible this morning > to look at these? These need to be submitted to CASP by 2PM CDT today, > so attention to the problem is rather urgent. > > The scripts are all coming from /home/aashish/RapLoops > The swift release is from /home/wilde/swift/src/stable/... > > In the above directories, you will find all source for scripts, > mappers, tc, and sites, as well as all logs. In some of the Tnnnn > directories (each one is a protein target for the CASP competition) > you will see multiple runs, each with an outN file log of stdout/err > and then a run directory for that run with all relevant files. > > This *looks* like the familiar problem of trying to run an app whose > maxwalltime wont fit into any available coaster slot, but the times in > tc and sites.xml dont seem to explain that behavior. > > This script has been running well since May; "slight" changes were > made to work around the unavailability of GPFS on PADS this week, but > we still cant figure out why these scripts are hanging in this > manner. > > - Mike > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From hategan at mcs.anl.gov Thu Jul 1 10:49:15 2010 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 01 Jul 2010 10:49:15 -0500 Subject: [Swift-devel] CASP jobs hang - seems to be in coaster scheduling In-Reply-To: <10232493.1304301277997820587.JavaMail.root@zimbra> References: <10232493.1304301277997820587.JavaMail.root@zimbra> Message-ID: <1277999355.16558.0.camel@blabla2.none> On Thu, 2010-07-01 at 10:23 -0500, Michael Wilde wrote: > Sorry, false alarm - please ignore the request below. > > The problem was indeed simply requesting a larger maxwalltime than any available coaster maxtime slot. > > Can this be detected and a clear error message issued, as well as ending the run? I thought it was. I can double-check. > > - Mike > > ----- wilde at mcs.anl.gov wrote: > > > [Mihael: help urgently needed on this if possible] > > > > Aashish, I see the runs you submitted around 3-4AM this morning in > > /home/aashish/CASP/{T0608,T0610,T0611} > > > > Each of them show a similar problem to what we saw earlier last night > > with T0608: the script submits 300 jobs to the pads coaster pool, and > > none of them run. > > > > In some of these scripts, the first round of 300 (boostThreader) work > > fine, but the later round of 300 loops jobs get "stuck". > > > > Mihael, can you set aside some time as soon as possible this morning > > to look at these? These need to be submitted to CASP by 2PM CDT today, > > so attention to the problem is rather urgent. > > > > The scripts are all coming from /home/aashish/RapLoops > > The swift release is from /home/wilde/swift/src/stable/... > > > > In the above directories, you will find all source for scripts, > > mappers, tc, and sites, as well as all logs. In some of the Tnnnn > > directories (each one is a protein target for the CASP competition) > > you will see multiple runs, each with an outN file log of stdout/err > > and then a run directory for that run with all relevant files. > > > > This *looks* like the familiar problem of trying to run an app whose > > maxwalltime wont fit into any available coaster slot, but the times in > > tc and sites.xml dont seem to explain that behavior. > > > > This script has been running well since May; "slight" changes were > > made to work around the unavailability of GPFS on PADS this week, but > > we still cant figure out why these scripts are hanging in this > > manner. > > > > - Mike > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > From wilde at mcs.anl.gov Thu Jul 1 11:29:14 2010 From: wilde at mcs.anl.gov (wilde at mcs.anl.gov) Date: Thu, 1 Jul 2010 11:29:14 -0500 (CDT) Subject: [Swift-devel] manual coasters In-Reply-To: <21262109.1309551278001249873.JavaMail.root@zimbra> Message-ID: <18576240.1310251278001754685.JavaMail.root@zimbra> Very cool - thanks, Mihael! For the sites entry, do we still use the current format to indicate where the server should start? Eg: key="workerManager">passive fast 10000 .07 /home/wilde/swiftwork Is the full range of provider options available to start the server in passive mode? Will throttling settings be honored? Can we start multiple coaster servers in different places? - Mike ----- "Mihael Hategan" wrote: > Manual coasters are in trunk. I did some limited testing on > localhost. > > The basic idea is that you say key="workerManager">passive in sites.xml. Other than that > you > may want to set workersPerNode, but the other options are useless. > > Then, when swift starts the coaster service, it will print the URL of > that on stderr. > > You carefully dig for worker.pl and then launch it in whatever way > you > like: > > worker.pl > > The blockid can be whatever you want, but it can be used to group > workers in the traditional blocks. The logdir is where you want the > worker logs to go. They are all mandatory. > > When workers connect to the service, the service should start > shipping > jobs to them. When the service is shut down, it will also try to shut > down the workers (they are useless anyway at that point), but it > cannot > control the LRM jobs, so it may fail to do so (or rather said, it is > more likely to fail to do so). > > Mihael > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From wilde at mcs.anl.gov Thu Jul 1 11:34:20 2010 From: wilde at mcs.anl.gov (Michael Wilde) Date: Thu, 1 Jul 2010 11:34:20 -0500 (CDT) Subject: [Swift-devel] Coaster problem on BG/P - worker processes dying Message-ID: <30532374.1310611278002060506.JavaMail.root@zimbra> Justin, can you send a brief update to the list on the coaster problem (workers exiting after a few jobs) that is blocking you on the BG/P, and how you are re-working worker logging to debug it? Lets use this thread to discuss and resolve the problem. - Mike From hategan at mcs.anl.gov Thu Jul 1 11:39:59 2010 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 01 Jul 2010 11:39:59 -0500 Subject: [Swift-devel] manual coasters In-Reply-To: <18576240.1310251278001754685.JavaMail.root@zimbra> References: <18576240.1310251278001754685.JavaMail.root@zimbra> Message-ID: <1278002399.17123.0.camel@blabla2.none> On Thu, 2010-07-01 at 11:29 -0500, wilde at mcs.anl.gov wrote: > Very cool - thanks, Mihael! > > For the sites entry, do we still use the current format to indicate where the server should start? Yes. That should work. But "queue=fast" there doesn't do anything. > Eg: > > > > key="workerManager">passive > fast > 10000 > .07 > > /home/wilde/swiftwork > > > Is the full range of provider options available to start the server in passive mode? > > Will throttling settings be honored? > > Can we start multiple coaster servers in different places? > > > - Mike > > > ----- "Mihael Hategan" wrote: > > > Manual coasters are in trunk. I did some limited testing on > > localhost. > > > > The basic idea is that you say > key="workerManager">passive in sites.xml. Other than that > > you > > may want to set workersPerNode, but the other options are useless. > > > > Then, when swift starts the coaster service, it will print the URL of > > that on stderr. > > > > You carefully dig for worker.pl and then launch it in whatever way > > you > > like: > > > > worker.pl > > > > The blockid can be whatever you want, but it can be used to group > > workers in the traditional blocks. The logdir is where you want the > > worker logs to go. They are all mandatory. > > > > When workers connect to the service, the service should start > > shipping > > jobs to them. When the service is shut down, it will also try to shut > > down the workers (they are useless anyway at that point), but it > > cannot > > control the LRM jobs, so it may fail to do so (or rather said, it is > > more likely to fail to do so). > > > > Mihael > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > From wozniak at mcs.anl.gov Thu Jul 1 11:56:13 2010 From: wozniak at mcs.anl.gov (Justin M Wozniak) Date: Thu, 1 Jul 2010 11:56:13 -0500 (Central Daylight Time) Subject: [Swift-devel] Re: Coaster problem on BG/P - worker processes dying In-Reply-To: <30532374.1310611278002060506.JavaMail.root@zimbra> References: <30532374.1310611278002060506.JavaMail.root@zimbra> Message-ID: On Thu, 1 Jul 2010, Michael Wilde wrote: > Justin, can you send a brief update to the list on the coaster problem > (workers exiting after a few jobs) that is blocking you on the BG/P, and > how you are re-working worker logging to debug it? A paste from a previous email is below (both BG/P systems are down due to cooling issues today). So far, the issue only appears after several thousand jobs run on at least 512 nodes. I'm pretty close to generating the logging I need to track this down. I have broken down the worker logs into one log per worker script... Paste: Running on the Intrepid compute nodes. In the last few runs I've only seen it in the 512 node case (I think this worked at least once), not 256 nodes, but that could be just because this is rare. 2010-06-18 16:06:29,117-0500 WARN Command Command(2, SUBMITJOB): handling reply timeout; sendReqTime=100618-160429.10 8, sendTime=100618-160429.108, now=100618-160629.117 2010-06-18 16:06:29,117-0500 INFO Command Command(2, SUBMITJOB): re-sending 2010-06-18 16:06:29,117-0500 WARN Command Command(2, SUBMITJOB)fault was: Reply timeout org.globus.cog.karajan.workflow.service.ReplyTimeoutException at org.globus.cog.karajan.workflow.service.commands.Command.handleReplyTimeout(Command.java:280) at org.globus.cog.karajan.workflow.service.commands.Command$Timeout.run(Command.java:285) at java.util.TimerThread.mainLoop(Timer.java:537) at java.util.TimerThread.run(Timer.java:487) 2010-06-18 16:06:29,118-0500 INFO Command Sending Command(2, SUBMITJOB) on MetaChannel: 855782146 -> SC-0618-370320-0 00000-001756 2010-06-18 16:06:29,119-0500 INFO AbstractStreamKarajanChannel Channel IOException java.net.SocketException: Broken pipe at java.net.SocketOutputStream.socketWrite0(Native Method) at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:105) at java.net.SocketOutputStream.write(SocketOutputStream.java:137) at org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel$Sender.send(AbstractStrea mKar ajanChannel.java:292) at org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel$Sender.run(AbstractStream Kara janChannel.java:244) -- Justin M Wozniak From hategan at mcs.anl.gov Thu Jul 1 12:08:08 2010 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 01 Jul 2010 12:08:08 -0500 Subject: [Swift-devel] Re: Coaster problem on BG/P - worker processes dying In-Reply-To: References: <30532374.1310611278002060506.JavaMail.root@zimbra> Message-ID: <1278004088.17643.1.camel@blabla2.none> That typically is an indication that something went wrong with the worker or the worker connection. It's also possible that the message queues are loaded enough to not be able to process everything in time. The coaster logs have some logging info that displays that information. On Thu, 2010-07-01 at 11:56 -0500, Justin M Wozniak wrote: > On Thu, 1 Jul 2010, Michael Wilde wrote: > > > Justin, can you send a brief update to the list on the coaster problem > > (workers exiting after a few jobs) that is blocking you on the BG/P, and > > how you are re-working worker logging to debug it? > > A paste from a previous email is below (both BG/P systems are down due to > cooling issues today). > > So far, the issue only appears after several thousand jobs run on at least > 512 nodes. > > I'm pretty close to generating the logging I need to track this down. I > have broken down the worker logs into one log per worker script... > > Paste: > > Running on the Intrepid compute nodes. In the last few runs I've only > seen it in the 512 node case (I think this worked at least once), not 256 > nodes, but that could be just because this is rare. > > 2010-06-18 16:06:29,117-0500 WARN Command Command(2, SUBMITJOB): handling > reply timeout; > sendReqTime=100618-160429.10 > 8, sendTime=100618-160429.108, now=100618-160629.117 > 2010-06-18 16:06:29,117-0500 INFO Command Command(2, SUBMITJOB): > re-sending > 2010-06-18 16:06:29,117-0500 WARN Command Command(2, SUBMITJOB)fault was: > Reply timeout > org.globus.cog.karajan.workflow.service.ReplyTimeoutException > at > org.globus.cog.karajan.workflow.service.commands.Command.handleReplyTimeout(Command.java:280) > at > org.globus.cog.karajan.workflow.service.commands.Command$Timeout.run(Command.java:285) > at java.util.TimerThread.mainLoop(Timer.java:537) > at java.util.TimerThread.run(Timer.java:487) > 2010-06-18 16:06:29,118-0500 INFO Command Sending Command(2, SUBMITJOB) > on MetaChannel: 855782146 -> > SC-0618-370320-0 > 00000-001756 > 2010-06-18 16:06:29,119-0500 INFO AbstractStreamKarajanChannel Channel > IOException > java.net.SocketException: Broken pipe > at java.net.SocketOutputStream.socketWrite0(Native Method) > at > java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:105) > at java.net.SocketOutputStream.write(SocketOutputStream.java:137) > at > org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel$Sender.send(AbstractStrea > mKar > ajanChannel.java:292) > at > org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel$Sender.run(AbstractStream > Kara > janChannel.java:244) > > From zhaozhang at uchicago.edu Thu Jul 1 12:13:02 2010 From: zhaozhang at uchicago.edu (Zhao Zhang) Date: Thu, 01 Jul 2010 12:13:02 -0500 Subject: [Swift-devel] Re: Coaster problem on BG/P - worker processes dying In-Reply-To: References: <30532374.1310611278002060506.JavaMail.root@zimbra> Message-ID: <4C2CCC9E.3090904@uchicago.edu> Hi, Justin Is there any chance that each worker is writing log files to GPFS or writing to RAM, then copying to GPFS? Even in the latter case, we used dd instead of cp on zeptoos, cuz with dd we could set the block size while cp is using a lined buffer to dump data to GPFS, which is quite slow. Another suspect would be that we are overwhelming the IO nodes. As far as I remember, coaster is running as a service on each compute node with a TCP connection to the Login Node. The communication between Login Node and CN node is handled by a IP forwarding component in zeptoos. In the tests I did before, the Falkon service is not stable with 1024 nodes connecting to the service each with a TCP connection.Can we login the IO nodes while we see those errors? Anyway, I can't tell anything exactly right now. best zhao Justin M Wozniak wrote: > On Thu, 1 Jul 2010, Michael Wilde wrote: > >> Justin, can you send a brief update to the list on the coaster >> problem (workers exiting after a few jobs) that is blocking you on >> the BG/P, and how you are re-working worker logging to debug it? > > A paste from a previous email is below (both BG/P systems are down due > to cooling issues today). > > So far, the issue only appears after several thousand jobs run on at > least 512 nodes. > > I'm pretty close to generating the logging I need to track this down. > I have broken down the worker logs into one log per worker script... > > Paste: > > Running on the Intrepid compute nodes. In the last few runs I've only > seen it in the 512 node case (I think this worked at least once), not > 256 nodes, but that could be just because this is rare. > > 2010-06-18 16:06:29,117-0500 WARN Command Command(2, SUBMITJOB): > handling reply timeout; > sendReqTime=100618-160429.10 > 8, sendTime=100618-160429.108, now=100618-160629.117 > 2010-06-18 16:06:29,117-0500 INFO Command Command(2, SUBMITJOB): > re-sending > 2010-06-18 16:06:29,117-0500 WARN Command Command(2, SUBMITJOB)fault > was: Reply timeout > org.globus.cog.karajan.workflow.service.ReplyTimeoutException > at > org.globus.cog.karajan.workflow.service.commands.Command.handleReplyTimeout(Command.java:280) > > at > org.globus.cog.karajan.workflow.service.commands.Command$Timeout.run(Command.java:285) > > at java.util.TimerThread.mainLoop(Timer.java:537) > at java.util.TimerThread.run(Timer.java:487) > 2010-06-18 16:06:29,118-0500 INFO Command Sending Command(2, > SUBMITJOB) on MetaChannel: 855782146 -> > SC-0618-370320-0 > 00000-001756 > 2010-06-18 16:06:29,119-0500 INFO AbstractStreamKarajanChannel > Channel IOException > java.net.SocketException: Broken pipe > at java.net.SocketOutputStream.socketWrite0(Native Method) > at > java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:105) > at java.net.SocketOutputStream.write(SocketOutputStream.java:137) > at > org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel$Sender.send(AbstractStrea > > mKar > ajanChannel.java:292) > at > org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel$Sender.run(AbstractStream > > Kara > janChannel.java:244) > > From hategan at mcs.anl.gov Thu Jul 1 12:18:31 2010 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 01 Jul 2010 12:18:31 -0500 Subject: [Swift-devel] Re: Coaster problem on BG/P - worker processes dying In-Reply-To: <4C2CCC9E.3090904@uchicago.edu> References: <30532374.1310611278002060506.JavaMail.root@zimbra> <4C2CCC9E.3090904@uchicago.edu> Message-ID: <1278004711.17771.3.camel@blabla2.none> On Thu, 2010-07-01 at 12:13 -0500, Zhao Zhang wrote: > Hi, Justin > > Is there any chance that each worker is writing log files to GPFS or > writing to RAM, then copying to GPFS? > Even in the latter case, we used dd instead of cp on zeptoos, cuz with > dd we could set the block size while cp > is using a lined buffer to dump data to GPFS, which is quite slow. Since some time the worker log level is set to WARN (which only produces a message at the start and end) when the number of workers is >= 16. > > Another suspect would be that we are overwhelming the IO nodes. As far > as I remember, coaster is running as a > service on each compute node with a TCP connection to the Login Node. > The communication between Login Node > and CN node is handled by a IP forwarding component in zeptoos. In the > tests I did before, the Falkon service is not > stable with 1024 nodes connecting to the service each with a TCP > connection.Can we login the IO nodes while we > see those errors? Maybe, but then I was able to run with 40k cores while the logging scheme above wasn't enabled. Since then, there was a switch to only one TCP connection per node (regardless of cores) and the very much reduced logging. So I suspect this isn't the problem unless the ZOID NAT got messed up. From wozniak at mcs.anl.gov Thu Jul 1 13:20:16 2010 From: wozniak at mcs.anl.gov (Justin M Wozniak) Date: Thu, 1 Jul 2010 13:20:16 -0500 (Central Daylight Time) Subject: [Swift-devel] Re: Coaster problem on BG/P - worker processes dying In-Reply-To: <4C2CCC9E.3090904@uchicago.edu> References: <30532374.1310611278002060506.JavaMail.root@zimbra> <4C2CCC9E.3090904@uchicago.edu> Message-ID: On Thu, 1 Jul 2010, Zhao Zhang wrote: > Is there any chance that each worker is writing log files to GPFS or > writing to RAM, then copying to GPFS? Even in the latter case, we used > dd instead of cp on zeptoos, cuz with dd we could set the block size > while cp is using a lined buffer to dump data to GPFS, which is quite > slow. I have modified the perl script to write directly to a unique file per worker script, directly to GPFS. (Speed is not an issue right now.) > Another suspect would be that we are overwhelming the IO nodes. As far > as I remember, coaster is running as a service on each compute node with > a TCP connection to the Login Node. The communication between Login Node > and CN node is handled by a IP forwarding component in zeptoos. In the > tests I did before, the Falkon service is not stable with 1024 nodes > connecting to the service each with a TCP connection.Can we login the IO > nodes while we see those errors? That seems like a possibility. If I can whittle the problem down to that level we will have something to report to the zepto team. Thanks > Justin M Wozniak wrote: >> On Thu, 1 Jul 2010, Michael Wilde wrote: >> >>> Justin, can you send a brief update to the list on the coaster problem >>> (workers exiting after a few jobs) that is blocking you on the BG/P, and >>> how you are re-working worker logging to debug it? >> >> A paste from a previous email is below (both BG/P systems are down due to >> cooling issues today). >> >> So far, the issue only appears after several thousand jobs run on at least >> 512 nodes. >> >> I'm pretty close to generating the logging I need to track this down. I >> have broken down the worker logs into one log per worker script... >> >> Paste: >> >> Running on the Intrepid compute nodes. In the last few runs I've only seen >> it in the 512 node case (I think this worked at least once), not 256 nodes, >> but that could be just because this is rare. >> >> 2010-06-18 16:06:29,117-0500 WARN Command Command(2, SUBMITJOB): handling >> reply timeout; >> sendReqTime=100618-160429.10 >> 8, sendTime=100618-160429.108, now=100618-160629.117 >> 2010-06-18 16:06:29,117-0500 INFO Command Command(2, SUBMITJOB): >> re-sending >> 2010-06-18 16:06:29,117-0500 WARN Command Command(2, SUBMITJOB)fault was: >> Reply timeout >> org.globus.cog.karajan.workflow.service.ReplyTimeoutException >> at >> org.globus.cog.karajan.workflow.service.commands.Command.handleReplyTimeout(Command.java:280) >> at >> org.globus.cog.karajan.workflow.service.commands.Command$Timeout.run(Command.java:285) >> at java.util.TimerThread.mainLoop(Timer.java:537) >> at java.util.TimerThread.run(Timer.java:487) >> 2010-06-18 16:06:29,118-0500 INFO Command Sending Command(2, SUBMITJOB) on >> MetaChannel: 855782146 -> >> SC-0618-370320-0 >> 00000-001756 >> 2010-06-18 16:06:29,119-0500 INFO AbstractStreamKarajanChannel Channel >> IOException >> java.net.SocketException: Broken pipe >> at java.net.SocketOutputStream.socketWrite0(Native Method) >> at >> java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:105) >> at java.net.SocketOutputStream.write(SocketOutputStream.java:137) >> at >> org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel$Sender.send(AbstractStrea >> mKar >> ajanChannel.java:292) >> at >> org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel$Sender.run(AbstractStream >> Kara >> janChannel.java:244) >> >> > -- Justin M Wozniak From wozniak at mcs.anl.gov Thu Jul 1 13:22:27 2010 From: wozniak at mcs.anl.gov (Justin M Wozniak) Date: Thu, 1 Jul 2010 13:22:27 -0500 (Central Daylight Time) Subject: [Swift-devel] Re: Coaster problem on BG/P - worker processes dying (fwd) Message-ID: On Thu, 1 Jul 2010, Mihael Hategan wrote: > On Thu, 2010-07-01 at 12:13 -0500, Zhao Zhang wrote: >> Hi, Justin >> >> Is there any chance that each worker is writing log files to GPFS or >> writing to RAM, then copying to GPFS? >> Even in the latter case, we used dd instead of cp on zeptoos, cuz with >> dd we could set the block size while cp >> is using a lined buffer to dump data to GPFS, which is quite slow. > > Since some time the worker log level is set to WARN (which only produces > a message at the start and end) when the number of workers is >= 16. Right, I have made changes there. >> Another suspect would be that we are overwhelming the IO nodes. As far >> as I remember, coaster is running as a >> service on each compute node with a TCP connection to the Login Node. >> The communication between Login Node >> and CN node is handled by a IP forwarding component in zeptoos. In the >> tests I did before, the Falkon service is not >> stable with 1024 nodes connecting to the service each with a TCP >> connection.Can we login the IO nodes while we >> see those errors? > > Maybe, but then I was able to run with 40k cores while the logging > scheme above wasn't enabled. Since then, there was a switch to only one > TCP connection per node (regardless of cores) and the very much reduced > logging. So I suspect this isn't the problem unless the ZOID NAT got > messed up. -- Justin M Wozniak From aespinosa at cs.uchicago.edu Thu Jul 1 17:21:35 2010 From: aespinosa at cs.uchicago.edu (Allan Espinosa) Date: Thu, 1 Jul 2010 17:21:35 -0500 Subject: [Swift-devel] manual coasters In-Reply-To: <18576240.1310251278001754685.JavaMail.root@zimbra> References: <21262109.1309551278001249873.JavaMail.root@zimbra> <18576240.1310251278001754685.JavaMail.root@zimbra> Message-ID: So for the pool entry below, where is the serviceURL? the submit host will issue a pbs request for a service host? Thanks, -Allan 2010/7/1 : > Very cool - thanks, Mihael! > > For the sites entry, do we still use the current format to indicate where the server should start? > Eg: > > ? > ? ? > ? ? key="workerManager">passive > ? ?fast > ? ?10000 > ? ?.07 > ? ? > ? ?/home/wilde/swiftwork > ? > > Is the full range of provider options available to start the server in passive mode? > > Will throttling settings be honored? > > Can we start multiple coaster servers in different places? > > > - Mike > > > ----- "Mihael Hategan" wrote: > >> Manual coasters are in trunk. I did some limited testing on >> localhost. >> >> The basic idea is that you say > key="workerManager">passive in sites.xml. Other than that >> you >> may want to set workersPerNode, but the other options are useless. >> >> Then, when swift starts the coaster service, it will print the URL of >> that on stderr. >> >> You carefully dig for worker.pl and then launch it in whatever way >> you >> like: >> >> worker.pl >> >> The blockid can be whatever you want, but it can be used to group >> workers in the traditional blocks. The logdir is where you want the >> worker logs to go. They are all mandatory. >> >> When workers connect to the service, the service should start >> shipping >> jobs to them. When the service is shut down, it will also try to shut >> down the workers (they are useless anyway at that point), but it >> cannot >> control the LRM jobs, so it may fail to do so (or rather said, it is >> more likely to fail to do so). >> >> Mihael >> From wilde at mcs.anl.gov Thu Jul 1 17:50:13 2010 From: wilde at mcs.anl.gov (Michael Wilde) Date: Thu, 1 Jul 2010 17:50:13 -0500 (CDT) Subject: [Swift-devel] manual coasters In-Reply-To: Message-ID: <9982948.1334571278024613578.JavaMail.root@zimbra> My understanding is that with the pool entry below, Swift will start the coaster service by submitting a PBS job. Then the swift command will print the service service URL (host:port ?) on stderr, and you manually start workers, passing that host:port on the command line to connect back to the coaster service. >> Then, when swift starts the coaster service, it will print the URL of >> that on stderr. > >> worker.pl - Mike ----- "Allan Espinosa" wrote: > So for the pool entry below, where is the serviceURL? the submit > host > will issue a pbs request for a service host? > > > Thanks, > -Allan > > 2010/7/1 : > > Very cool - thanks, Mihael! > > > > For the sites entry, do we still use the current format to indicate > where the server should start? > > Eg: > > > > ? > > ? ? > > ? ? > key="workerManager">passive > > ? ?fast > > ? ?10000 > > ? ?.07 > > ? ? > > ? ?/home/wilde/swiftwork > > ? > > > > Is the full range of provider options available to start the server > in passive mode? > > > > Will throttling settings be honored? > > > > Can we start multiple coaster servers in different places? > > > > > > - Mike > > > > > > ----- "Mihael Hategan" wrote: > > > >> Manual coasters are in trunk. I did some limited testing on > >> localhost. > >> > >> The basic idea is that you say >> key="workerManager">passive in sites.xml. Other than > that > >> you > >> may want to set workersPerNode, but the other options are useless. > >> > >> Then, when swift starts the coaster service, it will print the URL > of > >> that on stderr. > >> > >> You carefully dig for worker.pl and then launch it in whatever way > >> you > >> like: > >> > >> worker.pl > >> > >> The blockid can be whatever you want, but it can be used to group > >> workers in the traditional blocks. The logdir is where you want > the > >> worker logs to go. They are all mandatory. > >> > >> When workers connect to the service, the service should start > >> shipping > >> jobs to them. When the service is shut down, it will also try to > shut > >> down the workers (they are useless anyway at that point), but it > >> cannot > >> control the LRM jobs, so it may fail to do so (or rather said, it > is > >> more likely to fail to do so). > >> > >> Mihael > >> -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From hategan at mcs.anl.gov Thu Jul 1 19:10:02 2010 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 01 Jul 2010 19:10:02 -0500 Subject: [Swift-devel] manual coasters In-Reply-To: References: <21262109.1309551278001249873.JavaMail.root@zimbra> <18576240.1310251278001754685.JavaMail.root@zimbra> Message-ID: <1278029402.21455.0.camel@blabla2.none> On Thu, 2010-07-01 at 17:21 -0500, Allan Espinosa wrote: > So for the pool entry below, where is the serviceURL? the submit host > will issue a pbs request for a service host? No. The serviceURL is printed by swift on stderr when the service starts. It's mostly the port you care about if you know where it's running. From aespinosa at cs.uchicago.edu Thu Jul 1 19:39:43 2010 From: aespinosa at cs.uchicago.edu (Allan Espinosa) Date: Thu, 1 Jul 2010 19:39:43 -0500 Subject: [Swift-devel] dirname directives ambigious Message-ID: I updated to the latest cog-trunk and swift-trunk today and got these: Caused by: org.globus.cog.karajan.workflow.KarajanRuntimeException: Ambiguous element: dirname. Possible choices: vdl:dirname swiftscript:dirname 2010-07-01 19:37:47,906-0500 INFO EventBus Near Karajan line: dirname @ vdl-int.k, line: 274 Karajan exception: Ambiguous element: dirname. Possible choices: vdl:dirname swiftscript:dirname Caused by: org.globus.cog.karajan.workflow.KarajanRuntimeException: Ambiguous element: dirname. Possible choices: vdl:dirname swiftscript:dirname -Allan -- Allan M. Espinosa PhD student, Computer Science University of Chicago From aespinosa at cs.uchicago.edu Thu Jul 1 23:54:50 2010 From: aespinosa at cs.uchicago.edu (Allan Espinosa) Date: Thu, 1 Jul 2010 23:54:50 -0500 Subject: [Swift-devel] Re: dirname directives ambigious In-Reply-To: References: Message-ID: I changed the line it refers to vdl:dirname() . There's no dirname() function in vdl-int.k that refers to the swiftscript namespace right? -Allan 2010/7/1 Allan Espinosa : > I updated to the latest cog-trunk and swift-trunk today and got these: > > Caused by: org.globus.cog.karajan.workflow.KarajanRuntimeException: > Ambiguous element: dirname. Possible choices: > ? ? ? ?vdl:dirname > ? ? ? ?swiftscript:dirname > 2010-07-01 19:37:47,906-0500 INFO ?EventBus Near Karajan line: dirname > @ vdl-int.k, line: 274 > Karajan exception: Ambiguous element: dirname. Possible choices: > ? ? ? ?vdl:dirname > ? ? ? ?swiftscript:dirname > Caused by: org.globus.cog.karajan.workflow.KarajanRuntimeException: > Ambiguous element: dirname. Possible choices: > ? ? ? ?vdl:dirname > ? ? ? ?swiftscript:dirname > > > -Allan > > -- > Allan M. Espinosa > PhD student, Computer Science > University of Chicago > -- Allan M. Espinosa PhD student, Computer Science University of Chicago From hategan at mcs.anl.gov Fri Jul 2 00:50:34 2010 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 02 Jul 2010 00:50:34 -0500 Subject: [Swift-devel] Re: dirname directives ambigious In-Reply-To: References: Message-ID: <1278049834.24578.5.camel@blabla2.none> On Thu, 2010-07-01 at 23:54 -0500, Allan Espinosa wrote: > I changed the line it refers to vdl:dirname() . There's no dirname() > function in vdl-int.k that refers to the swiftscript namespace right? Wasn't when I last wrote at it. But that doesn't mean much. It's the purpose that made that change to be that is probably more relevant. From wozniak at mcs.anl.gov Fri Jul 2 11:55:26 2010 From: wozniak at mcs.anl.gov (Justin M Wozniak) Date: Fri, 2 Jul 2010 11:55:26 -0500 (Central Daylight Time) Subject: [Swift-devel] dirname directives ambigious In-Reply-To: References: Message-ID: Hi Allan I introduced this while adding the @dirname() function requested for the Montage application. I just committed a quick fix but you may want to wait until I do some more testing; sorry for the hitch. Justin On Thu, 1 Jul 2010, Allan Espinosa wrote: > I updated to the latest cog-trunk and swift-trunk today and got these: > > Caused by: org.globus.cog.karajan.workflow.KarajanRuntimeException: > Ambiguous element: dirname. Possible choices: > vdl:dirname > swiftscript:dirname > 2010-07-01 19:37:47,906-0500 INFO EventBus Near Karajan line: dirname > @ vdl-int.k, line: 274 > Karajan exception: Ambiguous element: dirname. Possible choices: > vdl:dirname > swiftscript:dirname > Caused by: org.globus.cog.karajan.workflow.KarajanRuntimeException: > Ambiguous element: dirname. Possible choices: > vdl:dirname > swiftscript:dirname > > > -Allan > > -- Justin M Wozniak From wozniak at mcs.anl.gov Fri Jul 2 14:40:23 2010 From: wozniak at mcs.anl.gov (Justin M Wozniak) Date: Fri, 2 Jul 2010 14:40:23 -0500 (CDT) Subject: [Swift-devel] manual coasters In-Reply-To: <1277957571.15423.8.camel@blabla2.none> References: <1277957571.15423.8.camel@blabla2.none> Message-ID: On Wed, 30 Jun 2010, Mihael Hategan wrote: > Manual coasters are in trunk. I did some limited testing on localhost. I'm getting a problem where the original callback URI is null. Is it possible that it is due to this change? Triggered with null pointer at Settings.java:239 Caused by: java.lang.NullPointerException at org.globus.cog.abstraction.coaster.service.job.manager.Settings.setInternalHostname(Settings.java:239) ... 11 more Failed to set configuration java.lang.IllegalArgumentException: Cannot set: internalHostname to: 172.17.5.144 at org.globus.cog.abstraction.coaster.service.job.manager.Settings.set(Settings.java:437) at org.globus.cog.abstraction.coaster.service.ServiceConfigurationHandler.requestComplete(ServiceConfigurationHandler.java:39) at org.globus.cog.karajan.workflow.service.handlers.RequestHandler.receiveCompleted(RequestHandler.java:84) at org.globus.cog.karajan.workflow.service.channels.AbstractKarajanChannel.handleRequest(AbstractKarajanChannel.java:381) at org.globus.cog.karajan.workflow.service.channels.AbstractPipedChannel.actualSend(AbstractPipedChannel.java:79) at org.globus.cog.karajan.workflow.service.channels.AbstractPipedChannel$Sender.run(AbstractPipedChannel.java:108) -- Justin M Wozniak From hategan at mcs.anl.gov Fri Jul 2 14:52:29 2010 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 02 Jul 2010 14:52:29 -0500 Subject: [Swift-devel] manual coasters In-Reply-To: References: <1277957571.15423.8.camel@blabla2.none> Message-ID: <1278100349.27598.0.camel@blabla2.none> On Fri, 2010-07-02 at 14:40 -0500, Justin M Wozniak wrote: > On Wed, 30 Jun 2010, Mihael Hategan wrote: > > > Manual coasters are in trunk. I did some limited testing on localhost. > > I'm getting a problem where the original callback URI is null. Is it > possible that it is due to this change? Likely. The original callback is not supposed to be null. From hategan at mcs.anl.gov Fri Jul 2 14:55:53 2010 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 02 Jul 2010 14:55:53 -0500 Subject: [Swift-devel] manual coasters In-Reply-To: <1278100349.27598.0.camel@blabla2.none> References: <1277957571.15423.8.camel@blabla2.none> <1278100349.27598.0.camel@blabla2.none> Message-ID: <1278100553.27598.1.camel@blabla2.none> On Fri, 2010-07-02 at 14:52 -0500, Mihael Hategan wrote: > On Fri, 2010-07-02 at 14:40 -0500, Justin M Wozniak wrote: > > On Wed, 30 Jun 2010, Mihael Hategan wrote: > > > > > Manual coasters are in trunk. I did some limited testing on localhost. > > > > I'm getting a problem where the original callback URI is null. Is it > > possible that it is due to this change? > > Likely. The original callback is not supposed to be null. > Try r2790. From wozniak at mcs.anl.gov Fri Jul 2 15:06:03 2010 From: wozniak at mcs.anl.gov (Justin M Wozniak) Date: Fri, 2 Jul 2010 15:06:03 -0500 (CDT) Subject: [Swift-devel] manual coasters In-Reply-To: <1278100553.27598.1.camel@blabla2.none> References: <1277957571.15423.8.camel@blabla2.none> <1278100349.27598.0.camel@blabla2.none> <1278100553.27598.1.camel@blabla2.none> Message-ID: On Fri, 2 Jul 2010, Mihael Hategan wrote: > On Fri, 2010-07-02 at 14:52 -0500, Mihael Hategan wrote: >> On Fri, 2010-07-02 at 14:40 -0500, Justin M Wozniak wrote: >>> On Wed, 30 Jun 2010, Mihael Hategan wrote: >>> >>>> Manual coasters are in trunk. I did some limited testing on localhost. >>> >>> I'm getting a problem where the original callback URI is null. Is it >>> possible that it is due to this change? >> >> Likely. The original callback is not supposed to be null. >> > > Try r2790. Works. -- Justin M Wozniak From wozniak at mcs.anl.gov Fri Jul 2 15:06:34 2010 From: wozniak at mcs.anl.gov (Justin M Wozniak) Date: Fri, 2 Jul 2010 15:06:34 -0500 (CDT) Subject: [Swift-devel] dirname directives ambigious In-Reply-To: References: Message-ID: On Fri, 2 Jul 2010, Justin M Wozniak wrote: > I introduced this while adding the @dirname() function requested for > the Montage application. I just committed a quick fix but you may want to > wait until I do some more testing; sorry for the hitch. Ok, you can try again now. -- Justin M Wozniak From dk0966 at cs.ship.edu Tue Jul 6 03:33:29 2010 From: dk0966 at cs.ship.edu (David Kelly) Date: Tue, 6 Jul 2010 04:33:29 -0400 Subject: [Swift-devel] Re: Swift configuration interface In-Reply-To: References: <22726276.770581276704176196.JavaMail.root@zimbra> <1277136066.3882.2.camel@blabla2.none> <1277145574.4729.3.camel@blabla2.none> Message-ID: Hello, The newest version of swiftconfig is available through svn at https://svn.ci.uchicago.edu/svn/vdl2/usertools/swift/swiftconfig. New features are the automatic replacement of $HOME within site templates, the ability to add/modify site profiles, and the removal of commands from the translation catalog. Details of how it works is also now documented as POD. "perldoc swiftconfig" will give you all the details (also included as attachment). David -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- NAME swiftconfig - Utility for managing Swift configuration SYNOPSIS swiftconfig [-option value] OVERVIEW The swiftconfig program allows users to configure Swift. It allows for the adding, removing, and modification of remote sites by utilizing a set of standard templates. It also provides a way to quickly add, remove and modify translation catalog entries without having to manually edit files. DESCRIPTION General operations: -add sitename add a site from template -remove site removes a site from sites.xml -remove command removes a command from the catalog -templates display all available sites in template -modify site Specifies the name of a site to modify Translation catalog settings: -host hostname hostname of the translation catalog entry -name name translation name -path path full pathname to location of program -status status installation status (deprecated) -profile setting define the profile value for an entry -tcfile filename explicitly specify a translation file Sites settings: -templatefile file explicitly set the template file to use -sitesfile file explicitly set the sites file to use -gridftp GridFTPURL GridFTP URL -jobuniverse universe job manager universe -joburl URL job manager URL -jobmajor major job mager number -jobminor minor job minor number -directory dir work directory -exprovider name execution provider -exmanager name execution job manager -exurl URL execution URL -key key profile key -value value profile value -namespace name profile namespace EXAMPLES List all templates available for adding: swiftconfig -templates Add a site from template into working sites.xml: swiftconfig -add teraport Modify the work directory of a site: swiftconfig -modify teraport -directory /var/tmp Remove a site: swiftconfig -remove teraport Add a new command to translation catalog: swiftconfig -name convert -path /usr/local/bin/convert Modify an existing command in the translation catalog: swiftconfig -name convert -path /usr/bin/convert Remove a command from the translation catalog: swiftconfig -remove convert CAVEATS Swiftconfig will attempt to automatically determine the location of swift configuration files. It first checks for an environment variable called $SWIFT_HOME. If that is not found, it will look for the location of "swift" in the path, and try to find the configuration files from there. This default behavior can be overwridden by manually specifying the location of files with -templatefile, -sitesfile, and -tcfile. The XML library that swiftconfig uses ignores comments in XML. All comments will be stripped from sites.xml as it gets modified. From wilde at mcs.anl.gov Thu Jul 8 19:57:14 2010 From: wilde at mcs.anl.gov (Michael Wilde) Date: Thu, 8 Jul 2010 19:57:14 -0500 (CDT) Subject: [Swift-devel] svn co of cog keeps hanging Message-ID: <1704803.1527991278637034665.JavaMail.root@zimbra> Ive tried 3 times in the past few hours to do an svn co of cog, on 2 different MCS machines (login and vanquish). In each case, the co runs fine for a while (few hundred files?) and then freezes. I could only kill the svn command with a kill -9, and then it left the tree locked in such a way that svn cleanup couldnt clear the lock. I think I've occasionally seen similar problems with checkouts of cog from sourceforge freezing. Is anyone else seeing this problem? Is it common? Any suggested remedies? Thanks, Mike From wilde at mcs.anl.gov Thu Jul 8 20:59:36 2010 From: wilde at mcs.anl.gov (Michael Wilde) Date: Thu, 8 Jul 2010 20:59:36 -0500 (CDT) Subject: [Swift-devel] Re: Import Statement In-Reply-To: <4C367E16.9000805@gmail.com> Message-ID: <19110871.1528721278640776610.JavaMail.root@zimbra> Jon, your idea sounds good to me, unless others with a deeper understanding of Java, Python, etc feel we need different semantics. I think plain pathname-based resolution ala C include files makes sense for Swift. for now. Ben implemented the initial import feature; he, Mihael, and Justin should weigh in as well. We also discussed easing the restriction on where in the source file the import statement can be placed; there may be some reason (ease of implementation?) why its constrained to the start of the file as it is now. - Mike ----- "Jonathan Monette" wrote: > Mike and Justin, > I think I have found out where to change the way the import > statement works. Right now when you import the file has to be in the > > current directory. I would like to change this so that you can > specify > an actual path(relative or absolute path)to the script to be imported. > > But I would like your opinion on how should it look. I am leaning > towards the C style i.e. "src/file". I am open to opinions and > discussion. Maybe the opinion is that it shouldn't be changed. Just > > need more input for the decision. > > -- > Jon > > Computers are incredibly fast, accurate, and stupid. Human beings are > incredibly slow, inaccurate, and brilliant. Together they are powerful > beyond imagination. > - Albert Einstein -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From benc at hawaga.org.uk Fri Jul 9 02:49:43 2010 From: benc at hawaga.org.uk (Ben Clifford) Date: Fri, 9 Jul 2010 07:49:43 +0000 (GMT) Subject: [Swift-devel] Re: Import Statement In-Reply-To: <19110871.1528721278640776610.JavaMail.root@zimbra> References: <19110871.1528721278640776610.JavaMail.root@zimbra> Message-ID: My original implementation was, I think, to try to get something in the langauge that was better than running .swift files through cpp. There is no module or namespace structure in Swift, so using something based on filenames makes sense. The requirement to have imports right at the start probably comes from me wanting to import the source code of the included file fairly early in processing. If you allow import statements elsewhere, then there is a question of "why?" - what behaviour do you expect to be different? If its to allow import statements to appear anywhere, but have them mean the same thing no matter where they appear, then I think it should be fairly straightforward to make them work. I think nested includes won't work correctly in the present implementation: myprog imports useful lib. myprogr imports stdlib. usefullib imports stdlib. I think that will give you duplicate imports which will break things. That's probably not hard to resolve (eg. check if a particular lib has been imported and skip it - don't start doing wierd cpp style ifdefs) -- From wilde at mcs.anl.gov Fri Jul 9 11:08:55 2010 From: wilde at mcs.anl.gov (Michael Wilde) Date: Fri, 9 Jul 2010 11:08:55 -0500 (CDT) Subject: [Swift-devel] Re: Import Statement In-Reply-To: Message-ID: <30925905.1544251278691735337.JavaMail.root@zimbra> ----- "Ben Clifford" wrote: > > The requirement to have imports right at the start probably comes from > me > wanting to import the source code of the included file fairly early in > > processing. If you allow import statements elsewhere, then there is a > > question of "why?" - what behaviour do you expect to be different? > > If its to allow import statements to appear anywhere, but have them > mean > the same thing no matter where they appear, then I think it should be > > fairly straightforward to make them work. It was to enable them to appear anywhere for textual purposes. The semantics of same behavior regardless where they appear sounds good. > I think nested includes won't work correctly in the present > implementation: > > myprog imports useful lib. myprogr imports stdlib. usefullib imports > > stdlib. > > I think that will give you duplicate imports which will break things. > > That's probably not hard to resolve (eg. check if a particular lib has > > been imported and skip it - don't start doing wierd cpp style ifdefs) That seems worth doing at some point. For now Jon's initial enhancement of path names will be useful and sufficient. - Mike From hategan at mcs.anl.gov Fri Jul 9 14:21:43 2010 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 09 Jul 2010 14:21:43 -0500 Subject: [Swift-devel] stuff to do Message-ID: <1278703303.2353.4.camel@blabla2.none> Mike suggested I send out an email to the list to figure out what the group-perceived priorities would be for a bunch of items. 1. make swift core faster 2. test/fix coaster file staging 3. standalone coaster service 4. swift shell The idea is that some recent changes may have shifted the existing priorities. So think of this from the perspective of user/application/publication goals rather than what you think would be "nice to have". Mihael From wilde at mcs.anl.gov Fri Jul 9 15:18:09 2010 From: wilde at mcs.anl.gov (Michael Wilde) Date: Fri, 9 Jul 2010 15:18:09 -0500 (CDT) Subject: [Swift-devel] Swift hanging on array close? Message-ID: <7818201.1556751278706689668.JavaMail.root@zimbra> In the following script, two nested foreach stmts fill a sparse array. @filenames(array) is then passed to an ext mapper. This seems to hang: the ext mapper is never called. This code in on the mcs net in ~wilde/swift/lab/sgflow. Mihael and/or Justin, can you take a look? The FL group needs this for a tutorial next week (and has lots more to do, so a quick fix or workaround this afternoon would be very helpful) Thanks, Mike type file; # is integer between 1 and 30 # is in megawatts app (file o) sgflow (int bus, int power) { sgflow bus power stdout=@o; } app (file o) mkgraph (file i) { awk "-f" "/home/turam/tmp/mkgraph.awk" @filename(i) stdout=@o; } app (file o) mktable (file i) { awk "-f" "/home/turam/tmp/SGsplitter.awk" stdin=@i; } file ofiles[] ; string nbus = @arg("nbus","1"); string nplevel = @arg("nplevel", "2"); foreach bus in [1:@toint(nbus)] { foreach plevel in [1:@toint(nplevel)] { # file o; ofiles[bus*@toint(nplevel)+plevel] = sgflow(bus,plevel); } } file i ; # ^^^^ hangs here - mktableinput.sh is never called. # trace(@filenames(ofiles)) also hangs. file otable <"otable.txt">; otable = mktable(i); ---- ofiles.map is hardcoded to return ofile.3 and ofile.4 - this works, and those files get the expected output. -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From wozniak at mcs.anl.gov Fri Jul 9 16:40:08 2010 From: wozniak at mcs.anl.gov (Justin M Wozniak) Date: Fri, 9 Jul 2010 16:40:08 -0500 (CDT) Subject: [Swift-devel] Swift hanging on array close? In-Reply-To: <7818201.1556751278706689668.JavaMail.root@zimbra> References: <7818201.1556751278706689668.JavaMail.root@zimbra> Message-ID: Using @filename() should get the job done. Also, note that you need file i[] as an array at the end there. And your cat call in mktableinput.sh may not work as desired. On Fri, 9 Jul 2010, Michael Wilde wrote: > In the following script, two nested foreach stmts fill a sparse array. > @filenames(array) is then passed to an ext mapper. > > This seems to hang: the ext mapper is never called. > > This code in on the mcs net in ~wilde/swift/lab/sgflow. > > Mihael and/or Justin, can you take a look? The FL group needs this for a > tutorial next week (and has lots more to do, so a quick fix or > workaround this afternoon would be very helpful) > > Thanks, > > Mike > > type file; > > # is integer between 1 and 30 > # is in megawatts > > app (file o) sgflow (int bus, int power) > { > sgflow bus power stdout=@o; > } > app (file o) mkgraph (file i) > { > awk "-f" "/home/turam/tmp/mkgraph.awk" @filename(i) stdout=@o; > } > > app (file o) mktable (file i) > { > awk "-f" "/home/turam/tmp/SGsplitter.awk" stdin=@i; > } > > file ofiles[] ; > > string nbus = @arg("nbus","1"); > string nplevel = @arg("nplevel", "2"); > foreach bus in [1:@toint(nbus)] { > foreach plevel in [1:@toint(nplevel)] { > # file o; > ofiles[bus*@toint(nplevel)+plevel] = sgflow(bus,plevel); > } > } > > file i ; > > # ^^^^ hangs here - mktableinput.sh is never called. > # trace(@filenames(ofiles)) also hangs. > > file otable <"otable.txt">; > otable = mktable(i); > > ---- > ofiles.map is hardcoded to return ofile.3 and ofile.4 - this works, and those files get the expected output. > > -- Justin M Wozniak From iraicu at cs.uchicago.edu Sun Jul 11 06:47:59 2010 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Sun, 11 Jul 2010 06:47:59 -0500 Subject: [Swift-devel] CFP: The 3rd ACM Workshop on Many-Task Computing on Grids and Supercomputers (MTAGS) 2010, co-located with Supercomputing 2010 -- November 15th, 2010 - New Orleans, LA, USA Message-ID: <4C39AF6F.7000409@cs.uchicago.edu> Call for Papers ------------------------------------------------------------------------------------------------ The 3rd ACM Workshop on Many-Task Computing on Grids and Supercomputers (MTAGS) 2010 http://dsl.cs.uchicago.edu/MTAGS10/ ------------------------------------------------------------------------------------------------ November 15th, 2010 New Orleans, Louisiana, USA Co-located with with IEEE/ACM International Conference for High Performance Computing, Networking, Storage and Analysis (SC10) ================================================================================================ The 3rd workshop on Many-Task Computing on Grids and Supercomputers (MTAGS) will provide the scientific community a dedicated forum for presenting new research, development, and deployment efforts of large-scale many-task computing (MTC) applications on large scale clusters, Grids, Supercomputers, and Cloud Computing infrastructure. MTC, the theme of the workshop encompasses loosely coupled applications, which are generally composed of many tasks (both independent and dependent tasks) to achieve some larger application goal. This workshop will cover challenges that can hamper efficiency and utilization in running applications on large-scale systems, such as local resource manager scalability and granularity, efficient utilization of raw hardware, parallel file system contention and scalability, data management, I/O management, reliability at scale, and application scalability. We welcome paper submissions on all topics related to MTC on large scale systems. Papers will be peer-reviewed, and accepted papers will be published in the workshop proceedings as part of the ACM digital library (pending approval). The workshop will be co-located with the IEEE/ACM Supercomputing 2010 Conference in New Orleans Louisiana on November 15th, 2010. For more information, please see http://dsl.cs.uchicago.edu/MTAGS010/. Scope ------------------------------------------------------------------------------------------------ This workshop will focus on the ability to manage and execute large scale applications on today's largest clusters, Grids, and Supercomputers. Clusters with 50K+ processor cores are now online (e.g. TACC Sun Constellation System - Ranger), Grids (e.g. TeraGrid) with a dozen sites and 100K+ processors, and supercomputers with 150K~200K processors (e.g. IBM BlueGene/P, Cray XT5); furthermore, new supercomputers are scheduled to come online with 300K processor-cores and more than 1M threads (e.g. IBM Blue Waters). Large clusters and supercomputers have traditionally been high performance computing (HPC) systems, as they are efficient at executing tightly coupled parallel jobs within a particular machine with low-latency interconnects; the applications typically use message passing interface (MPI) to achieve the needed inter-process communication. On the other hand, Grids have been the preferred platform for more loosely coupled applications that tend to be managed and executed through workflow systems, commonly known to fit in the high-throughput computing (HTC) paradigm. Many-task computing (MTC) aims to bridge the gap between two computing paradigms, HTC and HPC. MTC is reminiscent to HTC, but it differs in the emphasis of using many computing resources over short periods of time to accomplish many computational tasks (i.e. including both dependent and independent tasks), where the primary metrics are measured in seconds (e.g. FLOPS, tasks/s, MB/s I/O rates), as opposed to operations (e.g. jobs) per month. MTC denotes high-performance computations comprising multiple distinct activities, coupled via file system operations. Tasks may be small or large, uniprocessor or multiprocessor, compute-intensive or data-intensive. The set of tasks may be static or dynamic, homogeneous or heterogeneous, loosely coupled or tightly coupled. The aggregate number of tasks, quantity of computing, and volumes of data may be extremely large. MTC includes loosely coupled applications that are generally communication-intensive but not naturally expressed using standard message passing interface commonly found in HPC, drawing attention to the many computations that are heterogeneous but not "happily" parallel. There is more to HPC than tightly coupled MPI, and more to HTC than embarrassingly parallel long running jobs. Like HPC applications, and science itself, applications are becoming increasingly complex opening new doors for many opportunities to apply HPC in new ways if we broaden our perspective. Some applications have just so many simple tasks that managing them is hard. Applications that operate on or produce large amounts of data need sophisticated data management in order to scale. There exist applications that involve many tasks, each composed of tightly coupled MPI tasks. Loosely coupled applications often have dependencies among tasks, and typically use files for inter-process communication. Efficient support for these sorts of applications on existing large scale systems will involve substantial technical challenges and will have big impact on science. Today's existing HPC systems are a viable platform to host MTC applications. However, some challenges arise in large scale applications when run on large scale systems, which can hamper the efficiency and utilization of these large scale systems. These challenges vary from local resource manager scalability and granularity, efficient utilization of the raw hardware, parallel file system contention and scalability, data management, I/O management, reliability at scale, application scalability, and understanding the limitations of the HPC systems in order to identify good candidate MTC applications. Furthermore, the MTC paradigm can be naturally applied to the emerging Cloud Computing paradigm due to its loosely coupled nature, which is being adopted by industry as the next wave of technological advancement to reduce operational costs while improving efficiencies in large scale infrastructures. To see last year's workshop program agenda, and accepted papers and presentations, please see http://dsl.cs.uchicago.edu/MTAGS09/; for the initial workshop we ran in 2008, please see http://dsl.cs.uchicago.edu/MTAGS08/. We also ran a special issue on Many-Task Computing in the IEEE Transactions on Parallel and Distributed Systems (TPDS) which will appear in November 2010, which can be found at http://dsl.cs.uchicago.edu/TPDS_MTC/. We, the workshop organizers, also published two papers that are highly relevant to this workshop. One paper is titled "Toward Loosely Coupled Programming on Petascale Systems", and was published in SC08; the second paper is titled "Many-Task Computing for Grids and Supercomputers", which was published in MTAGS08. Topics ------------------------------------------------------------------------------------------------ We invite the submission of original work that is related to the topics below. The papers can be either short (5 pages) position papers, or long (10 pages) research papers. Topics of interest include (in the context of Many-Task Computing): * Compute Resource Management * Scheduling * Job execution frameworks * Local resource manager extensions * Performance evaluation of resource managers in use on large scale systems * Dynamic resource provisioning * Techniques to manage many-core resources and/or GPUs * Challenges and opportunities in running many-task workloads on HPC systems * Challenges and opportunities in running many-task workloads on Cloud Computing infrastructure * Storage architectures and implementations * Distributed file systems * Parallel file systems * Distributed meta-data management * Content distribution systems for large data * Data caching frameworks and techniques * Data management within and across data centers * Data-aware scheduling * Data-intensive computing applications * Eventual-consistency storage usage and management * Programming models and tools * Map-reduce and its generalizations * Many-task computing middleware and applications * Parallel programming frameworks * Ensemble MPI techniques and frameworks * Service-oriented science applications * Large-Scale Workflow Systems * Workflow system performance and scalability analysis * Scalability of workflow systems * Workflow infrastructure and e-Science middleware * Programming Paradigms and Models * Large-Scale Many-Task Applications * High-throughput computing (HTC) applications * Data-intensive applications * Quasi-supercomputing applications, deployments, and experiences * Performance Evaluation * Performance evaluation * Real systems * Simulations * Reliability of large systems Paper Submission and Publication ------------------------------------------------------------------------------------------------ Authors are invited to submit papers with unpublished, original work of not more than 10 pages of double column text using single spaced 10 point size on 8.5 x 11 inch pages, as per ACM 8.5 x 11 manuscript guidelines (http://www.acm.org/publications/instructions_for_proceedings_volumes); document templates can be found at http://www.acm.org/sigs/publications/proceedings-templates. We are also seeking position papers of no more than 5 pages in length. A 250 word abstract (PDF format) must be submitted online at https://cmt.research.microsoft.com/MTAGS2010/ before the deadline of August 25th, 2010 at 11:59PM PST; the final 5/10 page papers in PDF format will be due on September 1st, 2010 at 11:59PM PST. Papers will be peer-reviewed, and accepted papers will be published in the workshop proceedings as part of the ACM digital library (pending approval). Notifications of the paper decisions will be sent out by October 1st, 2010. Selected excellent work may be eligible for additional post-conference publication as journal articles or book chapters; see last year's special issue in the IEEE Transactions on Parallel and Distributed Systems (TPDS) at http://dsl.cs.uchicago.edu/TPDS_MTC/. Submission implies the willingness of at least one of the authors to register and present the paper. For more information, please visit http://dsl.cs.uchicago.edu/MTAGS10/. Important Dates ------------------------------------------------------------------------------------------------ * Abstract Due: August 25th, 2010 * Papers Due: September 1st, 2010 * Notification of Acceptance: October 1st, 2010 * Camera Ready Papers Due: November 1st, 2010 * Workshop Date: November 15th, 2010 Committee Members ------------------------------------------------------------------------------------------------ Workshop Chairs * Ioan Raicu, Illinois Institute of Technology * Ian Foster, University of Chicago & Argonne National Laboratory * Yong Zhao, Microsoft Technical Committee * Mihai Budiu, Microsoft Research, USA * Rajkumar Buyya, University of Melbourne, Australia * Alok Choudhary, Northwestern University, USA * Jack Dongara, University of Tennessee, USA * Catalin Dumitrescu, Fermi National Labs, USA * Geoffrey Fox, Indiana University, USA * Robert Grossman, University of Illinois at Chicago, USA * Alexandru Iosup, Delft University of Technology, Netherlands * Florin Isaila, Universidad Carlos III de Madrid, Spain * Daniel Katz, University of Chicago, USA * Tevfik Kosar, Louisiana State University, USA * Zhiling Lan, Illinois Institute of Technology, USA * Ignacio Llorente, Universidad Complutense de Madrid, Spain * Arthur Maccabe, Oak Ridge National Labs, USA * Reagan Moore, University of North Carolina, Chappel Hill, USA * Manish Parashar, Rutgers University, USA * Jose Moreira, IBM Research, USA * Marlon Pierce, Indiana University, USA * Lavanya Ramakrishnan, Lawrence Berkeley National Laboratory, USA * Matei Ripeanu, University of British Columbia, Canada * Alain Roy, University of Wisconsin Madison, USA * Xian-He Sun, Illinois Institute of Technology, USA * Edward Walker, Texas Advanced Computing Center, USA * Mike Wilde, University of Chicago & Argonne National Laboratory, USA * Matthew Woitaszek, The University Coorporation for Atmospheric Research, USA * Ken Yocum, University of California San Diego, USA -- ================================================================= Ioan Raicu, Ph.D. Assistant Professor ================================================================= Computer Science Department Illinois Institute of Technology 10 W. 31st Street Chicago, IL 60616 ================================================================= Cel: 1-847-722-0876 Email: iraicu at iit.edu Web: http://www.eecs.northwestern.edu/~iraicu/ ================================================================= ================================================================= -- ================================================================= Ioan Raicu, Ph.D. NSF/CRA Computing Innovation Fellow ================================================================= Center for Ultra-scale Computing and Information Security (CUCIS) Department of Electrical Engineering and Computer Science Northwestern University 2145 Sheridan Rd, Tech M384 Evanston, IL 60208-3118 ================================================================= Cel: 1-847-722-0876 Tel: 1-847-491-8163 Email: iraicu at eecs.northwestern.edu Web: http://www.eecs.northwestern.edu/~iraicu/ https://wiki.cucis.eecs.northwestern.edu/ ================================================================= ================================================================= From wilde at mcs.anl.gov Mon Jul 12 11:52:44 2010 From: wilde at mcs.anl.gov (Michael Wilde) Date: Mon, 12 Jul 2010 11:52:44 -0500 (CDT) Subject: [Swift-devel] stuff to do In-Reply-To: <1278703303.2353.4.camel@blabla2.none> Message-ID: <33415753.14491278953564701.JavaMail.root@zimbra> Here's my view on these: > 2. test/fix coaster file staging This would be useful for both real apps and (I think) for CDM testing. I would do this first. I would then add: 5. Adjustments needed, if any, on multicore handling in PBS and SGE provider. 6. Adjustments and fixes for reliability and logging, if needed, in Condor-G provider. I expect that 5 & 6 would be small tasks, and they are not yet clearly defined. I think that other people could do them. Maybe add: 7. -tui fixes. Seems not to be working so well on recent tests; several of the screens, including the source-code view, seem not to be working. Then: > 1. make swift core faster I would do this second; I think you said you need about 7-10 days to try things and see what can be done, maybe more after that if the exploration suggests things that will take much (re)coding? > 3. standalone coaster service The current manual coasters is proving useful. > 4. swift shell Lets defer (4) for now; if we can instead run swift repeatedly and either have the coaster worker pool re-connect quickly to each new swift, or quickly start new pools within the same cluster job(s), that would suffice for now. Justin, do you want to weigh in on these? Thanks, Mike > The idea is that some recent changes may have shifted the existing > priorities. So think of this from the perspective of > user/application/publication goals rather than what you think would > be > "nice to have". > > Mihael > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From wilde at mcs.anl.gov Mon Jul 12 13:50:25 2010 From: wilde at mcs.anl.gov (wilde at mcs.anl.gov) Date: Mon, 12 Jul 2010 13:50:25 -0500 (CDT) Subject: [Swift-devel] swift config files for running on multiple multicore machines In-Reply-To: <29347314.19351278960561692.JavaMail.root@zimbra> Message-ID: <1321903.19401278960625940.JavaMail.root@zimbra> attached. You need to set up an ssh key, and put its passphrase in ~/.ssh/auth.defaults. make sure that auth.defaults is mode 600 (not readable by others) You also need to create a GSI proxy on the submit host, and make sure that X509_CERT_DIR on the target hosts is set to a valid CA certificate dir: export X509_CERT_DIR=/home/wilde/TRUSTEDCA export X509_CADIR=/home/wilde/TRUSTEDCA - Mike -------------- next part -------------- A non-text attachment was scrubbed... Name: coasters.xml Type: application/xml Size: 9474 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: tc Type: application/octet-stream Size: 6144 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: auth.defaults.example Type: application/octet-stream Size: 2100 bytes Desc: not available URL: From wilde at mcs.anl.gov Mon Jul 12 20:17:43 2010 From: wilde at mcs.anl.gov (wilde at mcs.anl.gov) Date: Mon, 12 Jul 2010 20:17:43 -0500 (CDT) Subject: [Swift-devel] Re: MCS cluster In-Reply-To: <5658524.32951278983719554.JavaMail.root@zimbra> Message-ID: <17271120.32991278983863922.JavaMail.root@zimbra> ----- "Jonathan Monette" wrote: > Mike, > Why am I not able to submit tasks to the MCS machines from my > laptop? Why does it have to be from another MCS machine? Basically because *I think* the MCS server machines are not visible outside the MCS firewall - you need to ssh into them via a login.mcs host. Now, that's what I *think*, but you should verify by testing and reading the MCS FAQs. If I am correct, you *might* be able to get around this with clever ssh tunneling, but that will also be work to figure out. I cant recall what I did the other day to get around the problems with svn co of cog from sourceforge, but thats another angle you could attack. If symlinked path synonyms dont get in the way, you *might* be able to co a clean cog tree to a CI host and then tar it to the MCS net. Im not sure if the cog-checkout problem is unique to mcs-to-sourceforge, or happens elsewhere. I think I recall it happening on CI hosts as well. Maybe its caused by fast client hosts or networks that cause sourceforge to throttle back (and get hung in the process)??? I'm cc'ing swift-devel here for other ideas. - Mike > Because I > cannot checkout the swift trunk to machines I cannot use the fixes > that > have been commited to run the Montage wrappers. > > -- > Jon > > Computers are incredibly fast, accurate, and stupid. Human beings are > incredibly slow, inaccurate, and brilliant. Together they are powerful > beyond imagination. > - Albert Einstein -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From wilde at mcs.anl.gov Tue Jul 13 00:23:03 2010 From: wilde at mcs.anl.gov (wilde at mcs.anl.gov) Date: Tue, 13 Jul 2010 00:23:03 -0500 (CDT) Subject: [Swift-devel] Re: Coaster problems with proxy and gsi explained In-Reply-To: <1794127.34651278990828698.JavaMail.root@zimbra> Message-ID: <32423910.35991278998583868.JavaMail.root@zimbra> For whatever reason, here are the issues: - mcs is running Ubuntu 10.x on these machines, and seems to no longer include any Sun Javas in its .soft options. So I needed to bring in my own Java. - mcs doesnt have Globus or OSG packages, so I needed to bring in my own CA cert dir - the login shell, by default doesnt process .bashrc; the call to .bashrc needs to go in your .profile or similar, and was missing from mine. - after much reading up on bash startup, and fiddling, I concluded that when swift launches coasters with the ssh provider, only the .bashrc runs, not the profile, so one needs to essentially force the .profile to run in this case, or else set PATH and X509_CERT_DIR from .profile. What I did is have .bashrc call .profile if it was not previously run. - if either .profile or .bashrc sends anything to stdout, you get this cryptic, mysterious message from swift: --- stomp$ swift -tc.file tc -sites.file crush.xml cat.swift Swift svn swift-r3430 cog-r2798 RunID: 20100713-0010-d3r5x92f Progress: Exception in thread "sftp subsystem 1" java.lang.OutOfMemoryError: Java heap space at com.sshtools.j2ssh.subsystem.SubsystemClient.run(SubsystemClient.java:198) at java.lang.Thread.run(Unknown Source) --- Hence, dont do that :) So, Jon, you may want to look at my .profile and .bashrc, or do whatever is needed to set JAVA and X509_CERT_DIR correctly, until we figure how how to do this all more cleanly. - Lastly, getting a proxy from TeraGrid works just fine: cog-myproxy -S -h myproxy.teragrid.org -p 7514 -l wilde -S anonget The errors I saw when we tried this last were all due to the env var issues above. - Mike ----- "Michael Wilde" wrote: > Mihael, Jon, > > It seems that the problems we were seeing this after noon (in my > tests) was due to a bad .bashrc. > > I have verified the the cog-myproxy method of creating a proxy for > coasters-ssh does in fact work. > > Im still trying to debug what env vars are coming from my .bashrc, why > they are not supplied solely by my .soft, and what in my .bashrc was > causing the failures I was seeing all afternoon. > > But reverting to the simple .bashrc which I was using last Thu (under > the mistaken impression that it had no effect) makes coasters work > again for me, both with a DOEGrids cert and with a proxy made by > cog-myproxy-logon from my TeraGrid NCSA cert. > > - Mike > > > -- > Michael Wilde > Computation Institute, University of Chicago > Mathematics and Computer Science Division > Argonne National Laboratory -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From hategan at mcs.anl.gov Tue Jul 13 00:52:05 2010 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 13 Jul 2010 00:52:05 -0500 Subject: [Swift-devel] Re: Coaster problems with proxy and gsi explained In-Reply-To: <32423910.35991278998583868.JavaMail.root@zimbra> References: <32423910.35991278998583868.JavaMail.root@zimbra> Message-ID: <1279000325.11800.2.camel@blabla2.none> On Tue, 2010-07-13 at 00:23 -0500, wilde at mcs.anl.gov wrote: > For whatever reason, here are the issues: > > - mcs is running Ubuntu 10.x on these machines, and seems to no longer include any Sun Javas in its .soft options. So I needed to bring in my own Java. > > - mcs doesnt have Globus or OSG packages, so I needed to bring in my own CA cert dir > > - the login shell, by default doesnt process .bashrc; the call to .bashrc needs to go in your .profile or similar, and was missing from mine. > > - after much reading up on bash startup, and fiddling, I concluded that when swift launches coasters with the ssh provider, only the .bashrc runs, not the profile, so one needs to essentially force the .profile to run in this case, or else set PATH and X509_CERT_DIR from .profile. What I did is have .bashrc call .profile if it was not previously run. Or put those env vars in sites.xml. In a sense, I would probably recommend that. It seems to be the only "portable" way. > > - if either .profile or .bashrc sends anything to stdout, you get this cryptic, mysterious message from swift: > --- > stomp$ swift -tc.file tc -sites.file crush.xml cat.swift > Swift svn swift-r3430 cog-r2798 > > RunID: 20100713-0010-d3r5x92f > Progress: > Exception in thread "sftp subsystem 1" java.lang.OutOfMemoryError: Java heap space > at com.sshtools.j2ssh.subsystem.SubsystemClient.run(SubsystemClient.java:198) > at java.lang.Thread.run(Unknown Source) That is funny. In other words a bug. Is there any easy way to reproduce that? > > --- > Hence, dont do that :) But I wanna! From wilde at mcs.anl.gov Wed Jul 14 10:36:38 2010 From: wilde at mcs.anl.gov (Michael Wilde) Date: Wed, 14 Jul 2010 10:36:38 -0500 (CDT) Subject: [Swift-devel] Swift trunk seems broken Message-ID: <21879375.100861279121798319.JavaMail.root@zimbra> I get the error below from a trunk I just updated and built: bri$ swift cats.swift Swift svn swift-r3435 cog-r2799 RunID: 20100714-1033-npcd74t6 Progress: Uncaught exception: java.lang.IncompatibleClassChangeError: Expecting non-static method org.griphyn.vdl.karajan.lib.AbsFileName.filename(Lorg/globus/cog/karajan/stack/VariableStack;)[Ljava/lang/String; in vdl:absfilename @ vdl.k, line: 79 java.lang.IncompatibleClassChangeError: Expecting non-static method org.griphyn.vdl.karajan.lib.AbsFileName.filename(Lorg/globus/cog/karajan/stack/VariableStack;)[Ljava/lang/String; at org.griphyn.vdl.karajan.lib.AbsFileName.function(AbsFileName.java:16) at org.griphyn.vdl.karajan.lib.VDLFunction.post(VDLFunction.java:67) at org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.childCompleted(AbstractSequentialWithArguments.java:192) script is: type file; app (file o) cat (file i) { cat @i stdout=@o; } file out[]; file data<"data.txt">; foreach j in [0:19] { out[j] = cat(data); } From wozniak at mcs.anl.gov Wed Jul 14 10:41:52 2010 From: wozniak at mcs.anl.gov (Justin M Wozniak) Date: Wed, 14 Jul 2010 10:41:52 -0500 (CDT) Subject: [Swift-devel] Re: Swift trunk seems broken In-Reply-To: <21879375.100861279121798319.JavaMail.root@zimbra> References: <21879375.100861279121798319.JavaMail.root@zimbra> Message-ID: First glance: I think IncompatibleClassChangeError means you have to clean and build again. I'll try a few things here... On Wed, 14 Jul 2010, Michael Wilde wrote: > I get the error below from a trunk I just updated and built: > > bri$ swift cats.swift > Swift svn swift-r3435 cog-r2799 > > RunID: 20100714-1033-npcd74t6 > Progress: > Uncaught exception: java.lang.IncompatibleClassChangeError: Expecting non-static method org.griphyn.vdl.karajan.lib.AbsFileName.filename(Lorg/globus/cog/karajan/stack/VariableStack;)[Ljava/lang/String; in vdl:absfilename @ vdl.k, line: 79 > java.lang.IncompatibleClassChangeError: Expecting non-static method org.griphyn.vdl.karajan.lib.AbsFileName.filename(Lorg/globus/cog/karajan/stack/VariableStack;)[Ljava/lang/String; > at org.griphyn.vdl.karajan.lib.AbsFileName.function(AbsFileName.java:16) > at org.griphyn.vdl.karajan.lib.VDLFunction.post(VDLFunction.java:67) > at org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.childCompleted(AbstractSequentialWithArguments.java:192) > > script is: > > type file; > > app (file o) cat (file i) > { > cat @i stdout=@o; > } > > file out[]; > file data<"data.txt">; > > foreach j in [0:19] { > out[j] = cat(data); > } > > -- Justin M Wozniak From wilde at mcs.anl.gov Wed Jul 14 12:34:20 2010 From: wilde at mcs.anl.gov (wilde at mcs.anl.gov) Date: Wed, 14 Jul 2010 12:34:20 -0500 (CDT) Subject: [Swift-devel] Re: Swift trunk seems broken In-Reply-To: <17937832.108611279128784224.JavaMail.root@zimbra> Message-ID: <7158009.108651279128860864.JavaMail.root@zimbra> I extracted fresh trunks for cog and swift and rebuilt. Now I get "No 'proxy' provider or alias found". (see below) Why is it looking for a proxy provider? Im expecting it to use the default sites and tc files, and local provider and /bin/cat on localhost. Is this line getting involved? vdl-int-staging.k: stagingMethod := vdl:siteProfile(rhost, "swift:stagingMethod", default="proxy") Justin, Jon, you both said my cats.swift test worked for you. Are you using the default tc and sites files? And does your version ID say Swift svn swift-r3435 cog-r2799? - Mike bri$ swift cat.swift Swift svn swift-r3435 cog-r2799 RunID: 20100714-1226-vlpmmtm7 Progress: Execution failed: Exception in cat: Arguments: [data.txt] Host: localhost Directory: cat-20100714-1226-vlpmmtm7/jobs/7/cat-7eo37tujTODO: outs ---- Caused by: No 'proxy' provider or alias found. Available providers: [cobalt, gsiftp, coaster, dcache, webdav, ssh, gt4, gt2, condor, http, pbs, ftp, gsiftp-old, local]. Aliases: gt4 <-> gt4.0.2, gt4.0.1, gt3.9.5, gt4.0.0; condor <-> condorlocal; cobalt <-> cobaltlocal; gsiftp <-> gridftp; pbs <-> pbslocal; gsiftp-old <-> gridftp-old; local <-> file; bri$ cat cat.swift type file; app (file o) cat (file i) { cat @i stdout=@o; } file data<"data.txt">; file out<"out.txt">; out = cat(data); bri$ ----- "Justin M Wozniak" wrote: > First glance: I think IncompatibleClassChangeError means you have to > clean > and build again. I'll try a few things here... > > On Wed, 14 Jul 2010, Michael Wilde wrote: > > > I get the error below from a trunk I just updated and built: > > > > bri$ swift cats.swift > > Swift svn swift-r3435 cog-r2799 > > > > RunID: 20100714-1033-npcd74t6 > > Progress: > > Uncaught exception: java.lang.IncompatibleClassChangeError: > Expecting non-static method > org.griphyn.vdl.karajan.lib.AbsFileName.filename(Lorg/globus/cog/karajan/stack/VariableStack;)[Ljava/lang/String; > in vdl:absfilename @ vdl.k, line: 79 > > java.lang.IncompatibleClassChangeError: Expecting non-static method > org.griphyn.vdl.karajan.lib.AbsFileName.filename(Lorg/globus/cog/karajan/stack/VariableStack;)[Ljava/lang/String; > > at > org.griphyn.vdl.karajan.lib.AbsFileName.function(AbsFileName.java:16) > > at > org.griphyn.vdl.karajan.lib.VDLFunction.post(VDLFunction.java:67) > > at > org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.childCompleted(AbstractSequentialWithArguments.java:192) > > > > script is: > > > > type file; > > > > app (file o) cat (file i) > > { > > cat @i stdout=@o; > > } > > > > file out[]; > > file data<"data.txt">; > > > > foreach j in [0:19] { > > out[j] = cat(data); > > } > > > > > > -- > Justin M Wozniak -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From wilde at mcs.anl.gov Wed Jul 14 12:52:16 2010 From: wilde at mcs.anl.gov (Michael Wilde) Date: Wed, 14 Jul 2010 12:52:16 -0500 (CDT) Subject: [Swift-devel] Re: Swift trunk seems broken In-Reply-To: <7158009.108651279128860864.JavaMail.root@zimbra> Message-ID: <16929110.109221279129936989.JavaMail.root@zimbra> What seems to be happing is thatI had a ~/.swift/swift.properties file with "use.provider.staging = true". Ive been using that for as long as I can recall. Seems that in some recent rev, provider staging was changed to use a "proxy" provider? I turned off use.provider.staging and now my basic tests work again. - Mike ----- wilde at mcs.anl.gov wrote: > I extracted fresh trunks for cog and swift and rebuilt. Now I get "No > 'proxy' provider or alias found". > (see below) > > Why is it looking for a proxy provider? Im expecting it to use the > default sites and tc files, and local provider and /bin/cat on > localhost. > > Is this line getting involved? > > vdl-int-staging.k: stagingMethod := vdl:siteProfile(rhost, > "swift:stagingMethod", default="proxy") > > Justin, Jon, you both said my cats.swift test worked for you. Are you > using the default tc and sites files? And does your version ID say > Swift svn swift-r3435 cog-r2799? > > - Mike > > > > bri$ swift cat.swift > Swift svn swift-r3435 cog-r2799 > > RunID: 20100714-1226-vlpmmtm7 > Progress: > Execution failed: > Exception in cat: > Arguments: [data.txt] > Host: localhost > Directory: cat-20100714-1226-vlpmmtm7/jobs/7/cat-7eo37tujTODO: outs > ---- > > Caused by: > No 'proxy' provider or alias found. Available providers: [cobalt, > gsiftp, coaster, dcache, webdav, ssh, gt4, gt2, condor, http, pbs, > ftp, gsiftp-old, local]. Aliases: gt4 <-> gt4.0.2, gt4.0.1, gt3.9.5, > gt4.0.0; condor <-> condorlocal; cobalt <-> cobaltlocal; gsiftp <-> > gridftp; pbs <-> pbslocal; gsiftp-old <-> gridftp-old; local <-> file; > > bri$ cat cat.swift > type file; > > app (file o) cat (file i) > { > cat @i stdout=@o; > } > > file data<"data.txt">; > file out<"out.txt">; > out = cat(data); > bri$ > > > > ----- "Justin M Wozniak" wrote: > > > First glance: I think IncompatibleClassChangeError means you have > to > > clean > > and build again. I'll try a few things here... > > > > On Wed, 14 Jul 2010, Michael Wilde wrote: > > > > > I get the error below from a trunk I just updated and built: > > > > > > bri$ swift cats.swift > > > Swift svn swift-r3435 cog-r2799 > > > > > > RunID: 20100714-1033-npcd74t6 > > > Progress: > > > Uncaught exception: java.lang.IncompatibleClassChangeError: > > Expecting non-static method > > > org.griphyn.vdl.karajan.lib.AbsFileName.filename(Lorg/globus/cog/karajan/stack/VariableStack;)[Ljava/lang/String; > > in vdl:absfilename @ vdl.k, line: 79 > > > java.lang.IncompatibleClassChangeError: Expecting non-static > method > > > org.griphyn.vdl.karajan.lib.AbsFileName.filename(Lorg/globus/cog/karajan/stack/VariableStack;)[Ljava/lang/String; > > > at > > > org.griphyn.vdl.karajan.lib.AbsFileName.function(AbsFileName.java:16) > > > at > > org.griphyn.vdl.karajan.lib.VDLFunction.post(VDLFunction.java:67) > > > at > > > org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.childCompleted(AbstractSequentialWithArguments.java:192) > > > > > > script is: > > > > > > type file; > > > > > > app (file o) cat (file i) > > > { > > > cat @i stdout=@o; > > > } > > > > > > file out[]; > > > file data<"data.txt">; > > > > > > foreach j in [0:19] { > > > out[j] = cat(data); > > > } > > > > > > > > > > -- > > Justin M Wozniak > > -- > Michael Wilde > Computation Institute, University of Chicago > Mathematics and Computer Science Division > Argonne National Laboratory > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From hategan at mcs.anl.gov Wed Jul 14 12:59:04 2010 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 14 Jul 2010 12:59:04 -0500 Subject: [Swift-devel] Re: Swift trunk seems broken In-Reply-To: <16929110.109221279129936989.JavaMail.root@zimbra> References: <16929110.109221279129936989.JavaMail.root@zimbra> Message-ID: <1279130344.26184.1.camel@blabla2.none> On Wed, 2010-07-14 at 12:52 -0500, Michael Wilde wrote: > What seems to be happing is thatI had a ~/.swift/swift.properties file with "use.provider.staging = true". Ive been using that for as long as I can recall. It's odd that it worked. But it may be that the version you were running ignored that property. > > Seems that in some recent rev, provider staging was changed to use a "proxy" provider? No such thing as a "proxy" provider, but there is a proxy staging method. From aespinosa at cs.uchicago.edu Wed Jul 14 17:47:56 2010 From: aespinosa at cs.uchicago.edu (Allan Espinosa) Date: Wed, 14 Jul 2010 17:47:56 -0500 Subject: [Swift-devel] swift-plot-log broken on trunk Message-ID: I guess some required plots are no longer found in the new logs: $ swift-plot-log sleep-LGU_condor.log execstages.png Log file path is /home/aespinosa/workflows/cybershake/sleep-LGU_condor.log Log is in directory /home/aespinosa/workflows/cybershake Log basename is sleep-LGU_condor Now in directory /tmp/swift-plot-log-TCmEupYGhEd18102 rm -f start-times.data kickstart-times.data start-time.tmp end-time.tmp threads.list tasks.list log *.data *.shifted *.png *.event * .coloured-event *.total *.tmp *.transitions *.last karatasks-type-counts.txt index.html *.lastsummary execstages.plot total.plot col our.plot jobs-sites.html jobs.retrycount.summary kickstart.stats execution-counts.txt site-duration.txt jobs.retrycount sp.plot karatasks.coloured-sorted-event *.cedps *.stats t.inf *.seenstates tmp-* clusterstats trname-summary sites-list.data.nm info-md5sums pse2d-tmp.eip karajan.html falkon.html execute2.html info.html execute.html kickstart.html scheduler.html assorted.html log-to-execute2-transitions < /home/aespinosa/workflows/cybershake/sleep-LGU_condor.log > execute2.transitions compute-t-inf > t.inf < /home/aespinosa/workflows/cybershake/sleep-LGU_condor.log cat execute2.transitions | swap-and-sort | transitions-to-event > execute2.event log-to-dostagein-transitions < /home/aespinosa/workflows/cybershake/sleep-LGU_condor.log > dostagein.transitions cat dostagein.transitions | swap-and-sort | transitions-to-event > dostagein.event log-to-dostageout-transitions < /home/aespinosa/workflows/cybershake/sleep-LGU_condor.log > dostageout.transitions cat dostageout.transitions | swap-and-sort | transitions-to-event > dostageout.event extract-start-time > start-time.tmp execstages-plot Can't parse line 0 last-event-line no previous event cat: workflow.event: No such file or directory gnuplot> plot 'esp.execute2.tmp' with vector arrowstyle 1 title 'execute2', 'esp.dostagein.tmp' with vector arrowstyle 2 title 'dostagein', 'esp.dostageout.tmp' with vector arrowstyle 3 title 'dostageout' ^ "execstages.plot", line 15: no data point found in specified file make: *** [execstages.png] Error 1 -- Allan M. Espinosa PhD student, Computer Science University of Chicago From wilde at mcs.anl.gov Thu Jul 15 11:47:01 2010 From: wilde at mcs.anl.gov (Michael Wilde) Date: Thu, 15 Jul 2010 11:47:01 -0500 (CDT) Subject: [Swift-devel] _swiftwrap logging causes problems for users on shared computer servers Message-ID: <16809659.148551279212421442.JavaMail.root@zimbra> This line in _swiftwrap is causing problems when multiple users are running coasters on the same MCS compute server: -- COMMANDLINE=$@ echo $0 $COMMANDLINE >> /tmp/swiftwrap.out -- It creates a file owned by the user, and causes the next user's jobs to fail (and/or generate a message). This line looks to me like a debugging fossil. Im going to comment out this echo in my test trunk (which several of us are testing from) and see if it has any ill effect. I think we should leave it disabled, but can leave it in as a comment for debugging hints. Need to check when/why it was added. - Mike -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From wozniak at mcs.anl.gov Thu Jul 15 11:57:29 2010 From: wozniak at mcs.anl.gov (Justin M Wozniak) Date: Thu, 15 Jul 2010 11:57:29 -0500 (Central Daylight Time) Subject: [Swift-devel] _swiftwrap logging causes problems for users on shared computer servers In-Reply-To: <16809659.148551279212421442.JavaMail.root@zimbra> References: <16809659.148551279212421442.JavaMail.root@zimbra> Message-ID: This was my mistake- fixed. On Thu, 15 Jul 2010, Michael Wilde wrote: > This line in _swiftwrap is causing problems when multiple users are > running coasters on the same MCS compute server: > > -- > COMMANDLINE=$@ > > echo $0 $COMMANDLINE >> /tmp/swiftwrap.out > -- > > It creates a file owned by the user, and causes the next user's jobs to > fail (and/or generate a message). > > This line looks to me like a debugging fossil. Im going to comment out > this echo in my test trunk (which several of us are testing from) and > see if it has any ill effect. > > I think we should leave it disabled, but can leave it in as a comment > for debugging hints. Need to check when/why it was added. > > - Mike -- Justin M Wozniak From wozniak at mcs.anl.gov Thu Jul 15 13:59:55 2010 From: wozniak at mcs.anl.gov (Justin M Wozniak) Date: Thu, 15 Jul 2010 13:59:55 -0500 (Central Daylight Time) Subject: [Swift-devel] swift-plot-log broken on trunk In-Reply-To: References: Message-ID: Fixed- let me know what happens. On Wed, 14 Jul 2010, Allan Espinosa wrote: > I guess some required plots are no longer found in the new logs: > > $ swift-plot-log sleep-LGU_condor.log execstages.png > Log file path is /home/aespinosa/workflows/cybershake/sleep-LGU_condor.log > Log is in directory /home/aespinosa/workflows/cybershake > Log basename is sleep-LGU_condor > Now in directory /tmp/swift-plot-log-TCmEupYGhEd18102 > rm -f start-times.data kickstart-times.data start-time.tmp > end-time.tmp threads.list tasks.list log *.data *.shifted *.png > *.event * > .coloured-event *.total *.tmp *.transitions *.last > karatasks-type-counts.txt index.html *.lastsummary execstages.plot > total.plot col > our.plot jobs-sites.html jobs.retrycount.summary kickstart.stats > execution-counts.txt site-duration.txt jobs.retrycount sp.plot > karatasks.coloured-sorted-event *.cedps *.stats t.inf *.seenstates > tmp-* clusterstats trname-summary sites-list.data.nm info-md5sums > pse2d-tmp.eip karajan.html falkon.html execute2.html info.html > execute.html kickstart.html scheduler.html assorted.html > log-to-execute2-transitions < > /home/aespinosa/workflows/cybershake/sleep-LGU_condor.log > > execute2.transitions > compute-t-inf > t.inf < > /home/aespinosa/workflows/cybershake/sleep-LGU_condor.log > cat execute2.transitions | swap-and-sort | transitions-to-event > execute2.event > log-to-dostagein-transitions < > /home/aespinosa/workflows/cybershake/sleep-LGU_condor.log > > dostagein.transitions > cat dostagein.transitions | swap-and-sort | transitions-to-event > > dostagein.event > log-to-dostageout-transitions < > /home/aespinosa/workflows/cybershake/sleep-LGU_condor.log > > dostageout.transitions > cat dostageout.transitions | swap-and-sort | transitions-to-event > > dostageout.event > extract-start-time > start-time.tmp > execstages-plot > Can't parse line 0 last-event-line no previous event > > cat: workflow.event: No such file or directory > > gnuplot> plot 'esp.execute2.tmp' with vector arrowstyle 1 title > 'execute2', 'esp.dostagein.tmp' with vector arrowstyle 2 title > 'dostagein', 'esp.dostageout.tmp' with vector arrowstyle 3 title > 'dostageout' > ^ > "execstages.plot", line 15: no data point found in specified file > > make: *** [execstages.png] Error 1 > > -- Justin M Wozniak From hategan at mcs.anl.gov Thu Jul 15 18:35:49 2010 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 15 Jul 2010 18:35:49 -0500 Subject: [Swift-devel] stuff to do In-Reply-To: <33415753.14491278953564701.JavaMail.root@zimbra> References: <33415753.14491278953564701.JavaMail.root@zimbra> Message-ID: <1279236949.25935.17.camel@blabla2.none> Most of the problems that were obvious with coaster file staging should be fixed now. I ran a few tests for 1024 cat jobs on TP with ssh:pbs with 2-8 workers/node (such that "concurrent" workers are tested) and it consistently seemed fine. I also quickly made a fake provider and I am getting a rate of about 100 j/s. So that seems not to infirm my previous suspicion. On Mon, 2010-07-12 at 11:52 -0500, Michael Wilde wrote: > Here's my view on these: > > > 2. test/fix coaster file staging > > This would be useful for both real apps and (I think) for CDM testing. I would do this first. > > I would then add: > > 5. Adjustments needed, if any, on multicore handling in PBS and SGE provider. > > 6. Adjustments and fixes for reliability and logging, if needed, in Condor-G provider. > > I expect that 5 & 6 would be small tasks, and they are not yet clearly defined. I think that other people could do them. > > Maybe add: > > 7. -tui fixes. Seems not to be working so well on recent tests; several of the screens, including the source-code view, seem not to be working. > > Then: > > > 1. make swift core faster > > I would do this second; I think you said you need about 7-10 days to try things and see what can be done, maybe more after that if the exploration suggests things that will take much (re)coding? > > > 3. standalone coaster service > > The current manual coasters is proving useful. > > 4. swift shell > > Lets defer (4) for now; if we can instead run swift repeatedly and either have the coaster worker pool re-connect quickly to each new swift, or quickly start new pools within the same cluster job(s), that would suffice for now. > > Justin, do you want to weigh in on these? > > Thanks, > > Mike > > > > The idea is that some recent changes may have shifted the existing > > priorities. So think of this from the perspective of > > user/application/publication goals rather than what you think would > > be > > "nice to have". > > > > Mihael > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > From aespinosa at cs.uchicago.edu Thu Jul 15 23:32:01 2010 From: aespinosa at cs.uchicago.edu (Allan Espinosa) Date: Thu, 15 Jul 2010 23:32:01 -0500 Subject: [Swift-devel] swift-plot-log broken on trunk In-Reply-To: References: Message-ID: Thanks Justin. I'll try this out when I get another run. With the default logging policy there will be no execute2 statements as they are all in DEBUG level inside vdl-int.k this was the case in my run. -Allan 2010/7/15 Justin M Wozniak : > > Fixed- let me know what happens. > > On Wed, 14 Jul 2010, Allan Espinosa wrote: > >> I guess some required plots are no longer found in the new logs: >> >> $ swift-plot-log sleep-LGU_condor.log execstages.png >> Log file path is /home/aespinosa/workflows/cybershake/sleep-LGU_condor.log >> Log is in directory /home/aespinosa/workflows/cybershake >> Log basename is sleep-LGU_condor >> Now in directory /tmp/swift-plot-log-TCmEupYGhEd18102 >> rm -f start-times.data kickstart-times.data start-time.tmp >> end-time.tmp threads.list tasks.list log *.data *.shifted *.png >> *.event * >> .coloured-event *.total *.tmp *.transitions *.last >> karatasks-type-counts.txt index.html *.lastsummary execstages.plot >> total.plot col >> our.plot jobs-sites.html jobs.retrycount.summary kickstart.stats >> execution-counts.txt site-duration.txt jobs.retrycount sp.plot >> karatasks.coloured-sorted-event *.cedps *.stats t.inf *.seenstates >> tmp-* clusterstats ?trname-summary sites-list.data.nm info-md5sums >> pse2d-tmp.eip karajan.html falkon.html execute2.html info.html >> execute.html kickstart.html scheduler.html assorted.html >> log-to-execute2-transitions < >> /home/aespinosa/workflows/cybershake/sleep-LGU_condor.log > >> execute2.transitions >> compute-t-inf > t.inf < >> /home/aespinosa/workflows/cybershake/sleep-LGU_condor.log >> cat execute2.transitions | swap-and-sort | transitions-to-event > >> execute2.event >> log-to-dostagein-transitions < >> /home/aespinosa/workflows/cybershake/sleep-LGU_condor.log > >> dostagein.transitions >> cat dostagein.transitions | swap-and-sort | transitions-to-event > >> dostagein.event >> log-to-dostageout-transitions < >> /home/aespinosa/workflows/cybershake/sleep-LGU_condor.log > >> dostageout.transitions >> cat dostageout.transitions | swap-and-sort | transitions-to-event > >> dostageout.event >> extract-start-time > start-time.tmp >> execstages-plot >> Can't parse line ?0 last-event-line no previous event >> >> cat: workflow.event: No such file or directory >> >> gnuplot> plot 'esp.execute2.tmp' with vector arrowstyle 1 title >> 'execute2', ? ? ?'esp.dostagein.tmp' with vector arrowstyle 2 title >> 'dostagein', ? ? ?'esp.dostageout.tmp' with vector arrowstyle 3 title >> 'dostageout' >> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ^ >> ? ? ? ?"execstages.plot", line 15: no data point found in specified file >> >> make: *** [execstages.png] Error 1 >> >> > > -- > Justin M Wozniak > > -- Allan M. Espinosa PhD student, Computer Science University of Chicago From wilde at mcs.anl.gov Fri Jul 16 15:07:15 2010 From: wilde at mcs.anl.gov (Michael Wilde) Date: Fri, 16 Jul 2010 15:07:15 -0500 (CDT) Subject: [Swift-devel] Swift NMI B&T testing - how to add more users? Message-ID: <22361353.206931279310835989.JavaMail.root@zimbra> Ben, Dennis Touchet, a UTB student, is gearing up to get Swift testing rolling again. Can you provide a few specific pointers? - how to clone the B&T tests you set up so that multiple Swift developers can manage them? Does this require a B&T linux host login that is separate from the B&T web login? (I was unable to log into UW's system with my web login...) - can you comment on the state of the "site" tests in the swift test dir? - any other pointers on testing tat would be useful to Dennis and the devel team? Thanks, Mike From wilde at mcs.anl.gov Fri Jul 16 15:10:06 2010 From: wilde at mcs.anl.gov (Michael Wilde) Date: Fri, 16 Jul 2010 15:10:06 -0500 (CDT) Subject: [Swift-devel] Please upgrade communicado to latest CI linux Message-ID: <16682142.207021279311006408.JavaMail.root@zimbra> Hi CI Support, I think you were just waiting for our go-ahead to upgrade communicado. Can you proceed, schedule a time next week, and just notify the two lists above about the upgrade time? Thanks, Mike From wilde at mcs.anl.gov Fri Jul 16 15:15:59 2010 From: wilde at mcs.anl.gov (Michael Wilde) Date: Fri, 16 Jul 2010 15:15:59 -0500 (CDT) Subject: [Swift-devel] Swift NMI B&T testing - how to add more users? In-Reply-To: Message-ID: <29186856.207281279311359233.JavaMail.root@zimbra> [cc'ing swift-devel] Hi Allan, I have no special access to UW or B&T, but sure, feel free to use my name. I *think* that if you follow the general procedure for getting B&T access, that this is the best system for doing your builds. What I dont understand yet is: 1) whether we each need to get linus logins separate from our web logins 2) whether we need to create some kind of "Swift" project for sharing files, run logs, administartive control over swift tests, etc. - Mike ----- "Allan Espinosa" wrote: > Hi Mike, > > On a side note, I applied for UW account yesterday independently > because I needed access to specific type of machine architectures for > building codes for OSG deployments (i.e. SCEC Cybershake) . Will > reapplying under your name expedite the process? > > Thanks, > -Allan > > 2010/7/16 Michael Wilde : > > Ben, > > > > Dennis Touchet, a UTB student, is gearing up to get Swift testing > rolling again. Can you provide a few specific pointers? > > > > - how to clone the B&T tests you set up so that multiple Swift > developers can manage them? ?Does this require a B&T linux host login > that is separate from the B&T web login? (I was unable to log into > UW's system with my web login...) > > > > - can you comment on the state of the "site" tests in the swift test > dir? > > > > - any other pointers on testing tat would be useful to Dennis and > the devel team? > > > > Thanks, > > > > Mike > > -- > Allan M. Espinosa > PhD student, Computer Science > University of Chicago -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From support at ci.uchicago.edu Fri Jul 16 15:37:29 2010 From: support at ci.uchicago.edu (David Forero) Date: Fri, 16 Jul 2010 15:37:29 -0500 Subject: [Swift-devel] [CI Ticketing System #5791] Communicado upgrade In-Reply-To: <16682142.207021279311006408.JavaMail.root@zimbra> References: <16682142.207021279311006408.JavaMail.root@zimbra> Message-ID: Next Tuesday 20 July at 8am we will be taking communicado.ci.uchicago.edu down for an upgrade. Please use bridled.ci.uchicago.edu in its stead. Communicado should be back online by the end of the day If you have any questions, please contact support at ci uchicago.edu. Thank you for your cooperation. -- David Forero System Administrator Computation Institute University of Chicago 773-834-4102 From support at ci.uchicago.edu Fri Jul 16 15:37:30 2010 From: support at ci.uchicago.edu (David Forero) Date: Fri, 16 Jul 2010 15:37:30 -0500 Subject: [Swift-devel] [CI Ticketing System #5791] Communicado upgrade In-Reply-To: <16682142.207021279311006408.JavaMail.root@zimbra> References: <16682142.207021279311006408.JavaMail.root@zimbra> Message-ID: Next Tuesday 20 July at 8am we will be taking communicado.ci.uchicago.edu down for an upgrade. Please use bridled.ci.uchicago.edu in its stead. Communicado should be back online by the end of the day If you have any questions, please contact support at ci uchicago.edu. Thank you for your cooperation. -- David Forero System Administrator Computation Institute University of Chicago 773-834-4102 From benc at hawaga.org.uk Sun Jul 18 12:36:48 2010 From: benc at hawaga.org.uk (Ben Clifford) Date: Sun, 18 Jul 2010 17:36:48 +0000 (GMT) Subject: [Swift-devel] Re: Swift NMI B&T testing - how to add more users? In-Reply-To: <22361353.206931279310835989.JavaMail.root@zimbra> References: <22361353.206931279310835989.JavaMail.root@zimbra> Message-ID: > - how to clone the B&T tests you set up so that multiple Swift > developers can manage them? Does this require a B&T linux host login > that is separate from the B&T web login? (I was unable to log into UW's > system with my web login...) I don't know anything about web logins for NMI - I only (as far as I know) had a linux shell login. The way I had it set up, the tests from SVN (at least ones which didn't require credentials) were run regularly. If you wanted to add new tests there then adding them to SVN would cause them to be run on the NMI systems, in the same way as it would cause them to be run by other developers who run the tests themselves on their own systems. My home directory may still be in place on the NMI system, and I think I probably told them that they could share the contents with anyone; if both of those are true, you might be able to find all the scripts I had there. Having multiple people edit files on the NMI machines - I guess the nmi people have (or will create for you) some policy on that. > - can you comment on the state of the "site" tests in the swift test dir? It was hard to get those working reliably enough for them to be useful to run - by that I mean that if one of the local tests failed, then it was usually a problem that had recently been introduced into the swift stack; but a site test failing was often because of a problem with the site. That was a reflection of the difficulty on getting swift running and keeping it running on many different sites. You can look at the script to run the tests. Its probably useful. But the actual site definitions are presumably very rotted. -- From hategan at mcs.anl.gov Mon Jul 19 01:36:25 2010 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 19 Jul 2010 01:36:25 -0500 Subject: [Swift-devel] stuff to do In-Reply-To: <1279236949.25935.17.camel@blabla2.none> References: <33415753.14491278953564701.JavaMail.root@zimbra> <1279236949.25935.17.camel@blabla2.none> Message-ID: <1279521385.23339.1.camel@blabla2.none> On Thu, 2010-07-15 at 18:35 -0500, Mihael Hategan wrote: > > I also quickly made a fake provider and I am getting a rate of about 100 > j/s. So that seems not to infirm my previous suspicion. Well, it turns out that the flushing the restart log to disk takes some time. As in if I remove the call to flush() I can get 800 jobs/s. From benc at hawaga.org.uk Mon Jul 19 04:10:12 2010 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 19 Jul 2010 09:10:12 +0000 (GMT) Subject: [Swift-devel] Re: Swift NMI B&T testing - how to add more users? In-Reply-To: References: <22361353.206931279310835989.JavaMail.root@zimbra> Message-ID: I poked around a bit more. The svn has/had a top level directory nmi-build-test. In there, there is a subdirectory called submit-machine. That contains most/all of the files I had on the NMI build machine. In there: build-hourly was run by cron every hour and contains the logic to do the almost-per-commit tests; and build-daily was run by cron every day and does the several-different-architectures tests. Start by trying to make build-hourly run manually on the NMI machine and I think it should be straightforward to get it running. -- http://www.hawaga.org.uk/ben/ From wilde at mcs.anl.gov Mon Jul 19 09:41:51 2010 From: wilde at mcs.anl.gov (Michael Wilde) Date: Mon, 19 Jul 2010 08:41:51 -0600 (GMT-06:00) Subject: [Swift-devel] stuff to do In-Reply-To: <1279521385.23339.1.camel@blabla2.none> Message-ID: <321209788.22551279550511201.JavaMail.root@zimbra.anl.gov> Way cool. Can you make restart/flush a settable property? - Mike ----- "Mihael Hategan" wrote: > On Thu, 2010-07-15 at 18:35 -0500, Mihael Hategan wrote: > > > > I also quickly made a fake provider and I am getting a rate of about > 100 > > j/s. So that seems not to infirm my previous suspicion. > > Well, it turns out that the flushing the restart log to disk takes > some > time. As in if I remove the call to flush() I can get 800 jobs/s. -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From dk0966 at cs.ship.edu Tue Jul 20 07:00:23 2010 From: dk0966 at cs.ship.edu (David Kelly) Date: Tue, 20 Jul 2010 08:00:23 -0400 Subject: [Swift-devel] Swift shell script and JAVA_HOME Message-ID: Hello, I noticed on login.ci.uchicago.edu that I was not able to launch swift. Even though I had "+java-sun" in ~/.soft, the system was pointing me to gcj. Normally in ubuntu, something like "update-java-alternatives -s java-6-sun" lets you switch JVMs, but as far I know it makes changes system wide (requiring root access) and not on a per-user basis. Then I set $JAVA_HOME to the correct path and it still wouldn't launch. Should the swift shell script test for $JAVA_HOME to determine the correct location? Maybe something like this would work: ### EXECUTE ############ if test -n "$CYGWIN"; then set CLASSPATHSAVE=$CLASSPATH export CLASSPATH="$LOCALCLASSPATH" eval java ${OPTIONS} ${COG_OPTS} ${EXEC} ${CMDLINE} export CLASSPATH=$CLASSPATHSAVE else if [ -n "$JAVA_HOME" ]; then eval $JAVA_HOME/bin/java ${OPTIONS} ${COG_OPTS} -classpath ${LOCALCLASSPATH} ${EXEC} ${CMDLINE} else eval java ${OPTIONS} ${COG_OPTS} -classpath ${LOCALCLASSPATH} ${EXEC} ${CMDLINE} fi fi return_code=$? exit $return_code -------------- next part -------------- An HTML attachment was scrubbed... URL: From wilde at mcs.anl.gov Tue Jul 20 08:43:15 2010 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 20 Jul 2010 07:43:15 -0600 (GMT-06:00) Subject: [Swift-devel] Swift shell script and JAVA_HOME In-Reply-To: Message-ID: <1077109173.60151279633395534.JavaMail.root@zimbra.anl.gov> David, ----- "David Kelly" wrote: > Hello, > > I noticed on login.ci.uchicago.edu that I was not able to launch > swift. Even though I had "+java-sun" in ~/.soft, the system was > pointing me to gcj. In my .soft i have +java_sub before @default in .soft. That seems to work on login.ci. Can you try that? > Normally in ubuntu, something like > "update-java-alternatives -s java-6-sun" lets you switch JVMs, but as > far I know it makes changes system wide (requiring root access) and > not on a per-user basis. Then I set $JAVA_HOME to the correct path and > it still wouldn't launch. What error were you getting? Perhaps check if other *JAVA* env vars are still pointing to the wrong Java, eg: login$ env | grep -i java JRE_HOME=/soft/java-1.5.0_06-sun-r1/jre MATLAB_JAVA=/soft/matlab-7.7-r1/java JAVA_BINDIR=/soft/java-1.5.0_06-sun-r1/bin JAVA_HOME=/soft/java-1.5.0_06-sun-r1 SDK_HOME=/soft/java-1.5.0_06-sun-r1 JDK_HOME=/soft/java-1.5.0_06-sun-r1 JAVA_ROOT=/soft/java-1.5.0_06-sun-r1 and make sure that CLASSPATH is *not* set. > Should the swift shell script test for > $JAVA_HOME to determine the correct location? In my experience, Ive always tried to leave JAVA_HOME unset, have no JAVA vars in my env, and make sure that the right Java is in the PATH. I suspect Mihael and/or Justin need to weigh in on whats best; and then we should document that in the user guide under "Running Swift". - Mike > Maybe something like > this would work: > > ### EXECUTE ############ > if test -n "$CYGWIN"; then > set CLASSPATHSAVE=$CLASSPATH > export CLASSPATH="$LOCALCLASSPATH" > eval java ${OPTIONS} ${COG_OPTS} ${EXEC} ${CMDLINE} > export CLASSPATH=$CLASSPATHSAVE > else > if [ -n "$JAVA_HOME" ]; then > eval $JAVA_HOME/bin/java ${OPTIONS} ${COG_OPTS} -classpath > ${LOCALCLASSPATH} ${EXEC} ${CMDLINE} > else > eval java ${OPTIONS} ${COG_OPTS} -classpath ${LOCALCLASSPATH} ${EXEC} > ${CMDLINE} > fi > fi > return_code=$? > > exit $return_code > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From wilde at mcs.anl.gov Tue Jul 20 10:20:29 2010 From: wilde at mcs.anl.gov (wilde at mcs.anl.gov) Date: Tue, 20 Jul 2010 09:20:29 -0600 (GMT-06:00) Subject: [Swift-devel] Problems with coaster data provider In-Reply-To: <41072379.65871279639120318.JavaMail.root@zimbra.anl.gov> Message-ID: <1289205052.66121279639229161.JavaMail.root@zimbra.anl.gov> I tried the coaster data provider from MCS host vanquish to crush (2 of the compute servers) via ssh:local and get the error: "org.globus.cog.abstraction.impl.file.FileResourceException: org.globus.cog.karajan.workflow.service.ProtocolException: Unknown command: #!/BIN/BASH" (Full error text below). Has anyone else tried the coaster data provider? My sites file has the single pool: 8 3500 1 1 1 .07 10000 /home/wilde/swiftwork/crush Is that the correct url= value? I set these properties: wrapperlog.always.transfer=false sitedir.keep=true execution.retries=0 status.mode=provider The run command, svn version, and full error text on stdout/err is: vanquish$ swift -tc.file tc -sites.file crushds.xml -config cf catsn.swift -n=1 Swift svn swift-r3449 cog-r2816 RunID: 20100720-1006-z1vio8i1 Progress: Progress: Failed:1 Execution failed: Could not initialize shared directory on crush Caused by: org.globus.cog.abstraction.impl.file.FileResourceException: org.globus.cog.karajan.workflow.service.ProtocolException: Unknown command: #!/BIN/BASH # THIS SCRIPT MUST BE INVOKED INSIDE OF BASH, NOT PLAIN SH # NOTE THAT THIS SCRIPT MODIFIES $IFS INFOSECTION() { ...full text of _swiftwrap shows up here, in upper case... # ENSURE WE EXIT WITH A 0 AFTER A SUCCESSFUL EXECUTION EXIT 0 # LOCAL VARIABLES: # MODE: SH # SH-BASIC-OFFSET: 8 # END: Cleaning up... Shutting down service at https://140.221.8.62:59300 Got channel MetaChannel: 2039421489[205498061: {}] -> GSSSChannel-0494700354(1)[205498061: {}] + Done vanquish$ The _swiftwrap file was created in the workdirectory shared/ subdir but has length zero. So presumably it was about to be transferred but the transfer failed. vanquish$ ls -lR /home/wilde/swiftwork/crush/*8i1 /home/wilde/swiftwork/crush/catsn-20100720-1006-z1vio8i1: total 2 drwxr-sr-x 2 wilde mcsz 3 Jul 20 10:06 shared/ /home/wilde/swiftwork/crush/catsn-20100720-1006-z1vio8i1/shared: total 1 -rw-r--r-- 1 wilde mcsz 0 Jul 20 10:06 _swiftwrap vanquish$ ----- "Mihael Hategan" wrote: > Most of the problems that were obvious with coaster file staging > should > be fixed now. I ran a few tests for 1024 cat jobs on TP with ssh:pbs > with 2-8 workers/node (such that "concurrent" workers are tested) and > it > consistently seemed fine. > > I also quickly made a fake provider and I am getting a rate of about > 100 > j/s. So that seems not to infirm my previous suspicion. > > On Mon, 2010-07-12 at 11:52 -0500, Michael Wilde wrote: > > Here's my view on these: > > > > > 2. test/fix coaster file staging > > > > This would be useful for both real apps and (I think) for CDM > testing. I would do this first. > > > > I would then add: > > > > 5. Adjustments needed, if any, on multicore handling in PBS and SGE > provider. > > > > 6. Adjustments and fixes for reliability and logging, if needed, in > Condor-G provider. > > > > I expect that 5 & 6 would be small tasks, and they are not yet > clearly defined. I think that other people could do them. > > > > Maybe add: > > > > 7. -tui fixes. Seems not to be working so well on recent tests; > several of the screens, including the source-code view, seem not to be > working. > > > > Then: > > > > > 1. make swift core faster > > > > I would do this second; I think you said you need about 7-10 days to > try things and see what can be done, maybe more after that if the > exploration suggests things that will take much (re)coding? > > > > > 3. standalone coaster service > > > > The current manual coasters is proving useful. > > > 4. swift shell > > > > Lets defer (4) for now; if we can instead run swift repeatedly and > either have the coaster worker pool re-connect quickly to each new > swift, or quickly start new pools within the same cluster job(s), that > would suffice for now. > > > > Justin, do you want to weigh in on these? > > > > Thanks, > > > > Mike > > > > > > > The idea is that some recent changes may have shifted the > existing > > > priorities. So think of this from the perspective of > > > user/application/publication goals rather than what you think > would > > > be > > > "nice to have". > > > > > > Mihael > > > > > > _______________________________________________ > > > Swift-devel mailing list > > > Swift-devel at ci.uchicago.edu > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From dk0966 at cs.ship.edu Tue Jul 20 10:27:30 2010 From: dk0966 at cs.ship.edu (David Kelly) Date: Tue, 20 Jul 2010 11:27:30 -0400 Subject: [Swift-devel] Swift shell script and JAVA_HOME In-Reply-To: <1077109173.60151279633395534.JavaMail.root@zimbra.anl.gov> References: <1077109173.60151279633395534.JavaMail.root@zimbra.anl.gov> Message-ID: On Tue, Jul 20, 2010 at 9:43 AM, Michael Wilde wrote: In my .soft i have +java_sub before @default in .soft. That seems to work on > login.ci. Can you try that? > > Changing the ordering fixed it for me as well. Thanks. > What error were you getting? > > Perhaps check if other *JAVA* env vars are still pointing to the wrong > Java, eg: > > login$ env | grep -i java > JRE_HOME=/soft/java-1.5.0_06-sun-r1/jre > MATLAB_JAVA=/soft/matlab-7.7-r1/java > JAVA_BINDIR=/soft/java-1.5.0_06-sun-r1/bin > JAVA_HOME=/soft/java-1.5.0_06-sun-r1 > SDK_HOME=/soft/java-1.5.0_06-sun-r1 > JDK_HOME=/soft/java-1.5.0_06-sun-r1 > JAVA_ROOT=/soft/java-1.5.0_06-sun-r1 > > and make sure that CLASSPATH is *not* set. > If I have @default first followed +java_sun, even if all of the JAVA variables correctly pointing to sun java it will use gcj. I think this is due to the ordering of directories in $PATH. Having users adjust their PATH with the location of sun java ahead of directories like /usr/bin is probably the solution. Here are the errors I was getting: $ swift manyparam.swift Warning: -Xmx256M not understood. Ignoring. log4j:ERROR Error occured while converting date. java.lang.IllegalArgumentException: Illegal pattern character at java.text.SimpleDateFormat.format(java.util.Date, java.lang.StringBuffer, java.text.FieldPosition) (/usr/lib/libgcj.so.5.0.0) at java.text.DateFormat.format(java.util.Date) (/usr/lib/libgcj.so.5.0.0) at org.apache.log4j.helpers.PatternParser$DatePatternConverter.convert(org.apache.log4j.spi.LoggingEvent) (Unknown Source) at org.apache.log4j.helpers.PatternConverter.format(java.lang.StringBuffer, org.apache.log4j.spi.LoggingEvent) (Unknown Source) at org.apache.log4j.PatternLayout.format(org.apache.log4j.spi.LoggingEvent) (Unknown Source) at org.apache.log4j.WriterAppender.subAppend(org.apache.log4j.spi.LoggingEvent) (Unknown Source) at org.apache.log4j.WriterAppender.append(org.apache.log4j.spi.LoggingEvent) (Unknown Source) at org.apache.log4j.AppenderSkeleton.doAppend(org.apache.log4j.spi.LoggingEvent) (Unknown Source) at org.apache.log4j.helpers.AppenderAttachableImpl.appendLoopOnAppenders(org.apache.log4j.spi.LoggingEvent) (Unknown Source) at org.apache.log4j.Category.callAppenders(org.apache.log4j.spi.LoggingEvent) (Unknown Source) at org.apache.log4j.Category.forcedLog(java.lang.String, org.apache.log4j.Priority, java.lang.Object, java.lang.Throwable) (Unknown Source) at org.apache.log4j.Category.debug(java.lang.Object) (Unknown Source) at org.griphyn.vdl.karajan.Loader.main(java.lang.String[]) (Unknown Source) log4j:ERROR Error occured while converting date. java.lang.IllegalArgumentException: Illegal pattern character at java.text.SimpleDateFormat.format(java.util.Date, java.lang.StringBuffer, java.text.FieldPosition) (/usr/lib/libgcj.so.5.0.0) at java.text.DateFormat.format(java.util.Date) (/usr/lib/libgcj.so.5.0.0) at org.apache.log4j.helpers.PatternParser$DatePatternConverter.convert(org.apache.log4j.spi.LoggingEvent) (Unknown Source) at org.apache.log4j.helpers.PatternConverter.format(java.lang.StringBuffer, org.apache.log4j.spi.LoggingEvent) (Unknown Source) at org.apache.log4j.PatternLayout.format(org.apache.log4j.spi.LoggingEvent) (Unknown Source) at org.apache.log4j.WriterAppender.subAppend(org.apache.log4j.spi.LoggingEvent) (Unknown Source) at org.apache.log4j.WriterAppender.append(org.apache.log4j.spi.LoggingEvent) (Unknown Source) at org.apache.log4j.AppenderSkeleton.doAppend(org.apache.log4j.spi.LoggingEvent) (Unknown Source) at org.apache.log4j.helpers.AppenderAttachableImpl.appendLoopOnAppenders(org.apache.log4j.spi.LoggingEvent) (Unknown Source) at org.apache.log4j.Category.callAppenders(org.apache.log4j.spi.LoggingEvent) (Unknown Source) at org.apache.log4j.Category.forcedLog(java.lang.String, org.apache.log4j.Priority, java.lang.Object, java.lang.Throwable) (Unknown Source) at org.apache.log4j.Category.debug(java.lang.Object) (Unknown Source) at org.griphyn.vdl.karajan.Loader.main(java.lang.String[]) (Unknown Source) log4j:ERROR Error occured while converting date. java.lang.IllegalArgumentException: Illegal pattern character at java.text.SimpleDateFormat.format(java.util.Date, java.lang.StringBuffer, java.text.FieldPosition) (/usr/lib/libgcj.so.5.0.0) at java.text.DateFormat.format(java.util.Date) (/usr/lib/libgcj.so.5.0.0) at org.apache.log4j.helpers.PatternParser$DatePatternConverter.convert(org.apache.log4j.spi.LoggingEvent) (Unknown Source) at org.apache.log4j.helpers.PatternConverter.format(java.lang.StringBuffer, org.apache.log4j.spi.LoggingEvent) (Unknown Source) at org.apache.log4j.PatternLayout.format(org.apache.log4j.spi.LoggingEvent) (Unknown Source) at org.apache.log4j.WriterAppender.subAppend(org.apache.log4j.spi.LoggingEvent) (Unknown Source) at org.apache.log4j.WriterAppender.append(org.apache.log4j.spi.LoggingEvent) (Unknown Source) at org.apache.log4j.AppenderSkeleton.doAppend(org.apache.log4j.spi.LoggingEvent) (Unknown Source) at org.apache.log4j.helpers.AppenderAttachableImpl.appendLoopOnAppenders(org.apache.log4j.spi.LoggingEvent) (Unknown Source) at org.apache.log4j.Category.callAppenders(org.apache.log4j.spi.LoggingEvent) (Unknown Source) at org.apache.log4j.Category.forcedLog(java.lang.String, org.apache.log4j.Priority, java.lang.Object, java.lang.Throwable) (Unknown Source) at org.apache.log4j.Category.debug(java.lang.Object) (Unknown Source) at org.griphyn.vdl.karajan.Loader.compile(java.lang.String) (Unknown Source) at org.griphyn.vdl.karajan.Loader.main(java.lang.String[]) (Unknown Source) log4j:ERROR Error occured while converting date. java.lang.IllegalArgumentException: Illegal pattern character at java.text.SimpleDateFormat.format(java.util.Date, java.lang.StringBuffer, java.text.FieldPosition) (/usr/lib/libgcj.so.5.0.0) at java.text.DateFormat.format(java.util.Date) (/usr/lib/libgcj.so.5.0.0) at org.apache.log4j.helpers.PatternParser$DatePatternConverter.convert(org.apache.log4j.spi.LoggingEvent) (Unknown Source) at org.apache.log4j.helpers.PatternConverter.format(java.lang.StringBuffer, org.apache.log4j.spi.LoggingEvent) (Unknown Source) at org.apache.log4j.PatternLayout.format(org.apache.log4j.spi.LoggingEvent) (Unknown Source) at org.apache.log4j.WriterAppender.subAppend(org.apache.log4j.spi.LoggingEvent) (Unknown Source) at org.apache.log4j.WriterAppender.append(org.apache.log4j.spi.LoggingEvent) (Unknown Source) at org.apache.log4j.AppenderSkeleton.doAppend(org.apache.log4j.spi.LoggingEvent) (Unknown Source) at org.apache.log4j.helpers.AppenderAttachableImpl.appendLoopOnAppenders(org.apache.log4j.spi.LoggingEvent) (Unknown Source) at org.apache.log4j.Category.callAppenders(org.apache.log4j.spi.LoggingEvent) (Unknown Source) at org.apache.log4j.Category.forcedLog(java.lang.String, org.apache.log4j.Priority, java.lang.Object, java.lang.Throwable) (Unknown Source) at org.apache.log4j.Category.debug(java.lang.Object) (Unknown Source) at org.griphyn.vdl.karajan.Loader.compile(java.lang.String) (Unknown Source) at org.griphyn.vdl.karajan.Loader.main(java.lang.String[]) (Unknown Source) log4j:ERROR Error occured while converting date. java.lang.IllegalArgumentException: Illegal pattern character at java.text.SimpleDateFormat.format(java.util.Date, java.lang.StringBuffer, java.text.FieldPosition) (/usr/lib/libgcj.so.5.0.0) at java.text.DateFormat.format(java.util.Date) (/usr/lib/libgcj.so.5.0.0) at org.apache.log4j.helpers.PatternParser$DatePatternConverter.convert(org.apache.log4j.spi.LoggingEvent) (Unknown Source) at org.apache.log4j.helpers.PatternConverter.format(java.lang.StringBuffer, org.apache.log4j.spi.LoggingEvent) (Unknown Source) at org.apache.log4j.PatternLayout.format(org.apache.log4j.spi.LoggingEvent) (Unknown Source) at org.apache.log4j.WriterAppender.subAppend(org.apache.log4j.spi.LoggingEvent) (Unknown Source) at org.apache.log4j.WriterAppender.append(org.apache.log4j.spi.LoggingEvent) (Unknown Source) at org.apache.log4j.AppenderSkeleton.doAppend(org.apache.log4j.spi.LoggingEvent) (Unknown Source) at org.apache.log4j.helpers.AppenderAttachableImpl.appendLoopOnAppenders(org.apache.log4j.spi.LoggingEvent) (Unknown Source) at org.apache.log4j.Category.callAppenders(org.apache.log4j.spi.LoggingEvent) (Unknown Source) at org.apache.log4j.Category.forcedLog(java.lang.String, org.apache.log4j.Priority, java.lang.Object, java.lang.Throwable) (Unknown Source) at org.apache.log4j.Category.debug(java.lang.Object) (Unknown Source) at org.griphyn.vdl.karajan.Loader.compile(java.lang.String) (Unknown Source) at org.griphyn.vdl.karajan.Loader.main(java.lang.String[]) (Unknown Source) log4j:ERROR Error occured while converting date. log4j:ERROR Error occured while converting date. Exception in thread "main" java.lang.NullPointerException *** Got java.lang.NullPointerException while trying to print stack trace. David -------------- next part -------------- An HTML attachment was scrubbed... URL: From hategan at mcs.anl.gov Tue Jul 20 12:22:07 2010 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 20 Jul 2010 12:22:07 -0500 Subject: [Swift-devel] Re: Problems with coaster data provider In-Reply-To: <1289205052.66121279639229161.JavaMail.root@zimbra.anl.gov> References: <1289205052.66121279639229161.JavaMail.root@zimbra.anl.gov> Message-ID: <1279646527.16305.0.camel@blabla2.none> That is odd. It looks like all characters for the swift wrapper are in uppercase. On Tue, 2010-07-20 at 09:20 -0600, wilde at mcs.anl.gov wrote: > I tried the coaster data provider from MCS host vanquish to crush (2 of the compute servers) via ssh:local and get the error: > > "org.globus.cog.abstraction.impl.file.FileResourceException: org.globus.cog.karajan.workflow.service.ProtocolException: Unknown command: #!/BIN/BASH" (Full error text below). > > Has anyone else tried the coaster data provider? > > My sites file has the single pool: > > > > 8 > 3500 > 1 > 1 > 1 > > .07 > 10000 > > > /home/wilde/swiftwork/crush > > > Is that the correct url= value? > > I set these properties: > > wrapperlog.always.transfer=false > sitedir.keep=true > execution.retries=0 > status.mode=provider > > The run command, svn version, and full error text on stdout/err is: > > vanquish$ swift -tc.file tc -sites.file crushds.xml -config cf catsn.swift -n=1 > Swift svn swift-r3449 cog-r2816 > > RunID: 20100720-1006-z1vio8i1 > Progress: > Progress: Failed:1 > Execution failed: > Could not initialize shared directory on crush > Caused by: > org.globus.cog.abstraction.impl.file.FileResourceException: org.globus.cog.karajan.workflow.service.ProtocolException: Unknown command: #!/BIN/BASH > # THIS SCRIPT MUST BE INVOKED INSIDE OF BASH, NOT PLAIN SH > # NOTE THAT THIS SCRIPT MODIFIES $IFS > > INFOSECTION() { > > ...full text of _swiftwrap shows up here, in upper case... > > # ENSURE WE EXIT WITH A 0 AFTER A SUCCESSFUL EXECUTION > EXIT 0 > > # LOCAL VARIABLES: > # MODE: SH > # SH-BASIC-OFFSET: 8 > # END: > > Cleaning up... > Shutting down service at https://140.221.8.62:59300 > Got channel MetaChannel: 2039421489[205498061: {}] -> GSSSChannel-0494700354(1)[205498061: {}] > + Done > vanquish$ > > The _swiftwrap file was created in the workdirectory shared/ subdir but has length zero. So presumably it was about to be transferred but the transfer failed. > > vanquish$ ls -lR /home/wilde/swiftwork/crush/*8i1 > /home/wilde/swiftwork/crush/catsn-20100720-1006-z1vio8i1: > total 2 > drwxr-sr-x 2 wilde mcsz 3 Jul 20 10:06 shared/ > > /home/wilde/swiftwork/crush/catsn-20100720-1006-z1vio8i1/shared: > total 1 > -rw-r--r-- 1 wilde mcsz 0 Jul 20 10:06 _swiftwrap > vanquish$ > > ----- "Mihael Hategan" wrote: > > > Most of the problems that were obvious with coaster file staging > > should > > be fixed now. I ran a few tests for 1024 cat jobs on TP with ssh:pbs > > with 2-8 workers/node (such that "concurrent" workers are tested) and > > it > > consistently seemed fine. > > > > I also quickly made a fake provider and I am getting a rate of about > > 100 > > j/s. So that seems not to infirm my previous suspicion. > > > > On Mon, 2010-07-12 at 11:52 -0500, Michael Wilde wrote: > > > Here's my view on these: > > > > > > > 2. test/fix coaster file staging > > > > > > This would be useful for both real apps and (I think) for CDM > > testing. I would do this first. > > > > > > I would then add: > > > > > > 5. Adjustments needed, if any, on multicore handling in PBS and SGE > > provider. > > > > > > 6. Adjustments and fixes for reliability and logging, if needed, in > > Condor-G provider. > > > > > > I expect that 5 & 6 would be small tasks, and they are not yet > > clearly defined. I think that other people could do them. > > > > > > Maybe add: > > > > > > 7. -tui fixes. Seems not to be working so well on recent tests; > > several of the screens, including the source-code view, seem not to be > > working. > > > > > > Then: > > > > > > > 1. make swift core faster > > > > > > I would do this second; I think you said you need about 7-10 days to > > try things and see what can be done, maybe more after that if the > > exploration suggests things that will take much (re)coding? > > > > > > > 3. standalone coaster service > > > > > > The current manual coasters is proving useful. > > > > 4. swift shell > > > > > > Lets defer (4) for now; if we can instead run swift repeatedly and > > either have the coaster worker pool re-connect quickly to each new > > swift, or quickly start new pools within the same cluster job(s), that > > would suffice for now. > > > > > > Justin, do you want to weigh in on these? > > > > > > Thanks, > > > > > > Mike > > > > > > > > > > The idea is that some recent changes may have shifted the > > existing > > > > priorities. So think of this from the perspective of > > > > user/application/publication goals rather than what you think > > would > > > > be > > > > "nice to have". > > > > > > > > Mihael > > > > > > > > _______________________________________________ > > > > Swift-devel mailing list > > > > Swift-devel at ci.uchicago.edu > > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > From hategan at mcs.anl.gov Tue Jul 20 12:29:25 2010 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 20 Jul 2010 12:29:25 -0500 Subject: [Swift-devel] Re: Problems with coaster data provider In-Reply-To: <1279646527.16305.0.camel@blabla2.none> References: <1289205052.66121279639229161.JavaMail.root@zimbra.anl.gov> <1279646527.16305.0.camel@blabla2.none> Message-ID: <1279646965.16565.1.camel@blabla2.none> On Tue, 2010-07-20 at 12:22 -0500, Mihael Hategan wrote: > That is odd. It looks like all characters for the swift wrapper are in > uppercase. Actually it looks like something is off there. Btw, the coaster provider staging is different from the coaster data provider. If you want the former, say use.provider.staging=true in swift.properties. > > > On Tue, 2010-07-20 at 09:20 -0600, wilde at mcs.anl.gov wrote: > > I tried the coaster data provider from MCS host vanquish to crush (2 of the compute servers) via ssh:local and get the error: > > > > "org.globus.cog.abstraction.impl.file.FileResourceException: org.globus.cog.karajan.workflow.service.ProtocolException: Unknown command: #!/BIN/BASH" (Full error text below). > > > > Has anyone else tried the coaster data provider? > > > > My sites file has the single pool: > > > > > > > > 8 > > 3500 > > 1 > > 1 > > 1 > > > > .07 > > 10000 > > > > > > /home/wilde/swiftwork/crush > > > > > > Is that the correct url= value? > > > > I set these properties: > > > > wrapperlog.always.transfer=false > > sitedir.keep=true > > execution.retries=0 > > status.mode=provider > > > > The run command, svn version, and full error text on stdout/err is: > > > > vanquish$ swift -tc.file tc -sites.file crushds.xml -config cf catsn.swift -n=1 > > Swift svn swift-r3449 cog-r2816 > > > > RunID: 20100720-1006-z1vio8i1 > > Progress: > > Progress: Failed:1 > > Execution failed: > > Could not initialize shared directory on crush > > Caused by: > > org.globus.cog.abstraction.impl.file.FileResourceException: org.globus.cog.karajan.workflow.service.ProtocolException: Unknown command: #!/BIN/BASH > > # THIS SCRIPT MUST BE INVOKED INSIDE OF BASH, NOT PLAIN SH > > # NOTE THAT THIS SCRIPT MODIFIES $IFS > > > > INFOSECTION() { > > > > ...full text of _swiftwrap shows up here, in upper case... > > > > # ENSURE WE EXIT WITH A 0 AFTER A SUCCESSFUL EXECUTION > > EXIT 0 > > > > # LOCAL VARIABLES: > > # MODE: SH > > # SH-BASIC-OFFSET: 8 > > # END: > > > > Cleaning up... > > Shutting down service at https://140.221.8.62:59300 > > Got channel MetaChannel: 2039421489[205498061: {}] -> GSSSChannel-0494700354(1)[205498061: {}] > > + Done > > vanquish$ > > > > The _swiftwrap file was created in the workdirectory shared/ subdir but has length zero. So presumably it was about to be transferred but the transfer failed. > > > > vanquish$ ls -lR /home/wilde/swiftwork/crush/*8i1 > > /home/wilde/swiftwork/crush/catsn-20100720-1006-z1vio8i1: > > total 2 > > drwxr-sr-x 2 wilde mcsz 3 Jul 20 10:06 shared/ > > > > /home/wilde/swiftwork/crush/catsn-20100720-1006-z1vio8i1/shared: > > total 1 > > -rw-r--r-- 1 wilde mcsz 0 Jul 20 10:06 _swiftwrap > > vanquish$ > > > > ----- "Mihael Hategan" wrote: > > > > > Most of the problems that were obvious with coaster file staging > > > should > > > be fixed now. I ran a few tests for 1024 cat jobs on TP with ssh:pbs > > > with 2-8 workers/node (such that "concurrent" workers are tested) and > > > it > > > consistently seemed fine. > > > > > > I also quickly made a fake provider and I am getting a rate of about > > > 100 > > > j/s. So that seems not to infirm my previous suspicion. > > > > > > On Mon, 2010-07-12 at 11:52 -0500, Michael Wilde wrote: > > > > Here's my view on these: > > > > > > > > > 2. test/fix coaster file staging > > > > > > > > This would be useful for both real apps and (I think) for CDM > > > testing. I would do this first. > > > > > > > > I would then add: > > > > > > > > 5. Adjustments needed, if any, on multicore handling in PBS and SGE > > > provider. > > > > > > > > 6. Adjustments and fixes for reliability and logging, if needed, in > > > Condor-G provider. > > > > > > > > I expect that 5 & 6 would be small tasks, and they are not yet > > > clearly defined. I think that other people could do them. > > > > > > > > Maybe add: > > > > > > > > 7. -tui fixes. Seems not to be working so well on recent tests; > > > several of the screens, including the source-code view, seem not to be > > > working. > > > > > > > > Then: > > > > > > > > > 1. make swift core faster > > > > > > > > I would do this second; I think you said you need about 7-10 days to > > > try things and see what can be done, maybe more after that if the > > > exploration suggests things that will take much (re)coding? > > > > > > > > > 3. standalone coaster service > > > > > > > > The current manual coasters is proving useful. > > > > > 4. swift shell > > > > > > > > Lets defer (4) for now; if we can instead run swift repeatedly and > > > either have the coaster worker pool re-connect quickly to each new > > > swift, or quickly start new pools within the same cluster job(s), that > > > would suffice for now. > > > > > > > > Justin, do you want to weigh in on these? > > > > > > > > Thanks, > > > > > > > > Mike > > > > > > > > > > > > > The idea is that some recent changes may have shifted the > > > existing > > > > > priorities. So think of this from the perspective of > > > > > user/application/publication goals rather than what you think > > > would > > > > > be > > > > > "nice to have". > > > > > > > > > > Mihael > > > > > > > > > > _______________________________________________ > > > > > Swift-devel mailing list > > > > > Swift-devel at ci.uchicago.edu > > > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > > > > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From hategan at mcs.anl.gov Tue Jul 20 13:47:35 2010 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 20 Jul 2010 13:47:35 -0500 Subject: [Swift-devel] stuff to do In-Reply-To: <321209788.22551279550511201.JavaMail.root@zimbra.anl.gov> References: <321209788.22551279550511201.JavaMail.root@zimbra.anl.gov> Message-ID: <1279651655.17308.2.camel@blabla2.none> I changed it to run in a separate thread and collapse frequent flushes. This way it doesn't require user interaction. It may not work very well in case of power outages, but I don't think that's the most frequent use of the restart log. On Mon, 2010-07-19 at 08:41 -0600, Michael Wilde wrote: > Way cool. Can you make restart/flush a settable property? > > - Mike > > ----- "Mihael Hategan" wrote: > > > On Thu, 2010-07-15 at 18:35 -0500, Mihael Hategan wrote: > > > > > > I also quickly made a fake provider and I am getting a rate of about > > 100 > > > j/s. So that seems not to infirm my previous suspicion. > > > > Well, it turns out that the flushing the restart log to disk takes > > some > > time. As in if I remove the call to flush() I can get 800 jobs/s. > From wozniak at mcs.anl.gov Wed Jul 21 10:49:18 2010 From: wozniak at mcs.anl.gov (Justin M Wozniak) Date: Wed, 21 Jul 2010 10:49:18 -0500 (CDT) Subject: [Swift-devel] GSOC call today Message-ID: I'll be on for the call... -- Justin M Wozniak From wozniak at mcs.anl.gov Mon Jul 26 13:47:27 2010 From: wozniak at mcs.anl.gov (Justin M Wozniak) Date: Mon, 26 Jul 2010 13:47:27 -0500 (Central Daylight Time) Subject: [Swift-devel] MPICH/Coasters Message-ID: Hello I just had a meeting with Pavan to talk about what we can do to run MPI jobs from Coasters given the new MPICH/Hydra features. He's making a few modifications to MPICH to support this and they should be available soon. Background on Hydra: http://wiki.mcs.anl.gov/mpich2/index.php/Using_the_Hydra_Process_Manager#Bootstrap_Servers http://wiki.mcs.anl.gov/mpich2/index.php/Hydra_Process_Management_Framework Here's the basic idea so far: * The CoasterService locally runs an mpiexec; * mpiexec prints a list of (proxy) command lines, then listens; * The CoasterService passes each command-line to a worker; * The worker launches the proxy; * The proxy connects back to mpiexec; * mpiexec and the proxies complete the user job; * mpiexec and the proxies shut down. So, analogous to "manual Coasters", this is "manual MPICH", because Coasters is responsible for launching the proxies. Justin -- Justin M Wozniak From aespinosa at cs.uchicago.edu Mon Jul 26 14:50:50 2010 From: aespinosa at cs.uchicago.edu (Allan Espinosa) Date: Mon, 26 Jul 2010 14:50:50 -0500 Subject: [Swift-devel] MPICH/Coasters In-Reply-To: References: Message-ID: <20100726195050.GA3204@origin> Sounds like cool stuff. So essentially they modularized the process manager component of the mpich2 implementation (or a lamd daemon / manager) to be able to launch other processes. With this framework, can coasters directly access low-level interconnet interfaces instead of plain (or GSI) sockets? On Mon, Jul 26, 2010 at 01:47:27PM -0500, Justin M Wozniak wrote: > Hello > I just had a meeting with Pavan to talk about what we can do to run > MPI jobs from Coasters given the new MPICH/Hydra features. He's > making a few modifications to MPICH to support this and they should > be available soon. > > Background on Hydra: > > http://wiki.mcs.anl.gov/mpich2/index.php/Using_the_Hydra_Process_Manager#Bootstrap_Servers > http://wiki.mcs.anl.gov/mpich2/index.php/Hydra_Process_Management_Framework > > Here's the basic idea so far: > > * The CoasterService locally runs an mpiexec; > * mpiexec prints a list of (proxy) command lines, then listens; > * The CoasterService passes each command-line to a worker; > * The worker launches the proxy; > * The proxy connects back to mpiexec; > * mpiexec and the proxies complete the user job; > * mpiexec and the proxies shut down. > > So, analogous to "manual Coasters", this is "manual MPICH", because > Coasters is responsible for launching the proxies. > > Justin > > -- > Justin M Wozniak From hategan at mcs.anl.gov Mon Jul 26 15:21:20 2010 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 26 Jul 2010 15:21:20 -0500 Subject: [Swift-devel] MPICH/Coasters In-Reply-To: <20100726195050.GA3204@origin> References: <20100726195050.GA3204@origin> Message-ID: <1280175680.23112.4.camel@blabla2.none> On Mon, 2010-07-26 at 14:50 -0500, Allan Espinosa wrote: > Sounds like cool stuff. So essentially they modularized the process manager > component of the mpich2 implementation (or a lamd daemon / manager) to be able > to launch other processes. I'm not sure I follow, but I'm guessing the scenario is that you first get some nodes on which you start their bootstrap server after which you can submit various mpi applications on-demand without going through the queuing system again. Right? From aespinosa at cs.uchicago.edu Mon Jul 26 15:48:42 2010 From: aespinosa at cs.uchicago.edu (Allan Espinosa) Date: Mon, 26 Jul 2010 15:48:42 -0500 Subject: [Swift-devel] MPICH/Coasters In-Reply-To: <1280175680.23112.4.camel@blabla2.none> References: <20100726195050.GA3204@origin> <1280175680.23112.4.camel@blabla2.none> Message-ID: <20100726204842.GB3204@origin> Right. Because it will ask the bootstrap server the nodes that participates in the MPI_WORLD group of the application. I was talking about how the mpich2 implementation features dynamically growing an MPI_WORLD group using the bootstrap server. On Mon, Jul 26, 2010 at 03:21:20PM -0500, Mihael Hategan wrote: > On Mon, 2010-07-26 at 14:50 -0500, Allan Espinosa wrote: > > Sounds like cool stuff. So essentially they modularized the process manager > > component of the mpich2 implementation (or a lamd daemon / manager) to be able > > to launch other processes. > > I'm not sure I follow, but I'm guessing the scenario is that you first > get some nodes on which you start their bootstrap server after which you > can submit various mpi applications on-demand without going through the > queuing system again. Right? > > From hategan at mcs.anl.gov Mon Jul 26 20:43:40 2010 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 26 Jul 2010 20:43:40 -0500 Subject: [Swift-devel] more on job throughput Message-ID: <1280195020.29413.4.camel@blabla2.none> Here's a plot of the number of tasks in the various stages that the runtime stats track. This is with 8192 jobs and the fake provider (which does nothing and finishes tasks almost immediately, and which I should probably commit somewhere if anybody else wants to play with this). I also attached the scripts used. You would need to change RuntimeStats to print the stats more often than the 1s default (say something like (MIN,MAX)_PERIOD_MS=100). -------------- next part -------------- A non-text attachment was scrubbed... Name: timings.png Type: image/png Size: 5568 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: timings.tar.gz Type: application/x-compressed-tar Size: 2264 bytes Desc: not available URL: From benc at hawaga.org.uk Tue Jul 27 10:06:56 2010 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 27 Jul 2010 15:06:56 +0000 (GMT) Subject: [Swift-devel] more on job throughput In-Reply-To: <1280195020.29413.4.camel@blabla2.none> References: <1280195020.29413.4.camel@blabla2.none> Message-ID: > Here's a plot of the number of tasks in the various stages that the > runtime stats track. what is the x-axis on that graph? -- From hategan at mcs.anl.gov Tue Jul 27 10:50:26 2010 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 27 Jul 2010 10:50:26 -0500 Subject: [Swift-devel] more on job throughput In-Reply-To: References: <1280195020.29413.4.camel@blabla2.none> Message-ID: <1280245826.31330.0.camel@blabla2.none> On Tue, 2010-07-27 at 15:06 +0000, Ben Clifford wrote: > > Here's a plot of the number of tasks in the various stages that the > > runtime stats track. > > what is the x-axis on that graph? > Time. From benc at hawaga.org.uk Tue Jul 27 10:51:08 2010 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 27 Jul 2010 15:51:08 +0000 (GMT) Subject: [Swift-devel] more on job throughput In-Reply-To: <1280245826.31330.0.camel@blabla2.none> References: <1280195020.29413.4.camel@blabla2.none> <1280245826.31330.0.camel@blabla2.none> Message-ID: > Time. ... units of ... -- From hategan at mcs.anl.gov Tue Jul 27 10:54:45 2010 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 27 Jul 2010 10:54:45 -0500 Subject: [Swift-devel] more on job throughput In-Reply-To: References: <1280195020.29413.4.camel@blabla2.none> <1280245826.31330.0.camel@blabla2.none> Message-ID: <1280246085.31330.1.camel@blabla2.none> On Tue, 2010-07-27 at 15:51 +0000, Ben Clifford wrote: > > Time. > > ... units of ... > milliseconds. From hategan at mcs.anl.gov Tue Jul 27 10:55:53 2010 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 27 Jul 2010 10:55:53 -0500 Subject: [Swift-devel] more on job throughput In-Reply-To: References: <1280195020.29413.4.camel@blabla2.none> Message-ID: <1280246153.31546.0.camel@blabla2.none> On Tue, 2010-07-27 at 15:06 +0000, Ben Clifford wrote: > > Here's a plot of the number of tasks in the various stages that the > > runtime stats track. > > what is the x-axis on that graph? > Good point actually. One would also need to print time in RuntimeStats. From wozniak at mcs.anl.gov Tue Jul 27 11:49:38 2010 From: wozniak at mcs.anl.gov (Justin M Wozniak) Date: Tue, 27 Jul 2010 11:49:38 -0500 (Central Daylight Time) Subject: [Swift-devel] MPICH/Coasters In-Reply-To: <20100726204842.GB3204@origin> References: <20100726195050.GA3204@origin> <1280175680.23112.4.camel@blabla2.none> <20100726204842.GB3204@origin> Message-ID: On Mon, 26 Jul 2010, Allan Espinosa wrote: > I was talking about how the mpich2 implementation features dynamically growing > an MPI_WORLD group using the bootstrap server. We did briefly discuss getting MPI-2 stuff going (like the nameserver for MPI_Publish_name()) but I'd like to leave that for future work. I put a figure up at: http://www.ci.uchicago.edu/wiki/bin/view/SWFT/CoastersMpi -- Justin M Wozniak From aespinosa at cs.uchicago.edu Wed Jul 28 14:34:21 2010 From: aespinosa at cs.uchicago.edu (Allan Espinosa) Date: Wed, 28 Jul 2010 14:34:21 -0500 Subject: [Swift-devel] localscheduler (condor/ condorg) breaking on lots of condor jobs Message-ID: <20100728193421.GA11060@origin> Hi, it seems that when there's too many submitted condor jobs, the submit host will start to complain if it opens too many log, stderr, and stdout files: 330 Finished successfully:162 Failed but can retry:927 Failed to transfer wrapper log from sleep-LGU-estimate/info/x on USCMS-FNAL-WC1 Progress: Initializing site shared directory:1 Stage in:2 Submitted:1332 Active:245 Failed:331 Finished successfully:162 Failed but can retry:928 Progress:Failed to cancel job 57445 java.io.IOException: Cannot run program "condor_qedit": java.io.IOException: error=24, Too many open files at java.lang.ProcessBuilder.start(ProcessBuilder.java:460) at java.lang.Runtime.exec(Runtime.java:593) at java.lang.Runtime.exec(Runtime.java:466) at org.globus.cog.abstraction.impl.scheduler.condor.CondorExecutor.cancel(CondorExecutor.java:254) at org.globus.cog.abstraction.impl.scheduler.common.AbstractJobSubmissionTaskHandler.cancel(AbstractJobSubmissionTaskHandler.java:85) at org.globus.cog.abstraction.impl.common.AbstractTaskHandler.cancel(AbstractTaskHandler.java:70) at org.globus.cog.karajan.scheduler.submitQueue.NonBlockingCancel.run(NonBlockingCancel.java:26) at edu.emory.mathcs.backport.java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:431) at edu.emory.mathcs.backport.java.util.concurrent.FutureTask.run(FutureTask.java:166) at edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:643) at edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:668) at java.lang.Thread.run(Thread.java:619) Caused by: java.io.IOException: java.io.IOException: error=24, Too many open files at java.lang.UNIXProcess.(UNIXProcess.java:148) at java.lang.ProcessImpl.start(ProcessImpl.java:65) at java.lang.ProcessBuilder.start(ProcessBuilder.java:453) ... 11 more Initializing site shared directory:1 Stage in:1 Submitting:2 Submitted:1332 Active:245 Failed:331 Finished successfully:162 Failed but can retry:927 Progress: Initializing site shared directory:1 Submitting:3 Submitted:1331 Active:245 Failed:331 Finished successfully:162 Failed but can retry:928 This causes jobs to fail. Here are the logfile entries that I think are relevant to the failure: 2010-07-28 14:20:07,829-0500 WARN CondorExecutor Failed to cancel job 57026 java.io.IOException: Cannot run program "condor_rm": java.io.IOException: error=24, Too many open files at java.lang.ProcessBuilder.start(ProcessBuilder.java:460) at java.lang.Runtime.exec(Runtime.java:593) at java.lang.Runtime.exec(Runtime.java:466) at org.globus.cog.abstraction.impl.scheduler.condor.CondorExecutor.cancel(CondorExecutor.java:275) at org.globus.cog.abstraction.impl.scheduler.common.AbstractJobSubmissionTaskHandler.cancel(AbstractJobSubmissionTaskHandler.java:85) at org.globus.cog.abstraction.impl.common.AbstractTaskHandler.cancel(AbstractTaskHandler.java:70) at org.globus.cog.karajan.scheduler.submitQueue.NonBlockingCancel.run(NonBlockingCancel.java:26) at edu.emory.mathcs.backport.java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:431) at edu.emory.mathcs.backport.java.util.concurrent.FutureTask.run(FutureTask.java:166) at edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:643) at edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:668) at java.lang.Thread.run(Thread.java:619) Caused by: java.io.IOException: java.io.IOException: error=24, Too many open files at java.lang.UNIXProcess.(UNIXProcess.java:148) at java.lang.ProcessImpl.start(ProcessImpl.java:65) at java.lang.ProcessBuilder.start(ProcessBuilder.java:453) ... 11 more 2010-07-28 14:20:07,856-0500 WARN CondorExecutor Failed to cancel job 57106 -Allan From hategan at mcs.anl.gov Wed Jul 28 14:48:36 2010 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 28 Jul 2010 14:48:36 -0500 Subject: [Swift-devel] localscheduler (condor/ condorg) breaking on lots of condor jobs In-Reply-To: <20100728193421.GA11060@origin> References: <20100728193421.GA11060@origin> Message-ID: <1280346516.12761.3.camel@blabla2.none> Yeah. That's why the provider should be updated to use job logs instead of condor_qstat/condor_qedit for figuring out status. That or update limits (and, btw, what does ulimit -a say on that machine)? On Wed, 2010-07-28 at 14:34 -0500, Allan Espinosa wrote: > Hi, > > it seems that when there's too many submitted condor jobs, the submit host will > start to complain if it opens too many log, stderr, and stdout files: > > 330 Finished successfully:162 Failed but can retry:927 > Failed to transfer wrapper log from sleep-LGU-estimate/info/x on USCMS-FNAL-WC1 > Progress: Initializing site shared directory:1 Stage in:2 Submitted:1332 > Active:245 Failed:331 Finished successfully:162 Failed but can retry:928 > Progress:Failed to cancel job 57445 > java.io.IOException: Cannot run program "condor_qedit": java.io.IOException: > error=24, Too many open files > at java.lang.ProcessBuilder.start(ProcessBuilder.java:460) > at java.lang.Runtime.exec(Runtime.java:593) > at java.lang.Runtime.exec(Runtime.java:466) > at > org.globus.cog.abstraction.impl.scheduler.condor.CondorExecutor.cancel(CondorExecutor.java:254) > at > org.globus.cog.abstraction.impl.scheduler.common.AbstractJobSubmissionTaskHandler.cancel(AbstractJobSubmissionTaskHandler.java:85) > at > org.globus.cog.abstraction.impl.common.AbstractTaskHandler.cancel(AbstractTaskHandler.java:70) > at > org.globus.cog.karajan.scheduler.submitQueue.NonBlockingCancel.run(NonBlockingCancel.java:26) > at > edu.emory.mathcs.backport.java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:431) > at > edu.emory.mathcs.backport.java.util.concurrent.FutureTask.run(FutureTask.java:166) > at > edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:643) > at > edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:668) > at java.lang.Thread.run(Thread.java:619) > Caused by: java.io.IOException: java.io.IOException: error=24, Too many open > files > at java.lang.UNIXProcess.(UNIXProcess.java:148) > at java.lang.ProcessImpl.start(ProcessImpl.java:65) > at java.lang.ProcessBuilder.start(ProcessBuilder.java:453) > ... 11 more > Initializing site shared directory:1 Stage in:1 Submitting:2 Submitted:1332 > Active:245 Failed:331 Finished successfully:162 Failed but can retry:927 > Progress: Initializing site shared directory:1 Submitting:3 Submitted:1331 > Active:245 Failed:331 Finished successfully:162 Failed but can retry:928 > > > This causes jobs to fail. Here are the logfile entries that I think are > relevant to the failure: > > 2010-07-28 14:20:07,829-0500 WARN CondorExecutor Failed to cancel job 57026 > java.io.IOException: Cannot run program "condor_rm": java.io.IOException: > error=24, Too many open files > at java.lang.ProcessBuilder.start(ProcessBuilder.java:460) > at java.lang.Runtime.exec(Runtime.java:593) > at java.lang.Runtime.exec(Runtime.java:466) > at > org.globus.cog.abstraction.impl.scheduler.condor.CondorExecutor.cancel(CondorExecutor.java:275) > at > org.globus.cog.abstraction.impl.scheduler.common.AbstractJobSubmissionTaskHandler.cancel(AbstractJobSubmissionTaskHandler.java:85) > at > org.globus.cog.abstraction.impl.common.AbstractTaskHandler.cancel(AbstractTaskHandler.java:70) > at > org.globus.cog.karajan.scheduler.submitQueue.NonBlockingCancel.run(NonBlockingCancel.java:26) > at > edu.emory.mathcs.backport.java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:431) > at > edu.emory.mathcs.backport.java.util.concurrent.FutureTask.run(FutureTask.java:166) > at > edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:643) > at > edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:668) > at java.lang.Thread.run(Thread.java:619) > Caused by: java.io.IOException: java.io.IOException: error=24, Too many open > files > at java.lang.UNIXProcess.(UNIXProcess.java:148) > at java.lang.ProcessImpl.start(ProcessImpl.java:65) > at java.lang.ProcessBuilder.start(ProcessBuilder.java:453) > ... 11 more > 2010-07-28 14:20:07,856-0500 WARN CondorExecutor Failed to cancel job 57106 > > -Allan > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From aespinosa at cs.uchicago.edu Wed Jul 28 15:00:44 2010 From: aespinosa at cs.uchicago.edu (Allan Espinosa) Date: Wed, 28 Jul 2010 15:00:44 -0500 Subject: [Swift-devel] localscheduler (condor/ condorg) breaking on lots of condor jobs In-Reply-To: <1280346516.12761.3.camel@blabla2.none> References: <20100728193421.GA11060@origin> <1280346516.12761.3.camel@blabla2.none> Message-ID: <20100728200044.GB11060@origin> Ah, only 1024 files. That's why. $ ulimit -a core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 122880 max locked memory (kbytes, -l) 32 max memory size (kbytes, -m) unlimited open files (-n) 1024 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) 10240 cpu time (seconds, -t) unlimited max user processes (-u) 122880 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited On Wed, Jul 28, 2010 at 02:48:36PM -0500, Mihael Hategan wrote: > Yeah. That's why the provider should be updated to use job logs instead > of condor_qstat/condor_qedit for figuring out status. > > That or update limits (and, btw, what does ulimit -a say on that > machine)? > > On Wed, 2010-07-28 at 14:34 -0500, Allan Espinosa wrote: > > Hi, > > > > it seems that when there's too many submitted condor jobs, the submit host will > > start to complain if it opens too many log, stderr, and stdout files: > > > > 330 Finished successfully:162 Failed but can retry:927 > > Failed to transfer wrapper log from sleep-LGU-estimate/info/x on USCMS-FNAL-WC1 > > Progress: Initializing site shared directory:1 Stage in:2 Submitted:1332 > > Active:245 Failed:331 Finished successfully:162 Failed but can retry:928 > > Progress:Failed to cancel job 57445 > > java.io.IOException: Cannot run program "condor_qedit": java.io.IOException: > > error=24, Too many open files > > at java.lang.ProcessBuilder.start(ProcessBuilder.java:460) > > at java.lang.Runtime.exec(Runtime.java:593)