From wilde at mcs.anl.gov  Thu Jul  1 08:23:37 2010
From: wilde at mcs.anl.gov (wilde at mcs.anl.gov)
Date: Thu, 1 Jul 2010 08:23:37 -0500 (CDT)
Subject: [Swift-devel] CASP jobs hang - seems to be  in coaster scheduling
In-Reply-To: <26625693.1289731277990343214.JavaMail.root@zimbra>
Message-ID: <5935652.1289891277990617500.JavaMail.root@zimbra>

[Mihael: help urgently needed on this if possible]

Aashish, I see the runs you submitted around 3-4AM this morning in /home/aashish/CASP/{T0608,T0610,T0611}

Each of them show a similar problem to what we saw earlier last night with T0608: the script submits 300 jobs to the pads coaster pool, and none of them run.

In some of these scripts, the first round of 300 (boostThreader) work fine, but the later round of 300 loops jobs get "stuck".

Mihael, can you set aside some time as soon as possible this morning to look at these? These need to be submitted to CASP by 2PM CDT today, so attention to the problem is rather urgent.

The scripts are all coming from /home/aashish/RapLoops
The swift release is from /home/wilde/swift/src/stable/...

In the above directories, you will find all source for scripts, mappers, tc, and sites, as well as all logs. In some of the Tnnnn directories (each one is a protein target for the CASP competition) you will see multiple runs, each with an outN file log of stdout/err and then a run directory for that run with all relevant files.

This *looks* like the familiar problem of trying to run an app whose maxwalltime wont fit into any available coaster slot, but the times in tc and sites.xml dont seem to explain that behavior.

This script has been running well since May; "slight" changes were made to work around the unavailability of GPFS on PADS this week, but we still cant figure out why these scripts are hanging in this manner.

- Mike


From wilde at mcs.anl.gov  Thu Jul  1 10:23:40 2010
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Thu, 1 Jul 2010 10:23:40 -0500 (CDT)
Subject: [Swift-devel] CASP jobs hang - seems to be  in coaster scheduling
In-Reply-To: <5935652.1289891277990617500.JavaMail.root@zimbra>
Message-ID: <10232493.1304301277997820587.JavaMail.root@zimbra>

Sorry, false alarm - please ignore the request below.

The problem was indeed simply requesting a larger maxwalltime than any available coaster maxtime slot.

Can this be detected and a clear error message issued, as well as ending the run?

- Mike

----- wilde at mcs.anl.gov wrote:

> [Mihael: help urgently needed on this if possible]
> 
> Aashish, I see the runs you submitted around 3-4AM this morning in
> /home/aashish/CASP/{T0608,T0610,T0611}
> 
> Each of them show a similar problem to what we saw earlier last night
> with T0608: the script submits 300 jobs to the pads coaster pool, and
> none of them run.
> 
> In some of these scripts, the first round of 300 (boostThreader) work
> fine, but the later round of 300 loops jobs get "stuck".
> 
> Mihael, can you set aside some time as soon as possible this morning
> to look at these? These need to be submitted to CASP by 2PM CDT today,
> so attention to the problem is rather urgent.
> 
> The scripts are all coming from /home/aashish/RapLoops
> The swift release is from /home/wilde/swift/src/stable/...
> 
> In the above directories, you will find all source for scripts,
> mappers, tc, and sites, as well as all logs. In some of the Tnnnn
> directories (each one is a protein target for the CASP competition)
> you will see multiple runs, each with an outN file log of stdout/err
> and then a run directory for that run with all relevant files.
> 
> This *looks* like the familiar problem of trying to run an app whose
> maxwalltime wont fit into any available coaster slot, but the times in
> tc and sites.xml dont seem to explain that behavior.
> 
> This script has been running well since May; "slight" changes were
> made to work around the unavailability of GPFS on PADS this week, but
> we still cant figure out why these scripts are hanging in this
> manner.
> 
> - Mike
> 
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel

-- 
Michael Wilde
Computation Institute, University of Chicago
Mathematics and Computer Science Division
Argonne National Laboratory


From hategan at mcs.anl.gov  Thu Jul  1 10:49:15 2010
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Thu, 01 Jul 2010 10:49:15 -0500
Subject: [Swift-devel] CASP jobs hang - seems to be  in coaster scheduling
In-Reply-To: <10232493.1304301277997820587.JavaMail.root@zimbra>
References: <10232493.1304301277997820587.JavaMail.root@zimbra>
Message-ID: <1277999355.16558.0.camel@blabla2.none>

On Thu, 2010-07-01 at 10:23 -0500, Michael Wilde wrote:
> Sorry, false alarm - please ignore the request below.
> 
> The problem was indeed simply requesting a larger maxwalltime than any available coaster maxtime slot.
> 
> Can this be detected and a clear error message issued, as well as ending the run?

I thought it was. I can double-check.


> 
> - Mike
> 
> ----- wilde at mcs.anl.gov wrote:
> 
> > [Mihael: help urgently needed on this if possible]
> > 
> > Aashish, I see the runs you submitted around 3-4AM this morning in
> > /home/aashish/CASP/{T0608,T0610,T0611}
> > 
> > Each of them show a similar problem to what we saw earlier last night
> > with T0608: the script submits 300 jobs to the pads coaster pool, and
> > none of them run.
> > 
> > In some of these scripts, the first round of 300 (boostThreader) work
> > fine, but the later round of 300 loops jobs get "stuck".
> > 
> > Mihael, can you set aside some time as soon as possible this morning
> > to look at these? These need to be submitted to CASP by 2PM CDT today,
> > so attention to the problem is rather urgent.
> > 
> > The scripts are all coming from /home/aashish/RapLoops
> > The swift release is from /home/wilde/swift/src/stable/...
> > 
> > In the above directories, you will find all source for scripts,
> > mappers, tc, and sites, as well as all logs. In some of the Tnnnn
> > directories (each one is a protein target for the CASP competition)
> > you will see multiple runs, each with an outN file log of stdout/err
> > and then a run directory for that run with all relevant files.
> > 
> > This *looks* like the familiar problem of trying to run an app whose
> > maxwalltime wont fit into any available coaster slot, but the times in
> > tc and sites.xml dont seem to explain that behavior.
> > 
> > This script has been running well since May; "slight" changes were
> > made to work around the unavailability of GPFS on PADS this week, but
> > we still cant figure out why these scripts are hanging in this
> > manner.
> > 
> > - Mike
> > 
> > _______________________________________________
> > Swift-devel mailing list
> > Swift-devel at ci.uchicago.edu
> > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> 


From wilde at mcs.anl.gov  Thu Jul  1 11:29:14 2010
From: wilde at mcs.anl.gov (wilde at mcs.anl.gov)
Date: Thu, 1 Jul 2010 11:29:14 -0500 (CDT)
Subject: [Swift-devel] manual coasters
In-Reply-To: <21262109.1309551278001249873.JavaMail.root@zimbra>
Message-ID: <18576240.1310251278001754685.JavaMail.root@zimbra>

Very cool - thanks, Mihael!

For the sites entry, do we still use the current format to indicate where the server should start?
Eg:

  <pool handle="coasterpool01">
    <execution provider="coaster" url="none" jobManager="pbs"/>
    <profile namespace="globus"> key="workerManager">passive</profile>
    <profile namespace="globus" key="queue">fast</profile>
    <profile namespace="karajan" key="initialScore">10000</profile>
    <profile namespace="karajan" key="jobThrottle">.07</profile>
    <gridftp  url="local://localhost" />
    <workdirectory >/home/wilde/swiftwork</workdirectory>
  </pool>

Is the full range of provider options available to start the server in passive mode?

Will throttling settings be honored?

Can we start multiple coaster servers in different places?


- Mike


----- "Mihael Hategan" <hategan at mcs.anl.gov> wrote:

> Manual coasters are in trunk. I did some limited testing on
> localhost.
> 
> The basic idea is that you say <profile namespace="globus"
> key="workerManager">passive</profile> in sites.xml. Other than that
> you
> may want to set workersPerNode, but the other options are useless.
> 
> Then, when swift starts the coaster service, it will print the URL of
> that on stderr.
> 
> You carefully dig for worker.pl and then launch it in whatever way
> you
> like:
> 
> worker.pl <ServiceURL> <blockid> <logdir>
> 
> The blockid can be whatever you want, but it can be used to group
> workers in the traditional blocks. The logdir is where you want the
> worker logs to go. They are all mandatory.
> 
> When workers connect to the service, the service should start
> shipping
> jobs to them. When the service is shut down, it will also try to shut
> down the workers (they are useless anyway at that point), but it
> cannot
> control the LRM jobs, so it may fail to do so (or rather said, it is
> more likely to fail to do so).
> 
> Mihael
> 
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel

-- 
Michael Wilde
Computation Institute, University of Chicago
Mathematics and Computer Science Division
Argonne National Laboratory


From wilde at mcs.anl.gov  Thu Jul  1 11:34:20 2010
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Thu, 1 Jul 2010 11:34:20 -0500 (CDT)
Subject: [Swift-devel] Coaster problem on BG/P - worker processes dying
Message-ID: <30532374.1310611278002060506.JavaMail.root@zimbra>

Justin, can you send a brief update to the list on the coaster problem (workers exiting after a few jobs) that is blocking you on the BG/P, and how you are re-working worker logging to debug it?

Lets use this thread to discuss and resolve the problem.

- Mike


From hategan at mcs.anl.gov  Thu Jul  1 11:39:59 2010
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Thu, 01 Jul 2010 11:39:59 -0500
Subject: [Swift-devel] manual coasters
In-Reply-To: <18576240.1310251278001754685.JavaMail.root@zimbra>
References: <18576240.1310251278001754685.JavaMail.root@zimbra>
Message-ID: <1278002399.17123.0.camel@blabla2.none>

On Thu, 2010-07-01 at 11:29 -0500, wilde at mcs.anl.gov wrote:
> Very cool - thanks, Mihael!
> 
> For the sites entry, do we still use the current format to indicate where the server should start?

Yes. That should work. But "queue=fast" there doesn't do anything.

> Eg:
> 
>   <pool handle="coasterpool01">
>     <execution provider="coaster" url="none" jobManager="pbs"/>
>     <profile namespace="globus"> key="workerManager">passive</profile>
>     <profile namespace="globus" key="queue">fast</profile>
>     <profile namespace="karajan" key="initialScore">10000</profile>
>     <profile namespace="karajan" key="jobThrottle">.07</profile>
>     <gridftp  url="local://localhost" />
>     <workdirectory >/home/wilde/swiftwork</workdirectory>
>   </pool>
> 
> Is the full range of provider options available to start the server in passive mode?
> 
> Will throttling settings be honored?
> 
> Can we start multiple coaster servers in different places?
> 
> 
> - Mike
> 
> 
> ----- "Mihael Hategan" <hategan at mcs.anl.gov> wrote:
> 
> > Manual coasters are in trunk. I did some limited testing on
> > localhost.
> > 
> > The basic idea is that you say <profile namespace="globus"
> > key="workerManager">passive</profile> in sites.xml. Other than that
> > you
> > may want to set workersPerNode, but the other options are useless.
> > 
> > Then, when swift starts the coaster service, it will print the URL of
> > that on stderr.
> > 
> > You carefully dig for worker.pl and then launch it in whatever way
> > you
> > like:
> > 
> > worker.pl <ServiceURL> <blockid> <logdir>
> > 
> > The blockid can be whatever you want, but it can be used to group
> > workers in the traditional blocks. The logdir is where you want the
> > worker logs to go. They are all mandatory.
> > 
> > When workers connect to the service, the service should start
> > shipping
> > jobs to them. When the service is shut down, it will also try to shut
> > down the workers (they are useless anyway at that point), but it
> > cannot
> > control the LRM jobs, so it may fail to do so (or rather said, it is
> > more likely to fail to do so).
> > 
> > Mihael
> > 
> > _______________________________________________
> > Swift-devel mailing list
> > Swift-devel at ci.uchicago.edu
> > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> 


From wozniak at mcs.anl.gov  Thu Jul  1 11:56:13 2010
From: wozniak at mcs.anl.gov (Justin M Wozniak)
Date: Thu, 1 Jul 2010 11:56:13 -0500 (Central Daylight Time)
Subject: [Swift-devel] Re: Coaster problem on BG/P - worker processes dying
In-Reply-To: <30532374.1310611278002060506.JavaMail.root@zimbra>
References: <30532374.1310611278002060506.JavaMail.root@zimbra>
Message-ID: <alpine.WNT.2.00.1007011151500.260@justinwozniak>

On Thu, 1 Jul 2010, Michael Wilde wrote:

> Justin, can you send a brief update to the list on the coaster problem 
> (workers exiting after a few jobs) that is blocking you on the BG/P, and 
> how you are re-working worker logging to debug it?

A paste from a previous email is below (both BG/P systems are down due to 
cooling issues today).

So far, the issue only appears after several thousand jobs run on at least 
512 nodes.

I'm pretty close to generating the logging I need to track this down.  I 
have broken down the worker logs into one log per worker script...

Paste:

Running on the Intrepid compute nodes.  In the last few runs I've only 
seen it in the 512 node case (I think this worked at least once), not 256 
nodes, but that could be just because this is rare.

2010-06-18 16:06:29,117-0500 WARN  Command Command(2, SUBMITJOB): handling 
reply timeout;
sendReqTime=100618-160429.10
8, sendTime=100618-160429.108, now=100618-160629.117
2010-06-18 16:06:29,117-0500 INFO  Command Command(2, SUBMITJOB): 
re-sending
2010-06-18 16:06:29,117-0500 WARN  Command Command(2, SUBMITJOB)fault was: 
Reply timeout
org.globus.cog.karajan.workflow.service.ReplyTimeoutException
         at
org.globus.cog.karajan.workflow.service.commands.Command.handleReplyTimeout(Command.java:280)
         at 
org.globus.cog.karajan.workflow.service.commands.Command$Timeout.run(Command.java:285)
         at java.util.TimerThread.mainLoop(Timer.java:537)
         at java.util.TimerThread.run(Timer.java:487)
2010-06-18 16:06:29,118-0500 INFO  Command Sending Command(2, SUBMITJOB) 
on MetaChannel: 855782146 ->
SC-0618-370320-0
00000-001756
2010-06-18 16:06:29,119-0500 INFO  AbstractStreamKarajanChannel Channel 
IOException
java.net.SocketException: Broken pipe
         at java.net.SocketOutputStream.socketWrite0(Native Method)
         at 
java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:105)
         at java.net.SocketOutputStream.write(SocketOutputStream.java:137)
         at
org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel$Sender.send(AbstractStrea
mKar
ajanChannel.java:292)
         at
org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel$Sender.run(AbstractStream
Kara
janChannel.java:244)


-- 
Justin M Wozniak


From hategan at mcs.anl.gov  Thu Jul  1 12:08:08 2010
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Thu, 01 Jul 2010 12:08:08 -0500
Subject: [Swift-devel] Re: Coaster problem on BG/P - worker processes dying
In-Reply-To: <alpine.WNT.2.00.1007011151500.260@justinwozniak>
References: <30532374.1310611278002060506.JavaMail.root@zimbra>
	<alpine.WNT.2.00.1007011151500.260@justinwozniak>
Message-ID: <1278004088.17643.1.camel@blabla2.none>

That typically is an indication that something went wrong with the
worker or the worker connection. It's also possible that the message
queues are loaded enough to not be able to process everything in time.
The coaster logs have some logging info that displays that information.

On Thu, 2010-07-01 at 11:56 -0500, Justin M Wozniak wrote:
> On Thu, 1 Jul 2010, Michael Wilde wrote:
> 
> > Justin, can you send a brief update to the list on the coaster problem 
> > (workers exiting after a few jobs) that is blocking you on the BG/P, and 
> > how you are re-working worker logging to debug it?
> 
> A paste from a previous email is below (both BG/P systems are down due to 
> cooling issues today).
> 
> So far, the issue only appears after several thousand jobs run on at least 
> 512 nodes.
> 
> I'm pretty close to generating the logging I need to track this down.  I 
> have broken down the worker logs into one log per worker script...
> 
> Paste:
> 
> Running on the Intrepid compute nodes.  In the last few runs I've only 
> seen it in the 512 node case (I think this worked at least once), not 256 
> nodes, but that could be just because this is rare.
> 
> 2010-06-18 16:06:29,117-0500 WARN  Command Command(2, SUBMITJOB): handling 
> reply timeout;
> sendReqTime=100618-160429.10
> 8, sendTime=100618-160429.108, now=100618-160629.117
> 2010-06-18 16:06:29,117-0500 INFO  Command Command(2, SUBMITJOB): 
> re-sending
> 2010-06-18 16:06:29,117-0500 WARN  Command Command(2, SUBMITJOB)fault was: 
> Reply timeout
> org.globus.cog.karajan.workflow.service.ReplyTimeoutException
>          at
> org.globus.cog.karajan.workflow.service.commands.Command.handleReplyTimeout(Command.java:280)
>          at 
> org.globus.cog.karajan.workflow.service.commands.Command$Timeout.run(Command.java:285)
>          at java.util.TimerThread.mainLoop(Timer.java:537)
>          at java.util.TimerThread.run(Timer.java:487)
> 2010-06-18 16:06:29,118-0500 INFO  Command Sending Command(2, SUBMITJOB) 
> on MetaChannel: 855782146 ->
> SC-0618-370320-0
> 00000-001756
> 2010-06-18 16:06:29,119-0500 INFO  AbstractStreamKarajanChannel Channel 
> IOException
> java.net.SocketException: Broken pipe
>          at java.net.SocketOutputStream.socketWrite0(Native Method)
>          at 
> java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:105)
>          at java.net.SocketOutputStream.write(SocketOutputStream.java:137)
>          at
> org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel$Sender.send(AbstractStrea
> mKar
> ajanChannel.java:292)
>          at
> org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel$Sender.run(AbstractStream
> Kara
> janChannel.java:244)
> 
> 


From zhaozhang at uchicago.edu  Thu Jul  1 12:13:02 2010
From: zhaozhang at uchicago.edu (Zhao Zhang)
Date: Thu, 01 Jul 2010 12:13:02 -0500
Subject: [Swift-devel] Re: Coaster problem on BG/P - worker processes dying
In-Reply-To: <alpine.WNT.2.00.1007011151500.260@justinwozniak>
References: <30532374.1310611278002060506.JavaMail.root@zimbra>
	<alpine.WNT.2.00.1007011151500.260@justinwozniak>
Message-ID: <4C2CCC9E.3090904@uchicago.edu>

Hi, Justin

Is there any chance that each worker is writing log files to GPFS or 
writing to RAM, then copying to GPFS?
Even in the latter case, we used dd instead of cp on zeptoos, cuz with 
dd we could set the block size while cp
is using a lined buffer to dump data to GPFS, which is quite slow.

Another suspect would be that we are overwhelming the IO nodes. As far 
as I remember, coaster is running as a
service on each compute node with a TCP connection to the Login Node. 
The communication between Login Node
and CN node is handled by a IP forwarding component in zeptoos. In the 
tests I did before, the Falkon service is not
stable with 1024 nodes connecting to the service each with a TCP 
connection.Can we login the IO nodes while we
see those errors?

Anyway, I can't tell anything exactly right now.

best
zhao

Justin M Wozniak wrote:
> On Thu, 1 Jul 2010, Michael Wilde wrote:
>
>> Justin, can you send a brief update to the list on the coaster 
>> problem (workers exiting after a few jobs) that is blocking you on 
>> the BG/P, and how you are re-working worker logging to debug it?
>
> A paste from a previous email is below (both BG/P systems are down due 
> to cooling issues today).
>
> So far, the issue only appears after several thousand jobs run on at 
> least 512 nodes.
>
> I'm pretty close to generating the logging I need to track this down.  
> I have broken down the worker logs into one log per worker script...
>
> Paste:
>
> Running on the Intrepid compute nodes.  In the last few runs I've only 
> seen it in the 512 node case (I think this worked at least once), not 
> 256 nodes, but that could be just because this is rare.
>
> 2010-06-18 16:06:29,117-0500 WARN  Command Command(2, SUBMITJOB): 
> handling reply timeout;
> sendReqTime=100618-160429.10
> 8, sendTime=100618-160429.108, now=100618-160629.117
> 2010-06-18 16:06:29,117-0500 INFO  Command Command(2, SUBMITJOB): 
> re-sending
> 2010-06-18 16:06:29,117-0500 WARN  Command Command(2, SUBMITJOB)fault 
> was: Reply timeout
> org.globus.cog.karajan.workflow.service.ReplyTimeoutException
>         at
> org.globus.cog.karajan.workflow.service.commands.Command.handleReplyTimeout(Command.java:280) 
>
>         at 
> org.globus.cog.karajan.workflow.service.commands.Command$Timeout.run(Command.java:285) 
>
>         at java.util.TimerThread.mainLoop(Timer.java:537)
>         at java.util.TimerThread.run(Timer.java:487)
> 2010-06-18 16:06:29,118-0500 INFO  Command Sending Command(2, 
> SUBMITJOB) on MetaChannel: 855782146 ->
> SC-0618-370320-0
> 00000-001756
> 2010-06-18 16:06:29,119-0500 INFO  AbstractStreamKarajanChannel 
> Channel IOException
> java.net.SocketException: Broken pipe
>         at java.net.SocketOutputStream.socketWrite0(Native Method)
>         at 
> java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:105)
>         at java.net.SocketOutputStream.write(SocketOutputStream.java:137)
>         at
> org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel$Sender.send(AbstractStrea 
>
> mKar
> ajanChannel.java:292)
>         at
> org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel$Sender.run(AbstractStream 
>
> Kara
> janChannel.java:244)
>
>


From hategan at mcs.anl.gov  Thu Jul  1 12:18:31 2010
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Thu, 01 Jul 2010 12:18:31 -0500
Subject: [Swift-devel] Re: Coaster problem on BG/P - worker processes dying
In-Reply-To: <4C2CCC9E.3090904@uchicago.edu>
References: <30532374.1310611278002060506.JavaMail.root@zimbra>
	<alpine.WNT.2.00.1007011151500.260@justinwozniak>
	<4C2CCC9E.3090904@uchicago.edu>
Message-ID: <1278004711.17771.3.camel@blabla2.none>

On Thu, 2010-07-01 at 12:13 -0500, Zhao Zhang wrote:
> Hi, Justin
> 
> Is there any chance that each worker is writing log files to GPFS or 
> writing to RAM, then copying to GPFS?
> Even in the latter case, we used dd instead of cp on zeptoos, cuz with 
> dd we could set the block size while cp
> is using a lined buffer to dump data to GPFS, which is quite slow.

Since some time the worker log level is set to WARN (which only produces
a message at the start and end) when the number of workers is >= 16.

> 
> Another suspect would be that we are overwhelming the IO nodes. As far 
> as I remember, coaster is running as a
> service on each compute node with a TCP connection to the Login Node. 
> The communication between Login Node
> and CN node is handled by a IP forwarding component in zeptoos. In the 
> tests I did before, the Falkon service is not
> stable with 1024 nodes connecting to the service each with a TCP 
> connection.Can we login the IO nodes while we
> see those errors?

Maybe, but then I was able to run with 40k cores while the logging
scheme above wasn't enabled. Since then, there was a switch to only one
TCP connection per node (regardless of cores) and the very much reduced
logging. So I suspect this isn't the problem unless the ZOID NAT got
messed up.


From wozniak at mcs.anl.gov  Thu Jul  1 13:20:16 2010
From: wozniak at mcs.anl.gov (Justin M Wozniak)
Date: Thu, 1 Jul 2010 13:20:16 -0500 (Central Daylight Time)
Subject: [Swift-devel] Re: Coaster problem on BG/P - worker processes dying
In-Reply-To: <4C2CCC9E.3090904@uchicago.edu>
References: <30532374.1310611278002060506.JavaMail.root@zimbra>
	<alpine.WNT.2.00.1007011151500.260@justinwozniak>
	<4C2CCC9E.3090904@uchicago.edu>
Message-ID: <alpine.WNT.2.00.1007011316160.1192@justinwozniak>

On Thu, 1 Jul 2010, Zhao Zhang wrote:

> Is there any chance that each worker is writing log files to GPFS or 
> writing to RAM, then copying to GPFS? Even in the latter case, we used 
> dd instead of cp on zeptoos, cuz with dd we could set the block size 
> while cp is using a lined buffer to dump data to GPFS, which is quite 
> slow.

I have modified the perl script to write directly to a unique file per 
worker script, directly to GPFS.  (Speed is not an issue right now.)

> Another suspect would be that we are overwhelming the IO nodes. As far 
> as I remember, coaster is running as a service on each compute node with 
> a TCP connection to the Login Node. The communication between Login Node 
> and CN node is handled by a IP forwarding component in zeptoos. In the 
> tests I did before, the Falkon service is not stable with 1024 nodes 
> connecting to the service each with a TCP connection.Can we login the IO 
> nodes while we see those errors?

That seems like a possibility.  If I can whittle the problem down to that 
level we will have something to report to the zepto team.

 	Thanks

> Justin M Wozniak wrote:
>> On Thu, 1 Jul 2010, Michael Wilde wrote:
>> 
>>> Justin, can you send a brief update to the list on the coaster problem 
>>> (workers exiting after a few jobs) that is blocking you on the BG/P, and 
>>> how you are re-working worker logging to debug it?
>> 
>> A paste from a previous email is below (both BG/P systems are down due to 
>> cooling issues today).
>> 
>> So far, the issue only appears after several thousand jobs run on at least 
>> 512 nodes.
>> 
>> I'm pretty close to generating the logging I need to track this down.  I 
>> have broken down the worker logs into one log per worker script...
>> 
>> Paste:
>> 
>> Running on the Intrepid compute nodes.  In the last few runs I've only seen 
>> it in the 512 node case (I think this worked at least once), not 256 nodes, 
>> but that could be just because this is rare.
>> 
>> 2010-06-18 16:06:29,117-0500 WARN  Command Command(2, SUBMITJOB): handling 
>> reply timeout;
>> sendReqTime=100618-160429.10
>> 8, sendTime=100618-160429.108, now=100618-160629.117
>> 2010-06-18 16:06:29,117-0500 INFO  Command Command(2, SUBMITJOB): 
>> re-sending
>> 2010-06-18 16:06:29,117-0500 WARN  Command Command(2, SUBMITJOB)fault was: 
>> Reply timeout
>> org.globus.cog.karajan.workflow.service.ReplyTimeoutException
>>         at
>> org.globus.cog.karajan.workflow.service.commands.Command.handleReplyTimeout(Command.java:280)
>>         at 
>> org.globus.cog.karajan.workflow.service.commands.Command$Timeout.run(Command.java:285)
>>         at java.util.TimerThread.mainLoop(Timer.java:537)
>>         at java.util.TimerThread.run(Timer.java:487)
>> 2010-06-18 16:06:29,118-0500 INFO  Command Sending Command(2, SUBMITJOB) on 
>> MetaChannel: 855782146 ->
>> SC-0618-370320-0
>> 00000-001756
>> 2010-06-18 16:06:29,119-0500 INFO  AbstractStreamKarajanChannel Channel 
>> IOException
>> java.net.SocketException: Broken pipe
>>         at java.net.SocketOutputStream.socketWrite0(Native Method)
>>         at 
>> java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:105)
>>         at java.net.SocketOutputStream.write(SocketOutputStream.java:137)
>>         at
>> org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel$Sender.send(AbstractStrea 
>> mKar
>> ajanChannel.java:292)
>>         at
>> org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel$Sender.run(AbstractStream 
>> Kara
>> janChannel.java:244)
>> 
>> 
>

-- 
Justin M Wozniak


From wozniak at mcs.anl.gov  Thu Jul  1 13:22:27 2010
From: wozniak at mcs.anl.gov (Justin M Wozniak)
Date: Thu, 1 Jul 2010 13:22:27 -0500 (Central Daylight Time)
Subject: [Swift-devel] Re: Coaster problem on BG/P - worker processes
	dying (fwd)
Message-ID: <alpine.WNT.2.00.1007011321520.1192@justinwozniak>

On Thu, 1 Jul 2010, Mihael Hategan wrote:

> On Thu, 2010-07-01 at 12:13 -0500, Zhao Zhang wrote:
>> Hi, Justin
>> 
>> Is there any chance that each worker is writing log files to GPFS or
>> writing to RAM, then copying to GPFS?
>> Even in the latter case, we used dd instead of cp on zeptoos, cuz with
>> dd we could set the block size while cp
>> is using a lined buffer to dump data to GPFS, which is quite slow.
> 
> Since some time the worker log level is set to WARN (which only produces
> a message at the start and end) when the number of workers is >= 16.

Right, I have made changes there.

>> Another suspect would be that we are overwhelming the IO nodes. As far
>> as I remember, coaster is running as a
>> service on each compute node with a TCP connection to the Login Node.
>> The communication between Login Node
>> and CN node is handled by a IP forwarding component in zeptoos. In the
>> tests I did before, the Falkon service is not
>> stable with 1024 nodes connecting to the service each with a TCP
>> connection.Can we login the IO nodes while we
>> see those errors?
> 
> Maybe, but then I was able to run with 40k cores while the logging
> scheme above wasn't enabled. Since then, there was a switch to only one
> TCP connection per node (regardless of cores) and the very much reduced
> logging. So I suspect this isn't the problem unless the ZOID NAT got
> messed up.

-- 
Justin M Wozniak


From aespinosa at cs.uchicago.edu  Thu Jul  1 17:21:35 2010
From: aespinosa at cs.uchicago.edu (Allan Espinosa)
Date: Thu, 1 Jul 2010 17:21:35 -0500
Subject: [Swift-devel] manual coasters
In-Reply-To: <18576240.1310251278001754685.JavaMail.root@zimbra>
References: <21262109.1309551278001249873.JavaMail.root@zimbra>
	<18576240.1310251278001754685.JavaMail.root@zimbra>
Message-ID: <AANLkTilvMeVRqi1RxXjTjzhcriRV6sA_8RPag8e3QFz3@mail.gmail.com>

So for the pool entry below, where is the serviceURL?  the submit host
will issue a pbs request for a service host?


Thanks,
-Allan

2010/7/1  <wilde at mcs.anl.gov>:
> Very cool - thanks, Mihael!
>
> For the sites entry, do we still use the current format to indicate where the server should start?
> Eg:
>
> ?<pool handle="coasterpool01">
> ? ?<execution provider="coaster" url="none" jobManager="pbs"/>
> ? ?<profile namespace="globus"> key="workerManager">passive</profile>
> ? ?<profile namespace="globus" key="queue">fast</profile>
> ? ?<profile namespace="karajan" key="initialScore">10000</profile>
> ? ?<profile namespace="karajan" key="jobThrottle">.07</profile>
> ? ?<gridftp ?url="local://localhost" />
> ? ?<workdirectory >/home/wilde/swiftwork</workdirectory>
> ?</pool>
>
> Is the full range of provider options available to start the server in passive mode?
>
> Will throttling settings be honored?
>
> Can we start multiple coaster servers in different places?
>
>
> - Mike
>
>
> ----- "Mihael Hategan" <hategan at mcs.anl.gov> wrote:
>
>> Manual coasters are in trunk. I did some limited testing on
>> localhost.
>>
>> The basic idea is that you say <profile namespace="globus"
>> key="workerManager">passive</profile> in sites.xml. Other than that
>> you
>> may want to set workersPerNode, but the other options are useless.
>>
>> Then, when swift starts the coaster service, it will print the URL of
>> that on stderr.
>>
>> You carefully dig for worker.pl and then launch it in whatever way
>> you
>> like:
>>
>> worker.pl <ServiceURL> <blockid> <logdir>
>>
>> The blockid can be whatever you want, but it can be used to group
>> workers in the traditional blocks. The logdir is where you want the
>> worker logs to go. They are all mandatory.
>>
>> When workers connect to the service, the service should start
>> shipping
>> jobs to them. When the service is shut down, it will also try to shut
>> down the workers (they are useless anyway at that point), but it
>> cannot
>> control the LRM jobs, so it may fail to do so (or rather said, it is
>> more likely to fail to do so).
>>
>> Mihael
>>


From wilde at mcs.anl.gov  Thu Jul  1 17:50:13 2010
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Thu, 1 Jul 2010 17:50:13 -0500 (CDT)
Subject: [Swift-devel] manual coasters
In-Reply-To: <AANLkTilvMeVRqi1RxXjTjzhcriRV6sA_8RPag8e3QFz3@mail.gmail.com>
Message-ID: <9982948.1334571278024613578.JavaMail.root@zimbra>

My understanding is that with the pool entry below, Swift will start the coaster service by submitting a PBS job. Then the swift command will print the service service URL (host:port ?) on stderr, and you manually start workers, passing that host:port on the command line to connect back to the coaster service.

>> Then, when swift starts the coaster service, it will print the URL of
>> that on stderr.

> >> worker.pl <ServiceURL> <blockid> <logdir>

- Mike

----- "Allan Espinosa" <aespinosa at cs.uchicago.edu> wrote:

> So for the pool entry below, where is the serviceURL?  the submit
> host
> will issue a pbs request for a service host?
> 
> 
> Thanks,
> -Allan
> 
> 2010/7/1  <wilde at mcs.anl.gov>:
> > Very cool - thanks, Mihael!
> >
> > For the sites entry, do we still use the current format to indicate
> where the server should start?
> > Eg:
> >
> > ?<pool handle="coasterpool01">
> > ? ?<execution provider="coaster" url="none" jobManager="pbs"/>
> > ? ?<profile namespace="globus">
> key="workerManager">passive</profile>
> > ? ?<profile namespace="globus" key="queue">fast</profile>
> > ? ?<profile namespace="karajan" key="initialScore">10000</profile>
> > ? ?<profile namespace="karajan" key="jobThrottle">.07</profile>
> > ? ?<gridftp ?url="local://localhost" />
> > ? ?<workdirectory >/home/wilde/swiftwork</workdirectory>
> > ?</pool>
> >
> > Is the full range of provider options available to start the server
> in passive mode?
> >
> > Will throttling settings be honored?
> >
> > Can we start multiple coaster servers in different places?
> >
> >
> > - Mike
> >
> >
> > ----- "Mihael Hategan" <hategan at mcs.anl.gov> wrote:
> >
> >> Manual coasters are in trunk. I did some limited testing on
> >> localhost.
> >>
> >> The basic idea is that you say <profile namespace="globus"
> >> key="workerManager">passive</profile> in sites.xml. Other than
> that
> >> you
> >> may want to set workersPerNode, but the other options are useless.
> >>
> >> Then, when swift starts the coaster service, it will print the URL
> of
> >> that on stderr.
> >>
> >> You carefully dig for worker.pl and then launch it in whatever way
> >> you
> >> like:
> >>
> >> worker.pl <ServiceURL> <blockid> <logdir>
> >>
> >> The blockid can be whatever you want, but it can be used to group
> >> workers in the traditional blocks. The logdir is where you want
> the
> >> worker logs to go. They are all mandatory.
> >>
> >> When workers connect to the service, the service should start
> >> shipping
> >> jobs to them. When the service is shut down, it will also try to
> shut
> >> down the workers (they are useless anyway at that point), but it
> >> cannot
> >> control the LRM jobs, so it may fail to do so (or rather said, it
> is
> >> more likely to fail to do so).
> >>
> >> Mihael
> >>

-- 
Michael Wilde
Computation Institute, University of Chicago
Mathematics and Computer Science Division
Argonne National Laboratory


From hategan at mcs.anl.gov  Thu Jul  1 19:10:02 2010
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Thu, 01 Jul 2010 19:10:02 -0500
Subject: [Swift-devel] manual coasters
In-Reply-To: <AANLkTilvMeVRqi1RxXjTjzhcriRV6sA_8RPag8e3QFz3@mail.gmail.com>
References: <21262109.1309551278001249873.JavaMail.root@zimbra>
	<18576240.1310251278001754685.JavaMail.root@zimbra>
	<AANLkTilvMeVRqi1RxXjTjzhcriRV6sA_8RPag8e3QFz3@mail.gmail.com>
Message-ID: <1278029402.21455.0.camel@blabla2.none>

On Thu, 2010-07-01 at 17:21 -0500, Allan Espinosa wrote:
> So for the pool entry below, where is the serviceURL?  the submit host
> will issue a pbs request for a service host?

No. The serviceURL is printed by swift on stderr when the service
starts. It's mostly the port you care about if you know where it's
running.


From aespinosa at cs.uchicago.edu  Thu Jul  1 19:39:43 2010
From: aespinosa at cs.uchicago.edu (Allan Espinosa)
Date: Thu, 1 Jul 2010 19:39:43 -0500
Subject: [Swift-devel] dirname directives ambigious
Message-ID: <AANLkTikwry0nq4M3QGisZ78CUoJoAeE6Z7OkbwbNJWBE@mail.gmail.com>

I updated to the latest cog-trunk and swift-trunk today and got these:

Caused by: org.globus.cog.karajan.workflow.KarajanRuntimeException:
Ambiguous element: dirname. Possible choices:
        vdl:dirname
        swiftscript:dirname
2010-07-01 19:37:47,906-0500 INFO  EventBus Near Karajan line: dirname
@ vdl-int.k, line: 274
Karajan exception: Ambiguous element: dirname. Possible choices:
        vdl:dirname
        swiftscript:dirname
Caused by: org.globus.cog.karajan.workflow.KarajanRuntimeException:
Ambiguous element: dirname. Possible choices:
        vdl:dirname
        swiftscript:dirname


-Allan

-- 
Allan M. Espinosa <http://amespinosa.wordpress.com>
PhD student, Computer Science
University of Chicago <http://people.cs.uchicago.edu/~aespinosa>


From aespinosa at cs.uchicago.edu  Thu Jul  1 23:54:50 2010
From: aespinosa at cs.uchicago.edu (Allan Espinosa)
Date: Thu, 1 Jul 2010 23:54:50 -0500
Subject: [Swift-devel] Re: dirname directives ambigious
In-Reply-To: <AANLkTikwry0nq4M3QGisZ78CUoJoAeE6Z7OkbwbNJWBE@mail.gmail.com>
References: <AANLkTikwry0nq4M3QGisZ78CUoJoAeE6Z7OkbwbNJWBE@mail.gmail.com>
Message-ID: <AANLkTil7R-dcZw0TtNCpCl_Utzm9Xj7Kp2TTlwW02eYu@mail.gmail.com>

I changed the line it refers to vdl:dirname() .  There's no dirname()
function in vdl-int.k that refers to the swiftscript namespace right?

-Allan

2010/7/1 Allan Espinosa <aespinosa at cs.uchicago.edu>:
> I updated to the latest cog-trunk and swift-trunk today and got these:
>
> Caused by: org.globus.cog.karajan.workflow.KarajanRuntimeException:
> Ambiguous element: dirname. Possible choices:
> ? ? ? ?vdl:dirname
> ? ? ? ?swiftscript:dirname
> 2010-07-01 19:37:47,906-0500 INFO ?EventBus Near Karajan line: dirname
> @ vdl-int.k, line: 274
> Karajan exception: Ambiguous element: dirname. Possible choices:
> ? ? ? ?vdl:dirname
> ? ? ? ?swiftscript:dirname
> Caused by: org.globus.cog.karajan.workflow.KarajanRuntimeException:
> Ambiguous element: dirname. Possible choices:
> ? ? ? ?vdl:dirname
> ? ? ? ?swiftscript:dirname
>
>
> -Allan
>
> --
> Allan M. Espinosa <http://amespinosa.wordpress.com>
> PhD student, Computer Science
> University of Chicago <http://people.cs.uchicago.edu/~aespinosa>
>


-- 
Allan M. Espinosa <http://amespinosa.wordpress.com>
PhD student, Computer Science
University of Chicago <http://people.cs.uchicago.edu/~aespinosa>


From hategan at mcs.anl.gov  Fri Jul  2 00:50:34 2010
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Fri, 02 Jul 2010 00:50:34 -0500
Subject: [Swift-devel] Re: dirname directives ambigious
In-Reply-To: <AANLkTil7R-dcZw0TtNCpCl_Utzm9Xj7Kp2TTlwW02eYu@mail.gmail.com>
References: <AANLkTikwry0nq4M3QGisZ78CUoJoAeE6Z7OkbwbNJWBE@mail.gmail.com>
	<AANLkTil7R-dcZw0TtNCpCl_Utzm9Xj7Kp2TTlwW02eYu@mail.gmail.com>
Message-ID: <1278049834.24578.5.camel@blabla2.none>

On Thu, 2010-07-01 at 23:54 -0500, Allan Espinosa wrote:
> I changed the line it refers to vdl:dirname() .  There's no dirname()
> function in vdl-int.k that refers to the swiftscript namespace right?

Wasn't when I last wrote at it.

But that doesn't mean much. It's the purpose that made that change to be
that is probably more relevant.


From wozniak at mcs.anl.gov  Fri Jul  2 11:55:26 2010
From: wozniak at mcs.anl.gov (Justin M Wozniak)
Date: Fri, 2 Jul 2010 11:55:26 -0500 (Central Daylight Time)
Subject: [Swift-devel] dirname directives ambigious
In-Reply-To: <AANLkTikwry0nq4M3QGisZ78CUoJoAeE6Z7OkbwbNJWBE@mail.gmail.com>
References: <AANLkTikwry0nq4M3QGisZ78CUoJoAeE6Z7OkbwbNJWBE@mail.gmail.com>
Message-ID: <alpine.WNT.2.00.1007021038260.1192@justinwozniak>

Hi Allan
 	I introduced this while adding the @dirname() function requested 
for the Montage application.  I just committed a quick fix but you may 
want to wait until I do some more testing; sorry for the hitch.
 	Justin

On Thu, 1 Jul 2010, Allan Espinosa wrote:

> I updated to the latest cog-trunk and swift-trunk today and got these:
>
> Caused by: org.globus.cog.karajan.workflow.KarajanRuntimeException:
> Ambiguous element: dirname. Possible choices:
>        vdl:dirname
>        swiftscript:dirname
> 2010-07-01 19:37:47,906-0500 INFO  EventBus Near Karajan line: dirname
> @ vdl-int.k, line: 274
> Karajan exception: Ambiguous element: dirname. Possible choices:
>        vdl:dirname
>        swiftscript:dirname
> Caused by: org.globus.cog.karajan.workflow.KarajanRuntimeException:
> Ambiguous element: dirname. Possible choices:
>        vdl:dirname
>        swiftscript:dirname
>
>
> -Allan
>
>

-- 
Justin M Wozniak


From wozniak at mcs.anl.gov  Fri Jul  2 14:40:23 2010
From: wozniak at mcs.anl.gov (Justin M Wozniak)
Date: Fri, 2 Jul 2010 14:40:23 -0500 (CDT)
Subject: [Swift-devel] manual coasters
In-Reply-To: <1277957571.15423.8.camel@blabla2.none>
References: <1277957571.15423.8.camel@blabla2.none>
Message-ID: <alpine.DEB.2.00.1007021436570.2994@wozniak-desktop.mcs.anl.gov>

On Wed, 30 Jun 2010, Mihael Hategan wrote:

> Manual coasters are in trunk. I did some limited testing on localhost.

I'm getting a problem where the original callback URI is null.  Is it 
possible that it is due to this change?

Triggered with null pointer at Settings.java:239

Caused by: java.lang.NullPointerException
 	at 
org.globus.cog.abstraction.coaster.service.job.manager.Settings.setInternalHostname(Settings.java:239)
 	... 11 more
Failed to set configuration
java.lang.IllegalArgumentException: Cannot set: internalHostname to: 
172.17.5.144
 	at 
org.globus.cog.abstraction.coaster.service.job.manager.Settings.set(Settings.java:437)
 	at 
org.globus.cog.abstraction.coaster.service.ServiceConfigurationHandler.requestComplete(ServiceConfigurationHandler.java:39)
 	at 
org.globus.cog.karajan.workflow.service.handlers.RequestHandler.receiveCompleted(RequestHandler.java:84)
 	at 
org.globus.cog.karajan.workflow.service.channels.AbstractKarajanChannel.handleRequest(AbstractKarajanChannel.java:381)
 	at 
org.globus.cog.karajan.workflow.service.channels.AbstractPipedChannel.actualSend(AbstractPipedChannel.java:79)
 	at 
org.globus.cog.karajan.workflow.service.channels.AbstractPipedChannel$Sender.run(AbstractPipedChannel.java:108)

-- 
Justin M Wozniak


From hategan at mcs.anl.gov  Fri Jul  2 14:52:29 2010
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Fri, 02 Jul 2010 14:52:29 -0500
Subject: [Swift-devel] manual coasters
In-Reply-To: <alpine.DEB.2.00.1007021436570.2994@wozniak-desktop.mcs.anl.gov>
References: <1277957571.15423.8.camel@blabla2.none>
	<alpine.DEB.2.00.1007021436570.2994@wozniak-desktop.mcs.anl.gov>
Message-ID: <1278100349.27598.0.camel@blabla2.none>

On Fri, 2010-07-02 at 14:40 -0500, Justin M Wozniak wrote:
> On Wed, 30 Jun 2010, Mihael Hategan wrote:
> 
> > Manual coasters are in trunk. I did some limited testing on localhost.
> 
> I'm getting a problem where the original callback URI is null.  Is it 
> possible that it is due to this change?

Likely. The original callback is not supposed to be null.


From hategan at mcs.anl.gov  Fri Jul  2 14:55:53 2010
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Fri, 02 Jul 2010 14:55:53 -0500
Subject: [Swift-devel] manual coasters
In-Reply-To: <1278100349.27598.0.camel@blabla2.none>
References: <1277957571.15423.8.camel@blabla2.none>
	<alpine.DEB.2.00.1007021436570.2994@wozniak-desktop.mcs.anl.gov>
	<1278100349.27598.0.camel@blabla2.none>
Message-ID: <1278100553.27598.1.camel@blabla2.none>

On Fri, 2010-07-02 at 14:52 -0500, Mihael Hategan wrote:
> On Fri, 2010-07-02 at 14:40 -0500, Justin M Wozniak wrote:
> > On Wed, 30 Jun 2010, Mihael Hategan wrote:
> > 
> > > Manual coasters are in trunk. I did some limited testing on localhost.
> > 
> > I'm getting a problem where the original callback URI is null.  Is it 
> > possible that it is due to this change?
> 
> Likely. The original callback is not supposed to be null.
> 

Try r2790.


From wozniak at mcs.anl.gov  Fri Jul  2 15:06:03 2010
From: wozniak at mcs.anl.gov (Justin M Wozniak)
Date: Fri, 2 Jul 2010 15:06:03 -0500 (CDT)
Subject: [Swift-devel] manual coasters
In-Reply-To: <1278100553.27598.1.camel@blabla2.none>
References: <1277957571.15423.8.camel@blabla2.none>
	<alpine.DEB.2.00.1007021436570.2994@wozniak-desktop.mcs.anl.gov>
	<1278100349.27598.0.camel@blabla2.none>
	<1278100553.27598.1.camel@blabla2.none>
Message-ID: <alpine.DEB.2.00.1007021505570.2994@wozniak-desktop.mcs.anl.gov>

On Fri, 2 Jul 2010, Mihael Hategan wrote:

> On Fri, 2010-07-02 at 14:52 -0500, Mihael Hategan wrote:
>> On Fri, 2010-07-02 at 14:40 -0500, Justin M Wozniak wrote:
>>> On Wed, 30 Jun 2010, Mihael Hategan wrote:
>>>
>>>> Manual coasters are in trunk. I did some limited testing on localhost.
>>>
>>> I'm getting a problem where the original callback URI is null.  Is it
>>> possible that it is due to this change?
>>
>> Likely. The original callback is not supposed to be null.
>>
>
> Try r2790.

Works.

-- 
Justin M Wozniak


From wozniak at mcs.anl.gov  Fri Jul  2 15:06:34 2010
From: wozniak at mcs.anl.gov (Justin M Wozniak)
Date: Fri, 2 Jul 2010 15:06:34 -0500 (CDT)
Subject: [Swift-devel] dirname directives ambigious
In-Reply-To: <alpine.WNT.2.00.1007021038260.1192@justinwozniak>
References: <AANLkTikwry0nq4M3QGisZ78CUoJoAeE6Z7OkbwbNJWBE@mail.gmail.com>
	<alpine.WNT.2.00.1007021038260.1192@justinwozniak>
Message-ID: <alpine.DEB.2.00.1007021506090.2994@wozniak-desktop.mcs.anl.gov>

On Fri, 2 Jul 2010, Justin M Wozniak wrote:

> 	I introduced this while adding the @dirname() function requested for 
> the Montage application.  I just committed a quick fix but you may want to 
> wait until I do some more testing; sorry for the hitch.

Ok, you can try again now.

-- 
Justin M Wozniak


From dk0966 at cs.ship.edu  Tue Jul  6 03:33:29 2010
From: dk0966 at cs.ship.edu (David Kelly)
Date: Tue, 6 Jul 2010 04:33:29 -0400
Subject: [Swift-devel] Re: Swift configuration interface
In-Reply-To: <AANLkTilOp3vam68FEfAesRjAWsx9_RkqpsfGr2oGclv2@mail.gmail.com>
References: <22726276.770581276704176196.JavaMail.root@zimbra>
	<AANLkTikSkYIvW8ylKX_0olEHZ8zGe2YGQVE3OQS535nB@mail.gmail.com>
	<AANLkTil0DDg0jDgb56QEQiedXB7Np8KmJZOWcGlVMcNA@mail.gmail.com>
	<1277136066.3882.2.camel@blabla2.none>
	<AANLkTimveVxEOiINmiPvCqS4fkFp8SzzddhbmOzMYfSm@mail.gmail.com>
	<1277145574.4729.3.camel@blabla2.none>
	<AANLkTilOp3vam68FEfAesRjAWsx9_RkqpsfGr2oGclv2@mail.gmail.com>
Message-ID: <AANLkTilVaGbj0zPc8SfBauQn7J5YEDHr8oKujFIxEwkp@mail.gmail.com>

Hello,

The newest version of swiftconfig is available through svn at
https://svn.ci.uchicago.edu/svn/vdl2/usertools/swift/swiftconfig.

New features are the automatic replacement of $HOME within site templates,
the ability to add/modify site profiles, and the removal of commands from
the translation catalog.  Details of how it works is also now documented as
POD. "perldoc swiftconfig" will give you all the details (also included as
attachment).

David
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20100706/7878f354/attachment.html>
-------------- next part --------------
NAME
    swiftconfig - Utility for managing Swift configuration

SYNOPSIS
    swiftconfig [-option value]

OVERVIEW
    The swiftconfig program allows users to configure Swift. It allows for
    the adding, removing, and modification of remote sites by utilizing a
    set of standard templates. It also provides a way to quickly add, remove
    and modify translation catalog entries without having to manually edit
    files.

DESCRIPTION
    General operations: 
	-add sitename 		add a site from template 
	-remove site		removes a site from sites.xml 
	-remove command 	removes a command from the catalog 
	-templates 		display all available sites in template 
	-modify site		Specifies the name of a site to modify

    Translation catalog settings: 
	-host hostname		hostname of the translation catalog entry 
	-name name 	 	translation name 
	-path path 	 	full pathname to location of program 
	-status status 	 	installation status (deprecated)
    	-profile setting 	define the profile value for an entry 
	-tcfile filename	explicitly specify a translation file

    Sites settings: 
       -templatefile file 	explicitly set the template file to use 
       -sitesfile file		explicitly set the sites file to use 
       -gridftp GridFTPURL	GridFTP URL 
       -jobuniverse universe 	job manager universe
       -joburl URL 		job manager URL 
       -jobmajor major 		job mager number 
       -jobminor minor		job minor number 
       -directory dir 		work directory 
       -exprovider name         execution provider 
       -exmanager name		execution job manager 
       -exurl URL		execution URL 
       -key key 		profile key 
       -value value 		profile value 
       -namespace name 		profile namespace

EXAMPLES
    List all templates available for adding:
	swiftconfig -templates

    Add a site from template into working sites.xml:
	 swiftconfig -add teraport

    Modify the work directory of a site:
	 swiftconfig -modify teraport -directory /var/tmp

    Remove a site:
	swiftconfig -remove teraport

    Add a new command to translation catalog:
	swiftconfig -name convert -path /usr/local/bin/convert

    Modify an existing command in the translation catalog:
	swiftconfig -name convert -path /usr/bin/convert

    Remove a command from the translation catalog:
	swiftconfig -remove convert

CAVEATS
    Swiftconfig will attempt to automatically determine the location of
    swift configuration files. It first checks for an environment variable
    called $SWIFT_HOME. If that is not found, it will look for the location
    of "swift" in the path, and try to find the configuration files from
    there. This default behavior can be overwridden by manually specifying
    the location of files with -templatefile, -sitesfile, and -tcfile.

    The XML library that swiftconfig uses ignores comments in XML. All
    comments will be stripped from sites.xml as it gets modified.

From wilde at mcs.anl.gov  Thu Jul  8 19:57:14 2010
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Thu, 8 Jul 2010 19:57:14 -0500 (CDT)
Subject: [Swift-devel] svn co of cog keeps hanging
Message-ID: <1704803.1527991278637034665.JavaMail.root@zimbra>

Ive tried 3 times in the past few hours to do an svn co of cog, on 2 different MCS machines (login and vanquish).

In each case, the co runs fine for a while (few hundred files?) and then freezes.

I could only kill the svn command with a kill -9, and then it left the tree locked in such a way that svn cleanup couldnt clear the lock.

I think I've occasionally seen similar problems with checkouts of cog from sourceforge freezing.  Is anyone else seeing this problem? Is it common? Any suggested remedies?

Thanks,

Mike


From wilde at mcs.anl.gov  Thu Jul  8 20:59:36 2010
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Thu, 8 Jul 2010 20:59:36 -0500 (CDT)
Subject: [Swift-devel] Re: Import Statement
In-Reply-To: <4C367E16.9000805@gmail.com>
Message-ID: <19110871.1528721278640776610.JavaMail.root@zimbra>

Jon, your idea sounds good to me, unless others with a deeper understanding of Java, Python, etc feel we need different semantics.

I think plain pathname-based resolution ala C include files makes sense for Swift. for now.

Ben implemented the initial import feature; he, Mihael, and Justin should weigh in as well.

We also discussed easing the restriction on where in the source file the import statement can be placed; there may be some reason (ease of implementation?) why its constrained to the start of the file as it is now.

- Mike


----- "Jonathan Monette" <jon.monette at gmail.com> wrote:

> Mike and Justin,
>      I think I have found out where to change the way the import 
> statement works.  Right now when you import the file has to be in the
> 
> current directory.  I would like to change this so that you can
> specify 
> an actual path(relative or absolute path)to the script to be imported.
>  
> But I would like your opinion on how should it look.  I am leaning 
> towards the C style i.e. "src/file".  I am open to opinions and 
> discussion.  Maybe the opinion is that it shouldn't be changed.  Just
> 
> need more input for the decision.
> 
> -- 
> Jon
> 
> Computers are incredibly fast, accurate, and stupid. Human beings are
> incredibly slow, inaccurate, and brilliant. Together they are powerful
> beyond imagination.
> - Albert Einstein

-- 
Michael Wilde
Computation Institute, University of Chicago
Mathematics and Computer Science Division
Argonne National Laboratory


From benc at hawaga.org.uk  Fri Jul  9 02:49:43 2010
From: benc at hawaga.org.uk (Ben Clifford)
Date: Fri, 9 Jul 2010 07:49:43 +0000 (GMT)
Subject: [Swift-devel] Re: Import Statement
In-Reply-To: <19110871.1528721278640776610.JavaMail.root@zimbra>
References: <19110871.1528721278640776610.JavaMail.root@zimbra>
Message-ID: <Pine.LNX.4.64.1007090742250.16004@dildano.hawaga.org.uk>


My original implementation was, I think, to try to get something in the 
langauge that was better than running .swift files through cpp.

There is no module or namespace structure in Swift, so using something 
based on filenames makes sense.

The requirement to have imports right at the start probably comes from me 
wanting to import the source code of the included file fairly early in 
processing. If you allow import statements elsewhere, then there is a 
question of "why?" - what behaviour do you expect to be different?

If its to allow import statements to appear anywhere, but have them mean 
the same thing no matter where they appear, then I think it should be 
fairly straightforward to make them work.

I think nested includes won't work correctly in the present 
implementation:

  myprog imports useful lib. myprogr imports stdlib. usefullib imports 
  stdlib.

I think that will give you duplicate imports which will break things. 
That's probably not hard to resolve (eg. check if a particular lib has 
been imported and skip it - don't start doing wierd cpp style ifdefs)

-- 


From wilde at mcs.anl.gov  Fri Jul  9 11:08:55 2010
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Fri, 9 Jul 2010 11:08:55 -0500 (CDT)
Subject: [Swift-devel] Re: Import Statement
In-Reply-To: <Pine.LNX.4.64.1007090742250.16004@dildano.hawaga.org.uk>
Message-ID: <30925905.1544251278691735337.JavaMail.root@zimbra>


----- "Ben Clifford" <benc at hawaga.org.uk> wrote:

> 
> The requirement to have imports right at the start probably comes from
> me 
> wanting to import the source code of the included file fairly early in
> 
> processing. If you allow import statements elsewhere, then there is a
> 
> question of "why?" - what behaviour do you expect to be different?
> 
> If its to allow import statements to appear anywhere, but have them
> mean 
> the same thing no matter where they appear, then I think it should be
> 
> fairly straightforward to make them work.

It was to enable them to appear anywhere for textual purposes.
The semantics of same behavior regardless where they appear sounds good.

> I think nested includes won't work correctly in the present 
> implementation:
> 
>   myprog imports useful lib. myprogr imports stdlib. usefullib imports
> 
>   stdlib.
> 
> I think that will give you duplicate imports which will break things.
> 
> That's probably not hard to resolve (eg. check if a particular lib has
> 
> been imported and skip it - don't start doing wierd cpp style ifdefs)

That seems worth doing at some point. For now Jon's initial enhancement of path names will be useful and sufficient.

- Mike


From hategan at mcs.anl.gov  Fri Jul  9 14:21:43 2010
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Fri, 09 Jul 2010 14:21:43 -0500
Subject: [Swift-devel] stuff to do
Message-ID: <1278703303.2353.4.camel@blabla2.none>

Mike suggested I send out an email to the list to figure out what the
group-perceived priorities would be for a bunch of items.

1. make swift core faster
2. test/fix coaster file staging
3. standalone coaster service
4. swift shell

The idea is that some recent changes may have shifted the existing
priorities. So think of this from the perspective of
user/application/publication goals rather than what you think would be
"nice to have".

Mihael


From wilde at mcs.anl.gov  Fri Jul  9 15:18:09 2010
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Fri, 9 Jul 2010 15:18:09 -0500 (CDT)
Subject: [Swift-devel] Swift hanging on array close?
Message-ID: <7818201.1556751278706689668.JavaMail.root@zimbra>

In the following script, two nested foreach stmts fill a sparse array. @filenames(array) is then passed to an ext mapper.

This seems to hang: the ext mapper is never called.

This code in on the mcs net in ~wilde/swift/lab/sgflow.

Mihael and/or Justin, can you take a look? The FL group needs this for a tutorial next week (and has lots more to do, so a quick fix or workaround this afternoon would be very helpful)

Thanks,

Mike

type file;

# <bus> is integer between 1 and 30
# <power> is in megawatts

app (file o) sgflow (int bus, int power)
{
  sgflow bus power stdout=@o;
}
app (file o) mkgraph (file i)
{
  awk "-f" "/home/turam/tmp/mkgraph.awk" @filename(i) stdout=@o;
}

app (file o) mktable (file i)
{
  awk "-f" "/home/turam/tmp/SGsplitter.awk" stdin=@i;
}

file ofiles[] <ext;exec="ofiles.map">;

string nbus = @arg("nbus","1");
string nplevel = @arg("nplevel", "2");
foreach bus in [1:@toint(nbus)] {
  foreach plevel in [1:@toint(nplevel)] {
    # file o<single_file_mapper; file=@strcat("out.","bus.",bus,".pow.",plevel)>;
    ofiles[bus*@toint(nplevel)+plevel] = sgflow(bus,plevel);
  }
}

file i <ext;exec="mktableinput.sh",o="out.csv",i=@filenames(ofiles)>;

# ^^^^ hangs here - mktableinput.sh is never called.
# trace(@filenames(ofiles)) also hangs.

file otable <"otable.txt">;
otable = mktable(i);

----
ofiles.map is hardcoded to return ofile.3 and ofile.4 - this works, and those files get the expected output.

-- 
Michael Wilde
Computation Institute, University of Chicago
Mathematics and Computer Science Division
Argonne National Laboratory


From wozniak at mcs.anl.gov  Fri Jul  9 16:40:08 2010
From: wozniak at mcs.anl.gov (Justin M Wozniak)
Date: Fri, 9 Jul 2010 16:40:08 -0500 (CDT)
Subject: [Swift-devel] Swift hanging on array close?
In-Reply-To: <7818201.1556751278706689668.JavaMail.root@zimbra>
References: <7818201.1556751278706689668.JavaMail.root@zimbra>
Message-ID: <alpine.DEB.2.00.1007091637280.7496@wozniak-desktop.mcs.anl.gov>


Using @filename() should get the job done.

Also, note that you need file i[] as an array at the end there.  And your 
cat call in mktableinput.sh may not work as desired.

On Fri, 9 Jul 2010, Michael Wilde wrote:

> In the following script, two nested foreach stmts fill a sparse array. 
> @filenames(array) is then passed to an ext mapper.
>
> This seems to hang: the ext mapper is never called.
>
> This code in on the mcs net in ~wilde/swift/lab/sgflow.
>
> Mihael and/or Justin, can you take a look? The FL group needs this for a 
> tutorial next week (and has lots more to do, so a quick fix or 
> workaround this afternoon would be very helpful)
>
> Thanks,
>
> Mike
>
> type file;
>
> # <bus> is integer between 1 and 30
> # <power> is in megawatts
>
> app (file o) sgflow (int bus, int power)
> {
>  sgflow bus power stdout=@o;
> }
> app (file o) mkgraph (file i)
> {
>  awk "-f" "/home/turam/tmp/mkgraph.awk" @filename(i) stdout=@o;
> }
>
> app (file o) mktable (file i)
> {
>  awk "-f" "/home/turam/tmp/SGsplitter.awk" stdin=@i;
> }
>
> file ofiles[] <ext;exec="ofiles.map">;
>
> string nbus = @arg("nbus","1");
> string nplevel = @arg("nplevel", "2");
> foreach bus in [1:@toint(nbus)] {
>  foreach plevel in [1:@toint(nplevel)] {
>    # file o<single_file_mapper; file=@strcat("out.","bus.",bus,".pow.",plevel)>;
>    ofiles[bus*@toint(nplevel)+plevel] = sgflow(bus,plevel);
>  }
> }
>
> file i <ext;exec="mktableinput.sh",o="out.csv",i=@filenames(ofiles)>;
>
> # ^^^^ hangs here - mktableinput.sh is never called.
> # trace(@filenames(ofiles)) also hangs.
>
> file otable <"otable.txt">;
> otable = mktable(i);
>
> ----
> ofiles.map is hardcoded to return ofile.3 and ofile.4 - this works, and those files get the expected output.
>
>

-- 
Justin M Wozniak


From iraicu at cs.uchicago.edu  Sun Jul 11 06:47:59 2010
From: iraicu at cs.uchicago.edu (Ioan Raicu)
Date: Sun, 11 Jul 2010 06:47:59 -0500
Subject: [Swift-devel] CFP: The 3rd ACM Workshop on Many-Task Computing on
 Grids and Supercomputers
 (MTAGS) 2010, co-located with Supercomputing 2010 -- November 15th, 2010
 - New Orleans, LA, USA
Message-ID: <4C39AF6F.7000409@cs.uchicago.edu>

Call for Papers

------------------------------------------------------------------------------------------------
The 3rd ACM Workshop on Many-Task Computing on Grids and Supercomputers (MTAGS) 2010
http://dsl.cs.uchicago.edu/MTAGS10/  
------------------------------------------------------------------------------------------------
November 15th, 2010
New Orleans, Louisiana, USA

Co-located with with IEEE/ACM International Conference for 
High Performance Computing, Networking, Storage and Analysis (SC10) 

================================================================================================
The 3rd workshop on Many-Task Computing on Grids and Supercomputers (MTAGS) will provide 
the scientific community a dedicated forum for presenting new research, development, and 
deployment efforts of large-scale many-task computing (MTC) applications on large scale 
clusters, Grids, Supercomputers, and Cloud Computing infrastructure. MTC, the theme of 
the workshop encompasses loosely coupled applications, which are generally composed of 
many tasks (both independent and dependent tasks) to achieve some larger application 
goal.  This workshop will cover challenges that can hamper efficiency and utilization in 
running applications on large-scale systems, such as local resource manager scalability 
and granularity, efficient utilization of raw hardware, parallel file system contention 
and scalability, data management, I/O management, reliability at scale, and application 
scalability. We welcome paper submissions on all topics related to MTC on large scale 
systems.  Papers will be peer-reviewed, and accepted papers will be published in the 
workshop proceedings as part of the ACM digital library (pending approval). The workshop 
will be co-located with the IEEE/ACM Supercomputing 2010 Conference in New Orleans 
Louisiana on November 15th, 2010. For more information, please see 
http://dsl.cs.uchicago.edu/MTAGS010/.

Scope
------------------------------------------------------------------------------------------------
This workshop will focus on the ability to manage and execute large scale applications 
on today's largest clusters, Grids, and Supercomputers. Clusters with 50K+ processor 
cores are now online (e.g. TACC Sun Constellation System - Ranger), Grids (e.g. TeraGrid) 
with a dozen sites and 100K+ processors, and supercomputers with 150K~200K processors 
(e.g. IBM BlueGene/P, Cray XT5); furthermore, new supercomputers are scheduled to come 
online with 300K processor-cores and more than 1M threads (e.g. IBM Blue Waters). Large 
clusters and supercomputers have traditionally been high performance computing (HPC) 
systems, as they are efficient at executing tightly coupled parallel jobs within a 
particular machine with low-latency interconnects; the applications typically use message 
passing interface (MPI) to achieve the needed inter-process communication. On the other 
hand, Grids have been the preferred platform for more loosely coupled applications that 
tend to be managed and executed through workflow systems, commonly known to fit in the 
high-throughput computing (HTC) paradigm. 

Many-task computing (MTC) aims to bridge the gap between two computing paradigms, HTC and 
HPC. MTC is reminiscent to HTC, but it differs in the emphasis of using many computing 
resources over short periods of time to accomplish many computational tasks (i.e. including 
both dependent and independent tasks), where the primary metrics are measured in seconds 
(e.g. FLOPS, tasks/s, MB/s I/O rates), as opposed to operations (e.g. jobs) per month. MTC 
denotes high-performance computations comprising multiple distinct activities, coupled via 
file system operations. Tasks may be small or large, uniprocessor or multiprocessor, 
compute-intensive or data-intensive. The set of tasks may be static or dynamic, homogeneous 
or heterogeneous, loosely coupled or tightly coupled. The aggregate number of tasks, 
quantity of computing, and volumes of data may be extremely large. MTC includes loosely 
coupled applications that are generally communication-intensive but not naturally expressed 
using standard message passing interface commonly found in HPC, drawing attention to the 
many computations that are heterogeneous but not "happily" parallel.

There is more to HPC than tightly coupled MPI, and more to HTC than embarrassingly parallel 
long running jobs. Like HPC applications, and science itself, applications are becoming 
increasingly complex opening new doors for many opportunities to apply HPC in new ways if 
we broaden our perspective. Some applications have just so many simple tasks that managing 
them is hard. Applications that operate on or produce large amounts of data need 
sophisticated data management in order to scale. There exist applications that involve many 
tasks, each composed of tightly coupled MPI tasks. Loosely coupled applications often have 
dependencies among tasks, and typically use files for inter-process communication. Efficient 
support for these sorts of applications on existing large scale systems will involve 
substantial technical challenges and will have big impact on science.

Today's existing HPC systems are a viable platform to host MTC applications. However, some 
challenges arise in large scale applications when run on large scale systems, which can hamper 
the efficiency and utilization of these large scale systems.  These challenges vary from local 
resource manager scalability and granularity, efficient utilization of the raw hardware, 
parallel file system contention and scalability, data management, I/O management, reliability 
at scale, application scalability, and understanding the limitations of the HPC systems in order 
to identify good candidate MTC applications. Furthermore, the MTC paradigm can be naturally 
applied to the emerging Cloud Computing paradigm due to its loosely coupled nature, which is 
being adopted by industry as the next wave of technological advancement to reduce operational 
costs while improving efficiencies in large scale infrastructures.   

To see last year's workshop program agenda, and accepted papers and presentations, please see 
http://dsl.cs.uchicago.edu/MTAGS09/; for the initial workshop we ran in 2008, please see 
http://dsl.cs.uchicago.edu/MTAGS08/. We also ran a special issue on Many-Task Computing in the 
IEEE Transactions on Parallel and Distributed Systems (TPDS) which will appear in November 2010, 
which can be found at http://dsl.cs.uchicago.edu/TPDS_MTC/. We, the workshop organizers, also 
published two papers that are highly relevant to this workshop. One paper is titled "Toward 
Loosely Coupled Programming on Petascale Systems", and was published in SC08; the second paper 
is titled "Many-Task Computing for Grids and Supercomputers", which was published in MTAGS08.


Topics
------------------------------------------------------------------------------------------------
We invite the submission of original work that is related to the topics below. The papers can be 
either short (5 pages) position papers, or long (10 pages) research papers. Topics of interest 
include (in the context of Many-Task Computing):
* Compute Resource Management 
  * Scheduling
  * Job execution frameworks
  * Local resource manager extensions
  * Performance evaluation of resource managers in use on large scale systems
  * Dynamic resource provisioning
  * Techniques to manage many-core resources and/or GPUs
  * Challenges and opportunities in running many-task workloads on HPC systems
  * Challenges and opportunities in running many-task workloads on Cloud Computing infrastructure
* Storage architectures and implementations
  * Distributed file systems
  * Parallel file systems
  * Distributed meta-data management 
  * Content distribution systems for large data
  * Data caching frameworks and techniques
  * Data management within and across data centers
  * Data-aware scheduling
  * Data-intensive computing applications
  * Eventual-consistency storage usage and management
* Programming models and tools
  * Map-reduce and its generalizations
  * Many-task computing middleware and applications
  * Parallel programming frameworks
  * Ensemble MPI techniques and frameworks
  * Service-oriented science applications
* Large-Scale Workflow Systems
  * Workflow system performance and scalability analysis
  * Scalability of workflow systems
  * Workflow infrastructure and e-Science middleware
  * Programming Paradigms and Models
* Large-Scale Many-Task Applications
  * High-throughput computing (HTC) applications
  * Data-intensive applications
  * Quasi-supercomputing applications, deployments, and experiences
  * Performance Evaluation
* Performance evaluation
  * Real systems
  * Simulations
  * Reliability of large systems


Paper Submission and Publication
------------------------------------------------------------------------------------------------
Authors are invited to submit papers with unpublished, original work of not more than 10 pages of 
double column text using single spaced 10 point size on 8.5 x 11 inch pages, as per ACM 8.5 x 11 
manuscript guidelines (http://www.acm.org/publications/instructions_for_proceedings_volumes); 
document templates can be found at http://www.acm.org/sigs/publications/proceedings-templates. We 
are also seeking position papers of no more than 5 pages in length. A 250 word abstract (PDF 
format) must be submitted online at https://cmt.research.microsoft.com/MTAGS2010/ before the 
deadline of August 25th, 2010 at 11:59PM PST; the final 5/10 page papers in PDF format will be 
due on September 1st, 2010 at 11:59PM PST. Papers will be peer-reviewed, and accepted papers will 
be published in the workshop proceedings as part of the ACM digital library (pending approval). 
Notifications of the paper decisions will be sent out by October 1st, 2010. Selected excellent 
work may be eligible for additional post-conference publication	as journal articles or book 
chapters; see last year's special issue in the IEEE Transactions on Parallel and Distributed 
Systems (TPDS) at http://dsl.cs.uchicago.edu/TPDS_MTC/. Submission implies the willingness of at 
least one of the authors to register and present the paper. For more information, please visit 
http://dsl.cs.uchicago.edu/MTAGS10/. 


Important Dates
------------------------------------------------------------------------------------------------
*	Abstract Due:			    August 25th, 2010
*	Papers Due:			        September 1st, 2010
*	Notification of Acceptance:	October 1st, 2010
*	Camera Ready Papers Due:	November 1st, 2010
*	Workshop Date:			    November 15th, 2010


Committee Members
------------------------------------------------------------------------------------------------
Workshop Chairs
*	Ioan Raicu, Illinois Institute of Technology 
*	Ian Foster, University of Chicago & Argonne National Laboratory
*	Yong Zhao, Microsoft

Technical Committee
*	Mihai Budiu, Microsoft Research, USA
*	Rajkumar Buyya, University of Melbourne, Australia
*	Alok Choudhary, Northwestern University, USA
*	Jack Dongara, University of Tennessee, USA
*	Catalin Dumitrescu, Fermi National Labs, USA
*	Geoffrey Fox, Indiana University, USA
* 	Robert Grossman, University of Illinois at Chicago, USA
*	Alexandru Iosup, Delft University of Technology, Netherlands
*	Florin Isaila, Universidad Carlos III de Madrid, Spain
*	Daniel Katz, University of Chicago, USA
*	Tevfik Kosar, Louisiana State University, USA
*	Zhiling Lan, Illinois Institute of Technology, USA
*	Ignacio Llorente, Universidad Complutense de Madrid, Spain
*	Arthur Maccabe, Oak Ridge National Labs, USA
*	Reagan Moore, University of North Carolina, Chappel Hill, USA
*	Manish Parashar, Rutgers University, USA
*	Jose Moreira, IBM Research, USA
*	Marlon Pierce, Indiana University, USA
*	Lavanya Ramakrishnan, Lawrence Berkeley National Laboratory, USA
*	Matei Ripeanu, University of British Columbia, Canada
*	Alain Roy, University of Wisconsin Madison, USA
*	Xian-He Sun, Illinois Institute of Technology, USA
*	Edward Walker, Texas Advanced Computing Center, USA
*	Mike Wilde, University of Chicago & Argonne National Laboratory, USA
*	Matthew Woitaszek, The University Coorporation for Atmospheric Research, USA
*	Ken Yocum, University of California San Diego, USA


-- 
=================================================================
Ioan Raicu, Ph.D.
Assistant Professor
=================================================================
Computer Science Department
Illinois Institute of Technology
10 W. 31st Street 
Chicago, IL 60616 
=================================================================
Cel:   1-847-722-0876
Email: iraicu at iit.edu
Web:   http://www.eecs.northwestern.edu/~iraicu/
=================================================================
=================================================================


-- 
=================================================================
Ioan Raicu, Ph.D.
NSF/CRA Computing Innovation Fellow
=================================================================
Center for Ultra-scale Computing and Information Security (CUCIS)
Department of Electrical Engineering and Computer Science
Northwestern University
2145 Sheridan Rd, Tech M384 
Evanston, IL 60208-3118
=================================================================
Cel:   1-847-722-0876
Tel:   1-847-491-8163
Email: iraicu at eecs.northwestern.edu
Web:   http://www.eecs.northwestern.edu/~iraicu/
       https://wiki.cucis.eecs.northwestern.edu/
=================================================================
=================================================================


From wilde at mcs.anl.gov  Mon Jul 12 11:52:44 2010
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Mon, 12 Jul 2010 11:52:44 -0500 (CDT)
Subject: [Swift-devel] stuff to do
In-Reply-To: <1278703303.2353.4.camel@blabla2.none>
Message-ID: <33415753.14491278953564701.JavaMail.root@zimbra>

Here's my view on these:

> 2. test/fix coaster file staging

This would be useful for both real apps and (I think) for CDM testing. I would do this first.

I would then add:

5. Adjustments needed, if any, on multicore handling in PBS and SGE provider.

6. Adjustments and fixes for reliability and logging, if needed, in Condor-G provider.

I expect that 5 & 6 would be small tasks, and they are not yet clearly defined. I think that other people could do them.

Maybe add:

7. -tui fixes. Seems not to be working so well on recent tests; several of the screens, including the source-code view, seem not to be working.

Then:

> 1. make swift core faster

I would do this second; I think you said you need about 7-10 days to try things and see what can be done, maybe more after that if the exploration suggests things that will take much (re)coding?

> 3. standalone coaster service

The current manual coasters is proving useful. 
> 4. swift shell

Lets defer (4) for now; if we can instead run swift repeatedly and either have the coaster worker pool re-connect quickly to each new swift, or quickly start new pools within the same cluster job(s), that would suffice for now.

Justin, do you want to weigh in on these?

Thanks,

Mike


> The idea is that some recent changes may have shifted the existing
> priorities. So think of this from the perspective of
> user/application/publication goals rather than what you think would
> be
> "nice to have".
> 
> Mihael
> 
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel

-- 
Michael Wilde
Computation Institute, University of Chicago
Mathematics and Computer Science Division
Argonne National Laboratory


From wilde at mcs.anl.gov  Mon Jul 12 13:50:25 2010
From: wilde at mcs.anl.gov (wilde at mcs.anl.gov)
Date: Mon, 12 Jul 2010 13:50:25 -0500 (CDT)
Subject: [Swift-devel] swift config files for running on multiple multicore
	machines
In-Reply-To: <29347314.19351278960561692.JavaMail.root@zimbra>
Message-ID: <1321903.19401278960625940.JavaMail.root@zimbra>

attached.

You need to set up an ssh key, and put its passphrase in ~/.ssh/auth.defaults.

make sure that auth.defaults is mode 600 (not readable by others)

You also need to create a GSI proxy on the submit host, and make sure that X509_CERT_DIR on the target hosts is set to a valid CA certificate dir:

export X509_CERT_DIR=/home/wilde/TRUSTEDCA
export X509_CADIR=/home/wilde/TRUSTEDCA

- Mike

-------------- next part --------------
A non-text attachment was scrubbed...
Name: coasters.xml
Type: application/xml
Size: 9474 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20100712/611a4485/attachment.xml>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: tc
Type: application/octet-stream
Size: 6144 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20100712/611a4485/attachment.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: auth.defaults.example
Type: application/octet-stream
Size: 2100 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20100712/611a4485/attachment-0001.obj>

From wilde at mcs.anl.gov  Mon Jul 12 20:17:43 2010
From: wilde at mcs.anl.gov (wilde at mcs.anl.gov)
Date: Mon, 12 Jul 2010 20:17:43 -0500 (CDT)
Subject: [Swift-devel] Re: MCS cluster
In-Reply-To: <5658524.32951278983719554.JavaMail.root@zimbra>
Message-ID: <17271120.32991278983863922.JavaMail.root@zimbra>


----- "Jonathan Monette" <jon.monette at gmail.com> wrote:

> Mike,
>      Why am I not able to submit tasks to the MCS machines from my 
> laptop?  Why does it have to be from another MCS machine?

Basically because *I think* the MCS server machines are not visible outside the MCS firewall - you need to ssh into them via a login.mcs host. 

Now, that's what I *think*, but you should verify by testing and reading the MCS FAQs.

If I am correct, you *might* be able to get around this with clever ssh tunneling, but that will also be work to figure out.

I cant recall what I did the other day to get around the problems with svn co of cog from sourceforge, but thats another angle you could attack.

If symlinked path synonyms dont get in the way, you *might* be able to co a clean cog tree to a CI host and then tar it to the MCS net. Im not sure if the cog-checkout problem is unique to mcs-to-sourceforge, or happens elsewhere. I think I recall it happening on CI hosts as well. Maybe its caused by fast client hosts or networks that cause sourceforge to throttle back (and get hung in the process)???

I'm cc'ing swift-devel here for other ideas.

- Mike

> Because I 
> cannot checkout the swift trunk to machines I cannot use the fixes
> that 
> have been commited to run the Montage wrappers.
> 
> -- 
> Jon
> 
> Computers are incredibly fast, accurate, and stupid. Human beings are
> incredibly slow, inaccurate, and brilliant. Together they are powerful
> beyond imagination.
> - Albert Einstein

-- 
Michael Wilde
Computation Institute, University of Chicago
Mathematics and Computer Science Division
Argonne National Laboratory


From wilde at mcs.anl.gov  Tue Jul 13 00:23:03 2010
From: wilde at mcs.anl.gov (wilde at mcs.anl.gov)
Date: Tue, 13 Jul 2010 00:23:03 -0500 (CDT)
Subject: [Swift-devel] Re: Coaster problems with proxy and gsi explained
In-Reply-To: <1794127.34651278990828698.JavaMail.root@zimbra>
Message-ID: <32423910.35991278998583868.JavaMail.root@zimbra>

For whatever reason, here are the issues:

- mcs is running Ubuntu 10.x on these machines, and seems to no longer include any Sun Javas in its .soft options. So I needed to bring in my own Java.

- mcs doesnt have Globus or OSG packages, so I needed to bring in my own CA cert dir

- the login shell, by default doesnt process .bashrc; the call to .bashrc needs to go in your .profile or similar, and was missing from mine.

- after much reading up on bash startup, and fiddling, I concluded that when swift launches coasters with the ssh provider, only the .bashrc runs, not the profile, so one needs to essentially force the .profile to run in this case, or else set PATH and X509_CERT_DIR from .profile. What I did is have .bashrc call .profile if it was not previously run.

- if either .profile or .bashrc sends anything to stdout, you get this cryptic, mysterious message from swift:
---
stomp$ swift -tc.file tc -sites.file crush.xml cat.swift
Swift svn swift-r3430 cog-r2798

RunID: 20100713-0010-d3r5x92f
Progress:
Exception in thread "sftp subsystem 1" java.lang.OutOfMemoryError: Java heap space
	at com.sshtools.j2ssh.subsystem.SubsystemClient.run(SubsystemClient.java:198)
	at java.lang.Thread.run(Unknown Source)

---
Hence, dont do that :)

So, Jon, you may want to look at my .profile and .bashrc, or do whatever is needed to set JAVA and X509_CERT_DIR correctly, until we figure how how to do this all more cleanly.

- Lastly, getting a proxy from TeraGrid works just fine:

cog-myproxy -S -h myproxy.teragrid.org -p 7514 -l wilde -S anonget

The errors I saw when we tried this last were all due to the env var issues above.

- Mike


----- "Michael Wilde" <wilde at mcs.anl.gov> wrote:

> Mihael, Jon,
> 
> It seems that the problems we were seeing this after noon (in my
> tests) was due to a bad .bashrc.
> 
> I have verified the the cog-myproxy method of creating a proxy for
> coasters-ssh does in fact work.
> 
> Im still trying to debug what env vars are coming from my .bashrc, why
> they are not supplied solely by my .soft, and what in my .bashrc was
> causing the failures I was seeing all afternoon.
> 
> But reverting to the simple .bashrc which I was using last Thu (under
> the mistaken impression that it had no effect) makes coasters work
> again for me, both with a DOEGrids cert and with a proxy made by
> cog-myproxy-logon from my TeraGrid NCSA cert.
> 
> - Mike
> 
> 
> --
> Michael Wilde
> Computation Institute, University of Chicago
> Mathematics and Computer Science Division
> Argonne National Laboratory

-- 
Michael Wilde
Computation Institute, University of Chicago
Mathematics and Computer Science Division
Argonne National Laboratory


From hategan at mcs.anl.gov  Tue Jul 13 00:52:05 2010
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Tue, 13 Jul 2010 00:52:05 -0500
Subject: [Swift-devel] Re: Coaster problems with proxy and gsi explained
In-Reply-To: <32423910.35991278998583868.JavaMail.root@zimbra>
References: <32423910.35991278998583868.JavaMail.root@zimbra>
Message-ID: <1279000325.11800.2.camel@blabla2.none>

On Tue, 2010-07-13 at 00:23 -0500, wilde at mcs.anl.gov wrote:
> For whatever reason, here are the issues:
> 
> - mcs is running Ubuntu 10.x on these machines, and seems to no longer include any Sun Javas in its .soft options. So I needed to bring in my own Java.
> 
> - mcs doesnt have Globus or OSG packages, so I needed to bring in my own CA cert dir
> 
> - the login shell, by default doesnt process .bashrc; the call to .bashrc needs to go in your .profile or similar, and was missing from mine.
> 
> - after much reading up on bash startup, and fiddling, I concluded that when swift launches coasters with the ssh provider, only the .bashrc runs, not the profile, so one needs to essentially force the .profile to run in this case, or else set PATH and X509_CERT_DIR from .profile. What I did is have .bashrc call .profile if it was not previously run.

Or put those env vars in sites.xml.

In a sense, I would probably recommend that. It seems to be the only
"portable" way.

> 
> - if either .profile or .bashrc sends anything to stdout, you get this cryptic, mysterious message from swift:
> ---
> stomp$ swift -tc.file tc -sites.file crush.xml cat.swift
> Swift svn swift-r3430 cog-r2798
> 
> RunID: 20100713-0010-d3r5x92f
> Progress:
> Exception in thread "sftp subsystem 1" java.lang.OutOfMemoryError: Java heap space
> 	at com.sshtools.j2ssh.subsystem.SubsystemClient.run(SubsystemClient.java:198)
> 	at java.lang.Thread.run(Unknown Source)

That is funny. In other words a bug. Is there any easy way to reproduce
that?

> 
> ---
> Hence, dont do that :)

But I wanna!


From wilde at mcs.anl.gov  Wed Jul 14 10:36:38 2010
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Wed, 14 Jul 2010 10:36:38 -0500 (CDT)
Subject: [Swift-devel] Swift trunk seems broken
Message-ID: <21879375.100861279121798319.JavaMail.root@zimbra>

I get the error below from a trunk I just updated and built:

bri$ swift cats.swift
Swift svn swift-r3435 cog-r2799

RunID: 20100714-1033-npcd74t6
Progress:
Uncaught exception: java.lang.IncompatibleClassChangeError: Expecting non-static method org.griphyn.vdl.karajan.lib.AbsFileName.filename(Lorg/globus/cog/karajan/stack/VariableStack;)[Ljava/lang/String; in vdl:absfilename @ vdl.k, line: 79
java.lang.IncompatibleClassChangeError: Expecting non-static method org.griphyn.vdl.karajan.lib.AbsFileName.filename(Lorg/globus/cog/karajan/stack/VariableStack;)[Ljava/lang/String;
	at org.griphyn.vdl.karajan.lib.AbsFileName.function(AbsFileName.java:16)
	at org.griphyn.vdl.karajan.lib.VDLFunction.post(VDLFunction.java:67)
	at org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.childCompleted(AbstractSequentialWithArguments.java:192)

script is:

type file;

app (file o) cat (file i)
{
  cat @i stdout=@o;
}

file out[];
file data<"data.txt">;

foreach j in [0:19] {
  out[j] = cat(data);
}


From wozniak at mcs.anl.gov  Wed Jul 14 10:41:52 2010
From: wozniak at mcs.anl.gov (Justin M Wozniak)
Date: Wed, 14 Jul 2010 10:41:52 -0500 (CDT)
Subject: [Swift-devel] Re: Swift trunk seems broken
In-Reply-To: <21879375.100861279121798319.JavaMail.root@zimbra>
References: <21879375.100861279121798319.JavaMail.root@zimbra>
Message-ID: <alpine.DEB.2.00.1007141040380.7496@wozniak-desktop.mcs.anl.gov>


First glance: I think IncompatibleClassChangeError means you have to clean 
and build again.  I'll try a few things here...

On Wed, 14 Jul 2010, Michael Wilde wrote:

> I get the error below from a trunk I just updated and built:
>
> bri$ swift cats.swift
> Swift svn swift-r3435 cog-r2799
>
> RunID: 20100714-1033-npcd74t6
> Progress:
> Uncaught exception: java.lang.IncompatibleClassChangeError: Expecting non-static method org.griphyn.vdl.karajan.lib.AbsFileName.filename(Lorg/globus/cog/karajan/stack/VariableStack;)[Ljava/lang/String; in vdl:absfilename @ vdl.k, line: 79
> java.lang.IncompatibleClassChangeError: Expecting non-static method org.griphyn.vdl.karajan.lib.AbsFileName.filename(Lorg/globus/cog/karajan/stack/VariableStack;)[Ljava/lang/String;
> 	at org.griphyn.vdl.karajan.lib.AbsFileName.function(AbsFileName.java:16)
> 	at org.griphyn.vdl.karajan.lib.VDLFunction.post(VDLFunction.java:67)
> 	at org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.childCompleted(AbstractSequentialWithArguments.java:192)
>
> script is:
>
> type file;
>
> app (file o) cat (file i)
> {
>  cat @i stdout=@o;
> }
>
> file out[];
> file data<"data.txt">;
>
> foreach j in [0:19] {
>  out[j] = cat(data);
> }
>
>

-- 
Justin M Wozniak


From wilde at mcs.anl.gov  Wed Jul 14 12:34:20 2010
From: wilde at mcs.anl.gov (wilde at mcs.anl.gov)
Date: Wed, 14 Jul 2010 12:34:20 -0500 (CDT)
Subject: [Swift-devel] Re: Swift trunk seems broken
In-Reply-To: <17937832.108611279128784224.JavaMail.root@zimbra>
Message-ID: <7158009.108651279128860864.JavaMail.root@zimbra>

I extracted fresh trunks for cog and swift and rebuilt. Now I get "No 'proxy' provider or alias found".
(see below)

Why is it looking for a proxy provider? Im expecting it to use the default sites and tc files, and local provider and /bin/cat on localhost.

Is this line getting involved?

vdl-int-staging.k:							stagingMethod := vdl:siteProfile(rhost, "swift:stagingMethod", default="proxy") 	

Justin, Jon, you both said my cats.swift test worked for you. Are you using the default tc and sites files? And does your version ID say Swift svn swift-r3435 cog-r2799?

- Mike


bri$ swift cat.swift
Swift svn swift-r3435 cog-r2799

RunID: 20100714-1226-vlpmmtm7
Progress:
Execution failed:
	Exception in cat:
Arguments: [data.txt]
Host: localhost
Directory: cat-20100714-1226-vlpmmtm7/jobs/7/cat-7eo37tujTODO: outs
----

Caused by:
	No 'proxy' provider or alias found. Available providers: [cobalt, gsiftp, coaster, dcache, webdav, ssh, gt4, gt2, condor, http, pbs, ftp, gsiftp-old, local]. Aliases: gt4 <-> gt4.0.2, gt4.0.1, gt3.9.5, gt4.0.0; condor <-> condorlocal; cobalt <-> cobaltlocal; gsiftp <-> gridftp; pbs <-> pbslocal; gsiftp-old <-> gridftp-old; local <-> file; 
bri$ cat cat.swift
type file;

app (file o) cat (file i)
{
  cat @i stdout=@o;
}

file data<"data.txt">;
file out<"out.txt">;
out = cat(data);
bri$ 


----- "Justin M Wozniak" <wozniak at mcs.anl.gov> wrote:

> First glance: I think IncompatibleClassChangeError means you have to
> clean 
> and build again.  I'll try a few things here...
> 
> On Wed, 14 Jul 2010, Michael Wilde wrote:
> 
> > I get the error below from a trunk I just updated and built:
> >
> > bri$ swift cats.swift
> > Swift svn swift-r3435 cog-r2799
> >
> > RunID: 20100714-1033-npcd74t6
> > Progress:
> > Uncaught exception: java.lang.IncompatibleClassChangeError:
> Expecting non-static method
> org.griphyn.vdl.karajan.lib.AbsFileName.filename(Lorg/globus/cog/karajan/stack/VariableStack;)[Ljava/lang/String;
> in vdl:absfilename @ vdl.k, line: 79
> > java.lang.IncompatibleClassChangeError: Expecting non-static method
> org.griphyn.vdl.karajan.lib.AbsFileName.filename(Lorg/globus/cog/karajan/stack/VariableStack;)[Ljava/lang/String;
> > 	at
> org.griphyn.vdl.karajan.lib.AbsFileName.function(AbsFileName.java:16)
> > 	at
> org.griphyn.vdl.karajan.lib.VDLFunction.post(VDLFunction.java:67)
> > 	at
> org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.childCompleted(AbstractSequentialWithArguments.java:192)
> >
> > script is:
> >
> > type file;
> >
> > app (file o) cat (file i)
> > {
> >  cat @i stdout=@o;
> > }
> >
> > file out[];
> > file data<"data.txt">;
> >
> > foreach j in [0:19] {
> >  out[j] = cat(data);
> > }
> >
> >
> 
> -- 
> Justin M Wozniak

-- 
Michael Wilde
Computation Institute, University of Chicago
Mathematics and Computer Science Division
Argonne National Laboratory


From wilde at mcs.anl.gov  Wed Jul 14 12:52:16 2010
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Wed, 14 Jul 2010 12:52:16 -0500 (CDT)
Subject: [Swift-devel] Re: Swift trunk seems broken
In-Reply-To: <7158009.108651279128860864.JavaMail.root@zimbra>
Message-ID: <16929110.109221279129936989.JavaMail.root@zimbra>

What seems to be happing is thatI had a ~/.swift/swift.properties file with "use.provider.staging = true". Ive been using that for as long as I can recall.

Seems that in some recent rev, provider staging was changed to use a "proxy" provider?

I turned off use.provider.staging and now my basic tests work again.

- Mike

----- wilde at mcs.anl.gov wrote:

> I extracted fresh trunks for cog and swift and rebuilt. Now I get "No
> 'proxy' provider or alias found".
> (see below)
> 
> Why is it looking for a proxy provider? Im expecting it to use the
> default sites and tc files, and local provider and /bin/cat on
> localhost.
> 
> Is this line getting involved?
> 
> vdl-int-staging.k:							stagingMethod := vdl:siteProfile(rhost,
> "swift:stagingMethod", default="proxy") 	
> 
> Justin, Jon, you both said my cats.swift test worked for you. Are you
> using the default tc and sites files? And does your version ID say
> Swift svn swift-r3435 cog-r2799?
> 
> - Mike
> 
> 
> 
> bri$ swift cat.swift
> Swift svn swift-r3435 cog-r2799
> 
> RunID: 20100714-1226-vlpmmtm7
> Progress:
> Execution failed:
> 	Exception in cat:
> Arguments: [data.txt]
> Host: localhost
> Directory: cat-20100714-1226-vlpmmtm7/jobs/7/cat-7eo37tujTODO: outs
> ----
> 
> Caused by:
> 	No 'proxy' provider or alias found. Available providers: [cobalt,
> gsiftp, coaster, dcache, webdav, ssh, gt4, gt2, condor, http, pbs,
> ftp, gsiftp-old, local]. Aliases: gt4 <-> gt4.0.2, gt4.0.1, gt3.9.5,
> gt4.0.0; condor <-> condorlocal; cobalt <-> cobaltlocal; gsiftp <->
> gridftp; pbs <-> pbslocal; gsiftp-old <-> gridftp-old; local <-> file;
> 
> bri$ cat cat.swift
> type file;
> 
> app (file o) cat (file i)
> {
>   cat @i stdout=@o;
> }
> 
> file data<"data.txt">;
> file out<"out.txt">;
> out = cat(data);
> bri$ 
> 
> 
> 
> ----- "Justin M Wozniak" <wozniak at mcs.anl.gov> wrote:
> 
> > First glance: I think IncompatibleClassChangeError means you have
> to
> > clean 
> > and build again.  I'll try a few things here...
> > 
> > On Wed, 14 Jul 2010, Michael Wilde wrote:
> > 
> > > I get the error below from a trunk I just updated and built:
> > >
> > > bri$ swift cats.swift
> > > Swift svn swift-r3435 cog-r2799
> > >
> > > RunID: 20100714-1033-npcd74t6
> > > Progress:
> > > Uncaught exception: java.lang.IncompatibleClassChangeError:
> > Expecting non-static method
> >
> org.griphyn.vdl.karajan.lib.AbsFileName.filename(Lorg/globus/cog/karajan/stack/VariableStack;)[Ljava/lang/String;
> > in vdl:absfilename @ vdl.k, line: 79
> > > java.lang.IncompatibleClassChangeError: Expecting non-static
> method
> >
> org.griphyn.vdl.karajan.lib.AbsFileName.filename(Lorg/globus/cog/karajan/stack/VariableStack;)[Ljava/lang/String;
> > > 	at
> >
> org.griphyn.vdl.karajan.lib.AbsFileName.function(AbsFileName.java:16)
> > > 	at
> > org.griphyn.vdl.karajan.lib.VDLFunction.post(VDLFunction.java:67)
> > > 	at
> >
> org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.childCompleted(AbstractSequentialWithArguments.java:192)
> > >
> > > script is:
> > >
> > > type file;
> > >
> > > app (file o) cat (file i)
> > > {
> > >  cat @i stdout=@o;
> > > }
> > >
> > > file out[];
> > > file data<"data.txt">;
> > >
> > > foreach j in [0:19] {
> > >  out[j] = cat(data);
> > > }
> > >
> > >
> > 
> > -- 
> > Justin M Wozniak
> 
> -- 
> Michael Wilde
> Computation Institute, University of Chicago
> Mathematics and Computer Science Division
> Argonne National Laboratory
> 
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel

-- 
Michael Wilde
Computation Institute, University of Chicago
Mathematics and Computer Science Division
Argonne National Laboratory


From hategan at mcs.anl.gov  Wed Jul 14 12:59:04 2010
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Wed, 14 Jul 2010 12:59:04 -0500
Subject: [Swift-devel] Re: Swift trunk seems broken
In-Reply-To: <16929110.109221279129936989.JavaMail.root@zimbra>
References: <16929110.109221279129936989.JavaMail.root@zimbra>
Message-ID: <1279130344.26184.1.camel@blabla2.none>

On Wed, 2010-07-14 at 12:52 -0500, Michael Wilde wrote:
> What seems to be happing is thatI had a ~/.swift/swift.properties file with "use.provider.staging = true". Ive been using that for as long as I can recall.

It's odd that it worked. But it may be that the version you were running
ignored that property.

> 
> Seems that in some recent rev, provider staging was changed to use a "proxy" provider?

No such thing as a "proxy" provider, but there is a proxy staging
method.


From aespinosa at cs.uchicago.edu  Wed Jul 14 17:47:56 2010
From: aespinosa at cs.uchicago.edu (Allan Espinosa)
Date: Wed, 14 Jul 2010 17:47:56 -0500
Subject: [Swift-devel] swift-plot-log broken on trunk
Message-ID: <AANLkTik5ayojDO2AR90DcEKodMeadA11z3qFhyobhipa@mail.gmail.com>

I guess some required plots are no longer found in the new logs:

$ swift-plot-log sleep-LGU_condor.log execstages.png
Log file path is /home/aespinosa/workflows/cybershake/sleep-LGU_condor.log
Log is in directory /home/aespinosa/workflows/cybershake
Log basename is sleep-LGU_condor
Now in directory /tmp/swift-plot-log-TCmEupYGhEd18102
rm -f start-times.data kickstart-times.data start-time.tmp
end-time.tmp threads.list tasks.list log *.data *.shifted *.png
*.event *
.coloured-event *.total *.tmp *.transitions *.last
karatasks-type-counts.txt index.html *.lastsummary execstages.plot
total.plot col
our.plot jobs-sites.html jobs.retrycount.summary kickstart.stats
execution-counts.txt site-duration.txt jobs.retrycount sp.plot
karatasks.coloured-sorted-event *.cedps *.stats t.inf *.seenstates
tmp-* clusterstats  trname-summary sites-list.data.nm info-md5sums
pse2d-tmp.eip karajan.html falkon.html execute2.html info.html
execute.html kickstart.html scheduler.html assorted.html
log-to-execute2-transitions <
/home/aespinosa/workflows/cybershake/sleep-LGU_condor.log >
execute2.transitions
compute-t-inf > t.inf <
/home/aespinosa/workflows/cybershake/sleep-LGU_condor.log
cat execute2.transitions | swap-and-sort | transitions-to-event > execute2.event
log-to-dostagein-transitions <
/home/aespinosa/workflows/cybershake/sleep-LGU_condor.log >
dostagein.transitions
cat dostagein.transitions | swap-and-sort | transitions-to-event >
dostagein.event
log-to-dostageout-transitions <
/home/aespinosa/workflows/cybershake/sleep-LGU_condor.log >
dostageout.transitions
cat dostageout.transitions | swap-and-sort | transitions-to-event >
dostageout.event
extract-start-time > start-time.tmp
execstages-plot
Can't parse line  0 last-event-line no previous event

cat: workflow.event: No such file or directory

gnuplot> plot 'esp.execute2.tmp' with vector arrowstyle 1 title
'execute2',      'esp.dostagein.tmp' with vector arrowstyle 2 title
'dostagein',      'esp.dostageout.tmp' with vector arrowstyle 3 title
'dostageout'
                                                                          ^
         "execstages.plot", line 15: no data point found in specified file

make: *** [execstages.png] Error 1

-- 
Allan M. Espinosa <http://amespinosa.wordpress.com>
PhD student, Computer Science
University of Chicago <http://people.cs.uchicago.edu/~aespinosa>


From wilde at mcs.anl.gov  Thu Jul 15 11:47:01 2010
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Thu, 15 Jul 2010 11:47:01 -0500 (CDT)
Subject: [Swift-devel] _swiftwrap logging causes problems for users on
 shared computer servers
Message-ID: <16809659.148551279212421442.JavaMail.root@zimbra>

This line in _swiftwrap is causing problems when multiple users are running coasters on the same MCS compute server:

--
COMMANDLINE=$@

echo $0 $COMMANDLINE >> /tmp/swiftwrap.out
--

It creates a file owned by the user, and causes the next user's jobs to fail (and/or generate a message).

This line looks to me like a debugging fossil. Im going to comment out this echo in my test trunk (which several of us are testing from) and see if it has any ill effect.

I think we should leave it disabled, but can leave it in as a comment for debugging hints. Need to check when/why it was added.

- Mike

-- 
Michael Wilde
Computation Institute, University of Chicago
Mathematics and Computer Science Division
Argonne National Laboratory


From wozniak at mcs.anl.gov  Thu Jul 15 11:57:29 2010
From: wozniak at mcs.anl.gov (Justin M Wozniak)
Date: Thu, 15 Jul 2010 11:57:29 -0500 (Central Daylight Time)
Subject: [Swift-devel] _swiftwrap logging causes problems for users on
	shared computer servers
In-Reply-To: <16809659.148551279212421442.JavaMail.root@zimbra>
References: <16809659.148551279212421442.JavaMail.root@zimbra>
Message-ID: <alpine.WNT.2.00.1007151156330.964@justinwozniak>


This was my mistake- fixed.

On Thu, 15 Jul 2010, Michael Wilde wrote:

> This line in _swiftwrap is causing problems when multiple users are 
> running coasters on the same MCS compute server:
>
> --
> COMMANDLINE=$@
>
> echo $0 $COMMANDLINE >> /tmp/swiftwrap.out
> --
>
> It creates a file owned by the user, and causes the next user's jobs to 
> fail (and/or generate a message).
>
> This line looks to me like a debugging fossil. Im going to comment out 
> this echo in my test trunk (which several of us are testing from) and 
> see if it has any ill effect.
>
> I think we should leave it disabled, but can leave it in as a comment 
> for debugging hints. Need to check when/why it was added.
>
> - Mike


-- 
Justin M Wozniak


From wozniak at mcs.anl.gov  Thu Jul 15 13:59:55 2010
From: wozniak at mcs.anl.gov (Justin M Wozniak)
Date: Thu, 15 Jul 2010 13:59:55 -0500 (Central Daylight Time)
Subject: [Swift-devel] swift-plot-log broken on trunk
In-Reply-To: <AANLkTik5ayojDO2AR90DcEKodMeadA11z3qFhyobhipa@mail.gmail.com>
References: <AANLkTik5ayojDO2AR90DcEKodMeadA11z3qFhyobhipa@mail.gmail.com>
Message-ID: <alpine.WNT.2.00.1007151359400.964@justinwozniak>


Fixed- let me know what happens.

On Wed, 14 Jul 2010, Allan Espinosa wrote:

> I guess some required plots are no longer found in the new logs:
>
> $ swift-plot-log sleep-LGU_condor.log execstages.png
> Log file path is /home/aespinosa/workflows/cybershake/sleep-LGU_condor.log
> Log is in directory /home/aespinosa/workflows/cybershake
> Log basename is sleep-LGU_condor
> Now in directory /tmp/swift-plot-log-TCmEupYGhEd18102
> rm -f start-times.data kickstart-times.data start-time.tmp
> end-time.tmp threads.list tasks.list log *.data *.shifted *.png
> *.event *
> .coloured-event *.total *.tmp *.transitions *.last
> karatasks-type-counts.txt index.html *.lastsummary execstages.plot
> total.plot col
> our.plot jobs-sites.html jobs.retrycount.summary kickstart.stats
> execution-counts.txt site-duration.txt jobs.retrycount sp.plot
> karatasks.coloured-sorted-event *.cedps *.stats t.inf *.seenstates
> tmp-* clusterstats  trname-summary sites-list.data.nm info-md5sums
> pse2d-tmp.eip karajan.html falkon.html execute2.html info.html
> execute.html kickstart.html scheduler.html assorted.html
> log-to-execute2-transitions <
> /home/aespinosa/workflows/cybershake/sleep-LGU_condor.log >
> execute2.transitions
> compute-t-inf > t.inf <
> /home/aespinosa/workflows/cybershake/sleep-LGU_condor.log
> cat execute2.transitions | swap-and-sort | transitions-to-event > execute2.event
> log-to-dostagein-transitions <
> /home/aespinosa/workflows/cybershake/sleep-LGU_condor.log >
> dostagein.transitions
> cat dostagein.transitions | swap-and-sort | transitions-to-event >
> dostagein.event
> log-to-dostageout-transitions <
> /home/aespinosa/workflows/cybershake/sleep-LGU_condor.log >
> dostageout.transitions
> cat dostageout.transitions | swap-and-sort | transitions-to-event >
> dostageout.event
> extract-start-time > start-time.tmp
> execstages-plot
> Can't parse line  0 last-event-line no previous event
>
> cat: workflow.event: No such file or directory
>
> gnuplot> plot 'esp.execute2.tmp' with vector arrowstyle 1 title
> 'execute2',      'esp.dostagein.tmp' with vector arrowstyle 2 title
> 'dostagein',      'esp.dostageout.tmp' with vector arrowstyle 3 title
> 'dostageout'
>                                                                          ^
>         "execstages.plot", line 15: no data point found in specified file
>
> make: *** [execstages.png] Error 1
>
>

-- 
Justin M Wozniak


From hategan at mcs.anl.gov  Thu Jul 15 18:35:49 2010
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Thu, 15 Jul 2010 18:35:49 -0500
Subject: [Swift-devel] stuff to do
In-Reply-To: <33415753.14491278953564701.JavaMail.root@zimbra>
References: <33415753.14491278953564701.JavaMail.root@zimbra>
Message-ID: <1279236949.25935.17.camel@blabla2.none>

Most of the problems that were obvious with coaster file staging should
be fixed now. I ran a few tests for 1024 cat jobs on TP with ssh:pbs
with 2-8 workers/node (such that "concurrent" workers are tested) and it
consistently seemed fine.

I also quickly made a fake provider and I am getting a rate of about 100
j/s. So that seems not to infirm my previous suspicion.

On Mon, 2010-07-12 at 11:52 -0500, Michael Wilde wrote:
> Here's my view on these:
> 
> > 2. test/fix coaster file staging
> 
> This would be useful for both real apps and (I think) for CDM testing. I would do this first.
> 
> I would then add:
> 
> 5. Adjustments needed, if any, on multicore handling in PBS and SGE provider.
> 
> 6. Adjustments and fixes for reliability and logging, if needed, in Condor-G provider.
> 
> I expect that 5 & 6 would be small tasks, and they are not yet clearly defined. I think that other people could do them.
> 
> Maybe add:
> 
> 7. -tui fixes. Seems not to be working so well on recent tests; several of the screens, including the source-code view, seem not to be working.
> 
> Then:
> 
> > 1. make swift core faster
> 
> I would do this second; I think you said you need about 7-10 days to try things and see what can be done, maybe more after that if the exploration suggests things that will take much (re)coding?
> 
> > 3. standalone coaster service
> 
> The current manual coasters is proving useful. 
> > 4. swift shell
> 
> Lets defer (4) for now; if we can instead run swift repeatedly and either have the coaster worker pool re-connect quickly to each new swift, or quickly start new pools within the same cluster job(s), that would suffice for now.
> 
> Justin, do you want to weigh in on these?
> 
> Thanks,
> 
> Mike
> 
> 
> > The idea is that some recent changes may have shifted the existing
> > priorities. So think of this from the perspective of
> > user/application/publication goals rather than what you think would
> > be
> > "nice to have".
> > 
> > Mihael
> > 
> > _______________________________________________
> > Swift-devel mailing list
> > Swift-devel at ci.uchicago.edu
> > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> 


From aespinosa at cs.uchicago.edu  Thu Jul 15 23:32:01 2010
From: aespinosa at cs.uchicago.edu (Allan Espinosa)
Date: Thu, 15 Jul 2010 23:32:01 -0500
Subject: [Swift-devel] swift-plot-log broken on trunk
In-Reply-To: <alpine.WNT.2.00.1007151359400.964@justinwozniak>
References: <AANLkTik5ayojDO2AR90DcEKodMeadA11z3qFhyobhipa@mail.gmail.com>
	<alpine.WNT.2.00.1007151359400.964@justinwozniak>
Message-ID: <AANLkTilrh93lg7DFabg1oOxvYNSXYZ5slnm2-Xm7GfAk@mail.gmail.com>

Thanks Justin.  I'll try this out when I get another run.

With the default logging policy there will be no execute2 statements
as they are all in DEBUG level inside vdl-int.k  this was the case in
my run.

-Allan

2010/7/15 Justin M Wozniak <wozniak at mcs.anl.gov>:
>
> Fixed- let me know what happens.
>
> On Wed, 14 Jul 2010, Allan Espinosa wrote:
>
>> I guess some required plots are no longer found in the new logs:
>>
>> $ swift-plot-log sleep-LGU_condor.log execstages.png
>> Log file path is /home/aespinosa/workflows/cybershake/sleep-LGU_condor.log
>> Log is in directory /home/aespinosa/workflows/cybershake
>> Log basename is sleep-LGU_condor
>> Now in directory /tmp/swift-plot-log-TCmEupYGhEd18102
>> rm -f start-times.data kickstart-times.data start-time.tmp
>> end-time.tmp threads.list tasks.list log *.data *.shifted *.png
>> *.event *
>> .coloured-event *.total *.tmp *.transitions *.last
>> karatasks-type-counts.txt index.html *.lastsummary execstages.plot
>> total.plot col
>> our.plot jobs-sites.html jobs.retrycount.summary kickstart.stats
>> execution-counts.txt site-duration.txt jobs.retrycount sp.plot
>> karatasks.coloured-sorted-event *.cedps *.stats t.inf *.seenstates
>> tmp-* clusterstats ?trname-summary sites-list.data.nm info-md5sums
>> pse2d-tmp.eip karajan.html falkon.html execute2.html info.html
>> execute.html kickstart.html scheduler.html assorted.html
>> log-to-execute2-transitions <
>> /home/aespinosa/workflows/cybershake/sleep-LGU_condor.log >
>> execute2.transitions
>> compute-t-inf > t.inf <
>> /home/aespinosa/workflows/cybershake/sleep-LGU_condor.log
>> cat execute2.transitions | swap-and-sort | transitions-to-event >
>> execute2.event
>> log-to-dostagein-transitions <
>> /home/aespinosa/workflows/cybershake/sleep-LGU_condor.log >
>> dostagein.transitions
>> cat dostagein.transitions | swap-and-sort | transitions-to-event >
>> dostagein.event
>> log-to-dostageout-transitions <
>> /home/aespinosa/workflows/cybershake/sleep-LGU_condor.log >
>> dostageout.transitions
>> cat dostageout.transitions | swap-and-sort | transitions-to-event >
>> dostageout.event
>> extract-start-time > start-time.tmp
>> execstages-plot
>> Can't parse line ?0 last-event-line no previous event
>>
>> cat: workflow.event: No such file or directory
>>
>> gnuplot> plot 'esp.execute2.tmp' with vector arrowstyle 1 title
>> 'execute2', ? ? ?'esp.dostagein.tmp' with vector arrowstyle 2 title
>> 'dostagein', ? ? ?'esp.dostageout.tmp' with vector arrowstyle 3 title
>> 'dostageout'
>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ^
>> ? ? ? ?"execstages.plot", line 15: no data point found in specified file
>>
>> make: *** [execstages.png] Error 1
>>
>>
>
> --
> Justin M Wozniak
>
>


-- 
Allan M. Espinosa <http://amespinosa.wordpress.com>
PhD student, Computer Science
University of Chicago <http://people.cs.uchicago.edu/~aespinosa>


From wilde at mcs.anl.gov  Fri Jul 16 15:07:15 2010
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Fri, 16 Jul 2010 15:07:15 -0500 (CDT)
Subject: [Swift-devel] Swift NMI B&T testing - how to add more users?
Message-ID: <22361353.206931279310835989.JavaMail.root@zimbra>

Ben,

Dennis Touchet, a UTB student, is gearing up to get Swift testing rolling again. Can you provide a few specific pointers?

- how to clone the B&T tests you set up so that multiple Swift developers can manage them?  Does this require a B&T linux host login that is separate from the B&T web login? (I was unable to log into UW's system with my web login...)

- can you comment on the state of the "site" tests in the swift test dir?

- any other pointers on testing tat would be useful to Dennis and the devel team?

Thanks,

Mike


From wilde at mcs.anl.gov  Fri Jul 16 15:10:06 2010
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Fri, 16 Jul 2010 15:10:06 -0500 (CDT)
Subject: [Swift-devel] Please upgrade communicado to latest CI linux
Message-ID: <16682142.207021279311006408.JavaMail.root@zimbra>

Hi CI Support,

I think you were just waiting for our go-ahead to upgrade communicado.

Can you proceed, schedule a time next week, and just notify the two lists above about the upgrade time?

Thanks,

Mike


From wilde at mcs.anl.gov  Fri Jul 16 15:15:59 2010
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Fri, 16 Jul 2010 15:15:59 -0500 (CDT)
Subject: [Swift-devel] Swift NMI B&T testing - how to add more users?
In-Reply-To: <AANLkTildJB-zgsycD42FEc8I7LXiFnnvhqpI3UHUDT30@mail.gmail.com>
Message-ID: <29186856.207281279311359233.JavaMail.root@zimbra>

[cc'ing swift-devel]

Hi Allan,

I have no special access to UW or B&T, but sure, feel free to use my name.

I *think* that if you follow the general procedure for getting B&T access, that this is the best system for doing your builds.

What I dont understand yet is:

1) whether we each need to get linus logins separate from our web logins

2) whether we need to create some kind of "Swift" project for sharing files, run logs, administartive control over swift tests, etc.

- Mike


----- "Allan Espinosa" <aespinosa at cs.uchicago.edu> wrote:

> Hi Mike,
> 
> On a side note, I applied for UW account yesterday independently
> because I needed access to specific type of machine architectures for
> building codes for OSG deployments (i.e. SCEC Cybershake)  .  Will
> reapplying under your name expedite the process?
> 
> Thanks,
> -Allan
> 
> 2010/7/16 Michael Wilde <wilde at mcs.anl.gov>:
> > Ben,
> >
> > Dennis Touchet, a UTB student, is gearing up to get Swift testing
> rolling again. Can you provide a few specific pointers?
> >
> > - how to clone the B&T tests you set up so that multiple Swift
> developers can manage them? ?Does this require a B&T linux host login
> that is separate from the B&T web login? (I was unable to log into
> UW's system with my web login...)
> >
> > - can you comment on the state of the "site" tests in the swift test
> dir?
> >
> > - any other pointers on testing tat would be useful to Dennis and
> the devel team?
> >
> > Thanks,
> >
> > Mike
> 
> -- 
> Allan M. Espinosa <http://amespinosa.wordpress.com>
> PhD student, Computer Science
> University of Chicago <http://people.cs.uchicago.edu/~aespinosa>

-- 
Michael Wilde
Computation Institute, University of Chicago
Mathematics and Computer Science Division
Argonne National Laboratory


From support at ci.uchicago.edu  Fri Jul 16 15:37:29 2010
From: support at ci.uchicago.edu (David Forero)
Date: Fri, 16 Jul 2010 15:37:29 -0500
Subject: [Swift-devel] [CI Ticketing System #5791] Communicado upgrade 
In-Reply-To: <16682142.207021279311006408.JavaMail.root@zimbra>
References: <RT-Ticket-5791@ci.uchicago.edu>
	<16682142.207021279311006408.JavaMail.root@zimbra>
Message-ID: <rt-3.8.2-15745-1279312649-1589.5791-7-0@ci.uchicago.edu>

Next Tuesday 20 July at 8am we will be taking communicado.ci.uchicago.edu down
for an upgrade. Please use bridled.ci.uchicago.edu in its stead. Communicado
should be back online by the end of the day

If you have any questions, please contact support at ci uchicago.edu.

Thank you for your cooperation.

--

David Forero
System Administrator
Computation Institute
University of Chicago
773-834-4102


From support at ci.uchicago.edu  Fri Jul 16 15:37:30 2010
From: support at ci.uchicago.edu (David Forero)
Date: Fri, 16 Jul 2010 15:37:30 -0500
Subject: [Swift-devel] [CI Ticketing System #5791] Communicado upgrade 
In-Reply-To: <16682142.207021279311006408.JavaMail.root@zimbra>
References: <RT-Ticket-5791@ci.uchicago.edu>
	<16682142.207021279311006408.JavaMail.root@zimbra>
Message-ID: <rt-3.8.2-15745-1279312649-806.5791-6-0@ci.uchicago.edu>

Next Tuesday 20 July at 8am we will be taking communicado.ci.uchicago.edu down
for an upgrade. Please use bridled.ci.uchicago.edu in its stead. Communicado
should be back online by the end of the day

If you have any questions, please contact support at ci uchicago.edu.

Thank you for your cooperation.

--

David Forero
System Administrator
Computation Institute
University of Chicago
773-834-4102


From benc at hawaga.org.uk  Sun Jul 18 12:36:48 2010
From: benc at hawaga.org.uk (Ben Clifford)
Date: Sun, 18 Jul 2010 17:36:48 +0000 (GMT)
Subject: [Swift-devel] Re: Swift NMI B&T testing - how to add more users?
In-Reply-To: <22361353.206931279310835989.JavaMail.root@zimbra>
References: <22361353.206931279310835989.JavaMail.root@zimbra>
Message-ID: <Pine.LNX.4.64.1007181730130.13416@dildano.hawaga.org.uk>


> - how to clone the B&T tests you set up so that multiple Swift 
> developers can manage them?  Does this require a B&T linux host login 
> that is separate from the B&T web login? (I was unable to log into UW's 
> system with my web login...)

I don't know anything about web logins for NMI - I only (as far as I know) 
had a linux shell login.

The way I had it set up, the tests from SVN (at least ones which didn't 
require credentials) were run regularly. If you wanted to add new tests 
there then adding them to SVN would cause them to be run on the NMI 
systems, in the same way as it would cause them to be run by other 
developers who run the tests themselves on their own systems.

My home directory may still be in place on the NMI system, and I think I 
probably told them that they could share the contents with anyone; if both 
of those are true, you might be able to find all the scripts I had there.

Having multiple people edit files on the NMI machines - I guess the nmi 
people have (or will create for you) some policy on that.

> - can you comment on the state of the "site" tests in the swift test dir?

It was hard to get those working reliably enough for them to be useful to 
run - by that I mean that if one of the local tests failed, then it was 
usually a problem that had recently been introduced into the swift stack; 
but a site test failing was often because of a problem with the site. That 
was a reflection of the difficulty on getting swift running and keeping it 
running on many different sites.

You can look at the script to run the tests. Its probably useful. But the 
actual site definitions are presumably very rotted.

-- 


From hategan at mcs.anl.gov  Mon Jul 19 01:36:25 2010
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Mon, 19 Jul 2010 01:36:25 -0500
Subject: [Swift-devel] stuff to do
In-Reply-To: <1279236949.25935.17.camel@blabla2.none>
References: <33415753.14491278953564701.JavaMail.root@zimbra>
	<1279236949.25935.17.camel@blabla2.none>
Message-ID: <1279521385.23339.1.camel@blabla2.none>

On Thu, 2010-07-15 at 18:35 -0500, Mihael Hategan wrote:
> 
> I also quickly made a fake provider and I am getting a rate of about 100
> j/s. So that seems not to infirm my previous suspicion.

Well, it turns out that the flushing the restart log to disk takes some
time. As in if I remove the call to flush() I can get 800 jobs/s.


From benc at hawaga.org.uk  Mon Jul 19 04:10:12 2010
From: benc at hawaga.org.uk (Ben Clifford)
Date: Mon, 19 Jul 2010 09:10:12 +0000 (GMT)
Subject: [Swift-devel] Re: Swift NMI B&T testing - how to add more users?
In-Reply-To: <Pine.LNX.4.64.1007181730130.13416@dildano.hawaga.org.uk>
References: <22361353.206931279310835989.JavaMail.root@zimbra>
	<Pine.LNX.4.64.1007181730130.13416@dildano.hawaga.org.uk>
Message-ID: <Pine.LNX.4.64.1007190907510.13416@dildano.hawaga.org.uk>


I poked around a bit more. The svn has/had a top level directory 
nmi-build-test. In there, there is a subdirectory called submit-machine. 
That contains most/all of the files I had on the NMI build machine.

In there: build-hourly was run by cron every hour and contains the logic 
to do the almost-per-commit tests; and build-daily was run by cron every 
day and does the several-different-architectures tests.

Start by trying to make build-hourly run manually on the NMI machine and I 
think it should be straightforward to get it running.

-- 
http://www.hawaga.org.uk/ben/


From wilde at mcs.anl.gov  Mon Jul 19 09:41:51 2010
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Mon, 19 Jul 2010 08:41:51 -0600 (GMT-06:00)
Subject: [Swift-devel] stuff to do
In-Reply-To: <1279521385.23339.1.camel@blabla2.none>
Message-ID: <321209788.22551279550511201.JavaMail.root@zimbra.anl.gov>

Way cool. Can you make restart/flush a settable property?

- Mike

----- "Mihael Hategan" <hategan at mcs.anl.gov> wrote:

> On Thu, 2010-07-15 at 18:35 -0500, Mihael Hategan wrote:
> > 
> > I also quickly made a fake provider and I am getting a rate of about
> 100
> > j/s. So that seems not to infirm my previous suspicion.
> 
> Well, it turns out that the flushing the restart log to disk takes
> some
> time. As in if I remove the call to flush() I can get 800 jobs/s.

-- 
Michael Wilde
Computation Institute, University of Chicago
Mathematics and Computer Science Division
Argonne National Laboratory


From dk0966 at cs.ship.edu  Tue Jul 20 07:00:23 2010
From: dk0966 at cs.ship.edu (David Kelly)
Date: Tue, 20 Jul 2010 08:00:23 -0400
Subject: [Swift-devel] Swift shell script and JAVA_HOME
Message-ID: <AANLkTikfmuMatQjuUdqORYJX5nqdDzEvy87bQ9U5ySbL@mail.gmail.com>

Hello,

I noticed on login.ci.uchicago.edu that I was not able to launch swift. Even
though I had "+java-sun" in ~/.soft, the system was pointing me to gcj.
Normally in ubuntu, something like "update-java-alternatives -s java-6-sun"
lets you switch JVMs, but as far I know it makes changes system wide
(requiring root access) and not on a per-user basis. Then I set $JAVA_HOME
to the correct path and it still wouldn't launch. Should the swift shell
script test for $JAVA_HOME to determine the correct location? Maybe
something like this would work:

### EXECUTE ############
if test -n "$CYGWIN"; then
    set CLASSPATHSAVE=$CLASSPATH
    export CLASSPATH="$LOCALCLASSPATH"
    eval java ${OPTIONS} ${COG_OPTS} ${EXEC} ${CMDLINE}
    export CLASSPATH=$CLASSPATHSAVE
else
    if [ -n "$JAVA_HOME" ]; then
        eval $JAVA_HOME/bin/java ${OPTIONS} ${COG_OPTS} -classpath
${LOCALCLASSPATH} ${EXEC} ${CMDLINE}
    else
        eval java ${OPTIONS} ${COG_OPTS} -classpath ${LOCALCLASSPATH}
${EXEC} ${CMDLINE}
    fi
fi
return_code=$?

exit $return_code
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20100720/f2634b92/attachment.html>

From wilde at mcs.anl.gov  Tue Jul 20 08:43:15 2010
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Tue, 20 Jul 2010 07:43:15 -0600 (GMT-06:00)
Subject: [Swift-devel] Swift shell script and JAVA_HOME
In-Reply-To: <AANLkTikfmuMatQjuUdqORYJX5nqdDzEvy87bQ9U5ySbL@mail.gmail.com>
Message-ID: <1077109173.60151279633395534.JavaMail.root@zimbra.anl.gov>

David,

----- "David Kelly" <dk0966 at cs.ship.edu> wrote:

> Hello,
> 
> I noticed on login.ci.uchicago.edu that I was not able to launch
> swift. Even though I had "+java-sun" in ~/.soft, the system was
> pointing me to gcj.

In my .soft i have +java_sub before @default in .soft. That seems to work on login.ci. Can you try that?

> Normally in ubuntu, something like
> "update-java-alternatives -s java-6-sun" lets you switch JVMs, but as
> far I know it makes changes system wide (requiring root access) and
> not on a per-user basis. Then I set $JAVA_HOME to the correct path and
> it still wouldn't launch.

What error were you getting?

Perhaps check if other *JAVA* env vars are still pointing to the wrong Java, eg:

login$ env | grep -i java
JRE_HOME=/soft/java-1.5.0_06-sun-r1/jre
MATLAB_JAVA=/soft/matlab-7.7-r1/java
JAVA_BINDIR=/soft/java-1.5.0_06-sun-r1/bin
JAVA_HOME=/soft/java-1.5.0_06-sun-r1
SDK_HOME=/soft/java-1.5.0_06-sun-r1
JDK_HOME=/soft/java-1.5.0_06-sun-r1
JAVA_ROOT=/soft/java-1.5.0_06-sun-r1

and make sure that CLASSPATH is *not* set.

> Should the swift shell script test for
> $JAVA_HOME to determine the correct location?

In my experience, Ive always tried to leave JAVA_HOME unset, have no JAVA vars in my env, and make sure that the right Java is in the PATH. 

I suspect Mihael and/or Justin need to weigh in on whats best; and then we should document that in the user guide under "Running Swift".

- Mike

> Maybe something like
> this would work:
> 
> ### EXECUTE ############
> if test -n "$CYGWIN"; then
> set CLASSPATHSAVE=$CLASSPATH
> export CLASSPATH="$LOCALCLASSPATH"
> eval java ${OPTIONS} ${COG_OPTS} ${EXEC} ${CMDLINE}
> export CLASSPATH=$CLASSPATHSAVE
> else
> if [ -n "$JAVA_HOME" ]; then
> eval $JAVA_HOME/bin/java ${OPTIONS} ${COG_OPTS} -classpath
> ${LOCALCLASSPATH} ${EXEC} ${CMDLINE}
> else
> eval java ${OPTIONS} ${COG_OPTS} -classpath ${LOCALCLASSPATH} ${EXEC}
> ${CMDLINE}
> fi
> fi
> return_code=$?
> 
> exit $return_code
> 
> 
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel

-- 
Michael Wilde
Computation Institute, University of Chicago
Mathematics and Computer Science Division
Argonne National Laboratory


From wilde at mcs.anl.gov  Tue Jul 20 10:20:29 2010
From: wilde at mcs.anl.gov (wilde at mcs.anl.gov)
Date: Tue, 20 Jul 2010 09:20:29 -0600 (GMT-06:00)
Subject: [Swift-devel] Problems with coaster data provider
In-Reply-To: <41072379.65871279639120318.JavaMail.root@zimbra.anl.gov>
Message-ID: <1289205052.66121279639229161.JavaMail.root@zimbra.anl.gov>

I tried the coaster data provider from MCS host vanquish to crush (2 of the compute servers) via ssh:local and get the error:

"org.globus.cog.abstraction.impl.file.FileResourceException: org.globus.cog.karajan.workflow.service.ProtocolException: Unknown command: #!/BIN/BASH" (Full error text below).

Has anyone else tried the coaster data provider?

My sites file has the single pool:

  <pool handle="crush">
    <execution provider="coaster" url="crush.mcs.anl.gov" jobmanager="ssh:local"/>
    <profile namespace="globus" key="workersPerNode">8</profile>
    <profile namespace="globus" key="maxTime">3500</profile>
    <profile namespace="globus" key="slots">1</profile>
    <profile namespace="globus" key="nodeGranularity">1</profile>
    <profile namespace="globus" key="maxNodes">1</profile>

    <profile key="jobThrottle" namespace="karajan">.07</profile>
    <profile namespace="karajan" key="initialScore">10000</profile>

    <filesystem provider="coaster" url="ssh://crush.mcs.anl.gov" />
    <workdirectory>/home/wilde/swiftwork/crush</workdirectory>
  </pool>

Is that the correct url= value?

I set these properties:

wrapperlog.always.transfer=false
sitedir.keep=true
execution.retries=0
status.mode=provider

The run command, svn version, and full error text on stdout/err is:

vanquish$ swift -tc.file tc -sites.file crushds.xml -config cf catsn.swift -n=1
Swift svn swift-r3449 cog-r2816

RunID: 20100720-1006-z1vio8i1
Progress:
Progress:  Failed:1
Execution failed:
	Could not initialize shared directory on crush
Caused by:
	org.globus.cog.abstraction.impl.file.FileResourceException: org.globus.cog.karajan.workflow.service.ProtocolException: Unknown command: #!/BIN/BASH
# THIS SCRIPT MUST BE INVOKED INSIDE OF BASH, NOT PLAIN SH
# NOTE THAT THIS SCRIPT MODIFIES $IFS

INFOSECTION() {

...full text of _swiftwrap shows up here, in upper case...

# ENSURE WE EXIT WITH A 0 AFTER A SUCCESSFUL EXECUTION
EXIT 0

# LOCAL VARIABLES: 
# MODE: SH
# SH-BASIC-OFFSET: 8
# END:

Cleaning up...
Shutting down service at https://140.221.8.62:59300
Got channel MetaChannel: 2039421489[205498061: {}] -> GSSSChannel-0494700354(1)[205498061: {}]
+ Done
vanquish$

The _swiftwrap file was created in the workdirectory shared/ subdir but has length zero. So presumably it was about to be transferred but the transfer failed.

vanquish$ ls -lR /home/wilde/swiftwork/crush/*8i1
/home/wilde/swiftwork/crush/catsn-20100720-1006-z1vio8i1:
total 2
drwxr-sr-x 2 wilde mcsz 3 Jul 20 10:06 shared/

/home/wilde/swiftwork/crush/catsn-20100720-1006-z1vio8i1/shared:
total 1
-rw-r--r-- 1 wilde mcsz 0 Jul 20 10:06 _swiftwrap
vanquish$ 

----- "Mihael Hategan" <hategan at mcs.anl.gov> wrote:

> Most of the problems that were obvious with coaster file staging
> should
> be fixed now. I ran a few tests for 1024 cat jobs on TP with ssh:pbs
> with 2-8 workers/node (such that "concurrent" workers are tested) and
> it
> consistently seemed fine.
> 
> I also quickly made a fake provider and I am getting a rate of about
> 100
> j/s. So that seems not to infirm my previous suspicion.
> 
> On Mon, 2010-07-12 at 11:52 -0500, Michael Wilde wrote:
> > Here's my view on these:
> > 
> > > 2. test/fix coaster file staging
> > 
> > This would be useful for both real apps and (I think) for CDM
> testing. I would do this first.
> > 
> > I would then add:
> > 
> > 5. Adjustments needed, if any, on multicore handling in PBS and SGE
> provider.
> > 
> > 6. Adjustments and fixes for reliability and logging, if needed, in
> Condor-G provider.
> > 
> > I expect that 5 & 6 would be small tasks, and they are not yet
> clearly defined. I think that other people could do them.
> > 
> > Maybe add:
> > 
> > 7. -tui fixes. Seems not to be working so well on recent tests;
> several of the screens, including the source-code view, seem not to be
> working.
> > 
> > Then:
> > 
> > > 1. make swift core faster
> > 
> > I would do this second; I think you said you need about 7-10 days to
> try things and see what can be done, maybe more after that if the
> exploration suggests things that will take much (re)coding?
> > 
> > > 3. standalone coaster service
> > 
> > The current manual coasters is proving useful. 
> > > 4. swift shell
> > 
> > Lets defer (4) for now; if we can instead run swift repeatedly and
> either have the coaster worker pool re-connect quickly to each new
> swift, or quickly start new pools within the same cluster job(s), that
> would suffice for now.
> > 
> > Justin, do you want to weigh in on these?
> > 
> > Thanks,
> > 
> > Mike
> > 
> > 
> > > The idea is that some recent changes may have shifted the
> existing
> > > priorities. So think of this from the perspective of
> > > user/application/publication goals rather than what you think
> would
> > > be
> > > "nice to have".
> > > 
> > > Mihael
> > > 
> > > _______________________________________________
> > > Swift-devel mailing list
> > > Swift-devel at ci.uchicago.edu
> > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> >

-- 
Michael Wilde
Computation Institute, University of Chicago
Mathematics and Computer Science Division
Argonne National Laboratory


From dk0966 at cs.ship.edu  Tue Jul 20 10:27:30 2010
From: dk0966 at cs.ship.edu (David Kelly)
Date: Tue, 20 Jul 2010 11:27:30 -0400
Subject: [Swift-devel] Swift shell script and JAVA_HOME
In-Reply-To: <1077109173.60151279633395534.JavaMail.root@zimbra.anl.gov>
References: <AANLkTikfmuMatQjuUdqORYJX5nqdDzEvy87bQ9U5ySbL@mail.gmail.com>
	<1077109173.60151279633395534.JavaMail.root@zimbra.anl.gov>
Message-ID: <AANLkTintTY3_yZiYubfti1LUt4jYBFGuFXUgOLJhRgXy@mail.gmail.com>

On Tue, Jul 20, 2010 at 9:43 AM, Michael Wilde <wilde at mcs.anl.gov> wrote:

In my .soft i have +java_sub before @default in .soft. That seems to work on
> login.ci. Can you try that?
>
> Changing the ordering fixed it for me as well. Thanks.


> What error were you getting?
>

> Perhaps check if other *JAVA* env vars are still pointing to the wrong
> Java, eg:
>
> login$ env | grep -i java
> JRE_HOME=/soft/java-1.5.0_06-sun-r1/jre
> MATLAB_JAVA=/soft/matlab-7.7-r1/java
> JAVA_BINDIR=/soft/java-1.5.0_06-sun-r1/bin
> JAVA_HOME=/soft/java-1.5.0_06-sun-r1
> SDK_HOME=/soft/java-1.5.0_06-sun-r1
> JDK_HOME=/soft/java-1.5.0_06-sun-r1
> JAVA_ROOT=/soft/java-1.5.0_06-sun-r1
>
> and make sure that CLASSPATH is *not* set.
>

If I have @default first followed +java_sun, even if all of the JAVA
variables correctly pointing to sun java it will use gcj. I think this is
due to the ordering of directories in $PATH. Having users adjust their PATH
with the location of sun java ahead of directories like /usr/bin is probably
the solution.

Here are the errors I was getting:

$ swift manyparam.swift
Warning: -Xmx256M not understood. Ignoring.
log4j:ERROR Error occured while converting date.
java.lang.IllegalArgumentException: Illegal pattern character
   at java.text.SimpleDateFormat.format(java.util.Date,
java.lang.StringBuffer, java.text.FieldPosition) (/usr/lib/libgcj.so.5.0.0)
   at java.text.DateFormat.format(java.util.Date) (/usr/lib/libgcj.so.5.0.0)
   at
org.apache.log4j.helpers.PatternParser$DatePatternConverter.convert(org.apache.log4j.spi.LoggingEvent)
(Unknown Source)
   at
org.apache.log4j.helpers.PatternConverter.format(java.lang.StringBuffer,
org.apache.log4j.spi.LoggingEvent) (Unknown Source)
   at
org.apache.log4j.PatternLayout.format(org.apache.log4j.spi.LoggingEvent)
(Unknown Source)
   at
org.apache.log4j.WriterAppender.subAppend(org.apache.log4j.spi.LoggingEvent)
(Unknown Source)
   at
org.apache.log4j.WriterAppender.append(org.apache.log4j.spi.LoggingEvent)
(Unknown Source)
   at
org.apache.log4j.AppenderSkeleton.doAppend(org.apache.log4j.spi.LoggingEvent)
(Unknown Source)
   at
org.apache.log4j.helpers.AppenderAttachableImpl.appendLoopOnAppenders(org.apache.log4j.spi.LoggingEvent)
(Unknown Source)
   at
org.apache.log4j.Category.callAppenders(org.apache.log4j.spi.LoggingEvent)
(Unknown Source)
   at org.apache.log4j.Category.forcedLog(java.lang.String,
org.apache.log4j.Priority, java.lang.Object, java.lang.Throwable) (Unknown
Source)
   at org.apache.log4j.Category.debug(java.lang.Object) (Unknown Source)
   at org.griphyn.vdl.karajan.Loader.main(java.lang.String[]) (Unknown
Source)
log4j:ERROR Error occured while converting date.
java.lang.IllegalArgumentException: Illegal pattern character
   at java.text.SimpleDateFormat.format(java.util.Date,
java.lang.StringBuffer, java.text.FieldPosition) (/usr/lib/libgcj.so.5.0.0)
   at java.text.DateFormat.format(java.util.Date) (/usr/lib/libgcj.so.5.0.0)
   at
org.apache.log4j.helpers.PatternParser$DatePatternConverter.convert(org.apache.log4j.spi.LoggingEvent)
(Unknown Source)
   at
org.apache.log4j.helpers.PatternConverter.format(java.lang.StringBuffer,
org.apache.log4j.spi.LoggingEvent) (Unknown Source)
   at
org.apache.log4j.PatternLayout.format(org.apache.log4j.spi.LoggingEvent)
(Unknown Source)
   at
org.apache.log4j.WriterAppender.subAppend(org.apache.log4j.spi.LoggingEvent)
(Unknown Source)
   at
org.apache.log4j.WriterAppender.append(org.apache.log4j.spi.LoggingEvent)
(Unknown Source)
   at
org.apache.log4j.AppenderSkeleton.doAppend(org.apache.log4j.spi.LoggingEvent)
(Unknown Source)
   at
org.apache.log4j.helpers.AppenderAttachableImpl.appendLoopOnAppenders(org.apache.log4j.spi.LoggingEvent)
(Unknown Source)
   at
org.apache.log4j.Category.callAppenders(org.apache.log4j.spi.LoggingEvent)
(Unknown Source)
   at org.apache.log4j.Category.forcedLog(java.lang.String,
org.apache.log4j.Priority, java.lang.Object, java.lang.Throwable) (Unknown
Source)
   at org.apache.log4j.Category.debug(java.lang.Object) (Unknown Source)
   at org.griphyn.vdl.karajan.Loader.main(java.lang.String[]) (Unknown
Source)
log4j:ERROR Error occured while converting date.
java.lang.IllegalArgumentException: Illegal pattern character
   at java.text.SimpleDateFormat.format(java.util.Date,
java.lang.StringBuffer, java.text.FieldPosition) (/usr/lib/libgcj.so.5.0.0)
   at java.text.DateFormat.format(java.util.Date) (/usr/lib/libgcj.so.5.0.0)
   at
org.apache.log4j.helpers.PatternParser$DatePatternConverter.convert(org.apache.log4j.spi.LoggingEvent)
(Unknown Source)
   at
org.apache.log4j.helpers.PatternConverter.format(java.lang.StringBuffer,
org.apache.log4j.spi.LoggingEvent) (Unknown Source)
   at
org.apache.log4j.PatternLayout.format(org.apache.log4j.spi.LoggingEvent)
(Unknown Source)
   at
org.apache.log4j.WriterAppender.subAppend(org.apache.log4j.spi.LoggingEvent)
(Unknown Source)
   at
org.apache.log4j.WriterAppender.append(org.apache.log4j.spi.LoggingEvent)
(Unknown Source)
   at
org.apache.log4j.AppenderSkeleton.doAppend(org.apache.log4j.spi.LoggingEvent)
(Unknown Source)
   at
org.apache.log4j.helpers.AppenderAttachableImpl.appendLoopOnAppenders(org.apache.log4j.spi.LoggingEvent)
(Unknown Source)
   at
org.apache.log4j.Category.callAppenders(org.apache.log4j.spi.LoggingEvent)
(Unknown Source)
   at org.apache.log4j.Category.forcedLog(java.lang.String,
org.apache.log4j.Priority, java.lang.Object, java.lang.Throwable) (Unknown
Source)
   at org.apache.log4j.Category.debug(java.lang.Object) (Unknown Source)
   at org.griphyn.vdl.karajan.Loader.compile(java.lang.String) (Unknown
Source)
   at org.griphyn.vdl.karajan.Loader.main(java.lang.String[]) (Unknown
Source)
log4j:ERROR Error occured while converting date.
java.lang.IllegalArgumentException: Illegal pattern character
   at java.text.SimpleDateFormat.format(java.util.Date,
java.lang.StringBuffer, java.text.FieldPosition) (/usr/lib/libgcj.so.5.0.0)
   at java.text.DateFormat.format(java.util.Date) (/usr/lib/libgcj.so.5.0.0)
   at
org.apache.log4j.helpers.PatternParser$DatePatternConverter.convert(org.apache.log4j.spi.LoggingEvent)
(Unknown Source)
   at
org.apache.log4j.helpers.PatternConverter.format(java.lang.StringBuffer,
org.apache.log4j.spi.LoggingEvent) (Unknown Source)
   at
org.apache.log4j.PatternLayout.format(org.apache.log4j.spi.LoggingEvent)
(Unknown Source)
   at
org.apache.log4j.WriterAppender.subAppend(org.apache.log4j.spi.LoggingEvent)
(Unknown Source)
   at
org.apache.log4j.WriterAppender.append(org.apache.log4j.spi.LoggingEvent)
(Unknown Source)
   at
org.apache.log4j.AppenderSkeleton.doAppend(org.apache.log4j.spi.LoggingEvent)
(Unknown Source)
   at
org.apache.log4j.helpers.AppenderAttachableImpl.appendLoopOnAppenders(org.apache.log4j.spi.LoggingEvent)
(Unknown Source)
   at
org.apache.log4j.Category.callAppenders(org.apache.log4j.spi.LoggingEvent)
(Unknown Source)
   at org.apache.log4j.Category.forcedLog(java.lang.String,
org.apache.log4j.Priority, java.lang.Object, java.lang.Throwable) (Unknown
Source)
   at org.apache.log4j.Category.debug(java.lang.Object) (Unknown Source)
   at org.griphyn.vdl.karajan.Loader.compile(java.lang.String) (Unknown
Source)
   at org.griphyn.vdl.karajan.Loader.main(java.lang.String[]) (Unknown
Source)
log4j:ERROR Error occured while converting date.
java.lang.IllegalArgumentException: Illegal pattern character
   at java.text.SimpleDateFormat.format(java.util.Date,
java.lang.StringBuffer, java.text.FieldPosition) (/usr/lib/libgcj.so.5.0.0)
   at java.text.DateFormat.format(java.util.Date) (/usr/lib/libgcj.so.5.0.0)
   at
org.apache.log4j.helpers.PatternParser$DatePatternConverter.convert(org.apache.log4j.spi.LoggingEvent)
(Unknown Source)
   at
org.apache.log4j.helpers.PatternConverter.format(java.lang.StringBuffer,
org.apache.log4j.spi.LoggingEvent) (Unknown Source)
   at
org.apache.log4j.PatternLayout.format(org.apache.log4j.spi.LoggingEvent)
(Unknown Source)
   at
org.apache.log4j.WriterAppender.subAppend(org.apache.log4j.spi.LoggingEvent)
(Unknown Source)
   at
org.apache.log4j.WriterAppender.append(org.apache.log4j.spi.LoggingEvent)
(Unknown Source)
   at
org.apache.log4j.AppenderSkeleton.doAppend(org.apache.log4j.spi.LoggingEvent)
(Unknown Source)
   at
org.apache.log4j.helpers.AppenderAttachableImpl.appendLoopOnAppenders(org.apache.log4j.spi.LoggingEvent)
(Unknown Source)
   at
org.apache.log4j.Category.callAppenders(org.apache.log4j.spi.LoggingEvent)
(Unknown Source)
   at org.apache.log4j.Category.forcedLog(java.lang.String,
org.apache.log4j.Priority, java.lang.Object, java.lang.Throwable) (Unknown
Source)
   at org.apache.log4j.Category.debug(java.lang.Object) (Unknown Source)
   at org.griphyn.vdl.karajan.Loader.compile(java.lang.String) (Unknown
Source)
   at org.griphyn.vdl.karajan.Loader.main(java.lang.String[]) (Unknown
Source)
log4j:ERROR Error occured while converting date.
log4j:ERROR Error occured while converting date.
Exception in thread "main" java.lang.NullPointerException
*** Got java.lang.NullPointerException while trying to print stack trace.

David
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20100720/b752141f/attachment.html>

From hategan at mcs.anl.gov  Tue Jul 20 12:22:07 2010
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Tue, 20 Jul 2010 12:22:07 -0500
Subject: [Swift-devel] Re: Problems with coaster data provider
In-Reply-To: <1289205052.66121279639229161.JavaMail.root@zimbra.anl.gov>
References: <1289205052.66121279639229161.JavaMail.root@zimbra.anl.gov>
Message-ID: <1279646527.16305.0.camel@blabla2.none>

That is odd. It looks like all characters for the swift wrapper are in
uppercase.


On Tue, 2010-07-20 at 09:20 -0600, wilde at mcs.anl.gov wrote:
> I tried the coaster data provider from MCS host vanquish to crush (2 of the compute servers) via ssh:local and get the error:
> 
> "org.globus.cog.abstraction.impl.file.FileResourceException: org.globus.cog.karajan.workflow.service.ProtocolException: Unknown command: #!/BIN/BASH" (Full error text below).
> 
> Has anyone else tried the coaster data provider?
> 
> My sites file has the single pool:
> 
>   <pool handle="crush">
>     <execution provider="coaster" url="crush.mcs.anl.gov" jobmanager="ssh:local"/>
>     <profile namespace="globus" key="workersPerNode">8</profile>
>     <profile namespace="globus" key="maxTime">3500</profile>
>     <profile namespace="globus" key="slots">1</profile>
>     <profile namespace="globus" key="nodeGranularity">1</profile>
>     <profile namespace="globus" key="maxNodes">1</profile>
> 
>     <profile key="jobThrottle" namespace="karajan">.07</profile>
>     <profile namespace="karajan" key="initialScore">10000</profile>
> 
>     <filesystem provider="coaster" url="ssh://crush.mcs.anl.gov" />
>     <workdirectory>/home/wilde/swiftwork/crush</workdirectory>
>   </pool>
> 
> Is that the correct url= value?
> 
> I set these properties:
> 
> wrapperlog.always.transfer=false
> sitedir.keep=true
> execution.retries=0
> status.mode=provider
> 
> The run command, svn version, and full error text on stdout/err is:
> 
> vanquish$ swift -tc.file tc -sites.file crushds.xml -config cf catsn.swift -n=1
> Swift svn swift-r3449 cog-r2816
> 
> RunID: 20100720-1006-z1vio8i1
> Progress:
> Progress:  Failed:1
> Execution failed:
> 	Could not initialize shared directory on crush
> Caused by:
> 	org.globus.cog.abstraction.impl.file.FileResourceException: org.globus.cog.karajan.workflow.service.ProtocolException: Unknown command: #!/BIN/BASH
> # THIS SCRIPT MUST BE INVOKED INSIDE OF BASH, NOT PLAIN SH
> # NOTE THAT THIS SCRIPT MODIFIES $IFS
> 
> INFOSECTION() {
> 
> ...full text of _swiftwrap shows up here, in upper case...
> 
> # ENSURE WE EXIT WITH A 0 AFTER A SUCCESSFUL EXECUTION
> EXIT 0
> 
> # LOCAL VARIABLES: 
> # MODE: SH
> # SH-BASIC-OFFSET: 8
> # END:
> 
> Cleaning up...
> Shutting down service at https://140.221.8.62:59300
> Got channel MetaChannel: 2039421489[205498061: {}] -> GSSSChannel-0494700354(1)[205498061: {}]
> + Done
> vanquish$
> 
> The _swiftwrap file was created in the workdirectory shared/ subdir but has length zero. So presumably it was about to be transferred but the transfer failed.
> 
> vanquish$ ls -lR /home/wilde/swiftwork/crush/*8i1
> /home/wilde/swiftwork/crush/catsn-20100720-1006-z1vio8i1:
> total 2
> drwxr-sr-x 2 wilde mcsz 3 Jul 20 10:06 shared/
> 
> /home/wilde/swiftwork/crush/catsn-20100720-1006-z1vio8i1/shared:
> total 1
> -rw-r--r-- 1 wilde mcsz 0 Jul 20 10:06 _swiftwrap
> vanquish$ 
> 
> ----- "Mihael Hategan" <hategan at mcs.anl.gov> wrote:
> 
> > Most of the problems that were obvious with coaster file staging
> > should
> > be fixed now. I ran a few tests for 1024 cat jobs on TP with ssh:pbs
> > with 2-8 workers/node (such that "concurrent" workers are tested) and
> > it
> > consistently seemed fine.
> > 
> > I also quickly made a fake provider and I am getting a rate of about
> > 100
> > j/s. So that seems not to infirm my previous suspicion.
> > 
> > On Mon, 2010-07-12 at 11:52 -0500, Michael Wilde wrote:
> > > Here's my view on these:
> > > 
> > > > 2. test/fix coaster file staging
> > > 
> > > This would be useful for both real apps and (I think) for CDM
> > testing. I would do this first.
> > > 
> > > I would then add:
> > > 
> > > 5. Adjustments needed, if any, on multicore handling in PBS and SGE
> > provider.
> > > 
> > > 6. Adjustments and fixes for reliability and logging, if needed, in
> > Condor-G provider.
> > > 
> > > I expect that 5 & 6 would be small tasks, and they are not yet
> > clearly defined. I think that other people could do them.
> > > 
> > > Maybe add:
> > > 
> > > 7. -tui fixes. Seems not to be working so well on recent tests;
> > several of the screens, including the source-code view, seem not to be
> > working.
> > > 
> > > Then:
> > > 
> > > > 1. make swift core faster
> > > 
> > > I would do this second; I think you said you need about 7-10 days to
> > try things and see what can be done, maybe more after that if the
> > exploration suggests things that will take much (re)coding?
> > > 
> > > > 3. standalone coaster service
> > > 
> > > The current manual coasters is proving useful. 
> > > > 4. swift shell
> > > 
> > > Lets defer (4) for now; if we can instead run swift repeatedly and
> > either have the coaster worker pool re-connect quickly to each new
> > swift, or quickly start new pools within the same cluster job(s), that
> > would suffice for now.
> > > 
> > > Justin, do you want to weigh in on these?
> > > 
> > > Thanks,
> > > 
> > > Mike
> > > 
> > > 
> > > > The idea is that some recent changes may have shifted the
> > existing
> > > > priorities. So think of this from the perspective of
> > > > user/application/publication goals rather than what you think
> > would
> > > > be
> > > > "nice to have".
> > > > 
> > > > Mihael
> > > > 
> > > > _______________________________________________
> > > > Swift-devel mailing list
> > > > Swift-devel at ci.uchicago.edu
> > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> > >
> 


From hategan at mcs.anl.gov  Tue Jul 20 12:29:25 2010
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Tue, 20 Jul 2010 12:29:25 -0500
Subject: [Swift-devel] Re: Problems with coaster data provider
In-Reply-To: <1279646527.16305.0.camel@blabla2.none>
References: <1289205052.66121279639229161.JavaMail.root@zimbra.anl.gov>
	<1279646527.16305.0.camel@blabla2.none>
Message-ID: <1279646965.16565.1.camel@blabla2.none>

On Tue, 2010-07-20 at 12:22 -0500, Mihael Hategan wrote:
> That is odd. It looks like all characters for the swift wrapper are in
> uppercase.

Actually it looks like something is off there.

Btw, the coaster provider staging is different from the coaster data
provider. If you want the former, say use.provider.staging=true in
swift.properties.

> 
> 
> On Tue, 2010-07-20 at 09:20 -0600, wilde at mcs.anl.gov wrote:
> > I tried the coaster data provider from MCS host vanquish to crush (2 of the compute servers) via ssh:local and get the error:
> > 
> > "org.globus.cog.abstraction.impl.file.FileResourceException: org.globus.cog.karajan.workflow.service.ProtocolException: Unknown command: #!/BIN/BASH" (Full error text below).
> > 
> > Has anyone else tried the coaster data provider?
> > 
> > My sites file has the single pool:
> > 
> >   <pool handle="crush">
> >     <execution provider="coaster" url="crush.mcs.anl.gov" jobmanager="ssh:local"/>
> >     <profile namespace="globus" key="workersPerNode">8</profile>
> >     <profile namespace="globus" key="maxTime">3500</profile>
> >     <profile namespace="globus" key="slots">1</profile>
> >     <profile namespace="globus" key="nodeGranularity">1</profile>
> >     <profile namespace="globus" key="maxNodes">1</profile>
> > 
> >     <profile key="jobThrottle" namespace="karajan">.07</profile>
> >     <profile namespace="karajan" key="initialScore">10000</profile>
> > 
> >     <filesystem provider="coaster" url="ssh://crush.mcs.anl.gov" />
> >     <workdirectory>/home/wilde/swiftwork/crush</workdirectory>
> >   </pool>
> > 
> > Is that the correct url= value?
> > 
> > I set these properties:
> > 
> > wrapperlog.always.transfer=false
> > sitedir.keep=true
> > execution.retries=0
> > status.mode=provider
> > 
> > The run command, svn version, and full error text on stdout/err is:
> > 
> > vanquish$ swift -tc.file tc -sites.file crushds.xml -config cf catsn.swift -n=1
> > Swift svn swift-r3449 cog-r2816
> > 
> > RunID: 20100720-1006-z1vio8i1
> > Progress:
> > Progress:  Failed:1
> > Execution failed:
> > 	Could not initialize shared directory on crush
> > Caused by:
> > 	org.globus.cog.abstraction.impl.file.FileResourceException: org.globus.cog.karajan.workflow.service.ProtocolException: Unknown command: #!/BIN/BASH
> > # THIS SCRIPT MUST BE INVOKED INSIDE OF BASH, NOT PLAIN SH
> > # NOTE THAT THIS SCRIPT MODIFIES $IFS
> > 
> > INFOSECTION() {
> > 
> > ...full text of _swiftwrap shows up here, in upper case...
> > 
> > # ENSURE WE EXIT WITH A 0 AFTER A SUCCESSFUL EXECUTION
> > EXIT 0
> > 
> > # LOCAL VARIABLES: 
> > # MODE: SH
> > # SH-BASIC-OFFSET: 8
> > # END:
> > 
> > Cleaning up...
> > Shutting down service at https://140.221.8.62:59300
> > Got channel MetaChannel: 2039421489[205498061: {}] -> GSSSChannel-0494700354(1)[205498061: {}]
> > + Done
> > vanquish$
> > 
> > The _swiftwrap file was created in the workdirectory shared/ subdir but has length zero. So presumably it was about to be transferred but the transfer failed.
> > 
> > vanquish$ ls -lR /home/wilde/swiftwork/crush/*8i1
> > /home/wilde/swiftwork/crush/catsn-20100720-1006-z1vio8i1:
> > total 2
> > drwxr-sr-x 2 wilde mcsz 3 Jul 20 10:06 shared/
> > 
> > /home/wilde/swiftwork/crush/catsn-20100720-1006-z1vio8i1/shared:
> > total 1
> > -rw-r--r-- 1 wilde mcsz 0 Jul 20 10:06 _swiftwrap
> > vanquish$ 
> > 
> > ----- "Mihael Hategan" <hategan at mcs.anl.gov> wrote:
> > 
> > > Most of the problems that were obvious with coaster file staging
> > > should
> > > be fixed now. I ran a few tests for 1024 cat jobs on TP with ssh:pbs
> > > with 2-8 workers/node (such that "concurrent" workers are tested) and
> > > it
> > > consistently seemed fine.
> > > 
> > > I also quickly made a fake provider and I am getting a rate of about
> > > 100
> > > j/s. So that seems not to infirm my previous suspicion.
> > > 
> > > On Mon, 2010-07-12 at 11:52 -0500, Michael Wilde wrote:
> > > > Here's my view on these:
> > > > 
> > > > > 2. test/fix coaster file staging
> > > > 
> > > > This would be useful for both real apps and (I think) for CDM
> > > testing. I would do this first.
> > > > 
> > > > I would then add:
> > > > 
> > > > 5. Adjustments needed, if any, on multicore handling in PBS and SGE
> > > provider.
> > > > 
> > > > 6. Adjustments and fixes for reliability and logging, if needed, in
> > > Condor-G provider.
> > > > 
> > > > I expect that 5 & 6 would be small tasks, and they are not yet
> > > clearly defined. I think that other people could do them.
> > > > 
> > > > Maybe add:
> > > > 
> > > > 7. -tui fixes. Seems not to be working so well on recent tests;
> > > several of the screens, including the source-code view, seem not to be
> > > working.
> > > > 
> > > > Then:
> > > > 
> > > > > 1. make swift core faster
> > > > 
> > > > I would do this second; I think you said you need about 7-10 days to
> > > try things and see what can be done, maybe more after that if the
> > > exploration suggests things that will take much (re)coding?
> > > > 
> > > > > 3. standalone coaster service
> > > > 
> > > > The current manual coasters is proving useful. 
> > > > > 4. swift shell
> > > > 
> > > > Lets defer (4) for now; if we can instead run swift repeatedly and
> > > either have the coaster worker pool re-connect quickly to each new
> > > swift, or quickly start new pools within the same cluster job(s), that
> > > would suffice for now.
> > > > 
> > > > Justin, do you want to weigh in on these?
> > > > 
> > > > Thanks,
> > > > 
> > > > Mike
> > > > 
> > > > 
> > > > > The idea is that some recent changes may have shifted the
> > > existing
> > > > > priorities. So think of this from the perspective of
> > > > > user/application/publication goals rather than what you think
> > > would
> > > > > be
> > > > > "nice to have".
> > > > > 
> > > > > Mihael
> > > > > 
> > > > > _______________________________________________
> > > > > Swift-devel mailing list
> > > > > Swift-devel at ci.uchicago.edu
> > > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> > > >
> > 
> 
> 
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel


From hategan at mcs.anl.gov  Tue Jul 20 13:47:35 2010
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Tue, 20 Jul 2010 13:47:35 -0500
Subject: [Swift-devel] stuff to do
In-Reply-To: <321209788.22551279550511201.JavaMail.root@zimbra.anl.gov>
References: <321209788.22551279550511201.JavaMail.root@zimbra.anl.gov>
Message-ID: <1279651655.17308.2.camel@blabla2.none>

I changed it to run in a separate thread and collapse frequent flushes.
This way it doesn't require user interaction. It may not work very well
in case of power outages, but I don't think that's the most frequent use
of the restart log.

On Mon, 2010-07-19 at 08:41 -0600, Michael Wilde wrote:
> Way cool. Can you make restart/flush a settable property?
> 
> - Mike
> 
> ----- "Mihael Hategan" <hategan at mcs.anl.gov> wrote:
> 
> > On Thu, 2010-07-15 at 18:35 -0500, Mihael Hategan wrote:
> > > 
> > > I also quickly made a fake provider and I am getting a rate of about
> > 100
> > > j/s. So that seems not to infirm my previous suspicion.
> > 
> > Well, it turns out that the flushing the restart log to disk takes
> > some
> > time. As in if I remove the call to flush() I can get 800 jobs/s.
> 


From wozniak at mcs.anl.gov  Wed Jul 21 10:49:18 2010
From: wozniak at mcs.anl.gov (Justin M Wozniak)
Date: Wed, 21 Jul 2010 10:49:18 -0500 (CDT)
Subject: [Swift-devel] GSOC call today
Message-ID: <alpine.DEB.2.00.1007211048380.3403@wozniak-desktop.mcs.anl.gov>


I'll be on for the call...

-- 
Justin M Wozniak


From wozniak at mcs.anl.gov  Mon Jul 26 13:47:27 2010
From: wozniak at mcs.anl.gov (Justin M Wozniak)
Date: Mon, 26 Jul 2010 13:47:27 -0500 (Central Daylight Time)
Subject: [Swift-devel] MPICH/Coasters
Message-ID: <alpine.WNT.2.00.1007261335470.296@justinwozniak>

Hello
 	I just had a meeting with Pavan to talk about what we can do to 
run MPI jobs from Coasters given the new MPICH/Hydra features.  He's 
making a few modifications to MPICH to support this and they should be 
available soon.

Background on Hydra:

http://wiki.mcs.anl.gov/mpich2/index.php/Using_the_Hydra_Process_Manager#Bootstrap_Servers
http://wiki.mcs.anl.gov/mpich2/index.php/Hydra_Process_Management_Framework

Here's the basic idea so far:

* The CoasterService locally runs an mpiexec;
* mpiexec prints a list of (proxy) command lines, then listens;
* The CoasterService passes each command-line to a worker;
* The worker launches the proxy;
* The proxy connects back to mpiexec;
* mpiexec and the proxies complete the user job;
* mpiexec and the proxies shut down.

So, analogous to "manual Coasters", this is "manual MPICH", because 
Coasters is responsible for launching the proxies.

 	Justin

-- 
Justin M Wozniak


From aespinosa at cs.uchicago.edu  Mon Jul 26 14:50:50 2010
From: aespinosa at cs.uchicago.edu (Allan Espinosa)
Date: Mon, 26 Jul 2010 14:50:50 -0500
Subject: [Swift-devel] MPICH/Coasters
In-Reply-To: <alpine.WNT.2.00.1007261335470.296@justinwozniak>
References: <alpine.WNT.2.00.1007261335470.296@justinwozniak>
Message-ID: <20100726195050.GA3204@origin>

Sounds like cool stuff.  So essentially they modularized the process manager
component of the mpich2 implementation (or a lamd daemon / manager) to be able
to launch other processes.  

With this framework,  can coasters directly access low-level interconnet
interfaces instead of plain (or GSI) sockets?

On Mon, Jul 26, 2010 at 01:47:27PM -0500, Justin M Wozniak wrote:
> Hello
> 	I just had a meeting with Pavan to talk about what we can do to run
> MPI jobs from Coasters given the new MPICH/Hydra features.  He's
> making a few modifications to MPICH to support this and they should
> be available soon.
> 
> Background on Hydra:
> 
> http://wiki.mcs.anl.gov/mpich2/index.php/Using_the_Hydra_Process_Manager#Bootstrap_Servers
> http://wiki.mcs.anl.gov/mpich2/index.php/Hydra_Process_Management_Framework
> 
> Here's the basic idea so far:
> 
> * The CoasterService locally runs an mpiexec;
> * mpiexec prints a list of (proxy) command lines, then listens;
> * The CoasterService passes each command-line to a worker;
> * The worker launches the proxy;
> * The proxy connects back to mpiexec;
> * mpiexec and the proxies complete the user job;
> * mpiexec and the proxies shut down.
> 
> So, analogous to "manual Coasters", this is "manual MPICH", because
> Coasters is responsible for launching the proxies.
> 
> 	Justin
> 
> -- 
> Justin M Wozniak


From hategan at mcs.anl.gov  Mon Jul 26 15:21:20 2010
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Mon, 26 Jul 2010 15:21:20 -0500
Subject: [Swift-devel] MPICH/Coasters
In-Reply-To: <20100726195050.GA3204@origin>
References: <alpine.WNT.2.00.1007261335470.296@justinwozniak>
	<20100726195050.GA3204@origin>
Message-ID: <1280175680.23112.4.camel@blabla2.none>

On Mon, 2010-07-26 at 14:50 -0500, Allan Espinosa wrote:
> Sounds like cool stuff.  So essentially they modularized the process manager
> component of the mpich2 implementation (or a lamd daemon / manager) to be able
> to launch other processes.  

I'm not sure I follow, but I'm guessing the scenario is that you first
get some nodes on which you start their bootstrap server after which you
can submit various mpi applications on-demand without going through the
queuing system again. Right?


From aespinosa at cs.uchicago.edu  Mon Jul 26 15:48:42 2010
From: aespinosa at cs.uchicago.edu (Allan Espinosa)
Date: Mon, 26 Jul 2010 15:48:42 -0500
Subject: [Swift-devel] MPICH/Coasters
In-Reply-To: <1280175680.23112.4.camel@blabla2.none>
References: <alpine.WNT.2.00.1007261335470.296@justinwozniak>
	<20100726195050.GA3204@origin>
	<1280175680.23112.4.camel@blabla2.none>
Message-ID: <20100726204842.GB3204@origin>

Right.  Because it will ask the bootstrap server the nodes that participates in
the MPI_WORLD group of the application.  

I was talking about how the mpich2 implementation features dynamically growing
an MPI_WORLD group using the bootstrap server.

On Mon, Jul 26, 2010 at 03:21:20PM -0500, Mihael Hategan wrote:
> On Mon, 2010-07-26 at 14:50 -0500, Allan Espinosa wrote:
> > Sounds like cool stuff.  So essentially they modularized the process manager
> > component of the mpich2 implementation (or a lamd daemon / manager) to be able
> > to launch other processes.  
> 
> I'm not sure I follow, but I'm guessing the scenario is that you first
> get some nodes on which you start their bootstrap server after which you
> can submit various mpi applications on-demand without going through the
> queuing system again. Right?
> 
> 


From hategan at mcs.anl.gov  Mon Jul 26 20:43:40 2010
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Mon, 26 Jul 2010 20:43:40 -0500
Subject: [Swift-devel] more on job throughput
Message-ID: <1280195020.29413.4.camel@blabla2.none>

Here's a plot of the number of tasks in the various stages that the
runtime stats track.

This is with 8192 jobs and the fake provider (which does nothing and
finishes tasks almost immediately, and which I should probably commit
somewhere if anybody else wants to play with this).

I also attached the scripts used. You would need to change RuntimeStats
to print the stats more often than the 1s default (say something like
(MIN,MAX)_PERIOD_MS=100).
-------------- next part --------------
A non-text attachment was scrubbed...
Name: timings.png
Type: image/png
Size: 5568 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20100726/ab45be7f/attachment.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: timings.tar.gz
Type: application/x-compressed-tar
Size: 2264 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20100726/ab45be7f/attachment.bin>

From benc at hawaga.org.uk  Tue Jul 27 10:06:56 2010
From: benc at hawaga.org.uk (Ben Clifford)
Date: Tue, 27 Jul 2010 15:06:56 +0000 (GMT)
Subject: [Swift-devel] more on job throughput
In-Reply-To: <1280195020.29413.4.camel@blabla2.none>
References: <1280195020.29413.4.camel@blabla2.none>
Message-ID: <Pine.LNX.4.64.1007271506250.16004@dildano.hawaga.org.uk>


> Here's a plot of the number of tasks in the various stages that the
> runtime stats track.

what is the x-axis on that graph?

-- 


From hategan at mcs.anl.gov  Tue Jul 27 10:50:26 2010
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Tue, 27 Jul 2010 10:50:26 -0500
Subject: [Swift-devel] more on job throughput
In-Reply-To: <Pine.LNX.4.64.1007271506250.16004@dildano.hawaga.org.uk>
References: <1280195020.29413.4.camel@blabla2.none>
	<Pine.LNX.4.64.1007271506250.16004@dildano.hawaga.org.uk>
Message-ID: <1280245826.31330.0.camel@blabla2.none>

On Tue, 2010-07-27 at 15:06 +0000, Ben Clifford wrote:
> > Here's a plot of the number of tasks in the various stages that the
> > runtime stats track.
> 
> what is the x-axis on that graph?
> 

Time.


From benc at hawaga.org.uk  Tue Jul 27 10:51:08 2010
From: benc at hawaga.org.uk (Ben Clifford)
Date: Tue, 27 Jul 2010 15:51:08 +0000 (GMT)
Subject: [Swift-devel] more on job throughput
In-Reply-To: <1280245826.31330.0.camel@blabla2.none>
References: <1280195020.29413.4.camel@blabla2.none> 
	<Pine.LNX.4.64.1007271506250.16004@dildano.hawaga.org.uk>
	<1280245826.31330.0.camel@blabla2.none>
Message-ID: <Pine.LNX.4.64.1007271550530.13416@dildano.hawaga.org.uk>


> Time.

 ... units of ...

-- 


From hategan at mcs.anl.gov  Tue Jul 27 10:54:45 2010
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Tue, 27 Jul 2010 10:54:45 -0500
Subject: [Swift-devel] more on job throughput
In-Reply-To: <Pine.LNX.4.64.1007271550530.13416@dildano.hawaga.org.uk>
References: <1280195020.29413.4.camel@blabla2.none>
	<Pine.LNX.4.64.1007271506250.16004@dildano.hawaga.org.uk>
	<1280245826.31330.0.camel@blabla2.none>
	<Pine.LNX.4.64.1007271550530.13416@dildano.hawaga.org.uk>
Message-ID: <1280246085.31330.1.camel@blabla2.none>

On Tue, 2010-07-27 at 15:51 +0000, Ben Clifford wrote:
> > Time.
> 
>  ... units of ...
> 

milliseconds.


From hategan at mcs.anl.gov  Tue Jul 27 10:55:53 2010
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Tue, 27 Jul 2010 10:55:53 -0500
Subject: [Swift-devel] more on job throughput
In-Reply-To: <Pine.LNX.4.64.1007271506250.16004@dildano.hawaga.org.uk>
References: <1280195020.29413.4.camel@blabla2.none>
	<Pine.LNX.4.64.1007271506250.16004@dildano.hawaga.org.uk>
Message-ID: <1280246153.31546.0.camel@blabla2.none>

On Tue, 2010-07-27 at 15:06 +0000, Ben Clifford wrote:
> > Here's a plot of the number of tasks in the various stages that the
> > runtime stats track.
> 
> what is the x-axis on that graph?
> 

Good point actually. One would also need to print time in RuntimeStats.


From wozniak at mcs.anl.gov  Tue Jul 27 11:49:38 2010
From: wozniak at mcs.anl.gov (Justin M Wozniak)
Date: Tue, 27 Jul 2010 11:49:38 -0500 (Central Daylight Time)
Subject: [Swift-devel] MPICH/Coasters
In-Reply-To: <20100726204842.GB3204@origin>
References: <alpine.WNT.2.00.1007261335470.296@justinwozniak>
	<20100726195050.GA3204@origin>
	<1280175680.23112.4.camel@blabla2.none>
	<20100726204842.GB3204@origin>
Message-ID: <alpine.WNT.2.00.1007271137500.296@justinwozniak>

On Mon, 26 Jul 2010, Allan Espinosa wrote:

> I was talking about how the mpich2 implementation features dynamically growing
> an MPI_WORLD group using the bootstrap server.

We did briefly discuss getting MPI-2 stuff going (like the nameserver for 
MPI_Publish_name()) but I'd like to leave that for future work.

I put a figure up at:

http://www.ci.uchicago.edu/wiki/bin/view/SWFT/CoastersMpi

-- 
Justin M Wozniak


From aespinosa at cs.uchicago.edu  Wed Jul 28 14:34:21 2010
From: aespinosa at cs.uchicago.edu (Allan Espinosa)
Date: Wed, 28 Jul 2010 14:34:21 -0500
Subject: [Swift-devel] localscheduler (condor/ condorg) breaking on lots of
	condor jobs
Message-ID: <20100728193421.GA11060@origin>

Hi,

it seems that when there's too many submitted condor jobs, the submit host will
start to complain if it opens too many log, stderr, and stdout files:  

330  Finished successfully:162 Failed but can retry:927
Failed to transfer wrapper log from sleep-LGU-estimate/info/x on USCMS-FNAL-WC1
Progress:  Initializing site shared directory:1  Stage in:2  Submitted:1332
Active:245  Failed:331  Finished successfully:162 Failed but can retry:928
Progress:Failed to cancel job 57445
java.io.IOException: Cannot run program "condor_qedit": java.io.IOException:
error=24, Too many open files
        at java.lang.ProcessBuilder.start(ProcessBuilder.java:460)
        at java.lang.Runtime.exec(Runtime.java:593)
        at java.lang.Runtime.exec(Runtime.java:466)
        at
org.globus.cog.abstraction.impl.scheduler.condor.CondorExecutor.cancel(CondorExecutor.java:254)
        at
org.globus.cog.abstraction.impl.scheduler.common.AbstractJobSubmissionTaskHandler.cancel(AbstractJobSubmissionTaskHandler.java:85)
        at
org.globus.cog.abstraction.impl.common.AbstractTaskHandler.cancel(AbstractTaskHandler.java:70)
        at
org.globus.cog.karajan.scheduler.submitQueue.NonBlockingCancel.run(NonBlockingCancel.java:26)
        at
edu.emory.mathcs.backport.java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:431)
        at
edu.emory.mathcs.backport.java.util.concurrent.FutureTask.run(FutureTask.java:166)
        at
edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:643)
        at
edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:668)
        at java.lang.Thread.run(Thread.java:619)
Caused by: java.io.IOException: java.io.IOException: error=24, Too many open
files
        at java.lang.UNIXProcess.<init>(UNIXProcess.java:148)
        at java.lang.ProcessImpl.start(ProcessImpl.java:65)
        at java.lang.ProcessBuilder.start(ProcessBuilder.java:453)
        ... 11 more
  Initializing site shared directory:1  Stage in:1  Submitting:2  Submitted:1332
Active:245  Failed:331  Finished successfully:162 Failed but can retry:927
Progress:  Initializing site shared directory:1  Submitting:3  Submitted:1331
Active:245  Failed:331  Finished successfully:162 Failed but can retry:928


This causes jobs to fail.    Here are the logfile entries that I think are
relevant to the failure:

2010-07-28 14:20:07,829-0500 WARN  CondorExecutor Failed to cancel job 57026
java.io.IOException: Cannot run program "condor_rm": java.io.IOException:
error=24, Too many open files
        at java.lang.ProcessBuilder.start(ProcessBuilder.java:460)
        at java.lang.Runtime.exec(Runtime.java:593)
        at java.lang.Runtime.exec(Runtime.java:466)
        at
org.globus.cog.abstraction.impl.scheduler.condor.CondorExecutor.cancel(CondorExecutor.java:275)
        at
org.globus.cog.abstraction.impl.scheduler.common.AbstractJobSubmissionTaskHandler.cancel(AbstractJobSubmissionTaskHandler.java:85)
        at
org.globus.cog.abstraction.impl.common.AbstractTaskHandler.cancel(AbstractTaskHandler.java:70)
        at
org.globus.cog.karajan.scheduler.submitQueue.NonBlockingCancel.run(NonBlockingCancel.java:26)
        at
edu.emory.mathcs.backport.java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:431)
        at
edu.emory.mathcs.backport.java.util.concurrent.FutureTask.run(FutureTask.java:166)
        at
edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:643)
        at
edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:668)
        at java.lang.Thread.run(Thread.java:619)
Caused by: java.io.IOException: java.io.IOException: error=24, Too many open
files
        at java.lang.UNIXProcess.<init>(UNIXProcess.java:148)
        at java.lang.ProcessImpl.start(ProcessImpl.java:65)
        at java.lang.ProcessBuilder.start(ProcessBuilder.java:453)
        ... 11 more
2010-07-28 14:20:07,856-0500 WARN  CondorExecutor Failed to cancel job 57106

-Allan


From hategan at mcs.anl.gov  Wed Jul 28 14:48:36 2010
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Wed, 28 Jul 2010 14:48:36 -0500
Subject: [Swift-devel] localscheduler (condor/ condorg) breaking on
	lots of condor jobs
In-Reply-To: <20100728193421.GA11060@origin>
References: <20100728193421.GA11060@origin>
Message-ID: <1280346516.12761.3.camel@blabla2.none>

Yeah. That's why the provider should be updated to use job logs instead
of condor_qstat/condor_qedit for figuring out status.

That or update limits (and, btw, what does ulimit -a say on that
machine)?

On Wed, 2010-07-28 at 14:34 -0500, Allan Espinosa wrote:
> Hi,
> 
> it seems that when there's too many submitted condor jobs, the submit host will
> start to complain if it opens too many log, stderr, and stdout files:  
> 
> 330  Finished successfully:162 Failed but can retry:927
> Failed to transfer wrapper log from sleep-LGU-estimate/info/x on USCMS-FNAL-WC1
> Progress:  Initializing site shared directory:1  Stage in:2  Submitted:1332
> Active:245  Failed:331  Finished successfully:162 Failed but can retry:928
> Progress:Failed to cancel job 57445
> java.io.IOException: Cannot run program "condor_qedit": java.io.IOException:
> error=24, Too many open files
>         at java.lang.ProcessBuilder.start(ProcessBuilder.java:460)
>         at java.lang.Runtime.exec(Runtime.java:593)
>         at java.lang.Runtime.exec(Runtime.java:466)
>         at
> org.globus.cog.abstraction.impl.scheduler.condor.CondorExecutor.cancel(CondorExecutor.java:254)
>         at
> org.globus.cog.abstraction.impl.scheduler.common.AbstractJobSubmissionTaskHandler.cancel(AbstractJobSubmissionTaskHandler.java:85)
>         at
> org.globus.cog.abstraction.impl.common.AbstractTaskHandler.cancel(AbstractTaskHandler.java:70)
>         at
> org.globus.cog.karajan.scheduler.submitQueue.NonBlockingCancel.run(NonBlockingCancel.java:26)
>         at
> edu.emory.mathcs.backport.java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:431)
>         at
> edu.emory.mathcs.backport.java.util.concurrent.FutureTask.run(FutureTask.java:166)
>         at
> edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:643)
>         at
> edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:668)
>         at java.lang.Thread.run(Thread.java:619)
> Caused by: java.io.IOException: java.io.IOException: error=24, Too many open
> files
>         at java.lang.UNIXProcess.<init>(UNIXProcess.java:148)
>         at java.lang.ProcessImpl.start(ProcessImpl.java:65)
>         at java.lang.ProcessBuilder.start(ProcessBuilder.java:453)
>         ... 11 more
>   Initializing site shared directory:1  Stage in:1  Submitting:2  Submitted:1332
> Active:245  Failed:331  Finished successfully:162 Failed but can retry:927
> Progress:  Initializing site shared directory:1  Submitting:3  Submitted:1331
> Active:245  Failed:331  Finished successfully:162 Failed but can retry:928
> 
> 
> This causes jobs to fail.    Here are the logfile entries that I think are
> relevant to the failure:
> 
> 2010-07-28 14:20:07,829-0500 WARN  CondorExecutor Failed to cancel job 57026
> java.io.IOException: Cannot run program "condor_rm": java.io.IOException:
> error=24, Too many open files
>         at java.lang.ProcessBuilder.start(ProcessBuilder.java:460)
>         at java.lang.Runtime.exec(Runtime.java:593)
>         at java.lang.Runtime.exec(Runtime.java:466)
>         at
> org.globus.cog.abstraction.impl.scheduler.condor.CondorExecutor.cancel(CondorExecutor.java:275)
>         at
> org.globus.cog.abstraction.impl.scheduler.common.AbstractJobSubmissionTaskHandler.cancel(AbstractJobSubmissionTaskHandler.java:85)
>         at
> org.globus.cog.abstraction.impl.common.AbstractTaskHandler.cancel(AbstractTaskHandler.java:70)
>         at
> org.globus.cog.karajan.scheduler.submitQueue.NonBlockingCancel.run(NonBlockingCancel.java:26)
>         at
> edu.emory.mathcs.backport.java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:431)
>         at
> edu.emory.mathcs.backport.java.util.concurrent.FutureTask.run(FutureTask.java:166)
>         at
> edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:643)
>         at
> edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:668)
>         at java.lang.Thread.run(Thread.java:619)
> Caused by: java.io.IOException: java.io.IOException: error=24, Too many open
> files
>         at java.lang.UNIXProcess.<init>(UNIXProcess.java:148)
>         at java.lang.ProcessImpl.start(ProcessImpl.java:65)
>         at java.lang.ProcessBuilder.start(ProcessBuilder.java:453)
>         ... 11 more
> 2010-07-28 14:20:07,856-0500 WARN  CondorExecutor Failed to cancel job 57106
> 
> -Allan
> 
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel


From aespinosa at cs.uchicago.edu  Wed Jul 28 15:00:44 2010
From: aespinosa at cs.uchicago.edu (Allan Espinosa)
Date: Wed, 28 Jul 2010 15:00:44 -0500
Subject: [Swift-devel] localscheduler (condor/ condorg) breaking on
	lots of condor jobs
In-Reply-To: <1280346516.12761.3.camel@blabla2.none>
References: <20100728193421.GA11060@origin>
	<1280346516.12761.3.camel@blabla2.none>
Message-ID: <20100728200044.GB11060@origin>

Ah, only 1024 files.  That's why.

$ ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 122880
max locked memory       (kbytes, -l) 32
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 10240
cpu time               (seconds, -t) unlimited
max user processes              (-u) 122880
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited


On Wed, Jul 28, 2010 at 02:48:36PM -0500, Mihael Hategan wrote:
> Yeah. That's why the provider should be updated to use job logs instead
> of condor_qstat/condor_qedit for figuring out status.
> 
> That or update limits (and, btw, what does ulimit -a say on that
> machine)?
> 
> On Wed, 2010-07-28 at 14:34 -0500, Allan Espinosa wrote:
> > Hi,
> > 
> > it seems that when there's too many submitted condor jobs, the submit host will
> > start to complain if it opens too many log, stderr, and stdout files:  
> > 
> > 330  Finished successfully:162 Failed but can retry:927
> > Failed to transfer wrapper log from sleep-LGU-estimate/info/x on USCMS-FNAL-WC1
> > Progress:  Initializing site shared directory:1  Stage in:2  Submitted:1332
> > Active:245  Failed:331  Finished successfully:162 Failed but can retry:928
> > Progress:Failed to cancel job 57445
> > java.io.IOException: Cannot run program "condor_qedit": java.io.IOException:
> > error=24, Too many open files
> >         at java.lang.ProcessBuilder.start(ProcessBuilder.java:460)
> >         at java.lang.Runtime.exec(Runtime.java:593)