[Swift-devel] Coasters does only one round of tasks in faster trunk

Michael Wilde wilde at mcs.anl.gov
Wed Jul 10 13:47:05 CDT 2013


Mihael, the 0.94 branch (latest rev) seems to behave the same way on this workflow.

What I see in the log is below, with the last set of message repeating, with no new app tasks starting or completing, until I kill the script (ie Allocating blocks for a total walltime of: 35s; BlockQueueProcessor Jobs in holding queue: 32)

I'll test this again on a different scheduler.  The app tasks should run for about 5 secs, even though Ive used a default maxwalltime of 15 min.

- Mike


2013-07-10 13:32:21,243-0500 DEBUG swift CDM: file://localhost/data.0001.tiny : DEFAULT
2013-07-10 13:32:21,243-0500 DEBUG swift CDM: file://localhost/processpoints.py : DEFAULT
2013-07-10 13:32:21,243-0500 INFO  LateBindingScheduler jobs queued: 437
2013-07-10 13:32:21,243-0500 DEBUG swift CDM: file://localhost/out/seq/seq00313 : DEFAULT
2013-07-10 13:32:21,244-0500 DEBUG swift FILE_STAGE_IN_START file=seq00313 srchost=localhost srcdir=out/seq srcname=seq00313 desthost\
=cluster destdir=paintgrid-20130710-1331-g435v1l3/shared/out/seq provider=file policy=DEFAULT
2013-07-10 13:32:21,248-0500 INFO  LateBindingScheduler jobs queued: 437
2013-07-10 13:32:21,248-0500 DEBUG swift FILE_STAGE_IN_END file=seq00313 srchost=localhost srcdir=out/seq srcname=seq00313 desthost=c\
luster destdir=paintgrid-20130710-1331-g435v1l3/shared/out/seq provider=file
2013-07-10 13:32:21,248-0500 INFO  swift END jobid=python-69som5cl - Staging in finished
2013-07-10 13:32:21,249-0500 DEBUG swift JOB_START jobid=python-69som5cl tr=python arguments=[processpoints.py, data.0001.tiny, out/s\
eq/seq00313, 0.0] tmpdir=paintgrid-20130710-1331-g435v1l3/jobs/6/python-69som5cl host=cluster
2013-07-10 13:32:21,251-0500 INFO  GridExec TASK_DEFINITION: Task(type=JOB_SUBMISSION, identity=urn:0-5-296-1-1-1373481110671) is /bi\
n/bash shared/_swiftwrap python-69som5cl -jobdir 6 -scratch  -e python -out out/out.00313 -err stderr.txt -i -d out/seq|out -if proce\
sspoints.py|data.0001.tiny|out/seq/seq00313 -of out/out.00313 -k  -cdmfile  -status provider -a processpoints.py data.0001.tiny out/s\
eq/seq00313 0.0  
2013-07-10 13:32:21,252-0500 INFO  RequestHandler Handler(tag: 65, SUBMITJOB) unregistering (send)
2013-07-10 13:32:21,312-0500 INFO  BlockQueueProcessor Jobs in holding queue: 32
2013-07-10 13:32:21,312-0500 INFO  BlockQueueProcessor Time estimate for holding queue (seconds): 36
2013-07-10 13:32:21,312-0500 INFO  BlockQueueProcessor Allocating blocks for a total walltime of: 35s
2013-07-10 13:32:22,354-0500 INFO  BlockQueueProcessor Jobs in holding queue: 32
2013-07-10 13:32:22,354-0500 INFO  BlockQueueProcessor Time estimate for holding queue (seconds): 36
2013-07-10 13:32:22,354-0500 INFO  BlockQueueProcessor Allocating blocks for a total walltime of: 35s
2013-07-10 13:32:23,377-0500 INFO  BlockQueueProcessor Jobs in holding queue: 32


----- Original Message -----
> From: "Michael Wilde" <wilde at mcs.anl.gov>
> To: "Swift Devel" <swift-devel at ci.uchicago.edu>
> Sent: Wednesday, July 10, 2013 1:09:04 PM
> Subject: [Swift-devel] Coasters does only one round of tasks in faster trunk
> 
> Mihael,
> 
> Ive run into a more serious problem: Im running on a single SGE node
> with 32 cores.
> 
> The swift script has submitted about 400+ app() tasks
> 
> 32 completed, but then coasters doesn't seem to be sending new jobs
> to the node.
> 
> The log is at:
> http://www.ci.uchicago.edu/~wilde/paintgrid-20130710-1252-v9p47hm8.log
> 
> I'll try the same on a 0.94.1 rev.
> 
> - Mike
>    
> 
> 
> ----- Original Message -----
> > From: "Michael Wilde" <wilde at mcs.anl.gov>
> > To: "Swift Devel" <swift-devel at ci.uchicago.edu>
> > Sent: Wednesday, July 10, 2013 12:53:44 PM
> > Subject: Initial tests of new (faster) trunk
> > 
> > Initial tests of the new trunk are working for me, but I'm seeing
> > three odd things:
> > 
> > 1. [Error] sites.beagle.coasters.xml:1:9: cvc-elt.1: Cannot find
> > the
> > declaration of element 'config'.
> >    (as reported in the email below)
> > 
> > 2. {env.HOME} is not interpreted in sites.xml as in prior versions.
> > Swift created a workdir named "$PWD/{env.HOME}/swiftwork"
> > 
> > 3. Progress ticker lines seem to be defaulting to one per second,
> > even with no status changing.
> > 
> > But so far so good - a first test script using the new code is
> > running nicely on SGE.
> > 
> > - Mike
> > 
> > ----- Original Message -----
> > > From: "Mihael Hategan" <hategan at mcs.anl.gov>
> > > To: "David Kelly" <davidk at ci.uchicago.edu>
> > > Cc: swift-support at ci.uchicago.edu, "Michael Wilde"
> > > <wilde at mcs.anl.gov>
> > > Sent: Tuesday, February 19, 2013 2:46:14 PM
> > > Subject: Re: [Swift Support #22699] Fwd: [Swift-devel] First
> > > tests
> > > with swift faster
> > > 
> > > Yeah. The validation fails. You can ignore it for now. I'll fix
> > > in
> > > the
> > > future. Basically there is code to validate the sites file
> > > against
> > > the
> > > XML schema, but it fails. It's not a fatal issue though, and
> > > parsing
> > > still happens.
> > > 
> > > Mihael
> > > 
> > > On Tue, 2013-02-19 at 14:22 -0600, David Kelly wrote:
> > > > I tried updating from svn and running with the added url tags:
> > > > 
> > > > 
> > > > 
> > > > <config>
> > > > 
> > > > 
> > > > <pool handle="beagle">
> > > > <execution provider="coaster" jobmanager="local:pbs"
> > > > url="localhost"/>
> > > > <profile namespace="globus" key="jobsPerNode">1</profile>
> > > > <profile namespace="globus"
> > > > key="lowOverAllocation">100</profile>
> > > > <profile namespace="globus"
> > > > key="highOverAllocation">100</profile>
> > > > <profile namespace="globus"
> > > > key="providerAttributes">pbs.aprun;pbs.mpp;depth=24</profile>
> > > > <profile namespace="globus" key="maxTime">4000</profile>
> > > > <profile namespace="globus"
> > > > key="maxWallTime">00:05:00</profile>
> > > > <profile namespace="globus"
> > > > key="disableIdleBlockCleanup">true</profile>
> > > > <profile namespace="globus" key="slots">1</profile>
> > > > <profile namespace="globus" key="nodeGranularity">1</profile>
> > > > <profile namespace="globus" key="maxNodes">1</profile>
> > > > <profile namespace="globus" key="queue">batch</profile>
> > > > <profile namespace="karajan" key="jobThrottle">8.00</profile>
> > > > <profile namespace="karajan" key="initialScore">10000</profile>
> > > > <filesystem provider="local" url="localhost" />
> > > > <workdirectory>/lustre/beagle/davidk</workdirectory>
> > > > </pool>
> > > > 
> > > > 
> > > > </config>
> > > > 
> > > > 
> > > > I am seeing this error:
> > > > 
> > > > 
> > > > [Error] sites.beagle.coasters.xml:1:9: cvc-elt.1: Cannot find
> > > > the
> > > > declaration of element 'config'.
> > > > 
> > > > 
> > > > ----- Original Message -----
> > > > 
> > > > 
> > > > From: "Mike Wilde" <swift-support at ci.uchicago.edu>
> > > > Sent: Tuesday, February 19, 2013 1:50:15 PM
> > > > Subject: [Swift Support #22699] Fwd: [Swift-devel] First tests
> > > > with
> > > > swift faster
> > > > 
> > > > 
> > > > Tue Feb 19 13:50:14 2013: Request 22699 was acted upon.
> > > > Transaction: Ticket created by wilde at mcs.anl.gov
> > > > Queue: swift-support
> > > > Subject: Fwd: [Swift-devel] First tests with swift faster
> > > > Owner: Nobody
> > > > Requestors: wilde at ci.uchicago.edu
> > > > Status: new
> > > > Ticket <URL:
> > > > https://rt.ci.uchicago.edu/Ticket/Display.html?id=22699 >
> > > > 
> > > > 
> > > > 
> > > > David, Mihael, Yadu: could one of you try this on Beagle on the
> > > > faster branch?
> > > > 
> > > > Does the faster branch include the PBS support for Beagle?
> > > > 
> > > > It shouldnt be too hard to see what part of the PBS pool def it
> > > > doesnt like.
> > > > 
> > > > - Mike
> > > > 
> > > > ----- Forwarded Message -----
> > > > From: "Lorenzo Pesce" <lpesce at uchicago.edu>
> > > > To: "Swift Devel" <swift-devel at ci.uchicago.edu>
> > > > Sent: Tuesday, February 19, 2013 1:26:20 PM
> > > > Subject: [Swift-devel] First tests with swift faster
> > > > 
> > > > 
> > > > This is the content of the file where we have the first
> > > > complaint
> > > > from swift (see attached):
> > > > 
> > > > 
> > > > <config>
> > > > <pool handle="pbs">
> > > > <execution provider="coaster" jobmanager="local:pbs"/>
> > > > <!-- replace with your project -->
> > > > <profile namespace="globus"
> > > > key="project">CI-DEB000002</profile>
> > > > 
> > > > 
> > > > <profile namespace="globus"
> > > > key="providerAttributes">pbs.aprun;pbs.mpp;depth=24</profile>
> > > > 
> > > > 
> > > > 
> > > > 
> > > > <profile namespace="globus" key="jobsPerNode">24</profile>
> > > > <profile namespace="globus" key="maxTime">172800</profile>
> > > > <profile namespace="globus" key="maxwalltime">0:10:00</profile>
> > > > <profile namespace="globus"
> > > > key="lowOverallocation">100</profile>
> > > > <profile namespace="globus"
> > > > key="highOverallocation">100</profile>
> > > > 
> > > > 
> > > > <profile namespace="globus" key="slots">200</profile>
> > > > <profile namespace="globus" key="nodeGranularity">1</profile>
> > > > <profile namespace="globus" key="maxNodes">1</profile>
> > > > 
> > > > 
> > > > <profile namespace="karajan" key="jobThrottle">47.99</profile>
> > > > <profile namespace="karajan" key="initialScore">10000</profile>
> > > > 
> > > > 
> > > > <filesystem provider="local"/>
> > > > <!-- replace this with your home on lustre -->
> > > > <workdirectory>/lustre/beagle/samseaver/GS/swift.workdir</workdirectory>
> > > > </pool>
> > > > </config>
> > > > 
> > > > 
> > > > Any ideas?
> > > > 
> > > > 
> > > > Begin forwarded message:
> > > > 
> > > > 
> > > > 
> > > > From: Sam Seaver < samseaver at gmail.com >
> > > > 
> > > > Date: February 19, 2013 1:16:28 PM CST
> > > > 
> > > > To: Lorenzo Pesce < lpesce at uchicago.edu >
> > > > 
> > > > Subject: Re: How are things going?
> > > > 
> > > > 
> > > > I got this error. I suspect using the new SWIFT_HOME directory
> > > > means that there's possibly a missing parameter someplace:
> > > > 
> > > > 
> > > > 
> > > > should we resume a previous calculation? [y/N] y
> > > > rlog files displayed in reverse time order
> > > > should I use GS-20130203-0717-jgeppt98.0.rlog ?[y/n]
> > > > y
> > > > Using GS-20130203-0717-jgeppt98.0.rlog
> > > > [Error] GS_sites.xml:1:9: cvc-elt.1: Cannot find the
> > > > declaration
> > > > of
> > > > element 'config'.
> > > > 
> > > > 
> > > > Execution failed:
> > > > Failed to parse site catalog
> > > > swift:siteCatalog @ scheduler.k, line: 31
> > > > Caused by: Invalid pool entry 'pbs':
> > > > swift:siteCatalog @ scheduler.k, line: 31
> > > > Caused by: java.lang.IllegalArgumentException: Missing URL
> > > > at
> > > > org.griphyn.vdl.karajan.lib.SiteCatalog.execution(SiteCatalog.java:173)
> > > > at
> > > > org.griphyn.vdl.karajan.lib.SiteCatalog.pool(SiteCatalog.java:100)
> > > > at
> > > > org.griphyn.vdl.karajan.lib.SiteCatalog.buildResources(SiteCatalog.java:60)
> > > > at
> > > > org.griphyn.vdl.karajan.lib.SiteCatalog.function(SiteCatalog.java:48)
> > > > at
> > > > org.globus.cog.karajan.compiled.nodes.functions.AbstractFunction.runBody(AbstractFunction.java:38)
> > > > at
> > > > org.globus.cog.karajan.compiled.nodes.InternalFunction.run(InternalFunction.java:154)
> > > > at
> > > > org.globus.cog.karajan.compiled.nodes.CompoundNode.runChild(CompoundNode.java:87)
> > > > at
> > > > org.globus.cog.karajan.compiled.nodes.InternalFunction.run(InternalFunction.java:147)
> > > > at
> > > > org.globus.cog.karajan.compiled.nodes.CompoundNode.runChild(CompoundNode.java:87)
> > > > at
> > > > org.globus.cog.karajan.compiled.nodes.InternalFunction.run(InternalFunction.java:147)
> > > > at
> > > > org.globus.cog.karajan.compiled.nodes.CompoundNode.runChild(CompoundNode.java:87)
> > > > at
> > > > org.globus.cog.karajan.compiled.nodes.FramedInternalFunction.run(FramedInternalFunction.java:63)
> > > > at
> > > > org.globus.cog.karajan.compiled.nodes.Import.runBody(Import.java:269)
> > > > at
> > > > org.globus.cog.karajan.compiled.nodes.InternalFunction.run(InternalFunction.java:154)
> > > > at
> > > > org.globus.cog.karajan.compiled.nodes.CompoundNode.runChild(CompoundNode.java:87)
> > > > at
> > > > org.globus.cog.karajan.compiled.nodes.FramedInternalFunction.run(FramedInternalFunction.java:63)
> > > > at org.globus.cog.karajan.compiled.nodes.Main.run(Main.java:79)
> > > > at k.thr.LWThread.run(LWThread.java:243)
> > > > at
> > > > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
> > > > at
> > > > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
> > > > at java.lang.Thread.run(Thread.java:722)
> > > > 
> > > > 
> > > > 
> > > > On Tue, Feb 19, 2013 at 1:13 PM, Sam Seaver <
> > > > samseaver at gmail.com
> > > > >
> > > > wrote:
> > > > 
> > > > 
> > > > 
> > > > OK, it got to the point where it really did hang. I'm retrying,
> > > > but
> > > > with your suggestions. The other three finished fine!
> > > > 
> > > > 
> > > > 
> > > > Progress: time: Tue, 19 Feb 2013 19:08:53 +0000 Selecting
> > > > site:18147 Submitted:174 Active:96 Failed:2 Finished
> > > > successfully:132323 Failed but can retry:183
> > > > Progress: time: Tue, 19 Feb 2013 19:09:23 +0000 Selecting
> > > > site:18147 Submitted:174 Active:96 Failed:2 Finished
> > > > successfully:132323 Failed but can retry:183
> > > > Progress: time: Tue, 19 Feb 2013 19:09:53 +0000 Selecting
> > > > site:18147 Submitted:174 Active:96 Failed:2 Finished
> > > > successfully:132323 Failed but can retry:183
> > > > Progress: time: Tue, 19 Feb 2013 19:10:23 +0000 Selecting
> > > > site:18147 Submitted:174 Active:96 Failed:2 Finished
> > > > successfully:132323 Failed but can retry:183
> > > > Progress: time: Tue, 19 Feb 2013 19:10:53 +0000 Selecting
> > > > site:18147 Submitted:174 Active:96 Failed:2 Finished
> > > > successfully:132323 Failed but can retry:183
> > > > Progress: time: Tue, 19 Feb 2013 19:11:23 +0000 Selecting
> > > > site:18147 Submitted:174 Active:96 Failed:2 Finished
> > > > successfully:132323 Failed but can retry:183
> > > > 
> > > > 
> > > > 
> > > > 
> > > > 
> > > > On Tue, Feb 19, 2013 at 8:51 AM, Lorenzo Pesce <
> > > > lpesce at uchicago.edu > wrote:
> > > > 
> > > > 
> > > > 
> > > > Hmm...
> > > > 
> > > > 
> > > > foreach.max.threads=100
> > > > 
> > > > 
> > > > maybe you should increase this number a bit and see what
> > > > happens.
> > > > 
> > > > 
> > > > Also, I would try to replace
> > > > 
> > > > 
> > > > SWIFT_HOME=/home/wilde/swift/rev/swift-r6151-cog-r3552
> > > > 
> > > > 
> > > > with
> > > > 
> > > > 
> > > > SWIFT_HOME=/soft/swift/fast
> > > > 
> > > > 
> > > > Keep me posted. Let's get this rolling.
> > > > 
> > > > 
> > > > if it doesn't work, I can redo the packing.
> > > > 
> > > > 
> > > > 
> > > > 
> > > > 
> > > > 
> > > > 
> > > > 
> > > > 
> > > > 
> > > > 
> > > > On Feb 19, 2013, at 1:07 AM, Sam Seaver wrote:
> > > > 
> > > > 
> > > > 
> > > > Actually, the ten agents job does seem to be stuck in a strange
> > > > loop. It is incrementing the number of jobs that has finished
> > > > successfully, and at a fast pace, but the number of jobs its
> > > > starting is decrementing much more slowly, its almost as its
> > > > repeatedly attempting the same set of parameters multiple
> > > > times...
> > > > 
> > > > 
> > > > I'll see what it's doing in the morning
> > > > S
> > > > 
> > > > 
> > > > 
> > > > On Tue, Feb 19, 2013 at 1:00 AM, Sam Seaver <
> > > > samseaver at gmail.com
> > > > >
> > > > wrote:
> > > > 
> > > > 
> > > > 
> > > > Seems to have worked overall this time!
> > > > 
> > > > 
> > > > I resume four jobs, each were for a different number of agents
> > > > (10,100,1000,10000) it made it easier for me to decide on the
> > > > app
> > > > time. Two of them have already finished i.e.:
> > > > 
> > > > 
> > > > 
> > > > Progress: time: Mon, 18 Feb 2013 23:50:12 +0000 Active:4
> > > > Checking
> > > > status:1 Finished in previous run:148098 Finished
> > > > successfully:37897
> > > > Progress: time: Mon, 18 Feb 2013 23:50:15 +0000 Active:2
> > > > Checking
> > > > status:1 Finished in previous run:148098 Finished
> > > > successfully:37899
> > > > Final status: Mon, 18 Feb 2013 23:50:15 +0000 Finished in
> > > > previous
> > > > run:148098 Finished successfully:37902
> > > > 
> > > > 
> > > > and the only one that is showing any failure (50/110000), is
> > > > the
> > > > ten agents version which is so short I can understand why, but
> > > > its
> > > > still actively trying to run jobs and is actively finishing
> > > > jobs,
> > > > so that's good.
> > > > 
> > > > 
> > > > Yay!
> > > > 
> > > > 
> > > > 
> > > > 
> > > > 
> > > > 
> > > > 
> > > > On Mon, Feb 18, 2013 at 1:09 PM, Lorenzo Pesce <
> > > > lpesce at uchicago.edu > wrote:
> > > > 
> > > > 
> > > > 
> > > > Good. Keep me posted, I would really like to solve your
> > > > problems
> > > > in
> > > > running on Beagle this week, I wish that Swift would have been
> > > > friendlier.
> > > > 
> > > > 
> > > > 
> > > > 
> > > > 
> > > > On Feb 18, 2013, at 1:01 PM, Sam Seaver wrote:
> > > > 
> > > > 
> > > > 
> > > > I just resumed the jobs that I'd killed before the system went
> > > > down, lets see how it does. I always did a mini-review of the
> > > > data
> > > > I've got and it seems to be working as expected.
> > > > 
> > > > 
> > > > 
> > > > On Mon, Feb 18, 2013 at 12:28 PM, Lorenzo Pesce <
> > > > lpesce at uchicago.edu > wrote:
> > > > 
> > > > 
> > > > 
> > > > I have lost track a bit of what's up. I am happy to try and go
> > > > over
> > > > it with you when you are ready.
> > > > 
> > > > 
> > > > Some of the problems of swift might have improved with a new
> > > > version and the new system.
> > > > 
> > > > 
> > > > 
> > > > 
> > > > 
> > > > 
> > > > 
> > > > On Feb 18, 2013, at 12:22 PM, Sam Seaver wrote:
> > > > 
> > > > 
> > > > 
> > > > They're not, I've not looked since Beagle came back up. Will do
> > > > so
> > > > later today.
> > > > S
> > > > 
> > > > 
> > > > 
> > > > On Mon, Feb 18, 2013 at 12:20 PM, Lorenzo Pesce <
> > > > lpesce at uchicago.edu > wrote:
> > > > 
> > > > 
> > > > 
> > > > 
> > > > 
> > > > 
> > > > --
> > > > Postdoctoral Fellow
> > > > Mathematics and Computer Science Division
> > > > Argonne National Laboratory
> > > > 9700 S. Cass Avenue
> > > > Argonne, IL 60439
> > > > 
> > > > http://www.linkedin.com/pub/sam-seaver/0/412/168
> > > > samseaver at gmail.com
> > > > (773) 796-7144
> > > > 
> > > > "We shall not cease from exploration
> > > > And the end of all our exploring
> > > > Will be to arrive where we started
> > > > And know the place for the first time."
> > > > --T. S. Eliot
> > > > 
> > > > 
> > > > 
> > > > 
> > > > --
> > > > Postdoctoral Fellow
> > > > Mathematics and Computer Science Division
> > > > Argonne National Laboratory
> > > > 9700 S. Cass Avenue
> > > > Argonne, IL 60439
> > > > 
> > > > http://www.linkedin.com/pub/sam-seaver/0/412/168
> > > > samseaver at gmail.com
> > > > (773) 796-7144
> > > > 
> > > > "We shall not cease from exploration
> > > > And the end of all our exploring
> > > > Will be to arrive where we started
> > > > And know the place for the first time."
> > > > --T. S. Eliot
> > > > 
> > > > 
> > > > 
> > > > 
> > > > --
> > > > Postdoctoral Fellow
> > > > Mathematics and Computer Science Division
> > > > Argonne National Laboratory
> > > > 9700 S. Cass Avenue
> > > > Argonne, IL 60439
> > > > 
> > > > http://www.linkedin.com/pub/sam-seaver/0/412/168
> > > > samseaver at gmail.com
> > > > (773) 796-7144
> > > > 
> > > > "We shall not cease from exploration
> > > > And the end of all our exploring
> > > > Will be to arrive where we started
> > > > And know the place for the first time."
> > > > --T. S. Eliot
> > > > 
> > > > 
> > > > 
> > > > --
> > > > Postdoctoral Fellow
> > > > Mathematics and Computer Science Division
> > > > Argonne National Laboratory
> > > > 9700 S. Cass Avenue
> > > > Argonne, IL 60439
> > > > 
> > > > http://www.linkedin.com/pub/sam-seaver/0/412/168
> > > > samseaver at gmail.com
> > > > (773) 796-7144
> > > > 
> > > > "We shall not cease from exploration
> > > > And the end of all our exploring
> > > > Will be to arrive where we started
> > > > And know the place for the first time."
> > > > --T. S. Eliot
> > > > 
> > > > 
> > > > 
> > > > 
> > > > --
> > > > Postdoctoral Fellow
> > > > Mathematics and Computer Science Division
> > > > Argonne National Laboratory
> > > > 9700 S. Cass Avenue
> > > > Argonne, IL 60439
> > > > 
> > > > http://www.linkedin.com/pub/sam-seaver/0/412/168
> > > > samseaver at gmail.com
> > > > (773) 796-7144
> > > > 
> > > > "We shall not cease from exploration
> > > > And the end of all our exploring
> > > > Will be to arrive where we started
> > > > And know the place for the first time."
> > > > --T. S. Eliot
> > > > 
> > > > 
> > > > 
> > > > I tried updating from svn and running with the added url tags:
> > > > 
> > > > 
> > > > <config>
> > > > 
> > > > 
> > > >   <pool handle="beagle">
> > > >     <execution provider="coaster" jobmanager="local:pbs"
> > > > url="localhost"/>
> > > >     <profile namespace="globus" key="jobsPerNode">1</profile>
> > > >     <profile namespace="globus"
> > > >     key="lowOverAllocation">100</profile>
> > > >     <profile namespace="globus"
> > > >     key="highOverAllocation">100</profile>
> > > >     <profile namespace="globus"
> > > > key="providerAttributes">pbs.aprun;pbs.mpp;depth=24</profile>
> > > >     <profile namespace="globus" key="maxTime">4000</profile>
> > > >     <profile namespace="globus"
> > > >     key="maxWallTime">00:05:00</profile>
> > > >     <profile namespace="globus"
> > > > key="disableIdleBlockCleanup">true</profile>
> > > >     <profile namespace="globus" key="slots">1</profile>
> > > >     <profile namespace="globus"
> > > >     key="nodeGranularity">1</profile>
> > > >     <profile namespace="globus" key="maxNodes">1</profile>
> > > >     <profile namespace="globus" key="queue">batch</profile>
> > > >     <profile namespace="karajan"
> > > >     key="jobThrottle">8.00</profile>
> > > >     <profile namespace="karajan"
> > > >     key="initialScore">10000</profile>
> > > >     <filesystem provider="local" url="localhost" />
> > > >     <workdirectory>/lustre/beagle/davidk</workdirectory>
> > > >   </pool>
> > > > 
> > > > 
> > > > </config>
> > > > 
> > > > 
> > > > I am seeing this error:
> > > > 
> > > > 
> > > > [Error] sites.beagle.coasters.xml:1:9: cvc-elt.1: Cannot find
> > > > the
> > > > declaration of element 'config'.
> > > > 
> > > > 
> > > > 
> > > > 
> > > > ______________________________________________________________________
> > > >         From: "Mike Wilde" <swift-support at ci.uchicago.edu>
> > > >         Sent: Tuesday, February 19, 2013 1:50:15 PM
> > > >         Subject: [Swift Support #22699] Fwd: [Swift-devel]
> > > >         First
> > > >         tests
> > > >         with swift faster
> > > >         
> > > >         
> > > >         Tue Feb 19 13:50:14 2013: Request 22699 was acted upon.
> > > >          Transaction: Ticket created by wilde at mcs.anl.gov
> > > >                Queue: swift-support
> > > >              Subject: Fwd: [Swift-devel] First tests with swift
> > > >              faster
> > > >                Owner: Nobody
> > > >           Requestors: wilde at ci.uchicago.edu
> > > >               Status: new
> > > >          Ticket <URL:
> > > >         https://rt.ci.uchicago.edu/Ticket/Display.html?id=22699
> > > >         >
> > > >         
> > > >         
> > > >         
> > > >         David, Mihael, Yadu: could one of you try this on
> > > >         Beagle
> > > >         on
> > > >         the faster branch?
> > > >         
> > > >         Does the faster branch include the PBS support for
> > > >         Beagle?
> > > >         
> > > >         It shouldnt be too hard to see what part of the PBS
> > > >         pool
> > > >         def
> > > >         it doesnt like.
> > > >         
> > > >         - Mike
> > > >         
> > > >         ----- Forwarded Message -----
> > > >         From: "Lorenzo Pesce" <lpesce at uchicago.edu>
> > > >         To: "Swift Devel" <swift-devel at ci.uchicago.edu>
> > > >         Sent: Tuesday, February 19, 2013 1:26:20 PM
> > > >         Subject: [Swift-devel] First tests with swift faster
> > > >         
> > > >         
> > > >         This is the content of the file where we have the first
> > > >         complaint from swift (see attached):
> > > >         
> > > >         
> > > >         <config>
> > > >         <pool handle="pbs">
> > > >         <execution provider="coaster" jobmanager="local:pbs"/>
> > > >         <!-- replace with your project -->
> > > >         <profile namespace="globus"
> > > >         key="project">CI-DEB000002</profile>
> > > >         
> > > >         
> > > >         <profile namespace="globus"
> > > >         key="providerAttributes">pbs.aprun;pbs.mpp;depth=24</profile>
> > > >         
> > > >         
> > > >         
> > > >         
> > > >         <profile namespace="globus"
> > > >         key="jobsPerNode">24</profile>
> > > >         <profile namespace="globus"
> > > >         key="maxTime">172800</profile>
> > > >         <profile namespace="globus"
> > > >         key="maxwalltime">0:10:00</profile>
> > > >         <profile namespace="globus"
> > > >         key="lowOverallocation">100</profile>
> > > >         <profile namespace="globus"
> > > >         key="highOverallocation">100</profile>
> > > >         
> > > >         
> > > >         <profile namespace="globus" key="slots">200</profile>
> > > >         <profile namespace="globus"
> > > >         key="nodeGranularity">1</profile>
> > > >         <profile namespace="globus" key="maxNodes">1</profile>
> > > >         
> > > >         
> > > >         <profile namespace="karajan"
> > > >         key="jobThrottle">47.99</profile>
> > > >         <profile namespace="karajan"
> > > >         key="initialScore">10000</profile>
> > > >         
> > > >         
> > > >         <filesystem provider="local"/>
> > > >         <!-- replace this with your home on lustre -->
> > > >         <workdirectory>/lustre/beagle/samseaver/GS/swift.workdir</workdirectory>
> > > >         </pool>
> > > >         </config>
> > > >         
> > > >         
> > > >         Any ideas?
> > > >         
> > > >         
> > > >         Begin forwarded message:
> > > >         
> > > >         
> > > >         
> > > >         From: Sam Seaver < samseaver at gmail.com >
> > > >         
> > > >         Date: February 19, 2013 1:16:28 PM CST
> > > >         
> > > >         To: Lorenzo Pesce < lpesce at uchicago.edu >
> > > >         
> > > >         Subject: Re: How are things going?
> > > >         
> > > >         
> > > >         I got this error. I suspect using the new SWIFT_HOME
> > > >         directory
> > > >         means that there's possibly a missing parameter
> > > >         someplace:
> > > >         
> > > >         
> > > >         
> > > >         should we resume a previous calculation? [y/N] y
> > > >         rlog files displayed in reverse time order
> > > >         should I use GS-20130203-0717-jgeppt98.0.rlog ?[y/n]
> > > >         y
> > > >         Using GS-20130203-0717-jgeppt98.0.rlog
> > > >         [Error] GS_sites.xml:1:9: cvc-elt.1: Cannot find the
> > > >         declaration of element 'config'.
> > > >         
> > > >         
> > > >         Execution failed:
> > > >         Failed to parse site catalog
> > > >         swift:siteCatalog @ scheduler.k, line: 31
> > > >         Caused by: Invalid pool entry 'pbs':
> > > >         swift:siteCatalog @ scheduler.k, line: 31
> > > >         Caused by: java.lang.IllegalArgumentException: Missing
> > > >         URL
> > > >         at
> > > >         org.griphyn.vdl.karajan.lib.SiteCatalog.execution(SiteCatalog.java:173)
> > > >         at
> > > >         org.griphyn.vdl.karajan.lib.SiteCatalog.pool(SiteCatalog.java:100)
> > > >         at
> > > >         org.griphyn.vdl.karajan.lib.SiteCatalog.buildResources(SiteCatalog.java:60)
> > > >         at
> > > >         org.griphyn.vdl.karajan.lib.SiteCatalog.function(SiteCatalog.java:48)
> > > >         at
> > > >         org.globus.cog.karajan.compiled.nodes.functions.AbstractFunction.runBody(AbstractFunction.java:38)
> > > >         at
> > > >         org.globus.cog.karajan.compiled.nodes.InternalFunction.run(InternalFunction.java:154)
> > > >         at
> > > >         org.globus.cog.karajan.compiled.nodes.CompoundNode.runChild(CompoundNode.java:87)
> > > >         at
> > > >         org.globus.cog.karajan.compiled.nodes.InternalFunction.run(InternalFunction.java:147)
> > > >         at
> > > >         org.globus.cog.karajan.compiled.nodes.CompoundNode.runChild(CompoundNode.java:87)
> > > >         at
> > > >         org.globus.cog.karajan.compiled.nodes.InternalFunction.run(InternalFunction.java:147)
> > > >         at
> > > >         org.globus.cog.karajan.compiled.nodes.CompoundNode.runChild(CompoundNode.java:87)
> > > >         at
> > > >         org.globus.cog.karajan.compiled.nodes.FramedInternalFunction.run(FramedInternalFunction.java:63)
> > > >         at
> > > >         org.globus.cog.karajan.compiled.nodes.Import.runBody(Import.java:269)
> > > >         at
> > > >         org.globus.cog.karajan.compiled.nodes.InternalFunction.run(InternalFunction.java:154)
> > > >         at
> > > >         org.globus.cog.karajan.compiled.nodes.CompoundNode.runChild(CompoundNode.java:87)
> > > >         at
> > > >         org.globus.cog.karajan.compiled.nodes.FramedInternalFunction.run(FramedInternalFunction.java:63)
> > > >         at
> > > >         org.globus.cog.karajan.compiled.nodes.Main.run(Main.java:79)
> > > >         at k.thr.LWThread.run(LWThread.java:243)
> > > >         at
> > > >         java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
> > > >         at java.util.concurrent.ThreadPoolExecutor
> > > >         $Worker.run(ThreadPoolExecutor.java:603)
> > > >         at java.lang.Thread.run(Thread.java:722)
> > > >         
> > > >         
> > > >         
> > > >         On Tue, Feb 19, 2013 at 1:13 PM, Sam Seaver <
> > > >         samseaver at gmail.com > wrote:
> > > >         
> > > >         
> > > >         
> > > >         OK, it got to the point where it really did hang. I'm
> > > >         retrying, but with your suggestions. The other three
> > > >         finished
> > > >         fine!
> > > >         
> > > >         
> > > >         
> > > >         Progress: time: Tue, 19 Feb 2013 19:08:53 +0000
> > > >         Selecting
> > > >         site:18147 Submitted:174 Active:96 Failed:2 Finished
> > > >         successfully:132323 Failed but can retry:183
> > > >         Progress: time: Tue, 19 Feb 2013 19:09:23 +0000
> > > >         Selecting
> > > >         site:18147 Submitted:174 Active:96 Failed:2 Finished
> > > >         successfully:132323 Failed but can retry:183
> > > >         Progress: time: Tue, 19 Feb 2013 19:09:53 +0000
> > > >         Selecting
> > > >         site:18147 Submitted:174 Active:96 Failed:2 Finished
> > > >         successfully:132323 Failed but can retry:183
> > > >         Progress: time: Tue, 19 Feb 2013 19:10:23 +0000
> > > >         Selecting
> > > >         site:18147 Submitted:174 Active:96 Failed:2 Finished
> > > >         successfully:132323 Failed but can retry:183
> > > >         Progress: time: Tue, 19 Feb 2013 19:10:53 +0000
> > > >         Selecting
> > > >         site:18147 Submitted:174 Active:96 Failed:2 Finished
> > > >         successfully:132323 Failed but can retry:183
> > > >         Progress: time: Tue, 19 Feb 2013 19:11:23 +0000
> > > >         Selecting
> > > >         site:18147 Submitted:174 Active:96 Failed:2 Finished
> > > >         successfully:132323 Failed but can retry:183
> > > >         
> > > >         
> > > >         
> > > >         
> > > >         
> > > >         On Tue, Feb 19, 2013 at 8:51 AM, Lorenzo Pesce <
> > > >         lpesce at uchicago.edu > wrote:
> > > >         
> > > >         
> > > >         
> > > >         Hmm...
> > > >         
> > > >         
> > > >         foreach.max.threads=100
> > > >         
> > > >         
> > > >         maybe you should increase this number a bit and see
> > > >         what
> > > >         happens.
> > > >         
> > > >         
> > > >         Also, I would try to replace
> > > >         
> > > >         
> > > >         SWIFT_HOME=/home/wilde/swift/rev/swift-r6151-cog-r3552
> > > >         
> > > >         
> > > >         with
> > > >         
> > > >         
> > > >         SWIFT_HOME=/soft/swift/fast
> > > >         
> > > >         
> > > >         Keep me posted. Let's get this rolling.
> > > >         
> > > >         
> > > >         if it doesn't work, I can redo the packing.
> > > >         
> > > >         
> > > >         
> > > >         
> > > >         
> > > >         
> > > >         
> > > >         
> > > >         
> > > >         
> > > >         
> > > >         On Feb 19, 2013, at 1:07 AM, Sam Seaver wrote:
> > > >         
> > > >         
> > > >         
> > > >         Actually, the ten agents job does seem to be stuck in a
> > > >         strange loop. It is incrementing the number of jobs
> > > >         that
> > > >         has
> > > >         finished successfully, and at a fast pace, but the
> > > >         number
> > > >         of
> > > >         jobs its starting is decrementing much more slowly, its
> > > >         almost
> > > >         as its repeatedly attempting the same set of parameters
> > > >         multiple times...
> > > >         
> > > >         
> > > >         I'll see what it's doing in the morning
> > > >         S
> > > >         
> > > >         
> > > >         
> > > >         On Tue, Feb 19, 2013 at 1:00 AM, Sam Seaver <
> > > >         samseaver at gmail.com > wrote:
> > > >         
> > > >         
> > > >         
> > > >         Seems to have worked overall this time!
> > > >         
> > > >         
> > > >         I resume four jobs, each were for a different number of
> > > >         agents
> > > >         (10,100,1000,10000) it made it easier for me to decide
> > > >         on
> > > >         the
> > > >         app time. Two of them have already finished i.e.:
> > > >         
> > > >         
> > > >         
> > > >         Progress: time: Mon, 18 Feb 2013 23:50:12 +0000
> > > >         Active:4
> > > >         Checking status:1 Finished in previous run:148098
> > > >         Finished
> > > >         successfully:37897
> > > >         Progress: time: Mon, 18 Feb 2013 23:50:15 +0000
> > > >         Active:2
> > > >         Checking status:1 Finished in previous run:148098
> > > >         Finished
> > > >         successfully:37899
> > > >         Final status: Mon, 18 Feb 2013 23:50:15 +0000 Finished
> > > >         in
> > > >         previous run:148098 Finished successfully:37902
> > > >         
> > > >         
> > > >         and the only one that is showing any failure
> > > >         (50/110000),
> > > >         is
> > > >         the ten agents version which is so short I can
> > > >         understand
> > > >         why,
> > > >         but its still actively trying to run jobs and is
> > > >         actively
> > > >         finishing jobs, so that's good.
> > > >         
> > > >         
> > > >         Yay!
> > > >         
> > > >         
> > > >         
> > > >         
> > > >         
> > > >         
> > > >         
> > > >         On Mon, Feb 18, 2013 at 1:09 PM, Lorenzo Pesce <
> > > >         lpesce at uchicago.edu > wrote:
> > > >         
> > > >         
> > > >         
> > > >         Good. Keep me posted, I would really like to solve your
> > > >         problems in running on Beagle this week, I wish that
> > > >         Swift
> > > >         would have been friendlier.
> > > >         
> > > >         
> > > >         
> > > >         
> > > >         
> > > >         On Feb 18, 2013, at 1:01 PM, Sam Seaver wrote:
> > > >         
> > > >         
> > > >         
> > > >         I just resumed the jobs that I'd killed before the
> > > >         system
> > > >         went
> > > >         down, lets see how it does. I always did a mini-review
> > > >         of
> > > >         the
> > > >         data I've got and it seems to be working as expected.
> > > >         
> > > >         
> > > >         
> > > >         On Mon, Feb 18, 2013 at 12:28 PM, Lorenzo Pesce <
> > > >         lpesce at uchicago.edu > wrote:
> > > >         
> > > >         
> > > >         
> > > >         I have lost track a bit of what's up. I am happy to try
> > > >         and
> > > >         go
> > > >         over it with you when you are ready.
> > > >         
> > > >         
> > > >         Some of the problems of swift might have improved with
> > > >         a
> > > >         new
> > > >         version and the new system.
> > > >         
> > > >         
> > > >         
> > > >         
> > > >         
> > > >         
> > > >         
> > > >         On Feb 18, 2013, at 12:22 PM, Sam Seaver wrote:
> > > >         
> > > >         
> > > >         
> > > >         They're not, I've not looked since Beagle came back up.
> > > >         Will
> > > >         do so later today.
> > > >         S
> > > >         
> > > >         
> > > >         
> > > >         On Mon, Feb 18, 2013 at 12:20 PM, Lorenzo Pesce <
> > > >         lpesce at uchicago.edu > wrote:
> > > >         
> > > >         
> > > >         
> > > >         
> > > >         
> > > >         
> > > >         --
> > > >         Postdoctoral Fellow
> > > >         Mathematics and Computer Science Division
> > > >         Argonne National Laboratory
> > > >         9700 S. Cass Avenue
> > > >         Argonne, IL 60439
> > > >         
> > > >         http://www.linkedin.com/pub/sam-seaver/0/412/168
> > > >         samseaver at gmail.com
> > > >         (773) 796-7144
> > > >         
> > > >         "We shall not cease from exploration
> > > >         And the end of all our exploring
> > > >         Will be to arrive where we started
> > > >         And know the place for the first time."
> > > >         --T. S. Eliot
> > > >         
> > > >         
> > > >         
> > > >         
> > > >         --
> > > >         Postdoctoral Fellow
> > > >         Mathematics and Computer Science Division
> > > >         Argonne National Laboratory
> > > >         9700 S. Cass Avenue
> > > >         Argonne, IL 60439
> > > >         
> > > >         http://www.linkedin.com/pub/sam-seaver/0/412/168
> > > >         samseaver at gmail.com
> > > >         (773) 796-7144
> > > >         
> > > >         "We shall not cease from exploration
> > > >         And the end of all our exploring
> > > >         Will be to arrive where we started
> > > >         And know the place for the first time."
> > > >         --T. S. Eliot
> > > >         
> > > >         
> > > >         
> > > >         
> > > >         --
> > > >         Postdoctoral Fellow
> > > >         Mathematics and Computer Science Division
> > > >         Argonne National Laboratory
> > > >         9700 S. Cass Avenue
> > > >         Argonne, IL 60439
> > > >         
> > > >         http://www.linkedin.com/pub/sam-seaver/0/412/168
> > > >         samseaver at gmail.com
> > > >         (773) 796-7144
> > > >         
> > > >         "We shall not cease from exploration
> > > >         And the end of all our exploring
> > > >         Will be to arrive where we started
> > > >         And know the place for the first time."
> > > >         --T. S. Eliot
> > > >         
> > > >         
> > > >         
> > > >         --
> > > >         Postdoctoral Fellow
> > > >         Mathematics and Computer Science Division
> > > >         Argonne National Laboratory
> > > >         9700 S. Cass Avenue
> > > >         Argonne, IL 60439
> > > >         
> > > >         http://www.linkedin.com/pub/sam-seaver/0/412/168
> > > >         samseaver at gmail.com
> > > >         (773) 796-7144
> > > >         
> > > >         "We shall not cease from exploration
> > > >         And the end of all our exploring
> > > >         Will be to arrive where we started
> > > >         And know the place for the first time."
> > > >         --T. S. Eliot
> > > >         
> > > >         
> > > >         
> > > >         
> > > >         --
> > > >         Postdoctoral Fellow
> > > >         Mathematics and Computer Science Division
> > > >         Argonne National Laboratory
> > > >         9700 S. Cass Avenue
> > > >         Argonne, IL 60439
> > > >         
> > > >         http://www.linkedin.com/pub/sam-seaver/0/412/168
> > > >         samseaver at gmail.com
> > > >         (773) 796-7144
> > > >         
> > > >         "We shall not cease from exploration
> > > >         And the end of all our exploring
> > > >         Will be to arrive where we started
> > > >         And know the place for the first time."
> > > >         --T. S. Eliot
> > > >         
> > > >         
> > > >         
> > > >         --
> > > >         Postdoctoral Fellow
> > > >         Mathematics and Computer Science Division
> > > >         Argonne National Laboratory
> > > >         9700 S. Cass Avenue
> > > >         Argonne, IL 60439
> > > >         
> > > >         http://www.linkedin.com/pub/sam-seaver/0/412/168
> > > >         samseaver at gmail.com
> > > >         (773) 796-7144
> > > >         
> > > >         "We shall not cease from exploration
> > > >         And the end of all our exploring
> > > >         Will be to arrive where we started
> > > >         And know the place for the first time."
> > > >         --T. S. Eliot
> > > >         
> > > >         _______________________________________________
> > > >         Swift-devel mailing list
> > > >         Swift-devel at ci.uchicago.edu
> > > >         https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel
> > > >         
> > > > 
> > > > 
> > > 
> > > 
> > > 
> > 
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel
> 



More information about the Swift-devel mailing list