[Swift-devel] First tests with swift faster

Lorenzo Pesce lpesce at uchicago.edu
Tue Feb 19 13:26:20 CST 2013


This is the content of the file where we have the first complaint from swift (see attached): 
<config>
  <pool handle="pbs">
    <execution provider="coaster" jobmanager="local:pbs"/>
    <!-- replace with your project -->
    <profile namespace="globus" key="project">CI-DEB000002</profile>

    <profile namespace="globus" key="providerAttributes">pbs.aprun;pbs.mpp;depth=24</profile>


    <profile namespace="globus" key="jobsPerNode">24</profile>
    <profile namespace="globus" key="maxTime">172800</profile>
    <profile namespace="globus" key="maxwalltime">0:10:00</profile>
    <profile namespace="globus" key="lowOverallocation">100</profile>
    <profile namespace="globus" key="highOverallocation">100</profile>

    <profile namespace="globus" key="slots">200</profile>
    <profile namespace="globus" key="nodeGranularity">1</profile>
    <profile namespace="globus" key="maxNodes">1</profile>

    <profile namespace="karajan" key="jobThrottle">47.99</profile>
    <profile namespace="karajan" key="initialScore">10000</profile>

    <filesystem provider="local"/>
    <!-- replace this with your home on lustre -->
    <workdirectory>/lustre/beagle/samseaver/GS/swift.workdir</workdirectory>
  </pool>
</config>

Any ideas?

Begin forwarded message:

> From: Sam Seaver <samseaver at gmail.com>
> Date: February 19, 2013 1:16:28 PM CST
> To: Lorenzo Pesce <lpesce at uchicago.edu>
> Subject: Re: How are things going?
> 
> I got this error.  I suspect using the new SWIFT_HOME directory means that there's possibly a missing parameter someplace:
> 
> should we resume a previous calculation? [y/N] y
> rlog files displayed in reverse time order
> should I use GS-20130203-0717-jgeppt98.0.rlog ?[y/n]
> y
> Using  GS-20130203-0717-jgeppt98.0.rlog
> [Error] GS_sites.xml:1:9: cvc-elt.1: Cannot find the declaration of element 'config'.
> 
> Execution failed:
> Failed to parse site catalog
>         swift:siteCatalog @ scheduler.k, line: 31
> Caused by: Invalid pool entry 'pbs': 
>         swift:siteCatalog @ scheduler.k, line: 31
> Caused by: java.lang.IllegalArgumentException: Missing URL
>         at org.griphyn.vdl.karajan.lib.SiteCatalog.execution(SiteCatalog.java:173)
>         at org.griphyn.vdl.karajan.lib.SiteCatalog.pool(SiteCatalog.java:100)
>         at org.griphyn.vdl.karajan.lib.SiteCatalog.buildResources(SiteCatalog.java:60)
>         at org.griphyn.vdl.karajan.lib.SiteCatalog.function(SiteCatalog.java:48)
>         at org.globus.cog.karajan.compiled.nodes.functions.AbstractFunction.runBody(AbstractFunction.java:38)
>         at org.globus.cog.karajan.compiled.nodes.InternalFunction.run(InternalFunction.java:154)
>         at org.globus.cog.karajan.compiled.nodes.CompoundNode.runChild(CompoundNode.java:87)
>         at org.globus.cog.karajan.compiled.nodes.InternalFunction.run(InternalFunction.java:147)
>         at org.globus.cog.karajan.compiled.nodes.CompoundNode.runChild(CompoundNode.java:87)
>         at org.globus.cog.karajan.compiled.nodes.InternalFunction.run(InternalFunction.java:147)
>         at org.globus.cog.karajan.compiled.nodes.CompoundNode.runChild(CompoundNode.java:87)
>         at org.globus.cog.karajan.compiled.nodes.FramedInternalFunction.run(FramedInternalFunction.java:63)
>         at org.globus.cog.karajan.compiled.nodes.Import.runBody(Import.java:269)
>         at org.globus.cog.karajan.compiled.nodes.InternalFunction.run(InternalFunction.java:154)
>         at org.globus.cog.karajan.compiled.nodes.CompoundNode.runChild(CompoundNode.java:87)
>         at org.globus.cog.karajan.compiled.nodes.FramedInternalFunction.run(FramedInternalFunction.java:63)
>         at org.globus.cog.karajan.compiled.nodes.Main.run(Main.java:79)
>         at k.thr.LWThread.run(LWThread.java:243)
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
>         at java.lang.Thread.run(Thread.java:722)
> 
> 
> On Tue, Feb 19, 2013 at 1:13 PM, Sam Seaver <samseaver at gmail.com> wrote:
> OK, it got to the point where it really did hang.  I'm retrying, but with your suggestions.  The other three finished fine!
> 
> Progress:  time: Tue, 19 Feb 2013 19:08:53 +0000  Selecting site:18147  Submitted:174  Active:96  Failed:2  Finished successfully:132323  Failed but can retry:183
> Progress:  time: Tue, 19 Feb 2013 19:09:23 +0000  Selecting site:18147  Submitted:174  Active:96  Failed:2  Finished successfully:132323  Failed but can retry:183
> Progress:  time: Tue, 19 Feb 2013 19:09:53 +0000  Selecting site:18147  Submitted:174  Active:96  Failed:2  Finished successfully:132323  Failed but can retry:183
> Progress:  time: Tue, 19 Feb 2013 19:10:23 +0000  Selecting site:18147  Submitted:174  Active:96  Failed:2  Finished successfully:132323  Failed but can retry:183
> Progress:  time: Tue, 19 Feb 2013 19:10:53 +0000  Selecting site:18147  Submitted:174  Active:96  Failed:2  Finished successfully:132323  Failed but can retry:183
> Progress:  time: Tue, 19 Feb 2013 19:11:23 +0000  Selecting site:18147  Submitted:174  Active:96  Failed:2  Finished successfully:132323  Failed but can retry:183
> 
> 
> On Tue, Feb 19, 2013 at 8:51 AM, Lorenzo Pesce <lpesce at uchicago.edu> wrote:
> Hmm... 
> 
> foreach.max.threads=100
> 
> maybe you should increase this number a bit and see what happens.
> 
> Also, I would try to replace
> 
> SWIFT_HOME=/home/wilde/swift/rev/swift-r6151-cog-r3552
> 
> with 
> 
> SWIFT_HOME=/soft/swift/fast
> 
> Keep me posted. Let's get this rolling.
> 
> if it doesn't work, I can redo the packing.
> 
> 
> 
> 
> On Feb 19, 2013, at 1:07 AM, Sam Seaver wrote:
> 
>> Actually, the ten agents job does seem to be stuck in a strange loop.  It is incrementing the number of jobs that has finished successfully, and at a fast pace, but the number of jobs its starting is decrementing much more slowly, its almost as its repeatedly attempting the same set of parameters multiple times...
>> 
>> I'll see what it's doing in the morning
>> S
>> 
>> 
>> On Tue, Feb 19, 2013 at 1:00 AM, Sam Seaver <samseaver at gmail.com> wrote:
>> Seems to have worked overall this time!
>> 
>> I resume four jobs, each were for a different number of agents (10,100,1000,10000) it made it easier for me to decide on the app time.  Two of them have already finished i.e.:
>> 
>> Progress:  time: Mon, 18 Feb 2013 23:50:12 +0000  Active:4  Checking status:1  Finished in previous run:148098  Finished successfully:37897
>> Progress:  time: Mon, 18 Feb 2013 23:50:15 +0000  Active:2  Checking status:1  Finished in previous run:148098  Finished successfully:37899
>> Final status: Mon, 18 Feb 2013 23:50:15 +0000  Finished in previous run:148098  Finished successfully:37902
>> 
>> and the only one that is showing any failure (50/110000), is the ten agents version which is so short I can understand why, but its still actively trying to run jobs and is actively finishing jobs, so that's good.
>> 
>> Yay!
>> 
>> 
>> 
>> On Mon, Feb 18, 2013 at 1:09 PM, Lorenzo Pesce <lpesce at uchicago.edu> wrote:
>> Good. Keep me posted, I would really like to solve your problems in running on Beagle this week, I wish that Swift would have been friendlier.
>> 
>> On Feb 18, 2013, at 1:01 PM, Sam Seaver wrote:
>> 
>>> I just resumed the jobs that I'd killed before the system went down, lets see how it does.  I always did a mini-review of the data I've got and it seems to be working as expected.
>>> 
>>> 
>>> On Mon, Feb 18, 2013 at 12:28 PM, Lorenzo Pesce <lpesce at uchicago.edu> wrote:
>>> I have lost track a bit of what's up. I am happy to try and go over it with you when you are ready.
>>> 
>>> Some of the problems of swift might have improved with a new version and the new system.
>>> 
>>> 
>>> On Feb 18, 2013, at 12:22 PM, Sam Seaver wrote:
>>> 
>>>> They're not, I've not looked since Beagle came back up. Will do so later today.
>>>> S
>>>> 
>>>> 
>>>> On Mon, Feb 18, 2013 at 12:20 PM, Lorenzo Pesce <lpesce at uchicago.edu> wrote:
>>>> 
>>>> 
>>>> 
>>>> -- 
>>>> Postdoctoral Fellow
>>>> Mathematics and Computer Science Division
>>>> Argonne National Laboratory
>>>> 9700 S. Cass Avenue
>>>> Argonne, IL 60439
>>>> 
>>>> http://www.linkedin.com/pub/sam-seaver/0/412/168
>>>> samseaver at gmail.com
>>>> (773) 796-7144
>>>> 
>>>> "We shall not cease from exploration
>>>> And the end of all our exploring
>>>> Will be to arrive where we started
>>>> And know the place for the first time."
>>>>    --T. S. Eliot
>>> 
>>> 
>>> 
>>> 
>>> -- 
>>> Postdoctoral Fellow
>>> Mathematics and Computer Science Division
>>> Argonne National Laboratory
>>> 9700 S. Cass Avenue
>>> Argonne, IL 60439
>>> 
>>> http://www.linkedin.com/pub/sam-seaver/0/412/168
>>> samseaver at gmail.com
>>> (773) 796-7144
>>> 
>>> "We shall not cease from exploration
>>> And the end of all our exploring
>>> Will be to arrive where we started
>>> And know the place for the first time."
>>>    --T. S. Eliot
>> 
>> 
>> 
>> 
>> -- 
>> Postdoctoral Fellow
>> Mathematics and Computer Science Division
>> Argonne National Laboratory
>> 9700 S. Cass Avenue
>> Argonne, IL 60439
>> 
>> http://www.linkedin.com/pub/sam-seaver/0/412/168
>> samseaver at gmail.com
>> (773) 796-7144
>> 
>> "We shall not cease from exploration
>> And the end of all our exploring
>> Will be to arrive where we started
>> And know the place for the first time."
>>    --T. S. Eliot
>> 
>> 
>> 
>> -- 
>> Postdoctoral Fellow
>> Mathematics and Computer Science Division
>> Argonne National Laboratory
>> 9700 S. Cass Avenue
>> Argonne, IL 60439
>> 
>> http://www.linkedin.com/pub/sam-seaver/0/412/168
>> samseaver at gmail.com
>> (773) 796-7144
>> 
>> "We shall not cease from exploration
>> And the end of all our exploring
>> Will be to arrive where we started
>> And know the place for the first time."
>>    --T. S. Eliot
> 
> 
> 
> 
> -- 
> Postdoctoral Fellow
> Mathematics and Computer Science Division
> Argonne National Laboratory
> 9700 S. Cass Avenue
> Argonne, IL 60439
> 
> http://www.linkedin.com/pub/sam-seaver/0/412/168
> samseaver at gmail.com
> (773) 796-7144
> 
> "We shall not cease from exploration
> And the end of all our exploring
> Will be to arrive where we started
> And know the place for the first time."
>    --T. S. Eliot
> 
> 
> 
> -- 
> Postdoctoral Fellow
> Mathematics and Computer Science Division
> Argonne National Laboratory
> 9700 S. Cass Avenue
> Argonne, IL 60439
> 
> http://www.linkedin.com/pub/sam-seaver/0/412/168
> samseaver at gmail.com
> (773) 796-7144
> 
> "We shall not cease from exploration
> And the end of all our exploring
> Will be to arrive where we started
> And know the place for the first time."
>    --T. S. Eliot

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20130219/02203986/attachment.html>


More information about the Swift-devel mailing list