[Swift-devel] Coaster test failed at 86K of 100K jobs

Michael Wilde wilde at mcs.anl.gov
Sat Aug 6 07:38:45 CDT 2011


Mihael,

I rebuilt with that fix. Now Im getting this error on runs as small as 1,000 jobs.

Logs are in:  /home/wilde/swiftgrid/test.swift-workers/logs.06

Failing run was *8za.log

I copied the sites etc files there as well.

com$ ls -lt logs.06
total 1920
-rw-r--r-- 1 wilde ci-users 918172 Aug  6 07:34 swift.log
-rw-r--r-- 1 wilde ci-users    526 Aug  6 07:34 start-grid-service.out
-rw-r--r-- 1 wilde ci-users  11279 Aug  6 07:34 swift-workers.out
-rw-r--r-- 1 wilde ci-users  69555 Aug  6 07:33 condor.log
-rw-r--r-- 1 wilde ci-users 488616 Aug  6 07:28 catsn-20110806-0728-tpo2b8za.log
drwxr-xr-x 2 wilde ci-users      9 Aug  6 07:28 catsn-20110806-0728-tpo2b8za.d/
-rw-r--r-- 1 wilde ci-users    136 Aug  6 07:28 catsn-20110806-0728-tpo2b8za.0.rlog
-rw-r--r-- 1 wilde ci-users 200148 Aug  6 07:28 catsn-20110806-0728-8lecscl7.log
drwxr-xr-x 2 wilde ci-users    102 Aug  6 07:28 catsn-20110806-0728-8lecscl7.d/
-rw-r--r-- 1 wilde ci-users  23388 Aug  6 07:28 catsn-20110806-0728-jvvxoqdg.log
drwxr-xr-x 2 wilde ci-users     12 Aug  6 07:28 catsn-20110806-0728-jvvxoqdg.d/
-rw-r--r-- 1 wilde ci-users   5940 Aug  6 07:28 catsn-20110806-0728-lge9pvy3.log
drwxr-xr-x 2 wilde ci-users      3 Aug  6 07:28 catsn-20110806-0728-lge9pvy3.d/
com$ 


2011-08-06 07:28:46,432-0500 DEBUG vdl:execute2 JOB_START jobid=cat-j2tn42ek tr=cat arguments=[data.txt] tmpdir=catsn-20110806-0728-\
tpo2b8za/jobs/j/cat-j2tn42ek host=localhost
2011-08-06 07:28:46,432-0500 DEBUG VDL2ExecutionContext org.globus.cog.karajan.workflow.KarajanRuntimeException: Could not convert v\
alue to boolean: null
org.globus.cog.karajan.workflow.KarajanRuntimeException: Could not convert value to boolean: null
Caused by: org.globus.cog.karajan.workflow.KarajanRuntimeException: Could not convert value to boolean: null
        at org.globus.cog.karajan.util.TypeUtil.toBoolean(TypeUtil.java:131)
        at org.griphyn.vdl.karajan.lib.Mark.function(Mark.java:30)
        at org.griphyn.vdl.karajan.lib.VDLFunction.post(VDLFunction.java:62)
        at org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.completed(AbstractSequentialWithArguments.java:194)
        at org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:214)
        at org.globus.cog.karajan.workflow.nodes.FlowContainer.post(FlowContainer.java:58)
        at org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:28)
        at org.globus.cog.karajan.workflow.nodes.Sequential.startNext(Sequential.java:29)
        at org.globus.cog.karajan.workflow.nodes.Sequential.executeChildren(Sequential.java:20)
        at org.globus.cog.karajan.workflow.nodes.FlowContainer.execute(FlowContainer.java:63)
        at org.globus.cog.karajan.workflow.nodes.FlowNode.restart(FlowNode.java:139)
        at org.globus.cog.karajan.workflow.nodes.FlowNode.start(FlowNode.java:197)
        at org.globus.cog.karajan.workflow.FlowElementWrapper.start(FlowElementWrapper.java:227)
        at org.globus.cog.karajan.workflow.events.EventBus.start(EventBus.java:104)
        at org.globus.cog.karajan.workflow.events.EventTargetPair.run(EventTargetPair.java:40)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
        at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
        at java.util.concurrent.FutureTask.run(FutureTask.java:138)
        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:619)
Caused by: org.globus.cog.karajan.workflow.KarajanRuntimeException: Could not convert value to boolean: null
        at org.globus.cog.karajan.util.TypeUtil.toBoolean(TypeUtil.java:127)
        ... 20 more




----- Original Message -----
> From: "Mihael Hategan" <hategan at mcs.anl.gov>
> To: "Michael Wilde" <wilde at mcs.anl.gov>
> Cc: "Swift Devel" <swift-devel at ci.uchicago.edu>
> Sent: Saturday, August 6, 2011 1:02:48 AM
> Subject: Re: [Swift-devel] Coaster test failed at 86K of 100K jobs
> Potential fix is in the 0.93 branch.
> 
> I'm not entirely sure that this was the problem, but it's the only one
> I
> can see right now.
> 
> The issue is as follows. There is a "special" implementation of a
> CopyOnWriteArrayList in the util module. The standard java one does a
> copy of the underlying array for EVERY operation that changes the
> list.
> This guarantees that ongoing iterations will not be messed up by
> concurrent modifications to the list, but is very bad if you have many
> operations that change the list.
> 
> The version in util only does a copy if there is an ongoing iteration
> on
> a particular underlying array. If no concurrent changes and iterations
> occur, this works at the speed of a normal synchronized list. If
> concurrent changes and iterations occur, there is a copy penalty for
> each iteration (but only once for each iteration). This requires the
> user code to notify the implementation when an iteration is done
> (release).
> 
> The problem was with the way that the lock was implemented. It would
> be
> increased for every iteration, set to 0 for each mutation operation
> and
> decreased if > 0 for a release. That was broken, the following could
> have occurred:
> 
> iteration1start - lock = 1, with array1
> add - lock > 0, copy to array2, lock = 0
> iteration2start - lock = 1, with array2
> iteration1end - lock = 0
> add - lock == 0, add to array2 -> ConcurrentModificationException on
> iteration2.
> 
> Though I don't see how the usage stats got to iterate twice at the
> same
> time through stuff.
> 
> Mihael
> 
> 
> On Fri, 2011-08-05 at 22:02 -0700, Mihael Hategan wrote:
> > Amazing how that bug in what would otherwise be a relatively simple
> > class (CopyOnWriteArrayList) has managed to survive so long.
> > Concurrency
> > ain't easy!
> >
> > I'll have a fix committed after I do a bit of testing.

-- 
Michael Wilde
Computation Institute, University of Chicago
Mathematics and Computer Science Division
Argonne National Laboratory




More information about the Swift-devel mailing list