[Swift-devel] Coaster test failed at 86K of 100K jobs
Michael Wilde
wilde at mcs.anl.gov
Sat Aug 6 07:38:45 CDT 2011
Mihael,
I rebuilt with that fix. Now Im getting this error on runs as small as 1,000 jobs.
Logs are in: /home/wilde/swiftgrid/test.swift-workers/logs.06
Failing run was *8za.log
I copied the sites etc files there as well.
com$ ls -lt logs.06
total 1920
-rw-r--r-- 1 wilde ci-users 918172 Aug 6 07:34 swift.log
-rw-r--r-- 1 wilde ci-users 526 Aug 6 07:34 start-grid-service.out
-rw-r--r-- 1 wilde ci-users 11279 Aug 6 07:34 swift-workers.out
-rw-r--r-- 1 wilde ci-users 69555 Aug 6 07:33 condor.log
-rw-r--r-- 1 wilde ci-users 488616 Aug 6 07:28 catsn-20110806-0728-tpo2b8za.log
drwxr-xr-x 2 wilde ci-users 9 Aug 6 07:28 catsn-20110806-0728-tpo2b8za.d/
-rw-r--r-- 1 wilde ci-users 136 Aug 6 07:28 catsn-20110806-0728-tpo2b8za.0.rlog
-rw-r--r-- 1 wilde ci-users 200148 Aug 6 07:28 catsn-20110806-0728-8lecscl7.log
drwxr-xr-x 2 wilde ci-users 102 Aug 6 07:28 catsn-20110806-0728-8lecscl7.d/
-rw-r--r-- 1 wilde ci-users 23388 Aug 6 07:28 catsn-20110806-0728-jvvxoqdg.log
drwxr-xr-x 2 wilde ci-users 12 Aug 6 07:28 catsn-20110806-0728-jvvxoqdg.d/
-rw-r--r-- 1 wilde ci-users 5940 Aug 6 07:28 catsn-20110806-0728-lge9pvy3.log
drwxr-xr-x 2 wilde ci-users 3 Aug 6 07:28 catsn-20110806-0728-lge9pvy3.d/
com$
2011-08-06 07:28:46,432-0500 DEBUG vdl:execute2 JOB_START jobid=cat-j2tn42ek tr=cat arguments=[data.txt] tmpdir=catsn-20110806-0728-\
tpo2b8za/jobs/j/cat-j2tn42ek host=localhost
2011-08-06 07:28:46,432-0500 DEBUG VDL2ExecutionContext org.globus.cog.karajan.workflow.KarajanRuntimeException: Could not convert v\
alue to boolean: null
org.globus.cog.karajan.workflow.KarajanRuntimeException: Could not convert value to boolean: null
Caused by: org.globus.cog.karajan.workflow.KarajanRuntimeException: Could not convert value to boolean: null
at org.globus.cog.karajan.util.TypeUtil.toBoolean(TypeUtil.java:131)
at org.griphyn.vdl.karajan.lib.Mark.function(Mark.java:30)
at org.griphyn.vdl.karajan.lib.VDLFunction.post(VDLFunction.java:62)
at org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.completed(AbstractSequentialWithArguments.java:194)
at org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:214)
at org.globus.cog.karajan.workflow.nodes.FlowContainer.post(FlowContainer.java:58)
at org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:28)
at org.globus.cog.karajan.workflow.nodes.Sequential.startNext(Sequential.java:29)
at org.globus.cog.karajan.workflow.nodes.Sequential.executeChildren(Sequential.java:20)
at org.globus.cog.karajan.workflow.nodes.FlowContainer.execute(FlowContainer.java:63)
at org.globus.cog.karajan.workflow.nodes.FlowNode.restart(FlowNode.java:139)
at org.globus.cog.karajan.workflow.nodes.FlowNode.start(FlowNode.java:197)
at org.globus.cog.karajan.workflow.FlowElementWrapper.start(FlowElementWrapper.java:227)
at org.globus.cog.karajan.workflow.events.EventBus.start(EventBus.java:104)
at org.globus.cog.karajan.workflow.events.EventTargetPair.run(EventTargetPair.java:40)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:619)
Caused by: org.globus.cog.karajan.workflow.KarajanRuntimeException: Could not convert value to boolean: null
at org.globus.cog.karajan.util.TypeUtil.toBoolean(TypeUtil.java:127)
... 20 more
----- Original Message -----
> From: "Mihael Hategan" <hategan at mcs.anl.gov>
> To: "Michael Wilde" <wilde at mcs.anl.gov>
> Cc: "Swift Devel" <swift-devel at ci.uchicago.edu>
> Sent: Saturday, August 6, 2011 1:02:48 AM
> Subject: Re: [Swift-devel] Coaster test failed at 86K of 100K jobs
> Potential fix is in the 0.93 branch.
>
> I'm not entirely sure that this was the problem, but it's the only one
> I
> can see right now.
>
> The issue is as follows. There is a "special" implementation of a
> CopyOnWriteArrayList in the util module. The standard java one does a
> copy of the underlying array for EVERY operation that changes the
> list.
> This guarantees that ongoing iterations will not be messed up by
> concurrent modifications to the list, but is very bad if you have many
> operations that change the list.
>
> The version in util only does a copy if there is an ongoing iteration
> on
> a particular underlying array. If no concurrent changes and iterations
> occur, this works at the speed of a normal synchronized list. If
> concurrent changes and iterations occur, there is a copy penalty for
> each iteration (but only once for each iteration). This requires the
> user code to notify the implementation when an iteration is done
> (release).
>
> The problem was with the way that the lock was implemented. It would
> be
> increased for every iteration, set to 0 for each mutation operation
> and
> decreased if > 0 for a release. That was broken, the following could
> have occurred:
>
> iteration1start - lock = 1, with array1
> add - lock > 0, copy to array2, lock = 0
> iteration2start - lock = 1, with array2
> iteration1end - lock = 0
> add - lock == 0, add to array2 -> ConcurrentModificationException on
> iteration2.
>
> Though I don't see how the usage stats got to iterate twice at the
> same
> time through stuff.
>
> Mihael
>
>
> On Fri, 2011-08-05 at 22:02 -0700, Mihael Hategan wrote:
> > Amazing how that bug in what would otherwise be a relatively simple
> > class (CopyOnWriteArrayList) has managed to survive so long.
> > Concurrency
> > ain't easy!
> >
> > I'll have a fix committed after I do a bit of testing.
--
Michael Wilde
Computation Institute, University of Chicago
Mathematics and Computer Science Division
Argonne National Laboratory
More information about the Swift-devel
mailing list