[Swift-devel] localscheduler (condor/ condorg) breaking on lots of condor jobs
Mihael Hategan
hategan at mcs.anl.gov
Wed Jul 28 14:48:36 CDT 2010
Yeah. That's why the provider should be updated to use job logs instead
of condor_qstat/condor_qedit for figuring out status.
That or update limits (and, btw, what does ulimit -a say on that
machine)?
On Wed, 2010-07-28 at 14:34 -0500, Allan Espinosa wrote:
> Hi,
>
> it seems that when there's too many submitted condor jobs, the submit host will
> start to complain if it opens too many log, stderr, and stdout files:
>
> 330 Finished successfully:162 Failed but can retry:927
> Failed to transfer wrapper log from sleep-LGU-estimate/info/x on USCMS-FNAL-WC1
> Progress: Initializing site shared directory:1 Stage in:2 Submitted:1332
> Active:245 Failed:331 Finished successfully:162 Failed but can retry:928
> Progress:Failed to cancel job 57445
> java.io.IOException: Cannot run program "condor_qedit": java.io.IOException:
> error=24, Too many open files
> at java.lang.ProcessBuilder.start(ProcessBuilder.java:460)
> at java.lang.Runtime.exec(Runtime.java:593)
> at java.lang.Runtime.exec(Runtime.java:466)
> at
> org.globus.cog.abstraction.impl.scheduler.condor.CondorExecutor.cancel(CondorExecutor.java:254)
> at
> org.globus.cog.abstraction.impl.scheduler.common.AbstractJobSubmissionTaskHandler.cancel(AbstractJobSubmissionTaskHandler.java:85)
> at
> org.globus.cog.abstraction.impl.common.AbstractTaskHandler.cancel(AbstractTaskHandler.java:70)
> at
> org.globus.cog.karajan.scheduler.submitQueue.NonBlockingCancel.run(NonBlockingCancel.java:26)
> at
> edu.emory.mathcs.backport.java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:431)
> at
> edu.emory.mathcs.backport.java.util.concurrent.FutureTask.run(FutureTask.java:166)
> at
> edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:643)
> at
> edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:668)
> at java.lang.Thread.run(Thread.java:619)
> Caused by: java.io.IOException: java.io.IOException: error=24, Too many open
> files
> at java.lang.UNIXProcess.<init>(UNIXProcess.java:148)
> at java.lang.ProcessImpl.start(ProcessImpl.java:65)
> at java.lang.ProcessBuilder.start(ProcessBuilder.java:453)
> ... 11 more
> Initializing site shared directory:1 Stage in:1 Submitting:2 Submitted:1332
> Active:245 Failed:331 Finished successfully:162 Failed but can retry:927
> Progress: Initializing site shared directory:1 Submitting:3 Submitted:1331
> Active:245 Failed:331 Finished successfully:162 Failed but can retry:928
>
>
> This causes jobs to fail. Here are the logfile entries that I think are
> relevant to the failure:
>
> 2010-07-28 14:20:07,829-0500 WARN CondorExecutor Failed to cancel job 57026
> java.io.IOException: Cannot run program "condor_rm": java.io.IOException:
> error=24, Too many open files
> at java.lang.ProcessBuilder.start(ProcessBuilder.java:460)
> at java.lang.Runtime.exec(Runtime.java:593)
> at java.lang.Runtime.exec(Runtime.java:466)
> at
> org.globus.cog.abstraction.impl.scheduler.condor.CondorExecutor.cancel(CondorExecutor.java:275)
> at
> org.globus.cog.abstraction.impl.scheduler.common.AbstractJobSubmissionTaskHandler.cancel(AbstractJobSubmissionTaskHandler.java:85)
> at
> org.globus.cog.abstraction.impl.common.AbstractTaskHandler.cancel(AbstractTaskHandler.java:70)
> at
> org.globus.cog.karajan.scheduler.submitQueue.NonBlockingCancel.run(NonBlockingCancel.java:26)
> at
> edu.emory.mathcs.backport.java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:431)
> at
> edu.emory.mathcs.backport.java.util.concurrent.FutureTask.run(FutureTask.java:166)
> at
> edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:643)
> at
> edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:668)
> at java.lang.Thread.run(Thread.java:619)
> Caused by: java.io.IOException: java.io.IOException: error=24, Too many open
> files
> at java.lang.UNIXProcess.<init>(UNIXProcess.java:148)
> at java.lang.ProcessImpl.start(ProcessImpl.java:65)
> at java.lang.ProcessBuilder.start(ProcessBuilder.java:453)
> ... 11 more
> 2010-07-28 14:20:07,856-0500 WARN CondorExecutor Failed to cancel job 57106
>
> -Allan
>
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
More information about the Swift-devel
mailing list