[Swift-devel] localscheduler (condor/ condorg) breaking on lots of condor jobs
Allan Espinosa
aespinosa at cs.uchicago.edu
Wed Jul 28 14:34:21 CDT 2010
Hi,
it seems that when there's too many submitted condor jobs, the submit host will
start to complain if it opens too many log, stderr, and stdout files:
330 Finished successfully:162 Failed but can retry:927
Failed to transfer wrapper log from sleep-LGU-estimate/info/x on USCMS-FNAL-WC1
Progress: Initializing site shared directory:1 Stage in:2 Submitted:1332
Active:245 Failed:331 Finished successfully:162 Failed but can retry:928
Progress:Failed to cancel job 57445
java.io.IOException: Cannot run program "condor_qedit": java.io.IOException:
error=24, Too many open files
at java.lang.ProcessBuilder.start(ProcessBuilder.java:460)
at java.lang.Runtime.exec(Runtime.java:593)
at java.lang.Runtime.exec(Runtime.java:466)
at
org.globus.cog.abstraction.impl.scheduler.condor.CondorExecutor.cancel(CondorExecutor.java:254)
at
org.globus.cog.abstraction.impl.scheduler.common.AbstractJobSubmissionTaskHandler.cancel(AbstractJobSubmissionTaskHandler.java:85)
at
org.globus.cog.abstraction.impl.common.AbstractTaskHandler.cancel(AbstractTaskHandler.java:70)
at
org.globus.cog.karajan.scheduler.submitQueue.NonBlockingCancel.run(NonBlockingCancel.java:26)
at
edu.emory.mathcs.backport.java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:431)
at
edu.emory.mathcs.backport.java.util.concurrent.FutureTask.run(FutureTask.java:166)
at
edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:643)
at
edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:668)
at java.lang.Thread.run(Thread.java:619)
Caused by: java.io.IOException: java.io.IOException: error=24, Too many open
files
at java.lang.UNIXProcess.<init>(UNIXProcess.java:148)
at java.lang.ProcessImpl.start(ProcessImpl.java:65)
at java.lang.ProcessBuilder.start(ProcessBuilder.java:453)
... 11 more
Initializing site shared directory:1 Stage in:1 Submitting:2 Submitted:1332
Active:245 Failed:331 Finished successfully:162 Failed but can retry:927
Progress: Initializing site shared directory:1 Submitting:3 Submitted:1331
Active:245 Failed:331 Finished successfully:162 Failed but can retry:928
This causes jobs to fail. Here are the logfile entries that I think are
relevant to the failure:
2010-07-28 14:20:07,829-0500 WARN CondorExecutor Failed to cancel job 57026
java.io.IOException: Cannot run program "condor_rm": java.io.IOException:
error=24, Too many open files
at java.lang.ProcessBuilder.start(ProcessBuilder.java:460)
at java.lang.Runtime.exec(Runtime.java:593)
at java.lang.Runtime.exec(Runtime.java:466)
at
org.globus.cog.abstraction.impl.scheduler.condor.CondorExecutor.cancel(CondorExecutor.java:275)
at
org.globus.cog.abstraction.impl.scheduler.common.AbstractJobSubmissionTaskHandler.cancel(AbstractJobSubmissionTaskHandler.java:85)
at
org.globus.cog.abstraction.impl.common.AbstractTaskHandler.cancel(AbstractTaskHandler.java:70)
at
org.globus.cog.karajan.scheduler.submitQueue.NonBlockingCancel.run(NonBlockingCancel.java:26)
at
edu.emory.mathcs.backport.java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:431)
at
edu.emory.mathcs.backport.java.util.concurrent.FutureTask.run(FutureTask.java:166)
at
edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:643)
at
edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:668)
at java.lang.Thread.run(Thread.java:619)
Caused by: java.io.IOException: java.io.IOException: error=24, Too many open
files
at java.lang.UNIXProcess.<init>(UNIXProcess.java:148)
at java.lang.ProcessImpl.start(ProcessImpl.java:65)
at java.lang.ProcessBuilder.start(ProcessBuilder.java:453)
... 11 more
2010-07-28 14:20:07,856-0500 WARN CondorExecutor Failed to cancel job 57106
-Allan
More information about the Swift-devel
mailing list