[Swift-devel] localscheduler (condor/ condorg) breaking on lots of condor jobs

Allan Espinosa aespinosa at cs.uchicago.edu
Wed Jul 28 14:34:21 CDT 2010


Hi,

it seems that when there's too many submitted condor jobs, the submit host will
start to complain if it opens too many log, stderr, and stdout files:  

330  Finished successfully:162 Failed but can retry:927
Failed to transfer wrapper log from sleep-LGU-estimate/info/x on USCMS-FNAL-WC1
Progress:  Initializing site shared directory:1  Stage in:2  Submitted:1332
Active:245  Failed:331  Finished successfully:162 Failed but can retry:928
Progress:Failed to cancel job 57445
java.io.IOException: Cannot run program "condor_qedit": java.io.IOException:
error=24, Too many open files
        at java.lang.ProcessBuilder.start(ProcessBuilder.java:460)
        at java.lang.Runtime.exec(Runtime.java:593)
        at java.lang.Runtime.exec(Runtime.java:466)
        at
org.globus.cog.abstraction.impl.scheduler.condor.CondorExecutor.cancel(CondorExecutor.java:254)
        at
org.globus.cog.abstraction.impl.scheduler.common.AbstractJobSubmissionTaskHandler.cancel(AbstractJobSubmissionTaskHandler.java:85)
        at
org.globus.cog.abstraction.impl.common.AbstractTaskHandler.cancel(AbstractTaskHandler.java:70)
        at
org.globus.cog.karajan.scheduler.submitQueue.NonBlockingCancel.run(NonBlockingCancel.java:26)
        at
edu.emory.mathcs.backport.java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:431)
        at
edu.emory.mathcs.backport.java.util.concurrent.FutureTask.run(FutureTask.java:166)
        at
edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:643)
        at
edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:668)
        at java.lang.Thread.run(Thread.java:619)
Caused by: java.io.IOException: java.io.IOException: error=24, Too many open
files
        at java.lang.UNIXProcess.<init>(UNIXProcess.java:148)
        at java.lang.ProcessImpl.start(ProcessImpl.java:65)
        at java.lang.ProcessBuilder.start(ProcessBuilder.java:453)
        ... 11 more
  Initializing site shared directory:1  Stage in:1  Submitting:2  Submitted:1332
Active:245  Failed:331  Finished successfully:162 Failed but can retry:927
Progress:  Initializing site shared directory:1  Submitting:3  Submitted:1331
Active:245  Failed:331  Finished successfully:162 Failed but can retry:928


This causes jobs to fail.    Here are the logfile entries that I think are
relevant to the failure:

2010-07-28 14:20:07,829-0500 WARN  CondorExecutor Failed to cancel job 57026
java.io.IOException: Cannot run program "condor_rm": java.io.IOException:
error=24, Too many open files
        at java.lang.ProcessBuilder.start(ProcessBuilder.java:460)
        at java.lang.Runtime.exec(Runtime.java:593)
        at java.lang.Runtime.exec(Runtime.java:466)
        at
org.globus.cog.abstraction.impl.scheduler.condor.CondorExecutor.cancel(CondorExecutor.java:275)
        at
org.globus.cog.abstraction.impl.scheduler.common.AbstractJobSubmissionTaskHandler.cancel(AbstractJobSubmissionTaskHandler.java:85)
        at
org.globus.cog.abstraction.impl.common.AbstractTaskHandler.cancel(AbstractTaskHandler.java:70)
        at
org.globus.cog.karajan.scheduler.submitQueue.NonBlockingCancel.run(NonBlockingCancel.java:26)
        at
edu.emory.mathcs.backport.java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:431)
        at
edu.emory.mathcs.backport.java.util.concurrent.FutureTask.run(FutureTask.java:166)
        at
edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:643)
        at
edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:668)
        at java.lang.Thread.run(Thread.java:619)
Caused by: java.io.IOException: java.io.IOException: error=24, Too many open
files
        at java.lang.UNIXProcess.<init>(UNIXProcess.java:148)
        at java.lang.ProcessImpl.start(ProcessImpl.java:65)
        at java.lang.ProcessBuilder.start(ProcessBuilder.java:453)
        ... 11 more
2010-07-28 14:20:07,856-0500 WARN  CondorExecutor Failed to cancel job 57106

-Allan




More information about the Swift-devel mailing list