[Swift-devel] localscheduler (condor/ condorg) breaking on lots of condor jobs

Mihael Hategan hategan at mcs.anl.gov
Wed Jul 28 14:48:36 CDT 2010


Yeah. That's why the provider should be updated to use job logs instead
of condor_qstat/condor_qedit for figuring out status.

That or update limits (and, btw, what does ulimit -a say on that
machine)?

On Wed, 2010-07-28 at 14:34 -0500, Allan Espinosa wrote:
> Hi,
> 
> it seems that when there's too many submitted condor jobs, the submit host will
> start to complain if it opens too many log, stderr, and stdout files:  
> 
> 330  Finished successfully:162 Failed but can retry:927
> Failed to transfer wrapper log from sleep-LGU-estimate/info/x on USCMS-FNAL-WC1
> Progress:  Initializing site shared directory:1  Stage in:2  Submitted:1332
> Active:245  Failed:331  Finished successfully:162 Failed but can retry:928
> Progress:Failed to cancel job 57445
> java.io.IOException: Cannot run program "condor_qedit": java.io.IOException:
> error=24, Too many open files
>         at java.lang.ProcessBuilder.start(ProcessBuilder.java:460)
>         at java.lang.Runtime.exec(Runtime.java:593)
>         at java.lang.Runtime.exec(Runtime.java:466)
>         at
> org.globus.cog.abstraction.impl.scheduler.condor.CondorExecutor.cancel(CondorExecutor.java:254)
>         at
> org.globus.cog.abstraction.impl.scheduler.common.AbstractJobSubmissionTaskHandler.cancel(AbstractJobSubmissionTaskHandler.java:85)
>         at
> org.globus.cog.abstraction.impl.common.AbstractTaskHandler.cancel(AbstractTaskHandler.java:70)
>         at
> org.globus.cog.karajan.scheduler.submitQueue.NonBlockingCancel.run(NonBlockingCancel.java:26)
>         at
> edu.emory.mathcs.backport.java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:431)
>         at
> edu.emory.mathcs.backport.java.util.concurrent.FutureTask.run(FutureTask.java:166)
>         at
> edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:643)
>         at
> edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:668)
>         at java.lang.Thread.run(Thread.java:619)
> Caused by: java.io.IOException: java.io.IOException: error=24, Too many open
> files
>         at java.lang.UNIXProcess.<init>(UNIXProcess.java:148)
>         at java.lang.ProcessImpl.start(ProcessImpl.java:65)
>         at java.lang.ProcessBuilder.start(ProcessBuilder.java:453)
>         ... 11 more
>   Initializing site shared directory:1  Stage in:1  Submitting:2  Submitted:1332
> Active:245  Failed:331  Finished successfully:162 Failed but can retry:927
> Progress:  Initializing site shared directory:1  Submitting:3  Submitted:1331
> Active:245  Failed:331  Finished successfully:162 Failed but can retry:928
> 
> 
> This causes jobs to fail.    Here are the logfile entries that I think are
> relevant to the failure:
> 
> 2010-07-28 14:20:07,829-0500 WARN  CondorExecutor Failed to cancel job 57026
> java.io.IOException: Cannot run program "condor_rm": java.io.IOException:
> error=24, Too many open files
>         at java.lang.ProcessBuilder.start(ProcessBuilder.java:460)
>         at java.lang.Runtime.exec(Runtime.java:593)
>         at java.lang.Runtime.exec(Runtime.java:466)
>         at
> org.globus.cog.abstraction.impl.scheduler.condor.CondorExecutor.cancel(CondorExecutor.java:275)
>         at
> org.globus.cog.abstraction.impl.scheduler.common.AbstractJobSubmissionTaskHandler.cancel(AbstractJobSubmissionTaskHandler.java:85)
>         at
> org.globus.cog.abstraction.impl.common.AbstractTaskHandler.cancel(AbstractTaskHandler.java:70)
>         at
> org.globus.cog.karajan.scheduler.submitQueue.NonBlockingCancel.run(NonBlockingCancel.java:26)
>         at
> edu.emory.mathcs.backport.java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:431)
>         at
> edu.emory.mathcs.backport.java.util.concurrent.FutureTask.run(FutureTask.java:166)
>         at
> edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:643)
>         at
> edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:668)
>         at java.lang.Thread.run(Thread.java:619)
> Caused by: java.io.IOException: java.io.IOException: error=24, Too many open
> files
>         at java.lang.UNIXProcess.<init>(UNIXProcess.java:148)
>         at java.lang.ProcessImpl.start(ProcessImpl.java:65)
>         at java.lang.ProcessBuilder.start(ProcessBuilder.java:453)
>         ... 11 more
> 2010-07-28 14:20:07,856-0500 WARN  CondorExecutor Failed to cancel job 57106
> 
> -Allan
> 
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel





More information about the Swift-devel mailing list