[Swift-user] Looking for the cause of failure
Mihael Hategan
hategan at mcs.anl.gov
Sat Jan 30 21:46:33 CST 2010
On Sat, 2010-01-30 at 17:36 -0500, Andriy Fedorov wrote:
> 2010-01-30 16:17:22,275-0600 INFO Block Block task status changed:
> Failed The job manager could not stage out a file
> 2010-01-30 16:17:22,275-0600 INFO Block Failed task spec: Job:
> executable: /usr/bin/perl
> arguments: /u/ac/fedorov/.globus/coasters/cscript28331.pl
> http://141.142.68.180:54622 0130-580326-000001
> /u/ac/fedorov/.globus/coasters
> stdout: null
> stderr: null
> directory: null
> batch: false
> redirected: false
> {hostcount=40, maxwalltime=24, count=40, jobtype=multiple}
>
> 2010-01-30 16:17:22,275-0600 WARN Block Worker task failed:
> org.globus.gram.GramException: The job manager could not stage out a file
> at org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.statusChanged(JobSubmissionTaskHandler.java:531)
> at org.globus.gram.GramJob.setStatus(GramJob.java:184)
> at org.globus.gram.GramCallbackHandler.run(CallbackHandler.java:176)
> at java.lang.Thread.run(Thread.java:595)
That in itself is not a failure condition as it is something that
happens after the worker job completes.
>
> And then a longer series of what looks like timeout messages:
>
> 2010-01-30 16:19:40,911-0600 WARN Command Command(3, SHUTDOWN):
> handling reply timeout; sendReqTime=100130-161740.893,
> sendTime=100130-161740.893, now=100130-161940.911
> 2010-01-30 16:19:40,911-0600 WARN Command Command(3, SHUTDOWN)fault
> was: Reply timeout
> org.globus.cog.karajan.workflow.service.ReplyTimeoutException
> at org.globus.cog.karajan.workflow.service.commands.Command.handleReplyTimeout(Command.java:269)
> at org.globus.cog.karajan.workflow.service.commands.Command$Timeout.run(Command.java:274)
> at java.util.TimerThread.mainLoop(Timer.java:512)
> at java.util.TimerThread.run(Timer.java:462)
That is an indication that the worker didn't respond to a shutdown
command, perhaps because it died previously.
In ~/.globus/coasters you will find a bunch of worker logs. If you can
identify the ones for your run (based perhaps on the timestamp on the
files), they may contain the reason for the failure.
> 2010-01-30 16:19:40,911-0600 WARN Command Command(4, SHUTDOWN):
> handling reply timeout; sendReqTime=100130-161740.893,
> sendTime=100130-161740.893, now=100130-161940.911
> 2010-01-30 16:19:40,911-0600 WARN Command Command(4, SHUTDOWN)fault
> was: Reply timeout
> org.globus.cog.karajan.workflow.service.ReplyTimeoutException
> at org.globus.cog.karajan.workflow.service.commands.Command.handleReplyTimeout(Command.java:269)
> at org.globus.cog.karajan.workflow.service.commands.Command$Timeout.run(Command.java:274)
> at java.util.TimerThread.mainLoop(Timer.java:512)
> at java.util.TimerThread.run(Timer.java:462)
>
> Anybody can explain what happened? The same workflow ran earlier, but
> with fewer (2) workers per node.
Does it work if you set workers per node to 2 again? If yes, that may be
an indication that the workers per node setting causes a problem, and
that's a stronger statement than "it doesn't work right now".
More information about the Swift-user
mailing list