[Swift-user] Looking for the cause of failure

Sat Jan 30 21:46:33 CST 2010

On Sat, 2010-01-30 at 17:36 -0500, Andriy Fedorov wrote:

> 2010-01-30 16:17:22,275-0600 INFO  Block Block task status changed:
> Failed The job manager could not stage out a file
> 2010-01-30 16:17:22,275-0600 INFO  Block Failed task spec: Job:
>         executable: /usr/bin/perl
>         arguments:  /u/ac/fedorov/.globus/coasters/cscript28331.pl
> http://141.142.68.180:54622 0130-580326-000001
> /u/ac/fedorov/.globus/coasters
>         stdout:     null
>         stderr:     null
>         directory:  null
>         batch:      false
>         redirected: false
>         {hostcount=40, maxwalltime=24, count=40, jobtype=multiple}
> 
> 2010-01-30 16:17:22,275-0600 WARN  Block Worker task failed:
> org.globus.gram.GramException: The job manager could not stage out a file
>         at org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.statusChanged(JobSubmissionTaskHandler.java:531)
>         at org.globus.gram.GramJob.setStatus(GramJob.java:184)
>         at org.globus.gram.GramCallbackHandler.run(CallbackHandler.java:176)
>         at java.lang.Thread.run(Thread.java:595)

That in itself is not a failure condition as it is something that
happens after the worker job completes.

> 
> And then a longer series of what looks like timeout messages:
> 
> 2010-01-30 16:19:40,911-0600 WARN  Command Command(3, SHUTDOWN):
> handling reply timeout; sendReqTime=100130-161740.893,
> sendTime=100130-161740.893, now=100130-161940.911
> 2010-01-30 16:19:40,911-0600 WARN  Command Command(3, SHUTDOWN)fault
> was: Reply timeout
> org.globus.cog.karajan.workflow.service.ReplyTimeoutException
>         at org.globus.cog.karajan.workflow.service.commands.Command.handleReplyTimeout(Command.java:269)
>         at org.globus.cog.karajan.workflow.service.commands.Command$Timeout.run(Command.java:274)
>         at java.util.TimerThread.mainLoop(Timer.java:512)
>         at java.util.TimerThread.run(Timer.java:462)

That is an indication that the worker didn't respond to a shutdown
command, perhaps because it died previously.

In ~/.globus/coasters you will find a bunch of worker logs. If you can
identify the ones for your run (based perhaps on the timestamp on the
files), they may contain the reason for the failure.

> 2010-01-30 16:19:40,911-0600 WARN  Command Command(4, SHUTDOWN):
> handling reply timeout; sendReqTime=100130-161740.893,
> sendTime=100130-161740.893, now=100130-161940.911
> 2010-01-30 16:19:40,911-0600 WARN  Command Command(4, SHUTDOWN)fault
> was: Reply timeout
> org.globus.cog.karajan.workflow.service.ReplyTimeoutException
>         at org.globus.cog.karajan.workflow.service.commands.Command.handleReplyTimeout(Command.java:269)
>         at org.globus.cog.karajan.workflow.service.commands.Command$Timeout.run(Command.java:274)
>         at java.util.TimerThread.mainLoop(Timer.java:512)
>         at java.util.TimerThread.run(Timer.java:462)
> 
> Anybody can explain what happened? The same workflow ran earlier, but
> with fewer (2) workers per node.

Does it work if you set workers per node to 2 again? If yes, that may be
an indication that the workers per node setting causes a problem, and
that's a stronger statement than "it doesn't work right now".