[Swift-user] Looking for the cause of failure
Andriy Fedorov
fedorov at bwh.harvard.edu
Sat Jan 30 16:36:52 CST 2010
Hi,
I've been running a 1000-job swift script with coaster provider. After
executing successfully 998 jobs, I see continuous stream of messages
Progress: Submitted:1 Active:1 Finished successfully:998
...
At the same time, there are no jobs in the PBS queue. looking at
~/.globus/coasters/coasters.log, I found the following error messages
towards the end of the log:
2010-01-30 16:17:22,275-0600 INFO Block Block task status changed:
Failed The job manager could not stage out a file
2010-01-30 16:17:22,275-0600 INFO Block Failed task spec: Job:
executable: /usr/bin/perl
arguments: /u/ac/fedorov/.globus/coasters/cscript28331.pl
http://141.142.68.180:54622 0130-580326-000001
/u/ac/fedorov/.globus/coasters
stdout: null
stderr: null
directory: null
batch: false
redirected: false
{hostcount=40, maxwalltime=24, count=40, jobtype=multiple}
2010-01-30 16:17:22,275-0600 WARN Block Worker task failed:
org.globus.gram.GramException: The job manager could not stage out a file
at org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.statusChanged(JobSubmissionTaskHandler.java:531)
at org.globus.gram.GramJob.setStatus(GramJob.java:184)
at org.globus.gram.GramCallbackHandler.run(CallbackHandler.java:176)
at java.lang.Thread.run(Thread.java:595)
And then a longer series of what looks like timeout messages:
2010-01-30 16:19:40,911-0600 WARN Command Command(3, SHUTDOWN):
handling reply timeout; sendReqTime=100130-161740.893,
sendTime=100130-161740.893, now=100130-161940.911
2010-01-30 16:19:40,911-0600 WARN Command Command(3, SHUTDOWN)fault
was: Reply timeout
org.globus.cog.karajan.workflow.service.ReplyTimeoutException
at org.globus.cog.karajan.workflow.service.commands.Command.handleReplyTimeout(Command.java:269)
at org.globus.cog.karajan.workflow.service.commands.Command$Timeout.run(Command.java:274)
at java.util.TimerThread.mainLoop(Timer.java:512)
at java.util.TimerThread.run(Timer.java:462)
2010-01-30 16:19:40,911-0600 WARN Command Command(4, SHUTDOWN):
handling reply timeout; sendReqTime=100130-161740.893,
sendTime=100130-161740.893, now=100130-161940.911
2010-01-30 16:19:40,911-0600 WARN Command Command(4, SHUTDOWN)fault
was: Reply timeout
org.globus.cog.karajan.workflow.service.ReplyTimeoutException
at org.globus.cog.karajan.workflow.service.commands.Command.handleReplyTimeout(Command.java:269)
at org.globus.cog.karajan.workflow.service.commands.Command$Timeout.run(Command.java:274)
at java.util.TimerThread.mainLoop(Timer.java:512)
at java.util.TimerThread.run(Timer.java:462)
Anybody can explain what happened? The same workflow ran earlier, but
with fewer (2) workers per node.
I am running this on Abe, Swift svn swift-r3202 cog-r2682, site description:
<pool handle="Abe-GT2-coasters">
<gridftp url="local://localhost" />
<execution provider="coaster" jobmanager="gt2:gt2:pbs"
url="grid-abe.ncsa.teragrid.org"/>
<workdirectory>/u/ac/fedorov/scratch-global/scratch</workdirectory>
<profile namespace="karajan" key="jobThrottle">2.55</profile>
<profile namespace="karajan" key="initialScore">10000</profile>
<profile namespace="globus" key="nodeGranularity">20</profile>
<profile namespace="globus" key="remoteMonitorEnabled">false</profile>
<profile namespace="globus" key="parallelism">0.1</profile>
<profile namespace="globus" key="workersPerNode">4</profile>
<profile namespace="globus" key="highOverallocation">10</profile>
</pool>
Thanks
Andriy Fedorov
More information about the Swift-user
mailing list