[Swift-user] Looking for the cause of failure

Andriy Fedorov fedorov at bwh.harvard.edu
Sat Jan 30 16:36:52 CST 2010


Hi,

I've been running a 1000-job swift script with coaster provider. After
executing successfully 998 jobs, I see continuous stream of messages

Progress:  Submitted:1  Active:1  Finished successfully:998
...

At the same time, there are no jobs in the PBS queue. looking at
~/.globus/coasters/coasters.log, I found the following error messages
towards the end of the log:

2010-01-30 16:17:22,275-0600 INFO  Block Block task status changed:
Failed The job manager could not stage out a file
2010-01-30 16:17:22,275-0600 INFO  Block Failed task spec: Job:
        executable: /usr/bin/perl
        arguments:  /u/ac/fedorov/.globus/coasters/cscript28331.pl
http://141.142.68.180:54622 0130-580326-000001
/u/ac/fedorov/.globus/coasters
        stdout:     null
        stderr:     null
        directory:  null
        batch:      false
        redirected: false
        {hostcount=40, maxwalltime=24, count=40, jobtype=multiple}

2010-01-30 16:17:22,275-0600 WARN  Block Worker task failed:
org.globus.gram.GramException: The job manager could not stage out a file
        at org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.statusChanged(JobSubmissionTaskHandler.java:531)
        at org.globus.gram.GramJob.setStatus(GramJob.java:184)
        at org.globus.gram.GramCallbackHandler.run(CallbackHandler.java:176)
        at java.lang.Thread.run(Thread.java:595)

And then a longer series of what looks like timeout messages:

2010-01-30 16:19:40,911-0600 WARN  Command Command(3, SHUTDOWN):
handling reply timeout; sendReqTime=100130-161740.893,
sendTime=100130-161740.893, now=100130-161940.911
2010-01-30 16:19:40,911-0600 WARN  Command Command(3, SHUTDOWN)fault
was: Reply timeout
org.globus.cog.karajan.workflow.service.ReplyTimeoutException
        at org.globus.cog.karajan.workflow.service.commands.Command.handleReplyTimeout(Command.java:269)
        at org.globus.cog.karajan.workflow.service.commands.Command$Timeout.run(Command.java:274)
        at java.util.TimerThread.mainLoop(Timer.java:512)
        at java.util.TimerThread.run(Timer.java:462)
2010-01-30 16:19:40,911-0600 WARN  Command Command(4, SHUTDOWN):
handling reply timeout; sendReqTime=100130-161740.893,
sendTime=100130-161740.893, now=100130-161940.911
2010-01-30 16:19:40,911-0600 WARN  Command Command(4, SHUTDOWN)fault
was: Reply timeout
org.globus.cog.karajan.workflow.service.ReplyTimeoutException
        at org.globus.cog.karajan.workflow.service.commands.Command.handleReplyTimeout(Command.java:269)
        at org.globus.cog.karajan.workflow.service.commands.Command$Timeout.run(Command.java:274)
        at java.util.TimerThread.mainLoop(Timer.java:512)
        at java.util.TimerThread.run(Timer.java:462)

Anybody can explain what happened? The same workflow ran earlier, but
with fewer (2) workers per node.

I am running this on Abe, Swift svn swift-r3202 cog-r2682, site description:

<pool handle="Abe-GT2-coasters">
  <gridftp  url="local://localhost" />
  <execution provider="coaster" jobmanager="gt2:gt2:pbs"
  url="grid-abe.ncsa.teragrid.org"/>
  <workdirectory>/u/ac/fedorov/scratch-global/scratch</workdirectory>
  <profile namespace="karajan" key="jobThrottle">2.55</profile>
  <profile namespace="karajan" key="initialScore">10000</profile>
  <profile namespace="globus" key="nodeGranularity">20</profile>
  <profile namespace="globus" key="remoteMonitorEnabled">false</profile>
  <profile namespace="globus" key="parallelism">0.1</profile>
  <profile namespace="globus" key="workersPerNode">4</profile>
  <profile namespace="globus" key="highOverallocation">10</profile>
</pool>

Thanks

Andriy Fedorov



More information about the Swift-user mailing list