[Swift-user] Looking for the cause of failure

Sat Jan 30 18:27:28 CST 2010

Andriy, I need to look at this in more detail. (Mihael is unavailable this week).

But I'm wondering - since you are running Swift on an abe login host, consider changing:

  <execution provider="coaster" jobmanager="gt2:gt2:pbs"
             url="grid-abe.ncsa.teragrid.org"/>

to:

  <execution provider="coaster" url="none" jobmanager="local:pbs"/>

Also, on abe, don't you want to set workersNerNode to 8, as its nodes are 8-core hosts?

You may also want to set the max time of the coster job (in seconds) to, for example:

  <profile namespace="globus" key="maxtime">7500</profile>

Change "7500" to whatever makes sense for your application. This is somewhat manual, but I suggest setting the time to some value that ensure that the PBS jobs last long enough for your application jobs. That aspect may need  further adjustment.

Lastly, instead of the gridftp tag you can use:

  <filesystem provider="local"/>

But the gridftp tag you have is fine, and equivalent.

- Mike

----- "Andriy Fedorov" <fedorov at bwh.harvard.edu> wrote:

> Hi,
> 
> I've been running a 1000-job swift script with coaster provider.
> After
> executing successfully 998 jobs, I see continuous stream of messages
> 
> Progress:  Submitted:1  Active:1  Finished successfully:998
> ...
> 
> At the same time, there are no jobs in the PBS queue. looking at
> ~/.globus/coasters/coasters.log, I found the following error messages
> towards the end of the log:
> 
> 2010-01-30 16:17:22,275-0600 INFO  Block Block task status changed:
> Failed The job manager could not stage out a file
> 2010-01-30 16:17:22,275-0600 INFO  Block Failed task spec: Job:
>         executable: /usr/bin/perl
>         arguments:  /u/ac/fedorov/.globus/coasters/cscript28331.pl
> http://141.142.68.180:54622 0130-580326-000001
> /u/ac/fedorov/.globus/coasters
>         stdout:     null
>         stderr:     null
>         directory:  null
>         batch:      false
>         redirected: false
>         {hostcount=40, maxwalltime=24, count=40, jobtype=multiple}
> 
> 2010-01-30 16:17:22,275-0600 WARN  Block Worker task failed:
> org.globus.gram.GramException: The job manager could not stage out a
> file
>         at
> org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.statusChanged(JobSubmissionTaskHandler.java:531)
>         at org.globus.gram.GramJob.setStatus(GramJob.java:184)
>         at
> org.globus.gram.GramCallbackHandler.run(CallbackHandler.java:176)
>         at java.lang.Thread.run(Thread.java:595)
> 
> And then a longer series of what looks like timeout messages:
> 
> 2010-01-30 16:19:40,911-0600 WARN  Command Command(3, SHUTDOWN):
> handling reply timeout; sendReqTime=100130-161740.893,
> sendTime=100130-161740.893, now=100130-161940.911
> 2010-01-30 16:19:40,911-0600 WARN  Command Command(3, SHUTDOWN)fault
> was: Reply timeout
> org.globus.cog.karajan.workflow.service.ReplyTimeoutException
>         at
> org.globus.cog.karajan.workflow.service.commands.Command.handleReplyTimeout(Command.java:269)
>         at
> org.globus.cog.karajan.workflow.service.commands.Command$Timeout.run(Command.java:274)
>         at java.util.TimerThread.mainLoop(Timer.java:512)
>         at java.util.TimerThread.run(Timer.java:462)
> 2010-01-30 16:19:40,911-0600 WARN  Command Command(4, SHUTDOWN):
> handling reply timeout; sendReqTime=100130-161740.893,
> sendTime=100130-161740.893, now=100130-161940.911
> 2010-01-30 16:19:40,911-0600 WARN  Command Command(4, SHUTDOWN)fault
> was: Reply timeout
> org.globus.cog.karajan.workflow.service.ReplyTimeoutException
>         at
> org.globus.cog.karajan.workflow.service.commands.Command.handleReplyTimeout(Command.java:269)
>         at
> org.globus.cog.karajan.workflow.service.commands.Command$Timeout.run(Command.java:274)
>         at java.util.TimerThread.mainLoop(Timer.java:512)
>         at java.util.TimerThread.run(Timer.java:462)
> 
> Anybody can explain what happened? The same workflow ran earlier, but
> with fewer (2) workers per node.
> 
> I am running this on Abe, Swift svn swift-r3202 cog-r2682, site
> description:
> 
> <pool handle="Abe-GT2-coasters">
>   <gridftp  url="local://localhost" />
>   <execution provider="coaster" jobmanager="gt2:gt2:pbs"
>   url="grid-abe.ncsa.teragrid.org"/>
>   <workdirectory>/u/ac/fedorov/scratch-global/scratch</workdirectory>
>   <profile namespace="karajan" key="jobThrottle">2.55</profile>
>   <profile namespace="karajan" key="initialScore">10000</profile>
>   <profile namespace="globus" key="nodeGranularity">20</profile>
>   <profile namespace="globus"
> key="remoteMonitorEnabled">false</profile>
>   <profile namespace="globus" key="parallelism">0.1</profile>
>   <profile namespace="globus" key="workersPerNode">4</profile>
>   <profile namespace="globus" key="highOverallocation">10</profile>
> </pool>
> 
> Thanks
> 
> Andriy Fedorov
> _______________________________________________
> Swift-user mailing list
> Swift-user at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user