[Swift-user] Looking for the cause of failure

Sat Jan 30 22:07:47 CST 2010

On Sat, Jan 30, 2010 at 19:27,  <wilde at mcs.anl.gov> wrote:
> Andriy, I need to look at this in more detail. (Mihael is unavailable this week).
>
> But I'm wondering - since you are running Swift on an abe login host, consider changing:
>
>  <execution provider="coaster" jobmanager="gt2:gt2:pbs"
>             url="grid-abe.ncsa.teragrid.org"/>
>
> to:
>
>  <execution provider="coaster" url="none" jobmanager="local:pbs"/>
>

Michael, thank you for the suggestion -- I will try!

> Also, on abe, don't you want to set workersNerNode to 8, as its nodes are 8-core hosts?
>

Yes, you are right!

> You may also want to set the max time of the coster job (in seconds) to, for example:
>
>  <profile namespace="globus" key="maxtime">7500</profile>
>
> Change "7500" to whatever makes sense for your application. This is somewhat manual, but I suggest setting the time to some value that ensure that the PBS jobs last long enough for your application jobs. That aspect may need  further adjustment.
>

I am not sure about this one. The documentation says maxtime defines
the maximum walltime for a coaster block, and is by default unlimited.
It seems to me that setting this parameter could actually create
problems. Can you explain?

> Lastly, instead of the gridftp tag you can use:
>
>  <filesystem provider="local"/>
>
> But the gridftp tag you have is fine, and equivalent.
>
> - Mike
>
>
> ----- "Andriy Fedorov" <fedorov at bwh.harvard.edu> wrote:
>
>> Hi,
>>
>> I've been running a 1000-job swift script with coaster provider.
>> After
>> executing successfully 998 jobs, I see continuous stream of messages
>>
>> Progress:  Submitted:1  Active:1  Finished successfully:998
>> ...
>>
>> At the same time, there are no jobs in the PBS queue. looking at
>> ~/.globus/coasters/coasters.log, I found the following error messages
>> towards the end of the log:
>>
>> 2010-01-30 16:17:22,275-0600 INFO  Block Block task status changed:
>> Failed The job manager could not stage out a file
>> 2010-01-30 16:17:22,275-0600 INFO  Block Failed task spec: Job:
>>         executable: /usr/bin/perl
>>         arguments:  /u/ac/fedorov/.globus/coasters/cscript28331.pl
>> http://141.142.68.180:54622 0130-580326-000001
>> /u/ac/fedorov/.globus/coasters
>>         stdout:     null
>>         stderr:     null
>>         directory:  null
>>         batch:      false
>>         redirected: false
>>         {hostcount=40, maxwalltime=24, count=40, jobtype=multiple}
>>
>> 2010-01-30 16:17:22,275-0600 WARN  Block Worker task failed:
>> org.globus.gram.GramException: The job manager could not stage out a
>> file
>>         at
>> org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.statusChanged(JobSubmissionTaskHandler.java:531)
>>         at org.globus.gram.GramJob.setStatus(GramJob.java:184)
>>         at
>> org.globus.gram.GramCallbackHandler.run(CallbackHandler.java:176)
>>         at java.lang.Thread.run(Thread.java:595)
>>
>> And then a longer series of what looks like timeout messages:
>>
>> 2010-01-30 16:19:40,911-0600 WARN  Command Command(3, SHUTDOWN):
>> handling reply timeout; sendReqTime=100130-161740.893,
>> sendTime=100130-161740.893, now=100130-161940.911
>> 2010-01-30 16:19:40,911-0600 WARN  Command Command(3, SHUTDOWN)fault
>> was: Reply timeout
>> org.globus.cog.karajan.workflow.service.ReplyTimeoutException
>>         at
>> org.globus.cog.karajan.workflow.service.commands.Command.handleReplyTimeout(Command.java:269)
>>         at
>> org.globus.cog.karajan.workflow.service.commands.Command$Timeout.run(Command.java:274)
>>         at java.util.TimerThread.mainLoop(Timer.java:512)
>>         at java.util.TimerThread.run(Timer.java:462)
>> 2010-01-30 16:19:40,911-0600 WARN  Command Command(4, SHUTDOWN):
>> handling reply timeout; sendReqTime=100130-161740.893,
>> sendTime=100130-161740.893, now=100130-161940.911
>> 2010-01-30 16:19:40,911-0600 WARN  Command Command(4, SHUTDOWN)fault
>> was: Reply timeout
>> org.globus.cog.karajan.workflow.service.ReplyTimeoutException
>>         at
>> org.globus.cog.karajan.workflow.service.commands.Command.handleReplyTimeout(Command.java:269)
>>         at
>> org.globus.cog.karajan.workflow.service.commands.Command$Timeout.run(Command.java:274)
>>         at java.util.TimerThread.mainLoop(Timer.java:512)
>>         at java.util.TimerThread.run(Timer.java:462)
>>
>> Anybody can explain what happened? The same workflow ran earlier, but
>> with fewer (2) workers per node.
>>
>> I am running this on Abe, Swift svn swift-r3202 cog-r2682, site
>> description:
>>
>> <pool handle="Abe-GT2-coasters">
>>   <gridftp  url="local://localhost" />
>>   <execution provider="coaster" jobmanager="gt2:gt2:pbs"
>>   url="grid-abe.ncsa.teragrid.org"/>
>>   <workdirectory>/u/ac/fedorov/scratch-global/scratch</workdirectory>
>>   <profile namespace="karajan" key="jobThrottle">2.55</profile>
>>   <profile namespace="karajan" key="initialScore">10000</profile>
>>   <profile namespace="globus" key="nodeGranularity">20</profile>
>>   <profile namespace="globus"
>> key="remoteMonitorEnabled">false</profile>
>>   <profile namespace="globus" key="parallelism">0.1</profile>
>>   <profile namespace="globus" key="workersPerNode">4</profile>
>>   <profile namespace="globus" key="highOverallocation">10</profile>
>> </pool>
>>
>> Thanks
>>
>> Andriy Fedorov
>> _______________________________________________
>> Swift-user mailing list
>> Swift-user at ci.uchicago.edu
>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user
>