[Swift-user] Looking for the cause of failure
Andriy Fedorov
fedorov at bwh.harvard.edu
Sat Jan 30 22:07:47 CST 2010
On Sat, Jan 30, 2010 at 19:27, <wilde at mcs.anl.gov> wrote:
> Andriy, I need to look at this in more detail. (Mihael is unavailable this week).
>
> But I'm wondering - since you are running Swift on an abe login host, consider changing:
>
> <execution provider="coaster" jobmanager="gt2:gt2:pbs"
> url="grid-abe.ncsa.teragrid.org"/>
>
> to:
>
> <execution provider="coaster" url="none" jobmanager="local:pbs"/>
>
Michael, thank you for the suggestion -- I will try!
> Also, on abe, don't you want to set workersNerNode to 8, as its nodes are 8-core hosts?
>
Yes, you are right!
> You may also want to set the max time of the coster job (in seconds) to, for example:
>
> <profile namespace="globus" key="maxtime">7500</profile>
>
> Change "7500" to whatever makes sense for your application. This is somewhat manual, but I suggest setting the time to some value that ensure that the PBS jobs last long enough for your application jobs. That aspect may need further adjustment.
>
I am not sure about this one. The documentation says maxtime defines
the maximum walltime for a coaster block, and is by default unlimited.
It seems to me that setting this parameter could actually create
problems. Can you explain?
> Lastly, instead of the gridftp tag you can use:
>
> <filesystem provider="local"/>
>
> But the gridftp tag you have is fine, and equivalent.
>
> - Mike
>
>
> ----- "Andriy Fedorov" <fedorov at bwh.harvard.edu> wrote:
>
>> Hi,
>>
>> I've been running a 1000-job swift script with coaster provider.
>> After
>> executing successfully 998 jobs, I see continuous stream of messages
>>
>> Progress: Submitted:1 Active:1 Finished successfully:998
>> ...
>>
>> At the same time, there are no jobs in the PBS queue. looking at
>> ~/.globus/coasters/coasters.log, I found the following error messages
>> towards the end of the log:
>>
>> 2010-01-30 16:17:22,275-0600 INFO Block Block task status changed:
>> Failed The job manager could not stage out a file
>> 2010-01-30 16:17:22,275-0600 INFO Block Failed task spec: Job:
>> executable: /usr/bin/perl
>> arguments: /u/ac/fedorov/.globus/coasters/cscript28331.pl
>> http://141.142.68.180:54622 0130-580326-000001
>> /u/ac/fedorov/.globus/coasters
>> stdout: null
>> stderr: null
>> directory: null
>> batch: false
>> redirected: false
>> {hostcount=40, maxwalltime=24, count=40, jobtype=multiple}
>>
>> 2010-01-30 16:17:22,275-0600 WARN Block Worker task failed:
>> org.globus.gram.GramException: The job manager could not stage out a
>> file
>> at
>> org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.statusChanged(JobSubmissionTaskHandler.java:531)
>> at org.globus.gram.GramJob.setStatus(GramJob.java:184)
>> at
>> org.globus.gram.GramCallbackHandler.run(CallbackHandler.java:176)
>> at java.lang.Thread.run(Thread.java:595)
>>
>> And then a longer series of what looks like timeout messages:
>>
>> 2010-01-30 16:19:40,911-0600 WARN Command Command(3, SHUTDOWN):
>> handling reply timeout; sendReqTime=100130-161740.893,
>> sendTime=100130-161740.893, now=100130-161940.911
>> 2010-01-30 16:19:40,911-0600 WARN Command Command(3, SHUTDOWN)fault
>> was: Reply timeout
>> org.globus.cog.karajan.workflow.service.ReplyTimeoutException
>> at
>> org.globus.cog.karajan.workflow.service.commands.Command.handleReplyTimeout(Command.java:269)
>> at
>> org.globus.cog.karajan.workflow.service.commands.Command$Timeout.run(Command.java:274)
>> at java.util.TimerThread.mainLoop(Timer.java:512)
>> at java.util.TimerThread.run(Timer.java:462)
>> 2010-01-30 16:19:40,911-0600 WARN Command Command(4, SHUTDOWN):
>> handling reply timeout; sendReqTime=100130-161740.893,
>> sendTime=100130-161740.893, now=100130-161940.911
>> 2010-01-30 16:19:40,911-0600 WARN Command Command(4, SHUTDOWN)fault
>> was: Reply timeout
>> org.globus.cog.karajan.workflow.service.ReplyTimeoutException
>> at
>> org.globus.cog.karajan.workflow.service.commands.Command.handleReplyTimeout(Command.java:269)
>> at
>> org.globus.cog.karajan.workflow.service.commands.Command$Timeout.run(Command.java:274)
>> at java.util.TimerThread.mainLoop(Timer.java:512)
>> at java.util.TimerThread.run(Timer.java:462)
>>
>> Anybody can explain what happened? The same workflow ran earlier, but
>> with fewer (2) workers per node.
>>
>> I am running this on Abe, Swift svn swift-r3202 cog-r2682, site
>> description:
>>
>> <pool handle="Abe-GT2-coasters">
>> <gridftp url="local://localhost" />
>> <execution provider="coaster" jobmanager="gt2:gt2:pbs"
>> url="grid-abe.ncsa.teragrid.org"/>
>> <workdirectory>/u/ac/fedorov/scratch-global/scratch</workdirectory>
>> <profile namespace="karajan" key="jobThrottle">2.55</profile>
>> <profile namespace="karajan" key="initialScore">10000</profile>
>> <profile namespace="globus" key="nodeGranularity">20</profile>
>> <profile namespace="globus"
>> key="remoteMonitorEnabled">false</profile>
>> <profile namespace="globus" key="parallelism">0.1</profile>
>> <profile namespace="globus" key="workersPerNode">4</profile>
>> <profile namespace="globus" key="highOverallocation">10</profile>
>> </pool>
>>
>> Thanks
>>
>> Andriy Fedorov
>> _______________________________________________
>> Swift-user mailing list
>> Swift-user at ci.uchicago.edu
>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user
>
More information about the Swift-user
mailing list