[Swift-devel] Persistent coasters on OSG Swift not getting started cores

Ketan Maheshwari ketancmaheshwari at gmail.com
Fri Sep 9 12:03:07 CDT 2011


In addition,

I see the following timeout messages after about 2 hours into running
workflow:

org.globus.cog.karajan.workflow.service.ReplyTimeoutException
        at
org.globus.cog.karajan.workflow.service.commands.Command.handleReplyTimeout(Command.java:288)
        at
org.globus.cog.karajan.workflow.service.commands.Command$Timeout.run(Command.java:293)
        at java.util.TimerThread.mainLoop(Timer.java:512)
        at java.util.TimerThread.run(Timer.java:462)
Command(2883, HEARTBEAT): handling reply timeout;
sendReqTime=110909-115956.899, sendTime=110909-115956.899,
now=110909-120156.901
Command(2883, HEARTBEAT)fault was: Reply timeout
org.globus.cog.karajan.workflow.service.ReplyTimeoutException
        at
org.globus.cog.karajan.workflow.service.commands.Command.handleReplyTimeout(Command.java:288)
        at
org.globus.cog.karajan.workflow.service.commands.Command$Timeout.run(Command.java:293)
        at java.util.TimerThread.mainLoop(Timer.java:512)
        at java.util.TimerThread.run(Timer.java:462)
Command(2887, HEARTBEAT): handling reply timeout;
sendReqTime=110909-120003.463, sendTime=110909-120003.463,
now=110909-120203.468
Command(2887, HEARTBEAT)fault was: Reply timeout
org.globus.cog.karajan.workflow.service.ReplyTimeoutException
        at
org.globus.cog.karajan.workflow.service.commands.Command.handleReplyTimeout(Command.java:288)
        at
org.globus.cog.karajan.workflow.service.commands.Command$Timeout.run(Command.java:293)
        at java.util.TimerThread.mainLoop(Timer.java:512)
        at java.util.TimerThread.run(Timer.java:462)


Regards,
Ketan


On Fri, Sep 9, 2011 at 11:52 AM, Ketan Maheshwari <
ketancmaheshwari at gmail.com> wrote:

> Hi Mihael, All,
>
> I am trying to run the DSSAT workflow, a simple one process catsn-like
> loop.
>
> The setup on OSG is persisten coasters based with the following elements:
>
> 1. A coaster service is started on the head node
> 2. Workers are started on OSG sites. I am using 11 OSG sites.
> 3. The workers are submitted in the form of condor jobs which connect back
> to the service running at the headnode.
> 4. In the current instance that I am running, 500 workers are submitted to
> start, out of which 280 workers are in running state as of now.
>
> My throttles: jobthrottle, foreach throttle are set to run 500 tasks at a
> time.
>
> However, I am seeing a see-saw pattern of active tasks whose peak is very
> low. What I am seeing is: the number of active tasks start rising gradually
> from 0 to about 30 followed by a decrease from 30 to 0 and back to 30.
>
> The logs and sources are at : http://ci.uchicago.edu/~ketan/DSSAT-logs.tgz
>
> This tarball contains the following:
>
> DSSAT-logs/sites.grid-ps.xml
> DSSAT-logs/tc-provider-staging
> DSSAT-logs/cf.ps
> DSSAT-logs/RunDSSAT.swift
>
> Condor, swift logs
>
> DSSAT-logs/condor.log
> DSSAT-logs/swift.log
>
> Service and worker's stdouts
>
> DSSAT-logs/service-0.out
> DSSAT-logs/swift-workers.out
>
> Three runlogs since the run was resumed twice:
>
> DSSAT-logs/RunDSSAT-20110909-1025-hjcelum9.log
> DSSAT-logs/RunDSSAT-20110909-1030-jjefp0sb.log
> DSSAT-logs/RunDSSAT-20110909-0918-0hk7ign5.log
>
> Any insights would be helpful.
>
> Regards,
> --
> Ketan
>
>
>


-- 
Ketan
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20110909/0851ed73/attachment.html>


More information about the Swift-devel mailing list