In addition,<div><br></div><div>I see the following timeout messages after about 2 hours into running workflow:</div><div><br></div><div><div>org.globus.cog.karajan.workflow.service.ReplyTimeoutException</div><div> at org.globus.cog.karajan.workflow.service.commands.Command.handleReplyTimeout(Command.java:288)</div>
<div> at org.globus.cog.karajan.workflow.service.commands.Command$Timeout.run(Command.java:293)</div><div> at java.util.TimerThread.mainLoop(Timer.java:512)</div><div> at java.util.TimerThread.run(Timer.java:462)</div>
<div>Command(2883, HEARTBEAT): handling reply timeout; sendReqTime=110909-115956.899, sendTime=110909-115956.899, now=110909-120156.901</div><div>Command(2883, HEARTBEAT)fault was: Reply timeout</div><div>org.globus.cog.karajan.workflow.service.ReplyTimeoutException</div>
<div> at org.globus.cog.karajan.workflow.service.commands.Command.handleReplyTimeout(Command.java:288)</div><div> at org.globus.cog.karajan.workflow.service.commands.Command$Timeout.run(Command.java:293)</div>
<div> at java.util.TimerThread.mainLoop(Timer.java:512)</div><div> at java.util.TimerThread.run(Timer.java:462)</div><div>Command(2887, HEARTBEAT): handling reply timeout; sendReqTime=110909-120003.463, sendTime=110909-120003.463, now=110909-120203.468</div>
<div>Command(2887, HEARTBEAT)fault was: Reply timeout</div><div>org.globus.cog.karajan.workflow.service.ReplyTimeoutException</div><div> at org.globus.cog.karajan.workflow.service.commands.Command.handleReplyTimeout(Command.java:288)</div>
<div> at org.globus.cog.karajan.workflow.service.commands.Command$Timeout.run(Command.java:293)</div><div> at java.util.TimerThread.mainLoop(Timer.java:512)</div><div> at java.util.TimerThread.run(Timer.java:462)</div>
</div><div><br></div><div><br></div><div>Regards,</div><div>Ketan</div><div><br><br><div class="gmail_quote">On Fri, Sep 9, 2011 at 11:52 AM, Ketan Maheshwari <span dir="ltr"><<a href="mailto:ketancmaheshwari@gmail.com">ketancmaheshwari@gmail.com</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">Hi Mihael, All,<div><br></div><div>I am trying to run the DSSAT workflow, a simple one process catsn-like loop.<br clear="all">
<div><br></div><div>The setup on OSG is persisten coasters based with the following elements:</div>
<div><br></div><div>1. A coaster service is started on the head node</div><div>2. Workers are started on OSG sites. I am using 11 OSG sites.</div><div>3. The workers are submitted in the form of condor jobs which connect back to the service running at the headnode.</div>
<div>4. In the current instance that I am running, 500 workers are submitted to start, out of which 280 workers are in running state as of now.</div><div><br></div><div>My throttles: jobthrottle, foreach throttle are set to run 500 tasks at a time.</div>
<div><br></div><div>However, I am seeing a see-saw pattern of active tasks whose peak is very low. What I am seeing is: the number of active tasks start rising gradually from 0 to about 30 followed by a decrease from 30 to 0 and back to 30. </div>
<div><br></div><div>The logs and sources are at : <a href="http://ci.uchicago.edu/~ketan/DSSAT-logs.tgz" target="_blank">http://ci.uchicago.edu/~ketan/DSSAT-logs.tgz</a></div><div><br></div><div>This tarball contains the following:</div>
<div><br></div><div><div>DSSAT-logs/sites.grid-ps.xml</div><div>DSSAT-logs/tc-provider-staging</div><div>DSSAT-logs/<a href="http://cf.ps" target="_blank">cf.ps</a></div><div>DSSAT-logs/RunDSSAT.swift</div><div><br></div>
<div>Condor, swift logs</div>
<div><br></div><div>DSSAT-logs/condor.log</div></div><div><div><div>DSSAT-logs/swift.log</div><div><br></div><div>Service and worker's stdouts</div><div><br></div><div>DSSAT-logs/service-0.out</div></div><div><div>DSSAT-logs/swift-workers.out</div>
</div><div><br></div><div>Three runlogs since the run was resumed twice:</div><div><br></div><div>DSSAT-logs/RunDSSAT-20110909-1025-hjcelum9.log</div><div>DSSAT-logs/RunDSSAT-20110909-1030-jjefp0sb.log</div><div><div>DSSAT-logs/RunDSSAT-20110909-0918-0hk7ign5.log</div>
</div></div><div><br></div><div>Any insights would be helpful.</div><div><br></div><div>Regards,</div>-- <br><font color="#888888">Ketan<br><br><br>
</font></div>
</blockquote></div><br><br clear="all"><div><br></div>-- <br>Ketan<br><br><br>
</div>