<div>Hi,</div><div><br></div>I tried a smaller run, catsnsleep, sleeptime 60sec, n=20, walltime=2min<div><br></div><div>The run indeed completed but I saw this in the middle (I suppose at the end of first walltime slot):</div>
<div><br></div><div><div>Command(13, HEARTBEAT): handling reply timeout; sendReqTime=110823-153059.847, sendTime=110823-153059.847, now=110823-153259.860</div><div>Command(13, HEARTBEAT)fault was: Reply timeout</div><div>
org.globus.cog.karajan.workflow.service.ReplyTimeoutException</div><div><span class="Apple-tab-span" style="white-space:pre">     </span>at org.globus.cog.karajan.workflow.service.commands.Command.handleReplyTimeout(Command.java:288)</div>
<div><span class="Apple-tab-span" style="white-space:pre">      </span>at org.globus.cog.karajan.workflow.service.commands.Command$Timeout.run(Command.java:293)</div><div><span class="Apple-tab-span" style="white-space:pre">    </span>at java.util.TimerThread.mainLoop(Timer.java:512)</div>
<div><span class="Apple-tab-span" style="white-space:pre">      </span>at java.util.TimerThread.run(Timer.java:462)</div><div>Command(13, HEARTBEAT)fault was: Invalid channel: 914784201: {}</div><div>org.globus.cog.karajan.workflow.service.ReplyTimeoutException</div>
<div><span class="Apple-tab-span" style="white-space:pre">      </span>at org.globus.cog.karajan.workflow.service.commands.Command.handleReplyTimeout(Command.java:288)</div><div><span class="Apple-tab-span" style="white-space:pre">     </span>at org.globus.cog.karajan.workflow.service.commands.Command$Timeout.run(Command.java:293)</div>
<div><span class="Apple-tab-span" style="white-space:pre">      </span>at java.util.TimerThread.mainLoop(Timer.java:512)</div><div><span class="Apple-tab-span" style="white-space:pre">    </span>at java.util.TimerThread.run(Timer.java:462)</div>
<div>Heartbeat failed: Invalid channel: 914784201: {}</div><div>org.globus.cog.karajan.workflow.service.ReplyTimeoutException</div><div><span class="Apple-tab-span" style="white-space:pre">   </span>at org.globus.cog.karajan.workflow.service.commands.Command.handleReplyTimeout(Command.java:288)</div>
<div><span class="Apple-tab-span" style="white-space:pre">      </span>at org.globus.cog.karajan.workflow.service.commands.Command$Timeout.run(Command.java:293)</div><div><span class="Apple-tab-span" style="white-space:pre">    </span>at java.util.TimerThread.mainLoop(Timer.java:512)</div>
<div><span class="Apple-tab-span" style="white-space:pre">      </span>at java.util.TimerThread.run(Timer.java:462)</div><div><br></div><div>The log is attached. I will try a longish run with more heap memory.</div><div><br></div>
<div><br></div><div>Regards,</div><div>Ketan</div><div><br></div><div><br><div class="gmail_quote">On Tue, Aug 23, 2011 at 3:21 PM, Mihael Hategan <span dir="ltr"><<a href="mailto:hategan@mcs.anl.gov">hategan@mcs.anl.gov</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;"><div class="im">On Tue, 2011-08-23 at 14:46 -0500, Michael Wilde wrote:<br>
> Well, I raised that issue, but Ketan claimed that the failure to start more jobs occurs without that message as well.<br>
<br>
</div>Fine. I'll need the log from that then.<br>
<div><div></div><div class="h5"><br>
><br>
> Do you believe that the Out of Mem error is the root cause?<br>
><br>
> Ketan, can you point to logs without the OOM error?<br>
><br>
> Can you re-run the catsn with more memory?<br>
><br>
> And more importantly: can you run a *very small* catsnsleep test where you carefully craft the sleep times and settings to cause one (very short duration) coaster block to time out and verify that a new block is submitted and in new job and that the script runs to completion?<br>

><br>
> I suggested in the ticket that David do this; can you both discuss and see who is better positioned to do this sooner, so we can decide if we have a blocker here, or just something that needs better configuration and perhaps a note in the user guide telling users what to watch out for in this regard? (I think for example we do not tell how and when to increase memory in the user guide, at the moment).  Nor are we clear enough on the issues around maxtime, maxwalltime, and the sizing of coaster blocks.<br>

><br>
> Thanks,<br>
><br>
> - Mike<br>
><br>
><br>
> ----- Original Message -----<br>
> > From: "Mihael Hategan" <<a href="mailto:hategan@mcs.anl.gov">hategan@mcs.anl.gov</a>><br>
> > To: "Ketan Maheshwari" <<a href="mailto:ketancmaheshwari@gmail.com">ketancmaheshwari@gmail.com</a>><br>
> > Cc: "Michael Wilde" <<a href="mailto:wilde@mcs.anl.gov">wilde@mcs.anl.gov</a>>, "Swift Devel" <<a href="mailto:swift-devel@ci.uchicago.edu">swift-devel@ci.uchicago.edu</a>><br>
> > Sent: Tuesday, August 23, 2011 2:36:00 PM<br>
> > Subject: Re: [Swift-devel] Blocker issue for 0.93: DSSAT script does not complete, 2nd coaster blocks dont start?<br>
> > mike@blabla:~/tmp$ grep "heap" catsn-20110823-1116-94roxc18.log<br>
> > 2011-08-23 11:16:49,904-0500 DEBUG Loader Max heap: 257294336<br>
> > 2011-08-23 11:38:50,957-0500 DEBUG VDL2ExecutionContext<br>
> > java.lang.OutOfMemoryError: Java heap space<br>
> > java.lang.OutOfMemoryError: Java heap space<br>
> > Caused by: java.lang.OutOfMemoryError: Java heap space<br>
> > Caused by: java.lang.OutOfMemoryError: Java heap space<br>
> > java.lang.OutOfMemoryError: Java heap space<br>
> > Caused by: java.lang.OutOfMemoryError: Java heap space<br>
> > Caused by: java.lang.OutOfMemoryError: Java heap space<br>
> ><br>
> ><br>
> > On Tue, 2011-08-23 at 11:47 -0500, Ketan Maheshwari wrote:<br>
> > > Hello Mike,<br>
> > ><br>
> > ><br>
> > > I tried another run with 30K tasks on PADS. This run stopped after<br>
> > > completing 16K+ tasks.<br>
> > ><br>
> > ><br>
> > > The log file is:<br>
> > > <a href="http://www.ci.uchicago.edu/~ketan/catsn-20110823-1116-94roxc18.log" target="_blank">http://www.ci.uchicago.edu/~ketan/catsn-20110823-1116-94roxc18.log</a><br>
> > ><br>
> > ><br>
> > > The exception messages I get are attached with the mail.<br>
> > ><br>
> > ><br>
> > > Looking at the messages, it seems the coasters are unable to restart<br>
> > > the submit block once the walltime is expired for a run.<br>
> > ><br>
> > ><br>
> > > Regards,<br>
> > > Ketan<br>
> > ><br>
> > > On Tue, Aug 23, 2011 at 11:06 AM, Ketan Maheshwari<br>
> > > <<a href="mailto:ketancmaheshwari@gmail.com">ketancmaheshwari@gmail.com</a>> wrote:<br>
> > >         Mike,<br>
> > ><br>
> > ><br>
> > >         This looks like the coasters blocks not restarting issue. I<br>
> > >         can try to run the same run again and see if this persists.<br>
> > ><br>
> > ><br>
> > >         On Tue, Aug 23, 2011 at 11:04 AM, Michael Wilde<br>
> > >         <<a href="mailto:wilde@mcs.anl.gov">wilde@mcs.anl.gov</a>> wrote:<br>
> > >                 Ketan,<br>
> > ><br>
> > ><br>
> > >                 Should I ask David to try to replicate this problem?<br>
> > ><br>
> > ><br>
> > >                 Did you figure out why your jobs are not starting on<br>
> > >                 PADS?<br>
> > ><br>
> > ><br>
> > >                 - Mike<br>
> > ><br>
> > ><br>
> > ><br>
> > >                 ______________________________________________________<br>
> > >                         From: "Michael Wilde" <<a href="mailto:wilde@mcs.anl.gov">wilde@mcs.anl.gov</a>><br>
> > >                         To: "Ketan Maheshwari"<br>
> > >                         <<a href="mailto:ketancmaheshwari@gmail.com">ketancmaheshwari@gmail.com</a>>, "Mihael<br>
> > >                         Hategan"<br>
> > >                         <<a href="mailto:hategan@mcs.anl.gov">hategan@mcs.anl.gov</a>><br>
> > >                         Cc: "swift-devel Devel"<br>
> > >                         <<a href="mailto:swift-devel@ci.uchicago.edu">swift-devel@ci.uchicago.edu</a>>, "Papia<br>
> > >                         Rizwan"<br>
> > >                         <<a href="mailto:papia.rizwan@gmail.com">papia.rizwan@gmail.com</a>><br>
> > >                         Sent: Monday, August 22, 2011 10:47:56 AM<br>
> > ><br>
> > ><br>
> > >                         Subject: Re: Blocker issue for 0.93: DSSAT<br>
> > >                         script does not complete, 2nd coaster blocks<br>
> > >                         dont start?<br>
> > ><br>
> > >                         Can you try this on PADS using small jobs in<br>
> > >                         the fast queue?<br>
> > ><br>
> > >                         I have not thought this all the way through,<br>
> > >                         but perhaps coasters will honor maxtime and<br>
> > >                         maxwalltime on any coaster block, even if<br>
> > >                         its<br>
> > >                         not running on a batch scheduler. In that<br>
> > >                         case perhaps you can replicate the problem<br>
> > >                         on<br>
> > >                         the MCS pool or better yet on localhost.<br>
> > ><br>
> > ><br>
> > >                         In these runs, what was the value of<br>
> > >                         the execution.retries and lazy.errors flags?<br>
> > >                          Mihael, do those properties need to be set<br>
> > >                          to<br>
> > >                         >0 and true, respectively, in order for<br>
> > >                         coasters to start new blocks correctly,<br>
> > >                         assuming that in some cases a job will run<br>
> > >                         longer than its maxwalltime?<br>
> > ><br>
> > ><br>
> > >                         - Mike<br>
> > ><br>
> > ><br>
> > ><br>
> > >                         ______________________________________________<br>
> > >                                 From: "Ketan Maheshwari"<br>
> > >                                 <<a href="mailto:ketancmaheshwari@gmail.com">ketancmaheshwari@gmail.com</a>><br>
> > >                                 To: "Michael Wilde"<br>
> > >                                 <<a href="mailto:wilde@mcs.anl.gov">wilde@mcs.anl.gov</a>><br>
> > >                                 Cc: "Papia Rizwan"<br>
> > >                                 <<a href="mailto:papia.rizwan@gmail.com">papia.rizwan@gmail.com</a>>,<br>
> > >                                 "swift-devel<br>
> > >                                 Devel" <<a href="mailto:swift-devel@ci.uchicago.edu">swift-devel@ci.uchicago.edu</a>><br>
> > >                                 Sent: Monday, August 22, 2011<br>
> > >                                 10:32:31<br>
> > >                                 AM<br>
> > >                                 Subject: Re: Blocker issue for 0.93:<br>
> > >                                 DSSAT script does not complete, 2nd<br>
> > >                                 coaster blocks dont start?<br>
> > ><br>
> > >                                 Mike,<br>
> > ><br>
> > ><br>
> > >                                 If I recall correctly, Papia has<br>
> > >                                 always been running her DSSAT app<br>
> > >                                 with<br>
> > >                                 0.92. She has not yet tried with<br>
> > >                                 0.93.<br>
> > >                                 I too tried with 0.92 with her sites<br>
> > >                                 file settings.<br>
> > ><br>
> > ><br>
> > >                                 I once tried it with 0.93 on pads<br>
> > >                                 but<br>
> > >                                 could never get in the running from<br>
> > >                                 the queue.<br>
> > ><br>
> > ><br>
> > >                                 I will give another try today as it<br>
> > >                                 might be that PADS was too busy last<br>
> > >                                 week. As I recall Jon was also<br>
> > >                                 struggling to get access.<br>
> > ><br>
> > ><br>
> > >                                 Regards,<br>
> > >                                 Ketan<br>
> > ><br>
> > >                                 On Mon, Aug 22, 2011 at 10:24 AM,<br>
> > >                                 Michael Wilde <<a href="mailto:wilde@mcs.anl.gov">wilde@mcs.anl.gov</a>><br>
> > >                                 wrote:<br>
> > >                                         Papia, Ketan,<br>
> > ><br>
> > >                                         In reviewing 0.93 work<br>
> > >                                         remaining with David, I<br>
> > >                                         remembered this issue.<br>
> > ><br>
> > >                                         You both reported that the<br>
> > >                                         DSSAT application script<br>
> > >                                         doesnt finish on PADS - it<br>
> > >                                         seems not to start the<br>
> > >                                         second<br>
> > >                                         round of coaster blocks that<br>
> > >                                         it needs to complete (as I<br>
> > >                                         recall, but this may not be<br>
> > >                                         correct). This needs to be<br>
> > >                                         researched and filed as a<br>
> > >                                         bug<br>
> > >                                         (or, an error in the sites<br>
> > >                                         spec needs to be identified<br>
> > >                                         and made clear in the site<br>
> > >                                         guide if it turns out to be<br>
> > >                                         the problem).<br>
> > ><br>
> > >                                         Possible there is an issue<br>
> > >                                         with jobs failing at the end<br>
> > >                                         of the coaster blocks, and<br>
> > >                                         you<br>
> > >                                         dont have the necessary<br>
> > >                                         retry<br>
> > >                                         values set for the PADS<br>
> > >                                         site???<br>
> > ><br>
> > >                                         We need an example run with<br>
> > >                                         logs and full details. Can<br>
> > >                                         you<br>
> > >                                         try to re-create this with a<br>
> > >                                         much smaller initial<br>
> > >                                         allocation, and see if<br>
> > >                                         coasters is transitioning<br>
> > >                                         from<br>
> > >                                         its initial blocks to the<br>
> > >                                         next<br>
> > >                                         blocks?<br>
> > ><br>
> > >                                         Can you give this high prio<br>
> > >                                         for today?<br>
> > ><br>
> > >                                         Thanks,<br>
> > ><br>
> > >                                         - Mike<br>
> > ><br>
> > ><br>
> > ><br>
> > ><br>
> > >                                 --<br>
> > >                                 Ketan<br>
> > ><br>
> > ><br>
> > ><br>
> > ><br>
> > ><br>
> > ><br>
> > >                         --<br>
> > >                         Michael Wilde<br>
> > >                         Computation Institute, University of Chicago<br>
> > >                         Mathematics and Computer Science Division<br>
> > >                         Argonne National Laboratory<br>
> > ><br>
> > ><br>
> > ><br>
> > ><br>
> > ><br>
> > ><br>
> > >                 --<br>
> > >                 Michael Wilde<br>
> > >                 Computation Institute, University of Chicago<br>
> > >                 Mathematics and Computer Science Division<br>
> > >                 Argonne National Laboratory<br>
> > ><br>
> > ><br>
> > ><br>
> > ><br>
> > ><br>
> > ><br>
> > >         --<br>
> > >         Ketan<br>
> > ><br>
> > ><br>
> > ><br>
> > ><br>
> > ><br>
> > ><br>
> > ><br>
> > > --<br>
> > > Ketan<br>
> > ><br>
> > ><br>
> > ><br>
> > > _______________________________________________<br>
> > > Swift-devel mailing list<br>
> > > <a href="mailto:Swift-devel@ci.uchicago.edu">Swift-devel@ci.uchicago.edu</a><br>
> > > <a href="https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel" target="_blank">https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel</a><br>
><br>
<br>
<br>
_______________________________________________<br>
Swift-devel mailing list<br>
<a href="mailto:Swift-devel@ci.uchicago.edu">Swift-devel@ci.uchicago.edu</a><br>
<a href="https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel" target="_blank">https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel</a><br>
</div></div></blockquote></div><br><br clear="all"><div><br></div>-- <br>Ketan<br><br><br>
</div></div>