<div>Hi again,</div><div><br></div>Tried a larger run on PADS with similar sleep and but large n parameters. The run seemed to be progressing well (I killed it by mistake), but the log does show some coaster block shutdown and network related exception messages.<div>
<br></div><div>Attached is the log.</div><div><br></div><div>Regards,</div><div>Ketan</div><div><br></div><div><br><div class="gmail_quote">On Tue, Aug 23, 2011 at 4:05 PM, Mihael Hategan <span dir="ltr"><<a href="mailto:hategan@mcs.anl.gov">hategan@mcs.anl.gov</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">That's benign, but I committed a patch to prevent it from happening in<br>
cog r3237.<br>
<div><div></div><div class="h5"><br>
On Tue, 2011-08-23 at 15:46 -0500, Ketan Maheshwari wrote:<br>
> Hi,<br>
><br>
><br>
> I tried a smaller run, catsnsleep, sleeptime 60sec, n=20,<br>
> walltime=2min<br>
><br>
><br>
> The run indeed completed but I saw this in the middle (I suppose at<br>
> the end of first walltime slot):<br>
><br>
><br>
> Command(13, HEARTBEAT): handling reply timeout;<br>
> sendReqTime=110823-153059.847, sendTime=110823-153059.847,<br>
> now=110823-153259.860<br>
> Command(13, HEARTBEAT)fault was: Reply timeout<br>
> org.globus.cog.karajan.workflow.service.ReplyTimeoutException<br>
> at<br>
> org.globus.cog.karajan.workflow.service.commands.Command.handleReplyTimeout(Command.java:288)<br>
> at org.globus.cog.karajan.workflow.service.commands.Command<br>
> $Timeout.run(Command.java:293)<br>
> at java.util.TimerThread.mainLoop(Timer.java:512)<br>
> at java.util.TimerThread.run(Timer.java:462)<br>
> Command(13, HEARTBEAT)fault was: Invalid channel: 914784201: {}<br>
> org.globus.cog.karajan.workflow.service.ReplyTimeoutException<br>
> at<br>
> org.globus.cog.karajan.workflow.service.commands.Command.handleReplyTimeout(Command.java:288)<br>
> at org.globus.cog.karajan.workflow.service.commands.Command<br>
> $Timeout.run(Command.java:293)<br>
> at java.util.TimerThread.mainLoop(Timer.java:512)<br>
> at java.util.TimerThread.run(Timer.java:462)<br>
> Heartbeat failed: Invalid channel: 914784201: {}<br>
> org.globus.cog.karajan.workflow.service.ReplyTimeoutException<br>
> at<br>
> org.globus.cog.karajan.workflow.service.commands.Command.handleReplyTimeout(Command.java:288)<br>
> at org.globus.cog.karajan.workflow.service.commands.Command<br>
> $Timeout.run(Command.java:293)<br>
> at java.util.TimerThread.mainLoop(Timer.java:512)<br>
> at java.util.TimerThread.run(Timer.java:462)<br>
><br>
><br>
> The log is attached. I will try a longish run with more heap memory.<br>
><br>
><br>
><br>
><br>
> Regards,<br>
> Ketan<br>
><br>
><br>
><br>
> On Tue, Aug 23, 2011 at 3:21 PM, Mihael Hategan <<a href="mailto:hategan@mcs.anl.gov">hategan@mcs.anl.gov</a>><br>
> wrote:<br>
> On Tue, 2011-08-23 at 14:46 -0500, Michael Wilde wrote:<br>
> > Well, I raised that issue, but Ketan claimed that the<br>
> failure to start more jobs occurs without that message as<br>
> well.<br>
><br>
><br>
> Fine. I'll need the log from that then.<br>
><br>
><br>
> ><br>
> > Do you believe that the Out of Mem error is the root cause?<br>
> ><br>
> > Ketan, can you point to logs without the OOM error?<br>
> ><br>
> > Can you re-run the catsn with more memory?<br>
> ><br>
> > And more importantly: can you run a *very small* catsnsleep<br>
> test where you carefully craft the sleep times and settings to<br>
> cause one (very short duration) coaster block to time out and<br>
> verify that a new block is submitted and in new job and that<br>
> the script runs to completion?<br>
> ><br>
> > I suggested in the ticket that David do this; can you both<br>
> discuss and see who is better positioned to do this sooner, so<br>
> we can decide if we have a blocker here, or just something<br>
> that needs better configuration and perhaps a note in the user<br>
> guide telling users what to watch out for in this regard? (I<br>
> think for example we do not tell how and when to increase<br>
> memory in the user guide, at the moment). Nor are we clear<br>
> enough on the issues around maxtime, maxwalltime, and the<br>
> sizing of coaster blocks.<br>
> ><br>
> > Thanks,<br>
> ><br>
> > - Mike<br>
> ><br>
> ><br>
> > ----- Original Message -----<br>
> > > From: "Mihael Hategan" <<a href="mailto:hategan@mcs.anl.gov">hategan@mcs.anl.gov</a>><br>
> > > To: "Ketan Maheshwari" <<a href="mailto:ketancmaheshwari@gmail.com">ketancmaheshwari@gmail.com</a>><br>
> > > Cc: "Michael Wilde" <<a href="mailto:wilde@mcs.anl.gov">wilde@mcs.anl.gov</a>>, "Swift Devel"<br>
> <<a href="mailto:swift-devel@ci.uchicago.edu">swift-devel@ci.uchicago.edu</a>><br>
> > > Sent: Tuesday, August 23, 2011 2:36:00 PM<br>
> > > Subject: Re: [Swift-devel] Blocker issue for 0.93: DSSAT<br>
> script does not complete, 2nd coaster blocks dont start?<br>
> > > mike@blabla:~/tmp$ grep "heap"<br>
> catsn-20110823-1116-94roxc18.log<br>
> > > 2011-08-23 11:16:49,904-0500 DEBUG Loader Max heap:<br>
> 257294336<br>
> > > 2011-08-23 11:38:50,957-0500 DEBUG VDL2ExecutionContext<br>
> > > java.lang.OutOfMemoryError: Java heap space<br>
> > > java.lang.OutOfMemoryError: Java heap space<br>
> > > Caused by: java.lang.OutOfMemoryError: Java heap space<br>
> > > Caused by: java.lang.OutOfMemoryError: Java heap space<br>
> > > java.lang.OutOfMemoryError: Java heap space<br>
> > > Caused by: java.lang.OutOfMemoryError: Java heap space<br>
> > > Caused by: java.lang.OutOfMemoryError: Java heap space<br>
> > ><br>
> > ><br>
> > > On Tue, 2011-08-23 at 11:47 -0500, Ketan Maheshwari wrote:<br>
> > > > Hello Mike,<br>
> > > ><br>
> > > ><br>
> > > > I tried another run with 30K tasks on PADS. This run<br>
> stopped after<br>
> > > > completing 16K+ tasks.<br>
> > > ><br>
> > > ><br>
> > > > The log file is:<br>
> > > ><br>
> <a href="http://www.ci.uchicago.edu/~ketan/catsn-20110823-1116-94roxc18.log" target="_blank">http://www.ci.uchicago.edu/~ketan/catsn-20110823-1116-94roxc18.log</a><br>
> > > ><br>
> > > ><br>
> > > > The exception messages I get are attached with the mail.<br>
> > > ><br>
> > > ><br>
> > > > Looking at the messages, it seems the coasters are<br>
> unable to restart<br>
> > > > the submit block once the walltime is expired for a run.<br>
> > > ><br>
> > > ><br>
> > > > Regards,<br>
> > > > Ketan<br>
> > > ><br>
> > > > On Tue, Aug 23, 2011 at 11:06 AM, Ketan Maheshwari<br>
> > > > <<a href="mailto:ketancmaheshwari@gmail.com">ketancmaheshwari@gmail.com</a>> wrote:<br>
> > > > Mike,<br>
> > > ><br>
> > > ><br>
> > > > This looks like the coasters blocks not<br>
> restarting issue. I<br>
> > > > can try to run the same run again and see if<br>
> this persists.<br>
> > > ><br>
> > > ><br>
> > > > On Tue, Aug 23, 2011 at 11:04 AM, Michael Wilde<br>
> > > > <<a href="mailto:wilde@mcs.anl.gov">wilde@mcs.anl.gov</a>> wrote:<br>
> > > > Ketan,<br>
> > > ><br>
> > > ><br>
> > > > Should I ask David to try to replicate<br>
> this problem?<br>
> > > ><br>
> > > ><br>
> > > > Did you figure out why your jobs are not<br>
> starting on<br>
> > > > PADS?<br>
> > > ><br>
> > > ><br>
> > > > - Mike<br>
> > > ><br>
> > > ><br>
> > > ><br>
> > > ><br>
> ______________________________________________________<br>
> > > > From: "Michael Wilde"<br>
> <<a href="mailto:wilde@mcs.anl.gov">wilde@mcs.anl.gov</a>><br>
> > > > To: "Ketan Maheshwari"<br>
> > > > <<a href="mailto:ketancmaheshwari@gmail.com">ketancmaheshwari@gmail.com</a>>,<br>
> "Mihael<br>
> > > > Hategan"<br>
> > > > <<a href="mailto:hategan@mcs.anl.gov">hategan@mcs.anl.gov</a>><br>
> > > > Cc: "swift-devel Devel"<br>
> > > > <<a href="mailto:swift-devel@ci.uchicago.edu">swift-devel@ci.uchicago.edu</a>>,<br>
> "Papia<br>
> > > > Rizwan"<br>
> > > > <<a href="mailto:papia.rizwan@gmail.com">papia.rizwan@gmail.com</a>><br>
> > > > Sent: Monday, August 22, 2011<br>
> 10:47:56 AM<br>
> > > ><br>
> > > ><br>
> > > > Subject: Re: Blocker issue for<br>
> 0.93: DSSAT<br>
> > > > script does not complete, 2nd<br>
> coaster blocks<br>
> > > > dont start?<br>
> > > ><br>
> > > > Can you try this on PADS using<br>
> small jobs in<br>
> > > > the fast queue?<br>
> > > ><br>
> > > > I have not thought this all the<br>
> way through,<br>
> > > > but perhaps coasters will honor<br>
> maxtime and<br>
> > > > maxwalltime on any coaster<br>
> block, even if<br>
> > > > its<br>
> > > > not running on a batch<br>
> scheduler. In that<br>
> > > > case perhaps you can replicate<br>
> the problem<br>
> > > > on<br>
> > > > the MCS pool or better yet on<br>
> localhost.<br>
> > > ><br>
> > > ><br>
> > > > In these runs, what was the<br>
> value of<br>
> > > > the execution.retries and<br>
> lazy.errors flags?<br>
> > > > Mihael, do those properties<br>
> need to be set<br>
> > > > to<br>
> > > > >0 and true, respectively, in<br>
> order for<br>
> > > > coasters to start new blocks<br>
> correctly,<br>
> > > > assuming that in some cases a<br>
> job will run<br>
> > > > longer than its maxwalltime?<br>
> > > ><br>
> > > ><br>
> > > > - Mike<br>
> > > ><br>
> > > ><br>
> > > ><br>
> > > ><br>
> ______________________________________________<br>
> > > > From: "Ketan Maheshwari"<br>
> > > ><br>
> <<a href="mailto:ketancmaheshwari@gmail.com">ketancmaheshwari@gmail.com</a>><br>
> > > > To: "Michael Wilde"<br>
> > > > <<a href="mailto:wilde@mcs.anl.gov">wilde@mcs.anl.gov</a>><br>
> > > > Cc: "Papia Rizwan"<br>
> > > ><br>
> <<a href="mailto:papia.rizwan@gmail.com">papia.rizwan@gmail.com</a>>,<br>
> > > > "swift-devel<br>
> > > > Devel"<br>
> <<a href="mailto:swift-devel@ci.uchicago.edu">swift-devel@ci.uchicago.edu</a>><br>
> > > > Sent: Monday, August 22,<br>
> 2011<br>
> > > > 10:32:31<br>
> > > > AM<br>
> > > > Subject: Re: Blocker<br>
> issue for 0.93:<br>
> > > > DSSAT script does not<br>
> complete, 2nd<br>
> > > > coaster blocks dont<br>
> start?<br>
> > > ><br>
> > > > Mike,<br>
> > > ><br>
> > > ><br>
> > > > If I recall correctly,<br>
> Papia has<br>
> > > > always been running her<br>
> DSSAT app<br>
> > > > with<br>
> > > > 0.92. She has not yet<br>
> tried with<br>
> > > > 0.93.<br>
> > > > I too tried with 0.92<br>
> with her sites<br>
> > > > file settings.<br>
> > > ><br>
> > > ><br>
> > > > I once tried it with<br>
> 0.93 on pads<br>
> > > > but<br>
> > > > could never get in the<br>
> running from<br>
> > > > the queue.<br>
> > > ><br>
> > > ><br>
> > > > I will give another try<br>
> today as it<br>
> > > > might be that PADS was<br>
> too busy last<br>
> > > > week. As I recall Jon<br>
> was also<br>
> > > > struggling to get<br>
> access.<br>
> > > ><br>
> > > ><br>
> > > > Regards,<br>
> > > > Ketan<br>
> > > ><br>
> > > > On Mon, Aug 22, 2011 at<br>
> 10:24 AM,<br>
> > > > Michael Wilde<br>
> <<a href="mailto:wilde@mcs.anl.gov">wilde@mcs.anl.gov</a>><br>
> > > > wrote:<br>
> > > > Papia, Ketan,<br>
> > > ><br>
> > > > In reviewing<br>
> 0.93 work<br>
> > > > remaining with<br>
> David, I<br>
> > > > remembered this<br>
> issue.<br>
> > > ><br>
> > > > You both<br>
> reported that the<br>
> > > > DSSAT<br>
> application script<br>
> > > > doesnt finish on<br>
> PADS - it<br>
> > > > seems not to<br>
> start the<br>
> > > > second<br>
> > > > round of coaster<br>
> blocks that<br>
> > > > it needs to<br>
> complete (as I<br>
> > > > recall, but this<br>
> may not be<br>
> > > > correct). This<br>
> needs to be<br>
> > > > researched and<br>
> filed as a<br>
> > > > bug<br>
> > > > (or, an error in<br>
> the sites<br>
> > > > spec needs to be<br>
> identified<br>
> > > > and made clear<br>
> in the site<br>
> > > > guide if it<br>
> turns out to be<br>
> > > > the problem).<br>
> > > ><br>
> > > > Possible there<br>
> is an issue<br>
> > > > with jobs<br>
> failing at the end<br>
> > > > of the coaster<br>
> blocks, and<br>
> > > > you<br>
> > > > dont have the<br>
> necessary<br>
> > > > retry<br>
> > > > values set for<br>
> the PADS<br>
> > > > site???<br>
> > > ><br>
> > > > We need an<br>
> example run with<br>
> > > > logs and full<br>
> details. Can<br>
> > > > you<br>
> > > > try to re-create<br>
> this with a<br>
> > > > much smaller<br>
> initial<br>
> > > > allocation, and<br>
> see if<br>
> > > > coasters is<br>
> transitioning<br>
> > > > from<br>
> > > > its initial<br>
> blocks to the<br>
> > > > next<br>
> > > > blocks?<br>
> > > ><br>
> > > > Can you give<br>
> this high prio<br>
> > > > for today?<br>
> > > ><br>
> > > > Thanks,<br>
> > > ><br>
> > > > - Mike<br>
> > > ><br>
> > > ><br>
> > > ><br>
> > > ><br>
> > > > --<br>
> > > > Ketan<br>
> > > ><br>
> > > ><br>
> > > ><br>
> > > ><br>
> > > ><br>
> > > ><br>
> > > > --<br>
> > > > Michael Wilde<br>
> > > > Computation Institute,<br>
> University of Chicago<br>
> > > > Mathematics and Computer Science<br>
> Division<br>
> > > > Argonne National Laboratory<br>
> > > ><br>
> > > ><br>
> > > ><br>
> > > ><br>
> > > ><br>
> > > ><br>
> > > > --<br>
> > > > Michael Wilde<br>
> > > > Computation Institute, University of<br>
> Chicago<br>
> > > > Mathematics and Computer Science<br>
> Division<br>
> > > > Argonne National Laboratory<br>
> > > ><br>
> > > ><br>
> > > ><br>
> > > ><br>
> > > ><br>
> > > ><br>
> > > > --<br>
> > > > Ketan<br>
> > > ><br>
> > > ><br>
> > > ><br>
> > > ><br>
> > > ><br>
> > > ><br>
> > > ><br>
> > > > --<br>
> > > > Ketan<br>
> > > ><br>
> > > ><br>
> > > ><br>
> > > > _______________________________________________<br>
> > > > Swift-devel mailing list<br>
> > > > <a href="mailto:Swift-devel@ci.uchicago.edu">Swift-devel@ci.uchicago.edu</a><br>
> > > ><br>
> <a href="https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel" target="_blank">https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel</a><br>
> ><br>
><br>
><br>
> _______________________________________________<br>
> Swift-devel mailing list<br>
> <a href="mailto:Swift-devel@ci.uchicago.edu">Swift-devel@ci.uchicago.edu</a><br>
> <a href="https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel" target="_blank">https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel</a><br>
><br>
><br>
><br>
><br>
><br>
> --<br>
> Ketan<br>
><br>
><br>
><br>
<br>
<br>
</div></div></blockquote></div><br><br clear="all"><div><br></div>-- <br>Ketan<br><br><br>
</div>