<div>Hi again,</div><div><br></div>Tried a larger run on PADS with similar sleep and  but large n parameters. The run seemed to be progressing well (I killed it by mistake), but the log does show some coaster block shutdown and network related exception messages.<div>
<br></div><div>Attached is the log.</div><div><br></div><div>Regards,</div><div>Ketan</div><div><br></div><div><br><div class="gmail_quote">On Tue, Aug 23, 2011 at 4:05 PM, Mihael Hategan <span dir="ltr"><<a href="mailto:hategan@mcs.anl.gov">hategan@mcs.anl.gov</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">That's benign, but I committed a patch to prevent it from happening in<br>
cog r3237.<br>
<div><div></div><div class="h5"><br>
On Tue, 2011-08-23 at 15:46 -0500, Ketan Maheshwari wrote:<br>
> Hi,<br>
><br>
><br>
> I tried a smaller run, catsnsleep, sleeptime 60sec, n=20,<br>
> walltime=2min<br>
><br>
><br>
> The run indeed completed but I saw this in the middle (I suppose at<br>
> the end of first walltime slot):<br>
><br>
><br>
> Command(13, HEARTBEAT): handling reply timeout;<br>
> sendReqTime=110823-153059.847, sendTime=110823-153059.847,<br>
> now=110823-153259.860<br>
> Command(13, HEARTBEAT)fault was: Reply timeout<br>
> org.globus.cog.karajan.workflow.service.ReplyTimeoutException<br>
> at<br>
> org.globus.cog.karajan.workflow.service.commands.Command.handleReplyTimeout(Command.java:288)<br>
> at org.globus.cog.karajan.workflow.service.commands.Command<br>
> $Timeout.run(Command.java:293)<br>
> at java.util.TimerThread.mainLoop(Timer.java:512)<br>
> at java.util.TimerThread.run(Timer.java:462)<br>
> Command(13, HEARTBEAT)fault was: Invalid channel: 914784201: {}<br>
> org.globus.cog.karajan.workflow.service.ReplyTimeoutException<br>
> at<br>
> org.globus.cog.karajan.workflow.service.commands.Command.handleReplyTimeout(Command.java:288)<br>
> at org.globus.cog.karajan.workflow.service.commands.Command<br>
> $Timeout.run(Command.java:293)<br>
> at java.util.TimerThread.mainLoop(Timer.java:512)<br>
> at java.util.TimerThread.run(Timer.java:462)<br>
> Heartbeat failed: Invalid channel: 914784201: {}<br>
> org.globus.cog.karajan.workflow.service.ReplyTimeoutException<br>
> at<br>
> org.globus.cog.karajan.workflow.service.commands.Command.handleReplyTimeout(Command.java:288)<br>
> at org.globus.cog.karajan.workflow.service.commands.Command<br>
> $Timeout.run(Command.java:293)<br>
> at java.util.TimerThread.mainLoop(Timer.java:512)<br>
> at java.util.TimerThread.run(Timer.java:462)<br>
><br>
><br>
> The log is attached. I will try a longish run with more heap memory.<br>
><br>
><br>
><br>
><br>
> Regards,<br>
> Ketan<br>
><br>
><br>
><br>
> On Tue, Aug 23, 2011 at 3:21 PM, Mihael Hategan <<a href="mailto:hategan@mcs.anl.gov">hategan@mcs.anl.gov</a>><br>
> wrote:<br>
>         On Tue, 2011-08-23 at 14:46 -0500, Michael Wilde wrote:<br>
>         > Well, I raised that issue, but Ketan claimed that the<br>
>         failure to start more jobs occurs without that message as<br>
>         well.<br>
><br>
><br>
>         Fine. I'll need the log from that then.<br>
><br>
><br>
>         ><br>
>         > Do you believe that the Out of Mem error is the root cause?<br>
>         ><br>
>         > Ketan, can you point to logs without the OOM error?<br>
>         ><br>
>         > Can you re-run the catsn with more memory?<br>
>         ><br>
>         > And more importantly: can you run a *very small* catsnsleep<br>
>         test where you carefully craft the sleep times and settings to<br>
>         cause one (very short duration) coaster block to time out and<br>
>         verify that a new block is submitted and in new job and that<br>
>         the script runs to completion?<br>
>         ><br>
>         > I suggested in the ticket that David do this; can you both<br>
>         discuss and see who is better positioned to do this sooner, so<br>
>         we can decide if we have a blocker here, or just something<br>
>         that needs better configuration and perhaps a note in the user<br>
>         guide telling users what to watch out for in this regard? (I<br>
>         think for example we do not tell how and when to increase<br>
>         memory in the user guide, at the moment).  Nor are we clear<br>
>         enough on the issues around maxtime, maxwalltime, and the<br>
>         sizing of coaster blocks.<br>
>         ><br>
>         > Thanks,<br>
>         ><br>
>         > - Mike<br>
>         ><br>
>         ><br>
>         > ----- Original Message -----<br>
>         > > From: "Mihael Hategan" <<a href="mailto:hategan@mcs.anl.gov">hategan@mcs.anl.gov</a>><br>
>         > > To: "Ketan Maheshwari" <<a href="mailto:ketancmaheshwari@gmail.com">ketancmaheshwari@gmail.com</a>><br>
>         > > Cc: "Michael Wilde" <<a href="mailto:wilde@mcs.anl.gov">wilde@mcs.anl.gov</a>>, "Swift Devel"<br>
>         <<a href="mailto:swift-devel@ci.uchicago.edu">swift-devel@ci.uchicago.edu</a>><br>
>         > > Sent: Tuesday, August 23, 2011 2:36:00 PM<br>
>         > > Subject: Re: [Swift-devel] Blocker issue for 0.93: DSSAT<br>
>         script does not complete, 2nd coaster blocks dont start?<br>
>         > > mike@blabla:~/tmp$ grep "heap"<br>
>         catsn-20110823-1116-94roxc18.log<br>
>         > > 2011-08-23 11:16:49,904-0500 DEBUG Loader Max heap:<br>
>         257294336<br>
>         > > 2011-08-23 11:38:50,957-0500 DEBUG VDL2ExecutionContext<br>
>         > > java.lang.OutOfMemoryError: Java heap space<br>
>         > > java.lang.OutOfMemoryError: Java heap space<br>
>         > > Caused by: java.lang.OutOfMemoryError: Java heap space<br>
>         > > Caused by: java.lang.OutOfMemoryError: Java heap space<br>
>         > > java.lang.OutOfMemoryError: Java heap space<br>
>         > > Caused by: java.lang.OutOfMemoryError: Java heap space<br>
>         > > Caused by: java.lang.OutOfMemoryError: Java heap space<br>
>         > ><br>
>         > ><br>
>         > > On Tue, 2011-08-23 at 11:47 -0500, Ketan Maheshwari wrote:<br>
>         > > > Hello Mike,<br>
>         > > ><br>
>         > > ><br>
>         > > > I tried another run with 30K tasks on PADS. This run<br>
>         stopped after<br>
>         > > > completing 16K+ tasks.<br>
>         > > ><br>
>         > > ><br>
>         > > > The log file is:<br>
>         > > ><br>
>         <a href="http://www.ci.uchicago.edu/~ketan/catsn-20110823-1116-94roxc18.log" target="_blank">http://www.ci.uchicago.edu/~ketan/catsn-20110823-1116-94roxc18.log</a><br>
>         > > ><br>
>         > > ><br>
>         > > > The exception messages I get are attached with the mail.<br>
>         > > ><br>
>         > > ><br>
>         > > > Looking at the messages, it seems the coasters are<br>
>         unable to restart<br>
>         > > > the submit block once the walltime is expired for a run.<br>
>         > > ><br>
>         > > ><br>
>         > > > Regards,<br>
>         > > > Ketan<br>
>         > > ><br>
>         > > > On Tue, Aug 23, 2011 at 11:06 AM, Ketan Maheshwari<br>
>         > > > <<a href="mailto:ketancmaheshwari@gmail.com">ketancmaheshwari@gmail.com</a>> wrote:<br>
>         > > >         Mike,<br>
>         > > ><br>
>         > > ><br>
>         > > >         This looks like the coasters blocks not<br>
>         restarting issue. I<br>
>         > > >         can try to run the same run again and see if<br>
>         this persists.<br>
>         > > ><br>
>         > > ><br>
>         > > >         On Tue, Aug 23, 2011 at 11:04 AM, Michael Wilde<br>
>         > > >         <<a href="mailto:wilde@mcs.anl.gov">wilde@mcs.anl.gov</a>> wrote:<br>
>         > > >                 Ketan,<br>
>         > > ><br>
>         > > ><br>
>         > > >                 Should I ask David to try to replicate<br>
>         this problem?<br>
>         > > ><br>
>         > > ><br>
>         > > >                 Did you figure out why your jobs are not<br>
>         starting on<br>
>         > > >                 PADS?<br>
>         > > ><br>
>         > > ><br>
>         > > >                 - Mike<br>
>         > > ><br>
>         > > ><br>
>         > > ><br>
>         > > ><br>
>         ______________________________________________________<br>
>         > > >                         From: "Michael Wilde"<br>
>         <<a href="mailto:wilde@mcs.anl.gov">wilde@mcs.anl.gov</a>><br>
>         > > >                         To: "Ketan Maheshwari"<br>
>         > > >                         <<a href="mailto:ketancmaheshwari@gmail.com">ketancmaheshwari@gmail.com</a>>,<br>
>         "Mihael<br>
>         > > >                         Hategan"<br>
>         > > >                         <<a href="mailto:hategan@mcs.anl.gov">hategan@mcs.anl.gov</a>><br>
>         > > >                         Cc: "swift-devel Devel"<br>
>         > > >                         <<a href="mailto:swift-devel@ci.uchicago.edu">swift-devel@ci.uchicago.edu</a>>,<br>
>         "Papia<br>
>         > > >                         Rizwan"<br>
>         > > >                         <<a href="mailto:papia.rizwan@gmail.com">papia.rizwan@gmail.com</a>><br>
>         > > >                         Sent: Monday, August 22, 2011<br>
>         10:47:56 AM<br>
>         > > ><br>
>         > > ><br>
>         > > >                         Subject: Re: Blocker issue for<br>
>         0.93: DSSAT<br>
>         > > >                         script does not complete, 2nd<br>
>         coaster blocks<br>
>         > > >                         dont start?<br>
>         > > ><br>
>         > > >                         Can you try this on PADS using<br>
>         small jobs in<br>
>         > > >                         the fast queue?<br>
>         > > ><br>
>         > > >                         I have not thought this all the<br>
>         way through,<br>
>         > > >                         but perhaps coasters will honor<br>
>         maxtime and<br>
>         > > >                         maxwalltime on any coaster<br>
>         block, even if<br>
>         > > >                         its<br>
>         > > >                         not running on a batch<br>
>         scheduler. In that<br>
>         > > >                         case perhaps you can replicate<br>
>         the problem<br>
>         > > >                         on<br>
>         > > >                         the MCS pool or better yet on<br>
>         localhost.<br>
>         > > ><br>
>         > > ><br>
>         > > >                         In these runs, what was the<br>
>         value of<br>
>         > > >                         the execution.retries and<br>
>         lazy.errors flags?<br>
>         > > >                          Mihael, do those properties<br>
>         need to be set<br>
>         > > >                          to<br>
>         > > >                         >0 and true, respectively, in<br>
>         order for<br>
>         > > >                         coasters to start new blocks<br>
>         correctly,<br>
>         > > >                         assuming that in some cases a<br>
>         job will run<br>
>         > > >                         longer than its maxwalltime?<br>
>         > > ><br>
>         > > ><br>
>         > > >                         - Mike<br>
>         > > ><br>
>         > > ><br>
>         > > ><br>
>         > > ><br>
>         ______________________________________________<br>
>         > > >                                 From: "Ketan Maheshwari"<br>
>         > > ><br>
>         <<a href="mailto:ketancmaheshwari@gmail.com">ketancmaheshwari@gmail.com</a>><br>
>         > > >                                 To: "Michael Wilde"<br>
>         > > >                                 <<a href="mailto:wilde@mcs.anl.gov">wilde@mcs.anl.gov</a>><br>
>         > > >                                 Cc: "Papia Rizwan"<br>
>         > > ><br>
>         <<a href="mailto:papia.rizwan@gmail.com">papia.rizwan@gmail.com</a>>,<br>
>         > > >                                 "swift-devel<br>
>         > > >                                 Devel"<br>
>         <<a href="mailto:swift-devel@ci.uchicago.edu">swift-devel@ci.uchicago.edu</a>><br>
>         > > >                                 Sent: Monday, August 22,<br>
>         2011<br>
>         > > >                                 10:32:31<br>
>         > > >                                 AM<br>
>         > > >                                 Subject: Re: Blocker<br>
>         issue for 0.93:<br>
>         > > >                                 DSSAT script does not<br>
>         complete, 2nd<br>
>         > > >                                 coaster blocks dont<br>
>         start?<br>
>         > > ><br>
>         > > >                                 Mike,<br>
>         > > ><br>
>         > > ><br>
>         > > >                                 If I recall correctly,<br>
>         Papia has<br>
>         > > >                                 always been running her<br>
>         DSSAT app<br>
>         > > >                                 with<br>
>         > > >                                 0.92. She has not yet<br>
>         tried with<br>
>         > > >                                 0.93.<br>
>         > > >                                 I too tried with 0.92<br>
>         with her sites<br>
>         > > >                                 file settings.<br>
>         > > ><br>
>         > > ><br>
>         > > >                                 I once tried it with<br>
>         0.93 on pads<br>
>         > > >                                 but<br>
>         > > >                                 could never get in the<br>
>         running from<br>
>         > > >                                 the queue.<br>
>         > > ><br>
>         > > ><br>
>         > > >                                 I will give another try<br>
>         today as it<br>
>         > > >                                 might be that PADS was<br>
>         too busy last<br>
>         > > >                                 week. As I recall Jon<br>
>         was also<br>
>         > > >                                 struggling to get<br>
>         access.<br>
>         > > ><br>
>         > > ><br>
>         > > >                                 Regards,<br>
>         > > >                                 Ketan<br>
>         > > ><br>
>         > > >                                 On Mon, Aug 22, 2011 at<br>
>         10:24 AM,<br>
>         > > >                                 Michael Wilde<br>
>         <<a href="mailto:wilde@mcs.anl.gov">wilde@mcs.anl.gov</a>><br>
>         > > >                                 wrote:<br>
>         > > >                                         Papia, Ketan,<br>
>         > > ><br>
>         > > >                                         In reviewing<br>
>         0.93 work<br>
>         > > >                                         remaining with<br>
>         David, I<br>
>         > > >                                         remembered this<br>
>         issue.<br>
>         > > ><br>
>         > > >                                         You both<br>
>         reported that the<br>
>         > > >                                         DSSAT<br>
>         application script<br>
>         > > >                                         doesnt finish on<br>
>         PADS - it<br>
>         > > >                                         seems not to<br>
>         start the<br>
>         > > >                                         second<br>
>         > > >                                         round of coaster<br>
>         blocks that<br>
>         > > >                                         it needs to<br>
>         complete (as I<br>
>         > > >                                         recall, but this<br>
>         may not be<br>
>         > > >                                         correct). This<br>
>         needs to be<br>
>         > > >                                         researched and<br>
>         filed as a<br>
>         > > >                                         bug<br>
>         > > >                                         (or, an error in<br>
>         the sites<br>
>         > > >                                         spec needs to be<br>
>         identified<br>
>         > > >                                         and made clear<br>
>         in the site<br>
>         > > >                                         guide if it<br>
>         turns out to be<br>
>         > > >                                         the problem).<br>
>         > > ><br>
>         > > >                                         Possible there<br>
>         is an issue<br>
>         > > >                                         with jobs<br>
>         failing at the end<br>
>         > > >                                         of the coaster<br>
>         blocks, and<br>
>         > > >                                         you<br>
>         > > >                                         dont have the<br>
>         necessary<br>
>         > > >                                         retry<br>
>         > > >                                         values set for<br>
>         the PADS<br>
>         > > >                                         site???<br>
>         > > ><br>
>         > > >                                         We need an<br>
>         example run with<br>
>         > > >                                         logs and full<br>
>         details. Can<br>
>         > > >                                         you<br>
>         > > >                                         try to re-create<br>
>         this with a<br>
>         > > >                                         much smaller<br>
>         initial<br>
>         > > >                                         allocation, and<br>
>         see if<br>
>         > > >                                         coasters is<br>
>         transitioning<br>
>         > > >                                         from<br>
>         > > >                                         its initial<br>
>         blocks to the<br>
>         > > >                                         next<br>
>         > > >                                         blocks?<br>
>         > > ><br>
>         > > >                                         Can you give<br>
>         this high prio<br>
>         > > >                                         for today?<br>
>         > > ><br>
>         > > >                                         Thanks,<br>
>         > > ><br>
>         > > >                                         - Mike<br>
>         > > ><br>
>         > > ><br>
>         > > ><br>
>         > > ><br>
>         > > >                                 --<br>
>         > > >                                 Ketan<br>
>         > > ><br>
>         > > ><br>
>         > > ><br>
>         > > ><br>
>         > > ><br>
>         > > ><br>
>         > > >                         --<br>
>         > > >                         Michael Wilde<br>
>         > > >                         Computation Institute,<br>
>         University of Chicago<br>
>         > > >                         Mathematics and Computer Science<br>
>         Division<br>
>         > > >                         Argonne National Laboratory<br>
>         > > ><br>
>         > > ><br>
>         > > ><br>
>         > > ><br>
>         > > ><br>
>         > > ><br>
>         > > >                 --<br>
>         > > >                 Michael Wilde<br>
>         > > >                 Computation Institute, University of<br>
>         Chicago<br>
>         > > >                 Mathematics and Computer Science<br>
>         Division<br>
>         > > >                 Argonne National Laboratory<br>
>         > > ><br>
>         > > ><br>
>         > > ><br>
>         > > ><br>
>         > > ><br>
>         > > ><br>
>         > > >         --<br>
>         > > >         Ketan<br>
>         > > ><br>
>         > > ><br>
>         > > ><br>
>         > > ><br>
>         > > ><br>
>         > > ><br>
>         > > ><br>
>         > > > --<br>
>         > > > Ketan<br>
>         > > ><br>
>         > > ><br>
>         > > ><br>
>         > > > _______________________________________________<br>
>         > > > Swift-devel mailing list<br>
>         > > > <a href="mailto:Swift-devel@ci.uchicago.edu">Swift-devel@ci.uchicago.edu</a><br>
>         > > ><br>
>         <a href="https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel" target="_blank">https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel</a><br>
>         ><br>
><br>
><br>
>         _______________________________________________<br>
>         Swift-devel mailing list<br>
>         <a href="mailto:Swift-devel@ci.uchicago.edu">Swift-devel@ci.uchicago.edu</a><br>
>         <a href="https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel" target="_blank">https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel</a><br>
><br>
><br>
><br>
><br>
><br>
> --<br>
> Ketan<br>
><br>
><br>
><br>
<br>
<br>
</div></div></blockquote></div><br><br clear="all"><div><br></div>-- <br>Ketan<br><br><br>
</div>