<div>Hello Mike,</div><div><br></div>I tried another run with 30K tasks on PADS. This run stopped after completing 16K+ tasks.<div><br></div><div>The log file is: <a href="http://www.ci.uchicago.edu/~ketan/catsn-20110823-1116-94roxc18.log">http://www.ci.uchicago.edu/~ketan/catsn-20110823-1116-94roxc18.log</a></div>
<div><br></div><div>The exception messages I get are attached with the mail.<br><br></div><div>Looking at the messages, it seems the coasters are unable to restart the submit block once the walltime is expired for a run.</div>
<div><br></div><div>Regards,</div><div>Ketan</div><div><br><div class="gmail_quote">On Tue, Aug 23, 2011 at 11:06 AM, Ketan Maheshwari <span dir="ltr"><<a href="mailto:ketancmaheshwari@gmail.com">ketancmaheshwari@gmail.com</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;"><div>Mike,</div><div><br></div>This looks like the coasters blocks not restarting issue. I can try to run the same run again and see if this persists.<br>
<br><div><div><div></div><div class="h5"><div class="gmail_quote">On Tue, Aug 23, 2011 at 11:04 AM, Michael Wilde <span dir="ltr"><<a href="mailto:wilde@mcs.anl.gov" target="_blank">wilde@mcs.anl.gov</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div><div style="font-family:Times New Roman;font-size:12pt;color:#000000">Ketan,<div><br></div><div>Should I ask David to try to replicate this problem?</div>
<div><br></div><div>Did you figure out why your jobs are not starting on PADS?</div><div><br></div><div>- Mike</div><div><br><br><hr><blockquote style="border-left:2px solid rgb(16, 16, 255);margin-left:5px;padding-left:5px">
<b>From: </b>"Michael Wilde" <<a href="mailto:wilde@mcs.anl.gov" target="_blank">wilde@mcs.anl.gov</a>><br><b>To: </b>"Ketan Maheshwari" <<a href="mailto:ketancmaheshwari@gmail.com" target="_blank">ketancmaheshwari@gmail.com</a>>, "Mihael Hategan" <<a href="mailto:hategan@mcs.anl.gov" target="_blank">hategan@mcs.anl.gov</a>><br>
<b>Cc: </b>"swift-devel Devel" <<a href="mailto:swift-devel@ci.uchicago.edu" target="_blank">swift-devel@ci.uchicago.edu</a>>, "Papia Rizwan" <<a href="mailto:papia.rizwan@gmail.com" target="_blank">papia.rizwan@gmail.com</a>><br>
<b>Sent: </b>Monday, August 22, 2011 10:47:56 AM<div><div></div><div><br><b>Subject: </b>Re: Blocker issue for 0.93: DSSAT script does not complete, 2nd coaster blocks dont start?<br><br><div style="font-family:Times New Roman;font-size:12pt;color:#000000">
Can you try this on PADS using small jobs in the fast queue?<br><br><div>I have not thought this all the way through, but perhaps coasters will honor maxtime and maxwalltime on any coaster block, even if its not running on a batch scheduler. In that case perhaps you can replicate the problem on the MCS pool or better yet on localhost.</div>
<div><br></div><div>In these runs, what was the value of the execution.retries and lazy.errors flags? Mihael, do those properties need to be set to >0 and true, respectively, in order for coasters to start new blocks correctly, assuming that in some cases a job will run longer than its maxwalltime?</div>
<div><br></div><div>- Mike</div><div><div><br></div><hr><blockquote style="border-left:2px solid rgb(16, 16, 255);margin-left:5px;padding-left:5px"><b>From: </b>"Ketan Maheshwari" <<a href="mailto:ketancmaheshwari@gmail.com" target="_blank">ketancmaheshwari@gmail.com</a>><br>
<b>To: </b>"Michael Wilde" <<a href="mailto:wilde@mcs.anl.gov" target="_blank">wilde@mcs.anl.gov</a>><br><b>Cc: </b>"Papia Rizwan" <<a href="mailto:papia.rizwan@gmail.com" target="_blank">papia.rizwan@gmail.com</a>>, "swift-devel Devel" <<a href="mailto:swift-devel@ci.uchicago.edu" target="_blank">swift-devel@ci.uchicago.edu</a>><br>
<b>Sent: </b>Monday, August 22, 2011 10:32:31 AM<br><b>Subject: </b>Re: Blocker issue for 0.93: DSSAT script does not complete, 2nd coaster blocks dont start?<br><br>Mike,<div><br></div><div>If I recall correctly, Papia has always been running her DSSAT app with 0.92. She has not yet tried with 0.93. I too tried with 0.92 with her sites file settings.</div>
<div><br></div><div>I once tried it with 0.93 on pads but could never get in the running from the queue.</div>
<div><br></div><div>I will give another try today as it might be that PADS was too busy last week. As I recall Jon was also struggling to get access.<br><br></div><div>Regards,</div><div>Ketan</div><div><br><div class="gmail_quote">
On Mon, Aug 22, 2011 at 10:24 AM, Michael Wilde <span dir="ltr"><<a href="mailto:wilde@mcs.anl.gov" target="_blank">wilde@mcs.anl.gov</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
Papia, Ketan,<br>
<br>
In reviewing 0.93 work remaining with David, I remembered this issue.<br>
<br>
You both reported that the DSSAT application script doesnt finish on PADS - it seems not to start the second round of coaster blocks that it needs to complete (as I recall, but this may not be correct). This needs to be researched and filed as a bug (or, an error in the sites spec needs to be identified and made clear in the site guide if it turns out to be the problem).<br>
<br>
Possible there is an issue with jobs failing at the end of the coaster blocks, and you dont have the necessary retry values set for the PADS site???<br>
<br>
We need an example run with logs and full details. Can you try to re-create this with a much smaller initial allocation, and see if coasters is transitioning from its initial blocks to the next blocks?<br>
<br>
Can you give this high prio for today?<br>
<br>
Thanks,<br>
<br>
- Mike<br>
</blockquote></div><br><br clear="all"><div><br></div>-- <br>Ketan<br><br><br>
</div>
</blockquote><br><span><br><br>-- <br><span></span>Michael Wilde<br>Computation Institute, University of Chicago<br>Mathematics and Computer Science Division<br>Argonne National Laboratory<br><span></span><br></span></div>
</div></div></div></blockquote><div><div></div><div><br><span><br><br>-- <br><span name="x"></span>Michael Wilde<br>Computation Institute, University of Chicago<br>Mathematics and Computer Science Division<br>Argonne National Laboratory<br>
<span name="x"></span><br></span></div></div></div></div></div></blockquote></div><br><br clear="all"><div><br></div></div></div>-- <br><font color="#888888">Ketan<br><br><br>
</font></div>
</blockquote></div><br><br clear="all"><div><br></div>-- <br>Ketan<br><br><br>
</div>