[Swift-devel] Blocker issue for 0.93: DSSAT script does not complete, 2nd coaster blocks dont start?

Ketan Maheshwari ketancmaheshwari at gmail.com
Mon Aug 22 21:59:15 CDT 2011


Hi,

I tried a big catsn run with 0.93 on PADS. The number of tasks i set were to
100K.

I saw that at about 18K-19K, there were few error messages: error shutdown
of block and some replytimeout exceptions.

The run was put so as to test the coasters block restart so it was on a fast
queue with walltime of 16 mins.

The log for the run is :
http://ci.uchicago.edu/~ketan/catsn-20110822-1547-1ajivxte.log

The execution.retry value was 1 for these runs.

Regards,
Ketan


On Mon, Aug 22, 2011 at 10:47 AM, Michael Wilde <wilde at mcs.anl.gov> wrote:

> Can you try this on PADS using small jobs in the fast queue?
>
> I have not thought this all the way through, but perhaps coasters will
> honor maxtime and maxwalltime on any coaster block, even if its not running
> on a batch scheduler.  In that case perhaps you can replicate the problem on
> the MCS pool or better yet on localhost.
>
> In these runs, what was the value of the execution.retries and lazy.errors
> flags?  Mihael, do those properties need to be set to >0 and true,
> respectively, in order for coasters to start new blocks correctly, assuming
> that in some cases a job will run longer than its maxwalltime?
>
> - Mike
>
> ------------------------------
>
> *From: *"Ketan Maheshwari" <ketancmaheshwari at gmail.com>
> *To: *"Michael Wilde" <wilde at mcs.anl.gov>
> *Cc: *"Papia Rizwan" <papia.rizwan at gmail.com>, "swift-devel Devel" <
> swift-devel at ci.uchicago.edu>
> *Sent: *Monday, August 22, 2011 10:32:31 AM
> *Subject: *Re: Blocker issue for 0.93: DSSAT script does not complete, 2nd
> coaster blocks dont start?
>
>
> Mike,
>
> If I recall correctly, Papia has always been running her DSSAT app with
> 0.92. She has not yet tried with 0.93. I too tried with 0.92 with her sites
> file settings.
>
> I once tried it with 0.93 on pads but could never get in the running from
> the queue.
>
> I will give another try today as it might be that PADS was too busy last
> week. As I recall Jon was also struggling to get access.
>
> Regards,
> Ketan
>
> On Mon, Aug 22, 2011 at 10:24 AM, Michael Wilde <wilde at mcs.anl.gov> wrote:
>
>> Papia, Ketan,
>>
>> In reviewing 0.93 work remaining with David, I remembered this issue.
>>
>> You both reported that the DSSAT application script doesnt finish on PADS
>> - it seems not to start the second round of coaster blocks that it needs to
>> complete (as I recall, but this may not be correct).  This needs to be
>> researched and filed as a bug (or, an error in the sites spec needs to be
>> identified and made clear in the site guide if it turns out to be the
>> problem).
>>
>> Possible there is an issue with jobs failing at the end of the coaster
>> blocks, and you dont have the necessary retry values set for the PADS
>> site???
>>
>> We need an example run with logs and full details. Can you try to
>> re-create this with a much smaller initial allocation, and see if coasters
>> is transitioning from its initial blocks to the next blocks?
>>
>> Can you give this high prio for today?
>>
>> Thanks,
>>
>> - Mike
>>
>
>
>
> --
> Ketan
>
>
>
>
>
> --
> Michael Wilde
> Computation Institute, University of Chicago
> Mathematics and Computer Science Division
> Argonne National Laboratory
>
>


-- 
Ketan
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20110822/52db65c4/attachment.html>


More information about the Swift-devel mailing list