[Swift-devel] Blocker issue for 0.93: DSSAT script does not complete, 2nd coaster blocks dont start?

Ketan Maheshwari ketancmaheshwari at gmail.com
Tue Aug 23 11:47:48 CDT 2011


Hello Mike,

I tried another run with 30K tasks on PADS. This run stopped after
completing 16K+ tasks.

The log file is:
http://www.ci.uchicago.edu/~ketan/catsn-20110823-1116-94roxc18.log

The exception messages I get are attached with the mail.

Looking at the messages, it seems the coasters are unable to restart the
submit block once the walltime is expired for  a run.

Regards,
Ketan

On Tue, Aug 23, 2011 at 11:06 AM, Ketan Maheshwari <
ketancmaheshwari at gmail.com> wrote:

> Mike,
>
> This looks like the coasters blocks not restarting issue. I can try to run
> the same run again and see if this persists.
>
> On Tue, Aug 23, 2011 at 11:04 AM, Michael Wilde <wilde at mcs.anl.gov> wrote:
>
>> Ketan,
>>
>> Should I ask David to try to replicate this problem?
>>
>> Did you figure out why your jobs are not starting on PADS?
>>
>> - Mike
>>
>>
>> ------------------------------
>>
>> *From: *"Michael Wilde" <wilde at mcs.anl.gov>
>> *To: *"Ketan Maheshwari" <ketancmaheshwari at gmail.com>, "Mihael Hategan" <
>> hategan at mcs.anl.gov>
>> *Cc: *"swift-devel Devel" <swift-devel at ci.uchicago.edu>, "Papia Rizwan" <
>> papia.rizwan at gmail.com>
>> *Sent: *Monday, August 22, 2011 10:47:56 AM
>>
>> *Subject: *Re: Blocker issue for 0.93: DSSAT script does not complete,
>> 2nd coaster blocks dont start?
>>
>> Can you try this on PADS using small jobs in the fast queue?
>>
>> I have not thought this all the way through, but perhaps coasters will
>> honor maxtime and maxwalltime on any coaster block, even if its not running
>> on a batch scheduler.  In that case perhaps you can replicate the problem on
>> the MCS pool or better yet on localhost.
>>
>> In these runs, what was the value of the execution.retries and lazy.errors
>> flags?  Mihael, do those properties need to be set to >0 and true,
>> respectively, in order for coasters to start new blocks correctly, assuming
>> that in some cases a job will run longer than its maxwalltime?
>>
>> - Mike
>>
>> ------------------------------
>>
>> *From: *"Ketan Maheshwari" <ketancmaheshwari at gmail.com>
>> *To: *"Michael Wilde" <wilde at mcs.anl.gov>
>> *Cc: *"Papia Rizwan" <papia.rizwan at gmail.com>, "swift-devel Devel" <
>> swift-devel at ci.uchicago.edu>
>> *Sent: *Monday, August 22, 2011 10:32:31 AM
>> *Subject: *Re: Blocker issue for 0.93: DSSAT script does not complete,
>> 2nd coaster blocks dont start?
>>
>> Mike,
>>
>> If I recall correctly, Papia has always been running her DSSAT app with
>> 0.92. She has not yet tried with 0.93. I too tried with 0.92 with her sites
>> file settings.
>>
>> I once tried it with 0.93 on pads but could never get in the running from
>> the queue.
>>
>> I will give another try today as it might be that PADS was too busy last
>> week. As I recall Jon was also struggling to get access.
>>
>> Regards,
>> Ketan
>>
>> On Mon, Aug 22, 2011 at 10:24 AM, Michael Wilde <wilde at mcs.anl.gov>wrote:
>>
>>> Papia, Ketan,
>>>
>>> In reviewing 0.93 work remaining with David, I remembered this issue.
>>>
>>> You both reported that the DSSAT application script doesnt finish on PADS
>>> - it seems not to start the second round of coaster blocks that it needs to
>>> complete (as I recall, but this may not be correct).  This needs to be
>>> researched and filed as a bug (or, an error in the sites spec needs to be
>>> identified and made clear in the site guide if it turns out to be the
>>> problem).
>>>
>>> Possible there is an issue with jobs failing at the end of the coaster
>>> blocks, and you dont have the necessary retry values set for the PADS
>>> site???
>>>
>>> We need an example run with logs and full details. Can you try to
>>> re-create this with a much smaller initial allocation, and see if coasters
>>> is transitioning from its initial blocks to the next blocks?
>>>
>>> Can you give this high prio for today?
>>>
>>> Thanks,
>>>
>>> - Mike
>>>
>>
>>
>>
>> --
>> Ketan
>>
>>
>>
>>
>>
>> --
>> Michael Wilde
>> Computation Institute, University of Chicago
>> Mathematics and Computer Science Division
>> Argonne National Laboratory
>>
>>
>>
>>
>> --
>> Michael Wilde
>> Computation Institute, University of Chicago
>> Mathematics and Computer Science Division
>> Argonne National Laboratory
>>
>>
>
>
> --
> Ketan
>
>
>


-- 
Ketan
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20110823/46999602/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: exceptions.pads
Type: application/octet-stream
Size: 4709 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20110823/46999602/attachment.obj>


More information about the Swift-devel mailing list