[Swift-devel] Blocker issue for 0.93: DSSAT script does not complete, 2nd coaster blocks dont start?

Michael Wilde wilde at mcs.anl.gov
Mon Aug 22 10:47:56 CDT 2011


Can you try this on PADS using small jobs in the fast queue? 


I have not thought this all the way through, but perhaps coasters will honor maxtime and maxwalltime on any coaster block, even if its not running on a batch scheduler. In that case perhaps you can replicate the problem on the MCS pool or better yet on localhost. 


In these runs, what was the value of the execution.retries and lazy.errors flags? Mihael, do those properties need to be set to >0 and true, respectively, in order for coasters to start new blocks correctly, assuming that in some cases a job will run longer than its maxwalltime? 


- Mike 


----- Original Message -----


From: "Ketan Maheshwari" <ketancmaheshwari at gmail.com> 
To: "Michael Wilde" <wilde at mcs.anl.gov> 
Cc: "Papia Rizwan" <papia.rizwan at gmail.com>, "swift-devel Devel" <swift-devel at ci.uchicago.edu> 
Sent: Monday, August 22, 2011 10:32:31 AM 
Subject: Re: Blocker issue for 0.93: DSSAT script does not complete, 2nd coaster blocks dont start? 

Mike, 


If I recall correctly, Papia has always been running her DSSAT app with 0.92. She has not yet tried with 0.93. I too tried with 0.92 with her sites file settings. 


I once tried it with 0.93 on pads but could never get in the running from the queue. 


I will give another try today as it might be that PADS was too busy last week. As I recall Jon was also struggling to get access. 


Regards, 
Ketan 


On Mon, Aug 22, 2011 at 10:24 AM, Michael Wilde < wilde at mcs.anl.gov > wrote: 


Papia, Ketan, 

In reviewing 0.93 work remaining with David, I remembered this issue. 

You both reported that the DSSAT application script doesnt finish on PADS - it seems not to start the second round of coaster blocks that it needs to complete (as I recall, but this may not be correct). This needs to be researched and filed as a bug (or, an error in the sites spec needs to be identified and made clear in the site guide if it turns out to be the problem). 

Possible there is an issue with jobs failing at the end of the coaster blocks, and you dont have the necessary retry values set for the PADS site??? 

We need an example run with logs and full details. Can you try to re-create this with a much smaller initial allocation, and see if coasters is transitioning from its initial blocks to the next blocks? 

Can you give this high prio for today? 

Thanks, 

- Mike 




-- 
Ketan 





-- 
Michael Wilde 
Computation Institute, University of Chicago 
Mathematics and Computer Science Division 
Argonne National Laboratory 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20110822/2c1d8133/attachment.html>


More information about the Swift-devel mailing list