[Swift-devel] Blocker issue for 0.93: DSSAT script does not complete, 2nd coaster blocks dont start?

Mihael Hategan hategan at mcs.anl.gov
Tue Aug 23 14:36:00 CDT 2011


mike at blabla:~/tmp$ grep "heap" catsn-20110823-1116-94roxc18.log
2011-08-23 11:16:49,904-0500 DEBUG Loader Max heap: 257294336
2011-08-23 11:38:50,957-0500 DEBUG VDL2ExecutionContext
java.lang.OutOfMemoryError: Java heap space
java.lang.OutOfMemoryError: Java heap space
Caused by: java.lang.OutOfMemoryError: Java heap space
Caused by: java.lang.OutOfMemoryError: Java heap space
java.lang.OutOfMemoryError: Java heap space
Caused by: java.lang.OutOfMemoryError: Java heap space
Caused by: java.lang.OutOfMemoryError: Java heap space


On Tue, 2011-08-23 at 11:47 -0500, Ketan Maheshwari wrote:
> Hello Mike,
> 
> 
> I tried another run with 30K tasks on PADS. This run stopped after
> completing 16K+ tasks.
> 
> 
> The log file is:
> http://www.ci.uchicago.edu/~ketan/catsn-20110823-1116-94roxc18.log
> 
> 
> The exception messages I get are attached with the mail.
> 
> 
> Looking at the messages, it seems the coasters are unable to restart
> the submit block once the walltime is expired for  a run.
> 
> 
> Regards,
> Ketan
> 
> On Tue, Aug 23, 2011 at 11:06 AM, Ketan Maheshwari
> <ketancmaheshwari at gmail.com> wrote:
>         Mike,
>         
>         
>         This looks like the coasters blocks not restarting issue. I
>         can try to run the same run again and see if this persists.
>         
>         
>         On Tue, Aug 23, 2011 at 11:04 AM, Michael Wilde
>         <wilde at mcs.anl.gov> wrote:
>                 Ketan,
>                 
>                 
>                 Should I ask David to try to replicate this problem?
>                 
>                 
>                 Did you figure out why your jobs are not starting on
>                 PADS?
>                 
>                 
>                 - Mike
>                 
>                 
>                 
>                 ______________________________________________________
>                         From: "Michael Wilde" <wilde at mcs.anl.gov>
>                         To: "Ketan Maheshwari"
>                         <ketancmaheshwari at gmail.com>, "Mihael Hategan"
>                         <hategan at mcs.anl.gov>
>                         Cc: "swift-devel Devel"
>                         <swift-devel at ci.uchicago.edu>, "Papia Rizwan"
>                         <papia.rizwan at gmail.com>
>                         Sent: Monday, August 22, 2011 10:47:56 AM
>                         
>                         
>                         Subject: Re: Blocker issue for 0.93: DSSAT
>                         script does not complete, 2nd coaster blocks
>                         dont start?
>                         
>                         Can you try this on PADS using small jobs in
>                         the fast queue?
>                         
>                         I have not thought this all the way through,
>                         but perhaps coasters will honor maxtime and
>                         maxwalltime on any coaster block, even if its
>                         not running on a batch scheduler.  In that
>                         case perhaps you can replicate the problem on
>                         the MCS pool or better yet on localhost.
>                         
>                         
>                         In these runs, what was the value of
>                         the execution.retries and lazy.errors flags?
>                          Mihael, do those properties need to be set to
>                         >0 and true, respectively, in order for
>                         coasters to start new blocks correctly,
>                         assuming that in some cases a job will run
>                         longer than its maxwalltime?
>                         
>                         
>                         - Mike
>                         
>                         
>                         
>                         ______________________________________________
>                                 From: "Ketan Maheshwari"
>                                 <ketancmaheshwari at gmail.com>
>                                 To: "Michael Wilde"
>                                 <wilde at mcs.anl.gov>
>                                 Cc: "Papia Rizwan"
>                                 <papia.rizwan at gmail.com>, "swift-devel
>                                 Devel" <swift-devel at ci.uchicago.edu>
>                                 Sent: Monday, August 22, 2011 10:32:31
>                                 AM
>                                 Subject: Re: Blocker issue for 0.93:
>                                 DSSAT script does not complete, 2nd
>                                 coaster blocks dont start?
>                                 
>                                 Mike,
>                                 
>                                 
>                                 If I recall correctly, Papia has
>                                 always been running her DSSAT app with
>                                 0.92. She has not yet tried with 0.93.
>                                 I too tried with 0.92 with her sites
>                                 file settings.
>                                 
>                                 
>                                 I once tried it with 0.93 on pads but
>                                 could never get in the running from
>                                 the queue.
>                                 
>                                 
>                                 I will give another try today as it
>                                 might be that PADS was too busy last
>                                 week. As I recall Jon was also
>                                 struggling to get access.
>                                 
>                                 
>                                 Regards,
>                                 Ketan
>                                 
>                                 On Mon, Aug 22, 2011 at 10:24 AM,
>                                 Michael Wilde <wilde at mcs.anl.gov>
>                                 wrote:
>                                         Papia, Ketan,
>                                         
>                                         In reviewing 0.93 work
>                                         remaining with David, I
>                                         remembered this issue.
>                                         
>                                         You both reported that the
>                                         DSSAT application script
>                                         doesnt finish on PADS - it
>                                         seems not to start the second
>                                         round of coaster blocks that
>                                         it needs to complete (as I
>                                         recall, but this may not be
>                                         correct).  This needs to be
>                                         researched and filed as a bug
>                                         (or, an error in the sites
>                                         spec needs to be identified
>                                         and made clear in the site
>                                         guide if it turns out to be
>                                         the problem).
>                                         
>                                         Possible there is an issue
>                                         with jobs failing at the end
>                                         of the coaster blocks, and you
>                                         dont have the necessary retry
>                                         values set for the PADS
>                                         site???
>                                         
>                                         We need an example run with
>                                         logs and full details. Can you
>                                         try to re-create this with a
>                                         much smaller initial
>                                         allocation, and see if
>                                         coasters is transitioning from
>                                         its initial blocks to the next
>                                         blocks?
>                                         
>                                         Can you give this high prio
>                                         for today?
>                                         
>                                         Thanks,
>                                         
>                                         - Mike
>                                 
>                                 
>                                 
>                                 
>                                 -- 
>                                 Ketan
>                                 
>                                 
>                                 
>                         
>                         
>                         
>                         -- 
>                         Michael Wilde
>                         Computation Institute, University of Chicago
>                         Mathematics and Computer Science Division
>                         Argonne National Laboratory
>                         
>                         
>                 
>                 
>                 
>                 
>                 -- 
>                 Michael Wilde
>                 Computation Institute, University of Chicago
>                 Mathematics and Computer Science Division
>                 Argonne National Laboratory
>                 
>                 
>         
>         
>         
>         
>         -- 
>         Ketan
>         
>         
>         
> 
> 
> 
> 
> -- 
> Ketan
> 
> 
> 
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel





More information about the Swift-devel mailing list