[Swift-devel] Blocker issue for 0.93: DSSAT script does not complete, 2nd coaster blocks dont start?
Mihael Hategan
hategan at mcs.anl.gov
Tue Aug 23 14:36:00 CDT 2011
mike at blabla:~/tmp$ grep "heap" catsn-20110823-1116-94roxc18.log
2011-08-23 11:16:49,904-0500 DEBUG Loader Max heap: 257294336
2011-08-23 11:38:50,957-0500 DEBUG VDL2ExecutionContext
java.lang.OutOfMemoryError: Java heap space
java.lang.OutOfMemoryError: Java heap space
Caused by: java.lang.OutOfMemoryError: Java heap space
Caused by: java.lang.OutOfMemoryError: Java heap space
java.lang.OutOfMemoryError: Java heap space
Caused by: java.lang.OutOfMemoryError: Java heap space
Caused by: java.lang.OutOfMemoryError: Java heap space
On Tue, 2011-08-23 at 11:47 -0500, Ketan Maheshwari wrote:
> Hello Mike,
>
>
> I tried another run with 30K tasks on PADS. This run stopped after
> completing 16K+ tasks.
>
>
> The log file is:
> http://www.ci.uchicago.edu/~ketan/catsn-20110823-1116-94roxc18.log
>
>
> The exception messages I get are attached with the mail.
>
>
> Looking at the messages, it seems the coasters are unable to restart
> the submit block once the walltime is expired for a run.
>
>
> Regards,
> Ketan
>
> On Tue, Aug 23, 2011 at 11:06 AM, Ketan Maheshwari
> <ketancmaheshwari at gmail.com> wrote:
> Mike,
>
>
> This looks like the coasters blocks not restarting issue. I
> can try to run the same run again and see if this persists.
>
>
> On Tue, Aug 23, 2011 at 11:04 AM, Michael Wilde
> <wilde at mcs.anl.gov> wrote:
> Ketan,
>
>
> Should I ask David to try to replicate this problem?
>
>
> Did you figure out why your jobs are not starting on
> PADS?
>
>
> - Mike
>
>
>
> ______________________________________________________
> From: "Michael Wilde" <wilde at mcs.anl.gov>
> To: "Ketan Maheshwari"
> <ketancmaheshwari at gmail.com>, "Mihael Hategan"
> <hategan at mcs.anl.gov>
> Cc: "swift-devel Devel"
> <swift-devel at ci.uchicago.edu>, "Papia Rizwan"
> <papia.rizwan at gmail.com>
> Sent: Monday, August 22, 2011 10:47:56 AM
>
>
> Subject: Re: Blocker issue for 0.93: DSSAT
> script does not complete, 2nd coaster blocks
> dont start?
>
> Can you try this on PADS using small jobs in
> the fast queue?
>
> I have not thought this all the way through,
> but perhaps coasters will honor maxtime and
> maxwalltime on any coaster block, even if its
> not running on a batch scheduler. In that
> case perhaps you can replicate the problem on
> the MCS pool or better yet on localhost.
>
>
> In these runs, what was the value of
> the execution.retries and lazy.errors flags?
> Mihael, do those properties need to be set to
> >0 and true, respectively, in order for
> coasters to start new blocks correctly,
> assuming that in some cases a job will run
> longer than its maxwalltime?
>
>
> - Mike
>
>
>
> ______________________________________________
> From: "Ketan Maheshwari"
> <ketancmaheshwari at gmail.com>
> To: "Michael Wilde"
> <wilde at mcs.anl.gov>
> Cc: "Papia Rizwan"
> <papia.rizwan at gmail.com>, "swift-devel
> Devel" <swift-devel at ci.uchicago.edu>
> Sent: Monday, August 22, 2011 10:32:31
> AM
> Subject: Re: Blocker issue for 0.93:
> DSSAT script does not complete, 2nd
> coaster blocks dont start?
>
> Mike,
>
>
> If I recall correctly, Papia has
> always been running her DSSAT app with
> 0.92. She has not yet tried with 0.93.
> I too tried with 0.92 with her sites
> file settings.
>
>
> I once tried it with 0.93 on pads but
> could never get in the running from
> the queue.
>
>
> I will give another try today as it
> might be that PADS was too busy last
> week. As I recall Jon was also
> struggling to get access.
>
>
> Regards,
> Ketan
>
> On Mon, Aug 22, 2011 at 10:24 AM,
> Michael Wilde <wilde at mcs.anl.gov>
> wrote:
> Papia, Ketan,
>
> In reviewing 0.93 work
> remaining with David, I
> remembered this issue.
>
> You both reported that the
> DSSAT application script
> doesnt finish on PADS - it
> seems not to start the second
> round of coaster blocks that
> it needs to complete (as I
> recall, but this may not be
> correct). This needs to be
> researched and filed as a bug
> (or, an error in the sites
> spec needs to be identified
> and made clear in the site
> guide if it turns out to be
> the problem).
>
> Possible there is an issue
> with jobs failing at the end
> of the coaster blocks, and you
> dont have the necessary retry
> values set for the PADS
> site???
>
> We need an example run with
> logs and full details. Can you
> try to re-create this with a
> much smaller initial
> allocation, and see if
> coasters is transitioning from
> its initial blocks to the next
> blocks?
>
> Can you give this high prio
> for today?
>
> Thanks,
>
> - Mike
>
>
>
>
> --
> Ketan
>
>
>
>
>
>
> --
> Michael Wilde
> Computation Institute, University of Chicago
> Mathematics and Computer Science Division
> Argonne National Laboratory
>
>
>
>
>
>
> --
> Michael Wilde
> Computation Institute, University of Chicago
> Mathematics and Computer Science Division
> Argonne National Laboratory
>
>
>
>
>
>
> --
> Ketan
>
>
>
>
>
>
>
> --
> Ketan
>
>
>
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel
More information about the Swift-devel
mailing list