[Swift-devel] Blocker issue for 0.93: DSSAT script does not complete, 2nd coaster blocks dont start?
Ketan Maheshwari
ketancmaheshwari at gmail.com
Tue Aug 23 15:46:04 CDT 2011
Hi,
I tried a smaller run, catsnsleep, sleeptime 60sec, n=20, walltime=2min
The run indeed completed but I saw this in the middle (I suppose at the end
of first walltime slot):
Command(13, HEARTBEAT): handling reply timeout;
sendReqTime=110823-153059.847, sendTime=110823-153059.847,
now=110823-153259.860
Command(13, HEARTBEAT)fault was: Reply timeout
org.globus.cog.karajan.workflow.service.ReplyTimeoutException
at
org.globus.cog.karajan.workflow.service.commands.Command.handleReplyTimeout(Command.java:288)
at
org.globus.cog.karajan.workflow.service.commands.Command$Timeout.run(Command.java:293)
at java.util.TimerThread.mainLoop(Timer.java:512)
at java.util.TimerThread.run(Timer.java:462)
Command(13, HEARTBEAT)fault was: Invalid channel: 914784201: {}
org.globus.cog.karajan.workflow.service.ReplyTimeoutException
at
org.globus.cog.karajan.workflow.service.commands.Command.handleReplyTimeout(Command.java:288)
at
org.globus.cog.karajan.workflow.service.commands.Command$Timeout.run(Command.java:293)
at java.util.TimerThread.mainLoop(Timer.java:512)
at java.util.TimerThread.run(Timer.java:462)
Heartbeat failed: Invalid channel: 914784201: {}
org.globus.cog.karajan.workflow.service.ReplyTimeoutException
at
org.globus.cog.karajan.workflow.service.commands.Command.handleReplyTimeout(Command.java:288)
at
org.globus.cog.karajan.workflow.service.commands.Command$Timeout.run(Command.java:293)
at java.util.TimerThread.mainLoop(Timer.java:512)
at java.util.TimerThread.run(Timer.java:462)
The log is attached. I will try a longish run with more heap memory.
Regards,
Ketan
On Tue, Aug 23, 2011 at 3:21 PM, Mihael Hategan <hategan at mcs.anl.gov> wrote:
> On Tue, 2011-08-23 at 14:46 -0500, Michael Wilde wrote:
> > Well, I raised that issue, but Ketan claimed that the failure to start
> more jobs occurs without that message as well.
>
> Fine. I'll need the log from that then.
>
> >
> > Do you believe that the Out of Mem error is the root cause?
> >
> > Ketan, can you point to logs without the OOM error?
> >
> > Can you re-run the catsn with more memory?
> >
> > And more importantly: can you run a *very small* catsnsleep test where
> you carefully craft the sleep times and settings to cause one (very short
> duration) coaster block to time out and verify that a new block is submitted
> and in new job and that the script runs to completion?
> >
> > I suggested in the ticket that David do this; can you both discuss and
> see who is better positioned to do this sooner, so we can decide if we have
> a blocker here, or just something that needs better configuration and
> perhaps a note in the user guide telling users what to watch out for in this
> regard? (I think for example we do not tell how and when to increase memory
> in the user guide, at the moment). Nor are we clear enough on the issues
> around maxtime, maxwalltime, and the sizing of coaster blocks.
> >
> > Thanks,
> >
> > - Mike
> >
> >
> > ----- Original Message -----
> > > From: "Mihael Hategan" <hategan at mcs.anl.gov>
> > > To: "Ketan Maheshwari" <ketancmaheshwari at gmail.com>
> > > Cc: "Michael Wilde" <wilde at mcs.anl.gov>, "Swift Devel" <
> swift-devel at ci.uchicago.edu>
> > > Sent: Tuesday, August 23, 2011 2:36:00 PM
> > > Subject: Re: [Swift-devel] Blocker issue for 0.93: DSSAT script does
> not complete, 2nd coaster blocks dont start?
> > > mike at blabla:~/tmp$ grep "heap" catsn-20110823-1116-94roxc18.log
> > > 2011-08-23 11:16:49,904-0500 DEBUG Loader Max heap: 257294336
> > > 2011-08-23 11:38:50,957-0500 DEBUG VDL2ExecutionContext
> > > java.lang.OutOfMemoryError: Java heap space
> > > java.lang.OutOfMemoryError: Java heap space
> > > Caused by: java.lang.OutOfMemoryError: Java heap space
> > > Caused by: java.lang.OutOfMemoryError: Java heap space
> > > java.lang.OutOfMemoryError: Java heap space
> > > Caused by: java.lang.OutOfMemoryError: Java heap space
> > > Caused by: java.lang.OutOfMemoryError: Java heap space
> > >
> > >
> > > On Tue, 2011-08-23 at 11:47 -0500, Ketan Maheshwari wrote:
> > > > Hello Mike,
> > > >
> > > >
> > > > I tried another run with 30K tasks on PADS. This run stopped after
> > > > completing 16K+ tasks.
> > > >
> > > >
> > > > The log file is:
> > > > http://www.ci.uchicago.edu/~ketan/catsn-20110823-1116-94roxc18.log
> > > >
> > > >
> > > > The exception messages I get are attached with the mail.
> > > >
> > > >
> > > > Looking at the messages, it seems the coasters are unable to restart
> > > > the submit block once the walltime is expired for a run.
> > > >
> > > >
> > > > Regards,
> > > > Ketan
> > > >
> > > > On Tue, Aug 23, 2011 at 11:06 AM, Ketan Maheshwari
> > > > <ketancmaheshwari at gmail.com> wrote:
> > > > Mike,
> > > >
> > > >
> > > > This looks like the coasters blocks not restarting issue. I
> > > > can try to run the same run again and see if this persists.
> > > >
> > > >
> > > > On Tue, Aug 23, 2011 at 11:04 AM, Michael Wilde
> > > > <wilde at mcs.anl.gov> wrote:
> > > > Ketan,
> > > >
> > > >
> > > > Should I ask David to try to replicate this problem?
> > > >
> > > >
> > > > Did you figure out why your jobs are not starting on
> > > > PADS?
> > > >
> > > >
> > > > - Mike
> > > >
> > > >
> > > >
> > > >
> ______________________________________________________
> > > > From: "Michael Wilde" <wilde at mcs.anl.gov>
> > > > To: "Ketan Maheshwari"
> > > > <ketancmaheshwari at gmail.com>, "Mihael
> > > > Hategan"
> > > > <hategan at mcs.anl.gov>
> > > > Cc: "swift-devel Devel"
> > > > <swift-devel at ci.uchicago.edu>, "Papia
> > > > Rizwan"
> > > > <papia.rizwan at gmail.com>
> > > > Sent: Monday, August 22, 2011 10:47:56 AM
> > > >
> > > >
> > > > Subject: Re: Blocker issue for 0.93: DSSAT
> > > > script does not complete, 2nd coaster blocks
> > > > dont start?
> > > >
> > > > Can you try this on PADS using small jobs in
> > > > the fast queue?
> > > >
> > > > I have not thought this all the way through,
> > > > but perhaps coasters will honor maxtime and
> > > > maxwalltime on any coaster block, even if
> > > > its
> > > > not running on a batch scheduler. In that
> > > > case perhaps you can replicate the problem
> > > > on
> > > > the MCS pool or better yet on localhost.
> > > >
> > > >
> > > > In these runs, what was the value of
> > > > the execution.retries and lazy.errors flags?
> > > > Mihael, do those properties need to be set
> > > > to
> > > > >0 and true, respectively, in order for
> > > > coasters to start new blocks correctly,
> > > > assuming that in some cases a job will run
> > > > longer than its maxwalltime?
> > > >
> > > >
> > > > - Mike
> > > >
> > > >
> > > >
> > > >
> ______________________________________________
> > > > From: "Ketan Maheshwari"
> > > > <ketancmaheshwari at gmail.com>
> > > > To: "Michael Wilde"
> > > > <wilde at mcs.anl.gov>
> > > > Cc: "Papia Rizwan"
> > > > <papia.rizwan at gmail.com>,
> > > > "swift-devel
> > > > Devel" <swift-devel at ci.uchicago.edu>
> > > > Sent: Monday, August 22, 2011
> > > > 10:32:31
> > > > AM
> > > > Subject: Re: Blocker issue for 0.93:
> > > > DSSAT script does not complete, 2nd
> > > > coaster blocks dont start?
> > > >
> > > > Mike,
> > > >
> > > >
> > > > If I recall correctly, Papia has
> > > > always been running her DSSAT app
> > > > with
> > > > 0.92. She has not yet tried with
> > > > 0.93.
> > > > I too tried with 0.92 with her sites
> > > > file settings.
> > > >
> > > >
> > > > I once tried it with 0.93 on pads
> > > > but
> > > > could never get in the running from
> > > > the queue.
> > > >
> > > >
> > > > I will give another try today as it
> > > > might be that PADS was too busy last
> > > > week. As I recall Jon was also
> > > > struggling to get access.
> > > >
> > > >
> > > > Regards,
> > > > Ketan
> > > >
> > > > On Mon, Aug 22, 2011 at 10:24 AM,
> > > > Michael Wilde <wilde at mcs.anl.gov>
> > > > wrote:
> > > > Papia, Ketan,
> > > >
> > > > In reviewing 0.93 work
> > > > remaining with David, I
> > > > remembered this issue.
> > > >
> > > > You both reported that the
> > > > DSSAT application script
> > > > doesnt finish on PADS - it
> > > > seems not to start the
> > > > second
> > > > round of coaster blocks that
> > > > it needs to complete (as I
> > > > recall, but this may not be
> > > > correct). This needs to be
> > > > researched and filed as a
> > > > bug
> > > > (or, an error in the sites
> > > > spec needs to be identified
> > > > and made clear in the site
> > > > guide if it turns out to be
> > > > the problem).
> > > >
> > > > Possible there is an issue
> > > > with jobs failing at the end
> > > > of the coaster blocks, and
> > > > you
> > > > dont have the necessary
> > > > retry
> > > > values set for the PADS
> > > > site???
> > > >
> > > > We need an example run with
> > > > logs and full details. Can
> > > > you
> > > > try to re-create this with a
> > > > much smaller initial
> > > > allocation, and see if
> > > > coasters is transitioning
> > > > from
> > > > its initial blocks to the
> > > > next
> > > > blocks?
> > > >
> > > > Can you give this high prio
> > > > for today?
> > > >
> > > > Thanks,
> > > >
> > > > - Mike
> > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Ketan
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Michael Wilde
> > > > Computation Institute, University of Chicago
> > > > Mathematics and Computer Science Division
> > > > Argonne National Laboratory
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Michael Wilde
> > > > Computation Institute, University of Chicago
> > > > Mathematics and Computer Science Division
> > > > Argonne National Laboratory
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Ketan
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Ketan
> > > >
> > > >
> > > >
> > > > _______________________________________________
> > > > Swift-devel mailing list
> > > > Swift-devel at ci.uchicago.edu
> > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel
> >
>
>
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel
>
--
Ketan
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20110823/860637a2/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: catsnsleep-20110823-1527-b6kpd8h9.log
Type: application/octet-stream
Size: 147190 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20110823/860637a2/attachment.obj>
More information about the Swift-devel
mailing list