[Swift-devel] Blocker issue for 0.93: DSSAT script does not complete, 2nd coaster blocks dont start?
Ketan Maheshwari
ketancmaheshwari at gmail.com
Tue Aug 23 20:12:08 CDT 2011
Hi again,
Tried a larger run on PADS with similar sleep and but large n parameters.
The run seemed to be progressing well (I killed it by mistake), but the log
does show some coaster block shutdown and network related exception
messages.
Attached is the log.
Regards,
Ketan
On Tue, Aug 23, 2011 at 4:05 PM, Mihael Hategan <hategan at mcs.anl.gov> wrote:
> That's benign, but I committed a patch to prevent it from happening in
> cog r3237.
>
> On Tue, 2011-08-23 at 15:46 -0500, Ketan Maheshwari wrote:
> > Hi,
> >
> >
> > I tried a smaller run, catsnsleep, sleeptime 60sec, n=20,
> > walltime=2min
> >
> >
> > The run indeed completed but I saw this in the middle (I suppose at
> > the end of first walltime slot):
> >
> >
> > Command(13, HEARTBEAT): handling reply timeout;
> > sendReqTime=110823-153059.847, sendTime=110823-153059.847,
> > now=110823-153259.860
> > Command(13, HEARTBEAT)fault was: Reply timeout
> > org.globus.cog.karajan.workflow.service.ReplyTimeoutException
> > at
> >
> org.globus.cog.karajan.workflow.service.commands.Command.handleReplyTimeout(Command.java:288)
> > at org.globus.cog.karajan.workflow.service.commands.Command
> > $Timeout.run(Command.java:293)
> > at java.util.TimerThread.mainLoop(Timer.java:512)
> > at java.util.TimerThread.run(Timer.java:462)
> > Command(13, HEARTBEAT)fault was: Invalid channel: 914784201: {}
> > org.globus.cog.karajan.workflow.service.ReplyTimeoutException
> > at
> >
> org.globus.cog.karajan.workflow.service.commands.Command.handleReplyTimeout(Command.java:288)
> > at org.globus.cog.karajan.workflow.service.commands.Command
> > $Timeout.run(Command.java:293)
> > at java.util.TimerThread.mainLoop(Timer.java:512)
> > at java.util.TimerThread.run(Timer.java:462)
> > Heartbeat failed: Invalid channel: 914784201: {}
> > org.globus.cog.karajan.workflow.service.ReplyTimeoutException
> > at
> >
> org.globus.cog.karajan.workflow.service.commands.Command.handleReplyTimeout(Command.java:288)
> > at org.globus.cog.karajan.workflow.service.commands.Command
> > $Timeout.run(Command.java:293)
> > at java.util.TimerThread.mainLoop(Timer.java:512)
> > at java.util.TimerThread.run(Timer.java:462)
> >
> >
> > The log is attached. I will try a longish run with more heap memory.
> >
> >
> >
> >
> > Regards,
> > Ketan
> >
> >
> >
> > On Tue, Aug 23, 2011 at 3:21 PM, Mihael Hategan <hategan at mcs.anl.gov>
> > wrote:
> > On Tue, 2011-08-23 at 14:46 -0500, Michael Wilde wrote:
> > > Well, I raised that issue, but Ketan claimed that the
> > failure to start more jobs occurs without that message as
> > well.
> >
> >
> > Fine. I'll need the log from that then.
> >
> >
> > >
> > > Do you believe that the Out of Mem error is the root cause?
> > >
> > > Ketan, can you point to logs without the OOM error?
> > >
> > > Can you re-run the catsn with more memory?
> > >
> > > And more importantly: can you run a *very small* catsnsleep
> > test where you carefully craft the sleep times and settings to
> > cause one (very short duration) coaster block to time out and
> > verify that a new block is submitted and in new job and that
> > the script runs to completion?
> > >
> > > I suggested in the ticket that David do this; can you both
> > discuss and see who is better positioned to do this sooner, so
> > we can decide if we have a blocker here, or just something
> > that needs better configuration and perhaps a note in the user
> > guide telling users what to watch out for in this regard? (I
> > think for example we do not tell how and when to increase
> > memory in the user guide, at the moment). Nor are we clear
> > enough on the issues around maxtime, maxwalltime, and the
> > sizing of coaster blocks.
> > >
> > > Thanks,
> > >
> > > - Mike
> > >
> > >
> > > ----- Original Message -----
> > > > From: "Mihael Hategan" <hategan at mcs.anl.gov>
> > > > To: "Ketan Maheshwari" <ketancmaheshwari at gmail.com>
> > > > Cc: "Michael Wilde" <wilde at mcs.anl.gov>, "Swift Devel"
> > <swift-devel at ci.uchicago.edu>
> > > > Sent: Tuesday, August 23, 2011 2:36:00 PM
> > > > Subject: Re: [Swift-devel] Blocker issue for 0.93: DSSAT
> > script does not complete, 2nd coaster blocks dont start?
> > > > mike at blabla:~/tmp$ grep "heap"
> > catsn-20110823-1116-94roxc18.log
> > > > 2011-08-23 11:16:49,904-0500 DEBUG Loader Max heap:
> > 257294336
> > > > 2011-08-23 11:38:50,957-0500 DEBUG VDL2ExecutionContext
> > > > java.lang.OutOfMemoryError: Java heap space
> > > > java.lang.OutOfMemoryError: Java heap space
> > > > Caused by: java.lang.OutOfMemoryError: Java heap space
> > > > Caused by: java.lang.OutOfMemoryError: Java heap space
> > > > java.lang.OutOfMemoryError: Java heap space
> > > > Caused by: java.lang.OutOfMemoryError: Java heap space
> > > > Caused by: java.lang.OutOfMemoryError: Java heap space
> > > >
> > > >
> > > > On Tue, 2011-08-23 at 11:47 -0500, Ketan Maheshwari wrote:
> > > > > Hello Mike,
> > > > >
> > > > >
> > > > > I tried another run with 30K tasks on PADS. This run
> > stopped after
> > > > > completing 16K+ tasks.
> > > > >
> > > > >
> > > > > The log file is:
> > > > >
> >
> http://www.ci.uchicago.edu/~ketan/catsn-20110823-1116-94roxc18.log
> > > > >
> > > > >
> > > > > The exception messages I get are attached with the mail.
> > > > >
> > > > >
> > > > > Looking at the messages, it seems the coasters are
> > unable to restart
> > > > > the submit block once the walltime is expired for a run.
> > > > >
> > > > >
> > > > > Regards,
> > > > > Ketan
> > > > >
> > > > > On Tue, Aug 23, 2011 at 11:06 AM, Ketan Maheshwari
> > > > > <ketancmaheshwari at gmail.com> wrote:
> > > > > Mike,
> > > > >
> > > > >
> > > > > This looks like the coasters blocks not
> > restarting issue. I
> > > > > can try to run the same run again and see if
> > this persists.
> > > > >
> > > > >
> > > > > On Tue, Aug 23, 2011 at 11:04 AM, Michael Wilde
> > > > > <wilde at mcs.anl.gov> wrote:
> > > > > Ketan,
> > > > >
> > > > >
> > > > > Should I ask David to try to replicate
> > this problem?
> > > > >
> > > > >
> > > > > Did you figure out why your jobs are not
> > starting on
> > > > > PADS?
> > > > >
> > > > >
> > > > > - Mike
> > > > >
> > > > >
> > > > >
> > > > >
> > ______________________________________________________
> > > > > From: "Michael Wilde"
> > <wilde at mcs.anl.gov>
> > > > > To: "Ketan Maheshwari"
> > > > > <ketancmaheshwari at gmail.com>,
> > "Mihael
> > > > > Hategan"
> > > > > <hategan at mcs.anl.gov>
> > > > > Cc: "swift-devel Devel"
> > > > > <swift-devel at ci.uchicago.edu>,
> > "Papia
> > > > > Rizwan"
> > > > > <papia.rizwan at gmail.com>
> > > > > Sent: Monday, August 22, 2011
> > 10:47:56 AM
> > > > >
> > > > >
> > > > > Subject: Re: Blocker issue for
> > 0.93: DSSAT
> > > > > script does not complete, 2nd
> > coaster blocks
> > > > > dont start?
> > > > >
> > > > > Can you try this on PADS using
> > small jobs in
> > > > > the fast queue?
> > > > >
> > > > > I have not thought this all the
> > way through,
> > > > > but perhaps coasters will honor
> > maxtime and
> > > > > maxwalltime on any coaster
> > block, even if
> > > > > its
> > > > > not running on a batch
> > scheduler. In that
> > > > > case perhaps you can replicate
> > the problem
> > > > > on
> > > > > the MCS pool or better yet on
> > localhost.
> > > > >
> > > > >
> > > > > In these runs, what was the
> > value of
> > > > > the execution.retries and
> > lazy.errors flags?
> > > > > Mihael, do those properties
> > need to be set
> > > > > to
> > > > > >0 and true, respectively, in
> > order for
> > > > > coasters to start new blocks
> > correctly,
> > > > > assuming that in some cases a
> > job will run
> > > > > longer than its maxwalltime?
> > > > >
> > > > >
> > > > > - Mike
> > > > >
> > > > >
> > > > >
> > > > >
> > ______________________________________________
> > > > > From: "Ketan Maheshwari"
> > > > >
> > <ketancmaheshwari at gmail.com>
> > > > > To: "Michael Wilde"
> > > > > <wilde at mcs.anl.gov>
> > > > > Cc: "Papia Rizwan"
> > > > >
> > <papia.rizwan at gmail.com>,
> > > > > "swift-devel
> > > > > Devel"
> > <swift-devel at ci.uchicago.edu>
> > > > > Sent: Monday, August 22,
> > 2011
> > > > > 10:32:31
> > > > > AM
> > > > > Subject: Re: Blocker
> > issue for 0.93:
> > > > > DSSAT script does not
> > complete, 2nd
> > > > > coaster blocks dont
> > start?
> > > > >
> > > > > Mike,
> > > > >
> > > > >
> > > > > If I recall correctly,
> > Papia has
> > > > > always been running her
> > DSSAT app
> > > > > with
> > > > > 0.92. She has not yet
> > tried with
> > > > > 0.93.
> > > > > I too tried with 0.92
> > with her sites
> > > > > file settings.
> > > > >
> > > > >
> > > > > I once tried it with
> > 0.93 on pads
> > > > > but
> > > > > could never get in the
> > running from
> > > > > the queue.
> > > > >
> > > > >
> > > > > I will give another try
> > today as it
> > > > > might be that PADS was
> > too busy last
> > > > > week. As I recall Jon
> > was also
> > > > > struggling to get
> > access.
> > > > >
> > > > >
> > > > > Regards,
> > > > > Ketan
> > > > >
> > > > > On Mon, Aug 22, 2011 at
> > 10:24 AM,
> > > > > Michael Wilde
> > <wilde at mcs.anl.gov>
> > > > > wrote:
> > > > > Papia, Ketan,
> > > > >
> > > > > In reviewing
> > 0.93 work
> > > > > remaining with
> > David, I
> > > > > remembered this
> > issue.
> > > > >
> > > > > You both
> > reported that the
> > > > > DSSAT
> > application script
> > > > > doesnt finish on
> > PADS - it
> > > > > seems not to
> > start the
> > > > > second
> > > > > round of coaster
> > blocks that
> > > > > it needs to
> > complete (as I
> > > > > recall, but this
> > may not be
> > > > > correct). This
> > needs to be
> > > > > researched and
> > filed as a
> > > > > bug
> > > > > (or, an error in
> > the sites
> > > > > spec needs to be
> > identified
> > > > > and made clear
> > in the site
> > > > > guide if it
> > turns out to be
> > > > > the problem).
> > > > >
> > > > > Possible there
> > is an issue
> > > > > with jobs
> > failing at the end
> > > > > of the coaster
> > blocks, and
> > > > > you
> > > > > dont have the
> > necessary
> > > > > retry
> > > > > values set for
> > the PADS
> > > > > site???
> > > > >
> > > > > We need an
> > example run with
> > > > > logs and full
> > details. Can
> > > > > you
> > > > > try to re-create
> > this with a
> > > > > much smaller
> > initial
> > > > > allocation, and
> > see if
> > > > > coasters is
> > transitioning
> > > > > from
> > > > > its initial
> > blocks to the
> > > > > next
> > > > > blocks?
> > > > >
> > > > > Can you give
> > this high prio
> > > > > for today?
> > > > >
> > > > > Thanks,
> > > > >
> > > > > - Mike
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Ketan
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Michael Wilde
> > > > > Computation Institute,
> > University of Chicago
> > > > > Mathematics and Computer Science
> > Division
> > > > > Argonne National Laboratory
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Michael Wilde
> > > > > Computation Institute, University of
> > Chicago
> > > > > Mathematics and Computer Science
> > Division
> > > > > Argonne National Laboratory
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Ketan
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Ketan
> > > > >
> > > > >
> > > > >
> > > > > _______________________________________________
> > > > > Swift-devel mailing list
> > > > > Swift-devel at ci.uchicago.edu
> > > > >
> >
> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel
> > >
> >
> >
> > _______________________________________________
> > Swift-devel mailing list
> > Swift-devel at ci.uchicago.edu
> >
> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel
> >
> >
> >
> >
> >
> > --
> > Ketan
> >
> >
> >
>
>
>
--
Ketan
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20110823/17e0ce88/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: catsnsleep-20110823-1714-9xk5y0b2.log
Type: application/octet-stream
Size: 2852009 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20110823/17e0ce88/attachment.obj>
More information about the Swift-devel
mailing list