[Swift-devel] Blocker issue for 0.93: DSSAT script does not complete, 2nd coaster blocks dont start?
Mihael Hategan
hategan at mcs.anl.gov
Tue Aug 23 23:04:03 CDT 2011
Also benign, but annoying. So I'd like to nail these out. Can you try
r3240?
On Tue, 2011-08-23 at 20:12 -0500, Ketan Maheshwari wrote:
> Hi again,
>
>
> Tried a larger run on PADS with similar sleep and but large n
> parameters. The run seemed to be progressing well (I killed it by
> mistake), but the log does show some coaster block shutdown and
> network related exception messages.
>
>
> Attached is the log.
>
>
> Regards,
> Ketan
>
>
>
> On Tue, Aug 23, 2011 at 4:05 PM, Mihael Hategan <hategan at mcs.anl.gov>
> wrote:
> That's benign, but I committed a patch to prevent it from
> happening in
> cog r3237.
>
>
> On Tue, 2011-08-23 at 15:46 -0500, Ketan Maheshwari wrote:
> > Hi,
> >
> >
> > I tried a smaller run, catsnsleep, sleeptime 60sec, n=20,
> > walltime=2min
> >
> >
> > The run indeed completed but I saw this in the middle (I
> suppose at
> > the end of first walltime slot):
> >
> >
> > Command(13, HEARTBEAT): handling reply timeout;
> > sendReqTime=110823-153059.847, sendTime=110823-153059.847,
> > now=110823-153259.860
> > Command(13, HEARTBEAT)fault was: Reply timeout
> >
> org.globus.cog.karajan.workflow.service.ReplyTimeoutException
> > at
> >
> org.globus.cog.karajan.workflow.service.commands.Command.handleReplyTimeout(Command.java:288)
> > at org.globus.cog.karajan.workflow.service.commands.Command
> > $Timeout.run(Command.java:293)
> > at java.util.TimerThread.mainLoop(Timer.java:512)
> > at java.util.TimerThread.run(Timer.java:462)
> > Command(13, HEARTBEAT)fault was: Invalid channel: 914784201:
> {}
> >
> org.globus.cog.karajan.workflow.service.ReplyTimeoutException
> > at
> >
> org.globus.cog.karajan.workflow.service.commands.Command.handleReplyTimeout(Command.java:288)
> > at org.globus.cog.karajan.workflow.service.commands.Command
> > $Timeout.run(Command.java:293)
> > at java.util.TimerThread.mainLoop(Timer.java:512)
> > at java.util.TimerThread.run(Timer.java:462)
> > Heartbeat failed: Invalid channel: 914784201: {}
> >
> org.globus.cog.karajan.workflow.service.ReplyTimeoutException
> > at
> >
> org.globus.cog.karajan.workflow.service.commands.Command.handleReplyTimeout(Command.java:288)
> > at org.globus.cog.karajan.workflow.service.commands.Command
> > $Timeout.run(Command.java:293)
> > at java.util.TimerThread.mainLoop(Timer.java:512)
> > at java.util.TimerThread.run(Timer.java:462)
> >
> >
> > The log is attached. I will try a longish run with more heap
> memory.
> >
> >
> >
> >
> > Regards,
> > Ketan
> >
> >
> >
> > On Tue, Aug 23, 2011 at 3:21 PM, Mihael Hategan
> <hategan at mcs.anl.gov>
> > wrote:
> > On Tue, 2011-08-23 at 14:46 -0500, Michael Wilde
> wrote:
> > > Well, I raised that issue, but Ketan claimed that
> the
> > failure to start more jobs occurs without that
> message as
> > well.
> >
> >
> > Fine. I'll need the log from that then.
> >
> >
> > >
> > > Do you believe that the Out of Mem error is the
> root cause?
> > >
> > > Ketan, can you point to logs without the OOM
> error?
> > >
> > > Can you re-run the catsn with more memory?
> > >
> > > And more importantly: can you run a *very small*
> catsnsleep
> > test where you carefully craft the sleep times and
> settings to
> > cause one (very short duration) coaster block to
> time out and
> > verify that a new block is submitted and in new job
> and that
> > the script runs to completion?
> > >
> > > I suggested in the ticket that David do this; can
> you both
> > discuss and see who is better positioned to do this
> sooner, so
> > we can decide if we have a blocker here, or just
> something
> > that needs better configuration and perhaps a note
> in the user
> > guide telling users what to watch out for in this
> regard? (I
> > think for example we do not tell how and when to
> increase
> > memory in the user guide, at the moment). Nor are
> we clear
> > enough on the issues around maxtime, maxwalltime,
> and the
> > sizing of coaster blocks.
> > >
> > > Thanks,
> > >
> > > - Mike
> > >
> > >
> > > ----- Original Message -----
> > > > From: "Mihael Hategan" <hategan at mcs.anl.gov>
> > > > To: "Ketan Maheshwari"
> <ketancmaheshwari at gmail.com>
> > > > Cc: "Michael Wilde" <wilde at mcs.anl.gov>, "Swift
> Devel"
> > <swift-devel at ci.uchicago.edu>
> > > > Sent: Tuesday, August 23, 2011 2:36:00 PM
> > > > Subject: Re: [Swift-devel] Blocker issue for
> 0.93: DSSAT
> > script does not complete, 2nd coaster blocks dont
> start?
> > > > mike at blabla:~/tmp$ grep "heap"
> > catsn-20110823-1116-94roxc18.log
> > > > 2011-08-23 11:16:49,904-0500 DEBUG Loader Max
> heap:
> > 257294336
> > > > 2011-08-23 11:38:50,957-0500 DEBUG
> VDL2ExecutionContext
> > > > java.lang.OutOfMemoryError: Java heap space
> > > > java.lang.OutOfMemoryError: Java heap space
> > > > Caused by: java.lang.OutOfMemoryError: Java heap
> space
> > > > Caused by: java.lang.OutOfMemoryError: Java heap
> space
> > > > java.lang.OutOfMemoryError: Java heap space
> > > > Caused by: java.lang.OutOfMemoryError: Java heap
> space
> > > > Caused by: java.lang.OutOfMemoryError: Java heap
> space
> > > >
> > > >
> > > > On Tue, 2011-08-23 at 11:47 -0500, Ketan
> Maheshwari wrote:
> > > > > Hello Mike,
> > > > >
> > > > >
> > > > > I tried another run with 30K tasks on PADS.
> This run
> > stopped after
> > > > > completing 16K+ tasks.
> > > > >
> > > > >
> > > > > The log file is:
> > > > >
> >
> http://www.ci.uchicago.edu/~ketan/catsn-20110823-1116-94roxc18.log
> > > > >
> > > > >
> > > > > The exception messages I get are attached with
> the mail.
> > > > >
> > > > >
> > > > > Looking at the messages, it seems the coasters
> are
> > unable to restart
> > > > > the submit block once the walltime is expired
> for a run.
> > > > >
> > > > >
> > > > > Regards,
> > > > > Ketan
> > > > >
> > > > > On Tue, Aug 23, 2011 at 11:06 AM, Ketan
> Maheshwari
> > > > > <ketancmaheshwari at gmail.com> wrote:
> > > > > Mike,
> > > > >
> > > > >
> > > > > This looks like the coasters blocks
> not
> > restarting issue. I
> > > > > can try to run the same run again and
> see if
> > this persists.
> > > > >
> > > > >
> > > > > On Tue, Aug 23, 2011 at 11:04 AM,
> Michael Wilde
> > > > > <wilde at mcs.anl.gov> wrote:
> > > > > Ketan,
> > > > >
> > > > >
> > > > > Should I ask David to try to
> replicate
> > this problem?
> > > > >
> > > > >
> > > > > Did you figure out why your
> jobs are not
> > starting on
> > > > > PADS?
> > > > >
> > > > >
> > > > > - Mike
> > > > >
> > > > >
> > > > >
> > > > >
> >
> ______________________________________________________
> > > > > From: "Michael Wilde"
> > <wilde at mcs.anl.gov>
> > > > > To: "Ketan Maheshwari"
> > > > >
> <ketancmaheshwari at gmail.com>,
> > "Mihael
> > > > > Hategan"
> > > > > <hategan at mcs.anl.gov>
> > > > > Cc: "swift-devel
> Devel"
> > > > >
> <swift-devel at ci.uchicago.edu>,
> > "Papia
> > > > > Rizwan"
> > > > >
> <papia.rizwan at gmail.com>
> > > > > Sent: Monday, August
> 22, 2011
> > 10:47:56 AM
> > > > >
> > > > >
> > > > > Subject: Re: Blocker
> issue for
> > 0.93: DSSAT
> > > > > script does not
> complete, 2nd
> > coaster blocks
> > > > > dont start?
> > > > >
> > > > > Can you try this on
> PADS using
> > small jobs in
> > > > > the fast queue?
> > > > >
> > > > > I have not thought
> this all the
> > way through,
> > > > > but perhaps coasters
> will honor
> > maxtime and
> > > > > maxwalltime on any
> coaster
> > block, even if
> > > > > its
> > > > > not running on a batch
> > scheduler. In that
> > > > > case perhaps you can
> replicate
> > the problem
> > > > > on
> > > > > the MCS pool or better
> yet on
> > localhost.
> > > > >
> > > > >
> > > > > In these runs, what
> was the
> > value of
> > > > > the execution.retries
> and
> > lazy.errors flags?
> > > > > Mihael, do those
> properties
> > need to be set
> > > > > to
> > > > > >0 and true,
> respectively, in
> > order for
> > > > > coasters to start new
> blocks
> > correctly,
> > > > > assuming that in some
> cases a
> > job will run
> > > > > longer than its
> maxwalltime?
> > > > >
> > > > >
> > > > > - Mike
> > > > >
> > > > >
> > > > >
> > > > >
> > ______________________________________________
> > > > > From: "Ketan
> Maheshwari"
> > > > >
> > <ketancmaheshwari at gmail.com>
> > > > > To: "Michael
> Wilde"
> > > > >
> <wilde at mcs.anl.gov>
> > > > > Cc: "Papia
> Rizwan"
> > > > >
> > <papia.rizwan at gmail.com>,
> > > > > "swift-devel
> > > > > Devel"
> > <swift-devel at ci.uchicago.edu>
> > > > > Sent: Monday,
> August 22,
> > 2011
> > > > > 10:32:31
> > > > > AM
> > > > > Subject: Re:
> Blocker
> > issue for 0.93:
> > > > > DSSAT script
> does not
> > complete, 2nd
> > > > > coaster blocks
> dont
> > start?
> > > > >
> > > > > Mike,
> > > > >
> > > > >
> > > > > If I recall
> correctly,
> > Papia has
> > > > > always been
> running her
> > DSSAT app
> > > > > with
> > > > > 0.92. She has
> not yet
> > tried with
> > > > > 0.93.
> > > > > I too tried
> with 0.92
> > with her sites
> > > > > file settings.
> > > > >
> > > > >
> > > > > I once tried
> it with
> > 0.93 on pads
> > > > > but
> > > > > could never
> get in the
> > running from
> > > > > the queue.
> > > > >
> > > > >
> > > > > I will give
> another try
> > today as it
> > > > > might be that
> PADS was
> > too busy last
> > > > > week. As I
> recall Jon
> > was also
> > > > > struggling to
> get
> > access.
> > > > >
> > > > >
> > > > > Regards,
> > > > > Ketan
> > > > >
> > > > > On Mon, Aug
> 22, 2011 at
> > 10:24 AM,
> > > > > Michael Wilde
> > <wilde at mcs.anl.gov>
> > > > > wrote:
> > > > > Papia,
> Ketan,
> > > > >
> > > > > In
> reviewing
> > 0.93 work
> > > > >
> remaining with
> > David, I
> > > > >
> remembered this
> > issue.
> > > > >
> > > > > You
> both
> > reported that the
> > > > > DSSAT
> > application script
> > > > > doesnt
> finish on
> > PADS - it
> > > > > seems
> not to
> > start the
> > > > > second
> > > > > round
> of coaster
> > blocks that
> > > > > it
> needs to
> > complete (as I
> > > > >
> recall, but this
> > may not be
> > > > >
> correct). This
> > needs to be
> > > > >
> researched and
> > filed as a
> > > > > bug
> > > > > (or,
> an error in
> > the sites
> > > > > spec
> needs to be
> > identified
> > > > > and
> made clear
> > in the site
> > > > > guide
> if it
> > turns out to be
> > > > > the
> problem).
> > > > >
> > > > >
> Possible there
> > is an issue
> > > > > with
> jobs
> > failing at the end
> > > > > of the
> coaster
> > blocks, and
> > > > > you
> > > > > dont
> have the
> > necessary
> > > > > retry
> > > > > values
> set for
> > the PADS
> > > > >
> site???
> > > > >
> > > > > We
> need an
> > example run with
> > > > > logs
> and full
> > details. Can
> > > > > you
> > > > > try to
> re-create
> > this with a
> > > > > much
> smaller
> > initial
> > > > >
> allocation, and
> > see if
> > > > >
> coasters is
> > transitioning
> > > > > from
> > > > > its
> initial
> > blocks to the
> > > > > next
> > > > >
> blocks?
> > > > >
> > > > > Can
> you give
> > this high prio
> > > > > for
> today?
> > > > >
> > > > >
> Thanks,
> > > > >
> > > > > - Mike
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Ketan
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Michael Wilde
> > > > > Computation Institute,
> > University of Chicago
> > > > > Mathematics and
> Computer Science
> > Division
> > > > > Argonne National
> Laboratory
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Michael Wilde
> > > > > Computation Institute,
> University of
> > Chicago
> > > > > Mathematics and Computer
> Science
> > Division
> > > > > Argonne National Laboratory
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Ketan
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Ketan
> > > > >
> > > > >
> > > > >
> > > > >
> _______________________________________________
> > > > > Swift-devel mailing list
> > > > > Swift-devel at ci.uchicago.edu
> > > > >
> >
> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel
> > >
> >
> >
> > _______________________________________________
> > Swift-devel mailing list
> > Swift-devel at ci.uchicago.edu
> >
> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel
> >
> >
> >
> >
> >
> > --
> > Ketan
> >
> >
> >
>
>
>
>
>
>
>
> --
> Ketan
>
>
>
More information about the Swift-devel
mailing list