[Swift-devel] Blocker issue for 0.93: DSSAT script does not complete, 2nd coaster blocks dont start?
Mihael Hategan
hategan at mcs.anl.gov
Tue Aug 23 16:05:13 CDT 2011
That's benign, but I committed a patch to prevent it from happening in
cog r3237.
On Tue, 2011-08-23 at 15:46 -0500, Ketan Maheshwari wrote:
> Hi,
>
>
> I tried a smaller run, catsnsleep, sleeptime 60sec, n=20,
> walltime=2min
>
>
> The run indeed completed but I saw this in the middle (I suppose at
> the end of first walltime slot):
>
>
> Command(13, HEARTBEAT): handling reply timeout;
> sendReqTime=110823-153059.847, sendTime=110823-153059.847,
> now=110823-153259.860
> Command(13, HEARTBEAT)fault was: Reply timeout
> org.globus.cog.karajan.workflow.service.ReplyTimeoutException
> at
> org.globus.cog.karajan.workflow.service.commands.Command.handleReplyTimeout(Command.java:288)
> at org.globus.cog.karajan.workflow.service.commands.Command
> $Timeout.run(Command.java:293)
> at java.util.TimerThread.mainLoop(Timer.java:512)
> at java.util.TimerThread.run(Timer.java:462)
> Command(13, HEARTBEAT)fault was: Invalid channel: 914784201: {}
> org.globus.cog.karajan.workflow.service.ReplyTimeoutException
> at
> org.globus.cog.karajan.workflow.service.commands.Command.handleReplyTimeout(Command.java:288)
> at org.globus.cog.karajan.workflow.service.commands.Command
> $Timeout.run(Command.java:293)
> at java.util.TimerThread.mainLoop(Timer.java:512)
> at java.util.TimerThread.run(Timer.java:462)
> Heartbeat failed: Invalid channel: 914784201: {}
> org.globus.cog.karajan.workflow.service.ReplyTimeoutException
> at
> org.globus.cog.karajan.workflow.service.commands.Command.handleReplyTimeout(Command.java:288)
> at org.globus.cog.karajan.workflow.service.commands.Command
> $Timeout.run(Command.java:293)
> at java.util.TimerThread.mainLoop(Timer.java:512)
> at java.util.TimerThread.run(Timer.java:462)
>
>
> The log is attached. I will try a longish run with more heap memory.
>
>
>
>
> Regards,
> Ketan
>
>
>
> On Tue, Aug 23, 2011 at 3:21 PM, Mihael Hategan <hategan at mcs.anl.gov>
> wrote:
> On Tue, 2011-08-23 at 14:46 -0500, Michael Wilde wrote:
> > Well, I raised that issue, but Ketan claimed that the
> failure to start more jobs occurs without that message as
> well.
>
>
> Fine. I'll need the log from that then.
>
>
> >
> > Do you believe that the Out of Mem error is the root cause?
> >
> > Ketan, can you point to logs without the OOM error?
> >
> > Can you re-run the catsn with more memory?
> >
> > And more importantly: can you run a *very small* catsnsleep
> test where you carefully craft the sleep times and settings to
> cause one (very short duration) coaster block to time out and
> verify that a new block is submitted and in new job and that
> the script runs to completion?
> >
> > I suggested in the ticket that David do this; can you both
> discuss and see who is better positioned to do this sooner, so
> we can decide if we have a blocker here, or just something
> that needs better configuration and perhaps a note in the user
> guide telling users what to watch out for in this regard? (I
> think for example we do not tell how and when to increase
> memory in the user guide, at the moment). Nor are we clear
> enough on the issues around maxtime, maxwalltime, and the
> sizing of coaster blocks.
> >
> > Thanks,
> >
> > - Mike
> >
> >
> > ----- Original Message -----
> > > From: "Mihael Hategan" <hategan at mcs.anl.gov>
> > > To: "Ketan Maheshwari" <ketancmaheshwari at gmail.com>
> > > Cc: "Michael Wilde" <wilde at mcs.anl.gov>, "Swift Devel"
> <swift-devel at ci.uchicago.edu>
> > > Sent: Tuesday, August 23, 2011 2:36:00 PM
> > > Subject: Re: [Swift-devel] Blocker issue for 0.93: DSSAT
> script does not complete, 2nd coaster blocks dont start?
> > > mike at blabla:~/tmp$ grep "heap"
> catsn-20110823-1116-94roxc18.log
> > > 2011-08-23 11:16:49,904-0500 DEBUG Loader Max heap:
> 257294336
> > > 2011-08-23 11:38:50,957-0500 DEBUG VDL2ExecutionContext
> > > java.lang.OutOfMemoryError: Java heap space
> > > java.lang.OutOfMemoryError: Java heap space
> > > Caused by: java.lang.OutOfMemoryError: Java heap space
> > > Caused by: java.lang.OutOfMemoryError: Java heap space
> > > java.lang.OutOfMemoryError: Java heap space
> > > Caused by: java.lang.OutOfMemoryError: Java heap space
> > > Caused by: java.lang.OutOfMemoryError: Java heap space
> > >
> > >
> > > On Tue, 2011-08-23 at 11:47 -0500, Ketan Maheshwari wrote:
> > > > Hello Mike,
> > > >
> > > >
> > > > I tried another run with 30K tasks on PADS. This run
> stopped after
> > > > completing 16K+ tasks.
> > > >
> > > >
> > > > The log file is:
> > > >
> http://www.ci.uchicago.edu/~ketan/catsn-20110823-1116-94roxc18.log
> > > >
> > > >
> > > > The exception messages I get are attached with the mail.
> > > >
> > > >
> > > > Looking at the messages, it seems the coasters are
> unable to restart
> > > > the submit block once the walltime is expired for a run.
> > > >
> > > >
> > > > Regards,
> > > > Ketan
> > > >
> > > > On Tue, Aug 23, 2011 at 11:06 AM, Ketan Maheshwari
> > > > <ketancmaheshwari at gmail.com> wrote:
> > > > Mike,
> > > >
> > > >
> > > > This looks like the coasters blocks not
> restarting issue. I
> > > > can try to run the same run again and see if
> this persists.
> > > >
> > > >
> > > > On Tue, Aug 23, 2011 at 11:04 AM, Michael Wilde
> > > > <wilde at mcs.anl.gov> wrote:
> > > > Ketan,
> > > >
> > > >
> > > > Should I ask David to try to replicate
> this problem?
> > > >
> > > >
> > > > Did you figure out why your jobs are not
> starting on
> > > > PADS?
> > > >
> > > >
> > > > - Mike
> > > >
> > > >
> > > >
> > > >
> ______________________________________________________
> > > > From: "Michael Wilde"
> <wilde at mcs.anl.gov>
> > > > To: "Ketan Maheshwari"
> > > > <ketancmaheshwari at gmail.com>,
> "Mihael
> > > > Hategan"
> > > > <hategan at mcs.anl.gov>
> > > > Cc: "swift-devel Devel"
> > > > <swift-devel at ci.uchicago.edu>,
> "Papia
> > > > Rizwan"
> > > > <papia.rizwan at gmail.com>
> > > > Sent: Monday, August 22, 2011
> 10:47:56 AM
> > > >
> > > >
> > > > Subject: Re: Blocker issue for
> 0.93: DSSAT
> > > > script does not complete, 2nd
> coaster blocks
> > > > dont start?
> > > >
> > > > Can you try this on PADS using
> small jobs in
> > > > the fast queue?
> > > >
> > > > I have not thought this all the
> way through,
> > > > but perhaps coasters will honor
> maxtime and
> > > > maxwalltime on any coaster
> block, even if
> > > > its
> > > > not running on a batch
> scheduler. In that
> > > > case perhaps you can replicate
> the problem
> > > > on
> > > > the MCS pool or better yet on
> localhost.
> > > >
> > > >
> > > > In these runs, what was the
> value of
> > > > the execution.retries and
> lazy.errors flags?
> > > > Mihael, do those properties
> need to be set
> > > > to
> > > > >0 and true, respectively, in
> order for
> > > > coasters to start new blocks
> correctly,
> > > > assuming that in some cases a
> job will run
> > > > longer than its maxwalltime?
> > > >
> > > >
> > > > - Mike
> > > >
> > > >
> > > >
> > > >
> ______________________________________________
> > > > From: "Ketan Maheshwari"
> > > >
> <ketancmaheshwari at gmail.com>
> > > > To: "Michael Wilde"
> > > > <wilde at mcs.anl.gov>
> > > > Cc: "Papia Rizwan"
> > > >
> <papia.rizwan at gmail.com>,
> > > > "swift-devel
> > > > Devel"
> <swift-devel at ci.uchicago.edu>
> > > > Sent: Monday, August 22,
> 2011
> > > > 10:32:31
> > > > AM
> > > > Subject: Re: Blocker
> issue for 0.93:
> > > > DSSAT script does not
> complete, 2nd
> > > > coaster blocks dont
> start?
> > > >
> > > > Mike,
> > > >
> > > >
> > > > If I recall correctly,
> Papia has
> > > > always been running her
> DSSAT app
> > > > with
> > > > 0.92. She has not yet
> tried with
> > > > 0.93.
> > > > I too tried with 0.92
> with her sites
> > > > file settings.
> > > >
> > > >
> > > > I once tried it with
> 0.93 on pads
> > > > but
> > > > could never get in the
> running from
> > > > the queue.
> > > >
> > > >
> > > > I will give another try
> today as it
> > > > might be that PADS was
> too busy last
> > > > week. As I recall Jon
> was also
> > > > struggling to get
> access.
> > > >
> > > >
> > > > Regards,
> > > > Ketan
> > > >
> > > > On Mon, Aug 22, 2011 at
> 10:24 AM,
> > > > Michael Wilde
> <wilde at mcs.anl.gov>
> > > > wrote:
> > > > Papia, Ketan,
> > > >
> > > > In reviewing
> 0.93 work
> > > > remaining with
> David, I
> > > > remembered this
> issue.
> > > >
> > > > You both
> reported that the
> > > > DSSAT
> application script
> > > > doesnt finish on
> PADS - it
> > > > seems not to
> start the
> > > > second
> > > > round of coaster
> blocks that
> > > > it needs to
> complete (as I
> > > > recall, but this
> may not be
> > > > correct). This
> needs to be
> > > > researched and
> filed as a
> > > > bug
> > > > (or, an error in
> the sites
> > > > spec needs to be
> identified
> > > > and made clear
> in the site
> > > > guide if it
> turns out to be
> > > > the problem).
> > > >
> > > > Possible there
> is an issue
> > > > with jobs
> failing at the end
> > > > of the coaster
> blocks, and
> > > > you
> > > > dont have the
> necessary
> > > > retry
> > > > values set for
> the PADS
> > > > site???
> > > >
> > > > We need an
> example run with
> > > > logs and full
> details. Can
> > > > you
> > > > try to re-create
> this with a
> > > > much smaller
> initial
> > > > allocation, and
> see if
> > > > coasters is
> transitioning
> > > > from
> > > > its initial
> blocks to the
> > > > next
> > > > blocks?
> > > >
> > > > Can you give
> this high prio
> > > > for today?
> > > >
> > > > Thanks,
> > > >
> > > > - Mike
> > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Ketan
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Michael Wilde
> > > > Computation Institute,
> University of Chicago
> > > > Mathematics and Computer Science
> Division
> > > > Argonne National Laboratory
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Michael Wilde
> > > > Computation Institute, University of
> Chicago
> > > > Mathematics and Computer Science
> Division
> > > > Argonne National Laboratory
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Ketan
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Ketan
> > > >
> > > >
> > > >
> > > > _______________________________________________
> > > > Swift-devel mailing list
> > > > Swift-devel at ci.uchicago.edu
> > > >
> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel
> >
>
>
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel
>
>
>
>
>
> --
> Ketan
>
>
>
More information about the Swift-devel
mailing list