[Swift-devel] Blocker issue for 0.93: DSSAT script does not complete, 2nd coaster blocks dont start?

Ketan Maheshwari ketancmaheshwari at gmail.com
Tue Aug 23 15:46:04 CDT 2011


Hi,

I tried a smaller run, catsnsleep, sleeptime 60sec, n=20, walltime=2min

The run indeed completed but I saw this in the middle (I suppose at the end
of first walltime slot):

Command(13, HEARTBEAT): handling reply timeout;
sendReqTime=110823-153059.847, sendTime=110823-153059.847,
now=110823-153259.860
Command(13, HEARTBEAT)fault was: Reply timeout
org.globus.cog.karajan.workflow.service.ReplyTimeoutException
at
org.globus.cog.karajan.workflow.service.commands.Command.handleReplyTimeout(Command.java:288)
at
org.globus.cog.karajan.workflow.service.commands.Command$Timeout.run(Command.java:293)
at java.util.TimerThread.mainLoop(Timer.java:512)
at java.util.TimerThread.run(Timer.java:462)
Command(13, HEARTBEAT)fault was: Invalid channel: 914784201: {}
org.globus.cog.karajan.workflow.service.ReplyTimeoutException
at
org.globus.cog.karajan.workflow.service.commands.Command.handleReplyTimeout(Command.java:288)
at
org.globus.cog.karajan.workflow.service.commands.Command$Timeout.run(Command.java:293)
at java.util.TimerThread.mainLoop(Timer.java:512)
at java.util.TimerThread.run(Timer.java:462)
Heartbeat failed: Invalid channel: 914784201: {}
org.globus.cog.karajan.workflow.service.ReplyTimeoutException
at
org.globus.cog.karajan.workflow.service.commands.Command.handleReplyTimeout(Command.java:288)
at
org.globus.cog.karajan.workflow.service.commands.Command$Timeout.run(Command.java:293)
at java.util.TimerThread.mainLoop(Timer.java:512)
at java.util.TimerThread.run(Timer.java:462)

The log is attached. I will try a longish run with more heap memory.


Regards,
Ketan


On Tue, Aug 23, 2011 at 3:21 PM, Mihael Hategan <hategan at mcs.anl.gov> wrote:

> On Tue, 2011-08-23 at 14:46 -0500, Michael Wilde wrote:
> > Well, I raised that issue, but Ketan claimed that the failure to start
> more jobs occurs without that message as well.
>
> Fine. I'll need the log from that then.
>
> >
> > Do you believe that the Out of Mem error is the root cause?
> >
> > Ketan, can you point to logs without the OOM error?
> >
> > Can you re-run the catsn with more memory?
> >
> > And more importantly: can you run a *very small* catsnsleep test where
> you carefully craft the sleep times and settings to cause one (very short
> duration) coaster block to time out and verify that a new block is submitted
> and in new job and that the script runs to completion?
> >
> > I suggested in the ticket that David do this; can you both discuss and
> see who is better positioned to do this sooner, so we can decide if we have
> a blocker here, or just something that needs better configuration and
> perhaps a note in the user guide telling users what to watch out for in this
> regard? (I think for example we do not tell how and when to increase memory
> in the user guide, at the moment).  Nor are we clear enough on the issues
> around maxtime, maxwalltime, and the sizing of coaster blocks.
> >
> > Thanks,
> >
> > - Mike
> >
> >
> > ----- Original Message -----
> > > From: "Mihael Hategan" <hategan at mcs.anl.gov>
> > > To: "Ketan Maheshwari" <ketancmaheshwari at gmail.com>
> > > Cc: "Michael Wilde" <wilde at mcs.anl.gov>, "Swift Devel" <
> swift-devel at ci.uchicago.edu>
> > > Sent: Tuesday, August 23, 2011 2:36:00 PM
> > > Subject: Re: [Swift-devel] Blocker issue for 0.93: DSSAT script does
> not complete, 2nd coaster blocks dont start?
> > > mike at blabla:~/tmp$ grep "heap" catsn-20110823-1116-94roxc18.log
> > > 2011-08-23 11:16:49,904-0500 DEBUG Loader Max heap: 257294336
> > > 2011-08-23 11:38:50,957-0500 DEBUG VDL2ExecutionContext
> > > java.lang.OutOfMemoryError: Java heap space
> > > java.lang.OutOfMemoryError: Java heap space
> > > Caused by: java.lang.OutOfMemoryError: Java heap space
> > > Caused by: java.lang.OutOfMemoryError: Java heap space
> > > java.lang.OutOfMemoryError: Java heap space
> > > Caused by: java.lang.OutOfMemoryError: Java heap space
> > > Caused by: java.lang.OutOfMemoryError: Java heap space
> > >
> > >
> > > On Tue, 2011-08-23 at 11:47 -0500, Ketan Maheshwari wrote:
> > > > Hello Mike,
> > > >
> > > >
> > > > I tried another run with 30K tasks on PADS. This run stopped after
> > > > completing 16K+ tasks.
> > > >
> > > >
> > > > The log file is:
> > > > http://www.ci.uchicago.edu/~ketan/catsn-20110823-1116-94roxc18.log
> > > >
> > > >
> > > > The exception messages I get are attached with the mail.
> > > >
> > > >
> > > > Looking at the messages, it seems the coasters are unable to restart
> > > > the submit block once the walltime is expired for a run.
> > > >
> > > >
> > > > Regards,
> > > > Ketan
> > > >
> > > > On Tue, Aug 23, 2011 at 11:06 AM, Ketan Maheshwari
> > > > <ketancmaheshwari at gmail.com> wrote:
> > > >         Mike,
> > > >
> > > >
> > > >         This looks like the coasters blocks not restarting issue. I
> > > >         can try to run the same run again and see if this persists.
> > > >
> > > >
> > > >         On Tue, Aug 23, 2011 at 11:04 AM, Michael Wilde
> > > >         <wilde at mcs.anl.gov> wrote:
> > > >                 Ketan,
> > > >
> > > >
> > > >                 Should I ask David to try to replicate this problem?
> > > >
> > > >
> > > >                 Did you figure out why your jobs are not starting on
> > > >                 PADS?
> > > >
> > > >
> > > >                 - Mike
> > > >
> > > >
> > > >
> > > >
> ______________________________________________________
> > > >                         From: "Michael Wilde" <wilde at mcs.anl.gov>
> > > >                         To: "Ketan Maheshwari"
> > > >                         <ketancmaheshwari at gmail.com>, "Mihael
> > > >                         Hategan"
> > > >                         <hategan at mcs.anl.gov>
> > > >                         Cc: "swift-devel Devel"
> > > >                         <swift-devel at ci.uchicago.edu>, "Papia
> > > >                         Rizwan"
> > > >                         <papia.rizwan at gmail.com>
> > > >                         Sent: Monday, August 22, 2011 10:47:56 AM
> > > >
> > > >
> > > >                         Subject: Re: Blocker issue for 0.93: DSSAT
> > > >                         script does not complete, 2nd coaster blocks
> > > >                         dont start?
> > > >
> > > >                         Can you try this on PADS using small jobs in
> > > >                         the fast queue?
> > > >
> > > >                         I have not thought this all the way through,
> > > >                         but perhaps coasters will honor maxtime and
> > > >                         maxwalltime on any coaster block, even if
> > > >                         its
> > > >                         not running on a batch scheduler. In that
> > > >                         case perhaps you can replicate the problem
> > > >                         on
> > > >                         the MCS pool or better yet on localhost.
> > > >
> > > >
> > > >                         In these runs, what was the value of
> > > >                         the execution.retries and lazy.errors flags?
> > > >                          Mihael, do those properties need to be set
> > > >                          to
> > > >                         >0 and true, respectively, in order for
> > > >                         coasters to start new blocks correctly,
> > > >                         assuming that in some cases a job will run
> > > >                         longer than its maxwalltime?
> > > >
> > > >
> > > >                         - Mike
> > > >
> > > >
> > > >
> > > >
> ______________________________________________
> > > >                                 From: "Ketan Maheshwari"
> > > >                                 <ketancmaheshwari at gmail.com>
> > > >                                 To: "Michael Wilde"
> > > >                                 <wilde at mcs.anl.gov>
> > > >                                 Cc: "Papia Rizwan"
> > > >                                 <papia.rizwan at gmail.com>,
> > > >                                 "swift-devel
> > > >                                 Devel" <swift-devel at ci.uchicago.edu>
> > > >                                 Sent: Monday, August 22, 2011
> > > >                                 10:32:31
> > > >                                 AM
> > > >                                 Subject: Re: Blocker issue for 0.93:
> > > >                                 DSSAT script does not complete, 2nd
> > > >                                 coaster blocks dont start?
> > > >
> > > >                                 Mike,
> > > >
> > > >
> > > >                                 If I recall correctly, Papia has
> > > >                                 always been running her DSSAT app
> > > >                                 with
> > > >                                 0.92. She has not yet tried with
> > > >                                 0.93.
> > > >                                 I too tried with 0.92 with her sites
> > > >                                 file settings.
> > > >
> > > >
> > > >                                 I once tried it with 0.93 on pads
> > > >                                 but
> > > >                                 could never get in the running from
> > > >                                 the queue.
> > > >
> > > >
> > > >                                 I will give another try today as it
> > > >                                 might be that PADS was too busy last
> > > >                                 week. As I recall Jon was also
> > > >                                 struggling to get access.
> > > >
> > > >
> > > >                                 Regards,
> > > >                                 Ketan
> > > >
> > > >                                 On Mon, Aug 22, 2011 at 10:24 AM,
> > > >                                 Michael Wilde <wilde at mcs.anl.gov>
> > > >                                 wrote:
> > > >                                         Papia, Ketan,
> > > >
> > > >                                         In reviewing 0.93 work
> > > >                                         remaining with David, I
> > > >                                         remembered this issue.
> > > >
> > > >                                         You both reported that the
> > > >                                         DSSAT application script
> > > >                                         doesnt finish on PADS - it
> > > >                                         seems not to start the
> > > >                                         second
> > > >                                         round of coaster blocks that
> > > >                                         it needs to complete (as I
> > > >                                         recall, but this may not be
> > > >                                         correct). This needs to be
> > > >                                         researched and filed as a
> > > >                                         bug
> > > >                                         (or, an error in the sites
> > > >                                         spec needs to be identified
> > > >                                         and made clear in the site
> > > >                                         guide if it turns out to be
> > > >                                         the problem).
> > > >
> > > >                                         Possible there is an issue
> > > >                                         with jobs failing at the end
> > > >                                         of the coaster blocks, and
> > > >                                         you
> > > >                                         dont have the necessary
> > > >                                         retry
> > > >                                         values set for the PADS
> > > >                                         site???
> > > >
> > > >                                         We need an example run with
> > > >                                         logs and full details. Can
> > > >                                         you
> > > >                                         try to re-create this with a
> > > >                                         much smaller initial
> > > >                                         allocation, and see if
> > > >                                         coasters is transitioning
> > > >                                         from
> > > >                                         its initial blocks to the
> > > >                                         next
> > > >                                         blocks?
> > > >
> > > >                                         Can you give this high prio
> > > >                                         for today?
> > > >
> > > >                                         Thanks,
> > > >
> > > >                                         - Mike
> > > >
> > > >
> > > >
> > > >
> > > >                                 --
> > > >                                 Ketan
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >                         --
> > > >                         Michael Wilde
> > > >                         Computation Institute, University of Chicago
> > > >                         Mathematics and Computer Science Division
> > > >                         Argonne National Laboratory
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >                 --
> > > >                 Michael Wilde
> > > >                 Computation Institute, University of Chicago
> > > >                 Mathematics and Computer Science Division
> > > >                 Argonne National Laboratory
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >         --
> > > >         Ketan
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Ketan
> > > >
> > > >
> > > >
> > > > _______________________________________________
> > > > Swift-devel mailing list
> > > > Swift-devel at ci.uchicago.edu
> > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel
> >
>
>
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel
>



-- 
Ketan
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20110823/860637a2/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: catsnsleep-20110823-1527-b6kpd8h9.log
Type: application/octet-stream
Size: 147190 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20110823/860637a2/attachment.obj>


More information about the Swift-devel mailing list