[Swift-devel] Blocker issue for 0.93: DSSAT script does not complete, 2nd coaster blocks dont start?

Mihael Hategan hategan at mcs.anl.gov
Tue Aug 23 23:04:03 CDT 2011


Also benign, but annoying. So I'd like to nail these out. Can you try
r3240?

On Tue, 2011-08-23 at 20:12 -0500, Ketan Maheshwari wrote:
> Hi again,
> 
> 
> Tried a larger run on PADS with similar sleep and  but large n
> parameters. The run seemed to be progressing well (I killed it by
> mistake), but the log does show some coaster block shutdown and
> network related exception messages.
> 
> 
> Attached is the log.
> 
> 
> Regards,
> Ketan
> 
> 
> 
> On Tue, Aug 23, 2011 at 4:05 PM, Mihael Hategan <hategan at mcs.anl.gov>
> wrote:
>         That's benign, but I committed a patch to prevent it from
>         happening in
>         cog r3237.
>         
>         
>         On Tue, 2011-08-23 at 15:46 -0500, Ketan Maheshwari wrote:
>         > Hi,
>         >
>         >
>         > I tried a smaller run, catsnsleep, sleeptime 60sec, n=20,
>         > walltime=2min
>         >
>         >
>         > The run indeed completed but I saw this in the middle (I
>         suppose at
>         > the end of first walltime slot):
>         >
>         >
>         > Command(13, HEARTBEAT): handling reply timeout;
>         > sendReqTime=110823-153059.847, sendTime=110823-153059.847,
>         > now=110823-153259.860
>         > Command(13, HEARTBEAT)fault was: Reply timeout
>         >
>         org.globus.cog.karajan.workflow.service.ReplyTimeoutException
>         > at
>         >
>         org.globus.cog.karajan.workflow.service.commands.Command.handleReplyTimeout(Command.java:288)
>         > at org.globus.cog.karajan.workflow.service.commands.Command
>         > $Timeout.run(Command.java:293)
>         > at java.util.TimerThread.mainLoop(Timer.java:512)
>         > at java.util.TimerThread.run(Timer.java:462)
>         > Command(13, HEARTBEAT)fault was: Invalid channel: 914784201:
>         {}
>         >
>         org.globus.cog.karajan.workflow.service.ReplyTimeoutException
>         > at
>         >
>         org.globus.cog.karajan.workflow.service.commands.Command.handleReplyTimeout(Command.java:288)
>         > at org.globus.cog.karajan.workflow.service.commands.Command
>         > $Timeout.run(Command.java:293)
>         > at java.util.TimerThread.mainLoop(Timer.java:512)
>         > at java.util.TimerThread.run(Timer.java:462)
>         > Heartbeat failed: Invalid channel: 914784201: {}
>         >
>         org.globus.cog.karajan.workflow.service.ReplyTimeoutException
>         > at
>         >
>         org.globus.cog.karajan.workflow.service.commands.Command.handleReplyTimeout(Command.java:288)
>         > at org.globus.cog.karajan.workflow.service.commands.Command
>         > $Timeout.run(Command.java:293)
>         > at java.util.TimerThread.mainLoop(Timer.java:512)
>         > at java.util.TimerThread.run(Timer.java:462)
>         >
>         >
>         > The log is attached. I will try a longish run with more heap
>         memory.
>         >
>         >
>         >
>         >
>         > Regards,
>         > Ketan
>         >
>         >
>         >
>         > On Tue, Aug 23, 2011 at 3:21 PM, Mihael Hategan
>         <hategan at mcs.anl.gov>
>         > wrote:
>         >         On Tue, 2011-08-23 at 14:46 -0500, Michael Wilde
>         wrote:
>         >         > Well, I raised that issue, but Ketan claimed that
>         the
>         >         failure to start more jobs occurs without that
>         message as
>         >         well.
>         >
>         >
>         >         Fine. I'll need the log from that then.
>         >
>         >
>         >         >
>         >         > Do you believe that the Out of Mem error is the
>         root cause?
>         >         >
>         >         > Ketan, can you point to logs without the OOM
>         error?
>         >         >
>         >         > Can you re-run the catsn with more memory?
>         >         >
>         >         > And more importantly: can you run a *very small*
>         catsnsleep
>         >         test where you carefully craft the sleep times and
>         settings to
>         >         cause one (very short duration) coaster block to
>         time out and
>         >         verify that a new block is submitted and in new job
>         and that
>         >         the script runs to completion?
>         >         >
>         >         > I suggested in the ticket that David do this; can
>         you both
>         >         discuss and see who is better positioned to do this
>         sooner, so
>         >         we can decide if we have a blocker here, or just
>         something
>         >         that needs better configuration and perhaps a note
>         in the user
>         >         guide telling users what to watch out for in this
>         regard? (I
>         >         think for example we do not tell how and when to
>         increase
>         >         memory in the user guide, at the moment).  Nor are
>         we clear
>         >         enough on the issues around maxtime, maxwalltime,
>         and the
>         >         sizing of coaster blocks.
>         >         >
>         >         > Thanks,
>         >         >
>         >         > - Mike
>         >         >
>         >         >
>         >         > ----- Original Message -----
>         >         > > From: "Mihael Hategan" <hategan at mcs.anl.gov>
>         >         > > To: "Ketan Maheshwari"
>         <ketancmaheshwari at gmail.com>
>         >         > > Cc: "Michael Wilde" <wilde at mcs.anl.gov>, "Swift
>         Devel"
>         >         <swift-devel at ci.uchicago.edu>
>         >         > > Sent: Tuesday, August 23, 2011 2:36:00 PM
>         >         > > Subject: Re: [Swift-devel] Blocker issue for
>         0.93: DSSAT
>         >         script does not complete, 2nd coaster blocks dont
>         start?
>         >         > > mike at blabla:~/tmp$ grep "heap"
>         >         catsn-20110823-1116-94roxc18.log
>         >         > > 2011-08-23 11:16:49,904-0500 DEBUG Loader Max
>         heap:
>         >         257294336
>         >         > > 2011-08-23 11:38:50,957-0500 DEBUG
>         VDL2ExecutionContext
>         >         > > java.lang.OutOfMemoryError: Java heap space
>         >         > > java.lang.OutOfMemoryError: Java heap space
>         >         > > Caused by: java.lang.OutOfMemoryError: Java heap
>         space
>         >         > > Caused by: java.lang.OutOfMemoryError: Java heap
>         space
>         >         > > java.lang.OutOfMemoryError: Java heap space
>         >         > > Caused by: java.lang.OutOfMemoryError: Java heap
>         space
>         >         > > Caused by: java.lang.OutOfMemoryError: Java heap
>         space
>         >         > >
>         >         > >
>         >         > > On Tue, 2011-08-23 at 11:47 -0500, Ketan
>         Maheshwari wrote:
>         >         > > > Hello Mike,
>         >         > > >
>         >         > > >
>         >         > > > I tried another run with 30K tasks on PADS.
>         This run
>         >         stopped after
>         >         > > > completing 16K+ tasks.
>         >         > > >
>         >         > > >
>         >         > > > The log file is:
>         >         > > >
>         >
>         http://www.ci.uchicago.edu/~ketan/catsn-20110823-1116-94roxc18.log
>         >         > > >
>         >         > > >
>         >         > > > The exception messages I get are attached with
>         the mail.
>         >         > > >
>         >         > > >
>         >         > > > Looking at the messages, it seems the coasters
>         are
>         >         unable to restart
>         >         > > > the submit block once the walltime is expired
>         for a run.
>         >         > > >
>         >         > > >
>         >         > > > Regards,
>         >         > > > Ketan
>         >         > > >
>         >         > > > On Tue, Aug 23, 2011 at 11:06 AM, Ketan
>         Maheshwari
>         >         > > > <ketancmaheshwari at gmail.com> wrote:
>         >         > > >         Mike,
>         >         > > >
>         >         > > >
>         >         > > >         This looks like the coasters blocks
>         not
>         >         restarting issue. I
>         >         > > >         can try to run the same run again and
>         see if
>         >         this persists.
>         >         > > >
>         >         > > >
>         >         > > >         On Tue, Aug 23, 2011 at 11:04 AM,
>         Michael Wilde
>         >         > > >         <wilde at mcs.anl.gov> wrote:
>         >         > > >                 Ketan,
>         >         > > >
>         >         > > >
>         >         > > >                 Should I ask David to try to
>         replicate
>         >         this problem?
>         >         > > >
>         >         > > >
>         >         > > >                 Did you figure out why your
>         jobs are not
>         >         starting on
>         >         > > >                 PADS?
>         >         > > >
>         >         > > >
>         >         > > >                 - Mike
>         >         > > >
>         >         > > >
>         >         > > >
>         >         > > >
>         >
>         ______________________________________________________
>         >         > > >                         From: "Michael Wilde"
>         >         <wilde at mcs.anl.gov>
>         >         > > >                         To: "Ketan Maheshwari"
>         >         > > >
>         <ketancmaheshwari at gmail.com>,
>         >         "Mihael
>         >         > > >                         Hategan"
>         >         > > >                         <hategan at mcs.anl.gov>
>         >         > > >                         Cc: "swift-devel
>         Devel"
>         >         > > >
>         <swift-devel at ci.uchicago.edu>,
>         >         "Papia
>         >         > > >                         Rizwan"
>         >         > > >
>         <papia.rizwan at gmail.com>
>         >         > > >                         Sent: Monday, August
>         22, 2011
>         >         10:47:56 AM
>         >         > > >
>         >         > > >
>         >         > > >                         Subject: Re: Blocker
>         issue for
>         >         0.93: DSSAT
>         >         > > >                         script does not
>         complete, 2nd
>         >         coaster blocks
>         >         > > >                         dont start?
>         >         > > >
>         >         > > >                         Can you try this on
>         PADS using
>         >         small jobs in
>         >         > > >                         the fast queue?
>         >         > > >
>         >         > > >                         I have not thought
>         this all the
>         >         way through,
>         >         > > >                         but perhaps coasters
>         will honor
>         >         maxtime and
>         >         > > >                         maxwalltime on any
>         coaster
>         >         block, even if
>         >         > > >                         its
>         >         > > >                         not running on a batch
>         >         scheduler. In that
>         >         > > >                         case perhaps you can
>         replicate
>         >         the problem
>         >         > > >                         on
>         >         > > >                         the MCS pool or better
>         yet on
>         >         localhost.
>         >         > > >
>         >         > > >
>         >         > > >                         In these runs, what
>         was the
>         >         value of
>         >         > > >                         the execution.retries
>         and
>         >         lazy.errors flags?
>         >         > > >                          Mihael, do those
>         properties
>         >         need to be set
>         >         > > >                          to
>         >         > > >                         >0 and true,
>         respectively, in
>         >         order for
>         >         > > >                         coasters to start new
>         blocks
>         >         correctly,
>         >         > > >                         assuming that in some
>         cases a
>         >         job will run
>         >         > > >                         longer than its
>         maxwalltime?
>         >         > > >
>         >         > > >
>         >         > > >                         - Mike
>         >         > > >
>         >         > > >
>         >         > > >
>         >         > > >
>         >         ______________________________________________
>         >         > > >                                 From: "Ketan
>         Maheshwari"
>         >         > > >
>         >         <ketancmaheshwari at gmail.com>
>         >         > > >                                 To: "Michael
>         Wilde"
>         >         > > >
>         <wilde at mcs.anl.gov>
>         >         > > >                                 Cc: "Papia
>         Rizwan"
>         >         > > >
>         >         <papia.rizwan at gmail.com>,
>         >         > > >                                 "swift-devel
>         >         > > >                                 Devel"
>         >         <swift-devel at ci.uchicago.edu>
>         >         > > >                                 Sent: Monday,
>         August 22,
>         >         2011
>         >         > > >                                 10:32:31
>         >         > > >                                 AM
>         >         > > >                                 Subject: Re:
>         Blocker
>         >         issue for 0.93:
>         >         > > >                                 DSSAT script
>         does not
>         >         complete, 2nd
>         >         > > >                                 coaster blocks
>         dont
>         >         start?
>         >         > > >
>         >         > > >                                 Mike,
>         >         > > >
>         >         > > >
>         >         > > >                                 If I recall
>         correctly,
>         >         Papia has
>         >         > > >                                 always been
>         running her
>         >         DSSAT app
>         >         > > >                                 with
>         >         > > >                                 0.92. She has
>         not yet
>         >         tried with
>         >         > > >                                 0.93.
>         >         > > >                                 I too tried
>         with 0.92
>         >         with her sites
>         >         > > >                                 file settings.
>         >         > > >
>         >         > > >
>         >         > > >                                 I once tried
>         it with
>         >         0.93 on pads
>         >         > > >                                 but
>         >         > > >                                 could never
>         get in the
>         >         running from
>         >         > > >                                 the queue.
>         >         > > >
>         >         > > >
>         >         > > >                                 I will give
>         another try
>         >         today as it
>         >         > > >                                 might be that
>         PADS was
>         >         too busy last
>         >         > > >                                 week. As I
>         recall Jon
>         >         was also
>         >         > > >                                 struggling to
>         get
>         >         access.
>         >         > > >
>         >         > > >
>         >         > > >                                 Regards,
>         >         > > >                                 Ketan
>         >         > > >
>         >         > > >                                 On Mon, Aug
>         22, 2011 at
>         >         10:24 AM,
>         >         > > >                                 Michael Wilde
>         >         <wilde at mcs.anl.gov>
>         >         > > >                                 wrote:
>         >         > > >                                         Papia,
>         Ketan,
>         >         > > >
>         >         > > >                                         In
>         reviewing
>         >         0.93 work
>         >         > > >
>         remaining with
>         >         David, I
>         >         > > >
>         remembered this
>         >         issue.
>         >         > > >
>         >         > > >                                         You
>         both
>         >         reported that the
>         >         > > >                                         DSSAT
>         >         application script
>         >         > > >                                         doesnt
>         finish on
>         >         PADS - it
>         >         > > >                                         seems
>         not to
>         >         start the
>         >         > > >                                         second
>         >         > > >                                         round
>         of coaster
>         >         blocks that
>         >         > > >                                         it
>         needs to
>         >         complete (as I
>         >         > > >
>         recall, but this
>         >         may not be
>         >         > > >
>         correct). This
>         >         needs to be
>         >         > > >
>         researched and
>         >         filed as a
>         >         > > >                                         bug
>         >         > > >                                         (or,
>         an error in
>         >         the sites
>         >         > > >                                         spec
>         needs to be
>         >         identified
>         >         > > >                                         and
>         made clear
>         >         in the site
>         >         > > >                                         guide
>         if it
>         >         turns out to be
>         >         > > >                                         the
>         problem).
>         >         > > >
>         >         > > >
>         Possible there
>         >         is an issue
>         >         > > >                                         with
>         jobs
>         >         failing at the end
>         >         > > >                                         of the
>         coaster
>         >         blocks, and
>         >         > > >                                         you
>         >         > > >                                         dont
>         have the
>         >         necessary
>         >         > > >                                         retry
>         >         > > >                                         values
>         set for
>         >         the PADS
>         >         > > >
>         site???
>         >         > > >
>         >         > > >                                         We
>         need an
>         >         example run with
>         >         > > >                                         logs
>         and full
>         >         details. Can
>         >         > > >                                         you
>         >         > > >                                         try to
>         re-create
>         >         this with a
>         >         > > >                                         much
>         smaller
>         >         initial
>         >         > > >
>         allocation, and
>         >         see if
>         >         > > >
>         coasters is
>         >         transitioning
>         >         > > >                                         from
>         >         > > >                                         its
>         initial
>         >         blocks to the
>         >         > > >                                         next
>         >         > > >
>         blocks?
>         >         > > >
>         >         > > >                                         Can
>         you give
>         >         this high prio
>         >         > > >                                         for
>         today?
>         >         > > >
>         >         > > >
>         Thanks,
>         >         > > >
>         >         > > >                                         - Mike
>         >         > > >
>         >         > > >
>         >         > > >
>         >         > > >
>         >         > > >                                 --
>         >         > > >                                 Ketan
>         >         > > >
>         >         > > >
>         >         > > >
>         >         > > >
>         >         > > >
>         >         > > >
>         >         > > >                         --
>         >         > > >                         Michael Wilde
>         >         > > >                         Computation Institute,
>         >         University of Chicago
>         >         > > >                         Mathematics and
>         Computer Science
>         >         Division
>         >         > > >                         Argonne National
>         Laboratory
>         >         > > >
>         >         > > >
>         >         > > >
>         >         > > >
>         >         > > >
>         >         > > >
>         >         > > >                 --
>         >         > > >                 Michael Wilde
>         >         > > >                 Computation Institute,
>         University of
>         >         Chicago
>         >         > > >                 Mathematics and Computer
>         Science
>         >         Division
>         >         > > >                 Argonne National Laboratory
>         >         > > >
>         >         > > >
>         >         > > >
>         >         > > >
>         >         > > >
>         >         > > >
>         >         > > >         --
>         >         > > >         Ketan
>         >         > > >
>         >         > > >
>         >         > > >
>         >         > > >
>         >         > > >
>         >         > > >
>         >         > > >
>         >         > > > --
>         >         > > > Ketan
>         >         > > >
>         >         > > >
>         >         > > >
>         >         > > >
>         _______________________________________________
>         >         > > > Swift-devel mailing list
>         >         > > > Swift-devel at ci.uchicago.edu
>         >         > > >
>         >
>         https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel
>         >         >
>         >
>         >
>         >         _______________________________________________
>         >         Swift-devel mailing list
>         >         Swift-devel at ci.uchicago.edu
>         >
>         https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel
>         >
>         >
>         >
>         >
>         >
>         > --
>         > Ketan
>         >
>         >
>         >
>         
>         
>         
> 
> 
> 
> 
> -- 
> Ketan
> 
> 
> 





More information about the Swift-devel mailing list