[Swift-devel] Blocker issue for 0.93: DSSAT script does not complete, 2nd coaster blocks dont start?

Mihael Hategan hategan at mcs.anl.gov
Tue Aug 23 16:05:13 CDT 2011


That's benign, but I committed a patch to prevent it from happening in
cog r3237.

On Tue, 2011-08-23 at 15:46 -0500, Ketan Maheshwari wrote:
> Hi,
> 
> 
> I tried a smaller run, catsnsleep, sleeptime 60sec, n=20,
> walltime=2min
> 
> 
> The run indeed completed but I saw this in the middle (I suppose at
> the end of first walltime slot):
> 
> 
> Command(13, HEARTBEAT): handling reply timeout;
> sendReqTime=110823-153059.847, sendTime=110823-153059.847,
> now=110823-153259.860
> Command(13, HEARTBEAT)fault was: Reply timeout
> org.globus.cog.karajan.workflow.service.ReplyTimeoutException
> at
> org.globus.cog.karajan.workflow.service.commands.Command.handleReplyTimeout(Command.java:288)
> at org.globus.cog.karajan.workflow.service.commands.Command
> $Timeout.run(Command.java:293)
> at java.util.TimerThread.mainLoop(Timer.java:512)
> at java.util.TimerThread.run(Timer.java:462)
> Command(13, HEARTBEAT)fault was: Invalid channel: 914784201: {}
> org.globus.cog.karajan.workflow.service.ReplyTimeoutException
> at
> org.globus.cog.karajan.workflow.service.commands.Command.handleReplyTimeout(Command.java:288)
> at org.globus.cog.karajan.workflow.service.commands.Command
> $Timeout.run(Command.java:293)
> at java.util.TimerThread.mainLoop(Timer.java:512)
> at java.util.TimerThread.run(Timer.java:462)
> Heartbeat failed: Invalid channel: 914784201: {}
> org.globus.cog.karajan.workflow.service.ReplyTimeoutException
> at
> org.globus.cog.karajan.workflow.service.commands.Command.handleReplyTimeout(Command.java:288)
> at org.globus.cog.karajan.workflow.service.commands.Command
> $Timeout.run(Command.java:293)
> at java.util.TimerThread.mainLoop(Timer.java:512)
> at java.util.TimerThread.run(Timer.java:462)
> 
> 
> The log is attached. I will try a longish run with more heap memory.
> 
> 
> 
> 
> Regards,
> Ketan
> 
> 
> 
> On Tue, Aug 23, 2011 at 3:21 PM, Mihael Hategan <hategan at mcs.anl.gov>
> wrote:
>         On Tue, 2011-08-23 at 14:46 -0500, Michael Wilde wrote:
>         > Well, I raised that issue, but Ketan claimed that the
>         failure to start more jobs occurs without that message as
>         well.
>         
>         
>         Fine. I'll need the log from that then.
>         
>         
>         >
>         > Do you believe that the Out of Mem error is the root cause?
>         >
>         > Ketan, can you point to logs without the OOM error?
>         >
>         > Can you re-run the catsn with more memory?
>         >
>         > And more importantly: can you run a *very small* catsnsleep
>         test where you carefully craft the sleep times and settings to
>         cause one (very short duration) coaster block to time out and
>         verify that a new block is submitted and in new job and that
>         the script runs to completion?
>         >
>         > I suggested in the ticket that David do this; can you both
>         discuss and see who is better positioned to do this sooner, so
>         we can decide if we have a blocker here, or just something
>         that needs better configuration and perhaps a note in the user
>         guide telling users what to watch out for in this regard? (I
>         think for example we do not tell how and when to increase
>         memory in the user guide, at the moment).  Nor are we clear
>         enough on the issues around maxtime, maxwalltime, and the
>         sizing of coaster blocks.
>         >
>         > Thanks,
>         >
>         > - Mike
>         >
>         >
>         > ----- Original Message -----
>         > > From: "Mihael Hategan" <hategan at mcs.anl.gov>
>         > > To: "Ketan Maheshwari" <ketancmaheshwari at gmail.com>
>         > > Cc: "Michael Wilde" <wilde at mcs.anl.gov>, "Swift Devel"
>         <swift-devel at ci.uchicago.edu>
>         > > Sent: Tuesday, August 23, 2011 2:36:00 PM
>         > > Subject: Re: [Swift-devel] Blocker issue for 0.93: DSSAT
>         script does not complete, 2nd coaster blocks dont start?
>         > > mike at blabla:~/tmp$ grep "heap"
>         catsn-20110823-1116-94roxc18.log
>         > > 2011-08-23 11:16:49,904-0500 DEBUG Loader Max heap:
>         257294336
>         > > 2011-08-23 11:38:50,957-0500 DEBUG VDL2ExecutionContext
>         > > java.lang.OutOfMemoryError: Java heap space
>         > > java.lang.OutOfMemoryError: Java heap space
>         > > Caused by: java.lang.OutOfMemoryError: Java heap space
>         > > Caused by: java.lang.OutOfMemoryError: Java heap space
>         > > java.lang.OutOfMemoryError: Java heap space
>         > > Caused by: java.lang.OutOfMemoryError: Java heap space
>         > > Caused by: java.lang.OutOfMemoryError: Java heap space
>         > >
>         > >
>         > > On Tue, 2011-08-23 at 11:47 -0500, Ketan Maheshwari wrote:
>         > > > Hello Mike,
>         > > >
>         > > >
>         > > > I tried another run with 30K tasks on PADS. This run
>         stopped after
>         > > > completing 16K+ tasks.
>         > > >
>         > > >
>         > > > The log file is:
>         > > >
>         http://www.ci.uchicago.edu/~ketan/catsn-20110823-1116-94roxc18.log
>         > > >
>         > > >
>         > > > The exception messages I get are attached with the mail.
>         > > >
>         > > >
>         > > > Looking at the messages, it seems the coasters are
>         unable to restart
>         > > > the submit block once the walltime is expired for a run.
>         > > >
>         > > >
>         > > > Regards,
>         > > > Ketan
>         > > >
>         > > > On Tue, Aug 23, 2011 at 11:06 AM, Ketan Maheshwari
>         > > > <ketancmaheshwari at gmail.com> wrote:
>         > > >         Mike,
>         > > >
>         > > >
>         > > >         This looks like the coasters blocks not
>         restarting issue. I
>         > > >         can try to run the same run again and see if
>         this persists.
>         > > >
>         > > >
>         > > >         On Tue, Aug 23, 2011 at 11:04 AM, Michael Wilde
>         > > >         <wilde at mcs.anl.gov> wrote:
>         > > >                 Ketan,
>         > > >
>         > > >
>         > > >                 Should I ask David to try to replicate
>         this problem?
>         > > >
>         > > >
>         > > >                 Did you figure out why your jobs are not
>         starting on
>         > > >                 PADS?
>         > > >
>         > > >
>         > > >                 - Mike
>         > > >
>         > > >
>         > > >
>         > > >
>         ______________________________________________________
>         > > >                         From: "Michael Wilde"
>         <wilde at mcs.anl.gov>
>         > > >                         To: "Ketan Maheshwari"
>         > > >                         <ketancmaheshwari at gmail.com>,
>         "Mihael
>         > > >                         Hategan"
>         > > >                         <hategan at mcs.anl.gov>
>         > > >                         Cc: "swift-devel Devel"
>         > > >                         <swift-devel at ci.uchicago.edu>,
>         "Papia
>         > > >                         Rizwan"
>         > > >                         <papia.rizwan at gmail.com>
>         > > >                         Sent: Monday, August 22, 2011
>         10:47:56 AM
>         > > >
>         > > >
>         > > >                         Subject: Re: Blocker issue for
>         0.93: DSSAT
>         > > >                         script does not complete, 2nd
>         coaster blocks
>         > > >                         dont start?
>         > > >
>         > > >                         Can you try this on PADS using
>         small jobs in
>         > > >                         the fast queue?
>         > > >
>         > > >                         I have not thought this all the
>         way through,
>         > > >                         but perhaps coasters will honor
>         maxtime and
>         > > >                         maxwalltime on any coaster
>         block, even if
>         > > >                         its
>         > > >                         not running on a batch
>         scheduler. In that
>         > > >                         case perhaps you can replicate
>         the problem
>         > > >                         on
>         > > >                         the MCS pool or better yet on
>         localhost.
>         > > >
>         > > >
>         > > >                         In these runs, what was the
>         value of
>         > > >                         the execution.retries and
>         lazy.errors flags?
>         > > >                          Mihael, do those properties
>         need to be set
>         > > >                          to
>         > > >                         >0 and true, respectively, in
>         order for
>         > > >                         coasters to start new blocks
>         correctly,
>         > > >                         assuming that in some cases a
>         job will run
>         > > >                         longer than its maxwalltime?
>         > > >
>         > > >
>         > > >                         - Mike
>         > > >
>         > > >
>         > > >
>         > > >
>         ______________________________________________
>         > > >                                 From: "Ketan Maheshwari"
>         > > >
>         <ketancmaheshwari at gmail.com>
>         > > >                                 To: "Michael Wilde"
>         > > >                                 <wilde at mcs.anl.gov>
>         > > >                                 Cc: "Papia Rizwan"
>         > > >
>         <papia.rizwan at gmail.com>,
>         > > >                                 "swift-devel
>         > > >                                 Devel"
>         <swift-devel at ci.uchicago.edu>
>         > > >                                 Sent: Monday, August 22,
>         2011
>         > > >                                 10:32:31
>         > > >                                 AM
>         > > >                                 Subject: Re: Blocker
>         issue for 0.93:
>         > > >                                 DSSAT script does not
>         complete, 2nd
>         > > >                                 coaster blocks dont
>         start?
>         > > >
>         > > >                                 Mike,
>         > > >
>         > > >
>         > > >                                 If I recall correctly,
>         Papia has
>         > > >                                 always been running her
>         DSSAT app
>         > > >                                 with
>         > > >                                 0.92. She has not yet
>         tried with
>         > > >                                 0.93.
>         > > >                                 I too tried with 0.92
>         with her sites
>         > > >                                 file settings.
>         > > >
>         > > >
>         > > >                                 I once tried it with
>         0.93 on pads
>         > > >                                 but
>         > > >                                 could never get in the
>         running from
>         > > >                                 the queue.
>         > > >
>         > > >
>         > > >                                 I will give another try
>         today as it
>         > > >                                 might be that PADS was
>         too busy last
>         > > >                                 week. As I recall Jon
>         was also
>         > > >                                 struggling to get
>         access.
>         > > >
>         > > >
>         > > >                                 Regards,
>         > > >                                 Ketan
>         > > >
>         > > >                                 On Mon, Aug 22, 2011 at
>         10:24 AM,
>         > > >                                 Michael Wilde
>         <wilde at mcs.anl.gov>
>         > > >                                 wrote:
>         > > >                                         Papia, Ketan,
>         > > >
>         > > >                                         In reviewing
>         0.93 work
>         > > >                                         remaining with
>         David, I
>         > > >                                         remembered this
>         issue.
>         > > >
>         > > >                                         You both
>         reported that the
>         > > >                                         DSSAT
>         application script
>         > > >                                         doesnt finish on
>         PADS - it
>         > > >                                         seems not to
>         start the
>         > > >                                         second
>         > > >                                         round of coaster
>         blocks that
>         > > >                                         it needs to
>         complete (as I
>         > > >                                         recall, but this
>         may not be
>         > > >                                         correct). This
>         needs to be
>         > > >                                         researched and
>         filed as a
>         > > >                                         bug
>         > > >                                         (or, an error in
>         the sites
>         > > >                                         spec needs to be
>         identified
>         > > >                                         and made clear
>         in the site
>         > > >                                         guide if it
>         turns out to be
>         > > >                                         the problem).
>         > > >
>         > > >                                         Possible there
>         is an issue
>         > > >                                         with jobs
>         failing at the end
>         > > >                                         of the coaster
>         blocks, and
>         > > >                                         you
>         > > >                                         dont have the
>         necessary
>         > > >                                         retry
>         > > >                                         values set for
>         the PADS
>         > > >                                         site???
>         > > >
>         > > >                                         We need an
>         example run with
>         > > >                                         logs and full
>         details. Can
>         > > >                                         you
>         > > >                                         try to re-create
>         this with a
>         > > >                                         much smaller
>         initial
>         > > >                                         allocation, and
>         see if
>         > > >                                         coasters is
>         transitioning
>         > > >                                         from
>         > > >                                         its initial
>         blocks to the
>         > > >                                         next
>         > > >                                         blocks?
>         > > >
>         > > >                                         Can you give
>         this high prio
>         > > >                                         for today?
>         > > >
>         > > >                                         Thanks,
>         > > >
>         > > >                                         - Mike
>         > > >
>         > > >
>         > > >
>         > > >
>         > > >                                 --
>         > > >                                 Ketan
>         > > >
>         > > >
>         > > >
>         > > >
>         > > >
>         > > >
>         > > >                         --
>         > > >                         Michael Wilde
>         > > >                         Computation Institute,
>         University of Chicago
>         > > >                         Mathematics and Computer Science
>         Division
>         > > >                         Argonne National Laboratory
>         > > >
>         > > >
>         > > >
>         > > >
>         > > >
>         > > >
>         > > >                 --
>         > > >                 Michael Wilde
>         > > >                 Computation Institute, University of
>         Chicago
>         > > >                 Mathematics and Computer Science
>         Division
>         > > >                 Argonne National Laboratory
>         > > >
>         > > >
>         > > >
>         > > >
>         > > >
>         > > >
>         > > >         --
>         > > >         Ketan
>         > > >
>         > > >
>         > > >
>         > > >
>         > > >
>         > > >
>         > > >
>         > > > --
>         > > > Ketan
>         > > >
>         > > >
>         > > >
>         > > > _______________________________________________
>         > > > Swift-devel mailing list
>         > > > Swift-devel at ci.uchicago.edu
>         > > >
>         https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel
>         >
>         
>         
>         _______________________________________________
>         Swift-devel mailing list
>         Swift-devel at ci.uchicago.edu
>         https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel
>         
> 
> 
> 
> 
> -- 
> Ketan
> 
> 
> 





More information about the Swift-devel mailing list