[Swift-devel] Swift 0.93 RC3 hangs after all jobs seem to be complete

Mihael Hategan hategan at mcs.anl.gov
Tue Nov 1 12:41:09 CDT 2011


jstack -l <pid>

On Tue, 2011-11-01 at 12:23 -0500, Tim Armstrong wrote:
> I'm think I'm seeing a similar deadlock in the latest version of
> Swift.  I'm first going to verify that this is actually happening
> (update Swift, recompile, etc), but what information should I collect
> that would be useful for debugging?
> 
> - Tim
> 
> On Sat, Oct 29, 2011 at 7:58 PM, Mihael Hategan <hategan at mcs.anl.gov>
> wrote:
>         This deadlock is now fixed (swift r5262).
>         
>         
>         On Wed, 2011-10-26 at 11:14 -0500, David Kelly wrote:
>         > I think I've found a way to reproduce this. From the test
>         suite, if you run
>         language-behaviour/mappers/075-array-mapper.swift a few times,
>         you'll run into a deadlock which looks very similar to the one
>         Sheri is seeing. Here is the jstack:
>         >
>         >
>         http://www.ci.uchicago.edu/~davidk/logs/jstack20111025110620.log
>         >
>         > David
>         >
>         > ----- Original Message -----
>         > > From: "Michael Wilde" <wilde at mcs.anl.gov>
>         > > To: "Mihael Hategan" <hategan at mcs.anl.gov>, "David Kelly"
>         <davidk at ci.uchicago.edu>
>         > > Cc: "Swift Devel" <swift-devel at ci.uchicago.edu>, "Sheri
>         Mickelson" <mickelso at mcs.anl.gov>
>         > > Sent: Tuesday, October 25, 2011 2:10:04 PM
>         > > Subject: Re: Swift 0.93 RC3 hangs after all jobs seem to
>         be complete
>         > > Mihael, David,
>         > >
>         > > Can you both report on what you believe the status of this
>         bug is?
>         > >
>         > > I think the subject line here is a bot misleading, in that
>         it seems
>         > > that a similar thing - ie the workflow deadlocks - was
>         happening both
>         > > at the start and at the end of various scripts, and
>         possibly at
>         > > intermediate points.
>         > >
>         > > I *think* that Sheri was seeing hangs at the start and in
>         the middle;
>         > > David was seeing hangs at the end.
>         > >
>         > > Talking to David just now he reported diagnosing his hang
>         case down to
>         > > a situation where the coaster scheduler emits a
>         "null" (ill-formed)
>         > > job to PBS at the tail end of a workflow. He inserted a
>         workaround to
>         > > ignore (not submit) such "null" jobs. Im not sure of that
>         was
>         > > committed, or just tested. David, can you post the
>         details?
>         > >
>         > > Mihael, did you look at the jstack that Sheri attached to
>         the posting
>         > > below?
>         > >
>         > > Do you have any theories or fixes for this issue or
>         issues? Unless we
>         > > believe its resolved, David, please file in bugzilla and
>         attach
>         > > relevant postings from SHeri, David, and others on this
>         bug.
>         > >
>         > > Thanks,
>         > >
>         > > - Mike
>         > >
>         > >
>         > > ----- Original Message -----
>         > > > From: "Sheri Mickelson" <mickelso at mcs.anl.gov>
>         > > > To: "Mihael Hategan" <hategan at mcs.anl.gov>
>         > > > Cc: "Michael Wilde" <wilde at mcs.anl.gov>, "David Kelly"
>         > > > <davidk at ci.uchicago.edu>
>         > > > Sent: Wednesday, October 12, 2011 10:34:43 AM
>         > > > Subject: Re: Swift 0.93 RC3 hangs after all jobs seem to
>         be complete
>         > > > I just tried running again on fusion with 0.93RC3 and it
>         hung right
>         > > > away.
>         > > > It started with "No events in 10s." and then it looks
>         like it hung.
>         > > > This was ran using coasters and I manually killed it
>         after about 5
>         > > > minutes.
>         > > > I attached both the log file and the jstack info.
>         > > >
>         > > > Thanks, Sheri
>         > > >
>         > > >
>         > > >
>         > > >
>         > > >
>         > > > On Oct 7, 2011, at 2:47 PM, Mihael Hategan wrote:
>         > > >
>         > > > > Yeah, so the hang checker doesn't show anything. Which
>         means it's
>         > > > > not a
>         > > > > swift flow issue.
>         > > > >
>         > > > > I would do what Mike says with jstack as soon after
>         the hang
>         > > > > checker
>         > > > > kicks in as possible.
>         > > > >
>         > > > > Mihael
>         > > > >
>         > > > > On Fri, 2011-10-07 at 12:12 -0500, Michael Wilde
>         wrote:
>         > > > >> Was: Re: Swift 0.93RC2 is bad - Re: Help on fusion
>         > > > >> Changed subject so you can see what this is
>         regarding, Mihael.
>         > > > >>
>         > > > >> ---
>         > > > >>
>         > > > >> Sheri, could you run this again? (Or have you
>         already, and if so,
>         > > > >> did it run to completion?)
>         > > > >>
>         > > > >> What I saw in the log yesterday was that all jobs
>         that were
>         > > > >> submitted to coasters ran successfully, including all
>         of their
>         > > > >> data
>         > > > >> transfers.
>         > > > >>
>         > > > >> But I also see that the Swift "hang checker" went
>         off, which
>         > > > >> indicates that some Java activity was indeed hung.
>         > > > >>
>         > > > >> When this happens again, can you run the command
>         "jstack -l PID"
>         > > > >> where PID is the process of the Swift Java command
>         (which you can
>         > > > >> best locate by using "ps -u $USER -H" and locate the
>         java process
>         > > > >> below the swift command). Then send us the jstack
>         output in
>         > > > >> addition to the associated Swift log.
>         > > > >>
>         > > > >> Mihael, in the meantime, can you take a look at the
>         log to see if
>         > > > >> you can spot any incomplete Swift activities that may
>         be hanging
>         > > > >> the run?
>         > > > >>
>         > > > >> Thanks,
>         > > > >>
>         > > > >> - Mike
>         > > > >>
>         > > > >>
>         > > > >> ----- Original Message -----
>         > > > >>> From: "Sheri Mickelson" <mickelso at mcs.anl.gov>
>         > > > >>> To: "David Kelly" <davidk at ci.uchicago.edu>
>         > > > >>> Cc: "Michael Wilde" <wilde at mcs.anl.gov>
>         > > > >>> Sent: Thursday, October 6, 2011 3:23:57 PM
>         > > > >>> Subject: Re: Swift 0.93RC2 is bad - Re: Help on
>         fusion
>         > > > >>> Here's the log file.
>         > > > >>>
>         > > > >>>
>         > > > >>>
>         > > > >>> On Oct 6, 2011, at 3:19 PM, David Kelly wrote:
>         > > > >>>
>         > > > >>>> Hi Sheri,
>         > > > >>>>
>         > > > >>>> Could you please send the log file so we can take a
>         closer look
>         > > > >>>> and
>         > > > >>>> see what's going on there?
>         > > > >>>>
>         > > > >>>> Thanks,
>         > > > >>>> David
>         > > > >>>>
>         > > > >>>> ----- Original Message -----
>         > > > >>>>> From: "Sheri Mickelson" <mickelso at mcs.anl.gov>
>         > > > >>>>> To: "David Kelly" <davidk at ci.uchicago.edu>
>         > > > >>>>> Cc: "Michael Wilde" <wilde at mcs.anl.gov>
>         > > > >>>>> Sent: Thursday, October 6, 2011 3:07:44 PM
>         > > > >>>>> Subject: Re: Swift 0.93RC2 is bad - Re: Help on
>         fusion
>         > > > >>>>> I just tried this version and had a little bit
>         more luck. It
>         > > > >>>>> looked
>         > > > >>>>> like everything was running fine, but now it looks
>         like it's
>         > > > >>>>> hung
>         > > > >>>>> near
>         > > > >>>>> the end. I keep getting the message "Finished
>         > > > >>>>> successfully:66".
>         > > > >>>>> The
>         > > > >>>>> message before that was "Checking status:1
>         Finished
>         > > > >>>>> successfully:65".
>         > > > >>>>>
>         > > > >>>>> Thanks, Sheri
>         > > > >>>>>
>         > > > >>>>> On Oct 6, 2011, at 2:14 PM, David Kelly wrote:
>         > > > >>>>>
>         > > > >>>>>>
>         > > > >>>>>> It's been a while since RC2 was created. There
>         have been
>         > > > >>>>>> quite
>         > > > >>>>>> a
>         > > > >>>>>> lot
>         > > > >>>>>> of fixes since then, so I just created a new 0.93
>         RC3. The
>         > > > >>>>>> direct
>         > > > >>>>>> download can be found at:
>         > > > >>>>>>
>         > > > >>>>>>
>         http://www.ci.uchicago.edu/swift/packages/swift-0.93RC3.tar.gz
>         > > > >>>>>>
>         > > > >>>>>> Hope this helps.
>         > > > >>>>>>
>         > > > >>>>>> Thanks,
>         > > > >>>>>> David
>         > > > >>>>>>
>         > > > >>>>>> ----- Original Message -----
>         > > > >>>>>>> From: "Michael Wilde" <wilde at mcs.anl.gov>
>         > > > >>>>>>> To: "Sheri Mickelson" <mickelso at mcs.anl.gov>
>         > > > >>>>>>> Cc: "David Kelly" <davidk at ci.uchicago.edu>
>         > > > >>>>>>> Sent: Thursday, October 6, 2011 12:17:56 PM
>         > > > >>>>>>> Subject: Swift 0.93RC2 is bad - Re: Help on
>         fusion
>         > > > >>>>>>> Sheri,
>         > > > >>>>>>>
>         > > > >>>>>>> Your AMWG script is failing because the
>         swift-0.93RC2
>         > > > >>>>>>> release
>         > > > >>>>>>> is
>         > > > >>>>>>> bad.
>         > > > >>>>>>>
>         > > > >>>>>>> The error its showing in the log is this:
>         "2011-10-06
>         > > > >>>>>>> 11:46:24,635-0500 DEBUG vdl:execute2
>         APPLICATION_EXCEPTION
>         > > > >>>>>>> jobid=ncatted-se54rxgk - Application exception:
>         null
>         > > > >>>>>>> Caused by:
>         > > > >>>>>>> org
>         > > > >>>>>>> .globus
>         > > >
>         >>>>>>> .cog.abstraction.impl.common.task.TaskSubmissionException:
>         > > > >>>>>>> lowOverallocation must be < 1.0 (currently
>         100.0)"
>         > > > >>>>>>>
>         > > > >>>>>>> ...which was fixed in SVN for 0.93.
>         > > > >>>>>>>
>         > > > >>>>>>> Did you load this from a tarball or from SVN?
>         > > > >>>>>>>
>         > > > >>>>>>> David, do we have a more recent 0.93 release
>         candidate?
>         > > > >>>>>>>
>         > > > >>>>>>> If not, then can you build an 0.93 from SVN? If
>         not, we can
>         > > > >>>>>>> do
>         > > > >>>>>>> that
>         > > > >>>>>>> for you. I'll start a build in the meantime just
>         in case.
>         > > > >>>>>>>
>         > > > >>>>>>> Sorry about this error, Sheri.
>         > > > >>>>>>>
>         > > > >>>>>>> - Mike
>         > > > >>>>>>>
>         > > > >>>>>>>
>         > > > >>>>>>>
>         > > > >>>>>>> ----- Original Message -----
>         > > > >>>>>>>> From: "Sheri Mickelson" <mickelso at mcs.anl.gov>
>         > > > >>>>>>>> To: "Michael Wilde" <wilde at mcs.anl.gov>
>         > > > >>>>>>>> Sent: Thursday, October 6, 2011 11:52:58 AM
>         > > > >>>>>>>> Subject: Re: Help on fusion
>         > > > >>>>>>>> I have everything in
>         > > >
>         >>>>>>>> /fusion/gpfs/home/mickelso/amwg-swift/svnRepo/swift
>         > > > >>>>>>>>
>         > > > >>>>>>>> I believe the pathnames are correct.
>         > > > >>>>>>>>
>         > > > >>>>>>>> I have not tried running on localhost.
>         > > > >>>>>>>>
>         > > > >>>>>>>> I'm using swift version swift-0.93RC2.
>         > > > >>>>>>>>
>         > > > >>>>>>>> I'm not at Argonne today, but will be in
>         tomorrow.
>         > > > >>>>>>>>
>         > > > >>>>>>>> -Sheri
>         > > > >>>>>>>>
>         > > > >>>>>>>> On Oct 6, 2011, at 11:39 AM, Michael Wilde
>         wrote:
>         > > > >>>>>>>>
>         > > > >>>>>>>>> Hi Sheri,
>         > > > >>>>>>>>>
>         > > > >>>>>>>>> can you point me to the log, run directory,
>         and work dir
>         > > > >>>>>>>>> of
>         > > > >>>>>>>>> this
>         > > > >>>>>>>>> run?
>         > > > >>>>>>>>>
>         > > > >>>>>>>>> I trhink we'll need to look into to the log,
>         and the .d
>         > > > >>>>>>>>> directories,
>         > > > >>>>>>>>> and possibly the work dir to locate the stdout
>         of the
>         > > > >>>>>>>>> failing
>         > > > >>>>>>>>> apps.
>         > > > >>>>>>>>>
>         > > > >>>>>>>>> - are the pathnames correct?
>         > > > >>>>>>>>>
>         > > > >>>>>>>>> - does the run work on localhost? (ie, are the
>         PBS jobs
>         > > > >>>>>>>>> running
>         > > > >>>>>>>>> or
>         > > > >>>>>>>>> failing)?
>         > > > >>>>>>>>>
>         > > > >>>>>>>>> - which Swift rev are you using?
>         > > > >>>>>>>>>
>         > > > >>>>>>>>> Are you at Argonne? I can stop by and we can
>         debug.
>         > > > >>>>>>>>>
>         > > > >>>>>>>>> - Mike
>         > > > >>>>>>>>>
>         > > > >>>>>>>>>
>         > > > >>>>>>>>> ----- Original Message -----
>         > > > >>>>>>>>>> From: "Sheri Mickelson"
>         <mickelso at mcs.anl.gov>
>         > > > >>>>>>>>>> To: "Michael Wilde" <wilde at mcs.anl.gov>
>         > > > >>>>>>>>>> Sent: Thursday, October 6, 2011 10:32:38 AM
>         > > > >>>>>>>>>> Subject: Help on fusion
>         > > > >>>>>>>>>> Hi Mike,
>         > > > >>>>>>>>>>
>         > > > >>>>>>>>>> The AMWG people at NCAR want to incorporate
>         the swift
>         > > > >>>>>>>>>> version
>         > > > >>>>>>>>>> to
>         > > > >>>>>>>>>> their
>         > > > >>>>>>>>>> main branch. Rob's at NCAR right now and
>         wants to have
>         > > > >>>>>>>>>> this
>         > > > >>>>>>>>>> done
>         > > > >>>>>>>>>> as
>         > > > >>>>>>>>>> soon as possible. I've been working on
>         incorporating the
>         > > > >>>>>>>>>> changes
>         > > > >>>>>>>>>> that
>         > > > >>>>>>>>>> were made in the last release and believe
>         that it's in
>         > > > >>>>>>>>>> descent
>         > > > >>>>>>>>>> shape.
>         > > > >>>>>>>>>> I want to test it on fusion, though, just to
>         make sure
>         > > > >>>>>>>>>> I'm
>         > > > >>>>>>>>>> handling
>         > > > >>>>>>>>>> the env variables correctly. I'm running into
>         an error
>         > > > >>>>>>>>>> when
>         > > > >>>>>>>>>> I
>         > > > >>>>>>>>>> run.
>         > > > >>>>>>>>>> I'm getting "Failed to transfer wrapper log
>         for job <job>
>         > > > >>>>>>>>>> for
>         > > > >>>>>>>>>> all
>         > > > >>>>>>>>>> of
>         > > > >>>>>>>>>> the app calls. What usually causes this? I'm
>         stuck on
>         > > > >>>>>>>>>> where
>         > > > >>>>>>>>>> to
>         > > > >>>>>>>>>> look.
>         > > > >>>>>>>>>>
>         > > > >>>>>>>>>> Thanks, Sheri
>         > > > >>>>>>>>>
>         > > > >>>>>>>>> --
>         > > > >>>>>>>>> Michael Wilde
>         > > > >>>>>>>>> Computation Institute, University of Chicago
>         > > > >>>>>>>>> Mathematics and Computer Science Division
>         > > > >>>>>>>>> Argonne National Laboratory
>         > > > >>>>>>>>>
>         > > > >>>>>>>
>         > > > >>>>>>> --
>         > > > >>>>>>> Michael Wilde
>         > > > >>>>>>> Computation Institute, University of Chicago
>         > > > >>>>>>> Mathematics and Computer Science Division
>         > > > >>>>>>> Argonne National Laboratory
>         > > > >>
>         > > > >
>         > > > >
>         > >
>         > > --
>         > > Michael Wilde
>         > > Computation Institute, University of Chicago
>         > > Mathematics and Computer Science Division
>         > > Argonne National Laboratory
>         
>         
>         _______________________________________________
>         Swift-devel mailing list
>         Swift-devel at ci.uchicago.edu
>         https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel
>         
> 





More information about the Swift-devel mailing list