[Swift-devel] Swift 0.93 RC3 hangs after all jobs seem to be complete

David Kelly davidk at ci.uchicago.edu
Wed Oct 26 11:14:26 CDT 2011


I think I've found a way to reproduce this. From the test suite, if you run language-behaviour/mappers/075-array-mapper.swift a few times, you'll run into a deadlock which looks very similar to the one Sheri is seeing. Here is the jstack:

http://www.ci.uchicago.edu/~davidk/logs/jstack20111025110620.log

David

----- Original Message -----
> From: "Michael Wilde" <wilde at mcs.anl.gov>
> To: "Mihael Hategan" <hategan at mcs.anl.gov>, "David Kelly" <davidk at ci.uchicago.edu>
> Cc: "Swift Devel" <swift-devel at ci.uchicago.edu>, "Sheri Mickelson" <mickelso at mcs.anl.gov>
> Sent: Tuesday, October 25, 2011 2:10:04 PM
> Subject: Re: Swift 0.93 RC3 hangs after all jobs seem to be complete
> Mihael, David,
> 
> Can you both report on what you believe the status of this bug is?
> 
> I think the subject line here is a bot misleading, in that it seems
> that a similar thing - ie the workflow deadlocks - was happening both
> at the start and at the end of various scripts, and possibly at
> intermediate points.
> 
> I *think* that Sheri was seeing hangs at the start and in the middle;
> David was seeing hangs at the end.
> 
> Talking to David just now he reported diagnosing his hang case down to
> a situation where the coaster scheduler emits a "null" (ill-formed)
> job to PBS at the tail end of a workflow. He inserted a workaround to
> ignore (not submit) such "null" jobs. Im not sure of that was
> committed, or just tested. David, can you post the details?
> 
> Mihael, did you look at the jstack that Sheri attached to the posting
> below?
> 
> Do you have any theories or fixes for this issue or issues? Unless we
> believe its resolved, David, please file in bugzilla and attach
> relevant postings from SHeri, David, and others on this bug.
> 
> Thanks,
> 
> - Mike
> 
> 
> ----- Original Message -----
> > From: "Sheri Mickelson" <mickelso at mcs.anl.gov>
> > To: "Mihael Hategan" <hategan at mcs.anl.gov>
> > Cc: "Michael Wilde" <wilde at mcs.anl.gov>, "David Kelly"
> > <davidk at ci.uchicago.edu>
> > Sent: Wednesday, October 12, 2011 10:34:43 AM
> > Subject: Re: Swift 0.93 RC3 hangs after all jobs seem to be complete
> > I just tried running again on fusion with 0.93RC3 and it hung right
> > away.
> > It started with "No events in 10s." and then it looks like it hung.
> > This was ran using coasters and I manually killed it after about 5
> > minutes.
> > I attached both the log file and the jstack info.
> >
> > Thanks, Sheri
> >
> >
> >
> >
> >
> > On Oct 7, 2011, at 2:47 PM, Mihael Hategan wrote:
> >
> > > Yeah, so the hang checker doesn't show anything. Which means it's
> > > not a
> > > swift flow issue.
> > >
> > > I would do what Mike says with jstack as soon after the hang
> > > checker
> > > kicks in as possible.
> > >
> > > Mihael
> > >
> > > On Fri, 2011-10-07 at 12:12 -0500, Michael Wilde wrote:
> > >> Was: Re: Swift 0.93RC2 is bad - Re: Help on fusion
> > >> Changed subject so you can see what this is regarding, Mihael.
> > >>
> > >> ---
> > >>
> > >> Sheri, could you run this again? (Or have you already, and if so,
> > >> did it run to completion?)
> > >>
> > >> What I saw in the log yesterday was that all jobs that were
> > >> submitted to coasters ran successfully, including all of their
> > >> data
> > >> transfers.
> > >>
> > >> But I also see that the Swift "hang checker" went off, which
> > >> indicates that some Java activity was indeed hung.
> > >>
> > >> When this happens again, can you run the command "jstack -l PID"
> > >> where PID is the process of the Swift Java command (which you can
> > >> best locate by using "ps -u $USER -H" and locate the java process
> > >> below the swift command). Then send us the jstack output in
> > >> addition to the associated Swift log.
> > >>
> > >> Mihael, in the meantime, can you take a look at the log to see if
> > >> you can spot any incomplete Swift activities that may be hanging
> > >> the run?
> > >>
> > >> Thanks,
> > >>
> > >> - Mike
> > >>
> > >>
> > >> ----- Original Message -----
> > >>> From: "Sheri Mickelson" <mickelso at mcs.anl.gov>
> > >>> To: "David Kelly" <davidk at ci.uchicago.edu>
> > >>> Cc: "Michael Wilde" <wilde at mcs.anl.gov>
> > >>> Sent: Thursday, October 6, 2011 3:23:57 PM
> > >>> Subject: Re: Swift 0.93RC2 is bad - Re: Help on fusion
> > >>> Here's the log file.
> > >>>
> > >>>
> > >>>
> > >>> On Oct 6, 2011, at 3:19 PM, David Kelly wrote:
> > >>>
> > >>>> Hi Sheri,
> > >>>>
> > >>>> Could you please send the log file so we can take a closer look
> > >>>> and
> > >>>> see what's going on there?
> > >>>>
> > >>>> Thanks,
> > >>>> David
> > >>>>
> > >>>> ----- Original Message -----
> > >>>>> From: "Sheri Mickelson" <mickelso at mcs.anl.gov>
> > >>>>> To: "David Kelly" <davidk at ci.uchicago.edu>
> > >>>>> Cc: "Michael Wilde" <wilde at mcs.anl.gov>
> > >>>>> Sent: Thursday, October 6, 2011 3:07:44 PM
> > >>>>> Subject: Re: Swift 0.93RC2 is bad - Re: Help on fusion
> > >>>>> I just tried this version and had a little bit more luck. It
> > >>>>> looked
> > >>>>> like everything was running fine, but now it looks like it's
> > >>>>> hung
> > >>>>> near
> > >>>>> the end. I keep getting the message "Finished
> > >>>>> successfully:66".
> > >>>>> The
> > >>>>> message before that was "Checking status:1 Finished
> > >>>>> successfully:65".
> > >>>>>
> > >>>>> Thanks, Sheri
> > >>>>>
> > >>>>> On Oct 6, 2011, at 2:14 PM, David Kelly wrote:
> > >>>>>
> > >>>>>>
> > >>>>>> It's been a while since RC2 was created. There have been
> > >>>>>> quite
> > >>>>>> a
> > >>>>>> lot
> > >>>>>> of fixes since then, so I just created a new 0.93 RC3. The
> > >>>>>> direct
> > >>>>>> download can be found at:
> > >>>>>>
> > >>>>>> http://www.ci.uchicago.edu/swift/packages/swift-0.93RC3.tar.gz
> > >>>>>>
> > >>>>>> Hope this helps.
> > >>>>>>
> > >>>>>> Thanks,
> > >>>>>> David
> > >>>>>>
> > >>>>>> ----- Original Message -----
> > >>>>>>> From: "Michael Wilde" <wilde at mcs.anl.gov>
> > >>>>>>> To: "Sheri Mickelson" <mickelso at mcs.anl.gov>
> > >>>>>>> Cc: "David Kelly" <davidk at ci.uchicago.edu>
> > >>>>>>> Sent: Thursday, October 6, 2011 12:17:56 PM
> > >>>>>>> Subject: Swift 0.93RC2 is bad - Re: Help on fusion
> > >>>>>>> Sheri,
> > >>>>>>>
> > >>>>>>> Your AMWG script is failing because the swift-0.93RC2
> > >>>>>>> release
> > >>>>>>> is
> > >>>>>>> bad.
> > >>>>>>>
> > >>>>>>> The error its showing in the log is this: "2011-10-06
> > >>>>>>> 11:46:24,635-0500 DEBUG vdl:execute2 APPLICATION_EXCEPTION
> > >>>>>>> jobid=ncatted-se54rxgk - Application exception: null
> > >>>>>>> Caused by:
> > >>>>>>> org
> > >>>>>>> .globus
> > >>>>>>> .cog.abstraction.impl.common.task.TaskSubmissionException:
> > >>>>>>> lowOverallocation must be < 1.0 (currently 100.0)"
> > >>>>>>>
> > >>>>>>> ...which was fixed in SVN for 0.93.
> > >>>>>>>
> > >>>>>>> Did you load this from a tarball or from SVN?
> > >>>>>>>
> > >>>>>>> David, do we have a more recent 0.93 release candidate?
> > >>>>>>>
> > >>>>>>> If not, then can you build an 0.93 from SVN? If not, we can
> > >>>>>>> do
> > >>>>>>> that
> > >>>>>>> for you. I'll start a build in the meantime just in case.
> > >>>>>>>
> > >>>>>>> Sorry about this error, Sheri.
> > >>>>>>>
> > >>>>>>> - Mike
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> ----- Original Message -----
> > >>>>>>>> From: "Sheri Mickelson" <mickelso at mcs.anl.gov>
> > >>>>>>>> To: "Michael Wilde" <wilde at mcs.anl.gov>
> > >>>>>>>> Sent: Thursday, October 6, 2011 11:52:58 AM
> > >>>>>>>> Subject: Re: Help on fusion
> > >>>>>>>> I have everything in
> > >>>>>>>> /fusion/gpfs/home/mickelso/amwg-swift/svnRepo/swift
> > >>>>>>>>
> > >>>>>>>> I believe the pathnames are correct.
> > >>>>>>>>
> > >>>>>>>> I have not tried running on localhost.
> > >>>>>>>>
> > >>>>>>>> I'm using swift version swift-0.93RC2.
> > >>>>>>>>
> > >>>>>>>> I'm not at Argonne today, but will be in tomorrow.
> > >>>>>>>>
> > >>>>>>>> -Sheri
> > >>>>>>>>
> > >>>>>>>> On Oct 6, 2011, at 11:39 AM, Michael Wilde wrote:
> > >>>>>>>>
> > >>>>>>>>> Hi Sheri,
> > >>>>>>>>>
> > >>>>>>>>> can you point me to the log, run directory, and work dir
> > >>>>>>>>> of
> > >>>>>>>>> this
> > >>>>>>>>> run?
> > >>>>>>>>>
> > >>>>>>>>> I trhink we'll need to look into to the log, and the .d
> > >>>>>>>>> directories,
> > >>>>>>>>> and possibly the work dir to locate the stdout of the
> > >>>>>>>>> failing
> > >>>>>>>>> apps.
> > >>>>>>>>>
> > >>>>>>>>> - are the pathnames correct?
> > >>>>>>>>>
> > >>>>>>>>> - does the run work on localhost? (ie, are the PBS jobs
> > >>>>>>>>> running
> > >>>>>>>>> or
> > >>>>>>>>> failing)?
> > >>>>>>>>>
> > >>>>>>>>> - which Swift rev are you using?
> > >>>>>>>>>
> > >>>>>>>>> Are you at Argonne? I can stop by and we can debug.
> > >>>>>>>>>
> > >>>>>>>>> - Mike
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>> ----- Original Message -----
> > >>>>>>>>>> From: "Sheri Mickelson" <mickelso at mcs.anl.gov>
> > >>>>>>>>>> To: "Michael Wilde" <wilde at mcs.anl.gov>
> > >>>>>>>>>> Sent: Thursday, October 6, 2011 10:32:38 AM
> > >>>>>>>>>> Subject: Help on fusion
> > >>>>>>>>>> Hi Mike,
> > >>>>>>>>>>
> > >>>>>>>>>> The AMWG people at NCAR want to incorporate the swift
> > >>>>>>>>>> version
> > >>>>>>>>>> to
> > >>>>>>>>>> their
> > >>>>>>>>>> main branch. Rob's at NCAR right now and wants to have
> > >>>>>>>>>> this
> > >>>>>>>>>> done
> > >>>>>>>>>> as
> > >>>>>>>>>> soon as possible. I've been working on incorporating the
> > >>>>>>>>>> changes
> > >>>>>>>>>> that
> > >>>>>>>>>> were made in the last release and believe that it's in
> > >>>>>>>>>> descent
> > >>>>>>>>>> shape.
> > >>>>>>>>>> I want to test it on fusion, though, just to make sure
> > >>>>>>>>>> I'm
> > >>>>>>>>>> handling
> > >>>>>>>>>> the env variables correctly. I'm running into an error
> > >>>>>>>>>> when
> > >>>>>>>>>> I
> > >>>>>>>>>> run.
> > >>>>>>>>>> I'm getting "Failed to transfer wrapper log for job <job>
> > >>>>>>>>>> for
> > >>>>>>>>>> all
> > >>>>>>>>>> of
> > >>>>>>>>>> the app calls. What usually causes this? I'm stuck on
> > >>>>>>>>>> where
> > >>>>>>>>>> to
> > >>>>>>>>>> look.
> > >>>>>>>>>>
> > >>>>>>>>>> Thanks, Sheri
> > >>>>>>>>>
> > >>>>>>>>> --
> > >>>>>>>>> Michael Wilde
> > >>>>>>>>> Computation Institute, University of Chicago
> > >>>>>>>>> Mathematics and Computer Science Division
> > >>>>>>>>> Argonne National Laboratory
> > >>>>>>>>>
> > >>>>>>>
> > >>>>>>> --
> > >>>>>>> Michael Wilde
> > >>>>>>> Computation Institute, University of Chicago
> > >>>>>>> Mathematics and Computer Science Division
> > >>>>>>> Argonne National Laboratory
> > >>
> > >
> > >
> 
> --
> Michael Wilde
> Computation Institute, University of Chicago
> Mathematics and Computer Science Division
> Argonne National Laboratory



More information about the Swift-devel mailing list