[Swift-devel] Swift 0.93 RC3 hangs after all jobs seem to be complete

Michael Wilde wilde at mcs.anl.gov
Tue Oct 25 14:10:04 CDT 2011


Mihael, David,

Can you both report on what you believe the status of this bug is?

I think the subject line here is a bot misleading, in that it seems that a similar thing - ie the workflow deadlocks - was happening both at the start and at the end of various scripts, and possibly at intermediate points.

I *think* that Sheri was seeing hangs at the start and in the middle; David was seeing hangs at the end.

Talking to David just now he reported diagnosing his hang case down to a situation where the coaster scheduler emits a "null" (ill-formed) job to PBS at the tail end of a workflow. He inserted a workaround to ignore (not submit) such "null" jobs.  Im not sure of that was committed, or just tested.  David, can you post the details?

Mihael, did you look at the jstack that Sheri attached to the posting below?

Do you have any theories or fixes for this issue or issues? Unless we believe its resolved, David, please file in bugzilla and attach relevant postings from SHeri, David, and others on this bug.

Thanks,

- Mike


----- Original Message -----
> From: "Sheri Mickelson" <mickelso at mcs.anl.gov>
> To: "Mihael Hategan" <hategan at mcs.anl.gov>
> Cc: "Michael Wilde" <wilde at mcs.anl.gov>, "David Kelly" <davidk at ci.uchicago.edu>
> Sent: Wednesday, October 12, 2011 10:34:43 AM
> Subject: Re: Swift 0.93 RC3 hangs after all jobs seem to be complete
> I just tried running again on fusion with 0.93RC3 and it hung right
> away.
> It started with "No events in 10s." and then it looks like it hung.
> This was ran using coasters and I manually killed it after about 5
> minutes.
> I attached both the log file and the jstack info.
> 
> Thanks, Sheri
> 
> 
> 
> 
> 
> On Oct 7, 2011, at 2:47 PM, Mihael Hategan wrote:
> 
> > Yeah, so the hang checker doesn't show anything. Which means it's
> > not a
> > swift flow issue.
> >
> > I would do what Mike says with jstack as soon after the hang checker
> > kicks in as possible.
> >
> > Mihael
> >
> > On Fri, 2011-10-07 at 12:12 -0500, Michael Wilde wrote:
> >> Was: Re: Swift 0.93RC2 is bad - Re: Help on fusion
> >> Changed subject so you can see what this is regarding, Mihael.
> >>
> >> ---
> >>
> >> Sheri, could you run this again? (Or have you already, and if so,
> >> did it run to completion?)
> >>
> >> What I saw in the log yesterday was that all jobs that were
> >> submitted to coasters ran successfully, including all of their data
> >> transfers.
> >>
> >> But I also see that the Swift "hang checker" went off, which
> >> indicates that some Java activity was indeed hung.
> >>
> >> When this happens again, can you run the command "jstack -l PID"
> >> where PID is the process of the Swift Java command (which you can
> >> best locate by using "ps -u $USER -H" and locate the java process
> >> below the swift command). Then send us the jstack output in
> >> addition to the associated Swift log.
> >>
> >> Mihael, in the meantime, can you take a look at the log to see if
> >> you can spot any incomplete Swift activities that may be hanging
> >> the run?
> >>
> >> Thanks,
> >>
> >> - Mike
> >>
> >>
> >> ----- Original Message -----
> >>> From: "Sheri Mickelson" <mickelso at mcs.anl.gov>
> >>> To: "David Kelly" <davidk at ci.uchicago.edu>
> >>> Cc: "Michael Wilde" <wilde at mcs.anl.gov>
> >>> Sent: Thursday, October 6, 2011 3:23:57 PM
> >>> Subject: Re: Swift 0.93RC2 is bad - Re: Help on fusion
> >>> Here's the log file.
> >>>
> >>>
> >>>
> >>> On Oct 6, 2011, at 3:19 PM, David Kelly wrote:
> >>>
> >>>> Hi Sheri,
> >>>>
> >>>> Could you please send the log file so we can take a closer look
> >>>> and
> >>>> see what's going on there?
> >>>>
> >>>> Thanks,
> >>>> David
> >>>>
> >>>> ----- Original Message -----
> >>>>> From: "Sheri Mickelson" <mickelso at mcs.anl.gov>
> >>>>> To: "David Kelly" <davidk at ci.uchicago.edu>
> >>>>> Cc: "Michael Wilde" <wilde at mcs.anl.gov>
> >>>>> Sent: Thursday, October 6, 2011 3:07:44 PM
> >>>>> Subject: Re: Swift 0.93RC2 is bad - Re: Help on fusion
> >>>>> I just tried this version and had a little bit more luck. It
> >>>>> looked
> >>>>> like everything was running fine, but now it looks like it's
> >>>>> hung
> >>>>> near
> >>>>> the end. I keep getting the message "Finished successfully:66".
> >>>>> The
> >>>>> message before that was "Checking status:1 Finished
> >>>>> successfully:65".
> >>>>>
> >>>>> Thanks, Sheri
> >>>>>
> >>>>> On Oct 6, 2011, at 2:14 PM, David Kelly wrote:
> >>>>>
> >>>>>>
> >>>>>> It's been a while since RC2 was created. There have been quite
> >>>>>> a
> >>>>>> lot
> >>>>>> of fixes since then, so I just created a new 0.93 RC3. The
> >>>>>> direct
> >>>>>> download can be found at:
> >>>>>>
> >>>>>> http://www.ci.uchicago.edu/swift/packages/swift-0.93RC3.tar.gz
> >>>>>>
> >>>>>> Hope this helps.
> >>>>>>
> >>>>>> Thanks,
> >>>>>> David
> >>>>>>
> >>>>>> ----- Original Message -----
> >>>>>>> From: "Michael Wilde" <wilde at mcs.anl.gov>
> >>>>>>> To: "Sheri Mickelson" <mickelso at mcs.anl.gov>
> >>>>>>> Cc: "David Kelly" <davidk at ci.uchicago.edu>
> >>>>>>> Sent: Thursday, October 6, 2011 12:17:56 PM
> >>>>>>> Subject: Swift 0.93RC2 is bad - Re: Help on fusion
> >>>>>>> Sheri,
> >>>>>>>
> >>>>>>> Your AMWG script is failing because the swift-0.93RC2 release
> >>>>>>> is
> >>>>>>> bad.
> >>>>>>>
> >>>>>>> The error its showing in the log is this: "2011-10-06
> >>>>>>> 11:46:24,635-0500 DEBUG vdl:execute2 APPLICATION_EXCEPTION
> >>>>>>> jobid=ncatted-se54rxgk - Application exception: null
> >>>>>>> Caused by:
> >>>>>>> org
> >>>>>>> .globus
> >>>>>>> .cog.abstraction.impl.common.task.TaskSubmissionException:
> >>>>>>> lowOverallocation must be < 1.0 (currently 100.0)"
> >>>>>>>
> >>>>>>> ...which was fixed in SVN for 0.93.
> >>>>>>>
> >>>>>>> Did you load this from a tarball or from SVN?
> >>>>>>>
> >>>>>>> David, do we have a more recent 0.93 release candidate?
> >>>>>>>
> >>>>>>> If not, then can you build an 0.93 from SVN? If not, we can do
> >>>>>>> that
> >>>>>>> for you. I'll start a build in the meantime just in case.
> >>>>>>>
> >>>>>>> Sorry about this error, Sheri.
> >>>>>>>
> >>>>>>> - Mike
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> ----- Original Message -----
> >>>>>>>> From: "Sheri Mickelson" <mickelso at mcs.anl.gov>
> >>>>>>>> To: "Michael Wilde" <wilde at mcs.anl.gov>
> >>>>>>>> Sent: Thursday, October 6, 2011 11:52:58 AM
> >>>>>>>> Subject: Re: Help on fusion
> >>>>>>>> I have everything in
> >>>>>>>> /fusion/gpfs/home/mickelso/amwg-swift/svnRepo/swift
> >>>>>>>>
> >>>>>>>> I believe the pathnames are correct.
> >>>>>>>>
> >>>>>>>> I have not tried running on localhost.
> >>>>>>>>
> >>>>>>>> I'm using swift version swift-0.93RC2.
> >>>>>>>>
> >>>>>>>> I'm not at Argonne today, but will be in tomorrow.
> >>>>>>>>
> >>>>>>>> -Sheri
> >>>>>>>>
> >>>>>>>> On Oct 6, 2011, at 11:39 AM, Michael Wilde wrote:
> >>>>>>>>
> >>>>>>>>> Hi Sheri,
> >>>>>>>>>
> >>>>>>>>> can you point me to the log, run directory, and work dir of
> >>>>>>>>> this
> >>>>>>>>> run?
> >>>>>>>>>
> >>>>>>>>> I trhink we'll need to look into to the log, and the .d
> >>>>>>>>> directories,
> >>>>>>>>> and possibly the work dir to locate the stdout of the
> >>>>>>>>> failing
> >>>>>>>>> apps.
> >>>>>>>>>
> >>>>>>>>> - are the pathnames correct?
> >>>>>>>>>
> >>>>>>>>> - does the run work on localhost? (ie, are the PBS jobs
> >>>>>>>>> running
> >>>>>>>>> or
> >>>>>>>>> failing)?
> >>>>>>>>>
> >>>>>>>>> - which Swift rev are you using?
> >>>>>>>>>
> >>>>>>>>> Are you at Argonne? I can stop by and we can debug.
> >>>>>>>>>
> >>>>>>>>> - Mike
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> ----- Original Message -----
> >>>>>>>>>> From: "Sheri Mickelson" <mickelso at mcs.anl.gov>
> >>>>>>>>>> To: "Michael Wilde" <wilde at mcs.anl.gov>
> >>>>>>>>>> Sent: Thursday, October 6, 2011 10:32:38 AM
> >>>>>>>>>> Subject: Help on fusion
> >>>>>>>>>> Hi Mike,
> >>>>>>>>>>
> >>>>>>>>>> The AMWG people at NCAR want to incorporate the swift
> >>>>>>>>>> version
> >>>>>>>>>> to
> >>>>>>>>>> their
> >>>>>>>>>> main branch. Rob's at NCAR right now and wants to have this
> >>>>>>>>>> done
> >>>>>>>>>> as
> >>>>>>>>>> soon as possible. I've been working on incorporating the
> >>>>>>>>>> changes
> >>>>>>>>>> that
> >>>>>>>>>> were made in the last release and believe that it's in
> >>>>>>>>>> descent
> >>>>>>>>>> shape.
> >>>>>>>>>> I want to test it on fusion, though, just to make sure I'm
> >>>>>>>>>> handling
> >>>>>>>>>> the env variables correctly. I'm running into an error when
> >>>>>>>>>> I
> >>>>>>>>>> run.
> >>>>>>>>>> I'm getting "Failed to transfer wrapper log for job <job>
> >>>>>>>>>> for
> >>>>>>>>>> all
> >>>>>>>>>> of
> >>>>>>>>>> the app calls. What usually causes this? I'm stuck on where
> >>>>>>>>>> to
> >>>>>>>>>> look.
> >>>>>>>>>>
> >>>>>>>>>> Thanks, Sheri
> >>>>>>>>>
> >>>>>>>>> --
> >>>>>>>>> Michael Wilde
> >>>>>>>>> Computation Institute, University of Chicago
> >>>>>>>>> Mathematics and Computer Science Division
> >>>>>>>>> Argonne National Laboratory
> >>>>>>>>>
> >>>>>>>
> >>>>>>> --
> >>>>>>> Michael Wilde
> >>>>>>> Computation Institute, University of Chicago
> >>>>>>> Mathematics and Computer Science Division
> >>>>>>> Argonne National Laboratory
> >>
> >
> >

-- 
Michael Wilde
Computation Institute, University of Chicago
Mathematics and Computer Science Division
Argonne National Laboratory

-------------- next part --------------
A non-text attachment was scrubbed...
Name: amwg_stats-20111012-1025-qaxyxad6.log
Type: application/octet-stream
Size: 228154 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20111025/8ea18b53/attachment.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: jstack.out
Type: application/octet-stream
Size: 65175 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20111025/8ea18b53/attachment-0001.obj>


More information about the Swift-devel mailing list