I'm think I'm seeing a similar deadlock in the latest version of Swift. I'm first going to verify that this is actually happening (update Swift, recompile, etc), but what information should I collect that would be useful for debugging?<br>
<br>- Tim<br><br><div class="gmail_quote">On Sat, Oct 29, 2011 at 7:58 PM, Mihael Hategan <span dir="ltr"><<a href="mailto:hategan@mcs.anl.gov">hategan@mcs.anl.gov</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">
This deadlock is now fixed (swift r5262).<br>
<div><div></div><div class="h5"><br>
On Wed, 2011-10-26 at 11:14 -0500, David Kelly wrote:<br>
> I think I've found a way to reproduce this. From the test suite, if you run language-behaviour/mappers/075-array-mapper.swift a few times, you'll run into a deadlock which looks very similar to the one Sheri is seeing. Here is the jstack:<br>
><br>
> <a href="http://www.ci.uchicago.edu/%7Edavidk/logs/jstack20111025110620.log" target="_blank">http://www.ci.uchicago.edu/~davidk/logs/jstack20111025110620.log</a><br>
><br>
> David<br>
><br>
> ----- Original Message -----<br>
> > From: "Michael Wilde" <<a href="mailto:wilde@mcs.anl.gov">wilde@mcs.anl.gov</a>><br>
> > To: "Mihael Hategan" <<a href="mailto:hategan@mcs.anl.gov">hategan@mcs.anl.gov</a>>, "David Kelly" <<a href="mailto:davidk@ci.uchicago.edu">davidk@ci.uchicago.edu</a>><br>
> > Cc: "Swift Devel" <<a href="mailto:swift-devel@ci.uchicago.edu">swift-devel@ci.uchicago.edu</a>>, "Sheri Mickelson" <<a href="mailto:mickelso@mcs.anl.gov">mickelso@mcs.anl.gov</a>><br>
> > Sent: Tuesday, October 25, 2011 2:10:04 PM<br>
> > Subject: Re: Swift 0.93 RC3 hangs after all jobs seem to be complete<br>
> > Mihael, David,<br>
> ><br>
> > Can you both report on what you believe the status of this bug is?<br>
> ><br>
> > I think the subject line here is a bot misleading, in that it seems<br>
> > that a similar thing - ie the workflow deadlocks - was happening both<br>
> > at the start and at the end of various scripts, and possibly at<br>
> > intermediate points.<br>
> ><br>
> > I *think* that Sheri was seeing hangs at the start and in the middle;<br>
> > David was seeing hangs at the end.<br>
> ><br>
> > Talking to David just now he reported diagnosing his hang case down to<br>
> > a situation where the coaster scheduler emits a "null" (ill-formed)<br>
> > job to PBS at the tail end of a workflow. He inserted a workaround to<br>
> > ignore (not submit) such "null" jobs. Im not sure of that was<br>
> > committed, or just tested. David, can you post the details?<br>
> ><br>
> > Mihael, did you look at the jstack that Sheri attached to the posting<br>
> > below?<br>
> ><br>
> > Do you have any theories or fixes for this issue or issues? Unless we<br>
> > believe its resolved, David, please file in bugzilla and attach<br>
> > relevant postings from SHeri, David, and others on this bug.<br>
> ><br>
> > Thanks,<br>
> ><br>
> > - Mike<br>
> ><br>
> ><br>
> > ----- Original Message -----<br>
> > > From: "Sheri Mickelson" <<a href="mailto:mickelso@mcs.anl.gov">mickelso@mcs.anl.gov</a>><br>
> > > To: "Mihael Hategan" <<a href="mailto:hategan@mcs.anl.gov">hategan@mcs.anl.gov</a>><br>
> > > Cc: "Michael Wilde" <<a href="mailto:wilde@mcs.anl.gov">wilde@mcs.anl.gov</a>>, "David Kelly"<br>
> > > <<a href="mailto:davidk@ci.uchicago.edu">davidk@ci.uchicago.edu</a>><br>
> > > Sent: Wednesday, October 12, 2011 10:34:43 AM<br>
> > > Subject: Re: Swift 0.93 RC3 hangs after all jobs seem to be complete<br>
> > > I just tried running again on fusion with 0.93RC3 and it hung right<br>
> > > away.<br>
> > > It started with "No events in 10s." and then it looks like it hung.<br>
> > > This was ran using coasters and I manually killed it after about 5<br>
> > > minutes.<br>
> > > I attached both the log file and the jstack info.<br>
> > ><br>
> > > Thanks, Sheri<br>
> > ><br>
> > ><br>
> > ><br>
> > ><br>
> > ><br>
> > > On Oct 7, 2011, at 2:47 PM, Mihael Hategan wrote:<br>
> > ><br>
> > > > Yeah, so the hang checker doesn't show anything. Which means it's<br>
> > > > not a<br>
> > > > swift flow issue.<br>
> > > ><br>
> > > > I would do what Mike says with jstack as soon after the hang<br>
> > > > checker<br>
> > > > kicks in as possible.<br>
> > > ><br>
> > > > Mihael<br>
> > > ><br>
> > > > On Fri, 2011-10-07 at 12:12 -0500, Michael Wilde wrote:<br>
> > > >> Was: Re: Swift 0.93RC2 is bad - Re: Help on fusion<br>
> > > >> Changed subject so you can see what this is regarding, Mihael.<br>
> > > >><br>
> > > >> ---<br>
> > > >><br>
> > > >> Sheri, could you run this again? (Or have you already, and if so,<br>
> > > >> did it run to completion?)<br>
> > > >><br>
> > > >> What I saw in the log yesterday was that all jobs that were<br>
> > > >> submitted to coasters ran successfully, including all of their<br>
> > > >> data<br>
> > > >> transfers.<br>
> > > >><br>
> > > >> But I also see that the Swift "hang checker" went off, which<br>
> > > >> indicates that some Java activity was indeed hung.<br>
> > > >><br>
> > > >> When this happens again, can you run the command "jstack -l PID"<br>
> > > >> where PID is the process of the Swift Java command (which you can<br>
> > > >> best locate by using "ps -u $USER -H" and locate the java process<br>
> > > >> below the swift command). Then send us the jstack output in<br>
> > > >> addition to the associated Swift log.<br>
> > > >><br>
> > > >> Mihael, in the meantime, can you take a look at the log to see if<br>
> > > >> you can spot any incomplete Swift activities that may be hanging<br>
> > > >> the run?<br>
> > > >><br>
> > > >> Thanks,<br>
> > > >><br>
> > > >> - Mike<br>
> > > >><br>
> > > >><br>
> > > >> ----- Original Message -----<br>
> > > >>> From: "Sheri Mickelson" <<a href="mailto:mickelso@mcs.anl.gov">mickelso@mcs.anl.gov</a>><br>
> > > >>> To: "David Kelly" <<a href="mailto:davidk@ci.uchicago.edu">davidk@ci.uchicago.edu</a>><br>
> > > >>> Cc: "Michael Wilde" <<a href="mailto:wilde@mcs.anl.gov">wilde@mcs.anl.gov</a>><br>
> > > >>> Sent: Thursday, October 6, 2011 3:23:57 PM<br>
> > > >>> Subject: Re: Swift 0.93RC2 is bad - Re: Help on fusion<br>
> > > >>> Here's the log file.<br>
> > > >>><br>
> > > >>><br>
> > > >>><br>
> > > >>> On Oct 6, 2011, at 3:19 PM, David Kelly wrote:<br>
> > > >>><br>
> > > >>>> Hi Sheri,<br>
> > > >>>><br>
> > > >>>> Could you please send the log file so we can take a closer look<br>
> > > >>>> and<br>
> > > >>>> see what's going on there?<br>
> > > >>>><br>
> > > >>>> Thanks,<br>
> > > >>>> David<br>
> > > >>>><br>
> > > >>>> ----- Original Message -----<br>
> > > >>>>> From: "Sheri Mickelson" <<a href="mailto:mickelso@mcs.anl.gov">mickelso@mcs.anl.gov</a>><br>
> > > >>>>> To: "David Kelly" <<a href="mailto:davidk@ci.uchicago.edu">davidk@ci.uchicago.edu</a>><br>
> > > >>>>> Cc: "Michael Wilde" <<a href="mailto:wilde@mcs.anl.gov">wilde@mcs.anl.gov</a>><br>
> > > >>>>> Sent: Thursday, October 6, 2011 3:07:44 PM<br>
> > > >>>>> Subject: Re: Swift 0.93RC2 is bad - Re: Help on fusion<br>
> > > >>>>> I just tried this version and had a little bit more luck. It<br>
> > > >>>>> looked<br>
> > > >>>>> like everything was running fine, but now it looks like it's<br>
> > > >>>>> hung<br>
> > > >>>>> near<br>
> > > >>>>> the end. I keep getting the message "Finished<br>
> > > >>>>> successfully:66".<br>
> > > >>>>> The<br>
> > > >>>>> message before that was "Checking status:1 Finished<br>
> > > >>>>> successfully:65".<br>
> > > >>>>><br>
> > > >>>>> Thanks, Sheri<br>
> > > >>>>><br>
> > > >>>>> On Oct 6, 2011, at 2:14 PM, David Kelly wrote:<br>
> > > >>>>><br>
> > > >>>>>><br>
> > > >>>>>> It's been a while since RC2 was created. There have been<br>
> > > >>>>>> quite<br>
> > > >>>>>> a<br>
> > > >>>>>> lot<br>
> > > >>>>>> of fixes since then, so I just created a new 0.93 RC3. The<br>
> > > >>>>>> direct<br>
> > > >>>>>> download can be found at:<br>
> > > >>>>>><br>
> > > >>>>>> <a href="http://www.ci.uchicago.edu/swift/packages/swift-0.93RC3.tar.gz" target="_blank">http://www.ci.uchicago.edu/swift/packages/swift-0.93RC3.tar.gz</a><br>
> > > >>>>>><br>
> > > >>>>>> Hope this helps.<br>
> > > >>>>>><br>
> > > >>>>>> Thanks,<br>
> > > >>>>>> David<br>
> > > >>>>>><br>
> > > >>>>>> ----- Original Message -----<br>
> > > >>>>>>> From: "Michael Wilde" <<a href="mailto:wilde@mcs.anl.gov">wilde@mcs.anl.gov</a>><br>
> > > >>>>>>> To: "Sheri Mickelson" <<a href="mailto:mickelso@mcs.anl.gov">mickelso@mcs.anl.gov</a>><br>
> > > >>>>>>> Cc: "David Kelly" <<a href="mailto:davidk@ci.uchicago.edu">davidk@ci.uchicago.edu</a>><br>
> > > >>>>>>> Sent: Thursday, October 6, 2011 12:17:56 PM<br>
> > > >>>>>>> Subject: Swift 0.93RC2 is bad - Re: Help on fusion<br>
> > > >>>>>>> Sheri,<br>
> > > >>>>>>><br>
> > > >>>>>>> Your AMWG script is failing because the swift-0.93RC2<br>
> > > >>>>>>> release<br>
> > > >>>>>>> is<br>
> > > >>>>>>> bad.<br>
> > > >>>>>>><br>
> > > >>>>>>> The error its showing in the log is this: "2011-10-06<br>
> > > >>>>>>> 11:46:24,635-0500 DEBUG vdl:execute2 APPLICATION_EXCEPTION<br>
> > > >>>>>>> jobid=ncatted-se54rxgk - Application exception: null<br>
> > > >>>>>>> Caused by:<br>
> > > >>>>>>> org<br>
> > > >>>>>>> .globus<br>
> > > >>>>>>> .cog.abstraction.impl.common.task.TaskSubmissionException:<br>
> > > >>>>>>> lowOverallocation must be < 1.0 (currently 100.0)"<br>
> > > >>>>>>><br>
> > > >>>>>>> ...which was fixed in SVN for 0.93.<br>
> > > >>>>>>><br>
> > > >>>>>>> Did you load this from a tarball or from SVN?<br>
> > > >>>>>>><br>
> > > >>>>>>> David, do we have a more recent 0.93 release candidate?<br>
> > > >>>>>>><br>
> > > >>>>>>> If not, then can you build an 0.93 from SVN? If not, we can<br>
> > > >>>>>>> do<br>
> > > >>>>>>> that<br>
> > > >>>>>>> for you. I'll start a build in the meantime just in case.<br>
> > > >>>>>>><br>
> > > >>>>>>> Sorry about this error, Sheri.<br>
> > > >>>>>>><br>
> > > >>>>>>> - Mike<br>
> > > >>>>>>><br>
> > > >>>>>>><br>
> > > >>>>>>><br>
> > > >>>>>>> ----- Original Message -----<br>
> > > >>>>>>>> From: "Sheri Mickelson" <<a href="mailto:mickelso@mcs.anl.gov">mickelso@mcs.anl.gov</a>><br>
> > > >>>>>>>> To: "Michael Wilde" <<a href="mailto:wilde@mcs.anl.gov">wilde@mcs.anl.gov</a>><br>
> > > >>>>>>>> Sent: Thursday, October 6, 2011 11:52:58 AM<br>
> > > >>>>>>>> Subject: Re: Help on fusion<br>
> > > >>>>>>>> I have everything in<br>
> > > >>>>>>>> /fusion/gpfs/home/mickelso/amwg-swift/svnRepo/swift<br>
> > > >>>>>>>><br>
> > > >>>>>>>> I believe the pathnames are correct.<br>
> > > >>>>>>>><br>
> > > >>>>>>>> I have not tried running on localhost.<br>
> > > >>>>>>>><br>
> > > >>>>>>>> I'm using swift version swift-0.93RC2.<br>
> > > >>>>>>>><br>
> > > >>>>>>>> I'm not at Argonne today, but will be in tomorrow.<br>
> > > >>>>>>>><br>
> > > >>>>>>>> -Sheri<br>
> > > >>>>>>>><br>
> > > >>>>>>>> On Oct 6, 2011, at 11:39 AM, Michael Wilde wrote:<br>
> > > >>>>>>>><br>
> > > >>>>>>>>> Hi Sheri,<br>
> > > >>>>>>>>><br>
> > > >>>>>>>>> can you point me to the log, run directory, and work dir<br>
> > > >>>>>>>>> of<br>
> > > >>>>>>>>> this<br>
> > > >>>>>>>>> run?<br>
> > > >>>>>>>>><br>
> > > >>>>>>>>> I trhink we'll need to look into to the log, and the .d<br>
> > > >>>>>>>>> directories,<br>
> > > >>>>>>>>> and possibly the work dir to locate the stdout of the<br>
> > > >>>>>>>>> failing<br>
> > > >>>>>>>>> apps.<br>
> > > >>>>>>>>><br>
> > > >>>>>>>>> - are the pathnames correct?<br>
> > > >>>>>>>>><br>
> > > >>>>>>>>> - does the run work on localhost? (ie, are the PBS jobs<br>
> > > >>>>>>>>> running<br>
> > > >>>>>>>>> or<br>
> > > >>>>>>>>> failing)?<br>
> > > >>>>>>>>><br>
> > > >>>>>>>>> - which Swift rev are you using?<br>
> > > >>>>>>>>><br>
> > > >>>>>>>>> Are you at Argonne? I can stop by and we can debug.<br>
> > > >>>>>>>>><br>
> > > >>>>>>>>> - Mike<br>
> > > >>>>>>>>><br>
> > > >>>>>>>>><br>
> > > >>>>>>>>> ----- Original Message -----<br>
> > > >>>>>>>>>> From: "Sheri Mickelson" <<a href="mailto:mickelso@mcs.anl.gov">mickelso@mcs.anl.gov</a>><br>
> > > >>>>>>>>>> To: "Michael Wilde" <<a href="mailto:wilde@mcs.anl.gov">wilde@mcs.anl.gov</a>><br>
> > > >>>>>>>>>> Sent: Thursday, October 6, 2011 10:32:38 AM<br>
> > > >>>>>>>>>> Subject: Help on fusion<br>
> > > >>>>>>>>>> Hi Mike,<br>
> > > >>>>>>>>>><br>
> > > >>>>>>>>>> The AMWG people at NCAR want to incorporate the swift<br>
> > > >>>>>>>>>> version<br>
> > > >>>>>>>>>> to<br>
> > > >>>>>>>>>> their<br>
> > > >>>>>>>>>> main branch. Rob's at NCAR right now and wants to have<br>
> > > >>>>>>>>>> this<br>
> > > >>>>>>>>>> done<br>
> > > >>>>>>>>>> as<br>
> > > >>>>>>>>>> soon as possible. I've been working on incorporating the<br>
> > > >>>>>>>>>> changes<br>
> > > >>>>>>>>>> that<br>
> > > >>>>>>>>>> were made in the last release and believe that it's in<br>
> > > >>>>>>>>>> descent<br>
> > > >>>>>>>>>> shape.<br>
> > > >>>>>>>>>> I want to test it on fusion, though, just to make sure<br>
> > > >>>>>>>>>> I'm<br>
> > > >>>>>>>>>> handling<br>
> > > >>>>>>>>>> the env variables correctly. I'm running into an error<br>
> > > >>>>>>>>>> when<br>
> > > >>>>>>>>>> I<br>
> > > >>>>>>>>>> run.<br>
> > > >>>>>>>>>> I'm getting "Failed to transfer wrapper log for job <job><br>
> > > >>>>>>>>>> for<br>
> > > >>>>>>>>>> all<br>
> > > >>>>>>>>>> of<br>
> > > >>>>>>>>>> the app calls. What usually causes this? I'm stuck on<br>
> > > >>>>>>>>>> where<br>
> > > >>>>>>>>>> to<br>
> > > >>>>>>>>>> look.<br>
> > > >>>>>>>>>><br>
> > > >>>>>>>>>> Thanks, Sheri<br>
> > > >>>>>>>>><br>
> > > >>>>>>>>> --<br>
> > > >>>>>>>>> Michael Wilde<br>
> > > >>>>>>>>> Computation Institute, University of Chicago<br>
> > > >>>>>>>>> Mathematics and Computer Science Division<br>
> > > >>>>>>>>> Argonne National Laboratory<br>
> > > >>>>>>>>><br>
> > > >>>>>>><br>
> > > >>>>>>> --<br>
> > > >>>>>>> Michael Wilde<br>
> > > >>>>>>> Computation Institute, University of Chicago<br>
> > > >>>>>>> Mathematics and Computer Science Division<br>
> > > >>>>>>> Argonne National Laboratory<br>
> > > >><br>
> > > ><br>
> > > ><br>
> ><br>
> > --<br>
> > Michael Wilde<br>
> > Computation Institute, University of Chicago<br>
> > Mathematics and Computer Science Division<br>
> > Argonne National Laboratory<br>
<br>
<br>
_______________________________________________<br>
Swift-devel mailing list<br>
<a href="mailto:Swift-devel@ci.uchicago.edu">Swift-devel@ci.uchicago.edu</a><br>
<a href="https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel" target="_blank">https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel</a><br>
</div></div></blockquote></div><br>