[Swift-devel] hang checker updates
Michael Wilde
wilde at mcs.anl.gov
Sat Jul 14 06:41:21 CDT 2012
Mihael, thinking over the PNNL SPH hang, if your new code would indeed print the stack traces of the hanging Swift threads, that would likely identify the deadlock right away - essentially performing the logic that we're trying to do manual by deduction.
So I'll try to get them to test with the new version asap.
In the meantime, can you test against the hang below? This is a simple re-creation of the ParVis deadlock I mentioned in my prior post.
One think I noticed in the current PNNL incident is that the thread IDs which are listed by the hang checker are not found anywhere in the log. Often they are, which helps in the diagnosis. So Im assuming these must be internal functions which are just not logged. Im wondering if that will interfere with your new tracing, or not?
Here's the ParVis case.
- Mike
----- Forwarded Message -----
From: "Michael Wilde" <wilde at mcs.anl.gov>
To: "Sheri Mickelson" <mickelso at mcs.anl.gov>
Sent: Saturday, February 18, 2012 12:05:16 PM
Subject: Re: No events in 10s.
Hi Sheri,
A quick update: good news is that Ive been able to re-create whats causing the hang in a few very tiny Swift scripts that show whats happening.
Im trying to turn those into a "how to avoid this situation" example and suggest how to change your ocean script accordingly to get around this.
If I cant give you a good solution very soon, I'll send some prelim info.
Basically, if you are setting an array's elements *inside* an if() statement, you can't process the array's contents as a whole (ie pass it to an app) inside the same if statement block. Instead you need to process it outside the if statement, so that swift knows that the array is "closed", ie, completely filed.
Here's an example. I'll try to work up an example in terms of your exact code, to show you a few was to work around this. In the meantime, Im sending you what I have in case you want to try something on your own sooner.
Another approach I think works is to fill the array in a function that returns the array as a whole object.
Sorry that it took me so long to get to this. I'll also send something on your sites.xml question for Andy for PBS.
Regards,
- Mike
com$ swift acint.works.swift
no sites file specified, setting to default: /home/wilde/swift/rev/swift-0.93RC4/etc/sites.xml
Swift svn swift-r5277 cog-r3320
RunID: 20120218-1159-o08mzyd6
Progress: time: Sat, 18 Feb 2012 11:59:46 -0600
Final status: time: Sat, 18 Feb 2012 11:59:46 -0600 Finished successfully:1
com$ swift acint.hangs.swift
no sites file specified, setting to default: /home/wilde/swift/rev/swift-0.93RC4/etc/sites.xml
Swift svn swift-r5277 cog-r3320
RunID: 20120218-1159-8kgadrq6
Progress: time: Sat, 18 Feb 2012 11:59:55 -0600
No events in 10s.
Registered futures:
int[] out Open, 2 elements, 1 listeners
----
Waiting threads:
0-1-1
----
com$ cat acint.works.swift
type file;
app (file o) echo(int i[])
{
echo i stdout=@filename(o);
}
int out[];
file f<"out.txt">;
if ( true ) {
foreach j in [1:2] {
out[j] = j;
}
}
f = echo(out);
com$ cat acint.hangs.swift
type file;
app (file o) echo(int i[])
{
echo i stdout=@filename(o);
}
int out[];
file f<"out.txt">;
if ( true ) {
foreach j in [1:2] {
out[j] = j;
}
f = echo(out);
}
com$ diff acint.works.swift acint.hangs.swift
14a15
> f = echo(out);
16c17
< f = echo(out);
---
>
com$ cat out.txt
1 2
com$
----- Original Message -----
> From: "Sheri Mickelson" <mickelso at mcs.anl.gov>
> To: "Michael Wilde" <wilde at mcs.anl.gov>
> Sent: Tuesday, February 14, 2012 9:29:26 AM
> Subject: No events in 10s.
> Hi Mike,
>
> I've been trying to sort out an issue that I've been having with my
> ocean Swift code for a couple of days now and I'm stuck. Would you
> have time to give it a quick look?
>
> I've attached both my Swift file and the log file. I'm running local
> with coasters using the Swift 0.93 from the release page.
>
> Here's the exact error I'm seeing:
>
> Progress: time: Tue, 14 Feb 2012 07:51:52 -0700 Finished
> successfully:126
> No events in 10s.
>
> Registered futures:
> file[] ncl_finished Open, 12 elements, 1 listeners
> file psFileList - F/psFileList:file - Open
> file[] mocmYearlyFiles Open, 2 elements, 1 listeners
> file moctsa - F/moctsa:file - Open
> string[] psFiles Open, 0 elements, 1 listeners
> file[] mocaYearlyFiles Open, 2 elements, 1 listeners
> ----
>
> Waiting threads:
> 0-66-1
> 0-68-1-5-1
> 0-68-1-10-1
> 0-77
> 0-78
> 0-68-1-4-1
> 0-68-1-11-1
> 0-68-1-1-1
> 0-68-1-8-1
> 0-68-1-3-1
> 0-68-1-6-1
> 0-68-1-9-1
> 0-68-1-2-1
> 0-68-1-0-1
> 0-68-1-7-1
> 0-64-1
> 0-76
> ----
>
> I think there are two sections that are having problems.
>
> The first one starts at line 327.
>
> I checked my _concurrent directory and I have both of the files that
> ncks_var produced
> yearlyFile-8e533cf7-76aa-4110-ade1-857a13f77134-64-0-0
> yearlyFile-8e533cf7-76aa-4110-ade1-857a13f77134-64-0-1
>
> I also tried changing line 338 to
> mocaYearlyFiles[y] = create_blank_file_File(yearlyFile);
>
> and I get
> _concurrent/mocaYearlyFiles-f872c3e5-7574-4e51-9d69-3a33b9802725--array/
> elt-0 elt-1
>
> I'm only running this on two years of data so there are two files
> produced - one for each year.
>
> The problem is that line 341 never executes
> moctsa = Record_Cat(mocaYearlyFiles);
>
> I have a similar problem with line 368 never executing:
> moctsm = Record_Cat(mocmYearlyFiles);
>
> With this section everything is created and is also in _concurrent.
>
> The code stalls because it's waiting for the above Record_Cat calls to
> continue on and start running the ncl scripts.
>
> Does anything pop out at you? I'm at Argonne all day today if you
> want to stop by.
>
> Thanks, Sheri
--
Michael Wilde
Computation Institute, University of Chicago
Mathematics and Computer Science Division
Argonne National Laboratory
--
Michael Wilde
Computation Institute, University of Chicago
Mathematics and Computer Science Division
Argonne National Laboratory
----- Original Message -----
> From: "Michael Wilde" <wilde at mcs.anl.gov>
> To: "Mihael Hategan" <hategan at mcs.anl.gov>
> Cc: "Swift Devel" <swift-devel at ci.uchicago.edu>
> Sent: Saturday, July 14, 2012 6:29:05 AM
> Subject: Re: [Swift-devel] hang checker updates
> This sounds great, Mihael. Im eager to try it.
>
> In the meantime, can you help diagnose the specific deadlock in the
> PNNL "SPH" script? The deadlock doesnt occur until several hours into
> a large run on their Hopper Cray system, using complex MPI
> applications, so its not easy for us (or even them) to replicate. But
> it does deadlock on every run they've tried recently.
>
> From the Swift .log we have of one such deadlock, we determined the
> variables that are the likely cause. Now we're trying to determine the
> deadlocking statements by analyzing the source code and the .kml file,
> which tells us where the partial array closes are and where the code
> waits on those closes.
>
> One feature which I *think* would help in this debugging is a source
> code listing that annotates where these closes and waits are inserted.
> Would it be possible to generate that from the .kml file? (Or even a
> listing that gives the lines or expressions where these events take
> place?)
>
> The files for this problem are on the CI net at:
> /home/wilde/swift/support/PNNL.SPH.deadlock.2012.0712
>
> In that dir:
>
> - I extracted the source, hybrid.swift, from the .log.
>
> - open00 gives the open variables in the first hang-checker event in
> the log:
> egrep -i -w
> 'local_output|writeDataOut|sphOutArr|sphOutNameArr|tarfile' *.log
>
> - Khushbu thinks that line 388 is not getting executed as expected
> when the hang occurs:
> (local_forward_dat, gpg_stdout, conca_dat, concb_dat, concc_dat) =
> gpg(writeDataOut, h5part_files, iter, plot);
>
> This suggests that writeDataOut is the open variable blocking this
> statement.
>
> Working backward, writeDataOut is declared and set at lines 356-358:
>
> file writeDataOut <single_file_mapper;file=@strcat("run-", iter,
> "/writeData.out")>;
> trace("file writeDataOut = ", writeDataOut);
> writeDataOut = writeData(sphOutNameArr);
>
> This in turn is blocking on open var sphOutNameArr which in turn is
> possibly blocking on sphOutArr.
>
> In a similar deadlock we debugged in ParVis code, the script made a
> reference to a full array (ie, passed an array to a function that
> blocked on a complete close of the array) *within* a code block in
> which the array was still open. Ie, a partial close could not execute
> because the block had not completed, and the block could not complete
> until the partial close was done. Im not sure this is the same
> situation, but its possible. The script has many conditionals, which
> could explain why it doesnt deadlock until long into execution.
>
> If we could trace all the references to the open variables, including
> all partial closes and all waits on those closes, we might be able to
> identify and eliminate the deadlock.
>
> We can clearly see the partial closes and the waits on these in the
> kml, but mapping the KML to source code lines, while possible, is
> tedious and manual as far as I can tell.
>
> Ideally, we could have a tool that does this given the source, the
> kml, and the log. Im hoping your new trace code either does this or
> comes close. Early next week I'll try to help Karen and Khushbu do
> this, unless you can help them sooner.
>
> Thanks,
>
> - Mike
>
>
>
>
>
> ----- Original Message -----
> > From: "Mihael Hategan" <hategan at mcs.anl.gov>
> > To: "Swift Devel" <swift-devel at ci.uchicago.edu>
> > Sent: Saturday, July 14, 2012 1:16:47 AM
> > Subject: [Swift-devel] hang checker updates
> > I think mike requested swift stack traces in the hang checker
> > instead
> > of
> > cryptic thread ids. That's in now.
> >
> > Also in is a dependency loop detector in the hang checker. It
> > doesn't
> > detect static cycles, but ones that actually cause a hang. I'm not
> > sure
> > how well it works for real life situations, but I can confirm it
> > works
> > for simple things like a = f(b); b = f(a);. Please give it a shot.
> >
> > _______________________________________________
> > Swift-devel mailing list
> > Swift-devel at ci.uchicago.edu
> > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel
>
> --
> Michael Wilde
> Computation Institute, University of Chicago
> Mathematics and Computer Science Division
> Argonne National Laboratory
>
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel
--
Michael Wilde
Computation Institute, University of Chicago
Mathematics and Computer Science Division
Argonne National Laboratory
More information about the Swift-devel
mailing list