[Swift-devel] hang checker updates

Michael Wilde wilde at mcs.anl.gov
Sat Jul 14 06:29:05 CDT 2012


This sounds great, Mihael. Im eager to try it.

In the meantime, can you help diagnose the specific deadlock in the PNNL "SPH" script? The deadlock doesnt occur until several hours into a large run on their Hopper Cray system, using complex MPI applications, so its not easy for us (or even them) to replicate. But it does deadlock on every run they've tried recently.

>From the Swift .log we have of one such deadlock, we determined the variables that are the likely cause.  Now we're trying to determine the deadlocking statements by analyzing the source code and the .kml file, which tells us where the partial array closes are and where the code waits on those closes.

One feature which I *think* would help in this debugging is a source code listing that annotates where these closes and waits are inserted. Would it be possible to generate that from the .kml file? (Or even a listing that gives the lines or expressions where these events take place?)

The files for this problem are on the CI net at:
  /home/wilde/swift/support/PNNL.SPH.deadlock.2012.0712

In that dir:

- I extracted the source, hybrid.swift, from the .log.

- open00 gives the open variables in the first hang-checker event in the log:
  egrep -i -w 'local_output|writeDataOut|sphOutArr|sphOutNameArr|tarfile' *.log

- Khushbu thinks that line 388 is not getting executed as expected when the hang occurs:
      (local_forward_dat, gpg_stdout, conca_dat, concb_dat, concc_dat) = gpg(writeDataOut, h5part_files, iter, plot);

This suggests that writeDataOut is the open variable blocking this statement.

Working backward, writeDataOut is declared and set at lines 356-358:

    file writeDataOut <single_file_mapper;file=@strcat("run-", iter, "/writeData.out")>;
    trace("file writeDataOut = ", writeDataOut);
    writeDataOut = writeData(sphOutNameArr);

This in turn is blocking on open var sphOutNameArr which in turn is possibly blocking on sphOutArr.

In a similar deadlock we debugged in ParVis code, the script made a reference to a full array (ie, passed an array to a function that blocked on a complete close of the array) *within* a code block in which the array was still open. Ie, a partial close could not execute because the block had not completed, and the block could not complete until the partial close was done.  Im not sure this is the same situation, but its possible.  The script has many conditionals, which could explain why it doesnt deadlock until long into execution.

If we could trace all the references to the open variables, including all partial closes and all waits on those closes, we might be able to identify and eliminate the deadlock.

We can clearly see the partial closes and the waits on these in the kml, but mapping the KML to source code lines, while possible, is tedious and manual as far as I can tell.

Ideally, we could have a tool that does this given the source, the kml, and the log. Im hoping your new trace code either does this or comes close. Early next week I'll try to help Karen and Khushbu do this, unless you can help them sooner.

Thanks,

- Mike





----- Original Message -----
> From: "Mihael Hategan" <hategan at mcs.anl.gov>
> To: "Swift Devel" <swift-devel at ci.uchicago.edu>
> Sent: Saturday, July 14, 2012 1:16:47 AM
> Subject: [Swift-devel] hang checker updates
> I think mike requested swift stack traces in the hang checker instead
> of
> cryptic thread ids. That's in now.
> 
> Also in is a dependency loop detector in the hang checker. It doesn't
> detect static cycles, but ones that actually cause a hang. I'm not
> sure
> how well it works for real life situations, but I can confirm it works
> for simple things like a = f(b); b = f(a);. Please give it a shot.
> 
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel

-- 
Michael Wilde
Computation Institute, University of Chicago
Mathematics and Computer Science Division
Argonne National Laboratory




More information about the Swift-devel mailing list