[Swift-devel] hang checker updates

Mihael Hategan hategan at mcs.anl.gov
Sat Jul 14 14:07:41 CDT 2012


On Sat, 2012-07-14 at 13:55 -0500, Michael Wilde wrote:
> Wow - great analysis!  Is the logic you applied here embedded in the new trace code? (Ie if users and Swift support folks could get this right off the bat, that would be excellent).

It doesn't deal with partial closes and array analysis. I'm working on
that.

> 
> I'll forward this to the PNNL folks and see if they have more logs. Everything I got from them so far is in the dir you already have.
> 
> I was slowly going down this chain, but had no clue where to get these
> thread IDs.

You start with thread 0
if you have a parallel(), then each block inside that gets a new level
and a sequential id:

parallel(
  sequential(// happens in thread 0-0
    ...
  )
  foo(b); // happens in thread 0-1
)

Foreach loops also add their own level and use a sequential id for each
iteration. It's a bit of a manual work to look at the kml structure and
figure out where things are. You can speed up the process when you have
a compound invocation by looking at the first few levels in the thread
and compare with those in the log.

That should not be needed with the new stack traces, though that might
need some improvement.

>  I only looked at the first hang checker trace, whose thread IDs I
> could not find in the log.  How did you get all the details below?
> (Dont need to answer that now - might be good to put the technique in
> both a page of debuggging tips and/or the automated tracer...)
> 
> Thanks!
> 
> - Mike
> 
> 
> ----- Original Message -----
> > From: "Mihael Hategan" <hategan at mcs.anl.gov>
> > To: "Michael Wilde" <wilde at mcs.anl.gov>
> > Cc: "Swift Devel" <swift-devel at ci.uchicago.edu>
> > Sent: Saturday, July 14, 2012 1:38:17 PM
> > Subject: Re: [Swift-devel] hang checker updates
> > The waiting threads are as follows:
> > 
> > 0-17-84-2-66 local_output.h5part = sphOutArr
> > 0-17-84-2-73 foreach myfile in local_output.h5part
> > 0-17-84-2-72 output = local_output
> > 0-17-84-2-63 gpg(local_forward_dat, gpg_stdout, conca_dat, concb_dat,
> > concc_dat, writeDataOut, h5part_files, iter, plot)
> > 0-17-84-2-54 trace(writeDataOut)
> > 0-17-84-2-55 writeDataOut = writeData(sphOutNameArr)
> > 0-17-84-2-47-4-3-7 tarfiles[i] = tarfile
> > 
> > 54 waits on writeDataOut which waits on sphOutNameArr
> > 55 waits on sphOutNameArr
> > 63 waits on writeDataOut who waits in sphOutNameArr
> > 66 waits on sphOutArr
> > 72 waits on local_output.h5part who waits on sphOutArr
> > 73 waits on local_output who waits on sphOutArr
> > 
> > sphOutNameArr and sphOutArr wait on two partial closes: 88043 and
> > 88075
> > Those are the if (n > NUM_SPH_RUNS) {} (line 250) and the iterate on
> > line 313
> > 
> > The first one is the problem. In particular:
> > 0-17-84-2-47-4-3-7 tarfiles[4] = tarfile
> > 
> > For some reason tarfile is open. Since it should be closed by copySph
> > (and all other returns of copySph are closed), I can only conclude
> > that
> > it's a swift bug.
> > 
> > Do you have a different run (just the log file with hang checker
> > triggered will do) to confirm?
> > 
> > Mihael
> > 
> > On Sat, 2012-07-14 at 11:33 -0500, Michael Wilde wrote:
> > > Sorry, should be readable now.
> > >
> > > - Mike
> > >
> > > ----- Original Message -----
> > > > From: "Mihael Hategan" <hategan at mcs.anl.gov>
> > > > To: "Michael Wilde" <wilde at mcs.anl.gov>
> > > > Cc: "Swift Devel" <swift-devel at ci.uchicago.edu>
> > > > Sent: Saturday, July 14, 2012 11:04:28 AM
> > > > Subject: Re: [Swift-devel] hang checker updates
> > > > On Sat, 2012-07-14 at 06:29 -0500, Michael Wilde wrote:
> > > > > In the meantime, can you help diagnose the specific deadlock in
> > > > > the
> > > > > PNNL "SPH" script?
> > > >
> > > > I can try.
> > > >
> > > > > The files for this problem are on the CI net at:
> > > > >   /home/wilde/swift/support/PNNL.SPH.deadlock.2012.0712
> > > >
> > > > scp: /home/wilde/swift/support/PNNL.SPH.deadlock.2012.0712:
> > > > Permission
> > > > denied
> > >
> 





More information about the Swift-devel mailing list