[Swift-devel] hang checker updates

Michael Wilde wilde at mcs.anl.gov
Sat Jul 14 13:55:26 CDT 2012


Wow - great analysis!  Is the logic you applied here embedded in the new trace code? (Ie if users and Swift support folks could get this right off the bat, that would be excellent).

I'll forward this to the PNNL folks and see if they have more logs. Everything I got from them so far is in the dir you already have.

I was slowly going down this chain, but had no clue where to get these thread IDs. I only looked at the first hang checker trace, whose thread IDs I could not find in the log.  How did you get all the details below?  (Dont need to answer that now - might be good to put the technique in both a page of debuggging tips and/or the automated tracer...)

Thanks!

- Mike


----- Original Message -----
> From: "Mihael Hategan" <hategan at mcs.anl.gov>
> To: "Michael Wilde" <wilde at mcs.anl.gov>
> Cc: "Swift Devel" <swift-devel at ci.uchicago.edu>
> Sent: Saturday, July 14, 2012 1:38:17 PM
> Subject: Re: [Swift-devel] hang checker updates
> The waiting threads are as follows:
> 
> 0-17-84-2-66 local_output.h5part = sphOutArr
> 0-17-84-2-73 foreach myfile in local_output.h5part
> 0-17-84-2-72 output = local_output
> 0-17-84-2-63 gpg(local_forward_dat, gpg_stdout, conca_dat, concb_dat,
> concc_dat, writeDataOut, h5part_files, iter, plot)
> 0-17-84-2-54 trace(writeDataOut)
> 0-17-84-2-55 writeDataOut = writeData(sphOutNameArr)
> 0-17-84-2-47-4-3-7 tarfiles[i] = tarfile
> 
> 54 waits on writeDataOut which waits on sphOutNameArr
> 55 waits on sphOutNameArr
> 63 waits on writeDataOut who waits in sphOutNameArr
> 66 waits on sphOutArr
> 72 waits on local_output.h5part who waits on sphOutArr
> 73 waits on local_output who waits on sphOutArr
> 
> sphOutNameArr and sphOutArr wait on two partial closes: 88043 and
> 88075
> Those are the if (n > NUM_SPH_RUNS) {} (line 250) and the iterate on
> line 313
> 
> The first one is the problem. In particular:
> 0-17-84-2-47-4-3-7 tarfiles[4] = tarfile
> 
> For some reason tarfile is open. Since it should be closed by copySph
> (and all other returns of copySph are closed), I can only conclude
> that
> it's a swift bug.
> 
> Do you have a different run (just the log file with hang checker
> triggered will do) to confirm?
> 
> Mihael
> 
> On Sat, 2012-07-14 at 11:33 -0500, Michael Wilde wrote:
> > Sorry, should be readable now.
> >
> > - Mike
> >
> > ----- Original Message -----
> > > From: "Mihael Hategan" <hategan at mcs.anl.gov>
> > > To: "Michael Wilde" <wilde at mcs.anl.gov>
> > > Cc: "Swift Devel" <swift-devel at ci.uchicago.edu>
> > > Sent: Saturday, July 14, 2012 11:04:28 AM
> > > Subject: Re: [Swift-devel] hang checker updates
> > > On Sat, 2012-07-14 at 06:29 -0500, Michael Wilde wrote:
> > > > In the meantime, can you help diagnose the specific deadlock in
> > > > the
> > > > PNNL "SPH" script?
> > >
> > > I can try.
> > >
> > > > The files for this problem are on the CI net at:
> > > >   /home/wilde/swift/support/PNNL.SPH.deadlock.2012.0712
> > >
> > > scp: /home/wilde/swift/support/PNNL.SPH.deadlock.2012.0712:
> > > Permission
> > > denied
> >

-- 
Michael Wilde
Computation Institute, University of Chicago
Mathematics and Computer Science Division
Argonne National Laboratory




More information about the Swift-devel mailing list