[Swift-devel] Re: Need help on race issue in Swift on Montage code

Thu Aug 26 14:06:18 CDT 2010

If looking at log files or output files will help to give a better idea 
of the problem, you can take a look at 
~jonmon/Workspace/Swift/Montage/m101_j_05x05.  This is my 18 image set.  
In that directory there is two run directories.  run.0002 is a run that 
completed the workflow and run.0001 is a run that hung.  In each of 
those directories there is a swift.out file that contains the output to 
the screen that was captured.

Also, Mihael when the hang occurs, I type v for the inhook you set up 
and I get
Register Futures:

and then nothing.  Does this mean there is no listeners set up and that 
is why it hung?

On 08/26/2010 01:48 PM, wilde at mcs.anl.gov wrote:
> Hi Mihael, Justin,
>
> Long email follows - sorry!
>
> As I mentioned in passing, Jon is stuck on what looks much like we have a race condition in Swift/Karajan thread synchronization. (Testing on trunk)
>
> If Jon runs a Montage problem of size "10" it seems to always complete successfully.
>
> If he runs a problem of size ~1600, it always hangs.
>
> He now has a problem of size 18 that seems to hang a significant percentage of the time (~ 50%???)
>
> Jon is now trying to whittle that size-18 failing example down to a simple example you can run yourself to reproduce the problem.
>
> He knows pretty well what it is hanging on (see below; Jon is trying to package up a failing test case).
>
> The logic is basically:
>
> 1. csv map an array of structures from a csv file that describes the output of the earlier stages of Montage processing
>
> 2. foreach entry in the array of structures (~ 0..34 in the size-18 problem):
>    a. use simple mapper to map 2 files from the struct
>    b. run a montage function "mDiff" on these two files plus one constant hdr file from outside the loop
>
> The program hangs on the foreach loop because (I think, if I have this right) *some* of the mapped dependencies dont seem to be getting set. Its not clear to me whether, in the failing case, *all* the mDiff() calls inside the foreach loop are hanging, or *some* of them are.  Jon: please provide the details and correct me as needed.
>
> Also, we are relying heavily here on tracef("%k") to print the set-state of various variables. If %k is not 100% correct, then all of our assumptions are questionable.
>
> Im also curious to know what tools we have - or could develop - to in general help find what is hanging on what as a debugging aid, both for users to shake out their app errors and for Swift developers to diagnose a hang that is a Swift bug.
>
> (Jon told me about some ^T command that causes Swift to enter a Karajan debugging mode? I'd like to learn more about that, and how we might make it most useful for end users and for diagnostic info gathering).
>
> Incase its of use, Ive pasted below our latest Skype txt chat on this problem, which details what we know and what Jon will try next.
>
> Help and guidance on how to proceed would be great!
>
> Thanks,
>
> - Mike
>
> ---
>
> [8/26/10 11:05:04 AM] Jonathan Monette: ... I gathered some stuff to send to devel about the hanging problem just not sure how to word the problem for Mihael and Justin
> [8/26/10 1:00:25 PM] Michael Wilde: Hi, sorry, I missed your last message above
> [8/26/10 1:00:42 PM] Michael Wilde: "just not sure how to word the problem for Mihael and Justin" - how can I help on that?
> [8/26/10 1:01:20 PM] Jonathan Monette: uhhhh......well how do I work the email to describe my problem?
> [8/26/10 1:01:21 PM] Michael Wilde: Seems like you were on the right track yesterday: adding enough traces that the problem can be readily seen;
> [8/26/10 1:01:32 PM] Michael Wilde: able to state what seems to be hanging on what;
> [8/26/10 1:02:04 PM] Michael Wilde: and running enough times to prove that it runs OK some times (and what that traces out as) and then fails to complete other times (and what *that* traces as)
> [8/26/10 1:03:03 PM] Jonathan Monette: yea but how do word the email? just something like "There is a hanging problem in swift.  Jobs do not submit even though the inputs for the apps are satisfied."
> [8/26/10 1:03:14 PM] Michael Wilde: Then, what you can/should do, is try to start "whittling" down the test case so we can catch the failure in a simple example, that mihael or justin can easily run with minimal setup, to first make the problem happen, and then test their fix
> [8/26/10 1:03:36 PM] Michael Wilde: "There is a hanging problem in swift.  Jobs do not submit even though the inputs for the apps SEEM TO BE satisfied."
> [8/26/10 1:03:58 PM] Michael Wilde: here is the code, here is the outut trace, and here are the logs
> [8/26/10 1:04:13 PM] Michael Wilde: I have run this 10 times and it hangs 6 of 10 times
> [8/26/10 1:04:14 PM] Jonathan Monette: ok.  well my simple test script completed a couple of times.  I will run it more to see if I can get it to hang.
> [8/26/10 1:04:14 PM] Michael Wilde: etc
> [8/26/10 1:04:31 PM] Michael Wilde: saving the logs into run.nnnn through run.nnnn+10
> [8/26/10 1:04:54 PM] Michael Wilde: Right, thats exactly what you need to do.
> [8/26/10 1:05:00 PM] Michael Wilde: I usually do this:
> [8/26/10 1:05:14 PM] Michael Wilde: start with the real code; run N times, see what the failure ratio is.
> [8/26/10 1:05:42 PM] Michael Wilde: Then make a complete copy, make sure it still fails, and start stripping it down to the simplest program that still fails.
> [8/26/10 1:05:51 PM] Jonathan Monette: ok.  ill see if I can get the test script to fail
> [8/26/10 1:06:01 PM] Michael Wilde: wherever possible, replace Montage code with cat/sleep etc
> [8/26/10 1:06:19 PM] Michael Wilde: So this is and needs to be a general Swift development method:
> [8/26/10 1:07:03 PM] Michael Wilde: when we find any error, but *especially* a race, hang, or similar paralleism-related error, we need to isolate it to a test case that can be added to the test suite
> [8/26/10 1:07:23 PM] Michael Wilde: This bug, in particular, Mihael *thinks* he fixed, yet it "came back".
> [8/26/10 1:07:40 PM] Michael Wilde: Thats what SE folks mean by "regression testing":
> [8/26/10 1:08:17 PM] Michael Wilde: create a test that ensures that a fix is in place; then run that test forever more, to make sure that bug stas fixed and that nothing similar takes its place ; )
> [8/26/10 1:08:37 PM] Michael Wilde: This is hard work, no doubt about it
> [8/26/10 1:09:01 PM] Michael Wilde: But a simple reliable test case that reproduces a bug is THE most important requirement for fixing the bug
> [8/26/10 1:09:30 PM] Michael Wilde: So you need to swap your user hat for a Swift developer hat, for the moment : )
> [8/26/10 1:09:30 PM] Jonathan Monette: yea.  I am running several tests on the stripped down function I have and see if I can reproduce the error.
> [8/26/10 1:09:39 PM] Jonathan Monette: alright.
> [8/26/10 1:09:48 PM] Jonathan Monette: lets see if I can reproduce the error.
> [8/26/10 1:09:54 PM] Michael Wilde: Now, for this test to go into the test suite, it needs to go into a loop
> [8/26/10 1:10:11 PM] Michael Wilde: Some nasty races only occur 1/100 times, or worse.
> [8/26/10 1:10:34 PM] Michael Wilde: so we can only trust swift when we can run the simple example 100,000 times w/o a hang : (
> [8/26/10 1:11:16 PM] Michael Wilde: Once we have the tests, we can then run torture tests before releases that give us a good assurance of having a reliable product
> [8/26/10 1:11:26 PM] Jonathan Monette: alright.
> [8/26/10 1:11:42 PM] Michael Wilde: Once more, with enthusiasm! :)
> [8/26/10 1:12:04 PM] Jonathan Monette: well I am going to try this test several times with several different input files that increase in size.
> [8/26/10 1:12:30 PM] Jonathan Monette: the problem "always" appears with my larger sets so maybe with the large file it will fail more often
> [8/26/10 1:12:54 PM] Michael Wilde: I'll echo the above dialog to Mihael and Justin so they can pipe in with suggestions, OK? Hopefully to make your life easier and find the problem faster....
> [8/26/10 1:13:29 PM] Jonathan Monette: alright. thanks.  that will help.
> [8/26/10 1:14:01 PM] Michael Wilde: One approach is to strip out the earlier stages and the later stages, and first see if a shorter script with *just* mDiff and the foreach loop will fail. I think it should.
> [8/26/10 1:14:32 PM] Michael Wilde: Then once you have that (ie, a 20 line montage script that fails) you can try to replace Montage with cat
> [8/26/10 1:14:42 PM] Michael Wilde: because really all it is doing,
> [8/26/10 1:15:33 PM] Michael Wilde: is a csv mapping (and you can capture and freeze that) and then a simple foreach loop with just one reall app (mDiff)  which you can replace with a cat of 3 files to 1 file, right?
>
>    

-- 
Jon

Computers are incredibly fast, accurate, and stupid. Human beings are incredibly slow, inaccurate, and brilliant. Together they are powerful beyond imagination.
- Albert Einstein