[Swift-devel] Need help on race issue in Swift on Montage code

Thu Aug 26 13:48:31 CDT 2010

Hi Mihael, Justin,

Long email follows - sorry!

As I mentioned in passing, Jon is stuck on what looks much like we have a race condition in Swift/Karajan thread synchronization. (Testing on trunk)

If Jon runs a Montage problem of size "10" it seems to always complete successfully.

If he runs a problem of size ~1600, it always hangs.

He now has a problem of size 18 that seems to hang a significant percentage of the time (~ 50%???)

Jon is now trying to whittle that size-18 failing example down to a simple example you can run yourself to reproduce the problem.

He knows pretty well what it is hanging on (see below; Jon is trying to package up a failing test case).

The logic is basically:

1. csv map an array of structures from a csv file that describes the output of the earlier stages of Montage processing

2. foreach entry in the array of structures (~ 0..34 in the size-18 problem):
  a. use simple mapper to map 2 files from the struct
  b. run a montage function "mDiff" on these two files plus one constant hdr file from outside the loop

The program hangs on the foreach loop because (I think, if I have this right) *some* of the mapped dependencies dont seem to be getting set. Its not clear to me whether, in the failing case, *all* the mDiff() calls inside the foreach loop are hanging, or *some* of them are.  Jon: please provide the details and correct me as needed.

Also, we are relying heavily here on tracef("%k") to print the set-state of various variables. If %k is not 100% correct, then all of our assumptions are questionable.

Im also curious to know what tools we have - or could develop - to in general help find what is hanging on what as a debugging aid, both for users to shake out their app errors and for Swift developers to diagnose a hang that is a Swift bug.

(Jon told me about some ^T command that causes Swift to enter a Karajan debugging mode? I'd like to learn more about that, and how we might make it most useful for end users and for diagnostic info gathering).

Incase its of use, Ive pasted below our latest Skype txt chat on this problem, which details what we know and what Jon will try next.

Help and guidance on how to proceed would be great!

Thanks,

- Mike

---

[8/26/10 11:05:04 AM] Jonathan Monette: ... I gathered some stuff to send to devel about the hanging problem just not sure how to word the problem for Mihael and Justin
[8/26/10 1:00:25 PM] Michael Wilde: Hi, sorry, I missed your last message above
[8/26/10 1:00:42 PM] Michael Wilde: "just not sure how to word the problem for Mihael and Justin" - how can I help on that?
[8/26/10 1:01:20 PM] Jonathan Monette: uhhhh......well how do I work the email to describe my problem?
[8/26/10 1:01:21 PM] Michael Wilde: Seems like you were on the right track yesterday: adding enough traces that the problem can be readily seen;
[8/26/10 1:01:32 PM] Michael Wilde: able to state what seems to be hanging on what;
[8/26/10 1:02:04 PM] Michael Wilde: and running enough times to prove that it runs OK some times (and what that traces out as) and then fails to complete other times (and what *that* traces as)
[8/26/10 1:03:03 PM] Jonathan Monette: yea but how do word the email? just something like "There is a hanging problem in swift.  Jobs do not submit even though the inputs for the apps are satisfied."
[8/26/10 1:03:14 PM] Michael Wilde: Then, what you can/should do, is try to start "whittling" down the test case so we can catch the failure in a simple example, that mihael or justin can easily run with minimal setup, to first make the problem happen, and then test their fix
[8/26/10 1:03:36 PM] Michael Wilde: "There is a hanging problem in swift.  Jobs do not submit even though the inputs for the apps SEEM TO BE satisfied."
[8/26/10 1:03:58 PM] Michael Wilde: here is the code, here is the outut trace, and here are the logs
[8/26/10 1:04:13 PM] Michael Wilde: I have run this 10 times and it hangs 6 of 10 times
[8/26/10 1:04:14 PM] Jonathan Monette: ok.  well my simple test script completed a couple of times.  I will run it more to see if I can get it to hang.
[8/26/10 1:04:14 PM] Michael Wilde: etc
[8/26/10 1:04:31 PM] Michael Wilde: saving the logs into run.nnnn through run.nnnn+10
[8/26/10 1:04:54 PM] Michael Wilde: Right, thats exactly what you need to do.
[8/26/10 1:05:00 PM] Michael Wilde: I usually do this:
[8/26/10 1:05:14 PM] Michael Wilde: start with the real code; run N times, see what the failure ratio is.
[8/26/10 1:05:42 PM] Michael Wilde: Then make a complete copy, make sure it still fails, and start stripping it down to the simplest program that still fails.
[8/26/10 1:05:51 PM] Jonathan Monette: ok.  ill see if I can get the test script to fail
[8/26/10 1:06:01 PM] Michael Wilde: wherever possible, replace Montage code with cat/sleep etc
[8/26/10 1:06:19 PM] Michael Wilde: So this is and needs to be a general Swift development method:
[8/26/10 1:07:03 PM] Michael Wilde: when we find any error, but *especially* a race, hang, or similar paralleism-related error, we need to isolate it to a test case that can be added to the test suite
[8/26/10 1:07:23 PM] Michael Wilde: This bug, in particular, Mihael *thinks* he fixed, yet it "came back".
[8/26/10 1:07:40 PM] Michael Wilde: Thats what SE folks mean by "regression testing":
[8/26/10 1:08:17 PM] Michael Wilde: create a test that ensures that a fix is in place; then run that test forever more, to make sure that bug stas fixed and that nothing similar takes its place ; )
[8/26/10 1:08:37 PM] Michael Wilde: This is hard work, no doubt about it
[8/26/10 1:09:01 PM] Michael Wilde: But a simple reliable test case that reproduces a bug is THE most important requirement for fixing the bug
[8/26/10 1:09:30 PM] Michael Wilde: So you need to swap your user hat for a Swift developer hat, for the moment : )
[8/26/10 1:09:30 PM] Jonathan Monette: yea.  I am running several tests on the stripped down function I have and see if I can reproduce the error.
[8/26/10 1:09:39 PM] Jonathan Monette: alright.
[8/26/10 1:09:48 PM] Jonathan Monette: lets see if I can reproduce the error.
[8/26/10 1:09:54 PM] Michael Wilde: Now, for this test to go into the test suite, it needs to go into a loop
[8/26/10 1:10:11 PM] Michael Wilde: Some nasty races only occur 1/100 times, or worse.
[8/26/10 1:10:34 PM] Michael Wilde: so we can only trust swift when we can run the simple example 100,000 times w/o a hang : (
[8/26/10 1:11:16 PM] Michael Wilde: Once we have the tests, we can then run torture tests before releases that give us a good assurance of having a reliable product
[8/26/10 1:11:26 PM] Jonathan Monette: alright.
[8/26/10 1:11:42 PM] Michael Wilde: Once more, with enthusiasm! :)
[8/26/10 1:12:04 PM] Jonathan Monette: well I am going to try this test several times with several different input files that increase in size.
[8/26/10 1:12:30 PM] Jonathan Monette: the problem "always" appears with my larger sets so maybe with the large file it will fail more often
[8/26/10 1:12:54 PM] Michael Wilde: I'll echo the above dialog to Mihael and Justin so they can pipe in with suggestions, OK? Hopefully to make your life easier and find the problem faster....
[8/26/10 1:13:29 PM] Jonathan Monette: alright. thanks.  that will help.
[8/26/10 1:14:01 PM] Michael Wilde: One approach is to strip out the earlier stages and the later stages, and first see if a shorter script with *just* mDiff and the foreach loop will fail. I think it should.
[8/26/10 1:14:32 PM] Michael Wilde: Then once you have that (ie, a 20 line montage script that fails) you can try to replace Montage with cat
[8/26/10 1:14:42 PM] Michael Wilde: because really all it is doing,
[8/26/10 1:15:33 PM] Michael Wilde: is a csv mapping (and you can capture and freeze that) and then a simple foreach loop with just one reall app (mDiff)  which you can replace with a cat of 3 files to 1 file, right?

-- 
Michael Wilde
Computation Institute, University of Chicago
Mathematics and Computer Science Division
Argonne National Laboratory