[Swift-devel] Re: Need help on race issue in Swift on Montage code

Thu Aug 26 14:36:20 CDT 2010

Hi, Jon

When I was running mProject in a batch style, there is a _area.fits file 
with every .fits file, and I didn't see them 
jonmon/Workspace/Swift/Montage/m101_j_05x05.
I am not sure if that is a necessary input file, but mProject did demand 
the _area.fits file in my case.

best
zhao

Jonathan Monette wrote:
> If looking at log files or output files will help to give a better 
> idea of the problem, you can take a look at 
> ~jonmon/Workspace/Swift/Montage/m101_j_05x05.  This is my 18 image 
> set.  In that directory there is two run directories.  run.0002 is a 
> run that completed the workflow and run.0001 is a run that hung.  In 
> each of those directories there is a swift.out file that contains the 
> output to the screen that was captured.
>
> Also, Mihael when the hang occurs, I type v for the inhook you set up 
> and I get
> Register Futures:
>
> and then nothing.  Does this mean there is no listeners set up and 
> that is why it hung?
>
> On 08/26/2010 01:48 PM, wilde at mcs.anl.gov wrote:
>> Hi Mihael, Justin,
>>
>> Long email follows - sorry!
>>
>> As I mentioned in passing, Jon is stuck on what looks much like we 
>> have a race condition in Swift/Karajan thread synchronization. 
>> (Testing on trunk)
>>
>> If Jon runs a Montage problem of size "10" it seems to always 
>> complete successfully.
>>
>> If he runs a problem of size ~1600, it always hangs.
>>
>> He now has a problem of size 18 that seems to hang a significant 
>> percentage of the time (~ 50%???)
>>
>> Jon is now trying to whittle that size-18 failing example down to a 
>> simple example you can run yourself to reproduce the problem.
>>
>> He knows pretty well what it is hanging on (see below; Jon is trying 
>> to package up a failing test case).
>>
>> The logic is basically:
>>
>> 1. csv map an array of structures from a csv file that describes the 
>> output of the earlier stages of Montage processing
>>
>> 2. foreach entry in the array of structures (~ 0..34 in the size-18 
>> problem):
>>    a. use simple mapper to map 2 files from the struct
>>    b. run a montage function "mDiff" on these two files plus one 
>> constant hdr file from outside the loop
>>
>> The program hangs on the foreach loop because (I think, if I have 
>> this right) *some* of the mapped dependencies dont seem to be getting 
>> set. Its not clear to me whether, in the failing case, *all* the 
>> mDiff() calls inside the foreach loop are hanging, or *some* of them 
>> are.  Jon: please provide the details and correct me as needed.
>>
>> Also, we are relying heavily here on tracef("%k") to print the 
>> set-state of various variables. If %k is not 100% correct, then all 
>> of our assumptions are questionable.
>>
>> Im also curious to know what tools we have - or could develop - to in 
>> general help find what is hanging on what as a debugging aid, both 
>> for users to shake out their app errors and for Swift developers to 
>> diagnose a hang that is a Swift bug.
>>
>> (Jon told me about some ^T command that causes Swift to enter a 
>> Karajan debugging mode? I'd like to learn more about that, and how we 
>> might make it most useful for end users and for diagnostic info 
>> gathering).
>>
>> Incase its of use, Ive pasted below our latest Skype txt chat on this 
>> problem, which details what we know and what Jon will try next.
>>
>> Help and guidance on how to proceed would be great!
>>
>> Thanks,
>>
>> - Mike
>>
>> ---
>>
>> [8/26/10 11:05:04 AM] Jonathan Monette: ... I gathered some stuff to 
>> send to devel about the hanging problem just not sure how to word the 
>> problem for Mihael and Justin
>> [8/26/10 1:00:25 PM] Michael Wilde: Hi, sorry, I missed your last 
>> message above
>> [8/26/10 1:00:42 PM] Michael Wilde: "just not sure how to word the 
>> problem for Mihael and Justin" - how can I help on that?
>> [8/26/10 1:01:20 PM] Jonathan Monette: uhhhh......well how do I work 
>> the email to describe my problem?
>> [8/26/10 1:01:21 PM] Michael Wilde: Seems like you were on the right 
>> track yesterday: adding enough traces that the problem can be readily 
>> seen;
>> [8/26/10 1:01:32 PM] Michael Wilde: able to state what seems to be 
>> hanging on what;
>> [8/26/10 1:02:04 PM] Michael Wilde: and running enough times to prove 
>> that it runs OK some times (and what that traces out as) and then 
>> fails to complete other times (and what *that* traces as)
>> [8/26/10 1:03:03 PM] Jonathan Monette: yea but how do word the email? 
>> just something like "There is a hanging problem in swift.  Jobs do 
>> not submit even though the inputs for the apps are satisfied."
>> [8/26/10 1:03:14 PM] Michael Wilde: Then, what you can/should do, is 
>> try to start "whittling" down the test case so we can catch the 
>> failure in a simple example, that mihael or justin can easily run 
>> with minimal setup, to first make the problem happen, and then test 
>> their fix
>> [8/26/10 1:03:36 PM] Michael Wilde: "There is a hanging problem in 
>> swift.  Jobs do not submit even though the inputs for the apps SEEM 
>> TO BE satisfied."
>> [8/26/10 1:03:58 PM] Michael Wilde: here is the code, here is the 
>> outut trace, and here are the logs
>> [8/26/10 1:04:13 PM] Michael Wilde: I have run this 10 times and it 
>> hangs 6 of 10 times
>> [8/26/10 1:04:14 PM] Jonathan Monette: ok.  well my simple test 
>> script completed a couple of times.  I will run it more to see if I 
>> can get it to hang.
>> [8/26/10 1:04:14 PM] Michael Wilde: etc
>> [8/26/10 1:04:31 PM] Michael Wilde: saving the logs into run.nnnn 
>> through run.nnnn+10
>> [8/26/10 1:04:54 PM] Michael Wilde: Right, thats exactly what you 
>> need to do.
>> [8/26/10 1:05:00 PM] Michael Wilde: I usually do this:
>> [8/26/10 1:05:14 PM] Michael Wilde: start with the real code; run N 
>> times, see what the failure ratio is.
>> [8/26/10 1:05:42 PM] Michael Wilde: Then make a complete copy, make 
>> sure it still fails, and start stripping it down to the simplest 
>> program that still fails.
>> [8/26/10 1:05:51 PM] Jonathan Monette: ok.  ill see if I can get the 
>> test script to fail
>> [8/26/10 1:06:01 PM] Michael Wilde: wherever possible, replace 
>> Montage code with cat/sleep etc
>> [8/26/10 1:06:19 PM] Michael Wilde: So this is and needs to be a 
>> general Swift development method:
>> [8/26/10 1:07:03 PM] Michael Wilde: when we find any error, but 
>> *especially* a race, hang, or similar paralleism-related error, we 
>> need to isolate it to a test case that can be added to the test suite
>> [8/26/10 1:07:23 PM] Michael Wilde: This bug, in particular, Mihael 
>> *thinks* he fixed, yet it "came back".
>> [8/26/10 1:07:40 PM] Michael Wilde: Thats what SE folks mean by 
>> "regression testing":
>> [8/26/10 1:08:17 PM] Michael Wilde: create a test that ensures that a 
>> fix is in place; then run that test forever more, to make sure that 
>> bug stas fixed and that nothing similar takes its place ; )
>> [8/26/10 1:08:37 PM] Michael Wilde: This is hard work, no doubt about it
>> [8/26/10 1:09:01 PM] Michael Wilde: But a simple reliable test case 
>> that reproduces a bug is THE most important requirement for fixing 
>> the bug
>> [8/26/10 1:09:30 PM] Michael Wilde: So you need to swap your user hat 
>> for a Swift developer hat, for the moment : )
>> [8/26/10 1:09:30 PM] Jonathan Monette: yea.  I am running several 
>> tests on the stripped down function I have and see if I can reproduce 
>> the error.
>> [8/26/10 1:09:39 PM] Jonathan Monette: alright.
>> [8/26/10 1:09:48 PM] Jonathan Monette: lets see if I can reproduce 
>> the error.
>> [8/26/10 1:09:54 PM] Michael Wilde: Now, for this test to go into the 
>> test suite, it needs to go into a loop
>> [8/26/10 1:10:11 PM] Michael Wilde: Some nasty races only occur 1/100 
>> times, or worse.
>> [8/26/10 1:10:34 PM] Michael Wilde: so we can only trust swift when 
>> we can run the simple example 100,000 times w/o a hang : (
>> [8/26/10 1:11:16 PM] Michael Wilde: Once we have the tests, we can 
>> then run torture tests before releases that give us a good assurance 
>> of having a reliable product
>> [8/26/10 1:11:26 PM] Jonathan Monette: alright.
>> [8/26/10 1:11:42 PM] Michael Wilde: Once more, with enthusiasm! :)
>> [8/26/10 1:12:04 PM] Jonathan Monette: well I am going to try this 
>> test several times with several different input files that increase 
>> in size.
>> [8/26/10 1:12:30 PM] Jonathan Monette: the problem "always" appears 
>> with my larger sets so maybe with the large file it will fail more often
>> [8/26/10 1:12:54 PM] Michael Wilde: I'll echo the above dialog to 
>> Mihael and Justin so they can pipe in with suggestions, OK? Hopefully 
>> to make your life easier and find the problem faster....
>> [8/26/10 1:13:29 PM] Jonathan Monette: alright. thanks.  that will help.
>> [8/26/10 1:14:01 PM] Michael Wilde: One approach is to strip out the 
>> earlier stages and the later stages, and first see if a shorter 
>> script with *just* mDiff and the foreach loop will fail. I think it 
>> should.
>> [8/26/10 1:14:32 PM] Michael Wilde: Then once you have that (ie, a 20 
>> line montage script that fails) you can try to replace Montage with cat
>> [8/26/10 1:14:42 PM] Michael Wilde: because really all it is doing,
>> [8/26/10 1:15:33 PM] Michael Wilde: is a csv mapping (and you can 
>> capture and freeze that) and then a simple foreach loop with just one 
>> reall app (mDiff)  which you can replace with a cat of 3 files to 1 
>> file, right?
>>
>>    
>