[Swift-devel] Re: Need help on race issue in Swift on Montage code

Thu Aug 26 14:38:08 CDT 2010

No.  That file is not necessary since there is a -n option that can be 
passed to several of the functions that simply tells that function to 
ignore those area files.  Those area files are just weights that are 
applied to the projected image.

On 08/26/2010 02:36 PM, Zhao Zhang wrote:
> Hi, Jon
>
> When I was running mProject in a batch style, there is a _area.fits 
> file with every .fits file, and I didn't see them 
> jonmon/Workspace/Swift/Montage/m101_j_05x05.
> I am not sure if that is a necessary input file, but mProject did 
> demand the _area.fits file in my case.
>
> best
> zhao
>
> Jonathan Monette wrote:
>> If looking at log files or output files will help to give a better 
>> idea of the problem, you can take a look at 
>> ~jonmon/Workspace/Swift/Montage/m101_j_05x05.  This is my 18 image 
>> set.  In that directory there is two run directories.  run.0002 is a 
>> run that completed the workflow and run.0001 is a run that hung.  In 
>> each of those directories there is a swift.out file that contains the 
>> output to the screen that was captured.
>>
>> Also, Mihael when the hang occurs, I type v for the inhook you set up 
>> and I get
>> Register Futures:
>>
>> and then nothing.  Does this mean there is no listeners set up and 
>> that is why it hung?
>>
>> On 08/26/2010 01:48 PM, wilde at mcs.anl.gov wrote:
>>> Hi Mihael, Justin,
>>>
>>> Long email follows - sorry!
>>>
>>> As I mentioned in passing, Jon is stuck on what looks much like we 
>>> have a race condition in Swift/Karajan thread synchronization. 
>>> (Testing on trunk)
>>>
>>> If Jon runs a Montage problem of size "10" it seems to always 
>>> complete successfully.
>>>
>>> If he runs a problem of size ~1600, it always hangs.
>>>
>>> He now has a problem of size 18 that seems to hang a significant 
>>> percentage of the time (~ 50%???)
>>>
>>> Jon is now trying to whittle that size-18 failing example down to a 
>>> simple example you can run yourself to reproduce the problem.
>>>
>>> He knows pretty well what it is hanging on (see below; Jon is trying 
>>> to package up a failing test case).
>>>
>>> The logic is basically:
>>>
>>> 1. csv map an array of structures from a csv file that describes the 
>>> output of the earlier stages of Montage processing
>>>
>>> 2. foreach entry in the array of structures (~ 0..34 in the size-18 
>>> problem):
>>>    a. use simple mapper to map 2 files from the struct
>>>    b. run a montage function "mDiff" on these two files plus one 
>>> constant hdr file from outside the loop
>>>
>>> The program hangs on the foreach loop because (I think, if I have 
>>> this right) *some* of the mapped dependencies dont seem to be 
>>> getting set. Its not clear to me whether, in the failing case, *all* 
>>> the mDiff() calls inside the foreach loop are hanging, or *some* of 
>>> them are.  Jon: please provide the details and correct me as needed.
>>>
>>> Also, we are relying heavily here on tracef("%k") to print the 
>>> set-state of various variables. If %k is not 100% correct, then all 
>>> of our assumptions are questionable.
>>>
>>> Im also curious to know what tools we have - or could develop - to 
>>> in general help find what is hanging on what as a debugging aid, 
>>> both for users to shake out their app errors and for Swift 
>>> developers to diagnose a hang that is a Swift bug.
>>>
>>> (Jon told me about some ^T command that causes Swift to enter a 
>>> Karajan debugging mode? I'd like to learn more about that, and how 
>>> we might make it most useful for end users and for diagnostic info 
>>> gathering).
>>>
>>> Incase its of use, Ive pasted below our latest Skype txt chat on 
>>> this problem, which details what we know and what Jon will try next.
>>>
>>> Help and guidance on how to proceed would be great!
>>>
>>> Thanks,
>>>
>>> - Mike
>>>
>>> ---
>>>
>>> [8/26/10 11:05:04 AM] Jonathan Monette: ... I gathered some stuff to 
>>> send to devel about the hanging problem just not sure how to word 
>>> the problem for Mihael and Justin
>>> [8/26/10 1:00:25 PM] Michael Wilde: Hi, sorry, I missed your last 
>>> message above
>>> [8/26/10 1:00:42 PM] Michael Wilde: "just not sure how to word the 
>>> problem for Mihael and Justin" - how can I help on that?
>>> [8/26/10 1:01:20 PM] Jonathan Monette: uhhhh......well how do I work 
>>> the email to describe my problem?
>>> [8/26/10 1:01:21 PM] Michael Wilde: Seems like you were on the right 
>>> track yesterday: adding enough traces that the problem can be 
>>> readily seen;
>>> [8/26/10 1:01:32 PM] Michael Wilde: able to state what seems to be 
>>> hanging on what;
>>> [8/26/10 1:02:04 PM] Michael Wilde: and running enough times to 
>>> prove that it runs OK some times (and what that traces out as) and 
>>> then fails to complete other times (and what *that* traces as)
>>> [8/26/10 1:03:03 PM] Jonathan Monette: yea but how do word the 
>>> email? just something like "There is a hanging problem in swift.  
>>> Jobs do not submit even though the inputs for the apps are satisfied."
>>> [8/26/10 1:03:14 PM] Michael Wilde: Then, what you can/should do, is 
>>> try to start "whittling" down the test case so we can catch the 
>>> failure in a simple example, that mihael or justin can easily run 
>>> with minimal setup, to first make the problem happen, and then test 
>>> their fix
>>> [8/26/10 1:03:36 PM] Michael Wilde: "There is a hanging problem in 
>>> swift.  Jobs do not submit even though the inputs for the apps SEEM 
>>> TO BE satisfied."
>>> [8/26/10 1:03:58 PM] Michael Wilde: here is the code, here is the 
>>> outut trace, and here are the logs
>>> [8/26/10 1:04:13 PM] Michael Wilde: I have run this 10 times and it 
>>> hangs 6 of 10 times
>>> [8/26/10 1:04:14 PM] Jonathan Monette: ok.  well my simple test 
>>> script completed a couple of times.  I will run it more to see if I 
>>> can get it to hang.
>>> [8/26/10 1:04:14 PM] Michael Wilde: etc
>>> [8/26/10 1:04:31 PM] Michael Wilde: saving the logs into run.nnnn 
>>> through run.nnnn+10
>>> [8/26/10 1:04:54 PM] Michael Wilde: Right, thats exactly what you 
>>> need to do.
>>> [8/26/10 1:05:00 PM] Michael Wilde: I usually do this:
>>> [8/26/10 1:05:14 PM] Michael Wilde: start with the real code; run N 
>>> times, see what the failure ratio is.
>>> [8/26/10 1:05:42 PM] Michael Wilde: Then make a complete copy, make 
>>> sure it still fails, and start stripping it down to the simplest 
>>> program that still fails.
>>> [8/26/10 1:05:51 PM] Jonathan Monette: ok.  ill see if I can get the 
>>> test script to fail
>>> [8/26/10 1:06:01 PM] Michael Wilde: wherever possible, replace 
>>> Montage code with cat/sleep etc
>>> [8/26/10 1:06:19 PM] Michael Wilde: So this is and needs to be a 
>>> general Swift development method:
>>> [8/26/10 1:07:03 PM] Michael Wilde: when we find any error, but 
>>> *especially* a race, hang, or similar paralleism-related error, we 
>>> need to isolate it to a test case that can be added to the test suite
>>> [8/26/10 1:07:23 PM] Michael Wilde: This bug, in particular, Mihael 
>>> *thinks* he fixed, yet it "came back".
>>> [8/26/10 1:07:40 PM] Michael Wilde: Thats what SE folks mean by 
>>> "regression testing":
>>> [8/26/10 1:08:17 PM] Michael Wilde: create a test that ensures that 
>>> a fix is in place; then run that test forever more, to make sure 
>>> that bug stas fixed and that nothing similar takes its place ; )
>>> [8/26/10 1:08:37 PM] Michael Wilde: This is hard work, no doubt 
>>> about it
>>> [8/26/10 1:09:01 PM] Michael Wilde: But a simple reliable test case 
>>> that reproduces a bug is THE most important requirement for fixing 
>>> the bug
>>> [8/26/10 1:09:30 PM] Michael Wilde: So you need to swap your user 
>>> hat for a Swift developer hat, for the moment : )
>>> [8/26/10 1:09:30 PM] Jonathan Monette: yea.  I am running several 
>>> tests on the stripped down function I have and see if I can 
>>> reproduce the error.
>>> [8/26/10 1:09:39 PM] Jonathan Monette: alright.
>>> [8/26/10 1:09:48 PM] Jonathan Monette: lets see if I can reproduce 
>>> the error.
>>> [8/26/10 1:09:54 PM] Michael Wilde: Now, for this test to go into 
>>> the test suite, it needs to go into a loop
>>> [8/26/10 1:10:11 PM] Michael Wilde: Some nasty races only occur 
>>> 1/100 times, or worse.
>>> [8/26/10 1:10:34 PM] Michael Wilde: so we can only trust swift when 
>>> we can run the simple example 100,000 times w/o a hang : (
>>> [8/26/10 1:11:16 PM] Michael Wilde: Once we have the tests, we can 
>>> then run torture tests before releases that give us a good assurance 
>>> of having a reliable product
>>> [8/26/10 1:11:26 PM] Jonathan Monette: alright.
>>> [8/26/10 1:11:42 PM] Michael Wilde: Once more, with enthusiasm! :)
>>> [8/26/10 1:12:04 PM] Jonathan Monette: well I am going to try this 
>>> test several times with several different input files that increase 
>>> in size.
>>> [8/26/10 1:12:30 PM] Jonathan Monette: the problem "always" appears 
>>> with my larger sets so maybe with the large file it will fail more 
>>> often
>>> [8/26/10 1:12:54 PM] Michael Wilde: I'll echo the above dialog to 
>>> Mihael and Justin so they can pipe in with suggestions, OK? 
>>> Hopefully to make your life easier and find the problem faster....
>>> [8/26/10 1:13:29 PM] Jonathan Monette: alright. thanks.  that will 
>>> help.
>>> [8/26/10 1:14:01 PM] Michael Wilde: One approach is to strip out the 
>>> earlier stages and the later stages, and first see if a shorter 
>>> script with *just* mDiff and the foreach loop will fail. I think it 
>>> should.
>>> [8/26/10 1:14:32 PM] Michael Wilde: Then once you have that (ie, a 
>>> 20 line montage script that fails) you can try to replace Montage 
>>> with cat
>>> [8/26/10 1:14:42 PM] Michael Wilde: because really all it is doing,
>>> [8/26/10 1:15:33 PM] Michael Wilde: is a csv mapping (and you can 
>>> capture and freeze that) and then a simple foreach loop with just 
>>> one reall app (mDiff)  which you can replace with a cat of 3 files 
>>> to 1 file, right?
>>>
>>
>

-- 
Jon

Computers are incredibly fast, accurate, and stupid. Human beings are incredibly slow, inaccurate, and brilliant. Together they are powerful beyond imagination.
- Albert Einstein