[Swift-devel] Re: Need help on race issue in Swift on Montage code
Zhao Zhang
zhaozhang at uchicago.edu
Thu Aug 26 14:36:20 CDT 2010
Hi, Jon
When I was running mProject in a batch style, there is a _area.fits file
with every .fits file, and I didn't see them
jonmon/Workspace/Swift/Montage/m101_j_05x05.
I am not sure if that is a necessary input file, but mProject did demand
the _area.fits file in my case.
best
zhao
Jonathan Monette wrote:
> If looking at log files or output files will help to give a better
> idea of the problem, you can take a look at
> ~jonmon/Workspace/Swift/Montage/m101_j_05x05. This is my 18 image
> set. In that directory there is two run directories. run.0002 is a
> run that completed the workflow and run.0001 is a run that hung. In
> each of those directories there is a swift.out file that contains the
> output to the screen that was captured.
>
> Also, Mihael when the hang occurs, I type v for the inhook you set up
> and I get
> Register Futures:
>
> and then nothing. Does this mean there is no listeners set up and
> that is why it hung?
>
> On 08/26/2010 01:48 PM, wilde at mcs.anl.gov wrote:
>> Hi Mihael, Justin,
>>
>> Long email follows - sorry!
>>
>> As I mentioned in passing, Jon is stuck on what looks much like we
>> have a race condition in Swift/Karajan thread synchronization.
>> (Testing on trunk)
>>
>> If Jon runs a Montage problem of size "10" it seems to always
>> complete successfully.
>>
>> If he runs a problem of size ~1600, it always hangs.
>>
>> He now has a problem of size 18 that seems to hang a significant
>> percentage of the time (~ 50%???)
>>
>> Jon is now trying to whittle that size-18 failing example down to a
>> simple example you can run yourself to reproduce the problem.
>>
>> He knows pretty well what it is hanging on (see below; Jon is trying
>> to package up a failing test case).
>>
>> The logic is basically:
>>
>> 1. csv map an array of structures from a csv file that describes the
>> output of the earlier stages of Montage processing
>>
>> 2. foreach entry in the array of structures (~ 0..34 in the size-18
>> problem):
>> a. use simple mapper to map 2 files from the struct
>> b. run a montage function "mDiff" on these two files plus one
>> constant hdr file from outside the loop
>>
>> The program hangs on the foreach loop because (I think, if I have
>> this right) *some* of the mapped dependencies dont seem to be getting
>> set. Its not clear to me whether, in the failing case, *all* the
>> mDiff() calls inside the foreach loop are hanging, or *some* of them
>> are. Jon: please provide the details and correct me as needed.
>>
>> Also, we are relying heavily here on tracef("%k") to print the
>> set-state of various variables. If %k is not 100% correct, then all
>> of our assumptions are questionable.
>>
>> Im also curious to know what tools we have - or could develop - to in
>> general help find what is hanging on what as a debugging aid, both
>> for users to shake out their app errors and for Swift developers to
>> diagnose a hang that is a Swift bug.
>>
>> (Jon told me about some ^T command that causes Swift to enter a
>> Karajan debugging mode? I'd like to learn more about that, and how we
>> might make it most useful for end users and for diagnostic info
>> gathering).
>>
>> Incase its of use, Ive pasted below our latest Skype txt chat on this
>> problem, which details what we know and what Jon will try next.
>>
>> Help and guidance on how to proceed would be great!
>>
>> Thanks,
>>
>> - Mike
>>
>> ---
>>
>> [8/26/10 11:05:04 AM] Jonathan Monette: ... I gathered some stuff to
>> send to devel about the hanging problem just not sure how to word the
>> problem for Mihael and Justin
>> [8/26/10 1:00:25 PM] Michael Wilde: Hi, sorry, I missed your last
>> message above
>> [8/26/10 1:00:42 PM] Michael Wilde: "just not sure how to word the
>> problem for Mihael and Justin" - how can I help on that?
>> [8/26/10 1:01:20 PM] Jonathan Monette: uhhhh......well how do I work
>> the email to describe my problem?
>> [8/26/10 1:01:21 PM] Michael Wilde: Seems like you were on the right
>> track yesterday: adding enough traces that the problem can be readily
>> seen;
>> [8/26/10 1:01:32 PM] Michael Wilde: able to state what seems to be
>> hanging on what;
>> [8/26/10 1:02:04 PM] Michael Wilde: and running enough times to prove
>> that it runs OK some times (and what that traces out as) and then
>> fails to complete other times (and what *that* traces as)
>> [8/26/10 1:03:03 PM] Jonathan Monette: yea but how do word the email?
>> just something like "There is a hanging problem in swift. Jobs do
>> not submit even though the inputs for the apps are satisfied."
>> [8/26/10 1:03:14 PM] Michael Wilde: Then, what you can/should do, is
>> try to start "whittling" down the test case so we can catch the
>> failure in a simple example, that mihael or justin can easily run
>> with minimal setup, to first make the problem happen, and then test
>> their fix
>> [8/26/10 1:03:36 PM] Michael Wilde: "There is a hanging problem in
>> swift. Jobs do not submit even though the inputs for the apps SEEM
>> TO BE satisfied."
>> [8/26/10 1:03:58 PM] Michael Wilde: here is the code, here is the
>> outut trace, and here are the logs
>> [8/26/10 1:04:13 PM] Michael Wilde: I have run this 10 times and it
>> hangs 6 of 10 times
>> [8/26/10 1:04:14 PM] Jonathan Monette: ok. well my simple test
>> script completed a couple of times. I will run it more to see if I
>> can get it to hang.
>> [8/26/10 1:04:14 PM] Michael Wilde: etc
>> [8/26/10 1:04:31 PM] Michael Wilde: saving the logs into run.nnnn
>> through run.nnnn+10
>> [8/26/10 1:04:54 PM] Michael Wilde: Right, thats exactly what you
>> need to do.
>> [8/26/10 1:05:00 PM] Michael Wilde: I usually do this:
>> [8/26/10 1:05:14 PM] Michael Wilde: start with the real code; run N
>> times, see what the failure ratio is.
>> [8/26/10 1:05:42 PM] Michael Wilde: Then make a complete copy, make
>> sure it still fails, and start stripping it down to the simplest
>> program that still fails.
>> [8/26/10 1:05:51 PM] Jonathan Monette: ok. ill see if I can get the
>> test script to fail
>> [8/26/10 1:06:01 PM] Michael Wilde: wherever possible, replace
>> Montage code with cat/sleep etc
>> [8/26/10 1:06:19 PM] Michael Wilde: So this is and needs to be a
>> general Swift development method:
>> [8/26/10 1:07:03 PM] Michael Wilde: when we find any error, but
>> *especially* a race, hang, or similar paralleism-related error, we
>> need to isolate it to a test case that can be added to the test suite
>> [8/26/10 1:07:23 PM] Michael Wilde: This bug, in particular, Mihael
>> *thinks* he fixed, yet it "came back".
>> [8/26/10 1:07:40 PM] Michael Wilde: Thats what SE folks mean by
>> "regression testing":
>> [8/26/10 1:08:17 PM] Michael Wilde: create a test that ensures that a
>> fix is in place; then run that test forever more, to make sure that
>> bug stas fixed and that nothing similar takes its place ; )
>> [8/26/10 1:08:37 PM] Michael Wilde: This is hard work, no doubt about it
>> [8/26/10 1:09:01 PM] Michael Wilde: But a simple reliable test case
>> that reproduces a bug is THE most important requirement for fixing
>> the bug
>> [8/26/10 1:09:30 PM] Michael Wilde: So you need to swap your user hat
>> for a Swift developer hat, for the moment : )
>> [8/26/10 1:09:30 PM] Jonathan Monette: yea. I am running several
>> tests on the stripped down function I have and see if I can reproduce
>> the error.
>> [8/26/10 1:09:39 PM] Jonathan Monette: alright.
>> [8/26/10 1:09:48 PM] Jonathan Monette: lets see if I can reproduce
>> the error.
>> [8/26/10 1:09:54 PM] Michael Wilde: Now, for this test to go into the
>> test suite, it needs to go into a loop
>> [8/26/10 1:10:11 PM] Michael Wilde: Some nasty races only occur 1/100
>> times, or worse.
>> [8/26/10 1:10:34 PM] Michael Wilde: so we can only trust swift when
>> we can run the simple example 100,000 times w/o a hang : (
>> [8/26/10 1:11:16 PM] Michael Wilde: Once we have the tests, we can
>> then run torture tests before releases that give us a good assurance
>> of having a reliable product
>> [8/26/10 1:11:26 PM] Jonathan Monette: alright.
>> [8/26/10 1:11:42 PM] Michael Wilde: Once more, with enthusiasm! :)
>> [8/26/10 1:12:04 PM] Jonathan Monette: well I am going to try this
>> test several times with several different input files that increase
>> in size.
>> [8/26/10 1:12:30 PM] Jonathan Monette: the problem "always" appears
>> with my larger sets so maybe with the large file it will fail more often
>> [8/26/10 1:12:54 PM] Michael Wilde: I'll echo the above dialog to
>> Mihael and Justin so they can pipe in with suggestions, OK? Hopefully
>> to make your life easier and find the problem faster....
>> [8/26/10 1:13:29 PM] Jonathan Monette: alright. thanks. that will help.
>> [8/26/10 1:14:01 PM] Michael Wilde: One approach is to strip out the
>> earlier stages and the later stages, and first see if a shorter
>> script with *just* mDiff and the foreach loop will fail. I think it
>> should.
>> [8/26/10 1:14:32 PM] Michael Wilde: Then once you have that (ie, a 20
>> line montage script that fails) you can try to replace Montage with cat
>> [8/26/10 1:14:42 PM] Michael Wilde: because really all it is doing,
>> [8/26/10 1:15:33 PM] Michael Wilde: is a csv mapping (and you can
>> capture and freeze that) and then a simple foreach loop with just one
>> reall app (mDiff) which you can replace with a cat of 3 files to 1
>> file, right?
>>
>>
>
More information about the Swift-devel
mailing list