[Swift-devel] 244 MolDyn run was successful!
Ioan Raicu
iraicu at cs.uchicago.edu
Sun Aug 12 00:22:04 CDT 2007
Hi,
Here is a quick recap of the 244 MolDyn run we made this weekend...
I have posted the logs and graphs at:
http://people.cs.uchicago.edu/~iraicu/research/docs/MolDyn/244-mol-success-8-10-07/
11079 failed with a -1.
2 failed with an exit code of 127.
Inspecting the logs revealed the infamous stale NFS handle error!
A single machine (192.5.198.37) had all the failed tasks (11081 tasks);
the machine was not completely broken, as it did complete 4 tasks
successfully, although the completion times were considerably higher
than the other machines.
20836 tasks finished with an exit code 0.
I was expecting 20497 tasks broken down as follows:
1 1 1
1 244 244
1 244 244
68 244 16592
1 244 244
11 244 2684
1 244 244
1 244 244
20497
I do not know why there were 339 more tasks than we were expecting.
A close look at the summary graph
(http://people.cs.uchicago.edu/~iraicu/research/docs/MolDyn/244-mol-success-8-10-07/summary_graph-med.jpg),
we see that after the large number of failed tasks, the queue length
(blue line) quickly went to 0, and then stayed there as Swift was
trickling in only about 100 tasks at a time. For the rest of the
experiment, only about 100 tasks at a time were ever running. This is
not the first time we have seen this, and it seems that is only showing
up when there is a bad machine failing many tasks, and essentially Swift
doesn't try to resubmit them fast, and the jobs only trickle in
thereafter not keeping all the processors busy.
When we had runs with no bad nodes and no large number of failures, this
did not happen, and Swift essentially submitted all independent tasks to
Falkon. I know there is a heuristic within Karajan that is probably
affecting the submit rate of tasks after the large number of failures
happened, but I think it needs to be tuned to recover from large number
of failures so in time, it again attempts to send more. A good analogy
is TCP, think of its window size increasing larger and larger, but then
a large number of packets get lost, and TCP collapses its window size,
but then never recovering from this and remaining with a small window
size for the rest of the connection, regardless of the fact that it
could again increase the window size until the next round of lost
packets... I believe the normal behavior should allow Swift to recover
and again submit many tasks to Falkon. If this heuristic cannot be
easily tweaked or made to recover from the "window collapse", could we
disable it when we are running on Falkon at a single site?
BTW, here were the graphs from a previous run when only the last few
jobs didn't finish due to a bug in the application code.
http://people.cs.uchicago.edu/~iraicu/research/docs/MolDyn/244-mol-failed-7-16-07/
In this run, notice that there were no bad nodes that caused many tasks
to fail, and Swift submitted many tasks to Falkon, and managed to keep
all processors busy!
I think we can call the 244-mol MolDyn run a success, both the current
run and the previous run from 7-16-07 that almost finished!
We need to figure out how to control the job throttling better, and
perhaps on how to automatically detect this plaguing problem with "Stale
NFS handle", and possibly contain the damage to significantly fewer task
failures. I also think that increasing the # of retries from Swift's
end should be considered when running over Falkon. Notice that a single
worker can fail as many as 1000 tasks per minute, which are many tasks
given that when the NFS stale handle shows up, its around for tens of
seconds to minutes at a time.
BTW, the run we just made consummed about 1556.9 CPU hours (937.7 used
and 619.2 wasted) in 8.5 hours. In contrast, the run we made on 7-16-07
which almost finished, but behaved much better since there were no node
failures, consumed about 866.4 CPU hours (866.3 used and 0.1 wasted) in
4.18 hours.
When Nika comes back from vacation, we can try the real application,
which should consume some 16K CPU hours (service units)! She also has
her own temporary allocation at ANL/UC now, so we can use that!
Ioan
Ioan Raicu wrote:
> I think the workflow finally completed successfully, but there are
> still some oddities in the way the logs look (especially job
> throttling, a few hundred more jobs than I was expecting, etc). At
> least, we have all the output we needed for every molecule!
>
> I'll write up a summary of what happened, and draw up some nice
> graphs, and send it out later today.
>
> Ioan
>
> iraicu at viper:/home/nefedova/alamines> ls fe_* | wc
> 488 488 6832
>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20070812/4a2beaeb/attachment.html>
More information about the Swift-devel
mailing list