[Swift-devel] Re: [Swft] Q about throttling
Ioan Raicu
iraicu at cs.uchicago.edu
Fri Jun 22 12:17:35 CDT 2007
Here is an excerpt from an email on 6/19.
> > It completed 10998
> > tasks (8402 tasks with an exit code of 0, and 2596 tasks with an exit
> > code of -1 -- aka failed) in 13399 seconds on 200 processors, this
> > was for the 100 molecule run! The failed tasks were all on the same
> > node over several short time intervals (~30 seconds), and were due to
> > a "Stale NFS file handle", probably due to having 200 processes
> > hitting the shared file system at the same time. Note that all these
> > 2596 failed tasks were restarted by Swift and completed successfully
> > on the resubmission. In the end, everything went through, and the run
> > was successful!
We noticed the same node in later runs act up, and take on the order of
100 times longer to complete some tasks than it was supposed to take. I
bet this node is having some hardware issues, and we should write to
help at tg to tell them.
The failed tasks were eventually retried, and succeeded, and the whole
run was successful, but the question is, why were the 2596 failed tasks
(which were all independent of each other) not submitted faster after
they failed... I would have expected them to fill up the wait queue with
these 2596 retried tasks.
Ioan
Ben Clifford wrote:
>
> On Fri, 22 Jun 2007, Ioan Raicu wrote:
>
>> I believe it could have send out more. For example, there were 2500+ tasks
>> that failed in the middle of those 6800 tasks (which were all independent),
>> why were 2500 tasks not resubmitted all at once... they were each about 200
>> seconds long, so most of them should have certainly showed up in the wait
>> queue.
>>
>
> what kind of failure?
>
>
--
============================================
Ioan Raicu
Ph.D. Student
============================================
Distributed Systems Laboratory
Computer Science Department
University of Chicago
1100 E. 58th Street, Ryerson Hall
Chicago, IL 60637
============================================
Email: iraicu at cs.uchicago.edu
Web: http://www.cs.uchicago.edu/~iraicu
http://dsl.cs.uchicago.edu/
============================================
============================================
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20070622/28acec19/attachment.html>
More information about the Swift-devel
mailing list