[Swift-devel] Re: 244 MolDyn run was successful!

Mihael Hategan hategan at mcs.anl.gov
Sun Aug 12 23:21:31 CDT 2007


> > 
> > Your analogy is incorrect. In this case the score is kept low because
> > jobs keep on failing, even after the throttling kicks in.
> >   
> I would argue against your theory.... the last job (#12794) failed at
> 3954 seconds into the experiment, yet the last job (#31917) was ended
> at 30600 seconds.

Strange. I've seen jobs failing all throughout the run. Did something
make falkon stop sending jobs to that broken node? Statistically every
job would have a 1/total_workers chance of going the the one place where
it shouldn't. Higher if some "good" workers are busy doing actual stuff.

>   There were no failed jobs in the last 26K+ seconds with 19K+ jobs.
> Now my question is again, why would the score not improve at all

Quantify "at all". Point is it may take quite a few jobs to make up for
~10000 failed ones. The ratio is 1/5 (i.e. it takes 5 successful jobs to
make up for a failed one). Should the probability of jobs failing be
less than 1/5 in a certain time window, the score should increase.

In the context in which jobs are sent to non-busy workers, the system
would tend to produce lots of failed jobs if it takes little time
(compared to the normal run-time of a job) for a bad worker to fail a
job. This *IS* why the swift scheduler throttles in the beginning: to
avoid sending a large number of jobs to a resource that is broken.

Mihael

>  over this large period of time and jobs, as the throtling seems to be
> relatively constant throughout the experiment (after the failed jobs).
> 
> Ioan
> > Mihael
> > 
> >   
> > > I believe the normal behavior should allow Swift to recover and again
> > > submit many tasks to Falkon.  If this heuristic cannot be easily
> > > tweaked or made to recover from the "window collapse", could we
> > > disable it when we are running on Falkon at a single site?
> > > 
> > > BTW, here were the graphs from a previous run when only the last few
> > > jobs didn't finish due to a bug in the application code.  
> > > http://people.cs.uchicago.edu/~iraicu/research/docs/MolDyn/244-mol-failed-7-16-07/
> > > In this run, notice that there were no bad nodes that caused many
> > > tasks to fail, and Swift submitted many tasks to Falkon, and managed
> > > to keep all processors busy!  
> > > 
> > > I think we can call the 244-mol MolDyn run a success, both the current
> > > run and the previous run from 7-16-07 that almost finished!
> > > 
> > > We need to figure out how to control the job throttling better, and
> > > perhaps on how to automatically detect this plaguing problem with
> > > "Stale NFS handle", and possibly contain the damage to significantly
> > > fewer task failures.  I also think that increasing the # of retries
> > > from Swift's end should be considered when running over Falkon.
> > > Notice that a single worker can fail as many as 1000 tasks per minute,
> > > which are many tasks given that when the NFS stale handle shows up,
> > > its around for tens of seconds to minutes at a time.  
> > > 
> > > BTW, the run we just made consummed about 1556.9 CPU hours (937.7 used
> > > and 619.2 wasted) in 8.5 hours.  In contrast, the run we made on
> > > 7-16-07 which almost finished, but behaved much better since there
> > > were no node failures, consumed about 866.4 CPU hours (866.3 used and
> > > 0.1 wasted) in 4.18 hours.  
> > > 
> > > When Nika comes back from vacation, we can try the real application,
> > > which should consume some 16K CPU hours (service units)!   She also
> > > has her own temporary allocation at ANL/UC now, so we can use that!
> > > 
> > > Ioan
> > > 
> > > Ioan Raicu wrote: 
> > >     
> > > > I think  the workflow finally completed successfully, but there are
> > > > still some oddities in the way the logs look (especially job
> > > > throttling, a few hundred more jobs than I was expecting, etc).  At
> > > > least, we have all the output we needed for every molecule!
> > > > 
> > > > I'll write up a summary of what happened, and draw up some nice
> > > > graphs, and send it out later today.
> > > > 
> > > > Ioan
> > > > 
> > > > iraicu at viper:/home/nefedova/alamines> ls fe_* | wc
> > > >     488     488    6832
> > > > 
> > > >       
> > 
> > 
> >   




More information about the Swift-devel mailing list