[Swift-devel] Re: scheduler changes to deal with fast-failing sites

lixi at uchicago.edu lixi at uchicago.edu
Wed Jun 25 10:44:10 CDT 2008


>I'd be interested if anyone (especially Xi) tries this in a 
real life 
>multi-site situation.

Unfortunately, just now the workflow with 501 jobs failed 
due to:
"No status file was found. Check the shared filesystem on 
CIT_CMS_T2"

In fact, this is the most frequent error I encountered so 
far. I am thinking how to avoid this kind of error for a 
long time. I tried to check the remote directory using df 
command and make directory, transfer files, etc. These 
operations outside of Swift could be done successfully. So I 
still wonder how to avoid it, or could we think of adapting 
Swift to such sites as CIT_CMS_T2, MIT_CMS, and so on? 

>The scoring of well-performing sites is basically the same. 
Instead of a 
>base of 2 jobs, with more being added according to tscore * 
jobThrottle, 
>instead a base of 1 job is used. This should not cause much 
change in 
>behaviour for well-performing sites.
>
>However, the score can now go below 1 for poorly performing 
sites. In that 
>case, a delay is enforced between submissions to a 
particular site. The 
>length of that delay increases exponentially as the site 
score decreases.

In addition such improvements, as well as filtering out some 
sites and giving initial scores which I've done, I am 
thinking of other methods these days. Now in Swift, we only 
reply on "scores" to determine the performance of sites 
which are in turn the only metrics for site selection. Can 
we set the different states for sites? For example, 
candidate, frozen, etc. "Candidate" just means that we could 
select site from them based on their scores/Tscores. If the 
site fails, we could designate it as "frozen", at least for 
the current job, avoiding more retries would be eaten up. A 
frozen site could be unfrozen for satisfying different 
conditions, such as an amount of time later, for other new 
jobs. Of course, this is some simple ideas which I'm 
thinking now. I am going to give more detailed and feasible 
process. Any suggestions are warmly welcome.

Thanks,

Xi 



More information about the Swift-devel mailing list