[Swift-devel] jobs that go active forever, and their effect on multisite osg runs

Mats Rynge rynge at renci.org
Mon Dec 15 11:52:13 CST 2008


Ben Clifford wrote:
> During my experimentation last week with point Swift at the OSG Engage VO, 
> I repeatedly ran into a problem where jobs aimed at a particular small 
> subset of sites would go into the Active state and then never (for some 
> multiple-hours value of never) be reported as Completed or Failed.
> 
> This was the only site-misbehaviour problem that I encountered which 
> caused Swift runs to not complete and required manual intervention to 
> remove those sites before a run. Other site problems were dealt with by 
> various mechanisms already implemented in Swift (site scoring, 
> replication).
> 
> I'm desirous, then, of some way to get round this problem.
> 
> One approach we discussed previously was making maxwalltime enforced at 
> the client side.
> 

I think you have the same problem in other states of the job cycle. My
last run got stuck at:

Progress:  Selecting site:2 Stage in:1 Finished successfully:268
Initializing site shared directory:1

Log file:
http://www.renci.org/~rynge/swift/logs/osg-20081215-1117-eedjvp3c.log

There is a stack trace in the beginning which may or may not have
anything to do with the stuck jobs.

When we use OSG MatchMaker, we have timeouts for all job states, and
that seem to work well.

-- 
Mats Rynge
Renaissance Computing Institute <http://www.renci.org>



More information about the Swift-devel mailing list