[Swift-devel] Status of coasters

Michael Wilde wilde at mcs.anl.gov
Fri Feb 13 08:59:55 CST 2009


Here's my understanding of status, issues and needs on coasters.

Some side discussion with Mihael on various coaster issues is summarized 
here as well; clarifications welcome.

Work in progress:

- Mihael has a good handle on the bootstrap issues and is working on 
improvements. This is not working in trunk at the moment, will likely be 
fixed soon. We think this will fix known issues in: command line lenth 
for condor, spaces, quotes, newlines and other offending argument 
issues; location of Java and tools (wget/curl and mdsum).

- still to do on above: sites.xml attribute to explicitly specify 
location of tools, or at least of Java.

- Ben has a patch to integrate to run the coaster service on a worker 
node. Question: this is only usable when workers have sufficient IP 
access, correct?

- The scalability problem submitting to GT2 GRAM sites still exists. 
Potential solutions are:

-- Service submits workers via PBS (using jobmanger=gt2:pbs). Valid only 
on PBS sites. Not yet tested.

-- Service submits workers via Condor-G (using jobmanager=gt2:condor). 
Mihael feels this requires a new Condor provider, the one in the current 
code base being insufficient and untested - really more of a prototype 
developed by a student).

-- Service submits via WS-GRAM. This should be tested, on sites where 
WS-GRAM is working.
This woild use jobmanager=gt2:gt4:{pbs/condor/sge}, and needs to be tested.
For sites where WS-GRAM is not functional, I suggested we consider 
configuring our own non-root WS-GRAM, ideally using already-installed 
GT4 software, eg, from the OSG package on OSG and TG sites where its 
installed. Mihael thought this would be considerable work. I agree but 
it might be a stable solution with fewer unknowns and suppot from the 
GRAM group. We can bring in the latest GT4 as needed if that provides a 
better solution than some older installed GT4 which we have no control 
over and which wont change till upcoming releases of say OSG or TG packages.

Doing the above should then enable large-scale testing of user workflows 
across many OSG and TG sites, without need to throttle back the *number* 
of jobs waiting or running.

Lastly: it seems that a Condor-G provide might be a powerful capability 
(as one configuration option) to be able to submit all swift jobs via 
Condor-G (e.g, for non-coaster runs as well).  Please comment on the 
value of such a capability.

- Mike



More information about the Swift-devel mailing list