[Swift-devel] Status of coasters
Michael Wilde
wilde at mcs.anl.gov
Fri Feb 13 08:59:55 CST 2009
Here's my understanding of status, issues and needs on coasters.
Some side discussion with Mihael on various coaster issues is summarized
here as well; clarifications welcome.
Work in progress:
- Mihael has a good handle on the bootstrap issues and is working on
improvements. This is not working in trunk at the moment, will likely be
fixed soon. We think this will fix known issues in: command line lenth
for condor, spaces, quotes, newlines and other offending argument
issues; location of Java and tools (wget/curl and mdsum).
- still to do on above: sites.xml attribute to explicitly specify
location of tools, or at least of Java.
- Ben has a patch to integrate to run the coaster service on a worker
node. Question: this is only usable when workers have sufficient IP
access, correct?
- The scalability problem submitting to GT2 GRAM sites still exists.
Potential solutions are:
-- Service submits workers via PBS (using jobmanger=gt2:pbs). Valid only
on PBS sites. Not yet tested.
-- Service submits workers via Condor-G (using jobmanager=gt2:condor).
Mihael feels this requires a new Condor provider, the one in the current
code base being insufficient and untested - really more of a prototype
developed by a student).
-- Service submits via WS-GRAM. This should be tested, on sites where
WS-GRAM is working.
This woild use jobmanager=gt2:gt4:{pbs/condor/sge}, and needs to be tested.
For sites where WS-GRAM is not functional, I suggested we consider
configuring our own non-root WS-GRAM, ideally using already-installed
GT4 software, eg, from the OSG package on OSG and TG sites where its
installed. Mihael thought this would be considerable work. I agree but
it might be a stable solution with fewer unknowns and suppot from the
GRAM group. We can bring in the latest GT4 as needed if that provides a
better solution than some older installed GT4 which we have no control
over and which wont change till upcoming releases of say OSG or TG packages.
Doing the above should then enable large-scale testing of user workflows
across many OSG and TG sites, without need to throttle back the *number*
of jobs waiting or running.
Lastly: it seems that a Condor-G provide might be a powerful capability
(as one configuration option) to be able to submit all swift jobs via
Condor-G (e.g, for non-coaster runs as well). Please comment on the
value of such a capability.
- Mike
More information about the Swift-devel
mailing list