[Swift-devel] Status of coasters
Michael Wilde
wilde at mcs.anl.gov
Fri Feb 13 09:20:45 CST 2009
Its the problem of resource consumption by the jobmanager: the
longstanding problem that the Condor-G GRID_MONITOR addresses; the
problem that requires that we scale back to send fewer thn 20-40 jobs to
any OSG site when we use pre-WS-GRAM.
On 2/13/09 9:17 AM, Ian Foster wrote:
> Mike:
>
> What is the scalability problem WRT GT2 GRAM sites?
>
> Ian.
>
>
> On Feb 13, 2009, at 8:59 AM, Michael Wilde wrote:
>
>> Here's my understanding of status, issues and needs on coasters.
>>
>> Some side discussion with Mihael on various coaster issues is
>> summarized here as well; clarifications welcome.
>>
>> Work in progress:
>>
>> - Mihael has a good handle on the bootstrap issues and is working on
>> improvements. This is not working in trunk at the moment, will likely
>> be fixed soon. We think this will fix known issues in: command line
>> lenth for condor, spaces, quotes, newlines and other offending
>> argument issues; location of Java and tools (wget/curl and mdsum).
>>
>> - still to do on above: sites.xml attribute to explicitly specify
>> location of tools, or at least of Java.
>>
>> - Ben has a patch to integrate to run the coaster service on a worker
>> node. Question: this is only usable when workers have sufficient IP
>> access, correct?
>>
>> - The scalability problem submitting to GT2 GRAM sites still exists.
>> Potential solutions are:
>>
>> -- Service submits workers via PBS (using jobmanger=gt2:pbs). Valid
>> only on PBS sites. Not yet tested.
>>
>> -- Service submits workers via Condor-G (using jobmanager=gt2:condor).
>> Mihael feels this requires a new Condor provider, the one in the
>> current code base being insufficient and untested - really more of a
>> prototype developed by a student).
>>
>> -- Service submits via WS-GRAM. This should be tested, on sites where
>> WS-GRAM is working.
>> This woild use jobmanager=gt2:gt4:{pbs/condor/sge}, and needs to be
>> tested.
>> For sites where WS-GRAM is not functional, I suggested we consider
>> configuring our own non-root WS-GRAM, ideally using already-installed
>> GT4 software, eg, from the OSG package on OSG and TG sites where its
>> installed. Mihael thought this would be considerable work. I agree but
>> it might be a stable solution with fewer unknowns and suppot from the
>> GRAM group. We can bring in the latest GT4 as needed if that provides
>> a better solution than some older installed GT4 which we have no
>> control over and which wont change till upcoming releases of say OSG
>> or TG packages.
>>
>> Doing the above should then enable large-scale testing of user
>> workflows across many OSG and TG sites, without need to throttle back
>> the *number* of jobs waiting or running.
>>
>> Lastly: it seems that a Condor-G provide might be a powerful
>> capability (as one configuration option) to be able to submit all
>> swift jobs via Condor-G (e.g, for non-coaster runs as well). Please
>> comment on the value of such a capability.
>>
>> - Mike
>> _______________________________________________
>> Swift-devel mailing list
>> Swift-devel at ci.uchicago.edu
>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>
More information about the Swift-devel
mailing list