[Swift-devel] Status of coasters

Michael Wilde wilde at mcs.anl.gov
Fri Feb 13 09:20:45 CST 2009


Its the problem of resource consumption by the jobmanager: the 
longstanding problem that the Condor-G GRID_MONITOR addresses; the 
problem that requires that we scale back to send fewer thn 20-40 jobs to 
any OSG site when we use pre-WS-GRAM.


On 2/13/09 9:17 AM, Ian Foster wrote:
> Mike:
> 
> What is the scalability problem WRT GT2 GRAM sites?
> 
> Ian.
> 
> 
> On Feb 13, 2009, at 8:59 AM, Michael Wilde wrote:
> 
>> Here's my understanding of status, issues and needs on coasters.
>>
>> Some side discussion with Mihael on various coaster issues is 
>> summarized here as well; clarifications welcome.
>>
>> Work in progress:
>>
>> - Mihael has a good handle on the bootstrap issues and is working on 
>> improvements. This is not working in trunk at the moment, will likely 
>> be fixed soon. We think this will fix known issues in: command line 
>> lenth for condor, spaces, quotes, newlines and other offending 
>> argument issues; location of Java and tools (wget/curl and mdsum).
>>
>> - still to do on above: sites.xml attribute to explicitly specify 
>> location of tools, or at least of Java.
>>
>> - Ben has a patch to integrate to run the coaster service on a worker 
>> node. Question: this is only usable when workers have sufficient IP 
>> access, correct?
>>
>> - The scalability problem submitting to GT2 GRAM sites still exists. 
>> Potential solutions are:
>>
>> -- Service submits workers via PBS (using jobmanger=gt2:pbs). Valid 
>> only on PBS sites. Not yet tested.
>>
>> -- Service submits workers via Condor-G (using jobmanager=gt2:condor). 
>> Mihael feels this requires a new Condor provider, the one in the 
>> current code base being insufficient and untested - really more of a 
>> prototype developed by a student).
>>
>> -- Service submits via WS-GRAM. This should be tested, on sites where 
>> WS-GRAM is working.
>> This woild use jobmanager=gt2:gt4:{pbs/condor/sge}, and needs to be 
>> tested.
>> For sites where WS-GRAM is not functional, I suggested we consider 
>> configuring our own non-root WS-GRAM, ideally using already-installed 
>> GT4 software, eg, from the OSG package on OSG and TG sites where its 
>> installed. Mihael thought this would be considerable work. I agree but 
>> it might be a stable solution with fewer unknowns and suppot from the 
>> GRAM group. We can bring in the latest GT4 as needed if that provides 
>> a better solution than some older installed GT4 which we have no 
>> control over and which wont change till upcoming releases of say OSG 
>> or TG packages.
>>
>> Doing the above should then enable large-scale testing of user 
>> workflows across many OSG and TG sites, without need to throttle back 
>> the *number* of jobs waiting or running.
>>
>> Lastly: it seems that a Condor-G provide might be a powerful 
>> capability (as one configuration option) to be able to submit all 
>> swift jobs via Condor-G (e.g, for non-coaster runs as well).  Please 
>> comment on the value of such a capability.
>>
>> - Mike
>> _______________________________________________
>> Swift-devel mailing list
>> Swift-devel at ci.uchicago.edu
>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> 



More information about the Swift-devel mailing list