[Swift-devel] Status of coasters

Fri Feb 13 09:40:08 CST 2009

On 2/13/09 9:27 AM, Ben Clifford wrote:
> On Fri, 13 Feb 2009, Michael Wilde wrote:
> 
>> - Ben has a patch to integrate to run the coaster service on a worker node.
>> Question: this is only usable when workers have sufficient IP access, correct?
> 
> Yes. I plan on making this presentable and then committing it. As part of 
> that, probably I should document who connects where in coasters with a 
> pretty diagram, to aid in understanding of what 'sufficient' is.

Very good; I was just thinking of the same diagram, even as design 
documentation to help us grok the setup and communication paths for 
coasters.

Also: coaster-server-on-workernode has the nice advantage that we dont 
run any swift software on infrastructure nodes like headnodes: less 
chance to cause damage; more power for our workflow. Gets round 
potential problem that managed-fork JM will kill our process for 
exceeding a walltime limit. Nice philosophy overall.

>> - The scalability problem submitting to GT2 GRAM sites still exists. Potential
>> solutions are:
>>
>> -- Service submits workers via PBS (using jobmanger=gt2:pbs). Valid only on
>> PBS sites. Not yet tested.
>>
>> -- Service submits workers via Condor-G (using jobmanager=gt2:condor). Mihael
>> feels this requires a new Condor provider, the one in the current code base
>> being insufficient and untested - really more of a prototype developed by a
>> student).
> 
> That would be regular Condor, not Condor-G, I think.

Seems could be either:
- regular Condor to submit to the local condor pool
- Condor-G to submit back through GT2 but with aid of its GRID_MONITOR 
for scalability, and would be LRM-independent.

> 
> The two above could be summarised as "submit service workers through the 
> local LRM using CoG specific providers for that LRM".
> 
> The PBS provider seems to be getting a reasonable amount of use recently, 
> and I think is also useful in the single-site case where it allows GRAM to 
> be avoided entirely.
> 
> A decent Condor provider would probably allow something similar for Condor 
> based clusters.
> 
>> -- Service submits via WS-GRAM. This should be tested, on sites where 
>> WS-GRAM is working. This woild use jobmanager=gt2:gt4:{pbs/condor/sge}, 
>> and needs to be tested.
> 
> If gram4.0 is working on a site, is there any reason to use gt2 for the 
> head job submission?

No, not at all: we should indeed use WSGRAM in those cases. In fact, we 
should use it wherever possible - i.e., wherever it provides the best 
available job exec service.

> It seems to add a dependency on one more service 
> (depending on both gram2 and gram4.0) rather than substituting one 
> dependency for another (gram2 for gram4.0)
> 
>> For sites where WS-GRAM is not functional, I suggested we consider configuring
>> our own non-root WS-GRAM, ideally using already-installed GT4 software, eg,
>> from the OSG package on OSG and TG sites where its installed. Mihael thought
>> this would be considerable work. I agree but it might be a stable solution
>> with fewer unknowns and suppot from the GRAM group. We can bring in the latest
>> GT4 as needed if that provides a better solution than some older installed GT4
>> which we have no control over and which wont change till upcoming releases of
>> say OSG or TG packages.
> 
> I agree that this is considerable work. I think it is not something we 
> should pursue.
> 
>> Lastly: it seems that a Condor-G provide might be a powerful capability (as
>> one configuration option) to be able to submit all swift jobs via Condor-G
>> (e.g, for non-coaster runs as well).  Please comment on the value of such a
>> capability.
> 
> I've pondered that before.
> 
> Using Condor-G appears to be the officially supported mechanism for 
> submitting to OSG in some peoples minds; and similarly, using plain GRAM2 
> is Prohibited in those peoples minds.
> 
> Using Condor-G would be more in line with some peoples views of how jobs 
> should properly be submitted to OSG.
> 
> Such functionality could fit in as a CoG execution provider (similar to, 
> or part of a plain Condor execution provider), and would not peturb the 
> architecture of Swift. Swift runs in such a situation would look a little 
> like DAGman runs, with a management process handling some rate limiting 
> and deciding which jobs to run and where, but then the mechanics of 
> submission being handled by a local Condor.
> 
> This approach would necessitate a local Condor installation, but only in 
> situations where this approach was used; so this would not peturb 
> usability too much, and many places where this would be used already have 
> a Condor installation.
> 
> So I'm cautiously supportive of this approach.

Excellent, and I agree with your analysis.

I'll draft a priority list for such efforts and then circulate to the group.

> 
> Specifically given the two different uses for condor interfacing discussed 
> above, I think that it would be useful to investigate making the Condor 
> provider decent.
> 

Agreed.