[Swift-devel] Status of coasters
Michael Wilde
wilde at mcs.anl.gov
Fri Feb 13 09:40:08 CST 2009
On 2/13/09 9:27 AM, Ben Clifford wrote:
> On Fri, 13 Feb 2009, Michael Wilde wrote:
>
>> - Ben has a patch to integrate to run the coaster service on a worker node.
>> Question: this is only usable when workers have sufficient IP access, correct?
>
> Yes. I plan on making this presentable and then committing it. As part of
> that, probably I should document who connects where in coasters with a
> pretty diagram, to aid in understanding of what 'sufficient' is.
Very good; I was just thinking of the same diagram, even as design
documentation to help us grok the setup and communication paths for
coasters.
Also: coaster-server-on-workernode has the nice advantage that we dont
run any swift software on infrastructure nodes like headnodes: less
chance to cause damage; more power for our workflow. Gets round
potential problem that managed-fork JM will kill our process for
exceeding a walltime limit. Nice philosophy overall.
>> - The scalability problem submitting to GT2 GRAM sites still exists. Potential
>> solutions are:
>>
>> -- Service submits workers via PBS (using jobmanger=gt2:pbs). Valid only on
>> PBS sites. Not yet tested.
>>
>> -- Service submits workers via Condor-G (using jobmanager=gt2:condor). Mihael
>> feels this requires a new Condor provider, the one in the current code base
>> being insufficient and untested - really more of a prototype developed by a
>> student).
>
> That would be regular Condor, not Condor-G, I think.
Seems could be either:
- regular Condor to submit to the local condor pool
- Condor-G to submit back through GT2 but with aid of its GRID_MONITOR
for scalability, and would be LRM-independent.
>
> The two above could be summarised as "submit service workers through the
> local LRM using CoG specific providers for that LRM".
>
> The PBS provider seems to be getting a reasonable amount of use recently,
> and I think is also useful in the single-site case where it allows GRAM to
> be avoided entirely.
>
> A decent Condor provider would probably allow something similar for Condor
> based clusters.
>
>> -- Service submits via WS-GRAM. This should be tested, on sites where
>> WS-GRAM is working. This woild use jobmanager=gt2:gt4:{pbs/condor/sge},
>> and needs to be tested.
>
> If gram4.0 is working on a site, is there any reason to use gt2 for the
> head job submission?
No, not at all: we should indeed use WSGRAM in those cases. In fact, we
should use it wherever possible - i.e., wherever it provides the best
available job exec service.
> It seems to add a dependency on one more service
> (depending on both gram2 and gram4.0) rather than substituting one
> dependency for another (gram2 for gram4.0)
>
>> For sites where WS-GRAM is not functional, I suggested we consider configuring
>> our own non-root WS-GRAM, ideally using already-installed GT4 software, eg,
>> from the OSG package on OSG and TG sites where its installed. Mihael thought
>> this would be considerable work. I agree but it might be a stable solution
>> with fewer unknowns and suppot from the GRAM group. We can bring in the latest
>> GT4 as needed if that provides a better solution than some older installed GT4
>> which we have no control over and which wont change till upcoming releases of
>> say OSG or TG packages.
>
> I agree that this is considerable work. I think it is not something we
> should pursue.
>
>> Lastly: it seems that a Condor-G provide might be a powerful capability (as
>> one configuration option) to be able to submit all swift jobs via Condor-G
>> (e.g, for non-coaster runs as well). Please comment on the value of such a
>> capability.
>
> I've pondered that before.
>
> Using Condor-G appears to be the officially supported mechanism for
> submitting to OSG in some peoples minds; and similarly, using plain GRAM2
> is Prohibited in those peoples minds.
>
> Using Condor-G would be more in line with some peoples views of how jobs
> should properly be submitted to OSG.
>
> Such functionality could fit in as a CoG execution provider (similar to,
> or part of a plain Condor execution provider), and would not peturb the
> architecture of Swift. Swift runs in such a situation would look a little
> like DAGman runs, with a management process handling some rate limiting
> and deciding which jobs to run and where, but then the mechanics of
> submission being handled by a local Condor.
>
> This approach would necessitate a local Condor installation, but only in
> situations where this approach was used; so this would not peturb
> usability too much, and many places where this would be used already have
> a Condor installation.
>
> So I'm cautiously supportive of this approach.
Excellent, and I agree with your analysis.
I'll draft a priority list for such efforts and then circulate to the group.
>
> Specifically given the two different uses for condor interfacing discussed
> above, I think that it would be useful to investigate making the Condor
> provider decent.
>
Agreed.
More information about the Swift-devel
mailing list