[Swift-devel] Probing running jobs

Fri Apr 3 08:38:08 CDT 2009

Following up on Mihael's question about a feature I listed in the to-do 
list I proposed for coasters:

On 4/2/09 11:17 PM, Mihael Hategan wrote:
> On Thu, 2009-04-02 at 21:01 -0500, Michael Wilde wrote:
>>>> - some way to probe a job thats running on a coaster?
>>> Define "probe".
>> - ps -f on the running process.
>> - probe its resource usage (/proc, also ps, etc)
>> - ls -lR of its jobdir (as these will more often be on /tmp)
>>
>> We have these needs today; on the BGP under falkon we manually login to 
>> the node, but thats cumbersome: hard to find the node; 2-stage login 
>> process.
>>
>> Low prio, a pipe dream. But theoretically do-able.
> 
> It should be possible (and somewhat interesting) to have a simple shell
> that can execute stuff on the workers while the job is running, so that
> you can issue your own commands.
> 
> The question of how to find the right worker remains. Can you go a bit
> deeper into the details? How do you find the node currently (be as
> specific as you can be)?

In the oops workflow, I recall these cases at the moment:

1) Have my (large set of similar) jobs started?

2) Most jobs have finished. Are the remaining ones hung, or proceeding 
normally but slower for some application- or data-specific reason?

--

For (1), on the BGP, if most or all cores in the partition have apps 
running on them, we pick any core and login to it. Then to see what that 
particular app is doing, we tail its log file for progress compared to 
its CPU tie consumption (from ps). Note that its log file is on local 
disk, because we set the "jobdir on local" option of swiftwrapper.

Logging in to a node means finding its IO node IP addr from a Falkon 
dynamic config file, ssh-ing to the ION, then telnetting to an arbitrary 
worker node (these are on 192.168.1.[0-63] private addrs), then running 
ps and tail. If not all the worker nodes in a processor set are busy, 
its a nuisance to find one that is. If few are busy, its not practical. 
  Overall, this technique is just a spot-check to see "are *any* of my 
jobs running right", ie to see if we've (finally) got their arguments 
correct, etc.

(1) is better solved with the same technique needed for (2) - given a 
job, find its ION and worker node IPs, and ssh/telnet directly there, 
which does not exist but is straightforward. On BGP the WNs are not 
running ssh, hence the additional nuisance of telnet.

(2) Is theoretically possible, but impractical, until we add a few 
scripts to trace from a swift job to the falkon service thats running it 
to the falkon agent thats running it (again, in the bgp case). The data 
for this exists. So we occasionally need (2) but cant do it.

Regarding "question of how to find the right worker" - this starts with 
having some sort of ID for each job that the use can use to go from 
"source code based identity" through job status and then to job 
location. (by job here I mean execution of an app() proc).

I have not yet looked at your status monitor, but am eager to try it. So 
I dont know if you took any steps in there to correlate a job's proc 
name  and args to its status. But thats what I think the user ultimately 
needs and wants.

For example, in oops, the majority of tasks are either of app "runrama" 
or "runoops". They have a mixture of scalar and file args.

I'd like to see in status something sort of like strace, where syscalls 
have potentially long arg lists (when formatted) but there's a canonical 
way to present them in an acceptably compact format with ... ellipsis s 
needed.

So as app invocations become known to swift, they get IDs starting from 
0, (PID-like but not wrapping around), and are listed in the progress 
log as:

Job 123 is Proc runrama Args 456 input/prot/.../...00.019.1ubq.pdb etc
Job 123 input transfer OK
Job 123 submitted - teraport/coaster92
Job 124 is
Job 125 is
Job 123 output transfer OK
job 123 ended OK

And then, I can say:

probe 123 "ps -ef | grep runrama; tail -3 /tmp/work/*/*/runrama.log"

(for starters).

So the capability depends on having usable IDs for jobs and coasters, 
maybe more objects, so that the user can specify a job of interest and 
the system can send the users probe to that job.

Something simple, flexible and shell-like is good to start with so we 
can explore whats needed and ideally create scripts to wrap more 
powerful capabilities.