[Swift-user] Thoughts on a data dependency issues

Fri Feb 13 16:08:00 CST 2009

I'm looking for advice on scripting the following in Swift:

One of the inputs to the DOCK app is a protein description that has been 
run through a time-intensive prep application stage called GRID to 
produce a binary description of the protein structure in a format that 
is machine-architecture specific.

Essentially what we have is the following 2-stage workflow:

   gridfile = grid (protein)
   forach c, i in compoundList
     dockfile[i] = dock (protein, gridfile, c)

To be conservative, we generate the gridfiles on the same host that the 
DOCK stage of the application will run on. Theoretically we could 
generate one gridfile per architecture, and then let send the right 
architecture file to the dock app.

I would like to trigger the generation of the "grid" file from the swift 
script.

This presents two problems:

1) we dont know what machine the dock will run on, and
2) we need a way to name the grid file so that it can be cached for 
later use by multiple docks.

This is not a problem in single-site environments, but if each dock job 
could run on a site having one of several architectures, its more 
challenging.

Certainly one way is to hide the gridfile and the grid() stage from 
Swift, and thats not a problem.

Another approach is to name the gridfiles with an architecture suffix, 
eg 1UBQ.grid.x86, .x86-64, .ppc. etc. and pass all the archs to each 
dock() app. That would cause excess data traffic, but not too bad if 
they are only moved once. Then an app wrapper dynamically picks the 
right arch file.

An interesting case/feature, perhaps, is some new swift option to say 
dont transfer the file, just pass the filename, and send a uri (returned 
by the mapper) that an app wrapper can fetch dynamically.

(this verges on a discussion for swift-devel) if no one sees other 
clever ways to express this.

This is a low-prio issue, as we work around it by pre-generating the 
grid files or running on just one arch.

- Mike