[Swift-user] Thoughts on a data dependency issues
wilde at mcs.anl.gov
Fri Feb 13 16:08:00 CST 2009
I'm looking for advice on scripting the following in Swift:
One of the inputs to the DOCK app is a protein description that has been
run through a time-intensive prep application stage called GRID to
produce a binary description of the protein structure in a format that
is machine-architecture specific.
Essentially what we have is the following 2-stage workflow:
gridfile = grid (protein)
forach c, i in compoundList
dockfile[i] = dock (protein, gridfile, c)
To be conservative, we generate the gridfiles on the same host that the
DOCK stage of the application will run on. Theoretically we could
generate one gridfile per architecture, and then let send the right
architecture file to the dock app.
I would like to trigger the generation of the "grid" file from the swift
This presents two problems:
1) we dont know what machine the dock will run on, and
2) we need a way to name the grid file so that it can be cached for
later use by multiple docks.
This is not a problem in single-site environments, but if each dock job
could run on a site having one of several architectures, its more
Certainly one way is to hide the gridfile and the grid() stage from
Swift, and thats not a problem.
Another approach is to name the gridfiles with an architecture suffix,
eg 1UBQ.grid.x86, .x86-64, .ppc. etc. and pass all the archs to each
dock() app. That would cause excess data traffic, but not too bad if
they are only moved once. Then an app wrapper dynamically picks the
right arch file.
An interesting case/feature, perhaps, is some new swift option to say
dont transfer the file, just pass the filename, and send a uri (returned
by the mapper) that an app wrapper can fetch dynamically.
(this verges on a discussion for swift-devel) if no one sees other
clever ways to express this.
This is a low-prio issue, as we work around it by pre-generating the
grid files or running on just one arch.
More information about the Swift-user