[Swift-user] How to transmit data dynamically on Grid
Michael Wilde
wilde at mcs.anl.gov
Tue Jul 22 22:16:48 CDT 2008
Zhengxiong,
By default, Swift automatically moves your data from a directory on the
submit host (the host on which you run the swift command) to a shared
directory on the execution site, where its accessed by your job, running
on a worker node in the remote cluster.
This is explained in the User Guide intro:
http://www.ci.uchicago.edu/swift/guides/userguide.php
"SwiftScript programs are dataflow oriented - they are primarily
concerned with processing (possibly large) data files, by invoking
programs to do that processing. Swift handles execution of such programs
on remote sites by choosing sites, handling the staging of input and
output files to and from the chosen sites and remote execution of
program code".
Staging is detailed in section 8: "Invoking an Application from Swift":
http://www.ci.uchicago.edu/swift/guides/userguide.php#id2931120
I think the example shell script you are looking at, "rundock", is
misleading you, because it was written to run under Falkon without
Swift, and hence does some staging between the cluster's shared
filesystem and local worker-node directories.
I would start by dividing the files that DOCK uses into two categories:
1) files that you will declare as inputs or outputs of Swift atomic
procedures, which you should let Swift stage in an out automatically;
and 2) files which can be considered part of the application's install
directory (which can stay on each cluster's shared filesytem with the
application code, or which can be shipped to each site in a preparation
stage).
In addition Swift will, within the execution of a script, avoid staging
a file in twice, if it can. The users guide explains this under the
property "caching.algorithm":
"Swift caches files that are staged in on remote resources, and files
that are produced remotely by applications, such that they can be
re-used if needed without being transfered again. However, the amount of
remote file system space to be used for caching can be limited using the
swift:storagesize profile entry in the sites.xml file."
So you could let Swift bring in even large files for you, to the shared
filesystem, and your application wrapper script can cache these in a
persistent application directory on the worker node.
In rundock, you could use this aproach by declaring the "receptor"
protein molecule files (grid files and "selected spheres") as Swift
inputs, and let swift bring them to the grid site for you.
Lastly, see this note in the Environment Variables section of the users
guide:
"SWIFT_JOBDIR_PATH - set in env namespace profiles. If set, then Swift
will use the path specified here as a worker-node local temporary
directory to copy input files to before running a job. If unset, Swift
will keep input files on the site-shared filesystem. In some cases,
copying to a worker-node local directory can be much faster than having
applications access the site-shared filesystem directly."
You can achieve the same effect of copying data to the local worker node
disk, by doing so explicitly in your application wrapper script
("rundock" in your case). If you know that you will be running many
applications consecutively on the same worker nodes, eg because you are
using Coaster or Falkon, then you can do what rundock does on the BG/P,
and cache data in a local directory *between* jobs. But, like rundock,
you need to be careful to avoid races between multiple jobs on the same
node, and much ensure that you can always get your data from the shared
filesystem when its not already cached there. Bash functions in rundock
have the locking logic to do this.
Caching data that will be read by many jobs on the worker node disk
makes sense for the receptor files, as each of these will be read by 15K
jobs.
So there's actually several ways in which to manage your data.
Lets work out some of these cases, and then document them in the users
guide for future users, with examples.
- Mike
On 7/22/08 6:33 PM, Zhengxiong Hou wrote:
> Hi,
> I'm using the Swift to execute application jobs on the
> OSG grid sites.
> In the sites.xml file, if the jobmanager is not "fork",
> e.g. url="abitibi.sbgrid.org/jobmanager-condor".
> The job is usually executed on a local computing node,
> which is not the "Gateway node" of the grid site.
> But when executing the job, in the executable command,
> such as a wrapper script "rundock", I want to dynamically
> transmit the input data files from CI to the remote grid
> site by "globus-url-copy". e.g. (
> globus-url-copy gsiftp://communicado.ci.uchicago.edu$ligpath
> file://$work/$ligfile)
> And transmit the results data from remote grid site to CI
> machine, e.g. (globus-url-copy file://$work/result.tar.gz
> gsiftp://communicado.ci.uchicago.edu/home/houzx/dock-
> run/databases/results/abitibi.sbgrid.org-$ligfile-
> result.tar.gz)
> The problem is that, the executing computing node is not
> connected to the outside network, So the "globus-url-copy"
> fails! Only using "jobmanager-fork", can it succeed, because
> the job is executed on the Gateway node of the Grid site.
>
> The user may want to use the "jobmanager-condor" to
> execute the jobs. At the same time, according to the
> dynamically seleted grid sites of Swift,they also want to
> transmit the input and results data dynamically and
> automatically by "jobmanager-fork". Because it is
> troublesome to "globus-url-copy" the input and results data
> to the remote grid sites manually, if there are large
> amounts of data files.
>
> So, the quesiton is how to implement it in Swift? Maybe
> it's a common problem, but I didn't find it in the documents.
>
> Thanks,
> Zhengxiong
> _______________________________________________
> Swift-user mailing list
> Swift-user at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user
More information about the Swift-user
mailing list