[Swift-user] How to transmit data dynamically on Grid

Tue Jul 22 22:16:48 CDT 2008

Zhengxiong,

By default, Swift automatically moves your data from a directory on the 
submit host (the host on which you run the swift command) to a shared 
directory on the execution site, where its accessed by your job, running 
on a worker node in the remote cluster.

This is explained in the User Guide intro:

http://www.ci.uchicago.edu/swift/guides/userguide.php

"SwiftScript programs are dataflow oriented - they are primarily 
concerned with processing (possibly large) data files, by invoking 
programs to do that processing. Swift handles execution of such programs 
on remote sites by choosing sites, handling the staging of input and 
output files to and from the chosen sites and remote execution of 
program code".

Staging is detailed in section 8: "Invoking an Application from Swift":
http://www.ci.uchicago.edu/swift/guides/userguide.php#id2931120

I think the example shell script you are looking at, "rundock", is 
misleading you, because it was written to run under Falkon without 
Swift, and hence does some staging between the cluster's shared 
filesystem and local worker-node directories.

I would start by dividing the files that DOCK uses into two categories: 
1) files that you will declare as inputs or outputs of Swift atomic 
procedures, which you should let Swift stage in an out automatically; 
and 2) files which can be considered part of the application's install 
directory (which can stay on each cluster's shared filesytem with the 
application code, or which can be shipped to each site in a preparation 
stage).

In addition Swift will, within the execution of a script, avoid staging 
a file in twice, if it can. The users guide explains this under the 
property "caching.algorithm":

"Swift caches files that are staged in on remote resources, and files 
that are produced remotely by applications, such that they can be 
re-used if needed without being transfered again. However, the amount of 
remote file system space to be used for caching can be limited using the 
swift:storagesize profile entry in the sites.xml file."

So you could let Swift bring in even large files for you, to the shared 
filesystem, and your application wrapper script can cache these in a 
persistent application directory on the worker node.

In rundock, you could use this aproach by declaring the "receptor" 
protein molecule files (grid files and "selected spheres") as Swift 
inputs, and let swift bring them to the grid site for you.

Lastly, see this note in the Environment Variables section of the users 
guide:

"SWIFT_JOBDIR_PATH - set in env namespace profiles. If set, then Swift 
will use the path specified here as a worker-node local temporary 
directory to copy input files to before running a job. If unset, Swift 
will keep input files on the site-shared filesystem. In some cases, 
copying to a worker-node local directory can be much faster than having 
applications access the site-shared filesystem directly."

You can achieve the same effect of copying data to the local worker node 
disk, by doing so explicitly in your application wrapper script 
("rundock" in your case).  If you know that you will be running many 
applications consecutively on the same worker nodes, eg because you are 
using Coaster or Falkon, then you can do what rundock does on the BG/P, 
and cache data in a local directory *between* jobs. But, like rundock, 
you need to be careful to avoid races between multiple jobs on the same 
node, and much ensure that you can always get your data from the shared 
filesystem when its not already cached there. Bash functions in rundock 
have the locking logic to do this.

Caching data that will be read by many jobs on the worker node disk 
makes sense for the receptor files, as each of these will be read by 15K 
jobs.

So there's actually several ways in which to manage your data.
Lets work out some of these cases, and then document them in the users 
guide for future users, with examples.

- Mike

On 7/22/08 6:33 PM, Zhengxiong Hou wrote:
> Hi,
>    I'm using the Swift to execute application jobs on the 
> OSG grid sites.
>    In the sites.xml file, if the jobmanager is not "fork", 
> e.g. url="abitibi.sbgrid.org/jobmanager-condor". 
>    The job is usually executed on a local computing node, 
> which is not the "Gateway node" of the grid site.
>    But when executing the job, in the executable command, 
> such as a wrapper script "rundock", I want to dynamically 
> transmit the input data files from CI to the remote grid 
> site by "globus-url-copy". e.g. (
> globus-url-copy gsiftp://communicado.ci.uchicago.edu$ligpath 
> file://$work/$ligfile)
> And transmit the results data from remote grid site to CI 
> machine, e.g. (globus-url-copy  file://$work/result.tar.gz  
> gsiftp://communicado.ci.uchicago.edu/home/houzx/dock-
> run/databases/results/abitibi.sbgrid.org-$ligfile-
> result.tar.gz)
>    The problem is that, the executing computing node is not 
> connected to the outside network, So the "globus-url-copy" 
> fails! Only using "jobmanager-fork", can it succeed, because 
> the job is executed on the Gateway node of the Grid site.
> 
>    The user may want to use the "jobmanager-condor" to 
> execute the jobs. At the same time, according to the 
> dynamically seleted grid sites of Swift,they also want to 
> transmit the input and results data dynamically and 
> automatically by "jobmanager-fork". Because it is 
> troublesome to "globus-url-copy" the input and results data 
> to the remote grid sites manually, if there are large 
> amounts of data files.
> 
>    So, the quesiton is how to implement it in Swift? Maybe 
> it's a common problem, but I didn't find it in the documents.
> 
> Thanks,
> Zhengxiong
> _______________________________________________
> Swift-user mailing list
> Swift-user at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user