Datastore architecture / design document

Robert Olson olson at mcs.anl.gov
Thu Dec 19 13:38:47 CST 2002


User operations that result in interactions with the data store:

(A) User enters venue. His client fills up with a list of files and
directories available in the venue.

(B) User doubleclicks on a file. The file is downloaded and the
appropriate application is launched on his computer.

(C) User drags a file from his desktop into the file share. The file
is copied to the venue and made available. The list of files in the
venue updates with that new file.

(D) User brings up file properties window for a file. It shows who
created the file, when it was uploaded, and any access properties on
it. User renames the file.

(E) User wants to add a file or directory to his local exported
filestore. He drags the file or directory into the transient files
section in the client GUI.

----

Extended discussion.

(A) Discovery of files.

The venue description returned from the Enter() operation includes
a set of data item descriptions that describe the data objects present
in the venue and in all of the clients' transient data stores (1).

This description includes entries for both files and directories; that
is, the data server supports arbitrary directory trees.

Alternatives for descriptors:

    o full (relative) pathname in each
      directories given distinct identities (inode); each file
      has a reference to its directory.

These two aren't actually that different; if you consider the pathname
to the directory as the unique ID the two approaches are similar.

Each file descriptor will look something like this:

      name: name of this file
      directory: full path to containing directory
      owner: DN
      size: size in bytes
      upload_time: date/time
      acl: acl for access to file
      transfer_spec: information required to download this file

The transfer spec contains the information required for a client to
download the file. This will likely be (for GASS-based transfers) the
URL from which the file can be obtained along with the DN of the
identity of the server holding the file.

Each directory descriptor will look like this:

      pathname: full path to directory
      owner: DN
      acl: acl for access to directory
      upload_transfer_spec: information required for upload of file to dir

The aggregate description as sent to the client is a depth-first
traversal of the directory structure (so that the client always has
knowledge of directory before it receives the list of files for that
directory).

Alternatively, the interface could be directory based. Given a
directory name, the server returns the list of files and
subdirectories in that directory.

Data store operations to support this functionality:

RetrieveDirectory(path) => ([files], [directories])

     Retrieve the contents of the directory rooted at <path>
     Return a tuple containing the list of files in that directory
     and a list of subdirectories of that directory.

     Each entry in the file and directory lists is a descriptor as
     described above.

Additional notes:

If files or directories are added, a notification can be sent
asynchronously to clients who have registered for these
notifications. The information in the notification can contain the
descriptor for the file or directory; these descriptors contain all
the information necessary for the client to make use of the information.

Footnotes:

(1) This is the clients-advertise-all-data model. The alternative is
     to require clients to query all other clients for their transient
     data. This has issues with latency and with the possibility that
     inbound connections to clients may be forbidden by site firewalls.

---


(B) File Transfer, Venue to User

Given the transfer spec in the file description, this is
straightforward. It's likely just an HTTP GET or a FTP operation.

---

(C) Desktop upload.

The user has specified (perhaps implicitly via the GUI) the directory
into which the file should be uploaded. The file can be either pushed
to the server from the client or pulled from the client to the
server. The interaction with the server may be simpler with a client
pull, but site-local firewall rules may forbid connections incoming to
a client. A server pull also requires the client to act as a server.

Hence, we choose to first define a client-push based mechanism. In the
upload_transfer_spec for a directory the client will find the
information required to effect a transfer. This may be a URL to which
a GASS-based HTTP PUT operation can be performed, or perhaps a FTP url
to which a put is allowed.

---

(D) File properties and directory operations.

The client, upon receiving the file description, has the
metainformation about the file available for display.

Directory operations, such as renaming, moving, and deletion are
provided as a family of operations on the datastore service:

FileRename(oldname, newname) => descriptor for newly named file
    Rename the given file.

FileDelete(name) => success/failure
    Delete the given file.

DirCreate(full path) => descriptor for directory
    Create a new directory.

DirRename(oldpath, newpath) => descriptor for directory plus updated
descriptors for all files that have new names. (2)
    Rename a directory.

DirDelete(path)
    Delete a directory. This results in all files below that directory
    also being deleted.

(2) This argues actually for giving files unique IDs and having the
     directory information just be advisory.

---

(E) Transient filestores.

A transient filestore is one that is provided from a user's personal
machine; that is, it is not a persistent Venue resource.

A transient filestore uses the same core filestore engine that the
venue filestore does; the primary difference lies in the mechanism by
which the filestore and its contents are discovered. Whereas the
location of the Venue's filestore is found in teh description of the
Venue itself, the location of a transient filestore is an attribute of
the description of the user who is hosting that transient filestore.

All the same operations apply - file discovery, transfer, renaming,
etc. A transient filestore, however, is likely to have much different
access control policies. For instance, it is likely that a user would
not allow others en masse to have the ability to transfer files to his
machine, or to delete or rename files resident there.

----------------

Data store implementation notes.

The hierarchical directory of files as presented by the datastore API
may or may not be bound to an actual hierarchical directory of
files. For a Venue datastore, it may be reasonable for that to be the
case. For a transient user-based datastore, it may be reasonable to
present that view, while the files that are being exported in this
manner actually reside at varied places on the user's filesystem
(having arrived in the datastore by being dragged as needed into the
datastore's user interface).

The Python implementation of the datastore is split into two main
objects: a DataStore which provides the hierarchical file storage
abstraction with the internal bookkeepping of matching virtual
filename space to physical files, and a TransferEngine which provides
the functionality required for the actual upload and download of
files.

The DataStore relies on the TransferEngine to provide it with the
transfer_spec portion of the file description, for any given file.

The DataStore API closely follows the API described in (A) - (D)
above; however, this API is a Python object API rather than a web
services API.




More information about the ag-dev mailing list