[Swift-devel] Re: How to wait on functions that return no data?
Michael Wilde
wilde at mcs.anl.gov
Wed Mar 26 10:28:20 CDT 2008
Sorry - a long response follows to your simple question:
> For your example, what way do you want to store the data on the remote
> side - I'm assuming not individual files.
In this example, a C program takes in 5 data files describing parameters
of the petroleum refining process, and models various economic, emission
and production yields. We do parameter sweeps by varying a few of these
input vars and plotting their effect on an output var. The 5 files are
text files, bundled into the application wrapper as shell "here
documents" using cat <<END >datafileN. The parameters are inserted into
these data files using simple shell var substitution.
In the simple tests I'm running now, I vary 2 input vars, and plot one
output var.
Each run of the model, which takes about 1 sec, takes 3 parameters (id,
x, y) from a readdata() file, and puts out a similar line with a 4th
column, the z value (id, x, y, z). Id is an int, x, y, z are floats.
In the simplest runs, I just run one model per swift job. So id, x and y
are provided on the command line, and a single file is produced with the
tuple (id, x, y, z).
I am now testing a batched version, where the app-wrapper script takes a
range of x and y values with increments, and iterates over that range at
the specified increments. Each batch results in a single file with all
the output tuples for that batch. For this case, this is fine, and is
the end of the problem.
But I asked about the null values to explore a different approach: where
most batches run and just leave their outputs on a local fiesystem,
concatenated into one file. The nice thing about having output in tuples
is that you can batch them in any arbitrary way, and the reduce step can
sort and select as needed.
I suspect you're not going to like this idea on first consideration. But
its related to ideas on how to leverage map-reduce, as I mentioned
earlier, and Ian's suggestion to explore collective operations. Mihael
thought my take on this was inelegant and inconsistent with data flow. I
think it can be massaged to fit nicely in the model and provide useful
capabilities.
Here's one way I thought it could work with the addition of null/Nothing
to Swift.
The idea was that most or all invocations of the model jobs would return
Nothing, and the actual results would be collected later in large,
efficient batches.
If an invocation of a wrapper batch returns null, then a later job can
go and interrogate the workers to collect the data. One possibility in
the falkon case was that one job would be broadcast to all workers, and
collect all files of a desired type. Another approach is that each job
ensures that there's a background task running on the worker, which
waits for either some accumulation of data or elapsed time, and then
transfers what was produced, as a single file. These files would be
returned either as results of arbitrary actual model runs, or by a
collector job that runs after all the models are complete.
But, separate from this data collection operation, a Nothing return has
a more direct use. Its handy in cases when you have a large set of short
jobs, exploring some parameter space in which results are very sparse.
In these cases, it would be nice to have a way to say that a job
succeeded but return null/Nothing. That reduces the need to pass back a
large number of files that signify "Nothing" in some inefficient manner.
Its also handy for executing jobs that have side effects, and still
waiting for them to complete.
This gets us to a related issue:
If a swift job could efficiently return a set of swift objects without
using a file (specifically without placing files back in the shared
directory) then many of these apps could work beautifully, by returning
strings or numeric objects, possibly as structs and/r arrays, that
travel back through the job submission interface rather than getting
fetched via the data provider. If a cluster of jobs could return data
efficiently in a single "package" from the cluster, then we could pretty
readily do map-reduce in swift, efficiently, in perfect concordance with
the current dataflow model.
Perhaps this later approach is the best to consider: I suspect it could
be readily implemented, could use a simple file to contain an arbitrary
set of swift object return values, possibly in a format similar to that
of readdata().
- Mike
On 3/25/08 6:04 PM, Ben Clifford wrote:
> On Tue, 25 Mar 2008, Michael Wilde wrote:
>
>> From a pure language point of view, we should permit the return of data that
>> can be grouped (batched) into files files in arbitrary chunks, determined and
>> optimized by the implementation. Map-reduce tuples seem to work well for this
>> model, and it seems that Swift could encompass it with minimal semantic change
>> to the current language.
>
> For your example, what way do you want to store the data on the remote
> side - I'm assuming not individual files.
>
> The present dataset model should fairly easily accomodate the description
> of places to store data that aren't files - there's an abstraction in the
> implementation to help with that at the moment (DSHandle, which is what
> deals with the difference between in-memory values and on-disk files; and
> could fairly straightforwardly deal with other storage forms).
>
> One of the project ideas I put in for the google summer of code was to
> play around with this, in fact.
>
More information about the Swift-devel
mailing list