[Swift-devel] Re: How to wait on functions that return no data?

Wed Mar 26 10:28:20 CDT 2008

Sorry - a long response follows to your simple question:

 > For your example, what way do you want to store the data on the remote
 > side - I'm assuming not individual files.

In this example, a C program takes in 5 data files describing parameters 
of the petroleum refining process, and models various economic, emission 
and production yields. We do parameter sweeps by varying a few of these 
input vars and plotting their effect on an output var. The 5 files are 
text files, bundled into the application wrapper as shell "here 
documents" using cat <<END >datafileN. The parameters are inserted into 
these data files using simple shell var substitution.

In the simple tests I'm running now, I vary 2 input vars, and plot one 
output var.

Each run of the model, which takes about 1 sec, takes 3 parameters (id, 
x, y) from a readdata() file, and puts out a similar line with a 4th 
column, the z value (id, x, y, z). Id is an int, x, y, z are floats.

In the simplest runs, I just run one model per swift job. So id, x and y 
are provided on the command line, and a single file is produced with the 
tuple (id, x, y, z).

I am now testing a batched version, where the app-wrapper script takes a 
range of x and y values with increments, and iterates over that range at 
the specified increments. Each batch results in a single file with all 
the output tuples for that batch. For this case, this is fine, and is 
the end of the problem.

But I asked about the null values to explore a different approach: where 
most batches run and just leave their outputs on a local fiesystem, 
concatenated into one file. The nice thing about having output in tuples 
is that you can batch them in any arbitrary way, and the reduce step can 
sort and select as needed.

I suspect you're not going to like this idea on first consideration. But 
its related to ideas on how to leverage map-reduce, as I mentioned 
earlier, and Ian's suggestion to explore collective operations. Mihael 
thought my take on this was inelegant and inconsistent with data flow. I 
think it can be massaged to fit nicely in the model and provide useful 
capabilities.

Here's one way I thought it could work with the addition of null/Nothing 
to Swift.

The idea was that most or all invocations of the model jobs would return 
Nothing, and the actual results would be collected later in large, 
efficient batches.

If an invocation of a wrapper batch returns null, then a later job can 
go and interrogate the workers to collect the data. One possibility in 
the falkon case was that one job would be broadcast to all workers, and 
collect all files of a desired type. Another approach is that each job 
ensures that there's a background task running on the worker, which 
waits for either some accumulation of data or elapsed time, and then 
transfers what was produced, as a single file. These files would be 
returned either as results of arbitrary actual model runs, or by a 
collector job that runs after all the models are complete.

But, separate from this data collection operation, a Nothing return has 
a more direct use. Its handy in cases when you have a large set of short 
jobs, exploring some parameter space in which results are very sparse. 
In these cases, it would be nice to have a way to say that a job 
succeeded but return null/Nothing. That reduces the need to pass back a 
large number of files that signify "Nothing" in some inefficient manner.

Its also handy for executing jobs that have side effects, and still 
waiting for them to complete.

This gets us to a related issue:

If a swift job could efficiently return a set of swift objects without 
using a file (specifically without placing files back in the shared 
directory) then many of these apps could work beautifully, by returning 
strings or numeric objects, possibly as structs and/r arrays, that 
travel back through the job submission interface rather than getting 
fetched via the data provider. If a cluster of jobs could return data 
efficiently in a single "package" from the cluster, then we could pretty 
readily do map-reduce in swift, efficiently, in perfect concordance with 
the current dataflow model.

Perhaps this later approach is the best to consider: I suspect it could 
be readily implemented, could use a simple file to contain an arbitrary 
set of swift object return values, possibly in a format similar to that 
of readdata().

- Mike

On 3/25/08 6:04 PM, Ben Clifford wrote:
> On Tue, 25 Mar 2008, Michael Wilde wrote:
> 
>> From a pure language point of view, we should permit the return of data that
>> can be grouped (batched) into files files in arbitrary chunks, determined and
>> optimized by the implementation. Map-reduce tuples seem to work well for this
>> model, and it seems that Swift could encompass it with minimal semantic change
>> to the current language.
> 
> For your example, what way do you want to store the data on the remote 
> side - I'm assuming not individual files.
> 
> The present dataset model should fairly easily accomodate the description 
> of places to store data that aren't files - there's an abstraction in the 
> implementation to help with that at the moment (DSHandle, which is what 
> deals with the difference between in-memory values and on-disk files; and 
> could fairly straightforwardly deal with other storage forms).
> 
> One of the project ideas I put in for the google summer of code was to 
> play around with this, in fact.
>