[Swift-user] Issue with map reduce step one app to many

Michael Wilde wilde at mcs.anl.gov
Thu May 9 13:28:13 CDT 2013


Hi Lorenzo,

Swift is not yet able to map an array of files returned from an app whose size is not known before the app runs. We've discussed how to do this and hope to add such semantics in the future.

In the meantime, the two techniques for doing this are:

- return a tar file or similar archive from the app() that creates an unknown number of files

- return a list of files from the app()

The second technique works very nicely, especially if the entire script is being run on a single shared filesystem cluster like Beagle. In your example, app1() would return the list of files it produces as a single text file, and you then use that text file to map the array RGinfile[] using for example array_mapper.

One way to get app1() to return the desired list of files is by wrapping it in a shell script that does a selective "ls" or "find" on its output directory. Another way, if you really dont want to create a wrapper, is to have app1() return an "external" variable, and then call an app() that uses an sh -c script to find the data.

You'll need to make sure that app1() produces its output files in a persistent, known directory rather than in its temporary Swift-created "job dir" (which is the app's default current working directory when Swift runs it). That's another aspect that's easiest to deal with using a wrapper script around the actual application.

I'll try post an example of this when time permits; another illustration is in the MODIS example program in the 2011 Swift paper from Parallel Computing.

- Mike


----- Original Message -----
> From: "Lorenzo Pesce" <lpesce at uchicago.edu>
> To: "Swift User Discussion List" <swift-user at ci.uchicago.edu>
> Sent: Wednesday, May 8, 2013 2:43:13 PM
> Subject: [Swift-user] Issue with map reduce step one app to many
> 
> This is more or less the step I would do. My problem is that I am not
> sure how do I arrange the return of a set of files without
> connecting them first and I can't connect them since they are not
> made yet.
> I could conceivably create a list first and use that, but I was
> curious to know whether there is a shortcut. The files in the
> intermediate step at least at this point are not important to us and
> don't need to be tracked.
> 
> file inbam;
> file [] RGinfile;
> file [] RGoutfile;
> 
> 
> (RGinfile) app1(inbam);
> 
> for file, idx in RGfile {
> 
>   (RGoutBAM)=app2 (file);
>   RGoutfile [idx]) = RGoutBAM ;
> 
> }
> 
> (BAM) = app3 (RGoutfile);
> 
> _______________________________________________
> Swift-user mailing list
> Swift-user at ci.uchicago.edu
> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user
> 



More information about the Swift-user mailing list