[Swift-user] How can one force ordering into swift operations

Lorenzo Pesce lpesce at uchicago.edu
Thu Jan 24 13:09:40 CST 2013


Thanks a lot Mike & all.

Let me try this and get back to you. 

The devil is in the details and in the flood of genome datafiles that will come out of this...
So far my main problem is that there are 10 possible approaches, most of which don't scale to 50,000 files, 12,000 tasks and 30 TB of data :-(



On Jan 24, 2013, at 11:59 AM, Michael Wilde wrote:

> 
> Here's a split-and-process example:
> 
> $ cat SplitAndProcess.swift
> 
> type file;
> 
> app (file flist) split (file i)
> {
>  sh "-c" @strcat("split -l 50 ", @filename(i), " /tmp/segment ; /bin/ls -1 /tmp/segment??") stdout=@filename(flist);
> }
> 
> app (file counts) wc (file i)
> {
>  sh "-c" @strcat("wc ", @filename(i)) stdout=@filename(counts);
> }
> 
> file infile<"infile">;
> 
> string segnames[] = readData(split(infile));
> 
> foreach s,i in segnames {
>  file segment <single_file_mapper; file=s>;
>  string counts = readData(wc(segment));
>  tracef("segment %i is file %s, counts=%s\n", i, s, counts );
> }
> 
> $ wc -l infile
> 
> 460 infile
> 
> $ swift -config cf -tc.file tc -sites.file local.xml SplitAndProcess.swift 
> 
> Warning: Procedure split is deprecated, at 15
> Warning: Procedure wc is deprecated, at 19
> Swift trunk swift-r6151 cog-r3552 (cog modified locally)
> 
> RunID: 20130124-1152-gg21bvq8
> Progress:  time: Thu, 24 Jan 2013 11:52:06 -0600
> Progress:  time: Thu, 24 Jan 2013 11:52:08 -0600  Active:9  Checking status:1  Finished successfully:1
> segment 0 is file /tmp/segmentaa, counts=  50  267 2584 tmp/segmentaa
> segment 5 is file /tmp/segmentaf, counts=  50  597 7284 tmp/segmentaf
> segment 1 is file /tmp/segmentab, counts=  50  350 4196 tmp/segmentab
> segment 8 is file /tmp/segmentai, counts=  50  579 7082 tmp/segmentai
> segment 2 is file /tmp/segmentac, counts=  50  452 4949 tmp/segmentac
> segment 9 is file /tmp/segmentaj, counts= 10  71 835 tmp/segmentaj
> segment 4 is file /tmp/segmentae, counts=  50  490 6093 tmp/segmentae
> segment 3 is file /tmp/segmentad, counts=  50  589 7026 tmp/segmentad
> segment 7 is file /tmp/segmentah, counts=  50  498 6047 tmp/segmentah
> segment 6 is file /tmp/segmentag, counts=  50  591 7046 tmp/segmentag
> Final status: Thu, 24 Jan 2013 11:52:08 -0600  Finished successfully:11
> 
> Note that the script forces the split segments to be written to /tmp; otherwise they would be written to the job directory in which the split() app runs. This is not "location independent" but works fine when you run split on a local host.  You can use $PWD instead of /tmp by passing it into swift eg -cwd=$PWD and adjusting the script accordingly.
> 
> - Mike
> 
> ----- Original Message -----
>> From: "Michael Wilde" <wilde at mcs.anl.gov>
>> To: "Daniel S. Katz" <dsk at ci.uchicago.edu>
>> Cc: "Glen Hocky" <hockyg at gmail.com>, "Swift User Discussion List" <swift-user at ci.uchicago.edu>
>> Sent: Thursday, January 24, 2013 10:42:55 AM
>> Subject: Re: [Swift-user] How can one force ordering into swift operations
>> 
>> 
>> A few brief additional tips to help you make progress with this:
>> 
>> - your split app can create and return a single file containing a
>> list of file names
>> 
>> - use readData to read that list into an array; then use one of the
>> array mappers to map the list of files.
>> 
>> Separately: the "flag" Dan suggests can also be done using a variable
>> of type "external" which allows you to do explicit synchronization.
>> Its only honored as a return or an input of an app() function.
>> 
>> - Mike
>> 
>> ----- Original Message -----
>>> From: "Daniel S. Katz" <dsk at ci.uchicago.edu>
>>> To: "Lorenzo Pesce" <lpesce at uchicago.edu>
>>> Cc: "Glen Hocky" <hockyg at gmail.com>, "Swift User Discussion List"
>>> <swift-user at ci.uchicago.edu>
>>> Sent: Thursday, January 24, 2013 10:20:52 AM
>>> Subject: Re: [Swift-user] How can one force ordering into swift
>>> operations
>>> 
>>> 
>>> you could just add an artificial dependency. Make step out output
>>> file "flag" when it is done.
>>> 
>>> 
>>> Make step 2 dependent on file "flag"
>>> 
>>> 
>>> Dan
>>> 
>>> 
>>> 
>>> 
>>> 
>>> On Jan 24, 2013, at 11:13 AM, Lorenzo Pesce < lpesce at uchicago.edu >
>>> wrote:
>>> 
>>> 
>>> 
>>> So that I return an array of files of unknown size (don't know how
>>> many files they will be) to the calling swift script?
>>> 
>>> 
>>> 
>>> 
>>> 
>>> On Jan 24, 2013, at 10:08 AM, Glen Hocky wrote:
>>> 
>>> 
>>> 
>>> Lorenzo,
>>> This may not work for your purposes, but a simple solution similar
>>> to
>>> what I do, is to actually do step 1 in the wrapper before the
>>> mapping is done. This guarantees that all files are in place.
>>> 
>>> 
>>> Best,
>>> Glen
>>> 
>>> 
>>> 
>>> On Thu, Jan 24, 2013 at 10:43 AM, Lorenzo Pesce <
>>> lpesce at uchicago.edu
>>>> wrote:
>>> 
>>> 
>>> I have a simple problem:
>>> step 1: I run an app that splits a file in a group of files and we
>>> don't know how many they are.
>>> step2: I want to map those files using a mapper after the fact
>>> 
>>> Problem is that the mapper doesn't know that it can't run till step
>>> 1
>>> is done because it has no input files. How can I tell the mapper
>>> (and what follows it by consequence since those files will not be
>>> there) that it has to wait for step 1 to b finished?
>>> 
>>> Thanks,
>>> 
>>> Lorenzo
>>> _______________________________________________
>>> Swift-user mailing list
>>> Swift-user at ci.uchicago.edu
>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user
>>> 
>>> 
>>> _______________________________________________
>>> Swift-user mailing list
>>> Swift-user at ci.uchicago.edu
>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user
>>> 
>>> 
>>> 
>>> --
>>> 
>>> 
>>> 
>>> 
>>> 
>>> Daniel S. Katz
>>> University of Chicago
>>> (773) 834-7186 (voice)
>>> (773) 834-6818 (fax)
>>> d.katz at ieee.org or dsk at ci.uchicago.edu
>>> http://www.ci.uchicago.edu/~dsk/
>>> 
>>> 
>>> 
>>> 
>>> 
>>> _______________________________________________
>>> Swift-user mailing list
>>> Swift-user at ci.uchicago.edu
>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user
>> _______________________________________________
>> Swift-user mailing list
>> Swift-user at ci.uchicago.edu
>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user
>> 
> _______________________________________________
> Swift-user mailing list
> Swift-user at ci.uchicago.edu
> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user




More information about the Swift-user mailing list