[Swift-user] sort on large data

Gagan Munisiddha Gowda ggowda at hawk.iit.edu
Mon Oct 20 23:28:33 CDT 2014


Great Yadu !

Thanks for your help !

Regards,
Gagan
On 21/10/2014 6:33 am, "Yadu Nand Babuji" <yadunand at uchicago.edu> wrote:

>  Hi,
>
> @Jiada, Dongfang,
>
> I've updated the README on the https://github.com/yadudoc/cloud-tutorials
> page with documentation on how to use
> s3fs as a shared filesystem. I've added configs and links to external
> documentation. Please try it, and let me know
> if any of it is unclear or buggy.
>
> I would also appreciate help from anyone in testing this.
>
> @Gagan,
> That was most likely a bug in my scripts, where the user script is
> executed ahead of the installation of s3fs on the worker nodes.
> Please try again, and if you see the same behavior, please let me know.
>
> Thanks,
> Yadu
>
> On 10/19/2014 12:08 AM, Gagan Munisiddha Gowda wrote:
>
>   Hi Yadu,
>
>  I am in the same direction where I am trying to use a shared file system
> (S3 bucket / S3FS).
>
>  I have setup : *WORKER_INIT_SCRIPT=/path/to/mounts3fs.sh in
> cloud-tutorials/ec2/configs** (as mentioned in the tutorials)*
>
>  Though i am able to setup the passwd-s3fs file in the desired location
> (using mounts3fs.sh script), i see that the S3 bucket is not getting
> mounted.
>
>  I have verified the passwd-s3fs file and mount point and all seems to be
> created as expected. But, one observation was the owner of these files were
> 'root' user as it was getting created through the setup.sh.
>
>  So, i added more commands to change the permissions and made 'ubuntu' as
> the owner for all related files.
>
>  Even after all these changes, i see that the S3 bucket is still not
> mounted.
>
>  *PS: If i connect to the workers and run the s3fs command manually, it
> does mount !*
>
> sudo s3fs -o allow_other,gid=1000,use_cache=/home/ubuntu/cache <my-bucket>
> <mount-point>;
>
>  (tried with and without sudo)
>
>  Thanks for your help.
>
>
> On Sun, Oct 19, 2014 at 4:43 AM, Yadu Nand Babuji <yadunand at uchicago.edu>
> wrote:
>
>>  Hi Jiada Tu,
>>
>> 1) Here's an example for returning an array of files :
>>
>> type file;
>> app (file outs[]) make_outputs (file script)
>> {
>>     bash @script;
>> }
>>
>> file outputs[] <filesys_mapper; prefix="outputs">;
>> file script       <"make_outputs.sh">; # This script creates a few files
>> with outputs as prefix
>> (outputs) = make_outputs(script);
>>
>> 2) The products of a successful task execution, must be visible to the
>> headnode (where swift runs) either through a
>> - shared filesystem (NFS, S3 mounted over s3fs etc)  or
>> - must be brought back over the network.
>> But, we can reduce the overhead in moving the results to the headnode and
>> then to the workers for the reduce stage.
>>
>> I understand that this is part of your assignment, so I will try to
>> answer without getting too specific, at the same time,
>> concepts from hadoop do not necessarily work directly in this context. So
>> here are some things to consider to get
>> the best performance possible:
>>
>> - Assuming that the texts contain 10K unique words, your sort program
>> will generate a file containing atmost 10K lines
>>  (which would be definitely under an MB). Is there any advantage into
>> splitting this into smaller files ?
>>
>> - Since the final merge involves tiny files, you could very well do the
>> reduce stage on the headnode and be quite efficient
>>   (you can define the reduce app only for site:local)
>>
>>   sites : [local, cloud-static]
>>   site.local {
>>                 ....
>>                 app.reduce {
>>                         executable : ${env.PWD}/reduce.py
>>                 }
>>   }
>>
>>   site.cloud-static {
>>                 ....
>>                 app.python {
>>                         executable : /usr/bin/python
>>                 }
>>
>>  }
>>
>>  This assumes that you are going to define your sorting app like this :
>>
>>   app (file freqs) sort (file sorting_script, file input ) {
>>        python @sorting_script @input;
>>  }
>>
>>
>> - The real cost is in having the original text reach the workers, this
>> can be made faster by :
>>     - A better headnode with better network/disk IO (I've measured
>> 140Mbit/s between m1.medium nodes, c3.8xlarge comes with 975Mbits/s)
>>     - Use S3 with S3fs and have swift-workers pull data from S3 which is
>> pretty scalable, and remove the IO load from the headnode.
>>
>> - Identify the optimal size for data chunks for your specific problem.
>> Each chunk of data in this case comes with the overhead of starting
>>   a new remote task, sending the data and bringing results back. Note
>> that the result of a wordcount on a file whether it is 1Mb or 10Gb
>>   is still the atmost 1Mb (with earlier assumptions)
>>
>> - Ensure that the data with the same datacenter, for cost as well as
>> performance. By limiting the cluster to US-Oregon we already do this.
>>
>> If you would like to attempt this using S3FS, let me know, I'll be happy
>> to explain that in detail.
>>
>> Thanks,
>> Yadu
>>
>>
>>
>> On 10/18/2014 04:18 PM, Jiada Tu wrote:
>>
>>  I am doing an assignment with swift to sort large data. The data
>> contains one record (string) each line. We need to sort the records base on
>> ascii code. The data is too large to fit in the memory.
>>
>>  The large data file is in head node, and I run the swift script
>> directly on head node.
>>
>>  Here's what I plan to do:
>>
>>  1) split the big file into 64MB files
>> 2) let each worker task sort one 64MB files. Say, each task will call a
>> "sort.py" (written by me). sort.py will output a list of files,
>> say:"sorted-worker1-001; sorted-worker1-002; ......". The first file
>> contains the records started with 'a', the second started with 'b', etc.
>> 3) now we will have all records started with 'a' in
>> (sorted-worker1-001;sorted-worker2-001;...); 'b' in
>>  (sorted-worker1-002;sorted-worker2-002; ......); ...... Then I send all
>> the files contains records 'a' to a "reduce" worker task and let it merge
>> these files into one single file. Same to 'b', 'c', etc.
>> 4) now we get 26 files (a-z) with each sorted inside.
>>
>>  Basically what I am doing is simulate Map-reduce. step 2 is map and
>> step 3 is reduce
>>
>>  Here comes some problems:
>> 1) for step 2, sort.py need to output a list of files. How can swift app
>> function handles list of outputs?
>>
>>     app (file[] outfiles) sort (file[] infiles) {
>>           sort.py // how to put out files here?
>>     }
>>
>>  2) As I know (may be wrong), swift will stage all the output file back
>> to the local disk (here is the head node since I run the swift script
>> directly on headnode). So the output files in step 2 will be staged back to
>> head node first, then stage from head node to the worker nodes to do the
>> step 3, then stage the 26 files in step 4 back to head node. I don't want
>> it because the network will be a huge bottleneck. Is there any way to tell
>> the "reduce" worker to get data directly from "map" worker? Maybe a shared
>> file system will help, but is there any way that user can control the data
>> staging between workers without using the shared file system?
>>
>>  Since I am new to the swift, I may be totally wrong and
>> misunderstanding what swift do. If so, please correct me.
>>
>>
>>
>>
>>  _______________________________________________
>> Swift-user mailing listSwift-user at ci.uchicago.eduhttps://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user
>>
>>
>>
>> _______________________________________________
>> Swift-user mailing list
>> Swift-user at ci.uchicago.edu
>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user
>>
>
>
>
> --
> Regards,
> Gagan
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-user/attachments/20141021/2cf63c9f/attachment.html>


More information about the Swift-user mailing list