<div dir="ltr"><div><div><div>Hi Yadu,<br><br></div>I am in the same direction where I am trying to use a shared file system (S3 bucket / S3FS).<br><br></div>I have setup : <span style="color:rgb(51,51,51);font-family:"Helvetica Neue",Helvetica,"Segoe UI",Arial,freesans,sans-serif;font-size:16px;font-style:normal;font-variant:normal;font-weight:normal;letter-spacing:normal;line-height:25.6px;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;display:inline!important;float:none;background-color:rgb(255,255,255)"><i><font>WORKER_INIT_SCRIPT=/path/to/mounts3fs.sh in cloud-tutorials/ec2/configs</font></i><span class=""><i><font> (as mentioned in the tutorials)</font></i><br><br></span></span></div><div>Though i am able to setup the passwd-s3fs file in the desired location (using mounts3fs.sh script), i see that the S3 bucket is not getting mounted.<br><br></div><div>I have verified the passwd-s3fs file and mount point and all seems to be created as expected. But, one observation was the owner of these files were 'root' user as it was getting created through the setup.sh.<br><br></div><div>So, i added more commands to change the permissions and made 'ubuntu' as the owner for all related files.<br><br></div><div>Even after all these changes, i see that the S3 bucket is still not mounted.<br><br></div><div><b>PS: If i connect to the workers and run the s3fs command manually, it does mount !</b><br><br><font size="4">sudo s3fs -o allow_other,gid=1000,use_cache=/home/ubuntu/cache <my-bucket> <mount-point>;</font><br></div><span style="color:rgb(51,51,51);font-family:"Helvetica Neue",Helvetica,"Segoe UI",Arial,freesans,sans-serif;font-size:16px;font-style:normal;font-variant:normal;font-weight:normal;letter-spacing:normal;line-height:25.6px;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;display:inline!important;float:none;background-color:rgb(255,255,255)"><span class=""></span></span><span style="color:rgb(51,51,51);font-family:"Helvetica Neue",Helvetica,"Segoe UI",Arial,freesans,sans-serif;font-size:16px;font-style:normal;font-variant:normal;font-weight:normal;letter-spacing:normal;line-height:25.6px;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;display:inline!important;float:none;background-color:rgb(255,255,255)"></span><div><div><div><br></div><div>(tried with and without sudo)<br></div><div><br></div><div>Thanks for your help.<br><br></div></div></div></div><div class="gmail_extra"><br><div class="gmail_quote">On Sun, Oct 19, 2014 at 4:43 AM, Yadu Nand Babuji <span dir="ltr"><<a href="mailto:yadunand@uchicago.edu" target="_blank">yadunand@uchicago.edu</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

  <div bgcolor="#FFFFFF" text="#000000">

    Hi Jiada Tu,<br>

    <br>

    1) Here's an example for returning an array of files :<br>

    <br>

    type file;<br>

    app (file outs[]) make_outputs (file script)<br>

    {<br>

        bash @script;<br>

    }<br>

    <br>

    file outputs[] <filesys_mapper; prefix="outputs">;<br>

    file script       <"make_outputs.sh">; # This script creates a

    few files with outputs as prefix<br>

    (outputs) = make_outputs(script);<br>

    <br>

    2) The products of a successful task execution, must be visible to

    the headnode (where swift runs) either through a<br>

    - shared filesystem (NFS, S3 mounted over s3fs etc)  or<br>

    - must be brought back over the network.<br>

    But, we can reduce the overhead in moving the results to the

    headnode and then to the workers for the reduce stage.<br>

     <br>

    I understand that this is part of your assignment, so I will try to

    answer without getting too specific, at the same time, <br>

    concepts from hadoop do not necessarily work directly in this

    context. So here are some things to consider to get<br>

    the best performance possible:<br>

    <br>

    - Assuming that the texts contain 10K unique words, your sort

    program will generate a file containing atmost 10K lines<br>

     (which would be definitely under an MB). Is there any advantage

    into splitting this into smaller files ?<br>

    <br>

    - Since the final merge involves tiny files, you could very well do

    the reduce stage on the headnode and be quite efficient<br>

      (you can define the reduce app only for site:local)<br>

    <br>

      sites : [local, cloud-static]<br>

      site.local {<br>

                    ....<br>

                    app.reduce {<br>

                            executable : ${env.PWD}/reduce.py<br>

                    }    <br>

      }<br>

    <br>

      site.cloud-static {<br>

                    ....<br>

                    app.python {<br>

                            executable : /usr/bin/python<br>

                    }<br>

    <br>

     }<br>

    <br>

     This assumes that you are going to define your sorting app like

    this :<br>

    <br>

      app (file freqs) sort (file sorting_script, file input ) {<br>

           python @sorting_script @input; <br>

     }<br>

      <br>

    <br>

    - The real cost is in having the original text reach the workers,

    this can be made faster by :<br>

        - A better headnode with better network/disk IO (I've measured

    140Mbit/s between m1.medium nodes, c3.8xlarge comes with 975Mbits/s)<br>

        - Use S3 with S3fs and have swift-workers pull data from S3

    which is pretty scalable, and remove the IO load from the headnode.<br>

    <br>

    - Identify the optimal size for data chunks for your specific

    problem. Each chunk of data in this case comes with the overhead of

    starting<br>

      a new remote task, sending the data and bringing results back.

    Note that the result of a wordcount on a file whether it is 1Mb or

    10Gb<br>

      is still the atmost 1Mb (with earlier assumptions)<br>

    <br>

    - Ensure that the data with the same datacenter, for cost as well as

    performance. By limiting the cluster to US-Oregon we already do

    this.<br>

    <br>

    If you would like to attempt this using S3FS, let me know, I'll be

    happy to explain that in detail.<br>

    <br>

    Thanks,<br>

    Yadu<div><div class="h5"><br>

    <br>

    <br>

    <div>On 10/18/2014 04:18 PM, Jiada Tu wrote:<br>

    </div>

    </div></div><blockquote type="cite"><div><div class="h5">

      <div dir="ltr">I am doing an assignment with swift to sort large

        data. The data contains one record (string) each line. We need

        to sort the records base on ascii code. The data is too large to

        fit in the memory.

        <div><br>

        </div>

        <div>The large data file is in head node, and I run the swift

          script directly on head node.</div>

        <div><br>

        </div>

        <div>Here's what I plan to do:</div>

        <div><br>

        </div>

        <div>1) split the big file into 64MB files</div>

        <div>2) let each worker task sort one 64MB files. Say, each task

          will call a "sort.py" (written by me). sort.py will output a

          list of files, say:"sorted-worker1-001; sorted-worker1-002;

          ......". The first file contains the records started with 'a',

          the second started with 'b', etc.</div>

        <div>3) now we will have all records started with 'a' in

          (sorted-worker1-001;sorted-worker2-001;...); 'b' in

           (sorted-worker1-002;sorted-worker2-002; ......); ...... Then

          I send all the files contains records 'a' to a "reduce" worker

          task and let it merge these files into one single file. Same

          to 'b', 'c', etc.</div>

        <div>4) now we get 26 files (a-z) with each sorted inside.</div>

        <div><br>

        </div>

        <div>Basically what I am doing is simulate Map-reduce. step 2 is

          map and step 3 is reduce</div>

        <div><br>

        </div>

        <div>Here comes some problems:</div>

        <div>1) for step 2, sort.py need to output a list of files. How

          can swift app function handles list of outputs?</div>

        <div>     </div>

        <div>    app (file[] outfiles) sort (file[] infiles) {</div>

        <div>          sort.py // how to put out files here?</div>

        <div>    }</div>

        <div><br>

        </div>

        <div>2) As I know (may be wrong), swift will stage all the

          output file back to the local disk (here is the head node

          since I run the swift script directly on headnode). So the

          output files in step 2 will be staged back to head node first,

          then stage from head node to the worker nodes to do the step

          3, then stage the 26 files in step 4 back to head node. I

          don't want it because the network will be a huge bottleneck.

          Is there any way to tell the "reduce" worker to get data

          directly from "map" worker? Maybe a shared file system will

          help, but is there any way that user can control the data

          staging between workers without using the shared file system?</div>

        <div><br>

        </div>

        <div>Since I am new to the swift, I may be totally wrong and

          misunderstanding what swift do. If so, please correct me.</div>

        <div><br>

        </div>

        <div><br>

        </div>

      </div>

      <br>

      <fieldset></fieldset>

      <br>

      </div></div><pre>_______________________________________________

Swift-user mailing list

<a href="mailto:Swift-user@ci.uchicago.edu" target="_blank">Swift-user@ci.uchicago.edu</a>

<a href="https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user" target="_blank">https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user</a></pre>

    </blockquote>

    <br>

  </div>

<br>_______________________________________________<br>

Swift-user mailing list<br>

<a href="mailto:Swift-user@ci.uchicago.edu">Swift-user@ci.uchicago.edu</a><br>

<a href="https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user" target="_blank">https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user</a><br></blockquote></div><br><br clear="all"><br>-- <br><div dir="ltr">Regards,<div>Gagan</div></div>

</div>