<p dir="ltr">Great Yadu ! </p>

<p dir="ltr">Thanks for your help ! </p>

<p dir="ltr">Regards, <br>

Gagan</p>

<div class="gmail_quote">On 21/10/2014 6:33 am, "Yadu Nand Babuji" <<a href="mailto:yadunand@uchicago.edu">yadunand@uchicago.edu</a>> wrote:<br type="attribution"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

  <div bgcolor="#FFFFFF" text="#000000">

    Hi,<br>

    <br>

    @Jiada, Dongfang,<br>

    <br>

    I've updated the README on the

    <a href="https://github.com/yadudoc/cloud-tutorials" target="_blank">https://github.com/yadudoc/cloud-tutorials</a> page with documentation

    on how to use<br>

    s3fs as a shared filesystem. I've added configs and links to

    external documentation. Please try it, and let me know<br>

    if any of it is unclear or buggy. <br>

    <br>

    I would also appreciate help from anyone in testing this. <br>

    <br>

    @Gagan, <br>

    That was most likely a bug in my scripts, where the user script is

    executed ahead of the installation of s3fs on the worker nodes.<br>

    Please try again, and if you see the same behavior, please let me

    know.<br>

    <br>

    Thanks,<br>

    Yadu<br>

    <br>

    <div>On 10/19/2014 12:08 AM, Gagan

      Munisiddha Gowda wrote:<br>

    </div>

    <blockquote type="cite">

      <div dir="ltr">

        <div>

          <div>

            <div>Hi Yadu,<br>

              <br>

            </div>

            I am in the same direction where I am trying to use a shared

            file system (S3 bucket / S3FS).<br>

            <br>

          </div>

          I have setup : <span><i><font>WORKER_INIT_SCRIPT=/path/to/mounts3fs.sh

                in cloud-tutorials/ec2/configs</font></i><span><i><font>

                  (as mentioned in the tutorials)</font></i><br>

              <br>

            </span></span></div>

        <div>Though i am able to setup the passwd-s3fs file in the

          desired location (using mounts3fs.sh script), i see that the

          S3 bucket is not getting mounted.<br>

          <br>

        </div>

        <div>I have verified the passwd-s3fs file and mount point and

          all seems to be created as expected. But, one observation was

          the owner of these files were 'root' user as it was getting

          created through the setup.sh.<br>

          <br>

        </div>

        <div>So, i added more commands to change the permissions and

          made 'ubuntu' as the owner for all related files.<br>

          <br>

        </div>

        <div>Even after all these changes, i see that the S3 bucket is

          still not mounted.<br>

          <br>

        </div>

        <div><b>PS: If i connect to the workers and run the s3fs command

            manually, it does mount !</b><br>

          <br>

          <font size="4">sudo s3fs -o

            allow_other,gid=1000,use_cache=/home/ubuntu/cache

            <my-bucket> <mount-point>;</font><br>

        </div>

        <span><span></span></span><span></span>

        <div>

          <div>

            <div><br>

            </div>

            <div>(tried with and without sudo)<br>

            </div>

            <div><br>

            </div>

            <div>Thanks for your help.<br>

              <br>

            </div>

          </div>

        </div>

      </div>

      <div class="gmail_extra"><br>

        <div class="gmail_quote">On Sun, Oct 19, 2014 at 4:43 AM, Yadu

          Nand Babuji <span dir="ltr"><<a href="mailto:yadunand@uchicago.edu" target="_blank">yadunand@uchicago.edu</a>></span>

          wrote:<br>

          <blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

            <div bgcolor="#FFFFFF" text="#000000"> Hi Jiada Tu,<br>

              <br>

              1) Here's an example for returning an array of files :<br>

              <br>

              type file;<br>

              app (file outs[]) make_outputs (file script)<br>

              {<br>

                  bash @script;<br>

              }<br>

              <br>

              file outputs[] <filesys_mapper; prefix="outputs">;<br>

              file script       <"make_outputs.sh">; # This script

              creates a few files with outputs as prefix<br>

              (outputs) = make_outputs(script);<br>

              <br>

              2) The products of a successful task execution, must be

              visible to the headnode (where swift runs) either through

              a<br>

              - shared filesystem (NFS, S3 mounted over s3fs etc)  or<br>

              - must be brought back over the network.<br>

              But, we can reduce the overhead in moving the results to

              the headnode and then to the workers for the reduce stage.<br>

               <br>

              I understand that this is part of your assignment, so I

              will try to answer without getting too specific, at the

              same time, <br>

              concepts from hadoop do not necessarily work directly in

              this context. So here are some things to consider to get<br>

              the best performance possible:<br>

              <br>

              - Assuming that the texts contain 10K unique words, your

              sort program will generate a file containing atmost 10K

              lines<br>

               (which would be definitely under an MB). Is there any

              advantage into splitting this into smaller files ?<br>

              <br>

              - Since the final merge involves tiny files, you could

              very well do the reduce stage on the headnode and be quite

              efficient<br>

                (you can define the reduce app only for site:local)<br>

              <br>

                sites : [local, cloud-static]<br>

                site.local {<br>

                              ....<br>

                              app.reduce {<br>

                                      executable : ${env.PWD}/reduce.py<br>

                              }    <br>

                }<br>

              <br>

                site.cloud-static {<br>

                              ....<br>

                              app.python {<br>

                                      executable : /usr/bin/python<br>

                              }<br>

              <br>

               }<br>

              <br>

               This assumes that you are going to define your sorting

              app like this :<br>

              <br>

                app (file freqs) sort (file sorting_script, file input )

              {<br>

                     python @sorting_script @input; <br>

               }<br>

                <br>

              <br>

              - The real cost is in having the original text reach the

              workers, this can be made faster by :<br>

                  - A better headnode with better network/disk IO (I've

              measured 140Mbit/s between m1.medium nodes, c3.8xlarge

              comes with 975Mbits/s)<br>

                  - Use S3 with S3fs and have swift-workers pull data

              from S3 which is pretty scalable, and remove the IO load

              from the headnode.<br>

              <br>

              - Identify the optimal size for data chunks for your

              specific problem. Each chunk of data in this case comes

              with the overhead of starting<br>

                a new remote task, sending the data and bringing results

              back. Note that the result of a wordcount on a file

              whether it is 1Mb or 10Gb<br>

                is still the atmost 1Mb (with earlier assumptions)<br>

              <br>

              - Ensure that the data with the same datacenter, for cost

              as well as performance. By limiting the cluster to

              US-Oregon we already do this.<br>

              <br>

              If you would like to attempt this using S3FS, let me know,

              I'll be happy to explain that in detail.<br>

              <br>

              Thanks,<br>

              Yadu

              <div>

                <div><br>

                  <br>

                  <br>

                  <div>On 10/18/2014 04:18 PM, Jiada Tu wrote:<br>

                  </div>

                </div>

              </div>

              <blockquote type="cite">

                <div>

                  <div>

                    <div dir="ltr">I am doing an assignment with swift

                      to sort large data. The data contains one record

                      (string) each line. We need to sort the records

                      base on ascii code. The data is too large to fit

                      in the memory.

                      <div><br>

                      </div>

                      <div>The large data file is in head node, and I

                        run the swift script directly on head node.</div>

                      <div><br>

                      </div>

                      <div>Here's what I plan to do:</div>

                      <div><br>

                      </div>

                      <div>1) split the big file into 64MB files</div>

                      <div>2) let each worker task sort one 64MB files.

                        Say, each task will call a "sort.py" (written by

                        me). sort.py will output a list of files,

                        say:"sorted-worker1-001; sorted-worker1-002;

                        ......". The first file contains the records

                        started with 'a', the second started with 'b',

                        etc.</div>

                      <div>3) now we will have all records started with

                        'a' in

                        (sorted-worker1-001;sorted-worker2-001;...); 'b'

                        in  (sorted-worker1-002;sorted-worker2-002;

                        ......); ...... Then I send all the files

                        contains records 'a' to a "reduce" worker task

                        and let it merge these files into one single

                        file. Same to 'b', 'c', etc.</div>

                      <div>4) now we get 26 files (a-z) with each sorted

                        inside.</div>

                      <div><br>

                      </div>

                      <div>Basically what I am doing is simulate

                        Map-reduce. step 2 is map and step 3 is reduce</div>

                      <div><br>

                      </div>

                      <div>Here comes some problems:</div>

                      <div>1) for step 2, sort.py need to output a list

                        of files. How can swift app function handles

                        list of outputs?</div>

                      <div>     </div>

                      <div>    app (file[] outfiles) sort (file[]

                        infiles) {</div>

                      <div>          sort.py // how to put out files

                        here?</div>

                      <div>    }</div>

                      <div><br>

                      </div>

                      <div>2) As I know (may be wrong), swift will stage

                        all the output file back to the local disk (here

                        is the head node since I run the swift script

                        directly on headnode). So the output files in

                        step 2 will be staged back to head node first,

                        then stage from head node to the worker nodes to

                        do the step 3, then stage the 26 files in step 4

                        back to head node. I don't want it because the

                        network will be a huge bottleneck. Is there any

                        way to tell the "reduce" worker to get data

                        directly from "map" worker? Maybe a shared file

                        system will help, but is there any way that user

                        can control the data staging between workers

                        without using the shared file system?</div>

                      <div><br>

                      </div>

                      <div>Since I am new to the swift, I may be totally

                        wrong and misunderstanding what swift do. If so,

                        please correct me.</div>

                      <div><br>

                      </div>

                      <div><br>

                      </div>

                    </div>

                    <br>

                    <fieldset></fieldset>

                    <br>

                  </div>

                </div>

                <pre>_______________________________________________

Swift-user mailing list

<a href="mailto:Swift-user@ci.uchicago.edu" target="_blank">Swift-user@ci.uchicago.edu</a>

<a href="https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user" target="_blank">https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user</a></pre>

              </blockquote>

              <br>

            </div>

            <br>

            _______________________________________________<br>

            Swift-user mailing list<br>

            <a href="mailto:Swift-user@ci.uchicago.edu" target="_blank">Swift-user@ci.uchicago.edu</a><br>

            <a href="https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user" target="_blank">https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user</a><br>

          </blockquote>

        </div>

        <br>

        <br clear="all">

        <br>

        -- <br>

        <div dir="ltr">Regards,

          <div>Gagan</div>

        </div>

      </div>

    </blockquote>

    <br>

  </div>

</blockquote></div>