<div dir="ltr"><div>Hi Yadu,</div><div><br></div><div>Thanks for your answer! That's really helpful. <br></div><div><br></div><div><br></div><div><br></div><div>I forgot to forward my last question and your answer to swift-user group, so if anybody have the same confusions or have interest on it, please check below :</div><div><br></div><div><div class="gmail_extra"><div class="gmail_quote">On Sat, Oct 18, 2014 at 11:45 PM, Yadu Nand Babuji <span dir="ltr"><<a href="mailto:yadunand@uchicago.edu" target="_blank">yadunand@uchicago.edu</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
  
    
  
  <div bgcolor="#FFFFFF" text="#000000">
    Hi Jiada,<br>
    <br>
    Please find replies inline :<span class=""><br>
    <br>
    <div>On 10/18/2014 09:30 PM, Jiada Tu wrote:<br>
    </div>
    <blockquote type="cite">
      
      <div dir="ltr">Thanks for your answer, Yadu. I have some more
        questions:
        <div>1) If I didn't misunderstand, the (outputs) will be staging
          back to the head node, right? <br>
        </div>
      </div>
    </blockquote></span>
    Yes, in the current modes that you are using.<span class=""><br>
    <blockquote type="cite">
      <div dir="ltr">
        <div>
          <div>2)</div>
          <div>---------------------------<br>
          </div>
          <div><span style="color:rgb(0,0,0);font-family:arial,sans-serif;font-size:13px">type
              file;</span></div>
          <span style="color:rgb(0,0,0);font-family:arial,sans-serif;font-size:13px">app
            (file outs[]) make_outputs (file script)</span><br style="color:rgb(0,0,0);font-family:arial,sans-serif;font-size:13px">
          <span style="color:rgb(0,0,0);font-family:arial,sans-serif;font-size:13px">{</span><br style="color:rgb(0,0,0);font-family:arial,sans-serif;font-size:13px">
          <span style="color:rgb(0,0,0);font-family:arial,sans-serif;font-size:13px">   
            bash @script;</span><br style="color:rgb(0,0,0);font-family:arial,sans-serif;font-size:13px">
          <span style="color:rgb(0,0,0);font-family:arial,sans-serif;font-size:13px">}</span><br style="color:rgb(0,0,0);font-family:arial,sans-serif;font-size:13px">
          <br style="color:rgb(0,0,0);font-family:arial,sans-serif;font-size:13px">
          <span style="color:rgb(0,0,0);font-family:arial,sans-serif;font-size:13px">file
            outputs[] <filesys_mapper; prefix="outputs">;</span><br style="color:rgb(0,0,0);font-family:arial,sans-serif;font-size:13px">
          <span style="color:rgb(0,0,0);font-family:arial,sans-serif;font-size:13px">file
            script       <"make_outputs.sh">; # This script
            creates a few files with outputs as prefix</span><br style="color:rgb(0,0,0);font-family:arial,sans-serif;font-size:13px">
          <span style="color:rgb(0,0,0);font-family:arial,sans-serif;font-size:13px">(outputs)
            = make_outputs(script);</span>
          <div><span style="color:rgb(0,0,0);font-family:arial,sans-serif;font-size:13px">---------------------------</span></div>
          <div><span style="color:rgb(0,0,0);font-family:arial,sans-serif;font-size:13px"><br>
            </span></div>
          <div><font color="#000000" face="arial, sans-serif">If I have
              some later app function that takes the "outputs" files as
              input, will that app function wait until all possible</font></div>
        </div>
      </div>
    </blockquote>
    <blockquote type="cite">
      <div dir="ltr">
        <div>
          <div><font color="#000000" face="arial, sans-serif"> outputs
              generated? <br>
            </font></div>
        </div>
      </div>
    </blockquote>
    </span><font face="arial, sans-serif">Yes! Swift is implicitly parallel,
      and the order of execution is based on the </font>availability of
    dependent data items.<span class=""><br>
    <br>
    <blockquote type="cite">
      <div dir="ltr">
        <div>
          <div><font color="#000000" face="arial, sans-serif">For
              example:</font></div>
          <div><font color="#000000" face="arial, sans-serif"><br>
            </font></div>
          <div><span style="color:rgb(0,0,0);font-family:arial,sans-serif;font-size:13px">app
              (file outs[]) final_ouputs (file script, file input[]) </span><font color="#000000" face="arial, sans-serif"><br>
            </font></div>
          <div><span style="color:rgb(0,0,0);font-family:arial,sans-serif;font-size:13px">{</span></div>
          <div><span style="color:rgb(0,0,0);font-family:arial,sans-serif;font-size:13px"> 
                bash @script @filenames(input)</span></div>
          <div><font color="#000000" face="arial, sans-serif">}</font></div>
          <div><font color="#000000" face="arial, sans-serif">foreach i
              in [0:100]</font></div>
          <div><font color="#000000" face="arial, sans-serif">{</font></div>
          <div><font color="#000000" face="arial, sans-serif">    </font><span style="color:rgb(0,0,0);font-family:arial,sans-serif;font-size:13px">file
              outputs[] <filesys_mapper;
              prefix="outputs-"+@tostring(i)+"-">;  </span></div>
          <div><span style="color:rgb(0,0,0);font-family:arial,sans-serif;font-size:13px"> 
                #I know "</span><span style="color:rgb(0,0,0);font-family:arial,sans-serif;font-size:13px">"outputs-"+@tostring(i)+"-"</span><span style="color:rgb(0,0,0);font-family:arial,sans-serif;font-size:13px">"
              may not work, </span><font color="#000000" face="arial,
              sans-serif">please think it as pesudo-code</font></div>
          <div><span style="color:rgb(0,0,0);font-family:arial,sans-serif;font-size:13px"> 
                </span><span style="color:rgb(0,0,0);font-family:arial,sans-serif;font-size:13px">(outputs)
              = make_outputs(script);</span></div>
          <div><font color="#000000" face="arial, sans-serif">}</font></div>
          <div><font color="#000000" face="arial, sans-serif"><br>
            </font></div>
          <div><font color="#000000" face="arial, sans-serif">file
              script2 <"final_output"> #take multiple input and
              merge them into a file</font></div>
          <div><font color="#000000" face="arial, sans-serif">file
              inputs[] </font><span style="color:rgb(0,0,0);font-family:arial,sans-serif;font-size:13px"><filesys_mapper;
              prefix="outputs", suffix="000"></span></div>
          <div><span style="color:rgb(0,0,0);font-family:arial,sans-serif;font-size:13px">file
              finoutput</span><span style="color:rgb(0,0,0);font-family:arial,sans-serif;font-size:13px"><filesys_mapper;
              prefix="finalouputs-"></span></div>
          <div><span style="color:rgb(0,0,0);font-family:arial,sans-serif;font-size:13px">(finoutput)
              = final_ouputs(script2, inputs)</span></div>
          <div><span style="color:rgb(0,0,0);font-family:arial,sans-serif;font-size:13px"><br>
            </span></div>
          <div><font color="#000000" face="arial, sans-serif">final_outputs
              needs to take some output files from "every" single loops
              as its input (first loop may generate "outputs-0-000",
              second loop may generate "outputs-1-000", etc).</font></div>
          <div><font color="#000000" face="arial, sans-serif"><br>
            </font></div>
          <div><font color="#000000" face="arial, sans-serif">So,
              will final_outputs() task be "block" until all
              make_outputs() task finish processing? <br>
            </font></div>
        </div>
      </div>
    </blockquote>
    </span><font face="arial, sans-serif">Yes, final_outputs will block till
      the array that it depends on is closed.<br>
      <br>
    </font><span class="">
    <blockquote type="cite">
      <div dir="ltr">
        <div>
          <div><font color="#000000" face="arial, sans-serif">3)
              Actually, wordCount is our first program, and sort is our
              second program which gives "extra credits". You gives a
              great answer to another question which I also confused in.</font></div>
        </div>
        <div><font color="#000000" face="arial, sans-serif"><br>
          </font></div>
      </div>
    </blockquote>
    </span><font face="arial, sans-serif">I hope I did not give too much away
      :)<br>
    </font><span class="">
    <blockquote type="cite">
      <div dir="ltr">
        <div><font color="#000000" face="arial, sans-serif">The output
            of sort program will be 10GB, so it will not fit in memory.
            That why I want to split the intermediate files and send
            them to several merge task. Each merge task will generate,
            say, a 100MB file. So the result of my sort program will
            have 10GB/100MB=100 files, with file1 have the smallest
            words and file100 have the largest words.</font></div>
        <div><font color="#000000" face="arial, sans-serif"><br>
          </font></div>
        <div><font color="#000000" face="arial, sans-serif">From your
            answer, I believe this can be deal with by using s3fs? so:</font></div>
        <div><span style="color:rgb(0,0,0);font-family:arial,sans-serif">4) </span><font color="#000000" face="arial, sans-serif">Yes, I want some
            help to use s3fs.</font></div>
      </div>
    </blockquote></span>
    Since this is something that would be of general interest. I will
    update the github readme page with directions for how to run swift <br>
    over a s3fs acting as a shared-filesystem.<span class=""><br>
    <br>
    <blockquote type="cite">
      <div dir="ltr">
        <div>But, are there any other general ways to deal with
          big-file-sorting in swift? Can you give me a hit about what
          they would be? Like, how you generate deal with this sorting
          problem?</div>
      </div>
    </blockquote></span>
    The simplest strategy I can think of is to split each chunk into say
    100 buckets, and have the corresponding buckets from every chunk
    merge-sorted. <br><div><div class="h5">
    <br>
    <blockquote type="cite">
      <div dir="ltr">
        <div>Thanks,</div>
        <div>Jiada Tu</div>
      </div>
      <div class="gmail_extra"><br>
        <div class="gmail_quote">On Sat, Oct 18, 2014 at 6:13 PM, Yadu
          Nand Babuji <span dir="ltr"><<a href="mailto:yadunand@uchicago.edu" target="_blank">yadunand@uchicago.edu</a>></span>
          wrote:<br>
          <blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
            <div bgcolor="#FFFFFF" text="#000000"> Hi Jiada Tu,<br>
              <br>
              1) Here's an example for returning an array of files :<br>
              <br>
              type file;<br>
              app (file outs[]) make_outputs (file script)<br>
              {<br>
                  bash @script;<br>
              }<br>
              <br>
              file outputs[] <filesys_mapper; prefix="outputs">;<br>
              file script       <"make_outputs.sh">; # This script
              creates a few files with outputs as prefix<br>
              (outputs) = make_outputs(script);<br>
              <br>
              2) The products of a successful task execution, must be
              visible to the headnode (where swift runs) either through
              a<br>
              - shared filesystem (NFS, S3 mounted over s3fs etc)  or<br>
              - must be brought back over the network.<br>
              But, we can reduce the overhead in moving the results to
              the headnode and then to the workers for the reduce stage.<br>
               <br>
              I understand that this is part of your assignment, so I
              will try to answer without getting too specific, at the
              same time, <br>
              concepts from hadoop do not necessarily work directly in
              this context. So here are some things to consider to get<br>
              the best performance possible:<br>
              <br>
              - Assuming that the texts contain 10K unique words, your
              sort program will generate a file containing atmost 10K
              lines<br>
               (which would be definitely under an MB). Is there any
              advantage into splitting this into smaller files ?<br>
              <br>
              - Since the final merge involves tiny files, you could
              very well do the reduce stage on the headnode and be quite
              efficient<br>
                (you can define the reduce app only for site:local)<br>
              <br>
                sites : [local, cloud-static]<br>
                site.local {<br>
                              ....<br>
                              app.reduce {<br>
                                      executable : ${env.PWD}/reduce.py<br>
                              }    <br>
                }<br>
              <br>
                site.cloud-static {<br>
                              ....<br>
                              app.python {<br>
                                      executable : /usr/bin/python<br>
                              }<br>
              <br>
               }<br>
              <br>
               This assumes that you are going to define your sorting
              app like this :<br>
              <br>
                app (file freqs) sort (file sorting_script, file input )
              {<br>
                     python @sorting_script @input; <br>
               }<br>
                <br>
              <br>
              - The real cost is in having the original text reach the
              workers, this can be made faster by :<br>
                  - A better headnode with better network/disk IO (I've
              measured 140Mbit/s between m1.medium nodes, c3.8xlarge
              comes with 975Mbits/s)<br>
                  - Use S3 with S3fs and have swift-workers pull data
              from S3 which is pretty scalable, and remove the IO load
              from the headnode.<br>
              <br>
              - Identify the optimal size for data chunks for your
              specific problem. Each chunk of data in this case comes
              with the overhead of starting<br>
                a new remote task, sending the data and bringing results
              back. Note that the result of a wordcount on a file
              whether it is 1Mb or 10Gb<br>
                is still the atmost 1Mb (with earlier assumptions)<br>
              <br>
              - Ensure that the data with the same datacenter, for cost
              as well as performance. By limiting the cluster to
              US-Oregon we already do this.<br>
              <br>
              If you would like to attempt this using S3FS, let me know,
              I'll be happy to explain that in detail.<br>
              <br>
              Thanks,<br>
              Yadu
              <div>
                <div><br>
                  <br>
                  <br>
                  <div>On 10/18/2014 04:18 PM, Jiada Tu wrote:<br>
                  </div>
                </div>
              </div>
              <blockquote type="cite">
                <div>
                  <div>
                    <div dir="ltr">I am doing an assignment with swift
                      to sort large data. The data contains one record
                      (string) each line. We need to sort the records
                      base on ascii code. The data is too large to fit
                      in the memory.
                      <div><br>
                      </div>
                      <div>The large data file is in head node, and I
                        run the swift script directly on head node.</div>
                      <div><br>
                      </div>
                      <div>Here's what I plan to do:</div>
                      <div><br>
                      </div>
                      <div>1) split the big file into 64MB files</div>
                      <div>2) let each worker task sort one 64MB files.
                        Say, each task will call a "sort.py" (written by
                        me). sort.py will output a list of files,
                        say:"sorted-worker1-001; sorted-worker1-002;
                        ......". The first file contains the records
                        started with 'a', the second started with 'b',
                        etc.</div>
                      <div>3) now we will have all records started with
                        'a' in
                        (sorted-worker1-001;sorted-worker2-001;...); 'b'
                        in  (sorted-worker1-002;sorted-worker2-002;
                        ......); ...... Then I send all the files
                        contains records 'a' to a "reduce" worker task
                        and let it merge these files into one single
                        file. Same to 'b', 'c', etc.</div>
                      <div>4) now we get 26 files (a-z) with each sorted
                        inside.</div>
                      <div><br>
                      </div>
                      <div>Basically what I am doing is simulate
                        Map-reduce. step 2 is map and step 3 is reduce</div>
                      <div><br>
                      </div>
                      <div>Here comes some problems:</div>
                      <div>1) for step 2, sort.py need to output a list
                        of files. How can swift app function handles
                        list of outputs?</div>
                      <div>     </div>
                      <div>    app (file[] outfiles) sort (file[]
                        infiles) {</div>
                      <div>          sort.py // how to put out files
                        here?</div>
                      <div>    }</div>
                      <div><br>
                      </div>
                      <div>2) As I know (may be wrong), swift will stage
                        all the output file back to the local disk (here
                        is the head node since I run the swift script
                        directly on headnode). So the output files in
                        step 2 will be staged back to head node first,
                        then stage from head node to the worker nodes to
                        do the step 3, then stage the 26 files in step 4
                        back to head node. I don't want it because the
                        network will be a huge bottleneck. Is there any
                        way to tell the "reduce" worker to get data
                        directly from "map" worker? Maybe a shared file
                        system will help, but is there any way that user
                        can control the data staging between workers
                        without using the shared file system?</div>
                      <div><br>
                      </div>
                      <div>Since I am new to the swift, I may be totally
                        wrong and misunderstanding what swift do. If so,
                        please correct me.</div>
                      <div><br>
                      </div>
                      <div><br>
                      </div>
                    </div>
                    <br>
                    <fieldset></fieldset>
                    <br>
                  </div>
                </div>
                <pre>_______________________________________________
Swift-user mailing list
<a href="mailto:Swift-user@ci.uchicago.edu" target="_blank">Swift-user@ci.uchicago.edu</a>
<a href="https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user" target="_blank">https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user</a></pre>
              </blockquote>
              <br>
            </div>
            <br>
            _______________________________________________<br>
            Swift-user mailing list<br>
            <a href="mailto:Swift-user@ci.uchicago.edu" target="_blank">Swift-user@ci.uchicago.edu</a><br>
            <a href="https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user" target="_blank">https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user</a><br>
          </blockquote>
        </div>
        <br>
      </div>
    </blockquote>
    <br>
  </div></div></div>

</blockquote></div><br></div></div></div>