[Swift-user] sort on large data
Jiada Tu
jtu3 at hawk.iit.edu
Sat Oct 18 16:18:44 CDT 2014
I am doing an assignment with swift to sort large data. The data contains
one record (string) each line. We need to sort the records base on ascii
code. The data is too large to fit in the memory.
The large data file is in head node, and I run the swift script directly on
head node.
Here's what I plan to do:
1) split the big file into 64MB files
2) let each worker task sort one 64MB files. Say, each task will call a
"sort.py" (written by me). sort.py will output a list of files,
say:"sorted-worker1-001; sorted-worker1-002; ......". The first file
contains the records started with 'a', the second started with 'b', etc.
3) now we will have all records started with 'a' in
(sorted-worker1-001;sorted-worker2-001;...); 'b' in
(sorted-worker1-002;sorted-worker2-002; ......); ...... Then I send all
the files contains records 'a' to a "reduce" worker task and let it merge
these files into one single file. Same to 'b', 'c', etc.
4) now we get 26 files (a-z) with each sorted inside.
Basically what I am doing is simulate Map-reduce. step 2 is map and step 3
is reduce
Here comes some problems:
1) for step 2, sort.py need to output a list of files. How can swift app
function handles list of outputs?
app (file[] outfiles) sort (file[] infiles) {
sort.py // how to put out files here?
}
2) As I know (may be wrong), swift will stage all the output file back to
the local disk (here is the head node since I run the swift script directly
on headnode). So the output files in step 2 will be staged back to head
node first, then stage from head node to the worker nodes to do the step 3,
then stage the 26 files in step 4 back to head node. I don't want it
because the network will be a huge bottleneck. Is there any way to tell the
"reduce" worker to get data directly from "map" worker? Maybe a shared file
system will help, but is there any way that user can control the data
staging between workers without using the shared file system?
Since I am new to the swift, I may be totally wrong and misunderstanding
what swift do. If so, please correct me.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-user/attachments/20141018/a33593ef/attachment.html>
More information about the Swift-user
mailing list