[Swift-devel] Swift-issues (PBS+NFS Cluster)

foster at anl.gov foster at anl.gov
Thu May 7 12:10:10 CDT 2009


I'd suggest we want some microbenchmarks too. Then we want to get  
running on amazon and evaluate s3.

Ian -- from mobile

On May 7, 2009, at 11:54 AM, Michael Wilde <wilde at mcs.anl.gov> wrote:

>
>
> On 5/7/09 11:39 AM, Yi Zhu wrote:
>> Michael Wilde wrote:
>>> Very good!
>>>
>>> Now, what kind of tests can you do next?
>> Next, I will try to let swift running on Amazon EC2.
>>> Can you exercise the cluster with an interesting workflow?
>> Yes, Is there any complex sample/tools i can use (rahter than  
>> first.swift) to test swift performance? Is there any benchmark  
>> available i can compare with?
>
> I think we can assemble a good benchmark that is also a useful  
> application. I suggest we start with OOPS, the protein folding code.
>
> That way, all the test runs can contribute (a bit) to science by  
> building a catalog of results that the OOPS team can peruse and use.
>
> How many CPU hours (or VM hours) do you have available in EC2  
> through Nimbus? Can we use some of these?
>
>>> How large of a cluster can you assemble in a Nimbus workspace ?
>> Since the vm-image i use to test 'swift' is based on NFS shared  
>> file system, the performance may not be satisfiable if the we have  
>> a large scale of cluster. After I got the swift running on Amazon  
>> EC2, I will try to make a dedicate vm-image by using GPFS or any  
>> other shared file system you recommended.
>
> I suggest we also look at doing runs that depend little on the  
> shared filesystem. There is a Swift option to put the per-job  
> working dir on a local disk. Further work can reduce shared disk  
> usage further (but thats "research" work).
>
> With OOPS, there is very little input data (all of the large input  
> data is part of the app install at the moment). Output data is  
> modest, about 1MB per job, or so. Allan and Zhao are experimenting  
> with ways to batch that up and transfer it back efficiently. That is  
> worth pursuing as a longer term goal.
>
> How well / how fast does gridftp work, moving data from a Nimbus VM  
> (at CI, EC2, and elsewhere) back to a CI filesystem? That would be  
> good to test.
>
> I wonder if the Swift engine could pull the data back from a GridFTP  
> server running on the worker node?
>
> All of the above is for longer-term research.
>
>>> Can you aggregate VM's from a few different physical clusters into  
>>> one Nimbus workspace?
>> I don't think so. Tim may make commit on it.
>>> What's the largest cluster you can assemble with Nimbus?
>> I am not quite sure,I will do some test onto it soon. since it is a  
>> EC2-like cloud, it should easily be configured as a cluster with  
>> hundreds of nodes. Tim may make commit on it.
>>> Can we try putting Matlab on a Nimbus VM and then running it at an  
>>> interesting scale?
>> Yes, I think so.
>
> I will need to locate the pointers we were trying to post on how to  
> get started with Swift and Matlab.
>
>
>>> Do we have any free allocations of EC2 to enable testing of this  
>>> at Amazon?
>> Yes, Tim already made a AIM on Amazon EC2, my next step is make  
>> 'swift' running on it. I will try get this stuff to ork  by tomorrow.
>
> Awesome.
>
> - Mike
>
>
>>>
>>> - Mike
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On 5/6/09 10:10 PM, yizhu wrote:
>>>> problem solved.
>>>>
>>>> i immigrate all of my work from my  laptop to ci.uchicago.edu  
>>>> server and then everything works.
>>>>
>>>>
>>>> -Yi
>>>> yizhu wrote:
>>>>> Hi all
>>>>>
>>>>> I tried running swift-0.8 over  Nimbus Cloud (PBS+NFS), and  
>>>>> configured sites.xml and tc.data accordingly.[1]
>>>>>
>>>>> When i tried to run "$swift first.swift", it  stuck on  
>>>>> "Submitting:1" phase[2], (keep repeated showing  
>>>>> "Progress:Submitting:1" and  never return).
>>>>>
>>>>> Then I ssh to pbs_server to check the server log[3] and found  
>>>>> that the job has been enqueued, ran, and successfully dequeued.  
>>>>> I also check the queue status[4] when this job is running and  
>>>>> found that the output_path is "/dev/null" somewhat i don't  
>>>>> expected. ( The working directory of swift is "/home/jobrun".
>>>>>
>>>>> I think the problem might be the pbs_server failed to return the  
>>>>> output to the correct path (btw. what the output_path suppose to  
>>>>> be, the same work_directory of swift?), or anyone has a better  
>>>>> idea?
>>>>>
>>>>>
>>>>> Many Thanks.
>>>>>
>>>>>
>>>>> -Yi
>>>>>
>>>>>
>>>>>
>>>>> --------------------------------------------------
>>>>> [1]----sites.xml
>>>>>
>>>>> <pool handle="nb_basecluster">
>>>>>    <gridftp url="gsiftp://tp-x001.ci.uchicago.edu" />
>>>>>    <execution url="https://tp-x001.ci.uchicago.edu:8443/wsrf/services/ManagedJobFactoryService 
>>>>> " jobManager="PBS" provider="gt4" />
>>>>>    <workdirectory >/home/jobrun</workdirectory>
>>>>>  </pool>
>>>>>
>>>>> ----tc.data
>>>>>
>>>>> nb_basecluster     echo         /bin/echo    INSTALLED     
>>>>> INTEL32::LINUX    null
>>>>> nb_basecluster     cat         /bin/cat    INSTALLED     
>>>>> INTEL32::LINUX    null
>>>>> nb_basecluster     ls         /bin/ls        INSTALLED     
>>>>> INTEL32::LINUX    null
>>>>> nb_basecluster     grep         /bin/grep    INSTALLED     
>>>>> INTEL32::LINUX    null
>>>>> nb_basecluster     sort         /bin/sort    INSTALLED     
>>>>> INTEL32::LINUX    null
>>>>> nb_basecluster     paste         /bin/paste    INSTALLED     
>>>>> INTEL32::LINUX    null
>>>>>
>>>>> ---------------------------------------------------------
>>>>> [2]
>>>>> yizhu at ubuntu:~/swift-0.8/examples/swift$ swift -d first.swift
>>>>> Recompilation suppressed.
>>>>> Using sites file: /home/yizhu/swift-0.8/bin/../etc/sites.xml
>>>>> Using tc.data: /home/yizhu/swift-0.8/bin/../etc/tc.data
>>>>> Setting resources to: {nb_basecluster=nb_basecluster}
>>>>> Swift 0.8 swift-r2448 cog-r2261
>>>>>
>>>>> Swift 0.8 swift-r2448 cog-r2261
>>>>>
>>>>> RUNID id=tag:benc at ci.uchicago.edu,2007:swift:run:20090506-1912- 
>>>>> zqd8t5hg
>>>>> RunID: 20090506-1912-zqd8t5hg
>>>>> closed org.griphyn.vdl.mapping.RootDataNode identifier tag:benc at ci.uchicago.edu 
>>>>> ,2008:swift:dataset:20090506-1912-lxp69uu2:720000000001 type  
>>>>> string value=Hello, world! dataset=unnamed SwiftScript value  
>>>>> (closed)
>>>>> ROOTPATH dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912 
>>>>> -lxp69uu2:720000000001 path=$
>>>>> VALUE dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912 
>>>>> -lxp69uu2:720000000001 VALUE=Hello, world!
>>>>> closed org.griphyn.vdl.mapping.RootDataNode identifier tag:benc at ci.uchicago.edu 
>>>>> ,2008:swift:dataset:20090506-1912-lxp69uu2:720000000001 type  
>>>>> string value=Hello, world! dataset=unnamed SwiftScript value  
>>>>> (closed)
>>>>> ROOTPATH dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912 
>>>>> -lxp69uu2:720000000001 path=$
>>>>> VALUE dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912 
>>>>> -lxp69uu2:720000000001 VALUE=Hello, world!
>>>>> NEW id=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912- 
>>>>> lxp69uu2:720000000001
>>>>> Found mapped data org.griphyn.vdl.mapping.RootDataNode  
>>>>> identifier tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912 
>>>>> -lxp69uu2:720000000002 type messagefile with no value at  
>>>>> dataset=outfile (not closed).$
>>>>> NEW id=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912- 
>>>>> lxp69uu2:720000000002
>>>>> Progress:
>>>>> PROCEDURE thread=0 name=greeting
>>>>> PARAM thread=0 direction=output variable=t provenanceid=tag:benc at ci.uchicago.edu 
>>>>> ,2008:swift:dataset:20090506-1912-lxp69uu2:720000000002
>>>>> closed org.griphyn.vdl.mapping.RootDataNode identifier tag:benc at ci.uchicago.edu 
>>>>> ,2008:swift:dataset:20090506-1912-lxp69uu2:720000000003 type  
>>>>> string value=hello.txt dataset=unnamed SwiftScript value (closed)
>>>>> ROOTPATH dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912 
>>>>> -lxp69uu2:720000000003 path=$
>>>>> VALUE dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912 
>>>>> -lxp69uu2:720000000003 VALUE=hello.txt
>>>>> closed org.griphyn.vdl.mapping.RootDataNode identifier tag:benc at ci.uchicago.edu 
>>>>> ,2008:swift:dataset:20090506-1912-lxp69uu2:720000000003 type  
>>>>> string value=hello.txt dataset=unnamed SwiftScript value (closed)
>>>>> ROOTPATH dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912 
>>>>> -lxp69uu2:720000000003 path=$
>>>>> VALUE dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912 
>>>>> -lxp69uu2:720000000003 VALUE=hello.txt
>>>>> START thread=0 tr=echo
>>>>> Sorted: [nb_basecluster:0.000(1.000):0/1 overload: 0]
>>>>> Rand: 0.6583597597672994, sum: 1.0
>>>>> Next contact: nb_basecluster:0.000(1.000):0/1 overload: 0
>>>>> Progress:  Initializing site shared directory:1
>>>>> START host=nb_basecluster - Initializing shared directory
>>>>> multiplyScore(nb_basecluster:0.000(1.000):1/1 overload: 0, -0.01)
>>>>> Old score: 0.000, new score: -0.010
>>>>> No global submit throttle set. Using default (100)
>>>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174935)  
>>>>> setting status to Submitting
>>>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174935)  
>>>>> setting status to Submitted
>>>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174935)  
>>>>> setting status to Active
>>>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174935)  
>>>>> setting status to Completed
>>>>> multiplyScore(nb_basecluster:-0.010(0.994):1/1 overload: 0, 0.01)
>>>>> Old score: -0.010, new score: 0.000
>>>>> multiplyScore(nb_basecluster:0.000(1.000):1/1 overload: 0, 0.1)
>>>>> Old score: 0.000, new score: 0.100
>>>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174935)  
>>>>> Completed. Waiting: 0, Running: 0. Heap size: 75M, Heap free:  
>>>>> 63M, Max heap: 720M
>>>>> multiplyScore(nb_basecluster:0.100(1.060):1/1 overload: 0, -0.2)
>>>>> Old score: 0.100, new score: -0.100
>>>>> Task(type=FILE_TRANSFER, identity=urn:0-1-1241655174938) setting  
>>>>> status to Submitting
>>>>> Task(type=FILE_TRANSFER, identity=urn:0-1-1241655174938) setting  
>>>>> status to Submitted
>>>>> Task(type=FILE_TRANSFER, identity=urn:0-1-1241655174938) setting  
>>>>> status to Active
>>>>> Task(type=FILE_TRANSFER, identity=urn:0-1-1241655174938) setting  
>>>>> status to Completed
>>>>> multiplyScore(nb_basecluster:-0.100(0.943):1/1 overload: 0, 0.2)
>>>>> Old score: -0.100, new score: 0.100
>>>>> multiplyScore(nb_basecluster:0.100(1.060):1/1 overload: 0, 0.1)
>>>>> Old score: 0.100, new score: 0.200
>>>>> Task(type=FILE_TRANSFER, identity=urn:0-1-1241655174938)  
>>>>> Completed. Waiting: 0, Running: 0. Heap size: 75M, Heap free:  
>>>>> 60M, Max heap: 720M
>>>>> multiplyScore(nb_basecluster:0.200(1.124):1/1 overload: 0, -0.2)
>>>>> Old score: 0.200, new score: 0.000
>>>>> Task(type=FILE_TRANSFER, identity=urn:0-1-1241655174940) setting  
>>>>> status to Submitting
>>>>> Task(type=FILE_TRANSFER, identity=urn:0-1-1241655174940) setting  
>>>>> status to Submitted
>>>>> Task(type=FILE_TRANSFER, identity=urn:0-1-1241655174940) setting  
>>>>> status to Active
>>>>> Task(type=FILE_TRANSFER, identity=urn:0-1-1241655174940) setting  
>>>>> status to Completed
>>>>> multiplyScore(nb_basecluster:0.000(1.000):1/1 overload: 0, 0.2)
>>>>> Old score: 0.000, new score: 0.200
>>>>> multiplyScore(nb_basecluster:0.200(1.124):1/1 overload: 0, 0.1)
>>>>> Old score: 0.200, new score: 0.300
>>>>> Task(type=FILE_TRANSFER, identity=urn:0-1-1241655174940)  
>>>>> Completed. Waiting: 0, Running: 0. Heap size: 75M, Heap free:  
>>>>> 59M, Max heap: 720M
>>>>> multiplyScore(nb_basecluster:0.300(1.192):1/1 overload: 0, -0.01)
>>>>> Old score: 0.300, new score: 0.290
>>>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174942)  
>>>>> setting status to Submitting
>>>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174942)  
>>>>> setting status to Submitted
>>>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174942)  
>>>>> setting status to Active
>>>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174942)  
>>>>> setting status to Completed
>>>>> multiplyScore(nb_basecluster:0.290(1.185):1/1 overload: 0, 0.01)
>>>>> Old score: 0.290, new score: 0.300
>>>>> multiplyScore(nb_basecluster:0.300(1.192):1/1 overload: 0, 0.1)
>>>>> Old score: 0.300, new score: 0.400
>>>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174942)  
>>>>> Completed. Waiting: 0, Running: 0. Heap size: 75M, Heap free:  
>>>>> 59M, Max heap: 720M
>>>>> multiplyScore(nb_basecluster:0.400(1.264):1/1 overload: 0, -0.01)
>>>>> Old score: 0.400, new score: 0.390
>>>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174944)  
>>>>> setting status to Submitting
>>>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174944)  
>>>>> setting status to Submitted
>>>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174944)  
>>>>> setting status to Active
>>>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174944)  
>>>>> setting status to Completed
>>>>> multiplyScore(nb_basecluster:0.390(1.256):1/1 overload: 0, 0.01)
>>>>> Old score: 0.390, new score: 0.400
>>>>> multiplyScore(nb_basecluster:0.400(1.264):1/1 overload: 0, 0.1)
>>>>> Old score: 0.400, new score: 0.500
>>>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174944)  
>>>>> Completed. Waiting: 0, Running: 0. Heap size: 75M, Heap free:  
>>>>> 59M, Max heap: 720M
>>>>> multiplyScore(nb_basecluster:0.500(1.339):1/1 overload: 0, -0.01)
>>>>> Old score: 0.500, new score: 0.490
>>>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174946)  
>>>>> setting status to Submitting
>>>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174946)  
>>>>> setting status to Submitted
>>>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174946)  
>>>>> setting status to Active
>>>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174946)  
>>>>> setting status to Completed
>>>>> multiplyScore(nb_basecluster:0.490(1.332):1/1 overload: 0, 0.01)
>>>>> Old score: 0.490, new score: 0.500
>>>>> multiplyScore(nb_basecluster:0.500(1.339):1/1 overload: 0, 0.1)
>>>>> Old score: 0.500, new score: 0.600
>>>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174946)  
>>>>> Completed. Waiting: 0, Running: 0. Heap size: 75M, Heap free:  
>>>>> 59M, Max heap: 720M
>>>>> END host=nb_basecluster - Done initializing shared directory
>>>>> THREAD_ASSOCIATION jobid=echo-kfecpfaj thread=0-1  
>>>>> host=nb_basecluster replicationGroup=jfecpfaj
>>>>> Progress:  Stage in:1
>>>>> START jobid=echo-kfecpfaj host=nb_basecluster - Initializing  
>>>>> directory structure
>>>>> START path= dir=first-20090506-1912-zqd8t5hg/shared - Creating  
>>>>> directory structure
>>>>> multiplyScore(nb_basecluster:0.600(1.419):1/1 overload: 0, -0.01)
>>>>> Old score: 0.600, new score: 0.590
>>>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174948)  
>>>>> setting status to Submitting
>>>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174948)  
>>>>> setting status to Submitted
>>>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174948)  
>>>>> setting status to Active
>>>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174948)  
>>>>> setting status to Completed
>>>>> multiplyScore(nb_basecluster:0.590(1.411):1/1 overload: 0, 0.01)
>>>>> Old score: 0.590, new score: 0.600
>>>>> multiplyScore(nb_basecluster:0.600(1.419):1/1 overload: 0, 0.1)
>>>>> Old score: 0.600, new score: 0.700
>>>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174948)  
>>>>> Completed. Waiting: 0, Running: 0. Heap size: 75M, Heap free:  
>>>>> 59M, Max heap: 720M
>>>>> END jobid=echo-kfecpfaj - Done initializing directory structure
>>>>> START jobid=echo-kfecpfaj - Staging in files
>>>>> END jobid=echo-kfecpfaj - Staging in finished
>>>>> JOB_START jobid=echo-kfecpfaj tr=echo arguments=[Hello, world!]  
>>>>> tmpdir=first-20090506-1912-zqd8t5hg/jobs/k/echo-kfecpfaj  
>>>>> host=nb_basecluster
>>>>> jobid=echo-kfecpfaj task=Task(type=JOB_SUBMISSION, identity=urn:0-1-1241655174950 
>>>>> )
>>>>> multiplyScore(nb_basecluster:0.700(1.503):1/1 overload: 0, -0.2)
>>>>> Old score: 0.700, new score: 0.500
>>>>> Task(type=JOB_SUBMISSION, identity=urn:0-1-1241655174950)  
>>>>> setting status to Submitting
>>>>> Submitting task: Task(type=JOB_SUBMISSION, identity=urn:0-1-1241655174950 
>>>>> )
>>>>> <startTime name="submission">1241655180260</startTime>
>>>>> <startTime name="createManagedJob">1241655180623</startTime>
>>>>> <endTime name="createManagedJob">1241655181975</endTime
>>>>> Task submitted: Task(type=JOB_SUBMISSION, identity=urn:0-1-1241655174950 
>>>>> )
>>>>> Progress:  Submitting:1
>>>>>
>>>>> Progress:  Submitting:1
>>>>>
>>>>> Progress:  Submitting:1
>>>>> Progress:  Submitting:1
>>>>> Progress:  Submitting:1
>>>>> Progress:  Submitting:1
>>>>> ^C
>>>>> yizhu at ubuntu:~/swift-0.8/examples/swift$
>>>>>
>>>>> --------------------------------------------------------------
>>>>> [3] see attachment
>>>>> -------------------------------------------------------------
>>>>> [4]
>>>>> tp-x001 torque # qstat -f
>>>>> Job Id: 3.tp-x001.ci.uchicago.edu
>>>>>    Job_Name = STDIN
>>>>>    Job_Owner = jobrun at tp-x001.ci.uchicago.edu
>>>>>    job_state = R
>>>>>    queue = batch
>>>>>    server = tp-x001.ci.uchicago.edu
>>>>>    Checkpoint = u
>>>>>    ctime = Wed May  6 19:22:10 2009
>>>>>    Error_Path = tp-x001.ci.uchicago.edu:/dev/null
>>>>>    exec_host = tp-x002/0
>>>>>    Hold_Types = n
>>>>>    Join_Path = n
>>>>>    Keep_Files = n
>>>>>    Mail_Points = n
>>>>>    mtime = Wed May  6 19:22:10 2009
>>>>>    Output_Path = tp-x001.ci.uchicago.edu:/dev/null
>>>>>    Priority = 0
>>>>>    qtime = Wed May  6 19:22:10 2009
>>>>>    Rerunable = True
>>>>>    Resource_List.neednodes = 1
>>>>>    Resource_List.nodect = 1
>>>>>    Resource_List.nodes = 1
>>>>>    Shell_Path_List = /bin/sh
>>>>>    substate = 40
>>>>>    Variable_List = PBS_O_HOME=/home/jobrun,PBS_O_LOGNAME=jobrun,
>>>>>    PBS_O_PATH=/bin:/sbin:/usr/bin:/usr/sbin:/usr/local/bin:/usr/ 
>>>>> local/sb
>>>>>    in:/opt/bin,PBS_O_SHELL=/bin/bash,PBS_SERVER=tp- 
>>>>> x001.ci.uchicago.edu,
>>>>>    PBS_O_HOST=tp-x001.ci.uchicago.edu,
>>>>>    PBS_O_WORKDIR=/home/jobrun/first-20090506-1922-xaandi54,
>>>>>    PBS_O_QUEUE=batch
>>>>>    euser = jobrun
>>>>>    egroup = users
>>>>>    hashname = 3.tp-x001.c
>>>>>    queue_rank = 2
>>>>>    queue_type = E
>>>>>    comment = Job started on Wed May 06 at 19:22
>>>>>    etime = Wed May  6 19:22:10 2009
>>>>>
>>>>> tp-x001 torque #
>>>>>
>>>>>
>>>>> --- 
>>>>> --- 
>>>>> ------------------------------------------------------------------
>>>>>
>>>>> _______________________________________________
>>>>> Swift-devel mailing list
>>>>> Swift-devel at ci.uchicago.edu
>>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>>>
>>>> _______________________________________________
>>>> Swift-devel mailing list
>>>> Swift-devel at ci.uchicago.edu
>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>>
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel



More information about the Swift-devel mailing list