[Swift-devel] Swift-issues (PBS+NFS Cluster)
foster at anl.gov
foster at anl.gov
Thu May 7 12:10:10 CDT 2009
I'd suggest we want some microbenchmarks too. Then we want to get
running on amazon and evaluate s3.
Ian -- from mobile
On May 7, 2009, at 11:54 AM, Michael Wilde <wilde at mcs.anl.gov> wrote:
>
>
> On 5/7/09 11:39 AM, Yi Zhu wrote:
>> Michael Wilde wrote:
>>> Very good!
>>>
>>> Now, what kind of tests can you do next?
>> Next, I will try to let swift running on Amazon EC2.
>>> Can you exercise the cluster with an interesting workflow?
>> Yes, Is there any complex sample/tools i can use (rahter than
>> first.swift) to test swift performance? Is there any benchmark
>> available i can compare with?
>
> I think we can assemble a good benchmark that is also a useful
> application. I suggest we start with OOPS, the protein folding code.
>
> That way, all the test runs can contribute (a bit) to science by
> building a catalog of results that the OOPS team can peruse and use.
>
> How many CPU hours (or VM hours) do you have available in EC2
> through Nimbus? Can we use some of these?
>
>>> How large of a cluster can you assemble in a Nimbus workspace ?
>> Since the vm-image i use to test 'swift' is based on NFS shared
>> file system, the performance may not be satisfiable if the we have
>> a large scale of cluster. After I got the swift running on Amazon
>> EC2, I will try to make a dedicate vm-image by using GPFS or any
>> other shared file system you recommended.
>
> I suggest we also look at doing runs that depend little on the
> shared filesystem. There is a Swift option to put the per-job
> working dir on a local disk. Further work can reduce shared disk
> usage further (but thats "research" work).
>
> With OOPS, there is very little input data (all of the large input
> data is part of the app install at the moment). Output data is
> modest, about 1MB per job, or so. Allan and Zhao are experimenting
> with ways to batch that up and transfer it back efficiently. That is
> worth pursuing as a longer term goal.
>
> How well / how fast does gridftp work, moving data from a Nimbus VM
> (at CI, EC2, and elsewhere) back to a CI filesystem? That would be
> good to test.
>
> I wonder if the Swift engine could pull the data back from a GridFTP
> server running on the worker node?
>
> All of the above is for longer-term research.
>
>>> Can you aggregate VM's from a few different physical clusters into
>>> one Nimbus workspace?
>> I don't think so. Tim may make commit on it.
>>> What's the largest cluster you can assemble with Nimbus?
>> I am not quite sure,I will do some test onto it soon. since it is a
>> EC2-like cloud, it should easily be configured as a cluster with
>> hundreds of nodes. Tim may make commit on it.
>>> Can we try putting Matlab on a Nimbus VM and then running it at an
>>> interesting scale?
>> Yes, I think so.
>
> I will need to locate the pointers we were trying to post on how to
> get started with Swift and Matlab.
>
>
>>> Do we have any free allocations of EC2 to enable testing of this
>>> at Amazon?
>> Yes, Tim already made a AIM on Amazon EC2, my next step is make
>> 'swift' running on it. I will try get this stuff to ork by tomorrow.
>
> Awesome.
>
> - Mike
>
>
>>>
>>> - Mike
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On 5/6/09 10:10 PM, yizhu wrote:
>>>> problem solved.
>>>>
>>>> i immigrate all of my work from my laptop to ci.uchicago.edu
>>>> server and then everything works.
>>>>
>>>>
>>>> -Yi
>>>> yizhu wrote:
>>>>> Hi all
>>>>>
>>>>> I tried running swift-0.8 over Nimbus Cloud (PBS+NFS), and
>>>>> configured sites.xml and tc.data accordingly.[1]
>>>>>
>>>>> When i tried to run "$swift first.swift", it stuck on
>>>>> "Submitting:1" phase[2], (keep repeated showing
>>>>> "Progress:Submitting:1" and never return).
>>>>>
>>>>> Then I ssh to pbs_server to check the server log[3] and found
>>>>> that the job has been enqueued, ran, and successfully dequeued.
>>>>> I also check the queue status[4] when this job is running and
>>>>> found that the output_path is "/dev/null" somewhat i don't
>>>>> expected. ( The working directory of swift is "/home/jobrun".
>>>>>
>>>>> I think the problem might be the pbs_server failed to return the
>>>>> output to the correct path (btw. what the output_path suppose to
>>>>> be, the same work_directory of swift?), or anyone has a better
>>>>> idea?
>>>>>
>>>>>
>>>>> Many Thanks.
>>>>>
>>>>>
>>>>> -Yi
>>>>>
>>>>>
>>>>>
>>>>> --------------------------------------------------
>>>>> [1]----sites.xml
>>>>>
>>>>> <pool handle="nb_basecluster">
>>>>> <gridftp url="gsiftp://tp-x001.ci.uchicago.edu" />
>>>>> <execution url="https://tp-x001.ci.uchicago.edu:8443/wsrf/services/ManagedJobFactoryService
>>>>> " jobManager="PBS" provider="gt4" />
>>>>> <workdirectory >/home/jobrun</workdirectory>
>>>>> </pool>
>>>>>
>>>>> ----tc.data
>>>>>
>>>>> nb_basecluster echo /bin/echo INSTALLED
>>>>> INTEL32::LINUX null
>>>>> nb_basecluster cat /bin/cat INSTALLED
>>>>> INTEL32::LINUX null
>>>>> nb_basecluster ls /bin/ls INSTALLED
>>>>> INTEL32::LINUX null
>>>>> nb_basecluster grep /bin/grep INSTALLED
>>>>> INTEL32::LINUX null
>>>>> nb_basecluster sort /bin/sort INSTALLED
>>>>> INTEL32::LINUX null
>>>>> nb_basecluster paste /bin/paste INSTALLED
>>>>> INTEL32::LINUX null
>>>>>
>>>>> ---------------------------------------------------------
>>>>> [2]
>>>>> yizhu at ubuntu:~/swift-0.8/examples/swift$ swift -d first.swift
>>>>> Recompilation suppressed.
>>>>> Using sites file: /home/yizhu/swift-0.8/bin/../etc/sites.xml
>>>>> Using tc.data: /home/yizhu/swift-0.8/bin/../etc/tc.data
>>>>> Setting resources to: {nb_basecluster=nb_basecluster}
>>>>> Swift 0.8 swift-r2448 cog-r2261
>>>>>
>>>>> Swift 0.8 swift-r2448 cog-r2261
>>>>>
>>>>> RUNID id=tag:benc at ci.uchicago.edu,2007:swift:run:20090506-1912-
>>>>> zqd8t5hg
>>>>> RunID: 20090506-1912-zqd8t5hg
>>>>> closed org.griphyn.vdl.mapping.RootDataNode identifier tag:benc at ci.uchicago.edu
>>>>> ,2008:swift:dataset:20090506-1912-lxp69uu2:720000000001 type
>>>>> string value=Hello, world! dataset=unnamed SwiftScript value
>>>>> (closed)
>>>>> ROOTPATH dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912
>>>>> -lxp69uu2:720000000001 path=$
>>>>> VALUE dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912
>>>>> -lxp69uu2:720000000001 VALUE=Hello, world!
>>>>> closed org.griphyn.vdl.mapping.RootDataNode identifier tag:benc at ci.uchicago.edu
>>>>> ,2008:swift:dataset:20090506-1912-lxp69uu2:720000000001 type
>>>>> string value=Hello, world! dataset=unnamed SwiftScript value
>>>>> (closed)
>>>>> ROOTPATH dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912
>>>>> -lxp69uu2:720000000001 path=$
>>>>> VALUE dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912
>>>>> -lxp69uu2:720000000001 VALUE=Hello, world!
>>>>> NEW id=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-
>>>>> lxp69uu2:720000000001
>>>>> Found mapped data org.griphyn.vdl.mapping.RootDataNode
>>>>> identifier tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912
>>>>> -lxp69uu2:720000000002 type messagefile with no value at
>>>>> dataset=outfile (not closed).$
>>>>> NEW id=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-
>>>>> lxp69uu2:720000000002
>>>>> Progress:
>>>>> PROCEDURE thread=0 name=greeting
>>>>> PARAM thread=0 direction=output variable=t provenanceid=tag:benc at ci.uchicago.edu
>>>>> ,2008:swift:dataset:20090506-1912-lxp69uu2:720000000002
>>>>> closed org.griphyn.vdl.mapping.RootDataNode identifier tag:benc at ci.uchicago.edu
>>>>> ,2008:swift:dataset:20090506-1912-lxp69uu2:720000000003 type
>>>>> string value=hello.txt dataset=unnamed SwiftScript value (closed)
>>>>> ROOTPATH dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912
>>>>> -lxp69uu2:720000000003 path=$
>>>>> VALUE dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912
>>>>> -lxp69uu2:720000000003 VALUE=hello.txt
>>>>> closed org.griphyn.vdl.mapping.RootDataNode identifier tag:benc at ci.uchicago.edu
>>>>> ,2008:swift:dataset:20090506-1912-lxp69uu2:720000000003 type
>>>>> string value=hello.txt dataset=unnamed SwiftScript value (closed)
>>>>> ROOTPATH dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912
>>>>> -lxp69uu2:720000000003 path=$
>>>>> VALUE dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912
>>>>> -lxp69uu2:720000000003 VALUE=hello.txt
>>>>> START thread=0 tr=echo
>>>>> Sorted: [nb_basecluster:0.000(1.000):0/1 overload: 0]
>>>>> Rand: 0.6583597597672994, sum: 1.0
>>>>> Next contact: nb_basecluster:0.000(1.000):0/1 overload: 0
>>>>> Progress: Initializing site shared directory:1
>>>>> START host=nb_basecluster - Initializing shared directory
>>>>> multiplyScore(nb_basecluster:0.000(1.000):1/1 overload: 0, -0.01)
>>>>> Old score: 0.000, new score: -0.010
>>>>> No global submit throttle set. Using default (100)
>>>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174935)
>>>>> setting status to Submitting
>>>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174935)
>>>>> setting status to Submitted
>>>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174935)
>>>>> setting status to Active
>>>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174935)
>>>>> setting status to Completed
>>>>> multiplyScore(nb_basecluster:-0.010(0.994):1/1 overload: 0, 0.01)
>>>>> Old score: -0.010, new score: 0.000
>>>>> multiplyScore(nb_basecluster:0.000(1.000):1/1 overload: 0, 0.1)
>>>>> Old score: 0.000, new score: 0.100
>>>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174935)
>>>>> Completed. Waiting: 0, Running: 0. Heap size: 75M, Heap free:
>>>>> 63M, Max heap: 720M
>>>>> multiplyScore(nb_basecluster:0.100(1.060):1/1 overload: 0, -0.2)
>>>>> Old score: 0.100, new score: -0.100
>>>>> Task(type=FILE_TRANSFER, identity=urn:0-1-1241655174938) setting
>>>>> status to Submitting
>>>>> Task(type=FILE_TRANSFER, identity=urn:0-1-1241655174938) setting
>>>>> status to Submitted
>>>>> Task(type=FILE_TRANSFER, identity=urn:0-1-1241655174938) setting
>>>>> status to Active
>>>>> Task(type=FILE_TRANSFER, identity=urn:0-1-1241655174938) setting
>>>>> status to Completed
>>>>> multiplyScore(nb_basecluster:-0.100(0.943):1/1 overload: 0, 0.2)
>>>>> Old score: -0.100, new score: 0.100
>>>>> multiplyScore(nb_basecluster:0.100(1.060):1/1 overload: 0, 0.1)
>>>>> Old score: 0.100, new score: 0.200
>>>>> Task(type=FILE_TRANSFER, identity=urn:0-1-1241655174938)
>>>>> Completed. Waiting: 0, Running: 0. Heap size: 75M, Heap free:
>>>>> 60M, Max heap: 720M
>>>>> multiplyScore(nb_basecluster:0.200(1.124):1/1 overload: 0, -0.2)
>>>>> Old score: 0.200, new score: 0.000
>>>>> Task(type=FILE_TRANSFER, identity=urn:0-1-1241655174940) setting
>>>>> status to Submitting
>>>>> Task(type=FILE_TRANSFER, identity=urn:0-1-1241655174940) setting
>>>>> status to Submitted
>>>>> Task(type=FILE_TRANSFER, identity=urn:0-1-1241655174940) setting
>>>>> status to Active
>>>>> Task(type=FILE_TRANSFER, identity=urn:0-1-1241655174940) setting
>>>>> status to Completed
>>>>> multiplyScore(nb_basecluster:0.000(1.000):1/1 overload: 0, 0.2)
>>>>> Old score: 0.000, new score: 0.200
>>>>> multiplyScore(nb_basecluster:0.200(1.124):1/1 overload: 0, 0.1)
>>>>> Old score: 0.200, new score: 0.300
>>>>> Task(type=FILE_TRANSFER, identity=urn:0-1-1241655174940)
>>>>> Completed. Waiting: 0, Running: 0. Heap size: 75M, Heap free:
>>>>> 59M, Max heap: 720M
>>>>> multiplyScore(nb_basecluster:0.300(1.192):1/1 overload: 0, -0.01)
>>>>> Old score: 0.300, new score: 0.290
>>>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174942)
>>>>> setting status to Submitting
>>>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174942)
>>>>> setting status to Submitted
>>>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174942)
>>>>> setting status to Active
>>>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174942)
>>>>> setting status to Completed
>>>>> multiplyScore(nb_basecluster:0.290(1.185):1/1 overload: 0, 0.01)
>>>>> Old score: 0.290, new score: 0.300
>>>>> multiplyScore(nb_basecluster:0.300(1.192):1/1 overload: 0, 0.1)
>>>>> Old score: 0.300, new score: 0.400
>>>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174942)
>>>>> Completed. Waiting: 0, Running: 0. Heap size: 75M, Heap free:
>>>>> 59M, Max heap: 720M
>>>>> multiplyScore(nb_basecluster:0.400(1.264):1/1 overload: 0, -0.01)
>>>>> Old score: 0.400, new score: 0.390
>>>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174944)
>>>>> setting status to Submitting
>>>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174944)
>>>>> setting status to Submitted
>>>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174944)
>>>>> setting status to Active
>>>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174944)
>>>>> setting status to Completed
>>>>> multiplyScore(nb_basecluster:0.390(1.256):1/1 overload: 0, 0.01)
>>>>> Old score: 0.390, new score: 0.400
>>>>> multiplyScore(nb_basecluster:0.400(1.264):1/1 overload: 0, 0.1)
>>>>> Old score: 0.400, new score: 0.500
>>>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174944)
>>>>> Completed. Waiting: 0, Running: 0. Heap size: 75M, Heap free:
>>>>> 59M, Max heap: 720M
>>>>> multiplyScore(nb_basecluster:0.500(1.339):1/1 overload: 0, -0.01)
>>>>> Old score: 0.500, new score: 0.490
>>>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174946)
>>>>> setting status to Submitting
>>>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174946)
>>>>> setting status to Submitted
>>>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174946)
>>>>> setting status to Active
>>>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174946)
>>>>> setting status to Completed
>>>>> multiplyScore(nb_basecluster:0.490(1.332):1/1 overload: 0, 0.01)
>>>>> Old score: 0.490, new score: 0.500
>>>>> multiplyScore(nb_basecluster:0.500(1.339):1/1 overload: 0, 0.1)
>>>>> Old score: 0.500, new score: 0.600
>>>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174946)
>>>>> Completed. Waiting: 0, Running: 0. Heap size: 75M, Heap free:
>>>>> 59M, Max heap: 720M
>>>>> END host=nb_basecluster - Done initializing shared directory
>>>>> THREAD_ASSOCIATION jobid=echo-kfecpfaj thread=0-1
>>>>> host=nb_basecluster replicationGroup=jfecpfaj
>>>>> Progress: Stage in:1
>>>>> START jobid=echo-kfecpfaj host=nb_basecluster - Initializing
>>>>> directory structure
>>>>> START path= dir=first-20090506-1912-zqd8t5hg/shared - Creating
>>>>> directory structure
>>>>> multiplyScore(nb_basecluster:0.600(1.419):1/1 overload: 0, -0.01)
>>>>> Old score: 0.600, new score: 0.590
>>>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174948)
>>>>> setting status to Submitting
>>>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174948)
>>>>> setting status to Submitted
>>>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174948)
>>>>> setting status to Active
>>>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174948)
>>>>> setting status to Completed
>>>>> multiplyScore(nb_basecluster:0.590(1.411):1/1 overload: 0, 0.01)
>>>>> Old score: 0.590, new score: 0.600
>>>>> multiplyScore(nb_basecluster:0.600(1.419):1/1 overload: 0, 0.1)
>>>>> Old score: 0.600, new score: 0.700
>>>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174948)
>>>>> Completed. Waiting: 0, Running: 0. Heap size: 75M, Heap free:
>>>>> 59M, Max heap: 720M
>>>>> END jobid=echo-kfecpfaj - Done initializing directory structure
>>>>> START jobid=echo-kfecpfaj - Staging in files
>>>>> END jobid=echo-kfecpfaj - Staging in finished
>>>>> JOB_START jobid=echo-kfecpfaj tr=echo arguments=[Hello, world!]
>>>>> tmpdir=first-20090506-1912-zqd8t5hg/jobs/k/echo-kfecpfaj
>>>>> host=nb_basecluster
>>>>> jobid=echo-kfecpfaj task=Task(type=JOB_SUBMISSION, identity=urn:0-1-1241655174950
>>>>> )
>>>>> multiplyScore(nb_basecluster:0.700(1.503):1/1 overload: 0, -0.2)
>>>>> Old score: 0.700, new score: 0.500
>>>>> Task(type=JOB_SUBMISSION, identity=urn:0-1-1241655174950)
>>>>> setting status to Submitting
>>>>> Submitting task: Task(type=JOB_SUBMISSION, identity=urn:0-1-1241655174950
>>>>> )
>>>>> <startTime name="submission">1241655180260</startTime>
>>>>> <startTime name="createManagedJob">1241655180623</startTime>
>>>>> <endTime name="createManagedJob">1241655181975</endTime
>>>>> Task submitted: Task(type=JOB_SUBMISSION, identity=urn:0-1-1241655174950
>>>>> )
>>>>> Progress: Submitting:1
>>>>>
>>>>> Progress: Submitting:1
>>>>>
>>>>> Progress: Submitting:1
>>>>> Progress: Submitting:1
>>>>> Progress: Submitting:1
>>>>> Progress: Submitting:1
>>>>> ^C
>>>>> yizhu at ubuntu:~/swift-0.8/examples/swift$
>>>>>
>>>>> --------------------------------------------------------------
>>>>> [3] see attachment
>>>>> -------------------------------------------------------------
>>>>> [4]
>>>>> tp-x001 torque # qstat -f
>>>>> Job Id: 3.tp-x001.ci.uchicago.edu
>>>>> Job_Name = STDIN
>>>>> Job_Owner = jobrun at tp-x001.ci.uchicago.edu
>>>>> job_state = R
>>>>> queue = batch
>>>>> server = tp-x001.ci.uchicago.edu
>>>>> Checkpoint = u
>>>>> ctime = Wed May 6 19:22:10 2009
>>>>> Error_Path = tp-x001.ci.uchicago.edu:/dev/null
>>>>> exec_host = tp-x002/0
>>>>> Hold_Types = n
>>>>> Join_Path = n
>>>>> Keep_Files = n
>>>>> Mail_Points = n
>>>>> mtime = Wed May 6 19:22:10 2009
>>>>> Output_Path = tp-x001.ci.uchicago.edu:/dev/null
>>>>> Priority = 0
>>>>> qtime = Wed May 6 19:22:10 2009
>>>>> Rerunable = True
>>>>> Resource_List.neednodes = 1
>>>>> Resource_List.nodect = 1
>>>>> Resource_List.nodes = 1
>>>>> Shell_Path_List = /bin/sh
>>>>> substate = 40
>>>>> Variable_List = PBS_O_HOME=/home/jobrun,PBS_O_LOGNAME=jobrun,
>>>>> PBS_O_PATH=/bin:/sbin:/usr/bin:/usr/sbin:/usr/local/bin:/usr/
>>>>> local/sb
>>>>> in:/opt/bin,PBS_O_SHELL=/bin/bash,PBS_SERVER=tp-
>>>>> x001.ci.uchicago.edu,
>>>>> PBS_O_HOST=tp-x001.ci.uchicago.edu,
>>>>> PBS_O_WORKDIR=/home/jobrun/first-20090506-1922-xaandi54,
>>>>> PBS_O_QUEUE=batch
>>>>> euser = jobrun
>>>>> egroup = users
>>>>> hashname = 3.tp-x001.c
>>>>> queue_rank = 2
>>>>> queue_type = E
>>>>> comment = Job started on Wed May 06 at 19:22
>>>>> etime = Wed May 6 19:22:10 2009
>>>>>
>>>>> tp-x001 torque #
>>>>>
>>>>>
>>>>> ---
>>>>> ---
>>>>> ------------------------------------------------------------------
>>>>>
>>>>> _______________________________________________
>>>>> Swift-devel mailing list
>>>>> Swift-devel at ci.uchicago.edu
>>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>>>
>>>> _______________________________________________
>>>> Swift-devel mailing list
>>>> Swift-devel at ci.uchicago.edu
>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>>
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
More information about the Swift-devel
mailing list