[Swift-devel] Swift-issues (PBS+NFS Cluster)
Yi Zhu
yizhu at cs.uchicago.edu
Thu May 7 11:39:40 CDT 2009
Michael Wilde wrote:
> Very good!
>
> Now, what kind of tests can you do next?
Next, I will try to let swift running on Amazon EC2.
> Can you exercise the cluster with an interesting workflow?
Yes, Is there any complex sample/tools i can use (rahter than
first.swift) to test swift performance? Is there any benchmark available
i can compare with?
> How large of a cluster can you assemble in a Nimbus workspace ?
Since the vm-image i use to test 'swift' is based on NFS shared file
system, the performance may not be satisfiable if the we have a large
scale of cluster. After I got the swift running on Amazon EC2, I will
try to make a dedicate vm-image by using GPFS or any other shared file
system you recommended.
> Can you aggregate VM's from a few different physical clusters into one
> Nimbus workspace?
I don't think so. Tim may make commit on it.
> What's the largest cluster you can assemble with Nimbus?
I am not quite sure,I will do some test onto it soon. since it is a
EC2-like cloud, it should easily be configured as a cluster with
hundreds of nodes. Tim may make commit on it.
> Can we try putting Matlab on a Nimbus VM and then running it at an
> interesting scale?
Yes, I think so.
> Do we have any free allocations of EC2 to enable testing of this at Amazon?
Yes, Tim already made a AIM on Amazon EC2, my next step is make 'swift'
running on it. I will try get this stuff to ork by tomorrow.
>
> - Mike
>
>
>
>
>
>
>
> On 5/6/09 10:10 PM, yizhu wrote:
>> problem solved.
>>
>> i immigrate all of my work from my laptop to ci.uchicago.edu server
>> and then everything works.
>>
>>
>> -Yi
>> yizhu wrote:
>>> Hi all
>>>
>>> I tried running swift-0.8 over Nimbus Cloud (PBS+NFS), and
>>> configured sites.xml and tc.data accordingly.[1]
>>>
>>> When i tried to run "$swift first.swift", it stuck on "Submitting:1"
>>> phase[2], (keep repeated showing "Progress:Submitting:1" and never
>>> return).
>>>
>>> Then I ssh to pbs_server to check the server log[3] and found that
>>> the job has been enqueued, ran, and successfully dequeued. I also
>>> check the queue status[4] when this job is running and found that the
>>> output_path is "/dev/null" somewhat i don't expected. ( The working
>>> directory of swift is "/home/jobrun".
>>>
>>> I think the problem might be the pbs_server failed to return the
>>> output to the correct path (btw. what the output_path suppose to be,
>>> the same work_directory of swift?), or anyone has a better idea?
>>>
>>>
>>> Many Thanks.
>>>
>>>
>>> -Yi
>>>
>>>
>>>
>>> --------------------------------------------------
>>> [1]----sites.xml
>>>
>>> <pool handle="nb_basecluster">
>>> <gridftp url="gsiftp://tp-x001.ci.uchicago.edu" />
>>> <execution
>>> url="https://tp-x001.ci.uchicago.edu:8443/wsrf/services/ManagedJobFactoryService"
>>> jobManager="PBS" provider="gt4" />
>>> <workdirectory >/home/jobrun</workdirectory>
>>> </pool>
>>>
>>> ----tc.data
>>>
>>> nb_basecluster echo /bin/echo INSTALLED
>>> INTEL32::LINUX null
>>> nb_basecluster cat /bin/cat INSTALLED
>>> INTEL32::LINUX null
>>> nb_basecluster ls /bin/ls INSTALLED
>>> INTEL32::LINUX null
>>> nb_basecluster grep /bin/grep INSTALLED
>>> INTEL32::LINUX null
>>> nb_basecluster sort /bin/sort INSTALLED
>>> INTEL32::LINUX null
>>> nb_basecluster paste /bin/paste INSTALLED
>>> INTEL32::LINUX null
>>>
>>> ---------------------------------------------------------
>>> [2]
>>> yizhu at ubuntu:~/swift-0.8/examples/swift$ swift -d first.swift
>>> Recompilation suppressed.
>>> Using sites file: /home/yizhu/swift-0.8/bin/../etc/sites.xml
>>> Using tc.data: /home/yizhu/swift-0.8/bin/../etc/tc.data
>>> Setting resources to: {nb_basecluster=nb_basecluster}
>>> Swift 0.8 swift-r2448 cog-r2261
>>>
>>> Swift 0.8 swift-r2448 cog-r2261
>>>
>>> RUNID id=tag:benc at ci.uchicago.edu,2007:swift:run:20090506-1912-zqd8t5hg
>>> RunID: 20090506-1912-zqd8t5hg
>>> closed org.griphyn.vdl.mapping.RootDataNode identifier
>>> tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000001
>>> type string value=Hello, world! dataset=unnamed SwiftScript value
>>> (closed)
>>> ROOTPATH
>>> dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000001
>>> path=$
>>> VALUE
>>> dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000001
>>> VALUE=Hello, world!
>>> closed org.griphyn.vdl.mapping.RootDataNode identifier
>>> tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000001
>>> type string value=Hello, world! dataset=unnamed SwiftScript value
>>> (closed)
>>> ROOTPATH
>>> dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000001
>>> path=$
>>> VALUE
>>> dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000001
>>> VALUE=Hello, world!
>>> NEW
>>> id=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000001
>>>
>>> Found mapped data org.griphyn.vdl.mapping.RootDataNode identifier
>>> tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000002
>>> type messagefile with no value at dataset=outfile (not closed).$
>>> NEW
>>> id=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000002
>>>
>>> Progress:
>>> PROCEDURE thread=0 name=greeting
>>> PARAM thread=0 direction=output variable=t
>>> provenanceid=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000002
>>>
>>> closed org.griphyn.vdl.mapping.RootDataNode identifier
>>> tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000003
>>> type string value=hello.txt dataset=unnamed SwiftScript value (closed)
>>> ROOTPATH
>>> dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000003
>>> path=$
>>> VALUE
>>> dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000003
>>> VALUE=hello.txt
>>> closed org.griphyn.vdl.mapping.RootDataNode identifier
>>> tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000003
>>> type string value=hello.txt dataset=unnamed SwiftScript value (closed)
>>> ROOTPATH
>>> dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000003
>>> path=$
>>> VALUE
>>> dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000003
>>> VALUE=hello.txt
>>> START thread=0 tr=echo
>>> Sorted: [nb_basecluster:0.000(1.000):0/1 overload: 0]
>>> Rand: 0.6583597597672994, sum: 1.0
>>> Next contact: nb_basecluster:0.000(1.000):0/1 overload: 0
>>> Progress: Initializing site shared directory:1
>>> START host=nb_basecluster - Initializing shared directory
>>> multiplyScore(nb_basecluster:0.000(1.000):1/1 overload: 0, -0.01)
>>> Old score: 0.000, new score: -0.010
>>> No global submit throttle set. Using default (100)
>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174935) setting
>>> status to Submitting
>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174935) setting
>>> status to Submitted
>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174935) setting
>>> status to Active
>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174935) setting
>>> status to Completed
>>> multiplyScore(nb_basecluster:-0.010(0.994):1/1 overload: 0, 0.01)
>>> Old score: -0.010, new score: 0.000
>>> multiplyScore(nb_basecluster:0.000(1.000):1/1 overload: 0, 0.1)
>>> Old score: 0.000, new score: 0.100
>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174935) Completed.
>>> Waiting: 0, Running: 0. Heap size: 75M, Heap free: 63M, Max heap: 720M
>>> multiplyScore(nb_basecluster:0.100(1.060):1/1 overload: 0, -0.2)
>>> Old score: 0.100, new score: -0.100
>>> Task(type=FILE_TRANSFER, identity=urn:0-1-1241655174938) setting
>>> status to Submitting
>>> Task(type=FILE_TRANSFER, identity=urn:0-1-1241655174938) setting
>>> status to Submitted
>>> Task(type=FILE_TRANSFER, identity=urn:0-1-1241655174938) setting
>>> status to Active
>>> Task(type=FILE_TRANSFER, identity=urn:0-1-1241655174938) setting
>>> status to Completed
>>> multiplyScore(nb_basecluster:-0.100(0.943):1/1 overload: 0, 0.2)
>>> Old score: -0.100, new score: 0.100
>>> multiplyScore(nb_basecluster:0.100(1.060):1/1 overload: 0, 0.1)
>>> Old score: 0.100, new score: 0.200
>>> Task(type=FILE_TRANSFER, identity=urn:0-1-1241655174938) Completed.
>>> Waiting: 0, Running: 0. Heap size: 75M, Heap free: 60M, Max heap: 720M
>>> multiplyScore(nb_basecluster:0.200(1.124):1/1 overload: 0, -0.2)
>>> Old score: 0.200, new score: 0.000
>>> Task(type=FILE_TRANSFER, identity=urn:0-1-1241655174940) setting
>>> status to Submitting
>>> Task(type=FILE_TRANSFER, identity=urn:0-1-1241655174940) setting
>>> status to Submitted
>>> Task(type=FILE_TRANSFER, identity=urn:0-1-1241655174940) setting
>>> status to Active
>>> Task(type=FILE_TRANSFER, identity=urn:0-1-1241655174940) setting
>>> status to Completed
>>> multiplyScore(nb_basecluster:0.000(1.000):1/1 overload: 0, 0.2)
>>> Old score: 0.000, new score: 0.200
>>> multiplyScore(nb_basecluster:0.200(1.124):1/1 overload: 0, 0.1)
>>> Old score: 0.200, new score: 0.300
>>> Task(type=FILE_TRANSFER, identity=urn:0-1-1241655174940) Completed.
>>> Waiting: 0, Running: 0. Heap size: 75M, Heap free: 59M, Max heap: 720M
>>> multiplyScore(nb_basecluster:0.300(1.192):1/1 overload: 0, -0.01)
>>> Old score: 0.300, new score: 0.290
>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174942) setting
>>> status to Submitting
>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174942) setting
>>> status to Submitted
>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174942) setting
>>> status to Active
>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174942) setting
>>> status to Completed
>>> multiplyScore(nb_basecluster:0.290(1.185):1/1 overload: 0, 0.01)
>>> Old score: 0.290, new score: 0.300
>>> multiplyScore(nb_basecluster:0.300(1.192):1/1 overload: 0, 0.1)
>>> Old score: 0.300, new score: 0.400
>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174942) Completed.
>>> Waiting: 0, Running: 0. Heap size: 75M, Heap free: 59M, Max heap: 720M
>>> multiplyScore(nb_basecluster:0.400(1.264):1/1 overload: 0, -0.01)
>>> Old score: 0.400, new score: 0.390
>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174944) setting
>>> status to Submitting
>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174944) setting
>>> status to Submitted
>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174944) setting
>>> status to Active
>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174944) setting
>>> status to Completed
>>> multiplyScore(nb_basecluster:0.390(1.256):1/1 overload: 0, 0.01)
>>> Old score: 0.390, new score: 0.400
>>> multiplyScore(nb_basecluster:0.400(1.264):1/1 overload: 0, 0.1)
>>> Old score: 0.400, new score: 0.500
>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174944) Completed.
>>> Waiting: 0, Running: 0. Heap size: 75M, Heap free: 59M, Max heap: 720M
>>> multiplyScore(nb_basecluster:0.500(1.339):1/1 overload: 0, -0.01)
>>> Old score: 0.500, new score: 0.490
>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174946) setting
>>> status to Submitting
>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174946) setting
>>> status to Submitted
>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174946) setting
>>> status to Active
>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174946) setting
>>> status to Completed
>>> multiplyScore(nb_basecluster:0.490(1.332):1/1 overload: 0, 0.01)
>>> Old score: 0.490, new score: 0.500
>>> multiplyScore(nb_basecluster:0.500(1.339):1/1 overload: 0, 0.1)
>>> Old score: 0.500, new score: 0.600
>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174946) Completed.
>>> Waiting: 0, Running: 0. Heap size: 75M, Heap free: 59M, Max heap: 720M
>>> END host=nb_basecluster - Done initializing shared directory
>>> THREAD_ASSOCIATION jobid=echo-kfecpfaj thread=0-1 host=nb_basecluster
>>> replicationGroup=jfecpfaj
>>> Progress: Stage in:1
>>> START jobid=echo-kfecpfaj host=nb_basecluster - Initializing
>>> directory structure
>>> START path= dir=first-20090506-1912-zqd8t5hg/shared - Creating
>>> directory structure
>>> multiplyScore(nb_basecluster:0.600(1.419):1/1 overload: 0, -0.01)
>>> Old score: 0.600, new score: 0.590
>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174948) setting
>>> status to Submitting
>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174948) setting
>>> status to Submitted
>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174948) setting
>>> status to Active
>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174948) setting
>>> status to Completed
>>> multiplyScore(nb_basecluster:0.590(1.411):1/1 overload: 0, 0.01)
>>> Old score: 0.590, new score: 0.600
>>> multiplyScore(nb_basecluster:0.600(1.419):1/1 overload: 0, 0.1)
>>> Old score: 0.600, new score: 0.700
>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174948) Completed.
>>> Waiting: 0, Running: 0. Heap size: 75M, Heap free: 59M, Max heap: 720M
>>> END jobid=echo-kfecpfaj - Done initializing directory structure
>>> START jobid=echo-kfecpfaj - Staging in files
>>> END jobid=echo-kfecpfaj - Staging in finished
>>> JOB_START jobid=echo-kfecpfaj tr=echo arguments=[Hello, world!]
>>> tmpdir=first-20090506-1912-zqd8t5hg/jobs/k/echo-kfecpfaj
>>> host=nb_basecluster
>>> jobid=echo-kfecpfaj task=Task(type=JOB_SUBMISSION,
>>> identity=urn:0-1-1241655174950)
>>> multiplyScore(nb_basecluster:0.700(1.503):1/1 overload: 0, -0.2)
>>> Old score: 0.700, new score: 0.500
>>> Task(type=JOB_SUBMISSION, identity=urn:0-1-1241655174950) setting
>>> status to Submitting
>>> Submitting task: Task(type=JOB_SUBMISSION,
>>> identity=urn:0-1-1241655174950)
>>> <startTime name="submission">1241655180260</startTime>
>>> <startTime name="createManagedJob">1241655180623</startTime>
>>> <endTime name="createManagedJob">1241655181975</endTime
>>> Task submitted: Task(type=JOB_SUBMISSION,
>>> identity=urn:0-1-1241655174950)
>>> Progress: Submitting:1
>>>
>>> Progress: Submitting:1
>>>
>>> Progress: Submitting:1
>>> Progress: Submitting:1
>>> Progress: Submitting:1
>>> Progress: Submitting:1
>>> ^C
>>> yizhu at ubuntu:~/swift-0.8/examples/swift$
>>>
>>> --------------------------------------------------------------
>>> [3] see attachment
>>> -------------------------------------------------------------
>>> [4]
>>> tp-x001 torque # qstat -f
>>> Job Id: 3.tp-x001.ci.uchicago.edu
>>> Job_Name = STDIN
>>> Job_Owner = jobrun at tp-x001.ci.uchicago.edu
>>> job_state = R
>>> queue = batch
>>> server = tp-x001.ci.uchicago.edu
>>> Checkpoint = u
>>> ctime = Wed May 6 19:22:10 2009
>>> Error_Path = tp-x001.ci.uchicago.edu:/dev/null
>>> exec_host = tp-x002/0
>>> Hold_Types = n
>>> Join_Path = n
>>> Keep_Files = n
>>> Mail_Points = n
>>> mtime = Wed May 6 19:22:10 2009
>>> Output_Path = tp-x001.ci.uchicago.edu:/dev/null
>>> Priority = 0
>>> qtime = Wed May 6 19:22:10 2009
>>> Rerunable = True
>>> Resource_List.neednodes = 1
>>> Resource_List.nodect = 1
>>> Resource_List.nodes = 1
>>> Shell_Path_List = /bin/sh
>>> substate = 40
>>> Variable_List = PBS_O_HOME=/home/jobrun,PBS_O_LOGNAME=jobrun,
>>>
>>> PBS_O_PATH=/bin:/sbin:/usr/bin:/usr/sbin:/usr/local/bin:/usr/local/sb
>>>
>>> in:/opt/bin,PBS_O_SHELL=/bin/bash,PBS_SERVER=tp-x001.ci.uchicago.edu,
>>> PBS_O_HOST=tp-x001.ci.uchicago.edu,
>>> PBS_O_WORKDIR=/home/jobrun/first-20090506-1922-xaandi54,
>>> PBS_O_QUEUE=batch
>>> euser = jobrun
>>> egroup = users
>>> hashname = 3.tp-x001.c
>>> queue_rank = 2
>>> queue_type = E
>>> comment = Job started on Wed May 06 at 19:22
>>> etime = Wed May 6 19:22:10 2009
>>>
>>> tp-x001 torque #
>>>
>>>
>>> ------------------------------------------------------------------------
>>>
>>> _______________________________________________
>>> Swift-devel mailing list
>>> Swift-devel at ci.uchicago.edu
>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>
>> _______________________________________________
>> Swift-devel mailing list
>> Swift-devel at ci.uchicago.edu
>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>
More information about the Swift-devel
mailing list