[Swift-devel] Swift-issues (PBS+NFS Cluster)
Michael Wilde
wilde at mcs.anl.gov
Thu May 7 11:54:06 CDT 2009
On 5/7/09 11:39 AM, Yi Zhu wrote:
> Michael Wilde wrote:
>> Very good!
>>
>> Now, what kind of tests can you do next?
>
> Next, I will try to let swift running on Amazon EC2.
>
>> Can you exercise the cluster with an interesting workflow?
>
> Yes, Is there any complex sample/tools i can use (rahter than
> first.swift) to test swift performance? Is there any benchmark available
> i can compare with?
I think we can assemble a good benchmark that is also a useful
application. I suggest we start with OOPS, the protein folding code.
That way, all the test runs can contribute (a bit) to science by
building a catalog of results that the OOPS team can peruse and use.
How many CPU hours (or VM hours) do you have available in EC2 through
Nimbus? Can we use some of these?
>> How large of a cluster can you assemble in a Nimbus workspace ?
>
> Since the vm-image i use to test 'swift' is based on NFS shared file
> system, the performance may not be satisfiable if the we have a large
> scale of cluster. After I got the swift running on Amazon EC2, I will
> try to make a dedicate vm-image by using GPFS or any other shared file
> system you recommended.
I suggest we also look at doing runs that depend little on the shared
filesystem. There is a Swift option to put the per-job working dir on a
local disk. Further work can reduce shared disk usage further (but thats
"research" work).
With OOPS, there is very little input data (all of the large input data
is part of the app install at the moment). Output data is modest, about
1MB per job, or so. Allan and Zhao are experimenting with ways to batch
that up and transfer it back efficiently. That is worth pursuing as a
longer term goal.
How well / how fast does gridftp work, moving data from a Nimbus VM (at
CI, EC2, and elsewhere) back to a CI filesystem? That would be good to test.
I wonder if the Swift engine could pull the data back from a GridFTP
server running on the worker node?
All of the above is for longer-term research.
>> Can you aggregate VM's from a few different physical clusters into one
>> Nimbus workspace?
>
> I don't think so. Tim may make commit on it.
>
>
>> What's the largest cluster you can assemble with Nimbus?
>
> I am not quite sure,I will do some test onto it soon. since it is a
> EC2-like cloud, it should easily be configured as a cluster with
> hundreds of nodes. Tim may make commit on it.
>
>
>> Can we try putting Matlab on a Nimbus VM and then running it at an
>> interesting scale?
>
> Yes, I think so.
I will need to locate the pointers we were trying to post on how to get
started with Swift and Matlab.
>> Do we have any free allocations of EC2 to enable testing of this at
>> Amazon?
>
> Yes, Tim already made a AIM on Amazon EC2, my next step is make 'swift'
> running on it. I will try get this stuff to ork by tomorrow.
Awesome.
- Mike
>
>
>>
>> - Mike
>>
>>
>>
>>
>>
>>
>>
>> On 5/6/09 10:10 PM, yizhu wrote:
>>> problem solved.
>>>
>>> i immigrate all of my work from my laptop to ci.uchicago.edu server
>>> and then everything works.
>>>
>>>
>>> -Yi
>>> yizhu wrote:
>>>> Hi all
>>>>
>>>> I tried running swift-0.8 over Nimbus Cloud (PBS+NFS), and
>>>> configured sites.xml and tc.data accordingly.[1]
>>>>
>>>> When i tried to run "$swift first.swift", it stuck on
>>>> "Submitting:1" phase[2], (keep repeated showing
>>>> "Progress:Submitting:1" and never return).
>>>>
>>>> Then I ssh to pbs_server to check the server log[3] and found that
>>>> the job has been enqueued, ran, and successfully dequeued. I also
>>>> check the queue status[4] when this job is running and found that
>>>> the output_path is "/dev/null" somewhat i don't expected. ( The
>>>> working directory of swift is "/home/jobrun".
>>>>
>>>> I think the problem might be the pbs_server failed to return the
>>>> output to the correct path (btw. what the output_path suppose to be,
>>>> the same work_directory of swift?), or anyone has a better idea?
>>>>
>>>>
>>>> Many Thanks.
>>>>
>>>>
>>>> -Yi
>>>>
>>>>
>>>>
>>>> --------------------------------------------------
>>>> [1]----sites.xml
>>>>
>>>> <pool handle="nb_basecluster">
>>>> <gridftp url="gsiftp://tp-x001.ci.uchicago.edu" />
>>>> <execution
>>>> url="https://tp-x001.ci.uchicago.edu:8443/wsrf/services/ManagedJobFactoryService"
>>>> jobManager="PBS" provider="gt4" />
>>>> <workdirectory >/home/jobrun</workdirectory>
>>>> </pool>
>>>>
>>>> ----tc.data
>>>>
>>>> nb_basecluster echo /bin/echo INSTALLED
>>>> INTEL32::LINUX null
>>>> nb_basecluster cat /bin/cat INSTALLED
>>>> INTEL32::LINUX null
>>>> nb_basecluster ls /bin/ls INSTALLED
>>>> INTEL32::LINUX null
>>>> nb_basecluster grep /bin/grep INSTALLED
>>>> INTEL32::LINUX null
>>>> nb_basecluster sort /bin/sort INSTALLED
>>>> INTEL32::LINUX null
>>>> nb_basecluster paste /bin/paste INSTALLED
>>>> INTEL32::LINUX null
>>>>
>>>> ---------------------------------------------------------
>>>> [2]
>>>> yizhu at ubuntu:~/swift-0.8/examples/swift$ swift -d first.swift
>>>> Recompilation suppressed.
>>>> Using sites file: /home/yizhu/swift-0.8/bin/../etc/sites.xml
>>>> Using tc.data: /home/yizhu/swift-0.8/bin/../etc/tc.data
>>>> Setting resources to: {nb_basecluster=nb_basecluster}
>>>> Swift 0.8 swift-r2448 cog-r2261
>>>>
>>>> Swift 0.8 swift-r2448 cog-r2261
>>>>
>>>> RUNID id=tag:benc at ci.uchicago.edu,2007:swift:run:20090506-1912-zqd8t5hg
>>>> RunID: 20090506-1912-zqd8t5hg
>>>> closed org.griphyn.vdl.mapping.RootDataNode identifier
>>>> tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000001
>>>> type string value=Hello, world! dataset=unnamed SwiftScript value
>>>> (closed)
>>>> ROOTPATH
>>>> dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000001
>>>> path=$
>>>> VALUE
>>>> dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000001
>>>> VALUE=Hello, world!
>>>> closed org.griphyn.vdl.mapping.RootDataNode identifier
>>>> tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000001
>>>> type string value=Hello, world! dataset=unnamed SwiftScript value
>>>> (closed)
>>>> ROOTPATH
>>>> dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000001
>>>> path=$
>>>> VALUE
>>>> dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000001
>>>> VALUE=Hello, world!
>>>> NEW
>>>> id=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000001
>>>>
>>>> Found mapped data org.griphyn.vdl.mapping.RootDataNode identifier
>>>> tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000002
>>>> type messagefile with no value at dataset=outfile (not closed).$
>>>> NEW
>>>> id=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000002
>>>>
>>>> Progress:
>>>> PROCEDURE thread=0 name=greeting
>>>> PARAM thread=0 direction=output variable=t
>>>> provenanceid=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000002
>>>>
>>>> closed org.griphyn.vdl.mapping.RootDataNode identifier
>>>> tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000003
>>>> type string value=hello.txt dataset=unnamed SwiftScript value (closed)
>>>> ROOTPATH
>>>> dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000003
>>>> path=$
>>>> VALUE
>>>> dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000003
>>>> VALUE=hello.txt
>>>> closed org.griphyn.vdl.mapping.RootDataNode identifier
>>>> tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000003
>>>> type string value=hello.txt dataset=unnamed SwiftScript value (closed)
>>>> ROOTPATH
>>>> dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000003
>>>> path=$
>>>> VALUE
>>>> dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000003
>>>> VALUE=hello.txt
>>>> START thread=0 tr=echo
>>>> Sorted: [nb_basecluster:0.000(1.000):0/1 overload: 0]
>>>> Rand: 0.6583597597672994, sum: 1.0
>>>> Next contact: nb_basecluster:0.000(1.000):0/1 overload: 0
>>>> Progress: Initializing site shared directory:1
>>>> START host=nb_basecluster - Initializing shared directory
>>>> multiplyScore(nb_basecluster:0.000(1.000):1/1 overload: 0, -0.01)
>>>> Old score: 0.000, new score: -0.010
>>>> No global submit throttle set. Using default (100)
>>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174935) setting
>>>> status to Submitting
>>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174935) setting
>>>> status to Submitted
>>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174935) setting
>>>> status to Active
>>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174935) setting
>>>> status to Completed
>>>> multiplyScore(nb_basecluster:-0.010(0.994):1/1 overload: 0, 0.01)
>>>> Old score: -0.010, new score: 0.000
>>>> multiplyScore(nb_basecluster:0.000(1.000):1/1 overload: 0, 0.1)
>>>> Old score: 0.000, new score: 0.100
>>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174935) Completed.
>>>> Waiting: 0, Running: 0. Heap size: 75M, Heap free: 63M, Max heap: 720M
>>>> multiplyScore(nb_basecluster:0.100(1.060):1/1 overload: 0, -0.2)
>>>> Old score: 0.100, new score: -0.100
>>>> Task(type=FILE_TRANSFER, identity=urn:0-1-1241655174938) setting
>>>> status to Submitting
>>>> Task(type=FILE_TRANSFER, identity=urn:0-1-1241655174938) setting
>>>> status to Submitted
>>>> Task(type=FILE_TRANSFER, identity=urn:0-1-1241655174938) setting
>>>> status to Active
>>>> Task(type=FILE_TRANSFER, identity=urn:0-1-1241655174938) setting
>>>> status to Completed
>>>> multiplyScore(nb_basecluster:-0.100(0.943):1/1 overload: 0, 0.2)
>>>> Old score: -0.100, new score: 0.100
>>>> multiplyScore(nb_basecluster:0.100(1.060):1/1 overload: 0, 0.1)
>>>> Old score: 0.100, new score: 0.200
>>>> Task(type=FILE_TRANSFER, identity=urn:0-1-1241655174938) Completed.
>>>> Waiting: 0, Running: 0. Heap size: 75M, Heap free: 60M, Max heap: 720M
>>>> multiplyScore(nb_basecluster:0.200(1.124):1/1 overload: 0, -0.2)
>>>> Old score: 0.200, new score: 0.000
>>>> Task(type=FILE_TRANSFER, identity=urn:0-1-1241655174940) setting
>>>> status to Submitting
>>>> Task(type=FILE_TRANSFER, identity=urn:0-1-1241655174940) setting
>>>> status to Submitted
>>>> Task(type=FILE_TRANSFER, identity=urn:0-1-1241655174940) setting
>>>> status to Active
>>>> Task(type=FILE_TRANSFER, identity=urn:0-1-1241655174940) setting
>>>> status to Completed
>>>> multiplyScore(nb_basecluster:0.000(1.000):1/1 overload: 0, 0.2)
>>>> Old score: 0.000, new score: 0.200
>>>> multiplyScore(nb_basecluster:0.200(1.124):1/1 overload: 0, 0.1)
>>>> Old score: 0.200, new score: 0.300
>>>> Task(type=FILE_TRANSFER, identity=urn:0-1-1241655174940) Completed.
>>>> Waiting: 0, Running: 0. Heap size: 75M, Heap free: 59M, Max heap: 720M
>>>> multiplyScore(nb_basecluster:0.300(1.192):1/1 overload: 0, -0.01)
>>>> Old score: 0.300, new score: 0.290
>>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174942) setting
>>>> status to Submitting
>>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174942) setting
>>>> status to Submitted
>>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174942) setting
>>>> status to Active
>>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174942) setting
>>>> status to Completed
>>>> multiplyScore(nb_basecluster:0.290(1.185):1/1 overload: 0, 0.01)
>>>> Old score: 0.290, new score: 0.300
>>>> multiplyScore(nb_basecluster:0.300(1.192):1/1 overload: 0, 0.1)
>>>> Old score: 0.300, new score: 0.400
>>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174942) Completed.
>>>> Waiting: 0, Running: 0. Heap size: 75M, Heap free: 59M, Max heap: 720M
>>>> multiplyScore(nb_basecluster:0.400(1.264):1/1 overload: 0, -0.01)
>>>> Old score: 0.400, new score: 0.390
>>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174944) setting
>>>> status to Submitting
>>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174944) setting
>>>> status to Submitted
>>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174944) setting
>>>> status to Active
>>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174944) setting
>>>> status to Completed
>>>> multiplyScore(nb_basecluster:0.390(1.256):1/1 overload: 0, 0.01)
>>>> Old score: 0.390, new score: 0.400
>>>> multiplyScore(nb_basecluster:0.400(1.264):1/1 overload: 0, 0.1)
>>>> Old score: 0.400, new score: 0.500
>>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174944) Completed.
>>>> Waiting: 0, Running: 0. Heap size: 75M, Heap free: 59M, Max heap: 720M
>>>> multiplyScore(nb_basecluster:0.500(1.339):1/1 overload: 0, -0.01)
>>>> Old score: 0.500, new score: 0.490
>>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174946) setting
>>>> status to Submitting
>>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174946) setting
>>>> status to Submitted
>>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174946) setting
>>>> status to Active
>>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174946) setting
>>>> status to Completed
>>>> multiplyScore(nb_basecluster:0.490(1.332):1/1 overload: 0, 0.01)
>>>> Old score: 0.490, new score: 0.500
>>>> multiplyScore(nb_basecluster:0.500(1.339):1/1 overload: 0, 0.1)
>>>> Old score: 0.500, new score: 0.600
>>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174946) Completed.
>>>> Waiting: 0, Running: 0. Heap size: 75M, Heap free: 59M, Max heap: 720M
>>>> END host=nb_basecluster - Done initializing shared directory
>>>> THREAD_ASSOCIATION jobid=echo-kfecpfaj thread=0-1
>>>> host=nb_basecluster replicationGroup=jfecpfaj
>>>> Progress: Stage in:1
>>>> START jobid=echo-kfecpfaj host=nb_basecluster - Initializing
>>>> directory structure
>>>> START path= dir=first-20090506-1912-zqd8t5hg/shared - Creating
>>>> directory structure
>>>> multiplyScore(nb_basecluster:0.600(1.419):1/1 overload: 0, -0.01)
>>>> Old score: 0.600, new score: 0.590
>>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174948) setting
>>>> status to Submitting
>>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174948) setting
>>>> status to Submitted
>>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174948) setting
>>>> status to Active
>>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174948) setting
>>>> status to Completed
>>>> multiplyScore(nb_basecluster:0.590(1.411):1/1 overload: 0, 0.01)
>>>> Old score: 0.590, new score: 0.600
>>>> multiplyScore(nb_basecluster:0.600(1.419):1/1 overload: 0, 0.1)
>>>> Old score: 0.600, new score: 0.700
>>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174948) Completed.
>>>> Waiting: 0, Running: 0. Heap size: 75M, Heap free: 59M, Max heap: 720M
>>>> END jobid=echo-kfecpfaj - Done initializing directory structure
>>>> START jobid=echo-kfecpfaj - Staging in files
>>>> END jobid=echo-kfecpfaj - Staging in finished
>>>> JOB_START jobid=echo-kfecpfaj tr=echo arguments=[Hello, world!]
>>>> tmpdir=first-20090506-1912-zqd8t5hg/jobs/k/echo-kfecpfaj
>>>> host=nb_basecluster
>>>> jobid=echo-kfecpfaj task=Task(type=JOB_SUBMISSION,
>>>> identity=urn:0-1-1241655174950)
>>>> multiplyScore(nb_basecluster:0.700(1.503):1/1 overload: 0, -0.2)
>>>> Old score: 0.700, new score: 0.500
>>>> Task(type=JOB_SUBMISSION, identity=urn:0-1-1241655174950) setting
>>>> status to Submitting
>>>> Submitting task: Task(type=JOB_SUBMISSION,
>>>> identity=urn:0-1-1241655174950)
>>>> <startTime name="submission">1241655180260</startTime>
>>>> <startTime name="createManagedJob">1241655180623</startTime>
>>>> <endTime name="createManagedJob">1241655181975</endTime
>>>> Task submitted: Task(type=JOB_SUBMISSION,
>>>> identity=urn:0-1-1241655174950)
>>>> Progress: Submitting:1
>>>>
>>>> Progress: Submitting:1
>>>>
>>>> Progress: Submitting:1
>>>> Progress: Submitting:1
>>>> Progress: Submitting:1
>>>> Progress: Submitting:1
>>>> ^C
>>>> yizhu at ubuntu:~/swift-0.8/examples/swift$
>>>>
>>>> --------------------------------------------------------------
>>>> [3] see attachment
>>>> -------------------------------------------------------------
>>>> [4]
>>>> tp-x001 torque # qstat -f
>>>> Job Id: 3.tp-x001.ci.uchicago.edu
>>>> Job_Name = STDIN
>>>> Job_Owner = jobrun at tp-x001.ci.uchicago.edu
>>>> job_state = R
>>>> queue = batch
>>>> server = tp-x001.ci.uchicago.edu
>>>> Checkpoint = u
>>>> ctime = Wed May 6 19:22:10 2009
>>>> Error_Path = tp-x001.ci.uchicago.edu:/dev/null
>>>> exec_host = tp-x002/0
>>>> Hold_Types = n
>>>> Join_Path = n
>>>> Keep_Files = n
>>>> Mail_Points = n
>>>> mtime = Wed May 6 19:22:10 2009
>>>> Output_Path = tp-x001.ci.uchicago.edu:/dev/null
>>>> Priority = 0
>>>> qtime = Wed May 6 19:22:10 2009
>>>> Rerunable = True
>>>> Resource_List.neednodes = 1
>>>> Resource_List.nodect = 1
>>>> Resource_List.nodes = 1
>>>> Shell_Path_List = /bin/sh
>>>> substate = 40
>>>> Variable_List = PBS_O_HOME=/home/jobrun,PBS_O_LOGNAME=jobrun,
>>>>
>>>> PBS_O_PATH=/bin:/sbin:/usr/bin:/usr/sbin:/usr/local/bin:/usr/local/sb
>>>>
>>>> in:/opt/bin,PBS_O_SHELL=/bin/bash,PBS_SERVER=tp-x001.ci.uchicago.edu,
>>>> PBS_O_HOST=tp-x001.ci.uchicago.edu,
>>>> PBS_O_WORKDIR=/home/jobrun/first-20090506-1922-xaandi54,
>>>> PBS_O_QUEUE=batch
>>>> euser = jobrun
>>>> egroup = users
>>>> hashname = 3.tp-x001.c
>>>> queue_rank = 2
>>>> queue_type = E
>>>> comment = Job started on Wed May 06 at 19:22
>>>> etime = Wed May 6 19:22:10 2009
>>>>
>>>> tp-x001 torque #
>>>>
>>>>
>>>> ------------------------------------------------------------------------
>>>>
>>>>
>>>> _______________________________________________
>>>> Swift-devel mailing list
>>>> Swift-devel at ci.uchicago.edu
>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>>
>>> _______________________________________________
>>> Swift-devel mailing list
>>> Swift-devel at ci.uchicago.edu
>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>
>
More information about the Swift-devel
mailing list