[Swift-devel] Swift-issues (PBS+NFS Cluster)

Michael Wilde wilde at mcs.anl.gov
Thu May 7 11:54:06 CDT 2009



On 5/7/09 11:39 AM, Yi Zhu wrote:
> Michael Wilde wrote:
>> Very good!
>>
>> Now, what kind of tests can you do next?
> 
> Next, I will try to let swift running on Amazon EC2.
> 
>> Can you exercise the cluster with an interesting workflow?
> 
> Yes, Is there any complex sample/tools i can use (rahter than 
> first.swift) to test swift performance? Is there any benchmark available 
> i can compare with?

I think we can assemble a good benchmark that is also a useful 
application. I suggest we start with OOPS, the protein folding code.

That way, all the test runs can contribute (a bit) to science by 
building a catalog of results that the OOPS team can peruse and use.

How many CPU hours (or VM hours) do you have available in EC2 through 
Nimbus? Can we use some of these?

>> How large of a cluster can you assemble in a Nimbus workspace ?
> 
> Since the vm-image i use to test 'swift' is based on NFS shared file 
> system, the performance may not be satisfiable if the we have a large 
> scale of cluster. After I got the swift running on Amazon EC2, I will 
> try to make a dedicate vm-image by using GPFS or any other shared file 
> system you recommended.

I suggest we also look at doing runs that depend little on the shared 
filesystem. There is a Swift option to put the per-job working dir on a 
local disk. Further work can reduce shared disk usage further (but thats 
"research" work).

With OOPS, there is very little input data (all of the large input data 
is part of the app install at the moment). Output data is modest, about 
1MB per job, or so. Allan and Zhao are experimenting with ways to batch 
that up and transfer it back efficiently. That is worth pursuing as a 
longer term goal.

How well / how fast does gridftp work, moving data from a Nimbus VM (at 
CI, EC2, and elsewhere) back to a CI filesystem? That would be good to test.

I wonder if the Swift engine could pull the data back from a GridFTP 
server running on the worker node?

All of the above is for longer-term research.

>> Can you aggregate VM's from a few different physical clusters into one 
>> Nimbus workspace?
> 
> I don't think so. Tim may make commit on it.
> 
> 
>> What's the largest cluster you can assemble with Nimbus?
> 
> I am not quite sure,I will do some test onto it soon. since it is a 
> EC2-like cloud, it should easily be configured as a cluster with 
> hundreds of nodes. Tim may make commit on it.
> 
> 
>> Can we try putting Matlab on a Nimbus VM and then running it at an 
>> interesting scale?
> 
> Yes, I think so.

I will need to locate the pointers we were trying to post on how to get 
started with Swift and Matlab.


>> Do we have any free allocations of EC2 to enable testing of this at 
>> Amazon?
> 
> Yes, Tim already made a AIM on Amazon EC2, my next step is make 'swift' 
> running on it. I will try get this stuff to ork  by tomorrow.

Awesome.

- Mike


> 
> 
>>
>> - Mike
>>
>>
>>
>>
>>
>>
>>
>> On 5/6/09 10:10 PM, yizhu wrote:
>>> problem solved.
>>>
>>> i immigrate all of my work from my  laptop to ci.uchicago.edu server 
>>> and then everything works.
>>>
>>>
>>> -Yi
>>> yizhu wrote:
>>>> Hi all
>>>>
>>>> I tried running swift-0.8 over  Nimbus Cloud (PBS+NFS), and 
>>>> configured sites.xml and tc.data accordingly.[1]
>>>>
>>>> When i tried to run "$swift first.swift", it  stuck on 
>>>> "Submitting:1" phase[2], (keep repeated showing 
>>>> "Progress:Submitting:1" and  never return).
>>>>
>>>> Then I ssh to pbs_server to check the server log[3] and found that 
>>>> the job has been enqueued, ran, and successfully dequeued. I also 
>>>> check the queue status[4] when this job is running and found that 
>>>> the output_path is "/dev/null" somewhat i don't expected. ( The 
>>>> working directory of swift is "/home/jobrun".
>>>>
>>>> I think the problem might be the pbs_server failed to return the 
>>>> output to the correct path (btw. what the output_path suppose to be, 
>>>> the same work_directory of swift?), or anyone has a better idea?
>>>>
>>>>
>>>> Many Thanks.
>>>>
>>>>
>>>> -Yi
>>>>
>>>>
>>>>
>>>> --------------------------------------------------
>>>> [1]----sites.xml
>>>>
>>>>  <pool handle="nb_basecluster">
>>>>     <gridftp url="gsiftp://tp-x001.ci.uchicago.edu" />
>>>>     <execution 
>>>> url="https://tp-x001.ci.uchicago.edu:8443/wsrf/services/ManagedJobFactoryService" 
>>>> jobManager="PBS" provider="gt4" />
>>>>     <workdirectory >/home/jobrun</workdirectory>
>>>>   </pool>
>>>>
>>>> ----tc.data
>>>>
>>>> nb_basecluster     echo         /bin/echo    INSTALLED    
>>>> INTEL32::LINUX    null
>>>> nb_basecluster     cat         /bin/cat    INSTALLED    
>>>> INTEL32::LINUX    null
>>>> nb_basecluster     ls         /bin/ls        INSTALLED    
>>>> INTEL32::LINUX    null
>>>> nb_basecluster     grep         /bin/grep    INSTALLED    
>>>> INTEL32::LINUX    null
>>>> nb_basecluster     sort         /bin/sort    INSTALLED    
>>>> INTEL32::LINUX    null
>>>> nb_basecluster     paste         /bin/paste    INSTALLED    
>>>> INTEL32::LINUX    null
>>>>
>>>> ---------------------------------------------------------
>>>> [2]
>>>> yizhu at ubuntu:~/swift-0.8/examples/swift$ swift -d first.swift
>>>> Recompilation suppressed.
>>>> Using sites file: /home/yizhu/swift-0.8/bin/../etc/sites.xml
>>>> Using tc.data: /home/yizhu/swift-0.8/bin/../etc/tc.data
>>>> Setting resources to: {nb_basecluster=nb_basecluster}
>>>> Swift 0.8 swift-r2448 cog-r2261
>>>>
>>>> Swift 0.8 swift-r2448 cog-r2261
>>>>
>>>> RUNID id=tag:benc at ci.uchicago.edu,2007:swift:run:20090506-1912-zqd8t5hg
>>>> RunID: 20090506-1912-zqd8t5hg
>>>> closed org.griphyn.vdl.mapping.RootDataNode identifier 
>>>> tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000001 
>>>> type string value=Hello, world! dataset=unnamed SwiftScript value 
>>>> (closed)
>>>> ROOTPATH 
>>>> dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000001 
>>>> path=$
>>>> VALUE 
>>>> dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000001 
>>>> VALUE=Hello, world!
>>>> closed org.griphyn.vdl.mapping.RootDataNode identifier 
>>>> tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000001 
>>>> type string value=Hello, world! dataset=unnamed SwiftScript value 
>>>> (closed)
>>>> ROOTPATH 
>>>> dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000001 
>>>> path=$
>>>> VALUE 
>>>> dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000001 
>>>> VALUE=Hello, world!
>>>> NEW 
>>>> id=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000001 
>>>>
>>>> Found mapped data org.griphyn.vdl.mapping.RootDataNode identifier 
>>>> tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000002 
>>>> type messagefile with no value at dataset=outfile (not closed).$
>>>> NEW 
>>>> id=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000002 
>>>>
>>>> Progress:
>>>> PROCEDURE thread=0 name=greeting
>>>> PARAM thread=0 direction=output variable=t 
>>>> provenanceid=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000002 
>>>>
>>>> closed org.griphyn.vdl.mapping.RootDataNode identifier 
>>>> tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000003 
>>>> type string value=hello.txt dataset=unnamed SwiftScript value (closed)
>>>> ROOTPATH 
>>>> dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000003 
>>>> path=$
>>>> VALUE 
>>>> dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000003 
>>>> VALUE=hello.txt
>>>> closed org.griphyn.vdl.mapping.RootDataNode identifier 
>>>> tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000003 
>>>> type string value=hello.txt dataset=unnamed SwiftScript value (closed)
>>>> ROOTPATH 
>>>> dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000003 
>>>> path=$
>>>> VALUE 
>>>> dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000003 
>>>> VALUE=hello.txt
>>>> START thread=0 tr=echo
>>>> Sorted: [nb_basecluster:0.000(1.000):0/1 overload: 0]
>>>> Rand: 0.6583597597672994, sum: 1.0
>>>> Next contact: nb_basecluster:0.000(1.000):0/1 overload: 0
>>>> Progress:  Initializing site shared directory:1
>>>> START host=nb_basecluster - Initializing shared directory
>>>> multiplyScore(nb_basecluster:0.000(1.000):1/1 overload: 0, -0.01)
>>>> Old score: 0.000, new score: -0.010
>>>> No global submit throttle set. Using default (100)
>>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174935) setting 
>>>> status to Submitting
>>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174935) setting 
>>>> status to Submitted
>>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174935) setting 
>>>> status to Active
>>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174935) setting 
>>>> status to Completed
>>>> multiplyScore(nb_basecluster:-0.010(0.994):1/1 overload: 0, 0.01)
>>>> Old score: -0.010, new score: 0.000
>>>> multiplyScore(nb_basecluster:0.000(1.000):1/1 overload: 0, 0.1)
>>>> Old score: 0.000, new score: 0.100
>>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174935) Completed. 
>>>> Waiting: 0, Running: 0. Heap size: 75M, Heap free: 63M, Max heap: 720M
>>>> multiplyScore(nb_basecluster:0.100(1.060):1/1 overload: 0, -0.2)
>>>> Old score: 0.100, new score: -0.100
>>>> Task(type=FILE_TRANSFER, identity=urn:0-1-1241655174938) setting 
>>>> status to Submitting
>>>> Task(type=FILE_TRANSFER, identity=urn:0-1-1241655174938) setting 
>>>> status to Submitted
>>>> Task(type=FILE_TRANSFER, identity=urn:0-1-1241655174938) setting 
>>>> status to Active
>>>> Task(type=FILE_TRANSFER, identity=urn:0-1-1241655174938) setting 
>>>> status to Completed
>>>> multiplyScore(nb_basecluster:-0.100(0.943):1/1 overload: 0, 0.2)
>>>> Old score: -0.100, new score: 0.100
>>>> multiplyScore(nb_basecluster:0.100(1.060):1/1 overload: 0, 0.1)
>>>> Old score: 0.100, new score: 0.200
>>>> Task(type=FILE_TRANSFER, identity=urn:0-1-1241655174938) Completed. 
>>>> Waiting: 0, Running: 0. Heap size: 75M, Heap free: 60M, Max heap: 720M
>>>> multiplyScore(nb_basecluster:0.200(1.124):1/1 overload: 0, -0.2)
>>>> Old score: 0.200, new score: 0.000
>>>> Task(type=FILE_TRANSFER, identity=urn:0-1-1241655174940) setting 
>>>> status to Submitting
>>>> Task(type=FILE_TRANSFER, identity=urn:0-1-1241655174940) setting 
>>>> status to Submitted
>>>> Task(type=FILE_TRANSFER, identity=urn:0-1-1241655174940) setting 
>>>> status to Active
>>>> Task(type=FILE_TRANSFER, identity=urn:0-1-1241655174940) setting 
>>>> status to Completed
>>>> multiplyScore(nb_basecluster:0.000(1.000):1/1 overload: 0, 0.2)
>>>> Old score: 0.000, new score: 0.200
>>>> multiplyScore(nb_basecluster:0.200(1.124):1/1 overload: 0, 0.1)
>>>> Old score: 0.200, new score: 0.300
>>>> Task(type=FILE_TRANSFER, identity=urn:0-1-1241655174940) Completed. 
>>>> Waiting: 0, Running: 0. Heap size: 75M, Heap free: 59M, Max heap: 720M
>>>> multiplyScore(nb_basecluster:0.300(1.192):1/1 overload: 0, -0.01)
>>>> Old score: 0.300, new score: 0.290
>>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174942) setting 
>>>> status to Submitting
>>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174942) setting 
>>>> status to Submitted
>>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174942) setting 
>>>> status to Active
>>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174942) setting 
>>>> status to Completed
>>>> multiplyScore(nb_basecluster:0.290(1.185):1/1 overload: 0, 0.01)
>>>> Old score: 0.290, new score: 0.300
>>>> multiplyScore(nb_basecluster:0.300(1.192):1/1 overload: 0, 0.1)
>>>> Old score: 0.300, new score: 0.400
>>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174942) Completed. 
>>>> Waiting: 0, Running: 0. Heap size: 75M, Heap free: 59M, Max heap: 720M
>>>> multiplyScore(nb_basecluster:0.400(1.264):1/1 overload: 0, -0.01)
>>>> Old score: 0.400, new score: 0.390
>>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174944) setting 
>>>> status to Submitting
>>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174944) setting 
>>>> status to Submitted
>>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174944) setting 
>>>> status to Active
>>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174944) setting 
>>>> status to Completed
>>>> multiplyScore(nb_basecluster:0.390(1.256):1/1 overload: 0, 0.01)
>>>> Old score: 0.390, new score: 0.400
>>>> multiplyScore(nb_basecluster:0.400(1.264):1/1 overload: 0, 0.1)
>>>> Old score: 0.400, new score: 0.500
>>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174944) Completed. 
>>>> Waiting: 0, Running: 0. Heap size: 75M, Heap free: 59M, Max heap: 720M
>>>> multiplyScore(nb_basecluster:0.500(1.339):1/1 overload: 0, -0.01)
>>>> Old score: 0.500, new score: 0.490
>>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174946) setting 
>>>> status to Submitting
>>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174946) setting 
>>>> status to Submitted
>>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174946) setting 
>>>> status to Active
>>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174946) setting 
>>>> status to Completed
>>>> multiplyScore(nb_basecluster:0.490(1.332):1/1 overload: 0, 0.01)
>>>> Old score: 0.490, new score: 0.500
>>>> multiplyScore(nb_basecluster:0.500(1.339):1/1 overload: 0, 0.1)
>>>> Old score: 0.500, new score: 0.600
>>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174946) Completed. 
>>>> Waiting: 0, Running: 0. Heap size: 75M, Heap free: 59M, Max heap: 720M
>>>> END host=nb_basecluster - Done initializing shared directory
>>>> THREAD_ASSOCIATION jobid=echo-kfecpfaj thread=0-1 
>>>> host=nb_basecluster replicationGroup=jfecpfaj
>>>> Progress:  Stage in:1
>>>> START jobid=echo-kfecpfaj host=nb_basecluster - Initializing 
>>>> directory structure
>>>> START path= dir=first-20090506-1912-zqd8t5hg/shared - Creating 
>>>> directory structure
>>>> multiplyScore(nb_basecluster:0.600(1.419):1/1 overload: 0, -0.01)
>>>> Old score: 0.600, new score: 0.590
>>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174948) setting 
>>>> status to Submitting
>>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174948) setting 
>>>> status to Submitted
>>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174948) setting 
>>>> status to Active
>>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174948) setting 
>>>> status to Completed
>>>> multiplyScore(nb_basecluster:0.590(1.411):1/1 overload: 0, 0.01)
>>>> Old score: 0.590, new score: 0.600
>>>> multiplyScore(nb_basecluster:0.600(1.419):1/1 overload: 0, 0.1)
>>>> Old score: 0.600, new score: 0.700
>>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174948) Completed. 
>>>> Waiting: 0, Running: 0. Heap size: 75M, Heap free: 59M, Max heap: 720M
>>>> END jobid=echo-kfecpfaj - Done initializing directory structure
>>>> START jobid=echo-kfecpfaj - Staging in files
>>>> END jobid=echo-kfecpfaj - Staging in finished
>>>> JOB_START jobid=echo-kfecpfaj tr=echo arguments=[Hello, world!] 
>>>> tmpdir=first-20090506-1912-zqd8t5hg/jobs/k/echo-kfecpfaj 
>>>> host=nb_basecluster
>>>> jobid=echo-kfecpfaj task=Task(type=JOB_SUBMISSION, 
>>>> identity=urn:0-1-1241655174950)
>>>> multiplyScore(nb_basecluster:0.700(1.503):1/1 overload: 0, -0.2)
>>>> Old score: 0.700, new score: 0.500
>>>> Task(type=JOB_SUBMISSION, identity=urn:0-1-1241655174950) setting 
>>>> status to Submitting
>>>> Submitting task: Task(type=JOB_SUBMISSION, 
>>>> identity=urn:0-1-1241655174950)
>>>> <startTime name="submission">1241655180260</startTime>
>>>> <startTime name="createManagedJob">1241655180623</startTime>
>>>> <endTime name="createManagedJob">1241655181975</endTime
>>>> Task submitted: Task(type=JOB_SUBMISSION, 
>>>> identity=urn:0-1-1241655174950)
>>>> Progress:  Submitting:1
>>>>
>>>> Progress:  Submitting:1
>>>>
>>>> Progress:  Submitting:1
>>>> Progress:  Submitting:1
>>>> Progress:  Submitting:1
>>>> Progress:  Submitting:1
>>>> ^C
>>>> yizhu at ubuntu:~/swift-0.8/examples/swift$
>>>>
>>>> --------------------------------------------------------------
>>>> [3] see attachment
>>>> -------------------------------------------------------------
>>>> [4]
>>>> tp-x001 torque # qstat -f
>>>> Job Id: 3.tp-x001.ci.uchicago.edu
>>>>     Job_Name = STDIN
>>>>     Job_Owner = jobrun at tp-x001.ci.uchicago.edu
>>>>     job_state = R
>>>>     queue = batch
>>>>     server = tp-x001.ci.uchicago.edu
>>>>     Checkpoint = u
>>>>     ctime = Wed May  6 19:22:10 2009
>>>>     Error_Path = tp-x001.ci.uchicago.edu:/dev/null
>>>>     exec_host = tp-x002/0
>>>>     Hold_Types = n
>>>>     Join_Path = n
>>>>     Keep_Files = n
>>>>     Mail_Points = n
>>>>     mtime = Wed May  6 19:22:10 2009
>>>>     Output_Path = tp-x001.ci.uchicago.edu:/dev/null
>>>>     Priority = 0
>>>>     qtime = Wed May  6 19:22:10 2009
>>>>     Rerunable = True
>>>>     Resource_List.neednodes = 1
>>>>     Resource_List.nodect = 1
>>>>     Resource_List.nodes = 1
>>>>     Shell_Path_List = /bin/sh
>>>>     substate = 40
>>>>     Variable_List = PBS_O_HOME=/home/jobrun,PBS_O_LOGNAME=jobrun,
>>>>     
>>>> PBS_O_PATH=/bin:/sbin:/usr/bin:/usr/sbin:/usr/local/bin:/usr/local/sb
>>>>     
>>>> in:/opt/bin,PBS_O_SHELL=/bin/bash,PBS_SERVER=tp-x001.ci.uchicago.edu,
>>>>     PBS_O_HOST=tp-x001.ci.uchicago.edu,
>>>>     PBS_O_WORKDIR=/home/jobrun/first-20090506-1922-xaandi54,
>>>>     PBS_O_QUEUE=batch
>>>>     euser = jobrun
>>>>     egroup = users
>>>>     hashname = 3.tp-x001.c
>>>>     queue_rank = 2
>>>>     queue_type = E
>>>>     comment = Job started on Wed May 06 at 19:22
>>>>     etime = Wed May  6 19:22:10 2009
>>>>
>>>> tp-x001 torque #
>>>>
>>>>
>>>> ------------------------------------------------------------------------ 
>>>>
>>>>
>>>> _______________________________________________
>>>> Swift-devel mailing list
>>>> Swift-devel at ci.uchicago.edu
>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>>
>>> _______________________________________________
>>> Swift-devel mailing list
>>> Swift-devel at ci.uchicago.edu
>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>
> 



More information about the Swift-devel mailing list