[Swift-devel] Swift-issues (PBS+NFS Cluster)

Ioan Raicu iraicu at cs.uchicago.edu
Thu May 7 12:34:34 CDT 2009


Hi Yi,
Back in late 2007, Catalin, Tim, Borja and I started a side project 
whose goal was to evaluate how well Swift could run on EC2. We got quite 
far, up to the point of running a complex application (MolDyn) at small 
scales of some 10 molecules (100s ~ 1000s of jobs) at a scale of about 
10 EC2 instances. Our environment was 
Swift+Falkon+Workspace_Service+EC2, and our virtual cluster had NFS 
mounted to enable Swift to operate unmodified. At the time, a shared 
file system was a requirement for Swift. Also, the Workspace Service is 
now Nimbus. Unfortunately, Catalin ended up finding a job, and we never 
really finished the project. We never got to the point of comparing the 
real app, MolDyn between say the TeraGrid and EC2, and we never got to 
run real stress tests for throughputs, etc. In case you find anything 
useful, here were the wiki's we maintained during our work realated to 
Swift and EC2: http://dev.globus.org/wiki/Incubator/Falkon/EC2.

This is really exciting work, good luck!

Ioan

Yi Zhu wrote:
> Michael Wilde wrote:
>> Very good!
>>
>> Now, what kind of tests can you do next?
>
> Next, I will try to let swift running on Amazon EC2.
>
>> Can you exercise the cluster with an interesting workflow?
>
> Yes, Is there any complex sample/tools i can use (rahter than 
> first.swift) to test swift performance? Is there any benchmark 
> available i can compare with?
>
>> How large of a cluster can you assemble in a Nimbus workspace ?
>
> Since the vm-image i use to test 'swift' is based on NFS shared file 
> system, the performance may not be satisfiable if the we have a large 
> scale of cluster. After I got the swift running on Amazon EC2, I will 
> try to make a dedicate vm-image by using GPFS or any other shared file 
> system you recommended.
>
>
>> Can you aggregate VM's from a few different physical clusters into 
>> one Nimbus workspace?
>
> I don't think so. Tim may make commit on it.
>
>
>> What's the largest cluster you can assemble with Nimbus?
>
> I am not quite sure,I will do some test onto it soon. since it is a 
> EC2-like cloud, it should easily be configured as a cluster with 
> hundreds of nodes. Tim may make commit on it.
>
>
>> Can we try putting Matlab on a Nimbus VM and then running it at an 
>> interesting scale?
>
> Yes, I think so.
>
>> Do we have any free allocations of EC2 to enable testing of this at 
>> Amazon?
>
> Yes, Tim already made a AIM on Amazon EC2, my next step is make 
> 'swift' running on it. I will try get this stuff to ork  by tomorrow.
>
>
>
>>
>> - Mike
>>
>>
>>
>>
>>
>>
>>
>> On 5/6/09 10:10 PM, yizhu wrote:
>>> problem solved.
>>>
>>> i immigrate all of my work from my  laptop to ci.uchicago.edu server 
>>> and then everything works.
>>>
>>>
>>> -Yi
>>> yizhu wrote:
>>>> Hi all
>>>>
>>>> I tried running swift-0.8 over  Nimbus Cloud (PBS+NFS), and 
>>>> configured sites.xml and tc.data accordingly.[1]
>>>>
>>>> When i tried to run "$swift first.swift", it  stuck on 
>>>> "Submitting:1" phase[2], (keep repeated showing 
>>>> "Progress:Submitting:1" and  never return).
>>>>
>>>> Then I ssh to pbs_server to check the server log[3] and found that 
>>>> the job has been enqueued, ran, and successfully dequeued. I also 
>>>> check the queue status[4] when this job is running and found that 
>>>> the output_path is "/dev/null" somewhat i don't expected. ( The 
>>>> working directory of swift is "/home/jobrun".
>>>>
>>>> I think the problem might be the pbs_server failed to return the 
>>>> output to the correct path (btw. what the output_path suppose to 
>>>> be, the same work_directory of swift?), or anyone has a better idea?
>>>>
>>>>
>>>> Many Thanks.
>>>>
>>>>
>>>> -Yi
>>>>
>>>>
>>>>
>>>> --------------------------------------------------
>>>> [1]----sites.xml
>>>>
>>>>  <pool handle="nb_basecluster">
>>>>     <gridftp url="gsiftp://tp-x001.ci.uchicago.edu" />
>>>>     <execution 
>>>> url="https://tp-x001.ci.uchicago.edu:8443/wsrf/services/ManagedJobFactoryService" 
>>>> jobManager="PBS" provider="gt4" />
>>>>     <workdirectory >/home/jobrun</workdirectory>
>>>>   </pool>
>>>>
>>>> ----tc.data
>>>>
>>>> nb_basecluster     echo         /bin/echo    INSTALLED    
>>>> INTEL32::LINUX    null
>>>> nb_basecluster     cat         /bin/cat    INSTALLED    
>>>> INTEL32::LINUX    null
>>>> nb_basecluster     ls         /bin/ls        INSTALLED    
>>>> INTEL32::LINUX    null
>>>> nb_basecluster     grep         /bin/grep    INSTALLED    
>>>> INTEL32::LINUX    null
>>>> nb_basecluster     sort         /bin/sort    INSTALLED    
>>>> INTEL32::LINUX    null
>>>> nb_basecluster     paste         /bin/paste    INSTALLED    
>>>> INTEL32::LINUX    null
>>>>
>>>> ---------------------------------------------------------
>>>> [2]
>>>> yizhu at ubuntu:~/swift-0.8/examples/swift$ swift -d first.swift
>>>> Recompilation suppressed.
>>>> Using sites file: /home/yizhu/swift-0.8/bin/../etc/sites.xml
>>>> Using tc.data: /home/yizhu/swift-0.8/bin/../etc/tc.data
>>>> Setting resources to: {nb_basecluster=nb_basecluster}
>>>> Swift 0.8 swift-r2448 cog-r2261
>>>>
>>>> Swift 0.8 swift-r2448 cog-r2261
>>>>
>>>> RUNID 
>>>> id=tag:benc at ci.uchicago.edu,2007:swift:run:20090506-1912-zqd8t5hg
>>>> RunID: 20090506-1912-zqd8t5hg
>>>> closed org.griphyn.vdl.mapping.RootDataNode identifier 
>>>> tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000001 
>>>> type string value=Hello, world! dataset=unnamed SwiftScript value 
>>>> (closed)
>>>> ROOTPATH 
>>>> dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000001 
>>>> path=$
>>>> VALUE 
>>>> dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000001 
>>>> VALUE=Hello, world!
>>>> closed org.griphyn.vdl.mapping.RootDataNode identifier 
>>>> tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000001 
>>>> type string value=Hello, world! dataset=unnamed SwiftScript value 
>>>> (closed)
>>>> ROOTPATH 
>>>> dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000001 
>>>> path=$
>>>> VALUE 
>>>> dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000001 
>>>> VALUE=Hello, world!
>>>> NEW 
>>>> id=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000001 
>>>>
>>>> Found mapped data org.griphyn.vdl.mapping.RootDataNode identifier 
>>>> tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000002 
>>>> type messagefile with no value at dataset=outfile (not closed).$
>>>> NEW 
>>>> id=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000002 
>>>>
>>>> Progress:
>>>> PROCEDURE thread=0 name=greeting
>>>> PARAM thread=0 direction=output variable=t 
>>>> provenanceid=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000002 
>>>>
>>>> closed org.griphyn.vdl.mapping.RootDataNode identifier 
>>>> tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000003 
>>>> type string value=hello.txt dataset=unnamed SwiftScript value (closed)
>>>> ROOTPATH 
>>>> dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000003 
>>>> path=$
>>>> VALUE 
>>>> dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000003 
>>>> VALUE=hello.txt
>>>> closed org.griphyn.vdl.mapping.RootDataNode identifier 
>>>> tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000003 
>>>> type string value=hello.txt dataset=unnamed SwiftScript value (closed)
>>>> ROOTPATH 
>>>> dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000003 
>>>> path=$
>>>> VALUE 
>>>> dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000003 
>>>> VALUE=hello.txt
>>>> START thread=0 tr=echo
>>>> Sorted: [nb_basecluster:0.000(1.000):0/1 overload: 0]
>>>> Rand: 0.6583597597672994, sum: 1.0
>>>> Next contact: nb_basecluster:0.000(1.000):0/1 overload: 0
>>>> Progress:  Initializing site shared directory:1
>>>> START host=nb_basecluster - Initializing shared directory
>>>> multiplyScore(nb_basecluster:0.000(1.000):1/1 overload: 0, -0.01)
>>>> Old score: 0.000, new score: -0.010
>>>> No global submit throttle set. Using default (100)
>>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174935) setting 
>>>> status to Submitting
>>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174935) setting 
>>>> status to Submitted
>>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174935) setting 
>>>> status to Active
>>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174935) setting 
>>>> status to Completed
>>>> multiplyScore(nb_basecluster:-0.010(0.994):1/1 overload: 0, 0.01)
>>>> Old score: -0.010, new score: 0.000
>>>> multiplyScore(nb_basecluster:0.000(1.000):1/1 overload: 0, 0.1)
>>>> Old score: 0.000, new score: 0.100
>>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174935) 
>>>> Completed. Waiting: 0, Running: 0. Heap size: 75M, Heap free: 63M, 
>>>> Max heap: 720M
>>>> multiplyScore(nb_basecluster:0.100(1.060):1/1 overload: 0, -0.2)
>>>> Old score: 0.100, new score: -0.100
>>>> Task(type=FILE_TRANSFER, identity=urn:0-1-1241655174938) setting 
>>>> status to Submitting
>>>> Task(type=FILE_TRANSFER, identity=urn:0-1-1241655174938) setting 
>>>> status to Submitted
>>>> Task(type=FILE_TRANSFER, identity=urn:0-1-1241655174938) setting 
>>>> status to Active
>>>> Task(type=FILE_TRANSFER, identity=urn:0-1-1241655174938) setting 
>>>> status to Completed
>>>> multiplyScore(nb_basecluster:-0.100(0.943):1/1 overload: 0, 0.2)
>>>> Old score: -0.100, new score: 0.100
>>>> multiplyScore(nb_basecluster:0.100(1.060):1/1 overload: 0, 0.1)
>>>> Old score: 0.100, new score: 0.200
>>>> Task(type=FILE_TRANSFER, identity=urn:0-1-1241655174938) Completed. 
>>>> Waiting: 0, Running: 0. Heap size: 75M, Heap free: 60M, Max heap: 720M
>>>> multiplyScore(nb_basecluster:0.200(1.124):1/1 overload: 0, -0.2)
>>>> Old score: 0.200, new score: 0.000
>>>> Task(type=FILE_TRANSFER, identity=urn:0-1-1241655174940) setting 
>>>> status to Submitting
>>>> Task(type=FILE_TRANSFER, identity=urn:0-1-1241655174940) setting 
>>>> status to Submitted
>>>> Task(type=FILE_TRANSFER, identity=urn:0-1-1241655174940) setting 
>>>> status to Active
>>>> Task(type=FILE_TRANSFER, identity=urn:0-1-1241655174940) setting 
>>>> status to Completed
>>>> multiplyScore(nb_basecluster:0.000(1.000):1/1 overload: 0, 0.2)
>>>> Old score: 0.000, new score: 0.200
>>>> multiplyScore(nb_basecluster:0.200(1.124):1/1 overload: 0, 0.1)
>>>> Old score: 0.200, new score: 0.300
>>>> Task(type=FILE_TRANSFER, identity=urn:0-1-1241655174940) Completed. 
>>>> Waiting: 0, Running: 0. Heap size: 75M, Heap free: 59M, Max heap: 720M
>>>> multiplyScore(nb_basecluster:0.300(1.192):1/1 overload: 0, -0.01)
>>>> Old score: 0.300, new score: 0.290
>>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174942) setting 
>>>> status to Submitting
>>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174942) setting 
>>>> status to Submitted
>>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174942) setting 
>>>> status to Active
>>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174942) setting 
>>>> status to Completed
>>>> multiplyScore(nb_basecluster:0.290(1.185):1/1 overload: 0, 0.01)
>>>> Old score: 0.290, new score: 0.300
>>>> multiplyScore(nb_basecluster:0.300(1.192):1/1 overload: 0, 0.1)
>>>> Old score: 0.300, new score: 0.400
>>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174942) 
>>>> Completed. Waiting: 0, Running: 0. Heap size: 75M, Heap free: 59M, 
>>>> Max heap: 720M
>>>> multiplyScore(nb_basecluster:0.400(1.264):1/1 overload: 0, -0.01)
>>>> Old score: 0.400, new score: 0.390
>>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174944) setting 
>>>> status to Submitting
>>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174944) setting 
>>>> status to Submitted
>>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174944) setting 
>>>> status to Active
>>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174944) setting 
>>>> status to Completed
>>>> multiplyScore(nb_basecluster:0.390(1.256):1/1 overload: 0, 0.01)
>>>> Old score: 0.390, new score: 0.400
>>>> multiplyScore(nb_basecluster:0.400(1.264):1/1 overload: 0, 0.1)
>>>> Old score: 0.400, new score: 0.500
>>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174944) 
>>>> Completed. Waiting: 0, Running: 0. Heap size: 75M, Heap free: 59M, 
>>>> Max heap: 720M
>>>> multiplyScore(nb_basecluster:0.500(1.339):1/1 overload: 0, -0.01)
>>>> Old score: 0.500, new score: 0.490
>>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174946) setting 
>>>> status to Submitting
>>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174946) setting 
>>>> status to Submitted
>>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174946) setting 
>>>> status to Active
>>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174946) setting 
>>>> status to Completed
>>>> multiplyScore(nb_basecluster:0.490(1.332):1/1 overload: 0, 0.01)
>>>> Old score: 0.490, new score: 0.500
>>>> multiplyScore(nb_basecluster:0.500(1.339):1/1 overload: 0, 0.1)
>>>> Old score: 0.500, new score: 0.600
>>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174946) 
>>>> Completed. Waiting: 0, Running: 0. Heap size: 75M, Heap free: 59M, 
>>>> Max heap: 720M
>>>> END host=nb_basecluster - Done initializing shared directory
>>>> THREAD_ASSOCIATION jobid=echo-kfecpfaj thread=0-1 
>>>> host=nb_basecluster replicationGroup=jfecpfaj
>>>> Progress:  Stage in:1
>>>> START jobid=echo-kfecpfaj host=nb_basecluster - Initializing 
>>>> directory structure
>>>> START path= dir=first-20090506-1912-zqd8t5hg/shared - Creating 
>>>> directory structure
>>>> multiplyScore(nb_basecluster:0.600(1.419):1/1 overload: 0, -0.01)
>>>> Old score: 0.600, new score: 0.590
>>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174948) setting 
>>>> status to Submitting
>>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174948) setting 
>>>> status to Submitted
>>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174948) setting 
>>>> status to Active
>>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174948) setting 
>>>> status to Completed
>>>> multiplyScore(nb_basecluster:0.590(1.411):1/1 overload: 0, 0.01)
>>>> Old score: 0.590, new score: 0.600
>>>> multiplyScore(nb_basecluster:0.600(1.419):1/1 overload: 0, 0.1)
>>>> Old score: 0.600, new score: 0.700
>>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174948) 
>>>> Completed. Waiting: 0, Running: 0. Heap size: 75M, Heap free: 59M, 
>>>> Max heap: 720M
>>>> END jobid=echo-kfecpfaj - Done initializing directory structure
>>>> START jobid=echo-kfecpfaj - Staging in files
>>>> END jobid=echo-kfecpfaj - Staging in finished
>>>> JOB_START jobid=echo-kfecpfaj tr=echo arguments=[Hello, world!] 
>>>> tmpdir=first-20090506-1912-zqd8t5hg/jobs/k/echo-kfecpfaj 
>>>> host=nb_basecluster
>>>> jobid=echo-kfecpfaj task=Task(type=JOB_SUBMISSION, 
>>>> identity=urn:0-1-1241655174950)
>>>> multiplyScore(nb_basecluster:0.700(1.503):1/1 overload: 0, -0.2)
>>>> Old score: 0.700, new score: 0.500
>>>> Task(type=JOB_SUBMISSION, identity=urn:0-1-1241655174950) setting 
>>>> status to Submitting
>>>> Submitting task: Task(type=JOB_SUBMISSION, 
>>>> identity=urn:0-1-1241655174950)
>>>> <startTime name="submission">1241655180260</startTime>
>>>> <startTime name="createManagedJob">1241655180623</startTime>
>>>> <endTime name="createManagedJob">1241655181975</endTime
>>>> Task submitted: Task(type=JOB_SUBMISSION, 
>>>> identity=urn:0-1-1241655174950)
>>>> Progress:  Submitting:1
>>>>
>>>> Progress:  Submitting:1
>>>>
>>>> Progress:  Submitting:1
>>>> Progress:  Submitting:1
>>>> Progress:  Submitting:1
>>>> Progress:  Submitting:1
>>>> ^C
>>>> yizhu at ubuntu:~/swift-0.8/examples/swift$
>>>>
>>>> --------------------------------------------------------------
>>>> [3] see attachment
>>>> -------------------------------------------------------------
>>>> [4]
>>>> tp-x001 torque # qstat -f
>>>> Job Id: 3.tp-x001.ci.uchicago.edu
>>>>     Job_Name = STDIN
>>>>     Job_Owner = jobrun at tp-x001.ci.uchicago.edu
>>>>     job_state = R
>>>>     queue = batch
>>>>     server = tp-x001.ci.uchicago.edu
>>>>     Checkpoint = u
>>>>     ctime = Wed May  6 19:22:10 2009
>>>>     Error_Path = tp-x001.ci.uchicago.edu:/dev/null
>>>>     exec_host = tp-x002/0
>>>>     Hold_Types = n
>>>>     Join_Path = n
>>>>     Keep_Files = n
>>>>     Mail_Points = n
>>>>     mtime = Wed May  6 19:22:10 2009
>>>>     Output_Path = tp-x001.ci.uchicago.edu:/dev/null
>>>>     Priority = 0
>>>>     qtime = Wed May  6 19:22:10 2009
>>>>     Rerunable = True
>>>>     Resource_List.neednodes = 1
>>>>     Resource_List.nodect = 1
>>>>     Resource_List.nodes = 1
>>>>     Shell_Path_List = /bin/sh
>>>>     substate = 40
>>>>     Variable_List = PBS_O_HOME=/home/jobrun,PBS_O_LOGNAME=jobrun,
>>>>     
>>>> PBS_O_PATH=/bin:/sbin:/usr/bin:/usr/sbin:/usr/local/bin:/usr/local/sb
>>>>     
>>>> in:/opt/bin,PBS_O_SHELL=/bin/bash,PBS_SERVER=tp-x001.ci.uchicago.edu,
>>>>     PBS_O_HOST=tp-x001.ci.uchicago.edu,
>>>>     PBS_O_WORKDIR=/home/jobrun/first-20090506-1922-xaandi54,
>>>>     PBS_O_QUEUE=batch
>>>>     euser = jobrun
>>>>     egroup = users
>>>>     hashname = 3.tp-x001.c
>>>>     queue_rank = 2
>>>>     queue_type = E
>>>>     comment = Job started on Wed May 06 at 19:22
>>>>     etime = Wed May  6 19:22:10 2009
>>>>
>>>> tp-x001 torque #
>>>>
>>>>
>>>> ------------------------------------------------------------------------ 
>>>>
>>>>
>>>> _______________________________________________
>>>> Swift-devel mailing list
>>>> Swift-devel at ci.uchicago.edu
>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>>
>>> _______________________________________________
>>> Swift-devel mailing list
>>> Swift-devel at ci.uchicago.edu
>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>
>
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>

-- 
===================================================
Ioan Raicu, Ph.D.
===================================================
Distributed Systems Laboratory
Computer Science Department
University of Chicago
1100 E. 58th Street, Ryerson Hall
Chicago, IL 60637
===================================================
Email: iraicu at cs.uchicago.edu
Web:   http://www.cs.uchicago.edu/~iraicu
http://dev.globus.org/wiki/Incubator/Falkon
http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page
===================================================
===================================================




More information about the Swift-devel mailing list